This is a note to let you know that I've just added the patch titled
x86/mm: Rework lazy TLB to track the actual loaded mm
to the 4.9-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
x86-mm-rework-lazy-tlb-to-track-the-actual-loaded-mm.patch
and it can be found in the queue-4.9 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From 3d28ebceaffab40f30afa87e33331560148d7b8b Mon Sep 17 00:00:00 2001
From: Andy Lutomirski <luto(a)kernel.org>
Date: Sun, 28 May 2017 10:00:15 -0700
Subject: x86/mm: Rework lazy TLB to track the actual loaded mm
From: Andy Lutomirski <luto(a)kernel.org>
commit 3d28ebceaffab40f30afa87e33331560148d7b8b upstream.
Lazy TLB state is currently managed in a rather baroque manner.
AFAICT, there are three possible states:
- Non-lazy. This means that we're running a user thread or a
kernel thread that has called use_mm(). current->mm ==
current->active_mm == cpu_tlbstate.active_mm and
cpu_tlbstate.state == TLBSTATE_OK.
- Lazy with user mm. We're running a kernel thread without an mm
and we're borrowing an mm_struct. We have current->mm == NULL,
current->active_mm == cpu_tlbstate.active_mm, cpu_tlbstate.state
!= TLBSTATE_OK (i.e. TLBSTATE_LAZY or 0). The current cpu is set
in mm_cpumask(current->active_mm). CR3 points to
current->active_mm->pgd. The TLB is up to date.
- Lazy with init_mm. This happens when we call leave_mm(). We
have current->mm == NULL, current->active_mm ==
cpu_tlbstate.active_mm, but that mm is only relelvant insofar as
the scheduler is tracking it for refcounting. cpu_tlbstate.state
!= TLBSTATE_OK. The current cpu is clear in
mm_cpumask(current->active_mm). CR3 points to swapper_pg_dir,
i.e. init_mm->pgd.
This patch simplifies the situation. Other than perf, x86 stops
caring about current->active_mm at all. We have
cpu_tlbstate.loaded_mm pointing to the mm that CR3 references. The
TLB is always up to date for that mm. leave_mm() just switches us
to init_mm. There are no longer any special cases for mm_cpumask,
and switch_mm() switches mms without worrying about laziness.
After this patch, cpu_tlbstate.state serves only to tell the TLB
flush code whether it may switch to init_mm instead of doing a
normal flush.
This makes fairly extensive changes to xen_exit_mmap(), which used
to look a bit like black magic.
Perf is unchanged. With or without this change, perf may behave a bit
erratically if it tries to read user memory in kernel thread context.
We should build on this patch to teach perf to never look at user
memory when cpu_tlbstate.loaded_mm != current->mm.
Signed-off-by: Andy Lutomirski <luto(a)kernel.org>
Cc: Andrew Morton <akpm(a)linux-foundation.org>
Cc: Arjan van de Ven <arjan(a)linux.intel.com>
Cc: Borislav Petkov <bpetkov(a)suse.de>
Cc: Dave Hansen <dave.hansen(a)intel.com>
Cc: Linus Torvalds <torvalds(a)linux-foundation.org>
Cc: Mel Gorman <mgorman(a)suse.de>
Cc: Michal Hocko <mhocko(a)suse.com>
Cc: Nadav Amit <nadav.amit(a)gmail.com>
Cc: Nadav Amit <namit(a)vmware.com>
Cc: Peter Zijlstra <peterz(a)infradead.org>
Cc: Rik van Riel <riel(a)redhat.com>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Cc: linux-mm(a)kvack.org
Signed-off-by: Ingo Molnar <mingo(a)kernel.org>
Signed-off-by: Eduardo Valentin <eduval(a)amazon.com>
Signed-off-by: Eduardo Valentin <edubezval(a)gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
arch/x86/events/core.c | 3
arch/x86/include/asm/tlbflush.h | 12 +-
arch/x86/kernel/ldt.c | 7 -
arch/x86/mm/init.c | 2
arch/x86/mm/tlb.c | 216 ++++++++++++++++++++--------------------
arch/x86/xen/mmu.c | 51 ++++-----
6 files changed, 147 insertions(+), 144 deletions(-)
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -2100,8 +2100,7 @@ static int x86_pmu_event_init(struct per
static void refresh_pce(void *ignored)
{
- if (current->active_mm)
- load_mm_cr4(current->active_mm);
+ load_mm_cr4(this_cpu_read(cpu_tlbstate.loaded_mm));
}
static void x86_pmu_event_mapped(struct perf_event *event)
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -66,7 +66,13 @@ static inline void invpcid_flush_all_non
#endif
struct tlb_state {
- struct mm_struct *active_mm;
+ /*
+ * cpu_tlbstate.loaded_mm should match CR3 whenever interrupts
+ * are on. This means that it may not match current->active_mm,
+ * which will contain the previous user mm when we're in lazy TLB
+ * mode even if we've already switched back to swapper_pg_dir.
+ */
+ struct mm_struct *loaded_mm;
int state;
/*
@@ -249,7 +255,9 @@ void native_flush_tlb_others(const struc
static inline void reset_lazy_tlbstate(void)
{
this_cpu_write(cpu_tlbstate.state, 0);
- this_cpu_write(cpu_tlbstate.active_mm, &init_mm);
+ this_cpu_write(cpu_tlbstate.loaded_mm, &init_mm);
+
+ WARN_ON(read_cr3() != __pa_symbol(swapper_pg_dir));
}
static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch,
--- a/arch/x86/kernel/ldt.c
+++ b/arch/x86/kernel/ldt.c
@@ -23,14 +23,15 @@
#include <asm/syscalls.h>
/* context.lock is held for us, so we don't need any locking. */
-static void flush_ldt(void *current_mm)
+static void flush_ldt(void *__mm)
{
+ struct mm_struct *mm = __mm;
mm_context_t *pc;
- if (current->active_mm != current_mm)
+ if (this_cpu_read(cpu_tlbstate.loaded_mm) != mm)
return;
- pc = ¤t->active_mm->context;
+ pc = &mm->context;
set_ldt(pc->ldt->entries, pc->ldt->size);
}
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -764,7 +764,7 @@ void __init zone_sizes_init(void)
}
DEFINE_PER_CPU_SHARED_ALIGNED(struct tlb_state, cpu_tlbstate) = {
- .active_mm = &init_mm,
+ .loaded_mm = &init_mm,
.state = 0,
.cr4 = ~0UL, /* fail hard if we screw up cr4 shadow initialization */
};
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -28,26 +28,25 @@
* Implement flush IPI by CALL_FUNCTION_VECTOR, Alex Shi
*/
-/*
- * We cannot call mmdrop() because we are in interrupt context,
- * instead update mm->cpu_vm_mask.
- */
void leave_mm(int cpu)
{
- struct mm_struct *active_mm = this_cpu_read(cpu_tlbstate.active_mm);
+ struct mm_struct *loaded_mm = this_cpu_read(cpu_tlbstate.loaded_mm);
+
+ /*
+ * It's plausible that we're in lazy TLB mode while our mm is init_mm.
+ * If so, our callers still expect us to flush the TLB, but there
+ * aren't any user TLB entries in init_mm to worry about.
+ *
+ * This needs to happen before any other sanity checks due to
+ * intel_idle's shenanigans.
+ */
+ if (loaded_mm == &init_mm)
+ return;
+
if (this_cpu_read(cpu_tlbstate.state) == TLBSTATE_OK)
BUG();
- if (cpumask_test_cpu(cpu, mm_cpumask(active_mm))) {
- cpumask_clear_cpu(cpu, mm_cpumask(active_mm));
- load_cr3(swapper_pg_dir);
- /*
- * This gets called in the idle path where RCU
- * functions differently. Tracing normally
- * uses RCU, so we have to call the tracepoint
- * specially here.
- */
- trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
- }
+
+ switch_mm(NULL, &init_mm, NULL);
}
EXPORT_SYMBOL_GPL(leave_mm);
@@ -65,108 +64,109 @@ void switch_mm_irqs_off(struct mm_struct
struct task_struct *tsk)
{
unsigned cpu = smp_processor_id();
+ struct mm_struct *real_prev = this_cpu_read(cpu_tlbstate.loaded_mm);
- if (likely(prev != next)) {
- if (IS_ENABLED(CONFIG_VMAP_STACK)) {
- /*
- * If our current stack is in vmalloc space and isn't
- * mapped in the new pgd, we'll double-fault. Forcibly
- * map it.
- */
- unsigned int stack_pgd_index = pgd_index(current_stack_pointer());
-
- pgd_t *pgd = next->pgd + stack_pgd_index;
-
- if (unlikely(pgd_none(*pgd)))
- set_pgd(pgd, init_mm.pgd[stack_pgd_index]);
- }
+ /*
+ * NB: The scheduler will call us with prev == next when
+ * switching from lazy TLB mode to normal mode if active_mm
+ * isn't changing. When this happens, there is no guarantee
+ * that CR3 (and hence cpu_tlbstate.loaded_mm) matches next.
+ *
+ * NB: leave_mm() calls us with prev == NULL and tsk == NULL.
+ */
- this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK);
- this_cpu_write(cpu_tlbstate.active_mm, next);
+ this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK);
- cpumask_set_cpu(cpu, mm_cpumask(next));
+ if (real_prev == next) {
+ /*
+ * There's nothing to do: we always keep the per-mm control
+ * regs in sync with cpu_tlbstate.loaded_mm. Just
+ * sanity-check mm_cpumask.
+ */
+ if (WARN_ON_ONCE(!cpumask_test_cpu(cpu, mm_cpumask(next))))
+ cpumask_set_cpu(cpu, mm_cpumask(next));
+ return;
+ }
+ if (IS_ENABLED(CONFIG_VMAP_STACK)) {
/*
- * Re-load page tables.
- *
- * This logic has an ordering constraint:
- *
- * CPU 0: Write to a PTE for 'next'
- * CPU 0: load bit 1 in mm_cpumask. if nonzero, send IPI.
- * CPU 1: set bit 1 in next's mm_cpumask
- * CPU 1: load from the PTE that CPU 0 writes (implicit)
- *
- * We need to prevent an outcome in which CPU 1 observes
- * the new PTE value and CPU 0 observes bit 1 clear in
- * mm_cpumask. (If that occurs, then the IPI will never
- * be sent, and CPU 0's TLB will contain a stale entry.)
- *
- * The bad outcome can occur if either CPU's load is
- * reordered before that CPU's store, so both CPUs must
- * execute full barriers to prevent this from happening.
- *
- * Thus, switch_mm needs a full barrier between the
- * store to mm_cpumask and any operation that could load
- * from next->pgd. TLB fills are special and can happen
- * due to instruction fetches or for no reason at all,
- * and neither LOCK nor MFENCE orders them.
- * Fortunately, load_cr3() is serializing and gives the
- * ordering guarantee we need.
- *
+ * If our current stack is in vmalloc space and isn't
+ * mapped in the new pgd, we'll double-fault. Forcibly
+ * map it.
*/
- load_cr3(next->pgd);
+ unsigned int stack_pgd_index = pgd_index(current_stack_pointer());
- trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
+ pgd_t *pgd = next->pgd + stack_pgd_index;
- /* Stop flush ipis for the previous mm */
- cpumask_clear_cpu(cpu, mm_cpumask(prev));
+ if (unlikely(pgd_none(*pgd)))
+ set_pgd(pgd, init_mm.pgd[stack_pgd_index]);
+ }
- /* Load per-mm CR4 state */
- load_mm_cr4(next);
+ this_cpu_write(cpu_tlbstate.loaded_mm, next);
-#ifdef CONFIG_MODIFY_LDT_SYSCALL
- /*
- * Load the LDT, if the LDT is different.
- *
- * It's possible that prev->context.ldt doesn't match
- * the LDT register. This can happen if leave_mm(prev)
- * was called and then modify_ldt changed
- * prev->context.ldt but suppressed an IPI to this CPU.
- * In this case, prev->context.ldt != NULL, because we
- * never set context.ldt to NULL while the mm still
- * exists. That means that next->context.ldt !=
- * prev->context.ldt, because mms never share an LDT.
- */
- if (unlikely(prev->context.ldt != next->context.ldt))
- load_mm_ldt(next);
-#endif
- } else {
- this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK);
- BUG_ON(this_cpu_read(cpu_tlbstate.active_mm) != next);
+ WARN_ON_ONCE(cpumask_test_cpu(cpu, mm_cpumask(next)));
+ cpumask_set_cpu(cpu, mm_cpumask(next));
- if (!cpumask_test_cpu(cpu, mm_cpumask(next))) {
- /*
- * On established mms, the mm_cpumask is only changed
- * from irq context, from ptep_clear_flush() while in
- * lazy tlb mode, and here. Irqs are blocked during
- * schedule, protecting us from simultaneous changes.
- */
- cpumask_set_cpu(cpu, mm_cpumask(next));
+ /*
+ * Re-load page tables.
+ *
+ * This logic has an ordering constraint:
+ *
+ * CPU 0: Write to a PTE for 'next'
+ * CPU 0: load bit 1 in mm_cpumask. if nonzero, send IPI.
+ * CPU 1: set bit 1 in next's mm_cpumask
+ * CPU 1: load from the PTE that CPU 0 writes (implicit)
+ *
+ * We need to prevent an outcome in which CPU 1 observes
+ * the new PTE value and CPU 0 observes bit 1 clear in
+ * mm_cpumask. (If that occurs, then the IPI will never
+ * be sent, and CPU 0's TLB will contain a stale entry.)
+ *
+ * The bad outcome can occur if either CPU's load is
+ * reordered before that CPU's store, so both CPUs must
+ * execute full barriers to prevent this from happening.
+ *
+ * Thus, switch_mm needs a full barrier between the
+ * store to mm_cpumask and any operation that could load
+ * from next->pgd. TLB fills are special and can happen
+ * due to instruction fetches or for no reason at all,
+ * and neither LOCK nor MFENCE orders them.
+ * Fortunately, load_cr3() is serializing and gives the
+ * ordering guarantee we need.
+ */
+ load_cr3(next->pgd);
+
+ /*
+ * This gets called via leave_mm() in the idle path where RCU
+ * functions differently. Tracing normally uses RCU, so we have to
+ * call the tracepoint specially here.
+ */
+ trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
+
+ /* Stop flush ipis for the previous mm */
+ WARN_ON_ONCE(!cpumask_test_cpu(cpu, mm_cpumask(real_prev)) &&
+ real_prev != &init_mm);
+ cpumask_clear_cpu(cpu, mm_cpumask(real_prev));
- /*
- * We were in lazy tlb mode and leave_mm disabled
- * tlb flush IPI delivery. We must reload CR3
- * to make sure to use no freed page tables.
- *
- * As above, load_cr3() is serializing and orders TLB
- * fills with respect to the mm_cpumask write.
- */
- load_cr3(next->pgd);
- trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
- load_mm_cr4(next);
- load_mm_ldt(next);
- }
- }
+ /* Load per-mm CR4 state */
+ load_mm_cr4(next);
+
+#ifdef CONFIG_MODIFY_LDT_SYSCALL
+ /*
+ * Load the LDT, if the LDT is different.
+ *
+ * It's possible that prev->context.ldt doesn't match
+ * the LDT register. This can happen if leave_mm(prev)
+ * was called and then modify_ldt changed
+ * prev->context.ldt but suppressed an IPI to this CPU.
+ * In this case, prev->context.ldt != NULL, because we
+ * never set context.ldt to NULL while the mm still
+ * exists. That means that next->context.ldt !=
+ * prev->context.ldt, because mms never share an LDT.
+ */
+ if (unlikely(real_prev->context.ldt != next->context.ldt))
+ load_mm_ldt(next);
+#endif
}
/*
@@ -246,7 +246,7 @@ static void flush_tlb_func_remote(void *
inc_irq_stat(irq_tlb_count);
- if (f->mm && f->mm != this_cpu_read(cpu_tlbstate.active_mm))
+ if (f->mm && f->mm != this_cpu_read(cpu_tlbstate.loaded_mm))
return;
count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED);
@@ -337,7 +337,7 @@ void flush_tlb_mm_range(struct mm_struct
info.end = TLB_FLUSH_ALL;
}
- if (mm == current->active_mm)
+ if (mm == this_cpu_read(cpu_tlbstate.loaded_mm))
flush_tlb_func_local(&info, TLB_LOCAL_MM_SHOOTDOWN);
if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids)
flush_tlb_others(mm_cpumask(mm), &info);
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -998,37 +998,32 @@ static void xen_dup_mmap(struct mm_struc
spin_unlock(&mm->page_table_lock);
}
-
-#ifdef CONFIG_SMP
-/* Another cpu may still have their %cr3 pointing at the pagetable, so
- we need to repoint it somewhere else before we can unpin it. */
-static void drop_other_mm_ref(void *info)
+static void drop_mm_ref_this_cpu(void *info)
{
struct mm_struct *mm = info;
- struct mm_struct *active_mm;
-
- active_mm = this_cpu_read(cpu_tlbstate.active_mm);
- if (active_mm == mm && this_cpu_read(cpu_tlbstate.state) != TLBSTATE_OK)
+ if (this_cpu_read(cpu_tlbstate.loaded_mm) == mm)
leave_mm(smp_processor_id());
- /* If this cpu still has a stale cr3 reference, then make sure
- it has been flushed. */
+ /*
+ * If this cpu still has a stale cr3 reference, then make sure
+ * it has been flushed.
+ */
if (this_cpu_read(xen_current_cr3) == __pa(mm->pgd))
- load_cr3(swapper_pg_dir);
+ xen_mc_flush();
}
+#ifdef CONFIG_SMP
+/*
+ * Another cpu may still have their %cr3 pointing at the pagetable, so
+ * we need to repoint it somewhere else before we can unpin it.
+ */
static void xen_drop_mm_ref(struct mm_struct *mm)
{
cpumask_var_t mask;
unsigned cpu;
- if (current->active_mm == mm) {
- if (current->mm == mm)
- load_cr3(swapper_pg_dir);
- else
- leave_mm(smp_processor_id());
- }
+ drop_mm_ref_this_cpu(mm);
/* Get the "official" set of cpus referring to our pagetable. */
if (!alloc_cpumask_var(&mask, GFP_ATOMIC)) {
@@ -1036,31 +1031,31 @@ static void xen_drop_mm_ref(struct mm_st
if (!cpumask_test_cpu(cpu, mm_cpumask(mm))
&& per_cpu(xen_current_cr3, cpu) != __pa(mm->pgd))
continue;
- smp_call_function_single(cpu, drop_other_mm_ref, mm, 1);
+ smp_call_function_single(cpu, drop_mm_ref_this_cpu, mm, 1);
}
return;
}
cpumask_copy(mask, mm_cpumask(mm));
- /* It's possible that a vcpu may have a stale reference to our
- cr3, because its in lazy mode, and it hasn't yet flushed
- its set of pending hypercalls yet. In this case, we can
- look at its actual current cr3 value, and force it to flush
- if needed. */
+ /*
+ * It's possible that a vcpu may have a stale reference to our
+ * cr3, because its in lazy mode, and it hasn't yet flushed
+ * its set of pending hypercalls yet. In this case, we can
+ * look at its actual current cr3 value, and force it to flush
+ * if needed.
+ */
for_each_online_cpu(cpu) {
if (per_cpu(xen_current_cr3, cpu) == __pa(mm->pgd))
cpumask_set_cpu(cpu, mask);
}
- if (!cpumask_empty(mask))
- smp_call_function_many(mask, drop_other_mm_ref, mm, 1);
+ smp_call_function_many(mask, drop_mm_ref_this_cpu, mm, 1);
free_cpumask_var(mask);
}
#else
static void xen_drop_mm_ref(struct mm_struct *mm)
{
- if (current->active_mm == mm)
- load_cr3(swapper_pg_dir);
+ drop_mm_ref_this_cpu(mm);
}
#endif
Patches currently in stable-queue which might be from luto(a)kernel.org are
queue-4.9/x86-mm-refactor-flush_tlb_mm_range-to-merge-local-and-remote-cases.patch
queue-4.9/x86-mm-pass-flush_tlb_info-to-flush_tlb_others-etc.patch
queue-4.9/x86-mm-rework-lazy-tlb-to-track-the-actual-loaded-mm.patch
queue-4.9/x86-mm-kvm-teach-kvm-s-vmx-code-that-cr3-isn-t-a-constant.patch
queue-4.9/x86-mm-use-new-merged-flush-logic-in-arch_tlbbatch_flush.patch
queue-4.9/x86-kvm-vmx-simplify-segment_base.patch
queue-4.9/x86-entry-unwind-create-stack-frames-for-saved-interrupt-registers.patch
queue-4.9/x86-mm-reduce-indentation-in-flush_tlb_func.patch
queue-4.9/x86-mm-remove-the-up-asm-tlbflush.h-code-always-use-the-formerly-smp-code.patch
queue-4.9/x86-mm-reimplement-flush_tlb_page-using-flush_tlb_mm_range.patch
queue-4.9/mm-x86-mm-make-the-batched-unmap-tlb-flush-api-more-generic.patch
queue-4.9/x86-kvm-vmx-defer-tr-reload-after-vm-exit.patch
queue-4.9/x86-mm-change-the-leave_mm-condition-for-local-tlb-flushes.patch
queue-4.9/x86-mm-be-more-consistent-wrt-page_shift-vs-page_size-in-tlb-flush-code.patch
This is a note to let you know that I've just added the patch titled
x86/mm: Reimplement flush_tlb_page() using flush_tlb_mm_range()
to the 4.9-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
x86-mm-reimplement-flush_tlb_page-using-flush_tlb_mm_range.patch
and it can be found in the queue-4.9 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From ca6c99c0794875c6d1db6e22f246699691ab7e6b Mon Sep 17 00:00:00 2001
From: Andy Lutomirski <luto(a)kernel.org>
Date: Mon, 22 May 2017 15:30:01 -0700
Subject: x86/mm: Reimplement flush_tlb_page() using flush_tlb_mm_range()
From: Andy Lutomirski <luto(a)kernel.org>
commit ca6c99c0794875c6d1db6e22f246699691ab7e6b upstream.
flush_tlb_page() was very similar to flush_tlb_mm_range() except that
it had a couple of issues:
- It was missing an smp_mb() in the case where
current->active_mm != mm. (This is a longstanding bug reported by Nadav Amit)
- It was missing tracepoints and vm counter updates.
The only reason that I can see for keeping it at as a separate
function is that it could avoid a few branches that
flush_tlb_mm_range() needs to decide to flush just one page. This
hardly seems worthwhile. If we decide we want to get rid of those
branches again, a better way would be to introduce an
__flush_tlb_mm_range() helper and make both flush_tlb_page() and
flush_tlb_mm_range() use it.
Signed-off-by: Andy Lutomirski <luto(a)kernel.org>
Acked-by: Kees Cook <keescook(a)chromium.org>
Cc: Andrew Morton <akpm(a)linux-foundation.org>
Cc: Borislav Petkov <bpetkov(a)suse.de>
Cc: Dave Hansen <dave.hansen(a)intel.com>
Cc: Linus Torvalds <torvalds(a)linux-foundation.org>
Cc: Mel Gorman <mgorman(a)suse.de>
Cc: Michal Hocko <mhocko(a)suse.com>
Cc: Nadav Amit <nadav.amit(a)gmail.com>
Cc: Nadav Amit <namit(a)vmware.com>
Cc: Peter Zijlstra <peterz(a)infradead.org>
Cc: Rik van Riel <riel(a)redhat.com>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Cc: linux-mm(a)kvack.org
Link: http://lkml.kernel.org/r/3cc3847cf888d8907577569b8bac3f01992ef8f9.149549206…
Signed-off-by: Ingo Molnar <mingo(a)kernel.org>
Signed-off-by: Eduardo Valentin <eduval(a)amazon.com>
Signed-off-by: Eduardo Valentin <edubezval(a)gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
arch/x86/include/asm/tlbflush.h | 5 ++++-
arch/x86/mm/tlb.c | 27 ---------------------------
2 files changed, 4 insertions(+), 28 deletions(-)
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -304,12 +304,15 @@ static inline void flush_tlb_kernel_rang
extern void flush_tlb_all(void);
extern void flush_tlb_current_task(void);
-extern void flush_tlb_page(struct vm_area_struct *, unsigned long);
extern void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
unsigned long end, unsigned long vmflag);
extern void flush_tlb_kernel_range(unsigned long start, unsigned long end);
#define flush_tlb() flush_tlb_current_task()
+static inline void flush_tlb_page(struct vm_area_struct *vma, unsigned long a)
+{
+ flush_tlb_mm_range(vma->vm_mm, a, a + PAGE_SIZE, VM_NONE);
+}
void native_flush_tlb_others(const struct cpumask *cpumask,
struct mm_struct *mm,
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -369,33 +369,6 @@ out:
preempt_enable();
}
-void flush_tlb_page(struct vm_area_struct *vma, unsigned long start)
-{
- struct mm_struct *mm = vma->vm_mm;
-
- preempt_disable();
-
- if (current->active_mm == mm) {
- if (current->mm) {
- /*
- * Implicit full barrier (INVLPG) that synchronizes
- * with switch_mm.
- */
- __flush_tlb_one(start);
- } else {
- leave_mm(smp_processor_id());
-
- /* Synchronize with switch_mm. */
- smp_mb();
- }
- }
-
- if (cpumask_any_but(mm_cpumask(mm), smp_processor_id()) < nr_cpu_ids)
- flush_tlb_others(mm_cpumask(mm), mm, start, start + PAGE_SIZE);
-
- preempt_enable();
-}
-
static void do_flush_tlb_all(void *info)
{
count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED);
Patches currently in stable-queue which might be from luto(a)kernel.org are
queue-4.9/x86-mm-refactor-flush_tlb_mm_range-to-merge-local-and-remote-cases.patch
queue-4.9/x86-mm-pass-flush_tlb_info-to-flush_tlb_others-etc.patch
queue-4.9/x86-mm-rework-lazy-tlb-to-track-the-actual-loaded-mm.patch
queue-4.9/x86-mm-kvm-teach-kvm-s-vmx-code-that-cr3-isn-t-a-constant.patch
queue-4.9/x86-mm-use-new-merged-flush-logic-in-arch_tlbbatch_flush.patch
queue-4.9/x86-kvm-vmx-simplify-segment_base.patch
queue-4.9/x86-entry-unwind-create-stack-frames-for-saved-interrupt-registers.patch
queue-4.9/x86-mm-reduce-indentation-in-flush_tlb_func.patch
queue-4.9/x86-mm-remove-the-up-asm-tlbflush.h-code-always-use-the-formerly-smp-code.patch
queue-4.9/x86-mm-reimplement-flush_tlb_page-using-flush_tlb_mm_range.patch
queue-4.9/mm-x86-mm-make-the-batched-unmap-tlb-flush-api-more-generic.patch
queue-4.9/x86-kvm-vmx-defer-tr-reload-after-vm-exit.patch
queue-4.9/x86-mm-change-the-leave_mm-condition-for-local-tlb-flushes.patch
queue-4.9/x86-mm-be-more-consistent-wrt-page_shift-vs-page_size-in-tlb-flush-code.patch
This is a note to let you know that I've just added the patch titled
x86/mm: Refactor flush_tlb_mm_range() to merge local and remote cases
to the 4.9-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
x86-mm-refactor-flush_tlb_mm_range-to-merge-local-and-remote-cases.patch
and it can be found in the queue-4.9 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From 454bbad9793f59f5656ce5971ee473a8be736ef5 Mon Sep 17 00:00:00 2001
From: Andy Lutomirski <luto(a)kernel.org>
Date: Sun, 28 May 2017 10:00:12 -0700
Subject: x86/mm: Refactor flush_tlb_mm_range() to merge local and remote cases
From: Andy Lutomirski <luto(a)kernel.org>
commit 454bbad9793f59f5656ce5971ee473a8be736ef5 upstream.
The local flush path is very similar to the remote flush path.
Merge them.
This is intended to make no difference to behavior whatsoever. It
removes some code and will make future changes to the flushing
mechanics simpler.
This patch does remove one small optimization: flush_tlb_mm_range()
now has an unconditional smp_mb() instead of using MOV to CR3 or
INVLPG as a full barrier when applicable. I think this is okay for
a few reasons. First, smp_mb() is quite cheap compared to the cost
of a TLB flush. Second, this rearrangement makes a bigger
optimization available: with some work on the SMP function call
code, we could do the local and remote flushes in parallel. Third,
I'm planning a rework of the TLB flush algorithm that will require
an atomic operation at the beginning of each flush, and that
operation will replace the smp_mb().
Signed-off-by: Andy Lutomirski <luto(a)kernel.org>
Cc: Andrew Morton <akpm(a)linux-foundation.org>
Cc: Arjan van de Ven <arjan(a)linux.intel.com>
Cc: Borislav Petkov <bpetkov(a)suse.de>
Cc: Dave Hansen <dave.hansen(a)intel.com>
Cc: Linus Torvalds <torvalds(a)linux-foundation.org>
Cc: Mel Gorman <mgorman(a)suse.de>
Cc: Michal Hocko <mhocko(a)suse.com>
Cc: Nadav Amit <nadav.amit(a)gmail.com>
Cc: Nadav Amit <namit(a)vmware.com>
Cc: Peter Zijlstra <peterz(a)infradead.org>
Cc: Rik van Riel <riel(a)redhat.com>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Cc: linux-mm(a)kvack.org
Signed-off-by: Ingo Molnar <mingo(a)kernel.org>
Signed-off-by: Eduardo Valentin <eduval(a)amazon.com>
Signed-off-by: Eduardo Valentin <edubezval(a)gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
arch/x86/include/asm/tlbflush.h | 1
arch/x86/mm/tlb.c | 111 +++++++++++++++++-----------------------
2 files changed, 48 insertions(+), 64 deletions(-)
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -216,7 +216,6 @@ static inline void __flush_tlb_one(unsig
* ..but the i386 has somewhat limited tlb flushing capabilities,
* and page-granular flushes are available only on i486 and up.
*/
-
struct flush_tlb_info {
struct mm_struct *mm;
unsigned long start;
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -216,22 +216,9 @@ void switch_mm_irqs_off(struct mm_struct
* write/read ordering problems.
*/
-/*
- * TLB flush funcation:
- * 1) Flush the tlb entries if the cpu uses the mm that's being flushed.
- * 2) Leave the mm if we are in the lazy tlb mode.
- */
-static void flush_tlb_func(void *info)
+static void flush_tlb_func_common(const struct flush_tlb_info *f,
+ bool local, enum tlb_flush_reason reason)
{
- const struct flush_tlb_info *f = info;
-
- inc_irq_stat(irq_tlb_count);
-
- if (f->mm && f->mm != this_cpu_read(cpu_tlbstate.active_mm))
- return;
-
- count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED);
-
if (this_cpu_read(cpu_tlbstate.state) != TLBSTATE_OK) {
leave_mm(smp_processor_id());
return;
@@ -239,7 +226,9 @@ static void flush_tlb_func(void *info)
if (f->end == TLB_FLUSH_ALL) {
local_flush_tlb();
- trace_tlb_flush(TLB_REMOTE_SHOOTDOWN, TLB_FLUSH_ALL);
+ if (local)
+ count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL);
+ trace_tlb_flush(reason, TLB_FLUSH_ALL);
} else {
unsigned long addr;
unsigned long nr_pages =
@@ -249,10 +238,32 @@ static void flush_tlb_func(void *info)
__flush_tlb_single(addr);
addr += PAGE_SIZE;
}
- trace_tlb_flush(TLB_REMOTE_SHOOTDOWN, nr_pages);
+ if (local)
+ count_vm_tlb_events(NR_TLB_LOCAL_FLUSH_ONE, nr_pages);
+ trace_tlb_flush(reason, nr_pages);
}
}
+static void flush_tlb_func_local(void *info, enum tlb_flush_reason reason)
+{
+ const struct flush_tlb_info *f = info;
+
+ flush_tlb_func_common(f, true, reason);
+}
+
+static void flush_tlb_func_remote(void *info)
+{
+ const struct flush_tlb_info *f = info;
+
+ inc_irq_stat(irq_tlb_count);
+
+ if (f->mm && f->mm != this_cpu_read(cpu_tlbstate.active_mm))
+ return;
+
+ count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED);
+ flush_tlb_func_common(f, false, TLB_REMOTE_SHOOTDOWN);
+}
+
void native_flush_tlb_others(const struct cpumask *cpumask,
const struct flush_tlb_info *info)
{
@@ -269,11 +280,11 @@ void native_flush_tlb_others(const struc
cpu = smp_processor_id();
cpumask = uv_flush_tlb_others(cpumask, info);
if (cpumask)
- smp_call_function_many(cpumask, flush_tlb_func,
+ smp_call_function_many(cpumask, flush_tlb_func_remote,
(void *)info, 1);
return;
}
- smp_call_function_many(cpumask, flush_tlb_func,
+ smp_call_function_many(cpumask, flush_tlb_func_remote,
(void *)info, 1);
}
@@ -315,59 +326,33 @@ static unsigned long tlb_single_page_flu
void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
unsigned long end, unsigned long vmflag)
{
- unsigned long addr;
- struct flush_tlb_info info;
- /* do a global flush by default */
- unsigned long base_pages_to_flush = TLB_FLUSH_ALL;
-
- preempt_disable();
- if (current->active_mm != mm) {
- /* Synchronize with switch_mm. */
- smp_mb();
+ int cpu;
- goto out;
- }
-
- if (this_cpu_read(cpu_tlbstate.state) != TLBSTATE_OK) {
- leave_mm(smp_processor_id());
-
- /* Synchronize with switch_mm. */
- smp_mb();
+ struct flush_tlb_info info = {
+ .mm = mm,
+ };
- goto out;
- }
+ cpu = get_cpu();
- if ((end != TLB_FLUSH_ALL) && !(vmflag & VM_HUGETLB))
- base_pages_to_flush = (end - start) >> PAGE_SHIFT;
+ /* Synchronize with switch_mm. */
+ smp_mb();
- /*
- * Both branches below are implicit full barriers (MOV to CR or
- * INVLPG) that synchronize with switch_mm.
- */
- if (base_pages_to_flush > tlb_single_page_flush_ceiling) {
- base_pages_to_flush = TLB_FLUSH_ALL;
- count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL);
- local_flush_tlb();
+ /* Should we flush just the requested range? */
+ if ((end != TLB_FLUSH_ALL) &&
+ !(vmflag & VM_HUGETLB) &&
+ ((end - start) >> PAGE_SHIFT) <= tlb_single_page_flush_ceiling) {
+ info.start = start;
+ info.end = end;
} else {
- /* flush range by one by one 'invlpg' */
- for (addr = start; addr < end; addr += PAGE_SIZE) {
- count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ONE);
- __flush_tlb_single(addr);
- }
- }
- trace_tlb_flush(TLB_LOCAL_MM_SHOOTDOWN, base_pages_to_flush);
-out:
- info.mm = mm;
- if (base_pages_to_flush == TLB_FLUSH_ALL) {
info.start = 0UL;
info.end = TLB_FLUSH_ALL;
- } else {
- info.start = start;
- info.end = end;
}
- if (cpumask_any_but(mm_cpumask(mm), smp_processor_id()) < nr_cpu_ids)
+
+ if (mm == current->active_mm)
+ flush_tlb_func_local(&info, TLB_LOCAL_MM_SHOOTDOWN);
+ if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids)
flush_tlb_others(mm_cpumask(mm), &info);
- preempt_enable();
+ put_cpu();
}
Patches currently in stable-queue which might be from luto(a)kernel.org are
queue-4.9/x86-mm-refactor-flush_tlb_mm_range-to-merge-local-and-remote-cases.patch
queue-4.9/x86-mm-pass-flush_tlb_info-to-flush_tlb_others-etc.patch
queue-4.9/x86-mm-rework-lazy-tlb-to-track-the-actual-loaded-mm.patch
queue-4.9/x86-mm-kvm-teach-kvm-s-vmx-code-that-cr3-isn-t-a-constant.patch
queue-4.9/x86-mm-use-new-merged-flush-logic-in-arch_tlbbatch_flush.patch
queue-4.9/x86-kvm-vmx-simplify-segment_base.patch
queue-4.9/x86-entry-unwind-create-stack-frames-for-saved-interrupt-registers.patch
queue-4.9/x86-mm-reduce-indentation-in-flush_tlb_func.patch
queue-4.9/x86-mm-remove-the-up-asm-tlbflush.h-code-always-use-the-formerly-smp-code.patch
queue-4.9/x86-mm-reimplement-flush_tlb_page-using-flush_tlb_mm_range.patch
queue-4.9/mm-x86-mm-make-the-batched-unmap-tlb-flush-api-more-generic.patch
queue-4.9/x86-kvm-vmx-defer-tr-reload-after-vm-exit.patch
queue-4.9/x86-mm-change-the-leave_mm-condition-for-local-tlb-flushes.patch
queue-4.9/x86-mm-be-more-consistent-wrt-page_shift-vs-page_size-in-tlb-flush-code.patch
This is a note to let you know that I've just added the patch titled
x86/mm: Reduce indentation in flush_tlb_func()
to the 4.9-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
x86-mm-reduce-indentation-in-flush_tlb_func.patch
and it can be found in the queue-4.9 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From b3b90e5af7976e46541f5029a369c9c38c5e4cea Mon Sep 17 00:00:00 2001
From: Andy Lutomirski <luto(a)kernel.org>
Date: Mon, 22 May 2017 15:30:02 -0700
Subject: x86/mm: Reduce indentation in flush_tlb_func()
From: Andy Lutomirski <luto(a)kernel.org>
commit b3b90e5af7976e46541f5029a369c9c38c5e4cea upstream.
The leave_mm() case can just exit the function early so we don't
need to indent the entire remainder of the function.
Signed-off-by: Andy Lutomirski <luto(a)kernel.org>
Acked-by: Kees Cook <keescook(a)chromium.org>
Cc: Andrew Morton <akpm(a)linux-foundation.org>
Cc: Borislav Petkov <bpetkov(a)suse.de>
Cc: Dave Hansen <dave.hansen(a)intel.com>
Cc: Linus Torvalds <torvalds(a)linux-foundation.org>
Cc: Mel Gorman <mgorman(a)suse.de>
Cc: Michal Hocko <mhocko(a)suse.com>
Cc: Nadav Amit <nadav.amit(a)gmail.com>
Cc: Nadav Amit <namit(a)vmware.com>
Cc: Peter Zijlstra <peterz(a)infradead.org>
Cc: Rik van Riel <riel(a)redhat.com>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Cc: linux-mm(a)kvack.org
Link: http://lkml.kernel.org/r/97901ddcc9821d7bc7b296d2918d1179f08aaf22.149549206…
Signed-off-by: Ingo Molnar <mingo(a)kernel.org>
Signed-off-by: Eduardo Valentin <eduval(a)amazon.com>
Signed-off-by: Eduardo Valentin <edubezval(a)gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
arch/x86/mm/tlb.c | 34 ++++++++++++++++++----------------
1 file changed, 18 insertions(+), 16 deletions(-)
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -237,24 +237,26 @@ static void flush_tlb_func(void *info)
return;
count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED);
- if (this_cpu_read(cpu_tlbstate.state) == TLBSTATE_OK) {
- if (f->flush_end == TLB_FLUSH_ALL) {
- local_flush_tlb();
- trace_tlb_flush(TLB_REMOTE_SHOOTDOWN, TLB_FLUSH_ALL);
- } else {
- unsigned long addr;
- unsigned long nr_pages =
- (f->flush_end - f->flush_start) / PAGE_SIZE;
- addr = f->flush_start;
- while (addr < f->flush_end) {
- __flush_tlb_single(addr);
- addr += PAGE_SIZE;
- }
- trace_tlb_flush(TLB_REMOTE_SHOOTDOWN, nr_pages);
- }
- } else
+
+ if (this_cpu_read(cpu_tlbstate.state) != TLBSTATE_OK) {
leave_mm(smp_processor_id());
+ return;
+ }
+ if (f->flush_end == TLB_FLUSH_ALL) {
+ local_flush_tlb();
+ trace_tlb_flush(TLB_REMOTE_SHOOTDOWN, TLB_FLUSH_ALL);
+ } else {
+ unsigned long addr;
+ unsigned long nr_pages =
+ (f->flush_end - f->flush_start) / PAGE_SIZE;
+ addr = f->flush_start;
+ while (addr < f->flush_end) {
+ __flush_tlb_single(addr);
+ addr += PAGE_SIZE;
+ }
+ trace_tlb_flush(TLB_REMOTE_SHOOTDOWN, nr_pages);
+ }
}
void native_flush_tlb_others(const struct cpumask *cpumask,
Patches currently in stable-queue which might be from luto(a)kernel.org are
queue-4.9/x86-mm-refactor-flush_tlb_mm_range-to-merge-local-and-remote-cases.patch
queue-4.9/x86-mm-pass-flush_tlb_info-to-flush_tlb_others-etc.patch
queue-4.9/x86-mm-rework-lazy-tlb-to-track-the-actual-loaded-mm.patch
queue-4.9/x86-mm-kvm-teach-kvm-s-vmx-code-that-cr3-isn-t-a-constant.patch
queue-4.9/x86-mm-use-new-merged-flush-logic-in-arch_tlbbatch_flush.patch
queue-4.9/x86-kvm-vmx-simplify-segment_base.patch
queue-4.9/x86-entry-unwind-create-stack-frames-for-saved-interrupt-registers.patch
queue-4.9/x86-mm-reduce-indentation-in-flush_tlb_func.patch
queue-4.9/x86-mm-remove-the-up-asm-tlbflush.h-code-always-use-the-formerly-smp-code.patch
queue-4.9/x86-mm-reimplement-flush_tlb_page-using-flush_tlb_mm_range.patch
queue-4.9/mm-x86-mm-make-the-batched-unmap-tlb-flush-api-more-generic.patch
queue-4.9/x86-kvm-vmx-defer-tr-reload-after-vm-exit.patch
queue-4.9/x86-mm-change-the-leave_mm-condition-for-local-tlb-flushes.patch
queue-4.9/x86-mm-be-more-consistent-wrt-page_shift-vs-page_size-in-tlb-flush-code.patch
This is a note to let you know that I've just added the patch titled
x86/mm, KVM: Teach KVM's VMX code that CR3 isn't a constant
to the 4.9-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
x86-mm-kvm-teach-kvm-s-vmx-code-that-cr3-isn-t-a-constant.patch
and it can be found in the queue-4.9 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From d6e41f1151feeb118eee776c09323aceb4a415d9 Mon Sep 17 00:00:00 2001
From: Andy Lutomirski <luto(a)kernel.org>
Date: Sun, 28 May 2017 10:00:17 -0700
Subject: x86/mm, KVM: Teach KVM's VMX code that CR3 isn't a constant
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
From: Andy Lutomirski <luto(a)kernel.org>
commit d6e41f1151feeb118eee776c09323aceb4a415d9 upstream.
When PCID is enabled, CR3's PCID bits can change during context
switches, so KVM won't be able to treat CR3 as a per-mm constant any
more.
I structured this like the existing CR4 handling. Under ordinary
circumstances (PCID disabled or if the current PCID and the value
that's already in the VMCS match), then we won't do an extra VMCS
write, and we'll never do an extra direct CR3 read. The overhead
should be minimal.
I disallowed using the new helper in non-atomic context because
PCID support will cause CR3 to stop being constant in non-atomic
process context.
(Frankly, it also scares me a bit that KVM ever treated CR3 as
constant, but it looks like it was okay before.)
Signed-off-by: Andy Lutomirski <luto(a)kernel.org>
Cc: Andrew Morton <akpm(a)linux-foundation.org>
Cc: Arjan van de Ven <arjan(a)linux.intel.com>
Cc: Borislav Petkov <bpetkov(a)suse.de>
Cc: Dave Hansen <dave.hansen(a)intel.com>
Cc: Linus Torvalds <torvalds(a)linux-foundation.org>
Cc: Mel Gorman <mgorman(a)suse.de>
Cc: Michal Hocko <mhocko(a)suse.com>
Cc: Nadav Amit <nadav.amit(a)gmail.com>
Cc: Nadav Amit <namit(a)vmware.com>
Cc: Paolo Bonzini <pbonzini(a)redhat.com>
Cc: Peter Zijlstra <peterz(a)infradead.org>
Cc: Radim Krčmář <rkrcmar(a)redhat.com>
Cc: Rik van Riel <riel(a)redhat.com>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Cc: kvm(a)vger.kernel.org
Cc: linux-mm(a)kvack.org
Signed-off-by: Ingo Molnar <mingo(a)kernel.org>
Signed-off-by: Eduardo Valentin <eduval(a)amazon.com>
Signed-off-by: Eduardo Valentin <edubezval(a)gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
arch/x86/include/asm/mmu_context.h | 19 +++++++++++++++++++
arch/x86/kvm/vmx.c | 25 +++++++++++++++++++++----
2 files changed, 40 insertions(+), 4 deletions(-)
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -268,4 +268,23 @@ static inline bool arch_pte_access_permi
{
return __pkru_allows_pkey(pte_flags_pkey(pte_flags(pte)), write);
}
+
+/*
+ * This can be used from process context to figure out what the value of
+ * CR3 is without needing to do a (slow) read_cr3().
+ *
+ * It's intended to be used for code like KVM that sneakily changes CR3
+ * and needs to restore it. It needs to be used very carefully.
+ */
+static inline unsigned long __get_current_cr3_fast(void)
+{
+ unsigned long cr3 = __pa(this_cpu_read(cpu_tlbstate.loaded_mm)->pgd);
+
+ /* For now, be very restrictive about when this can be called. */
+ VM_WARN_ON(in_nmi() || !in_atomic());
+
+ VM_BUG_ON(cr3 != read_cr3());
+ return cr3;
+}
+
#endif /* _ASM_X86_MMU_CONTEXT_H */
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -48,6 +48,7 @@
#include <asm/kexec.h>
#include <asm/apic.h>
#include <asm/irq_remapping.h>
+#include <asm/mmu_context.h>
#include "trace.h"
#include "pmu.h"
@@ -572,6 +573,7 @@ struct vcpu_vmx {
int gs_ldt_reload_needed;
int fs_reload_needed;
u64 msr_host_bndcfgs;
+ unsigned long vmcs_host_cr3; /* May not match real cr3 */
unsigned long vmcs_host_cr4; /* May not match real cr4 */
} host_state;
struct {
@@ -4857,10 +4859,19 @@ static void vmx_set_constant_host_state(
u32 low32, high32;
unsigned long tmpl;
struct desc_ptr dt;
- unsigned long cr4;
+ unsigned long cr0, cr3, cr4;
- vmcs_writel(HOST_CR0, read_cr0() & ~X86_CR0_TS); /* 22.2.3 */
- vmcs_writel(HOST_CR3, read_cr3()); /* 22.2.3 FIXME: shadow tables */
+ cr0 = read_cr0();
+ WARN_ON(cr0 & X86_CR0_TS);
+ vmcs_writel(HOST_CR0, cr0); /* 22.2.3 */
+
+ /*
+ * Save the most likely value for this task's CR3 in the VMCS.
+ * We can't use __get_current_cr3_fast() because we're not atomic.
+ */
+ cr3 = read_cr3();
+ vmcs_writel(HOST_CR3, cr3); /* 22.2.3 FIXME: shadow tables */
+ vmx->host_state.vmcs_host_cr3 = cr3;
/* Save the most likely value for this task's CR4 in the VMCS. */
cr4 = cr4_read_shadow();
@@ -8836,7 +8847,7 @@ void vmx_arm_hv_timer(struct kvm_vcpu *v
static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
- unsigned long debugctlmsr, cr4;
+ unsigned long debugctlmsr, cr3, cr4;
/* Record the guest's net vcpu time for enforced NMI injections. */
if (unlikely(!cpu_has_virtual_nmis() && vmx->soft_vnmi_blocked))
@@ -8862,6 +8873,12 @@ static void __noclone vmx_vcpu_run(struc
if (test_bit(VCPU_REGS_RIP, (unsigned long *)&vcpu->arch.regs_dirty))
vmcs_writel(GUEST_RIP, vcpu->arch.regs[VCPU_REGS_RIP]);
+ cr3 = __get_current_cr3_fast();
+ if (unlikely(cr3 != vmx->host_state.vmcs_host_cr3)) {
+ vmcs_writel(HOST_CR3, cr3);
+ vmx->host_state.vmcs_host_cr3 = cr3;
+ }
+
cr4 = cr4_read_shadow();
if (unlikely(cr4 != vmx->host_state.vmcs_host_cr4)) {
vmcs_writel(HOST_CR4, cr4);
Patches currently in stable-queue which might be from luto(a)kernel.org are
queue-4.9/x86-mm-refactor-flush_tlb_mm_range-to-merge-local-and-remote-cases.patch
queue-4.9/x86-mm-pass-flush_tlb_info-to-flush_tlb_others-etc.patch
queue-4.9/x86-mm-rework-lazy-tlb-to-track-the-actual-loaded-mm.patch
queue-4.9/x86-mm-kvm-teach-kvm-s-vmx-code-that-cr3-isn-t-a-constant.patch
queue-4.9/x86-mm-use-new-merged-flush-logic-in-arch_tlbbatch_flush.patch
queue-4.9/x86-kvm-vmx-simplify-segment_base.patch
queue-4.9/x86-entry-unwind-create-stack-frames-for-saved-interrupt-registers.patch
queue-4.9/x86-mm-reduce-indentation-in-flush_tlb_func.patch
queue-4.9/x86-mm-remove-the-up-asm-tlbflush.h-code-always-use-the-formerly-smp-code.patch
queue-4.9/x86-mm-reimplement-flush_tlb_page-using-flush_tlb_mm_range.patch
queue-4.9/mm-x86-mm-make-the-batched-unmap-tlb-flush-api-more-generic.patch
queue-4.9/x86-kvm-vmx-defer-tr-reload-after-vm-exit.patch
queue-4.9/x86-mm-change-the-leave_mm-condition-for-local-tlb-flushes.patch
queue-4.9/x86-mm-be-more-consistent-wrt-page_shift-vs-page_size-in-tlb-flush-code.patch
This is a note to let you know that I've just added the patch titled
x86/mm: Change the leave_mm() condition for local TLB flushes
to the 4.9-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
x86-mm-change-the-leave_mm-condition-for-local-tlb-flushes.patch
and it can be found in the queue-4.9 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From 59f537c1dea04287165bb11407921e095250dc80 Mon Sep 17 00:00:00 2001
From: Andy Lutomirski <luto(a)kernel.org>
Date: Sun, 28 May 2017 10:00:11 -0700
Subject: x86/mm: Change the leave_mm() condition for local TLB flushes
From: Andy Lutomirski <luto(a)kernel.org>
commit 59f537c1dea04287165bb11407921e095250dc80 upstream.
On a remote TLB flush, we leave_mm() if we're TLBSTATE_LAZY. For a
local flush_tlb_mm_range(), we leave_mm() if !current->mm. These
are approximately the same condition -- the scheduler sets lazy TLB
mode when switching to a thread with no mm.
I'm about to merge the local and remote flush code, but for ease of
verifying and bisecting the patch, I want the local and remote flush
behavior to match first. This patch changes the local code to match
the remote code.
Signed-off-by: Andy Lutomirski <luto(a)kernel.org>
Acked-by: Rik van Riel <riel(a)redhat.com>
Cc: Andrew Morton <akpm(a)linux-foundation.org>
Cc: Arjan van de Ven <arjan(a)linux.intel.com>
Cc: Borislav Petkov <bpetkov(a)suse.de>
Cc: Dave Hansen <dave.hansen(a)intel.com>
Cc: Linus Torvalds <torvalds(a)linux-foundation.org>
Cc: Mel Gorman <mgorman(a)suse.de>
Cc: Michal Hocko <mhocko(a)suse.com>
Cc: Nadav Amit <nadav.amit(a)gmail.com>
Cc: Nadav Amit <namit(a)vmware.com>
Cc: Peter Zijlstra <peterz(a)infradead.org>
Cc: Rik van Riel <riel(a)redhat.com>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Cc: linux-mm(a)kvack.org
Signed-off-by: Ingo Molnar <mingo(a)kernel.org>
Signed-off-by: Eduardo Valentin <eduval(a)amazon.com>
Signed-off-by: Eduardo Valentin <edubezval(a)gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
arch/x86/mm/tlb.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -328,7 +328,7 @@ void flush_tlb_mm_range(struct mm_struct
goto out;
}
- if (!current->mm) {
+ if (this_cpu_read(cpu_tlbstate.state) != TLBSTATE_OK) {
leave_mm(smp_processor_id());
/* Synchronize with switch_mm. */
Patches currently in stable-queue which might be from luto(a)kernel.org are
queue-4.9/x86-mm-refactor-flush_tlb_mm_range-to-merge-local-and-remote-cases.patch
queue-4.9/x86-mm-pass-flush_tlb_info-to-flush_tlb_others-etc.patch
queue-4.9/x86-mm-rework-lazy-tlb-to-track-the-actual-loaded-mm.patch
queue-4.9/x86-mm-kvm-teach-kvm-s-vmx-code-that-cr3-isn-t-a-constant.patch
queue-4.9/x86-mm-use-new-merged-flush-logic-in-arch_tlbbatch_flush.patch
queue-4.9/x86-kvm-vmx-simplify-segment_base.patch
queue-4.9/x86-entry-unwind-create-stack-frames-for-saved-interrupt-registers.patch
queue-4.9/x86-mm-reduce-indentation-in-flush_tlb_func.patch
queue-4.9/x86-mm-remove-the-up-asm-tlbflush.h-code-always-use-the-formerly-smp-code.patch
queue-4.9/x86-mm-reimplement-flush_tlb_page-using-flush_tlb_mm_range.patch
queue-4.9/mm-x86-mm-make-the-batched-unmap-tlb-flush-api-more-generic.patch
queue-4.9/x86-kvm-vmx-defer-tr-reload-after-vm-exit.patch
queue-4.9/x86-mm-change-the-leave_mm-condition-for-local-tlb-flushes.patch
queue-4.9/x86-mm-be-more-consistent-wrt-page_shift-vs-page_size-in-tlb-flush-code.patch
This is a note to let you know that I've just added the patch titled
x86/mm: Be more consistent wrt PAGE_SHIFT vs PAGE_SIZE in tlb flush code
to the 4.9-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
x86-mm-be-more-consistent-wrt-page_shift-vs-page_size-in-tlb-flush-code.patch
and it can be found in the queue-4.9 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From be4ffc0d787fafb22b89a2f29e71fea3b119205e Mon Sep 17 00:00:00 2001
From: Andy Lutomirski <luto(a)kernel.org>
Date: Sun, 28 May 2017 10:00:16 -0700
Subject: x86/mm: Be more consistent wrt PAGE_SHIFT vs PAGE_SIZE in tlb flush code
From: Andy Lutomirski <luto(a)kernel.org>
commit be4ffc0d787fafb22b89a2f29e71fea3b119205e upstream.
Nadav pointed out that some code used PAGE_SIZE and other code used
PAGE_SHIFT. Use PAGE_SHIFT instead of multiplying or dividing by
PAGE_SIZE.
Requested-by: Nadav Amit <nadav.amit(a)gmail.com>
Signed-off-by: Andy Lutomirski <luto(a)kernel.org>
Cc: Andrew Morton <akpm(a)linux-foundation.org>
Cc: Arjan van de Ven <arjan(a)linux.intel.com>
Cc: Borislav Petkov <bpetkov(a)suse.de>
Cc: Dave Hansen <dave.hansen(a)intel.com>
Cc: Linus Torvalds <torvalds(a)linux-foundation.org>
Cc: Mel Gorman <mgorman(a)suse.de>
Cc: Michal Hocko <mhocko(a)suse.com>
Cc: Nadav Amit <nadav.amit(a)gmail.com>
Cc: Nadav Amit <namit(a)vmware.com>
Cc: Peter Zijlstra <peterz(a)infradead.org>
Cc: Rik van Riel <riel(a)redhat.com>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Cc: linux-mm(a)kvack.org
Signed-off-by: Ingo Molnar <mingo(a)kernel.org>
Signed-off-by: Eduardo Valentin <eduval(a)amazon.com>
Signed-off-by: Eduardo Valentin <edubezval(a)gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
arch/x86/mm/tlb.c | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -220,8 +220,7 @@ static void flush_tlb_func_common(const
trace_tlb_flush(reason, TLB_FLUSH_ALL);
} else {
unsigned long addr;
- unsigned long nr_pages =
- (f->end - f->start) / PAGE_SIZE;
+ unsigned long nr_pages = (f->end - f->start) >> PAGE_SHIFT;
addr = f->start;
while (addr < f->end) {
__flush_tlb_single(addr);
@@ -374,7 +373,7 @@ void flush_tlb_kernel_range(unsigned lon
/* Balance as user space task's flush, a bit conservative */
if (end == TLB_FLUSH_ALL ||
- (end - start) > tlb_single_page_flush_ceiling * PAGE_SIZE) {
+ (end - start) > tlb_single_page_flush_ceiling << PAGE_SHIFT) {
on_each_cpu(do_flush_tlb_all, NULL, 1);
} else {
struct flush_tlb_info info;
Patches currently in stable-queue which might be from luto(a)kernel.org are
queue-4.9/x86-mm-refactor-flush_tlb_mm_range-to-merge-local-and-remote-cases.patch
queue-4.9/x86-mm-pass-flush_tlb_info-to-flush_tlb_others-etc.patch
queue-4.9/x86-mm-rework-lazy-tlb-to-track-the-actual-loaded-mm.patch
queue-4.9/x86-mm-kvm-teach-kvm-s-vmx-code-that-cr3-isn-t-a-constant.patch
queue-4.9/x86-mm-use-new-merged-flush-logic-in-arch_tlbbatch_flush.patch
queue-4.9/x86-kvm-vmx-simplify-segment_base.patch
queue-4.9/x86-entry-unwind-create-stack-frames-for-saved-interrupt-registers.patch
queue-4.9/x86-mm-reduce-indentation-in-flush_tlb_func.patch
queue-4.9/x86-mm-remove-the-up-asm-tlbflush.h-code-always-use-the-formerly-smp-code.patch
queue-4.9/x86-mm-reimplement-flush_tlb_page-using-flush_tlb_mm_range.patch
queue-4.9/mm-x86-mm-make-the-batched-unmap-tlb-flush-api-more-generic.patch
queue-4.9/x86-kvm-vmx-defer-tr-reload-after-vm-exit.patch
queue-4.9/x86-mm-change-the-leave_mm-condition-for-local-tlb-flushes.patch
queue-4.9/x86-mm-be-more-consistent-wrt-page_shift-vs-page_size-in-tlb-flush-code.patch
This is a note to let you know that I've just added the patch titled
x86/kvm/vmx: Simplify segment_base()
to the 4.9-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
x86-kvm-vmx-simplify-segment_base.patch
and it can be found in the queue-4.9 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From 8c2e41f7ae1234c192ef497472ad306227c77c03 Mon Sep 17 00:00:00 2001
From: Andy Lutomirski <luto(a)kernel.org>
Date: Mon, 20 Feb 2017 08:56:12 -0800
Subject: x86/kvm/vmx: Simplify segment_base()
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
From: Andy Lutomirski <luto(a)kernel.org>
commit 8c2e41f7ae1234c192ef497472ad306227c77c03 upstream.
Use actual pointer types for pointers (instead of unsigned long) and
replace hardcoded constants with the appropriate self-documenting
macros.
The function is still a bit messy, but this seems a lot better than
before to me.
This is mostly borrowed from a patch by Thomas Garnier.
Cc: Thomas Garnier <thgarnie(a)google.com>
Cc: Jim Mattson <jmattson(a)google.com>
Cc: Radim Krčmář <rkrcmar(a)redhat.com>
Cc: Paolo Bonzini <pbonzini(a)redhat.com>
Signed-off-by: Andy Lutomirski <luto(a)kernel.org>
Signed-off-by: Paolo Bonzini <pbonzini(a)redhat.com>
Signed-off-by: Eduardo Valentin <eduval(a)amazon.com>
Signed-off-by: Eduardo Valentin <edubezval(a)gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
arch/x86/kvm/vmx.c | 19 +++++++------------
1 file changed, 7 insertions(+), 12 deletions(-)
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -2030,28 +2030,23 @@ static unsigned long segment_base(u16 se
{
struct desc_ptr *gdt = this_cpu_ptr(&host_gdt);
struct desc_struct *d;
- unsigned long table_base;
+ struct desc_struct *table;
unsigned long v;
- if (!(selector & ~3))
+ if (!(selector & ~SEGMENT_RPL_MASK))
return 0;
- table_base = gdt->address;
+ table = (struct desc_struct *)gdt->address;
- if (selector & 4) { /* from ldt */
+ if ((selector & SEGMENT_TI_MASK) == SEGMENT_LDT) {
u16 ldt_selector = kvm_read_ldt();
- if (!(ldt_selector & ~3))
+ if (!(ldt_selector & ~SEGMENT_RPL_MASK))
return 0;
- table_base = segment_base(ldt_selector);
+ table = (struct desc_struct *)segment_base(ldt_selector);
}
- d = (struct desc_struct *)(table_base + (selector & ~7));
- v = get_desc_base(d);
-#ifdef CONFIG_X86_64
- if (d->s == 0 && (d->type == 2 || d->type == 9 || d->type == 11))
- v |= ((unsigned long)((struct ldttss_desc64 *)d)->base3) << 32;
-#endif
+ v = get_desc_base(&table[selector >> 3]);
return v;
}
Patches currently in stable-queue which might be from luto(a)kernel.org are
queue-4.9/x86-mm-refactor-flush_tlb_mm_range-to-merge-local-and-remote-cases.patch
queue-4.9/x86-mm-pass-flush_tlb_info-to-flush_tlb_others-etc.patch
queue-4.9/x86-mm-rework-lazy-tlb-to-track-the-actual-loaded-mm.patch
queue-4.9/x86-mm-kvm-teach-kvm-s-vmx-code-that-cr3-isn-t-a-constant.patch
queue-4.9/x86-mm-use-new-merged-flush-logic-in-arch_tlbbatch_flush.patch
queue-4.9/x86-kvm-vmx-simplify-segment_base.patch
queue-4.9/x86-entry-unwind-create-stack-frames-for-saved-interrupt-registers.patch
queue-4.9/x86-mm-reduce-indentation-in-flush_tlb_func.patch
queue-4.9/x86-mm-remove-the-up-asm-tlbflush.h-code-always-use-the-formerly-smp-code.patch
queue-4.9/x86-mm-reimplement-flush_tlb_page-using-flush_tlb_mm_range.patch
queue-4.9/mm-x86-mm-make-the-batched-unmap-tlb-flush-api-more-generic.patch
queue-4.9/x86-kvm-vmx-defer-tr-reload-after-vm-exit.patch
queue-4.9/x86-mm-change-the-leave_mm-condition-for-local-tlb-flushes.patch
queue-4.9/x86-mm-be-more-consistent-wrt-page_shift-vs-page_size-in-tlb-flush-code.patch
This is a note to let you know that I've just added the patch titled
x86/kvm/vmx: remove unused variable in segment_base()
to the 4.9-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
x86-kvm-vmx-remove-unused-variable-in-segment_base.patch
and it can be found in the queue-4.9 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From 0fce546f9f07b94ccc9de09cf48d35e18946d2fa Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?J=C3=A9r=C3=A9my=20Lefaure?= <jeremy.lefaure(a)lse.epita.fr>
Date: Sat, 25 Feb 2017 17:46:53 -0500
Subject: x86/kvm/vmx: remove unused variable in segment_base()
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
From: Jérémy Lefaure <jeremy.lefaure(a)lse.epita.fr>
commit 0fce546f9f07b94ccc9de09cf48d35e18946d2fa upstream.
The pointer 'struct desc_struct *d' is unused since commit 8c2e41f7ae12
("x86/kvm/vmx: Simplify segment_base()") so let's remove it.
Signed-off-by: Jérémy Lefaure <jeremy.lefaure(a)lse.epita.fr>
Reviewed-by: David Hildenbrand <david(a)redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar(a)redhat.com>
Signed-off-by: Eduardo Valentin <eduval(a)amazon.com>
Signed-off-by: Eduardo Valentin <edubezval(a)gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
arch/x86/kvm/vmx.c | 1 -
1 file changed, 1 deletion(-)
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -2016,7 +2016,6 @@ static bool update_transition_efer(struc
static unsigned long segment_base(u16 selector)
{
struct desc_ptr *gdt = this_cpu_ptr(&host_gdt);
- struct desc_struct *d;
struct desc_struct *table;
unsigned long v;
Patches currently in stable-queue which might be from jeremy.lefaure(a)lse.epita.fr are
queue-4.9/x86-kvm-vmx-remove-unused-variable-in-segment_base.patch
This is a note to let you know that I've just added the patch titled
x86/kvm/vmx: Defer TR reload after VM exit
to the 4.9-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
x86-kvm-vmx-defer-tr-reload-after-vm-exit.patch
and it can be found in the queue-4.9 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From b7ffc44d5b2ea163899d09289ca7743d5c32e926 Mon Sep 17 00:00:00 2001
From: Andy Lutomirski <luto(a)kernel.org>
Date: Mon, 20 Feb 2017 08:56:14 -0800
Subject: x86/kvm/vmx: Defer TR reload after VM exit
From: Andy Lutomirski <luto(a)kernel.org>
commit b7ffc44d5b2ea163899d09289ca7743d5c32e926 upstream.
Intel's VMX is daft and resets the hidden TSS limit register to 0x67
on VMX reload, and the 0x67 is not configurable. KVM currently
reloads TR using the LTR instruction on every exit, but this is quite
slow because LTR is serializing.
The 0x67 limit is entirely harmless unless ioperm() is in use, so
defer the reload until a task using ioperm() is actually running.
Here's some poorly done benchmarking using kvm-unit-tests:
Before:
cpuid 1313
vmcall 1195
mov_from_cr8 11
mov_to_cr8 17
inl_from_pmtimer 6770
inl_from_qemu 6856
inl_from_kernel 2435
outl_to_kernel 1402
After:
cpuid 1291
vmcall 1181
mov_from_cr8 11
mov_to_cr8 16
inl_from_pmtimer 6457
inl_from_qemu 6209
inl_from_kernel 2339
outl_to_kernel 1391
Signed-off-by: Andy Lutomirski <luto(a)kernel.org>
[Force-reload TR in invalidate_tss_limit. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini(a)redhat.com>
Signed-off-by: Eduardo Valentin <eduval(a)amazon.com>
Signed-off-by: Eduardo Valentin <edubezval(a)gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
arch/x86/include/asm/desc.h | 48 ++++++++++++++++++++++++++++++++++++++++++++
arch/x86/kernel/ioport.c | 5 ++++
arch/x86/kernel/process.c | 10 +++++++++
arch/x86/kvm/vmx.c | 23 ++++++++-------------
4 files changed, 72 insertions(+), 14 deletions(-)
--- a/arch/x86/include/asm/desc.h
+++ b/arch/x86/include/asm/desc.h
@@ -213,6 +213,54 @@ static inline void native_load_tr_desc(v
asm volatile("ltr %w0"::"q" (GDT_ENTRY_TSS*8));
}
+static inline void force_reload_TR(void)
+{
+ struct desc_struct *d = get_cpu_gdt_table(smp_processor_id());
+ tss_desc tss;
+
+ memcpy(&tss, &d[GDT_ENTRY_TSS], sizeof(tss_desc));
+
+ /*
+ * LTR requires an available TSS, and the TSS is currently
+ * busy. Make it be available so that LTR will work.
+ */
+ tss.type = DESC_TSS;
+ write_gdt_entry(d, GDT_ENTRY_TSS, &tss, DESC_TSS);
+
+ load_TR_desc();
+}
+
+DECLARE_PER_CPU(bool, need_tr_refresh);
+
+static inline void refresh_TR(void)
+{
+ WARN_ON(preemptible());
+
+ if (unlikely(this_cpu_read(need_tr_refresh))) {
+ force_reload_TR();
+ this_cpu_write(need_tr_refresh, false);
+ }
+}
+
+/*
+ * If you do something evil that corrupts the cached TSS limit (I'm looking
+ * at you, VMX exits), call this function.
+ *
+ * The optimization here is that the TSS limit only matters for Linux if the
+ * IO bitmap is in use. If the TSS limit gets forced to its minimum value,
+ * everything works except that IO bitmap will be ignored and all CPL 3 IO
+ * instructions will #GP, which is exactly what we want for normal tasks.
+ */
+static inline void invalidate_tss_limit(void)
+{
+ WARN_ON(preemptible());
+
+ if (unlikely(test_thread_flag(TIF_IO_BITMAP)))
+ force_reload_TR();
+ else
+ this_cpu_write(need_tr_refresh, true);
+}
+
static inline void native_load_gdt(const struct desc_ptr *dtr)
{
asm volatile("lgdt %0"::"m" (*dtr));
--- a/arch/x86/kernel/ioport.c
+++ b/arch/x86/kernel/ioport.c
@@ -16,6 +16,7 @@
#include <linux/syscalls.h>
#include <linux/bitmap.h>
#include <asm/syscalls.h>
+#include <asm/desc.h>
/*
* this changes the io permissions bitmap in the current task.
@@ -45,6 +46,10 @@ asmlinkage long sys_ioperm(unsigned long
memset(bitmap, 0xff, IO_BITMAP_BYTES);
t->io_bitmap_ptr = bitmap;
set_thread_flag(TIF_IO_BITMAP);
+
+ preempt_disable();
+ refresh_TR();
+ preempt_enable();
}
/*
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -33,6 +33,7 @@
#include <asm/mce.h>
#include <asm/vm86.h>
#include <asm/switch_to.h>
+#include <asm/desc.h>
/*
* per-CPU TSS segments. Threads are completely 'soft' on Linux,
@@ -82,6 +83,9 @@ void idle_notifier_unregister(struct not
EXPORT_SYMBOL_GPL(idle_notifier_unregister);
#endif
+DEFINE_PER_CPU(bool, need_tr_refresh);
+EXPORT_PER_CPU_SYMBOL_GPL(need_tr_refresh);
+
/*
* this gets called so that we can store lazy state into memory and copy the
* current task into the new thread.
@@ -227,6 +231,12 @@ void __switch_to_xtra(struct task_struct
*/
memcpy(tss->io_bitmap, next->io_bitmap_ptr,
max(prev->io_bitmap_max, next->io_bitmap_max));
+
+ /*
+ * Make sure that the TSS limit is correct for the CPU
+ * to notice the IO bitmap.
+ */
+ refresh_TR();
} else if (test_tsk_thread_flag(prev_p, TIF_IO_BITMAP)) {
/*
* Clear any possible leftover bits:
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -1959,19 +1959,6 @@ static void add_atomic_switch_msr(struct
m->host[i].value = host_val;
}
-static void reload_tss(void)
-{
- /*
- * VT restores TR but not its size. Useless.
- */
- struct desc_ptr *gdt = this_cpu_ptr(&host_gdt);
- struct desc_struct *descs;
-
- descs = (void *)gdt->address;
- descs[GDT_ENTRY_TSS].type = 9; /* available TSS */
- load_TR_desc();
-}
-
static bool update_transition_efer(struct vcpu_vmx *vmx, int efer_offset)
{
u64 guest_efer = vmx->vcpu.arch.efer;
@@ -2141,7 +2128,7 @@ static void __vmx_load_host_state(struct
loadsegment(es, vmx->host_state.es_sel);
}
#endif
- reload_tss();
+ invalidate_tss_limit();
#ifdef CONFIG_X86_64
wrmsrl(MSR_KERNEL_GS_BASE, vmx->msr_host_kernel_gs_base);
#endif
@@ -2265,6 +2252,14 @@ static void vmx_vcpu_load(struct kvm_vcp
vmcs_writel(HOST_TR_BASE, kvm_read_tr_base()); /* 22.2.4 */
vmcs_writel(HOST_GDTR_BASE, gdt->address); /* 22.2.4 */
+ /*
+ * VM exits change the host TR limit to 0x67 after a VM
+ * exit. This is okay, since 0x67 covers everything except
+ * the IO bitmap and have have code to handle the IO bitmap
+ * being lost after a VM exit.
+ */
+ BUILD_BUG_ON(IO_BITMAP_OFFSET - 1 != 0x67);
+
rdmsrl(MSR_IA32_SYSENTER_ESP, sysenter_esp);
vmcs_writel(HOST_IA32_SYSENTER_ESP, sysenter_esp); /* 22.2.3 */
Patches currently in stable-queue which might be from luto(a)kernel.org are
queue-4.9/x86-mm-refactor-flush_tlb_mm_range-to-merge-local-and-remote-cases.patch
queue-4.9/x86-mm-pass-flush_tlb_info-to-flush_tlb_others-etc.patch
queue-4.9/x86-mm-rework-lazy-tlb-to-track-the-actual-loaded-mm.patch
queue-4.9/x86-mm-kvm-teach-kvm-s-vmx-code-that-cr3-isn-t-a-constant.patch
queue-4.9/x86-mm-use-new-merged-flush-logic-in-arch_tlbbatch_flush.patch
queue-4.9/x86-kvm-vmx-simplify-segment_base.patch
queue-4.9/x86-entry-unwind-create-stack-frames-for-saved-interrupt-registers.patch
queue-4.9/x86-mm-reduce-indentation-in-flush_tlb_func.patch
queue-4.9/x86-mm-remove-the-up-asm-tlbflush.h-code-always-use-the-formerly-smp-code.patch
queue-4.9/x86-mm-reimplement-flush_tlb_page-using-flush_tlb_mm_range.patch
queue-4.9/mm-x86-mm-make-the-batched-unmap-tlb-flush-api-more-generic.patch
queue-4.9/x86-kvm-vmx-defer-tr-reload-after-vm-exit.patch
queue-4.9/x86-mm-change-the-leave_mm-condition-for-local-tlb-flushes.patch
queue-4.9/x86-mm-be-more-consistent-wrt-page_shift-vs-page_size-in-tlb-flush-code.patch
This is a note to let you know that I've just added the patch titled
x86/entry/unwind: Create stack frames for saved interrupt registers
to the 4.9-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
x86-entry-unwind-create-stack-frames-for-saved-interrupt-registers.patch
and it can be found in the queue-4.9 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From 946c191161cef10c667b5ee3179db1714fa5b7c0 Mon Sep 17 00:00:00 2001
From: Josh Poimboeuf <jpoimboe(a)redhat.com>
Date: Thu, 20 Oct 2016 11:34:40 -0500
Subject: x86/entry/unwind: Create stack frames for saved interrupt registers
From: Josh Poimboeuf <jpoimboe(a)redhat.com>
commit 946c191161cef10c667b5ee3179db1714fa5b7c0 upstream.
With frame pointers, when a task is interrupted, its stack is no longer
completely reliable because the function could have been interrupted
before it had a chance to save the previous frame pointer on the stack.
So the caller of the interrupted function could get skipped by a stack
trace.
This is problematic for live patching, which needs to know whether a
stack trace of a sleeping task can be relied upon. There's currently no
way to detect if a sleeping task was interrupted by a page fault
exception or preemption before it went to sleep.
Another issue is that when dumping the stack of an interrupted task, the
unwinder has no way of knowing where the saved pt_regs registers are, so
it can't print them.
This solves those issues by encoding the pt_regs pointer in the frame
pointer on entry from an interrupt or an exception.
This patch also updates the unwinder to be able to decode it, because
otherwise the unwinder would be broken by this change.
Note that this causes a change in the behavior of the unwinder: each
instance of a pt_regs on the stack is now considered a "frame". So
callers of unwind_get_return_address() will now get an occasional
'regs->ip' address that would have previously been skipped over.
Suggested-by: Andy Lutomirski <luto(a)amacapital.net>
Signed-off-by: Josh Poimboeuf <jpoimboe(a)redhat.com>
Cc: Andy Lutomirski <luto(a)kernel.org>
Cc: Borislav Petkov <bp(a)alien8.de>
Cc: Brian Gerst <brgerst(a)gmail.com>
Cc: Denys Vlasenko <dvlasenk(a)redhat.com>
Cc: H. Peter Anvin <hpa(a)zytor.com>
Cc: Linus Torvalds <torvalds(a)linux-foundation.org>
Cc: Peter Zijlstra <peterz(a)infradead.org>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Link: http://lkml.kernel.org/r/8b9f84a21e39d249049e0547b559ff8da0df0988.147697374…
Signed-off-by: Ingo Molnar <mingo(a)kernel.org>
Signed-off-by: Eduardo Valentin <eduval(a)amazon.com>
Signed-off-by: Eduardo Valentin <edubezval(a)gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
arch/x86/entry/calling.h | 20 ++++++++++
arch/x86/entry/entry_32.S | 33 +++++++++++++++--
arch/x86/entry/entry_64.S | 10 +++--
arch/x86/include/asm/unwind.h | 16 ++++++++
arch/x86/kernel/unwind_frame.c | 76 ++++++++++++++++++++++++++++++++++++-----
5 files changed, 139 insertions(+), 16 deletions(-)
--- a/arch/x86/entry/calling.h
+++ b/arch/x86/entry/calling.h
@@ -201,6 +201,26 @@ For 32-bit we have the following convent
.byte 0xf1
.endm
+/*
+ * This is a sneaky trick to help the unwinder find pt_regs on the stack. The
+ * frame pointer is replaced with an encoded pointer to pt_regs. The encoding
+ * is just setting the LSB, which makes it an invalid stack address and is also
+ * a signal to the unwinder that it's a pt_regs pointer in disguise.
+ *
+ * NOTE: This macro must be used *after* SAVE_EXTRA_REGS because it corrupts
+ * the original rbp.
+ */
+.macro ENCODE_FRAME_POINTER ptregs_offset=0
+#ifdef CONFIG_FRAME_POINTER
+ .if \ptregs_offset
+ leaq \ptregs_offset(%rsp), %rbp
+ .else
+ mov %rsp, %rbp
+ .endif
+ orq $0x1, %rbp
+#endif
+.endm
+
#endif /* CONFIG_X86_64 */
/*
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -175,6 +175,22 @@
SET_KERNEL_GS %edx
.endm
+/*
+ * This is a sneaky trick to help the unwinder find pt_regs on the stack. The
+ * frame pointer is replaced with an encoded pointer to pt_regs. The encoding
+ * is just setting the LSB, which makes it an invalid stack address and is also
+ * a signal to the unwinder that it's a pt_regs pointer in disguise.
+ *
+ * NOTE: This macro must be used *after* SAVE_ALL because it corrupts the
+ * original rbp.
+ */
+.macro ENCODE_FRAME_POINTER
+#ifdef CONFIG_FRAME_POINTER
+ mov %esp, %ebp
+ orl $0x1, %ebp
+#endif
+.endm
+
.macro RESTORE_INT_REGS
popl %ebx
popl %ecx
@@ -624,6 +640,7 @@ common_interrupt:
ASM_CLAC
addl $-0x80, (%esp) /* Adjust vector into the [-256, -1] range */
SAVE_ALL
+ ENCODE_FRAME_POINTER
TRACE_IRQS_OFF
movl %esp, %eax
call do_IRQ
@@ -635,6 +652,7 @@ ENTRY(name) \
ASM_CLAC; \
pushl $~(nr); \
SAVE_ALL; \
+ ENCODE_FRAME_POINTER; \
TRACE_IRQS_OFF \
movl %esp, %eax; \
call fn; \
@@ -769,6 +787,7 @@ END(spurious_interrupt_bug)
ENTRY(xen_hypervisor_callback)
pushl $-1 /* orig_ax = -1 => not a system call */
SAVE_ALL
+ ENCODE_FRAME_POINTER
TRACE_IRQS_OFF
/*
@@ -823,6 +842,7 @@ ENTRY(xen_failsafe_callback)
jmp iret_exc
5: pushl $-1 /* orig_ax = -1 => not a system call */
SAVE_ALL
+ ENCODE_FRAME_POINTER
jmp ret_from_exception
.section .fixup, "ax"
@@ -1047,6 +1067,7 @@ error_code:
pushl %edx
pushl %ecx
pushl %ebx
+ ENCODE_FRAME_POINTER
cld
movl $(__KERNEL_PERCPU), %ecx
movl %ecx, %fs
@@ -1079,6 +1100,7 @@ ENTRY(debug)
ASM_CLAC
pushl $-1 # mark this as an int
SAVE_ALL
+ ENCODE_FRAME_POINTER
xorl %edx, %edx # error code 0
movl %esp, %eax # pt_regs pointer
@@ -1094,11 +1116,11 @@ ENTRY(debug)
.Ldebug_from_sysenter_stack:
/* We're on the SYSENTER stack. Switch off. */
- movl %esp, %ebp
+ movl %esp, %ebx
movl PER_CPU_VAR(cpu_current_top_of_stack), %esp
TRACE_IRQS_OFF
call do_debug
- movl %ebp, %esp
+ movl %ebx, %esp
jmp ret_from_exception
END(debug)
@@ -1121,6 +1143,7 @@ ENTRY(nmi)
pushl %eax # pt_regs->orig_ax
SAVE_ALL
+ ENCODE_FRAME_POINTER
xorl %edx, %edx # zero error code
movl %esp, %eax # pt_regs pointer
@@ -1139,10 +1162,10 @@ ENTRY(nmi)
* We're on the SYSENTER stack. Switch off. No one (not even debug)
* is using the thread stack right now, so it's safe for us to use it.
*/
- movl %esp, %ebp
+ movl %esp, %ebx
movl PER_CPU_VAR(cpu_current_top_of_stack), %esp
call do_nmi
- movl %ebp, %esp
+ movl %ebx, %esp
jmp restore_all_notrace
#ifdef CONFIG_X86_ESPFIX32
@@ -1159,6 +1182,7 @@ nmi_espfix_stack:
.endr
pushl %eax
SAVE_ALL
+ ENCODE_FRAME_POINTER
FIXUP_ESPFIX_STACK # %eax == %esp
xorl %edx, %edx # zero error code
call do_nmi
@@ -1172,6 +1196,7 @@ ENTRY(int3)
ASM_CLAC
pushl $-1 # mark this as an int
SAVE_ALL
+ ENCODE_FRAME_POINTER
TRACE_IRQS_OFF
xorl %edx, %edx # zero error code
movl %esp, %eax # pt_regs pointer
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -469,6 +469,7 @@ END(irq_entries_start)
ALLOC_PT_GPREGS_ON_STACK
SAVE_C_REGS
SAVE_EXTRA_REGS
+ ENCODE_FRAME_POINTER
testb $3, CS(%rsp)
jz 1f
@@ -985,6 +986,7 @@ ENTRY(xen_failsafe_callback)
ALLOC_PT_GPREGS_ON_STACK
SAVE_C_REGS
SAVE_EXTRA_REGS
+ ENCODE_FRAME_POINTER
jmp error_exit
END(xen_failsafe_callback)
@@ -1028,6 +1030,7 @@ ENTRY(paranoid_entry)
cld
SAVE_C_REGS 8
SAVE_EXTRA_REGS 8
+ ENCODE_FRAME_POINTER 8
movl $1, %ebx
movl $MSR_GS_BASE, %ecx
rdmsr
@@ -1075,6 +1078,7 @@ ENTRY(error_entry)
cld
SAVE_C_REGS 8
SAVE_EXTRA_REGS 8
+ ENCODE_FRAME_POINTER 8
xorl %ebx, %ebx
testb $3, CS+8(%rsp)
jz .Lerror_kernelspace
@@ -1259,6 +1263,7 @@ ENTRY(nmi)
pushq %r13 /* pt_regs->r13 */
pushq %r14 /* pt_regs->r14 */
pushq %r15 /* pt_regs->r15 */
+ ENCODE_FRAME_POINTER
/*
* At this point we no longer need to worry about stack damage
@@ -1272,11 +1277,10 @@ ENTRY(nmi)
/*
* Return back to user mode. We must *not* do the normal exit
- * work, because we don't want to enable interrupts. Fortunately,
- * do_nmi doesn't modify pt_regs.
+ * work, because we don't want to enable interrupts.
*/
SWAPGS
- jmp restore_c_regs_and_iret
+ jmp restore_regs_and_iret
.Lnmi_from_kernel:
/*
--- a/arch/x86/include/asm/unwind.h
+++ b/arch/x86/include/asm/unwind.h
@@ -13,6 +13,7 @@ struct unwind_state {
int graph_idx;
#ifdef CONFIG_FRAME_POINTER
unsigned long *bp;
+ struct pt_regs *regs;
#else
unsigned long *sp;
#endif
@@ -47,7 +48,15 @@ unsigned long *unwind_get_return_address
if (unwind_done(state))
return NULL;
- return state->bp + 1;
+ return state->regs ? &state->regs->ip : state->bp + 1;
+}
+
+static inline struct pt_regs *unwind_get_entry_regs(struct unwind_state *state)
+{
+ if (unwind_done(state))
+ return NULL;
+
+ return state->regs;
}
#else /* !CONFIG_FRAME_POINTER */
@@ -57,6 +66,11 @@ unsigned long *unwind_get_return_address
{
return NULL;
}
+
+static inline struct pt_regs *unwind_get_entry_regs(struct unwind_state *state)
+{
+ return NULL;
+}
#endif /* CONFIG_FRAME_POINTER */
--- a/arch/x86/kernel/unwind_frame.c
+++ b/arch/x86/kernel/unwind_frame.c
@@ -14,6 +14,9 @@ unsigned long unwind_get_return_address(
if (unwind_done(state))
return 0;
+ if (state->regs && user_mode(state->regs))
+ return 0;
+
addr = ftrace_graph_ret_addr(state->task, &state->graph_idx, *addr_p,
addr_p);
@@ -21,6 +24,20 @@ unsigned long unwind_get_return_address(
}
EXPORT_SYMBOL_GPL(unwind_get_return_address);
+/*
+ * This determines if the frame pointer actually contains an encoded pointer to
+ * pt_regs on the stack. See ENCODE_FRAME_POINTER.
+ */
+static struct pt_regs *decode_frame_pointer(unsigned long *bp)
+{
+ unsigned long regs = (unsigned long)bp;
+
+ if (!(regs & 0x1))
+ return NULL;
+
+ return (struct pt_regs *)(regs & ~0x1);
+}
+
static bool update_stack_state(struct unwind_state *state, void *addr,
size_t len)
{
@@ -43,26 +60,59 @@ static bool update_stack_state(struct un
bool unwind_next_frame(struct unwind_state *state)
{
- unsigned long *next_bp;
+ struct pt_regs *regs;
+ unsigned long *next_bp, *next_frame;
+ size_t next_len;
if (unwind_done(state))
return false;
- next_bp = (unsigned long *)*state->bp;
+ /* have we reached the end? */
+ if (state->regs && user_mode(state->regs))
+ goto the_end;
+
+ /* get the next frame pointer */
+ if (state->regs)
+ next_bp = (unsigned long *)state->regs->bp;
+ else
+ next_bp = (unsigned long *)*state->bp;
+
+ /* is the next frame pointer an encoded pointer to pt_regs? */
+ regs = decode_frame_pointer(next_bp);
+ if (regs) {
+ next_frame = (unsigned long *)regs;
+ next_len = sizeof(*regs);
+ } else {
+ next_frame = next_bp;
+ next_len = FRAME_HEADER_SIZE;
+ }
/* make sure the next frame's data is accessible */
- if (!update_stack_state(state, next_bp, FRAME_HEADER_SIZE))
+ if (!update_stack_state(state, next_frame, next_len))
return false;
-
/* move to the next frame */
- state->bp = next_bp;
+ if (regs) {
+ state->regs = regs;
+ state->bp = NULL;
+ } else {
+ state->bp = next_bp;
+ state->regs = NULL;
+ }
+
return true;
+
+the_end:
+ state->stack_info.type = STACK_TYPE_UNKNOWN;
+ return false;
}
EXPORT_SYMBOL_GPL(unwind_next_frame);
void __unwind_start(struct unwind_state *state, struct task_struct *task,
struct pt_regs *regs, unsigned long *first_frame)
{
+ unsigned long *bp, *frame;
+ size_t len;
+
memset(state, 0, sizeof(*state));
state->task = task;
@@ -73,12 +123,22 @@ void __unwind_start(struct unwind_state
}
/* set up the starting stack frame */
- state->bp = get_frame_pointer(task, regs);
+ bp = get_frame_pointer(task, regs);
+ regs = decode_frame_pointer(bp);
+ if (regs) {
+ state->regs = regs;
+ frame = (unsigned long *)regs;
+ len = sizeof(*regs);
+ } else {
+ state->bp = bp;
+ frame = bp;
+ len = FRAME_HEADER_SIZE;
+ }
/* initialize stack info and make sure the frame data is accessible */
- get_stack_info(state->bp, state->task, &state->stack_info,
+ get_stack_info(frame, state->task, &state->stack_info,
&state->stack_mask);
- update_stack_state(state, state->bp, FRAME_HEADER_SIZE);
+ update_stack_state(state, frame, len);
/*
* The caller can provide the address of the first frame directly
Patches currently in stable-queue which might be from jpoimboe(a)redhat.com are
queue-4.9/x86-entry-unwind-create-stack-frames-for-saved-interrupt-registers.patch
This is a note to let you know that I've just added the patch titled
vsock: cancel packets when failing to connect
to the 4.9-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
vsock-cancel-packets-when-failing-to-connect.patch
and it can be found in the queue-4.9 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From foo@baz Thu Dec 21 09:02:40 CET 2017
From: Peng Tao <bergwolf(a)gmail.com>
Date: Wed, 15 Mar 2017 09:32:17 +0800
Subject: vsock: cancel packets when failing to connect
From: Peng Tao <bergwolf(a)gmail.com>
[ Upstream commit 380feae0def7e6a115124a3219c3ec9b654dca32 ]
Otherwise we'll leave the packets queued until releasing vsock device.
E.g., if guest is slow to start up, resulting ETIMEDOUT on connect, guest
will get the connect requests from failed host sockets.
Reviewed-by: Stefan Hajnoczi <stefanha(a)redhat.com>
Reviewed-by: Jorgen Hansen <jhansen(a)vmware.com>
Signed-off-by: Peng Tao <bergwolf(a)gmail.com>
Signed-off-by: David S. Miller <davem(a)davemloft.net>
Signed-off-by: Sasha Levin <alexander.levin(a)verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
net/vmw_vsock/af_vsock.c | 14 ++++++++++++++
1 file changed, 14 insertions(+)
--- a/net/vmw_vsock/af_vsock.c
+++ b/net/vmw_vsock/af_vsock.c
@@ -1101,10 +1101,19 @@ static const struct proto_ops vsock_dgra
.sendpage = sock_no_sendpage,
};
+static int vsock_transport_cancel_pkt(struct vsock_sock *vsk)
+{
+ if (!transport->cancel_pkt)
+ return -EOPNOTSUPP;
+
+ return transport->cancel_pkt(vsk);
+}
+
static void vsock_connect_timeout(struct work_struct *work)
{
struct sock *sk;
struct vsock_sock *vsk;
+ int cancel = 0;
vsk = container_of(work, struct vsock_sock, dwork.work);
sk = sk_vsock(vsk);
@@ -1115,8 +1124,11 @@ static void vsock_connect_timeout(struct
sk->sk_state = SS_UNCONNECTED;
sk->sk_err = ETIMEDOUT;
sk->sk_error_report(sk);
+ cancel = 1;
}
release_sock(sk);
+ if (cancel)
+ vsock_transport_cancel_pkt(vsk);
sock_put(sk);
}
@@ -1223,11 +1235,13 @@ static int vsock_stream_connect(struct s
err = sock_intr_errno(timeout);
sk->sk_state = SS_UNCONNECTED;
sock->state = SS_UNCONNECTED;
+ vsock_transport_cancel_pkt(vsk);
goto out_wait;
} else if (timeout == 0) {
err = -ETIMEDOUT;
sk->sk_state = SS_UNCONNECTED;
sock->state = SS_UNCONNECTED;
+ vsock_transport_cancel_pkt(vsk);
goto out_wait;
}
Patches currently in stable-queue which might be from bergwolf(a)gmail.com are
queue-4.9/vsock-cancel-packets-when-failing-to-connect.patch
queue-4.9/vsock-track-pkt-owner-vsock.patch
queue-4.9/vhost-vsock-add-pkt-cancel-capability.patch
This is a note to let you know that I've just added the patch titled
virtio-balloon: use actual number of stats for stats queue buffers
to the 4.9-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
virtio-balloon-use-actual-number-of-stats-for-stats-queue-buffers.patch
and it can be found in the queue-4.9 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From foo@baz Thu Dec 21 09:02:40 CET 2017
From: Ladi Prosek <lprosek(a)redhat.com>
Date: Tue, 28 Mar 2017 18:46:58 +0200
Subject: virtio-balloon: use actual number of stats for stats queue buffers
From: Ladi Prosek <lprosek(a)redhat.com>
[ Upstream commit 9646b26e85896ef0256e66649f7937f774dc18a6 ]
The virtio balloon driver contained a not-so-obvious invariant that
update_balloon_stats has to update exactly VIRTIO_BALLOON_S_NR counters
in order to send valid stats to the host. This commit fixes it by having
update_balloon_stats return the actual number of counters, and its
callers use it when pushing buffers to the stats virtqueue.
Note that it is still out of spec to change the number of counters
at run-time. "Driver MUST supply the same subset of statistics in all
buffers submitted to the statsq."
Suggested-by: Arnd Bergmann <arnd(a)arndb.de>
Signed-off-by: Ladi Prosek <lprosek(a)redhat.com>
Signed-off-by: Michael S. Tsirkin <mst(a)redhat.com>
Signed-off-by: Sasha Levin <alexander.levin(a)verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
drivers/virtio/virtio_balloon.c | 17 ++++++++++-------
1 file changed, 10 insertions(+), 7 deletions(-)
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -241,11 +241,11 @@ static inline void update_stat(struct vi
#define pages_to_bytes(x) ((u64)(x) << PAGE_SHIFT)
-static void update_balloon_stats(struct virtio_balloon *vb)
+static unsigned int update_balloon_stats(struct virtio_balloon *vb)
{
unsigned long events[NR_VM_EVENT_ITEMS];
struct sysinfo i;
- int idx = 0;
+ unsigned int idx = 0;
long available;
all_vm_events(events);
@@ -265,6 +265,8 @@ static void update_balloon_stats(struct
pages_to_bytes(i.totalram));
update_stat(vb, idx++, VIRTIO_BALLOON_S_AVAIL,
pages_to_bytes(available));
+
+ return idx;
}
/*
@@ -290,14 +292,14 @@ static void stats_handle_request(struct
{
struct virtqueue *vq;
struct scatterlist sg;
- unsigned int len;
+ unsigned int len, num_stats;
- update_balloon_stats(vb);
+ num_stats = update_balloon_stats(vb);
vq = vb->stats_vq;
if (!virtqueue_get_buf(vq, &len))
return;
- sg_init_one(&sg, vb->stats, sizeof(vb->stats));
+ sg_init_one(&sg, vb->stats, sizeof(vb->stats[0]) * num_stats);
virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
virtqueue_kick(vq);
}
@@ -421,15 +423,16 @@ static int init_vqs(struct virtio_balloo
vb->deflate_vq = vqs[1];
if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
struct scatterlist sg;
+ unsigned int num_stats;
vb->stats_vq = vqs[2];
/*
* Prime this virtqueue with one buffer so the hypervisor can
* use it to signal us later (it can't be broken yet!).
*/
- update_balloon_stats(vb);
+ num_stats = update_balloon_stats(vb);
- sg_init_one(&sg, vb->stats, sizeof vb->stats);
+ sg_init_one(&sg, vb->stats, sizeof(vb->stats[0]) * num_stats);
if (virtqueue_add_outbuf(vb->stats_vq, &sg, 1, vb, GFP_KERNEL)
< 0)
BUG();
Patches currently in stable-queue which might be from lprosek(a)redhat.com are
queue-4.9/kvm-nvmx-fix-host_cr3-host_cr4-cache.patch
queue-4.9/virtio-balloon-use-actual-number-of-stats-for-stats-queue-buffers.patch
queue-4.9/virtio_balloon-prevent-uninitialized-variable-use.patch
This is a note to let you know that I've just added the patch titled
virtio_balloon: prevent uninitialized variable use
to the 4.9-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
virtio_balloon-prevent-uninitialized-variable-use.patch
and it can be found in the queue-4.9 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From foo@baz Thu Dec 21 09:02:40 CET 2017
From: Arnd Bergmann <arnd(a)arndb.de>
Date: Tue, 28 Mar 2017 18:46:59 +0200
Subject: virtio_balloon: prevent uninitialized variable use
From: Arnd Bergmann <arnd(a)arndb.de>
[ Upstream commit f0bb2d50dfcc519f06f901aac88502be6ff1df2c ]
The latest gcc-7.0.1 snapshot reports a new warning:
virtio/virtio_balloon.c: In function 'update_balloon_stats':
virtio/virtio_balloon.c:258:26: error: 'events[2]' is used uninitialized in this function [-Werror=uninitialized]
virtio/virtio_balloon.c:260:26: error: 'events[3]' is used uninitialized in this function [-Werror=uninitialized]
virtio/virtio_balloon.c:261:56: error: 'events[18]' is used uninitialized in this function [-Werror=uninitialized]
virtio/virtio_balloon.c:262:56: error: 'events[17]' is used uninitialized in this function [-Werror=uninitialized]
This seems absolutely right, so we should add an extra check to
prevent copying uninitialized stack data into the statistics.
>From all I can tell, this has been broken since the statistics code
was originally added in 2.6.34.
Fixes: 9564e138b1f6 ("virtio: Add memory statistics reporting to the balloon driver (V4)")
Signed-off-by: Arnd Bergmann <arnd(a)arndb.de>
Signed-off-by: Ladi Prosek <lprosek(a)redhat.com>
Signed-off-by: Michael S. Tsirkin <mst(a)redhat.com>
Signed-off-by: Sasha Levin <alexander.levin(a)verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
drivers/virtio/virtio_balloon.c | 2 ++
1 file changed, 2 insertions(+)
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -253,12 +253,14 @@ static unsigned int update_balloon_stats
available = si_mem_available();
+#ifdef CONFIG_VM_EVENT_COUNTERS
update_stat(vb, idx++, VIRTIO_BALLOON_S_SWAP_IN,
pages_to_bytes(events[PSWPIN]));
update_stat(vb, idx++, VIRTIO_BALLOON_S_SWAP_OUT,
pages_to_bytes(events[PSWPOUT]));
update_stat(vb, idx++, VIRTIO_BALLOON_S_MAJFLT, events[PGMAJFAULT]);
update_stat(vb, idx++, VIRTIO_BALLOON_S_MINFLT, events[PGFAULT]);
+#endif
update_stat(vb, idx++, VIRTIO_BALLOON_S_MEMFREE,
pages_to_bytes(i.freeram));
update_stat(vb, idx++, VIRTIO_BALLOON_S_MEMTOT,
Patches currently in stable-queue which might be from arnd(a)arndb.de are
queue-4.9/hwmon-asus_atk0110-fix-uninitialized-data-access.patch
queue-4.9/bna-avoid-writing-uninitialized-data-into-hw-registers.patch
queue-4.9/virtio-balloon-use-actual-number-of-stats-for-stats-queue-buffers.patch
queue-4.9/virtio_balloon-prevent-uninitialized-variable-use.patch
queue-4.9/isdn-kcapi-avoid-uninitialized-data.patch
This is a note to let you know that I've just added the patch titled
vfio/pci: Virtualize Maximum Payload Size
to the 4.9-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
vfio-pci-virtualize-maximum-payload-size.patch
and it can be found in the queue-4.9 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From foo@baz Thu Dec 21 09:02:40 CET 2017
From: Alex Williamson <alex.williamson(a)redhat.com>
Date: Mon, 2 Oct 2017 12:39:09 -0600
Subject: vfio/pci: Virtualize Maximum Payload Size
From: Alex Williamson <alex.williamson(a)redhat.com>
[ Upstream commit 523184972b282cd9ca17a76f6ca4742394856818 ]
With virtual PCI-Express chipsets, we now see userspace/guest drivers
trying to match the physical MPS setting to a virtual downstream port.
Of course a lone physical device surrounded by virtual interconnects
cannot make a correct decision for a proper MPS setting. Instead,
let's virtualize the MPS control register so that writes through to
hardware are disallowed. Userspace drivers like QEMU assume they can
write anything to the device and we'll filter out anything dangerous.
Since mismatched MPS can lead to AER and other faults, let's add it
to the kernel side rather than relying on userspace virtualization to
handle it.
Signed-off-by: Alex Williamson <alex.williamson(a)redhat.com>
Reviewed-by: Eric Auger <eric.auger(a)redhat.com>
Signed-off-by: Sasha Levin <alexander.levin(a)verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
drivers/vfio/pci/vfio_pci_config.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
--- a/drivers/vfio/pci/vfio_pci_config.c
+++ b/drivers/vfio/pci/vfio_pci_config.c
@@ -851,11 +851,13 @@ static int __init init_pci_cap_exp_perm(
/*
* Allow writes to device control fields, except devctl_phantom,
- * which could confuse IOMMU, and the ARI bit in devctl2, which
+ * which could confuse IOMMU, MPS, which can break communication
+ * with other physical devices, and the ARI bit in devctl2, which
* is set at probe time. FLR gets virtualized via our writefn.
*/
p_setw(perm, PCI_EXP_DEVCTL,
- PCI_EXP_DEVCTL_BCR_FLR, ~PCI_EXP_DEVCTL_PHANTOM);
+ PCI_EXP_DEVCTL_BCR_FLR | PCI_EXP_DEVCTL_PAYLOAD,
+ ~PCI_EXP_DEVCTL_PHANTOM);
p_setw(perm, PCI_EXP_DEVCTL2, NO_VIRT, ~PCI_EXP_DEVCTL2_ARI);
return 0;
}
Patches currently in stable-queue which might be from alex.williamson(a)redhat.com are
queue-4.9/pci-avoid-bus-reset-if-bridge-itself-is-broken.patch
queue-4.9/vfio-pci-virtualize-maximum-payload-size.patch
This is a note to let you know that I've just added the patch titled
vhost-vsock: add pkt cancel capability
to the 4.9-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
vhost-vsock-add-pkt-cancel-capability.patch
and it can be found in the queue-4.9 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From foo@baz Thu Dec 21 09:02:40 CET 2017
From: Peng Tao <bergwolf(a)gmail.com>
Date: Wed, 15 Mar 2017 09:32:15 +0800
Subject: vhost-vsock: add pkt cancel capability
From: Peng Tao <bergwolf(a)gmail.com>
[ Upstream commit 16320f363ae128d9b9c70e60f00f2a572f57c23d ]
To allow canceling all packets of a connection.
Reviewed-by: Stefan Hajnoczi <stefanha(a)redhat.com>
Reviewed-by: Jorgen Hansen <jhansen(a)vmware.com>
Signed-off-by: Peng Tao <bergwolf(a)gmail.com>
Signed-off-by: David S. Miller <davem(a)davemloft.net>
Signed-off-by: Sasha Levin <alexander.levin(a)verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
drivers/vhost/vsock.c | 41 +++++++++++++++++++++++++++++++++++++++++
include/net/af_vsock.h | 3 +++
2 files changed, 44 insertions(+)
--- a/drivers/vhost/vsock.c
+++ b/drivers/vhost/vsock.c
@@ -218,6 +218,46 @@ vhost_transport_send_pkt(struct virtio_v
return len;
}
+static int
+vhost_transport_cancel_pkt(struct vsock_sock *vsk)
+{
+ struct vhost_vsock *vsock;
+ struct virtio_vsock_pkt *pkt, *n;
+ int cnt = 0;
+ LIST_HEAD(freeme);
+
+ /* Find the vhost_vsock according to guest context id */
+ vsock = vhost_vsock_get(vsk->remote_addr.svm_cid);
+ if (!vsock)
+ return -ENODEV;
+
+ spin_lock_bh(&vsock->send_pkt_list_lock);
+ list_for_each_entry_safe(pkt, n, &vsock->send_pkt_list, list) {
+ if (pkt->vsk != vsk)
+ continue;
+ list_move(&pkt->list, &freeme);
+ }
+ spin_unlock_bh(&vsock->send_pkt_list_lock);
+
+ list_for_each_entry_safe(pkt, n, &freeme, list) {
+ if (pkt->reply)
+ cnt++;
+ list_del(&pkt->list);
+ virtio_transport_free_pkt(pkt);
+ }
+
+ if (cnt) {
+ struct vhost_virtqueue *tx_vq = &vsock->vqs[VSOCK_VQ_TX];
+ int new_cnt;
+
+ new_cnt = atomic_sub_return(cnt, &vsock->queued_replies);
+ if (new_cnt + cnt >= tx_vq->num && new_cnt < tx_vq->num)
+ vhost_poll_queue(&tx_vq->poll);
+ }
+
+ return 0;
+}
+
static struct virtio_vsock_pkt *
vhost_vsock_alloc_pkt(struct vhost_virtqueue *vq,
unsigned int out, unsigned int in)
@@ -669,6 +709,7 @@ static struct virtio_transport vhost_tra
.release = virtio_transport_release,
.connect = virtio_transport_connect,
.shutdown = virtio_transport_shutdown,
+ .cancel_pkt = vhost_transport_cancel_pkt,
.dgram_enqueue = virtio_transport_dgram_enqueue,
.dgram_dequeue = virtio_transport_dgram_dequeue,
--- a/include/net/af_vsock.h
+++ b/include/net/af_vsock.h
@@ -100,6 +100,9 @@ struct vsock_transport {
void (*destruct)(struct vsock_sock *);
void (*release)(struct vsock_sock *);
+ /* Cancel all pending packets sent on vsock. */
+ int (*cancel_pkt)(struct vsock_sock *vsk);
+
/* Connections. */
int (*connect)(struct vsock_sock *);
Patches currently in stable-queue which might be from bergwolf(a)gmail.com are
queue-4.9/vsock-cancel-packets-when-failing-to-connect.patch
queue-4.9/vsock-track-pkt-owner-vsock.patch
queue-4.9/vhost-vsock-add-pkt-cancel-capability.patch
This is a note to let you know that I've just added the patch titled
tracing: Exclude 'generic fields' from histograms
to the 4.9-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
tracing-exclude-generic-fields-from-histograms.patch
and it can be found in the queue-4.9 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From foo@baz Thu Dec 21 09:02:40 CET 2017
From: Tom Zanussi <tom.zanussi(a)linux.intel.com>
Date: Fri, 22 Sep 2017 14:58:17 -0500
Subject: tracing: Exclude 'generic fields' from histograms
From: Tom Zanussi <tom.zanussi(a)linux.intel.com>
[ Upstream commit a15f7fc20389a8827d5859907568b201234d4b79 ]
There are a small number of 'generic fields' (comm/COMM/cpu/CPU) that
are found by trace_find_event_field() but are only meant for
filtering. Specifically, they unlike normal fields, they have a size
of 0 and thus wreak havoc when used as a histogram key.
Exclude these (return -EINVAL) when used as histogram keys.
Link: http://lkml.kernel.org/r/956154cbc3e8a4f0633d619b886c97f0f0edf7b4.150610504…
Signed-off-by: Tom Zanussi <tom.zanussi(a)linux.intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt(a)goodmis.org>
Signed-off-by: Sasha Levin <alexander.levin(a)verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
kernel/trace/trace_events_hist.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
--- a/kernel/trace/trace_events_hist.c
+++ b/kernel/trace/trace_events_hist.c
@@ -449,7 +449,7 @@ static int create_val_field(struct hist_
}
field = trace_find_event_field(file->event_call, field_name);
- if (!field) {
+ if (!field || !field->size) {
ret = -EINVAL;
goto out;
}
@@ -547,7 +547,7 @@ static int create_key_field(struct hist_
}
field = trace_find_event_field(file->event_call, field_name);
- if (!field) {
+ if (!field || !field->size) {
ret = -EINVAL;
goto out;
}
Patches currently in stable-queue which might be from tom.zanussi(a)linux.intel.com are
queue-4.9/tracing-exclude-generic-fields-from-histograms.patch
This is a note to let you know that I've just added the patch titled
usb: gadget: udc: remove pointer dereference after free
to the 4.9-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
usb-gadget-udc-remove-pointer-dereference-after-free.patch
and it can be found in the queue-4.9 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From foo@baz Thu Dec 21 09:02:40 CET 2017
From: "Gustavo A. R. Silva" <garsilva(a)embeddedor.com>
Date: Fri, 10 Mar 2017 15:39:32 -0600
Subject: usb: gadget: udc: remove pointer dereference after free
From: "Gustavo A. R. Silva" <garsilva(a)embeddedor.com>
[ Upstream commit 1f459262b0e1649a1e5ad12fa4c66eb76c2220ce ]
Remove pointer dereference after free.
Addresses-Coverity-ID: 1091173
Acked-by: Michal Nazarewicz <mina86(a)mina86.com>
Signed-off-by: Gustavo A. R. Silva <garsilva(a)embeddedor.com>
Signed-off-by: Felipe Balbi <felipe.balbi(a)linux.intel.com>
Signed-off-by: Sasha Levin <alexander.levin(a)verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
drivers/usb/gadget/udc/pch_udc.c | 1 -
1 file changed, 1 deletion(-)
--- a/drivers/usb/gadget/udc/pch_udc.c
+++ b/drivers/usb/gadget/udc/pch_udc.c
@@ -1523,7 +1523,6 @@ static void pch_udc_free_dma_chain(struc
td = phys_to_virt(addr);
addr2 = (dma_addr_t)td->next;
pci_pool_free(dev->data_requests, td, addr);
- td->next = 0x00;
addr = addr2;
}
req->chain_len = 1;
Patches currently in stable-queue which might be from garsilva(a)embeddedor.com are
queue-4.9/usb-gadget-udc-remove-pointer-dereference-after-free.patch
This is a note to let you know that I've just added the patch titled
usb: gadget: f_uvc: Sanity check wMaxPacketSize for SuperSpeed
to the 4.9-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
usb-gadget-f_uvc-sanity-check-wmaxpacketsize-for-superspeed.patch
and it can be found in the queue-4.9 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From foo@baz Thu Dec 21 09:02:40 CET 2017
From: Roger Quadros <rogerq(a)ti.com>
Date: Wed, 8 Mar 2017 16:05:44 +0200
Subject: usb: gadget: f_uvc: Sanity check wMaxPacketSize for SuperSpeed
From: Roger Quadros <rogerq(a)ti.com>
[ Upstream commit 16bb05d98c904a4f6c5ce7e2d992299f794acbf2 ]
As per USB3.0 Specification "Table 9-20. Standard Endpoint Descriptor",
for interrupt and isochronous endpoints, wMaxPacketSize must be set to
1024 if the endpoint defines bMaxBurst to be greater than zero.
Reviewed-by: Laurent Pinchart <laurent.pinchart(a)ideasonboard.com>
Signed-off-by: Roger Quadros <rogerq(a)ti.com>
Signed-off-by: Felipe Balbi <felipe.balbi(a)linux.intel.com>
Signed-off-by: Sasha Levin <alexander.levin(a)verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
drivers/usb/gadget/function/f_uvc.c | 8 ++++++++
1 file changed, 8 insertions(+)
--- a/drivers/usb/gadget/function/f_uvc.c
+++ b/drivers/usb/gadget/function/f_uvc.c
@@ -594,6 +594,14 @@ uvc_function_bind(struct usb_configurati
opts->streaming_maxpacket = clamp(opts->streaming_maxpacket, 1U, 3072U);
opts->streaming_maxburst = min(opts->streaming_maxburst, 15U);
+ /* For SS, wMaxPacketSize has to be 1024 if bMaxBurst is not 0 */
+ if (opts->streaming_maxburst &&
+ (opts->streaming_maxpacket % 1024) != 0) {
+ opts->streaming_maxpacket = roundup(opts->streaming_maxpacket, 1024);
+ INFO(cdev, "overriding streaming_maxpacket to %d\n",
+ opts->streaming_maxpacket);
+ }
+
/* Fill in the FS/HS/SS Video Streaming specific descriptors from the
* module parameters.
*
Patches currently in stable-queue which might be from rogerq(a)ti.com are
queue-4.9/usb-gadget-f_uvc-sanity-check-wmaxpacketsize-for-superspeed.patch
This is a note to let you know that I've just added the patch titled
tcp: fix under-evaluated ssthresh in TCP Vegas
to the 4.9-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
tcp-fix-under-evaluated-ssthresh-in-tcp-vegas.patch
and it can be found in the queue-4.9 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From foo@baz Thu Dec 21 09:02:40 CET 2017
From: Hoang Tran <tranviethoang.vn(a)gmail.com>
Date: Wed, 27 Sep 2017 18:30:58 +0200
Subject: tcp: fix under-evaluated ssthresh in TCP Vegas
From: Hoang Tran <tranviethoang.vn(a)gmail.com>
[ Upstream commit cf5d74b85ef40c202c76d90959db4d850f301b95 ]
With the commit 76174004a0f19785 (tcp: do not slow start when cwnd equals
ssthresh), the comparison to the reduced cwnd in tcp_vegas_ssthresh() would
under-evaluate the ssthresh.
Signed-off-by: Hoang Tran <hoang.tran(a)uclouvain.be>
Signed-off-by: David S. Miller <davem(a)davemloft.net>
Signed-off-by: Sasha Levin <alexander.levin(a)verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
net/ipv4/tcp_vegas.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--- a/net/ipv4/tcp_vegas.c
+++ b/net/ipv4/tcp_vegas.c
@@ -158,7 +158,7 @@ EXPORT_SYMBOL_GPL(tcp_vegas_cwnd_event);
static inline u32 tcp_vegas_ssthresh(struct tcp_sock *tp)
{
- return min(tp->snd_ssthresh, tp->snd_cwnd-1);
+ return min(tp->snd_ssthresh, tp->snd_cwnd);
}
static void tcp_vegas_cong_avoid(struct sock *sk, u32 ack, u32 acked)
Patches currently in stable-queue which might be from tranviethoang.vn(a)gmail.com are
queue-4.9/tcp-fix-under-evaluated-ssthresh-in-tcp-vegas.patch
This is a note to let you know that I've just added the patch titled
staging: greybus: light: Release memory obtained by kasprintf
to the 4.9-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
staging-greybus-light-release-memory-obtained-by-kasprintf.patch
and it can be found in the queue-4.9 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From foo@baz Thu Dec 21 09:02:40 CET 2017
From: Arvind Yadav <arvind.yadav.cs(a)gmail.com>
Date: Sat, 23 Sep 2017 13:25:30 +0530
Subject: staging: greybus: light: Release memory obtained by kasprintf
From: Arvind Yadav <arvind.yadav.cs(a)gmail.com>
[ Upstream commit 04820da21050b35eed68aa046115d810163ead0c ]
Free memory region, if gb_lights_channel_config is not successful.
Signed-off-by: Arvind Yadav <arvind.yadav.cs(a)gmail.com>
Reviewed-by: Rui Miguel Silva <rmfrfs(a)gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Signed-off-by: Sasha Levin <alexander.levin(a)verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
drivers/staging/greybus/light.c | 2 ++
1 file changed, 2 insertions(+)
--- a/drivers/staging/greybus/light.c
+++ b/drivers/staging/greybus/light.c
@@ -924,6 +924,8 @@ static void __gb_lights_led_unregister(s
return;
led_classdev_unregister(cdev);
+ kfree(cdev->name);
+ cdev->name = NULL;
channel->led = NULL;
}
Patches currently in stable-queue which might be from arvind.yadav.cs(a)gmail.com are
queue-4.9/staging-greybus-light-release-memory-obtained-by-kasprintf.patch
This is a note to let you know that I've just added the patch titled
tipc: fix nametbl deadlock at tipc_nametbl_unsubscribe
to the 4.9-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
tipc-fix-nametbl-deadlock-at-tipc_nametbl_unsubscribe.patch
and it can be found in the queue-4.9 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From foo@baz Thu Dec 21 09:02:40 CET 2017
From: Ying Xue <ying.xue(a)windriver.com>
Date: Tue, 21 Mar 2017 10:47:49 +0100
Subject: tipc: fix nametbl deadlock at tipc_nametbl_unsubscribe
From: Ying Xue <ying.xue(a)windriver.com>
[ Upstream commit 557d054c01da0337ca81de9e9d9206d57245b57e ]
Until now, tipc_nametbl_unsubscribe() is called at subscriptions
reference count cleanup. Usually the subscriptions cleanup is
called at subscription timeout or at subscription cancel or at
subscriber delete.
We have ignored the possibility of this being called from other
locations, which causes deadlock as we try to grab the
tn->nametbl_lock while holding it already.
CPU1: CPU2:
---------- ----------------
tipc_nametbl_publish
spin_lock_bh(&tn->nametbl_lock)
tipc_nametbl_insert_publ
tipc_nameseq_insert_publ
tipc_subscrp_report_overlap
tipc_subscrp_get
tipc_subscrp_send_event
tipc_close_conn
tipc_subscrb_release_cb
tipc_subscrb_delete
tipc_subscrp_put
tipc_subscrp_put
tipc_subscrp_kref_release
tipc_nametbl_unsubscribe
spin_lock_bh(&tn->nametbl_lock)
<<grab nametbl_lock again>>
CPU1: CPU2:
---------- ----------------
tipc_nametbl_stop
spin_lock_bh(&tn->nametbl_lock)
tipc_purge_publications
tipc_nameseq_remove_publ
tipc_subscrp_report_overlap
tipc_subscrp_get
tipc_subscrp_send_event
tipc_close_conn
tipc_subscrb_release_cb
tipc_subscrb_delete
tipc_subscrp_put
tipc_subscrp_put
tipc_subscrp_kref_release
tipc_nametbl_unsubscribe
spin_lock_bh(&tn->nametbl_lock)
<<grab nametbl_lock again>>
In this commit, we advance the calling of tipc_nametbl_unsubscribe()
from the refcount cleanup to the intended callers.
Fixes: d094c4d5f5c7 ("tipc: add subscription refcount to avoid invalid delete")
Reported-by: John Thompson <thompa.atl(a)gmail.com>
Acked-by: Jon Maloy <jon.maloy(a)ericsson.com>
Signed-off-by: Ying Xue <ying.xue(a)windriver.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan(a)ericsson.com>
Signed-off-by: David S. Miller <davem(a)davemloft.net>
Signed-off-by: Sasha Levin <alexander.levin(a)verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
net/tipc/subscr.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
--- a/net/tipc/subscr.c
+++ b/net/tipc/subscr.c
@@ -141,6 +141,11 @@ void tipc_subscrp_report_overlap(struct
static void tipc_subscrp_timeout(unsigned long data)
{
struct tipc_subscription *sub = (struct tipc_subscription *)data;
+ struct tipc_subscriber *subscriber = sub->subscriber;
+
+ spin_lock_bh(&subscriber->lock);
+ tipc_nametbl_unsubscribe(sub);
+ spin_unlock_bh(&subscriber->lock);
/* Notify subscriber of timeout */
tipc_subscrp_send_event(sub, sub->evt.s.seq.lower, sub->evt.s.seq.upper,
@@ -173,7 +178,6 @@ static void tipc_subscrp_kref_release(st
struct tipc_subscriber *subscriber = sub->subscriber;
spin_lock_bh(&subscriber->lock);
- tipc_nametbl_unsubscribe(sub);
list_del(&sub->subscrp_list);
atomic_dec(&tn->subscription_count);
spin_unlock_bh(&subscriber->lock);
@@ -205,6 +209,7 @@ static void tipc_subscrb_subscrp_delete(
if (s && memcmp(s, &sub->evt.s, sizeof(struct tipc_subscr)))
continue;
+ tipc_nametbl_unsubscribe(sub);
tipc_subscrp_get(sub);
spin_unlock_bh(&subscriber->lock);
tipc_subscrp_delete(sub);
Patches currently in stable-queue which might be from ying.xue(a)windriver.com are
queue-4.9/tipc-fix-nametbl-deadlock-at-tipc_nametbl_unsubscribe.patch