Context =======
We've observed within Red Hat that isolated, NOHZ_FULL CPUs running a pure-userspace application get regularly interrupted by IPIs sent from housekeeping CPUs. Those IPIs are caused by activity on the housekeeping CPUs leading to various on_each_cpu() calls, e.g.:
64359.052209596 NetworkManager 0 1405 smp_call_function_many_cond (cpu=0, func=do_kernel_range_flush) smp_call_function_many_cond+0x1 smp_call_function+0x39 on_each_cpu+0x2a flush_tlb_kernel_range+0x7b __purge_vmap_area_lazy+0x70 _vm_unmap_aliases.part.42+0xdf change_page_attr_set_clr+0x16a set_memory_ro+0x26 bpf_int_jit_compile+0x2f9 bpf_prog_select_runtime+0xc6 bpf_prepare_filter+0x523 sk_attach_filter+0x13 sock_setsockopt+0x92c __sys_setsockopt+0x16a __x64_sys_setsockopt+0x20 do_syscall_64+0x87 entry_SYSCALL_64_after_hwframe+0x65
The heart of this series is the thought that while we cannot remove NOHZ_FULL CPUs from the list of CPUs targeted by these IPIs, they may not have to execute the callbacks immediately. Anything that only affects kernelspace can wait until the next user->kernel transition, providing it can be executed "early enough" in the entry code.
The original implementation is from Peter [1]. Nicolas then added kernel TLB invalidation deferral to that [2], and I picked it up from there.
Deferral approach =================
Storing each and every callback, like a secondary call_single_queue turned out to be a no-go: the whole point of deferral is to keep NOHZ_FULL CPUs in userspace for as long as possible - no signal of any form would be sent when deferring an IPI. This means that any form of queuing for deferred callbacks would end up as a convoluted memory leak.
Deferred IPIs must thus be coalesced, which this series achieves by assigning IPIs a "type" and having a mapping of IPI type to callback, leveraged upon kernel entry.
What about IPIs whose callback take a parameter, you may ask?
Peter suggested during OSPM23 [3] that since on_each_cpu() targets housekeeping CPUs *and* isolated CPUs, isolated CPUs can access either global or housekeeping-CPU-local state to "reconstruct" the data that would have been sent via the IPI.
This series does not affect any IPI callback that requires an argument, but the approach would remain the same (one coalescable callback executed on kernel entry).
Kernel entry vs execution of the deferred operation ===================================================
This is what I've referred to as the "Danger Zone" during my LPC24 talk [4].
There is a non-zero length of code that is executed upon kernel entry before the deferred operation can be itself executed (before we start getting into context_tracking.c proper), i.e.:
idtentry_func_foo() <--- we're in the kernel irqentry_enter() enter_from_user_mode() __ct_user_exit() ct_kernel_enter_state() ct_work_flush() <--- deferred operation is executed here
This means one must take extra care to what can happen in the early entry code, and that <bad things> cannot happen. For instance, we really don't want to hit instructions that have been modified by a remote text_poke() while we're on our way to execute a deferred sync_core(). Patches doing the actual deferral have more detail on this.
Where are we at with this whole thing? ======================================
Dave has been incredibly helpful wrt figuring out what would and wouldn't (mostly that) be safe to do for deferring kernel range TLB flush IPIs, see [5].
Long story short, there are ugly things I can still do to (safely) defer the TLB flush IPIs, but it's going to be a long session of pulling my own hair out, and I got plenty so I won't be done for a while.
In the meantime, I think everything leading up to deferring text poke IPIs is sane-ish and could get in. I'm not the biggest fan of adding an API with a single user, but hey, I've been working on this for "a little while" now and I'll still need to get the other IPIs sorted out.
TL;DR: Text patching IPI deferral LGTM so here it is for now, I'm still working on the TLB flush thing.
Patches =======
o Patches 1-2 are standalone objtool cleanups.
o Patches 3-4 add an RCU testing feature.
o Patches 5-6 add infrastructure for annotating static keys and static calls that may be used in noinstr code (courtesy of Josh). o Patches 7-20 use said annotations on relevant keys / calls. o Patch 21 enforces proper usage of said annotations (courtesy of Josh).
o Patches 22-23 deal with detecting NOINSTR text in modules
o Patches 24-25 add the actual IPI deferral faff
Patches are also available at: https://gitlab.com/vschneid/linux.git -b redhat/isolirq/defer/v5
Testing =======
Xeon E5-2699 system with SMToff, NOHZ_FULL, isolated CPUs. RHEL10 userspace.
Workload is using rteval (kernel compilation + hackbench) on housekeeping CPUs and a dummy stay-in-userspace loop on the isolated CPUs. The main invocation is:
$ trace-cmd record -e "csd_queue_cpu" -f "cpu & CPUS{$ISOL_CPUS}" \ -e "ipi_send_cpumask" -f "cpumask & CPUS{$ISOL_CPUS}" \ -e "ipi_send_cpu" -f "cpu & CPUS{$ISOL_CPUS}" \ rteval --onlyload --loads-cpulist=$HK_CPUS \ --hackbench-runlowmem=True --duration=$DURATION
This only records IPIs sent to isolated CPUs, so any event there is interference (with a bit of fuzz at the start/end of the workload when spawning the processes). All tests were done with a duration of 3 hours.
v6.14 # This is the actual IPI count $ trace-cmd report | grep callback | awk '{ print $(NF) }' | sort | uniq -c | sort -nr 93 callback=generic_smp_call_function_single_interrupt+0x0 22 callback=nohz_full_kick_func+0x0
# These are the different CSD's that caused IPIs $ trace-cmd report | grep csd_queue | awk '{ print $(NF-1) }' | sort | uniq -c | sort -nr 1456 func=do_flush_tlb_all 78 func=do_sync_core 33 func=nohz_full_kick_func 26 func=do_kernel_range_flush
v6.14 + patches # This is the actual IPI count $ trace-cmd report | grep callback | awk '{ print $(NF) }' | sort | uniq -c | sort -nr 86 callback=generic_smp_call_function_single_interrupt+0x0 41 callback=nohz_full_kick_func+0x0
# These are the different CSD's that caused IPIs $ trace-cmd report | grep csd_queue | awk '{ print $(NF-1) }' | sort | uniq -c | sort -nr 1378 func=do_flush_tlb_all 33 func=nohz_full_kick_func
So the TLB flush is still there driving most of the IPIs, but at least the instruction patching IPIs are gone. With kernel TLB flushes deferred, there are no IPIs sent to isolated CPUs in that 3hr window, but as stated above that still needs some more work.
Also note that tlb_remove_table_smp_sync() showed up during testing of v3, and has gone as mysteriously as it showed up. Yair had a series adressing this [6] which per these results would be worth revisiting.
Acknowledgements ================
Special thanks to: o Clark Williams for listening to my ramblings about this and throwing ideas my way o Josh Poimboeuf for all his help with everything objtool-related o All of the folks who attended various (too many?) talks about this and provided precious feedback. o The mm folks for pointing out what I can and can't do with TLB flushes
Links =====
[1]: https://lore.kernel.org/all/20210929151723.162004989@infradead.org/ [2]: https://github.com/vianpl/linux.git -b ct-work-defer-wip [3]: https://youtu.be/0vjE6fjoVVE [4]: https://lpc.events/event/18/contributions/1889/ [5]: http://lore.kernel.org/r/eef09bdc-7546-462b-9ac0-661a44d2ceae@intel.com [6]: https://lore.kernel.org/lkml/20230620144618.125703-1-ypodemsk@redhat.com/
Revisions =========
v4 -> v5 ++++++++
o Rebased onto v6.15-rc3 o Collected Reviewed-by
o Annotated a few more static keys o Added proper checking of noinstr sections that are in loadable code such as KVM early entry (Sean Christopherson)
o Switched to checking for CT_RCU_WATCHING instead of CT_STATE_KERNEL or CT_STATE_IDLE, which means deferral is now behaving sanely for IRQ/NMI entry from idle (thanks to Frederic!)
o Ditched the vmap TLB flush deferral (for now)
RFCv3 -> v4 +++++++++++
o Rebased onto v6.13-rc6
o New objtool patches from Josh o More .noinstr static key/call patches o Static calls now handled as well (again thanks to Josh)
o Fixed clearing the work bits on kernel exit o Messed with IRQ hitting an idle CPU vs context tracking o Various comment and naming cleanups
o Made RCU_DYNTICKS_TORTURE depend on !COMPILE_TEST (PeterZ) o Fixed the CT_STATE_KERNEL check when setting a deferred work (Frederic) o Cleaned up the __flush_tlb_all() mess thanks to PeterZ
RFCv2 -> RFCv3 ++++++++++++++
o Rebased onto v6.12-rc6
o Added objtool documentation for the new warning (Josh) o Added low-size RCU watching counter to TREE04 torture scenario (Paul) o Added FORCEFUL jump label and static key types o Added noinstr-compliant helpers for tlb flush deferral
RFCv1 -> RFCv2 ++++++++++++++
o Rebased onto v6.5-rc1
o Updated the trace filter patches (Steven)
o Fixed __ro_after_init keys used in modules (Peter) o Dropped the extra context_tracking atomic, squashed the new bits in the existing .state field (Peter, Frederic)
o Added an RCU_EXPERT config for the RCU dynticks counter size, and added an rcutorture case for a low-size counter (Paul)
o Fixed flush_tlb_kernel_range_deferrable() definition
Josh Poimboeuf (3): jump_label: Add annotations for validating noinstr usage static_call: Add read-only-after-init static calls objtool: Add noinstr validation for static branches/calls
Valentin Schneider (22): objtool: Make validate_call() recognize indirect calls to pv_ops[] objtool: Flesh out warning related to pv_ops[] calls rcu: Add a small-width RCU watching counter debug option rcutorture: Make TREE04 use CONFIG_RCU_DYNTICKS_TORTURE x86/paravirt: Mark pv_sched_clock static call as __ro_after_init x86/idle: Mark x86_idle static call as __ro_after_init x86/paravirt: Mark pv_steal_clock static call as __ro_after_init riscv/paravirt: Mark pv_steal_clock static call as __ro_after_init loongarch/paravirt: Mark pv_steal_clock static call as __ro_after_init arm64/paravirt: Mark pv_steal_clock static call as __ro_after_init arm/paravirt: Mark pv_steal_clock static call as __ro_after_init perf/x86/amd: Mark perf_lopwr_cb static call as __ro_after_init sched/clock: Mark sched_clock_running key as __ro_after_init KVM: VMX: Mark __kvm_is_using_evmcs static key as __ro_after_init x86/speculation/mds: Mark mds_idle_clear key as allowed in .noinstr sched/clock, x86: Mark __sched_clock_stable key as allowed in .noinstr KVM: VMX: Mark vmx_l1d_should flush and vmx_l1d_flush_cond keys as allowed in .noinstr stackleack: Mark stack_erasing_bypass key as allowed in .noinstr module: Remove outdated comment about text_size module: Add MOD_NOINSTR_TEXT mem_type context-tracking: Introduce work deferral infrastructure context_tracking,x86: Defer kernel text patching IPIs
arch/Kconfig | 9 ++ arch/arm/kernel/paravirt.c | 2 +- arch/arm64/kernel/paravirt.c | 2 +- arch/loongarch/kernel/paravirt.c | 2 +- arch/riscv/kernel/paravirt.c | 2 +- arch/x86/Kconfig | 1 + arch/x86/events/amd/brs.c | 2 +- arch/x86/include/asm/context_tracking_work.h | 18 +++ arch/x86/include/asm/text-patching.h | 1 + arch/x86/kernel/alternative.c | 39 ++++++- arch/x86/kernel/cpu/bugs.c | 2 +- arch/x86/kernel/kprobes/core.c | 4 +- arch/x86/kernel/kprobes/opt.c | 4 +- arch/x86/kernel/module.c | 2 +- arch/x86/kernel/paravirt.c | 4 +- arch/x86/kernel/process.c | 2 +- arch/x86/kvm/vmx/vmx.c | 11 +- arch/x86/kvm/vmx/vmx_onhyperv.c | 2 +- include/asm-generic/sections.h | 15 +++ include/linux/context_tracking.h | 21 ++++ include/linux/context_tracking_state.h | 54 +++++++-- include/linux/context_tracking_work.h | 26 +++++ include/linux/jump_label.h | 30 ++++- include/linux/module.h | 6 +- include/linux/objtool.h | 7 ++ include/linux/static_call.h | 19 ++++ kernel/context_tracking.c | 69 +++++++++++- kernel/kprobes.c | 8 +- kernel/module/main.c | 85 ++++++++++---- kernel/rcu/Kconfig.debug | 15 +++ kernel/sched/clock.c | 7 +- kernel/stackleak.c | 6 +- kernel/time/Kconfig | 5 + tools/objtool/Documentation/objtool.txt | 34 ++++++ tools/objtool/check.c | 106 +++++++++++++++--- tools/objtool/include/objtool/check.h | 1 + tools/objtool/include/objtool/elf.h | 1 + tools/objtool/include/objtool/special.h | 1 + tools/objtool/special.c | 15 ++- .../selftests/rcutorture/configs/rcu/TREE04 | 1 + 40 files changed, 557 insertions(+), 84 deletions(-) create mode 100644 arch/x86/include/asm/context_tracking_work.h create mode 100644 include/linux/context_tracking_work.h
-- 2.49.0
call_dest_name() does not get passed the file pointer of validate_call(), which means its invocation of insn_reloc() will always return NULL. Make it take a file pointer.
While at it, make sure call_dest_name() uses arch_dest_reloc_offset(), otherwise it gets the pv_ops[] offset wrong.
Fabricating an intentional warning shows the change; previously:
vmlinux.o: warning: objtool: __flush_tlb_all_noinstr+0x4: call to {dynamic}() leaves .noinstr.text section
now:
vmlinux.o: warning: objtool: __flush_tlb_all_noinstr+0x4: call to pv_ops[1]() leaves .noinstr.text section
Signed-off-by: Valentin Schneider vschneid@redhat.com Acked-by: Josh Poimboeuf jpoimboe@kernel.org --- tools/objtool/check.c | 14 ++++++++------ 1 file changed, 8 insertions(+), 6 deletions(-)
diff --git a/tools/objtool/check.c b/tools/objtool/check.c index 3a411064fa34b..973dfc8fde792 100644 --- a/tools/objtool/check.c +++ b/tools/objtool/check.c @@ -3319,7 +3319,7 @@ static inline bool func_uaccess_safe(struct symbol *func) return false; }
-static inline const char *call_dest_name(struct instruction *insn) +static inline const char *call_dest_name(struct objtool_file *file, struct instruction *insn) { static char pvname[19]; struct reloc *reloc; @@ -3328,9 +3328,9 @@ static inline const char *call_dest_name(struct instruction *insn) if (insn_call_dest(insn)) return insn_call_dest(insn)->name;
- reloc = insn_reloc(NULL, insn); + reloc = insn_reloc(file, insn); if (reloc && !strcmp(reloc->sym->name, "pv_ops")) { - idx = (reloc_addend(reloc) / sizeof(void *)); + idx = (arch_dest_reloc_offset(reloc_addend(reloc)) / sizeof(void *)); snprintf(pvname, sizeof(pvname), "pv_ops[%d]", idx); return pvname; } @@ -3409,17 +3409,19 @@ static int validate_call(struct objtool_file *file, { if (state->noinstr && state->instr <= 0 && !noinstr_call_dest(file, insn, insn_call_dest(insn))) { - WARN_INSN(insn, "call to %s() leaves .noinstr.text section", call_dest_name(insn)); + WARN_INSN(insn, "call to %s() leaves .noinstr.text section", call_dest_name(file, insn)); return 1; }
if (state->uaccess && !func_uaccess_safe(insn_call_dest(insn))) { - WARN_INSN(insn, "call to %s() with UACCESS enabled", call_dest_name(insn)); + WARN_INSN(insn, "call to %s() with UACCESS enabled", + call_dest_name(file, insn)); return 1; }
if (state->df) { - WARN_INSN(insn, "call to %s() with DF set", call_dest_name(insn)); + WARN_INSN(insn, "call to %s() with DF set", + call_dest_name(file, insn)); return 1; }
I had to look into objtool itself to understand what this warning was about; make it more explicit.
Signed-off-by: Valentin Schneider vschneid@redhat.com Acked-by: Josh Poimboeuf jpoimboe@kernel.org --- tools/objtool/check.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/objtool/check.c b/tools/objtool/check.c index 973dfc8fde792..08e73765059fc 100644 --- a/tools/objtool/check.c +++ b/tools/objtool/check.c @@ -3357,7 +3357,7 @@ static bool pv_call_dest(struct objtool_file *file, struct instruction *insn)
list_for_each_entry(target, &file->pv_ops[idx].targets, pv_target) { if (!target->sec->noinstr) { - WARN("pv_ops[%d]: %s", idx, target->name); + WARN("pv_ops[%d]: indirect call to %s() leaves .noinstr.text section", idx, target->name); file->pv_ops[idx].clean = false; } }
A later commit will reduce the size of the RCU watching counter to free up some bits for another purpose. Paul suggested adding a config option to test the extreme case where the counter is reduced to its minimum usable width for rcutorture to poke at, so do that.
Make it only configurable under RCU_EXPERT. While at it, add a comment to explain the layout of context_tracking->state.
Link: http://lore.kernel.org/r/4c2cb573-168f-4806-b1d9-164e8276e66a@paulmck-laptop Suggested-by: Paul E. McKenney paulmck@kernel.org Signed-off-by: Valentin Schneider vschneid@redhat.com Reviewed-by: Paul E. McKenney paulmck@kernel.org Reviewed-by: Frederic Weisbecker frederic@kernel.org --- include/linux/context_tracking_state.h | 44 ++++++++++++++++++++++---- kernel/rcu/Kconfig.debug | 15 +++++++++ 2 files changed, 52 insertions(+), 7 deletions(-)
diff --git a/include/linux/context_tracking_state.h b/include/linux/context_tracking_state.h index 7b8433d5a8efe..0b81248aa03e2 100644 --- a/include/linux/context_tracking_state.h +++ b/include/linux/context_tracking_state.h @@ -18,12 +18,6 @@ enum ctx_state { CT_STATE_MAX = 4, };
-/* Odd value for watching, else even. */ -#define CT_RCU_WATCHING CT_STATE_MAX - -#define CT_STATE_MASK (CT_STATE_MAX - 1) -#define CT_RCU_WATCHING_MASK (~CT_STATE_MASK) - struct context_tracking { #ifdef CONFIG_CONTEXT_TRACKING_USER /* @@ -44,9 +38,45 @@ struct context_tracking { #endif };
+/* + * We cram two different things within the same atomic variable: + * + * CT_RCU_WATCHING_START CT_STATE_START + * | | + * v v + * MSB [ RCU watching counter ][ context_state ] LSB + * ^ ^ + * | | + * CT_RCU_WATCHING_END CT_STATE_END + * + * Bits are used from the LSB upwards, so unused bits (if any) will always be in + * upper bits of the variable. + */ #ifdef CONFIG_CONTEXT_TRACKING +#define CT_SIZE (sizeof(((struct context_tracking *)0)->state) * BITS_PER_BYTE) + +#define CT_STATE_WIDTH bits_per(CT_STATE_MAX - 1) +#define CT_STATE_START 0 +#define CT_STATE_END (CT_STATE_START + CT_STATE_WIDTH - 1) + +#define CT_RCU_WATCHING_MAX_WIDTH (CT_SIZE - CT_STATE_WIDTH) +#define CT_RCU_WATCHING_WIDTH (IS_ENABLED(CONFIG_RCU_DYNTICKS_TORTURE) ? 2 : CT_RCU_WATCHING_MAX_WIDTH) +#define CT_RCU_WATCHING_START (CT_STATE_END + 1) +#define CT_RCU_WATCHING_END (CT_RCU_WATCHING_START + CT_RCU_WATCHING_WIDTH - 1) +#define CT_RCU_WATCHING BIT(CT_RCU_WATCHING_START) + +#define CT_STATE_MASK GENMASK(CT_STATE_END, CT_STATE_START) +#define CT_RCU_WATCHING_MASK GENMASK(CT_RCU_WATCHING_END, CT_RCU_WATCHING_START) + +#define CT_UNUSED_WIDTH (CT_RCU_WATCHING_MAX_WIDTH - CT_RCU_WATCHING_WIDTH) + +static_assert(CT_STATE_WIDTH + + CT_RCU_WATCHING_WIDTH + + CT_UNUSED_WIDTH == + CT_SIZE); + DECLARE_PER_CPU(struct context_tracking, context_tracking); -#endif +#endif /* CONFIG_CONTEXT_TRACKING */
#ifdef CONFIG_CONTEXT_TRACKING_USER static __always_inline int __ct_state(void) diff --git a/kernel/rcu/Kconfig.debug b/kernel/rcu/Kconfig.debug index 12e4c64ebae15..625d75392647b 100644 --- a/kernel/rcu/Kconfig.debug +++ b/kernel/rcu/Kconfig.debug @@ -213,4 +213,19 @@ config RCU_STRICT_GRACE_PERIOD when looking for certain types of RCU usage bugs, for example, too-short RCU read-side critical sections.
+ +config RCU_DYNTICKS_TORTURE + bool "Minimize RCU dynticks counter size" + depends on RCU_EXPERT && !COMPILE_TEST + default n + help + This option sets the width of the dynticks counter to its + minimum usable value. This minimum width greatly increases + the probability of flushing out bugs involving counter wrap, + but it also increases the probability of extending grace period + durations. This Kconfig option should therefore be avoided in + production due to the consequent increased probability of OOMs. + + This has no value for production and is only for testing. + endmenu # "RCU Debugging"
We now have an RCU_EXPERT config for testing small-sized RCU dynticks counter: CONFIG_RCU_DYNTICKS_TORTURE.
Modify scenario TREE04 to exercise to use this config in order to test a ridiculously small counter (2 bits).
Link: http://lore.kernel.org/r/4c2cb573-168f-4806-b1d9-164e8276e66a@paulmck-laptop Suggested-by: Paul E. McKenney paulmck@kernel.org Signed-off-by: Valentin Schneider vschneid@redhat.com Reviewed-by: Paul E. McKenney paulmck@kernel.org Reviewed-by: Frederic Weisbecker frederic@kernel.org --- tools/testing/selftests/rcutorture/configs/rcu/TREE04 | 1 + 1 file changed, 1 insertion(+)
diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TREE04 b/tools/testing/selftests/rcutorture/configs/rcu/TREE04 index dc4985064b3ad..67caf4276bb01 100644 --- a/tools/testing/selftests/rcutorture/configs/rcu/TREE04 +++ b/tools/testing/selftests/rcutorture/configs/rcu/TREE04 @@ -16,3 +16,4 @@ CONFIG_DEBUG_OBJECTS_RCU_HEAD=n CONFIG_RCU_EXPERT=y CONFIG_RCU_EQS_DEBUG=y CONFIG_RCU_LAZY=y +CONFIG_RCU_DYNTICKS_TORTURE=y
From: Josh Poimboeuf jpoimboe@kernel.org
Deferring a code patching IPI is unsafe if the patched code is in a noinstr region. In that case the text poke code must trigger an immediate IPI to all CPUs, which can rudely interrupt an isolated NO_HZ CPU running in userspace.
Some noinstr static branches may really need to be patched at runtime, despite the resulting disruption. Add DEFINE_STATIC_KEY_*_NOINSTR() variants for those. They don't do anything special yet; that will come later.
Signed-off-by: Josh Poimboeuf jpoimboe@kernel.org --- include/linux/jump_label.h | 17 +++++++++++++++++ 1 file changed, 17 insertions(+)
diff --git a/include/linux/jump_label.h b/include/linux/jump_label.h index fdb79dd1ebd8c..c4f6240ff4d95 100644 --- a/include/linux/jump_label.h +++ b/include/linux/jump_label.h @@ -388,6 +388,23 @@ struct static_key_false { #define DEFINE_STATIC_KEY_FALSE_RO(name) \ struct static_key_false name __ro_after_init = STATIC_KEY_FALSE_INIT
+/* + * The _NOINSTR variants are used to tell objtool the static key is allowed to + * be used in noinstr code. + * + * They should almost never be used, as they prevent code patching IPIs from + * being deferred, which can be problematic for isolated NOHZ_FULL CPUs running + * in pure userspace. + * + * If using one of these _NOINSTR variants, please add a comment above the + * definition with the rationale. + */ +#define DEFINE_STATIC_KEY_TRUE_NOINSTR(name) \ + DEFINE_STATIC_KEY_TRUE(name) + +#define DEFINE_STATIC_KEY_FALSE_NOINSTR(name) \ + DEFINE_STATIC_KEY_FALSE(name) + #define DECLARE_STATIC_KEY_FALSE(name) \ extern struct static_key_false name
From: Josh Poimboeuf jpoimboe@kernel.org
Deferring a code patching IPI is unsafe if the patched code is in a noinstr region. In that case the text poke code must trigger an immediate IPI to all CPUs, which can rudely interrupt an isolated NO_HZ CPU running in userspace.
If a noinstr static call only needs to be patched during boot, its key can be made ro-after-init to ensure it will never be patched at runtime.
Signed-off-by: Josh Poimboeuf jpoimboe@kernel.org --- include/linux/static_call.h | 16 ++++++++++++++++ 1 file changed, 16 insertions(+)
diff --git a/include/linux/static_call.h b/include/linux/static_call.h index 78a77a4ae0ea8..ea6ca57e2a829 100644 --- a/include/linux/static_call.h +++ b/include/linux/static_call.h @@ -192,6 +192,14 @@ extern long __static_call_return0(void); }; \ ARCH_DEFINE_STATIC_CALL_TRAMP(name, _func)
+#define DEFINE_STATIC_CALL_RO(name, _func) \ + DECLARE_STATIC_CALL(name, _func); \ + struct static_call_key __ro_after_init STATIC_CALL_KEY(name) = {\ + .func = _func, \ + .type = 1, \ + }; \ + ARCH_DEFINE_STATIC_CALL_TRAMP(name, _func) + #define DEFINE_STATIC_CALL_NULL(name, _func) \ DECLARE_STATIC_CALL(name, _func); \ struct static_call_key STATIC_CALL_KEY(name) = { \ @@ -200,6 +208,14 @@ extern long __static_call_return0(void); }; \ ARCH_DEFINE_STATIC_CALL_NULL_TRAMP(name)
+#define DEFINE_STATIC_CALL_NULL_RO(name, _func) \ + DECLARE_STATIC_CALL(name, _func); \ + struct static_call_key __ro_after_init STATIC_CALL_KEY(name) = {\ + .func = NULL, \ + .type = 1, \ + }; \ + ARCH_DEFINE_STATIC_CALL_NULL_TRAMP(name) + #define DEFINE_STATIC_CALL_RET0(name, _func) \ DECLARE_STATIC_CALL(name, _func); \ struct static_call_key STATIC_CALL_KEY(name) = { \
Later commits will cause objtool to warn about static calls being used in .noinstr sections in order to safely defer instruction patching IPIs targeted at NOHZ_FULL CPUs.
pv_sched_clock is updated in: o __init vmware_paravirt_ops_setup() o __init xen_init_time_common() o kvm_sched_clock_init() <- __init kvmclock_init() o hv_setup_sched_clock() <- __init hv_init_tsc_clocksource()
IOW purely init context, and can thus be marked as __ro_after_init.
Reported-by: Josh Poimboeuf jpoimboe@kernel.org Signed-off-by: Valentin Schneider vschneid@redhat.com --- arch/x86/kernel/paravirt.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c index 1ccd05d8999f1..0da0ec6cdecfb 100644 --- a/arch/x86/kernel/paravirt.c +++ b/arch/x86/kernel/paravirt.c @@ -68,7 +68,7 @@ static u64 native_steal_clock(int cpu) }
DEFINE_STATIC_CALL(pv_steal_clock, native_steal_clock); -DEFINE_STATIC_CALL(pv_sched_clock, native_sched_clock); +DEFINE_STATIC_CALL_RO(pv_sched_clock, native_sched_clock);
void paravirt_set_sched_clock(u64 (*func)(void)) {
Later commits will cause objtool to warn about static calls being used in .noinstr sections in order to safely defer instruction patching IPIs targeted at NOHZ_FULL CPUs.
x86_idle is updated in: o xen_set_default_idle() <- __init xen_arch_setup() o __init select_idle_routine()
IOW purely init context, and can thus be marked as __ro_after_init.
Reported-by: Josh Poimboeuf jpoimboe@kernel.org Signed-off-by: Valentin Schneider vschneid@redhat.com --- arch/x86/kernel/process.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c index 962c3ce39323e..90f31f8526aa4 100644 --- a/arch/x86/kernel/process.c +++ b/arch/x86/kernel/process.c @@ -752,7 +752,7 @@ void __cpuidle default_idle(void) EXPORT_SYMBOL(default_idle); #endif
-DEFINE_STATIC_CALL_NULL(x86_idle, default_idle); +DEFINE_STATIC_CALL_NULL_RO(x86_idle, default_idle);
static bool x86_idle_set(void) {
The static call is only ever updated in
__init pv_time_init() __init xen_init_time_common() __init vmware_paravirt_ops_setup() __init xen_time_setup_guest(
so mark it appropriately as __ro_after_init.
Signed-off-by: Valentin Schneider vschneid@redhat.com --- arch/x86/kernel/paravirt.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c index 0da0ec6cdecfb..a08b9766b8a36 100644 --- a/arch/x86/kernel/paravirt.c +++ b/arch/x86/kernel/paravirt.c @@ -67,7 +67,7 @@ static u64 native_steal_clock(int cpu) return 0; }
-DEFINE_STATIC_CALL(pv_steal_clock, native_steal_clock); +DEFINE_STATIC_CALL_RO(pv_steal_clock, native_steal_clock); DEFINE_STATIC_CALL_RO(pv_sched_clock, native_sched_clock);
void paravirt_set_sched_clock(u64 (*func)(void))
The static call is only ever updated in:
__init pv_time_init() __init xen_time_setup_guest()
so mark it appropriately as __ro_after_init.
Signed-off-by: Valentin Schneider vschneid@redhat.com Reviewed-by: Andrew Jones ajones@ventanamicro.com --- arch/riscv/kernel/paravirt.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/riscv/kernel/paravirt.c b/arch/riscv/kernel/paravirt.c index fa6b0339a65de..dfe8808016fd8 100644 --- a/arch/riscv/kernel/paravirt.c +++ b/arch/riscv/kernel/paravirt.c @@ -30,7 +30,7 @@ static u64 native_steal_clock(int cpu) return 0; }
-DEFINE_STATIC_CALL(pv_steal_clock, native_steal_clock); +DEFINE_STATIC_CALL_RO(pv_steal_clock, native_steal_clock);
static bool steal_acc = true; static int __init parse_no_stealacc(char *arg)
The static call is only ever updated in
__init pv_time_init() __init xen_time_setup_guest()
so mark it appropriately as __ro_after_init.
Signed-off-by: Valentin Schneider vschneid@redhat.com --- arch/loongarch/kernel/paravirt.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/loongarch/kernel/paravirt.c b/arch/loongarch/kernel/paravirt.c index e5a39bbad0780..b011578d3e931 100644 --- a/arch/loongarch/kernel/paravirt.c +++ b/arch/loongarch/kernel/paravirt.c @@ -20,7 +20,7 @@ static u64 native_steal_clock(int cpu) return 0; }
-DEFINE_STATIC_CALL(pv_steal_clock, native_steal_clock); +DEFINE_STATIC_CALL_RO(pv_steal_clock, native_steal_clock);
static bool steal_acc = true;
The static call is only ever updated in
__init pv_time_init() __init xen_time_setup_guest()
so mark it appropriately as __ro_after_init.
Signed-off-by: Valentin Schneider vschneid@redhat.com --- arch/arm64/kernel/paravirt.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/arm64/kernel/paravirt.c b/arch/arm64/kernel/paravirt.c index aa718d6a9274a..ad28fa23c9228 100644 --- a/arch/arm64/kernel/paravirt.c +++ b/arch/arm64/kernel/paravirt.c @@ -32,7 +32,7 @@ static u64 native_steal_clock(int cpu) return 0; }
-DEFINE_STATIC_CALL(pv_steal_clock, native_steal_clock); +DEFINE_STATIC_CALL_RO(pv_steal_clock, native_steal_clock);
struct pv_time_stolen_time_region { struct pvclock_vcpu_stolen_time __rcu *kaddr;
The static call is only ever updated in
__init xen_time_setup_guest()
so mark it appropriately as __ro_after_init.
Signed-off-by: Valentin Schneider vschneid@redhat.com --- arch/arm/kernel/paravirt.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/arm/kernel/paravirt.c b/arch/arm/kernel/paravirt.c index 7dd9806369fb0..632d8d5e06db3 100644 --- a/arch/arm/kernel/paravirt.c +++ b/arch/arm/kernel/paravirt.c @@ -20,4 +20,4 @@ static u64 native_steal_clock(int cpu) return 0; }
-DEFINE_STATIC_CALL(pv_steal_clock, native_steal_clock); +DEFINE_STATIC_CALL_RO(pv_steal_clock, native_steal_clock);
Later commits will cause objtool to warn about static calls being used in .noinstr sections in order to safely defer instruction patching IPIs targeted at NOHZ_FULL CPUs.
perf_lopwr_cb is used in .noinstr code, but is only ever updated in __init amd_brs_lopwr_init(), and can thus be marked as __ro_after_init.
Reported-by: Josh Poimboeuf jpoimboe@kernel.org Signed-off-by: Valentin Schneider vschneid@redhat.com --- arch/x86/events/amd/brs.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/x86/events/amd/brs.c b/arch/x86/events/amd/brs.c index ec34274633824..218545ffe3e24 100644 --- a/arch/x86/events/amd/brs.c +++ b/arch/x86/events/amd/brs.c @@ -423,7 +423,7 @@ void noinstr perf_amd_brs_lopwr_cb(bool lopwr_in) } }
-DEFINE_STATIC_CALL_NULL(perf_lopwr_cb, perf_amd_brs_lopwr_cb); +DEFINE_STATIC_CALL_NULL_RO(perf_lopwr_cb, perf_amd_brs_lopwr_cb); EXPORT_STATIC_CALL_TRAMP_GPL(perf_lopwr_cb);
void __init amd_brs_lopwr_init(void)
sched_clock_running is only ever enabled in the __init functions sched_clock_init() and sched_clock_init_late(), and is never disabled. Mark it __ro_after_init.
Signed-off-by: Valentin Schneider vschneid@redhat.com --- kernel/sched/clock.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/sched/clock.c b/kernel/sched/clock.c index a09655b481402..200e5568b9894 100644 --- a/kernel/sched/clock.c +++ b/kernel/sched/clock.c @@ -66,7 +66,7 @@ notrace unsigned long long __weak sched_clock(void) } EXPORT_SYMBOL_GPL(sched_clock);
-static DEFINE_STATIC_KEY_FALSE(sched_clock_running); +static DEFINE_STATIC_KEY_FALSE_RO(sched_clock_running);
#ifdef CONFIG_HAVE_UNSTABLE_SCHED_CLOCK /*
The static key is only ever enabled in
__init hv_init_evmcs()
so mark it appropriately as __ro_after_init.
Reported-by: Sean Christopherson seanjc@google.com Signed-off-by: Valentin Schneider vschneid@redhat.com --- arch/x86/kvm/vmx/vmx_onhyperv.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/x86/kvm/vmx/vmx_onhyperv.c b/arch/x86/kvm/vmx/vmx_onhyperv.c index b9a8b91166d02..ff3d80c9565bb 100644 --- a/arch/x86/kvm/vmx/vmx_onhyperv.c +++ b/arch/x86/kvm/vmx/vmx_onhyperv.c @@ -3,7 +3,7 @@ #include "capabilities.h" #include "vmx_onhyperv.h"
-DEFINE_STATIC_KEY_FALSE(__kvm_is_using_evmcs); +DEFINE_STATIC_KEY_FALSE_RO(__kvm_is_using_evmcs);
/* * KVM on Hyper-V always uses the latest known eVMCSv1 revision, the assumption
Later commits will cause objtool to warn about static keys being used in .noinstr sections in order to safely defer instruction patching IPIs targeted at NOHZ_FULL CPUs.
mds_idle_clear is used in .noinstr code, and can be modified at runtime (SMT hotplug). Suppressing the text_poke_sync() IPI has little benefits for this key, as hotplug implies eventually going through takedown_cpu() -> stop_machine_cpuslocked() which is going to cause interference on all online CPUs anyway.
Mark it to let objtool know not to warn about it.
Signed-off-by: Valentin Schneider vschneid@redhat.com --- arch/x86/kernel/cpu/bugs.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c index 362602b705cc4..59a77ca1bb14c 100644 --- a/arch/x86/kernel/cpu/bugs.c +++ b/arch/x86/kernel/cpu/bugs.c @@ -117,7 +117,7 @@ DEFINE_STATIC_KEY_FALSE(switch_vcpu_ibpb); EXPORT_SYMBOL_GPL(switch_vcpu_ibpb);
/* Control MDS CPU buffer clear before idling (halt, mwait) */ -DEFINE_STATIC_KEY_FALSE(mds_idle_clear); +DEFINE_STATIC_KEY_FALSE_NOINSTR(mds_idle_clear); EXPORT_SYMBOL_GPL(mds_idle_clear);
/*
Later commits will cause objtool to warn about static keys being used in .noinstr sections in order to safely defer instruction patching IPIs targeted at NOHZ_FULL CPUs.
__sched_clock_stable is used in .noinstr code, and can be modified at runtime (e.g. time_cpufreq_notifier()). Suppressing the text_poke_sync() IPI has little benefits for this key, as NOHZ_FULL is incompatible with an unstable TSC anyway.
Mark it to let objtool know not to warn about it.
Signed-off-by: Valentin Schneider vschneid@redhat.com --- kernel/sched/clock.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/clock.c b/kernel/sched/clock.c index 200e5568b9894..e59986bc14a43 100644 --- a/kernel/sched/clock.c +++ b/kernel/sched/clock.c @@ -75,8 +75,11 @@ static DEFINE_STATIC_KEY_FALSE_RO(sched_clock_running); * * Similarly we start with __sched_clock_stable_early, thereby assuming we * will become stable, such that there's only a single 1 -> 0 transition. + * + * Allowed in .noinstr as an unstable TLC is incompatible with NOHZ_FULL, + * thus the text patching IPI would be the least of our concerns. */ -static DEFINE_STATIC_KEY_FALSE(__sched_clock_stable); +static DEFINE_STATIC_KEY_FALSE_NOINSTR(__sched_clock_stable); static int __sched_clock_stable_early = 1;
/*
Later commits will cause objtool to warn about static keys being used in .noinstr sections in order to safely defer instruction patching IPIs targeted at NOHZ_FULL CPUs.
These keys are used in .noinstr code, and can be modified at runtime (/proc/kernel/vmx* write). However it is not expected that they will be flipped during latency-sensitive operations, and thus shouldn't be a source of interference wrt the text patching IPI.
Mark it to let objtool know not to warn about it.
Reported-by: Josh Poimboeuf jpoimboe@kernel.org Signed-off-by: Valentin Schneider vschneid@redhat.com --- arch/x86/kvm/vmx/vmx.c | 11 +++++++++-- 1 file changed, 9 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 5c5766467a61d..00053458cd10c 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -225,8 +225,15 @@ module_param(pt_mode, int, S_IRUGO);
struct x86_pmu_lbr __ro_after_init vmx_lbr_caps;
-static DEFINE_STATIC_KEY_FALSE(vmx_l1d_should_flush); -static DEFINE_STATIC_KEY_FALSE(vmx_l1d_flush_cond); +/* + * Both of these static keys end up being used in .noinstr sections, however + * they are only modified: + * - at init + * - from a /proc/kernel/vmx* write + * thus during latency-sensitive operations they should remain stable. + */ +static DEFINE_STATIC_KEY_FALSE_NOINSTR(vmx_l1d_should_flush); +static DEFINE_STATIC_KEY_FALSE_NOINSTR(vmx_l1d_flush_cond); static DEFINE_MUTEX(vmx_l1d_flush_mutex);
/* Storage for pre module init parameter parsing */
Later commits will cause objtool to warn about static keys being used in .noinstr sections in order to safely defer instruction patching IPIs targeted at NOHZ_FULL CPUs.
stack_erasing_bypass is used in .noinstr code, and can be modified at runtime (proc/sys/kernel/stack_erasing write). However it is not expected that it will be flipped during latency-sensitive operations, and thus shouldn't be a source of interference wrt the text patching IPI.
Mark it to let objtool know not to warn about it.
Reported-by: Josh Poimboeuf jpoimboe@kernel.org Signed-off-by: Valentin Schneider vschneid@redhat.com --- kernel/stackleak.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/kernel/stackleak.c b/kernel/stackleak.c index bb65321761b43..51b24d1e04ba2 100644 --- a/kernel/stackleak.c +++ b/kernel/stackleak.c @@ -19,7 +19,11 @@ #include <linux/sysctl.h> #include <linux/init.h>
-static DEFINE_STATIC_KEY_FALSE(stack_erasing_bypass); +/* + * This static key can only be modified via its sysctl interface. It is + * expected it will remain stable during latency-senstive operations. + */ +static DEFINE_STATIC_KEY_FALSE_NOINSTR(stack_erasing_bypass);
#ifdef CONFIG_SYSCTL static int stack_erasing_sysctl(const struct ctl_table *table, int write,
From: Josh Poimboeuf jpoimboe@kernel.org
Warn about static branches/calls in noinstr regions, unless the corresponding key is RO-after-init or has been manually whitelisted with DEFINE_STATIC_KEY_*_NOINSTR(().
Signed-off-by: Josh Poimboeuf jpoimboe@kernel.org [Added NULL check for insn_call_dest() return value] Signed-off-by: Valentin Schneider vschneid@redhat.com --- include/linux/jump_label.h | 17 +++-- include/linux/objtool.h | 7 ++ include/linux/static_call.h | 3 + tools/objtool/Documentation/objtool.txt | 34 +++++++++ tools/objtool/check.c | 92 ++++++++++++++++++++++--- tools/objtool/include/objtool/check.h | 1 + tools/objtool/include/objtool/elf.h | 1 + tools/objtool/include/objtool/special.h | 1 + tools/objtool/special.c | 15 +++- 9 files changed, 155 insertions(+), 16 deletions(-)
diff --git a/include/linux/jump_label.h b/include/linux/jump_label.h index c4f6240ff4d95..0ea203ebbc493 100644 --- a/include/linux/jump_label.h +++ b/include/linux/jump_label.h @@ -76,6 +76,7 @@ #include <linux/types.h> #include <linux/compiler.h> #include <linux/cleanup.h> +#include <linux/objtool.h>
extern bool static_key_initialized;
@@ -376,8 +377,9 @@ struct static_key_false { #define DEFINE_STATIC_KEY_TRUE(name) \ struct static_key_true name = STATIC_KEY_TRUE_INIT
-#define DEFINE_STATIC_KEY_TRUE_RO(name) \ - struct static_key_true name __ro_after_init = STATIC_KEY_TRUE_INIT +#define DEFINE_STATIC_KEY_TRUE_RO(name) \ + struct static_key_true name __ro_after_init = STATIC_KEY_TRUE_INIT; \ + ANNOTATE_NOINSTR_ALLOWED(name)
#define DECLARE_STATIC_KEY_TRUE(name) \ extern struct static_key_true name @@ -385,8 +387,9 @@ struct static_key_false { #define DEFINE_STATIC_KEY_FALSE(name) \ struct static_key_false name = STATIC_KEY_FALSE_INIT
-#define DEFINE_STATIC_KEY_FALSE_RO(name) \ - struct static_key_false name __ro_after_init = STATIC_KEY_FALSE_INIT +#define DEFINE_STATIC_KEY_FALSE_RO(name) \ + struct static_key_false name __ro_after_init = STATIC_KEY_FALSE_INIT; \ + ANNOTATE_NOINSTR_ALLOWED(name)
/* * The _NOINSTR variants are used to tell objtool the static key is allowed to @@ -400,10 +403,12 @@ struct static_key_false { * definition with the rationale. */ #define DEFINE_STATIC_KEY_TRUE_NOINSTR(name) \ - DEFINE_STATIC_KEY_TRUE(name) + DEFINE_STATIC_KEY_TRUE(name); \ + ANNOTATE_NOINSTR_ALLOWED(name)
#define DEFINE_STATIC_KEY_FALSE_NOINSTR(name) \ - DEFINE_STATIC_KEY_FALSE(name) + DEFINE_STATIC_KEY_FALSE(name); \ + ANNOTATE_NOINSTR_ALLOWED(name)
#define DECLARE_STATIC_KEY_FALSE(name) \ extern struct static_key_false name diff --git a/include/linux/objtool.h b/include/linux/objtool.h index 366ad004d794b..2d3661de4cf95 100644 --- a/include/linux/objtool.h +++ b/include/linux/objtool.h @@ -34,6 +34,12 @@ static void __used __section(".discard.func_stack_frame_non_standard") \ *__func_stack_frame_non_standard_##func = func
+#define __ANNOTATE_NOINSTR_ALLOWED(key) \ + static void __used __section(".discard.noinstr_allowed") \ + *__annotate_noinstr_allowed_##key = &key + +#define ANNOTATE_NOINSTR_ALLOWED(key) __ANNOTATE_NOINSTR_ALLOWED(key) + /* * STACK_FRAME_NON_STANDARD_FP() is a frame-pointer-specific function ignore * for the case where a function is intentionally missing frame pointer setup, @@ -130,6 +136,7 @@ #define STACK_FRAME_NON_STANDARD_FP(func) #define __ASM_ANNOTATE(label, type) "" #define ASM_ANNOTATE(type) +#define ANNOTATE_NOINSTR_ALLOWED(key) #else .macro UNWIND_HINT type:req sp_reg=0 sp_offset=0 signal=0 .endm diff --git a/include/linux/static_call.h b/include/linux/static_call.h index ea6ca57e2a829..0d4b16d348501 100644 --- a/include/linux/static_call.h +++ b/include/linux/static_call.h @@ -133,6 +133,7 @@
#include <linux/types.h> #include <linux/cpu.h> +#include <linux/objtool.h> #include <linux/static_call_types.h>
#ifdef CONFIG_HAVE_STATIC_CALL @@ -198,6 +199,7 @@ extern long __static_call_return0(void); .func = _func, \ .type = 1, \ }; \ + ANNOTATE_NOINSTR_ALLOWED(STATIC_CALL_TRAMP(name)); \ ARCH_DEFINE_STATIC_CALL_TRAMP(name, _func)
#define DEFINE_STATIC_CALL_NULL(name, _func) \ @@ -214,6 +216,7 @@ extern long __static_call_return0(void); .func = NULL, \ .type = 1, \ }; \ + ANNOTATE_NOINSTR_ALLOWED(STATIC_CALL_TRAMP(name)); \ ARCH_DEFINE_STATIC_CALL_NULL_TRAMP(name)
#define DEFINE_STATIC_CALL_RET0(name, _func) \ diff --git a/tools/objtool/Documentation/objtool.txt b/tools/objtool/Documentation/objtool.txt index 9e97fc25b2d8a..991e085e10d95 100644 --- a/tools/objtool/Documentation/objtool.txt +++ b/tools/objtool/Documentation/objtool.txt @@ -456,6 +456,40 @@ the objtool maintainers. these special names and does not use module_init() / module_exit() macros to create them.
+13. file.o: warning: func()+0x2a: key: non-RO static key usage in noinstr code + file.o: warning: func()+0x2a: key: non-RO static call usage in noinstr code + + This means that noinstr function func() uses a static key or + static call named 'key' which can be modified at runtime. This is + discouraged because it prevents code patching IPIs from being + deferred. + + You have the following options: + + 1) Check whether the static key/call in question is only modified + during init. If so, define it as read-only-after-init with + DEFINE_STATIC_KEY_*_RO() or DEFINE_STATIC_CALL_RO(). + + 2) Avoid the runtime patching. For static keys this can be done by + using static_key_enabled() or by getting rid of the static key + altogether if performance is not a concern. + + For static calls, something like the following could be done: + + target = static_call_query(foo); + if (target == func1) + func1(); + else if (target == func2) + func2(); + ... + + 3) Silence the warning by defining the static key/call with + DEFINE_STATIC_*_NOINSTR(). This decision should not + be taken lightly as it may result in code patching IPIs getting + sent to isolated NOHZ_FULL CPUs running in pure userspace. A + comment should be added above the definition explaining the + rationale for the decision. +
If the error doesn't seem to make sense, it could be a bug in objtool. Feel free to ask objtool maintainers for help. diff --git a/tools/objtool/check.c b/tools/objtool/check.c index 08e73765059fc..85f50777d7219 100644 --- a/tools/objtool/check.c +++ b/tools/objtool/check.c @@ -978,6 +978,45 @@ static int create_direct_call_sections(struct objtool_file *file) return 0; }
+static int read_noinstr_allowed(struct objtool_file *file) +{ + struct section *rsec; + struct symbol *sym; + struct reloc *reloc; + + rsec = find_section_by_name(file->elf, ".rela.discard.noinstr_allowed"); + if (!rsec) + return 0; + + for_each_reloc(rsec, reloc) { + switch (reloc->sym->type) { + case STT_OBJECT: + case STT_FUNC: + sym = reloc->sym; + break; + + case STT_SECTION: + sym = find_symbol_by_offset(reloc->sym->sec, + reloc_addend(reloc)); + if (!sym) { + WARN_FUNC(reloc->sym->sec, reloc_addend(reloc), + "can't find static key/call symbol"); + return -1; + } + break; + + default: + WARN("unexpected relocation symbol type in %s: %d", + rsec->name, reloc->sym->type); + return -1; + } + + sym->noinstr_allowed = 1; + } + + return 0; +} + /* * Warnings shouldn't be reported for ignored functions. */ @@ -1864,6 +1903,8 @@ static int handle_jump_alt(struct objtool_file *file, return -1; }
+ orig_insn->key = special_alt->key; + if (opts.hack_jump_label && special_alt->key_addend & 2) { struct reloc *reloc = insn_reloc(file, orig_insn);
@@ -2596,6 +2637,10 @@ static int decode_sections(struct objtool_file *file) if (ret) return ret;
+ ret = read_noinstr_allowed(file); + if (ret) + return ret; + return 0; }
@@ -3365,9 +3410,9 @@ static bool pv_call_dest(struct objtool_file *file, struct instruction *insn) return file->pv_ops[idx].clean; }
-static inline bool noinstr_call_dest(struct objtool_file *file, - struct instruction *insn, - struct symbol *func) +static inline bool noinstr_call_allowed(struct objtool_file *file, + struct instruction *insn, + struct symbol *func) { /* * We can't deal with indirect function calls at present; @@ -3387,10 +3432,10 @@ static inline bool noinstr_call_dest(struct objtool_file *file, return true;
/* - * If the symbol is a static_call trampoline, we can't tell. + * Only DEFINE_STATIC_CALL_*_RO allowed. */ if (func->static_call_tramp) - return true; + return func->noinstr_allowed;
/* * The __ubsan_handle_*() calls are like WARN(), they only happen when @@ -3403,14 +3448,29 @@ static inline bool noinstr_call_dest(struct objtool_file *file, return false; }
+static char *static_call_name(struct symbol *func) +{ + return func->name + strlen("__SCT__"); +} + static int validate_call(struct objtool_file *file, struct instruction *insn, struct insn_state *state) { - if (state->noinstr && state->instr <= 0 && - !noinstr_call_dest(file, insn, insn_call_dest(insn))) { - WARN_INSN(insn, "call to %s() leaves .noinstr.text section", call_dest_name(file, insn)); - return 1; + if (state->noinstr && state->instr <= 0) { + struct symbol *dest = insn_call_dest(insn); + + if (dest && dest->static_call_tramp) { + if (!dest->noinstr_allowed) { + WARN_INSN(insn, "%s: non-RO static call usage in noinstr", + static_call_name(dest)); + } + + } else if (dest && !noinstr_call_allowed(file, insn, dest)) { + WARN_INSN(insn, "call to %s() leaves .noinstr.text section", + call_dest_name(file, insn)); + return 1; + } }
if (state->uaccess && !func_uaccess_safe(insn_call_dest(insn))) { @@ -3475,6 +3535,17 @@ static int validate_return(struct symbol *func, struct instruction *insn, struct return 0; }
+static int validate_static_key(struct instruction *insn, struct insn_state *state) +{ + if (state->noinstr && state->instr <= 0 && !insn->key->noinstr_allowed) { + WARN_INSN(insn, "%s: non-RO static key usage in noinstr", + insn->key->name); + return 1; + } + + return 0; +} + static struct instruction *next_insn_to_validate(struct objtool_file *file, struct instruction *insn) { @@ -3662,6 +3733,9 @@ static int validate_branch(struct objtool_file *file, struct symbol *func, if (handle_insn_ops(insn, next_insn, &state)) return 1;
+ if (insn->key) + validate_static_key(insn, &state); + switch (insn->type) {
case INSN_RETURN: diff --git a/tools/objtool/include/objtool/check.h b/tools/objtool/include/objtool/check.h index 00fb745e72339..d79b08f55bcbc 100644 --- a/tools/objtool/include/objtool/check.h +++ b/tools/objtool/include/objtool/check.h @@ -81,6 +81,7 @@ struct instruction { struct symbol *sym; struct stack_op *stack_ops; struct cfi_state *cfi; + struct symbol *key; };
static inline struct symbol *insn_func(struct instruction *insn) diff --git a/tools/objtool/include/objtool/elf.h b/tools/objtool/include/objtool/elf.h index c7c4e87ebe882..b3c11869931fd 100644 --- a/tools/objtool/include/objtool/elf.h +++ b/tools/objtool/include/objtool/elf.h @@ -70,6 +70,7 @@ struct symbol { u8 local_label : 1; u8 frame_pointer : 1; u8 ignore : 1; + u8 noinstr_allowed : 1; struct list_head pv_target; struct reloc *relocs; }; diff --git a/tools/objtool/include/objtool/special.h b/tools/objtool/include/objtool/special.h index 72d09c0adf1a1..e84d704f3f20e 100644 --- a/tools/objtool/include/objtool/special.h +++ b/tools/objtool/include/objtool/special.h @@ -18,6 +18,7 @@ struct special_alt { bool group; bool jump_or_nop; u8 key_addend; + struct symbol *key;
struct section *orig_sec; unsigned long orig_off; diff --git a/tools/objtool/special.c b/tools/objtool/special.c index c80fed8a840ee..d77f3fa4bbbc9 100644 --- a/tools/objtool/special.c +++ b/tools/objtool/special.c @@ -110,13 +110,26 @@ static int get_alt_entry(struct elf *elf, const struct special_entry *entry,
if (entry->key) { struct reloc *key_reloc; + struct symbol *key; + s64 key_addend;
key_reloc = find_reloc_by_dest(elf, sec, offset + entry->key); if (!key_reloc) { ERROR_FUNC(sec, offset + entry->key, "can't find key reloc"); return -1; } - alt->key_addend = reloc_addend(key_reloc); + + key = key_reloc->sym; + key_addend = reloc_addend(key_reloc); + + if (key->type == STT_SECTION) + key = find_symbol_by_offset(key->sec, key_addend & ~3); + + /* embedded keys not supported */ + if (key) { + alt->key = key; + alt->key_addend = key_addend; + } }
return 0;
The text_size bit referred to by the comment has been removed as of commit
ac3b43283923 ("module: replace module_layout with module_memory")
and is thus no longer relevant. Remove it and comment about the contents of the masks array instead.
Signed-off-by: Valentin Schneider vschneid@redhat.com --- kernel/module/main.c | 9 ++++----- 1 file changed, 4 insertions(+), 5 deletions(-)
diff --git a/kernel/module/main.c b/kernel/module/main.c index a2859dc3eea66..b9f010daaa4c7 100644 --- a/kernel/module/main.c +++ b/kernel/module/main.c @@ -1562,12 +1562,11 @@ static void __layout_sections(struct module *mod, struct load_info *info, bool i { unsigned int m, i;
+ /* + * { Mask of required section header flags, + * Mask of excluded section header flags } + */ static const unsigned long masks[][2] = { - /* - * NOTE: all executable code must be the first section - * in this array; otherwise modify the text_size - * finder in the two loops below - */ { SHF_EXECINSTR | SHF_ALLOC, ARCH_SHF_SMALL }, { SHF_ALLOC, SHF_WRITE | ARCH_SHF_SMALL }, { SHF_RO_AFTER_INIT | SHF_ALLOC, ARCH_SHF_SMALL },
On 4/29/25 13:32, Valentin Schneider wrote:
The text_size bit referred to by the comment has been removed as of commit
ac3b43283923 ("module: replace module_layout with module_memory")
and is thus no longer relevant. Remove it and comment about the contents of the masks array instead.
Signed-off-by: Valentin Schneider vschneid@redhat.com
This comment cleanup is independent of the rest of the series. I've picked it separately on modules-next.
As pointed out by Sean [1], is_kernel_noinstr_text() will return false for an address contained within a module's .noinstr.text section. A later patch will require checking whether a text address is noinstr, and this can unfortunately be the case of modules - KVM is one such case.
A module's .noinstr.text section is already tracked as of commit 66e9b0717102 ("kprobes: Prevent probes in .noinstr.text section") for kprobe blacklisting purposes, but via an ad-hoc mechanism.
Add a MOD_NOINSTR_TEXT mem_type, and reorganize __layout_sections() so that it maps all the sections in a single invocation.
[1]: http://lore.kernel.org/r/Z4qQL89GZ_gk0vpu@google.com Signed-off-by: Valentin Schneider vschneid@redhat.com --- include/linux/module.h | 6 ++-- kernel/kprobes.c | 8 ++--- kernel/module/main.c | 76 ++++++++++++++++++++++++++++++++---------- 3 files changed, 66 insertions(+), 24 deletions(-)
diff --git a/include/linux/module.h b/include/linux/module.h index d94b196d5a34e..193d8d34eeee0 100644 --- a/include/linux/module.h +++ b/include/linux/module.h @@ -332,6 +332,7 @@ struct mod_tree_node {
enum mod_mem_type { MOD_TEXT = 0, + MOD_NOINSTR_TEXT, MOD_DATA, MOD_RODATA, MOD_RO_AFTER_INIT, @@ -502,8 +503,6 @@ struct module { void __percpu *percpu; unsigned int percpu_size; #endif - void *noinstr_text_start; - unsigned int noinstr_text_size;
#ifdef CONFIG_TRACEPOINTS unsigned int num_tracepoints; @@ -622,12 +621,13 @@ static inline bool module_is_coming(struct module *mod) return mod->state == MODULE_STATE_COMING; }
-struct module *__module_text_address(unsigned long addr); struct module *__module_address(unsigned long addr); +struct module *__module_text_address(unsigned long addr); bool is_module_address(unsigned long addr); bool __is_module_percpu_address(unsigned long addr, unsigned long *can_addr); bool is_module_percpu_address(unsigned long addr); bool is_module_text_address(unsigned long addr); +bool is_module_noinstr_text_address(unsigned long addr);
static inline bool within_module_mem_type(unsigned long addr, const struct module *mod, diff --git a/kernel/kprobes.c b/kernel/kprobes.c index ffe0c3d523063..9a799faee68a1 100644 --- a/kernel/kprobes.c +++ b/kernel/kprobes.c @@ -2547,9 +2547,9 @@ static void add_module_kprobe_blacklist(struct module *mod) kprobe_add_area_blacklist(start, end); }
- start = (unsigned long)mod->noinstr_text_start; + start = (unsigned long)mod->mem[MOD_NOINSTR_TEXT].base; if (start) { - end = start + mod->noinstr_text_size; + end = start + mod->mem[MOD_NOINSTR_TEXT].size; kprobe_add_area_blacklist(start, end); } } @@ -2570,9 +2570,9 @@ static void remove_module_kprobe_blacklist(struct module *mod) kprobe_remove_area_blacklist(start, end); }
- start = (unsigned long)mod->noinstr_text_start; + start = (unsigned long)mod->mem[MOD_NOINSTR_TEXT].base; if (start) { - end = start + mod->noinstr_text_size; + end = start + mod->mem[MOD_NOINSTR_TEXT].size; kprobe_remove_area_blacklist(start, end); } } diff --git a/kernel/module/main.c b/kernel/module/main.c index b9f010daaa4c7..0126bae64b698 100644 --- a/kernel/module/main.c +++ b/kernel/module/main.c @@ -1558,7 +1558,17 @@ bool module_init_layout_section(const char *sname) return module_init_section(sname); }
-static void __layout_sections(struct module *mod, struct load_info *info, bool is_init) +static bool module_noinstr_layout_section(const char *sname) +{ + return strstarts(sname, ".noinstr"); +} + +static bool module_default_layout_section(const char *sname) +{ + return !module_init_layout_section(sname) && !module_noinstr_layout_section(sname); +} + +static void __layout_sections(struct module *mod, struct load_info *info) { unsigned int m, i;
@@ -1567,20 +1577,44 @@ static void __layout_sections(struct module *mod, struct load_info *info, bool i * Mask of excluded section header flags } */ static const unsigned long masks[][2] = { + /* Core */ + { SHF_EXECINSTR | SHF_ALLOC, ARCH_SHF_SMALL }, + { SHF_EXECINSTR | SHF_ALLOC, ARCH_SHF_SMALL }, + { SHF_ALLOC, SHF_WRITE | ARCH_SHF_SMALL }, + { SHF_RO_AFTER_INIT | SHF_ALLOC, ARCH_SHF_SMALL }, + { SHF_WRITE | SHF_ALLOC, ARCH_SHF_SMALL }, + { ARCH_SHF_SMALL | SHF_ALLOC, 0 }, + /* Init */ { SHF_EXECINSTR | SHF_ALLOC, ARCH_SHF_SMALL }, { SHF_ALLOC, SHF_WRITE | ARCH_SHF_SMALL }, { SHF_RO_AFTER_INIT | SHF_ALLOC, ARCH_SHF_SMALL }, { SHF_WRITE | SHF_ALLOC, ARCH_SHF_SMALL }, - { ARCH_SHF_SMALL | SHF_ALLOC, 0 } + { ARCH_SHF_SMALL | SHF_ALLOC, 0 }, }; - static const int core_m_to_mem_type[] = { + static bool (*const section_filter[])(const char *) = { + /* Core */ + module_default_layout_section, + module_noinstr_layout_section, + module_default_layout_section, + module_default_layout_section, + module_default_layout_section, + module_default_layout_section, + /* Init */ + module_init_layout_section, + module_init_layout_section, + module_init_layout_section, + module_init_layout_section, + module_init_layout_section, + }; + static const int mem_type_map[] = { + /* Core */ MOD_TEXT, + MOD_NOINSTR_TEXT, MOD_RODATA, MOD_RO_AFTER_INIT, MOD_DATA, MOD_DATA, - }; - static const int init_m_to_mem_type[] = { + /* Init */ MOD_INIT_TEXT, MOD_INIT_RODATA, MOD_INVALID, @@ -1589,16 +1623,16 @@ static void __layout_sections(struct module *mod, struct load_info *info, bool i };
for (m = 0; m < ARRAY_SIZE(masks); ++m) { - enum mod_mem_type type = is_init ? init_m_to_mem_type[m] : core_m_to_mem_type[m]; + enum mod_mem_type type = mem_type_map[m];
for (i = 0; i < info->hdr->e_shnum; ++i) { Elf_Shdr *s = &info->sechdrs[i]; const char *sname = info->secstrings + s->sh_name;
- if ((s->sh_flags & masks[m][0]) != masks[m][0] - || (s->sh_flags & masks[m][1]) - || s->sh_entsize != ~0UL - || is_init != module_init_layout_section(sname)) + if ((s->sh_flags & masks[m][0]) != masks[m][0] || + (s->sh_flags & masks[m][1]) || + s->sh_entsize != ~0UL || + !section_filter[m](sname)) continue;
if (WARN_ON_ONCE(type == MOD_INVALID)) @@ -1638,10 +1672,7 @@ static void layout_sections(struct module *mod, struct load_info *info) info->sechdrs[i].sh_entsize = ~0UL;
pr_debug("Core section allocation order for %s:\n", mod->name); - __layout_sections(mod, info, false); - - pr_debug("Init section allocation order for %s:\n", mod->name); - __layout_sections(mod, info, true); + __layout_sections(mod, info); }
static void module_license_taint_check(struct module *mod, const char *license) @@ -2515,9 +2546,6 @@ static int find_module_sections(struct module *mod, struct load_info *info) } #endif
- mod->noinstr_text_start = section_objs(info, ".noinstr.text", 1, - &mod->noinstr_text_size); - #ifdef CONFIG_TRACEPOINTS mod->tracepoints_ptrs = section_objs(info, "__tracepoints_ptrs", sizeof(*mod->tracepoints_ptrs), @@ -3769,12 +3797,26 @@ struct module *__module_text_address(unsigned long addr) if (mod) { /* Make sure it's within the text section. */ if (!within_module_mem_type(addr, mod, MOD_TEXT) && + !within_module_mem_type(addr, mod, MOD_NOINSTR_TEXT) && !within_module_mem_type(addr, mod, MOD_INIT_TEXT)) mod = NULL; } return mod; }
+bool is_module_noinstr_text_address(unsigned long addr) +{ + scoped_guard(preempt) { + struct module *mod = __module_address(addr); + + /* Make sure it's within the .noinstr.text section. */ + if (mod) + return within_module_mem_type(addr, mod, MOD_NOINSTR_TEXT); + } + + return false; +} + /* Don't grab lock, we're oopsing. */ void print_modules(void) {
smp_call_function() & friends have the unfortunate habit of sending IPIs to isolated, NOHZ_FULL, in-userspace CPUs, as they blindly target all online CPUs.
Some callsites can be bent into doing the right, such as done by commit:
cc9e303c91f5 ("x86/cpu: Disable frequency requests via aperfmperf IPI for nohz_full CPUs")
Unfortunately, not all SMP callbacks can be omitted in this fashion. However, some of them only affect execution in kernelspace, which means they don't have to be executed *immediately* if the target CPU is in userspace: stashing the callback and executing it upon the next kernel entry would suffice. x86 kernel instruction patching or kernel TLB invalidation are prime examples of it.
Reduce the RCU dynticks counter width to free up some bits to be used as a deferred callback bitmask. Add some build-time checks to validate that setup.
Presence of CT_RCU_WATCHING in the ct_state prevents queuing deferred work.
Later commits introduce the bit:callback mappings.
Link: https://lore.kernel.org/all/20210929151723.162004989@infradead.org/ Signed-off-by: Nicolas Saenz Julienne nsaenzju@redhat.com Signed-off-by: Valentin Schneider vschneid@redhat.com --- arch/Kconfig | 9 +++ arch/x86/Kconfig | 1 + arch/x86/include/asm/context_tracking_work.h | 16 +++++ include/linux/context_tracking.h | 21 ++++++ include/linux/context_tracking_state.h | 30 ++++++--- include/linux/context_tracking_work.h | 26 ++++++++ kernel/context_tracking.c | 69 +++++++++++++++++++- kernel/time/Kconfig | 5 ++ 8 files changed, 165 insertions(+), 12 deletions(-) create mode 100644 arch/x86/include/asm/context_tracking_work.h create mode 100644 include/linux/context_tracking_work.h
diff --git a/arch/Kconfig b/arch/Kconfig index b0adb665041f1..e363fc0dc1f88 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -952,6 +952,15 @@ config HAVE_CONTEXT_TRACKING_USER_OFFSTACK - No use of instrumentation, unless instrumentation_begin() got called.
+config HAVE_CONTEXT_TRACKING_WORK + bool + help + Architecture supports deferring work while not in kernel context. + This is especially useful on setups with isolated CPUs that might + want to avoid being interrupted to perform housekeeping tasks (for + ex. TLB invalidation or icache invalidation). The housekeeping + operations are performed upon re-entering the kernel. + config HAVE_TIF_NOHZ bool help diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 4b9f378e05f6b..c3fbcbee07788 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -222,6 +222,7 @@ config X86 select HAVE_CMPXCHG_LOCAL select HAVE_CONTEXT_TRACKING_USER if X86_64 select HAVE_CONTEXT_TRACKING_USER_OFFSTACK if HAVE_CONTEXT_TRACKING_USER + select HAVE_CONTEXT_TRACKING_WORK if X86_64 select HAVE_C_RECORDMCOUNT select HAVE_OBJTOOL_MCOUNT if HAVE_OBJTOOL select HAVE_OBJTOOL_NOP_MCOUNT if HAVE_OBJTOOL_MCOUNT diff --git a/arch/x86/include/asm/context_tracking_work.h b/arch/x86/include/asm/context_tracking_work.h new file mode 100644 index 0000000000000..5f3b2d0977235 --- /dev/null +++ b/arch/x86/include/asm/context_tracking_work.h @@ -0,0 +1,16 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _ASM_X86_CONTEXT_TRACKING_WORK_H +#define _ASM_X86_CONTEXT_TRACKING_WORK_H + +static __always_inline void arch_context_tracking_work(enum ct_work work) +{ + switch (work) { + case CT_WORK_n: + // Do work... + break; + case CT_WORK_MAX: + WARN_ON_ONCE(true); + } +} + +#endif diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h index af9fe87a09225..0b0faa040e9b5 100644 --- a/include/linux/context_tracking.h +++ b/include/linux/context_tracking.h @@ -5,6 +5,7 @@ #include <linux/sched.h> #include <linux/vtime.h> #include <linux/context_tracking_state.h> +#include <linux/context_tracking_work.h> #include <linux/instrumentation.h>
#include <asm/ptrace.h> @@ -137,6 +138,26 @@ static __always_inline unsigned long ct_state_inc(int incby) return raw_atomic_add_return(incby, this_cpu_ptr(&context_tracking.state)); }
+#ifdef CONFIG_CONTEXT_TRACKING_WORK +static __always_inline unsigned long ct_state_inc_clear_work(int incby) +{ + struct context_tracking *ct = this_cpu_ptr(&context_tracking); + unsigned long new, old, state; + + state = arch_atomic_read(&ct->state); + do { + old = state; + new = old & ~CT_WORK_MASK; + new += incby; + state = arch_atomic_cmpxchg(&ct->state, old, new); + } while (old != state); + + return new; +} +#else +#define ct_state_inc_clear_work(x) ct_state_inc(x) +#endif + static __always_inline bool warn_rcu_enter(void) { bool ret = false; diff --git a/include/linux/context_tracking_state.h b/include/linux/context_tracking_state.h index 0b81248aa03e2..d2c302133672f 100644 --- a/include/linux/context_tracking_state.h +++ b/include/linux/context_tracking_state.h @@ -5,6 +5,7 @@ #include <linux/percpu.h> #include <linux/static_key.h> #include <linux/context_tracking_irq.h> +#include <linux/context_tracking_work.h>
/* Offset to allow distinguishing irq vs. task-based idle entry/exit. */ #define CT_NESTING_IRQ_NONIDLE ((LONG_MAX / 2) + 1) @@ -39,16 +40,19 @@ struct context_tracking { };
/* - * We cram two different things within the same atomic variable: + * We cram up to three different things within the same atomic variable: * - * CT_RCU_WATCHING_START CT_STATE_START - * | | - * v v - * MSB [ RCU watching counter ][ context_state ] LSB - * ^ ^ - * | | - * CT_RCU_WATCHING_END CT_STATE_END + * CT_RCU_WATCHING_START CT_STATE_START + * | CT_WORK_START | + * | | | + * v v v + * MSB [ RCU watching counter ][ context work ][ context_state ] LSB + * ^ ^ ^ + * | | | + * | CT_WORK_END | + * CT_RCU_WATCHING_END CT_STATE_END * + * The [ context work ] region spans 0 bits if CONFIG_CONTEXT_WORK=n * Bits are used from the LSB upwards, so unused bits (if any) will always be in * upper bits of the variable. */ @@ -59,18 +63,24 @@ struct context_tracking { #define CT_STATE_START 0 #define CT_STATE_END (CT_STATE_START + CT_STATE_WIDTH - 1)
-#define CT_RCU_WATCHING_MAX_WIDTH (CT_SIZE - CT_STATE_WIDTH) +#define CT_WORK_WIDTH (IS_ENABLED(CONFIG_CONTEXT_TRACKING_WORK) ? CT_WORK_MAX_OFFSET : 0) +#define CT_WORK_START (CT_STATE_END + 1) +#define CT_WORK_END (CT_WORK_START + CT_WORK_WIDTH - 1) + +#define CT_RCU_WATCHING_MAX_WIDTH (CT_SIZE - CT_WORK_WIDTH - CT_STATE_WIDTH) #define CT_RCU_WATCHING_WIDTH (IS_ENABLED(CONFIG_RCU_DYNTICKS_TORTURE) ? 2 : CT_RCU_WATCHING_MAX_WIDTH) -#define CT_RCU_WATCHING_START (CT_STATE_END + 1) +#define CT_RCU_WATCHING_START (CT_WORK_END + 1) #define CT_RCU_WATCHING_END (CT_RCU_WATCHING_START + CT_RCU_WATCHING_WIDTH - 1) #define CT_RCU_WATCHING BIT(CT_RCU_WATCHING_START)
#define CT_STATE_MASK GENMASK(CT_STATE_END, CT_STATE_START) +#define CT_WORK_MASK GENMASK(CT_WORK_END, CT_WORK_START) #define CT_RCU_WATCHING_MASK GENMASK(CT_RCU_WATCHING_END, CT_RCU_WATCHING_START)
#define CT_UNUSED_WIDTH (CT_RCU_WATCHING_MAX_WIDTH - CT_RCU_WATCHING_WIDTH)
static_assert(CT_STATE_WIDTH + + CT_WORK_WIDTH + CT_RCU_WATCHING_WIDTH + CT_UNUSED_WIDTH == CT_SIZE); diff --git a/include/linux/context_tracking_work.h b/include/linux/context_tracking_work.h new file mode 100644 index 0000000000000..c68245f8d77c5 --- /dev/null +++ b/include/linux/context_tracking_work.h @@ -0,0 +1,26 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _LINUX_CONTEXT_TRACKING_WORK_H +#define _LINUX_CONTEXT_TRACKING_WORK_H + +#include <linux/bitops.h> + +enum { + CT_WORK_n_OFFSET, + CT_WORK_MAX_OFFSET +}; + +enum ct_work { + CT_WORK_n = BIT(CT_WORK_n_OFFSET), + CT_WORK_MAX = BIT(CT_WORK_MAX_OFFSET) +}; + +#include <asm/context_tracking_work.h> + +#ifdef CONFIG_CONTEXT_TRACKING_WORK +extern bool ct_set_cpu_work(unsigned int cpu, enum ct_work work); +#else +static inline bool +ct_set_cpu_work(unsigned int cpu, unsigned int work) { return false; } +#endif + +#endif diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c index fb5be6e9b423f..3238bb1f41ff4 100644 --- a/kernel/context_tracking.c +++ b/kernel/context_tracking.c @@ -72,6 +72,70 @@ static __always_inline void rcu_task_trace_heavyweight_exit(void) #endif /* #ifdef CONFIG_TASKS_TRACE_RCU */ }
+#ifdef CONFIG_CONTEXT_TRACKING_WORK +static noinstr void ct_work_flush(unsigned long seq) +{ + int bit; + + seq = (seq & CT_WORK_MASK) >> CT_WORK_START; + + /* + * arch_context_tracking_work() must be noinstr, non-blocking, + * and NMI safe. + */ + for_each_set_bit(bit, &seq, CT_WORK_MAX) + arch_context_tracking_work(BIT(bit)); +} + +/** + * ct_set_cpu_work - set work to be run at next kernel context entry + * + * If @cpu is not currently executing in kernelspace, it will execute the + * callback mapped to @work (see arch_context_tracking_work()) at its next + * entry into ct_kernel_enter_state(). + * + * If it is already executing in kernelspace, this will be a no-op. + */ +bool ct_set_cpu_work(unsigned int cpu, enum ct_work work) +{ + struct context_tracking *ct = per_cpu_ptr(&context_tracking, cpu); + unsigned int old; + bool ret = false; + + if (!ct->active) + return false; + + preempt_disable(); + + old = atomic_read(&ct->state); + + /* + * The work bit must only be set if the target CPU is not executing + * in kernelspace. + * CT_RCU_WATCHING is used as a proxy for that - if the bit is set, we + * know for sure the CPU is executing in the kernel whether that be in + * NMI, IRQ or process context. + * Set CT_RCU_WATCHING here and let the cmpxchg do the check for us; + * the state could change between the atomic_read() and the cmpxchg(). + */ + old |= CT_RCU_WATCHING; + /* + * Try setting the work until either + * - the target CPU has entered kernelspace + * - the work has been set + */ + do { + ret = atomic_try_cmpxchg(&ct->state, &old, old | (work << CT_WORK_START)); + } while (!ret && !(old & CT_RCU_WATCHING)); + + preempt_enable(); + return ret; +} +#else +static __always_inline void ct_work_flush(unsigned long work) { } +static __always_inline void ct_work_clear(struct context_tracking *ct) { } +#endif + /* * Record entry into an extended quiescent state. This is only to be * called when not already in an extended quiescent state, that is, @@ -88,7 +152,7 @@ static noinstr void ct_kernel_exit_state(int offset) rcu_task_trace_heavyweight_enter(); // Before CT state update! // RCU is still watching. Better not be in extended quiescent state! WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) && !rcu_is_watching_curr_cpu()); - (void)ct_state_inc(offset); + (void)ct_state_inc_clear_work(offset); // RCU is no longer watching. }
@@ -99,7 +163,7 @@ static noinstr void ct_kernel_exit_state(int offset) */ static noinstr void ct_kernel_enter_state(int offset) { - int seq; + unsigned long seq;
/* * CPUs seeing atomic_add_return() must see prior idle sojourns, @@ -107,6 +171,7 @@ static noinstr void ct_kernel_enter_state(int offset) * critical section. */ seq = ct_state_inc(offset); + ct_work_flush(seq); // RCU is now watching. Better not be in an extended quiescent state! rcu_task_trace_heavyweight_exit(); // After CT state update! WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) && !(seq & CT_RCU_WATCHING)); diff --git a/kernel/time/Kconfig b/kernel/time/Kconfig index b0b97a60aaa6f..7e8106a0d981f 100644 --- a/kernel/time/Kconfig +++ b/kernel/time/Kconfig @@ -181,6 +181,11 @@ config CONTEXT_TRACKING_USER_FORCE Say N otherwise, this option brings an overhead that you don't want in production.
+config CONTEXT_TRACKING_WORK + bool + depends on HAVE_CONTEXT_TRACKING_WORK && CONTEXT_TRACKING_USER + default y + config NO_HZ bool "Old Idle dynticks config" help
text_poke_bp_batch() sends IPIs to all online CPUs to synchronize them vs the newly patched instruction. CPUs that are executing in userspace do not need this synchronization to happen immediately, and this is actually harmful interference for NOHZ_FULL CPUs.
As the synchronization IPIs are sent using a blocking call, returning from text_poke_bp_batch() implies all CPUs will observe the patched instruction(s), and this should be preserved even if the IPI is deferred. In other words, to safely defer this synchronization, any kernel instruction leading to the execution of the deferred instruction sync (ct_work_flush()) must *not* be mutable (patchable) at runtime.
This means we must pay attention to mutable instructions in the early entry code: - alternatives - static keys - static calls - all sorts of probes (kprobes/ftrace/bpf/???)
The early entry code leading to ct_work_flush() is noinstr, which gets rid of the probes.
Alternatives are safe, because it's boot-time patching (before SMP is even brought up) which is before any IPI deferral can happen.
This leaves us with static keys and static calls.
Any static key used in early entry code should be only forever-enabled at boot time, IOW __ro_after_init (pretty much like alternatives). Exceptions are explicitly marked as allowed in .noinstr and will always generate an IPI when flipped.
The same applies to static calls - they should be only updated at boot time, or manually marked as an exception.
Objtool is now able to point at static keys/calls that don't respect this, and all static keys/calls used in early entry code have now been verified as behaving appropriately.
Leverage the new context_tracking infrastructure to defer sync_core() IPIs to a target CPU's next kernel entry.
Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Signed-off-by: Nicolas Saenz Julienne nsaenzju@redhat.com Signed-off-by: Valentin Schneider vschneid@redhat.com --- arch/x86/include/asm/context_tracking_work.h | 6 ++- arch/x86/include/asm/text-patching.h | 1 + arch/x86/kernel/alternative.c | 39 +++++++++++++++++--- arch/x86/kernel/kprobes/core.c | 4 +- arch/x86/kernel/kprobes/opt.c | 4 +- arch/x86/kernel/module.c | 2 +- include/asm-generic/sections.h | 15 ++++++++ include/linux/context_tracking_work.h | 4 +- 8 files changed, 60 insertions(+), 15 deletions(-)
diff --git a/arch/x86/include/asm/context_tracking_work.h b/arch/x86/include/asm/context_tracking_work.h index 5f3b2d0977235..485b32881fde5 100644 --- a/arch/x86/include/asm/context_tracking_work.h +++ b/arch/x86/include/asm/context_tracking_work.h @@ -2,11 +2,13 @@ #ifndef _ASM_X86_CONTEXT_TRACKING_WORK_H #define _ASM_X86_CONTEXT_TRACKING_WORK_H
+#include <asm/sync_core.h> + static __always_inline void arch_context_tracking_work(enum ct_work work) { switch (work) { - case CT_WORK_n: - // Do work... + case CT_WORK_SYNC: + sync_core(); break; case CT_WORK_MAX: WARN_ON_ONCE(true); diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h index ab9e143ec9fea..9dfa46f721c1d 100644 --- a/arch/x86/include/asm/text-patching.h +++ b/arch/x86/include/asm/text-patching.h @@ -33,6 +33,7 @@ extern void apply_relocation(u8 *buf, const u8 * const instr, size_t instrlen, u */ extern void *text_poke(void *addr, const void *opcode, size_t len); extern void text_poke_sync(void); +extern void text_poke_sync_deferrable(void); extern void *text_poke_kgdb(void *addr, const void *opcode, size_t len); extern void *text_poke_copy(void *addr, const void *opcode, size_t len); #define text_poke_copy text_poke_copy diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c index bf82c6f7d6905..8c73ac6243809 100644 --- a/arch/x86/kernel/alternative.c +++ b/arch/x86/kernel/alternative.c @@ -18,6 +18,7 @@ #include <linux/mmu_context.h> #include <linux/bsearch.h> #include <linux/sync_core.h> +#include <linux/context_tracking.h> #include <asm/text-patching.h> #include <asm/alternative.h> #include <asm/sections.h> @@ -2450,9 +2451,24 @@ static void do_sync_core(void *info) sync_core(); }
+static bool do_sync_core_defer_cond(int cpu, void *info) +{ + return !ct_set_cpu_work(cpu, CT_WORK_SYNC); +} + +static void __text_poke_sync(smp_cond_func_t cond_func) +{ + on_each_cpu_cond(cond_func, do_sync_core, NULL, 1); +} + void text_poke_sync(void) { - on_each_cpu(do_sync_core, NULL, 1); + __text_poke_sync(NULL); +} + +void text_poke_sync_deferrable(void) +{ + __text_poke_sync(do_sync_core_defer_cond); }
/* @@ -2623,6 +2639,7 @@ static int tp_vec_nr; */ static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries) { + smp_cond_func_t cond = do_sync_core_defer_cond; unsigned char int3 = INT3_INSN_OPCODE; unsigned int i; int do_sync; @@ -2658,11 +2675,21 @@ static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries * First step: add a int3 trap to the address that will be patched. */ for (i = 0; i < nr_entries; i++) { - tp[i].old = *(u8 *)text_poke_addr(&tp[i]); - text_poke(text_poke_addr(&tp[i]), &int3, INT3_INSN_SIZE); + void *addr = text_poke_addr(&tp[i]); + + /* + * There's no safe way to defer IPIs for patching text in + * .noinstr, record whether there is at least one such poke. + */ + if (is_kernel_noinstr_text((unsigned long)addr) || + is_module_noinstr_text_address((unsigned long)addr)) + cond = NULL; + + tp[i].old = *((u8 *)addr); + text_poke(addr, &int3, INT3_INSN_SIZE); }
- text_poke_sync(); + __text_poke_sync(cond);
/* * Second step: update all but the first byte of the patched range. @@ -2724,7 +2751,7 @@ static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries * not necessary and we'd be safe even without it. But * better safe than sorry (plus there's not only Intel). */ - text_poke_sync(); + __text_poke_sync(cond); }
/* @@ -2745,7 +2772,7 @@ static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries }
if (do_sync) - text_poke_sync(); + __text_poke_sync(cond);
/* * Remove and wait for refs to be zero. diff --git a/arch/x86/kernel/kprobes/core.c b/arch/x86/kernel/kprobes/core.c index 09608fd936876..687e6805b7511 100644 --- a/arch/x86/kernel/kprobes/core.c +++ b/arch/x86/kernel/kprobes/core.c @@ -808,7 +808,7 @@ void arch_arm_kprobe(struct kprobe *p) u8 int3 = INT3_INSN_OPCODE;
text_poke(p->addr, &int3, 1); - text_poke_sync(); + text_poke_sync_deferrable(); perf_event_text_poke(p->addr, &p->opcode, 1, &int3, 1); }
@@ -818,7 +818,7 @@ void arch_disarm_kprobe(struct kprobe *p)
perf_event_text_poke(p->addr, &int3, 1, &p->opcode, 1); text_poke(p->addr, &p->opcode, 1); - text_poke_sync(); + text_poke_sync_deferrable(); }
void arch_remove_kprobe(struct kprobe *p) diff --git a/arch/x86/kernel/kprobes/opt.c b/arch/x86/kernel/kprobes/opt.c index 36d6809c6c9e1..b2ce4d9c3ba56 100644 --- a/arch/x86/kernel/kprobes/opt.c +++ b/arch/x86/kernel/kprobes/opt.c @@ -513,11 +513,11 @@ void arch_unoptimize_kprobe(struct optimized_kprobe *op) JMP32_INSN_SIZE - INT3_INSN_SIZE);
text_poke(addr, new, INT3_INSN_SIZE); - text_poke_sync(); + text_poke_sync_deferrable(); text_poke(addr + INT3_INSN_SIZE, new + INT3_INSN_SIZE, JMP32_INSN_SIZE - INT3_INSN_SIZE); - text_poke_sync(); + text_poke_sync_deferrable();
perf_event_text_poke(op->kp.addr, old, JMP32_INSN_SIZE, new, JMP32_INSN_SIZE); } diff --git a/arch/x86/kernel/module.c b/arch/x86/kernel/module.c index a7998f3517017..d89c9de0ca9f5 100644 --- a/arch/x86/kernel/module.c +++ b/arch/x86/kernel/module.c @@ -206,7 +206,7 @@ static int write_relocate_add(Elf64_Shdr *sechdrs, write, apply);
if (!early) { - text_poke_sync(); + text_poke_sync_deferrable(); mutex_unlock(&text_mutex); }
diff --git a/include/asm-generic/sections.h b/include/asm-generic/sections.h index 0755bc39b0d80..7d2403014010e 100644 --- a/include/asm-generic/sections.h +++ b/include/asm-generic/sections.h @@ -199,6 +199,21 @@ static inline bool is_kernel_inittext(unsigned long addr) addr < (unsigned long)_einittext; }
+ +/** + * is_kernel_noinstr_text - checks if the pointer address is located in the + * .noinstr section + * + * @addr: address to check + * + * Returns: true if the address is located in .noinstr, false otherwise. + */ +static inline bool is_kernel_noinstr_text(unsigned long addr) +{ + return addr >= (unsigned long)__noinstr_text_start && + addr < (unsigned long)__noinstr_text_end; +} + /** * __is_kernel_text - checks if the pointer address is located in the * .text section diff --git a/include/linux/context_tracking_work.h b/include/linux/context_tracking_work.h index c68245f8d77c5..2facc621be067 100644 --- a/include/linux/context_tracking_work.h +++ b/include/linux/context_tracking_work.h @@ -5,12 +5,12 @@ #include <linux/bitops.h>
enum { - CT_WORK_n_OFFSET, + CT_WORK_SYNC_OFFSET, CT_WORK_MAX_OFFSET };
enum ct_work { - CT_WORK_n = BIT(CT_WORK_n_OFFSET), + CT_WORK_SYNC = BIT(CT_WORK_SYNC_OFFSET), CT_WORK_MAX = BIT(CT_WORK_MAX_OFFSET) };
I don't think we should do this series.
If folks want this functionality, they should get a new CPU that can flush the TLB without IPIs.
On Tue, 29 Apr 2025 09:11:57 -0700 Dave Hansen dave.hansen@intel.com wrote:
I don't think we should do this series.
Could you provide more rationale for your decision.
If folks want this functionality, they should get a new CPU that can flush the TLB without IPIs.
That's a pretty heavy handed response. I'm not sure that's always a feasible solution.
From my experience in the world, software has always been around to fix the hardware, not the other way around ;-)
-- Steve
On 4/30/25 10:20, Steven Rostedt wrote:
On Tue, 29 Apr 2025 09:11:57 -0700 Dave Hansen dave.hansen@intel.com wrote:
I don't think we should do this series.
Could you provide more rationale for your decision.
I talked about it a bit in here:
https://lore.kernel.org/all/408ebd8b-4bfb-4c4f-b118-7fe853c6e897@intel.com/
But, basically, this series puts a new onus on the entry code: it can't touch the vmalloc() area ... except the LDT ... and except the PEBS buffers. If anyone touches vmalloc()'d memory (or anything else that eventually gets deferred), they crash. They _only_ crash on these NOHZ_FULL systems.
Putting new restrictions on the entry code is really nasty. Let's say a new hardware feature showed up that touched vmalloc()'d memory in the entry code. Probably, nobody would notice until they got that new hardware and tried to do a NOHZ_FULL workload. It might take years to uncover, once that hardware was out in the wild.
I have a substantial number of gray hairs from dealing with corner cases in the entry code.
You _could_ make it more debuggable. Could you make this work for all tasks, not just NOHZ_FULL? The same logic _should_ apply. It would be inefficient, but would provide good debugging coverage.
I also mentioned this earlier, but PTI could be leveraged here to ensure that the TLB is flushed properly. You could have the rule that anything mapped into the user page table can't have a deferred flush and then do deferred flushes at SWITCH_TO_KERNEL_CR3 time. Yeah, that's in arch-specific assembly, but it's a million times easier to reason about because the window where a deferred-flush allocation might bite you is so small.
Look at the syscall code for instance:
SYM_CODE_START(entry_SYSCALL_64) swapgs movq %rsp, PER_CPU_VAR(cpu_tss_rw + TSS_sp2) SWITCH_TO_KERNEL_CR3 scratch_reg=%rsp
You can _trivially_ audit this and know that swapgs doesn't touch memory and that as long as PER_CPU_VAR()s and the process stack don't have their mappings munged and flushes deferred that this would be correct.
If folks want this functionality, they should get a new CPU that can flush the TLB without IPIs.
That's a pretty heavy handed response. I'm not sure that's always a feasible solution.
From my experience in the world, software has always been around to fix the hardware, not the other way around ;-)
Both AMD and Intel have hardware to do it. ARM CPUs do it too, I think. You can go buy the Intel hardware off the shelf today.
On Wed, 30 Apr 2025 11:07:35 -0700 Dave Hansen dave.hansen@intel.com wrote:
On 4/30/25 10:20, Steven Rostedt wrote:
On Tue, 29 Apr 2025 09:11:57 -0700 Dave Hansen dave.hansen@intel.com wrote:
I don't think we should do this series.
Could you provide more rationale for your decision.
I talked about it a bit in here:
https://lore.kernel.org/all/408ebd8b-4bfb-4c4f-b118-7fe853c6e897@intel.com/
Hmm, that's easily missed. But thanks for linking it.
But, basically, this series puts a new onus on the entry code: it can't touch the vmalloc() area ... except the LDT ... and except the PEBS buffers. If anyone touches vmalloc()'d memory (or anything else that eventually gets deferred), they crash. They _only_ crash on these NOHZ_FULL systems.
Putting new restrictions on the entry code is really nasty. Let's say a new hardware feature showed up that touched vmalloc()'d memory in the entry code. Probably, nobody would notice until they got that new hardware and tried to do a NOHZ_FULL workload. It might take years to uncover, once that hardware was out in the wild.
I have a substantial number of gray hairs from dealing with corner cases in the entry code.
You _could_ make it more debuggable. Could you make this work for all tasks, not just NOHZ_FULL? The same logic _should_ apply. It would be inefficient, but would provide good debugging coverage.
I also mentioned this earlier, but PTI could be leveraged here to ensure that the TLB is flushed properly. You could have the rule that anything mapped into the user page table can't have a deferred flush and then do deferred flushes at SWITCH_TO_KERNEL_CR3 time. Yeah, that's in arch-specific assembly, but it's a million times easier to reason about because the window where a deferred-flush allocation might bite you is so small.
Look at the syscall code for instance:
SYM_CODE_START(entry_SYSCALL_64) swapgs movq %rsp, PER_CPU_VAR(cpu_tss_rw + TSS_sp2) SWITCH_TO_KERNEL_CR3 scratch_reg=%rsp
You can _trivially_ audit this and know that swapgs doesn't touch memory and that as long as PER_CPU_VAR()s and the process stack don't have their mappings munged and flushes deferred that this would be correct.
Hmm, so there is still a path for this?
At least if it added more ways to debug it, and some other changes to make the locations where vmalloc is dangerous smaller?
If folks want this functionality, they should get a new CPU that can flush the TLB without IPIs.
That's a pretty heavy handed response. I'm not sure that's always a feasible solution.
From my experience in the world, software has always been around to fix the hardware, not the other way around ;-)
Both AMD and Intel have hardware to do it. ARM CPUs do it too, I think. You can go buy the Intel hardware off the shelf today.
Sure, but changing CPUs on machines is not always that feasible either.
-- Steve
On 4/30/25 12:42, Steven Rostedt wrote:
Look at the syscall code for instance:
SYM_CODE_START(entry_SYSCALL_64) swapgs movq %rsp, PER_CPU_VAR(cpu_tss_rw + TSS_sp2) SWITCH_TO_KERNEL_CR3 scratch_reg=%rsp
You can _trivially_ audit this and know that swapgs doesn't touch memory and that as long as PER_CPU_VAR()s and the process stack don't have their mappings munged and flushes deferred that this would be correct.
Hmm, so there is still a path for this?
At least if it added more ways to debug it, and some other changes to make the locations where vmalloc is dangerous smaller?
Being able to debug it would be a good start. But, more generally, what we need is for more people to be able to run the code in the first place. Would a _normal_ system (without setups that are trying to do NOHZ_FULL) ever be able to defer TLB flush IPIs?
If the answer is no, then, yeah, I'll settle for some debugging options.
But if you shrink the window as small as I'm talking about, it would look very different from this series.
For instance, imagine when a CPU goes into the NOHZ mode. Could it just unconditionally flush the TLB on the way back into the kernel (in the same SWITCH_TO_KERNEL_CR3 spot)? Yeah, it'll make entry into the kernel expensive for NOHZ tasks, but it's not *THAT* bad. And if the entire point of a NOHZ_FULL task is to minimize the number of kernel entries then a little extra overhead there doesn't sound too bad.
Also, about the new hardware, I suspect there's some mystery customer lurking in the shadows asking folks for this functionality. Could you at least go _talk_ to the mystery customer(s) and see which hardware they care about? They might already even have the magic CPUs they need for this, or have them on the roadmap. If they've got Intel CPUs, I'd be happy to help figure it out.
On 30/04/25 13:00, Dave Hansen wrote:
On 4/30/25 12:42, Steven Rostedt wrote:
Look at the syscall code for instance:
SYM_CODE_START(entry_SYSCALL_64) swapgs movq %rsp, PER_CPU_VAR(cpu_tss_rw + TSS_sp2) SWITCH_TO_KERNEL_CR3 scratch_reg=%rsp
You can _trivially_ audit this and know that swapgs doesn't touch memory and that as long as PER_CPU_VAR()s and the process stack don't have their mappings munged and flushes deferred that this would be correct.
Hmm, so there is still a path for this?
At least if it added more ways to debug it, and some other changes to make the locations where vmalloc is dangerous smaller?
Being able to debug it would be a good start. But, more generally, what we need is for more people to be able to run the code in the first place. Would a _normal_ system (without setups that are trying to do NOHZ_FULL) ever be able to defer TLB flush IPIs?
If the answer is no, then, yeah, I'll settle for some debugging options.
But if you shrink the window as small as I'm talking about, it would look very different from this series.
For instance, imagine when a CPU goes into the NOHZ mode. Could it just unconditionally flush the TLB on the way back into the kernel (in the same SWITCH_TO_KERNEL_CR3 spot)? Yeah, it'll make entry into the kernel expensive for NOHZ tasks, but it's not *THAT* bad. And if the entire point of a NOHZ_FULL task is to minimize the number of kernel entries then a little extra overhead there doesn't sound too bad.
Right, so my thought per your previous comments was to special case the TLB flush, depend on kPTI and do it uncondtionally in SWITCH_TO_KERNEL_CR3 just like you've described - but keep the context tracking mechanism for other deferrable operations.
My gripe with that was having two separate mechanisms - super early entry around SWITCH_TO_KERNEL_CR3) - later entry at context tracking
Shifting everything to SWITCH_TO_KERNEL_CR3 means we lose the context_tracking infra to dynamically defer operations (atomically reading and writing to context_tracking.state), which means we unconditionally run all possible deferrable operations. This doesn't scream scalable, even though as you say NOHZ_FULL kernel entry is already a "you lose" situation.
Yet another option is to duplicate the context tracking state specifically for IPI deferral and have it driven in/by SWITCH_TO_KERNEL_CR3, which is also not super savoury.
I suppose I can start poking around running deferred ops in that SWITCH_TO_KERNEL_CR3 region, and add state/infra on top. Let's see where this gets me :-)
Again, thanks for the insight and the suggestions Dave!
Also, about the new hardware, I suspect there's some mystery customer lurking in the shadows asking folks for this functionality. Could you at least go _talk_ to the mystery customer(s) and see which hardware they care about? They might already even have the magic CPUs they need for this, or have them on the roadmap. If they've got Intel CPUs, I'd be happy to help figure it out.
On 5/2/25 02:55, Valentin Schneider wrote:
My gripe with that was having two separate mechanisms
- super early entry around SWITCH_TO_KERNEL_CR3)
- later entry at context tracking
What do you mean by "later entry"?
All of the paths to enter the kernel from userspace have some SWITCH_TO_KERNEL_CR3 variant. If they didn't, the userspace that they entered from could have attacked the kernel with Meltdown.
I'm theorizing that if this is _just_ about avoiding TLB flush IPIs that you can get away with a single mechanism.
On 02/05/25 06:53, Dave Hansen wrote:
On 5/2/25 02:55, Valentin Schneider wrote:
My gripe with that was having two separate mechanisms
- super early entry around SWITCH_TO_KERNEL_CR3)
- later entry at context tracking
What do you mean by "later entry"?
I meant the point at which the deferred operation is run in the current patches, i.e. ct_kernel_enter() - kernel entry from the PoV of context tracking.
All of the paths to enter the kernel from userspace have some SWITCH_TO_KERNEL_CR3 variant. If they didn't, the userspace that they entered from could have attacked the kernel with Meltdown.
I'm theorizing that if this is _just_ about avoiding TLB flush IPIs that you can get away with a single mechanism.
So right now there would indeed be the TLB flush IPIs, but also the text_poke() ones (sync_core() after patching text).
These are the two NOHZ-breaking IPIs that show up on my HP box, and that I also got reports for from folks using NOHZ_FULL + CPU isolation in production, mostly on SPR "edge enhanced" type of systems.
There's been some other sources of IPIs that have been fixed with an ad-hoc solution - disable the mechanism for NOHZ_FULL CPUs or do it differently such that an IPI isn't required, e.g.
https://lore.kernel.org/lkml/ZJtBrybavtb1x45V@tpad/
While I don't expect the list to grow much, it's unfortunately not just the TLB flush IPIs.
gah, the cc list here is rotund...
On 5/2/25 09:38, Valentin Schneider wrote: ...
All of the paths to enter the kernel from userspace have some SWITCH_TO_KERNEL_CR3 variant. If they didn't, the userspace that they entered from could have attacked the kernel with Meltdown.
I'm theorizing that if this is _just_ about avoiding TLB flush IPIs that you can get away with a single mechanism.
So right now there would indeed be the TLB flush IPIs, but also the text_poke() ones (sync_core() after patching text).
These are the two NOHZ-breaking IPIs that show up on my HP box, and that I also got reports for from folks using NOHZ_FULL + CPU isolation in production, mostly on SPR "edge enhanced" type of systems.
...
While I don't expect the list to grow much, it's unfortunately not just the TLB flush IPIs.
Isn't text patching way easier than TLB flushes? You just need *some* serialization. Heck, since TLB flushes are architecturally serializing, you could probably even reuse the exact same mechanism: implement deferred text patch serialization operations as a deferred TLB flush.
The hardest part is figuring out which CPUs are in the state where they can be deferred or not. But you have to solve that in any case, and you already have an algorithm to do it.
On 02/05/25 10:57, Dave Hansen wrote:
gah, the cc list here is rotund...
On 5/2/25 09:38, Valentin Schneider wrote: ...
All of the paths to enter the kernel from userspace have some SWITCH_TO_KERNEL_CR3 variant. If they didn't, the userspace that they entered from could have attacked the kernel with Meltdown.
I'm theorizing that if this is _just_ about avoiding TLB flush IPIs that you can get away with a single mechanism.
So right now there would indeed be the TLB flush IPIs, but also the text_poke() ones (sync_core() after patching text).
These are the two NOHZ-breaking IPIs that show up on my HP box, and that I also got reports for from folks using NOHZ_FULL + CPU isolation in production, mostly on SPR "edge enhanced" type of systems.
...
While I don't expect the list to grow much, it's unfortunately not just the TLB flush IPIs.
Isn't text patching way easier than TLB flushes? You just need *some* serialization. Heck, since TLB flushes are architecturally serializing, you could probably even reuse the exact same mechanism: implement deferred text patch serialization operations as a deferred TLB flush.
The hardest part is figuring out which CPUs are in the state where they can be deferred or not. But you have to solve that in any case, and you already have an algorithm to do it.
Alright, off to mess around SWITCH_TO_KERNEL_CR3 to see how shoving deferred operations there would look then.
On Wed, Apr 30, 2025 at 11:07:35AM -0700, Dave Hansen wrote:
Both AMD and Intel have hardware to do it. ARM CPUs do it too, I think. You can go buy the Intel hardware off the shelf today.
To be fair, the Intel RAR thing is pretty horrific :-( Definitely sub-par compared to the AMD and ARM things.
Furthermore, the paper states it is a uarch feature for SPR with no guarantee future uarchs will get it (and to be fair, I'd prefer it if they didn't).
Furthermore, I suspect it will actually be slower than IPIs for anything with more than 64 logical CPUs due to reduced parallelism.
On 5/2/25 04:22, Peter Zijlstra wrote:
On Wed, Apr 30, 2025 at 11:07:35AM -0700, Dave Hansen wrote:
Both AMD and Intel have hardware to do it. ARM CPUs do it too, I think. You can go buy the Intel hardware off the shelf today.
To be fair, the Intel RAR thing is pretty horrific 🙁 Definitely sub-par compared to the AMD and ARM things.
Furthermore, the paper states it is a uarch feature for SPR with no guarantee future uarchs will get it (and to be fair, I'd prefer it if they didn't).
I don't think any of that is set in stone, fwiw. It should be entirely possible to obtain a longer promise about its availability.
Or ask that AMD and Intel put their heads together in their fancy new x86 advisory group and figure out a single way forward. If you're right that RAR stinks and INVLPGB rocks, then it'll be an easy thing to advise.
Furthermore, I suspect it will actually be slower than IPIs for anything with more than 64 logical CPUs due to reduced parallelism.
Maybe my brain is crusty and I need to go back and read the spec, but I remember RAR using the normal old APIC programming that normal old TLB flush IPIs use. So they have similar restrictions. If it's inefficient to program a wide IPI, it's also inefficient to program a RAR operation. So the (theoretical) pro is that you program it like an IPI and it slots into the IPI code fairly easily. But the con is that it has the same limitations as IPIs.
I was actually concerned that INVLPGB won't be scalable. Since it doesn't have the ability to target specific CPUs in the ISA, it fundamentally need to either have a mechanism to reach all CPUs, or some way to know which TLB entries each CPU might have.
Maybe AMD has something super duper clever to limit the broadcast scope. But if they don't, then a small range flush on a small number of CPUs might end up being pretty expensive, relatively.
I don't think this is a big problem in Rik's series because he had a floor on the size of processes that get INVLPGB applied. Also, if it turns out to be a problem, it's dirt simple to revert back to IPIs for problematic TLB flushes.
But I am deeply curious how the system will behave if there are a boatload of processes doing modestly-sized INVLPGBs that only apply to a handful of CPUs on a very large system.
AMD and Intel came at this from very different angles (go figure). The designs are prioritizing different things for sure. I can't wait to see both of them fighting it out under real workloads.
On Fri, May 02, 2025 at 07:33:55AM -0700, Dave Hansen wrote:
On 5/2/25 04:22, Peter Zijlstra wrote:
On Wed, Apr 30, 2025 at 11:07:35AM -0700, Dave Hansen wrote:
Both AMD and Intel have hardware to do it. ARM CPUs do it too, I think. You can go buy the Intel hardware off the shelf today.
To be fair, the Intel RAR thing is pretty horrific 🙁 Definitely sub-par compared to the AMD and ARM things.
Furthermore, the paper states it is a uarch feature for SPR with no guarantee future uarchs will get it (and to be fair, I'd prefer it if they didn't).
I don't think any of that is set in stone, fwiw. It should be entirely possible to obtain a longer promise about its availability.
Or ask that AMD and Intel put their heads together in their fancy new x86 advisory group and figure out a single way forward.
This might be a good thing regardless.
Furthermore, I suspect it will actually be slower than IPIs for anything with more than 64 logical CPUs due to reduced parallelism.
Maybe my brain is crusty and I need to go back and read the spec, but I remember RAR using the normal old APIC programming that normal old TLB flush IPIs use. So they have similar restrictions. If it's inefficient to program a wide IPI, it's also inefficient to program a RAR operation. So the (theoretical) pro is that you program it like an IPI and it slots into the IPI code fairly easily. But the con is that it has the same limitations as IPIs.
The problem is in the request structure. Sending an IPI is an async action. You do, done.
OTOH RAR has a request buffer where pending requests are put and 'polled' for completion. This buffer does not have room for more than 64 CPUs.
This means that if you want to invalidate across more, you need to do it in multiple batches.
So where IPI is:
- IPI all CPUs - local invalidate - wait for completion
This then becomes:
for () - RAR some CPUs - wait for completion
Or so I thought to have understood, the paper isn't the easiest to read.
I was actually concerned that INVLPGB won't be scalable. Since it doesn't have the ability to target specific CPUs in the ISA, it fundamentally need to either have a mechanism to reach all CPUs, or some way to know which TLB entries each CPU might have.
Maybe AMD has something super duper clever to limit the broadcast scope. But if they don't, then a small range flush on a small number of CPUs might end up being pretty expensive, relatively.
So the way I understand things:
Sending IPIs is sending a message on the interconnect. Mostly this is a cacheline in size (because MESI). Sparc (v9?) has a fun feature where you can actually put data payload in an IPI.
Now, we can target an IPI to a single CPU or to a (limited) set of CPU or broadcast to all CPUs. In fact, targeted IPIs might still be broadcast IPIs, except most CPUs will ignore it because it doesn't match them.
TLBI broadcast is like sending IPIs to all CPUs, the message goes out, everybody sees it.
Much like how snoop filters and the like function, a CPU can process these messages async -- your CPU doesn't stall for a cacheline invalidate message either (except ofcourse if it is actively using that line). Same for TLBI, if the local TLB does not have anything that matches, its done. Even if it does match, as long as nothing makes active use of it, it can just drop the TLB entry without disturbing the actual core.
Only if the CPU has a matching TLB entry *and* it is active, then we have options. One option is to interrupt the core, another option is to wait for it to stop using it.
IIUC the current AMD implementation does the 'interrupt' thing.
One thing to consider in all this is that if we TLBI for an executable page, we should very much also wipe the u-ops cache and all such related structures -- ARM might have an 'issue' here.
That is, I think the TLBI problem is very similar to the I in MESI -- except possibly simpler, because E must not happen until all CPUs acknowledge I etc. TLBI does not have this, it has until the next TLBSYNC.
Anyway, I'm not a hardware person, but this is how I understand these things to work.
linux-kselftest-mirror@lists.linaro.org