New subject: [PATCH v5 01/25] objtool: Make validate_call() recognize indirect calls to pv_ops[]

29 Apr 2025


      Context
=======
We've observed within Red Hat that isolated, NOHZ_FULL CPUs running a
pure-userspace application get regularly interrupted by IPIs sent from
housekeeping CPUs. Those IPIs are caused by activity on the housekeeping CPUs
leading to various on_each_cpu() calls, e.g.:
64359.052209596    NetworkManager       0    1405     smp_call_function_many_cond (cpu=0, func=do_kernel_range_flush)
    smp_call_function_many_cond+0x1
    smp_call_function+0x39
    on_each_cpu+0x2a
    flush_tlb_kernel_range+0x7b
    __purge_vmap_area_lazy+0x70
    _vm_unmap_aliases.part.42+0xdf
    change_page_attr_set_clr+0x16a
    set_memory_ro+0x26
    bpf_int_jit_compile+0x2f9
    bpf_prog_select_runtime+0xc6
    bpf_prepare_filter+0x523
    sk_attach_filter+0x13
    sock_setsockopt+0x92c
    __sys_setsockopt+0x16a
    __x64_sys_setsockopt+0x20
    do_syscall_64+0x87
    entry_SYSCALL_64_after_hwframe+0x65
The heart of this series is the thought that while we cannot remove NOHZ_FULL
CPUs from the list of CPUs targeted by these IPIs, they may not have to execute
the callbacks immediately. Anything that only affects kernelspace can wait
until the next user->kernel transition, providing it can be executed "early
enough" in the entry code.
The original implementation is from Peter [1]. Nicolas then added kernel TLB
invalidation deferral to that [2], and I picked it up from there.
Deferral approach
=================
Storing each and every callback, like a secondary call_single_queue turned out
to be a no-go: the whole point of deferral is to keep NOHZ_FULL CPUs in
userspace for as long as possible - no signal of any form would be sent when
deferring an IPI. This means that any form of queuing for deferred callbacks
would end up as a convoluted memory leak.
Deferred IPIs must thus be coalesced, which this series achieves by assigning
IPIs a "type" and having a mapping of IPI type to callback, leveraged upon
kernel entry.
What about IPIs whose callback take a parameter, you may ask?
Peter suggested during OSPM23 [3] that since on_each_cpu() targets
housekeeping CPUs *and* isolated CPUs, isolated CPUs can access either global or
housekeeping-CPU-local state to "reconstruct" the data that would have been sent
via the IPI.
This series does not affect any IPI callback that requires an argument, but the
approach would remain the same (one coalescable callback executed on kernel
entry).
Kernel entry vs execution of the deferred operation
===================================================
This is what I've referred to as the "Danger Zone" during my LPC24 talk [4].
There is a non-zero length of code that is executed upon kernel entry before the
deferred operation can be itself executed (before we start getting into
context_tracking.c proper), i.e.:
idtentry_func_foo()                <--- we're in the kernel
    irqentry_enter()
      enter_from_user_mode()
    __ct_user_exit()
        ct_kernel_enter_state()
          ct_work_flush()        <--- deferred operation is executed here
This means one must take extra care to what can happen in the early entry code,
and that <bad things> cannot happen. For instance, we really don't want to hit
instructions that have been modified by a remote text_poke() while we're on our
way to execute a deferred sync_core(). Patches doing the actual deferral have
more detail on this.
Where are we at with this whole thing?
======================================
Dave has been incredibly helpful wrt figuring out what would and wouldn't
(mostly that) be safe to do for deferring kernel range TLB flush IPIs, see [5].
Long story short, there are ugly things I can still do to (safely) defer the TLB
flush IPIs, but it's going to be a long session of pulling my own hair out, and
I got plenty so I won't be done for a while.
In the meantime, I think everything leading up to deferring text poke IPIs is
sane-ish and could get in. I'm not the biggest fan of adding an API with a
single user, but hey, I've been working on this for "a little while" now and
I'll still need to get the other IPIs sorted out.
TL;DR: Text patching IPI deferral LGTM so here it is for now, I'm still working
on the TLB flush thing.
Patches
=======
o Patches 1-2 are standalone objtool cleanups.
o Patches 3-4 add an RCU testing feature.
o Patches 5-6 add infrastructure for annotating static keys and static calls
  that may be used in noinstr code (courtesy of Josh).
o Patches 7-20 use said annotations on relevant keys / calls.
o Patch 21 enforces proper usage of said annotations (courtesy of Josh).
o Patches 22-23 deal with detecting NOINSTR text in modules
o Patches 24-25 add the actual IPI deferral faff
Patches are also available at:
https://gitlab.com/vschneid/linux.git -b redhat/isolirq/defer/v5
Testing
=======
Xeon E5-2699 system with SMToff, NOHZ_FULL, isolated CPUs.
RHEL10 userspace.
Workload is using rteval (kernel compilation + hackbench) on housekeeping CPUs
and a dummy stay-in-userspace loop on the isolated CPUs. The main invocation is:
$ trace-cmd record -e "csd_queue_cpu" -f "cpu & CPUS{$ISOL_CPUS}" \
               -e "ipi_send_cpumask" -f "cpumask & CPUS{$ISOL_CPUS}" \
               -e "ipi_send_cpu"     -f "cpu & CPUS{$ISOL_CPUS}" \
    	   rteval --onlyload --loads-cpulist=$HK_CPUS \
    	   --hackbench-runlowmem=True --duration=$DURATION
This only records IPIs sent to isolated CPUs, so any event there is interference
(with a bit of fuzz at the start/end of the workload when spawning the
processes). All tests were done with a duration of 3 hours.
v6.14
# This is the actual IPI count
$ trace-cmd report | grep callback | awk '{ print $(NF) }' | sort | uniq -c | sort -nr
     93 callback=generic_smp_call_function_single_interrupt+0x0
     22 callback=nohz_full_kick_func+0x0
# These are the different CSD's that caused IPIs    
$ trace-cmd report | grep csd_queue | awk '{ print $(NF-1) }' | sort | uniq -c | sort -nr
   1456 func=do_flush_tlb_all
     78 func=do_sync_core
     33 func=nohz_full_kick_func
     26 func=do_kernel_range_flush
v6.14 + patches
# This is the actual IPI count
$ trace-cmd report | grep callback | awk '{ print $(NF) }' | sort | uniq -c | sort -nr
     86 callback=generic_smp_call_function_single_interrupt+0x0
     41 callback=nohz_full_kick_func+0x0
# These are the different CSD's that caused IPIs          
$ trace-cmd report | grep csd_queue | awk '{ print $(NF-1) }' | sort | uniq -c | sort -nr
   1378 func=do_flush_tlb_all
     33 func=nohz_full_kick_func
So the TLB flush is still there driving most of the IPIs, but at least the
instruction patching IPIs are gone. With kernel TLB flushes deferred, there are
no IPIs sent to isolated CPUs in that 3hr window, but as stated above that still
needs some more work.
Also note that tlb_remove_table_smp_sync() showed up during testing of v3, and
has gone as mysteriously as it showed up. Yair had a series adressing this [6]
which per these results would be worth revisiting.
Acknowledgements
================
Special thanks to:
o Clark Williams for listening to my ramblings about this and throwing ideas my way
o Josh Poimboeuf for all his help with everything objtool-related
o All of the folks who attended various (too many?) talks about this and
  provided precious feedback.  
o The mm folks for pointing out what I can and can't do with TLB flushes
Links
=====
[1]: https://lore.kernel.org/all/20210929151723.162004989@infradead.org/
[2]: https://github.com/vianpl/linux.git -b ct-work-defer-wip
[3]: https://youtu.be/0vjE6fjoVVE
[4]: https://lpc.events/event/18/contributions/1889/
[5]: http://lore.kernel.org/r/eef09bdc-7546-462b-9ac0-661a44d2ceae@intel.com
[6]: https://lore.kernel.org/lkml/20230620144618.125703-1-ypodemsk@redhat.com/
Revisions
=========
v4 -> v5
++++++++
o Rebased onto v6.15-rc3
o Collected Reviewed-by
o Annotated a few more static keys
o Added proper checking of noinstr sections that are in loadable code such as
  KVM early entry (Sean Christopherson)
o Switched to checking for CT_RCU_WATCHING instead of CT_STATE_KERNEL or
  CT_STATE_IDLE, which means deferral is now behaving sanely for IRQ/NMI
  entry from idle (thanks to Frederic!)
o Ditched the vmap TLB flush deferral (for now)
RFCv3 -> v4
+++++++++++
o Rebased onto v6.13-rc6
o New objtool patches from Josh
o More .noinstr static key/call patches
o Static calls now handled as well (again thanks to Josh)
o Fixed clearing the work bits on kernel exit
o Messed with IRQ hitting an idle CPU vs context tracking
o Various comment and naming cleanups
o Made RCU_DYNTICKS_TORTURE depend on !COMPILE_TEST (PeterZ)
o Fixed the CT_STATE_KERNEL check when setting a deferred work (Frederic)
o Cleaned up the __flush_tlb_all() mess thanks to PeterZ
RFCv2 -> RFCv3
++++++++++++++
o Rebased onto v6.12-rc6
o Added objtool documentation for the new warning (Josh)
o Added low-size RCU watching counter to TREE04 torture scenario (Paul)
o Added FORCEFUL jump label and static key types
o Added noinstr-compliant helpers for tlb flush deferral
RFCv1 -> RFCv2
++++++++++++++
o Rebased onto v6.5-rc1
o Updated the trace filter patches (Steven)
o Fixed __ro_after_init keys used in modules (Peter)
o Dropped the extra context_tracking atomic, squashed the new bits in the
  existing .state field (Peter, Frederic)
o Added an RCU_EXPERT config for the RCU dynticks counter size, and added an
  rcutorture case for a low-size counter (Paul)
o Fixed flush_tlb_kernel_range_deferrable() definition
Josh Poimboeuf (3):
  jump_label: Add annotations for validating noinstr usage
  static_call: Add read-only-after-init static calls
  objtool: Add noinstr validation for static branches/calls
Valentin Schneider (22):
  objtool: Make validate_call() recognize indirect calls to pv_ops[]
  objtool: Flesh out warning related to pv_ops[] calls
  rcu: Add a small-width RCU watching counter debug option
  rcutorture: Make TREE04 use CONFIG_RCU_DYNTICKS_TORTURE
  x86/paravirt: Mark pv_sched_clock static call as __ro_after_init
  x86/idle: Mark x86_idle static call as __ro_after_init
  x86/paravirt: Mark pv_steal_clock static call as __ro_after_init
  riscv/paravirt: Mark pv_steal_clock static call as __ro_after_init
  loongarch/paravirt: Mark pv_steal_clock static call as __ro_after_init
  arm64/paravirt: Mark pv_steal_clock static call as __ro_after_init
  arm/paravirt: Mark pv_steal_clock static call as __ro_after_init
  perf/x86/amd: Mark perf_lopwr_cb static call as __ro_after_init
  sched/clock: Mark sched_clock_running key as __ro_after_init
  KVM: VMX: Mark __kvm_is_using_evmcs static key as __ro_after_init
  x86/speculation/mds: Mark mds_idle_clear key as allowed in .noinstr
  sched/clock, x86: Mark __sched_clock_stable key as allowed in .noinstr
  KVM: VMX: Mark vmx_l1d_should flush and vmx_l1d_flush_cond keys as
    allowed in .noinstr
  stackleack: Mark stack_erasing_bypass key as allowed in .noinstr
  module: Remove outdated comment about text_size
  module: Add MOD_NOINSTR_TEXT mem_type
  context-tracking: Introduce work deferral infrastructure
  context_tracking,x86: Defer kernel text patching IPIs
arch/Kconfig                                  |   9 ++
 arch/arm/kernel/paravirt.c                    |   2 +-
 arch/arm64/kernel/paravirt.c                  |   2 +-
 arch/loongarch/kernel/paravirt.c              |   2 +-
 arch/riscv/kernel/paravirt.c                  |   2 +-
 arch/x86/Kconfig                              |   1 +
 arch/x86/events/amd/brs.c                     |   2 +-
 arch/x86/include/asm/context_tracking_work.h  |  18 +++
 arch/x86/include/asm/text-patching.h          |   1 +
 arch/x86/kernel/alternative.c                 |  39 ++++++-
 arch/x86/kernel/cpu/bugs.c                    |   2 +-
 arch/x86/kernel/kprobes/core.c                |   4 +-
 arch/x86/kernel/kprobes/opt.c                 |   4 +-
 arch/x86/kernel/module.c                      |   2 +-
 arch/x86/kernel/paravirt.c                    |   4 +-
 arch/x86/kernel/process.c                     |   2 +-
 arch/x86/kvm/vmx/vmx.c                        |  11 +-
 arch/x86/kvm/vmx/vmx_onhyperv.c               |   2 +-
 include/asm-generic/sections.h                |  15 +++
 include/linux/context_tracking.h              |  21 ++++
 include/linux/context_tracking_state.h        |  54 +++++++--
 include/linux/context_tracking_work.h         |  26 +++++
 include/linux/jump_label.h                    |  30 ++++-
 include/linux/module.h                        |   6 +-
 include/linux/objtool.h                       |   7 ++
 include/linux/static_call.h                   |  19 ++++
 kernel/context_tracking.c                     |  69 +++++++++++-
 kernel/kprobes.c                              |   8 +-
 kernel/module/main.c                          |  85 ++++++++++----
 kernel/rcu/Kconfig.debug                      |  15 +++
 kernel/sched/clock.c                          |   7 +-
 kernel/stackleak.c                            |   6 +-
 kernel/time/Kconfig                           |   5 +
 tools/objtool/Documentation/objtool.txt       |  34 ++++++
 tools/objtool/check.c                         | 106 +++++++++++++++---
 tools/objtool/include/objtool/check.h         |   1 +
 tools/objtool/include/objtool/elf.h           |   1 +
 tools/objtool/include/objtool/special.h       |   1 +
 tools/objtool/special.c                       |  15 ++-
 .../selftests/rcutorture/configs/rcu/TREE04   |   1 +
 40 files changed, 557 insertions(+), 84 deletions(-)
 create mode 100644 arch/x86/include/asm/context_tracking_work.h
 create mode 100644 include/linux/context_tracking_work.h
--
2.49.0

[PATCH v5 00/25] context_tracking,x86: Defer some IPIs until a user->kernel transition