February 2025 - Linux-kselftest-mirror

[PATCH bpf-next v2 0/2] bpf: fix ktls panic with sockmap and add tests

by Jiayuan Chen

We can reproduce the issue using the existing test program: './test_sockmap --ktls' Or use the selftest I provided, which will cause a panic: ------------[ cut here ]------------ kernel BUG at lib/iov_iter.c:629! PKRU: 55555554 Call Trace: <TASK> ? die+0x36/0x90 ? do_trap+0xdd/0x100 ? iov_iter_revert+0x178/0x180 ? iov_iter_revert+0x178/0x180 ? do_error_trap+0x7d/0x110 ? iov_iter_revert+0x178/0x180 ? exc_invalid_op+0x50/0x70 ? iov_iter_revert+0x178/0x180 ? asm_exc_invalid_op+0x1a/0x20 ? iov_iter_revert+0x178/0x180 ? iov_iter_revert+0x5c/0x180 tls_sw_sendmsg_locked.isra.0+0x794/0x840 tls_sw_sendmsg+0x52/0x80 ? inet_sendmsg+0x1f/0x70 __sys_sendto+0x1cd/0x200 ? find_held_lock+0x2b/0x80 ? syscall_trace_enter+0x140/0x270 ? __lock_release.isra.0+0x5e/0x170 ? find_held_lock+0x2b/0x80 ? syscall_trace_enter+0x140/0x270 ? lockdep_hardirqs_on_prepare+0xda/0x190 ? ktime_get_coarse_real_ts64+0xc2/0xd0 __x64_sys_sendto+0x24/0x30 do_syscall_64+0x90/0x170 1. It looks like the issue started occurring after bpf being introduced to ktls and later the addition of assertions to iov_iter has caused a panic. If my fix tag is incorrect, please assist me in correcting the fix tag. 2. I make minimal changes for now, it's enough to make ktls work correctly. --- v1->v2: Added more content to the commit message https://lore.kernel.org/all/20250123171552.57345-1-mrpre@163.com/#r --- Jiayuan Chen (2): bpf: fix ktls panic with sockmap selftests/bpf: add ktls selftest net/tls/tls_sw.c | 8 +- .../selftests/bpf/prog_tests/sockmap_ktls.c | 174 +++++++++++++++++- .../selftests/bpf/progs/test_sockmap_ktls.c | 26 +++ 3 files changed, 205 insertions(+), 3 deletions(-) create mode 100644 tools/testing/selftests/bpf/progs/test_sockmap_ktls.c -- 2.47.1

8 months, 3 weeks

3
4
0 0

[PATCH v8 00/14] iommufd: Add vIOMMU infrastructure (Part-3: vEVENTQ)

by Nicolin Chen

As the vIOMMU infrastructure series part-3, this introduces a new vEVENTQ object. The existing FAULT object provides a nice notification pathway to the user space with a queue already, so let vEVENTQ reuse that. Mimicing the HWPT structure, add a common EVENTQ structure to support its derivatives: IOMMUFD_OBJ_FAULT (existing) and IOMMUFD_OBJ_VEVENTQ (new). An IOMMUFD_CMD_VEVENTQ_ALLOC is introduced to allocate vEVENTQ object for vIOMMUs. One vIOMMU can have multiple vEVENTQs in different types but can not support multiple vEVENTQs in the same type. The forwarding part is fairly simple but might need to replace a physical device ID with a virtual device ID in a driver-level event data structure. So, this also adds some helpers for drivers to use. As usual, this series comes with the selftest coverage for this new ioctl and with a real world use case in the ARM SMMUv3 driver. This is on Github: https://github.com/nicolinc/iommufd/commits/iommufd_veventq-v8 Paring QEMU branch for testing: https://github.com/nicolinc/qemu/commits/wip/for_iommufd_veventq-v8 Changelog v8 * Add Reviewed-by from Jason and Pranjal * Fix errno returned in arm_smmu_handle_event() * Validate domain->type outside of arm_smmu_attach_prepare_vmaster() * Drop unnecessary vmaster comparison in arm_smmu_attach_commit_vmaster() v7 https://lore.kernel.org/all/cover.1740238876.git.nicolinc@nvidia.com/ * Rebase on Jason's for-next tree for latest fault.c * Add Reviewed-by * Update commit logs * Add __reserved field sanity * Skip kfree() on the static header * Replace "bool on_list" with list_is_last() * Use u32 for flags in iommufd_vevent_header * Drop casting in iommufd_viommu_get_vdev_id() * Update the bounding logic to veventq->sequence * Add missing cpu_to_le64() around STRTAB_STE_1_MEV * Reuse veventq->common.lock to fence sequence and num_events * Rename overflow to lost_events and log it in upon kmalloc failure * Correct the error handling part in iommufd_veventq_deliver_fetch() * Add an arm_smmu_clear_vmaster() to simplify identity/blocked domain attach ops * Add additional four event records to forward to user space VM, and update the uAPI doc * Reuse the existing smmu->streams_mutex lock to fence master->vmaster pointer, instead of adding a new rwsem v6 https://lore.kernel.org/all/cover.1737754129.git.nicolinc@nvidia.com/ * Drop supports_veventq viommu op * Split bug/cosmetics fixes out of the series * Drop the blocking mutex around copy_to_user() * Add veventq_depth in uAPI to limit vEVENTQ size * Revise the documentation for a clear description * Fix sparse warnings in arm_vmaster_report_event() * Rework iommufd_viommu_get_vdev_id() to return -ENOENT v.s. 0 * Allow Abort/Bypass STEs to allocate vEVENTQ and set STE.MEV for DoS mitigations v5 https://lore.kernel.org/all/cover.1736237481.git.nicolinc@nvidia.com/ * Add Reviewed-by from Baolu * Reorder the OBJ list as well * Fix alphabetical order after renaming in v4 * Add supports_veventq viommu op for vEVENTQ type validation v4 https://lore.kernel.org/all/cover.1735933254.git.nicolinc@nvidia.com/ * Rename "vIRQ" to "vEVENTQ" * Use flexible array in struct iommufd_vevent * Add the new ioctl command to union ucmd_buffer * Fix the alphabetical order in union ucmd_buffer too * Rename _TYPE_NONE to _TYPE_DEFAULT aligning with vIOMMU naming v3 https://lore.kernel.org/all/cover.1734477608.git.nicolinc@nvidia.com/ * Rebase on Will's for-joerg/arm-smmu/updates for arm_smmu_event series * Add "Reviewed-by" lines from Kevin * Fix typos in comments, kdocs, and jump tags * Add a patch to sort struct iommufd_ioctl_op * Update iommufd's userpsace-api documentation * Update uAPI kdoc to quote SMMUv3 offical spec * Drop the unused workqueue in struct iommufd_virq * Drop might_sleep() in iommufd_viommu_report_irq() helper * Add missing "break" in iommufd_viommu_get_vdev_id() helper * Shrink the scope of the vmaster's read lock in SMMUv3 driver * Pass in two arguments to iommufd_eventq_virq_handler() helper * Move "!ops || !ops->read" validation into iommufd_eventq_init() * Move "fault->ictx = ictx" closer to iommufd_ctx_get(fault->ictx) * Update commit message for arm_smmu_attach_prepare/commit_vmaster() * Keep "iommufd_fault" as-is and rename "iommufd_eventq_virq" to just "iommufd_virq" v2 https://lore.kernel.org/all/cover.1733263737.git.nicolinc@nvidia.com/ * Rebase on v6.13-rc1 * Add IOPF and vIRQ in iommufd.rst (userspace-api) * Add a proper locking in iommufd_event_virq_destroy * Add iommufd_event_virq_abort with a lockdep_assert_held * Rename "EVENT_*" to "EVENTQ_*" to describe the objects better * Reorganize flows in iommufd_eventq_virq_alloc for abort() to work * Adde struct arm_smmu_vmaster to store vSID upon attaching to a nested domain, calling a newly added iommufd_viommu_get_vdev_id helper * Adde an arm_vmaster_report_event helper in arm-smmu-v3-iommufd file to simplify the routine in arm_smmu_handle_evt() of the main driver v1 https://lore.kernel.org/all/cover.1724777091.git.nicolinc@nvidia.com/ Thanks! Nicolin Nicolin Chen (14): iommufd/fault: Move two fault functions out of the header iommufd/fault: Add an iommufd_fault_init() helper iommufd: Abstract an iommufd_eventq from iommufd_fault iommufd: Rename fault.c to eventq.c iommufd: Add IOMMUFD_OBJ_VEVENTQ and IOMMUFD_CMD_VEVENTQ_ALLOC iommufd/viommu: Add iommufd_viommu_get_vdev_id helper iommufd/viommu: Add iommufd_viommu_report_event helper iommufd/selftest: Require vdev_id when attaching to a nested domain iommufd/selftest: Add IOMMU_TEST_OP_TRIGGER_VEVENT for vEVENTQ coverage iommufd/selftest: Add IOMMU_VEVENTQ_ALLOC test coverage Documentation: userspace-api: iommufd: Update FAULT and VEVENTQ iommu/arm-smmu-v3: Introduce struct arm_smmu_vmaster iommu/arm-smmu-v3: Report events that belong to devices attached to vIOMMU iommu/arm-smmu-v3: Set MEV bit in nested STE for DoS mitigations drivers/iommu/iommufd/Makefile | 2 +- drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 36 ++ drivers/iommu/iommufd/iommufd_private.h | 135 +++- drivers/iommu/iommufd/iommufd_test.h | 10 + include/linux/iommufd.h | 23 + include/uapi/linux/iommufd.h | 105 +++ tools/testing/selftests/iommu/iommufd_utils.h | 115 ++++ .../arm/arm-smmu-v3/arm-smmu-v3-iommufd.c | 64 ++ drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 82 ++- drivers/iommu/iommufd/driver.c | 72 +++ drivers/iommu/iommufd/eventq.c | 597 ++++++++++++++++++ drivers/iommu/iommufd/fault.c | 342 ---------- drivers/iommu/iommufd/hw_pagetable.c | 6 +- drivers/iommu/iommufd/main.c | 7 + drivers/iommu/iommufd/selftest.c | 54 ++ drivers/iommu/iommufd/viommu.c | 2 + tools/testing/selftests/iommu/iommufd.c | 36 ++ .../selftests/iommu/iommufd_fail_nth.c | 7 + Documentation/userspace-api/iommufd.rst | 17 + 19 files changed, 1304 insertions(+), 408 deletions(-) create mode 100644 drivers/iommu/iommufd/eventq.c delete mode 100644 drivers/iommu/iommufd/fault.c base-commit: 598749522d4254afb33b8a6c1bea614a95896868 -- 2.43.0

8 months, 3 weeks

6
32
0 0

[PATCH] selftest: rtc: skip some tests if the alarm only supports minutes

by Wolfram Sang

There are alarms which have only minute-granularity. The RTC core already has a flag to describe them. Use this flag to skip tests which require the alarm to support seconds. Signed-off-by: Wolfram Sang <wsa+renesas(a)sang-engineering.com> --- Tested with a Renesas RZ-N1D board. This RTC obviously has only minute resolution for the alarms. Output now looks like this: # RUN rtc.alarm_alm_set ... # SKIP Skipping test since alarms has only minute granularity. # OK rtc.alarm_alm_set ok 5 rtc.alarm_alm_set # SKIP Skipping test since alarms has only minute granularity. Before it was like this: # RUN rtc.alarm_alm_set ... # rtctest.c:255:alarm_alm_set:Alarm time now set to 09:40:00. # rtctest.c:275:alarm_alm_set:data: 1a0 # rtctest.c:281:alarm_alm_set:Expected new (1489743644) == secs (1489743647) tools/testing/selftests/rtc/rtctest.c | 19 ++++++++++++++----- 1 file changed, 14 insertions(+), 5 deletions(-) diff --git a/tools/testing/selftests/rtc/rtctest.c b/tools/testing/selftests/rtc/rtctest.c index 3e4f0d5c5329..e0a148261e6f 100644 --- a/tools/testing/selftests/rtc/rtctest.c +++ b/tools/testing/selftests/rtc/rtctest.c @@ -29,6 +29,7 @@ enum rtc_alarm_state { RTC_ALARM_UNKNOWN, RTC_ALARM_ENABLED, RTC_ALARM_DISABLED, + RTC_ALARM_RES_MINUTE, }; FIXTURE(rtc) { @@ -88,7 +89,7 @@ static void nanosleep_with_retries(long ns) } } -static enum rtc_alarm_state get_rtc_alarm_state(int fd) +static enum rtc_alarm_state get_rtc_alarm_state(int fd, int need_seconds) { struct rtc_param param = { 0 }; int rc; @@ -103,6 +104,10 @@ static enum rtc_alarm_state get_rtc_alarm_state(int fd) if ((param.uvalue & _BITUL(RTC_FEATURE_ALARM)) == 0) return RTC_ALARM_DISABLED; + /* Check if alarm has desired granularity */ + if (need_seconds && (param.uvalue & _BITUL(RTC_FEATURE_ALARM_RES_MINUTE))) + return RTC_ALARM_RES_MINUTE; + return RTC_ALARM_ENABLED; } @@ -227,9 +232,11 @@ TEST_F(rtc, alarm_alm_set) { SKIP(return, "Skipping test since %s does not exist", rtc_file); ASSERT_NE(-1, self->fd); - alarm_state = get_rtc_alarm_state(self->fd); + alarm_state = get_rtc_alarm_state(self->fd, 1); if (alarm_state == RTC_ALARM_DISABLED) SKIP(return, "Skipping test since alarms are not supported."); + if (alarm_state == RTC_ALARM_RES_MINUTE) + SKIP(return, "Skipping test since alarms has only minute granularity."); rc = ioctl(self->fd, RTC_RD_TIME, &tm); ASSERT_NE(-1, rc); @@ -295,9 +302,11 @@ TEST_F(rtc, alarm_wkalm_set) { SKIP(return, "Skipping test since %s does not exist", rtc_file); ASSERT_NE(-1, self->fd); - alarm_state = get_rtc_alarm_state(self->fd); + alarm_state = get_rtc_alarm_state(self->fd, 1); if (alarm_state == RTC_ALARM_DISABLED) SKIP(return, "Skipping test since alarms are not supported."); + if (alarm_state == RTC_ALARM_RES_MINUTE) + SKIP(return, "Skipping test since alarms has only minute granularity."); rc = ioctl(self->fd, RTC_RD_TIME, &alarm.time); ASSERT_NE(-1, rc); @@ -357,7 +366,7 @@ TEST_F_TIMEOUT(rtc, alarm_alm_set_minute, 65) { SKIP(return, "Skipping test since %s does not exist", rtc_file); ASSERT_NE(-1, self->fd); - alarm_state = get_rtc_alarm_state(self->fd); + alarm_state = get_rtc_alarm_state(self->fd, 0); if (alarm_state == RTC_ALARM_DISABLED) SKIP(return, "Skipping test since alarms are not supported."); @@ -425,7 +434,7 @@ TEST_F_TIMEOUT(rtc, alarm_wkalm_set_minute, 65) { SKIP(return, "Skipping test since %s does not exist", rtc_file); ASSERT_NE(-1, self->fd); - alarm_state = get_rtc_alarm_state(self->fd); + alarm_state = get_rtc_alarm_state(self->fd, 0); if (alarm_state == RTC_ALARM_DISABLED) SKIP(return, "Skipping test since alarms are not supported."); -- 2.39.2

9 months

2
1
0 0

[PATCH v2 0/4] tools/nolibc: MIPS: entrypoint cleanups and N32/N64 ABIs

by Thomas Weißschuh

Introduce support for the N32 and N64 ABIs. As preparation, the entrypoint is first simplified significantly. Thanks to Maciej for all the valuable information. Signed-off-by: Thomas Weißschuh <linux(a)weissschuh.net> --- Changes in v2: - Clean up entrypoint first - Annotate #endifs - Link to v1: https://lore.kernel.org/r/20250212-nolibc-mips-n32-v1-1-6892e58d1321@weisss… --- Thomas Weißschuh (4): tools/nolibc: MIPS: drop $gp setup tools/nolibc: MIPS: drop manual stack pointer alignment tools/nolibc: MIPS: drop noreorder option tools/nolibc: MIPS: add support for N64 and N32 ABIs tools/include/nolibc/arch-mips.h | 117 +++++++++++++++++++++------- tools/testing/selftests/nolibc/Makefile | 28 ++++++- tools/testing/selftests/nolibc/run-tests.sh | 2 +- 3 files changed, 118 insertions(+), 29 deletions(-) --- base-commit: 9c812b01f13d37410ea103e00bc47e5e0f6d2bad change-id: 20231105-nolibc-mips-n32-234901bd910d Best regards, -- Thomas Weißschuh <linux(a)weissschuh.net>

9 months

4
13
0 0

[RFC PATCH v3 0/8] PMU partitioning driver support

by Colton Lewis

This series introduces support in the KVM and ARM PMUv3 driver for partitioning PMU counters into two separate ranges by taking advantage of the MDCR_EL2.HPMN register field. The advantage of a partitioned PMU would be to allow KVM guests direct access to a subset of PMU functionality, greatly reducing the overhead of performance monitoring in guests. While this feature could be accepted on its own merits, practically there is a lot more to be done before it will be fully useful, so I'm sending as an RFC for now. v3: * Include cpucap definition for FEAT_HPMN0 to allow for setting HPMN to 0 * Include PMU header cleanup provided by Marc [1] with some minor changes so compilation works * Pull functions out of pmu-emul.c that aren't specific to the emulated PMU. This and the previous item aren't strictly needed but they provide a nicer starting point. * As suggested by Oliver, start a file for partitioned PMU functions and move the reserved_host_counters parameter and MDCR handling into KVM so the driver does not have to know about it and we need fewer hacks to keep the driver working on 32-bit ARM. This was not a complete separation because the driver still needs to start and stop the host counters all at once and needs to toggle MDCR_EL2.HPME to do that. Introduce kvm_pmu_host_counters_{enable,disable}() functions to handle this and define them as no ops on 32-bit ARM. * As suggested by Oliver, don't limit PMCR.N on emulated PMU. This value will be read correctly when the right traps are disabled to use the partitioned PMU v2: https://lore.kernel.org/kvm/20250208020111.2068239-1-coltonlewis@google.com/ v1: https://lore.kernel.org/kvm/20250127222031.3078945-1-coltonlewis@google.com/ [1] https://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms.git/log/?… Colton Lewis (7): arm64: cpufeature: Add cap for HPMN0 arm64: Generate sign macro for sysreg Enums KVM: arm64: Reorganize PMU functions KVM: arm64: Introduce module param to partition the PMU perf: arm_pmuv3: Generalize counter bitmasks perf: arm_pmuv3: Keep out of guest counter partition KVM: arm64: selftests: Reword selftests error Marc Zyngier (1): KVM: arm64: Cleanup PMU includes arch/arm/include/asm/arm_pmuv3.h | 2 + arch/arm64/include/asm/arm_pmuv3.h | 2 +- arch/arm64/include/asm/kvm_host.h | 199 +++++++- arch/arm64/include/asm/kvm_pmu.h | 47 ++ arch/arm64/kernel/cpufeature.c | 8 + arch/arm64/kvm/Makefile | 2 +- arch/arm64/kvm/arm.c | 1 - arch/arm64/kvm/debug.c | 10 +- arch/arm64/kvm/hyp/include/hyp/switch.h | 1 + arch/arm64/kvm/pmu-emul.c | 464 +----------------- arch/arm64/kvm/pmu-part.c | 63 +++ arch/arm64/kvm/pmu.c | 454 +++++++++++++++++ arch/arm64/kvm/sys_regs.c | 2 + arch/arm64/tools/cpucaps | 1 + arch/arm64/tools/gen-sysreg.awk | 1 + arch/arm64/tools/sysreg | 6 +- drivers/perf/arm_pmuv3.c | 73 ++- include/kvm/arm_pmu.h | 204 -------- include/linux/perf/arm_pmu.h | 16 +- include/linux/perf/arm_pmuv3.h | 27 +- .../selftests/kvm/arm64/vpmu_counter_access.c | 2 +- virt/kvm/kvm_main.c | 1 + 22 files changed, 882 insertions(+), 704 deletions(-) create mode 100644 arch/arm64/include/asm/kvm_pmu.h create mode 100644 arch/arm64/kvm/pmu-part.c delete mode 100644 include/kvm/arm_pmu.h base-commit: 2014c95afecee3e76ca4a56956a936e23283f05b -- 2.48.1.601.g30ceb7b040-goog

9 months, 1 week

3
16
0 0

[PATCH v7 0/3] Enable Zicbom in usermode

by Yunhui Cui

v1/v2: There is only the first patch: RISC-V: Enable cbo.clean/flush in usermode, which mainly removes the enabling of cbo.inval in user mode. v3: Add the functionality of Expose Zicbom and selftests for Zicbom. v4: Modify the order of macros, The test_no_cbo_inval function is added separately. v5: 1. Modify the order of RISCV_HWPROBE_KEY_ZICBOM_BLOCK_SIZE in hwprobe.rst 2. "TEST_NO_ZICBOINVAL" -> "TEST_NO_CBO_INVAL" v6: Change hwprobe_ext0_has's second param to u64. v7: Rebase to the latest code of linux-next. Yunhui Cui (3): RISC-V: Enable cbo.clean/flush in usermode RISC-V: hwprobe: Expose Zicbom extension and its block size RISC-V: selftests: Add TEST_ZICBOM into CBO tests Documentation/arch/riscv/hwprobe.rst | 6 ++ arch/riscv/include/asm/hwprobe.h | 2 +- arch/riscv/include/uapi/asm/hwprobe.h | 2 + arch/riscv/kernel/cpufeature.c | 8 +++ arch/riscv/kernel/sys_hwprobe.c | 8 ++- tools/testing/selftests/riscv/hwprobe/cbo.c | 66 +++++++++++++++++---- 6 files changed, 79 insertions(+), 13 deletions(-) -- 2.39.2

9 months, 1 week

2
4
0 0

[PATCH v4 00/30] context_tracking,x86: Defer some IPIs until a user->kernel transition

by Valentin Schneider

Context ======= We've observed within Red Hat that isolated, NOHZ_FULL CPUs running a pure-userspace application get regularly interrupted by IPIs sent from housekeeping CPUs. Those IPIs are caused by activity on the housekeeping CPUs leading to various on_each_cpu() calls, e.g.: 64359.052209596 NetworkManager 0 1405 smp_call_function_many_cond (cpu=0, func=do_kernel_range_flush) smp_call_function_many_cond+0x1 smp_call_function+0x39 on_each_cpu+0x2a flush_tlb_kernel_range+0x7b __purge_vmap_area_lazy+0x70 _vm_unmap_aliases.part.42+0xdf change_page_attr_set_clr+0x16a set_memory_ro+0x26 bpf_int_jit_compile+0x2f9 bpf_prog_select_runtime+0xc6 bpf_prepare_filter+0x523 sk_attach_filter+0x13 sock_setsockopt+0x92c __sys_setsockopt+0x16a __x64_sys_setsockopt+0x20 do_syscall_64+0x87 entry_SYSCALL_64_after_hwframe+0x65 The heart of this series is the thought that while we cannot remove NOHZ_FULL CPUs from the list of CPUs targeted by these IPIs, they may not have to execute the callbacks immediately. Anything that only affects kernelspace can wait until the next user->kernel transition, providing it can be executed "early enough" in the entry code. The original implementation is from Peter [1]. Nicolas then added kernel TLB invalidation deferral to that [2], and I picked it up from there. Deferral approach ================= Storing each and every callback, like a secondary call_single_queue turned out to be a no-go: the whole point of deferral is to keep NOHZ_FULL CPUs in userspace for as long as possible - no signal of any form would be sent when deferring an IPI. This means that any form of queuing for deferred callbacks would end up as a convoluted memory leak. Deferred IPIs must thus be coalesced, which this series achieves by assigning IPIs a "type" and having a mapping of IPI type to callback, leveraged upon kernel entry. What about IPIs whose callback take a parameter, you may ask? Peter suggested during OSPM23 [3] that since on_each_cpu() targets housekeeping CPUs *and* isolated CPUs, isolated CPUs can access either global or housekeeping-CPU-local state to "reconstruct" the data that would have been sent via the IPI. This series does not affect any IPI callback that requires an argument, but the approach would remain the same (one coalescable callback executed on kernel entry). Kernel entry vs execution of the deferred operation =================================================== This is what I've referred to as the "Danger Zone" during my LPC24 talk [4]. There is a non-zero length of code that is executed upon kernel entry before the deferred operation can be itself executed (i.e. before we start getting into context_tracking.c proper), i.e.: idtentry_func_foo() <--- we're in the kernel irqentry_enter() enter_from_user_mode() __ct_user_exit() ct_kernel_enter_state() ct_work_flush() <--- deferred operation is executed here This means one must take extra care to what can happen in the early entry code, and that <bad things> cannot happen. For instance, we really don't want to hit instructions that have been modified by a remote text_poke() while we're on our way to execute a deferred sync_core(). Patches doing the actual deferral have more detail on this. Patches ======= o Patches 1-2 are standalone objtool cleanups. o Patches 3-4 add an RCU testing feature. o Patches 5-6 add infrastructure for annotating static keys and static calls that may be used in noinstr code (courtesy of Josh). o Patches 7-19 use said annotations on relevant keys / calls. o Patch 20 enforces proper usage of said annotations (courtesy of Josh). o Patches 21-23 fiddle with CT_STATE* within context tracking o Patches 24-29 add the actual IPI deferral faff o Patch 30 adds a freebie: deferring IPIs for NOHZ_IDLE. Not tested that much! if you care about battery-powered devices and energy consumption, go give it a try! Patches are also available at: https://gitlab.com/vschneid/linux.git -b redhat/isolirq/defer/v4 Stuff I'd like eyes and neurons on ================================== Context-tracking vs idle. Patch 22 "improves" the situation by adding an IDLE->KERNEL transition when getting an IRQ while idle, but it leaves the following window: ~> IRQ ct_nmi_enter() state = state + CT_STATE_KERNEL - CT_STATE_IDLE [...] ct_nmi_exit() state = state - CT_STATE_KERNEL + CT_STATE_IDLE [...] /!\ CT_STATE_IDLE here while we're really in kernelspace! /!\ ct_cpuidle_exit() state = state + CT_STATE_KERNEL - CT_STATE_IDLE Said window is contained within cpu_idle_poll() and the cpuidle call within cpuidle_enter_state(), both being noinstr (the former is __cpuidle which is noinstr itself). Thus objtool will consider it as early entry and will warn accordingly of any static key / call misuse, so the damage is somewhat contained, but it's not ideal. I tried fiddling with this but idle polling likes being annoying, as it is shaped like so: ct_cpuidle_enter(); raw_local_irq_enable(); while (!tif_need_resched() && (cpu_idle_force_poll || tick_check_broadcast_expired())) cpu_relax(); raw_local_irq_disable(); ct_cpuidle_exit(); IOW, getting an IRQ that doesn't end up setting NEED_RESCHED while idle-polling doesn't come near ct_cpuidle_exit(), which prevents me from having the outermost ct_nmi_exit() leave the state as CT_STATE_KERNEL (rather than CT_STATE_IDLE). Testing ======= Xeon E5-2699 system with SMToff, NOHZ_FULL, isolated CPUs. RHEL9 userspace. Workload is using rteval (kernel compilation + hackbench) on housekeeping CPUs and a dummy stay-in-userspace loop on the isolated CPUs. The main invocation is: $ trace-cmd record -e "csd_queue_cpu" -f "cpu & CPUS{$ISOL_CPUS}" \ -e "ipi_send_cpumask" -f "cpumask & CPUS{$ISOL_CPUS}" \ -e "ipi_send_cpu" -f "cpu & CPUS{$ISOL_CPUS}" \ rteval --onlyload --loads-cpulist=$HK_CPUS \ --hackbench-runlowmem=True --duration=$DURATION This only records IPIs sent to isolated CPUs, so any event there is interference (with a bit of fuzz at the start/end of the workload when spawning the processes). All tests were done with a duration of 6 hours. v6.13-rc6 # This is the actual IPI count $ trace-cmd report | grep callback | awk '{ print $(NF) }' | sort | uniq -c | sort -nr 531 callback=generic_smp_call_function_single_interrupt+0x0 # These are the different CSD's that caused IPIs $ trace-cmd report | grep csd_queue | awk '{ print $(NF-1) }' | sort | uniq -c | sort -nr 12818 func=do_flush_tlb_all 910 func=do_kernel_range_flush 78 func=do_sync_core v6.13-rc6 + patches: # This is the actual IPI count $ trace-cmd report | grep callback | awk '{ print $(NF) }' | sort | uniq -c | sort -nr # Zilch! # These are the different CSD's that caused IPIs $ trace-cmd report | grep csd_queue | awk '{ print $(NF-1) }' | sort | uniq -c | sort -nr # Nada! Note that tlb_remove_table_smp_sync() showed up during testing of v3, and has gone as mysteriously as it showed up. Yair had a series adressing this [5] which would be worth revisiting. Acknowledgements ================ Special thanks to: o Clark Williams for listening to my ramblings about this and throwing ideas my way o Josh Poimboeuf for all his help with everything objtool-related o All of the folks who attended various (too many?) talks about this and provided precious feedback. Links ===== [1]: https://lore.kernel.org/all/20210929151723.162004989@infradead.org/ [2]: https://github.com/vianpl/linux.git -b ct-work-defer-wip [3]: https://youtu.be/0vjE6fjoVVE [4]: https://lpc.events/event/18/contributions/1889/ [5]: https://lore.kernel.org/lkml/20230620144618.125703-1-ypodemsk@redhat.com/ Revisions ========= RFCv3 -> v4 ++++++++++++++ o Rebased onto v6.13-rc6 o New objtool patches from Josh o More .noinstr static key/call patches o Static calls now handled as well (again thanks to Josh) o Fixed clearing the work bits on kernel exit o Messed with IRQ hitting an idle CPU vs context tracking o Various comment and naming cleanups o Made RCU_DYNTICKS_TORTURE depend on !COMPILE_TEST (PeterZ) o Fixed the CT_STATE_KERNEL check when setting a deferred work (Frederic) o Cleaned up the __flush_tlb_all() mess thanks to PeterZ RFCv2 -> RFCv3 ++++++++++++++ o Rebased onto v6.12-rc6 o Added objtool documentation for the new warning (Josh) o Added low-size RCU watching counter to TREE04 torture scenario (Paul) o Added FORCEFUL jump label and static key types o Added noinstr-compliant helpers for tlb flush deferral RFCv1 -> RFCv2 ++++++++++++++ o Rebased onto v6.5-rc1 o Updated the trace filter patches (Steven) o Fixed __ro_after_init keys used in modules (Peter) o Dropped the extra context_tracking atomic, squashed the new bits in the existing .state field (Peter, Frederic) o Added an RCU_EXPERT config for the RCU dynticks counter size, and added an rcutorture case for a low-size counter (Paul) o Fixed flush_tlb_kernel_range_deferrable() definition Josh Poimboeuf (3): jump_label: Add annotations for validating noinstr usage static_call: Add read-only-after-init static calls objtool: Add noinstr validation for static branches/calls Peter Zijlstra (1): x86,tlb: Make __flush_tlb_global() noinstr-compliant Valentin Schneider (26): objtool: Make validate_call() recognize indirect calls to pv_ops[] objtool: Flesh out warning related to pv_ops[] calls rcu: Add a small-width RCU watching counter debug option rcutorture: Make TREE04 use CONFIG_RCU_DYNTICKS_TORTURE x86/paravirt: Mark pv_sched_clock static call as __ro_after_init x86/idle: Mark x86_idle static call as __ro_after_init x86/paravirt: Mark pv_steal_clock static call as __ro_after_init riscv/paravirt: Mark pv_steal_clock static call as __ro_after_init loongarch/paravirt: Mark pv_steal_clock static call as __ro_after_init arm64/paravirt: Mark pv_steal_clock static call as __ro_after_init arm/paravirt: Mark pv_steal_clock static call as __ro_after_init perf/x86/amd: Mark perf_lopwr_cb static call as __ro_after_init sched/clock: Mark sched_clock_running key as __ro_after_init x86/speculation/mds: Mark mds_idle_clear key as allowed in .noinstr sched/clock, x86: Mark __sched_clock_stable key as allowed in .noinstr x86/kvm/vmx: Mark vmx_l1d_should flush and vmx_l1d_flush_cond keys as allowed in .noinstr stackleack: Mark stack_erasing_bypass key as allowed in .noinstr context_tracking: Explicitely use CT_STATE_KERNEL where it is missing context_tracking: Exit CT_STATE_IDLE upon irq/nmi entry context_tracking: Turn CT_STATE_* into bits context-tracking: Introduce work deferral infrastructure context_tracking,x86: Defer kernel text patching IPIs x86/tlb: Make __flush_tlb_local() noinstr-compliant x86/tlb: Make __flush_tlb_all() noinstr x86/mm, mm/vmalloc: Defer flush_tlb_kernel_range() targeting NOHZ_FULL CPUs context-tracking: Add a Kconfig to enable IPI deferral for NO_HZ_IDLE arch/Kconfig | 9 ++ arch/arm/kernel/paravirt.c | 2 +- arch/arm64/kernel/paravirt.c | 2 +- arch/loongarch/kernel/paravirt.c | 2 +- arch/riscv/kernel/paravirt.c | 2 +- arch/x86/Kconfig | 1 + arch/x86/events/amd/brs.c | 2 +- arch/x86/include/asm/context_tracking_work.h | 22 ++++ arch/x86/include/asm/invpcid.h | 13 +-- arch/x86/include/asm/paravirt.h | 4 +- arch/x86/include/asm/text-patching.h | 1 + arch/x86/include/asm/tlbflush.h | 3 +- arch/x86/include/asm/xen/hypercall.h | 11 +- arch/x86/kernel/alternative.c | 38 ++++++- arch/x86/kernel/cpu/bugs.c | 9 +- arch/x86/kernel/kprobes/core.c | 4 +- arch/x86/kernel/kprobes/opt.c | 4 +- arch/x86/kernel/module.c | 2 +- arch/x86/kernel/paravirt.c | 4 +- arch/x86/kernel/process.c | 2 +- arch/x86/kvm/vmx/vmx.c | 11 +- arch/x86/mm/tlb.c | 46 ++++++-- arch/x86/xen/mmu_pv.c | 10 +- arch/x86/xen/xen-ops.h | 12 +- include/asm-generic/sections.h | 15 +++ include/linux/context_tracking.h | 21 ++++ include/linux/context_tracking_state.h | 64 +++++++++-- include/linux/context_tracking_work.h | 28 +++++ include/linux/jump_label.h | 30 ++++- include/linux/objtool.h | 7 ++ include/linux/static_call.h | 19 ++++ kernel/context_tracking.c | 98 ++++++++++++++-- kernel/rcu/Kconfig.debug | 15 +++ kernel/sched/clock.c | 7 +- kernel/stackleak.c | 6 +- kernel/time/Kconfig | 19 ++++ mm/vmalloc.c | 35 +++++- tools/objtool/Documentation/objtool.txt | 34 ++++++ tools/objtool/check.c | 106 +++++++++++++++--- tools/objtool/include/objtool/check.h | 1 + tools/objtool/include/objtool/elf.h | 1 + tools/objtool/include/objtool/special.h | 1 + tools/objtool/special.c | 18 ++- .../selftests/rcutorture/configs/rcu/TREE04 | 1 + 44 files changed, 635 insertions(+), 107 deletions(-) create mode 100644 arch/x86/include/asm/context_tracking_work.h create mode 100644 include/linux/context_tracking_work.h -- 2.43.0

9 months, 1 week

11
85
0 0

[PATCH rcu 00/11] RCU torture changes for v6.15

by Boqun Feng

Hi, Please find the upcoming changes in rcutorture for v6.15. The changes can also be found at: git://git.kernel.org/pub/scm/linux/kernel/git/rcu/linux.git torture.2025.02.05a Regards, Boqun Paul E. McKenney (11): torture: Add get_torture_init_jiffies() for test-start time rcutorture: Add a test_boost_holdoff module parameter rcutorture: Include grace-period sequence numbers in failure/close-call rcutorture: Expand failure/close-call grace-period output rcu: Trace expedited grace-period numbers in hexadecimal rcutorture: Add ftrace-compatible timestamp to GP# failure/close-call output rcutorture: Make cur_ops->format_gp_seqs take buffer length rcutorture: Move RCU_TORTURE_TEST_{CHK_RDR_STATE,LOG_CPU} to bool rcutorture: Complain when invalid SRCU reader_flavor is specified srcu: Add FORCE_NEED_SRCU_NMI_SAFE Kconfig for testing torture: Make SRCU lockdep testing use srcu_read_lock_nmisafe() .../admin-guide/kernel-parameters.txt | 5 ++ include/linux/torture.h | 1 + include/trace/events/rcu.h | 2 +- kernel/rcu/Kconfig | 11 ++++ kernel/rcu/Kconfig.debug | 18 ++++- kernel/rcu/rcu.h | 2 + kernel/rcu/rcutorture.c | 65 +++++++++++++++++-- kernel/rcu/tiny.c | 14 ++++ kernel/rcu/tree.c | 20 ++++++ kernel/torture.c | 12 ++++ .../selftests/rcutorture/bin/srcu_lockdep.sh | 2 +- 11 files changed, 144 insertions(+), 8 deletions(-) -- 2.39.5 (Apple Git-154)

9 months, 1 week

3
18
0 0

[PATCHv4 RESEND net-next 0/2] selftests: wireguards: use nftables for testing

by Hangbin Liu

This patch set convert iptables to nftables for wireguard testing, as iptables is deparated and nftables is the default framework of most releases. v3: drop iptables directly (Jason A. Donenfeld) Also convert to using nft for qemu testing (Jason A. Donenfeld) v2: use one nft table for testing (Phil Sutter) Hangbin Liu (2): selftests: wireguards: convert iptables to nft selftests: wireguard: update to using nft for qemu test tools/testing/selftests/wireguard/netns.sh | 29 +++++++++----- .../testing/selftests/wireguard/qemu/Makefile | 40 ++++++++++++++----- .../selftests/wireguard/qemu/kernel.config | 7 ++-- 3 files changed, 53 insertions(+), 23 deletions(-) -- 2.46.0

9 months, 1 week

3
13
0 0

[PATCH 0/3] bitmap: convert self-test to KUnit

by Tamir Duberstein

This is one of just 3 remaining "Test Module" kselftests (the others being printf and scanf), the rest having been converted to KUnit. I tested this using: $ tools/testing/kunit/kunit.py run --arch arm64 --make_options LLVM=1 bitmap. I've already sent out a conversion series for each of printf[0] and scanf[1]. There was a previous attempt[2] to do this in July 2024. Please bear with me as I try to understand and address the objections from that time. I've spoken with Muhammad Usama Anjum, the author of that series, and received their approval to "take over" this work. Here we go... On 7/26/24 11:45 PM, John Hubbard wrote: > > This changes the situation from "works for Linus' tab completion > case", to "causes a tab completion problem"! :) > > I think a tests/ subdir is how we eventually decided to do this [1], > right? > > So: > > lib/tests/bitmap_kunit.c > > [1] https://lore.kernel.org/20240724201354.make.730-kees@kernel.org This is true and unfortunate, but not trivial to fix because new kallsyms tests were placed in lib/tests in commit 84b4a51fce4c ("selftests: add new kallsyms selftests") *after* the KUnit filename best practices were adopted. I propose that the KUnit maintainers blaze this trail using `string_kunit.c` which currently still lives in lib/ despite the KUnit docs giving it as an example at lib/tests/. On 7/27/24 12:24 AM, Shuah Khan wrote: > > This change will take away the ability to run bitmap tests during > boot on a non-kunit kernel. > > Nack on this change. I wan to see all tests that are being removed > from lib because they have been converted - also it doesn't make > sense to convert some tests like this one that add the ability test > during boot. This point was also discussed in another thread[3] in which: On 7/27/24 12:35 AM, Shuah Khan wrote: > > Please make sure you aren't taking away the ability to run these tests during > boot. > > It doesn't make sense to convert every single test especially when it > is intended to be run during boot without dependencies - not as a kunit test > but a regression test during boot. > > bitmap is one example - pay attention to the config help test - bitmap > one clearly states it runs regression testing during boot. Any test that > says that isn't a candidate for conversion. > > I am going to nack any such conversions. The crux of the argument seems to be that the config help text is taken to describe the author's intent with the fragment "at boot". I think this may be a case of confirmation bias: I see at least the following KUnit tests with "at boot" in their help text: - CPUMASK_KUNIT_TEST - BITFIELD_KUNIT - CHECKSUM_KUNIT - UTIL_MACROS_KUNIT It seems to me that the inference being made is that any test that runs "at boot" is intended to be run by both developers and users, but I find no evidence that bitmap in particular would ever provide additional value when run by users. There's further discussion about KUnit not being "ideal for cases where people would want to check a subsystem on a running kernel", but I find no evidence that bitmap in particular is actually testing the running kernel; it is a unit test of the bitmap functions, which is also stated in the config help text. David Gow made many of the same points in his final reply[4], which was never replied to. Link: https://lore.kernel.org/all/20250207-printf-kunit-convert-v2-0-057b23860823… [0] Link: https://lore.kernel.org/all/20250207-scanf-kunit-convert-v4-0-a23e2afaede8@… [1] Link: https://lore.kernel.org/all/20240726110658.2281070-1-usama.anjum@collabora.… [2] Link: https://lore.kernel.org/all/327831fb-47ab-4555-8f0b-19a8dbcaacd7@collabora.… [3] Link: https://lore.kernel.org/all/CABVgOSmMoPD3JfzVd4VTkzGL2fZCo8LfwzaVSzeFimPrhg… [4] Thanks for your attention. Signed-off-by: Tamir Duberstein <tamird(a)gmail.com> --- Tamir Duberstein (3): bitmap: remove _check_eq_u32_array bitmap: convert self-test to KUnit bitmap: break kunit into test cases MAINTAINERS | 2 +- arch/m68k/configs/amiga_defconfig | 1 - arch/m68k/configs/apollo_defconfig | 1 - arch/m68k/configs/atari_defconfig | 1 - arch/m68k/configs/bvme6000_defconfig | 1 - arch/m68k/configs/hp300_defconfig | 1 - arch/m68k/configs/mac_defconfig | 1 - arch/m68k/configs/multi_defconfig | 1 - arch/m68k/configs/mvme147_defconfig | 1 - arch/m68k/configs/mvme16x_defconfig | 1 - arch/m68k/configs/q40_defconfig | 1 - arch/m68k/configs/sun3_defconfig | 1 - arch/m68k/configs/sun3x_defconfig | 1 - arch/powerpc/configs/ppc64_defconfig | 1 - lib/Kconfig.debug | 24 +- lib/Makefile | 2 +- lib/{test_bitmap.c => bitmap_kunit.c} | 454 +++++++++++++--------------------- tools/testing/selftests/lib/bitmap.sh | 3 - tools/testing/selftests/lib/config | 1 - 19 files changed, 195 insertions(+), 304 deletions(-) --- base-commit: 2014c95afecee3e76ca4a56956a936e23283f05b change-id: 20250207-bitmap-kunit-convert-92d3147b2eee Best regards, -- Tamir Duberstein <tamird(a)gmail.com>

9 months, 1 week

8
26
0 0

2025

2024

2023

2022

2021

2020

2019

2018

2017

Linux-kselftest-mirror February 2025