From: Yi Liu <yi.l.liu(a)intel.com>
[ Upstream commit 1062d81086156e42878d701b816d2f368b53a77c ]
Allocating a domain with a fault ID indicates that the domain is faultable.
However, there is a gap for the nested parent domain to support PRI. Some
hardware lacks the capability to distinguish whether PRI occurs at stage 1
or stage 2. This limitation may require software-based page table walking
to resolve. Since no in-tree IOMMU driver currently supports this
functionality, it is disallowed. For more details, refer to the related
discussion at [1].
[1] https://lore.kernel.org/linux-iommu/bd1655c6-8b2f-4cfa-adb1-badc00d01811@in…
Link: https://patch.msgid.link/r/20250226104012.82079-1-yi.l.liu@intel.com
Suggested-by: Lu Baolu <baolu.lu(a)linux.intel.com>
Signed-off-by: Yi Liu <yi.l.liu(a)intel.com>
Reviewed-by: Kevin Tian <kevin.tian(a)intel.com>
Reviewed-by: Lu Baolu <baolu.lu(a)linux.intel.com>
Signed-off-by: Jason Gunthorpe <jgg(a)nvidia.com>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
drivers/iommu/iommufd/hw_pagetable.c | 3 +++
tools/testing/selftests/iommu/iommufd.c | 4 ++++
2 files changed, 7 insertions(+)
diff --git a/drivers/iommu/iommufd/hw_pagetable.c b/drivers/iommu/iommufd/hw_pagetable.c
index 598be26a14e28..9b5b0b8522299 100644
--- a/drivers/iommu/iommufd/hw_pagetable.c
+++ b/drivers/iommu/iommufd/hw_pagetable.c
@@ -126,6 +126,9 @@ iommufd_hwpt_paging_alloc(struct iommufd_ctx *ictx, struct iommufd_ioas *ioas,
if ((flags & IOMMU_HWPT_ALLOC_DIRTY_TRACKING) &&
!device_iommu_capable(idev->dev, IOMMU_CAP_DIRTY_TRACKING))
return ERR_PTR(-EOPNOTSUPP);
+ if ((flags & IOMMU_HWPT_FAULT_ID_VALID) &&
+ (flags & IOMMU_HWPT_ALLOC_NEST_PARENT))
+ return ERR_PTR(-EOPNOTSUPP);
hwpt_paging = __iommufd_object_alloc(
ictx, hwpt_paging, IOMMUFD_OBJ_HWPT_PAGING, common.obj);
diff --git a/tools/testing/selftests/iommu/iommufd.c b/tools/testing/selftests/iommu/iommufd.c
index a1b2b657999dc..618c03bb6509b 100644
--- a/tools/testing/selftests/iommu/iommufd.c
+++ b/tools/testing/selftests/iommu/iommufd.c
@@ -439,6 +439,10 @@ TEST_F(iommufd_ioas, alloc_hwpt_nested)
&test_hwpt_id);
test_err_hwpt_alloc(EINVAL, self->device_id, self->device_id, 0,
&test_hwpt_id);
+ test_err_hwpt_alloc(EOPNOTSUPP, self->device_id, self->ioas_id,
+ IOMMU_HWPT_ALLOC_NEST_PARENT |
+ IOMMU_HWPT_FAULT_ID_VALID,
+ &test_hwpt_id);
test_cmd_hwpt_alloc(self->device_id, self->ioas_id,
IOMMU_HWPT_ALLOC_NEST_PARENT,
--
2.39.5
From: Niklas Cassel <cassel(a)kernel.org>
[ Upstream commit af1451b6738ec7cf91f2914f53845424959ec4ee ]
Currently BARs that have been disabled by the endpoint controller driver
will result in a test FAIL.
Returning FAIL for a BAR that is disabled seems overly pessimistic.
There are EPC that disables one or more BARs intentionally.
One reason for this is that there are certain EPCs that are hardwired to
expose internal PCIe controller registers over a certain BAR, so the EPC
driver disables such a BAR, such that the host will not overwrite random
registers during testing.
Such a BAR will be disabled by the EPC driver's init function, and the
BAR will be marked as BAR_RESERVED, such that it will be unavailable to
endpoint function drivers.
Let's return FAIL only for BARs that are actually enabled and failed the
test, and let's return skip for BARs that are not even enabled.
Signed-off-by: Niklas Cassel <cassel(a)kernel.org>
Reviewed-by: Manivannan Sadhasivam <manivannan.sadhasivam(a)linaro.org>
Link: https://lore.kernel.org/r/20250123120147.3603409-4-cassel@kernel.org
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam(a)linaro.org>
Signed-off-by: Krzysztof Wilczyński <kwilczynski(a)kernel.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/pci_endpoint/pci_endpoint_test.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/tools/testing/selftests/pci_endpoint/pci_endpoint_test.c b/tools/testing/selftests/pci_endpoint/pci_endpoint_test.c
index c267b822c1081..576c590b277b1 100644
--- a/tools/testing/selftests/pci_endpoint/pci_endpoint_test.c
+++ b/tools/testing/selftests/pci_endpoint/pci_endpoint_test.c
@@ -65,6 +65,8 @@ TEST_F(pci_ep_bar, BAR_TEST)
int ret;
pci_ep_ioctl(PCITEST_BAR, variant->barno);
+ if (ret == -ENODATA)
+ SKIP(return, "BAR is disabled");
EXPECT_FALSE(ret) TH_LOG("Test failed for BAR%d", variant->barno);
}
--
2.39.5
From: Rae Moar <rmoar(a)google.com>
[ Upstream commit 1d4c06d51963f89f67c7b75d5c0c34e9d1bb2ae6 ]
A bug was identified where the KTAP below caused an infinite loop:
TAP version 13
ok 4 test_case
1..4
The infinite loop was caused by the parser not parsing a test plan
if following a test result line.
Fix this bug by parsing test plan line to avoid the infinite loop.
Link: https://lore.kernel.org/r/20250313192714.1380005-1-rmoar@google.com
Signed-off-by: Rae Moar <rmoar(a)google.com>
Reviewed-by: David Gow <davidgow(a)google.com>
Signed-off-by: Shuah Khan <shuah(a)kernel.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/kunit/kunit_parser.py | 9 ++++-----
1 file changed, 4 insertions(+), 5 deletions(-)
diff --git a/tools/testing/kunit/kunit_parser.py b/tools/testing/kunit/kunit_parser.py
index 29fc27e8949bd..da53a709773a2 100644
--- a/tools/testing/kunit/kunit_parser.py
+++ b/tools/testing/kunit/kunit_parser.py
@@ -759,7 +759,7 @@ def parse_test(lines: LineStream, expected_num: int, log: List[str], is_subtest:
# If parsing the main/top-level test, parse KTAP version line and
# test plan
test.name = "main"
- ktap_line = parse_ktap_header(lines, test, printer)
+ parse_ktap_header(lines, test, printer)
test.log.extend(parse_diagnostic(lines))
parse_test_plan(lines, test)
parent_test = True
@@ -768,13 +768,12 @@ def parse_test(lines: LineStream, expected_num: int, log: List[str], is_subtest:
# the KTAP version line and/or subtest header line
ktap_line = parse_ktap_header(lines, test, printer)
subtest_line = parse_test_header(lines, test)
+ test.log.extend(parse_diagnostic(lines))
+ parse_test_plan(lines, test)
parent_test = (ktap_line or subtest_line)
if parent_test:
- # If KTAP version line and/or subtest header is found, attempt
- # to parse test plan and print test header
- test.log.extend(parse_diagnostic(lines))
- parse_test_plan(lines, test)
print_test_header(test, printer)
+
expected_count = test.expected_count
subtests = []
test_num = 1
--
2.39.5
After a long delay I'm posting next iteration of lockless /proc/pid/maps
reading patchset. Differences from v2 [1]:
- Add a set of tests concurrently modifying address space and checking for
correct reading results;
- Use new mmap_lock_speculate_xxx APIs for concurrent change detection and
retries;
- Add lockless PROCMAP_QUERY execution support;
The new tests are designed to check for any unexpected data tearing while
performing some common address space modifications (vma split, resize and
remap). Even before these changes, reading /proc/pid/maps might have
inconsistent data because the file is read page-by-page with mmap_lock
being dropped between the pages. Such tearing is expected and userspace
is supposed to deal with that possibility. An example of user-visible
inconsistency can be that the same vma is printed twice: once before
it was modified and then after the modifications. For example if vma was
extended, it might be found and reported twice. Whan is not expected is
to see a gap where there should have been a vma both before and after
modification. This patchset increases the chances of such tearing,
therefore it's event more important now to test for unexpected
inconsistencies.
Thanks to Paul McKenney who developed a benchmark to test performance
of concurrent reads and updates, we also have data on performance
benefits:
The test has a pair of processes scanning /proc/PID/maps, and another
process unmapping and remapping 4K pages from a 128MB range of anonymous
memory. At the end of each 10-second run, the latency of each mmap()
or munmap() operation is measured, and for each run the maximum and mean
latency is printed. (Yes, the map/unmap process is started first, its
PID is passed to the scanners, and then the map/unmap process waits until
both scanners are running before starting its timed test. The scanners
keep scanning until the specified /proc/PID/maps file disappears.)
In summary, with stock mm, 78% of the runs had maximum latencies in
excess of 0.5 milliseconds, and with more then half of the runs' latencies
exceeding a full millisecond. In contrast, 98% of the runs with Suren's
patch series applied had maximum latencies of less than 0.5 milliseconds.
From a median-performance viewpoint, Suren's series also looks good,
with stock mm weighing in at 13 microseconds and Suren's series at 10
microseconds, better than a 20% improvement.
[1] https://lore.kernel.org/all/20240123231014.3801041-1-surenb@google.com/
Suren Baghdasaryan (8):
selftests/proc: add /proc/pid/maps tearing from vma split test
selftests/proc: extend /proc/pid/maps tearing test to include vma
resizing
selftests/proc: extend /proc/pid/maps tearing test to include vma
remapping
selftests/proc: test PROCMAP_QUERY ioctl while vma is concurrently
modified
selftests/proc: add verbose more for tests to facilitate debugging
mm: make vm_area_struct anon_name field RCU-safe
mm/maps: read proc/pid/maps under RCU
mm/maps: execute PROCMAP_QUERY ioctl under RCU
fs/proc/internal.h | 6 +
fs/proc/task_mmu.c | 233 +++++-
include/linux/mm_inline.h | 28 +-
include/linux/mm_types.h | 3 +-
mm/madvise.c | 30 +-
tools/testing/selftests/proc/proc-pid-vm.c | 793 ++++++++++++++++++++-
6 files changed, 1061 insertions(+), 32 deletions(-)
base-commit: 79f35c4125a9a3fd98efeed4cce1cd7ce5311a44
--
2.49.0.805.g082f7c87e0-goog
I'd like to cut down the memory usage of parsing vmlinux BTF in ebpf-go.
With some upcoming changes the library is sitting at 5MiB for a parse.
Most of that memory is simply copying the BTF blob into user space.
By allowing vmlinux BTF to be mmapped read-only into user space I can
cut memory usage by about 75%.
Signed-off-by: Lorenz Bauer <lmb(a)isovalent.com>
---
Changes in v2:
- Use btf__new in selftest
- Avoid vm_iomap_memory in btf_vmlinux_mmap
- Add VM_DONTDUMP
- Add support to libbpf
- Link to v1: https://lore.kernel.org/r/20250501-vmlinux-mmap-v1-0-aa2724572598@isovalent…
---
Lorenz Bauer (3):
btf: allow mmap of vmlinux btf
selftests: bpf: add a test for mmapable vmlinux BTF
libbpf: Use mmap to parse vmlinux BTF from sysfs
include/asm-generic/vmlinux.lds.h | 3 +-
kernel/bpf/sysfs_btf.c | 36 +++++++++-
tools/lib/bpf/btf.c | 82 +++++++++++++++++++---
tools/testing/selftests/bpf/prog_tests/btf_sysfs.c | 82 ++++++++++++++++++++++
4 files changed, 189 insertions(+), 14 deletions(-)
---
base-commit: 38d976c32d85ef12dcd2b8a231196f7049548477
change-id: 20250501-vmlinux-mmap-2ec5563c3ef1
Best regards,
--
Lorenz Bauer <lmb(a)isovalent.com>
Context
=======
We've observed within Red Hat that isolated, NOHZ_FULL CPUs running a
pure-userspace application get regularly interrupted by IPIs sent from
housekeeping CPUs. Those IPIs are caused by activity on the housekeeping CPUs
leading to various on_each_cpu() calls, e.g.:
64359.052209596 NetworkManager 0 1405 smp_call_function_many_cond (cpu=0, func=do_kernel_range_flush)
smp_call_function_many_cond+0x1
smp_call_function+0x39
on_each_cpu+0x2a
flush_tlb_kernel_range+0x7b
__purge_vmap_area_lazy+0x70
_vm_unmap_aliases.part.42+0xdf
change_page_attr_set_clr+0x16a
set_memory_ro+0x26
bpf_int_jit_compile+0x2f9
bpf_prog_select_runtime+0xc6
bpf_prepare_filter+0x523
sk_attach_filter+0x13
sock_setsockopt+0x92c
__sys_setsockopt+0x16a
__x64_sys_setsockopt+0x20
do_syscall_64+0x87
entry_SYSCALL_64_after_hwframe+0x65
The heart of this series is the thought that while we cannot remove NOHZ_FULL
CPUs from the list of CPUs targeted by these IPIs, they may not have to execute
the callbacks immediately. Anything that only affects kernelspace can wait
until the next user->kernel transition, providing it can be executed "early
enough" in the entry code.
The original implementation is from Peter [1]. Nicolas then added kernel TLB
invalidation deferral to that [2], and I picked it up from there.
Deferral approach
=================
Storing each and every callback, like a secondary call_single_queue turned out
to be a no-go: the whole point of deferral is to keep NOHZ_FULL CPUs in
userspace for as long as possible - no signal of any form would be sent when
deferring an IPI. This means that any form of queuing for deferred callbacks
would end up as a convoluted memory leak.
Deferred IPIs must thus be coalesced, which this series achieves by assigning
IPIs a "type" and having a mapping of IPI type to callback, leveraged upon
kernel entry.
What about IPIs whose callback take a parameter, you may ask?
Peter suggested during OSPM23 [3] that since on_each_cpu() targets
housekeeping CPUs *and* isolated CPUs, isolated CPUs can access either global or
housekeeping-CPU-local state to "reconstruct" the data that would have been sent
via the IPI.
This series does not affect any IPI callback that requires an argument, but the
approach would remain the same (one coalescable callback executed on kernel
entry).
Kernel entry vs execution of the deferred operation
===================================================
This is what I've referred to as the "Danger Zone" during my LPC24 talk [4].
There is a non-zero length of code that is executed upon kernel entry before the
deferred operation can be itself executed (before we start getting into
context_tracking.c proper), i.e.:
idtentry_func_foo() <--- we're in the kernel
irqentry_enter()
enter_from_user_mode()
__ct_user_exit()
ct_kernel_enter_state()
ct_work_flush() <--- deferred operation is executed here
This means one must take extra care to what can happen in the early entry code,
and that <bad things> cannot happen. For instance, we really don't want to hit
instructions that have been modified by a remote text_poke() while we're on our
way to execute a deferred sync_core(). Patches doing the actual deferral have
more detail on this.
Where are we at with this whole thing?
======================================
Dave has been incredibly helpful wrt figuring out what would and wouldn't
(mostly that) be safe to do for deferring kernel range TLB flush IPIs, see [5].
Long story short, there are ugly things I can still do to (safely) defer the TLB
flush IPIs, but it's going to be a long session of pulling my own hair out, and
I got plenty so I won't be done for a while.
In the meantime, I think everything leading up to deferring text poke IPIs is
sane-ish and could get in. I'm not the biggest fan of adding an API with a
single user, but hey, I've been working on this for "a little while" now and
I'll still need to get the other IPIs sorted out.
TL;DR: Text patching IPI deferral LGTM so here it is for now, I'm still working
on the TLB flush thing.
Patches
=======
o Patches 1-2 are standalone objtool cleanups.
o Patches 3-4 add an RCU testing feature.
o Patches 5-6 add infrastructure for annotating static keys and static calls
that may be used in noinstr code (courtesy of Josh).
o Patches 7-20 use said annotations on relevant keys / calls.
o Patch 21 enforces proper usage of said annotations (courtesy of Josh).
o Patches 22-23 deal with detecting NOINSTR text in modules
o Patches 24-25 add the actual IPI deferral faff
Patches are also available at:
https://gitlab.com/vschneid/linux.git -b redhat/isolirq/defer/v5
Testing
=======
Xeon E5-2699 system with SMToff, NOHZ_FULL, isolated CPUs.
RHEL10 userspace.
Workload is using rteval (kernel compilation + hackbench) on housekeeping CPUs
and a dummy stay-in-userspace loop on the isolated CPUs. The main invocation is:
$ trace-cmd record -e "csd_queue_cpu" -f "cpu & CPUS{$ISOL_CPUS}" \
-e "ipi_send_cpumask" -f "cpumask & CPUS{$ISOL_CPUS}" \
-e "ipi_send_cpu" -f "cpu & CPUS{$ISOL_CPUS}" \
rteval --onlyload --loads-cpulist=$HK_CPUS \
--hackbench-runlowmem=True --duration=$DURATION
This only records IPIs sent to isolated CPUs, so any event there is interference
(with a bit of fuzz at the start/end of the workload when spawning the
processes). All tests were done with a duration of 3 hours.
v6.14
# This is the actual IPI count
$ trace-cmd report | grep callback | awk '{ print $(NF) }' | sort | uniq -c | sort -nr
93 callback=generic_smp_call_function_single_interrupt+0x0
22 callback=nohz_full_kick_func+0x0
# These are the different CSD's that caused IPIs
$ trace-cmd report | grep csd_queue | awk '{ print $(NF-1) }' | sort | uniq -c | sort -nr
1456 func=do_flush_tlb_all
78 func=do_sync_core
33 func=nohz_full_kick_func
26 func=do_kernel_range_flush
v6.14 + patches
# This is the actual IPI count
$ trace-cmd report | grep callback | awk '{ print $(NF) }' | sort | uniq -c | sort -nr
86 callback=generic_smp_call_function_single_interrupt+0x0
41 callback=nohz_full_kick_func+0x0
# These are the different CSD's that caused IPIs
$ trace-cmd report | grep csd_queue | awk '{ print $(NF-1) }' | sort | uniq -c | sort -nr
1378 func=do_flush_tlb_all
33 func=nohz_full_kick_func
So the TLB flush is still there driving most of the IPIs, but at least the
instruction patching IPIs are gone. With kernel TLB flushes deferred, there are
no IPIs sent to isolated CPUs in that 3hr window, but as stated above that still
needs some more work.
Also note that tlb_remove_table_smp_sync() showed up during testing of v3, and
has gone as mysteriously as it showed up. Yair had a series adressing this [6]
which per these results would be worth revisiting.
Acknowledgements
================
Special thanks to:
o Clark Williams for listening to my ramblings about this and throwing ideas my way
o Josh Poimboeuf for all his help with everything objtool-related
o All of the folks who attended various (too many?) talks about this and
provided precious feedback.
o The mm folks for pointing out what I can and can't do with TLB flushes
Links
=====
[1]: https://lore.kernel.org/all/20210929151723.162004989@infradead.org/
[2]: https://github.com/vianpl/linux.git -b ct-work-defer-wip
[3]: https://youtu.be/0vjE6fjoVVE
[4]: https://lpc.events/event/18/contributions/1889/
[5]: http://lore.kernel.org/r/eef09bdc-7546-462b-9ac0-661a44d2ceae@intel.com
[6]: https://lore.kernel.org/lkml/20230620144618.125703-1-ypodemsk@redhat.com/
Revisions
=========
v4 -> v5
++++++++
o Rebased onto v6.15-rc3
o Collected Reviewed-by
o Annotated a few more static keys
o Added proper checking of noinstr sections that are in loadable code such as
KVM early entry (Sean Christopherson)
o Switched to checking for CT_RCU_WATCHING instead of CT_STATE_KERNEL or
CT_STATE_IDLE, which means deferral is now behaving sanely for IRQ/NMI
entry from idle (thanks to Frederic!)
o Ditched the vmap TLB flush deferral (for now)
RFCv3 -> v4
+++++++++++
o Rebased onto v6.13-rc6
o New objtool patches from Josh
o More .noinstr static key/call patches
o Static calls now handled as well (again thanks to Josh)
o Fixed clearing the work bits on kernel exit
o Messed with IRQ hitting an idle CPU vs context tracking
o Various comment and naming cleanups
o Made RCU_DYNTICKS_TORTURE depend on !COMPILE_TEST (PeterZ)
o Fixed the CT_STATE_KERNEL check when setting a deferred work (Frederic)
o Cleaned up the __flush_tlb_all() mess thanks to PeterZ
RFCv2 -> RFCv3
++++++++++++++
o Rebased onto v6.12-rc6
o Added objtool documentation for the new warning (Josh)
o Added low-size RCU watching counter to TREE04 torture scenario (Paul)
o Added FORCEFUL jump label and static key types
o Added noinstr-compliant helpers for tlb flush deferral
RFCv1 -> RFCv2
++++++++++++++
o Rebased onto v6.5-rc1
o Updated the trace filter patches (Steven)
o Fixed __ro_after_init keys used in modules (Peter)
o Dropped the extra context_tracking atomic, squashed the new bits in the
existing .state field (Peter, Frederic)
o Added an RCU_EXPERT config for the RCU dynticks counter size, and added an
rcutorture case for a low-size counter (Paul)
o Fixed flush_tlb_kernel_range_deferrable() definition
Josh Poimboeuf (3):
jump_label: Add annotations for validating noinstr usage
static_call: Add read-only-after-init static calls
objtool: Add noinstr validation for static branches/calls
Valentin Schneider (22):
objtool: Make validate_call() recognize indirect calls to pv_ops[]
objtool: Flesh out warning related to pv_ops[] calls
rcu: Add a small-width RCU watching counter debug option
rcutorture: Make TREE04 use CONFIG_RCU_DYNTICKS_TORTURE
x86/paravirt: Mark pv_sched_clock static call as __ro_after_init
x86/idle: Mark x86_idle static call as __ro_after_init
x86/paravirt: Mark pv_steal_clock static call as __ro_after_init
riscv/paravirt: Mark pv_steal_clock static call as __ro_after_init
loongarch/paravirt: Mark pv_steal_clock static call as __ro_after_init
arm64/paravirt: Mark pv_steal_clock static call as __ro_after_init
arm/paravirt: Mark pv_steal_clock static call as __ro_after_init
perf/x86/amd: Mark perf_lopwr_cb static call as __ro_after_init
sched/clock: Mark sched_clock_running key as __ro_after_init
KVM: VMX: Mark __kvm_is_using_evmcs static key as __ro_after_init
x86/speculation/mds: Mark mds_idle_clear key as allowed in .noinstr
sched/clock, x86: Mark __sched_clock_stable key as allowed in .noinstr
KVM: VMX: Mark vmx_l1d_should flush and vmx_l1d_flush_cond keys as
allowed in .noinstr
stackleack: Mark stack_erasing_bypass key as allowed in .noinstr
module: Remove outdated comment about text_size
module: Add MOD_NOINSTR_TEXT mem_type
context-tracking: Introduce work deferral infrastructure
context_tracking,x86: Defer kernel text patching IPIs
arch/Kconfig | 9 ++
arch/arm/kernel/paravirt.c | 2 +-
arch/arm64/kernel/paravirt.c | 2 +-
arch/loongarch/kernel/paravirt.c | 2 +-
arch/riscv/kernel/paravirt.c | 2 +-
arch/x86/Kconfig | 1 +
arch/x86/events/amd/brs.c | 2 +-
arch/x86/include/asm/context_tracking_work.h | 18 +++
arch/x86/include/asm/text-patching.h | 1 +
arch/x86/kernel/alternative.c | 39 ++++++-
arch/x86/kernel/cpu/bugs.c | 2 +-
arch/x86/kernel/kprobes/core.c | 4 +-
arch/x86/kernel/kprobes/opt.c | 4 +-
arch/x86/kernel/module.c | 2 +-
arch/x86/kernel/paravirt.c | 4 +-
arch/x86/kernel/process.c | 2 +-
arch/x86/kvm/vmx/vmx.c | 11 +-
arch/x86/kvm/vmx/vmx_onhyperv.c | 2 +-
include/asm-generic/sections.h | 15 +++
include/linux/context_tracking.h | 21 ++++
include/linux/context_tracking_state.h | 54 +++++++--
include/linux/context_tracking_work.h | 26 +++++
include/linux/jump_label.h | 30 ++++-
include/linux/module.h | 6 +-
include/linux/objtool.h | 7 ++
include/linux/static_call.h | 19 ++++
kernel/context_tracking.c | 69 +++++++++++-
kernel/kprobes.c | 8 +-
kernel/module/main.c | 85 ++++++++++----
kernel/rcu/Kconfig.debug | 15 +++
kernel/sched/clock.c | 7 +-
kernel/stackleak.c | 6 +-
kernel/time/Kconfig | 5 +
tools/objtool/Documentation/objtool.txt | 34 ++++++
tools/objtool/check.c | 106 +++++++++++++++---
tools/objtool/include/objtool/check.h | 1 +
tools/objtool/include/objtool/elf.h | 1 +
tools/objtool/include/objtool/special.h | 1 +
tools/objtool/special.c | 15 ++-
.../selftests/rcutorture/configs/rcu/TREE04 | 1 +
40 files changed, 557 insertions(+), 84 deletions(-)
create mode 100644 arch/x86/include/asm/context_tracking_work.h
create mode 100644 include/linux/context_tracking_work.h
--
2.49.0
From: Chia-Yu Chang <chia-yu.chang(a)nokia-bell-labs.com>
Hello,
Please find DUALPI2 iproute2 patch v7.
v7 (05-May-25)
- Align pkt_sched.h with the v14 version of net-next due to spec modificiaotn in tc.yaml
- Reorganize dualpi2_print_opt() to match the order in tc.yaml
- Remove credit-queue in PRINT_JSON
v6 (26-Apr-25)
- Update JSON file output due to spec modifiocation in tc.yaml of net-next
v5 (25-Mar-25)
- Use matches() to replace current strcmp() (Stephen Hemminger <stephen(a)networkplumber.org>)
- Use general parse_percent() for handling scaled percentage values (Stephen Hemminger <stephen(a)networkplumber.org>)
- Add print function for JSON of dualpi2 stats (Stephen Hemminger <stephen(a)networkplumber.org>)
v4 (16-Mar-25)
- Add min_qlen_step to dualpi2 attribute as the minimum queue length in number of packets in the L-queue to start step amrking.
v3 (21-Feb-25)
- Add memlimit to dualpi2 attribute, and add memory_used, max_memory_used, memory_limit in dualpi2 stats (Dave Taht <dave.taht(a)gmail.com>)
- Update manual to align latest implementation and clarify the queue naming and default unit
- Use common "get_scaled_alpha_beta" and clean print_opt for Dualpi2
v2 (23-Oct-24)
- Rename get_float in dualpi2 to get_float_min_max in utils.c
- Move get_float from iplink_can.c in utils.c (Stephen Hemminger <stephen(a)networkplumber.org>)
- Add print function for JSON of dualpi2 (Stephen Hemminger <stephen(a)networkplumber.org>)
For more details of DualPI2, plesae refer IETF RFC9332
(https://datatracker.ietf.org/doc/html/rfc9332).
Best Regards,
Chia-Yu
Chia-Yu Chang (1):
tc: add dualpi2 scheduler module
bash-completion/tc | 11 +-
include/uapi/linux/pkt_sched.h | 67 +++++
include/utils.h | 2 +
ip/iplink_can.c | 14 -
lib/utils.c | 30 ++
man/man8/tc-dualpi2.8 | 249 ++++++++++++++++
tc/Makefile | 1 +
tc/q_dualpi2.c | 528 +++++++++++++++++++++++++++++++++
8 files changed, 887 insertions(+), 15 deletions(-)
create mode 100644 man/man8/tc-dualpi2.8
create mode 100644 tc/q_dualpi2.c
--
2.34.1
Fixes and cleanups for various issues in the vDSO selftests.
Signed-off-by: Thomas Weißschuh <thomas.weissschuh(a)linutronix.de>
---
Changes in v2:
- Refer to -Wstrict-prototypes over -Wold-style-prototypes
- Pick up Acks
- Enable fixed warnings in Makefile
- Link to v1: https://lore.kernel.org/r/20250502-selftests-vdso-fixes-v1-0-fb5d640a4f78@l…
---
Thomas Weißschuh (8):
selftests: vDSO: chacha: Correctly skip test if necessary
selftests: vDSO: clock_getres: Drop unused include of err.h
selftests: vDSO: vdso_test_getrandom: Drop unused include of linux/compiler.h
selftests: vDSO: vdso_test_getrandom: Drop some dead code
selftests: vDSO: vdso_config: Avoid -Wunused-variables
selftests: vDSO: enable -Wall
selftests: vDSO: vdso_test_correctness: Fix -Wstrict-prototypes
selftests: vDSO: vdso_test_getrandom: Always print TAP header
tools/testing/selftests/vDSO/Makefile | 2 +-
tools/testing/selftests/vDSO/vdso_config.h | 2 ++
tools/testing/selftests/vDSO/vdso_test_chacha.c | 3 ++-
tools/testing/selftests/vDSO/vdso_test_clock_getres.c | 1 -
tools/testing/selftests/vDSO/vdso_test_correctness.c | 2 +-
tools/testing/selftests/vDSO/vdso_test_getrandom.c | 18 +++++-------------
6 files changed, 11 insertions(+), 17 deletions(-)
---
base-commit: 0af2f6be1b4281385b618cb86ad946eded089ac8
change-id: 20250423-selftests-vdso-fixes-d2ce74142359
Best regards,
--
Thomas Weißschuh <thomas.weissschuh(a)linutronix.de>
Nolibc is useful for selftests as the test programs can be very small,
and compiled with just a kernel crosscompiler, without userspace support.
Currently nolibc is only usable with kselftest.h, not the more
convenient to use kselftest_harness.h
This series provides this compatibility by adding new features to nolibc
and removing the usage of problematic features from the harness.
The first half of the series are changes to the harness, the second one
are for nolibc. Both parts are very independent and should go through
different trees.
The last patch is not meant to be applied and serves as test that
everything works together correctly.
Signed-off-by: Thomas Weißschuh <thomas.weissschuh(a)linutronix.de>
---
Changes in v3:
- Send patches to correct kselftest harness maintainers
- Move harness selftest to dedicated directory
- Add harness selftest to MAINTAINERS
- Integrate harness selftest cleanup with the selftest framework
- Consistently use "kselftest harness" in commit messages
- Properly propagate kselftest harness failure
- Link to v2: https://lore.kernel.org/r/20250407-nolibc-kselftest-harness-v2-0-f8812f76e9…
Changes in v2:
- Rebase unto v6.15-rc1
- Rename internal nolibc symbols
- Handle edge case of waitpid(INT_MIN) == ESRCH
- Fix arm configurations for final testing patch
- Clean up global getopt.h variable declarations
- Add Acks from Willy
- Link to v1: https://lore.kernel.org/r/20250304-nolibc-kselftest-harness-v1-0-adca7cd231…
---
Thomas Weißschuh (32):
selftests: harness: Add kselftest harness selftest
selftests: harness: Use C89 comment style
selftests: harness: Ignore unused variant argument warning
selftests: harness: Mark functions without prototypes static
selftests: harness: Remove inline qualifier for wrappers
selftests: harness: Remove dependency on libatomic
selftests: harness: Implement test timeouts through pidfd
selftests: harness: Don't set setup_completed for fixtureless tests
selftests: harness: Always provide "self" and "variant"
selftests: harness: Move teardown conditional into test metadata
selftests: harness: Add teardown callback to test metadata
selftests: harness: Stop using setjmp()/longjmp()
selftests: harness: Guard includes on nolibc
tools/nolibc: handle intmax_t/uintmax_t in printf
tools/nolibc: use intmax definitions from compiler
tools/nolibc: use pselect6_time64 if available
tools/nolibc: use ppoll_time64 if available
tools/nolibc: add tolower() and toupper()
tools/nolibc: add _exit()
tools/nolibc: add setpgrp()
tools/nolibc: implement waitpid() in terms of waitid()
Revert "selftests/nolibc: use waitid() over waitpid()"
tools/nolibc: add dprintf() and vdprintf()
tools/nolibc: add getopt()
tools/nolibc: allow different write callbacks in printf
tools/nolibc: allow limiting of printf destination size
tools/nolibc: add snprintf() and friends
selftests/nolibc: use snprintf() for printf tests
selftests/nolibc: rename vfprintf test suite
selftests/nolibc: add test for snprintf() truncation
tools/nolibc: implement width padding in printf()
HACK: selftests/nolibc: demonstrate usage of the kselftest harness
MAINTAINERS | 1 +
tools/include/nolibc/Makefile | 1 +
tools/include/nolibc/getopt.h | 101 ++
tools/include/nolibc/nolibc.h | 1 +
tools/include/nolibc/stdint.h | 4 +-
tools/include/nolibc/stdio.h | 127 +-
tools/include/nolibc/string.h | 17 +
tools/include/nolibc/sys.h | 105 +-
tools/testing/selftests/Makefile | 1 +
tools/testing/selftests/kselftest_harness.h | 181 +-
.../testing/selftests/kselftest_harness/.gitignore | 2 +
tools/testing/selftests/kselftest_harness/Makefile | 7 +
.../selftests/kselftest_harness/harness-selftest.c | 129 ++
.../kselftest_harness/harness-selftest.expected | 62 +
.../kselftest_harness/harness-selftest.sh | 13 +
tools/testing/selftests/nolibc/Makefile | 13 +-
tools/testing/selftests/nolibc/harness-selftest.c | 1 +
tools/testing/selftests/nolibc/nolibc-test.c | 1729 +-------------------
tools/testing/selftests/nolibc/run-tests.sh | 2 +-
19 files changed, 637 insertions(+), 1860 deletions(-)
---
base-commit: 0af2f6be1b4281385b618cb86ad946eded089ac8
change-id: 20250130-nolibc-kselftest-harness-8b2c8cac43bf
Best regards,
--
Thomas Weißschuh <thomas.weissschuh(a)linutronix.de>
Hi Thadeu,
CC kunit
On Mon, 5 May 2025 at 14:13, Thadeu Lima de Souza Cascardo
<cascardo(a)igalia.com> wrote:
> On Mon, May 05, 2025 at 09:21:15AM +0200, Geert Uytterhoeven wrote:
> > On Wed, 30 Apr 2025 at 18:53, Thadeu Lima de Souza Cascardo
> > <cascardo(a)igalia.com> wrote:
> > > Since it uses __init symbols, it cannot be a module. Builds with
> > > CONFIG_TEST_MISC_MINOR=m will fail with:
> > >
> > > ERROR: modpost: "init_mknod" [drivers/misc/misc_minor_kunit.ko] undefined!
> > > ERROR: modpost: "init_unlink" [drivers/misc/misc_minor_kunit.ko] undefined!
> > >
> > > Reported-by: Stephen Rothwell <sfr(a)canb.auug.org.au>
> > > Closes: https://lore.kernel.org/linux-next/20250429155404.2b6fe5b1@canb.auug.org.au/
> > > Reported-by: kernel test robot <lkp(a)intel.com>
> > > Closes: https://lore.kernel.org/oe-kbuild-all/202504160338.BjUL3Owb-lkp@intel.com/
> > > Fixes: 45f0de4f8dc3 ("char: misc: add test cases")
> > > Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo(a)igalia.com>
> >
> > Thanks for your patch, which is now commit 20acf4dd46e4c090 ("char:
> > misc: make miscdevice unit test built-in only") in char-misc-next.
> >
> > > --- a/lib/Kconfig.debug
> > > +++ b/lib/Kconfig.debug
> > > @@ -2512,7 +2512,7 @@ config TEST_IDA
> > > tristate "Perform selftest on IDA functions"
> > >
> > > config TEST_MISC_MINOR
> > > - tristate "miscdevice KUnit test" if !KUNIT_ALL_TESTS
> > > + bool "miscdevice KUnit test" if !KUNIT_ALL_TESTS
> > > depends on KUNIT
> > > default KUNIT_ALL_TESTS
> >
> > This means "default y" if KUNIT_ALL_TESTS=m, which is IMHO not
> > what we want.
>
> The precedent for other kunit config options that are bool is that they use
> "default KUNIT_ALL_TESTS".
Seems like you are right. Looks like none of the boolean ones can
be enabled on m68k, which is where I run most of the tests, so I never
noticed before :-(
> It makes sense that if you choose to build all tests, you would not skip
> the ones that cannot be built as a module.
You can still enable the test manually if you want.
But I think it should not be enabled by default when all other tests
that can be modular are built as modules. Unlike for modular tests,
enabling builtin tests by default does impact the base kernel.
> > Perhaps
> >
> > default KUNIT_ALL_TESTS=y
> >
> > ?
Gr{oetje,eeting}s,
Geert
--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert(a)linux-m68k.org
In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds
Cong reported a warning when running ./test_sockmp:
https://lore.kernel.org/bpf/aAmIi0vlycHtbXeb@pop-os.localdomain/T/#t
------------[ cut here ]------------
WARNING: CPU: 1 PID: 40 at net/ipv4/af_inet.c inet_sock_destruct+0x173/0x1d5
Tainted: [W]=WARN
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1 04/01/2014
Workqueue: events sk_psock_destroy
RIP: 0010:inet_sock_destruct+0x173/0x1d5
RSP: 0018:ffff8880085cfc18 EFLAGS: 00010202
RAX: 1ffff11003dbfc00 RBX: ffff88801edfe3e8 RCX: ffffffff822f5af4
RDX: 0000000000000007 RSI: dffffc0000000000 RDI: ffff88801edfe16c
RBP: ffff88801edfe184 R08: ffffed1003dbfc31 R09: 0000000000000000
R10: ffffffff822f5ab7 R11: ffff88801edfe187 R12: ffff88801edfdec0
R13: ffff888020376ac0 R14: ffff888020376ac0 R15: ffff888020376a60
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000556365155830 CR3: 000000001d6aa000 CR4: 0000000000350ef0
Call Trace:
<TASK>
__sk_destruct+0x46/0x222
sk_psock_destroy+0x22f/0x242
process_one_work+0x504/0x8a8
? process_one_work+0x39d/0x8a8
? __pfx_process_one_work+0x10/0x10
? worker_thread+0x44/0x2ae
? __list_add_valid_or_report+0x83/0xea
? srso_return_thunk+0x5/0x5f
? __list_add+0x45/0x52
process_scheduled_works+0x73/0x82
worker_thread+0x1ce/0x2ae
When we specify apply_bytes, we divide the msg into multiple segments,
each with a length of 'send', and every time we send this part of the data
using tcp_bpf_sendmsg_redir(), we use sk_msg_return_zero() to uncharge the
memory of the specified 'send' size.
However, if the first segment of data fails to send, for example, the
peer's buffer is full, we need to release all of the msg. When releasing
the msg, we haven't uncharged the memory of the subsequent segments.
This modification does not make significant logical changes, but only
fills in the missing uncharge places.
This issue has existed all along, until it was exposed after we added the
apply test in test_sockmap:
commit 3448ad23b34e ("selftests/bpf: Add apply_bytes test to test_txmsg_redir_wait_sndmem in test_sockmap")
Jiayuan Chen (2):
ktls, sockmap: Fix missing uncharge operation
selftests/bpf: Add test to cover sockmap with ktls
net/tls/tls_sw.c | 7 ++
.../selftests/bpf/prog_tests/sockmap_ktls.c | 76 +++++++++++++++++++
.../selftests/bpf/progs/test_sockmap_ktls.c | 10 +++
3 files changed, 93 insertions(+)
--
2.47.1
ksft runner sends 2 SIGTERMs in a row if a test runs out of time.
Handle this in a similar way we handle SIGINT - cleanup and stop
running further tests.
Because we get 2 signals we need a bit of logic to ignore
the subsequent one, they come immediately one after the other
(due to commit 9616cb34b08e ("kselftest/runner.sh: Propagate SIGTERM
to runner child")).
This change makes sure we run cleanup (scheduled defer()s)
and also print a stack trace on SIGTERM, which doesn't happen
by default. Tests occasionally hang in NIPA and it's impossible
to tell what they are waiting from or doing.
Signed-off-by: Jakub Kicinski <kuba(a)kernel.org>
---
v2:
- remove declaration at the global scope
v1: https://lore.kernel.org/20250425151757.1652517-1-kuba@kernel.org
CC: petrm(a)nvidia.com
CC: willemb(a)google.com
CC: sdf(a)fomichev.me
CC: linux-kselftest(a)vger.kernel.org
---
tools/testing/selftests/net/lib/py/ksft.py | 25 +++++++++++++++++++++-
1 file changed, 24 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/net/lib/py/ksft.py b/tools/testing/selftests/net/lib/py/ksft.py
index 3cfad0fd4570..1b815768bf8a 100644
--- a/tools/testing/selftests/net/lib/py/ksft.py
+++ b/tools/testing/selftests/net/lib/py/ksft.py
@@ -3,6 +3,7 @@
import builtins
import functools
import inspect
+import signal
import sys
import time
import traceback
@@ -26,6 +27,10 @@ KSFT_DISRUPTIVE = True
pass
+class KsftTerminate(KeyboardInterrupt):
+ pass
+
+
def ksft_pr(*objs, **kwargs):
print("#", *objs, **kwargs)
@@ -193,6 +198,17 @@ KSFT_DISRUPTIVE = True
return env
+def _ksft_intr(signum, frame):
+ # ksft runner.sh sends 2 SIGTERMs in a row on a timeout
+ # if we don't ignore the second one it will stop us from handling cleanup
+ global term_cnt
+ term_cnt += 1
+ if term_cnt == 1:
+ raise KsftTerminate()
+ else:
+ ksft_pr(f"Ignoring SIGTERM (cnt: {term_cnt}), already exiting...")
+
+
def ksft_run(cases=None, globs=None, case_pfx=None, args=()):
cases = cases or []
@@ -205,6 +221,10 @@ KSFT_DISRUPTIVE = True
cases.append(value)
break
+ global term_cnt
+ term_cnt = 0
+ prev_sigterm = signal.signal(signal.SIGTERM, _ksft_intr)
+
totals = {"pass": 0, "fail": 0, "skip": 0, "xfail": 0}
print("TAP version 13")
@@ -229,11 +249,12 @@ KSFT_DISRUPTIVE = True
cnt_key = 'xfail'
except BaseException as e:
stop |= isinstance(e, KeyboardInterrupt)
+ stop |= isinstance(e, KsftTerminate)
tb = traceback.format_exc()
for line in tb.strip().split('\n'):
ksft_pr("Exception|", line)
if stop:
- ksft_pr("Stopping tests due to KeyboardInterrupt.")
+ ksft_pr(f"Stopping tests due to {type(e).__name__}.")
KSFT_RESULT = False
cnt_key = 'fail'
@@ -248,6 +269,8 @@ KSFT_DISRUPTIVE = True
if stop:
break
+ signal.signal(signal.SIGTERM, prev_sigterm)
+
print(
f"# Totals: pass:{totals['pass']} fail:{totals['fail']} xfail:{totals['xfail']} xpass:0 skip:{totals['skip']} error:0"
)
--
2.49.0
v8:
- Ignore the low event count of child 2 with memory_recursiveprot on
in patch 1 as originally suggested by Michal.
v7:
- Skip the vmscan change as the mem_cgroup_usage() check for now as
it is currently redundant.
v6:
- The memcg_test_low failure is indeed due to the memory_recursiveprot
mount option which is enabled by default in systemd cgroup v2 setting.
So adopt Michal's suggestion to adjust the low event checking
according to whether memory_recursiveprot is enabled or not.
The test_memcontrol selftest consistently fails its test_memcg_low
sub-test (with memory_recursiveprot enabled) and sporadically fails
its test_memcg_min sub-test. This patchset fixes the test_memcg_min
and test_memcg_low failures by adjusting the test_memcontrol selftest
to fix these test failures.
Waiman Long (2):
selftests: memcg: Allow low event with no memory.low and
memory_recursiveprot on
selftests: memcg: Increase error tolerance of child memory.current
check in test_memcg_protection()
.../selftests/cgroup/test_memcontrol.c | 22 ++++++++++++++-----
1 file changed, 16 insertions(+), 6 deletions(-)
--
2.49.0
Fixes and cleanups for various issues in the vDSO selftests.
Signed-off-by: Thomas Weißschuh <thomas.weissschuh(a)linutronix.de>
---
Thomas Weißschuh (7):
selftests: vDSO: chacha: Correctly skip test if necessary
selftests: vDSO: clock_getres: Drop unused include of err.h
selftests: vDSO: vdso_test_correctness: Fix -Wold-style-definitions
selftests: vDSO: vdso_test_getrandom: Drop unused include of linux/compiler.h
selftests: vDSO: vdso_test_getrandom: Drop some dead code
selftests: vDSO: vdso_test_getrandom: Always print TAP header
selftests: vDSO: vdso_config: Avoid -Wunused-variables
tools/testing/selftests/vDSO/vdso_config.h | 2 ++
tools/testing/selftests/vDSO/vdso_test_chacha.c | 3 ++-
tools/testing/selftests/vDSO/vdso_test_clock_getres.c | 1 -
tools/testing/selftests/vDSO/vdso_test_correctness.c | 2 +-
tools/testing/selftests/vDSO/vdso_test_getrandom.c | 18 +++++-------------
5 files changed, 10 insertions(+), 16 deletions(-)
---
base-commit: 0af2f6be1b4281385b618cb86ad946eded089ac8
change-id: 20250423-selftests-vdso-fixes-d2ce74142359
Best regards,
--
Thomas Weißschuh <thomas.weissschuh(a)linutronix.de>
I'd like to cut down the memory usage of parsing vmlinux BTF in ebpf-go.
With some upcoming changes the library is sitting at 5MiB for a parse.
Most of that memory is simply copying the BTF blob into user space.
By allowing vmlinux BTF to be mmapped read-only into user space I can
cut memory usage by about 75%.
Signed-off-by: Lorenz Bauer <lmb(a)isovalent.com>
---
Lorenz Bauer (2):
btf: allow mmap of vmlinux btf
selftests: bpf: add a test for mmapable vmlinux BTF
include/asm-generic/vmlinux.lds.h | 3 +-
kernel/bpf/sysfs_btf.c | 25 ++++++-
tools/testing/selftests/bpf/prog_tests/btf_sysfs.c | 79 ++++++++++++++++++++++
3 files changed, 104 insertions(+), 3 deletions(-)
---
base-commit: 38d976c32d85ef12dcd2b8a231196f7049548477
change-id: 20250501-vmlinux-mmap-2ec5563c3ef1
Best regards,
--
Lorenz Bauer <lmb(a)isovalent.com>
Ensure the following prerequisites before executing the test:
1. 'socat' is installed on the remote host.
2. Python version supports socket.SO_INCOMING_CPU (available since v3.11).
Skip the test if either prerequisite is not met.
Reviewed-by: Nimrod Oren <noren(a)nvidia.com>
Signed-off-by: Gal Pressman <gal(a)nvidia.com>
---
Changelog -
v1->v2: https://lore.kernel.org/netdev/20250317123149.364565-1-gal@nvidia.com/
* Use require_cmd() helper (Jakub).
---
tools/testing/selftests/drivers/net/hw/rss_input_xfrm.py | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/tools/testing/selftests/drivers/net/hw/rss_input_xfrm.py b/tools/testing/selftests/drivers/net/hw/rss_input_xfrm.py
index 53bb08cc29ec..f439c434ba36 100755
--- a/tools/testing/selftests/drivers/net/hw/rss_input_xfrm.py
+++ b/tools/testing/selftests/drivers/net/hw/rss_input_xfrm.py
@@ -32,6 +32,11 @@ def test_rss_input_xfrm(cfg, ipver):
if multiprocessing.cpu_count() < 2:
raise KsftSkipEx("Need at least two CPUs to test symmetric RSS hash")
+ cfg.require_cmd("socat", remote=True)
+
+ if not hasattr(socket, "SO_INCOMING_CPU"):
+ raise KsftSkipEx("socket.SO_INCOMING_CPU was added in Python 3.11")
+
input_xfrm = cfg.ethnl.rss_get(
{'header': {'dev-name': cfg.ifname}}).get('input_xfrm')
--
2.40.1
This series is a follow-up to [1], which adds mTHP support to khugepaged.
mTHP khugepaged support is a "loose" dependency for the sysfs/sysctl
configs to make sense. Without it global="defer" and mTHP="inherit" case
is "undefined" behavior.
We've seen cases were customers switching from RHEL7 to RHEL8 see a
significant increase in the memory footprint for the same workloads.
Through our investigations we found that a large contributing factor to
the increase in RSS was an increase in THP usage.
For workloads like MySQL, or when using allocators like jemalloc, it is
often recommended to set /transparent_hugepages/enabled=never. This is
in part due to performance degradations and increased memory waste.
This series introduces enabled=defer, this setting acts as a middle
ground between always and madvise. If the mapping is MADV_HUGEPAGE, the
page fault handler will act normally, making a hugepage if possible. If
the allocation is not MADV_HUGEPAGE, then the page fault handler will
default to the base size allocation. The caveat is that khugepaged can
still operate on pages that are not MADV_HUGEPAGE.
This allows for three things... one, applications specifically designed to
use hugepages will get them, and two, applications that don't use
hugepages can still benefit from them without aggressively inserting
THPs at every possible chance. This curbs the memory waste, and defers
the use of hugepages to khugepaged. Khugepaged can then scan the memory
for eligible collapsing. Lastly there is the added benefit for those who
want THPs but experience higher latency PFs. Now you can get base page
performance at the PF handler and Hugepage performance for those mappings
after they collapse.
Admins may want to lower max_ptes_none, if not, khugepaged may
aggressively collapse single allocations into hugepages.
TESTING:
- Built for x86_64, aarch64, ppc64le, and s390x
- selftests mm
- In [1] I provided a script [2] that has multiple access patterns
- lots of general use.
- redis testing. This test was my original case for the defer mode. What I
was able to prove was that THP=always leads to increased max_latency
cases; hence why it is recommended to disable THPs for redis servers.
However with 'defer' we dont have the max_latency spikes and can still
get the system to utilize THPs. I further tested this with the mTHP
defer setting and found that redis (and probably other jmalloc users)
can utilize THPs via defer (+mTHP defer) without a large latency
penalty and some potential gains. I uploaded some mmtest results
here[3] which compares:
stock+thp=never
stock+(m)thp=always
khugepaged-mthp + defer (max_ptes_none=64)
The results show that (m)THPs can cause some throughput regression in
some cases, but also has gains in other cases. The mTHP+defer results
have more gains and less losses over the (m)THP=always case.
V5 Changes:
- rebased dependent series
- added reviewed-by tag on 2/4
V4 Changes:
- Minor Documentation fixes
- rebased the dependent series [1] onto mm-unstable
commit 0e68b850b1d3 ("vmalloc: use atomic_long_add_return_relaxed()")
V3 Changes:
- Combined the documentation commits into one, and moved a section to the
khugepaged mthp patchset
V2 Changes:
- base changes on mTHP khugepaged support
- Fix selftests parsing issue
- add mTHP defer option
- add mTHP defer Documentation
[1] - https://lore.kernel.org/lkml/20250428181218.85925-1-npache@redhat.com/
[2] - https://gitlab.com/npache/khugepaged_mthp_test
[3] - https://people.redhat.com/npache/mthp_khugepaged_defer/testoutput2/output.h…
Nico Pache (4):
mm: defer THP insertion to khugepaged
mm: document (m)THP defer usage
khugepaged: add defer option to mTHP options
selftests: mm: add defer to thp setting parser
Documentation/admin-guide/mm/transhuge.rst | 31 +++++++---
include/linux/huge_mm.h | 18 +++++-
mm/huge_memory.c | 69 +++++++++++++++++++---
mm/khugepaged.c | 8 +--
tools/testing/selftests/mm/thp_settings.c | 1 +
tools/testing/selftests/mm/thp_settings.h | 1 +
6 files changed, 106 insertions(+), 22 deletions(-)
--
2.48.1
In the current implementation if the program is dev-bound to a specific
device, it will not be possible to perform XDP_REDIRECT into a DEVMAP or
CPUMAP even if the program is running in the driver NAPI context.
Fix the issue introducing __bpf_prog_map_compatible utility routine in
order to avoid bpf_prog_is_dev_bound() during the XDP program load.
Continue forbidding to attach a dev-bound program to XDP maps.
---
Changes in v3:
- move seltest changes in a dedicated patch
- Link to v2: https://lore.kernel.org/r/20250423-xdp-prog-bound-fix-v2-1-51742a5dfbce@ker…
Changes in v2:
- Introduce __bpf_prog_map_compatible() utility routine in order to skip
bpf_prog_is_dev_bound check in bpf_check_tail_call()
- Extend xdp_metadata selftest
- Link to v1: https://lore.kernel.org/r/20250422-xdp-prog-bound-fix-v1-1-0b581fa186fe@ker…
---
Lorenzo Bianconi (2):
bpf: Allow XDP dev-bound programs to perform XDP_REDIRECT into maps
selftests/bpf: xdp_metadata: check XDP_REDIRCT support for dev-bound progs
kernel/bpf/core.c | 27 +++++++++++++---------
.../selftests/bpf/prog_tests/xdp_metadata.c | 22 +++++++++++++++++-
tools/testing/selftests/bpf/progs/xdp_metadata.c | 13 +++++++++++
3 files changed, 50 insertions(+), 12 deletions(-)
---
base-commit: 91dbac4076537b464639953c055c460d2bdfc7ea
change-id: 20250422-xdp-prog-bound-fix-9f30f3e134aa
Best regards,
--
Lorenzo Bianconi <lorenzo(a)kernel.org>
This series fixes misaligned access handling when in non interruptible
context by reenabling interrupts when possible. A previous commit
changed raw_copy_from_user() with copy_from_user() which enables page
faulting and thus can sleep. While correct, a warning is now triggered
due to being called in an invalid context (sleeping in
non-interruptible). This series fixes that problem by factorizing
misaligned load/store entry in a single function than reenables
interrupt if the interrupted context had interrupts enabled.
In order for misaligned handling problems to be caught sooner, add a
kselftest for all the currently supported instructions .
Note: these commits were actually part of another larger series for
misaligned request delegation but was split since it isn't directly
required.
Clément Léger (5):
riscv: misaligned: factorize trap handling
riscv: misaligned: enable IRQs while handling misaligned accesses
riscv: misaligned: use get_user() instead of __get_user()
Documentation/sysctl: add riscv to unaligned-trap supported archs
selftests: riscv: add misaligned access testing
Documentation/admin-guide/sysctl/kernel.rst | 4 +-
arch/riscv/kernel/traps.c | 57 ++--
arch/riscv/kernel/traps_misaligned.c | 2 +-
.../selftests/riscv/misaligned/.gitignore | 1 +
.../selftests/riscv/misaligned/Makefile | 12 +
.../selftests/riscv/misaligned/common.S | 33 +++
.../testing/selftests/riscv/misaligned/fpu.S | 180 +++++++++++++
tools/testing/selftests/riscv/misaligned/gp.S | 103 +++++++
.../selftests/riscv/misaligned/misaligned.c | 254 ++++++++++++++++++
9 files changed, 614 insertions(+), 32 deletions(-)
create mode 100644 tools/testing/selftests/riscv/misaligned/.gitignore
create mode 100644 tools/testing/selftests/riscv/misaligned/Makefile
create mode 100644 tools/testing/selftests/riscv/misaligned/common.S
create mode 100644 tools/testing/selftests/riscv/misaligned/fpu.S
create mode 100644 tools/testing/selftests/riscv/misaligned/gp.S
create mode 100644 tools/testing/selftests/riscv/misaligned/misaligned.c
--
2.49.0
For some services we are using "established-over-unconnected" model.
'''
// create unconnected socket and 'listen()'
srv_fd = socket(AF_INET, SOCK_DGRAM)
setsockopt(srv_fd, SO_REUSEPORT)
bind(srv_fd, SERVER_ADDR, SERVER_PORT)
// 'accept()'
data, client_addr = recvmsg(srv_fd)
// create a connected socket for this request
cli_fd = socket(AF_INET, SOCK_DGRAM)
setsockopt(cli_fd, SO_REUSEPORT)
bind(cli_fd, SERVER_ADDR, SERVER_PORT)
connect(cli, client_addr)
...
// do handshake with cli_fd
'''
This programming pattern simulates accept() using UDP, creating a new
socket for each client request. The server can then use separate sockets
to handle client requests, avoiding the need to use a single UDP socket
for I/O transmission.
But there is a race condition between the bind() and connect() of the
connected socket:
We might receive unexpected packets belonging to the unconnected socket
before connect() is executed, which is not what we need.
(Of course, before connect(), the unconnected socket will also receive
packets from the connected socket, which is easily resolved because
upper-layer protocols typically require explicit boundaries, and we
receive a complete packet before creating a connected socket.)
Before this patch, the connected socket had to filter requests at recvmsg
time, acting as a dispatcher to some extent. With this patch, we can
consider the bind and connect operations to be atomic.
Signed-off-by: Jiayuan Chen <jiayuan.chen(a)linux.dev>
---
include/linux/udp.h | 1 +
include/uapi/linux/udp.h | 1 +
net/ipv4/udp.c | 13 ++++++++++---
net/ipv6/udp.c | 5 +++--
4 files changed, 15 insertions(+), 5 deletions(-)
diff --git a/include/linux/udp.h b/include/linux/udp.h
index 895240177f4f..8d281a0c0d9d 100644
--- a/include/linux/udp.h
+++ b/include/linux/udp.h
@@ -42,6 +42,7 @@ enum {
UDP_FLAGS_ENCAP_ENABLED, /* This socket enabled encap */
UDP_FLAGS_UDPLITE_SEND_CC, /* set via udplite setsockopt */
UDP_FLAGS_UDPLITE_RECV_CC, /* set via udplite setsockopt */
+ UDP_FLAGS_STOP_RCV, /* Stop receiving packets */
};
struct udp_sock {
diff --git a/include/uapi/linux/udp.h b/include/uapi/linux/udp.h
index edca3e430305..bb8e0a749a55 100644
--- a/include/uapi/linux/udp.h
+++ b/include/uapi/linux/udp.h
@@ -34,6 +34,7 @@ struct udphdr {
#define UDP_NO_CHECK6_RX 102 /* Disable accepting checksum for UDP6 */
#define UDP_SEGMENT 103 /* Set GSO segmentation size */
#define UDP_GRO 104 /* This socket can receive UDP GRO packets */
+#define UDP_STOP_RCV 105 /* This socket will not receive any packets */
/* UDP encapsulation types */
#define UDP_ENCAP_ESPINUDP_NON_IKE 1 /* unused draft-ietf-ipsec-nat-t-ike-00/01 */
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index f9f5b92cf4b6..764d337ab1b3 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -376,7 +376,8 @@ static int compute_score(struct sock *sk, const struct net *net,
if (!net_eq(sock_net(sk), net) ||
udp_sk(sk)->udp_port_hash != hnum ||
- ipv6_only_sock(sk))
+ ipv6_only_sock(sk) ||
+ udp_test_bit(STOP_RCV, sk))
return -1;
if (sk->sk_rcv_saddr != daddr)
@@ -494,7 +495,7 @@ static struct sock *udp4_lib_lookup2(const struct net *net,
result = inet_lookup_reuseport(net, sk, skb, sizeof(struct udphdr),
saddr, sport, daddr, hnum, udp_ehashfn);
- if (!result) {
+ if (!result || udp_test_bit(STOP_RCV, result)) {
result = sk;
continue;
}
@@ -3031,7 +3032,9 @@ int udp_lib_setsockopt(struct sock *sk, int level, int optname,
set_xfrm_gro_udp_encap_rcv(up->encap_type, sk->sk_family, sk);
sockopt_release_sock(sk);
break;
-
+ case UDP_STOP_RCV:
+ udp_assign_bit(STOP_RCV, sk, valbool);
+ break;
/*
* UDP-Lite's partial checksum coverage (RFC 3828).
*/
@@ -3120,6 +3123,10 @@ int udp_lib_getsockopt(struct sock *sk, int level, int optname,
val = udp_test_bit(GRO_ENABLED, sk);
break;
+ case UDP_STOP_RCV:
+ val = udp_test_bit(STOP_RCV, sk);
+ break;
+
/* The following two cannot be changed on UDP sockets, the return is
* always 0 (which corresponds to the full checksum coverage of UDP). */
case UDPLITE_SEND_CSCOV:
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 7317f8e053f1..55896a78e94b 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -137,7 +137,8 @@ static int compute_score(struct sock *sk, const struct net *net,
if (!net_eq(sock_net(sk), net) ||
udp_sk(sk)->udp_port_hash != hnum ||
- sk->sk_family != PF_INET6)
+ sk->sk_family != PF_INET6 ||
+ udp_test_bit(STOP_RCV, sk))
return -1;
if (!ipv6_addr_equal(&sk->sk_v6_rcv_saddr, daddr))
@@ -245,7 +246,7 @@ static struct sock *udp6_lib_lookup2(const struct net *net,
result = inet6_lookup_reuseport(net, sk, skb, sizeof(struct udphdr),
saddr, sport, daddr, hnum, udp6_ehashfn);
- if (!result) {
+ if (!result || udp_test_bit(STOP_RCV, result)) {
result = sk;
continue;
}
--
2.47.1
This series improves the following tests.
1. Get-reg-list : Adds vector support
2. SBI PMU test : Distinguish between different types of illegal exception
The first patch is just helper patch that adds stval support during
exception handling.
Signed-off-by: Atish Patra <atishp(a)rivosinc.com>
---
Changes in v3:
- Dropped the redundant macros and rv32 specific csr details.
- Changed to vcpu_get_reg from __vcpu_get_reg based on suggestion from Drew.
- Added RB tags from Drew.
- Link to v2: https://lore.kernel.org/r/20250429-kvm_selftest_improve-v2-0-51713f91e04a@r…
Changes in v2:
- Rebased on top of Linux 6.15-rc4
- Changed from ex_regs to pt_regs based on Drew's suggestion.
- Dropped Anup's review on PATCH1 as it is significantly changed from last review.
- Moved the instruction decoding macros to a common header file.
- Improved the vector reg list test as per the feedback.
- Link to v1: https://lore.kernel.org/r/20250324-kvm_selftest_improve-v1-0-583620219d4f@r…
---
Atish Patra (3):
KVM: riscv: selftests: Align the trap information wiht pt_regs
KVM: riscv: selftests: Decode stval to identify exact exception type
KVM: riscv: selftests: Add vector extension tests
.../selftests/kvm/include/riscv/processor.h | 23 +++-
tools/testing/selftests/kvm/lib/riscv/handlers.S | 139 +++++++++++----------
tools/testing/selftests/kvm/lib/riscv/processor.c | 2 +-
tools/testing/selftests/kvm/riscv/arch_timer.c | 2 +-
tools/testing/selftests/kvm/riscv/ebreak_test.c | 2 +-
tools/testing/selftests/kvm/riscv/get-reg-list.c | 132 +++++++++++++++++++
tools/testing/selftests/kvm/riscv/sbi_pmu_test.c | 24 +++-
7 files changed, 247 insertions(+), 77 deletions(-)
---
base-commit: f15d97df5afae16f40ecef942031235d1c6ba14f
change-id: 20250324-kvm_selftest_improve-9bedb9f0a6d3
--
Regards,
Atish patra
kunit kernel build could fail if there are ny build artifacts from a
prior kernel build. These can be hard to debug if the build artifact
happens to be generated header file. It took me a while to debug kunit
build fail on ARCH=x86_64 in a tree which had a generated header file
arch/x86/realmode/rm/pasyms.h
make ARCH=um mrproper will not clean the tree. It is necessary to run
make ARCH=x86_64 mrproper
Example work-flow that could lead to this:
make allmodconfig (x86_64)
make
./tools/testing/kunit/kunit.py run
Add this to the documentation and kunit.py build help message.
Shuah Khan (2):
doc: kunit: add information about cleaning source trees
kunit: add tips to clean source tree to build help message
Documentation/dev-tools/kunit/start.rst | 12 ++++++++++++
tools/testing/kunit/kunit.py | 2 +-
2 files changed, 13 insertions(+), 1 deletion(-)
--
2.47.2
The inconsistencies in the systcall ABI between arm and arm-compat can
can cause a failure in the syscall_restart test due to the logic
attempting to work around the differences. The 'machine' field for an
ARM64 device running in compat mode can report 'armv8l' or 'armv8b'
which matches with the string 'arm' when only examining the first three
characters of the string.
This change adds additional validation to the workaround logic to make
sure we only take the arm path when running natively, not in arm-compat.
Fixes: 256d0afb11d6 ("selftests/seccomp: build and pass on arm64")
Signed-off-by: Neill Kapron <nkapron(a)google.com>
---
tools/testing/selftests/seccomp/seccomp_bpf.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c b/tools/testing/selftests/seccomp/seccomp_bpf.c
index b2f76a52215a..53bf6a9c801f 100644
--- a/tools/testing/selftests/seccomp/seccomp_bpf.c
+++ b/tools/testing/selftests/seccomp/seccomp_bpf.c
@@ -3166,12 +3166,15 @@ TEST(syscall_restart)
ret = get_syscall(_metadata, child_pid);
#if defined(__arm__)
/*
- * FIXME:
* - native ARM registers do NOT expose true syscall.
* - compat ARM registers on ARM64 DO expose true syscall.
+ * - values of utsbuf.machine include 'armv8l' or 'armb8b'
+ * for ARM64 running in compat mode.
*/
ASSERT_EQ(0, uname(&utsbuf));
- if (strncmp(utsbuf.machine, "arm", 3) == 0) {
+ if ((strncmp(utsbuf.machine, "arm", 3) == 0) &&
+ (strncmp(utsbuf.machine, "armv8l", 6) != 0) &&
+ (strncmp(utsbuf.machine, "armv8b", 6) != 0)) {
EXPECT_EQ(__NR_nanosleep, ret);
} else
#endif
--
2.49.0.850.g28803427d3-goog
Commit 0b631ed3ce92 ("kselftest: cpufreq: Add RTC wakeup alarm") added
support for automatic wakeup in the suspend routine of the cpufreq
kselftest by using rtcwake, however it left the manual power state
change in the common path. The end result is that when running the
cpufreq kselftest with '-t suspend_rtc' or '-t hibernate_rtc', the
system will go to sleep and be woken up by the RTC, but then immediately
go to sleep again with no wakeup programmed, so it will sleep forever in
an automated testing setup.
Fix this by moving the manual power state change so that it only happens
when not using rtcwake.
Fixes: 0b631ed3ce92 ("kselftest: cpufreq: Add RTC wakeup alarm")
Signed-off-by: Nícolas F. R. A. Prado <nfraprado(a)collabora.com>
---
tools/testing/selftests/cpufreq/cpufreq.sh | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/cpufreq/cpufreq.sh b/tools/testing/selftests/cpufreq/cpufreq.sh
index e350c521b4675080943d2c74dc31533979410316..3aad9db921b5339585da5711a08775ebd965aaf3 100755
--- a/tools/testing/selftests/cpufreq/cpufreq.sh
+++ b/tools/testing/selftests/cpufreq/cpufreq.sh
@@ -244,9 +244,10 @@ do_suspend()
printf "Failed to suspend using RTC wake alarm\n"
return 1
fi
+ else
+ echo $filename > $SYSFS/power/state
fi
- echo $filename > $SYSFS/power/state
printf "Came out of $1\n"
printf "Do basic tests after finishing $1 to verify cpufreq state\n\n"
---
base-commit: 8a2d53ce3c5f82683ad3df9a9a55822816fe64e7
change-id: 20250429-ksft-cpufreq-suspend-rtc-double-fix-75c560e1a30f
Best regards,
--
Nícolas F. R. A. Prado <nfraprado(a)collabora.com>
Misbehaving guests can cause bus locks to degrade the performance of
a system. Non-WB (write-back) and misaligned locked RMW (read-modify-write)
instructions are referred to as "bus locks" and require system wide
synchronization among all processors to guarantee the atomicity. The bus
locks can impose notable performance penalties for all processors within
the system.
Support for the Bus Lock Threshold is indicated by CPUID
Fn8000_000A_EDX[29] BusLockThreshold=1, the VMCB provides a Bus Lock
Threshold enable bit and an unsigned 16-bit Bus Lock Threshold count.
VMCB intercept bit
VMCB Offset Bits Function
14h 5 Intercept bus lock operations
Bus lock threshold count
VMCB Offset Bits Function
120h 15:0 Bus lock counter
During VMRUN, the bus lock threshold count is fetched and stored in an
internal count register. Prior to executing a bus lock within the guest,
the processor verifies the count in the bus lock register. If the count is
greater than zero, the processor executes the bus lock, reducing the count.
However, if the count is zero, the bus lock operation is not performed, and
instead, a Bus Lock Threshold #VMEXIT is triggered to transfer control to
the Virtual Machine Monitor (VMM).
A Bus Lock Threshold #VMEXIT is reported to the VMM with VMEXIT code 0xA5h,
VMEXIT_BUSLOCK. EXITINFO1 and EXITINFO2 are set to 0 on a VMEXIT_BUSLOCK.
On a #VMEXIT, the processor writes the current value of the Bus Lock
Threshold Counter to the VMCB.
Note: Currently, virtualizing the Bus Lock Threshold feature for L1 guest is
not supported.
More details about the Bus Lock Threshold feature can be found in AMD APM
[1].
v3 -> v4
- Incorporated Sean's review comments
- Added a preparatory patch to move linear_rip out of kvm_pio_request, so
that it can be used by the bus lock threshold patches.
- Added complete_userspace_buslock() function to reload bus_lock_counter
to '1' only if the usespace has not changed the RIP.
- Added changes to continue running bus_lock_counter accross the nested
transitions.
v2 -> v3
- Drop parch to add virt tag in /proc/cpuinfo.
- Incorporated Tom's review comments.
v1 -> v2
- Incorporated misc review comments from Sean.
- Removed bus_lock_counter module parameter.
- Set the value of bus_lock_counter to zero by default and reload the value by 1
in bus lock exit handler.
- Add documentation for the behavioral difference for KVM_EXIT_BUS_LOCK.
- Improved selftest for buslock to work on SVM and VMX.
- Rewrite the commit messages.
Patches are prepared on kvm-next/next (c9ea48bb6ee6).
Testing done:
- Tested the Bus Lock Threshold functionality on normal, SEV, SEV-ES and SEV-SNP guests.
- Tested the Bus Lock Threshold functionality on nested guests.
v1: https://lore.kernel.org/kvm/20240709175145.9986-4-manali.shukla@amd.com/T/
v2: https://lore.kernel.org/kvm/20241001063413.687787-4-manali.shukla@amd.com/T/
v3: https://lore.kernel.org/kvm/20241004053341.5726-1-manali.shukla@amd.com/T/
[1]: AMD64 Architecture Programmer's Manual Pub. 24593, April 2024,
Vol 2, 15.14.5 Bus Lock Threshold.
https://bugzilla.kernel.org/attachment.cgi?id=306250
Manali Shukla (3):
KVM: x86: Preparatory patch to move linear_rip out of kvm_pio_request
x86/cpufeatures: Add CPUID feature bit for the Bus Lock Threshold
KVM: SVM: Add support for KVM_CAP_X86_BUS_LOCK_EXIT on SVM CPUs
Nikunj A Dadhania (2):
KVM: SVM: Enable Bus lock threshold exit
KVM: selftests: Add bus lock exit test
Documentation/virt/kvm/api.rst | 19 +++
arch/x86/include/asm/cpufeatures.h | 1 +
arch/x86/include/asm/kvm_host.h | 2 +-
arch/x86/include/asm/svm.h | 5 +-
arch/x86/include/uapi/asm/svm.h | 2 +
arch/x86/kvm/svm/nested.c | 42 ++++++
arch/x86/kvm/svm/svm.c | 38 +++++
arch/x86/kvm/svm/svm.h | 2 +
arch/x86/kvm/x86.c | 8 +-
tools/testing/selftests/kvm/Makefile.kvm | 1 +
.../selftests/kvm/x86/kvm_buslock_test.c | 135 ++++++++++++++++++
11 files changed, 249 insertions(+), 6 deletions(-)
create mode 100644 tools/testing/selftests/kvm/x86/kvm_buslock_test.c
base-commit: c9ea48bb6ee6b28bbc956c1e8af98044618fed5e
--
2.34.1
This is the initial import of a CAN selftest from can-tests[1] into the
tree. For now, it is just a single test but when agreed on the
structure, we intend to import more tests from can-tests and add
additional test cases.
The goal of moving the CAN selftests into the tree is to align the tests
more closely with the kernel, improve testing of CAN in general, and to
simplify running the tests automatically in the various kernel CI
systems.
I have cc'ed netdev and its reviewers and maintainers to make sure they
are okay with the location of the tests and the changes to the paths in
MAINTAINERS. The changes should be merged through linux-can-next and
subsequent changes will not go to netdev anymore.
[1]: https://github.com/linux-can/can-tests
Felix Maurer (4):
selftests: can: Import tst-filter from can-tests
selftests: can: use kselftest harness in test_raw_filter
selftests: can: Use fixtures in test_raw_filter
selftests: can: Document test_raw_filter test cases
MAINTAINERS | 2 +
tools/testing/selftests/Makefile | 1 +
tools/testing/selftests/net/can/.gitignore | 2 +
tools/testing/selftests/net/can/Makefile | 11 +
.../selftests/net/can/test_raw_filter.c | 338 ++++++++++++++++++
.../selftests/net/can/test_raw_filter.sh | 37 ++
6 files changed, 391 insertions(+)
create mode 100644 tools/testing/selftests/net/can/.gitignore
create mode 100644 tools/testing/selftests/net/can/Makefile
create mode 100644 tools/testing/selftests/net/can/test_raw_filter.c
create mode 100755 tools/testing/selftests/net/can/test_raw_filter.sh
--
2.49.0
This series improves the following tests.
1. Get-reg-list : Adds vector support
2. SBI PMU test : Distinguish between different types of illegal exception
The first patch is just helper patch that adds stval support during
exception handling.
Signed-off-by: Atish Patra <atishp(a)rivosinc.com>
---
Changes in v2:
- Rebased on top of Linux 6.15-rc4
- Changed from ex_regs to pt_regs based on Drew's suggestion.
- Dropped Anup's review on PATCH1 as it is significantly changed from last review.
- Moved the instruction decoding macros to a common header file.
- Improved the vector reg list test as per the feedback.
- Link to v1: https://lore.kernel.org/r/20250324-kvm_selftest_improve-v1-0-583620219d4f@r…
---
Atish Patra (3):
KVM: riscv: selftests: Align the trap information wiht pt_regs
KVM: riscv: selftests: Decode stval to identify exact exception type
KVM: riscv: selftests: Add vector extension tests
.../selftests/kvm/include/riscv/processor.h | 23 ++-
tools/testing/selftests/kvm/lib/riscv/handlers.S | 164 ++++++++++++---------
tools/testing/selftests/kvm/lib/riscv/processor.c | 2 +-
tools/testing/selftests/kvm/riscv/arch_timer.c | 2 +-
tools/testing/selftests/kvm/riscv/ebreak_test.c | 2 +-
tools/testing/selftests/kvm/riscv/get-reg-list.c | 133 +++++++++++++++++
tools/testing/selftests/kvm/riscv/sbi_pmu_test.c | 24 ++-
7 files changed, 270 insertions(+), 80 deletions(-)
---
base-commit: f15d97df5afae16f40ecef942031235d1c6ba14f
change-id: 20250324-kvm_selftest_improve-9bedb9f0a6d3
--
Regards,
Atish patra
Basics and overview
===================
Software with larger attack surfaces (e.g. network facing apps like databases,
browsers or apps relying on browser runtimes) suffer from memory corruption
issues which can be utilized by attackers to bend control flow of the program
to eventually gain control (by making their payload executable). Attackers are
able to perform such attacks by leveraging call-sites which rely on indirect
calls or return sites which rely on obtaining return address from stack memory.
To mitigate such attacks, risc-v extension zicfilp enforces that all indirect
calls must land on a landing pad instruction `lpad` else cpu will raise software
check exception (a new cpu exception cause code on riscv).
Similarly for return flow, risc-v extension zicfiss extends architecture with
- `sspush` instruction to push return address on a shadow stack
- `sspopchk` instruction to pop return address from shadow stack
and compare with input operand (i.e. return address on stack)
- `sspopchk` to raise software check exception if comparision above
was a mismatch
- Protection mechanism using which shadow stack is not writeable via
regular store instructions
More information an details can be found at extensions github repo [1].
Equivalent to landing pad (zicfilp) on x86 is `ENDBRANCH` instruction in Intel
CET [3] and branch target identification (BTI) [4] on arm.
Similarly x86's Intel CET has shadow stack [5] and arm64 has guarded control
stack (GCS) [6] which are very similar to risc-v's zicfiss shadow stack.
x86 and arm64 support for user mode shadow stack is already in mainline.
Kernel awareness for user control flow integrity
================================================
This series picks up Samuel Holland's envcfg changes [2] as well. So if those are
being applied independently, they should be removed from this series.
Enabling:
In order to maintain compatibility and not break anything in user mode, kernel
doesn't enable control flow integrity cpu extensions on binary by default.
Instead exposes a prctl interface to enable, disable and lock the shadow stack
or landing pad feature for a task. This allows userspace (loader) to enumerate
if all objects in its address space are compiled with shadow stack and landing
pad support and accordingly enable the feature. Additionally if a subsequent
`dlopen` happens on a library, user mode can take a decision again to disable
the feature (if incoming library is not compiled with support) OR terminate the
task (if user mode policy is strict to have all objects in address space to be
compiled with control flow integirty cpu feature). prctl to enable shadow stack
results in allocating shadow stack from virtual memory and activating for user
address space. x86 and arm64 are also following same direction due to similar
reason(s).
clone/fork:
On clone and fork, cfi state for task is inherited by child. Shadow stack is
part of virtual memory and is a writeable memory from kernel perspective
(writeable via a restricted set of instructions aka shadow stack instructions)
Thus kernel changes ensure that this memory is converted into read-only when
fork/clone happens and COWed when fault is taken due to sspush, sspopchk or
ssamoswap. In case `CLONE_VM` is specified and shadow stack is to be enabled,
kernel will automatically allocate a shadow stack for that clone call.
map_shadow_stack:
x86 introduced `map_shadow_stack` system call to allow user space to explicitly
map shadow stack memory in its address space. It is useful to allocate shadow
for different contexts managed by a single thread (green threads or contexts)
risc-v implements this system call as well.
signal management:
If shadow stack is enabled for a task, kernel performs an asynchronous control
flow diversion to deliver the signal and eventually expects userspace to issue
sigreturn so that original execution can be resumed. Even though resume context
is prepared by kernel, it is in user space memory and is subject to memory
corruption and corruption bugs can be utilized by attacker in this race window
to perform arbitrary sigreturn and eventually bypass cfi mechanism.
Another issue is how to ensure that cfi related state on sigcontext area is not
trampled by legacy apps or apps compiled with old kernel headers.
In order to mitigate control-flow hijacting, kernel prepares a token and place
it on shadow stack before signal delivery and places address of token in
sigcontext structure. During sigreturn, kernel obtains address of token from
sigcontext struture, reads token from shadow stack and validates it and only
then allow sigreturn to succeed. Compatiblity issue is solved by adopting
dynamic sigcontext management introduced for vector extension. This series
re-factor the code little bit to allow future sigcontext management easy (as
proposed by Andy Chiu from SiFive)
config and compilation:
Introduce a new risc-v config option `CONFIG_RISCV_USER_CFI`. Selecting this
config option picks the kernel support for user control flow integrity. This
optin is presented only if toolchain has shadow stack and landing pad support.
And is on purpose guarded by toolchain support. Reason being that eventually
vDSO also needs to be compiled in with shadow stack and landing pad support.
vDSO compile patches are not included as of now because landing pad labeling
scheme is yet to settle for usermode runtime.
To get more information on kernel interactions with respect to
zicfilp and zicfiss, patch series adds documentation for
`zicfilp` and `zicfiss` in following:
Documentation/arch/riscv/zicfiss.rst
Documentation/arch/riscv/zicfilp.rst
How to test this series
=======================
Toolchain
---------
$ git clone git@github.com:sifive/riscv-gnu-toolchain.git -b cfi-dev
$ riscv-gnu-toolchain/configure --prefix=<path-to-where-to-build> --with-arch=rv64gc_zicfilp_zicfiss --enable-linux --disable-gdb --with-extra-multilib-test="rv64gc_zicfilp_zicfiss-lp64d:-static"
$ make -j$(nproc)
Qemu
----
Get the lastest qemu
$ cd qemu
$ mkdir build
$ cd build
$ ../configure --target-list=riscv64-softmmu
$ make -j$(nproc)
Opensbi
-------
$ git clone git@github.com:deepak0414/opensbi.git -b v6_cfi_spec_split_opensbi
$ make CROSS_COMPILE=<your riscv toolchain> -j$(nproc) PLATFORM=generic
Linux
-----
Running defconfig is fine. CFI is enabled by default if the toolchain
supports it.
$ make ARCH=riscv CROSS_COMPILE=<path-to-cfi-riscv-gnu-toolchain>/build/bin/riscv64-unknown-linux-gnu- -j$(nproc) defconfig
$ make ARCH=riscv CROSS_COMPILE=<path-to-cfi-riscv-gnu-toolchain>/build/bin/riscv64-unknown-linux-gnu- -j$(nproc)
In case you're building your own rootfs using toolchain, please make sure you
pick following patch to ensure that vDSO compiled with lpad and shadow stack.
"arch/riscv: compile vdso with landing pad"
Branch where above patch can be picked
https://github.com/deepak0414/linux-riscv-cfi/tree/vdso_user_cfi_v6.12-rc1
Running
-------
Modify your qemu command to have:
-bios <path-to-cfi-opensbi>/build/platform/generic/firmware/fw_dynamic.bin
-cpu rv64,zicfilp=true,zicfiss=true,zimop=true,zcmop=true
vDSO related Opens (in the flux)
=================================
I am listing these opens for laying out plan and what to expect in future
patch sets. And of course for the sake of discussion.
Shadow stack and landing pad enabling in vDSO
----------------------------------------------
vDSO must have shadow stack and landing pad support compiled in for task
to have shadow stack and landing pad support. This patch series doesn't
enable that (yet). Enabling shadow stack support in vDSO should be
straight forward (intend to do that in next versions of patch set). Enabling
landing pad support in vDSO requires some collaboration with toolchain folks
to follow a single label scheme for all object binaries. This is necessary to
ensure that all indirect call-sites are setting correct label and target landing
pads are decorated with same label scheme.
How many vDSOs
---------------
Shadow stack instructions are carved out of zimop (may be operations) and if CPU
doesn't implement zimop, they're illegal instructions. Kernel could be running on
a CPU which may or may not implement zimop. And thus kernel will have to carry 2
different vDSOs and expose the appropriate one depending on whether CPU implements
zimop or not.
References
==========
[1] - https://github.com/riscv/riscv-cfi
[2] - https://lore.kernel.org/all/20240814081126.956287-1-samuel.holland@sifive.c…
[3] - https://lwn.net/Articles/889475/
[4] - https://developer.arm.com/documentation/109576/0100/Branch-Target-Identific…
[5] - https://www.intel.com/content/dam/develop/external/us/en/documents/catc17-i…
[6] - https://lwn.net/Articles/940403/
To: Thomas Gleixner <tglx(a)linutronix.de>
To: Ingo Molnar <mingo(a)redhat.com>
To: Borislav Petkov <bp(a)alien8.de>
To: Dave Hansen <dave.hansen(a)linux.intel.com>
To: x86(a)kernel.org
To: H. Peter Anvin <hpa(a)zytor.com>
To: Andrew Morton <akpm(a)linux-foundation.org>
To: Liam R. Howlett <Liam.Howlett(a)oracle.com>
To: Vlastimil Babka <vbabka(a)suse.cz>
To: Lorenzo Stoakes <lorenzo.stoakes(a)oracle.com>
To: Paul Walmsley <paul.walmsley(a)sifive.com>
To: Palmer Dabbelt <palmer(a)dabbelt.com>
To: Albert Ou <aou(a)eecs.berkeley.edu>
To: Conor Dooley <conor(a)kernel.org>
To: Rob Herring <robh(a)kernel.org>
To: Krzysztof Kozlowski <krzk+dt(a)kernel.org>
To: Arnd Bergmann <arnd(a)arndb.de>
To: Christian Brauner <brauner(a)kernel.org>
To: Peter Zijlstra <peterz(a)infradead.org>
To: Oleg Nesterov <oleg(a)redhat.com>
To: Eric Biederman <ebiederm(a)xmission.com>
To: Kees Cook <kees(a)kernel.org>
To: Jonathan Corbet <corbet(a)lwn.net>
To: Shuah Khan <shuah(a)kernel.org>
To: Jann Horn <jannh(a)google.com>
To: Conor Dooley <conor+dt(a)kernel.org>
To: Miguel Ojeda <ojeda(a)kernel.org>
To: Alex Gaynor <alex.gaynor(a)gmail.com>
To: Boqun Feng <boqun.feng(a)gmail.com>
To: Gary Guo <gary(a)garyguo.net>
To: Björn Roy Baron <bjorn3_gh(a)protonmail.com>
To: Benno Lossin <benno.lossin(a)proton.me>
To: Andreas Hindborg <a.hindborg(a)kernel.org>
To: Alice Ryhl <aliceryhl(a)google.com>
To: Trevor Gross <tmgross(a)umich.edu>
Cc: linux-kernel(a)vger.kernel.org
Cc: linux-fsdevel(a)vger.kernel.org
Cc: linux-mm(a)kvack.org
Cc: linux-riscv(a)lists.infradead.org
Cc: devicetree(a)vger.kernel.org
Cc: linux-arch(a)vger.kernel.org
Cc: linux-doc(a)vger.kernel.org
Cc: linux-kselftest(a)vger.kernel.org
Cc: alistair.francis(a)wdc.com
Cc: richard.henderson(a)linaro.org
Cc: jim.shu(a)sifive.com
Cc: andybnac(a)gmail.com
Cc: kito.cheng(a)sifive.com
Cc: charlie(a)rivosinc.com
Cc: atishp(a)rivosinc.com
Cc: evan(a)rivosinc.com
Cc: cleger(a)rivosinc.com
Cc: alexghiti(a)rivosinc.com
Cc: samitolvanen(a)google.com
Cc: broonie(a)kernel.org
Cc: rick.p.edgecombe(a)intel.com
Cc: rust-for-linux(a)vger.kernel.org
changelog
---------
v14:
- rebased on top of palmer/sbi-v3. Thus dropped clement's FWFT patches
Updated RISCV_ISA_EXT_XXXX in hwcap and hwprobe constants.
- Took Radim's suggestions on bitfields.
- Placed cfi_state at the end of thread_info block so that current situation
is not disturbed with respect to member fields of thread_info in single
cacheline.
v13:
- cpu_supports_shadow_stack/cpu_supports_indirect_br_lp_instr uses
riscv_has_extension_unlikely()
- uses nops(count) to create nop slide
- RISCV_ACQUIRE_BARRIER is not needed in `amo_user_shstk`. Removed it
- changed ternaries to simply use implicit casting to convert to bool.
- kernel command line allows to disable zicfilp and zicfiss independently.
updated kernel-parameters.txt.
- ptrace user abi for cfi uses bitmasks instead of bitfields. Added ptrace
kselftest.
- cosmetic and grammatical changes to documentation.
v12:
- It seems like I had accidently squashed arch agnostic indirect branch
tracking prctl and riscv implementation of those prctls. Split them again.
- set_shstk_status/set_indir_lp_status perform CSR writes only when CPU
support is available. As suggested by Zong Li.
- Some minor clean up in kselftests as suggested by Zong Li.
v11:
- patch "arch/riscv: compile vdso with landing pad" was unconditionally
selecting `_zicfilp` for vDSO compile. fixed that. Changed `lpad 1` to
to `lpad 0`.
v10:
- dropped "mm: helper `is_shadow_stack_vma` to check shadow stack vma". This patch
is not that interesting to this patch series for risc-v. There are instances in
arch directories where VM_SHADOW_STACK flag is anyways used. Dropping this patch
to expedite merging in riscv tree.
- Took suggestions from `Clement` on "riscv: zicfiss / zicfilp enumeration" to
validate presence of cfi based on config.
- Added a patch for vDSO to have `lpad 0`. I had omitted this earlier to make sure
we add single vdso object with cfi enabled. But a vdso object with scheme of
zero labeled landing pad is least common denominator and should work with all
objects of zero labeled as well as function-signature labeled objects.
v9:
- rebased on master (39a803b754d5 fix braino in "9p: fix ->rename_sem exclusion")
- dropped "mm: Introduce ARCH_HAS_USER_SHADOW_STACK" (master has it from arm64/gcs)
- dropped "prctl: arch-agnostic prctl for shadow stack" (master has it from arm64/gcs)
v8:
- rebased on palmer/for-next
- dropped samuel holland's `envcfg` context switch patches.
they are in parlmer/for-next
v7:
- Removed "riscv/Kconfig: enable HAVE_EXIT_THREAD for riscv"
Instead using `deactivate_mm` flow to clean up.
see here for more context
https://lore.kernel.org/all/20230908203655.543765-1-rick.p.edgecombe@intel.…
- Changed the header include in `kselftest`. Hopefully this fixes compile
issue faced by Zong Li at SiFive.
- Cleaned up an orphaned change to `mm/mmap.c` in below patch
"riscv/mm : ensure PROT_WRITE leads to VM_READ | VM_WRITE"
- Lock interfaces for shadow stack and indirect branch tracking expect arg == 0
Any future evolution of this interface should accordingly define how arg should
be setup.
- `mm/map.c` has an instance of using `VM_SHADOW_STACK`. Fixed it to use helper
`is_shadow_stack_vma`.
- Link to v6: https://lore.kernel.org/r/20241008-v5_user_cfi_series-v6-0-60d9fe073f37@riv…
v6:
- Picked up Samuel Holland's changes as is with `envcfg` placed in
`thread` instead of `thread_info`
- fixed unaligned newline escapes in kselftest
- cleaned up messages in kselftest and included test output in commit message
- fixed a bug in clone path reported by Zong Li
- fixed a build issue if CONFIG_RISCV_ISA_V is not selected
(this was introduced due to re-factoring signal context
management code)
v5:
- rebased on v6.12-rc1
- Fixed schema related issues in device tree file
- Fixed some of the documentation related issues in zicfilp/ss.rst
(style issues and added index)
- added `SHADOW_STACK_SET_MARKER` so that implementation can define base
of shadow stack.
- Fixed warnings on definitions added in usercfi.h when
CONFIG_RISCV_USER_CFI is not selected.
- Adopted context header based signal handling as proposed by Andy Chiu
- Added support for enabling kernel mode access to shadow stack using
FWFT
(https://github.com/riscv-non-isa/riscv-sbi-doc/blob/master/src/ext-firmware…)
- Link to v5: https://lore.kernel.org/r/20241001-v5_user_cfi_series-v1-0-3ba65b6e550f@riv…
(Note: I had an issue in my workflow due to which version number wasn't
picked up correctly while sending out patches)
v4:
- rebased on 6.11-rc6
- envcfg: Converged with Samuel Holland's patches for envcfg management on per-
thread basis.
- vma_is_shadow_stack is renamed to is_vma_shadow_stack
- picked up Mark Brown's `ARCH_HAS_USER_SHADOW_STACK` patch
- signal context: using extended context management to maintain compatibility.
- fixed `-Wmissing-prototypes` compiler warnings for prctl functions
- Documentation fixes and amending typos.
- Link to v4: https://lore.kernel.org/all/20240912231650.3740732-1-debug@rivosinc.com/
v3:
- envcfg
logic to pick up base envcfg had a bug where `ENVCFG_CBZE` could have been
picked on per task basis, even though CPU didn't implement it. Fixed in
this series.
- dt-bindings
As suggested, split into separate commit. fixed the messaging that spec is
in public review
- arch_is_shadow_stack change
arch_is_shadow_stack changed to vma_is_shadow_stack
- hwprobe
zicfiss / zicfilp if present will get enumerated in hwprobe
- selftests
As suggested, added object and binary filenames to .gitignore
Selftest binary anyways need to be compiled with cfi enabled compiler which
will make sure that landing pad and shadow stack are enabled. Thus removed
separate enable/disable tests. Cleaned up tests a bit.
- Link to v3: https://lore.kernel.org/lkml/20240403234054.2020347-1-debug@rivosinc.com/
v2:
- Using config `CONFIG_RISCV_USER_CFI`, kernel support for riscv control flow
integrity for user mode programs can be compiled in the kernel.
- Enabling of control flow integrity for user programs is left to user runtime
- This patch series introduces arch agnostic `prctls` to enable shadow stack
and indirect branch tracking. And implements them on riscv.
---
Changes in v14:
- changelog posted just below cover letter
- Link to v13: https://lore.kernel.org/r/20250424-v5_user_cfi_series-v13-0-971437de586a@ri…
Changes in v13:
- changelog posted just below cover letter
- Link to v12: https://lore.kernel.org/r/20250314-v5_user_cfi_series-v12-0-e51202b53138@ri…
Changes in v12:
- changelog posted just below cover letter
- Link to v11: https://lore.kernel.org/r/20250310-v5_user_cfi_series-v11-0-86b36cbfb910@ri…
Changes in v11:
- changelog posted just below cover letter
- Link to v10: https://lore.kernel.org/r/20250210-v5_user_cfi_series-v10-0-163dcfa31c60@ri…
---
Andy Chiu (1):
riscv: signal: abstract header saving for setup_sigcontext
Deepak Gupta (25):
mm: VM_SHADOW_STACK definition for riscv
dt-bindings: riscv: zicfilp and zicfiss in dt-bindings (extensions.yaml)
riscv: zicfiss / zicfilp enumeration
riscv: zicfiss / zicfilp extension csr and bit definitions
riscv: usercfi state for task and save/restore of CSR_SSP on trap entry/exit
riscv/mm : ensure PROT_WRITE leads to VM_READ | VM_WRITE
riscv mm: manufacture shadow stack pte
riscv mmu: teach pte_mkwrite to manufacture shadow stack PTEs
riscv mmu: write protect and shadow stack
riscv/mm: Implement map_shadow_stack() syscall
riscv/shstk: If needed allocate a new shadow stack on clone
riscv: Implements arch agnostic shadow stack prctls
prctl: arch-agnostic prctl for indirect branch tracking
riscv: Implements arch agnostic indirect branch tracking prctls
riscv/traps: Introduce software check exception
riscv/signal: save and restore of shadow stack for signal
riscv/kernel: update __show_regs to print shadow stack register
riscv/ptrace: riscv cfi status and state via ptrace and in core files
riscv/hwprobe: zicfilp / zicfiss enumeration in hwprobe
riscv: kernel command line option to opt out of user cfi
riscv: enable kernel access to shadow stack memory via FWFT sbi call
riscv: create a config for shadow stack and landing pad instr support
riscv: Documentation for landing pad / indirect branch tracking
riscv: Documentation for shadow stack on riscv
kselftest/riscv: kselftest for user mode cfi
Jim Shu (1):
arch/riscv: compile vdso with landing pad
Documentation/admin-guide/kernel-parameters.txt | 8 +
Documentation/arch/riscv/index.rst | 2 +
Documentation/arch/riscv/zicfilp.rst | 115 +++++
Documentation/arch/riscv/zicfiss.rst | 179 +++++++
.../devicetree/bindings/riscv/extensions.yaml | 14 +
arch/riscv/Kconfig | 20 +
arch/riscv/Makefile | 5 +-
arch/riscv/include/asm/asm-prototypes.h | 1 +
arch/riscv/include/asm/assembler.h | 44 ++
arch/riscv/include/asm/cpufeature.h | 12 +
arch/riscv/include/asm/csr.h | 16 +
arch/riscv/include/asm/entry-common.h | 2 +
arch/riscv/include/asm/hwcap.h | 2 +
arch/riscv/include/asm/mman.h | 25 +
arch/riscv/include/asm/mmu_context.h | 7 +
arch/riscv/include/asm/pgtable.h | 30 +-
arch/riscv/include/asm/processor.h | 2 +
arch/riscv/include/asm/thread_info.h | 3 +
arch/riscv/include/asm/usercfi.h | 95 ++++
arch/riscv/include/asm/vector.h | 3 +
arch/riscv/include/uapi/asm/hwprobe.h | 2 +
arch/riscv/include/uapi/asm/ptrace.h | 34 ++
arch/riscv/include/uapi/asm/sigcontext.h | 1 +
arch/riscv/kernel/Makefile | 1 +
arch/riscv/kernel/asm-offsets.c | 8 +
arch/riscv/kernel/cpufeature.c | 13 +
arch/riscv/kernel/entry.S | 33 +-
arch/riscv/kernel/head.S | 23 +
arch/riscv/kernel/process.c | 26 +-
arch/riscv/kernel/ptrace.c | 95 ++++
arch/riscv/kernel/signal.c | 148 +++++-
arch/riscv/kernel/sys_hwprobe.c | 2 +
arch/riscv/kernel/sys_riscv.c | 10 +
arch/riscv/kernel/traps.c | 43 ++
arch/riscv/kernel/usercfi.c | 545 +++++++++++++++++++++
arch/riscv/kernel/vdso/Makefile | 6 +
arch/riscv/kernel/vdso/flush_icache.S | 4 +
arch/riscv/kernel/vdso/getcpu.S | 4 +
arch/riscv/kernel/vdso/rt_sigreturn.S | 4 +
arch/riscv/kernel/vdso/sys_hwprobe.S | 4 +
arch/riscv/mm/init.c | 2 +-
arch/riscv/mm/pgtable.c | 17 +
include/linux/cpu.h | 4 +
include/linux/mm.h | 7 +
include/uapi/linux/elf.h | 2 +
include/uapi/linux/prctl.h | 27 +
kernel/sys.c | 30 ++
tools/testing/selftests/riscv/Makefile | 2 +-
tools/testing/selftests/riscv/cfi/.gitignore | 3 +
tools/testing/selftests/riscv/cfi/Makefile | 10 +
tools/testing/selftests/riscv/cfi/cfi_rv_test.h | 82 ++++
tools/testing/selftests/riscv/cfi/riscv_cfi_test.c | 171 +++++++
tools/testing/selftests/riscv/cfi/shadowstack.c | 385 +++++++++++++++
tools/testing/selftests/riscv/cfi/shadowstack.h | 27 +
54 files changed, 2331 insertions(+), 29 deletions(-)
---
base-commit: 4181f8ad7a1061efed0219951d608d4988302af7
change-id: 20240930-v5_user_cfi_series-3dc332f8f5b2
--
- debug
From: Ihor Solodrai <ihor.solodrai(a)linux.dev>
[ Upstream commit f2858f308131a09e33afb766cd70119b5b900569 ]
"sockmap_ktls disconnect_after_delete" test has been failing on BPF CI
after recent merges from netdev:
* https://github.com/kernel-patches/bpf/actions/runs/14458537639
* https://github.com/kernel-patches/bpf/actions/runs/14457178732
It happens because disconnect has been disabled for TLS [1], and it
renders the test case invalid.
Removing all the test code creates a conflict between bpf and
bpf-next, so for now only remove the offending assert [2].
The test will be removed later on bpf-next.
[1] https://lore.kernel.org/netdev/20250404180334.3224206-1-kuba@kernel.org/
[2] https://lore.kernel.org/bpf/cfc371285323e1a3f3b006bfcf74e6cf7ad65258@linux.…
Signed-off-by: Ihor Solodrai <ihor.solodrai(a)linux.dev>
Signed-off-by: Andrii Nakryiko <andrii(a)kernel.org>
Reviewed-by: Jiayuan Chen <jiayuan.chen(a)linux.dev>
Link: https://lore.kernel.org/bpf/20250416170246.2438524-1-ihor.solodrai@linux.dev
Signed-off-by: Alexei Starovoitov <ast(a)kernel.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/bpf/prog_tests/sockmap_ktls.c | 1 -
1 file changed, 1 deletion(-)
diff --git a/tools/testing/selftests/bpf/prog_tests/sockmap_ktls.c b/tools/testing/selftests/bpf/prog_tests/sockmap_ktls.c
index 2d0796314862a..0a99fd404f6dc 100644
--- a/tools/testing/selftests/bpf/prog_tests/sockmap_ktls.c
+++ b/tools/testing/selftests/bpf/prog_tests/sockmap_ktls.c
@@ -68,7 +68,6 @@ static void test_sockmap_ktls_disconnect_after_delete(int family, int map)
goto close_cli;
err = disconnect(cli);
- ASSERT_OK(err, "disconnect");
close_cli:
close(cli);
--
2.39.5
Abstract
===
This patchset improves the performance of sockmap by providing CPU affinity,
resulting in a 1-10x increase in throughput.
Motivation
===
Traditional user-space reverse proxy:
Reserve Proxy
_________________
client -> | fd1 <-> fd2 | -> server
|_________________|
Using sockmap for reverse proxy:
Reserve Proxy
_________________
client -> | fd1 <-> fd2 | -> server
| |_________________| |
| | | |
| _________ |
| | sockmap | |
--> |_________| -->
By adding fds to sockmap and using a BPF program, we can quickly forward
data and avoid data copying between user space and kernel space.
Mainstream multi-process reverse proxy applications, such as Nginx and
HAProxy, support CPU affinity settings, which allow each process to be
pinned to a specific CPU, avoiding conflicts between data plane processes
and other processes, especially in multi-tenant environments.
Current Issues
===
The current design of sockmap uses a workqueue to forward ingress_skb and
wakes up the workqueue without specifying a CPU
(by calling schedule_delayed_work()). In the current implementation of
schedule_delayed_work, it tends to run the workqueue on the current CPU.
This approach has a high probability of running on the current CPU, which
is the same CPU that handles the net rx soft interrupt, especially for
programs that access each other using local interfaces.
The loopback driver's transmit interface, loopback_xmit(), directly calls
__netif_rx() on the current CPU, which means that the CPU handling
sockmap's workqueue and the client's sending CPU are the same, resulting
in contention.
For a TCP flow, if the request or response is very large, the
psock->ingress_skb queue can become very long. When the workqueue
traverses this queue to forward the data, it can consume a significant
amount of CPU time.
Solution
===
Configuring RPS on a loopback interface can be useful, but it will trigger
additional softirq, and furthermore, it fails to achieve our expected
effect of CPU isolation from other processes.
Instead, we provide a kfunc that allow users to specify the CPU on which
the workqueue runs through a BPF program.
We can use the existing benchmark to test the performance, which allows
us to evaluate the effectiveness of this optimization.
Because we use local interfaces for communication and the client consumes
a significant amount of CPU when sending data, this prevents the workqueue
from processing ingress_skb in a timely manner, ultimately causing the
server to fail to read data quickly.
Without cpu-affinity:
./bench sockmap -c 2 -p 1 -a --rx-verdict-ingress --no-verify
Setting up benchmark 'sockmap'...
create socket fd c1:14 p1:15 c2:16 p2:17
Benchmark 'sockmap' started.
Iter 0 ( 36.031us): Send Speed 1143.693 MB/s ... Rcv Speed 109.572 MB/s
Iter 1 ( 0.608us): Send Speed 1320.550 MB/s ... Rcv Speed 48.103 MB/s
Iter 2 ( -5.448us): Send Speed 1314.790 MB/s ... Rcv Speed 47.842 MB/s
Iter 3 ( -0.613us): Send Speed 1320.158 MB/s ... Rcv Speed 46.531 MB/s
Iter 4 ( -3.441us): Send Speed 1319.375 MB/s ... Rcv Speed 46.662 MB/s
Iter 5 ( 3.764us): Send Speed 1166.667 MB/s ... Rcv Speed 42.467 MB/s
Iter 6 ( -4.404us): Send Speed 1319.508 MB/s ... Rcv Speed 47.973 MB/s
Summary: total trans 7758 MB ± 1293.506 MB/s
Without cpu-affinity(RPS enabled):
./bench sockmap -c 2 -p 1 -a --rx-verdict-ingress --no-verify
Setting up benchmark 'sockmap'...
create socket fd c1:14 p1:15 c2:16 p2:17
Benchmark 'sockmap' started.
Iter 0 ( 28.925us): Send Speed 1630.357 MB/s ... Rcv Speed 850.960 MB/s
Iter 1 ( -2.042us): Send Speed 1644.564 MB/s ... Rcv Speed 822.478 MB/s
Iter 2 ( 0.754us): Send Speed 1644.297 MB/s ... Rcv Speed 850.787 MB/s
Iter 3 ( 0.159us): Send Speed 1644.429 MB/s ... Rcv Speed 850.198 MB/s
Iter 4 ( -2.898us): Send Speed 1646.924 MB/s ... Rcv Speed 830.867 MB/s
Iter 5 ( -0.210us): Send Speed 1649.410 MB/s ... Rcv Speed 824.246 MB/s
Iter 6 ( -1.448us): Send Speed 1650.723 MB/s ... Rcv Speed 808.256 MB/s
With cpu-affinity(RPS disabled):
./bench sockmap -c 2 -p 1 -a --rx-verdict-ingress --no-verify --cpu-affinity
Setting up benchmark 'sockmap'...
create socket fd c1:14 p1:15 c2:16 p2:17
Benchmark 'sockmap' started.
Iter 0 ( 36.051us): Send Speed 1883.437 MB/s ... Rcv Speed 1865.087 MB/s
Iter 1 ( 1.246us): Send Speed 1900.542 MB/s ... Rcv Speed 1761.737 MB/s
Iter 2 ( -8.595us): Send Speed 1883.128 MB/s ... Rcv Speed 1860.714 MB/s
Iter 3 ( 7.033us): Send Speed 1890.831 MB/s ... Rcv Speed 1806.684 MB/s
Iter 4 ( -8.397us): Send Speed 1884.700 MB/s ... Rcv Speed 1973.568 MB/s
Iter 5 ( -1.822us): Send Speed 1894.125 MB/s ... Rcv Speed 1775.046 MB/s
Iter 6 ( 4.936us): Send Speed 1877.597 MB/s ... Rcv Speed 1959.320 MB/s
Summary: total trans 11328 MB ± 1888.507 MB/s
Appendix
===
Through this optimization, we discovered that sk_mem_charge()
and sk_mem_uncharge() have concurrency issues. The performance improvement
brought by this optimization has made these concurrency issues more
evident.
This concurrency issue can cause the WARN_ON_ONCE(sk->sk_forward_alloc)
check to be triggered when the socket is released. Since this patch is a
feature-type patch and does not intend to fix this bug, I will provide
additional patches to fix this issue later.
Jiayuan Chen (3):
bpf, sockmap: Introduce a new kfunc for sockmap
bpf, sockmap: Affinitize workqueue to a specific CPU
selftest/bpf/benchs: Add cpu-affinity for sockmap bench
Documentation/bpf/map_sockmap.rst | 14 +++++++
include/linux/skmsg.h | 15 +++++++
kernel/bpf/btf.c | 3 ++
net/core/skmsg.c | 10 +++--
net/core/sock_map.c | 39 +++++++++++++++++++
.../selftests/bpf/benchs/bench_sockmap.c | 35 +++++++++++++++--
tools/testing/selftests/bpf/bpf_kfuncs.h | 6 +++
.../selftests/bpf/progs/bench_sockmap_prog.c | 7 ++++
8 files changed, 122 insertions(+), 7 deletions(-)
--
2.47.1
From: Ihor Solodrai <ihor.solodrai(a)linux.dev>
[ Upstream commit f2858f308131a09e33afb766cd70119b5b900569 ]
"sockmap_ktls disconnect_after_delete" test has been failing on BPF CI
after recent merges from netdev:
* https://github.com/kernel-patches/bpf/actions/runs/14458537639
* https://github.com/kernel-patches/bpf/actions/runs/14457178732
It happens because disconnect has been disabled for TLS [1], and it
renders the test case invalid.
Removing all the test code creates a conflict between bpf and
bpf-next, so for now only remove the offending assert [2].
The test will be removed later on bpf-next.
[1] https://lore.kernel.org/netdev/20250404180334.3224206-1-kuba@kernel.org/
[2] https://lore.kernel.org/bpf/cfc371285323e1a3f3b006bfcf74e6cf7ad65258@linux.…
Signed-off-by: Ihor Solodrai <ihor.solodrai(a)linux.dev>
Signed-off-by: Andrii Nakryiko <andrii(a)kernel.org>
Reviewed-by: Jiayuan Chen <jiayuan.chen(a)linux.dev>
Link: https://lore.kernel.org/bpf/20250416170246.2438524-1-ihor.solodrai@linux.dev
Signed-off-by: Alexei Starovoitov <ast(a)kernel.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/bpf/prog_tests/sockmap_ktls.c | 1 -
1 file changed, 1 deletion(-)
diff --git a/tools/testing/selftests/bpf/prog_tests/sockmap_ktls.c b/tools/testing/selftests/bpf/prog_tests/sockmap_ktls.c
index 2d0796314862a..0a99fd404f6dc 100644
--- a/tools/testing/selftests/bpf/prog_tests/sockmap_ktls.c
+++ b/tools/testing/selftests/bpf/prog_tests/sockmap_ktls.c
@@ -68,7 +68,6 @@ static void test_sockmap_ktls_disconnect_after_delete(int family, int map)
goto close_cli;
err = disconnect(cli);
- ASSERT_OK(err, "disconnect");
close_cli:
close(cli);
--
2.39.5
From: Ihor Solodrai <ihor.solodrai(a)linux.dev>
[ Upstream commit f2858f308131a09e33afb766cd70119b5b900569 ]
"sockmap_ktls disconnect_after_delete" test has been failing on BPF CI
after recent merges from netdev:
* https://github.com/kernel-patches/bpf/actions/runs/14458537639
* https://github.com/kernel-patches/bpf/actions/runs/14457178732
It happens because disconnect has been disabled for TLS [1], and it
renders the test case invalid.
Removing all the test code creates a conflict between bpf and
bpf-next, so for now only remove the offending assert [2].
The test will be removed later on bpf-next.
[1] https://lore.kernel.org/netdev/20250404180334.3224206-1-kuba@kernel.org/
[2] https://lore.kernel.org/bpf/cfc371285323e1a3f3b006bfcf74e6cf7ad65258@linux.…
Signed-off-by: Ihor Solodrai <ihor.solodrai(a)linux.dev>
Signed-off-by: Andrii Nakryiko <andrii(a)kernel.org>
Reviewed-by: Jiayuan Chen <jiayuan.chen(a)linux.dev>
Link: https://lore.kernel.org/bpf/20250416170246.2438524-1-ihor.solodrai@linux.dev
Signed-off-by: Alexei Starovoitov <ast(a)kernel.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/bpf/prog_tests/sockmap_ktls.c | 1 -
1 file changed, 1 deletion(-)
diff --git a/tools/testing/selftests/bpf/prog_tests/sockmap_ktls.c b/tools/testing/selftests/bpf/prog_tests/sockmap_ktls.c
index 2d0796314862a..0a99fd404f6dc 100644
--- a/tools/testing/selftests/bpf/prog_tests/sockmap_ktls.c
+++ b/tools/testing/selftests/bpf/prog_tests/sockmap_ktls.c
@@ -68,7 +68,6 @@ static void test_sockmap_ktls_disconnect_after_delete(int family, int map)
goto close_cli;
err = disconnect(cli);
- ASSERT_OK(err, "disconnect");
close_cli:
close(cli);
--
2.39.5
From: Ihor Solodrai <ihor.solodrai(a)linux.dev>
[ Upstream commit f2858f308131a09e33afb766cd70119b5b900569 ]
"sockmap_ktls disconnect_after_delete" test has been failing on BPF CI
after recent merges from netdev:
* https://github.com/kernel-patches/bpf/actions/runs/14458537639
* https://github.com/kernel-patches/bpf/actions/runs/14457178732
It happens because disconnect has been disabled for TLS [1], and it
renders the test case invalid.
Removing all the test code creates a conflict between bpf and
bpf-next, so for now only remove the offending assert [2].
The test will be removed later on bpf-next.
[1] https://lore.kernel.org/netdev/20250404180334.3224206-1-kuba@kernel.org/
[2] https://lore.kernel.org/bpf/cfc371285323e1a3f3b006bfcf74e6cf7ad65258@linux.…
Signed-off-by: Ihor Solodrai <ihor.solodrai(a)linux.dev>
Signed-off-by: Andrii Nakryiko <andrii(a)kernel.org>
Reviewed-by: Jiayuan Chen <jiayuan.chen(a)linux.dev>
Link: https://lore.kernel.org/bpf/20250416170246.2438524-1-ihor.solodrai@linux.dev
Signed-off-by: Alexei Starovoitov <ast(a)kernel.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/bpf/prog_tests/sockmap_ktls.c | 1 -
1 file changed, 1 deletion(-)
diff --git a/tools/testing/selftests/bpf/prog_tests/sockmap_ktls.c b/tools/testing/selftests/bpf/prog_tests/sockmap_ktls.c
index 2d0796314862a..0a99fd404f6dc 100644
--- a/tools/testing/selftests/bpf/prog_tests/sockmap_ktls.c
+++ b/tools/testing/selftests/bpf/prog_tests/sockmap_ktls.c
@@ -68,7 +68,6 @@ static void test_sockmap_ktls_disconnect_after_delete(int family, int map)
goto close_cli;
err = disconnect(cli);
- ASSERT_OK(err, "disconnect");
close_cli:
close(cli);
--
2.39.5
Richie Pearn presented a reproducible situation where traffic would get
blocked on the NXP LS1028A switch if a certain taprio schedule was
applied, and stepping the PTP clock would take place. The latter event
is an expected initial occurrence, but also at runtime, for example when
transitioning from one grandmaster to another.
The issue is completely described in patch 1/4, which also contains
the fix, but it has left me with some doubts regarding the need for
vsc9959_tas_clock_adjust() in general.
In order to prove to myself that vsc9959_tas_clock_adjust() is needed in
general, I have written a selftest for the tc-taprio data path in patch
4/4. On the LS1028A, we can clearly see the following failures without
that function:
INFO: Forcing a backward clock jump
TEST: ping [FAIL]
INFO: Setting up taprio after PTP
TEST: In band with gate [FAIL]
Reception of 100 packets failed
TEST: Out of band with gate [FAIL]
Reception of 100 packets failed
As for testing my fix from patch 1/4, that was quite a bit more complex
to do automatically. In fact, I couldn't find any other schedule that
would fail to be updated by vsc9959_tas_clock_adjust() as cleanly as
the schedule from Richie, so I've added that specific schedule as the
test_clock_jump_backward() test.
The test ordering is also (unfortunately) very strategic. Running the
selftest to the end dirties the GCL RAM, and when running
test_clock_jump_backward() once again, the GCL entries won't be all
zeroes as they were the first time around. They will contain bits and
pieces of old schedules, making it very challenging to make it fail.
Thus, test_clock_jump_backward() is the first in the test suite, and
without patch 1/4, it is only supposed to fail the _first_ time when
running after a clean boot.
Vladimir Oltean (4):
net: dsa: felix: fix broken taprio gate states after clock jump
selftests: net: tsn_lib: create common helper for counting received
packets
selftests: net: tsn_lib: add window_size argument to isochron_do()
selftests: net: tc_taprio: new test
drivers/net/dsa/ocelot/felix_vsc9959.c | 5 +-
.../selftests/drivers/net/dsa/tc_taprio.sh | 1 +
.../selftests/drivers/net/ocelot/psfp.sh | 8 +-
.../selftests/net/forwarding/tc_taprio.sh | 421 ++++++++++++++++++
.../selftests/net/forwarding/tsn_lib.sh | 26 ++
5 files changed, 454 insertions(+), 7 deletions(-)
create mode 120000 tools/testing/selftests/drivers/net/dsa/tc_taprio.sh
create mode 100755 tools/testing/selftests/net/forwarding/tc_taprio.sh
--
2.43.0
From: Feng Yang <yangfeng(a)kylinos.cn>
If the CONFIG_NET_SCH_BPF configuration is not enabled,
the BPF test compilation will report the following error:
In file included from progs/bpf_qdisc_fq.c:39:
progs/bpf_qdisc_common.h:17:51: error: declaration of 'struct bpf_sk_buff_ptr' will not be visible outside of this function [-Werror,-Wvisibility]
17 | void bpf_qdisc_skb_drop(struct sk_buff *p, struct bpf_sk_buff_ptr *to_free) __ksym;
| ^
progs/bpf_qdisc_fq.c:309:14: error: declaration of 'struct bpf_sk_buff_ptr' will not be visible outside of this function [-Werror,-Wvisibility]
309 | struct bpf_sk_buff_ptr *to_free)
| ^
progs/bpf_qdisc_fq.c:309:14: error: declaration of 'struct bpf_sk_buff_ptr' will not be visible outside of this function [-Werror,-Wvisibility]
progs/bpf_qdisc_fq.c:308:5: error: conflicting types for '____bpf_fq_enqueue'
Fixes: 11c701639ba9 ("selftests/bpf: Add a basic fifo qdisc test")
Signed-off-by: Feng Yang <yangfeng(a)kylinos.cn>
---
tools/testing/selftests/bpf/progs/bpf_qdisc_common.h | 2 ++
1 file changed, 2 insertions(+)
diff --git a/tools/testing/selftests/bpf/progs/bpf_qdisc_common.h b/tools/testing/selftests/bpf/progs/bpf_qdisc_common.h
index 65a2c561c0bb..7e7f2fe04f22 100644
--- a/tools/testing/selftests/bpf/progs/bpf_qdisc_common.h
+++ b/tools/testing/selftests/bpf/progs/bpf_qdisc_common.h
@@ -12,6 +12,8 @@
#define private(name) SEC(".data." #name) __hidden __attribute__((aligned(8)))
+struct bpf_sk_buff_ptr;
+
u32 bpf_skb_get_hash(struct sk_buff *p) __ksym;
void bpf_kfree_skb(struct sk_buff *p) __ksym;
void bpf_qdisc_skb_drop(struct sk_buff *p, struct bpf_sk_buff_ptr *to_free) __ksym;
--
2.43.0
The closing parentheses around the read syscall is misplaced, causing
single byte reads from the iterator instead of buf sized reads. While
the end result is the same, many more read calls than necessary are
performed.
$ tools/testing/selftests/bpf/vmtest.sh "./test_progs -t kmem_cache_iter"
145/1 kmem_cache_iter/check_task_struct:OK
145/2 kmem_cache_iter/check_slabinfo:OK
145/3 kmem_cache_iter/open_coded_iter:OK
145 kmem_cache_iter:OK
Summary: 1/3 PASSED, 0 SKIPPED, 0 FAILED
Fixes: a496d0cdc84d ("selftests/bpf: Add a test for kmem_cache_iter")
Signed-off-by: T.J. Mercier <tjmercier(a)google.com>
---
tools/testing/selftests/bpf/prog_tests/kmem_cache_iter.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/bpf/prog_tests/kmem_cache_iter.c b/tools/testing/selftests/bpf/prog_tests/kmem_cache_iter.c
index 8e13a3416a21..1de14b111931 100644
--- a/tools/testing/selftests/bpf/prog_tests/kmem_cache_iter.c
+++ b/tools/testing/selftests/bpf/prog_tests/kmem_cache_iter.c
@@ -104,7 +104,7 @@ void test_kmem_cache_iter(void)
goto destroy;
memset(buf, 0, sizeof(buf));
- while (read(iter_fd, buf, sizeof(buf) > 0)) {
+ while (read(iter_fd, buf, sizeof(buf)) > 0) {
/* Read out all contents */
printf("%s", buf);
}
base-commit: b4432656b36e5cc1d50a1f2dc15357543add530e
--
2.49.0.906.g1f30a19c02-goog
ksft runner sends 2 SIGTERMs in a row if a test runs out of time.
Handle this in a similar way we handle SIGINT - cleanup and stop
running further tests.
Because we get 2 signals we need a bit of logic to ignore
the subsequent one, they come immediately one after the other
(due to commit 9616cb34b08e ("kselftest/runner.sh: Propagate SIGTERM
to runner child")).
This change makes sure we run cleanup (scheduled defer()s)
and also print a stack trace on SIGTERM, which doesn't happen
by default. Tests occasionally hang in NIPA and it's impossible
to tell what they are waiting from or doing.
Signed-off-by: Jakub Kicinski <kuba(a)kernel.org>
---
CC: petrm(a)nvidia.com
CC: willemb(a)google.com
CC: sdf(a)fomichev.me
CC: linux-kselftest(a)vger.kernel.org
---
tools/testing/selftests/net/lib/py/ksft.py | 27 +++++++++++++++++++++-
1 file changed, 26 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/net/lib/py/ksft.py b/tools/testing/selftests/net/lib/py/ksft.py
index 3cfad0fd4570..73710634d457 100644
--- a/tools/testing/selftests/net/lib/py/ksft.py
+++ b/tools/testing/selftests/net/lib/py/ksft.py
@@ -3,6 +3,7 @@
import builtins
import functools
import inspect
+import signal
import sys
import time
import traceback
@@ -26,6 +27,10 @@ KSFT_DISRUPTIVE = True
pass
+class KsftTerminate(KeyboardInterrupt):
+ pass
+
+
def ksft_pr(*objs, **kwargs):
print("#", *objs, **kwargs)
@@ -193,6 +198,19 @@ KSFT_DISRUPTIVE = True
return env
+term_cnt = 0
+
+def _ksft_intr(signum, frame):
+ # ksft runner.sh sends 2 SIGTERMs in a row on a timeout
+ # if we don't ignore the second one it will stop us from handling cleanup
+ global term_cnt
+ term_cnt += 1
+ if term_cnt == 1:
+ raise KsftTerminate()
+ else:
+ ksft_pr(f"Ignoring SIGTERM (cnt: {term_cnt}), already exiting...")
+
+
def ksft_run(cases=None, globs=None, case_pfx=None, args=()):
cases = cases or []
@@ -205,6 +223,10 @@ KSFT_DISRUPTIVE = True
cases.append(value)
break
+ global term_cnt
+ term_cnt = 0
+ prev_sigterm = signal.signal(signal.SIGTERM, _ksft_intr)
+
totals = {"pass": 0, "fail": 0, "skip": 0, "xfail": 0}
print("TAP version 13")
@@ -229,11 +251,12 @@ KSFT_DISRUPTIVE = True
cnt_key = 'xfail'
except BaseException as e:
stop |= isinstance(e, KeyboardInterrupt)
+ stop |= isinstance(e, KsftTerminate)
tb = traceback.format_exc()
for line in tb.strip().split('\n'):
ksft_pr("Exception|", line)
if stop:
- ksft_pr("Stopping tests due to KeyboardInterrupt.")
+ ksft_pr(f"Stopping tests due to {type(e).__name__}.")
KSFT_RESULT = False
cnt_key = 'fail'
@@ -248,6 +271,8 @@ KSFT_DISRUPTIVE = True
if stop:
break
+ signal.signal(signal.SIGTERM, prev_sigterm)
+
print(
f"# Totals: pass:{totals['pass']} fail:{totals['fail']} xfail:{totals['xfail']} xpass:0 skip:{totals['skip']} error:0"
)
--
2.49.0
From: Yicong Yang <yangyicong(a)hisilicon.com>
Armv8.7 introduces single-copy atomic 64-byte loads and stores
instructions and its variants named under FEAT_{LS64, LS64_V}.
Add support for Armv8.7 FEAT_{LS64, LS64_V}:
- Add identifying and enabling in the cpufeature list
- Expose the support of these features to userspace through HWCAP3
and cpuinfo
- Add related hwcap test
- Handle the trap of unsupported memory (normal/uncacheable) access in a VM
A real scenario for this feature is that the userspace driver can make use of
this to implement direct WQE (workqueue entry) - a mechanism to fill WQE
directly into the hardware.
This patchset also complement with Marc's patchset v2[1] for handling LS64*
trapped if not advertised for a VM.
[1] https://lore.kernel.org/linux-arm-kernel/20250310122505.2857610-1-maz@kerne…
Tested with updated hwcap test:
On host:
root@localhost:/tmp# dmesg | grep "All CPU(s) started"
[ 0.504846] CPU: All CPU(s) started at EL2
root@localhost:/tmp# ./hwcap
[...]
# LS64 present
ok 217 cpuinfo_match_LS64
ok 218 sigill_LS64
ok 219 # SKIP sigbus_LS64
# LS64_V present
ok 220 cpuinfo_match_LS64_V
ok 221 sigill_LS64_V
ok 222 # SKIP sigbus_LS64_V
# 115 skipped test(s) detected. Consider enabling relevant config options to improve coverage.
# Totals: pass:107 fail:0 xfail:0 xpass:0 skip:115 error:0
On guest:
root@localhost:/# dmesg | grep "All CPU(s) started"
[ 0.205580] CPU: All CPU(s) started at EL1
root@localhost:/mnt# ./hwcap
[...]
# LS64 present
ok 217 cpuinfo_match_LS64
ok 218 sigill_LS64
ok 219 # SKIP sigbus_LS64
# LS64_V present
ok 220 cpuinfo_match_LS64_V
ok 221 sigill_LS64_V
ok 222 # SKIP sigbus_LS64_V
# 115 skipped test(s) detected. Consider enabling relevant config options to improve coverage.
# Totals: pass:107 fail:0 xfail:0 xpass:0 skip:115 error:0
Change since v1:
- Drop the suppport for LS64_ACCDATA
- handle the DABT of unsupported memory type after checking the memory attributes
Link: https://lore.kernel.org/linux-arm-kernel/20241202135504.14252-1-yangyicong@…
Yicong Yang (6):
arm64: Provide basic EL2 setup for FEAT_{LS64, LS64_V} usage at EL0/1
arm64: Add support for FEAT_{LS64, LS64_V}
KVM: arm64: Enable FEAT_{LS64, LS64_V} in the supported guest
kselftest/arm64: Add HWCAP test for FEAT_{LS64, LS64_V}
arm64: Add ESR.DFSC definition of unsupported exclusive or atomic
access
KVM: arm64: Handle DABT caused by LS64* instructions on unsupported
memory
Documentation/arch/arm64/booting.rst | 12 +++
Documentation/arch/arm64/elf_hwcaps.rst | 6 ++
arch/arm64/include/asm/el2_setup.h | 12 ++-
arch/arm64/include/asm/esr.h | 8 ++
arch/arm64/include/asm/hwcap.h | 2 +
arch/arm64/include/asm/kvm_emulate.h | 7 ++
arch/arm64/include/uapi/asm/hwcap.h | 2 +
arch/arm64/kernel/cpufeature.c | 51 +++++++++++++
arch/arm64/kernel/cpuinfo.c | 2 +
arch/arm64/kvm/inject_fault.c | 35 +++++++++
arch/arm64/kvm/mmu.c | 37 +++++++++-
arch/arm64/tools/cpucaps | 2 +
tools/testing/selftests/arm64/abi/hwcap.c | 90 +++++++++++++++++++++++
13 files changed, 264 insertions(+), 2 deletions(-)
--
2.24.0
In preparation for making the kmalloc family of allocators type aware,
we need to make sure that the returned type from the allocation matches
the type of the variable being assigned. (Before, the allocator would
always return "void *", which can be implicitly cast to any pointer type.)
The assigned type is "struct kunit_suite **" but the returned type will
be "struct kunit_suite * const *". Since it isn't generally possible
to remove the const qualifier, adjust the allocation type to match
the assignment.
Signed-off-by: Kees Cook <kees(a)kernel.org>
---
Cc: Brendan Higgins <brendan.higgins(a)linux.dev>
Cc: David Gow <davidgow(a)google.com>
Cc: Rae Moar <rmoar(a)google.com>
Cc: <linux-kselftest(a)vger.kernel.org>
Cc: <kunit-dev(a)googlegroups.com>
---
lib/kunit/executor.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/lib/kunit/executor.c b/lib/kunit/executor.c
index 3f39955cb0f1..0061d4c7e351 100644
--- a/lib/kunit/executor.c
+++ b/lib/kunit/executor.c
@@ -177,7 +177,7 @@ kunit_filter_suites(const struct kunit_suite_set *suite_set,
const size_t max = suite_set->end - suite_set->start;
- copy = kcalloc(max, sizeof(*filtered.start), GFP_KERNEL);
+ copy = kcalloc(max, sizeof(*copy), GFP_KERNEL);
if (!copy) { /* won't be able to run anything, return an empty set */
return filtered;
}
--
2.34.1
From: Alexander Shatalin <sashatalin03(a)gmail.com>
This patch improves the safety when restoring WRITES_STRICT by ensuring:
- The variable is not empty (`-n`)
- The target file is writable (`-w`)
Also improves formatting by quoting variables and following shell best
practices.
Signed-off-by: Alexander Shatalin <sashatalin03(a)gmail.com>
From: Madhavan Srinivasan <maddy(a)linux.ibm.com>
Commit 50910acd6f615 ("selftests/mm: use sys_pkey helpers consistently")
added a pkey_util.c to refactor some of the protection_keys functions accessible
by other tests. But this broken the build in powerpc in two ways,
pkey-powerpc.h: In function ‘arch_is_powervm’:
pkey-powerpc.h:73:21: error: storage size of ‘buf’ isn’t known
73 | struct stat buf;
| ^~~
pkey-powerpc.h:75:14: error: implicit declaration of function ‘stat’; did you mean ‘strcat’? [-Wimplicit-function-declaration]
75 | if ((stat("/sys/firmware/devicetree/base/ibm,partition-name", &buf) == 0) &&
| ^~~~
| strcat
Since pkey_util.c includes pkeys-helper.h, which in turn includes pkeys-powerpc.h,
stat.h including is missing for "struct stat". This is fixed by adding "sys/stat.h"
in pkeys-powerpc.h
Secondly,
pkey-powerpc.h:55:18: warning: format ‘%llx’ expects argument of type ‘long long unsigned int’, but argument 3 has type ‘u64’ {aka ‘long unsigned int’} [-Wformat=]
55 | dprintf4("%s() changing %016llx to %016llx\n",
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
56 | __func__, __read_pkey_reg(), pkey_reg);
| ~~~~~~~~~~~~~~~~~
| |
| u64 {aka long unsigned int}
pkey-helpers.h:63:32: note: in definition of macro ‘dprintf_level’
63 | sigsafe_printf(args); \
| ^~~~
These format specifier related warning are removed by adding
"__SANE_USERSPACE_TYPES__" to pkeys_utils.c.
Fixes: 50910acd6f615 ("selftests/mm: use sys_pkey helpers consistently")
Signed-off-by: Madhavan Srinivasan <maddy(a)linux.ibm.com>
Signed-off-by: Nysal Jan K.A. <nysal(a)linux.ibm.com>
---
tools/testing/selftests/mm/pkey-powerpc.h | 2 ++
tools/testing/selftests/mm/pkey_util.c | 1 +
2 files changed, 3 insertions(+)
diff --git a/tools/testing/selftests/mm/pkey-powerpc.h b/tools/testing/selftests/mm/pkey-powerpc.h
index 1bad310d282a..d8ec906b8120 100644
--- a/tools/testing/selftests/mm/pkey-powerpc.h
+++ b/tools/testing/selftests/mm/pkey-powerpc.h
@@ -3,6 +3,8 @@
#ifndef _PKEYS_POWERPC_H
#define _PKEYS_POWERPC_H
+#include <sys/stat.h>
+
#ifndef SYS_pkey_alloc
# define SYS_pkey_alloc 384
# define SYS_pkey_free 385
diff --git a/tools/testing/selftests/mm/pkey_util.c b/tools/testing/selftests/mm/pkey_util.c
index ca4ad0d44ab2..255b332f7a08 100644
--- a/tools/testing/selftests/mm/pkey_util.c
+++ b/tools/testing/selftests/mm/pkey_util.c
@@ -1,4 +1,5 @@
// SPDX-License-Identifier: GPL-2.0-only
+#define __SANE_USERSPACE_TYPES__
#include <sys/syscall.h>
#include <unistd.h>
--
2.47.0
A few functions used by different selftests.
Adding them now avoids later conflicts between different selftest serieses.
Also add full support for nolibc-test.c on riscv32.
All unsupported syscalls have been replaced.
Signed-off-by: Thomas Weißschuh <thomas.weissschuh(a)linutronix.de>
---
Changes in v2:
- Rebase onto latest nolibc next branch
- Move #include "nolibc.h" to the top of the headers
- Include linux/random.h from sys/random.h
- Don't block on missing entropy in getrandom() selftest
- Also test negative result of difftime()
- Simplify fopen()
- Link to v1: https://lore.kernel.org/r/20250423-nolibc-misc-v1-0-a925bf40297b@linutronix…
---
Thomas Weißschuh (15):
tools/nolibc: add strstr()
tools/nolibc: add %m printf format
tools/nolibc: add more stat() variants
tools/nolibc: add mremap()
tools/nolibc: add getrandom()
tools/nolibc: add abs() and friends
tools/nolibc: add support for access() and faccessat()
tools/nolibc: add clock_getres(), clock_gettime() and clock_settime()
tools/nolibc: add timer functions
tools/nolibc: add timerfd functionality
tools/nolibc: add difftime()
tools/nolibc: add namespace functionality
tools/nolibc: add fopen()
tools/nolibc: fall back to sys_clock_gettime() in gettimeofday()
tools/nolibc: implement wait() in terms of waitpid()
tools/include/nolibc/Makefile | 4 +
tools/include/nolibc/math.h | 31 ++++
tools/include/nolibc/nolibc.h | 4 +
tools/include/nolibc/sched.h | 50 +++++
tools/include/nolibc/stdio.h | 27 +++
tools/include/nolibc/stdlib.h | 18 ++
tools/include/nolibc/string.h | 20 ++
tools/include/nolibc/sys/mman.h | 19 ++
tools/include/nolibc/sys/random.h | 34 ++++
tools/include/nolibc/sys/stat.h | 25 ++-
tools/include/nolibc/sys/time.h | 15 +-
tools/include/nolibc/sys/timerfd.h | 87 +++++++++
tools/include/nolibc/sys/wait.h | 12 +-
tools/include/nolibc/time.h | 185 ++++++++++++++++++
tools/include/nolibc/types.h | 3 +
tools/include/nolibc/unistd.h | 28 +++
tools/testing/selftests/nolibc/Makefile | 2 +
tools/testing/selftests/nolibc/nolibc-test.c | 268 ++++++++++++++++++++++++++-
18 files changed, 820 insertions(+), 12 deletions(-)
---
base-commit: 2051d3b830c0889ae55e37e9e8ff0d43a4acd482
change-id: 20250415-nolibc-misc-f2548ccc5ce1
Best regards,
--
Thomas Weißschuh <thomas.weissschuh(a)linutronix.de>
The ID_AA64PFR1_EL1.MTE_frac field is currently hidden from KVM.
However, when ID_AA64PFR1_EL1.MTE==2, ID_AA64PFR1_EL1.MTE_frac==0
indicates that MTE_ASYNC is supported. On a host with
ID_AA64PFR1_EL1.MTE==2 but without MTE_ASYNC support a guest with the
MTE capability enabled will incorrectly see MTE_ASYNC advertised as
supported. This series fixes that.
This was found by inspection and the current behaviour is not known to
break anything. Linux doesn't check MTE_frac, and wrongly, assumes
MTE async faults can be generated whenever MTE is supported. This is
a separate problem and not addressed here.
I am looking for feedback on whether this change is valuable or
otherwise.
Ben Horgan (3):
arm64/sysreg: Expose MTE_frac so that it is visible to KVM
KVM: arm64: Make MTE_frac masking conditional on MTE capability
KVM: selftests: Confirm exposing MTE_frac does not break migration
arch/arm64/kernel/cpufeature.c | 1 +
arch/arm64/kvm/sys_regs.c | 26 ++++++-
.../testing/selftests/kvm/arm64/set_id_regs.c | 77 ++++++++++++++++++-
3 files changed, 101 insertions(+), 3 deletions(-)
base-commit: 8ffd015db85fea3e15a77027fda6c02ced4d2444
--
2.43.0