This patch improves the clarity and grammar of output messages in the acct()
selftest. Minor changes were made to user-facing strings and comments to make
them easier to understand and more consistent with the kselftest style.
Changes include:
- Fixing grammar in printed messages and comments.
- Rewording error and success outputs for clarity and professionalism.
- Making the "root check" message more direct.
These updates improve readability without affecting test logic or behavior.
Signed-off-by: Abdelrahman Fekry <Abdelrahmanfekry375(a)gmail.com>
---
tools/testing/selftests/acct/acct_syscall.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/tools/testing/selftests/acct/acct_syscall.c b/tools/testing/selftests/acct/acct_syscall.c
index 87c044fb9293..2c120a527574 100644
--- a/tools/testing/selftests/acct/acct_syscall.c
+++ b/tools/testing/selftests/acct/acct_syscall.c
@@ -22,9 +22,9 @@ int main(void)
ksft_print_header();
ksft_set_plan(1);
- // Check if test is run a root
+ // Check if test is run as root
if (geteuid()) {
- ksft_exit_skip("This test needs root to run!\n");
+ ksft_exit_skip("This test must be run as root!\n");
return 1;
}
@@ -52,7 +52,7 @@ int main(void)
child_pid = fork();
if (child_pid < 0) {
- ksft_test_result_error("Creating a child process to log failed\n");
+ ksft_test_result_error("Failed to create child process for logging\n");
acct(NULL);
return 1;
} else if (child_pid > 0) {
--
2.25.1
The bulk of these changes modify the cow and gup_longterm tests to
report unique and stable names for each test, bringing them into line
with the expectations of tooling that works with kselftest. The string
reported as a test result is used by tooling to both deduplicate tests
and track tests between test runs, using the same string for multiple
tests or changing the string depending on test result causes problems
for user interfaces and automation such as bisection.
It was suggested that converting to use kselftest_harness.h would be a
good way of addressing this, however that really wants the set of tests
to run to be known at compile time but both test programs dynamically
enumarate the set of huge page sizes the system supports and test each.
Refactoring to handle this would be even more invasive than these
changes which are large but straightforward and repetitive.
A version of the main gup_longterm cleanup was previously sent
separately, this version factors out the helpers for logging the start
of the test since the cow test looks very similar.
Signed-off-by: Mark Brown <broonie(a)kernel.org>
---
Mark Brown (4):
selftests/mm: Use standard ksft_finished() in cow and gup_longterm
selftest/mm: Add helper for logging test start and results
selftests/mm: Report unique test names for each cow test
selftests/mm: Fix test result reporting in gup_longterm
tools/testing/selftests/mm/cow.c | 340 +++++++++++++++++++-----------
tools/testing/selftests/mm/gup_longterm.c | 158 ++++++++------
tools/testing/selftests/mm/vm_util.h | 20 ++
3 files changed, 334 insertions(+), 184 deletions(-)
---
base-commit: a5806cd506af5a7c19bcd596e4708b5c464bfd21
change-id: 20250521-selftests-mm-cow-dedupe-33dcab034558
Best regards,
--
Mark Brown <broonie(a)kernel.org>
Hello,
this is the v2 of the many args series for arm64, being itself a revival
of Xu Kuhoai's work to enable larger arguments count for BPF programs on
ARM64 ([1]).
The discussions in v1 shed some light on some issues around specific
cases, for example with functions passing struct on stack with custom
packing/alignment attributes: those cases can not be properly detected
with the current BTF info. So this new revision aims to separate
concerns with a simpler implementation, just accepting additional args
on stack if we can make sure about the alignment constraints (and so,
refusing attachment to functions passing structs on stacks). I then
checked if the specific alignment constraints could be checked with
larger scalar types rather than structs, but it appears that this use
case is in fact rejected at the verifier level (see a9b59159d338 ("bpf:
Do not allow btf_ctx_access with __int128 types")). So in the end the
specific alignment corner cases raised in [1] can not really happen in
the kernel in its current state. This new revision still brings support
for the standard cases as a first step, it will then be possible to
iterate on top of it to add the more specific cases like struct passed
on stack and larger types.
[1] https://lore.kernel.org/all/20230917150752.69612-1-xukuohai@huaweicloud.com…
Signed-off-by: Alexis Lothoré (eBPF Foundation) <alexis.lothore(a)bootlin.com>
---
Changes in v2:
- remove alignment computation from btf.c
- deduce alignment constraints directly in jit compiler for simple types
- deny attachment to functions with "corner-cases" arguments (ie:
structs on stack)
- remove custom tests, as the corresponding use cases are locked either
by the JIT comp or the verifier
- drop RFC
- Link to v1: https://lore.kernel.org/r/20250411-many_args_arm64-v1-0-0a32fe72339e@bootli…
---
Alexis Lothoré (eBPF Foundation) (1):
selftests/bpf: enable many-args tests for arm64
Xu Kuohai (1):
bpf, arm64: Support up to 12 function arguments
arch/arm64/net/bpf_jit_comp.c | 234 ++++++++++++++++++++-------
tools/testing/selftests/bpf/DENYLIST.aarch64 | 2 -
2 files changed, 180 insertions(+), 56 deletions(-)
---
base-commit: 9435138c069117cd59a4912b5ea2ae44cc2c5ffa
change-id: 20250220-many_args_arm64-8bd3747e6948
Best regards,
--
Alexis Lothoré, Bootlin
Embedded Linux and Kernel engineering
https://bootlin.com
Corrects a spelling mistake "memebers" instead of "members" in
tools/testing/selftests/filesystems/mount-notify/mount-notify_test.c
Signed-off-by: Hendrik Hamerlinck <hendrik.hamerlinck(a)hammernet.be>
---
.../selftests/filesystems/mount-notify/mount-notify_test.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/filesystems/mount-notify/mount-notify_test.c b/tools/testing/selftests/filesystems/mount-notify/mount-notify_test.c
index 59a71f22fb11..af2b61224a61 100644
--- a/tools/testing/selftests/filesystems/mount-notify/mount-notify_test.c
+++ b/tools/testing/selftests/filesystems/mount-notify/mount-notify_test.c
@@ -230,7 +230,7 @@ static void verify_mount_ids(struct __test_metadata *const _metadata,
}
}
}
- // Check that all list1 memebers can be found in list2. Together with
+ // Check that all list1 members can be found in list2. Together with
// the above it means that the list1 and list2 represent the same sets.
for (i = 0; i < num; i++) {
for (j = 0; j < num; j++) {
--
2.43.0
This patch set convert the wireguard selftest to nftables, as iptables is
deparated and nftables is the default framework of most releases.
v7: re-post, no update.
v6: fix typo in patch 1/2. Update the description (Phil Sutter)
v5: remove the counter in nft rules and link nft statically (Jason A.
Donenfeld)
v4: no update, just re-send
v3: drop iptables directly (Jason A. Donenfeld)
Also convert to using nft for qemu testing (Jason A. Donenfeld)
v2: use one nft table for testing (Phil Sutter)
Hangbin Liu (2):
wireguard: selftests: convert iptables to nft
wireguard: selftests: update to using nft for qemu test
tools/testing/selftests/wireguard/netns.sh | 29 +++++++++------
.../testing/selftests/wireguard/qemu/Makefile | 36 ++++++++++++++-----
.../selftests/wireguard/qemu/kernel.config | 7 ++--
3 files changed, 49 insertions(+), 23 deletions(-)
--
2.46.0
Hi Linus,
Please pull the following kselftest next update for Linux 6.16-rc1.
-- Fixes
- cpufreq test to not double suspend in rtcwake case.
- compile error in pid_namespace test.
- run_kselftest.sh to use readlink if realpath is not available.
- cpufreq basic read and update testcases.
- ftrace to add poll to a gen_file so test can find it at run-time.
- spelling errors in perf_events test.
diff is attached.
thanks,
-- Shuah
----------------------------------------------------------------
The following changes since commit b4432656b36e5cc1d50a1f2dc15357543add530e:
Linux 6.15-rc4 (2025-04-27 15:19:23 -0700)
are available in the Git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest tags/linux_kselftest-next-6.16-rc1
for you to fetch changes up to 1107dc4c5b06188a3fb4897ceb197eb320a52e85:
selftests/run_kselftest.sh: Use readlink if realpath is not available (2025-05-15 16:52:47 -0600)
----------------------------------------------------------------
linux_kselftest-next-6.16-rc1
-- Fixes
- cpufreq test to not double suspend in rtcwake case.
- compile error in pid_namespace test.
- run_kselftest.sh to use readlink if realpath is not available.
- cpufreq basic read and update testcases.
- ftrace to add poll to a gen_file so test can find it at run-time.
- spelling errors in perf_events test.
----------------------------------------------------------------
Ayush Jain (1):
selftests/ftrace: Convert poll to a gen_file
Colin Ian King (1):
selftests/perf_events: Fix spelling mistake "sycnhronize" -> "synchronize"
Nícolas F. R. A. Prado (1):
kselftest: cpufreq: Get rid of double suspend in rtcwake case
Peter Seiderer (1):
selftests: pid_namespace: add missing sys/mount.h include in pid_max.c
Swapnil Sapkal (1):
selftests/cpufreq: Fix cpufreq basic read and update testcases
Thomas Weißschuh (3):
selftests/timens: Print TAP headers
selftests/timens: Make run_tests() functions static
selftests/timens: timerfd: Use correct clockid type in tclock_gettime()
Yosry Ahmed (1):
selftests/run_kselftest.sh: Use readlink if realpath is not available
tools/testing/selftests/cpufreq/cpufreq.sh | 18 +++++++++++++-----
tools/testing/selftests/ftrace/Makefile | 2 +-
tools/testing/selftests/perf_events/watermark_signal.c | 2 +-
tools/testing/selftests/pid_namespace/pid_max.c | 1 +
tools/testing/selftests/run_kselftest.sh | 9 ++++++++-
tools/testing/selftests/timens/clock_nanosleep.c | 4 +++-
tools/testing/selftests/timens/exec.c | 2 ++
tools/testing/selftests/timens/futex.c | 2 ++
tools/testing/selftests/timens/gettime_perf.c | 2 ++
tools/testing/selftests/timens/procfs.c | 2 ++
tools/testing/selftests/timens/timens.c | 2 ++
tools/testing/selftests/timens/timer.c | 4 +++-
tools/testing/selftests/timens/timerfd.c | 6 ++++--
tools/testing/selftests/timens/vfork_exec.c | 2 ++
14 files changed, 46 insertions(+), 12 deletions(-)
----------------------------------------------------------------
Hi Linus,
Please pull the following kunit next update for Linux 6.16-rc1.
- Enables qemu_config for riscv32, sparc 64-bit, PowerPC 32-bit BE and
64-bit LE.
- Enables CONFIG_SPARC32 to clearly differentiate between sparc 32-bit
and 64-bit configurations.
- Enables CONFIG_CPU_BIG_ENDIAN to clearly differentiate between powerpc
LE and BE configurations.
- Adds feature to list available architectures to kunit tool.
- Fixes to bugs and changes to documentation.
diff is attached.
thanks,
-- Shuah
----------------------------------------------------------------
The following changes since commit 8ffd015db85fea3e15a77027fda6c02ced4d2444:
Linux 6.15-rc2 (2025-04-13 11:54:49 -0700)
are available in the Git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest tags/linux_kselftest-kunit-6.16-rc1
for you to fetch changes up to 772e50a76ee664e75581624f512df4e45582605a:
kunit: Fix wrong parameter to kunit_deactivate_static_stub() (2025-05-21 09:51:23 -0600)
----------------------------------------------------------------
linux_kselftest-kunit-6.16-rc1
- Enables qemu_config for riscv32, sparc 64-bit, PowerPC 32-bit BE and
64-bit LE.
- Enables CONFIG_SPARC32 to clearly differentiate between sparc 32-bit
and 64-bit configurations.
- Enables CONFIG_CPU_BIG_ENDIAN to clearly differentiate between powerpc
LE and BE configurations.
- Add feature to list available architectures to kunit tool.
- Fixes to bugs and changes to documentation.
----------------------------------------------------------------
David Gow (1):
kunit: qemu_configs: Disable faulting tests on 32-bit SPARC
Kees Cook (1):
kunit: executor: Remove const from kunit_filter_suites() allocation type
Rae Moar (2):
Documentation: kunit: improve example on testing static functions
kunit: tool: add test counts to JSON output
Richard Fitzgerald (1):
kunit: configs: Enable CONFIG_INIT_STACK_ALL_PATTERN in all_tests
Thomas Weißschuh (6):
kunit: qemu_configs: Add riscv32 config
kunit: tool: Implement listing of available architectures
kunit: qemu_configs: powerpc: Explicitly enable CONFIG_CPU_BIG_ENDIAN=y
kunit: qemu_configs: Add PowerPC 32-bit BE and 64-bit LE
kunit: qemu_configs: sparc: Explicitly enable CONFIG_SPARC32=y
kunit: qemu_configs: Add 64-bit SPARC configuration
Tzung-Bi Shih (1):
kunit: Fix wrong parameter to kunit_deactivate_static_stub()
Documentation/dev-tools/kunit/run_wrapper.rst | 2 ++
Documentation/dev-tools/kunit/usage.rst | 38 +++++++++++++++++++++------
lib/kunit/executor.c | 2 +-
lib/kunit/static_stub.c | 2 +-
tools/testing/kunit/configs/all_tests.config | 1 +
tools/testing/kunit/kunit_json.py | 10 +++++++
tools/testing/kunit/kunit_kernel.py | 8 ++++++
tools/testing/kunit/qemu_configs/powerpc.py | 1 +
tools/testing/kunit/qemu_configs/powerpc32.py | 17 ++++++++++++
tools/testing/kunit/qemu_configs/powerpcle.py | 14 ++++++++++
tools/testing/kunit/qemu_configs/riscv32.py | 17 ++++++++++++
tools/testing/kunit/qemu_configs/sparc.py | 2 ++
tools/testing/kunit/qemu_configs/sparc64.py | 16 +++++++++++
13 files changed, 120 insertions(+), 10 deletions(-)
create mode 100644 tools/testing/kunit/qemu_configs/powerpc32.py
create mode 100644 tools/testing/kunit/qemu_configs/powerpcle.py
create mode 100644 tools/testing/kunit/qemu_configs/riscv32.py
create mode 100644 tools/testing/kunit/qemu_configs/sparc64.py
----------------------------------------------------------------
The madv_populate selftest has some repetitive code for several different
cases that it covers, included repeated test names used in ksft_test_result()
reports. This causes problems for automation, the test name is used to both
track the test between runs and distinguish between multiple tests within
the same run. Fix this by tweaking the messages with duplication to be more
specific about the contexts they're in.
Signed-off-by: Mark Brown <broonie(a)kernel.org>
---
tools/testing/selftests/mm/madv_populate.c | 18 +++++++++---------
1 file changed, 9 insertions(+), 9 deletions(-)
diff --git a/tools/testing/selftests/mm/madv_populate.c b/tools/testing/selftests/mm/madv_populate.c
index ef7d911da13e..b6fabd5c27ed 100644
--- a/tools/testing/selftests/mm/madv_populate.c
+++ b/tools/testing/selftests/mm/madv_populate.c
@@ -172,12 +172,12 @@ static void test_populate_read(void)
if (addr == MAP_FAILED)
ksft_exit_fail_msg("mmap failed\n");
ksft_test_result(range_is_not_populated(addr, SIZE),
- "range initially not populated\n");
+ "read range initially not populated\n");
ret = madvise(addr, SIZE, MADV_POPULATE_READ);
ksft_test_result(!ret, "MADV_POPULATE_READ\n");
ksft_test_result(range_is_populated(addr, SIZE),
- "range is populated\n");
+ "read range is populated\n");
munmap(addr, SIZE);
}
@@ -194,12 +194,12 @@ static void test_populate_write(void)
if (addr == MAP_FAILED)
ksft_exit_fail_msg("mmap failed\n");
ksft_test_result(range_is_not_populated(addr, SIZE),
- "range initially not populated\n");
+ "write range initially not populated\n");
ret = madvise(addr, SIZE, MADV_POPULATE_WRITE);
ksft_test_result(!ret, "MADV_POPULATE_WRITE\n");
ksft_test_result(range_is_populated(addr, SIZE),
- "range is populated\n");
+ "write range is populated\n");
munmap(addr, SIZE);
}
@@ -247,19 +247,19 @@ static void test_softdirty(void)
/* Clear any softdirty bits. */
clear_softdirty();
ksft_test_result(range_is_not_softdirty(addr, SIZE),
- "range is not softdirty\n");
+ "cleared range is not softdirty\n");
/* Populating READ should set softdirty. */
ret = madvise(addr, SIZE, MADV_POPULATE_READ);
- ksft_test_result(!ret, "MADV_POPULATE_READ\n");
+ ksft_test_result(!ret, "softdirty MADV_POPULATE_READ\n");
ksft_test_result(range_is_not_softdirty(addr, SIZE),
- "range is not softdirty\n");
+ "range is not softdirty after MADV_POPULATE_READ\n");
/* Populating WRITE should set softdirty. */
ret = madvise(addr, SIZE, MADV_POPULATE_WRITE);
- ksft_test_result(!ret, "MADV_POPULATE_WRITE\n");
+ ksft_test_result(!ret, "softdirty MADV_POPULATE_WRITE\n");
ksft_test_result(range_is_softdirty(addr, SIZE),
- "range is softdirty\n");
+ "range is softdirty after MADV_POPULATE_WRITE \n");
munmap(addr, SIZE);
}
---
base-commit: a5806cd506af5a7c19bcd596e4708b5c464bfd21
change-id: 20250521-selftests-mm-madv-populate-dedupe-95faf16c3c8f
Best regards,
--
Mark Brown <broonie(a)kernel.org>
Mark Rutland's recent SME fixes updated the SME ABI to reject any
attempt to write FPSIMD register data via the streaming mode SVE
register set but did not update the sve-ptrace test to take account of
this, resulting in spurious failures. Update the test for this, and
also fix another preexisting issue I noticed while looking at this.
Signed-off-by: Mark Brown <broonie(a)kernel.org>
---
Mark Brown (3):
kselftest/arm64: Fix check for setting new VLs in sve-ptrace
kselftest/arm64: Fix test for streaming FPSIMD write in sve-ptrace
kselftest/arm64: Specify SVE data when testing VL set in sve-ptrace
tools/testing/selftests/arm64/fp/sve-ptrace.c | 9 ++++++---
1 file changed, 6 insertions(+), 3 deletions(-)
---
base-commit: 1c1abfd151c824698830ee900cc8d9f62e9a5fbb
change-id: 20250523-kselftest-arm64-ssve-fixups-b68ae61c1ebf
Best regards,
--
Mark Brown <broonie(a)kernel.org>
In cpufreq basic selftests, one of the testcases is to read all cpufreq
sysfs files and print the values. This testcase assumes all the cpufreq
sysfs files have read permissions. However certain cpufreq sysfs files
(eg. stats/reset) are write only files and this testcase errors out
when it is not able to read the file.
Similarily, there is one more testcase which reads the cpufreq sysfs
file data and write it back to same file. This testcase also errors out
for sysfs files without read permission.
Fix these testcases by adding proper read permission checks.
Reported-by: Narasimhan V <narasimhan.v(a)amd.com>
Signed-off-by: Swapnil Sapkal <swapnil.sapkal(a)amd.com>
---
tools/testing/selftests/cpufreq/cpufreq.sh | 15 +++++++++++----
1 file changed, 11 insertions(+), 4 deletions(-)
diff --git a/tools/testing/selftests/cpufreq/cpufreq.sh b/tools/testing/selftests/cpufreq/cpufreq.sh
index e350c521b467..3484fa34e8d8 100755
--- a/tools/testing/selftests/cpufreq/cpufreq.sh
+++ b/tools/testing/selftests/cpufreq/cpufreq.sh
@@ -52,7 +52,14 @@ read_cpufreq_files_in_dir()
for file in $files; do
if [ -f $1/$file ]; then
printf "$file:"
- cat $1/$file
+ #file is readable ?
+ local rfile=$(ls -l $1/$file | awk '$1 ~ /^.*r.*/ { print $NF; }')
+
+ if [ ! -z $rfile ]; then
+ cat $1/$file
+ else
+ printf "$file is not readable\n"
+ fi
else
printf "\n"
read_cpufreq_files_in_dir "$1/$file"
@@ -83,10 +90,10 @@ update_cpufreq_files_in_dir()
for file in $files; do
if [ -f $1/$file ]; then
- # file is writable ?
- local wfile=$(ls -l $1/$file | awk '$1 ~ /^.*w.*/ { print $NF; }')
+ # file is readable and writable ?
+ local rwfile=$(ls -l $1/$file | awk '$1 ~ /^.*rw.*/ { print $NF; }')
- if [ ! -z $wfile ]; then
+ if [ ! -z $rwfile ]; then
# scaling_setspeed is a special file and we
# should skip updating it
if [ $file != "scaling_setspeed" ]; then
--
2.43.0
With joint effort from the upstream KVM community, we come up with the
4th version of mediated vPMU for x86. We have made the following changes
on top of the previous RFC v3.
v3 -> v4
- Rebase whole patchset on 6.14-rc3 base.
- Address Peter's comments on Perf part.
- Address Sean's comments on KVM part.
* Change key word "passthrough" to "mediated" in all patches
* Change static enabling to user space dynamic enabling via KVM_CAP_PMU_CAPABILITY.
* Only support GLOBAL_CTRL save/restore with VMCS exec_ctrl, drop the MSR
save/retore list support for GLOBAL_CTRL, thus the support of mediated
vPMU is constrained to SapphireRapids and later CPUs on Intel side.
* Merge some small changes into a single patch.
- Address Sandipan's comment on invalid pmu pointer.
- Add back "eventsel_hw" and "fixed_ctr_ctrl_hw" to avoid to directly
manipulate pmc->eventsel and pmu->fixed_ctr_ctrl.
Testing (Intel side):
- Perf-based legacy vPMU (force emulation on/off)
* Kselftests pmu_counters_test, pmu_event_filter_test and
vmx_pmu_caps_test pass.
* KUT PMU tests pmu, pmu_lbr, pmu_pebs pass.
* Basic perf counting/sampling tests in 3 scenarios, guest-only,
host-only and host-guest coexistence all pass.
- Mediated vPMU (force emulation on/off)
* Kselftests pmu_counters_test, pmu_event_filter_test and
vmx_pmu_caps_test pass.
* KUT PMU tests pmu, pmu_lbr, pmu_pebs pass.
* Basic perf counting/sampling tests in 3 scenarios, guest-only,
host-only and host-guest coexistence all pass.
- Failures. All above tests passed on Intel Granite Rapids as well
except a failure on KUT/pmu_pebs.
* GP counter 0 (0xfffffffffffe): PEBS record (written seq 0)
is verified (including size, counters and cfg).
* The pebs_data_cfg (0xb500000000) doesn't match with the
effective MSR_PEBS_DATA_CFG (0x0).
* This failure has nothing to do with this mediated vPMU patch set. The
failure is caused by Granite Rapids supported timed PEBS which needs
extra support on Qemu and KUT/pmu_pebs. These extra support would be
sent in separate patches later.
Testing (AMD side):
- Kselftests pmu_counters_test, pmu_event_filter_test and
vmx_pmu_caps_test all pass
- legacy guest with KUT/pmu:
* qmeu option: -cpu host, -perfctr-core
* when set force_emulation_prefix=1, passes
* when set force_emulation_prefix=0, passes
- perfmon-v1 guest with KUT/pmu:
* qmeu option: -cpu host, -perfmon-v2
* when set force_emulation_prefix=1, passes
* when set force_emulation_prefix=0, passes
- perfmon-v2 guest with KUT/pmu:
* qmeu option: -cpu host
* when set force_emulation_prefix=1, passes
* when set force_emulation_prefix=0, passes
- perf_fuzzer (perfmon-v2):
* fails with soft lockup in guest in current version.
* culprit could be between 6.13 ~ 6.14-rc3 within KVM
* Series tested on 6.12 and 6.13 without issue.
Note: a QEMU series is needed to run mediated vPMU v4:
- https://lore.kernel.org/all/20250324123712.34096-1-dapeng1.mi@linux.intel.c…
History:
- RFC v3: https://lore.kernel.org/all/20240801045907.4010984-1-mizhang@google.com/
- RFC v2: https://lore.kernel.org/all/20240506053020.3911940-1-mizhang@google.com/
- RFC v1: https://lore.kernel.org/all/20240126085444.324918-1-xiong.y.zhang@linux.int…
Dapeng Mi (18):
KVM: x86/pmu: Introduce enable_mediated_pmu global parameter
KVM: x86/pmu: Check PMU cpuid configuration from user space
KVM: x86: Rename vmx_vmentry/vmexit_ctrl() helpers
KVM: x86/pmu: Add perf_capabilities field in struct kvm_host_values{}
KVM: x86/pmu: Move PMU_CAP_{FW_WRITES,LBR_FMT} into msr-index.h header
KVM: VMX: Add macros to wrap around
{secondary,tertiary}_exec_controls_changebit()
KVM: x86/pmu: Check if mediated vPMU can intercept rdpmc
KVM: x86/pmu/vmx: Save/load guest IA32_PERF_GLOBAL_CTRL with
vm_exit/entry_ctrl
KVM: x86/pmu: Optimize intel/amd_pmu_refresh() helpers
KVM: x86/pmu: Setup PMU MSRs' interception mode
KVM: x86/pmu: Handle PMU MSRs interception and event filtering
KVM: x86/pmu: Switch host/guest PMU context at vm-exit/vm-entry
KVM: x86/pmu: Handle emulated instruction for mediated vPMU
KVM: nVMX: Add macros to simplify nested MSR interception setting
KVM: selftests: Add mediated vPMU supported for pmu tests
KVM: Selftests: Support mediated vPMU for vmx_pmu_caps_test
KVM: Selftests: Fix pmu_counters_test error for mediated vPMU
KVM: x86/pmu: Expose enable_mediated_pmu parameter to user space
Kan Liang (8):
perf: Support get/put mediated PMU interfaces
perf: Skip pmu_ctx based on event_type
perf: Clean up perf ctx time
perf: Add a EVENT_GUEST flag
perf: Add generic exclude_guest support
perf: Add switch_guest_ctx() interface
perf/x86: Support switch_guest_ctx interface
perf/x86/intel: Support PERF_PMU_CAP_MEDIATED_VPMU
Mingwei Zhang (5):
perf/x86: Forbid PMI handler when guest own PMU
perf/x86/core: Plumb mediated PMU capability from x86_pmu to
x86_pmu_cap
KVM: x86/pmu: Exclude PMU MSRs in vmx_get_passthrough_msr_slot()
KVM: x86/pmu: introduce eventsel_hw to prepare for pmu event filtering
KVM: nVMX: Add nested virtualization support for mediated PMU
Sandipan Das (4):
perf/x86/core: Do not set bit width for unavailable counters
KVM: x86/pmu: Add AMD PMU registers to direct access list
KVM: x86/pmu/svm: Set GuestOnly bit and clear HostOnly bit when guest
write to event selectors
perf/x86/amd: Support PERF_PMU_CAP_MEDIATED_VPMU for AMD host
Xiong Zhang (3):
x86/irq: Factor out common code for installing kvm irq handler
perf: core/x86: Register a new vector for KVM GUEST PMI
KVM: x86/pmu: Register KVM_GUEST_PMI_VECTOR handler
arch/x86/events/amd/core.c | 2 +
arch/x86/events/core.c | 40 +-
arch/x86/events/intel/core.c | 5 +
arch/x86/include/asm/hardirq.h | 1 +
arch/x86/include/asm/idtentry.h | 1 +
arch/x86/include/asm/irq.h | 2 +-
arch/x86/include/asm/irq_vectors.h | 5 +-
arch/x86/include/asm/kvm-x86-pmu-ops.h | 2 +
arch/x86/include/asm/kvm_host.h | 10 +
arch/x86/include/asm/msr-index.h | 18 +-
arch/x86/include/asm/perf_event.h | 1 +
arch/x86/include/asm/vmx.h | 1 +
arch/x86/kernel/idt.c | 1 +
arch/x86/kernel/irq.c | 39 +-
arch/x86/kvm/cpuid.c | 15 +
arch/x86/kvm/pmu.c | 254 ++++++++-
arch/x86/kvm/pmu.h | 45 ++
arch/x86/kvm/svm/pmu.c | 148 ++++-
arch/x86/kvm/svm/svm.c | 26 +
arch/x86/kvm/svm/svm.h | 2 +-
arch/x86/kvm/vmx/capabilities.h | 11 +-
arch/x86/kvm/vmx/nested.c | 68 ++-
arch/x86/kvm/vmx/pmu_intel.c | 224 ++++++--
arch/x86/kvm/vmx/vmx.c | 89 +--
arch/x86/kvm/vmx/vmx.h | 11 +-
arch/x86/kvm/x86.c | 63 ++-
arch/x86/kvm/x86.h | 2 +
include/linux/perf_event.h | 47 +-
kernel/events/core.c | 519 ++++++++++++++----
.../beauty/arch/x86/include/asm/irq_vectors.h | 5 +-
.../selftests/kvm/include/kvm_test_harness.h | 13 +
.../testing/selftests/kvm/include/kvm_util.h | 3 +
.../selftests/kvm/include/x86/processor.h | 8 +
tools/testing/selftests/kvm/lib/kvm_util.c | 23 +
.../selftests/kvm/x86/pmu_counters_test.c | 24 +-
.../selftests/kvm/x86/pmu_event_filter_test.c | 8 +-
.../selftests/kvm/x86/vmx_pmu_caps_test.c | 2 +-
37 files changed, 1480 insertions(+), 258 deletions(-)
base-commit: 0ad2507d5d93f39619fc42372c347d6006b64319
--
2.49.0.395.g12beb8f557-goog
Add small grammar fixes in perf events and Real Time Clock tests'
output messages.
Include braces around a single if statement, when there are multiple
statements in the else branch, to align with the kernel coding style.
Signed-off-by: Hanne-Lotta Mäenpää <hannelotta(a)gmail.com>
---
Notes:
v1 -> v2: Improved wording in RTC tests based on feedback from
Alexandre Belloni <alexandre.belloni(a)bootlin.com>
tools/testing/selftests/perf_events/watermark_signal.c | 7 ++++---
tools/testing/selftests/rtc/rtctest.c | 10 +++++-----
2 files changed, 9 insertions(+), 8 deletions(-)
diff --git a/tools/testing/selftests/perf_events/watermark_signal.c b/tools/testing/selftests/perf_events/watermark_signal.c
index 49dc1e831174..6176afd4950b 100644
--- a/tools/testing/selftests/perf_events/watermark_signal.c
+++ b/tools/testing/selftests/perf_events/watermark_signal.c
@@ -65,8 +65,9 @@ TEST(watermark_signal)
child = fork();
EXPECT_GE(child, 0);
- if (child == 0)
+ if (child == 0) {
do_child();
+ }
else if (child < 0) {
perror("fork()");
goto cleanup;
@@ -75,7 +76,7 @@ TEST(watermark_signal)
if (waitpid(child, &child_status, WSTOPPED) != child ||
!(WIFSTOPPED(child_status) && WSTOPSIG(child_status) == SIGSTOP)) {
fprintf(stderr,
- "failed to sycnhronize with child errno=%d status=%x\n",
+ "failed to synchronize with child errno=%d status=%x\n",
errno,
child_status);
goto cleanup;
@@ -84,7 +85,7 @@ TEST(watermark_signal)
fd = syscall(__NR_perf_event_open, &attr, child, -1, -1,
PERF_FLAG_FD_CLOEXEC);
if (fd < 0) {
- fprintf(stderr, "failed opening event %llx\n", attr.config);
+ fprintf(stderr, "failed to setup performance monitoring %llx\n", attr.config);
goto cleanup;
}
diff --git a/tools/testing/selftests/rtc/rtctest.c b/tools/testing/selftests/rtc/rtctest.c
index be175c0e6ae3..930bf0ce4fa6 100644
--- a/tools/testing/selftests/rtc/rtctest.c
+++ b/tools/testing/selftests/rtc/rtctest.c
@@ -138,10 +138,10 @@ TEST_F_TIMEOUT(rtc, date_read_loop, READ_LOOP_DURATION_SEC + 2) {
rtc_read = rtc_time_to_timestamp(&rtc_tm);
/* Time should not go backwards */
ASSERT_LE(prev_rtc_read, rtc_read);
- /* Time should not increase more then 1s at a time */
+ /* Time should not increase more than 1s per read */
ASSERT_GE(prev_rtc_read + 1, rtc_read);
- /* Sleep 11ms to avoid killing / overheating the RTC */
+ /* Sleep 11ms to avoid overheating the RTC */
nanosleep_with_retries(READ_LOOP_SLEEP_MS * 1000000);
prev_rtc_read = rtc_read;
@@ -236,7 +236,7 @@ TEST_F(rtc, alarm_alm_set) {
if (alarm_state == RTC_ALARM_DISABLED)
SKIP(return, "Skipping test since alarms are not supported.");
if (alarm_state == RTC_ALARM_RES_MINUTE)
- SKIP(return, "Skipping test since alarms has only minute granularity.");
+ SKIP(return, "Skipping test since alarm has only minute granularity.");
rc = ioctl(self->fd, RTC_RD_TIME, &tm);
ASSERT_NE(-1, rc);
@@ -306,7 +306,7 @@ TEST_F(rtc, alarm_wkalm_set) {
if (alarm_state == RTC_ALARM_DISABLED)
SKIP(return, "Skipping test since alarms are not supported.");
if (alarm_state == RTC_ALARM_RES_MINUTE)
- SKIP(return, "Skipping test since alarms has only minute granularity.");
+ SKIP(return, "Skipping test since alarm has only minute granularity.");
rc = ioctl(self->fd, RTC_RD_TIME, &alarm.time);
ASSERT_NE(-1, rc);
@@ -502,7 +502,7 @@ int main(int argc, char **argv)
if (access(rtc_file, R_OK) == 0)
ret = test_harness_run(argc, argv);
else
- ksft_exit_skip("[SKIP]: Cannot access rtc file %s - Exiting\n",
+ ksft_exit_skip("Cannot access RTC file %s - exiting\n",
rtc_file);
return ret;
--
2.39.5
This patch corrects minor coding style issues to comply with the Linux kernel coding style:
- Align closing parentheses to match opening ones in printf statements.
- Break long lines to keep them within the 100-column limit.
These changes address warnings reported by checkpatch.pl and do not
affect functionality.
changes in v2 :
- Resubmitted the patch with a properly formatted commit message,
following patch submission guidelines, as suggested by Shuah Khan.
Signed-off-by: Rujra Bhatt <braker.noob.kernel(a)gmail.com>
---
tools/testing/selftests/timers/valid-adjtimex.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/tools/testing/selftests/timers/valid-adjtimex.c
b/tools/testing/selftests/timers/valid-adjtimex.c
index 6b7801055ad1..5110f9ee285c 100644
--- a/tools/testing/selftests/timers/valid-adjtimex.c
+++ b/tools/testing/selftests/timers/valid-adjtimex.c
@@ -157,7 +157,7 @@ int validate_freq(void)
if (tx.freq == outofrange_freq[i]) {
printf("[FAIL]\n");
printf("ERROR: out of range value %ld actually set!\n",
- tx.freq);
+ tx.freq);
pass = -1;
goto out;
}
@@ -172,7 +172,7 @@ int validate_freq(void)
if (ret >= 0) {
printf("[FAIL]\n");
printf("Error: No failure on invalid
ADJ_FREQUENCY %ld\n",
- invalid_freq[i]);
+ invalid_freq[i]);
pass = -1;
goto out;
}
@@ -238,7 +238,8 @@ int set_bad_offset(long sec, long usec, int use_nano)
tmx.time.tv_usec = usec;
ret = clock_adjtime(CLOCK_REALTIME, &tmx);
if (ret >= 0) {
- printf("Invalid (sec: %ld usec: %ld) did not fail! ",
tmx.time.tv_sec, tmx.time.tv_usec);
+ printf("Invalid (sec: %ld usec: %ld) did not fail! ",
+ tmx.time.tv_sec, tmx.time.tv_usec);
printf("[FAIL]\n");
return -1;
}
--
2.43.0
With newer compilers the pid_max test is failing to build with the
following build error:
...
pid_max.c: In function ‘pid_max_cb’:
pid_max.c:42:15: error: implicit declaration of function ‘mount’ [-Wimplicit-function-declaration]
42 | ret = mount("", "/", NULL, MS_PRIVATE | MS_REC, 0);
| ^~~~~
pid_max.c:42:36: error: ‘MS_PRIVATE’ undeclared (first use in this function); did you mean ‘MAP_PRIVATE’?
42 | ret = mount("", "/", NULL, MS_PRIVATE | MS_REC, 0);
| ^~~~~~~~~~
| MAP_PRIVATE
pid_max.c:42:36: note: each undeclared identifier is reported only once for each function it appears in
...
The fix seems to be including sys/mount.h which brings in the missing
defines and missing definition mount function.
Signed-off-by: Brahmajit Das <listout(a)listout.xyz>
---
tools/testing/selftests/pid_namespace/pid_max.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/tools/testing/selftests/pid_namespace/pid_max.c b/tools/testing/selftests/pid_namespace/pid_max.c
index 51c414faabb0..96f274f0582b 100644
--- a/tools/testing/selftests/pid_namespace/pid_max.c
+++ b/tools/testing/selftests/pid_namespace/pid_max.c
@@ -10,6 +10,7 @@
#include <stdlib.h>
#include <string.h>
#include <syscall.h>
+#include <sys/mount.h>
#include <sys/wait.h>
#include "../kselftest_harness.h"
--
2.49.0
sys_open_tree is once defined in filesystems/overlayfs/wrappers.h so
before we define it in mount_setattr_test.c we should check if has been
previously defined or not. Otherwise it results in the following build
error:
make[1]: Nothing to be done for 'all'.
CC mount_setattr_test
mount_setattr_test.c:176:19: error: redefinition of ‘sys_open_tree’
176 | static inline int sys_open_tree(int dfd, const char *filename, unsigned int flags)
| ^~~~~~~~~~~~~
In file included from mount_setattr_test.c:23:
../filesystems/overlayfs/wrappers.h:59:19: note: previous definition of
‘sys_open_tree’ with type ‘int(int, const char *, unsigned int)’
59 | static inline int sys_open_tree(int dfd, const char *filename, unsigned int flags)
| ^~~~~~~~~~~~~
make[1]: *** [../lib.mk:222: /home/listout/linux/tools/testing/selftests/mount_setattr/mount_setattr_test]
Error 1
Signed-off-by: Brahmajit Das <listout(a)listout.xyz>
---
tools/testing/selftests/mount_setattr/mount_setattr_test.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/tools/testing/selftests/mount_setattr/mount_setattr_test.c b/tools/testing/selftests/mount_setattr/mount_setattr_test.c
index 48a000cabc97..b0798777b822 100644
--- a/tools/testing/selftests/mount_setattr/mount_setattr_test.c
+++ b/tools/testing/selftests/mount_setattr/mount_setattr_test.c
@@ -173,10 +173,13 @@ static inline int sys_mount_setattr(int dfd, const char *path, unsigned int flag
#define AT_RECURSIVE 0x8000 /* Apply to the entire subtree */
#endif
+/* Do not define sys_open_tree if it's already defined in overlayfs/wrappers.h */
+#if !defined(__SELFTEST_OVERLAYFS_WRAPPERS_H__)
static inline int sys_open_tree(int dfd, const char *filename, unsigned int flags)
{
return syscall(__NR_open_tree, dfd, filename, flags);
}
+#endif
static ssize_t write_nointr(int fd, const void *buf, size_t count)
{
--
2.49.0
From: Chia-Yu Chang <chia-yu.chang(a)nokia-bell-labs.com>
Hello,
Please find the DualPI2 patch v16.
v16 (16-MAy-2025)
- Add qdisc_lock() to dualpi2_timer() in dualpi2_timer (Paolo Abeni <pabeni(a)redhat.com>)
- Introduce convert_ns_to_usec() to convert usec to nsec without overflow in #1 (Paolo Abeni <pabeni(a)redhat.com>)
- Update convert_us_tonsec() to convert nsec to usec without overflow in #2 (Paolo Abeni <pabeni(a)redhat.com>)
- Add more descriptions with respect to DualPI2 in the cover ltter and add changelog in each patch (Paolo Abeni <pabeni(a)redhat.com>)
v15 (09-May-2025)
- Add enum of TCA_DUALPI2_ECN_MASK_CLA_ECT to remove potential leakeage in #1 (Simon Horman <horms(a)kernel.org>)
- Fix one typo in comment of #2
- Update tc.yaml in #5 to aligh with the updated enum of pkt_sched.h
v14 (05-May-2025)
- Modify tc.yaml: (1) Replace flags with enum and remove enum-as-flags, (2) Remove credit-queue in xstats, and (3) Change attribute types (Donald Hunter <donald.hun
- Add enum and fix the ordering of variables in pkt_sched.h to align with the modified tc.yaml (Donald Hunter <donald.hunter(a)gmail.com>)
- Add validators for DROP_OVERLOAD, DROP_EARLY, ECN_MASK, and SPLIT_GSO in sch_dualpi2.c (Donald Hunter <donald.hunter(a)gmail.com>)
- Update dualpi2.json to align with the updated variable order in pkt_sched.h
- Reorder patches (Donald Hunter <donald.hunter(a)gmail.com>)
v13 (26-Apr-2025)
- Use dashes in member names to follow YNL conventions in tc.yaml (Donald Hunter <donald.hunter(a)gmail.com>)
- Define enumerations separately for flags of drop-early, drop-overload, ecn-mask, credit-queue in tc.yaml (Donald Hunter <donald.hunter(a)gmail.com>)
- Change the types of split-gso and step-packets into flag in tc.yaml (Donald Hunter <donald.hunter(a)gmail.com>)
- Revert to u32/u8 types for tc-dualpi2-xstats members in tc.yaml (Donald Hunter <donald.hunter(a)gmail.com>)
- Add new test cases in tc-tests/qdiscs/dualpi2.json to cover all dualpi2 parameters (Donald Hunter <donald.hunter(a)gmail.com>)
- Change the type of TCA_DUALPI2_STEP_PACKETS into NLA_FLAG (Donald Hunter <donald.hunter(a)gmail.com>)
v12 (22-Apr-2025)
- Remove anonymous struct in sch_dualpi2.c (Paolo Abeni <pabeni(a)redhat.com>)
- Replace u32/u8 with uint and s32 with int in tc spec document (Paolo Abeni <pabeni(a)redhat.com>)
- Introduce get_memory_limit function to handle potential overflow when multipling limit with MTU (Paolo Abeni <pabeni(a)redhat.com>)
- Double the packet length to further include packet overhead in memory_limit (Paolo Abeni <pabeni(a)redhat.com>)
- Remove the check of qdisc_qlen(sch) when calling qdisc_tree_reduce_backlog (Paolo Abeni <pabeni(a)redhat.com>)
v11 (15-Apr-2025)
- Replace hstimer_init with hstimer_setup in sch_dualpi2.c
v10 (25-Mar-2025)
- Remove leftover include in include/linux/netdevice.h and anonymous struct in sch_dualpi2.c (Paolo Abeni <pabeni(a)redhat.com>)
- Use kfree_skb_reason() and add SKB_DROP_REASON_DUALPI2_STEP_DROP drop reason (Paolo Abeni <pabeni(a)redhat.com>)
- Split sch_dualpi2.c into 3 patches (and overall 5 patches): Struct definition & parsing, Dump stats & configuration, Enqueue/Dequeue (Paolo Abeni <pabeni(a)redhat.com>)
v9 (16-Mar-2025)
- Fix mem_usage error in previous version
- Add min_qlen_step to the dualpi2 attribute as the minimum queue length in number of packets in the L-queue to start step threshold marking.
In previous versions, this value was fixed to 2, so the step threshold was applied to mark packets in the L queue only when the queue length of the L queue was greater than or equal to 2 packets.
This will cause larger queuing delays for L4S traffic at low rates (<20Mbps). So we parameterize it and change the default value to 0.
Comparison of tcp_1down run 'HTB 20Mbit + DUALPI2 + 10ms base delay'
Old versions:
avg median # data pts
Ping (ms) ICMP : 11.55 11.70 ms 350
TCP upload avg : 18.96 N/A Mbits/s 350
TCP upload sum : 18.96 N/A Mbits/s 350
New version (v9):
avg median # data pts
Ping (ms) ICMP : 10.81 10.70 ms 350
TCP upload avg : 18.91 N/A Mbits/s 350
TCP upload sum : 18.91 N/A Mbits/s 350
Comparison of tcp_1down run 'HTB 10Mbit + DUALPI2 + 10ms base delay'
Old versions:
avg median # data pts
Ping (ms) ICMP : 12.61 12.80 ms 350
TCP upload avg : 9.48 N/A Mbits/s 350
TCP upload sum : 9.48 N/A Mbits/s 350
New version (v9):
avg median # data pts
Ping (ms) ICMP : 11.06 10.80 ms 350
TCP upload avg : 9.43 N/A Mbits/s 350
TCP upload sum : 9.43 N/A Mbits/s 350
Comparison of tcp_1down run 'HTB 10Mbit + DUALPI2 + 10ms base delay'
Old versions:
avg median # data pts
Ping (ms) ICMP : 40.86 37.45 ms 350
TCP upload avg : 0.88 N/A Mbits/s 350
TCP upload sum : 0.88 N/A Mbits/s 350
TCP upload::1 : 0.88 0.97 Mbits/s 350
New version (v9):
avg median # data pts
Ping (ms) ICMP : 11.07 10.40 ms 350
TCP upload avg : 0.55 N/A Mbits/s 350
TCP upload sum : 0.55 N/A Mbits/s 350
TCP upload::1 : 0.55 0.59 Mbits/s 350
v8 (11-Mar-2025)
- Fix warning messages in v7
v7 (07-Mar-2025)
- Separate into 3 patches to avoid mixing changes of documentation, selftest, and code. (Cong Wang <xiyou.wangcong(a)gmail.com>)
v6 (04-Mar-2025)
- Add modprobe for dulapi2 in tc-testing script tc-testing/tdc.sh (Jakub Kicinski <kuba(a)kernel.org>)
- Update test cases in dualpi2.json
- Update commit message
v5 (22-Feb-2025)
- A comparison was done between MQ + DUALPI2, MQ + FQ_PIE, MQ + FQ_CODEL:
Unshaped 1gigE with 4 download streams test:
- Summary of tcp_4down run 'MQ + FQ_CODEL':
avg median # data pts
Ping (ms) ICMP : 1.19 1.34 ms 349
TCP download avg : 235.42 N/A Mbits/s 349
TCP download sum : 941.68 N/A Mbits/s 349
TCP download::1 : 235.19 235.39 Mbits/s 349
TCP download::2 : 235.03 235.35 Mbits/s 349
TCP download::3 : 236.89 235.44 Mbits/s 349
TCP download::4 : 234.57 235.19 Mbits/s 349
- Summary of tcp_4down run 'MQ + FQ_PIE'
avg median # data pts
Ping (ms) ICMP : 1.21 1.37 ms 350
TCP download avg : 235.42 N/A Mbits/s 350
TCP download sum : 941.61 N/A Mbits/s 350
TCP download::1 : 232.54 233.13 Mbits/s 350
TCP download::2 : 232.52 232.80 Mbits/s 350
TCP download::3 : 233.14 233.78 Mbits/s 350
TCP download::4 : 243.41 241.48 Mbits/s 350
- Summary of tcp_4down run 'MQ + DUALPI2'
avg median # data pts
Ping (ms) ICMP : 1.19 1.34 ms 349
TCP download avg : 235.42 N/A Mbits/s 349
TCP download sum : 941.68 N/A Mbits/s 349
TCP download::1 : 235.19 235.39 Mbits/s 349
TCP download::2 : 235.03 235.35 Mbits/s 349
TCP download::3 : 236.89 235.44 Mbits/s 349
TCP download::4 : 234.57 235.19 Mbits/s 349
Unshaped 1gigE with 128 download streams test:
- Summary of tcp_128down run 'MQ + FQ_CODEL':
avg median # data pts
Ping (ms) ICMP : 1.88 1.86 ms 350
TCP download avg : 7.39 N/A Mbits/s 350
TCP download sum : 946.47 N/A Mbits/s 350
- Summary of tcp_128down run 'MQ + FQ_PIE':
avg median # data pts
Ping (ms) ICMP : 1.88 1.86 ms 350
TCP download avg : 7.39 N/A Mbits/s 350
TCP download sum : 946.47 N/A Mbits/s 350
- Summary of tcp_128down run 'MQ + DUALPI2':
avg median # data pts
Ping (ms) ICMP : 1.88 1.86 ms 350
TCP download avg : 7.39 N/A Mbits/s 350
TCP download sum : 946.47 N/A Mbits/s 350
Unshaped 10gigE with 4 download streams test:
- Summary of tcp_4down run 'MQ + FQ_CODEL':
avg median # data pts
Ping (ms) ICMP : 0.22 0.23 ms 350
TCP download avg : 2354.08 N/A Mbits/s 350
TCP download sum : 9416.31 N/A Mbits/s 350
TCP download::1 : 2353.65 2352.81 Mbits/s 350
TCP download::2 : 2354.54 2354.21 Mbits/s 350
TCP download::3 : 2353.56 2353.78 Mbits/s 350
TCP download::4 : 2354.56 2354.45 Mbits/s 350
- Summary of tcp_4down run 'MQ + FQ_PIE':
avg median # data pts
Ping (ms) ICMP : 0.20 0.19 ms 350
TCP download avg : 2354.76 N/A Mbits/s 350
TCP download sum : 9419.04 N/A Mbits/s 350
TCP download::1 : 2354.77 2353.89 Mbits/s 350
TCP download::2 : 2353.41 2354.29 Mbits/s 350
TCP download::3 : 2356.18 2354.19 Mbits/s 350
TCP download::4 : 2354.68 2353.15 Mbits/s 350
- Summary of tcp_4down run 'MQ + DUALPI2':
avg median # data pts
Ping (ms) ICMP : 0.24 0.24 ms 350
TCP download avg : 2354.11 N/A Mbits/s 350
TCP download sum : 9416.43 N/A Mbits/s 350
TCP download::1 : 2354.75 2353.93 Mbits/s 350
TCP download::2 : 2353.15 2353.75 Mbits/s 350
TCP download::3 : 2353.49 2353.72 Mbits/s 350
TCP download::4 : 2355.04 2353.73 Mbits/s 350
Unshaped 10gigE with 128 download streams test:
- Summary of tcp_128down run 'MQ + FQ_CODEL':
avg median # data pts
Ping (ms) ICMP : 7.57 8.69 ms 350
TCP download avg : 73.97 N/A Mbits/s 350
TCP download sum : 9467.82 N/A Mbits/s 350
- Summary of tcp_128down run 'MQ + FQ_PIE':
avg median # data pts
Ping (ms) ICMP : 7.82 8.91 ms 350
TCP download avg : 73.97 N/A Mbits/s 350
TCP download sum : 9468.42 N/A Mbits/s 350
- Summary of tcp_128down run 'MQ + DUALPI2':
avg median # data pts
Ping (ms) ICMP : 6.87 7.93 ms 350
TCP download avg : 73.95 N/A Mbits/s 350
TCP download sum : 9465.87 N/A Mbits/s 350
From the results shown above, we see small differences between combinations.
- Update commit message to include results of no_split_gso and split_gso (Dave Taht <dave.taht(a)gmail.com> and Paolo Abeni <pabeni(a)redhat.com>)
- Add memlimit in the dualpi2 attribute, and add memory_used, max_memory_used, memory_limit in dualpi2 stats (Dave Taht <dave.taht(a)gmail.com>)
- Update note in sch_dualpi2.c related to BBRv3 status (Dave Taht <dave.taht(a)gmail.com>)
- Update license identifier (Dave Taht <dave.taht(a)gmail.com>)
- Add selftest in tools/testing/selftests/tc-testing (Cong Wang <xiyou.wangcong(a)gmail.com>)
- Use netlink policies for parameter checks (Jamal Hadi Salim <jhs(a)mojatatu.com>)
- Modify texts & fix typos in Documentation/netlink/specs/tc.yaml (Dave Taht <dave.taht(a)gmail.com>)
- Add descriptions of packet counter statistics and the reset function of sch_dualpi2.c
- Fix step_thresh in packets
- Update code comments in sch_dualpi2.c
v4 (22-Oct-2024)
- Update statement in Kconfig for DualPI2 (Stephen Hemminger <stephen(a)networkplumber.org>)
- Put a blank line after #define in sch_dualpi2.c (Stephen Hemminger <stephen(a)networkplumber.org>)
- Fix line length warning.
v3 (19-Oct-2024)
- Fix compilaiton error
- Update Documentation/netlink/specs/tc.yaml (Jakub Kicinski <kuba(a)kernel.org>)
v2 (18-Oct-2024)
- Add Documentation/netlink/specs/tc.yaml (Jakub Kicinski <kuba(a)kernel.org>)
- Use dualpi2 instead of skb prefix (Jamal Hadi Salim <jhs(a)mojatatu.com>)
- Replace nla_parse_nested_deprecated with nla_parse_nested (Jamal Hadi Salim <jhs(a)mojatatu.com>)
- Fix line length warning
This patch serise adds DualPI Improved with a Square (DualPI2) with following features:
* Supports congestion controls that comply with the Prague requirements in RFC9331 (e.g. TCP-Prague)
* Coupled dual-queue that separates the L4S traffic in a low latency queue (L-queue), without harming remaining traffic that is scheduled in classic queue (C-queue) due to congestion-coupling using PI2 as defined in RFC9332
* Configurable overload strategies
* Use of sojourn time to reliably estimate queue delay
* Supports ECN L4S-identifier (IP.ECN==0b*1) to classify traffic into respective queues
For more details of DualPI2, please refer IETF RFC9332 (https://datatracker.ietf.org/doc/html/rfc9332).
Best regards,
Chia-Yu
Chia-Yu Chang (4):
sched: Struct definition and parsing of dualpi2 qdisc
sched: Dump configuration and statistics of dualpi2 qdisc
selftests/tc-testing: Add selftests for qdisc DualPI2
Documentation: netlink: specs: tc: Add DualPI2 specification
Koen De Schepper (1):
sched: Add enqueue/dequeue of dualpi2 qdisc
Documentation/netlink/specs/tc.yaml | 156 +++
include/net/dropreason-core.h | 6 +
include/uapi/linux/pkt_sched.h | 68 +
net/sched/Kconfig | 12 +
net/sched/Makefile | 1 +
net/sched/sch_dualpi2.c | 1127 +++++++++++++++++
tools/testing/selftests/tc-testing/config | 1 +
.../tc-testing/tc-tests/qdiscs/dualpi2.json | 254 ++++
tools/testing/selftests/tc-testing/tdc.sh | 1 +
9 files changed, 1626 insertions(+)
create mode 100644 net/sched/sch_dualpi2.c
create mode 100644 tools/testing/selftests/tc-testing/tc-tests/qdiscs/dualpi2.json
--
2.34.1
Apparently, this test completes successfully when it completes execution
without either causing a kernel panic or being killed by the kernel.
This new test result message is more descriptive and grammatically
correct.
There are no changes in v2. I'm resending this patch because I addressed
it to the wrong email for Shuah.
Signed-off-by: Brigham Campbell <me(a)brighamcampbell.com>
---
tools/testing/selftests/x86/mov_ss_trap.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/x86/mov_ss_trap.c b/tools/testing/selftests/x86/mov_ss_trap.c
index f22cb6b382f9..d80033c0d7eb 100644
--- a/tools/testing/selftests/x86/mov_ss_trap.c
+++ b/tools/testing/selftests/x86/mov_ss_trap.c
@@ -269,6 +269,6 @@ int main()
);
}
- printf("[OK]\tI aten't dead\n");
+ printf("[OK]\tkernel handled MOV SS without crashing test\n");
return 0;
}
--
2.49.0
Basics and overview
===================
Software with larger attack surfaces (e.g. network facing apps like databases,
browsers or apps relying on browser runtimes) suffer from memory corruption
issues which can be utilized by attackers to bend control flow of the program
to eventually gain control (by making their payload executable). Attackers are
able to perform such attacks by leveraging call-sites which rely on indirect
calls or return sites which rely on obtaining return address from stack memory.
To mitigate such attacks, risc-v extension zicfilp enforces that all indirect
calls must land on a landing pad instruction `lpad` else cpu will raise software
check exception (a new cpu exception cause code on riscv).
Similarly for return flow, risc-v extension zicfiss extends architecture with
- `sspush` instruction to push return address on a shadow stack
- `sspopchk` instruction to pop return address from shadow stack
and compare with input operand (i.e. return address on stack)
- `sspopchk` to raise software check exception if comparision above
was a mismatch
- Protection mechanism using which shadow stack is not writeable via
regular store instructions
More information an details can be found at extensions github repo [1].
Equivalent to landing pad (zicfilp) on x86 is `ENDBRANCH` instruction in Intel
CET [3] and branch target identification (BTI) [4] on arm.
Similarly x86's Intel CET has shadow stack [5] and arm64 has guarded control
stack (GCS) [6] which are very similar to risc-v's zicfiss shadow stack.
x86 and arm64 support for user mode shadow stack is already in mainline.
Kernel awareness for user control flow integrity
================================================
This series picks up Samuel Holland's envcfg changes [2] as well. So if those are
being applied independently, they should be removed from this series.
Enabling:
In order to maintain compatibility and not break anything in user mode, kernel
doesn't enable control flow integrity cpu extensions on binary by default.
Instead exposes a prctl interface to enable, disable and lock the shadow stack
or landing pad feature for a task. This allows userspace (loader) to enumerate
if all objects in its address space are compiled with shadow stack and landing
pad support and accordingly enable the feature. Additionally if a subsequent
`dlopen` happens on a library, user mode can take a decision again to disable
the feature (if incoming library is not compiled with support) OR terminate the
task (if user mode policy is strict to have all objects in address space to be
compiled with control flow integirty cpu feature). prctl to enable shadow stack
results in allocating shadow stack from virtual memory and activating for user
address space. x86 and arm64 are also following same direction due to similar
reason(s).
clone/fork:
On clone and fork, cfi state for task is inherited by child. Shadow stack is
part of virtual memory and is a writeable memory from kernel perspective
(writeable via a restricted set of instructions aka shadow stack instructions)
Thus kernel changes ensure that this memory is converted into read-only when
fork/clone happens and COWed when fault is taken due to sspush, sspopchk or
ssamoswap. In case `CLONE_VM` is specified and shadow stack is to be enabled,
kernel will automatically allocate a shadow stack for that clone call.
map_shadow_stack:
x86 introduced `map_shadow_stack` system call to allow user space to explicitly
map shadow stack memory in its address space. It is useful to allocate shadow
for different contexts managed by a single thread (green threads or contexts)
risc-v implements this system call as well.
signal management:
If shadow stack is enabled for a task, kernel performs an asynchronous control
flow diversion to deliver the signal and eventually expects userspace to issue
sigreturn so that original execution can be resumed. Even though resume context
is prepared by kernel, it is in user space memory and is subject to memory
corruption and corruption bugs can be utilized by attacker in this race window
to perform arbitrary sigreturn and eventually bypass cfi mechanism.
Another issue is how to ensure that cfi related state on sigcontext area is not
trampled by legacy apps or apps compiled with old kernel headers.
In order to mitigate control-flow hijacting, kernel prepares a token and place
it on shadow stack before signal delivery and places address of token in
sigcontext structure. During sigreturn, kernel obtains address of token from
sigcontext struture, reads token from shadow stack and validates it and only
then allow sigreturn to succeed. Compatiblity issue is solved by adopting
dynamic sigcontext management introduced for vector extension. This series
re-factor the code little bit to allow future sigcontext management easy (as
proposed by Andy Chiu from SiFive)
config and compilation:
Introduce a new risc-v config option `CONFIG_RISCV_USER_CFI`. Selecting this
config option picks the kernel support for user control flow integrity. This
optin is presented only if toolchain has shadow stack and landing pad support.
And is on purpose guarded by toolchain support. Reason being that eventually
vDSO also needs to be compiled in with shadow stack and landing pad support.
vDSO compile patches are not included as of now because landing pad labeling
scheme is yet to settle for usermode runtime.
To get more information on kernel interactions with respect to
zicfilp and zicfiss, patch series adds documentation for
`zicfilp` and `zicfiss` in following:
Documentation/arch/riscv/zicfiss.rst
Documentation/arch/riscv/zicfilp.rst
How to test this series
=======================
Toolchain
---------
$ git clone git@github.com:sifive/riscv-gnu-toolchain.git -b cfi-dev
$ riscv-gnu-toolchain/configure --prefix=<path-to-where-to-build> --with-arch=rv64gc_zicfilp_zicfiss --enable-linux --disable-gdb --with-extra-multilib-test="rv64gc_zicfilp_zicfiss-lp64d:-static"
$ make -j$(nproc)
Qemu
----
Get the lastest qemu
$ cd qemu
$ mkdir build
$ cd build
$ ../configure --target-list=riscv64-softmmu
$ make -j$(nproc)
Opensbi
-------
$ git clone git@github.com:deepak0414/opensbi.git -b v6_cfi_spec_split_opensbi
$ make CROSS_COMPILE=<your riscv toolchain> -j$(nproc) PLATFORM=generic
Linux
-----
Running defconfig is fine. CFI is enabled by default if the toolchain
supports it.
$ make ARCH=riscv CROSS_COMPILE=<path-to-cfi-riscv-gnu-toolchain>/build/bin/riscv64-unknown-linux-gnu- -j$(nproc) defconfig
$ make ARCH=riscv CROSS_COMPILE=<path-to-cfi-riscv-gnu-toolchain>/build/bin/riscv64-unknown-linux-gnu- -j$(nproc)
In case you're building your own rootfs using toolchain, please make sure you
pick following patch to ensure that vDSO compiled with lpad and shadow stack.
"arch/riscv: compile vdso with landing pad"
Branch where above patch can be picked
https://github.com/deepak0414/linux-riscv-cfi/tree/vdso_user_cfi_v6.12-rc1
Running
-------
Modify your qemu command to have:
-bios <path-to-cfi-opensbi>/build/platform/generic/firmware/fw_dynamic.bin
-cpu rv64,zicfilp=true,zicfiss=true,zimop=true,zcmop=true
vDSO related Opens (in the flux)
=================================
I am listing these opens for laying out plan and what to expect in future
patch sets. And of course for the sake of discussion.
Shadow stack and landing pad enabling in vDSO
----------------------------------------------
vDSO must have shadow stack and landing pad support compiled in for task
to have shadow stack and landing pad support. This patch series doesn't
enable that (yet). Enabling shadow stack support in vDSO should be
straight forward (intend to do that in next versions of patch set). Enabling
landing pad support in vDSO requires some collaboration with toolchain folks
to follow a single label scheme for all object binaries. This is necessary to
ensure that all indirect call-sites are setting correct label and target landing
pads are decorated with same label scheme.
How many vDSOs
---------------
Shadow stack instructions are carved out of zimop (may be operations) and if CPU
doesn't implement zimop, they're illegal instructions. Kernel could be running on
a CPU which may or may not implement zimop. And thus kernel will have to carry 2
different vDSOs and expose the appropriate one depending on whether CPU implements
zimop or not.
References
==========
[1] - https://github.com/riscv/riscv-cfi
[2] - https://lore.kernel.org/all/20240814081126.956287-1-samuel.holland@sifive.c…
[3] - https://lwn.net/Articles/889475/
[4] - https://developer.arm.com/documentation/109576/0100/Branch-Target-Identific…
[5] - https://www.intel.com/content/dam/develop/external/us/en/documents/catc17-i…
[6] - https://lwn.net/Articles/940403/
To: Thomas Gleixner <tglx(a)linutronix.de>
To: Ingo Molnar <mingo(a)redhat.com>
To: Borislav Petkov <bp(a)alien8.de>
To: Dave Hansen <dave.hansen(a)linux.intel.com>
To: x86(a)kernel.org
To: H. Peter Anvin <hpa(a)zytor.com>
To: Andrew Morton <akpm(a)linux-foundation.org>
To: Liam R. Howlett <Liam.Howlett(a)oracle.com>
To: Vlastimil Babka <vbabka(a)suse.cz>
To: Lorenzo Stoakes <lorenzo.stoakes(a)oracle.com>
To: Paul Walmsley <paul.walmsley(a)sifive.com>
To: Palmer Dabbelt <palmer(a)dabbelt.com>
To: Albert Ou <aou(a)eecs.berkeley.edu>
To: Conor Dooley <conor(a)kernel.org>
To: Rob Herring <robh(a)kernel.org>
To: Krzysztof Kozlowski <krzk+dt(a)kernel.org>
To: Arnd Bergmann <arnd(a)arndb.de>
To: Christian Brauner <brauner(a)kernel.org>
To: Peter Zijlstra <peterz(a)infradead.org>
To: Oleg Nesterov <oleg(a)redhat.com>
To: Eric Biederman <ebiederm(a)xmission.com>
To: Kees Cook <kees(a)kernel.org>
To: Jonathan Corbet <corbet(a)lwn.net>
To: Shuah Khan <shuah(a)kernel.org>
To: Jann Horn <jannh(a)google.com>
To: Conor Dooley <conor+dt(a)kernel.org>
To: Miguel Ojeda <ojeda(a)kernel.org>
To: Alex Gaynor <alex.gaynor(a)gmail.com>
To: Boqun Feng <boqun.feng(a)gmail.com>
To: Gary Guo <gary(a)garyguo.net>
To: Björn Roy Baron <bjorn3_gh(a)protonmail.com>
To: Benno Lossin <benno.lossin(a)proton.me>
To: Andreas Hindborg <a.hindborg(a)kernel.org>
To: Alice Ryhl <aliceryhl(a)google.com>
To: Trevor Gross <tmgross(a)umich.edu>
Cc: linux-kernel(a)vger.kernel.org
Cc: linux-fsdevel(a)vger.kernel.org
Cc: linux-mm(a)kvack.org
Cc: linux-riscv(a)lists.infradead.org
Cc: devicetree(a)vger.kernel.org
Cc: linux-arch(a)vger.kernel.org
Cc: linux-doc(a)vger.kernel.org
Cc: linux-kselftest(a)vger.kernel.org
Cc: alistair.francis(a)wdc.com
Cc: richard.henderson(a)linaro.org
Cc: jim.shu(a)sifive.com
Cc: andybnac(a)gmail.com
Cc: kito.cheng(a)sifive.com
Cc: charlie(a)rivosinc.com
Cc: atishp(a)rivosinc.com
Cc: evan(a)rivosinc.com
Cc: cleger(a)rivosinc.com
Cc: alexghiti(a)rivosinc.com
Cc: samitolvanen(a)google.com
Cc: broonie(a)kernel.org
Cc: rick.p.edgecombe(a)intel.com
Cc: rust-for-linux(a)vger.kernel.org
changelog
---------
v16:
- If FWFT is not implemented or returns error for shadow stack activation, then
no_usercfi is set to disable shadow stack. Although this should be picked up
by extension validation and activation. Fixed this bug for zicfilp and zicfiss
both. Thanks to Charlie Jenkins for reporting this.
- If toolchain doesn't support cfi, cfi kselftest shouldn't build. Suggested by
Charlie Jenkins.
- Default for CONFIG_RISCV_USER_CFI is set to no. Charlie/Atish suggested to
keep it off till we have more hardware availibility with RVA23 profile and
zimop/zcmop implemented. Else this will start breaking people's workflow
- Includes the fix if "!RV64 and !SBI" then definitions for FWFT in
asm-offsets.c error.
v15:
- Toolchain has been updated to include `-fcf-protection` flag. This
exists for x86 as well. Updated kernel patches to compile vDSO and
selftest to compile with `fcf-protection=full` flag.
- selecting CONFIG_RISCV_USERCFI selects CONFIG_RISCV_SBI.
- Patch to enable shadow stack for kernel wasn't hidden behind
CONFIG_RISCV_USERCFI and CONFIG_RISCV_SBI. fixed that.
v14:
- rebased on top of palmer/sbi-v3. Thus dropped clement's FWFT patches
Updated RISCV_ISA_EXT_XXXX in hwcap and hwprobe constants.
- Took Radim's suggestions on bitfields.
- Placed cfi_state at the end of thread_info block so that current situation
is not disturbed with respect to member fields of thread_info in single
cacheline.
v13:
- cpu_supports_shadow_stack/cpu_supports_indirect_br_lp_instr uses
riscv_has_extension_unlikely()
- uses nops(count) to create nop slide
- RISCV_ACQUIRE_BARRIER is not needed in `amo_user_shstk`. Removed it
- changed ternaries to simply use implicit casting to convert to bool.
- kernel command line allows to disable zicfilp and zicfiss independently.
updated kernel-parameters.txt.
- ptrace user abi for cfi uses bitmasks instead of bitfields. Added ptrace
kselftest.
- cosmetic and grammatical changes to documentation.
v12:
- It seems like I had accidently squashed arch agnostic indirect branch
tracking prctl and riscv implementation of those prctls. Split them again.
- set_shstk_status/set_indir_lp_status perform CSR writes only when CPU
support is available. As suggested by Zong Li.
- Some minor clean up in kselftests as suggested by Zong Li.
v11:
- patch "arch/riscv: compile vdso with landing pad" was unconditionally
selecting `_zicfilp` for vDSO compile. fixed that. Changed `lpad 1` to
to `lpad 0`.
v10:
- dropped "mm: helper `is_shadow_stack_vma` to check shadow stack vma". This patch
is not that interesting to this patch series for risc-v. There are instances in
arch directories where VM_SHADOW_STACK flag is anyways used. Dropping this patch
to expedite merging in riscv tree.
- Took suggestions from `Clement` on "riscv: zicfiss / zicfilp enumeration" to
validate presence of cfi based on config.
- Added a patch for vDSO to have `lpad 0`. I had omitted this earlier to make sure
we add single vdso object with cfi enabled. But a vdso object with scheme of
zero labeled landing pad is least common denominator and should work with all
objects of zero labeled as well as function-signature labeled objects.
v9:
- rebased on master (39a803b754d5 fix braino in "9p: fix ->rename_sem exclusion")
- dropped "mm: Introduce ARCH_HAS_USER_SHADOW_STACK" (master has it from arm64/gcs)
- dropped "prctl: arch-agnostic prctl for shadow stack" (master has it from arm64/gcs)
v8:
- rebased on palmer/for-next
- dropped samuel holland's `envcfg` context switch patches.
they are in parlmer/for-next
v7:
- Removed "riscv/Kconfig: enable HAVE_EXIT_THREAD for riscv"
Instead using `deactivate_mm` flow to clean up.
see here for more context
https://lore.kernel.org/all/20230908203655.543765-1-rick.p.edgecombe@intel.…
- Changed the header include in `kselftest`. Hopefully this fixes compile
issue faced by Zong Li at SiFive.
- Cleaned up an orphaned change to `mm/mmap.c` in below patch
"riscv/mm : ensure PROT_WRITE leads to VM_READ | VM_WRITE"
- Lock interfaces for shadow stack and indirect branch tracking expect arg == 0
Any future evolution of this interface should accordingly define how arg should
be setup.
- `mm/map.c` has an instance of using `VM_SHADOW_STACK`. Fixed it to use helper
`is_shadow_stack_vma`.
- Link to v6: https://lore.kernel.org/r/20241008-v5_user_cfi_series-v6-0-60d9fe073f37@riv…
v6:
- Picked up Samuel Holland's changes as is with `envcfg` placed in
`thread` instead of `thread_info`
- fixed unaligned newline escapes in kselftest
- cleaned up messages in kselftest and included test output in commit message
- fixed a bug in clone path reported by Zong Li
- fixed a build issue if CONFIG_RISCV_ISA_V is not selected
(this was introduced due to re-factoring signal context
management code)
v5:
- rebased on v6.12-rc1
- Fixed schema related issues in device tree file
- Fixed some of the documentation related issues in zicfilp/ss.rst
(style issues and added index)
- added `SHADOW_STACK_SET_MARKER` so that implementation can define base
of shadow stack.
- Fixed warnings on definitions added in usercfi.h when
CONFIG_RISCV_USER_CFI is not selected.
- Adopted context header based signal handling as proposed by Andy Chiu
- Added support for enabling kernel mode access to shadow stack using
FWFT
(https://github.com/riscv-non-isa/riscv-sbi-doc/blob/master/src/ext-firmware…)
- Link to v5: https://lore.kernel.org/r/20241001-v5_user_cfi_series-v1-0-3ba65b6e550f@riv…
(Note: I had an issue in my workflow due to which version number wasn't
picked up correctly while sending out patches)
v4:
- rebased on 6.11-rc6
- envcfg: Converged with Samuel Holland's patches for envcfg management on per-
thread basis.
- vma_is_shadow_stack is renamed to is_vma_shadow_stack
- picked up Mark Brown's `ARCH_HAS_USER_SHADOW_STACK` patch
- signal context: using extended context management to maintain compatibility.
- fixed `-Wmissing-prototypes` compiler warnings for prctl functions
- Documentation fixes and amending typos.
- Link to v4: https://lore.kernel.org/all/20240912231650.3740732-1-debug@rivosinc.com/
v3:
- envcfg
logic to pick up base envcfg had a bug where `ENVCFG_CBZE` could have been
picked on per task basis, even though CPU didn't implement it. Fixed in
this series.
- dt-bindings
As suggested, split into separate commit. fixed the messaging that spec is
in public review
- arch_is_shadow_stack change
arch_is_shadow_stack changed to vma_is_shadow_stack
- hwprobe
zicfiss / zicfilp if present will get enumerated in hwprobe
- selftests
As suggested, added object and binary filenames to .gitignore
Selftest binary anyways need to be compiled with cfi enabled compiler which
will make sure that landing pad and shadow stack are enabled. Thus removed
separate enable/disable tests. Cleaned up tests a bit.
- Link to v3: https://lore.kernel.org/lkml/20240403234054.2020347-1-debug@rivosinc.com/
v2:
- Using config `CONFIG_RISCV_USER_CFI`, kernel support for riscv control flow
integrity for user mode programs can be compiled in the kernel.
- Enabling of control flow integrity for user programs is left to user runtime
- This patch series introduces arch agnostic `prctls` to enable shadow stack
and indirect branch tracking. And implements them on riscv.
---
Changes in v16:
- Link to v15: https://lore.kernel.org/r/20250502-v5_user_cfi_series-v15-0-914966471885@ri…
Changes in v15:
- changelog posted just below cover letter
- Link to v14: https://lore.kernel.org/r/20250429-v5_user_cfi_series-v14-0-5239410d012a@ri…
Changes in v14:
- changelog posted just below cover letter
- Link to v13: https://lore.kernel.org/r/20250424-v5_user_cfi_series-v13-0-971437de586a@ri…
Changes in v13:
- changelog posted just below cover letter
- Link to v12: https://lore.kernel.org/r/20250314-v5_user_cfi_series-v12-0-e51202b53138@ri…
Changes in v12:
- changelog posted just below cover letter
- Link to v11: https://lore.kernel.org/r/20250310-v5_user_cfi_series-v11-0-86b36cbfb910@ri…
Changes in v11:
- changelog posted just below cover letter
- Link to v10: https://lore.kernel.org/r/20250210-v5_user_cfi_series-v10-0-163dcfa31c60@ri…
---
Andy Chiu (1):
riscv: signal: abstract header saving for setup_sigcontext
Deepak Gupta (25):
mm: VM_SHADOW_STACK definition for riscv
dt-bindings: riscv: zicfilp and zicfiss in dt-bindings (extensions.yaml)
riscv: zicfiss / zicfilp enumeration
riscv: zicfiss / zicfilp extension csr and bit definitions
riscv: usercfi state for task and save/restore of CSR_SSP on trap entry/exit
riscv/mm : ensure PROT_WRITE leads to VM_READ | VM_WRITE
riscv mm: manufacture shadow stack pte
riscv mmu: teach pte_mkwrite to manufacture shadow stack PTEs
riscv mmu: write protect and shadow stack
riscv/mm: Implement map_shadow_stack() syscall
riscv/shstk: If needed allocate a new shadow stack on clone
riscv: Implements arch agnostic shadow stack prctls
prctl: arch-agnostic prctl for indirect branch tracking
riscv: Implements arch agnostic indirect branch tracking prctls
riscv/traps: Introduce software check exception
riscv/signal: save and restore of shadow stack for signal
riscv/kernel: update __show_regs to print shadow stack register
riscv/ptrace: riscv cfi status and state via ptrace and in core files
riscv/hwprobe: zicfilp / zicfiss enumeration in hwprobe
riscv: kernel command line option to opt out of user cfi
riscv: enable kernel access to shadow stack memory via FWFT sbi call
riscv: create a config for shadow stack and landing pad instr support
riscv: Documentation for landing pad / indirect branch tracking
riscv: Documentation for shadow stack on riscv
kselftest/riscv: kselftest for user mode cfi
Jim Shu (1):
arch/riscv: compile vdso with landing pad
Documentation/admin-guide/kernel-parameters.txt | 8 +
Documentation/arch/riscv/index.rst | 2 +
Documentation/arch/riscv/zicfilp.rst | 115 +++++
Documentation/arch/riscv/zicfiss.rst | 179 +++++++
.../devicetree/bindings/riscv/extensions.yaml | 14 +
arch/riscv/Kconfig | 21 +
arch/riscv/Makefile | 5 +-
arch/riscv/include/asm/asm-prototypes.h | 1 +
arch/riscv/include/asm/assembler.h | 44 ++
arch/riscv/include/asm/cpufeature.h | 12 +
arch/riscv/include/asm/csr.h | 16 +
arch/riscv/include/asm/entry-common.h | 2 +
arch/riscv/include/asm/hwcap.h | 2 +
arch/riscv/include/asm/mman.h | 25 +
arch/riscv/include/asm/mmu_context.h | 7 +
arch/riscv/include/asm/pgtable.h | 30 +-
arch/riscv/include/asm/processor.h | 2 +
arch/riscv/include/asm/thread_info.h | 3 +
arch/riscv/include/asm/usercfi.h | 95 ++++
arch/riscv/include/asm/vector.h | 3 +
arch/riscv/include/uapi/asm/hwprobe.h | 2 +
arch/riscv/include/uapi/asm/ptrace.h | 34 ++
arch/riscv/include/uapi/asm/sigcontext.h | 1 +
arch/riscv/kernel/Makefile | 1 +
arch/riscv/kernel/asm-offsets.c | 10 +
arch/riscv/kernel/cpufeature.c | 27 +
arch/riscv/kernel/entry.S | 33 +-
arch/riscv/kernel/head.S | 27 +
arch/riscv/kernel/process.c | 26 +-
arch/riscv/kernel/ptrace.c | 95 ++++
arch/riscv/kernel/signal.c | 148 +++++-
arch/riscv/kernel/sys_hwprobe.c | 2 +
arch/riscv/kernel/sys_riscv.c | 10 +
arch/riscv/kernel/traps.c | 43 ++
arch/riscv/kernel/usercfi.c | 545 +++++++++++++++++++++
arch/riscv/kernel/vdso/Makefile | 6 +
arch/riscv/kernel/vdso/flush_icache.S | 4 +
arch/riscv/kernel/vdso/getcpu.S | 4 +
arch/riscv/kernel/vdso/rt_sigreturn.S | 4 +
arch/riscv/kernel/vdso/sys_hwprobe.S | 4 +
arch/riscv/mm/init.c | 2 +-
arch/riscv/mm/pgtable.c | 17 +
include/linux/cpu.h | 4 +
include/linux/mm.h | 7 +
include/uapi/linux/elf.h | 2 +
include/uapi/linux/prctl.h | 27 +
kernel/sys.c | 30 ++
tools/testing/selftests/riscv/Makefile | 2 +-
tools/testing/selftests/riscv/cfi/.gitignore | 3 +
tools/testing/selftests/riscv/cfi/Makefile | 16 +
tools/testing/selftests/riscv/cfi/cfi_rv_test.h | 82 ++++
tools/testing/selftests/riscv/cfi/riscv_cfi_test.c | 173 +++++++
tools/testing/selftests/riscv/cfi/shadowstack.c | 385 +++++++++++++++
tools/testing/selftests/riscv/cfi/shadowstack.h | 27 +
54 files changed, 2360 insertions(+), 29 deletions(-)
---
base-commit: 4181f8ad7a1061efed0219951d608d4988302af7
change-id: 20240930-v5_user_cfi_series-3dc332f8f5b2
--
- debug
I'd like to cut down the memory usage of parsing vmlinux BTF in ebpf-go.
With some upcoming changes the library is sitting at 5MiB for a parse.
Most of that memory is simply copying the BTF blob into user space.
By allowing vmlinux BTF to be mmapped read-only into user space I can
cut memory usage by about 75%.
Signed-off-by: Lorenz Bauer <lmb(a)isovalent.com>
---
Changes in v4:
- Go back to remap_pfn_range for aarch64 compat
- Dropped btf_new_no_copy (Andrii)
- Fixed nits in selftests (Andrii)
- Clearer error handling in the mmap handler (Andrii)
- Fixed build on s390
- Link to v3: https://lore.kernel.org/r/20250505-vmlinux-mmap-v3-0-5d53afa060e8@isovalent…
Changes in v3:
- Remove slightly confusing calculation of trailing (Alexei)
- Use vm_insert_page (Alexei)
- Simplified libbpf code
- Link to v2: https://lore.kernel.org/r/20250502-vmlinux-mmap-v2-0-95c271434519@isovalent…
Changes in v2:
- Use btf__new in selftest
- Avoid vm_iomap_memory in btf_vmlinux_mmap
- Add VM_DONTDUMP
- Add support to libbpf
- Link to v1: https://lore.kernel.org/r/20250501-vmlinux-mmap-v1-0-aa2724572598@isovalent…
---
Lorenz Bauer (3):
btf: allow mmap of vmlinux btf
selftests: bpf: add a test for mmapable vmlinux BTF
libbpf: Use mmap to parse vmlinux BTF from sysfs
include/asm-generic/vmlinux.lds.h | 3 +-
kernel/bpf/sysfs_btf.c | 32 ++++++++
tools/lib/bpf/btf.c | 85 ++++++++++++++++++----
tools/testing/selftests/bpf/prog_tests/btf_sysfs.c | 81 +++++++++++++++++++++
4 files changed, 184 insertions(+), 17 deletions(-)
---
base-commit: 7220eabff8cb4af3b93cd021aa853b9f5df2923f
change-id: 20250501-vmlinux-mmap-2ec5563c3ef1
Best regards,
--
Lorenz Bauer <lmb(a)isovalent.com>
I'd like to cut down the memory usage of parsing vmlinux BTF in ebpf-go.
With some upcoming changes the library is sitting at 5MiB for a parse.
Most of that memory is simply copying the BTF blob into user space.
By allowing vmlinux BTF to be mmapped read-only into user space I can
cut memory usage by about 75%.
Signed-off-by: Lorenz Bauer <lmb(a)isovalent.com>
---
Changes in v5:
- Fix error return of btf_parse_raw_mmap (Andrii)
- Link to v4: https://lore.kernel.org/r/20250510-vmlinux-mmap-v4-0-69e424b2a672@isovalent…
Changes in v4:
- Go back to remap_pfn_range for aarch64 compat
- Dropped btf_new_no_copy (Andrii)
- Fixed nits in selftests (Andrii)
- Clearer error handling in the mmap handler (Andrii)
- Fixed build on s390
- Link to v3: https://lore.kernel.org/r/20250505-vmlinux-mmap-v3-0-5d53afa060e8@isovalent…
Changes in v3:
- Remove slightly confusing calculation of trailing (Alexei)
- Use vm_insert_page (Alexei)
- Simplified libbpf code
- Link to v2: https://lore.kernel.org/r/20250502-vmlinux-mmap-v2-0-95c271434519@isovalent…
Changes in v2:
- Use btf__new in selftest
- Avoid vm_iomap_memory in btf_vmlinux_mmap
- Add VM_DONTDUMP
- Add support to libbpf
- Link to v1: https://lore.kernel.org/r/20250501-vmlinux-mmap-v1-0-aa2724572598@isovalent…
---
Lorenz Bauer (3):
btf: allow mmap of vmlinux btf
selftests: bpf: add a test for mmapable vmlinux BTF
libbpf: Use mmap to parse vmlinux BTF from sysfs
include/asm-generic/vmlinux.lds.h | 3 +-
kernel/bpf/sysfs_btf.c | 32 ++++++++
tools/lib/bpf/btf.c | 89 +++++++++++++++++-----
tools/testing/selftests/bpf/prog_tests/btf_sysfs.c | 81 ++++++++++++++++++++
4 files changed, 186 insertions(+), 19 deletions(-)
---
base-commit: 7220eabff8cb4af3b93cd021aa853b9f5df2923f
change-id: 20250501-vmlinux-mmap-2ec5563c3ef1
Best regards,
--
Lorenz Bauer <lmb(a)isovalent.com>
Here is a series from Geliang, adding mptcp_subflow bpf_iter support.
We are working on extending MPTCP with BPF, e.g. to control the path
manager -- in charge of the creation, deletion, and announcements of
subflows (paths) -- and the packet scheduler -- in charge of selecting
which available path the next data will be sent to. These extensions
need to iterate over the list of subflows attached to an MPTCP
connection, and do some specific actions via some new kfunc that will be
added later on.
This preparation work is split in different patches:
- Patch 1: register some "basic" MPTCP kfunc.
- Patch 2: add mptcp_subflow bpf_iter support. Note that previous
versions of this single patch have already been shared to the
BPF mailing list. The changelog has been kept with a comment,
but the version number has been reset to avoid confusions.
- Patch 3: add more MPTCP endpoints in the selftests, in order to create
more than 2 subflows.
- Patch 4: add a very simple test validating mptcp_subflow bpf_iter
support. This test could be written without the new bpf_iter,
but it is there only to make sure this specific feature works
as expected.
- Patch 5: a small fix to drop an unused parameter in the selftests.
Signed-off-by: Matthieu Baerts (NGI0) <matttbe(a)kernel.org>
---
Changes in v3:
- Previous patches 1, 2 and 5 were no longer needed. (Martin)
- Patch 2: Switch to 'struct sock' and drop unneeded checks. (Martin)
- Patch 4: Adapt the test accordingly.
- Patch 5: New small fix for the selftests.
- Examples and questions for BPF maintainers have been added in Patch 2.
- Link to v2: https://lore.kernel.org/r/20241219-bpf-next-net-mptcp-bpf_iter-subflows-v2-…
Changes in v2:
- Patches 1-2: new ones.
- Patch 3: remove two kfunc, more restrictions. (Martin)
- Patch 4: add BUILD_BUG_ON(), more restrictions. (Martin)
- Patch 7: adaptations due to modifications in patches 1-4.
- Link to v1: https://lore.kernel.org/r/20241108-bpf-next-net-mptcp-bpf_iter-subflows-v1-…
---
Geliang Tang (5):
bpf: Register mptcp common kfunc set
bpf: Add mptcp_subflow bpf_iter
selftests/bpf: More endpoints for endpoint_init
selftests/bpf: Add mptcp_subflow bpf_iter subtest
selftests/bpf: Drop cgroup_fd of run_mptcpify
net/mptcp/bpf.c | 87 +++++++++++++-
tools/testing/selftests/bpf/bpf_experimental.h | 8 ++
tools/testing/selftests/bpf/prog_tests/mptcp.c | 133 +++++++++++++++++++--
tools/testing/selftests/bpf/progs/mptcp_bpf.h | 4 +
.../testing/selftests/bpf/progs/mptcp_bpf_iters.c | 59 +++++++++
5 files changed, 282 insertions(+), 9 deletions(-)
---
base-commit: dad704ebe38642cd405e15b9c51263356391355c
change-id: 20241108-bpf-next-net-mptcp-bpf_iter-subflows-027f6d87770e
Best regards,
--
Matthieu Baerts (NGI0) <matttbe(a)kernel.org>
Split out all headers which are used by nolibc-test.c.
This makes it easier to port existing applications to nolibc.
Signed-off-by: Thomas Weißschuh <linux(a)weissschuh.net>
---
Thomas Weißschuh (9):
tools/nolibc: move ioctl() to sys/ioctl.h
tools/nolibc: move mount() to sys/mount.h
tools/nolibc: move prctl() to sys/prctl.h
tools/nolibc: move reboot() to sys/reboot.h
tools/nolibc: move getrlimit() and friends to sys/resource.h
tools/nolibc: move makedev() and friends to sys/sysmacros.h
tools/nolibc: move uname() and friends to sys/utsname.h
tools/nolibc: move NULL and offsetof() to sys/stddef.h
selftests/nolibc: drop include guards around standard headers
tools/include/nolibc/Makefile | 8 ++
tools/include/nolibc/nolibc.h | 7 ++
tools/include/nolibc/std.h | 6 +-
tools/include/nolibc/stddef.h | 24 +++++
tools/include/nolibc/sys.h | 136 ---------------------------
tools/include/nolibc/sys/ioctl.h | 29 ++++++
tools/include/nolibc/sys/mount.h | 37 ++++++++
tools/include/nolibc/sys/prctl.h | 36 +++++++
tools/include/nolibc/sys/reboot.h | 34 +++++++
tools/include/nolibc/sys/resource.h | 53 +++++++++++
tools/include/nolibc/sys/sysmacros.h | 20 ++++
tools/include/nolibc/sys/utsname.h | 42 +++++++++
tools/include/nolibc/types.h | 11 ---
tools/testing/selftests/nolibc/nolibc-test.c | 5 -
14 files changed, 291 insertions(+), 157 deletions(-)
---
base-commit: 6a25f787912a73613f12e7eefbebd72ee3d43f85
change-id: 20250515-nolibc-sys-31a4fd76d897
Best regards,
--
Thomas Weißschuh <linux(a)weissschuh.net>
Corrects a spelling mistake "memebers" instead of "members" in
tools/testing/selftests/filesystems/mount-notify/mount-notify_test.c
Signed-off-by: Hendrik Hamerlinck <hendrik.hamerlinck(a)hammernet.be>
---
Changes since v1:
Improved commit message to be consistent with other commit messages.
.../selftests/filesystems/mount-notify/mount-notify_test.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/filesystems/mount-notify/mount-notify_test.c b/tools/testing/selftests/filesystems/mount-notify/mount-notify_test.c
index 59a71f22fb11..af2b61224a61 100644
--- a/tools/testing/selftests/filesystems/mount-notify/mount-notify_test.c
+++ b/tools/testing/selftests/filesystems/mount-notify/mount-notify_test.c
@@ -230,7 +230,7 @@ static void verify_mount_ids(struct __test_metadata *const _metadata,
}
}
}
- // Check that all list1 memebers can be found in list2. Together with
+ // Check that all list1 members can be found in list2. Together with
// the above it means that the list1 and list2 represent the same sets.
for (i = 0; i < num; i++) {
for (j = 0; j < num; j++) {
--
2.43.0
v2- fixed multiple trailing whitespace errors and
the Signed-off-by mismatch
The test file for the IR decoder used single-line comments
at the top to document its purpose and licensing,
which is inconsistent with the style used throughout the
Linux kernel.
In this patch i converted the file header to
a proper multi-line comment block
(/*) that aligns with standard kernel practices.
This improves readability, consistency across selftests,
and ensures the license and documentation are
clearly visible in a familiar format.
No functional changes have been made.
Signed-off-by: Abdelrahman Fekry <abdelrahmanfekry375(a)gmail.com>
---
tools/testing/selftests/ir/ir_loopback.c | 23 +++++++++++++----------
1 file changed, 13 insertions(+), 10 deletions(-)
diff --git a/tools/testing/selftests/ir/ir_loopback.c b/tools/testing/selftests/ir/ir_loopback.c
index f4a15cbdd5ea..c94faa975630 100644
--- a/tools/testing/selftests/ir/ir_loopback.c
+++ b/tools/testing/selftests/ir/ir_loopback.c
@@ -1,14 +1,17 @@
// SPDX-License-Identifier: GPL-2.0
-// test ir decoder
-//
-// Copyright (C) 2018 Sean Young <sean(a)mess.org>
-
-// When sending LIRC_MODE_SCANCODE, the IR will be encoded. rc-loopback
-// will send this IR to the receiver side, where we try to read the decoded
-// IR. Decoding happens in a separate kernel thread, so we will need to
-// wait until that is scheduled, hence we use poll to check for read
-// readiness.
-
+/* Copyright (C) 2018 Sean Young <sean(a)mess.org>
+ *
+ * Selftest for IR decoder
+ *
+ *
+ * When sending LIRC_MODE_SCANCODE, the IR will be encoded.
+ * rc-loopback will send this IR to the receiver side,
+ * where we try to read the decoded IR.
+ * Decoding happens in a separate kernel thread,
+ * so we will need to wait until that is scheduled,
+ * hence we use poll to check for read
+ * readiness.
+ */
#include <linux/lirc.h>
#include <errno.h>
#include <stdio.h>
--
2.25.1
This patch improves the clarity and grammar of output messages in the acct()
selftest. Minor changes were made to user-facing strings and comments to make
them easier to understand and more consistent with the kselftest style.
Changes include:
- Fixing grammar in printed messages and comments.
- Rewording error and success outputs for clarity and professionalism.
- Making the "root check" message more direct.
These updates improve readability without affecting test logic or behavior.
Signed-off-by: Abdelrahman Fekry <abdelrahmanfekry375(a)gmail.com>
---
tools/testing/selftests/acct/acct_syscall.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/tools/testing/selftests/acct/acct_syscall.c b/tools/testing/selftests/acct/acct_syscall.c
index 87c044fb9293..2c120a527574 100644
--- a/tools/testing/selftests/acct/acct_syscall.c
+++ b/tools/testing/selftests/acct/acct_syscall.c
@@ -22,9 +22,9 @@ int main(void)
ksft_print_header();
ksft_set_plan(1);
- // Check if test is run a root
+ // Check if test is run as root
if (geteuid()) {
- ksft_exit_skip("This test needs root to run!\n");
+ ksft_exit_skip("This test must be run as root!\n");
return 1;
}
@@ -52,7 +52,7 @@ int main(void)
child_pid = fork();
if (child_pid < 0) {
- ksft_test_result_error("Creating a child process to log failed\n");
+ ksft_test_result_error("Failed to create child process for logging\n");
acct(NULL);
return 1;
} else if (child_pid > 0) {
--
2.25.1
John, this revision introduces one more patch: "selftests/bpf: Introduce
verdict programs for sockmap_redir". I've kept you cross-series Acked-by. I
hope it's ok.
Jiayuan, I haven't heard back from you regarding [*], so I've kept things
unchanged for now. Please let me know what you think.
[*] https://lore.kernel.org/bpf/66bf942f-dfdb-4ce9-bd95-8b734e7afa53@rbox.co/
--
The idea behind this series is to comprehensively test the BPF redirection:
BPF_MAP_TYPE_SOCKMAP,
BPF_MAP_TYPE_SOCKHASH
x
sk_msg-to-egress,
sk_msg-to-ingress,
sk_skb-to-egress,
sk_skb-to-ingress
x
AF_INET, SOCK_STREAM,
AF_INET6, SOCK_STREAM,
AF_INET, SOCK_DGRAM,
AF_INET6, SOCK_DGRAM,
AF_UNIX, SOCK_STREAM,
AF_UNIX, SOCK_DGRAM,
AF_VSOCK, SOCK_STREAM,
AF_VSOCK, SOCK_SEQPACKET
New module is introduced, sockmap_redir: all supported and unsupported
redirect combinations are tested for success and failure respectively. Code
is pretty much stolen/adapted from Jakub Sitnicki's sockmap_redir_matrix.c
[1].
Usage:
$ cd tools/testing/selftests/bpf
$ make
$ sudo ./test_progs -t sockmap_redir
...
Summary: 1/576 PASSED, 0 SKIPPED, 0 FAILED
[1]: https://github.com/jsitnicki/sockmap-redir-matrix/blob/main/sockmap_redir_m…
Changes in v3:
- Drop unrelated changes; sockmap_listen, test_sockmap_listen, doc
- Collect tags [Jakub, John]
- Introduce BPF verdict programs especially for sockmap_redir [Jiayuan]
- Link to v2: https://lore.kernel.org/r/20250411-selftests-sockmap-redir-v2-0-5f9b018d670…
Changes in v2:
- Verify that the unsupported redirect combos do fail [Jakub]
- Dedup tests in sockmap_listen
- Cosmetic changes and code reordering
- Link to v1: https://lore.kernel.org/bpf/42939687-20f9-4a45-b7c2-342a0e11a014@rbox.co/
Suggested-by: Jakub Sitnicki <jakub(a)cloudflare.com>
Signed-off-by: Michal Luczaj <mhal(a)rbox.co>
---
Michal Luczaj (8):
selftests/bpf: Support af_unix SOCK_DGRAM socket pair creation
selftests/bpf: Add socket_kind_to_str() to socket_helpers
selftests/bpf: Add u32()/u64() to sockmap_helpers
selftests/bpf: Introduce verdict programs for sockmap_redir
selftests/bpf: Add selftest for sockmap/hashmap redirection
selftests/bpf: sockmap_listen cleanup: Drop af_vsock redir tests
selftests/bpf: sockmap_listen cleanup: Drop af_unix redir tests
selftests/bpf: sockmap_listen cleanup: Drop af_inet SOCK_DGRAM redir tests
.../selftests/bpf/prog_tests/socket_helpers.h | 84 +++-
.../selftests/bpf/prog_tests/sockmap_helpers.h | 25 +-
.../selftests/bpf/prog_tests/sockmap_listen.c | 457 --------------------
.../selftests/bpf/prog_tests/sockmap_redir.c | 465 +++++++++++++++++++++
.../selftests/bpf/progs/test_sockmap_redir.c | 68 +++
5 files changed, 623 insertions(+), 476 deletions(-)
---
base-commit: d0445d7dd3fd9b15af7564c38d7aa3cbc29778ee
change-id: 20240922-selftests-sockmap-redir-5d839396c75e
Best regards,
--
Michal Luczaj <mhal(a)rbox.co>
Improved the clarity and grammar in the header comment of nanosleep.c
for better readability and consistency with kernel documentation style.
Signed-off-by: Rahul Kumar <rk0006818(a)gmail.com>
---
tools/testing/selftests/timers/nanosleep.c | 22 +++++++++++-----------
1 file changed, 11 insertions(+), 11 deletions(-)
diff --git a/tools/testing/selftests/timers/nanosleep.c b/tools/testing/selftests/timers/nanosleep.c
index 252c6308c569..84adf8a4ab5d 100644
--- a/tools/testing/selftests/timers/nanosleep.c
+++ b/tools/testing/selftests/timers/nanosleep.c
@@ -1,12 +1,12 @@
-/* Make sure timers don't return early
- * by: john stultz (johnstul(a)us.ibm.com)
- * John Stultz (john.stultz(a)linaro.org)
- * (C) Copyright IBM 2012
- * (C) Copyright Linaro 2013 2015
- * Licensed under the GPLv2
+ /*
+ * Ensure timers do not return early.
+ * Author: John Stultz (john.stultz(a)linaro.org)
+ * Copyright (C) IBM 2012
+ * Copyright (C) Linaro 2013, 2015
+ * Licensed under the GPLv2
*
- * To build:
- * $ gcc nanosleep.c -o nanosleep -lrt
+ * To build:
+ * $ gcc nanosleep.c -o nanosleep -lrt
*
* This program is free software: you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
@@ -61,7 +61,7 @@ char *clockstring(int clockid)
case CLOCK_TAI:
return "CLOCK_TAI";
};
- return "UNKNOWN_CLOCKID";
+ return "UNKNOWN_CLOCKID"; // Could not identify clockid
}
/* returns 1 if a <= b, 0 otherwise */
@@ -90,7 +90,7 @@ int nanosleep_test(int clockid, long long ns)
{
struct timespec now, target, rel;
- /* First check abs time */
+ /* First, check absolute time using clock_nanosleep with TIMER_ABSTIME */
if (clock_gettime(clockid, &now))
return UNSUPPORTED;
target = timespec_add(now, ns);
@@ -102,7 +102,7 @@ int nanosleep_test(int clockid, long long ns)
if (!in_order(target, now))
return -1;
- /* Second check reltime */
+ /* Then, test relative time sleep */
clock_gettime(clockid, &now);
rel.tv_sec = 0;
rel.tv_nsec = 0;
--
2.43.0
This patch adds a new robust_list() syscall. The current syscall
can't be expanded to cover the following use case, so a new one is
needed. This new syscall allows users to set multiple robust lists per
process and to have either 32bit or 64bit pointers in the list.
* Use case
FEX-Emu[1] is an application that runs x86 and x86-64 binaries on an
AArch64 Linux host. One of the tasks of FEX-Emu is to translate syscalls
from one platform to another. Existing set_robust_list() can't be easily
translated because of two limitations:
1) x86 apps can have 32bit pointers robust lists. For a x86-64 kernel
this is not a problem, because of the compat entry point. But there's
no such compat entry point for AArch64, so the kernel would do the
pointer arithmetic wrongly. Is also unviable to userspace to keep
track every addition/removal to the robust list and keep a 64bit
version of it somewhere else to feed the kernel. Thus, the new
interface has an option of telling the kernel if the list is filled
with 32bit or 64bit pointers.
2) Apps can set just one robust list (in theory, x86-64 can set two if
they also use the compat entry point). That means that when a x86 app
asks FEX-Emu to call set_robust_list(), FEX have two options: to
overwrite their own robust list pointer and make the app robust, or
to ignore the app robust list and keep the emulator robust. The new
interface allows for multiple robust lists per application, solving
this.
* Interface
This is the proposed interface:
long set_robust_list2(void *head, int index, unsigned int flags)
`head` is the head of the userspace struct robust_list_head, just as old
set_robust_list(). It needs to be a void pointer since it can point to a normal
robust_list_head or a compat_robust_list_head.
`flags` can be used for defining the list type:
enum robust_list_type {
ROBUST_LIST_32BIT,
ROBUST_LIST_64BIT,
};
`index` is the index in the internal robust_list's linked list (the naming
starts to get confusing, I reckon). If `index == -1`, that means that user wants
to set a new robust_list, and the kernel will append it in the end of the list,
assign a new index and return this index to the user. If `index >= 0`, that
means that user wants to re-set `*head` of an already existing list (similarly
to what happens when you call set_robust_list() twice with different `*head`).
If `index` is out of range, or it points to a non-existing robust_list, or if
the internal list is full, an error is returned.
* Implementation
The implementation re-uses most of the existing robust list interface as
possible. The new task_struct member `struct list_head robust_list2` is just a
linked list where new lists are appended as the user requests more lists, and by
futex_cleanup(), the kernel walks through the internal list feeding
exit_robust_list() with the robust_list's.
This implementation supports up to 10 lists (defined at ROBUST_LISTS_PER_TASK),
but it was an arbitrary number for this RFC. For the described use case above, 4
should be enough, I'm not sure which should be the limit.
It doesn't support list removal (should it support?). It doesn't have a proper
get_robust_list2() yet as well, but I can add it in a next revision. We could
also have a generic robust_list() syscall that can be used to set/get and be
controlled by flags.
The new interface has a `unsigned int flags` argument, making it
extensible for future use cases as well.
It refuses unaligned `head` addresses. It doesn't have a limit for elements in a
single list (like ROBUST_LIST_LIMIT), it destroys the list as it is parsed to be
safe against circular lists.
* Testing
This patcheset has a selftest patch that expands this one:
https://lore.kernel.org/lkml/20250212131123.37431-1-andrealmeid@igalia.com/
Also, FEX-Emu added support for this interface to validate it:
https://github.com/FEX-Emu/FEX/pull/3966
Feedback is very welcomed!
Thanks,
André
[1] https://github.com/FEX-Emu/FEX
Changelog:
- Rebased on top of new futex work (private hash)
v4: https://lore.kernel.org/lkml/20250225183531.682556-1-andrealmeid@igalia.com/
- Refuse unaligned head pointers
- Ignore ROBUST_LIST_LIMIT for lists created with this interface and make it
robust against circular lists
- Fix a get_robust_list() syscall bug for getting the list from another thread
- Adapt selftest to use the new interface
v3: https://lore.kernel.org/lkml/20241217174958.477692-1-andrealmeid@igalia.com/
- Old syscall set_robust_list() adds new head to the internal linked list of
robust lists pointers, instead of having a field just for them. Remove
tsk->robust_list and use only tsk->robust_list2
v2: https://lore.kernel.org/lkml/20241101162147.284993-1-andrealmeid@igalia.com/
- Added a patch to properly deal with exit_robust_list() in 64bit vs 32bit
- Wired-up syscall for all archs
- Added more of the cover letter to the commit message
v1: https://lore.kernel.org/lkml/20241024145735.162090-1-andrealmeid@igalia.com/
---
André Almeida (7):
selftests/futex: Add ASSERT_ macros
selftests/futex: Create test for robust list
futex: Use explicit sizes for compat_exit_robust_list
futex: Create set_robust_list2
futex: Wire up set_robust_list2 syscall
futex: Remove the limit of elements for sys_set_robust_list2 lists
selftests: futex: Expand robust list test for the new interface
arch/alpha/kernel/syscalls/syscall.tbl | 1 +
arch/arm/tools/syscall.tbl | 1 +
arch/m68k/kernel/syscalls/syscall.tbl | 1 +
arch/microblaze/kernel/syscalls/syscall.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n32.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n64.tbl | 1 +
arch/mips/kernel/syscalls/syscall_o32.tbl | 1 +
arch/parisc/kernel/syscalls/syscall.tbl | 1 +
arch/powerpc/kernel/syscalls/syscall.tbl | 1 +
arch/s390/kernel/syscalls/syscall.tbl | 1 +
arch/sh/kernel/syscalls/syscall.tbl | 1 +
arch/sparc/kernel/syscalls/syscall.tbl | 1 +
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/xtensa/kernel/syscalls/syscall.tbl | 1 +
include/linux/compat.h | 12 +-
include/linux/futex.h | 16 +-
include/linux/sched.h | 5 +-
include/uapi/asm-generic/unistd.h | 2 +
include/uapi/linux/futex.h | 24 +
kernel/futex/core.c | 165 ++++-
kernel/futex/futex.h | 5 +
kernel/futex/syscalls.c | 85 ++-
kernel/sys_ni.c | 1 +
scripts/syscall.tbl | 1 +
.../testing/selftests/futex/functional/.gitignore | 1 +
tools/testing/selftests/futex/functional/Makefile | 3 +-
.../selftests/futex/functional/robust_list.c | 706 +++++++++++++++++++++
tools/testing/selftests/futex/include/logging.h | 38 ++
29 files changed, 1026 insertions(+), 53 deletions(-)
---
base-commit: 3ee84e3dd88e39b55b534e17a7b9a181f1d46809
change-id: 20250225-tonyk-robust_futex-60adeedac695
Best regards,
--
André Almeida <andrealmeid(a)igalia.com>
There is a spelling mistake in a ksft_test_result message. Fix it.
Signed-off-by: Colin Ian King <colin.i.king(a)gmail.com>
---
tools/testing/selftests/futex/functional/futex_priv_hash.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/futex/functional/futex_priv_hash.c b/tools/testing/selftests/futex/functional/futex_priv_hash.c
index 2dca18fefedc..0213eb0bb4af 100644
--- a/tools/testing/selftests/futex/functional/futex_priv_hash.c
+++ b/tools/testing/selftests/futex/functional/futex_priv_hash.c
@@ -242,7 +242,7 @@ int main(int argc, char *argv[])
join_max_threads();
ret = futex_hash_slots_get();
- ksft_test_result(ret == 2, "No more auto-resize after manaul setting, got %d\n",
+ ksft_test_result(ret == 2, "No more auto-resize after manual setting, got %d\n",
ret);
futex_hash_slots_set_must_fail(1 << 29, 0);
--
2.49.0
The SBI Firmware Feature extension allows the S-mode to request some
specific features (either hardware or software) to be enabled. This
series uses this extension to request misaligned access exception
delegation to S-mode in order to let the kernel handle it. It also adds
support for the KVM FWFT SBI extension based on the misaligned access
handling infrastructure.
FWFT SBI extension is part of the SBI V3.0 specifications [1]. It can be
tested using the qemu provided at [2] which contains the series from
[3]. Upstream kvm-unit-tests can be used inside kvm to tests the correct
delegation of misaligned exceptions. Upstream OpenSBI can be used.
Note: Since SBI V3.0 is not yet ratified, FWFT extension API is split
between interface only and implementation, allowing to pick only the
interface which do not have hard dependencies on SBI.
The tests can be run using the kselftest from series [4].
$ qemu-system-riscv64 \
-cpu rv64,trap-misaligned-access=true,v=true \
-M virt \
-m 1024M \
-bios fw_dynamic.bin \
-kernel Image
...
# ./misaligned
TAP version 13
1..23
# Starting 23 tests from 1 test cases.
# RUN global.gp_load_lh ...
# OK global.gp_load_lh
ok 1 global.gp_load_lh
# RUN global.gp_load_lhu ...
# OK global.gp_load_lhu
ok 2 global.gp_load_lhu
# RUN global.gp_load_lw ...
# OK global.gp_load_lw
ok 3 global.gp_load_lw
# RUN global.gp_load_lwu ...
# OK global.gp_load_lwu
ok 4 global.gp_load_lwu
# RUN global.gp_load_ld ...
# OK global.gp_load_ld
ok 5 global.gp_load_ld
# RUN global.gp_load_c_lw ...
# OK global.gp_load_c_lw
ok 6 global.gp_load_c_lw
# RUN global.gp_load_c_ld ...
# OK global.gp_load_c_ld
ok 7 global.gp_load_c_ld
# RUN global.gp_load_c_ldsp ...
# OK global.gp_load_c_ldsp
ok 8 global.gp_load_c_ldsp
# RUN global.gp_load_sh ...
# OK global.gp_load_sh
ok 9 global.gp_load_sh
# RUN global.gp_load_sw ...
# OK global.gp_load_sw
ok 10 global.gp_load_sw
# RUN global.gp_load_sd ...
# OK global.gp_load_sd
ok 11 global.gp_load_sd
# RUN global.gp_load_c_sw ...
# OK global.gp_load_c_sw
ok 12 global.gp_load_c_sw
# RUN global.gp_load_c_sd ...
# OK global.gp_load_c_sd
ok 13 global.gp_load_c_sd
# RUN global.gp_load_c_sdsp ...
# OK global.gp_load_c_sdsp
ok 14 global.gp_load_c_sdsp
# RUN global.fpu_load_flw ...
# OK global.fpu_load_flw
ok 15 global.fpu_load_flw
# RUN global.fpu_load_fld ...
# OK global.fpu_load_fld
ok 16 global.fpu_load_fld
# RUN global.fpu_load_c_fld ...
# OK global.fpu_load_c_fld
ok 17 global.fpu_load_c_fld
# RUN global.fpu_load_c_fldsp ...
# OK global.fpu_load_c_fldsp
ok 18 global.fpu_load_c_fldsp
# RUN global.fpu_store_fsw ...
# OK global.fpu_store_fsw
ok 19 global.fpu_store_fsw
# RUN global.fpu_store_fsd ...
# OK global.fpu_store_fsd
ok 20 global.fpu_store_fsd
# RUN global.fpu_store_c_fsd ...
# OK global.fpu_store_c_fsd
ok 21 global.fpu_store_c_fsd
# RUN global.fpu_store_c_fsdsp ...
# OK global.fpu_store_c_fsdsp
ok 22 global.fpu_store_c_fsdsp
# RUN global.gen_sigbus ...
[12797.988647] misaligned[618]: unhandled signal 7 code 0x1 at 0x0000000000014dc0 in misaligned[4dc0,10000+76000]
[12797.988990] CPU: 0 UID: 0 PID: 618 Comm: misaligned Not tainted 6.13.0-rc6-00008-g4ec4468967c9-dirty #51
[12797.989169] Hardware name: riscv-virtio,qemu (DT)
[12797.989264] epc : 0000000000014dc0 ra : 0000000000014d00 sp : 00007fffe165d100
[12797.989407] gp : 000000000008f6e8 tp : 0000000000095760 t0 : 0000000000000008
[12797.989544] t1 : 00000000000965d8 t2 : 000000000008e830 s0 : 00007fffe165d160
[12797.989692] s1 : 000000000000001a a0 : 0000000000000000 a1 : 0000000000000002
[12797.989831] a2 : 0000000000000000 a3 : 0000000000000000 a4 : ffffffffdeadbeef
[12797.989964] a5 : 000000000008ef61 a6 : 626769735f6e0000 a7 : fffffffffffff000
[12797.990094] s2 : 0000000000000001 s3 : 00007fffe165d838 s4 : 00007fffe165d848
[12797.990238] s5 : 000000000000001a s6 : 0000000000010442 s7 : 0000000000010200
[12797.990391] s8 : 000000000000003a s9 : 0000000000094508 s10: 0000000000000000
[12797.990526] s11: 0000555567460668 t3 : 00007fffe165d070 t4 : 00000000000965d0
[12797.990656] t5 : fefefefefefefeff t6 : 0000000000000073
[12797.990756] status: 0000000200004020 badaddr: 000000000008ef61 cause: 0000000000000006
[12797.990911] Code: 8793 8791 3423 fcf4 3783 fc84 c737 dead 0713 eef7 (c398) 0001
# OK global.gen_sigbus
ok 23 global.gen_sigbus
# PASSED: 23 / 23 tests passed.
# Totals: pass:23 fail:0 xfail:0 xpass:0 skip:0 error:0
With kvm-tools:
# lkvm run -k sbi.flat -m 128
Info: # lkvm run -k sbi.flat -m 128 -c 1 --name guest-97
Info: Removed ghost socket file "/root/.lkvm//guest-97.sock".
##########################################################################
# kvm-unit-tests
##########################################################################
... [test messages elided]
PASS: sbi: fwft: FWFT extension probing no error
PASS: sbi: fwft: get/set reserved feature 0x6 error == SBI_ERR_DENIED
PASS: sbi: fwft: get/set reserved feature 0x3fffffff error == SBI_ERR_DENIED
PASS: sbi: fwft: get/set reserved feature 0x80000000 error == SBI_ERR_DENIED
PASS: sbi: fwft: get/set reserved feature 0xbfffffff error == SBI_ERR_DENIED
PASS: sbi: fwft: misaligned_deleg: Get misaligned deleg feature no error
PASS: sbi: fwft: misaligned_deleg: Set misaligned deleg feature invalid value error
PASS: sbi: fwft: misaligned_deleg: Set misaligned deleg feature invalid value error
PASS: sbi: fwft: misaligned_deleg: Set misaligned deleg feature value no error
PASS: sbi: fwft: misaligned_deleg: Set misaligned deleg feature value 0
PASS: sbi: fwft: misaligned_deleg: Set misaligned deleg feature value no error
PASS: sbi: fwft: misaligned_deleg: Set misaligned deleg feature value 1
PASS: sbi: fwft: misaligned_deleg: Verify misaligned load exception trap in supervisor
SUMMARY: 50 tests, 2 unexpected failures, 12 skipped
This series is available at [5].
Link: https://github.com/riscv-non-isa/riscv-sbi-doc/releases/download/vv3.0-rc2/… [1]
Link: https://github.com/rivosinc/qemu/tree/dev/cleger/misaligned [2]
Link: https://lore.kernel.org/all/20241211211933.198792-3-fkonrad@amd.com/T/ [3]
Link: https://lore.kernel.org/linux-riscv/20250414123543.1615478-1-cleger@rivosin… [4]
Link: https://github.com/rivosinc/linux/tree/dev/cleger/fwft [5]
---
V7:
- Fix ifdefery build problems
- Move sbi_fwft_is_supported with fwft_set_req struct
- Added Atish Reviewed-by
- Updated KVM vcpu cfg hedeleg value in set_delegation
- Changed SBI ETIME error mapping to ETIMEDOUT
- Fixed a few typo reported by Alok
V6:
- Rename FWFT interface to remove "_local"
- Fix test for MEDELEG values in KVM FWFT support
- Add __init for unaligned_access_init()
- Rebased on master
V5:
- Return ERANGE as mapping for SBI_ERR_BAD_RANGE
- Removed unused sbi_fwft_get()
- Fix kernel for sbi_fwft_local_set_cpumask()
- Fix indentation for sbi_fwft_local_set()
- Remove spurious space in kvm_sbi_fwft_ops.
- Rebased on origin/master
- Remove fixes commits and sent them as a separate series [4]
V4:
- Check SBI version 3.0 instead of 2.0 for FWFT presence
- Use long for kvm_sbi_fwft operation return value
- Init KVM sbi extension even if default_disabled
- Remove revert_on_fail parameter for sbi_fwft_feature_set().
- Fix comments for sbi_fwft_set/get()
- Only handle local features (there are no globals yet in the spec)
- Add new SBI errors to sbi_err_map_linux_errno()
V3:
- Added comment about kvm sbi fwft supported/set/get callback
requirements
- Move struct kvm_sbi_fwft_feature in kvm_sbi_fwft.c
- Add a FWFT interface
V2:
- Added Kselftest for misaligned testing
- Added get_user() usage instead of __get_user()
- Reenable interrupt when possible in misaligned access handling
- Document that riscv supports unaligned-traps
- Fix KVM extension state when an init function is present
- Rework SBI misaligned accesses trap delegation code
- Added support for CPU hotplugging
- Added KVM SBI reset callback
- Added reset for KVM SBI FWFT lock
- Return SBI_ERR_DENIED_LOCKED when LOCK flag is set
Clément Léger (14):
riscv: sbi: add Firmware Feature (FWFT) SBI extensions definitions
riscv: sbi: remove useless parenthesis
riscv: sbi: add new SBI error mappings
riscv: sbi: add FWFT extension interface
riscv: sbi: add SBI FWFT extension calls
riscv: misaligned: request misaligned exception from SBI
riscv: misaligned: use on_each_cpu() for scalar misaligned access
probing
riscv: misaligned: use correct CONFIG_ ifdef for
misaligned_access_speed
riscv: misaligned: move emulated access uniformity check in a function
riscv: misaligned: add a function to check misalign trap delegability
RISC-V: KVM: add SBI extension init()/deinit() functions
RISC-V: KVM: add SBI extension reset callback
RISC-V: KVM: add support for FWFT SBI extension
RISC-V: KVM: add support for SBI_FWFT_MISALIGNED_DELEG
arch/riscv/include/asm/cpufeature.h | 12 +-
arch/riscv/include/asm/kvm_host.h | 5 +-
arch/riscv/include/asm/kvm_vcpu_sbi.h | 12 +
arch/riscv/include/asm/kvm_vcpu_sbi_fwft.h | 29 +++
arch/riscv/include/asm/sbi.h | 60 +++++
arch/riscv/include/uapi/asm/kvm.h | 1 +
arch/riscv/kernel/sbi.c | 81 ++++++-
arch/riscv/kernel/traps_misaligned.c | 112 ++++++++-
arch/riscv/kernel/unaligned_access_speed.c | 8 +-
arch/riscv/kvm/Makefile | 1 +
arch/riscv/kvm/vcpu.c | 4 +-
arch/riscv/kvm/vcpu_sbi.c | 54 +++++
arch/riscv/kvm/vcpu_sbi_fwft.c | 257 +++++++++++++++++++++
arch/riscv/kvm/vcpu_sbi_sta.c | 3 +-
14 files changed, 620 insertions(+), 19 deletions(-)
create mode 100644 arch/riscv/include/asm/kvm_vcpu_sbi_fwft.h
create mode 100644 arch/riscv/kvm/vcpu_sbi_fwft.c
--
2.49.0
This patch set convert the wireguard selftest to nftables, as iptables is
deparated and nftables is the default framework of most releases.
v6: fix typo in patch 1/2. Update the description (Phil Sutter)
v5: remove the counter in nft rules and link nft statically (Jason A. Donenfeld)
v4: no update, just re-send
v3: drop iptables directly (Jason A. Donenfeld)
Also convert to using nft for qemu testing (Jason A. Donenfeld)
v2: use one nft table for testing (Phil Sutter)
Hangbin Liu (2):
wireguard: selftests: convert iptables to nft
wireguard: selftests: update to using nft for qemu test
tools/testing/selftests/wireguard/netns.sh | 29 +++++++++------
.../testing/selftests/wireguard/qemu/Makefile | 36 ++++++++++++++-----
.../selftests/wireguard/qemu/kernel.config | 7 ++--
3 files changed, 49 insertions(+), 23 deletions(-)
--
2.46.0
This patch series introduces the Hornet LSM. The goal of Hornet is to
provide a signature verification mechanism for eBPF programs.
eBPF has similar requirements to that of modules when it comes to
loading: find symbol addresses, fix up ELF relocations, some struct
field offset handling stuff called CO-RE (compile-once run-anywhere),
and some other miscellaneous bookkeeping. During eBPF program
compilation, pseudo-values get written to the immediate operands of
instructions. During loading, those pseudo-values get rewritten with
concrete addresses or data applicable to the currently running system,
e.g., a kallsyms address or an fd for a map. This needs to happen
before the instructions for a bpf program are loaded into the kernel
via the bpf() syscall. Unlike modules, an in-kernel loader
unfortunately doesn't exist. Typically, the instruction rewriting is
done dynamically in userspace via libbpf. Since the relocations and
instruction modifications are happening in userspace, and their values
may change depending upon the running system, this breaks known
signature verification mechanisms.
Light skeleton programs were introduced in order to support early
loading of eBPF programs along with user-mode drivers. They utilize a
separate eBPF program that can load a target eBPF program and perform
all necessary relocations in-kernel without needing a working
userspace. Light skeletons were mentioned as a possible path forward
for signature verification.
Hornet takes a simple approach to light-skeleton-based eBPF signature
verification. A PKCS#7 signature of a data buffer containing the raw
instructions of an eBPF program, followed by the initial values of any
maps used by the program is used. A utility script is provided to
parse and extract the contents of autogenerated header files created
via bpftool. That payload can then be signed and appended to the light
skeleton executable.
Maps are checked that they are frozen to prevent TOCTOU bugs where a
sufficiently privileged user could rewrite map data between the calls
to BPF_PROG_LOAD and BPF_PROG_RUN. Additionally, both
sparse-array-based and fd_array_cnt-based map fd arrays are supported
for signature verification.
References:
[1] https://lore.kernel.org/bpf/20220209054315.73833-1-alexei.starovoitov@gmail…
[2] https://lore.kernel.org/bpf/CAADnVQ+wPK1KKZhCgb-Nnf0Xfjk8M1UpX5fnXC=cBzdEYb…
Change list:
- v2 -> v3
- Remove any and all usage of proprietary bpf APIs
- Add optional systemd/pid1 whitelisting
- Minor Makefile cleanup
- Fixed buffer leak
- Handled null current task
- Made magic number required
- Defensive checks against invalid buffer signature reads
- v1 -> v2
- Jargon clarification, maintainer entry and a few cosmetic fixes
Revisions:
- v1
https://lore.kernel.org/bpf/20250321164537.16719-1-bboscaccy@linux.microsof…
- v2
https://lore.kernel.org/linux-security-module/20250404215527.1563146-1-bbos…
Blaise Boscaccy (4):
security: Hornet LSM
hornet: Introduce sign-ebpf
hornet: Add a light skeleton data extractor script
selftests/hornet: Add a selftest for the Hornet LSM
Documentation/admin-guide/LSM/Hornet.rst | 65 +++
Documentation/admin-guide/LSM/index.rst | 1 +
MAINTAINERS | 9 +
crypto/asymmetric_keys/pkcs7_verify.c | 10 +
include/linux/kernel_read_file.h | 1 +
include/linux/verification.h | 1 +
include/uapi/linux/lsm.h | 1 +
scripts/Makefile | 1 +
scripts/hornet/Makefile | 5 +
scripts/hornet/extract-skel.sh | 29 ++
scripts/hornet/sign-ebpf.c | 411 ++++++++++++++++++
security/Kconfig | 3 +-
security/Makefile | 1 +
security/hornet/Kconfig | 24 +
security/hornet/Makefile | 4 +
security/hornet/hornet_lsm.c | 250 +++++++++++
security/selinux/hooks.c | 12 +-
security/selinux/include/classmap.h | 2 +-
tools/testing/selftests/Makefile | 1 +
tools/testing/selftests/hornet/Makefile | 58 +++
tools/testing/selftests/hornet/fail_loader.sh | 3 +
tools/testing/selftests/hornet/frozen_skel.h | 393 +++++++++++++++++
tools/testing/selftests/hornet/loader.c | 22 +
tools/testing/selftests/hornet/trivial.bpf.c | 33 ++
24 files changed, 1336 insertions(+), 4 deletions(-)
create mode 100644 Documentation/admin-guide/LSM/Hornet.rst
create mode 100644 scripts/hornet/Makefile
create mode 100755 scripts/hornet/extract-skel.sh
create mode 100644 scripts/hornet/sign-ebpf.c
create mode 100644 security/hornet/Kconfig
create mode 100644 security/hornet/Makefile
create mode 100644 security/hornet/hornet_lsm.c
create mode 100644 tools/testing/selftests/hornet/Makefile
create mode 100755 tools/testing/selftests/hornet/fail_loader.sh
create mode 100644 tools/testing/selftests/hornet/frozen_skel.h
create mode 100644 tools/testing/selftests/hornet/loader.c
create mode 100644 tools/testing/selftests/hornet/trivial.bpf.c
--
2.48.1
Fix typos in comments in pidfd_info_test.c
Signed-off-by: Ivan Pravdin <ipravdin.official(a)gmail.com>
---
tools/testing/selftests/pidfd/pidfd_info_test.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/tools/testing/selftests/pidfd/pidfd_info_test.c b/tools/testing/selftests/pidfd/pidfd_info_test.c
index 1758a1b0457b..53cb429f5fa0 100644
--- a/tools/testing/selftests/pidfd/pidfd_info_test.c
+++ b/tools/testing/selftests/pidfd/pidfd_info_test.c
@@ -229,7 +229,7 @@ static void *pidfd_info_pause_thread(void *arg)
close(ipc_socket);
- /* Sleep untill we're killed. */
+ /* Sleep until we're killed. */
pause();
return NULL;
}
@@ -518,7 +518,7 @@ TEST_F(pidfd_info, thread_group_exec)
EXPECT_EQ(sys_pidfd_send_signal(pidfd_leader, SIGKILL, NULL, 0), 0);
/*
- * Afte the exec we're dealing with an empty thread-group so now
+ * After the exec we're dealing with an empty thread-group so now
* we must see an exit notification on the thread-specific pidfd
* for the thread-group leader as there's no subthread that can
* revive the struct pid.
@@ -665,7 +665,7 @@ TEST_F(pidfd_info, thread_group_exec_thread)
EXPECT_EQ(sys_pidfd_send_signal(pidfd_leader, SIGKILL, NULL, 0), 0);
/*
- * Afte the exec we're dealing with an empty thread-group so now
+ * After the exec we're dealing with an empty thread-group so now
* we must see an exit notification on the thread-specific pidfd
* for the thread-group leader as there's no subthread that can
* revive the struct pid.
--
2.45.2
Cover three recent cases:
1. missing ops locking for the lowers during netdev_sync_lower_features
2. missing locking for dev_set_promiscuity (plus netdev_ops_assert_locked
with a comment on why/when it's needed)
3. rcu lock during team_change_rx_flags
Verified that each one triggers when the respective fix is reverted.
Not sure about the placement, but since it all relies on teaming,
added to the teaming directory.
One ugly bit is that I add NETIF_F_LRO to netdevsim; there is no way
to trigger netdev_sync_lower_features without it.
Signed-off-by: Stanislav Fomichev <stfomichev(a)gmail.com>
---
v2: fix cleanup and wrong exit code (Jakub)
---
drivers/net/netdevsim/netdev.c | 2 +
net/core/dev.c | 10 ++-
.../selftests/drivers/net/team/Makefile | 2 +-
.../testing/selftests/drivers/net/team/config | 1 +
.../selftests/drivers/net/team/propagation.sh | 80 +++++++++++++++++++
5 files changed, 93 insertions(+), 2 deletions(-)
create mode 100755 tools/testing/selftests/drivers/net/team/propagation.sh
diff --git a/drivers/net/netdevsim/netdev.c b/drivers/net/netdevsim/netdev.c
index 0e0321a7ddd7..3bd1f8cffee8 100644
--- a/drivers/net/netdevsim/netdev.c
+++ b/drivers/net/netdevsim/netdev.c
@@ -879,11 +879,13 @@ static void nsim_setup(struct net_device *dev)
NETIF_F_SG |
NETIF_F_FRAGLIST |
NETIF_F_HW_CSUM |
+ NETIF_F_LRO |
NETIF_F_TSO;
dev->hw_features |= NETIF_F_HW_TC |
NETIF_F_SG |
NETIF_F_FRAGLIST |
NETIF_F_HW_CSUM |
+ NETIF_F_LRO |
NETIF_F_TSO;
dev->max_mtu = ETH_MAX_MTU;
dev->xdp_features = NETDEV_XDP_ACT_HW_OFFLOAD;
diff --git a/net/core/dev.c b/net/core/dev.c
index 0d891634c692..4debd4b8e0f5 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -9188,8 +9188,16 @@ static int __dev_set_promiscuity(struct net_device *dev, int inc, bool notify)
dev_change_rx_flags(dev, IFF_PROMISC);
}
- if (notify)
+ if (notify) {
+ /* The ops lock is only required to ensure consistent locking
+ * for `NETDEV_CHANGE` notifiers. This function is sometimes
+ * called without the lock, even for devices that are ops
+ * locked, such as in `dev_uc_sync_multiple` when using
+ * bonding or teaming.
+ */
+ netdev_ops_assert_locked(dev);
__dev_notify_flags(dev, old_flags, IFF_PROMISC, 0, NULL);
+ }
return 0;
}
diff --git a/tools/testing/selftests/drivers/net/team/Makefile b/tools/testing/selftests/drivers/net/team/Makefile
index 2d5a76d99181..eaf6938f100e 100644
--- a/tools/testing/selftests/drivers/net/team/Makefile
+++ b/tools/testing/selftests/drivers/net/team/Makefile
@@ -1,7 +1,7 @@
# SPDX-License-Identifier: GPL-2.0
# Makefile for net selftests
-TEST_PROGS := dev_addr_lists.sh
+TEST_PROGS := dev_addr_lists.sh propagation.sh
TEST_INCLUDES := \
../bonding/lag_lib.sh \
diff --git a/tools/testing/selftests/drivers/net/team/config b/tools/testing/selftests/drivers/net/team/config
index b5e3a3aad4bf..636b3525b679 100644
--- a/tools/testing/selftests/drivers/net/team/config
+++ b/tools/testing/selftests/drivers/net/team/config
@@ -1,5 +1,6 @@
CONFIG_DUMMY=y
CONFIG_IPV6=y
CONFIG_MACVLAN=y
+CONFIG_NETDEVSIM=m
CONFIG_NET_TEAM=y
CONFIG_NET_TEAM_MODE_LOADBALANCE=y
diff --git a/tools/testing/selftests/drivers/net/team/propagation.sh b/tools/testing/selftests/drivers/net/team/propagation.sh
new file mode 100755
index 000000000000..4bea75b79878
--- /dev/null
+++ b/tools/testing/selftests/drivers/net/team/propagation.sh
@@ -0,0 +1,80 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+
+set -e
+
+NSIM_LRO_ID=$((256 + RANDOM % 256))
+NSIM_LRO_SYS=/sys/bus/netdevsim/devices/netdevsim$NSIM_LRO_ID
+
+NSIM_DEV_SYS_NEW=/sys/bus/netdevsim/new_device
+NSIM_DEV_SYS_DEL=/sys/bus/netdevsim/del_device
+
+cleanup()
+{
+ set +e
+ ip link del dummyteam &>/dev/null
+ ip link del team0 &>/dev/null
+ echo $NSIM_LRO_ID > $NSIM_DEV_SYS_DEL
+ modprobe -r netdevsim
+}
+
+# Trigger LRO propagation to the lower.
+# https://lore.kernel.org/netdev/aBvOpkIoxcr9PfDg@mini-arch/
+team_lro()
+{
+ # using netdevsim because it supports NETIF_F_LRO
+ NSIM_LRO_NAME=$(find $NSIM_LRO_SYS/net -maxdepth 1 -type d ! \
+ -path $NSIM_LRO_SYS/net -exec basename {} \;)
+
+ ip link add name team0 type team
+ ip link set $NSIM_LRO_NAME down
+ ip link set dev $NSIM_LRO_NAME master team0
+ ip link set team0 up
+ ethtool -K team0 large-receive-offload off
+
+ ip link del team0
+}
+
+# Trigger promisc propagation to the lower during IFLA_MASTER.
+# https://lore.kernel.org/netdev/20250506032328.3003050-1-sdf@fomichev.me/
+team_promisc()
+{
+ ip link add name dummyteam type dummy
+ ip link add name team0 type team
+ ip link set dummyteam down
+ ip link set team0 promisc on
+ ip link set dev dummyteam master team0
+ ip link set team0 up
+
+ ip link del team0
+ ip link del dummyteam
+}
+
+# Trigger promisc propagation to the lower via netif_change_flags (aka
+# ndo_change_rx_flags).
+# https://lore.kernel.org/netdev/20250514220319.3505158-1-stfomichev@gmail.co…
+team_change_flags()
+{
+ ip link add name dummyteam type dummy
+ ip link add name team0 type team
+ ip link set dummyteam down
+ ip link set dev dummyteam master team0
+ ip link set team0 up
+ ip link set team0 promisc on
+
+ # Make sure we can add more L2 addresses without any issues.
+ ip link add link team0 address 00:00:00:00:00:01 team0.1 type macvlan
+ ip link set team0.1 up
+
+ ip link del team0.1
+ ip link del team0
+ ip link del dummyteam
+}
+
+trap cleanup EXIT
+modprobe netdevsim || :
+echo $NSIM_LRO_ID > $NSIM_DEV_SYS_NEW
+udevadm settle
+team_lro
+team_promisc
+team_change_flags
--
2.49.0
Fix "withouth" to "without"
Fix "instaces" to "instances"
Signed-off-by: Sumanth Gavini <sumanth.gavini(a)yahoo.com>
---
tools/testing/selftests/net/srv6_end_flavors_test.sh | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/net/srv6_end_flavors_test.sh b/tools/testing/selftests/net/srv6_end_flavors_test.sh
index 50563443a4ad..318487eda671 100755
--- a/tools/testing/selftests/net/srv6_end_flavors_test.sh
+++ b/tools/testing/selftests/net/srv6_end_flavors_test.sh
@@ -399,7 +399,7 @@ __get_srv6_rtcfg_id()
# Given the description of a router <id:op> as an input, the function returns
# the <op> token which represents the operation (e.g. End behavior with or
-# withouth flavors) configured for the node.
+# without flavors) configured for the node.
# Note that when the operation represents an End behavior with a list of
# flavors, the output is the ordered version of that list.
@@ -480,7 +480,7 @@ setup_rt_local_sids()
# all SIDs start with a common locator. Routes and SRv6 Endpoint
- # behavior instaces are grouped together in the 'localsid' table.
+ # behavior instances are grouped together in the 'localsid' table.
ip -netns "${nsname}" -6 rule \
add to "${LOCATOR_SERVICE}::/16" \
lookup "${LOCALSID_TABLE_ID}" prio 999
--
2.43.0
Basics and overview
===================
Software with larger attack surfaces (e.g. network facing apps like databases,
browsers or apps relying on browser runtimes) suffer from memory corruption
issues which can be utilized by attackers to bend control flow of the program
to eventually gain control (by making their payload executable). Attackers are
able to perform such attacks by leveraging call-sites which rely on indirect
calls or return sites which rely on obtaining return address from stack memory.
To mitigate such attacks, risc-v extension zicfilp enforces that all indirect
calls must land on a landing pad instruction `lpad` else cpu will raise software
check exception (a new cpu exception cause code on riscv).
Similarly for return flow, risc-v extension zicfiss extends architecture with
- `sspush` instruction to push return address on a shadow stack
- `sspopchk` instruction to pop return address from shadow stack
and compare with input operand (i.e. return address on stack)
- `sspopchk` to raise software check exception if comparision above
was a mismatch
- Protection mechanism using which shadow stack is not writeable via
regular store instructions
More information an details can be found at extensions github repo [1].
Equivalent to landing pad (zicfilp) on x86 is `ENDBRANCH` instruction in Intel
CET [3] and branch target identification (BTI) [4] on arm.
Similarly x86's Intel CET has shadow stack [5] and arm64 has guarded control
stack (GCS) [6] which are very similar to risc-v's zicfiss shadow stack.
x86 and arm64 support for user mode shadow stack is already in mainline.
Kernel awareness for user control flow integrity
================================================
This series picks up Samuel Holland's envcfg changes [2] as well. So if those are
being applied independently, they should be removed from this series.
Enabling:
In order to maintain compatibility and not break anything in user mode, kernel
doesn't enable control flow integrity cpu extensions on binary by default.
Instead exposes a prctl interface to enable, disable and lock the shadow stack
or landing pad feature for a task. This allows userspace (loader) to enumerate
if all objects in its address space are compiled with shadow stack and landing
pad support and accordingly enable the feature. Additionally if a subsequent
`dlopen` happens on a library, user mode can take a decision again to disable
the feature (if incoming library is not compiled with support) OR terminate the
task (if user mode policy is strict to have all objects in address space to be
compiled with control flow integirty cpu feature). prctl to enable shadow stack
results in allocating shadow stack from virtual memory and activating for user
address space. x86 and arm64 are also following same direction due to similar
reason(s).
clone/fork:
On clone and fork, cfi state for task is inherited by child. Shadow stack is
part of virtual memory and is a writeable memory from kernel perspective
(writeable via a restricted set of instructions aka shadow stack instructions)
Thus kernel changes ensure that this memory is converted into read-only when
fork/clone happens and COWed when fault is taken due to sspush, sspopchk or
ssamoswap. In case `CLONE_VM` is specified and shadow stack is to be enabled,
kernel will automatically allocate a shadow stack for that clone call.
map_shadow_stack:
x86 introduced `map_shadow_stack` system call to allow user space to explicitly
map shadow stack memory in its address space. It is useful to allocate shadow
for different contexts managed by a single thread (green threads or contexts)
risc-v implements this system call as well.
signal management:
If shadow stack is enabled for a task, kernel performs an asynchronous control
flow diversion to deliver the signal and eventually expects userspace to issue
sigreturn so that original execution can be resumed. Even though resume context
is prepared by kernel, it is in user space memory and is subject to memory
corruption and corruption bugs can be utilized by attacker in this race window
to perform arbitrary sigreturn and eventually bypass cfi mechanism.
Another issue is how to ensure that cfi related state on sigcontext area is not
trampled by legacy apps or apps compiled with old kernel headers.
In order to mitigate control-flow hijacting, kernel prepares a token and place
it on shadow stack before signal delivery and places address of token in
sigcontext structure. During sigreturn, kernel obtains address of token from
sigcontext struture, reads token from shadow stack and validates it and only
then allow sigreturn to succeed. Compatiblity issue is solved by adopting
dynamic sigcontext management introduced for vector extension. This series
re-factor the code little bit to allow future sigcontext management easy (as
proposed by Andy Chiu from SiFive)
config and compilation:
Introduce a new risc-v config option `CONFIG_RISCV_USER_CFI`. Selecting this
config option picks the kernel support for user control flow integrity. This
optin is presented only if toolchain has shadow stack and landing pad support.
And is on purpose guarded by toolchain support. Reason being that eventually
vDSO also needs to be compiled in with shadow stack and landing pad support.
vDSO compile patches are not included as of now because landing pad labeling
scheme is yet to settle for usermode runtime.
To get more information on kernel interactions with respect to
zicfilp and zicfiss, patch series adds documentation for
`zicfilp` and `zicfiss` in following:
Documentation/arch/riscv/zicfiss.rst
Documentation/arch/riscv/zicfilp.rst
How to test this series
=======================
Toolchain
---------
$ git clone git@github.com:sifive/riscv-gnu-toolchain.git -b cfi-dev
$ riscv-gnu-toolchain/configure --prefix=<path-to-where-to-build> --with-arch=rv64gc_zicfilp_zicfiss --enable-linux --disable-gdb --with-extra-multilib-test="rv64gc_zicfilp_zicfiss-lp64d:-static"
$ make -j$(nproc)
Qemu
----
Get the lastest qemu
$ cd qemu
$ mkdir build
$ cd build
$ ../configure --target-list=riscv64-softmmu
$ make -j$(nproc)
Opensbi
-------
$ git clone git@github.com:deepak0414/opensbi.git -b v6_cfi_spec_split_opensbi
$ make CROSS_COMPILE=<your riscv toolchain> -j$(nproc) PLATFORM=generic
Linux
-----
Running defconfig is fine. CFI is enabled by default if the toolchain
supports it.
$ make ARCH=riscv CROSS_COMPILE=<path-to-cfi-riscv-gnu-toolchain>/build/bin/riscv64-unknown-linux-gnu- -j$(nproc) defconfig
$ make ARCH=riscv CROSS_COMPILE=<path-to-cfi-riscv-gnu-toolchain>/build/bin/riscv64-unknown-linux-gnu- -j$(nproc)
In case you're building your own rootfs using toolchain, please make sure you
pick following patch to ensure that vDSO compiled with lpad and shadow stack.
"arch/riscv: compile vdso with landing pad"
Branch where above patch can be picked
https://github.com/deepak0414/linux-riscv-cfi/tree/vdso_user_cfi_v6.12-rc1
Running
-------
Modify your qemu command to have:
-bios <path-to-cfi-opensbi>/build/platform/generic/firmware/fw_dynamic.bin
-cpu rv64,zicfilp=true,zicfiss=true,zimop=true,zcmop=true
vDSO related Opens (in the flux)
=================================
I am listing these opens for laying out plan and what to expect in future
patch sets. And of course for the sake of discussion.
Shadow stack and landing pad enabling in vDSO
----------------------------------------------
vDSO must have shadow stack and landing pad support compiled in for task
to have shadow stack and landing pad support. This patch series doesn't
enable that (yet). Enabling shadow stack support in vDSO should be
straight forward (intend to do that in next versions of patch set). Enabling
landing pad support in vDSO requires some collaboration with toolchain folks
to follow a single label scheme for all object binaries. This is necessary to
ensure that all indirect call-sites are setting correct label and target landing
pads are decorated with same label scheme.
How many vDSOs
---------------
Shadow stack instructions are carved out of zimop (may be operations) and if CPU
doesn't implement zimop, they're illegal instructions. Kernel could be running on
a CPU which may or may not implement zimop. And thus kernel will have to carry 2
different vDSOs and expose the appropriate one depending on whether CPU implements
zimop or not.
References
==========
[1] - https://github.com/riscv/riscv-cfi
[2] - https://lore.kernel.org/all/20240814081126.956287-1-samuel.holland@sifive.c…
[3] - https://lwn.net/Articles/889475/
[4] - https://developer.arm.com/documentation/109576/0100/Branch-Target-Identific…
[5] - https://www.intel.com/content/dam/develop/external/us/en/documents/catc17-i…
[6] - https://lwn.net/Articles/940403/
To: Thomas Gleixner <tglx(a)linutronix.de>
To: Ingo Molnar <mingo(a)redhat.com>
To: Borislav Petkov <bp(a)alien8.de>
To: Dave Hansen <dave.hansen(a)linux.intel.com>
To: x86(a)kernel.org
To: H. Peter Anvin <hpa(a)zytor.com>
To: Andrew Morton <akpm(a)linux-foundation.org>
To: Liam R. Howlett <Liam.Howlett(a)oracle.com>
To: Vlastimil Babka <vbabka(a)suse.cz>
To: Lorenzo Stoakes <lorenzo.stoakes(a)oracle.com>
To: Paul Walmsley <paul.walmsley(a)sifive.com>
To: Palmer Dabbelt <palmer(a)dabbelt.com>
To: Albert Ou <aou(a)eecs.berkeley.edu>
To: Conor Dooley <conor(a)kernel.org>
To: Rob Herring <robh(a)kernel.org>
To: Krzysztof Kozlowski <krzk+dt(a)kernel.org>
To: Arnd Bergmann <arnd(a)arndb.de>
To: Christian Brauner <brauner(a)kernel.org>
To: Peter Zijlstra <peterz(a)infradead.org>
To: Oleg Nesterov <oleg(a)redhat.com>
To: Eric Biederman <ebiederm(a)xmission.com>
To: Kees Cook <kees(a)kernel.org>
To: Jonathan Corbet <corbet(a)lwn.net>
To: Shuah Khan <shuah(a)kernel.org>
To: Jann Horn <jannh(a)google.com>
To: Conor Dooley <conor+dt(a)kernel.org>
To: Miguel Ojeda <ojeda(a)kernel.org>
To: Alex Gaynor <alex.gaynor(a)gmail.com>
To: Boqun Feng <boqun.feng(a)gmail.com>
To: Gary Guo <gary(a)garyguo.net>
To: Björn Roy Baron <bjorn3_gh(a)protonmail.com>
To: Benno Lossin <benno.lossin(a)proton.me>
To: Andreas Hindborg <a.hindborg(a)kernel.org>
To: Alice Ryhl <aliceryhl(a)google.com>
To: Trevor Gross <tmgross(a)umich.edu>
Cc: linux-kernel(a)vger.kernel.org
Cc: linux-fsdevel(a)vger.kernel.org
Cc: linux-mm(a)kvack.org
Cc: linux-riscv(a)lists.infradead.org
Cc: devicetree(a)vger.kernel.org
Cc: linux-arch(a)vger.kernel.org
Cc: linux-doc(a)vger.kernel.org
Cc: linux-kselftest(a)vger.kernel.org
Cc: alistair.francis(a)wdc.com
Cc: richard.henderson(a)linaro.org
Cc: jim.shu(a)sifive.com
Cc: andybnac(a)gmail.com
Cc: kito.cheng(a)sifive.com
Cc: charlie(a)rivosinc.com
Cc: atishp(a)rivosinc.com
Cc: evan(a)rivosinc.com
Cc: cleger(a)rivosinc.com
Cc: alexghiti(a)rivosinc.com
Cc: samitolvanen(a)google.com
Cc: broonie(a)kernel.org
Cc: rick.p.edgecombe(a)intel.com
Cc: rust-for-linux(a)vger.kernel.org
changelog
---------
v15:
- Toolchain has been updated to include `-fcf-protection` flag. This
exists for x86 as well. Updated kernel patches to compile vDSO and
selftest to compile with `fcf-protection=full` flag.
- selecting CONFIG_RISCV_USERCFI selects CONFIG_RISCV_SBI.
- Patch to enable shadow stack for kernel wasn't hidden behind
CONFIG_RISCV_USERCFI and CONFIG_RISCV_SBI. fixed that.
v14:
- rebased on top of palmer/sbi-v3. Thus dropped clement's FWFT patches
Updated RISCV_ISA_EXT_XXXX in hwcap and hwprobe constants.
- Took Radim's suggestions on bitfields.
- Placed cfi_state at the end of thread_info block so that current situation
is not disturbed with respect to member fields of thread_info in single
cacheline.
v13:
- cpu_supports_shadow_stack/cpu_supports_indirect_br_lp_instr uses
riscv_has_extension_unlikely()
- uses nops(count) to create nop slide
- RISCV_ACQUIRE_BARRIER is not needed in `amo_user_shstk`. Removed it
- changed ternaries to simply use implicit casting to convert to bool.
- kernel command line allows to disable zicfilp and zicfiss independently.
updated kernel-parameters.txt.
- ptrace user abi for cfi uses bitmasks instead of bitfields. Added ptrace
kselftest.
- cosmetic and grammatical changes to documentation.
v12:
- It seems like I had accidently squashed arch agnostic indirect branch
tracking prctl and riscv implementation of those prctls. Split them again.
- set_shstk_status/set_indir_lp_status perform CSR writes only when CPU
support is available. As suggested by Zong Li.
- Some minor clean up in kselftests as suggested by Zong Li.
v11:
- patch "arch/riscv: compile vdso with landing pad" was unconditionally
selecting `_zicfilp` for vDSO compile. fixed that. Changed `lpad 1` to
to `lpad 0`.
v10:
- dropped "mm: helper `is_shadow_stack_vma` to check shadow stack vma". This patch
is not that interesting to this patch series for risc-v. There are instances in
arch directories where VM_SHADOW_STACK flag is anyways used. Dropping this patch
to expedite merging in riscv tree.
- Took suggestions from `Clement` on "riscv: zicfiss / zicfilp enumeration" to
validate presence of cfi based on config.
- Added a patch for vDSO to have `lpad 0`. I had omitted this earlier to make sure
we add single vdso object with cfi enabled. But a vdso object with scheme of
zero labeled landing pad is least common denominator and should work with all
objects of zero labeled as well as function-signature labeled objects.
v9:
- rebased on master (39a803b754d5 fix braino in "9p: fix ->rename_sem exclusion")
- dropped "mm: Introduce ARCH_HAS_USER_SHADOW_STACK" (master has it from arm64/gcs)
- dropped "prctl: arch-agnostic prctl for shadow stack" (master has it from arm64/gcs)
v8:
- rebased on palmer/for-next
- dropped samuel holland's `envcfg` context switch patches.
they are in parlmer/for-next
v7:
- Removed "riscv/Kconfig: enable HAVE_EXIT_THREAD for riscv"
Instead using `deactivate_mm` flow to clean up.
see here for more context
https://lore.kernel.org/all/20230908203655.543765-1-rick.p.edgecombe@intel.…
- Changed the header include in `kselftest`. Hopefully this fixes compile
issue faced by Zong Li at SiFive.
- Cleaned up an orphaned change to `mm/mmap.c` in below patch
"riscv/mm : ensure PROT_WRITE leads to VM_READ | VM_WRITE"
- Lock interfaces for shadow stack and indirect branch tracking expect arg == 0
Any future evolution of this interface should accordingly define how arg should
be setup.
- `mm/map.c` has an instance of using `VM_SHADOW_STACK`. Fixed it to use helper
`is_shadow_stack_vma`.
- Link to v6: https://lore.kernel.org/r/20241008-v5_user_cfi_series-v6-0-60d9fe073f37@riv…
v6:
- Picked up Samuel Holland's changes as is with `envcfg` placed in
`thread` instead of `thread_info`
- fixed unaligned newline escapes in kselftest
- cleaned up messages in kselftest and included test output in commit message
- fixed a bug in clone path reported by Zong Li
- fixed a build issue if CONFIG_RISCV_ISA_V is not selected
(this was introduced due to re-factoring signal context
management code)
v5:
- rebased on v6.12-rc1
- Fixed schema related issues in device tree file
- Fixed some of the documentation related issues in zicfilp/ss.rst
(style issues and added index)
- added `SHADOW_STACK_SET_MARKER` so that implementation can define base
of shadow stack.
- Fixed warnings on definitions added in usercfi.h when
CONFIG_RISCV_USER_CFI is not selected.
- Adopted context header based signal handling as proposed by Andy Chiu
- Added support for enabling kernel mode access to shadow stack using
FWFT
(https://github.com/riscv-non-isa/riscv-sbi-doc/blob/master/src/ext-firmware…)
- Link to v5: https://lore.kernel.org/r/20241001-v5_user_cfi_series-v1-0-3ba65b6e550f@riv…
(Note: I had an issue in my workflow due to which version number wasn't
picked up correctly while sending out patches)
v4:
- rebased on 6.11-rc6
- envcfg: Converged with Samuel Holland's patches for envcfg management on per-
thread basis.
- vma_is_shadow_stack is renamed to is_vma_shadow_stack
- picked up Mark Brown's `ARCH_HAS_USER_SHADOW_STACK` patch
- signal context: using extended context management to maintain compatibility.
- fixed `-Wmissing-prototypes` compiler warnings for prctl functions
- Documentation fixes and amending typos.
- Link to v4: https://lore.kernel.org/all/20240912231650.3740732-1-debug@rivosinc.com/
v3:
- envcfg
logic to pick up base envcfg had a bug where `ENVCFG_CBZE` could have been
picked on per task basis, even though CPU didn't implement it. Fixed in
this series.
- dt-bindings
As suggested, split into separate commit. fixed the messaging that spec is
in public review
- arch_is_shadow_stack change
arch_is_shadow_stack changed to vma_is_shadow_stack
- hwprobe
zicfiss / zicfilp if present will get enumerated in hwprobe
- selftests
As suggested, added object and binary filenames to .gitignore
Selftest binary anyways need to be compiled with cfi enabled compiler which
will make sure that landing pad and shadow stack are enabled. Thus removed
separate enable/disable tests. Cleaned up tests a bit.
- Link to v3: https://lore.kernel.org/lkml/20240403234054.2020347-1-debug@rivosinc.com/
v2:
- Using config `CONFIG_RISCV_USER_CFI`, kernel support for riscv control flow
integrity for user mode programs can be compiled in the kernel.
- Enabling of control flow integrity for user programs is left to user runtime
- This patch series introduces arch agnostic `prctls` to enable shadow stack
and indirect branch tracking. And implements them on riscv.
---
Changes in v15:
- changelog posted just below cover letter
- Link to v14: https://lore.kernel.org/r/20250429-v5_user_cfi_series-v14-0-5239410d012a@ri…
Changes in v14:
- changelog posted just below cover letter
- Link to v13: https://lore.kernel.org/r/20250424-v5_user_cfi_series-v13-0-971437de586a@ri…
Changes in v13:
- changelog posted just below cover letter
- Link to v12: https://lore.kernel.org/r/20250314-v5_user_cfi_series-v12-0-e51202b53138@ri…
Changes in v12:
- changelog posted just below cover letter
- Link to v11: https://lore.kernel.org/r/20250310-v5_user_cfi_series-v11-0-86b36cbfb910@ri…
Changes in v11:
- changelog posted just below cover letter
- Link to v10: https://lore.kernel.org/r/20250210-v5_user_cfi_series-v10-0-163dcfa31c60@ri…
---
Andy Chiu (1):
riscv: signal: abstract header saving for setup_sigcontext
Deepak Gupta (25):
mm: VM_SHADOW_STACK definition for riscv
dt-bindings: riscv: zicfilp and zicfiss in dt-bindings (extensions.yaml)
riscv: zicfiss / zicfilp enumeration
riscv: zicfiss / zicfilp extension csr and bit definitions
riscv: usercfi state for task and save/restore of CSR_SSP on trap entry/exit
riscv/mm : ensure PROT_WRITE leads to VM_READ | VM_WRITE
riscv mm: manufacture shadow stack pte
riscv mmu: teach pte_mkwrite to manufacture shadow stack PTEs
riscv mmu: write protect and shadow stack
riscv/mm: Implement map_shadow_stack() syscall
riscv/shstk: If needed allocate a new shadow stack on clone
riscv: Implements arch agnostic shadow stack prctls
prctl: arch-agnostic prctl for indirect branch tracking
riscv: Implements arch agnostic indirect branch tracking prctls
riscv/traps: Introduce software check exception
riscv/signal: save and restore of shadow stack for signal
riscv/kernel: update __show_regs to print shadow stack register
riscv/ptrace: riscv cfi status and state via ptrace and in core files
riscv/hwprobe: zicfilp / zicfiss enumeration in hwprobe
riscv: kernel command line option to opt out of user cfi
riscv: enable kernel access to shadow stack memory via FWFT sbi call
riscv: create a config for shadow stack and landing pad instr support
riscv: Documentation for landing pad / indirect branch tracking
riscv: Documentation for shadow stack on riscv
kselftest/riscv: kselftest for user mode cfi
Jim Shu (1):
arch/riscv: compile vdso with landing pad
Documentation/admin-guide/kernel-parameters.txt | 8 +
Documentation/arch/riscv/index.rst | 2 +
Documentation/arch/riscv/zicfilp.rst | 115 +++++
Documentation/arch/riscv/zicfiss.rst | 179 +++++++
.../devicetree/bindings/riscv/extensions.yaml | 14 +
arch/riscv/Kconfig | 21 +
arch/riscv/Makefile | 5 +-
arch/riscv/include/asm/asm-prototypes.h | 1 +
arch/riscv/include/asm/assembler.h | 44 ++
arch/riscv/include/asm/cpufeature.h | 12 +
arch/riscv/include/asm/csr.h | 16 +
arch/riscv/include/asm/entry-common.h | 2 +
arch/riscv/include/asm/hwcap.h | 2 +
arch/riscv/include/asm/mman.h | 25 +
arch/riscv/include/asm/mmu_context.h | 7 +
arch/riscv/include/asm/pgtable.h | 30 +-
arch/riscv/include/asm/processor.h | 2 +
arch/riscv/include/asm/thread_info.h | 3 +
arch/riscv/include/asm/usercfi.h | 95 ++++
arch/riscv/include/asm/vector.h | 3 +
arch/riscv/include/uapi/asm/hwprobe.h | 2 +
arch/riscv/include/uapi/asm/ptrace.h | 34 ++
arch/riscv/include/uapi/asm/sigcontext.h | 1 +
arch/riscv/kernel/Makefile | 1 +
arch/riscv/kernel/asm-offsets.c | 8 +
arch/riscv/kernel/cpufeature.c | 13 +
arch/riscv/kernel/entry.S | 33 +-
arch/riscv/kernel/head.S | 27 +
arch/riscv/kernel/process.c | 26 +-
arch/riscv/kernel/ptrace.c | 95 ++++
arch/riscv/kernel/signal.c | 148 +++++-
arch/riscv/kernel/sys_hwprobe.c | 2 +
arch/riscv/kernel/sys_riscv.c | 10 +
arch/riscv/kernel/traps.c | 43 ++
arch/riscv/kernel/usercfi.c | 545 +++++++++++++++++++++
arch/riscv/kernel/vdso/Makefile | 6 +
arch/riscv/kernel/vdso/flush_icache.S | 4 +
arch/riscv/kernel/vdso/getcpu.S | 4 +
arch/riscv/kernel/vdso/rt_sigreturn.S | 4 +
arch/riscv/kernel/vdso/sys_hwprobe.S | 4 +
arch/riscv/mm/init.c | 2 +-
arch/riscv/mm/pgtable.c | 17 +
include/linux/cpu.h | 4 +
include/linux/mm.h | 7 +
include/uapi/linux/elf.h | 2 +
include/uapi/linux/prctl.h | 27 +
kernel/sys.c | 30 ++
tools/testing/selftests/riscv/Makefile | 2 +-
tools/testing/selftests/riscv/cfi/.gitignore | 3 +
tools/testing/selftests/riscv/cfi/Makefile | 10 +
tools/testing/selftests/riscv/cfi/cfi_rv_test.h | 82 ++++
tools/testing/selftests/riscv/cfi/riscv_cfi_test.c | 173 +++++++
tools/testing/selftests/riscv/cfi/shadowstack.c | 385 +++++++++++++++
tools/testing/selftests/riscv/cfi/shadowstack.h | 27 +
54 files changed, 2338 insertions(+), 29 deletions(-)
---
base-commit: 4181f8ad7a1061efed0219951d608d4988302af7
change-id: 20240930-v5_user_cfi_series-3dc332f8f5b2
--
- debug
Fix misspelling reported by codespell
Signed-off-by: Sumanth Gavini <sumanth.gavini(a)yahoo.com>
---
tools/testing/selftests/seccomp/seccomp_benchmark.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/seccomp/seccomp_benchmark.c b/tools/testing/selftests/seccomp/seccomp_benchmark.c
index 94886c82ae60..5822e25e0217 100644
--- a/tools/testing/selftests/seccomp/seccomp_benchmark.c
+++ b/tools/testing/selftests/seccomp/seccomp_benchmark.c
@@ -1,6 +1,6 @@
/*
* Strictly speaking, this is not a test. But it can report during test
- * runs so relative performace can be measured.
+ * runs so relative performance can be measured.
*/
#define _GNU_SOURCE
#include <assert.h>
--
2.43.0
The documentation on testing static functions using the KUnit macros
VISIBLE_IF_KUNIT and EXPORT_SYMBOL_IF_KUNIT is lacking clarity and
missing key steps in the example. This has caused bugs and confusion
among developers.
Improve wording of description and add missing steps to the example.
This entails adding the "#include <kunit/visibility.h>" line and the
"MODULE_IMPORT_NS("EXPORTED_FOR_KUNIT_TESTING");" line. Both of which
were missing from the original example and key to exposing static
functions.
Reviewed-by: David Gow <davidgow(a)google.com>
Signed-off-by: Rae Moar <rmoar(a)google.com>
---
Changes from v1 to v2:
- Add quotes around EXPORTED_FOR_KUNIT_TESTING to change it to be a
string
Documentation/dev-tools/kunit/usage.rst | 38 +++++++++++++++++++------
1 file changed, 30 insertions(+), 8 deletions(-)
diff --git a/Documentation/dev-tools/kunit/usage.rst b/Documentation/dev-tools/kunit/usage.rst
index 22955d56b379..066ecda1dd98 100644
--- a/Documentation/dev-tools/kunit/usage.rst
+++ b/Documentation/dev-tools/kunit/usage.rst
@@ -670,28 +670,50 @@ with ``kunit_remove_action``.
Testing Static Functions
------------------------
-If we do not want to expose functions or variables for testing, one option is to
-conditionally export the used symbol. For example:
+If you want to test static functions without exposing those functions outside of
+testing, one option is conditionally export the symbol. When KUnit is enabled,
+the symbol is exposed but remains static otherwise. To use this method, follow
+the template below.
.. code-block:: c
- /* In my_file.c */
+ /* In the file containing functions to test "my_file.c" */
- VISIBLE_IF_KUNIT int do_interesting_thing();
+ #include <kunit/visibility.h>
+ #include <my_file.h>
+ ...
+ VISIBLE_IF_KUNIT int do_interesting_thing()
+ {
+ ...
+ }
EXPORT_SYMBOL_IF_KUNIT(do_interesting_thing);
- /* In my_file.h */
+ /* In the header file "my_file.h" */
#if IS_ENABLED(CONFIG_KUNIT)
int do_interesting_thing(void);
#endif
-Alternatively, you could conditionally ``#include`` the test file at the end of
-your .c file. For example:
+ /* In the KUnit test file "my_file_test.c" */
+
+ #include <kunit/visibility.h>
+ #include <my_file.h>
+ ...
+ MODULE_IMPORT_NS("EXPORTED_FOR_KUNIT_TESTING");
+ ...
+ // Use do_interesting_thing() in tests
+
+For a full example, see this `patch <https://lore.kernel.org/all/20221207014024.340230-3-rmoar@google.com/>`_
+where a test is modified to conditionally expose static functions for testing
+using the macros above.
+
+As an **alternative** to the method above, you could conditionally ``#include``
+the test file at the end of your .c file. This is not recommended but works
+if needed. For example:
.. code-block:: c
- /* In my_file.c */
+ /* In "my_file.c" */
static int do_interesting_thing();
base-commit: c2493384e8110d5a4792fff4b9d46e47b78ea10a
--
2.49.0.1112.g889b7c5bd8-goog
Add the test counts to the JSON output from kunit.py. For example:
...
"git_branch": "kselftest",
"misc":
{
"tests": 2,
"passed": 1.
"failed": 1,
"crashed": 0,
"skipped": 0,
"errors": 0,
}
...
To output the JSON using the following command:
./tools/testing/kunit/kunit.py run example --json
This has been requested by KUnit users. The counts are in a "misc"
field because the JSON output needs to be compliant with the KCIDB
submission guide. There are no counts fields but there is a "misc" field
in the guide.
Reviewed-by: David Gow <davidgow(a)google.com>
Signed-off-by: Rae Moar <rmoar(a)google.com>
---
tools/testing/kunit/kunit_json.py | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/tools/testing/kunit/kunit_json.py b/tools/testing/kunit/kunit_json.py
index 10ff65689dd8..c1463e6819b6 100644
--- a/tools/testing/kunit/kunit_json.py
+++ b/tools/testing/kunit/kunit_json.py
@@ -39,10 +39,20 @@ def _get_group_json(test: Test, common_fields: JsonObj) -> JsonObj:
status = _status_map.get(subtest.status, "FAIL")
test_cases.append({"name": subtest.name, "status": status})
+ test_counts = test.counts
+ counts_json = {
+ "tests": test_counts.total(),
+ "passed": test_counts.passed,
+ "failed": test_counts.failed,
+ "crashed": test_counts.crashed,
+ "skipped": test_counts.skipped,
+ "errors": test_counts.errors,
+ }
test_group = {
"name": test.name,
"sub_groups": sub_groups,
"test_cases": test_cases,
+ "misc": counts_json,
}
test_group.update(common_fields)
return test_group
base-commit: c2493384e8110d5a4792fff4b9d46e47b78ea10a
--
2.49.0.1112.g889b7c5bd8-goog
The documentation on testing static functions using the KUnit macros
VISIBLE_IF_KUNIT and EXPORT_SYMBOL_IF_KUNIT is lacking clarity and
missing key steps in the example. This has caused bugs and confusion
among developers.
Improve wording of description and add missing steps to the example.
This entails adding the "#include <kunit/visibility.h>" line and the
"MODULE_IMPORT_NS(EXPORTED_FOR_KUNIT_TESTING);" line. Both of which were
missing from the original example and key to exposing static functions.
Signed-off-by: Rae Moar <rmoar(a)google.com>
---
Documentation/dev-tools/kunit/usage.rst | 38 +++++++++++++++++++------
1 file changed, 30 insertions(+), 8 deletions(-)
diff --git a/Documentation/dev-tools/kunit/usage.rst b/Documentation/dev-tools/kunit/usage.rst
index 22955d56b379..038f480074fd 100644
--- a/Documentation/dev-tools/kunit/usage.rst
+++ b/Documentation/dev-tools/kunit/usage.rst
@@ -670,28 +670,50 @@ with ``kunit_remove_action``.
Testing Static Functions
------------------------
-If we do not want to expose functions or variables for testing, one option is to
-conditionally export the used symbol. For example:
+If you want to test static functions without exposing those functions outside of
+testing, one option is conditionally export the symbol. When KUnit is enabled,
+the symbol is exposed but remains static otherwise. To use this method, follow
+the template below.
.. code-block:: c
- /* In my_file.c */
+ /* In the file containing functions to test "my_file.c" */
- VISIBLE_IF_KUNIT int do_interesting_thing();
+ #include <kunit/visibility.h>
+ #include <my_file.h>
+ ...
+ VISIBLE_IF_KUNIT int do_interesting_thing()
+ {
+ ...
+ }
EXPORT_SYMBOL_IF_KUNIT(do_interesting_thing);
- /* In my_file.h */
+ /* In the header file "my_file.h" */
#if IS_ENABLED(CONFIG_KUNIT)
int do_interesting_thing(void);
#endif
-Alternatively, you could conditionally ``#include`` the test file at the end of
-your .c file. For example:
+ /* In the KUnit test file "my_file_test.c" */
+
+ #include <kunit/visibility.h>
+ #include <my_file.h>
+ ...
+ MODULE_IMPORT_NS(EXPORTED_FOR_KUNIT_TESTING);
+ ...
+ // Use do_interesting_thing() in tests
+
+For a full example, see this `patch <https://lore.kernel.org/all/20221207014024.340230-3-rmoar@google.com/>`_
+where a test is modified to conditionally expose static functions for testing
+using the macros above.
+
+As an **alternative** to the method above, you could conditionally ``#include``
+the test file at the end of your .c file. This is not recommended but works
+if needed. For example:
.. code-block:: c
- /* In my_file.c */
+ /* In "my_file.c" */
static int do_interesting_thing();
base-commit: c2493384e8110d5a4792fff4b9d46e47b78ea10a
--
2.49.0.1101.gccaa498523-goog
Add the test counts to the JSON output from kunit.py. For example:
...
"git_branch": "kselftest",
"misc":
{
"tests": 2,
"passed": 1.
"failed": 1,
"crashed": 0,
"skipped": 0,
"errors": 0,
}
...
To output the JSON using the following command:
./tools/testing/kunit/kunit.py run example --json
This has been requested by KUnit users. The counts are in a "misc"
field because the JSON output needs to be compliant with the KCIDB
submission guide. There are no counts fields but there is a "misc" field
in the guide.
Signed-off-by: Rae Moar <rmoar(a)google.com>
---
tools/testing/kunit/kunit_json.py | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/tools/testing/kunit/kunit_json.py b/tools/testing/kunit/kunit_json.py
index 10ff65689dd8..80fa4e354a17 100644
--- a/tools/testing/kunit/kunit_json.py
+++ b/tools/testing/kunit/kunit_json.py
@@ -39,10 +39,20 @@ def _get_group_json(test: Test, common_fields: JsonObj) -> JsonObj:
status = _status_map.get(subtest.status, "FAIL")
test_cases.append({"name": subtest.name, "status": status})
+ test_counts = test.counts
+ counts_json = {
+ "tests": test_counts.total(),
+ "passed": test_counts.passed,
+ "failed": test_counts.failed,
+ "crashed": test_counts.crashed,
+ "skipped": test_counts.skipped,
+ "errors": test_counts.errors,
+ }
test_group = {
"name": test.name,
"sub_groups": sub_groups,
"test_cases": test_cases,
+ "misc": counts_json
}
test_group.update(common_fields)
return test_group
base-commit: c2493384e8110d5a4792fff4b9d46e47b78ea10a
--
2.49.0.1101.gccaa498523-goog
There is a spelling mistake in a fail error message. Fix it.
Signed-off-by: Colin Ian King <colin.i.king(a)gmail.com>
---
tools/testing/selftests/futex/functional/futex_numa_mpol.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/futex/functional/futex_numa_mpol.c b/tools/testing/selftests/futex/functional/futex_numa_mpol.c
index dd70532f293e..8864c610f165 100644
--- a/tools/testing/selftests/futex/functional/futex_numa_mpol.c
+++ b/tools/testing/selftests/futex/functional/futex_numa_mpol.c
@@ -181,7 +181,7 @@ int main(int argc, char *argv[])
test_futex(futex_ptr, 0);
if (futex_numa->numa == FUTEX_NO_NODE) {
- fail("NUMA node is left unitiliazed\n");
+ fail("NUMA node is left uninitialized\n");
return 1;
}
--
2.49.0
Fix misspelling reported by codespell
Signed-off-by: Sumanth Gavini <sumanth.gavini(a)yahoo.com>
---
tools/testing/selftests/drivers/net/lib/py/env.py | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/drivers/net/lib/py/env.py b/tools/testing/selftests/drivers/net/lib/py/env.py
index ad5ff645183a..3bccddf8cbc5 100644
--- a/tools/testing/selftests/drivers/net/lib/py/env.py
+++ b/tools/testing/selftests/drivers/net/lib/py/env.py
@@ -12,7 +12,7 @@ from .remote import Remote
class NetDrvEnvBase:
"""
- Base class for a NIC / host envirnoments
+ Base class for a NIC / host environments
Attributes:
test_dir: Path to the source directory of the test
--
2.43.0
This patchset adds KVM selftests for LoongArch system, currently only
some common test cases are supported and pass to run. These test cases
are listed as following:
coalesced_io_test
demand_paging_test
dirty_log_perf_test
dirty_log_test
guest_print_test
hardware_disable_test
kvm_binary_stats_test
kvm_create_max_vcpus
kvm_page_table_test
memslot_modification_stress_test
memslot_perf_test
set_memory_region_test
---
Changes in v12:
1. Replace vcpu_set_reg() with __vcpu_set_reg() in loongarch_set_csr()
2. Remove duplicated test case including in Makefile.kvm
3. Fix one issue reported by checkpatch
Changes in v11:
1. Fix a typo issue in notes of patch 2, it is kvm_util_arch.h rather than
kvm_util_base.h
Changes in v10:
1. Add PS_64K and remove PS_8K in file include/loongarch/processor.h
2. Fix a typo issue in file lib/loongarch/processor.c
3. Update file MAINTAINERS about LoongArch KVM selftests
Changes in v9:
1. Add vm mode VM_MODE_P47V47_16K, LoongArch VM uses this mode by
default, rather than VM_MODE_P36V47_16K.
2. Refresh some spelling issues in changelog.
Changes in v8:
1. Porting patch based on the latest version.
2. For macro PC_OFFSET_EXREGS, offsetof() method is used for C header file,
still hardcoded definition for assemble language.
Changes in v7:
1. Refine code to add LoongArch support in test case
set_memory_region_test.
Changes in v6:
1. Refresh the patch based on latest kernel 6.8-rc1, add LoongArch
support about testcase set_memory_region_test.
2. Add hardware_disable_test test case.
3. Drop modification about macro DEFAULT_GUEST_TEST_MEM, it is problem
of LoongArch binutils, this issue is raised to LoongArch binutils
owners.
Changes in v5:
1. In LoongArch kvm self tests, the DEFAULT_GUEST_TEST_MEM could be
0x130000000, it is different from the default value in memstress.h.
So we Move the definition of DEFAULT_GUEST_TEST_MEM into LoongArch
ucall.h, and add 'ifndef' condition for DEFAULT_GUEST_TEST_MEM
in memstress.h.
Changes in v4:
1. Remove the based-on flag, as the LoongArch KVM patch series
have been accepted by Linux kernel, so this can be applied directly
in kernel.
Changes in v3:
1. Improve implementation of LoongArch VM page walk.
2. Add exception handler for LoongArch.
3. Add dirty_log_test, dirty_log_perf_test, guest_print_test
test cases for LoongArch.
4. Add __ASSEMBLER__ macro to distinguish asm file and c file.
5. Move ucall_arch_do_ucall to the header file and make it as
static inline to avoid function calls.
6. Change the DEFAULT_GUEST_TEST_MEM base addr for LoongArch.
Changes in v2:
1. We should use ".balign 4096" to align the assemble code with 4K in
exception.S instead of "align 12".
2. LoongArch only supports 3 or 4 levels page tables, so we remove the
hanlders for 2-levels page table.
3. Remove the DEFAULT_LOONGARCH_GUEST_STACK_VADDR_MIN and use the common
DEFAULT_GUEST_STACK_VADDR_MIN to allocate stack memory in guest.
4. Reorganize the test cases supported by LoongArch.
5. Fix some code comments.
6. Add kvm_binary_stats_test test case into LoongArch KVM selftests.
---
Bibo Mao (5):
KVM: selftests: Add VM_MODE_P47V47_16K VM mode
KVM: selftests: Add KVM selftests header files for LoongArch
KVM: selftests: Add core KVM selftests support for LoongArch
KVM: selftests: Add ucall test support for LoongArch
KVM: selftests: Add supported test cases for LoongArch
MAINTAINERS | 2 +
tools/testing/selftests/kvm/Makefile | 2 +-
tools/testing/selftests/kvm/Makefile.kvm | 17 +
.../testing/selftests/kvm/include/kvm_util.h | 6 +
.../kvm/include/loongarch/kvm_util_arch.h | 7 +
.../kvm/include/loongarch/processor.h | 141 +++++++
.../selftests/kvm/include/loongarch/ucall.h | 20 +
tools/testing/selftests/kvm/lib/kvm_util.c | 3 +
.../selftests/kvm/lib/loongarch/exception.S | 59 +++
.../selftests/kvm/lib/loongarch/processor.c | 347 ++++++++++++++++++
.../selftests/kvm/lib/loongarch/ucall.c | 38 ++
.../selftests/kvm/set_memory_region_test.c | 2 +-
12 files changed, 642 insertions(+), 2 deletions(-)
create mode 100644 tools/testing/selftests/kvm/include/loongarch/kvm_util_arch.h
create mode 100644 tools/testing/selftests/kvm/include/loongarch/processor.h
create mode 100644 tools/testing/selftests/kvm/include/loongarch/ucall.h
create mode 100644 tools/testing/selftests/kvm/lib/loongarch/exception.S
create mode 100644 tools/testing/selftests/kvm/lib/loongarch/processor.c
create mode 100644 tools/testing/selftests/kvm/lib/loongarch/ucall.c
base-commit: a5806cd506af5a7c19bcd596e4708b5c464bfd21
--
2.39.3
Minor cleanups to the devmem tcp code, and not-so-minor improvements to
the ksft.
For the cleanups:
- Address comment from Paolo post-merge.
- Fix whitespace.
- Add improvement dropped from Taehee's fix patch.
For the ksft:
- Add support for ipv4 environment.
- Add support for drivers that are limited to 5-tuple flow steering.
- Improve test by sending 1K data instead of just "hello\nworld"
Cc: sdf(a)fomichev.me
Cc: ap420073(a)gmail.com
Cc: praan(a)google.com
Cc: shivajikant(a)google.com
Mina Almasry (9):
net: devmem: move list_add to net_devmem_bind_dmabuf.
page_pool: fix ugly page_pool formatting
net: devmem: preserve sockc_err
net: devmem: ksft: remove ksft_disruptive
net: devmem: ksft: add ipv4 support
net: devmem: ksft: add exit_wait to make rx test pass
net: devmem: ksft: add 5 tuple FS support
net: devmem: ksft: upgrade rx test to send 1K data
net: devmem: ncdevmem: remove unused variable
net/core/devmem.c | 5 +-
net/core/devmem.h | 5 +-
net/core/netdev-genl.c | 8 +--
net/core/page_pool.c | 4 +-
net/ipv4/tcp.c | 24 ++++-----
.../selftests/drivers/net/hw/devmem.py | 52 +++++++++++++------
.../selftests/drivers/net/hw/ncdevmem.c | 1 -
7 files changed, 59 insertions(+), 40 deletions(-)
base-commit: b8fa067c4a76e9a28f2003a50ff9b60f00b11168
--
2.49.0.1101.gccaa498523-goog
Hello,
It is the last week to submit your proposal!
The Automated Testing Summit (ATS) 2025 will be held as a co-located event at the Open Source Summit North America!
https://events.linuxfoundation.org/open-source-summit-north-america/feature…
📅 Date: June 26, 2025
📍 Location: Denver, CO, USA
Hosted by KernelCI, ATS is a technical summit focused on the challenges of testing and quality assurance in the Linux ecosystem — especially in upstream kernel development, embedded systems, cloud environments, and CI integration.
This is a great opportunity to share your work on:
* Kernel and userspace test frameworks
* Lab infrastructure and automation
* CI/CD pipelines for Linux
* Fuzzing, performance testing, and debugging tools
* Sharing and standardizing test results across systems
Whether you’re working on kernel testing, running tests on hardware labs, developing QA tools, or building infrastructure that scales across projects, ATS is the place to collaborate and move the ecosystem forward.
Submit your talk by May 18, 2025:
👉 Call for Proposals (CFP): https://sessionize.com/atsna2025
We hope to see you in Denver!
— The KernelCI Team
--
Gustavo Padovan
Collabora Ltd.
This patchset adds KVM selftests for LoongArch system, currently only
some common test cases are supported and pass to run. These test cases
are listed as following:
coalesced_io_test
demand_paging_test
dirty_log_perf_test
dirty_log_test
guest_print_test
hardware_disable_test
kvm_binary_stats_test
kvm_create_max_vcpus
kvm_page_table_test
memslot_modification_stress_test
memslot_perf_test
set_memory_region_test
---
Changes in v11:
1. Fix a typo issue in notes of patch 2, it is kvm_util_arch.h rather than
kvm_util_base.h
Changes in v10:
1. Add PS_64K and remove PS_8K in file include/loongarch/processor.h
2. Fix a typo issue in file lib/loongarch/processor.c
3. Update file MAINTAINERS about LoongArch KVM selftests
Changes in v9:
1. Add vm mode VM_MODE_P47V47_16K, LoongArch VM uses this mode by
default, rather than VM_MODE_P36V47_16K.
2. Refresh some spelling issues in changelog.
Changes in v8:
1. Porting patch based on the latest version.
2. For macro PC_OFFSET_EXREGS, offsetof() method is used for C header file,
still hardcoded definition for assemble language.
Changes in v7:
1. Refine code to add LoongArch support in test case
set_memory_region_test.
Changes in v6:
1. Refresh the patch based on latest kernel 6.8-rc1, add LoongArch
support about testcase set_memory_region_test.
2. Add hardware_disable_test test case.
3. Drop modification about macro DEFAULT_GUEST_TEST_MEM, it is problem
of LoongArch binutils, this issue is raised to LoongArch binutils owners.
Changes in v5:
1. In LoongArch kvm self tests, the DEFAULT_GUEST_TEST_MEM could be
0x130000000, it is different from the default value in memstress.h.
So we Move the definition of DEFAULT_GUEST_TEST_MEM into LoongArch
ucall.h, and add 'ifndef' condition for DEFAULT_GUEST_TEST_MEM
in memstress.h.
Changes in v4:
1. Remove the based-on flag, as the LoongArch KVM patch series
have been accepted by Linux kernel, so this can be applied directly
in kernel.
Changes in v3:
1. Improve implementation of LoongArch VM page walk.
2. Add exception handler for LoongArch.
3. Add dirty_log_test, dirty_log_perf_test, guest_print_test
test cases for LoongArch.
4. Add __ASSEMBLER__ macro to distinguish asm file and c file.
5. Move ucall_arch_do_ucall to the header file and make it as
static inline to avoid function calls.
6. Change the DEFAULT_GUEST_TEST_MEM base addr for LoongArch.
Changes in v2:
1. We should use ".balign 4096" to align the assemble code with 4K in
exception.S instead of "align 12".
2. LoongArch only supports 3 or 4 levels page tables, so we remove the
hanlders for 2-levels page table.
3. Remove the DEFAULT_LOONGARCH_GUEST_STACK_VADDR_MIN and use the common
DEFAULT_GUEST_STACK_VADDR_MIN to allocate stack memory in guest.
4. Reorganize the test cases supported by LoongArch.
5. Fix some code comments.
6. Add kvm_binary_stats_test test case into LoongArch KVM selftests.
---
Bibo Mao (5):
KVM: selftests: Add VM_MODE_P47V47_16K VM mode
KVM: selftests: Add KVM selftests header files for LoongArch
KVM: selftests: Add core KVM selftests support for LoongArch
KVM: selftests: Add ucall test support for LoongArch
KVM: selftests: Add test cases for LoongArch
MAINTAINERS | 2 +
tools/testing/selftests/kvm/Makefile | 2 +-
tools/testing/selftests/kvm/Makefile.kvm | 18 +
.../testing/selftests/kvm/include/kvm_util.h | 6 +
.../kvm/include/loongarch/kvm_util_arch.h | 7 +
.../kvm/include/loongarch/processor.h | 141 ++++++++
.../selftests/kvm/include/loongarch/ucall.h | 20 +
tools/testing/selftests/kvm/lib/kvm_util.c | 3 +
.../selftests/kvm/lib/loongarch/exception.S | 59 +++
.../selftests/kvm/lib/loongarch/processor.c | 342 ++++++++++++++++++
.../selftests/kvm/lib/loongarch/ucall.c | 38 ++
.../selftests/kvm/set_memory_region_test.c | 2 +-
12 files changed, 638 insertions(+), 2 deletions(-)
create mode 100644 tools/testing/selftests/kvm/include/loongarch/kvm_util_arch.h
create mode 100644 tools/testing/selftests/kvm/include/loongarch/processor.h
create mode 100644 tools/testing/selftests/kvm/include/loongarch/ucall.h
create mode 100644 tools/testing/selftests/kvm/lib/loongarch/exception.S
create mode 100644 tools/testing/selftests/kvm/lib/loongarch/processor.c
create mode 100644 tools/testing/selftests/kvm/lib/loongarch/ucall.c
base-commit: 5bc1018675ec28a8a60d83b378d8c3991faa5a27
--
2.39.3
Fixes a grammatical error in the output where the word 'to' was missing.
Signed-off-by: Anish Dabhane <anishdabhane71(a)gmail.com>
---
tools/testing/selftests/cachestat/test_cachestat.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/cachestat/test_cachestat.c b/tools/testing/selftests/cachestat/test_cachestat.c
index 632ab44737ec..1406bc9f2b87 100644
--- a/tools/testing/selftests/cachestat/test_cachestat.c
+++ b/tools/testing/selftests/cachestat/test_cachestat.c
@@ -76,7 +76,7 @@ bool write_exactly(int fd, size_t filesize)
ssize_t write_len = write(fd, cursor, remained);
if (write_len <= 0) {
- ksft_print_msg("Unable write random data to file.\n");
+ ksft_print_msg("Unable to write random data to file.\n");
ret = false;
goto out_free_data;
}
--
2.49.0
This improves the expressiveness of unprivileged BPF by inserting
speculation barriers instead of rejecting the programs.
The approach was previously presented at LPC'24 [1] and RAID'24 [2].
To mitigate the Spectre v1 (PHT) vulnerability, the kernel rejects
potentially-dangerous unprivileged BPF programs as of
commit 9183671af6db ("bpf: Fix leakage under speculation on mispredicted
branches"). In [2], we have analyzed 364 object files from open source
projects (Linux Samples and Selftests, BCC, Loxilb, Cilium, libbpf
Examples, Parca, and Prevail) and found that this affects 31% to 54% of
programs.
To resolve this in the majority of cases this patchset adds a fall-back
for mitigating Spectre v1 using speculation barriers. The kernel still
optimistically attempts to verify all speculative paths but uses
speculation barriers against v1 when unsafe behavior is detected. This
allows for more programs to be accepted without disabling the BPF
Spectre mitigations (e.g., by setting cpu_mitigations_off()).
For this, it relies on the fact that speculation barriers prevent all
later instructions if the speculation was not correct:
* On x86_64, lfence acts as full speculation barrier, not only as a
load fence [3]:
An LFENCE instruction or a serializing instruction will ensure that
no later instructions execute, even speculatively, until all prior
instructions complete locally. [...] Inserting an LFENCE instruction
after a bounds check prevents later operations from executing before
the bound check completes.
This was experimentally confirmed in [4].
* ARM's SB speculation barrier instruction also affects "any instruction
that appears later in the program order than the barrier" [5].
In [1] we have measured the overhead of this approach relative to having
mitigations off and including the upstream Spectre v4 mitigations. For
event tracing and stack-sampling profilers, we found that mitigations
increase BPF program execution time by 0% to 62%. For the Loxilb network
load balancer, we have measured a 14% slowdown in SCTP performance but
no significant slowdown for TCP. This overhead only applies to programs
that were previously rejected.
I reran the expressiveness-evaluation with v6.14 and made sure the main
results still match those from [1] and [2] (which used v6.5).
Main design decisions are:
* Do not use separate bytecode insns for v1 and v4 barriers. This
simplifies the verifier significantly and has the only downside that
performance on PowerPC is not as high as it could be.
* Allow archs to still disable v1/v4 mitigations separately by setting
bpf_jit_bypass_spec_v1/v4(). This has the benefit that archs can
benefit from improved BPF expressiveness / performance if they are not
vulnerable (e.g., ARM64 for v4 in the kernel).
* Do not remove the empty BPF_NOSPEC implementation for backends for
which it is unknown whether they are vulnerable to Spectre v1.
[1] https://lpc.events/event/18/contributions/1954/ ("Mitigating
Spectre-PHT using Speculation Barriers in Linux eBPF")
[2] https://arxiv.org/pdf/2405.00078 ("VeriFence: Lightweight and
Precise Spectre Defenses for Untrusted Linux Kernel Extensions")
[3] https://www.intel.com/content/www/us/en/developer/articles/technical/softwa…
("Managed Runtime Speculative Execution Side Channel Mitigations")
[4] https://dl.acm.org/doi/pdf/10.1145/3359789.3359837 ("Speculator: a
tool to analyze speculative execution attacks and mitigations" -
Section 4.6 "Stopping Speculative Execution")
[5] https://developer.arm.com/documentation/ddi0597/2020-12/Base-Instructions/S…
("SB - Speculation Barrier - Arm Armv8-A A32/T32 Instruction Set Architecture (2020-12)")
Changes:
* v1 -> v2:
- Drop former commits 9 ("bpf: Return PTR_ERR from push_stack()") and 11
("bpf: Fall back to nospec for spec path verification") as suggested
by Alexei. This series therefore no longer changes push_stack() to
return PTR_ERR.
- Add detailed explanation of how lfence works internally and how it
affects the algorithm.
- Add tests checking that nospec instructions are inserted in expected
locations using __xlated_unpriv as suggested by Eduard (also,
include a fix for __xlated_unpriv)
- Add a test for the mitigations from the description of
commit 9183671af6db ("bpf: Fix leakage under speculation on
mispredicted branches")
- Remove unused variables from do_check[_insn]() as suggested by
Eduard.
- Remove INSN_IDX_MODIFIED to improve readability as suggested by
Eduard. This also causes the nospec_result-check to run (and fail)
for jumping-ops. Add a warning to assert that this check must never
succeed in that case.
- Add details on the safety of patch 10 ("bpf: Allow nospec-protected
var-offset stack access") based on the feedback on v1.
- Rebase to bpf-next-250420
- Link to v1: https://lore.kernel.org/all/20250313172127.1098195-1-luis.gerhorst@fau.de/
* RFC -> v1:
- rebase to bpf-next-250313
- tests: mark expected successes/new errors
- add bpt_jit_bypass_spec_v1/v4() to avoid #ifdef in
bpf_bypass_spec_v1/v4()
- ensure that nospec with v1-support is implemented for archs for
which GCC supports speculation barriers, except for MIPS
- arm64: emit speculation barrier
- powerpc: change nospec to include v1 barrier
- discuss potential security (archs that do not impl. BPF nospec) and
performance (only PowerPC) regressions
- Linkt to RFC: https://lore.kernel.org/bpf/20250224203619.594724-1-luis.gerhorst@fau.de/
Luis Gerhorst (11):
selftests/bpf: Fix caps for __xlated/jited_unpriv
bpf: Move insn if/else into do_check_insn()
bpf: Return -EFAULT on misconfigurations
bpf: Return -EFAULT on internal errors
bpf, arm64, powerpc: Add bpf_jit_bypass_spec_v1/v4()
bpf, arm64, powerpc: Change nospec to include v1 barrier
bpf: Rename sanitize_stack_spill to nospec_result
bpf: Fall back to nospec for Spectre v1
selftests/bpf: Add test for Spectre v1 mitigation
bpf: Allow nospec-protected var-offset stack access
bpf: Fall back to nospec for sanitization-failures
arch/arm64/net/bpf_jit.h | 5 +
arch/arm64/net/bpf_jit_comp.c | 28 +-
arch/powerpc/net/bpf_jit_comp64.c | 79 ++-
include/linux/bpf.h | 11 +-
include/linux/bpf_verifier.h | 3 +-
include/linux/filter.h | 2 +-
kernel/bpf/core.c | 32 +-
kernel/bpf/verifier.c | 648 ++++++++++--------
tools/testing/selftests/bpf/progs/bpf_misc.h | 4 +
.../selftests/bpf/progs/verifier_and.c | 8 +-
.../selftests/bpf/progs/verifier_bounds.c | 66 +-
.../bpf/progs/verifier_bounds_deduction.c | 45 +-
.../selftests/bpf/progs/verifier_map_ptr.c | 20 +-
.../selftests/bpf/progs/verifier_movsx.c | 16 +-
.../selftests/bpf/progs/verifier_unpriv.c | 65 +-
.../bpf/progs/verifier_value_ptr_arith.c | 101 ++-
tools/testing/selftests/bpf/test_loader.c | 14 +-
.../selftests/bpf/verifier/dead_code.c | 3 +-
tools/testing/selftests/bpf/verifier/jmp32.c | 33 +-
tools/testing/selftests/bpf/verifier/jset.c | 10 +-
20 files changed, 765 insertions(+), 428 deletions(-)
base-commit: 8582d9ab3efdebb88e0cd8beed8e0b9de76443e7
--
2.49.0
The map_fixed_noreplace test does two blocks of test starting from a
mapping of 5 pages at the base address, logging a test result for each
initial mapping. These are logged with the same test name, causing test
automation software to see two reports for the same test in a single run.
Tweak the log message for the second one to deduplicate.
Signed-off-by: Mark Brown <broonie(a)kernel.org>
---
tools/testing/selftests/mm/map_fixed_noreplace.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/mm/map_fixed_noreplace.c b/tools/testing/selftests/mm/map_fixed_noreplace.c
index d53de2486080..1e9980b8993c 100644
--- a/tools/testing/selftests/mm/map_fixed_noreplace.c
+++ b/tools/testing/selftests/mm/map_fixed_noreplace.c
@@ -96,7 +96,7 @@ int main(void)
ksft_exit_fail_msg("Error:1: mmap() succeeded when it shouldn't have\n");
}
ksft_print_msg("mmap() @ 0x%lx-0x%lx p=%p result=%m\n", addr, addr + size, p);
- ksft_test_result_pass("mmap() 5*PAGE_SIZE at base\n");
+ ksft_test_result_pass("Second mmap() 5*PAGE_SIZE at base\n");
/*
* Second mapping contained within first:
---
base-commit: 82f2b0b97b36ee3fcddf0f0780a9a0825d52fec3
change-id: 20250514-selftests-mm-map-fixed-noreplace-dup-25f83f618bb2
Best regards,
--
Mark Brown <broonie(a)kernel.org>
Apparently, this test completes successfully when it completes execution
without either causing a kernel panic or being killed by the kernel.
This new test result message is more descriptive and grammatically
correct.
Signed-off-by: Brigham Campbell <me(a)brighamcampbell.com>
---
tools/testing/selftests/x86/mov_ss_trap.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/x86/mov_ss_trap.c b/tools/testing/selftests/x86/mov_ss_trap.c
index f22cb6b382f9..d80033c0d7eb 100644
--- a/tools/testing/selftests/x86/mov_ss_trap.c
+++ b/tools/testing/selftests/x86/mov_ss_trap.c
@@ -269,6 +269,6 @@ int main()
);
}
- printf("[OK]\tI aten't dead\n");
+ printf("[OK]\tkernel handled MOV SS without crashing test\n");
return 0;
}
--
2.49.0
Fixes a small typo in a comment; helping to improve clarity in code comment.
Signed-off-by: Sukrut Heroorkar <hsukrut3(a)gmail.com>
---
tools/testing/selftests/dma/dma_map_benchmark.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/dma/dma_map_benchmark.c b/tools/testing/selftests/dma/dma_map_benchmark.c
index b12f1f9babf8..ed5522061dd6 100644
--- a/tools/testing/selftests/dma/dma_map_benchmark.c
+++ b/tools/testing/selftests/dma/dma_map_benchmark.c
@@ -80,7 +80,7 @@ int main(int argc, char **argv)
exit(1);
}
- /* suppose the mininum DMA zone is 1MB in the world */
+ /* suppose the minimum DMA zone is 1MB in the world */
if (bits < 20 || bits > 64) {
fprintf(stderr, "invalid dma mask bit, must be in 20-64\n");
exit(1);
--
2.43.0
From: Suresh K C <suresh.k.chandrappa(a)gmail.com>
Fixes minor grammar and spelling issues in the MSM crash dump documentation:
- Corrected "uinque" to "unique"
- Changed "Set of registers values" to "Set of register values"
These changes improve clarity and maintain consistency with kernel documentation standards.
Signed-off-by: Suresh K C <suresh.k.chandrappa(a)gmail.com>
---
Documentation/gpu/msm-crash-dump.rst | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/Documentation/gpu/msm-crash-dump.rst b/Documentation/gpu/msm-crash-dump.rst
index 9509cc4224f4..6a7a947044ee 100644
--- a/Documentation/gpu/msm-crash-dump.rst
+++ b/Documentation/gpu/msm-crash-dump.rst
@@ -70,7 +70,7 @@ ringbuffer
bo
List of buffers from the hanging submission if available.
- Each buffer object will have a uinque iova.
+ Each buffer object will have a unique iova.
iova
GPU address of the buffer object.
@@ -83,7 +83,7 @@ bo
Trailing zeros at the end of the buffer will be skipped.
registers
- Set of registers values. Each entry is on its own line enclosed
+ Set of register values. Each entry is on its own line enclosed
by brackets { }.
offset
--
2.43.0
Fix some grammar mistake and improve clarity of output messages
and comments in rtctest.c. Fix format of message in accordance to
other output messages in rtctest.c.
Signed-off-by: Nidhish Chauhan <solemnsquire(a)gmail.com>
---
tools/testing/selftests/rtc/rtctest.c | 18 +++++++++---------
1 file changed, 9 insertions(+), 9 deletions(-)
diff --git a/tools/testing/selftests/rtc/rtctest.c b/tools/testing/selftests/rtc/rtctest.c
index be175c0e6ae3..42456dd90095 100644
--- a/tools/testing/selftests/rtc/rtctest.c
+++ b/tools/testing/selftests/rtc/rtctest.c
@@ -138,10 +138,10 @@ TEST_F_TIMEOUT(rtc, date_read_loop, READ_LOOP_DURATION_SEC + 2) {
rtc_read = rtc_time_to_timestamp(&rtc_tm);
/* Time should not go backwards */
ASSERT_LE(prev_rtc_read, rtc_read);
- /* Time should not increase more then 1s at a time */
+ /* Time should not increase by more than 1s at a time */
ASSERT_GE(prev_rtc_read + 1, rtc_read);
- /* Sleep 11ms to avoid killing / overheating the RTC */
+ /* Sleep for 11ms to avoid overloading or overheating the RTC */
nanosleep_with_retries(READ_LOOP_SLEEP_MS * 1000000);
prev_rtc_read = rtc_read;
@@ -163,7 +163,7 @@ TEST_F_TIMEOUT(rtc, uie_read, NUM_UIE + 2) {
rc = ioctl(self->fd, RTC_UIE_ON, 0);
if (rc == -1) {
ASSERT_EQ(EINVAL, errno);
- TH_LOG("skip update IRQs not supported.");
+ TH_LOG("Skipping test since update IRQs are not supported.");
return;
}
@@ -192,7 +192,7 @@ TEST_F(rtc, uie_select) {
rc = ioctl(self->fd, RTC_UIE_ON, 0);
if (rc == -1) {
ASSERT_EQ(EINVAL, errno);
- TH_LOG("skip update IRQs not supported.");
+ TH_LOG("Skipping test since update IRQs are not supported.");
return;
}
@@ -252,7 +252,7 @@ TEST_F(rtc, alarm_alm_set) {
*/
ASSERT_EQ(RTC_ALARM_UNKNOWN, alarm_state);
ASSERT_EQ(EINVAL, errno);
- TH_LOG("skip alarms are not supported.");
+ TH_LOG("Skipping test since alarms are not supported.");
return;
}
@@ -324,7 +324,7 @@ TEST_F(rtc, alarm_wkalm_set) {
*/
ASSERT_EQ(RTC_ALARM_UNKNOWN, alarm_state);
ASSERT_EQ(EINVAL, errno);
- TH_LOG("skip alarms are not supported.");
+ TH_LOG("Skipping test since alarms are not supported.");
return;
}
@@ -384,7 +384,7 @@ TEST_F_TIMEOUT(rtc, alarm_alm_set_minute, 65) {
*/
ASSERT_EQ(RTC_ALARM_UNKNOWN, alarm_state);
ASSERT_EQ(EINVAL, errno);
- TH_LOG("skip alarms are not supported.");
+ TH_LOG("Skipping test since alarms are not supported.");
return;
}
@@ -454,7 +454,7 @@ TEST_F_TIMEOUT(rtc, alarm_wkalm_set_minute, 65) {
*/
ASSERT_EQ(RTC_ALARM_UNKNOWN, alarm_state);
ASSERT_EQ(EINVAL, errno);
- TH_LOG("skip alarms are not supported.");
+ TH_LOG("Skipping test since alarms are not supported.");
return;
}
@@ -494,7 +494,7 @@ int main(int argc, char **argv)
case 1:
break;
default:
- fprintf(stderr, "usage: %s [rtcdev]\n", argv[0]);
+ fprintf(stderr, "Usage: %s [rtcdev]\n", argv[0]);
return 1;
}
--
2.49.0
Corrected multiple spelling and grammar issues in the under powerpc PMU
selftests:
- Corrected 'aplicable' -> 'applicable'.
- Corrected 'contraint' -> 'constraint'.
- Corrected 'Incase' -> 'In Case'.
These changes improve documentation quality and maintain code readability.
Signed-off-by: Akshaykumar Gunari <akshaygunari(a)gmail.com>
---
.../pmu/event_code_tests/group_constraint_cache_test.c | 4 ++--
.../pmu/event_code_tests/group_constraint_l2l3_sel_test.c | 6 +++---
.../group_constraint_radix_scope_qual_test.c | 4 ++--
.../event_code_tests/group_constraint_thresh_cmp_test.c | 8 ++++----
.../event_code_tests/group_constraint_thresh_ctl_test.c | 4 ++--
.../event_code_tests/group_constraint_thresh_sel_test.c | 4 ++--
.../pmu/event_code_tests/group_constraint_unit_test.c | 8 ++++----
tools/testing/selftests/powerpc/pmu/sampling_tests/misc.c | 4 ++--
.../powerpc/pmu/sampling_tests/mmcra_bhrb_cond_test.c | 2 +-
.../sampling_tests/mmcra_bhrb_disable_no_branch_test.c | 2 +-
.../powerpc/pmu/sampling_tests/mmcra_bhrb_disable_test.c | 2 +-
.../powerpc/pmu/sampling_tests/mmcra_bhrb_ind_call_test.c | 2 +-
12 files changed, 25 insertions(+), 25 deletions(-)
diff --git a/tools/testing/selftests/powerpc/pmu/event_code_tests/group_constraint_cache_test.c b/tools/testing/selftests/powerpc/pmu/event_code_tests/group_constraint_cache_test.c
index f4be05aa3a3d..855dca7334da 100644
--- a/tools/testing/selftests/powerpc/pmu/event_code_tests/group_constraint_cache_test.c
+++ b/tools/testing/selftests/powerpc/pmu/event_code_tests/group_constraint_cache_test.c
@@ -31,7 +31,7 @@ static int group_constraint_cache(void)
/* Check for platform support for the test */
SKIP_IF(platform_check_for_tests());
- /* Init the events for the group contraint check for l1 cache select bits */
+ /* Init the events for the group constraint check for l1 cache select bits */
event_init(&leader, EventCode_1);
FAIL_IF(event_open(&leader));
@@ -42,7 +42,7 @@ static int group_constraint_cache(void)
event_close(&event);
- /* Init the event for the group contraint l1 cache select test */
+ /* Init the event for the group constraint l1 cache select test */
event_init(&event, EventCode_3);
/* Expected to succeed as sibling event request same l1 cache select bits as leader */
diff --git a/tools/testing/selftests/powerpc/pmu/event_code_tests/group_constraint_l2l3_sel_test.c b/tools/testing/selftests/powerpc/pmu/event_code_tests/group_constraint_l2l3_sel_test.c
index e3c7a0c071e2..9de991d7a767 100644
--- a/tools/testing/selftests/powerpc/pmu/event_code_tests/group_constraint_l2l3_sel_test.c
+++ b/tools/testing/selftests/powerpc/pmu/event_code_tests/group_constraint_l2l3_sel_test.c
@@ -30,12 +30,12 @@ static int group_constraint_l2l3_sel(void)
/*
* Check for platform support for the test.
- * This test is only aplicable on ISA v3.1
+ * This test is only applicable on ISA v3.1
*/
SKIP_IF(platform_check_for_tests());
SKIP_IF(!have_hwcap2(PPC_FEATURE2_ARCH_3_1));
- /* Init the events for the group contraint check for l2l3_sel bits */
+ /* Init the events for the group constraint check for l2l3_sel bits */
event_init(&leader, EventCode_1);
FAIL_IF(event_open(&leader));
@@ -46,7 +46,7 @@ static int group_constraint_l2l3_sel(void)
event_close(&event);
- /* Init the event for the group contraint l2l3_sel test */
+ /* Init the event for the group constraint l2l3_sel test */
event_init(&event, EventCode_3);
/* Expected to succeed as sibling event request same l2l3_sel bits as leader */
diff --git a/tools/testing/selftests/powerpc/pmu/event_code_tests/group_constraint_radix_scope_qual_test.c b/tools/testing/selftests/powerpc/pmu/event_code_tests/group_constraint_radix_scope_qual_test.c
index 9233175787cc..e354129cd7fb 100644
--- a/tools/testing/selftests/powerpc/pmu/event_code_tests/group_constraint_radix_scope_qual_test.c
+++ b/tools/testing/selftests/powerpc/pmu/event_code_tests/group_constraint_radix_scope_qual_test.c
@@ -26,12 +26,12 @@ static int group_constraint_radix_scope_qual(void)
/*
* Check for platform support for the test.
- * This test is aplicable on ISA v3.1 only.
+ * This test is applicable on ISA v3.1 only.
*/
SKIP_IF(platform_check_for_tests());
SKIP_IF(!have_hwcap2(PPC_FEATURE2_ARCH_3_1));
- /* Init the events for the group contraint check for radix_scope_qual bits */
+ /* Init the events for the group constraint check for radix_scope_qual bits */
event_init(&leader, EventCode_1);
FAIL_IF(event_open(&leader));
diff --git a/tools/testing/selftests/powerpc/pmu/event_code_tests/group_constraint_thresh_cmp_test.c b/tools/testing/selftests/powerpc/pmu/event_code_tests/group_constraint_thresh_cmp_test.c
index 4b69e7214c0b..433fa9982c8c 100644
--- a/tools/testing/selftests/powerpc/pmu/event_code_tests/group_constraint_thresh_cmp_test.c
+++ b/tools/testing/selftests/powerpc/pmu/event_code_tests/group_constraint_thresh_cmp_test.c
@@ -37,7 +37,7 @@ static int group_constraint_thresh_cmp(void)
SKIP_IF(platform_check_for_tests());
if (have_hwcap2(PPC_FEATURE2_ARCH_3_1)) {
- /* Init the events for the group contraint check for thresh_cmp bits */
+ /* Init the events for the group constraint check for thresh_cmp bits */
event_init(&leader, p10_EventCode_1);
/* Add the thresh_cmp value for leader in config1 */
@@ -54,7 +54,7 @@ static int group_constraint_thresh_cmp(void)
event_close(&event);
- /* Init the event for the group contraint thresh compare test */
+ /* Init the event for the group constraint thresh compare test */
event_init(&event, p10_EventCode_2);
/* Add the same thresh_cmp value for leader and sibling event in config1 */
@@ -66,7 +66,7 @@ static int group_constraint_thresh_cmp(void)
event_close(&leader);
event_close(&event);
} else {
- /* Init the events for the group contraint check for thresh_cmp bits */
+ /* Init the events for the group constraint check for thresh_cmp bits */
event_init(&leader, p9_EventCode_1);
FAIL_IF(event_open(&leader));
@@ -77,7 +77,7 @@ static int group_constraint_thresh_cmp(void)
event_close(&event);
- /* Init the event for the group contraint thresh compare test */
+ /* Init the event for the group constraint thresh compare test */
event_init(&event, p9_EventCode_3);
/* Expected to succeed as sibling and leader event request same thresh_cmp bits */
diff --git a/tools/testing/selftests/powerpc/pmu/event_code_tests/group_constraint_thresh_ctl_test.c b/tools/testing/selftests/powerpc/pmu/event_code_tests/group_constraint_thresh_ctl_test.c
index e0852ebc1671..7e7206eefcfd 100644
--- a/tools/testing/selftests/powerpc/pmu/event_code_tests/group_constraint_thresh_ctl_test.c
+++ b/tools/testing/selftests/powerpc/pmu/event_code_tests/group_constraint_thresh_ctl_test.c
@@ -35,7 +35,7 @@ static int group_constraint_thresh_ctl(void)
/* Check for platform support for the test */
SKIP_IF(platform_check_for_tests());
- /* Init the events for the group contraint thresh control test */
+ /* Init the events for the group constraint thresh control test */
event_init(&leader, EventCode_1);
FAIL_IF(event_open(&leader));
@@ -46,7 +46,7 @@ static int group_constraint_thresh_ctl(void)
event_close(&event);
- /* Init the event for the group contraint thresh control test */
+ /* Init the event for the group constraint thresh control test */
event_init(&event, EventCode_3);
/* Expected to succeed as sibling and leader event request same thresh_ctl bits */
diff --git a/tools/testing/selftests/powerpc/pmu/event_code_tests/group_constraint_thresh_sel_test.c b/tools/testing/selftests/powerpc/pmu/event_code_tests/group_constraint_thresh_sel_test.c
index 50a8cd843ce7..1b1336c1ddb1 100644
--- a/tools/testing/selftests/powerpc/pmu/event_code_tests/group_constraint_thresh_sel_test.c
+++ b/tools/testing/selftests/powerpc/pmu/event_code_tests/group_constraint_thresh_sel_test.c
@@ -34,7 +34,7 @@ static int group_constraint_thresh_sel(void)
/* Check for platform support for the test */
SKIP_IF(platform_check_for_tests());
- /* Init the events for the group contraint thresh select test */
+ /* Init the events for the group constraint thresh select test */
event_init(&leader, EventCode_1);
FAIL_IF(event_open(&leader));
@@ -45,7 +45,7 @@ static int group_constraint_thresh_sel(void)
event_close(&event);
- /* Init the event for the group contraint thresh select test */
+ /* Init the event for the group constraint thresh select test */
event_init(&event, EventCode_3);
/* Expected to succeed as sibling and leader event request same thresh_sel bits */
diff --git a/tools/testing/selftests/powerpc/pmu/event_code_tests/group_constraint_unit_test.c b/tools/testing/selftests/powerpc/pmu/event_code_tests/group_constraint_unit_test.c
index a2c18923dcec..86684331f2bc 100644
--- a/tools/testing/selftests/powerpc/pmu/event_code_tests/group_constraint_unit_test.c
+++ b/tools/testing/selftests/powerpc/pmu/event_code_tests/group_constraint_unit_test.c
@@ -21,7 +21,7 @@
* Testcase for group constraint check of unit and pmc bits which is
* used to program corresponding unit and pmc field in Monitor Mode
* Control Register 1 (MMCR1)
- * One of the event in the group should use PMC 4 incase units field
+ * One of the event in the group should use PMC 4 in case units field
* value is within 6 to 9 otherwise event_open for the group will fail.
*/
static int group_constraint_unit(void)
@@ -37,21 +37,21 @@ static int group_constraint_unit(void)
SKIP_IF(platform_check_for_tests());
SKIP_IF(have_hwcap2(PPC_FEATURE2_ARCH_3_1));
- /* Init the events for the group contraint check for unit bits */
+ /* Init the events for the group constraint check for unit bits */
e = &events[0];
event_init(e, EventCode_1);
/* Expected to fail as PMC 4 is not used with unit field value 6 to 9 */
FAIL_IF(!event_open(&events[0]));
- /* Init the events for the group contraint check for unit bits */
+ /* Init the events for the group constraint check for unit bits */
e = &events[1];
event_init(e, EventCode_2);
/* Expected to pass as PMC 4 is used with unit field value 6 to 9 */
FAIL_IF(event_open(&events[1]));
- /* Init the event for the group contraint unit test */
+ /* Init the event for the group constraint unit test */
e = &events[2];
event_init(e, EventCode_3);
diff --git a/tools/testing/selftests/powerpc/pmu/sampling_tests/misc.c b/tools/testing/selftests/powerpc/pmu/sampling_tests/misc.c
index 8a538b6182a1..7ec35566790a 100644
--- a/tools/testing/selftests/powerpc/pmu/sampling_tests/misc.c
+++ b/tools/testing/selftests/powerpc/pmu/sampling_tests/misc.c
@@ -466,7 +466,7 @@ int get_thresh_cmp_val(struct event event)
return value;
/*
- * Incase of P10, thresh_cmp value is not part of raw event code
+ * In case of P10, thresh_cmp value is not part of raw event code
* and provided via attr.config1 parameter. To program threshold in MMCRA,
* take a 18 bit number N and shift right 2 places and increment
* the exponent E by 1 until the upper 10 bits of N are zero.
@@ -498,7 +498,7 @@ int get_thresh_cmp_val(struct event event)
* by comparing base_platform value from auxv and real
* PVR value.
* auxv_base_platform() func gives information of "base platform"
- * corresponding to PVR value. Incase, if the distro doesn't
+ * corresponding to PVR value. In case, if the distro doesn't
* support platform PVR (missing cputable support), base platform
* in auxv will have a default value other than the real PVR's.
* In this case, ISAv3 PMU (generic compat PMU) will be registered
diff --git a/tools/testing/selftests/powerpc/pmu/sampling_tests/mmcra_bhrb_cond_test.c b/tools/testing/selftests/powerpc/pmu/sampling_tests/mmcra_bhrb_cond_test.c
index 809de8d58b3b..696f4285af3e 100644
--- a/tools/testing/selftests/powerpc/pmu/sampling_tests/mmcra_bhrb_cond_test.c
+++ b/tools/testing/selftests/powerpc/pmu/sampling_tests/mmcra_bhrb_cond_test.c
@@ -29,7 +29,7 @@ static int mmcra_bhrb_cond_test(void)
/*
* Check for platform support for the test.
- * This test is only aplicable on ISA v3.1
+ * This test is only applicable on ISA v3.1
*/
SKIP_IF(check_pvr_for_sampling_tests());
SKIP_IF(!have_hwcap2(PPC_FEATURE2_ARCH_3_1));
diff --git a/tools/testing/selftests/powerpc/pmu/sampling_tests/mmcra_bhrb_disable_no_branch_test.c b/tools/testing/selftests/powerpc/pmu/sampling_tests/mmcra_bhrb_disable_no_branch_test.c
index fa0dc15f9123..6ed293d18a34 100644
--- a/tools/testing/selftests/powerpc/pmu/sampling_tests/mmcra_bhrb_disable_no_branch_test.c
+++ b/tools/testing/selftests/powerpc/pmu/sampling_tests/mmcra_bhrb_disable_no_branch_test.c
@@ -26,7 +26,7 @@ static int mmcra_bhrb_disable_no_branch_test(void)
/*
* Check for platform support for the test.
- * This test is only aplicable on ISA v3.1
+ * This test is only applicable on ISA v3.1
*/
SKIP_IF(check_pvr_for_sampling_tests());
SKIP_IF(!have_hwcap2(PPC_FEATURE2_ARCH_3_1));
diff --git a/tools/testing/selftests/powerpc/pmu/sampling_tests/mmcra_bhrb_disable_test.c b/tools/testing/selftests/powerpc/pmu/sampling_tests/mmcra_bhrb_disable_test.c
index bc3161ab003d..858e37705275 100644
--- a/tools/testing/selftests/powerpc/pmu/sampling_tests/mmcra_bhrb_disable_test.c
+++ b/tools/testing/selftests/powerpc/pmu/sampling_tests/mmcra_bhrb_disable_test.c
@@ -26,7 +26,7 @@ static int mmcra_bhrb_disable_test(void)
/*
* Check for platform support for the test.
- * This test is only aplicable on ISA v3.1
+ * This test is only applicable on ISA v3.1
*/
SKIP_IF(check_pvr_for_sampling_tests());
SKIP_IF(!have_hwcap2(PPC_FEATURE2_ARCH_3_1));
diff --git a/tools/testing/selftests/powerpc/pmu/sampling_tests/mmcra_bhrb_ind_call_test.c b/tools/testing/selftests/powerpc/pmu/sampling_tests/mmcra_bhrb_ind_call_test.c
index fd6c9f12212c..6ae3edfedd51 100644
--- a/tools/testing/selftests/powerpc/pmu/sampling_tests/mmcra_bhrb_ind_call_test.c
+++ b/tools/testing/selftests/powerpc/pmu/sampling_tests/mmcra_bhrb_ind_call_test.c
@@ -29,7 +29,7 @@ static int mmcra_bhrb_ind_call_test(void)
/*
* Check for platform support for the test.
- * This test is only aplicable on ISA v3.1
+ * This test is only applicable on ISA v3.1
*/
SKIP_IF(check_pvr_for_sampling_tests());
SKIP_IF(!have_hwcap2(PPC_FEATURE2_ARCH_3_1));
--
2.43.0
Convert multi-line comments to preferred style with beginning and ending
almost-blank lines.
Signed-off-by: George Anthony Vernon <contact(a)gvernon.com>
---
tools/testing/selftests/mm/compaction_test.c | 22 +++++++++++++-------
1 file changed, 14 insertions(+), 8 deletions(-)
diff --git a/tools/testing/selftests/mm/compaction_test.c b/tools/testing/selftests/mm/compaction_test.c
index 2c3a0eb6b22d..b217ec4a434b 100644
--- a/tools/testing/selftests/mm/compaction_test.c
+++ b/tools/testing/selftests/mm/compaction_test.c
@@ -1,6 +1,5 @@
// SPDX-License-Identifier: GPL-2.0
/*
- *
* A test for the patch "Allow compaction of unevictable pages".
* With this patch we should be able to allocate at least 1/4
* of RAM in huge pages. Without the patch much less is
@@ -94,8 +93,10 @@ int check_compaction(unsigned long mem_free, unsigned long hugepage_size,
snprintf(init_nr_hugepages, sizeof(init_nr_hugepages),
"%lu", initial_nr_hugepages);
- /* We want to test with 80% of available memory. Else, OOM killer comes
- in to play */
+ /*
+ * We want to test with 80% of available memory. Else, OOM killer comes
+ * in to play
+ */
mem_free = mem_free * 0.8;
fd = open("/proc/sys/vm/nr_hugepages", O_RDWR | O_NONBLOCK);
@@ -106,8 +107,10 @@ int check_compaction(unsigned long mem_free, unsigned long hugepage_size,
goto out;
}
- /* Request a large number of huge pages. The Kernel will allocate
- as much as it can */
+ /*
+ * Request a large number of huge pages. The Kernel will allocate
+ * as much as it can
+ */
if (write(fd, "100000", (6*sizeof(char))) != (6*sizeof(char))) {
ksft_print_msg("Failed to write 100000 to /proc/sys/vm/nr_hugepages: %s\n",
strerror(errno));
@@ -122,8 +125,10 @@ int check_compaction(unsigned long mem_free, unsigned long hugepage_size,
goto close_fd;
}
- /* We should have been able to request at least 1/3 rd of the memory in
- huge pages */
+ /*
+ * We should have been able to request at least 1/3 rd of the memory in
+ * huge pages
+ */
nr_hugepages_ul = strtoul(nr_hugepages, NULL, 10);
if (!nr_hugepages_ul) {
ksft_print_msg("ERROR: No memory is available as huge pages\n");
@@ -243,7 +248,8 @@ int main(int argc, char **argv)
entry->next = list;
list = entry;
- /* Write something (in this case the address of the map) to
+ /*
+ * Write something (in this case the address of the map) to
* ensure that KSM can't merge the mapped pages
*/
for (i = 0; i < MAP_SIZE; i += page_size)
--
2.49.0
Hello,
RFC v2 addresses comments in RFC v1 [1]. This series is also rebased
on kvm/next (v6.15-rc4).
Here's the series stitched together for your convenience:
https://github.com/googleprodkernel/linux-cc/tree/kvm-gmem-link-migrate-rfc…
Changes from RFC v1:
+ Adds patches to make guest mem use guest mem inodes instead of
anonymous inodes.
+ Changed the name of factored out gmem allocating function to
kvm_gmem_alloc_view().
+ Changed the flag name vm_move_enc_ctxt_supported to
use_vm_enc_ctxt_op.
+ Various small changes to make patchset compatible with latest version
of kvm/next.
As a refresher, split file/inode model was proposed in guest_mem v11,
where memslot bindings belong to the file and pages belong to the inode.
This model lends itself well to having different VMs use separate files
pointing to the same inode.
The split file/inode model has also been used by the other following
recent patch series:
+ mmap support for guest_memfd: [2]
+ NUMA mempolicy support for guest_memfd: [3]
+ HugeTLB support for guest_memfd: [4]
This RFC proposes an ioctl, KVM_LINK_GUEST_MEMFD, that takes a VM and
a gmem fd, and returns another gmem fd referencing a different file
and associated with VM. This RFC also includes an update to
KVM_CAP_VM_MOVE_ENC_CONTEXT_FROM to migrate memory context
(slot->arch.lpage_info and kvm->mem_attr_array) from source to
destination vm, intra-host.
Intended usage of the two ioctls:
1. Source VM’s fd is passed to destination VM via unix sockets.
2. Destination VM uses new ioctl KVM_LINK_GUEST_MEMFD to link source
VM’s fd to a new fd.
3. Destination VM will pass new fds to KVM_SET_USER_MEMORY_REGION,
which will bind the new file, pointing to the same inode that the
source VM’s file points to, to memslots.
4. Use KVM_CAP_VM_MOVE_ENC_CONTEXT_FROM to move kvm->mem_attr_array
and slot->arch.lpage_info to the destination VM.
5. Run the destination VM as per normal.
Some other approaches considered were:
+ Using the linkat() syscall, but that requires a mount/directory for
a source fd to be linked to
+ Using the dup() syscall, but that only duplicates the fd, and both
fds point to the same file
[1] https://lore.kernel.org/all/cover.1691446946.git.ackerleytng@google.com/T/
[2] https://lore.kernel.org/all/20250328153133.3504118-2-tabba@google.com/
[3] https://lore.kernel.org/all/20250408112402.181574-6-shivankg@amd.com/
[4] https://lore.kernel.org/all/c1ee659c212b5a8b0e7a7f4d1763699176dd3a62.174726…
---
Ackerley Tng (12):
KVM: guest_memfd: Make guest mem use guest mem inodes instead of
anonymous inodes
KVM: guest_mem: Refactor out kvm_gmem_alloc_view()
KVM: guest_mem: Add ioctl KVM_LINK_GUEST_MEMFD
KVM: selftests: Add tests for KVM_LINK_GUEST_MEMFD ioctl
KVM: selftests: Test transferring private memory to another VM
KVM: x86: Refactor sev's flag migration_in_progress to kvm struct
KVM: x86: Refactor common code out of sev.c
KVM: x86: Refactor common migration preparation code out of
sev_vm_move_enc_context_from
KVM: x86: Let moving encryption context be configurable
KVM: x86: Handle moving of memory context for intra-host migration
KVM: selftests: Generalize migration functions from
sev_migrate_tests.c
KVM: selftests: Add tests for migration of private mem
David Hildenbrand (1):
fs: Refactor to provide function that allocates a secure anonymous
inode
arch/x86/include/asm/kvm_host.h | 3 +-
arch/x86/kvm/svm/sev.c | 82 +------
arch/x86/kvm/svm/svm.h | 3 +-
arch/x86/kvm/x86.c | 218 ++++++++++++++++-
arch/x86/kvm/x86.h | 6 +
fs/anon_inodes.c | 23 +-
include/linux/fs.h | 13 +-
include/linux/kvm_host.h | 18 ++
include/uapi/linux/kvm.h | 8 +
include/uapi/linux/magic.h | 1 +
mm/secretmem.c | 9 +-
tools/testing/selftests/kvm/Makefile.kvm | 1 +
.../testing/selftests/kvm/guest_memfd_test.c | 43 ++++
.../testing/selftests/kvm/include/kvm_util.h | 31 +++
.../kvm/x86/private_mem_migrate_tests.c | 93 ++++++++
.../selftests/kvm/x86/sev_migrate_tests.c | 48 ++--
virt/kvm/guest_memfd.c | 225 +++++++++++++++---
virt/kvm/kvm_main.c | 17 +-
virt/kvm/kvm_mm.h | 14 +-
19 files changed, 697 insertions(+), 159 deletions(-)
create mode 100644 tools/testing/selftests/kvm/x86/private_mem_migrate_tests.c
--
2.49.0.1101.gccaa498523-goog
Cover three recent cases:
1. missing ops locking for the lowers during netdev_sync_lower_features
2. missing locking for dev_set_promiscuity (plus netdev_ops_assert_locked
with a comment on why/when it's needed)
3. rcu lock during team_change_rx_flags
Verified that each one triggers when the respective fix is reverted.
Not sure about the placement, but since it all relies on teaming,
added to the teaming directory.
One ugly bit is that I add NETIF_F_LRO to netdevsim; there is no way
to trigger netdev_sync_lower_features without it.
Signed-off-by: Stanislav Fomichev <stfomichev(a)gmail.com>
---
drivers/net/netdevsim/netdev.c | 2 +
net/core/dev.c | 10 ++-
.../selftests/drivers/net/team/Makefile | 2 +-
.../testing/selftests/drivers/net/team/config | 1 +
.../selftests/drivers/net/team/propagation.sh | 79 +++++++++++++++++++
5 files changed, 92 insertions(+), 2 deletions(-)
create mode 100755 tools/testing/selftests/drivers/net/team/propagation.sh
diff --git a/drivers/net/netdevsim/netdev.c b/drivers/net/netdevsim/netdev.c
index 0e0321a7ddd7..3bd1f8cffee8 100644
--- a/drivers/net/netdevsim/netdev.c
+++ b/drivers/net/netdevsim/netdev.c
@@ -879,11 +879,13 @@ static void nsim_setup(struct net_device *dev)
NETIF_F_SG |
NETIF_F_FRAGLIST |
NETIF_F_HW_CSUM |
+ NETIF_F_LRO |
NETIF_F_TSO;
dev->hw_features |= NETIF_F_HW_TC |
NETIF_F_SG |
NETIF_F_FRAGLIST |
NETIF_F_HW_CSUM |
+ NETIF_F_LRO |
NETIF_F_TSO;
dev->max_mtu = ETH_MAX_MTU;
dev->xdp_features = NETDEV_XDP_ACT_HW_OFFLOAD;
diff --git a/net/core/dev.c b/net/core/dev.c
index 0d891634c692..4debd4b8e0f5 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -9188,8 +9188,16 @@ static int __dev_set_promiscuity(struct net_device *dev, int inc, bool notify)
dev_change_rx_flags(dev, IFF_PROMISC);
}
- if (notify)
+ if (notify) {
+ /* The ops lock is only required to ensure consistent locking
+ * for `NETDEV_CHANGE` notifiers. This function is sometimes
+ * called without the lock, even for devices that are ops
+ * locked, such as in `dev_uc_sync_multiple` when using
+ * bonding or teaming.
+ */
+ netdev_ops_assert_locked(dev);
__dev_notify_flags(dev, old_flags, IFF_PROMISC, 0, NULL);
+ }
return 0;
}
diff --git a/tools/testing/selftests/drivers/net/team/Makefile b/tools/testing/selftests/drivers/net/team/Makefile
index 2d5a76d99181..eaf6938f100e 100644
--- a/tools/testing/selftests/drivers/net/team/Makefile
+++ b/tools/testing/selftests/drivers/net/team/Makefile
@@ -1,7 +1,7 @@
# SPDX-License-Identifier: GPL-2.0
# Makefile for net selftests
-TEST_PROGS := dev_addr_lists.sh
+TEST_PROGS := dev_addr_lists.sh propagation.sh
TEST_INCLUDES := \
../bonding/lag_lib.sh \
diff --git a/tools/testing/selftests/drivers/net/team/config b/tools/testing/selftests/drivers/net/team/config
index b5e3a3aad4bf..636b3525b679 100644
--- a/tools/testing/selftests/drivers/net/team/config
+++ b/tools/testing/selftests/drivers/net/team/config
@@ -1,5 +1,6 @@
CONFIG_DUMMY=y
CONFIG_IPV6=y
CONFIG_MACVLAN=y
+CONFIG_NETDEVSIM=m
CONFIG_NET_TEAM=y
CONFIG_NET_TEAM_MODE_LOADBALANCE=y
diff --git a/tools/testing/selftests/drivers/net/team/propagation.sh b/tools/testing/selftests/drivers/net/team/propagation.sh
new file mode 100755
index 000000000000..849a5f2cb3a7
--- /dev/null
+++ b/tools/testing/selftests/drivers/net/team/propagation.sh
@@ -0,0 +1,79 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+
+set -e
+
+NSIM_LRO_ID=$((256 + RANDOM % 256))
+NSIM_LRO_SYS=/sys/bus/netdevsim/devices/netdevsim$NSIM_LRO_ID
+
+NSIM_DEV_SYS_NEW=/sys/bus/netdevsim/new_device
+NSIM_DEV_SYS_DEL=/sys/bus/netdevsim/del_device
+
+cleanup()
+{
+ ip link del dummyteam &>/dev/null
+ ip link del team0 &>/dev/null
+ echo $NSIM_LRO_ID > $NSIM_DEV_SYS_DEL
+}
+
+# Trigger LRO propagation to the lower.
+# https://lore.kernel.org/netdev/aBvOpkIoxcr9PfDg@mini-arch/
+team_lro()
+{
+ # using netdevsim because it supports NETIF_F_LRO
+ NSIM_LRO_NAME=$(find $NSIM_LRO_SYS/net -maxdepth 1 -type d ! \
+ -path $NSIM_LRO_SYS/net -exec basename {} \;)
+
+ ip link add name team0 type team
+ ip link set $NSIM_LRO_NAME down
+ ip link set dev $NSIM_LRO_NAME master team0
+ ip link set team0 up
+ ethtool -K team0 large-receive-offload off
+
+ ip link del team0
+}
+
+# Trigger promisc propagation to the lower during IFLA_MASTER.
+# https://lore.kernel.org/netdev/20250506032328.3003050-1-sdf@fomichev.me/
+team_promisc()
+{
+ ip link add name dummyteam type dummy
+ ip link add name team0 type team
+ ip link set dummyteam down
+ ip link set team0 promisc on
+ ip link set dev dummyteam master team0
+ ip link set team0 up
+
+ ip link del team0
+ ip link del dummyteam
+}
+
+# Trigger promisc propagation to the lower via netif_change_flags (aka
+# ndo_change_rx_flags).
+# https://lore.kernel.org/netdev/20250514220319.3505158-1-stfomichev@gmail.co…
+team_change_flags()
+{
+ ip link add name dummyteam type dummy
+ ip link add name team0 type team
+ ip link set dummyteam down
+ ip link set dev dummyteam master team0
+ ip link set team0 up
+ ip link set team0 promisc on
+
+ # Make sure we can add more L2 addresses without any issues.
+ ip link add link team0 address 00:00:00:00:00:01 team0.1 type macvlan
+ ip link set team0.1 up
+
+ ip link del team0.1
+ ip link del team0
+ ip link del dummyteam
+}
+
+trap cleanup EXIT
+modprobe netdevsim || :
+echo $NSIM_LRO_ID > $NSIM_DEV_SYS_NEW
+udevadm settle
+team_lro
+team_promisc
+team_change_flags
+modprobe -r netdevsim || :
--
2.49.0
When running 'make' in tools/testing/selftests/arm64/ without explicitly
setting the OUTPUT variable, the build system will creates test
directories (e.g., /bti) in the root filesystem due to OUTPUT defaulting
to an empty string. This causes unintended pollution of the root directory.
This patch adds proper handling for the OUTPUT variable: Sets OUTPUT
to the current directory (.) if not specified
Signed-off-by: tanze <tanze(a)kylinos.cn>
---
tools/testing/selftests/arm64/Makefile | 2 ++
1 file changed, 2 insertions(+)
diff --git a/tools/testing/selftests/arm64/Makefile b/tools/testing/selftests/arm64/Makefile
index 22029e60eff3..c4c72ee2ef55 100644
--- a/tools/testing/selftests/arm64/Makefile
+++ b/tools/testing/selftests/arm64/Makefile
@@ -21,6 +21,8 @@ CFLAGS += $(KHDR_INCLUDES)
CFLAGS += -I$(top_srcdir)/tools/include
+OUTPUT ?= $(CURDIR)
+
export CFLAGS
export top_srcdir
--
2.25.1
Two guard_regions tests on userfaultfd fail when userfaultfd is not
present. Skip them instead.
hugevm test reads kernel config to get page table level information and
fails when neither /proc/config.gz nor /boot/config-* is present. Skip
it instead.
Zi Yan (2):
selftests/mm: skip guard_regions.uffd tests when uffd is not present.
selftests/mm: skip hugevm test if kernel config file is not present.
tools/testing/selftests/mm/guard-regions.c | 17 ++++++++++--
.../selftests/mm/va_high_addr_switch.sh | 26 +++++++------------
2 files changed, 24 insertions(+), 19 deletions(-)
--
2.47.2
The ID_AA64PFR1_EL1.MTE_frac field is currently hidden from KVM.
However, when ID_AA64PFR1_EL1.MTE==2, ID_AA64PFR1_EL1.MTE_frac==0
indicates that MTE_ASYNC is supported. On a host with
ID_AA64PFR1_EL1.MTE==2 but without MTE_ASYNC support a guest with the
MTE capability enabled will incorrectly see MTE_ASYNC advertised as
supported. This series fixes that.
This was found by inspection and the current behaviour is not known to
break anything. Linux doesn't check MTE_frac, and wrongly, assumes
MTE async faults can be generated whenever MTE is supported. This is
a separate problem and not addressed here.
I am looking for feedback on whether this change is valuable or
otherwise.
Changes since v1:
Only pass MTE_Frac hw value to the guest when it is the exact failure case.
Changed base commit to v6.15-rc5 but still applies on v6.16-rc2 as well.
Ben Horgan (3):
arm64/sysreg: Expose MTE_frac so that it is visible to KVM
KVM: arm64: Make MTE_frac masking conditional on MTE capability
KVM: selftests: Confirm exposing MTE_frac does not break migration
arch/arm64/kernel/cpufeature.c | 1 +
arch/arm64/kvm/sys_regs.c | 28 ++++++-
.../testing/selftests/kvm/arm64/set_id_regs.c | 77 ++++++++++++++++++-
3 files changed, 103 insertions(+), 3 deletions(-)
base-commit: 92a09c47464d040866cf2b4cd052bc60555185fb
--
2.43.0
Add small grammar fixes in perf events and Real Time Clock tests'
output messages.
Include braces around a single if statement, when there are multiple
statements in the else branch, to align with the kernel coding style.
Signed-off-by: Hanne-Lotta Mäenpää <hannelotta(a)gmail.com>
---
tools/testing/selftests/perf_events/watermark_signal.c | 7 ++++---
tools/testing/selftests/rtc/rtctest.c | 10 +++++-----
2 files changed, 9 insertions(+), 8 deletions(-)
diff --git a/tools/testing/selftests/perf_events/watermark_signal.c b/tools/testing/selftests/perf_events/watermark_signal.c
index 49dc1e831174..6176afd4950b 100644
--- a/tools/testing/selftests/perf_events/watermark_signal.c
+++ b/tools/testing/selftests/perf_events/watermark_signal.c
@@ -65,8 +65,9 @@ TEST(watermark_signal)
child = fork();
EXPECT_GE(child, 0);
- if (child == 0)
+ if (child == 0) {
do_child();
+ }
else if (child < 0) {
perror("fork()");
goto cleanup;
@@ -75,7 +76,7 @@ TEST(watermark_signal)
if (waitpid(child, &child_status, WSTOPPED) != child ||
!(WIFSTOPPED(child_status) && WSTOPSIG(child_status) == SIGSTOP)) {
fprintf(stderr,
- "failed to sycnhronize with child errno=%d status=%x\n",
+ "failed to synchronize with child errno=%d status=%x\n",
errno,
child_status);
goto cleanup;
@@ -84,7 +85,7 @@ TEST(watermark_signal)
fd = syscall(__NR_perf_event_open, &attr, child, -1, -1,
PERF_FLAG_FD_CLOEXEC);
if (fd < 0) {
- fprintf(stderr, "failed opening event %llx\n", attr.config);
+ fprintf(stderr, "failed to setup performance monitoring %llx\n", attr.config);
goto cleanup;
}
diff --git a/tools/testing/selftests/rtc/rtctest.c b/tools/testing/selftests/rtc/rtctest.c
index be175c0e6ae3..8fd4d5d3b527 100644
--- a/tools/testing/selftests/rtc/rtctest.c
+++ b/tools/testing/selftests/rtc/rtctest.c
@@ -138,10 +138,10 @@ TEST_F_TIMEOUT(rtc, date_read_loop, READ_LOOP_DURATION_SEC + 2) {
rtc_read = rtc_time_to_timestamp(&rtc_tm);
/* Time should not go backwards */
ASSERT_LE(prev_rtc_read, rtc_read);
- /* Time should not increase more then 1s at a time */
+ /* Time should not increase more than 1s per read */
ASSERT_GE(prev_rtc_read + 1, rtc_read);
- /* Sleep 11ms to avoid killing / overheating the RTC */
+ /* Sleep 11ms to avoid overheating the RTC */
nanosleep_with_retries(READ_LOOP_SLEEP_MS * 1000000);
prev_rtc_read = rtc_read;
@@ -236,7 +236,7 @@ TEST_F(rtc, alarm_alm_set) {
if (alarm_state == RTC_ALARM_DISABLED)
SKIP(return, "Skipping test since alarms are not supported.");
if (alarm_state == RTC_ALARM_RES_MINUTE)
- SKIP(return, "Skipping test since alarms has only minute granularity.");
+ SKIP(return, "Skipping test since alarms have only minute granularity.");
rc = ioctl(self->fd, RTC_RD_TIME, &tm);
ASSERT_NE(-1, rc);
@@ -306,7 +306,7 @@ TEST_F(rtc, alarm_wkalm_set) {
if (alarm_state == RTC_ALARM_DISABLED)
SKIP(return, "Skipping test since alarms are not supported.");
if (alarm_state == RTC_ALARM_RES_MINUTE)
- SKIP(return, "Skipping test since alarms has only minute granularity.");
+ SKIP(return, "Skipping test since alarms have only minute granularity.");
rc = ioctl(self->fd, RTC_RD_TIME, &alarm.time);
ASSERT_NE(-1, rc);
@@ -502,7 +502,7 @@ int main(int argc, char **argv)
if (access(rtc_file, R_OK) == 0)
ret = test_harness_run(argc, argv);
else
- ksft_exit_skip("[SKIP]: Cannot access rtc file %s - Exiting\n",
+ ksft_exit_skip("Cannot access RTC file %s - exiting\n",
rtc_file);
return ret;
--
2.39.5
Hi maintainers,
As part of the Kselftest task for the LFX Mentorship Program, I have
reviewed the futex selftest and made minor improvements to the message
clarity in `futex_requeue.c`.
Attached is the patch with the changes. I’ve also uploaded it to the
mentorship platform as instructed.
Please let me know if any changes are needed.
Thanks,
Shivam Sharma
10sharmashivam(a)gmail.com
The thuge-gen test program runs mmap() and shmget() tests for both every
available page size and the default page size, resulting in two tests for
the default size. These tests are distinct since the flags in the default
case do not specify an explicit size, add the flags to the test name that
is logged to deduplicate.
Signed-off-by: Mark Brown <broonie(a)kernel.org>
---
tools/testing/selftests/mm/thuge-gen.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/mm/thuge-gen.c b/tools/testing/selftests/mm/thuge-gen.c
index cd5174d735be..a41bc1234b37 100644
--- a/tools/testing/selftests/mm/thuge-gen.c
+++ b/tools/testing/selftests/mm/thuge-gen.c
@@ -127,7 +127,7 @@ void test_mmap(unsigned long size, unsigned flags)
show(size);
ksft_test_result(size == getpagesize() || (before - after) == NUM_PAGES,
- "%s mmap %lu\n", __func__, size);
+ "%s mmap %lu %x\n", __func__, size, flags);
if (munmap(map, size * NUM_PAGES))
ksft_exit_fail_msg("%s: unmap %s\n", __func__, strerror(errno));
@@ -165,7 +165,7 @@ void test_shmget(unsigned long size, unsigned flags)
show(size);
ksft_test_result(size == getpagesize() || (before - after) == NUM_PAGES,
- "%s: mmap %lu\n", __func__, size);
+ "%s: mmap %lu %x\n", __func__, size, flags);
if (shmdt(map))
ksft_exit_fail_msg("%s: shmdt: %s\n", __func__, strerror(errno));
}
---
base-commit: 82f2b0b97b36ee3fcddf0f0780a9a0825d52fec3
change-id: 20250514-selfests-mm-thuge-gen-dup-7e1c40716091
Best regards,
--
Mark Brown <broonie(a)kernel.org>
The mlock2-tests test_mlock_lock() test reports two test results with
an identical string, one reporitng if it successfully locked a block of
memory and another reporting if the lock is still present after doing an
unlock (following a similar pattern to other tests in the same program).
This confuses test automation since the test string is used to deduplicate
tests, change the post unlock test to report "Unlocked" instead like the
other tests to fix this.
Signed-off-by: Mark Brown <broonie(a)kernel.org>
---
tools/testing/selftests/mm/mlock2-tests.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/mm/mlock2-tests.c b/tools/testing/selftests/mm/mlock2-tests.c
index 7f0d50fa361d..3e90ff37e336 100644
--- a/tools/testing/selftests/mm/mlock2-tests.c
+++ b/tools/testing/selftests/mm/mlock2-tests.c
@@ -196,7 +196,7 @@ static void test_mlock_lock(void)
ksft_exit_fail_msg("munlock(): %s\n", strerror(errno));
}
- ksft_test_result(!unlock_lock_check(map), "%s: Locked\n", __func__);
+ ksft_test_result(!unlock_lock_check(map), "%s: Unlocked\n", __func__);
munmap(map, 2 * page_size);
}
---
base-commit: 82f2b0b97b36ee3fcddf0f0780a9a0825d52fec3
change-id: 20250514-selftest-mm-mlock2-dup-277d586bb29d
Best regards,
--
Mark Brown <broonie(a)kernel.org>
When writing a test for fusectl, I referred to this Makefile as a
reference for creating a FUSE daemon in the selftests.
While doing so, I noticed that there is a minor issue in the Makefile.
The fuse_mnt.c file is not actually compiled into fuse_mnt.o,
and the code setting CFLAGS for it never takes effect.
The reason fuse_mnt compiles successfully is because CFLAGS is set
at the very beginning of the file.
Signed-off-by: Chen Linxuan <chenlinxuan(a)uniontech.com>
---
tools/testing/selftests/memfd/Makefile | 4 +---
1 file changed, 1 insertion(+), 3 deletions(-)
diff --git a/tools/testing/selftests/memfd/Makefile b/tools/testing/selftests/memfd/Makefile
index 163b6f68631c4..e9b886c65153d 100644
--- a/tools/testing/selftests/memfd/Makefile
+++ b/tools/testing/selftests/memfd/Makefile
@@ -1,5 +1,4 @@
# SPDX-License-Identifier: GPL-2.0
-CFLAGS += -D_FILE_OFFSET_BITS=64
CFLAGS += $(KHDR_INCLUDES)
TEST_GEN_PROGS := memfd_test
@@ -16,10 +15,9 @@ ifeq ($(VAR_LDLIBS),)
VAR_LDLIBS := -lfuse -pthread
endif
-fuse_mnt.o: CFLAGS += $(VAR_CFLAGS)
-
include ../lib.mk
+$(OUTPUT)/fuse_mnt: CFLAGS += $(VAR_CFLAGS)
$(OUTPUT)/fuse_mnt: LDLIBS += $(VAR_LDLIBS)
$(OUTPUT)/memfd_test: memfd_test.c common.c
--
2.43.0
Until CONFIG_DMABUF_SYSFS_STATS was added [1] it was only possible to
perform per-buffer accounting with debugfs which is not suitable for
production environments. Eventually we discovered the overhead with
per-buffer sysfs file creation/removal was significantly impacting
allocation and free times, and exacerbated kernfs lock contention. [2]
dma_buf_stats_setup() is responsible for 39% of single-page buffer
creation duration, or 74% of single-page dma_buf_export() duration when
stressing dmabuf allocations and frees.
I prototyped a change from per-buffer to per-exporter statistics with a
RCU protected list of exporter allocations that accommodates most (but
not all) of our use-cases and avoids almost all of the sysfs overhead.
While that adds less overhead than per-buffer sysfs, and less even than
the maintenance of the dmabuf debugfs_list, it's still *additional*
overhead on top of the debugfs_list and doesn't give us per-buffer info.
This series uses the existing dmabuf debugfs_list to implement a BPF
dmabuf iterator, which adds no overhead to buffer allocation/free and
provides per-buffer info. The list has been moved outside of
CONFIG_DEBUG_FS scope so that it is always populated. The BPF program
loaded by userspace that extracts per-buffer information gets to define
its own interface which avoids the lack of ABI stability with debugfs.
This will allow us to replace our use of CONFIG_DMABUF_SYSFS_STATS, and
the plan is to remove it from the kernel after the next longterm stable
release.
[1] https://lore.kernel.org/linux-media/20201210044400.1080308-1-hridya@google.…
[2] https://lore.kernel.org/all/20220516171315.2400578-1-tjmercier@google.com
v1: https://lore.kernel.org/all/20250414225227.3642618-1-tjmercier@google.com
v1 -> v2:
Make the DMA buffer list independent of CONFIG_DEBUG_FS per Christian
König
Add CONFIG_DMA_SHARED_BUFFER check to kernel/bpf/Makefile per kernel
test robot
Use BTF_ID_LIST_SINGLE instead of BTF_ID_LIST_GLOBAL_SINGLE per Song Liu
Fixup comment style, mixing code/declarations, and use ASSERT_OK_FD in
selftest per Song Liu
Add BPF_ITER_RESCHED feature to bpf_dmabuf_reg_info per Alexei
Starovoitov
Add open-coded iterator and selftest per Alexei Starovoitov
Add a second test buffer from the system dmabuf heap to selftests
Use the BPF program we'll use in production for selftest per Alexei
Starovoitov
https://r.android.com/c/platform/system/bpfprogs/+/3616123/2/dmabufIter.chttps://r.android.com/c/platform/system/memory/libmeminfo/+/3614259/1/libdm…
v2: https://lore.kernel.org/all/20250504224149.1033867-1-tjmercier@google.com
v2 -> v3:
Rebase onto bpf-next/master
Move get_next_dmabuf() into drivers/dma-buf/dma-buf.c, along with the
new get_first_dmabuf(). This avoids having to expose the dmabuf list
and mutex to the rest of the kernel, and keeps the dmabuf mutex
operations near each other in the same file. (Christian König)
Add Christian's RB to dma-buf: Rename debugfs symbols
Drop RFC: dma-buf: Remove DMA-BUF statistics
v3: https://lore.kernel.org/all/20250507001036.2278781-1-tjmercier@google.com
v3 -> v4:
Fix selftest BPF program comment style (not kdoc) per Alexei Starovoitov
Fix dma-buf.c kdoc comment style per Alexei Starovoitov
Rename get_first_dmabuf / get_next_dmabuf to dma_buf_iter_begin /
dma_buf_iter_next per Christian König
Add Christian's RB to bpf: Add dmabuf iterator
v4: https://lore.kernel.org/all/20250508182025.2961555-1-tjmercier@google.com
v4 -> v5:
Add Christian's Acks to all patches
Add Song Liu's Acks
Move BTF_ID_LIST_SINGLE and DEFINE_BPF_ITER_FUNC closer to usage per
Song Liu
Fix open-coded iterator comment style per Song Liu
Move iterator termination check to its own subtest per Song Liu
Rework selftest buffer creation per Song Liu
Fix spacing in sanitize_string per BPF CI
v5: https://lore.kernel.org/all/20250512174036.266796-1-tjmercier@google.com
v5 -> v6:
Song Liu:
Init test buffer FDs to -1
Zero-init udmabuf_create for future proofing
Bail early for iterator fd/FILE creation failure
Dereference char ptr to check for NUL in sanitize_string()
Move map insertion from create_test_buffers() to test_dmabuf_iter()
Add ACK to selftests/bpf: Add test for open coded dmabuf_iter
T.J. Mercier (5):
dma-buf: Rename debugfs symbols
bpf: Add dmabuf iterator
bpf: Add open coded dmabuf iterator
selftests/bpf: Add test for dmabuf_iter
selftests/bpf: Add test for open coded dmabuf_iter
drivers/dma-buf/dma-buf.c | 98 ++++--
include/linux/dma-buf.h | 4 +-
kernel/bpf/Makefile | 3 +
kernel/bpf/dmabuf_iter.c | 150 +++++++++
kernel/bpf/helpers.c | 5 +
.../testing/selftests/bpf/bpf_experimental.h | 5 +
tools/testing/selftests/bpf/config | 3 +
.../selftests/bpf/prog_tests/dmabuf_iter.c | 285 ++++++++++++++++++
.../testing/selftests/bpf/progs/dmabuf_iter.c | 91 ++++++
9 files changed, 622 insertions(+), 22 deletions(-)
create mode 100644 kernel/bpf/dmabuf_iter.c
create mode 100644 tools/testing/selftests/bpf/prog_tests/dmabuf_iter.c
create mode 100644 tools/testing/selftests/bpf/progs/dmabuf_iter.c
base-commit: 43745d11bfd9683abdf08ad7a5cc403d6a9ffd15
--
2.49.0.1045.g170613ef41-goog
From: Chia-Yu Chang <chia-yu.chang(a)nokia-bell-labs.com>
Hello,
Please find the v6:
v5 (09-May-2025)
- Add #3 to utilize exisintg holes of tcp_sock_write_txrx group for later patches (#4, #9, #10) with new u8 members (Paolo Abeni <pabeni(a)redhat.com>)
- Add pahole outcomes before and after commit in #4, #5, #6, #9, #10, #15 (Paolo Abeni <pabeni(a)redhat.com>)
- Define new helper function tcp_send_ack_reflect_ect() for sending ACK with reflected ECT in #5 (Paolo Abeni <pabeni(a)redhat.com>)
- Add comments for function tcp_ecn_rcv_synack() in #5 (Paolo Abeni <pabeni(a)redhat.com>)
- Add enum/define to be used by sysctl_tcp_ecn in #5, sysctl_tcp_ecn_option in #9, and sysctl_tcp_ecn_option_beacon in #10 (Paolo Abeni <pabeni(a)redhat.com>)
- Move accecn_fail_mode and saw_accecn_opt in #5 and #11 to use exisintg holes of tcp_sock (Paolo Abeni <pabeni(a)redhat.com>)
- Change data type of new members of tcp_request_sock and move them to the end of struct in #5 and #11 (Paolo Abeni <pabeni(a)redhat.com>)
- Move new members of tcp_info to the end of struct in #6 (Paolo Abeni <pabeni(a)redhat.com>)
- Merge previous #7 into #9 (Paolo Abeni <pabeni(a)redhat.com>)
- Mask ecnfield with INET_ECN_MASK to remove WARN_ONCE in #9 (Paolo Abeni <pabeni(a)redhat.com>)
- Reduce the indentation levels for reabability in #9 and #10 (Paolo Abeni <pabeni(a)redhat.com>)
- Move delivered_ecn_bytes to the RX group in #9, accecn_opt_tstamp to the TX group in #10, pkts_acked_ewma to the RX group in #15 (Paolo Abeni <pabeni(a)redhat.com>)
- Add changes in Documentation/networking/net_cachelines/tcp_sock.rst for new tcp_sock members in #3, #5, #6, #9, #10, #15
v5 (22-Apr-2025)
- Further fix for 32-bit ARM alignment in tcp.c (Simon Horman <horms(a)kernel.org>)
v4 (18-Apr-2025)
- Fix 32-bit ARM assertion for alignment requirement (Simon Horman <horms(a)kernel.org>)
v3 (14-Apr-2025)
- Fix patch apply issue in v2 (Jakub Kicinski <kuba(a)kernel.org>)
v2 (18-Mar-2025)
- Add one missing patch from the previous AccECN protocol preparation patch series to this patch series.
The full patch series can be found in
https://github.com/L4STeam/linux-net-next/commits/upstream_l4steam/
The Accurate ECN draft can be found in
https://datatracker.ietf.org/doc/html/draft-ietf-tcpm-accurate-ecn-28
Best regards,
Chia-Yu
Chia-Yu Chang (2):
tcp: reorganize tcp_sock_write_txrx group for variables later
tcp: accecn: AccECN option failure handling
Ilpo Järvinen (13):
tcp: reorganize SYN ECN code
tcp: fast path functions later
tcp: AccECN core
tcp: accecn: AccECN negotiation
tcp: accecn: add AccECN rx byte counters
tcp: accecn: AccECN needs to know delivered bytes
tcp: sack option handling improvements
tcp: accecn: AccECN option
tcp: accecn: AccECN option send control
tcp: accecn: AccECN option ceb/cep heuristic
tcp: accecn: AccECN ACE field multi-wrap heuristic
tcp: accecn: try to fit AccECN option with SACK
tcp: try to avoid safer when ACKs are thinned
.../networking/net_cachelines/tcp_sock.rst | 14 +
include/linux/tcp.h | 34 +-
include/net/netns/ipv4.h | 2 +
include/net/tcp.h | 221 ++++++-
include/uapi/linux/tcp.h | 7 +
net/ipv4/syncookies.c | 3 +
net/ipv4/sysctl_net_ipv4.c | 19 +
net/ipv4/tcp.c | 30 +-
net/ipv4/tcp_input.c | 608 +++++++++++++++++-
net/ipv4/tcp_ipv4.c | 7 +-
net/ipv4/tcp_minisocks.c | 92 ++-
net/ipv4/tcp_output.c | 297 ++++++++-
net/ipv6/syncookies.c | 1 +
net/ipv6/tcp_ipv6.c | 1 +
14 files changed, 1237 insertions(+), 99 deletions(-)
--
2.34.1
Hello,
This patchset is our exploration of how to support 1G pages in guest_memfd, and
how the pages will be used in Confidential VMs.
The patchset covers:
+ How to get 1G pages
+ Allowing mmap() of guest_memfd to userspace so that both private and shared
memory can use the same physical pages
+ Splitting and reconstructing pages to support conversions and mmap()
+ How the VM, userspace and guest_memfd interact to support conversions
+ Selftests to test all the above
+ Selftests also demonstrate the conversion flow between VM, userspace and
guest_memfd.
Why 1G pages in guest memfd?
Bring guest_memfd to performance and memory savings parity with VMs that are
backed by HugeTLBfs.
+ Performance is improved with 1G pages by more TLB hits and faster page walks
on TLB misses.
+ Memory savings from 1G pages comes from HugeTLB Vmemmap Optimization (HVO).
Options for 1G page support:
1. HugeTLB
2. Contiguous Memory Allocator (CMA)
3. Other suggestions are welcome!
Comparison between options:
1. HugeTLB
+ Refactor HugeTLB to separate allocator from the rest of HugeTLB
+ Pro: Graceful transition for VMs backed with HugeTLB to guest_memfd
+ Near term: Allows co-tenancy of HugeTLB and guest_memfd backed VMs
+ Pro: Can provide iterative steps toward new future allocator
+ Unexplored: Managing userspace-visible changes
+ e.g. HugeTLB's free_hugepages will decrease if HugeTLB is used,
but not when future allocator is used
2. CMA
+ Port some HugeTLB features to be applied on CMA
+ Pro: Clean slate
What would refactoring HugeTLB involve?
(Some refactoring was done in this RFC, more can be done.)
1. Broadly involves separating the HugeTLB allocator from the rest of HugeTLB
+ Brings more modularity to HugeTLB
+ No functionality change intended
+ Likely step towards HugeTLB's integration into core-mm
2. guest_memfd will use just the allocator component of HugeTLB, not including
the complex parts of HugeTLB like
+ Userspace reservations (resv_map)
+ Shared PMD mappings
+ Special page walkers
What features would need to be ported to CMA?
+ Improved allocation guarantees
+ Per NUMA node pool of huge pages
+ Subpools per guest_memfd
+ Memory savings
+ Something like HugeTLB Vmemmap Optimization
+ Configuration/reporting features
+ Configuration of number of pages available (and per NUMA node) at and
after host boot
+ Reporting of memory usage/availability statistics at runtime
HugeTLB was picked as the source of 1G pages for this RFC because it allows a
graceful transition, and retains memory savings from HVO.
To illustrate this, if a host machine uses HugeTLBfs to back VMs, and a
confidential VM were to be scheduled on that host, some HugeTLBfs pages would
have to be given up and returned to CMA for guest_memfd pages to be rebuilt from
that memory. This requires memory to be reserved for HVO to be removed and
reapplied on the new guest_memfd memory. This not only slows down memory
allocation but also trims the benefits of HVO. Memory would have to be reserved
on the host to facilitate these transitions.
Improving how guest_memfd uses the allocator in a future revision of this RFC:
To provide an easier transition away from HugeTLB, guest_memfd's use of HugeTLB
should be limited to these allocator functions:
+ reserve(node, page_size, num_pages) => opaque handle
+ Used when a guest_memfd inode is created to reserve memory from backend
allocator
+ allocate(handle, mempolicy, page_size) => folio
+ To allocate a folio from guest_memfd's reservation
+ split(handle, folio, target_page_size) => void
+ To take a huge folio, and split it to smaller folios, restore to filemap
+ reconstruct(handle, first_folio, nr_pages) => void
+ To take a folio, and reconstruct a huge folio out of nr_pages from the
first_folio
+ free(handle, folio) => void
+ To return folio to guest_memfd's reservation
+ error(handle, folio) => void
+ To handle memory errors
+ unreserve(handle) => void
+ To return guest_memfd's reservation to allocator backend
Userspace should only provide a page size when creating a guest_memfd and should
not have to specify HugeTLB.
Overview of patches:
+ Patches 01-12
+ Many small changes to HugeTLB, mostly to separate HugeTLBfs concepts from
HugeTLB, and to expose HugeTLB functions.
+ Patches 13-16
+ Letting guest_memfd use HugeTLB
+ Creation of each guest_memfd reserves pages from HugeTLB's global hstate
and puts it into the guest_memfd inode's subpool
+ Each folio allocation takes a page from the guest_memfd inode's subpool
+ Patches 17-21
+ Selftests for new HugeTLB features in guest_memfd
+ Patches 22-24
+ More small changes on the HugeTLB side to expose functions needed by
guest_memfd
+ Patch 25:
+ Uses the newly available functions from patches 22-24 to split HugeTLB
pages. In this patch, HugeTLB folios are always split to 4K before any
usage, private or shared.
+ Patches 26-28
+ Allow mmap() in guest_memfd and faulting in shared pages
+ Patch 29
+ Enables conversion between private/shared pages
+ Patch 30
+ Required to zero folios after conversions to avoid leaking initialized
kernel memory
+ Patch 31-38
+ Add selftests to test mapping pages to userspace, guest/host memory
sharing and update conversions tests
+ Patch 33 illustrates the conversion flow between VM/userspace/guest_memfd
+ Patch 39
+ Dynamically split and reconstruct HugeTLB pages instead of always
splitting before use. All earlier selftests are expected to still pass.
TODOs:
+ Add logic to wait for safe_refcount [1]
+ Look into lazy splitting/reconstruction of pages
+ Currently, when the KVM_SET_MEMORY_ATTRIBUTES is invoked, not only is the
mem_attr_array and faultability updated, the pages in the requested range
are also split/reconstructed as necessary. We want to look into delaying
splitting/reconstruction to fault time.
+ Solve race between folios being faulted in and being truncated
+ When running private_mem_conversions_test with more than 1 vCPU, a folio
getting truncated may get faulted in by another process, causing elevated
mapcounts when the folio is freed (VM_BUG_ON_FOLIO).
+ Add intermediate splits (1G should first split to 2M and not split directly to
4K)
+ Use guest's lock instead of hugetlb_lock
+ Use multi-index xarray/replace xarray with some other data struct for
faultability flag
+ Refactor HugeTLB better, present generic allocator interface
Please let us know your thoughts on:
+ HugeTLB as the choice of transitional allocator backend
+ Refactoring HugeTLB to provide generic allocator interface
+ Shared/private conversion flow
+ Requiring user to request kernel to unmap pages from userspace using
madvise(MADV_DONTNEED)
+ Failing conversion on elevated mapcounts/pincounts/refcounts
+ Process of splitting/reconstructing page
+ Anything else!
[1] https://lore.kernel.org/all/20240829-guest-memfd-lib-v2-0-b9afc1ff3656@quic…
Ackerley Tng (37):
mm: hugetlb: Simplify logic in dequeue_hugetlb_folio_vma()
mm: hugetlb: Refactor vma_has_reserves() to should_use_hstate_resv()
mm: hugetlb: Remove unnecessary check for avoid_reserve
mm: mempolicy: Refactor out policy_node_nodemask()
mm: hugetlb: Refactor alloc_buddy_hugetlb_folio_with_mpol() to
interpret mempolicy instead of vma
mm: hugetlb: Refactor dequeue_hugetlb_folio_vma() to use mpol
mm: hugetlb: Refactor out hugetlb_alloc_folio
mm: truncate: Expose preparation steps for truncate_inode_pages_final
mm: hugetlb: Expose hugetlb_subpool_{get,put}_pages()
mm: hugetlb: Add option to create new subpool without using surplus
mm: hugetlb: Expose hugetlb_acct_memory()
mm: hugetlb: Move and expose hugetlb_zero_partial_page()
KVM: guest_memfd: Make guest mem use guest mem inodes instead of
anonymous inodes
KVM: guest_memfd: hugetlb: initialization and cleanup
KVM: guest_memfd: hugetlb: allocate and truncate from hugetlb
KVM: guest_memfd: Add page alignment check for hugetlb guest_memfd
KVM: selftests: Add basic selftests for hugetlb-backed guest_memfd
KVM: selftests: Support various types of backing sources for private
memory
KVM: selftests: Update test for various private memory backing source
types
KVM: selftests: Add private_mem_conversions_test.sh
KVM: selftests: Test that guest_memfd usage is reported via hugetlb
mm: hugetlb: Expose vmemmap optimization functions
mm: hugetlb: Expose HugeTLB functions for promoting/demoting pages
mm: hugetlb: Add functions to add/move/remove from hugetlb lists
KVM: guest_memfd: Track faultability within a struct kvm_gmem_private
KVM: guest_memfd: Allow mmapping guest_memfd files
KVM: guest_memfd: Use vm_type to determine default faultability
KVM: Handle conversions in the SET_MEMORY_ATTRIBUTES ioctl
KVM: guest_memfd: Handle folio preparation for guest_memfd mmap
KVM: selftests: Allow vm_set_memory_attributes to be used without
asserting return value of 0
KVM: selftests: Test using guest_memfd memory from userspace
KVM: selftests: Test guest_memfd memory sharing between guest and host
KVM: selftests: Add notes in private_mem_kvm_exits_test for mmap-able
guest_memfd
KVM: selftests: Test that pinned pages block KVM from setting memory
attributes to PRIVATE
KVM: selftests: Refactor vm_mem_add to be more flexible
KVM: selftests: Add helper to perform madvise by memslots
KVM: selftests: Update private_mem_conversions_test for mmap()able
guest_memfd
Vishal Annapurve (2):
KVM: guest_memfd: Split HugeTLB pages for guest_memfd use
KVM: guest_memfd: Dynamically split/reconstruct HugeTLB page
fs/hugetlbfs/inode.c | 35 +-
include/linux/hugetlb.h | 54 +-
include/linux/kvm_host.h | 1 +
include/linux/mempolicy.h | 2 +
include/linux/mm.h | 1 +
include/uapi/linux/kvm.h | 26 +
include/uapi/linux/magic.h | 1 +
mm/hugetlb.c | 346 ++--
mm/hugetlb_vmemmap.h | 11 -
mm/mempolicy.c | 36 +-
mm/truncate.c | 26 +-
tools/include/linux/kernel.h | 4 +-
tools/testing/selftests/kvm/Makefile | 3 +
.../kvm/guest_memfd_hugetlb_reporting_test.c | 222 +++
.../selftests/kvm/guest_memfd_pin_test.c | 104 ++
.../selftests/kvm/guest_memfd_sharing_test.c | 160 ++
.../testing/selftests/kvm/guest_memfd_test.c | 238 ++-
.../testing/selftests/kvm/include/kvm_util.h | 45 +-
.../testing/selftests/kvm/include/test_util.h | 18 +
tools/testing/selftests/kvm/lib/kvm_util.c | 443 +++--
tools/testing/selftests/kvm/lib/test_util.c | 99 ++
.../kvm/x86_64/private_mem_conversions_test.c | 158 +-
.../x86_64/private_mem_conversions_test.sh | 91 +
.../kvm/x86_64/private_mem_kvm_exits_test.c | 11 +-
virt/kvm/guest_memfd.c | 1563 ++++++++++++++++-
virt/kvm/kvm_main.c | 17 +
virt/kvm/kvm_mm.h | 16 +
27 files changed, 3288 insertions(+), 443 deletions(-)
create mode 100644 tools/testing/selftests/kvm/guest_memfd_hugetlb_reporting_test.c
create mode 100644 tools/testing/selftests/kvm/guest_memfd_pin_test.c
create mode 100644 tools/testing/selftests/kvm/guest_memfd_sharing_test.c
create mode 100755 tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.sh
--
2.46.0.598.g6f2099f65c-goog
v14: https://lore.kernel.org/netdev/20250429032645.363766-1-almasrymina@google.c…
===
Picked up acks from Paolo, and addressed feedback from Paolo and Jakub.
Changelog:
- Fix issue in patch 4 where sockc_valid == false but err is
overwritten.
- Addressed nits.
v13: https://lore.kernel.org/netdev/20250425204743.617260-1-almasrymina@google.c…
===
Changelog:
- Fix unneeded error label pointed out by Christoph, and addressed
nitpick.
v12: https://lore.kernel.org/netdev/20250423031117.907681-1-almasrymina@google.c…
====
No changes in v12, just restored the selftests patch I accidentally dropped in
v11
v11: https://lore.kernel.org/netdev/20250423031117.907681-1-almasrymina@google.c…
====
Addressed a couple of nits and collected Acked-by from Harshitha
(thanks!)
v10: https://lore.kernel.org/netdev/20250417231540.2780723-1-almasrymina@google.…
====
Addressed comments following conversations with Pavel, Stan, and
Harshitha. Thank you guys for the reviews again. Overall minor changes:
Changelog:
- Check for !niov->pp in io_zcrx_recv_frag, just in case we end up with
a TX niov in that path (Pavel).
- Fix locking case in !netif_device_present (Jakub/Stan).
v9: https://lore.kernel.org/netdev/20250415224756.152002-1-almasrymina@google.c…
===
Changelog:
- Use priv->bindings list instead of sock_bindings_list. This was missed
during the rebase as the bindings have been updated to use
priv->bindings recently (thanks Stan!)
v8: https://lore.kernel.org/netdev/20250308214045.1160445-1-almasrymina@google.…
===
Only address minor comments on V7
Changelog:
- Use netdev locking instead of rtnl_locking to match rx path.
- Now that iouring zcrx is in net-next, use NET_IOV_IOURING instead of
NET_IOV_UNSPECIFIED.
- Post send binding to net_devmem_dmabuf_bindings after it's been fully
initialized (Stan).
v7: https://lore.kernel.org/netdev/20250227041209.2031104-1-almasrymina@google.…
===
Changelog:
- Check the dmabuf net_iov binding belongs to the device the TX is going
out on. (Jakub)
- Provide detailed inspection of callsites of
__skb_frag_ref/skb_page_unref in patch 2's changelog (Jakub)
v6: https://lore.kernel.org/netdev/20250222191517.743530-1-almasrymina@google.c…
===
v6 has no major changes. Addressed a few issues from Paolo and David,
and collected Acks from Stan. Thank you everyone for the review!
Changes:
- retain behavior to process MSG_FASTOPEN even if the provided cmsg is
invalid (Paolo).
- Rework the freeing of tx_vec slightly (it now has its own err label).
(Paolo).
- Squash the commit that makes dmabuf unbinding scheduled work into the
same one which implements the TX path so we don't run into future
errors on bisecting (Paolo).
- Fix/add comments to explain how dmabuf binding refcounting works
(David).
v5: https://lore.kernel.org/netdev/20250220020914.895431-1-almasrymina@google.c…
===
v5 has no major changes; it clears up the relatively minor issues
pointed out to in v4, and rebases the series on top of net-next to
resolve the conflict with a patch that raced to the tree. It also
collects the review tags from v4.
Changes:
- Rebase to net-next
- Fix issues in selftest (Stan).
- Address comments in the devmem and netmem driver docs (Stan and Bagas)
- Fix zerocopy_fill_skb_from_devmem return error code (Stan).
v4: https://lore.kernel.org/netdev/20250203223916.1064540-1-almasrymina@google.…
===
v4 mainly addresses the critical driver support issue surfaced in v3 by
Paolo and Stan. Drivers aiming to support netmem_tx should make sure not
to pass the netmem dma-addrs to the dma-mapping APIs, as these dma-addrs
may come from dma-bufs.
Additionally other feedback from v3 is addressed.
Major changes:
- Add helpers to handle netmem dma-addrs. Add GVE support for
netmem_tx.
- Fix binding->tx_vec not being freed on error paths during the
tx binding.
- Add a minimal devmem_tx test to devmem.py.
- Clean up everything obsolete from the cover letter (Paolo).
v3: https://patchwork.kernel.org/project/netdevbpf/list/?series=929401&state=*
===
Address minor comments from RFCv2 and fix a few build warnings and
ynl-regen issues. No major changes.
RFC v2: https://patchwork.kernel.org/project/netdevbpf/list/?series=920056&state=*
=======
RFC v2 addresses much of the feedback from RFC v1. I plan on sending
something close to this as net-next reopens, sending it slightly early
to get feedback if any.
Major changes:
--------------
- much improved UAPI as suggested by Stan. We now interpret the iov_base
of the passed in iov from userspace as the offset into the dmabuf to
send from. This removes the need to set iov.iov_base = NULL which may
be confusing to users, and enables us to send multiple iovs in the
same sendmsg() call. ncdevmem and the docs show a sample use of that.
- Removed the duplicate dmabuf iov_iter in binding->iov_iter. I think
this is good improvment as it was confusing to keep track of
2 iterators for the same sendmsg, and mistracking both iterators
caused a couple of bugs reported in the last iteration that are now
resolved with this streamlining.
- Improved test coverage in ncdevmem. Now multiple sendmsg() are tested,
and sending multiple iovs in the same sendmsg() is tested.
- Fixed issue where dmabuf unmapping was happening in invalid context
(Stan).
====================================================================
The TX path had been dropped from the Device Memory TCP patch series
post RFCv1 [1], to make that series slightly easier to review. This
series rebases the implementation of the TX path on top of the
net_iov/netmem framework agreed upon and merged. The motivation for
the feature is thoroughly described in the docs & cover letter of the
original proposal, so I don't repeat the lengthy descriptions here, but
they are available in [1].
Full outline on usage of the TX path is detailed in the documentation
included with this series.
Test example is available via the kselftest included in the series as well.
The series is relatively small, as the TX path for this feature largely
piggybacks on the existing MSG_ZEROCOPY implementation.
Patch Overview:
---------------
1. Documentation & tests to give high level overview of the feature
being added.
1. Add netmem refcounting needed for the TX path.
2. Devmem TX netlink API.
3. Devmem TX net stack implementation.
4. Make dma-buf unbinding scheduled work to handle TX cases where it gets
freed from contexts where we can't sleep.
5. Add devmem TX documentation.
6. Add scaffolding enabling driver support for netmem_tx. Add helpers, driver
feature flag, and docs to enable drivers to declare netmem_tx support.
7. Guard netmem_tx against being enabled against drivers that don't
support it.
8. Add devmem_tx selftests. Add TX path to ncdevmem and add a test to
devmem.py.
Testing:
--------
Testing is very similar to devmem TCP RX path. The ncdevmem test used
for the RX path is now augemented with client functionality to test TX
path.
* Test Setup:
Kernel: net-next with this RFC and memory provider API cherry-picked
locally.
Hardware: Google Cloud A3 VMs.
NIC: GVE with header split & RSS & flow steering support.
Performance results are not included with this version, unfortunately.
I'm having issues running the dma-buf exporter driver against the
upstream kernel on my test setup. The issues are specific to that
dma-buf exporter and do not affect this patch series. I plan to follow
up this series with perf fixes if the tests point to issues once they're
up and running.
Special thanks to Stan who took a stab at rebasing the TX implementation
on top of the netmem/net_iov framework merged. Parts of his proposal [2]
that are reused as-is are forked off into their own patches to give full
credit.
[1] https://lore.kernel.org/netdev/20240909054318.1809580-1-almasrymina@google.…
[2] https://lore.kernel.org/netdev/20240913150913.1280238-2-sdf@fomichev.me/T/#…
Cc: sdf(a)fomichev.me
Cc: asml.silence(a)gmail.com
Cc: dw(a)davidwei.uk
Cc: Jamal Hadi Salim <jhs(a)mojatatu.com>
Cc: Victor Nogueira <victor(a)mojatatu.com>
Cc: Pedro Tammela <pctammela(a)mojatatu.com>
Cc: Samiullah Khawaja <skhawaja(a)google.com>
Cc: Kuniyuki Iwashima <kuniyu(a)amazon.com>
Mina Almasry (8):
netmem: add niov->type attribute to distinguish different net_iov
types
net: add get_netmem/put_netmem support
net: devmem: Implement TX path
net: add devmem TCP TX documentation
net: enable driver support for netmem TX
gve: add netmem TX support to GVE DQO-RDA mode
net: check for driver support in netmem TX
selftests: ncdevmem: Implement devmem TCP TX
Stanislav Fomichev (1):
net: devmem: TCP tx netlink api
Documentation/netlink/specs/netdev.yaml | 12 +
Documentation/networking/devmem.rst | 150 ++++++++-
.../networking/net_cachelines/net_device.rst | 1 +
Documentation/networking/netdev-features.rst | 5 +
Documentation/networking/netmem.rst | 23 +-
drivers/net/ethernet/google/gve/gve_main.c | 3 +
drivers/net/ethernet/google/gve/gve_tx_dqo.c | 8 +-
include/linux/netdevice.h | 2 +
include/linux/skbuff.h | 17 +-
include/linux/skbuff_ref.h | 4 +-
include/net/netmem.h | 34 +-
include/net/sock.h | 1 +
include/uapi/linux/netdev.h | 1 +
io_uring/zcrx.c | 3 +-
net/core/datagram.c | 48 ++-
net/core/dev.c | 34 +-
net/core/devmem.c | 131 ++++++--
net/core/devmem.h | 83 ++++-
net/core/netdev-genl-gen.c | 13 +
net/core/netdev-genl-gen.h | 1 +
net/core/netdev-genl.c | 80 ++++-
net/core/skbuff.c | 48 ++-
net/core/sock.c | 5 +
net/ipv4/ip_output.c | 3 +-
net/ipv4/tcp.c | 48 ++-
net/ipv6/ip6_output.c | 3 +-
net/vmw_vsock/virtio_transport_common.c | 5 +-
tools/include/uapi/linux/netdev.h | 1 +
.../selftests/drivers/net/hw/devmem.py | 26 +-
.../selftests/drivers/net/hw/ncdevmem.c | 300 +++++++++++++++++-
30 files changed, 1007 insertions(+), 86 deletions(-)
base-commit: 3e52667a9c328b3d1a1ddbbb6b8fbf63a217bda3
--
2.49.0.987.g0cc8ee98dc-goog
The kunit test that checks the longests symbol length [1], has triggered
warnings in some pilelines when symbol prefixes are used [2]. The test
is adjunsted to depend on !PREFIX_SYMBOLS and !CFI_CLANG as sujested in [3]
[1] https://lore.kernel.org/rust-for-linux/CABVgOSm=5Q0fM6neBhxSbOUHBgNzmwf2V22…
[2] https://lore.kernel.org/all/20250328112156.2614513-1-arnd@kernel.org/T/#u
[3] https://lore.kernel.org/linux-kselftest/20250427200916.GA1661412@ax162/T/#t
Signed-off-by: Sergio González Collado <sergio.collado(a)gmail.com>
---
V1 -> V2: added dependency on !CFI_CLANG as suggested in [3], removed
CONFIG_ prefix
---
lib/Kconfig.debug | 2 +-
lib/tests/longest_symbol_kunit.c | 3 +--
2 files changed, 2 insertions(+), 3 deletions(-)
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index f9051ab610d5..5b33673d82da 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -2885,7 +2885,7 @@ config FORTIFY_KUNIT_TEST
config LONGEST_SYM_KUNIT_TEST
tristate "Test the longest symbol possible" if !KUNIT_ALL_TESTS
- depends on KUNIT && KPROBES
+ depends on KUNIT && KPROBES && !PREFIX_SYMBOLS && !CFI_CLANG
default KUNIT_ALL_TESTS
help
Tests the longest symbol possible
diff --git a/lib/tests/longest_symbol_kunit.c b/lib/tests/longest_symbol_kunit.c
index e3c28ff1807f..b183fb92d1b2 100644
--- a/lib/tests/longest_symbol_kunit.c
+++ b/lib/tests/longest_symbol_kunit.c
@@ -3,8 +3,7 @@
* Test the longest symbol length. Execute with:
* ./tools/testing/kunit/kunit.py run longest-symbol
* --arch=x86_64 --kconfig_add CONFIG_KPROBES=y --kconfig_add CONFIG_MODULES=y
- * --kconfig_add CONFIG_RETPOLINE=n --kconfig_add CONFIG_CFI_CLANG=n
- * --kconfig_add CONFIG_MITIGATION_RETPOLINE=n
+ * --kconfig_add CONFIG_CPU_MITIGATIONS=n
*/
#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
base-commit: ebd297a2affadb6f6f4d2e5d975c1eda18ac762d
--
2.39.2
Until CONFIG_DMABUF_SYSFS_STATS was added [1] it was only possible to
perform per-buffer accounting with debugfs which is not suitable for
production environments. Eventually we discovered the overhead with
per-buffer sysfs file creation/removal was significantly impacting
allocation and free times, and exacerbated kernfs lock contention. [2]
dma_buf_stats_setup() is responsible for 39% of single-page buffer
creation duration, or 74% of single-page dma_buf_export() duration when
stressing dmabuf allocations and frees.
I prototyped a change from per-buffer to per-exporter statistics with a
RCU protected list of exporter allocations that accommodates most (but
not all) of our use-cases and avoids almost all of the sysfs overhead.
While that adds less overhead than per-buffer sysfs, and less even than
the maintenance of the dmabuf debugfs_list, it's still *additional*
overhead on top of the debugfs_list and doesn't give us per-buffer info.
This series uses the existing dmabuf debugfs_list to implement a BPF
dmabuf iterator, which adds no overhead to buffer allocation/free and
provides per-buffer info. The list has been moved outside of
CONFIG_DEBUG_FS scope so that it is always populated. The BPF program
loaded by userspace that extracts per-buffer information gets to define
its own interface which avoids the lack of ABI stability with debugfs.
This will allow us to replace our use of CONFIG_DMABUF_SYSFS_STATS, and
the plan is to remove it from the kernel after the next longterm stable
release.
[1] https://lore.kernel.org/linux-media/20201210044400.1080308-1-hridya@google.…
[2] https://lore.kernel.org/all/20220516171315.2400578-1-tjmercier@google.com
v1: https://lore.kernel.org/all/20250414225227.3642618-1-tjmercier@google.com
v1 -> v2:
Make the DMA buffer list independent of CONFIG_DEBUG_FS per Christian
König
Add CONFIG_DMA_SHARED_BUFFER check to kernel/bpf/Makefile per kernel
test robot
Use BTF_ID_LIST_SINGLE instead of BTF_ID_LIST_GLOBAL_SINGLE per Song Liu
Fixup comment style, mixing code/declarations, and use ASSERT_OK_FD in
selftest per Song Liu
Add BPF_ITER_RESCHED feature to bpf_dmabuf_reg_info per Alexei
Starovoitov
Add open-coded iterator and selftest per Alexei Starovoitov
Add a second test buffer from the system dmabuf heap to selftests
Use the BPF program we'll use in production for selftest per Alexei
Starovoitov
https://r.android.com/c/platform/system/bpfprogs/+/3616123/2/dmabufIter.chttps://r.android.com/c/platform/system/memory/libmeminfo/+/3614259/1/libdm…
v2: https://lore.kernel.org/all/20250504224149.1033867-1-tjmercier@google.com
v2 -> v3:
Rebase onto bpf-next/master
Move get_next_dmabuf() into drivers/dma-buf/dma-buf.c, along with the
new get_first_dmabuf(). This avoids having to expose the dmabuf list
and mutex to the rest of the kernel, and keeps the dmabuf mutex
operations near each other in the same file. (Christian König)
Add Christian's RB to dma-buf: Rename debugfs symbols
Drop RFC: dma-buf: Remove DMA-BUF statistics
v3: https://lore.kernel.org/all/20250507001036.2278781-1-tjmercier@google.com
v3 -> v4:
Fix selftest BPF program comment style (not kdoc) per Alexei Starovoitov
Fix dma-buf.c kdoc comment style per Alexei Starovoitov
Rename get_first_dmabuf / get_next_dmabuf to dma_buf_iter_begin /
dma_buf_iter_next per Christian König
Add Christian's RB to bpf: Add dmabuf iterator
v4: https://lore.kernel.org/all/20250508182025.2961555-1-tjmercier@google.com
v4 -> v5:
Add Christian's Acks to all patches
Add Song Liu's Acks
Move BTF_ID_LIST_SINGLE and DEFINE_BPF_ITER_FUNC closer to usage per
Song Liu
Fix open-coded iterator comment style per Song Liu
Move iterator termination check to its own subtest per Song Liu
Rework selftest buffer creation per Song Liu
Fix spacing in sanitize_string per BPF CI
T.J. Mercier (5):
dma-buf: Rename debugfs symbols
bpf: Add dmabuf iterator
bpf: Add open coded dmabuf iterator
selftests/bpf: Add test for dmabuf_iter
selftests/bpf: Add test for open coded dmabuf_iter
drivers/dma-buf/dma-buf.c | 98 +++++--
include/linux/dma-buf.h | 4 +-
kernel/bpf/Makefile | 3 +
kernel/bpf/dmabuf_iter.c | 150 ++++++++++
kernel/bpf/helpers.c | 5 +
.../testing/selftests/bpf/bpf_experimental.h | 5 +
tools/testing/selftests/bpf/config | 3 +
.../selftests/bpf/prog_tests/dmabuf_iter.c | 276 ++++++++++++++++++
.../testing/selftests/bpf/progs/dmabuf_iter.c | 91 ++++++
9 files changed, 613 insertions(+), 22 deletions(-)
create mode 100644 kernel/bpf/dmabuf_iter.c
create mode 100644 tools/testing/selftests/bpf/prog_tests/dmabuf_iter.c
create mode 100644 tools/testing/selftests/bpf/progs/dmabuf_iter.c
base-commit: 43745d11bfd9683abdf08ad7a5cc403d6a9ffd15
--
2.49.0.1045.g170613ef41-goog
The SBI Firmware Feature extension allows the S-mode to request some
specific features (either hardware or software) to be enabled. This
series uses this extension to request misaligned access exception
delegation to S-mode in order to let the kernel handle it. It also adds
support for the KVM FWFT SBI extension based on the misaligned access
handling infrastructure.
FWFT SBI extension is part of the SBI V3.0 specifications [1]. It can be
tested using the qemu provided at [2] which contains the series from
[3]. Upstream kvm-unit-tests can be used inside kvm to tests the correct
delegation of misaligned exceptions. Upstream OpenSBI can be used.
Note: Since SBI V3.0 is not yet ratified, FWFT extension API is split
between interface only and implementation, allowing to pick only the
interface which do not have hard dependencies on SBI.
The tests can be run using the kselftest from series [4].
$ qemu-system-riscv64 \
-cpu rv64,trap-misaligned-access=true,v=true \
-M virt \
-m 1024M \
-bios fw_dynamic.bin \
-kernel Image
...
# ./misaligned
TAP version 13
1..23
# Starting 23 tests from 1 test cases.
# RUN global.gp_load_lh ...
# OK global.gp_load_lh
ok 1 global.gp_load_lh
# RUN global.gp_load_lhu ...
# OK global.gp_load_lhu
ok 2 global.gp_load_lhu
# RUN global.gp_load_lw ...
# OK global.gp_load_lw
ok 3 global.gp_load_lw
# RUN global.gp_load_lwu ...
# OK global.gp_load_lwu
ok 4 global.gp_load_lwu
# RUN global.gp_load_ld ...
# OK global.gp_load_ld
ok 5 global.gp_load_ld
# RUN global.gp_load_c_lw ...
# OK global.gp_load_c_lw
ok 6 global.gp_load_c_lw
# RUN global.gp_load_c_ld ...
# OK global.gp_load_c_ld
ok 7 global.gp_load_c_ld
# RUN global.gp_load_c_ldsp ...
# OK global.gp_load_c_ldsp
ok 8 global.gp_load_c_ldsp
# RUN global.gp_load_sh ...
# OK global.gp_load_sh
ok 9 global.gp_load_sh
# RUN global.gp_load_sw ...
# OK global.gp_load_sw
ok 10 global.gp_load_sw
# RUN global.gp_load_sd ...
# OK global.gp_load_sd
ok 11 global.gp_load_sd
# RUN global.gp_load_c_sw ...
# OK global.gp_load_c_sw
ok 12 global.gp_load_c_sw
# RUN global.gp_load_c_sd ...
# OK global.gp_load_c_sd
ok 13 global.gp_load_c_sd
# RUN global.gp_load_c_sdsp ...
# OK global.gp_load_c_sdsp
ok 14 global.gp_load_c_sdsp
# RUN global.fpu_load_flw ...
# OK global.fpu_load_flw
ok 15 global.fpu_load_flw
# RUN global.fpu_load_fld ...
# OK global.fpu_load_fld
ok 16 global.fpu_load_fld
# RUN global.fpu_load_c_fld ...
# OK global.fpu_load_c_fld
ok 17 global.fpu_load_c_fld
# RUN global.fpu_load_c_fldsp ...
# OK global.fpu_load_c_fldsp
ok 18 global.fpu_load_c_fldsp
# RUN global.fpu_store_fsw ...
# OK global.fpu_store_fsw
ok 19 global.fpu_store_fsw
# RUN global.fpu_store_fsd ...
# OK global.fpu_store_fsd
ok 20 global.fpu_store_fsd
# RUN global.fpu_store_c_fsd ...
# OK global.fpu_store_c_fsd
ok 21 global.fpu_store_c_fsd
# RUN global.fpu_store_c_fsdsp ...
# OK global.fpu_store_c_fsdsp
ok 22 global.fpu_store_c_fsdsp
# RUN global.gen_sigbus ...
[12797.988647] misaligned[618]: unhandled signal 7 code 0x1 at 0x0000000000014dc0 in misaligned[4dc0,10000+76000]
[12797.988990] CPU: 0 UID: 0 PID: 618 Comm: misaligned Not tainted 6.13.0-rc6-00008-g4ec4468967c9-dirty #51
[12797.989169] Hardware name: riscv-virtio,qemu (DT)
[12797.989264] epc : 0000000000014dc0 ra : 0000000000014d00 sp : 00007fffe165d100
[12797.989407] gp : 000000000008f6e8 tp : 0000000000095760 t0 : 0000000000000008
[12797.989544] t1 : 00000000000965d8 t2 : 000000000008e830 s0 : 00007fffe165d160
[12797.989692] s1 : 000000000000001a a0 : 0000000000000000 a1 : 0000000000000002
[12797.989831] a2 : 0000000000000000 a3 : 0000000000000000 a4 : ffffffffdeadbeef
[12797.989964] a5 : 000000000008ef61 a6 : 626769735f6e0000 a7 : fffffffffffff000
[12797.990094] s2 : 0000000000000001 s3 : 00007fffe165d838 s4 : 00007fffe165d848
[12797.990238] s5 : 000000000000001a s6 : 0000000000010442 s7 : 0000000000010200
[12797.990391] s8 : 000000000000003a s9 : 0000000000094508 s10: 0000000000000000
[12797.990526] s11: 0000555567460668 t3 : 00007fffe165d070 t4 : 00000000000965d0
[12797.990656] t5 : fefefefefefefeff t6 : 0000000000000073
[12797.990756] status: 0000000200004020 badaddr: 000000000008ef61 cause: 0000000000000006
[12797.990911] Code: 8793 8791 3423 fcf4 3783 fc84 c737 dead 0713 eef7 (c398) 0001
# OK global.gen_sigbus
ok 23 global.gen_sigbus
# PASSED: 23 / 23 tests passed.
# Totals: pass:23 fail:0 xfail:0 xpass:0 skip:0 error:0
With kvm-tools:
# lkvm run -k sbi.flat -m 128
Info: # lkvm run -k sbi.flat -m 128 -c 1 --name guest-97
Info: Removed ghost socket file "/root/.lkvm//guest-97.sock".
##########################################################################
# kvm-unit-tests
##########################################################################
... [test messages elided]
PASS: sbi: fwft: FWFT extension probing no error
PASS: sbi: fwft: get/set reserved feature 0x6 error == SBI_ERR_DENIED
PASS: sbi: fwft: get/set reserved feature 0x3fffffff error == SBI_ERR_DENIED
PASS: sbi: fwft: get/set reserved feature 0x80000000 error == SBI_ERR_DENIED
PASS: sbi: fwft: get/set reserved feature 0xbfffffff error == SBI_ERR_DENIED
PASS: sbi: fwft: misaligned_deleg: Get misaligned deleg feature no error
PASS: sbi: fwft: misaligned_deleg: Set misaligned deleg feature invalid value error
PASS: sbi: fwft: misaligned_deleg: Set misaligned deleg feature invalid value error
PASS: sbi: fwft: misaligned_deleg: Set misaligned deleg feature value no error
PASS: sbi: fwft: misaligned_deleg: Set misaligned deleg feature value 0
PASS: sbi: fwft: misaligned_deleg: Set misaligned deleg feature value no error
PASS: sbi: fwft: misaligned_deleg: Set misaligned deleg feature value 1
PASS: sbi: fwft: misaligned_deleg: Verify misaligned load exception trap in supervisor
SUMMARY: 50 tests, 2 unexpected failures, 12 skipped
This series is available at [5].
Link: https://github.com/riscv-non-isa/riscv-sbi-doc/releases/download/vv3.0-rc2/… [1]
Link: https://github.com/rivosinc/qemu/tree/dev/cleger/misaligned [2]
Link: https://lore.kernel.org/all/20241211211933.198792-3-fkonrad@amd.com/T/ [3]
Link: https://lore.kernel.org/linux-riscv/20250414123543.1615478-1-cleger@rivosin… [4]
Link: https://github.com/rivosinc/linux/tree/dev/cleger/fwft [5]
---
V6:
- Rename FWFT interface to remove "_local"
- Fix test for MEDELEG values in KVM FWFT support
- Add __init for unaligned_access_init()
- Rebased on master
V5:
- Return ERANGE as mapping for SBI_ERR_BAD_RANGE
- Removed unused sbi_fwft_get()
- Fix kernel for sbi_fwft_local_set_cpumask()
- Fix indentation for sbi_fwft_local_set()
- Remove spurious space in kvm_sbi_fwft_ops.
- Rebased on origin/master
- Remove fixes commits and sent them as a separate series [4]
V4:
- Check SBI version 3.0 instead of 2.0 for FWFT presence
- Use long for kvm_sbi_fwft operation return value
- Init KVM sbi extension even if default_disabled
- Remove revert_on_fail parameter for sbi_fwft_feature_set().
- Fix comments for sbi_fwft_set/get()
- Only handle local features (there are no globals yet in the spec)
- Add new SBI errors to sbi_err_map_linux_errno()
V3:
- Added comment about kvm sbi fwft supported/set/get callback
requirements
- Move struct kvm_sbi_fwft_feature in kvm_sbi_fwft.c
- Add a FWFT interface
V2:
- Added Kselftest for misaligned testing
- Added get_user() usage instead of __get_user()
- Reenable interrupt when possible in misaligned access handling
- Document that riscv supports unaligned-traps
- Fix KVM extension state when an init function is present
- Rework SBI misaligned accesses trap delegation code
- Added support for CPU hotplugging
- Added KVM SBI reset callback
- Added reset for KVM SBI FWFT lock
- Return SBI_ERR_DENIED_LOCKED when LOCK flag is set
Clément Léger (14):
riscv: sbi: add Firmware Feature (FWFT) SBI extensions definitions
riscv: sbi: remove useless parenthesis
riscv: sbi: add new SBI error mappings
riscv: sbi: add FWFT extension interface
riscv: sbi: add SBI FWFT extension calls
riscv: misaligned: request misaligned exception from SBI
riscv: misaligned: use on_each_cpu() for scalar misaligned access
probing
riscv: misaligned: use correct CONFIG_ ifdef for
misaligned_access_speed
riscv: misaligned: move emulated access uniformity check in a function
riscv: misaligned: add a function to check misalign trap delegability
RISC-V: KVM: add SBI extension init()/deinit() functions
RISC-V: KVM: add SBI extension reset callback
RISC-V: KVM: add support for FWFT SBI extension
RISC-V: KVM: add support for SBI_FWFT_MISALIGNED_DELEG
arch/riscv/include/asm/cpufeature.h | 8 +-
arch/riscv/include/asm/kvm_host.h | 5 +-
arch/riscv/include/asm/kvm_vcpu_sbi.h | 12 +
arch/riscv/include/asm/kvm_vcpu_sbi_fwft.h | 29 +++
arch/riscv/include/asm/sbi.h | 60 +++++
arch/riscv/include/uapi/asm/kvm.h | 1 +
arch/riscv/kernel/sbi.c | 81 ++++++-
arch/riscv/kernel/traps_misaligned.c | 110 ++++++++-
arch/riscv/kernel/unaligned_access_speed.c | 8 +-
arch/riscv/kvm/Makefile | 1 +
arch/riscv/kvm/vcpu.c | 7 +-
arch/riscv/kvm/vcpu_sbi.c | 54 +++++
arch/riscv/kvm/vcpu_sbi_fwft.c | 252 +++++++++++++++++++++
arch/riscv/kvm/vcpu_sbi_sta.c | 3 +-
14 files changed, 613 insertions(+), 18 deletions(-)
create mode 100644 arch/riscv/include/asm/kvm_vcpu_sbi_fwft.h
create mode 100644 arch/riscv/kvm/vcpu_sbi_fwft.c
--
2.49.0
hi,
We can use get_func_[arg|arg_cnt] helpers in fentry/fexit/fmod_ret programs
currently[1]. But they can't be used in raw_tp/tp_btf programs.
Adding support to use get_func_[arg|arg_cnt] helpers in raw_tp/tp_btf
programs.
Adding BPF_PROG_TEST_RUN for tp_btf.
Add selftests to check them.
Thanks,
KaFai
[1] https://lore.kernel.org/bpf/20211208193245.172141-1-jolsa@kernel.org/
---
KaFai Wan (4):
bpf: Allow get_func_[arg|arg_cnt] helpers in raw tracepoint programs
bpf: Enable BPF_PROG_TEST_RUN for tp_btf
selftests/bpf: Add raw_tp_test_run for tp_btf
selftests/bpf: Add tests for get_func_[arg|arg_cnt] helpers in raw
tracepoint programs
kernel/trace/bpf_trace.c | 17 +++++--
net/bpf/test_run.c | 16 +++----
.../bpf/prog_tests/raw_tp_get_func_args.c | 48 +++++++++++++++++++
.../bpf/prog_tests/raw_tp_test_run.c | 18 ++++++-
.../bpf/progs/test_raw_tp_get_func_args.c | 47 ++++++++++++++++++
.../bpf/progs/test_raw_tp_test_run.c | 16 +++++--
6 files changed, 146 insertions(+), 16 deletions(-)
create mode 100644 tools/testing/selftests/bpf/prog_tests/raw_tp_get_func_args.c
create mode 100644 tools/testing/selftests/bpf/progs/test_raw_tp_get_func_args.c
--
2.43.0
This series introduces a new VIOMMU infrastructure and related ioctls.
IOMMUFD has been using the HWPT infrastructure for all cases, including a
nested IO page table support. Yet, there're limitations for an HWPT-based
structure to support some advanced HW-accelerated features, such as CMDQV
on NVIDIA Grace, and HW-accelerated vIOMMU on AMD. Even for a multi-IOMMU
environment, it is not straightforward for nested HWPTs to share the same
parent HWPT (stage-2 IO pagetable), with the HWPT infrastructure alone.
The new VIOMMU object is an additional layer, between the nested HWPT and
its parent HWPT, to give to both the IOMMUFD core and an IOMMU driver an
additional structure to support HW-accelerated feature:
----------------------------
---------------- | | paging_hwpt0 |
| hwpt_nested0 |--->| viommu0 ------------------
---------------- | | HW-accel feats |
----------------------------
On a multi-IOMMU system, the VIOMMU object can be instanced to the number
of vIOMMUs in a guest VM, while holding the same parent HWPT to share the
stage-2 IO pagetable. Each VIOMMU then just need to only allocate its own
VMID to attach the shared stage-2 IO pagetable to the physical IOMMU:
----------------------------
---------------- | | paging_hwpt0 |
| hwpt_nested0 |--->| viommu0 ------------------
---------------- | | VMID0 |
----------------------------
----------------------------
---------------- | | paging_hwpt0 |
| hwpt_nested1 |--->| viommu1 ------------------
---------------- | | VMID1 |
----------------------------
As an initial part-1, add ioctls to support a VIOMMU-based invalidation:
IOMMUFD_CMD_VIOMMU_ALLOC to allocate a VIOMMU object
IOMMUFD_CMD_VIOMMU_SET/UNSET_VDEV_ID to set/clear device's virtual ID
(Resue IOMMUFD_CMD_HWPT_INVALIDATE for a VIOMMU object to flush cache
by a given driver data)
Worth noting that the VDEV_ID is for a per-VIOMMU device list for drivers
to look up the device's physical instance from its virtual ID in a VM. It
is essential for a VIOMMU-based invalidation where the request contains a
device's virtual ID for its device cache flush, e.g. ATC invalidation.
As for the implementation of the series, add an IOMMU_VIOMMU_TYPE_DEFAULT
type for a core-allocated-core-managed VIOMMU object, allowing drivers to
simply hook a default viommu ops for viommu-based invalidation alone. And
provide some viommu helpers to drivers for VDEV_ID translation and parent
domain lookup. Add VIOMMU invalidation support to ARM SMMUv3 driver for a
real world use case. This adds supports of arm-smmuv-v3's CMDQ_OP_ATC_INV
and CMDQ_OP_CFGI_CD/ALL commands, supplementing HWPT-based invalidations.
In the future, drivers will also be able to choose a driver-managed type
to hold its own structure by adding a new type to enum iommu_viommu_type.
More VIOMMU-based structures and ioctls will be introduced in part-2/3 to
support a driver-managed VIOMMU, e.g. VQUEUE object for a HW accelerated
queue, VIRQ (or VEVENT) object for IRQ injections. Although we repurposed
the VIOMMU object from an earlier RFC discussion, for a referece:
https://lore.kernel.org/all/cover.1712978212.git.nicolinc@nvidia.com/
This series is on Github:
https://github.com/nicolinc/iommufd/commits/iommufd_viommu_p1-v2
Paring QEMU branch for testing:
https://github.com/nicolinc/qemu/commits/wip/for_iommufd_viommu_p1-v2
Changelog
v2
* Limited vdev_id to one per idev
* Added a rw_sem to protect the vdev_id list
* Reworked driver-level APIs with proper lockings
* Added a new viommu_api file for IOMMUFD_DRIVER config
* Dropped useless iommu_dev point from the viommu structure
* Added missing index numnbers to new types in the uAPI header
* Dropped IOMMU_VIOMMU_INVALIDATE uAPI; Instead, reuse the HWPT one
* Reworked mock_viommu_cache_invalidate() using the new iommu helper
* Reordered details of set/unset_vdev_id handlers for proper lockings
* Added arm_smmu_cache_invalidate_user patch from Jason's nesting series
v1
https://lore.kernel.org/all/cover.1723061377.git.nicolinc@nvidia.com/
Thanks!
Nicolin
Jason Gunthorpe (3):
iommu: Add iommu_copy_struct_from_full_user_array helper
iommu/arm-smmu-v3: Allow ATS for IOMMU_DOMAIN_NESTED
iommu/arm-smmu-v3: Update comments about ATS and bypass
Nicolin Chen (16):
iommufd: Reorder struct forward declarations
iommufd/viommu: Add IOMMUFD_OBJ_VIOMMU and IOMMU_VIOMMU_ALLOC ioctl
iommu: Pass in a viommu pointer to domain_alloc_user op
iommufd: Allow pt_id to carry viommu_id for IOMMU_HWPT_ALLOC
iommufd/selftest: Add IOMMU_VIOMMU_ALLOC test coverage
iommufd/viommu: Add IOMMU_VIOMMU_SET/UNSET_VDEV_ID ioctl
iommufd/selftest: Add IOMMU_VIOMMU_SET/UNSET_VDEV_ID test coverage
iommufd/viommu: Add cache_invalidate for IOMMU_VIOMMU_TYPE_DEFAULT
iommufd: Allow hwpt_id to carry viommu_id for IOMMU_HWPT_INVALIDATE
iommufd/viommu: Add vdev_id helpers for IOMMU drivers
iommufd/selftest: Add mock_viommu_invalidate_user op
iommufd/selftest: Add IOMMU_TEST_OP_DEV_CHECK_CACHE test command
iommufd/selftest: Add VIOMMU coverage for IOMMU_HWPT_INVALIDATE ioctl
iommufd/viommu: Add iommufd_viommu_to_parent_domain helper
iommu/arm-smmu-v3: Add arm_smmu_cache_invalidate_user
iommu/arm-smmu-v3: Add arm_smmu_viommu_cache_invalidate
drivers/iommu/amd/iommu.c | 1 +
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 218 ++++++++++++++-
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 3 +
drivers/iommu/intel/iommu.c | 1 +
drivers/iommu/iommufd/Makefile | 5 +-
drivers/iommu/iommufd/device.c | 12 +
drivers/iommu/iommufd/hw_pagetable.c | 59 +++-
drivers/iommu/iommufd/iommufd_private.h | 37 +++
drivers/iommu/iommufd/iommufd_test.h | 30 ++
drivers/iommu/iommufd/main.c | 12 +
drivers/iommu/iommufd/selftest.c | 101 ++++++-
drivers/iommu/iommufd/viommu.c | 196 +++++++++++++
drivers/iommu/iommufd/viommu_api.c | 53 ++++
include/linux/iommu.h | 56 +++-
include/linux/iommufd.h | 51 +++-
include/uapi/linux/iommufd.h | 117 +++++++-
tools/testing/selftests/iommu/iommufd.c | 259 +++++++++++++++++-
tools/testing/selftests/iommu/iommufd_utils.h | 126 +++++++++
18 files changed, 1299 insertions(+), 38 deletions(-)
create mode 100644 drivers/iommu/iommufd/viommu.c
create mode 100644 drivers/iommu/iommufd/viommu_api.c
--
2.43.0
This series fixes misaligned access handling when in non interruptible
context by reenabling interrupts when possible. A previous commit
changed raw_copy_from_user() with copy_from_user() which enables page
faulting and thus can sleep. While correct, a warning is now triggered
due to being called in an invalid context (sleeping in
non-interruptible). This series fixes that problem by factorizing
misaligned load/store entry in a single function than reenables
interrupt if the interrupted context had interrupts enabled.
In order for misaligned handling problems to be caught sooner, add a
kselftest for all the currently supported instructions .
Note: these commits were actually part of another larger series for
misaligned request delegation but was split since it isn't directly
required.
V2:
- Use an array of struct to simplify misaligned load/store selection
- Revert use of irqentry_enter()/exit() to irqentry_nmi_enter() for
kernel space.
Clément Léger (5):
riscv: misaligned: factorize trap handling
riscv: misaligned: enable IRQs while handling misaligned accesses
riscv: misaligned: use get_user() instead of __get_user()
Documentation/sysctl: add riscv to unaligned-trap supported archs
selftests: riscv: add misaligned access testing
Documentation/admin-guide/sysctl/kernel.rst | 4 +-
arch/riscv/kernel/traps.c | 64 +++--
arch/riscv/kernel/traps_misaligned.c | 2 +-
.../selftests/riscv/misaligned/.gitignore | 1 +
.../selftests/riscv/misaligned/Makefile | 12 +
.../selftests/riscv/misaligned/common.S | 33 +++
.../testing/selftests/riscv/misaligned/fpu.S | 180 +++++++++++++
tools/testing/selftests/riscv/misaligned/gp.S | 103 +++++++
.../selftests/riscv/misaligned/misaligned.c | 254 ++++++++++++++++++
9 files changed, 623 insertions(+), 30 deletions(-)
create mode 100644 tools/testing/selftests/riscv/misaligned/.gitignore
create mode 100644 tools/testing/selftests/riscv/misaligned/Makefile
create mode 100644 tools/testing/selftests/riscv/misaligned/common.S
create mode 100644 tools/testing/selftests/riscv/misaligned/fpu.S
create mode 100644 tools/testing/selftests/riscv/misaligned/gp.S
create mode 100644 tools/testing/selftests/riscv/misaligned/misaligned.c
--
2.49.0
┌────────────┐ ┌───────────────────────────────────┐ ┌────────────────┐
│ │ │ │ │ │
│ │ │ PCI Endpoint │ │ PCI Host │
│ │ │ │ │ │
│ │◄──┤ 1.platform_msi_domain_alloc_irqs()│ │ │
│ │ │ │ │ │
│ MSI ├──►│ 2.write_msi_msg() ├──►├─BAR<n> │
│ Controller │ │ update doorbell register address│ │ │
│ │ │ for BAR │ │ │
│ │ │ │ │ 3. Write BAR<n>│
│ │◄──┼───────────────────────────────────┼───┤ │
│ │ │ │ │ │
│ ├──►│ 4.Irq Handle │ │ │
│ │ │ │ │ │
│ │ │ │ │ │
└────────────┘ └───────────────────────────────────┘ └────────────────┘
This patches based on old https://lore.kernel.org/imx/20221124055036.1630573-1-Frank.Li@nxp.com/
Original patch only target to vntb driver. But actually it is common
method.
This patches add new API to pci-epf-core, so any EP driver can use it.
Previous v2 discussion here.
https://lore.kernel.org/imx/20230911220920.1817033-1-Frank.Li@nxp.com/
Changes in v18:
- pci-ep.yaml: sort property order, fix maxvalue to 0x7ffff for msi-map-mask and
iommu-map-mask
- Link to v17: https://lore.kernel.org/r/20250407-ep-msi-v17-0-633ab45a31d0@nxp.com
Changes in v17:
- move document part to pci-ep.yaml
- Link to v16: https://lore.kernel.org/r/20250404-ep-msi-v16-0-d4919d68c0d0@nxp.com
Changes in v16:
- remove arm64: dts: imx95-19x19-evk: Add PCIe1 endpoint function overlay file
because there are better patches, which under review.
- Add document for pcie-ep msi-map usage
- other change to see each patch's change log
About IMMUTABLE (No change for this part, tglx provide feedback)
> - This IMMUTABLE thing serves no purpose, because you don't randomly
> plug this end-point block on any MSI controller. They come as part
> of an SoC.
"Yes and no. The problem is that the EP implementation is meant to be a
generic library and while GIC-ITS guarantees immutability of the
address/data pair after setup, there are architectures (x86, loongson,
riscv) where the base MSI controller does not and immutability is only
achieved when interrupt remapping is enabled. The latter can be disabled
at boot-time and then the EP implementation becomes a lottery across
affinity changes.
That was my concern about this library implementation and that's why I
asked for a mechanism to ensure that the underlying irqdomain provides a
immutable address/data pair.
So it does not matter for GIC-ITS, but in the larger picture it matters.
Thanks,
tglx
"
So it does not matter for GIC-ITS, but in the larger picture it matters.
- Link to v15: https://lore.kernel.org/r/20250211-ep-msi-v15-0-bcacc1f2b1a9@nxp.com
Changes in v15:
- rebase to v6.14-rc1
- fix build issue find by kernel test robot
- Link to v14: https://lore.kernel.org/r/20250207-ep-msi-v14-0-9671b136f2b8@nxp.com
Changes in v14:
Marc Zyngier raised concerns about adding DOMAIN_BUS_DEVICE_PCI_EP_MSI. As
a result, the approach has been reverted to the v9 method. However, there
are several improvements:
MSI now supports msi-map in addition to msi-parent.
- The struct device: id is used as the endpoint function (EPF) device
identity to map to the stream ID (sideband information).
- The EPC device tree source (DTS) utilizes msi-map to provide such
information.
- The EPF device's of_node is set to the EPC controller’s node. This
approach is commonly used for multi-function device (MFD) platform child
devices, allowing them to inherit properties from the MFD device’s DTS,
such as reset-cells and gpio-cells. This method is well-suited for the
current case, as the EPF is inherently created/binded to the EPC and
should inherit the EPC’s DTS node properties.
Additionally:
Since the basic IMX95 LUT support has already been merged into the
mainline, a DTS and driver increment patch is added to complete the
solution. The patch is rebased onto the latest linux-next tree and
aligned with the new pcitest framework.
- Link to v13: https://lore.kernel.org/r/20241218-ep-msi-v13-0-646e2192dc24@nxp.com
Changes in v13:
- Change to use DOMAIN_BUS_PCI_DEVICE_EP_MSI
- Change request id as func | vfunc << 3
- Remove IRQ_DOMAIN_MSI_IMMUTABLE
Thomas Gleixner:
I hope capture all your points in review comments. If missed, let me know.
- Link to v12: https://lore.kernel.org/r/20241211-ep-msi-v12-0-33d4532fa520@nxp.com
Changes in v12:
- Change to use IRQ_DOMAIN_MSI_IMMUTABLE and add help function
irq_domain_msi_is_immuatble().
- split PCI: endpoint: pci-ep-msi: Add MSI address/data pair mutable check to 3 patches
- Link to v11: https://lore.kernel.org/r/20241209-ep-msi-v11-0-7434fa8397bd@nxp.com
Changes in v11:
- Change to use MSI_FLAG_MSG_IMMUTABLE
- Link to v10: https://lore.kernel.org/r/20241204-ep-msi-v10-0-87c378dbcd6d@nxp.com
Changes in v10:
Thomas Gleixner:
There are big change in pci-ep-msi.c. I am sure if go on the
corrent path. The key improvement is remove only 1 function devices's
limitation.
I use new patch for imutable check, which relative additional
feature compared to base enablement patch.
- Remove patch Add msi_remove_device_irq_domain() in platform_device_msi_free_irqs_all()
- Add new patch irqchip/gic-v3-its: Avoid overwriting msi_prepare callback if provided by msi_domain_info
- Remove only support 1 endpoint function limiation.
- Create one MSI domain for each endpoint function devices.
- Use "msi-map" in pci ep controler node, instead of of msi-parent. first
argument is
(func_no << 8 | vfunc_no)
- Link to v9: https://lore.kernel.org/r/20241203-ep-msi-v9-0-a60dbc3f15dd@nxp.com
Changes in v9
- Add patch platform-msi: Add msi_remove_device_irq_domain() in platform_device_msi_free_irqs_all()
- Remove patch PCI: endpoint: Add pci_epc_get_fn() API for customizable filtering
- Remove API pci_epf_align_inbound_addr_lo_hi
- Move doorbell_alloc in to doorbell_enable function.
- Link to v8: https://lore.kernel.org/r/20241116-ep-msi-v8-0-6f1f68ffd1bb@nxp.com
Changes in v8:
- update helper function name to pci_epf_align_inbound_addr()
- Link to v7: https://lore.kernel.org/r/20241114-ep-msi-v7-0-d4ac7aafbd2c@nxp.com
Changes in v7:
- Add helper function pci_epf_align_addr();
- Link to v6: https://lore.kernel.org/r/20241112-ep-msi-v6-0-45f9722e3c2a@nxp.com
Changes in v6:
- change doorbell_addr to doorbell_offset
- use round_down()
- add Niklas's test by tag
- rebase to pci/endpoint
- Link to v5: https://lore.kernel.org/r/20241108-ep-msi-v5-0-a14951c0d007@nxp.com
Changes in v5:
- Move request_irq to epf test function driver for more flexiable user case
- Add fixed size bar handler
- Some minor improvememtn to see each patches's changelog.
- Link to v4: https://lore.kernel.org/r/20241031-ep-msi-v4-0-717da2d99b28@nxp.com
Changes in v4:
- Remove patch genirq/msi: Add cleanup guard define for msi_lock_descs()/msi_unlock_descs()
- Use new method to avoid compatible problem.
Add new command DOORBELL_ENABLE and DOORBELL_DISABLE.
pcitest -B send DOORBELL_ENABLE first, EP test function driver try to
remap one of BAR_N (except test register bar) to ITS MSI MMIO space. Old
driver don't support new command, so failure return, not side effect.
After test, DOORBELL_DISABLE command send out to recover original map, so
pcitest bar test can pass as normal.
- Other detail change see each patches's change log
- Link to v3: https://lore.kernel.org/r/20241015-ep-msi-v3-0-cedc89a16c1a@nxp.com
Change from v2 to v3
- Fixed manivannan's comments
- Move common part to pci-ep-msi.c and pci-ep-msi.h
- rebase to 6.12-rc1
- use RevID to distingiush old version
mkdir /sys/kernel/config/pci_ep/functions/pci_epf_test/func1
echo 16 > /sys/kernel/config/pci_ep/functions/pci_epf_test/func1/msi_interrupts
echo 0x080c > /sys/kernel/config/pci_ep/functions/pci_epf_test/func1/deviceid
echo 0x1957 > /sys/kernel/config/pci_ep/functions/pci_epf_test/func1/vendorid
echo 1 > /sys/kernel/config/pci_ep/functions/pci_epf_test/func1/revid
^^^^^^ to enable platform msi support.
ln -s /sys/kernel/config/pci_ep/functions/pci_epf_test/func1 /sys/kernel/config/pci_ep/controllers/4c380000.pcie-ep
- use new device ID, which identify support doorbell to avoid broken
compatility.
Enable doorbell support only for PCI_DEVICE_ID_IMX8_DB, while other devices
keep the same behavior as before.
EP side RC with old driver RC with new driver
PCI_DEVICE_ID_IMX8_DB no probe doorbell enabled
Other device ID doorbell disabled* doorbell disabled*
* Behavior remains unchanged.
Change from v1 to v2
- Add missed patch for endpont/pci-epf-test.c
- Move alloc and free to epc driver from epf.
- Provide general help function for EPC driver to alloc platform msi irq.
- Fixed manivannan's comments.
Signed-off-by: Frank Li <Frank.Li(a)nxp.com>
---
Frank Li (15):
platform-msi: Add msi_remove_device_irq_domain() in platform_device_msi_free_irqs_all()
irqdomain: Add IRQ_DOMAIN_FLAG_MSI_IMMUTABLE and irq_domain_is_msi_immutable()
irqchip/gic-v3-its: Set IRQ_DOMAIN_FLAG_MSI_IMMUTABLE for ITS
dt-bindings: PCI: pci-ep: Add support for iommu-map and msi-map
irqchip/gic-v3-its: Add support for device tree msi-map and msi-mask
PCI: endpoint: Set ID and of_node for function driver
PCI: endpoint: Add RC-to-EP doorbell support using platform MSI controller
PCI: endpoint: pci-ep-msi: Add MSI address/data pair mutable check
PCI: endpoint: Add pci_epf_align_inbound_addr() helper for address alignment
PCI: endpoint: pci-epf-test: Add doorbell test support
misc: pci_endpoint_test: Add doorbell test case
selftests: pci_endpoint: Add doorbell test case
pci: imx6: Add helper function imx_pcie_add_lut_by_rid()
pci: imx6: Add LUT setting for MSI/IOMMU in Endpoint mode
arm64: dts: imx95: Add msi-map for pci-ep device
Documentation/devicetree/bindings/pci/pci-ep.yaml | 68 ++++++++++
arch/arm64/boot/dts/freescale/imx95.dtsi | 1 +
drivers/base/platform-msi.c | 1 +
drivers/irqchip/irq-gic-v3-its-msi-parent.c | 8 ++
drivers/irqchip/irq-gic-v3-its.c | 2 +-
drivers/misc/pci_endpoint_test.c | 82 ++++++++++++
drivers/pci/controller/dwc/pci-imx6.c | 25 ++--
drivers/pci/endpoint/Makefile | 1 +
drivers/pci/endpoint/functions/pci-epf-test.c | 142 +++++++++++++++++++++
drivers/pci/endpoint/pci-ep-msi.c | 90 +++++++++++++
drivers/pci/endpoint/pci-epf-core.c | 48 +++++++
include/linux/irqdomain.h | 7 +
include/linux/pci-ep-msi.h | 28 ++++
include/linux/pci-epf.h | 21 +++
include/uapi/linux/pcitest.h | 1 +
.../selftests/pci_endpoint/pci_endpoint_test.c | 28 ++++
16 files changed, 544 insertions(+), 9 deletions(-)
---
base-commit: a4949bd40778aa9beac77c89e4c6a1da52875c8b
change-id: 20241010-ep-msi-8b4cab33b1be
Best regards,
---
Frank Li <Frank.Li(a)nxp.com>
This patch set aims to allow ublk server threads to better balance load
amongst themselves by decoupling server threads from ublk queues/hctxs,
so that multiple threads can service I/Os from a single hctx.
The first patch is the functional change in the driver which switches
from per-queue daemons to per-io daemons and allows for ublk servers to
balance load that is imbalanced among queues. The second patch fixes a
bug in tag allocation (in the sbitmap layer) that was observed while
developing a test for this feature. The next five patches add support in
the selftests ublk server (kublk) for this feature, and add a test which
shows the new feature working as intended. The last patch documents the
new feature.
Signed-off-by: Uday Shankar <ushankar(a)purestorage.com>
---
Changes in v6:
- Add a feature flag for this feature, called UBLK_F_RR_TAGS (Ming Lei)
- Add test for this feature (Ming Lei)
- Add documentation for this feature (Ming Lei)
- Link to v5: https://lore.kernel.org/r/20250416-ublk_task_per_io-v5-0-9261ad7bff20@pures…
Changes in v5:
- Set io->task before ublk_mark_io_ready (Caleb Sander Mateos)
- Set io->task atomically, read it atomically when needed
- Return 0 on success from command-specific helpers in
__ublk_ch_uring_cmd (Caleb Sander Mateos)
- Rename ublk_handle_need_get_data to ublk_get_data (Caleb Sander
Mateos)
- Link to v4: https://lore.kernel.org/r/20250415-ublk_task_per_io-v4-0-54210b91a46f@pures…
Changes in v4:
- Drop "ublk: properly serialize all FETCH_REQs" since Ming is taking it
in another set
- Prevent data races by marking data structures which should be
read-only in the I/O path as const (Ming Lei)
- Link to v3: https://lore.kernel.org/r/20250410-ublk_task_per_io-v3-0-b811e8f4554a@pures…
Changes in v3:
- Check for UBLK_IO_FLAG_ACTIVE on I/O again after taking lock to ensure
that two concurrent FETCH_REQs on the same I/O can't succeed (Caleb
Sander Mateos)
- Link to v2: https://lore.kernel.org/r/20250408-ublk_task_per_io-v2-0-b97877e6fd50@pures…
Changes in v2:
- Remove changes split into other patches
- To ease error handling/synchronization, associate each I/O (instead of
each queue) to the last task that issues a FETCH_REQ against it. Only
that task is allowed to operate on the I/O.
- Link to v1: https://lore.kernel.org/r/20241002224437.3088981-1-ushankar@purestorage.com
---
Uday Shankar (8):
ublk: have a per-io daemon instead of a per-queue daemon
sbitmap: fix off-by-one when wrapping hint
selftests: ublk: kublk: plumb q_id in io_uring user_data
selftests: ublk: kublk: tie sqe allocation to io instead of queue
selftests: ublk: kublk: lift queue initialization out of thread
selftests: ublk: kublk: move per-thread data out of ublk_queue
selftests: ublk: kublk: decouple ublk_queues from ublk server threads
Documentation: ublk: document UBLK_F_RR_TAGS
Documentation/block/ublk.rst | 83 +++++-
drivers/block/ublk_drv.c | 82 ++---
include/uapi/linux/ublk_cmd.h | 8 +
lib/sbitmap.c | 4 +-
tools/testing/selftests/ublk/Makefile | 1 +
tools/testing/selftests/ublk/fault_inject.c | 4 +-
tools/testing/selftests/ublk/file_backed.c | 20 +-
tools/testing/selftests/ublk/kublk.c | 329 ++++++++++++++-------
tools/testing/selftests/ublk/kublk.h | 73 +++--
tools/testing/selftests/ublk/null.c | 12 +-
tools/testing/selftests/ublk/stripe.c | 17 +-
tools/testing/selftests/ublk/test_generic_08.sh | 61 ++++
.../selftests/ublk/trace/count_ios_per_tid.bt | 9 +
13 files changed, 488 insertions(+), 215 deletions(-)
---
base-commit: 037af793557ed192b2c10cf2379ac97abacedf55
change-id: 20250408-ublk_task_per_io-c693cf608d7a
Best regards,
--
Uday Shankar <ushankar(a)purestorage.com>
Some small fixes for arch_timer_edge_cases that I stumbled upon
while debugging failures for this selftest on ampere-one.
Sebastian Ott (3):
KVM: arm64: selftests: fix help text for arch_timer_edge_cases
KVM: arm64: selftests: fix thread migration in arch_timer_edge_cases
KVM: arm64: selftests: arch_timer_edge_cases - workaround for
AC03_CPU_14
.../selftests/kvm/arm64/arch_timer_edge_cases.c | 12 +++++++-----
1 file changed, 7 insertions(+), 5 deletions(-)
base-commit: 9c69f88849045499e8ad114e5e13dbb3c85f4443
--
2.49.0
Until CONFIG_DMABUF_SYSFS_STATS was added [1] it was only possible to
perform per-buffer accounting with debugfs which is not suitable for
production environments. Eventually we discovered the overhead with
per-buffer sysfs file creation/removal was significantly impacting
allocation and free times, and exacerbated kernfs lock contention. [2]
dma_buf_stats_setup() is responsible for 39% of single-page buffer
creation duration, or 74% of single-page dma_buf_export() duration when
stressing dmabuf allocations and frees.
I prototyped a change from per-buffer to per-exporter statistics with a
RCU protected list of exporter allocations that accommodates most (but
not all) of our use-cases and avoids almost all of the sysfs overhead.
While that adds less overhead than per-buffer sysfs, and less even than
the maintenance of the dmabuf debugfs_list, it's still *additional*
overhead on top of the debugfs_list and doesn't give us per-buffer info.
This series uses the existing dmabuf debugfs_list to implement a BPF
dmabuf iterator, which adds no overhead to buffer allocation/free and
provides per-buffer info. The list has been moved outside of
CONFIG_DEBUG_FS scope so that it is always populated. The BPF program
loaded by userspace that extracts per-buffer information gets to define
its own interface which avoids the lack of ABI stability with debugfs.
This will allow us to replace our use of CONFIG_DMABUF_SYSFS_STATS, and
the plan is to remove it from the kernel after the next longterm stable
release.
[1] https://lore.kernel.org/linux-media/20201210044400.1080308-1-hridya@google.…
[2] https://lore.kernel.org/all/20220516171315.2400578-1-tjmercier@google.com
v1: https://lore.kernel.org/all/20250414225227.3642618-1-tjmercier@google.com
v1 -> v2:
Make the DMA buffer list independent of CONFIG_DEBUG_FS per Christian König
Add CONFIG_DMA_SHARED_BUFFER check to kernel/bpf/Makefile per kernel test robot
Use BTF_ID_LIST_SINGLE instead of BTF_ID_LIST_GLOBAL_SINGLE per Song Liu
Fixup comment style, mixing code/declarations, and use ASSERT_OK_FD in selftest per Song Liu
Add BPF_ITER_RESCHED feature to bpf_dmabuf_reg_info per Alexei Starovoitov
Add open-coded iterator and selftest per Alexei Starovoitov
Add a second test buffer from the system dmabuf heap to selftests
Use the BPF program we'll use in production for selftest per Alexei Starovoitov
https://r.android.com/c/platform/system/bpfprogs/+/3616123/2/dmabufIter.chttps://r.android.com/c/platform/system/memory/libmeminfo/+/3614259/1/libdm…
v2: https://lore.kernel.org/all/20250504224149.1033867-1-tjmercier@google.com
v2 -> v3:
Rebase onto bpf-next/master
Move get_next_dmabuf() into drivers/dma-buf/dma-buf.c, along with the
new get_first_dmabuf(). This avoids having to expose the dmabuf list
and mutex to the rest of the kernel, and keeps the dmabuf mutex
operations near each other in the same file. (Christian König)
Add Christian's RB to dma-buf: Rename debugfs symbols
Drop RFC: dma-buf: Remove DMA-BUF statistics
v3: https://lore.kernel.org/all/20250507001036.2278781-1-tjmercier@google.com
v3 -> v4:
Fix selftest BPF program comment style (not kdoc) per Alexei Starovoitov
Fix dma-buf.c kdoc comment style per Alexei Starovoitov
Rename get_first_dmabuf / get_next_dmabuf to dma_buf_iter_begin /
dma_buf_iter_next per Christian König
Add Christian's RB to bpf: Add dmabuf iterator
T.J. Mercier (5):
dma-buf: Rename debugfs symbols
bpf: Add dmabuf iterator
bpf: Add open coded dmabuf iterator
selftests/bpf: Add test for dmabuf_iter
selftests/bpf: Add test for open coded dmabuf_iter
drivers/dma-buf/dma-buf.c | 98 +++++--
include/linux/dma-buf.h | 4 +-
kernel/bpf/Makefile | 3 +
kernel/bpf/dmabuf_iter.c | 149 ++++++++++
kernel/bpf/helpers.c | 5 +
.../testing/selftests/bpf/bpf_experimental.h | 5 +
tools/testing/selftests/bpf/config | 3 +
.../selftests/bpf/prog_tests/dmabuf_iter.c | 258 ++++++++++++++++++
.../testing/selftests/bpf/progs/dmabuf_iter.c | 91 ++++++
9 files changed, 594 insertions(+), 22 deletions(-)
create mode 100644 kernel/bpf/dmabuf_iter.c
create mode 100644 tools/testing/selftests/bpf/prog_tests/dmabuf_iter.c
create mode 100644 tools/testing/selftests/bpf/progs/dmabuf_iter.c
base-commit: 43745d11bfd9683abdf08ad7a5cc403d6a9ffd15
--
2.49.0.1015.ga840276032-goog
From: Chia-Yu Chang <chia-yu.chang(a)nokia-bell-labs.com>
Hello,
Please find DUALPI2 iproute2 patch v8.
v8 (09-May-25)
- Update pkt_sched.h with the one in nex-next
- Fix one typo of comment in pkt_sched.h (ALOK TIWARI <alok.a.tiwari(a)oracle.com>)
- Update manual content in man/man8/tc-dualpi2.8 (ALOK TIWARI <alok.a.tiwari(a)oracle.com>)
- Update tc/q_dualpi2.c to fix missing blank lines and add missing case (ALOK TIWARI <alok.a.tiwari(a)oracle.com>)
v7 (05-May-25)
- Align pkt_sched.h with the v14 version of net-next due to spec modificiaotn in tc.yaml
- Reorganize dualpi2_print_opt() to match the order in tc.yaml
- Remove credit-queue in PRINT_JSON
v6 (26-Apr-25)
- Update JSON file output due to spec modifiocation in tc.yaml of net-next
v5 (25-Mar-25)
- Use matches() to replace current strcmp() (Stephen Hemminger <stephen(a)networkplumber.org>)
- Use general parse_percent() for handling scaled percentage values (Stephen Hemminger <stephen(a)networkplumber.org>)
- Add print function for JSON of dualpi2 stats (Stephen Hemminger <stephen(a)networkplumber.org>)
v4 (16-Mar-25)
- Add min_qlen_step to dualpi2 attribute as the minimum queue length in number of packets in the L-queue to start step amrking.
v3 (21-Feb-25)
- Add memlimit to dualpi2 attribute, and add memory_used, max_memory_used, memory_limit in dualpi2 stats (Dave Taht <dave.taht(a)gmail.com>)
- Update manual to align latest implementation and clarify the queue naming and default unit
- Use common "get_scaled_alpha_beta" and clean print_opt for Dualpi2
v2 (23-Oct-24)
- Rename get_float in dualpi2 to get_float_min_max in utils.c
- Move get_float from iplink_can.c in utils.c (Stephen Hemminger <stephen(a)networkplumber.org>)
- Add print function for JSON of dualpi2 (Stephen Hemminger <stephen(a)networkplumber.org>)
For more details of DualPI2, plesae refer IETF RFC9332
(https://datatracker.ietf.org/doc/html/rfc9332).
Best Regards,
Chia-Yu
Chia-Yu Chang (1):
tc: add dualpi2 scheduler module
bash-completion/tc | 11 +-
include/uapi/linux/pkt_sched.h | 68 +++++
include/utils.h | 2 +
ip/iplink_can.c | 14 -
lib/utils.c | 30 ++
man/man8/tc-dualpi2.8 | 249 +++++++++++++++
tc/Makefile | 1 +
tc/q_dualpi2.c | 534 +++++++++++++++++++++++++++++++++
8 files changed, 894 insertions(+), 15 deletions(-)
create mode 100644 man/man8/tc-dualpi2.8
create mode 100644 tc/q_dualpi2.c
--
2.34.1
There are KUnit tests for the bitfield packing library but these depend
on CONFIG_PACKING which is not enabled by anything in either the
existing KUnit all_tests.config or the base UML config it runs on as
standard. Enable that in all_tests.config to improve coverage.
Signed-off-by: Mark Brown <broonie(a)kernel.org>
---
tools/testing/kunit/configs/all_tests.config | 2 ++
1 file changed, 2 insertions(+)
diff --git a/tools/testing/kunit/configs/all_tests.config b/tools/testing/kunit/configs/all_tests.config
index cdd9782f9646..b0223b7aebde 100644
--- a/tools/testing/kunit/configs/all_tests.config
+++ b/tools/testing/kunit/configs/all_tests.config
@@ -51,3 +51,5 @@ CONFIG_SOUND=y
CONFIG_SND=y
CONFIG_SND_SOC=y
CONFIG_SND_SOC_TOPOLOGY_BUILD=y
+
+CONFIG_PACKING=y
---
base-commit: 8ffd015db85fea3e15a77027fda6c02ced4d2444
change-id: 20250422-kunit-enable-missing-tests-9189ee930cae
Best regards,
--
Mark Brown <broonie(a)kernel.org>
TRACE_syscall.ptrace.negative_ENOSYS and TRACE_syscall.seccomp.negative_ENOSYS
on arm32 are being reported as failures instead of skipping.
The teardown_trace_fixture function sets the test to KSFT_FAIL in case of a
non 0 return value from the tracer process.
Due to _metadata now being shared between the forked processes the tracer is
returning the KSFT_SKIP value set by the tracee which is non 0.
Remove the setting of the _metadata.exit_code in teardown_trace_fixture.
Fixes: 24cf65a62266 ("selftests/harness: Share _metadata between forked processes")
Signed-off-by: Terry Tritton <terry.tritton(a)linaro.org>
---
tools/testing/selftests/seccomp/seccomp_bpf.c | 6 ------
1 file changed, 6 deletions(-)
diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c b/tools/testing/selftests/seccomp/seccomp_bpf.c
index b2f76a52215a..c43a6f8f8cd5 100644
--- a/tools/testing/selftests/seccomp/seccomp_bpf.c
+++ b/tools/testing/selftests/seccomp/seccomp_bpf.c
@@ -1629,14 +1629,8 @@ void teardown_trace_fixture(struct __test_metadata *_metadata,
{
if (tracer) {
int status;
- /*
- * Extract the exit code from the other process and
- * adopt it for ourselves in case its asserts failed.
- */
ASSERT_EQ(0, kill(tracer, SIGUSR1));
ASSERT_EQ(tracer, waitpid(tracer, &status, 0));
- if (WEXITSTATUS(status))
- _metadata->exit_code = KSFT_FAIL;
}
}
--
2.39.5
From: Ashish Kalra <ashish.kalra(a)amd.com>
During platform init, SNP initialization may fail for several reasons,
such as firmware command failures and incompatible versions. However,
the KVM capability may continue to advertise support for it.
The platform may have SNP enabled but if SNP_INIT fails then SNP is
not supported by KVM.
During KVM module initialization query the SNP platform status to obtain
the SNP initialization state and use it as an additional condition to
determine support for SEV-SNP.
Co-developed-by: Sean Christopherson <seanjc(a)google.com>
Signed-off-by: Sean Christopherson <seanjc(a)google.com>
Co-developed-by: Pratik R. Sampat <prsampat(a)amd.com>
Signed-off-by: Pratik R. Sampat <prsampat(a)amd.com>
Signed-off-by: Ashish Kalra <ashish.kalra(a)amd.com>
---
arch/x86/kvm/svm/sev.c | 43 +++++++++++++++++++++++++++++++++---------
1 file changed, 34 insertions(+), 9 deletions(-)
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index ada53f04158c..a6abdb26f877 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -2934,6 +2934,32 @@ void __init sev_set_cpu_caps(void)
}
}
+static bool sev_is_snp_initialized(void)
+{
+ struct sev_user_data_snp_status *status;
+ struct sev_data_snp_addr buf;
+ bool initialized = false;
+ void *data;
+ int error;
+
+ data = snp_alloc_firmware_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO);
+ if (!data)
+ return initialized;
+
+ buf.address = __psp_pa(data);
+ if (sev_do_cmd(SEV_CMD_SNP_PLATFORM_STATUS, &buf, &error))
+ goto out;
+
+ status = (struct sev_user_data_snp_status *)data;
+ if (status->state)
+ initialized = true;
+
+out:
+ snp_free_firmware_page(data);
+
+ return initialized;
+}
+
void __init sev_hardware_setup(void)
{
unsigned int eax, ebx, ecx, edx, sev_asid_count, sev_es_asid_count;
@@ -3038,6 +3064,14 @@ void __init sev_hardware_setup(void)
sev_snp_supported = sev_snp_enabled && cc_platform_has(CC_ATTR_HOST_SEV_SNP);
out:
+ if (sev_enabled) {
+ init_args.probe = true;
+ if (sev_platform_init(&init_args))
+ sev_supported = sev_es_supported = sev_snp_supported = false;
+ else
+ sev_snp_supported &= sev_is_snp_initialized();
+ }
+
if (boot_cpu_has(X86_FEATURE_SEV))
pr_info("SEV %s (ASIDs %u - %u)\n",
sev_supported ? min_sev_asid <= max_sev_asid ? "enabled" :
@@ -3064,15 +3098,6 @@ void __init sev_hardware_setup(void)
sev_supported_vmsa_features = 0;
if (sev_es_debug_swap_enabled)
sev_supported_vmsa_features |= SVM_SEV_FEAT_DEBUG_SWAP;
-
- if (!sev_enabled)
- return;
-
- /*
- * Do both SNP and SEV initialization at KVM module load.
- */
- init_args.probe = true;
- sev_platform_init(&init_args);
}
void sev_hardware_unsetup(void)
--
2.34.1
Poll program is a helper to ftracetest, thus make it a
generic file and remove it from being run as a test.
Currently when executing tests using
$ make run_tests
CC poll
TAP version 13
1..2
# timeout set to 0
# selftests: ftrace: poll
# Error: Polling file is not specified
not ok 1 selftests: ftrace: poll # exit=255
Fix this by using TEST_GEN_FILES to build the 'poll' binary as a helper
rather than as a test.
Fixes: 80c3e28528ff ("selftests/tracing: Add hist poll() support test")
Signed-off-by: Ayush Jain <Ayush.jain3(a)amd.com>
---
tools/testing/selftests/ftrace/Makefile | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/ftrace/Makefile b/tools/testing/selftests/ftrace/Makefile
index 49d96bb16355..7c12263f8260 100644
--- a/tools/testing/selftests/ftrace/Makefile
+++ b/tools/testing/selftests/ftrace/Makefile
@@ -6,6 +6,6 @@ TEST_PROGS := ftracetest-ktap
TEST_FILES := test.d settings
EXTRA_CLEAN := $(OUTPUT)/logs/*
-TEST_GEN_PROGS = poll
+TEST_GEN_FILES := poll
include ../lib.mk
--
2.34.1
The vIOMMU object is designed to represent a slice of an IOMMU HW for its
virtualization features shared with or passed to user space (a VM mostly)
in a way of HW acceleration. This extended the HWPT-based design for more
advanced virtualization feature.
A vQUEUE introduced by this series as a part of the vIOMMU infrastructure
represents a HW accelerated queue/buffer for VM to use exclusively, e.g.
- NVIDIA's Virtual Command Queue
- AMD vIOMMU's Command Buffer, Event Log Buffer, and PPR Log Buffer
each of which is an IOMMU HW feature to directly access the virtual queue
in the guest address space, to avoid VM Exits to improve the performance.
As an initial use case, it adds support for guest-owned HW virtual queues
that VMM can allocate per request from a guest OS writing the VM register.
Introduce IOMMUFD_OBJ_VQUEUE and its allocator IOMMUFD_CMD_VQUEUE_ALLOC,
allowing VMM to forward the IOMMU-specific queue info, such as queue base
address, size, and etc.
Meanwhile, a guest-owned virtual queue needs the kernel (a virtual queue
driver) to control the queue by reading/writing its consumer and producer
indexes, which means the virtual queue HW allows the guest kernel to get
a direct R/W access to those registers. Introduce an mmap infrastructure
to the iommufd core so as to support pass through a piece of MMIO region
from the host physical address space to the guest physical address space.
The VMA info (vm_pgoff/size) used by an mmap must be pre-allocated during
the IOMMUFD_CMD_VQUEUE_ALLOC and returned to the user space as an output
driver-data carried via the IOMMUFD_CMD_VQUEUE_ALLOC. So, this requires a
driver-specific user data support in the vIOMMU allocation flow.
As a real-world use case, this series implements a vQUEUE support to the
tegra241-cmdqv driver for VCMDQs on NVIDIA Grace CPU. In another word, it
is also the Tegra CMDQV series Part-2 (user-space support), reworked from
Previous RFCv1:
https://lore.kernel.org/all/cover.1712978212.git.nicolinc@nvidia.com/
This enables the HW accelerated feature for NVIDIA Grace CPU. Compared to
the standard SMMUv3 operating in the nested translation mode trapping CMDQ
for TLBI and ATC_INV commands, this gives a huge performance improvement:
70% to 90% reductions of invalidation time were measured by various DMA
unmap tests running in a guest OS.
// Unmap latencies from "dma_map_benchmark -g @granule -t @threads",
// by toggling "/sys/kernel/debug/iommu/tegra241_cmdqv/bypass_vcmdq"
@granule | @threads | bypass_vcmdq=1 | bypass_vcmdq=0
4KB 1 35.7 us 5.3 us
16KB 1 41.8 us 6.8 us
64KB 1 68.9 us 9.9 us
128KB 1 109.0 us 12.6 us
256KB 1 187.1 us 18.0 us
4KB 2 96.9 us 6.8 us
16KB 2 97.8 us 7.5 us
64KB 2 151.5 us 10.7 us
128KB 2 257.8 us 12.7 us
256KB 2 443.0 us 17.9 us
This is on Github:
https://github.com/nicolinc/iommufd/commits/iommufd_vqueue-v3
Paring QEMU branch for testing:
https://github.com/nicolinc/qemu/commits/wip/for_iommufd_vqueue-v3
Changelog
v3
* Add Reviewed-by from Baolu, Pranjal, and Alok
* Revise kdocs, uAPI docs, and commit logs
* Rename "vCMDQ" back to "vQUEUE" for AMD cases
* [tegra] Add tegra241_vcmdq_hw_flush_timeout()
* [tegra] Rename vsmmu_alloc to alloc_vintf_user
* [tegra] Use writel for SID replacement registers
* [tegra] Move mmap removal call to vsmmu_destroy op
* [tegra] Fix revert in tegra241_vintf_alloc_lvcmdq_user()
* [iommufd] Replace "& ~PAGE_MASK" with PAGE_ALIGNED()
* [iommufd] Add an object-type "owner" to immap structure
* [iommufd] Drop the ictx input in the new for-driver APIs
* [iommufd] Add iommufd_vma_ops to keep track of mmap lifecycle
* [iommufd] Add viommu-based iommufd_viommu_alloc/destroy_mmap helpers
* [iommufd] Rename iommufd_ctx_alloc/free_mmap to
_iommufd_alloc/destroy_mmap
v2
https://lore.kernel.org/all/cover.1745646960.git.nicolinc@nvidia.com/
* Add Reviewed-by from Jason
* [smmu] Fix vsmmu initial value
* [smmu] Support impl for hw_info
* [tegra] Rename "slot" to "vsid"
* [tegra] Update kdocs and commit logs
* [tegra] Map/unmap LVCMDQ dynamically
* [tegra] Refcount the previous LVCMDQ
* [tegra] Return -EEXIST if LVCMDQ exists
* [tegra] Simplify VINTF cleanup routine
* [tegra] Use vmid and s2_domain in vsmmu
* [tegra] Rename "mmap_pgoff" to "immap_id"
* [tegra] Add more addr and length validation
* [iommufd] Add more narrative to mmap's kdoc
* [iommufd] Add iommufd_struct_depend/undepend()
* [iommufd] Rename vcmdq_free op to vcmdq_destroy
* [iommufd] Fix bug in iommu_copy_struct_to_user()
* [iommufd] Drop is_io from iommufd_ctx_alloc_mmap()
* [iommufd] Test the queue memory for its contiguity
* [iommufd] Return -ENXIO if address or length fails
* [iommufd] Do not change @min_last in mock_viommu_alloc()
* [iommufd] Generalize TEGRA241_VCMDQ data in core structure
* [iommufd] Add selftest coverage for IOMMUFD_CMD_VCMDQ_ALLOC
* [iommufd] Add iopt_pin_pages() to prevent queue memory from unmapping
v1
https://lore.kernel.org/all/cover.1744353300.git.nicolinc@nvidia.com/
Thanks
Nicolin
Nicolin Chen (23):
iommufd/viommu: Add driver-allocated vDEVICE support
iommu: Pass in a driver-level user data structure to viommu_alloc op
iommufd/viommu: Allow driver-specific user data for a vIOMMU object
iommu: Add iommu_copy_struct_to_user helper
iommufd/driver: Let iommufd_viommu_alloc helper save ictx to
viommu->ictx
iommufd/driver: Add iommufd_struct_destroy to revert
iommufd_viommu_alloc
iommufd/selftest: Support user_data in mock_viommu_alloc
iommufd/selftest: Add covearge for viommu data
iommufd: Abstract iopt_pin_pages and iopt_unpin_pages helpers
iommufd/viommu: Introduce IOMMUFD_OBJ_VQUEUE and its related struct
iommufd/viommu: Add IOMMUFD_CMD_VQUEUE_ALLOC ioctl
iommufd/driver: Add iommufd_vqueue_depend/undepend() helpers
iommufd/selftest: Add coverage for IOMMUFD_CMD_VQUEUE_ALLOC
iommufd: Add mmap interface
iommufd/selftest: Add coverage for the new mmap interface
Documentation: userspace-api: iommufd: Update vQUEUE
iommu/arm-smmu-v3-iommufd: Add vsmmu_alloc impl op
iommu/arm-smmu-v3-iommufd: Support implementation-defined hw_info
iommu/tegra241-cmdqv: Use request_threaded_irq
iommu/tegra241-cmdqv: Simplify deinit flow in
tegra241_cmdqv_remove_vintf()
iommu/tegra241-cmdqv: Do not statically map LVCMDQs
iommu/tegra241-cmdqv: Add user-space use support
iommu/tegra241-cmdqv: Add IOMMU_VEVENTQ_TYPE_TEGRA241_CMDQV support
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 25 +-
drivers/iommu/iommufd/io_pagetable.h | 8 +
drivers/iommu/iommufd/iommufd_private.h | 28 +-
drivers/iommu/iommufd/iommufd_test.h | 20 +
include/linux/iommu.h | 43 +-
include/linux/iommufd.h | 184 ++++++-
include/uapi/linux/iommufd.h | 117 ++++-
tools/testing/selftests/iommu/iommufd_utils.h | 52 +-
.../arm/arm-smmu-v3/arm-smmu-v3-iommufd.c | 42 +-
.../iommu/arm/arm-smmu-v3/tegra241-cmdqv.c | 481 +++++++++++++++++-
drivers/iommu/iommufd/device.c | 117 +----
drivers/iommu/iommufd/driver.c | 88 ++++
drivers/iommu/iommufd/io_pagetable.c | 95 ++++
drivers/iommu/iommufd/main.c | 84 ++-
drivers/iommu/iommufd/selftest.c | 126 ++++-
drivers/iommu/iommufd/viommu.c | 116 ++++-
tools/testing/selftests/iommu/iommufd.c | 96 +++-
.../selftests/iommu/iommufd_fail_nth.c | 11 +-
Documentation/userspace-api/iommufd.rst | 15 +
19 files changed, 1555 insertions(+), 193 deletions(-)
--
2.43.0
I'd like to cut down the memory usage of parsing vmlinux BTF in ebpf-go.
With some upcoming changes the library is sitting at 5MiB for a parse.
Most of that memory is simply copying the BTF blob into user space.
By allowing vmlinux BTF to be mmapped read-only into user space I can
cut memory usage by about 75%.
Signed-off-by: Lorenz Bauer <lmb(a)isovalent.com>
---
Changes in v3:
- Remove slightly confusing calculation of trailing (Alexei)
- Use vm_insert_page (Alexei)
- Simplified libbpf code
- Link to v2: https://lore.kernel.org/r/20250502-vmlinux-mmap-v2-0-95c271434519@isovalent…
Changes in v2:
- Use btf__new in selftest
- Avoid vm_iomap_memory in btf_vmlinux_mmap
- Add VM_DONTDUMP
- Add support to libbpf
- Link to v1: https://lore.kernel.org/r/20250501-vmlinux-mmap-v1-0-aa2724572598@isovalent…
---
Lorenz Bauer (3):
btf: allow mmap of vmlinux btf
selftests: bpf: add a test for mmapable vmlinux BTF
libbpf: Use mmap to parse vmlinux BTF from sysfs
include/asm-generic/vmlinux.lds.h | 3 +-
kernel/bpf/sysfs_btf.c | 37 ++++++++++
tools/lib/bpf/btf.c | 83 +++++++++++++++++++---
tools/testing/selftests/bpf/prog_tests/btf_sysfs.c | 83 ++++++++++++++++++++++
4 files changed, 194 insertions(+), 12 deletions(-)
---
base-commit: 38d976c32d85ef12dcd2b8a231196f7049548477
change-id: 20250501-vmlinux-mmap-2ec5563c3ef1
Best regards,
--
Lorenz Bauer <lmb(a)isovalent.com>
Revert fbbf93556f0c ("selftests: nic_performance: Add selftest for performance of NIC driver")
Revert c087dc54394b ("selftests: nic_link_layer: Add selftest case for speed and duplex states")
Revert 6116075e18f7 ("selftests: nic_link_layer: Add link layer selftest for NIC driver")
These tests don't clean up after themselves, don't use the disruptive
annotations, don't get included in make install etc. etc. The tests
were added before we have any "HW" runner, so the issues were missed.
Our CI doesn't have any way of excluding broken tests, remove these
for now to stop the random pollution of results due to broken env.
We can always add them back once / if fixed.
Signed-off-by: Jakub Kicinski <kuba(a)kernel.org>
---
CC: shuah(a)kernel.org
CC: willemb(a)google.com
CC: sdf(a)fomichev.me
CC: mohan.prasad(a)microchip.com
CC: dw(a)davidwei.uk
CC: petrm(a)nvidia.com
CC: linux-kselftest(a)vger.kernel.org
---
.../testing/selftests/drivers/net/hw/Makefile | 2 -
.../drivers/net/hw/lib/py/__init__.py | 1 -
.../drivers/net/hw/lib/py/linkconfig.py | 222 ------------------
.../drivers/net/hw/nic_link_layer.py | 113 ---------
.../drivers/net/hw/nic_performance.py | 137 -----------
.../selftests/drivers/net/lib/py/load.py | 20 +-
6 files changed, 1 insertion(+), 494 deletions(-)
delete mode 100644 tools/testing/selftests/drivers/net/hw/lib/py/linkconfig.py
delete mode 100644 tools/testing/selftests/drivers/net/hw/nic_link_layer.py
delete mode 100644 tools/testing/selftests/drivers/net/hw/nic_performance.py
diff --git a/tools/testing/selftests/drivers/net/hw/Makefile b/tools/testing/selftests/drivers/net/hw/Makefile
index 5447785c286e..df2c047ffa90 100644
--- a/tools/testing/selftests/drivers/net/hw/Makefile
+++ b/tools/testing/selftests/drivers/net/hw/Makefile
@@ -15,8 +15,6 @@ TEST_PROGS = \
iou-zcrx.py \
irq.py \
loopback.sh \
- nic_link_layer.py \
- nic_performance.py \
pp_alloc_fail.py \
rss_ctx.py \
rss_input_xfrm.py \
diff --git a/tools/testing/selftests/drivers/net/hw/lib/py/__init__.py b/tools/testing/selftests/drivers/net/hw/lib/py/__init__.py
index 399789a9676a..b582885786f5 100644
--- a/tools/testing/selftests/drivers/net/hw/lib/py/__init__.py
+++ b/tools/testing/selftests/drivers/net/hw/lib/py/__init__.py
@@ -9,7 +9,6 @@ KSFT_DIR = (Path(__file__).parent / "../../../../..").resolve()
sys.path.append(KSFT_DIR.as_posix())
from net.lib.py import *
from drivers.net.lib.py import *
- from .linkconfig import LinkConfig
except ModuleNotFoundError as e:
ksft_pr("Failed importing `net` library from kernel sources")
ksft_pr(str(e))
diff --git a/tools/testing/selftests/drivers/net/hw/lib/py/linkconfig.py b/tools/testing/selftests/drivers/net/hw/lib/py/linkconfig.py
deleted file mode 100644
index 79fde603cbbc..000000000000
--- a/tools/testing/selftests/drivers/net/hw/lib/py/linkconfig.py
+++ /dev/null
@@ -1,222 +0,0 @@
-# SPDX-License-Identifier: GPL-2.0
-
-from lib.py import cmd, ethtool, ip
-from lib.py import ksft_pr, ksft_eq, KsftSkipEx
-from typing import Optional
-import re
-import time
-import json
-
-#The LinkConfig class is implemented to handle the link layer configurations.
-#Required minimum ethtool version is 6.10
-
-class LinkConfig:
- """Class for handling the link layer configurations"""
- def __init__(self, cfg: object) -> None:
- self.cfg = cfg
- self.partner_netif = self.get_partner_netif_name()
-
- """Get the initial link configuration of local interface"""
- self.common_link_modes = self.get_common_link_modes()
-
- def get_partner_netif_name(self) -> Optional[str]:
- partner_netif = None
- try:
- if not self.verify_link_up():
- return None
- """Get partner interface name"""
- partner_json_output = ip("addr show", json=True, host=self.cfg.remote)
- for interface in partner_json_output:
- for addr in interface.get('addr_info', []):
- if addr.get('local') == self.cfg.remote_addr:
- partner_netif = interface['ifname']
- ksft_pr(f"Partner Interface name: {partner_netif}")
- if partner_netif is None:
- ksft_pr("Unable to get the partner interface name")
- except Exception as e:
- print(f"Unexpected error occurred while getting partner interface name: {e}")
- self.partner_netif = partner_netif
- return partner_netif
-
- def verify_link_up(self) -> bool:
- """Verify whether the local interface link is up"""
- with open(f"/sys/class/net/{self.cfg.ifname}/operstate", "r") as fp:
- link_state = fp.read().strip()
-
- if link_state == "down":
- ksft_pr(f"Link state of interface {self.cfg.ifname} is DOWN")
- return False
- else:
- return True
-
- def reset_interface(self, local: bool = True, remote: bool = True) -> bool:
- ksft_pr("Resetting interfaces in local and remote")
- if remote:
- if self.verify_link_up():
- if self.partner_netif is not None:
- ifname = self.partner_netif
- link_up_cmd = f"ip link set up {ifname}"
- link_down_cmd = f"ip link set down {ifname}"
- reset_cmd = f"{link_down_cmd} && sleep 5 && {link_up_cmd}"
- try:
- cmd(reset_cmd, host=self.cfg.remote)
- except Exception as e:
- ksft_pr(f"Unexpected error occurred while resetting remote: {e}")
- else:
- ksft_pr("Partner interface not available")
- if local:
- ifname = self.cfg.ifname
- link_up_cmd = f"ip link set up {ifname}"
- link_down_cmd = f"ip link set down {ifname}"
- reset_cmd = f"{link_down_cmd} && sleep 5 && {link_up_cmd}"
- try:
- cmd(reset_cmd)
- except Exception as e:
- ksft_pr(f"Unexpected error occurred while resetting local: {e}")
- time.sleep(10)
- if self.verify_link_up() and self.get_ethtool_field("link-detected"):
- ksft_pr("Local and remote interfaces reset to original state")
- return True
- else:
- ksft_pr("Error occurred after resetting interfaces. Link is DOWN.")
- return False
-
- def set_speed_and_duplex(self, speed: str, duplex: str, autoneg: bool = True) -> bool:
- """Set the speed and duplex state for the interface"""
- autoneg_state = "on" if autoneg is True else "off"
- process = None
- try:
- process = ethtool(f"--change {self.cfg.ifname} speed {speed} duplex {duplex} autoneg {autoneg_state}")
- except Exception as e:
- ksft_pr(f"Unexpected error occurred while setting speed/duplex: {e}")
- if process is None or process.ret != 0:
- return False
- else:
- ksft_pr(f"Speed: {speed} Mbps, Duplex: {duplex} set for Interface: {self.cfg.ifname}")
- return True
-
- def verify_speed_and_duplex(self, expected_speed: str, expected_duplex: str) -> bool:
- if not self.verify_link_up():
- return False
- """Verifying the speed and duplex state for the interface"""
- with open(f"/sys/class/net/{self.cfg.ifname}/speed", "r") as fp:
- actual_speed = fp.read().strip()
- with open(f"/sys/class/net/{self.cfg.ifname}/duplex", "r") as fp:
- actual_duplex = fp.read().strip()
-
- ksft_eq(actual_speed, expected_speed)
- ksft_eq(actual_duplex, expected_duplex)
- return True
-
- def set_autonegotiation_state(self, state: str, remote: bool = False) -> bool:
- common_link_modes = self.common_link_modes
- speeds, duplex_modes = self.get_speed_duplex_values(self.common_link_modes)
- speed = speeds[0]
- duplex = duplex_modes[0]
- if not speed or not duplex:
- ksft_pr("No speed or duplex modes found")
- return False
-
- speed_duplex_cmd = f"speed {speed} duplex {duplex}" if state == "off" else ""
- if remote:
- if not self.verify_link_up():
- return False
- """Set the autonegotiation state for the partner"""
- command = f"-s {self.partner_netif} {speed_duplex_cmd} autoneg {state}"
- partner_autoneg_change = None
- """Set autonegotiation state for interface in remote pc"""
- try:
- partner_autoneg_change = ethtool(command, host=self.cfg.remote)
- except Exception as e:
- ksft_pr(f"Unexpected error occurred while changing auto-neg in remote: {e}")
- if partner_autoneg_change is None or partner_autoneg_change.ret != 0:
- ksft_pr(f"Not able to set autoneg parameter for interface {self.partner_netif}.")
- return False
- ksft_pr(f"Autoneg set as {state} for {self.partner_netif}")
- else:
- """Set the autonegotiation state for the interface"""
- try:
- process = ethtool(f"-s {self.cfg.ifname} {speed_duplex_cmd} autoneg {state}")
- if process.ret != 0:
- ksft_pr(f"Not able to set autoneg parameter for interface {self.cfg.ifname}")
- return False
- except Exception as e:
- ksft_pr(f"Unexpected error occurred while changing auto-neg in local: {e}")
- return False
- ksft_pr(f"Autoneg set as {state} for {self.cfg.ifname}")
- return True
-
- def check_autoneg_supported(self, remote: bool = False) -> bool:
- if not remote:
- local_autoneg = self.get_ethtool_field("supports-auto-negotiation")
- if local_autoneg is None:
- ksft_pr(f"Unable to fetch auto-negotiation status for interface {self.cfg.ifname}")
- """Return autoneg status of the local interface"""
- return local_autoneg
- else:
- if not self.verify_link_up():
- raise KsftSkipEx("Link is DOWN")
- """Check remote auto-negotiation support status"""
- partner_autoneg = False
- if self.partner_netif is not None:
- partner_autoneg = self.get_ethtool_field("supports-auto-negotiation", remote=True)
- if partner_autoneg is None:
- ksft_pr(f"Unable to fetch auto-negotiation status for interface {self.partner_netif}")
- return partner_autoneg
-
- def get_common_link_modes(self) -> set[str]:
- common_link_modes = []
- """Populate common link modes"""
- link_modes = self.get_ethtool_field("supported-link-modes")
- partner_link_modes = self.get_ethtool_field("link-partner-advertised-link-modes")
- if link_modes is None:
- raise KsftSkipEx(f"Link modes not available for {self.cfg.ifname}")
- if partner_link_modes is None:
- raise KsftSkipEx(f"Partner link modes not available for {self.cfg.ifname}")
- common_link_modes = set(link_modes) and set(partner_link_modes)
- return common_link_modes
-
- def get_speed_duplex_values(self, link_modes: list[str]) -> tuple[list[str], list[str]]:
- speed = []
- duplex = []
- """Check the link modes"""
- for data in link_modes:
- parts = data.split('/')
- speed_value = re.match(r'\d+', parts[0])
- if speed_value:
- speed.append(speed_value.group())
- else:
- ksft_pr(f"No speed value found for interface {self.ifname}")
- return None, None
- duplex.append(parts[1].lower())
- return speed, duplex
-
- def get_ethtool_field(self, field: str, remote: bool = False) -> Optional[str]:
- process = None
- if not remote:
- """Get the ethtool field value for the local interface"""
- try:
- process = ethtool(self.cfg.ifname, json=True)
- except Exception as e:
- ksft_pr("Required minimum ethtool version is 6.10")
- ksft_pr(f"Unexpected error occurred while getting ethtool field in local: {e}")
- return None
- else:
- if not self.verify_link_up():
- return None
- """Get the ethtool field value for the remote interface"""
- self.cfg.require_cmd("ethtool", remote=True)
- if self.partner_netif is None:
- ksft_pr(f"Partner interface name is unavailable.")
- return None
- try:
- process = ethtool(self.partner_netif, json=True, host=self.cfg.remote)
- except Exception as e:
- ksft_pr("Required minimum ethtool version is 6.10")
- ksft_pr(f"Unexpected error occurred while getting ethtool field in remote: {e}")
- return None
- json_data = process[0]
- """Check if the field exist in the json data"""
- if field not in json_data:
- raise KsftSkipEx(f'Field {field} does not exist in the output of interface {json_data["ifname"]}')
- return json_data[field]
diff --git a/tools/testing/selftests/drivers/net/hw/nic_link_layer.py b/tools/testing/selftests/drivers/net/hw/nic_link_layer.py
deleted file mode 100644
index efd921180532..000000000000
--- a/tools/testing/selftests/drivers/net/hw/nic_link_layer.py
+++ /dev/null
@@ -1,113 +0,0 @@
-#!/usr/bin/env python3
-# SPDX-License-Identifier: GPL-2.0
-
-#Introduction:
-#This file has basic link layer tests for generic NIC drivers.
-#The test comprises of auto-negotiation, speed and duplex checks.
-#
-#Setup:
-#Connect the DUT PC with NIC card to partner pc back via ethernet medium of your choice(RJ45, T1)
-#
-# DUT PC Partner PC
-#┌───────────────────────┐ ┌──────────────────────────┐
-#│ │ │ │
-#│ │ │ │
-#│ ┌───────────┐ │ │
-#│ │DUT NIC │ Eth │ │
-#│ │Interface ─┼─────────────────────────┼─ any eth Interface │
-#│ └───────────┘ │ │
-#│ │ │ │
-#│ │ │ │
-#└───────────────────────┘ └──────────────────────────┘
-#
-#Configurations:
-#Required minimum ethtool version is 6.10 (supports json)
-#Default values:
-#time_delay = 8 #time taken to wait for transitions to happen, in seconds.
-
-import time
-import argparse
-from lib.py import ksft_run, ksft_exit, ksft_pr, ksft_eq
-from lib.py import KsftFailEx, KsftSkipEx
-from lib.py import NetDrvEpEnv
-from lib.py import LinkConfig
-
-def _pre_test_checks(cfg: object, link_config: LinkConfig) -> None:
- if link_config.partner_netif is None:
- KsftSkipEx("Partner interface is not available")
- if not link_config.check_autoneg_supported() or not link_config.check_autoneg_supported(remote=True):
- KsftSkipEx(f"Auto-negotiation not supported for interface {cfg.ifname} or {link_config.partner_netif}")
- if not link_config.verify_link_up():
- raise KsftSkipEx(f"Link state of interface {cfg.ifname} is DOWN")
-
-def verify_autonegotiation(cfg: object, expected_state: str, link_config: LinkConfig) -> None:
- if not link_config.verify_link_up():
- raise KsftSkipEx(f"Link state of interface {cfg.ifname} is DOWN")
- """Verifying the autonegotiation state in partner"""
- partner_autoneg_output = link_config.get_ethtool_field("auto-negotiation", remote=True)
- if partner_autoneg_output is None:
- KsftSkipEx(f"Auto-negotiation state not available for interface {link_config.partner_netif}")
- partner_autoneg_state = "on" if partner_autoneg_output is True else "off"
-
- ksft_eq(partner_autoneg_state, expected_state)
-
- """Verifying the autonegotiation state of local"""
- autoneg_output = link_config.get_ethtool_field("auto-negotiation")
- if autoneg_output is None:
- KsftSkipEx(f"Auto-negotiation state not available for interface {cfg.ifname}")
- actual_state = "on" if autoneg_output is True else "off"
-
- ksft_eq(actual_state, expected_state)
-
- """Verifying the link establishment"""
- link_available = link_config.get_ethtool_field("link-detected")
- if link_available is None:
- KsftSkipEx(f"Link status not available for interface {cfg.ifname}")
- if link_available != True:
- raise KsftSkipEx("Link not established at interface {cfg.ifname} after changing auto-negotiation")
-
-def test_autonegotiation(cfg: object, link_config: LinkConfig, time_delay: int) -> None:
- _pre_test_checks(cfg, link_config)
- for state in ["off", "on"]:
- if not link_config.set_autonegotiation_state(state, remote=True):
- raise KsftSkipEx(f"Unable to set auto-negotiation state for interface {link_config.partner_netif}")
- if not link_config.set_autonegotiation_state(state):
- raise KsftSkipEx(f"Unable to set auto-negotiation state for interface {cfg.ifname}")
- time.sleep(time_delay)
- verify_autonegotiation(cfg, state, link_config)
-
-def test_network_speed(cfg: object, link_config: LinkConfig, time_delay: int) -> None:
- _pre_test_checks(cfg, link_config)
- common_link_modes = link_config.common_link_modes
- if not common_link_modes:
- KsftSkipEx("No common link modes exist")
- speeds, duplex_modes = link_config.get_speed_duplex_values(common_link_modes)
-
- if speeds and duplex_modes and len(speeds) == len(duplex_modes):
- for idx in range(len(speeds)):
- speed = speeds[idx]
- duplex = duplex_modes[idx]
- if not link_config.set_speed_and_duplex(speed, duplex):
- raise KsftFailEx(f"Unable to set speed and duplex parameters for {cfg.ifname}")
- time.sleep(time_delay)
- if not link_config.verify_speed_and_duplex(speed, duplex):
- raise KsftSkipEx(f"Error occurred while verifying speed and duplex states for interface {cfg.ifname}")
- else:
- if not speeds or not duplex_modes:
- KsftSkipEx(f"No supported speeds or duplex modes found for interface {cfg.ifname}")
- else:
- KsftSkipEx("Mismatch in the number of speeds and duplex modes")
-
-def main() -> None:
- parser = argparse.ArgumentParser(description="Run basic link layer tests for NIC driver")
- parser.add_argument('--time-delay', type=int, default=8, help='Time taken to wait for transitions to happen(in seconds). Default is 8 seconds.')
- args = parser.parse_args()
- time_delay = args.time_delay
- with NetDrvEpEnv(__file__, nsim_test=False) as cfg:
- link_config = LinkConfig(cfg)
- ksft_run(globs=globals(), case_pfx={"test_"}, args=(cfg, link_config, time_delay,))
- link_config.reset_interface()
- ksft_exit()
-
-if __name__ == "__main__":
- main()
diff --git a/tools/testing/selftests/drivers/net/hw/nic_performance.py b/tools/testing/selftests/drivers/net/hw/nic_performance.py
deleted file mode 100644
index 201403b76ea3..000000000000
--- a/tools/testing/selftests/drivers/net/hw/nic_performance.py
+++ /dev/null
@@ -1,137 +0,0 @@
-#!/usr/bin/env python3
-# SPDX-License-Identifier: GPL-2.0
-
-#Introduction:
-#This file has basic performance test for generic NIC drivers.
-#The test comprises of throughput check for TCP and UDP streams.
-#
-#Setup:
-#Connect the DUT PC with NIC card to partner pc back via ethernet medium of your choice(RJ45, T1)
-#
-# DUT PC Partner PC
-#┌───────────────────────┐ ┌──────────────────────────┐
-#│ │ │ │
-#│ │ │ │
-#│ ┌───────────┐ │ │
-#│ │DUT NIC │ Eth │ │
-#│ │Interface ─┼─────────────────────────┼─ any eth Interface │
-#│ └───────────┘ │ │
-#│ │ │ │
-#│ │ │ │
-#└───────────────────────┘ └──────────────────────────┘
-#
-#Configurations:
-#To prevent interruptions, Add ethtool, ip to the sudoers list in remote PC and get the ssh key from remote.
-#Required minimum ethtool version is 6.10
-#Change the below configuration based on your hw needs.
-# """Default values"""
-#time_delay = 8 #time taken to wait for transitions to happen, in seconds.
-#test_duration = 10 #performance test duration for the throughput check, in seconds.
-#send_throughput_threshold = 80 #percentage of send throughput required to pass the check
-#receive_throughput_threshold = 50 #percentage of receive throughput required to pass the check
-
-import time
-import json
-import argparse
-from lib.py import ksft_run, ksft_exit, ksft_pr, ksft_true
-from lib.py import KsftFailEx, KsftSkipEx, GenerateTraffic
-from lib.py import NetDrvEpEnv, bkg, wait_port_listen
-from lib.py import cmd
-from lib.py import LinkConfig
-
-class TestConfig:
- def __init__(self, time_delay: int, test_duration: int, send_throughput_threshold: int, receive_throughput_threshold: int) -> None:
- self.time_delay = time_delay
- self.test_duration = test_duration
- self.send_throughput_threshold = send_throughput_threshold
- self.receive_throughput_threshold = receive_throughput_threshold
-
-def _pre_test_checks(cfg: object, link_config: LinkConfig) -> None:
- if not link_config.verify_link_up():
- KsftSkipEx(f"Link state of interface {cfg.ifname} is DOWN")
- common_link_modes = link_config.common_link_modes
- if common_link_modes is None:
- KsftSkipEx("No common link modes found")
- if link_config.partner_netif == None:
- KsftSkipEx("Partner interface is not available")
- if link_config.check_autoneg_supported():
- KsftSkipEx("Auto-negotiation not supported by local")
- if link_config.check_autoneg_supported(remote=True):
- KsftSkipEx("Auto-negotiation not supported by remote")
- cfg.require_cmd("iperf3", remote=True)
-
-def check_throughput(cfg: object, link_config: LinkConfig, test_config: TestConfig, protocol: str, traffic: GenerateTraffic) -> None:
- common_link_modes = link_config.common_link_modes
- speeds, duplex_modes = link_config.get_speed_duplex_values(common_link_modes)
- """Test duration in seconds"""
- duration = test_config.test_duration
-
- ksft_pr(f"{protocol} test")
- test_type = "-u" if protocol == "UDP" else ""
-
- send_throughput = []
- receive_throughput = []
- for idx in range(0, len(speeds)):
- if link_config.set_speed_and_duplex(speeds[idx], duplex_modes[idx]) == False:
- raise KsftFailEx(f"Not able to set speed and duplex parameters for {cfg.ifname}")
- time.sleep(test_config.time_delay)
- if not link_config.verify_link_up():
- raise KsftSkipEx(f"Link state of interface {cfg.ifname} is DOWN")
-
- send_command=f"{test_type} -b 0 -t {duration} --json"
- receive_command=f"{test_type} -b 0 -t {duration} --reverse --json"
-
- send_result = traffic.run_remote_test(cfg, command=send_command)
- if send_result.ret != 0:
- raise KsftSkipEx("Error occurred during data transmit: {send_result.stdout}")
-
- send_output = send_result.stdout
- send_data = json.loads(send_output)
-
- """Convert throughput to Mbps"""
- send_throughput.append(round(send_data['end']['sum_sent']['bits_per_second'] / 1e6, 2))
- ksft_pr(f"{protocol}: Send throughput: {send_throughput[idx]} Mbps")
-
- receive_result = traffic.run_remote_test(cfg, command=receive_command)
- if receive_result.ret != 0:
- raise KsftSkipEx("Error occurred during data receive: {receive_result.stdout}")
-
- receive_output = receive_result.stdout
- receive_data = json.loads(receive_output)
-
- """Convert throughput to Mbps"""
- receive_throughput.append(round(receive_data['end']['sum_received']['bits_per_second'] / 1e6, 2))
- ksft_pr(f"{protocol}: Receive throughput: {receive_throughput[idx]} Mbps")
-
- """Check whether throughput is not below the threshold (default values set at start)"""
- for idx in range(0, len(speeds)):
- send_threshold = float(speeds[idx]) * float(test_config.send_throughput_threshold / 100)
- receive_threshold = float(speeds[idx]) * float(test_config.receive_throughput_threshold / 100)
- ksft_true(send_throughput[idx] >= send_threshold, f"{protocol}: Send throughput is below threshold for {speeds[idx]} Mbps in {duplex_modes[idx]} duplex")
- ksft_true(receive_throughput[idx] >= receive_threshold, f"{protocol}: Receive throughput is below threshold for {speeds[idx]} Mbps in {duplex_modes[idx]} duplex")
-
-def test_tcp_throughput(cfg: object, link_config: LinkConfig, test_config: TestConfig, traffic: GenerateTraffic) -> None:
- _pre_test_checks(cfg, link_config)
- check_throughput(cfg, link_config, test_config, 'TCP', traffic)
-
-def test_udp_throughput(cfg: object, link_config: LinkConfig, test_config: TestConfig, traffic: GenerateTraffic) -> None:
- _pre_test_checks(cfg, link_config)
- check_throughput(cfg, link_config, test_config, 'UDP', traffic)
-
-def main() -> None:
- parser = argparse.ArgumentParser(description="Run basic performance test for NIC driver")
- parser.add_argument('--time-delay', type=int, default=8, help='Time taken to wait for transitions to happen(in seconds). Default is 8 seconds.')
- parser.add_argument('--test-duration', type=int, default=10, help='Performance test duration for the throughput check, in seconds. Default is 10 seconds.')
- parser.add_argument('--stt', type=int, default=80, help='Send throughput Threshold: Percentage of send throughput upon actual throughput required to pass the throughput check (in percentage). Default is 80.')
- parser.add_argument('--rtt', type=int, default=50, help='Receive throughput Threshold: Percentage of receive throughput upon actual throughput required to pass the throughput check (in percentage). Default is 50.')
- args=parser.parse_args()
- test_config = TestConfig(args.time_delay, args.test_duration, args.stt, args.rtt)
- with NetDrvEpEnv(__file__, nsim_test=False) as cfg:
- traffic = GenerateTraffic(cfg)
- link_config = LinkConfig(cfg)
- ksft_run(globs=globals(), case_pfx={"test_"}, args=(cfg, link_config, test_config, traffic, ))
- link_config.reset_interface()
- ksft_exit()
-
-if __name__ == "__main__":
- main()
diff --git a/tools/testing/selftests/drivers/net/lib/py/load.py b/tools/testing/selftests/drivers/net/lib/py/load.py
index da5af2c680fa..d9c10613ae67 100644
--- a/tools/testing/selftests/drivers/net/lib/py/load.py
+++ b/tools/testing/selftests/drivers/net/lib/py/load.py
@@ -2,7 +2,7 @@
import time
-from lib.py import ksft_pr, cmd, ip, rand_port, wait_port_listen, bkg
+from lib.py import ksft_pr, cmd, ip, rand_port, wait_port_listen
class GenerateTraffic:
def __init__(self, env, port=None):
@@ -23,24 +23,6 @@ from lib.py import ksft_pr, cmd, ip, rand_port, wait_port_listen, bkg
self.stop(verbose=True)
raise Exception("iperf3 traffic did not ramp up")
- def run_remote_test(self, env: object, port=None, command=None):
- if port is None:
- port = rand_port()
- try:
- server_cmd = f"iperf3 -s 1 -p {port} --one-off"
- with bkg(server_cmd, host=env.remote):
- #iperf3 opens TCP connection as default in server
- #-u to be specified in client command for UDP
- wait_port_listen(port, host=env.remote)
- except Exception as e:
- raise Exception(f"Unexpected error occurred while running server command: {e}")
- try:
- client_cmd = f"iperf3 -c {env.remote_addr} -p {port} {command}"
- proc = cmd(client_cmd)
- return proc
- except Exception as e:
- raise Exception(f"Unexpected error occurred while running client command: {e}")
-
def _wait_pkts(self, pkt_cnt=None, pps=None):
"""
Wait until we've seen pkt_cnt or until traffic ramps up to pps.
--
2.49.0
Greetings,
We're looking to start doing test development for portions of kernel code "standalone" mocked out and would like to do it in userspace. Are there any existing patch sets we could review or help extend to define this concept? We have checked out David Gow's LPC talk [1] from last year that did point out a few patch series that hinted at userspace kunit.
Regards,
Matt
[1] https://lpc.events/event/18/contributions/1790/attachments/1400/3007/LPC202…
The vIOMMU object is designed to represent a slice of an IOMMU HW for its
virtualization features shared with or passed to user space (a VM mostly)
in a way of HW acceleration. This extended the HWPT-based design for more
advanced virtualization feature.
A vCMDQ introduced by this series as a part of the vIOMMU infrastructure
represents a HW supported queue/buffer for VM to use exclusively, e.g.
- NVIDIA's virtual command queue
- AMD vIOMMU's command buffer
either of which is an IOMMU HW feature to directly load and execute cache
invalidation commands issued by a guest kernel, to shoot down TLB entries
that HW cached for guest-owned stage-1 page table entries. This is a big
improvement since there is no VM Exit during an invalidation, compared to
the traditional invalidation pathway by trapping a guest-own invalidation
queue and forwarding those commands/requests to the host kernel that will
eventually fill a HW-owned queue to execute those commands.
Thus, a vCMDQ object, as an initial use case, is all about a guest-owned
HW command queue that VMM can allocate/configure depending on the request
from a guest kernel. Introduce a new IOMMUFD_OBJ_VCMDQ and its allocator
IOMMUFD_CMD_VCMDQ_ALLOC allowing VMM to forward the IOMMU-specific queue
info, such as queue base address, size, and etc.
Meanwhile, a guest-owned command queue needs the kernel (a command queue
driver) to control the queue by reading/writing its consumer and producer
indexes, which means the command queue HW allows the guest kernel to get
a direct R/W access to those registers. Introduce an mmap infrastructure
to the iommufd core so as to support pass through a piece of MMIO region
from the host physical address space to the guest physical address space.
The VMA info (vm_pgoff/size) used by an mmap must be pre-allocated during
the IOMMUFD_CMD_VCMDQ_ALLOC and given those info to the user space as an
output driver-data by the IOMMUFD_CMD_VCMDQ_ALLOC. So, this requires a
driver-specific user data support by a vIOMMU object.
As a real-world use case, this series implements a vCMDQ support to the
tegra241-cmdqv driver for the vCMDQ on NVIDIA Grace CPU. In another word,
this is also the Tegra CMDQV series Part-2 (user-space support), reworked
from Previous RFCv1:
https://lore.kernel.org/all/cover.1712978212.git.nicolinc@nvidia.com/
This enables the HW accelerated feature for NVIDIA Grace CPU. Compared to
the standard SMMUv3 operating in the nested translation mode trapping CMDQ
for TLBI and ATC_INV commands, this gives a huge performance improvement:
70% to 90% reductions of invalidation time were measured by various DMA
unmap tests running in a guest OS.
This is on Github:
https://github.com/nicolinc/iommufd/commits/iommufd_vcmdq-v2
Paring QEMU branch for testing:
https://github.com/nicolinc/qemu/commits/wip/for_iommufd_vcmdq-v2
Changelog
v2
* Add Reviewed-by from Jason
* [smmu] Fix vsmmu initial value
* [smmu] Support impl for hw_info
* [tegra] Rename "slot" to "vsid"
* [tegra] Update kdocs and commit logs
* [tegra] Map/unmap LVCMDQ dynamically
* [tegra] Refcount the previous LVCMDQ
* [tegra] Return -EEXIST if LVCMDQ exists
* [tegra] Simplify VINTF cleanup routine
* [tegra] Use vmid and s2_domain in vsmmu
* [tegra] Rename "mmap_pgoff" to "immap_id"
* [tegra] Add more addr and length validation
* [iommufd] Add more narrative to mmap's kdoc
* [iommufd] Add iommufd_struct_depend/undepend()
* [iommufd] Rename vcmdq_free op to vcmdq_destroy
* [iommufd] Fix bug in iommu_copy_struct_to_user()
* [iommufd] Drop is_io from iommufd_ctx_alloc_mmap()
* [iommufd] Test the queue memory for its contiguity
* [iommufd] Return -ENXIO if address or length fails
* [iommufd] Do not change @min_last in mock_viommu_alloc()
* [iommufd] Generalize TEGRA241_VCMDQ data in core structure
* [iommufd] Add selftest coverage for IOMMUFD_CMD_VCMDQ_ALLOC
* [iommufd] Add iopt_pin_pages() to prevent queue memory from unmapping
v1
https://lore.kernel.org/all/cover.1744353300.git.nicolinc@nvidia.com/
Thanks
Nicolin
Nicolin Chen (22):
iommufd/viommu: Add driver-allocated vDEVICE support
iommu: Pass in a driver-level user data structure to viommu_alloc op
iommufd/viommu: Allow driver-specific user data for a vIOMMU object
iommu: Add iommu_copy_struct_to_user helper
iommufd: Add iommufd_struct_destroy to revert iommufd_viommu_alloc
iommufd/selftest: Support user_data in mock_viommu_alloc
iommufd/selftest: Add covearge for viommu data
iommufd: Abstract iopt_pin_pages and iopt_unpin_pages helpers
iommufd/viommu: Introduce IOMMUFD_OBJ_VCMDQ and its related struct
iommufd/viommmu: Add IOMMUFD_CMD_VCMDQ_ALLOC ioctl
iommufd: Add for-driver helpers iommufd_vcmdq_depend/undepend()
iommufd/selftest: Add coverage for IOMMUFD_CMD_VCMDQ_ALLOC
iommufd: Add mmap interface
iommufd/selftest: Add coverage for the new mmap interface
Documentation: userspace-api: iommufd: Update vCMDQ
iommu/arm-smmu-v3-iommufd: Add vsmmu_alloc impl op
iommu/arm-smmu-v3-iommufd: Support implementation-defined hw_info
iommu/tegra241-cmdqv: Use request_threaded_irq
iommu/tegra241-cmdqv: Simplify deinit flow in
tegra241_cmdqv_remove_vintf()
iommu/tegra241-cmdqv: Do not statically map LVCMDQs
iommu/tegra241-cmdqv: Add user-space use support
iommu/tegra241-cmdqv: Add IOMMU_VEVENTQ_TYPE_TEGRA241_CMDQV support
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 25 +-
drivers/iommu/iommufd/io_pagetable.h | 8 +
drivers/iommu/iommufd/iommufd_private.h | 25 +-
drivers/iommu/iommufd/iommufd_test.h | 20 +
include/linux/iommu.h | 43 +-
include/linux/iommufd.h | 146 ++++++
include/uapi/linux/iommufd.h | 113 ++++-
tools/testing/selftests/iommu/iommufd_utils.h | 51 +-
.../arm/arm-smmu-v3/arm-smmu-v3-iommufd.c | 42 +-
.../iommu/arm/arm-smmu-v3/tegra241-cmdqv.c | 451 +++++++++++++++++-
drivers/iommu/iommufd/device.c | 117 +----
drivers/iommu/iommufd/driver.c | 81 ++++
drivers/iommu/iommufd/io_pagetable.c | 95 ++++
drivers/iommu/iommufd/main.c | 58 ++-
drivers/iommu/iommufd/selftest.c | 123 ++++-
drivers/iommu/iommufd/viommu.c | 111 ++++-
tools/testing/selftests/iommu/iommufd.c | 93 +++-
.../selftests/iommu/iommufd_fail_nth.c | 11 +-
Documentation/userspace-api/iommufd.rst | 14 +
19 files changed, 1436 insertions(+), 191 deletions(-)
--
2.43.0
If you wish to utilise a pidfd interface to refer to the current process or
thread it is rather cumbersome, requiring something like:
int pidfd = pidfd_open(getpid(), 0 or PIDFD_THREAD);
...
close(pidfd);
Or the equivalent call opening /proc/self. It is more convenient to use a
sentinel value to indicate to an interface that accepts a pidfd that we
simply wish to refer to the current process thread.
This series introduces sentinels for this purposes which can be passed as
the pidfd in this instance rather than having to establish a dummy fd for
this purpose.
It is useful to refer to both the current thread from the userland's
perspective for which we use PIDFD_SELF, and the current process from the
userland's perspective, for which we use PIDFD_SELF_PROCESS.
There is unfortunately some confusion between the kernel and userland as to
what constitutes a process - a thread from the userland perspective is a
process in userland, and a userland process is a thread group (more
specifically the thread group leader from the kernel perspective). We
therefore alias things thusly:
* PIDFD_SELF_THREAD aliased by PIDFD_SELF - use PIDTYPE_PID.
* PIDFD_SELF_THREAD_GROUP alised by PIDFD_SELF_PROCESS - use PIDTYPE_TGID.
In all of the kernel code we refer to PIDFD_SELF_THREAD and
PIDFD_SELF_THREAD_GROUP. However we expect users to use PIDFD_SELF and
PIDFD_SELF_PROCESS.
This matters for cases where, for instance, a user unshare()'s FDs or does
thread-specific signal handling and where the user would be hugely confused
if the FDs referenced or signal processed referred to the thread group
leader rather than the individual thread.
We ensure that pidfd_send_signal() and pidfd_getfd() work correctly, and
assert as much in selftests. All other interfaces except setns() will work
implicitly with this new interface, however it doesn't make sense to test
waitid(P_PIDFD, ...) as waiting on ourselves is a blocking operation.
In the case of setns() we explicitly disallow use of PIDFD_SELF* as it
doesn't make sense to obtain the namespaces of our own process, and it
would require work to implement this functionality there that would be of
no use.
We also do not provide the ability to utilise PIDFD_SELF* in ordinary fd
operations such as open() or poll(), as this would require extensive work
and be of no real use.
v3:
* Do not fput() an invalid fd as reported by kernel test bot.
* Fix unintended churn from moving variable declaration.
v2:
* Fix tests as reported by Shuah.
* Correct RFC version lore link.
https://lore.kernel.org/linux-mm/cover.1728643714.git.lorenzo.stoakes@oracl…
Non-RFC v1:
* Removed RFC tag - there seems to be general consensus that this change is
a good idea, but perhaps some debate to be had on implementation. It
seems sensible then to move forward with the RFC flag removed.
* Introduced PIDFD_SELF_THREAD, PIDFD_SELF_THREAD_GROUP and their aliases
PIDFD_SELF and PIDFD_SELF_PROCESS respectively.
* Updated testing accordingly.
https://lore.kernel.org/linux-mm/cover.1728578231.git.lorenzo.stoakes@oracl…
RFC version:
https://lore.kernel.org/linux-mm/cover.1727644404.git.lorenzo.stoakes@oracl…
Lorenzo Stoakes (3):
pidfd: extend pidfd_get_pid() and de-duplicate pid lookup
pidfd: add PIDFD_SELF_* sentinels to refer to own thread/process
selftests: pidfd: add tests for PIDFD_SELF_*
include/linux/pid.h | 43 +++++-
include/uapi/linux/pidfd.h | 15 ++
kernel/exit.c | 3 +-
kernel/nsproxy.c | 1 +
kernel/pid.c | 73 ++++++---
kernel/signal.c | 26 +---
tools/testing/selftests/pidfd/pidfd.h | 8 +
.../selftests/pidfd/pidfd_getfd_test.c | 141 ++++++++++++++++++
.../selftests/pidfd/pidfd_setns_test.c | 11 ++
tools/testing/selftests/pidfd/pidfd_test.c | 76 ++++++++--
10 files changed, 342 insertions(+), 55 deletions(-)
--
2.46.2
__netdev_update_features() expects the netdevice to be ops-locked, but
it gets called recursively on the lower level netdevices to sync their
features, and nothing locks those.
This commit fixes that, with the assumption that it shouldn't be possible
for both higher-level and lover-level netdevices to require the instance
lock, because that would lead to lock dependency warnings.
Without this, playing with higher level (e.g. vxlan) netdevices on top
of netdevices with instance locking enabled can run into issues:
WARNING: CPU: 59 PID: 206496 at ./include/net/netdev_lock.h:17 netif_napi_add_weight_locked+0x753/0xa60
[...]
Call Trace:
<TASK>
mlx5e_open_channel+0xc09/0x3740 [mlx5_core]
mlx5e_open_channels+0x1f0/0x770 [mlx5_core]
mlx5e_safe_switch_params+0x1b5/0x2e0 [mlx5_core]
set_feature_lro+0x1c2/0x330 [mlx5_core]
mlx5e_handle_feature+0xc8/0x140 [mlx5_core]
mlx5e_set_features+0x233/0x2e0 [mlx5_core]
__netdev_update_features+0x5be/0x1670
__netdev_update_features+0x71f/0x1670
dev_ethtool+0x21c5/0x4aa0
dev_ioctl+0x438/0xae0
sock_ioctl+0x2ba/0x690
__x64_sys_ioctl+0xa78/0x1700
do_syscall_64+0x6d/0x140
entry_SYSCALL_64_after_hwframe+0x4b/0x53
</TASK>
Fixes: 7e4d784f5810 ("net: hold netdev instance lock during rtnetlink operations")
Signed-off-by: Cosmin Ratiu <cratiu(a)nvidia.com>
---
net/core/dev.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/net/core/dev.c b/net/core/dev.c
index 1be7cb73a602..77472364225c 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -10620,8 +10620,11 @@ int __netdev_update_features(struct net_device *dev)
/* some features must be disabled on lower devices when disabled
* on an upper device (think: bonding master or bridge)
*/
- netdev_for_each_lower_dev(dev, lower, iter)
+ netdev_for_each_lower_dev(dev, lower, iter) {
+ netdev_lock_ops(lower);
netdev_sync_lower_features(dev, lower, features);
+ netdev_unlock_ops(lower);
+ }
if (!err) {
netdev_features_t diff = features ^ dev->features;
--
2.45.0
Until CONFIG_DMABUF_SYSFS_STATS was added [1] it was only possible to
perform per-buffer accounting with debugfs which is not suitable for
production environments. Eventually we discovered the overhead with
per-buffer sysfs file creation/removal was significantly impacting
allocation and free times, and exacerbated kernfs lock contention. [2]
dma_buf_stats_setup() is responsible for 39% of single-page buffer
creation duration, or 74% of single-page dma_buf_export() duration when
stressing dmabuf allocations and frees.
I prototyped a change from per-buffer to per-exporter statistics with a
RCU protected list of exporter allocations that accommodates most (but
not all) of our use-cases and avoids almost all of the sysfs overhead.
While that adds less overhead than per-buffer sysfs, and less even than
the maintenance of the dmabuf debugfs_list, it's still *additional*
overhead on top of the debugfs_list and doesn't give us per-buffer info.
This series uses the existing dmabuf debugfs_list to implement a BPF
dmabuf iterator, which adds no overhead to buffer allocation/free and
provides per-buffer info. The list has been moved outside of
CONFIG_DEBUG_FS scope so that it is always populated. The BPF program
loaded by userspace that extracts per-buffer information gets to define
its own interface which avoids the lack of ABI stability with debugfs.
This will allow us to replace our use of CONFIG_DMABUF_SYSFS_STATS, and
the plan is to remove it from the kernel after the next longterm stable
release.
[1] https://lore.kernel.org/linux-media/20201210044400.1080308-1-hridya@google.…
[2] https://lore.kernel.org/all/20220516171315.2400578-1-tjmercier@google.com
v1: https://lore.kernel.org/all/20250414225227.3642618-1-tjmercier@google.com
v1 -> v2:
Make the DMA buffer list independent of CONFIG_DEBUG_FS per Christian König
Add CONFIG_DMA_SHARED_BUFFER check to kernel/bpf/Makefile per kernel test robot
Use BTF_ID_LIST_SINGLE instead of BTF_ID_LIST_GLOBAL_SINGLE per Song Liu
Fixup comment style, mixing code/declarations, and use ASSERT_OK_FD in selftest per Song Liu
Add BPF_ITER_RESCHED feature to bpf_dmabuf_reg_info per Alexei Starovoitov
Add open-coded iterator and selftest per Alexei Starovoitov
Add a second test buffer from the system dmabuf heap to selftests
Use the BPF program we'll use in production for selftest per Alexei Starovoitov
https://r.android.com/c/platform/system/bpfprogs/+/3616123/2/dmabufIter.chttps://r.android.com/c/platform/system/memory/libmeminfo/+/3614259/1/libdm…
v2: https://lore.kernel.org/all/20250504224149.1033867-1-tjmercier@google.com
v2 -> v3:
Rebase onto bpf-next/master
Move get_next_dmabuf() into drivers/dma-buf/dma-buf.c, along with the
new get_first_dmabuf(). This avoids having to expose the dmabuf list
and mutex to the rest of the kernel, and keeps the dmabuf mutex
operations near each other in the same file. (Christian König)
Add Christian's RB to dma-buf: Rename debugfs symbols
Drop RFC: dma-buf: Remove DMA-BUF statistics
T.J. Mercier (5):
dma-buf: Rename debugfs symbols
bpf: Add dmabuf iterator
bpf: Add open coded dmabuf iterator
selftests/bpf: Add test for dmabuf_iter
selftests/bpf: Add test for open coded dmabuf_iter
drivers/dma-buf/dma-buf.c | 94 +++++--
include/linux/dma-buf.h | 5 +-
kernel/bpf/Makefile | 3 +
kernel/bpf/dmabuf_iter.c | 149 ++++++++++
kernel/bpf/helpers.c | 5 +
.../testing/selftests/bpf/bpf_experimental.h | 5 +
tools/testing/selftests/bpf/config | 3 +
.../selftests/bpf/prog_tests/dmabuf_iter.c | 258 ++++++++++++++++++
.../testing/selftests/bpf/progs/dmabuf_iter.c | 91 ++++++
9 files changed, 591 insertions(+), 22 deletions(-)
create mode 100644 kernel/bpf/dmabuf_iter.c
create mode 100644 tools/testing/selftests/bpf/prog_tests/dmabuf_iter.c
create mode 100644 tools/testing/selftests/bpf/progs/dmabuf_iter.c
base-commit: 43745d11bfd9683abdf08ad7a5cc403d6a9ffd15
--
2.49.0.1045.g170613ef41-goog
KVM's guest-memfd memory backend currently lacks support for NUMA policy
enforcement, causing guest memory allocations to be distributed arbitrarily
across host NUMA nodes regardless of the policy specified by the VMM. This
occurs because conventional userspace NUMA control mechanisms like mbind()
are ineffective with guest-memfd, as the memory isn't directly mapped to
userspace when allocations occur.
This patch-series adds NUMA-aware memory placement for guest_memfd backed
KVM guests. Based on community feedback, the approach has evolved as
follows:
- v1,v2: Extended the KVM_CREATE_GUEST_MEMFD IOCTL to pass mempolicy.
- v3: Introduced fbind() syscall for VMM memory-placement configuration.
- v4-v6: Current approach using shared_policy support and vm_ops (based on
suggestions from David[1] and guest_memfd biweekly upstream
calls[2][4]).
- v7: Use inodes to store NUMA policy instead of file[5].
== Implementation ==
This series implements proper NUMA policy support for guest-memfd by:
1. Adding mempolicy-aware allocation APIs to the filemap layer.
2. Add custom inodes (via a dedicated slab-allocated inode cache,
kvm_gmem_inode_info) to store NUMA policy and metadata for guest memory.
3. Implementing get/set_policy vm_ops in guest_memfd to support shared policy.
With these changes, VMMs can now control guest memory placement by
specifying:
- Policy modes: default, bind, interleave, or preferred
- Host NUMA nodes: List of target nodes for memory allocation
Policies only affect future allocations and do not migrate existing memory.
This matches mbind(2)'s default behavior which affects only new allocations
unless overridden with MPOL_MF_MOVE/MPOL_MF_MOVE_ALL flags (Not supported
for guest_memfd as it is unmovable).
This series builds on the existing guest-memfd support in KVM and provides
a clean integration path for NUMA-aware memory management in confidential
computing environments. The work is primarily focused on supporting SEV-SNP
requirements, though the benefits extend to any VMM using the guest-memfd
backend that needs control over guest memory placement.
== Example usage with QEMU (requires patched QEMU from [3]) ==
Snippet of the QEMU changes[3] needed to support this feature:
/* Create and map guest-memfd region */
new_block->guest_memfd = kvm_create_guest_memfd(
new_block->max_length, 0, errp);
...
void *ptr_memfd = mmap(NULL, new_block->max_length,
PROT_READ | PROT_WRITE, MAP_SHARED,
new_block->guest_memfd, 0);
...
/* Apply NUMA policy */
int ret = mbind(ptr_memfd, new_block->max_length,
backend->policy, backend->host_nodes,
maxnode+1, 0);
...
QEMU Command to run SEV-SNP guest with interleaved memory across
nodes 0 and 1 of the host:
$ qemu-system-x86_64 \
-enable-kvm \
...
-machine memory-encryption=sev0,vmport=off \
-object sev-snp-guest,id=sev0,cbitpos=51,reduced-phys-bits=1 \
-numa node,nodeid=0,memdev=ram0,cpus=0-15 \
-object memory-backend-memfd,id=ram0,host-nodes=0-1,policy=interleave,size=1024M,share=true,prealloc=false
== Experiment and Analysis ==
SEV-SNP enabled host, AMD Zen 3, 2 socket 2 NUMA node system
NUMA for Policy Guest Node 0: policy=interleave, host-node=0-1
Test: Allocate and touch 50GB inside guest on node=0.
* Generic Kernel (without NUMA supported guest-memfd):
Node 0 Node 1 Total
Before running Test:
MemUsed 9981.60 3312.00 13293.60
After running Test:
MemUsed 61451.72 3201.62 64653.34
Arbitrary allocations: all ~50GB allocated on node 0.
* With NUMA supported guest-memfd:
Node 0 Node 1 Total
Before running Test:
MemUsed 5003.88 3963.07 8966.94
After running Test:
MemUsed 30607.55 29670.00 60277.55
Balanced memory distribution: Equal increase (~25GB) on both nodes.
== Conclusion ==
Adding the NUMA-aware memory management to guest_memfd will make a lot of
sense. Improving performance of memory-intensive and locality-sensitive
workloads with fine-grained control over guest memory allocations, as
pointed out in the analysis.
Please review and provide feedback!
Thanks,
Shivank
[1] https://lore.kernel.org/all/6fbef654-36e2-4be5-906e-2a648a845278@redhat.com
[2] https://lore.kernel.org/all/6f2bfac2-d9e7-4e4a-9298-7accded16b4f@redhat.com
[3] https://github.com/shivankgarg98/qemu/tree/guest_memfd_mbind_NUMA
[4] https://lore.kernel.org/all/2b77e055-98ac-43a1-a7ad-9f9065d7f38f@amd.com
[5] https://lore.kernel.org/all/diqzbjumm167.fsf@ackerleytng-ctop.c.googlers.com
== Earlier postings and changelogs ==
v7 (current):
- Add fixes suggested by Vlastimil and Ackerley.
- Store NUMA policy in custom inode struct instead of file.
v6:
- https://lore.kernel.org/all/20250226082549.6034-1-shivankg@amd.com
- Rebase to linux mainline
- Drop RFC tag
- Add selftests to ensure NUMA support for guest_memfd works correctly.
v5:
- https://lore.kernel.org/all/20250219101559.414878-1-shivankg@amd.com
- Fix documentation and style issues.
- Use EXPORT_SYMBOL_GPL
- Split preparatory change in separate patch
v4:
- https://lore.kernel.org/all/20250210063227.41125-1-shivankg@amd.com
- Dropped fbind() approach in favor of shared policy support.
v3:
- https://lore.kernel.org/all/20241105164549.154700-1-shivankg@amd.com
- Introduce fbind() syscall and drop the IOCTL-based approach.
v2:
- https://lore.kernel.org/all/20240919094438.10987-1-shivankg@amd.com
- Add fixes suggested by Matthew Wilcox.
v1:
- https://lore.kernel.org/all/20240916165743.201087-1-shivankg@amd.com
- Proposed IOCTL based approach to pass NUMA mempolicy.
Ackerley Tng (1):
KVM: guest_memfd: Make guest mem use guest mem inodes instead of
anonymous inodes
Shivank Garg (6):
mm/mempolicy: Export memory policy symbols
security: Export security_inode_init_security_anon for KVM guest_memfd
KVM: Add kvm_gmem_exit() cleanup function
KVM: guest_memfd: Add slab-allocated inode cache
KVM: guest_memfd: Enforce NUMA mempolicy using shared policy
KVM: guest_memfd: selftests: Add tests for mmap and NUMA policy
support
Shivansh Dhiman (1):
mm/filemap: Add mempolicy support to the filemap layer
include/linux/pagemap.h | 41 +++
include/uapi/linux/magic.h | 1 +
mm/filemap.c | 27 +-
mm/mempolicy.c | 6 +
security/security.c | 1 +
.../testing/selftests/kvm/guest_memfd_test.c | 86 +++++-
virt/kvm/guest_memfd.c | 261 ++++++++++++++++--
virt/kvm/kvm_main.c | 2 +
virt/kvm/kvm_mm.h | 6 +
9 files changed, 402 insertions(+), 29 deletions(-)
--
2.34.1
Some distributions enable rp_filter globally by default, which can interfere
with various test cases. To address this, many tests explicitly disable
rp_filter within their scripts.
To avoid duplication and ensure consistent behavior across tests, this patch
moves the rp_filter configuration into setup_ns, applied immediately after a
new namespace is created. This change ensures that all namespace-based tests
inherit the appropriate rp_filter settings, simplifying individual test
scripts and improving maintainability.
BTW, the patch 4/6 for srv6 is a bit large. Please tell me if you think
I need to break this one.
Hangbin Liu (6):
selftests: net: disable rp_filter after namespace initialization
selftests: net: remove redundant rp_filter configuration
selftests: net: use setup_ns for bareudp testing
selftests: net: use setup_ns for SRv6 tests and remove rp_filter
configuration
selftests: netfilter: remove rp_filter configuration
selftests: mptcp: remove rp_filter configuration
tools/testing/selftests/net/bareudp.sh | 49 ++---------
tools/testing/selftests/net/fib_rule_tests.sh | 3 -
tools/testing/selftests/net/fib_tests.sh | 3 -
tools/testing/selftests/net/icmp_redirect.sh | 2 -
tools/testing/selftests/net/lib.sh | 2 +
.../testing/selftests/net/mptcp/mptcp_lib.sh | 2 -
.../selftests/net/netfilter/br_netfilter.sh | 3 -
.../selftests/net/netfilter/bridge_brouter.sh | 2 -
.../selftests/net/netfilter/conntrack_vrf.sh | 3 -
tools/testing/selftests/net/netfilter/ipvs.sh | 6 --
.../selftests/net/netfilter/nft_fib.sh | 2 -
.../selftests/net/netfilter/nft_nat_zones.sh | 2 -
.../testing/selftests/net/netfilter/rpath.sh | 18 ++--
.../selftests/net/srv6_end_dt46_l3vpn_test.sh | 5 --
.../selftests/net/srv6_end_dt4_l3vpn_test.sh | 5 --
.../net/srv6_end_next_csid_l3vpn_test.sh | 77 ++++-------------
.../net/srv6_end_x_next_csid_l3vpn_test.sh | 83 +++++--------------
.../net/srv6_hencap_red_l3vpn_test.sh | 74 ++++-------------
.../net/srv6_hl2encap_red_l2vpn_test.sh | 77 ++++-------------
19 files changed, 87 insertions(+), 331 deletions(-)
--
2.46.0
When make finds the source tree unclean, it prints a message to run
"make ARCH=x86_64 mrproper" message using the ARCH from the command
line. The ARCH specified in the command line could be different from
the ARCH of the existing build in the source tree.
This could cause problems in regular kernel build and kunit workflows.
Regular workflow:
- Build x86_64 kernel
$ make ARCH=x86_64
- Try building another arch kernel out of tree with O=
$ make ARCH=um O=/linux/build
- kbuild detects source tree is unclean
***
*** The source tree is not clean, please run 'make ARCH=um mrproper'
*** in /linux/linux_srcdir
***
- Clean source tree as suggested by kbuild
$ make ARCH=um mrproper
- Source clean appears to be clean, but it leaves behind generated header
files under arch/x86
arch/x86/realmode/rm/pasyms.h
A subsequent x86_64e build fails with
"undefined symbol sev_es_trampoline_start referenced ..."
kunit workflow runs into this issue:
- Build x86_64 kernel
- Run kunit tests: it tries to build for user specified ARCH or uml
as default:
$ ./tools/testing/kunit/kunit.py run
- kbuild detects unclean source tree
***
*** The source tree is not clean, please run 'make ARCH=um mrproper'
*** in /linux/linux_6.15
***
- Clean source tree as suggested by kbuild
$ make ARCH=um mrproper
- Source clean appears to be clean, but it leaves behind generated header
files under arch/x86
The problem shows when user tries to run tests on ARCH=x86_64:
$ ./tools/testing/kunit/kunit.py run ARCH=x86_64
"undefined symbol sev_es_trampoline_start referenced ..."
Build trips on arch/x86/realmode/rm/pasyms.h left behind by a prior
x86_64 build.
Problems related to partially cleaned source tree are hard to debug.
Change Makefile to unclean source logic to use ARCH from compile.h
UTS_MACHINE string. With this change kbuild prints:
$ ./tools/testing/kunit/kunit.py run
***
*** The source tree is not clean, please run 'make ARCH=x86_64 mrproper'
*** in /linux/linux_6.15
***
Signed-off-by: Shuah Khan <skhan(a)linuxfoundation.org>
---
Makefile | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/Makefile b/Makefile
index 5aa9ee52a765..7ee29136b4da 100644
--- a/Makefile
+++ b/Makefile
@@ -674,7 +674,7 @@ ifeq ($(KBUILD_EXTMOD),)
-d $(srctree)/include/config -o \
-d $(srctree)/arch/$(SRCARCH)/include/generated ]; then \
echo >&2 "***"; \
- echo >&2 "*** The source tree is not clean, please run 'make$(if $(findstring command line, $(origin ARCH)), ARCH=$(ARCH)) mrproper'"; \
+ echo >&2 "*** The source tree is not clean, please run 'make ARCH=$(shell grep UTS_MACHINE $(srctree)/include/generated/compile.h | cut -d '"' -f 2) mrproper'"; \
echo >&2 "*** in $(abs_srctree)";\
echo >&2 "***"; \
false; \
--
2.47.2
__qdisc_destroy() calls into various qdiscs .destroy() op, which in turn
can call .ndo_setup_tc(), which requires the netdev instance lock.
This commit extends the critical section in
unregister_netdevice_many_notify() to cover dev_shutdown() (and
dev_tcx_uninstall() as a side-effect) and acquires the netdev instance
lock in __dev_change_net_namespace() for the other dev_shutdown() call.
This should now guarantee that for all qdisc ops, the netdev instance
lock is held during .ndo_setup_tc().
Fixes: a0527ee2df3f ("net: hold netdev instance lock during qdisc ndo_setup_tc")
Signed-off-by: Cosmin Ratiu <cratiu(a)nvidia.com>
---
net/core/dev.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/net/core/dev.c b/net/core/dev.c
index 1be7cb73a602..92e004c354ea 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -11966,9 +11966,9 @@ void unregister_netdevice_many_notify(struct list_head *head,
struct sk_buff *skb = NULL;
/* Shutdown queueing discipline. */
+ netdev_lock_ops(dev);
dev_shutdown(dev);
dev_tcx_uninstall(dev);
- netdev_lock_ops(dev);
dev_xdp_uninstall(dev);
dev_memory_provider_uninstall(dev);
netdev_unlock_ops(dev);
@@ -12161,7 +12161,9 @@ int __dev_change_net_namespace(struct net_device *dev, struct net *net,
synchronize_net();
/* Shutdown queueing discipline. */
+ netdev_lock_ops(dev);
dev_shutdown(dev);
+ netdev_unlock_ops(dev);
/* Notify protocols, that we are about to destroy
* this device. They should clean all the things.
--
2.45.0
This patch series extends the sev_init2 and the sev_smoke test to
exercise the SEV-SNP VM launch workflow.
Primarily, it introduces the architectural defines, its support in the
SEV library and extends the tests to interact with the SEV-SNP ioctl()
wrappers.
Patch 1 - Do not advertise SNP on initialization failure
Patch 2 - SNP test for KVM_SEV_INIT2
Patch 3 - Add vmgexit helper
Patch 4 - Add SMT control interface helper
Patch 5 - Replace assert() with TEST_ASSERT_EQ()
Patch 6 - Introduce SEV+ VM type check
Patch 7 - SNP iotcl() plumbing for the SEV library
Patch 8 - Force set GUEST_MEMFD for SNP
Patch 9 - Cleanups of smoke test - Decouple policy from type
Patch 10 - SNP smoke test
The series is based on
git.kernel.org/pub/scm/virt/kvm/kvm.git next
v7..v8:
* Dropped exporting the SNP initialized API from ccp to KVM. Instead
call SNP_PLATFORM_STATUS within KVM to query the initialization. (Tom)
While it may be cheaper to query sev->snp_initialized from ccp, making
the SNP platform call within KVM does away with any dependencies.
v6..v7:
https://lore.kernel.org/kvm/20250221210200.244405-7-prsampat@amd.com/
Based on comments from Sean -
* Replaced FW check with sev->snp_initialized
* Dropped the patch which removes SEV+ KVM advertisement if INIT fails.
This should be now be resolved by the combination of the patches [1,2]
from Ashish.
* Change vmgexit to an inline function
* Export SMT control parsing interface to kvm_util
Note: hyperv_cpuid KST only compile tested
* Replace assert() with TEST_ASSERT_EQ() within SEV library
* Define KVM_SEV_PAGE_TYPE_INVALID for SEV call of encrypt_region()
* Parameterize encrypt_region() to include privatize_region()
* Deduplication of sev test calls between SEV,SEV-ES and SNP
* Removed FW version tests for SNP
* Included testing of SNP_POLICY_DBG
* Dropped most tags from patches that have been changed or indirectly
affected
[1] https://lore.kernel.org/all/d6d08c6b-9602-4f3d-92c2-8db6d50a1b92@amd.com
[2] https://lore.kernel.org/all/f78ddb64087df27e7bcb1ae0ab53f55aa0804fab.173922…
v5..v6:
https://lore.kernel.org/kvm/ab433246-e97c-495b-ab67-b0cb1721fb99@amd.com/
* Rename is_sev_platform_init to sev_fw_initialized (Nikunj)
* Rename KVM CPU feature X86_FEATURE_SNP to X86_FEATURE_SEV_SNP (Nikunj)
* Collected Tags from Nikunj, Pankaj, Srikanth.
v4..v5:
https://lore.kernel.org/kvm/8e7d8172-879e-4a28-8438-343b1c386ec9@amd.com/
* Introduced a check to disable advertising support for SEV, SEV-ES
and SNP when platform initialization fails (Nikunj)
* Remove the redundant SNP check within is_sev_vm() (Nikunj)
* Cleanup of the encrypt_region flow for better readability (Nikunj)
* Refactor paths to use the canonical $(ARCH) to rebase for kvm/next
v3..v4:
https://lore.kernel.org/kvm/20241114234104.128532-1-pratikrajesh.sampat@amd…
* Remove SNP FW API version check in the test and ensure the KVM
capability advertises the presence of the feature. Retain the minimum
version definitions to exercise these API versions in the smoke test
* Retained only the SNP smoke test and SNP_INIT2 test
* The SNP architectural defined merged with SNP_INIT2 test patch
* SNP shutdown merged with SNP smoke test patch
* Add SEV VM type check to abstract comparisons and reduce clutter
* Define a SNP default policy which sets bits based on the presence of
SMT
* Decouple privatization and encryption for it to be SNP agnostic
* Assert for only positive tests using vm_ioctl()
* Dropped tested-by tags
In summary - based on comments from Sean, I have primarily reduced the
scope of this patch series to focus on breaking down the SNP smoke test
patch (v3 - patch2) to first introduce SEV-SNP support and use this
interface to extend the sev_init2 and the sev_smoke test.
The rest of the v3 patchset that introduces ioctl, pre fault, fallocate
and negative tests, will be re-worked and re-introduced subsequently in
future patch series post addressing the issues discussed.
v2..v3:
https://lore.kernel.org/kvm/20240905124107.6954-1-pratikrajesh.sampat@amd.c…
* Remove the assignments for the prefault and fallocate test type
enums.
* Fix error message for sev launch measure and finish.
* Collect tested-by tags [Peter, Srikanth]
Pratik R. Sampat (10):
KVM: SEV: Disable SEV-SNP support on initialization failure
KVM: selftests: SEV-SNP test for KVM_SEV_INIT2
KVM: selftests: Add vmgexit helper
KVM: selftests: Add SMT control state helper
KVM: selftests: Replace assert() with TEST_ASSERT_EQ()
KVM: selftests: Introduce SEV VM type check
KVM: selftests: Add library support for interacting with SNP
KVM: selftests: Force GUEST_MEMFD flag for SNP VM type
KVM: selftests: Abstractions for SEV to decouple policy from type
KVM: selftests: Add a basic SEV-SNP smoke test
arch/x86/include/uapi/asm/kvm.h | 1 +
arch/x86/kvm/svm/sev.c | 30 +++++-
tools/arch/x86/include/uapi/asm/kvm.h | 1 +
.../testing/selftests/kvm/include/kvm_util.h | 35 +++++++
.../selftests/kvm/include/x86/processor.h | 1 +
tools/testing/selftests/kvm/include/x86/sev.h | 42 ++++++++-
tools/testing/selftests/kvm/lib/kvm_util.c | 7 +-
.../testing/selftests/kvm/lib/x86/processor.c | 4 +-
tools/testing/selftests/kvm/lib/x86/sev.c | 93 +++++++++++++++++--
.../testing/selftests/kvm/x86/hyperv_cpuid.c | 19 ----
.../selftests/kvm/x86/sev_init2_tests.c | 13 +++
.../selftests/kvm/x86/sev_smoke_test.c | 75 +++++++++------
12 files changed, 261 insertions(+), 60 deletions(-)
--
2.43.0
Hello,
This patchset builds upon the code at
https://lore.kernel.org/lkml/20230718234512.1690985-1-seanjc@google.com/T/.
This code is available at
https://github.com/googleprodkernel/linux-cc/tree/kvm-gmem-link-migrate-rfc….
In guest_mem v11, a split file/inode model was proposed, where memslot
bindings belong to the file and pages belong to the inode. This model
lends itself well to having different VMs use separate files pointing
to the same inode.
This RFC proposes an ioctl, KVM_LINK_GUEST_MEMFD, that takes a VM and
a gmem fd, and returns another gmem fd referencing a different file
and associated with VM. This RFC also includes an update to
KVM_CAP_VM_MOVE_ENC_CONTEXT_FROM to migrate memory context
(slot->arch.lpage_info and kvm->mem_attr_array) from source to
destination vm, intra-host.
Intended usage of the two ioctls:
1. Source VM’s fd is passed to destination VM via unix sockets
2. Destination VM uses new ioctl KVM_LINK_GUEST_MEMFD to link source
VM’s fd to a new fd.
3. Destination VM will pass new fds to KVM_SET_USER_MEMORY_REGION,
which will bind the new file, pointing to the same inode that the
source VM’s file points to, to memslots
4. Use KVM_CAP_VM_MOVE_ENC_CONTEXT_FROM to move kvm->mem_attr_array
and slot->arch.lpage_info to the destination VM.
5. Run the destination VM as per normal
Some other approaches considered were:
+ Using the linkat() syscall, but that requires a mount/directory for
a source fd to be linked to
+ Using the dup() syscall, but that only duplicates the fd, and both
fds point to the same file
---
Ackerley Tng (11):
KVM: guest_mem: Refactor out kvm_gmem_alloc_file()
KVM: guest_mem: Add ioctl KVM_LINK_GUEST_MEMFD
KVM: selftests: Add tests for KVM_LINK_GUEST_MEMFD ioctl
KVM: selftests: Test transferring private memory to another VM
KVM: x86: Refactor sev's flag migration_in_progress to kvm struct
KVM: x86: Refactor common code out of sev.c
KVM: x86: Refactor common migration preparation code out of
sev_vm_move_enc_context_from
KVM: x86: Let moving encryption context be configurable
KVM: x86: Handle moving of memory context for intra-host migration
KVM: selftests: Generalize migration functions from
sev_migrate_tests.c
KVM: selftests: Add tests for migration of private mem
arch/x86/include/asm/kvm_host.h | 4 +-
arch/x86/kvm/svm/sev.c | 85 ++-----
arch/x86/kvm/svm/svm.h | 3 +-
arch/x86/kvm/x86.c | 221 +++++++++++++++++-
arch/x86/kvm/x86.h | 6 +
include/linux/kvm_host.h | 18 ++
include/uapi/linux/kvm.h | 8 +
tools/testing/selftests/kvm/Makefile | 1 +
.../testing/selftests/kvm/guest_memfd_test.c | 42 ++++
.../selftests/kvm/include/kvm_util_base.h | 31 +++
.../kvm/x86_64/private_mem_migrate_tests.c | 93 ++++++++
.../selftests/kvm/x86_64/sev_migrate_tests.c | 48 ++--
virt/kvm/guest_mem.c | 151 ++++++++++--
virt/kvm/kvm_main.c | 10 +
virt/kvm/kvm_mm.h | 7 +
15 files changed, 596 insertions(+), 132 deletions(-)
create mode 100644 tools/testing/selftests/kvm/x86_64/private_mem_migrate_tests.c
--
2.41.0.640.ga95def55d0-goog
'realpath' is not always available, fallback to 'readlink -f' if is not
available. They seem to work equally well in this context.
Signed-off-by: Yosry Ahmed <yosry.ahmed(a)linux.dev>
---
tools/testing/selftests/run_kselftest.sh | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/run_kselftest.sh b/tools/testing/selftests/run_kselftest.sh
index 50e03eefe7ac7..0443beacf3621 100755
--- a/tools/testing/selftests/run_kselftest.sh
+++ b/tools/testing/selftests/run_kselftest.sh
@@ -3,7 +3,14 @@
#
# Run installed kselftest tests.
#
-BASE_DIR=$(realpath $(dirname $0))
+
+# Fallback to readlink if realpath is not available
+if which realpath > /dev/null; then
+ BASE_DIR=$(realpath $(dirname $0))
+else
+ BASE_DIR=$(readlink -f $(dirname $0))
+fi
+
cd $BASE_DIR
TESTS="$BASE_DIR"/kselftest-list.txt
if [ ! -r "$TESTS" ] ; then
--
2.49.0.rc1.451.g8f38331e32-goog
ksft runner sends 2 SIGTERMs in a row if a test runs out of time.
Handle this in a similar way we handle SIGINT - cleanup and stop
running further tests.
Because we get 2 signals we need a bit of logic to ignore
the subsequent one, they come immediately one after the other
(due to commit 9616cb34b08e ("kselftest/runner.sh: Propagate SIGTERM
to runner child")).
This change makes sure we run cleanup (scheduled defer()s)
and also print a stack trace on SIGTERM, which doesn't happen
by default. Tests occasionally hang in NIPA and it's impossible
to tell what they are waiting from or doing.
Signed-off-by: Jakub Kicinski <kuba(a)kernel.org>
---
v3:
- remove unnecessary isinstance()
v2: https://lore.kernel.org/20250429170804.2649622-1-kuba@kernel.org
- remove declaration at the global scope
v1: https://lore.kernel.org/20250425151757.1652517-1-kuba@kernel.org
CC: ecree.xilinx(a)gmail.com
CC: petrm(a)nvidia.com
CC: willemb(a)google.com
CC: sdf(a)fomichev.me
CC: linux-kselftest(a)vger.kernel.org
---
tools/testing/selftests/net/lib/py/ksft.py | 24 +++++++++++++++++++++-
1 file changed, 23 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/net/lib/py/ksft.py b/tools/testing/selftests/net/lib/py/ksft.py
index 3cfad0fd4570..61287c203b6e 100644
--- a/tools/testing/selftests/net/lib/py/ksft.py
+++ b/tools/testing/selftests/net/lib/py/ksft.py
@@ -3,6 +3,7 @@
import builtins
import functools
import inspect
+import signal
import sys
import time
import traceback
@@ -26,6 +27,10 @@ KSFT_DISRUPTIVE = True
pass
+class KsftTerminate(KeyboardInterrupt):
+ pass
+
+
def ksft_pr(*objs, **kwargs):
print("#", *objs, **kwargs)
@@ -193,6 +198,17 @@ KSFT_DISRUPTIVE = True
return env
+def _ksft_intr(signum, frame):
+ # ksft runner.sh sends 2 SIGTERMs in a row on a timeout
+ # if we don't ignore the second one it will stop us from handling cleanup
+ global term_cnt
+ term_cnt += 1
+ if term_cnt == 1:
+ raise KsftTerminate()
+ else:
+ ksft_pr(f"Ignoring SIGTERM (cnt: {term_cnt}), already exiting...")
+
+
def ksft_run(cases=None, globs=None, case_pfx=None, args=()):
cases = cases or []
@@ -205,6 +221,10 @@ KSFT_DISRUPTIVE = True
cases.append(value)
break
+ global term_cnt
+ term_cnt = 0
+ prev_sigterm = signal.signal(signal.SIGTERM, _ksft_intr)
+
totals = {"pass": 0, "fail": 0, "skip": 0, "xfail": 0}
print("TAP version 13")
@@ -233,7 +253,7 @@ KSFT_DISRUPTIVE = True
for line in tb.strip().split('\n'):
ksft_pr("Exception|", line)
if stop:
- ksft_pr("Stopping tests due to KeyboardInterrupt.")
+ ksft_pr(f"Stopping tests due to {type(e).__name__}.")
KSFT_RESULT = False
cnt_key = 'fail'
@@ -248,6 +268,8 @@ KSFT_DISRUPTIVE = True
if stop:
break
+ signal.signal(signal.SIGTERM, prev_sigterm)
+
print(
f"# Totals: pass:{totals['pass']} fail:{totals['fail']} xfail:{totals['xfail']} xpass:0 skip:{totals['skip']} error:0"
)
--
2.49.0
v13: https://lore.kernel.org/netdev/20250425204743.617260-1-almasrymina@google.c…
===
Changelog:
- Fix unneeded error label pointed out by Christoph, and addressed
nitpick.
v12: https://lore.kernel.org/netdev/20250423031117.907681-1-almasrymina@google.c…
====
No changes in v12, just restored the selftests patch I accidentally dropped in
v11
v11: https://lore.kernel.org/netdev/20250423031117.907681-1-almasrymina@google.c…
====
Addressed a couple of nits and collected Acked-by from Harshitha
(thanks!)
v10: https://lore.kernel.org/netdev/20250417231540.2780723-1-almasrymina@google.…
====
Addressed comments following conversations with Pavel, Stan, and
Harshitha. Thank you guys for the reviews again. Overall minor changes:
Changelog:
- Check for !niov->pp in io_zcrx_recv_frag, just in case we end up with
a TX niov in that path (Pavel).
- Fix locking case in !netif_device_present (Jakub/Stan).
v9: https://lore.kernel.org/netdev/20250415224756.152002-1-almasrymina@google.c…
===
Changelog:
- Use priv->bindings list instead of sock_bindings_list. This was missed
during the rebase as the bindings have been updated to use
priv->bindings recently (thanks Stan!)
v8: https://lore.kernel.org/netdev/20250308214045.1160445-1-almasrymina@google.…
===
Only address minor comments on V7
Changelog:
- Use netdev locking instead of rtnl_locking to match rx path.
- Now that iouring zcrx is in net-next, use NET_IOV_IOURING instead of
NET_IOV_UNSPECIFIED.
- Post send binding to net_devmem_dmabuf_bindings after it's been fully
initialized (Stan).
v7: https://lore.kernel.org/netdev/20250227041209.2031104-1-almasrymina@google.…
===
Changelog:
- Check the dmabuf net_iov binding belongs to the device the TX is going
out on. (Jakub)
- Provide detailed inspection of callsites of
__skb_frag_ref/skb_page_unref in patch 2's changelog (Jakub)
v6: https://lore.kernel.org/netdev/20250222191517.743530-1-almasrymina@google.c…
===
v6 has no major changes. Addressed a few issues from Paolo and David,
and collected Acks from Stan. Thank you everyone for the review!
Changes:
- retain behavior to process MSG_FASTOPEN even if the provided cmsg is
invalid (Paolo).
- Rework the freeing of tx_vec slightly (it now has its own err label).
(Paolo).
- Squash the commit that makes dmabuf unbinding scheduled work into the
same one which implements the TX path so we don't run into future
errors on bisecting (Paolo).
- Fix/add comments to explain how dmabuf binding refcounting works
(David).
v5: https://lore.kernel.org/netdev/20250220020914.895431-1-almasrymina@google.c…
===
v5 has no major changes; it clears up the relatively minor issues
pointed out to in v4, and rebases the series on top of net-next to
resolve the conflict with a patch that raced to the tree. It also
collects the review tags from v4.
Changes:
- Rebase to net-next
- Fix issues in selftest (Stan).
- Address comments in the devmem and netmem driver docs (Stan and Bagas)
- Fix zerocopy_fill_skb_from_devmem return error code (Stan).
v4: https://lore.kernel.org/netdev/20250203223916.1064540-1-almasrymina@google.…
===
v4 mainly addresses the critical driver support issue surfaced in v3 by
Paolo and Stan. Drivers aiming to support netmem_tx should make sure not
to pass the netmem dma-addrs to the dma-mapping APIs, as these dma-addrs
may come from dma-bufs.
Additionally other feedback from v3 is addressed.
Major changes:
- Add helpers to handle netmem dma-addrs. Add GVE support for
netmem_tx.
- Fix binding->tx_vec not being freed on error paths during the
tx binding.
- Add a minimal devmem_tx test to devmem.py.
- Clean up everything obsolete from the cover letter (Paolo).
v3: https://patchwork.kernel.org/project/netdevbpf/list/?series=929401&state=*
===
Address minor comments from RFCv2 and fix a few build warnings and
ynl-regen issues. No major changes.
RFC v2: https://patchwork.kernel.org/project/netdevbpf/list/?series=920056&state=*
=======
RFC v2 addresses much of the feedback from RFC v1. I plan on sending
something close to this as net-next reopens, sending it slightly early
to get feedback if any.
Major changes:
--------------
- much improved UAPI as suggested by Stan. We now interpret the iov_base
of the passed in iov from userspace as the offset into the dmabuf to
send from. This removes the need to set iov.iov_base = NULL which may
be confusing to users, and enables us to send multiple iovs in the
same sendmsg() call. ncdevmem and the docs show a sample use of that.
- Removed the duplicate dmabuf iov_iter in binding->iov_iter. I think
this is good improvment as it was confusing to keep track of
2 iterators for the same sendmsg, and mistracking both iterators
caused a couple of bugs reported in the last iteration that are now
resolved with this streamlining.
- Improved test coverage in ncdevmem. Now multiple sendmsg() are tested,
and sending multiple iovs in the same sendmsg() is tested.
- Fixed issue where dmabuf unmapping was happening in invalid context
(Stan).
====================================================================
The TX path had been dropped from the Device Memory TCP patch series
post RFCv1 [1], to make that series slightly easier to review. This
series rebases the implementation of the TX path on top of the
net_iov/netmem framework agreed upon and merged. The motivation for
the feature is thoroughly described in the docs & cover letter of the
original proposal, so I don't repeat the lengthy descriptions here, but
they are available in [1].
Full outline on usage of the TX path is detailed in the documentation
included with this series.
Test example is available via the kselftest included in the series as well.
The series is relatively small, as the TX path for this feature largely
piggybacks on the existing MSG_ZEROCOPY implementation.
Patch Overview:
---------------
1. Documentation & tests to give high level overview of the feature
being added.
1. Add netmem refcounting needed for the TX path.
2. Devmem TX netlink API.
3. Devmem TX net stack implementation.
4. Make dma-buf unbinding scheduled work to handle TX cases where it gets
freed from contexts where we can't sleep.
5. Add devmem TX documentation.
6. Add scaffolding enabling driver support for netmem_tx. Add helpers, driver
feature flag, and docs to enable drivers to declare netmem_tx support.
7. Guard netmem_tx against being enabled against drivers that don't
support it.
8. Add devmem_tx selftests. Add TX path to ncdevmem and add a test to
devmem.py.
Testing:
--------
Testing is very similar to devmem TCP RX path. The ncdevmem test used
for the RX path is now augemented with client functionality to test TX
path.
* Test Setup:
Kernel: net-next with this RFC and memory provider API cherry-picked
locally.
Hardware: Google Cloud A3 VMs.
NIC: GVE with header split & RSS & flow steering support.
Performance results are not included with this version, unfortunately.
I'm having issues running the dma-buf exporter driver against the
upstream kernel on my test setup. The issues are specific to that
dma-buf exporter and do not affect this patch series. I plan to follow
up this series with perf fixes if the tests point to issues once they're
up and running.
Special thanks to Stan who took a stab at rebasing the TX implementation
on top of the netmem/net_iov framework merged. Parts of his proposal [2]
that are reused as-is are forked off into their own patches to give full
credit.
[1] https://lore.kernel.org/netdev/20240909054318.1809580-1-almasrymina@google.…
[2] https://lore.kernel.org/netdev/20240913150913.1280238-2-sdf@fomichev.me/T/#…
Cc: sdf(a)fomichev.me
Cc: asml.silence(a)gmail.com
Cc: dw(a)davidwei.uk
Cc: Jamal Hadi Salim <jhs(a)mojatatu.com>
Cc: Victor Nogueira <victor(a)mojatatu.com>
Cc: Pedro Tammela <pctammela(a)mojatatu.com>
Cc: Samiullah Khawaja <skhawaja(a)google.com>
Cc: Kuniyuki Iwashima <kuniyu(a)amazon.com>
Mina Almasry (8):
netmem: add niov->type attribute to distinguish different net_iov
types
net: add get_netmem/put_netmem support
net: devmem: Implement TX path
net: add devmem TCP TX documentation
net: enable driver support for netmem TX
gve: add netmem TX support to GVE DQO-RDA mode
net: check for driver support in netmem TX
selftests: ncdevmem: Implement devmem TCP TX
Stanislav Fomichev (1):
net: devmem: TCP tx netlink api
Documentation/netlink/specs/netdev.yaml | 12 +
Documentation/networking/devmem.rst | 150 ++++++++-
.../networking/net_cachelines/net_device.rst | 1 +
Documentation/networking/netdev-features.rst | 5 +
Documentation/networking/netmem.rst | 23 +-
drivers/net/ethernet/google/gve/gve_main.c | 3 +
drivers/net/ethernet/google/gve/gve_tx_dqo.c | 8 +-
include/linux/netdevice.h | 2 +
include/linux/skbuff.h | 17 +-
include/linux/skbuff_ref.h | 4 +-
include/net/netmem.h | 34 +-
include/net/sock.h | 1 +
include/uapi/linux/netdev.h | 1 +
io_uring/zcrx.c | 3 +-
net/core/datagram.c | 48 ++-
net/core/dev.c | 34 +-
net/core/devmem.c | 131 ++++++--
net/core/devmem.h | 83 ++++-
net/core/netdev-genl-gen.c | 13 +
net/core/netdev-genl-gen.h | 1 +
net/core/netdev-genl.c | 80 ++++-
net/core/skbuff.c | 48 ++-
net/core/sock.c | 6 +
net/ipv4/ip_output.c | 3 +-
net/ipv4/tcp.c | 50 ++-
net/ipv6/ip6_output.c | 3 +-
net/vmw_vsock/virtio_transport_common.c | 5 +-
tools/include/uapi/linux/netdev.h | 1 +
.../selftests/drivers/net/hw/devmem.py | 26 +-
.../selftests/drivers/net/hw/ncdevmem.c | 300 +++++++++++++++++-
30 files changed, 1008 insertions(+), 88 deletions(-)
base-commit: 0d15a26b247d25cd012134bf8825128fedb15cc9
--
2.49.0.901.g37484f566f-goog
Fix `ping.py` test failure on an ipv6 system, and appropriately handle the
cases where either one of the two address families (ipv4, ipv6) is not
present.
Mohsin Bashir (3):
selftests: drv: net: fix test failure on ipv6 sys
selftests: drv: net: avoid skipping tests
selftests: drv: net: add version indicator
tools/testing/selftests/drivers/net/ping.py | 45 ++++++++++++---------
1 file changed, 27 insertions(+), 18 deletions(-)
--
2.47.1
Clean up two build warnings:
[1]
iou-zcrx.c: In function ‘process_recvzc’:
iou-zcrx.c:263:37: warning: too many arguments for format [-Wformat-extra-args]
263 | error(1, 0, "payload mismatch at ", i);
| ^~~~~~~~~~~~~~~~~~~~~~
[2] Use "%zd" for ssize_t type as better
iou-zcrx.c: In function ‘run_client’:
iou-zcrx.c:357:47: warning: format ‘%d’ expects argument of type ‘int’, but argument 4 has type ‘ssize_t’ {aka ‘long int’} [-Wformat=]
357 | error(1, 0, "send(): %d", sent);
| ~^ ~~~~
| | |
| int ssize_t {aka long int}
| %ld
Signed-off-by: Haiyue Wang <haiyuewa(a)163.com>
---
v2:
- Dont't wrap the build warning message to make scripts/checkpatch.pl happy,
keep it as for readability.
- Change the format for ssize_t from "%ld" to "%zd" as Simon suggested.
- Change the target to net-next tree.
v1: https://lore.kernel.org/netdev/20250502042240.17371-1-haiyuewa@163.com/
---
tools/testing/selftests/drivers/net/hw/iou-zcrx.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/drivers/net/hw/iou-zcrx.c b/tools/testing/selftests/drivers/net/hw/iou-zcrx.c
index 8aa426014c87..62456df947bc 100644
--- a/tools/testing/selftests/drivers/net/hw/iou-zcrx.c
+++ b/tools/testing/selftests/drivers/net/hw/iou-zcrx.c
@@ -260,7 +260,7 @@ static void process_recvzc(struct io_uring *ring, struct io_uring_cqe *cqe)
for (i = 0; i < n; i++) {
if (*(data + i) != payload[(received + i)])
- error(1, 0, "payload mismatch at ", i);
+ error(1, 0, "payload mismatch at %d", i);
}
received += n;
@@ -354,7 +354,7 @@ static void run_client(void)
chunk = min_t(ssize_t, cfg_payload_len, to_send);
res = send(fd, src, chunk, 0);
if (res < 0)
- error(1, 0, "send(): %d", sent);
+ error(1, 0, "send(): %zd", sent);
sent += res;
to_send -= res;
}
--
2.49.0
Until CONFIG_DMABUF_SYSFS_STATS was added [1] it was only possible to
perform per-buffer accounting with debugfs which is not suitable for
production environments. Eventually we discovered the overhead with
per-buffer sysfs file creation/removal was significantly impacting
allocation and free times, and exacerbated kernfs lock contention. [2]
dma_buf_stats_setup() is responsible for 39% of single-page buffer
creation duration, or 74% of single-page dma_buf_export() duration when
stressing dmabuf allocations and frees.
I prototyped a change from per-buffer to per-exporter statistics with a
RCU protected list of exporter allocations that accommodates most (but
not all) of our use-cases and avoids almost all of the sysfs overhead.
While that adds less overhead than per-buffer sysfs, and less even than
the maintenance of the dmabuf debugfs_list, it's still *additional*
overhead on top of the debugfs_list and doesn't give us per-buffer info.
This series uses the existing dmabuf debugfs_list to implement a BPF
dmabuf iterator, which adds no overhead to buffer allocation/free and
provides per-buffer info. The list has been moved outside of
CONFIG_DEBUG_FS scope so that it is always populated. The BPF program
loaded by userspace that extracts per-buffer information gets to define
its own interface which avoids the lack of ABI stability with debugfs.
As this is a replacement for our use of CONFIG_DMABUF_SYSFS_STATS, the
last patch is a RFC for removing it from the kernel. Please see my
suggestion there regarding the timeline for that.
[1] https://lore.kernel.org/linux-media/20201210044400.1080308-1-hridya@google.…
[2] https://lore.kernel.org/all/20220516171315.2400578-1-tjmercier@google.com
v1: https://lore.kernel.org/all/20250414225227.3642618-1-tjmercier@google.com
v1 -> v2:
Make the DMA buffer list independent of CONFIG_DEBUG_FS per Christian König
Add CONFIG_DMA_SHARED_BUFFER check to kernel/bpf/Makefile per kernel test robot
Use BTF_ID_LIST_SINGLE instead of BTF_ID_LIST_GLOBAL_SINGLE per Song Liu
Fixup comment style, mixing code/declarations, and use ASSERT_OK_FD in selftest per Song Liu
Add BPF_ITER_RESCHED feature to bpf_dmabuf_reg_info per Alexei Starovoitov
Add open-coded iterator and selftest per Alexei Starovoitov
Add a second test buffer from the system dmabuf heap to selftests
Use the BPF program we'll use in production for selftest per Alexei Starovoitov
https://r.android.com/c/platform/system/bpfprogs/+/3616123/2/dmabufIter.chttps://r.android.com/c/platform/system/memory/libmeminfo/+/3614259/1/libdm…
T.J. Mercier (6):
dma-buf: Rename and expose debugfs symbols
bpf: Add dmabuf iterator
bpf: Add open coded dmabuf iterator
selftests/bpf: Add test for dmabuf_iter
selftests/bpf: Add test for open coded dmabuf_iter
RFC: dma-buf: Remove DMA-BUF statistics
.../ABI/testing/sysfs-kernel-dmabuf-buffers | 24 --
Documentation/driver-api/dma-buf.rst | 5 -
drivers/dma-buf/Kconfig | 15 -
drivers/dma-buf/Makefile | 1 -
drivers/dma-buf/dma-buf-sysfs-stats.c | 202 --------------
drivers/dma-buf/dma-buf-sysfs-stats.h | 35 ---
drivers/dma-buf/dma-buf.c | 58 +---
include/linux/dma-buf.h | 6 +-
kernel/bpf/Makefile | 3 +
kernel/bpf/dmabuf_iter.c | 177 ++++++++++++
kernel/bpf/helpers.c | 5 +
.../testing/selftests/bpf/bpf_experimental.h | 5 +
tools/testing/selftests/bpf/config | 3 +
.../selftests/bpf/prog_tests/dmabuf_iter.c | 258 ++++++++++++++++++
.../testing/selftests/bpf/progs/dmabuf_iter.c | 91 ++++++
15 files changed, 561 insertions(+), 327 deletions(-)
delete mode 100644 Documentation/ABI/testing/sysfs-kernel-dmabuf-buffers
delete mode 100644 drivers/dma-buf/dma-buf-sysfs-stats.c
delete mode 100644 drivers/dma-buf/dma-buf-sysfs-stats.h
create mode 100644 kernel/bpf/dmabuf_iter.c
create mode 100644 tools/testing/selftests/bpf/prog_tests/dmabuf_iter.c
create mode 100644 tools/testing/selftests/bpf/progs/dmabuf_iter.c
base-commit: 0af2f6be1b4281385b618cb86ad946eded089ac8
--
2.49.0.906.g1f30a19c02-goog
From: Brendan Jackman <jackmanb(a)google.com>
[ Upstream commit 08fafac4c9f289a9d9a22d838921e4b3eb22c664 ]
As noted in [0], SeaBIOS (QEMU default) makes a mess of the terminal,
qboot does not.
It turns out this is actually useful with kunit.py, since the user is
exposed to this issue if they set --raw_output=all.
qboot is also faster than SeaBIOS, but it's is marginal for this
usecase.
[0] https://lore.kernel.org/all/CA+i-1C0wYb-gZ8Mwh3WSVpbk-LF-Uo+njVbASJPe1WXDUR…
Both SeaBIOS and qboot are x86-specific.
Link: https://lore.kernel.org/r/20250124-kunit-qboot-v1-1-815e4d4c6f7c@google.com
Signed-off-by: Brendan Jackman <jackmanb(a)google.com>
Reviewed-by: David Gow <davidgow(a)google.com>
Signed-off-by: Shuah Khan <skhan(a)linuxfoundation.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/kunit/qemu_configs/x86_64.py | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/tools/testing/kunit/qemu_configs/x86_64.py b/tools/testing/kunit/qemu_configs/x86_64.py
index dc79490768630..4a6bf4e048f5b 100644
--- a/tools/testing/kunit/qemu_configs/x86_64.py
+++ b/tools/testing/kunit/qemu_configs/x86_64.py
@@ -7,4 +7,6 @@ CONFIG_SERIAL_8250_CONSOLE=y''',
qemu_arch='x86_64',
kernel_path='arch/x86/boot/bzImage',
kernel_command_line='console=ttyS0',
- extra_qemu_params=[])
+ # qboot is faster than SeaBIOS and doesn't mess up
+ # the terminal.
+ extra_qemu_params=['-bios', 'qboot.rom'])
--
2.39.5
From: Petr Machata <petrm(a)nvidia.com>
[ Upstream commit d9e9f6d7b7d0c520bb87f19d2cbc57aeeb2091d5 ]
Attempts to replace an MDB group membership of the host itself are
currently bounced:
# ip link add name br up type bridge vlan_filtering 1
# bridge mdb replace dev br port br grp 239.0.0.1 vid 2
# bridge mdb replace dev br port br grp 239.0.0.1 vid 2
Error: bridge: Group is already joined by host.
A similar operation done on a member port would succeed. Ignore the check
for replacement of host group memberships as well.
The bit of code that this enables is br_multicast_host_join(), which, for
already-joined groups only refreshes the MC group expiration timer, which
is desirable; and a userspace notification, also desirable.
Change a selftest that exercises this code path from expecting a rejection
to expecting a pass. The rest of MDB selftests pass without modification.
Signed-off-by: Petr Machata <petrm(a)nvidia.com>
Reviewed-by: Ido Schimmel <idosch(a)nvidia.com>
Acked-by: Nikolay Aleksandrov <razor(a)blackwall.org>
Link: https://patch.msgid.link/e5c5188b9787ae806609e7ca3aa2a0a501b9b5c4.173868564…
Signed-off-by: Jakub Kicinski <kuba(a)kernel.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
net/bridge/br_mdb.c | 2 +-
tools/testing/selftests/net/forwarding/bridge_mdb.sh | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/net/bridge/br_mdb.c b/net/bridge/br_mdb.c
index 7305f5f8215ca..96bea0c8408fe 100644
--- a/net/bridge/br_mdb.c
+++ b/net/bridge/br_mdb.c
@@ -1030,7 +1030,7 @@ static int br_mdb_add_group(const struct br_mdb_config *cfg,
/* host join */
if (!port) {
- if (mp->host_joined) {
+ if (mp->host_joined && !(cfg->nlflags & NLM_F_REPLACE)) {
NL_SET_ERR_MSG_MOD(extack, "Group is already joined by host");
return -EEXIST;
}
diff --git a/tools/testing/selftests/net/forwarding/bridge_mdb.sh b/tools/testing/selftests/net/forwarding/bridge_mdb.sh
index a3678dfe5848a..c151374ddf040 100755
--- a/tools/testing/selftests/net/forwarding/bridge_mdb.sh
+++ b/tools/testing/selftests/net/forwarding/bridge_mdb.sh
@@ -149,7 +149,7 @@ cfg_test_host_common()
check_err $? "Failed to add $name host entry"
bridge mdb replace dev br0 port br0 grp $grp $state vid 10 &> /dev/null
- check_fail $? "Managed to replace $name host entry"
+ check_err $? "Failed to replace $name host entry"
bridge mdb del dev br0 port br0 grp $grp $state vid 10
bridge mdb get dev br0 grp $grp vid 10 &> /dev/null
--
2.39.5
Fix some more minor issues in ublk selftests.
The first patch is from
https://lore.kernel.org/linux-block/20250423-ublk_selftests-v1-0-7d060e260e…
with a modification requested by Jens. The others are new.
Signed-off-by: Uday Shankar <ushankar(a)purestorage.com>
---
Changes in v2:
- Use a test-specific WERROR flag instead of reusing CONFIG_WERROR from
the kernel build for deciding whether or not to use -Werror for the
kublk build. The default behavior is to use -Werror (Ming Lei)
- Link to v1: https://lore.kernel.org/r/20250428-ublk_selftests-v1-0-5795f7b00cda@puresto…
---
Uday Shankar (3):
selftests: ublk: kublk: build with -Werror iff WERROR!=0
selftests: ublk: make test_generic_06 silent on success
selftests: ublk: kublk: fix include path
tools/testing/selftests/ublk/Makefile | 6 +++++-
tools/testing/selftests/ublk/kublk.h | 1 -
tools/testing/selftests/ublk/test_generic_06.sh | 2 +-
3 files changed, 6 insertions(+), 3 deletions(-)
---
base-commit: 53ec1abce79c986dc59e59d0c60d00088bcdf32a
change-id: 20250428-ublk_selftests-983240d3a325
Best regards,
--
Uday Shankar <ushankar(a)purestorage.com>
From: Brendan Jackman <jackmanb(a)google.com>
[ Upstream commit 08fafac4c9f289a9d9a22d838921e4b3eb22c664 ]
As noted in [0], SeaBIOS (QEMU default) makes a mess of the terminal,
qboot does not.
It turns out this is actually useful with kunit.py, since the user is
exposed to this issue if they set --raw_output=all.
qboot is also faster than SeaBIOS, but it's is marginal for this
usecase.
[0] https://lore.kernel.org/all/CA+i-1C0wYb-gZ8Mwh3WSVpbk-LF-Uo+njVbASJPe1WXDUR…
Both SeaBIOS and qboot are x86-specific.
Link: https://lore.kernel.org/r/20250124-kunit-qboot-v1-1-815e4d4c6f7c@google.com
Signed-off-by: Brendan Jackman <jackmanb(a)google.com>
Reviewed-by: David Gow <davidgow(a)google.com>
Signed-off-by: Shuah Khan <skhan(a)linuxfoundation.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/kunit/qemu_configs/x86_64.py | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/tools/testing/kunit/qemu_configs/x86_64.py b/tools/testing/kunit/qemu_configs/x86_64.py
index dc79490768630..4a6bf4e048f5b 100644
--- a/tools/testing/kunit/qemu_configs/x86_64.py
+++ b/tools/testing/kunit/qemu_configs/x86_64.py
@@ -7,4 +7,6 @@ CONFIG_SERIAL_8250_CONSOLE=y''',
qemu_arch='x86_64',
kernel_path='arch/x86/boot/bzImage',
kernel_command_line='console=ttyS0',
- extra_qemu_params=[])
+ # qboot is faster than SeaBIOS and doesn't mess up
+ # the terminal.
+ extra_qemu_params=['-bios', 'qboot.rom'])
--
2.39.5
Fix some more minor issues in ublk selftests.
The first patch is from
https://lore.kernel.org/linux-block/20250423-ublk_selftests-v1-0-7d060e260e…
with a modification requested by Jens. The others are new.
Signed-off-by: Uday Shankar <ushankar(a)purestorage.com>
---
Uday Shankar (3):
selftests: ublk: kublk: build with -Werror iff CONFIG_WERROR=y
selftests: ublk: make test_generic_06 silent on success
selftests: ublk: kublk: fix include path
tools/testing/selftests/ublk/Makefile | 4 +++-
tools/testing/selftests/ublk/kublk.h | 1 -
tools/testing/selftests/ublk/test_generic_06.sh | 2 +-
3 files changed, 4 insertions(+), 3 deletions(-)
---
base-commit: 53ec1abce79c986dc59e59d0c60d00088bcdf32a
change-id: 20250428-ublk_selftests-983240d3a325
Best regards,
--
Uday Shankar <ushankar(a)purestorage.com>
From: Petr Machata <petrm(a)nvidia.com>
[ Upstream commit d9e9f6d7b7d0c520bb87f19d2cbc57aeeb2091d5 ]
Attempts to replace an MDB group membership of the host itself are
currently bounced:
# ip link add name br up type bridge vlan_filtering 1
# bridge mdb replace dev br port br grp 239.0.0.1 vid 2
# bridge mdb replace dev br port br grp 239.0.0.1 vid 2
Error: bridge: Group is already joined by host.
A similar operation done on a member port would succeed. Ignore the check
for replacement of host group memberships as well.
The bit of code that this enables is br_multicast_host_join(), which, for
already-joined groups only refreshes the MC group expiration timer, which
is desirable; and a userspace notification, also desirable.
Change a selftest that exercises this code path from expecting a rejection
to expecting a pass. The rest of MDB selftests pass without modification.
Signed-off-by: Petr Machata <petrm(a)nvidia.com>
Reviewed-by: Ido Schimmel <idosch(a)nvidia.com>
Acked-by: Nikolay Aleksandrov <razor(a)blackwall.org>
Link: https://patch.msgid.link/e5c5188b9787ae806609e7ca3aa2a0a501b9b5c4.173868564…
Signed-off-by: Jakub Kicinski <kuba(a)kernel.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
net/bridge/br_mdb.c | 2 +-
tools/testing/selftests/net/forwarding/bridge_mdb.sh | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/net/bridge/br_mdb.c b/net/bridge/br_mdb.c
index 1a52a0bca086d..7e1ad229e1330 100644
--- a/net/bridge/br_mdb.c
+++ b/net/bridge/br_mdb.c
@@ -1040,7 +1040,7 @@ static int br_mdb_add_group(const struct br_mdb_config *cfg,
/* host join */
if (!port) {
- if (mp->host_joined) {
+ if (mp->host_joined && !(cfg->nlflags & NLM_F_REPLACE)) {
NL_SET_ERR_MSG_MOD(extack, "Group is already joined by host");
return -EEXIST;
}
diff --git a/tools/testing/selftests/net/forwarding/bridge_mdb.sh b/tools/testing/selftests/net/forwarding/bridge_mdb.sh
index d9d587454d207..8c1597ebc2d38 100755
--- a/tools/testing/selftests/net/forwarding/bridge_mdb.sh
+++ b/tools/testing/selftests/net/forwarding/bridge_mdb.sh
@@ -149,7 +149,7 @@ cfg_test_host_common()
check_err $? "Failed to add $name host entry"
bridge mdb replace dev br0 port br0 grp $grp $state vid 10 &> /dev/null
- check_fail $? "Managed to replace $name host entry"
+ check_err $? "Failed to replace $name host entry"
bridge mdb del dev br0 port br0 grp $grp $state vid 10
bridge mdb get dev br0 grp $grp vid 10 &> /dev/null
--
2.39.5
From: Brendan Jackman <jackmanb(a)google.com>
[ Upstream commit 08fafac4c9f289a9d9a22d838921e4b3eb22c664 ]
As noted in [0], SeaBIOS (QEMU default) makes a mess of the terminal,
qboot does not.
It turns out this is actually useful with kunit.py, since the user is
exposed to this issue if they set --raw_output=all.
qboot is also faster than SeaBIOS, but it's is marginal for this
usecase.
[0] https://lore.kernel.org/all/CA+i-1C0wYb-gZ8Mwh3WSVpbk-LF-Uo+njVbASJPe1WXDUR…
Both SeaBIOS and qboot are x86-specific.
Link: https://lore.kernel.org/r/20250124-kunit-qboot-v1-1-815e4d4c6f7c@google.com
Signed-off-by: Brendan Jackman <jackmanb(a)google.com>
Reviewed-by: David Gow <davidgow(a)google.com>
Signed-off-by: Shuah Khan <skhan(a)linuxfoundation.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/kunit/qemu_configs/x86_64.py | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/tools/testing/kunit/qemu_configs/x86_64.py b/tools/testing/kunit/qemu_configs/x86_64.py
index dc79490768630..4a6bf4e048f5b 100644
--- a/tools/testing/kunit/qemu_configs/x86_64.py
+++ b/tools/testing/kunit/qemu_configs/x86_64.py
@@ -7,4 +7,6 @@ CONFIG_SERIAL_8250_CONSOLE=y''',
qemu_arch='x86_64',
kernel_path='arch/x86/boot/bzImage',
kernel_command_line='console=ttyS0',
- extra_qemu_params=[])
+ # qboot is faster than SeaBIOS and doesn't mess up
+ # the terminal.
+ extra_qemu_params=['-bios', 'qboot.rom'])
--
2.39.5
From: Yi Liu <yi.l.liu(a)intel.com>
[ Upstream commit 1062d81086156e42878d701b816d2f368b53a77c ]
Allocating a domain with a fault ID indicates that the domain is faultable.
However, there is a gap for the nested parent domain to support PRI. Some
hardware lacks the capability to distinguish whether PRI occurs at stage 1
or stage 2. This limitation may require software-based page table walking
to resolve. Since no in-tree IOMMU driver currently supports this
functionality, it is disallowed. For more details, refer to the related
discussion at [1].
[1] https://lore.kernel.org/linux-iommu/bd1655c6-8b2f-4cfa-adb1-badc00d01811@in…
Link: https://patch.msgid.link/r/20250226104012.82079-1-yi.l.liu@intel.com
Suggested-by: Lu Baolu <baolu.lu(a)linux.intel.com>
Signed-off-by: Yi Liu <yi.l.liu(a)intel.com>
Reviewed-by: Kevin Tian <kevin.tian(a)intel.com>
Reviewed-by: Lu Baolu <baolu.lu(a)linux.intel.com>
Signed-off-by: Jason Gunthorpe <jgg(a)nvidia.com>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
drivers/iommu/iommufd/hw_pagetable.c | 3 +++
tools/testing/selftests/iommu/iommufd.c | 4 ++++
2 files changed, 7 insertions(+)
diff --git a/drivers/iommu/iommufd/hw_pagetable.c b/drivers/iommu/iommufd/hw_pagetable.c
index d06bf6e6c19fd..2454627a8b61b 100644
--- a/drivers/iommu/iommufd/hw_pagetable.c
+++ b/drivers/iommu/iommufd/hw_pagetable.c
@@ -122,6 +122,9 @@ iommufd_hwpt_paging_alloc(struct iommufd_ctx *ictx, struct iommufd_ioas *ioas,
if ((flags & IOMMU_HWPT_ALLOC_DIRTY_TRACKING) &&
!device_iommu_capable(idev->dev, IOMMU_CAP_DIRTY_TRACKING))
return ERR_PTR(-EOPNOTSUPP);
+ if ((flags & IOMMU_HWPT_FAULT_ID_VALID) &&
+ (flags & IOMMU_HWPT_ALLOC_NEST_PARENT))
+ return ERR_PTR(-EOPNOTSUPP);
hwpt_paging = __iommufd_object_alloc(
ictx, hwpt_paging, IOMMUFD_OBJ_HWPT_PAGING, common.obj);
diff --git a/tools/testing/selftests/iommu/iommufd.c b/tools/testing/selftests/iommu/iommufd.c
index 4927b9add5add..06f252733660a 100644
--- a/tools/testing/selftests/iommu/iommufd.c
+++ b/tools/testing/selftests/iommu/iommufd.c
@@ -289,6 +289,10 @@ TEST_F(iommufd_ioas, alloc_hwpt_nested)
&test_hwpt_id);
test_err_hwpt_alloc(EINVAL, self->device_id, self->device_id, 0,
&test_hwpt_id);
+ test_err_hwpt_alloc(EOPNOTSUPP, self->device_id, self->ioas_id,
+ IOMMU_HWPT_ALLOC_NEST_PARENT |
+ IOMMU_HWPT_FAULT_ID_VALID,
+ &test_hwpt_id);
test_cmd_hwpt_alloc(self->device_id, self->ioas_id,
IOMMU_HWPT_ALLOC_NEST_PARENT,
--
2.39.5
From: Petr Machata <petrm(a)nvidia.com>
[ Upstream commit d9e9f6d7b7d0c520bb87f19d2cbc57aeeb2091d5 ]
Attempts to replace an MDB group membership of the host itself are
currently bounced:
# ip link add name br up type bridge vlan_filtering 1
# bridge mdb replace dev br port br grp 239.0.0.1 vid 2
# bridge mdb replace dev br port br grp 239.0.0.1 vid 2
Error: bridge: Group is already joined by host.
A similar operation done on a member port would succeed. Ignore the check
for replacement of host group memberships as well.
The bit of code that this enables is br_multicast_host_join(), which, for
already-joined groups only refreshes the MC group expiration timer, which
is desirable; and a userspace notification, also desirable.
Change a selftest that exercises this code path from expecting a rejection
to expecting a pass. The rest of MDB selftests pass without modification.
Signed-off-by: Petr Machata <petrm(a)nvidia.com>
Reviewed-by: Ido Schimmel <idosch(a)nvidia.com>
Acked-by: Nikolay Aleksandrov <razor(a)blackwall.org>
Link: https://patch.msgid.link/e5c5188b9787ae806609e7ca3aa2a0a501b9b5c4.173868564…
Signed-off-by: Jakub Kicinski <kuba(a)kernel.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
net/bridge/br_mdb.c | 2 +-
tools/testing/selftests/net/forwarding/bridge_mdb.sh | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/net/bridge/br_mdb.c b/net/bridge/br_mdb.c
index 1a52a0bca086d..7e1ad229e1330 100644
--- a/net/bridge/br_mdb.c
+++ b/net/bridge/br_mdb.c
@@ -1040,7 +1040,7 @@ static int br_mdb_add_group(const struct br_mdb_config *cfg,
/* host join */
if (!port) {
- if (mp->host_joined) {
+ if (mp->host_joined && !(cfg->nlflags & NLM_F_REPLACE)) {
NL_SET_ERR_MSG_MOD(extack, "Group is already joined by host");
return -EEXIST;
}
diff --git a/tools/testing/selftests/net/forwarding/bridge_mdb.sh b/tools/testing/selftests/net/forwarding/bridge_mdb.sh
index d9d587454d207..8c1597ebc2d38 100755
--- a/tools/testing/selftests/net/forwarding/bridge_mdb.sh
+++ b/tools/testing/selftests/net/forwarding/bridge_mdb.sh
@@ -149,7 +149,7 @@ cfg_test_host_common()
check_err $? "Failed to add $name host entry"
bridge mdb replace dev br0 port br0 grp $grp $state vid 10 &> /dev/null
- check_fail $? "Managed to replace $name host entry"
+ check_err $? "Failed to replace $name host entry"
bridge mdb del dev br0 port br0 grp $grp $state vid 10
bridge mdb get dev br0 grp $grp vid 10 &> /dev/null
--
2.39.5
From: Brendan Jackman <jackmanb(a)google.com>
[ Upstream commit 08fafac4c9f289a9d9a22d838921e4b3eb22c664 ]
As noted in [0], SeaBIOS (QEMU default) makes a mess of the terminal,
qboot does not.
It turns out this is actually useful with kunit.py, since the user is
exposed to this issue if they set --raw_output=all.
qboot is also faster than SeaBIOS, but it's is marginal for this
usecase.
[0] https://lore.kernel.org/all/CA+i-1C0wYb-gZ8Mwh3WSVpbk-LF-Uo+njVbASJPe1WXDUR…
Both SeaBIOS and qboot are x86-specific.
Link: https://lore.kernel.org/r/20250124-kunit-qboot-v1-1-815e4d4c6f7c@google.com
Signed-off-by: Brendan Jackman <jackmanb(a)google.com>
Reviewed-by: David Gow <davidgow(a)google.com>
Signed-off-by: Shuah Khan <skhan(a)linuxfoundation.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/kunit/qemu_configs/x86_64.py | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/tools/testing/kunit/qemu_configs/x86_64.py b/tools/testing/kunit/qemu_configs/x86_64.py
index dc79490768630..4a6bf4e048f5b 100644
--- a/tools/testing/kunit/qemu_configs/x86_64.py
+++ b/tools/testing/kunit/qemu_configs/x86_64.py
@@ -7,4 +7,6 @@ CONFIG_SERIAL_8250_CONSOLE=y''',
qemu_arch='x86_64',
kernel_path='arch/x86/boot/bzImage',
kernel_command_line='console=ttyS0',
- extra_qemu_params=[])
+ # qboot is faster than SeaBIOS and doesn't mess up
+ # the terminal.
+ extra_qemu_params=['-bios', 'qboot.rom'])
--
2.39.5
From: Yi Liu <yi.l.liu(a)intel.com>
[ Upstream commit 1062d81086156e42878d701b816d2f368b53a77c ]
Allocating a domain with a fault ID indicates that the domain is faultable.
However, there is a gap for the nested parent domain to support PRI. Some
hardware lacks the capability to distinguish whether PRI occurs at stage 1
or stage 2. This limitation may require software-based page table walking
to resolve. Since no in-tree IOMMU driver currently supports this
functionality, it is disallowed. For more details, refer to the related
discussion at [1].
[1] https://lore.kernel.org/linux-iommu/bd1655c6-8b2f-4cfa-adb1-badc00d01811@in…
Link: https://patch.msgid.link/r/20250226104012.82079-1-yi.l.liu@intel.com
Suggested-by: Lu Baolu <baolu.lu(a)linux.intel.com>
Signed-off-by: Yi Liu <yi.l.liu(a)intel.com>
Reviewed-by: Kevin Tian <kevin.tian(a)intel.com>
Reviewed-by: Lu Baolu <baolu.lu(a)linux.intel.com>
Signed-off-by: Jason Gunthorpe <jgg(a)nvidia.com>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
drivers/iommu/iommufd/hw_pagetable.c | 3 +++
tools/testing/selftests/iommu/iommufd.c | 4 ++++
2 files changed, 7 insertions(+)
diff --git a/drivers/iommu/iommufd/hw_pagetable.c b/drivers/iommu/iommufd/hw_pagetable.c
index 598be26a14e28..9b5b0b8522299 100644
--- a/drivers/iommu/iommufd/hw_pagetable.c
+++ b/drivers/iommu/iommufd/hw_pagetable.c
@@ -126,6 +126,9 @@ iommufd_hwpt_paging_alloc(struct iommufd_ctx *ictx, struct iommufd_ioas *ioas,
if ((flags & IOMMU_HWPT_ALLOC_DIRTY_TRACKING) &&
!device_iommu_capable(idev->dev, IOMMU_CAP_DIRTY_TRACKING))
return ERR_PTR(-EOPNOTSUPP);
+ if ((flags & IOMMU_HWPT_FAULT_ID_VALID) &&
+ (flags & IOMMU_HWPT_ALLOC_NEST_PARENT))
+ return ERR_PTR(-EOPNOTSUPP);
hwpt_paging = __iommufd_object_alloc(
ictx, hwpt_paging, IOMMUFD_OBJ_HWPT_PAGING, common.obj);
diff --git a/tools/testing/selftests/iommu/iommufd.c b/tools/testing/selftests/iommu/iommufd.c
index a1b2b657999dc..618c03bb6509b 100644
--- a/tools/testing/selftests/iommu/iommufd.c
+++ b/tools/testing/selftests/iommu/iommufd.c
@@ -439,6 +439,10 @@ TEST_F(iommufd_ioas, alloc_hwpt_nested)
&test_hwpt_id);
test_err_hwpt_alloc(EINVAL, self->device_id, self->device_id, 0,
&test_hwpt_id);
+ test_err_hwpt_alloc(EOPNOTSUPP, self->device_id, self->ioas_id,
+ IOMMU_HWPT_ALLOC_NEST_PARENT |
+ IOMMU_HWPT_FAULT_ID_VALID,
+ &test_hwpt_id);
test_cmd_hwpt_alloc(self->device_id, self->ioas_id,
IOMMU_HWPT_ALLOC_NEST_PARENT,
--
2.39.5
From: Niklas Cassel <cassel(a)kernel.org>
[ Upstream commit af1451b6738ec7cf91f2914f53845424959ec4ee ]
Currently BARs that have been disabled by the endpoint controller driver
will result in a test FAIL.
Returning FAIL for a BAR that is disabled seems overly pessimistic.
There are EPC that disables one or more BARs intentionally.
One reason for this is that there are certain EPCs that are hardwired to
expose internal PCIe controller registers over a certain BAR, so the EPC
driver disables such a BAR, such that the host will not overwrite random
registers during testing.
Such a BAR will be disabled by the EPC driver's init function, and the
BAR will be marked as BAR_RESERVED, such that it will be unavailable to
endpoint function drivers.
Let's return FAIL only for BARs that are actually enabled and failed the
test, and let's return skip for BARs that are not even enabled.
Signed-off-by: Niklas Cassel <cassel(a)kernel.org>
Reviewed-by: Manivannan Sadhasivam <manivannan.sadhasivam(a)linaro.org>
Link: https://lore.kernel.org/r/20250123120147.3603409-4-cassel@kernel.org
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam(a)linaro.org>
Signed-off-by: Krzysztof Wilczyński <kwilczynski(a)kernel.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/pci_endpoint/pci_endpoint_test.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/tools/testing/selftests/pci_endpoint/pci_endpoint_test.c b/tools/testing/selftests/pci_endpoint/pci_endpoint_test.c
index c267b822c1081..576c590b277b1 100644
--- a/tools/testing/selftests/pci_endpoint/pci_endpoint_test.c
+++ b/tools/testing/selftests/pci_endpoint/pci_endpoint_test.c
@@ -65,6 +65,8 @@ TEST_F(pci_ep_bar, BAR_TEST)
int ret;
pci_ep_ioctl(PCITEST_BAR, variant->barno);
+ if (ret == -ENODATA)
+ SKIP(return, "BAR is disabled");
EXPECT_FALSE(ret) TH_LOG("Test failed for BAR%d", variant->barno);
}
--
2.39.5
From: Rae Moar <rmoar(a)google.com>
[ Upstream commit 1d4c06d51963f89f67c7b75d5c0c34e9d1bb2ae6 ]
A bug was identified where the KTAP below caused an infinite loop:
TAP version 13
ok 4 test_case
1..4
The infinite loop was caused by the parser not parsing a test plan
if following a test result line.
Fix this bug by parsing test plan line to avoid the infinite loop.
Link: https://lore.kernel.org/r/20250313192714.1380005-1-rmoar@google.com
Signed-off-by: Rae Moar <rmoar(a)google.com>
Reviewed-by: David Gow <davidgow(a)google.com>
Signed-off-by: Shuah Khan <shuah(a)kernel.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/kunit/kunit_parser.py | 9 ++++-----
1 file changed, 4 insertions(+), 5 deletions(-)
diff --git a/tools/testing/kunit/kunit_parser.py b/tools/testing/kunit/kunit_parser.py
index 29fc27e8949bd..da53a709773a2 100644
--- a/tools/testing/kunit/kunit_parser.py
+++ b/tools/testing/kunit/kunit_parser.py
@@ -759,7 +759,7 @@ def parse_test(lines: LineStream, expected_num: int, log: List[str], is_subtest:
# If parsing the main/top-level test, parse KTAP version line and
# test plan
test.name = "main"
- ktap_line = parse_ktap_header(lines, test, printer)
+ parse_ktap_header(lines, test, printer)
test.log.extend(parse_diagnostic(lines))
parse_test_plan(lines, test)
parent_test = True
@@ -768,13 +768,12 @@ def parse_test(lines: LineStream, expected_num: int, log: List[str], is_subtest:
# the KTAP version line and/or subtest header line
ktap_line = parse_ktap_header(lines, test, printer)
subtest_line = parse_test_header(lines, test)
+ test.log.extend(parse_diagnostic(lines))
+ parse_test_plan(lines, test)
parent_test = (ktap_line or subtest_line)
if parent_test:
- # If KTAP version line and/or subtest header is found, attempt
- # to parse test plan and print test header
- test.log.extend(parse_diagnostic(lines))
- parse_test_plan(lines, test)
print_test_header(test, printer)
+
expected_count = test.expected_count
subtests = []
test_num = 1
--
2.39.5
After a long delay I'm posting next iteration of lockless /proc/pid/maps
reading patchset. Differences from v2 [1]:
- Add a set of tests concurrently modifying address space and checking for
correct reading results;
- Use new mmap_lock_speculate_xxx APIs for concurrent change detection and
retries;
- Add lockless PROCMAP_QUERY execution support;
The new tests are designed to check for any unexpected data tearing while
performing some common address space modifications (vma split, resize and
remap). Even before these changes, reading /proc/pid/maps might have
inconsistent data because the file is read page-by-page with mmap_lock
being dropped between the pages. Such tearing is expected and userspace
is supposed to deal with that possibility. An example of user-visible
inconsistency can be that the same vma is printed twice: once before
it was modified and then after the modifications. For example if vma was
extended, it might be found and reported twice. Whan is not expected is
to see a gap where there should have been a vma both before and after
modification. This patchset increases the chances of such tearing,
therefore it's event more important now to test for unexpected
inconsistencies.
Thanks to Paul McKenney who developed a benchmark to test performance
of concurrent reads and updates, we also have data on performance
benefits:
The test has a pair of processes scanning /proc/PID/maps, and another
process unmapping and remapping 4K pages from a 128MB range of anonymous
memory. At the end of each 10-second run, the latency of each mmap()
or munmap() operation is measured, and for each run the maximum and mean
latency is printed. (Yes, the map/unmap process is started first, its
PID is passed to the scanners, and then the map/unmap process waits until
both scanners are running before starting its timed test. The scanners
keep scanning until the specified /proc/PID/maps file disappears.)
In summary, with stock mm, 78% of the runs had maximum latencies in
excess of 0.5 milliseconds, and with more then half of the runs' latencies
exceeding a full millisecond. In contrast, 98% of the runs with Suren's
patch series applied had maximum latencies of less than 0.5 milliseconds.
From a median-performance viewpoint, Suren's series also looks good,
with stock mm weighing in at 13 microseconds and Suren's series at 10
microseconds, better than a 20% improvement.
[1] https://lore.kernel.org/all/20240123231014.3801041-1-surenb@google.com/
Suren Baghdasaryan (8):
selftests/proc: add /proc/pid/maps tearing from vma split test
selftests/proc: extend /proc/pid/maps tearing test to include vma
resizing
selftests/proc: extend /proc/pid/maps tearing test to include vma
remapping
selftests/proc: test PROCMAP_QUERY ioctl while vma is concurrently
modified
selftests/proc: add verbose more for tests to facilitate debugging
mm: make vm_area_struct anon_name field RCU-safe
mm/maps: read proc/pid/maps under RCU
mm/maps: execute PROCMAP_QUERY ioctl under RCU
fs/proc/internal.h | 6 +
fs/proc/task_mmu.c | 233 +++++-
include/linux/mm_inline.h | 28 +-
include/linux/mm_types.h | 3 +-
mm/madvise.c | 30 +-
tools/testing/selftests/proc/proc-pid-vm.c | 793 ++++++++++++++++++++-
6 files changed, 1061 insertions(+), 32 deletions(-)
base-commit: 79f35c4125a9a3fd98efeed4cce1cd7ce5311a44
--
2.49.0.805.g082f7c87e0-goog
I'd like to cut down the memory usage of parsing vmlinux BTF in ebpf-go.
With some upcoming changes the library is sitting at 5MiB for a parse.
Most of that memory is simply copying the BTF blob into user space.
By allowing vmlinux BTF to be mmapped read-only into user space I can
cut memory usage by about 75%.
Signed-off-by: Lorenz Bauer <lmb(a)isovalent.com>
---
Changes in v2:
- Use btf__new in selftest
- Avoid vm_iomap_memory in btf_vmlinux_mmap
- Add VM_DONTDUMP
- Add support to libbpf
- Link to v1: https://lore.kernel.org/r/20250501-vmlinux-mmap-v1-0-aa2724572598@isovalent…
---
Lorenz Bauer (3):
btf: allow mmap of vmlinux btf
selftests: bpf: add a test for mmapable vmlinux BTF
libbpf: Use mmap to parse vmlinux BTF from sysfs
include/asm-generic/vmlinux.lds.h | 3 +-
kernel/bpf/sysfs_btf.c | 36 +++++++++-
tools/lib/bpf/btf.c | 82 +++++++++++++++++++---
tools/testing/selftests/bpf/prog_tests/btf_sysfs.c | 82 ++++++++++++++++++++++
4 files changed, 189 insertions(+), 14 deletions(-)
---
base-commit: 38d976c32d85ef12dcd2b8a231196f7049548477
change-id: 20250501-vmlinux-mmap-2ec5563c3ef1
Best regards,
--
Lorenz Bauer <lmb(a)isovalent.com>
Context
=======
We've observed within Red Hat that isolated, NOHZ_FULL CPUs running a
pure-userspace application get regularly interrupted by IPIs sent from
housekeeping CPUs. Those IPIs are caused by activity on the housekeeping CPUs
leading to various on_each_cpu() calls, e.g.:
64359.052209596 NetworkManager 0 1405 smp_call_function_many_cond (cpu=0, func=do_kernel_range_flush)
smp_call_function_many_cond+0x1
smp_call_function+0x39
on_each_cpu+0x2a
flush_tlb_kernel_range+0x7b
__purge_vmap_area_lazy+0x70
_vm_unmap_aliases.part.42+0xdf
change_page_attr_set_clr+0x16a
set_memory_ro+0x26
bpf_int_jit_compile+0x2f9
bpf_prog_select_runtime+0xc6
bpf_prepare_filter+0x523
sk_attach_filter+0x13
sock_setsockopt+0x92c
__sys_setsockopt+0x16a
__x64_sys_setsockopt+0x20
do_syscall_64+0x87
entry_SYSCALL_64_after_hwframe+0x65
The heart of this series is the thought that while we cannot remove NOHZ_FULL
CPUs from the list of CPUs targeted by these IPIs, they may not have to execute
the callbacks immediately. Anything that only affects kernelspace can wait
until the next user->kernel transition, providing it can be executed "early
enough" in the entry code.
The original implementation is from Peter [1]. Nicolas then added kernel TLB
invalidation deferral to that [2], and I picked it up from there.
Deferral approach
=================
Storing each and every callback, like a secondary call_single_queue turned out
to be a no-go: the whole point of deferral is to keep NOHZ_FULL CPUs in
userspace for as long as possible - no signal of any form would be sent when
deferring an IPI. This means that any form of queuing for deferred callbacks
would end up as a convoluted memory leak.
Deferred IPIs must thus be coalesced, which this series achieves by assigning
IPIs a "type" and having a mapping of IPI type to callback, leveraged upon
kernel entry.
What about IPIs whose callback take a parameter, you may ask?
Peter suggested during OSPM23 [3] that since on_each_cpu() targets
housekeeping CPUs *and* isolated CPUs, isolated CPUs can access either global or
housekeeping-CPU-local state to "reconstruct" the data that would have been sent
via the IPI.
This series does not affect any IPI callback that requires an argument, but the
approach would remain the same (one coalescable callback executed on kernel
entry).
Kernel entry vs execution of the deferred operation
===================================================
This is what I've referred to as the "Danger Zone" during my LPC24 talk [4].
There is a non-zero length of code that is executed upon kernel entry before the
deferred operation can be itself executed (before we start getting into
context_tracking.c proper), i.e.:
idtentry_func_foo() <--- we're in the kernel
irqentry_enter()
enter_from_user_mode()
__ct_user_exit()
ct_kernel_enter_state()
ct_work_flush() <--- deferred operation is executed here
This means one must take extra care to what can happen in the early entry code,
and that <bad things> cannot happen. For instance, we really don't want to hit
instructions that have been modified by a remote text_poke() while we're on our
way to execute a deferred sync_core(). Patches doing the actual deferral have
more detail on this.
Where are we at with this whole thing?
======================================
Dave has been incredibly helpful wrt figuring out what would and wouldn't
(mostly that) be safe to do for deferring kernel range TLB flush IPIs, see [5].
Long story short, there are ugly things I can still do to (safely) defer the TLB
flush IPIs, but it's going to be a long session of pulling my own hair out, and
I got plenty so I won't be done for a while.
In the meantime, I think everything leading up to deferring text poke IPIs is
sane-ish and could get in. I'm not the biggest fan of adding an API with a
single user, but hey, I've been working on this for "a little while" now and
I'll still need to get the other IPIs sorted out.
TL;DR: Text patching IPI deferral LGTM so here it is for now, I'm still working
on the TLB flush thing.
Patches
=======
o Patches 1-2 are standalone objtool cleanups.
o Patches 3-4 add an RCU testing feature.
o Patches 5-6 add infrastructure for annotating static keys and static calls
that may be used in noinstr code (courtesy of Josh).
o Patches 7-20 use said annotations on relevant keys / calls.
o Patch 21 enforces proper usage of said annotations (courtesy of Josh).
o Patches 22-23 deal with detecting NOINSTR text in modules
o Patches 24-25 add the actual IPI deferral faff
Patches are also available at:
https://gitlab.com/vschneid/linux.git -b redhat/isolirq/defer/v5
Testing
=======
Xeon E5-2699 system with SMToff, NOHZ_FULL, isolated CPUs.
RHEL10 userspace.
Workload is using rteval (kernel compilation + hackbench) on housekeeping CPUs
and a dummy stay-in-userspace loop on the isolated CPUs. The main invocation is:
$ trace-cmd record -e "csd_queue_cpu" -f "cpu & CPUS{$ISOL_CPUS}" \
-e "ipi_send_cpumask" -f "cpumask & CPUS{$ISOL_CPUS}" \
-e "ipi_send_cpu" -f "cpu & CPUS{$ISOL_CPUS}" \
rteval --onlyload --loads-cpulist=$HK_CPUS \
--hackbench-runlowmem=True --duration=$DURATION
This only records IPIs sent to isolated CPUs, so any event there is interference
(with a bit of fuzz at the start/end of the workload when spawning the
processes). All tests were done with a duration of 3 hours.
v6.14
# This is the actual IPI count
$ trace-cmd report | grep callback | awk '{ print $(NF) }' | sort | uniq -c | sort -nr
93 callback=generic_smp_call_function_single_interrupt+0x0
22 callback=nohz_full_kick_func+0x0
# These are the different CSD's that caused IPIs
$ trace-cmd report | grep csd_queue | awk '{ print $(NF-1) }' | sort | uniq -c | sort -nr
1456 func=do_flush_tlb_all
78 func=do_sync_core
33 func=nohz_full_kick_func
26 func=do_kernel_range_flush
v6.14 + patches
# This is the actual IPI count
$ trace-cmd report | grep callback | awk '{ print $(NF) }' | sort | uniq -c | sort -nr
86 callback=generic_smp_call_function_single_interrupt+0x0
41 callback=nohz_full_kick_func+0x0
# These are the different CSD's that caused IPIs
$ trace-cmd report | grep csd_queue | awk '{ print $(NF-1) }' | sort | uniq -c | sort -nr
1378 func=do_flush_tlb_all
33 func=nohz_full_kick_func
So the TLB flush is still there driving most of the IPIs, but at least the
instruction patching IPIs are gone. With kernel TLB flushes deferred, there are
no IPIs sent to isolated CPUs in that 3hr window, but as stated above that still
needs some more work.
Also note that tlb_remove_table_smp_sync() showed up during testing of v3, and
has gone as mysteriously as it showed up. Yair had a series adressing this [6]
which per these results would be worth revisiting.
Acknowledgements
================
Special thanks to:
o Clark Williams for listening to my ramblings about this and throwing ideas my way
o Josh Poimboeuf for all his help with everything objtool-related
o All of the folks who attended various (too many?) talks about this and
provided precious feedback.
o The mm folks for pointing out what I can and can't do with TLB flushes
Links
=====
[1]: https://lore.kernel.org/all/20210929151723.162004989@infradead.org/
[2]: https://github.com/vianpl/linux.git -b ct-work-defer-wip
[3]: https://youtu.be/0vjE6fjoVVE
[4]: https://lpc.events/event/18/contributions/1889/
[5]: http://lore.kernel.org/r/eef09bdc-7546-462b-9ac0-661a44d2ceae@intel.com
[6]: https://lore.kernel.org/lkml/20230620144618.125703-1-ypodemsk@redhat.com/
Revisions
=========
v4 -> v5
++++++++
o Rebased onto v6.15-rc3
o Collected Reviewed-by
o Annotated a few more static keys
o Added proper checking of noinstr sections that are in loadable code such as
KVM early entry (Sean Christopherson)
o Switched to checking for CT_RCU_WATCHING instead of CT_STATE_KERNEL or
CT_STATE_IDLE, which means deferral is now behaving sanely for IRQ/NMI
entry from idle (thanks to Frederic!)
o Ditched the vmap TLB flush deferral (for now)
RFCv3 -> v4
+++++++++++
o Rebased onto v6.13-rc6
o New objtool patches from Josh
o More .noinstr static key/call patches
o Static calls now handled as well (again thanks to Josh)
o Fixed clearing the work bits on kernel exit
o Messed with IRQ hitting an idle CPU vs context tracking
o Various comment and naming cleanups
o Made RCU_DYNTICKS_TORTURE depend on !COMPILE_TEST (PeterZ)
o Fixed the CT_STATE_KERNEL check when setting a deferred work (Frederic)
o Cleaned up the __flush_tlb_all() mess thanks to PeterZ
RFCv2 -> RFCv3
++++++++++++++
o Rebased onto v6.12-rc6
o Added objtool documentation for the new warning (Josh)
o Added low-size RCU watching counter to TREE04 torture scenario (Paul)
o Added FORCEFUL jump label and static key types
o Added noinstr-compliant helpers for tlb flush deferral
RFCv1 -> RFCv2
++++++++++++++
o Rebased onto v6.5-rc1
o Updated the trace filter patches (Steven)
o Fixed __ro_after_init keys used in modules (Peter)
o Dropped the extra context_tracking atomic, squashed the new bits in the
existing .state field (Peter, Frederic)
o Added an RCU_EXPERT config for the RCU dynticks counter size, and added an
rcutorture case for a low-size counter (Paul)
o Fixed flush_tlb_kernel_range_deferrable() definition
Josh Poimboeuf (3):
jump_label: Add annotations for validating noinstr usage
static_call: Add read-only-after-init static calls
objtool: Add noinstr validation for static branches/calls
Valentin Schneider (22):
objtool: Make validate_call() recognize indirect calls to pv_ops[]
objtool: Flesh out warning related to pv_ops[] calls
rcu: Add a small-width RCU watching counter debug option
rcutorture: Make TREE04 use CONFIG_RCU_DYNTICKS_TORTURE
x86/paravirt: Mark pv_sched_clock static call as __ro_after_init
x86/idle: Mark x86_idle static call as __ro_after_init
x86/paravirt: Mark pv_steal_clock static call as __ro_after_init
riscv/paravirt: Mark pv_steal_clock static call as __ro_after_init
loongarch/paravirt: Mark pv_steal_clock static call as __ro_after_init
arm64/paravirt: Mark pv_steal_clock static call as __ro_after_init
arm/paravirt: Mark pv_steal_clock static call as __ro_after_init
perf/x86/amd: Mark perf_lopwr_cb static call as __ro_after_init
sched/clock: Mark sched_clock_running key as __ro_after_init
KVM: VMX: Mark __kvm_is_using_evmcs static key as __ro_after_init
x86/speculation/mds: Mark mds_idle_clear key as allowed in .noinstr
sched/clock, x86: Mark __sched_clock_stable key as allowed in .noinstr
KVM: VMX: Mark vmx_l1d_should flush and vmx_l1d_flush_cond keys as
allowed in .noinstr
stackleack: Mark stack_erasing_bypass key as allowed in .noinstr
module: Remove outdated comment about text_size
module: Add MOD_NOINSTR_TEXT mem_type
context-tracking: Introduce work deferral infrastructure
context_tracking,x86: Defer kernel text patching IPIs
arch/Kconfig | 9 ++
arch/arm/kernel/paravirt.c | 2 +-
arch/arm64/kernel/paravirt.c | 2 +-
arch/loongarch/kernel/paravirt.c | 2 +-
arch/riscv/kernel/paravirt.c | 2 +-
arch/x86/Kconfig | 1 +
arch/x86/events/amd/brs.c | 2 +-
arch/x86/include/asm/context_tracking_work.h | 18 +++
arch/x86/include/asm/text-patching.h | 1 +
arch/x86/kernel/alternative.c | 39 ++++++-
arch/x86/kernel/cpu/bugs.c | 2 +-
arch/x86/kernel/kprobes/core.c | 4 +-
arch/x86/kernel/kprobes/opt.c | 4 +-
arch/x86/kernel/module.c | 2 +-
arch/x86/kernel/paravirt.c | 4 +-
arch/x86/kernel/process.c | 2 +-
arch/x86/kvm/vmx/vmx.c | 11 +-
arch/x86/kvm/vmx/vmx_onhyperv.c | 2 +-
include/asm-generic/sections.h | 15 +++
include/linux/context_tracking.h | 21 ++++
include/linux/context_tracking_state.h | 54 +++++++--
include/linux/context_tracking_work.h | 26 +++++
include/linux/jump_label.h | 30 ++++-
include/linux/module.h | 6 +-
include/linux/objtool.h | 7 ++
include/linux/static_call.h | 19 ++++
kernel/context_tracking.c | 69 +++++++++++-
kernel/kprobes.c | 8 +-
kernel/module/main.c | 85 ++++++++++----
kernel/rcu/Kconfig.debug | 15 +++
kernel/sched/clock.c | 7 +-
kernel/stackleak.c | 6 +-
kernel/time/Kconfig | 5 +
tools/objtool/Documentation/objtool.txt | 34 ++++++
tools/objtool/check.c | 106 +++++++++++++++---
tools/objtool/include/objtool/check.h | 1 +
tools/objtool/include/objtool/elf.h | 1 +
tools/objtool/include/objtool/special.h | 1 +
tools/objtool/special.c | 15 ++-
.../selftests/rcutorture/configs/rcu/TREE04 | 1 +
40 files changed, 557 insertions(+), 84 deletions(-)
create mode 100644 arch/x86/include/asm/context_tracking_work.h
create mode 100644 include/linux/context_tracking_work.h
--
2.49.0
From: Chia-Yu Chang <chia-yu.chang(a)nokia-bell-labs.com>
Hello,
Please find DUALPI2 iproute2 patch v7.
v7 (05-May-25)
- Align pkt_sched.h with the v14 version of net-next due to spec modificiaotn in tc.yaml
- Reorganize dualpi2_print_opt() to match the order in tc.yaml
- Remove credit-queue in PRINT_JSON
v6 (26-Apr-25)
- Update JSON file output due to spec modifiocation in tc.yaml of net-next
v5 (25-Mar-25)
- Use matches() to replace current strcmp() (Stephen Hemminger <stephen(a)networkplumber.org>)
- Use general parse_percent() for handling scaled percentage values (Stephen Hemminger <stephen(a)networkplumber.org>)
- Add print function for JSON of dualpi2 stats (Stephen Hemminger <stephen(a)networkplumber.org>)
v4 (16-Mar-25)
- Add min_qlen_step to dualpi2 attribute as the minimum queue length in number of packets in the L-queue to start step amrking.
v3 (21-Feb-25)
- Add memlimit to dualpi2 attribute, and add memory_used, max_memory_used, memory_limit in dualpi2 stats (Dave Taht <dave.taht(a)gmail.com>)
- Update manual to align latest implementation and clarify the queue naming and default unit
- Use common "get_scaled_alpha_beta" and clean print_opt for Dualpi2
v2 (23-Oct-24)
- Rename get_float in dualpi2 to get_float_min_max in utils.c
- Move get_float from iplink_can.c in utils.c (Stephen Hemminger <stephen(a)networkplumber.org>)
- Add print function for JSON of dualpi2 (Stephen Hemminger <stephen(a)networkplumber.org>)
For more details of DualPI2, plesae refer IETF RFC9332
(https://datatracker.ietf.org/doc/html/rfc9332).
Best Regards,
Chia-Yu
Chia-Yu Chang (1):
tc: add dualpi2 scheduler module
bash-completion/tc | 11 +-
include/uapi/linux/pkt_sched.h | 67 +++++
include/utils.h | 2 +
ip/iplink_can.c | 14 -
lib/utils.c | 30 ++
man/man8/tc-dualpi2.8 | 249 ++++++++++++++++
tc/Makefile | 1 +
tc/q_dualpi2.c | 528 +++++++++++++++++++++++++++++++++
8 files changed, 887 insertions(+), 15 deletions(-)
create mode 100644 man/man8/tc-dualpi2.8
create mode 100644 tc/q_dualpi2.c
--
2.34.1
Fixes and cleanups for various issues in the vDSO selftests.
Signed-off-by: Thomas Weißschuh <thomas.weissschuh(a)linutronix.de>
---
Changes in v2:
- Refer to -Wstrict-prototypes over -Wold-style-prototypes
- Pick up Acks
- Enable fixed warnings in Makefile
- Link to v1: https://lore.kernel.org/r/20250502-selftests-vdso-fixes-v1-0-fb5d640a4f78@l…
---
Thomas Weißschuh (8):
selftests: vDSO: chacha: Correctly skip test if necessary
selftests: vDSO: clock_getres: Drop unused include of err.h
selftests: vDSO: vdso_test_getrandom: Drop unused include of linux/compiler.h
selftests: vDSO: vdso_test_getrandom: Drop some dead code
selftests: vDSO: vdso_config: Avoid -Wunused-variables
selftests: vDSO: enable -Wall
selftests: vDSO: vdso_test_correctness: Fix -Wstrict-prototypes
selftests: vDSO: vdso_test_getrandom: Always print TAP header
tools/testing/selftests/vDSO/Makefile | 2 +-
tools/testing/selftests/vDSO/vdso_config.h | 2 ++
tools/testing/selftests/vDSO/vdso_test_chacha.c | 3 ++-
tools/testing/selftests/vDSO/vdso_test_clock_getres.c | 1 -
tools/testing/selftests/vDSO/vdso_test_correctness.c | 2 +-
tools/testing/selftests/vDSO/vdso_test_getrandom.c | 18 +++++-------------
6 files changed, 11 insertions(+), 17 deletions(-)
---
base-commit: 0af2f6be1b4281385b618cb86ad946eded089ac8
change-id: 20250423-selftests-vdso-fixes-d2ce74142359
Best regards,
--
Thomas Weißschuh <thomas.weissschuh(a)linutronix.de>
Nolibc is useful for selftests as the test programs can be very small,
and compiled with just a kernel crosscompiler, without userspace support.
Currently nolibc is only usable with kselftest.h, not the more
convenient to use kselftest_harness.h
This series provides this compatibility by adding new features to nolibc
and removing the usage of problematic features from the harness.
The first half of the series are changes to the harness, the second one
are for nolibc. Both parts are very independent and should go through
different trees.
The last patch is not meant to be applied and serves as test that
everything works together correctly.
Signed-off-by: Thomas Weißschuh <thomas.weissschuh(a)linutronix.de>
---
Changes in v3:
- Send patches to correct kselftest harness maintainers
- Move harness selftest to dedicated directory
- Add harness selftest to MAINTAINERS
- Integrate harness selftest cleanup with the selftest framework
- Consistently use "kselftest harness" in commit messages
- Properly propagate kselftest harness failure
- Link to v2: https://lore.kernel.org/r/20250407-nolibc-kselftest-harness-v2-0-f8812f76e9…
Changes in v2:
- Rebase unto v6.15-rc1
- Rename internal nolibc symbols
- Handle edge case of waitpid(INT_MIN) == ESRCH
- Fix arm configurations for final testing patch
- Clean up global getopt.h variable declarations
- Add Acks from Willy
- Link to v1: https://lore.kernel.org/r/20250304-nolibc-kselftest-harness-v1-0-adca7cd231…
---
Thomas Weißschuh (32):
selftests: harness: Add kselftest harness selftest
selftests: harness: Use C89 comment style
selftests: harness: Ignore unused variant argument warning
selftests: harness: Mark functions without prototypes static
selftests: harness: Remove inline qualifier for wrappers
selftests: harness: Remove dependency on libatomic
selftests: harness: Implement test timeouts through pidfd
selftests: harness: Don't set setup_completed for fixtureless tests
selftests: harness: Always provide "self" and "variant"
selftests: harness: Move teardown conditional into test metadata
selftests: harness: Add teardown callback to test metadata
selftests: harness: Stop using setjmp()/longjmp()
selftests: harness: Guard includes on nolibc
tools/nolibc: handle intmax_t/uintmax_t in printf
tools/nolibc: use intmax definitions from compiler
tools/nolibc: use pselect6_time64 if available
tools/nolibc: use ppoll_time64 if available
tools/nolibc: add tolower() and toupper()
tools/nolibc: add _exit()
tools/nolibc: add setpgrp()
tools/nolibc: implement waitpid() in terms of waitid()
Revert "selftests/nolibc: use waitid() over waitpid()"
tools/nolibc: add dprintf() and vdprintf()
tools/nolibc: add getopt()
tools/nolibc: allow different write callbacks in printf
tools/nolibc: allow limiting of printf destination size
tools/nolibc: add snprintf() and friends
selftests/nolibc: use snprintf() for printf tests
selftests/nolibc: rename vfprintf test suite
selftests/nolibc: add test for snprintf() truncation
tools/nolibc: implement width padding in printf()
HACK: selftests/nolibc: demonstrate usage of the kselftest harness
MAINTAINERS | 1 +
tools/include/nolibc/Makefile | 1 +
tools/include/nolibc/getopt.h | 101 ++
tools/include/nolibc/nolibc.h | 1 +
tools/include/nolibc/stdint.h | 4 +-
tools/include/nolibc/stdio.h | 127 +-
tools/include/nolibc/string.h | 17 +
tools/include/nolibc/sys.h | 105 +-
tools/testing/selftests/Makefile | 1 +
tools/testing/selftests/kselftest_harness.h | 181 +-
.../testing/selftests/kselftest_harness/.gitignore | 2 +
tools/testing/selftests/kselftest_harness/Makefile | 7 +
.../selftests/kselftest_harness/harness-selftest.c | 129 ++
.../kselftest_harness/harness-selftest.expected | 62 +
.../kselftest_harness/harness-selftest.sh | 13 +
tools/testing/selftests/nolibc/Makefile | 13 +-
tools/testing/selftests/nolibc/harness-selftest.c | 1 +
tools/testing/selftests/nolibc/nolibc-test.c | 1729 +-------------------
tools/testing/selftests/nolibc/run-tests.sh | 2 +-
19 files changed, 637 insertions(+), 1860 deletions(-)
---
base-commit: 0af2f6be1b4281385b618cb86ad946eded089ac8
change-id: 20250130-nolibc-kselftest-harness-8b2c8cac43bf
Best regards,
--
Thomas Weißschuh <thomas.weissschuh(a)linutronix.de>
Hi Thadeu,
CC kunit
On Mon, 5 May 2025 at 14:13, Thadeu Lima de Souza Cascardo
<cascardo(a)igalia.com> wrote:
> On Mon, May 05, 2025 at 09:21:15AM +0200, Geert Uytterhoeven wrote:
> > On Wed, 30 Apr 2025 at 18:53, Thadeu Lima de Souza Cascardo
> > <cascardo(a)igalia.com> wrote:
> > > Since it uses __init symbols, it cannot be a module. Builds with
> > > CONFIG_TEST_MISC_MINOR=m will fail with:
> > >
> > > ERROR: modpost: "init_mknod" [drivers/misc/misc_minor_kunit.ko] undefined!
> > > ERROR: modpost: "init_unlink" [drivers/misc/misc_minor_kunit.ko] undefined!
> > >
> > > Reported-by: Stephen Rothwell <sfr(a)canb.auug.org.au>
> > > Closes: https://lore.kernel.org/linux-next/20250429155404.2b6fe5b1@canb.auug.org.au/
> > > Reported-by: kernel test robot <lkp(a)intel.com>
> > > Closes: https://lore.kernel.org/oe-kbuild-all/202504160338.BjUL3Owb-lkp@intel.com/
> > > Fixes: 45f0de4f8dc3 ("char: misc: add test cases")
> > > Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo(a)igalia.com>
> >
> > Thanks for your patch, which is now commit 20acf4dd46e4c090 ("char:
> > misc: make miscdevice unit test built-in only") in char-misc-next.
> >
> > > --- a/lib/Kconfig.debug
> > > +++ b/lib/Kconfig.debug
> > > @@ -2512,7 +2512,7 @@ config TEST_IDA
> > > tristate "Perform selftest on IDA functions"
> > >
> > > config TEST_MISC_MINOR
> > > - tristate "miscdevice KUnit test" if !KUNIT_ALL_TESTS
> > > + bool "miscdevice KUnit test" if !KUNIT_ALL_TESTS
> > > depends on KUNIT
> > > default KUNIT_ALL_TESTS
> >
> > This means "default y" if KUNIT_ALL_TESTS=m, which is IMHO not
> > what we want.
>
> The precedent for other kunit config options that are bool is that they use
> "default KUNIT_ALL_TESTS".
Seems like you are right. Looks like none of the boolean ones can
be enabled on m68k, which is where I run most of the tests, so I never
noticed before :-(
> It makes sense that if you choose to build all tests, you would not skip
> the ones that cannot be built as a module.
You can still enable the test manually if you want.
But I think it should not be enabled by default when all other tests
that can be modular are built as modules. Unlike for modular tests,
enabling builtin tests by default does impact the base kernel.
> > Perhaps
> >
> > default KUNIT_ALL_TESTS=y
> >
> > ?
Gr{oetje,eeting}s,
Geert
--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert(a)linux-m68k.org
In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds
Cong reported a warning when running ./test_sockmp:
https://lore.kernel.org/bpf/aAmIi0vlycHtbXeb@pop-os.localdomain/T/#t
------------[ cut here ]------------
WARNING: CPU: 1 PID: 40 at net/ipv4/af_inet.c inet_sock_destruct+0x173/0x1d5
Tainted: [W]=WARN
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1 04/01/2014
Workqueue: events sk_psock_destroy
RIP: 0010:inet_sock_destruct+0x173/0x1d5
RSP: 0018:ffff8880085cfc18 EFLAGS: 00010202
RAX: 1ffff11003dbfc00 RBX: ffff88801edfe3e8 RCX: ffffffff822f5af4
RDX: 0000000000000007 RSI: dffffc0000000000 RDI: ffff88801edfe16c
RBP: ffff88801edfe184 R08: ffffed1003dbfc31 R09: 0000000000000000
R10: ffffffff822f5ab7 R11: ffff88801edfe187 R12: ffff88801edfdec0
R13: ffff888020376ac0 R14: ffff888020376ac0 R15: ffff888020376a60
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000556365155830 CR3: 000000001d6aa000 CR4: 0000000000350ef0
Call Trace:
<TASK>
__sk_destruct+0x46/0x222
sk_psock_destroy+0x22f/0x242
process_one_work+0x504/0x8a8
? process_one_work+0x39d/0x8a8
? __pfx_process_one_work+0x10/0x10
? worker_thread+0x44/0x2ae
? __list_add_valid_or_report+0x83/0xea
? srso_return_thunk+0x5/0x5f
? __list_add+0x45/0x52
process_scheduled_works+0x73/0x82
worker_thread+0x1ce/0x2ae
When we specify apply_bytes, we divide the msg into multiple segments,
each with a length of 'send', and every time we send this part of the data
using tcp_bpf_sendmsg_redir(), we use sk_msg_return_zero() to uncharge the
memory of the specified 'send' size.
However, if the first segment of data fails to send, for example, the
peer's buffer is full, we need to release all of the msg. When releasing
the msg, we haven't uncharged the memory of the subsequent segments.
This modification does not make significant logical changes, but only
fills in the missing uncharge places.
This issue has existed all along, until it was exposed after we added the
apply test in test_sockmap:
commit 3448ad23b34e ("selftests/bpf: Add apply_bytes test to test_txmsg_redir_wait_sndmem in test_sockmap")
Jiayuan Chen (2):
ktls, sockmap: Fix missing uncharge operation
selftests/bpf: Add test to cover sockmap with ktls
net/tls/tls_sw.c | 7 ++
.../selftests/bpf/prog_tests/sockmap_ktls.c | 76 +++++++++++++++++++
.../selftests/bpf/progs/test_sockmap_ktls.c | 10 +++
3 files changed, 93 insertions(+)
--
2.47.1
ksft runner sends 2 SIGTERMs in a row if a test runs out of time.
Handle this in a similar way we handle SIGINT - cleanup and stop
running further tests.
Because we get 2 signals we need a bit of logic to ignore
the subsequent one, they come immediately one after the other
(due to commit 9616cb34b08e ("kselftest/runner.sh: Propagate SIGTERM
to runner child")).
This change makes sure we run cleanup (scheduled defer()s)
and also print a stack trace on SIGTERM, which doesn't happen
by default. Tests occasionally hang in NIPA and it's impossible
to tell what they are waiting from or doing.
Signed-off-by: Jakub Kicinski <kuba(a)kernel.org>
---
v2:
- remove declaration at the global scope
v1: https://lore.kernel.org/20250425151757.1652517-1-kuba@kernel.org
CC: petrm(a)nvidia.com
CC: willemb(a)google.com
CC: sdf(a)fomichev.me
CC: linux-kselftest(a)vger.kernel.org
---
tools/testing/selftests/net/lib/py/ksft.py | 25 +++++++++++++++++++++-
1 file changed, 24 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/net/lib/py/ksft.py b/tools/testing/selftests/net/lib/py/ksft.py
index 3cfad0fd4570..1b815768bf8a 100644
--- a/tools/testing/selftests/net/lib/py/ksft.py
+++ b/tools/testing/selftests/net/lib/py/ksft.py
@@ -3,6 +3,7 @@
import builtins
import functools
import inspect
+import signal
import sys
import time
import traceback
@@ -26,6 +27,10 @@ KSFT_DISRUPTIVE = True
pass
+class KsftTerminate(KeyboardInterrupt):
+ pass
+
+
def ksft_pr(*objs, **kwargs):
print("#", *objs, **kwargs)
@@ -193,6 +198,17 @@ KSFT_DISRUPTIVE = True
return env
+def _ksft_intr(signum, frame):
+ # ksft runner.sh sends 2 SIGTERMs in a row on a timeout
+ # if we don't ignore the second one it will stop us from handling cleanup
+ global term_cnt
+ term_cnt += 1
+ if term_cnt == 1:
+ raise KsftTerminate()
+ else:
+ ksft_pr(f"Ignoring SIGTERM (cnt: {term_cnt}), already exiting...")
+
+
def ksft_run(cases=None, globs=None, case_pfx=None, args=()):
cases = cases or []
@@ -205,6 +221,10 @@ KSFT_DISRUPTIVE = True
cases.append(value)
break
+ global term_cnt
+ term_cnt = 0
+ prev_sigterm = signal.signal(signal.SIGTERM, _ksft_intr)
+
totals = {"pass": 0, "fail": 0, "skip": 0, "xfail": 0}
print("TAP version 13")
@@ -229,11 +249,12 @@ KSFT_DISRUPTIVE = True
cnt_key = 'xfail'
except BaseException as e:
stop |= isinstance(e, KeyboardInterrupt)
+ stop |= isinstance(e, KsftTerminate)
tb = traceback.format_exc()
for line in tb.strip().split('\n'):
ksft_pr("Exception|", line)
if stop:
- ksft_pr("Stopping tests due to KeyboardInterrupt.")
+ ksft_pr(f"Stopping tests due to {type(e).__name__}.")
KSFT_RESULT = False
cnt_key = 'fail'
@@ -248,6 +269,8 @@ KSFT_DISRUPTIVE = True
if stop:
break
+ signal.signal(signal.SIGTERM, prev_sigterm)
+
print(
f"# Totals: pass:{totals['pass']} fail:{totals['fail']} xfail:{totals['xfail']} xpass:0 skip:{totals['skip']} error:0"
)
--
2.49.0