The word 'accross' is wrong, so fix it.
Signed-off-by: Zhu Jun <zhujun2(a)cmss.chinamobile.com>
---
tools/testing/selftests/net/psock_tpacket.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/net/psock_tpacket.c b/tools/testing/selftests/net/psock_tpacket.c
index 404a2ce75..221270cee 100644
--- a/tools/testing/selftests/net/psock_tpacket.c
+++ b/tools/testing/selftests/net/psock_tpacket.c
@@ -12,7 +12,7 @@
*
* Datapath:
* Open a pair of packet sockets and send resp. receive an a priori known
- * packet pattern accross the sockets and check if it was received resp.
+ * packet pattern across the sockets and check if it was received resp.
* sent correctly. Fanout in combination with RX_RING is currently not
* tested here.
*
--
2.17.1
Danielle Ratson writes:
Currently, the sharedbuffer test fails sometimes because it is reading a
maximum occupancy that is larger than expected on some different cases.
This is happening because the test assumes that the packet it is sending
is the only packet being passed to the device.
In addition, some duplications on one hand, and redundant test cases on
the other hand, were found in the test.
Add egress filters on h1 and h2 that will guarantee that the packets in
the buffer are sent in the test, and remove the redundant test cases.
Danielle Ratson (3):
selftests: mlxsw: sharedbuffer: Remove h1 ingress test case
selftests: mlxsw: sharedbuffer: Remove duplicate test cases
selftests: mlxsw: sharedbuffer: Ensure no extra packets are counted
.../drivers/net/mlxsw/sharedbuffer.sh | 55 ++++++++++++++-----
1 file changed, 40 insertions(+), 15 deletions(-)
--
2.47.0
Hi all,
This patch series continues the work to migrate the script tests into
prog_tests.
test_xdp_meta.sh uses the BPF programs defined in progs/test_xdp_meta.c
to do a simple XDP/TC functional test that checks the metadata
allocation performed by the bpf_xdp_adjust_meta() helper.
This is already partly covered by two tests under prog_tests/:
- xdp_context_test_run.c uses bpf_prog_test_run_opts() to verify the
validity of the xdp_md context after a call to bpf_xdp_adjust_meta()
- xdp_metadata.c ensures that these meta-data can be exchanged through
an AF_XDP socket.
However test_xdp_meta.sh also verifies that the meta-data initialized
in the struct xdp_md is forwarded to the struct __sk_buff used by BPF
programs at 'TC level'. To cover this, I add a test case in
xdp_context_test_run.c that uses the same BPF programs from
progs/test_xdp_meta.c.
---
Bastien Curutchet (2):
selftests/bpf: test_xdp_meta: Rename BPF sections
selftests/bpf: Migrate test_xdp_meta.sh into xdp_context_test_run.c
tools/testing/selftests/bpf/Makefile | 1 -
.../bpf/prog_tests/xdp_context_test_run.c | 86 ++++++++++++++++++++++
tools/testing/selftests/bpf/progs/test_xdp_meta.c | 4 +-
tools/testing/selftests/bpf/test_xdp_meta.sh | 58 ---------------
4 files changed, 88 insertions(+), 61 deletions(-)
---
base-commit: 6849a3de3507a490fb0788c9bafbb2f29a904f05
change-id: 20241203-xdp_meta-868307cd0e03
Best regards,
--
Bastien Curutchet (eBPF Foundation) <bastien.curutchet(a)bootlin.com>
When compiling the pointer masking tests with -Wall this warning
is present:
pointer_masking.c: In function ‘test_tagged_addr_abi_sysctl’:
pointer_masking.c:203:9: warning: ignoring return value of ‘pwrite’
declared with attribute ‘warn_unused_result’ [-Wunused-result]
203 | pwrite(fd, &value, 1, 0); |
^~~~~~~~~~~~~~~~~~~~~~~~ pointer_masking.c:208:9: warning:
ignoring return value of ‘pwrite’ declared with attribute
‘warn_unused_result’ [-Wunused-result]
208 | pwrite(fd, &value, 1, 0);
I came across this on riscv64-linux-gnu-gcc (Ubuntu
11.4.0-1ubuntu1~22.04).
Fix this by checking that the number of bytes written equal the expected
number of bytes written.
Fixes: 7470b5afd150 ("riscv: selftests: Add a pointer masking test")
Signed-off-by: Charlie Jenkins <charlie(a)rivosinc.com>
---
Changes in v4:
- Skip sysctl_enabled test if first pwrite failed
- Link to v3: https://lore.kernel.org/r/20241205-fix_warnings_pointer_masking_tests-v3-1-…
Changes in v3:
- Fix sysctl enabled test case (Drew/Alex)
- Move pwrite err condition into goto (Drew)
- Link to v2: https://lore.kernel.org/r/20241204-fix_warnings_pointer_masking_tests-v2-1-…
Changes in v2:
- I had ret != 2 for testing, I changed it to be ret != 1.
- Link to v1: https://lore.kernel.org/r/20241204-fix_warnings_pointer_masking_tests-v1-1-…
---
tools/testing/selftests/riscv/abi/pointer_masking.c | 20 ++++++++++++++++++--
1 file changed, 18 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/riscv/abi/pointer_masking.c b/tools/testing/selftests/riscv/abi/pointer_masking.c
index dee41b7ee3e3..759445d5f265 100644
--- a/tools/testing/selftests/riscv/abi/pointer_masking.c
+++ b/tools/testing/selftests/riscv/abi/pointer_masking.c
@@ -189,6 +189,8 @@ static void test_tagged_addr_abi_sysctl(void)
{
char value;
int fd;
+ int ret;
+ char *err_pwrite_msg = "failed to write to /proc/sys/abi/tagged_addr_disabled\n";
ksft_print_msg("Testing tagged address ABI sysctl\n");
@@ -200,18 +202,32 @@ static void test_tagged_addr_abi_sysctl(void)
}
value = '1';
- pwrite(fd, &value, 1, 0);
+ ret = pwrite(fd, &value, 1, 0);
+ if (ret != 1) {
+ ksft_test_result_skip(err_pwrite_msg);
+ goto err_pwrite;
+ }
+
ksft_test_result(set_tagged_addr_ctrl(min_pmlen, true) == -EINVAL,
"sysctl disabled\n");
value = '0';
- pwrite(fd, &value, 1, 0);
+ ret = pwrite(fd, &value, 1, 0);
+ if (ret != 1)
+ goto err_pwrite;
+
ksft_test_result(set_tagged_addr_ctrl(min_pmlen, true) == 0,
"sysctl enabled\n");
set_tagged_addr_ctrl(0, false);
close(fd);
+
+ return;
+
+err_pwrite:
+ close(fd);
+ ksft_test_result_fail(err_pwrite_msg);
}
static void test_tagged_addr_abi_pmlen(int pmlen)
---
base-commit: 40384c840ea1944d7c5a392e8975ed088ecf0b37
change-id: 20241204-fix_warnings_pointer_masking_tests-3860e4f35429
--
- Charlie
Add ip_link_set_addr(), ip_link_set_up(), ip_addr_add() and ip_route_add()
to the suite of helpers that automatically schedule a corresponding
cleanup.
When setting a new MAC, one needs to remember the old address first. Move
mac_get() from forwarding/ to that end.
Signed-off-by: Petr Machata <petrm(a)nvidia.com>
Reviewed-by: Ido Schimmel <idosch(a)nvidia.com>
---
Notes:
CC: Shuah Khan <shuah(a)kernel.org>
CC: Benjamin Poirier <bpoirier(a)nvidia.com>
CC: Hangbin Liu <liuhangbin(a)gmail.com>
CC: Vladimir Oltean <vladimir.oltean(a)nxp.com>
CC: linux-kselftest(a)vger.kernel.org
tools/testing/selftests/net/forwarding/lib.sh | 7 ----
tools/testing/selftests/net/lib.sh | 39 +++++++++++++++++++
2 files changed, 39 insertions(+), 7 deletions(-)
diff --git a/tools/testing/selftests/net/forwarding/lib.sh b/tools/testing/selftests/net/forwarding/lib.sh
index 7337f398f9cc..1fd40bada694 100644
--- a/tools/testing/selftests/net/forwarding/lib.sh
+++ b/tools/testing/selftests/net/forwarding/lib.sh
@@ -932,13 +932,6 @@ packets_rate()
echo $(((t1 - t0) / interval))
}
-mac_get()
-{
- local if_name=$1
-
- ip -j link show dev $if_name | jq -r '.[]["address"]'
-}
-
ether_addr_to_u64()
{
local addr="$1"
diff --git a/tools/testing/selftests/net/lib.sh b/tools/testing/selftests/net/lib.sh
index 5ea6537acd2b..2cd5c743b2d9 100644
--- a/tools/testing/selftests/net/lib.sh
+++ b/tools/testing/selftests/net/lib.sh
@@ -435,6 +435,13 @@ xfail_on_veth()
fi
}
+mac_get()
+{
+ local if_name=$1
+
+ ip -j link show dev $if_name | jq -r '.[]["address"]'
+}
+
kill_process()
{
local pid=$1; shift
@@ -459,3 +466,35 @@ ip_link_set_master()
ip link set dev "$member" master "$master"
defer ip link set dev "$member" nomaster
}
+
+ip_link_set_addr()
+{
+ local name=$1; shift
+ local addr=$1; shift
+
+ local old_addr=$(mac_get "$name")
+ ip link set dev "$name" address "$addr"
+ defer ip link set dev "$name" address "$old_addr"
+}
+
+ip_link_set_up()
+{
+ local name=$1; shift
+
+ ip link set dev "$name" up
+ defer ip link set dev "$name" down
+}
+
+ip_addr_add()
+{
+ local name=$1; shift
+
+ ip addr add dev "$name" "$@"
+ defer ip addr del dev "$name" "$@"
+}
+
+ip_route_add()
+{
+ ip route add "$@"
+ defer ip route del "$@"
+}
--
2.47.0
This series is a follow-up to v1[1], aimed at making the watchdog selftest
more suitable for CI environments. Currently, in non-interactive setups,
the watchdog kselftest can only run with oneshot parameters, preventing the
testing of the WDIOC_KEEPALIVE ioctl since the ping loop is only
interrupted by SIGINT.
The first patch adds a new -c option to limit the number of watchdog pings,
allowing the test to be optionally finite. The second patch updates the
test output to conform to KTAP.
The default behavior remains unchanged: without the -c option, the
keep_alive() loop continues indefinitely until interrupted by SIGINT.
[1] https://lore.kernel.org/all/20240506111359.224579-1-laura.nao@collabora.com/
Changes in v2:
- The keep_alive() loop remains infinite by default
- Introduced keep_alive_res variable to track the WDIOC_KEEPALIVE ioctl return code for user reporting
Laura Nao (2):
selftests/watchdog: add -c option to limit the ping loop
selftests/watchdog: convert the test output to KTAP format
.../selftests/watchdog/watchdog-test.c | 169 +++++++++++-------
1 file changed, 103 insertions(+), 66 deletions(-)
--
2.30.2
Currently, user needs to manually enable transmit hardware timestamp
feature of certain Ethernet drivers, e.g. stmmac and igc drivers, through
following command after running the xdp_hw_metadata app.
sudo hwstamp_ctl -i eth0 -t 1
To simplify the step test of xdp_hw_metadata, set tx_type to HWTSTAMP_TX_ON
to enable hardware timestamping for all outgoing packets, so that user no
longer need to execute hwstamp_ctl command.
Signed-off-by: Song Yoong Siang <yoong.siang.song(a)intel.com>
Acked-by: Stanislav Fomichev <sdf(a)fomichev.me>
---
v1: https://patchwork.kernel.org/project/netdevbpf/patch/20241204115715.3148412…
v1->v2 changelog:
- Add detail in commit msg on why HWTSTAMP_TX_ON is needed (Stanislav).
- Separate the patch into two, current one submit to bpf-next,
another one submit to bpf.
---
tools/testing/selftests/bpf/xdp_hw_metadata.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/tools/testing/selftests/bpf/xdp_hw_metadata.c b/tools/testing/selftests/bpf/xdp_hw_metadata.c
index 06266aad2f99..96c65500f4b4 100644
--- a/tools/testing/selftests/bpf/xdp_hw_metadata.c
+++ b/tools/testing/selftests/bpf/xdp_hw_metadata.c
@@ -551,6 +551,7 @@ static void hwtstamp_enable(const char *ifname)
{
struct hwtstamp_config cfg = {
.rx_filter = HWTSTAMP_FILTER_ALL,
+ .tx_type = HWTSTAMP_TX_ON,
};
hwtstamp_ioctl(SIOCGHWTSTAMP, ifname, &saved_hwtstamp_cfg);
--
2.34.1
When compiling the pointer masking tests with -Wall this warning
is present:
pointer_masking.c: In function ‘test_tagged_addr_abi_sysctl’:
pointer_masking.c:203:9: warning: ignoring return value of ‘pwrite’
declared with attribute ‘warn_unused_result’ [-Wunused-result]
203 | pwrite(fd, &value, 1, 0); |
^~~~~~~~~~~~~~~~~~~~~~~~ pointer_masking.c:208:9: warning:
ignoring return value of ‘pwrite’ declared with attribute
‘warn_unused_result’ [-Wunused-result]
208 | pwrite(fd, &value, 1, 0);
I came across this on riscv64-linux-gnu-gcc (Ubuntu
11.4.0-1ubuntu1~22.04).
Fix this by checking that the number of bytes written equal the expected
number of bytes written.
Fixes: 7470b5afd150 ("riscv: selftests: Add a pointer masking test")
Signed-off-by: Charlie Jenkins <charlie(a)rivosinc.com>
---
Changes in v2:
- I had ret != 2 for testing, I changed it to be ret != 1.
- Link to v1: https://lore.kernel.org/r/20241204-fix_warnings_pointer_masking_tests-v1-1-…
---
tools/testing/selftests/riscv/abi/pointer_masking.c | 19 +++++++++++++++----
1 file changed, 15 insertions(+), 4 deletions(-)
diff --git a/tools/testing/selftests/riscv/abi/pointer_masking.c b/tools/testing/selftests/riscv/abi/pointer_masking.c
index dee41b7ee3e3..229d85ccff50 100644
--- a/tools/testing/selftests/riscv/abi/pointer_masking.c
+++ b/tools/testing/selftests/riscv/abi/pointer_masking.c
@@ -189,6 +189,7 @@ static void test_tagged_addr_abi_sysctl(void)
{
char value;
int fd;
+ int ret;
ksft_print_msg("Testing tagged address ABI sysctl\n");
@@ -200,14 +201,24 @@ static void test_tagged_addr_abi_sysctl(void)
}
value = '1';
- pwrite(fd, &value, 1, 0);
+ ret = pwrite(fd, &value, 1, 0);
+ if (ret != 1) {
+ ksft_test_result_fail("Write to /proc/sys/abi/tagged_addr_disabled failed.\n");
+ return;
+ }
+
ksft_test_result(set_tagged_addr_ctrl(min_pmlen, true) == -EINVAL,
"sysctl disabled\n");
value = '0';
- pwrite(fd, &value, 1, 0);
- ksft_test_result(set_tagged_addr_ctrl(min_pmlen, true) == 0,
- "sysctl enabled\n");
+ ret = pwrite(fd, &value, 1, 0);
+ if (ret != 1) {
+ ksft_test_result_fail("Write to /proc/sys/abi/tagged_addr_disabled failed.\n");
+ return;
+ }
+
+ ksft_test_result(set_tagged_addr_ctrl(min_pmlen, true) == -EINVAL,
+ "sysctl disabled\n");
set_tagged_addr_ctrl(0, false);
---
base-commit: 40384c840ea1944d7c5a392e8975ed088ecf0b37
change-id: 20241204-fix_warnings_pointer_masking_tests-3860e4f35429
--
- Charlie
When compiling the pointer masking tests with -Wall this warning
is present:
pointer_masking.c: In function ‘test_tagged_addr_abi_sysctl’:
pointer_masking.c:203:9: warning: ignoring return value of ‘pwrite’
declared with attribute ‘warn_unused_result’ [-Wunused-result]
203 | pwrite(fd, &value, 1, 0); |
^~~~~~~~~~~~~~~~~~~~~~~~ pointer_masking.c:208:9: warning:
ignoring return value of ‘pwrite’ declared with attribute
‘warn_unused_result’ [-Wunused-result]
208 | pwrite(fd, &value, 1, 0);
I came across this on riscv64-linux-gnu-gcc (Ubuntu
11.4.0-1ubuntu1~22.04).
Fix this by checking that the number of bytes written equal the expected
number of bytes written.
Fixes: 7470b5afd150 ("riscv: selftests: Add a pointer masking test")
Signed-off-by: Charlie Jenkins <charlie(a)rivosinc.com>
---
Changes in v3:
- Fix sysctl enabled test case (Drew/Alex)
- Move pwrite err condition into goto (Drew)
- Link to v2: https://lore.kernel.org/r/20241204-fix_warnings_pointer_masking_tests-v2-1-…
Changes in v2:
- I had ret != 2 for testing, I changed it to be ret != 1.
- Link to v1: https://lore.kernel.org/r/20241204-fix_warnings_pointer_masking_tests-v1-1-…
---
tools/testing/selftests/riscv/abi/pointer_masking.c | 17 +++++++++++++++--
1 file changed, 15 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/riscv/abi/pointer_masking.c b/tools/testing/selftests/riscv/abi/pointer_masking.c
index dee41b7ee3e3..2367b24a2b4e 100644
--- a/tools/testing/selftests/riscv/abi/pointer_masking.c
+++ b/tools/testing/selftests/riscv/abi/pointer_masking.c
@@ -189,6 +189,7 @@ static void test_tagged_addr_abi_sysctl(void)
{
char value;
int fd;
+ int ret;
ksft_print_msg("Testing tagged address ABI sysctl\n");
@@ -200,18 +201,30 @@ static void test_tagged_addr_abi_sysctl(void)
}
value = '1';
- pwrite(fd, &value, 1, 0);
+ ret = pwrite(fd, &value, 1, 0);
+ if (ret != 1)
+ goto err_pwrite;
+
ksft_test_result(set_tagged_addr_ctrl(min_pmlen, true) == -EINVAL,
"sysctl disabled\n");
value = '0';
- pwrite(fd, &value, 1, 0);
+ ret = pwrite(fd, &value, 1, 0);
+ if (ret != 1)
+ goto err_pwrite;
+
ksft_test_result(set_tagged_addr_ctrl(min_pmlen, true) == 0,
"sysctl enabled\n");
set_tagged_addr_ctrl(0, false);
close(fd);
+
+ return;
+
+err_pwrite:
+ close(fd);
+ ksft_test_result_fail("failed to write to /proc/sys/abi/tagged_addr_disabled\n");
}
static void test_tagged_addr_abi_pmlen(int pmlen)
---
base-commit: 40384c840ea1944d7c5a392e8975ed088ecf0b37
change-id: 20241204-fix_warnings_pointer_masking_tests-3860e4f35429
--
- Charlie
The sysctl tests for vm.memfd_noexec rely on the kernel to support PID
namespaces (i.e. the kernel is built with CONFIG_PID_NS=y). If the
kernel the test runs on does not support PID namespaces, the first
sysctl test will fail when attempting to spawn a new thread in a new
PID namespace, abort the test, preventing the remaining tests from
being run.
This is not desirable, as not all kernels need PID namespaces, but can
still use the other features provided by memfd. Therefore, only run the
sysctl tests if the kernel supports PID namespaces. Otherwise, skip
those tests and emit an informative message to let the user know why
the sysctl tests are not being run.
Fixes: 11f75a01448f ("selftests/memfd: add tests for MFD_NOEXEC_SEAL MFD_EXEC")
Cc: stable(a)vger.kernel.org # v6.6+
Cc: Jeff Xu <jeffxu(a)google.com>
Cc: Suren Baghdasaryan <surenb(a)google.com>
Cc: Kalesh Singh <kaleshsingh(a)google.com>
Signed-off-by: Isaac J. Manjarres <isaacmanjarres(a)google.com>
---
tools/testing/selftests/memfd/memfd_test.c | 14 ++++++++++++--
1 file changed, 12 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/memfd/memfd_test.c b/tools/testing/selftests/memfd/memfd_test.c
index 95af2d78fd31..0a0b55516028 100644
--- a/tools/testing/selftests/memfd/memfd_test.c
+++ b/tools/testing/selftests/memfd/memfd_test.c
@@ -9,6 +9,7 @@
#include <fcntl.h>
#include <linux/memfd.h>
#include <sched.h>
+#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <signal.h>
@@ -1557,6 +1558,11 @@ static void test_share_fork(char *banner, char *b_suffix)
close(fd);
}
+static bool pid_ns_supported(void)
+{
+ return access("/proc/self/ns/pid", F_OK) == 0;
+}
+
int main(int argc, char **argv)
{
pid_t pid;
@@ -1591,8 +1597,12 @@ int main(int argc, char **argv)
test_seal_grow();
test_seal_resize();
- test_sysctl_simple();
- test_sysctl_nested();
+ if (pid_ns_supported()) {
+ test_sysctl_simple();
+ test_sysctl_nested();
+ } else {
+ printf("PID namespaces are not supported; skipping sysctl tests\n");
+ }
test_share_dup("SHARE-DUP", "");
test_share_mmap("SHARE-MMAP", "");
--
2.47.0.338.g60cca15819-goog
When we fork anonymous pages, apply a guard page then remove it, the
previous CoW mapping is cleared.
This might not be obvious to an outside observer without taking some time
to think about how the overall process functions, so document that this is
the case through a test, which also usefully asserts that the behaviour is
as we expect.
This is grouped with other, more important, fork tests that ensure that
guard pages are correctly propagated on fork.
Fix a typo in a nearby comment at the same time.
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes(a)oracle.com>
---
tools/testing/selftests/mm/guard-pages.c | 73 +++++++++++++++++++++++-
1 file changed, 72 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/mm/guard-pages.c b/tools/testing/selftests/mm/guard-pages.c
index 7cdf815d0d63..d8f8dee9ebbd 100644
--- a/tools/testing/selftests/mm/guard-pages.c
+++ b/tools/testing/selftests/mm/guard-pages.c
@@ -990,7 +990,7 @@ TEST_F(guard_pages, fork)
MAP_ANON | MAP_PRIVATE, -1, 0);
ASSERT_NE(ptr, MAP_FAILED);
- /* Establish guard apges in the first 5 pages. */
+ /* Establish guard pages in the first 5 pages. */
ASSERT_EQ(madvise(ptr, 5 * page_size, MADV_GUARD_INSTALL), 0);
pid = fork();
@@ -1029,6 +1029,77 @@ TEST_F(guard_pages, fork)
ASSERT_EQ(munmap(ptr, 10 * page_size), 0);
}
+/*
+ * Assert expected behaviour after we fork populated ranges of anonymous memory
+ * and then guard and unguard the range.
+ */
+TEST_F(guard_pages, fork_cow)
+{
+ const unsigned long page_size = self->page_size;
+ char *ptr;
+ pid_t pid;
+ int i;
+
+ /* Map 10 pages. */
+ ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE,
+ MAP_ANON | MAP_PRIVATE, -1, 0);
+ ASSERT_NE(ptr, MAP_FAILED);
+
+ /* Populate range. */
+ for (i = 0; i < 10 * page_size; i++) {
+ char chr = 'a' + (i % 26);
+
+ ptr[i] = chr;
+ }
+
+ pid = fork();
+ ASSERT_NE(pid, -1);
+ if (!pid) {
+ /* This is the child process now. */
+
+ /* Ensure the range is as expected. */
+ for (i = 0; i < 10 * page_size; i++) {
+ char expected = 'a' + (i % 26);
+ char actual = ptr[i];
+
+ ASSERT_EQ(actual, expected);
+ }
+
+ /* Establish guard pages across the whole range. */
+ ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_GUARD_INSTALL), 0);
+ /* Remove it. */
+ ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_GUARD_REMOVE), 0);
+
+ /*
+ * By removing the guard pages, the page tables will be
+ * cleared. Assert that we are looking at the zero page now.
+ */
+ for (i = 0; i < 10 * page_size; i++) {
+ char actual = ptr[i];
+
+ ASSERT_EQ(actual, '\0');
+ }
+
+ exit(0);
+ }
+
+ /* Parent process. */
+
+ /* Parent simply waits on child. */
+ waitpid(pid, NULL, 0);
+
+ /* Ensure the range is unchanged in parent anon range. */
+ for (i = 0; i < 10 * page_size; i++) {
+ char expected = 'a' + (i % 26);
+ char actual = ptr[i];
+
+ ASSERT_EQ(actual, expected);
+ }
+
+ /* Cleanup. */
+ ASSERT_EQ(munmap(ptr, 10 * page_size), 0);
+}
+
/*
* Assert that forking a process with VMAs that do have VM_WIPEONFORK set
* behave as expected.
--
2.47.1
In 'NOFENTRY_ARGS' test case for syntax check, any offset X of
`vfs_read+X` except function entry offset (0) fits the criterion,
even if that offset is not at instruction boundary, as the parser
comes before probing. But with "ENDBR64" instruction on x86, offset
4 is treated as function entry. So, X can't be 4 as well. Thus, 8
was used as offset for the test case. On 64-bit powerpc though, any
offset <= 16 can be considered function entry depending on build
configuration (see arch_kprobe_on_func_entry() for implementation
details). So, use `vfs_read+20` to accommodate that scenario too.
Suggested-by: Masami Hiramatsu <mhiramat(a)kernel.org>
Signed-off-by: Hari Bathini <hbathini(a)linux.ibm.com>
---
Changes in v2:
* Use 20 as offset for all arches.
.../selftests/ftrace/test.d/kprobe/kprobe_syntax_errors.tc | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/ftrace/test.d/kprobe/kprobe_syntax_errors.tc b/tools/testing/selftests/ftrace/test.d/kprobe/kprobe_syntax_errors.tc
index a16c6a6f6055..8f1c58f0c239 100644
--- a/tools/testing/selftests/ftrace/test.d/kprobe/kprobe_syntax_errors.tc
+++ b/tools/testing/selftests/ftrace/test.d/kprobe/kprobe_syntax_errors.tc
@@ -111,7 +111,7 @@ check_error 'p vfs_read $arg* ^$arg*' # DOUBLE_ARGS
if !grep -q 'kernel return probes support:' README; then
check_error 'r vfs_read ^$arg*' # NOFENTRY_ARGS
fi
-check_error 'p vfs_read+8 ^$arg*' # NOFENTRY_ARGS
+check_error 'p vfs_read+20 ^$arg*' # NOFENTRY_ARGS
check_error 'p vfs_read ^hoge' # NO_BTFARG
check_error 'p kfree ^$arg10' # NO_BTFARG (exceed the number of parameters)
check_error 'r kfree ^$retval' # NO_RETVAL
--
2.47.0
`MFD_NOEXEC_SEAL` should remove the executable bits and set `F_SEAL_EXEC`
to prevent further modifications to the executable bits as per the comment
in the uapi header file:
not executable and sealed to prevent changing to executable
However, commit 105ff5339f498a ("mm/memfd: add MFD_NOEXEC_SEAL and MFD_EXEC")
that introduced this feature made it so that `MFD_NOEXEC_SEAL` unsets
`F_SEAL_SEAL`, essentially acting as a superset of `MFD_ALLOW_SEALING`.
Nothing implies that it should be so, and indeed up until the second version
of the of the patchset[0] that introduced `MFD_EXEC` and `MFD_NOEXEC_SEAL`,
`F_SEAL_SEAL` was not removed, however, it was changed in the third revision
of the patchset[1] without a clear explanation.
This behaviour is surprising for application developers, there is no
documentation that would reveal that `MFD_NOEXEC_SEAL` has the additional
effect of `MFD_ALLOW_SEALING`. Additionally, combined with `vm.memfd_noexec=2`
it has the effect of making all memfds initially sealable.
So do not remove `F_SEAL_SEAL` when `MFD_NOEXEC_SEAL` is requested,
thereby returning to the pre-Linux 6.3 behaviour of only allowing
sealing when `MFD_ALLOW_SEALING` is specified.
Now, this is technically a uapi break. However, the damage is expected
to be minimal. To trigger user visible change, a program has to do the
following steps:
- create memfd:
- with `MFD_NOEXEC_SEAL`,
- without `MFD_ALLOW_SEALING`;
- try to add seals / check the seals.
But that seems unlikely to happen intentionally since this change
essentially reverts the kernel's behaviour to that of Linux <6.3,
so if a program worked correctly on those older kernels, it will
likely work correctly after this change.
I have used Debian Code Search and GitHub to try to find potential
breakages, and I could only find a single one. dbus-broker's
memfd_create() wrapper is aware of this implicit `MFD_ALLOW_SEALING`
behaviour, and tries to work around it[2]. This workaround will
break. Luckily, this only affects the test suite, it does not affect
the normal operations of dbus-broker. There is a PR with a fix[3].
I also carried out a smoke test by building a kernel with this change
and booting an Arch Linux system into GNOME and Plasma sessions.
There was also a previous attempt to address this peculiarity by
introducing a new flag[4].
[0]: https://lore.kernel.org/lkml/20220805222126.142525-3-jeffxu@google.com/
[1]: https://lore.kernel.org/lkml/20221202013404.163143-3-jeffxu@google.com/
[2]: https://github.com/bus1/dbus-broker/blob/9eb0b7e5826fc76cad7b025bc46f267d4a…
[3]: https://github.com/bus1/dbus-broker/pull/366
[4]: https://lore.kernel.org/lkml/20230714114753.170814-1-david@readahead.eu/
Cc: stable(a)vger.kernel.org
Signed-off-by: Barnabás Pőcze <pobrn(a)protonmail.com>
---
* v3: https://lore.kernel.org/linux-mm/20240611231409.3899809-1-jeffxu@chromium.o…
* v2: https://lore.kernel.org/linux-mm/20240524033933.135049-1-jeffxu@google.com/
* v1: https://lore.kernel.org/linux-mm/20240513191544.94754-1-pobrn@protonmail.co…
This fourth version returns to removing the inconsistency as opposed to documenting
its existence, with the same code change as v1 but with a somewhat extended commit
message. This is sent because I believe it is worth at least a try; it can be easily
reverted if bigger application breakages are discovered than initially imagined.
---
mm/memfd.c | 9 ++++-----
tools/testing/selftests/memfd/memfd_test.c | 2 +-
2 files changed, 5 insertions(+), 6 deletions(-)
diff --git a/mm/memfd.c b/mm/memfd.c
index 7d8d3ab3fa37..8b7f6afee21d 100644
--- a/mm/memfd.c
+++ b/mm/memfd.c
@@ -356,12 +356,11 @@ SYSCALL_DEFINE2(memfd_create,
inode->i_mode &= ~0111;
file_seals = memfd_file_seals_ptr(file);
- if (file_seals) {
- *file_seals &= ~F_SEAL_SEAL;
+ if (file_seals)
*file_seals |= F_SEAL_EXEC;
- }
- } else if (flags & MFD_ALLOW_SEALING) {
- /* MFD_EXEC and MFD_ALLOW_SEALING are set */
+ }
+
+ if (flags & MFD_ALLOW_SEALING) {
file_seals = memfd_file_seals_ptr(file);
if (file_seals)
*file_seals &= ~F_SEAL_SEAL;
diff --git a/tools/testing/selftests/memfd/memfd_test.c b/tools/testing/selftests/memfd/memfd_test.c
index 95af2d78fd31..7b78329f65b6 100644
--- a/tools/testing/selftests/memfd/memfd_test.c
+++ b/tools/testing/selftests/memfd/memfd_test.c
@@ -1151,7 +1151,7 @@ static void test_noexec_seal(void)
mfd_def_size,
MFD_CLOEXEC | MFD_NOEXEC_SEAL);
mfd_assert_mode(fd, 0666);
- mfd_assert_has_seals(fd, F_SEAL_EXEC);
+ mfd_assert_has_seals(fd, F_SEAL_SEAL | F_SEAL_EXEC);
mfd_fail_chmod(fd, 0777);
close(fd);
}
--
2.45.2
Currently, sendmmsg is implemented in udpgso_bench_tx.c,
but it is not called by any test script.
This patch adds a test for sendmmsg in udpgso_bench.sh.
This allows for basic API testing and benchmarking
comparisons with GSO.
Signed-off-by: Kenjiro Nakayama <nakayamakenjiro(a)gmail.com>
---
tools/testing/selftests/net/udpgso_bench.sh | 3 +++
1 file changed, 3 insertions(+)
diff --git a/tools/testing/selftests/net/udpgso_bench.sh b/tools/testing/selftests/net/udpgso_bench.sh
index 640bc43452fa..88fa1d53ba2b 100755
--- a/tools/testing/selftests/net/udpgso_bench.sh
+++ b/tools/testing/selftests/net/udpgso_bench.sh
@@ -92,6 +92,9 @@ run_udp() {
echo "udp"
run_in_netns ${args}
+ echo "udp sendmmsg"
+ run_in_netns ${args} -m
+
echo "udp gso"
run_in_netns ${args} -S 0
--
2.39.3 (Apple Git-146)
If fopen succeeds, the fscanf function is called to read the data.
Regardless of whether fscanf is successful, you need to run
fclose(proc) to prevent memory leaks.
Signed-off-by: liujing <liujing(a)cmss.chinamobile.com>
diff --git a/tools/testing/selftests/timens/procfs.c b/tools/testing/selftests/timens/procfs.c
index 1833ca97eb24..e47844a73c31 100644
--- a/tools/testing/selftests/timens/procfs.c
+++ b/tools/testing/selftests/timens/procfs.c
@@ -79,9 +79,11 @@ static int read_proc_uptime(struct timespec *uptime)
if (fscanf(proc, "%lu.%02lu", &up_sec, &up_nsec) != 2) {
if (errno) {
pr_perror("fscanf");
+ fclose(proc);
return -errno;
}
pr_err("failed to parse /proc/uptime");
+ fclose(proc);
return -1;
}
fclose(proc);
--
2.27.0
After using the fopen function successfully, you need to call
fclose to close the file before finally returning.
Signed-off-by: liujing <liujing(a)cmss.chinamobile.com>
diff --git a/tools/testing/selftests/prctl/set-process-name.c b/tools/testing/selftests/prctl/set-process-name.c
index 562f707ba771..7be7afff0cd2 100644
--- a/tools/testing/selftests/prctl/set-process-name.c
+++ b/tools/testing/selftests/prctl/set-process-name.c
@@ -66,9 +66,12 @@ int check_name(void)
return -EIO;
fscanf(fptr, "%s", output);
- if (ferror(fptr))
+ if (ferror(fptr)) {
+ fclose(fptr);
return -EIO;
+ }
+ fclose(fptr);
int res = prctl(PR_GET_NAME, name, NULL, NULL, NULL);
if (res < 0)
--
2.27.0
Commit f803bcf9208a ("selftests/bpf: Prevent client connect before
server bind in test_tc_tunnel.sh") added code that waits for the
netcat server to start before the netcat client attempts to connect to
it. However, not all calls to 'server_listen' were guarded.
This patch adds the existing 'wait_for_port' guard after the remaining
call to 'server_listen'.
Fixes: f803bcf9208a ("selftests/bpf: Prevent client connect before server bind in test_tc_tunnel.sh")
Signed-off-by: Marco Leogrande <leogrande(a)google.com>
---
tools/testing/selftests/bpf/test_tc_tunnel.sh | 1 +
1 file changed, 1 insertion(+)
diff --git a/tools/testing/selftests/bpf/test_tc_tunnel.sh b/tools/testing/selftests/bpf/test_tc_tunnel.sh
index 7989ec6084545..cb55a908bb0d7 100755
--- a/tools/testing/selftests/bpf/test_tc_tunnel.sh
+++ b/tools/testing/selftests/bpf/test_tc_tunnel.sh
@@ -305,6 +305,7 @@ else
client_connect
verify_data
server_listen
+ wait_for_port ${port} ${netcat_opt}
fi
# serverside, use BPF for decap
--
2.47.0.338.g60cca15819-goog
The current implementation of netconsole sends all log messages in
parallel, which can lead to an intermixed and interleaved output on the
receiving side. This makes it challenging to demultiplex the messages
and attribute them to their originating CPUs.
As a result, users and developers often struggle to effectively analyze
and debug the parallel log output received through netconsole.
Example of a message got from produciton hosts:
------------[ cut here ]------------
------------[ cut here ]------------
refcount_t: saturated; leaking memory.
WARNING: CPU: 2 PID: 1613668 at lib/refcount.c:22 refcount_warn_saturate+0x5e/0xe0
refcount_t: addition on 0; use-after-free.
WARNING: CPU: 26 PID: 4139916 at lib/refcount.c:25 refcount_warn_saturate+0x7d/0xe0
Modules linked in: bpf_preload(E) vhost_net(E) tun(E) vhost(E)
This series of patches introduces a new feature to the netconsole
subsystem that allows the automatic population of the CPU number in the
userdata field for each log message. This enhancement provides several
benefits:
* Improved demultiplexing of parallel log output: When multiple CPUs are
sending messages concurrently, the added CPU number in the userdata
makes it easier to differentiate and attribute the messages to their
originating CPUs.
* Better visibility into message sources: The CPU number information
gives users and developers more insight into which specific CPU a
particular log message came from, which can be valuable for debugging
and analysis.
The changes in this series are as follows:
Patch "Ensure dynamic_netconsole_mutex is held during userdata update"
Add a lockdep assert to make sure dynamic_netconsole_mutex is held when
calling update_userdata().
Patch "netconsole: Add option to auto-populate CPU number in userdata"
Adds a new option to enable automatic CPU number population in the
netconsole userdata Provides a new "populate_cpu_nr" sysfs attribute to
control this feature
Patch "netconsole: selftest: test CPU number auto-population"
Expands the existing netconsole selftest to verify the CPU number
auto-population functionality Ensures the received netconsole messages
contain the expected "cpu=" entry in the userdata
Patch "netconsole: docs: Add documentation for CPU number auto-population"
Updates the netconsole documentation to explain the new CPU number
auto-population feature Provides instructions on how to enable and use
the feature
I believe these changes will be a valuable addition to the netconsole
subsystem, enhancing its usefulness for kernel developers and users.
Signed-off-by: Breno Leitao <leitao(a)debian.org>
---
Breno Leitao (4):
netconsole: Ensure dynamic_netconsole_mutex is held during userdata update
netconsole: Add option to auto-populate CPU number in userdata
netconsole: docs: Add documentation for CPU number auto-population
netconsole: selftest: Validate CPU number auto-population in userdata
Documentation/networking/netconsole.rst | 44 +++++++++++++++
drivers/net/netconsole.c | 63 ++++++++++++++++++++++
.../testing/selftests/drivers/net/netcons_basic.sh | 18 +++++++
3 files changed, 125 insertions(+)
---
base-commit: a58f00ed24b849d449f7134fd5d86f07090fe2f5
change-id: 20241108-netcon_cpu-ce3917e88f4b
Best regards,
--
Breno Leitao <leitao(a)debian.org>
Changes v6:
- Rebase onto latest kselftests-next.
- Looking at the two patches with a fresh eye decided to make a split
along the lines of:
- Patch 1/2 contains all of the code that relates to SNC mode
detection and checking that detection's reliability.
- Patch 2/2 contains checking kernel support for SNC and
modifying the messages at the end of affected tests.
Changes v5:
- Tests are skipped if snc_unreliable was set.
- Moved resctrlfs.c changes from patch 2/2 to 1/2.
- Removed CAT changes since it's not impacted by SNC in the selftest.
- Updated various comments.
- Fixed a bunch of minor issues pointed out in the review.
Changes v4:
- Printing SNC warnings at the start of every test.
- Printing SNC warnings at the end of every relevant test.
- Remove global snc_mode variable, consolidate snc detection functions
into one.
- Correct minor mistakes.
Changes v3:
- Reworked patch 2.
- Changed minor things in patch 1 like function name and made
corrections to the patch message.
Changes v2:
- Removed patches 2 and 3 since now this part will be supported by the
kernel.
Sub-Numa Clustering (SNC) allows splitting CPU cores, caches and memory
into multiple NUMA nodes. When enabled, NUMA-aware applications can
achieve better performance on bigger server platforms.
SNC support in the kernel was merged into x86/cache [1]. With SNC enabled
and kernel support in place all the tests will function normally (aside
from effective cache size). There might be a problem when SNC is enabled
but the system is still using an older kernel version without SNC
support. Currently the only message displayed in that situation is a
guess that SNC might be enabled and is causing issues. That message also
is displayed whenever the test fails on an Intel platform.
Add a mechanism to discover kernel support for SNC which will add more
meaning and certainty to the error message.
Add runtime SNC mode detection and verify how reliable that information
is.
Series was tested on Ice Lake server platforms with SNC disabled, SNC-2
and SNC-4. The tests were also ran with and without kernel support for
SNC.
Series applies cleanly on kselftest/next.
[1] https://lore.kernel.org/all/20240628215619.76401-1-tony.luck@intel.com/
Previous versions of this series:
[v1] https://lore.kernel.org/all/cover.1709721159.git.maciej.wieczor-retman@inte…
[v2] https://lore.kernel.org/all/cover.1715769576.git.maciej.wieczor-retman@inte…
[v3] https://lore.kernel.org/all/cover.1719842207.git.maciej.wieczor-retman@inte…
[v4] https://lore.kernel.org/all/cover.1720774981.git.maciej.wieczor-retman@inte…
[v5] https://lore.kernel.org/all/cover.1730206468.git.maciej.wieczor-retman@inte…
Maciej Wieczor-Retman (2):
selftests/resctrl: Adjust effective L3 cache size with SNC enabled
selftests/resctrl: Discover SNC kernel support and adjust messages
tools/testing/selftests/resctrl/cmt_test.c | 4 +-
tools/testing/selftests/resctrl/mba_test.c | 2 +
tools/testing/selftests/resctrl/mbm_test.c | 4 +-
tools/testing/selftests/resctrl/resctrl.h | 5 +
.../testing/selftests/resctrl/resctrl_tests.c | 9 +-
tools/testing/selftests/resctrl/resctrlfs.c | 129 ++++++++++++++++++
6 files changed, 148 insertions(+), 5 deletions(-)
--
2.47.1
From: Yicong Yang <yangyicong(a)hisilicon.com>
Armv8.7 introduces single-copy atomic 64-byte loads and stores
instructions and its variants named under FEAT_{LS64, LS64_V,
LS64_ACCDATA}. Add support for Armv8.7 FEAT_{LS64, LS64_V, LS64_ACCDATA}:
- Add identifying and enabling in the cpufeature list
- Expose the support of these features to userspace through HWCAP3
and cpuinfo
- Add related hwcap test
- Handle the trap of unsupported memory (normal/uncacheable) access in a VM
A real scenario for this feature is that the userspace driver can make use of
this to implement direct WQE (workqueue entry) - a mechanism to fill WQE
directly into the hardware.
This patchset also depends on Marc's patchset[1] for enabling related
features in a VM, HCRX trap handling, etc.
[1] https://lore.kernel.org/linux-arm-kernel/20240815125959.2097734-1-maz@kerne…
Tested with updated hwcap test:
On host:
root@localhost:/# dmesg | grep "All CPU(s) started"
[ 1.600263] CPU: All CPU(s) started at EL2
root@localhost:/# ./hwcap
[snip...]
# LS64 present
ok 169 cpuinfo_match_LS64
ok 170 sigill_LS64
ok 171 # SKIP sigbus_LS64
# LS64_V present
ok 172 cpuinfo_match_LS64_V
ok 173 sigill_LS64_V
ok 174 # SKIP sigbus_LS64_V
# LS64_ACCDATA present
ok 175 cpuinfo_match_LS64_ACCDATA
ok 176 sigill_LS64_ACCDATA
ok 177 # SKIP sigbus_LS64_ACCDATA
# Totals: pass:92 fail:0 xfail:0 xpass:0 skip:85 error:0
On guest:
root@localhost:/# dmesg | grep "All CPU(s) started"
[ 1.375272] CPU: All CPU(s) started at EL1
root@localhost:/# ./hwcap
[snip...]
# LS64 present
ok 169 cpuinfo_match_LS64
ok 170 sigill_LS64
ok 171 # SKIP sigbus_LS64
# LS64_V present
ok 172 cpuinfo_match_LS64_V
ok 173 sigill_LS64_V
ok 174 # SKIP sigbus_LS64_V
# LS64_ACCDATA present
ok 175 cpuinfo_match_LS64_ACCDATA
ok 176 sigill_LS64_ACCDATA
ok 177 # SKIP sigbus_LS64_ACCDATA
# Totals: pass:92 fail:0 xfail:0 xpass:0 skip:85 error:0
Yicong Yang (5):
arm64: Provide basic EL2 setup for FEAT_{LS64, LS64_V, LS64_ACCDATA}
usage at EL0/1
arm64: Add support for FEAT_{LS64, LS64_V, LS64_ACCDATA}
kselftest/arm64: Add HWCAP test for FEAT_{LS64, LS64_V, LS64_ACCDATA}
arm64: Add ESR.DFSC definition of unsupported exclusive or atomic
access
KVM: arm64: Handle DABT caused by LS64* instructions on unsupported
memory
Documentation/arch/arm64/booting.rst | 28 +++++
Documentation/arch/arm64/elf_hwcaps.rst | 9 ++
arch/arm64/include/asm/el2_setup.h | 26 ++++-
arch/arm64/include/asm/esr.h | 8 ++
arch/arm64/include/asm/hwcap.h | 3 +
arch/arm64/include/uapi/asm/hwcap.h | 3 +
arch/arm64/kernel/cpufeature.c | 70 +++++++++++-
arch/arm64/kernel/cpuinfo.c | 3 +
arch/arm64/kvm/mmu.c | 14 +++
arch/arm64/tools/cpucaps | 3 +
tools/testing/selftests/arm64/abi/hwcap.c | 127 ++++++++++++++++++++++
11 files changed, 291 insertions(+), 3 deletions(-)
--
2.24.0
This version 5 patch series replace direct error handling methods with ksft
macros, which provide better reporting.Currently, when the tmpfs test runs,
it does not display any output if it passes,and if it fails
(particularly when not run as root),it simply exits without any warning or
message.
This series of patch adds:
1. Add 'ksft_print_header()' and 'ksft_set_plan()'
to structure test outputs more effectively.
2. Error if not run as root.
3. Replace direct error handling with 'ksft_test_result_*',
'ksft_exit_fail_msg' macros for better reporting.
v4->v5:
- Remove unnecessary pass messages.
- Remove unnecessary use of KSFT_SKIP.
- Add appropriate use of ksft_exit_fail_msg.
v4 1/2: https://lore.kernel.org/all/20241105202639.1977356-2-cvam0000@gmail.com/
v4 2/2: https://lore.kernel.org/all/20241105202639.1977356-3-cvam0000@gmail.com/
v3->v4:
- Start a patchset
- Split patch into smaller patches to make it easy to review.
Patch1 Replace 'ksft_test_result_skip' with 'KSFT_SKIP' during root run check.
Patch2 Replace 'ksft_test_result_fail' with 'KSFT_SKIP' where fail does not make sense,
or failure could be due to not unsupported APIs with appropriate warnings.
v3: https://lore.kernel.org/all/20241028185756.111832-1-cvam0000@gmail.com/
v2->v3:
- Remove extra ksft_set_plan()
- Remove function for unshare()
- Fix the comment style
v2: https://lore.kernel.org/all/20241026191621.2860376-1-cvam0000@gmail.com/
v1->v2:
- Make the commit message more clear.
v1: https://lore.kernel.org/all/20241024200228.1075840-1-cvam0000@gmail.com/T/#u
thanks
Shivam
Shivam Chaudhary (2):
selftests: tmpfs: Add Test-fail if not run as root
selftests: tmpfs: Add kselftest support to tmpfs
.../selftests/tmpfs/bug-link-o-tmpfile.c | 60 ++++++++++++-------
1 file changed, 37 insertions(+), 23 deletions(-)
--
2.45.2
If the selftest is not running as root, it should fail and
give an appropriate warning to the user. This patch adds
ksft_exit_fail_msg() if the test is not running as root.
Logs:
Before change:
TAP version 13
1..1
ok 1 # SKIP This test needs root to run!
After change:
TAP version 13
1..1
Bail out! Error : Need to run as root# Planned tests != run tests (1 != 0)
Totals: pass:0 fail:0 xfail:0 xpass:0 skip:0 error:0
Signed-off-by: Shivam Chaudhary <cvam0000(a)gmail.com>
---
tools/testing/selftests/acct/acct_syscall.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/tools/testing/selftests/acct/acct_syscall.c b/tools/testing/selftests/acct/acct_syscall.c
index e44e8fe1f4a3..7c65deef54e3 100644
--- a/tools/testing/selftests/acct/acct_syscall.c
+++ b/tools/testing/selftests/acct/acct_syscall.c
@@ -24,8 +24,7 @@ int main(void)
// Check if test is run a root
if (geteuid()) {
- ksft_test_result_skip("This test needs root to run!\n");
- return 1;
+ ksft_exit_fail_msg("Error : Need to run as root");
}
// Create file to log closed processes
--
2.45.2
The kernel has recently added support for shadow stacks, currently
x86 only using their CET feature but both arm64 and RISC-V have
equivalent features (GCS and Zicfiss respectively), I am actively
working on GCS[1]. With shadow stacks the hardware maintains an
additional stack containing only the return addresses for branch
instructions which is not generally writeable by userspace and ensures
that any returns are to the recorded addresses. This provides some
protection against ROP attacks and making it easier to collect call
stacks. These shadow stacks are allocated in the address space of the
userspace process.
Our API for shadow stacks does not currently offer userspace any
flexiblity for managing the allocation of shadow stacks for newly
created threads, instead the kernel allocates a new shadow stack with
the same size as the normal stack whenever a thread is created with the
feature enabled. The stacks allocated in this way are freed by the
kernel when the thread exits or shadow stacks are disabled for the
thread. This lack of flexibility and control isn't ideal, in the vast
majority of cases the shadow stack will be over allocated and the
implicit allocation and deallocation is not consistent with other
interfaces. As far as I can tell the interface is done in this manner
mainly because the shadow stack patches were in development since before
clone3() was implemented.
Since clone3() is readily extensible let's add support for specifying a
shadow stack when creating a new thread or process, keeping the current
implicit allocation behaviour if one is not specified either with
clone3() or through the use of clone(). The user must provide a shadow
stack pointer, this must point to memory mapped for use as a shadow
stackby map_shadow_stack() with an architecture specified shadow stack
token at the top of the stack.
Please note that the x86 portions of this code are build tested only, I
don't appear to have a system that can run CET available to me.
[1] https://lore.kernel.org/linux-arm-kernel/20241001-arm64-gcs-v13-0-222b78d87…
Signed-off-by: Mark Brown <broonie(a)kernel.org>
---
Changes in v13:
- Rebase onto v6.13-rc1.
- Link to v12: https://lore.kernel.org/r/20241031-clone3-shadow-stack-v12-0-7183eb8bee17@k…
Changes in v12:
- Add the regular prctl() to the userspace API document since arm64
support is queued in -next.
- Link to v11: https://lore.kernel.org/r/20241005-clone3-shadow-stack-v11-0-2a6a2bd6d651@k…
Changes in v11:
- Rebase onto arm64 for-next/gcs, which is based on v6.12-rc1, and
integrate arm64 support.
- Rework the interface to specify a shadow stack pointer rather than a
base and size like we do for the regular stack.
- Link to v10: https://lore.kernel.org/r/20240821-clone3-shadow-stack-v10-0-06e8797b9445@k…
Changes in v10:
- Integrate fixes & improvements for the x86 implementation from Rick
Edgecombe.
- Require that the shadow stack be VM_WRITE.
- Require that the shadow stack base and size be sizeof(void *) aligned.
- Clean up trailing newline.
- Link to v9: https://lore.kernel.org/r/20240819-clone3-shadow-stack-v9-0-962d74f99464@ke…
Changes in v9:
- Pull token validation earlier and report problems with an error return
to parent rather than signal delivery to the child.
- Verify that the top of the supplied shadow stack is VM_SHADOW_STACK.
- Rework token validation to only do the page mapping once.
- Drop no longer needed support for testing for signals in selftest.
- Fix typo in comments.
- Link to v8: https://lore.kernel.org/r/20240808-clone3-shadow-stack-v8-0-0acf37caf14c@ke…
Changes in v8:
- Fix token verification with user specified shadow stack.
- Don't track user managed shadow stacks for child processes.
- Link to v7: https://lore.kernel.org/r/20240731-clone3-shadow-stack-v7-0-a9532eebfb1d@ke…
Changes in v7:
- Rebase onto v6.11-rc1.
- Typo fixes.
- Link to v6: https://lore.kernel.org/r/20240623-clone3-shadow-stack-v6-0-9ee7783b1fb9@ke…
Changes in v6:
- Rebase onto v6.10-rc3.
- Ensure we don't try to free the parent shadow stack in error paths of
x86 arch code.
- Spelling fixes in userspace API document.
- Additional cleanups and improvements to the clone3() tests to support
the shadow stack tests.
- Link to v5: https://lore.kernel.org/r/20240203-clone3-shadow-stack-v5-0-322c69598e4b@ke…
Changes in v5:
- Rebase onto v6.8-rc2.
- Rework ABI to have the user allocate the shadow stack memory with
map_shadow_stack() and a token.
- Force inlining of the x86 shadow stack enablement.
- Move shadow stack enablement out into a shared header for reuse by
other tests.
- Link to v4: https://lore.kernel.org/r/20231128-clone3-shadow-stack-v4-0-8b28ffe4f676@ke…
Changes in v4:
- Formatting changes.
- Use a define for minimum shadow stack size and move some basic
validation to fork.c.
- Link to v3: https://lore.kernel.org/r/20231120-clone3-shadow-stack-v3-0-a7b8ed3e2acc@ke…
Changes in v3:
- Rebase onto v6.7-rc2.
- Remove stale shadow_stack in internal kargs.
- If a shadow stack is specified unconditionally use it regardless of
CLONE_ parameters.
- Force enable shadow stacks in the selftest.
- Update changelogs for RISC-V feature rename.
- Link to v2: https://lore.kernel.org/r/20231114-clone3-shadow-stack-v2-0-b613f8681155@ke…
Changes in v2:
- Rebase onto v6.7-rc1.
- Remove ability to provide preallocated shadow stack, just specify the
desired size.
- Link to v1: https://lore.kernel.org/r/20231023-clone3-shadow-stack-v1-0-d867d0b5d4d0@ke…
---
Mark Brown (8):
arm64/gcs: Return a success value from gcs_alloc_thread_stack()
Documentation: userspace-api: Add shadow stack API documentation
selftests: Provide helper header for shadow stack testing
fork: Add shadow stack support to clone3()
selftests/clone3: Remove redundant flushes of output streams
selftests/clone3: Factor more of main loop into test_clone3()
selftests/clone3: Allow tests to flag if -E2BIG is a valid error code
selftests/clone3: Test shadow stack support
Documentation/userspace-api/index.rst | 1 +
Documentation/userspace-api/shadow_stack.rst | 42 ++++
arch/arm64/include/asm/gcs.h | 8 +-
arch/arm64/kernel/process.c | 8 +-
arch/arm64/mm/gcs.c | 62 +++++-
arch/x86/include/asm/shstk.h | 11 +-
arch/x86/kernel/process.c | 2 +-
arch/x86/kernel/shstk.c | 57 +++++-
include/asm-generic/cacheflush.h | 11 ++
include/linux/sched/task.h | 17 ++
include/uapi/linux/sched.h | 10 +-
kernel/fork.c | 96 +++++++--
tools/testing/selftests/clone3/clone3.c | 226 ++++++++++++++++++----
tools/testing/selftests/clone3/clone3_selftests.h | 65 ++++++-
tools/testing/selftests/ksft_shstk.h | 98 ++++++++++
15 files changed, 633 insertions(+), 81 deletions(-)
---
base-commit: 40384c840ea1944d7c5a392e8975ed088ecf0b37
change-id: 20231019-clone3-shadow-stack-15d40d2bf536
Best regards,
--
Mark Brown <broonie(a)kernel.org>
There are a few patches in vIRQ series that rework the fault.c file. So,
we should fix this before that bigger series touches the same code.
And add missing coverage for IOMMU_FAULT_QUEUE_ALLOC in iommufd_fail_nth.
Thanks!
Nicolin
Nicolin Chen (2):
iommufd/fault: Fix out_fput in iommufd_fault_alloc()
iommufd/selftest: Cover IOMMU_FAULT_QUEUE_ALLOC in iommufd_fail_nth
drivers/iommu/iommufd/fault.c | 2 --
tools/testing/selftests/iommu/iommufd_fail_nth.c | 14 ++++++++++++++
2 files changed, 14 insertions(+), 2 deletions(-)
--
2.43.0
Currently, sendmmsg is implemented in udpgso_bench_tx.c,
but it is not called by any test script.
This patch adds a test for sendmmsg in udpgso_bench.sh.
This allows for basic API testing and benchmarking
comparisons with GSO.
---
tools/testing/selftests/net/udpgso_bench.sh | 3 +++
1 file changed, 3 insertions(+)
diff --git a/tools/testing/selftests/net/udpgso_bench.sh b/tools/testing/selftests/net/udpgso_bench.sh
index 640bc43452fa..88fa1d53ba2b 100755
--- a/tools/testing/selftests/net/udpgso_bench.sh
+++ b/tools/testing/selftests/net/udpgso_bench.sh
@@ -92,6 +92,9 @@ run_udp() {
echo "udp"
run_in_netns ${args}
+ echo "udp sendmmsg"
+ run_in_netns ${args} -m
+
echo "udp gso"
run_in_netns ${args} -S 0
--
2.39.3 (Apple Git-146)
As discussed in the v1 [1], with guest_memfd moving from KVM to mm, it
is more practical to have a non-KVM-specific API to populate guest
memory in a generic way. The series proposes using the write syscall
for this purpose instead of a KVM ioctl as in the v1. The approach also
has an advantage that the guest_memfd handle can be sent to another
process that would be responsible for population. I also included a
suggestion from Mike Day for excluding the code from compilation if AMD
SEV is configured.
There is a potential for refactoring of the kvm_gmem_populate to extract
common parts with the write. I did not do that in this series yet to
keep it clear what the write would do and get feedback on whether
write's behaviour is sensible.
Nikita
[1]: https://lore.kernel.org/kvm/20241024095429.54052-1-kalyazin@amazon.com/T/
Nikita Kalyazin (2):
KVM: guest_memfd: add generic population via write
KVM: selftests: update guest_memfd write tests
.../testing/selftests/kvm/guest_memfd_test.c | 85 +++++++++++++++++--
virt/kvm/guest_memfd.c | 79 +++++++++++++++++
2 files changed, 158 insertions(+), 6 deletions(-)
base-commit: 1508bae37044ebffd7c7e09915f041936f338123
--
2.40.1
Add ip_link_set_addr(), ip_link_set_up(), ip_addr_add() and ip_route_add()
to the suite of helpers that automatically schedule a corresponding
cleanup.
When setting a new MAC, one needs to remember the old address first. Move
mac_get() from forwarding/ to that end.
Signed-off-by: Petr Machata <petrm(a)nvidia.com>
Reviewed-by: Ido Schimmel <idosch(a)nvidia.com>
---
Notes:
CC: Shuah Khan <shuah(a)kernel.org>
CC: Benjamin Poirier <bpoirier(a)nvidia.com>
CC: Hangbin Liu <liuhangbin(a)gmail.com>
CC: Vladimir Oltean <vladimir.oltean(a)nxp.com>
CC: linux-kselftest(a)vger.kernel.org
tools/testing/selftests/net/forwarding/lib.sh | 7 ----
tools/testing/selftests/net/lib.sh | 39 +++++++++++++++++++
2 files changed, 39 insertions(+), 7 deletions(-)
diff --git a/tools/testing/selftests/net/forwarding/lib.sh b/tools/testing/selftests/net/forwarding/lib.sh
index 7337f398f9cc..1fd40bada694 100644
--- a/tools/testing/selftests/net/forwarding/lib.sh
+++ b/tools/testing/selftests/net/forwarding/lib.sh
@@ -932,13 +932,6 @@ packets_rate()
echo $(((t1 - t0) / interval))
}
-mac_get()
-{
- local if_name=$1
-
- ip -j link show dev $if_name | jq -r '.[]["address"]'
-}
-
ether_addr_to_u64()
{
local addr="$1"
diff --git a/tools/testing/selftests/net/lib.sh b/tools/testing/selftests/net/lib.sh
index 5ea6537acd2b..2cd5c743b2d9 100644
--- a/tools/testing/selftests/net/lib.sh
+++ b/tools/testing/selftests/net/lib.sh
@@ -435,6 +435,13 @@ xfail_on_veth()
fi
}
+mac_get()
+{
+ local if_name=$1
+
+ ip -j link show dev $if_name | jq -r '.[]["address"]'
+}
+
kill_process()
{
local pid=$1; shift
@@ -459,3 +466,35 @@ ip_link_set_master()
ip link set dev "$member" master "$master"
defer ip link set dev "$member" nomaster
}
+
+ip_link_set_addr()
+{
+ local name=$1; shift
+ local addr=$1; shift
+
+ local old_addr=$(mac_get "$name")
+ ip link set dev "$name" address "$addr"
+ defer ip link set dev "$name" address "$old_addr"
+}
+
+ip_link_set_up()
+{
+ local name=$1; shift
+
+ ip link set dev "$name" up
+ defer ip link set dev "$name" down
+}
+
+ip_addr_add()
+{
+ local name=$1; shift
+
+ ip addr add dev "$name" "$@"
+ defer ip addr del dev "$name" "$@"
+}
+
+ip_route_add()
+{
+ ip route add "$@"
+ defer ip route del "$@"
+}
--
2.47.0
If the coreutils variant of ping is used instead of the busybox one, the
ping_do() command is broken. This comes by the fact that for coreutils
ping, the ping IP needs to be the very last elements.
To handle this, reorder the ping args and make $dip last element.
The use of coreutils ping might be useful for case where busybox is not
compiled with float interval support and ping command doesn't support
0.1 interval. (in such case a dedicated ping utility is installed
instead)
Cc: stable(a)vger.kernel.org
Fixes: 73bae6736b6b ("selftests: forwarding: Add initial testing framework")
Signed-off-by: Christian Marangi <ansuelsmth(a)gmail.com>
---
tools/testing/selftests/net/forwarding/lib.sh | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/tools/testing/selftests/net/forwarding/lib.sh b/tools/testing/selftests/net/forwarding/lib.sh
index c992e385159c..2060f95d5c62 100644
--- a/tools/testing/selftests/net/forwarding/lib.sh
+++ b/tools/testing/selftests/net/forwarding/lib.sh
@@ -1473,8 +1473,8 @@ ping_do()
vrf_name=$(master_name_get $if_name)
ip vrf exec $vrf_name \
- $PING $args $dip -c $PING_COUNT -i 0.1 \
- -w $PING_TIMEOUT &> /dev/null
+ $PING $args -c $PING_COUNT -i 0.1 \
+ -w $PING_TIMEOUT $dip &> /dev/null
}
ping_test()
@@ -1504,8 +1504,8 @@ ping6_do()
vrf_name=$(master_name_get $if_name)
ip vrf exec $vrf_name \
- $PING6 $args $dip -c $PING_COUNT -i 0.1 \
- -w $PING_TIMEOUT &> /dev/null
+ $PING6 $args -c $PING_COUNT -i 0.1 \
+ -w $PING_TIMEOUT $dip &> /dev/null
}
ping6_test()
--
2.45.2
Create a test for the robust list mechanism. Test the following uAPI
operations:
- Creating a robust mutex where the lock waiter is wake by the kernel
when the lock owner died
- Setting a robust list to the current task
- Getting a robust list from the current task
- Getting a robust list from another task
- Using the list_op_pending field from robust_list_head struct to test
robustness when the lock owner dies before completing the locking
- Setting a invalid size for syscall argument `len`
Signed-off-by: André Almeida <andrealmeid(a)igalia.com>
---
Changes from v2:
- Dropped kselftest_harness include, using just futex test API
- This is the expected output:
TAP version 13
1..6
ok 1 test_robustness
ok 2 test_set_robust_list_invalid_size
ok 3 test_get_robust_list_self
ok 4 test_get_robust_list_child
ok 5 test_set_list_op_pending
ok 6 test_robust_list_multiple_elements
# Totals: pass:6 fail:0 xfail:0 xpass:0 skip:0 error:0
Changes from v1:
- Change futex type from int to _Atomic(unsigned int)
- Use old futex(FUTEX_WAIT) instead of the new sys_futex_wait()
---
.../selftests/futex/functional/.gitignore | 1 +
.../selftests/futex/functional/Makefile | 3 +-
.../selftests/futex/functional/robust_list.c | 512 ++++++++++++++++++
3 files changed, 515 insertions(+), 1 deletion(-)
create mode 100644 tools/testing/selftests/futex/functional/robust_list.c
diff --git a/tools/testing/selftests/futex/functional/.gitignore b/tools/testing/selftests/futex/functional/.gitignore
index fbcbdb6963b3..4726e1be7497 100644
--- a/tools/testing/selftests/futex/functional/.gitignore
+++ b/tools/testing/selftests/futex/functional/.gitignore
@@ -9,3 +9,4 @@ futex_wait_wouldblock
futex_wait
futex_requeue
futex_waitv
+robust_list
diff --git a/tools/testing/selftests/futex/functional/Makefile b/tools/testing/selftests/futex/functional/Makefile
index f79f9bac7918..b8635a1ac7f6 100644
--- a/tools/testing/selftests/futex/functional/Makefile
+++ b/tools/testing/selftests/futex/functional/Makefile
@@ -17,7 +17,8 @@ TEST_GEN_PROGS := \
futex_wait_private_mapped_file \
futex_wait \
futex_requeue \
- futex_waitv
+ futex_waitv \
+ robust_list
TEST_PROGS := run.sh
diff --git a/tools/testing/selftests/futex/functional/robust_list.c b/tools/testing/selftests/futex/functional/robust_list.c
new file mode 100644
index 000000000000..7bfda2d39821
--- /dev/null
+++ b/tools/testing/selftests/futex/functional/robust_list.c
@@ -0,0 +1,512 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2024 Igalia S.L.
+ *
+ * Robust list test by André Almeida <andrealmeid(a)igalia.com>
+ *
+ * The robust list uAPI allows userspace to create "robust" locks, in the sense
+ * that if the lock holder thread dies, the remaining threads that are waiting
+ * for the lock won't block forever, waiting for a lock that will never be
+ * released.
+ *
+ * This is achieve by userspace setting a list where a thread can enter all the
+ * locks (futexes) that it is holding. The robust list is a linked list, and
+ * userspace register the start of the list with the syscall set_robust_list().
+ * If such thread eventually dies, the kernel will walk this list, waking up one
+ * thread waiting for each futex and marking the futex word with the flag
+ * FUTEX_OWNER_DIED.
+ *
+ * See also
+ * man set_robust_list
+ * Documententation/locking/robust-futex-ABI.rst
+ * Documententation/locking/robust-futexes.rst
+ */
+
+#define _GNU_SOURCE
+
+#include "futextest.h"
+#include "logging.h"
+
+#include <errno.h>
+#include <pthread.h>
+#include <signal.h>
+#include <stdatomic.h>
+#include <stdbool.h>
+#include <stddef.h>
+#include <sys/mman.h>
+#include <sys/wait.h>
+
+#define STACK_SIZE (1024 * 1024)
+
+#define FUTEX_TIMEOUT 3
+
+static pthread_barrier_t barrier, barrier2;
+
+int set_robust_list(struct robust_list_head *head, size_t len)
+{
+ return syscall(SYS_set_robust_list, head, len);
+}
+
+int get_robust_list(int pid, struct robust_list_head **head, size_t *len_ptr)
+{
+ return syscall(SYS_get_robust_list, pid, head, len_ptr);
+}
+
+/*
+ * Basic lock struct, contains just the futex word and the robust list element
+ * Real implementations have also a *prev to easily walk in the list
+ */
+struct lock_struct {
+ _Atomic(unsigned int) futex;
+ struct robust_list list;
+};
+
+/*
+ * Helper function to spawn a child thread. Returns -1 on error, pid on success
+ */
+static int create_child(int (*fn)(void *arg), void *arg)
+{
+ char *stack;
+ pid_t pid;
+
+ stack = mmap(NULL, STACK_SIZE, PROT_READ | PROT_WRITE,
+ MAP_PRIVATE | MAP_ANONYMOUS | MAP_STACK, -1, 0);
+ if (stack == MAP_FAILED)
+ return -1;
+
+ stack += STACK_SIZE;
+
+ pid = clone(fn, stack, CLONE_VM | SIGCHLD, arg);
+
+ if (pid == -1)
+ return -1;
+
+ return pid;
+}
+
+/*
+ * Helper function to prepare and register a robust list
+ */
+static int set_list(struct robust_list_head *head)
+{
+ int ret;
+
+ ret = set_robust_list(head, sizeof(struct robust_list_head));
+ if (ret)
+ return ret;
+
+ head->futex_offset = (size_t) offsetof(struct lock_struct, futex) -
+ (size_t) offsetof(struct lock_struct, list);
+ head->list.next = &head->list;
+ head->list_op_pending = NULL;
+
+ return 0;
+}
+
+/*
+ * A basic (and incomplete) mutex lock function with robustness
+ */
+static int mutex_lock(struct lock_struct *lock, struct robust_list_head *head, bool error_inject)
+{
+ _Atomic(unsigned int) *futex = &lock->futex;
+ int zero = 0, ret = -1;
+ pid_t tid = gettid();
+
+ /*
+ * Set list_op_pending before starting the lock, so the kernel can catch
+ * the case where the thread died during the lock operation
+ */
+ head->list_op_pending = &lock->list;
+
+ if (atomic_compare_exchange_strong(futex, &zero, tid)) {
+ /*
+ * We took the lock, insert it in the robust list
+ */
+ struct robust_list *list = &head->list;
+
+ /* Error injection to test list_op_pending */
+ if (error_inject)
+ return 0;
+
+ while (list->next != &head->list)
+ list = list->next;
+
+ list->next = &lock->list;
+ lock->list.next = &head->list;
+
+ ret = 0;
+ } else {
+ /*
+ * We didn't take the lock, wait until the owner wakes (or dies)
+ */
+ struct timespec to;
+
+ clock_gettime(CLOCK_MONOTONIC, &to);
+ to.tv_sec = to.tv_sec + FUTEX_TIMEOUT;
+
+ tid = atomic_load(futex);
+ /* Kernel ignores futexes without the waiters flag */
+ tid |= FUTEX_WAITERS;
+ atomic_store(futex, tid);
+
+ ret = futex_wait((futex_t *) futex, tid, &to, 0);
+
+ /*
+ * A real mutex_lock() implementation would loop here to finally
+ * take the lock. We don't care about that, so we stop here.
+ */
+ }
+
+ head->list_op_pending = NULL;
+
+ return ret;
+}
+
+/*
+ * This child thread will succeed taking the lock, and then will exit holding it
+ */
+static int child_fn_lock(void *arg)
+{
+ struct lock_struct *lock = (struct lock_struct *) arg;
+ struct robust_list_head head;
+ int ret;
+
+ ret = set_list(&head);
+ if (ret)
+ ksft_test_result_fail("set_robust_list error\n");
+
+ ret = mutex_lock(lock, &head, false);
+ if (ret)
+ ksft_test_result_fail("mutex_lock error\n");
+
+ pthread_barrier_wait(&barrier);
+
+ /*
+ * There's a race here: the parent thread needs to be inside
+ * futex_wait() before the child thread dies, otherwise it will miss the
+ * wakeup from handle_futex_death() that this child will emit. We wait a
+ * little bit just to make sure that this happens.
+ */
+ sleep(1);
+
+ return 0;
+}
+
+/*
+ * Spawns a child thread that will set a robust list, take the lock, register it
+ * in the robust list and die. The parent thread will wait on this futex, and
+ * should be waken up when the child exits.
+ */
+static void test_robustness()
+{
+ struct lock_struct lock = { .futex = 0 };
+ struct robust_list_head head;
+ _Atomic(unsigned int) *futex = &lock.futex;
+ int ret;
+
+ ret = set_list(&head);
+ ASSERT_EQ(ret, 0);
+
+ /*
+ * Lets use a barrier to ensure that the child thread takes the lock
+ * before the parent
+ */
+ ret = pthread_barrier_init(&barrier, NULL, 2);
+ ASSERT_EQ(ret, 0);
+
+ ret = create_child(&child_fn_lock, &lock);
+ ASSERT_NE(ret, -1);
+
+ pthread_barrier_wait(&barrier);
+ ret = mutex_lock(&lock, &head, false);
+
+ /*
+ * futex_wait() should return 0 and the futex word should be marked with
+ * FUTEX_OWNER_DIED
+ */
+ ASSERT_EQ(ret, 0);
+ if (ret != 0)
+ printf("futex wait returned %d", errno);
+
+ ASSERT_TRUE(*futex | FUTEX_OWNER_DIED);
+
+ pthread_barrier_destroy(&barrier);
+
+ ksft_test_result_pass("%s\n", __func__);
+}
+
+/*
+ * The only valid value for len is sizeof(*head)
+ */
+static void test_set_robust_list_invalid_size()
+{
+ struct robust_list_head head;
+ size_t head_size = sizeof(struct robust_list_head);
+ int ret;
+
+ ret = set_robust_list(&head, head_size);
+ ASSERT_EQ(ret, 0);
+
+ ret = set_robust_list(&head, head_size * 2);
+ ASSERT_EQ(ret, -1);
+ ASSERT_EQ(errno, EINVAL);
+
+ ret = set_robust_list(&head, head_size - 1);
+ ASSERT_EQ(ret, -1);
+ ASSERT_EQ(errno, EINVAL);
+
+ ret = set_robust_list(&head, 0);
+ ASSERT_EQ(ret, -1);
+ ASSERT_EQ(errno, EINVAL);
+
+ ksft_test_result_pass("%s\n", __func__);
+}
+
+/*
+ * Test get_robust_list with pid = 0, getting the list of the running thread
+ */
+static void test_get_robust_list_self()
+{
+ struct robust_list_head head, head2, *get_head;
+ size_t head_size = sizeof(struct robust_list_head), len_ptr;
+ int ret;
+
+ ret = set_robust_list(&head, head_size);
+ ASSERT_EQ(ret, 0);
+
+ ret = get_robust_list(0, &get_head, &len_ptr);
+ ASSERT_EQ(ret, 0);
+ ASSERT_EQ(get_head, &head);
+ ASSERT_EQ(head_size, len_ptr);
+
+ ret = set_robust_list(&head2, head_size);
+ ASSERT_EQ(ret, 0);
+
+ ret = get_robust_list(0, &get_head, &len_ptr);
+ ASSERT_EQ(ret, 0);
+ ASSERT_EQ(get_head, &head2);
+ ASSERT_EQ(head_size, len_ptr);
+
+ ksft_test_result_pass("%s\n", __func__);
+}
+
+static int child_list(void *arg)
+{
+ struct robust_list_head *head = (struct robust_list_head *) arg;
+ int ret;
+
+ ret = set_robust_list(head, sizeof(struct robust_list_head));
+ if (ret)
+ ksft_test_result_fail("set_robust_list error\n");
+
+ pthread_barrier_wait(&barrier);
+ pthread_barrier_wait(&barrier2);
+
+ return 0;
+}
+
+/*
+ * Test get_robust_list from another thread. We use two barriers here to ensure
+ * that:
+ * 1) the child thread set the list before we try to get it from the
+ * parent
+ * 2) the child thread still alive when we try to get the list from it
+ */
+static void test_get_robust_list_child()
+{
+ pid_t tid;
+ int ret;
+ struct robust_list_head head, *get_head;
+ size_t len_ptr;
+
+ ret = pthread_barrier_init(&barrier, NULL, 2);
+ ret = pthread_barrier_init(&barrier2, NULL, 2);
+ ASSERT_EQ(ret, 0);
+
+ tid = create_child(&child_list, &head);
+ ASSERT_NE(tid, -1);
+
+ pthread_barrier_wait(&barrier);
+
+ ret = get_robust_list(tid, &get_head, &len_ptr);
+ ASSERT_EQ(ret, 0);
+ ASSERT_EQ(&head, get_head);
+
+ pthread_barrier_wait(&barrier2);
+
+ pthread_barrier_destroy(&barrier);
+ pthread_barrier_destroy(&barrier2);
+
+ ksft_test_result_pass("%s\n", __func__);
+}
+
+static int child_fn_lock_with_error(void *arg)
+{
+ struct lock_struct *lock = (struct lock_struct *) arg;
+ struct robust_list_head head;
+ int ret;
+
+ ret = set_list(&head);
+ if (ret)
+ ksft_test_result_fail("set_robust_list error\n");
+
+ ret = mutex_lock(lock, &head, true);
+ if (ret)
+ ksft_test_result_fail("mutex_lock error\n");
+
+ pthread_barrier_wait(&barrier);
+
+ sleep(1);
+
+ return 0;
+}
+
+/*
+ * Same as robustness test, but inject an error where the mutex_lock() exits
+ * earlier, just after setting list_op_pending and taking the lock, to test the
+ * list_op_pending mechanism
+ */
+static void test_set_list_op_pending()
+{
+ struct lock_struct lock = { .futex = 0 };
+ struct robust_list_head head;
+ _Atomic(unsigned int) *futex = &lock.futex;
+ int ret;
+
+ ret = set_list(&head);
+ ASSERT_EQ(ret, 0);
+
+ ret = pthread_barrier_init(&barrier, NULL, 2);
+ ASSERT_EQ(ret, 0);
+
+ ret = create_child(&child_fn_lock_with_error, &lock);
+ ASSERT_NE(ret, -1);
+
+ pthread_barrier_wait(&barrier);
+ ret = mutex_lock(&lock, &head, false);
+
+ ASSERT_EQ(ret, 0);
+ if (ret != 0)
+ printf("futex wait returned %d", errno);
+
+ ASSERT_TRUE(*futex | FUTEX_OWNER_DIED);
+
+ pthread_barrier_destroy(&barrier);
+
+ ksft_test_result_pass("%s\n", __func__);
+}
+
+#define CHILD_NR 10
+
+static int child_lock_holder(void *arg)
+{
+ struct lock_struct *locks = (struct lock_struct *) arg;
+ struct robust_list_head head;
+ int i;
+
+ set_list(&head);
+
+ for (i = 0; i < CHILD_NR; i++) {
+ locks[i].futex = 0;
+ mutex_lock(&locks[i], &head, false);
+ }
+
+ pthread_barrier_wait(&barrier);
+ pthread_barrier_wait(&barrier2);
+
+ sleep(1);
+ return 0;
+}
+
+static int child_wait_lock(void *arg)
+{
+ struct lock_struct *lock = (struct lock_struct *) arg;
+ struct robust_list_head head;
+ int ret;
+
+ pthread_barrier_wait(&barrier2);
+ ret = mutex_lock(lock, &head, false);
+
+ if (ret)
+ ksft_test_result_fail("mutex_lock error\n");
+
+ if (!(lock->futex | FUTEX_OWNER_DIED))
+ ksft_test_result_fail("futex not marked with FUTEX_OWNER_DIED\n");
+
+ return 0;
+}
+
+/*
+ * Test a robust list of more than one element. All the waiters should wake when
+ * the holder dies
+ */
+static void test_robust_list_multiple_elements()
+{
+ struct lock_struct locks[CHILD_NR];
+ int i, ret;
+
+ ret = pthread_barrier_init(&barrier, NULL, 2);
+ ASSERT_EQ(ret, 0);
+ ret = pthread_barrier_init(&barrier2, NULL, CHILD_NR + 1);
+ ASSERT_EQ(ret, 0);
+
+ create_child(&child_lock_holder, &locks);
+
+ /* Wait until the locker thread takes the look */
+ pthread_barrier_wait(&barrier);
+
+ for (i = 0; i < CHILD_NR; i++)
+ create_child(&child_wait_lock, &locks[i]);
+
+ /* Wait for all children to return */
+ while (wait(NULL) > 0);
+
+ pthread_barrier_destroy(&barrier);
+ pthread_barrier_destroy(&barrier2);
+
+ ksft_test_result_pass("%s\n", __func__);
+}
+
+void usage(char *prog)
+{
+ printf("Usage: %s\n", prog);
+ printf(" -c Use color\n");
+ printf(" -h Display this help message\n");
+ printf(" -v L Verbosity level: %d=QUIET %d=CRITICAL %d=INFO\n",
+ VQUIET, VCRITICAL, VINFO);
+}
+
+int main(int argc, char *argv[])
+{
+ int c;
+
+ while ((c = getopt(argc, argv, "cht:v:")) != -1) {
+ switch (c) {
+ case 'c':
+ log_color(1);
+ break;
+ case 'h':
+ usage(basename(argv[0]));
+ exit(0);
+ case 'v':
+ log_verbosity(atoi(optarg));
+ break;
+ default:
+ usage(basename(argv[0]));
+ exit(1);
+ }
+ }
+
+ ksft_print_header();
+ ksft_set_plan(6);
+
+ test_robustness();
+ test_set_robust_list_invalid_size();
+ test_get_robust_list_self();
+ test_get_robust_list_child();
+ test_set_list_op_pending();
+ test_robust_list_multiple_elements();
+
+ ksft_print_cnts();
+ return 0;
+}
--
2.47.1
serial_test_flow_dissector_namespace manipulates both the root net
namespace and a dedicated non-root net namespace. If for some reason a
program attach on root namespace succeeds while it was expected to
fail, the unexpected program will remain attached to the root namespace,
possibly affecting other runs or even other tests in the same run.
Fix undesired test failure side effect by explicitly detaching programs
on failing tests expecting attach to fail. As a side effect of this
change, do not test errno value if the tested operation do not fail.
Fixes: 284ed00a59dd ("selftests/bpf: migrate flow_dissector namespace exclusivity test")
Signed-off-by: Alexis Lothoré (eBPF Foundation) <alexis.lothore(a)bootlin.com>
---
This small fix addresses an issue discovered while trying to add a new
test in my recently merged work on flow_dissector migration. This new
test is still only present in bpf-next, hence this fix does not target
the bpf tree but the bpf-next tree.
---
tools/testing/selftests/bpf/prog_tests/flow_dissector.c | 16 +++++++++++-----
1 file changed, 11 insertions(+), 5 deletions(-)
diff --git a/tools/testing/selftests/bpf/prog_tests/flow_dissector.c b/tools/testing/selftests/bpf/prog_tests/flow_dissector.c
index 8e6e483fead3f71f21e2223c707c6d4fb548a61e..08bae13248c4a8ab0bfa356a34b2738964d97f4c 100644
--- a/tools/testing/selftests/bpf/prog_tests/flow_dissector.c
+++ b/tools/testing/selftests/bpf/prog_tests/flow_dissector.c
@@ -525,11 +525,14 @@ void serial_test_flow_dissector_namespace(void)
ns = open_netns(TEST_NS);
if (!ASSERT_OK_PTR(ns, "enter non-root net namespace"))
goto out_clean_ns;
-
err = bpf_prog_attach(prog_fd, 0, BPF_FLOW_DISSECTOR, 0);
+ if (!ASSERT_ERR(err,
+ "refuse new flow dissector in non-root net namespace"))
+ bpf_prog_detach2(prog_fd, 0, BPF_FLOW_DISSECTOR);
+ else
+ ASSERT_EQ(errno, EEXIST,
+ "refused because of already attached prog");
close_netns(ns);
- ASSERT_ERR(err, "refuse new flow dissector in non-root net namespace");
- ASSERT_EQ(errno, EEXIST, "refused because of already attached prog");
/* If no flow dissector is attached to the root namespace, we must
* be able to attach one to a non-root net namespace
@@ -545,8 +548,11 @@ void serial_test_flow_dissector_namespace(void)
* a flow dissector to root namespace must fail
*/
err = bpf_prog_attach(prog_fd, 0, BPF_FLOW_DISSECTOR, 0);
- ASSERT_ERR(err, "refuse new flow dissector on root namespace");
- ASSERT_EQ(errno, EEXIST, "refused because of already attached prog");
+ if (!ASSERT_ERR(err, "refuse new flow dissector on root namespace"))
+ bpf_prog_detach2(prog_fd, 0, BPF_FLOW_DISSECTOR);
+ else
+ ASSERT_EQ(errno, EEXIST,
+ "refused because of already attached prog");
ns = open_netns(TEST_NS);
bpf_prog_detach2(prog_fd, 0, BPF_FLOW_DISSECTOR);
---
base-commit: 04e7b00083a120d60511443d900a5cc10dbed263
change-id: 20241128-small_flow_test_fix-0c53624a3c4c
Best regards,
--
Alexis Lothoré, Bootlin
Embedded Linux and Kernel engineering
https://bootlin.com
From: Edward Cree <ecree.xilinx(a)gmail.com>
The original semantics of ntuple filters with FLOW_RSS were not
fully understood by all drivers, some ignoring the ring_cookie from
the flow rule. Require this support to be explicitly declared by
the driver for filters relying on it to be inserted, and add self-
test coverage for this functionality.
Also teach ethtool_check_max_channel() about this.
Edward Cree (5):
net: ethtool: only allow set_rxnfc with rss + ring_cookie if driver
opts in
net: ethtool: account for RSS+RXNFC add semantics when checking
channel count
selftest: include dst-ip in ethtool ntuple rules
selftest: validate RSS+ntuple filters with nonzero ring_cookie
selftest: extend test_rss_context_queue_reconfigure for action
addition
drivers/net/ethernet/sfc/ef100_ethtool.c | 1 +
drivers/net/ethernet/sfc/ethtool.c | 1 +
include/linux/ethtool.h | 4 +
net/ethtool/common.c | 42 +++++++---
net/ethtool/ioctl.c | 5 ++
.../selftests/drivers/net/hw/rss_ctx.py | 79 +++++++++++++++++--
6 files changed, 113 insertions(+), 19 deletions(-)
From: Mark Brown <broonie(a)kernel.org>
[ Upstream commit 3e360ef0c0a1fb6ce9a302e40b8057c41ba8a9d2 ]
When building for streaming SVE the irritator for SVE skips updates of both
P0 and FFR. While FFR is skipped since it might not be present there is no
reason to skip corrupting P0 so switch to an instruction valid in streaming
mode and move the ifdef.
Signed-off-by: Mark Brown <broonie(a)kernel.org>
Link: https://lore.kernel.org/r/20241107-arm64-fp-stress-irritator-v2-3-c4b9622e3…
Signed-off-by: Catalin Marinas <catalin.marinas(a)arm.com>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/arm64/fp/sve-test.S | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/arm64/fp/sve-test.S b/tools/testing/selftests/arm64/fp/sve-test.S
index 07f14e279a904..7784c684a865b 100644
--- a/tools/testing/selftests/arm64/fp/sve-test.S
+++ b/tools/testing/selftests/arm64/fp/sve-test.S
@@ -459,7 +459,8 @@ function irritator_handler
movi v9.16b, #2
movi v31.8b, #3
// And P0
- rdffr p0.b
+ ptrue p0.d
+#ifndef SSVE
// And FFR
wrffr p15.b
--
2.43.0
This patchset fixes a bug where bnxt driver was failing to report that
an ntuple rule is redirecting to an RSS context. First commit is the
fix, then second commit extends selftests to detect if other/new drivers
are compliant with ntuple/rss_ctx API.
=== Changelog ===
Changes from v2:
* Rebase to net instead of net-next
* Make regex work with ethtool output changes
Changes from v1:
* Add selftest in patch 2
Daniel Xu (2):
bnxt_en: ethtool: Supply ntuple rss context action
selftests: drv-net: rss_ctx: Add test for ntuple rule
drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c | 8 ++++++--
tools/testing/selftests/drivers/net/hw/rss_ctx.py | 12 +++++++++++-
2 files changed, 17 insertions(+), 3 deletions(-)
--
2.46.0
Hello and Welcome,
We appreciate your understanding during this time, and we regret the slow response to your previous note. We appreciate your interest and are pleased to assist you with the information you need.
This email contains an attached screenshot that presents essential details concerning your request. Please take a moment to access it and familiarize yourself with the details for a full understanding of the information.
If you need any information or have any questions, do not hesitate to get in touch. We are always ready to assist you and eager to provide all required help.
Cheers,
Ivory Davis
Catalyst Solutions, LLC
+1 (212) 143-51-81
Change the spelling from metadate -> metadata
Signed-off-by: Gautam Somani <gautamsomani(a)gmail.com>
---
tools/testing/selftests/x86/lam.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/x86/lam.c b/tools/testing/selftests/x86/lam.c
index 0ea4f6813930..4d4a76532dc9 100644
--- a/tools/testing/selftests/x86/lam.c
+++ b/tools/testing/selftests/x86/lam.c
@@ -237,7 +237,7 @@ static uint64_t set_metadata(uint64_t src, unsigned long lam)
* both pointers should point to the same address.
*
* @return:
- * 0: value on the pointer with metadate and value on original are same
+ * 0: value on the pointer with metadata and value on original are same
* 1: not same.
*/
static int handle_lam_test(void *src, unsigned int lam)
--
2.43.0
The kernel has recently added support for shadow stacks, currently
x86 only using their CET feature but both arm64 and RISC-V have
equivalent features (GCS and Zicfiss respectively), I am actively
working on GCS[1]. With shadow stacks the hardware maintains an
additional stack containing only the return addresses for branch
instructions which is not generally writeable by userspace and ensures
that any returns are to the recorded addresses. This provides some
protection against ROP attacks and making it easier to collect call
stacks. These shadow stacks are allocated in the address space of the
userspace process.
Our API for shadow stacks does not currently offer userspace any
flexiblity for managing the allocation of shadow stacks for newly
created threads, instead the kernel allocates a new shadow stack with
the same size as the normal stack whenever a thread is created with the
feature enabled. The stacks allocated in this way are freed by the
kernel when the thread exits or shadow stacks are disabled for the
thread. This lack of flexibility and control isn't ideal, in the vast
majority of cases the shadow stack will be over allocated and the
implicit allocation and deallocation is not consistent with other
interfaces. As far as I can tell the interface is done in this manner
mainly because the shadow stack patches were in development since before
clone3() was implemented.
Since clone3() is readily extensible let's add support for specifying a
shadow stack when creating a new thread or process, keeping the current
implicit allocation behaviour if one is not specified either with
clone3() or through the use of clone(). The user must provide a shadow
stack pointer, this must point to memory mapped for use as a shadow
stackby map_shadow_stack() with an architecture specified shadow stack
token at the top of the stack.
Please note that the x86 portions of this code are build tested only, I
don't appear to have a system that can run CET available to me.
[1] https://lore.kernel.org/linux-arm-kernel/20241001-arm64-gcs-v13-0-222b78d87…
Signed-off-by: Mark Brown <broonie(a)kernel.org>
---
Changes in v12:
- Add the regular prctl() to the userspace API document since arm64
support is queued in -next.
- Link to v11: https://lore.kernel.org/r/20241005-clone3-shadow-stack-v11-0-2a6a2bd6d651@k…
Changes in v11:
- Rebase onto arm64 for-next/gcs, which is based on v6.12-rc1, and
integrate arm64 support.
- Rework the interface to specify a shadow stack pointer rather than a
base and size like we do for the regular stack.
- Link to v10: https://lore.kernel.org/r/20240821-clone3-shadow-stack-v10-0-06e8797b9445@k…
Changes in v10:
- Integrate fixes & improvements for the x86 implementation from Rick
Edgecombe.
- Require that the shadow stack be VM_WRITE.
- Require that the shadow stack base and size be sizeof(void *) aligned.
- Clean up trailing newline.
- Link to v9: https://lore.kernel.org/r/20240819-clone3-shadow-stack-v9-0-962d74f99464@ke…
Changes in v9:
- Pull token validation earlier and report problems with an error return
to parent rather than signal delivery to the child.
- Verify that the top of the supplied shadow stack is VM_SHADOW_STACK.
- Rework token validation to only do the page mapping once.
- Drop no longer needed support for testing for signals in selftest.
- Fix typo in comments.
- Link to v8: https://lore.kernel.org/r/20240808-clone3-shadow-stack-v8-0-0acf37caf14c@ke…
Changes in v8:
- Fix token verification with user specified shadow stack.
- Don't track user managed shadow stacks for child processes.
- Link to v7: https://lore.kernel.org/r/20240731-clone3-shadow-stack-v7-0-a9532eebfb1d@ke…
Changes in v7:
- Rebase onto v6.11-rc1.
- Typo fixes.
- Link to v6: https://lore.kernel.org/r/20240623-clone3-shadow-stack-v6-0-9ee7783b1fb9@ke…
Changes in v6:
- Rebase onto v6.10-rc3.
- Ensure we don't try to free the parent shadow stack in error paths of
x86 arch code.
- Spelling fixes in userspace API document.
- Additional cleanups and improvements to the clone3() tests to support
the shadow stack tests.
- Link to v5: https://lore.kernel.org/r/20240203-clone3-shadow-stack-v5-0-322c69598e4b@ke…
Changes in v5:
- Rebase onto v6.8-rc2.
- Rework ABI to have the user allocate the shadow stack memory with
map_shadow_stack() and a token.
- Force inlining of the x86 shadow stack enablement.
- Move shadow stack enablement out into a shared header for reuse by
other tests.
- Link to v4: https://lore.kernel.org/r/20231128-clone3-shadow-stack-v4-0-8b28ffe4f676@ke…
Changes in v4:
- Formatting changes.
- Use a define for minimum shadow stack size and move some basic
validation to fork.c.
- Link to v3: https://lore.kernel.org/r/20231120-clone3-shadow-stack-v3-0-a7b8ed3e2acc@ke…
Changes in v3:
- Rebase onto v6.7-rc2.
- Remove stale shadow_stack in internal kargs.
- If a shadow stack is specified unconditionally use it regardless of
CLONE_ parameters.
- Force enable shadow stacks in the selftest.
- Update changelogs for RISC-V feature rename.
- Link to v2: https://lore.kernel.org/r/20231114-clone3-shadow-stack-v2-0-b613f8681155@ke…
Changes in v2:
- Rebase onto v6.7-rc1.
- Remove ability to provide preallocated shadow stack, just specify the
desired size.
- Link to v1: https://lore.kernel.org/r/20231023-clone3-shadow-stack-v1-0-d867d0b5d4d0@ke…
---
Mark Brown (8):
arm64/gcs: Return a success value from gcs_alloc_thread_stack()
Documentation: userspace-api: Add shadow stack API documentation
selftests: Provide helper header for shadow stack testing
fork: Add shadow stack support to clone3()
selftests/clone3: Remove redundant flushes of output streams
selftests/clone3: Factor more of main loop into test_clone3()
selftests/clone3: Allow tests to flag if -E2BIG is a valid error code
selftests/clone3: Test shadow stack support
Documentation/userspace-api/index.rst | 1 +
Documentation/userspace-api/shadow_stack.rst | 42 ++++
arch/arm64/include/asm/gcs.h | 8 +-
arch/arm64/kernel/process.c | 8 +-
arch/arm64/mm/gcs.c | 62 +++++-
arch/x86/include/asm/shstk.h | 11 +-
arch/x86/kernel/process.c | 2 +-
arch/x86/kernel/shstk.c | 57 +++++-
include/asm-generic/cacheflush.h | 11 ++
include/linux/sched/task.h | 17 ++
include/uapi/linux/sched.h | 10 +-
kernel/fork.c | 96 +++++++--
tools/testing/selftests/clone3/clone3.c | 226 ++++++++++++++++++----
tools/testing/selftests/clone3/clone3_selftests.h | 65 ++++++-
tools/testing/selftests/ksft_shstk.h | 98 ++++++++++
15 files changed, 633 insertions(+), 81 deletions(-)
---
base-commit: d17cd7b7cc92d37ee8b2df8f975fc859a261f4dc
change-id: 20231019-clone3-shadow-stack-15d40d2bf536
Best regards,
--
Mark Brown <broonie(a)kernel.org>
bpftool now embeds the kfuncs definitions directly in the generated
vmlinux.h
This is great, but because the selftests dir might be compiled with
HID_BPF disabled, we have no guarantees to be able to compile the
sources with the generated kfuncs.
If we have the kfuncs, because we have the `__not_used` hack, the newly
defined kfuncs do not match the ones from vmlinux.h and things go wrong.
Prevent vmlinux.h to define its kfuncs and also add the missing `__weak`
symbols for our custom kfuncs definitions
Signed-off-by: Benjamin Tissoires <bentiss(a)kernel.org>
---
This was noticed while bumping the CI to fedora 41 which has an update
of bpftool.
I'll probably take this in for-6.13/upstream-fixes tomorrow if no bots
comes back at me.
Cheers,
Benjamin
---
tools/testing/selftests/hid/progs/hid_bpf_helpers.h | 19 +++++++++++--------
1 file changed, 11 insertions(+), 8 deletions(-)
diff --git a/tools/testing/selftests/hid/progs/hid_bpf_helpers.h b/tools/testing/selftests/hid/progs/hid_bpf_helpers.h
index e5db897586bbfe010d8799f6f52fc5c418344e6b..531228b849daebcf40d994abb8bf35e760b3cc4e 100644
--- a/tools/testing/selftests/hid/progs/hid_bpf_helpers.h
+++ b/tools/testing/selftests/hid/progs/hid_bpf_helpers.h
@@ -22,6 +22,9 @@
#define HID_REQ_SET_IDLE HID_REQ_SET_IDLE___not_used
#define HID_REQ_SET_PROTOCOL HID_REQ_SET_PROTOCOL___not_used
+/* do not define kfunc through vmlinux.h as this messes up our custom hack */
+#define BPF_NO_KFUNC_PROTOTYPES
+
#include "vmlinux.h"
#undef hid_bpf_ctx
@@ -91,31 +94,31 @@ struct hid_bpf_ops {
/* following are kfuncs exported by HID for HID-BPF */
extern __u8 *hid_bpf_get_data(struct hid_bpf_ctx *ctx,
unsigned int offset,
- const size_t __sz) __ksym;
-extern struct hid_bpf_ctx *hid_bpf_allocate_context(unsigned int hid_id) __ksym;
-extern void hid_bpf_release_context(struct hid_bpf_ctx *ctx) __ksym;
+ const size_t __sz) __weak __ksym;
+extern struct hid_bpf_ctx *hid_bpf_allocate_context(unsigned int hid_id) __weak __ksym;
+extern void hid_bpf_release_context(struct hid_bpf_ctx *ctx) __weak __ksym;
extern int hid_bpf_hw_request(struct hid_bpf_ctx *ctx,
__u8 *data,
size_t buf__sz,
enum hid_report_type type,
- enum hid_class_request reqtype) __ksym;
+ enum hid_class_request reqtype) __weak __ksym;
extern int hid_bpf_hw_output_report(struct hid_bpf_ctx *ctx,
- __u8 *buf, size_t buf__sz) __ksym;
+ __u8 *buf, size_t buf__sz) __weak __ksym;
extern int hid_bpf_input_report(struct hid_bpf_ctx *ctx,
enum hid_report_type type,
__u8 *data,
- size_t buf__sz) __ksym;
+ size_t buf__sz) __weak __ksym;
extern int hid_bpf_try_input_report(struct hid_bpf_ctx *ctx,
enum hid_report_type type,
__u8 *data,
- size_t buf__sz) __ksym;
+ size_t buf__sz) __weak __ksym;
/* bpf_wq implementation */
extern int bpf_wq_init(struct bpf_wq *wq, void *p__map, unsigned int flags) __weak __ksym;
extern int bpf_wq_start(struct bpf_wq *wq, unsigned int flags) __weak __ksym;
extern int bpf_wq_set_callback_impl(struct bpf_wq *wq,
int (callback_fn)(void *map, int *key, void *wq),
- unsigned int flags__k, void *aux__ign) __ksym;
+ unsigned int flags__k, void *aux__ign) __weak __ksym;
#define bpf_wq_set_callback(timer, cb, flags) \
bpf_wq_set_callback_impl(timer, cb, flags, NULL)
---
base-commit: 919464deeca24e5bf13b6c8efd0b1d25cc43866f
change-id: 20241128-fix-new-bpftool-31a9f43aafbb
Best regards,
--
Benjamin Tissoires <bentiss(a)kernel.org>
The parameter passed to load_mod() is stored in $MOD, but never used.
Obviously it was intended to be used instead of the hardcoded
"test_kallsyms_b" module name.
Fixes: 84b4a51fce4ccc66 ("selftests: add new kallsyms selftests")
Signed-off-by: Geert Uytterhoeven <geert(a)linux-m68k.org>
---
tools/testing/selftests/module/find_symbol.sh | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/module/find_symbol.sh b/tools/testing/selftests/module/find_symbol.sh
index 140364d3c49fc912..2c56805c9b6e6ea0 100755
--- a/tools/testing/selftests/module/find_symbol.sh
+++ b/tools/testing/selftests/module/find_symbol.sh
@@ -44,10 +44,10 @@ load_mod()
local ARCH="$(uname -m)"
case "${ARCH}" in
x86_64)
- perf stat $STATS $MODPROBE test_kallsyms_b
+ perf stat $STATS $MODPROBE $MOD
;;
*)
- time $MODPROBE test_kallsyms_b
+ time $MODPROBE $MOD
exit 1
;;
esac
--
2.34.1
The string logged when a test passes or fails is used by the selftest
framework to identify which test is being reported. The hugetlb_dio test
not only uses the same strings for every test that is run but it also uses
different strings for test passes and failures which means that test
automation is unable to follow what the test is doing at all.
Pull the existing duplicated logging of the number of free huge pages
before and after the test out of the conditional and replace that and the
logging of the result with a single ksft_print_result() which incorporates
the parameters passed into the test into the output.
Fixes: fae1980347bf ("selftests: hugetlb_dio: fixup check for initial conditions to skip in the start")
Signed-off-by: Mark Brown <broonie(a)kernel.org>
---
tools/testing/selftests/mm/hugetlb_dio.c | 14 +++++---------
1 file changed, 5 insertions(+), 9 deletions(-)
diff --git a/tools/testing/selftests/mm/hugetlb_dio.c b/tools/testing/selftests/mm/hugetlb_dio.c
index 432d5af15e66b7d6cac0273fb244d6696d7c9ddc..db63abe5ee5e85ff7795d3ea176c3ac47184bf4f 100644
--- a/tools/testing/selftests/mm/hugetlb_dio.c
+++ b/tools/testing/selftests/mm/hugetlb_dio.c
@@ -76,19 +76,15 @@ void run_dio_using_hugetlb(unsigned int start_off, unsigned int end_off)
/* Get the free huge pages after unmap*/
free_hpage_a = get_free_hugepages();
+ ksft_print_msg("No. Free pages before allocation : %d\n", free_hpage_b);
+ ksft_print_msg("No. Free pages after munmap : %d\n", free_hpage_a);
+
/*
* If the no. of free hugepages before allocation and after unmap does
* not match - that means there could still be a page which is pinned.
*/
- if (free_hpage_a != free_hpage_b) {
- ksft_print_msg("No. Free pages before allocation : %d\n", free_hpage_b);
- ksft_print_msg("No. Free pages after munmap : %d\n", free_hpage_a);
- ksft_test_result_fail(": Huge pages not freed!\n");
- } else {
- ksft_print_msg("No. Free pages before allocation : %d\n", free_hpage_b);
- ksft_print_msg("No. Free pages after munmap : %d\n", free_hpage_a);
- ksft_test_result_pass(": Huge pages freed successfully !\n");
- }
+ ksft_test_result(free_hpage_a == free_hpage_b,
+ "free huge pages from %u-%u\n", start_off, end_off);
}
int main(void)
---
base-commit: 6f3d2b5299b0a8bcb8a9405a8d3fceb24f79c4f0
change-id: 20241127-kselftest-mm-hugetlb-dio-names-1ebccbe8183d
Best regards,
--
Mark Brown <broonie(a)kernel.org>
The test.py should not be run separately. It should be run via run.sh,
which will do some sanity checks first. Move the test.py from TEST_PROGS
to TEST_FILES.
Reported-by: Maximilian Heyne <mheyne(a)amazon.de>
Closes: https://lore.kernel.org/netdev/20241122150129.GB18887@dev-dsk-mheyne-1b-556…
Fixes: 3ade6ce1255e ("selftests: rds: add testing infrastructure")
Signed-off-by: Hangbin Liu <liuhangbin(a)gmail.com>
---
tools/testing/selftests/net/rds/Makefile | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)
diff --git a/tools/testing/selftests/net/rds/Makefile b/tools/testing/selftests/net/rds/Makefile
index 1803c39dbacb..612a7219990e 100644
--- a/tools/testing/selftests/net/rds/Makefile
+++ b/tools/testing/selftests/net/rds/Makefile
@@ -3,10 +3,9 @@
all:
@echo mk_build_dir="$(shell pwd)" > include.sh
-TEST_PROGS := run.sh \
- test.py
+TEST_PROGS := run.sh
-TEST_FILES := include.sh
+TEST_FILES := include.sh test.py
EXTRA_CLEAN := /tmp/rds_logs include.sh
--
2.46.0
Recent change in how get_user() handles pointers [1] has a specific case
for LAM. It assigns a different bitmask that's later used to check
whether a pointer comes from userland in get_user().
While currently commented out (until LASS [2] is merged into the kernel)
it's worth making changes to the LAM selftest ahead of time.
Modify cpu_has_la57() so it provides current paging level information
instead of the cpuid one.
Add test case to LAM that utilizes a ioctl (FIOASYNC) syscall which uses
get_user() in its implementation. Execute the syscall with differently
tagged pointers to verify that valid user pointers are passing through
and invalid kernel/non-canonical pointers are not.
Also to avoid unhelpful test failures add a check in main() to skip
running tests if LAM was not compiled into the kernel.
Code was tested on a Sierra Forest Xeon machine that's LAM capable. The
test was ran without issues with both the LAM lines from [1] untouched
and commented out. The test was also ran without issues with LAM_SUP
both enabled and disabled.
4/5 level pagetables code paths were also successfully tested in Simics
on a 5-level capable machine.
[1] https://lore.kernel.org/all/20241024013214.129639-1-torvalds@linux-foundati…
[2] https://lore.kernel.org/all/20241028160917.1380714-1-alexander.shishkin@lin…
Maciej Wieczor-Retman (3):
selftests/lam: Move cpu_has_la57() to use cpuinfo flag
selftests/lam: Skip test if LAM is disabled
selftests/lam: Test get_user() LAM pointer handling
tools/testing/selftests/x86/lam.c | 122 ++++++++++++++++++++++++++++--
1 file changed, 117 insertions(+), 5 deletions(-)
--
2.47.1
When running selftests I encountered the following error message with
some damon tests:
# Traceback (most recent call last):
# File "[...]/damon/./damos_quota.py", line 7, in <module>
# import _damon_sysfs
# ModuleNotFoundError: No module named '_damon_sysfs'
Fix this by adding the _damon_sysfs.py file to TEST_FILES so that it
will be available when running the respective damon selftests.
Fixes: 306abb63a8ca ("selftests/damon: implement a python module for test-purpose DAMON sysfs controls")
Signed-off-by: Maximilian Heyne <mheyne(a)amazon.de>
---
tools/testing/selftests/damon/Makefile | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/damon/Makefile b/tools/testing/selftests/damon/Makefile
index 5b2a6a5dd1af7..812f656260fba 100644
--- a/tools/testing/selftests/damon/Makefile
+++ b/tools/testing/selftests/damon/Makefile
@@ -6,7 +6,7 @@ TEST_GEN_FILES += debugfs_target_ids_read_before_terminate_race
TEST_GEN_FILES += debugfs_target_ids_pid_leak
TEST_GEN_FILES += access_memory access_memory_even
-TEST_FILES = _chk_dependency.sh _debugfs_common.sh
+TEST_FILES = _chk_dependency.sh _debugfs_common.sh _damon_sysfs.py
# functionality tests
TEST_PROGS = debugfs_attrs.sh debugfs_schemes.sh debugfs_target_ids.sh
--
2.40.1
Amazon Web Services Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
Sitz: Berlin
Ust-ID: DE 365 538 597
__ktap_test() may be called without the optional third argument which is
an issue for scripts using `set -u` to detect uninitialized variables
and potential bugs.
Fix this optional "directive" argument by either using the third
argument or an empty string.
This was discovered while developing tests for script control execution:
https://lore.kernel.org/r/20241112191858.162021-7-mic@digikod.net
Cc: Kees Cook <kees(a)kernel.org>
Cc: Nícolas F. R. A. Prado <nfraprado(a)collabora.com>
Cc: Shuah Khan <skhan(a)linuxfoundation.org>
Signed-off-by: Mickaël Salaün <mic(a)digikod.net>
---
tools/testing/selftests/kselftest/ktap_helpers.sh | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/kselftest/ktap_helpers.sh b/tools/testing/selftests/kselftest/ktap_helpers.sh
index 79a125eb24c2..14e7f3ec3f84 100644
--- a/tools/testing/selftests/kselftest/ktap_helpers.sh
+++ b/tools/testing/selftests/kselftest/ktap_helpers.sh
@@ -40,7 +40,7 @@ ktap_skip_all() {
__ktap_test() {
result="$1"
description="$2"
- directive="$3" # optional
+ directive="${3:-}" # optional
local directive_str=
[ ! -z "$directive" ] && directive_str="# $directive"
--
2.47.1
From: Tycho Andersen <tandersen(a)netflix.com>
Zbigniew mentioned at Linux Plumber's that systemd is interested in
switching to execveat() for service execution, but can't, because the
contents of /proc/pid/comm are the file descriptor which was used,
instead of the path to the binary. This makes the output of tools like
top and ps useless, especially in a world where most fds are opened
CLOEXEC so the number is truly meaningless.
Change exec path to fix up /proc/pid/comm in the case where we have
allocated one of these synthetic paths in bprm_init(). This way the actual
exec machinery is unchanged, but cosmetically the comm looks reasonable to
admins investigating things.
Signed-off-by: Tycho Andersen <tandersen(a)netflix.com>
Suggested-by: Zbigniew Jędrzejewski-Szmek <zbyszek(a)in.waw.pl>
CC: Aleksa Sarai <cyphar(a)cyphar.com>
Link: https://github.com/uapi-group/kernel-features#set-comm-field-before-exec
---
v2: * drop the flag, everyone :)
* change the rendered value to f_path.dentry->d_name.name instead of
argv[0], Eric
v3: * fix up subject line, Eric
v4: * switch to no flag, always rewrite approach, with some cleanup
suggested by Kees
---
fs/exec.c | 36 +++++++++++++++++++++++++++++++++++-
include/linux/binfmts.h | 1 +
2 files changed, 36 insertions(+), 1 deletion(-)
diff --git a/fs/exec.c b/fs/exec.c
index 6c53920795c2..3b559f598c74 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1347,7 +1347,16 @@ int begin_new_exec(struct linux_binprm * bprm)
set_dumpable(current->mm, SUID_DUMP_USER);
perf_event_exec();
- __set_task_comm(me, kbasename(bprm->filename), true);
+
+ /*
+ * If argv0 was set, alloc_bprm() made up a path that will
+ * probably not be useful to admins running ps or similar.
+ * Let's fix it up to be something reasonable.
+ */
+ if (bprm->argv0)
+ __set_task_comm(me, kbasename(bprm->argv0), true);
+ else
+ __set_task_comm(me, kbasename(bprm->filename), true);
/* An exec changes our domain. We are no longer part of the thread
group */
@@ -1497,9 +1506,28 @@ static void free_bprm(struct linux_binprm *bprm)
if (bprm->interp != bprm->filename)
kfree(bprm->interp);
kfree(bprm->fdpath);
+ kfree(bprm->argv0);
kfree(bprm);
}
+static int bprm_add_fixup_comm(struct linux_binprm *bprm,
+ struct user_arg_ptr argv)
+{
+ const char __user *p = get_user_arg_ptr(argv, 0);
+
+ /*
+ * If p == NULL, let's just fall back to fdpath.
+ */
+ if (!p)
+ return 0;
+
+ bprm->argv0 = strndup_user(p, MAX_ARG_STRLEN);
+ if (bprm->argv0)
+ return 0;
+
+ return -EFAULT;
+}
+
static struct linux_binprm *alloc_bprm(int fd, struct filename *filename, int flags)
{
struct linux_binprm *bprm;
@@ -1906,6 +1934,12 @@ static int do_execveat_common(int fd, struct filename *filename,
goto out_ret;
}
+ if (unlikely(bprm->fdpath)) {
+ retval = bprm_add_fixup_comm(bprm, argv);
+ if (retval != 0)
+ goto out_free;
+ }
+
retval = count(argv, MAX_ARG_STRINGS);
if (retval == 0)
pr_warn_once("process '%s' launched '%s' with NULL argv: empty string added\n",
diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h
index e6c00e860951..bab5121a746b 100644
--- a/include/linux/binfmts.h
+++ b/include/linux/binfmts.h
@@ -55,6 +55,7 @@ struct linux_binprm {
of the time same as filename, but could be
different for binfmt_{misc,script} */
const char *fdpath; /* generated filename for execveat */
+ const char *argv0; /* argv0 from execveat */
unsigned interp_flags;
int execfd; /* File descriptor of the executable */
unsigned long loader, exec;
base-commit: c1e939a21eb111a6d6067b38e8e04b8809b64c4e
--
2.34.1
Hi Kuan-Wei,
On Sun, Oct 20, 2024 at 6:02 AM Kuan-Wei Chiu <visitorckw(a)gmail.com> wrote:
> All current min heap API functions are marked with '__always_inline'.
> However, as the number of users increases, inlining these functions
> everywhere leads to a increase in kernel size.
>
> In performance-critical paths, such as when perf events are enabled and
> min heap functions are called on every context switch, it is important
> to retain the inline versions for optimal performance. To balance this,
> the original inline functions are kept, and additional non-inline
> versions of the functions have been added in lib/min_heap.c.
>
> Link: https://lore.kernel.org/20240522161048.8d8bbc7b153b4ecd92c50666@linux-found…
> Suggested-by: Andrew Morton <akpm(a)linux-foundation.org>
> Signed-off-by: Kuan-Wei Chiu <visitorckw(a)gmail.com>
Thanks for your patch, which is now commit 92a8b224b833e82d
("lib/min_heap: introduce non-inline versions of min heap API
functions") upstream.
> --- a/include/linux/min_heap.h
> +++ b/include/linux/min_heap.h
> @@ -50,33 +50,33 @@ void __min_heap_init(min_heap_char *heap, void *data, int size)
> heap->data = heap->preallocated;
> }
>
> -#define min_heap_init(_heap, _data, _size) \
> - __min_heap_init((min_heap_char *)_heap, _data, _size)
> +#define min_heap_init_inline(_heap, _data, _size) \
> + __min_heap_init_inline((min_heap_char *)_heap, _data, _size)
Casting macro parameters without any further checks prevents the
compiler from detecting silly mistakes. Would it be possible to
add safety-nets here and below, using e.g. container_of() or typeof()
checks?
> --- a/lib/Kconfig
> +++ b/lib/Kconfig
> @@ -777,3 +777,6 @@ config POLYNOMIAL
>
> config FIRMWARE_TABLE
> bool
> +
> +config MIN_HEAP
> + bool
Perhaps tristate? See also below.
> --- a/lib/Kconfig.debug
> +++ b/lib/Kconfig.debug
> @@ -2279,6 +2279,7 @@ config TEST_LIST_SORT
> config TEST_MIN_HEAP
> tristate "Min heap test"
> depends on DEBUG_KERNEL || m
> + select MIN_HEAP
Ideally, tests should not select functionality, to prevent increasing the
attack vector by merely enabling (modular) tests.
In this particular case, just using "depends on MIN_HEAP" is not an
option, as MIN_HEAP is not user-visible, and thus cannot be enabled
by the user on its own. However, making MIN_HEAP tristate could be
a first step for the modular case.
The builtin case is harder to fix, as e.g.
depends on MIN_HEAP || COMPILE_TEST
select MIN_HEAP if COMPILE_TEST
would still trigger a recursive dependency error.
Alternatively, the test could just keep on using the inline variants,
unless CONFIG_MIN_HEAP=y? Or event test both for the latter?
> help
> Enable this to turn on min heap function tests. This test is
> executed only once during system boot (so affects only boot time),
Gr{oetje,eeting}s,
Geert
--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert(a)linux-m68k.org
In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds
This patchset adds basic kunit test coverage for initramfs unpacking
and cleans up some minor buffer handling issues / inefficiencies.
Changes since v2 (patch 2 only):
- fix !CONFIG_INITRAMFS_PRESERVE_MTIME kunit test checks
- add test MODULE_DESCRIPTION(), as suggested by Jeff Johnson
- add some missing headers, reported by kernel test robot
Changes since v1 (RFC):
- rebase atop v6.12-rc6 and filename field overrun fix from
https://lore.kernel.org/r/20241030035509.20194-2-ddiss@suse.de
- add unit test coverage (new patches 1 and 2)
- add patch: fix hardlink hash leak without TRAILER
- rework patch: avoid static buffer for error message
+ drop unnecessary message propagation
- drop patch: cpio_buf reuse for built-in and bootloader initramfs
+ no good justification for the change
Feedback appreciated.
David Disseldorp (9):
init: add initramfs_internal.h
initramfs_test: kunit tests for initramfs unpacking
vsprintf: add simple_strntoul
initramfs: avoid memcpy for hex header fields
initramfs: remove extra symlink path buffer
initramfs: merge header_buf and name_buf
initramfs: reuse name_len for dir mtime tracking
initramfs: fix hardlink hash leak without TRAILER
initramfs: avoid static buffer for error message
include/linux/kstrtox.h | 1 +
init/.kunitconfig | 3 +
init/Kconfig | 7 +
init/Makefile | 1 +
init/initramfs.c | 73 +++++----
init/initramfs_internal.h | 8 +
init/initramfs_test.c | 403 ++++++++++++++++++++++++++++++++++++++++++++++
lib/vsprintf.c | 7 +
8 files changed, 471 insertions(+), 32 deletions(-)
create mode 100644 init/.kunitconfig
create mode 100644 init/initramfs_internal.h
create mode 100644 init/initramfs_test.c
CONFIG_PREEMPT is a preemtion model the so called "Low-Latency Desktop".
A different preemption model is PREEMPT_RT the so called "Real-Time".
Both implement preemption in kernel and set CONFIG_PREEMPTION.
There is also the so called "LAZY PREEMPT" which the "Scheduler
controlled preemption model". Here we have also preemption in the kernel
the rules are slightly different.
Therefore the testsuite should not check for CONFIG_PREEMPT (as one
model) but for CONFIG_PREEMPTION to figure out if preemption in the
kernel is possible.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy(a)linutronix.de>
---
tools/testing/selftests/bpf/map_tests/task_storage_map.c | 4 ++--
tools/testing/selftests/bpf/prog_tests/task_local_storage.c | 2 +-
.../testing/selftests/bpf/progs/read_bpf_task_storage_busy.c | 4 ++--
tools/testing/selftests/bpf/progs/task_storage_nodeadlock.c | 4 ++--
4 files changed, 7 insertions(+), 7 deletions(-)
diff --git a/tools/testing/selftests/bpf/map_tests/task_storage_map.c b/tools/testing/selftests/bpf/map_tests/task_storage_map.c
index 7d050364efca1..5fd7c923544c3 100644
--- a/tools/testing/selftests/bpf/map_tests/task_storage_map.c
+++ b/tools/testing/selftests/bpf/map_tests/task_storage_map.c
@@ -77,8 +77,8 @@ void test_task_storage_map_stress_lookup(void)
CHECK(err, "open_and_load", "error %d\n", err);
/* Only for a fully preemptible kernel */
- if (!skel->kconfig->CONFIG_PREEMPT) {
- printf("%s SKIP (no CONFIG_PREEMPT)\n", __func__);
+ if (!skel->kconfig->CONFIG_PREEMPTION) {
+ printf("%s SKIP (no CONFIG_PREEMPTION)\n", __func__);
read_bpf_task_storage_busy__destroy(skel);
skips++;
return;
diff --git a/tools/testing/selftests/bpf/prog_tests/task_local_storage.c b/tools/testing/selftests/bpf/prog_tests/task_local_storage.c
index c33c05161a9ea..acaeebf83f3ee 100644
--- a/tools/testing/selftests/bpf/prog_tests/task_local_storage.c
+++ b/tools/testing/selftests/bpf/prog_tests/task_local_storage.c
@@ -189,7 +189,7 @@ static void test_nodeadlock(void)
/* Unnecessary recursion and deadlock detection are reproducible
* in the preemptible kernel.
*/
- if (!skel->kconfig->CONFIG_PREEMPT) {
+ if (!skel->kconfig->CONFIG_PREEMPTION) {
test__skip();
goto done;
}
diff --git a/tools/testing/selftests/bpf/progs/read_bpf_task_storage_busy.c b/tools/testing/selftests/bpf/progs/read_bpf_task_storage_busy.c
index 76556e0b42b24..69da05bb6c63e 100644
--- a/tools/testing/selftests/bpf/progs/read_bpf_task_storage_busy.c
+++ b/tools/testing/selftests/bpf/progs/read_bpf_task_storage_busy.c
@@ -4,7 +4,7 @@
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_tracing.h>
-extern bool CONFIG_PREEMPT __kconfig __weak;
+extern bool CONFIG_PREEMPTION __kconfig __weak;
extern const int bpf_task_storage_busy __ksym;
char _license[] SEC("license") = "GPL";
@@ -24,7 +24,7 @@ int BPF_PROG(read_bpf_task_storage_busy)
{
int *value;
- if (!CONFIG_PREEMPT)
+ if (!CONFIG_PREEMPTION)
return 0;
if (bpf_get_current_pid_tgid() >> 32 != pid)
diff --git a/tools/testing/selftests/bpf/progs/task_storage_nodeadlock.c b/tools/testing/selftests/bpf/progs/task_storage_nodeadlock.c
index ea2dbb80f7b3e..986829aaf73a6 100644
--- a/tools/testing/selftests/bpf/progs/task_storage_nodeadlock.c
+++ b/tools/testing/selftests/bpf/progs/task_storage_nodeadlock.c
@@ -10,7 +10,7 @@ char _license[] SEC("license") = "GPL";
#define EBUSY 16
#endif
-extern bool CONFIG_PREEMPT __kconfig __weak;
+extern bool CONFIG_PREEMPTION __kconfig __weak;
int nr_get_errs = 0;
int nr_del_errs = 0;
@@ -29,7 +29,7 @@ int BPF_PROG(socket_post_create, struct socket *sock, int family, int type,
int ret, zero = 0;
int *value;
- if (!CONFIG_PREEMPT)
+ if (!CONFIG_PREEMPTION)
return 0;
task = bpf_get_current_task_btf();
--
2.45.2
Hello,
this is the revision 3 of test_flow_dissector_migration.sh into
test_progs. This revision addresses comments from Stanislas, especially
about proper reuse of pseudo-header checksuming in new network helpers.
There are 2 "main" parts in test_flow_dissector.sh:
- a set of tests checking flow_dissector programs attachment to either
root namespace or non-root namespace
- dissection test
The first set is integrated in flow_dissector.c, which already contains
some existing tests for flow_dissector programs. This series uses the
opportunity to update a bit this file (use new assert, re-split tests,
etc)
The second part is migrated into a new file under test_progs,
flow_dissector_classification.c. It uses the same eBPF programs as
flow_dissector.c, but the difference is rather about how those program
are executed:
- flow_dissector.c manually runs programs with BPF_PROG_RUN
- flow_dissector_classification.c sends real packets to be dissected, and
so it also executes kernel code related to eBPF flow dissector (eg:
__skb_flow_bpf_to_target)
---
Changes in v3:
- Keep new helpers name in sync with kernel ones
- Document some existing network helpers
- Properly reuse pseudo-header csum helper in transport layer csum
helper
- Drop duplicate assert
- Use const for test structure in the migrated test
- Simplify shutdown callchain for basic test
- collect Acked-by
- Link to v2: https://lore.kernel.org/r/20241114-flow_dissector-v2-0-ee4a3be3de65@bootlin…
Changes in v2:
- allow tests to run in parallel
- move some generic helpers to network_helpers.h
- define proper function for ASSERT_MEMEQ
- fetch acked-by tags
- Link to v1: https://lore.kernel.org/r/20241113-flow_dissector-v1-0-27c4df0592dc@bootlin…
---
Alexis Lothoré (eBPF Foundation) (14):
selftests/bpf: add a macro to compare raw memory
selftests/bpf: use ASSERT_MEMEQ to compare bpf flow keys
selftests/bpf: replace CHECK calls with ASSERT macros in flow_dissector test
selftests/bpf: re-split main function into dedicated tests
selftests/bpf: expose all subtests from flow_dissector
selftests/bpf: add gre packets testing to flow_dissector
selftests/bpf: migrate flow_dissector namespace exclusivity test
selftests/bpf: Enable generic tc actions in selftests config
selftests/bpf: move ip checksum helper to network helpers
selftests/bpf: document pseudo-header checksum helpers
selftests/bpf: use the same udp and tcp headers in tests under test_progs
selftests/bpf: add network helpers to generate udp checksums
selftests/bpf: migrate bpf flow dissectors tests to test_progs
selftests/bpf: remove test_flow_dissector.sh
tools/testing/selftests/bpf/.gitignore | 1 -
tools/testing/selftests/bpf/Makefile | 3 +-
tools/testing/selftests/bpf/config | 1 +
tools/testing/selftests/bpf/network_helpers.c | 2 +-
tools/testing/selftests/bpf/network_helpers.h | 96 +++
.../selftests/bpf/prog_tests/flow_dissector.c | 323 +++++++--
.../bpf/prog_tests/flow_dissector_classification.c | 792 +++++++++++++++++++++
.../testing/selftests/bpf/prog_tests/sockopt_sk.c | 2 +-
.../testing/selftests/bpf/prog_tests/xdp_bonding.c | 2 +-
.../selftests/bpf/prog_tests/xdp_do_redirect.c | 2 +-
.../selftests/bpf/prog_tests/xdp_flowtable.c | 2 +-
.../selftests/bpf/prog_tests/xdp_metadata.c | 21 +-
.../selftests/bpf/progs/test_cls_redirect.c | 2 +-
.../selftests/bpf/progs/test_cls_redirect.h | 2 +-
.../selftests/bpf/progs/test_cls_redirect_dynptr.c | 2 +-
tools/testing/selftests/bpf/test_flow_dissector.c | 780 --------------------
tools/testing/selftests/bpf/test_flow_dissector.sh | 178 -----
tools/testing/selftests/bpf/test_progs.c | 15 +
tools/testing/selftests/bpf/test_progs.h | 15 +
tools/testing/selftests/bpf/xdp_hw_metadata.c | 2 +-
20 files changed, 1176 insertions(+), 1067 deletions(-)
---
base-commit: 8e403f7465a7c73e3a8ef62bba8cd75ef525e4b1
change-id: 20241019-flow_dissector-3eb0c07fc163
Best regards,
--
Alexis Lothoré, Bootlin
Embedded Linux and Kernel engineering
https://bootlin.com
Currently the temporary address is not removed when mngtmpaddr is deleted
or becomes unmanaged. The patch set fixed this issue and add a related
test.
v2:
1) delete the tempaddrs directly instead of remove them in addrconf_verify_rtnl(Sam Edwards)
2) Update the test case by checking the address including, add Sam in SOB (Sam Edwards)
Hangbin Liu (2):
net/ipv6: delete temporary address if mngtmpaddr is removed or
unmanaged
selftests/rtnetlink.sh: add mngtempaddr test
net/ipv6/addrconf.c | 41 +++++++---
tools/testing/selftests/net/rtnetlink.sh | 95 ++++++++++++++++++++++++
2 files changed, 124 insertions(+), 12 deletions(-)
--
2.46.0
Two small fixes for vsock: poll() missing a queue check, and close() not
invoking sockmap cleanup.
Signed-off-by: Michal Luczaj <mhal(a)rbox.co>
---
Michal Luczaj (4):
bpf, vsock: Fix poll() missing a queue
selftest/bpf: Add test for af_vsock poll()
bpf, vsock: Invoke proto::close on close()
selftest/bpf: Add test for vsock removal from sockmap on close()
net/vmw_vsock/af_vsock.c | 70 ++++++++++++--------
.../selftests/bpf/prog_tests/sockmap_basic.c | 77 ++++++++++++++++++++++
2 files changed, 120 insertions(+), 27 deletions(-)
---
base-commit: 6c4139b0f19b7397286897caee379f8321e78272
change-id: 20241118-vsock-bpf-poll-close-64f432e682ec
Best regards,
--
Michal Luczaj <mhal(a)rbox.co>
Compiled binary files should be added to .gitignore
'git status' complains:
Untracked files:
(use "git add <file>..." to include in what will be committed)
mm/hugetlb_dio
mm/pkey_sighandler_tests_32
mm/pkey_sighandler_tests_64
Cc: Donet Tom <donettom(a)linux.ibm.com>
Cc: Andrew Morton <akpm(a)linux-foundation.org>
Cc: Shuah Khan <shuah(a)kernel.org>
Signed-off-by: Li Zhijian <lizhijian(a)fujitsu.com>
---
Cc: linux-mm(a)kvack.org
---
Hello,
Cover letter is here.
This patch set aims to make 'git status' clear after 'make' and 'make
run_tests' for kselftests.
---
V3:
nothing change, just resend it
(This .gitignore have not sorted, so I appended new files to the end)
V2:
split as seperate patch from a small one [0]
[0] https://lore.kernel.org/linux-kselftest/20241015010817.453539-1-lizhijian@f…
---
tools/testing/selftests/mm/.gitignore | 2 ++
1 file changed, 2 insertions(+)
diff --git a/tools/testing/selftests/mm/.gitignore b/tools/testing/selftests/mm/.gitignore
index da030b43e43b..2ac11b7fcb26 100644
--- a/tools/testing/selftests/mm/.gitignore
+++ b/tools/testing/selftests/mm/.gitignore
@@ -51,3 +51,5 @@ hugetlb_madv_vs_map
mseal_test
seal_elf
droppable
+hugetlb_dio
+pkey_sighandler_tests*
--
2.44.0
Currently, when testing a certain target in selftests, executing the
command 'make TARGETS=XX -C tools/testing/selftests' succeeds for non-BPF,
but a similar command fails for BPF:
'''
make TARGETS=bpf -C tools/testing/selftests
make: Entering directory '/linux-kselftest/tools/testing/selftests'
make: *** [Makefile:197: all] Error 1
make: Leaving directory '/linux-kselftest/tools/testing/selftests'
'''
The reason is that the previous commit:
commit 7a6eb7c34a78 ("selftests: Skip BPF seftests by default")
led to the default filtering of bpf in TARGETS which make TARGETS empty.
That commit also mentioned that building BPF tests requires external
commands to run. This caused target like 'bpf' or 'sched_ext' defined
in SKIP_TARGETS to need an additional specification of SKIP_TARGETS as
empty to avoid skipping it, for example:
'''
make TARGETS=bpf SKIP_TARGETS="" -C tools/testing/selftests
'''
If special steps are required to execute certain test, it is extremely
unfair. We need a fairer way to treat different test targets.
This commit provider a way: If a user has specified a single TARGETS,
it indicates an expectation to run the specified target, and thus the
object should not be skipped.
Another way is to change TARGETS to DEFAULT_TARGETS in the Makefile and
then check if the user specified TARGETS and decide whether filter or not,
though this approach requires too many modifications.
Signed-off-by: Jiayuan Chen <mrpre(a)163.com>
---
tools/testing/selftests/Makefile | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index 363d031a16f7..d76c1781ec09 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -116,7 +116,7 @@ TARGETS += vDSO
TARGETS += mm
TARGETS += x86
TARGETS += zram
-#Please keep the TARGETS list alphabetically sorted
+# Please keep the TARGETS list alphabetically sorted
# Run "make quicktest=1 run_tests" or
# "make quicktest=1 kselftest" from top level Makefile
@@ -132,12 +132,15 @@ endif
# User can optionally provide a TARGETS skiplist. By default we skip
# targets using BPF since it has cutting edge build time dependencies
-# which require more effort to install.
+# If user provide custom TARGETS, we just ignore SKIP_TARGETS so that
+# user can easy to test single target which defined in SKIP_TARGETS
SKIP_TARGETS ?= bpf sched_ext
ifneq ($(SKIP_TARGETS),)
+ifneq ($(words $(TARGETS)), 1)
TMP := $(filter-out $(SKIP_TARGETS), $(TARGETS))
override TARGETS := $(TMP)
endif
+endif
# User can set FORCE_TARGETS to 1 to require all targets to be successfully
# built; make will fail if any of the targets cannot be built. If
base-commit: 67b6d342fb6d5abfbeb71e0f23141b9b96cf7bb1
--
2.43.5
From: Reinette Chatre <reinette.chatre(a)intel.com>
[ Upstream commit 46058430fc5d39c114f7e1b9c6ff14c9f41bd531 ]
resctrl selftests discover system properties via a variety of sysfs files.
The MBM and MBA tests need to discover the event and umask with which to
configure the performance event used to measure read memory bandwidth.
This is done by parsing the contents of
/sys/bus/event_source/devices/uncore_imc_<imc instance>/events/cas_count_read
Similarly, the resctrl selftests discover the cache size via
/sys/bus/cpu/devices/cpu<id>/cache/index<index>/size.
Take care to do bounds checking when using fscanf() to read the
contents of files into a string buffer because by default fscanf() assumes
arbitrarily long strings. If the file contains more bytes than the array
can accommodate then an overflow will occur.
Provide a maximum field width to the conversion specifier to protect
against array overflow. The maximum is one less than the array size because
string input stores a terminating null byte that is not covered by the
maximum field width.
Signed-off-by: Reinette Chatre <reinette.chatre(a)intel.com>
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen(a)linux.intel.com>
Signed-off-by: Shuah Khan <skhan(a)linuxfoundation.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/resctrl/resctrl_val.c | 4 ++--
tools/testing/selftests/resctrl/resctrlfs.c | 2 +-
2 files changed, 3 insertions(+), 3 deletions(-)
diff --git a/tools/testing/selftests/resctrl/resctrl_val.c b/tools/testing/selftests/resctrl/resctrl_val.c
index 45439e726e79c..438dd86f2704c 100644
--- a/tools/testing/selftests/resctrl/resctrl_val.c
+++ b/tools/testing/selftests/resctrl/resctrl_val.c
@@ -179,7 +179,7 @@ static int read_from_imc_dir(char *imc_dir, int count)
return -1;
}
- if (fscanf(fp, "%s", cas_count_cfg) <= 0) {
+ if (fscanf(fp, "%1023s", cas_count_cfg) <= 0) {
ksft_perror("Could not get iMC cas count read");
fclose(fp);
@@ -197,7 +197,7 @@ static int read_from_imc_dir(char *imc_dir, int count)
return -1;
}
- if (fscanf(fp, "%s", cas_count_cfg) <= 0) {
+ if (fscanf(fp, "%1023s", cas_count_cfg) <= 0) {
ksft_perror("Could not get iMC cas count write");
fclose(fp);
diff --git a/tools/testing/selftests/resctrl/resctrlfs.c b/tools/testing/selftests/resctrl/resctrlfs.c
index 71ad2b335b83f..fe3241799841b 100644
--- a/tools/testing/selftests/resctrl/resctrlfs.c
+++ b/tools/testing/selftests/resctrl/resctrlfs.c
@@ -160,7 +160,7 @@ int get_cache_size(int cpu_no, char *cache_type, unsigned long *cache_size)
return -1;
}
- if (fscanf(fp, "%s", cache_str) <= 0) {
+ if (fscanf(fp, "%63s", cache_str) <= 0) {
ksft_perror("Could not get cache_size");
fclose(fp);
--
2.43.0
From: Reinette Chatre <reinette.chatre(a)intel.com>
[ Upstream commit 46058430fc5d39c114f7e1b9c6ff14c9f41bd531 ]
resctrl selftests discover system properties via a variety of sysfs files.
The MBM and MBA tests need to discover the event and umask with which to
configure the performance event used to measure read memory bandwidth.
This is done by parsing the contents of
/sys/bus/event_source/devices/uncore_imc_<imc instance>/events/cas_count_read
Similarly, the resctrl selftests discover the cache size via
/sys/bus/cpu/devices/cpu<id>/cache/index<index>/size.
Take care to do bounds checking when using fscanf() to read the
contents of files into a string buffer because by default fscanf() assumes
arbitrarily long strings. If the file contains more bytes than the array
can accommodate then an overflow will occur.
Provide a maximum field width to the conversion specifier to protect
against array overflow. The maximum is one less than the array size because
string input stores a terminating null byte that is not covered by the
maximum field width.
Signed-off-by: Reinette Chatre <reinette.chatre(a)intel.com>
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen(a)linux.intel.com>
Signed-off-by: Shuah Khan <skhan(a)linuxfoundation.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/resctrl/resctrl_val.c | 4 ++--
tools/testing/selftests/resctrl/resctrlfs.c | 2 +-
2 files changed, 3 insertions(+), 3 deletions(-)
diff --git a/tools/testing/selftests/resctrl/resctrl_val.c b/tools/testing/selftests/resctrl/resctrl_val.c
index 8c275f6b4dd77..1bba85e4c0675 100644
--- a/tools/testing/selftests/resctrl/resctrl_val.c
+++ b/tools/testing/selftests/resctrl/resctrl_val.c
@@ -160,7 +160,7 @@ static int read_from_imc_dir(char *imc_dir, int count)
return -1;
}
- if (fscanf(fp, "%s", cas_count_cfg) <= 0) {
+ if (fscanf(fp, "%1023s", cas_count_cfg) <= 0) {
ksft_perror("Could not get iMC cas count read");
fclose(fp);
@@ -178,7 +178,7 @@ static int read_from_imc_dir(char *imc_dir, int count)
return -1;
}
- if (fscanf(fp, "%s", cas_count_cfg) <= 0) {
+ if (fscanf(fp, "%1023s", cas_count_cfg) <= 0) {
ksft_perror("Could not get iMC cas count write");
fclose(fp);
diff --git a/tools/testing/selftests/resctrl/resctrlfs.c b/tools/testing/selftests/resctrl/resctrlfs.c
index 250c320349a78..a53cd1cb6e0c6 100644
--- a/tools/testing/selftests/resctrl/resctrlfs.c
+++ b/tools/testing/selftests/resctrl/resctrlfs.c
@@ -182,7 +182,7 @@ int get_cache_size(int cpu_no, const char *cache_type, unsigned long *cache_size
return -1;
}
- if (fscanf(fp, "%s", cache_str) <= 0) {
+ if (fscanf(fp, "%63s", cache_str) <= 0) {
ksft_perror("Could not get cache_size");
fclose(fp);
--
2.43.0
From: Reinette Chatre <reinette.chatre(a)intel.com>
[ Upstream commit 46058430fc5d39c114f7e1b9c6ff14c9f41bd531 ]
resctrl selftests discover system properties via a variety of sysfs files.
The MBM and MBA tests need to discover the event and umask with which to
configure the performance event used to measure read memory bandwidth.
This is done by parsing the contents of
/sys/bus/event_source/devices/uncore_imc_<imc instance>/events/cas_count_read
Similarly, the resctrl selftests discover the cache size via
/sys/bus/cpu/devices/cpu<id>/cache/index<index>/size.
Take care to do bounds checking when using fscanf() to read the
contents of files into a string buffer because by default fscanf() assumes
arbitrarily long strings. If the file contains more bytes than the array
can accommodate then an overflow will occur.
Provide a maximum field width to the conversion specifier to protect
against array overflow. The maximum is one less than the array size because
string input stores a terminating null byte that is not covered by the
maximum field width.
Signed-off-by: Reinette Chatre <reinette.chatre(a)intel.com>
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen(a)linux.intel.com>
Signed-off-by: Shuah Khan <skhan(a)linuxfoundation.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/resctrl/resctrl_val.c | 4 ++--
tools/testing/selftests/resctrl/resctrlfs.c | 2 +-
2 files changed, 3 insertions(+), 3 deletions(-)
diff --git a/tools/testing/selftests/resctrl/resctrl_val.c b/tools/testing/selftests/resctrl/resctrl_val.c
index 8c275f6b4dd77..1bba85e4c0675 100644
--- a/tools/testing/selftests/resctrl/resctrl_val.c
+++ b/tools/testing/selftests/resctrl/resctrl_val.c
@@ -160,7 +160,7 @@ static int read_from_imc_dir(char *imc_dir, int count)
return -1;
}
- if (fscanf(fp, "%s", cas_count_cfg) <= 0) {
+ if (fscanf(fp, "%1023s", cas_count_cfg) <= 0) {
ksft_perror("Could not get iMC cas count read");
fclose(fp);
@@ -178,7 +178,7 @@ static int read_from_imc_dir(char *imc_dir, int count)
return -1;
}
- if (fscanf(fp, "%s", cas_count_cfg) <= 0) {
+ if (fscanf(fp, "%1023s", cas_count_cfg) <= 0) {
ksft_perror("Could not get iMC cas count write");
fclose(fp);
diff --git a/tools/testing/selftests/resctrl/resctrlfs.c b/tools/testing/selftests/resctrl/resctrlfs.c
index 250c320349a78..a53cd1cb6e0c6 100644
--- a/tools/testing/selftests/resctrl/resctrlfs.c
+++ b/tools/testing/selftests/resctrl/resctrlfs.c
@@ -182,7 +182,7 @@ int get_cache_size(int cpu_no, const char *cache_type, unsigned long *cache_size
return -1;
}
- if (fscanf(fp, "%s", cache_str) <= 0) {
+ if (fscanf(fp, "%63s", cache_str) <= 0) {
ksft_perror("Could not get cache_size");
fclose(fp);
--
2.43.0
From: Mark Brown <broonie(a)kernel.org>
[ Upstream commit 27141b690547da5650a420f26ec369ba142a9ebb ]
The PAC exec_sign_all() test spawns some child processes, creating pipes
to be stdin and stdout for the child. It cleans up most of the file
descriptors that are created as part of this but neglects to clean up the
parent end of the child stdin and stdout. Add the missing close() calls.
Signed-off-by: Mark Brown <broonie(a)kernel.org>
Link: https://lore.kernel.org/r/20241111-arm64-pac-test-collisions-v1-1-171875f37…
Signed-off-by: Catalin Marinas <catalin.marinas(a)arm.com>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/arm64/pauth/pac.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/tools/testing/selftests/arm64/pauth/pac.c b/tools/testing/selftests/arm64/pauth/pac.c
index b743daa772f55..5a07b3958fbf2 100644
--- a/tools/testing/selftests/arm64/pauth/pac.c
+++ b/tools/testing/selftests/arm64/pauth/pac.c
@@ -182,6 +182,9 @@ int exec_sign_all(struct signatures *signed_vals, size_t val)
return -1;
}
+ close(new_stdin[1]);
+ close(new_stdout[0]);
+
return 0;
}
--
2.43.0
From: Mark Brown <broonie(a)kernel.org>
[ Upstream commit 27141b690547da5650a420f26ec369ba142a9ebb ]
The PAC exec_sign_all() test spawns some child processes, creating pipes
to be stdin and stdout for the child. It cleans up most of the file
descriptors that are created as part of this but neglects to clean up the
parent end of the child stdin and stdout. Add the missing close() calls.
Signed-off-by: Mark Brown <broonie(a)kernel.org>
Link: https://lore.kernel.org/r/20241111-arm64-pac-test-collisions-v1-1-171875f37…
Signed-off-by: Catalin Marinas <catalin.marinas(a)arm.com>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/arm64/pauth/pac.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/tools/testing/selftests/arm64/pauth/pac.c b/tools/testing/selftests/arm64/pauth/pac.c
index b743daa772f55..5a07b3958fbf2 100644
--- a/tools/testing/selftests/arm64/pauth/pac.c
+++ b/tools/testing/selftests/arm64/pauth/pac.c
@@ -182,6 +182,9 @@ int exec_sign_all(struct signatures *signed_vals, size_t val)
return -1;
}
+ close(new_stdin[1]);
+ close(new_stdout[0]);
+
return 0;
}
--
2.43.0
From: Mark Brown <broonie(a)kernel.org>
[ Upstream commit 3e360ef0c0a1fb6ce9a302e40b8057c41ba8a9d2 ]
When building for streaming SVE the irritator for SVE skips updates of both
P0 and FFR. While FFR is skipped since it might not be present there is no
reason to skip corrupting P0 so switch to an instruction valid in streaming
mode and move the ifdef.
Signed-off-by: Mark Brown <broonie(a)kernel.org>
Link: https://lore.kernel.org/r/20241107-arm64-fp-stress-irritator-v2-3-c4b9622e3…
Signed-off-by: Catalin Marinas <catalin.marinas(a)arm.com>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/arm64/fp/sve-test.S | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/arm64/fp/sve-test.S b/tools/testing/selftests/arm64/fp/sve-test.S
index e3e08d9c7020e..f631d17c899d2 100644
--- a/tools/testing/selftests/arm64/fp/sve-test.S
+++ b/tools/testing/selftests/arm64/fp/sve-test.S
@@ -459,7 +459,8 @@ function irritator_handler
movi v9.16b, #2
movi v31.8b, #3
// And P0
- rdffr p0.b
+ ptrue p0.d
+#ifndef SSVE
// And FFR
wrffr p15.b
--
2.43.0
From: Mark Brown <broonie(a)kernel.org>
[ Upstream commit 27141b690547da5650a420f26ec369ba142a9ebb ]
The PAC exec_sign_all() test spawns some child processes, creating pipes
to be stdin and stdout for the child. It cleans up most of the file
descriptors that are created as part of this but neglects to clean up the
parent end of the child stdin and stdout. Add the missing close() calls.
Signed-off-by: Mark Brown <broonie(a)kernel.org>
Link: https://lore.kernel.org/r/20241111-arm64-pac-test-collisions-v1-1-171875f37…
Signed-off-by: Catalin Marinas <catalin.marinas(a)arm.com>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/arm64/pauth/pac.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/tools/testing/selftests/arm64/pauth/pac.c b/tools/testing/selftests/arm64/pauth/pac.c
index b743daa772f55..5a07b3958fbf2 100644
--- a/tools/testing/selftests/arm64/pauth/pac.c
+++ b/tools/testing/selftests/arm64/pauth/pac.c
@@ -182,6 +182,9 @@ int exec_sign_all(struct signatures *signed_vals, size_t val)
return -1;
}
+ close(new_stdin[1]);
+ close(new_stdout[0]);
+
return 0;
}
--
2.43.0
From: Mark Brown <broonie(a)kernel.org>
[ Upstream commit 3e360ef0c0a1fb6ce9a302e40b8057c41ba8a9d2 ]
When building for streaming SVE the irritator for SVE skips updates of both
P0 and FFR. While FFR is skipped since it might not be present there is no
reason to skip corrupting P0 so switch to an instruction valid in streaming
mode and move the ifdef.
Signed-off-by: Mark Brown <broonie(a)kernel.org>
Link: https://lore.kernel.org/r/20241107-arm64-fp-stress-irritator-v2-3-c4b9622e3…
Signed-off-by: Catalin Marinas <catalin.marinas(a)arm.com>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/arm64/fp/sve-test.S | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/arm64/fp/sve-test.S b/tools/testing/selftests/arm64/fp/sve-test.S
index 2a18cb4c528c5..b9648daeabbb1 100644
--- a/tools/testing/selftests/arm64/fp/sve-test.S
+++ b/tools/testing/selftests/arm64/fp/sve-test.S
@@ -304,9 +304,9 @@ function irritator_handler
movi v0.8b, #1
movi v9.16b, #2
movi v31.8b, #3
-#ifndef SSVE
// And P0
- rdffr p0.b
+ ptrue p0.d
+#ifndef SSVE
// And FFR
wrffr p15.b
#endif
--
2.43.0
From: Mark Brown <broonie(a)kernel.org>
[ Upstream commit 27141b690547da5650a420f26ec369ba142a9ebb ]
The PAC exec_sign_all() test spawns some child processes, creating pipes
to be stdin and stdout for the child. It cleans up most of the file
descriptors that are created as part of this but neglects to clean up the
parent end of the child stdin and stdout. Add the missing close() calls.
Signed-off-by: Mark Brown <broonie(a)kernel.org>
Link: https://lore.kernel.org/r/20241111-arm64-pac-test-collisions-v1-1-171875f37…
Signed-off-by: Catalin Marinas <catalin.marinas(a)arm.com>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/arm64/pauth/pac.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/tools/testing/selftests/arm64/pauth/pac.c b/tools/testing/selftests/arm64/pauth/pac.c
index b743daa772f55..5a07b3958fbf2 100644
--- a/tools/testing/selftests/arm64/pauth/pac.c
+++ b/tools/testing/selftests/arm64/pauth/pac.c
@@ -182,6 +182,9 @@ int exec_sign_all(struct signatures *signed_vals, size_t val)
return -1;
}
+ close(new_stdin[1]);
+ close(new_stdout[0]);
+
return 0;
}
--
2.43.0
From: Mark Brown <broonie(a)kernel.org>
[ Upstream commit 3e360ef0c0a1fb6ce9a302e40b8057c41ba8a9d2 ]
When building for streaming SVE the irritator for SVE skips updates of both
P0 and FFR. While FFR is skipped since it might not be present there is no
reason to skip corrupting P0 so switch to an instruction valid in streaming
mode and move the ifdef.
Signed-off-by: Mark Brown <broonie(a)kernel.org>
Link: https://lore.kernel.org/r/20241107-arm64-fp-stress-irritator-v2-3-c4b9622e3…
Signed-off-by: Catalin Marinas <catalin.marinas(a)arm.com>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/arm64/fp/sve-test.S | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/arm64/fp/sve-test.S b/tools/testing/selftests/arm64/fp/sve-test.S
index 4328895dfc876..ff60360a97f80 100644
--- a/tools/testing/selftests/arm64/fp/sve-test.S
+++ b/tools/testing/selftests/arm64/fp/sve-test.S
@@ -304,9 +304,9 @@ function irritator_handler
movi v0.8b, #1
movi v9.16b, #2
movi v31.8b, #3
-#ifndef SSVE
// And P0
- rdffr p0.b
+ ptrue p0.d
+#ifndef SSVE
// And FFR
wrffr p15.b
#endif
--
2.43.0
From: Mark Brown <broonie(a)kernel.org>
[ Upstream commit dca93d29845dfed60910ba13dbfb6ae6a0e19f6d ]
Currently if we encounter an error between fork() and exec() of a child
process we log the error to stderr. This means that the errors don't get
annotated with the child information which makes diagnostics harder and
means that if we miss the exit signal from the child we can deadlock
waiting for output from the child. Improve robustness and output quality
by logging to stdout instead.
Signed-off-by: Mark Brown <broonie(a)kernel.org>
Link: https://lore.kernel.org/r/20241023-arm64-fp-stress-exec-fail-v1-1-ee3c62932…
Signed-off-by: Catalin Marinas <catalin.marinas(a)arm.com>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/arm64/fp/fp-stress.c | 15 +++++++--------
1 file changed, 7 insertions(+), 8 deletions(-)
diff --git a/tools/testing/selftests/arm64/fp/fp-stress.c b/tools/testing/selftests/arm64/fp/fp-stress.c
index dd31647b00a22..cf9d7b2e4630c 100644
--- a/tools/testing/selftests/arm64/fp/fp-stress.c
+++ b/tools/testing/selftests/arm64/fp/fp-stress.c
@@ -79,7 +79,7 @@ static void child_start(struct child_data *child, const char *program)
*/
ret = dup2(pipefd[1], 1);
if (ret == -1) {
- fprintf(stderr, "dup2() %d\n", errno);
+ printf("dup2() %d\n", errno);
exit(EXIT_FAILURE);
}
@@ -89,7 +89,7 @@ static void child_start(struct child_data *child, const char *program)
*/
ret = dup2(startup_pipe[0], 3);
if (ret == -1) {
- fprintf(stderr, "dup2() %d\n", errno);
+ printf("dup2() %d\n", errno);
exit(EXIT_FAILURE);
}
@@ -107,16 +107,15 @@ static void child_start(struct child_data *child, const char *program)
*/
ret = read(3, &i, sizeof(i));
if (ret < 0)
- fprintf(stderr, "read(startp pipe) failed: %s (%d)\n",
- strerror(errno), errno);
+ printf("read(startp pipe) failed: %s (%d)\n",
+ strerror(errno), errno);
if (ret > 0)
- fprintf(stderr, "%d bytes of data on startup pipe\n",
- ret);
+ printf("%d bytes of data on startup pipe\n", ret);
close(3);
ret = execl(program, program, NULL);
- fprintf(stderr, "execl(%s) failed: %d (%s)\n",
- program, errno, strerror(errno));
+ printf("execl(%s) failed: %d (%s)\n",
+ program, errno, strerror(errno));
exit(EXIT_FAILURE);
} else {
--
2.43.0
From: Mark Brown <broonie(a)kernel.org>
[ Upstream commit 27141b690547da5650a420f26ec369ba142a9ebb ]
The PAC exec_sign_all() test spawns some child processes, creating pipes
to be stdin and stdout for the child. It cleans up most of the file
descriptors that are created as part of this but neglects to clean up the
parent end of the child stdin and stdout. Add the missing close() calls.
Signed-off-by: Mark Brown <broonie(a)kernel.org>
Link: https://lore.kernel.org/r/20241111-arm64-pac-test-collisions-v1-1-171875f37…
Signed-off-by: Catalin Marinas <catalin.marinas(a)arm.com>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/arm64/pauth/pac.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/tools/testing/selftests/arm64/pauth/pac.c b/tools/testing/selftests/arm64/pauth/pac.c
index b743daa772f55..5a07b3958fbf2 100644
--- a/tools/testing/selftests/arm64/pauth/pac.c
+++ b/tools/testing/selftests/arm64/pauth/pac.c
@@ -182,6 +182,9 @@ int exec_sign_all(struct signatures *signed_vals, size_t val)
return -1;
}
+ close(new_stdin[1]);
+ close(new_stdout[0]);
+
return 0;
}
--
2.43.0
From: Mark Brown <broonie(a)kernel.org>
[ Upstream commit 3e360ef0c0a1fb6ce9a302e40b8057c41ba8a9d2 ]
When building for streaming SVE the irritator for SVE skips updates of both
P0 and FFR. While FFR is skipped since it might not be present there is no
reason to skip corrupting P0 so switch to an instruction valid in streaming
mode and move the ifdef.
Signed-off-by: Mark Brown <broonie(a)kernel.org>
Link: https://lore.kernel.org/r/20241107-arm64-fp-stress-irritator-v2-3-c4b9622e3…
Signed-off-by: Catalin Marinas <catalin.marinas(a)arm.com>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/arm64/fp/sve-test.S | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/arm64/fp/sve-test.S b/tools/testing/selftests/arm64/fp/sve-test.S
index fff60e2a25add..4fcb492aee1fb 100644
--- a/tools/testing/selftests/arm64/fp/sve-test.S
+++ b/tools/testing/selftests/arm64/fp/sve-test.S
@@ -304,9 +304,9 @@ function irritator_handler
movi v0.8b, #1
movi v9.16b, #2
movi v31.8b, #3
-#ifndef SSVE
// And P0
- rdffr p0.b
+ ptrue p0.d
+#ifndef SSVE
// And FFR
wrffr p15.b
#endif
--
2.43.0
From: Mark Brown <broonie(a)kernel.org>
[ Upstream commit dca93d29845dfed60910ba13dbfb6ae6a0e19f6d ]
Currently if we encounter an error between fork() and exec() of a child
process we log the error to stderr. This means that the errors don't get
annotated with the child information which makes diagnostics harder and
means that if we miss the exit signal from the child we can deadlock
waiting for output from the child. Improve robustness and output quality
by logging to stdout instead.
Signed-off-by: Mark Brown <broonie(a)kernel.org>
Link: https://lore.kernel.org/r/20241023-arm64-fp-stress-exec-fail-v1-1-ee3c62932…
Signed-off-by: Catalin Marinas <catalin.marinas(a)arm.com>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/arm64/fp/fp-stress.c | 15 +++++++--------
1 file changed, 7 insertions(+), 8 deletions(-)
diff --git a/tools/testing/selftests/arm64/fp/fp-stress.c b/tools/testing/selftests/arm64/fp/fp-stress.c
index faac24bdefeb9..80f22789504d6 100644
--- a/tools/testing/selftests/arm64/fp/fp-stress.c
+++ b/tools/testing/selftests/arm64/fp/fp-stress.c
@@ -79,7 +79,7 @@ static void child_start(struct child_data *child, const char *program)
*/
ret = dup2(pipefd[1], 1);
if (ret == -1) {
- fprintf(stderr, "dup2() %d\n", errno);
+ printf("dup2() %d\n", errno);
exit(EXIT_FAILURE);
}
@@ -89,7 +89,7 @@ static void child_start(struct child_data *child, const char *program)
*/
ret = dup2(startup_pipe[0], 3);
if (ret == -1) {
- fprintf(stderr, "dup2() %d\n", errno);
+ printf("dup2() %d\n", errno);
exit(EXIT_FAILURE);
}
@@ -107,16 +107,15 @@ static void child_start(struct child_data *child, const char *program)
*/
ret = read(3, &i, sizeof(i));
if (ret < 0)
- fprintf(stderr, "read(startp pipe) failed: %s (%d)\n",
- strerror(errno), errno);
+ printf("read(startp pipe) failed: %s (%d)\n",
+ strerror(errno), errno);
if (ret > 0)
- fprintf(stderr, "%d bytes of data on startup pipe\n",
- ret);
+ printf("%d bytes of data on startup pipe\n", ret);
close(3);
ret = execl(program, program, NULL);
- fprintf(stderr, "execl(%s) failed: %d (%s)\n",
- program, errno, strerror(errno));
+ printf("execl(%s) failed: %d (%s)\n",
+ program, errno, strerror(errno));
exit(EXIT_FAILURE);
} else {
--
2.43.0
From: Mark Brown <broonie(a)kernel.org>
[ Upstream commit 27141b690547da5650a420f26ec369ba142a9ebb ]
The PAC exec_sign_all() test spawns some child processes, creating pipes
to be stdin and stdout for the child. It cleans up most of the file
descriptors that are created as part of this but neglects to clean up the
parent end of the child stdin and stdout. Add the missing close() calls.
Signed-off-by: Mark Brown <broonie(a)kernel.org>
Link: https://lore.kernel.org/r/20241111-arm64-pac-test-collisions-v1-1-171875f37…
Signed-off-by: Catalin Marinas <catalin.marinas(a)arm.com>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/arm64/pauth/pac.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/tools/testing/selftests/arm64/pauth/pac.c b/tools/testing/selftests/arm64/pauth/pac.c
index b743daa772f55..5a07b3958fbf2 100644
--- a/tools/testing/selftests/arm64/pauth/pac.c
+++ b/tools/testing/selftests/arm64/pauth/pac.c
@@ -182,6 +182,9 @@ int exec_sign_all(struct signatures *signed_vals, size_t val)
return -1;
}
+ close(new_stdin[1]);
+ close(new_stdout[0]);
+
return 0;
}
--
2.43.0
From: Mark Brown <broonie(a)kernel.org>
[ Upstream commit 3e360ef0c0a1fb6ce9a302e40b8057c41ba8a9d2 ]
When building for streaming SVE the irritator for SVE skips updates of both
P0 and FFR. While FFR is skipped since it might not be present there is no
reason to skip corrupting P0 so switch to an instruction valid in streaming
mode and move the ifdef.
Signed-off-by: Mark Brown <broonie(a)kernel.org>
Link: https://lore.kernel.org/r/20241107-arm64-fp-stress-irritator-v2-3-c4b9622e3…
Signed-off-by: Catalin Marinas <catalin.marinas(a)arm.com>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/arm64/fp/sve-test.S | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/arm64/fp/sve-test.S b/tools/testing/selftests/arm64/fp/sve-test.S
index fff60e2a25add..4fcb492aee1fb 100644
--- a/tools/testing/selftests/arm64/fp/sve-test.S
+++ b/tools/testing/selftests/arm64/fp/sve-test.S
@@ -304,9 +304,9 @@ function irritator_handler
movi v0.8b, #1
movi v9.16b, #2
movi v31.8b, #3
-#ifndef SSVE
// And P0
- rdffr p0.b
+ ptrue p0.d
+#ifndef SSVE
// And FFR
wrffr p15.b
#endif
--
2.43.0
From: Mark Brown <broonie(a)kernel.org>
[ Upstream commit dca93d29845dfed60910ba13dbfb6ae6a0e19f6d ]
Currently if we encounter an error between fork() and exec() of a child
process we log the error to stderr. This means that the errors don't get
annotated with the child information which makes diagnostics harder and
means that if we miss the exit signal from the child we can deadlock
waiting for output from the child. Improve robustness and output quality
by logging to stdout instead.
Signed-off-by: Mark Brown <broonie(a)kernel.org>
Link: https://lore.kernel.org/r/20241023-arm64-fp-stress-exec-fail-v1-1-ee3c62932…
Signed-off-by: Catalin Marinas <catalin.marinas(a)arm.com>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/arm64/fp/fp-stress.c | 15 +++++++--------
1 file changed, 7 insertions(+), 8 deletions(-)
diff --git a/tools/testing/selftests/arm64/fp/fp-stress.c b/tools/testing/selftests/arm64/fp/fp-stress.c
index faac24bdefeb9..80f22789504d6 100644
--- a/tools/testing/selftests/arm64/fp/fp-stress.c
+++ b/tools/testing/selftests/arm64/fp/fp-stress.c
@@ -79,7 +79,7 @@ static void child_start(struct child_data *child, const char *program)
*/
ret = dup2(pipefd[1], 1);
if (ret == -1) {
- fprintf(stderr, "dup2() %d\n", errno);
+ printf("dup2() %d\n", errno);
exit(EXIT_FAILURE);
}
@@ -89,7 +89,7 @@ static void child_start(struct child_data *child, const char *program)
*/
ret = dup2(startup_pipe[0], 3);
if (ret == -1) {
- fprintf(stderr, "dup2() %d\n", errno);
+ printf("dup2() %d\n", errno);
exit(EXIT_FAILURE);
}
@@ -107,16 +107,15 @@ static void child_start(struct child_data *child, const char *program)
*/
ret = read(3, &i, sizeof(i));
if (ret < 0)
- fprintf(stderr, "read(startp pipe) failed: %s (%d)\n",
- strerror(errno), errno);
+ printf("read(startp pipe) failed: %s (%d)\n",
+ strerror(errno), errno);
if (ret > 0)
- fprintf(stderr, "%d bytes of data on startup pipe\n",
- ret);
+ printf("%d bytes of data on startup pipe\n", ret);
close(3);
ret = execl(program, program, NULL);
- fprintf(stderr, "execl(%s) failed: %d (%s)\n",
- program, errno, strerror(errno));
+ printf("execl(%s) failed: %d (%s)\n",
+ program, errno, strerror(errno));
exit(EXIT_FAILURE);
} else {
--
2.43.0
Compiled binary files should be added to .gitignore
'git status' complains:
Untracked files:
(use "git add <file>..." to include in what will be committed)
alsa/global-timer
alsa/utimer-test
Cc: Mark Brown <broonie(a)kernel.org>
Cc: Jaroslav Kysela <perex(a)perex.cz>
Cc: Takashi Iwai <tiwai(a)suse.com>
Cc: Shuah Khan <shuah(a)kernel.org>
Signed-off-by: Li Zhijian <lizhijian(a)fujitsu.com>
---
Cc: linux-sound(a)vger.kernel.org
---
Hello,
Cover letter is here.
This patch set aims to make 'git status' clear after 'make' and 'make
run_tests' for kselftests.
---
V3:
sorted the ignore files
V2:
split as a separate patch from a small one [0]
[0] https://lore.kernel.org/linux-kselftest/20241015010817.453539-1-lizhijian@f…
Signed-off-by: Li Zhijian <lizhijian(a)fujitsu.com>
---
tools/testing/selftests/alsa/.gitignore | 2 ++
1 file changed, 2 insertions(+)
diff --git a/tools/testing/selftests/alsa/.gitignore b/tools/testing/selftests/alsa/.gitignore
index 12dc3fcd3456..3dd8e1176b89 100644
--- a/tools/testing/selftests/alsa/.gitignore
+++ b/tools/testing/selftests/alsa/.gitignore
@@ -1,3 +1,5 @@
+global-timer
mixer-test
pcm-test
test-pcmtest-driver
+utimer-test
--
2.44.0
The include.sh file is generated for inclusion and should not be executable.
Otherwise, it will be added to kselftest-list.txt. Additionally, add the
executable bit for test.py at the same time to ensure proper functionality.
Fixes: 3ade6ce1255e ("selftests: rds: add testing infrastructure")
Signed-off-by: Hangbin Liu <liuhangbin(a)gmail.com>
---
tools/testing/selftests/net/rds/Makefile | 3 ++-
tools/testing/selftests/net/rds/test.py | 0
2 files changed, 2 insertions(+), 1 deletion(-)
mode change 100644 => 100755 tools/testing/selftests/net/rds/test.py
diff --git a/tools/testing/selftests/net/rds/Makefile b/tools/testing/selftests/net/rds/Makefile
index da9714bc7aad..cf30307a829b 100644
--- a/tools/testing/selftests/net/rds/Makefile
+++ b/tools/testing/selftests/net/rds/Makefile
@@ -4,9 +4,10 @@ all:
@echo mk_build_dir="$(shell pwd)" > include.sh
TEST_PROGS := run.sh \
- include.sh \
test.py
+TEST_FILES := include.sh
+
EXTRA_CLEAN := /tmp/rds_logs
include ../../lib.mk
diff --git a/tools/testing/selftests/net/rds/test.py b/tools/testing/selftests/net/rds/test.py
old mode 100644
new mode 100755
--
2.39.3 (Apple Git-146)
Hi Linus,
Please pull the following kunit update for Linux 6.13-rc1.
This pull request is fixed up with the right fix for the UAF bug and
includes other fixes.
linux_kselftest-kunit-6.13-rc1-fixed
kunit update for Linux 6.13-rc1
-- fixes user-after-free (UAF) bug in kunit_init_suite()
-- adds option to kunit tool to print just the summary of test results
-- adds option to kunit tool to print just the failed test results
-- fixes kunit_zalloc_skb() to use user passed in gfp value instead of
hardcoding GFP_KERNEL
-- fixes kunit_zalloc_skb() kernel doc to include allocation flags variable
-- updates KUnit email address for Brendan Higgins
-- adds LoongArch config to qemu_configs
-- changes tool to allow overriding the shutdown mode from qemu config
-- enables shutdown in loongarch qemu_config
-- fixes potential null dereference in kunit_device_driver_test()
-- fixes debugfs to use IS_ERR() for alloc_string_stream() error check
diff is attached.
Tests passed on my kunit repo & linux-next:
- Build make allmodconfig
./tools/testing/kunit/kunit.py run
./tools/testing/kunit/kunit.py run --alltests
./tools/testing/kunit/kunit.py run --arch x86_64
./tools/testing/kunit/kunit.py run --alltests --arch x86_64
thanks,
-- Shuah
----------------------------------------------------------------
The following changes since commit 2d5404caa8c7bb5c4e0435f94b28834ae5456623:
Linux 6.12-rc7 (2024-11-10 14:19:35 -0800)
are available in the Git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest tags/linux_kselftest-kunit-6.13-rc1-fixed
for you to fetch changes up to 62adcae479fe5bc04fa3b6c3f93bd340441f8b25:
kunit: qemu_configs: loongarch: Enable shutdown (2024-11-19 15:26:30 -0700)
----------------------------------------------------------------
linux_kselftest-kunit-6.13-rc1-fixed
kunit update for Linux 6.13-rc1
-- fixes user-after-free (UAF) bug in kunit_init_suite()
-- adds option to kunit tool to print just the summary of test results
-- adds option to kunit tool to print just the failed test results
-- fixes kunit_zalloc_skb() to use user passed in gfp value instead of
hardcoding GFP_KERNEL
-- fixes kunit_zalloc_skb() kernel doc to include allocation flags variable
-- updates KUnit email address for Brendan Higgins
-- adds LoongArch config to qemu_configs
-- changes tool to allow overriding the shutdown mode from qemu config
-- enables shutdown in loongarch qemu_config
-- fixes potential null dereference in kunit_device_driver_test()
-- fixes debugfs to use IS_ERR() for alloc_string_stream() error check
----------------------------------------------------------------
Brendan Higgins (1):
MAINTAINERS: Update KUnit email address for Brendan Higgins
Dan Carpenter (2):
kunit: skb: use "gfp" variable instead of hardcoding GFP_KERNEL
kunit: skb: add gfp to kernel doc for kunit_zalloc_skb()
David Gow (1):
kunit: tool: Only print the summary
Jinjie Ruan (1):
kunit: string-stream: Fix a UAF bug in kunit_init_suite()
Kuan-Wei Chiu (1):
kunit: debugfs: Use IS_ERR() for alloc_string_stream() error check
Rae Moar (1):
kunit: tool: print failed tests only
Thomas Weißschuh (3):
kunit: qemu_configs: Add LoongArch config
kunit: tool: Allow overriding the shutdown mode from qemu config
kunit: qemu_configs: loongarch: Enable shutdown
Zichen Xie (1):
kunit: Fix potential null dereference in kunit_device_driver_test()
MAINTAINERS | 2 +-
include/kunit/skbuff.h | 5 +-
lib/kunit/debugfs.c | 9 +-
lib/kunit/kunit-test.c | 2 +
tools/testing/kunit/kunit.py | 28 +++++-
tools/testing/kunit/kunit_kernel.py | 4 +-
tools/testing/kunit/kunit_parser.py | 134 ++++++++++++++++----------
tools/testing/kunit/kunit_printer.py | 14 ++-
tools/testing/kunit/kunit_tool_test.py | 55 +++++------
tools/testing/kunit/qemu_configs/loongarch.py | 21 ++++
10 files changed, 183 insertions(+), 91 deletions(-)
create mode 100644 tools/testing/kunit/qemu_configs/loongarch.py
----------------------------------------------------------------
The testing effort is increasing throughout the community.
The tests are generally merged into the subsystem trees,
and are of relatively narrow interest. The patch volume on
linux-kselftest(a)vger.kernel.org makes it hard to follow
the changes to the framework, and discuss proposals.
Create a new ML focused on the framework of kselftests,
which will hopefully be similarly low volume to the workflows@
mailing list.
From the responses to v1 I gather that the preference is to
keep the existing list for all (or create an alias to it);
the framework-only section should be the one to have a new
list created.
Signed-off-by: Jakub Kicinski <kuba(a)kernel.org>
---
Sorry for the delay, the responses to v1 weren't super positive
but I keep thinking this would be very useful :) Or at the very
least I find workflows@ very useful and informative as a maintainer.
Posting as an RFC because we need to create the new ML.
v1: https://lore.kernel.org/20241003142328.622730-1-kuba@kernel.org
CC: shuah(a)kernel.org
CC: Tim.Bird(a)sony.com
CC: linux-kselftest(a)vger.kernel.org
CC: workflows(a)vger.kernel.org
---
MAINTAINERS | 15 ++++++++++++++-
1 file changed, 14 insertions(+), 1 deletion(-)
diff --git a/MAINTAINERS b/MAINTAINERS
index f245573722ef..345a0657ba1d 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -12395,7 +12395,7 @@ S: Supported
F: Documentation/admin-guide/reporting-regressions.rst
F: Documentation/process/handling-regressions.rst
-KERNEL SELFTEST FRAMEWORK
+KERNEL SELFTEST
M: Shuah Khan <shuah(a)kernel.org>
M: Shuah Khan <skhan(a)linuxfoundation.org>
L: linux-kselftest(a)vger.kernel.org
@@ -12405,6 +12405,19 @@ T: git git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest.git
F: Documentation/dev-tools/kselftest*
F: tools/testing/selftests/
+KERNEL SELFTEST FRAMEWORK
+M: Shuah Khan <shuah(a)kernel.org>
+M: Shuah Khan <skhan(a)linuxfoundation.org>
+L: linux-kselftest-core(a)vger.kernel.org
+S: Maintained
+F: Documentation/dev-tools/kselftest*
+F: tools/testing/selftests/kselftest/
+F: tools/testing/selftests/lib/
+F: tools/testing/selftests/lib.mk
+F: tools/testing/selftests/Makefile
+F: tools/testing/selftests/*.sh
+F: tools/testing/selftests/*.h
+
KERNEL SMB3 SERVER (KSMBD)
M: Namjae Jeon <linkinjeon(a)kernel.org>
M: Steve French <sfrench(a)samba.org>
--
2.47.0
This patch series includes some netns-related improvements and fixes for
RTNL and ip_tunnel, to make link creation more intuitive:
- Creating link in another net namespace doesn't conflict with link names
in current one.
- Refector rtnetlink link creation. Create link in target namespace
directly. Pass both source and link netns to drivers via newlink()
callback.
So that
# ip link add netns ns1 link-netns ns2 tun0 type gre ...
will create tun0 in ns1, rather than create it in ns2 and move to ns1.
And don't conflict with another interface named "tun0" in current netns.
---
v4:
- Pack newlink() parameters to a single struct.
- Use ynl async_msg_queue.empty() in selftest.
v3:
link: https://lore.kernel.org/all/20241113125715.150201-1-shaw.leon@gmail.com/
- Drop "netns_atomic" flag and module parameter. Add netns parameter to
newlink() instead, and convert drivers accordingly.
- Move python NetNSEnter helper to net selftest lib.
v2:
link: https://lore.kernel.org/all/20241107133004.7469-1-shaw.leon@gmail.com/
- Check NLM_F_EXCL to ensure only link creation is affected.
- Add self tests for link name/ifindex conflict and notifications
in different netns.
- Changes in dummy driver and ynl in order to add the test case.
v1:
link: https://lore.kernel.org/all/20241023023146.372653-1-shaw.leon@gmail.com/
Xiao Liang (5):
net: ip_tunnel: Build flow in underlay net namespace
rtnetlink: Lookup device in target netns when creating link
rtnetlink: Decouple net namespaces in rtnl_newlink_create()
selftests: net: Add python context manager for netns entering
selftests: net: Add two test cases for link netns
drivers/infiniband/ulp/ipoib/ipoib_netlink.c | 11 ++++--
drivers/net/amt.c | 13 ++++---
drivers/net/bareudp.c | 11 ++++--
drivers/net/bonding/bond_netlink.c | 8 ++--
drivers/net/can/dev/netlink.c | 4 +-
drivers/net/can/vxcan.c | 11 ++++--
.../ethernet/qualcomm/rmnet/rmnet_config.c | 11 ++++--
drivers/net/geneve.c | 11 ++++--
drivers/net/gtp.c | 9 +++--
drivers/net/ipvlan/ipvlan.h | 4 +-
drivers/net/ipvlan/ipvlan_main.c | 11 ++++--
drivers/net/ipvlan/ipvtap.c | 7 ++--
drivers/net/macsec.c | 11 ++++--
drivers/net/macvlan.c | 8 ++--
drivers/net/macvtap.c | 8 ++--
drivers/net/netkit.c | 11 ++++--
drivers/net/pfcp.c | 8 ++--
drivers/net/ppp/ppp_generic.c | 10 +++--
drivers/net/team/team_core.c | 7 ++--
drivers/net/veth.c | 11 ++++--
drivers/net/vrf.c | 7 ++--
drivers/net/vxlan/vxlan_core.c | 11 ++++--
drivers/net/wireguard/device.c | 8 ++--
drivers/net/wireless/virtual/virt_wifi.c | 10 +++--
drivers/net/wwan/wwan_core.c | 15 +++++--
include/net/ip_tunnels.h | 5 ++-
include/net/rtnetlink.h | 34 +++++++++++++---
net/8021q/vlan_netlink.c | 11 ++++--
net/batman-adv/soft-interface.c | 8 ++--
net/bridge/br_netlink.c | 8 ++--
net/caif/chnl_net.c | 6 +--
net/core/rtnetlink.c | 29 +++++++++-----
net/hsr/hsr_netlink.c | 14 ++++---
net/ieee802154/6lowpan/core.c | 9 +++--
net/ipv4/ip_gre.c | 27 ++++++++-----
net/ipv4/ip_tunnel.c | 16 ++++----
net/ipv4/ip_vti.c | 10 +++--
net/ipv4/ipip.c | 10 +++--
net/ipv6/ip6_gre.c | 28 +++++++------
net/ipv6/ip6_tunnel.c | 16 ++++----
net/ipv6/ip6_vti.c | 15 ++++---
net/ipv6/sit.c | 16 ++++----
net/xfrm/xfrm_interface_core.c | 14 +++----
tools/testing/selftests/net/Makefile | 1 +
.../testing/selftests/net/lib/py/__init__.py | 2 +-
tools/testing/selftests/net/lib/py/netns.py | 18 +++++++++
tools/testing/selftests/net/netns-name.sh | 10 +++++
tools/testing/selftests/net/netns_atomic.py | 39 +++++++++++++++++++
48 files changed, 377 insertions(+), 205 deletions(-)
create mode 100755 tools/testing/selftests/net/netns_atomic.py
--
2.47.0
Hi,
At Linux Plumbers, a few dozen of us gathered together to discuss how
to expose what tests subsystem maintainers would like to run for every
patch submitted or when CI runs tests. We agreed on a mock up of a
yaml template to start gathering info. The yaml file could be
temporarily stored on kernelci.org until a more permanent home could
be found. Attached is a template to start the conversation.
Longer story.
The current problem is CI systems are not unanimous about what tests
they run on submitted patches or git branches. This makes it
difficult to figure out why a test failed or how to reproduce.
Further, it isn't always clear what tests a normal contributor should
run before posting patches.
It has been long communicated that the tests LTP, xfstest and/or
kselftests should be the tests to run. However, not all maintainers
use those tests for their subsystems. I am hoping to either capture
those tests or find ways to convince them to add their tests to the
preferred locations.
The goal is for a given subsystem (defined in MAINTAINERS), define a
set of tests that should be run for any contributions to that
subsystem. The hope is the collective CI results can be triaged
collectively (because they are related) and even have the numerous
flakes waived collectively (same reason) improving the ability to
find and debug new test failures. Because the tests and process are
known, having a human help debug any failures becomes easier.
The plan is to put together a minimal yaml template that gets us going
(even if it is not optimized yet) and aim for about a dozen or so
subsystems. At that point we should have enough feedback to promote
this more seriously and talk optimizations.
Feedback encouraged.
Cheers,
Don
---
# List of tests by subsystem
#
# Tests should adhere to KTAP definitions for results
#
# Description of section entries
#
# maintainer: test maintainer - name <email>
# list: mailing list for discussion
# version: stable version of the test
# dependency: necessary distro package for testing
# test:
# path: internal git path or url to fetch from
# cmd: command to run; ability to run locally
# param: additional param necessary to run test
# hardware: hardware necessary for validation
#
# Subsystems (alphabetical)
KUNIT TEST:
maintainer:
- name: name1
email: email1
- name: name2
email: email2
list:
version:
dependency:
- dep1
- dep2
test:
- path: tools/testing/kunit
cmd:
param:
- path:
cmd:
param:
hardware: none
Hi Linus,
Please pull the following kselftest fixes update for Linux 6.13-rc1.
kselftest update for Linux 6.13-rc1
-- timers test - removes duplicates defines
-- timers test - fixes to improve error reporting
-- rtc test - adds check rtc alarm status to alarm test
-- resctrl test - adds array overrun checks during iMC config parsing code
-- resctrl test - adds array overflow checks when reading strings
-- resctrl test - fixes and reorganizing code
Looks like I forgot to mention signal test changes in my tag
message. :(
diff is attached.
Tests passed on my kselftest next branch:
- Individual test runs of signal, timers, rtc, and resctrl
thanks,
-- Shuah
----------------------------------------------------------------
The following changes since commit 8e929cb546ee42c9a61d24fae60605e9e3192354:
Linux 6.12-rc3 (2024-10-13 14:33:32 -0700)
are available in the Git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest tags/linux_kselftest-next-6.13-rc1
for you to fetch changes up to a44c26d7fa74a5f4d2795a5c55a2d6ec1ebf1e38:
selftests/resctrl: Replace magic constants used as array size (2024-11-04 17:02:03 -0700)
----------------------------------------------------------------
linux_kselftest-next-6.13-rc1
kselftest update for Linux 6.13-rc1
-- timers test - removes duplicates defines
-- timers test - fixes to improve error reporting
-- rtc test - adds check rtc alarm status to alarm test
-- resctrl test - adds array overrun checks during iMC config parsing code
-- resctrl test - adds array overflow checks when reading strings
-- resctrl test - fixes and reorganizing code
----------------------------------------------------------------
Chen Ni (1):
selftests: timers: Remove unneeded semicolon
Dev Jain (2):
selftests: Rename sigaltstack to generic signal
selftests: Add a test mangling with uc_sigmask
Gianfranco Trad (1):
selftests: timers: improve timer_create failure message
Joseph Jang (1):
selftest: rtc: Add to check rtc alarm status for alarm related test
Nícolas F. R. A. Prado (1):
docs: dev-tools: Add documentation for the device focused kselftests
Reinette Chatre (15):
selftests/resctrl: Make functions only used in same file static
selftests/resctrl: Print accurate buffer size as part of MBM results
selftests/resctrl: Fix memory overflow due to unhandled wraparound
selftests/resctrl: Protect against array overrun during iMC config parsing
selftests/resctrl: Protect against array overflow when reading strings
selftests/resctrl: Make wraparound handling obvious
selftests/resctrl: Remove "once" parameter required to be false
selftests/resctrl: Only support measured read operation
selftests/resctrl: Remove unused measurement code
selftests/resctrl: Make benchmark parameter passing robust
selftests/resctrl: Ensure measurements skip initialization of default benchmark
selftests/resctrl: Use cache size to determine "fill_buf" buffer size
selftests/resctrl: Do not compare performance counters and resctrl at low bandwidth
selftests/resctrl: Keep results from first test run
selftests/resctrl: Replace magic constants used as array size
Shuah Khan (2):
selftests: timers: Remove local NSEC_PER_SEC and USEC_PER_SEC defines
selftests:timers: remove local CLOCKID defines
Documentation/dev-tools/kselftest.rst | 9 +
Documentation/dev-tools/testing-devices.rst | 47 +++
tools/testing/selftests/Makefile | 2 +-
tools/testing/selftests/resctrl/cmt_test.c | 37 +-
tools/testing/selftests/resctrl/fill_buf.c | 45 +--
tools/testing/selftests/resctrl/mba_test.c | 54 ++-
tools/testing/selftests/resctrl/mbm_test.c | 37 +-
tools/testing/selftests/resctrl/resctrl.h | 79 +++-
tools/testing/selftests/resctrl/resctrl_tests.c | 95 ++++-
tools/testing/selftests/resctrl/resctrl_val.c | 447 ++++++---------------
tools/testing/selftests/resctrl/resctrlfs.c | 19 +-
tools/testing/selftests/rtc/Makefile | 2 +-
tools/testing/selftests/rtc/rtctest.c | 64 +++
.../selftests/{sigaltstack => signal}/.gitignore | 1 +
.../selftests/{sigaltstack => signal}/Makefile | 3 +-
.../current_stack_pointer.h | 0
tools/testing/selftests/signal/mangle_uc_sigmask.c | 184 +++++++++
.../selftests/{sigaltstack => signal}/sas.c | 0
tools/testing/selftests/timers/Makefile | 2 +-
tools/testing/selftests/timers/adjtick.c | 6 +-
.../testing/selftests/timers/alarmtimer-suspend.c | 22 +-
.../testing/selftests/timers/inconsistency-check.c | 21 +-
tools/testing/selftests/timers/leap-a-day.c | 2 +-
tools/testing/selftests/timers/mqueue-lat.c | 2 +-
tools/testing/selftests/timers/nanosleep.c | 21 +-
tools/testing/selftests/timers/nsleep-lat.c | 22 +-
tools/testing/selftests/timers/posix_timers.c | 15 +-
tools/testing/selftests/timers/raw_skew.c | 4 +-
tools/testing/selftests/timers/set-2038.c | 3 +-
tools/testing/selftests/timers/set-timer-lat.c | 21 +-
tools/testing/selftests/timers/valid-adjtimex.c | 4 +-
31 files changed, 701 insertions(+), 569 deletions(-)
create mode 100644 Documentation/dev-tools/testing-devices.rst
rename tools/testing/selftests/{sigaltstack => signal}/.gitignore (70%)
rename tools/testing/selftests/{sigaltstack => signal}/Makefile (56%)
rename tools/testing/selftests/{sigaltstack => signal}/current_stack_pointer.h (100%)
create mode 100644 tools/testing/selftests/signal/mangle_uc_sigmask.c
rename tools/testing/selftests/{sigaltstack => signal}/sas.c (100%)
----------------------------------------------------------------
Page Detective is a new kernel debugging tool that provides detailed
information about the usage and mapping of physical memory pages.
It is often known that a particular page is corrupted, but it is hard to
extract more information about such a page from live system. Examples
are:
- Checksum failure during live migration
- Filesystem journal failure
- dump_page warnings on the console log
- Unexcpected segfaults
Page Detective helps to extract more information from the kernel, so it
can be used by developers to root cause the associated problem.
It operates through the Linux debugfs interface, with two files: "virt"
and "phys".
The "virt" file takes a virtual address and PID and outputs information
about the corresponding page.
The "phys" file takes a physical address and outputs information about
that page.
The output is presented via kernel log messages (can be accessed with
dmesg), and includes information such as the page's reference count,
mapping, flags, and memory cgroup. It also shows whether the page is
mapped in the kernel page table, and if so, how many times.
Pasha Tatashin (6):
mm: Make get_vma_name() function public
pagewalk: Add a page table walker for init_mm page table
mm: Add a dump_page variant that accept log level argument
misc/page_detective: Introduce Page Detective
misc/page_detective: enable loadable module
selftests/page_detective: Introduce self tests for Page Detective
Documentation/misc-devices/index.rst | 1 +
Documentation/misc-devices/page_detective.rst | 78 ++
MAINTAINERS | 8 +
drivers/misc/Kconfig | 11 +
drivers/misc/Makefile | 1 +
drivers/misc/page_detective.c | 808 ++++++++++++++++++
fs/inode.c | 18 +-
fs/kernfs/dir.c | 1 +
fs/proc/task_mmu.c | 61 --
include/linux/fs.h | 5 +-
include/linux/mmdebug.h | 1 +
include/linux/pagewalk.h | 2 +
kernel/pid.c | 1 +
mm/debug.c | 53 +-
mm/memcontrol.c | 1 +
mm/oom_kill.c | 1 +
mm/pagewalk.c | 32 +
mm/vma.c | 60 ++
tools/testing/selftests/Makefile | 1 +
.../selftests/page_detective/.gitignore | 1 +
.../testing/selftests/page_detective/Makefile | 7 +
tools/testing/selftests/page_detective/config | 4 +
.../page_detective/page_detective_test.c | 727 ++++++++++++++++
23 files changed, 1787 insertions(+), 96 deletions(-)
create mode 100644 Documentation/misc-devices/page_detective.rst
create mode 100644 drivers/misc/page_detective.c
create mode 100644 tools/testing/selftests/page_detective/.gitignore
create mode 100644 tools/testing/selftests/page_detective/Makefile
create mode 100644 tools/testing/selftests/page_detective/config
create mode 100644 tools/testing/selftests/page_detective/page_detective_test.c
--
2.47.0.338.g60cca15819-goog
Hi all,
Another rather vague question from me - does anyone know any reasons
why we couldn't support KUNIT_EXPECT in atomic contexts?
For Address Space Isolation we have some tests that want to disable
preemption. I guess this is not a totally insane thing to want to do -
am I missing anything there?
You can see some examples here:
https://github.com/googleprodkernel/linux-kvm/blob/asi-lpc-24/arch/x86/mm/a…
(Actually, looks like in that code we just started doing KUNIT_EXPECT
in preempt-disabled regions anyway, and I never suffered any
consequences until recently. Probably we just never had those
assertions fail under DEBUG_ATOMIC_SLEEP).
From a very quick look, it seems like the only reason you can't
KUNIT_EXPECT from atomic right now is the GFP_KERNEL allocations.
So... what if we just made them GFP_ATOMIC? In general I think that's
a pretty ropey approach, but maybe fine in the context of unit tests?
Or, we could have variants of the assertions, or a test attribute, to
just use GFP_ATOMIC in the cases where it's needed.
Is there anything else missing that would need to be done?
Alternatively, we could expose the whole context concern to the user
in such a way that KUNIT_ASSERT can be used too, something like:
struct kunit_defer defer = KUNIT_INIT_DEFER(test);
preempt_disable();
KUNIT_EXPECT_DEFERRED_TRUE(&defer, ...);
KUNIT_ASSERT_DEFERRED_EQ(&defer, ...);
preempt_enable();
// Prints failures from above, aborts if the ASSERT failed:
kunit_defer_finish(&defer);
I hope that wouldn't be necessary though, it seems like a lot of API surface.
Currently the mount_setattr_test fails on machines with a 64K PAGE_SIZE,
with errors such as:
# RUN mount_setattr_idmapped.invalid_fd_negative ...
mkfs.ext4: No space left on device while writing out and closing file system
# mount_setattr_test.c:1055:invalid_fd_negative:Expected system("mkfs.ext4 -q /mnt/C/ext4.img") (256) == 0 (0)
# invalid_fd_negative: Test terminated by assertion
# FAIL mount_setattr_idmapped.invalid_fd_negative
not ok 12 mount_setattr_idmapped.invalid_fd_negative
The code creates a 100,000 byte tmpfs:
ASSERT_EQ(mount("testing", "/mnt", "tmpfs", MS_NOATIME | MS_NODEV,
"size=100000,mode=700"), 0);
And then a little later creates a 2MB ext4 filesystem in that tmpfs:
ASSERT_EQ(ftruncate(img_fd, 1024 * 2048), 0);
ASSERT_EQ(system("mkfs.ext4 -q /mnt/C/ext4.img"), 0);
At first glance it seems like that should never work, after all 2MB is
larger than 100,000 bytes. However the filesystem image doesn't actually
occupy 2MB on "disk" (actually RAM, due to tmpfs). On 4K kernels the
ext4.img uses ~84KB of actual space (according to du), which just fits.
However on 64K PAGE_SIZE kernels the ext4.img takes at least 256KB,
which is too large to fit in the tmpfs, hence the errors.
It seems fraught to rely on the ext4.img taking less space on disk than
the allocated size, so instead create the tmpfs with a size of 2MB. With
that all 21 tests pass on 64K PAGE_SIZE kernels.
Fixes: 01eadc8dd96d ("tests: add mount_setattr() selftests")
Signed-off-by: Michael Ellerman <mpe(a)ellerman.id.au>
---
tools/testing/selftests/mount_setattr/mount_setattr_test.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/mount_setattr/mount_setattr_test.c b/tools/testing/selftests/mount_setattr/mount_setattr_test.c
index 68801e1a9ec2..70f65eb320a7 100644
--- a/tools/testing/selftests/mount_setattr/mount_setattr_test.c
+++ b/tools/testing/selftests/mount_setattr/mount_setattr_test.c
@@ -1026,7 +1026,7 @@ FIXTURE_SETUP(mount_setattr_idmapped)
"size=100000,mode=700"), 0);
ASSERT_EQ(mount("testing", "/mnt", "tmpfs", MS_NOATIME | MS_NODEV,
- "size=100000,mode=700"), 0);
+ "size=2m,mode=700"), 0);
ASSERT_EQ(mkdir("/mnt/A", 0777), 0);
--
2.47.0
Hi Linus,
Please pull the following kunit update for Linux 6.13-rc1.
kunit update for Linux 6.13-rc1
-- fixes user-after-free (UAF) bug in kunit_init_suite()
-- adds option to kunit tool to print just the summary of test results
-- adds option to kunit tool to print just the failed test results
-- fixes kunit_zalloc_skb() to use user passed in gfp value instead of
hardcoding GFP_KERNEL
-- fixes kunit_zalloc_skb() kernel doc to include allocation flags variable
diff is attached.
Tests passed on my kunit repo:
- Build make allmodconfig
./tools/testing/kunit/kunit.py run
./tools/testing/kunit/kunit.py run --alltests
./tools/testing/kunit/kunit.py run --arch x86_64
./tools/testing/kunit/kunit.py run --alltests --arch x86_64
thanks,
-- Shuah
----------------------------------------------------------------
The following changes since commit 2d5404caa8c7bb5c4e0435f94b28834ae5456623:
Linux 6.12-rc7 (2024-11-10 14:19:35 -0800)
are available in the Git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest tags/linux_kselftest-kunit-6.13-rc1
for you to fetch changes up to 67b6d342fb6d5abfbeb71e0f23141b9b96cf7bb1:
kunit: tool: print failed tests only (2024-11-14 09:38:19 -0700)
----------------------------------------------------------------
linux_kselftest-kunit-6.13-rc1
kunit update for Linux 6.13-rc1
-- fixes user-after-free (UAF) bug in kunit_init_suite()
-- adds option to kunit tool to print just the summary of test results
-- adds option to kunit tool to print just the failed test results
-- fixes kunit_zalloc_skb() to use user passed in gfp value instead of
hardcoding GFP_KERNEL
-- fixes kunit_zalloc_skb() kernel doc to include allocation flags variable
----------------------------------------------------------------
Dan Carpenter (2):
kunit: skb: use "gfp" variable instead of hardcoding GFP_KERNEL
kunit: skb: add gfp to kernel doc for kunit_zalloc_skb()
David Gow (1):
kunit: tool: Only print the summary
Jinjie Ruan (1):
kunit: string-stream: Fix a UAF bug in kunit_init_suite()
Rae Moar (1):
kunit: tool: print failed tests only
include/kunit/skbuff.h | 5 +-
lib/kunit/string-stream.c | 1 +
tools/testing/kunit/kunit.py | 28 ++++++-
tools/testing/kunit/kunit_parser.py | 134 +++++++++++++++++++++------------
tools/testing/kunit/kunit_printer.py | 14 +++-
tools/testing/kunit/kunit_tool_test.py | 55 +++++++-------
6 files changed, 151 insertions(+), 86 deletions(-)
----------------------------------------------------------------
page_frag test module is an out of tree module, but built
using KDIR as the main kernel tree, the mm test suite is
just getting skipped if newly added page_frag test module
fails to compile due to kernel not yet compiled.
Fix the above problem by ensuring both kernel is built first
and a newer kernel which has page_frag_cache.h is used.
CC: Andrew Morton <akpm(a)linux-foundation.org>
CC: Alexander Duyck <alexanderduyck(a)fb.com>
CC: Linux-MM <linux-mm(a)kvack.org>
Fixes: 7fef0dec415c ("mm: page_frag: add a test module for page_frag")
Fixes: 65941f10caf2 ("mm: move the page fragment allocator from page_alloc into its own file")
Reported-by: Mark Brown <broonie(a)kernel.org>
Signed-off-by: Yunsheng Lin <linyunsheng(a)huawei.com>
Tested-by: Mark Brown <broonie(a)kernel.org>
---
V2: Repost by adding net-next ML.
Note, page_frag test module is only in the net-next tree for now,
so target the net-next tree.
---
tools/testing/selftests/mm/Makefile | 18 ++++++++++++++++++
tools/testing/selftests/mm/page_frag/Makefile | 2 +-
2 files changed, 19 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile
index acec529baaca..04e04733fc8a 100644
--- a/tools/testing/selftests/mm/Makefile
+++ b/tools/testing/selftests/mm/Makefile
@@ -36,7 +36,16 @@ MAKEFLAGS += --no-builtin-rules
CFLAGS = -Wall -I $(top_srcdir) $(EXTRA_CFLAGS) $(KHDR_INCLUDES) $(TOOLS_INCLUDES)
LDLIBS = -lrt -lpthread -lm
+KDIR ?= /lib/modules/$(shell uname -r)/build
+ifneq (,$(wildcard $(KDIR)/Module.symvers))
+ifneq (,$(wildcard $(KDIR)/include/linux/page_frag_cache.h))
TEST_GEN_MODS_DIR := page_frag
+else
+PAGE_FRAG_WARNING = "missing page_frag_cache.h, please use a newer kernel"
+endif
+else
+PAGE_FRAG_WARNING = "missing Module.symvers, please have the kernel built first"
+endif
TEST_GEN_FILES = cow
TEST_GEN_FILES += compaction_test
@@ -214,3 +223,12 @@ warn_missing_liburing:
echo "Warning: missing liburing support. Some tests will be skipped." ; \
echo
endif
+
+ifneq ($(PAGE_FRAG_WARNING),)
+all: warn_missing_page_frag
+
+warn_missing_page_frag:
+ @echo ; \
+ echo "Warning: $(PAGE_FRAG_WARNING). page_frag test will be skipped." ; \
+ echo
+endif
diff --git a/tools/testing/selftests/mm/page_frag/Makefile b/tools/testing/selftests/mm/page_frag/Makefile
index 58dda74d50a3..8c8bb39ffa28 100644
--- a/tools/testing/selftests/mm/page_frag/Makefile
+++ b/tools/testing/selftests/mm/page_frag/Makefile
@@ -1,5 +1,5 @@
PAGE_FRAG_TEST_DIR := $(realpath $(dir $(abspath $(lastword $(MAKEFILE_LIST)))))
-KDIR ?= $(abspath $(PAGE_FRAG_TEST_DIR)/../../../../..)
+KDIR ?= /lib/modules/$(shell uname -r)/build
ifeq ($(V),1)
Q =
--
2.33.0
The series of patches are for doing basic tests
of NIC driver. Test comprises checks for auto-negotiation,
speed, duplex state and throughput between local NIC and
partner. Tools such as ethtool, iperf3 are used.
Signed-off-by: Mohan Prasad J <mohan.prasad(a)microchip.com>
---
Changes in v4:
- Type hints are added for arguments and return types.
- Test values can be modified through command line.
- Minor code improvements.
- GenerateTraffic class extended for throughput test.
- Protocol testcases separated.
Changes in v3:
Link to v3: https://patchwork.kernel.org/project/netdevbpf/cover/20241016215014.401476-…
- LinkConfig class is included in the hw library. This contains
generic APIs for doing link layer operations.
- Auto-negotiation checks involve changing the auto-neg state
both in local and partner NIC.
- Link layer test and performance test are separated to
different selftest files.
- Resetting of NIC driver done after test completion.
Changes in v2:
Link to v2: https://patchwork.kernel.org/project/netdevbpf/cover/20240917023525.2571082…
- Changed the hardcoded implementation of speed, duplex states,
throughput to generic values, in order to support all type
of NIC drivers.
- Test executes based on the supported link modes between local
NIC driver and partner.
- Instead of lan743x directory, selftest file is now relocated
to /selftests/drivers/net/hw.
Link to v1: https://patchwork.kernel.org/project/netdevbpf/cover/20240903221549.1215842…
---
Mohan Prasad J (3):
selftests: nic_link_layer: Add link layer selftest for NIC driver
selftests: nic_link_layer: Add selftest case for speed and duplex
states
selftests: nic_performance: Add selftest for performance of NIC driver
.../testing/selftests/drivers/net/hw/Makefile | 6 +-
.../drivers/net/hw/lib/py/__init__.py | 1 +
.../drivers/net/hw/lib/py/linkconfig.py | 222 ++++++++++++++++++
.../drivers/net/hw/nic_link_layer.py | 113 +++++++++
.../drivers/net/hw/nic_performance.py | 137 +++++++++++
.../selftests/drivers/net/lib/py/load.py | 20 +-
6 files changed, 496 insertions(+), 3 deletions(-)
create mode 100644 tools/testing/selftests/drivers/net/hw/lib/py/linkconfig.py
create mode 100644 tools/testing/selftests/drivers/net/hw/nic_link_layer.py
create mode 100644 tools/testing/selftests/drivers/net/hw/nic_performance.py
--
2.43.0
Hello,
this new series aims to migrate test_flow_dissector.sh into test_progs.
There are 2 "main" parts in test_flow_dissector.sh:
- a set of tests checking flow_dissector programs attachment to either
root namespace or non-root namespace
- dissection test
The first set is integrated in flow_dissector.c, which already contains
some existing tests for flow_dissector programs. This series uses the
opportunity to update a bit this file (use new assert, re-split tests,
etc)
The second part is migrated into a new file under test_progs,
flow_dissector_classification.c. It uses the same eBPF programs as
flow_dissector.c, but the difference is rather about how those program
are executed:
- flow_dissector.c manually runs programs with BPF_PROG_RUN
- flow_dissector_classification.c sends real packets to be dissected, and
so it also executes kernel code related to eBPF flow dissector (eg:
__skb_flow_bpf_to_target)
---
Changes in v2:
- allow tests to run in parallel
- move some generic helpers to network_helpers.h
- define proper function for ASSERT_MEMEQ
- fetch acked-by tags
- Link to v1: https://lore.kernel.org/r/20241113-flow_dissector-v1-0-27c4df0592dc@bootlin…
---
Alexis Lothoré (eBPF Foundation) (13):
selftests/bpf: add a macro to compare raw memory
selftests/bpf: use ASSERT_MEMEQ to compare bpf flow keys
selftests/bpf: replace CHECK calls with ASSERT macros in flow_dissector test
selftests/bpf: re-split main function into dedicated tests
selftests/bpf: expose all subtests from flow_dissector
selftests/bpf: add gre packets testing to flow_dissector
selftests/bpf: migrate flow_dissector namespace exclusivity test
selftests/bpf: Enable generic tc actions in selftests config
selftests/bpf: move ip checksum helper to network helpers
selftests/bpf: rename pseudo headers checksum computation
selftests/bpf: add network helpers to generate udp checksums
selftests/bpf: migrate bpf flow dissectors tests to test_progs
selftests/bpf: remove test_flow_dissector.sh
tools/testing/selftests/bpf/.gitignore | 1 -
tools/testing/selftests/bpf/Makefile | 3 +-
tools/testing/selftests/bpf/config | 1 +
tools/testing/selftests/bpf/network_helpers.h | 64 +-
.../selftests/bpf/prog_tests/flow_dissector.c | 323 +++++++--
.../bpf/prog_tests/flow_dissector_classification.c | 807 +++++++++++++++++++++
.../selftests/bpf/prog_tests/xdp_metadata.c | 24 +-
tools/testing/selftests/bpf/test_flow_dissector.c | 780 --------------------
tools/testing/selftests/bpf/test_flow_dissector.sh | 178 -----
tools/testing/selftests/bpf/test_progs.c | 15 +
tools/testing/selftests/bpf/test_progs.h | 15 +
tools/testing/selftests/bpf/xdp_hw_metadata.c | 12 +-
12 files changed, 1153 insertions(+), 1070 deletions(-)
---
base-commit: 9e71d50d3befb93a6394b0979f8ebd0dc9bd8d0f
change-id: 20241019-flow_dissector-3eb0c07fc163
Best regards,
--
Alexis Lothoré, Bootlin
Embedded Linux and Kernel engineering
https://bootlin.com
Currently the temporary address is not removed when mngtmpaddr is deleted
or becomes unmanaged. The patch set fixed this issue and add a related
test.
Hangbin Liu (2):
net/ipv6: delete temporary address if mngtmpaddr is removed or
un-mngtmpaddr
selftests/rtnetlink.sh: add mngtempaddr test
net/ipv6/addrconf.c | 5 ++
tools/testing/selftests/net/rtnetlink.sh | 89 ++++++++++++++++++++++++
2 files changed, 94 insertions(+)
--
2.46.0
1. fix recursive lock when ebpf prog return SK_PASS.
2. add selftest to reproduce recursive lock.
Note that the test code can reproduce the 'dead-lock' and if just
the selftest merged without first patch, the test case will
definitely fail, because the issue of deadlock is inevitable.
---
v2->v4: fix line length reported by patchwork and remove unused code.
(max_line_length is set to 80 in patchwork but default is 100 in kernel tree)
v1->v2: 1.inspired by martin.lau to add selftest to reproduce the issue.
2. follow the community rules for patch.
v1: https://lore.kernel.org/bpf/55fc6114-7e64-4b65-86d2-92cfd1e9e92f@linux.dev/…
---
Jiayuan Chen (2):
bpf: fix recursive lock when verdict program return SK_PASS
selftests/bpf: Add some tests with sockmap SK_PASS
net/core/skmsg.c | 4 +-
.../selftests/bpf/prog_tests/sockmap_basic.c | 54 +++++++++++++++++++
2 files changed, 56 insertions(+), 2 deletions(-)
--
2.43.5
sizeof when applied to a pointer typed expression gives the size of
the pointer.
tools/testing/selftests/bpf/progs/test_tunnel_kern.c:678:41-47: ERROR: application of sizeof to pointer
The proper fix in this particular case is to code sizeof(*gopt)
instead of sizeof(gopt).
This issue was detected with the help of Coccinelle.
Fixes: 5ddafcc377f9 ("selftests/bpf: Fix a few tests for GCC related warnings.")
Signed-off-by: guanjing <guanjing(a)cmss.chinamobile.com>
---
tools/testing/selftests/bpf/progs/test_tunnel_kern.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/bpf/progs/test_tunnel_kern.c b/tools/testing/selftests/bpf/progs/test_tunnel_kern.c
index 32127f1cd687..3a437cdc5c15 100644
--- a/tools/testing/selftests/bpf/progs/test_tunnel_kern.c
+++ b/tools/testing/selftests/bpf/progs/test_tunnel_kern.c
@@ -675,7 +675,7 @@ int ip6geneve_set_tunnel(struct __sk_buff *skb)
gopt->length = 2; /* 4-byte multiple */
*(int *) &gopt->opt_data = bpf_htonl(0xfeedbeef);
- ret = bpf_skb_set_tunnel_opt(skb, gopt, sizeof(gopt));
+ ret = bpf_skb_set_tunnel_opt(skb, gopt, sizeof(*gopt));
if (ret < 0) {
log_err(ret);
return TC_ACT_SHOT;
--
2.33.0
page_frag test module is an out of tree module, but built
using KDIR as the main kernel tree, the mm test suite is
just getting skipped if newly added page_frag test module
fails to compile due to kernel not yet compiled.
Fix the above problem by ensuring both kernel is built first
and a newer kernel which has page_frag_cache.h is used.
CC: Andrew Morton <akpm(a)linux-foundation.org>
CC: Alexander Duyck <alexanderduyck(a)fb.com>
CC: Linux-MM <linux-mm(a)kvack.org>
Fixes: 7fef0dec415c ("mm: page_frag: add a test module for page_frag")
Fixes: 65941f10caf2 ("mm: move the page fragment allocator from page_alloc into its own file")
Reported-by: Mark Brown <broonie(a)kernel.org>
Signed-off-by: Yunsheng Lin <yunshenglin0825(a)gmail.com>
---
Mote, page_frag test module is only in the net-next tree for now,
so target the net-next tree.
---
tools/testing/selftests/mm/Makefile | 18 ++++++++++++++++++
tools/testing/selftests/mm/page_frag/Makefile | 2 +-
2 files changed, 19 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile
index acec529baaca..04e04733fc8a 100644
--- a/tools/testing/selftests/mm/Makefile
+++ b/tools/testing/selftests/mm/Makefile
@@ -36,7 +36,16 @@ MAKEFLAGS += --no-builtin-rules
CFLAGS = -Wall -I $(top_srcdir) $(EXTRA_CFLAGS) $(KHDR_INCLUDES) $(TOOLS_INCLUDES)
LDLIBS = -lrt -lpthread -lm
+KDIR ?= /lib/modules/$(shell uname -r)/build
+ifneq (,$(wildcard $(KDIR)/Module.symvers))
+ifneq (,$(wildcard $(KDIR)/include/linux/page_frag_cache.h))
TEST_GEN_MODS_DIR := page_frag
+else
+PAGE_FRAG_WARNING = "missing page_frag_cache.h, please use a newer kernel"
+endif
+else
+PAGE_FRAG_WARNING = "missing Module.symvers, please have the kernel built first"
+endif
TEST_GEN_FILES = cow
TEST_GEN_FILES += compaction_test
@@ -214,3 +223,12 @@ warn_missing_liburing:
echo "Warning: missing liburing support. Some tests will be skipped." ; \
echo
endif
+
+ifneq ($(PAGE_FRAG_WARNING),)
+all: warn_missing_page_frag
+
+warn_missing_page_frag:
+ @echo ; \
+ echo "Warning: $(PAGE_FRAG_WARNING). page_frag test will be skipped." ; \
+ echo
+endif
diff --git a/tools/testing/selftests/mm/page_frag/Makefile b/tools/testing/selftests/mm/page_frag/Makefile
index 58dda74d50a3..8c8bb39ffa28 100644
--- a/tools/testing/selftests/mm/page_frag/Makefile
+++ b/tools/testing/selftests/mm/page_frag/Makefile
@@ -1,5 +1,5 @@
PAGE_FRAG_TEST_DIR := $(realpath $(dir $(abspath $(lastword $(MAKEFILE_LIST)))))
-KDIR ?= $(abspath $(PAGE_FRAG_TEST_DIR)/../../../../..)
+KDIR ?= /lib/modules/$(shell uname -r)/build
ifeq ($(V),1)
Q =
--
2.34.1
Add ip_link_set_addr(), ip_link_set_up(), ip_addr_add() and ip_route_add()
to the suite of helpers that automatically schedule a corresponding
cleanup.
When setting a new MAC, one needs to remember the old address first. Move
mac_get() from forwarding/ to that end.
Signed-off-by: Petr Machata <petrm(a)nvidia.com>
Reviewed-by: Ido Schimmel <idosch(a)nvidia.com>
---
Notes:
CC: Shuah Khan <shuah(a)kernel.org>
CC: Benjamin Poirier <bpoirier(a)nvidia.com>
CC: Hangbin Liu <liuhangbin(a)gmail.com>
CC: Vladimir Oltean <vladimir.oltean(a)nxp.com>
CC: linux-kselftest(a)vger.kernel.org
tools/testing/selftests/net/forwarding/lib.sh | 7 ----
tools/testing/selftests/net/lib.sh | 39 +++++++++++++++++++
2 files changed, 39 insertions(+), 7 deletions(-)
diff --git a/tools/testing/selftests/net/forwarding/lib.sh b/tools/testing/selftests/net/forwarding/lib.sh
index 7337f398f9cc..1fd40bada694 100644
--- a/tools/testing/selftests/net/forwarding/lib.sh
+++ b/tools/testing/selftests/net/forwarding/lib.sh
@@ -932,13 +932,6 @@ packets_rate()
echo $(((t1 - t0) / interval))
}
-mac_get()
-{
- local if_name=$1
-
- ip -j link show dev $if_name | jq -r '.[]["address"]'
-}
-
ether_addr_to_u64()
{
local addr="$1"
diff --git a/tools/testing/selftests/net/lib.sh b/tools/testing/selftests/net/lib.sh
index 5ea6537acd2b..2cd5c743b2d9 100644
--- a/tools/testing/selftests/net/lib.sh
+++ b/tools/testing/selftests/net/lib.sh
@@ -435,6 +435,13 @@ xfail_on_veth()
fi
}
+mac_get()
+{
+ local if_name=$1
+
+ ip -j link show dev $if_name | jq -r '.[]["address"]'
+}
+
kill_process()
{
local pid=$1; shift
@@ -459,3 +466,35 @@ ip_link_set_master()
ip link set dev "$member" master "$master"
defer ip link set dev "$member" nomaster
}
+
+ip_link_set_addr()
+{
+ local name=$1; shift
+ local addr=$1; shift
+
+ local old_addr=$(mac_get "$name")
+ ip link set dev "$name" address "$addr"
+ defer ip link set dev "$name" address "$old_addr"
+}
+
+ip_link_set_up()
+{
+ local name=$1; shift
+
+ ip link set dev "$name" up
+ defer ip link set dev "$name" down
+}
+
+ip_addr_add()
+{
+ local name=$1; shift
+
+ ip addr add dev "$name" "$@"
+ defer ip addr del dev "$name" "$@"
+}
+
+ip_route_add()
+{
+ ip route add "$@"
+ defer ip route del "$@"
+}
--
2.47.0
After reviewing the code, it was found that the macro GUEST_CODE_PIO_PORT
is never referenced in the code. Just remove it.
Signed-off-by: Ba Jing <bajing(a)cmss.chinamobile.com>
---
tools/testing/selftests/kvm/hardware_disable_test.c | 1 -
1 file changed, 1 deletion(-)
diff --git a/tools/testing/selftests/kvm/hardware_disable_test.c b/tools/testing/selftests/kvm/hardware_disable_test.c
index bce73bcb973c..94bd6ed24cf3 100644
--- a/tools/testing/selftests/kvm/hardware_disable_test.c
+++ b/tools/testing/selftests/kvm/hardware_disable_test.c
@@ -20,7 +20,6 @@
#define SLEEPING_THREAD_NUM (1 << 4)
#define FORK_NUM (1ULL << 9)
#define DELAY_US_MAX 2000
-#define GUEST_CODE_PIO_PORT 4
sem_t *sem;
--
2.33.0
This series is a follow-up to Joey's Permission Overlay Extension (POE)
series [1] that recently landed on mainline. The goal is to improve the
way we handle the register that governs which pkeys/POIndex are
accessible (POR_EL0) during signal delivery. As things stand, we may
unexpectedly fail to write the signal frame on the stack because POR_EL0
is not reset before the uaccess operations. See patch 3 for more details
and the main changes this series brings.
A similar series landed recently for x86/MPK [2]; the present series
aims at aligning arm64 with x86. Worth noting: once the signal frame is
written, POR_EL0 is still set to POR_EL0_INIT, granting access to pkey 0
only. This means that a program that sets up an alternate signal stack
with a non-zero pkey will need some assembly trampoline to set POR_EL0
before invoking the real signal handler, as discussed here [3]. This is
not ideal, but it makes experimentation with pkeys in signal handlers
possible while waiting for a potential interface to control the pkey
state when delivering a signal. See Pierre's reply [4] for more
information about use-cases and a potential interface.
The x86 series also added kselftests to ensure that no spurious SIGSEGV
occurs during signal delivery regardless of which pkey is accessible at
the point where the signal is delivered. This series adapts those
kselftests to allow running them on arm64 (patch 4-5).
Finally patch 2 is a clean-up following feedback on Joey's series [5].
I have tested this series on arm64 and x86_64 (booting and running the
protection_keys and pkey_sighandler_tests mm kselftests).
v1..v2:
* In setup_rt_frame(), ensured that POR_EL0 is reset to its original
value if we fail to deliver the signal (addresses Catalin's concern [6]).
* Renamed *unpriv_access* to *user_access* in patch 3 (suggestion from
Dave).
* Made what patch 1-2 do explicit in the commit message body (suggestion
from Dave).
- Kevin
[1] https://lore.kernel.org/linux-arm-kernel/20240822151113.1479789-1-joey.goul…
[2] https://lore.kernel.org/lkml/20240802061318.2140081-1-aruna.ramakrishna@ora…
[3] https://lore.kernel.org/lkml/CABi2SkWxNkP2O7ipkP67WKz0-LV33e5brReevTTtba6oK…
[4] https://lore.kernel.org/linux-arm-kernel/87plns8owh.fsf@arm.com/
[5] https://lore.kernel.org/linux-arm-kernel/20241015114116.GA19334@willie-the-…
[6] https://lore.kernel.org/linux-arm-kernel/Zw6D2waVyIwYE7wd@arm.com/
Cc: akpm(a)linux-foundation.org
Cc: anshuman.khandual(a)arm.com
Cc: aruna.ramakrishna(a)oracle.com
Cc: broonie(a)kernel.org
Cc: catalin.marinas(a)arm.com
Cc: dave.hansen(a)linux.intel.com
Cc: dave.martin(a)arm.com
Cc: jeffxu(a)chromium.org
Cc: joey.gouly(a)arm.com
Cc: pierre.langlois(a)arm.com
Cc: shuah(a)kernel.org
Cc: sroettger(a)google.com
Cc: will(a)kernel.org
Cc: linux-kselftest(a)vger.kernel.org
Cc: x86(a)kernel.org
Kevin Brodsky (5):
arm64: signal: Remove unused macro
arm64: signal: Remove unnecessary check when saving POE state
arm64: signal: Improve POR_EL0 handling to avoid uaccess failures
selftests/mm: Use generic pkey register manipulation
selftests/mm: Enable pkey_sighandler_tests on arm64
arch/arm64/kernel/signal.c | 95 +++++++++++++---
tools/testing/selftests/mm/Makefile | 8 +-
tools/testing/selftests/mm/pkey-arm64.h | 1 +
tools/testing/selftests/mm/pkey-x86.h | 2 +
.../selftests/mm/pkey_sighandler_tests.c | 101 +++++++++++++-----
5 files changed, 162 insertions(+), 45 deletions(-)
--
2.43.0
Here is a series from Geliang, adding mptcp_subflow bpf_iter support.
We are working on extending MPTCP with BPF, e.g. to control the path
manager -- in charge of the creation, deletion, and announcements of
subflows (paths) -- and the packet scheduler -- in charge of selecting
which available path the next data will be sent to. These extensions
need to iterate over the list of subflows attached to an MPTCP
connection, and do some specific actions via some new kfunc that will be
added later on.
This preparation work is split in different patches:
- Patch 1: register some "basic" MPTCP kfunc.
- Patch 2: add mptcp_subflow bpf_iter support. Note that previous
versions of this single patch have already been shared to the
BPF mailing list. The changelog has been kept with a comment,
but the version number has been reset to avoid confusions.
- Patch 3: add kfunc to make sure the msk is valid
- Patch 4: add more MPTCP endpoints in the selftests, in order to create
more than 2 subflows.
- Patch 5: add a very simple test validating mptcp_subflow bpf_iter
support. This test could be written without the new bpf_iter,
but it is there only to make sure this specific feature works
as expected.
Signed-off-by: Matthieu Baerts (NGI0) <matttbe(a)kernel.org>
---
Geliang Tang (5):
bpf: Register mptcp common kfunc set
bpf: Add mptcp_subflow bpf_iter
bpf: Acquire and release mptcp socket
selftests/bpf: More endpoints for endpoint_init
selftests/bpf: Add mptcp_subflow bpf_iter subtest
net/mptcp/bpf.c | 104 ++++++++++++++++-
tools/testing/selftests/bpf/bpf_experimental.h | 8 ++
tools/testing/selftests/bpf/prog_tests/mptcp.c | 129 ++++++++++++++++++++-
tools/testing/selftests/bpf/progs/mptcp_bpf.h | 10 ++
.../testing/selftests/bpf/progs/mptcp_bpf_iters.c | 64 ++++++++++
5 files changed, 308 insertions(+), 7 deletions(-)
---
base-commit: 141b4d6a8049cecdc8124f87e044b83a9e80730d
change-id: 20241108-bpf-next-net-mptcp-bpf_iter-subflows-027f6d87770e
Best regards,
--
Matthieu Baerts (NGI0) <matttbe(a)kernel.org>
This series introduces a new vIOMMU infrastructure and related ioctls.
IOMMUFD has been using the HWPT infrastructure for all cases, including a
nested IO page table support. Yet, there're limitations for an HWPT-based
structure to support some advanced HW-accelerated features, such as CMDQV
on NVIDIA Grace, and HW-accelerated vIOMMU on AMD. Even for a multi-IOMMU
environment, it is not straightforward for nested HWPTs to share the same
parent HWPT (stage-2 IO pagetable), with the HWPT infrastructure alone: a
parent HWPT typically hold one stage-2 IO pagetable and tag it with only
one ID in the cache entries. When sharing one large stage-2 IO pagetable
across physical IOMMU instances, that one ID may not always be available
across all the IOMMU instances. In other word, it's ideal for SW to have
a different container for the stage-2 IO pagetable so it can hold another
ID that's available. And this container will be able to hold some advanced
feature too.
For this "different container", add vIOMMU, an additional layer to hold
extra virtualization information:
_______________________________________________________________________
| iommufd (with vIOMMU) |
| _____________ |
| | | |
| |----------------| vIOMMU | |
| | ______ | | _____________ ________ |
| | | | | | | | | | |
| | | IOAS |<---|(HWPT_PAGING)|<---| HWPT_NESTED |<--| DEVICE | |
| | |______| |_____________| |_____________| |________| |
| | | | | | |
|______|________|______________|__________________|_______________|_____|
| | | | |
______v_____ | ______v_____ ______v_____ ___v__
| struct | | PFN | (paging) | | (nested) | |struct|
|iommu_device| |------>|iommu_domain|<----|iommu_domain|<----|device|
|____________| storage|____________| |____________| |______|
The vIOMMU object should be seen as a slice of a physical IOMMU instance
that is passed to or shared with a VM. That can be some HW/SW resources:
- Security namespace for guest owned ID, e.g. guest-controlled cache tags
- Non-device-affiliated event reporting, e.g. invalidation queue errors
- Access to a sharable nesting parent pagetable across physical IOMMUs
- Virtualization of various platforms IDs, e.g. RIDs and others
- Delivery of paravirtualized invalidation
- Direct assigned invalidation queues
- Direct assigned interrupts
On a multi-IOMMU system, the vIOMMU object must be instanced to the number
of the physical IOMMUs that have a slice passed to (via device) a guest VM,
while being able to hold the shareable parent HWPT. Each vIOMMU then just
needs to allocate its own individual ID to tag its own cache:
----------------------------
---------------- | | paging_hwpt0 |
| hwpt_nested0 |--->| viommu0 ------------------
---------------- | | IDx |
----------------------------
----------------------------
---------------- | | paging_hwpt0 |
| hwpt_nested1 |--->| viommu1 ------------------
---------------- | | IDy |
----------------------------
As an initial part-1, add IOMMUFD_CMD_VIOMMU_ALLOC ioctl for an allocation
only.
More vIOMMU-based structs and ioctls will be introduced in the follow-up
series to support vDEVICE, vIRQ (vEVENT) and vQUEUE objects. Although we
repurposed the vIOMMU object from an earlier RFC, just for a referece:
https://lore.kernel.org/all/cover.1712978212.git.nicolinc@nvidia.com/
This series is on Github:
https://github.com/nicolinc/iommufd/commits/iommufd_viommu_p1-v7
(QEMU branch for testing will be provided in Jason's nesting series)
Changelog
v7
* Added "Reviewed-by" from Jason
* Dropped "select IOMMUFD_DRIVER_CORE" in Kconfig
* Decoupled IOMMUFD_DRIVER from IOMMUFD_DRIVER_CORE
* Moved vIOMMU negative tests out of FIXTURE_SETUP to a TEST_F
* Dropped the "flags" check in iommufd_viommu_alloc_hwpt_nested
* Added the kdoc for "flags" in iommufd_viommu_alloc_hwpt_nested
v6
https://lore.kernel.org/all/cover.1730313237.git.nicolinc@nvidia.com/
* Improved comment lines
* Added a TEST_F for IO page fault
* Fixed indentations in iommufd.rst
* Revised kdoc of the viommu_alloc op
* Added "Reviewed-by" from Kevin and Jason
* Passed in "flags" to ->alloc_domain_nested
* Renamed "free" op to "destroy" in viommu_ops
* Skipped SMMUv3 driver changes (to post in a separate series)
* Fixed "flags" validation in iommufd_viommu_alloc_hwpt_nested
* Added CONFIG_IOMMUFD_DRIVER_CORE for sharing between iommufd
core and IOMMU dirvers
* Replaced iommufd_verify_unfinalized_object with xa_cmpxchg in
iommufd_object_finalize/abort functions
v5
https://lore.kernel.org/all/cover.1729897352.git.nicolinc@nvidia.com/
* Added "Reviewed-by" from Kevin
* Reworked iommufd_viommu_alloc helper
* Revised the uAPI kdoc for vIOMMU object
* Revised comments for pluggable iommu_dev
* Added a couple of cleanup patches for selftest
* Renamed domain_alloc_nested op to alloc_domain_nested
* Updated a few commit messages to reflect the latest series
* Renamed iommufd_hwpt_nested_alloc_for_viommu to
iommufd_viommu_alloc_hwpt_nested, and added flag validation
v4
https://lore.kernel.org/all/cover.1729553811.git.nicolinc@nvidia.com/
* Added "Reviewed-by" from Jason
* Dropped IOMMU_VIOMMU_TYPE_DEFAULT support
* Dropped iommufd_object_alloc_elm renamings
* Renamed iommufd's viommu_api.c to driver.c
* Reworked iommufd_viommu_alloc helper
* Added a separate iommufd_hwpt_nested_alloc_for_viommu function for
hwpt_nested allocations on a vIOMMU, and added comparison between
viommu->iommu_dev->ops and dev_iommu_ops(idev->dev)
* Replaced s2_parent with vsmmu in arm_smmu_nested_domain
* Replaced domain_alloc_user in iommu_ops with domain_alloc_nested in
viommu_ops
* Replaced wait_queue_head_t with a completion, to delay the unplug of
mock_iommu_dev
* Corrected documentation graph that was missing struct iommu_device
* Added an iommufd_verify_unfinalized_object helper to verify driver-
allocated vIOMMU/vDEVICE objects
* Added missing test cases for TEST_LENGTH and fail_nth
v3
https://lore.kernel.org/all/cover.1728491453.git.nicolinc@nvidia.com/
* Rebased on top of Jason's nesting v3 series
https://lore.kernel.org/all/0-v3-e2e16cd7467f+2a6a1-smmuv3_nesting_jgg@nvid…
* Split the series into smaller parts
* Added Jason's Reviewed-by
* Added back viommu->iommu_dev
* Added support for driver-allocated vIOMMU v.s. core-allocated
* Dropped arm_smmu_cache_invalidate_user
* Added an iommufd_test_wait_for_users() in selftest
* Reworked test code to make viommu an individual FIXTURE
* Added missing TEST_LENGTH case for the new ioctl command
v2
https://lore.kernel.org/all/cover.1724776335.git.nicolinc@nvidia.com/
* Limited vdev_id to one per idev
* Added a rw_sem to protect the vdev_id list
* Reworked driver-level APIs with proper lockings
* Added a new viommu_api file for IOMMUFD_DRIVER config
* Dropped useless iommu_dev point from the viommu structure
* Added missing index numnbers to new types in the uAPI header
* Dropped IOMMU_VIOMMU_INVALIDATE uAPI; Instead, reuse the HWPT one
* Reworked mock_viommu_cache_invalidate() using the new iommu helper
* Reordered details of set/unset_vdev_id handlers for proper lockings
v1
https://lore.kernel.org/all/cover.1723061377.git.nicolinc@nvidia.com/
Thanks!
Nicolin
Nicolin Chen (13):
iommufd: Move struct iommufd_object to public iommufd header
iommufd: Move _iommufd_object_alloc helper to a sharable file
iommufd: Introduce IOMMUFD_OBJ_VIOMMU and its related struct
iommufd: Verify object in iommufd_object_finalize/abort()
iommufd/viommu: Add IOMMU_VIOMMU_ALLOC ioctl
iommufd: Add alloc_domain_nested op to iommufd_viommu_ops
iommufd: Allow pt_id to carry viommu_id for IOMMU_HWPT_ALLOC
iommufd/selftest: Add container_of helpers
iommufd/selftest: Prepare for mock_viommu_alloc_domain_nested()
iommufd/selftest: Add refcount to mock_iommu_device
iommufd/selftest: Add IOMMU_VIOMMU_TYPE_SELFTEST
iommufd/selftest: Add IOMMU_VIOMMU_ALLOC test coverage
Documentation: userspace-api: iommufd: Update vIOMMU
drivers/iommu/iommufd/Kconfig | 4 +
drivers/iommu/iommufd/Makefile | 4 +-
drivers/iommu/iommufd/iommufd_private.h | 33 +--
drivers/iommu/iommufd/iommufd_test.h | 2 +
include/linux/iommu.h | 14 +
include/linux/iommufd.h | 86 ++++++
include/uapi/linux/iommufd.h | 54 +++-
tools/testing/selftests/iommu/iommufd_utils.h | 28 ++
drivers/iommu/iommufd/driver.c | 40 +++
drivers/iommu/iommufd/hw_pagetable.c | 73 ++++-
drivers/iommu/iommufd/main.c | 54 ++--
drivers/iommu/iommufd/selftest.c | 266 +++++++++++++-----
drivers/iommu/iommufd/viommu.c | 81 ++++++
tools/testing/selftests/iommu/iommufd.c | 137 +++++++++
.../selftests/iommu/iommufd_fail_nth.c | 11 +
Documentation/userspace-api/iommufd.rst | 69 ++++-
16 files changed, 804 insertions(+), 152 deletions(-)
create mode 100644 drivers/iommu/iommufd/driver.c
create mode 100644 drivers/iommu/iommufd/viommu.c
base-commit: 0bcceb1f51c77f6b98a7aab00847ed340bf36e35
--
2.43.0
Some distros may not load nf_conntrack by default, which will cause
subsequent nf_conntrack settings to fail. Let's load this module if it's
not loaded by default.
Fixes: e7096c131e51 ("net: WireGuard secure network tunnel")
Signed-off-by: Hangbin Liu <liuhangbin(a)gmail.com>
---
v2: load the mode directly in case nf_conntrack is build in (Simon Horman)
---
tools/testing/selftests/wireguard/netns.sh | 1 +
1 file changed, 1 insertion(+)
diff --git a/tools/testing/selftests/wireguard/netns.sh b/tools/testing/selftests/wireguard/netns.sh
index 405ff262ca93..fa4dd7eb5918 100755
--- a/tools/testing/selftests/wireguard/netns.sh
+++ b/tools/testing/selftests/wireguard/netns.sh
@@ -66,6 +66,7 @@ cleanup() {
orig_message_cost="$(< /proc/sys/net/core/message_cost)"
trap cleanup EXIT
printf 0 > /proc/sys/net/core/message_cost
+modprobe nf_conntrack
ip netns del $netns0 2>/dev/null || true
ip netns del $netns1 2>/dev/null || true
--
2.39.5 (Apple Git-154)
The count_stcx_fail test runs for close to or just over 2 minutes, which
means it sometimes times out.
That's overkill for a test that just demonstrates some PMU counters
are working. Drop the 64 billion instruction case, to lower the runtime
to ~30s.
Signed-off-by: Michael Ellerman <mpe(a)ellerman.id.au>
---
tools/testing/selftests/powerpc/pmu/count_stcx_fail.c | 3 ---
1 file changed, 3 deletions(-)
diff --git a/tools/testing/selftests/powerpc/pmu/count_stcx_fail.c b/tools/testing/selftests/powerpc/pmu/count_stcx_fail.c
index 2070a1e2b3a5..d8dd9a9c6c1b 100644
--- a/tools/testing/selftests/powerpc/pmu/count_stcx_fail.c
+++ b/tools/testing/selftests/powerpc/pmu/count_stcx_fail.c
@@ -144,9 +144,6 @@ static int test_body(void)
/* Run for 16Bi instructions */
FAIL_IF(do_count_loop(events, 16000000000, overhead, true));
- /* Run for 64Bi instructions */
- FAIL_IF(do_count_loop(events, 64000000000, overhead, true));
-
event_close(&events[0]);
event_close(&events[1]);
--
2.47.0
Hello,
I am a private investment consultant representing the interest of a multinational conglomerate that wishes to place funds into a trust management portfolio.
Please indicate your interest for additional information.
Regards,
Van Collin.
This patch series includes some netns-related improvements and fixes for
RTNL and ip_tunnel, to make link creation more intuitive:
- Creating link in another net namespace doesn't conflict with link names
in current one.
- Refector rtnetlink link creation. Create link in target namespace
directly. Pass both source and link netns to drivers via newlink()
callback.
So that
# ip link add netns ns1 link-netns ns2 tun0 type gre ...
will create tun0 in ns1, rather than create it in ns2 and move to ns1.
And don't conflict with another interface named "tun0" in current netns.
Patch 1 from Donald is included just as a dependency.
---
v3:
- Drop "netns_atomic" flag and module parameter. Add netns parameter to
newlink() instead, and convert drivers accordingly.
- Move python NetNSEnter helper to net selftest lib.
v2:
link: https://lore.kernel.org/all/20241107133004.7469-1-shaw.leon@gmail.com/
- Check NLM_F_EXCL to ensure only link creation is affected.
- Add self tests for link name/ifindex conflict and notifications
in different netns.
- Changes in dummy driver and ynl in order to add the test case.
v1:
link: https://lore.kernel.org/all/20241023023146.372653-1-shaw.leon@gmail.com/
Donald Hunter (1):
Revert "tools/net/ynl: improve async notification handling"
Xiao Liang (5):
net: ip_tunnel: Build flow in underlay net namespace
rtnetlink: Lookup device in target netns when creating link
rtnetlink: Decouple net namespaces in rtnl_newlink_create()
selftests: net: Add python context manager for netns entering
selftests: net: Add two test cases for link netns
drivers/infiniband/ulp/ipoib/ipoib_netlink.c | 6 ++-
drivers/net/amt.c | 6 +--
drivers/net/bareudp.c | 4 +-
drivers/net/bonding/bond_netlink.c | 3 +-
drivers/net/can/dev/netlink.c | 2 +-
drivers/net/can/vxcan.c | 4 +-
.../ethernet/qualcomm/rmnet/rmnet_config.c | 5 +-
drivers/net/geneve.c | 4 +-
drivers/net/gtp.c | 4 +-
drivers/net/ipvlan/ipvlan.h | 2 +-
drivers/net/ipvlan/ipvlan_main.c | 5 +-
drivers/net/ipvlan/ipvtap.c | 4 +-
drivers/net/macsec.c | 5 +-
drivers/net/macvlan.c | 5 +-
drivers/net/macvtap.c | 5 +-
drivers/net/netkit.c | 4 +-
drivers/net/pfcp.c | 4 +-
drivers/net/ppp/ppp_generic.c | 4 +-
drivers/net/team/team_core.c | 2 +-
drivers/net/veth.c | 4 +-
drivers/net/vrf.c | 2 +-
drivers/net/vxlan/vxlan_core.c | 4 +-
drivers/net/wireguard/device.c | 4 +-
drivers/net/wireless/virtual/virt_wifi.c | 5 +-
drivers/net/wwan/wwan_core.c | 6 ++-
include/net/ip_tunnels.h | 5 +-
include/net/rtnetlink.h | 22 ++++++++-
net/8021q/vlan_netlink.c | 5 +-
net/batman-adv/soft-interface.c | 5 +-
net/bridge/br_netlink.c | 2 +-
net/caif/chnl_net.c | 2 +-
net/core/rtnetlink.c | 25 ++++++----
net/hsr/hsr_netlink.c | 8 +--
net/ieee802154/6lowpan/core.c | 5 +-
net/ipv4/ip_gre.c | 13 +++--
net/ipv4/ip_tunnel.c | 16 +++---
net/ipv4/ip_vti.c | 5 +-
net/ipv4/ipip.c | 5 +-
net/ipv6/ip6_gre.c | 17 ++++---
net/ipv6/ip6_tunnel.c | 11 ++---
net/ipv6/ip6_vti.c | 11 ++---
net/ipv6/sit.c | 11 ++---
net/xfrm/xfrm_interface_core.c | 13 +++--
tools/net/ynl/cli.py | 10 ++--
tools/net/ynl/lib/ynl.py | 49 ++++++++-----------
tools/testing/selftests/net/Makefile | 1 +
.../testing/selftests/net/lib/py/__init__.py | 2 +-
tools/testing/selftests/net/lib/py/netns.py | 18 +++++++
tools/testing/selftests/net/netns-name.sh | 10 ++++
tools/testing/selftests/net/netns_atomic.py | 38 ++++++++++++++
50 files changed, 255 insertions(+), 157 deletions(-)
create mode 100755 tools/testing/selftests/net/netns_atomic.py
--
2.47.0
The alloc_string_stream() function only returns ERR_PTR(-ENOMEM) on
failure and never returns NULL. Therefore, switching the error check in
the caller from IS_ERR_OR_NULL to IS_ERR improves clarity, indicating
that this function will return an error pointer (not NULL) when an
error occurs. This change avoids any ambiguity regarding the function's
return behavior.
Link: https://lore.kernel.org/lkml/Zy9deU5VK3YR+r9N@visitorckw-System-Product-Name
Signed-off-by: Kuan-Wei Chiu <visitorckw(a)gmail.com>
---
lib/kunit/debugfs.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/lib/kunit/debugfs.c b/lib/kunit/debugfs.c
index d548750a325a..6273fa9652df 100644
--- a/lib/kunit/debugfs.c
+++ b/lib/kunit/debugfs.c
@@ -181,7 +181,7 @@ void kunit_debugfs_create_suite(struct kunit_suite *suite)
* successfully.
*/
stream = alloc_string_stream(GFP_KERNEL);
- if (IS_ERR_OR_NULL(stream))
+ if (IS_ERR(stream))
return;
string_stream_set_append_newlines(stream, true);
@@ -189,7 +189,7 @@ void kunit_debugfs_create_suite(struct kunit_suite *suite)
kunit_suite_for_each_test_case(suite, test_case) {
stream = alloc_string_stream(GFP_KERNEL);
- if (IS_ERR_OR_NULL(stream))
+ if (IS_ERR(stream))
goto err;
string_stream_set_append_newlines(stream, true);
--
2.34.1
Update Brendan's email address for the KUnit entry.
Signed-off-by: Brendan Higgins <brendanhiggins(a)google.com>
---
I am leaving Google and am going through and cleaning up my @google.com
address in the relevant places. This patch updates my email address for
the KUnit entry. I am removing myself from the ASPEED I2C entry in a
separate patch.
Do note that Friday, November 15 2024 is my last day at Google after
which I will lose access to this email account so any future updates or
comments after Friday will come from my @linux.dev account.
---
MAINTAINERS | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/MAINTAINERS b/MAINTAINERS
index b878ddc99f94e..d077a3deeb386 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -12405,7 +12405,7 @@ F: fs/smb/common/
F: fs/smb/server/
KERNEL UNIT TESTING FRAMEWORK (KUnit)
-M: Brendan Higgins <brendanhiggins(a)google.com>
+M: Brendan Higgins <brendan.higgins(a)linux.dev>
M: David Gow <davidgow(a)google.com>
R: Rae Moar <rmoar(a)google.com>
L: linux-kselftest(a)vger.kernel.org
base-commit: cfaaa7d010d1fc58f9717fcc8591201e741d2d49
--
2.47.0.338.g60cca15819-goog
Unmapping virtual machine guest memory from the host kernel's direct map
is a successful mitigation against Spectre-style transient execution
issues: If the kernel page tables do not contain entries pointing to
guest memory, then any attempted speculative read through the direct map
will necessarily be blocked by the MMU before any observable
microarchitectural side-effects happen. This means that Spectre-gadgets
and similar cannot be used to target virtual machine memory. Roughly 60%
of speculative execution issues fall into this category [1, Table 1].
This patch series extends guest_memfd with the ability to remove its
memory from the host kernel's direct map, to be able to attain the above
protection for KVM guests running inside guest_memfd.
=== Changes to v2 ===
- Handle direct map removal for physically contiguous pages in arch code
(Mike R.)
- Track the direct map state in guest_memfd itself instead of at the
folio level, to prepare for huge pages support (Sean C.)
- Allow configuring direct map state of not-yet faulted in memory
(Vishal A.)
- Pay attention to alignment in ftrace structs (Steven R.)
Most significantly, I've reduced the patch series to focus only on
direct map removal for guest_memfd for now, leaving the whole "how to do
non-CoCo VMs in guest_memfd" for later. If this separation is
acceptable, then I think I can drop the RFC tag in the next revision
(I've mainly kept it here because I'm not entirely sure what to do with
patches 3 and 4).
=== Implementation ===
This patch series introduces a new flag to the KVM_CREATE_GUEST_MEMFD
that causes guest_memfd to remove its pages from the host kernel's
direct map immediately after population/preparation. It also adds
infrastructure for tracking the direct map state of all gmem folios
inside the guest_memfd inode. Storing this information in the inode has
the advantage that the code is ready for future hugepages extensions,
where only removing/reinserting direct map entries for sub-ranges of a
huge folio is a valid usecase, and it allows pre-configuring the direct
map state of not-yet faulted in parts of memory (for example, when the
VMM is receiving a RX virtio buffer from the guest).
=== Summary ===
Patch 1 (from Mike Rapoport) adds arch APIs for manipulating the direct
map for ranges of physically contiguous pages, which are used by
guest_memfd in follow up patches. Patch 2 adds the
KVM_GMEM_NO_DIRECT_MAP flag and the logic for configuring direct map
state of freshly prepared folios. Patches 3 and 4 mainly serve an
illustrative purpose, to show how the framework from patch 2 can be
extended with routines for runtime direct map manipulation. Patches 5
and 6 deal with documentation and self-tests respectively.
[1]: https://download.vusec.net/papers/quarantine_raid23.pdf
[RFC v1]: https://lore.kernel.org/kvm/20240709132041.3625501-1-roypat@amazon.co.uk/
[RFC v2]: https://lore.kernel.org/kvm/20240910163038.1298452-1-roypat@amazon.co.uk/
Mike Rapoport (Microsoft) (1):
arch: introduce set_direct_map_valid_noflush()
Patrick Roy (5):
kvm: gmem: add flag to remove memory from kernel direct map
kvm: gmem: implement direct map manipulation routines
kvm: gmem: add trace point for direct map state changes
kvm: document KVM_GMEM_NO_DIRECT_MAP flag
kvm: selftests: run gmem tests with KVM_GMEM_NO_DIRECT_MAP set
Documentation/virt/kvm/api.rst | 14 +
arch/arm64/include/asm/set_memory.h | 1 +
arch/arm64/mm/pageattr.c | 10 +
arch/loongarch/include/asm/set_memory.h | 1 +
arch/loongarch/mm/pageattr.c | 21 ++
arch/riscv/include/asm/set_memory.h | 1 +
arch/riscv/mm/pageattr.c | 15 +
arch/s390/include/asm/set_memory.h | 1 +
arch/s390/mm/pageattr.c | 11 +
arch/x86/include/asm/set_memory.h | 1 +
arch/x86/mm/pat/set_memory.c | 8 +
include/linux/set_memory.h | 6 +
include/trace/events/kvm.h | 22 ++
include/uapi/linux/kvm.h | 2 +
.../testing/selftests/kvm/guest_memfd_test.c | 2 +-
.../kvm/x86_64/private_mem_conversions_test.c | 7 +-
virt/kvm/guest_memfd.c | 280 +++++++++++++++++-
17 files changed, 384 insertions(+), 19 deletions(-)
base-commit: 5cb1659f412041e4780f2e8ee49b2e03728a2ba6
--
2.47.0
Hello LKFT maintainers, CI operators,
First, I would like to say thank you to the people behind the LKFT
project for validating stable kernels (and more), and including some
Network selftests in their tests suites.
A lot of improvements around the networking kselftests have been done
this year. At the last Netconf [1], we discussed how these tests were
validated on stable kernels from CIs like the LKFT one, and we have some
suggestions to improve the situation.
KSelftests from the same version
--------------------------------
According to the doc [2], kselftests should support all previous kernel
versions. The LKFT CI is then using the kselftests from the last stable
release to validate all stable versions. Even if there are good reasons
to do that, we would like to ask for an opt-out for this policy for the
networking tests: this is hard to maintain with the increased
complexity, hard to validate on all stable kernels before applying
patches, and hard to put in place in some situations. As a result, many
tests are failing on older kernels, and it looks like it is a lot of
work to support older kernels, and to maintain this.
Many networking tests are validating the internal behaviour that is not
exposed to the userspace. A typical example: some tests look at the raw
packets being exchanged during a test, and this behaviour can change
without modifying how the userspace is interacting with the kernel. The
kernel could expose capabilities, but that's not something that seems
natural to put in place for internal behaviours that are not exposed to
end users. Maybe workarounds could be used, e.g. looking at kernel
symbols, etc. Nut that doesn't always work, increase the complexity, and
often "false positive" issue will be noticed only after a patch hits
stable, and will cause a bunch of tests to be ignored.
Regarding fixes, ideally they will come with a new or modified test that
can also be backported. So the coverage can continue to grow in stable
versions too.
Do you think that from the kernel v6.12 (or before?), the LKFT CI could
run the networking kselftests from the version that is being validated,
and not from a newer one? So validating the selftests from v6.12.1 on a
v6.12.1, and not the ones from a future v6.16.y on a v6.12.42.
Skipped tests
-------------
It looks like many tests are skipped:
- Some have been in a skip file [3] for a while: maybe they can be removed?
- Some are skipped because of missing tools: maybe they can be added?
e.g. iputils, tshark, ipv6toolkit, etc.
- Some tests are in 'net', but in subdirectories, and hence not tested,
e.g. forwarding, packetdrill, netfilter, tcp_ao. Could they be tested too?
How can we change this to increase the code coverage using existing tests?
KVM
---
It looks like different VMs are being used to execute the different
tests. Do these VMs benefit from any accelerations like KVM? If not,
some tests might fail because the environment is too slow.
The KSFT_MACHINE_SLOW=yes env var can be set to increase some
tolerances, timeout or to skip some parts, but that might not be enough
for some tests.
Notifications
-------------
In case of new regressions, who is being notified? Are the people from
the MAINTAINERS file, and linked to the corresponding selftests being
notified or do they need to do the monitoring on their side?
Looking forward to improving the networking selftests results when
validating stable kernels!
[1] https://netdev.bots.linux.dev/netconf/2024/
[2] https://docs.kernel.org/dev-tools/kselftest.html
[3]
https://github.com/Linaro/test-definitions/blob/master/automated/linux/ksel…
Cheers,
Matt
--
Sponsored by the NGI0 Core fund.
Basics and overview
===================
Software with larger attack surfaces (e.g. network facing apps like databases,
browsers or apps relying on browser runtimes) suffer from memory corruption
issues which can be utilized by attackers to bend control flow of the program
to eventually gain control (by making their payload executable). Attackers are
able to perform such attacks by leveraging call-sites which rely on indirect
calls or return sites which rely on obtaining return address from stack memory.
To mitigate such attacks, risc-v extension zicfilp enforces that all indirect
calls must land on a landing pad instruction `lpad` else cpu will raise software
check exception (a new cpu exception cause code on riscv).
Similarly for return flow, risc-v extension zicfiss extends architecture with
- `sspush` instruction to push return address on a shadow stack
- `sspopchk` instruction to pop return address from shadow stack
and compare with input operand (i.e. return address on stack)
- `sspopchk` to raise software check exception if comparision above
was a mismatch
- Protection mechanism using which shadow stack is not writeable via
regular store instructions
More information an details can be found at extensions github repo [1].
Equivalent to landing pad (zicfilp) on x86 is `ENDBRANCH` instruction in Intel
CET [3] and branch target identification (BTI) [4] on arm.
Similarly x86's Intel CET has shadow stack [5] and arm64 has guarded control
stack (GCS) [6] which are very similar to risc-v's zicfiss shadow stack.
x86 already supports shadow stack for user mode and arm64 support for GCS in
usermode [7] is in -next.
Kernel awareness for user control flow integrity
================================================
This series picks up Samuel Holland's envcfg changes [2] as well. So if those are
being applied independently, they should be removed from this series.
Enabling:
In order to maintain compatibility and not break anything in user mode, kernel
doesn't enable control flow integrity cpu extensions on binary by default.
Instead exposes a prctl interface to enable, disable and lock the shadow stack
or landing pad feature for a task. This allows userspace (loader) to enumerate
if all objects in its address space are compiled with shadow stack and landing
pad support and accordingly enable the feature. Additionally if a subsequent
`dlopen` happens on a library, user mode can take a decision again to disable
the feature (if incoming library is not compiled with support) OR terminate the
task (if user mode policy is strict to have all objects in address space to be
compiled with control flow integirty cpu feature). prctl to enable shadow stack
results in allocating shadow stack from virtual memory and activating for user
address space. x86 and arm64 are also following same direction due to similar
reason(s).
clone/fork:
On clone and fork, cfi state for task is inherited by child. Shadow stack is
part of virtual memory and is a writeable memory from kernel perspective
(writeable via a restricted set of instructions aka shadow stack instructions)
Thus kernel changes ensure that this memory is converted into read-only when
fork/clone happens and COWed when fault is taken due to sspush, sspopchk or
ssamoswap. In case `CLONE_VM` is specified and shadow stack is to be enabled,
kernel will automatically allocate a shadow stack for that clone call.
map_shadow_stack:
x86 introduced `map_shadow_stack` system call to allow user space to explicitly
map shadow stack memory in its address space. It is useful to allocate shadow
for different contexts managed by a single thread (green threads or contexts)
risc-v implements this system call as well.
signal management:
If shadow stack is enabled for a task, kernel performs an asynchronous control
flow diversion to deliver the signal and eventually expects userspace to issue
sigreturn so that original execution can be resumed. Even though resume context
is prepared by kernel, it is in user space memory and is subject to memory
corruption and corruption bugs can be utilized by attacker in this race window
to perform arbitrary sigreturn and eventually bypass cfi mechanism.
Another issue is how to ensure that cfi related state on sigcontext area is not
trampled by legacy apps or apps compiled with old kernel headers.
In order to mitigate control-flow hijacting, kernel prepares a token and place
it on shadow stack before signal delivery and places address of token in
sigcontext structure. During sigreturn, kernel obtains address of token from
sigcontext struture, reads token from shadow stack and validates it and only
then allow sigreturn to succeed. Compatiblity issue is solved by adopting
dynamic sigcontext management introduced for vector extension. This series
re-factor the code little bit to allow future sigcontext management easy (as
proposed by Andy Chiu from SiFive)
config and compilation:
Introduce a new risc-v config option `CONFIG_RISCV_USER_CFI`. Selecting this
config option picks the kernel support for user control flow integrity. This
optin is presented only if toolchain has shadow stack and landing pad support.
And is on purpose guarded by toolchain support. Reason being that eventually
vDSO also needs to be compiled in with shadow stack and landing pad support.
vDSO compile patches are not included as of now because landing pad labeling
scheme is yet to settle for usermode runtime.
To get more information on kernel interactions with respect to
zicfilp and zicfiss, patch series adds documentation for
`zicfilp` and `zicfiss` in following:
Documentation/arch/riscv/zicfiss.rst
Documentation/arch/riscv/zicfilp.rst
How to test this series
=======================
Toolchain
---------
$ git clone git@github.com:sifive/riscv-gnu-toolchain.git -b cfi-dev
$ riscv-gnu-toolchain/configure --prefix=<path-to-where-to-build> --with-arch=rv64gc_zicfilp_zicfiss --enable-linux --disable-gdb --with-extra-multilib-test="rv64gc_zicfilp_zicfiss-lp64d:-static"
$ make -j$(nproc)
Qemu
----
$ git clone git@github.com:deepak0414/qemu.git -b zicfilp_zicfiss_ratified_master_july11
$ cd qemu
$ mkdir build
$ cd build
$ ../configure --target-list=riscv64-softmmu
$ make -j$(nproc)
Opensbi
-------
$ git clone git@github.com:deepak0414/opensbi.git -b v6_cfi_spec_split_opensbi
$ make CROSS_COMPILE=<your riscv toolchain> -j$(nproc) PLATFORM=generic
Linux
-----
Running defconfig is fine. CFI is enabled by default if the toolchain
supports it.
$ make ARCH=riscv CROSS_COMPILE=<path-to-cfi-riscv-gnu-toolchain>/build/bin/riscv64-unknown-linux-gnu- -j$(nproc) defconfig
$ make ARCH=riscv CROSS_COMPILE=<path-to-cfi-riscv-gnu-toolchain>/build/bin/riscv64-unknown-linux-gnu- -j$(nproc)
Branch where user cfi enabling patches are maintained
https://github.com/deepak0414/linux-riscv-cfi/tree/vdso_user_cfi_v6.12-rc1
In case you're building your own rootfs using toolchain, please make sure you
pick following patch to ensure that vDSO compiled with lpad and shadow stack.
"arch/riscv: compile vdso with landing pad"
Running
-------
Modify your qemu command to have:
-bios <path-to-cfi-opensbi>/build/platform/generic/firmware/fw_dynamic.bin
-cpu rv64,zicfilp=true,zicfiss=true,zimop=true,zcmop=true
vDSO related Opens (in the flux)
=================================
I am listing these opens for laying out plan and what to expect in future
patch sets. And of course for the sake of discussion.
Shadow stack and landing pad enabling in vDSO
----------------------------------------------
vDSO must have shadow stack and landing pad support compiled in for task
to have shadow stack and landing pad support. This patch series doesn't
enable that (yet). Enabling shadow stack support in vDSO should be
straight forward (intend to do that in next versions of patch set). Enabling
landing pad support in vDSO requires some collaboration with toolchain folks
to follow a single label scheme for all object binaries. This is necessary to
ensure that all indirect call-sites are setting correct label and target landing
pads are decorated with same label scheme.
How many vDSOs
---------------
Shadow stack instructions are carved out of zimop (may be operations) and if CPU
doesn't implement zimop, they're illegal instructions. Kernel could be running on
a CPU which may or may not implement zimop. And thus kernel will have to carry 2
different vDSOs and expose the appropriate one depending on whether CPU implements
zimop or not.
References
==========
[1] - https://github.com/riscv/riscv-cfi
[2] - https://lore.kernel.org/all/20240814081126.956287-1-samuel.holland@sifive.c…
[3] - https://lwn.net/Articles/889475/
[4] - https://developer.arm.com/documentation/109576/0100/Branch-Target-Identific…
[5] - https://www.intel.com/content/dam/develop/external/us/en/documents/catc17-i…
[6] - https://lwn.net/Articles/940403/
[7] - https://lore.kernel.org/all/20241001-arm64-gcs-v13-0-222b78d87eee@kernel.or…
---
changelog
---------
v8:
- rebased on palmer/for-next
- dropped samuel holland's `envcfg` context switch patches.
they are in parlmer/for-next
v7:
- Removed "riscv/Kconfig: enable HAVE_EXIT_THREAD for riscv"
Instead using `deactivate_mm` flow to clean up.
see here for more context
https://lore.kernel.org/all/20230908203655.543765-1-rick.p.edgecombe@intel.…
- Changed the header include in `kselftest`. Hopefully this fixes compile
issue faced by Zong Li at SiFive.
- Cleaned up an orphaned change to `mm/mmap.c` in below patch
"riscv/mm : ensure PROT_WRITE leads to VM_READ | VM_WRITE"
- Lock interfaces for shadow stack and indirect branch tracking expect arg == 0
Any future evolution of this interface should accordingly define how arg should
be setup.
- `mm/map.c` has an instance of using `VM_SHADOW_STACK`. Fixed it to use helper
`is_shadow_stack_vma`.
- Link to v6: https://lore.kernel.org/r/20241008-v5_user_cfi_series-v6-0-60d9fe073f37@riv…
v6:
- Picked up Samuel Holland's changes as is with `envcfg` placed in
`thread` instead of `thread_info`
- fixed unaligned newline escapes in kselftest
- cleaned up messages in kselftest and included test output in commit message
- fixed a bug in clone path reported by Zong Li
- fixed a build issue if CONFIG_RISCV_ISA_V is not selected
(this was introduced due to re-factoring signal context
management code)
v5:
- rebased on v6.12-rc1
- Fixed schema related issues in device tree file
- Fixed some of the documentation related issues in zicfilp/ss.rst
(style issues and added index)
- added `SHADOW_STACK_SET_MARKER` so that implementation can define base
of shadow stack.
- Fixed warnings on definitions added in usercfi.h when
CONFIG_RISCV_USER_CFI is not selected.
- Adopted context header based signal handling as proposed by Andy Chiu
- Added support for enabling kernel mode access to shadow stack using
FWFT
(https://github.com/riscv-non-isa/riscv-sbi-doc/blob/master/src/ext-firmware…)
- Link to v5: https://lore.kernel.org/r/20241001-v5_user_cfi_series-v1-0-3ba65b6e550f@riv…
(Note: I had an issue in my workflow due to which version number wasn't
picked up correctly while sending out patches)
v4:
- rebased on 6.11-rc6
- envcfg: Converged with Samuel Holland's patches for envcfg management on per-
thread basis.
- vma_is_shadow_stack is renamed to is_vma_shadow_stack
- picked up Mark Brown's `ARCH_HAS_USER_SHADOW_STACK` patch
- signal context: using extended context management to maintain compatibility.
- fixed `-Wmissing-prototypes` compiler warnings for prctl functions
- Documentation fixes and amending typos.
- Link to v4: https://lore.kernel.org/all/20240912231650.3740732-1-debug@rivosinc.com/
v3:
- envcfg
logic to pick up base envcfg had a bug where `ENVCFG_CBZE` could have been
picked on per task basis, even though CPU didn't implement it. Fixed in
this series.
- dt-bindings
As suggested, split into separate commit. fixed the messaging that spec is
in public review
- arch_is_shadow_stack change
arch_is_shadow_stack changed to vma_is_shadow_stack
- hwprobe
zicfiss / zicfilp if present will get enumerated in hwprobe
- selftests
As suggested, added object and binary filenames to .gitignore
Selftest binary anyways need to be compiled with cfi enabled compiler which
will make sure that landing pad and shadow stack are enabled. Thus removed
separate enable/disable tests. Cleaned up tests a bit.
- Link to v3: https://lore.kernel.org/lkml/20240403234054.2020347-1-debug@rivosinc.com/
v2:
- Using config `CONFIG_RISCV_USER_CFI`, kernel support for riscv control flow
integrity for user mode programs can be compiled in the kernel.
- Enabling of control flow integrity for user programs is left to user runtime
- This patch series introduces arch agnostic `prctls` to enable shadow stack
and indirect branch tracking. And implements them on riscv.
---
Changes in v8:
- EDITME: describe what is new in this series revision.
- EDITME: use bulletpoints and terse descriptions.
- Link to v7: https://lore.kernel.org/r/20241029-v5_user_cfi_series-v7-0-2727ce9936cb@riv…
---
Andy Chiu (1):
riscv: signal: abstract header saving for setup_sigcontext
Clément Léger (1):
riscv: Add Firmware Feature SBI extensions definitions
Deepak Gupta (25):
mm: helper `is_shadow_stack_vma` to check shadow stack vma
dt-bindings: riscv: zicfilp and zicfiss in dt-bindings (extensions.yaml)
riscv: zicfiss / zicfilp enumeration
riscv: zicfiss / zicfilp extension csr and bit definitions
riscv: usercfi state for task and save/restore of CSR_SSP on trap entry/exit
riscv/mm : ensure PROT_WRITE leads to VM_READ | VM_WRITE
riscv mm: manufacture shadow stack pte
riscv mmu: teach pte_mkwrite to manufacture shadow stack PTEs
riscv mmu: write protect and shadow stack
riscv/mm: Implement map_shadow_stack() syscall
riscv/shstk: If needed allocate a new shadow stack on clone
prctl: arch-agnostic prctl for indirect branch tracking
riscv: Implements arch agnostic shadow stack prctls
riscv: Implements arch agnostic indirect branch tracking prctls
riscv/traps: Introduce software check exception
riscv/signal: save and restore of shadow stack for signal
riscv/kernel: update __show_regs to print shadow stack register
riscv/ptrace: riscv cfi status and state via ptrace and in core files
riscv/hwprobe: zicfilp / zicfiss enumeration in hwprobe
riscv: enable kernel access to shadow stack memory via FWFT sbi call
riscv: kernel command line option to opt out of user cfi
riscv: create a config for shadow stack and landing pad instr support
riscv: Documentation for landing pad / indirect branch tracking
riscv: Documentation for shadow stack on riscv
kselftest/riscv: kselftest for user mode cfi
Mark Brown (2):
mm: Introduce ARCH_HAS_USER_SHADOW_STACK
prctl: arch-agnostic prctl for shadow stack
Documentation/arch/riscv/index.rst | 2 +
Documentation/arch/riscv/zicfilp.rst | 115 +++++
Documentation/arch/riscv/zicfiss.rst | 176 +++++++
.../devicetree/bindings/riscv/extensions.yaml | 14 +
arch/riscv/Kconfig | 20 +
arch/riscv/include/asm/asm-prototypes.h | 1 +
arch/riscv/include/asm/cpufeature.h | 13 +
arch/riscv/include/asm/csr.h | 16 +
arch/riscv/include/asm/entry-common.h | 2 +
arch/riscv/include/asm/hwcap.h | 2 +
arch/riscv/include/asm/mman.h | 24 +
arch/riscv/include/asm/mmu_context.h | 7 +
arch/riscv/include/asm/pgtable.h | 30 +-
arch/riscv/include/asm/processor.h | 2 +
arch/riscv/include/asm/sbi.h | 27 ++
arch/riscv/include/asm/thread_info.h | 3 +
arch/riscv/include/asm/usercfi.h | 89 ++++
arch/riscv/include/asm/vector.h | 3 +
arch/riscv/include/uapi/asm/hwprobe.h | 2 +
arch/riscv/include/uapi/asm/ptrace.h | 22 +
arch/riscv/include/uapi/asm/sigcontext.h | 1 +
arch/riscv/kernel/Makefile | 2 +
arch/riscv/kernel/asm-offsets.c | 8 +
arch/riscv/kernel/cpufeature.c | 2 +
arch/riscv/kernel/entry.S | 31 +-
arch/riscv/kernel/head.S | 12 +
arch/riscv/kernel/process.c | 26 +-
arch/riscv/kernel/ptrace.c | 83 ++++
arch/riscv/kernel/signal.c | 140 +++++-
arch/riscv/kernel/sys_hwprobe.c | 2 +
arch/riscv/kernel/sys_riscv.c | 10 +
arch/riscv/kernel/traps.c | 42 ++
arch/riscv/kernel/usercfi.c | 526 +++++++++++++++++++++
arch/riscv/mm/init.c | 2 +-
arch/riscv/mm/pgtable.c | 17 +
arch/x86/Kconfig | 1 +
fs/proc/task_mmu.c | 2 +-
include/linux/cpu.h | 4 +
include/linux/mm.h | 5 +-
include/uapi/asm-generic/mman.h | 4 +
include/uapi/linux/elf.h | 1 +
include/uapi/linux/prctl.h | 48 ++
kernel/sys.c | 60 +++
mm/Kconfig | 6 +
mm/gup.c | 2 +-
mm/mmap.c | 2 +-
mm/vma.h | 10 +-
tools/testing/selftests/riscv/Makefile | 2 +-
tools/testing/selftests/riscv/cfi/.gitignore | 3 +
tools/testing/selftests/riscv/cfi/Makefile | 10 +
tools/testing/selftests/riscv/cfi/cfi_rv_test.h | 84 ++++
tools/testing/selftests/riscv/cfi/riscv_cfi_test.c | 78 +++
tools/testing/selftests/riscv/cfi/shadowstack.c | 373 +++++++++++++++
tools/testing/selftests/riscv/cfi/shadowstack.h | 37 ++
54 files changed, 2171 insertions(+), 35 deletions(-)
---
base-commit: 64f7b77f0bd9271861ed9e410e9856b6b0b21c48
change-id: 20240930-v5_user_cfi_series-3dc332f8f5b2
--
- debug
For logging to be useful, something has to set RET and retmsg by calling
ret_set_ksft_status(). There is a suite of functions to that end in
forwarding/lib: check_err, check_fail et.al. Move them to net/lib.sh so
that every net test can use them.
Existing lib.sh users might be using these same names for their functions.
However lib.sh is always sourced near the top of the file (checked), and
whatever new definitions will simply override the ones provided by lib.sh.
Signed-off-by: Petr Machata <petrm(a)nvidia.com>
Reviewed-by: Amit Cohen <amcohen(a)nvidia.com>
Acked-by: Shuah Khan <skhan(a)linuxfoundation.org>
---
Notes:
CC: Shuah Khan <shuah(a)kernel.org>
CC: Benjamin Poirier <bpoirier(a)nvidia.com>
CC: Hangbin Liu <liuhangbin(a)gmail.com>
CC: linux-kselftest(a)vger.kernel.org
CC: Jiri Pirko <jiri(a)resnulli.us>
tools/testing/selftests/net/forwarding/lib.sh | 73 -------------------
tools/testing/selftests/net/lib.sh | 73 +++++++++++++++++++
2 files changed, 73 insertions(+), 73 deletions(-)
diff --git a/tools/testing/selftests/net/forwarding/lib.sh b/tools/testing/selftests/net/forwarding/lib.sh
index d28dbf27c1f0..8625e3c99f55 100644
--- a/tools/testing/selftests/net/forwarding/lib.sh
+++ b/tools/testing/selftests/net/forwarding/lib.sh
@@ -445,79 +445,6 @@ done
##############################################################################
# Helpers
-# Whether FAILs should be interpreted as XFAILs. Internal.
-FAIL_TO_XFAIL=
-
-check_err()
-{
- local err=$1
- local msg=$2
-
- if ((err)); then
- if [[ $FAIL_TO_XFAIL = yes ]]; then
- ret_set_ksft_status $ksft_xfail "$msg"
- else
- ret_set_ksft_status $ksft_fail "$msg"
- fi
- fi
-}
-
-check_fail()
-{
- local err=$1
- local msg=$2
-
- check_err $((!err)) "$msg"
-}
-
-check_err_fail()
-{
- local should_fail=$1; shift
- local err=$1; shift
- local what=$1; shift
-
- if ((should_fail)); then
- check_fail $err "$what succeeded, but should have failed"
- else
- check_err $err "$what failed"
- fi
-}
-
-xfail()
-{
- FAIL_TO_XFAIL=yes "$@"
-}
-
-xfail_on_slow()
-{
- if [[ $KSFT_MACHINE_SLOW = yes ]]; then
- FAIL_TO_XFAIL=yes "$@"
- else
- "$@"
- fi
-}
-
-omit_on_slow()
-{
- if [[ $KSFT_MACHINE_SLOW != yes ]]; then
- "$@"
- fi
-}
-
-xfail_on_veth()
-{
- local dev=$1; shift
- local kind
-
- kind=$(ip -j -d link show dev $dev |
- jq -r '.[].linkinfo.info_kind')
- if [[ $kind = veth ]]; then
- FAIL_TO_XFAIL=yes "$@"
- else
- "$@"
- fi
-}
-
not()
{
"$@"
diff --git a/tools/testing/selftests/net/lib.sh b/tools/testing/selftests/net/lib.sh
index 4f52b8e48a3a..6bcf5d13879d 100644
--- a/tools/testing/selftests/net/lib.sh
+++ b/tools/testing/selftests/net/lib.sh
@@ -361,3 +361,76 @@ tests_run()
$current_test
done
}
+
+# Whether FAILs should be interpreted as XFAILs. Internal.
+FAIL_TO_XFAIL=
+
+check_err()
+{
+ local err=$1
+ local msg=$2
+
+ if ((err)); then
+ if [[ $FAIL_TO_XFAIL = yes ]]; then
+ ret_set_ksft_status $ksft_xfail "$msg"
+ else
+ ret_set_ksft_status $ksft_fail "$msg"
+ fi
+ fi
+}
+
+check_fail()
+{
+ local err=$1
+ local msg=$2
+
+ check_err $((!err)) "$msg"
+}
+
+check_err_fail()
+{
+ local should_fail=$1; shift
+ local err=$1; shift
+ local what=$1; shift
+
+ if ((should_fail)); then
+ check_fail $err "$what succeeded, but should have failed"
+ else
+ check_err $err "$what failed"
+ fi
+}
+
+xfail()
+{
+ FAIL_TO_XFAIL=yes "$@"
+}
+
+xfail_on_slow()
+{
+ if [[ $KSFT_MACHINE_SLOW = yes ]]; then
+ FAIL_TO_XFAIL=yes "$@"
+ else
+ "$@"
+ fi
+}
+
+omit_on_slow()
+{
+ if [[ $KSFT_MACHINE_SLOW != yes ]]; then
+ "$@"
+ fi
+}
+
+xfail_on_veth()
+{
+ local dev=$1; shift
+ local kind
+
+ kind=$(ip -j -d link show dev $dev |
+ jq -r '.[].linkinfo.info_kind')
+ if [[ $kind = veth ]]; then
+ FAIL_TO_XFAIL=yes "$@"
+ else
+ "$@"
+ fi
+}
--
2.45.0
It would be good to use the same mechanism for scheduling and dispatching
general net tests as the many forwarding tests already use. To that end,
move the logging helpers to net/lib.sh so that every net test can use them.
Existing lib.sh users might be using the name themselves. However lib.sh is
always sourced near the top of the file (checked), and whatever new
definition will simply override the one provided by lib.sh.
Signed-off-by: Petr Machata <petrm(a)nvidia.com>
Reviewed-by: Amit Cohen <amcohen(a)nvidia.com>
Acked-by: Shuah Khan <skhan(a)linuxfoundation.org>
---
Notes:
CC: Shuah Khan <shuah(a)kernel.org>
CC: Benjamin Poirier <bpoirier(a)nvidia.com>
CC: Hangbin Liu <liuhangbin(a)gmail.com>
CC: linux-kselftest(a)vger.kernel.org
CC: Jiri Pirko <jiri(a)resnulli.us>
tools/testing/selftests/net/forwarding/lib.sh | 10 ----------
tools/testing/selftests/net/lib.sh | 10 ++++++++++
2 files changed, 10 insertions(+), 10 deletions(-)
diff --git a/tools/testing/selftests/net/forwarding/lib.sh b/tools/testing/selftests/net/forwarding/lib.sh
index 41dd14c42c48..d28dbf27c1f0 100644
--- a/tools/testing/selftests/net/forwarding/lib.sh
+++ b/tools/testing/selftests/net/forwarding/lib.sh
@@ -1285,16 +1285,6 @@ matchall_sink_create()
action drop
}
-tests_run()
-{
- local current_test
-
- for current_test in ${TESTS:-$ALL_TESTS}; do
- in_defer_scope \
- $current_test
- done
-}
-
cleanup()
{
pre_cleanup
diff --git a/tools/testing/selftests/net/lib.sh b/tools/testing/selftests/net/lib.sh
index 691318b1ec55..4f52b8e48a3a 100644
--- a/tools/testing/selftests/net/lib.sh
+++ b/tools/testing/selftests/net/lib.sh
@@ -351,3 +351,13 @@ log_info()
echo "INFO: $msg"
}
+
+tests_run()
+{
+ local current_test
+
+ for current_test in ${TESTS:-$ALL_TESTS}; do
+ in_defer_scope \
+ $current_test
+ done
+}
--
2.45.0
Many net selftests invent their own logging helpers. These really should be
in a library sourced by these tests. Currently forwarding/lib.sh has a
suite of perfectly fine logging helpers, but sourcing a forwarding/ library
from a higher-level directory smells of layering violation. In this patch,
move the logging helpers to net/lib.sh so that every net test can use them.
Together with the logging helpers, it's also necessary to move
pause_on_fail(), and EXIT_STATUS and RET.
Existing lib.sh users might be using these same names for their functions
or variables. However lib.sh is always sourced near the top of the
file (checked), and whatever new definitions will simply override the ones
provided by lib.sh.
Signed-off-by: Petr Machata <petrm(a)nvidia.com>
Reviewed-by: Amit Cohen <amcohen(a)nvidia.com>
Acked-by: Shuah Khan <skhan(a)linuxfoundation.org>
---
Notes:
CC: Shuah Khan <shuah(a)kernel.org>
CC: Benjamin Poirier <bpoirier(a)nvidia.com>
CC: Hangbin Liu <liuhangbin(a)gmail.com>
CC: linux-kselftest(a)vger.kernel.org
CC: Jiri Pirko <jiri(a)resnulli.us>
tools/testing/selftests/net/forwarding/lib.sh | 113 -----------------
tools/testing/selftests/net/lib.sh | 115 ++++++++++++++++++
2 files changed, 115 insertions(+), 113 deletions(-)
diff --git a/tools/testing/selftests/net/forwarding/lib.sh b/tools/testing/selftests/net/forwarding/lib.sh
index 89c25f72b10c..41dd14c42c48 100644
--- a/tools/testing/selftests/net/forwarding/lib.sh
+++ b/tools/testing/selftests/net/forwarding/lib.sh
@@ -48,7 +48,6 @@ declare -A NETIFS=(
: "${WAIT_TIME:=5}"
# Whether to pause on, respectively, after a failure and before cleanup.
-: "${PAUSE_ON_FAIL:=no}"
: "${PAUSE_ON_CLEANUP:=no}"
# Whether to create virtual interfaces, and what netdevice type they should be.
@@ -446,22 +445,6 @@ done
##############################################################################
# Helpers
-# Exit status to return at the end. Set in case one of the tests fails.
-EXIT_STATUS=0
-# Per-test return value. Clear at the beginning of each test.
-RET=0
-
-ret_set_ksft_status()
-{
- local ksft_status=$1; shift
- local msg=$1; shift
-
- RET=$(ksft_status_merge $RET $ksft_status)
- if (( $? )); then
- retmsg=$msg
- fi
-}
-
# Whether FAILs should be interpreted as XFAILs. Internal.
FAIL_TO_XFAIL=
@@ -535,102 +518,6 @@ xfail_on_veth()
fi
}
-log_test_result()
-{
- local test_name=$1; shift
- local opt_str=$1; shift
- local result=$1; shift
- local retmsg=$1; shift
-
- printf "TEST: %-60s [%s]\n" "$test_name $opt_str" "$result"
- if [[ $retmsg ]]; then
- printf "\t%s\n" "$retmsg"
- fi
-}
-
-pause_on_fail()
-{
- if [[ $PAUSE_ON_FAIL == yes ]]; then
- echo "Hit enter to continue, 'q' to quit"
- read a
- [[ $a == q ]] && exit 1
- fi
-}
-
-handle_test_result_pass()
-{
- local test_name=$1; shift
- local opt_str=$1; shift
-
- log_test_result "$test_name" "$opt_str" " OK "
-}
-
-handle_test_result_fail()
-{
- local test_name=$1; shift
- local opt_str=$1; shift
-
- log_test_result "$test_name" "$opt_str" FAIL "$retmsg"
- pause_on_fail
-}
-
-handle_test_result_xfail()
-{
- local test_name=$1; shift
- local opt_str=$1; shift
-
- log_test_result "$test_name" "$opt_str" XFAIL "$retmsg"
- pause_on_fail
-}
-
-handle_test_result_skip()
-{
- local test_name=$1; shift
- local opt_str=$1; shift
-
- log_test_result "$test_name" "$opt_str" SKIP "$retmsg"
-}
-
-log_test()
-{
- local test_name=$1
- local opt_str=$2
-
- if [[ $# -eq 2 ]]; then
- opt_str="($opt_str)"
- fi
-
- if ((RET == ksft_pass)); then
- handle_test_result_pass "$test_name" "$opt_str"
- elif ((RET == ksft_xfail)); then
- handle_test_result_xfail "$test_name" "$opt_str"
- elif ((RET == ksft_skip)); then
- handle_test_result_skip "$test_name" "$opt_str"
- else
- handle_test_result_fail "$test_name" "$opt_str"
- fi
-
- EXIT_STATUS=$(ksft_exit_status_merge $EXIT_STATUS $RET)
- return $RET
-}
-
-log_test_skip()
-{
- RET=$ksft_skip retmsg= log_test "$@"
-}
-
-log_test_xfail()
-{
- RET=$ksft_xfail retmsg= log_test "$@"
-}
-
-log_info()
-{
- local msg=$1
-
- echo "INFO: $msg"
-}
-
not()
{
"$@"
diff --git a/tools/testing/selftests/net/lib.sh b/tools/testing/selftests/net/lib.sh
index c8991cc6bf28..691318b1ec55 100644
--- a/tools/testing/selftests/net/lib.sh
+++ b/tools/testing/selftests/net/lib.sh
@@ -9,6 +9,9 @@ source "$net_dir/lib/sh/defer.sh"
: "${WAIT_TIMEOUT:=20}"
+# Whether to pause on after a failure.
+: "${PAUSE_ON_FAIL:=no}"
+
BUSYWAIT_TIMEOUT=$((WAIT_TIMEOUT * 1000)) # ms
# Kselftest framework constants.
@@ -20,6 +23,11 @@ ksft_skip=4
# namespace list created by setup_ns
NS_LIST=()
+# Exit status to return at the end. Set in case one of the tests fails.
+EXIT_STATUS=0
+# Per-test return value. Clear at the beginning of each test.
+RET=0
+
##############################################################################
# Helpers
@@ -236,3 +244,110 @@ tc_rule_handle_stats_get()
| jq ".[] | select(.options.handle == $handle) | \
.options.actions[0].stats$selector"
}
+
+ret_set_ksft_status()
+{
+ local ksft_status=$1; shift
+ local msg=$1; shift
+
+ RET=$(ksft_status_merge $RET $ksft_status)
+ if (( $? )); then
+ retmsg=$msg
+ fi
+}
+
+log_test_result()
+{
+ local test_name=$1; shift
+ local opt_str=$1; shift
+ local result=$1; shift
+ local retmsg=$1; shift
+
+ printf "TEST: %-60s [%s]\n" "$test_name $opt_str" "$result"
+ if [[ $retmsg ]]; then
+ printf "\t%s\n" "$retmsg"
+ fi
+}
+
+pause_on_fail()
+{
+ if [[ $PAUSE_ON_FAIL == yes ]]; then
+ echo "Hit enter to continue, 'q' to quit"
+ read a
+ [[ $a == q ]] && exit 1
+ fi
+}
+
+handle_test_result_pass()
+{
+ local test_name=$1; shift
+ local opt_str=$1; shift
+
+ log_test_result "$test_name" "$opt_str" " OK "
+}
+
+handle_test_result_fail()
+{
+ local test_name=$1; shift
+ local opt_str=$1; shift
+
+ log_test_result "$test_name" "$opt_str" FAIL "$retmsg"
+ pause_on_fail
+}
+
+handle_test_result_xfail()
+{
+ local test_name=$1; shift
+ local opt_str=$1; shift
+
+ log_test_result "$test_name" "$opt_str" XFAIL "$retmsg"
+ pause_on_fail
+}
+
+handle_test_result_skip()
+{
+ local test_name=$1; shift
+ local opt_str=$1; shift
+
+ log_test_result "$test_name" "$opt_str" SKIP "$retmsg"
+}
+
+log_test()
+{
+ local test_name=$1
+ local opt_str=$2
+
+ if [[ $# -eq 2 ]]; then
+ opt_str="($opt_str)"
+ fi
+
+ if ((RET == ksft_pass)); then
+ handle_test_result_pass "$test_name" "$opt_str"
+ elif ((RET == ksft_xfail)); then
+ handle_test_result_xfail "$test_name" "$opt_str"
+ elif ((RET == ksft_skip)); then
+ handle_test_result_skip "$test_name" "$opt_str"
+ else
+ handle_test_result_fail "$test_name" "$opt_str"
+ fi
+
+ EXIT_STATUS=$(ksft_exit_status_merge $EXIT_STATUS $RET)
+ return $RET
+}
+
+log_test_skip()
+{
+ RET=$ksft_skip retmsg= log_test "$@"
+}
+
+log_test_xfail()
+{
+ RET=$ksft_xfail retmsg= log_test "$@"
+}
+
+log_info()
+{
+ local msg=$1
+
+ echo "INFO: $msg"
+}
--
2.45.0
Hello,
this new series aims to migrate test_flow_dissector.sh into test_progs.
There are 2 "main" parts in test_flow_dissector.sh:
- a set of tests checking flow_dissector programs attachment to either
root namespace or non-root namespace
- dissection test
The first set is integrated in flow_dissector.c, which already contains
some existing tests for flow_dissector programs. This series uses the
opportunity to update a bit this file (use new assert, re-split tests,
etc)
The second part is migrated into a new file under test_progs,
flow_dissector_classification.c. It uses the same eBPF programs as
flow_dissector.c, but the difference is rather about how those program
are executed:
- flow_dissector.c manually runs programs with BPF_PROG_RUN
- flow_dissector_classification.c sends real packets to be dissected, and
so it also executes kernel code related to eBPF flow dissector (eg:
__skb_flow_bpf_to_target)
---
Alexis Lothoré (eBPF Foundation) (10):
selftests/bpf: add a macro to compare raw memory
selftests/bpf: use ASSERT_MEMEQ to compare bpf flow keys
selftests/bpf: replace CHECK calls with ASSERT macros in flow_dissector test
selftests/bpf: re-split main function into dedicated tests
selftests/bpf: expose all subtests from flow_dissector
selftests/bpf: add gre packets testing to flow_dissector
selftests/bpf: migrate flow_dissector namespace exclusivity test
selftests/bpf: Enable generic tc actions in selftests config
selftests/bpf: migrate bpf flow dissectors tests to test_progs
selftests/bpf: remove test_flow_dissector.sh
tools/testing/selftests/bpf/.gitignore | 1 -
tools/testing/selftests/bpf/Makefile | 3 +-
tools/testing/selftests/bpf/config | 1 +
.../selftests/bpf/prog_tests/flow_dissector.c | 307 ++++++--
.../bpf/prog_tests/flow_dissector_classification.c | 851 +++++++++++++++++++++
tools/testing/selftests/bpf/test_flow_dissector.c | 780 -------------------
tools/testing/selftests/bpf/test_flow_dissector.sh | 178 -----
tools/testing/selftests/bpf/test_progs.h | 25 +
8 files changed, 1107 insertions(+), 1039 deletions(-)
---
base-commit: 16e1d1c377aa4c56223701a31f3bfa88505e7e4f
change-id: 20241019-flow_dissector-3eb0c07fc163
Best regards,
--
Alexis Lothoré, Bootlin
Embedded Linux and Kernel engineering
https://bootlin.com
On Wed, Nov 13, 2024, Paolo Bonzini wrote:
> Il mar 12 nov 2024, 21:44 Doug Covelli <doug.covelli(a)broadcom.com> ha
> scritto:
>
> > > Split irqchip should be the best tradeoff. Without it, moves from cr8
> > > stay in the kernel, but moves to cr8 always go to userspace with a
> > > KVM_EXIT_SET_TPR exit. You also won't be able to use Intel
> > > flexpriority (in-processor accelerated TPR) because KVM does not know
> > > which bits are set in IRR. So it will be *really* every move to cr8
> > > that goes to userspace.
> >
> > Sorry to hijack this thread but is there a technical reason not to allow
> > CR8 based accesses to the TPR (not MMIO accesses) when the in-kernel local
> > APIC is not in use?
>
> No worries, you're not hijacking :) The only reason is that it would be
> more code for a seldom used feature and anyway with worse performance. (To
> be clear, CR8 based accesses are allowed, but stores cause an exit in order
> to check the new TPR against IRR. That's because KVM's API does not have an
> equivalent of the TPR threshold as you point out below).
>
> > Also I could not find these documented anywhere but with MSFT's APIC our
> > monitor relies on extensions for trapping certain events such as INIT/SIPI
> > plus LINT0 and SVR writes:
> >
> > UINT64 X64ApicInitSipiExitTrap : 1; //
> > WHvRunVpExitReasonX64ApicInitSipiTrap
> > UINT64 X64ApicWriteLint0ExitTrap : 1; //
> > WHvRunVpExitReasonX64ApicWriteTrap
> > UINT64 X64ApicWriteLint1ExitTrap : 1; //
> > WHvRunVpExitReasonX64ApicWriteTrap
> > UINT64 X64ApicWriteSvrExitTrap : 1; //
> > WHvRunVpExitReasonX64ApicWriteTrap
> >
>
> There's no need for this in KVM's in-kernel APIC model. INIT and SIPI are
> handled in the hypervisor and you can get the current state of APs via
> KVM_GET_MPSTATE. LINT0 and LINT1 are injected with KVM_INTERRUPT and
> KVM_NMI respectively, and they obey IF/PPR and NMI blocking respectively,
> plus the interrupt shadow; so there's no need for userspace to know when
> LINT0/LINT1 themselves change. The spurious interrupt vector register is
> also handled completely in kernel.
>
> > I did not see any similar functionality for KVM. Does anything like that
> > exist? In any case we would be happy to add support for handling CR8
> > accesses w/o exiting w/o the in-kernel APIC along with some sort of a way
> > to configure the TPR threshold if folks are not opposed to that.
>
> As far I know everybody who's using KVM (whether proprietary or open
> source) has had no need for that, so I don't think it's a good idea to make
> the API more complex.
+1
> Performance of Windows guests is going to be bad anyway with userspace APIC.
Heh, on modern hardware, performance of any guest is going to suck with a userspace
APIC, compared to what is possible with an in-kernel APIC.
More importantly, I really, really don't want to encourage non-trivial usage of
a local APIC in userspace. KVM's support for a userspace local APIC is very
poorly tested these days. I have zero desire to spend any amount of time reviewing
and fixing issues that are unique to emulating the local APIC in userspace. And
long term, I would love to force an in-kernel local APIC, though I don't know if
that's entirely feasible.
This patchset fixes a bug where bnxt driver was failing to report that
an ntuple rule is redirecting to an RSS context. First commit is the
fix, then second commit extends selftests to detect if other/new drivers
are compliant with ntuple/rss_ctx API.
Daniel Xu (2):
bnxt_en: ethtool: Supply ntuple rss context action
selftests: drv-net: rss_ctx: Add test for ntuple rule
drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c | 8 ++++++--
tools/testing/selftests/drivers/net/hw/rss_ctx.py | 13 ++++++++++++-
2 files changed, 18 insertions(+), 3 deletions(-)
--
2.46.0
To be able to switch VMware products running on Linux to KVM some minor
changes are required to let KVM run/resume unmodified VMware guests.
First allow enabling of the VMware backdoor via an api. Currently the
setting of the VMware backdoor is limited to kernel boot parameters,
which forces all VM's running on a host to either run with or without
the VMware backdoor. Add a simple cap to allow enabling of the VMware
backdoor on a per VM basis. The default for that setting remains the
kvm.enable_vmware_backdoor boot parameter (which is false by default)
and can be changed on a per-vm basis via the KVM_CAP_X86_VMWARE_BACKDOOR
cap.
Second add a cap to forward hypercalls to userspace. I know that in
general that's frowned upon but VMwre guests send quite a few hypercalls
from userspace and it would be both impractical and largelly impossible
to handle all in the kernel. The change is trivial and I'd be maintaining
this code so I hope it's not a big deal.
The third commit just adds a self-test for the "forward VMware hypercalls
to userspace" functionality.
Cc: Doug Covelli <doug.covelli(a)broadcom.com>
Cc: Paolo Bonzini <pbonzini(a)redhat.com>
Cc: Jonathan Corbet <corbet(a)lwn.net>
Cc: Sean Christopherson <seanjc(a)google.com>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Cc: Ingo Molnar <mingo(a)redhat.com>
Cc: Borislav Petkov <bp(a)alien8.de>
Cc: Dave Hansen <dave.hansen(a)linux.intel.com>
Cc: x86(a)kernel.org
Cc: "H. Peter Anvin" <hpa(a)zytor.com>
Cc: Shuah Khan <shuah(a)kernel.org>
Cc: Namhyung Kim <namhyung(a)kernel.org>
Cc: Arnaldo Carvalho de Melo <acme(a)redhat.com>
Cc: Isaku Yamahata <isaku.yamahata(a)intel.com>
Cc: Joel Stanley <joel(a)jms.id.au>
Cc: Zack Rusin <zack.rusin(a)broadcom.com>
Cc: kvm(a)vger.kernel.org
Cc: linux-doc(a)vger.kernel.org
Cc: linux-kernel(a)vger.kernel.org
Cc: linux-kselftest(a)vger.kernel.org
Zack Rusin (3):
KVM: x86: Allow enabling of the vmware backdoor via a cap
KVM: x86: Add support for VMware guest specific hypercalls
KVM: selftests: x86: Add a test for KVM_CAP_X86_VMWARE_HYPERCALL
Documentation/virt/kvm/api.rst | 56 ++++++++-
arch/x86/include/asm/kvm_host.h | 2 +
arch/x86/kvm/emulate.c | 5 +-
arch/x86/kvm/svm/svm.c | 6 +-
arch/x86/kvm/vmx/vmx.c | 4 +-
arch/x86/kvm/x86.c | 47 ++++++++
arch/x86/kvm/x86.h | 7 +-
include/uapi/linux/kvm.h | 2 +
tools/include/uapi/linux/kvm.h | 2 +
tools/testing/selftests/kvm/Makefile | 1 +
.../kvm/x86_64/vmware_hypercall_test.c | 108 ++++++++++++++++++
11 files changed, 227 insertions(+), 13 deletions(-)
create mode 100644 tools/testing/selftests/kvm/x86_64/vmware_hypercall_test.c
--
2.43.0
This patch series represents the final release of KTAP version 2.
There have been having open discussions on version 2 for just over 2
years. This patch series marks the end of KTAP version 2 development
and beginning of the KTAP version 3 development.
The largest component of KTAP version 2 release is the addition of test
metadata to the specification. KTAP metadata could include any test
information that is pertinent for user interaction before or after the
running of the test. For example, the test file path or the test speed.
Example of KTAP Metadata:
KTAP version 2
#:ktap_test: main
#:ktap_arch: uml
1..1
KTAP version 2
#:ktap_test: suite_1
#:ktap_subsystem: example
#:ktap_test_file: lib/test.c
1..2
ok 1 test_1
#:ktap_test: test_2
#:ktap_speed: very_slow
# test_2 has begun
#:custom_is_flaky: true
ok 2 test_2
# suite_1 has passed
ok 1 suite_1
The release also includes some formatting fixes and changes to update
the specification to version 2.
Frank Rowand (2):
ktap_v2: change version to 2-rc in KTAP specification
ktap_v2: change "version 1" to "version 2" in examples
Rae Moar (3):
ktap_v2: add test metadata
ktap_v2: formatting fixes to ktap spec
ktap_v2: change version to 2 in KTAP specification
Documentation/dev-tools/ktap.rst | 276 +++++++++++++++++++++++++++++--
1 file changed, 260 insertions(+), 16 deletions(-)
base-commit: 2d5404caa8c7bb5c4e0435f94b28834ae5456623
--
2.47.0.277.g8800431eea-goog
Since commit 49f59573e9e0 ("selftests/mm: Enable pkey_sighandler_tests
on arm64"), pkey_sighandler_tests.c (which includes pkey-arm64.h via
pkey-helpers.h) ends up compiled for arm64. Since it doesn't use
aarch64_write_signal_pkey(), the compiler warns:
In file included from pkey-helpers.h:106,
from pkey_sighandler_tests.c:31:
pkey-arm64.h:130:13: warning: ‘aarch64_write_signal_pkey’ defined but not used [-Wunused-function]
130 | static void aarch64_write_signal_pkey(ucontext_t *uctxt, u64 pkey)
| ^~~~~~~~~~~~~~~~~~~~~~~~~
Make the aarch64_write_signal_pkey() a 'static inline void' function to
avoid the compiler warning.
Fixes: f5b5ea51f78f ("selftests: mm: make protection_keys test work on arm64")
Cc: Shuah Khan <skhan(a)linuxfoundation.org>
Cc: Joey Gouly <joey.gouly(a)arm.com>
Cc: Kevin Brodsky <kevin.brodsky(a)arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas(a)arm.com>
---
I'll add this on top of the arm64 for-next/pkey-signal branch together with
Kevin's other patches.
tools/testing/selftests/mm/pkey-arm64.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/mm/pkey-arm64.h b/tools/testing/selftests/mm/pkey-arm64.h
index d57fbeace38f..d9d2100eafc0 100644
--- a/tools/testing/selftests/mm/pkey-arm64.h
+++ b/tools/testing/selftests/mm/pkey-arm64.h
@@ -127,7 +127,7 @@ static inline u64 get_pkey_bits(u64 reg, int pkey)
return 0;
}
-static void aarch64_write_signal_pkey(ucontext_t *uctxt, u64 pkey)
+static inline void aarch64_write_signal_pkey(ucontext_t *uctxt, u64 pkey)
{
struct _aarch64_ctx *ctx = GET_UC_RESV_HEAD(uctxt);
struct poe_context *poe_ctx =
It looks like people started ignoring the compiler warnings (or even
errors) when building the arm64-specific kselftests. The first three
patches are printf() arguments adjustment. The last one adds
".arch_extension sme", otherwise they fail to build (with my toolchain
versions at least).
Note for future kselftest contributors - try to keep them warning-free.
Catalin Marinas (4):
kselftest/arm64: Fix printf() compiler warnings in the arm64 fp tests
kselftest/arm64: Fix printf() warning in the arm64 MTE prctl() test
kselftest/arm64: Fix printf() compiler warnings in the arm64
syscall-abi.c tests
kselftest/arm64: Fix compilation of SME instructions in the arm64 fp
tests
tools/testing/selftests/arm64/abi/syscall-abi.c | 8 ++++----
tools/testing/selftests/arm64/fp/sve-ptrace.c | 16 +++++++++-------
tools/testing/selftests/arm64/fp/za-ptrace.c | 8 +++++---
tools/testing/selftests/arm64/fp/za-test.S | 1 +
tools/testing/selftests/arm64/fp/zt-ptrace.c | 8 +++++---
tools/testing/selftests/arm64/fp/zt-test.S | 1 +
tools/testing/selftests/arm64/mte/check_prctl.c | 2 +-
7 files changed, 26 insertions(+), 18 deletions(-)
Currently we test signal delivery to the programs run by fp-stress but
our signal handlers simply count the number of signals seen and don't do
anything with the floating point state. The original fpsimd-test and
sve-test programs had signal handlers called irritators which modify the
live register state, verifying that we restore the signal context on
return, but a combination of misleading comments and code resulted in
them never being used and the equivalent handlers in the other tests
being stubbed or omitted.
Clarify the code, implement effective irritator handlers for the test
programs that can have them and then switch the signals generated by the
fp-stress program over to use the irritators, ensuring that we validate
that we restore the saved signal context properly.
Signed-off-by: Mark Brown <broonie(a)kernel.org>
---
Changes in v2:
- Change instruction and commit log for corruption of predicat
registers.
- Link to v1: https://lore.kernel.org/r/20241023-arm64-fp-stress-irritator-v1-0-a51af298d…
---
Mark Brown (6):
kselftest/arm64: Correct misleading comments on fp-stress irritators
kselftest/arm64: Remove unused ADRs from irritator handlers
kselftest/arm64: Corrupt P0 in the irritator when testing SSVE
kselftest/arm64: Implement irritators for ZA and ZT
kselftest/arm64: Provide a SIGUSR1 handler in the kernel mode FP stress test
kselftest/arm64: Test signal handler state modification in fp-stress
tools/testing/selftests/arm64/fp/fp-stress.c | 2 +-
tools/testing/selftests/arm64/fp/fpsimd-test.S | 4 +---
tools/testing/selftests/arm64/fp/kernel-test.c | 4 ++++
tools/testing/selftests/arm64/fp/sve-test.S | 8 +++-----
tools/testing/selftests/arm64/fp/za-test.S | 13 ++++---------
tools/testing/selftests/arm64/fp/zt-test.S | 13 ++++---------
6 files changed, 17 insertions(+), 27 deletions(-)
---
base-commit: 8e929cb546ee42c9a61d24fae60605e9e3192354
change-id: 20241023-arm64-fp-stress-irritator-5415fe92adef
Best regards,
--
Mark Brown <broonie(a)kernel.org>
This series contains a bit of a grab bag of improvements to the floating
point tests, mainly fp-ptrace. Globally over all the tests we start
using defines from the generated sysregs (following the example of the
KVM selftests) for SVCR, stop being quite so wasteful with registers
when calling into the assembler code then expand the coverage of both ZA
writes and FPMR (which was not there since fp-ptrace and the 2023 dpISA
extensions were on the list at the same time).
Signed-off-by: Mark Brown <broonie(a)kernel.org>
---
Changes in v2:
- Drop attempt to reference sysreg-defs.h due to build issues, just
import the defines directly into fp-ptrace. We can revisit later.
- Link to v1: https://lore.kernel.org/r/20241107-arm64-fp-ptrace-fpmr-v1-0-3e5e0b6e3be9@k…
---
Mark Brown (3):
kselftets/arm64: Use flag bits for features in fp-ptrace assembler code
kselftest/arm64: Expand the set of ZA writes fp-ptrace does
kselftest/arm64: Add FPMR coverage to fp-ptrace
tools/testing/selftests/arm64/fp/fp-ptrace-asm.S | 41 ++++--
tools/testing/selftests/arm64/fp/fp-ptrace.c | 161 +++++++++++++++++++++--
tools/testing/selftests/arm64/fp/fp-ptrace.h | 12 ++
tools/testing/selftests/arm64/fp/sme-inst.h | 2 +
4 files changed, 188 insertions(+), 28 deletions(-)
---
base-commit: 8e929cb546ee42c9a61d24fae60605e9e3192354
change-id: 20241105-arm64-fp-ptrace-fpmr-ff061facd3da
Best regards,
--
Mark Brown <broonie(a)kernel.org>
While some assemblers (including the LLVM assembler I mostly use) will
happily accept SMSTART as an instruction by default others, specifically
gas, require that any architecture extensions be explicitly enabled.
The assembler SME test programs use manually encoded helpers for the new
instructions but no SMSTART helper is defined, only SM and ZA specific
variants. Unfortunately the irritators that were just added use plain
SMSTART so on stricter assemblers these fail to build:
za-test.S:160: Error: selected processor does not support `smstart'
Switch to using SMSTART ZA via the manually encoded smstart_za macro we
already have defined.
Fixes: d65f27d240bb ("kselftest/arm64: Implement irritators for ZA and ZT")
Signed-off-by: Mark Brown <broonie(a)kernel.org>
---
tools/testing/selftests/arm64/fp/za-test.S | 2 +-
tools/testing/selftests/arm64/fp/zt-test.S | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/arm64/fp/za-test.S b/tools/testing/selftests/arm64/fp/za-test.S
index 95fdc1c1f228221bc812087a528e4b7c99767bba..9c33e13e9dc4a6f084649fe7d0fb838d9171e3aa 100644
--- a/tools/testing/selftests/arm64/fp/za-test.S
+++ b/tools/testing/selftests/arm64/fp/za-test.S
@@ -157,7 +157,7 @@ function irritator_handler
// This will reset ZA to all bits 0
smstop
- smstart
+ smstart_za
ret
endfunction
diff --git a/tools/testing/selftests/arm64/fp/zt-test.S b/tools/testing/selftests/arm64/fp/zt-test.S
index a90712802801efb97dc6bf8027fb9ceac8f0a895..38080f3c328042af6b3e2d7c3300162ea6efa4ea 100644
--- a/tools/testing/selftests/arm64/fp/zt-test.S
+++ b/tools/testing/selftests/arm64/fp/zt-test.S
@@ -126,7 +126,7 @@ function irritator_handler
// This will reset ZT to all bits 0
smstop
- smstart
+ smstart_za
ret
endfunction
---
base-commit: 95ad089d464da2a4cd4511fb077f25994104c8f1
change-id: 20241108-arm64-selftest-asm-error-d78570e50b3b
Best regards,
--
Mark Brown <broonie(a)kernel.org>
Currently we don't build the PAC selftests when building with LLVM=1 since
we attempt to test for PAC support in the toolchain before we've set up the
build system to point at LLVM in lib.mk, which has to be one of the last
things in the Makefile.
Since all versions of LLVM supported for use with the kernel have PAC
support we can just sidestep the issue by just assuming PAC is there when
doing a LLVM=1 build.
Signed-off-by: Mark Brown <broonie(a)kernel.org>
---
tools/testing/selftests/arm64/pauth/Makefile | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/tools/testing/selftests/arm64/pauth/Makefile b/tools/testing/selftests/arm64/pauth/Makefile
index 72e290b0b10c1ea5bf1b84232f70844a601b8129..b5a1c80e0ead6932d2441a192b9758a049e3b3f8 100644
--- a/tools/testing/selftests/arm64/pauth/Makefile
+++ b/tools/testing/selftests/arm64/pauth/Makefile
@@ -7,8 +7,14 @@ CC := $(CROSS_COMPILE)gcc
endif
CFLAGS += -mbranch-protection=pac-ret
+
+# All supported LLVMs have PAC, test for GCC
+ifeq ($(LLVM),1)
+pauth_cc_support := 1
+else
# check if the compiler supports ARMv8.3 and branch protection with PAuth
pauth_cc_support := $(shell if ($(CC) $(CFLAGS) -march=armv8.3-a -E -x c /dev/null -o /dev/null 2>&1) then echo "1"; fi)
+endif
ifeq ($(pauth_cc_support),1)
TEST_GEN_PROGS := pac
---
base-commit: 8e929cb546ee42c9a61d24fae60605e9e3192354
change-id: 20241111-arm64-selftest-pac-clang-cb80de079169
Best regards,
--
Mark Brown <broonie(a)kernel.org>
The PAC tests currently generate a very small number of false failures
since the limited size of PAC keys, especially with fewer bits allocated
for PAC due to larger VA, means collisions in generated PACs are
possible even if the PAC keys are different. The test tries to work around
this by testing repeatedly but doesn't iterate often enough to be
reliable.
We also fix a leak of file descriptors in the exec test which doesn't
matter now but can become an issue when the number of iterations is
increased, we can bump into limits on allocated file descriptors.
Signed-off-by: Mark Brown <broonie(a)kernel.org>
---
Mark Brown (2):
kselftest/arm64: Don't leak pipe fds in pac.exec_sign_all()
kselftest/arm64: Try harder to generate different keys during PAC tests
tools/testing/selftests/arm64/pauth/pac.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
---
base-commit: 8e929cb546ee42c9a61d24fae60605e9e3192354
change-id: 20241111-arm64-pac-test-collisions-5613f5dfe479
Best regards,
--
Mark Brown <broonie(a)kernel.org>
This series contains a bit of a grab bag of improvements to the floating
point tests, mainly fp-ptrace. Globally over all the tests we start
using defines from the generated sysregs (following the example of the
KVM selftests) for SVCR, stop being quite so wasteful with registers
when calling into the assembler code then expand the coverage of both ZA
writes and FPMR (which was not there since fp-ptrace and the 2023 dpISA
extensions were on the list at the same time).
Signed-off-by: Mark Brown <broonie(a)kernel.org>
---
Mark Brown (4):
kselftets/arm64: Use flag bits for features in fp-ptrace assembler code
kselftest/arm64: Use a define for SVCR
kselftest/arm64: Expand the set of ZA writes fp-ptrace does
kselftest/arm64: Add FPMR coverage to fp-ptrace
tools/testing/selftests/arm64/fp/Makefile | 20 ++-
tools/testing/selftests/arm64/fp/fp-ptrace-asm.S | 52 +++++---
tools/testing/selftests/arm64/fp/fp-ptrace.c | 155 +++++++++++++++++++++--
tools/testing/selftests/arm64/fp/fp-ptrace.h | 16 ++-
tools/testing/selftests/arm64/fp/sve-test.S | 5 +-
tools/testing/selftests/arm64/fp/za-fork-asm.S | 3 +-
tools/testing/selftests/arm64/fp/za-test.S | 6 +-
tools/testing/selftests/arm64/fp/zt-test.S | 5 +-
8 files changed, 213 insertions(+), 49 deletions(-)
---
base-commit: 8e929cb546ee42c9a61d24fae60605e9e3192354
change-id: 20241105-arm64-fp-ptrace-fpmr-ff061facd3da
Best regards,
--
Mark Brown <broonie(a)kernel.org>
Kuan-Wei Chiu pointed out that the kernel doc for kunit_zalloc_skb()
needs to include the @gfp information. Add it.
Reported-by: Kuan-Wei Chiu <visitorckw(a)gmail.com>
Closes: https://lore.kernel.org/all/Zy+VIXDPuU613fFd@visitorckw-System-Product-Name/
Signed-off-by: Dan Carpenter <dan.carpenter(a)linaro.org>
---
include/kunit/skbuff.h | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/include/kunit/skbuff.h b/include/kunit/skbuff.h
index 345e1e8f0312..07784694357c 100644
--- a/include/kunit/skbuff.h
+++ b/include/kunit/skbuff.h
@@ -20,8 +20,9 @@ static void kunit_action_kfree_skb(void *p)
* kunit_zalloc_skb() - Allocate and initialize a resource managed skb.
* @test: The test case to which the skb belongs
* @len: size to allocate
+ * @gfp: allocation flags
*
- * Allocate a new struct sk_buff with GFP_KERNEL, zero fill the give length
+ * Allocate a new struct sk_buff with gfp flags, zero fill the given length
* and add it as a resource to the kunit test for automatic cleanup.
*
* Returns: newly allocated SKB, or %NULL on error
--
2.45.2
This patch series includes some netns-related improvements and fixes for
RTNL and ip_tunnel, to make link creation more intuitive:
- Creating link in another net namespace doesn't conflict with link names
in current one.
- Add a flag in rtnl_ops, to avoid netns change when link-netns is present
if possible.
- When creating ip tunnel (e.g. GRE) in another netns, use current as
link-netns if not specified explicitly.
So that
# modprobe ip_gre netns_atomic=1
# ip link add netns ns1 link-netns ns2 tun0 type gre ...
will create tun0 in ns1, rather than create it in ns2 and move to ns1.
And don't conflict with another interface named "tun0" in current netns.
---
v2:
- Check NLM_F_EXCL to ensure only link creation is affected.
- Add self tests for link name/ifindex conflict and notifications
in different netns.
- Changes in dummy driver and ynl in order to add the test case.
v1:
link: https://lore.kernel.org/all/20241023023146.372653-1-shaw.leon@gmail.com/
Xiao Liang (8):
rtnetlink: Lookup device in target netns when creating link
rtnetlink: Add netns_atomic flag in rtnl_link_ops
net: ip_tunnel: Build flow in underlay net namespace
net: ip_tunnel: Add source netns support for newlink
net: ip_gre: Add netns_atomic module parameter
net: dummy: Set netns_atomic in rtnl ops for testing
tools/net/ynl: Add retry limit for async notification
selftests: net: Add two test cases for link netns
drivers/net/dummy.c | 1 +
include/net/ip_tunnels.h | 3 ++
include/net/rtnetlink.h | 3 ++
net/core/rtnetlink.c | 17 +++++--
net/ipv4/ip_gre.c | 15 +++++-
net/ipv4/ip_tunnel.c | 27 +++++++----
tools/net/ynl/lib/ynl.py | 7 ++-
tools/testing/selftests/net/Makefile | 1 +
tools/testing/selftests/net/netns-name.sh | 10 ++++
tools/testing/selftests/net/netns_atomic.py | 54 +++++++++++++++++++++
10 files changed, 121 insertions(+), 17 deletions(-)
create mode 100755 tools/testing/selftests/net/netns_atomic.py
--
2.47.0
Check number of paths by fib_info_num_path(),
and update_or_create_fnhe() for every path.
Problem is that pmtu is cached only for the oif
that has received icmp message "need to frag",
other oifs will still try to use "default" iface mtu.
An example topology showing the problem:
| host1
+---------+
| dummy0 | 10.179.20.18/32 mtu9000
+---------+
+-----------+----------------+
+---------+ +---------+
| ens17f0 | 10.179.2.141/31 | ens17f1 | 10.179.2.13/31
+---------+ +---------+
| (all here have mtu 9000) |
+------+ +------+
| ro1 | 10.179.2.140/31 | ro2 | 10.179.2.12/31
+------+ +------+
| |
---------+------------+-------------------+------
|
+-----+
| ro3 | 10.10.10.10 mtu1500
+-----+
|
========================================
some networks
========================================
|
+-----+
| eth0| 10.10.30.30 mtu9000
+-----+
| host2
host1 have enabled multipath and
sysctl net.ipv4.fib_multipath_hash_policy = 1:
default proto static src 10.179.20.18
nexthop via 10.179.2.12 dev ens17f1 weight 1
nexthop via 10.179.2.140 dev ens17f0 weight 1
When host1 tries to do pmtud from 10.179.20.18/32 to host2,
host1 receives at ens17f1 iface an icmp packet from ro3 that ro3 mtu=1500.
And host1 caches it in nexthop exceptions cache.
Problem is that it is cached only for the iface that has received icmp,
and there is no way that ro3 will send icmp msg to host1 via another path.
Host1 now have this routes to host2:
ip r g 10.10.30.30 sport 30000 dport 443
10.10.30.30 via 10.179.2.12 dev ens17f1 src 10.179.20.18 uid 0
cache expires 521sec mtu 1500
ip r g 10.10.30.30 sport 30033 dport 443
10.10.30.30 via 10.179.2.140 dev ens17f0 src 10.179.20.18 uid 0
cache
So when host1 tries again to reach host2 with mtu>1500,
if packet flow is lucky enough to be hashed with oif=ens17f1 its ok,
if oif=ens17f0 it blackholes and still gets icmp msgs from ro3 to ens17f1,
until lucky day when ro3 will send it through another flow to ens17f0.
Signed-off-by: Vladimir Vdovin <deliran(a)verdict.gg>
---
V10:
selfttests in pmtu.sh:
- remove unused fail var
- Flip error messages
V9:
selftests in pmtu.sh:
- remove useless return
- fix mtu var override
V8:
selftests in pmtu.sh:
- Change var names from "dummy" to "host"
- Fix errors caused by incorrect iface arguments pass
- Add src addr to setup_multipath_new
- Change multipath* func order
- Change route_get_dst_exception() && route_get_dst_pmtu_from_exception()
and arguments pass where they are used
as Ido suggested in https://lore.kernel.org/all/ZykH_fdcMBdFgXix@shredder/
V7:
selftest in pmtu.sh:
- add setup_multipath() with old and new nh tests
- add global "dummy_v4" addr variables
- add documentation
- remove dummy netdev usage in mp nh test
- remove useless sysctl opts in mp nh test
V6:
- make commit message cleaner
V5:
- make self test cleaner
V4:
- fix selftest, do route lookup before checking cached exceptions
V3:
- add selftest
- fix compile error
V2:
- fix fib_info_num_path parameter pass
---
net/ipv4/route.c | 13 ++++
tools/testing/selftests/net/pmtu.sh | 112 +++++++++++++++++++++++-----
2 files changed, 108 insertions(+), 17 deletions(-)
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 723ac9181558..652f603d29fe 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1027,6 +1027,19 @@ static void __ip_rt_update_pmtu(struct rtable *rt, struct flowi4 *fl4, u32 mtu)
struct fib_nh_common *nhc;
fib_select_path(net, &res, fl4, NULL);
+#ifdef CONFIG_IP_ROUTE_MULTIPATH
+ if (fib_info_num_path(res.fi) > 1) {
+ int nhsel;
+
+ for (nhsel = 0; nhsel < fib_info_num_path(res.fi); nhsel++) {
+ nhc = fib_info_nhc(res.fi, nhsel);
+ update_or_create_fnhe(nhc, fl4->daddr, 0, mtu, lock,
+ jiffies + net->ipv4.ip_rt_mtu_expires);
+ }
+ rcu_read_unlock();
+ return;
+ }
+#endif /* CONFIG_IP_ROUTE_MULTIPATH */
nhc = FIB_RES_NHC(res);
update_or_create_fnhe(nhc, fl4->daddr, 0, mtu, lock,
jiffies + net->ipv4.ip_rt_mtu_expires);
diff --git a/tools/testing/selftests/net/pmtu.sh b/tools/testing/selftests/net/pmtu.sh
index 569bce8b6383..7ca2fbd971ae 100755
--- a/tools/testing/selftests/net/pmtu.sh
+++ b/tools/testing/selftests/net/pmtu.sh
@@ -197,6 +197,12 @@
#
# - pmtu_ipv6_route_change
# Same as above but with IPv6
+#
+# - pmtu_ipv4_mp_exceptions
+# Use the same topology as in pmtu_ipv4, but add routeable addresses
+# on host A and B on lo reachable via both routers. Host A and B
+# addresses have multipath routes to each other, b_r1 mtu = 1500.
+# Check that PMTU exceptions are created for both paths.
source lib.sh
source net_helper.sh
@@ -266,7 +272,8 @@ tests="
list_flush_ipv4_exception ipv4: list and flush cached exceptions 1
list_flush_ipv6_exception ipv6: list and flush cached exceptions 1
pmtu_ipv4_route_change ipv4: PMTU exception w/route replace 1
- pmtu_ipv6_route_change ipv6: PMTU exception w/route replace 1"
+ pmtu_ipv6_route_change ipv6: PMTU exception w/route replace 1
+ pmtu_ipv4_mp_exceptions ipv4: PMTU multipath nh exceptions 1"
# Addressing and routing for tests with routers: four network segments, with
# index SEGMENT between 1 and 4, a common prefix (PREFIX4 or PREFIX6) and an
@@ -343,6 +350,9 @@ tunnel6_a_addr="fd00:2::a"
tunnel6_b_addr="fd00:2::b"
tunnel6_mask="64"
+host4_a_addr="192.168.99.99"
+host4_b_addr="192.168.88.88"
+
dummy6_0_prefix="fc00:1000::"
dummy6_1_prefix="fc00:1001::"
dummy6_mask="64"
@@ -984,6 +994,52 @@ setup_ovs_bridge() {
run_cmd ip route add ${prefix6}:${b_r1}::1 via ${prefix6}:${a_r1}::2
}
+setup_multipath_new() {
+ # Set up host A with multipath routes to host B host4_b_addr
+ run_cmd ${ns_a} ip addr add ${host4_a_addr} dev lo
+ run_cmd ${ns_a} ip nexthop add id 401 via ${prefix4}.${a_r1}.2 dev veth_A-R1
+ run_cmd ${ns_a} ip nexthop add id 402 via ${prefix4}.${a_r2}.2 dev veth_A-R2
+ run_cmd ${ns_a} ip nexthop add id 403 group 401/402
+ run_cmd ${ns_a} ip route add ${host4_b_addr} src ${host4_a_addr} nhid 403
+
+ # Set up host B with multipath routes to host A host4_a_addr
+ run_cmd ${ns_b} ip addr add ${host4_b_addr} dev lo
+ run_cmd ${ns_b} ip nexthop add id 401 via ${prefix4}.${b_r1}.2 dev veth_B-R1
+ run_cmd ${ns_b} ip nexthop add id 402 via ${prefix4}.${b_r2}.2 dev veth_B-R2
+ run_cmd ${ns_b} ip nexthop add id 403 group 401/402
+ run_cmd ${ns_b} ip route add ${host4_a_addr} src ${host4_b_addr} nhid 403
+}
+
+setup_multipath_old() {
+ # Set up host A with multipath routes to host B host4_b_addr
+ run_cmd ${ns_a} ip addr add ${host4_a_addr} dev lo
+ run_cmd ${ns_a} ip route add ${host4_b_addr} \
+ src ${host4_a_addr} \
+ nexthop via ${prefix4}.${a_r1}.2 weight 1 \
+ nexthop via ${prefix4}.${a_r2}.2 weight 1
+
+ # Set up host B with multipath routes to host A host4_a_addr
+ run_cmd ${ns_b} ip addr add ${host4_b_addr} dev lo
+ run_cmd ${ns_b} ip route add ${host4_a_addr} \
+ src ${host4_b_addr} \
+ nexthop via ${prefix4}.${b_r1}.2 weight 1 \
+ nexthop via ${prefix4}.${b_r2}.2 weight 1
+}
+
+setup_multipath() {
+ if [ "$USE_NH" = "yes" ]; then
+ setup_multipath_new
+ else
+ setup_multipath_old
+ fi
+
+ # Set up routers with routes to dummies
+ run_cmd ${ns_r1} ip route add ${host4_a_addr} via ${prefix4}.${a_r1}.1
+ run_cmd ${ns_r2} ip route add ${host4_a_addr} via ${prefix4}.${a_r2}.1
+ run_cmd ${ns_r1} ip route add ${host4_b_addr} via ${prefix4}.${b_r1}.1
+ run_cmd ${ns_r2} ip route add ${host4_b_addr} via ${prefix4}.${b_r2}.1
+}
+
setup() {
[ "$(id -u)" -ne 0 ] && echo " need to run as root" && return $ksft_skip
@@ -1076,23 +1132,15 @@ link_get_mtu() {
}
route_get_dst_exception() {
- ns_cmd="${1}"
- dst="${2}"
- dsfield="${3}"
+ ns_cmd="${1}"; shift
- if [ -z "${dsfield}" ]; then
- dsfield=0
- fi
-
- ${ns_cmd} ip route get "${dst}" dsfield "${dsfield}"
+ ${ns_cmd} ip route get "$@"
}
route_get_dst_pmtu_from_exception() {
- ns_cmd="${1}"
- dst="${2}"
- dsfield="${3}"
+ ns_cmd="${1}"; shift
- mtu_parse "$(route_get_dst_exception "${ns_cmd}" "${dst}" "${dsfield}")"
+ mtu_parse "$(route_get_dst_exception "${ns_cmd}" "$@")"
}
check_pmtu_value() {
@@ -1235,10 +1283,10 @@ test_pmtu_ipv4_dscp_icmp_exception() {
run_cmd "${ns_a}" ping -q -M want -Q "${dsfield}" -c 1 -w 1 -s "${len}" "${dst2}"
# Check that exceptions have been created with the correct PMTU
- pmtu_1="$(route_get_dst_pmtu_from_exception "${ns_a}" "${dst1}" "${policy_mark}")"
+ pmtu_1="$(route_get_dst_pmtu_from_exception "${ns_a}" "${dst1}" dsfield "${policy_mark}")"
check_pmtu_value "1400" "${pmtu_1}" "exceeding MTU" || return 1
- pmtu_2="$(route_get_dst_pmtu_from_exception "${ns_a}" "${dst2}" "${policy_mark}")"
+ pmtu_2="$(route_get_dst_pmtu_from_exception "${ns_a}" "${dst2}" dsfield "${policy_mark}")"
check_pmtu_value "1500" "${pmtu_2}" "exceeding MTU" || return 1
}
@@ -1285,9 +1333,9 @@ test_pmtu_ipv4_dscp_udp_exception() {
UDP:"${dst2}":50000,tos="${dsfield}"
# Check that exceptions have been created with the correct PMTU
- pmtu_1="$(route_get_dst_pmtu_from_exception "${ns_a}" "${dst1}" "${policy_mark}")"
+ pmtu_1="$(route_get_dst_pmtu_from_exception "${ns_a}" "${dst1}" dsfield "${policy_mark}")"
check_pmtu_value "1400" "${pmtu_1}" "exceeding MTU" || return 1
- pmtu_2="$(route_get_dst_pmtu_from_exception "${ns_a}" "${dst2}" "${policy_mark}")"
+ pmtu_2="$(route_get_dst_pmtu_from_exception "${ns_a}" "${dst2}" dsfield "${policy_mark}")"
check_pmtu_value "1500" "${pmtu_2}" "exceeding MTU" || return 1
}
@@ -2329,6 +2377,36 @@ test_pmtu_ipv6_route_change() {
test_pmtu_ipvX_route_change 6
}
+test_pmtu_ipv4_mp_exceptions() {
+ setup namespaces routing multipath || return $ksft_skip
+
+ trace "${ns_a}" veth_A-R1 "${ns_r1}" veth_R1-A \
+ "${ns_r1}" veth_R1-B "${ns_b}" veth_B-R1 \
+ "${ns_a}" veth_A-R2 "${ns_r2}" veth_R2-A \
+ "${ns_r2}" veth_R2-B "${ns_b}" veth_B-R2
+
+ # Set up initial MTU values
+ mtu "${ns_a}" veth_A-R1 2000
+ mtu "${ns_r1}" veth_R1-A 2000
+ mtu "${ns_r1}" veth_R1-B 1500
+ mtu "${ns_b}" veth_B-R1 1500
+
+ mtu "${ns_a}" veth_A-R2 2000
+ mtu "${ns_r2}" veth_R2-A 2000
+ mtu "${ns_r2}" veth_R2-B 1500
+ mtu "${ns_b}" veth_B-R2 1500
+
+ # Ping and expect two nexthop exceptions for two routes
+ run_cmd ${ns_a} ping -q -M want -i 0.1 -c 1 -s 1800 "${host4_b_addr}"
+
+ # Check that exceptions have been created with the correct PMTU
+ pmtu_a_R1="$(route_get_dst_pmtu_from_exception "${ns_a}" "${host4_b_addr}" oif veth_A-R1)"
+ pmtu_a_R2="$(route_get_dst_pmtu_from_exception "${ns_a}" "${host4_b_addr}" oif veth_A-R2)"
+
+ check_pmtu_value "1500" "${pmtu_a_R1}" "exceeding MTU (veth_A-R1)" || return 1
+ check_pmtu_value "1500" "${pmtu_a_R2}" "exceeding MTU (veth_A-R2)" || return 1
+}
+
usage() {
echo
echo "$0 [OPTIONS] [TEST]..."
base-commit: 66600fac7a984dea4ae095411f644770b2561ede
--
2.43.5
The netconsole selftest relies on the availability of the netdevsim module.
To ensure the test can run correctly, we need to check if the netdevsim
module is either loaded or built-in before proceeding.
Update the netconsole selftest to check for the existence of
the /sys/bus/netdevsim/new_device file before running the test. If the
file is not found, the test is skipped with an explanation that the
CONFIG_NETDEVSIM kernel config option may not be enabled.
Signed-off-by: Breno Leitao <leitao(a)debian.org>
---
tools/testing/selftests/drivers/net/netcons_basic.sh | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/drivers/net/netcons_basic.sh b/tools/testing/selftests/drivers/net/netcons_basic.sh
index 182eb1a97e59f3b4c9eea0e5b9e64a7fff656e2b..b175f4d966e5056ddb62e335f212c03e55f50fb0 100755
--- a/tools/testing/selftests/drivers/net/netcons_basic.sh
+++ b/tools/testing/selftests/drivers/net/netcons_basic.sh
@@ -39,6 +39,7 @@ NAMESPACE=""
# IDs for netdevsim
NSIM_DEV_1_ID=$((256 + RANDOM % 256))
NSIM_DEV_2_ID=$((512 + RANDOM % 256))
+NSIM_DEV_SYS_NEW="/sys/bus/netdevsim/new_device"
# Used to create and delete namespaces
source "${SCRIPTDIR}"/../../net/lib.sh
@@ -46,7 +47,6 @@ source "${SCRIPTDIR}"/../../net/net_helper.sh
# Create netdevsim interfaces
create_ifaces() {
- local NSIM_DEV_SYS_NEW=/sys/bus/netdevsim/new_device
echo "$NSIM_DEV_2_ID" > "$NSIM_DEV_SYS_NEW"
echo "$NSIM_DEV_1_ID" > "$NSIM_DEV_SYS_NEW"
@@ -212,6 +212,11 @@ function check_for_dependencies() {
exit "${ksft_skip}"
fi
+ if [ ! -f "${NSIM_DEV_SYS_NEW}" ]; then
+ echo "SKIP: file ${NSIM_DEV_SYS_NEW} does not exist. Check if CONFIG_NETDEVSIM is enabled" >&2
+ exit "${ksft_skip}"
+ fi
+
if [ ! -d "${NETCONS_CONFIGFS}" ]; then
echo "SKIP: directory ${NETCONS_CONFIGFS} does not exist. Check if NETCONSOLE_DYNAMIC is enabled" >&2
exit "${ksft_skip}"
---
base-commit: 4861333b42178fa3d8fd1bb4e2cfb2fedc968dba
change-id: 20241108-netcon_selftest_deps-83f26456f095
Best regards,
--
Breno Leitao <leitao(a)debian.org>
Greetings:
Welcome to v9, see changelog below.
This revision addresses feedback Willem gave on the selftests. No
functional or code changes to the implementation were made and
performance tests were not re-run.
This series introduces a new mechanism, IRQ suspension, which allows
network applications using epoll to mask IRQs during periods of high
traffic while also reducing tail latency (compared to existing
mechanisms, see below) during periods of low traffic. In doing so, this
balances CPU consumption with network processing efficiency.
Martin Karsten (CC'd) and I have been collaborating on this series for
several months and have appreciated the feedback from the community on
our RFC [1]. We've updated the cover letter and kernel documentation in
an attempt to more clearly explain how this mechanism works, how
applications can use it, and how it compares to existing mechanisms in
the kernel.
I briefly mentioned this idea at netdev conf 2024 (for those who were
there) and Martin described this idea in an earlier paper presented at
Sigmetrics 2024 [2].
~ The short explanation (TL;DR)
We propose adding a new napi config parameter: irq_suspend_timeout to
help balance CPU usage and network processing efficiency when using IRQ
deferral and napi busy poll.
If this parameter is set to a non-zero value *and* a user application
has enabled preferred busy poll on a busy poll context (via the
EPIOCSPARAMS ioctl introduced in commit 18e2bf0edf4d ("eventpoll: Add
epoll ioctl for epoll_params")), then application calls to epoll_wait
for that context will cause device IRQs and softirq processing to be
suspended as long as epoll_wait successfully retrieves data from the
NAPI. Each time data is retrieved, the irq_suspend_timeout is deferred.
If/when network traffic subsides and epoll_wait returns no data, IRQ
suspension is immediately reverted back to the existing
napi_defer_hard_irqs and gro_flush_timeout mechanism which was
introduced in commit 6f8b12d661d0 ("net: napi: add hard irqs deferral
feature")).
The irq_suspend_timeout serves as a safety mechanism. If userland takes
a long time processing data, irq_suspend_timeout will fire and restart
normal NAPI processing.
For a more in depth explanation, please continue reading.
~ Comparison with existing mechanisms
Interrupt mitigation can be accomplished in napi software, by setting
napi_defer_hard_irqs and gro_flush_timeout, or via interrupt coalescing
in the NIC. This can be quite efficient, but in both cases, a fixed
timeout (or packet count) needs to be configured. However, a fixed
timeout cannot effectively support both low- and high-load situations:
At low load, an application typically processes a few requests and then
waits to receive more input data. In this scenario, a large timeout will
cause unnecessary latency.
At high load, an application typically processes many requests before
being ready to receive more input data. In this case, a small timeout
will likely fire prematurely and trigger irq/softirq processing, which
interferes with the application's execution. This causes overhead, most
likely due to cache contention.
While NICs attempt to provide adaptive interrupt coalescing schemes,
these cannot properly take into account application-level processing.
An alternative packet delivery mechanism is busy-polling, which results
in perfect alignment of application processing and network polling. It
delivers optimal performance (throughput and latency), but results in
100% cpu utilization and is thus inefficient for below-capacity
workloads.
We propose to add a new packet delivery mode that properly alternates
between busy polling and interrupt-based delivery depending on busy and
idle periods of the application. During a busy period, the system
operates in busy-polling mode, which avoids interference. During an idle
period, the system falls back to interrupt deferral, but with a small
timeout to avoid excessive latencies. This delivery mode can also be
viewed as an extension of basic interrupt deferral, but alternating
between a small and a very large timeout.
This delivery mode is efficient, because it avoids softirq execution
interfering with application processing during busy periods. It can be
used with blocking epoll_wait to conserve cpu cycles during idle
periods. The effect of alternating between busy and idle periods is that
performance (throughput and latency) is very close to full busy polling,
while cpu utilization is lower and very close to interrupt mitigation.
~ Usage details
IRQ suspension is introduced via a per-NAPI configuration parameter that
controls the maximum time that IRQs can be suspended.
Here's how it is intended to work:
- The user application (or system administrator) uses the netdev-genl
netlink interface to set the pre-existing napi_defer_hard_irqs and
gro_flush_timeout NAPI config parameters to enable IRQ deferral.
- The user application (or system administrator) sets the proposed
irq_suspend_timeout parameter via the netdev-genl netlink interface
to a larger value than gro_flush_timeout to enable IRQ suspension.
- The user application issues the existing epoll ioctl to set the
prefer_busy_poll flag on the epoll context.
- The user application then calls epoll_wait to busy poll for network
events, as it normally would.
- If epoll_wait returns events to userland, IRQs are suspended for the
duration of irq_suspend_timeout.
- If epoll_wait finds no events and the thread is about to go to
sleep, IRQ handling using napi_defer_hard_irqs and gro_flush_timeout
is resumed.
As long as epoll_wait is retrieving events, IRQs (and softirq
processing) for the NAPI being polled remain disabled. When network
traffic reduces, eventually a busy poll loop in the kernel will retrieve
no data. When this occurs, regular IRQ deferral using gro_flush_timeout
for the polled NAPI is re-enabled.
Unless IRQ suspension is continued by subsequent calls to epoll_wait, it
automatically times out after the irq_suspend_timeout timer expires.
Regular deferral is also immediately re-enabled when the epoll context
is destroyed.
~ Usage scenario
The target scenario for IRQ suspension as packet delivery mode is a
system that runs a dominant application with substantial network I/O.
The target application can be configured to receive input data up to a
certain batch size (via epoll_wait maxevents parameter) and this batch
size determines the worst-case latency that application requests might
experience. Because packet delivery is suspended during the target
application's processing, the batch size also determines the worst-case
latency of concurrent applications using the same RX queue(s).
gro_flush_timeout should be set as small as possible, but large enough to
make sure that a single request is likely not being interfered with.
irq_suspend_timeout is largely a safety mechanism against misbehaving
applications. It should be set large enough to cover the processing of an
entire application batch, i.e., the factor between gro_flush_timeout and
irq_suspend_timeout should roughly correspond to the maximum batch size
that the target application would process in one go.
~ Important call out in the implementation
- Enabling per epoll-context preferred busy poll will now effectively
lead to a nonblocking iteration through napi_busy_loop, even when
busy_poll_usecs is 0. See patch 4.
~ Benchmark configs & descriptions
The changes were benchmarked with memcached [3] using the benchmarking
tool mutilate [4].
To facilitate benchmarking, a small patch [5] was applied to memcached
1.6.29 to allow setting per-epoll context preferred busy poll and other
settings via environment variables. Another small patch [6] was applied
to libevent to enable full busy-polling.
Multiple scenarios were benchmarked as described below and the scripts
used for producing these results can be found on github [7] (note: all
scenarios use NAPI-based traffic splitting via SO_INCOMING_ID by passing
-N to memcached):
- base:
- no other options enabled
- deferX:
- set defer_hard_irqs to 100
- set gro_flush_timeout to X,000
- napibusy:
- set defer_hard_irqs to 100
- set gro_flush_timeout to 200,000
- enable busy poll via the existing ioctl (busy_poll_usecs = 64,
busy_poll_budget = 64, prefer_busy_poll = true)
- fullbusy:
- set defer_hard_irqs to 100
- set gro_flush_timeout to 5,000,000
- enable busy poll via the existing ioctl (busy_poll_usecs = 1000,
busy_poll_budget = 64, prefer_busy_poll = true)
- change memcached's nonblocking epoll_wait invocation (via
libevent) to using a 1 ms timeout
- suspend0:
- set defer_hard_irqs to 0
- set gro_flush_timeout to 0
- set irq_suspend_timeout to 20,000,000
- enable busy poll via the existing ioctl (busy_poll_usecs = 0,
busy_poll_budget = 64, prefer_busy_poll = true)
- suspendX:
- set defer_hard_irqs to 100
- set gro_flush_timeout to X,000
- set irq_suspend_timeout to 20,000,000
- enable busy poll via the existing ioctl (busy_poll_usecs = 0,
busy_poll_budget = 64, prefer_busy_poll = true)
~ Benchmark results
Tested on:
Single socket AMD EPYC 7662 64-Core Processor
Hyperthreading disabled
4 NUMA Zones (NPS=4)
16 CPUs per NUMA zone (64 cores total)
2 x Dual port 100gbps Mellanox Technologies ConnectX-5 Ex EN NIC
The test machine is configured such that a single interface has 8 RX
queues. The queues' IRQs and memcached are pinned to CPUs that are
NUMA-local to the interface which is under test. The NIC's interrupt
coalescing configuration is left at boot-time defaults.
Results:
Results are shown below. The mechanism added by this series is
represented by the 'suspend' cases. Data presented shows a summary over
nearly 10 runs of each test case [8] using the scripts on github [7].
For latency, the median is shown. For throughput and CPU utilization,
the average is shown.
The results also include cycles-per-query (cpq) and
instruction-per-query (ipq) metrics, following the methodology proposed
in [2], to augment the CPU utilization numbers, which could be skewed
due to frequency scaling. We find that this does not appear to be the
case as CPU utilization and low-level metrics show similar trends.
These results were captured using the scripts on github [7] to
illustrate how this approach compares with other pre-existing
mechanisms. This data is not to be interpreted as scientific data
captured in a fully isolated lab setting, but instead as best effort,
illustrative information comparing and contrasting tradeoffs.
The absolute QPS results shift between submissions, but the
relative differences are equivalent. As patches are rebased,
several factors likely influence overall performance.
Compare:
- Throughput (MAX) and latencies of base vs suspend.
- CPU usage of napibusy and fullbusy during lower load (200K, 400K for
example) vs suspend.
- Latency of the defer variants vs suspend as timeout and load
increases.
- suspend0, which sets defer_hard_irqs and gro_flush_timeout to 0, has
nearly the same performance as the base case (this is FAQ item #1).
The overall takeaway is that the suspend variants provide a superior
combination of high throughput, low latency, and low cpu utilization
compared to all other variants. Each of the suspend variants works very
well, but some fine-tuning between latency and cpu utilization is still
possible by tuning the small timeout (gro_flush_timeout).
Note: we've reorganized the results to make comparison among testcases
with the same load easier.
testcase load qps avglat 95%lat 99%lat cpu cpq ipq
base 200K 199946 112 239 416 26 12973 11343
defer10 200K 199971 54 124 142 29 19412 17460
defer20 200K 199986 60 130 153 26 15644 14095
defer50 200K 200025 79 144 182 23 12122 11632
defer200 200K 199999 164 254 309 19 8923 9635
fullbusy 200K 199998 46 118 133 100 43658 23133
napibusy 200K 199983 100 237 277 56 24840 24716
suspend0 200K 200020 105 249 432 30 14264 11796
suspend10 200K 199950 53 123 141 32 19518 16903
suspend20 200K 200037 58 126 151 30 16426 14736
suspend50 200K 199961 73 136 177 26 13310 12633
suspend200 200K 199998 149 251 306 21 9566 10203
testcase load qps avglat 95%lat 99%lat cpu cpq ipq
base 400K 400014 139 269 707 41 9476 9343
defer10 400K 400016 59 133 166 53 13991 12989
defer20 400K 399952 67 140 172 47 12063 11644
defer50 400K 400007 87 162 198 39 9384 9880
defer200 400K 399979 181 274 330 31 7089 8430
fullbusy 400K 399987 50 123 156 100 21827 16037
napibusy 400K 400014 76 222 272 83 18185 16529
suspend0 400K 400015 127 350 776 47 10699 9603
suspend10 400K 400023 57 129 164 54 13758 13178
suspend20 400K 400043 62 135 169 49 12071 11826
suspend50 400K 400071 76 149 186 42 10011 10301
suspend200 400K 399961 154 269 327 34 7827 8774
testcase load qps avglat 95%lat 99%lat cpu cpq ipq
base 600K 599951 149 266 574 61 9265 8876
defer10 600K 600006 71 147 203 76 11866 10936
defer20 600K 600123 76 152 203 66 10430 10342
defer50 600K 600162 95 172 217 54 8526 9142
defer200 600K 599942 200 301 357 46 6977 8212
fullbusy 600K 599990 55 127 177 100 14551 13983
napibusy 600K 600035 63 160 250 96 13937 14140
suspend0 600K 599903 127 320 732 68 10166 8963
suspend10 600K 599908 63 137 192 69 10902 11100
suspend20 600K 599961 66 141 194 65 9976 10370
suspend50 600K 599973 80 159 204 57 8678 9381
suspend200 600K 600010 157 277 346 48 7133 8381
testcase load qps avglat 95%lat 99%lat cpu cpq ipq
base 800K 800039 181 300 536 87 9585 8304
defer10 800K 800038 181 530 939 96 10564 8970
defer20 800K 800029 112 225 329 90 10056 8935
defer50 800K 799999 120 208 296 82 9234 8562
defer200 800K 800066 227 338 401 63 7117 8129
fullbusy 800K 800040 61 134 190 100 10913 12608
napibusy 800K 799944 64 141 214 99 10828 12588
suspend0 800K 799911 126 248 509 85 9346 8498
suspend10 800K 800006 69 143 200 83 9410 9845
suspend20 800K 800120 74 150 207 78 8786 9454
suspend50 800K 799989 87 168 224 71 7946 8833
suspend200 800K 799987 160 292 357 62 6923 8229
testcase load qps avglat 95%lat 99%lat cpu cpq ipq
base 1000K 906879 4079 5751 6216 98 9496 7904
defer10 1000K 860849 3643 6274 6730 99 10040 8676
defer20 1000K 896063 3298 5840 6349 98 9620 8237
defer50 1000K 919782 2962 5513 5807 97 9284 7951
defer200 1000K 970941 3059 5348 5984 95 8593 7959
fullbusy 1000K 999950 70 150 207 100 8732 10777
napibusy 1000K 999996 78 154 223 100 8722 10656
suspend0 1000K 949706 2666 5770 6660 99 9071 8046
suspend10 1000K 1000024 80 160 220 92 8137 9035
suspend20 1000K 1000059 83 165 226 89 7850 8804
suspend50 1000K 999955 95 180 240 84 7411 8459
suspend200 1000K 999914 163 299 366 77 6833 8078
testcase load qps avglat 95%lat 99%lat cpu cpq ipq
base MAX 1037654 4184 5453 5810 100 8411 7938
defer10 MAX 905607 4840 6151 6380 100 9639 8431
defer20 MAX 986463 4455 5594 5796 100 8848 8110
defer50 MAX 1077030 4000 5073 5299 100 8104 7920
defer200 MAX 1040728 4152 5385 5765 100 8379 7849
fullbusy MAX 1247536 3518 3935 3984 100 6998 7930
napibusy MAX 1136310 3799 7756 9964 100 7670 7877
suspend0 MAX 1057509 4132 5724 6185 100 8253 7918
suspend10 MAX 1215147 3580 3957 4041 100 7185 7944
suspend20 MAX 1216469 3576 3953 3988 100 7175 7950
suspend50 MAX 1215871 3577 3961 4075 100 7181 7949
suspend200 MAX 1216882 3556 3951 3988 100 7175 7955
~ FAQ
- Why is a new parameter needed? Does irq_suspend_timeout override
gro_flush_timeout?
Using the suspend mechanism causes the system to alternate between
polling mode and irq-driven packet delivery. During busy periods,
irq_suspend_timeout overrides gro_flush_timeout and keeps the system
busy polling, but when epoll finds no events, the setting of
gro_flush_timeout and napi_defer_hard_irqs determine the next step.
There are essentially three possible loops for network processing and
packet delivery:
1) hardirq -> softirq -> napi poll; basic interrupt delivery
2) timer -> softirq -> napi poll; deferred irq processing
3) epoll -> busy-poll -> napi poll; busy looping
Loop 2 can take control from Loop 1, if gro_flush_timeout and
napi_defer_hard_irqs are set.
If gro_flush_timeout and napi_defer_hard_irqs are set, Loops 2 and
3 "wrestle" with each other for control. During busy periods,
irq_suspend_timeout is used as timer in Loop 2, which essentially
tilts this in favour of Loop 3.
If gro_flush_timeout and napi_defer_hard_irqs are not set, Loop 3
cannot take control from Loop 1.
Therefore, setting gro_flush_timeout and napi_defer_hard_irqs is the
recommended usage, because otherwise setting irq_suspend_timeout
might not have any discernible effect.
This is shown in the results above: compare suspend0 with the base
case. Note that the lack of napi_defer_hard_irqs and
gro_flush_timeout produce similar results for both, which encourages
the use of napi_defer_hard_irqs and gro_flush_timeout in addition to
irq_suspend_timeout.
- Can the new timeout value be threaded through the new epoll ioctl ?
It is possible, but presents challenges for userspace. User
applications must ensure that the file descriptors added to epoll
contexts have the same NAPI ID to support busy polling.
An epoll context is not permanently tied to any particular NAPI ID.
So, a user application could decide to clear the file descriptors
from the context and add a new set of file descriptors with a
different NAPI ID to the context. Busy polling would work as
expected, but the meaning of the suspend timeout becomes ambiguous
because IRQs are not inherently associated with epoll contexts, but
rather with the NAPI. The user program would need to reissue the
ioctl to set the irq_suspend_timeout, but the napi_defer_hard_irqs
and gro_flush_timeout settings would come from the NAPI's
napi_config (which are set either by sysfs or by netlink). Such an
interface seems awkard to use from a user perspective.
Further, IRQs are related to NAPIs, which is why they are stored in
the napi_config space. Putting the irq_suspend_timeout in
the epoll context while other IRQ deferral mechanisms remain in the
NAPI's napi_config space seems like an odd design choice.
We've opted to keep all of the IRQ deferral parameters together and
place the irq_suspend_timeout in napi_config. This has nice benefits
for userspace: if a user app were to remove all file descriptors
from an epoll context and add new file descriptors with a new NAPI ID,
the correct suspend timeout for that NAPI ID would be used automatically
without the user application needing to do anything (like re-issuing an
ioctl, for example). All IRQ deferral related parameters are in one
place and can all be set the same way: with netlink.
- Can irq suspend be built by combining NIC coalescing and
gro_flush_timeout ?
No. The problem is that the long timeout must engage if and only if
prefer-busy is active.
When using NIC coalescing for the short timeout (without
napi_defer_hard_irqs/gro_flush_timeout), an interrupt after an idle
period will trigger softirq, which will run napi polling. At this
point, prefer-busy is not active, so NIC interrupts would be
re-enabled. Then it is not possible for the longer timeout to
interject to switch control back to polling. In other words, only by
using the software timer for the short timeout, it is possible to
extend the timeout without having to reprogram the NIC timer or
reach down directly and disable interrupts.
Using gro_flush_timeout for the long timeout also has problems, for
the same underlying reason. In the current napi implementation,
gro_flush_timeout is not tied to prefer-busy. We'd either have to
change that and in the process modify the existing deferral
mechanism, or introduce a state variable to determine whether
gro_flush_timeout is used as long timeout for irq suspend or whether
it is used for its default purpose. In an earlier version, we did
try something similar to the latter and made it work, but it ends up
being a lot more convoluted than our current proposal.
- Isn't it already possible to combine busy looping with irq deferral?
Yes, in fact enabling irq deferral via napi_defer_hard_irqs and
gro_flush_timeout is a precondition for prefer_busy_poll to have an
effect. If the application also uses a tight busy loop with
essentially nonblocking epoll_wait (accomplished with a very short
timeout parameter), this is the fullbusy case shown in the results.
An application using blocking epoll_wait is shown as the napibusy
case in the results. It's a hybrid approach that provides limited
latency benefits compared to the base case and plain irq deferral,
but not as good as fullbusy or suspend.
~ Special thanks
Several people were involved in earlier stages of the development of this
mechanism whom we'd like to thank:
- Peter Cai (CC'd), for the initial kernel patch and his contributions
to the paper.
- Mohammadamin Shafie (CC'd), for testing various versions of the kernel
patch and providing helpful feedback.
Thanks,
Martin and Joe
[1]: https://lore.kernel.org/netdev/20240812125717.413108-1-jdamato@fastly.com/
[2]: https://doi.org/10.1145/3626780
[3]: https://github.com/memcached/memcached/blob/master/doc/napi_ids.txt
[4]: https://github.com/leverich/mutilate
[5]: https://raw.githubusercontent.com/martinkarsten/irqsuspend/main/patches/mem…
[6]: https://raw.githubusercontent.com/martinkarsten/irqsuspend/main/patches/lib…
[7]: https://github.com/martinkarsten/irqsuspend
[8]: https://github.com/martinkarsten/irqsuspend/tree/main/results
v9:
- Addresses Willem's feedback on the selftests in patch 5 by fixing
the SPDX-License-Identifier, moving constants into variables in the
test script, reducing code duplication, shortening long lines, and
renaming variables to be more reader friendly. In the C test file,
added a comment explaining the if def blob and changed a few types
for strtoul.
v8: https://lore.kernel.org/netdev/20241108045337.292905-1-jdamato@fastly.com/
- Update patch 2 to drop the exports, as requested by Jakub.
v7: https://lore.kernel.org/netdev/20241108023912.98416-1-jdamato@fastly.com/
- Jakub noted that patch 2 adds unnecessary complexity by checking the
suspend timeout in the NAPI loop. This makes the code more
complicated and difficult to reason about. He's right; we've dropped
patch 2 which simplifies this series.
- Updated the cover letter with a full re-run of all test cases.
- Updated FAQ #2.
v6: https://lore.kernel.org/netdev/20241104215542.215919-1-jdamato@fastly.com/
- Updated the cover letter with a full re-run of all test cases,
including a new case suspend0, as requested by Sridhar previously.
- Updated the kernel documentation in patch 7 as suggested by Bagas
Sanjaya, which improved the htmldoc output.
v5: https://lore.kernel.org/netdev/20241103052421.518856-1-jdamato@fastly.com/
- Adjusted patch 5 to only suspend IRQs when ep_send_events returns a
positive return value. This issue was pointed out by Hillf Danton.
- Updated the commit message of patch 6 which still mentioned netcat,
despite the code being updated in v4 to replace it with socat and fixed
misspelling of netdevsim.
- Fixed a minor typo in patch 7 and removed an unnecessary paragraph.
- Added Sridhar Samudrala's Reviewed-by to patch 1-5 and 7.
v4: https://lore.kernel.org/netdev/20241102005214.32443-1-jdamato@fastly.com/
- Added a new FAQ item to cover letter.
- Updated patch 6 to use socat instead of nc in busy_poll_test.sh and
updated busy_poller.c to use netlink directly to configure napi
params.
- Updated the kernel documentation in patch 7 to include more details.
- Dropped Stanislav's Acked-by and Bagas' Reviewed-by from patch 7
since the documentation was updated.
v3: https://lore.kernel.org/netdev/20241101004846.32532-1-jdamato@fastly.com/
- Added Stanislav Fomichev's Acked-by to every patch except the newly
added selftest.
- Added Bagas Sanjaya's Reviewed-by to the documentation patch.
- Fixed the commit message of patch 2 to remove a reference to the now
non-existent sysfs setting.
- Added a self test which tests both "regular" busy poll and busy poll
with suspend enabled. This was added as patch 6 as requested by
Paolo. netdevsim was chosen instead of veth due to netdevsim's
pre-existing support for netdev-genl. See the commit message of
patch 6 for more details.
v2: https://lore.kernel.org/bpf/20241021015311.95468-1-jdamato@fastly.com/
- Cover letter updated, including a re-run of test data.
- Patch 1 rewritten to use netdev-genl instead of sysfs.
- Patch 3 updated with a comment added to napi_resume_irqs.
- Patch 4 rebased to apply now that commit b9ca079dd6b0 ("eventpoll:
Annotate data-race of busy_poll_usecs") has been picked up from VFS.
- Patch 6 updated the kernel documentation.
rfc -> v1:
- Cover letter updated to include more details.
- Patch 1 updated to remove the documentation added. This was moved to
patch 6 with the rest of the docs (see below).
- Patch 5 updated to fix an error uncovered by the kernel build robot.
See patch 5's changelog for more details.
- Patch 6 added which updates kernel documentation.
Joe Damato (2):
selftests: net: Add busy_poll_test
docs: networking: Describe irq suspension
Martin Karsten (4):
net: Add napi_struct parameter irq_suspend_timeout
net: Add control functions for irq suspension
eventpoll: Trigger napi_busy_loop, if prefer_busy_poll is set
eventpoll: Control irq suspension for prefer_busy_poll
Documentation/netlink/specs/netdev.yaml | 7 +
Documentation/networking/napi.rst | 170 ++++++++-
fs/eventpoll.c | 36 +-
include/linux/netdevice.h | 2 +
include/net/busy_poll.h | 3 +
include/uapi/linux/netdev.h | 1 +
net/core/dev.c | 39 ++
net/core/dev.h | 25 ++
net/core/netdev-genl-gen.c | 5 +-
net/core/netdev-genl.c | 12 +
tools/include/uapi/linux/netdev.h | 1 +
tools/testing/selftests/net/.gitignore | 1 +
tools/testing/selftests/net/Makefile | 3 +-
tools/testing/selftests/net/busy_poll_test.sh | 165 +++++++++
tools/testing/selftests/net/busy_poller.c | 346 ++++++++++++++++++
15 files changed, 809 insertions(+), 7 deletions(-)
create mode 100755 tools/testing/selftests/net/busy_poll_test.sh
create mode 100644 tools/testing/selftests/net/busy_poller.c
base-commit: dc7c381bb8649e3701ed64f6c3e55316675904d7
--
2.25.1
The goal of the series is to simplify and make it possible to use
ncdevmem in an automated way from the ksft python wrapper.
ncdevmem is slowly mutated into a state where it uses stdout
to print the payload and the python wrapper is added to
make sure the arrived payload matches the expected one.
v8:
- move error() calls into enable_reuseaddr() (Joe)
- bail out when number of queues is 1 (Joe)
- use socat instead of nc (Joe)
- fix warning about string truncation buf[256]->buf[40] (Jakub)
v7:
- fix validation (Mina)
- add support for working with non ::ffff-prefixed addresses (Mina)
v6:
- fix compilation issue in 'Unify error handling' patch (Jakub)
v5:
- properly handle errors from inet_pton() and socket() (Paolo)
- remove unneeded import from python selftest (Paolo)
v4:
- keep usage example with validation (Mina)
- fix compilation issue in one patch (s/start_queues/start_queue/)
v3:
- keep and refine the comment about ncdevmem invocation (Mina)
- add the comment about not enforcing exit status for ntuple reset (Mina)
- make configure_headersplit more robust (Mina)
- use num_queues/2 in selftest and let the users override it (Mina)
- remove memory_provider.memcpy_to_device (Mina)
- keep ksft as is (don't use -v validate flags): we are gonna
need a --debug-disable flag to make it less chatty; otherwise
it times out when sending too much data; so leaving it as
a separate follow up
v2:
- don't remove validation (Mina)
- keep 5-tuple flow steering but use it only when -c is provided (Mina)
- remove separate flag for probing (Mina)
- move ncdevmem under drivers/net/hw, not drivers/net (Jakub)
Cc: Mina Almasry <almasrymina(a)google.com>
Stanislav Fomichev (12):
selftests: ncdevmem: Redirect all non-payload output to stderr
selftests: ncdevmem: Separate out dmabuf provider
selftests: ncdevmem: Unify error handling
selftests: ncdevmem: Make client_ip optional
selftests: ncdevmem: Remove default arguments
selftests: ncdevmem: Switch to AF_INET6
selftests: ncdevmem: Properly reset flow steering
selftests: ncdevmem: Use YNL to enable TCP header split
selftests: ncdevmem: Remove hard-coded queue numbers
selftests: ncdevmem: Run selftest when none of the -s or -c has been
provided
selftests: ncdevmem: Move ncdevmem under drivers/net/hw
selftests: ncdevmem: Add automated test
.../selftests/drivers/net/hw/.gitignore | 1 +
.../testing/selftests/drivers/net/hw/Makefile | 9 +
.../selftests/drivers/net/hw/devmem.py | 45 +
.../selftests/drivers/net/hw/ncdevmem.c | 789 ++++++++++++++++++
tools/testing/selftests/net/.gitignore | 1 -
tools/testing/selftests/net/Makefile | 8 -
tools/testing/selftests/net/ncdevmem.c | 570 -------------
7 files changed, 844 insertions(+), 579 deletions(-)
create mode 100644 tools/testing/selftests/drivers/net/hw/.gitignore
create mode 100755 tools/testing/selftests/drivers/net/hw/devmem.py
create mode 100644 tools/testing/selftests/drivers/net/hw/ncdevmem.c
delete mode 100644 tools/testing/selftests/net/ncdevmem.c
--
2.47.0
This series adds VLAN support to HSR framework.
This series also adds VLAN support to HSR mode of ICSSG Ethernet driver.
Changes from v2 to v3:
*) Modified hsr_ndo_vlan_rx_add_vid() to handle arbitrary HSR_PT_SLAVE_A,
HSR_PT_SLAVE_B order and skip INTERLINK port in patch 2/4 as suggested by
Paolo Abeni <pabeni(a)redhat.com>
*) Removed handling of HSR_PT_MASTER in hsr_ndo_vlan_rx_kill_vid() as MASTER
and INTERLINK port will be ignored anyway in the default switch case as
suggested by Paolo Abeni <pabeni(a)redhat.com>
*) Modified the selftest in patch 4/4 to use vlan by default. The test will
check the exposed feature `vlan-challenged` and if vlan is not supported, skip
the vlan test as suggested by Paolo Abeni <pabeni(a)redhat.com>. Test logs can be
found at [1]
Changes from v1 to v2:
*) Added patch 4/4 to add test script related to VLAN in HSR as asked by
Lukasz Majewski <lukma(a)denx.de>
[1] https://gist.githubusercontent.com/danish-ti/d309f92c640134ccc4f2c0c442de5b…
v1 https://lore.kernel.org/all/20241004074715.791191-1-danishanwar@ti.com/
v2 https://lore.kernel.org/all/20241024103056.3201071-1-danishanwar@ti.com/
MD Danish Anwar (1):
selftests: hsr: Add test for VLAN
Murali Karicheri (1):
net: hsr: Add VLAN CTAG filter support
Ravi Gunasekaran (1):
net: ti: icssg-prueth: Add VLAN support for HSR mode
WingMan Kwok (1):
net: hsr: Add VLAN support
drivers/net/ethernet/ti/icssg/icssg_prueth.c | 45 ++++++++-
net/hsr/hsr_device.c | 85 +++++++++++++++--
net/hsr/hsr_forward.c | 19 +++-
tools/testing/selftests/net/hsr/config | 1 +
tools/testing/selftests/net/hsr/hsr_ping.sh | 98 ++++++++++++++++++++
5 files changed, 236 insertions(+), 12 deletions(-)
base-commit: a84e8c05f58305dfa808bc5465c5175c29d7c9b6
--
2.34.1
Soft lockups have been observed on a cluster of Linux-based edge routers
located in a highly dynamic environment. Using the `bird` service, these
routers continuously update BGP-advertised routes due to frequently
changing nexthop destinations, while also managing significant IPv6
traffic. The lockups occur during the traversal of the multipath
circular linked-list in the `fib6_select_path` function, particularly
while iterating through the siblings in the list. The issue typically
arises when the nodes of the linked list are unexpectedly deleted
concurrently on a different core—indicated by their 'next' and
'previous' elements pointing back to the node itself and their reference
count dropping to zero. This results in an infinite loop, leading to a
soft lockup that triggers a system panic via the watchdog timer.
Apply RCU primitives in the problematic code sections to resolve the
issue. Where necessary, update the references to fib6_siblings to
annotate or use the RCU APIs.
Include a test script that reproduces the issue. The script
periodically updates the routing table while generating a heavy load
of outgoing IPv6 traffic through multiple iperf3 clients. It
consistently induces infinite soft lockups within a couple of minutes.
Kernel log:
0 [ffffbd13003e8d30] machine_kexec at ffffffff8ceaf3eb
1 [ffffbd13003e8d90] __crash_kexec at ffffffff8d0120e3
2 [ffffbd13003e8e58] panic at ffffffff8cef65d4
3 [ffffbd13003e8ed8] watchdog_timer_fn at ffffffff8d05cb03
4 [ffffbd13003e8f08] __hrtimer_run_queues at ffffffff8cfec62f
5 [ffffbd13003e8f70] hrtimer_interrupt at ffffffff8cfed756
6 [ffffbd13003e8fd0] __sysvec_apic_timer_interrupt at ffffffff8cea01af
7 [ffffbd13003e8ff0] sysvec_apic_timer_interrupt at ffffffff8df1b83d
-- <IRQ stack> --
8 [ffffbd13003d3708] asm_sysvec_apic_timer_interrupt at ffffffff8e000ecb
[exception RIP: fib6_select_path+299]
RIP: ffffffff8ddafe7b RSP: ffffbd13003d37b8 RFLAGS: 00000287
RAX: ffff975850b43600 RBX: ffff975850b40200 RCX: 0000000000000000
RDX: 000000003fffffff RSI: 0000000051d383e4 RDI: ffff975850b43618
RBP: ffffbd13003d3800 R8: 0000000000000000 R9: ffff975850b40200
R10: 0000000000000000 R11: 0000000000000000 R12: ffffbd13003d3830
R13: ffff975850b436a8 R14: ffff975850b43600 R15: 0000000000000007
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
9 [ffffbd13003d3808] ip6_pol_route at ffffffff8ddb030c
10 [ffffbd13003d3888] ip6_pol_route_input at ffffffff8ddb068c
11 [ffffbd13003d3898] fib6_rule_lookup at ffffffff8ddf02b5
12 [ffffbd13003d3928] ip6_route_input at ffffffff8ddb0f47
13 [ffffbd13003d3a18] ip6_rcv_finish_core.constprop.0 at ffffffff8dd950d0
14 [ffffbd13003d3a30] ip6_list_rcv_finish.constprop.0 at ffffffff8dd96274
15 [ffffbd13003d3a98] ip6_sublist_rcv at ffffffff8dd96474
16 [ffffbd13003d3af8] ipv6_list_rcv at ffffffff8dd96615
17 [ffffbd13003d3b60] __netif_receive_skb_list_core at ffffffff8dc16fec
18 [ffffbd13003d3be0] netif_receive_skb_list_internal at ffffffff8dc176b3
19 [ffffbd13003d3c50] napi_gro_receive at ffffffff8dc565b9
20 [ffffbd13003d3c80] ice_receive_skb at ffffffffc087e4f5 [ice]
21 [ffffbd13003d3c90] ice_clean_rx_irq at ffffffffc0881b80 [ice]
22 [ffffbd13003d3d20] ice_napi_poll at ffffffffc088232f [ice]
23 [ffffbd13003d3d80] __napi_poll at ffffffff8dc18000
24 [ffffbd13003d3db8] net_rx_action at ffffffff8dc18581
25 [ffffbd13003d3e40] __do_softirq at ffffffff8df352e9
26 [ffffbd13003d3eb0] run_ksoftirqd at ffffffff8ceffe47
27 [ffffbd13003d3ec0] smpboot_thread_fn at ffffffff8cf36a30
28 [ffffbd13003d3ee8] kthread at ffffffff8cf2b39f
29 [ffffbd13003d3f28] ret_from_fork at ffffffff8ce5fa64
30 [ffffbd13003d3f50] ret_from_fork_asm at ffffffff8ce03cbb
Fixes: 66f5d6ce53e6 ("ipv6: replace rwlock with rcu and spinlock in fib6_table")
Reported-by: Adrian Oliver <kernel(a)aoliver.ca>
Signed-off-by: Omid Ehtemam-Haghighi <omid.ehtemamhaghighi(a)menlosecurity.com>
Cc: David S. Miller <davem(a)davemloft.net>
Cc: David Ahern <dsahern(a)gmail.com>
Cc: Eric Dumazet <edumazet(a)google.com>
Cc: Jakub Kicinski <kuba(a)kernel.org>
Cc: Paolo Abeni <pabeni(a)redhat.com>
Cc: Shuah Khan <shuah(a)kernel.org>
Cc: Ido Schimmel <idosch(a)idosch.org>
Cc: Kuniyuki Iwashima <kuniyu(a)amazon.com>
Cc: Simon Horman <horms(a)kernel.org>
Cc: Omid Ehtemam-Haghighi <oeh.kernel(a)gmail.com>
Cc: netdev(a)vger.kernel.org
Cc: linux-kselftest(a)vger.kernel.org
Cc: linux-kernel(a)vger.kernel.org
---
v6 -> v7:
* Rebased on top of 'net-next'
v5 -> v6:
* Adjust the comment line lengths in the test script to a maximum of
80 characters
* Change memory allocation in inet6_rt_notify from gfp_any() to GFP_ATOMIC for
atomic allocation in non-blocking contexts, as suggested by Ido Schimmel
* NOTE: I have executed the test script on both bare-metal servers and
virtualized environments such as QEMU and vng. In the case of bare-metal, it
consistently triggers a soft lockup in under a minute on unpatched kernels.
For the virtualized environments, an unpatched kernel compiled with the
Ubuntu 24.04 configuration also triggers a soft lockup, though it takes
longer; however, it did not trigger a soft lockup on kernels compiled with
configurations provided in:
https://github.com/linux-netdev/nipa/wiki/How-to-run-netdev-selftests-CI-st…
leading to potential false negatives in the test results.
I am curious if this test can be executed on a bare-metal machine within a
CI system, if such a setup exists, rather than in a virtualized environment.
If that’s not possible, how can I apply a different kernel configuration,
such as the one used in Ubuntu 24.04, for this test? Please advise.
v4 -> v5:
* Addressed review comments from Paolo Abeni.
* Added additional clarifying comments in the test script.
* Minor cleanup performed in the test script.
v3 -> v4:
* Added RCU primitives to rt6_fill_node(). I found that this function is typically
called either with a table lock held or within rcu_read_lock/rcu_read_unlock
pairs, except in the following call chain, where the protection is unclear:
rt_fill_node()
fib6_info_hw_flags_set()
mlxsw_sp_fib6_offload_failed_flag_set()
mlxsw_sp_router_fib6_event_work()
The last function is initialized as a work item in mlxsw_sp_router_fib_event()
and scheduled for deferred execution. I am unsure if the execution context of
this work item is protected by any table lock or rcu_read_lock/rcu_read_unlock
pair, so I have added the protection. Please let me know if this is redundant.
* Other review comments addressed
v2 -> v3:
* Removed redundant rcu_read_lock()/rcu_read_unlock() pairs
* Revised the test script based on Ido Schimmel's feedback
* Updated the test script to ensure compatibility with the latest iperf3 version
* Fixed new warnings generated with 'C=2' in the previous version
* Other review comments addressed
v1 -> v2:
* list_del_rcu() is applied exclusively to legacy multipath code
* All occurrences of fib6_siblings have been modified to utilize RCU
APIs for annotation and usage.
* Additionally, a test script for reproducing the reported
issue is included
---
net/ipv6/ip6_fib.c | 8 +-
net/ipv6/route.c | 45 ++-
tools/testing/selftests/net/Makefile | 1 +
.../net/ipv6_route_update_soft_lockup.sh | 262 ++++++++++++++++++
4 files changed, 297 insertions(+), 19 deletions(-)
create mode 100755 tools/testing/selftests/net/ipv6_route_update_soft_lockup.sh
diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index 6383263bfd04..c134ba202c4c 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -1183,8 +1183,8 @@ static int fib6_add_rt2node(struct fib6_node *fn, struct fib6_info *rt,
while (sibling) {
if (sibling->fib6_metric == rt->fib6_metric &&
rt6_qualify_for_ecmp(sibling)) {
- list_add_tail(&rt->fib6_siblings,
- &sibling->fib6_siblings);
+ list_add_tail_rcu(&rt->fib6_siblings,
+ &sibling->fib6_siblings);
break;
}
sibling = rcu_dereference_protected(sibling->fib6_next,
@@ -1245,7 +1245,7 @@ static int fib6_add_rt2node(struct fib6_node *fn, struct fib6_info *rt,
fib6_siblings)
sibling->fib6_nsiblings--;
rt->fib6_nsiblings = 0;
- list_del_init(&rt->fib6_siblings);
+ list_del_rcu(&rt->fib6_siblings);
rt6_multipath_rebalance(next_sibling);
return err;
}
@@ -1963,7 +1963,7 @@ static void fib6_del_route(struct fib6_table *table, struct fib6_node *fn,
&rt->fib6_siblings, fib6_siblings)
sibling->fib6_nsiblings--;
rt->fib6_nsiblings = 0;
- list_del_init(&rt->fib6_siblings);
+ list_del_rcu(&rt->fib6_siblings);
rt6_multipath_rebalance(next_sibling);
}
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index d7ce5cf2017a..e23e653de158 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -413,8 +413,8 @@ void fib6_select_path(const struct net *net, struct fib6_result *res,
struct flowi6 *fl6, int oif, bool have_oif_match,
const struct sk_buff *skb, int strict)
{
- struct fib6_info *sibling, *next_sibling;
struct fib6_info *match = res->f6i;
+ struct fib6_info *sibling;
if (!match->nh && (!match->fib6_nsiblings || have_oif_match))
goto out;
@@ -440,8 +440,8 @@ void fib6_select_path(const struct net *net, struct fib6_result *res,
if (fl6->mp_hash <= atomic_read(&match->fib6_nh->fib_nh_upper_bound))
goto out;
- list_for_each_entry_safe(sibling, next_sibling, &match->fib6_siblings,
- fib6_siblings) {
+ list_for_each_entry_rcu(sibling, &match->fib6_siblings,
+ fib6_siblings) {
const struct fib6_nh *nh = sibling->fib6_nh;
int nh_upper_bound;
@@ -5195,14 +5195,18 @@ static void ip6_route_mpath_notify(struct fib6_info *rt,
* nexthop. Since sibling routes are always added at the end of
* the list, find the first sibling of the last route appended
*/
+ rcu_read_lock();
+
if ((nlflags & NLM_F_APPEND) && rt_last && rt_last->fib6_nsiblings) {
- rt = list_first_entry(&rt_last->fib6_siblings,
- struct fib6_info,
- fib6_siblings);
+ rt = list_first_or_null_rcu(&rt_last->fib6_siblings,
+ struct fib6_info,
+ fib6_siblings);
}
if (rt)
inet6_rt_notify(RTM_NEWROUTE, rt, info, nlflags);
+
+ rcu_read_unlock();
}
static bool ip6_route_mpath_should_notify(const struct fib6_info *rt)
@@ -5547,17 +5551,21 @@ static size_t rt6_nlmsg_size(struct fib6_info *f6i)
nexthop_for_each_fib6_nh(f6i->nh, rt6_nh_nlmsg_size,
&nexthop_len);
} else {
- struct fib6_info *sibling, *next_sibling;
struct fib6_nh *nh = f6i->fib6_nh;
+ struct fib6_info *sibling;
nexthop_len = 0;
if (f6i->fib6_nsiblings) {
rt6_nh_nlmsg_size(nh, &nexthop_len);
- list_for_each_entry_safe(sibling, next_sibling,
- &f6i->fib6_siblings, fib6_siblings) {
+ rcu_read_lock();
+
+ list_for_each_entry_rcu(sibling, &f6i->fib6_siblings,
+ fib6_siblings) {
rt6_nh_nlmsg_size(sibling->fib6_nh, &nexthop_len);
}
+
+ rcu_read_unlock();
}
nexthop_len += lwtunnel_get_encap_size(nh->fib_nh_lws);
}
@@ -5721,7 +5729,7 @@ static int rt6_fill_node(struct net *net, struct sk_buff *skb,
lwtunnel_fill_encap(skb, dst->lwtstate, RTA_ENCAP, RTA_ENCAP_TYPE) < 0)
goto nla_put_failure;
} else if (rt->fib6_nsiblings) {
- struct fib6_info *sibling, *next_sibling;
+ struct fib6_info *sibling;
struct nlattr *mp;
mp = nla_nest_start_noflag(skb, RTA_MULTIPATH);
@@ -5733,14 +5741,21 @@ static int rt6_fill_node(struct net *net, struct sk_buff *skb,
0) < 0)
goto nla_put_failure;
- list_for_each_entry_safe(sibling, next_sibling,
- &rt->fib6_siblings, fib6_siblings) {
+ rcu_read_lock();
+
+ list_for_each_entry_rcu(sibling, &rt->fib6_siblings,
+ fib6_siblings) {
if (fib_add_nexthop(skb, &sibling->fib6_nh->nh_common,
sibling->fib6_nh->fib_nh_weight,
- AF_INET6, 0) < 0)
+ AF_INET6, 0) < 0) {
+ rcu_read_unlock();
+
goto nla_put_failure;
+ }
}
+ rcu_read_unlock();
+
nla_nest_end(skb, mp);
} else if (rt->nh) {
if (nla_put_u32(skb, RTA_NH_ID, rt->nh->id))
@@ -6177,7 +6192,7 @@ void inet6_rt_notify(int event, struct fib6_info *rt, struct nl_info *info,
err = -ENOBUFS;
seq = info->nlh ? info->nlh->nlmsg_seq : 0;
- skb = nlmsg_new(rt6_nlmsg_size(rt), gfp_any());
+ skb = nlmsg_new(rt6_nlmsg_size(rt), GFP_ATOMIC);
if (!skb)
goto errout;
@@ -6190,7 +6205,7 @@ void inet6_rt_notify(int event, struct fib6_info *rt, struct nl_info *info,
goto errout;
}
rtnl_notify(skb, net, info->portid, RTNLGRP_IPV6_ROUTE,
- info->nlh, gfp_any());
+ info->nlh, GFP_ATOMIC);
return;
errout:
rtnl_set_sk_err(net, RTNLGRP_IPV6_ROUTE, err);
diff --git a/tools/testing/selftests/net/Makefile b/tools/testing/selftests/net/Makefile
index 26a4883a65c9..8c4db5199a42 100644
--- a/tools/testing/selftests/net/Makefile
+++ b/tools/testing/selftests/net/Makefile
@@ -96,6 +96,7 @@ TEST_PROGS += fdb_flush.sh
TEST_PROGS += fq_band_pktlimit.sh
TEST_PROGS += vlan_hw_filter.sh
TEST_PROGS += bpf_offload.py
+TEST_PROGS += ipv6_route_update_soft_lockup.sh
# YNL files, must be before "include ..lib.mk"
YNL_GEN_FILES := ncdevmem
diff --git a/tools/testing/selftests/net/ipv6_route_update_soft_lockup.sh b/tools/testing/selftests/net/ipv6_route_update_soft_lockup.sh
new file mode 100755
index 000000000000..a6b2b1f9c641
--- /dev/null
+++ b/tools/testing/selftests/net/ipv6_route_update_soft_lockup.sh
@@ -0,0 +1,262 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# Testing for potential kernel soft lockup during IPv6 routing table
+# refresh under heavy outgoing IPv6 traffic. If a kernel soft lockup
+# occurs, a kernel panic will be triggered to prevent associated issues.
+#
+#
+# Test Environment Layout
+#
+# ┌----------------┐ ┌----------------┐
+# | SOURCE_NS | | SINK_NS |
+# | NAMESPACE | | NAMESPACE |
+# |(iperf3 clients)| |(iperf3 servers)|
+# | | | |
+# | | | |
+# | ┌-----------| nexthops |---------┐ |
+# | |veth_source|<--------------------------------------->|veth_sink|<┐ |
+# | └-----------|2001:0DB8:1::0:1/96 2001:0DB8:1::1:1/96 |---------┘ | |
+# | | ^ 2001:0DB8:1::1:2/96 | | |
+# | | . . | fwd | |
+# | ┌---------┐ | . . | | |
+# | | IPv6 | | . . | V |
+# | | routing | | . 2001:0DB8:1::1:80/96| ┌-----┐ |
+# | | table | | . | | lo | |
+# | | nexthop | | . └--------┴-----┴-┘
+# | | update | | ............................> 2001:0DB8:2::1:1/128
+# | └-------- ┘ |
+# └----------------┘
+#
+# The test script sets up two network namespaces, source_ns and sink_ns,
+# connected via a veth link. Within source_ns, it continuously updates the
+# IPv6 routing table by flushing and inserting IPV6_NEXTHOP_ADDR_COUNT nexthop
+# IPs destined for SINK_LOOPBACK_IP_ADDR in sink_ns. This refresh occurs at a
+# rate of 1/ROUTING_TABLE_REFRESH_PERIOD per second for TEST_DURATION seconds.
+#
+# Simultaneously, multiple iperf3 clients within source_ns generate heavy
+# outgoing IPv6 traffic. Each client is assigned a unique port number starting
+# at 5000 and incrementing sequentially. Each client targets a unique iperf3
+# server running in sink_ns, connected to the SINK_LOOPBACK_IFACE interface
+# using the same port number.
+#
+# The number of iperf3 servers and clients is set to half of the total
+# available cores on each machine.
+#
+# NOTE: We have tested this script on machines with various CPU specifications,
+# ranging from lower to higher performance as listed below. The test script
+# effectively triggered a kernel soft lockup on machines running an unpatched
+# kernel in under a minute:
+#
+# - 1x Intel Xeon E-2278G 8-Core Processor @ 3.40GHz
+# - 1x Intel Xeon E-2378G Processor 8-Core @ 2.80GHz
+# - 1x AMD EPYC 7401P 24-Core Processor @ 2.00GHz
+# - 1x AMD EPYC 7402P 24-Core Processor @ 2.80GHz
+# - 2x Intel Xeon Gold 5120 14-Core Processor @ 2.20GHz
+# - 1x Ampere Altra Q80-30 80-Core Processor @ 3.00GHz
+# - 2x Intel Xeon Gold 5120 14-Core Processor @ 2.20GHz
+# - 2x Intel Xeon Silver 4214 24-Core Processor @ 2.20GHz
+# - 1x AMD EPYC 7502P 32-Core @ 2.50GHz
+# - 1x Intel Xeon Gold 6314U 32-Core Processor @ 2.30GHz
+# - 2x Intel Xeon Gold 6338 32-Core Processor @ 2.00GHz
+#
+# On less performant machines, you may need to increase the TEST_DURATION
+# parameter to enhance the likelihood of encountering a race condition leading
+# to a kernel soft lockup and avoid a false negative result.
+#
+# NOTE: The test may not produce the expected result in virtualized
+# environments (e.g., qemu) due to differences in timing and CPU handling,
+# which can affect the conditions needed to trigger a soft lockup.
+
+source lib.sh
+source net_helper.sh
+
+TEST_DURATION=300
+ROUTING_TABLE_REFRESH_PERIOD=0.01
+
+IPERF3_BITRATE="300m"
+
+
+IPV6_NEXTHOP_ADDR_COUNT="128"
+IPV6_NEXTHOP_ADDR_MASK="96"
+IPV6_NEXTHOP_PREFIX="2001:0DB8:1"
+
+
+SOURCE_TEST_IFACE="veth_source"
+SOURCE_TEST_IP_ADDR="2001:0DB8:1::0:1/96"
+
+SINK_TEST_IFACE="veth_sink"
+# ${SINK_TEST_IFACE} is populated with the following range of IPv6 addresses:
+# 2001:0DB8:1::1:1 to 2001:0DB8:1::1:${IPV6_NEXTHOP_ADDR_COUNT}
+SINK_LOOPBACK_IFACE="lo"
+SINK_LOOPBACK_IP_MASK="128"
+SINK_LOOPBACK_IP_ADDR="2001:0DB8:2::1:1"
+
+nexthop_ip_list=""
+termination_signal=""
+kernel_softlokup_panic_prev_val=""
+
+terminate_ns_processes_by_pattern() {
+ local ns=$1
+ local pattern=$2
+
+ for pid in $(ip netns pids ${ns}); do
+ [ -e /proc/$pid/cmdline ] && grep -qe "${pattern}" /proc/$pid/cmdline && kill -9 $pid
+ done
+}
+
+cleanup() {
+ echo "info: cleaning up namespaces and terminating all processes within them..."
+
+
+ # Terminate iperf3 instances running in the source_ns. To avoid race
+ # conditions, first iterate over the PIDs and terminate those
+ # associated with the bash shells running the
+ # `while true; do iperf3 -c ...; done` loops. In a second iteration,
+ # terminate the individual `iperf3 -c ...` instances.
+ terminate_ns_processes_by_pattern ${source_ns} while
+ terminate_ns_processes_by_pattern ${source_ns} iperf3
+
+ # Repeat the same process for sink_ns
+ terminate_ns_processes_by_pattern ${sink_ns} while
+ terminate_ns_processes_by_pattern ${sink_ns} iperf3
+
+ # Check if any iperf3 instances are still running. This could happen
+ # if a core has entered an infinite loop and the timeout for detecting
+ # the soft lockup has not expired, but either the test interval has
+ # already elapsed or the test was terminated manually (e.g., with ^C)
+ for pid in $(ip netns pids ${source_ns}); do
+ if [ -e /proc/$pid/cmdline ] && grep -qe 'iperf3' /proc/$pid/cmdline; then
+ echo "FAIL: unable to terminate some iperf3 instances. Soft lockup is underway. A kernel panic is on the way!"
+ exit ${ksft_fail}
+ fi
+ done
+
+ if [ "$termination_signal" == "SIGINT" ]; then
+ echo "SKIP: Termination due to ^C (SIGINT)"
+ elif [ "$termination_signal" == "SIGALRM" ]; then
+ echo "PASS: No kernel soft lockup occurred during this ${TEST_DURATION} second test"
+ fi
+
+ cleanup_ns ${source_ns} ${sink_ns}
+
+ sysctl -qw kernel.softlockup_panic=${kernel_softlokup_panic_prev_val}
+}
+
+setup_prepare() {
+ setup_ns source_ns sink_ns
+
+ ip -n ${source_ns} link add name ${SOURCE_TEST_IFACE} type veth peer name ${SINK_TEST_IFACE} netns ${sink_ns}
+
+ # Setting up the Source namespace
+ ip -n ${source_ns} addr add ${SOURCE_TEST_IP_ADDR} dev ${SOURCE_TEST_IFACE}
+ ip -n ${source_ns} link set dev ${SOURCE_TEST_IFACE} qlen 10000
+ ip -n ${source_ns} link set dev ${SOURCE_TEST_IFACE} up
+ ip netns exec ${source_ns} sysctl -qw net.ipv6.fib_multipath_hash_policy=1
+
+ # Setting up the Sink namespace
+ ip -n ${sink_ns} addr add ${SINK_LOOPBACK_IP_ADDR}/${SINK_LOOPBACK_IP_MASK} dev ${SINK_LOOPBACK_IFACE}
+ ip -n ${sink_ns} link set dev ${SINK_LOOPBACK_IFACE} up
+ ip netns exec ${sink_ns} sysctl -qw net.ipv6.conf.${SINK_LOOPBACK_IFACE}.forwarding=1
+
+ ip -n ${sink_ns} link set ${SINK_TEST_IFACE} up
+ ip netns exec ${sink_ns} sysctl -qw net.ipv6.conf.${SINK_TEST_IFACE}.forwarding=1
+
+
+ # Populate nexthop IPv6 addresses on the test interface in the sink_ns
+ echo "info: populating ${IPV6_NEXTHOP_ADDR_COUNT} IPv6 addresses on the ${SINK_TEST_IFACE} interface ..."
+ for IP in $(seq 1 ${IPV6_NEXTHOP_ADDR_COUNT}); do
+ ip -n ${sink_ns} addr add ${IPV6_NEXTHOP_PREFIX}::$(printf "1:%x" "${IP}")/${IPV6_NEXTHOP_ADDR_MASK} dev ${SINK_TEST_IFACE};
+ done
+
+ # Preparing list of nexthops
+ for IP in $(seq 1 ${IPV6_NEXTHOP_ADDR_COUNT}); do
+ nexthop_ip_list=$nexthop_ip_list" nexthop via ${IPV6_NEXTHOP_PREFIX}::$(printf "1:%x" $IP) dev ${SOURCE_TEST_IFACE} weight 1"
+ done
+}
+
+
+test_soft_lockup_during_routing_table_refresh() {
+ # Start num_of_iperf_servers iperf3 servers in the sink_ns namespace,
+ # each listening on ports starting at 5001 and incrementing
+ # sequentially. Since iperf3 instances may terminate unexpectedly, a
+ # while loop is used to automatically restart them in such cases.
+ echo "info: starting ${num_of_iperf_servers} iperf3 servers in the sink_ns namespace ..."
+ for i in $(seq 1 ${num_of_iperf_servers}); do
+ cmd="iperf3 --bind ${SINK_LOOPBACK_IP_ADDR} -s -p $(printf '5%03d' ${i}) --rcv-timeout 200 &>/dev/null"
+ ip netns exec ${sink_ns} bash -c "while true; do ${cmd}; done &" &>/dev/null
+ done
+
+ # Wait for the iperf3 servers to be ready
+ for i in $(seq ${num_of_iperf_servers}); do
+ port=$(printf '5%03d' ${i});
+ wait_local_port_listen ${sink_ns} ${port} tcp
+ done
+
+ # Continuously refresh the routing table in the background within
+ # the source_ns namespace
+ ip netns exec ${source_ns} bash -c "
+ while \$(ip netns list | grep -q ${source_ns}); do
+ ip -6 route add ${SINK_LOOPBACK_IP_ADDR}/${SINK_LOOPBACK_IP_MASK} ${nexthop_ip_list};
+ sleep ${ROUTING_TABLE_REFRESH_PERIOD};
+ ip -6 route delete ${SINK_LOOPBACK_IP_ADDR}/${SINK_LOOPBACK_IP_MASK};
+ done &"
+
+ # Start num_of_iperf_servers iperf3 clients in the source_ns namespace,
+ # each sending TCP traffic on sequential ports starting at 5001.
+ # Since iperf3 instances may terminate unexpectedly (e.g., if the route
+ # to the server is deleted in the background during a route refresh), a
+ # while loop is used to automatically restart them in such cases.
+ echo "info: starting ${num_of_iperf_servers} iperf3 clients in the source_ns namespace ..."
+ for i in $(seq 1 ${num_of_iperf_servers}); do
+ cmd="iperf3 -c ${SINK_LOOPBACK_IP_ADDR} -p $(printf '5%03d' ${i}) --length 64 --bitrate ${IPERF3_BITRATE} -t 0 --connect-timeout 150 &>/dev/null"
+ ip netns exec ${source_ns} bash -c "while true; do ${cmd}; done &" &>/dev/null
+ done
+
+ echo "info: IPv6 routing table is being updated at the rate of $(echo "1/${ROUTING_TABLE_REFRESH_PERIOD}" | bc)/s for ${TEST_DURATION} seconds ..."
+ echo "info: A kernel soft lockup, if detected, results in a kernel panic!"
+
+ wait
+}
+
+# Make sure 'iperf3' is installed, skip the test otherwise
+if [ ! -x "$(command -v "iperf3")" ]; then
+ echo "SKIP: 'iperf3' is not installed. Skipping the test."
+ exit ${ksft_skip}
+fi
+
+# Determine the number of cores on the machine
+num_of_iperf_servers=$(( $(nproc)/2 ))
+
+# Check if we are running on a multi-core machine, skip the test otherwise
+if [ "${num_of_iperf_servers}" -eq 0 ]; then
+ echo "SKIP: This test is not valid on a single core machine!"
+ exit ${ksft_skip}
+fi
+
+# Since the kernel soft lockup we're testing causes at least one core to enter
+# an infinite loop, destabilizing the host and likely affecting subsequent
+# tests, we trigger a kernel panic instead of reporting a failure and
+# continuing
+kernel_softlokup_panic_prev_val=$(sysctl -n kernel.softlockup_panic)
+sysctl -qw kernel.softlockup_panic=1
+
+handle_sigint() {
+ termination_signal="SIGINT"
+ cleanup
+ exit ${ksft_skip}
+}
+
+handle_sigalrm() {
+ termination_signal="SIGALRM"
+ cleanup
+ exit ${ksft_pass}
+}
+
+trap handle_sigint SIGINT
+trap handle_sigalrm SIGALRM
+
+(sleep ${TEST_DURATION} && kill -s SIGALRM $$)&
+
+setup_prepare
+test_soft_lockup_during_routing_table_refresh
--
2.43.0
When macsec is offloaded to a NIC, we can take advantage of some of
its features, mainly TSO and checksumming. This increases performance
significantly. Some features cannot be inherited, because they require
additional ops that aren't provided by the macsec netdevice.
We also need to inherit TSO limits from the lower device, like
VLAN/macvlan devices do.
This series also moves the existing macsec offload selftest to the
netdevsim selftests before adding tests for the new features. To allow
this new selftest to work, netdevsim's hw_features are expanded.
Sabrina Dubroca (8):
netdevsim: add more hw_features
selftests: netdevsim: add a test checking ethtool features
macsec: add some of the lower device's features when offloading
macsec: clean up local variables in macsec_notify
macsec: inherit lower device's TSO limits when offloading
selftests: move macsec offload tests from net/rtnetlink to
drivers/net/netdvesim
selftests: netdevsim: add test toggling macsec offload
selftests: netdevsim: add ethtool features to macsec offload tests
drivers/net/macsec.c | 64 +++++++---
drivers/net/netdevsim/netdev.c | 6 +-
.../selftests/drivers/net/netdevsim/Makefile | 2 +
.../selftests/drivers/net/netdevsim/config | 1 +
.../drivers/net/netdevsim/ethtool-features.sh | 31 +++++
.../drivers/net/netdevsim/macsec-offload.sh | 117 ++++++++++++++++++
tools/testing/selftests/net/rtnetlink.sh | 68 ----------
7 files changed, 200 insertions(+), 89 deletions(-)
create mode 100644 tools/testing/selftests/drivers/net/netdevsim/ethtool-features.sh
create mode 100755 tools/testing/selftests/drivers/net/netdevsim/macsec-offload.sh
--
2.47.0
For logging to be useful, something has to set RET and retmsg by calling
ret_set_ksft_status(). There is a suite of functions to that end in
forwarding/lib: check_err, check_fail et.al. Move them to net/lib.sh so
that every net test can use them.
Existing lib.sh users might be using these same names for their functions.
However lib.sh is always sourced near the top of the file (checked), and
whatever new definitions will simply override the ones provided by lib.sh.
Signed-off-by: Petr Machata <petrm(a)nvidia.com>
Reviewed-by: Amit Cohen <amcohen(a)nvidia.com>
Acked-by: Shuah Khan <skhan(a)linuxfoundation.org>
---
Notes:
CC: Shuah Khan <shuah(a)kernel.org>
CC: Benjamin Poirier <bpoirier(a)nvidia.com>
CC: Hangbin Liu <liuhangbin(a)gmail.com>
CC: linux-kselftest(a)vger.kernel.org
CC: Jiri Pirko <jiri(a)resnulli.us>
tools/testing/selftests/net/forwarding/lib.sh | 73 -------------------
tools/testing/selftests/net/lib.sh | 73 +++++++++++++++++++
2 files changed, 73 insertions(+), 73 deletions(-)
diff --git a/tools/testing/selftests/net/forwarding/lib.sh b/tools/testing/selftests/net/forwarding/lib.sh
index d28dbf27c1f0..8625e3c99f55 100644
--- a/tools/testing/selftests/net/forwarding/lib.sh
+++ b/tools/testing/selftests/net/forwarding/lib.sh
@@ -445,79 +445,6 @@ done
##############################################################################
# Helpers
-# Whether FAILs should be interpreted as XFAILs. Internal.
-FAIL_TO_XFAIL=
-
-check_err()
-{
- local err=$1
- local msg=$2
-
- if ((err)); then
- if [[ $FAIL_TO_XFAIL = yes ]]; then
- ret_set_ksft_status $ksft_xfail "$msg"
- else
- ret_set_ksft_status $ksft_fail "$msg"
- fi
- fi
-}
-
-check_fail()
-{
- local err=$1
- local msg=$2
-
- check_err $((!err)) "$msg"
-}
-
-check_err_fail()
-{
- local should_fail=$1; shift
- local err=$1; shift
- local what=$1; shift
-
- if ((should_fail)); then
- check_fail $err "$what succeeded, but should have failed"
- else
- check_err $err "$what failed"
- fi
-}
-
-xfail()
-{
- FAIL_TO_XFAIL=yes "$@"
-}
-
-xfail_on_slow()
-{
- if [[ $KSFT_MACHINE_SLOW = yes ]]; then
- FAIL_TO_XFAIL=yes "$@"
- else
- "$@"
- fi
-}
-
-omit_on_slow()
-{
- if [[ $KSFT_MACHINE_SLOW != yes ]]; then
- "$@"
- fi
-}
-
-xfail_on_veth()
-{
- local dev=$1; shift
- local kind
-
- kind=$(ip -j -d link show dev $dev |
- jq -r '.[].linkinfo.info_kind')
- if [[ $kind = veth ]]; then
- FAIL_TO_XFAIL=yes "$@"
- else
- "$@"
- fi
-}
-
not()
{
"$@"
diff --git a/tools/testing/selftests/net/lib.sh b/tools/testing/selftests/net/lib.sh
index 4f52b8e48a3a..6bcf5d13879d 100644
--- a/tools/testing/selftests/net/lib.sh
+++ b/tools/testing/selftests/net/lib.sh
@@ -361,3 +361,76 @@ tests_run()
$current_test
done
}
+
+# Whether FAILs should be interpreted as XFAILs. Internal.
+FAIL_TO_XFAIL=
+
+check_err()
+{
+ local err=$1
+ local msg=$2
+
+ if ((err)); then
+ if [[ $FAIL_TO_XFAIL = yes ]]; then
+ ret_set_ksft_status $ksft_xfail "$msg"
+ else
+ ret_set_ksft_status $ksft_fail "$msg"
+ fi
+ fi
+}
+
+check_fail()
+{
+ local err=$1
+ local msg=$2
+
+ check_err $((!err)) "$msg"
+}
+
+check_err_fail()
+{
+ local should_fail=$1; shift
+ local err=$1; shift
+ local what=$1; shift
+
+ if ((should_fail)); then
+ check_fail $err "$what succeeded, but should have failed"
+ else
+ check_err $err "$what failed"
+ fi
+}
+
+xfail()
+{
+ FAIL_TO_XFAIL=yes "$@"
+}
+
+xfail_on_slow()
+{
+ if [[ $KSFT_MACHINE_SLOW = yes ]]; then
+ FAIL_TO_XFAIL=yes "$@"
+ else
+ "$@"
+ fi
+}
+
+omit_on_slow()
+{
+ if [[ $KSFT_MACHINE_SLOW != yes ]]; then
+ "$@"
+ fi
+}
+
+xfail_on_veth()
+{
+ local dev=$1; shift
+ local kind
+
+ kind=$(ip -j -d link show dev $dev |
+ jq -r '.[].linkinfo.info_kind')
+ if [[ $kind = veth ]]; then
+ FAIL_TO_XFAIL=yes "$@"
+ else
+ "$@"
+ fi
+}
--
2.45.0
It would be good to use the same mechanism for scheduling and dispatching
general net tests as the many forwarding tests already use. To that end,
move the logging helpers to net/lib.sh so that every net test can use them.
Existing lib.sh users might be using the name themselves. However lib.sh is
always sourced near the top of the file (checked), and whatever new
definition will simply override the one provided by lib.sh.
Signed-off-by: Petr Machata <petrm(a)nvidia.com>
Reviewed-by: Amit Cohen <amcohen(a)nvidia.com>
Acked-by: Shuah Khan <skhan(a)linuxfoundation.org>
---
Notes:
CC: Shuah Khan <shuah(a)kernel.org>
CC: Benjamin Poirier <bpoirier(a)nvidia.com>
CC: Hangbin Liu <liuhangbin(a)gmail.com>
CC: linux-kselftest(a)vger.kernel.org
CC: Jiri Pirko <jiri(a)resnulli.us>
tools/testing/selftests/net/forwarding/lib.sh | 10 ----------
tools/testing/selftests/net/lib.sh | 10 ++++++++++
2 files changed, 10 insertions(+), 10 deletions(-)
diff --git a/tools/testing/selftests/net/forwarding/lib.sh b/tools/testing/selftests/net/forwarding/lib.sh
index 41dd14c42c48..d28dbf27c1f0 100644
--- a/tools/testing/selftests/net/forwarding/lib.sh
+++ b/tools/testing/selftests/net/forwarding/lib.sh
@@ -1285,16 +1285,6 @@ matchall_sink_create()
action drop
}
-tests_run()
-{
- local current_test
-
- for current_test in ${TESTS:-$ALL_TESTS}; do
- in_defer_scope \
- $current_test
- done
-}
-
cleanup()
{
pre_cleanup
diff --git a/tools/testing/selftests/net/lib.sh b/tools/testing/selftests/net/lib.sh
index 691318b1ec55..4f52b8e48a3a 100644
--- a/tools/testing/selftests/net/lib.sh
+++ b/tools/testing/selftests/net/lib.sh
@@ -351,3 +351,13 @@ log_info()
echo "INFO: $msg"
}
+
+tests_run()
+{
+ local current_test
+
+ for current_test in ${TESTS:-$ALL_TESTS}; do
+ in_defer_scope \
+ $current_test
+ done
+}
--
2.45.0
Many net selftests invent their own logging helpers. These really should be
in a library sourced by these tests. Currently forwarding/lib.sh has a
suite of perfectly fine logging helpers, but sourcing a forwarding/ library
from a higher-level directory smells of layering violation. In this patch,
move the logging helpers to net/lib.sh so that every net test can use them.
Together with the logging helpers, it's also necessary to move
pause_on_fail(), and EXIT_STATUS and RET.
Existing lib.sh users might be using these same names for their functions
or variables. However lib.sh is always sourced near the top of the
file (checked), and whatever new definitions will simply override the ones
provided by lib.sh.
Signed-off-by: Petr Machata <petrm(a)nvidia.com>
Reviewed-by: Amit Cohen <amcohen(a)nvidia.com>
Acked-by: Shuah Khan <skhan(a)linuxfoundation.org>
---
Notes:
CC: Shuah Khan <shuah(a)kernel.org>
CC: Benjamin Poirier <bpoirier(a)nvidia.com>
CC: Hangbin Liu <liuhangbin(a)gmail.com>
CC: linux-kselftest(a)vger.kernel.org
CC: Jiri Pirko <jiri(a)resnulli.us>
tools/testing/selftests/net/forwarding/lib.sh | 113 -----------------
tools/testing/selftests/net/lib.sh | 115 ++++++++++++++++++
2 files changed, 115 insertions(+), 113 deletions(-)
diff --git a/tools/testing/selftests/net/forwarding/lib.sh b/tools/testing/selftests/net/forwarding/lib.sh
index 89c25f72b10c..41dd14c42c48 100644
--- a/tools/testing/selftests/net/forwarding/lib.sh
+++ b/tools/testing/selftests/net/forwarding/lib.sh
@@ -48,7 +48,6 @@ declare -A NETIFS=(
: "${WAIT_TIME:=5}"
# Whether to pause on, respectively, after a failure and before cleanup.
-: "${PAUSE_ON_FAIL:=no}"
: "${PAUSE_ON_CLEANUP:=no}"
# Whether to create virtual interfaces, and what netdevice type they should be.
@@ -446,22 +445,6 @@ done
##############################################################################
# Helpers
-# Exit status to return at the end. Set in case one of the tests fails.
-EXIT_STATUS=0
-# Per-test return value. Clear at the beginning of each test.
-RET=0
-
-ret_set_ksft_status()
-{
- local ksft_status=$1; shift
- local msg=$1; shift
-
- RET=$(ksft_status_merge $RET $ksft_status)
- if (( $? )); then
- retmsg=$msg
- fi
-}
-
# Whether FAILs should be interpreted as XFAILs. Internal.
FAIL_TO_XFAIL=
@@ -535,102 +518,6 @@ xfail_on_veth()
fi
}
-log_test_result()
-{
- local test_name=$1; shift
- local opt_str=$1; shift
- local result=$1; shift
- local retmsg=$1; shift
-
- printf "TEST: %-60s [%s]\n" "$test_name $opt_str" "$result"
- if [[ $retmsg ]]; then
- printf "\t%s\n" "$retmsg"
- fi
-}
-
-pause_on_fail()
-{
- if [[ $PAUSE_ON_FAIL == yes ]]; then
- echo "Hit enter to continue, 'q' to quit"
- read a
- [[ $a == q ]] && exit 1
- fi
-}
-
-handle_test_result_pass()
-{
- local test_name=$1; shift
- local opt_str=$1; shift
-
- log_test_result "$test_name" "$opt_str" " OK "
-}
-
-handle_test_result_fail()
-{
- local test_name=$1; shift
- local opt_str=$1; shift
-
- log_test_result "$test_name" "$opt_str" FAIL "$retmsg"
- pause_on_fail
-}
-
-handle_test_result_xfail()
-{
- local test_name=$1; shift
- local opt_str=$1; shift
-
- log_test_result "$test_name" "$opt_str" XFAIL "$retmsg"
- pause_on_fail
-}
-
-handle_test_result_skip()
-{
- local test_name=$1; shift
- local opt_str=$1; shift
-
- log_test_result "$test_name" "$opt_str" SKIP "$retmsg"
-}
-
-log_test()
-{
- local test_name=$1
- local opt_str=$2
-
- if [[ $# -eq 2 ]]; then
- opt_str="($opt_str)"
- fi
-
- if ((RET == ksft_pass)); then
- handle_test_result_pass "$test_name" "$opt_str"
- elif ((RET == ksft_xfail)); then
- handle_test_result_xfail "$test_name" "$opt_str"
- elif ((RET == ksft_skip)); then
- handle_test_result_skip "$test_name" "$opt_str"
- else
- handle_test_result_fail "$test_name" "$opt_str"
- fi
-
- EXIT_STATUS=$(ksft_exit_status_merge $EXIT_STATUS $RET)
- return $RET
-}
-
-log_test_skip()
-{
- RET=$ksft_skip retmsg= log_test "$@"
-}
-
-log_test_xfail()
-{
- RET=$ksft_xfail retmsg= log_test "$@"
-}
-
-log_info()
-{
- local msg=$1
-
- echo "INFO: $msg"
-}
-
not()
{
"$@"
diff --git a/tools/testing/selftests/net/lib.sh b/tools/testing/selftests/net/lib.sh
index c8991cc6bf28..691318b1ec55 100644
--- a/tools/testing/selftests/net/lib.sh
+++ b/tools/testing/selftests/net/lib.sh
@@ -9,6 +9,9 @@ source "$net_dir/lib/sh/defer.sh"
: "${WAIT_TIMEOUT:=20}"
+# Whether to pause on after a failure.
+: "${PAUSE_ON_FAIL:=no}"
+
BUSYWAIT_TIMEOUT=$((WAIT_TIMEOUT * 1000)) # ms
# Kselftest framework constants.
@@ -20,6 +23,11 @@ ksft_skip=4
# namespace list created by setup_ns
NS_LIST=()
+# Exit status to return at the end. Set in case one of the tests fails.
+EXIT_STATUS=0
+# Per-test return value. Clear at the beginning of each test.
+RET=0
+
##############################################################################
# Helpers
@@ -236,3 +244,110 @@ tc_rule_handle_stats_get()
| jq ".[] | select(.options.handle == $handle) | \
.options.actions[0].stats$selector"
}
+
+ret_set_ksft_status()
+{
+ local ksft_status=$1; shift
+ local msg=$1; shift
+
+ RET=$(ksft_status_merge $RET $ksft_status)
+ if (( $? )); then
+ retmsg=$msg
+ fi
+}
+
+log_test_result()
+{
+ local test_name=$1; shift
+ local opt_str=$1; shift
+ local result=$1; shift
+ local retmsg=$1; shift
+
+ printf "TEST: %-60s [%s]\n" "$test_name $opt_str" "$result"
+ if [[ $retmsg ]]; then
+ printf "\t%s\n" "$retmsg"
+ fi
+}
+
+pause_on_fail()
+{
+ if [[ $PAUSE_ON_FAIL == yes ]]; then
+ echo "Hit enter to continue, 'q' to quit"
+ read a
+ [[ $a == q ]] && exit 1
+ fi
+}
+
+handle_test_result_pass()
+{
+ local test_name=$1; shift
+ local opt_str=$1; shift
+
+ log_test_result "$test_name" "$opt_str" " OK "
+}
+
+handle_test_result_fail()
+{
+ local test_name=$1; shift
+ local opt_str=$1; shift
+
+ log_test_result "$test_name" "$opt_str" FAIL "$retmsg"
+ pause_on_fail
+}
+
+handle_test_result_xfail()
+{
+ local test_name=$1; shift
+ local opt_str=$1; shift
+
+ log_test_result "$test_name" "$opt_str" XFAIL "$retmsg"
+ pause_on_fail
+}
+
+handle_test_result_skip()
+{
+ local test_name=$1; shift
+ local opt_str=$1; shift
+
+ log_test_result "$test_name" "$opt_str" SKIP "$retmsg"
+}
+
+log_test()
+{
+ local test_name=$1
+ local opt_str=$2
+
+ if [[ $# -eq 2 ]]; then
+ opt_str="($opt_str)"
+ fi
+
+ if ((RET == ksft_pass)); then
+ handle_test_result_pass "$test_name" "$opt_str"
+ elif ((RET == ksft_xfail)); then
+ handle_test_result_xfail "$test_name" "$opt_str"
+ elif ((RET == ksft_skip)); then
+ handle_test_result_skip "$test_name" "$opt_str"
+ else
+ handle_test_result_fail "$test_name" "$opt_str"
+ fi
+
+ EXIT_STATUS=$(ksft_exit_status_merge $EXIT_STATUS $RET)
+ return $RET
+}
+
+log_test_skip()
+{
+ RET=$ksft_skip retmsg= log_test "$@"
+}
+
+log_test_xfail()
+{
+ RET=$ksft_xfail retmsg= log_test "$@"
+}
+
+log_info()
+{
+ local msg=$1
+
+ echo "INFO: $msg"
+}
--
2.45.0
This patch series adds a some not yet picked selftests to the kvm s390x
selftest suite.
The additional test cases are covering:
* Assert KVM_EXIT_S390_UCONTROL exit on not mapped memory access
* Assert functionality of storage keys in ucontrol VM
* Assert that memory region operations are rejected for ucontrol VMs
Running the test cases requires sys_admin capabilities to start the
ucontrol VM.
This can be achieved by running as root or with a command like:
sudo setpriv --reuid nobody --inh-caps -all,+sys_admin \
--ambient-caps -all,+sys_admin --bounding-set -all,+sys_admin \
./ucontrol_test
---
The patches in this series have been part of the previous patch series.
The test cases added here do depend on the fixture added in the earlier
patches.
From v5 PATCH 7-9 the segment and page table generation has been removed
and DAT
has been disabled. Since DAT is not necessary to validate the KVM code.
https://lore.kernel.org/kvm/20240807154512.316936-1-schlameuss@linux.ibm.co…
v7:
- skip uc_skey test when execution is in vsie
v6:
- add instruction intercept handling for skey specific instructions
(iske, sske, rrbe) in addition to kss intercept to work properly on
all machines
- reorder local variables
- fixup some method comments
- add a patch correcting the IP.b value length a debug message
v5:
- rebased to current upstream master
- corrected assertion on 0x00 to 0
- reworded fixup commit so that it can be merged on top of current
upstream
v4:
- fix whitespaces in pointer function arguments (thanks Claudio)
- fix whitespaces in comments (thanks Janosch)
v3:
- fix skey assertion (thanks Claudio)
- introduce a wrapper around UCAS map and unmap ioctls to improve
readability (Claudio)
- add an displacement to accessed memory to assert translation
intercepts actually point to segments to the uc_map_unmap test
- add an misaligned failing mapping try to the uc_map_unmap test
v2:
- Reenable KSS intercept and handle it within skey test.
- Modify the checked register between storing (sske) and reading (iske)
it within the test program to make sure the.
- Add an additional state assertion in the end of uc_skey
- Fix some typos and white spaces.
v1:
- Remove segment and page table generation and disable DAT. This is not
necessary to validate the KVM code.
Christoph Schlameuss (5):
selftests: kvm: s390: Add uc_map_unmap VM test case
selftests: kvm: s390: Add uc_skey VM test case
selftests: kvm: s390: Verify reject memory region operations for
ucontrol VMs
selftests: kvm: s390: Fix whitespace confusion in ucontrol test
selftests: kvm: s390: correct IP.b length in uc_handle_sieic debug
output
.../selftests/kvm/include/s390x/processor.h | 6 +
.../selftests/kvm/s390x/ucontrol_test.c | 321 +++++++++++++++++-
2 files changed, 319 insertions(+), 8 deletions(-)
--
2.47.0
The test should be skipped if initial conditions aren't fulfilled in
the start instead of failing and outputting non-compliant TAP logs. This
kind of failure pollutes the results. The initial conditions are:
- The test should only execute if /tmp file can be allocated.
- The test should only execute if huge pages are free.
Signed-off-by: Muhammad Usama Anjum <usama.anjum(a)collabora.com>
---
Before:
TAP version 13
1..4
Bail out! Error opening file
: Read-only file system (30)
# Planned tests != run tests (4 != 0)
# Totals: pass:0 fail:0 xfail:0 xpass:0 skip:0 error:0
After:
TAP version 13
1..0 # SKIP Unable to allocate file: Read-only file system
---
tools/testing/selftests/mm/hugetlb_dio.c | 19 ++++++++++++-------
1 file changed, 12 insertions(+), 7 deletions(-)
diff --git a/tools/testing/selftests/mm/hugetlb_dio.c b/tools/testing/selftests/mm/hugetlb_dio.c
index f9ac20c657ec6..60001c142ce99 100644
--- a/tools/testing/selftests/mm/hugetlb_dio.c
+++ b/tools/testing/selftests/mm/hugetlb_dio.c
@@ -44,13 +44,6 @@ void run_dio_using_hugetlb(unsigned int start_off, unsigned int end_off)
if (fd < 0)
ksft_exit_fail_perror("Error opening file\n");
- /* Get the free huge pages before allocation */
- free_hpage_b = get_free_hugepages();
- if (free_hpage_b == 0) {
- close(fd);
- ksft_exit_skip("No free hugepage, exiting!\n");
- }
-
/* Allocate a hugetlb page */
orig_buffer = mmap(NULL, h_pagesize, mmap_prot, mmap_flags, -1, 0);
if (orig_buffer == MAP_FAILED) {
@@ -94,8 +87,20 @@ void run_dio_using_hugetlb(unsigned int start_off, unsigned int end_off)
int main(void)
{
size_t pagesize = 0;
+ int fd;
ksft_print_header();
+
+ /* Open the file to DIO */
+ fd = open("/tmp", O_TMPFILE | O_RDWR | O_DIRECT, 0664);
+ if (fd < 0)
+ ksft_exit_skip("Unable to allocate file: %s\n", strerror(errno));
+ close(fd);
+
+ /* Check if huge pages are free */
+ if (!get_free_hugepages())
+ ksft_exit_skip("No free hugepage, exiting\n");
+
ksft_set_plan(4);
/* Get base page size */
--
2.39.5
1. fix recursive lock when ebpf prog return SK_PASS.
2. add selftest to reproduce recursive lock.
Note that if just the selftest merged without first
patch, the test case will definitely fail, because the
issue of deadlock is inevitable.
---
v2->v3: fix line length reported by patchwork.
(max_line_length is set to 80 in patchwork but default is 100 in kernel tree)
v1->v2: 1.inspired by martin.lau to add selftest to reproduce the issue.
2. follow the community rules for patch.
v1: https://lore.kernel.org/bpf/55fc6114-7e64-4b65-86d2-92cfd1e9e92f@linux.dev/…
---
Jiayuan Chen (2):
bpf: fix recursive lock when verdict program return SK_PASS
selftests/bpf: Add some tests with sockmap SK_PASS
net/core/skmsg.c | 4 +-
.../selftests/bpf/prog_tests/sockmap_basic.c | 54 +++++++++++++++++++
.../bpf/progs/test_sockmap_pass_prog.c | 2 +-
3 files changed, 57 insertions(+), 3 deletions(-)
--
2.43.5
This series was originally written by José Expósito, and has been
modified and updated by Matt Gilbride and myself. The original version
can be found here:
https://github.com/Rust-for-Linux/linux/pull/950
Add support for writing KUnit tests in Rust. While Rust doctests are
already converted to KUnit tests and run, they're really better suited
for examples, rather than as first-class unit tests.
This series implements a series of direct Rust bindings for KUnit tests,
as well as a new macro which allows KUnit tests to be written using a
close variant of normal Rust unit test syntax. The only change required
is replacing '#[cfg(test)]' with '#[kunit_tests(kunit_test_suite_name)]'
An example test would look like:
#[kunit_tests(rust_kernel_hid_driver)]
mod tests {
use super::*;
use crate::{c_str, driver, hid, prelude::*};
use core::ptr;
struct SimpleTestDriver;
impl Driver for SimpleTestDriver {
type Data = ();
}
#[test]
fn rust_test_hid_driver_adapter() {
let mut hid = bindings::hid_driver::default();
let name = c_str!("SimpleTestDriver");
static MODULE: ThisModule = unsafe { ThisModule::from_ptr(ptr::null_mut()) };
let res = unsafe {
<hid::Adapter<SimpleTestDriver> as driver::DriverOps>::register(&mut hid, name, &MODULE)
};
assert_eq!(res, Err(ENODEV)); // The mock returns -19
}
}
Please give this a go, and make sure I haven't broken it! There's almost
certainly a lot of improvements which can be made -- and there's a fair
case to be made for replacing some of this with generated C code which
can use the C macros -- but this is hopefully an adequate implementation
for now, and the interface can (with luck) remain the same even if the
implementation changes.
A few small notable missing features:
- Attributes (like the speed of a test) are hardcoded to the default
value.
- Similarly, the module name attribute is hardcoded to NULL. In C, we
use the KBUILD_MODNAME macro, but I couldn't find a way to use this
from Rust which wasn't more ugly than just disabling it.
- Assertions are not automatically rewritten to use KUnit assertions.
---
Changes since v3:
https://lore.kernel.org/linux-kselftest/20241030045719.3085147-2-davidgow@g…
- The kunit_unsafe_test_suite!() macro now panic!s if the suite name is
too long, triggering a compile error. (Thanks, Alice!)
- The #[kunit_tests()] macro now preserves span information, so
errors can be better reported. (Thanks, Boqun!)
- The example tests have been updated to no longer use assert_eq!() with
a constant bool argument (which triggered a clippy warning now we
have the span info).
Changes since v2:
https://lore.kernel.org/linux-kselftest/20241029092422.2884505-1-davidgow@g…
- Include missing rust/macros/kunit.rs file from v2. (Thanks Boqun!)
- The kunit_unsafe_test_suite!() macro will truncate the name of the
suite if it is too long. (Thanks Alice!)
- The proc macro now emits an error if the suite name is too long.
- We no longer needlessly use UnsafeCell<> in
kunit_unsafe_test_suite!(). (Thanks Alice!)
Changes since v1:
https://lore.kernel.org/lkml/20230720-rustbind-v1-0-c80db349e3b5@google.com…
- Rebase on top of the latest rust-next (commit 718c4069896c)
- Make kunit_case a const fn, rather than a macro (Thanks Boqun)
- As a result, the null terminator is now created with
kernel::kunit::kunit_case_null()
- Use the C kunit_get_current_test() function to implement
in_kunit_test(), rather than re-implementing it (less efficiently)
ourselves.
Changes since the GitHub PR:
- Rebased on top of kselftest/kunit
- Add const_mut_refs feature
This may conflict with https://lore.kernel.org/lkml/20230503090708.2524310-6-nmi@metaspace.dk/
- Add rust/macros/kunit.rs to the KUnit MAINTAINERS entry
---
José Expósito (3):
rust: kunit: add KUnit case and suite macros
rust: macros: add macro to easily run KUnit tests
rust: kunit: allow to know if we are in a test
MAINTAINERS | 1 +
rust/kernel/kunit.rs | 194 +++++++++++++++++++++++++++++++++++++++++++
rust/kernel/lib.rs | 1 +
rust/macros/kunit.rs | 168 +++++++++++++++++++++++++++++++++++++
rust/macros/lib.rs | 29 +++++++
5 files changed, 393 insertions(+)
create mode 100644 rust/macros/kunit.rs
--
2.47.0.199.ga7371fff76-goog
Greetings:
Welcome to v8, see changelog below.
This revision is a quick follow-up to v7 and only drops the unneeded
exports in patch 2. No other changes.
This series introduces a new mechanism, IRQ suspension, which allows
network applications using epoll to mask IRQs during periods of high
traffic while also reducing tail latency (compared to existing
mechanisms, see below) during periods of low traffic. In doing so, this
balances CPU consumption with network processing efficiency.
Martin Karsten (CC'd) and I have been collaborating on this series for
several months and have appreciated the feedback from the community on
our RFC [1]. We've updated the cover letter and kernel documentation in
an attempt to more clearly explain how this mechanism works, how
applications can use it, and how it compares to existing mechanisms in
the kernel.
I briefly mentioned this idea at netdev conf 2024 (for those who were
there) and Martin described this idea in an earlier paper presented at
Sigmetrics 2024 [2].
~ The short explanation (TL;DR)
We propose adding a new napi config parameter: irq_suspend_timeout to
help balance CPU usage and network processing efficiency when using IRQ
deferral and napi busy poll.
If this parameter is set to a non-zero value *and* a user application
has enabled preferred busy poll on a busy poll context (via the
EPIOCSPARAMS ioctl introduced in commit 18e2bf0edf4d ("eventpoll: Add
epoll ioctl for epoll_params")), then application calls to epoll_wait
for that context will cause device IRQs and softirq processing to be
suspended as long as epoll_wait successfully retrieves data from the
NAPI. Each time data is retrieved, the irq_suspend_timeout is deferred.
If/when network traffic subsides and epoll_wait returns no data, IRQ
suspension is immediately reverted back to the existing
napi_defer_hard_irqs and gro_flush_timeout mechanism which was
introduced in commit 6f8b12d661d0 ("net: napi: add hard irqs deferral
feature")).
The irq_suspend_timeout serves as a safety mechanism. If userland takes
a long time processing data, irq_suspend_timeout will fire and restart
normal NAPI processing.
For a more in depth explanation, please continue reading.
~ Comparison with existing mechanisms
Interrupt mitigation can be accomplished in napi software, by setting
napi_defer_hard_irqs and gro_flush_timeout, or via interrupt coalescing
in the NIC. This can be quite efficient, but in both cases, a fixed
timeout (or packet count) needs to be configured. However, a fixed
timeout cannot effectively support both low- and high-load situations:
At low load, an application typically processes a few requests and then
waits to receive more input data. In this scenario, a large timeout will
cause unnecessary latency.
At high load, an application typically processes many requests before
being ready to receive more input data. In this case, a small timeout
will likely fire prematurely and trigger irq/softirq processing, which
interferes with the application's execution. This causes overhead, most
likely due to cache contention.
While NICs attempt to provide adaptive interrupt coalescing schemes,
these cannot properly take into account application-level processing.
An alternative packet delivery mechanism is busy-polling, which results
in perfect alignment of application processing and network polling. It
delivers optimal performance (throughput and latency), but results in
100% cpu utilization and is thus inefficient for below-capacity
workloads.
We propose to add a new packet delivery mode that properly alternates
between busy polling and interrupt-based delivery depending on busy and
idle periods of the application. During a busy period, the system
operates in busy-polling mode, which avoids interference. During an idle
period, the system falls back to interrupt deferral, but with a small
timeout to avoid excessive latencies. This delivery mode can also be
viewed as an extension of basic interrupt deferral, but alternating
between a small and a very large timeout.
This delivery mode is efficient, because it avoids softirq execution
interfering with application processing during busy periods. It can be
used with blocking epoll_wait to conserve cpu cycles during idle
periods. The effect of alternating between busy and idle periods is that
performance (throughput and latency) is very close to full busy polling,
while cpu utilization is lower and very close to interrupt mitigation.
~ Usage details
IRQ suspension is introduced via a per-NAPI configuration parameter that
controls the maximum time that IRQs can be suspended.
Here's how it is intended to work:
- The user application (or system administrator) uses the netdev-genl
netlink interface to set the pre-existing napi_defer_hard_irqs and
gro_flush_timeout NAPI config parameters to enable IRQ deferral.
- The user application (or system administrator) sets the proposed
irq_suspend_timeout parameter via the netdev-genl netlink interface
to a larger value than gro_flush_timeout to enable IRQ suspension.
- The user application issues the existing epoll ioctl to set the
prefer_busy_poll flag on the epoll context.
- The user application then calls epoll_wait to busy poll for network
events, as it normally would.
- If epoll_wait returns events to userland, IRQs are suspended for the
duration of irq_suspend_timeout.
- If epoll_wait finds no events and the thread is about to go to
sleep, IRQ handling using napi_defer_hard_irqs and gro_flush_timeout
is resumed.
As long as epoll_wait is retrieving events, IRQs (and softirq
processing) for the NAPI being polled remain disabled. When network
traffic reduces, eventually a busy poll loop in the kernel will retrieve
no data. When this occurs, regular IRQ deferral using gro_flush_timeout
for the polled NAPI is re-enabled.
Unless IRQ suspension is continued by subsequent calls to epoll_wait, it
automatically times out after the irq_suspend_timeout timer expires.
Regular deferral is also immediately re-enabled when the epoll context
is destroyed.
~ Usage scenario
The target scenario for IRQ suspension as packet delivery mode is a
system that runs a dominant application with substantial network I/O.
The target application can be configured to receive input data up to a
certain batch size (via epoll_wait maxevents parameter) and this batch
size determines the worst-case latency that application requests might
experience. Because packet delivery is suspended during the target
application's processing, the batch size also determines the worst-case
latency of concurrent applications using the same RX queue(s).
gro_flush_timeout should be set as small as possible, but large enough to
make sure that a single request is likely not being interfered with.
irq_suspend_timeout is largely a safety mechanism against misbehaving
applications. It should be set large enough to cover the processing of an
entire application batch, i.e., the factor between gro_flush_timeout and
irq_suspend_timeout should roughly correspond to the maximum batch size
that the target application would process in one go.
~ Important call out in the implementation
- Enabling per epoll-context preferred busy poll will now effectively
lead to a nonblocking iteration through napi_busy_loop, even when
busy_poll_usecs is 0. See patch 4.
~ Benchmark configs & descriptions
The changes were benchmarked with memcached [3] using the benchmarking
tool mutilate [4].
To facilitate benchmarking, a small patch [5] was applied to memcached
1.6.29 to allow setting per-epoll context preferred busy poll and other
settings via environment variables. Another small patch [6] was applied
to libevent to enable full busy-polling.
Multiple scenarios were benchmarked as described below and the scripts
used for producing these results can be found on github [7] (note: all
scenarios use NAPI-based traffic splitting via SO_INCOMING_ID by passing
-N to memcached):
- base:
- no other options enabled
- deferX:
- set defer_hard_irqs to 100
- set gro_flush_timeout to X,000
- napibusy:
- set defer_hard_irqs to 100
- set gro_flush_timeout to 200,000
- enable busy poll via the existing ioctl (busy_poll_usecs = 64,
busy_poll_budget = 64, prefer_busy_poll = true)
- fullbusy:
- set defer_hard_irqs to 100
- set gro_flush_timeout to 5,000,000
- enable busy poll via the existing ioctl (busy_poll_usecs = 1000,
busy_poll_budget = 64, prefer_busy_poll = true)
- change memcached's nonblocking epoll_wait invocation (via
libevent) to using a 1 ms timeout
- suspend0:
- set defer_hard_irqs to 0
- set gro_flush_timeout to 0
- set irq_suspend_timeout to 20,000,000
- enable busy poll via the existing ioctl (busy_poll_usecs = 0,
busy_poll_budget = 64, prefer_busy_poll = true)
- suspendX:
- set defer_hard_irqs to 100
- set gro_flush_timeout to X,000
- set irq_suspend_timeout to 20,000,000
- enable busy poll via the existing ioctl (busy_poll_usecs = 0,
busy_poll_budget = 64, prefer_busy_poll = true)
~ Benchmark results
Tested on:
Single socket AMD EPYC 7662 64-Core Processor
Hyperthreading disabled
4 NUMA Zones (NPS=4)
16 CPUs per NUMA zone (64 cores total)
2 x Dual port 100gbps Mellanox Technologies ConnectX-5 Ex EN NIC
The test machine is configured such that a single interface has 8 RX
queues. The queues' IRQs and memcached are pinned to CPUs that are
NUMA-local to the interface which is under test. The NIC's interrupt
coalescing configuration is left at boot-time defaults.
Results:
Results are shown below. The mechanism added by this series is
represented by the 'suspend' cases. Data presented shows a summary over
nearly 10 runs of each test case [8] using the scripts on github [7].
For latency, the median is shown. For throughput and CPU utilization,
the average is shown.
The results also include cycles-per-query (cpq) and
instruction-per-query (ipq) metrics, following the methodology proposed
in [2], to augment the CPU utilization numbers, which could be skewed
due to frequency scaling. We find that this does not appear to be the
case as CPU utilization and low-level metrics show similar trends.
These results were captured using the scripts on github [7] to
illustrate how this approach compares with other pre-existing
mechanisms. This data is not to be interpreted as scientific data
captured in a fully isolated lab setting, but instead as best effort,
illustrative information comparing and contrasting tradeoffs.
The absolute QPS results shift between submissions, but the
relative differences are equivalent. As patches are rebased,
several factors likely influence overall performance.
Compare:
- Throughput (MAX) and latencies of base vs suspend.
- CPU usage of napibusy and fullbusy during lower load (200K, 400K for
example) vs suspend.
- Latency of the defer variants vs suspend as timeout and load
increases.
- suspend0, which sets defer_hard_irqs and gro_flush_timeout to 0, has
nearly the same performance as the base case (this is FAQ item #1).
The overall takeaway is that the suspend variants provide a superior
combination of high throughput, low latency, and low cpu utilization
compared to all other variants. Each of the suspend variants works very
well, but some fine-tuning between latency and cpu utilization is still
possible by tuning the small timeout (gro_flush_timeout).
Note: we've reorganized the results to make comparison among testcases
with the same load easier.
testcase load qps avglat 95%lat 99%lat cpu cpq ipq
base 200K 199946 112 239 416 26 12973 11343
defer10 200K 199971 54 124 142 29 19412 17460
defer20 200K 199986 60 130 153 26 15644 14095
defer50 200K 200025 79 144 182 23 12122 11632
defer200 200K 199999 164 254 309 19 8923 9635
fullbusy 200K 199998 46 118 133 100 43658 23133
napibusy 200K 199983 100 237 277 56 24840 24716
suspend0 200K 200020 105 249 432 30 14264 11796
suspend10 200K 199950 53 123 141 32 19518 16903
suspend20 200K 200037 58 126 151 30 16426 14736
suspend50 200K 199961 73 136 177 26 13310 12633
suspend200 200K 199998 149 251 306 21 9566 10203
testcase load qps avglat 95%lat 99%lat cpu cpq ipq
base 400K 400014 139 269 707 41 9476 9343
defer10 400K 400016 59 133 166 53 13991 12989
defer20 400K 399952 67 140 172 47 12063 11644
defer50 400K 400007 87 162 198 39 9384 9880
defer200 400K 399979 181 274 330 31 7089 8430
fullbusy 400K 399987 50 123 156 100 21827 16037
napibusy 400K 400014 76 222 272 83 18185 16529
suspend0 400K 400015 127 350 776 47 10699 9603
suspend10 400K 400023 57 129 164 54 13758 13178
suspend20 400K 400043 62 135 169 49 12071 11826
suspend50 400K 400071 76 149 186 42 10011 10301
suspend200 400K 399961 154 269 327 34 7827 8774
testcase load qps avglat 95%lat 99%lat cpu cpq ipq
base 600K 599951 149 266 574 61 9265 8876
defer10 600K 600006 71 147 203 76 11866 10936
defer20 600K 600123 76 152 203 66 10430 10342
defer50 600K 600162 95 172 217 54 8526 9142
defer200 600K 599942 200 301 357 46 6977 8212
fullbusy 600K 599990 55 127 177 100 14551 13983
napibusy 600K 600035 63 160 250 96 13937 14140
suspend0 600K 599903 127 320 732 68 10166 8963
suspend10 600K 599908 63 137 192 69 10902 11100
suspend20 600K 599961 66 141 194 65 9976 10370
suspend50 600K 599973 80 159 204 57 8678 9381
suspend200 600K 600010 157 277 346 48 7133 8381
testcase load qps avglat 95%lat 99%lat cpu cpq ipq
base 800K 800039 181 300 536 87 9585 8304
defer10 800K 800038 181 530 939 96 10564 8970
defer20 800K 800029 112 225 329 90 10056 8935
defer50 800K 799999 120 208 296 82 9234 8562
defer200 800K 800066 227 338 401 63 7117 8129
fullbusy 800K 800040 61 134 190 100 10913 12608
napibusy 800K 799944 64 141 214 99 10828 12588
suspend0 800K 799911 126 248 509 85 9346 8498
suspend10 800K 800006 69 143 200 83 9410 9845
suspend20 800K 800120 74 150 207 78 8786 9454
suspend50 800K 799989 87 168 224 71 7946 8833
suspend200 800K 799987 160 292 357 62 6923 8229
testcase load qps avglat 95%lat 99%lat cpu cpq ipq
base 1000K 906879 4079 5751 6216 98 9496 7904
defer10 1000K 860849 3643 6274 6730 99 10040 8676
defer20 1000K 896063 3298 5840 6349 98 9620 8237
defer50 1000K 919782 2962 5513 5807 97 9284 7951
defer200 1000K 970941 3059 5348 5984 95 8593 7959
fullbusy 1000K 999950 70 150 207 100 8732 10777
napibusy 1000K 999996 78 154 223 100 8722 10656
suspend0 1000K 949706 2666 5770 6660 99 9071 8046
suspend10 1000K 1000024 80 160 220 92 8137 9035
suspend20 1000K 1000059 83 165 226 89 7850 8804
suspend50 1000K 999955 95 180 240 84 7411 8459
suspend200 1000K 999914 163 299 366 77 6833 8078
testcase load qps avglat 95%lat 99%lat cpu cpq ipq
base MAX 1037654 4184 5453 5810 100 8411 7938
defer10 MAX 905607 4840 6151 6380 100 9639 8431
defer20 MAX 986463 4455 5594 5796 100 8848 8110
defer50 MAX 1077030 4000 5073 5299 100 8104 7920
defer200 MAX 1040728 4152 5385 5765 100 8379 7849
fullbusy MAX 1247536 3518 3935 3984 100 6998 7930
napibusy MAX 1136310 3799 7756 9964 100 7670 7877
suspend0 MAX 1057509 4132 5724 6185 100 8253 7918
suspend10 MAX 1215147 3580 3957 4041 100 7185 7944
suspend20 MAX 1216469 3576 3953 3988 100 7175 7950
suspend50 MAX 1215871 3577 3961 4075 100 7181 7949
suspend200 MAX 1216882 3556 3951 3988 100 7175 7955
~ FAQ
- Why is a new parameter needed? Does irq_suspend_timeout override
gro_flush_timeout?
Using the suspend mechanism causes the system to alternate between
polling mode and irq-driven packet delivery. During busy periods,
irq_suspend_timeout overrides gro_flush_timeout and keeps the system
busy polling, but when epoll finds no events, the setting of
gro_flush_timeout and napi_defer_hard_irqs determine the next step.
There are essentially three possible loops for network processing and
packet delivery:
1) hardirq -> softirq -> napi poll; basic interrupt delivery
2) timer -> softirq -> napi poll; deferred irq processing
3) epoll -> busy-poll -> napi poll; busy looping
Loop 2 can take control from Loop 1, if gro_flush_timeout and
napi_defer_hard_irqs are set.
If gro_flush_timeout and napi_defer_hard_irqs are set, Loops 2 and
3 "wrestle" with each other for control. During busy periods,
irq_suspend_timeout is used as timer in Loop 2, which essentially
tilts this in favour of Loop 3.
If gro_flush_timeout and napi_defer_hard_irqs are not set, Loop 3
cannot take control from Loop 1.
Therefore, setting gro_flush_timeout and napi_defer_hard_irqs is the
recommended usage, because otherwise setting irq_suspend_timeout
might not have any discernible effect.
This is shown in the results above: compare suspend0 with the base
case. Note that the lack of napi_defer_hard_irqs and
gro_flush_timeout produce similar results for both, which encourages
the use of napi_defer_hard_irqs and gro_flush_timeout in addition to
irq_suspend_timeout.
- Can the new timeout value be threaded through the new epoll ioctl ?
It is possible, but presents challenges for userspace. User
applications must ensure that the file descriptors added to epoll
contexts have the same NAPI ID to support busy polling.
An epoll context is not permanently tied to any particular NAPI ID.
So, a user application could decide to clear the file descriptors
from the context and add a new set of file descriptors with a
different NAPI ID to the context. Busy polling would work as
expected, but the meaning of the suspend timeout becomes ambiguous
because IRQs are not inherently associated with epoll contexts, but
rather with the NAPI. The user program would need to reissue the
ioctl to set the irq_suspend_timeout, but the napi_defer_hard_irqs
and gro_flush_timeout settings would come from the NAPI's
napi_config (which are set either by sysfs or by netlink). Such an
interface seems awkard to use from a user perspective.
Further, IRQs are related to NAPIs, which is why they are stored in
the napi_config space. Putting the irq_suspend_timeout in
the epoll context while other IRQ deferral mechanisms remain in the
NAPI's napi_config space seems like an odd design choice.
We've opted to keep all of the IRQ deferral parameters together and
place the irq_suspend_timeout in napi_config. This has nice benefits
for userspace: if a user app were to remove all file descriptors
from an epoll context and add new file descriptors with a new NAPI ID,
the correct suspend timeout for that NAPI ID would be used automatically
without the user application needing to do anything (like re-issuing an
ioctl, for example). All IRQ deferral related parameters are in one
place and can all be set the same way: with netlink.
- Can irq suspend be built by combining NIC coalescing and
gro_flush_timeout ?
No. The problem is that the long timeout must engage if and only if
prefer-busy is active.
When using NIC coalescing for the short timeout (without
napi_defer_hard_irqs/gro_flush_timeout), an interrupt after an idle
period will trigger softirq, which will run napi polling. At this
point, prefer-busy is not active, so NIC interrupts would be
re-enabled. Then it is not possible for the longer timeout to
interject to switch control back to polling. In other words, only by
using the software timer for the short timeout, it is possible to
extend the timeout without having to reprogram the NIC timer or
reach down directly and disable interrupts.
Using gro_flush_timeout for the long timeout also has problems, for
the same underlying reason. In the current napi implementation,
gro_flush_timeout is not tied to prefer-busy. We'd either have to
change that and in the process modify the existing deferral
mechanism, or introduce a state variable to determine whether
gro_flush_timeout is used as long timeout for irq suspend or whether
it is used for its default purpose. In an earlier version, we did
try something similar to the latter and made it work, but it ends up
being a lot more convoluted than our current proposal.
- Isn't it already possible to combine busy looping with irq deferral?
Yes, in fact enabling irq deferral via napi_defer_hard_irqs and
gro_flush_timeout is a precondition for prefer_busy_poll to have an
effect. If the application also uses a tight busy loop with
essentially nonblocking epoll_wait (accomplished with a very short
timeout parameter), this is the fullbusy case shown in the results.
An application using blocking epoll_wait is shown as the napibusy
case in the results. It's a hybrid approach that provides limited
latency benefits compared to the base case and plain irq deferral,
but not as good as fullbusy or suspend.
~ Special thanks
Several people were involved in earlier stages of the development of this
mechanism whom we'd like to thank:
- Peter Cai (CC'd), for the initial kernel patch and his contributions
to the paper.
- Mohammadamin Shafie (CC'd), for testing various versions of the kernel
patch and providing helpful feedback.
Thanks,
Martin and Joe
[1]: https://lore.kernel.org/netdev/20240812125717.413108-1-jdamato@fastly.com/
[2]: https://doi.org/10.1145/3626780
[3]: https://github.com/memcached/memcached/blob/master/doc/napi_ids.txt
[4]: https://github.com/leverich/mutilate
[5]: https://raw.githubusercontent.com/martinkarsten/irqsuspend/main/patches/mem…
[6]: https://raw.githubusercontent.com/martinkarsten/irqsuspend/main/patches/lib…
[7]: https://github.com/martinkarsten/irqsuspend
[8]: https://github.com/martinkarsten/irqsuspend/tree/main/results
v8:
- Update patch 2 to drop the exports, as requested by Jakub.
v7: https://lore.kernel.org/netdev/20241108023912.98416-1-jdamato@fastly.com/
- Jakub noted that patch 2 adds unnecessary complexity by checking the
suspend timeout in the NAPI loop. This makes the code more
complicated and difficult to reason about. He's right; we've dropped
patch 2 which simplifies this series.
- Updated the cover letter with a full re-run of all test cases.
- Updated FAQ #2.
v6: https://lore.kernel.org/netdev/20241104215542.215919-1-jdamato@fastly.com/
- Updated the cover letter with a full re-run of all test cases,
including a new case suspend0, as requested by Sridhar previously.
- Updated the kernel documentation in patch 7 as suggested by Bagas
Sanjaya, which improved the htmldoc output.
v5: https://lore.kernel.org/netdev/20241103052421.518856-1-jdamato@fastly.com/
- Adjusted patch 5 to only suspend IRQs when ep_send_events returns a
positive return value. This issue was pointed out by Hillf Danton.
- Updated the commit message of patch 6 which still mentioned netcat,
despite the code being updated in v4 to replace it with socat and fixed
misspelling of netdevsim.
- Fixed a minor typo in patch 7 and removed an unnecessary paragraph.
- Added Sridhar Samudrala's Reviewed-by to patch 1-5 and 7.
v4: https://lore.kernel.org/netdev/20241102005214.32443-1-jdamato@fastly.com/
- Added a new FAQ item to cover letter.
- Updated patch 6 to use socat instead of nc in busy_poll_test.sh and
updated busy_poller.c to use netlink directly to configure napi
params.
- Updated the kernel documentation in patch 7 to include more details.
- Dropped Stanislav's Acked-by and Bagas' Reviewed-by from patch 7
since the documentation was updated.
v3: https://lore.kernel.org/netdev/20241101004846.32532-1-jdamato@fastly.com/
- Added Stanislav Fomichev's Acked-by to every patch except the newly
added selftest.
- Added Bagas Sanjaya's Reviewed-by to the documentation patch.
- Fixed the commit message of patch 2 to remove a reference to the now
non-existent sysfs setting.
- Added a self test which tests both "regular" busy poll and busy poll
with suspend enabled. This was added as patch 6 as requested by
Paolo. netdevsim was chosen instead of veth due to netdevsim's
pre-existing support for netdev-genl. See the commit message of
patch 6 for more details.
v2: https://lore.kernel.org/bpf/20241021015311.95468-1-jdamato@fastly.com/
- Cover letter updated, including a re-run of test data.
- Patch 1 rewritten to use netdev-genl instead of sysfs.
- Patch 3 updated with a comment added to napi_resume_irqs.
- Patch 4 rebased to apply now that commit b9ca079dd6b0 ("eventpoll:
Annotate data-race of busy_poll_usecs") has been picked up from VFS.
- Patch 6 updated the kernel documentation.
rfc -> v1:
- Cover letter updated to include more details.
- Patch 1 updated to remove the documentation added. This was moved to
patch 6 with the rest of the docs (see below).
- Patch 5 updated to fix an error uncovered by the kernel build robot.
See patch 5's changelog for more details.
- Patch 6 added which updates kernel documentation.
Joe Damato (2):
selftests: net: Add busy_poll_test
docs: networking: Describe irq suspension
Martin Karsten (4):
net: Add napi_struct parameter irq_suspend_timeout
net: Add control functions for irq suspension
eventpoll: Trigger napi_busy_loop, if prefer_busy_poll is set
eventpoll: Control irq suspension for prefer_busy_poll
Documentation/netlink/specs/netdev.yaml | 7 +
Documentation/networking/napi.rst | 170 ++++++++-
fs/eventpoll.c | 36 +-
include/linux/netdevice.h | 2 +
include/net/busy_poll.h | 3 +
include/uapi/linux/netdev.h | 1 +
net/core/dev.c | 39 +++
net/core/dev.h | 25 ++
net/core/netdev-genl-gen.c | 5 +-
net/core/netdev-genl.c | 12 +
tools/include/uapi/linux/netdev.h | 1 +
tools/testing/selftests/net/.gitignore | 1 +
tools/testing/selftests/net/Makefile | 3 +-
tools/testing/selftests/net/busy_poll_test.sh | 164 +++++++++
tools/testing/selftests/net/busy_poller.c | 328 ++++++++++++++++++
15 files changed, 790 insertions(+), 7 deletions(-)
create mode 100755 tools/testing/selftests/net/busy_poll_test.sh
create mode 100644 tools/testing/selftests/net/busy_poller.c
base-commit: dc7c381bb8649e3701ed64f6c3e55316675904d7
--
2.25.1
Commit 6e182dc9f268 ("selftests/mm: Use generic pkey register
manipulation") makes use of PKEY_UNRESTRICTED in
pkey_sighandler_tests. The macro has been proposed for addition to
uapi headers [1], but the patch hasn't landed yet.
Define PKEY_UNRESTRICTED in pkey-helpers.h for the time being to fix
the build.
[1] https://lore.kernel.org/all/20241028090715.509527-2-yury.khrustalev@arm.com/
Fixes: 6e182dc9f268 ("selftests/mm: Use generic pkey register
manipulation")
Reported-by: Aishwarya TCV <aishwarya.tcv(a)arm.com>
Signed-off-by: Kevin Brodsky <kevin.brodsky(a)arm.com>
---
Based on arm64 for-next/pkey-signal (49f59573e9e0).
---
tools/testing/selftests/mm/pkey-helpers.h | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/tools/testing/selftests/mm/pkey-helpers.h b/tools/testing/selftests/mm/pkey-helpers.h
index 9ab6a3ee153b..319f5b6b7132 100644
--- a/tools/testing/selftests/mm/pkey-helpers.h
+++ b/tools/testing/selftests/mm/pkey-helpers.h
@@ -112,6 +112,10 @@ void record_pkey_malloc(void *ptr, long size, int prot);
#define PKEY_MASK (PKEY_DISABLE_ACCESS | PKEY_DISABLE_WRITE)
#endif
+#ifndef PKEY_UNRESTRICTED
+#define PKEY_UNRESTRICTED 0x0
+#endif
+
#ifndef set_pkey_bits
static inline u64 set_pkey_bits(u64 reg, int pkey, u64 flags)
{
--
2.43.0
Hello colleagues,
Following up on Tim Bird's presentation "Adding benchmarks results
support to KTAP/kselftest", I would like to share some thoughts on
kernel benchmarking and kernel performance evaluation. Tim suggested
sharing these comments with the wider kselftest community for
discussion.
The topic of performance evaluation is obviously extremely complex, so
I’ve organised my comments into several paragraphs, each of which
focuses on a specific aspect. This should make it easier to follow and
understand the key points, such as metrics, reference values, results
data lake, interpretation of contradictory results, system profiles,
analysis and methodology.
# Metrics
A few remarks on benchmark metrics which were called “values” in the
original presentation:
- Metrics must be accompanied by standardised units. This
standardisation ensures consistency across different tests and
environments, simplifying accurate comparisons and analysis.
- Each metric should be clearly labelled with its nature or kind
(throughput, speed, latency, etc). This classification is essential
for proper interpretation of the results and prevents
misunderstandings that could lead to incorrect conclusions.
- Presentation contains "May also include allowable variance", but
variance must be included into the analysis as we deal with
statistical calculations and multiple randomised values.
- I would like to note that other statistical parameters are also
worth including into comparison, like confidence levels, sample size
and so on.
# Reference Values
The concept of "reference values" introduced in the slides could be
significantly enhanced by implementing a collaborative, transparent
system for data collection and validation. This system could operate
as follows:
- Data Collection: Any user could submit benchmark results to a
centralised and public repository. This would allow for a diverse
range of hardware configurations and use cases to be represented.
- Vendor Validation: Hardware vendors would have the opportunity to
review submitted results pertaining to their products. They could then
mark certain results as "Vendor Approved," indicating that the results
align with their own testing and expectations.
- Community Review: The broader community of users and experts could
also review and vote on submitted results. Results that receive
substantial positive feedback could be marked as "Community Approved,"
providing an additional layer of validation.
- Automated Validation: Reference values must be checked, validated
and supported by multiple sources. This can be done only in an
automatic way as those processes are time consuming and require
extreme attention to details.
- Transparency: All submitted results would need to be accompanied by
detailed information about the testing environment, hardware
specifications, and methodology used. This would ensure
reproducibility and allow others to understand the context of each
result.
- Trust Building: The combination of vendor and community approval
would help establish trust in the reference values. It would mitigate
concerns about marketing bias and provide a more reliable basis for
performance comparisons.
- Accessibility: The system would be publicly accessible, allowing
anyone to reference and utilise this data in their own testing and
analysis.
Implementation of such a system would require careful consideration of
governance and funding. A community-driven, non-profit organisation
sponsored by multiple stakeholders could be an appropriate model. This
structure would help maintain neutrality and avoid potential conflicts
of interest.
While the specifics of building and managing such a system would need
further exploration, this approach could significantly improve the
reliability and usefulness of reference values in benchmark testing.
It would foster a more collaborative and transparent environment for
performance evaluation in the Linux ecosystem as well as attract
interested vendors to submit and review results.
I’m not very informed about the current state of the community in this
field, but I’m sure you know better how exactly this can be done.
# Results Data Lake
Along with reference values it’s important to collect results on a
regular basis as the kernel evolves so results must follow this
evolution as well. To do this cloud-based data lake is needed (a
self-hosted system will be too expensive from my point of view).
This data lake should be able to collect and process incoming data as
well as to serve reference values for users. Data processing flow
should be quite standard: Collection -> Parsing + Enhancement ->
Storage -> Analysis -> Serving.
Tim proposed to use file names for reference files, I would like to
note that such approach could fail pretty fast if system will collect
more and more data and there will rise a need to have more granular
and detailed features to identify reference results and this can lead
to very long filenames, which will be hard to use. I propose to use
UUID4-based identification, which provides very low chances for
collision. Those IDs will be keys in the database with all information
required for clear identification of relevant results and
corresponding details. Moreover this approach can be easily extended
on the database side if more data is needed.
Yes, UUID4 is not human-readable, but do we need such an option if we
have tools, which can provide a better interface?
For example, this could be something like:
---
request: results-cli search -b "Test Suite D" -v "v1.2.3" -o "Ubuntu
22.04" -t "baseline" -m "response_time>100"
response:
[
{
"id": "550e8400-e29b-41d4-a716-446655440005",
"benchmark": "Test Suite A",
"version": "v1.2.3",
"target_os": "Ubuntu 22.04",
"metrics": {
"cpu_usage": 70.5,
"memory_usage": 2048,
"response_time": 120
},
"tags": ["baseline", "v1.0"],
"created_at": "2024-10-25T10:00:00Z"
},
...
]
---
or
request: results-cli search "<Domain-Specific-Language-Query>"
response: [ {}, {}, {}...]
---
or
request: results-cli get 550e8400-e29b-41d4-a716-446655440005
response:
{
"id": "550e8400-e29b-41d4-a716-446655440005",
"benchmark": "Test Suite A",
"version": "v1.2.3",
"target_os": "Ubuntu 22.04",
"metrics": {
"cpu_usage": 70.5,
"memory_usage": 2048,
"response_time": 120
},
"tags": ["baseline", "v1.0"],
"created_at": "2024-10-25T10:00:00Z"
}
---
or
request: curl -X POST http://api.example.com/references/search \ -d '{
"query": "benchmark = \"Test Suite A\" AND (version >= \"v1.2\" OR tag
IN [\"baseline\", \"regression\"]) AND cpu_usage > 60" }'
...
---
Another point of use DB-based approach is the following: in case when
a user works with particular hardware and/or would like to use a
reference he/she does not need a full database with all collected
reference values, but only a small slice of it. This slice can be
downloaded from public repo or accessed via API.
# Large results dataset
If we collect a large benchmarks dataset in one place accompanied with
detailed information about target systems from which this dataset was
collected, then it will allow us to calculate precise baselines across
different compositions of parameters, making performance deviations
easier to detect. Long-term trend analysis can identify small changes
and correlate them with updates, revealing performance drift.
Another use of such a database - predictive modelling, which can
provide forecasts of expected results and setting dynamic performance
thresholds, enabling early issue detection. Anomaly detection becomes
more effective with context, distinguishing unusual deviations from
normal behaviour.
# Interpretation of contradictory results
It’s not clear how to deal with contradictory results to make a
decision on regression presence. For example, we have a set of 10
tests, which test more or less the same, for example disk performance.
It’s unclear what to do when one subset of tests show degradation and
another subset shows neutral status or improvements. Is there a
regression?
I suppose that the availability of historical data can help to deal
with such situations as historical data can show behaviour of
particular tests and allow to assign weights in decision-making
algorithms, but it’s just my guess.
# System Profiles
Tim's idea to reduce results to - “pass / fail” and my experience with
various people trying to interpret benchmarking results led me to
think of “profiles” - a set of parameters and metrics collected from a
reference system while execution of a particular configuration of a
particular benchmark.
Profiles can be used for A/B comparison with pass/fail outcomes or
match/not match, and this approach does not hide/miss the details and
allows to capture multiple characteristics of the experiment, like
presence of outliers/errors or skewed distribution form. Interested
persons (like kernel developers or performance engineers, for example)
can dig deeper to find a reason for such mismatch and those who are
interested just in high-level results - pass/fail should be enough.
Here is how I imaging a structure of a profile:
---
profile_a:
system packages:
- pkg_1
- pkg_2
# Additional packages...
settings:
- cmdline
# Additional settings...
indicators:
cpu: null
ram: null
loadavg: null
# Additional indicators...
benchmark:
settings:
param_1: null
param_2: null
param_x: null
metrics:
metric_1: null
metric_2: null
metric_x: null
---
- System Packages, System Settings: Usually we do not pay much
attention to this, but I think it’s worth highlighting that base OS is
an important factor, as there are distribution-specific modifications
present in the filesystem. Most commonly developers and researchers
use Ubuntu (as the most popular distro) or Debian (as a cleaner and
lightweight version of Ubuntu), but distributions apply their own
patches to the kernel and system libraries, which may impact
performance. Another kind of base OS - cloud OS images which can be
modified by cloud providers to add internal packages & services which
could potentially affect performance as well. While comparing we must
take into account this aspect to compare apples-to-apples.
- System Indicators: These are periodic statistics like CPU
utilisation, RAM consumption, and other params collected before
benchmarking, while benchmarking and after benchmarking.
- Benchmark Settings: Benchmarking systems have multiple parameters,
so it’s important to capture them and use them in analysis.
- Benchmark Metrics: That’s obviously - benchmark results. It’s not a
rare case when a benchmark test provides more than a single number.
# Analysis
Proposed rules-based analysis will work only for highly determined
environments and systems, where rules can describe all the aspects.
Rule-based systems are easier to understand and implement than other
types, but for a small set of rules. However, we deal with the live
system and it constantly evolves, so rules will deprecate extremely
fast. It's the same story as with rule-based recommended systems in
early years of machine learning.
If you want to follow a rules-based approach, it's probably worth
taking a look at https://www.clipsrules.net as this will allow to
decouple results from analysis and avoid reinventing the analysis
engine.
Declaration of those rules will be error-prone due to the nature of
their origin - they must be declared and maintained by humans. IMHO a
human-less approach and use modern ML methods instead would be more
beneficial in the long run.
# Methodology
Used methodology - another aspect which is not directly related to
Tim's slides, but it's an important topic for results processing and
interpretation, probably an idea of automated results interpretation
can force use of one or another methodology.
# Next steps
I would be glad to participate in further discussions and share
experience to improve kernel performance testing automation, analysis
and interpretation of results. If there is interest, I'm open to
collaborating on implementing some of these ideas.
--
Best regards,
Konstantin Belov
Greetings:
Welcome to v7, see changelog below.
This revision drops patch 2 from the previous version, which was
unnecessary as Jakub pointed out. The tests were fully re-run after
dropping the patch to confirm the performance numbers hold; they do -
see below. FAQ #2 has been updated to be more verbose, as well.
This series introduces a new mechanism, IRQ suspension, which allows
network applications using epoll to mask IRQs during periods of high
traffic while also reducing tail latency (compared to existing
mechanisms, see below) during periods of low traffic. In doing so, this
balances CPU consumption with network processing efficiency.
Martin Karsten (CC'd) and I have been collaborating on this series for
several months and have appreciated the feedback from the community on
our RFC [1]. We've updated the cover letter and kernel documentation in
an attempt to more clearly explain how this mechanism works, how
applications can use it, and how it compares to existing mechanisms in
the kernel.
I briefly mentioned this idea at netdev conf 2024 (for those who were
there) and Martin described this idea in an earlier paper presented at
Sigmetrics 2024 [2].
~ The short explanation (TL;DR)
We propose adding a new napi config parameter: irq_suspend_timeout to
help balance CPU usage and network processing efficiency when using IRQ
deferral and napi busy poll.
If this parameter is set to a non-zero value *and* a user application
has enabled preferred busy poll on a busy poll context (via the
EPIOCSPARAMS ioctl introduced in commit 18e2bf0edf4d ("eventpoll: Add
epoll ioctl for epoll_params")), then application calls to epoll_wait
for that context will cause device IRQs and softirq processing to be
suspended as long as epoll_wait successfully retrieves data from the
NAPI. Each time data is retrieved, the irq_suspend_timeout is deferred.
If/when network traffic subsides and epoll_wait returns no data, IRQ
suspension is immediately reverted back to the existing
napi_defer_hard_irqs and gro_flush_timeout mechanism which was
introduced in commit 6f8b12d661d0 ("net: napi: add hard irqs deferral
feature")).
The irq_suspend_timeout serves as a safety mechanism. If userland takes
a long time processing data, irq_suspend_timeout will fire and restart
normal NAPI processing.
For a more in depth explanation, please continue reading.
~ Comparison with existing mechanisms
Interrupt mitigation can be accomplished in napi software, by setting
napi_defer_hard_irqs and gro_flush_timeout, or via interrupt coalescing
in the NIC. This can be quite efficient, but in both cases, a fixed
timeout (or packet count) needs to be configured. However, a fixed
timeout cannot effectively support both low- and high-load situations:
At low load, an application typically processes a few requests and then
waits to receive more input data. In this scenario, a large timeout will
cause unnecessary latency.
At high load, an application typically processes many requests before
being ready to receive more input data. In this case, a small timeout
will likely fire prematurely and trigger irq/softirq processing, which
interferes with the application's execution. This causes overhead, most
likely due to cache contention.
While NICs attempt to provide adaptive interrupt coalescing schemes,
these cannot properly take into account application-level processing.
An alternative packet delivery mechanism is busy-polling, which results
in perfect alignment of application processing and network polling. It
delivers optimal performance (throughput and latency), but results in
100% cpu utilization and is thus inefficient for below-capacity
workloads.
We propose to add a new packet delivery mode that properly alternates
between busy polling and interrupt-based delivery depending on busy and
idle periods of the application. During a busy period, the system
operates in busy-polling mode, which avoids interference. During an idle
period, the system falls back to interrupt deferral, but with a small
timeout to avoid excessive latencies. This delivery mode can also be
viewed as an extension of basic interrupt deferral, but alternating
between a small and a very large timeout.
This delivery mode is efficient, because it avoids softirq execution
interfering with application processing during busy periods. It can be
used with blocking epoll_wait to conserve cpu cycles during idle
periods. The effect of alternating between busy and idle periods is that
performance (throughput and latency) is very close to full busy polling,
while cpu utilization is lower and very close to interrupt mitigation.
~ Usage details
IRQ suspension is introduced via a per-NAPI configuration parameter that
controls the maximum time that IRQs can be suspended.
Here's how it is intended to work:
- The user application (or system administrator) uses the netdev-genl
netlink interface to set the pre-existing napi_defer_hard_irqs and
gro_flush_timeout NAPI config parameters to enable IRQ deferral.
- The user application (or system administrator) sets the proposed
irq_suspend_timeout parameter via the netdev-genl netlink interface
to a larger value than gro_flush_timeout to enable IRQ suspension.
- The user application issues the existing epoll ioctl to set the
prefer_busy_poll flag on the epoll context.
- The user application then calls epoll_wait to busy poll for network
events, as it normally would.
- If epoll_wait returns events to userland, IRQs are suspended for the
duration of irq_suspend_timeout.
- If epoll_wait finds no events and the thread is about to go to
sleep, IRQ handling using napi_defer_hard_irqs and gro_flush_timeout
is resumed.
As long as epoll_wait is retrieving events, IRQs (and softirq
processing) for the NAPI being polled remain disabled. When network
traffic reduces, eventually a busy poll loop in the kernel will retrieve
no data. When this occurs, regular IRQ deferral using gro_flush_timeout
for the polled NAPI is re-enabled.
Unless IRQ suspension is continued by subsequent calls to epoll_wait, it
automatically times out after the irq_suspend_timeout timer expires.
Regular deferral is also immediately re-enabled when the epoll context
is destroyed.
~ Usage scenario
The target scenario for IRQ suspension as packet delivery mode is a
system that runs a dominant application with substantial network I/O.
The target application can be configured to receive input data up to a
certain batch size (via epoll_wait maxevents parameter) and this batch
size determines the worst-case latency that application requests might
experience. Because packet delivery is suspended during the target
application's processing, the batch size also determines the worst-case
latency of concurrent applications using the same RX queue(s).
gro_flush_timeout should be set as small as possible, but large enough to
make sure that a single request is likely not being interfered with.
irq_suspend_timeout is largely a safety mechanism against misbehaving
applications. It should be set large enough to cover the processing of an
entire application batch, i.e., the factor between gro_flush_timeout and
irq_suspend_timeout should roughly correspond to the maximum batch size
that the target application would process in one go.
~ Important call out in the implementation
- Enabling per epoll-context preferred busy poll will now effectively
lead to a nonblocking iteration through napi_busy_loop, even when
busy_poll_usecs is 0. See patch 4.
~ Benchmark configs & descriptions
The changes were benchmarked with memcached [3] using the benchmarking
tool mutilate [4].
To facilitate benchmarking, a small patch [5] was applied to memcached
1.6.29 to allow setting per-epoll context preferred busy poll and other
settings via environment variables. Another small patch [6] was applied
to libevent to enable full busy-polling.
Multiple scenarios were benchmarked as described below and the scripts
used for producing these results can be found on github [7] (note: all
scenarios use NAPI-based traffic splitting via SO_INCOMING_ID by passing
-N to memcached):
- base:
- no other options enabled
- deferX:
- set defer_hard_irqs to 100
- set gro_flush_timeout to X,000
- napibusy:
- set defer_hard_irqs to 100
- set gro_flush_timeout to 200,000
- enable busy poll via the existing ioctl (busy_poll_usecs = 64,
busy_poll_budget = 64, prefer_busy_poll = true)
- fullbusy:
- set defer_hard_irqs to 100
- set gro_flush_timeout to 5,000,000
- enable busy poll via the existing ioctl (busy_poll_usecs = 1000,
busy_poll_budget = 64, prefer_busy_poll = true)
- change memcached's nonblocking epoll_wait invocation (via
libevent) to using a 1 ms timeout
- suspend0:
- set defer_hard_irqs to 0
- set gro_flush_timeout to 0
- set irq_suspend_timeout to 20,000,000
- enable busy poll via the existing ioctl (busy_poll_usecs = 0,
busy_poll_budget = 64, prefer_busy_poll = true)
- suspendX:
- set defer_hard_irqs to 100
- set gro_flush_timeout to X,000
- set irq_suspend_timeout to 20,000,000
- enable busy poll via the existing ioctl (busy_poll_usecs = 0,
busy_poll_budget = 64, prefer_busy_poll = true)
~ Benchmark results
Tested on:
Single socket AMD EPYC 7662 64-Core Processor
Hyperthreading disabled
4 NUMA Zones (NPS=4)
16 CPUs per NUMA zone (64 cores total)
2 x Dual port 100gbps Mellanox Technologies ConnectX-5 Ex EN NIC
The test machine is configured such that a single interface has 8 RX
queues. The queues' IRQs and memcached are pinned to CPUs that are
NUMA-local to the interface which is under test. The NIC's interrupt
coalescing configuration is left at boot-time defaults.
Results:
Results are shown below. The mechanism added by this series is
represented by the 'suspend' cases. Data presented shows a summary over
nearly 10 runs of each test case [8] using the scripts on github [7].
For latency, the median is shown. For throughput and CPU utilization,
the average is shown.
The results also include cycles-per-query (cpq) and
instruction-per-query (ipq) metrics, following the methodology proposed
in [2], to augment the CPU utilization numbers, which could be skewed
due to frequency scaling. We find that this does not appear to be the
case as CPU utilization and low-level metrics show similar trends.
These results were captured using the scripts on github [7] to
illustrate how this approach compares with other pre-existing
mechanisms. This data is not to be interpreted as scientific data
captured in a fully isolated lab setting, but instead as best effort,
illustrative information comparing and contrasting tradeoffs.
The absolute QPS results shift between submissions, but the
relative differences are equivalent. As patches are rebased,
several factors likely influence overall performance.
Compare:
- Throughput (MAX) and latencies of base vs suspend.
- CPU usage of napibusy and fullbusy during lower load (200K, 400K for
example) vs suspend.
- Latency of the defer variants vs suspend as timeout and load
increases.
- suspend0, which sets defer_hard_irqs and gro_flush_timeout to 0, has
nearly the same performance as the base case (this is FAQ item #1).
The overall takeaway is that the suspend variants provide a superior
combination of high throughput, low latency, and low cpu utilization
compared to all other variants. Each of the suspend variants works very
well, but some fine-tuning between latency and cpu utilization is still
possible by tuning the small timeout (gro_flush_timeout).
Note: we've reorganized the results to make comparison among testcases
with the same load easier.
testcase load qps avglat 95%lat 99%lat cpu cpq ipq
base 200K 199946 112 239 416 26 12973 11343
defer10 200K 199971 54 124 142 29 19412 17460
defer20 200K 199986 60 130 153 26 15644 14095
defer50 200K 200025 79 144 182 23 12122 11632
defer200 200K 199999 164 254 309 19 8923 9635
fullbusy 200K 199998 46 118 133 100 43658 23133
napibusy 200K 199983 100 237 277 56 24840 24716
suspend0 200K 200020 105 249 432 30 14264 11796
suspend10 200K 199950 53 123 141 32 19518 16903
suspend20 200K 200037 58 126 151 30 16426 14736
suspend50 200K 199961 73 136 177 26 13310 12633
suspend200 200K 199998 149 251 306 21 9566 10203
testcase load qps avglat 95%lat 99%lat cpu cpq ipq
base 400K 400014 139 269 707 41 9476 9343
defer10 400K 400016 59 133 166 53 13991 12989
defer20 400K 399952 67 140 172 47 12063 11644
defer50 400K 400007 87 162 198 39 9384 9880
defer200 400K 399979 181 274 330 31 7089 8430
fullbusy 400K 399987 50 123 156 100 21827 16037
napibusy 400K 400014 76 222 272 83 18185 16529
suspend0 400K 400015 127 350 776 47 10699 9603
suspend10 400K 400023 57 129 164 54 13758 13178
suspend20 400K 400043 62 135 169 49 12071 11826
suspend50 400K 400071 76 149 186 42 10011 10301
suspend200 400K 399961 154 269 327 34 7827 8774
testcase load qps avglat 95%lat 99%lat cpu cpq ipq
base 600K 599951 149 266 574 61 9265 8876
defer10 600K 600006 71 147 203 76 11866 10936
defer20 600K 600123 76 152 203 66 10430 10342
defer50 600K 600162 95 172 217 54 8526 9142
defer200 600K 599942 200 301 357 46 6977 8212
fullbusy 600K 599990 55 127 177 100 14551 13983
napibusy 600K 600035 63 160 250 96 13937 14140
suspend0 600K 599903 127 320 732 68 10166 8963
suspend10 600K 599908 63 137 192 69 10902 11100
suspend20 600K 599961 66 141 194 65 9976 10370
suspend50 600K 599973 80 159 204 57 8678 9381
suspend200 600K 600010 157 277 346 48 7133 8381
testcase load qps avglat 95%lat 99%lat cpu cpq ipq
base 800K 800039 181 300 536 87 9585 8304
defer10 800K 800038 181 530 939 96 10564 8970
defer20 800K 800029 112 225 329 90 10056 8935
defer50 800K 799999 120 208 296 82 9234 8562
defer200 800K 800066 227 338 401 63 7117 8129
fullbusy 800K 800040 61 134 190 100 10913 12608
napibusy 800K 799944 64 141 214 99 10828 12588
suspend0 800K 799911 126 248 509 85 9346 8498
suspend10 800K 800006 69 143 200 83 9410 9845
suspend20 800K 800120 74 150 207 78 8786 9454
suspend50 800K 799989 87 168 224 71 7946 8833
suspend200 800K 799987 160 292 357 62 6923 8229
testcase load qps avglat 95%lat 99%lat cpu cpq ipq
base 1000K 906879 4079 5751 6216 98 9496 7904
defer10 1000K 860849 3643 6274 6730 99 10040 8676
defer20 1000K 896063 3298 5840 6349 98 9620 8237
defer50 1000K 919782 2962 5513 5807 97 9284 7951
defer200 1000K 970941 3059 5348 5984 95 8593 7959
fullbusy 1000K 999950 70 150 207 100 8732 10777
napibusy 1000K 999996 78 154 223 100 8722 10656
suspend0 1000K 949706 2666 5770 6660 99 9071 8046
suspend10 1000K 1000024 80 160 220 92 8137 9035
suspend20 1000K 1000059 83 165 226 89 7850 8804
suspend50 1000K 999955 95 180 240 84 7411 8459
suspend200 1000K 999914 163 299 366 77 6833 8078
testcase load qps avglat 95%lat 99%lat cpu cpq ipq
base MAX 1037654 4184 5453 5810 100 8411 7938
defer10 MAX 905607 4840 6151 6380 100 9639 8431
defer20 MAX 986463 4455 5594 5796 100 8848 8110
defer50 MAX 1077030 4000 5073 5299 100 8104 7920
defer200 MAX 1040728 4152 5385 5765 100 8379 7849
fullbusy MAX 1247536 3518 3935 3984 100 6998 7930
napibusy MAX 1136310 3799 7756 9964 100 7670 7877
suspend0 MAX 1057509 4132 5724 6185 100 8253 7918
suspend10 MAX 1215147 3580 3957 4041 100 7185 7944
suspend20 MAX 1216469 3576 3953 3988 100 7175 7950
suspend50 MAX 1215871 3577 3961 4075 100 7181 7949
suspend200 MAX 1216882 3556 3951 3988 100 7175 7955
~ FAQ
- Why is a new parameter needed? Does irq_suspend_timeout override
gro_flush_timeout?
Using the suspend mechanism causes the system to alternate between
polling mode and irq-driven packet delivery. During busy periods,
irq_suspend_timeout overrides gro_flush_timeout and keeps the system
busy polling, but when epoll finds no events, the setting of
gro_flush_timeout and napi_defer_hard_irqs determine the next step.
There are essentially three possible loops for network processing and
packet delivery:
1) hardirq -> softirq -> napi poll; basic interrupt delivery
2) timer -> softirq -> napi poll; deferred irq processing
3) epoll -> busy-poll -> napi poll; busy looping
Loop 2 can take control from Loop 1, if gro_flush_timeout and
napi_defer_hard_irqs are set.
If gro_flush_timeout and napi_defer_hard_irqs are set, Loops 2 and
3 "wrestle" with each other for control. During busy periods,
irq_suspend_timeout is used as timer in Loop 2, which essentially
tilts this in favour of Loop 3.
If gro_flush_timeout and napi_defer_hard_irqs are not set, Loop 3
cannot take control from Loop 1.
Therefore, setting gro_flush_timeout and napi_defer_hard_irqs is the
recommended usage, because otherwise setting irq_suspend_timeout
might not have any discernible effect.
This is shown in the results above: compare suspend0 with the base
case. Note that the lack of napi_defer_hard_irqs and
gro_flush_timeout produce similar results for both, which encourages
the use of napi_defer_hard_irqs and gro_flush_timeout in addition to
irq_suspend_timeout.
- Can the new timeout value be threaded through the new epoll ioctl ?
It is possible, but presents challenges for userspace. User
applications must ensure that the file descriptors added to epoll
contexts have the same NAPI ID to support busy polling.
An epoll context is not permanently tied to any particular NAPI ID.
So, a user application could decide to clear the file descriptors
from the context and add a new set of file descriptors with a
different NAPI ID to the context. Busy polling would work as
expected, but the meaning of the suspend timeout becomes ambiguous
because IRQs are not inherently associated with epoll contexts, but
rather with the NAPI. The user program would need to reissue the
ioctl to set the irq_suspend_timeout, but the napi_defer_hard_irqs
and gro_flush_timeout settings would come from the NAPI's
napi_config (which are set either by sysfs or by netlink). Such an
interface seems awkard to use from a user perspective.
Further, IRQs are related to NAPIs, which is why they are stored in
the napi_config space. Putting the irq_suspend_timeout in
the epoll context while other IRQ deferral mechanisms remain in the
NAPI's napi_config space seems like an odd design choice.
We've opted to keep all of the IRQ deferral parameters together and
place the irq_suspend_timeout in napi_config. This has nice benefits
for userspace: if a user app were to remove all file descriptors
from an epoll context and add new file descriptors with a new NAPI ID,
the correct suspend timeout for that NAPI ID would be used automatically
without the user application needing to do anything (like re-issuing an
ioctl, for example). All IRQ deferral related parameters are in one
place and can all be set the same way: with netlink.
- Can irq suspend be built by combining NIC coalescing and
gro_flush_timeout ?
No. The problem is that the long timeout must engage if and only if
prefer-busy is active.
When using NIC coalescing for the short timeout (without
napi_defer_hard_irqs/gro_flush_timeout), an interrupt after an idle
period will trigger softirq, which will run napi polling. At this
point, prefer-busy is not active, so NIC interrupts would be
re-enabled. Then it is not possible for the longer timeout to
interject to switch control back to polling. In other words, only by
using the software timer for the short timeout, it is possible to
extend the timeout without having to reprogram the NIC timer or
reach down directly and disable interrupts.
Using gro_flush_timeout for the long timeout also has problems, for
the same underlying reason. In the current napi implementation,
gro_flush_timeout is not tied to prefer-busy. We'd either have to
change that and in the process modify the existing deferral
mechanism, or introduce a state variable to determine whether
gro_flush_timeout is used as long timeout for irq suspend or whether
it is used for its default purpose. In an earlier version, we did
try something similar to the latter and made it work, but it ends up
being a lot more convoluted than our current proposal.
- Isn't it already possible to combine busy looping with irq deferral?
Yes, in fact enabling irq deferral via napi_defer_hard_irqs and
gro_flush_timeout is a precondition for prefer_busy_poll to have an
effect. If the application also uses a tight busy loop with
essentially nonblocking epoll_wait (accomplished with a very short
timeout parameter), this is the fullbusy case shown in the results.
An application using blocking epoll_wait is shown as the napibusy
case in the results. It's a hybrid approach that provides limited
latency benefits compared to the base case and plain irq deferral,
but not as good as fullbusy or suspend.
~ Special thanks
Several people were involved in earlier stages of the development of this
mechanism whom we'd like to thank:
- Peter Cai (CC'd), for the initial kernel patch and his contributions
to the paper.
- Mohammadamin Shafie (CC'd), for testing various versions of the kernel
patch and providing helpful feedback.
Thanks,
Martin and Joe
[1]: https://lore.kernel.org/netdev/20240812125717.413108-1-jdamato@fastly.com/
[2]: https://doi.org/10.1145/3626780
[3]: https://github.com/memcached/memcached/blob/master/doc/napi_ids.txt
[4]: https://github.com/leverich/mutilate
[5]: https://raw.githubusercontent.com/martinkarsten/irqsuspend/main/patches/mem…
[6]: https://raw.githubusercontent.com/martinkarsten/irqsuspend/main/patches/lib…
[7]: https://github.com/martinkarsten/irqsuspend
[8]: https://github.com/martinkarsten/irqsuspend/tree/main/results
v7:
- Jakub noted that patch 2 adds unnecessary complexity by checking the
suspend timeout in the NAPI loop. This makes the code more
complicated and difficult to reason about. He's right; we've dropped
patch 2 which simplifies this series.
- Updated the cover letter with a full re-run of all test cases.
- Updated FAQ #2.
v6: https://lore.kernel.org/netdev/20241104215542.215919-1-jdamato@fastly.com/
- Updated the cover letter with a full re-run of all test cases,
including a new case suspend0, as requested by Sridhar previously.
- Updated the kernel documentation in patch 7 as suggested by Bagas
Sanjaya, which improved the htmldoc output.
v5: https://lore.kernel.org/netdev/20241103052421.518856-1-jdamato@fastly.com/
- Adjusted patch 5 to only suspend IRQs when ep_send_events returns a
positive return value. This issue was pointed out by Hillf Danton.
- Updated the commit message of patch 6 which still mentioned netcat,
despite the code being updated in v4 to replace it with socat and fixed
misspelling of netdevsim.
- Fixed a minor typo in patch 7 and removed an unnecessary paragraph.
- Added Sridhar Samudrala's Reviewed-by to patch 1-5 and 7.
v4: https://lore.kernel.org/netdev/20241102005214.32443-1-jdamato@fastly.com/
- Added a new FAQ item to cover letter.
- Updated patch 6 to use socat instead of nc in busy_poll_test.sh and
updated busy_poller.c to use netlink directly to configure napi
params.
- Updated the kernel documentation in patch 7 to include more details.
- Dropped Stanislav's Acked-by and Bagas' Reviewed-by from patch 7
since the documentation was updated.
v3: https://lore.kernel.org/netdev/20241101004846.32532-1-jdamato@fastly.com/
- Added Stanislav Fomichev's Acked-by to every patch except the newly
added selftest.
- Added Bagas Sanjaya's Reviewed-by to the documentation patch.
- Fixed the commit message of patch 2 to remove a reference to the now
non-existent sysfs setting.
- Added a self test which tests both "regular" busy poll and busy poll
with suspend enabled. This was added as patch 6 as requested by
Paolo. netdevsim was chosen instead of veth due to netdevsim's
pre-existing support for netdev-genl. See the commit message of
patch 6 for more details.
v2: https://lore.kernel.org/bpf/20241021015311.95468-1-jdamato@fastly.com/
- Cover letter updated, including a re-run of test data.
- Patch 1 rewritten to use netdev-genl instead of sysfs.
- Patch 3 updated with a comment added to napi_resume_irqs.
- Patch 4 rebased to apply now that commit b9ca079dd6b0 ("eventpoll:
Annotate data-race of busy_poll_usecs") has been picked up from VFS.
- Patch 6 updated the kernel documentation.
rfc -> v1:
- Cover letter updated to include more details.
- Patch 1 updated to remove the documentation added. This was moved to
patch 6 with the rest of the docs (see below).
- Patch 5 updated to fix an error uncovered by the kernel build robot.
See patch 5's changelog for more details.
- Patch 6 added which updates kernel documentation.
Joe Damato (2):
selftests: net: Add busy_poll_test
docs: networking: Describe irq suspension
Martin Karsten (4):
net: Add napi_struct parameter irq_suspend_timeout
net: Add control functions for irq suspension
eventpoll: Trigger napi_busy_loop, if prefer_busy_poll is set
eventpoll: Control irq suspension for prefer_busy_poll
Documentation/netlink/specs/netdev.yaml | 7 +
Documentation/networking/napi.rst | 170 ++++++++-
fs/eventpoll.c | 36 +-
include/linux/netdevice.h | 2 +
include/net/busy_poll.h | 3 +
include/uapi/linux/netdev.h | 1 +
net/core/dev.c | 41 +++
net/core/dev.h | 25 ++
net/core/netdev-genl-gen.c | 5 +-
net/core/netdev-genl.c | 12 +
tools/include/uapi/linux/netdev.h | 1 +
tools/testing/selftests/net/.gitignore | 1 +
tools/testing/selftests/net/Makefile | 3 +-
tools/testing/selftests/net/busy_poll_test.sh | 164 +++++++++
tools/testing/selftests/net/busy_poller.c | 328 ++++++++++++++++++
15 files changed, 792 insertions(+), 7 deletions(-)
create mode 100755 tools/testing/selftests/net/busy_poll_test.sh
create mode 100644 tools/testing/selftests/net/busy_poller.c
base-commit: dc7c381bb8649e3701ed64f6c3e55316675904d7
--
2.25.1