Build break was reported in the powerpc mailing list for next-20250218 with below errors
make[1]: Nothing to be done for 'all'.
BUILD_TARGET=/root/venkat/linux-next/tools/testing/selftests/powerpc/mm; mkdir -p $BUILD_TARGET; make OUTPUT=$BUILD_TARGET -k -C mm all
CC pkey_exec_prot
In file included from pkey_exec_prot.c:18:
/root/venkat/linux-next/tools/testing/selftests/powerpc/include/pkeys.h: In function ‘pkeys_unsupported’:
/root/venkat/linux-next/tools/testing/selftests/powerpc/include/pkeys.h:96:34: error: ‘PKEY_UNRESTRICTED’ undeclared (first use in this function)
96 | pkey = sys_pkey_alloc(0, PKEY_UNRESTRICTED);
| ^~~~~~~~~~~~~~~~~
https://lore.kernel.org/all/20250113170619.484698-2-yury.khrustalev@arm.com/ patchset
has been queued to arm64/for-next/pkey_unrestricted which is causing a build break
in the selftest/powerpc builds.
Commit 6d61527d931ba ("mm/pkey: Add PKEY_UNRESTRICTED macro") added a macro
PKEY_UNRESTRICTED to handle implicit literal value of 0x0 (which is "unrestricted").
Add the same to selftest/powerpc/pkeys.h to fix the reported build break.
Reported-by: Venkat Rao Bagalkote <venkat88(a)linux.ibm.com>
Closes: https://lore.kernel.org/lkml/3267ea6e-5a1a-4752-96ef-8351c912d386@linux.ibm…
Tested-by: Venkat Rao Bagalkote <venkat88(a)linux.ibm.com>
Signed-off-by: Madhavan Srinivasan <maddy(a)linux.ibm.com>
---
Catalin, can you take this fix via arm64/for-next/pkey_unrestricted?
Patch applies clean on top of arm64/for-next/pkey_unrestricted
tools/testing/selftests/powerpc/include/pkeys.h | 3 +++
1 file changed, 3 insertions(+)
diff --git a/tools/testing/selftests/powerpc/include/pkeys.h b/tools/testing/selftests/powerpc/include/pkeys.h
index c6d4063dd4f6..d6deb6ffa1b9 100644
--- a/tools/testing/selftests/powerpc/include/pkeys.h
+++ b/tools/testing/selftests/powerpc/include/pkeys.h
@@ -24,6 +24,9 @@
#undef PKEY_DISABLE_EXECUTE
#define PKEY_DISABLE_EXECUTE 0x4
+#undef PKEY_UNRESTRICTED
+#define PKEY_UNRESTRICTED 0x0
+
/* Older versions of libc do not define this */
#ifndef SEGV_PKUERR
#define SEGV_PKUERR 4
--
2.48.1
Hello Jens and guys,
This patchset fixes several issues(1, 2, 4) and consolidate & improve
the tests in the following ways:
- support shellcheck and fixes all warning
- misc cleanup
- improve cleanup code path(module load/unload, cleanup temp files)
- help to reuse the same test source code and scripts for other
projects(liburing[1], blktest, ...)
- add two stress tests for covering IO workloads vs. removing device &
killing ublk server, given buffer lifetime is one big thing for ublk-zc
[1] https://github.com/ming1/liburing/commits/ublk-zc
- just need one line change for overriding skip_code, libring uses 77 and
kselftests takes 4
Ming Lei (11):
selftests: ublk: make ublk_stop_io_daemon() more reliable
selftests: ublk: fix build failure
selftests: ublk: add --foreground command line
selftests: ublk: fix parsing '-a' argument
selftests: ublk: support shellcheck and fix all warning
selftests: ublk: don't pass ${dev_id} to _cleanup_test()
selftests: ublk: move zero copy feature check into _add_ublk_dev()
selftests: ublk: load/unload ublk_drv when preparing & cleaning up
tests
selftests: ublk: add one stress test for covering IO vs. removing
device
selftests: ublk: add stress test for covering IO vs. killing ublk
server
selftests: ublk: improve test usability
tools/testing/selftests/ublk/Makefile | 6 +
tools/testing/selftests/ublk/kublk.c | 43 +++--
tools/testing/selftests/ublk/kublk.h | 2 +
tools/testing/selftests/ublk/test_common.sh | 167 ++++++++++++++----
tools/testing/selftests/ublk/test_loop_01.sh | 13 +-
tools/testing/selftests/ublk/test_loop_02.sh | 14 +-
tools/testing/selftests/ublk/test_loop_03.sh | 16 +-
tools/testing/selftests/ublk/test_loop_04.sh | 14 +-
tools/testing/selftests/ublk/test_null_01.sh | 9 +-
.../testing/selftests/ublk/test_stress_01.sh | 47 +++++
.../testing/selftests/ublk/test_stress_02.sh | 47 +++++
11 files changed, 300 insertions(+), 78 deletions(-)
create mode 100755 tools/testing/selftests/ublk/test_stress_01.sh
create mode 100755 tools/testing/selftests/ublk/test_stress_02.sh
--
2.47.0
I never had much luck running mm selftests so I spent a few hours
digging into why.
Looks like most of the reason is missing SKIP checks, so this series is
just adding a bunch of those that I found. I did not do anything like
all of them, just the ones I spotted in gup_longterm, gup_test, mmap,
userfaultfd and memfd_secret.
It's a bit unfortunate to have to skip those tests when ftruncate()
fails, but I don't have time to dig deep enough into it to actually make
them pass. I have observed the issue on 9pfs and heard rumours that NFS
has a similar problem.
I'm now able to run these test groups successfully:
- mmap
- gup_test
- compaction
- migration
- page_frag
- userfaultfd
- mlock
I've never gone past "Waiting for hugetlb memory to get depleted", in
the hugetlb tests. I don't know if they are stuck or if they would
eventually work if I was patient enough (testing on a 1G machine). I
have not investigated further.
I had some issues with mlock tests failing due to -ENOSRCH from
mlock2(), I can no longer reproduce that though, things work OK now.
Of the remaining tests there may be others that work fine, but there's
no convenient way to survey the whole output of run_vmtests.sh so I'm
just going test by test here.
In my spare moments I am slowly chipping away at a setup to run these
tests continuously in a reasonably hermetic QEMU environment via
virtme-ng:
https://github.com/bjackman/linux/blob/5fad4b9c592290f38e0f8bc73c9abb9c99d8…
Hopefully that will eventually offer a way to provide a "canned"
environment where the tests are known to work, which can be fairly
easily reproduced by any developer.
Signed-off-by: Brendan Jackman <jackmanb(a)google.com>
---
Changes in v4:
- NOT ADDRESSED: still using errno==ENOENT as a hacky way to detect
buggy filesystems:
https://lore.kernel.org/all/CA+i-1C3srkh44tN8dMQ5aD-jhoksUkdEpa+mMfdDtDrPAU…
- Added some incomplete cleanups for the mlock tests.
- Fixed divide-by-zero error when running uffd-stress on <32cpu systems.
- Fixed misnamed nr_threads variable (now nr_parallel).
- Fixed reporting io_uring errors (retval instead of errno).
- Link to v3: https://lore.kernel.org/r/20250228-mm-selftests-v3-0-958e3b6f0203@google.com
Changes in v3:
- Added fix for userfaultfd tests.
- Dropped attempts to use sudo.
- Fixed garbage printf in uffd-stress.
(Added EXTRA_CFLAGS=-Werror FORCE_TARGETS=1 to my scripts to prevent
such errors happening again).
- Fixed missing newlines in ksft_test_result_skip() calls.
- Link to v2: https://lore.kernel.org/r/20250221-mm-selftests-v2-0-28c4d66383c5@google.com
Changes in v2 (Thanks to Dev for the reviews):
- Improve and cleanup some error messages
- Add some extra SKIPs
- Fix misnaming of nr_cpus variable in uffd tests
- Link to v1: https://lore.kernel.org/r/20250220-mm-selftests-v1-0-9bbf57d64463@google.com
---
Brendan Jackman (12):
selftests/mm: Report errno when things fail in gup_longterm
selftests/mm: Skip uffd-stress if userfaultfd not available
selftests/mm: Skip uffd-wp-mremap if userfaultfd not available
selftests/mm/uffd: Rename nr_cpus -> nr_parallel
selftests/mm: Print some details when uffd-stress gets bad params
selftests/mm: Don't fail uffd-stress if too many CPUs
selftests/mm: Skip map_populate on weird filesystems
selftests/mm: Skip gup_longterm tests on weird filesystems
selftests/mm: Drop unnecessary sudo usage
selftests/mm: Ensure uffd-wp-mremap gets pages of each size
selftests/mm: Skip mlock tests if nobody user can't read it
selftests/mm/mlock: Print error on failure
tools/testing/selftests/mm/gup_longterm.c | 45 +++++++++++++++++---------
tools/testing/selftests/mm/map_populate.c | 7 ++++
tools/testing/selftests/mm/mlock-random-test.c | 4 +--
tools/testing/selftests/mm/mlock2.h | 8 ++++-
tools/testing/selftests/mm/run_vmtests.sh | 27 ++++++++++++++--
tools/testing/selftests/mm/uffd-common.c | 8 ++---
tools/testing/selftests/mm/uffd-common.h | 2 +-
tools/testing/selftests/mm/uffd-stress.c | 42 +++++++++++++++---------
tools/testing/selftests/mm/uffd-unit-tests.c | 2 +-
tools/testing/selftests/mm/uffd-wp-mremap.c | 5 ++-
10 files changed, 105 insertions(+), 45 deletions(-)
---
base-commit: dcb38e6757f1b7944af9347ce6b54263d3666478
change-id: 20250220-mm-selftests-2d7d0542face
Best regards,
--
Brendan Jackman <jackmanb(a)google.com>
The mac address on backup slave should be convert from Solicited-Node
Multicast address, not from bonding unicast target address.
v4: no change, just repost.
v3: also fix the mac setting for slave_set_ns_maddr. (Jay)
Add function description for slave_set_ns_maddr/slave_set_ns_maddrs (Jay)
v2: fix patch 01's subject
Hangbin Liu (2):
bonding: fix incorrect MAC address setting to receive NS messages
selftests: bonding: fix incorrect mac address
drivers/net/bonding/bond_options.c | 55 ++++++++++++++++---
.../drivers/net/bonding/bond_options.sh | 4 +-
2 files changed, 49 insertions(+), 10 deletions(-)
--
2.46.0
Hello,
While trying to implement an eBPF gatekeeper program, we ran into an
issue whereas the LSM hooks are missing some relevant data.
Certain subcommands passed to the bpf() syscall can be invoked from
either the kernel or userspace. Additionally, some fields in the
bpf_attr struct contain pointers, and depending on where the
subcommand was invoked, they could point to either user or kernel
memory. One example of this is the bpf_prog_load subcommand and its
fd_array. This data is made available and used by the verifier but not
made available to the LSM subsystem. This patchset simply exposes that
information to applicable LSM hooks.
Change list:
- v6 -> v7
- use gettid/pid in lieu of getpid/tgid in test condition
- v5 -> v6
- fix regression caused by is_kernel renaming
- simplify test logic
- v4 -> v5
- merge v4 selftest breakout patch back into a single patch
- change "is_kernel" to "kernel"
- add selftest using new kernel flag
- v3 -> v4
- split out selftest changes into a separate patch
- v2 -> v3
- reorder params so that the new boolean flag is the last param
- fixup function signatures in bpf selftests
- v1 -> v2
- Pass a boolean flag in lieu of bpfptr_t
Revisions:
- v6
https://lore.kernel.org/bpf/20250308013314.719150-1-bboscaccy@linux.microso…
- v5
https://lore.kernel.org/bpf/20250307213651.3065714-1-bboscaccy@linux.micros…
- v4
https://lore.kernel.org/bpf/20250304203123.3935371-1-bboscaccy@linux.micros…
- v3
https://lore.kernel.org/bpf/20250303222416.3909228-1-bboscaccy@linux.micros…
- v2
https://lore.kernel.org/bpf/20250228165322.3121535-1-bboscaccy@linux.micros…
- v1
https://lore.kernel.org/bpf/20250226003055.1654837-1-bboscaccy@linux.micros…
Blaise Boscaccy (2):
security: Propagate caller information in bpf hooks
selftests/bpf: Add a kernel flag test for LSM bpf hook
include/linux/lsm_hook_defs.h | 6 +--
include/linux/security.h | 12 +++---
kernel/bpf/syscall.c | 10 ++---
security/security.c | 15 ++++---
security/selinux/hooks.c | 6 +--
.../selftests/bpf/prog_tests/kernel_flag.c | 43 +++++++++++++++++++
.../selftests/bpf/progs/rcu_read_lock.c | 3 +-
.../bpf/progs/test_cgroup1_hierarchy.c | 4 +-
.../selftests/bpf/progs/test_kernel_flag.c | 28 ++++++++++++
.../bpf/progs/test_kfunc_dynptr_param.c | 6 +--
.../selftests/bpf/progs/test_lookup_key.c | 2 +-
.../selftests/bpf/progs/test_ptr_untrusted.c | 2 +-
.../bpf/progs/test_task_under_cgroup.c | 2 +-
.../bpf/progs/test_verify_pkcs7_sig.c | 2 +-
14 files changed, 108 insertions(+), 33 deletions(-)
create mode 100644 tools/testing/selftests/bpf/prog_tests/kernel_flag.c
create mode 100644 tools/testing/selftests/bpf/progs/test_kernel_flag.c
--
2.48.1
A task in the kernel (task_mm_cid_work) runs somewhat periodically to
compact the mm_cid for each process. Add a test to validate that it runs
correctly and timely.
The test spawns 1 thread pinned to each CPU, then each thread, including
the main one, runs in short bursts for some time. During this period, the
mm_cids should be spanning all numbers between 0 and nproc.
At the end of this phase, a thread with high enough mm_cid (>= nproc/2)
is selected to be the new leader, all other threads terminate.
After some time, the only running thread should see 0 as mm_cid, if that
doesn't happen, the compaction mechanism didn't work and the test fails.
The test never fails if only 1 core is available, in which case, we
cannot test anything as the only available mm_cid is 0.
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers(a)efficios.com>
Signed-off-by: Gabriele Monaco <gmonaco(a)redhat.com>
---
tools/testing/selftests/rseq/.gitignore | 1 +
tools/testing/selftests/rseq/Makefile | 2 +-
.../selftests/rseq/mm_cid_compaction_test.c | 200 ++++++++++++++++++
3 files changed, 202 insertions(+), 1 deletion(-)
create mode 100644 tools/testing/selftests/rseq/mm_cid_compaction_test.c
diff --git a/tools/testing/selftests/rseq/.gitignore b/tools/testing/selftests/rseq/.gitignore
index 16496de5f6ce4..2c89f97e4f737 100644
--- a/tools/testing/selftests/rseq/.gitignore
+++ b/tools/testing/selftests/rseq/.gitignore
@@ -3,6 +3,7 @@ basic_percpu_ops_test
basic_percpu_ops_mm_cid_test
basic_test
basic_rseq_op_test
+mm_cid_compaction_test
param_test
param_test_benchmark
param_test_compare_twice
diff --git a/tools/testing/selftests/rseq/Makefile b/tools/testing/selftests/rseq/Makefile
index 5a3432fceb586..ce1b38f46a355 100644
--- a/tools/testing/selftests/rseq/Makefile
+++ b/tools/testing/selftests/rseq/Makefile
@@ -16,7 +16,7 @@ OVERRIDE_TARGETS = 1
TEST_GEN_PROGS = basic_test basic_percpu_ops_test basic_percpu_ops_mm_cid_test param_test \
param_test_benchmark param_test_compare_twice param_test_mm_cid \
- param_test_mm_cid_benchmark param_test_mm_cid_compare_twice
+ param_test_mm_cid_benchmark param_test_mm_cid_compare_twice mm_cid_compaction_test
TEST_GEN_PROGS_EXTENDED = librseq.so
diff --git a/tools/testing/selftests/rseq/mm_cid_compaction_test.c b/tools/testing/selftests/rseq/mm_cid_compaction_test.c
new file mode 100644
index 0000000000000..7ddde3b657dd6
--- /dev/null
+++ b/tools/testing/selftests/rseq/mm_cid_compaction_test.c
@@ -0,0 +1,200 @@
+// SPDX-License-Identifier: LGPL-2.1
+#define _GNU_SOURCE
+#include <assert.h>
+#include <pthread.h>
+#include <sched.h>
+#include <stdint.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <stddef.h>
+
+#include "../kselftest.h"
+#include "rseq.h"
+
+#define VERBOSE 0
+#define printf_verbose(fmt, ...) \
+ do { \
+ if (VERBOSE) \
+ printf(fmt, ##__VA_ARGS__); \
+ } while (0)
+
+/* 0.5 s */
+#define RUNNER_PERIOD 500000
+/* Number of runs before we terminate or get the token */
+#define THREAD_RUNS 5
+
+/*
+ * Number of times we check that the mm_cid were compacted.
+ * Checks are repeated every RUNNER_PERIOD.
+ */
+#define MM_CID_COMPACT_TIMEOUT 10
+
+struct thread_args {
+ int cpu;
+ int num_cpus;
+ pthread_mutex_t *token;
+ pthread_barrier_t *barrier;
+ pthread_t *tinfo;
+ struct thread_args *args_head;
+};
+
+static void __noreturn *thread_runner(void *arg)
+{
+ struct thread_args *args = arg;
+ int i, ret, curr_mm_cid;
+ cpu_set_t cpumask;
+
+ CPU_ZERO(&cpumask);
+ CPU_SET(args->cpu, &cpumask);
+ ret = pthread_setaffinity_np(pthread_self(), sizeof(cpumask), &cpumask);
+ if (ret) {
+ errno = ret;
+ perror("Error: failed to set affinity");
+ abort();
+ }
+ pthread_barrier_wait(args->barrier);
+
+ for (i = 0; i < THREAD_RUNS; i++)
+ usleep(RUNNER_PERIOD);
+ curr_mm_cid = rseq_current_mm_cid();
+ /*
+ * We select one thread with high enough mm_cid to be the new leader.
+ * All other threads (including the main thread) will terminate.
+ * After some time, the mm_cid of the only remaining thread should
+ * converge to 0, if not, the test fails.
+ */
+ if (curr_mm_cid >= args->num_cpus / 2 &&
+ !pthread_mutex_trylock(args->token)) {
+ printf_verbose(
+ "cpu%d has mm_cid=%d and will be the new leader.\n",
+ sched_getcpu(), curr_mm_cid);
+ for (i = 0; i < args->num_cpus; i++) {
+ if (args->tinfo[i] == pthread_self())
+ continue;
+ ret = pthread_join(args->tinfo[i], NULL);
+ if (ret) {
+ errno = ret;
+ perror("Error: failed to join thread");
+ abort();
+ }
+ }
+ pthread_barrier_destroy(args->barrier);
+ free(args->tinfo);
+ free(args->token);
+ free(args->barrier);
+ free(args->args_head);
+
+ for (i = 0; i < MM_CID_COMPACT_TIMEOUT; i++) {
+ curr_mm_cid = rseq_current_mm_cid();
+ printf_verbose("run %d: mm_cid=%d on cpu%d.\n", i,
+ curr_mm_cid, sched_getcpu());
+ if (curr_mm_cid == 0)
+ exit(EXIT_SUCCESS);
+ usleep(RUNNER_PERIOD);
+ }
+ exit(EXIT_FAILURE);
+ }
+ printf_verbose("cpu%d has mm_cid=%d and is going to terminate.\n",
+ sched_getcpu(), curr_mm_cid);
+ pthread_exit(NULL);
+}
+
+int test_mm_cid_compaction(void)
+{
+ cpu_set_t affinity;
+ int i, j, ret = 0, num_threads;
+ pthread_t *tinfo;
+ pthread_mutex_t *token;
+ pthread_barrier_t *barrier;
+ struct thread_args *args;
+
+ sched_getaffinity(0, sizeof(affinity), &affinity);
+ num_threads = CPU_COUNT(&affinity);
+ tinfo = calloc(num_threads, sizeof(*tinfo));
+ if (!tinfo) {
+ perror("Error: failed to allocate tinfo");
+ return -1;
+ }
+ args = calloc(num_threads, sizeof(*args));
+ if (!args) {
+ perror("Error: failed to allocate args");
+ ret = -1;
+ goto out_free_tinfo;
+ }
+ token = malloc(sizeof(*token));
+ if (!token) {
+ perror("Error: failed to allocate token");
+ ret = -1;
+ goto out_free_args;
+ }
+ barrier = malloc(sizeof(*barrier));
+ if (!barrier) {
+ perror("Error: failed to allocate barrier");
+ ret = -1;
+ goto out_free_token;
+ }
+ if (num_threads == 1) {
+ fprintf(stderr, "Cannot test on a single cpu. "
+ "Skipping mm_cid_compaction test.\n");
+ /* only skipping the test, this is not a failure */
+ goto out_free_barrier;
+ }
+ pthread_mutex_init(token, NULL);
+ ret = pthread_barrier_init(barrier, NULL, num_threads);
+ if (ret) {
+ errno = ret;
+ perror("Error: failed to initialise barrier");
+ goto out_free_barrier;
+ }
+ for (i = 0, j = 0; i < CPU_SETSIZE && j < num_threads; i++) {
+ if (!CPU_ISSET(i, &affinity))
+ continue;
+ args[j].num_cpus = num_threads;
+ args[j].tinfo = tinfo;
+ args[j].token = token;
+ args[j].barrier = barrier;
+ args[j].cpu = i;
+ args[j].args_head = args;
+ if (!j) {
+ /* The first thread is the main one */
+ tinfo[0] = pthread_self();
+ ++j;
+ continue;
+ }
+ ret = pthread_create(&tinfo[j], NULL, thread_runner, &args[j]);
+ if (ret) {
+ errno = ret;
+ perror("Error: failed to create thread");
+ abort();
+ }
+ ++j;
+ }
+ printf_verbose("Started %d threads.\n", num_threads);
+
+ /* Also main thread will terminate if it is not selected as leader */
+ thread_runner(&args[0]);
+
+ /* only reached in case of errors */
+out_free_barrier:
+ free(barrier);
+out_free_token:
+ free(token);
+out_free_args:
+ free(args);
+out_free_tinfo:
+ free(tinfo);
+
+ return ret;
+}
+
+int main(int argc, char **argv)
+{
+ if (!rseq_mm_cid_available()) {
+ fprintf(stderr, "Error: rseq_mm_cid unavailable\n");
+ return -1;
+ }
+ if (test_mm_cid_compaction())
+ return -1;
+ return 0;
+}
--
2.48.1
Replace comma between expressions with semicolons.
Using a ',' in place of a ';' can have unintended side effects.
Although that is not the case here, it is seems best to use ';'
unless ',' is intended.
Found by inspection.
No functional change intended.
Compile tested only.
Signed-off-by: Chen Ni <nichen(a)iscas.ac.cn>
---
tools/testing/selftests/bpf/prog_tests/fd_array.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/bpf/prog_tests/fd_array.c b/tools/testing/selftests/bpf/prog_tests/fd_array.c
index a1d52e73fb16..9add890c2d37 100644
--- a/tools/testing/selftests/bpf/prog_tests/fd_array.c
+++ b/tools/testing/selftests/bpf/prog_tests/fd_array.c
@@ -83,8 +83,8 @@ static inline int bpf_prog_get_map_ids(int prog_fd, __u32 *nr_map_ids, __u32 *ma
int err;
memset(&info, 0, len);
- info.nr_map_ids = *nr_map_ids,
- info.map_ids = ptr_to_u64(map_ids),
+ info.nr_map_ids = *nr_map_ids;
+ info.map_ids = ptr_to_u64(map_ids);
err = bpf_prog_get_info_by_fd(prog_fd, &info, &len);
if (!ASSERT_OK(err, "bpf_prog_get_info_by_fd"))
--
2.25.1
Notable changes since v20:
* removed newline at the end of message in NL_SET_ERR_MSG_FMT_MOD
* dropped udp_init() and related build_protos() and directly use
.encap_destroy [instead of overriding sk->close()]
* defered peer_del() call to worker in case of transport errors, as
we may be in non-sleepable context
* used kfree() instead of kfree_rcu() when releasing socket as we
just invoked synchronize_rcu()
* fix comment in ovpn_tcp_parse()
* invoked peer->tcp.sk_cb.prot->release_cb instead of explicitly
calling tcp_release_cb. This way we don't need to export it again
* moved call to peer_put() after call to peer->tcp.sk_cb.prot->close
* moved switch(mode) inside ovpn_peers_free()
* simplified skip logic in ovpn_peers_free()
* fixed check to avoid passing a negative delay to keepalive's
schedule_work()
Please note that some patches were already reviewed/tested by a few
people. These patches have retained the tags as they have hardly been
touched.
The latest code can also be found at:
https://github.com/OpenVPN/ovpn-net-next
Thanks a lot!
Best Regards,
Antonio Quartulli
OpenVPN Inc.
To: netdev(a)vger.kernel.org
To: Eric Dumazet <edumazet(a)google.com>
To: Jakub Kicinski <kuba(a)kernel.org>
To: Paolo Abeni <pabeni(a)redhat.com>
To: Donald Hunter <donald.hunter(a)gmail.com>
To: Antonio Quartulli <antonio(a)openvpn.net>
To: Shuah Khan <shuah(a)kernel.org>
To: sd(a)queasysnail.net
To: ryazanov.s.a(a)gmail.com
To: Andrew Lunn <andrew+netdev(a)lunn.ch>
Cc: Simon Horman <horms(a)kernel.org>
Cc: linux-kernel(a)vger.kernel.org
Cc: linux-kselftest(a)vger.kernel.org
Cc: Xiao Liang <shaw.leon(a)gmail.com>
Signed-off-by: Antonio Quartulli <antonio(a)openvpn.net>
---
Antonio Quartulli (24):
net: introduce OpenVPN Data Channel Offload (ovpn)
ovpn: add basic netlink support
ovpn: add basic interface creation/destruction/management routines
ovpn: keep carrier always on for MP interfaces
ovpn: introduce the ovpn_peer object
ovpn: introduce the ovpn_socket object
ovpn: implement basic TX path (UDP)
ovpn: implement basic RX path (UDP)
ovpn: implement packet processing
ovpn: store tunnel and transport statistics
ovpn: implement TCP transport
skb: implement skb_send_sock_locked_with_flags()
ovpn: add support for MSG_NOSIGNAL in tcp_sendmsg
ovpn: implement multi-peer support
ovpn: implement peer lookup logic
ovpn: implement keepalive mechanism
ovpn: add support for updating local UDP endpoint
ovpn: add support for peer floating
ovpn: implement peer add/get/dump/delete via netlink
ovpn: implement key add/get/del/swap via netlink
ovpn: kill key and notify userspace in case of IV exhaustion
ovpn: notify userspace when a peer is deleted
ovpn: add basic ethtool support
testing/selftests: add test tool and scripts for ovpn module
Documentation/netlink/specs/ovpn.yaml | 371 +++
Documentation/netlink/specs/rt_link.yaml | 16 +
MAINTAINERS | 11 +
drivers/net/Kconfig | 15 +
drivers/net/Makefile | 1 +
drivers/net/ovpn/Makefile | 22 +
drivers/net/ovpn/bind.c | 55 +
drivers/net/ovpn/bind.h | 101 +
drivers/net/ovpn/crypto.c | 211 ++
drivers/net/ovpn/crypto.h | 145 ++
drivers/net/ovpn/crypto_aead.c | 408 ++++
drivers/net/ovpn/crypto_aead.h | 33 +
drivers/net/ovpn/io.c | 462 ++++
drivers/net/ovpn/io.h | 34 +
drivers/net/ovpn/main.c | 339 +++
drivers/net/ovpn/main.h | 14 +
drivers/net/ovpn/netlink-gen.c | 213 ++
drivers/net/ovpn/netlink-gen.h | 41 +
drivers/net/ovpn/netlink.c | 1249 ++++++++++
drivers/net/ovpn/netlink.h | 18 +
drivers/net/ovpn/ovpnpriv.h | 57 +
drivers/net/ovpn/peer.c | 1352 +++++++++++
drivers/net/ovpn/peer.h | 163 ++
drivers/net/ovpn/pktid.c | 129 ++
drivers/net/ovpn/pktid.h | 87 +
drivers/net/ovpn/proto.h | 118 +
drivers/net/ovpn/skb.h | 61 +
drivers/net/ovpn/socket.c | 244 ++
drivers/net/ovpn/socket.h | 49 +
drivers/net/ovpn/stats.c | 21 +
drivers/net/ovpn/stats.h | 47 +
drivers/net/ovpn/tcp.c | 592 +++++
drivers/net/ovpn/tcp.h | 36 +
drivers/net/ovpn/udp.c | 442 ++++
drivers/net/ovpn/udp.h | 25 +
include/linux/skbuff.h | 2 +
include/uapi/linux/if_link.h | 15 +
include/uapi/linux/ovpn.h | 110 +
include/uapi/linux/udp.h | 1 +
net/core/skbuff.c | 18 +-
net/ipv6/af_inet6.c | 1 +
net/ipv6/udp.c | 1 +
tools/testing/selftests/Makefile | 1 +
tools/testing/selftests/net/ovpn/.gitignore | 2 +
tools/testing/selftests/net/ovpn/Makefile | 31 +
tools/testing/selftests/net/ovpn/common.sh | 92 +
tools/testing/selftests/net/ovpn/config | 10 +
tools/testing/selftests/net/ovpn/data64.key | 5 +
tools/testing/selftests/net/ovpn/ovpn-cli.c | 2395 ++++++++++++++++++++
tools/testing/selftests/net/ovpn/tcp_peers.txt | 5 +
.../testing/selftests/net/ovpn/test-chachapoly.sh | 9 +
.../selftests/net/ovpn/test-close-socket-tcp.sh | 9 +
.../selftests/net/ovpn/test-close-socket.sh | 45 +
tools/testing/selftests/net/ovpn/test-float.sh | 9 +
tools/testing/selftests/net/ovpn/test-tcp.sh | 9 +
tools/testing/selftests/net/ovpn/test.sh | 113 +
tools/testing/selftests/net/ovpn/udp_peers.txt | 5 +
57 files changed, 10065 insertions(+), 5 deletions(-)
---
base-commit: fb05579a176f7bccc8d279665fc0e46dfed43dfb
change-id: 20250304-b4-ovpn-tmp-153379e78603
Best regards,
--
Antonio Quartulli <antonio(a)openvpn.net>
┌────────────┐ ┌───────────────────────────────────┐ ┌────────────────┐
│ │ │ │ │ │
│ │ │ PCI Endpoint │ │ PCI Host │
│ │ │ │ │ │
│ │◄──┤ 1.platform_msi_domain_alloc_irqs()│ │ │
│ │ │ │ │ │
│ MSI ├──►│ 2.write_msi_msg() ├──►├─BAR<n> │
│ Controller │ │ update doorbell register address│ │ │
│ │ │ for BAR │ │ │
│ │ │ │ │ 3. Write BAR<n>│
│ │◄──┼───────────────────────────────────┼───┤ │
│ │ │ │ │ │
│ ├──►│ 4.Irq Handle │ │ │
│ │ │ │ │ │
│ │ │ │ │ │
└────────────┘ └───────────────────────────────────┘ └────────────────┘
This patches based on old https://lore.kernel.org/imx/20221124055036.1630573-1-Frank.Li@nxp.com/
Original patch only target to vntb driver. But actually it is common
method.
This patches add new API to pci-epf-core, so any EP driver can use it.
Previous v2 discussion here.
https://lore.kernel.org/imx/20230911220920.1817033-1-Frank.Li@nxp.com/
Changes in v15:
- rebase to v6.14-rc1
- fix build issue find by kernel test robot
- Link to v14: https://lore.kernel.org/r/20250207-ep-msi-v14-0-9671b136f2b8@nxp.com
Changes in v14:
Marc Zyngier raised concerns about adding DOMAIN_BUS_DEVICE_PCI_EP_MSI. As
a result, the approach has been reverted to the v9 method. However, there
are several improvements:
MSI now supports msi-map in addition to msi-parent.
- The struct device: id is used as the endpoint function (EPF) device
identity to map to the stream ID (sideband information).
- The EPC device tree source (DTS) utilizes msi-map to provide such
information.
- The EPF device's of_node is set to the EPC controller’s node. This
approach is commonly used for multi-function device (MFD) platform child
devices, allowing them to inherit properties from the MFD device’s DTS,
such as reset-cells and gpio-cells. This method is well-suited for the
current case, as the EPF is inherently created/binded to the EPC and
should inherit the EPC’s DTS node properties.
Additionally:
Since the basic IMX95 LUT support has already been merged into the
mainline, a DTS and driver increment patch is added to complete the
solution. The patch is rebased onto the latest linux-next tree and
aligned with the new pcitest framework.
- Link to v13: https://lore.kernel.org/r/20241218-ep-msi-v13-0-646e2192dc24@nxp.com
Changes in v13:
- Change to use DOMAIN_BUS_PCI_DEVICE_EP_MSI
- Change request id as func | vfunc << 3
- Remove IRQ_DOMAIN_MSI_IMMUTABLE
Thomas Gleixner:
I hope capture all your points in review comments. If missed, let me know.
- Link to v12: https://lore.kernel.org/r/20241211-ep-msi-v12-0-33d4532fa520@nxp.com
Changes in v12:
- Change to use IRQ_DOMAIN_MSI_IMMUTABLE and add help function
irq_domain_msi_is_immuatble().
- split PCI: endpoint: pci-ep-msi: Add MSI address/data pair mutable check to 3 patches
- Link to v11: https://lore.kernel.org/r/20241209-ep-msi-v11-0-7434fa8397bd@nxp.com
Changes in v11:
- Change to use MSI_FLAG_MSG_IMMUTABLE
- Link to v10: https://lore.kernel.org/r/20241204-ep-msi-v10-0-87c378dbcd6d@nxp.com
Changes in v10:
Thomas Gleixner:
There are big change in pci-ep-msi.c. I am sure if go on the
corrent path. The key improvement is remove only 1 function devices's
limitation.
I use new patch for imutable check, which relative additional
feature compared to base enablement patch.
- Remove patch Add msi_remove_device_irq_domain() in platform_device_msi_free_irqs_all()
- Add new patch irqchip/gic-v3-its: Avoid overwriting msi_prepare callback if provided by msi_domain_info
- Remove only support 1 endpoint function limiation.
- Create one MSI domain for each endpoint function devices.
- Use "msi-map" in pci ep controler node, instead of of msi-parent. first
argument is
(func_no << 8 | vfunc_no)
- Link to v9: https://lore.kernel.org/r/20241203-ep-msi-v9-0-a60dbc3f15dd@nxp.com
Changes in v9
- Add patch platform-msi: Add msi_remove_device_irq_domain() in platform_device_msi_free_irqs_all()
- Remove patch PCI: endpoint: Add pci_epc_get_fn() API for customizable filtering
- Remove API pci_epf_align_inbound_addr_lo_hi
- Move doorbell_alloc in to doorbell_enable function.
- Link to v8: https://lore.kernel.org/r/20241116-ep-msi-v8-0-6f1f68ffd1bb@nxp.com
Changes in v8:
- update helper function name to pci_epf_align_inbound_addr()
- Link to v7: https://lore.kernel.org/r/20241114-ep-msi-v7-0-d4ac7aafbd2c@nxp.com
Changes in v7:
- Add helper function pci_epf_align_addr();
- Link to v6: https://lore.kernel.org/r/20241112-ep-msi-v6-0-45f9722e3c2a@nxp.com
Changes in v6:
- change doorbell_addr to doorbell_offset
- use round_down()
- add Niklas's test by tag
- rebase to pci/endpoint
- Link to v5: https://lore.kernel.org/r/20241108-ep-msi-v5-0-a14951c0d007@nxp.com
Changes in v5:
- Move request_irq to epf test function driver for more flexiable user case
- Add fixed size bar handler
- Some minor improvememtn to see each patches's changelog.
- Link to v4: https://lore.kernel.org/r/20241031-ep-msi-v4-0-717da2d99b28@nxp.com
Changes in v4:
- Remove patch genirq/msi: Add cleanup guard define for msi_lock_descs()/msi_unlock_descs()
- Use new method to avoid compatible problem.
Add new command DOORBELL_ENABLE and DOORBELL_DISABLE.
pcitest -B send DOORBELL_ENABLE first, EP test function driver try to
remap one of BAR_N (except test register bar) to ITS MSI MMIO space. Old
driver don't support new command, so failure return, not side effect.
After test, DOORBELL_DISABLE command send out to recover original map, so
pcitest bar test can pass as normal.
- Other detail change see each patches's change log
- Link to v3: https://lore.kernel.org/r/20241015-ep-msi-v3-0-cedc89a16c1a@nxp.com
Change from v2 to v3
- Fixed manivannan's comments
- Move common part to pci-ep-msi.c and pci-ep-msi.h
- rebase to 6.12-rc1
- use RevID to distingiush old version
mkdir /sys/kernel/config/pci_ep/functions/pci_epf_test/func1
echo 16 > /sys/kernel/config/pci_ep/functions/pci_epf_test/func1/msi_interrupts
echo 0x080c > /sys/kernel/config/pci_ep/functions/pci_epf_test/func1/deviceid
echo 0x1957 > /sys/kernel/config/pci_ep/functions/pci_epf_test/func1/vendorid
echo 1 > /sys/kernel/config/pci_ep/functions/pci_epf_test/func1/revid
^^^^^^ to enable platform msi support.
ln -s /sys/kernel/config/pci_ep/functions/pci_epf_test/func1 /sys/kernel/config/pci_ep/controllers/4c380000.pcie-ep
- use new device ID, which identify support doorbell to avoid broken
compatility.
Enable doorbell support only for PCI_DEVICE_ID_IMX8_DB, while other devices
keep the same behavior as before.
EP side RC with old driver RC with new driver
PCI_DEVICE_ID_IMX8_DB no probe doorbell enabled
Other device ID doorbell disabled* doorbell disabled*
* Behavior remains unchanged.
Change from v1 to v2
- Add missed patch for endpont/pci-epf-test.c
- Move alloc and free to epc driver from epf.
- Provide general help function for EPC driver to alloc platform msi irq.
- Fixed manivannan's comments.
Signed-off-by: Frank Li <Frank.Li(a)nxp.com>
---
Frank Li (15):
platform-msi: Add msi_remove_device_irq_domain() in platform_device_msi_free_irqs_all()
irqdomain: Add IRQ_DOMAIN_FLAG_MSI_IMMUTABLE and irq_domain_is_msi_immutable()
irqchip/gic-v3-its: Set IRQ_DOMAIN_FLAG_MSI_IMMUTABLE for ITS
irqchip/gic-v3-its: Add support for device tree msi-map and msi-mask
PCI: endpoint: Set ID and of_node for function driver
PCI: endpoint: Add RC-to-EP doorbell support using platform MSI controller
PCI: endpoint: pci-ep-msi: Add MSI address/data pair mutable check
PCI: endpoint: Add pci_epf_align_inbound_addr() helper for address alignment
PCI: endpoint: pci-epf-test: Add doorbell test support
misc: pci_endpoint_test: Add doorbell test case
selftests: pci_endpoint: Add doorbell test case
pci: imx6: Add helper function imx_pcie_add_lut_by_rid()
pci: imx6: Add LUT setting for MSI/IOMMU in Endpoint mode
arm64: dts: imx95: Add msi-map for pci-ep device
arm64: dts: imx95-19x19-evk: Add PCIe1 endpoint function overlay file
arch/arm64/boot/dts/freescale/Makefile | 3 +
.../dts/freescale/imx95-19x19-evk-pcie1-ep.dtso | 21 ++++
arch/arm64/boot/dts/freescale/imx95.dtsi | 1 +
drivers/base/platform-msi.c | 1 +
drivers/irqchip/irq-gic-v3-its-msi-parent.c | 8 ++
drivers/irqchip/irq-gic-v3-its.c | 2 +-
drivers/misc/pci_endpoint_test.c | 81 +++++++++++++
drivers/pci/controller/dwc/pci-imx6.c | 25 ++--
drivers/pci/endpoint/Makefile | 1 +
drivers/pci/endpoint/functions/pci-epf-test.c | 132 +++++++++++++++++++++
drivers/pci/endpoint/pci-ep-msi.c | 90 ++++++++++++++
drivers/pci/endpoint/pci-epf-core.c | 48 ++++++++
include/linux/irqdomain.h | 7 ++
include/linux/pci-ep-msi.h | 28 +++++
include/linux/pci-epf.h | 21 ++++
include/uapi/linux/pcitest.h | 1 +
.../selftests/pci_endpoint/pci_endpoint_test.c | 25 ++++
17 files changed, 486 insertions(+), 9 deletions(-)
---
base-commit: 2014c95afecee3e76ca4a56956a936e23283f05b
change-id: 20241010-ep-msi-8b4cab33b1be
Best regards,
---
Frank Li <Frank.Li(a)nxp.com>
The first fixes setting incorrect skb->truesize.
When xdp-mb prog returns XDP_PASS, skb is allocated and initialized.
Currently, The truesize is calculated as BNXT_RX_PAGE_SIZE *
sinfo->nr_frags, but sinfo->nr_frags is flushed by napi_build_skb().
So, it stores sinfo before calling napi_build_skb() and then use it
for calculate truesize.
The second fixes kernel panic in the bnxt_queue_mem_alloc().
The bnxt_queue_mem_alloc() accesses rx ring descriptor.
rx ring descriptors are allocated when the interface is up and it's
freed when the interface is down.
So, if bnxt_queue_mem_alloc() is called when the interface is down,
kernel panic occurs.
This patch makes the bnxt_queue_mem_alloc() return -ENETDOWN if rx ring
descriptors are not allocated.
The third patch fixes kernel panic in the bnxt_queue_{start | stop}().
When a queue is restarted bnxt_queue_{start | stop}() are called.
These functions set MRU to 0 to stop packet flow and then to set up the
remaining things.
MRU variable is a member of vnic_info[] the first vnic_info is for
default and the second is for ntuple.
The first vnic_info is always allocated when interface is up, but the
second is allocated only when ntuple is enabled.
(ethtool -K eth0 ntuple <on | off>).
Currently, the bnxt_queue_{start | stop}() access
vnic_info[BNXT_VNIC_NTUPLE] regardless of whether ntuple is enabled or
not.
So kernel panic occurs.
This patch make the bnxt_queue_{start | stop}() use bp->nr_vnics instead
of BNXT_VNIC_NTUPLE.
The fourth patch fixes a warning due to checksum state.
The bnxt_rx_pkt() checks whether skb->ip_summed is not CHECKSUM_NONE
before updating ip_summed. if ip_summed is not CHECKSUM_NONE, it WARNS
about it. However, the bnxt_xdp_build_skb() is called in XDP-MB-PASS
path and it updates ip_summed earlier than bnxt_rx_pkt().
So, in the XDP-MB-PASS path, the bnxt_rx_pkt() always warns about
checksum.
Updating ip_summed at the bnxt_xdp_build_skb() is unnecessary and
duplicate, so it is removed.
The fifth patch fixes a kernel panic in the
bnxt_get_queue_stats{rx | tx}().
The bnxt_get_queue_stats{rx | tx}() callback functions are called when
a queue is resetting.
These internally access rx and tx rings without null check, but rings
are allocated and initialized when the interface is up.
So, these functions are called when the interface is down, it
occurs a kernel panic.
The sixth patch fixes memory leak in queue reset logic.
When a queue is resetting, tpa_info is allocated for the new queue and
tpa_info for an old queue is not used anymore.
So it should be freed, but not.
The seventh patch makes net_devmem_unbind_dmabuf() ignore -ENETDOWN.
When devmem socket is closed, net_devmem_unbind_dmabuf() is called to
unbind/release resources.
If interface is down, the driver returns -ENETDOWN.
The -ENETDOWN return value is not an actual error, because the interface
will release resources when the interface is down.
So, net_devmem_unbind_dmabuf() needs to ignore -ENETDOWN.
The last patch adds XDP testcases to
tools/testing/selftests/drivers/net/ping.py.
v3:
- Copy nr_frags instead of full copy. (1/8)
- Add Review tags from Somnath. (3/8)
- Add new patch for fixing kernel panic in the
bnxt_get_queue_stats{rx | tx}(). (5/8)
- Add new patch for fixing memory leak in queue reset. (6/8)
v2:
- Do not use num_frags in the bnxt_xdp_build_skb(). (1/6)
- Add Review tags from Somnath and Jakub. (2/6)
- Add new patch for fixing checksum warning. (4/6)
- Add new patch for fixing warning in net_devmem_unbind_dmabuf(). (5/6)
- Add new XDP testcases to ping.py (6/6)
Taehee Yoo (8):
eth: bnxt: fix truesize for mb-xdp-pass case
eth: bnxt: return fail if interface is down in bnxt_queue_mem_alloc()
eth: bnxt: do not use BNXT_VNIC_NTUPLE unconditionally in queue
restart logic
eth: bnxt: do not update checksum in bnxt_xdp_build_skb()
eth: bnxt: fix kernel panic in the bnxt_get_queue_stats{rx | tx}
eth: bnxt: fix memory leak in queue reset
net: devmem: do not WARN conditionally after netdev_rx_queue_restart()
selftests: drv-net: add xdp cases for ping.py
drivers/net/ethernet/broadcom/bnxt/bnxt.c | 25 ++-
drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c | 13 +-
drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.h | 3 +-
net/core/devmem.c | 4 +-
tools/testing/selftests/drivers/net/ping.py | 200 ++++++++++++++++--
.../testing/selftests/net/lib/xdp_dummy.bpf.c | 6 +
6 files changed, 220 insertions(+), 31 deletions(-)
--
2.34.1
Hello,
While trying to implement an eBPF gatekeeper program, we ran into an
issue whereas the LSM hooks are missing some relevant data.
Certain subcommands passed to the bpf() syscall can be invoked from
either the kernel or userspace. Additionally, some fields in the
bpf_attr struct contain pointers, and depending on where the
subcommand was invoked, they could point to either user or kernel
memory. One example of this is the bpf_prog_load subcommand and its
fd_array. This data is made available and used by the verifier but not
made available to the LSM subsystem. This patchset simply exposes that
information to applicable LSM hooks.
Change list:
- v5 -> v6
- fix regression caused by is_kernel renaming
- simplify test logic
- v4 -> v5
- merge v4 selftest breakout patch back into a single patch
- change "is_kernel" to "kernel"
- add selftest using new kernel flag
- v3 -> v4
- split out selftest changes into a separate patch
- v2 -> v3
- reorder params so that the new boolean flag is the last param
- fixup function signatures in bpf selftests
- v1 -> v2
- Pass a boolean flag in lieu of bpfptr_t
Revisions:
- v5
https://lore.kernel.org/bpf/20250307213651.3065714-1-bboscaccy@linux.micros…
- v4
https://lore.kernel.org/bpf/20250304203123.3935371-1-bboscaccy@linux.micros…
- v3
https://lore.kernel.org/bpf/20250303222416.3909228-1-bboscaccy@linux.micros…
- v2
https://lore.kernel.org/bpf/20250228165322.3121535-1-bboscaccy@linux.micros…
- v1
https://lore.kernel.org/bpf/20250226003055.1654837-1-bboscaccy@linux.micros…
Blaise Boscaccy (2):
security: Propagate caller information in bpf hooks
selftests/bpf: Add a kernel flag test for LSM bpf hook
include/linux/lsm_hook_defs.h | 6 +--
include/linux/security.h | 12 +++---
kernel/bpf/syscall.c | 10 ++---
security/security.c | 15 ++++---
security/selinux/hooks.c | 6 +--
.../selftests/bpf/prog_tests/kernel_flag.c | 43 +++++++++++++++++++
.../selftests/bpf/progs/rcu_read_lock.c | 3 +-
.../bpf/progs/test_cgroup1_hierarchy.c | 4 +-
.../selftests/bpf/progs/test_kernel_flag.c | 28 ++++++++++++
.../bpf/progs/test_kfunc_dynptr_param.c | 6 +--
.../selftests/bpf/progs/test_lookup_key.c | 2 +-
.../selftests/bpf/progs/test_ptr_untrusted.c | 2 +-
.../bpf/progs/test_task_under_cgroup.c | 2 +-
.../bpf/progs/test_verify_pkcs7_sig.c | 2 +-
14 files changed, 108 insertions(+), 33 deletions(-)
create mode 100644 tools/testing/selftests/bpf/prog_tests/kernel_flag.c
create mode 100644 tools/testing/selftests/bpf/progs/test_kernel_flag.c
--
2.48.1
When an after-split folio is large and needs to be dropped due to EOF,
folio_put_refs(folio, folio_nr_pages(folio)) should be used to drop
all page cache refs. Otherwise, the folio will not be freed, causing
memory leak.
This leak would happen on a filesystem with blocksize > page_size and
a truncate is performed, where the blocksize makes folios split to
>0 order ones, causing truncated folios not being freed.
Fixes: c010d47f107f ("mm: thp: split huge page to any lower order pages")
Reported-by: Hugh Dickins <hughd(a)google.com>
Closes: https://lore.kernel.org/all/fcbadb7f-dd3e-21df-f9a7-2853b53183c4@google.com/
Cc: stable(a)vger.kernel.org
Signed-off-by: Zi Yan <ziy(a)nvidia.com>
---
mm/huge_memory.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 3d3ebdc002d5..373781b21e5c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3304,7 +3304,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,
folio_account_cleaned(tail,
inode_to_wb(folio->mapping->host));
__filemap_remove_folio(tail, NULL);
- folio_put(tail);
+ folio_put_refs(tail, folio_nr_pages(tail));
} else if (!folio_test_anon(folio)) {
__xa_store(&folio->mapping->i_pages, tail->index,
tail, 0);
--
2.47.2
Hi all,
This patchset adds a new buddy allocator like (or non-uniform) large folio
split from a order-n folio to order-m with m < n. It reduces
1. the total number of after-split folios from 2^(n-m) to n-m+1;
2. the amount of memory needed for multi-index xarray split from 2^(n/6-m/6) to
n/6-m/6, assuming XA_CHUNK_SHIFT=6;
3. keep more large folios after a split from all order-m folios to
order-(n-1) to order-m folios.
For example, to split an order-9 to order-0, folio split generates 10
(or 11 for anonymous memory) folios instead of 512, allocates 1 xa_node
instead of 8, and leaves 1 order-8, 1 order-7, ..., 1 order-1 and 2 order-0
folios (or 4 order-0 for anonymous memory) instead of 512 order-0 folios.
Instead of duplicating existing split_huge_page*() code, __folio_split()
is introduced as the shared backend code for both
split_huge_page_to_list_to_order() and folio_split(). __folio_split()
can support both uniform split and buddy allocator like (or non-uniform) split.
All existing split_huge_page*() users can be gradually converted to use
folio_split() if possible. In this patchset, I converted
truncate_inode_partial_folio() to use folio_split().
xfstests quick group passed for both tmpfs and xfs.
It is on top of mm-everything-2025-02-26-03-56 with V8 reverted. It is ready to
be merged.
Changelog
===
From V8[11]:
1. Removed gfp parameter from xas_try_split() and GFP_NOWAIT is used all
the time. (per Baolin Wang)
2. Used __xas_init_node_for_split() instead of
__xas_alloc_node_for_split() and moved node allocation out. It fixed
a bug when xa_node is pre-allocated by xas_nomem() before
xas_try_split() is called without being initialized for split.
From V7[9]:
1. Fixed a wrong function name in lib/test_xarray.c.
2. Made __split_folio_to_order() never fail, since the old order check
is already done in __folio_split(). (per David Hildenbrand)
3. Fixed an issue reported by syzbot[10] by not dropping the original
folio during truncate.
4. Fixed a WARNING when READ_ONLY_THP_FOR_FS is enabled. (Thank David
Hildenbrand for reporting the issue)
5. Used two separate struct page* parameters, split_at and lock_at, to
specify at which subpage the non-uniform split happens and which subpage
to keep locked after the split, respectively. It improves code
readability.
From V6[8]:
1. Added an xarray function xas_try_split() to support iterative folio split,
removing the need of using xas_split_alloc() and xas_split(). The
function guarantees that at most one xa_node is allocated for each
call.
2. Added concrete numbers of after-split folios and xa_node savings to
cover letter, commit log. (per Andrew)
From V5[7]:
1. Split shmem to any lower order patches are in mm tree, so dropped
from this series.
2. Rename split_folio_at() to try_folio_split() to clarify that
non-uniform split will not be used if it is not supported.
From V4[6]:
1. Enabled shmem support in both uniform and buddy allocator like split
and added selftests for it.
2. Added functions to check if uniform split and buddy allocator like
split are supported for the given folio and order.
3. Made truncate fall back to uniform split if buddy allocator split is
not supported (CONFIG_READ_ONLY_THP_FOR_FS and FS without large folio).
4. Added the missing folio_clear_has_hwpoisoned() to
__split_unmapped_folio().
From V3[5]:
1. Used xas_split_alloc(GFP_NOWAIT) instead of xas_nomem(), since extra
operations inside xas_split_alloc() are needed for correctness.
2. Enabled folio_split() for shmem and no issue was found with xfstests
quick test group.
3. Split both ends of a truncate range in truncate_inode_partial_folio()
to avoid wasting memory in shmem truncate (per David Hildenbrand).
4. Removed page_in_folio_offset() since page_folio() does the same
thing.
5. Finished truncate related tests from xfstests quick test group on XFS and
tmpfs without issues.
6. Disabled buddy allocator like split on CONFIG_READ_ONLY_THP_FOR_FS
and FS without large folio. This check was missed in the prior
versions.
From V2[3]:
1. Incorporated all the feedback from Kirill[4].
2. Used GFP_NOWAIT for xas_nomem().
3. Tested the code path when xas_nomem() fails.
4. Added selftests for folio_split().
5. Fixed no THP config build error.
From V1[2]:
1. Split the original patch 1 into multiple ones for easy review (per
Kirill).
2. Added xas_destroy() to avoid memory leak.
3. Fixed nr_dropped not used error (per kernel test robot).
4. Added proper error handling when xas_nomem() fails to allocate memory
for xas_split() during buddy allocator like split.
From RFC[1]:
1. Merged backend code of split_huge_page_to_list_to_order() and
folio_split(). The same code is used for both uniform split and buddy
allocator like split.
2. Use xas_nomem() instead of xas_split_alloc() for folio_split().
3. folio_split() now leaves the first after-split folio unlocked,
instead of the one containing the given page, since
the caller of truncate_inode_partial_folio() locks and unlocks the
first folio.
4. Extended split_huge_page debugfs to use folio_split().
5. Added truncate_inode_partial_folio() as first user of folio_split().
Design
===
folio_split() splits a large folio in the same way as buddy allocator
splits a large free page for allocation. The purpose is to minimize the
number of folios after the split. For example, if user wants to free the
3rd subpage in a order-9 folio, folio_split() will split the order-9 folio
as:
O-0, O-0, O-0, O-0, O-2, O-3, O-4, O-5, O-6, O-7, O-8 if it is anon
O-1, O-0, O-0, O-2, O-3, O-4, O-5, O-6, O-7, O-9 if it is pagecache
Since anon folio does not support order-1 yet.
The split process is similar to existing approach:
1. Unmap all page mappings (split PMD mappings if exist);
2. Split meta data like memcg, page owner, page alloc tag;
3. Copy meta data in struct folio to sub pages, but instead of spliting
the whole folio into multiple smaller ones with the same order in a
shot, this approach splits the folio iteratively. Taking the example
above, this approach first splits the original order-9 into two order-8,
then splits left part of order-8 to two order-7 and so on;
4. Post-process split folios, like write mapping->i_pages for pagecache,
adjust folio refcounts, add split folios to corresponding list;
5. Remap split folios
6. Unlock split folios.
__split_unmapped_folio() and __split_folio_to_order() replace
__split_huge_page() and __split_huge_page_tail() respectively.
__split_unmapped_folio() uses different approaches to perform
uniform split and buddy allocator like split:
1. uniform split: one single call to __split_folio_to_order() is used to
uniformly split the given folio. All resulting folios are put back to
the list after split. The folio containing the given page is left to
caller to unlock and others are unlocked.
2. buddy allocator like (or non-uniform) split: (old_order - new_order) calls
to __split_folio_to_order() are used to split the given folio at order N to
order N-1. After each call, the target folio is changed to the one
containing the page, which is given as a folio_split() parameter.
After each call, folios not containing the page are put back to the list.
The folio containing the page is put back to the list when its order
is new_order. All folios are unlocked except the first folio, which
is left to caller to unlock.
Patch Overview
===
1. Patch 1 added a new xarray function xas_try_split() to perform
iterative xarray split.
2. Patch 2 added __split_unmapped_folio() and __split_folio_to_order() to
prepare for moving to new backend split code.
3. Patch 3 moved common code in split_huge_page_to_list_to_order() to
__folio_split().
4. Patch 4 added new folio_split() and made
split_huge_page_to_list_to_order() share the new
__split_unmapped_folio() with folio_split().
5. Patch 5 removed no longer used __split_huge_page() and
__split_huge_page_tail().
6. Patch 6 added a new in_folio_offset to split_huge_page debugfs for
folio_split() test.
7. Patch 7 used try_folio_split() for truncate operation.
8. Patch 8 added folio_split() tests.
Any comments and/or suggestions are welcome. Thanks.
[1] https://lore.kernel.org/linux-mm/20241008223748.555845-1-ziy@nvidia.com/
[2] https://lore.kernel.org/linux-mm/20241028180932.1319265-1-ziy@nvidia.com/
[3] https://lore.kernel.org/linux-mm/20241101150357.1752726-1-ziy@nvidia.com/
[4] https://lore.kernel.org/linux-mm/e6ppwz5t4p4kvir6eqzoto4y5fmdjdxdyvxvtw43nc…
[5] https://lore.kernel.org/linux-mm/20241205001839.2582020-1-ziy@nvidia.com/
[6] https://lore.kernel.org/linux-mm/20250106165513.104899-1-ziy@nvidia.com/
[7] https://lore.kernel.org/linux-mm/20250116211042.741543-1-ziy@nvidia.com/
[8] https://lore.kernel.org/linux-mm/20250205031417.1771278-1-ziy@nvidia.com/
[9] https://lore.kernel.org/linux-mm/20250211155034.268962-1-ziy@nvidia.com/
[10] https://lore.kernel.org/all/67af65cb.050a0220.21dd3.004a.GAE@google.com/
[11] https://lore.kernel.org/linux-mm/20250218235012.1542225-1-ziy@nvidia.com/
Zi Yan (8):
xarray: add xas_try_split() to split a multi-index entry
mm/huge_memory: add two new (not yet used) functions for folio_split()
mm/huge_memory: move folio split common code to __folio_split()
mm/huge_memory: add buddy allocator like (non-uniform) folio_split()
mm/huge_memory: remove the old, unused __split_huge_page()
mm/huge_memory: add folio_split() to debugfs testing interface
mm/truncate: use buddy allocator like folio split for truncate
operation
selftests/mm: add tests for folio_split(), buddy allocator like split
Documentation/core-api/xarray.rst | 14 +-
include/linux/huge_mm.h | 36 +
include/linux/xarray.h | 6 +
lib/test_xarray.c | 52 ++
lib/xarray.c | 131 ++-
mm/huge_memory.c | 755 ++++++++++++------
mm/truncate.c | 31 +-
tools/testing/radix-tree/Makefile | 1 +
.../selftests/mm/split_huge_page_test.c | 34 +-
9 files changed, 783 insertions(+), 277 deletions(-)
--
2.47.2
1. Issue
Syzkaller reported this issue [1].
2. Reproduce
We can reproduce this issue by using the test_sockmap_with_close_on_write()
test I provided in selftest, also you need to apply the following patch to
ensure 100% reproducibility (sleep after checking sock):
'''
static void sk_psock_verdict_data_ready(struct sock *sk)
{
.......
if (unlikely(!sock))
return;
+ if (!strcmp("test_progs", current->comm)) {
+ printk("sleep 2s to wait socket freed\n");
+ mdelay(2000);
+ printk("sleep end\n");
+ }
ops = READ_ONCE(sock->ops);
if (!ops || !ops->read_skb)
return;
}
'''
Then running './test_progs -v sockmap_basic', and if the kernel has KASAN
enabled [2], you will see the following warning:
'''
BUG: KASAN: slab-use-after-free in sk_psock_verdict_data_ready+0x29b/0x2d0
Read of size 8 at addr ffff88813a777020 by task test_progs/47055
Tainted: [O]=OOT_MODULE
Call Trace:
<TASK>
dump_stack_lvl+0x53/0x70
print_address_description.constprop.0+0x30/0x420
? sk_psock_verdict_data_ready+0x29b/0x2d0
print_report+0xb7/0x270
? sk_psock_verdict_data_ready+0x29b/0x2d0
? kasan_addr_to_slab+0xd/0xa0
? sk_psock_verdict_data_ready+0x29b/0x2d0
kasan_report+0xca/0x100
? sk_psock_verdict_data_ready+0x29b/0x2d0
sk_psock_verdict_data_ready+0x29b/0x2d0
unix_stream_sendmsg+0x4a6/0xa40
? __pfx_unix_stream_sendmsg+0x10/0x10
? fdget+0x2c1/0x3a0
__sys_sendto+0x39c/0x410
'''
3. Reason
'''
CPU0 CPU1
unix_stream_sendmsg(sk):
other = unix_peer(sk)
other->sk_data_ready(other):
socket *sock = sk->sk_socket
if (unlikely(!sock))
return;
close(other):
...
other->close()
free(socket)
READ_ONCE(sock->ops)
^
use 'sock' after free
'''
For TCP, UDP, or other protocols, we have already performed
rcu_read_lock() when the network stack receives packets in ip_input.c:
'''
ip_local_deliver_finish():
rcu_read_lock()
ip_protocol_deliver_rcu()
xxx_rcv
rcu_read_unlock()
'''
However, for Unix sockets, sk_data_ready is called directly from the
process context without rcu_read_lock() protection.
4. Solution
Based on the fact that the 'struct socket' is released using call_rcu(),
We add rcu_read_{un}lock() at the entrance and exit of our sk_data_ready.
It will not increase performance overhead, at least for TCP and UDP, they
are already in a relatively large critical section.
Of course, we can also add a custom callback for Unix sockets and call
rcu_read_lock() before calling _verdict_data_ready like this:
'''
if (sk_is_unix(sk))
sk->sk_data_ready = sk_psock_verdict_data_ready_rcu;
else
sk->sk_data_ready = sk_psock_verdict_data_ready;
sk_psock_verdict_data_ready_rcu():
rcu_read_lock()
sk_psock_verdict_data_ready()
rcu_read_unlock()
'''
However, this will cause too many branches, and it's not suitable to
distinguish network protocols in skmsg.c.
[1] https://syzkaller.appspot.com/bug?extid=dd90a702f518e0eac072
[2] https://syzkaller.appspot.com/text?tag=KernelConfig&x=1362a5aee630ff34
---
v1 -> v2:
1. Add Fixes tag.
2. Extend selftest of edge case for TCP/UDP sockets.
3. Add Reviewed-by and Acked-by tag.
https://lore.kernel.org/bpf/20250226132242.52663-1-jiayuan.chen@linux.dev/T…
Jiayuan Chen (3):
bpf, sockmap: avoid using sk_socket after free
selftests/bpf: Add socketpair to create_pair to support unix socket
selftests/bpf: Add edge case tests for sockmap
net/core/skmsg.c | 18 ++++--
.../selftests/bpf/prog_tests/socket_helpers.h | 13 +++-
.../selftests/bpf/prog_tests/sockmap_basic.c | 59 +++++++++++++++++++
3 files changed, 84 insertions(+), 6 deletions(-)
--
2.47.1
This series expands the XDP TX metadata framework to allow user
applications to pass per packet 64-bit launch time directly to the kernel
driver, requesting launch time hardware offload support. The XDP TX
metadata framework will not perform any clock conversion or packet
reordering.
Please note that the role of Tx metadata is just to pass the launch time,
not to enable the offload feature. Users will need to enable the launch
time hardware offload feature of the device by using the respective
command, such as the tc-etf command.
Although some devices use the tc-etf command to enable their launch time
hardware offload feature, xsk packets will not go through the etf qdisc.
Therefore, in my opinion, the launch time should always be based on the PTP
Hardware Clock (PHC). Thus, i did not include a clock ID to indicate the
clock source.
To simplify the test steps, I modified the xdp_hw_metadata bpf self-test
tool in such a way that it will set the launch time based on the offset
provided by the user and the value of the Receive Hardware Timestamp, which
is against the PHC. This will eliminate the need to discipline System Clock
with the PHC and then use clock_gettime() to get the time.
Please note that AF_XDP lacks a feedback mechanism to inform the
application if the requested launch time is invalid. So, users are expected
to familiar with the horizon of the launch time of the device they use and
not request a launch time that is beyond the horizon. Otherwise, the driver
might interpret the launch time incorrectly and react wrongly. For stmmac
and igc, where modulo computation is used, a launch time larger than the
horizon will cause the device to transmit the packet earlier that the
requested launch time.
Although there is no feedback mechanism for the launch time request
for now, user still can check whether the requested launch time is
working or not, by requesting the Transmit Completion Hardware Timestamp.
v12:
- Fix the comment in include/uapi/linux/if_xdp.h to allign with what is
generated by ./tools/net/ynl/ynl-regen.sh to avoid dirty tree error in
the netdev/ynl checks.
v11: https://lore.kernel.org/netdev/20250216074302.956937-1-yoong.siang.song@int…
- regenerate netdev_xsk_flags based on latest netdev.yaml (Jakub)
v10: https://lore.kernel.org/netdev/20250207021943.814768-1-yoong.siang.song@int…
- use net_err_ratelimited(), instead of net_ratelimit() (Maciej)
- accumulate the amount of used descs in local variable and update the
igc_metadata_request::used_desc once (Maciej)
- Ensure reverse christmas tree rule (Maciej)
V9: https://lore.kernel.org/netdev/20250206060408.808325-1-yoong.siang.song@int…
- Remove the igc_desc_unused() checking (Maciej)
- Ensure that skb allocation and DMA mapping work before proceeding to
fill in igc_tx_buffer info, context desc, and data desc (Maciej)
- Rate limit the error messages (Maciej)
- Update the comment to indicate that the 2 descriptors needed by the
empty frame are already taken into consideration (Maciej)
- Handle the case where the insertion of an empty frame fails and
explain the reason behind (Maciej)
- put self SOB tag as last tag (Maciej)
V8: https://lore.kernel.org/netdev/20250205024116.798862-1-yoong.siang.song@int…
- check the number of used descriptor in xsk_tx_metadata_request()
by using used_desc of struct igc_metadata_request, and then decreases
the budget with it (Maciej)
- submit another bug fix patch to set the buffer type for empty frame (Maciej):
https://lore.kernel.org/netdev/20250205023603.798819-1-yoong.siang.song@int…
V7: https://lore.kernel.org/netdev/20250204004907.789330-1-yoong.siang.song@int…
- split the refactoring code of igc empty packet insertion into a separate
commit (Faizal)
- add explanation on why the value "4" is used as igc transmit budget
(Faizal)
- perform a stress test by sending 1000 packets with 10ms interval and
launch time set to 500us in the future (Faizal & Yong Liang)
V6: https://lore.kernel.org/netdev/20250116155350.555374-1-yoong.siang.song@int…
- fix selftest build errors by using asprintf() and realloc(), instead of
managing the buffer sizes manually (Daniel, Stanislav)
V5: https://lore.kernel.org/netdev/20250114152718.120588-1-yoong.siang.song@int…
- change netdev feature name from tx-launch-time to tx-launch-time-fifo
to explicitly state the FIFO behaviour (Stanislav)
- improve the looping of xdp_hw_metadata app to wait for packet tx
completion to be more readable by using clock_gettime() (Stanislav)
- add launch time setup steps into xdp_hw_metadata app (Stanislav)
V4: https://lore.kernel.org/netdev/20250106135506.9687-1-yoong.siang.song@intel…
- added XDP launch time support to the igc driver (Jesper & Florian)
- added per-driver launch time limitation on xsk-tx-metadata.rst (Jesper)
- added explanation on FIFO behavior on xsk-tx-metadata.rst (Jakub)
- added step to enable launch time in the commit message (Jesper & Willem)
- explicitly documented the type of launch_time and which clock source
it is against (Willem)
V3: https://lore.kernel.org/netdev/20231203165129.1740512-1-yoong.siang.song@in…
- renamed to use launch time (Jesper & Willem)
- changed the default launch time in xdp_hw_metadata apps from 1s to 0.1s
because some NICs do not support such a large future time.
V2: https://lore.kernel.org/netdev/20231201062421.1074768-1-yoong.siang.song@in…
- renamed to use Earliest TxTime First (Willem)
- renamed to use txtime (Willem)
V1: https://lore.kernel.org/netdev/20231130162028.852006-1-yoong.siang.song@int…
Song Yoong Siang (5):
xsk: Add launch time hardware offload support to XDP Tx metadata
selftests/bpf: Add launch time request to xdp_hw_metadata
net: stmmac: Add launch time support to XDP ZC
igc: Refactor empty frame insertion for launch time support
igc: Add launch time support to XDP ZC
Documentation/netlink/specs/netdev.yaml | 4 +
Documentation/networking/xsk-tx-metadata.rst | 62 +++++++
drivers/net/ethernet/intel/igc/igc.h | 1 +
drivers/net/ethernet/intel/igc/igc_main.c | 143 +++++++++++----
drivers/net/ethernet/stmicro/stmmac/stmmac.h | 2 +
.../net/ethernet/stmicro/stmmac/stmmac_main.c | 13 ++
include/net/xdp_sock.h | 10 ++
include/net/xdp_sock_drv.h | 1 +
include/uapi/linux/if_xdp.h | 10 ++
include/uapi/linux/netdev.h | 3 +
net/core/netdev-genl.c | 2 +
net/xdp/xsk.c | 3 +
tools/include/uapi/linux/if_xdp.h | 10 ++
tools/include/uapi/linux/netdev.h | 3 +
tools/testing/selftests/bpf/xdp_hw_metadata.c | 168 +++++++++++++++++-
15 files changed, 396 insertions(+), 39 deletions(-)
--
2.34.1
Hi all,
This patch series continues the work to migrate the script tests into
prog_tests.
test_lwt_seg6local.sh tests some bpf_lwt_* helpers. It contains only one
test that uses a network topology quite different than the ones that
can be found in others prog_tests/lwt_*.c files so I add a new
prog_tests/lwt_seg6local.c file.
While working on the migration I noticed that some routes present in the
script weren't needed so PATCH 1 deletes them and then PATCH 2 migrates
the test into the test_progs framework.
Signed-off-by: Bastien Curutchet (eBPF Foundation) <bastien.curutchet(a)bootlin.com>
---
Bastien Curutchet (eBPF Foundation) (2):
selftests/bpf: lwt_seg6local: Remove unused routes
selftests/bpf: lwt_seg6local: Move test to test_progs
tools/testing/selftests/bpf/Makefile | 1 -
.../selftests/bpf/prog_tests/lwt_seg6local.c | 176 +++++++++++++++++++++
tools/testing/selftests/bpf/test_lwt_seg6local.sh | 156 ------------------
3 files changed, 176 insertions(+), 157 deletions(-)
---
base-commit: 86eb3a47230a41c6ccf5cdae8ee0a7e7292aa29d
change-id: 20250214-seg6local-64bcde44b66e
Best regards,
--
Bastien Curutchet (eBPF Foundation) <bastien.curutchet(a)bootlin.com>
A bug was identified where the KTAP below caused an infinite loop:
TAP version 13
ok 4 test_case
1..4
The infinite loop was caused by the parser not parsing a test plan
if following a test result line.
Fix this bug to correctly parse test plan line.
Signed-off-by: Rae Moar <rmoar(a)google.com>
---
Changes since v1:
- Remove error reported when test plan is missing.
tools/testing/kunit/kunit_parser.py | 9 ++++-----
1 file changed, 4 insertions(+), 5 deletions(-)
diff --git a/tools/testing/kunit/kunit_parser.py b/tools/testing/kunit/kunit_parser.py
index 29fc27e8949b..da53a709773a 100644
--- a/tools/testing/kunit/kunit_parser.py
+++ b/tools/testing/kunit/kunit_parser.py
@@ -759,7 +759,7 @@ def parse_test(lines: LineStream, expected_num: int, log: List[str], is_subtest:
# If parsing the main/top-level test, parse KTAP version line and
# test plan
test.name = "main"
- ktap_line = parse_ktap_header(lines, test, printer)
+ parse_ktap_header(lines, test, printer)
test.log.extend(parse_diagnostic(lines))
parse_test_plan(lines, test)
parent_test = True
@@ -768,13 +768,12 @@ def parse_test(lines: LineStream, expected_num: int, log: List[str], is_subtest:
# the KTAP version line and/or subtest header line
ktap_line = parse_ktap_header(lines, test, printer)
subtest_line = parse_test_header(lines, test)
+ test.log.extend(parse_diagnostic(lines))
+ parse_test_plan(lines, test)
parent_test = (ktap_line or subtest_line)
if parent_test:
- # If KTAP version line and/or subtest header is found, attempt
- # to parse test plan and print test header
- test.log.extend(parse_diagnostic(lines))
- parse_test_plan(lines, test)
print_test_header(test, printer)
+
expected_count = test.expected_count
subtests = []
test_num = 1
base-commit: 0619a4868fc1b32b07fb9ed6c69adc5e5cf4e4b2
--
2.49.0.rc0.332.g42c0ae87b1-goog
Hello,
While trying to implement an eBPF gatekeeper program, we ran into an
issue whereas the LSM hooks are missing some relevant data.
Certain subcommands passed to the bpf() syscall can be invoked from
either the kernel or userspace. Additionally, some fields in the
bpf_attr struct contain pointers, and depending on where the
subcommand was invoked, they could point to either user or kernel
memory. One example of this is the bpf_prog_load subcommand and its
fd_array. This data is made available and used by the verifier but not
made available to the LSM subsystem. This patchset simply exposes that
information to applicable LSM hooks.
Change list:
- v4 -> v5
- merge v4 selftest breakout patch back into a single patch
- change "is_kernel" to "kernel"
- add selftest using new kernel flag
- v3 -> v4
- split out selftest changes into a separate patch
- v2 -> v3
- reorder params so that the new boolean flag is the last param
- fixup function signatures in bpf selftests
- v1 -> v2
- Pass a boolean flag in lieu of bpfptr_t
Revisions:
- v4
https://lore.kernel.org/bpf/20250304203123.3935371-1-bboscaccy@linux.micros…
- v3
https://lore.kernel.org/bpf/20250303222416.3909228-1-bboscaccy@linux.micros…
- v2
https://lore.kernel.org/bpf/20250228165322.3121535-1-bboscaccy@linux.micros…
- v1
https://lore.kernel.org/bpf/20250226003055.1654837-1-bboscaccy@linux.micros…
Blaise Boscaccy (2):
security: Propagate caller information in bpf hooks
selftests/bpf: Add a kernel flag test for LSM bpf hook
include/linux/lsm_hook_defs.h | 6 +--
include/linux/security.h | 12 +++---
kernel/bpf/syscall.c | 10 ++---
security/security.c | 15 ++++---
security/selinux/hooks.c | 6 +--
.../selftests/bpf/prog_tests/kernel_flag.c | 43 +++++++++++++++++++
.../selftests/bpf/progs/rcu_read_lock.c | 3 +-
.../bpf/progs/test_cgroup1_hierarchy.c | 4 +-
.../selftests/bpf/progs/test_kernel_flag.c | 32 ++++++++++++++
.../bpf/progs/test_kfunc_dynptr_param.c | 6 +--
.../selftests/bpf/progs/test_lookup_key.c | 2 +-
.../selftests/bpf/progs/test_ptr_untrusted.c | 2 +-
.../bpf/progs/test_task_under_cgroup.c | 2 +-
.../bpf/progs/test_verify_pkcs7_sig.c | 2 +-
14 files changed, 112 insertions(+), 33 deletions(-)
create mode 100644 tools/testing/selftests/bpf/prog_tests/kernel_flag.c
create mode 100644 tools/testing/selftests/bpf/progs/test_kernel_flag.c
--
2.48.1
This is one of just 3 remaining "Test Module" kselftests (the others
being bitmap and scanf), the rest having been converted to KUnit.
I tested this using:
$ tools/testing/kunit/kunit.py run --arch arm64 --make_options LLVM=1 printf
I have also sent out a series converting scanf[0].
Link: https://lore.kernel.org/all/20250204-scanf-kunit-convert-v3-0-386d7c3ee714@… [0]
Signed-off-by: Tamir Duberstein <tamird(a)gmail.com>
---
Changes in v5:
- Update `do_test` `__printf` annotation (Rasmus Villemoes).
- Link to v4: https://lore.kernel.org/r/20250214-printf-kunit-convert-v4-0-c254572f1565@g…
Changes in v4:
- Add patch "implicate test line in failure messages".
- Rebase on linux-next, move scanf_kunit.c into lib/tests/.
- Link to v3: https://lore.kernel.org/r/20250210-printf-kunit-convert-v3-0-ee6ac5500f5e@g…
Changes in v3:
- Remove extraneous trailing newlines from failure messages.
- Replace `pr_warn` with `kunit_warn`.
- Drop arch changes.
- Remove KUnit boilerplate from CONFIG_PRINTF_KUNIT_TEST help text.
- Restore `total_tests` counting.
- Remove tc_fail macro in last patch.
- Link to v2: https://lore.kernel.org/r/20250207-printf-kunit-convert-v2-0-057b23860823@g…
Changes in v2:
- Incorporate code review from prior work[0] by Arpitha Raghunandan.
- Link to v1: https://lore.kernel.org/r/20250204-printf-kunit-convert-v1-0-ecf1b846a4de@g…
Link: https://lore.kernel.org/lkml/20200817043028.76502-1-98.arpi@gmail.com/t/#u [0]
---
Tamir Duberstein (3):
printf: convert self-test to KUnit
printf: break kunit into test cases
printf: implicate test line in failure messages
Documentation/core-api/printk-formats.rst | 4 +-
MAINTAINERS | 2 +-
lib/Kconfig.debug | 12 +-
lib/Makefile | 1 -
lib/tests/Makefile | 1 +
lib/{test_printf.c => tests/printf_kunit.c} | 437 ++++++++++++----------------
tools/testing/selftests/lib/config | 1 -
tools/testing/selftests/lib/printf.sh | 4 -
8 files changed, 200 insertions(+), 262 deletions(-)
---
base-commit: d4b0fd87ff0d4338b259dc79b2b3c6f7e70e8afa
change-id: 20250131-printf-kunit-convert-fd4012aa2ec6
Best regards,
--
Tamir Duberstein <tamird(a)gmail.com>
virtio-net have two usage of hashes: one is RSS and another is hash
reporting. Conventionally the hash calculation was done by the VMM.
However, computing the hash after the queue was chosen defeats the
purpose of RSS.
Another approach is to use eBPF steering program. This approach has
another downside: it cannot report the calculated hash due to the
restrictive nature of eBPF.
Introduce the code to compute hashes to the kernel in order to overcome
thse challenges.
An alternative solution is to extend the eBPF steering program so that it
will be able to report to the userspace, but it is based on context
rewrites, which is in feature freeze. We can adopt kfuncs, but they will
not be UAPIs. We opt to ioctl to align with other relevant UAPIs (KVM
and vhost_net).
The patches for QEMU to use this new feature was submitted as RFC and
is available at:
https://patchew.org/QEMU/20240915-hash-v3-0-79cb08d28647@daynix.com/
This work was presented at LPC 2024:
https://lpc.events/event/18/contributions/1963/
V1 -> V2:
Changed to introduce a new BPF program type.
Signed-off-by: Akihiko Odaki <akihiko.odaki(a)daynix.com>
---
Changes in v8:
- Disabled IPv6 to eliminate noises in tests.
- Added a branch in tap to avoid unnecessary dissection when hash
reporting is disabled.
- Removed unnecessary rtnl_lock().
- Extracted code to handle new ioctls into separate functions to avoid
adding extra NULL checks to the code handling other ioctls.
- Introduced variable named "fd" to __tun_chr_ioctl().
- s/-/=/g in a patch message to avoid confusing Git.
- Link to v7: https://lore.kernel.org/r/20250228-rss-v7-0-844205cbbdd6@daynix.com
Changes in v7:
- Ensured to set hash_report to VIRTIO_NET_HASH_REPORT_NONE for
VHOST_NET_F_VIRTIO_NET_HDR.
- s/4/sizeof(u32)/ in patch "virtio_net: Add functions for hashing".
- Added tap_skb_cb type.
- Rebased.
- Link to v6: https://lore.kernel.org/r/20250109-rss-v6-0-b1c90ad708f6@daynix.com
Changes in v6:
- Extracted changes to fill vnet header holes into another series.
- Squashed patches "skbuff: Introduce SKB_EXT_TUN_VNET_HASH", "tun:
Introduce virtio-net hash reporting feature", and "tun: Introduce
virtio-net RSS" into patch "tun: Introduce virtio-net hash feature".
- Dropped the RFC tag.
- Link to v5: https://lore.kernel.org/r/20241008-rss-v5-0-f3cf68df005d@daynix.com
Changes in v5:
- Fixed a compilation error with CONFIG_TUN_VNET_CROSS_LE.
- Optimized the calculation of the hash value according to:
https://git.dpdk.org/dpdk/commit/?id=3fb1ea032bd6ff8317af5dac9af901f1f324ca…
- Added patch "tun: Unify vnet implementation".
- Dropped patch "tap: Pad virtio header with zero".
- Added patch "selftest: tun: Test vnet ioctls without device".
- Reworked selftests to skip for older kernels.
- Documented the case when the underlying device is deleted and packets
have queue_mapping set by TC.
- Reordered test harness arguments.
- Added code to handle fragmented packets.
- Link to v4: https://lore.kernel.org/r/20240924-rss-v4-0-84e932ec0e6c@daynix.com
Changes in v4:
- Moved tun_vnet_hash_ext to if_tun.h.
- Renamed virtio_net_toeplitz() to virtio_net_toeplitz_calc().
- Replaced htons() with cpu_to_be16().
- Changed virtio_net_hash_rss() to return void.
- Reordered variable declarations in virtio_net_hash_rss().
- Removed virtio_net_hdr_v1_hash_from_skb().
- Updated messages of "tap: Pad virtio header with zero" and
"tun: Pad virtio header with zero".
- Fixed vnet_hash allocation size.
- Ensured to free vnet_hash when destructing tun_struct.
- Link to v3: https://lore.kernel.org/r/20240915-rss-v3-0-c630015db082@daynix.com
Changes in v3:
- Reverted back to add ioctl.
- Split patch "tun: Introduce virtio-net hashing feature" into
"tun: Introduce virtio-net hash reporting feature" and
"tun: Introduce virtio-net RSS".
- Changed to reuse hash values computed for automq instead of performing
RSS hashing when hash reporting is requested but RSS is not.
- Extracted relevant data from struct tun_struct to keep it minimal.
- Added kernel-doc.
- Changed to allow calling TUNGETVNETHASHCAP before TUNSETIFF.
- Initialized num_buffers with 1.
- Added a test case for unclassified packets.
- Fixed error handling in tests.
- Changed tests to verify that the queue index will not overflow.
- Rebased.
- Link to v2: https://lore.kernel.org/r/20231015141644.260646-1-akihiko.odaki@daynix.com
---
Akihiko Odaki (6):
virtio_net: Add functions for hashing
net: flow_dissector: Export flow_keys_dissector_symmetric
tun: Introduce virtio-net hash feature
selftest: tun: Test vnet ioctls without device
selftest: tun: Add tests for virtio-net hashing
vhost/net: Support VIRTIO_NET_F_HASH_REPORT
Documentation/networking/tuntap.rst | 7 +
drivers/net/Kconfig | 1 +
drivers/net/tap.c | 67 +++-
drivers/net/tun.c | 98 +++++-
drivers/net/tun_vnet.h | 159 ++++++++-
drivers/vhost/net.c | 49 +--
include/linux/if_tap.h | 2 +
include/linux/skbuff.h | 3 +
include/linux/virtio_net.h | 188 ++++++++++
include/net/flow_dissector.h | 1 +
include/uapi/linux/if_tun.h | 75 ++++
net/core/flow_dissector.c | 3 +-
net/core/skbuff.c | 4 +
tools/testing/selftests/net/Makefile | 2 +-
tools/testing/selftests/net/tun.c | 656 ++++++++++++++++++++++++++++++++++-
15 files changed, 1254 insertions(+), 61 deletions(-)
---
base-commit: dd83757f6e686a2188997cb58b5975f744bb7786
change-id: 20240403-rss-e737d89efa77
prerequisite-change-id: 20241230-tun-66e10a49b0c7:v6
prerequisite-patch-id: 871dc5f146fb6b0e3ec8612971a8e8190472c0fb
prerequisite-patch-id: 2797ed249d32590321f088373d4055ff3f430a0e
prerequisite-patch-id: ea3370c72d4904e2f0536ec76ba5d26784c0cede
prerequisite-patch-id: 837e4cf5d6b451424f9b1639455e83a260c4440d
prerequisite-patch-id: ea701076f57819e844f5a35efe5cbc5712d3080d
prerequisite-patch-id: 701646fb43ad04cc64dd2bf13c150ccbe6f828ce
prerequisite-patch-id: 53176dae0c003f5b6c114d43f936cf7140d31bb5
prerequisite-change-id: 20250116-buffers-96e14bf023fc:v2
prerequisite-patch-id: 25fd4f99d4236a05a5ef16ab79f3e85ee57e21cc
Best regards,
--
Akihiko Odaki <akihiko.odaki(a)daynix.com>
This series adds a fix for KVM PMU code and improves the pmu selftest
by allowing generating precise number of interrupts. It also provided
another additional option to the overflow test that allows user to
generate custom number of LCOFI interrupts.
Signed-off-by: Atish Patra <atishp(a)rivosinc.com>
---
Changes in v2:
- Initialized the local overflow irq variable to 0 indicate that it's not a
allowed value.
- Moved the introduction of argument option `n` to the last patch.
- Link to v1: https://lore.kernel.org/r/20250226-kvm_pmu_improve-v1-0-74c058c2bf6d@rivosi…
---
Atish Patra (4):
RISC-V: KVM: Disable the kernel perf counter during configure
KVM: riscv: selftests: Do not start the counter in the overflow handler
KVM: riscv: selftests: Change command line option
KVM: riscv: selftests: Allow number of interrupts to be configurable
arch/riscv/kvm/vcpu_pmu.c | 1 +
tools/testing/selftests/kvm/riscv/sbi_pmu_test.c | 81 ++++++++++++++++--------
2 files changed, 57 insertions(+), 25 deletions(-)
---
base-commit: 0ad2507d5d93f39619fc42372c347d6006b64319
change-id: 20250225-kvm_pmu_improve-fffd038b2404
--
Regards,
Atish patra
The first fixes setting incorrect skb->truesize.
When xdp-mb prog returns XDP_PASS, skb is allocated and initialized.
Currently, The truesize is calculated as BNXT_RX_PAGE_SIZE *
sinfo->nr_frags, but sinfo->nr_frags is flushed by napi_build_skb().
So, it stores sinfo before calling napi_build_skb() and then use it
for calculate truesize.
The second fixes kernel panic in the bnxt_queue_mem_alloc().
The bnxt_queue_mem_alloc() accesses rx ring descriptor.
rx ring descriptors are allocated when the interface is up and it's
freed when the interface is down.
So, if bnxt_queue_mem_alloc() is called when the interface is down,
kernel panic occurs.
This patch makes the bnxt_queue_mem_alloc() return -ENETDOWN if rx ring
descriptors are not allocated.
The third patch fixes kernel panic in the bnxt_queue_{start | stop}().
When a queue is restarted bnxt_queue_{start | stop}() are called.
These functions set MRU to 0 to stop packet flow and then to set up the
remaining things.
MRU variable is a member of vnic_info[] the first vnic_info is for
default and the second is for ntuple.
The first vnic_info is always allocated when interface is up, but the
second is allocated only when ntuple is enabled.
(ethtool -K eth0 ntuple <on | off>).
Currently, the bnxt_queue_{start | stop}() access
vnic_info[BNXT_VNIC_NTUPLE] regardless of whether ntuple is enabled or
not.
So kernel panic occurs.
This patch make the bnxt_queue_{start | stop}() use bp->nr_vnics instead
of BNXT_VNIC_NTUPLE.
The fourth patch fixes a warning due to checksum state.
The bnxt_rx_pkt() checks whether skb->ip_summed is not CHECKSUM_NONE
before updating ip_summed. if ip_summed is not CHECKSUM_NONE, it WARNS
about it. However, the bnxt_xdp_build_skb() is called in XDP-MB-PASS
path and it updates ip_summed earlier than bnxt_rx_pkt().
So, in the XDP-MB-PASS path, the bnxt_rx_pkt() always warns about
checksum.
Updating ip_summed at the bnxt_xdp_build_skb() is unnecessary and
duplicate, so it is removed.
The fifth patch makes net_devmem_unbind_dmabuf() ignore -ENETDOWN.
When devmem socket is closed, net_devmem_unbind_dmabuf() is called to
unbind/release resources.
If interface is down, the driver returns -ENETDOWN.
The -ENETDOWN return value is not an actual error, because the interface
will release resources when the interface is down.
So, net_devmem_unbind_dmabuf() needs to ignore -ENETDOWN.
The last patch adds XDP testcases to
tools/testing/selftests/drivers/net/ping.py.
v2:
- Do not use num_frags in the bnxt_xdp_build_skb(). (1/6)
- Add Review tags from Somnath and Jakub. (2/6)
- Add new patch for fixing checksum warning. (4/6)
- Add new patch for fixing warning in net_devmem_unbind_dmabuf(). (5/6)
- Add new XDP testcases to ping.py (6/6)
Taehee Yoo (6):
eth: bnxt: fix truesize for mb-xdp-pass case
eth: bnxt: return fail if interface is down in bnxt_queue_mem_alloc()
eth: bnxt: do not use BNXT_VNIC_NTUPLE unconditionally in queue
restart logic
eth: bnxt: do not update checksum in bnxt_xdp_build_skb()
net: devmem: do not WARN conditionally after netdev_rx_queue_restart()
selftests: drv-net: add xdp cases for ping.py
drivers/net/ethernet/broadcom/bnxt/bnxt.c | 36 ++--
drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c | 18 +-
drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.h | 6 +-
net/core/devmem.c | 4 +-
tools/testing/selftests/drivers/net/ping.py | 200 ++++++++++++++++--
.../testing/selftests/net/lib/xdp_dummy.bpf.c | 6 +
6 files changed, 221 insertions(+), 49 deletions(-)
--
2.34.1
The first patch fixes the incorrect locks using in bond driver.
The second patch fixes the xfrm offload feature during setup active-backup
mode. The third patch add a ipsec offload testing.
v4: hold xs->lock for bond_ipsec_{del, add}_sa_all (Cosmin Ratiu)
use the defer helpers in lib.sh for selftest (Petr Machata)
v3: move the ipsec deletion to bond_ipsec_free_sa (Cosmin Ratiu)
v2: do not turn carrier on if bond change link failed (Nikolay Aleksandrov)
move the mutex lock to a work queue (Cosmin Ratiu)
Hangbin Liu (3):
bonding: move IPsec deletion to bond_ipsec_free_sa
bonding: fix xfrm offload feature setup on active-backup mode
selftests: bonding: add ipsec offload test
drivers/net/bonding/bond_main.c | 55 +++++--
drivers/net/bonding/bond_netlink.c | 16 +-
include/net/bonding.h | 1 +
.../selftests/drivers/net/bonding/Makefile | 3 +-
.../drivers/net/bonding/bond_ipsec_offload.sh | 154 ++++++++++++++++++
.../selftests/drivers/net/bonding/config | 4 +
6 files changed, 208 insertions(+), 25 deletions(-)
create mode 100755 tools/testing/selftests/drivers/net/bonding/bond_ipsec_offload.sh
--
2.46.0
We hit a following exception on timeout, nmaps is never set:
Test bpftool bound info reporting (own ns)...
Traceback (most recent call last):
File "/home/virtme/testing-1/tools/testing/selftests/net/./bpf_offload.py", line 1128, in <module>
check_dev_info(False, "")
File "/home/virtme/testing-1/tools/testing/selftests/net/./bpf_offload.py", line 583, in check_dev_info
maps = bpftool_map_list_wait(expected=2, ns=ns)
File "/home/virtme/testing-1/tools/testing/selftests/net/./bpf_offload.py", line 215, in bpftool_map_list_wait
raise Exception("Time out waiting for map counts to stabilize want %d, have %d" % (expected, nmaps))
NameError: name 'nmaps' is not defined
Signed-off-by: Jakub Kicinski <kuba(a)kernel.org>
---
CC: shuah(a)kernel.org
CC: linux-kselftest(a)vger.kernel.org
CC: bpf(a)vger.kernel.org
---
tools/testing/selftests/net/bpf_offload.py | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/net/bpf_offload.py b/tools/testing/selftests/net/bpf_offload.py
index fd0d959914e4..4a9be8c49561 100755
--- a/tools/testing/selftests/net/bpf_offload.py
+++ b/tools/testing/selftests/net/bpf_offload.py
@@ -207,9 +207,11 @@ netns = [] # net namespaces to be removed
raise Exception("Time out waiting for program counts to stabilize want %d, have %d" % (expected, nprogs))
def bpftool_map_list_wait(expected=0, n_retry=20, ns=""):
+ nmaps = None
for i in range(n_retry):
maps = bpftool_map_list(ns=ns)
- if len(maps) == expected:
+ nmaps = len(maps)
+ if nmaps == expected:
return maps
time.sleep(0.05)
raise Exception("Time out waiting for map counts to stabilize want %d, have %d" % (expected, nmaps))
--
2.48.1
v6: https://lore.kernel.org/netdev/20250222191517.743530-1-almasrymina@google.c…
===
v6 has no major changes. Addressed a few issues from Paolo and David,
and collected Acks from Stan. Thank you everyone for the review!
Changes:
- retain behavior to process MSG_FASTOPEN even if the provided cmsg is
invalid (Paolo).
- Rework the freeing of tx_vec slightly (it now has its own err label).
(Paolo).
- Squash the commit that makes dmabuf unbinding scheduled work into the
same one which implements the TX path so we don't run into future
errors on bisecting (Paolo).
- Fix/add comments to explain how dmabuf binding refcounting works
(David).
v5: https://lore.kernel.org/netdev/20250220020914.895431-1-almasrymina@google.c…
===
v5 has no major changes; it clears up the relatively minor issues
pointed out to in v4, and rebases the series on top of net-next to
resolve the conflict with a patch that raced to the tree. It also
collects the review tags from v4.
Changes:
- Rebase to net-next
- Fix issues in selftest (Stan).
- Address comments in the devmem and netmem driver docs (Stan and Bagas)
- Fix zerocopy_fill_skb_from_devmem return error code (Stan).
v4: https://lore.kernel.org/netdev/20250203223916.1064540-1-almasrymina@google.…
===
v4 mainly addresses the critical driver support issue surfaced in v3 by
Paolo and Stan. Drivers aiming to support netmem_tx should make sure not
to pass the netmem dma-addrs to the dma-mapping APIs, as these dma-addrs
may come from dma-bufs.
Additionally other feedback from v3 is addressed.
Major changes:
- Add helpers to handle netmem dma-addrs. Add GVE support for
netmem_tx.
- Fix binding->tx_vec not being freed on error paths during the
tx binding.
- Add a minimal devmem_tx test to devmem.py.
- Clean up everything obsolete from the cover letter (Paolo).
v3: https://patchwork.kernel.org/project/netdevbpf/list/?series=929401&state=*
===
Address minor comments from RFCv2 and fix a few build warnings and
ynl-regen issues. No major changes.
RFC v2: https://patchwork.kernel.org/project/netdevbpf/list/?series=920056&state=*
=======
RFC v2 addresses much of the feedback from RFC v1. I plan on sending
something close to this as net-next reopens, sending it slightly early
to get feedback if any.
Major changes:
--------------
- much improved UAPI as suggested by Stan. We now interpret the iov_base
of the passed in iov from userspace as the offset into the dmabuf to
send from. This removes the need to set iov.iov_base = NULL which may
be confusing to users, and enables us to send multiple iovs in the
same sendmsg() call. ncdevmem and the docs show a sample use of that.
- Removed the duplicate dmabuf iov_iter in binding->iov_iter. I think
this is good improvment as it was confusing to keep track of
2 iterators for the same sendmsg, and mistracking both iterators
caused a couple of bugs reported in the last iteration that are now
resolved with this streamlining.
- Improved test coverage in ncdevmem. Now multiple sendmsg() are tested,
and sending multiple iovs in the same sendmsg() is tested.
- Fixed issue where dmabuf unmapping was happening in invalid context
(Stan).
====================================================================
The TX path had been dropped from the Device Memory TCP patch series
post RFCv1 [1], to make that series slightly easier to review. This
series rebases the implementation of the TX path on top of the
net_iov/netmem framework agreed upon and merged. The motivation for
the feature is thoroughly described in the docs & cover letter of the
original proposal, so I don't repeat the lengthy descriptions here, but
they are available in [1].
Full outline on usage of the TX path is detailed in the documentation
included with this series.
Test example is available via the kselftest included in the series as well.
The series is relatively small, as the TX path for this feature largely
piggybacks on the existing MSG_ZEROCOPY implementation.
Patch Overview:
---------------
1. Documentation & tests to give high level overview of the feature
being added.
1. Add netmem refcounting needed for the TX path.
2. Devmem TX netlink API.
3. Devmem TX net stack implementation.
4. Make dma-buf unbinding scheduled work to handle TX cases where it gets
freed from contexts where we can't sleep.
5. Add devmem TX documentation.
6. Add scaffolding enabling driver support for netmem_tx. Add helpers, driver
feature flag, and docs to enable drivers to declare netmem_tx support.
7. Guard netmem_tx against being enabled against drivers that don't
support it.
8. Add devmem_tx selftests. Add TX path to ncdevmem and add a test to
devmem.py.
Testing:
--------
Testing is very similar to devmem TCP RX path. The ncdevmem test used
for the RX path is now augemented with client functionality to test TX
path.
* Test Setup:
Kernel: net-next with this RFC and memory provider API cherry-picked
locally.
Hardware: Google Cloud A3 VMs.
NIC: GVE with header split & RSS & flow steering support.
Performance results are not included with this version, unfortunately.
I'm having issues running the dma-buf exporter driver against the
upstream kernel on my test setup. The issues are specific to that
dma-buf exporter and do not affect this patch series. I plan to follow
up this series with perf fixes if the tests point to issues once they're
up and running.
Special thanks to Stan who took a stab at rebasing the TX implementation
on top of the netmem/net_iov framework merged. Parts of his proposal [2]
that are reused as-is are forked off into their own patches to give full
credit.
[1] https://lore.kernel.org/netdev/20240909054318.1809580-1-almasrymina@google.…
[2] https://lore.kernel.org/netdev/20240913150913.1280238-2-sdf@fomichev.me/T/#…
Cc: sdf(a)fomichev.me
Cc: asml.silence(a)gmail.com
Cc: dw(a)davidwei.uk
Cc: Jamal Hadi Salim <jhs(a)mojatatu.com>
Cc: Victor Nogueira <victor(a)mojatatu.com>
Cc: Pedro Tammela <pctammela(a)mojatatu.com>
Cc: Samiullah Khawaja <skhawaja(a)google.com>
Mina Almasry (7):
net: add get_netmem/put_netmem support
net: devmem: Implement TX path
net: add devmem TCP TX documentation
net: enable driver support for netmem TX
gve: add netmem TX support to GVE DQO-RDA mode
net: check for driver support in netmem TX
selftests: ncdevmem: Implement devmem TCP TX
Stanislav Fomichev (1):
net: devmem: TCP tx netlink api
Documentation/netlink/specs/netdev.yaml | 12 +
Documentation/networking/devmem.rst | 150 ++++++++-
.../networking/net_cachelines/net_device.rst | 1 +
Documentation/networking/netdev-features.rst | 5 +
Documentation/networking/netmem.rst | 23 +-
drivers/net/ethernet/google/gve/gve_main.c | 4 +
drivers/net/ethernet/google/gve/gve_tx_dqo.c | 8 +-
include/linux/netdevice.h | 2 +
include/linux/skbuff.h | 17 +-
include/linux/skbuff_ref.h | 4 +-
include/net/netmem.h | 23 ++
include/net/sock.h | 1 +
include/uapi/linux/netdev.h | 1 +
net/core/datagram.c | 48 ++-
net/core/dev.c | 3 +
net/core/devmem.c | 115 ++++++-
net/core/devmem.h | 77 ++++-
net/core/netdev-genl-gen.c | 13 +
net/core/netdev-genl-gen.h | 1 +
net/core/netdev-genl.c | 73 ++++-
net/core/skbuff.c | 48 ++-
net/core/sock.c | 6 +
net/ipv4/ip_output.c | 3 +-
net/ipv4/tcp.c | 50 ++-
net/ipv6/ip6_output.c | 3 +-
net/vmw_vsock/virtio_transport_common.c | 5 +-
tools/include/uapi/linux/netdev.h | 1 +
.../selftests/drivers/net/hw/devmem.py | 26 +-
.../selftests/drivers/net/hw/ncdevmem.c | 300 +++++++++++++++++-
29 files changed, 950 insertions(+), 73 deletions(-)
base-commit: 80c4a0015ce249cf0917a04dbb3cc652a6811079
--
2.48.1.658.g4767266eb4-goog
Hi all,
this v5 of the patch series is very similar to v4, but rebased onto the
bpf-next/net branch instead of bpf-next/master.
Because the commit c047e0e0e435 ("selftests/bpf: Optionally open a
dedicated namespace to run test in it") is not yet included in this branch,
I changed the xdp_context_tuntap test to manually create a namespace to run
the test in.
Not so successful pipeline: https://github.com/kernel-patches/bpf/actions/runs/13682405154
The CI pipeline failed because of veristat changes in seemingly unrelated
eBPF programs. I don't think this has to do with the changes from this
patch series, but if it does, please let me know what I may have to do
different to make the CI pass.
---
v5:
- rebase onto bpf-next/net
- resolve rebase conflicts
- change xdp_context_tuntap test to manually create and open a network
namespace using netns_new
v4: https://lore.kernel.org/bpf/20250227142330.1605996-1-marcus.wichelmann@hetz…
- strip unrelated changes from the selftest patches
- extend commit message for "selftests/bpf: refactor xdp_context_functional
test and bpf program"
- the NOARP flag was not effective to prevent other packets from
interfering with the tests, add a filter to the XDP program instead
- run xdp_context_tuntap in a separate namespace to avoid conflicts with
other tests
v3: https://lore.kernel.org/bpf/20250224152909.3911544-1-marcus.wichelmann@hetz…
- change the condition to handle xdp_buffs without metadata support, as
suggested by Willem de Bruijn <willemb(a)google.com>
- add clarifying comment why that condition is needed
- set NOARP flag in selftests to ensure that the kernel does not send
packets on the test interfaces that may interfere with the tests
v2: https://lore.kernel.org/bpf/20250217172308.3291739-1-marcus.wichelmann@hetz…
- submit against bpf-next subtree
- split commits and improved commit messages
- remove redundant metasize check and add clarifying comment instead
- use max() instead of ternary operator
- add selftest for metadata support in the tun driver
v1: https://lore.kernel.org/all/20250130171614.1657224-1-marcus.wichelmann@hetz…
Marcus Wichelmann (6):
net: tun: enable XDP metadata support
net: tun: enable transfer of XDP metadata to skb
selftests/bpf: move open_tuntap to network helpers
selftests/bpf: refactor xdp_context_functional test and bpf program
selftests/bpf: add test for XDP metadata support in tun driver
selftests/bpf: fix file descriptor assertion in open_tuntap helper
drivers/net/tun.c | 28 +++-
tools/testing/selftests/bpf/network_helpers.c | 28 ++++
tools/testing/selftests/bpf/network_helpers.h | 3 +
.../selftests/bpf/prog_tests/lwt_helpers.h | 29 ----
.../bpf/prog_tests/xdp_context_test_run.c | 145 +++++++++++++++++-
.../selftests/bpf/progs/test_xdp_meta.c | 53 +++++--
6 files changed, 230 insertions(+), 56 deletions(-)
--
2.43.0
A bug was identified where the KTAP below caused an infinite loop:
TAP version 13
ok 4 test_case
1..4
The infinite loop was caused by the parser not parsing a test plan
if following a test result line.
Fix bug to correctly parse test plan and add error if test plan is
missing.
Signed-off-by: Rae Moar <rmoar(a)google.com>
---
tools/testing/kunit/kunit_parser.py | 12 +++++++-----
tools/testing/kunit/kunit_tool_test.py | 5 ++---
2 files changed, 9 insertions(+), 8 deletions(-)
diff --git a/tools/testing/kunit/kunit_parser.py b/tools/testing/kunit/kunit_parser.py
index 29fc27e8949b..5dcbc670e1dc 100644
--- a/tools/testing/kunit/kunit_parser.py
+++ b/tools/testing/kunit/kunit_parser.py
@@ -761,20 +761,22 @@ def parse_test(lines: LineStream, expected_num: int, log: List[str], is_subtest:
test.name = "main"
ktap_line = parse_ktap_header(lines, test, printer)
test.log.extend(parse_diagnostic(lines))
- parse_test_plan(lines, test)
+ plan_line = parse_test_plan(lines, test)
parent_test = True
else:
# If not the main test, attempt to parse a test header containing
# the KTAP version line and/or subtest header line
ktap_line = parse_ktap_header(lines, test, printer)
subtest_line = parse_test_header(lines, test)
+ test.log.extend(parse_diagnostic(lines))
+ plan_line = parse_test_plan(lines, test)
parent_test = (ktap_line or subtest_line)
if parent_test:
- # If KTAP version line and/or subtest header is found, attempt
- # to parse test plan and print test header
- test.log.extend(parse_diagnostic(lines))
- parse_test_plan(lines, test)
print_test_header(test, printer)
+
+ if parent_test and not plan_line:
+ test.add_error(printer, 'missing test plan!')
+
expected_count = test.expected_count
subtests = []
test_num = 1
diff --git a/tools/testing/kunit/kunit_tool_test.py b/tools/testing/kunit/kunit_tool_test.py
index 0bcb0cc002f8..e1e142c1a850 100755
--- a/tools/testing/kunit/kunit_tool_test.py
+++ b/tools/testing/kunit/kunit_tool_test.py
@@ -181,8 +181,7 @@ class KUnitParserTest(unittest.TestCase):
result = kunit_parser.parse_run_tests(
kunit_parser.extract_tap_lines(
file.readlines()), stdout)
- # A missing test plan is not an error.
- self.assertEqual(result.counts, kunit_parser.TestCounts(passed=10, errors=0))
+ self.assertEqual(result.counts, kunit_parser.TestCounts(passed=10, errors=2))
self.assertEqual(kunit_parser.TestStatus.SUCCESS, result.status)
def test_no_tests(self):
@@ -203,7 +202,7 @@ class KUnitParserTest(unittest.TestCase):
self.assertEqual(
kunit_parser.TestStatus.NO_TESTS,
result.subtests[0].subtests[0].status)
- self.assertEqual(result.counts, kunit_parser.TestCounts(passed=1, errors=1))
+ self.assertEqual(result.counts, kunit_parser.TestCounts(passed=1, errors=2))
def test_no_kunit_output(self):
base-commit: 0619a4868fc1b32b07fb9ed6c69adc5e5cf4e4b2
--
2.48.1.711.g2feabab25a-goog
virtio-net have two usage of hashes: one is RSS and another is hash
reporting. Conventionally the hash calculation was done by the VMM.
However, computing the hash after the queue was chosen defeats the
purpose of RSS.
Another approach is to use eBPF steering program. This approach has
another downside: it cannot report the calculated hash due to the
restrictive nature of eBPF.
Introduce the code to compute hashes to the kernel in order to overcome
thse challenges.
An alternative solution is to extend the eBPF steering program so that it
will be able to report to the userspace, but it is based on context
rewrites, which is in feature freeze. We can adopt kfuncs, but they will
not be UAPIs. We opt to ioctl to align with other relevant UAPIs (KVM
and vhost_net).
The patches for QEMU to use this new feature was submitted as RFC and
is available at:
https://patchew.org/QEMU/20240915-hash-v3-0-79cb08d28647@daynix.com/
This work was presented at LPC 2024:
https://lpc.events/event/18/contributions/1963/
V1 -> V2:
Changed to introduce a new BPF program type.
Signed-off-by: Akihiko Odaki <akihiko.odaki(a)daynix.com>
---
Changes in v7:
- Ensured to set hash_report to VIRTIO_NET_HASH_REPORT_NONE for
VHOST_NET_F_VIRTIO_NET_HDR.
- s/4/sizeof(u32)/ in patch "virtio_net: Add functions for hashing".
- Added tap_skb_cb type.
- Rebased.
- Link to v6: https://lore.kernel.org/r/20250109-rss-v6-0-b1c90ad708f6@daynix.com
Changes in v6:
- Extracted changes to fill vnet header holes into another series.
- Squashed patches "skbuff: Introduce SKB_EXT_TUN_VNET_HASH", "tun:
Introduce virtio-net hash reporting feature", and "tun: Introduce
virtio-net RSS" into patch "tun: Introduce virtio-net hash feature".
- Dropped the RFC tag.
- Link to v5: https://lore.kernel.org/r/20241008-rss-v5-0-f3cf68df005d@daynix.com
Changes in v5:
- Fixed a compilation error with CONFIG_TUN_VNET_CROSS_LE.
- Optimized the calculation of the hash value according to:
https://git.dpdk.org/dpdk/commit/?id=3fb1ea032bd6ff8317af5dac9af901f1f324ca…
- Added patch "tun: Unify vnet implementation".
- Dropped patch "tap: Pad virtio header with zero".
- Added patch "selftest: tun: Test vnet ioctls without device".
- Reworked selftests to skip for older kernels.
- Documented the case when the underlying device is deleted and packets
have queue_mapping set by TC.
- Reordered test harness arguments.
- Added code to handle fragmented packets.
- Link to v4: https://lore.kernel.org/r/20240924-rss-v4-0-84e932ec0e6c@daynix.com
Changes in v4:
- Moved tun_vnet_hash_ext to if_tun.h.
- Renamed virtio_net_toeplitz() to virtio_net_toeplitz_calc().
- Replaced htons() with cpu_to_be16().
- Changed virtio_net_hash_rss() to return void.
- Reordered variable declarations in virtio_net_hash_rss().
- Removed virtio_net_hdr_v1_hash_from_skb().
- Updated messages of "tap: Pad virtio header with zero" and
"tun: Pad virtio header with zero".
- Fixed vnet_hash allocation size.
- Ensured to free vnet_hash when destructing tun_struct.
- Link to v3: https://lore.kernel.org/r/20240915-rss-v3-0-c630015db082@daynix.com
Changes in v3:
- Reverted back to add ioctl.
- Split patch "tun: Introduce virtio-net hashing feature" into
"tun: Introduce virtio-net hash reporting feature" and
"tun: Introduce virtio-net RSS".
- Changed to reuse hash values computed for automq instead of performing
RSS hashing when hash reporting is requested but RSS is not.
- Extracted relevant data from struct tun_struct to keep it minimal.
- Added kernel-doc.
- Changed to allow calling TUNGETVNETHASHCAP before TUNSETIFF.
- Initialized num_buffers with 1.
- Added a test case for unclassified packets.
- Fixed error handling in tests.
- Changed tests to verify that the queue index will not overflow.
- Rebased.
- Link to v2: https://lore.kernel.org/r/20231015141644.260646-1-akihiko.odaki@daynix.com
---
Akihiko Odaki (6):
virtio_net: Add functions for hashing
net: flow_dissector: Export flow_keys_dissector_symmetric
tun: Introduce virtio-net hash feature
selftest: tun: Test vnet ioctls without device
selftest: tun: Add tests for virtio-net hashing
vhost/net: Support VIRTIO_NET_F_HASH_REPORT
Documentation/networking/tuntap.rst | 7 +
drivers/net/Kconfig | 1 +
drivers/net/tap.c | 62 +++-
drivers/net/tun.c | 89 ++++-
drivers/net/tun_vnet.h | 180 +++++++++-
drivers/vhost/net.c | 49 +--
include/linux/if_tap.h | 2 +
include/linux/skbuff.h | 3 +
include/linux/virtio_net.h | 188 +++++++++++
include/net/flow_dissector.h | 1 +
include/uapi/linux/if_tun.h | 75 +++++
net/core/flow_dissector.c | 3 +-
net/core/skbuff.c | 4 +
tools/testing/selftests/net/Makefile | 2 +-
tools/testing/selftests/net/tun.c | 627 ++++++++++++++++++++++++++++++++++-
15 files changed, 1231 insertions(+), 62 deletions(-)
---
base-commit: dd83757f6e686a2188997cb58b5975f744bb7786
change-id: 20240403-rss-e737d89efa77
prerequisite-change-id: 20241230-tun-66e10a49b0c7:v6
prerequisite-patch-id: 871dc5f146fb6b0e3ec8612971a8e8190472c0fb
prerequisite-patch-id: 2797ed249d32590321f088373d4055ff3f430a0e
prerequisite-patch-id: ea3370c72d4904e2f0536ec76ba5d26784c0cede
prerequisite-patch-id: 837e4cf5d6b451424f9b1639455e83a260c4440d
prerequisite-patch-id: ea701076f57819e844f5a35efe5cbc5712d3080d
prerequisite-patch-id: 701646fb43ad04cc64dd2bf13c150ccbe6f828ce
prerequisite-patch-id: 53176dae0c003f5b6c114d43f936cf7140d31bb5
prerequisite-change-id: 20250116-buffers-96e14bf023fc:v2
prerequisite-patch-id: 25fd4f99d4236a05a5ef16ab79f3e85ee57e21cc
Best regards,
--
Akihiko Odaki <akihiko.odaki(a)daynix.com>
[cc'ing linux-kselftest and kunit-dev]
Hi,
On Wed, 5 Mar 2025 01:47:55 +0800, kernel test robot wrote:
> tree: https://github.com/brauner/linux.git vfs.all
> head: ea47e99a3a234837d5fea0d1a20bb2ad1eaa6dd4
> commit: b6736cfccb582b7c016cba6cd484fbcf30d499af [205/231] initramfs_test: kunit tests for initramfs unpacking
> config: x86_64-buildonly-randconfig-002-20250304 (https://download.01.org/0day-ci/archive/20250305/202503050109.t5Ab93hX-lkp@…)
> compiler: gcc-12 (Debian 12.2.0-14) 12.2.0
> reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250305/202503050109.t5Ab93hX-lkp@…)
>
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <lkp(a)intel.com>
> | Closes: https://lore.kernel.org/oe-kbuild-all/202503050109.t5Ab93hX-lkp@intel.com/
>
> All warnings (new ones prefixed by >>, old ones prefixed by <<):
>
> >> WARNING: modpost: vmlinux: section mismatch in reference: initramfs_test_cases+0x0 (section: .data) -> initramfs_test_extract (section: .init.text)
> >> WARNING: modpost: vmlinux: section mismatch in reference: initramfs_test_cases+0x30 (section: .data) -> initramfs_test_fname_overrun (section: .init.text)
> >> WARNING: modpost: vmlinux: section mismatch in reference: initramfs_test_cases+0x60 (section: .data) -> initramfs_test_data (section: .init.text)
> >> WARNING: modpost: vmlinux: section mismatch in reference: initramfs_test_cases+0x90 (section: .data) -> initramfs_test_csum (section: .init.text)
> >> WARNING: modpost: vmlinux: section mismatch in reference: initramfs_test_cases+0xc0 (section: .data) -> initramfs_test_hardlink (section: .init.text)
> >> WARNING: modpost: vmlinux: section mismatch in reference: initramfs_test_cases+0xf0 (section: .data) -> initramfs_test_many (section: .init.text)
These new warnings are covered in the commit message. The
kunit_test_init_section_suites() registered tests aren't in the .init
section as debugfs entries are retained for results reporting (without
an ability to rerun them).
IIUC, the __kunit_init_test_suites->CONCATENATE(..., _probe) suffix is
intended to suppress the modpost warning - @kunit-dev: any ideas why
this isn't working as intended?
Thanks, David
Vector registers are zero initialized by the kernel. Stop accepting
"all ones" as a clean value.
Note that this was not working as expected given that
value == 0xff
can be assumed to be always false by the compiler as value's range is
[-128, 127]. Both GCC (-Wtype-limits) and clang
(-Wtautological-constant-out-of-range-compare) warn about this.
Signed-off-by: Ignacio Encinas <ignacio(a)iencinas.com>
---
I tried looking why "all ones" was previously deemed a "clean" value but
couldn't find any information. It looks like the kernel always
zero-initializes the vector registers.
If "all ones" is still acceptable for any reason, my intention is to
spin a v2 changing the types of `value` and `prev_value` to unsigned
char.
---
tools/testing/selftests/riscv/vector/v_exec_initval_nolibc.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/riscv/vector/v_exec_initval_nolibc.c b/tools/testing/selftests/riscv/vector/v_exec_initval_nolibc.c
index 35c0812e32de0c82a54f84bd52c4272507121e35..b712c4d258a6cb045aa96de4a75299714866f5e6 100644
--- a/tools/testing/selftests/riscv/vector/v_exec_initval_nolibc.c
+++ b/tools/testing/selftests/riscv/vector/v_exec_initval_nolibc.c
@@ -6,7 +6,7 @@
* the values. To further ensure consistency, this file is compiled without
* libc and without auto-vectorization.
*
- * To be "clean" all values must be either all ones or all zeroes.
+ * To be "clean" all values must be all zeroes.
*/
#define __stringify_1(x...) #x
@@ -46,7 +46,7 @@ int main(int argc, char **argv)
: "=r" (value)); \
if (first) { \
first = 0; \
- } else if (value != prev_value || !(value == 0x00 || value == 0xff)) { \
+ } else if (value != prev_value || value != 0x00) { \
printf("Register " __stringify(register) \
" values not clean! value: %u\n", value); \
exit(-1); \
---
base-commit: 03d38806a902b36bf364cae8de6f1183c0a35a67
change-id: 20250301-fix-v_exec_initval_nolibc-498d976c372d
Best regards,
--
Ignacio Encinas <ignacio(a)iencinas.com>
Documentation/dev-tools/kselftest.rst says you can use the "TARGETS"
variable on the make command line to run only tests targeted for a
single subsystem:
$ make TARGETS="size timers" kselftest
A natural way to narrow down further to a particular test in a subsystem
is to specify e.g., TEST_GEN_PROGS:
$ make TARGETS=net TEST_PROGS= TEST_GEN_PROGS=tun kselftest
However, this does not work well because the following statement in
tools/testing/selftests/lib.mk gets ignored:
TEST_GEN_PROGS := $(patsubst %,$(OUTPUT)/%,$(TEST_GEN_PROGS))
Add the override directive to make it and similar ones will be effective
even when TEST_GEN_PROGS and similar variables are specified in the
command line.
Signed-off-by: Akihiko Odaki <akihiko.odaki(a)daynix.com>
---
tools/testing/selftests/lib.mk | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/tools/testing/selftests/lib.mk b/tools/testing/selftests/lib.mk
index d6edcfcb5be832ddee4c3d34b5ad221e9295f878..68116e51f97d62376c63f727ba3fd1f616c67562 100644
--- a/tools/testing/selftests/lib.mk
+++ b/tools/testing/selftests/lib.mk
@@ -93,9 +93,9 @@ TOOLS_INCLUDES := -isystem $(top_srcdir)/tools/include/uapi
# TEST_PROGS are for test shell scripts.
# TEST_CUSTOM_PROGS and TEST_PROGS will be run by common run_tests
# and install targets. Common clean doesn't touch them.
-TEST_GEN_PROGS := $(patsubst %,$(OUTPUT)/%,$(TEST_GEN_PROGS))
-TEST_GEN_PROGS_EXTENDED := $(patsubst %,$(OUTPUT)/%,$(TEST_GEN_PROGS_EXTENDED))
-TEST_GEN_FILES := $(patsubst %,$(OUTPUT)/%,$(TEST_GEN_FILES))
+override TEST_GEN_PROGS := $(patsubst %,$(OUTPUT)/%,$(TEST_GEN_PROGS))
+override TEST_GEN_PROGS_EXTENDED := $(patsubst %,$(OUTPUT)/%,$(TEST_GEN_PROGS_EXTENDED))
+override TEST_GEN_FILES := $(patsubst %,$(OUTPUT)/%,$(TEST_GEN_FILES))
all: $(TEST_GEN_PROGS) $(TEST_GEN_PROGS_EXTENDED) $(TEST_GEN_FILES) \
$(if $(TEST_GEN_MODS_DIR),gen_mods_dir)
---
base-commit: dd83757f6e686a2188997cb58b5975f744bb7786
change-id: 20250306-lib-4ac9711c10a2
Best regards,
--
Akihiko Odaki <akihiko.odaki(a)daynix.com>
The mac address on backup slave should be convert from Solicited-Node
Multicast address, not from bonding unicast target address.
v3: also fix the mac setting for slave_set_ns_maddr. (Jay)
Add function description for slave_set_ns_maddr/slave_set_ns_maddrs (Jay)
v2: fix patch 01's subject
Hangbin Liu (2):
bonding: fix incorrect MAC address setting to receive NS messages
selftests: bonding: fix incorrect mac address
drivers/net/bonding/bond_options.c | 55 ++++++++++++++++---
.../drivers/net/bonding/bond_options.sh | 4 +-
2 files changed, 49 insertions(+), 10 deletions(-)
--
2.46.0
Fixes an issue where out-of-tree kselftest builds fail when building
the BPF and bpftools components. The failure occurs because the top-level
Makefile passes a relative srctree path to its sub-Makefiles, which
leads to errors in locating necessary files.
For example, the following error is encountered:
```
$ make V=1 O=$build/ TARGETS=hid kselftest-all
...
make -C ../tools/testing/selftests all
make[4]: Entering directory '/path/to/linux/tools/testing/selftests/hid'
make -C /path/to/linux/tools/testing/selftests/../../../tools/lib/bpf OUTPUT=/path/to/linux/O/kselftest/hid/tools/build/libbpf/ \
EXTRA_CFLAGS='-g -O0' \
DESTDIR=/path/to/linux/O/kselftest/hid/tools prefix= all install_headers
make[5]: Entering directory '/path/to/linux/tools/lib/bpf'
...
make[5]: Entering directory '/path/to/linux/tools/bpf/bpftool'
Makefile:127: ../tools/build/Makefile.feature: No such file or directory
make[5]: *** No rule to make target '../tools/build/Makefile.feature'. Stop.
```
To resolve this, override the srctree in the kselftests's top Makefile
when performing an out-of-tree build. This ensures that all sub-Makefiles
have the correct path to the source tree, preventing directory resolution
errors.
Cc: Andrii Nakryiko <andrii.nakryiko(a)gmail.com>
Signed-off-by: Li Zhijian <lizhijian(a)fujitsu.com>
Tested-by: Quentin Monnet <qmo(a)kernel.org>
---
Cc: Masahiro Yamada <masahiroy(a)kernel.org>
V3:
collected Tested-by and rebased on bpf-next
V2:
- handle srctree in selftests itself rather than the linux' top Makefile # Masahiro Yamada <masahiroy(a)kernel.org>
V1: https://lore.kernel.org/lkml/20241217031052.69744-1-lizhijian@fujitsu.com/
---
tools/testing/selftests/Makefile | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index 2401e973c359..f04a3b0003f6 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -154,15 +154,19 @@ override LDFLAGS =
override MAKEFLAGS =
endif
+top_srcdir ?= ../../..
+
# Append kselftest to KBUILD_OUTPUT and O to avoid cluttering
# KBUILD_OUTPUT with selftest objects and headers installed
# by selftests Makefile or lib.mk.
+# Override the `srctree` variable to ensure it is correctly resolved in
+# sub-Makefiles, such as those within `bpf`, when managing targets like
+# `net` and `hid`.
ifdef building_out_of_srctree
override LDFLAGS =
+override srctree := $(top_srcdir)
endif
-top_srcdir ?= ../../..
-
ifeq ("$(origin O)", "command line")
KBUILD_OUTPUT := $(O)
endif
--
2.44.0
From: Jeff Xu <jeffxu(a)chromium.org>
This is V9 version, addressing comments from V8, without code logic
change.
-------------------------------------------------------------------
As discussed during mseal() upstream process [1], mseal() protects
the VMAs of a given virtual memory range against modifications, such
as the read/write (RW) and no-execute (NX) bits. For complete
descriptions of memory sealing, please see mseal.rst [2].
The mseal() is useful to mitigate memory corruption issues where a
corrupted pointer is passed to a memory management system. For
example, such an attacker primitive can break control-flow integrity
guarantees since read-only memory that is supposed to be trusted can
become writable or .text pages can get remapped.
The system mappings are readonly only, memory sealing can protect
them from ever changing to writable or unmmap/remapped as different
attributes.
System mappings such as vdso, vvar, vvar_vclock,
vectors (arm compat-mode), sigpage (arm compat-mode),
are created by the kernel during program initialization, and could
be sealed after creation.
Unlike the aforementioned mappings, the uprobe mapping is not
established during program startup. However, its lifetime is the same
as the process's lifetime [3]. It could be sealed from creation.
The vsyscall on x86-64 uses a special address (0xffffffffff600000),
which is outside the mm managed range. This means mprotect, munmap, and
mremap won't work on the vsyscall. Since sealing doesn't enhance
the vsyscall's security, it is skipped in this patch. If we ever seal
the vsyscall, it is probably only for decorative purpose, i.e. showing
the 'sl' flag in the /proc/pid/smaps. For this patch, it is ignored.
It is important to note that the CHECKPOINT_RESTORE feature (CRIU) may
alter the system mappings during restore operations. UML(User Mode Linux)
and gVisor, rr are also known to change the vdso/vvar mappings.
Consequently, this feature cannot be universally enabled across all
systems. As such, CONFIG_MSEAL_SYSTEM_MAPPINGS is disabled by default.
To support mseal of system mappings, architectures must define
CONFIG_ARCH_SUPPORTS_MSEAL_SYSTEM_MAPPINGS and update their special
mappings calls to pass mseal flag. Additionally, architectures must
confirm they do not unmap/remap system mappings during the process
lifetime. The existence of this flag for an architecture implies that
it does not require the remapping of thest system mappings during
process lifetime, so sealing these mappings is safe from a kernel
perspective.
This version covers x86-64 and arm64 archiecture as minimum viable feature.
While no specific CPU hardware features are required for enable this
feature on an archiecture, memory sealing requires a 64-bit kernel. Other
architectures can choose whether or not to adopt this feature. Currently,
I'm not aware of any instances in the kernel code that actively
munmap/mremap a system mapping without a request from userspace. The PPC
does call munmap when _install_special_mapping fails for vdso; however,
it's uncertain if this will ever fail for PPC - this needs to be
investigated by PPC in the future [4]. The UML kernel can add this support
when KUnit tests require it [5].
In this version, we've improved the handling of system mapping sealing from
previous versions, instead of modifying the _install_special_mapping
function itself, which would affect all architectures, we now call
_install_special_mapping with a sealing flag only within the specific
architecture that requires it. This targeted approach offers two key
advantages: 1) It limits the code change's impact to the necessary
architectures, and 2) It aligns with the software architecture by keeping
the core memory management within the mm layer, while delegating the
decision of sealing system mappings to the individual architecture, which
is particularly relevant since 32-bit architectures never require sealing.
Prior to this patch series, we explored sealing special mappings from
userspace using glibc's dynamic linker. This approach revealed several
issues:
- The PT_LOAD header may report an incorrect length for vdso, (smaller
than its actual size). The dynamic linker, which relies on PT_LOAD
information to determine mapping size, would then split and partially
seal the vdso mapping. Since each architecture has its own vdso/vvar
code, fixing this in the kernel would require going through each
archiecture. Our initial goal was to enable sealing readonly mappings,
e.g. .text, across all architectures, sealing vdso from kernel since
creation appears to be simpler than sealing vdso at glibc.
- The [vvar] mapping header only contains address information, not length
information. Similar issues might exist for other special mappings.
- Mappings like uprobe are not covered by the dynamic linker,
and there is no effective solution for them.
This feature's security enhancements will benefit ChromeOS, Android,
and other high security systems.
Testing:
This feature was tested on ChromeOS and Android for both x86-64 and ARM64.
- Enable sealing and verify vdso/vvar, sigpage, vector are sealed properly,
i.e. "sl" shown in the smaps for those mappings, and mremap is blocked.
- Passing various automation tests (e.g. pre-checkin) on ChromeOS and
Android to ensure the sealing doesn't affect the functionality of
Chromebook and Android phone.
I also tested the feature on Ubuntu on x86-64:
- With config disabled, vdso/vvar is not sealed,
- with config enabled, vdso/vvar is sealed, and booting up Ubuntu is OK,
normal operations such as browsing the web, open/edit doc are OK.
Link: https://lore.kernel.org/all/20240415163527.626541-1-jeffxu@chromium.org/ [1]
Link: Documentation/userspace-api/mseal.rst [2]
Link: https://lore.kernel.org/all/CABi2SkU9BRUnqf70-nksuMCQ+yyiWjo3fM4XkRkL-NrCZx… [3]
Link: https://lore.kernel.org/all/CABi2SkV6JJwJeviDLsq9N4ONvQ=EFANsiWkgiEOjyT9TQS… [4]
Link: https://lore.kernel.org/all/202502251035.239B85A93@keescook/ [5]
-------------------------------------------
History:
V9:
- Add negative test in selftest (Kees Cook)
- fx typos in text (Kees Cook)
V8:
- Change ARCH_SUPPORTS_MSEAL_X to ARCH_SUPPORTS_MSEAL_X (Liam R. Howlett)
- Update comments in Kconfig and mseal.rst (Lorenzo Stoakes, Liam R. Howlett)
- Change patch header perfix to "mseal sysmap" (Lorenzo Stoakes)
- Remove "vm_flags =" (Kees Cook, Liam R. Howlett, Oleg Nesterov)
- Drop uml architecture (Lorenzo Stoakes, Kees Cook)
- Add a selftest to verify system mappings are sealed (Lorenzo Stoakes)
V7:
https://lore.kernel.org/all/20250224225246.3712295-1-jeffxu@google.com/
- Remove cover letter from the first patch (Liam R. Howlett)
- Change macro name to VM_SEALED_SYSMAP (Liam R. Howlett)
- logging and fclose() in selftest (Liam R. Howlett)
V6:
https://lore.kernel.org/all/20250224174513.3600914-1-jeffxu@google.com/
- mseal.rst: fix a typo (Randy Dunlap)
- security/Kconfig: add rr into note (Liam R. Howlett)
- remove mseal_system_mappings() and use macro instead (Liam R. Howlett)
- mseal.rst: add incompatible userland software (Lorenzo Stoakes)
- remove RFC from title (Kees Cook)
V5
https://lore.kernel.org/all/20250212032155.1276806-1-jeffxu@google.com/
- Remove kernel cmd line (Lorenzo Stoakes)
- Add test info (Lorenzo Stoakes)
- Add threat model info (Lorenzo Stoakes)
- Fix x86 selftest: test_mremap_vdso
- Restrict code change to ARM64/x86-64/UM arch only.
- Add userprocess.h to include seal_system_mapping().
- Remove sealing vsyscall.
- Split the patch.
V4:
https://lore.kernel.org/all/20241125202021.3684919-1-jeffxu@google.com/
- ARCH_HAS_SEAL_SYSTEM_MAPPINGS (Lorenzo Stoakes)
- test info (Lorenzo Stoakes)
- Update mseal.rst (Liam R. Howlett)
- Update test_mremap_vdso.c (Liam R. Howlett)
- Misc. style, comments, doc update (Liam R. Howlett)
V3:
https://lore.kernel.org/all/20241113191602.3541870-1-jeffxu@google.com/
- Revert uprobe to v1 logic (Oleg Nesterov)
- use CONFIG_SEAL_SYSTEM_MAPPINGS instead of _ALWAYS/_NEVER (Kees Cook)
- Move kernel cmd line from fs/exec.c to mm/mseal.c and
misc. (Liam R. Howlett)
V2:
https://lore.kernel.org/all/20241014215022.68530-1-jeffxu@google.com/
- Seal uprobe always (Oleg Nesterov)
- Update comments and description (Randy Dunlap, Liam R.Howlett, Oleg Nesterov)
- Rebase to linux_main
V1:
- https://lore.kernel.org/all/20241004163155.3493183-1-jeffxu@google.com/
--------------------------------------------------
Jeff Xu (7):
mseal sysmap: kernel config and header change
selftests: x86: test_mremap_vdso: skip if vdso is msealed
mseal sysmap: enable x86-64
mseal sysmap: enable arm64
mseal sysmap: uprobe mapping
mseal sysmap: update mseal.rst
selftest: test system mappings are sealed.
Documentation/userspace-api/mseal.rst | 20 +++
arch/arm64/Kconfig | 1 +
arch/arm64/kernel/vdso.c | 12 +-
arch/x86/Kconfig | 1 +
arch/x86/entry/vdso/vma.c | 7 +-
include/linux/mm.h | 10 ++
init/Kconfig | 22 ++++
kernel/events/uprobes.c | 3 +-
security/Kconfig | 21 ++++
tools/testing/selftests/Makefile | 1 +
.../mseal_system_mappings/.gitignore | 2 +
.../selftests/mseal_system_mappings/Makefile | 6 +
.../selftests/mseal_system_mappings/config | 1 +
.../mseal_system_mappings/sysmap_is_sealed.c | 119 ++++++++++++++++++
.../testing/selftests/x86/test_mremap_vdso.c | 43 +++++++
15 files changed, 261 insertions(+), 8 deletions(-)
create mode 100644 tools/testing/selftests/mseal_system_mappings/.gitignore
create mode 100644 tools/testing/selftests/mseal_system_mappings/Makefile
create mode 100644 tools/testing/selftests/mseal_system_mappings/config
create mode 100644 tools/testing/selftests/mseal_system_mappings/sysmap_is_sealed.c
--
2.48.1.711.g2feabab25a-goog