- Linux-kselftest-mirror - lists.linaro.org

[PATCH] selftests: filesystems: fix warn_unused_result build warnings

by Abhinav Jain

Add return value checks for read & write calls in test_listmount_ns function. This patch resolves below compilation warnings: ``` statmount_test_ns.c: In function ‘test_listmount_ns’: statmount_test_ns.c:322:17: warning: ignoring return value of ‘write’ declared with attribute ‘warn_unused_result’ [-Wunused-result] statmount_test_ns.c:323:17: warning: ignoring return value of ‘read’ declared with attribute ‘warn_unused_result’ [-Wunused-result] ``` Signed-off-by: Abhinav Jain <jain.abhinav177(a)gmail.com> --- .../selftests/filesystems/statmount/statmount_test_ns.c | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/tools/testing/selftests/filesystems/statmount/statmount_test_ns.c b/tools/testing/selftests/filesystems/statmount/statmount_test_ns.c index e044f5fc57fd..70cb0c8b21cf 100644 --- a/tools/testing/selftests/filesystems/statmount/statmount_test_ns.c +++ b/tools/testing/selftests/filesystems/statmount/statmount_test_ns.c @@ -319,8 +319,11 @@ static void test_listmount_ns(void) * Tell our parent how many mounts we have, and then wait for it * to tell us we're done. */ - write(child_ready_pipe[1], &nr_mounts, sizeof(nr_mounts)); - read(parent_ready_pipe[0], &cval, sizeof(cval)); + if (write(child_ready_pipe[1], &nr_mounts, sizeof(nr_mounts)) != + sizeof(nr_mounts)) + ret = NSID_ERROR; + if (read(parent_ready_pipe[0], &cval, sizeof(cval)) != sizeof(cval)) + ret = NSID_ERROR; exit(NSID_PASS); } -- 2.34.1

10 months

2
2
0 0

[PATCH] selftests: net: convert comma to semicolon

by Chen Ni

Replace a comma between expression statements by a semicolon. Signed-off-by: Chen Ni <nichen(a)iscas.ac.cn> --- tools/testing/selftests/net/psock_fanout.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/tools/testing/selftests/net/psock_fanout.c b/tools/testing/selftests/net/psock_fanout.c index 1a736f700be4..4f31e92ebd96 100644 --- a/tools/testing/selftests/net/psock_fanout.c +++ b/tools/testing/selftests/net/psock_fanout.c @@ -165,9 +165,9 @@ static void sock_fanout_set_ebpf(int fd) attr.insns = (unsigned long) prog; attr.insn_cnt = ARRAY_SIZE(prog); attr.license = (unsigned long) "GPL"; - attr.log_buf = (unsigned long) log_buf, - attr.log_size = sizeof(log_buf), - attr.log_level = 1, + attr.log_buf = (unsigned long) log_buf; + attr.log_size = sizeof(log_buf); + attr.log_level = 1; pfd = syscall(__NR_bpf, BPF_PROG_LOAD, &attr, sizeof(attr)); if (pfd < 0) { -- 2.25.1

10 months

2
1
0 0

[PATCH v3 0/3] RISC-V: mm: do not treat hint addr on mmap as the upper bound to search

by Yangyu Chen

Previous patch series[1][2] changes a mmap behavior that treats the hint address as the upper bound of the mmap address range. The motivation of the previous patch series is that some user space software may assume 48-bit address space and use higher bits to encode some information, which may collide with large virtual address space mmap may return. However, to make sv48 by default, we don't need to change the meaning of the hint address on mmap as the upper bound of the mmap address range. This behavior breaks some user space software like Chromium that gets ENOMEM error when the hint address + size is not big enough, as specified in [3]. Other ISAs with larger than 48-bit virtual address space like x86, arm64, and powerpc do not have this special mmap behavior on hint address. They all just make 48-bit / 47-bit virtual address space by default, and if a user space software wants to large virtual address space, it only need to specify a hint address larger than 48-bit / 47-bit. Thus, this patch series change mmap to use sv48 by default but does not treat the hint address as the upper bound of the mmap address range. After this patch, the behavior of mmap will align with existing behavior on other ISAs with larger than 48-bit virtual address space like x86, arm64, and powerpc. The user space software will no longer need to rewrite their code to fit with this special mmap behavior only on RISC-V. Note: Charlie also created another series [4] to completely remove the arch_get_mmap_end and arch_get_mmap_base behavior based on the hint address and size. However, this will cause programs like Go and Java, which need to store information in the higher bits of the pointer, to fail on Sv57 machines. Changes in v3: - Rebase to newest master - Changes some information in cover letter after patchset [2] - Use patch [5] to patch selftests - Link to v2: https://lore.kernel.org/linux-riscv/tencent_B2D0435BC011135736262764B511994… Changes in v2: - correct arch_get_mmap_end and arch_get_mmap_base - Add description in documentation about mmap behavior on kernel v6.6-6.7. - Improve commit message and cover letter - Rebase to newest riscv/for-next branch - Link to v1: https://lore.kernel.org/linux-riscv/tencent_F3B3B5AB1C9D704763CA423E1A41F8B… [1] https://lore.kernel.org/linux-riscv/20230809232218.849726-1-charlie@rivosin… [2] https://lore.kernel.org/linux-riscv/20240130-use_mmap_hint_address-v3-0-8a6… [3] https://lore.kernel.org/linux-riscv/MEYP282MB2312A08FF95D44014AB78411C68D2@… [4] https://lore.kernel.org/linux-riscv/20240826-riscv_mmap-v1-0-cd8962afe47f@r… [5] https://lore.kernel.org/linux-riscv/20240826-riscv_mmap-v1-2-cd8962afe47f@r… Charlie Jenkins (1): riscv: selftests: Remove mmap hint address checks Yangyu Chen (2): RISC-V: mm: not use hint addr as upper bound Documentation: riscv: correct sv57 kernel behavior Documentation/arch/riscv/vm-layout.rst | 43 ++++++++---- arch/riscv/include/asm/processor.h | 20 ++---- .../selftests/riscv/mm/mmap_bottomup.c | 2 - .../testing/selftests/riscv/mm/mmap_default.c | 2 - tools/testing/selftests/riscv/mm/mmap_test.h | 67 ------------------- 5 files changed, 36 insertions(+), 98 deletions(-) -- 2.45.2

10 months

4
9
0 0

[PATCH v3 0/3] selftests: Fix cpuid / vendor checking build issues

by Ilpo Järvinen

This series first generalizes resctrl selftest non-contiguous CAT check to not assume non-AMD vendor implies Intel. Second, it improves kselftest common parts and resctrl selftest such that the use of __cpuid_count() does not lead into a build failure (happens at least on ARM). While ARM does not currently support resctrl features, there's an ongoing work to enable resctrl support also for it on the kernel side. In any case, a common header such as kselftest.h should have a proper fallback in place for what it provides, thus it seems justified to fix this common level problem on the common level rather than e.g. disabling build for resctrl selftest for archs lacking resctrl support. I've dropped reviewed and tested by tags from the last patch due to major changes into the makefile logic. So it would be helpful if Muhammad could retest with this version. v3: - Remove "empty" wording - Also cast input parameters to void - Initialize ARCH from uname -m if not set (this might allow cleaning up some other makefiles but that is left as future work) v2: - Removed RFC from the last patch & added Fixes and tags - Fixed the error message's line splits - Noted down the reason for void casts in the stub Ilpo Järvinen (3): selftests/resctrl: Generalize non-contiguous CAT check selftests/resctrl: Always initialize ecx to avoid build warnings kselftest: Provide __cpuid_count() stub on non-x86 archs tools/testing/selftests/kselftest.h | 6 +++++ tools/testing/selftests/lib.mk | 6 +++++ tools/testing/selftests/resctrl/cat_test.c | 28 +++++++++++++--------- 3 files changed, 29 insertions(+), 11 deletions(-) -- 2.39.2

10 months

1
4
0 0

[PATCH] utimer-test: remove unused variables

by bajing

The variable i is never referenced in the code, just remove it. Signed-off-by: bajing <bajing(a)cmss.chinamobile.com> --- tools/testing/selftests/alsa/utimer-test.c | 1 - 1 file changed, 1 deletion(-) diff --git a/tools/testing/selftests/alsa/utimer-test.c b/tools/testing/selftests/alsa/utimer-test.c index 32ee3ce57721..9d2683c83ef3 100644 --- a/tools/testing/selftests/alsa/utimer-test.c +++ b/tools/testing/selftests/alsa/utimer-test.c @@ -140,7 +140,6 @@ TEST_F(timer_f, utimer) { TEST(wrong_timers_test) { int timer_dev_fd; int utimer_fd; - size_t i; struct snd_timer_uinfo wrong_timer = { .resolution = 0, .id = UTIMER_DEFAULT_ID, -- 2.33.0

10 months

1
0
0 0

[PATCH RFC 0/8] extensible syscalls: CHECK_FIELDS to allow for easier feature detection

by Aleksa Sarai

This is something that I've been thinking about for a while. We had a discussion at LPC 2020 about this[1] but the proposals suggested there never materialised. In short, it is quite difficult for userspace to detect the feature capability of syscalls at runtime. This is something a lot of programs want to do, but they are forced to create elaborate scenarios to try to figure out if a feature is supported without causing damage to the system. For the vast majority of cases, each individual feature also needs to be tested individually (because syscall results are all-or-nothing), so testing even a single syscall's feature set can easily inflate the startup time of programs. This patchset implements the fairly minimal design I proposed in this talk[2] and in some old LKML threads (though I can't find the exact references ATM). The general flow looks like: 1. Userspace will indicate to the kernel that a syscall should a be no-op by setting the top bit of the extensible struct size argument. We will almost certainly never support exabyte sized structs, so the top bits are free for us to use as makeshift flag bits. This is preferable to using the per-syscall flag field inside the structure because seccomp can easily detect the bit in the flag and allow the probe or forcefully return -EEXTSYS_NOOP. 2. The kernel will then fill the provided structure with every valid bit pattern that the current kernel understands. For flags or other bitflag-like fields, this is the set of valid flags or bits. For pointer fields or fields that take an arbitrary value, the field has every bit set (0xFF... to fill the field) to indicate that any value is valid in the field. 3. The syscall then returns -EEXTSYS_NOOP which is an errno that will only ever be used for this purpose (so userspace can be sure that the request succeeded). On older kernels, the syscall will return a different error (usually -E2BIG or -EFAULT) and userspace can do their old-fashioned checks. 4. Userspace can then check which flags and fields are supported by looking at the fields in the returned structure. Flags are checked by doing an AND with the flags field, and field support can checked by comparing to 0. In principle you could just AND the entire structure if you wanted to do this check generically without caring about the structure contents (this is what libraries might consider doing). Userspace can even find out the internal kernel structure size by passing a PAGE_SIZE buffer and seeing how many bytes are non-zero. As with copy_struct_from_user(), this is designed to be forward- and backwards- compatible. This allows programas to get a one-shot understanding of what features a syscall supports without having to do any elaborate setups or tricks to detect support for destructive features. Flags can simply be ANDed to check if they are in the supported set, and fields can just be checked to see if they are non-zero. This patchset is IMHO the simplest way we can add the ability to introspect the feature set of extensible struct (copy_struct_from_user) syscalls. It doesn't preclude the chance of a more generic mechanism being added later. The intended way of using this interface to get feature information looks something like the following (imagine that openat2 has gained a new field and a new flag in the future): static bool openat2_no_automount_supported; static bool openat2_cwd_fd_supported; int check_openat2_support(void) { int err; struct open_how how = {}; err = openat2(AT_FDCWD, ".", &how, CHECK_FIELDS | sizeof(how)); assert(err < 0); switch (errno) { case EFAULT: case E2BIG: /* Old kernel... */ check_support_the_old_way(); break; case EEXTSYS_NOOP: openat2_no_automount_supported = (how.flags & RESOLVE_NO_AUTOMOUNT); openat2_cwd_fd_supported = (how.cwd_fd != 0); break; } } [1]: https://lwn.net/Articles/830666/ [2]: https://youtu.be/ggD-eb3yPVs Signed-off-by: Aleksa Sarai <cyphar(a)cyphar.com> --- Aleksa Sarai (8): uaccess: add copy_struct_to_user helper sched_getattr: port to copy_struct_to_user openat2: explicitly return -E2BIG for (usize > PAGE_SIZE) openat2: add CHECK_FIELDS flag to usize argument clone3: add CHECK_FIELDS flag to usize argument selftests: openat2: add 0xFF poisoned data after misaligned struct selftests: openat2: add CHECK_FIELDS selftests selftests: clone3: add CHECK_FIELDS selftests fs/open.c | 17 ++ include/linux/uaccess.h | 98 +++++++++ include/uapi/asm-generic/errno.h | 3 + include/uapi/linux/openat2.h | 2 + kernel/fork.c | 33 ++- kernel/sched/syscalls.c | 42 +--- tools/testing/selftests/clone3/.gitignore | 1 + tools/testing/selftests/clone3/Makefile | 2 +- .../testing/selftests/clone3/clone3_check_fields.c | 229 +++++++++++++++++++++ tools/testing/selftests/openat2/openat2_test.c | 126 +++++++++++- 10 files changed, 504 insertions(+), 49 deletions(-) --- base-commit: 431c1646e1f86b949fa3685efc50b660a364c2b6 change-id: 20240803-extensible-structs-check_fields-a47e94cef691 Best regards, -- Aleksa Sarai <cyphar(a)cyphar.com>

10 months

2
13
0 0

[PATCH] selftests/bpf: Fix procmap_query()'s params mismatch and compilation warning

by Yuan Chen

From: Yuan Chen <chenyuan(a)kylinos.cn> When the PROCMAP_QUERY is not defined, a compilation error occurs due to the mismatch of the procmap_query()'s params, procmap_query() only be called in the file where the function is defined, modify the params so they can match. We get a warning when build samples/bpf: trace_helpers.c:252:5: warning: no previous prototype for ‘procmap_query’ [-Wmissing-prototypes] 252 | int procmap_query(int fd, const void *addr, __u32 query_flags, size_t *start, size_t *offset, int *flags) | ^~~~~~~~~~~~~ As this function is only used in the file, mark it as 'static'. Signed-off-by: Yuan Chen <chenyuan(a)kylinos.cn> --- tools/testing/selftests/bpf/trace_helpers.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/tools/testing/selftests/bpf/trace_helpers.c b/tools/testing/selftests/bpf/trace_helpers.c index 1bfd881c0e07..2d742fdac6b9 100644 --- a/tools/testing/selftests/bpf/trace_helpers.c +++ b/tools/testing/selftests/bpf/trace_helpers.c @@ -249,7 +249,7 @@ int kallsyms_find(const char *sym, unsigned long long *addr) #ifdef PROCMAP_QUERY int env_verbosity __weak = 0; -int procmap_query(int fd, const void *addr, __u32 query_flags, size_t *start, size_t *offset, int *flags) +static int procmap_query(int fd, const void *addr, __u32 query_flags, size_t *start, size_t *offset, int *flags) { char path_buf[PATH_MAX], build_id_buf[20]; struct procmap_query q; @@ -293,7 +293,7 @@ int procmap_query(int fd, const void *addr, __u32 query_flags, size_t *start, si return 0; } #else -int procmap_query(int fd, const void *addr, size_t *start, size_t *offset, int *flags) +static int procmap_query(int fd, const void *addr, __u32 query_flags, size_t *start, size_t *offset, int *flags) { return -EOPNOTSUPP; } -- 2.46.0

10 months

2
1
0 0

[PATCH 0/6] Extend pmu_counters_test to AMD CPUs

by Colton Lewis

(I was positive I had sent this already, but I couldn't find it on the mailing list to reply to and ask for reviews.) Extend pmu_counters_test to AMD CPUs. As the AMD PMU is quite different from Intel with different events and feature sets, this series introduces a new code path to test it, specifically focusing on the core counters including the PerfCtrExtCore and PerfMonV2 features. Northbridge counters and cache counters exist, but are not as important and can be deferred to a later series. The first patch is a bug fix that could be submitted separately. The series has been tested on both Intel and AMD machines, but I have not found an AMD machine old enough to lack PerfCtrExtCore. I have made efforts that no part of the code has any dependency on its presence. I am aware of similar work in this direction done by Jinrong Liang [1]. He told me he is not working on it currently and I am not intruding by making my own submission. [1] https://lore.kernel.org/kvm/20231121115457.76269-1-cloudliang@tencent.com/ Colton Lewis (6): KVM: x86: selftests: Fix typos in macro variable use KVM: x86: selftests: Define AMD PMU CPUID leaves KVM: x86: selftests: Set up AMD VM in pmu_counters_test KVM: x86: selftests: Test read/write core counters KVM: x86: selftests: Test core events KVM: x86: selftests: Test PerfMonV2 .../selftests/kvm/include/x86_64/processor.h | 7 + .../selftests/kvm/x86_64/pmu_counters_test.c | 267 ++++++++++++++++-- 2 files changed, 249 insertions(+), 25 deletions(-) -- 2.46.0.76.ge559c4bf1a-goog

10 months

3
20
0 0

[PATCH net-next] selftests: add selftest for UDP SO_PEEK_OFF support

by Jason Xing

From: Jason Xing <kernelxing(a)tencent.com> Add the SO_PEEK_OFF selftest for UDP. In this patch, I mainly do three things: 1. rename tcp_so_peek_off.c 2. adjust for UDP protocol 3. add selftests into it Suggested-by: Jon Maloy <jmaloy(a)redhat.com> Signed-off-by: Jason Xing <kernelxing(a)tencent.com> --- Link: https://lore.kernel.org/all/9f4dd14d-fbe3-4c61-b04c-f0e6b8096d7b@redhat.com/ --- tools/testing/selftests/net/Makefile | 2 +- .../{tcp_so_peek_off.c => sk_so_peek_off.c} | 91 +++++++++++-------- 2 files changed, 56 insertions(+), 37 deletions(-) rename tools/testing/selftests/net/{tcp_so_peek_off.c => sk_so_peek_off.c} (58%) diff --git a/tools/testing/selftests/net/Makefile b/tools/testing/selftests/net/Makefile index 1179e3261bef..d5029f978aa9 100644 --- a/tools/testing/selftests/net/Makefile +++ b/tools/testing/selftests/net/Makefile @@ -80,7 +80,7 @@ TEST_PROGS += io_uring_zerocopy_tx.sh TEST_GEN_FILES += bind_bhash TEST_GEN_PROGS += sk_bind_sendto_listen TEST_GEN_PROGS += sk_connect_zero_addr -TEST_GEN_PROGS += tcp_so_peek_off +TEST_GEN_PROGS += sk_so_peek_off TEST_PROGS += test_ingress_egress_chaining.sh TEST_GEN_PROGS += so_incoming_cpu TEST_PROGS += sctp_vrf.sh diff --git a/tools/testing/selftests/net/tcp_so_peek_off.c b/tools/testing/selftests/net/sk_so_peek_off.c similarity index 58% rename from tools/testing/selftests/net/tcp_so_peek_off.c rename to tools/testing/selftests/net/sk_so_peek_off.c index df8a39d9d3c3..870a890138c4 100644 --- a/tools/testing/selftests/net/tcp_so_peek_off.c +++ b/tools/testing/selftests/net/sk_so_peek_off.c @@ -10,37 +10,41 @@ #include <arpa/inet.h> #include "../kselftest.h" -static char *afstr(int af) +static char *afstr(int af, int proto) { - return af == AF_INET ? "TCP/IPv4" : "TCP/IPv6"; + if (proto == IPPROTO_TCP) + return af == AF_INET ? "TCP/IPv4" : "TCP/IPv6"; + else + return af == AF_INET ? "UDP/IPv4" : "UDP/IPv6"; } -int tcp_peek_offset_probe(sa_family_t af) +int sk_peek_offset_probe(sa_family_t af, int proto) { + int type = (proto == IPPROTO_TCP ? SOCK_STREAM : SOCK_DGRAM); int optv = 0; int ret = 0; int s; - s = socket(af, SOCK_STREAM | SOCK_CLOEXEC, IPPROTO_TCP); + s = socket(af, type, proto); if (s < 0) { ksft_perror("Temporary TCP socket creation failed"); } else { if (!setsockopt(s, SOL_SOCKET, SO_PEEK_OFF, &optv, sizeof(int))) ret = 1; else - printf("%s does not support SO_PEEK_OFF\n", afstr(af)); + printf("%s does not support SO_PEEK_OFF\n", afstr(af, proto)); close(s); } return ret; } -static void tcp_peek_offset_set(int s, int offset) +static void sk_peek_offset_set(int s, int offset) { if (setsockopt(s, SOL_SOCKET, SO_PEEK_OFF, &offset, sizeof(offset))) ksft_perror("Failed to set SO_PEEK_OFF value\n"); } -static int tcp_peek_offset_get(int s) +static int sk_peek_offset_get(int s) { int offset; socklen_t len = sizeof(offset); @@ -50,8 +54,9 @@ static int tcp_peek_offset_get(int s) return offset; } -static int tcp_peek_offset_test(sa_family_t af) +static int sk_peek_offset_test(sa_family_t af, int proto) { + int type = (proto == IPPROTO_TCP ? SOCK_STREAM : SOCK_DGRAM); union { struct sockaddr sa; struct sockaddr_in a4; @@ -62,13 +67,13 @@ static int tcp_peek_offset_test(sa_family_t af) int recv_sock = 0; int offset = 0; ssize_t len; - char buf; + char buf[2]; memset(&a, 0, sizeof(a)); a.sa.sa_family = af; - s[0] = socket(af, SOCK_STREAM, IPPROTO_TCP); - s[1] = socket(af, SOCK_STREAM | SOCK_NONBLOCK, IPPROTO_TCP); + s[0] = recv_sock = socket(af, type, proto); + s[1] = socket(af, type, proto); if (s[0] < 0 || s[1] < 0) { ksft_perror("Temporary socket creation failed\n"); @@ -82,76 +87,78 @@ static int tcp_peek_offset_test(sa_family_t af) ksft_perror("Temporary socket getsockname() failed\n"); goto out; } - if (listen(s[0], 0) < 0) { + if (proto == IPPROTO_TCP && listen(s[0], 0) < 0) { ksft_perror("Temporary socket listen() failed\n"); goto out; } - if (connect(s[1], &a.sa, sizeof(a)) >= 0 || errno != EINPROGRESS) { + if (connect(s[1], &a.sa, sizeof(a))) { ksft_perror("Temporary socket connect() failed\n"); goto out; } - recv_sock = accept(s[0], NULL, NULL); - if (recv_sock <= 0) { - ksft_perror("Temporary socket accept() failed\n"); - goto out; + if (proto == IPPROTO_TCP) { + recv_sock = accept(s[0], NULL, NULL); + if (recv_sock <= 0) { + ksft_perror("Temporary socket accept() failed\n"); + goto out; + } } /* Some basic tests of getting/setting offset */ - offset = tcp_peek_offset_get(recv_sock); + offset = sk_peek_offset_get(recv_sock); if (offset != -1) { ksft_perror("Initial value of socket offset not -1\n"); goto out; } - tcp_peek_offset_set(recv_sock, 0); - offset = tcp_peek_offset_get(recv_sock); + sk_peek_offset_set(recv_sock, 0); + offset = sk_peek_offset_get(recv_sock); if (offset != 0) { ksft_perror("Failed to set socket offset to 0\n"); goto out; } /* Transfer a message */ - if (send(s[1], (char *)("ab"), 2, 0) <= 0 || errno != EINPROGRESS) { + if (send(s[1], (char *)("ab"), 2, 0) != 2) { ksft_perror("Temporary probe socket send() failed\n"); goto out; } /* Read first byte */ - len = recv(recv_sock, &buf, 1, MSG_PEEK); - if (len != 1 || buf != 'a') { + len = recv(recv_sock, buf, 1, MSG_PEEK); + if (len != 1 || buf[0] != 'a') { ksft_perror("Failed to read first byte of message\n"); goto out; } - offset = tcp_peek_offset_get(recv_sock); + offset = sk_peek_offset_get(recv_sock); if (offset != 1) { ksft_perror("Offset not forwarded correctly at first byte\n"); goto out; } /* Try to read beyond last byte */ - len = recv(recv_sock, &buf, 2, MSG_PEEK); - if (len != 1 || buf != 'b') { + len = recv(recv_sock, buf, 2, MSG_PEEK); + if (len != 1 || buf[0] != 'b') { ksft_perror("Failed to read last byte of message\n"); goto out; } - offset = tcp_peek_offset_get(recv_sock); + offset = sk_peek_offset_get(recv_sock); if (offset != 2) { ksft_perror("Offset not forwarded correctly at last byte\n"); goto out; } /* Flush message */ - len = recv(recv_sock, NULL, 2, MSG_TRUNC); + len = recv(recv_sock, buf, 2, MSG_TRUNC); if (len != 2) { ksft_perror("Failed to flush message\n"); goto out; } - offset = tcp_peek_offset_get(recv_sock); + offset = sk_peek_offset_get(recv_sock); if (offset != 0) { ksft_perror("Offset not reverted correctly after flush\n"); goto out; } - printf("%s with MSG_PEEK_OFF works correctly\n", afstr(af)); + printf("%s with MSG_PEEK_OFF works correctly\n", afstr(af, proto)); res = 1; out: - if (recv_sock >= 0) + if (proto == IPPROTO_TCP && recv_sock >= 0) close(recv_sock); if (s[1] >= 0) close(s[1]); @@ -160,24 +167,36 @@ static int tcp_peek_offset_test(sa_family_t af) return res; } -int main(void) +static int do_test(int proto) { int res4, res6; - res4 = tcp_peek_offset_probe(AF_INET); - res6 = tcp_peek_offset_probe(AF_INET6); + res4 = sk_peek_offset_probe(AF_INET, proto); + res6 = sk_peek_offset_probe(AF_INET6, proto); if (!res4 && !res6) return KSFT_SKIP; if (res4) - res4 = tcp_peek_offset_test(AF_INET); + res4 = sk_peek_offset_test(AF_INET, proto); if (res6) - res6 = tcp_peek_offset_test(AF_INET6); + res6 = sk_peek_offset_test(AF_INET6, proto); if (!res4 || !res6) return KSFT_FAIL; return KSFT_PASS; } + +int main(void) +{ + int restcp, resudp; + + restcp = do_test(IPPROTO_TCP); + resudp = do_test(IPPROTO_UDP); + if (restcp == KSFT_FAIL || resudp == KSFT_FAIL) + return KSFT_FAIL; + + return KSFT_PASS; +} -- 2.37.3

10 months

3
5
0 0

[PATCH v4 0/5] Wire up getrandom() vDSO implementation on powerpc

by Christophe Leroy

This series wires up getrandom() vDSO implementation on powerpc. Tested on PPC32 on real hardware. Tested on PPC64 (both BE and LE) on QEMU: Performance on powerpc 885: ~# ./vdso_test_getrandom bench-single vdso: 25000000 times in 62.938002291 seconds libc: 25000000 times in 535.581916866 seconds syscall: 25000000 times in 531.525042806 seconds Performance on powerpc 8321: ~# ./vdso_test_getrandom bench-single vdso: 25000000 times in 16.899318858 seconds libc: 25000000 times in 131.050596522 seconds syscall: 25000000 times in 129.794790389 seconds Performance on QEMU pseries: ~ # ./vdso_test_getrandom bench-single vdso: 25000000 times in 4.977777162 seconds libc: 25000000 times in 75.516749981 seconds syscall: 25000000 times in 86.842242014 seconds Changes in v4: - Rebased on recent random git tree (963233ff0133) (The new tree includes selftests fixes) - Read/write counter in native byte order - Don't use anymore compat macros to write output - Fixed selftests build failure with patch 4 (without patch 5) on little endian on PPC64 - Implement a __kernel_getrandom() stub returning ENOSYS on ppc64 in patch 4 (without patch 5) to make selftests happy. Changes in v3: - Rebased on recent random git tree (0c7e00e22c21) - Fixed build failures reported by robots around VM_DROPPABLE - Fixed crash on PPC64 due to clobbered r13 by not using r13 anymore (saving it was not enough for signals). - Split final patch in two, first for PPC32, second for PPC64 - Moved selftest fixes out of this series Changes in v2: - Define VM_DROPPABLE for powerpc/32 - Fixes generic vDSO getrandom headers to enable CONFIG_COMPAT build. - Fixed size of generation counter - Fixed selftests to work on non x86 architectures Christophe Leroy (5): mm: Define VM_DROPPABLE for powerpc/32 powerpc/vdso32: Add crtsavres powerpc/vdso: Refactor CFLAGS for CVDSO build powerpc/vdso: Wire up getrandom() vDSO implementation on PPC32 powerpc/vdso: Wire up getrandom() vDSO implementation on PPC64 arch/powerpc/Kconfig | 1 + arch/powerpc/include/asm/mman.h | 2 +- arch/powerpc/include/asm/vdso/getrandom.h | 54 ++++ arch/powerpc/include/asm/vdso/vsyscall.h | 6 + arch/powerpc/include/asm/vdso_datapage.h | 2 + arch/powerpc/kernel/asm-offsets.c | 1 + arch/powerpc/kernel/vdso/Makefile | 57 ++-- arch/powerpc/kernel/vdso/getrandom.S | 58 ++++ arch/powerpc/kernel/vdso/gettimeofday.S | 13 - arch/powerpc/kernel/vdso/vdso32.lds.S | 1 + arch/powerpc/kernel/vdso/vdso64.lds.S | 1 + arch/powerpc/kernel/vdso/vgetrandom-chacha.S | 320 +++++++++++++++++++ arch/powerpc/kernel/vdso/vgetrandom.c | 14 + fs/proc/task_mmu.c | 4 +- include/linux/mm.h | 4 +- include/trace/events/mmflags.h | 4 +- tools/testing/selftests/vDSO/Makefile | 2 +- 17 files changed, 501 insertions(+), 43 deletions(-) create mode 100644 arch/powerpc/include/asm/vdso/getrandom.h create mode 100644 arch/powerpc/kernel/vdso/getrandom.S create mode 100644 arch/powerpc/kernel/vdso/vgetrandom-chacha.S create mode 100644 arch/powerpc/kernel/vdso/vgetrandom.c -- 2.44.0

10 months

2
14
0 0

[PATCH net-next v17 11/14] mm: page_frag: add testing for the newly added prepare API

by Yunsheng Lin

Add testing for the newly added prepare API, for both aligned and non-aligned API, also probe API is also tested along with prepare API. CC: Alexander Duyck <alexander.duyck(a)gmail.com> Signed-off-by: Yunsheng Lin <linyunsheng(a)huawei.com> --- .../selftests/mm/page_frag/page_frag_test.c | 66 +++++++++++++++++-- tools/testing/selftests/mm/run_vmtests.sh | 4 ++ tools/testing/selftests/mm/test_page_frag.sh | 31 +++++++++ 3 files changed, 96 insertions(+), 5 deletions(-) diff --git a/tools/testing/selftests/mm/page_frag/page_frag_test.c b/tools/testing/selftests/mm/page_frag/page_frag_test.c index a4bd543d6950..7cfa896f69cb 100644 --- a/tools/testing/selftests/mm/page_frag/page_frag_test.c +++ b/tools/testing/selftests/mm/page_frag/page_frag_test.c @@ -27,6 +27,10 @@ static bool test_align; module_param(test_align, bool, 0); MODULE_PARM_DESC(test_align, "use align API for testing"); +static bool test_prepare; +module_param(test_prepare, bool, 0); +MODULE_PARM_DESC(test_prepare, "use prepare API for testing"); + static int test_alloc_len = 2048; module_param(test_alloc_len, int, 0); MODULE_PARM_DESC(test_alloc_len, "alloc len for testing"); @@ -67,6 +71,18 @@ static int page_frag_pop_thread(void *arg) return 0; } +static void frag_frag_test_commit(struct page_frag_cache *nc, + struct page_frag *prepare_pfrag, + struct page_frag *probe_pfrag, + unsigned int used_sz) +{ + WARN_ON_ONCE(prepare_pfrag->page != probe_pfrag->page || + prepare_pfrag->offset != probe_pfrag->offset || + prepare_pfrag->size != probe_pfrag->size); + + page_frag_commit(nc, prepare_pfrag, used_sz); +} + static int page_frag_push_thread(void *arg) { struct ptr_ring *ring = arg; @@ -80,13 +96,52 @@ static int page_frag_push_thread(void *arg) int ret; if (test_align) { - va = page_frag_alloc_align(&test_nc, test_alloc_len, - GFP_KERNEL, SMP_CACHE_BYTES); + if (test_prepare) { + struct page_frag prepare_frag, probe_frag; + void *probe_va; + + va = page_frag_alloc_refill_prepare_align(&test_nc, + test_alloc_len, + &prepare_frag, + GFP_KERNEL, + SMP_CACHE_BYTES); + + probe_va = __page_frag_alloc_refill_probe_align(&test_nc, + test_alloc_len, + &probe_frag, + -SMP_CACHE_BYTES); + WARN_ON_ONCE(va != probe_va); + + if (likely(va)) + frag_frag_test_commit(&test_nc, &prepare_frag, + &probe_frag, test_alloc_len); + } else { + va = page_frag_alloc_align(&test_nc, + test_alloc_len, + GFP_KERNEL, + SMP_CACHE_BYTES); + } WARN_ONCE((unsigned long)va & (SMP_CACHE_BYTES - 1), "unaligned va returned\n"); } else { - va = page_frag_alloc(&test_nc, test_alloc_len, GFP_KERNEL); + if (test_prepare) { + struct page_frag prepare_frag, probe_frag; + void *probe_va; + + va = page_frag_alloc_refill_prepare(&test_nc, test_alloc_len, + &prepare_frag, GFP_KERNEL); + + probe_va = page_frag_alloc_refill_probe(&test_nc, test_alloc_len, + &probe_frag); + + WARN_ON_ONCE(va != probe_va); + if (likely(va)) + frag_frag_test_commit(&test_nc, &prepare_frag, + &probe_frag, test_alloc_len); + } else { + va = page_frag_alloc(&test_nc, test_alloc_len, GFP_KERNEL); + } } if (!va) @@ -149,8 +204,9 @@ static int __init page_frag_test_init(void) wait_for_completion(&wait); duration = (u64)ktime_us_delta(ktime_get(), start); - pr_info("%d of iterations for %s testing took: %lluus\n", nr_test, - test_align ? "aligned" : "non-aligned", duration); + pr_info("%d of iterations for %s %s API testing took: %lluus\n", nr_test, + test_align ? "aligned" : "non-aligned", + test_prepare ? "prepare" : "alloc", duration); ptr_ring_cleanup(&ptr_ring, NULL); page_frag_cache_drain(&test_nc); diff --git a/tools/testing/selftests/mm/run_vmtests.sh b/tools/testing/selftests/mm/run_vmtests.sh index 96fd470b9f51..e4a36231bbea 100755 --- a/tools/testing/selftests/mm/run_vmtests.sh +++ b/tools/testing/selftests/mm/run_vmtests.sh @@ -464,6 +464,10 @@ CATEGORY="page_frag" run_test ./test_page_frag.sh aligned CATEGORY="page_frag" run_test ./test_page_frag.sh nonaligned +CATEGORY="page_frag" run_test ./test_page_frag.sh aligned_prepare + +CATEGORY="page_frag" run_test ./test_page_frag.sh nonaligned_prepare + echo "SUMMARY: PASS=${count_pass} SKIP=${count_skip} FAIL=${count_fail}" | tap_prefix echo "1..${count_total}" | tap_output diff --git a/tools/testing/selftests/mm/test_page_frag.sh b/tools/testing/selftests/mm/test_page_frag.sh index d2b0734a90b5..3bc40a895d0d 100755 --- a/tools/testing/selftests/mm/test_page_frag.sh +++ b/tools/testing/selftests/mm/test_page_frag.sh @@ -36,6 +36,8 @@ ksft_skip=4 SMOKE_PARAM="test_push_cpu=$TEST_CPU_0 test_pop_cpu=$TEST_CPU_1" NONALIGNED_PARAM="$SMOKE_PARAM test_alloc_len=75 nr_test=$NR_TEST" ALIGNED_PARAM="$NONALIGNED_PARAM test_align=1" +NONALIGNED_PREPARE_PARAM="$NONALIGNED_PARAM test_prepare=1" +ALIGNED_PREPARE_PARAM="$ALIGNED_PARAM test_prepare=1" check_test_requirements() { @@ -74,6 +76,24 @@ run_aligned_check() echo "Check the kernel ring buffer to see the summary." } +run_nonaligned_prepare_check() +{ + echo "Run performance tests to evaluate how fast nonaligned prepare API is." + + insmod $DRIVER $NONALIGNED_PREPARE_PARAM > /dev/null 2>&1 + echo "Done." + echo "Ccheck the kernel ring buffer to see the summary." +} + +run_aligned_prepare_check() +{ + echo "Run performance tests to evaluate how fast aligned prepare API is." + + insmod $DRIVER $ALIGNED_PREPARE_PARAM > /dev/null 2>&1 + echo "Done." + echo "Check the kernel ring buffer to see the summary." +} + run_smoke_check() { echo "Run smoke test." @@ -86,6 +106,7 @@ run_smoke_check() usage() { echo -n "Usage: $0 [ aligned ] | [ nonaligned ] | | [ smoke ] | " + echo "[ aligned_prepare ] | [ nonaligned_prepare ] | " echo "manual parameters" echo echo "Valid tests and parameters:" @@ -106,6 +127,12 @@ usage() echo "# Performance testing for aligned alloc API" echo "$0 aligned" echo + echo "# Performance testing for nonaligned prepare API" + echo "$0 nonaligned_prepare" + echo + echo "# Performance testing for aligned prepare API" + echo "$0 aligned_prepare" + echo exit 0 } @@ -159,6 +186,10 @@ function run_test() run_nonaligned_check elif [[ "$1" = "aligned" ]]; then run_aligned_check + elif [[ "$1" = "nonaligned_prepare" ]]; then + run_nonaligned_prepare_check + elif [[ "$1" = "aligned_prepare" ]]; then + run_aligned_prepare_check else run_manual_check $@ fi -- 2.33.0

10 months

1
0
0 0

[PATCH net-next v17 04/14] mm: page_frag: avoid caller accessing 'page_frag_cache' directly

by Yunsheng Lin

Use appropriate frag_page API instead of caller accessing 'page_frag_cache' directly. CC: Alexander Duyck <alexander.duyck(a)gmail.com> Signed-off-by: Yunsheng Lin <linyunsheng(a)huawei.com> Reviewed-by: Alexander Duyck <alexanderduyck(a)fb.com> Acked-by: Chuck Lever <chuck.lever(a)oracle.com> --- drivers/vhost/net.c | 2 +- include/linux/page_frag_cache.h | 10 ++++++++++ net/core/skbuff.c | 6 +++--- net/rxrpc/conn_object.c | 4 +--- net/rxrpc/local_object.c | 4 +--- net/sunrpc/svcsock.c | 6 ++---- tools/testing/selftests/mm/page_frag/page_frag_test.c | 2 +- 7 files changed, 19 insertions(+), 15 deletions(-) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index f16279351db5..9ad37c012189 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -1325,7 +1325,7 @@ static int vhost_net_open(struct inode *inode, struct file *f) vqs[VHOST_NET_VQ_RX]); f->private_data = n; - n->pf_cache.va = NULL; + page_frag_cache_init(&n->pf_cache); return 0; } diff --git a/include/linux/page_frag_cache.h b/include/linux/page_frag_cache.h index 67ac8626ed9b..0a52f7a179c8 100644 --- a/include/linux/page_frag_cache.h +++ b/include/linux/page_frag_cache.h @@ -7,6 +7,16 @@ #include <linux/mm_types_task.h> #include <linux/types.h> +static inline void page_frag_cache_init(struct page_frag_cache *nc) +{ + nc->va = NULL; +} + +static inline bool page_frag_cache_is_pfmemalloc(struct page_frag_cache *nc) +{ + return !!nc->pfmemalloc; +} + void page_frag_cache_drain(struct page_frag_cache *nc); void __page_frag_cache_drain(struct page *page, unsigned int count); void *__page_frag_alloc_align(struct page_frag_cache *nc, unsigned int fragsz, diff --git a/net/core/skbuff.c b/net/core/skbuff.c index a52638363ea5..a5f8e4e0c649 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -752,14 +752,14 @@ struct sk_buff *__netdev_alloc_skb(struct net_device *dev, unsigned int len, if (in_hardirq() || irqs_disabled()) { nc = this_cpu_ptr(&netdev_alloc_cache); data = page_frag_alloc(nc, len, gfp_mask); - pfmemalloc = nc->pfmemalloc; + pfmemalloc = page_frag_cache_is_pfmemalloc(nc); } else { local_bh_disable(); local_lock_nested_bh(&napi_alloc_cache.bh_lock); nc = this_cpu_ptr(&napi_alloc_cache.page); data = page_frag_alloc(nc, len, gfp_mask); - pfmemalloc = nc->pfmemalloc; + pfmemalloc = page_frag_cache_is_pfmemalloc(nc); local_unlock_nested_bh(&napi_alloc_cache.bh_lock); local_bh_enable(); @@ -849,7 +849,7 @@ struct sk_buff *napi_alloc_skb(struct napi_struct *napi, unsigned int len) len = SKB_HEAD_ALIGN(len); data = page_frag_alloc(&nc->page, len, gfp_mask); - pfmemalloc = nc->page.pfmemalloc; + pfmemalloc = page_frag_cache_is_pfmemalloc(&nc->page); } local_unlock_nested_bh(&napi_alloc_cache.bh_lock); diff --git a/net/rxrpc/conn_object.c b/net/rxrpc/conn_object.c index 1539d315afe7..694c4df7a1a3 100644 --- a/net/rxrpc/conn_object.c +++ b/net/rxrpc/conn_object.c @@ -337,9 +337,7 @@ static void rxrpc_clean_up_connection(struct work_struct *work) */ rxrpc_purge_queue(&conn->rx_queue); - if (conn->tx_data_alloc.va) - __page_frag_cache_drain(virt_to_page(conn->tx_data_alloc.va), - conn->tx_data_alloc.pagecnt_bias); + page_frag_cache_drain(&conn->tx_data_alloc); call_rcu(&conn->rcu, rxrpc_rcu_free_connection); } diff --git a/net/rxrpc/local_object.c b/net/rxrpc/local_object.c index 504453c688d7..a8cffe47cf01 100644 --- a/net/rxrpc/local_object.c +++ b/net/rxrpc/local_object.c @@ -452,9 +452,7 @@ void rxrpc_destroy_local(struct rxrpc_local *local) #endif rxrpc_purge_queue(&local->rx_queue); rxrpc_purge_client_connections(local); - if (local->tx_alloc.va) - __page_frag_cache_drain(virt_to_page(local->tx_alloc.va), - local->tx_alloc.pagecnt_bias); + page_frag_cache_drain(&local->tx_alloc); } /* diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c index 6b3f01beb294..dcfd84cf0694 100644 --- a/net/sunrpc/svcsock.c +++ b/net/sunrpc/svcsock.c @@ -1609,7 +1609,6 @@ static void svc_tcp_sock_detach(struct svc_xprt *xprt) static void svc_sock_free(struct svc_xprt *xprt) { struct svc_sock *svsk = container_of(xprt, struct svc_sock, sk_xprt); - struct page_frag_cache *pfc = &svsk->sk_frag_cache; struct socket *sock = svsk->sk_sock; trace_svcsock_free(svsk, sock); @@ -1619,8 +1618,7 @@ static void svc_sock_free(struct svc_xprt *xprt) sockfd_put(sock); else sock_release(sock); - if (pfc->va) - __page_frag_cache_drain(virt_to_head_page(pfc->va), - pfc->pagecnt_bias); + + page_frag_cache_drain(&svsk->sk_frag_cache); kfree(svsk); } diff --git a/tools/testing/selftests/mm/page_frag/page_frag_test.c b/tools/testing/selftests/mm/page_frag/page_frag_test.c index 5395a36e4030..a4bd543d6950 100644 --- a/tools/testing/selftests/mm/page_frag/page_frag_test.c +++ b/tools/testing/selftests/mm/page_frag/page_frag_test.c @@ -117,7 +117,7 @@ static int __init page_frag_test_init(void) u64 duration; int ret; - test_nc.va = NULL; + page_frag_cache_init(&test_nc); atomic_set(&nthreads, 2); init_completion(&wait); -- 2.33.0

10 months

1
0
0 0

[PATCH 0/2] Improve migration by backing off earlier

by Dev Jain

It was recently observed at [1] that during the folio unmapping stage of migration, when the PTEs are cleared, a racing thread faulting on that folio may increase the refcount of the folio, sleep on the folio lock (the migration path has the lock), and migration ultimately fails when asserting the actual refcount against the expected. Migration is a best effort service; the unmapping and the moving phase are wrapped around loops for retrying. The refcount of the folio is currently being asserted during the move stage; if it fails, we retry. But, if a racing thread changes the refcount, and ends up sleeping on the folio lock (which is mostly the case), there is no way the refcount would be decremented; as a result, this renders the retrying useless. In the first patch, we make the refcount check also during the unmap stage; if it fails, we restore the original state of the PTE, drop the folio lock, let the system make progress, and retry unmapping again. This improves the probability of migration winning the race. Given that migration is a best-effort service, it is wrong to fail the test for just a single failure; hence, fail the test after 100 consecutive failures (where 100 is still a subjective choice). [1] https://lore.kernel.org/all/20240801081657.1386743-1-dev.jain@arm.com/ Dev Jain (2): mm: Retry migration earlier upon refcount mismatch selftests/mm: Do not fail test for a single migration failure mm/migrate.c | 9 +++++++++ tools/testing/selftests/mm/migration.c | 17 +++++++++++------ 2 files changed, 20 insertions(+), 6 deletions(-) -- 2.30.2

10 months

6
29
0 0

[PATCH] selftests/futex: Create test for robust list

by André Almeida

Create a test for the robust list mechanism. Signed-off-by: André Almeida <andrealmeid(a)igalia.com> --- .../selftests/futex/functional/.gitignore | 1 + .../selftests/futex/functional/Makefile | 3 +- .../selftests/futex/functional/robust_list.c | 450 ++++++++++++++++++ 3 files changed, 453 insertions(+), 1 deletion(-) create mode 100644 tools/testing/selftests/futex/functional/robust_list.c diff --git a/tools/testing/selftests/futex/functional/.gitignore b/tools/testing/selftests/futex/functional/.gitignore index fbcbdb6963b3..4726e1be7497 100644 --- a/tools/testing/selftests/futex/functional/.gitignore +++ b/tools/testing/selftests/futex/functional/.gitignore @@ -9,3 +9,4 @@ futex_wait_wouldblock futex_wait futex_requeue futex_waitv +robust_list diff --git a/tools/testing/selftests/futex/functional/Makefile b/tools/testing/selftests/futex/functional/Makefile index f79f9bac7918..b8635a1ac7f6 100644 --- a/tools/testing/selftests/futex/functional/Makefile +++ b/tools/testing/selftests/futex/functional/Makefile @@ -17,7 +17,8 @@ TEST_GEN_PROGS := \ futex_wait_private_mapped_file \ futex_wait \ futex_requeue \ - futex_waitv + futex_waitv \ + robust_list TEST_PROGS := run.sh diff --git a/tools/testing/selftests/futex/functional/robust_list.c b/tools/testing/selftests/futex/functional/robust_list.c new file mode 100644 index 000000000000..5cc0edaaf028 --- /dev/null +++ b/tools/testing/selftests/futex/functional/robust_list.c @@ -0,0 +1,450 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/* + * Copyright Igalia, 2024 + * + * Robust list test by André Almeida <andrealmeid(a)igalia.com> + * + * The robust list uAPI allows userspace to create "robust" locks, in the sense + * that if the lock holder thread dies, the remaining threads that are waiting + * for the lock won't block forever, waiting for a lock that will never be + * released. + * + * This is achieve by userspace setting a list where a thread can enter all the + * locks (futexes) that it is holding. The robust list is a linked list, and + * userspace register the start of the list with the syscall set_robust_list(). + * If such thread eventually dies, the kernel will walk this list, waking up one + * thread waiting for each futex and marking the futex word with the flag + * FUTEX_OWNER_DIED. + * + * See also + * man set_robust_list + * Documententation/locking/robust-futex-ABI.rst + * Documententation/locking/robust-futexes.rst + */ + +#define _GNU_SOURCE + +#include "../../kselftest_harness.h" + +#include "futextest.h" + +#include <pthread.h> +#include <stdatomic.h> +#include <stddef.h> + +#define STACK_SIZE (1024 * 1024) + +#define FUTEX_TIMEOUT 3 + +static pthread_barrier_t barrier, barrier2; + +int set_robust_list(struct robust_list_head *head, size_t len) +{ + return syscall(SYS_set_robust_list, head, len); +} + +int get_robust_list(int pid, struct robust_list_head **head, size_t *len_ptr) +{ + return syscall(SYS_get_robust_list, pid, head, len_ptr); +} + +int futex2_wait(void *futex, int val, struct timespec *timo) +{ + return syscall(SYS_futex_wait, futex, val, ~0U, FUTEX2_SIZE_U32, timo, CLOCK_MONOTONIC); +} + +/* + * Basic lock struct, contains just the futex word and the robust list element + * Real implementations have also a *prev to easily walk in the list + */ +struct lock_struct { + int futex; + struct robust_list list; +}; + +/* + * Helper function to spawn a child thread. Returns -1 on error, pid on success + */ +static int create_child(int (*fn)(void *arg), void *arg) +{ + char *stack; + pid_t pid; + + stack = mmap(NULL, STACK_SIZE, PROT_READ | PROT_WRITE, + MAP_PRIVATE | MAP_ANONYMOUS | MAP_STACK, -1, 0); + if (stack == MAP_FAILED) + return -1; + + stack += STACK_SIZE; + + pid = clone(fn, stack, CLONE_VM | SIGCHLD, arg); + + if (pid == -1) + return -1; + + return pid; +} + +/* + * Helper function to prepare and register a robust list + */ +static int set_list(struct robust_list_head *head) +{ + int ret; + + ret = set_robust_list(head, sizeof(struct robust_list_head)); + if (ret) + return ret; + + head->futex_offset = (size_t) offsetof(struct lock_struct, futex) - + (size_t) offsetof(struct lock_struct, list); + head->list.next = &head->list; + head->list_op_pending = NULL; + + return 0; +} + +/* + * A basic (and incomplete) mutex lock function with robustness + */ +static int mutex_lock(struct lock_struct *lock, struct robust_list_head *head, bool error_inject) +{ + int *futex = &lock->futex, zero = 0, ret = -1; + pid_t tid = gettid(); + + /* + * Set list_op_pending before starting the lock, so the kernel can catch + * the case where the thread died during the lock operation + */ + head->list_op_pending = &lock->list; + + if (atomic_compare_exchange_strong(futex, &zero, tid)) { + /* + * We took the lock, insert it in the robust list + */ + struct robust_list *list = &head->list; + + /* Error injection to test list_op_pending */ + if (error_inject) + return 0; + + while (list->next != &head->list) + list = list->next; + + list->next = &lock->list; + lock->list.next = &head->list; + + ret = 0; + } else { + /* + * We didn't take the lock, wait until the owner wakes (or dies) + */ + struct timespec to; + + clock_gettime(CLOCK_MONOTONIC, &to); + to.tv_sec = to.tv_sec + FUTEX_TIMEOUT; + + tid = atomic_load(futex); + /* Kernel ignores futexes without the waiters flag */ + tid |= FUTEX_WAITERS; + atomic_store(futex, tid); + + ret = futex2_wait(futex, tid, &to); + + /* + * A real mutex_lock() implementation would loop here to finally + * take the lock. We don't care about that, so we stop here. + */ + } + + head->list_op_pending = NULL; + + return ret; +} + +/* + * This child thread will succeed taking the lock, and then will exit holding it + */ +static int child_fn_lock(void *arg) +{ + struct lock_struct *lock = (struct lock_struct *) arg; + struct robust_list_head head; + int ret; + + ret = set_list(&head); + if (ret) + ksft_test_result_fail("set_robust_list error\n"); + + ret = mutex_lock(lock, &head, false); + if (ret) + ksft_test_result_fail("mutex_lock error\n"); + + pthread_barrier_wait(&barrier); + + /* + * There's a race here: the parent thread needs to be inside + * futex_wait() before the child thread dies, otherwise it will miss the + * wakeup from handle_futex_death() that this child will emit. We wait a + * little bit just to make sure that this happens. + */ + sleep(1); + + return 0; +} + +/* + * Spawns a child thread that will set a robust list, take the lock, register it + * in the robust list and die. The parent thread will wait on this futex, and + * should be waken up when the child exits. + */ +TEST(robustness) +{ + struct lock_struct lock = { .futex = 0 }; + struct robust_list_head head; + int ret, *futex = &lock.futex; + + ret = set_list(&head); + ASSERT_EQ(ret, 0); + + /* + * Lets use a barrier to ensure that the child thread takes the lock + * before the parent + */ + ret = pthread_barrier_init(&barrier, NULL, 2); + ASSERT_EQ(ret, 0); + + ret = create_child(&child_fn_lock, &lock); + ASSERT_NE(ret, -1); + + pthread_barrier_wait(&barrier); + ret = mutex_lock(&lock, &head, false); + + /* + * futex_wait() should return 0 and the futex word should be marked with + * FUTEX_OWNER_DIED + */ + ASSERT_EQ(ret, 0) TH_LOG("futex wait returned %d", errno); + ASSERT_TRUE(*futex | FUTEX_OWNER_DIED); + + pthread_barrier_destroy(&barrier); +} + +/* + * The only valid value for len is sizeof(*head) + */ +TEST(set_robust_list_invalid_size) +{ + struct robust_list_head head; + size_t head_size = sizeof(struct robust_list_head); + int ret; + + ret = set_robust_list(&head, head_size); + ASSERT_EQ(ret, 0); + + ret = set_robust_list(&head, head_size * 2); + ASSERT_EQ(ret, -1); + ASSERT_EQ(errno, EINVAL); + + ret = set_robust_list(&head, head_size - 1); + ASSERT_EQ(ret, -1); + ASSERT_EQ(errno, EINVAL); + + ret = set_robust_list(&head, 0); + ASSERT_EQ(ret, -1); + ASSERT_EQ(errno, EINVAL); +} + +/* + * Test get_robust_list with pid = 0, getting the list of the running thread + */ +TEST(get_robust_list_self) +{ + struct robust_list_head head, head2, *get_head; + size_t head_size = sizeof(struct robust_list_head), len_ptr; + int ret; + + ret = set_robust_list(&head, head_size); + ASSERT_EQ(ret, 0); + + ret = get_robust_list(0, &get_head, &len_ptr); + ASSERT_EQ(ret, 0); + ASSERT_EQ(get_head, &head); + ASSERT_EQ(head_size, len_ptr); + + ret = set_robust_list(&head2, head_size); + ASSERT_EQ(ret, 0); + + ret = get_robust_list(0, &get_head, &len_ptr); + ASSERT_EQ(ret, 0); + ASSERT_EQ(get_head, &head2); + ASSERT_EQ(head_size, len_ptr); +} + +static int child_list(void *arg) +{ + struct robust_list_head *head = (struct robust_list_head *) arg; + int ret; + + ret = set_robust_list(head, sizeof(struct robust_list_head)); + if (ret) + ksft_test_result_fail("set_robust_list error\n"); + + pthread_barrier_wait(&barrier); + pthread_barrier_wait(&barrier2); + + return 0; +} + +/* + * Test get_robust_list from another thread. We use two barriers here to ensure + * that: + * 1) the child thread set the list before we try to get it from the + * parent + * 2) the child thread still alive when we try to get the list from it + */ +TEST(get_robust_list_child) +{ + pid_t tid; + int ret; + struct robust_list_head head, *get_head; + size_t len_ptr; + + ret = pthread_barrier_init(&barrier, NULL, 2); + ret = pthread_barrier_init(&barrier2, NULL, 2); + ASSERT_EQ(ret, 0); + + tid = create_child(&child_list, &head); + ASSERT_NE(tid, -1); + + pthread_barrier_wait(&barrier); + + ret = get_robust_list(tid, &get_head, &len_ptr); + ASSERT_EQ(ret, 0); + ASSERT_EQ(&head, get_head); + + pthread_barrier_wait(&barrier2); + + pthread_barrier_destroy(&barrier); + pthread_barrier_destroy(&barrier2); +} + +static int child_fn_lock_with_error(void *arg) +{ + struct lock_struct *lock = (struct lock_struct *) arg; + struct robust_list_head head; + int ret; + + ret = set_list(&head); + if (ret) + ksft_test_result_fail("set_robust_list error\n"); + + ret = mutex_lock(lock, &head, true); + if (ret) + ksft_test_result_fail("mutex_lock error\n"); + + pthread_barrier_wait(&barrier); + + sleep(1); + + return 0; +} + +/* + * Same as robustness test, but inject an error where the mutex_lock() exits + * earlier, just after setting list_op_pending and taking the lock, to test the + * list_op_pending mechanism + */ +TEST(set_list_op_pending) +{ + struct lock_struct lock = { .futex = 0 }; + struct robust_list_head head; + int ret, *futex = &lock.futex; + + ret = set_list(&head); + ASSERT_EQ(ret, 0); + + ret = pthread_barrier_init(&barrier, NULL, 2); + ASSERT_EQ(ret, 0); + + ret = create_child(&child_fn_lock_with_error, &lock); + ASSERT_NE(ret, -1); + + pthread_barrier_wait(&barrier); + ret = mutex_lock(&lock, &head, false); + + ASSERT_EQ(ret, 0) TH_LOG("futex wait returned %d", errno); + ASSERT_TRUE(*futex | FUTEX_OWNER_DIED); + + pthread_barrier_destroy(&barrier); +} + +#define CHILD_NR 10 + +static int child_lock_holder(void *arg) +{ + struct lock_struct *locks = (struct lock_struct *) arg; + struct robust_list_head head; + int i; + + set_list(&head); + + for (i = 0; i < CHILD_NR; i++) { + locks[i].futex = 0; + mutex_lock(&locks[i], &head, false); + } + + pthread_barrier_wait(&barrier); + pthread_barrier_wait(&barrier2); + + sleep(1); + return 0; +} + +static int child_wait_lock(void *arg) +{ + struct lock_struct *lock = (struct lock_struct *) arg; + struct robust_list_head head; + int ret; + + pthread_barrier_wait(&barrier2); + ret = mutex_lock(lock, &head, false); + + if (ret) + ksft_test_result_fail("mutex_lock error\n"); + + if (!(lock->futex | FUTEX_OWNER_DIED)) + ksft_test_result_fail("futex not marked with FUTEX_OWNER_DIED\n"); + + return 0; +} + +/* + * Test a robust list of more than one element. All the waiters should wake when + * the holder dies + */ +TEST(robust_list_multiple_elements) +{ + struct lock_struct locks[CHILD_NR]; + int i, ret; + + ret = pthread_barrier_init(&barrier, NULL, 2); + ASSERT_EQ(ret, 0); + ret = pthread_barrier_init(&barrier2, NULL, CHILD_NR + 1); + ASSERT_EQ(ret, 0); + + create_child(&child_lock_holder, &locks); + + /* Wait until the locker thread takes the look */ + pthread_barrier_wait(&barrier); + + for (i = 0; i < CHILD_NR; i++) + create_child(&child_wait_lock, &locks[i]); + + /* Wait for all children to return */ + while (wait(NULL) > 0); + + pthread_barrier_destroy(&barrier); + pthread_barrier_destroy(&barrier2); +} + +TEST_HARNESS_MAIN -- 2.46.0

10 months

2
2
0 0

[PATCH] selftests: vDSO: Also test counter in vdso_test_chacha

by Christophe Leroy

The chacha vDSO selftest doesn't check the way the counter is handled by __arch_chacha20_blocks_nostack(). It indirectly checks that the counter is writen on exit and read back on new entry, but it doesn't check that the format is correct. It has led to an invisible erroneous implementation on powerpc where the counter was writen and read in wrong byte order. Also, the counter uses two words, but the tests with a zero counter and uses a small amount of blocks so at the end the upper part of the counter is always 0 so it is not checked. Add a verification of counter's content in addition to the verification of the output. Also add two tests where the counter crosses the u32 upper limit. The first test verifies that the function properly writes back the upper word, the second test verifies that the function properly reads back the upper word. While at it, remove 'nonce' which is not unused anymore after the replacement of libsodium by open coded chacha implementation. Signed-off-by: Christophe Leroy <christophe.leroy(a)csgroup.eu> --- .../testing/selftests/vDSO/vdso_test_chacha.c | 39 ++++++++++++++----- 1 file changed, 30 insertions(+), 9 deletions(-) diff --git a/tools/testing/selftests/vDSO/vdso_test_chacha.c b/tools/testing/selftests/vDSO/vdso_test_chacha.c index 9d18d49a82f8..ed6cf372d9ee 100644 --- a/tools/testing/selftests/vDSO/vdso_test_chacha.c +++ b/tools/testing/selftests/vDSO/vdso_test_chacha.c @@ -17,11 +17,12 @@ static uint32_t rol32(uint32_t word, unsigned int shift) return (word << (shift & 31)) | (word >> ((-shift) & 31)); } -static void reference_chacha20_blocks(uint8_t *dst_bytes, const uint32_t *key, size_t nblocks) +static void reference_chacha20_blocks(uint8_t *dst_bytes, const uint32_t *key, uint32_t *counter, size_t nblocks) { uint32_t s[16] = { 0x61707865U, 0x3320646eU, 0x79622d32U, 0x6b206574U, - key[0], key[1], key[2], key[3], key[4], key[5], key[6], key[7] + key[0], key[1], key[2], key[3], key[4], key[5], key[6], key[7], + counter[0], counter[1], }; while (nblocks--) { @@ -52,6 +53,8 @@ static void reference_chacha20_blocks(uint8_t *dst_bytes, const uint32_t *key, s if (!++s[12]) ++s[13]; } + counter[0] = s[12]; + counter[1] = s[13]; } typedef uint8_t u8; @@ -66,8 +69,7 @@ typedef uint64_t u64; int main(int argc, char *argv[]) { enum { TRIALS = 1000, BLOCKS = 128, BLOCK_SIZE = 64 }; - static const uint8_t nonce[8] = { 0 }; - uint32_t counter[2]; + uint32_t counter1[2], counter2[2]; uint32_t key[8]; uint8_t output1[BLOCK_SIZE * BLOCKS], output2[BLOCK_SIZE * BLOCKS]; @@ -84,17 +86,36 @@ int main(int argc, char *argv[]) printf("getrandom() failed!\n"); return KSFT_SKIP; } - reference_chacha20_blocks(output1, key, BLOCKS); + memset(counter1, 0, sizeof(counter1)); + reference_chacha20_blocks(output1, key, counter1, BLOCKS); for (unsigned int split = 0; split < BLOCKS; ++split) { memset(output2, 'X', sizeof(output2)); - memset(counter, 0, sizeof(counter)); + memset(counter2, 0, sizeof(counter2)); if (split) - __arch_chacha20_blocks_nostack(output2, key, counter, split); - __arch_chacha20_blocks_nostack(output2 + split * BLOCK_SIZE, key, counter, BLOCKS - split); - if (memcmp(output1, output2, sizeof(output1))) + __arch_chacha20_blocks_nostack(output2, key, counter2, split); + __arch_chacha20_blocks_nostack(output2 + split * BLOCK_SIZE, key, counter2, BLOCKS - split); + if (memcmp(output1, output2, sizeof(output1)) || + memcmp(counter2, counter2, sizeof(counter1))) return KSFT_FAIL; } } + memset(counter1, 0, sizeof(counter1)); + counter1[0] = (uint32_t)-BLOCKS + 2; + memset(counter2, 0, sizeof(counter2)); + counter2[0] = (uint32_t)-BLOCKS + 2; + + reference_chacha20_blocks(output1, key, counter1, BLOCKS); + __arch_chacha20_blocks_nostack(output2, key, counter2, BLOCKS); + if (memcmp(output1, output2, sizeof(output1)) || + memcmp(counter2, counter2, sizeof(counter1))) + return KSFT_FAIL; + + reference_chacha20_blocks(output1, key, counter1, BLOCKS); + __arch_chacha20_blocks_nostack(output2, key, counter2, BLOCKS); + if (memcmp(output1, output2, sizeof(output1)) || + memcmp(counter2, counter2, sizeof(counter1))) + return KSFT_FAIL; + ksft_test_result_pass("chacha: PASS\n"); return KSFT_PASS; } -- 2.44.0

10 months

2
3
0 0

[PATCH] selftests: vDSO: Build vDSO tests with O2 optimisation

by Christophe Leroy

Without -O2, the generated code for testing chacha function is awful. GCC even implements rol32() as a function instead of just using the rotlwi instruction, that function is 20 instructions long. ~# time ./vdso_test_chacha TAP version 13 1..1 ok 1 chacha: PASS real 0m 37.16s user 0m 36.89s sys 0m 0.26s Several other selftests directory add -O2, and the kernel is also always built with optimisation active. Do the same for vDSO selftests. With this patch the time is reduced by approx 15%. ~# time ./vdso_test_chacha TAP version 13 1..1 ok 1 chacha: PASS real 0m 32.09s user 0m 31.86s sys 0m 0.22s Signed-off-by: Christophe Leroy <christophe.leroy(a)csgroup.eu> --- tools/testing/selftests/vDSO/Makefile | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tools/testing/selftests/vDSO/Makefile b/tools/testing/selftests/vDSO/Makefile index cfb7c281b22c..96f25aa2f84e 100644 --- a/tools/testing/selftests/vDSO/Makefile +++ b/tools/testing/selftests/vDSO/Makefile @@ -13,7 +13,7 @@ TEST_GEN_PROGS += vdso_test_correctness TEST_GEN_PROGS += vdso_test_getrandom TEST_GEN_PROGS += vdso_test_chacha -CFLAGS := -std=gnu99 +CFLAGS := -std=gnu99 -O2 ifeq ($(CONFIG_X86_32),y) LDLIBS += -lgcc_s -- 2.44.0

10 months, 1 week

2
1
0 0

I am facing Issue with Running Kselftest on ARM64 Architecture

by iamolivasmith＠gmail.com

Hello everyone, I am working on running Kselftest on an ARM64 platform and have facing a few issues that I am hoping someone here might have experience with. I have successfully compiled the tests and am able to run most of them but I am facing a specific problem with the memory management tests. They seem to fail consistently; even though I have confirmed that the kernel configuration should support them. The errors I am seeing are related to page allocation failures & Also i have double checked that there ample memory available on the system. I have also tried running these tests on a different ARM64 platform with similar kernel configurations and encountered the same issue. Is this a known problem with ARM64 Kselftest, or is there something unique to my configuration that I am not seeing? if you have any advice; any suggestions or pointers to relevant documentation would be greatly appreciated. Thank you <a href="https://www.igmguru.com/blog/what-is-ampscript-in-salesforce-marketing-cloud">https://www.igmguru.com/blog/what-is-ampscript-in-salesforce-marketing-cloud</a>

10 months, 1 week

1
0
0 0

[PATCH v3 0/5] Wire up getrandom() vDSO implementation on powerpc

by Christophe Leroy

This series wires up getrandom() vDSO implementation on powerpc. Tested on PPC32 on real hardware. Tested on PPC64 (both BE and LE) on QEMU: Performance on powerpc 885: ~# ./vdso_test_getrandom bench-single vdso: 25000000 times in 62.938002291 seconds libc: 25000000 times in 535.581916866 seconds syscall: 25000000 times in 531.525042806 seconds Performance on powerpc 8321: ~# ./vdso_test_getrandom bench-single vdso: 25000000 times in 16.899318858 seconds libc: 25000000 times in 131.050596522 seconds syscall: 25000000 times in 129.794790389 seconds Performance on QEMU pseries: ~ # ./vdso_test_getrandom bench-single vdso: 25000000 times in 4.977777162 seconds libc: 25000000 times in 75.516749981 seconds syscall: 25000000 times in 86.842242014 seconds In order to run selftests, some fixes are needed, see https://lore.kernel.org/linuxppc-dev/6c5da802e72befecfa09046c489aa45d934d61… Those selftest fixes are independant and are not required to apply and use this series. Changes in v3: - Rebased on recent random git tree (0c7e00e22c21) - Fixed build failures reported by robots around VM_DROPPABLE - Fixed crash on PPC64 due to clobbered r13 by not using r13 anymore (saving it was not enough for signals). - Split final patch in two, first for PPC32, second for PPC64 - Moved selftest fixes out of this series Changes in v2: - Define VM_DROPPABLE for powerpc/32 - Fixes generic vDSO getrandom headers to enable CONFIG_COMPAT build. - Fixed size of generation counter - Fixed selftests to work on non x86 architectures Christophe Leroy (5): mm: Define VM_DROPPABLE for powerpc/32 powerpc/vdso32: Add crtsavres powerpc/vdso: Refactor CFLAGS for CVDSO build powerpc/vdso: Wire up getrandom() vDSO implementation on PPC32 powerpc/vdso: Wire up getrandom() vDSO implementation on PPC64 arch/powerpc/Kconfig | 1 + arch/powerpc/include/asm/asm-compat.h | 8 + arch/powerpc/include/asm/mman.h | 2 +- arch/powerpc/include/asm/vdso/getrandom.h | 54 ++++ arch/powerpc/include/asm/vdso/vsyscall.h | 6 + arch/powerpc/include/asm/vdso_datapage.h | 2 + arch/powerpc/kernel/asm-offsets.c | 1 + arch/powerpc/kernel/vdso/Makefile | 57 ++-- arch/powerpc/kernel/vdso/getrandom.S | 58 ++++ arch/powerpc/kernel/vdso/gettimeofday.S | 13 - arch/powerpc/kernel/vdso/vdso32.lds.S | 1 + arch/powerpc/kernel/vdso/vdso64.lds.S | 1 + arch/powerpc/kernel/vdso/vgetrandom-chacha.S | 299 +++++++++++++++++++ arch/powerpc/kernel/vdso/vgetrandom.c | 14 + fs/proc/task_mmu.c | 4 +- include/linux/mm.h | 4 +- include/trace/events/mmflags.h | 4 +- tools/arch/powerpc/vdso | 1 + tools/testing/selftests/vDSO/Makefile | 4 + 19 files changed, 492 insertions(+), 42 deletions(-) create mode 100644 arch/powerpc/include/asm/vdso/getrandom.h create mode 100644 arch/powerpc/kernel/vdso/getrandom.S create mode 100644 arch/powerpc/kernel/vdso/vgetrandom-chacha.S create mode 100644 arch/powerpc/kernel/vdso/vgetrandom.c create mode 120000 tools/arch/powerpc/vdso -- 2.44.0

10 months, 1 week

3
11
0 0

[PATCH net-next] wireguard: allowedips: Add WGALLOWEDIP_F_REMOVE_ME flag

by Jordan Rife

With the current API the only way to remove an allowed IP is to completely rebuild the allowed IPs set for a peer using WGPEER_F_REPLACE_ALLOWEDIPS. In other words, if my current configuration is such that a peer has allowed IP IPs 192.168.0.2 and 192.168.0.3 and I want to remove 192.168.0.2 the actual transition looks like this. [192.168.0.2, 192.168.0.3] <-- Initial state [] <-- Step 1: Allowed IPs removed for peer [192.168.0.3] <-- Step 2: Allowed IPs added back for peer This is true even if the allowed IP list is small and the update does not need to be batched into multiple WG_CMD_SET_DEVICE requests, as the removal and subsequent addition of IPs is non-atomic within a single request. Consequently, wg_allowedips_lookup_dst and wg_allowedips_lookup_src may return NULL while reconfiguring a peer even for packets bound for IPs a user did not intend to remove leading to unintended interruptions in connectivity. This presents in userspace as failed calls to sendto and sendmsg. In my case, I ran netperf while repeatedly reconfiguring the allowed IPs for a peer with wg. /usr/local/bin/netperf -H 10.102.73.72 -l 10m -t UDP_STREAM -- -R 1 -m 1024 send_data: data send error: No route to host (errno 113) netperf: send_omni: send_data failed: No route to host While this may not be of particular concern for environments where peers and allowed IPs are mostly static, Cilium manages peers and allowed IPs in a dynamic environment where peers (i.e. Kubernetes nodes) and allowed IPs (i.e. Pods running on those nodes) can frequently change. Cilium must continually keep its WireGuard device's configuration in sync with its cluster state leading to unnecessary churn and packet drops. This patch introduces a new flag called WGALLOWEDIP_F_REMOVE_ME which in the same way that WGPEER_F_REMOVE_ME allows a user to remove a single peer from a WireGuard device's configuration allows a user to remove an IP from a peer's set of allowed IPs. This has two benefits. First, it allows systems such as Cilium to avoid introducing connectivity blips while reconfiguring a WireGuard device. Second, it allows us to more efficiently keep the device's configuration in sync with the cluster state, as we no longer need to do frequent rebuilds of the allowed IPs list for each peer. Instead, the device's configuration can be incrementally updated. This patch also bumps WG_GENL_VERSION which can be used by clients to detect whether or not their system supports the WGALLOWEDIP_F_REMOVE_ME flag. Signed-off-by: Jordan Rife <jrife(a)google.com> Link: https://github.com/cilium/cilium/issues/33159 --- drivers/net/wireguard/allowedips.c | 103 ++++++++++---- drivers/net/wireguard/allowedips.h | 4 + drivers/net/wireguard/netlink.c | 45 +++++-- drivers/net/wireguard/selftest/allowedips.c | 30 +++++ include/uapi/linux/wireguard.h | 11 +- tools/testing/selftests/wireguard/Makefile | 18 +++ tools/testing/selftests/wireguard/netns.sh | 38 ++++++ tools/testing/selftests/wireguard/remove-ip.c | 126 ++++++++++++++++++ 8 files changed, 333 insertions(+), 42 deletions(-) create mode 100644 tools/testing/selftests/wireguard/Makefile create mode 100644 tools/testing/selftests/wireguard/remove-ip.c diff --git a/drivers/net/wireguard/allowedips.c b/drivers/net/wireguard/allowedips.c index 4b8528206cc8a..47a96a1b8f0ea 100644 --- a/drivers/net/wireguard/allowedips.c +++ b/drivers/net/wireguard/allowedips.c @@ -249,6 +249,56 @@ static int add(struct allowedips_node __rcu **trie, u8 bits, const u8 *key, return 0; } +static void _remove(struct allowedips_node __rcu *node, struct mutex *lock) +{ + struct allowedips_node *child, **parent_bit, *parent; + bool free_parent; + + list_del_init(&node->peer_list); + RCU_INIT_POINTER(node->peer, NULL); + if (node->bit[0] && node->bit[1]) + return; + child = rcu_dereference_protected(node->bit[!rcu_access_pointer(node->bit[0])], + lockdep_is_held(lock)); + if (child) + child->parent_bit_packed = node->parent_bit_packed; + parent_bit = (struct allowedips_node **)(node->parent_bit_packed & ~3UL); + *parent_bit = child; + parent = (void *)parent_bit - + offsetof(struct allowedips_node, bit[node->parent_bit_packed & 1]); + free_parent = !rcu_access_pointer(node->bit[0]) && + !rcu_access_pointer(node->bit[1]) && + (node->parent_bit_packed & 3) <= 1 && + !rcu_access_pointer(parent->peer); + if (free_parent) + child = rcu_dereference_protected(parent->bit[!(node->parent_bit_packed & 1)], + lockdep_is_held(lock)); + call_rcu(&node->rcu, node_free_rcu); + if (!free_parent) + return; + if (child) + child->parent_bit_packed = parent->parent_bit_packed; + *(struct allowedips_node **)(parent->parent_bit_packed & ~3UL) = child; + call_rcu(&parent->rcu, node_free_rcu); +} + +static int remove(struct allowedips_node __rcu **trie, u8 bits, const u8 *key, + u8 cidr, struct wg_peer *peer, struct mutex *lock) +{ + struct allowedips_node *node; + + if (unlikely(cidr > bits || !peer)) + return -EINVAL; + if (!rcu_access_pointer(*trie) || + !node_placement(*trie, key, cidr, bits, &node, lock) || + peer != node->peer) + return 0; + + _remove(node, lock); + + return 0; +} + void wg_allowedips_init(struct allowedips *table) { table->root4 = table->root6 = NULL; @@ -300,43 +350,38 @@ int wg_allowedips_insert_v6(struct allowedips *table, const struct in6_addr *ip, return add(&table->root6, 128, key, cidr, peer, lock); } +int wg_allowedips_remove_v4(struct allowedips *table, const struct in_addr *ip, + u8 cidr, struct wg_peer *peer, struct mutex *lock) +{ + /* Aligned so it can be passed to fls */ + u8 key[4] __aligned(__alignof(u32)); + + ++table->seq; + swap_endian(key, (const u8 *)ip, 32); + return remove(&table->root4, 32, key, cidr, peer, lock); +} + +int wg_allowedips_remove_v6(struct allowedips *table, const struct in6_addr *ip, + u8 cidr, struct wg_peer *peer, struct mutex *lock) +{ + /* Aligned so it can be passed to fls64 */ + u8 key[16] __aligned(__alignof(u64)); + + ++table->seq; + swap_endian(key, (const u8 *)ip, 128); + return remove(&table->root6, 128, key, cidr, peer, lock); +} + void wg_allowedips_remove_by_peer(struct allowedips *table, struct wg_peer *peer, struct mutex *lock) { - struct allowedips_node *node, *child, **parent_bit, *parent, *tmp; - bool free_parent; + struct allowedips_node *node, *tmp; if (list_empty(&peer->allowedips_list)) return; ++table->seq; list_for_each_entry_safe(node, tmp, &peer->allowedips_list, peer_list) { - list_del_init(&node->peer_list); - RCU_INIT_POINTER(node->peer, NULL); - if (node->bit[0] && node->bit[1]) - continue; - child = rcu_dereference_protected(node->bit[!rcu_access_pointer(node->bit[0])], - lockdep_is_held(lock)); - if (child) - child->parent_bit_packed = node->parent_bit_packed; - parent_bit = (struct allowedips_node **)(node->parent_bit_packed & ~3UL); - *parent_bit = child; - parent = (void *)parent_bit - - offsetof(struct allowedips_node, bit[node->parent_bit_packed & 1]); - free_parent = !rcu_access_pointer(node->bit[0]) && - !rcu_access_pointer(node->bit[1]) && - (node->parent_bit_packed & 3) <= 1 && - !rcu_access_pointer(parent->peer); - if (free_parent) - child = rcu_dereference_protected( - parent->bit[!(node->parent_bit_packed & 1)], - lockdep_is_held(lock)); - call_rcu(&node->rcu, node_free_rcu); - if (!free_parent) - continue; - if (child) - child->parent_bit_packed = parent->parent_bit_packed; - *(struct allowedips_node **)(parent->parent_bit_packed & ~3UL) = child; - call_rcu(&parent->rcu, node_free_rcu); + _remove(node, lock); } } diff --git a/drivers/net/wireguard/allowedips.h b/drivers/net/wireguard/allowedips.h index 2346c797eb4d8..931958cb6e100 100644 --- a/drivers/net/wireguard/allowedips.h +++ b/drivers/net/wireguard/allowedips.h @@ -38,6 +38,10 @@ int wg_allowedips_insert_v4(struct allowedips *table, const struct in_addr *ip, u8 cidr, struct wg_peer *peer, struct mutex *lock); int wg_allowedips_insert_v6(struct allowedips *table, const struct in6_addr *ip, u8 cidr, struct wg_peer *peer, struct mutex *lock); +int wg_allowedips_remove_v4(struct allowedips *table, const struct in_addr *ip, + u8 cidr, struct wg_peer *peer, struct mutex *lock); +int wg_allowedips_remove_v6(struct allowedips *table, const struct in6_addr *ip, + u8 cidr, struct wg_peer *peer, struct mutex *lock); void wg_allowedips_remove_by_peer(struct allowedips *table, struct wg_peer *peer, struct mutex *lock); /* The ip input pointer should be __aligned(__alignof(u64))) */ diff --git a/drivers/net/wireguard/netlink.c b/drivers/net/wireguard/netlink.c index f7055180ba4aa..5f2a8553ab43d 100644 --- a/drivers/net/wireguard/netlink.c +++ b/drivers/net/wireguard/netlink.c @@ -46,7 +46,8 @@ static const struct nla_policy peer_policy[WGPEER_A_MAX + 1] = { static const struct nla_policy allowedip_policy[WGALLOWEDIP_A_MAX + 1] = { [WGALLOWEDIP_A_FAMILY] = { .type = NLA_U16 }, [WGALLOWEDIP_A_IPADDR] = NLA_POLICY_MIN_LEN(sizeof(struct in_addr)), - [WGALLOWEDIP_A_CIDR_MASK] = { .type = NLA_U8 } + [WGALLOWEDIP_A_CIDR_MASK] = { .type = NLA_U8 }, + [WGALLOWEDIP_A_FLAGS] = { .type = NLA_U32 } }; static struct wg_device *lookup_interface(struct nlattr **attrs, @@ -329,6 +330,7 @@ static int set_port(struct wg_device *wg, u16 port) static int set_allowedip(struct wg_peer *peer, struct nlattr **attrs) { int ret = -EINVAL; + u32 flags = 0; u16 family; u8 cidr; @@ -337,19 +339,38 @@ static int set_allowedip(struct wg_peer *peer, struct nlattr **attrs) return ret; family = nla_get_u16(attrs[WGALLOWEDIP_A_FAMILY]); cidr = nla_get_u8(attrs[WGALLOWEDIP_A_CIDR_MASK]); + if (attrs[WGALLOWEDIP_A_FLAGS]) + flags = nla_get_u32(attrs[WGALLOWEDIP_A_FLAGS]); if (family == AF_INET && cidr <= 32 && - nla_len(attrs[WGALLOWEDIP_A_IPADDR]) == sizeof(struct in_addr)) - ret = wg_allowedips_insert_v4( - &peer->device->peer_allowedips, - nla_data(attrs[WGALLOWEDIP_A_IPADDR]), cidr, peer, - &peer->device->device_update_lock); - else if (family == AF_INET6 && cidr <= 128 && - nla_len(attrs[WGALLOWEDIP_A_IPADDR]) == sizeof(struct in6_addr)) - ret = wg_allowedips_insert_v6( - &peer->device->peer_allowedips, - nla_data(attrs[WGALLOWEDIP_A_IPADDR]), cidr, peer, - &peer->device->device_update_lock); + nla_len(attrs[WGALLOWEDIP_A_IPADDR]) == sizeof(struct in_addr)) { + if (flags & WGALLOWEDIP_F_REMOVE_ME) + ret = wg_allowedips_remove_v4(&peer->device->peer_allowedips, + nla_data(attrs[WGALLOWEDIP_A_IPADDR]), + cidr, + peer, + &peer->device->device_update_lock); + else + ret = wg_allowedips_insert_v4(&peer->device->peer_allowedips, + nla_data(attrs[WGALLOWEDIP_A_IPADDR]), + cidr, + peer, + &peer->device->device_update_lock); + } else if (family == AF_INET6 && cidr <= 128 && + nla_len(attrs[WGALLOWEDIP_A_IPADDR]) == sizeof(struct in6_addr)) { + if (flags & WGALLOWEDIP_F_REMOVE_ME) + ret = wg_allowedips_remove_v6(&peer->device->peer_allowedips, + nla_data(attrs[WGALLOWEDIP_A_IPADDR]), + cidr, + peer, + &peer->device->device_update_lock); + else + ret = wg_allowedips_insert_v6(&peer->device->peer_allowedips, + nla_data(attrs[WGALLOWEDIP_A_IPADDR]), + cidr, + peer, + &peer->device->device_update_lock); + } return ret; } diff --git a/drivers/net/wireguard/selftest/allowedips.c b/drivers/net/wireguard/selftest/allowedips.c index 3d1f64ff2e122..9f6458a889e96 100644 --- a/drivers/net/wireguard/selftest/allowedips.c +++ b/drivers/net/wireguard/selftest/allowedips.c @@ -461,6 +461,10 @@ static __init struct wg_peer *init_peer(void) wg_allowedips_insert_v##version(&t, ip##version(ipa, ipb, ipc, ipd), \ cidr, mem, &mutex) +#define remove(version, mem, ipa, ipb, ipc, ipd, cidr) \ + wg_allowedips_remove_v##version(&t, ip##version(ipa, ipb, ipc, ipd), \ + cidr, mem, &mutex) + #define maybe_fail() do { \ ++i; \ if (!_s) { \ @@ -586,6 +590,32 @@ bool __init wg_allowedips_selftest(void) test_negative(4, a, 192, 0, 0, 0); test_negative(4, a, 255, 0, 0, 0); + insert(4, a, 1, 0, 0, 0, 32); + insert(4, a, 192, 0, 0, 0, 24); + insert(6, a, 0x24446801, 0x40e40800, 0xdeaebeef, 0xdefbeef, 128); + insert(6, a, 0x24446800, 0xf0e40800, 0xeeaebeef, 0, 98); + test(4, a, 1, 0, 0, 0); + test(4, a, 192, 0, 0, 1); + test(6, a, 0x24446801, 0x40e40800, 0xdeaebeef, 0xdefbeef); + test(6, a, 0x24446800, 0xf0e40800, 0xeeaebeef, 0x10101010); + /* Must be an exact match to remove */ + remove(4, a, 192, 0, 0, 0, 32); + test(4, a, 192, 0, 0, 1); + remove(4, a, 192, 0, 0, 0, 24); + test_negative(4, a, 192, 0, 0, 1); + remove(4, a, 1, 0, 0, 0, 32); + test_negative(4, a, 1, 0, 0, 0); + /* Must be an exact match to remove */ + remove(6, a, 0x24446801, 0x40e40800, 0xdeaebeef, 0xdefbeef, 96); + test(6, a, 0x24446801, 0x40e40800, 0xdeaebeef, 0xdefbeef); + remove(6, a, 0x24446801, 0x40e40800, 0xdeaebeef, 0xdefbeef, 128); + test_negative(6, a, 0x24446801, 0x40e40800, 0xdeaebeef, 0xdefbeef); + /* Must match the peer to remove */ + remove(6, b, 0x24446800, 0xf0e40800, 0xeeaebeef, 0, 98); + test(6, a, 0x24446800, 0xf0e40800, 0xeeaebeef, 0x10101010); + remove(6, a, 0x24446800, 0xf0e40800, 0xeeaebeef, 0, 98); + test_negative(6, a, 0x24446800, 0xf0e40800, 0xeeaebeef, 0x10101010); + wg_allowedips_free(&t, &mutex); wg_allowedips_init(&t); insert(4, a, 192, 168, 0, 0, 16); diff --git a/include/uapi/linux/wireguard.h b/include/uapi/linux/wireguard.h index ae88be14c9478..e219194cb9f5a 100644 --- a/include/uapi/linux/wireguard.h +++ b/include/uapi/linux/wireguard.h @@ -101,6 +101,10 @@ * WGALLOWEDIP_A_FAMILY: NLA_U16 * WGALLOWEDIP_A_IPADDR: struct in_addr or struct in6_addr * WGALLOWEDIP_A_CIDR_MASK: NLA_U8 + * WGALLOWEDIP_A_FLAGS: NLA_U32, WGALLOWEDIP_F_REMOVE_ME if + * the specified IP should be removed, + * otherwise this IP will be added if + * it is not already present. * 0: NLA_NESTED * ... * 0: NLA_NESTED @@ -132,7 +136,7 @@ #define _WG_UAPI_WIREGUARD_H #define WG_GENL_NAME "wireguard" -#define WG_GENL_VERSION 1 +#define WG_GENL_VERSION 2 #define WG_KEY_LEN 32 @@ -184,11 +188,16 @@ enum wgpeer_attribute { }; #define WGPEER_A_MAX (__WGPEER_A_LAST - 1) +enum wgallowedip_flag { + WGALLOWEDIP_F_REMOVE_ME = 1U << 0, + __WGALLOWEDIP_F_ALL = WGALLOWEDIP_F_REMOVE_ME +}; enum wgallowedip_attribute { WGALLOWEDIP_A_UNSPEC, WGALLOWEDIP_A_FAMILY, WGALLOWEDIP_A_IPADDR, WGALLOWEDIP_A_CIDR_MASK, + WGALLOWEDIP_A_FLAGS, __WGALLOWEDIP_A_LAST }; #define WGALLOWEDIP_A_MAX (__WGALLOWEDIP_A_LAST - 1) diff --git a/tools/testing/selftests/wireguard/Makefile b/tools/testing/selftests/wireguard/Makefile new file mode 100644 index 0000000000000..4f4db54f89cb3 --- /dev/null +++ b/tools/testing/selftests/wireguard/Makefile @@ -0,0 +1,18 @@ +# SPDX-License-Identifier: GPL-2.0 +# +# Note: To build this you must install libnl-3 and libnl-genl-3 development +# packages. +remove-ip: + gcc -I/usr/include/libnl3 \ + -I../../../../usr/include \ + remove-ip.c \ + -o remove-ip \ + -lnl-genl-3 \ + -lnl-3 + +.PHONY: all +all: remove-ip + +.PHONY: clean +clean: + rm remove-ip diff --git a/tools/testing/selftests/wireguard/netns.sh b/tools/testing/selftests/wireguard/netns.sh index 405ff262ca93d..70058d6ebbe85 100755 --- a/tools/testing/selftests/wireguard/netns.sh +++ b/tools/testing/selftests/wireguard/netns.sh @@ -28,6 +28,7 @@ exec 3>&1 export LANG=C export WG_HIDE_KEYS=never NPROC=( /sys/devices/system/cpu/cpu+([0-9]) ); NPROC=${#NPROC[@]} +SCRIPT_DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd ) netns0="wg-test-$$-0" netns1="wg-test-$$-1" netns2="wg-test-$$-2" @@ -610,6 +611,43 @@ n0 wg set wg0 peer "$pub2" allowed-ips "$allowedips" } < <(n0 wg show wg0 allowed-ips) ip0 link del wg0 +# Test IP removal +allowedips=( ) +for i in {1..197}; do + allowedips+=( 192.168.0.$i ) + allowedips+=( abcd::$i ) +done +saved_ifs="$IFS" +IFS=, +allowedips="${allowedips[*]}" +IFS="$saved_ifs" +ip0 link add wg0 type wireguard +n0 wg set wg0 peer "$pub1" allowed-ips "$allowedips" +pub1_hex=$(echo "$pub1" | base64 -d | xxd -p -c 50) +n0 $SCRIPT_DIR/remove-ip wg0 "$pub1_hex" 4 192.168.0.1 +n0 $SCRIPT_DIR/remove-ip wg0 "$pub1_hex" 4 192.168.0.20 +n0 $SCRIPT_DIR/remove-ip wg0 "$pub1_hex" 4 192.168.0.100 +n0 $SCRIPT_DIR/remove-ip wg0 "$pub1_hex" 6 abcd::1 +n0 $SCRIPT_DIR/remove-ip wg0 "$pub1_hex" 6 abcd::20 +n0 $SCRIPT_DIR/remove-ip wg0 "$pub1_hex" 6 abcd::100 +n0 wg show wg0 allowed-ips +{ + read -r pub allowedips + [[ $pub == "$pub1" ]] + i=0 + for ip in $allowedips; do + [[ "$ip" != "192.168.0.1" ]] + [[ "$ip" != "192.168.0.20" ]] + [[ "$ip" != "192.168.0.100" ]] + [[ "$ip" != "abcd::1" ]] + [[ "$ip" != "abcd::20" ]] + [[ "$ip" != "abcd::100" ]] + ((++i)) + done + ((i == 388)) +} < <(n0 wg show wg0 allowed-ips) +ip0 link del wg0 + ! n0 wg show doesnotexist || false ip0 link add wg0 type wireguard diff --git a/tools/testing/selftests/wireguard/remove-ip.c b/tools/testing/selftests/wireguard/remove-ip.c new file mode 100644 index 0000000000000..242f922d99b56 --- /dev/null +++ b/tools/testing/selftests/wireguard/remove-ip.c @@ -0,0 +1,126 @@ +// SPDX-License-Identifier: GPL-2.0 +#include <linux/wireguard.h> +#include <sys/socket.h> +#include <netinet/in.h> +#include <arpa/inet.h> +#include <netlink/socket.h> +#include <netlink/netlink.h> +#include <netlink/genl/ctrl.h> +#include <netlink/genl/genl.h> +#include <netlink/genl/family.h> + +#define CURVE25519_KEY_SIZE 32 + +const char *usage = "Usage: remove-ip INTERFACE_NAME PEER_PUBLIC_KEY_HEX IP_VERSION IP"; + +char h2b(char c) +{ + if ('0' <= c && c <= '9') + return c - '0'; + else if ('a' <= c && c <= 'f') + return 10 + (c - 'a'); + + return -1; +} + +int parse_key(const char *raw, unsigned char key[CURVE25519_KEY_SIZE]) +{ + int ret = 0; + int i; + + for (i = 0; i < CURVE25519_KEY_SIZE; i++) { + char h, l; + + h = h2b(raw[0]); + if (h < 0) + return -1; + + l = h2b(raw[1]); + if (l < 0) + return -1; + + key[i] = (h << 4) | l; + raw += 2; + } + + return 0; +} + +int main(int argc, char **argv) +{ + unsigned char addr[sizeof(struct in6_addr)]; + unsigned char pub_key[CURVE25519_KEY_SIZE]; + struct nl_sock *sock; + struct nl_msg *msg; + int addr_len; + int family; + int cidr; + int af; + + if (argc < 5) { + printf("Not enough arguments.\n\n%s\n", usage); + return -1; + } + + if (parse_key(argv[2], pub_key)) { + printf("Could not parse public key\n"); + return -1; + } + + switch (argv[3][0]) { + case '4': + af = AF_INET; + addr_len = sizeof(struct in_addr); + cidr = 32; + break; + case '6': + af = AF_INET6; + addr_len = sizeof(struct in6_addr); + cidr = 128; + break; + default: + printf("Invalid IP version\n"); + return -1; + } + + if (inet_pton(af, argv[4], &addr) <= 0) { + printf("Could not parse IP address\n"); + return -1; + } + + sock = nl_socket_alloc(); + genl_connect(sock); + family = genl_ctrl_resolve(sock, WG_GENL_NAME); + msg = nlmsg_alloc(); + genlmsg_put(msg, NL_AUTO_PID, NL_AUTO_SEQ, family, 0, NLM_F_ECHO, + WG_CMD_SET_DEVICE, WG_GENL_VERSION); + nla_put_string(msg, WGDEVICE_A_IFNAME, argv[1]); + + struct nlattr *peers = nla_nest_start(msg, WGDEVICE_A_PEERS); + struct nlattr *peer0 = nla_nest_start(msg, 0); + + nla_put(msg, WGPEER_A_PUBLIC_KEY, CURVE25519_KEY_SIZE, pub_key); + + struct nlattr *allowed_ips = nla_nest_start(msg, WGPEER_A_ALLOWEDIPS); + struct nlattr *allowed_ip0 = nla_nest_start(msg, 0); + + nla_put_u16(msg, WGALLOWEDIP_A_FAMILY, af); + nla_put(msg, WGALLOWEDIP_A_IPADDR, addr_len, &addr); + nla_put_u8(msg, WGALLOWEDIP_A_CIDR_MASK, cidr); + nla_put_u32(msg, WGALLOWEDIP_A_FLAGS, WGALLOWEDIP_F_REMOVE_ME); + nla_nest_end(msg, allowed_ip0); + nla_nest_end(msg, allowed_ips); + nla_nest_end(msg, peer0); + nla_nest_end(msg, peers); + + int err = nl_send_sync(sock, msg); + + if (err < 0) { + char message[256]; + + nl_perror(err, message); + printf("An error occurred: %d - %s\n", err, message); + } + + return err; +} -- 2.46.0.469.g59c65b2a67-goog

10 months, 1 week

2
1
0 0

[PATCH 0/2] Adding SO_PEEK_OFF for TCPv6

by jmaloy＠redhat.com

From: Jon Maloy <jmaloy(a)redhat.com> Adding SO_PEEK_OFF for TCPv6 and selftest for both TCPv4 and TCPv6. Jon Maloy (2): tcp: add SO_PEEK_OFF socket option tor TCPv6 selftests: add selftest for tcp SO_PEEK_OFF support net/ipv6/af_inet6.c | 1 + tools/testing/selftests/net/Makefile | 1 + tools/testing/selftests/net/tcp_so_peek_off.c | 181 ++++++++++++++++++ 3 files changed, 183 insertions(+) create mode 100644 tools/testing/selftests/net/tcp_so_peek_off.c -- 2.45.2

10 months, 1 week

4
8
0 0

[PATCH 00/16] mm: Introduce MAP_BELOW_HINT

by Charlie Jenkins

Some applications rely on placing data in free bits addresses allocated by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the address returned by mmap to be less than the maximum address space, unless the hint address is greater than this value. On arm64 this barrier is at 52 bits and on x86 it is at 56 bits. This flag allows applications a way to specify exactly how many bits they want to be left unused by mmap. This eliminates the need for applications to know the page table hierarchy of the system to be able to reason which addresses mmap will be allowed to return. --- riscv made this feature of mmap returning addresses less than the hint address the default behavior. This was in contrast to the implementation of x86/arm64 that have a single boundary at the 5-level page table region. However this restriction proved too great -- the reduced address space when using a hint address was too small. A patch for riscv [1] reverts the behavior that broke userspace. This series serves to make this feature available to all architectures. I have only tested on riscv and x86. There is a tremendous amount of duplicated code in mmap so the implementations across architectures I believe should be mostly consistent. I added this feature to all architectures that implement either arch_get_mmap_end()/arch_get_mmap_base() or arch_get_unmapped_area_topdown()/arch_get_unmapped_area(). I also added it to the default behavior for arch_get_mmap_end()/arch_get_mmap_base(). Link: https://lore.kernel.org/lkml/20240826-riscv_mmap-v1-2-cd8962afe47f@rivosinc… [1] To: Arnd Bergmann <arnd(a)arndb.de> To: Paul Walmsley <paul.walmsley(a)sifive.com> To: Palmer Dabbelt <palmer(a)dabbelt.com> To: Albert Ou <aou(a)eecs.berkeley.edu> To: Catalin Marinas <catalin.marinas(a)arm.com> To: Will Deacon <will(a)kernel.org> To: Michael Ellerman <mpe(a)ellerman.id.au> To: Nicholas Piggin <npiggin(a)gmail.com> To: Christophe Leroy <christophe.leroy(a)csgroup.eu> To: Naveen N Rao <naveen(a)kernel.org> To: Muchun Song <muchun.song(a)linux.dev> To: Andrew Morton <akpm(a)linux-foundation.org> To: Liam R. Howlett <Liam.Howlett(a)oracle.com> To: Vlastimil Babka <vbabka(a)suse.cz> To: Lorenzo Stoakes <lorenzo.stoakes(a)oracle.com> To: Thomas Gleixner <tglx(a)linutronix.de> To: Ingo Molnar <mingo(a)redhat.com> To: Borislav Petkov <bp(a)alien8.de> To: Dave Hansen <dave.hansen(a)linux.intel.com> To: x86(a)kernel.org To: H. Peter Anvin <hpa(a)zytor.com> To: Huacai Chen <chenhuacai(a)kernel.org> To: WANG Xuerui <kernel(a)xen0n.name> To: Russell King <linux(a)armlinux.org.uk> To: Thomas Bogendoerfer <tsbogend(a)alpha.franken.de> To: James E.J. Bottomley <James.Bottomley(a)HansenPartnership.com> To: Helge Deller <deller(a)gmx.de> To: Alexander Gordeev <agordeev(a)linux.ibm.com> To: Gerald Schaefer <gerald.schaefer(a)linux.ibm.com> To: Heiko Carstens <hca(a)linux.ibm.com> To: Vasily Gorbik <gor(a)linux.ibm.com> To: Christian Borntraeger <borntraeger(a)linux.ibm.com> To: Sven Schnelle <svens(a)linux.ibm.com> To: Yoshinori Sato <ysato(a)users.sourceforge.jp> To: Rich Felker <dalias(a)libc.org> To: John Paul Adrian Glaubitz <glaubitz(a)physik.fu-berlin.de> To: David S. Miller <davem(a)davemloft.net> To: Andreas Larsson <andreas(a)gaisler.com> To: Shuah Khan <shuah(a)kernel.org> To: Alexandre Ghiti <alexghiti(a)rivosinc.com> Cc: linux-arch(a)vger.kernel.org Cc: linux-kernel(a)vger.kernel.org Cc: Palmer Dabbelt <palmer(a)rivosinc.com> Cc: linux-riscv(a)lists.infradead.org Cc: linux-arm-kernel(a)lists.infradead.org Cc: linuxppc-dev(a)lists.ozlabs.org Cc: linux-mm(a)kvack.org Cc: loongarch(a)lists.linux.dev Cc: linux-mips(a)vger.kernel.org Cc: linux-parisc(a)vger.kernel.org Cc: linux-s390(a)vger.kernel.org Cc: linux-sh(a)vger.kernel.org Cc: sparclinux(a)vger.kernel.org Cc: linux-kselftest(a)vger.kernel.org Signed-off-by: Charlie Jenkins <charlie(a)rivosinc.com> --- Charlie Jenkins (16): mm: Add MAP_BELOW_HINT riscv: mm: Do not restrict mmap address based on hint mm: Add flag and len param to arch_get_mmap_base() mm: Add generic MAP_BELOW_HINT riscv: mm: Support MAP_BELOW_HINT arm64: mm: Support MAP_BELOW_HINT powerpc: mm: Support MAP_BELOW_HINT x86: mm: Support MAP_BELOW_HINT loongarch: mm: Support MAP_BELOW_HINT arm: mm: Support MAP_BELOW_HINT mips: mm: Support MAP_BELOW_HINT parisc: mm: Support MAP_BELOW_HINT s390: mm: Support MAP_BELOW_HINT sh: mm: Support MAP_BELOW_HINT sparc: mm: Support MAP_BELOW_HINT selftests/mm: Create MAP_BELOW_HINT test arch/arm/mm/mmap.c | 10 ++++++++ arch/arm64/include/asm/processor.h | 34 ++++++++++++++++++++++---- arch/loongarch/mm/mmap.c | 11 +++++++++ arch/mips/mm/mmap.c | 9 +++++++ arch/parisc/include/uapi/asm/mman.h | 1 + arch/parisc/kernel/sys_parisc.c | 9 +++++++ arch/powerpc/include/asm/task_size_64.h | 36 +++++++++++++++++++++++----- arch/riscv/include/asm/processor.h | 32 ------------------------- arch/s390/mm/mmap.c | 10 ++++++++ arch/sh/mm/mmap.c | 10 ++++++++ arch/sparc/kernel/sys_sparc_64.c | 8 +++++++ arch/x86/kernel/sys_x86_64.c | 25 ++++++++++++++++--- fs/hugetlbfs/inode.c | 2 +- include/linux/sched/mm.h | 34 ++++++++++++++++++++++++-- include/uapi/asm-generic/mman-common.h | 1 + mm/mmap.c | 2 +- tools/arch/parisc/include/uapi/asm/mman.h | 1 + tools/include/uapi/asm-generic/mman-common.h | 1 + tools/testing/selftests/mm/Makefile | 1 + tools/testing/selftests/mm/map_below_hint.c | 29 ++++++++++++++++++++++ 20 files changed, 216 insertions(+), 50 deletions(-) --- base-commit: 5be63fc19fcaa4c236b307420483578a56986a37 change-id: 20240827-patches-below_hint_mmap-b13d79ae1c55 -- - Charlie

10 months, 1 week

6
32
0 0

[PATCH v1 1/2] mseal: fix mmap(FIXED) error code.

by jeffxu＠chromium.org

From: Jeff Xu <jeffxu(a)chromium.org> mmap(MAP_FIXED) should return EPERM when memory is sealed. Fixes: 4205a39e06da ("mm/munmap: replace can_modify_mm with can_modify_vma") Signed-off-by: Jeff Xu <jeffxu(a)chromium.org> --- mm/mmap.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/mm/mmap.c b/mm/mmap.c index 80d70ed099cf..0cd0c0ef03c7 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1386,7 +1386,10 @@ unsigned long mmap_region(struct file *file, unsigned long addr, mt_on_stack(mt_detach); mas_init(&mas_detach, &mt_detach, /* addr = */ 0); /* Prepare to unmap any existing mapping in the area */ - if (vms_gather_munmap_vmas(&vms, &mas_detach)) + error = vms_gather_munmap_vmas(&vms, &mas_detach); + if (error == -EPERM) + return -EPERM; + if (error) return -ENOMEM; vmg.next = vms.next; -- 2.46.0.295.g3b9ea8a38a-goog

10 months, 1 week

8
19
0 0

[PATCH] selftests: splice: Add splice_read.sh and hint

by Rong Tao

From: Rong Tao <rongtao(a)cestc.cn> Add test scripts and prompts. Signed-off-by: Rong Tao <rongtao(a)cestc.cn> --- tools/testing/selftests/splice/splice_read.c | 1 + tools/testing/selftests/splice/splice_read.sh | 9 +++++++++ 2 files changed, 10 insertions(+) create mode 100755 tools/testing/selftests/splice/splice_read.sh diff --git a/tools/testing/selftests/splice/splice_read.c b/tools/testing/selftests/splice/splice_read.c index 46dae6a25cfb..194b075f6bc0 100644 --- a/tools/testing/selftests/splice/splice_read.c +++ b/tools/testing/selftests/splice/splice_read.c @@ -49,6 +49,7 @@ int main(int argc, char *argv[]) size, SPLICE_F_MOVE); if (spliced < 0) { perror("splice"); + fprintf(stderr, "May try: %s /etc/os-release | cat\n", argv[0]); return EXIT_FAILURE; } diff --git a/tools/testing/selftests/splice/splice_read.sh b/tools/testing/selftests/splice/splice_read.sh new file mode 100755 index 000000000000..10fd5d738a2d --- /dev/null +++ b/tools/testing/selftests/splice/splice_read.sh @@ -0,0 +1,9 @@ +#!/bin/sh +# SPDX-License-Identifier: GPL-2.0 +set -e +nl=$(./splice_read /etc/os-release | wc -l) + +test "$nl" != 0 && exit 0 + +echo "splice_read broken" +exit 1 -- 2.46.0

10 months, 1 week

2
2
0 0

[PATCH] KVM: selftests: Add SEV-ES shutdown test

by Peter Gonda

Regression test for ae20eef5 ("KVM: SVM: Update SEV-ES shutdown intercepts with more metadata"). Test confirms userspace is correctly indicated of a guest shutdown not previous behavior of an EINVAL from KVM_RUN. Cc: Paolo Bonzini <pbonzini(a)redhat.com> Cc: Sean Christopherson <seanjc(a)google.com> Cc: Alper Gun <alpergun(a)google.com> Cc: Tom Lendacky <thomas.lendacky(a)amd.com> Cc: Michael Roth <michael.roth(a)amd.com> Cc: kvm(a)vger.kernel.org Cc: linux-kselftest(a)vger.kernel.org Signed-off-by: Peter Gonda <pgonda(a)google.com> --- .../selftests/kvm/x86_64/sev_smoke_test.c | 26 +++++++++++++++++++ 1 file changed, 26 insertions(+) diff --git a/tools/testing/selftests/kvm/x86_64/sev_smoke_test.c b/tools/testing/selftests/kvm/x86_64/sev_smoke_test.c index 7c70c0da4fb74..04f24d5f09877 100644 --- a/tools/testing/selftests/kvm/x86_64/sev_smoke_test.c +++ b/tools/testing/selftests/kvm/x86_64/sev_smoke_test.c @@ -160,6 +160,30 @@ static void test_sev(void *guest_code, uint64_t policy) kvm_vm_free(vm); } +static void guest_shutdown_code(void) +{ + __asm__ __volatile__("ud2"); +} + +static void test_sev_es_shutdown(void) +{ + struct kvm_vcpu *vcpu; + struct kvm_vm *vm; + + uint32_t type = KVM_X86_SEV_ES_VM; + + vm = vm_sev_create_with_one_vcpu(type, guest_shutdown_code, &vcpu); + + vm_sev_launch(vm, SEV_POLICY_ES, NULL); + + vcpu_run(vcpu); + TEST_ASSERT(vcpu->run->exit_reason == KVM_EXIT_SHUTDOWN, + "Wanted SHUTDOWN, got %s", + exit_reason_str(vcpu->run->exit_reason)); + + kvm_vm_free(vm); +} + int main(int argc, char *argv[]) { TEST_REQUIRE(kvm_cpu_has(X86_FEATURE_SEV)); @@ -171,6 +195,8 @@ int main(int argc, char *argv[]) test_sev(guest_sev_es_code, SEV_POLICY_ES | SEV_POLICY_NO_DBG); test_sev(guest_sev_es_code, SEV_POLICY_ES); + test_sev_es_shutdown(); + if (kvm_has_cap(KVM_CAP_XCRS) && (xgetbv(0) & XFEATURE_MASK_X87_AVX) == XFEATURE_MASK_X87_AVX) { test_sync_vmsa(0); -- 2.45.2.803.g4e1b14247a-goog

10 months, 1 week

3
4
0 0

[PATCH-cgroup 0/2] cgroup/cpuset: Account for boot time isolated CPUs

by Waiman Long

The current cpuset code and test_cpuset_prs.sh test have not fully account for the possibility of pre-isolated CPUs added by the "isolcpus" boot command line parameter. This patch series modifies them to do the right thing whether or not "isolcpus" is present or not. The updated test_cpuset_prs.sh was run successfully with or without the "isolcpus" option. Waiman Long (2): cgroup/cpuset: Account for boot time isolated CPUs selftest/cgroup: Make test_cpuset_prs.sh deal with pre-isolated CPUs kernel/cgroup/cpuset.c | 23 +++++++--- .../selftests/cgroup/test_cpuset_prs.sh | 44 ++++++++++++++----- 2 files changed, 51 insertions(+), 16 deletions(-) -- 2.43.5

10 months, 1 week

3
9
0 0

[PATCH net-next v23 00/13] Device Memory TCP

by Mina Almasry

v23: https://patchwork.kernel.org/project/netdevbpf/list/?series=882978&state=* ==== Fixing relatively minor issues called out in v22. (thanks again!) Mostly code cleanups, extack error messages, and minor reworks. Nothing major really changed, so the exact changes per commit is called in the commit messages. Full devmem TCP changes including the full GVE driver implementation is here: https://github.com/mina/linux/commits/tcpdevmem-v23/ v22: https://patchwork.kernel.org/project/netdevbpf/list/?series=881158&state=* ==== v22 aims to resolve the pending issue pointed to in v21, which is the interaction with xdp. In this series I rebase on top of the minor refactor which refactors propagating xdp configuration to slave devices: https://patchwork.kernel.org/project/netdevbpf/list/?series=881994&state=* I then disable setting xdp on devices using memory providers, and propagating xdp configuration to devices using memory providers. Full devmem TCP changes including the full GVE driver implementation is here: https://github.com/mina/linux/commits/tcpdevmem-v22/ v21: https://patchwork.kernel.org/project/netdevbpf/list/?series=880735&state=* ==== v20 addressed some comments and resolved a test failure, but introduced an unfortunate build error with a config edge case I wasn't testing. v21 simply resolves that error. Major Changes: - Resolve build error with CONFIG_PAGE_POOL=n && CONFIG_NET=y Full devmem TCP changes including the full GVE driver implementation is here: https://github.com/mina/linux/commits/tcpdevmem-v21/ v20: https://patchwork.kernel.org/project/netdevbpf/list/?series=879373&state=* ==== v20 aims to resolve a couple of bug reports against v19, and addresses some review comments around the page_pool_check_memory_provider mechanism. Major changes: - Test edge cases such as header split disabled in selftest. - Change `offset = 0` back to `offset = offset - start` to resolve issue found in RX path by Taehee (thanks!) - Address a few comments around page_pool_check_memory_provider() from Pavel & Jakub. - Removed some unnecessary includes across various patches in the series. - Removed unnecessary EXPORT_SYMBOL(page_pool_mem_providers) (Jakub). - Fix regression caused by incorrect dev_get_max_mp_channel check, along with rename (Jakub). Full devmem TCP changes including the full GVE driver implementation is here: https://github.com/mina/linux/commits/tcpdevmem-v20/ v19: https://patchwork.kernel.org/project/netdevbpf/list/?series=876852&state=* ==== v18 got a thorough review (thanks!), and this iteration addresses the feedback. Major changes: - Prevent deactivating mp bound queues. - Prevent installing xdp on mp bound netdevs, or installing mps on xdp installed netdevs. - Fix corner cases in netlink API vis-a-vis missing attributes. - Iron out the unreadable netmem driver support story. To be honest, the conversation with Jakub & Pavel got a bit confusing for me. I've implemented an approach in this set that makes sense to me, and AFAICT, addresses the requirements. It may be good as-is, or it may be a conversation starter/continuer. To be honest IMO there are many ways to skin this cat and I don't see an extremely strong reason to go for one approach over another. Here is one approach you may like. - Don't reset niov dma_addr on allocation & free. - Add some tests to the selftest that catches some of the issues around missing netlink attributes or deactivating mp-bound queues. Full devmem TCP changes including the full GVE driver implementation is here: https://github.com/mina/linux/commits/tcpdevmem-v19/ v18: https://patchwork.kernel.org/project/netdevbpf/list/?series=874848&state=* ==== v17 got minor feedback: (a) to beef up the description on patch 1 and (b) to remove the leading underscores in the header definition. I applied (a). (b) seems to be against current conventions so I did not apply before further discussion. Full devmem TCP changes including the full GVE driver implementation is here: https://github.com/mina/linux/commits/tcpdevmem-v17/ v17: https://patchwork.kernel.org/project/netdevbpf/list/?series=869900&state=* ==== v16 also got a very thorough review and some testing (thanks again!). Thes version addresses all the concerns reported on v15, in terms of feedback and issues reported. Major changes: - Use ASSERT_RTNL. - Moved around some of the page_pool helpers definitions so I can hide some netmem helpers in private files as Jakub suggested. - Don't make every net_iov hold a ref on the binding as Jakub suggested. - Fix issue reported by Taehee where we access queues after they have been freed. Full devmem TCP changes including the full GVE driver implementation is here: https://github.com/mina/linux/commits/tcpdevmem-v17/ v16: https://patchwork.kernel.org/project/netdevbpf/list/?series=866353&state=* ==== v15 got a thorough review and some testing, and this version addresses almost all the feedback. Some more minor comments where the authors said it could be done later, I left out. Major changes: - Addition of dma-buf introspection to page-pool-get and queue-get. - Fixes to selftests suggested by Taehee. - Fixes to documentation suggested by Donald. - A couple of suggestions and fixes to TCP patches by Eric and David. - Fixes to number assignements suggested by Arnd. - Use rtnl_lock()ing to guard against queue reconfiguration while the page_pool initialization is happening. (Jakub). - Fixes to a few warnings reproduced by Taehee. - Fixes to dma-buf binding suggested by Taehee and Jakub. - Fixes to netlink UAPI suggested by Jakub - Applied a number of Reviewed-bys and Acked-bys (including ones I lost from v13+). Full devmem TCP changes including the full GVE driver implementation is here: https://github.com/mina/linux/commits/tcpdevmem-v16/ One caveat: Taehee reproduced a KASAN warning and reported it here: https://lore.kernel.org/netdev/CAMArcTUdCxOBYGF3vpbq=eBvqZfnc44KBaQTN7H-wqd… I estimate the issue to be minor and easily fixable: https://lore.kernel.org/netdev/CAHS8izNgaqC--GGE2xd85QB=utUnOHmioCsDd1TNxJW… I hope to be able to follow up with a fix to net tree as net-next closes imminently, but if this iteration doesn't make it in, I will repost with a fix squashed after net-next reopens, no problem. v15: https://patchwork.kernel.org/project/netdevbpf/list/?series=865481&state=* ==== No material changes in this version, only a fix to linking against libynl.a from the last version. Per Jakub's instructions I've pulled one of his patches into this series, and now use the new libynl.a correctly, I hope. As usual, the full devmem TCP changes including the full GVE driver implementation is here: https://github.com/mina/linux/commits/tcpdevmem-v15/ v14: https://patchwork.kernel.org/project/netdevbpf/list/?series=865135&archive=… ==== No material changes in this version. Only rebase and re-verification on top of net-next. v13, I think, raced with commit ebad6d0334793 ("net/ipv4: Use nested-BH locking for ipv4_tcp_sk.") being merged to net-next that caused a patchwork failure to apply. This series should apply cleanly on commit c4532232fa2a4 ("selftests: net: remove unneeded IP_GRE config"). I did not wait the customary 24hr as Jakub said it's OK to repost as soon as I build test the rebased version: https://lore.kernel.org/netdev/20240625075926.146d769d@kernel.org/ v13: https://patchwork.kernel.org/project/netdevbpf/list/?series=861406&archive=… ==== Major changes: -------------- This iteration addresses Pavel's review comments, applies his reviewed-by's, and seeks to fix the patchwork build error (sorry!). As usual, the full devmem TCP changes including the full GVE driver implementation is here: https://github.com/mina/linux/commits/tcpdevmem-v13/ v12: https://patchwork.kernel.org/project/netdevbpf/list/?series=859747&state=* ==== Major changes: -------------- This iteration only addresses one minor comment from Pavel with regards to the trace printing of netmem, and the patchwork build error introduced in v11 because I missed doing an allmodconfig build, sorry. Other than that v11, AFAICT, received no feedback. There is one discussion about how the specifics of plugging io uring memory through the page pool, but not relevant to content in this particular patchset, AFAICT. As usual, the full devmem TCP changes including the full GVE driver implementation is here: https://github.com/mina/linux/commits/tcpdevmem-v12/ v11: https://patchwork.kernel.org/project/netdevbpf/list/?series=857457&state=* ==== Major Changes: -------------- v11 addresses feedback received in v10. The major change is the removal of the memory provider ops as requested by Christoph. We still accomplish the same thing, but utilizing direct function calls with if statements rather than generic ops. Additionally address sparse warnings, bugs and review comments from folks that reviewed. As usual, the full devmem TCP changes including the full GVE driver implementation is here: https://github.com/mina/linux/commits/tcpdevmem-v11/ Detailed changelog: ------------------- - Fixes in netdev_rx_queue_restart() from Pavel & David. - Remove commit e650e8c3a36f5 ("net: page_pool: create hooks for custom page providers") from the series to address Christoph's feedback and rebased other patches on the series on this change. - Fixed build errors with CONFIG_DMA_SHARED_BUFFER && !CONFIG_GENERIC_ALLOCATOR build. - Fixed sparse warnings pointed out by Paolo. - Drop unnecessary gro_pull_from_frag0 checks. - Added Bagas reviewed-by to docs. v10: https://patchwork.kernel.org/project/netdevbpf/list/?series=852422&state=* ==== Major Changes: -------------- v9 was sent right before the merge window closed (sorry!). v10 is almost a re-send of the series now that the merge window re-opened. Only rebased to latest net-next and addressed some minor iterative comments received on v9. As usual, the full devmem TCP changes including the full GVE driver implementation is here: https://github.com/mina/linux/commits/tcpdevmem-v10/ Detailed changelog: ------------------- - Fixed tokens leaking in DONTNEED setsockopt (Nikolay). - Moved net_iov_dma_addr() to devmem.c and made it a devmem specific helpers (David). - Rename hook alloc_pages to alloc_netmems as alloc_pages is now preprocessor macro defined and causes a build error. v9: === Major Changes: -------------- GVE queue API has been merged. Submitting this version as non-RFC after rebasing on top of the merged API, and dropped the out of tree queue API I was carrying on github. Addressed the little feedback v8 has received. Detailed changelog: ------------------ - Added new patch from David Wei to this series for netdev_rx_queue_restart() - Fixed sparse error. - Removed CONFIG_ checks in netmem_is_net_iov() - Flipped skb->readable to skb->unreadable - Minor fixes to selftests & docs. RFC v8: ======= Major Changes: -------------- - Fixed build error generated by patch-by-patch build. - Applied docs suggestions from Randy. RFC v7: ======= Major Changes: -------------- This revision largely rebases on top of net-next and addresses the feedback RFCv6 received from folks, namely Jakub, Yunsheng, Arnd, David, & Pavel. The series remains in RFC because the queue-API ndos defined in this series are not yet implemented. I have a GVE implementation I carry out of tree for my testing. A upstreamable GVE implementation is in the works. Aside from that, in my estimation all the patches are ready for review/merge. Please do take a look. As usual the full devmem TCP changes including the full GVE driver implementation is here: https://github.com/mina/linux/commits/tcpdevmem-v7/ Detailed changelog: - Use admin-perm in netlink API. - Addressed feedback from Jakub with regards to netlink API implementation. - Renamed devmem.c functions to something more appropriate for that file. - Improve the performance seen through the page_pool benchmark. - Fix the value definition of all the SO_DEVMEM_* uapi. - Various fixes to documentation. Perf - page-pool benchmark: --------------------------- Improved performance of bench_page_pool_simple.ko tests compared to v6: https://pastebin.com/raw/v5dYRg8L net-next base: 8 cycle fast path. RFC v6: 10 cycle fast path. RFC v7: 9 cycle fast path. RFC v7 with CONFIG_DMA_SHARED_BUFFER disabled: 8 cycle fast path, same as baseline. Perf - Devmem TCP benchmark: --------------------- Perf is about the same regardless of the changes in v7, namely the removal of the static_branch_unlikely to improve the page_pool benchmark performance: 189/200gbps bi-directional throughput with RX devmem TCP and regular TCP TX i.e. ~95% line rate. RFC v6: ======= Major Changes: -------------- This revision largely rebases on top of net-next and addresses the little feedback RFCv5 received. The series remains in RFC because the queue-API ndos defined in this series are not yet implemented. I have a GVE implementation I carry out of tree for my testing. A upstreamable GVE implementation is in the works. Aside from that, in my estimation all the patches are ready for review/merge. Please do take a look. As usual the full devmem TCP changes including the full GVE driver implementation is here: https://github.com/mina/linux/commits/tcpdevmem-v6/ This version also comes with some performance data recorded in the cover letter (see below changelog). Detailed changelog: - Rebased on top of the merged netmem_ref changes. - Converted skb->dmabuf to skb->readable (Pavel). Pavel's original suggestion was to remove the skb->dmabuf flag entirely, but when I looked into it closely, I found the issue that if we remove the flag we have to dereference the shinfo(skb) pointer to obtain the first frag to tell whether an skb is readable or not. This can cause a performance regression if it dirties the cache line when the shinfo(skb) was not really needed. Instead, I converted the skb->dmabuf flag into a generic skb->readable flag which can be re-used by io_uring 0-copy RX. - Squashed a few locking optimizations from Eric Dumazet in the RX path and the DEVMEM_DONTNEED setsockopt. - Expanded the tests a bit. Added validation for invalid scenarios and added some more coverage. Perf - page-pool benchmark: --------------------------- bench_page_pool_simple.ko tests with and without these changes: https://pastebin.com/raw/ncHDwAbn AFAIK the number that really matters in the perf tests is the 'tasklet_page_pool01_fast_path Per elem'. This one measures at about 8 cycles without the changes but there is some 1 cycle noise in some results. With the patches this regresses to 9 cycles with the changes but there is 1 cycle noise occasionally running this test repeatedly. Lastly I tried disable the static_branch_unlikely() in netmem_is_net_iov() check. To my surprise disabling the static_branch_unlikely() check reduces the fast path back to 8 cycles, but the 1 cycle noise remains. Perf - Devmem TCP benchmark: --------------------- 189/200gbps bi-directional throughput with RX devmem TCP and regular TCP TX i.e. ~95% line rate. Major changes in RFC v5: ======================== 1. Rebased on top of 'Abstract page from net stack' series and used the new netmem type to refer to LSB set pointers instead of re-using struct page. 2. Downgraded this series back to RFC and called it RFC v5. This is because this series is now dependent on 'Abstract page from net stack'[1] and the queue API. Both are removed from the series to reduce the patch # and those bits are fairly independent or pre-requisite work. 3. Reworked the page_pool devmem support to use netmem and for some more unified handling. 4. Reworked the reference counting of net_iov (renamed from page_pool_iov) to use pp_ref_count for refcounting. The full changes including the dependent series and GVE page pool support is here: https://github.com/mina/linux/commits/tcpdevmem-rfcv5/ [1] https://patchwork.kernel.org/project/netdevbpf/list/?series=810774 Major changes in v1: ==================== 1. Implemented MVP queue API ndos to remove the userspace-visible driver reset. 2. Fixed issues in the napi_pp_put_page() devmem frag unref path. 3. Removed RFC tag. Many smaller addressed comments across all the patches (patches have individual change log). Full tree including the rest of the GVE driver changes: https://github.com/mina/linux/commits/tcpdevmem-v1 Changes in RFC v3: ================== 1. Pulled in the memory-provider dependency from Jakub's RFC[1] to make the series reviewable and mergeable. 2. Implemented multi-rx-queue binding which was a todo in v2. 3. Fix to cmsg handling. The sticking point in RFC v2[2] was the device reset required to refill the device rx-queues after the dmabuf bind/unbind. The solution suggested as I understand is a subset of the per-queue management ops Jakub suggested or similar: https://lore.kernel.org/netdev/20230815171638.4c057dcd@kernel.org/ This is not addressed in this revision, because: 1. This point was discussed at netconf & netdev and there is openness to using the current approach of requiring a device reset. 2. Implementing individual queue resetting seems to be difficult for my test bed with GVE. My prototype to test this ran into issues with the rx-queues not coming back up properly if reset individually. At the moment I'm unsure if it's a mistake in the POC or a genuine issue in the virtualization stack behind GVE, which currently doesn't test individual rx-queue restart. 3. Our usecases are not bothered by requiring a device reset to refill the buffer queues, and we'd like to support NICs that run into this limitation with resetting individual queues. My thought is that drivers that have trouble with per-queue configs can use the support in this series, while drivers that support new netdev ops to reset individual queues can automatically reset the queue as part of the dma-buf bind/unbind. The same approach with device resets is presented again for consideration with other sticking points addressed. This proposal includes the rx devmem path only proposed for merge. For a snapshot of my entire tree which includes the GVE POC page pool support & device memory support: https://github.com/torvalds/linux/compare/master...mina:linux:tcpdevmem-v3 [1] https://lore.kernel.org/netdev/f8270765-a27b-6ccf-33ea-cda097168d79@redhat.… [2] https://lore.kernel.org/netdev/CAHS8izOVJGJH5WF68OsRWFKJid1_huzzUK+hpKbLcL4… Changes in RFC v2: ================== The sticking point in RFC v1[1] was the dma-buf pages approach we used to deliver the device memory to the TCP stack. RFC v2 is a proof-of-concept that attempts to resolve this by implementing scatterlist support in the networking stack, such that we can import the dma-buf scatterlist directly. This is the approach proposed at a high level here[2]. Detailed changes: 1. Replaced dma-buf pages approach with importing scatterlist into the page pool. 2. Replace the dma-buf pages centric API with a netlink API. 3. Removed the TX path implementation - there is no issue with implementing the TX path with scatterlist approach, but leaving out the TX path makes it easier to review. 4. Functionality is tested with this proposal, but I have not conducted perf testing yet. I'm not sure there are regressions, but I removed perf claims from the cover letter until they can be re-confirmed. 5. Added Signed-off-by: contributors to the implementation. 6. Fixed some bugs with the RX path since RFC v1. Any feedback welcome, but specifically the biggest pending questions needing feedback IMO are: 1. Feedback on the scatterlist-based approach in general. 2. Netlink API (Patch 1 & 2). 3. Approach to handle all the drivers that expect to receive pages from the page pool (Patch 6). [1] https://lore.kernel.org/netdev/dfe4bae7-13a0-3c5d-d671-f61b375cb0b4@gmail.c… [2] https://lore.kernel.org/netdev/CAHS8izPm6XRS54LdCDZVd0C75tA1zHSu6jLVO8nzTLX… ================== * TL;DR: Device memory TCP (devmem TCP) is a proposal for transferring data to and/or from device memory efficiently, without bouncing the data to a host memory buffer. * Problem: A large amount of data transfers have device memory as the source and/or destination. Accelerators drastically increased the volume of such transfers. Some examples include: - ML accelerators transferring large amounts of training data from storage into GPU/TPU memory. In some cases ML training setup time can be as long as 50% of TPU compute time, improving data transfer throughput & efficiency can help improving GPU/TPU utilization. - Distributed training, where ML accelerators, such as GPUs on different hosts, exchange data among them. - Distributed raw block storage applications transfer large amounts of data with remote SSDs, much of this data does not require host processing. Today, the majority of the Device-to-Device data transfers the network are implemented as the following low level operations: Device-to-Host copy, Host-to-Host network transfer, and Host-to-Device copy. The implementation is suboptimal, especially for bulk data transfers, and can put significant strains on system resources, such as host memory bandwidth, PCIe bandwidth, etc. One important reason behind the current state is the kernel’s lack of semantics to express device to network transfers. * Proposal: In this patch series we attempt to optimize this use case by implementing socket APIs that enable the user to: 1. send device memory across the network directly, and 2. receive incoming network packets directly into device memory. Packet _payloads_ go directly from the NIC to device memory for receive and from device memory to NIC for transmit. Packet _headers_ go to/from host memory and are processed by the TCP/IP stack normally. The NIC _must_ support header split to achieve this. Advantages: - Alleviate host memory bandwidth pressure, compared to existing network-transfer + device-copy semantics. - Alleviate PCIe BW pressure, by limiting data transfer to the lowest level of the PCIe tree, compared to traditional path which sends data through the root complex. * Patch overview: ** Part 1: netlink API Gives user ability to bind dma-buf to an RX queue. ** Part 2: scatterlist support Currently the standard for device memory sharing is DMABUF, which doesn't generate struct pages. On the other hand, networking stack (skbs, drivers, and page pool) operate on pages. We have 2 options: 1. Generate struct pages for dmabuf device memory, or, 2. Modify the networking stack to process scatterlist. Approach #1 was attempted in RFC v1. RFC v2 implements approach #2. ** part 3: page pool support We piggy back on page pool memory providers proposal: https://github.com/kuba-moo/linux/tree/pp-providers It allows the page pool to define a memory provider that provides the page allocation and freeing. It helps abstract most of the device memory TCP changes from the driver. ** part 4: support for unreadable skb frags Page pool iovs are not accessible by the host; we implement changes throughput the networking stack to correctly handle skbs with unreadable frags. ** Part 5: recvmsg() APIs We define user APIs for the user to send and receive device memory. Not included with this series is the GVE devmem TCP support, just to simplify the review. Code available here if desired: https://github.com/mina/linux/tree/tcpdevmem This series is built on top of net-next with Jakub's pp-providers changes cherry-picked. * NIC dependencies: 1. (strict) Devmem TCP require the NIC to support header split, i.e. the capability to split incoming packets into a header + payload and to put each into a separate buffer. Devmem TCP works by using device memory for the packet payload, and host memory for the packet headers. 2. (optional) Devmem TCP works better with flow steering support & RSS support, i.e. the NIC's ability to steer flows into certain rx queues. This allows the sysadmin to enable devmem TCP on a subset of the rx queues, and steer devmem TCP traffic onto these queues and non devmem TCP elsewhere. The NIC I have access to with these properties is the GVE with DQO support running in Google Cloud, but any NIC that supports these features would suffice. I may be able to help reviewers bring up devmem TCP on their NICs. * Testing: The series includes a udmabuf kselftest that show a simple use case of devmem TCP and validates the entire data path end to end without a dependency on a specific dmabuf provider. ** Test Setup Kernel: net-next with this series and memory provider API cherry-picked locally. Hardware: Google Cloud A3 VMs. NIC: GVE with header split & RSS & flow steering support. Cc: Pavel Begunkov <asml.silence(a)gmail.com> Cc: David Wei <dw(a)davidwei.uk> Cc: Jason Gunthorpe <jgg(a)ziepe.ca> Cc: Yunsheng Lin <linyunsheng(a)huawei.com> Cc: Shailend Chand <shailend(a)google.com> Cc: Harshitha Ramamurthy <hramamurthy(a)google.com> Cc: Shakeel Butt <shakeel.butt(a)linux.dev> Cc: Jeroen de Borst <jeroendb(a)google.com> Cc: Praveen Kaligineedi <pkaligineedi(a)google.com> Cc: Bagas Sanjaya <bagasdotme(a)gmail.com> Cc: Steven Rostedt <rostedt(a)goodmis.org> Cc: Christoph Hellwig <hch(a)infradead.org> Cc: Nikolay Aleksandrov <razor(a)blackwall.org> Cc: Taehee Yoo <ap420073(a)gmail.com> Cc: Donald Hunter <donald.hunter(a)gmail.com> Mina Almasry (13): netdev: add netdev_rx_queue_restart() net: netdev netlink api to bind dma-buf to a net device netdev: support binding dma-buf to netdevice netdev: netdevice devmem allocator page_pool: devmem support memory-provider: dmabuf devmem memory provider net: support non paged skb frags net: add support for skbs with unreadable frags tcp: RX path for devmem TCP net: add SO_DEVMEM_DONTNEED setsockopt to release RX frags net: add devmem TCP documentation selftests: add ncdevmem, netcat for devmem TCP netdev: add dmabuf introspection Documentation/netlink/specs/netdev.yaml | 61 +++ Documentation/networking/devmem.rst | 269 +++++++++++ Documentation/networking/index.rst | 1 + arch/alpha/include/uapi/asm/socket.h | 6 + arch/mips/include/uapi/asm/socket.h | 6 + arch/parisc/include/uapi/asm/socket.h | 6 + arch/sparc/include/uapi/asm/socket.h | 6 + include/linux/netdevice.h | 2 + include/linux/skbuff.h | 61 ++- include/linux/skbuff_ref.h | 9 +- include/linux/socket.h | 1 + include/net/devmem.h | 136 ++++++ include/net/mp_dmabuf_devmem.h | 44 ++ include/net/netdev_rx_queue.h | 5 + include/net/netmem.h | 163 ++++++- include/net/page_pool/helpers.h | 39 +- include/net/page_pool/types.h | 22 +- include/net/sock.h | 2 + include/net/tcp.h | 5 +- include/trace/events/page_pool.h | 12 +- include/uapi/asm-generic/socket.h | 6 + include/uapi/linux/netdev.h | 13 + include/uapi/linux/uio.h | 17 + net/Kconfig | 5 + net/core/Makefile | 2 + net/core/datagram.c | 6 + net/core/dev.c | 24 +- net/core/devmem.c | 388 ++++++++++++++++ net/core/gro.c | 3 +- net/core/netdev-genl-gen.c | 23 + net/core/netdev-genl-gen.h | 6 + net/core/netdev-genl.c | 134 +++++- net/core/netdev_rx_queue.c | 81 ++++ net/core/netmem_priv.h | 31 ++ net/core/page_pool.c | 117 +++-- net/core/page_pool_priv.h | 46 ++ net/core/page_pool_user.c | 31 +- net/core/skbuff.c | 77 +++- net/core/sock.c | 68 +++ net/ethtool/common.c | 8 + net/ipv4/esp4.c | 3 +- net/ipv4/tcp.c | 261 ++++++++++- net/ipv4/tcp_input.c | 13 +- net/ipv4/tcp_ipv4.c | 16 + net/ipv4/tcp_minisocks.c | 2 + net/ipv4/tcp_output.c | 5 +- net/ipv6/esp6.c | 3 +- net/packet/af_packet.c | 4 +- net/xdp/xsk_buff_pool.c | 5 + tools/include/uapi/linux/netdev.h | 13 + tools/testing/selftests/net/.gitignore | 1 + tools/testing/selftests/net/Makefile | 9 + tools/testing/selftests/net/ncdevmem.c | 570 ++++++++++++++++++++++++ 53 files changed, 2723 insertions(+), 124 deletions(-) create mode 100644 Documentation/networking/devmem.rst create mode 100644 include/net/devmem.h create mode 100644 include/net/mp_dmabuf_devmem.h create mode 100644 net/core/devmem.c create mode 100644 net/core/netdev_rx_queue.c create mode 100644 net/core/netmem_priv.h create mode 100644 tools/testing/selftests/net/ncdevmem.c -- 2.46.0.469.g59c65b2a67-goog

10 months, 1 week

4
25
0 0

[PATCH] kselftest/arm64: Fix build warnings for ptrace

by Dev Jain

A "%s" is missing in ksft_exit_fail_msg(); instead, use the newly introduced ksft_exit_fail_perror(). Signed-off-by: Dev Jain <dev.jain(a)arm.com> --- tools/testing/selftests/arm64/abi/ptrace.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/tools/testing/selftests/arm64/abi/ptrace.c b/tools/testing/selftests/arm64/abi/ptrace.c index e4fa507cbdd0..b51d21f78cf9 100644 --- a/tools/testing/selftests/arm64/abi/ptrace.c +++ b/tools/testing/selftests/arm64/abi/ptrace.c @@ -163,10 +163,10 @@ static void test_hw_debug(pid_t child, int type, const char *type_name) static int do_child(void) { if (ptrace(PTRACE_TRACEME, -1, NULL, NULL)) - ksft_exit_fail_msg("PTRACE_TRACEME", strerror(errno)); + ksft_exit_fail_perror("PTRACE_TRACEME"); if (raise(SIGSTOP)) - ksft_exit_fail_msg("raise(SIGSTOP)", strerror(errno)); + ksft_exit_fail_perror("raise(SIGSTOP)"); return EXIT_SUCCESS; } -- 2.30.2

10 months, 1 week

4
3
0 0

[PATCH] kselftest/arm64: Actually test SME vector length changes via sigreturn

by Mark Brown

The test case for SME vector length changes via sigreturn use a bit too much cut'n'paste and only actually changed the SVE vector length in the test itself. Andre's recent factoring out of the initialisation code caused this to be exposed and the test to start failing. Fix the test to actually cover the thing it's supposed to test. Fixes: 4963aeb35a9e ("kselftest/arm64: signal: Add SME signal handling tests") Signed-off-by: Mark Brown <broonie(a)kernel.org> --- .../arm64/signal/testcases/fake_sigreturn_sme_change_vl.c | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/tools/testing/selftests/arm64/signal/testcases/fake_sigreturn_sme_change_vl.c b/tools/testing/selftests/arm64/signal/testcases/fake_sigreturn_sme_change_vl.c index cb8c051b5c8f..dfd6a2badf9f 100644 --- a/tools/testing/selftests/arm64/signal/testcases/fake_sigreturn_sme_change_vl.c +++ b/tools/testing/selftests/arm64/signal/testcases/fake_sigreturn_sme_change_vl.c @@ -35,30 +35,30 @@ static int fake_sigreturn_ssve_change_vl(struct tdescr *td, { size_t resv_sz, offset; struct _aarch64_ctx *head = GET_SF_RESV_HEAD(sf); - struct sve_context *sve; + struct za_context *za; /* Get a signal context with a SME ZA frame in it */ if (!get_current_context(td, &sf.uc, sizeof(sf.uc))) return 1; resv_sz = GET_SF_RESV_SIZE(sf); - head = get_header(head, SVE_MAGIC, resv_sz, &offset); + head = get_header(head, ZA_MAGIC, resv_sz, &offset); if (!head) { - fprintf(stderr, "No SVE context\n"); + fprintf(stderr, "No ZA context\n"); return 1; } - if (head->size != sizeof(struct sve_context)) { + if (head->size != sizeof(struct za_context)) { fprintf(stderr, "Register data present, aborting\n"); return 1; } - sve = (struct sve_context *)head; + za = (struct za_context *)head; /* No changes are supported; init left us at minimum VL so go to max */ fprintf(stderr, "Attempting to change VL from %d to %d\n", - sve->vl, vls[0]); - sve->vl = vls[0]; + za->vl, vls[0]); + za->vl = vls[0]; fake_sigreturn(&sf, sizeof(sf), 0); --- base-commit: b18bbfc14a38b5234e09c2adcf713e38063a7e6e change-id: 20240829-arm64-sme-signal-vl-change-test-cebe4035856a Best regards, -- Mark Brown <broonie(a)kernel.org>

10 months, 1 week

3
2
0 0

[PATCH] selftest/vDSO: Fix cross build for the random tests

by Mark Brown

Unlike the check for the standalone x86 test the check for building the vDSO getrandom and chacaha tests looks at the architecture for the host rather than the architecture for the target when deciding if they should be built. Since the chacha test includes some assembler code this means that cross building with x86 as either the target or host is broken. Use a check for ARCH instead. Fixes: 4920a2590e91 ("selftests/vDSO: add tests for vgetrandom") Signed-off-by: Mark Brown <broonie(a)kernel.org> --- The x86_64 build is still broken for me because nothing installs tools/arch/x86_64/vdso/vgetrandom-chacha.S (I beleive it's supposed to be copied from ./arch/x86/entry/vdso/vgetrandom-chacha.S but I don't see how?) but this at least fixes all the other architectures. --- tools/testing/selftests/vDSO/Makefile | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tools/testing/selftests/vDSO/Makefile b/tools/testing/selftests/vDSO/Makefile index e21e78aae24d..7fb59310718c 100644 --- a/tools/testing/selftests/vDSO/Makefile +++ b/tools/testing/selftests/vDSO/Makefile @@ -10,7 +10,7 @@ ifeq ($(ARCH),$(filter $(ARCH),x86 x86_64)) TEST_GEN_PROGS += vdso_standalone_test_x86 endif TEST_GEN_PROGS += vdso_test_correctness -ifeq ($(uname_M),x86_64) +ifeq ($(ARCH),$(filter $(ARCH),x86_64)) TEST_GEN_PROGS += vdso_test_getrandom TEST_GEN_PROGS += vdso_test_chacha endif --- base-commit: 985bf40edf4343dcb04c33f58b40b4a85c1776d4 change-id: 20240830-vdso-chacha-build-8d3789bf695c Best regards, -- Mark Brown <broonie(a)kernel.org>

10 months, 1 week

2
7
0 0

[PATCH v2 0/4] Increase mseal test coverage

by jeffxu＠chromium.org

From: Jeff Xu <jeffxu(a)chromium.org> This series increase the test coverage of mseal_test by: Add check for vma_size, prot, and error code for existing tests. Add more testcases for madvise, munmap, mmap and mremap to cover sealing in different scenarios. The increase test coverage hopefully help to prevent future regression. It doesn't change any existing mm api's semantics, i.e. it will pass on linux main and 6.10 branch. Note: in order to pass this test in mm-unstable, mm-unstable must have Liam's fix on mmap [1] [1] https://lore.kernel.org/linux-kselftest/vyllxuh5xbqmaoyl2mselebij5ox7cseekj… History: V2: - remove the mmap fix (Liam R. Howlett will fix it separately) - Add cover letter (Lorenzo Stoakes) - split the testcase for ease of review (Mark Brown) V1: - https://lore.kernel.org/linux-kselftest/20240828225522.684774-1-jeffxu@chro… Jeff Xu (4): selftests/mm: mseal_test, add vma size check selftests/mm: mseal_test add sealed madvise type selftests/mm: mseal_test add more tests for mmap selftests/mm: mseal_test add more tests for mremap tools/testing/selftests/mm/mseal_test.c | 829 ++++++++++++++++++++++-- 1 file changed, 762 insertions(+), 67 deletions(-) -- 2.46.0.469.g59c65b2a67-goog

10 months, 1 week

3
12
0 0

[PATCH 0/3] selftests: kvm: s390: Add ucontrol memory selftests

by Christoph Schlameuss

This patch series adds a some not yet picked selftests to the kvm s390x selftest suite. The additional test cases are covering: * Assert KVM_EXIT_S390_UCONTROL exit on not mapped memory access * Assert functionality of storage keys in ucontrol VM * Assert that memory region operations are rejected for ucontrol VMs Running the test cases requires sys_admin capabilities to start the ucontrol VM. This can be achieved by running as root or with a command like: sudo setpriv --reuid nobody --inh-caps -all,+sys_admin \ --ambient-caps -all,+sys_admin --bounding-set -all,+sys_admin \ ./ucontrol_test --- The patches in this series have been part of the previous patch series. The test cases added here do depend on the fixture added in the earlier patches. From v5 PATCH 7-9 the segment and page table generation has been removed and DAT has been disabled. Since DAT is not necessary to validate the KVM code. Previeous series: https://lore.kernel.org/kvm/20240807154512.316936-1-schlameuss@linux.ibm.co… Also see: https://lore.kernel.org/kvm/d97f4dec-31c3-45c0-ac33-90e665eb6e99@linux.ibm.… Christoph Schlameuss (3): selftests: kvm: s390: Add uc_map_unmap VM test case selftests: kvm: s390: Add uc_skey VM test case selftests: kvm: s390: Verify reject memory region operations for ucontrol VMs .../selftests/kvm/s390x/ucontrol_test.c | 218 +++++++++++++++++- 1 file changed, 217 insertions(+), 1 deletion(-) -- 2.46.0

10 months, 1 week

2
12
0 0

[PATCH 1/5] selftests: vdso: Fix vDSO name for powerpc

by Christophe Leroy

Following error occurs when running vdso_test_correctness on powerpc: ~ # ./vdso_test_correctness [WARN] failed to find vDSO [SKIP] No vDSO, so skipping clock_gettime() tests [SKIP] No vDSO, so skipping clock_gettime64() tests [RUN] Testing getcpu... [OK] CPU 0: syscall: cpu 0, node 0 On powerpc, vDSO is neither called linux-vdso.so.1 nor linux-gate.so.1 but linux-vdso32.so.1 or linux-vdso64.so.1. Also search those two names before giving up. Fixes: c7e5789b24d3 ("kselftest: Move test_vdso to the vDSO test suite") Signed-off-by: Christophe Leroy <christophe.leroy(a)csgroup.eu> --- tools/testing/selftests/vDSO/vdso_test_correctness.c | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/tools/testing/selftests/vDSO/vdso_test_correctness.c b/tools/testing/selftests/vDSO/vdso_test_correctness.c index e691a3cf1491..cdb697ae8343 100644 --- a/tools/testing/selftests/vDSO/vdso_test_correctness.c +++ b/tools/testing/selftests/vDSO/vdso_test_correctness.c @@ -114,6 +114,12 @@ static void fill_function_pointers() if (!vdso) vdso = dlopen("linux-gate.so.1", RTLD_LAZY | RTLD_LOCAL | RTLD_NOLOAD); + if (!vdso) + vdso = dlopen("linux-vdso32.so.1", + RTLD_LAZY | RTLD_LOCAL | RTLD_NOLOAD); + if (!vdso) + vdso = dlopen("linux-vdso64.so.1", + RTLD_LAZY | RTLD_LOCAL | RTLD_NOLOAD); if (!vdso) { printf("[WARN]\tfailed to find vDSO\n"); return; -- 2.44.0

10 months, 1 week

3
6
0 0

[PATCH net-next v16 11/14] mm: page_frag: add testing for the newly added prepare API

by Yunsheng Lin

Add testing for the newly added prepare API, for both aligned and non-aligned API, also probe API is also tested along with prepare API. CC: Alexander Duyck <alexander.duyck(a)gmail.com> Signed-off-by: Yunsheng Lin <linyunsheng(a)huawei.com> --- .../selftests/mm/page_frag/page_frag_test.c | 66 +++++++++++++++++-- tools/testing/selftests/mm/run_vmtests.sh | 4 ++ tools/testing/selftests/mm/test_page_frag.sh | 31 +++++++++ 3 files changed, 96 insertions(+), 5 deletions(-) diff --git a/tools/testing/selftests/mm/page_frag/page_frag_test.c b/tools/testing/selftests/mm/page_frag/page_frag_test.c index e21a22b1d70b..856eacdd1c90 100644 --- a/tools/testing/selftests/mm/page_frag/page_frag_test.c +++ b/tools/testing/selftests/mm/page_frag/page_frag_test.c @@ -27,6 +27,10 @@ static bool test_align; module_param(test_align, bool, 0); MODULE_PARM_DESC(test_align, "use align API for testing"); +static bool test_prepare; +module_param(test_prepare, bool, 0); +MODULE_PARM_DESC(test_prepare, "use prepare API for testing"); + static int test_alloc_len = 2048; module_param(test_alloc_len, int, 0); MODULE_PARM_DESC(test_alloc_len, "alloc len for testing"); @@ -67,6 +71,18 @@ static int page_frag_pop_thread(void *arg) return 0; } +static void frag_frag_test_commit(struct page_frag_cache *nc, + struct page_frag *prepare_pfrag, + struct page_frag *probe_pfrag, + unsigned int used_sz) +{ + WARN_ON_ONCE(prepare_pfrag->page != probe_pfrag->page || + prepare_pfrag->offset != probe_pfrag->offset || + prepare_pfrag->size != probe_pfrag->size); + + page_frag_commit(nc, prepare_pfrag, used_sz); +} + static int page_frag_push_thread(void *arg) { struct ptr_ring *ring = arg; @@ -80,13 +96,52 @@ static int page_frag_push_thread(void *arg) int ret; if (test_align) { - va = page_frag_alloc_align(&test_nc, test_alloc_len, - GFP_KERNEL, SMP_CACHE_BYTES); + if (test_prepare) { + struct page_frag prepare_frag, probe_frag; + void *probe_va; + + va = page_frag_alloc_refill_prepare_align(&test_nc, + test_alloc_len, + &prepare_frag, + GFP_KERNEL, + SMP_CACHE_BYTES); + + probe_va = __page_frag_alloc_refill_probe_align(&test_nc, + test_alloc_len, + &probe_frag, + -SMP_CACHE_BYTES); + WARN_ON_ONCE(va != probe_va); + + if (likely(va)) + frag_frag_test_commit(&test_nc, &prepare_frag, + &probe_frag, test_alloc_len); + } else { + va = page_frag_alloc_align(&test_nc, + test_alloc_len, + GFP_KERNEL, + SMP_CACHE_BYTES); + } WARN_ONCE((unsigned long)va & (SMP_CACHE_BYTES - 1), "unaligned va returned\n"); } else { - va = page_frag_alloc(&test_nc, test_alloc_len, GFP_KERNEL); + if (test_prepare) { + struct page_frag prepare_frag, probe_frag; + void *probe_va; + + va = page_frag_alloc_refill_prepare(&test_nc, test_alloc_len, + &prepare_frag, GFP_KERNEL); + + probe_va = page_frag_alloc_refill_probe(&test_nc, test_alloc_len, + &probe_frag); + + WARN_ON_ONCE(va != probe_va); + if (likely(va)) + frag_frag_test_commit(&test_nc, &prepare_frag, + &probe_frag, test_alloc_len); + } else { + va = page_frag_alloc(&test_nc, test_alloc_len, GFP_KERNEL); + } } if (!va) @@ -149,8 +204,9 @@ static int __init page_frag_test_init(void) wait_for_completion(&wait); duration = (u64)ktime_us_delta(ktime_get(), start); - pr_info("%d of iterations for %s testing took: %lluus\n", nr_test, - test_align ? "aligned" : "non-aligned", duration); + pr_info("%d of iterations for %s %s API testing took: %lluus\n", nr_test, + test_align ? "aligned" : "non-aligned", + test_prepare ? "prepare" : "alloc", duration); ptr_ring_cleanup(&ptr_ring, NULL); page_frag_cache_drain(&test_nc); diff --git a/tools/testing/selftests/mm/run_vmtests.sh b/tools/testing/selftests/mm/run_vmtests.sh index 96fd470b9f51..e4a36231bbea 100755 --- a/tools/testing/selftests/mm/run_vmtests.sh +++ b/tools/testing/selftests/mm/run_vmtests.sh @@ -464,6 +464,10 @@ CATEGORY="page_frag" run_test ./test_page_frag.sh aligned CATEGORY="page_frag" run_test ./test_page_frag.sh nonaligned +CATEGORY="page_frag" run_test ./test_page_frag.sh aligned_prepare + +CATEGORY="page_frag" run_test ./test_page_frag.sh nonaligned_prepare + echo "SUMMARY: PASS=${count_pass} SKIP=${count_skip} FAIL=${count_fail}" | tap_prefix echo "1..${count_total}" | tap_output diff --git a/tools/testing/selftests/mm/test_page_frag.sh b/tools/testing/selftests/mm/test_page_frag.sh index aad55e0ca2cd..753ec4b6fdc3 100755 --- a/tools/testing/selftests/mm/test_page_frag.sh +++ b/tools/testing/selftests/mm/test_page_frag.sh @@ -32,6 +32,8 @@ ksft_skip=4 # NONALIGNED_PARAM="test_push_cpu=$TEST_CPU_0 test_pop_cpu=$TEST_CPU_1 test_alloc_len=12 nr_test=512000000" ALIGNED_PARAM="$NONALIGNED_PARAM test_align=1" +NONALIGNED_PREPARE_PARAM="$NONALIGNED_PARAM test_prepare=1" +ALIGNED_PREPARE_PARAM="$ALIGNED_PARAM test_prepare=1" check_test_requirements() { @@ -70,6 +72,24 @@ run_aligned_check() echo "Check the kernel ring buffer to see the summary." } +run_nonaligned_prepare_check() +{ + echo "Run performance tests to evaluate how fast nonaligned prepare API is." + + insmod $DRIVER $NONALIGNED_PREPARE_PARAM > /dev/null 2>&1 + echo "Done." + echo "Ccheck the kernel ring buffer to see the summary." +} + +run_aligned_prepare_check() +{ + echo "Run performance tests to evaluate how fast aligned prepare API is." + + insmod $DRIVER $ALIGNED_PREPARE_PARAM > /dev/null 2>&1 + echo "Done." + echo "Check the kernel ring buffer to see the summary." +} + run_smoke_check() { echo "Run smoke test." @@ -82,6 +102,7 @@ run_smoke_check() usage() { echo -n "Usage: $0 [ aligned ] | [ nonaligned ] | | [ smoke ] | " + echo "[ aligned_prepare ] | [ nonaligned_prepare ] | " echo "manual parameters" echo echo "Valid tests and parameters:" @@ -102,6 +123,12 @@ usage() echo "# Performance testing for aligned alloc API" echo "$0 aligned" echo + echo "# Performance testing for nonaligned prepare API" + echo "$0 nonaligned_prepare" + echo + echo "# Performance testing for aligned prepare API" + echo "$0 aligned_prepare" + echo exit 0 } @@ -155,6 +182,10 @@ function run_test() run_nonaligned_check elif [[ "$1" = "aligned" ]]; then run_aligned_check + elif [[ "$1" = "nonaligned_prepare" ]]; then + run_nonaligned_prepare_check + elif [[ "$1" = "aligned_prepare" ]]; then + run_aligned_prepare_check else run_manual_check $@ fi -- 2.33.0

10 months, 1 week

1
0
0 0

[PATCH net-next v16 04/14] mm: page_frag: avoid caller accessing 'page_frag_cache' directly

by Yunsheng Lin

Use appropriate frag_page API instead of caller accessing 'page_frag_cache' directly. CC: Alexander Duyck <alexander.duyck(a)gmail.com> Signed-off-by: Yunsheng Lin <linyunsheng(a)huawei.com> Reviewed-by: Alexander Duyck <alexanderduyck(a)fb.com> Acked-by: Chuck Lever <chuck.lever(a)oracle.com> --- drivers/vhost/net.c | 2 +- include/linux/page_frag_cache.h | 10 ++++++++++ net/core/skbuff.c | 6 +++--- net/rxrpc/conn_object.c | 4 +--- net/rxrpc/local_object.c | 4 +--- net/sunrpc/svcsock.c | 6 ++---- tools/testing/selftests/mm/page_frag/page_frag_test.c | 2 +- 7 files changed, 19 insertions(+), 15 deletions(-) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index f16279351db5..9ad37c012189 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -1325,7 +1325,7 @@ static int vhost_net_open(struct inode *inode, struct file *f) vqs[VHOST_NET_VQ_RX]); f->private_data = n; - n->pf_cache.va = NULL; + page_frag_cache_init(&n->pf_cache); return 0; } diff --git a/include/linux/page_frag_cache.h b/include/linux/page_frag_cache.h index 67ac8626ed9b..0a52f7a179c8 100644 --- a/include/linux/page_frag_cache.h +++ b/include/linux/page_frag_cache.h @@ -7,6 +7,16 @@ #include <linux/mm_types_task.h> #include <linux/types.h> +static inline void page_frag_cache_init(struct page_frag_cache *nc) +{ + nc->va = NULL; +} + +static inline bool page_frag_cache_is_pfmemalloc(struct page_frag_cache *nc) +{ + return !!nc->pfmemalloc; +} + void page_frag_cache_drain(struct page_frag_cache *nc); void __page_frag_cache_drain(struct page *page, unsigned int count); void *__page_frag_alloc_align(struct page_frag_cache *nc, unsigned int fragsz, diff --git a/net/core/skbuff.c b/net/core/skbuff.c index a52638363ea5..a5f8e4e0c649 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -752,14 +752,14 @@ struct sk_buff *__netdev_alloc_skb(struct net_device *dev, unsigned int len, if (in_hardirq() || irqs_disabled()) { nc = this_cpu_ptr(&netdev_alloc_cache); data = page_frag_alloc(nc, len, gfp_mask); - pfmemalloc = nc->pfmemalloc; + pfmemalloc = page_frag_cache_is_pfmemalloc(nc); } else { local_bh_disable(); local_lock_nested_bh(&napi_alloc_cache.bh_lock); nc = this_cpu_ptr(&napi_alloc_cache.page); data = page_frag_alloc(nc, len, gfp_mask); - pfmemalloc = nc->pfmemalloc; + pfmemalloc = page_frag_cache_is_pfmemalloc(nc); local_unlock_nested_bh(&napi_alloc_cache.bh_lock); local_bh_enable(); @@ -849,7 +849,7 @@ struct sk_buff *napi_alloc_skb(struct napi_struct *napi, unsigned int len) len = SKB_HEAD_ALIGN(len); data = page_frag_alloc(&nc->page, len, gfp_mask); - pfmemalloc = nc->page.pfmemalloc; + pfmemalloc = page_frag_cache_is_pfmemalloc(&nc->page); } local_unlock_nested_bh(&napi_alloc_cache.bh_lock); diff --git a/net/rxrpc/conn_object.c b/net/rxrpc/conn_object.c index 1539d315afe7..694c4df7a1a3 100644 --- a/net/rxrpc/conn_object.c +++ b/net/rxrpc/conn_object.c @@ -337,9 +337,7 @@ static void rxrpc_clean_up_connection(struct work_struct *work) */ rxrpc_purge_queue(&conn->rx_queue); - if (conn->tx_data_alloc.va) - __page_frag_cache_drain(virt_to_page(conn->tx_data_alloc.va), - conn->tx_data_alloc.pagecnt_bias); + page_frag_cache_drain(&conn->tx_data_alloc); call_rcu(&conn->rcu, rxrpc_rcu_free_connection); } diff --git a/net/rxrpc/local_object.c b/net/rxrpc/local_object.c index 504453c688d7..a8cffe47cf01 100644 --- a/net/rxrpc/local_object.c +++ b/net/rxrpc/local_object.c @@ -452,9 +452,7 @@ void rxrpc_destroy_local(struct rxrpc_local *local) #endif rxrpc_purge_queue(&local->rx_queue); rxrpc_purge_client_connections(local); - if (local->tx_alloc.va) - __page_frag_cache_drain(virt_to_page(local->tx_alloc.va), - local->tx_alloc.pagecnt_bias); + page_frag_cache_drain(&local->tx_alloc); } /* diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c index 6b3f01beb294..dcfd84cf0694 100644 --- a/net/sunrpc/svcsock.c +++ b/net/sunrpc/svcsock.c @@ -1609,7 +1609,6 @@ static void svc_tcp_sock_detach(struct svc_xprt *xprt) static void svc_sock_free(struct svc_xprt *xprt) { struct svc_sock *svsk = container_of(xprt, struct svc_sock, sk_xprt); - struct page_frag_cache *pfc = &svsk->sk_frag_cache; struct socket *sock = svsk->sk_sock; trace_svcsock_free(svsk, sock); @@ -1619,8 +1618,7 @@ static void svc_sock_free(struct svc_xprt *xprt) sockfd_put(sock); else sock_release(sock); - if (pfc->va) - __page_frag_cache_drain(virt_to_head_page(pfc->va), - pfc->pagecnt_bias); + + page_frag_cache_drain(&svsk->sk_frag_cache); kfree(svsk); } diff --git a/tools/testing/selftests/mm/page_frag/page_frag_test.c b/tools/testing/selftests/mm/page_frag/page_frag_test.c index 72a3861c2de1..e21a22b1d70b 100644 --- a/tools/testing/selftests/mm/page_frag/page_frag_test.c +++ b/tools/testing/selftests/mm/page_frag/page_frag_test.c @@ -117,7 +117,7 @@ static int __init page_frag_test_init(void) u64 duration; int ret; - test_nc.va = NULL; + page_frag_cache_init(&test_nc); atomic_set(&nthreads, 2); init_completion(&wait); -- 2.33.0

10 months, 1 week

1
0
0 0

[PATCH net-next v16 02/14] mm: move the page fragment allocator from page_alloc into its own file

by Yunsheng Lin

Inspired by [1], move the page fragment allocator from page_alloc into its own c file and header file, as we are about to make more change for it to replace another page_frag implementation in sock.c As this patchset is going to replace 'struct page_frag' with 'struct page_frag_cache' in sched.h, including page_frag_cache.h in sched.h has a compiler error caused by interdependence between mm_types.h and mm.h for asm-offsets.c, see [2]. So avoid the compiler error by moving 'struct page_frag_cache' to mm_types_task.h as suggested by Alexander, see [3]. 1. https://lore.kernel.org/all/20230411160902.4134381-3-dhowells@redhat.com/ 2. https://lore.kernel.org/all/15623dac-9358-4597-b3ee-3694a5956920@gmail.com/ 3. https://lore.kernel.org/all/CAKgT0UdH1yD=LSCXFJ=YM_aiA4OomD-2wXykO42bizaWMt… CC: David Howells <dhowells(a)redhat.com> CC: Alexander Duyck <alexander.duyck(a)gmail.com> Signed-off-by: Yunsheng Lin <linyunsheng(a)huawei.com> Acked-by: Andrew Morton <akpm(a)linux-foundation.org> --- include/linux/gfp.h | 22 --- include/linux/mm_types.h | 18 --- include/linux/mm_types_task.h | 18 +++ include/linux/page_frag_cache.h | 31 ++++ include/linux/skbuff.h | 1 + mm/Makefile | 1 + mm/page_alloc.c | 136 ---------------- mm/page_frag_cache.c | 145 ++++++++++++++++++ .../selftests/mm/page_frag/page_frag_test.c | 2 +- 9 files changed, 197 insertions(+), 177 deletions(-) create mode 100644 include/linux/page_frag_cache.h create mode 100644 mm/page_frag_cache.c diff --git a/include/linux/gfp.h b/include/linux/gfp.h index f53f76e0b17e..01a49be7c98d 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -371,28 +371,6 @@ __meminit void *alloc_pages_exact_nid_noprof(int nid, size_t size, gfp_t gfp_mas extern void __free_pages(struct page *page, unsigned int order); extern void free_pages(unsigned long addr, unsigned int order); -struct page_frag_cache; -void page_frag_cache_drain(struct page_frag_cache *nc); -extern void __page_frag_cache_drain(struct page *page, unsigned int count); -void *__page_frag_alloc_align(struct page_frag_cache *nc, unsigned int fragsz, - gfp_t gfp_mask, unsigned int align_mask); - -static inline void *page_frag_alloc_align(struct page_frag_cache *nc, - unsigned int fragsz, gfp_t gfp_mask, - unsigned int align) -{ - WARN_ON_ONCE(!is_power_of_2(align)); - return __page_frag_alloc_align(nc, fragsz, gfp_mask, -align); -} - -static inline void *page_frag_alloc(struct page_frag_cache *nc, - unsigned int fragsz, gfp_t gfp_mask) -{ - return __page_frag_alloc_align(nc, fragsz, gfp_mask, ~0u); -} - -extern void page_frag_free(void *addr); - #define __free_page(page) __free_pages((page), 0) #define free_page(addr) free_pages((addr), 0) diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 485424979254..843d75412105 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -521,9 +521,6 @@ static_assert(sizeof(struct ptdesc) <= sizeof(struct page)); */ #define STRUCT_PAGE_MAX_SHIFT (order_base_2(sizeof(struct page))) -#define PAGE_FRAG_CACHE_MAX_SIZE __ALIGN_MASK(32768, ~PAGE_MASK) -#define PAGE_FRAG_CACHE_MAX_ORDER get_order(PAGE_FRAG_CACHE_MAX_SIZE) - /* * page_private can be used on tail pages. However, PagePrivate is only * checked by the VM on the head page. So page_private on the tail pages @@ -542,21 +539,6 @@ static inline void *folio_get_private(struct folio *folio) return folio->private; } -struct page_frag_cache { - void * va; -#if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE) - __u16 offset; - __u16 size; -#else - __u32 offset; -#endif - /* we maintain a pagecount bias, so that we dont dirty cache line - * containing page->_refcount every time we allocate a fragment. - */ - unsigned int pagecnt_bias; - bool pfmemalloc; -}; - typedef unsigned long vm_flags_t; /* diff --git a/include/linux/mm_types_task.h b/include/linux/mm_types_task.h index a2f6179b672b..cdc1e3696439 100644 --- a/include/linux/mm_types_task.h +++ b/include/linux/mm_types_task.h @@ -8,6 +8,7 @@ * (These are defined separately to decouple sched.h from mm_types.h as much as possible.) */ +#include <linux/align.h> #include <linux/types.h> #include <asm/page.h> @@ -46,6 +47,23 @@ struct page_frag { #endif }; +#define PAGE_FRAG_CACHE_MAX_SIZE __ALIGN_MASK(32768, ~PAGE_MASK) +#define PAGE_FRAG_CACHE_MAX_ORDER get_order(PAGE_FRAG_CACHE_MAX_SIZE) +struct page_frag_cache { + void *va; +#if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE) + __u16 offset; + __u16 size; +#else + __u32 offset; +#endif + /* we maintain a pagecount bias, so that we dont dirty cache line + * containing page->_refcount every time we allocate a fragment. + */ + unsigned int pagecnt_bias; + bool pfmemalloc; +}; + /* Track pages that require TLB flushes */ struct tlbflush_unmap_batch { #ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH diff --git a/include/linux/page_frag_cache.h b/include/linux/page_frag_cache.h new file mode 100644 index 000000000000..67ac8626ed9b --- /dev/null +++ b/include/linux/page_frag_cache.h @@ -0,0 +1,31 @@ +/* SPDX-License-Identifier: GPL-2.0 */ + +#ifndef _LINUX_PAGE_FRAG_CACHE_H +#define _LINUX_PAGE_FRAG_CACHE_H + +#include <linux/log2.h> +#include <linux/mm_types_task.h> +#include <linux/types.h> + +void page_frag_cache_drain(struct page_frag_cache *nc); +void __page_frag_cache_drain(struct page *page, unsigned int count); +void *__page_frag_alloc_align(struct page_frag_cache *nc, unsigned int fragsz, + gfp_t gfp_mask, unsigned int align_mask); + +static inline void *page_frag_alloc_align(struct page_frag_cache *nc, + unsigned int fragsz, gfp_t gfp_mask, + unsigned int align) +{ + WARN_ON_ONCE(!is_power_of_2(align)); + return __page_frag_alloc_align(nc, fragsz, gfp_mask, -align); +} + +static inline void *page_frag_alloc(struct page_frag_cache *nc, + unsigned int fragsz, gfp_t gfp_mask) +{ + return __page_frag_alloc_align(nc, fragsz, gfp_mask, ~0u); +} + +void page_frag_free(void *addr); + +#endif diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index cf8f6ce06742..7482997c719f 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -31,6 +31,7 @@ #include <linux/in6.h> #include <linux/if_packet.h> #include <linux/llist.h> +#include <linux/page_frag_cache.h> #include <net/flow.h> #if IS_ENABLED(CONFIG_NF_CONNTRACK) #include <linux/netfilter/nf_conntrack_common.h> diff --git a/mm/Makefile b/mm/Makefile index d2915f8c9dc0..e9d342fa8058 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -65,6 +65,7 @@ page-alloc-$(CONFIG_SHUFFLE_PAGE_ALLOCATOR) += shuffle.o memory-hotplug-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o obj-y += page-alloc.o +obj-y += page_frag_cache.o obj-y += init-mm.o obj-y += memblock.o obj-y += $(memory-hotplug-y) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index c565de8f48e9..d0e88aa6eb0d 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4798,142 +4798,6 @@ void free_pages(unsigned long addr, unsigned int order) EXPORT_SYMBOL(free_pages); -/* - * Page Fragment: - * An arbitrary-length arbitrary-offset area of memory which resides - * within a 0 or higher order page. Multiple fragments within that page - * are individually refcounted, in the page's reference counter. - * - * The page_frag functions below provide a simple allocation framework for - * page fragments. This is used by the network stack and network device - * drivers to provide a backing region of memory for use as either an - * sk_buff->head, or to be used in the "frags" portion of skb_shared_info. - */ -static struct page *__page_frag_cache_refill(struct page_frag_cache *nc, - gfp_t gfp_mask) -{ - struct page *page = NULL; - gfp_t gfp = gfp_mask; - -#if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE) - gfp_mask = (gfp_mask & ~__GFP_DIRECT_RECLAIM) | __GFP_COMP | - __GFP_NOWARN | __GFP_NORETRY | __GFP_NOMEMALLOC; - page = alloc_pages_node(NUMA_NO_NODE, gfp_mask, - PAGE_FRAG_CACHE_MAX_ORDER); - nc->size = page ? PAGE_FRAG_CACHE_MAX_SIZE : PAGE_SIZE; -#endif - if (unlikely(!page)) - page = alloc_pages_node(NUMA_NO_NODE, gfp, 0); - - nc->va = page ? page_address(page) : NULL; - - return page; -} - -void page_frag_cache_drain(struct page_frag_cache *nc) -{ - if (!nc->va) - return; - - __page_frag_cache_drain(virt_to_head_page(nc->va), nc->pagecnt_bias); - nc->va = NULL; -} -EXPORT_SYMBOL(page_frag_cache_drain); - -void __page_frag_cache_drain(struct page *page, unsigned int count) -{ - VM_BUG_ON_PAGE(page_ref_count(page) == 0, page); - - if (page_ref_sub_and_test(page, count)) - free_unref_page(page, compound_order(page)); -} -EXPORT_SYMBOL(__page_frag_cache_drain); - -void *__page_frag_alloc_align(struct page_frag_cache *nc, - unsigned int fragsz, gfp_t gfp_mask, - unsigned int align_mask) -{ - unsigned int size = PAGE_SIZE; - struct page *page; - int offset; - - if (unlikely(!nc->va)) { -refill: - page = __page_frag_cache_refill(nc, gfp_mask); - if (!page) - return NULL; - -#if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE) - /* if size can vary use size else just use PAGE_SIZE */ - size = nc->size; -#endif - /* Even if we own the page, we do not use atomic_set(). - * This would break get_page_unless_zero() users. - */ - page_ref_add(page, PAGE_FRAG_CACHE_MAX_SIZE); - - /* reset page count bias and offset to start of new frag */ - nc->pfmemalloc = page_is_pfmemalloc(page); - nc->pagecnt_bias = PAGE_FRAG_CACHE_MAX_SIZE + 1; - nc->offset = size; - } - - offset = nc->offset - fragsz; - if (unlikely(offset < 0)) { - page = virt_to_page(nc->va); - - if (!page_ref_sub_and_test(page, nc->pagecnt_bias)) - goto refill; - - if (unlikely(nc->pfmemalloc)) { - free_unref_page(page, compound_order(page)); - goto refill; - } - -#if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE) - /* if size can vary use size else just use PAGE_SIZE */ - size = nc->size; -#endif - /* OK, page count is 0, we can safely set it */ - set_page_count(page, PAGE_FRAG_CACHE_MAX_SIZE + 1); - - /* reset page count bias and offset to start of new frag */ - nc->pagecnt_bias = PAGE_FRAG_CACHE_MAX_SIZE + 1; - offset = size - fragsz; - if (unlikely(offset < 0)) { - /* - * The caller is trying to allocate a fragment - * with fragsz > PAGE_SIZE but the cache isn't big - * enough to satisfy the request, this may - * happen in low memory conditions. - * We don't release the cache page because - * it could make memory pressure worse - * so we simply return NULL here. - */ - return NULL; - } - } - - nc->pagecnt_bias--; - offset &= align_mask; - nc->offset = offset; - - return nc->va + offset; -} -EXPORT_SYMBOL(__page_frag_alloc_align); - -/* - * Frees a page fragment allocated out of either a compound or order 0 page. - */ -void page_frag_free(void *addr) -{ - struct page *page = virt_to_head_page(addr); - - if (unlikely(put_page_testzero(page))) - free_unref_page(page, compound_order(page)); -} -EXPORT_SYMBOL(page_frag_free); - static void *make_alloc_exact(unsigned long addr, unsigned int order, size_t size) { diff --git a/mm/page_frag_cache.c b/mm/page_frag_cache.c new file mode 100644 index 000000000000..609a485cd02a --- /dev/null +++ b/mm/page_frag_cache.c @@ -0,0 +1,145 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* Page fragment allocator + * + * Page Fragment: + * An arbitrary-length arbitrary-offset area of memory which resides within a + * 0 or higher order page. Multiple fragments within that page are + * individually refcounted, in the page's reference counter. + * + * The page_frag functions provide a simple allocation framework for page + * fragments. This is used by the network stack and network device drivers to + * provide a backing region of memory for use as either an sk_buff->head, or to + * be used in the "frags" portion of skb_shared_info. + */ + +#include <linux/export.h> +#include <linux/gfp_types.h> +#include <linux/init.h> +#include <linux/mm.h> +#include <linux/page_frag_cache.h> +#include "internal.h" + +static struct page *__page_frag_cache_refill(struct page_frag_cache *nc, + gfp_t gfp_mask) +{ + struct page *page = NULL; + gfp_t gfp = gfp_mask; + +#if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE) + gfp_mask = (gfp_mask & ~__GFP_DIRECT_RECLAIM) | __GFP_COMP | + __GFP_NOWARN | __GFP_NORETRY | __GFP_NOMEMALLOC; + page = alloc_pages_node(NUMA_NO_NODE, gfp_mask, + PAGE_FRAG_CACHE_MAX_ORDER); + nc->size = page ? PAGE_FRAG_CACHE_MAX_SIZE : PAGE_SIZE; +#endif + if (unlikely(!page)) + page = alloc_pages_node(NUMA_NO_NODE, gfp, 0); + + nc->va = page ? page_address(page) : NULL; + + return page; +} + +void page_frag_cache_drain(struct page_frag_cache *nc) +{ + if (!nc->va) + return; + + __page_frag_cache_drain(virt_to_head_page(nc->va), nc->pagecnt_bias); + nc->va = NULL; +} +EXPORT_SYMBOL(page_frag_cache_drain); + +void __page_frag_cache_drain(struct page *page, unsigned int count) +{ + VM_BUG_ON_PAGE(page_ref_count(page) == 0, page); + + if (page_ref_sub_and_test(page, count)) + free_unref_page(page, compound_order(page)); +} +EXPORT_SYMBOL(__page_frag_cache_drain); + +void *__page_frag_alloc_align(struct page_frag_cache *nc, + unsigned int fragsz, gfp_t gfp_mask, + unsigned int align_mask) +{ + unsigned int size = PAGE_SIZE; + struct page *page; + int offset; + + if (unlikely(!nc->va)) { +refill: + page = __page_frag_cache_refill(nc, gfp_mask); + if (!page) + return NULL; + +#if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE) + /* if size can vary use size else just use PAGE_SIZE */ + size = nc->size; +#endif + /* Even if we own the page, we do not use atomic_set(). + * This would break get_page_unless_zero() users. + */ + page_ref_add(page, PAGE_FRAG_CACHE_MAX_SIZE); + + /* reset page count bias and offset to start of new frag */ + nc->pfmemalloc = page_is_pfmemalloc(page); + nc->pagecnt_bias = PAGE_FRAG_CACHE_MAX_SIZE + 1; + nc->offset = size; + } + + offset = nc->offset - fragsz; + if (unlikely(offset < 0)) { + page = virt_to_page(nc->va); + + if (!page_ref_sub_and_test(page, nc->pagecnt_bias)) + goto refill; + + if (unlikely(nc->pfmemalloc)) { + free_unref_page(page, compound_order(page)); + goto refill; + } + +#if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE) + /* if size can vary use size else just use PAGE_SIZE */ + size = nc->size; +#endif + /* OK, page count is 0, we can safely set it */ + set_page_count(page, PAGE_FRAG_CACHE_MAX_SIZE + 1); + + /* reset page count bias and offset to start of new frag */ + nc->pagecnt_bias = PAGE_FRAG_CACHE_MAX_SIZE + 1; + offset = size - fragsz; + if (unlikely(offset < 0)) { + /* + * The caller is trying to allocate a fragment + * with fragsz > PAGE_SIZE but the cache isn't big + * enough to satisfy the request, this may + * happen in low memory conditions. + * We don't release the cache page because + * it could make memory pressure worse + * so we simply return NULL here. + */ + return NULL; + } + } + + nc->pagecnt_bias--; + offset &= align_mask; + nc->offset = offset; + + return nc->va + offset; +} +EXPORT_SYMBOL(__page_frag_alloc_align); + +/* + * Frees a page fragment allocated out of either a compound or order 0 page. + */ +void page_frag_free(void *addr) +{ + struct page *page = virt_to_head_page(addr); + + if (unlikely(put_page_testzero(page))) + free_unref_page(page, compound_order(page)); +} +EXPORT_SYMBOL(page_frag_free); diff --git a/tools/testing/selftests/mm/page_frag/page_frag_test.c b/tools/testing/selftests/mm/page_frag/page_frag_test.c index 1c9070423420..72a3861c2de1 100644 --- a/tools/testing/selftests/mm/page_frag/page_frag_test.c +++ b/tools/testing/selftests/mm/page_frag/page_frag_test.c @@ -6,12 +6,12 @@ * Copyright (C) 2024 Yunsheng Lin <linyunsheng(a)huawei.com> */ -#include <linux/mm.h> #include <linux/module.h> #include <linux/cpumask.h> #include <linux/completion.h> #include <linux/ptr_ring.h> #include <linux/kthread.h> +#include <linux/page_frag_cache.h> static struct ptr_ring ptr_ring; static int nr_objs = 512; -- 2.33.0

10 months, 1 week

1
0
0 0

[PATCH net-next v16 01/14] mm: page_frag: add a test module for page_frag

by Yunsheng Lin

The testing is done by ensuring that the fragment allocated from a frag_frag_cache instance is pushed into a ptr_ring instance in a kthread binded to a specified cpu, and a kthread binded to a specified cpu will pop the fragment from the ptr_ring and free the fragment. CC: Alexander Duyck <alexander.duyck(a)gmail.com> Signed-off-by: Yunsheng Lin <linyunsheng(a)huawei.com> --- tools/testing/selftests/mm/Makefile | 2 + tools/testing/selftests/mm/page_frag/Makefile | 18 ++ .../selftests/mm/page_frag/page_frag_test.c | 170 ++++++++++++++++++ tools/testing/selftests/mm/run_vmtests.sh | 8 + tools/testing/selftests/mm/test_page_frag.sh | 167 +++++++++++++++++ 5 files changed, 365 insertions(+) create mode 100644 tools/testing/selftests/mm/page_frag/Makefile create mode 100644 tools/testing/selftests/mm/page_frag/page_frag_test.c create mode 100755 tools/testing/selftests/mm/test_page_frag.sh diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile index cfad627e8d94..ed196901b9ca 100644 --- a/tools/testing/selftests/mm/Makefile +++ b/tools/testing/selftests/mm/Makefile @@ -36,6 +36,8 @@ MAKEFLAGS += --no-builtin-rules CFLAGS = -Wall -I $(top_srcdir) $(EXTRA_CFLAGS) $(KHDR_INCLUDES) $(TOOLS_INCLUDES) LDLIBS = -lrt -lpthread -lm +TEST_GEN_MODS_DIR := page_frag + TEST_GEN_FILES = cow TEST_GEN_FILES += compaction_test TEST_GEN_FILES += gup_longterm diff --git a/tools/testing/selftests/mm/page_frag/Makefile b/tools/testing/selftests/mm/page_frag/Makefile new file mode 100644 index 000000000000..58dda74d50a3 --- /dev/null +++ b/tools/testing/selftests/mm/page_frag/Makefile @@ -0,0 +1,18 @@ +PAGE_FRAG_TEST_DIR := $(realpath $(dir $(abspath $(lastword $(MAKEFILE_LIST))))) +KDIR ?= $(abspath $(PAGE_FRAG_TEST_DIR)/../../../../..) + +ifeq ($(V),1) +Q = +else +Q = @ +endif + +MODULES = page_frag_test.ko + +obj-m += page_frag_test.o + +all: + +$(Q)make -C $(KDIR) M=$(PAGE_FRAG_TEST_DIR) modules + +clean: + +$(Q)make -C $(KDIR) M=$(PAGE_FRAG_TEST_DIR) clean diff --git a/tools/testing/selftests/mm/page_frag/page_frag_test.c b/tools/testing/selftests/mm/page_frag/page_frag_test.c new file mode 100644 index 000000000000..1c9070423420 --- /dev/null +++ b/tools/testing/selftests/mm/page_frag/page_frag_test.c @@ -0,0 +1,170 @@ +// SPDX-License-Identifier: GPL-2.0 + +/* + * Test module for page_frag cache + * + * Copyright (C) 2024 Yunsheng Lin <linyunsheng(a)huawei.com> + */ + +#include <linux/mm.h> +#include <linux/module.h> +#include <linux/cpumask.h> +#include <linux/completion.h> +#include <linux/ptr_ring.h> +#include <linux/kthread.h> + +static struct ptr_ring ptr_ring; +static int nr_objs = 512; +static atomic_t nthreads; +static struct completion wait; +static struct page_frag_cache test_nc; + +static int nr_test = 5120000; +module_param(nr_test, int, 0); +MODULE_PARM_DESC(nr_test, "number of iterations to test"); + +static bool test_align; +module_param(test_align, bool, 0); +MODULE_PARM_DESC(test_align, "use align API for testing"); + +static int test_alloc_len = 2048; +module_param(test_alloc_len, int, 0); +MODULE_PARM_DESC(test_alloc_len, "alloc len for testing"); + +static int test_push_cpu; +module_param(test_push_cpu, int, 0); +MODULE_PARM_DESC(test_push_cpu, "test cpu for pushing fragment"); + +static int test_pop_cpu; +module_param(test_pop_cpu, int, 0); +MODULE_PARM_DESC(test_pop_cpu, "test cpu for popping fragment"); + +static int page_frag_pop_thread(void *arg) +{ + struct ptr_ring *ring = arg; + int nr = nr_test; + + pr_info("page_frag pop test thread begins on cpu %d\n", + smp_processor_id()); + + while (nr > 0) { + void *obj = __ptr_ring_consume(ring); + + if (obj) { + nr--; + page_frag_free(obj); + } else { + cond_resched(); + } + } + + if (atomic_dec_and_test(&nthreads)) + complete(&wait); + + pr_info("page_frag pop test thread exits on cpu %d\n", + smp_processor_id()); + + return 0; +} + +static int page_frag_push_thread(void *arg) +{ + struct ptr_ring *ring = arg; + int nr = nr_test; + + pr_info("page_frag push test thread begins on cpu %d\n", + smp_processor_id()); + + while (nr > 0) { + void *va; + int ret; + + if (test_align) { + va = page_frag_alloc_align(&test_nc, test_alloc_len, + GFP_KERNEL, SMP_CACHE_BYTES); + + WARN_ONCE((unsigned long)va & (SMP_CACHE_BYTES - 1), + "unaligned va returned\n"); + } else { + va = page_frag_alloc(&test_nc, test_alloc_len, GFP_KERNEL); + } + + if (!va) + continue; + + ret = __ptr_ring_produce(ring, va); + if (ret) { + page_frag_free(va); + cond_resched(); + } else { + nr--; + } + } + + pr_info("page_frag push test thread exits on cpu %d\n", + smp_processor_id()); + + if (atomic_dec_and_test(&nthreads)) + complete(&wait); + + return 0; +} + +static int __init page_frag_test_init(void) +{ + struct task_struct *tsk_push, *tsk_pop; + ktime_t start; + u64 duration; + int ret; + + test_nc.va = NULL; + atomic_set(&nthreads, 2); + init_completion(&wait); + + if (test_alloc_len > PAGE_SIZE || test_alloc_len <= 0 || + !cpu_active(test_push_cpu) || !cpu_active(test_pop_cpu)) + return -EINVAL; + + ret = ptr_ring_init(&ptr_ring, nr_objs, GFP_KERNEL); + if (ret) + return ret; + + tsk_push = kthread_create_on_cpu(page_frag_push_thread, &ptr_ring, + test_push_cpu, "page_frag_push"); + if (IS_ERR(tsk_push)) + return PTR_ERR(tsk_push); + + tsk_pop = kthread_create_on_cpu(page_frag_pop_thread, &ptr_ring, + test_pop_cpu, "page_frag_pop"); + if (IS_ERR(tsk_pop)) { + kthread_stop(tsk_push); + return PTR_ERR(tsk_pop); + } + + start = ktime_get(); + wake_up_process(tsk_push); + wake_up_process(tsk_pop); + + pr_info("waiting for test to complete\n"); + wait_for_completion(&wait); + + duration = (u64)ktime_us_delta(ktime_get(), start); + pr_info("%d of iterations for %s testing took: %lluus\n", nr_test, + test_align ? "aligned" : "non-aligned", duration); + + ptr_ring_cleanup(&ptr_ring, NULL); + page_frag_cache_drain(&test_nc); + + return -EAGAIN; +} + +static void __exit page_frag_test_exit(void) +{ +} + +module_init(page_frag_test_init); +module_exit(page_frag_test_exit); + +MODULE_LICENSE("GPL"); +MODULE_AUTHOR("Yunsheng Lin <linyunsheng(a)huawei.com>"); +MODULE_DESCRIPTION("Test module for page_frag"); diff --git a/tools/testing/selftests/mm/run_vmtests.sh b/tools/testing/selftests/mm/run_vmtests.sh index 36045edb10de..96fd470b9f51 100755 --- a/tools/testing/selftests/mm/run_vmtests.sh +++ b/tools/testing/selftests/mm/run_vmtests.sh @@ -75,6 +75,8 @@ separated by spaces: read-only VMAs - mdwe test prctl(PR_SET_MDWE, ...) +- page_frag + test handling of page fragment allocation and freeing example: ./run_vmtests.sh -t "hmm mmap ksm" EOF @@ -456,6 +458,12 @@ CATEGORY="mkdirty" run_test ./mkdirty CATEGORY="mdwe" run_test ./mdwe_test +CATEGORY="page_frag" run_test ./test_page_frag.sh smoke + +CATEGORY="page_frag" run_test ./test_page_frag.sh aligned + +CATEGORY="page_frag" run_test ./test_page_frag.sh nonaligned + echo "SUMMARY: PASS=${count_pass} SKIP=${count_skip} FAIL=${count_fail}" | tap_prefix echo "1..${count_total}" | tap_output diff --git a/tools/testing/selftests/mm/test_page_frag.sh b/tools/testing/selftests/mm/test_page_frag.sh new file mode 100755 index 000000000000..aad55e0ca2cd --- /dev/null +++ b/tools/testing/selftests/mm/test_page_frag.sh @@ -0,0 +1,167 @@ +#!/bin/bash +# SPDX-License-Identifier: GPL-2.0 +# +# Copyright (C) 2024 Yunsheng Lin <linyunsheng(a)huawei.com> +# Copyright (C) 2018 Uladzislau Rezki (Sony) <urezki(a)gmail.com> +# +# This is a test script for the kernel test driver to test the +# correctness and performance of page_frag's implementation. +# Therefore it is just a kernel module loader. You can specify +# and pass different parameters in order to: +# a) analyse performance of page fragment allocations; +# b) stressing and stability check of page_frag subsystem. + +DRIVER="./page_frag/page_frag_test.ko" +NUM_CPUS=`grep -c ^processor /proc/cpuinfo` +TEST_CPU_0=0 +if [ $NUM_CPUS -gt 1 ]; then + TEST_CPU_1=1 +else + TEST_CPU_1=0 +fi + +# 1 if fails +exitcode=1 + +# Kselftest framework requirement - SKIP code is 4. +ksft_skip=4 + +# +# Static templates for testing of page_frag APIs. +# Also it is possible to pass any supported parameters manually. +# +NONALIGNED_PARAM="test_push_cpu=$TEST_CPU_0 test_pop_cpu=$TEST_CPU_1 test_alloc_len=12 nr_test=512000000" +ALIGNED_PARAM="$NONALIGNED_PARAM test_align=1" + +check_test_requirements() +{ + uid=$(id -u) + if [ $uid -ne 0 ]; then + echo "$0: Must be run as root" + exit $ksft_skip + fi + + if ! which insmod > /dev/null 2>&1; then + echo "$0: You need insmod installed" + exit $ksft_skip + fi + + if [ ! -f $DRIVER ]; then + echo "$0: You need to compile page_frag_test module" + exit $ksft_skip + fi +} + +run_nonaligned_check() +{ + echo "Run performance tests to evaluate how fast nonaligned alloc API is." + + insmod $DRIVER $NONALIGNED_PARAM > /dev/null 2>&1 + echo "Done." + echo "Ccheck the kernel ring buffer to see the summary." +} + +run_aligned_check() +{ + echo "Run performance tests to evaluate how fast aligned alloc API is." + + insmod $DRIVER $ALIGNED_PARAM > /dev/null 2>&1 + echo "Done." + echo "Check the kernel ring buffer to see the summary." +} + +run_smoke_check() +{ + echo "Run smoke test." + + insmod $DRIVER > /dev/null 2>&1 + echo "Done." + echo "Check the kernel ring buffer to see the summary." +} + +usage() +{ + echo -n "Usage: $0 [ aligned ] | [ nonaligned ] | | [ smoke ] | " + echo "manual parameters" + echo + echo "Valid tests and parameters:" + echo + modinfo $DRIVER + echo + echo "Example usage:" + echo + echo "# Shows help message" + echo "$0" + echo + echo "# Smoke testing" + echo "$0 smoke" + echo + echo "# Performance testing for nonaligned alloc API" + echo "$0 nonaligned" + echo + echo "# Performance testing for aligned alloc API" + echo "$0 aligned" + echo + exit 0 +} + +function validate_passed_args() +{ + VALID_ARGS=`modinfo $DRIVER | awk '/parm:/ {print $2}' | sed 's/:.*//'` + + # + # Something has been passed, check it. + # + for passed_arg in $@; do + key=${passed_arg//=*/} + valid=0 + + for valid_arg in $VALID_ARGS; do + if [[ $key = $valid_arg ]]; then + valid=1 + break + fi + done + + if [[ $valid -ne 1 ]]; then + echo "Error: key is not correct: ${key}" + exit $exitcode + fi + done +} + +function run_manual_check() +{ + # + # Validate passed parameters. If there is wrong one, + # the script exists and does not execute further. + # + validate_passed_args $@ + + echo "Run the test with following parameters: $@" + insmod $DRIVER $@ > /dev/null 2>&1 + echo "Done." + echo "Check the kernel ring buffer to see the summary." +} + +function run_test() +{ + if [ $# -eq 0 ]; then + usage + else + if [[ "$1" = "smoke" ]]; then + run_smoke_check + elif [[ "$1" = "nonaligned" ]]; then + run_nonaligned_check + elif [[ "$1" = "aligned" ]]; then + run_aligned_check + else + run_manual_check $@ + fi + fi +} + +check_test_requirements +run_test $@ + +exit 0 -- 2.33.0

10 months, 1 week

1
0
0 0

[PATCH v2 00/17] Wire up getrandom() vDSO implementation on powerpc

by Christophe Leroy

This series wires up getrandom() vDSO implementation on powerpc. Tested on PPC32. Performance on powerpc 885 (using kernel selftest): ~# ./vdso_test_getrandom bench-single vdso: 2500000 times in 7.897495392 seconds libc: 2500000 times in 56.091632232 seconds syscall: 2500000 times in 55.704851989 seconds Performance on powerpc 8321 (using kernel selftest): ~# ./vdso_test_getrandom bench-single vdso: 2500000 times in 2.017183250 seconds libc: 2500000 times in 13.088533630 seconds syscall: 2500000 times in 12.952458068 seconds Only build tested on PPC64. There is a problem with vdso_test_getrandom selftest, it doesn't find vDSO symbol __kernel_getrandom. There is the same problem with vdso_test_gettimeofday so it is not related to getrandom. On strange things to be clarified, there is the format of the key passed to __arch_chacha20_blocks_nostack(). In struct vgetrandom_state it is declared as a table of u32, but in reality it seems it is a flat storage that needs to be loaded in reversed byte order, so it should either be defined as a table of bytes, or as a table of __le32 but not a table of u32. But this has no impact and can be clarified later and fixed in a follow-up patch. Changes in v2: - Define VM_DROPPABLE for powerpc/32 - Fixes generic vDSO getrandom headers to enable CONFIG_COMPAT build. - Fixed size of generation counter - Fixed selftests to work on non x86 architectures Christophe Leroy (17): asm-generic/unaligned.h: Extract common header for vDSO vdso: Clean header inclusion in getrandom vdso: Add __arch_get_k_vdso_rng_data() vdso: Add missing c-getrandom-y in Makefile vdso: Avoid call to memset() by getrandom vdso: Change getrandom's generation to unsigned long mm: Define VM_DROPPABLE for powerpc/32 powerpc: Add little endian variants of LWZX_BE and STWX_BE powerpc/vdso32: Add crtsavres powerpc/vdso: Refactor CFLAGS for CVDSO build powerpc/vdso: Wire up getrandom() vDSO implementation selftests: vdso: Fix powerpc64 vdso_config selftests: vdso: Don't hard-code location of vDSO sources selftests: vdso: Make test_vdso_getrandom look for the right vDSO function selftests: vdso: Fix build of test_vdso_chacha selftests: vdso: Make VDSO function call more generic selftests: vdso: Add support for vdso_test_random for powerpc arch/powerpc/Kconfig | 1 + arch/powerpc/include/asm/asm-compat.h | 8 + arch/powerpc/include/asm/mman.h | 2 +- arch/powerpc/include/asm/vdso/getrandom.h | 67 ++++ arch/powerpc/include/asm/vdso/vsyscall.h | 6 + arch/powerpc/include/asm/vdso_datapage.h | 2 + arch/powerpc/kernel/asm-offsets.c | 1 + arch/powerpc/kernel/vdso/Makefile | 45 ++- arch/powerpc/kernel/vdso/getrandom.S | 58 ++++ arch/powerpc/kernel/vdso/gettimeofday.S | 13 - arch/powerpc/kernel/vdso/vdso32.lds.S | 1 + arch/powerpc/kernel/vdso/vdso64.lds.S | 1 + arch/powerpc/kernel/vdso/vgetrandom-chacha.S | 297 ++++++++++++++++++ arch/powerpc/kernel/vdso/vgetrandom.c | 14 + arch/x86/entry/vdso/vma.c | 3 + arch/x86/include/asm/pvclock.h | 1 + arch/x86/include/asm/vdso/vsyscall.h | 10 +- drivers/char/random.c | 5 +- fs/proc/task_mmu.c | 4 +- include/asm-generic/unaligned.h | 11 +- include/linux/mm.h | 4 +- include/trace/events/mmflags.h | 4 +- include/vdso/datapage.h | 2 +- include/vdso/getrandom.h | 2 +- include/vdso/helpers.h | 1 + include/vdso/unaligned.h | 15 + lib/vdso/Makefile | 1 + lib/vdso/getrandom.c | 30 +- tools/arch/powerpc/vdso | 1 + tools/arch/x86/vdso | 1 + tools/include/linux/linkage.h | 4 + tools/testing/selftests/vDSO/Makefile | 12 +- tools/testing/selftests/vDSO/vdso_call.h | 52 +++ tools/testing/selftests/vDSO/vdso_config.h | 14 +- .../selftests/vDSO/vdso_test_getrandom.c | 11 +- 35 files changed, 628 insertions(+), 76 deletions(-) create mode 100644 arch/powerpc/include/asm/vdso/getrandom.h create mode 100644 arch/powerpc/kernel/vdso/getrandom.S create mode 100644 arch/powerpc/kernel/vdso/vgetrandom-chacha.S create mode 100644 arch/powerpc/kernel/vdso/vgetrandom.c create mode 100644 include/vdso/unaligned.h create mode 120000 tools/arch/powerpc/vdso create mode 120000 tools/arch/x86/vdso create mode 100644 tools/testing/selftests/vDSO/vdso_call.h -- 2.44.0

10 months, 1 week

10
63
0 0

[PATCH] selftests/mm: Relax test to fail after 100 migration failures

by Dev Jain

It was recently observed at [1] that during the folio unmapping stage of migration, when the PTEs are cleared, a racing thread faulting on that folio may increase the refcount of the folio, sleep on the folio lock (the migration path has the lock), and migration ultimately fails when asserting the actual refcount against the expected. Thereby, the migration selftest fails on shared-anon mappings. The above enforces the fact that migration is a best-effort service, therefore, it is wrong to fail the test for just a single failure; hence, fail the test after 100 consecutive failures (where 100 is still a subjective choice). Note that, this has no effect on the execution time of the test since that is controlled by a timeout. [1] https://lore.kernel.org/all/20240801081657.1386743-1-dev.jain@arm.com/ Signed-off-by: Dev Jain <dev.jain(a)arm.com> Suggested-by: David Hildenbrand <david(a)redhat.com> Reviewed-by: Ryan Roberts <ryan.roberts(a)arm.com> Tested-by: Ryan Roberts <ryan.roberts(a)arm.com> --- The above patch was part of the following: https://lore.kernel.org/all/20240809103129.365029-1-dev.jain@arm.com/ I decided to send it separately since it should be applied nevertheless. tools/testing/selftests/mm/migration.c | 17 +++++++++++------ 1 file changed, 11 insertions(+), 6 deletions(-) diff --git a/tools/testing/selftests/mm/migration.c b/tools/testing/selftests/mm/migration.c index 6908569ef406..64bcbb7151cf 100644 --- a/tools/testing/selftests/mm/migration.c +++ b/tools/testing/selftests/mm/migration.c @@ -15,10 +15,10 @@ #include <signal.h> #include <time.h> -#define TWOMEG (2<<20) -#define RUNTIME (20) - -#define ALIGN(x, a) (((x) + (a - 1)) & (~((a) - 1))) +#define TWOMEG (2<<20) +#define RUNTIME (20) +#define MAX_RETRIES 100 +#define ALIGN(x, a) (((x) + (a - 1)) & (~((a) - 1))) FIXTURE(migration) { @@ -65,6 +65,7 @@ int migrate(uint64_t *ptr, int n1, int n2) int ret, tmp; int status = 0; struct timespec ts1, ts2; + int failures = 0; if (clock_gettime(CLOCK_MONOTONIC, &ts1)) return -1; @@ -79,13 +80,17 @@ int migrate(uint64_t *ptr, int n1, int n2) ret = move_pages(0, 1, (void **) &ptr, &n2, &status, MPOL_MF_MOVE_ALL); if (ret) { - if (ret > 0) + if (ret > 0) { + /* Migration is best effort; try again */ + if (++failures < MAX_RETRIES) + continue; printf("Didn't migrate %d pages\n", ret); + } else perror("Couldn't migrate pages"); return -2; } - + failures = 0; tmp = n2; n2 = n1; n1 = tmp; -- 2.30.2

10 months, 1 week

1
0
0 0

[net-next, v3 0/2] Adding SO_PEEK_OFF for TCPv6

by jmaloy＠redhat.com

From: Jon Maloy <jmaloy(a)redhat.com> Adding SO_PEEK_OFF for TCPv6 and selftest for both TCPv4 and TCPv6 Jon Maloy (2): tcp: add SO_PEEK_OFF socket option tor TCPv6 selftests: add selftest for tcp SO_PEEK_OFF support net/ipv6/af_inet6.c | 1 + tools/testing/selftests/net/Makefile | 1 + tools/testing/selftests/net/tcp_so_peek_off.c | 183 ++++++++++++++++++ 3 files changed, 185 insertions(+) create mode 100644 tools/testing/selftests/net/tcp_so_peek_off.c -- 2.45.2

10 months, 1 week

3
4
0 0

[PATCH 0/2] Exposing nice CPU usage to userspace

by Joshua＠web.codeaurora.org

From: Joshua Hahn <joshua.hahn6(a)gmail.com> Niced CPU usage is a metric reported in host-level /proc/stat, but is not reported in cgroup-level statistics in cpu.stat. However, when a host contains multiple tasks across different workloads, it becomes difficult to gauage how much of the task is being spent on niced processes based on /proc/stat alone, since host-level metrics do not provide this cgroup-level granularity. Exposing this metric will allow load balancers to correctly probe the niced CPU metric for each workload, and make more informed decisions when directing higher priority tasks. Joshua Hahn (2): Tracking cgroup-level niced CPU time Selftests for niced CPU statistics include/linux/cgroup-defs.h | 1 + kernel/cgroup/rstat.c | 16 ++++- tools/testing/selftests/cgroup/test_cpu.c | 72 +++++++++++++++++++++++ 3 files changed, 86 insertions(+), 3 deletions(-) -- 2.43.5

10 months, 1 week

5
6
0 0

[PATCH v2 0/6] kunit: Add macros to help write more complex tests

by Michal Wajdeczko

v1: https://groups.google.com/g/kunit-dev/c/f4LIMLyofj8 v2: make it more complex and attempt to be thread safe s/FIXED_STUB/GLOBAL_STUB (David, Lucas) make it little more thread safe (Rae, David) wait until stub call finishes before test end (David) wait until stub call finishes before changing stub (David) allow stub deactivation (Rae) prefer kunit log (David) add simple selftest (Michal) also introduce ONLY_IF_KUNIT macro (Michal) Sample output from the tests: $ tools/testing/kunit/kunit.py run *example*.*global* \ --kunitconfig lib/kunit/.kunitconfig --raw_output KTAP version 1 1..1 # example: initializing suite KTAP version 1 # Subtest: example # module: kunit_example_test 1..1 # example_global_stub_test: initializing # example_global_stub_test: add_two: redirecting to subtract_one # example_global_stub_test: add_two: redirecting to subtract_one # example_global_stub_test: cleaning up ok 1 example_global_stub_test # example: exiting suite ok 1 example $ tools/testing/kunit/kunit.py run *global*.*global* \ --kunitconfig lib/kunit/.kunitconfig --raw_output KTAP version 1 1..1 KTAP version 1 # Subtest: kunit_global_stub # module: kunit_test 1..4 # kunit_global_stub_test_activate: real_void_func: redirecting to replacement_void_func # kunit_global_stub_test_activate: real_func: redirecting to replacement_func # kunit_global_stub_test_activate: real_func: redirecting to replacement_func # kunit_global_stub_test_activate: real_func: redirecting to other_replacement_func # kunit_global_stub_test_activate: real_func: redirecting to other_replacement_func # kunit_global_stub_test_activate: real_func: redirecting to super_replacement_func # kunit_global_stub_test_activate: real_func: redirecting to super_replacement_func ok 1 kunit_global_stub_test_activate ok 2 kunit_global_stub_test_deactivate # kunit_global_stub_test_slow_deactivate: real_func: redirecting to slow_replacement_func # kunit_global_stub_test_slow_deactivate: real_func: redirecting to slow_replacement_func # kunit_global_stub_test_slow_deactivate: waiting for slow_replacement_func # kunit_global_stub_test_slow_deactivate.speed: slow ok 3 kunit_global_stub_test_slow_deactivate # kunit_global_stub_test_slow_replace: real_func: redirecting to slow_replacement_func # kunit_global_stub_test_slow_replace: real_func: redirecting to slow_replacement_func # kunit_global_stub_test_slow_replace: waiting for slow_replacement_func # kunit_global_stub_test_slow_replace: real_func: redirecting to other_replacement_func # kunit_global_stub_test_slow_replace.speed: slow ok 4 kunit_global_stub_test_slow_replace # kunit_global_stub: pass:4 fail:0 skip:0 total:4 # Totals: pass:4 fail:0 skip:0 total:4 ok 1 kunit_global_stub Cc: Rae Moar <rmoar(a)google.com> Cc: David Gow <davidgow(a)google.com> Cc: Lucas De Marchi <lucas.demarchi(a)intel.com> Michal Wajdeczko (6): kunit: Introduce kunit_is_running() kunit: Add macro to conditionally expose declarations to tests kunit: Add macro to conditionally expose expressions to tests kunit: Allow function redirection outside of the KUnit thread kunit: Add example with alternate function redirection method kunit: Add some selftests for global stub redirection macros include/kunit/static_stub.h | 158 ++++++++++++++++++++ include/kunit/test-bug.h | 12 +- include/kunit/visibility.h | 16 +++ lib/kunit/kunit-example-test.c | 67 +++++++++ lib/kunit/kunit-test.c | 254 ++++++++++++++++++++++++++++++++- lib/kunit/static_stub.c | 49 +++++++ 6 files changed, 553 insertions(+), 3 deletions(-) -- 2.43.0

10 months, 1 week

3
14
0 0

[PATCH nf-next v3 1/2] netfilter: Make IP_NF_IPTABLES_LEGACY selectable

by Breno Leitao

This option makes IP_NF_IPTABLES_LEGACY user selectable, giving users the option to configure iptables without enabling any other config. Suggested-by: Florian Westphal <fw(a)strlen.de> Signed-off-by: Breno Leitao <leitao(a)debian.org> --- net/ipv4/netfilter/Kconfig | 19 +++++++++++-------- tools/testing/selftests/net/config | 8 ++++++++ 2 files changed, 19 insertions(+), 8 deletions(-) diff --git a/net/ipv4/netfilter/Kconfig b/net/ipv4/netfilter/Kconfig index 1b991b889506..a06c1903183f 100644 --- a/net/ipv4/netfilter/Kconfig +++ b/net/ipv4/netfilter/Kconfig @@ -12,7 +12,12 @@ config NF_DEFRAG_IPV4 # old sockopt interface and eval loop config IP_NF_IPTABLES_LEGACY - tristate + tristate "Legacy IP tables support" + default n + select NETFILTER_XTABLES + help + iptables is a general, extensible packet identification legacy framework. + This is not needed if you are using iptables over nftables (iptables-nft). config NF_SOCKET_IPV4 tristate "IPv4 socket lookup support" @@ -177,7 +182,7 @@ config IP_NF_MATCH_TTL config IP_NF_FILTER tristate "Packet filtering" default m if NETFILTER_ADVANCED=n - select IP_NF_IPTABLES_LEGACY + depends on IP_NF_IPTABLES_LEGACY help Packet filtering defines a table `filter', which has a series of rules for simple packet filtering at local input, forwarding and @@ -217,7 +222,7 @@ config IP_NF_NAT default m if NETFILTER_ADVANCED=n select NF_NAT select NETFILTER_XT_NAT - select IP_NF_IPTABLES_LEGACY + depends on IP_NF_IPTABLES_LEGACY help This enables the `nat' table in iptables. This allows masquerading, port forwarding and other forms of full Network Address Port @@ -258,7 +263,7 @@ endif # IP_NF_NAT config IP_NF_MANGLE tristate "Packet mangling" default m if NETFILTER_ADVANCED=n - select IP_NF_IPTABLES_LEGACY + depends on IP_NF_IPTABLES_LEGACY help This option adds a `mangle' table to iptables: see the man page for iptables(8). This table is used for various packet alterations @@ -293,7 +298,7 @@ config IP_NF_TARGET_TTL # raw + specific targets config IP_NF_RAW tristate 'raw table support (required for NOTRACK/TRACE)' - select IP_NF_IPTABLES_LEGACY + depends on IP_NF_IPTABLES_LEGACY help This option adds a `raw' table to iptables. This table is the very first in the netfilter framework and hooks in at the PREROUTING @@ -305,9 +310,7 @@ config IP_NF_RAW # security table for MAC policy config IP_NF_SECURITY tristate "Security table" - depends on SECURITY - depends on NETFILTER_ADVANCED - select IP_NF_IPTABLES_LEGACY + depends on SECURITY && NETFILTER_ADVANCED && IP_NF_IPTABLES_LEGACY help This option adds a `security' table to iptables, for use with Mandatory Access Control (MAC) policy. diff --git a/tools/testing/selftests/net/config b/tools/testing/selftests/net/config index 5b9baf708950..90e997cfa12e 100644 --- a/tools/testing/selftests/net/config +++ b/tools/testing/selftests/net/config @@ -28,6 +28,7 @@ CONFIG_NET_FOU=y CONFIG_NET_FOU_IP_TUNNELS=y CONFIG_NETFILTER=y CONFIG_NETFILTER_ADVANCED=y +CONFIG_NETFILTER_XT_TARGET_HL=m CONFIG_NF_CONNTRACK=m CONFIG_IPV6_MROUTE=y CONFIG_IPV6_SIT=y @@ -35,6 +36,11 @@ CONFIG_IP_DCCP=m CONFIG_NF_NAT=m CONFIG_IP6_NF_IPTABLES=m CONFIG_IP_NF_IPTABLES=m +CONFIG_IP_NF_IPTABLES_LEGACY=m +CONFIG_IP_NF_FILTER=m +CONFIG_IP_NF_TARGET_REJECT=m +CONFIG_IP_NF_TARGET_MASQUERADE=m +CONFIG_IP_NF_MANGLE=m CONFIG_IP6_NF_NAT=m CONFIG_IP6_NF_RAW=m CONFIG_IP_NF_NAT=m @@ -54,6 +60,7 @@ CONFIG_MPTCP=y CONFIG_NF_TABLES=m CONFIG_NF_TABLES_IPV6=y CONFIG_NF_TABLES_IPV4=y +CONFIG_NF_REJECT_IPV4=y CONFIG_NFT_NAT=m CONFIG_NETFILTER_XT_MATCH_LENGTH=m CONFIG_NET_ACT_CSUM=m @@ -106,4 +113,5 @@ CONFIG_CRYPTO_ARIA=y CONFIG_XFRM_INTERFACE=m CONFIG_XFRM_USER=m CONFIG_IP_NF_MATCH_RPFILTER=m +CONFIG_IP_NF_TARGET_MASQUERADE=m CONFIG_IP6_NF_MATCH_RPFILTER=m -- 2.43.5

10 months, 1 week

2
6
0 0

[PATCH bpf-next V1] enable virtFS(9p virtio) for sharing directory on VM to optimize debugging

by Lin Yikai

[Problem] Sometimes, we have only x86_64 server for compiling BPF with target ARCH of arm64. Therefore, the only way to debug bpf is using cross-compile and qemu. Unfortunately, debugging online on VM is very inconvenient, when test_progs fails. Such as: 1. We are unable to directly replace old test object and still need to quit VM and restart, which consumes valuable time. 2. We also want to share other tools or binaries online for execution on the VM, which is not supported by VM. [Optimization] I noitce that CONFIG_9P_FS is enabled in "config.vm", so virtFS (9p virtio) is available on VM. To achieve it, I add a new init file on qemu, which only exists when '-v' option is appended. root@(none):/# cat /etc/rcS.d/S20-testDebug #!/bin/sh set -x rm -rf /mnt/shared mkdir -p /mnt/shared /bin/mount -t 9p -o trans=virtio,version=9p2000.L host0 /mnt/shared [Usage] Append the option '-v' to enable it. For instance: LDLIBS=-static ./vmtest.sh -v -s -- ./test_progs -t d_path This will share the directory between VM's "/mnt/shared" with host's *${OUTPUT_DIR}/${MOUNT_DIR}/shared*. On host: $ mv ./test_progs ~/workplace/bpf/arm64/.bpf_selftests/mnt/shared/ On VM(you can directly move it into /root/bpf): root@(none):/# ls /mnt/shared/ test_progs Signed-off-by: Lin Yikai <yikai.lin(a)vivo.com> --- tools/testing/selftests/bpf/vmtest.sh | 75 ++++++++++++++++++++++++++- 1 file changed, 73 insertions(+), 2 deletions(-) diff --git a/tools/testing/selftests/bpf/vmtest.sh b/tools/testing/selftests/bpf/vmtest.sh index c7461ed496ab..82afadde50da 100755 --- a/tools/testing/selftests/bpf/vmtest.sh +++ b/tools/testing/selftests/bpf/vmtest.sh @@ -70,10 +70,15 @@ LOG_FILE_BASE="$(date +"bpf_selftests.%Y-%m-%d_%H-%M-%S")" LOG_FILE="${LOG_FILE_BASE}.log" EXIT_STATUS_FILE="${LOG_FILE_BASE}.exit_status" +DEBUG_CMD_INIT="" +DEBUG_FILE_INIT="S20-testDebug" +QEMU_FLAG_VIRTFS="" + + usage() { cat <<EOF -Usage: $0 [-i] [-s] [-d <output_dir>] -- [<command>] +Usage: $0 [-i] [-s] [-v] [-d <output_dir>] -- [<command>] <command> is the command you would normally run when you are in tools/testing/selftests/bpf. e.g: @@ -101,6 +106,8 @@ Options: -s) Instead of powering off the VM, start an interactive shell. If <command> is specified, the shell runs after the command finishes executing + -v) enable virtFS (9p virtio) for sharing directory + of "/mnt/shared" on the VM EOF } @@ -275,6 +282,7 @@ EOF -serial mon:stdio \ "${QEMU_FLAGS[@]}" \ -enable-kvm \ + ${QEMU_FLAG_VIRTFS} \ -m 4G \ -drive file="${rootfs_img}",format=raw,index=1,media=disk,if=virtio,cache=none \ -kernel "${kernel_bzimage}" \ @@ -354,6 +362,60 @@ catch() exit ${exit_code} } +update_debug_init() +{ + #You can do something else just for debuging on qemu. + #The init script will be reset every time before vm running on host, + #and be executed on qemu before test_progs. + local init_script_dir="${OUTPUT_DIR}/${MOUNT_DIR}/etc/rcS.d" + local init_script_file="${init_script_dir}/${DEBUG_FILE_INIT}" + + mount_image + if [[ "${DEBUG_CMD_INIT}" == "" ]]; then + sudo rm -rf ${init_script_file} + unmount_image + return + fi + + if [[ ! -d "${init_script_dir}" ]]; then + cat <<EOF +Could not find ${init_script_dir} in the mounted image. +This likely indicates a bad or not default rootfs image, +You need to change debug init manually +according to the actual situation of the rootfs image. +EOF + unmount_image + exit 1 + fi + + sudo bash -c "cat > ${init_script_file}" <<EOF +#!/bin/sh +set -x +${DEBUG_CMD_INIT} +EOF + sudo chmod 755 "${init_script_file}" + unmount_image +} + +#Establish shared dir access by 9p virtfs +#between "/mnt/shared" on qemu with *${OUTPUT_DIR}/${MOUNT_DIR}/shared* on local host. +debug_by_virtfs_shared() +{ + local qemu_shared_dir="/mnt/shared" + local host_shared_dir="${OUTPUT_DIR}/${MOUNT_DIR}/shared" + + #append virtfs shared flag for qemu + local flag="-virtfs local,mount_tag=host0,security_model=passthrough,id=host0,path=${host_shared_dir}" + mkdir -p "${host_shared_dir}" + QEMU_FLAG_VIRTFS="${QEMU_FLAG_VIRTFS} ${flag}" + + #append mount cmd into init + DEBUG_CMD_INIT="${DEBUG_CMD_INIT}\ +rm -rf ${qemu_shared_dir} +mkdir -p ${qemu_shared_dir} +/bin/mount -t 9p -o trans=virtio,version=9p2000.L host0 ${qemu_shared_dir}" +} + main() { local script_dir="$(cd -P -- "$(dirname -- "${BASH_SOURCE[0]}")" && pwd -P)" @@ -365,8 +427,9 @@ main() local update_image="no" local exit_command="poweroff -f" local debug_shell="no" + local enable_virtfs_shared="no" - while getopts ':hskid:j:' opt; do + while getopts ':vhskid:j:' opt; do case ${opt} in i) update_image="yes" @@ -382,6 +445,9 @@ main() debug_shell="yes" exit_command="bash" ;; + v) + enable_virtfs_shared="yes" + ;; h) usage exit 0 @@ -449,6 +515,11 @@ main() create_vm_image fi + if [[ "${enable_virtfs_shared}" == "yes" ]]; then + debug_by_virtfs_shared + fi + update_debug_init + update_selftests "${kernel_checkout}" "${make_command}" update_init_script "${command}" "${exit_command}" run_vm "${kernel_bzimage}" -- 2.34.1

10 months, 1 week

1
0
0 0

[PATCH] selftests/arm64: Fix build warnings for abi

by Dev Jain

A "%s" is missing in ksft_exit_fail_msg(); instead, use the newly introduced ksft_exit_fail_perror(). Also, uint64_t corresponds to unsigned 64-bit integer, so use %lx instead of %llx. Signed-off-by: Dev Jain <dev.jain(a)arm.com> --- The changes in ptrace.c were earlier a part of the following: https://lore.kernel.org/all/20240625122408.1439097-6-dev.jain@arm.com/ which were reviewed by Mark. tools/testing/selftests/arm64/abi/ptrace.c | 4 ++-- tools/testing/selftests/arm64/abi/syscall-abi.c | 8 ++++---- 2 files changed, 6 insertions(+), 6 deletions(-) diff --git a/tools/testing/selftests/arm64/abi/ptrace.c b/tools/testing/selftests/arm64/abi/ptrace.c index e4fa507cbdd0..b51d21f78cf9 100644 --- a/tools/testing/selftests/arm64/abi/ptrace.c +++ b/tools/testing/selftests/arm64/abi/ptrace.c @@ -163,10 +163,10 @@ static void test_hw_debug(pid_t child, int type, const char *type_name) static int do_child(void) { if (ptrace(PTRACE_TRACEME, -1, NULL, NULL)) - ksft_exit_fail_msg("PTRACE_TRACEME", strerror(errno)); + ksft_exit_fail_perror("PTRACE_TRACEME"); if (raise(SIGSTOP)) - ksft_exit_fail_msg("raise(SIGSTOP)", strerror(errno)); + ksft_exit_fail_perror("raise(SIGSTOP)"); return EXIT_SUCCESS; } diff --git a/tools/testing/selftests/arm64/abi/syscall-abi.c b/tools/testing/selftests/arm64/abi/syscall-abi.c index d704511a0955..5ec9a18ec802 100644 --- a/tools/testing/selftests/arm64/abi/syscall-abi.c +++ b/tools/testing/selftests/arm64/abi/syscall-abi.c @@ -81,7 +81,7 @@ static int check_gpr(struct syscall_cfg *cfg, int sve_vl, int sme_vl, uint64_t s */ for (i = 9; i < ARRAY_SIZE(gpr_in); i++) { if (gpr_in[i] != gpr_out[i]) { - ksft_print_msg("%s SVE VL %d mismatch in GPR %d: %llx != %llx\n", + ksft_print_msg("%s SVE VL %d mismatch in GPR %d: %lx != %lx\n", cfg->name, sve_vl, i, gpr_in[i], gpr_out[i]); errors++; @@ -112,7 +112,7 @@ static int check_fpr(struct syscall_cfg *cfg, int sve_vl, int sme_vl, if (!sve_vl && !(svcr & SVCR_SM_MASK)) { for (i = 0; i < ARRAY_SIZE(fpr_in); i++) { if (fpr_in[i] != fpr_out[i]) { - ksft_print_msg("%s Q%d/%d mismatch %llx != %llx\n", + ksft_print_msg("%s Q%d/%d mismatch %lx != %lx\n", cfg->name, i / 2, i % 2, fpr_in[i], fpr_out[i]); @@ -294,13 +294,13 @@ static int check_svcr(struct syscall_cfg *cfg, int sve_vl, int sme_vl, int errors = 0; if (svcr_out & SVCR_SM_MASK) { - ksft_print_msg("%s Still in SM, SVCR %llx\n", + ksft_print_msg("%s Still in SM, SVCR %lx\n", cfg->name, svcr_out); errors++; } if ((svcr_in & SVCR_ZA_MASK) != (svcr_out & SVCR_ZA_MASK)) { - ksft_print_msg("%s PSTATE.ZA changed, SVCR %llx != %llx\n", + ksft_print_msg("%s PSTATE.ZA changed, SVCR %lx != %lx\n", cfg->name, svcr_in, svcr_out); errors++; } -- 2.30.2

10 months, 1 week

3
7
0 0

[PATCH net v2 00/15] mptcp: more fixes for the in-kernel PM

by Matthieu Baerts (NGI0)

Here is a new batch of fixes for the MPTCP in-kernel path-manager: Patch 1 ensures the address ID is set to 0 when the path-manager sends an ADD_ADDR for the address of the initial subflow. The same fix is applied when a new subflow is created re-using this special address. A fix for v6.0. Patch 2 is similar, but for the case where an endpoint is removed: if this endpoint was used for the initial address, it is important to send a RM_ADDR with this ID set to 0, and look for existing subflows with the ID set to 0. A fix for v6.0 as well. Patch 3 validates the two previous patches. Patch 4 makes the PM selecting an "active" path to send an address notification in an ACK, instead of taking the first path in the list. A fix for v5.11. Patch 5 fixes skipping the establishment of a new subflow if a previous subflow using the same pair of addresses is being closed. A fix for v5.13. Patch 6 resets the ID linked to the initial subflow when the linked endpoint is re-added, possibly with a different ID. A fix for v6.0. Patch 7 validates the three previous patches. Patch 8 is a small fix for the MPTCP Join selftest, when being used with older subflows not supporting all MIB counters. A fix for a commit introduced in v6.4, but backported up to v5.10. Patch 9 avoids the PM to try to close the initial subflow multiple times, and increment counters while nothing happened. A fix for v5.10. Patch 10 stops incrementing local_addr_used and add_addr_accepted counters when dealing with the address ID 0, because these counters are not taking into account the initial subflow, and are then not decremented when the linked addresses are removed. A fix for v6.0. Patch 11 validates the previous patch. Patch 12 avoids the PM to send multiple SUB_CLOSED events for the initial subflow. A fix for v5.12. Patch 13 validates the previous patch. Patch 14 stops treating the ADD_ADDR 0 as a new address, and accepts it in order to re-create the initial subflow if it has been closed, even if the limit for *new* addresses -- not taking into account the address of the initial subflow -- has been reached. A fix for v5.10. Patch 15 validates the previous patch. Signed-off-by: Matthieu Baerts (NGI0) <matttbe(a)kernel.org> --- Changes in v2: - Patches 11,15/15: allow the connection to run for longer, should fix the issue seen on the Netdev CI, with a debug kconfig. - Link to v1: https://lore.kernel.org/r/20240826-net-mptcp-more-pm-fix-v1-0-8cd6c87d1d6d@… --- Matthieu Baerts (NGI0) (15): mptcp: pm: reuse ID 0 after delete and re-add mptcp: pm: fix RM_ADDR ID for the initial subflow selftests: mptcp: join: check removing ID 0 endpoint mptcp: pm: send ACK on an active subflow mptcp: pm: skip connecting to already established sf mptcp: pm: reset MPC endp ID when re-added selftests: mptcp: join: check re-adding init endp with != id selftests: mptcp: join: no extra msg if no counter mptcp: pm: do not remove already closed subflows mptcp: pm: fix ID 0 endp usage after multiple re-creations selftests: mptcp: join: check re-re-adding ID 0 endp mptcp: avoid duplicated SUB_CLOSED events selftests: mptcp: join: validate event numbers mptcp: pm: ADD_ADDR 0 is not a new address selftests: mptcp: join: check re-re-adding ID 0 signal net/mptcp/pm.c | 4 +- net/mptcp/pm_netlink.c | 87 ++++++++++---- net/mptcp/protocol.c | 6 + net/mptcp/protocol.h | 5 +- tools/testing/selftests/net/mptcp/mptcp_join.sh | 153 ++++++++++++++++++++---- tools/testing/selftests/net/mptcp/mptcp_lib.sh | 4 + 6 files changed, 209 insertions(+), 50 deletions(-) --- base-commit: 3a0504d54b3b57f0d7bf3d9184a00c9f8887f6d7 change-id: 20240826-net-mptcp-more-pm-fix-ffa61a36f817 Best regards, -- Matthieu Baerts (NGI0) <matttbe(a)kernel.org>

10 months, 1 week

2
16
0 0

[PATCH v5 0/4] HID: hidraw: HIDIOCREVOKE introduction

by bentiss＠kernel.org

The is the v5 of the HIDIOCREVOKE patches. After a small discussion with Peter, we decided to: - drop the BPF hooks that are problematic (Linus doesn't want "ALLOW_ERROR_INJECTION" to be used as "normal" fmodret bpf hooks) - punt those BPF hooks later once we get the API right - I'll be the one sending that new version, given that it's easier for me ATM For testing the patch, and for convenience, I added a new selftest program that can test this new ioctl. This will also allow us to integrate the (future) BPF hooks and show how this should be used. Signed-off-by: Benjamin Tissoires <bentiss(a)kernel.org> --- Changes in v5: - check for ENODEV when required in selftests - create new common header for the HID tests that can be reused in other HID selftests - Link to v4: https://lore.kernel.org/r/20240827-hidraw-revoke-v4-0-88c6795bf867@kernel.o… Link to v3: https://lore.kernel.org/all/20240812052753.GA478917@quokka/ --- Benjamin Tissoires (3): selftests/hid: extract the utility part of hid_bpf.c into its own header selftests/hid: Add initial hidraw tests skeleton selftests/hid: Add HIDIOCREVOKE tests Peter Hutterer (1): HID: hidraw: add HIDIOCREVOKE ioctl drivers/hid/hidraw.c | 39 ++- include/linux/hidraw.h | 1 + include/uapi/linux/hidraw.h | 1 + tools/testing/selftests/hid/.gitignore | 1 + tools/testing/selftests/hid/Makefile | 2 +- tools/testing/selftests/hid/hid_bpf.c | 437 +------------------------------ tools/testing/selftests/hid/hid_common.h | 436 ++++++++++++++++++++++++++++++ tools/testing/selftests/hid/hidraw.c | 237 +++++++++++++++++ 8 files changed, 714 insertions(+), 440 deletions(-) --- base-commit: 6e4436539ae182dc86d57d13849862bcafaa4709 change-id: 20240826-hidraw-revoke-0a02ebb21743 Best regards, -- Benjamin Tissoires <bentiss(a)kernel.org>

10 months, 1 week

2
5
0 0

[PATCH bpf-next v3 0/8] libbpf, selftests/bpf: Support cross-endian usage

by Tony Ambardar

Hello all, This patch series targets a long-standing BPF usability issue - the lack of general cross-compilation support - by enabling cross-endian usage of libbpf and bpftool, as well as supporting cross-endian build targets for selftests/bpf. Benefits include improved BPF development and testing for embedded systems based on e.g. big-endian MIPS, more build options e.g for s390x systems, and better accessibility to the very latest test tools e.g. 'test_progs'. Initial development and testing used mips64, since this arch makes switching the build byte-order trivial and is thus very handy for A/B testing. However, it lacks some key features (bpf2bpf call, kfuncs, etc) making for poor selftests/bpf coverage. Final testing takes the kernel and selftests/bpf cross-built from x86_64 to s390x, and runs the result under QEMU/s390x. That same configuration could also be used on kernel-patches/bpf CI for regression testing endian support or perhaps load-sharing s390x builds across x86_64 systems. This thread includes some background regarding testing on QEMU/s390x and the generally favourable results: https://lore.kernel.org/bpf/ZsEcsaa3juxxQBUf@kodidev-ubuntu/ Feedback and suggestions are welcome! Best regards, Tony Changelog: --------- v2 -> v3: (feedback from Andrii) - improve some log and commit message formatting - restructure BTF.ext endianness safety checks and byte-swapping - use BTF.ext info record definitions for swapping, require BTF v1 - follow BTF API implementation more closely for BTF.ext - explicitly reject loading non-native endianness program into kernel - simplify linker output byte-order setting - drop redundant safety checks during linking - simplify endianness macro and improve blob setup code for light skel - no unexpected test failures after cross-compiling x86_64 -> s390x v1 -> v2: - fixed a light skeleton bug causing test_progs 'map_ptr' failure - simplified some BTF.ext related endianness logic - remove an 'inline' usage related to CI checkpatch failure - improve some formatting noted by checkpatch warnings - unexpected 'test_progs' failures drop 3 -> 2 (x86_64 to s390x cross) Tony Ambardar (8): libbpf: Improve log message formatting libbpf: Fix header comment typos for BTF.ext libbpf: Fix output .symtab byte-order during linking libbpf: Support BTF.ext loading and output in either endianness libbpf: Support opening bpf objects of either endianness libbpf: Support linking bpf objects of either endianness libbpf: Support creating light skeleton of either endianness selftests/bpf: Support cross-endian building tools/lib/bpf/bpf_gen_internal.h | 1 + tools/lib/bpf/btf.c | 230 ++++++++++++++++++++++++--- tools/lib/bpf/btf.h | 3 + tools/lib/bpf/btf_dump.c | 2 +- tools/lib/bpf/btf_relocate.c | 2 +- tools/lib/bpf/gen_loader.c | 185 ++++++++++++++++----- tools/lib/bpf/libbpf.c | 39 +++-- tools/lib/bpf/libbpf.map | 2 + tools/lib/bpf/libbpf_internal.h | 17 +- tools/lib/bpf/linker.c | 92 +++++++++-- tools/lib/bpf/relo_core.c | 2 +- tools/lib/bpf/skel_internal.h | 3 +- tools/testing/selftests/bpf/Makefile | 7 +- 13 files changed, 488 insertions(+), 97 deletions(-) -- 2.34.1

10 months, 1 week

1
9
0 0

[PATCH net-next v22 00/13] Device Memory TCP

by Mina Almasry

v22: https://patchwork.kernel.org/project/netdevbpf/list/?series=881158&state=* ==== v22 aims to resolve the pending issue pointed to in v21, which is the interaction with xdp. In this series I rebase on top of the minor refactor which refactors propagating xdp configuration to slave devices: https://patchwork.kernel.org/project/netdevbpf/list/?series=881994&state=* I then disable setting xdp on devices using memory providers, and propagating xdp configuration to devices using memory providers. Full devmem TCP changes including the full GVE driver implementation is here: https://github.com/mina/linux/commits/tcpdevmem-v22/ v21: https://patchwork.kernel.org/project/netdevbpf/list/?series=880735&state=* ==== v20 addressed some comments and resolved a test failure, but introduced an unfortunate build error with a config edge case I wasn't testing. v21 simply resolves that error. Major Changes: - Resolve build error with CONFIG_PAGE_POOL=n && CONFIG_NET=y Full devmem TCP changes including the full GVE driver implementation is here: https://github.com/mina/linux/commits/tcpdevmem-v21/ v20: https://patchwork.kernel.org/project/netdevbpf/list/?series=879373&state=* ==== v20 aims to resolve a couple of bug reports against v19, and addresses some review comments around the page_pool_check_memory_provider mechanism. Major changes: - Test edge cases such as header split disabled in selftest. - Change `offset = 0` back to `offset = offset - start` to resolve issue found in RX path by Taehee (thanks!) - Address a few comments around page_pool_check_memory_provider() from Pavel & Jakub. - Removed some unnecessary includes across various patches in the series. - Removed unnecessary EXPORT_SYMBOL(page_pool_mem_providers) (Jakub). - Fix regression caused by incorrect dev_get_max_mp_channel check, along with rename (Jakub). Full devmem TCP changes including the full GVE driver implementation is here: https://github.com/mina/linux/commits/tcpdevmem-v20/ v19: https://patchwork.kernel.org/project/netdevbpf/list/?series=876852&state=* ==== v18 got a thorough review (thanks!), and this iteration addresses the feedback. Major changes: - Prevent deactivating mp bound queues. - Prevent installing xdp on mp bound netdevs, or installing mps on xdp installed netdevs. - Fix corner cases in netlink API vis-a-vis missing attributes. - Iron out the unreadable netmem driver support story. To be honest, the conversation with Jakub & Pavel got a bit confusing for me. I've implemented an approach in this set that makes sense to me, and AFAICT, addresses the requirements. It may be good as-is, or it may be a conversation starter/continuer. To be honest IMO there are many ways to skin this cat and I don't see an extremely strong reason to go for one approach over another. Here is one approach you may like. - Don't reset niov dma_addr on allocation & free. - Add some tests to the selftest that catches some of the issues around missing netlink attributes or deactivating mp-bound queues. Full devmem TCP changes including the full GVE driver implementation is here: https://github.com/mina/linux/commits/tcpdevmem-v19/ v18: https://patchwork.kernel.org/project/netdevbpf/list/?series=874848&state=* ==== v17 got minor feedback: (a) to beef up the description on patch 1 and (b) to remove the leading underscores in the header definition. I applied (a). (b) seems to be against current conventions so I did not apply before further discussion. Full devmem TCP changes including the full GVE driver implementation is here: https://github.com/mina/linux/commits/tcpdevmem-v17/ v17: https://patchwork.kernel.org/project/netdevbpf/list/?series=869900&state=* ==== v16 also got a very thorough review and some testing (thanks again!). Thes version addresses all the concerns reported on v15, in terms of feedback and issues reported. Major changes: - Use ASSERT_RTNL. - Moved around some of the page_pool helpers definitions so I can hide some netmem helpers in private files as Jakub suggested. - Don't make every net_iov hold a ref on the binding as Jakub suggested. - Fix issue reported by Taehee where we access queues after they have been freed. Full devmem TCP changes including the full GVE driver implementation is here: https://github.com/mina/linux/commits/tcpdevmem-v17/ v16: https://patchwork.kernel.org/project/netdevbpf/list/?series=866353&state=* ==== v15 got a thorough review and some testing, and this version addresses almost all the feedback. Some more minor comments where the authors said it could be done later, I left out. Major changes: - Addition of dma-buf introspection to page-pool-get and queue-get. - Fixes to selftests suggested by Taehee. - Fixes to documentation suggested by Donald. - A couple of suggestions and fixes to TCP patches by Eric and David. - Fixes to number assignements suggested by Arnd. - Use rtnl_lock()ing to guard against queue reconfiguration while the page_pool initialization is happening. (Jakub). - Fixes to a few warnings reproduced by Taehee. - Fixes to dma-buf binding suggested by Taehee and Jakub. - Fixes to netlink UAPI suggested by Jakub - Applied a number of Reviewed-bys and Acked-bys (including ones I lost from v13+). Full devmem TCP changes including the full GVE driver implementation is here: https://github.com/mina/linux/commits/tcpdevmem-v16/ One caveat: Taehee reproduced a KASAN warning and reported it here: https://lore.kernel.org/netdev/CAMArcTUdCxOBYGF3vpbq=eBvqZfnc44KBaQTN7H-wqd… I estimate the issue to be minor and easily fixable: https://lore.kernel.org/netdev/CAHS8izNgaqC--GGE2xd85QB=utUnOHmioCsDd1TNxJW… I hope to be able to follow up with a fix to net tree as net-next closes imminently, but if this iteration doesn't make it in, I will repost with a fix squashed after net-next reopens, no problem. v15: https://patchwork.kernel.org/project/netdevbpf/list/?series=865481&state=* ==== No material changes in this version, only a fix to linking against libynl.a from the last version. Per Jakub's instructions I've pulled one of his patches into this series, and now use the new libynl.a correctly, I hope. As usual, the full devmem TCP changes including the full GVE driver implementation is here: https://github.com/mina/linux/commits/tcpdevmem-v15/ v14: https://patchwork.kernel.org/project/netdevbpf/list/?series=865135&archive=… ==== No material changes in this version. Only rebase and re-verification on top of net-next. v13, I think, raced with commit ebad6d0334793 ("net/ipv4: Use nested-BH locking for ipv4_tcp_sk.") being merged to net-next that caused a patchwork failure to apply. This series should apply cleanly on commit c4532232fa2a4 ("selftests: net: remove unneeded IP_GRE config"). I did not wait the customary 24hr as Jakub said it's OK to repost as soon as I build test the rebased version: https://lore.kernel.org/netdev/20240625075926.146d769d@kernel.org/ v13: https://patchwork.kernel.org/project/netdevbpf/list/?series=861406&archive=… ==== Major changes: -------------- This iteration addresses Pavel's review comments, applies his reviewed-by's, and seeks to fix the patchwork build error (sorry!). As usual, the full devmem TCP changes including the full GVE driver implementation is here: https://github.com/mina/linux/commits/tcpdevmem-v13/ v12: https://patchwork.kernel.org/project/netdevbpf/list/?series=859747&state=* ==== Major changes: -------------- This iteration only addresses one minor comment from Pavel with regards to the trace printing of netmem, and the patchwork build error introduced in v11 because I missed doing an allmodconfig build, sorry. Other than that v11, AFAICT, received no feedback. There is one discussion about how the specifics of plugging io uring memory through the page pool, but not relevant to content in this particular patchset, AFAICT. As usual, the full devmem TCP changes including the full GVE driver implementation is here: https://github.com/mina/linux/commits/tcpdevmem-v12/ v11: https://patchwork.kernel.org/project/netdevbpf/list/?series=857457&state=* ==== Major Changes: -------------- v11 addresses feedback received in v10. The major change is the removal of the memory provider ops as requested by Christoph. We still accomplish the same thing, but utilizing direct function calls with if statements rather than generic ops. Additionally address sparse warnings, bugs and review comments from folks that reviewed. As usual, the full devmem TCP changes including the full GVE driver implementation is here: https://github.com/mina/linux/commits/tcpdevmem-v11/ Detailed changelog: ------------------- - Fixes in netdev_rx_queue_restart() from Pavel & David. - Remove commit e650e8c3a36f5 ("net: page_pool: create hooks for custom page providers") from the series to address Christoph's feedback and rebased other patches on the series on this change. - Fixed build errors with CONFIG_DMA_SHARED_BUFFER && !CONFIG_GENERIC_ALLOCATOR build. - Fixed sparse warnings pointed out by Paolo. - Drop unnecessary gro_pull_from_frag0 checks. - Added Bagas reviewed-by to docs. v10: https://patchwork.kernel.org/project/netdevbpf/list/?series=852422&state=* ==== Major Changes: -------------- v9 was sent right before the merge window closed (sorry!). v10 is almost a re-send of the series now that the merge window re-opened. Only rebased to latest net-next and addressed some minor iterative comments received on v9. As usual, the full devmem TCP changes including the full GVE driver implementation is here: https://github.com/mina/linux/commits/tcpdevmem-v10/ Detailed changelog: ------------------- - Fixed tokens leaking in DONTNEED setsockopt (Nikolay). - Moved net_iov_dma_addr() to devmem.c and made it a devmem specific helpers (David). - Rename hook alloc_pages to alloc_netmems as alloc_pages is now preprocessor macro defined and causes a build error. v9: === Major Changes: -------------- GVE queue API has been merged. Submitting this version as non-RFC after rebasing on top of the merged API, and dropped the out of tree queue API I was carrying on github. Addressed the little feedback v8 has received. Detailed changelog: ------------------ - Added new patch from David Wei to this series for netdev_rx_queue_restart() - Fixed sparse error. - Removed CONFIG_ checks in netmem_is_net_iov() - Flipped skb->readable to skb->unreadable - Minor fixes to selftests & docs. RFC v8: ======= Major Changes: -------------- - Fixed build error generated by patch-by-patch build. - Applied docs suggestions from Randy. RFC v7: ======= Major Changes: -------------- This revision largely rebases on top of net-next and addresses the feedback RFCv6 received from folks, namely Jakub, Yunsheng, Arnd, David, & Pavel. The series remains in RFC because the queue-API ndos defined in this series are not yet implemented. I have a GVE implementation I carry out of tree for my testing. A upstreamable GVE implementation is in the works. Aside from that, in my estimation all the patches are ready for review/merge. Please do take a look. As usual the full devmem TCP changes including the full GVE driver implementation is here: https://github.com/mina/linux/commits/tcpdevmem-v7/ Detailed changelog: - Use admin-perm in netlink API. - Addressed feedback from Jakub with regards to netlink API implementation. - Renamed devmem.c functions to something more appropriate for that file. - Improve the performance seen through the page_pool benchmark. - Fix the value definition of all the SO_DEVMEM_* uapi. - Various fixes to documentation. Perf - page-pool benchmark: --------------------------- Improved performance of bench_page_pool_simple.ko tests compared to v6: https://pastebin.com/raw/v5dYRg8L net-next base: 8 cycle fast path. RFC v6: 10 cycle fast path. RFC v7: 9 cycle fast path. RFC v7 with CONFIG_DMA_SHARED_BUFFER disabled: 8 cycle fast path, same as baseline. Perf - Devmem TCP benchmark: --------------------- Perf is about the same regardless of the changes in v7, namely the removal of the static_branch_unlikely to improve the page_pool benchmark performance: 189/200gbps bi-directional throughput with RX devmem TCP and regular TCP TX i.e. ~95% line rate. RFC v6: ======= Major Changes: -------------- This revision largely rebases on top of net-next and addresses the little feedback RFCv5 received. The series remains in RFC because the queue-API ndos defined in this series are not yet implemented. I have a GVE implementation I carry out of tree for my testing. A upstreamable GVE implementation is in the works. Aside from that, in my estimation all the patches are ready for review/merge. Please do take a look. As usual the full devmem TCP changes including the full GVE driver implementation is here: https://github.com/mina/linux/commits/tcpdevmem-v6/ This version also comes with some performance data recorded in the cover letter (see below changelog). Detailed changelog: - Rebased on top of the merged netmem_ref changes. - Converted skb->dmabuf to skb->readable (Pavel). Pavel's original suggestion was to remove the skb->dmabuf flag entirely, but when I looked into it closely, I found the issue that if we remove the flag we have to dereference the shinfo(skb) pointer to obtain the first frag to tell whether an skb is readable or not. This can cause a performance regression if it dirties the cache line when the shinfo(skb) was not really needed. Instead, I converted the skb->dmabuf flag into a generic skb->readable flag which can be re-used by io_uring 0-copy RX. - Squashed a few locking optimizations from Eric Dumazet in the RX path and the DEVMEM_DONTNEED setsockopt. - Expanded the tests a bit. Added validation for invalid scenarios and added some more coverage. Perf - page-pool benchmark: --------------------------- bench_page_pool_simple.ko tests with and without these changes: https://pastebin.com/raw/ncHDwAbn AFAIK the number that really matters in the perf tests is the 'tasklet_page_pool01_fast_path Per elem'. This one measures at about 8 cycles without the changes but there is some 1 cycle noise in some results. With the patches this regresses to 9 cycles with the changes but there is 1 cycle noise occasionally running this test repeatedly. Lastly I tried disable the static_branch_unlikely() in netmem_is_net_iov() check. To my surprise disabling the static_branch_unlikely() check reduces the fast path back to 8 cycles, but the 1 cycle noise remains. Perf - Devmem TCP benchmark: --------------------- 189/200gbps bi-directional throughput with RX devmem TCP and regular TCP TX i.e. ~95% line rate. Major changes in RFC v5: ======================== 1. Rebased on top of 'Abstract page from net stack' series and used the new netmem type to refer to LSB set pointers instead of re-using struct page. 2. Downgraded this series back to RFC and called it RFC v5. This is because this series is now dependent on 'Abstract page from net stack'[1] and the queue API. Both are removed from the series to reduce the patch # and those bits are fairly independent or pre-requisite work. 3. Reworked the page_pool devmem support to use netmem and for some more unified handling. 4. Reworked the reference counting of net_iov (renamed from page_pool_iov) to use pp_ref_count for refcounting. The full changes including the dependent series and GVE page pool support is here: https://github.com/mina/linux/commits/tcpdevmem-rfcv5/ [1] https://patchwork.kernel.org/project/netdevbpf/list/?series=810774 Major changes in v1: ==================== 1. Implemented MVP queue API ndos to remove the userspace-visible driver reset. 2. Fixed issues in the napi_pp_put_page() devmem frag unref path. 3. Removed RFC tag. Many smaller addressed comments across all the patches (patches have individual change log). Full tree including the rest of the GVE driver changes: https://github.com/mina/linux/commits/tcpdevmem-v1 Changes in RFC v3: ================== 1. Pulled in the memory-provider dependency from Jakub's RFC[1] to make the series reviewable and mergeable. 2. Implemented multi-rx-queue binding which was a todo in v2. 3. Fix to cmsg handling. The sticking point in RFC v2[2] was the device reset required to refill the device rx-queues after the dmabuf bind/unbind. The solution suggested as I understand is a subset of the per-queue management ops Jakub suggested or similar: https://lore.kernel.org/netdev/20230815171638.4c057dcd@kernel.org/ This is not addressed in this revision, because: 1. This point was discussed at netconf & netdev and there is openness to using the current approach of requiring a device reset. 2. Implementing individual queue resetting seems to be difficult for my test bed with GVE. My prototype to test this ran into issues with the rx-queues not coming back up properly if reset individually. At the moment I'm unsure if it's a mistake in the POC or a genuine issue in the virtualization stack behind GVE, which currently doesn't test individual rx-queue restart. 3. Our usecases are not bothered by requiring a device reset to refill the buffer queues, and we'd like to support NICs that run into this limitation with resetting individual queues. My thought is that drivers that have trouble with per-queue configs can use the support in this series, while drivers that support new netdev ops to reset individual queues can automatically reset the queue as part of the dma-buf bind/unbind. The same approach with device resets is presented again for consideration with other sticking points addressed. This proposal includes the rx devmem path only proposed for merge. For a snapshot of my entire tree which includes the GVE POC page pool support & device memory support: https://github.com/torvalds/linux/compare/master...mina:linux:tcpdevmem-v3 [1] https://lore.kernel.org/netdev/f8270765-a27b-6ccf-33ea-cda097168d79@redhat.… [2] https://lore.kernel.org/netdev/CAHS8izOVJGJH5WF68OsRWFKJid1_huzzUK+hpKbLcL4… Changes in RFC v2: ================== The sticking point in RFC v1[1] was the dma-buf pages approach we used to deliver the device memory to the TCP stack. RFC v2 is a proof-of-concept that attempts to resolve this by implementing scatterlist support in the networking stack, such that we can import the dma-buf scatterlist directly. This is the approach proposed at a high level here[2]. Detailed changes: 1. Replaced dma-buf pages approach with importing scatterlist into the page pool. 2. Replace the dma-buf pages centric API with a netlink API. 3. Removed the TX path implementation - there is no issue with implementing the TX path with scatterlist approach, but leaving out the TX path makes it easier to review. 4. Functionality is tested with this proposal, but I have not conducted perf testing yet. I'm not sure there are regressions, but I removed perf claims from the cover letter until they can be re-confirmed. 5. Added Signed-off-by: contributors to the implementation. 6. Fixed some bugs with the RX path since RFC v1. Any feedback welcome, but specifically the biggest pending questions needing feedback IMO are: 1. Feedback on the scatterlist-based approach in general. 2. Netlink API (Patch 1 & 2). 3. Approach to handle all the drivers that expect to receive pages from the page pool (Patch 6). [1] https://lore.kernel.org/netdev/dfe4bae7-13a0-3c5d-d671-f61b375cb0b4@gmail.c… [2] https://lore.kernel.org/netdev/CAHS8izPm6XRS54LdCDZVd0C75tA1zHSu6jLVO8nzTLX… ================== * TL;DR: Device memory TCP (devmem TCP) is a proposal for transferring data to and/or from device memory efficiently, without bouncing the data to a host memory buffer. * Problem: A large amount of data transfers have device memory as the source and/or destination. Accelerators drastically increased the volume of such transfers. Some examples include: - ML accelerators transferring large amounts of training data from storage into GPU/TPU memory. In some cases ML training setup time can be as long as 50% of TPU compute time, improving data transfer throughput & efficiency can help improving GPU/TPU utilization. - Distributed training, where ML accelerators, such as GPUs on different hosts, exchange data among them. - Distributed raw block storage applications transfer large amounts of data with remote SSDs, much of this data does not require host processing. Today, the majority of the Device-to-Device data transfers the network are implemented as the following low level operations: Device-to-Host copy, Host-to-Host network transfer, and Host-to-Device copy. The implementation is suboptimal, especially for bulk data transfers, and can put significant strains on system resources, such as host memory bandwidth, PCIe bandwidth, etc. One important reason behind the current state is the kernel’s lack of semantics to express device to network transfers. * Proposal: In this patch series we attempt to optimize this use case by implementing socket APIs that enable the user to: 1. send device memory across the network directly, and 2. receive incoming network packets directly into device memory. Packet _payloads_ go directly from the NIC to device memory for receive and from device memory to NIC for transmit. Packet _headers_ go to/from host memory and are processed by the TCP/IP stack normally. The NIC _must_ support header split to achieve this. Advantages: - Alleviate host memory bandwidth pressure, compared to existing network-transfer + device-copy semantics. - Alleviate PCIe BW pressure, by limiting data transfer to the lowest level of the PCIe tree, compared to traditional path which sends data through the root complex. * Patch overview: ** Part 1: netlink API Gives user ability to bind dma-buf to an RX queue. ** Part 2: scatterlist support Currently the standard for device memory sharing is DMABUF, which doesn't generate struct pages. On the other hand, networking stack (skbs, drivers, and page pool) operate on pages. We have 2 options: 1. Generate struct pages for dmabuf device memory, or, 2. Modify the networking stack to process scatterlist. Approach #1 was attempted in RFC v1. RFC v2 implements approach #2. ** part 3: page pool support We piggy back on page pool memory providers proposal: https://github.com/kuba-moo/linux/tree/pp-providers It allows the page pool to define a memory provider that provides the page allocation and freeing. It helps abstract most of the device memory TCP changes from the driver. ** part 4: support for unreadable skb frags Page pool iovs are not accessible by the host; we implement changes throughput the networking stack to correctly handle skbs with unreadable frags. ** Part 5: recvmsg() APIs We define user APIs for the user to send and receive device memory. Not included with this series is the GVE devmem TCP support, just to simplify the review. Code available here if desired: https://github.com/mina/linux/tree/tcpdevmem This series is built on top of net-next with Jakub's pp-providers changes cherry-picked. * NIC dependencies: 1. (strict) Devmem TCP require the NIC to support header split, i.e. the capability to split incoming packets into a header + payload and to put each into a separate buffer. Devmem TCP works by using device memory for the packet payload, and host memory for the packet headers. 2. (optional) Devmem TCP works better with flow steering support & RSS support, i.e. the NIC's ability to steer flows into certain rx queues. This allows the sysadmin to enable devmem TCP on a subset of the rx queues, and steer devmem TCP traffic onto these queues and non devmem TCP elsewhere. The NIC I have access to with these properties is the GVE with DQO support running in Google Cloud, but any NIC that supports these features would suffice. I may be able to help reviewers bring up devmem TCP on their NICs. * Testing: The series includes a udmabuf kselftest that show a simple use case of devmem TCP and validates the entire data path end to end without a dependency on a specific dmabuf provider. ** Test Setup Kernel: net-next with this series and memory provider API cherry-picked locally. Hardware: Google Cloud A3 VMs. NIC: GVE with header split & RSS & flow steering support. Cc: Pavel Begunkov <asml.silence(a)gmail.com> Cc: David Wei <dw(a)davidwei.uk> Cc: Jason Gunthorpe <jgg(a)ziepe.ca> Cc: Yunsheng Lin <linyunsheng(a)huawei.com> Cc: Shailend Chand <shailend(a)google.com> Cc: Harshitha Ramamurthy <hramamurthy(a)google.com> Cc: Shakeel Butt <shakeel.butt(a)linux.dev> Cc: Jeroen de Borst <jeroendb(a)google.com> Cc: Praveen Kaligineedi <pkaligineedi(a)google.com> Cc: Bagas Sanjaya <bagasdotme(a)gmail.com> Cc: Steven Rostedt <rostedt(a)goodmis.org> Cc: Christoph Hellwig <hch(a)infradead.org> Cc: Nikolay Aleksandrov <razor(a)blackwall.org> Cc: Taehee Yoo <ap420073(a)gmail.com> Cc: Donald Hunter <donald.hunter(a)gmail.com> Mina Almasry (13): netdev: add netdev_rx_queue_restart() net: netdev netlink api to bind dma-buf to a net device netdev: support binding dma-buf to netdevice netdev: netdevice devmem allocator page_pool: devmem support memory-provider: dmabuf devmem memory provider net: support non paged skb frags net: add support for skbs with unreadable frags tcp: RX path for devmem TCP net: add SO_DEVMEM_DONTNEED setsockopt to release RX frags net: add devmem TCP documentation selftests: add ncdevmem, netcat for devmem TCP netdev: add dmabuf introspection Documentation/netlink/specs/netdev.yaml | 61 +++ Documentation/networking/devmem.rst | 269 +++++++++++ Documentation/networking/index.rst | 1 + arch/alpha/include/uapi/asm/socket.h | 6 + arch/mips/include/uapi/asm/socket.h | 6 + arch/parisc/include/uapi/asm/socket.h | 6 + arch/sparc/include/uapi/asm/socket.h | 6 + include/linux/netdevice.h | 2 + include/linux/skbuff.h | 61 ++- include/linux/skbuff_ref.h | 9 +- include/linux/socket.h | 1 + include/net/devmem.h | 133 ++++++ include/net/mp_dmabuf_devmem.h | 44 ++ include/net/netdev_rx_queue.h | 5 + include/net/netmem.h | 169 ++++++- include/net/page_pool/helpers.h | 39 +- include/net/page_pool/types.h | 22 +- include/net/sock.h | 2 + include/net/tcp.h | 5 +- include/trace/events/page_pool.h | 12 +- include/uapi/asm-generic/socket.h | 6 + include/uapi/linux/netdev.h | 13 + include/uapi/linux/uio.h | 17 + net/core/Makefile | 3 +- net/core/datagram.c | 6 + net/core/dev.c | 24 +- net/core/devmem.c | 382 ++++++++++++++++ net/core/gro.c | 3 +- net/core/netdev-genl-gen.c | 23 + net/core/netdev-genl-gen.h | 6 + net/core/netdev-genl.c | 118 +++++ net/core/netdev_rx_queue.c | 81 ++++ net/core/netmem_priv.h | 31 ++ net/core/page_pool.c | 117 +++-- net/core/page_pool_priv.h | 46 ++ net/core/page_pool_user.c | 29 ++ net/core/skbuff.c | 77 +++- net/core/sock.c | 68 +++ net/ethtool/common.c | 8 + net/ipv4/esp4.c | 3 +- net/ipv4/tcp.c | 261 ++++++++++- net/ipv4/tcp_input.c | 13 +- net/ipv4/tcp_ipv4.c | 16 + net/ipv4/tcp_minisocks.c | 2 + net/ipv4/tcp_output.c | 5 +- net/ipv6/esp6.c | 3 +- net/packet/af_packet.c | 4 +- net/xdp/xsk_buff_pool.c | 5 + tools/include/uapi/linux/netdev.h | 13 + tools/testing/selftests/net/.gitignore | 1 + tools/testing/selftests/net/Makefile | 9 + tools/testing/selftests/net/ncdevmem.c | 570 ++++++++++++++++++++++++ 52 files changed, 2701 insertions(+), 121 deletions(-) create mode 100644 Documentation/networking/devmem.rst create mode 100644 include/net/devmem.h create mode 100644 include/net/mp_dmabuf_devmem.h create mode 100644 net/core/devmem.c create mode 100644 net/core/netdev_rx_queue.c create mode 100644 net/core/netmem_priv.h create mode 100644 tools/testing/selftests/net/ncdevmem.c -- 2.46.0.295.g3b9ea8a38a-goog

10 months, 1 week

2
18
0 0

[PATCH v11 00/39] arm64/gcs: Provide support for GCS in userspace

by Mark Brown

The arm64 Guarded Control Stack (GCS) feature provides support for hardware protected stacks of return addresses, intended to provide hardening against return oriented programming (ROP) attacks and to make it easier to gather call stacks for applications such as profiling. When GCS is active a secondary stack called the Guarded Control Stack is maintained, protected with a memory attribute which means that it can only be written with specific GCS operations. The current GCS pointer can not be directly written to by userspace. When a BL is executed the value stored in LR is also pushed onto the GCS, and when a RET is executed the top of the GCS is popped and compared to LR with a fault being raised if the values do not match. GCS operations may only be performed on GCS pages, a data abort is generated if they are not. The combination of hardware enforcement and lack of extra instructions in the function entry and exit paths should result in something which has less overhead and is more difficult to attack than a purely software implementation like clang's shadow stacks. This series implements support for use of GCS by userspace, along with support for use of GCS within KVM guests. It does not enable use of GCS by either EL1 or EL2, this will be implemented separately. Executables are started without GCS and must use a prctl() to enable it, it is expected that this will be done very early in application execution by the dynamic linker or other startup code. For dynamic linking this will be done by checking that everything in the executable is marked as GCS compatible. x86 has an equivalent feature called shadow stacks, this series depends on the x86 patches for generic memory management support for the new guarded/shadow stack page type and shares APIs as much as possible. As there has been extensive discussion with the wider community around the ABI for shadow stacks I have as far as practical kept implementation decisions close to those for x86, anticipating that review would lead to similar conclusions in the absence of strong reasoning for divergence. The main divergence I am concious of is that x86 allows shadow stack to be enabled and disabled repeatedly, freeing the shadow stack for the thread whenever disabled, while this implementation keeps the GCS allocated after disable but refuses to reenable it. This is to avoid races with things actively walking the GCS during a disable, we do anticipate that some systems will wish to disable GCS at runtime but are not aware of any demand for subsequently reenabling it. x86 uses an arch_prctl() to manage enable and disable, since only x86 and S/390 use arch_prctl() a generic prctl() was proposed[1] as part of a patch set for the equivalent RISC-V Zicfiss feature which I initially adopted fairly directly but following review feedback has been revised quite a bit. We currently maintain the x86 pattern of implicitly allocating a shadow stack for threads started with shadow stack enabled, there has been some discussion of removing this support and requiring the use of clone3() with explicit allocation of shadow stacks instead. I have no strong feelings either way, implicit allocation is not really consistent with anything else we do and creates the potential for errors around thread exit but on the other hand it is existing ABI on x86 and minimises the changes needed in userspace code. glibc and bionic changes using this ABI have been implemented and tested. Headless Android systems have been validated and Ross Burton has used this code has been used to bring up a Yocto system with GCS enabed as standard, a test implementation of V8 support has also been done. There is an open issue with support for CRIU, on x86 this required the ability to set the GCS mode via ptrace. This series supports configuring mode bits other than enable/disable via ptrace but it needs to be confirmed if this is sufficient. It is likely that we could relax some of the barriers added here with some more targeted placements, this is left for further study. There is an in process series adding clone3() support for shadow stacks: https://lore.kernel.org/r/20240819-clone3-shadow-stack-v9-0-962d74f99464@ke… Previous versions of this series depended on that, this dependency has been removed in order to make merging easier. [1] https://lore.kernel.org/lkml/20230213045351.3945824-1-debug@rivosinc.com/ Signed-off-by: Mark Brown <broonie(a)kernel.org> --- Changes in v11: - Remove the dependency on the addition of clone3() support for shadow stacks, rebasing onto v6.11-rc3. - Make ID_AA64PFR1_EL1.GCS writeable in KVM. - Hide GCS registers when GCS is not enabled for KVM guests. - Require HCRX_EL2.GCSEn if booting at EL1. - Require that GCSCR_EL1 and GCSCRE0_EL1 be initialised regardless of if we boot at EL2 or EL1. - Remove some stray use of bit 63 in signal cap tokens. - Warn if we see a GCS with VM_SHARED. - Remove rdundant check for VM_WRITE in fault handling. - Cleanups and clarifications in the ABI document. - Clean up and improve documentation of some sync placement. - Only set the EL0 GCS mode if it's actually changed. - Various minor fixes and tweaks. - Link to v10: https://lore.kernel.org/r/20240801-arm64-gcs-v10-0-699e2bd2190b@kernel.org Changes in v10: - Fix issues with THP. - Tighten up requirements for initialising GCSCR*. - Only generate GCS signal frames for threads using GCS. - Only context switch EL1 GCS registers if S1PIE is enabled. - Move context switch of GCSCRE0_EL1 to EL0 context switch. - Make GCS registers unconditionally visible to userspace. - Use FHU infrastructure. - Don't change writability of ID_AA64PFR1_EL1 for KVM. - Remove unused arguments from alloc_gcs(). - Typo fixes. - Link to v9: https://lore.kernel.org/r/20240625-arm64-gcs-v9-0-0f634469b8f0@kernel.org Changes in v9: - Rebase onto v6.10-rc3. - Restructure and clarify memory management fault handling. - Fix up basic-gcs for the latest clone3() changes. - Convert to newly merged KVM ID register based feature configuration. - Fixes for NV traps. - Link to v8: https://lore.kernel.org/r/20240203-arm64-gcs-v8-0-c9fec77673ef@kernel.org Changes in v8: - Invalidate signal cap token on stack when consuming. - Typo and other trivial fixes. - Don't try to use process_vm_write() on GCS, it intentionally does not work. - Fix leak of thread GCSs. - Rebase onto latest clone3() series. - Link to v7: https://lore.kernel.org/r/20231122-arm64-gcs-v7-0-201c483bd775@kernel.org Changes in v7: - Rebase onto v6.7-rc2 via the clone3() patch series. - Change the token used to cap the stack during signal handling to be compatible with GCSPOPM. - Fix flags for new page types. - Fold in support for clone3(). - Replace copy_to_user_gcs() with put_user_gcs(). - Link to v6: https://lore.kernel.org/r/20231009-arm64-gcs-v6-0-78e55deaa4dd@kernel.org Changes in v6: - Rebase onto v6.6-rc3. - Add some more gcsb_dsync() barriers following spec clarifications. - Due to ongoing discussion around clone()/clone3() I've not updated anything there, the behaviour is the same as on previous versions. - Link to v5: https://lore.kernel.org/r/20230822-arm64-gcs-v5-0-9ef181dd6324@kernel.org Changes in v5: - Don't map any permissions for user GCSs, we always use EL0 accessors or use a separate mapping of the page. - Reduce the standard size of the GCS to RLIMIT_STACK/2. - Enforce a PAGE_SIZE alignment requirement on map_shadow_stack(). - Clarifications and fixes to documentation. - More tests. - Link to v4: https://lore.kernel.org/r/20230807-arm64-gcs-v4-0-68cfa37f9069@kernel.org Changes in v4: - Implement flags for map_shadow_stack() allowing the cap and end of stack marker to be enabled independently or not at all. - Relax size and alignment requirements for map_shadow_stack(). - Add more blurb explaining the advantages of hardware enforcement. - Link to v3: https://lore.kernel.org/r/20230731-arm64-gcs-v3-0-cddf9f980d98@kernel.org Changes in v3: - Rebase onto v6.5-rc4. - Add a GCS barrier on context switch. - Add a GCS stress test. - Link to v2: https://lore.kernel.org/r/20230724-arm64-gcs-v2-0-dc2c1d44c2eb@kernel.org Changes in v2: - Rebase onto v6.5-rc3. - Rework prctl() interface to allow each bit to be locked independently. - map_shadow_stack() now places the cap token based on the size requested by the caller not the actual space allocated. - Mode changes other than enable via ptrace are now supported. - Expand test coverage. - Various smaller fixes and adjustments. - Link to v1: https://lore.kernel.org/r/20230716-arm64-gcs-v1-0-bf567f93bba6@kernel.org --- Mark Brown (39): mm: Introduce ARCH_HAS_USER_SHADOW_STACK arm64/mm: Restructure arch_validate_flags() for extensibility prctl: arch-agnostic prctl for shadow stack mman: Add map_shadow_stack() flags arm64: Document boot requirements for Guarded Control Stacks arm64/gcs: Document the ABI for Guarded Control Stacks arm64/sysreg: Add definitions for architected GCS caps arm64/gcs: Add manual encodings of GCS instructions arm64/gcs: Provide put_user_gcs() arm64/gcs: Provide basic EL2 setup to allow GCS usage at EL0 and EL1 arm64/cpufeature: Runtime detection of Guarded Control Stack (GCS) arm64/mm: Allocate PIE slots for EL0 guarded control stack mm: Define VM_SHADOW_STACK for arm64 when we support GCS arm64/mm: Map pages for guarded control stack KVM: arm64: Manage GCS access and registers for guests arm64/idreg: Add overrride for GCS arm64/hwcap: Add hwcap for GCS arm64/traps: Handle GCS exceptions arm64/mm: Handle GCS data aborts arm64/gcs: Context switch GCS state for EL0 arm64/gcs: Ensure that new threads have a GCS arm64/gcs: Implement shadow stack prctl() interface arm64/mm: Implement map_shadow_stack() arm64/signal: Set up and restore the GCS context for signal handlers arm64/signal: Expose GCS state in signal frames arm64/ptrace: Expose GCS via ptrace and core files arm64: Add Kconfig for Guarded Control Stack (GCS) kselftest/arm64: Verify the GCS hwcap kselftest/arm64: Add GCS as a detected feature in the signal tests kselftest/arm64: Add framework support for GCS to signal handling tests kselftest/arm64: Allow signals tests to specify an expected si_code kselftest/arm64: Always run signals tests with GCS enabled kselftest/arm64: Add very basic GCS test program kselftest/arm64: Add a GCS test program built with the system libc kselftest/arm64: Add test coverage for GCS mode locking kselftest/arm64: Add GCS signal tests kselftest/arm64: Add a GCS stress test kselftest/arm64: Enable GCS for the FP stress tests KVM: selftests: arm64: Add GCS registers to get-reg-list Documentation/admin-guide/kernel-parameters.txt | 3 + Documentation/arch/arm64/booting.rst | 32 + Documentation/arch/arm64/elf_hwcaps.rst | 2 + Documentation/arch/arm64/gcs.rst | 230 +++++++ Documentation/arch/arm64/index.rst | 1 + Documentation/filesystems/proc.rst | 2 +- arch/arm64/Kconfig | 20 + arch/arm64/include/asm/cpufeature.h | 6 + arch/arm64/include/asm/el2_setup.h | 29 + arch/arm64/include/asm/esr.h | 28 +- arch/arm64/include/asm/exception.h | 2 + arch/arm64/include/asm/gcs.h | 107 +++ arch/arm64/include/asm/hwcap.h | 1 + arch/arm64/include/asm/kvm_host.h | 12 + arch/arm64/include/asm/mman.h | 23 +- arch/arm64/include/asm/pgtable-prot.h | 14 +- arch/arm64/include/asm/processor.h | 7 + arch/arm64/include/asm/sysreg.h | 20 + arch/arm64/include/asm/uaccess.h | 40 ++ arch/arm64/include/asm/vncr_mapping.h | 2 + arch/arm64/include/uapi/asm/hwcap.h | 1 + arch/arm64/include/uapi/asm/ptrace.h | 8 + arch/arm64/include/uapi/asm/sigcontext.h | 9 + arch/arm64/kernel/cpufeature.c | 12 + arch/arm64/kernel/cpuinfo.c | 1 + arch/arm64/kernel/entry-common.c | 23 + arch/arm64/kernel/pi/idreg-override.c | 2 + arch/arm64/kernel/process.c | 88 +++ arch/arm64/kernel/ptrace.c | 54 ++ arch/arm64/kernel/signal.c | 225 ++++++- arch/arm64/kernel/traps.c | 11 + arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h | 49 +- arch/arm64/kvm/sys_regs.c | 27 +- arch/arm64/mm/Makefile | 1 + arch/arm64/mm/fault.c | 40 ++ arch/arm64/mm/gcs.c | 252 +++++++ arch/arm64/mm/mmap.c | 10 +- arch/arm64/tools/cpucaps | 1 + arch/x86/Kconfig | 1 + arch/x86/include/uapi/asm/mman.h | 3 - fs/proc/task_mmu.c | 2 +- include/linux/mm.h | 18 +- include/uapi/asm-generic/mman.h | 4 + include/uapi/linux/elf.h | 1 + include/uapi/linux/prctl.h | 22 + kernel/sys.c | 30 + mm/Kconfig | 6 + tools/testing/selftests/arm64/Makefile | 2 +- tools/testing/selftests/arm64/abi/hwcap.c | 19 + tools/testing/selftests/arm64/fp/assembler.h | 15 + tools/testing/selftests/arm64/fp/fpsimd-test.S | 2 + tools/testing/selftests/arm64/fp/sve-test.S | 2 + tools/testing/selftests/arm64/fp/za-test.S | 2 + tools/testing/selftests/arm64/fp/zt-test.S | 2 + tools/testing/selftests/arm64/gcs/.gitignore | 5 + tools/testing/selftests/arm64/gcs/Makefile | 24 + tools/testing/selftests/arm64/gcs/asm-offsets.h | 0 tools/testing/selftests/arm64/gcs/basic-gcs.c | 357 ++++++++++ tools/testing/selftests/arm64/gcs/gcs-locking.c | 200 ++++++ .../selftests/arm64/gcs/gcs-stress-thread.S | 311 +++++++++ tools/testing/selftests/arm64/gcs/gcs-stress.c | 530 +++++++++++++++ tools/testing/selftests/arm64/gcs/gcs-util.h | 100 +++ tools/testing/selftests/arm64/gcs/libc-gcs.c | 728 +++++++++++++++++++++ tools/testing/selftests/arm64/signal/.gitignore | 1 + .../testing/selftests/arm64/signal/test_signals.c | 17 +- .../testing/selftests/arm64/signal/test_signals.h | 6 + .../selftests/arm64/signal/test_signals_utils.c | 32 +- .../selftests/arm64/signal/test_signals_utils.h | 39 ++ .../arm64/signal/testcases/gcs_exception_fault.c | 62 ++ .../selftests/arm64/signal/testcases/gcs_frame.c | 88 +++ .../arm64/signal/testcases/gcs_write_fault.c | 67 ++ .../selftests/arm64/signal/testcases/testcases.c | 7 + .../selftests/arm64/signal/testcases/testcases.h | 1 + tools/testing/selftests/kvm/aarch64/get-reg-list.c | 28 + 74 files changed, 4086 insertions(+), 43 deletions(-) --- base-commit: 7c626ce4bae1ac14f60076d00eafe71af30450ba change-id: 20230303-arm64-gcs-e311ab0d8729 Best regards, -- Mark Brown <broonie(a)kernel.org>

10 months, 1 week

2
64
0 0

[PATCH] MAINTAINERS: Add selftests/x86 entry

by Muhammad Usama Anjum

There are no maintainers specified for tools/testing/selftests/x86. Shuah has mentioned [1] that the patches should go through x86 tree or in special cases directly to Shuah's tree after getting ack-ed from x86 maintainers. Different people have been confused when sending patches as correct maintainers aren't found by get_maintainer.pl script. Fix this by adding entry to MAINTAINERS file. [1] https://lore.kernel.org/all/90dc0dfc-4c67-4ea1-b705-0585d6e2ec47@linuxfound… Signed-off-by: Muhammad Usama Anjum <usama.anjum(a)collabora.com> --- MAINTAINERS | 1 + 1 file changed, 1 insertion(+) diff --git a/MAINTAINERS b/MAINTAINERS index 523d84b2d6139..f3a17e5d954a3 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -24378,6 +24378,7 @@ T: git git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git x86/core F: Documentation/arch/x86/ F: Documentation/devicetree/bindings/x86/ F: arch/x86/ +F: tools/testing/selftests/x86 X86 ENTRY CODE M: Andy Lutomirski <luto(a)kernel.org> -- 2.39.2

10 months, 1 week

4
7
0 0

2025

2024

2023

2022

2021

2020

2019

2018

2017

Linux-kselftest-mirror