This series implements feature detection of hardware virtualization on
Linux and macOS; the latter being my primary use case.
This yields approximately a 6x improvement using HVF on M3 Pro.
Signed-off-by: Tamir Duberstein <tamird(a)gmail.com>
---
Changes in v2:
- Use QEMU accelerator fallback (Alyssa Ross, Thomas Weißschuh).
- Link to v1: https://lore.kernel.org/r/20241025-kunit-qemu-accel-macos-v1-0-2f30c26192d4…
---
Tamir Duberstein (2):
kunit: add fallback for os.sched_getaffinity
kunit: enable hardware acceleration when available
tools/testing/kunit/kunit.py | 11 ++++++++++-
tools/testing/kunit/kunit_kernel.py | 3 +++
tools/testing/kunit/qemu_configs/arm64.py | 2 +-
3 files changed, 14 insertions(+), 2 deletions(-)
---
base-commit: 81983758430957d9a5cb3333fe324fd70cf63e7e
change-id: 20241025-kunit-qemu-accel-macos-2840e4c2def5
Best regards,
--
Tamir Duberstein <tamird(a)gmail.com>
This patch series extends the sev_init2 and the sev_smoke test to
exercise the SEV-SNP VM launch workflow.
Primarily, it introduces the architectural defines, its support in the SEV
library and extends the tests to interact with the SEV-SNP ioctl()
wrappers.
Patch 1 - Do not advertize SNP on incompatible firmware
Patch 2 - SNP test for KVM_SEV_INIT2
Patch 3 - Add VMGEXIT helper
Patch 4 - Introduce SEV+ VM type check
Patch 5 - SNP iotcl() plumbing for the SEV library
Patch 6 - Force set GUEST_MEMFD for SNP
Patch 7 - Cleanups of smoke test - Decouple policy from type
Patch 8 - SNP smoke test
v4:
1. Remove SNP FW API version check in the test and ensure the KVM
capability advertizes the presence of the feature. Retain the minimum
version definitions to exercise these API versions in the smoke test.
2. Retained only the SNP smoke test and SNP_INIT2 test
3. The SNP architectural defined merged with SNP_INIT2 test patch
4. SNP shutdown merged with SNP smoke test patch
5. Add SEV VM type check to abstract comparisons and reduce clutter
6. Define a SNP default policy which sets bits based on the presence of
SMT
7. Decouple privatization and encryption for it to be SNP agnostic
8. Assert for only positive tests using vm_ioctl()
9. Dropped tested-by tags
In summary - based on comments from Sean, I have primarily reduced the
scope of this patch series to focus on breaking down the SNP smoke test
patch (v3 - patch2) to first introduce SEV-SNP support and use this
interface to extend the sev_init2 and the sev_smoke test.
The rest of the v3 patchset that introduces ioctl, pre fault, fallocate
and negative tests, will be re-worked and re-introduced subsequently in
future patch series post addressing the issues discussed.
v3:
https://lore.kernel.org/kvm/20240905124107.6954-1-pratikrajesh.sampat@amd.c…
1. Remove the assignments for the prefault and fallocate test type
enums.
2. Fix error message for sev launch measure and finish.
3. Collect tested-by tags [Peter, Srikanth]
Any feedback/review is highly appreciated!
Pratik R. Sampat (8):
KVM: SEV: Disable SEV-SNP on FW validation failure
KVM: selftests: SEV-SNP test for KVM_SEV_INIT2
KVM: selftests: Add VMGEXIT helper
KVM: selftests: Introduce SEV VM type check
KVM: selftests: Add library support for interacting with SNP
KVM: selftests: Force GUEST_MEMFD flag for SNP VM type
KVM: selftests: Abstractions for SEV to decouple policy from type
KVM: selftests: Add a basic SEV-SNP smoke test
arch/x86/kvm/svm/sev.c | 4 +-
drivers/crypto/ccp/sev-dev.c | 6 ++
include/linux/psp-sev.h | 3 +
.../selftests/kvm/include/x86_64/processor.h | 1 +
.../selftests/kvm/include/x86_64/sev.h | 55 ++++++++++-
tools/testing/selftests/kvm/lib/kvm_util.c | 7 +-
.../selftests/kvm/lib/x86_64/processor.c | 4 +-
tools/testing/selftests/kvm/lib/x86_64/sev.c | 98 ++++++++++++++++++-
.../selftests/kvm/x86_64/sev_init2_tests.c | 13 +++
.../selftests/kvm/x86_64/sev_smoke_test.c | 96 ++++++++++++++----
10 files changed, 258 insertions(+), 29 deletions(-)
--
2.43.0
The new option controls tests run on boot or module load. With the new
debugfs "run" dentry allowing to run tests on demand, an ability to disable
automatic tests run becomes a useful option in case of intrusive tests.
The option is set to true by default to preserve the existent behavior. It
can be overridden by either the corresponding module option or by the
corresponding config build option.
Signed-off-by: Stanislav Kinsburskii <skinsburskii(a)linux.microsoft.com>
---
include/kunit/test.h | 4 +++-
lib/kunit/Kconfig | 12 ++++++++++++
lib/kunit/debugfs.c | 2 +-
lib/kunit/executor.c | 18 +++++++++++++++++-
lib/kunit/test.c | 6 ++++--
5 files changed, 37 insertions(+), 5 deletions(-)
diff --git a/include/kunit/test.h b/include/kunit/test.h
index 34b71e42fb10..58dbab60f853 100644
--- a/include/kunit/test.h
+++ b/include/kunit/test.h
@@ -312,6 +312,7 @@ static inline void kunit_set_failure(struct kunit *test)
}
bool kunit_enabled(void);
+bool kunit_autorun(void);
const char *kunit_action(void);
const char *kunit_filter_glob(void);
char *kunit_filter(void);
@@ -334,7 +335,8 @@ kunit_filter_suites(const struct kunit_suite_set *suite_set,
int *err);
void kunit_free_suite_set(struct kunit_suite_set suite_set);
-int __kunit_test_suites_init(struct kunit_suite * const * const suites, int num_suites);
+int __kunit_test_suites_init(struct kunit_suite * const * const suites, int num_suites,
+ bool run_tests);
void __kunit_test_suites_exit(struct kunit_suite **suites, int num_suites);
diff --git a/lib/kunit/Kconfig b/lib/kunit/Kconfig
index 34d7242d526d..a97897edd964 100644
--- a/lib/kunit/Kconfig
+++ b/lib/kunit/Kconfig
@@ -81,4 +81,16 @@ config KUNIT_DEFAULT_ENABLED
In most cases this should be left as Y. Only if additional opt-in
behavior is needed should this be set to N.
+config KUNIT_AUTORUN_ENABLED
+ bool "Default value of kunit.autorun"
+ default y
+ help
+ Sets the default value of kunit.autorun. If set to N then KUnit
+ tests will not run after initialization unless kunit.autorun=1 is
+ passed to the kernel command line. The test can still be run manually
+ via debugfs interface.
+
+ In most cases this should be left as Y. Only if additional opt-in
+ behavior is needed should this be set to N.
+
endif # KUNIT
diff --git a/lib/kunit/debugfs.c b/lib/kunit/debugfs.c
index d548750a325a..9df064f40d98 100644
--- a/lib/kunit/debugfs.c
+++ b/lib/kunit/debugfs.c
@@ -145,7 +145,7 @@ static ssize_t debugfs_run(struct file *file,
struct inode *f_inode = file->f_inode;
struct kunit_suite *suite = (struct kunit_suite *) f_inode->i_private;
- __kunit_test_suites_init(&suite, 1);
+ __kunit_test_suites_init(&suite, 1, true);
return count;
}
diff --git a/lib/kunit/executor.c b/lib/kunit/executor.c
index 34b7b6833df3..340723571b0f 100644
--- a/lib/kunit/executor.c
+++ b/lib/kunit/executor.c
@@ -29,6 +29,22 @@ const char *kunit_action(void)
return action_param;
}
+/*
+ * Run KUnit tests after initialization
+ */
+#ifdef CONFIG_KUNIT_AUTORUN_ENABLED
+static bool autorun_param = true;
+#else
+static bool autorun_param;
+#endif
+module_param_named(autorun, autorun_param, bool, 0);
+MODULE_PARM_DESC(autorun, "Run KUnit tests after initialization");
+
+bool kunit_autorun(void)
+{
+ return autorun_param;
+}
+
static char *filter_glob_param;
static char *filter_param;
static char *filter_action_param;
@@ -266,7 +282,7 @@ void kunit_exec_run_tests(struct kunit_suite_set *suite_set, bool builtin)
pr_info("1..%zu\n", num_suites);
}
- __kunit_test_suites_init(suite_set->start, num_suites);
+ __kunit_test_suites_init(suite_set->start, num_suites, kunit_autorun());
}
void kunit_exec_list_tests(struct kunit_suite_set *suite_set, bool include_attr)
diff --git a/lib/kunit/test.c b/lib/kunit/test.c
index 089c832e3cdb..146d1b48a096 100644
--- a/lib/kunit/test.c
+++ b/lib/kunit/test.c
@@ -708,7 +708,8 @@ bool kunit_enabled(void)
return enable_param;
}
-int __kunit_test_suites_init(struct kunit_suite * const * const suites, int num_suites)
+int __kunit_test_suites_init(struct kunit_suite * const * const suites, int num_suites,
+ bool run_tests)
{
unsigned int i;
@@ -731,7 +732,8 @@ int __kunit_test_suites_init(struct kunit_suite * const * const suites, int num_
for (i = 0; i < num_suites; i++) {
kunit_init_suite(suites[i]);
- kunit_run_tests(suites[i]);
+ if (run_tests)
+ kunit_run_tests(suites[i]);
}
static_branch_dec(&kunit_running);
PTRACE_SET_SYSCALL_INFO is a generic ptrace API that complements
PTRACE_GET_SYSCALL_INFO by letting the ptracer modify details of
system calls the tracee is blocked in.
This API allows ptracers to obtain and modify system call details
in a straightforward and architecture-agnostic way.
Current implementation supports changing only those bits of system call
information that are used by strace, namely, syscall number, syscall
arguments, and syscall return value.
Support of changing additional details returned by PTRACE_GET_SYSCALL_INFO,
such as instruction pointer and stack pointer, could be added later
if needed, by using struct ptrace_syscall_info.flags to specify
the additional details that should be set. Currently, flags and reserved
fields of struct ptrace_syscall_info must be initialized with zeroes;
arch, instruction_pointer, and stack_pointer fields are ignored.
PTRACE_SET_SYSCALL_INFO currently supports only PTRACE_SYSCALL_INFO_ENTRY,
PTRACE_SYSCALL_INFO_EXIT, and PTRACE_SYSCALL_INFO_SECCOMP operations.
Other operations could be added later if needed.
Ideally, PTRACE_SET_SYSCALL_INFO should have been introduced along with
PTRACE_GET_SYSCALL_INFO, but it didn't happen. The last straw that
convinced me to implement PTRACE_SET_SYSCALL_INFO was apparent failure
to provide an API of changing the first system call argument on riscv
architecture [1].
ptrace(2) man page:
long ptrace(enum __ptrace_request request, pid_t pid, void *addr, void *data);
...
PTRACE_SET_SYSCALL_INFO
Modify information about the system call that caused the stop.
The "data" argument is a pointer to struct ptrace_syscall_info
that specifies the system call information to be set.
The "addr" argument should be set to sizeof(struct ptrace_syscall_info)).
[1] https://lore.kernel.org/all/59505464-c84a-403d-972f-d4b2055eeaac@gmail.com/
---
Notes:
v2:
* Add patch to fix syscall_set_return_value() on powerpc
* Add patch to fix mips_get_syscall_arg() on mips
* Merge two patches adding syscall_set_arguments() implementations
from different sources into a single patch
* Add syscall_set_return_value() implementation on hexagon
* Add syscall_set_return_value() invocation to syscall_set_nr()
on arm and arm64.
* Fix syscall_set_nr() and mips_set_syscall_arg() on mips
* Add a comment to syscall_set_nr() on arc, powerpc, s390, sh,
and sparc
* Remove redundant ptrace_syscall_info.op assignments in
ptrace_get_syscall_info_*
* Minor style tweaks in ptrace_get_syscall_info_op()
* Remove syscall_set_return_value() invocation from
ptrace_set_syscall_info_entry()
* Skip syscall_set_arguments() invocation in case of syscall number -1
in ptrace_set_syscall_info_entry()
* Split ptrace_syscall_info.reserved into ptrace_syscall_info.reserved
and ptrace_syscall_info.flags
* Use __kernel_ulong_t instead of unsigned long in set_syscall_info test
Dmitry V. Levin (7):
powerpc: properly negate error in syscall_set_return_value()
mips: fix mips_get_syscall_arg() for O32 and N32
syscall.h: add syscall_set_arguments() and syscall_set_return_value()
syscall.h: introduce syscall_set_nr()
ptrace_get_syscall_info: factor out ptrace_get_syscall_info_op
ptrace: introduce PTRACE_SET_SYSCALL_INFO request
selftests/ptrace: add a test case for PTRACE_SET_SYSCALL_INFO
arch/arc/include/asm/syscall.h | 25 +
arch/arm/include/asm/syscall.h | 37 ++
arch/arm64/include/asm/syscall.h | 29 ++
arch/csky/include/asm/syscall.h | 13 +
arch/hexagon/include/asm/syscall.h | 21 +
arch/loongarch/include/asm/syscall.h | 15 +
arch/m68k/include/asm/syscall.h | 7 +
arch/microblaze/include/asm/syscall.h | 7 +
arch/mips/include/asm/syscall.h | 72 ++-
arch/nios2/include/asm/syscall.h | 16 +
arch/openrisc/include/asm/syscall.h | 13 +
arch/parisc/include/asm/syscall.h | 19 +
arch/powerpc/include/asm/syscall.h | 26 +-
arch/riscv/include/asm/syscall.h | 16 +
arch/s390/include/asm/syscall.h | 24 +
arch/sh/include/asm/syscall_32.h | 24 +
arch/sparc/include/asm/syscall.h | 22 +
arch/um/include/asm/syscall-generic.h | 19 +
arch/x86/include/asm/syscall.h | 43 ++
arch/xtensa/include/asm/syscall.h | 18 +
include/asm-generic/syscall.h | 30 ++
include/linux/ptrace.h | 3 +
include/uapi/linux/ptrace.h | 4 +-
kernel/ptrace.c | 153 +++++-
tools/testing/selftests/ptrace/Makefile | 2 +-
.../selftests/ptrace/set_syscall_info.c | 441 ++++++++++++++++++
26 files changed, 1052 insertions(+), 47 deletions(-)
create mode 100644 tools/testing/selftests/ptrace/set_syscall_info.c
--
ldv
Enabling cbo.clean and cbo.flush in user mode makes it more
convenient to manage the cache state and achieve better performance.
Reviewed-by: Andrew Jones <ajones(a)ventanamicro.com>
Signed-off-by: Yunhui Cui <cuiyunhui(a)bytedance.com>
---
arch/riscv/kernel/cpufeature.c | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/arch/riscv/kernel/cpufeature.c b/arch/riscv/kernel/cpufeature.c
index c0916ed318c2..60d180b98f52 100644
--- a/arch/riscv/kernel/cpufeature.c
+++ b/arch/riscv/kernel/cpufeature.c
@@ -30,6 +30,7 @@
#define NUM_ALPHA_EXTS ('z' - 'a' + 1)
static bool any_cpu_has_zicboz;
+static bool any_cpu_has_zicbom;
unsigned long elf_hwcap __read_mostly;
@@ -87,6 +88,8 @@ static int riscv_ext_zicbom_validate(const struct riscv_isa_ext_data *data,
pr_err("Zicbom disabled as cbom-block-size present, but is not a power-of-2\n");
return -EINVAL;
}
+
+ any_cpu_has_zicbom = true;
return 0;
}
@@ -944,6 +947,11 @@ void __init riscv_user_isa_enable(void)
current->thread.envcfg |= ENVCFG_CBZE;
else if (any_cpu_has_zicboz)
pr_warn("Zicboz disabled as it is unavailable on some harts\n");
+
+ if (riscv_has_extension_unlikely(RISCV_ISA_EXT_ZICBOM))
+ current->thread.envcfg |= ENVCFG_CBCFE;
+ else if (any_cpu_has_zicbom)
+ pr_warn("Zicbom disabled as it is unavailable on some harts\n");
}
#ifdef CONFIG_RISCV_ALTERNATIVE
--
2.39.2
Running "make kselftest TARGETS=net/forwarding" results in several
occurrences of the same error:
./lib.sh: line 787: teamd: command not found
Since many tests depends on teamd, this fix stops the tests if the
teamd command is not installed.
Signed-off-by: Alessandro Zanni <alessandro.zanni87(a)gmail.com>
---
tools/testing/selftests/net/forwarding/lib.sh | 1 +
1 file changed, 1 insertion(+)
diff --git a/tools/testing/selftests/net/forwarding/lib.sh b/tools/testing/selftests/net/forwarding/lib.sh
index 7337f398f9cc..a6a74a4be4bf 100644
--- a/tools/testing/selftests/net/forwarding/lib.sh
+++ b/tools/testing/selftests/net/forwarding/lib.sh
@@ -784,6 +784,7 @@ team_destroy()
{
local if_name=$1; shift
+ require_command $TEAMD
$TEAMD -t $if_name -k
}
--
2.43.0
The selftest started failing since commit e93d2521b27f
("x86/vdso: Split virtual clock pages into dedicated mapping")
was merged. While debugging I stumbled upon some memory usage
optimizations.
With these test now runs on a VM with only 60MiB of memory.
Signed-off-by: Thomas Weißschuh <thomas.weissschuh(a)linutronix.de>
---
Changes in v2:
- Drop /dev/null usage
- Avoid overcommit restrictions by dropping PROT_WRITE
- Avoid high memory usage due to PTEs
- Link to v1: https://lore.kernel.org/r/20250107-virtual_address_range-tests-v1-0-3834a2f…
---
Thomas Weißschuh (3):
selftests/mm: virtual_address_range: mmap() without PROT_WRITE
selftests/mm: virtual_address_range: Unmap chunks after validation
selftests/mm: virtual_address_range: Avoid reading VVAR mappings
tools/testing/selftests/mm/config | 1 +
tools/testing/selftests/mm/virtual_address_range.c | 34 +++++++++++++++++++---
2 files changed, 31 insertions(+), 4 deletions(-)
---
base-commit: 32af4d2269d20fe2f8d32aaa456cad8e40abd365
change-id: 20250107-virtual_address_range-tests-95843766fa97
Best regards,
--
Thomas Weißschuh <thomas.weissschuh(a)linutronix.de>
Notable changes since v16:
* fixed usage of netdev tracker by removing dev_tracker member from
ovpn_priv and adding it to ovpn_peer and ovpn_socket as those are the
objects really holding a ref to the netdev
* switched ovpn_get_dev_from_attrs() to GFP_ATOMIC to prevent sleep under
rcu_read_lock
* allocated netdevice_tracker in ovpn_nl_pre_doit() [stored in
user_ptr[1]] to keep track of the netdev reference held during netlink
handler calls
* moved whole socket detaching routine to worker. This way the code is
allowed to sleep and in turn it can be executed under lock_sock. This
lock allows us to happily coordinate concurrent attach/detach calls.
(note: lock is acquired everytime the refcnt for the socket is
decremented, because this guarantees us that setting the refcnt to 0
and detaching the socket will happen atomically)
* dropped kref_put_sock()/refcount handler as it's not required anymore,
thanks to the point above
* re-arranged ovpn_socket_new() in order to simplify error path by first
allocating the new ovpn_sock and then attaching
Please note that some patches were already reviewed/tested by a few
people. iThese patches have retained the tags as they have hardly been
touched.
The latest code can also be found at:
https://github.com/OpenVPN/linux-kernel-ovpn
Thanks a lot!
Best Regards,
Antonio Quartulli
OpenVPN Inc.
---
Antonio Quartulli (25):
net: introduce OpenVPN Data Channel Offload (ovpn)
ovpn: add basic netlink support
ovpn: add basic interface creation/destruction/management routines
ovpn: keep carrier always on for MP interfaces
ovpn: introduce the ovpn_peer object
ovpn: introduce the ovpn_socket object
ovpn: implement basic TX path (UDP)
ovpn: implement basic RX path (UDP)
ovpn: implement packet processing
ovpn: store tunnel and transport statistics
ipv6: export inet6_stream_ops via EXPORT_SYMBOL_GPL
ovpn: implement TCP transport
skb: implement skb_send_sock_locked_with_flags()
ovpn: add support for MSG_NOSIGNAL in tcp_sendmsg
ovpn: implement multi-peer support
ovpn: implement peer lookup logic
ovpn: implement keepalive mechanism
ovpn: add support for updating local UDP endpoint
ovpn: add support for peer floating
ovpn: implement peer add/get/dump/delete via netlink
ovpn: implement key add/get/del/swap via netlink
ovpn: kill key and notify userspace in case of IV exhaustion
ovpn: notify userspace when a peer is deleted
ovpn: add basic ethtool support
testing/selftests: add test tool and scripts for ovpn module
Documentation/netlink/specs/ovpn.yaml | 372 +++
Documentation/netlink/specs/rt_link.yaml | 16 +
MAINTAINERS | 11 +
drivers/net/Kconfig | 15 +
drivers/net/Makefile | 1 +
drivers/net/ovpn/Makefile | 22 +
drivers/net/ovpn/bind.c | 55 +
drivers/net/ovpn/bind.h | 101 +
drivers/net/ovpn/crypto.c | 211 ++
drivers/net/ovpn/crypto.h | 145 ++
drivers/net/ovpn/crypto_aead.c | 382 ++++
drivers/net/ovpn/crypto_aead.h | 33 +
drivers/net/ovpn/io.c | 446 ++++
drivers/net/ovpn/io.h | 34 +
drivers/net/ovpn/main.c | 350 +++
drivers/net/ovpn/main.h | 14 +
drivers/net/ovpn/netlink-gen.c | 213 ++
drivers/net/ovpn/netlink-gen.h | 41 +
drivers/net/ovpn/netlink.c | 1183 ++++++++++
drivers/net/ovpn/netlink.h | 18 +
drivers/net/ovpn/ovpnstruct.h | 54 +
drivers/net/ovpn/peer.c | 1269 +++++++++++
drivers/net/ovpn/peer.h | 164 ++
drivers/net/ovpn/pktid.c | 129 ++
drivers/net/ovpn/pktid.h | 87 +
drivers/net/ovpn/proto.h | 118 +
drivers/net/ovpn/skb.h | 60 +
drivers/net/ovpn/socket.c | 204 ++
drivers/net/ovpn/socket.h | 49 +
drivers/net/ovpn/stats.c | 21 +
drivers/net/ovpn/stats.h | 47 +
drivers/net/ovpn/tcp.c | 565 +++++
drivers/net/ovpn/tcp.h | 33 +
drivers/net/ovpn/udp.c | 421 ++++
drivers/net/ovpn/udp.h | 22 +
include/linux/skbuff.h | 2 +
include/uapi/linux/if_link.h | 15 +
include/uapi/linux/ovpn.h | 111 +
include/uapi/linux/udp.h | 1 +
net/core/skbuff.c | 18 +-
net/ipv6/af_inet6.c | 1 +
tools/testing/selftests/Makefile | 1 +
tools/testing/selftests/net/ovpn/.gitignore | 2 +
tools/testing/selftests/net/ovpn/Makefile | 17 +
tools/testing/selftests/net/ovpn/config | 10 +
tools/testing/selftests/net/ovpn/data64.key | 5 +
tools/testing/selftests/net/ovpn/ovpn-cli.c | 2366 ++++++++++++++++++++
tools/testing/selftests/net/ovpn/tcp_peers.txt | 5 +
.../testing/selftests/net/ovpn/test-chachapoly.sh | 9 +
tools/testing/selftests/net/ovpn/test-float.sh | 9 +
tools/testing/selftests/net/ovpn/test-tcp.sh | 9 +
tools/testing/selftests/net/ovpn/test.sh | 185 ++
tools/testing/selftests/net/ovpn/udp_peers.txt | 5 +
53 files changed, 9672 insertions(+), 5 deletions(-)
---
base-commit: 7b24f164cf005b9649138ef6de94aaac49c9f3d1
change-id: 20241002-b4-ovpn-eeee35c694a2
Best regards,
--
Antonio Quartulli <antonio(a)openvpn.net>
Hi all,
This patch series continues the work to migrate the *.sh tests into
prog_tests.
test_xdp_redirect.sh tests the XDP redirections done through
bpf_redirect().
These XDP redirections are already tested by prog_tests/xdp_do_redirect.c
but IMO it doesn't cover the exact same code path because
xdp_do_redirect.c uses bpf_prog_test_run_opts() to trigger redirections
of 'fake packets' while test_xdp_redirect.sh redirects packets coming
from the network. Also, the test_xdp_redirect.sh script tests the
redirections with both SKB and DRV modes while xdp_do_redirect.c only
tests the DRV mode.
The patch series adds two new test cases in prog_tests/xdp_do_redirect.c
to replace the test_xdp_redirect.sh script.
Signed-off-by: Bastien Curutchet (eBPF Foundation) <bastien.curutchet(a)bootlin.com>
---
Changes in v2:
- Use directly skel->progs instead of 'bpf_object__find_program_by_name()'
- Use 'ip -n NSX' in SYS calls instead of opening NSX with open_netns()
- Use #define for static indexes of veth1 and veth2
- Delete the useless second ping
- Set nstoken to NULL after close_netns()
- Merge the two added tests into one with 3 subtests (one for each flag:
0, DRV, SKB)
- Link to v1: https://lore.kernel.org/r/20250103-xdp_redirect-v1-0-e93099f59069@bootlin.c…
---
Bastien Curutchet (eBPF Foundation) (3):
selftests/bpf: test_xdp_redirect: Rename BPF sections
selftests/bpf: Migrate test_xdp_redirect.sh to xdp_do_redirect.c
selftests/bpf: Migrate test_xdp_redirect.c to test_xdp_do_redirect.c
tools/testing/selftests/bpf/Makefile | 1 -
.../selftests/bpf/prog_tests/xdp_do_redirect.c | 164 +++++++++++++++++++++
.../selftests/bpf/progs/test_xdp_do_redirect.c | 12 ++
.../selftests/bpf/progs/test_xdp_redirect.c | 26 ----
tools/testing/selftests/bpf/test_xdp_redirect.sh | 79 ----------
5 files changed, 176 insertions(+), 106 deletions(-)
---
base-commit: b27feb5365c6a1bf7e71ba5c795717ee0eec298d
change-id: 20241219-xdp_redirect-2b8ec79dc24e
Best regards,
--
Bastien Curutchet (eBPF Foundation) <bastien.curutchet(a)bootlin.com>
`-l2 -v` is a useful combination of flags to dump the entire
verification log. This is helpful when making changes to the verifier,
as you can see what it thinks program one instruction at a time.
This was more or less a hidden feature before. Document it so others can
discover it.
Signed-off-by: Daniel Xu <dxu(a)dxuuu.xyz>
---
tools/testing/selftests/bpf/veristat.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/bpf/veristat.c b/tools/testing/selftests/bpf/veristat.c
index 974c808f9321..7d0a9cb753e3 100644
--- a/tools/testing/selftests/bpf/veristat.c
+++ b/tools/testing/selftests/bpf/veristat.c
@@ -216,7 +216,8 @@ const char argp_program_doc[] =
"\n"
"USAGE: veristat <obj-file> [<obj-file>...]\n"
" OR: veristat -C <baseline.csv> <comparison.csv>\n"
-" OR: veristat -R <results.csv>\n";
+" OR: veristat -R <results.csv>\n"
+" OR: veristat -v -l2 <to_analyze.bpf.o>\n";
enum {
OPT_LOG_FIXED = 1000,
@@ -228,7 +229,7 @@ static const struct argp_option opts[] = {
{ "version", 'V', NULL, 0, "Print version" },
{ "verbose", 'v', NULL, 0, "Verbose mode" },
{ "debug", 'd', NULL, 0, "Debug mode (turns on libbpf debug logging)" },
- { "log-level", 'l', "LEVEL", 0, "Verifier log level (default 0 for normal mode, 1 for verbose mode)" },
+ { "log-level", 'l', "LEVEL", 0, "Verifier log level (default 0 for normal mode, 1 for verbose mode, 2 for full verification log)" },
{ "log-fixed", OPT_LOG_FIXED, NULL, 0, "Disable verifier log rotation" },
{ "log-size", OPT_LOG_SIZE, "BYTES", 0, "Customize verifier log size (default to 16MB)" },
{ "top-n", 'n', "N", 0, "Emit only up to first N results." },
--
2.47.1
Android uses the ashmem driver [1] for creating shared memory regions
between processes. The ashmem driver exposes an ioctl command for
processes to restrict the permissions an ashmem buffer can be mapped
with.
Buffers are created with the ability to be mapped as readable, writable,
and executable. Processes remove the ability to map some ashmem buffers
as executable to ensure that those buffers cannot be used to inject
malicious code for another process to run. Other buffers retain their
ability to be mapped as executable, as these buffers can be used for
just-in-time (JIT) compilation. So there is a need to be able to remove
the ability to map a buffer as executable on a per-buffer basis.
Android is currently trying to migrate towards replacing its ashmem
driver usage with memfd. Part of the transition involved introducing a
library that serves to abstract away how shared memory regions are
allocated (i.e. ashmem vs memfd). This allows clients to use a single
interface for restricting how a buffer can be mapped without having to
worry about how it is handled for ashmem (through the ioctl
command mentioned earlier) or memfd (through file seals).
While memfd has support for preventing buffers from being mapped as
writable beyond a certain point in time (thanks to
F_SEAL_FUTURE_WRITE), it does not have a similar interface to prevent
buffers from being mapped as executable beyond a certain point.
However, that could be implemented as a file seal (F_SEAL_FUTURE_EXEC)
which works similarly to F_SEAL_FUTURE_WRITE.
F_SEAL_FUTURE_WRITE was chosen as a template for how this new seal
should behave, instead of F_SEAL_WRITE, for the following reasons:
1. Having the new seal behave like F_SEAL_FUTURE_WRITE matches the
behavior that was present with ashmem. This aids in seamlessly
transitioning clients away from ashmem to memfd.
2. Making the new seal behave like F_SEAL_WRITE would mean that no
mappings that could become executable in the future (i.e. via
mprotect()) can exist when the seal is applied. However, there are
known cases (e.g. CursorWindow [2]) where restrictions are applied
on how a buffer can be mapped after a mapping has already been made.
That mapping may have VM_MAYEXEC set, which would not allow the seal
to be applied successfully.
Therefore, the F_SEAL_FUTURE_EXEC seal was designed to have the same
semantics as F_SEAL_FUTURE_WRITE.
Note: this series depends on Lorenzo's work [3], [4], [5] from Andrew
Morton's mm-unstable branch [6], which reworks memfd's file seal checks,
allowing for newer file seals to be implemented in a cleaner fashion.
Changes from v1 ==> v2:
- Changed the return code to be -EPERM instead of -EACCES when
attempting to map an exec sealed file with PROT_EXEC to align
to mmap()'s man page. Thank you Kalesh Singh for spotting this!
- Rebased on top of Lorenzo's work to cleanup memfd file seal checks in
mmap() ([3], [4], and [5]). Thank you for this Lorenzo!
- Changed to deny PROT_EXEC mappings only if the mapping is shared,
instead of for both shared and private mappings, after discussing
this with Lorenzo.
Opens:
- Lorenzo brought up that this patch may negatively impact the usage of
MFD_NOEXEC_SCOPE_NOEXEC_ENFORCED [7]. However, it is not clear to me
why that is the case. At the moment, my intent is for the executable
permissions of the file to be disjoint from the ability to create
executable mappings.
Links:
[1] https://cs.android.com/android/kernel/superproject/+/common-android-mainlin…
[2] https://developer.android.com/reference/android/database/CursorWindow
[3] https://lore.kernel.org/all/cover.1732804776.git.lorenzo.stoakes@oracle.com/
[4] https://lkml.kernel.org/r/20241206212846.210835-1-lorenzo.stoakes@oracle.com
[5] https://lkml.kernel.org/r/7dee6c5d-480b-4c24-b98e-6fa47dbd8a23@lucifer.local
[6] https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git/tree/?h=mm-unst…
[7] https://lore.kernel.org/all/3a53b154-1e46-45fb-a559-65afa7a8a788@lucifer.lo…
Links to previous versions:
v1: https://lore.kernel.org/all/20241206010930.3871336-1-isaacmanjarres@google.…
Isaac J. Manjarres (2):
mm/memfd: Add support for F_SEAL_FUTURE_EXEC to memfd
selftests/memfd: Add tests for F_SEAL_FUTURE_EXEC
include/uapi/linux/fcntl.h | 1 +
mm/memfd.c | 39 ++++++++++-
tools/testing/selftests/memfd/memfd_test.c | 79 ++++++++++++++++++++++
3 files changed, 118 insertions(+), 1 deletion(-)
--
2.47.1.613.gc27f4b7a9f-goog
After reviewing the code, it was found that these macros are never
referenced in the code. Just remove them.
Signed-off-by: Ba Jing <bajing(a)cmss.chinamobile.com>
---
tools/testing/selftests/landlock/ptrace_test.c | 2 --
1 file changed, 2 deletions(-)
diff --git a/tools/testing/selftests/landlock/ptrace_test.c b/tools/testing/selftests/landlock/ptrace_test.c
index a19db4d0b3bd..8f31b673ff2d 100644
--- a/tools/testing/selftests/landlock/ptrace_test.c
+++ b/tools/testing/selftests/landlock/ptrace_test.c
@@ -22,8 +22,6 @@
/* Copied from security/yama/yama_lsm.c */
#define YAMA_SCOPE_DISABLED 0
#define YAMA_SCOPE_RELATIONAL 1
-#define YAMA_SCOPE_CAPABILITY 2
-#define YAMA_SCOPE_NO_ATTACH 3
static void create_domain(struct __test_metadata *const _metadata)
{
--
2.33.0
This series depends on: "[PATCH v2 0/3] tun: Unify vnet implementation
and fill full vnet header"
https://lore.kernel.org/r/20250109-tun-v2-0-388d7d5a287a@daynix.com
virtio-net have two usage of hashes: one is RSS and another is hash
reporting. Conventionally the hash calculation was done by the VMM.
However, computing the hash after the queue was chosen defeats the
purpose of RSS.
Another approach is to use eBPF steering program. This approach has
another downside: it cannot report the calculated hash due to the
restrictive nature of eBPF.
Introduce the code to compute hashes to the kernel in order to overcome
thse challenges.
An alternative solution is to extend the eBPF steering program so that it
will be able to report to the userspace, but it is based on context
rewrites, which is in feature freeze. We can adopt kfuncs, but they will
not be UAPIs. We opt to ioctl to align with other relevant UAPIs (KVM
and vhost_net).
The patches for QEMU to use this new feature was submitted as RFC and
is available at:
https://patchew.org/QEMU/20240915-hash-v3-0-79cb08d28647@daynix.com/
This work was presented at LPC 2024:
https://lpc.events/event/18/contributions/1963/
V1 -> V2:
Changed to introduce a new BPF program type.
Signed-off-by: Akihiko Odaki <akihiko.odaki(a)daynix.com>
---
Changes in v6:
- Extracted changes to fill vnet header holes into another series.
- Squashed patches "skbuff: Introduce SKB_EXT_TUN_VNET_HASH", "tun:
Introduce virtio-net hash reporting feature", and "tun: Introduce
virtio-net RSS" into patch "tun: Introduce virtio-net hash feature".
- Dropped the RFC tag.
- Link to v5: https://lore.kernel.org/r/20241008-rss-v5-0-f3cf68df005d@daynix.com
Changes in v5:
- Fixed a compilation error with CONFIG_TUN_VNET_CROSS_LE.
- Optimized the calculation of the hash value according to:
https://git.dpdk.org/dpdk/commit/?id=3fb1ea032bd6ff8317af5dac9af901f1f324ca…
- Added patch "tun: Unify vnet implementation".
- Dropped patch "tap: Pad virtio header with zero".
- Added patch "selftest: tun: Test vnet ioctls without device".
- Reworked selftests to skip for older kernels.
- Documented the case when the underlying device is deleted and packets
have queue_mapping set by TC.
- Reordered test harness arguments.
- Added code to handle fragmented packets.
- Link to v4: https://lore.kernel.org/r/20240924-rss-v4-0-84e932ec0e6c@daynix.com
Changes in v4:
- Moved tun_vnet_hash_ext to if_tun.h.
- Renamed virtio_net_toeplitz() to virtio_net_toeplitz_calc().
- Replaced htons() with cpu_to_be16().
- Changed virtio_net_hash_rss() to return void.
- Reordered variable declarations in virtio_net_hash_rss().
- Removed virtio_net_hdr_v1_hash_from_skb().
- Updated messages of "tap: Pad virtio header with zero" and
"tun: Pad virtio header with zero".
- Fixed vnet_hash allocation size.
- Ensured to free vnet_hash when destructing tun_struct.
- Link to v3: https://lore.kernel.org/r/20240915-rss-v3-0-c630015db082@daynix.com
Changes in v3:
- Reverted back to add ioctl.
- Split patch "tun: Introduce virtio-net hashing feature" into
"tun: Introduce virtio-net hash reporting feature" and
"tun: Introduce virtio-net RSS".
- Changed to reuse hash values computed for automq instead of performing
RSS hashing when hash reporting is requested but RSS is not.
- Extracted relevant data from struct tun_struct to keep it minimal.
- Added kernel-doc.
- Changed to allow calling TUNGETVNETHASHCAP before TUNSETIFF.
- Initialized num_buffers with 1.
- Added a test case for unclassified packets.
- Fixed error handling in tests.
- Changed tests to verify that the queue index will not overflow.
- Rebased.
- Link to v2: https://lore.kernel.org/r/20231015141644.260646-1-akihiko.odaki@daynix.com
---
Akihiko Odaki (6):
virtio_net: Add functions for hashing
net: flow_dissector: Export flow_keys_dissector_symmetric
tun: Introduce virtio-net hash feature
selftest: tun: Test vnet ioctls without device
selftest: tun: Add tests for virtio-net hashing
vhost/net: Support VIRTIO_NET_F_HASH_REPORT
Documentation/networking/tuntap.rst | 7 +
drivers/net/Kconfig | 1 +
drivers/net/tap.c | 50 ++-
drivers/net/tun.c | 93 ++++--
drivers/net/tun_vnet.c | 167 +++++++++-
drivers/net/tun_vnet.h | 33 +-
drivers/vhost/net.c | 16 +-
include/linux/if_tap.h | 2 +
include/linux/skbuff.h | 3 +
include/linux/virtio_net.h | 188 +++++++++++
include/net/flow_dissector.h | 1 +
include/uapi/linux/if_tun.h | 75 +++++
net/core/flow_dissector.c | 3 +-
net/core/skbuff.c | 4 +
tools/testing/selftests/net/Makefile | 2 +-
tools/testing/selftests/net/tun.c | 630 ++++++++++++++++++++++++++++++++++-
16 files changed, 1224 insertions(+), 51 deletions(-)
---
base-commit: 9b2ffa6148b1e4468d08f7e0e7e371c43cac9ffe
change-id: 20240403-rss-e737d89efa77
prerequisite-change-id: 20241230-tun-66e10a49b0c7:v2
prerequisite-patch-id: 057e888c371f2ce750064b7c40c2cc6abbdf6819
prerequisite-patch-id: 22d53dd3443a2c72496bffb90f19d429972550a3
prerequisite-patch-id: 1520f0c1f7b11559d0898bea556f745f6b8914ac
Best regards,
--
Akihiko Odaki <akihiko.odaki(a)daynix.com>
PTRACE_SET_SYSCALL_INFO is a generic ptrace API that complements
PTRACE_GET_SYSCALL_INFO by letting the ptracer modify details of
system calls the tracee is blocked in.
This API allows ptracers to obtain and modify system call details
in a straightforward and architecture-agnostic way.
Current implementation supports changing only those bits of system call
information that are used by strace, namely, syscall number, syscall
arguments, and syscall return value.
Support of changing additional details returned by PTRACE_GET_SYSCALL_INFO,
such as instruction pointer and stack pointer, could be added later
if needed, by re-using struct ptrace_syscall_info.reserved to specify
the additional details that should be set. Currently, the reserved
field of struct ptrace_syscall_info must be initialized with zeroes;
arch, instruction_pointer, and stack_pointer fields are ignored.
PTRACE_SET_SYSCALL_INFO currently supports only PTRACE_SYSCALL_INFO_ENTRY,
PTRACE_SYSCALL_INFO_EXIT, and PTRACE_SYSCALL_INFO_SECCOMP operations.
Other operations could be added later if needed.
Ideally, PTRACE_SET_SYSCALL_INFO should have been introduced along with
PTRACE_GET_SYSCALL_INFO, but it didn't happen. The last straw that
convinced me to implement PTRACE_SET_SYSCALL_INFO was apparent failure
to provide an API of changing the first system call argument on riscv
architecture [1].
ptrace(2) man page:
long ptrace(enum __ptrace_request request, pid_t pid, void *addr, void *data);
...
PTRACE_SET_SYSCALL_INFO
Modify information about the system call that caused the stop.
The "data" argument is a pointer to struct ptrace_syscall_info
that specifies the system call information to be set.
The "addr" argument should be set to sizeof(struct ptrace_syscall_info)).
[1] https://lore.kernel.org/all/59505464-c84a-403d-972f-d4b2055eeaac@gmail.com/
Dmitry V. Levin (6):
Revert "arch: remove unused function syscall_set_arguments()"
syscall.h: add syscall_set_arguments() on remaining
HAVE_ARCH_TRACEHOOK arches
syscall.h: introduce syscall_set_nr()
ptrace_get_syscall_info: factor out ptrace_get_syscall_info_op
ptrace: introduce PTRACE_SET_SYSCALL_INFO request
selftests/ptrace: add a test case for PTRACE_SET_SYSCALL_INFO
arch/arc/include/asm/syscall.h | 20 +
arch/arm/include/asm/syscall.h | 25 +
arch/arm64/include/asm/syscall.h | 20 +
arch/csky/include/asm/syscall.h | 13 +
arch/hexagon/include/asm/syscall.h | 14 +
arch/loongarch/include/asm/syscall.h | 15 +
arch/m68k/include/asm/syscall.h | 7 +
arch/microblaze/include/asm/syscall.h | 7 +
arch/mips/include/asm/syscall.h | 53 +++
arch/nios2/include/asm/syscall.h | 16 +
arch/openrisc/include/asm/syscall.h | 13 +
arch/parisc/include/asm/syscall.h | 19 +
arch/powerpc/include/asm/syscall.h | 15 +
arch/riscv/include/asm/syscall.h | 16 +
arch/s390/include/asm/syscall.h | 19 +
arch/sh/include/asm/syscall_32.h | 19 +
arch/sparc/include/asm/syscall.h | 17 +
arch/um/include/asm/syscall-generic.h | 19 +
arch/x86/include/asm/syscall.h | 43 ++
arch/xtensa/include/asm/syscall.h | 18 +
include/asm-generic/syscall.h | 30 ++
include/linux/ptrace.h | 3 +
include/uapi/linux/ptrace.h | 3 +-
kernel/ptrace.c | 154 ++++++-
tools/testing/selftests/ptrace/Makefile | 2 +-
.../selftests/ptrace/set_syscall_info.c | 436 ++++++++++++++++++
26 files changed, 994 insertions(+), 22 deletions(-)
create mode 100644 tools/testing/selftests/ptrace/set_syscall_info.c
--
ldv
Implement comprehensive testing for netconsole userdata entry handling,
demonstrating correct behavior when creating maximum entries and
preventing unauthorized overflow.
Refactor existing test infrastructure to support modular, reusable
helper functions that validate strict entry limit enforcement.
Also, add a warning if update_userdata() sees more than
MAX_USERDATA_ITEMS entries. This shouldn't happen and it is a bug that
shouldn't be silently ignored.
Signed-off-by: Breno Leitao <leitao(a)debian.org>
---
Changes in v3:
- Added the new shell helpers files in the TEST_INCLUDES (Jakub)
- Link to v2: https://lore.kernel.org/r/20250103-netcons_overflow_test-v2-0-a49f9be64c21@…
Changes in v2:
- Add the new script (netcons_overflow.sh) in
tools/testing/selftests/drivers/net/Makefile as suggested by Simon
Horman
- Link to v1: https://lore.kernel.org/r/20241204-netcons_overflow_test-v1-0-a85a8d0ace21@…
---
Breno Leitao (4):
netconsole: Warn if MAX_USERDATA_ITEMS limit is exceeded
netconsole: selftest: Split the helpers from the selftest
netconsole: selftest: Delete all userdata keys
netconsole: selftest: verify userdata entry limit
MAINTAINERS | 3 +-
drivers/net/netconsole.c | 2 +-
tools/testing/selftests/drivers/net/Makefile | 2 +
.../selftests/drivers/net/lib/sh/lib_netcons.sh | 225 +++++++++++++++++++++
.../testing/selftests/drivers/net/netcons_basic.sh | 218 +-------------------
.../selftests/drivers/net/netcons_overflow.sh | 67 ++++++
6 files changed, 298 insertions(+), 219 deletions(-)
---
base-commit: 7bf1659bad4e9413cdba132ef9cbd0caa9cabcc4
change-id: 20241204-netcons_overflow_test-eaf735d1f743
Best regards,
--
Breno Leitao <leitao(a)debian.org>
This patch allows progs to elide a null check on statically known map
lookup keys. In other words, if the verifier can statically prove that
the lookup will be in-bounds, allow the prog to drop the null check.
This is useful for two reasons:
1. Large numbers of nullness checks (especially when they cannot fail)
unnecessarily pushes prog towards BPF_COMPLEXITY_LIMIT_JMP_SEQ.
2. It forms a tighter contract between programmer and verifier.
For (1), bpftrace is starting to make heavier use of percpu scratch
maps. As a result, for user scripts with large number of unrolled loops,
we are starting to hit jump complexity verification errors. These
percpu lookups cannot fail anyways, as we only use static key values.
Eliding nullness probably results in less work for verifier as well.
For (2), percpu scratch maps are often used as a larger stack, as the
currrent stack is limited to 512 bytes. In these situations, it is
desirable for the programmer to express: "this lookup should never fail,
and if it does, it means I messed up the code". By omitting the null
check, the programmer can "ask" the verifier to double check the logic.
=== Changelog ===
Changes in v6:
* Use is_spilled_scalar_reg() helper and remove unnecessary comment
* Add back deleted selftest with different helper to dirty dst buffer
* Check size of spill is exactly key_size and update selftests
* Read slot_type from correct offset into the spi
* Rewrite selftests in C where possible
* Mark constant map keys as precise
Changes in v5:
* Dropped all acks
* Use s64 instead of long for const_map_key
* Ensure stack slot contains spilled reg before accessing spilled_ptr
* Ensure spilled reg is a scalar before accessing tnum const value
* Fix verifier selftest for 32-bit write to write at 8 byte alignment
to ensure spill is tracked
* Introduce more precise tracking of helper stack accesses
* Do constant map key extraction as part of helper argument processing
and then remove duplicated stack checks
* Use ret_flag instead of regs[BPF_REG_0].type
* Handle STACK_ZERO
* Fix bug in bpf_load_hdr_opt() arg annotation
Changes in v4:
* Only allow for CAP_BPF
* Add test for stack growing upwards
* Improve comment about stack growing upwards
Changes in v3:
* Check if stack is (erroneously) growing upwards
* Mention in commit message why existing tests needed change
Changes in v2:
* Added a check for when R2 is not a ptr to stack
* Added a check for when stack is uninitialized (no stack slot yet)
* Updated existing tests to account for null elision
* Added test case for when R2 can be both const and non-const
Daniel Xu (5):
bpf: verifier: Add missing newline on verbose() call
bpf: tcp: Mark bpf_load_hdr_opt() arg2 as read-write
bpf: verifier: Refactor helper access type tracking
bpf: verifier: Support eliding map lookup nullness
bpf: selftests: verifier: Add nullness elision tests
kernel/bpf/verifier.c | 139 +++++++++++----
net/core/filter.c | 2 +-
.../testing/selftests/bpf/progs/dynptr_fail.c | 6 +-
tools/testing/selftests/bpf/progs/iters.c | 14 +-
.../selftests/bpf/progs/map_kptr_fail.c | 2 +-
.../selftests/bpf/progs/test_global_func10.c | 2 +-
.../selftests/bpf/progs/uninit_stack.c | 5 +-
.../bpf/progs/verifier_array_access.c | 168 ++++++++++++++++++
.../bpf/progs/verifier_basic_stack.c | 2 +-
.../selftests/bpf/progs/verifier_const_or.c | 4 +-
.../progs/verifier_helper_access_var_len.c | 12 +-
.../selftests/bpf/progs/verifier_int_ptr.c | 2 +-
.../selftests/bpf/progs/verifier_map_in_map.c | 2 +-
.../selftests/bpf/progs/verifier_mtu.c | 2 +-
.../selftests/bpf/progs/verifier_raw_stack.c | 4 +-
.../selftests/bpf/progs/verifier_unpriv.c | 2 +-
.../selftests/bpf/progs/verifier_var_off.c | 8 +-
tools/testing/selftests/bpf/verifier/calls.c | 2 +-
.../testing/selftests/bpf/verifier/map_kptr.c | 2 +-
19 files changed, 311 insertions(+), 69 deletions(-)
--
2.47.1
Reverse the order in which
the PML log is read to align more closely to the hardware. It should
not affect regular users of the dirty logging but it fixes a unit test
specific assumption in the dirty_log_test dirty-ring mode.
Best regards,
Maxim Levitsky
Maxim Levitsky (2):
KVM: VMX: refactor PML terminology
KVM: VMX: read the PML log in the same order as it was written
arch/x86/kvm/vmx/main.c | 2 +-
arch/x86/kvm/vmx/nested.c | 2 +-
arch/x86/kvm/vmx/vmx.c | 32 ++++++++++++++++++++------------
arch/x86/kvm/vmx/vmx.h | 5 ++++-
4 files changed, 26 insertions(+), 15 deletions(-)
--
2.26.3
Extend the XDP Tx metadata framework so that user can requests launch time
hardware offload, where the Ethernet device will schedule the packet for
transmission at a pre-determined time called launch time. The value of
launch time is communicated from user space to Ethernet driver via
launch_time field of struct xsk_tx_metadata.
Suggested-by: Stanislav Fomichev <sdf(a)google.com>
Signed-off-by: Song Yoong Siang <yoong.siang.song(a)intel.com>
---
Documentation/netlink/specs/netdev.yaml | 4 ++
Documentation/networking/xsk-tx-metadata.rst | 64 ++++++++++++++++++++
include/net/xdp_sock.h | 10 +++
include/net/xdp_sock_drv.h | 1 +
include/uapi/linux/if_xdp.h | 10 +++
include/uapi/linux/netdev.h | 3 +
net/core/netdev-genl.c | 2 +
net/xdp/xsk.c | 3 +
tools/include/uapi/linux/if_xdp.h | 10 +++
tools/include/uapi/linux/netdev.h | 3 +
10 files changed, 110 insertions(+)
diff --git a/Documentation/netlink/specs/netdev.yaml b/Documentation/netlink/specs/netdev.yaml
index cbb544bd6c84..e59c8a14f7d1 100644
--- a/Documentation/netlink/specs/netdev.yaml
+++ b/Documentation/netlink/specs/netdev.yaml
@@ -70,6 +70,10 @@ definitions:
name: tx-checksum
doc:
L3 checksum HW offload is supported by the driver.
+ -
+ name: tx-launch-time
+ doc:
+ Launch time HW offload is supported by the driver.
-
name: queue-type
type: enum
diff --git a/Documentation/networking/xsk-tx-metadata.rst b/Documentation/networking/xsk-tx-metadata.rst
index e76b0cfc32f7..3cec089747ce 100644
--- a/Documentation/networking/xsk-tx-metadata.rst
+++ b/Documentation/networking/xsk-tx-metadata.rst
@@ -50,6 +50,10 @@ The flags field enables the particular offload:
checksum. ``csum_start`` specifies byte offset of where the checksumming
should start and ``csum_offset`` specifies byte offset where the
device should store the computed checksum.
+- ``XDP_TXMD_FLAGS_LAUNCH_TIME``: requests the device to schedule the
+ packet for transmission at a pre-determined time called launch time. The
+ value of launch time is indicated by ``launch_time`` field of
+ ``union xsk_tx_metadata``.
Besides the flags above, in order to trigger the offloads, the first
packet's ``struct xdp_desc`` descriptor should set ``XDP_TX_METADATA``
@@ -65,6 +69,65 @@ In this case, when running in ``XDK_COPY`` mode, the TX checksum
is calculated on the CPU. Do not enable this option in production because
it will negatively affect performance.
+Launch Time
+===========
+
+The value of the requested launch time should be based on the device's PTP
+Hardware Clock (PHC) to ensure accuracy. AF_XDP takes a different data path
+compared to the ETF queuing discipline, which organizes packets and delays
+their transmission. Instead, AF_XDP immediately hands off the packets to
+the device driver without rearranging their order or holding them prior to
+transmission. In scenarios where the launch time offload feature is
+disabled, the device driver is expected to disregard the launch time
+request. For correct interpretation and meaningful operation, the launch
+time should never be set to a value larger than the farthest programmable
+time in the future (the horizon). Different devices have different hardware
+limitations on the launch time offload feature.
+
+stmmac driver
+-------------
+
+For stmmac, TSO and launch time (TBS) features are mutually exclusive for
+each individual Tx Queue. By default, the driver configures Tx Queue 0 to
+support TSO and the rest of the Tx Queues to support TBS. The launch time
+hardware offload feature can be enabled or disabled by using the tc-etf
+command to call the driver's ndo_setup_tc() callback.
+
+The value of the launch time that is programmed in the Enhanced Normal
+Transmit Descriptors is a 32-bit value, where the most significant 8 bits
+represent the time in seconds and the remaining 24 bits represent the time
+in 256 ns increments. The programmed launch time is compared against the
+PTP time (bits[39:8]) and rolls over after 256 seconds. Therefore, the
+horizon of the launch time for dwmac4 and dwxlgmac2 is 128 seconds in the
+future.
+
+The stmmac driver maintains FIFO behavior and does not perform packet
+reordering. This means that a packet with a launch time request will block
+other packets in the same Tx Queue until it is transmitted.
+
+igc driver
+----------
+
+For igc, all four Tx Queues support the launch time feature. The launch
+time hardware offload feature can be enabled or disabled by using the
+tc-etf command to call the driver's ndo_setup_tc() callback. When entering
+TSN mode, the igc driver will reset the device and create a default Qbv
+schedule with a 1-second cycle time, with all Tx Queues open at all times.
+
+The value of the launch time that is programmed in the Advanced Transmit
+Context Descriptor is a relative offset to the starting time of the Qbv
+transmission window of the queue. The Frst flag of the descriptor can be
+set to schedule the packet for the next Qbv cycle. Therefore, the horizon
+of the launch time for i225 and i226 is the ending time of the next cycle
+of the Qbv transmission window of the queue. For example, when the Qbv
+cycle time is set to 1 second, the horizon of the launch time ranges
+from 1 second to 2 seconds, depending on where the Qbv cycle is currently
+running.
+
+The igc driver maintains FIFO behavior and does not perform packet
+reordering. This means that a packet with a launch time request will block
+other packets in the same Tx Queue until it is transmitted.
+
Querying Device Capabilities
============================
@@ -74,6 +137,7 @@ Refer to ``xsk-flags`` features bitmask in
- ``tx-timestamp``: device supports ``XDP_TXMD_FLAGS_TIMESTAMP``
- ``tx-checksum``: device supports ``XDP_TXMD_FLAGS_CHECKSUM``
+- ``tx-launch-time``: device supports ``XDP_TXMD_FLAGS_LAUNCH_TIME``
See ``tools/net/ynl/samples/netdev.c`` on how to query this information.
diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
index bfe625b55d55..a58ae7589d12 100644
--- a/include/net/xdp_sock.h
+++ b/include/net/xdp_sock.h
@@ -110,11 +110,16 @@ struct xdp_sock {
* indicates position where checksumming should start.
* csum_offset indicates position where checksum should be stored.
*
+ * void (*tmo_request_launch_time)(u64 launch_time, void *priv)
+ * Called when AF_XDP frame requested launch time HW offload support.
+ * launch_time indicates the PTP time at which the device can schedule the
+ * packet for transmission.
*/
struct xsk_tx_metadata_ops {
void (*tmo_request_timestamp)(void *priv);
u64 (*tmo_fill_timestamp)(void *priv);
void (*tmo_request_checksum)(u16 csum_start, u16 csum_offset, void *priv);
+ void (*tmo_request_launch_time)(u64 launch_time, void *priv);
};
#ifdef CONFIG_XDP_SOCKETS
@@ -162,6 +167,11 @@ static inline void xsk_tx_metadata_request(const struct xsk_tx_metadata *meta,
if (!meta)
return;
+ if (ops->tmo_request_launch_time)
+ if (meta->flags & XDP_TXMD_FLAGS_LAUNCH_TIME)
+ ops->tmo_request_launch_time(meta->request.launch_time,
+ priv);
+
if (ops->tmo_request_timestamp)
if (meta->flags & XDP_TXMD_FLAGS_TIMESTAMP)
ops->tmo_request_timestamp(priv);
diff --git a/include/net/xdp_sock_drv.h b/include/net/xdp_sock_drv.h
index 40085afd9160..78af371bc002 100644
--- a/include/net/xdp_sock_drv.h
+++ b/include/net/xdp_sock_drv.h
@@ -198,6 +198,7 @@ static inline void *xsk_buff_raw_get_data(struct xsk_buff_pool *pool, u64 addr)
#define XDP_TXMD_FLAGS_VALID ( \
XDP_TXMD_FLAGS_TIMESTAMP | \
XDP_TXMD_FLAGS_CHECKSUM | \
+ XDP_TXMD_FLAGS_LAUNCH_TIME | \
0)
static inline bool xsk_buff_valid_tx_metadata(struct xsk_tx_metadata *meta)
diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
index 42ec5ddaab8d..42869770776e 100644
--- a/include/uapi/linux/if_xdp.h
+++ b/include/uapi/linux/if_xdp.h
@@ -127,6 +127,12 @@ struct xdp_options {
*/
#define XDP_TXMD_FLAGS_CHECKSUM (1 << 1)
+/* Request launch time hardware offload. The device will schedule the packet for
+ * transmission at a pre-determined time called launch time. The value of
+ * launch time is communicated via launch_time field of struct xsk_tx_metadata.
+ */
+#define XDP_TXMD_FLAGS_LAUNCH_TIME (1 << 2)
+
/* AF_XDP offloads request. 'request' union member is consumed by the driver
* when the packet is being transmitted. 'completion' union member is
* filled by the driver when the transmit completion arrives.
@@ -142,6 +148,10 @@ struct xsk_tx_metadata {
__u16 csum_start;
/* Offset from csum_start where checksum should be stored. */
__u16 csum_offset;
+
+ /* XDP_TXMD_FLAGS_LAUNCH_TIME */
+ /* Launch time in nanosecond against the PTP HW Clock */
+ __u64 launch_time;
} request;
struct {
diff --git a/include/uapi/linux/netdev.h b/include/uapi/linux/netdev.h
index e4be227d3ad6..5ab85f4af009 100644
--- a/include/uapi/linux/netdev.h
+++ b/include/uapi/linux/netdev.h
@@ -59,10 +59,13 @@ enum netdev_xdp_rx_metadata {
* by the driver.
* @NETDEV_XSK_FLAGS_TX_CHECKSUM: L3 checksum HW offload is supported by the
* driver.
+ * @NETDEV_XSK_FLAGS_LAUNCH_TIME: Launch Time HW offload is supported by the
+ * driver.
*/
enum netdev_xsk_flags {
NETDEV_XSK_FLAGS_TX_TIMESTAMP = 1,
NETDEV_XSK_FLAGS_TX_CHECKSUM = 2,
+ NETDEV_XSK_FLAGS_LAUNCH_TIME = 4,
};
enum netdev_queue_type {
diff --git a/net/core/netdev-genl.c b/net/core/netdev-genl.c
index 9527dd46e4dc..e2515cf9190f 100644
--- a/net/core/netdev-genl.c
+++ b/net/core/netdev-genl.c
@@ -52,6 +52,8 @@ XDP_METADATA_KFUNC_xxx
xsk_features |= NETDEV_XSK_FLAGS_TX_TIMESTAMP;
if (netdev->xsk_tx_metadata_ops->tmo_request_checksum)
xsk_features |= NETDEV_XSK_FLAGS_TX_CHECKSUM;
+ if (netdev->xsk_tx_metadata_ops->tmo_request_launch_time)
+ xsk_features |= NETDEV_XSK_FLAGS_LAUNCH_TIME;
}
if (nla_put_u32(rsp, NETDEV_A_DEV_IFINDEX, netdev->ifindex) ||
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 3fa70286c846..8feaa0e86f07 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -743,6 +743,9 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
goto free_err;
}
}
+
+ if (meta->flags & XDP_TXMD_FLAGS_LAUNCH_TIME)
+ skb->skb_mstamp_ns = meta->request.launch_time;
}
}
diff --git a/tools/include/uapi/linux/if_xdp.h b/tools/include/uapi/linux/if_xdp.h
index 2f082b01ff22..67719f8966c2 100644
--- a/tools/include/uapi/linux/if_xdp.h
+++ b/tools/include/uapi/linux/if_xdp.h
@@ -127,6 +127,12 @@ struct xdp_options {
*/
#define XDP_TXMD_FLAGS_CHECKSUM (1 << 1)
+/* Request launch time hardware offload. The device will schedule the packet for
+ * transmission at a pre-determined time called launch time. The value of
+ * launch time is communicated via launch_time field of struct xsk_tx_metadata.
+ */
+#define XDP_TXMD_FLAGS_LAUNCH_TIME (1 << 2)
+
/* AF_XDP offloads request. 'request' union member is consumed by the driver
* when the packet is being transmitted. 'completion' union member is
* filled by the driver when the transmit completion arrives.
@@ -142,6 +148,10 @@ struct xsk_tx_metadata {
__u16 csum_start;
/* Offset from csum_start where checksum should be stored. */
__u16 csum_offset;
+
+ /* XDP_TXMD_FLAGS_LAUNCH_TIME */
+ /* Launch time in nanosecond against the PTP HW Clock */
+ __u64 launch_time;
} request;
struct {
diff --git a/tools/include/uapi/linux/netdev.h b/tools/include/uapi/linux/netdev.h
index e4be227d3ad6..5ab85f4af009 100644
--- a/tools/include/uapi/linux/netdev.h
+++ b/tools/include/uapi/linux/netdev.h
@@ -59,10 +59,13 @@ enum netdev_xdp_rx_metadata {
* by the driver.
* @NETDEV_XSK_FLAGS_TX_CHECKSUM: L3 checksum HW offload is supported by the
* driver.
+ * @NETDEV_XSK_FLAGS_LAUNCH_TIME: Launch Time HW offload is supported by the
+ * driver.
*/
enum netdev_xsk_flags {
NETDEV_XSK_FLAGS_TX_TIMESTAMP = 1,
NETDEV_XSK_FLAGS_TX_CHECKSUM = 2,
+ NETDEV_XSK_FLAGS_LAUNCH_TIME = 4,
};
enum netdev_queue_type {
--
2.34.1
The selftest started failing since commit e93d2521b27f
("x86/vdso: Split virtual clock pages into dedicated mapping")
was merged. While debugging I stumbled upon another bug and potential
cleanup.
Signed-off-by: Thomas Weißschuh <thomas.weissschuh(a)linutronix.de>
---
Thomas Weißschuh (3):
selftests/mm: virtual_address_range: Fix error when CommitLimit < 1GiB
selftests/mm: virtual_address_range: Avoid reading VVAR mappings
selftests/mm: virtual_address_range: Dump to /dev/null
tools/testing/selftests/mm/virtual_address_range.c | 21 +++++++++++++++------
1 file changed, 15 insertions(+), 6 deletions(-)
---
base-commit: fbfd64d25c7af3b8695201ebc85efe90be28c5a3
change-id: 20250107-virtual_address_range-tests-95843766fa97
Best regards,
--
Thomas Weißschuh <thomas.weissschuh(a)linutronix.de>
Notable changes since v15:
* added IPV6 hack in Kconfig
* switched doc '|' operator to '>-' in yaml netlink spec
* added ovpn-mode doc to rt_link.yaml
* implemented rtnl_link_ops.fill_info
* removed ovpn_socket_detach() function because UDP and TCP detachment
is now happening in different moments
* reworked ovpn_socket lifetime:
** introduced ovpn_socket_release() that depending on transport proto
will take the right step towards releasing the socket (check large
comment on top of function for greater details)
** extended comments on various ovpn_socket* functions to ensure socket
lifecycle is clear
** implemented kref_put_lock() to allow UDP sockets to be detached while
holding socket lock
** acquired socket lock in ovpn_socket_new() to avoid race with detach
(point above)
** socket is now released upon peer removal (not upon peer free!)
* added convenient define OVPN_AAD_SIZE
* renamed AUTH_TAG_SIZE to OVPN_AUTH_TAG_SIZE
* s/dev_core_stats_rx_dropped_inc/dev_core_stats_tx_dropped_inc where
needed
* fixed some typos
* moved tcp_close() call outside of rcu_read_lock area
* moved ovpn_socket creation from ovpn_nl_peer_modify() to
ovpn_nl_peer_new_doit() to make smatch happy (ovpn_socket_new() may
have been called under spinlock, but it may sleep)
* added support for MSG_NOSIGNAL flag in TCP calls (required extending
the skb API)
* improved TCP proto/ops customization (required exporting
inet6_stream_ops)
* changed kselftest tool (ovpn-cli.c) to pass MSG_NOSIGNAL to TCP
send/recv calls.
The ovpn_socket lifecycle changes above address the race conditions
previously reported by Sabrina.
Hopefully all though nuts have been cracked at this point.
Please note that some patches were already reviewed by Andre Lunn,
Donald Hunter and Shuah Khan. They have retained the Reviewed-by tag
since no major code modification has happened since the review.
The latest code can also be found at:
https://github.com/OpenVPN/linux-kernel-ovpn
Thanks a lot!
Best Regards,
Antonio Quartulli
OpenVPN Inc.
---
Antonio Quartulli (26):
net: introduce OpenVPN Data Channel Offload (ovpn)
ovpn: add basic netlink support
ovpn: add basic interface creation/destruction/management routines
ovpn: keep carrier always on for MP interfaces
ovpn: introduce the ovpn_peer object
kref/refcount: implement kref_put_sock()
ovpn: introduce the ovpn_socket object
ovpn: implement basic TX path (UDP)
ovpn: implement basic RX path (UDP)
ovpn: implement packet processing
ovpn: store tunnel and transport statistics
ipv6: export inet6_stream_ops via EXPORT_SYMBOL_GPL
ovpn: implement TCP transport
skb: implement skb_send_sock_locked_with_flags()
ovpn: add support for MSG_NOSIGNAL in tcp_sendmsg
ovpn: implement multi-peer support
ovpn: implement peer lookup logic
ovpn: implement keepalive mechanism
ovpn: add support for updating local UDP endpoint
ovpn: add support for peer floating
ovpn: implement peer add/get/dump/delete via netlink
ovpn: implement key add/get/del/swap via netlink
ovpn: kill key and notify userspace in case of IV exhaustion
ovpn: notify userspace when a peer is deleted
ovpn: add basic ethtool support
testing/selftests: add test tool and scripts for ovpn module
Documentation/netlink/specs/ovpn.yaml | 372 +++
Documentation/netlink/specs/rt_link.yaml | 16 +
MAINTAINERS | 11 +
drivers/net/Kconfig | 15 +
drivers/net/Makefile | 1 +
drivers/net/ovpn/Makefile | 22 +
drivers/net/ovpn/bind.c | 55 +
drivers/net/ovpn/bind.h | 101 +
drivers/net/ovpn/crypto.c | 211 ++
drivers/net/ovpn/crypto.h | 145 ++
drivers/net/ovpn/crypto_aead.c | 382 ++++
drivers/net/ovpn/crypto_aead.h | 33 +
drivers/net/ovpn/io.c | 446 ++++
drivers/net/ovpn/io.h | 34 +
drivers/net/ovpn/main.c | 350 +++
drivers/net/ovpn/main.h | 14 +
drivers/net/ovpn/netlink-gen.c | 213 ++
drivers/net/ovpn/netlink-gen.h | 41 +
drivers/net/ovpn/netlink.c | 1178 ++++++++++
drivers/net/ovpn/netlink.h | 18 +
drivers/net/ovpn/ovpnstruct.h | 57 +
drivers/net/ovpn/peer.c | 1256 +++++++++++
drivers/net/ovpn/peer.h | 159 ++
drivers/net/ovpn/pktid.c | 129 ++
drivers/net/ovpn/pktid.h | 87 +
drivers/net/ovpn/proto.h | 118 +
drivers/net/ovpn/skb.h | 60 +
drivers/net/ovpn/socket.c | 237 ++
drivers/net/ovpn/socket.h | 45 +
drivers/net/ovpn/stats.c | 21 +
drivers/net/ovpn/stats.h | 47 +
drivers/net/ovpn/tcp.c | 567 +++++
drivers/net/ovpn/tcp.h | 33 +
drivers/net/ovpn/udp.c | 392 ++++
drivers/net/ovpn/udp.h | 23 +
include/linux/kref.h | 11 +
include/linux/refcount.h | 3 +
include/linux/skbuff.h | 2 +
include/uapi/linux/if_link.h | 15 +
include/uapi/linux/ovpn.h | 111 +
include/uapi/linux/udp.h | 1 +
lib/refcount.c | 32 +
net/core/skbuff.c | 18 +-
net/ipv6/af_inet6.c | 1 +
tools/testing/selftests/Makefile | 1 +
tools/testing/selftests/net/ovpn/.gitignore | 2 +
tools/testing/selftests/net/ovpn/Makefile | 17 +
tools/testing/selftests/net/ovpn/config | 10 +
tools/testing/selftests/net/ovpn/data64.key | 5 +
tools/testing/selftests/net/ovpn/ovpn-cli.c | 2366 ++++++++++++++++++++
tools/testing/selftests/net/ovpn/tcp_peers.txt | 5 +
.../testing/selftests/net/ovpn/test-chachapoly.sh | 9 +
tools/testing/selftests/net/ovpn/test-float.sh | 9 +
tools/testing/selftests/net/ovpn/test-tcp.sh | 9 +
tools/testing/selftests/net/ovpn/test.sh | 182 ++
tools/testing/selftests/net/ovpn/udp_peers.txt | 5 +
56 files changed, 9698 insertions(+), 5 deletions(-)
---
base-commit: 4b252f2dab2ebb654eebbb2aee980ab8373b2295
change-id: 20241002-b4-ovpn-eeee35c694a2
Best regards,
--
Antonio Quartulli <antonio(a)openvpn.net>
On 08.01.25 07:09, Dev Jain wrote:
>
> On 07/01/25 8:44 pm, Thomas Weißschuh wrote:
>> During the execution of validate_complete_va_space() a lot of memory is
>> on the VM subsystem. When running on a low memory subsystem an OOM may
>> be triggered, when writing to the dump file as the filesystem may also
>> require memory.
>>
>> On my test system with 1100MiB physical memory:
>>
>> Tasks state (memory values in pages):
>> [ pid ] uid tgid total_vm rss rss_anon rss_file rss_shmem pgtables_bytes swapents oom_score_adj name
>> [ 57] 0 57 34359215953 695 256 0 439 1064390656 0 0 virtual_address
>>
>> Out of memory: Killed process 57 (virtual_address) total-vm:137436863812kB, anon-rss:1024kB, file-rss:0kB, shmem-rss:1756kB, UID:0 pgtables:1039444kB oom_score_adj:0
>> <snip>
>> fault_in_iov_iter_readable+0x4a/0xd0
>> generic_perform_write+0x9c/0x280
>> shmem_file_write_iter+0x86/0x90
>> vfs_write+0x29c/0x480
>> ksys_write+0x6c/0xe0
>> do_syscall_64+0x9e/0x1a0
>> entry_SYSCALL_64_after_hwframe+0x77/0x7f
>>
>> Write the dumped data into /dev/null instead which does not require
>> additional memory during write(), making the code simpler as a
>> side-effect.
>>
>> Signed-off-by: Thomas Weißschuh<thomas.weissschuh(a)linutronix.de>
>> ---
>> tools/testing/selftests/mm/virtual_address_range.c | 6 ++----
>> 1 file changed, 2 insertions(+), 4 deletions(-)
>>
>> diff --git a/tools/testing/selftests/mm/virtual_address_range.c b/tools/testing/selftests/mm/virtual_address_range.c
>> index 484f82c7b7c871f82a7d9ec6d6c649f2ab1eb0cd..4042fd878acd702d23da2c3293292de33bd48143 100644
>> --- a/tools/testing/selftests/mm/virtual_address_range.c
>> +++ b/tools/testing/selftests/mm/virtual_address_range.c
>> @@ -103,10 +103,9 @@ static int validate_complete_va_space(void)
>> FILE *file;
>> int fd;
>>
>> - fd = open("va_dump", O_CREAT | O_WRONLY, 0600);
>> - unlink("va_dump");
>> + fd = open("/dev/null", O_WRONLY);
>> if (fd < 0) {
>> - ksft_test_result_skip("cannot create or open dump file\n");
>> + ksft_test_result_skip("cannot create or open /dev/null\n");
>> ksft_finished();
>> }
>> >> @@ -152,7 +151,6 @@ static int validate_complete_va_space(void)
>> while (start_addr + hop < end_addr) {
>> if (write(fd, (void *)(start_addr + hop), 1) != 1)
>> return 1;
>> - lseek(fd, 0, SEEK_SET);
>>
>> hop += MAP_CHUNK_SIZE;
>> }
>>
>
> The reason I had not used /dev/null was that write() was succeeding to /dev/null
> even from an address not in my VA space. I was puzzled about this behaviour of
> /dev/null and I chose to ignore it and just use a real file.
>
> To test this behaviour, run the following program:
>
> #include <stdio.h>
> #include <stdlib.h>
> #include <unistd.h>
> #include <fcntl.h>
> #include <sys/mman.h>
> intmain()
> {
> intfd;
> fd = open("va_dump", O_CREAT| O_WRONLY, 0600);
> unlink("va_dump");
> // fd = open("/dev/null", O_WRONLY);
> intret = munmap((void*)(1UL<< 30), 100);
> if(!ret)
> printf("munmap succeeded\n");
> intres = write(fd, (void*)(1UL<< 30), 1);
> if(res == 1)
> printf("write succeeded\n");
> return0;
> }
> The write will fail as expected, but if you comment out the va_dump
> lines and use /dev/null, the write will succeed.
What exactly do we want to achieve with the write? Verify that the
output of /proc/self/map is reasonable and we can actually resolve a
fault / map a page?
Why not access the memory directly+signal handler or using
/proc/self/mem, so you can avoid the temp file completely?
--
Cheers,
David / dhildenb
* Resending because I accidentally forgot to include Lorenzo in the
"to" list.
Android uses the ashmem driver [1] for creating shared memory regions
between processes. The ashmem driver exposes an ioctl command for
processes to restrict the permissions an ashmem buffer can be mapped
with.
Buffers are created with the ability to be mapped as readable, writable,
and executable. Processes remove the ability to map some ashmem buffers
as executable to ensure that those buffers cannot be used to inject
malicious code for another process to run. Other buffers retain their
ability to be mapped as executable, as these buffers can be used for
just-in-time (JIT) compilation. So there is a need to be able to remove
the ability to map a buffer as executable on a per-buffer basis.
Android is currently trying to migrate towards replacing its ashmem
driver usage with memfd. Part of the transition involved introducing a
library that serves to abstract away how shared memory regions are
allocated (i.e. ashmem vs memfd). This allows clients to use a single
interface for restricting how a buffer can be mapped without having to
worry about how it is handled for ashmem (through the ioctl
command mentioned earlier) or memfd (through file seals).
While memfd has support for preventing buffers from being mapped as
writable beyond a certain point in time (thanks to
F_SEAL_FUTURE_WRITE), it does not have a similar interface to prevent
buffers from being mapped as executable beyond a certain point.
However, that could be implemented as a file seal (F_SEAL_FUTURE_EXEC)
which works similarly to F_SEAL_FUTURE_WRITE.
F_SEAL_FUTURE_WRITE was chosen as a template for how this new seal
should behave, instead of F_SEAL_WRITE, for the following reasons:
1. Having the new seal behave like F_SEAL_FUTURE_WRITE matches the
behavior that was present with ashmem. This aids in seamlessly
transitioning clients away from ashmem to memfd.
2. Making the new seal behave like F_SEAL_WRITE would mean that no
mappings that could become executable in the future (i.e. via
mprotect()) can exist when the seal is applied. However, there are
known cases (e.g. CursorWindow [2]) where restrictions are applied
on how a buffer can be mapped after a mapping has already been made.
That mapping may have VM_MAYEXEC set, which would not allow the seal
to be applied successfully.
Therefore, the F_SEAL_FUTURE_EXEC seal was designed to have the same
semantics as F_SEAL_FUTURE_WRITE.
Note: this series depends on Lorenzo's work [3], [4], [5] from Andrew
Morton's mm-unstable branch [6], which reworks memfd's file seal checks,
allowing for newer file seals to be implemented in a cleaner fashion.
Changes from v1 ==> v2:
- Changed the return code to be -EPERM instead of -EACCES when
attempting to map an exec sealed file with PROT_EXEC to align
to mmap()'s man page. Thank you Kalesh Singh for spotting this!
- Rebased on top of Lorenzo's work to cleanup memfd file seal checks in
mmap() ([3], [4], and [5]). Thank you for this Lorenzo!
- Changed to deny PROT_EXEC mappings only if the mapping is shared,
instead of for both shared and private mappings, after discussing
this with Lorenzo.
Opens:
- Lorenzo brought up that this patch may negatively impact the usage of
MFD_NOEXEC_SCOPE_NOEXEC_ENFORCED [7]. However, it is not clear to me
why that is the case. At the moment, my intent is for the executable
permissions of the file to be disjoint from the ability to create
executable mappings.
Links:
[1] https://cs.android.com/android/kernel/superproject/+/common-android-mainlin…
[2] https://developer.android.com/reference/android/database/CursorWindow
[3] https://lore.kernel.org/all/cover.1732804776.git.lorenzo.stoakes@oracle.com/
[4] https://lkml.kernel.org/r/20241206212846.210835-1-lorenzo.stoakes@oracle.com
[5] https://lkml.kernel.org/r/7dee6c5d-480b-4c24-b98e-6fa47dbd8a23@lucifer.local
[6] https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git/tree/?h=mm-unst…
[7] https://lore.kernel.org/all/3a53b154-1e46-45fb-a559-65afa7a8a788@lucifer.lo…
Links to previous versions:
v1: https://lore.kernel.org/all/20241206010930.3871336-1-isaacmanjarres@google.…
Isaac J. Manjarres (2):
mm/memfd: Add support for F_SEAL_FUTURE_EXEC to memfd
selftests/memfd: Add tests for F_SEAL_FUTURE_EXEC
include/uapi/linux/fcntl.h | 1 +
mm/memfd.c | 39 ++++++++++-
tools/testing/selftests/memfd/memfd_test.c | 79 ++++++++++++++++++++++
3 files changed, 118 insertions(+), 1 deletion(-)
--
2.47.1.613.gc27f4b7a9f-goog
When compiling the pointer masking tests with -Wall this warning
is present:
pointer_masking.c: In function ‘test_tagged_addr_abi_sysctl’:
pointer_masking.c:203:9: warning: ignoring return value of ‘pwrite’
declared with attribute ‘warn_unused_result’ [-Wunused-result]
203 | pwrite(fd, &value, 1, 0); |
^~~~~~~~~~~~~~~~~~~~~~~~ pointer_masking.c:208:9: warning:
ignoring return value of ‘pwrite’ declared with attribute
‘warn_unused_result’ [-Wunused-result]
208 | pwrite(fd, &value, 1, 0);
I came across this on riscv64-linux-gnu-gcc (Ubuntu
11.4.0-1ubuntu1~22.04).
Fix this by checking that the number of bytes written equal the expected
number of bytes written.
Fixes: 7470b5afd150 ("riscv: selftests: Add a pointer masking test")
Signed-off-by: Charlie Jenkins <charlie(a)rivosinc.com>
Reviewed-by: Andrew Jones <ajones(a)ventanamicro.com>
---
Changes in v6:
- Add back ksft_test_result() (Samuel)
- Link to v5: https://lore.kernel.org/r/20241206-fix_warnings_pointer_masking_tests-v5-1-…
Changes in v5:
- No longer skip second pwrite if first one fails
- Use wrapper function instead of goto (Drew)
- Link to v4: https://lore.kernel.org/r/20241205-fix_warnings_pointer_masking_tests-v4-1-…
Changes in v4:
- Skip sysctl_enabled test if first pwrite failed
- Link to v3: https://lore.kernel.org/r/20241205-fix_warnings_pointer_masking_tests-v3-1-…
Changes in v3:
- Fix sysctl enabled test case (Drew/Alex)
- Move pwrite err condition into goto (Drew)
- Link to v2: https://lore.kernel.org/r/20241204-fix_warnings_pointer_masking_tests-v2-1-…
Changes in v2:
- I had ret != 2 for testing, I changed it to be ret != 1.
- Link to v1: https://lore.kernel.org/r/20241204-fix_warnings_pointer_masking_tests-v1-1-…
---
.../testing/selftests/riscv/abi/pointer_masking.c | 28 +++++++++++++++++-----
1 file changed, 22 insertions(+), 6 deletions(-)
diff --git a/tools/testing/selftests/riscv/abi/pointer_masking.c b/tools/testing/selftests/riscv/abi/pointer_masking.c
index dee41b7ee3e323150d55523c8acbf3ec38857b87..059d2e87eb1f737caf44f692b239bf3e49c233b4 100644
--- a/tools/testing/selftests/riscv/abi/pointer_masking.c
+++ b/tools/testing/selftests/riscv/abi/pointer_masking.c
@@ -185,8 +185,20 @@ static void test_fork_exec(void)
}
}
+static bool pwrite_wrapper(int fd, void *buf, size_t count, const char *msg)
+{
+ int ret = pwrite(fd, buf, count, 0);
+
+ if (ret != count) {
+ ksft_perror(msg);
+ return false;
+ }
+ return true;
+}
+
static void test_tagged_addr_abi_sysctl(void)
{
+ char *err_pwrite_msg = "failed to write to /proc/sys/abi/tagged_addr_disabled\n";
char value;
int fd;
@@ -200,14 +212,18 @@ static void test_tagged_addr_abi_sysctl(void)
}
value = '1';
- pwrite(fd, &value, 1, 0);
- ksft_test_result(set_tagged_addr_ctrl(min_pmlen, true) == -EINVAL,
- "sysctl disabled\n");
+ if (!pwrite_wrapper(fd, &value, 1, "write '1'"))
+ ksft_test_result_fail(err_pwrite_msg);
+ else
+ ksft_test_result(set_tagged_addr_ctrl(min_pmlen, true) == -EINVAL,
+ "sysctl disabled\n");
value = '0';
- pwrite(fd, &value, 1, 0);
- ksft_test_result(set_tagged_addr_ctrl(min_pmlen, true) == 0,
- "sysctl enabled\n");
+ if (!pwrite_wrapper(fd, &value, 1, "write '0'"))
+ ksft_test_result_fail(err_pwrite_msg);
+ else
+ ksft_test_result(set_tagged_addr_ctrl(min_pmlen, true) == 0,
+ "sysctl enabled\n");
set_tagged_addr_ctrl(0, false);
---
base-commit: 40384c840ea1944d7c5a392e8975ed088ecf0b37
change-id: 20241204-fix_warnings_pointer_masking_tests-3860e4f35429
--
- Charlie
This patch series includes some netns-related improvements and fixes for
rtnetlink, to make link creation more intuitive:
1) Creating link in another net namespace doesn't conflict with link
names in current one.
2) Refector rtnetlink link creation. Create link in target namespace
directly.
So that
# ip link add netns ns1 link-netns ns2 tun0 type gre ...
will create tun0 in ns1, rather than create it in ns2 and move to ns1.
And don't conflict with another interface named "tun0" in current netns.
Patch 01 serves for 1) to avoids link name conflict in different netns.
To achieve 2), there're mainly 3 steps:
- Patch 02 packs newlink() parameters into a struct, including
the original "src_net" along with more netns context. No semantic
changes are introduced.
- Patch 03 ~ 07 converts device drivers to use the explicit netns
extracted from params.
- Patch 08 ~ 09 removes the old netns parameter, and converts
rtnetlink to create device in target netns directly.
Patch 10 ~ 11 adds some tests for link name and link netns.
BTW please note there're some issues found in current code:
- In amt_newlink() drivers/net/amt.c:
amt->net = net;
...
amt->stream_dev = dev_get_by_index(net, ...
Uses net, but amt_lookup_upper_dev() only searches in dev_net.
So the AMT device may not be properly deleted if it's in a different
netns from lower dev.
- In gtp_newlink() in drivers/net/gtp.c:
gtp->net = src_net;
...
gn = net_generic(dev_net(dev), gtp_net_id);
list_add_rcu(>p->list, &gn->gtp_dev_list);
Uses src_net, but priv is linked to list in dev_net. So it may not be
properly deleted on removal of link netns.
- In pfcp_newlink() in drivers/net/pfcp.c:
pfcp->net = net;
...
pn = net_generic(dev_net(dev), pfcp_net_id);
list_add_rcu(&pfcp->list, &pn->pfcp_dev_list);
Same as above.
- In lowpan_newlink() in net/ieee802154/6lowpan/core.c:
wdev = dev_get_by_index(dev_net(ldev), nla_get_u32(tb[IFLA_LINK]));
Looks for IFLA_LINK in dev_net, but in theory the ifindex is defined
in link netns.
---
v7:
- Add selftest kconfig.
- Remove a duplicated test of ip6gre.
v6:
link: https://lore.kernel.org/all/20241218130909.2173-1-shaw.leon@gmail.com/
- Split prototype, driver and rtnetlink changes.
- Add more tests for link netns.
- Fix IPv6 tunnel net overwriten in ndo_init().
- Reorder variable declarations.
- Exclude a ip_tunnel-specific patch.
v5:
link: https://lore.kernel.org/all/20241209140151.231257-1-shaw.leon@gmail.com/
- Fix function doc in batman-adv.
- Include peer_net in rtnl newlink parameters.
v4:
link: https://lore.kernel.org/all/20241118143244.1773-1-shaw.leon@gmail.com/
- Pack newlink() parameters to a single struct.
- Use ynl async_msg_queue.empty() in selftest.
v3:
link: https://lore.kernel.org/all/20241113125715.150201-1-shaw.leon@gmail.com/
- Drop "netns_atomic" flag and module parameter. Add netns parameter to
newlink() instead, and convert drivers accordingly.
- Move python NetNSEnter helper to net selftest lib.
v2:
link: https://lore.kernel.org/all/20241107133004.7469-1-shaw.leon@gmail.com/
- Check NLM_F_EXCL to ensure only link creation is affected.
- Add self tests for link name/ifindex conflict and notifications
in different netns.
- Changes in dummy driver and ynl in order to add the test case.
v1:
link: https://lore.kernel.org/all/20241023023146.372653-1-shaw.leon@gmail.com/
Xiao Liang (11):
rtnetlink: Lookup device in target netns when creating link
rtnetlink: Pack newlink() params into struct
net: Use link netns in newlink() of rtnl_link_ops
ieee802154: 6lowpan: Use link netns in newlink() of rtnl_link_ops
net: ip_tunnel: Use link netns in newlink() of rtnl_link_ops
net: ipv6: Use link netns in newlink() of rtnl_link_ops
net: xfrm: Use link netns in newlink() of rtnl_link_ops
rtnetlink: Remove "net" from newlink params
rtnetlink: Create link directly in target net namespace
selftests: net: Add python context manager for netns entering
selftests: net: Add test cases for link and peer netns
drivers/infiniband/ulp/ipoib/ipoib_netlink.c | 11 +-
drivers/net/amt.c | 16 +-
drivers/net/bareudp.c | 11 +-
drivers/net/bonding/bond_netlink.c | 8 +-
drivers/net/can/dev/netlink.c | 4 +-
drivers/net/can/vxcan.c | 9 +-
.../ethernet/qualcomm/rmnet/rmnet_config.c | 11 +-
drivers/net/geneve.c | 11 +-
drivers/net/gtp.c | 9 +-
drivers/net/ipvlan/ipvlan.h | 4 +-
drivers/net/ipvlan/ipvlan_main.c | 15 +-
drivers/net/ipvlan/ipvtap.c | 10 +-
drivers/net/macsec.c | 15 +-
drivers/net/macvlan.c | 8 +-
drivers/net/macvtap.c | 11 +-
drivers/net/netkit.c | 9 +-
drivers/net/pfcp.c | 11 +-
drivers/net/ppp/ppp_generic.c | 10 +-
drivers/net/team/team_core.c | 7 +-
drivers/net/veth.c | 9 +-
drivers/net/vrf.c | 11 +-
drivers/net/vxlan/vxlan_core.c | 11 +-
drivers/net/wireguard/device.c | 11 +-
drivers/net/wireless/virtual/virt_wifi.c | 14 +-
drivers/net/wwan/wwan_core.c | 25 +++-
include/net/ip_tunnels.h | 5 +-
include/net/rtnetlink.h | 44 +++++-
net/8021q/vlan_netlink.c | 15 +-
net/batman-adv/soft-interface.c | 16 +-
net/bridge/br_netlink.c | 12 +-
net/caif/chnl_net.c | 6 +-
net/core/rtnetlink.c | 35 +++--
net/hsr/hsr_netlink.c | 14 +-
net/ieee802154/6lowpan/core.c | 9 +-
net/ipv4/ip_gre.c | 27 ++--
net/ipv4/ip_tunnel.c | 10 +-
net/ipv4/ip_vti.c | 10 +-
net/ipv4/ipip.c | 14 +-
net/ipv6/ip6_gre.c | 42 ++++--
net/ipv6/ip6_tunnel.c | 20 ++-
net/ipv6/ip6_vti.c | 16 +-
net/ipv6/sit.c | 18 ++-
net/xfrm/xfrm_interface_core.c | 15 +-
tools/testing/selftests/net/Makefile | 1 +
tools/testing/selftests/net/config | 5 +
.../testing/selftests/net/lib/py/__init__.py | 2 +-
tools/testing/selftests/net/lib/py/netns.py | 18 +++
tools/testing/selftests/net/link_netns.py | 141 ++++++++++++++++++
tools/testing/selftests/net/netns-name.sh | 10 ++
49 files changed, 550 insertions(+), 226 deletions(-)
create mode 100755 tools/testing/selftests/net/link_netns.py
--
2.47.1
When I implemented virtio's hash-related features to tun/tap [1],
I found tun/tap does not fill the entire region reserved for the virtio
header, leaving some uninitialized hole in the middle of the buffer
after read()/recvmesg().
This series fills the uninitialized hole. More concretely, the
num_buffers field will be initialized with 1, and the other fields will
be inialized with 0. Setting the num_buffers field to 1 is mandated by
virtio 1.0 [2].
The change to virtio header is preceded by another change that refactors
tun and tap to unify their virtio-related code.
[1]: https://lore.kernel.org/r/20241008-rss-v5-0-f3cf68df005d@daynix.com
[2]: https://lore.kernel.org/r/20241227084256-mutt-send-email-mst@kernel.org/
Signed-off-by: Akihiko Odaki <akihiko.odaki(a)daynix.com>
---
Akihiko Odaki (3):
tun: Unify vnet implementation
tun: Pad virtio header with zero
tun: Set num_buffers for virtio 1.0
MAINTAINERS | 1 +
drivers/net/Kconfig | 5 ++
drivers/net/Makefile | 1 +
drivers/net/tap.c | 174 ++++++----------------------------------
drivers/net/tun.c | 212 ++++++++-----------------------------------------
drivers/net/tun_vnet.c | 191 ++++++++++++++++++++++++++++++++++++++++++++
drivers/net/tun_vnet.h | 24 ++++++
7 files changed, 281 insertions(+), 327 deletions(-)
---
base-commit: a32e14f8aef69b42826cf0998b068a43d486a9e9
change-id: 20241230-tun-66e10a49b0c7
Best regards,
--
Akihiko Odaki <akihiko.odaki(a)daynix.com>
Implement comprehensive testing for netconsole userdata entry handling,
demonstrating correct behavior when creating maximum entries and
preventing unauthorized overflow.
Refactor existing test infrastructure to support modular, reusable
helper functions that validate strict entry limit enforcement.
Also, add a warning if update_userdata() sees more than
MAX_USERDATA_ITEMS entries. This shouldn't happen and it is a bug that
shouldn't be silently ignored.
Signed-off-by: Breno Leitao <leitao(a)debian.org>
---
Changes in v2:
- Add the new script (netcons_overflow.sh) in
tools/testing/selftests/drivers/net/Makefile as suggested by Simon
Horman
- Link to v1: https://lore.kernel.org/r/20241204-netcons_overflow_test-v1-0-a85a8d0ace21@…
---
Breno Leitao (4):
netconsole: Warn if MAX_USERDATA_ITEMS limit is exceeded
netconsole: selftest: Split the helpers from the selftest
netconsole: selftest: Delete all userdata keys
netconsole: selftest: verify userdata entry limit
MAINTAINERS | 3 +-
drivers/net/netconsole.c | 2 +-
tools/testing/selftests/drivers/net/Makefile | 1 +
.../selftests/drivers/net/lib/sh/lib_netcons.sh | 225 +++++++++++++++++++++
.../testing/selftests/drivers/net/netcons_basic.sh | 218 +-------------------
.../selftests/drivers/net/netcons_overflow.sh | 67 ++++++
6 files changed, 297 insertions(+), 219 deletions(-)
---
base-commit: 94c16fd4df9089931f674fb9aaec41ea20b0fd7a
change-id: 20241204-netcons_overflow_test-eaf735d1f743
Best regards,
--
Breno Leitao <leitao(a)debian.org>
After commit b1f202060afe ("mm: remap unused subpages to shared zeropage
when splitting isolated thp"), cow test cases involving swapping out
THPs via madvise(MADV_PAGEOUT) started to be skipped due to the
subsequent check via pagemap determining that the memory was not
actually swapped out. Logs similar to this were emitted:
...
# [RUN] Basic COW after fork() ... with swapped-out, PTE-mapped THP (16 kB)
ok 2 # SKIP MADV_PAGEOUT did not work, is swap enabled?
# [RUN] Basic COW after fork() ... with single PTE of swapped-out THP (16 kB)
ok 3 # SKIP MADV_PAGEOUT did not work, is swap enabled?
# [RUN] Basic COW after fork() ... with swapped-out, PTE-mapped THP (32 kB)
ok 4 # SKIP MADV_PAGEOUT did not work, is swap enabled?
...
The commit in question introduces the behaviour of scanning THPs and if
their content is predominantly zero, it splits them and replaces the
pages which are wholly zero with the zero page. These cow test cases
were getting caught up in this.
So let's avoid that by filling the contents of all allocated memory with
a non-zero value. With this in place, the tests are passing again.
Signed-off-by: Ryan Roberts <ryan.roberts(a)arm.com>
---
Applies on top of mm-unstable (f349e79bfbf3)
Thanks,
Ryan
tools/testing/selftests/mm/cow.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/tools/testing/selftests/mm/cow.c b/tools/testing/selftests/mm/cow.c
index 32c6ccc2a6be..1238e1c5aae1 100644
--- a/tools/testing/selftests/mm/cow.c
+++ b/tools/testing/selftests/mm/cow.c
@@ -758,7 +758,7 @@ static void do_run_with_base_page(test_fn fn, bool swapout)
}
/* Populate a base page. */
- memset(mem, 0, pagesize);
+ memset(mem, 1, pagesize);
if (swapout) {
madvise(mem, pagesize, MADV_PAGEOUT);
@@ -824,12 +824,12 @@ static void do_run_with_thp(test_fn fn, enum thp_run thp_run, size_t thpsize)
* Try to populate a THP. Touch the first sub-page and test if
* we get the last sub-page populated automatically.
*/
- mem[0] = 0;
+ mem[0] = 1;
if (!pagemap_is_populated(pagemap_fd, mem + thpsize - pagesize)) {
ksft_test_result_skip("Did not get a THP populated\n");
goto munmap;
}
- memset(mem, 0, thpsize);
+ memset(mem, 1, thpsize);
size = thpsize;
switch (thp_run) {
@@ -1012,7 +1012,7 @@ static void run_with_hugetlb(test_fn fn, const char *desc, size_t hugetlbsize)
}
/* Populate an huge page. */
- memset(mem, 0, hugetlbsize);
+ memset(mem, 1, hugetlbsize);
/*
* We need a total of two hugetlb pages to handle COW/unsharing
--
2.43.0
On Wed, Jan 08, 2025 at 11:39:40AM +0530, Dev Jain wrote:
>
> On 07/01/25 8:44 pm, Thomas Weißschuh wrote:
> > During the execution of validate_complete_va_space() a lot of memory is
> > on the VM subsystem. When running on a low memory subsystem an OOM may
> > be triggered, when writing to the dump file as the filesystem may also
> > require memory.
> >
> > On my test system with 1100MiB physical memory:
> >
> > Tasks state (memory values in pages):
> > [ pid ] uid tgid total_vm rss rss_anon rss_file rss_shmem pgtables_bytes swapents oom_score_adj name
> > [ 57] 0 57 34359215953 695 256 0 439 1064390656 0 0 virtual_address
> >
> > Out of memory: Killed process 57 (virtual_address) total-vm:137436863812kB, anon-rss:1024kB, file-rss:0kB, shmem-rss:1756kB, UID:0 pgtables:1039444kB oom_score_adj:0
> > <snip>
> > fault_in_iov_iter_readable+0x4a/0xd0
> > generic_perform_write+0x9c/0x280
> > shmem_file_write_iter+0x86/0x90
> > vfs_write+0x29c/0x480
> > ksys_write+0x6c/0xe0
> > do_syscall_64+0x9e/0x1a0
> > entry_SYSCALL_64_after_hwframe+0x77/0x7f
> >
> > Write the dumped data into /dev/null instead which does not require
> > additional memory during write(), making the code simpler as a
> > side-effect.
> >
> > Signed-off-by: Thomas Weißschuh<thomas.weissschuh(a)linutronix.de>
> > ---
> > tools/testing/selftests/mm/virtual_address_range.c | 6 ++----
> > 1 file changed, 2 insertions(+), 4 deletions(-)
> >
> > diff --git a/tools/testing/selftests/mm/virtual_address_range.c b/tools/testing/selftests/mm/virtual_address_range.c
> > index 484f82c7b7c871f82a7d9ec6d6c649f2ab1eb0cd..4042fd878acd702d23da2c3293292de33bd48143 100644
> > --- a/tools/testing/selftests/mm/virtual_address_range.c
> > +++ b/tools/testing/selftests/mm/virtual_address_range.c
> > @@ -103,10 +103,9 @@ static int validate_complete_va_space(void)
> > FILE *file;
> > int fd;
> > - fd = open("va_dump", O_CREAT | O_WRONLY, 0600);
> > - unlink("va_dump");
> > + fd = open("/dev/null", O_WRONLY);
> > if (fd < 0) {
> > - ksft_test_result_skip("cannot create or open dump file\n");
> > + ksft_test_result_skip("cannot create or open /dev/null\n");
> > ksft_finished();
> > }
> > @@ -152,7 +151,6 @@ static int validate_complete_va_space(void)
> > while (start_addr + hop < end_addr) {
> > if (write(fd, (void *)(start_addr + hop), 1) != 1)
> > return 1;
> > - lseek(fd, 0, SEEK_SET);
> > hop += MAP_CHUNK_SIZE;
> > }
> >
>
> The reason I had not used /dev/null was that write() was succeeding to /dev/null
> even from an address not in my VA space. I was puzzled about this behaviour of
> /dev/null and I chose to ignore it and just use a real file.
That makes sense and I can reproduce your example.
Switching to another dummy file which reads the written data like
/dev/random also leads to OOM, so wouldn't help either.
Thanks for the explanation.
@Andrew, could you drop this patch?
> To test this behaviour, run the following program:
[..]
PS: Your mail contained HTML and did not make it to the list archives.
(And the text variant of the example program was corrupted)
This patch series implements a new char misc driver, /dev/ntsync, which is used
to implement Windows NT synchronization primitives.
NT synchronization primitives are unique in that the wait functions both are
vectored, operate on multiple types of object with different behaviour (mutex,
semaphore, event), and affect the state of the objects they wait on. This model
is not compatible with existing kernel synchronization objects or interfaces,
and therefore the ntsync driver implements its own wait queues and locking.
This patch series is rebased against the "char-misc-next" branch of
gregkh/char-misc.git.
== Background ==
The Wine project emulates the Windows API in user space. One particular part of
that API, namely the NT synchronization primitives, have historically been
implemented via RPC to a dedicated "kernel" process. However, more recent
applications use these APIs more strenuously, and the overhead of RPC has become
a bottleneck.
The NT synchronization APIs are too complex to implement on top of existing
primitives without sacrificing correctness. Certain operations, such as
NtPulseEvent() or the "wait-for-all" mode of NtWaitForMultipleObjects(), require
direct control over the underlying wait queue, and implementing a wait queue
sufficiently robust for Wine in user space is not possible. This proposed
driver, therefore, implements the problematic interfaces directly in the Linux
kernel.
This driver was presented at Linux Plumbers Conference 2023. For those further
interested in the history of synchronization in Wine and past attempts to solve
this problem in user space, a recording of the presentation can be viewed here:
https://www.youtube.com/watch?v=NjU4nyWyhU8
== Performance ==
The performance measurements described below are copied from earlier versions of
the patch set. While some of the code has changed, I do not currently anticipate
that it has changed drastically enough to affect those measurements.
The gain in performance varies wildly depending on the application in question
and the user's hardware. For some games NT synchronization is not a bottleneck
and no change can be observed, but for others frame rate improvements of 50 to
150 percent are not atypical. The following table lists frame rate measurements
from a variety of games on a variety of hardware, taken by users Dmitry
Skvortsov, FuzzyQuils, OnMars, and myself:
Game Upstream ntsync improvement
===========================================================================
Anger Foot 69 99 43%
Call of Juarez 99.8 224.1 125%
Dirt 3 110.6 860.7 678%
Forza Horizon 5 108 160 48%
Lara Croft: Temple of Osiris 141 326 131%
Metro 2033 164.4 199.2 21%
Resident Evil 2 26 77 196%
The Crew 26 51 96%
Tiny Tina's Wonderlands 130 360 177%
Total War Saga: Troy 109 146 34%
===========================================================================
== Patches ==
The intended semantics of the patches are broadly intended to match those of the
corresponding Windows functions. For those not already familiar with the Windows
functions (or their undocumented behaviour), patch 27/28 provides a detailed
specification, and individual patches also include a brief description of the
API they are implementing.
The patches making use of this driver in Wine can be retrieved or browsed here:
https://repo.or.cz/wine/zf.git/shortlog/refs/heads/ntsync7
== Previous versions ==
Changes from v6:
* rename NTSYNC_IOC_SEM_POST to NTSYNC_IOC_SEM_RELEASE (matching the NT
terminology instead of POSIX),
* change object creation ioctls to return the fds directly in the return value
instead of through the args struct, which simplifies the API a bit.
* Link to v6: https://lore.kernel.org/lkml/20241209185904.507350-1-zfigura@codeweavers.co…
* Link to v5: https://lore.kernel.org/lkml/20240519202454.1192826-1-zfigura@codeweavers.c…
* Link to v4: https://lore.kernel.org/lkml/20240416010837.333694-1-zfigura@codeweavers.co…
* Link to v3: https://lore.kernel.org/lkml/20240329000621.148791-1-zfigura@codeweavers.co…
* Link to v2: https://lore.kernel.org/lkml/20240219223833.95710-1-zfigura@codeweavers.com/
* Link to v1: https://lore.kernel.org/lkml/20240214233645.9273-1-zfigura@codeweavers.com/
* Link to RFC v2: https://lore.kernel.org/lkml/20240131021356.10322-1-zfigura@codeweavers.com/
* Link to RFC v1: https://lore.kernel.org/lkml/20240124004028.16826-1-zfigura@codeweavers.com/
Elizabeth Figura (30):
ntsync: Return the fd from NTSYNC_IOC_CREATE_SEM.
ntsync: Rename NTSYNC_IOC_SEM_POST to NTSYNC_IOC_SEM_RELEASE.
ntsync: Introduce NTSYNC_IOC_WAIT_ANY.
ntsync: Introduce NTSYNC_IOC_WAIT_ALL.
ntsync: Introduce NTSYNC_IOC_CREATE_MUTEX.
ntsync: Introduce NTSYNC_IOC_MUTEX_UNLOCK.
ntsync: Introduce NTSYNC_IOC_MUTEX_KILL.
ntsync: Introduce NTSYNC_IOC_CREATE_EVENT.
ntsync: Introduce NTSYNC_IOC_EVENT_SET.
ntsync: Introduce NTSYNC_IOC_EVENT_RESET.
ntsync: Introduce NTSYNC_IOC_EVENT_PULSE.
ntsync: Introduce NTSYNC_IOC_SEM_READ.
ntsync: Introduce NTSYNC_IOC_MUTEX_READ.
ntsync: Introduce NTSYNC_IOC_EVENT_READ.
ntsync: Introduce alertable waits.
selftests: ntsync: Add some tests for semaphore state.
selftests: ntsync: Add some tests for mutex state.
selftests: ntsync: Add some tests for NTSYNC_IOC_WAIT_ANY.
selftests: ntsync: Add some tests for NTSYNC_IOC_WAIT_ALL.
selftests: ntsync: Add some tests for wakeup signaling with
WINESYNC_IOC_WAIT_ANY.
selftests: ntsync: Add some tests for wakeup signaling with
WINESYNC_IOC_WAIT_ALL.
selftests: ntsync: Add some tests for manual-reset event state.
selftests: ntsync: Add some tests for auto-reset event state.
selftests: ntsync: Add some tests for wakeup signaling with events.
selftests: ntsync: Add tests for alertable waits.
selftests: ntsync: Add some tests for wakeup signaling via alerts.
selftests: ntsync: Add a stress test for contended waits.
maintainers: Add an entry for ntsync.
docs: ntsync: Add documentation for the ntsync uAPI.
ntsync: No longer depend on BROKEN.
Documentation/userspace-api/index.rst | 1 +
Documentation/userspace-api/ntsync.rst | 385 +++++
MAINTAINERS | 9 +
drivers/misc/Kconfig | 1 -
drivers/misc/ntsync.c | 992 +++++++++++-
include/uapi/linux/ntsync.h | 42 +-
tools/testing/selftests/Makefile | 1 +
.../selftests/drivers/ntsync/.gitignore | 1 +
.../testing/selftests/drivers/ntsync/Makefile | 7 +
tools/testing/selftests/drivers/ntsync/config | 1 +
.../testing/selftests/drivers/ntsync/ntsync.c | 1343 +++++++++++++++++
11 files changed, 2767 insertions(+), 16 deletions(-)
create mode 100644 Documentation/userspace-api/ntsync.rst
create mode 100644 tools/testing/selftests/drivers/ntsync/.gitignore
create mode 100644 tools/testing/selftests/drivers/ntsync/Makefile
create mode 100644 tools/testing/selftests/drivers/ntsync/config
create mode 100644 tools/testing/selftests/drivers/ntsync/ntsync.c
base-commit: cdd30ebb1b9f36159d66f088b61aee264e649d7a
--
2.45.2
Hi all,
This patch series continues the work to migrate the *.sh tests into
prog_tests.
test_xdp_redirect.sh tests the XDP redirections done through
bpf_redirect().
These XDP redirections are already tested by prog_tests/xdp_do_redirect.c
but IMO it doesn't cover the exact same code path because
xdp_do_redirect.c uses bpf_prog_test_run_opts() to trigger redirections
of 'fake packets' while test_xdp_redirect.sh redirects packets coming
from the network. Also, the test_xdp_redirect.sh script tests the
redirections with both SKB and DRV modes while xdp_do_redirect.c only
tests the DRV mode.
The patch series adds two new test cases in prog_tests/xdp_do_redirect.c
to replace the test_xdp_redirect.sh script.
Signed-off-by: Bastien Curutchet (eBPF Foundation) <bastien.curutchet(a)bootlin.com>
---
Bastien Curutchet (eBPF Foundation) (3):
selftests/bpf: test_xdp_redirect: Rename BPF sections
selftests/bpf: Migrate test_xdp_redirect.sh to xdp_do_redirect.c
selftests/bpf: Migrate test_xdp_redirect.c to test_xdp_do_redirect.c
tools/testing/selftests/bpf/Makefile | 1 -
.../selftests/bpf/prog_tests/xdp_do_redirect.c | 192 +++++++++++++++++++++
.../selftests/bpf/progs/test_xdp_do_redirect.c | 12 ++
.../selftests/bpf/progs/test_xdp_redirect.c | 26 ---
tools/testing/selftests/bpf/test_xdp_redirect.sh | 79 ---------
5 files changed, 204 insertions(+), 106 deletions(-)
---
base-commit: da86bde1e6d1b887efc46af5ee1f9bbccd27233e
change-id: 20241219-xdp_redirect-2b8ec79dc24e
Best regards,
--
Bastien Curutchet (eBPF Foundation) <bastien.curutchet(a)bootlin.com>
The 2024 architecture release includes a number of data processing
extensions, mostly SVE and SME additions with a few others. These are
all very straightforward extensions which add instructions but no
architectural state so only need hwcaps and exposing of the ID registers
to KVM guests and userspace.
Signed-off-by: Mark Brown <broonie(a)kernel.org>
---
Changes in v4:
- Fix encodings for ID_AA64ISAR3_EL1.
- Link to v3: https://lore.kernel.org/r/20241203-arm64-2024-dpisa-v3-0-a6c78b1aa297@kerne…
Changes in v3:
- Commit log update for the hwcap test.
- Link to v2: https://lore.kernel.org/r/20241030-arm64-2024-dpisa-v2-0-b6601a15d2a5@kerne…
Changes in v2:
- Filter KVM guest visible bitfields in ID_AA64ISAR3_EL1 to only those
we make writeable.
- Link to v1: https://lore.kernel.org/r/20241028-arm64-2024-dpisa-v1-0-a38d08b008a8@kerne…
---
Mark Brown (9):
arm64/sysreg: Update ID_AA64PFR2_EL1 to DDI0601 2024-09
arm64/sysreg: Update ID_AA64ISAR3_EL1 to DDI0601 2024-09
arm64/sysreg: Update ID_AA64FPFR0_EL1 to DDI0601 2024-09
arm64/sysreg: Update ID_AA64ZFR0_EL1 to DDI0601 2024-09
arm64/sysreg: Update ID_AA64SMFR0_EL1 to DDI0601 2024-09
arm64/sysreg: Update ID_AA64ISAR2_EL1 to DDI0601 2024-09
arm64/hwcap: Describe 2024 dpISA extensions to userspace
KVM: arm64: Allow control of dpISA extensions in ID_AA64ISAR3_EL1
kselftest/arm64: Add 2024 dpISA extensions to hwcap test
Documentation/arch/arm64/elf_hwcaps.rst | 51 ++++++
arch/arm64/include/asm/hwcap.h | 17 ++
arch/arm64/include/uapi/asm/hwcap.h | 17 ++
arch/arm64/kernel/cpufeature.c | 35 ++++
arch/arm64/kernel/cpuinfo.c | 17 ++
arch/arm64/kvm/sys_regs.c | 6 +-
arch/arm64/tools/sysreg | 87 +++++++++-
tools/testing/selftests/arm64/abi/hwcap.c | 273 +++++++++++++++++++++++++++++-
8 files changed, 493 insertions(+), 10 deletions(-)
---
base-commit: 40384c840ea1944d7c5a392e8975ed088ecf0b37
change-id: 20241008-arm64-2024-dpisa-8091074a7f48
Best regards,
--
Mark Brown <broonie(a)kernel.org>
This series brings various cleanups and fixes for the mm (mostly
pkeys) kselftests. The original goal was to make the pkeys tests work
out of the box and without build warning - it turned out to be more
involved than expected.
The most important change is enabling -O2 when building all mm
kselftests (patch 5). This is actually needed for the pkeys tests to run
successfully (see gcc command line at the top of protection_keys.c and
pkey_sighandler_tests.c), and seems to have no negative impact on the
other tests. It certainly can't hurt performance!
The following patches address a few obvious issues in the pkeys tests
(unused code, bad scope for functions/variables, etc.) and finally make
a couple of small improvements.
There is one ugliness that this series does not fix: some functions in
pkey-<arch>.h call functions that are actually defined in
protection_keys.c. For instance, expect_fault_on_read_execonly_key() in
pkey-x86.h calls expected_pkey_fault(). This means that other test
programs that use pkey-helpers.h (namely pkey_sighandler_tests) would
fail to link if they called such functions defined in pkey-<arch>.h.
Fixing this would require a more comprehensive reorganisation of the
pkey-* headers, which doesn't seem worth it (patch 9 adds a comment to
pkey-helpers.h to clarify the situation).
Some more details on the patches:
- Patch 1 is an unrelated fix that was revealed by inspecting a warning.
It seems fairly harmless though, so I thought I'd just post it as part
of this series.
- Patch 2-5 fix various warnings that come up by building the mm tests
at -O2 and finally enable -O2.
- Patch 6-12 are various cleanups for the pkeys tests. Patch 11 in
particular enables is_pkeys_supported() to be called from outside
protection_keys.c (patch 13 relies on this).
- Patch 13-14 are small improvements to pkey_sighandler_tests.c.
Many thanks to Ryan Roberts for checking that the mm tests still run
fine on arm64 with those patches applied. I've also checked that the
pkeys tests run fine on arm64 and x86.
- Kevin
---
Cc: akpm(a)linux-foundation.org
Cc: aruna.ramakrishna(a)oracle.com
Cc: catalin.marinas(a)arm.com
Cc: dave.hansen(a)linux.intel.com
Cc: joey.gouly(a)arm.com
Cc: keith.lucas(a)oracle.com
Cc: ryan.roberts(a)arm.com
Cc: shuah(a)kernel.org
Cc: linux-arm-kernel(a)lists.infradead.org
Cc: linux-kselftest(a)vger.kernel.org
Cc: x86(a)kernel.org
---
Kevin Brodsky (14):
selftests/mm: Fix condition in uffd_move_test_common()
selftests/mm: Fix -Wmaybe-uninitialized warnings
selftests/mm: Fix strncpy() length
selftests/mm: Fix -Warray-bounds warnings in pkey_sighandler_tests
selftests/mm: Build with -O2
selftests/mm: Remove unused pkey helpers
selftests/mm: Define types using typedef in pkey-helpers.h
selftests/mm: Ensure pkey-*.h define inline functions only
selftests/mm: Remove empty pkey helper definition
selftests/mm: Ensure non-global pkey symbols are marked static
selftests/mm: Use sys_pkey helpers consistently
selftests/mm: Rename pkey register macro
selftests/mm: Skip pkey_sighandler_tests if support is missing
selftests/mm: Remove X permission from sigaltstack mapping
tools/testing/selftests/mm/Makefile | 6 +-
tools/testing/selftests/mm/ksm_tests.c | 2 +-
tools/testing/selftests/mm/mremap_test.c | 2 +-
tools/testing/selftests/mm/pkey-arm64.h | 6 +-
tools/testing/selftests/mm/pkey-helpers.h | 61 ++---
tools/testing/selftests/mm/pkey-powerpc.h | 4 +-
tools/testing/selftests/mm/pkey-x86.h | 6 +-
.../selftests/mm/pkey_sighandler_tests.c | 32 +--
tools/testing/selftests/mm/pkey_util.c | 40 ++++
tools/testing/selftests/mm/protection_keys.c | 212 +++++++-----------
tools/testing/selftests/mm/soft-dirty.c | 2 +-
tools/testing/selftests/mm/uffd-unit-tests.c | 4 +-
.../testing/selftests/mm/write_to_hugetlbfs.c | 2 +-
13 files changed, 163 insertions(+), 216 deletions(-)
create mode 100644 tools/testing/selftests/mm/pkey_util.c
--
2.47.0
The recently introduced guard-pages mm selftest uses the
process_madvise() syscall, a wrapper for which was added to glibc v2.36.
For those of us stuck with older distributions this causes a compile
error when compiling the mm selftests. For example Ubuntu 22.04 uses
glibc 2.35, which does not have the wrapper.
To workaround the issue, let's introduce our own static
process_madvise() wrapper that uses glibc's syscall() helper.
While we are at it, add the guard-page test suite to run_vmtests.sh so
that it can be automatically run by CI systems.
Signed-off-by: Ryan Roberts <ryan.roberts(a)arm.com>
---
Applies on top of mm-unstable (f349e79bfbf3)
Thanks,
Ryan
tools/testing/selftests/mm/guard-pages.c | 10 ++++++++--
tools/testing/selftests/mm/run_vmtests.sh | 5 +++++
2 files changed, 13 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/mm/guard-pages.c b/tools/testing/selftests/mm/guard-pages.c
index d8f8dee9ebbd..ece37212a8a2 100644
--- a/tools/testing/selftests/mm/guard-pages.c
+++ b/tools/testing/selftests/mm/guard-pages.c
@@ -55,6 +55,12 @@ static int pidfd_open(pid_t pid, unsigned int flags)
return syscall(SYS_pidfd_open, pid, flags);
}
+static ssize_t sys_process_madvise(int pidfd, const struct iovec *iovec,
+ size_t n, int advice, unsigned int flags)
+{
+ return syscall(__NR_process_madvise, pidfd, iovec, n, advice, flags);
+}
+
/*
* Enable our signal catcher and try to read/write the specified buffer. The
* return value indicates whether the read/write succeeds without a fatal
@@ -419,7 +425,7 @@ TEST_F(guard_pages, process_madvise)
ASSERT_EQ(munmap(&ptr_region[99 * page_size], page_size), 0);
/* Now guard in one step. */
- count = process_madvise(pidfd, vec, 6, MADV_GUARD_INSTALL, 0);
+ count = sys_process_madvise(pidfd, vec, 6, MADV_GUARD_INSTALL, 0);
/* OK we don't have permission to do this, skip. */
if (count == -1 && errno == EPERM)
@@ -440,7 +446,7 @@ TEST_F(guard_pages, process_madvise)
ASSERT_FALSE(try_read_write_buf(&ptr3[19 * page_size]));
/* Now do the same with unguard... */
- count = process_madvise(pidfd, vec, 6, MADV_GUARD_REMOVE, 0);
+ count = sys_process_madvise(pidfd, vec, 6, MADV_GUARD_REMOVE, 0);
/* ...and everything should now succeed. */
diff --git a/tools/testing/selftests/mm/run_vmtests.sh b/tools/testing/selftests/mm/run_vmtests.sh
index 2fc290d9430c..00c3f07ea100 100755
--- a/tools/testing/selftests/mm/run_vmtests.sh
+++ b/tools/testing/selftests/mm/run_vmtests.sh
@@ -45,6 +45,8 @@ separated by spaces:
vmalloc smoke tests
- hmm
hmm smoke tests
+- madv_guard
+ test madvise(2) MADV_GUARD_INSTALL and MADV_GUARD_REMOVE options
- madv_populate
test memadvise(2) MADV_POPULATE_{READ,WRITE} options
- memfd_secret
@@ -375,6 +377,9 @@ CATEGORY="mremap" run_test ./mremap_dontunmap
CATEGORY="hmm" run_test bash ./test_hmm.sh smoke
+# MADV_GUARD_INSTALL and MADV_GUARD_REMOVE tests
+CATEGORY="madv_guard" run_test ./guard-pages
+
# MADV_POPULATE_READ and MADV_POPULATE_WRITE tests
CATEGORY="madv_populate" run_test ./madv_populate
--
2.43.0
Currently, kselftests does not have a generalised mechanism to skip compilation
and run tests when required kernel configuration options are disabled.
This patch series adresses this issue by checking whether all required configs
from selftest/<test>/config are enabled in the current kernel
Siddharth Menon (2):
selftests: Introduce script to validate required configs
selftests/lib.mk: Introduce check to validate required configs
tools/testing/selftests/lib.mk | 11 ++-
tools/testing/selftests/mktest.pl | 138 ++++++++++++++++++++++++++++++
2 files changed, 147 insertions(+), 2 deletions(-)
mode change 100644 => 100755 tools/testing/selftests/lib.mk
create mode 100755 tools/testing/selftests/mktest.pl
--
2.39.5