When running Kselftests with the current selftests/net/config
the following problem can be seen with the net:xfrm_policy.sh
selftest:
# selftests: net: xfrm_policy.sh
[ 41.076721] IPv6: ADDRCONF(NETDEV_CHANGE): veth0: link becomes ready
[ 41.094787] IPv6: ADDRCONF(NETDEV_CHANGE): veth0: link becomes ready
[ 41.107635] IPv6: ADDRCONF(NETDEV_CHANGE): veth0: link becomes ready
# modprobe: FATAL: Module ip_tables not found in directory /lib/modules/6.1.36
# iptables v1.8.7 (legacy): can't initialize iptables table `filter': Table does not exist (do you need to insmod?)
# Perhaps iptables or your kernel needs to be upgraded.
# modprobe: FATAL: Module ip_tables not found in directory /lib/modules/6.1.36
# iptables v1.8.7 (legacy): can't initialize iptables table `filter': Table does not exist (do you need to insmod?)
# Perhaps iptables or your kernel needs to be upgraded.
# SKIP: Could not insert iptables rule
ok 1 selftests: net: xfrm_policy.sh # SKIP
This is because IPsec "policy" match support is not available
to the kernel.
This patch adds CONFIG_NETFILTER_XT_MATCH_POLICY as a module
to the selftests/net/config file, so that `make
kselftest-merge` can take this into consideration.
Signed-off-by: Daniel Díaz <daniel.diaz(a)linaro.org>
---
tools/testing/selftests/net/config | 1 +
1 file changed, 1 insertion(+)
diff --git a/tools/testing/selftests/net/config b/tools/testing/selftests/net/config
index d1d421ec10a3..cd3cc52c59b4 100644
--- a/tools/testing/selftests/net/config
+++ b/tools/testing/selftests/net/config
@@ -50,3 +50,4 @@ CONFIG_CRYPTO_SM4_GENERIC=y
CONFIG_AMT=m
CONFIG_VXLAN=m
CONFIG_IP_SCTP=m
+CONFIG_NETFILTER_XT_MATCH_POLICY=m
--
2.34.1
From: Björn Töpel <bjorn(a)rivosinc.com>
When you're cross-building kselftest, in this case RISC-V:
| make ARCH=riscv CROSS_COMPILE=riscv64-linux-gnu- O=/tmp/kselftest \
| HOSTCC=gcc FORMAT= SKIP_TARGETS="arm64 ia64 powerpc sparc64 x86 \
| sgx" -C tools/testing/selftests gen_tar
the components (paths) that fail to build are skipped. In this case,
openat2 failed due to missing library support, and proc due to an
x86-64 only test.
This tiny series addresses the problems above.
Björn
Björn Töpel (2):
selftests/openat2: Run-time check for -fsanitize=undefined
selftests/proc: Do not build x86-64 tests on non-x86-64 builds
tools/testing/selftests/openat2/Makefile | 9 ++++++++-
tools/testing/selftests/proc/Makefile | 4 ++++
2 files changed, 12 insertions(+), 1 deletion(-)
base-commit: 3a8a670eeeaa40d87bd38a587438952741980c18
--
2.39.2
Hi, all
Thanks very much for your review suggestions of the v1 series [1], we
just sent out the generic part1 [2], and here is the part2 of the whole
v2 revision.
Changes from v1 -> v2:
* Don't emulate the return values in the new syscalls path, fix up or
support the new syscalls in the side of the related test cases (1-3)
selftests/nolibc: remove gettimeofday_bad1/2 completely
selftests/nolibc: support two errnos with EXPECT_SYSER2()
selftests/nolibc: waitpid_min: add waitid syscall support
(Review suggestions from Willy and Thomas)
* Fix up new failure of the state_timestamps test case (4, new)
tools/nolibc: add missing nanoseconds support for __NR_statx
(Fixes for the commit a89c937d781a ("tools/nolibc: support nanoseconds in stat()")
* Add new waitstatus macros as a standalone patch for the waitid support (5)
tools/nolibc: add more wait status related types
(Split and Cleanup for the waitid syscall based sys_wait4)
* Pure 64bit lseek and time64 select/poll/gettimeofday support (6-11)
tools/nolibc: add pure 64bit off_t, time_t and blkcnt_t
tools/nolibc: sys_lseek: add pure 64bit lseek
tools/nolibc: add pure 64bit time structs
tools/nolibc: sys_select: add pure 64bit select
tools/nolibc: sys_poll: add pure 64bit poll
tools/nolibc: sys_gettimeofday: add pure 64bit gettimeofday
(Review suggestions from Arnd, Thomas and Willy, time32 variants have
been removed completely and some fixups)
* waitid syscall support cleanup (12)
tools/nolibc: sys_wait4: add waitid syscall support
(Sync with the waitstatus macros update and Removal of emulated code)
* rv32 nolibc-test support, commit message update (13)
selftests/nolibc: riscv: customize makefile for rv32
(Review suggestions from Thomas, explain more about the change logic in commit message)
Best regards,
Zhangjin
---
[1]: https://lore.kernel.org/linux-riscv/20230529113143.GB2762@1wt.eu/T/#t
[2]: https://lore.kernel.org/linux-riscv/cover.1685362482.git.falcon@tinylab.org/
Zhangjin Wu (13):
selftests/nolibc: remove gettimeofday_bad1/2 completely
selftests/nolibc: support two errnos with EXPECT_SYSER2()
selftests/nolibc: waitpid_min: add waitid syscall support
tools/nolibc: add missing nanoseconds support for __NR_statx
tools/nolibc: add more wait status related types
tools/nolibc: add pure 64bit off_t, time_t and blkcnt_t
tools/nolibc: sys_lseek: add pure 64bit lseek
tools/nolibc: add pure 64bit time structs
tools/nolibc: sys_select: add pure 64bit select
tools/nolibc: sys_poll: add pure 64bit poll
tools/nolibc: sys_gettimeofday: add pure 64bit gettimeofday
tools/nolibc: sys_wait4: add waitid syscall support
selftests/nolibc: riscv: customize makefile for rv32
tools/include/nolibc/arch-aarch64.h | 3 -
tools/include/nolibc/arch-loongarch.h | 3 -
tools/include/nolibc/arch-riscv.h | 3 -
tools/include/nolibc/std.h | 28 ++--
tools/include/nolibc/sys.h | 134 +++++++++++++++----
tools/include/nolibc/types.h | 58 +++++++-
tools/testing/selftests/nolibc/Makefile | 11 +-
tools/testing/selftests/nolibc/nolibc-test.c | 20 +--
8 files changed, 202 insertions(+), 58 deletions(-)
--
2.25.1
This extension allows to use F_UNLCK on query, which currently returns
EINVAL. Instead it can be used to query the locks on a particular fd -
something that is not currently possible. The basic idea is that on
F_OFD_GETLK, F_UNLCK would "conflict" with (or query) any types of the
lock on the same fd, and ignore any locks on other fds.
Use-cases:
1. CRIU-alike scenario when you want to read the locking info from an
fd for the later reconstruction. This can now be done by setting
l_start and l_len to 0 to cover entire file range, and do F_OFD_GETLK.
In the loop you need to advance l_start past the returned lock ranges,
to eventually collect all locked ranges.
2. Implementing the lock checking/enforcing policy.
Say you want to implement an "auditor" module in your program,
that checks that the I/O is done only after the proper locking is
applied on a file region. In this case you need to know if the
particular region is locked on that fd, and if so - with what type
of the lock. If you would do that currently (without this extension)
then you can only check for the write locks, and for that you need to
probe the lock on your fd and then open the same file via another fd and
probe there. That way you can identify the write lock on a particular
fd, but such trick is non-atomic and complex. As for finding out the
read lock on a particular fd - impossible.
This extension allows to do such queries without any extra efforts.
3. Implementing the mandatory locking policy.
Suppose you want to make a policy where the write lock inhibits any
unlocked readers and writers. Currently you need to check if the
write lock is present on some other fd, and if it is not there - allow
the I/O operation. But because the write lock can appear at any moment,
you need to do that under some global lock, which can be released only
when the I/O operation is finished.
With the proposed extension you can instead just check the write lock
on your own fd first, and if it is there - allow the I/O operation on
that fd without using any global lock. Only if there is no write lock
on this fd, then you need to take global lock and check for a write
lock on other fds.
The second patch adds a test-case for OFD locks.
It tests both the generic things and the proposed extension.
The third patch is a proposed man page update for fcntl(2)
(not for the linux source tree)
Changes in v3:
- Move selftest to selftests/filelock
Changes in v2:
- Dropped the l_pid extension patch and updated test-case accordingly.
Stas Sergeev (2):
fs/locks: F_UNLCK extension for F_OFD_GETLK
selftests: add OFD lock tests
fs/locks.c | 23 +++-
tools/testing/selftests/filelock/Makefile | 5 +
tools/testing/selftests/filelock/ofdlocks.c | 132 ++++++++++++++++++++
3 files changed, 157 insertions(+), 3 deletions(-)
create mode 100644 tools/testing/selftests/filelock/Makefile
create mode 100644 tools/testing/selftests/filelock/ofdlocks.c
CC: Jeff Layton <jlayton(a)kernel.org>
CC: Chuck Lever <chuck.lever(a)oracle.com>
CC: Alexander Viro <viro(a)zeniv.linux.org.uk>
CC: Christian Brauner <brauner(a)kernel.org>
CC: linux-fsdevel(a)vger.kernel.org
CC: linux-kernel(a)vger.kernel.org
CC: Shuah Khan <shuah(a)kernel.org>
CC: linux-kselftest(a)vger.kernel.org
CC: linux-api(a)vger.kernel.org
--
2.39.2
Willy, Thomas
This is v3 to allow run with minimal kernel config, see v2 [1].
Applied further suggestions from Thomas, It is based on our previous v5
sysret helper series [2] and Thomas' chmod_net removal patchset [3].
Now, a test report on arm/vexpress-a9 without procfs, shmem, tmpfs, net
and memfd_create looks like:
LOG: testing report for arm/vexpress-a9:
14 chmod_self [SKIPPED]
16 chown_self [SKIPPED]
40 link_cross [SKIPPED]
0 -fstackprotector not supported [SKIPPED]
139 test(s) passed, 4 skipped, 0 failed.
See all results in /labs/linux-lab/logging/nolibc/arm-vexpress-a9-nolibc-test.log
LOG: testing summary:
arch/board | result
------------|------------
arm/vexpress-a9 | 139 test(s) passed, 4 skipped, 0 failed. See all results in /labs/linux-lab/logging/nolibc/arm-vexpress-a9-nolibc-test.log
Changes from v2 --> v3:
* Added Reviewed-by from Thomas for the whole series, Many Thanks
* selftests/nolibc: stat_fault: silence NULL argument warning with glibc
selftests/nolibc: gettid: restore for glibc and musl
selftests/nolibc: add _LARGEFILE64_SOURCE for musl
selftests/nolibc: fix up int_fast16/32_t test cases for musl
selftests/nolibc: fix up kernel parameters support
selftests/nolibc: stat_timestamps: remove procfs dependency
selftests/nolibc: link_cross: use /proc/self/cmdline
tools/nolibc: add rmdir() support
selftests/nolibc: add a new rmdir() test case
selftests/nolibc: fix up failures when CONFIG_PROC_FS=n
selftests/nolibc: vfprintf: remove MEMFD_CREATE dependency
No code changes except some commit message cleanups.
* selftests/nolibc: prepare /tmp for tmpfs or ramfs
As suggested by Thomas, simply calling mkdir() and mount() to
prepare /tmp can save a stat() call.
* selftests/nolibc: chroot_exe: remove procfs dependency
As suggested by Thomas, remove the 'weird' get_tmpfile() and use
the '/init' for !procfs as we did for stat_timestamps.
For the worst-case scene, when '/init' is not there, add ENOENT to
the error check list.
Now, it is a oneline code change.
* selftests/nolibc: add chmod_tmpdir test
Without get_tmpfile(), let's direct mkdir() a temp directory for
chmod_tmpdir test, it function as a substitute for the removed
chmod_net.
Now, it is a oneline code change.
Best regards,
Zhangjin
---
[1]: https://lore.kernel.org/lkml/cover.1688078604.git.falcon@tinylab.org/
Zhangjin Wu (14):
selftests/nolibc: stat_fault: silence NULL argument warning with glibc
selftests/nolibc: gettid: restore for glibc and musl
selftests/nolibc: add _LARGEFILE64_SOURCE for musl
selftests/nolibc: fix up int_fast16/32_t test cases for musl
selftests/nolibc: fix up kernel parameters support
selftests/nolibc: stat_timestamps: remove procfs dependency
selftests/nolibc: chroot_exe: remove procfs dependency
selftests/nolibc: link_cross: use /proc/self/cmdline
tools/nolibc: add rmdir() support
selftests/nolibc: add a new rmdir() test case
selftests/nolibc: fix up failures when CONFIG_PROC_FS=n
selftests/nolibc: prepare /tmp for tmpfs or ramfs
selftests/nolibc: add chmod_tmpdir test
selftests/nolibc: vfprintf: remove MEMFD_CREATE dependency
tools/include/nolibc/sys.h | 22 ++++++
tools/testing/selftests/nolibc/nolibc-test.c | 83 +++++++++++++++-----
2 files changed, 87 insertions(+), 18 deletions(-)
--
2.25.1
This is the initial KUnit integration for running Rust documentation
tests within the kernel.
Thank you to the KUnit team for all the input and feedback on this
over the months, as well as the Intel LKP 0-Day team!
This may be merged through either the KUnit or the Rust trees. If
the KUnit team wants to merge it, then that would be great.
Please see the message in the main commit for the details.
Miguel Ojeda (6):
rust: init: make doctests compilable/testable
rust: str: make doctests compilable/testable
rust: sync: make doctests compilable/testable
rust: types: make doctests compilable/testable
rust: support running Rust documentation tests as KUnit ones
MAINTAINERS: add Rust KUnit files to the KUnit entry
MAINTAINERS | 2 +
lib/Kconfig.debug | 13 +++
rust/.gitignore | 2 +
rust/Makefile | 29 ++++++
rust/bindings/bindings_helper.h | 1 +
rust/helpers.c | 7 ++
rust/kernel/init.rs | 25 +++--
rust/kernel/kunit.rs | 156 ++++++++++++++++++++++++++++
rust/kernel/lib.rs | 2 +
rust/kernel/str.rs | 4 +-
rust/kernel/sync/arc.rs | 9 +-
rust/kernel/sync/lock/mutex.rs | 1 +
rust/kernel/sync/lock/spinlock.rs | 1 +
rust/kernel/types.rs | 6 +-
scripts/.gitignore | 2 +
scripts/Makefile | 4 +
scripts/rustdoc_test_builder.rs | 73 ++++++++++++++
scripts/rustdoc_test_gen.rs | 162 ++++++++++++++++++++++++++++++
18 files changed, 484 insertions(+), 15 deletions(-)
create mode 100644 rust/kernel/kunit.rs
create mode 100644 scripts/rustdoc_test_builder.rs
create mode 100644 scripts/rustdoc_test_gen.rs
base-commit: d2e3115d717197cb2bc020dd1f06b06538474ac3
--
2.41.0
TCP SYN/ACK packets of connections from processes/sockets outside a
cgroup on the same host are not received by the cgroup's installed
cgroup_skb filters.
There were two BPF cgroup_skb programs attached to a cgroup named
"my_cgroup".
SEC("cgroup_skb/ingress")
int ingress(struct __sk_buff *skb)
{
/* .... process skb ... */
return 1;
}
SEC("cgroup_skb/egress")
int egress(struct __sk_buff *skb)
{
/* .... process skb ... */
return 1;
}
We discovered that when running the command "nc -6 -l 8000" in
"my_group" and connecting to it from outside of "my_cgroup" with the
command "nc -6 localhost 8000", the egress filter did not detect the
SYN/ACK packet. However, we did observe the SYN/ACK packet at the
ingress when connecting from a socket in "my_cgroup" to a socket
outside of it.
We came across BPF_CGROUP_RUN_PROG_INET_EGRESS(). This macro is
responsible for calling BPF programs that are attached to the egress
hook of a cgroup and it skips programs if the sending socket is not the
owner of the skb. Specifically, in our situation, the SYN/ACK
skb is owned by a struct request_sock instance, but the sending
socket is the listener socket we use to receive incoming
connections. The request_sock is created to manage an incoming
connection.
It has been determined that checking the owner of a skb against
the sending socket is not required. Removing this check will allow the
filters to receive SYN/ACK packets.
To ensure that cgroup_skb filters can receive all signaling packets,
including SYN, SYN/ACK, ACK, FIN, and FIN/ACK. A new self-test has
been added as well.
Changes from v3:
- Check SKB ownership against full socket instead of just remove the
check.
- Address the issue raised by Yonghong.
- Put more details down in the commit message.
Changes from v2:
- Remove redundant blank lines.
Changes from v1:
- Check the number of observed packets instead of just sleeping.
- Use ASSERT_XXX() instead of CHECK()/
[v1] https://lore.kernel.org/all/20230612191641.441774-1-kuifeng@meta.com/
[v2] https://lore.kernel.org/all/20230617052756.640916-2-kuifeng@meta.com/
[v3] https://lore.kernel.org/all/20230620171409.166001-1-kuifeng@meta.com/
Kui-Feng Lee (2):
net: bpf: Check SKB ownership against full socket.
selftests/bpf: Verify that the cgroup_skb filters receive expected
packets.
include/linux/bpf-cgroup.h | 4 +-
tools/testing/selftests/bpf/cgroup_helpers.c | 12 +
tools/testing/selftests/bpf/cgroup_helpers.h | 1 +
tools/testing/selftests/bpf/cgroup_tcp_skb.h | 35 ++
.../selftests/bpf/prog_tests/cgroup_tcp_skb.c | 402 ++++++++++++++++++
.../selftests/bpf/progs/cgroup_tcp_skb.c | 382 +++++++++++++++++
6 files changed, 834 insertions(+), 2 deletions(-)
create mode 100644 tools/testing/selftests/bpf/cgroup_tcp_skb.h
create mode 100644 tools/testing/selftests/bpf/prog_tests/cgroup_tcp_skb.c
create mode 100644 tools/testing/selftests/bpf/progs/cgroup_tcp_skb.c
--
2.34.1
Willy, Thomas
This is v2 to allow run with minimal kernel config, see v1 [1].
It mainly applied the suggestions from Thomas. It is based on our
previous v5 sysret helper series [2] and Thomas' chmod_net removal
patchset [3].
Now, a test report on arm/vexpress-a9 without procfs, shmem, tmpfs, net
and memfd_create looks like:
LOG: testing report for arm/vexpress-a9:
14 chmod_net [SKIPPED]
15 chmod_self [SKIPPED]
17 chown_self [SKIPPED]
41 link_cross [SKIPPED]
0 -fstackprotector not supported [SKIPPED]
139 test(s) passed, 5 skipped, 0 failed.
See all results in /labs/linux-lab/logging/nolibc/arm-vexpress-a9-nolibc-test.log
LOG: testing summary:
arch/board | result
------------|------------
arm/vexpress-a9 | 139 test(s) passed, 5 skipped, 0 failed. See all results in /labs/linux-lab/logging/nolibc/arm-vexpress-a9-nolibc-test.log
Changes from v1 --> v2:
* selftests/nolibc: stat_fault: silence NULL argument warning with glibc
selftests/nolibc: gettid: restore for glibc and musl
selftests/nolibc: add _LARGEFILE64_SOURCE for musl
selftests/nolibc: add a new rmdir() test case
selftests/nolibc: fix up failures when CONFIG_PROC_FS=n
The same as v1, only a few of commit message changes.
* selftests/nolibc: fix up int_fast16/32_t test cases for musl
Applied the method suggested by Thomas, two new macros are added to
get SINT_MAX_OF_TYPE(type) and SINT_MIN_OF_TYPE(type).
* selftests/nolibc: fix up kernel parameters support
After discuss with Thomas and with more tests, both of argv[1] and
NOLIBC_TEST environment variable should be verified to support
such kernel parameters:
NOLIBC_TEST=syscall
noapic NOLIBC_TEST=syscall
noapic
* selftests/nolibc: stat_timestamps: remove procfs dependency
Add '/init' and '/' for !procfs, don't skip it.
* selftests/nolibc: link_cross: use /proc/self/cmdline
Use /proc/self/cmdline instead of /proc/self/net, the ramfs based
/tmp/file doesn't work as expected (not really crossdev).
* tools/nolibc: add rmdir() support
Now, rebased on __sysret() from sysret helper patchset [2].
* selftests/nolibc: prepare /tmp for tmpfs or ramfs
Removed the hugetlbfs prepare part, not really required.
Don't remove /tmp and reserve it to use ramfs as tmpfs.
* selftests/nolibc: add common get_tmpfile()
selftests/nolibc: rename chroot_exe to chroot_tmpfile
Some cleanups.
* selftests/nolibc: add chmod_tmpfile test
To avoid conflict with Thomas' chmod_net removal patch [3], a new
chmod_tmpfile is added (in v1, there is a rename patch from
chmod_net to chmod_good)
Still to avoid conflict, these two are removed in this series:
- selftests/nolibc: rename proc variable to has_proc
- selftests/nolibc: rename euid0 variable to is_root
* selftests/nolibc: vfprintf: remove MEMFD_CREATE dependency
Many checks are removed, only reserve the direct tmpfs access
version.
Best regards,
Zhangjin
---
[1]: https://lore.kernel.org/lkml/cover.1687344643.git.falcon@tinylab.org/
[2]: https://lore.kernel.org/lkml/cover.1687976753.git.falcon@tinylab.org/
[3]: https://lore.kernel.org/lkml/20230624-proc-net-setattr-v1-0-73176812adee@we…
Zhangjin Wu (15):
selftests/nolibc: stat_fault: silence NULL argument warning with glibc
selftests/nolibc: gettid: restore for glibc and musl
selftests/nolibc: add _LARGEFILE64_SOURCE for musl
selftests/nolibc: fix up int_fast16/32_t test cases for musl
selftests/nolibc: fix up kernel parameters support
selftests/nolibc: stat_timestamps: remove procfs dependency
selftests/nolibc: link_cross: use /proc/self/cmdline
tools/nolibc: add rmdir() support
selftests/nolibc: add a new rmdir() test case
selftests/nolibc: fix up failures when CONFIG_PROC_FS=n
selftests/nolibc: prepare /tmp for tmpfs or ramfs
selftests/nolibc: add common get_tmpfile()
selftests/nolibc: rename chroot_exe to chroot_tmpfile
selftests/nolibc: add chmod_tmpfile test
selftests/nolibc: vfprintf: remove MEMFD_CREATE dependency
tools/include/nolibc/sys.h | 22 ++++
tools/testing/selftests/nolibc/nolibc-test.c | 102 +++++++++++++++----
2 files changed, 106 insertions(+), 18 deletions(-)
--
2.25.1
Hi,
This patch series introduces two tests to further enhance and
verify the functionality of the KVM subsystem. These tests focus
on MSR_IA32_DS_AREA and MSR_IA32_PERF_CAPABILITIES.
The first patch adds tests to verify the correct behavior when
trying to set MSR_IA32_DS_AREA with a non-classical address. It
checks that KVM is correctly faulting these non-classical addresses,
ensuring the accuracy and stability of the KVM subsystem.
The second patch includes a comprehensive PEBS test that checks all
possible combinations of PEBS-related bits in MSR_IA32_PERF_CAPABILITIES.
This helps to ensure the accuracy of PEBS functionality.
Feedback and suggestions are welcomed and appreciated.
Sincerely,
Jinrong Liang
Jinrong Liang (2):
KVM: selftests: Test consistency of setting MSR_IA32_DS_AREA
KVM: selftests: Add PEBS test for MSR_IA32_PERF_CAPABILITIES
.../selftests/kvm/x86_64/vmx_pmu_caps_test.c | 171 ++++++++++++++++++
1 file changed, 171 insertions(+)
base-commit: 31b4fc3bc64aadd660c5bfa5178c86a7ba61e0f7
--
2.31.1
From: Jeff Xu <jeffxu(a)google.com>
When sysctl vm.memfd_noexec is 2 (MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED),
memfd_create(.., MFD_EXEC) should fail.
This complies with how MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED is
defined - "memfd_create() without MFD_NOEXEC_SEAL will be rejected"
Thanks to Dominique Martinet <asmadeus(a)codewreck.org> who reported the bug.
see [1] for context.
[1] https://lore.kernel.org/linux-mm/CABi2SkXUX_QqTQ10Yx9bBUGpN1wByOi_=gZU6WEy5…
Jeff Xu (2):
mm/memfd: sysctl: fix MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED
selftests/memfd: sysctl: fix MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED
mm/memfd.c | 48 +++++++++++-----------
tools/testing/selftests/memfd/memfd_test.c | 5 +++
2 files changed, 30 insertions(+), 23 deletions(-)
--
2.41.0.255.g8b1d071c50-goog
From: Jeff Xu <jeffxu(a)google.com>
Add documentation for sysctl vm.memfd_noexec
Link:https://lore.kernel.org/linux-mm/CABi2SkXUX_QqTQ10Yx9bBUGpN1wByOi_=gZU…
Reported-by: Dominique Martinet <asmadeus(a)codewreck.org>
Signed-off-by: Jeff Xu <jeffxu(a)google.com>
---
Documentation/admin-guide/sysctl/vm.rst | 30 +++++++++++++++++++++++++
1 file changed, 30 insertions(+)
diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst
index 45ba1f4dc004..621588041a9e 100644
--- a/Documentation/admin-guide/sysctl/vm.rst
+++ b/Documentation/admin-guide/sysctl/vm.rst
@@ -424,6 +424,36 @@ e.g., up to one or two maps per allocation.
The default value is 65530.
+memfd_noexec:
+=============
+This pid namespaced sysctl controls memfd_create().
+
+The new MFD_NOEXEC_SEAL and MFD_EXEC flags of memfd_create() allows
+application to set executable bit at creation time.
+
+When MFD_NOEXEC_SEAL is set, memfd is created without executable bit
+(mode:0666), and sealed with F_SEAL_EXEC, so it can't be chmod to
+be executable (mode: 0777) after creation.
+
+when MFD_EXEC flag is set, memfd is created with executable bit
+(mode:0777), this is the same as the old behavior of memfd_create.
+
+The new pid namespaced sysctl vm.memfd_noexec has 3 values:
+0: memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL acts like
+ MFD_EXEC was set.
+1: memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL acts like
+ MFD_NOEXEC_SEAL was set.
+2: memfd_create() without MFD_NOEXEC_SEAL will be rejected.
+
+The default value is 0.
+
+Once set, it can't be downgraded at runtime, i.e. 2=>1, 1=>0
+are denied.
+
+This is pid namespaced sysctl, child processes inherit the parent
+process's memfd_noexec at the time of fork. Changes to the parent
+process after fork are not automatically propagated to the child
+process.
memory_failure_early_kill:
==========================
--
2.41.0.255.g8b1d071c50-goog
Hi,
This patch series aims to improve the PMU event filter settings with a cleaner
and more organized structure and adds several test cases related to PMU event
filters.
The first patch of this series introduces a custom "__kvm_pmu_event_filter"
structure that simplifies the event filter setup and improves overall code
readability and maintainability.
The second patch adds test cases to check that unsupported input values in the
PMU event filters are rejected, covering unsupported "action" values,
unsupported "flags" values, and unsupported "nevents" values, as well as the
setting of non-existent fixed counters in the fixed bitmap.
The third patch includes tests for the PMU event filter's behavior when applied
to fixed performance counters, ensuring the correct operation in cases where no
fixed counters exist (e.g., Intel guest PMU version=1 or AMD guest).
Finally, the fourth patch adds a test to verify that setting both generic and
fixed performance event filters does not impact the consistency of the fixed
performance filter behavior.
These changes help to ensure that KVM's PMU event filter functions as expected
in all supported use cases. These patches have been tested and verified to
function properly.
Any feedback or suggestions are greatly appreciated.
Please note that following patches should be applied before this patch series:
https://lore.kernel.org/kvm/20230530134248.23998-2-cloudliang@tencent.comhttps://lore.kernel.org/kvm/20230530134248.23998-3-cloudliang@tencent.com
This will ensure that macro definitions such as X86_INTEL_MAX_FIXED_CTR_NUM,
INTEL_PMC_IDX_FIXED, etc. can be used.
Sincerely,
Jinrong Liang
Changes log:
v3:
- Rebased to 31b4fc3bc64a(tag: kvm-x86-next-2023.06.02).
- Dropped the patch "KVM: selftests: Replace int with uint32_t for nevents". (Sean)
- Dropped the patch "KVM: selftests: Test pmu event filter with incompatible
kvm_pmu_event_filter". (Sean)
- Introduce __kvm_pmu_event_filter to replace the original method of creating
PMU event filters. (Sean)
- Use the macro definition of kvm_cpu_property to find the number of supported
fixed counters instead of calculating it via the vcpu's cpuid. (Sean)
- Remove the wrappers that are single line passthroughs. (Sean)
- Optimize function names and variable names. (Sean)
- Optimize comments to make them more rigorous. (Sean)
v2:
- Wrap the code from the documentation in a block of code. (Bagas Sanjaya)
v1:
https://lore.kernel.org/kvm/20230414110056.19665-1-cloudliang@tencent.com
Jinrong Liang (4):
KVM: selftests: Introduce __kvm_pmu_event_filter to improved event
filter settings
KVM: selftests: Test unavailable event filters are rejected
KVM: selftests: Check if event filter meets expectations on fixed
counters
KVM: selftests: Test gp event filters don't affect fixed event filters
.../kvm/x86_64/pmu_event_filter_test.c | 341 +++++++++++++-----
1 file changed, 246 insertions(+), 95 deletions(-)
base-commit: 31b4fc3bc64aadd660c5bfa5178c86a7ba61e0f7
prerequisite-patch-id: 909d42f185f596d6e5c5b48b33231c89fa5236e4
prerequisite-patch-id: ba0dd0f97d8db0fb6cdf2c7f1e3a60c206fc9784
--
2.31.1
Hi, Willy
This patchset mainly allows speed up the nolibc test with a minimal
kernel config.
As the nolibc supported architectures become more and more, the 'run'
test with DEFCONFIG may cost several hours, which is not friendly to
develop testing and even for release testing, so, smaller kernel configs
may be required, and firstly, we should let nolibc-test work with less
kernel config options, this patchset aims to this goal.
This patchset mainly remove the dependency from procfs, tmpfs, net and
memfd_create, many failures have been fixed up.
When CONFIG_TMPFS and CONFIG_SHMEM are disabled, kernel will provide a
ramfs based tmpfs (mm/shmem.c), it will be used as a choice to fix up
some failures and also allow skip less tests.
Besides, it also adds musl support, improves glibc support and fixes up
a kernel cmdline passing use case.
This is based on the dev.2023.06.14a branch of linux-rcu [1], all of the
supported architectures are tested (with local minimal configs, [5]
pasted the one for i386) without failures:
arch/board | result
------------|------------
arm/vexpress-a9 | 138 test(s) passed, 1 skipped, 0 failed. See all results in /labs/linux-lab/logging/nolibc/arm-vexpress-a9-nolibc-test.log
aarch64/virt | 138 test(s) passed, 1 skipped, 0 failed. See all results in /labs/linux-lab/logging/nolibc/aarch64-virt-nolibc-test.log
ppc/g3beige | not supported
i386/pc | 136 test(s) passed, 3 skipped, 0 failed. See all results in /labs/linux-lab/logging/nolibc/i386-pc-nolibc-test.log
x86_64/pc | 138 test(s) passed, 1 skipped, 0 failed. See all results in /labs/linux-lab/logging/nolibc/x86_64-pc-nolibc-test.log
mipsel/malta | 138 test(s) passed, 1 skipped, 0 failed. See all results in /labs/linux-lab/logging/nolibc/mipsel-malta-nolibc-test.log
loongarch64/virt | 138 test(s) passed, 1 skipped, 0 failed. See all results in /labs/linux-lab/logging/nolibc/loongarch64-virt-nolibc-test.log
riscv64/virt | 136 test(s) passed, 3 skipped, 0 failed. See all results in /labs/linux-lab/logging/nolibc/riscv64-virt-nolibc-test.log
riscv32/virt | no test log found
s390x/s390-ccw-virtio | 138 test(s) passed, 1 skipped, 0 failed. See all results in /labs/linux-lab/logging/nolibc/s390x-s390-ccw-virtio-nolibc-test.log
Notes:
* The skipped ones are -fstackprotector, chmod_self and chown_self
The -fstackprotector skip is due to gcc version.
chmod_self and chmod_self skips are due to procfs not enabled
* ppc/g3beige support is added locally, but not added in this patchset
will send ppc support as a new patchset, it depends on v2 test
report patchset [3] and the v5 rv32 support, require changes on
Makefile
* riscv32/virt support is still in review, see v5 rv32 support [4]
This patchset doesn't depends on any of my other nolibc patch series,
but the new rmdir() routine added in this patchset may be requird to
apply the __sysret() from our v4 syscall helper series [2] after that
series being merged, currently, we use the old method to let it compile
without any dependency.
Here explains all of the patches:
* selftests/nolibc: stat_fault: silence NULL argument warning with glibc
selftests/nolibc: gettid: restore for glibc and musl
selftests/nolibc: add _LARGEFILE64_SOURCE for musl
The above 3 patches adds musl compile support and improve glibc support.
It is able to build and run nolibc-test with musl libc now, but there
are some failures/skips due to the musl its own issues/requirements:
$ sudo ./nolibc-test | grep -E 'FAIL|SKIP'
8 sbrk = 1 ENOMEM [FAIL]
9 brk = -1 ENOMEM [FAIL]
46 limit_int_fast16_min = -2147483648 [FAIL]
47 limit_int_fast16_max = 2147483647 [FAIL]
49 limit_int_fast32_min = -2147483648 [FAIL]
50 limit_int_fast32_max = 2147483647 [FAIL]
0 -fstackprotector not supported [SKIPPED]
musl disabled sbrk and brk for some conflicts with its malloc and the
fast version of int types are defined in 32bit, which differs from nolibc
and glibc. musl reserved the sbrk(0) to allow get current brk, we
added a test for this in the v4 __sysret() helper series [2].
* selftests/nolibc: fix up kernel parameters support
kernel cmdline allows pass two types of parameters, one is without
'=', another is with '=', the first one is passed as init arguments,
the sencond one is passed as init environment variables.
Our nolibc-test prefer arguments to environment variables, this not
work when users add such parameters in the kernel cmdline:
noapic NOLIBC_TEST=syscall
So, this patch will verify the setting from arguments at first, if it
is no valid, will try the environment variables instead.
* selftests/nolibc: stat_timestamps: remove procfs dependency
Use '/' instead of /proc/self, or we can add a 'has_proc' condition
for this test case, but it is not that necessary to skip the whole
stat_timestamps tests for such a subtest binding to /proc/self.
Welcome suggestion from Thomas.
* tools/nolibc: add rmdir() support
selftests/nolibc: add a new rmdir() test case
rmdir() routine and test case are added for the coming requirement.
Note, if the __sysret() patchset [2] is applied before us, this patch
should be rebased on it and apply the __sysret() helper.
* selftests/nolibc: fix up failures when there is no procfs
call rmdir() to remove /proc completely to rework the checking of
/proc, before, the existing of /proc not means the procfs is really
mounted.
* selftests/nolibc: rename proc variable to has_proc
selftests/nolibc: rename euid0 variable to is_root
align with the has_gettid, has_xxx variables.
* selftests/nolibc: prepare tmpfs and hugetlbfs
selftests/nolibc: rename chmod_net to chmod_good
selftests/nolibc: link_cross: support tmpfs
selftests/nolibc: rename chroot_exe to chroot_file
use file from /tmp instead of file from /proc when there is no procfs
this avoid skipping the chmod_net, link_cross, chroot_exe tests
* selftests/nolibc: vfprintf: silence memfd_create() warning
selftests/nolibc: vfprintf: skip if neither tmpfs nor hugetlbfs
selftests/nolibc: vfprintf: support tmpfs and hugetlbfs
memfd_create from kernel >= v6.2 forcely warn on missing
MFD_NOEXEC_SEAL flag, the first one silence it with such flag, for
older kernels, use 0 flag as before.
since memfd_create() depends on TMPFS or HUGETLBFS, the second one
skip the whole vfprintf instead of simply fail if memfd_create() not
work.
the 3rd one futher try the ramfs based tmpfs even when memfd_create()
not work.
At last, let's simply discuss about the configs, I have prepared minimal
configs for all of the nolibc supported architectures but not sure where
should we put them, what about tools/testing/selftests/nolibc/configs ?
Thanks!
Best regards,
Zhangjin
---
[1]: https://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git/
[2]: https://lore.kernel.org/linux-riscv/cover.1687187451.git.falcon@tinylab.org/
[3]: https://lore.kernel.org/lkml/cover.1687156559.git.falcon@tinylab.org/
[4]: https://lore.kernel.org/linux-riscv/cover.1687176996.git.falcon@tinylab.org/
[5]: https://pastebin.com/5jq0Vxbz
Zhangjin Wu (17):
selftests/nolibc: stat_fault: silence NULL argument warning with glibc
selftests/nolibc: gettid: restore for glibc and musl
selftests/nolibc: add _LARGEFILE64_SOURCE for musl
selftests/nolibc: fix up kernel parameters support
selftests/nolibc: stat_timestamps: remove procfs dependency
tools/nolibc: add rmdir() support
selftests/nolibc: add a new rmdir() test case
selftests/nolibc: fix up failures when there is no procfs
selftests/nolibc: rename proc variable to has_proc
selftests/nolibc: rename euid0 variable to is_root
selftests/nolibc: prepare tmpfs and hugetlbfs
selftests/nolibc: rename chmod_net to chmod_good
selftests/nolibc: link_cross: support tmpfs
selftests/nolibc: rename chroot_exe to chroot_file
selftests/nolibc: vfprintf: silence memfd_create() warning
selftests/nolibc: vfprintf: skip if neither tmpfs nor hugetlbfs
selftests/nolibc: vfprintf: support tmpfs and hugetlbfs
tools/include/nolibc/sys.h | 28 ++++
tools/testing/selftests/nolibc/nolibc-test.c | 132 +++++++++++++++----
2 files changed, 138 insertions(+), 22 deletions(-)
--
2.25.1
From: Jeff Xu <jeffxu(a)google.com>
Since Linux introduced the memfd feature, memfd have always had their
execute bit set, and the memfd_create() syscall doesn't allow setting
it differently.
However, in a secure by default system, such as ChromeOS, (where all
executables should come from the rootfs, which is protected by Verified
boot), this executable nature of memfd opens a door for NoExec bypass
and enables “confused deputy attack”. E.g, in VRP bug [1]: cros_vm
process created a memfd to share the content with an external process,
however the memfd is overwritten and used for executing arbitrary code
and root escalation. [2] lists more VRP in this kind.
On the other hand, executable memfd has its legit use, runc uses memfd’s
seal and executable feature to copy the contents of the binary then
execute them, for such system, we need a solution to differentiate runc's
use of executable memfds and an attacker's [3].
To address those above, this set of patches add following:
1> Let memfd_create() set X bit at creation time.
2> Let memfd to be sealed for modifying X bit.
3> A new pid namespace sysctl: vm.memfd_noexec to control the behavior of
X bit.For example, if a container has vm.memfd_noexec=2, then
memfd_create() without MFD_NOEXEC_SEAL will be rejected.
4> A new security hook in memfd_create(). This make it possible to a new
LSM, which rejects or allows executable memfd based on its security policy.
Change history:
v8:
- Update ref bug in cover letter.
- Add Reviewed-by field.
- Remove security hook (security_memfd_create) patch, which will have
its own patch set in future.
v7:
- patch 2/6: remove #ifdef and MAX_PATH (memfd_test.c).
- patch 3/6: check capability (CAP_SYS_ADMIN) from userns instead of
global ns (pid_sysctl.h). Add a tab (pid_namespace.h).
- patch 5/6: remove #ifdef (memfd_test.c)
- patch 6/6: remove unneeded security_move_mount(security.c).
v6:https://lore.kernel.org/lkml/20221206150233.1963717-1-jeffxu@google.com/
- Address comment and move "#ifdef CONFIG_" from .c file to pid_sysctl.h
v5:https://lore.kernel.org/lkml/20221206152358.1966099-1-jeffxu@google.com/
- Pass vm.memfd_noexec from current ns to child ns.
- Fix build issue detected by kernel test robot.
- Add missing security.c
v3:https://lore.kernel.org/lkml/20221202013404.163143-1-jeffxu@google.com/
- Address API design comments in v2.
- Let memfd_create() to set X bit at creation time.
- A new pid namespace sysctl: vm.memfd_noexec to control behavior of X bit.
- A new security hook in memfd_create().
v2:https://lore.kernel.org/lkml/20220805222126.142525-1-jeffxu@google.com/
- address comments in V1.
- add sysctl (vm.mfd_noexec) to set the default file permissions of
memfd_create to be non-executable.
v1:https://lwn.net/Articles/890096/
[1] https://crbug.com/1305267
[2] https://bugs.chromium.org/p/chromium/issues/list?q=type%3Dbug-security%20me…
[3] https://lwn.net/Articles/781013/
Daniel Verkamp (2):
mm/memfd: add F_SEAL_EXEC
selftests/memfd: add tests for F_SEAL_EXEC
Jeff Xu (3):
mm/memfd: add MFD_NOEXEC_SEAL and MFD_EXEC
mm/memfd: Add write seals when apply SEAL_EXEC to executable memfd
selftests/memfd: add tests for MFD_NOEXEC_SEAL MFD_EXEC
include/linux/pid_namespace.h | 19 ++
include/uapi/linux/fcntl.h | 1 +
include/uapi/linux/memfd.h | 4 +
kernel/pid_namespace.c | 5 +
kernel/pid_sysctl.h | 59 ++++
mm/memfd.c | 56 +++-
mm/shmem.c | 6 +
tools/testing/selftests/memfd/fuse_test.c | 1 +
tools/testing/selftests/memfd/memfd_test.c | 341 ++++++++++++++++++++-
9 files changed, 489 insertions(+), 3 deletions(-)
create mode 100644 kernel/pid_sysctl.h
base-commit: eb7081409f94a9a8608593d0fb63a1aa3d6f95d8
--
2.39.0.rc1.256.g54fd8350bd-goog
From: sunliming <sunliming(a)kylinos.cn>
[ Upstream commit ba470eebc2f6c2f704872955a715b9555328e7d0 ]
User processes register name_args for events. If the same name but different
args event are registered. The trace outputs of second event are printed
as the first event. This is incorrect.
Return EADDRINUSE back to the user process if the same name but different args
event has being registered.
Link: https://lore.kernel.org/linux-trace-kernel/20230529032100.286534-1-sunlimin…
Signed-off-by: sunliming <sunliming(a)kylinos.cn>
Reviewed-by: Masami Hiramatsu (Google) <mhiramat(a)kernel.org>
Acked-by: Beau Belgrave <beaub(a)linux.microsoft.com>
Signed-off-by: Steven Rostedt (Google) <rostedt(a)goodmis.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
kernel/trace/trace_events_user.c | 36 +++++++++++++++----
.../selftests/user_events/ftrace_test.c | 6 ++++
2 files changed, 36 insertions(+), 6 deletions(-)
diff --git a/kernel/trace/trace_events_user.c b/kernel/trace/trace_events_user.c
index 625cab4b9d945..774d146c2c2ca 100644
--- a/kernel/trace/trace_events_user.c
+++ b/kernel/trace/trace_events_user.c
@@ -1274,6 +1274,8 @@ static int user_event_parse(struct user_event_group *group, char *name,
int index;
u32 key;
struct user_event *user;
+ int argc = 0;
+ char **argv;
/* Prevent dyn_event from racing */
mutex_lock(&event_mutex);
@@ -1281,13 +1283,35 @@ static int user_event_parse(struct user_event_group *group, char *name,
mutex_unlock(&event_mutex);
if (user) {
- *newuser = user;
- /*
- * Name is allocated by caller, free it since it already exists.
- * Caller only worries about failure cases for freeing.
- */
- kfree(name);
+ if (args) {
+ argv = argv_split(GFP_KERNEL, args, &argc);
+ if (!argv) {
+ ret = -ENOMEM;
+ goto error;
+ }
+
+ ret = user_fields_match(user, argc, (const char **)argv);
+ argv_free(argv);
+
+ } else
+ ret = list_empty(&user->fields);
+
+ if (ret) {
+ *newuser = user;
+ /*
+ * Name is allocated by caller, free it since it already exists.
+ * Caller only worries about failure cases for freeing.
+ */
+ kfree(name);
+ } else {
+ ret = -EADDRINUSE;
+ goto error;
+ }
+
return 0;
+error:
+ refcount_dec(&user->refcnt);
+ return ret;
}
index = find_first_zero_bit(group->page_bitmap, MAX_EVENTS);
diff --git a/tools/testing/selftests/user_events/ftrace_test.c b/tools/testing/selftests/user_events/ftrace_test.c
index 1bc26e6476fc3..df0e776c2cc1b 100644
--- a/tools/testing/selftests/user_events/ftrace_test.c
+++ b/tools/testing/selftests/user_events/ftrace_test.c
@@ -209,6 +209,12 @@ TEST_F(user, register_events) {
ASSERT_EQ(0, reg.write_index);
ASSERT_NE(0, reg.status_bit);
+ /* Multiple registers to same name but different args should fail */
+ reg.enable_bit = 29;
+ reg.name_args = (__u64)"__test_event u32 field1;";
+ ASSERT_EQ(-1, ioctl(self->data_fd, DIAG_IOCSREG, ®));
+ ASSERT_EQ(EADDRINUSE, errno);
+
/* Ensure disabled */
self->enable_fd = open(enable_file, O_RDWR);
ASSERT_NE(-1, self->enable_fd);
--
2.39.2
From: sunliming <sunliming(a)kylinos.cn>
[ Upstream commit ba470eebc2f6c2f704872955a715b9555328e7d0 ]
User processes register name_args for events. If the same name but different
args event are registered. The trace outputs of second event are printed
as the first event. This is incorrect.
Return EADDRINUSE back to the user process if the same name but different args
event has being registered.
Link: https://lore.kernel.org/linux-trace-kernel/20230529032100.286534-1-sunlimin…
Signed-off-by: sunliming <sunliming(a)kylinos.cn>
Reviewed-by: Masami Hiramatsu (Google) <mhiramat(a)kernel.org>
Acked-by: Beau Belgrave <beaub(a)linux.microsoft.com>
Signed-off-by: Steven Rostedt (Google) <rostedt(a)goodmis.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
kernel/trace/trace_events_user.c | 36 +++++++++++++++----
.../selftests/user_events/ftrace_test.c | 6 ++++
2 files changed, 36 insertions(+), 6 deletions(-)
diff --git a/kernel/trace/trace_events_user.c b/kernel/trace/trace_events_user.c
index 625cab4b9d945..774d146c2c2ca 100644
--- a/kernel/trace/trace_events_user.c
+++ b/kernel/trace/trace_events_user.c
@@ -1274,6 +1274,8 @@ static int user_event_parse(struct user_event_group *group, char *name,
int index;
u32 key;
struct user_event *user;
+ int argc = 0;
+ char **argv;
/* Prevent dyn_event from racing */
mutex_lock(&event_mutex);
@@ -1281,13 +1283,35 @@ static int user_event_parse(struct user_event_group *group, char *name,
mutex_unlock(&event_mutex);
if (user) {
- *newuser = user;
- /*
- * Name is allocated by caller, free it since it already exists.
- * Caller only worries about failure cases for freeing.
- */
- kfree(name);
+ if (args) {
+ argv = argv_split(GFP_KERNEL, args, &argc);
+ if (!argv) {
+ ret = -ENOMEM;
+ goto error;
+ }
+
+ ret = user_fields_match(user, argc, (const char **)argv);
+ argv_free(argv);
+
+ } else
+ ret = list_empty(&user->fields);
+
+ if (ret) {
+ *newuser = user;
+ /*
+ * Name is allocated by caller, free it since it already exists.
+ * Caller only worries about failure cases for freeing.
+ */
+ kfree(name);
+ } else {
+ ret = -EADDRINUSE;
+ goto error;
+ }
+
return 0;
+error:
+ refcount_dec(&user->refcnt);
+ return ret;
}
index = find_first_zero_bit(group->page_bitmap, MAX_EVENTS);
diff --git a/tools/testing/selftests/user_events/ftrace_test.c b/tools/testing/selftests/user_events/ftrace_test.c
index 1bc26e6476fc3..df0e776c2cc1b 100644
--- a/tools/testing/selftests/user_events/ftrace_test.c
+++ b/tools/testing/selftests/user_events/ftrace_test.c
@@ -209,6 +209,12 @@ TEST_F(user, register_events) {
ASSERT_EQ(0, reg.write_index);
ASSERT_NE(0, reg.status_bit);
+ /* Multiple registers to same name but different args should fail */
+ reg.enable_bit = 29;
+ reg.name_args = (__u64)"__test_event u32 field1;";
+ ASSERT_EQ(-1, ioctl(self->data_fd, DIAG_IOCSREG, ®));
+ ASSERT_EQ(EADDRINUSE, errno);
+
/* Ensure disabled */
self->enable_fd = open(enable_file, O_RDWR);
ASSERT_NE(-1, self->enable_fd);
--
2.39.2
=== Context ===
In the context of a middlebox, fragmented packets are tricky to handle.
The full 5-tuple of a packet is often only available in the first
fragment which makes enforcing consistent policy difficult. There are
really only two stateless options, neither of which are very nice:
1. Enforce policy on first fragment and accept all subsequent fragments.
This works but may let in certain attacks or allow data exfiltration.
2. Enforce policy on first fragment and drop all subsequent fragments.
This does not really work b/c some protocols may rely on
fragmentation. For example, DNS may rely on oversized UDP packets for
large responses.
So stateful tracking is the only sane option. RFC 8900 [0] calls this
out as well in section 6.3:
Middleboxes [...] should process IP fragments in a manner that is
consistent with [RFC0791] and [RFC8200]. In many cases, middleboxes
must maintain state in order to achieve this goal.
=== BPF related bits ===
Policy has traditionally been enforced from XDP/TC hooks. Both hooks
run before kernel reassembly facilities. However, with the new
BPF_PROG_TYPE_NETFILTER, we can rather easily hook into existing
netfilter reassembly infra.
The basic idea is we bump a refcnt on the netfilter defrag module and
then run the bpf prog after the defrag module runs. This allows bpf
progs to transparently see full, reassembled packets. The nice thing
about this is that progs don't have to carry around logic to detect
fragments.
=== Patchset details ===
There was an earlier attempt at providing defrag via kfuncs [1]. The
feedback was that we could end up doing too much stuff in prog execution
context (like sending ICMP error replies). However, I think there are
still some outstanding discussion w.r.t. performance when it comes to
netfilter vs the previous approach. I'll schedule some time during
office hours for this.
Patches 1 & 2 are stolenfrom Florian. Hopefully he doesn't mind. There
were some outstanding comments on the v2 [2] but it doesn't look like a
v3 was ever submitted. I've addressed the comments and put them in this
patchset cuz I needed them.
Finally, the new selftest seems to be a little flaky. I'm not quite
sure why the server will fail to `recvfrom()` occassionaly. I'm fairly
sure it's a timing related issue with creating veths. I'll keep
debugging but I didn't want that to hold up discussion on this patchset.
[0]: https://datatracker.ietf.org/doc/html/rfc8900
[1]: https://lore.kernel.org/bpf/cover.1677526810.git.dxu@dxuuu.xyz/
[2]: https://lore.kernel.org/bpf/20230525110100.8212-1-fw@strlen.de/
Daniel Xu (7):
tools: libbpf: add netfilter link attach helper
selftests/bpf: Add bpf_program__attach_netfilter helper test
netfilter: defrag: Add glue hooks for enabling/disabling defrag
netfilter: bpf: Support BPF_F_NETFILTER_IP_DEFRAG in netfilter link
bpf: selftests: Support not connecting client socket
bpf: selftests: Support custom type and proto for client sockets
bpf: selftests: Add defrag selftests
include/linux/netfilter.h | 12 +
include/uapi/linux/bpf.h | 5 +
net/ipv4/netfilter/nf_defrag_ipv4.c | 8 +
net/ipv6/netfilter/nf_defrag_ipv6_hooks.c | 10 +
net/netfilter/core.c | 6 +
net/netfilter/nf_bpf_link.c | 108 ++++++-
tools/include/uapi/linux/bpf.h | 5 +
tools/lib/bpf/bpf.c | 8 +
tools/lib/bpf/bpf.h | 6 +
tools/lib/bpf/libbpf.c | 47 +++
tools/lib/bpf/libbpf.h | 15 +
tools/lib/bpf/libbpf.map | 1 +
tools/testing/selftests/bpf/Makefile | 4 +-
.../selftests/bpf/generate_udp_fragments.py | 90 ++++++
.../selftests/bpf/ip_check_defrag_frags.h | 57 ++++
tools/testing/selftests/bpf/network_helpers.c | 26 +-
tools/testing/selftests/bpf/network_helpers.h | 3 +
.../bpf/prog_tests/ip_check_defrag.c | 282 ++++++++++++++++++
.../bpf/prog_tests/netfilter_basic.c | 78 +++++
.../selftests/bpf/progs/ip_check_defrag.c | 104 +++++++
.../bpf/progs/test_netfilter_link_attach.c | 14 +
21 files changed, 868 insertions(+), 21 deletions(-)
create mode 100755 tools/testing/selftests/bpf/generate_udp_fragments.py
create mode 100644 tools/testing/selftests/bpf/ip_check_defrag_frags.h
create mode 100644 tools/testing/selftests/bpf/prog_tests/ip_check_defrag.c
create mode 100644 tools/testing/selftests/bpf/prog_tests/netfilter_basic.c
create mode 100644 tools/testing/selftests/bpf/progs/ip_check_defrag.c
create mode 100644 tools/testing/selftests/bpf/progs/test_netfilter_link_attach.c
--
2.40.1
Dzień dobry,
zapoznałem się z Państwa ofertą i z przyjemnością przyznaję, że przyciąga uwagę i zachęca do dalszych rozmów.
Pomyślałem, że może mógłbym mieć swój wkład w Państwa rozwój i pomóc dotrzeć z tą ofertą do większego grona odbiorców. Pozycjonuję strony www, dzięki czemu generują świetny ruch w sieci.
Możemy porozmawiać w najbliższym czasie?
Pozdrawiam
Adam Charachuta
Nested translation is a hardware feature that is supported by many modern
IOMMU hardwares. It has two stages (stage-1, stage-2) address translation
to get access to the physical address. stage-1 translation table is owned
by userspace (e.g. by a guest OS), while stage-2 is owned by kernel. Changes
to stage-1 translation table should be followed by an IOTLB invalidation.
Take Intel VT-d as an example, the stage-1 translation table is I/O page
table. As the below diagram shows, guest I/O page table pointer in GPA
(guest physical address) is passed to host and be used to perform the stage-1
address translation. Along with it, modifications to present mappings in the
guest I/O page table should be followed with an IOTLB invalidation.
.-------------. .---------------------------.
| vIOMMU | | Guest I/O page table |
| | '---------------------------'
.----------------/
| PASID Entry |--- PASID cache flush --+
'-------------' |
| | V
| | I/O page table pointer in GPA
'-------------'
Guest
------| Shadow |--------------------------|--------
v v v
Host
.-------------. .------------------------.
| pIOMMU | | FS for GIOVA->GPA |
| | '------------------------'
.----------------/ |
| PASID Entry | V (Nested xlate)
'----------------\.----------------------------------.
| | | SS for GPA->HPA, unmanaged domain|
| | '----------------------------------'
'-------------'
Where:
- FS = First stage page tables
- SS = Second stage page tables
<Intel VT-d Nested translation>
In IOMMUFD, all the translation tables are tracked by hw_pagetable (hwpt)
and each has an iommu_domain allocated from iommu driver. So in this series
hw_pagetable and iommu_domain means the same thing if no special note.
IOMMUFD has already supported allocating hw_pagetable that is linked with
an IOAS. However, nesting requires IOMMUFD to allow allocating hw_pagetable
with driver specific parameters and interface to sync stage-1 IOTLB as user
owns the stage-1 translation table.
This series is based on the iommu hw info reporting series [1]. It first
introduces new iommu op for allocating domains with user data and the op
for syncing stage-1 IOTLB, and then extend the IOMMUFD internal infrastructure
to accept user_data and parent hwpt, then relay the data to iommu core to
allocate iommu_domain. After it, extend the ioctl IOMMU_HWPT_ALLOC to accept
user data and stage-2 hwpt ID to allocate hwpt. Along with it, ioctl
IOMMU_HWPT_INVALIDATE is added to invalidate stage-1 IOTLB. This is needed
for user-managed hwpts. Selftest is added as well to cover the new ioctls.
Complete code can be found in [2], QEMU could can be found in [3].
At last, this is a team work together with Nicolin Chen, Lu Baolu. Thanks
them for the help. ^_^. Look forward to your feedbacks.
base-commit: cf905391237ded2331388e75adb5afbabeddc852
[1] https://lore.kernel.org/linux-iommu/20230511143024.19542-1-yi.l.liu@intel.c…
[2] https://github.com/yiliu1765/iommufd/tree/iommufd_nesting
[3] https://github.com/yiliu1765/qemu/tree/wip/iommufd_rfcv4.mig.reset.v4_var3%…
Change log:
v2:
- Add union iommu_domain_user_data to include all user data structures to avoid
passing void * in kernel APIs.
- Add iommu op to return user data length for user domain allocation
- Rename struct iommu_hwpt_alloc::data_type to be hwpt_type
- Store the invalidation data length in iommu_domain_ops::cache_invalidate_user_data_len
- Convert cache_invalidate_user op to be int instead of void
- Remove @data_type in struct iommu_hwpt_invalidate
- Remove out_hwpt_type_bitmap in struct iommu_hw_info hence drop patch 08 of v1
v1: https://lore.kernel.org/linux-iommu/20230309080910.607396-1-yi.l.liu@intel.…
Thanks,
Yi Liu
Lu Baolu (2):
iommu: Add new iommu op to create domains owned by userspace
iommu: Add nested domain support
Nicolin Chen (5):
iommufd/hw_pagetable: Do not populate user-managed hw_pagetables
iommufd/selftest: Add domain_alloc_user() support in iommu mock
iommufd/selftest: Add coverage for IOMMU_HWPT_ALLOC with user data
iommufd/selftest: Add IOMMU_TEST_OP_MD_CHECK_IOTLB test op
iommufd/selftest: Add coverage for IOMMU_HWPT_INVALIDATE ioctl
Yi Liu (4):
iommufd/hw_pagetable: Use domain_alloc_user op for domain allocation
iommufd: Pass parent hwpt and user_data to
iommufd_hw_pagetable_alloc()
iommufd: IOMMU_HWPT_ALLOC allocation with user data
iommufd: Add IOMMU_HWPT_INVALIDATE
drivers/iommu/iommufd/device.c | 2 +-
drivers/iommu/iommufd/hw_pagetable.c | 191 +++++++++++++++++-
drivers/iommu/iommufd/iommufd_private.h | 16 +-
drivers/iommu/iommufd/iommufd_test.h | 30 +++
drivers/iommu/iommufd/main.c | 5 +-
drivers/iommu/iommufd/selftest.c | 119 ++++++++++-
include/linux/iommu.h | 36 ++++
include/uapi/linux/iommufd.h | 58 +++++-
tools/testing/selftests/iommu/iommufd.c | 126 +++++++++++-
tools/testing/selftests/iommu/iommufd_utils.h | 70 +++++++
10 files changed, 629 insertions(+), 24 deletions(-)
--
2.34.1
Make sv39 the default address space for mmap as some applications
currently depend on this assumption. The RISC-V specification enforces
that bits outside of the virtual address range are not used, so
restricting the size of the default address space as such should be
temporary. A hint address passed to mmap will cause the largest address
space that fits entirely into the hint to be used. If the hint is less
than or equal to 1<<38, a 39-bit address will be used. After an address
space is completely full, the next smallest address space will be used.
Documentation is also added to the RISC-V virtual memory section to explain
these changes.
Charlie Jenkins (2):
RISC-V: mm: Restrict address space for sv39,sv48,sv57
RISC-V: mm: Update documentation and include test
Documentation/riscv/vm-layout.rst | 20 ++++++++
arch/riscv/include/asm/elf.h | 2 +-
arch/riscv/include/asm/pgtable.h | 21 ++++++--
arch/riscv/include/asm/processor.h | 41 +++++++++++++---
tools/testing/selftests/riscv/Makefile | 2 +-
tools/testing/selftests/riscv/mm/Makefile | 22 +++++++++
.../selftests/riscv/mm/testcases/mmap.c | 49 +++++++++++++++++++
7 files changed, 144 insertions(+), 13 deletions(-)
create mode 100644 tools/testing/selftests/riscv/mm/Makefile
create mode 100644 tools/testing/selftests/riscv/mm/testcases/mmap.c
base-commit: eef509789cecdce895020682192d32e8bac790e8
--
2.34.1
Hi folks,
This series implements the functionality of delivering IO page faults to
user space through the IOMMUFD framework. The use case is nested
translation, where modern IOMMU hardware supports two-stage translation
tables. The second-stage translation table is managed by the host VMM
while the first-stage translation table is owned by the user space.
Hence, any IO page fault that occurs on the first-stage page table
should be delivered to the user space and handled there. The user space
should respond the page fault handling result to the device top-down
through the IOMMUFD response uAPI.
User space indicates its capablity of handling IO page faults by setting
a user HWPT allocation flag IOMMU_HWPT_ALLOC_FLAGS_IOPF_CAPABLE. IOMMUFD
will then setup its infrastructure for page fault delivery. Together
with the iopf-capable flag, user space should also provide an eventfd
where it will listen on any down-top page fault messages.
On a successful return of the allocation of iopf-capable HWPT, a fault
fd will be returned. User space can open and read fault messages from it
once the eventfd is signaled.
Besides the overall design, I'd like to hear comments about below
designs:
- The IOMMUFD fault message format. It is very similar to that in
uapi/linux/iommu which has been discussed before and partially used by
the IOMMU SVA implementation. I'd like to get more comments on the
format when it comes to IOMMUFD.
- The timeout value for the pending page fault messages. Ideally we
should determine the timeout value from the device configuration, but
I failed to find any statement in the PCI specification (version 6.x).
A default 100 milliseconds is selected in the implementation, but it
leave the room for grow the code for per-device setting.
This series is only for review comment purpose. I used IOMMUFD selftest
to verify the hwpt allocation, attach/detach and replace. But I didn't
get a chance to run it with real hardware yet. I will do more test in
the subsequent versions when I am confident that I am heading on the
right way.
This series is based on the latest implementation of the nested
translation under discussion. The whole series and related patches are
available on gitbub:
https://github.com/LuBaolu/intel-iommu/commits/iommufd-io-pgfault-delivery-…
Best regards,
baolu
Lu Baolu (17):
iommu: Move iommu fault data to linux/iommu.h
iommu: Support asynchronous I/O page fault response
iommu: Add helper to set iopf handler for domain
iommu: Pass device parameter to iopf handler
iommu: Split IO page fault handling from SVA
iommu: Add iommu page fault cookie helpers
iommufd: Add iommu page fault data
iommufd: IO page fault delivery initialization and release
iommufd: Add iommufd hwpt iopf handler
iommufd: Add IOMMU_HWPT_ALLOC_FLAGS_USER_PASID_TABLE for hwpt_alloc
iommufd: Deliver fault messages to user space
iommufd: Add io page fault response support
iommufd: Add a timer for each iommufd fault data
iommufd: Drain all pending faults when destroying hwpt
iommufd: Allow new hwpt_alloc flags
iommufd/selftest: Add IOPF feature for mock devices
iommufd/selftest: Cover iopf-capable nested hwpt
include/linux/iommu.h | 175 +++++++++-
drivers/iommu/{iommu-sva.h => io-pgfault.h} | 25 +-
drivers/iommu/iommu-priv.h | 3 +
drivers/iommu/iommufd/iommufd_private.h | 32 ++
include/uapi/linux/iommu.h | 161 ---------
include/uapi/linux/iommufd.h | 73 +++-
tools/testing/selftests/iommu/iommufd_utils.h | 20 +-
.../iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c | 2 +-
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 2 +-
drivers/iommu/intel/iommu.c | 2 +-
drivers/iommu/intel/svm.c | 2 +-
drivers/iommu/io-pgfault.c | 7 +-
drivers/iommu/iommu-sva.c | 4 +-
drivers/iommu/iommu.c | 50 ++-
drivers/iommu/iommufd/device.c | 64 +++-
drivers/iommu/iommufd/hw_pagetable.c | 318 +++++++++++++++++-
drivers/iommu/iommufd/main.c | 3 +
drivers/iommu/iommufd/selftest.c | 71 ++++
tools/testing/selftests/iommu/iommufd.c | 17 +-
MAINTAINERS | 1 -
drivers/iommu/Kconfig | 4 +
drivers/iommu/Makefile | 3 +-
drivers/iommu/intel/Kconfig | 1 +
23 files changed, 837 insertions(+), 203 deletions(-)
rename drivers/iommu/{iommu-sva.h => io-pgfault.h} (71%)
delete mode 100644 include/uapi/linux/iommu.h
--
2.34.1
When we collect a signal context with one of the SME modes enabled we will
have enabled that mode behind the compiler and libc's back so they may
issue some instructions not valid in streaming mode, causing spurious
failures.
For the code prior to issuing the BRK to trigger signal handling we need to
stay in streaming mode if we were already there since that's a part of the
signal context the caller is trying to collect. Unfortunately this code
includes a memset() which is likely to be heavily optimised and is likely
to use FP instructions incompatible with streaming mode. We can avoid this
happening by open coding the memset(), inserting a volatile assembly
statement to avoid the compiler recognising what's being done and doing
something in optimisation. This code is not performance critical so the
inefficiency should not be an issue.
After collecting the context we can simply exit streaming mode, avoiding
these issues. Use a full SMSTOP for safety to prevent any issues appearing
with ZA.
Reported-by: Will Deacon <will(a)kernel.org>
Signed-off-by: Mark Brown <broonie(a)kernel.org>
---
.../selftests/arm64/signal/test_signals_utils.h | 28 +++++++++++++++++++++-
1 file changed, 27 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/arm64/signal/test_signals_utils.h b/tools/testing/selftests/arm64/signal/test_signals_utils.h
index 222093f51b67..db28409fd44b 100644
--- a/tools/testing/selftests/arm64/signal/test_signals_utils.h
+++ b/tools/testing/selftests/arm64/signal/test_signals_utils.h
@@ -60,13 +60,28 @@ static __always_inline bool get_current_context(struct tdescr *td,
size_t dest_sz)
{
static volatile bool seen_already;
+ int i;
+ char *uc = (char *)dest_uc;
assert(td && dest_uc);
/* it's a genuine invocation..reinit */
seen_already = 0;
td->live_uc_valid = 0;
td->live_sz = dest_sz;
- memset(dest_uc, 0x00, td->live_sz);
+
+ /*
+ * This is a memset() but we don't want the compiler to
+ * optimise it into either instructions or a library call
+ * which might be incompatible with streaming mode.
+ */
+ for (i = 0; i < td->live_sz; i++) {
+ asm volatile("nop"
+ : "+m" (*dest_uc)
+ :
+ : "memory");
+ uc[i] = 0;
+ }
+
td->live_uc = dest_uc;
/*
* Grab ucontext_t triggering a SIGTRAP.
@@ -103,6 +118,17 @@ static __always_inline bool get_current_context(struct tdescr *td,
:
: "memory");
+ /*
+ * If we were grabbing a streaming mode context then we may
+ * have entered streaming mode behind the system's back and
+ * libc or compiler generated code might decide to do
+ * something invalid in streaming mode, or potentially even
+ * the state of ZA. Issue a SMSTOP to exit both now we have
+ * grabbed the state.
+ */
+ if (td->feats_supported & FEAT_SME)
+ asm volatile("msr S0_3_C4_C6_3, xzr");
+
/*
* If we get here with seen_already==1 it implies the td->live_uc
* context has been used to get back here....this probably means
---
base-commit: 6995e2de6891c724bfeb2db33d7b87775f913ad1
change-id: 20230628-arm64-signal-memcpy-fix-7de3b3c8fa10
Best regards,
--
Mark Brown <broonie(a)kernel.org>
Hi Mark,
While debugging the SME issue reported in CI, I noticed that the
streaming SVE tests are failing on the fastmodel because of an
unexpected SIGILL. For example:
will:arm64/signal$ ./ssve_za_regs
# Streaming SVE registers :: Check that we get the right Streaming SVE registers reported
Registered handlers for all signals.
Detected MINSTKSIGSZ:4720
Required Features: [ SME ] supported
Incompatible Features: [] absent
Testcase initialized.
Testing VL 64
-- RX UNEXPECTED SIGNAL: 4
==>> completed. FAIL(0)
The signal is injected because we get an SME trap due to an fpsimd, sve
or sve2 instruction being used in streaming mode (ESR is 0x76000001).
I did a bit of digging and it looks like this is my libc using a vector
DUP instruction in memset:
#0 __memset_generic () at ../sysdeps/aarch64/memset.S:37
#1 0x0000aaaaaaaa1170 in get_current_context (dest_sz=131072,
dest_uc=0xaaaaaeab6ba0 <context>, td=0xaaaaaaab50f0 <tde>)
at ./test_signals_utils.h:69
#2 do_one_sme_vl (si=<optimized out>, uc=<optimized out>, vl=64,
td=0xaaaaaaab50f0 <tde>) at testcases/ssve_za_regs.c:90
#3 sme_regs (td=0xaaaaaaab50f0 <tde>, si=<optimized out>, uc=<optimized out>)
at testcases/ssve_za_regs.c:145
#4 0x0000aaaaaaaa0ed0 in main (argc=<optimized out>, argv=<optimized out>)
at test_signals.c:21
Dump of assembler code for function __memset_generic:
=> 0x0000fffff7edfb00 <+0>: dup v0.16b, w1
The easy option would be to require FA64 for these tests, but I guess it
would be better to exit streaming mode.
Please can you have a look?
Thanks,
Will
Awk is already called for /sys/block/zram#/mm_stat parsing, so use it
to also perform the floating point capacity vs consumption ratio
calculations. The test output is unchanged.
This allows bc to be dropped as a dependency for the zram selftests.
Signed-off-by: David Disseldorp <ddiss(a)suse.de>
---
tools/testing/selftests/zram/zram01.sh | 18 ++++++++----------
1 file changed, 8 insertions(+), 10 deletions(-)
diff --git a/tools/testing/selftests/zram/zram01.sh b/tools/testing/selftests/zram/zram01.sh
index 8f4affe34f3e4..df1b1d4158989 100755
--- a/tools/testing/selftests/zram/zram01.sh
+++ b/tools/testing/selftests/zram/zram01.sh
@@ -33,7 +33,7 @@ zram_algs="lzo"
zram_fill_fs()
{
- for i in $(seq $dev_start $dev_end); do
+ for ((i = $dev_start; i <= $dev_end && !ERR_CODE; i++)); do
echo "fill zram$i..."
local b=0
while [ true ]; do
@@ -44,15 +44,13 @@ zram_fill_fs()
done
echo "zram$i can be filled with '$b' KB"
- local mem_used_total=`awk '{print $3}' "/sys/block/zram$i/mm_stat"`
- local v=$((100 * 1024 * $b / $mem_used_total))
- if [ "$v" -lt 100 ]; then
- echo "FAIL compression ratio: 0.$v:1"
- ERR_CODE=-1
- return
- fi
-
- echo "zram compression ratio: $(echo "scale=2; $v / 100 " | bc):1: OK"
+ awk -v b="$b" '{ v = (100 * 1024 * b / $3) } END {
+ if (v < 100) {
+ printf "FAIL compression ratio: 0.%u:1\n", v
+ exit 1
+ }
+ printf "zram compression ratio: %.2f:1: OK\n", v / 100
+ }' "/sys/block/zram$i/mm_stat" || ERR_CODE=-1
done
}
--
2.35.3
KVM_GET_REG_LIST will dump all register IDs that are available to
KVM_GET/SET_ONE_REG and It's very useful to identify some platform
regression issue during VM migration.
Patch 1-7 re-structured the get-reg-list test in aarch64 to make some
of the code as common test framework that can be shared by riscv.
Patch 8 move reject_set check logic to a function so as to check for
different errno for different registers.
Patch 9 change to do the get/set operation only on present-blessed list.
Patch 10 enabled the KVM_GET_REG_LIST API in riscv.
patch 11-12 added the corresponding kselftest for checking possible
register regressions.
The get-reg-list kvm selftest was ported from aarch64 and tested with
Linux 6.4-rc6 on a Qemu riscv64 virt machine.
---
Changed since v3:
* Rebase to Linux 6.4-rc6
* Address Andrew's suggestions and comments:
* Move reject_set check logic to a function
* Only do get/set tests on present blessed list
* Only enable ISA extension for the specified config
* For disable-not-allowed registers, move them to the filter-reg-list
Andrew Jones (7):
KVM: arm64: selftests: Replace str_with_index with strdup_printf
KVM: arm64: selftests: Drop SVE cap check in print_reg
KVM: arm64: selftests: Remove print_reg's dependency on vcpu_config
KVM: arm64: selftests: Rename vcpu_config and add to kvm_util.h
KVM: arm64: selftests: Delete core_reg_fixup
KVM: arm64: selftests: Split get-reg-list test code
KVM: arm64: selftests: Finish generalizing get-reg-list
Haibo Xu (5):
KVM: arm64: selftests: Move reject_set check logic to a function
KVM: selftests: Only do get/set tests on present blessed list
KVM: riscv: Add KVM_GET_REG_LIST API support
KVM: riscv: selftests: Add finalize_vcpu check in run_test
KVM: riscv: selftests: Add get-reg-list test
Documentation/virt/kvm/api.rst | 2 +-
arch/riscv/kvm/vcpu.c | 375 +++++++++
tools/testing/selftests/kvm/Makefile | 11 +-
.../selftests/kvm/aarch64/get-reg-list.c | 538 ++-----------
tools/testing/selftests/kvm/get-reg-list.c | 439 ++++++++++
.../selftests/kvm/include/kvm_util_base.h | 16 +
.../selftests/kvm/include/riscv/processor.h | 3 +
.../testing/selftests/kvm/include/test_util.h | 2 +
tools/testing/selftests/kvm/lib/test_util.c | 15 +
.../selftests/kvm/riscv/get-reg-list.c | 752 ++++++++++++++++++
10 files changed, 1658 insertions(+), 495 deletions(-)
create mode 100644 tools/testing/selftests/kvm/get-reg-list.c
create mode 100644 tools/testing/selftests/kvm/riscv/get-reg-list.c
--
2.34.1
This patch introduces two tests for the EVIOCSABS ioctl. The first one
checks that the ioctl fails when the EV_ABS bit was not set, and the
second one just checks that the normal workflow for this ioctl
succeeds.
Signed-off-by: Dana Elfassy <dangel101(a)gmail.com>
---
This patch depends on '[v3] selftests/input: Introduce basic tests for evdev ioctls' [1] sent to the ML.
[1] https://patchwork.kernel.org/project/linux-input/patch/20230607153214.15933…
tools/testing/selftests/input/evioc-test.c | 23 ++++++++++++++++++++++
1 file changed, 23 insertions(+)
diff --git a/tools/testing/selftests/input/evioc-test.c b/tools/testing/selftests/input/evioc-test.c
index 4c0c8ebed378..7afd537f0b24 100644
--- a/tools/testing/selftests/input/evioc-test.c
+++ b/tools/testing/selftests/input/evioc-test.c
@@ -279,4 +279,27 @@ TEST(eviocgrep_get_repeat_settings)
selftest_uinput_destroy(uidev);
}
+TEST(eviocsabs_set_abs_value_limits)
+{
+ struct selftest_uinput *uidev;
+ struct input_absinfo absinfo;
+ int rc;
+
+ // fail test on dev->absinfo
+ rc = selftest_uinput_create_device(&uidev), -1;
+ ASSERT_EQ(0, rc);
+ ASSERT_NE(NULL, uidev);
+ rc = ioctl(uidev->evdev_fd, EVIOCSABS(0), &absinfo);
+ ASSERT_EQ(-1, rc);
+ selftest_uinput_destroy(uidev);
+
+ // ioctl normal flow
+ rc = selftest_uinput_create_device(&uidev, EV_ABS, -1);
+ ASSERT_EQ(0, rc);
+ ASSERT_NE(NULL, uidev);
+ rc = ioctl(uidev->evdev_fd, EVIOCSABS(0), &absinfo);
+ ASSERT_EQ(0, rc);
+ selftest_uinput_destroy(uidev);
+}
+
TEST_HARNESS_MAIN
--
2.41.0
Changes in v21:
- Abort walk instead of returning error if WP is to be performed on
partial hugetlb
*Changes in v20*
- Correct PAGE_IS_FILE and add PAGE_IS_PFNZERO
*Changes in v19*
- Minor changes and interface updates
*Changes in v18*
- Rebase on top of next-20230613
- Minor updates
*Changes in v17*
- Rebase on top of next-20230606
- Minor improvements in PAGEMAP_SCAN IOCTL patch
*Changes in v16*
- Fix a corner case
- Add exclusive PM_SCAN_OP_WP back
*Changes in v15*
- Build fix (Add missed build fix in RESEND)
*Changes in v14*
- Fix build error caused by #ifdef added at last minute in some configs
*Changes in v13*
- Rebase on top of next-20230414
- Give-up on using uffd_wp_range() and write new helpers, flush tlb only
once
*Changes in v12*
- Update and other memory types to UFFD_FEATURE_WP_ASYNC
- Rebaase on top of next-20230406
- Review updates
*Changes in v11*
- Rebase on top of next-20230307
- Base patches on UFFD_FEATURE_WP_UNPOPULATED
- Do a lot of cosmetic changes and review updates
- Remove ENGAGE_WP + !GET operation as it can be performed with
UFFDIO_WRITEPROTECT
*Changes in v10*
- Add specific condition to return error if hugetlb is used with wp
async
- Move changes in tools/include/uapi/linux/fs.h to separate patch
- Add documentation
*Changes in v9:*
- Correct fault resolution for userfaultfd wp async
- Fix build warnings and errors which were happening on some configs
- Simplify pagemap ioctl's code
*Changes in v8:*
- Update uffd async wp implementation
- Improve PAGEMAP_IOCTL implementation
*Changes in v7:*
- Add uffd wp async
- Update the IOCTL to use uffd under the hood instead of soft-dirty
flags
*Motivation*
The real motivation for adding PAGEMAP_SCAN IOCTL is to emulate Windows
GetWriteWatch() syscall [1]. The GetWriteWatch{} retrieves the addresses of
the pages that are written to in a region of virtual memory.
This syscall is used in Windows applications and games etc. This syscall is
being emulated in pretty slow manner in userspace. Our purpose is to
enhance the kernel such that we translate it efficiently in a better way.
Currently some out of tree hack patches are being used to efficiently
emulate it in some kernels. We intend to replace those with these patches.
So the whole gaming on Linux can effectively get benefit from this. It
means there would be tons of users of this code.
CRIU use case [2] was mentioned by Andrei and Danylo:
> Use cases for migrating sparse VMAs are binaries sanitized with ASAN,
> MSAN or TSAN [3]. All of these sanitizers produce sparse mappings of
> shadow memory [4]. Being able to migrate such binaries allows to highly
> reduce the amount of work needed to identify and fix post-migration
> crashes, which happen constantly.
Andrei's defines the following uses of this code:
* it is more granular and allows us to track changed pages more
effectively. The current interface can clear dirty bits for the entire
process only. In addition, reading info about pages is a separate
operation. It means we must freeze the process to read information
about all its pages, reset dirty bits, only then we can start dumping
pages. The information about pages becomes more and more outdated,
while we are processing pages. The new interface solves both these
downsides. First, it allows us to read pte bits and clear the
soft-dirty bit atomically. It means that CRIU will not need to freeze
processes to pre-dump their memory. Second, it clears soft-dirty bits
for a specified region of memory. It means CRIU will have actual info
about pages to the moment of dumping them.
* The new interface has to be much faster because basic page filtering
is happening in the kernel. With the old interface, we have to read
pagemap for each page.
*Implementation Evolution (Short Summary)*
From the definition of GetWriteWatch(), we feel like kernel's soft-dirty
feature can be used under the hood with some additions like:
* reset soft-dirty flag for only a specific region of memory instead of
clearing the flag for the entire process
* get and clear soft-dirty flag for a specific region atomically
So we decided to use ioctl on pagemap file to read or/and reset soft-dirty
flag. But using soft-dirty flag, sometimes we get extra pages which weren't
even written. They had become soft-dirty because of VMA merging and
VM_SOFTDIRTY flag. This breaks the definition of GetWriteWatch(). We were
able to by-pass this short coming by ignoring VM_SOFTDIRTY until David
reported that mprotect etc messes up the soft-dirty flag while ignoring
VM_SOFTDIRTY [5]. This wasn't happening until [6] got introduced. We
discussed if we can revert these patches. But we could not reach to any
conclusion. So at this point, I made couple of tries to solve this whole
VM_SOFTDIRTY issue by correcting the soft-dirty implementation:
* [7] Correct the bug fixed wrongly back in 2014. It had potential to cause
regression. We left it behind.
* [8] Keep a list of soft-dirty part of a VMA across splits and merges. I
got the reply don't increase the size of the VMA by 8 bytes.
At this point, we left soft-dirty considering it is too much delicate and
userfaultfd [9] seemed like the only way forward. From there onward, we
have been basing soft-dirty emulation on userfaultfd wp feature where
kernel resolves the faults itself when WP_ASYNC feature is used. It was
straight forward to add WP_ASYNC feature in userfautlfd. Now we get only
those pages dirty or written-to which are really written in reality. (PS
There is another WP_UNPOPULATED userfautfd feature is required which is
needed to avoid pre-faulting memory before write-protecting [9].)
All the different masks were added on the request of CRIU devs to create
interface more generic and better.
[1] https://learn.microsoft.com/en-us/windows/win32/api/memoryapi/nf-memoryapi-…
[2] https://lore.kernel.org/all/20221014134802.1361436-1-mdanylo@google.com
[3] https://github.com/google/sanitizers
[4] https://github.com/google/sanitizers/wiki/AddressSanitizerAlgorithm#64-bit
[5] https://lore.kernel.org/all/bfcae708-db21-04b4-0bbe-712badd03071@redhat.com
[6] https://lore.kernel.org/all/20220725142048.30450-1-peterx@redhat.com/
[7] https://lore.kernel.org/all/20221122115007.2787017-1-usama.anjum@collabora.…
[8] https://lore.kernel.org/all/20221220162606.1595355-1-usama.anjum@collabora.…
[9] https://lore.kernel.org/all/20230306213925.617814-1-peterx@redhat.com
[10] https://lore.kernel.org/all/20230125144529.1630917-1-mdanylo@google.com
* Original Cover letter from v8*
Hello,
Note:
Soft-dirty pages and pages which have been written-to are synonyms. As
kernel already has soft-dirty feature inside which we have given up to
use, we are using written-to terminology while using UFFD async WP under
the hood.
This IOCTL, PAGEMAP_SCAN on pagemap file can be used to get and/or clear
the info about page table entries. The following operations are
supported in this ioctl:
- Get the information if the pages have been written-to (PAGE_IS_WRITTEN),
file mapped (PAGE_IS_FILE), present (PAGE_IS_PRESENT) or swapped
(PAGE_IS_SWAPPED).
- Write-protect the pages (PAGEMAP_WP_ENGAGE) to start finding which
pages have been written-to.
- Find pages which have been written-to and write protect the pages
(atomic PAGE_IS_WRITTEN + PAGEMAP_WP_ENGAGE)
It is possible to find and clear soft-dirty pages entirely in userspace.
But it isn't efficient:
- The mprotect and SIGSEGV handler for bookkeeping
- The userfaultfd wp (synchronous) with the handler for bookkeeping
Some benchmarks can be seen here[1]. This series adds features that weren't
present earlier:
- There is no atomic get soft-dirty/Written-to status and clear present in
the kernel.
- The pages which have been written-to can not be found in accurate way.
(Kernel's soft-dirty PTE bit + sof_dirty VMA bit shows more soft-dirty
pages than there actually are.)
Historically, soft-dirty PTE bit tracking has been used in the CRIU
project. The procfs interface is enough for finding the soft-dirty bit
status and clearing the soft-dirty bit of all the pages of a process.
We have the use case where we need to track the soft-dirty PTE bit for
only specific pages on-demand. We need this tracking and clear mechanism
of a region of memory while the process is running to emulate the
getWriteWatch() syscall of Windows.
*(Moved to using UFFD instead of soft-dirtyi feature to find pages which
have been written-to from v7 patch series)*:
Stop using the soft-dirty flags for finding which pages have been
written to. It is too delicate and wrong as it shows more soft-dirty
pages than the actual soft-dirty pages. There is no interest in
correcting it [2][3] as this is how the feature was written years ago.
It shouldn't be updated to changed behaviour. Peter Xu has suggested
using the async version of the UFFD WP [4] as it is based inherently
on the PTEs.
So in this patch series, I've added a new mode to the UFFD which is
asynchronous version of the write protect. When this variant of the
UFFD WP is used, the page faults are resolved automatically by the
kernel. The pages which have been written-to can be found by reading
pagemap file (!PM_UFFD_WP). This feature can be used successfully to
find which pages have been written to from the time the pages were
write protected. This works just like the soft-dirty flag without
showing any extra pages which aren't soft-dirty in reality.
The information related to pages if the page is file mapped, present and
swapped is required for the CRIU project [5][6]. The addition of the
required mask, any mask, excluded mask and return masks are also required
for the CRIU project [5].
The IOCTL returns the addresses of the pages which match the specific
masks. The page addresses are returned in struct page_region in a compact
form. The max_pages is needed to support a use case where user only wants
to get a specific number of pages. So there is no need to find all the
pages of interest in the range when max_pages is specified. The IOCTL
returns when the maximum number of the pages are found. The max_pages is
optional. If max_pages is specified, it must be equal or greater than the
vec_size. This restriction is needed to handle worse case when one
page_region only contains info of one page and it cannot be compacted.
This is needed to emulate the Windows getWriteWatch() syscall.
The patch series include the detailed selftest which can be used as an
example for the uffd async wp test and PAGEMAP_IOCTL. It shows the
interface usages as well.
[1] https://lore.kernel.org/lkml/54d4c322-cd6e-eefd-b161-2af2b56aae24@collabora…
[2] https://lore.kernel.org/all/20221220162606.1595355-1-usama.anjum@collabora.…
[3] https://lore.kernel.org/all/20221122115007.2787017-1-usama.anjum@collabora.…
[4] https://lore.kernel.org/all/Y6Hc2d+7eTKs7AiH@x1n
[5] https://lore.kernel.org/all/YyiDg79flhWoMDZB@gmail.com/
[6] https://lore.kernel.org/all/20221014134802.1361436-1-mdanylo@google.com/
Regards,
Muhammad Usama Anjum
Muhammad Usama Anjum (4):
fs/proc/task_mmu: Implement IOCTL to get and optionally clear info
about PTEs
tools headers UAPI: Update linux/fs.h with the kernel sources
mm/pagemap: add documentation of PAGEMAP_SCAN IOCTL
selftests: mm: add pagemap ioctl tests
Peter Xu (1):
userfaultfd: UFFD_FEATURE_WP_ASYNC
Documentation/admin-guide/mm/pagemap.rst | 58 +
Documentation/admin-guide/mm/userfaultfd.rst | 35 +
fs/proc/task_mmu.c | 560 +++++++
fs/userfaultfd.c | 26 +-
include/linux/hugetlb.h | 1 +
include/linux/userfaultfd_k.h | 21 +-
include/uapi/linux/fs.h | 54 +
include/uapi/linux/userfaultfd.h | 9 +-
mm/hugetlb.c | 34 +-
mm/memory.c | 27 +-
tools/include/uapi/linux/fs.h | 54 +
tools/testing/selftests/mm/.gitignore | 2 +
tools/testing/selftests/mm/Makefile | 3 +-
tools/testing/selftests/mm/config | 1 +
tools/testing/selftests/mm/pagemap_ioctl.c | 1464 ++++++++++++++++++
tools/testing/selftests/mm/run_vmtests.sh | 4 +
16 files changed, 2329 insertions(+), 24 deletions(-)
create mode 100644 tools/testing/selftests/mm/pagemap_ioctl.c
mode change 100644 => 100755 tools/testing/selftests/mm/run_vmtests.sh
--
2.39.2
Hi Linus,
Please pull the following Kselftest update for Linux 6.5-rc1.
This kselftest update for Linux 6.5-rc1 consists of:
- change to allow runners to override the timeout
This change is made to avoid future increases of long
timeouts
- several other spelling and cleanups
- a new subtest to video_device_test
- enhancements to test coverage in clone3 test
- other fixes to ftrace and cpufreq tests
diff is attached.
thanks,
-- Shuah
----------------------------------------------------------------
The following changes since commit 858fd168a95c5b9669aac8db6c14a9aeab446375:
Linux 6.4-rc6 (2023-06-11 14:35:30 -0700)
are available in the Git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest tags/linux-kselftest-next-6.5-rc1
for you to fetch changes up to 8cd0d8633e2de4e6dd9ddae7980432e726220fdb:
selftests/ftace: Fix KTAP output ordering (2023-06-12 16:40:22 -0600)
----------------------------------------------------------------
linux-kselftest-next-6.5-rc1
This kselftest update for Linux 6.5-rc1 consists of:
- change to allow runners to override the timeout
This change is made to avoid future increases of long
timeouts
- several other spelling and cleanups
- a new subtest to video_device_test
- enhancements to test coverage in clone3 test
- other fixes to ftrace and cpufreq tests
----------------------------------------------------------------
Akanksha J N (1):
selftests/ftrace: Add new test case which checks for optimized probes
Colin Ian King (2):
selftests: prctl: Fix spelling mistake "anonynous" -> "anonymous"
kselftest: vDSO: Fix accumulation of uninitialized ret when CLOCK_REALTIME is undefined
Ivan Orlov (1):
selftests: media_tests: Add new subtest to video_device_test
Luis Chamberlain (1):
selftests: allow runners to override the timeout
Mark Brown (2):
selftests/cpufreq: Don't enable generic lock debugging options
selftests/ftace: Fix KTAP output ordering
Rishabh Bhatnagar (1):
kselftests: Sort the collections list to avoid duplicate tests
Tobias Klauser (1):
selftests/clone3: test clone3 with exit signal in flags
Ziqi Zhao (1):
selftest: pidfd: Omit long and repeating outputs
Documentation/dev-tools/kselftest.rst | 22 ++++
tools/testing/selftests/clone3/clone3.c | 5 +-
tools/testing/selftests/cpufreq/config | 8 --
tools/testing/selftests/ftrace/ftracetest | 2 +-
.../ftrace/test.d/kprobe/kprobe_opt_types.tc | 34 +++++++
tools/testing/selftests/kselftest/runner.sh | 11 +-
.../selftests/media_tests/video_device_test.c | 111 +++++++++++++++------
tools/testing/selftests/pidfd/pidfd.h | 1 -
tools/testing/selftests/pidfd/pidfd_fdinfo_test.c | 1 +
tools/testing/selftests/pidfd/pidfd_test.c | 3 +-
.../selftests/prctl/set-anon-vma-name-test.c | 2 +-
tools/testing/selftests/run_kselftest.sh | 7 +-
.../selftests/vDSO/vdso_test_clock_getres.c | 4 +-
13 files changed, 166 insertions(+), 45 deletions(-)
create mode 100644 tools/testing/selftests/ftrace/test.d/kprobe/kprobe_opt_types.tc
----------------------------------------------------------------
Make sv39 the default address space for mmap as some applications
currently depend on this assumption. The RISC-V specification enforces
that bits outside of the virtual address range are not used, so
restricting the size of the default address space as such should be
temporary. A hint address passed to mmap will cause the largest address
space that fits entirely into the hint to be used. If the hint is less
than or equal to 1<<38, a 39-bit address will be used. After an address
space is completely full, the next smallest address space will be used.
Documentation is also added to the RISC-V virtual memory section to explain
these changes.
Charlie Jenkins (2):
RISC-V: mm: Restrict address space for sv39,sv48,sv57
RISC-V: mm: Update documentation and include test
Documentation/riscv/vm-layout.rst | 20 ++++++++
arch/riscv/include/asm/elf.h | 2 +-
arch/riscv/include/asm/pgtable.h | 21 ++++++--
arch/riscv/include/asm/processor.h | 41 +++++++++++++---
tools/testing/selftests/riscv/Makefile | 2 +-
tools/testing/selftests/riscv/mm/Makefile | 22 +++++++++
.../selftests/riscv/mm/testcases/mmap.c | 49 +++++++++++++++++++
7 files changed, 144 insertions(+), 13 deletions(-)
create mode 100644 tools/testing/selftests/riscv/mm/Makefile
create mode 100644 tools/testing/selftests/riscv/mm/testcases/mmap.c
base-commit: eef509789cecdce895020682192d32e8bac790e8
--
2.34.1
Hello!
Here is v4 of the mremap start address optimization / fix for exec warning. It
took me a while to write a test that catches the issue me/Linus discussed in
the last version. And I verified kernel crashes without the check. See below.
The main changes in this series is:
Care to be taken to move purely within a VMA, in other words this check
in call_align_down():
if (vma->vm_start != addr_masked)
return false;
As an example of why this is needed:
Consider the following range which is 2MB aligned and is
a part of a larger 10MB range which is not shown. Each
character is 256KB below making the source and destination
2MB each. The lower case letters are moved (s to d) and the
upper case letters are not moved.
|DDDDddddSSSSssss|
If we align down 'ssss' to start from the 'SSSS', we will end up destroying
SSSS. The above if statement prevents that and I verified it.
I also added a test for this in the last patch.
History of patches
==================
v3->v4:
1. Make sure to check address to align is beginning of VMA
2. Add test to check this (test fails with a kernel crash if we don't do this).
v2->v3:
1. Masked address was stored in int, fixed it to unsigned long to avoid truncation.
2. We now handle moves happening purely within a VMA, a new test is added to handle this.
3. More code comments.
v1->v2:
1. Trigger the optimization for mremaps smaller than a PMD. I tested by tracing
that it works correctly.
2. Fix issue with bogus return value found by Linus if we broke out of the
above loop for the first PMD itself.
v1: Initial RFC.
Description of patches
======================
These patches optimizes the start addresses in move_page_tables() and tests the
changes. It addresses a warning [1] that occurs due to a downward, overlapping
move on a mutually-aligned offset within a PMD during exec. By initiating the
copy process at the PMD level when such alignment is present, we can prevent
this warning and speed up the copying process at the same time. Linus Torvalds
suggested this idea.
Please check the individual patches for more details.
thanks,
- Joel
[1] https://lore.kernel.org/all/ZB2GTBD%2FLWTrkOiO@dhcp22.suse.cz/
Joel Fernandes (Google) (7):
mm/mremap: Optimize the start addresses in move_page_tables()
mm/mremap: Allow moves within the same VMA for stack
selftests: mm: Fix failure case when new remap region was not found
selftests: mm: Add a test for mutually aligned moves > PMD size
selftests: mm: Add a test for remapping to area immediately after
existing mapping
selftests: mm: Add a test for remapping within a range
selftests: mm: Add a test for moving from an offset from start of
mapping
fs/exec.c | 2 +-
include/linux/mm.h | 2 +-
mm/mremap.c | 63 ++++-
tools/testing/selftests/mm/mremap_test.c | 301 +++++++++++++++++++----
4 files changed, 319 insertions(+), 49 deletions(-)
--
2.41.0.rc2.161.g9c6817b8e7-goog
Hi Linus,
Please pull the following KUnit next update for Linux 6.5-rc1.
This KUnit update for Linux 6.5-rc1 consists of:
- kunit_add_action() API to defer a call until test exit.
- Update document to add kunit_add_action() usage notes.
- Changes to always run cleanup from a test kthread.
- Documentation updates to clarify cleanup usage
- assertions should not be used in cleanup
- Documentation update to clearly indicate that exit
functions should run even if init fails
- Several fixes and enhancements to existing tests.
diff is attached.
thanks,
-- Shuah
----------------------------------------------------------------
The following changes since commit ac9a78681b921877518763ba0e89202254349d1b:
Linux 6.4-rc1 (2023-05-07 13:34:35 -0700)
are available in the Git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest tags/linux-kselftest-kunit-6.5-rc1
for you to fetch changes up to 2e66833579ed759d7b7da1a8f07eb727ec6e80db:
MAINTAINERS: Add source tree entry for kunit (2023-06-15 09:16:01 -0600)
----------------------------------------------------------------
linux-kselftest-kunit-6.5-rc1
This KUnit update for Linux 6.5-rc1 consists of:
- kunit_add_action() API to defer a call until test exit.
- Update document to add kunit_add_action() usage notes.
- Changes to always run cleanup from a test kthread.
- Documentation updates to clarify cleanup usage
- assertions should not be used in cleanup
- Documentation update to clearly indicate that exit
functions should run even if init fails
- Several fixes and enhancements to existing tests.
----------------------------------------------------------------
Daniel Latypov (1):
kunit: tool: undo type subscripts for subprocess.Popen
David Gow (11):
kunit: Always run cleanup from a test kthread
Documentation: kunit: Note that assertions should not be used in cleanup
Documentation: kunit: Warn that exit functions run even if init fails
kunit: example: Provide example exit functions
kunit: Add kunit_add_action() to defer a call until test exit
kunit: executor_test: Use kunit_add_action()
kunit: kmalloc_array: Use kunit_add_action()
Documentation: kunit: Add usage notes for kunit_add_action()
kunit: Fix obsolete name in documentation headers (func->action)
kunit: Move kunit_abort() call out of kunit_do_failed_assertion()
Documentation: kunit: Rename references to kunit_abort()
Geert Uytterhoeven (1):
Documentation: kunit: Modular tests should not depend on KUNIT=y
Michal Wajdeczko (3):
kunit/test: Add example test showing parameterized testing
kunit: Fix reporting of the skipped parameterized tests
kunit: Update kunit_print_ok_not_ok function
SeongJae Park (1):
MAINTAINERS: Add source tree entry for kunit
Takashi Sakamoto (1):
Documentation: Kunit: add MODULE_LICENSE to sample code
Documentation/dev-tools/kunit/architecture.rst | 4 +-
Documentation/dev-tools/kunit/start.rst | 7 +-
Documentation/dev-tools/kunit/usage.rst | 69 ++++++++++-
MAINTAINERS | 2 +
include/kunit/resource.h | 92 +++++++++++++++
include/kunit/test.h | 34 ++++--
lib/kunit/executor_test.c | 11 +-
lib/kunit/kunit-example-test.c | 56 +++++++++
lib/kunit/kunit-test.c | 88 +++++++++++++-
lib/kunit/resource.c | 99 ++++++++++++++++
lib/kunit/test.c | 157 ++++++++++++++-----------
tools/testing/kunit/kunit_kernel.py | 6 +-
tools/testing/kunit/mypy.ini | 6 +
tools/testing/kunit/run_checks.py | 2 +-
14 files changed, 538 insertions(+), 95 deletions(-)
create mode 100644 tools/testing/kunit/mypy.ini
----------------------------------------------------------------
Hi Shuah,
This series contains updates to the rseq selftests.
* A typo in the Makefile prevents the basic_percpu_ops_mm_cid_test to use
the mm_cid field.
* Fix load-acquire/store-release macros which were buggy on arm64.
(this depends on commit "Implement rseq_unqual_scalar_typeof").
* The change "Use rseq_unqual_scalar_typeof in macros" is not a fix
per se, but improves the assembler generated.
Can you pick these in the selftests tree please ?
Thanks,
Mathieu
Mathieu Desnoyers (4):
selftests/rseq: Fix CID_ID typo in Makefile
selftests/rseq: Implement rseq_unqual_scalar_typeof
selftests/rseq: Fix arm64 buggy load-acquire/store-release macros
selftests/rseq: Use rseq_unqual_scalar_typeof in macros
tools/testing/selftests/rseq/Makefile | 2 +-
tools/testing/selftests/rseq/compiler.h | 26 ++++++++++
tools/testing/selftests/rseq/rseq-arm.h | 4 +-
tools/testing/selftests/rseq/rseq-arm64.h | 58 ++++++++++++-----------
tools/testing/selftests/rseq/rseq-mips.h | 4 +-
tools/testing/selftests/rseq/rseq-ppc.h | 4 +-
tools/testing/selftests/rseq/rseq-riscv.h | 6 +--
tools/testing/selftests/rseq/rseq-s390.h | 4 +-
tools/testing/selftests/rseq/rseq-x86.h | 4 +-
9 files changed, 70 insertions(+), 42 deletions(-)
--
2.25.1
We want to replace iptables TPROXY with a BPF program at TC ingress.
To make this work in all cases we need to assign a SO_REUSEPORT socket
to an skb, which is currently prohibited. This series adds support for
such sockets to bpf_sk_assing.
I did some refactoring to cut down on the amount of duplicate code. The
key to this is to use INDIRECT_CALL in the reuseport helpers. To show
that this approach is not just beneficial to TC sk_assign I removed
duplicate code for bpf_sk_lookup as well.
Changes from v1:
- Correct commit abbrev length (Kuniyuki)
- Reduce duplication (Kuniyuki)
- Add checks on sk_state (Martin)
- Split exporting inet[6]_lookup_reuseport into separate patch (Eric)
Joint work with Daniel Borkmann.
Signed-off-by: Lorenz Bauer <lmb(a)isovalent.com>
---
Changes in v3:
- Fix warning re udp_ehashfn and udp6_ehashfn (Simon)
- Return higher scoring connected UDP reuseport sockets (Kuniyuki)
- Fix ipv6 module builds
- Link to v2: https://lore.kernel.org/r/20230613-so-reuseport-v2-0-b7c69a342613@isovalent…
---
Daniel Borkmann (1):
selftests/bpf: Test that SO_REUSEPORT can be used with sk_assign helper
Lorenz Bauer (6):
udp: re-score reuseport groups when connected sockets are present
net: export inet_lookup_reuseport and inet6_lookup_reuseport
net: document inet[6]_lookup_reuseport sk_state requirements
net: remove duplicate reuseport_lookup functions
net: remove duplicate sk_lookup helpers
bpf, net: Support SO_REUSEPORT sockets with bpf_sk_assign
include/net/inet6_hashtables.h | 84 ++++++++-
include/net/inet_hashtables.h | 77 +++++++-
include/net/sock.h | 7 +-
include/net/udp.h | 8 +
include/uapi/linux/bpf.h | 3 -
net/core/filter.c | 2 -
net/ipv4/inet_hashtables.c | 70 +++++---
net/ipv4/udp.c | 88 ++++-----
net/ipv6/inet6_hashtables.c | 73 +++++---
net/ipv6/udp.c | 98 ++++------
tools/include/uapi/linux/bpf.h | 3 -
tools/testing/selftests/bpf/network_helpers.c | 3 +
.../selftests/bpf/prog_tests/assign_reuse.c | 197 +++++++++++++++++++++
.../selftests/bpf/progs/test_assign_reuse.c | 142 +++++++++++++++
14 files changed, 676 insertions(+), 179 deletions(-)
---
base-commit: 970308a7b544fa1c7ee98a2721faba3765be8dd8
change-id: 20230613-so-reuseport-e92c526173ee
Best regards,
--
Lorenz Bauer <lmb(a)isovalent.com>
v3:
- [v2] https://lore.kernel.org/lkml/20230531163405.2200292-1-longman@redhat.com/
- Change the new control file from root-only "cpuset.cpus.reserve" to
non-root "cpuset.cpus.exclusive" which lists the set of exclusive
CPUs distributed down the hierarchy.
- Add a patch to restrict boot-time isolated CPUs to isolated
partitions only.
- Update the test_cpuset_prs.sh test script and documentation
accordingly.
v2:
- [v1] https://lore.kernel.org/lkml/20230412153758.3088111-1-longman@redhat.com/
- Dropped the special "isolcpus" partition in v1
- Add the root only "cpuset.cpus.reserve" control file for reserving
CPUs used for remote isolated partitions.
- Update the test_cpuset_prs.sh test script and documentation
accordingly.
This patch series introduces a new cpuset control file
"cpuset.cpus.exclusive" which must be a subset of "cpuset.cpus"
and the parent's "cpuset.cpus.exclusive". This control file lists
the exclusive CPUs to be distributed down the hierarchy. Any one
of the exclusive CPUs can only be distributed to at most one child
cpuset. Unlike "cpuset.cpus", invalid input to "cpuset.cpus.exclusive"
will be rejected with an error. This new control file has no effect on
the behavior of the cpuset until it turns into a partition root. At that
point, its effective CPUs will be set to its exclusive CPUs unless some
of them are offline.
This patch series also introduces a new category of cpuset partition
called remote partitions. The existing partition category where the
partition roots have to be clustered around the root cgroup in a
hierarchical way is now referred to as local partitions.
A remote partition can be formed far from the root cgroup
with no partition root parent. While local partitions can be
created without touching "cpuset.cpus.exclusive" as it can be set
automatically if a cpuset becomes a local partition root. Properly set
"cpuset.cpus.exclusive" values down the hierarchy are required to create
a remote partition.
Both scheduling and isolated partitions can be formed in a remote
partition. A local partition can be created under a remote partition.
A remote partition, however, cannot be formed under a local partition
for now.
Modern container orchestration tools like Kubernetes use the cgroup
hierarchy to manage different containers. And it is relying on other
middleware like systemd to help managing it. If a container needs to
use isolated CPUs, it is hard to get those with the local partitions
as it will require the administrative parent cgroup to be a partition
root too which tool like systemd may not be ready to manage.
With this patch series, we allow the creation of remote partition
far from the root. The container management tool can manage the
"cpuset.cpus.exclusive" file without impacting the other cpuset
files that are managed by other middlewares. Of course, invalid
"cpuset.cpus.exclusive" values will be rejected and changes to
"cpuset.cpus" can affect the value of "cpuset.cpus.exclusive" due to
the requirement that it has to be a subset of the former control file.
Waiman Long (9):
cgroup/cpuset: Inherit parent's load balance state in v2
cgroup/cpuset: Extract out CS_CPU_EXCLUSIVE & CS_SCHED_LOAD_BALANCE
handling
cgroup/cpuset: Improve temporary cpumasks handling
cgroup/cpuset: Allow suppression of sched domain rebuild in
update_cpumasks_hier()
cgroup/cpuset: Add cpuset.cpus.exclusive for v2
cgroup/cpuset: Introduce remote partition
cgroup/cpuset: Check partition conflict with housekeeping setup
cgroup/cpuset: Documentation update for partition
cgroup/cpuset: Extend test_cpuset_prs.sh to test remote partition
Documentation/admin-guide/cgroup-v2.rst | 100 +-
kernel/cgroup/cpuset.c | 1352 ++++++++++++-----
.../selftests/cgroup/test_cpuset_prs.sh | 398 +++--
3 files changed, 1297 insertions(+), 553 deletions(-)
--
2.31.1
Now the writing operation return the count of writes regardless of whether
events are enabled or disabled. Fix this by just return -EBADF when events
are disabled.
v3 -> v4:
- Change the return value from zero to -EBADF
v2 -> v3:
- Change the return value from -ENOENT to zero
v1 -> v2:
- Change the return value from -EFAULT to -ENOENT
sunliming (3):
tracing/user_events: Fix incorrect return value for writing operation
when events are disabled
selftests/user_events: Enable the event before write_fault test in
ftrace self-test
selftests/user_events: Add test cases when event is disabled
kernel/trace/trace_events_user.c | 3 ++-
tools/testing/selftests/user_events/ftrace_test.c | 8 ++++++++
2 files changed, 10 insertions(+), 1 deletion(-)
--
2.25.1
On systems where netdevsim is built-in or loaded before the test
starts, kci_test_ipsec_offload doesn't remove the netdevsim device it
created during the test.
Fixes: e05b2d141fef ("netdevsim: move netdev creation/destruction to dev probe")
Signed-off-by: Sabrina Dubroca <sd(a)queasysnail.net>
---
tools/testing/selftests/net/rtnetlink.sh | 1 +
1 file changed, 1 insertion(+)
diff --git a/tools/testing/selftests/net/rtnetlink.sh b/tools/testing/selftests/net/rtnetlink.sh
index 383ac6fc037d..ba286d680fd9 100755
--- a/tools/testing/selftests/net/rtnetlink.sh
+++ b/tools/testing/selftests/net/rtnetlink.sh
@@ -860,6 +860,7 @@ EOF
fi
# clean up any leftovers
+ echo 0 > /sys/bus/netdevsim/del_device
$probed && rmmod netdevsim
if [ $ret -ne 0 ]; then
--
2.40.1