This RFC is a follow-up of the discussions taken here:
https://lore.kernel.org/linux-doc/20230704132812.02ba97ba@maurocar-mobl2/T/…
It adds a new extension that allows documenting tests using the same tool we're
using for DRM unit tests at IGT GPU tools: https://gitlab.freedesktop.org/drm/igt-gpu-tools.
While kernel-doc has provided documentation for in-lined functions/struct comments,
it was not meant to document tests.
Tests need to be grouped by the test functions. It should also be possible to produce
other outputs from the documentation, to integrate it with test suites. For instance,
Internally at Intel, we use the comments to generate DOT files hierarchically grouped
per feature categories.
This is just an initial RFC to start discussions around the solution. Before being merged
upstream, we need to define what tags will be used to identify test markups and add
a simple change at kernel-doc to let it ignore such markups.
On this series, we have:
- patch 1:
adding test_list.py as present at the IGT tree, after a patch series to make it
more generic: https://patchwork.freedesktop.org/series/120622/
- patch 2:
adds an example about how tests could be documented. This is a really simple
example, just to test the feature, specially designed to make easy to build just
the test documentation from a single DRM kunit file.
After discussions, my plan is to send a new version addressing the issues, and add
some documentation for DRM and/or i915 kunit tests.
Mauro Carvalho Chehab (2):
docs: add support for documenting kUnit and kSelftests
drm: add documentation for drm_buddy_test kUnit test
Documentation/conf.py | 2 +-
Documentation/index.rst | 2 +-
Documentation/sphinx/test_kdoc.py | 108 ++
Documentation/sphinx/test_list.py | 1288 ++++++++++++++++++++++++
Documentation/tests/index.rst | 6 +
Documentation/tests/kunit.rst | 5 +
drivers/gpu/drm/tests/drm_buddy_test.c | 12 +
7 files changed, 1421 insertions(+), 2 deletions(-)
create mode 100644 Documentation/sphinx/test_kdoc.py
create mode 100644 Documentation/sphinx/test_list.py
create mode 100644 Documentation/tests/index.rst
create mode 100644 Documentation/tests/kunit.rst
--
2.40.1
This series decreases the pcm-test duration in order to avoid timeouts
by first moving the audio stream duration to a variable and subsequently
decreasing it.
Nícolas F. R. A. Prado (2):
kselftest/alsa: pcm-test: Move stream duration and margin to variables
kselftest/alsa: pcm-test: Decrease stream duration from 4 to 2 seconds
tools/testing/selftests/alsa/pcm-test.c | 8 +++++---
1 file changed, 5 insertions(+), 3 deletions(-)
--
2.41.0
Since apparently enabling all the KUnit tests shouldn't enable any new
subsystems it is hard to enable the regmap KUnit tests in normal KUnit
testing scenarios that don't enable any drivers. Add a Kconfig option
to help with this and include it in the KUnit all tests config.
Signed-off-by: Mark Brown <broonie(a)kernel.org>
---
drivers/base/regmap/Kconfig | 12 +++++++++++-
tools/testing/kunit/configs/all_tests.config | 2 ++
2 files changed, 13 insertions(+), 1 deletion(-)
diff --git a/drivers/base/regmap/Kconfig b/drivers/base/regmap/Kconfig
index 0db2021f7477..b1affac70d5d 100644
--- a/drivers/base/regmap/Kconfig
+++ b/drivers/base/regmap/Kconfig
@@ -4,7 +4,7 @@
# subsystems should select the appropriate symbols.
config REGMAP
- bool "Register Map support" if KUNIT_ALL_TESTS
+ bool
default y if (REGMAP_I2C || REGMAP_SPI || REGMAP_SPMI || REGMAP_W1 || REGMAP_AC97 || REGMAP_MMIO || REGMAP_IRQ || REGMAP_SOUNDWIRE || REGMAP_SOUNDWIRE_MBQ || REGMAP_SCCB || REGMAP_I3C || REGMAP_SPI_AVMM || REGMAP_MDIO || REGMAP_FSI)
select IRQ_DOMAIN if REGMAP_IRQ
select MDIO_BUS if REGMAP_MDIO
@@ -23,6 +23,16 @@ config REGMAP_KUNIT
default KUNIT_ALL_TESTS
select REGMAP_RAM
+config REGMAP_BUILD
+ bool "Enable regmap build"
+ depends on KUNIT
+ select REGMAP
+ help
+ This option exists purely to allow the regmap KUnit tests to
+ be enabled without having to enable some driver that uses
+ regmap due to unfortunate issues with how KUnit tests are
+ normally enabled.
+
config REGMAP_AC97
tristate
diff --git a/tools/testing/kunit/configs/all_tests.config b/tools/testing/kunit/configs/all_tests.config
index 0393940c706a..873f3e06ccad 100644
--- a/tools/testing/kunit/configs/all_tests.config
+++ b/tools/testing/kunit/configs/all_tests.config
@@ -33,5 +33,7 @@ CONFIG_DAMON_PADDR=y
CONFIG_DEBUG_FS=y
CONFIG_DAMON_DBGFS=y
+CONFIG_REGMAP_BUILD=y
+
CONFIG_SECURITY=y
CONFIG_SECURITY_APPARMOR=y
---
base-commit: 06c2afb862f9da8dc5efa4b6076a0e48c3fbaaa5
change-id: 20230701-regmap-kunit-enable-a08718e77dd4
Best regards,
--
Mark Brown <broonie(a)kernel.org>
Delete a duplicate assignment from this function implementation.
The note means ppm is average of the two actual freq samples.
But ppm have a duplicate assignment.
Signed-off-by: Minjie Du <duminjie(a)vivo.com>
---
tools/testing/selftests/timers/raw_skew.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/tools/testing/selftests/timers/raw_skew.c b/tools/testing/selftests/timers/raw_skew.c
index 5beceeed0..6eba203f9 100644
--- a/tools/testing/selftests/timers/raw_skew.c
+++ b/tools/testing/selftests/timers/raw_skew.c
@@ -129,8 +129,7 @@ int main(int argc, char **argv)
printf("%lld.%i(est)", eppm/1000, abs((int)(eppm%1000)));
/* Avg the two actual freq samples adjtimex gave us */
- ppm = (tx1.freq + tx2.freq) * 1000 / 2;
- ppm = (long long)tx1.freq * 1000;
+ ppm = (long long)(tx1.freq + tx2.freq) * 1000 / 2;
ppm = shift_right(ppm, 16);
printf(" %lld.%i(act)", ppm/1000, abs((int)(ppm%1000)));
--
2.39.0
From: Jeff Xu <jeffxu(a)google.com>
Add documentation for sysctl vm.memfd_noexec
Link:https://lore.kernel.org/linux-mm/CABi2SkXUX_QqTQ10Yx9bBUGpN1wByOi_=gZU…
Reported-by: Dominique Martinet <asmadeus(a)codewreck.org>
Signed-off-by: Jeff Xu <jeffxu(a)google.com>
---
Documentation/admin-guide/sysctl/vm.rst | 29 +++++++++++++++++++++++++
1 file changed, 29 insertions(+)
diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst
index 45ba1f4dc004..71923c3d7044 100644
--- a/Documentation/admin-guide/sysctl/vm.rst
+++ b/Documentation/admin-guide/sysctl/vm.rst
@@ -424,6 +424,35 @@ e.g., up to one or two maps per allocation.
The default value is 65530.
+memfd_noexec:
+=============
+This pid namespaced sysctl controls memfd_create().
+
+The new MFD_NOEXEC_SEAL and MFD_EXEC flags of memfd_create() allows
+application to set executable bit at creation time.
+
+When MFD_NOEXEC_SEAL is set, memfd is created without executable bit
+(mode:0666), and sealed with F_SEAL_EXEC, so it can't be chmod to
+be executable (mode: 0777) after creation.
+
+when MFD_EXEC flag is set, memfd is created with executable bit
+(mode:0777), this is the same as the old behavior of memfd_create.
+
+The new pid namespaced sysctl vm.memfd_noexec has 3 values:
+0: memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL acts like
+ MFD_EXEC was set.
+1: memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL acts like
+ MFD_NOEXEC_SEAL was set.
+2: memfd_create() without MFD_NOEXEC_SEAL will be rejected.
+
+The default value is 0.
+
+Once set, it can't be downgraded at runtime, i.e. 2=>1, 1=>0
+are denied.
+
+This is pid namespaced sysctl, child processes inherit the parent
+process's pid at the time of fork. Changes to the parent process
+after fork are not automatically propagated to the child process.
memory_failure_early_kill:
==========================
--
2.41.0.255.g8b1d071c50-goog
/proc/$PID/net currently allows the setting of file attributes,
in contrast to other /proc/$PID/ files and directories.
This would break the nolibc testsuite so the first patch in the series
removes the offending testcase.
The "fix" for nolibc-test is intentionally kept trivial as the series
will most likely go through the filesystem tree and if conflicts arise,
it is obvious on how to resolve them.
Technically this can lead to breakage of nolibc-test if an old
nolibc-test is used with a newer kernel containing the fix.
Note:
Except for /proc itself this is the only "struct inode_operations" in
fs/proc/ that is missing an implementation of setattr().
Signed-off-by: Thomas Weißschuh <linux(a)weissschuh.net>
---
Thomas Weißschuh (2):
selftests/nolibc: drop test chmod_net
proc: use generic setattr() for /proc/$PID/net
fs/proc/proc_net.c | 1 +
tools/testing/selftests/nolibc/nolibc-test.c | 1 -
2 files changed, 1 insertion(+), 1 deletion(-)
---
base-commit: a92b7d26c743b9dc06d520f863d624e94978a1d9
change-id: 20230624-proc-net-setattr-8f0a6b8eb2f5
Best regards,
--
Thomas Weißschuh <linux(a)weissschuh.net>
In the case where a sysfs file cannot be opened the error return path
fcloses file pointer fpl, however, fpl has already been closed in the
previous stanza. Fix the double fclose by removing it.
Fixes: 10b98a4db11a ("selftests: ALSA: Add test for the 'pcmtest' driver")
Signed-off-by: Colin Ian King <colin.i.king(a)gmail.com>
---
tools/testing/selftests/alsa/test-pcmtest-driver.c | 4 +---
1 file changed, 1 insertion(+), 3 deletions(-)
diff --git a/tools/testing/selftests/alsa/test-pcmtest-driver.c b/tools/testing/selftests/alsa/test-pcmtest-driver.c
index 71931b240a83..357adc722cba 100644
--- a/tools/testing/selftests/alsa/test-pcmtest-driver.c
+++ b/tools/testing/selftests/alsa/test-pcmtest-driver.c
@@ -47,10 +47,8 @@ static int read_patterns(void)
sprintf(pf, "/sys/kernel/debug/pcmtest/fill_pattern%d", i);
fp = fopen(pf, "r");
- if (!fp) {
- fclose(fpl);
+ if (!fp)
return -1;
- }
fread(patterns[i].buf, 1, patterns[i].len, fp);
fclose(fp);
}
--
2.39.2
The checksum_32 code was originally written to only handle 2-byte
aligned buffers, but was later extended to support arbitrary alignment.
However, the non-PPro variant doesn't apply the carry before jumping to
the 2- or 4-byte aligned versions, which clear CF.
This causes the new checksum_kunit test to fail, as it runs with a large
number of different possible alignments and both with and without
carries.
For example:
./tools/testing/kunit/kunit.py run --arch i386 --kconfig_add CONFIG_M486=y checksum
Gives:
KTAP version 1
# Subtest: checksum
1..3
ok 1 test_csum_fixed_random_inputs
# test_csum_all_carry_inputs: ASSERTION FAILED at lib/checksum_kunit.c:267
Expected result == expec, but
result == 65281 (0xff01)
expec == 65280 (0xff00)
not ok 2 test_csum_all_carry_inputs
# test_csum_no_carry_inputs: ASSERTION FAILED at lib/checksum_kunit.c:314
Expected result == expec, but
result == 65535 (0xffff)
expec == 65534 (0xfffe)
not ok 3 test_csum_no_carry_inputs
With this patch, it passes.
KTAP version 1
# Subtest: checksum
1..3
ok 1 test_csum_fixed_random_inputs
ok 2 test_csum_all_carry_inputs
ok 3 test_csum_no_carry_inputs
I also tested it on a real 486DX2, with the same results.
Signed-off-by: David Gow <davidgow(a)google.com>
---
This is a follow-up to the UML patch to use the common 32-bit x86
checksum implementations:
https://lore.kernel.org/linux-um/20230704083022.692368-2-davidgow@google.co…
---
arch/x86/lib/checksum_32.S | 1 +
1 file changed, 1 insertion(+)
diff --git a/arch/x86/lib/checksum_32.S b/arch/x86/lib/checksum_32.S
index 23318c338db0..128287cea42d 100644
--- a/arch/x86/lib/checksum_32.S
+++ b/arch/x86/lib/checksum_32.S
@@ -62,6 +62,7 @@ SYM_FUNC_START(csum_partial)
jl 8f
movzbl (%esi), %ebx
adcl %ebx, %eax
+ adcl $0, %eax
roll $8, %eax
inc %esi
testl $2, %esi
--
2.41.0.255.g8b1d071c50-goog
While it probably doesn't make a huge difference given the current KUnit
coverage we will get the best coverage of arm64 architecture features if
we specify -cpu=max rather than picking a specific CPU, this will include
all architecture features that qemu supports including many which have not
yet made it into physical implementations.
Due to performance issues emulating the architected pointer authentication
algorithm it is recommended to use the implementation defined algorithm
that qemu has instead, this should make no meaningful difference to the
coverage and will run the tests faster.
Signed-off-by: Mark Brown <broonie(a)kernel.org>
---
tools/testing/kunit/qemu_configs/arm64.py | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/kunit/qemu_configs/arm64.py b/tools/testing/kunit/qemu_configs/arm64.py
index 67d04064f785..d3ff27024755 100644
--- a/tools/testing/kunit/qemu_configs/arm64.py
+++ b/tools/testing/kunit/qemu_configs/arm64.py
@@ -9,4 +9,4 @@ CONFIG_SERIAL_AMBA_PL011_CONSOLE=y''',
qemu_arch='aarch64',
kernel_path='arch/arm64/boot/Image.gz',
kernel_command_line='console=ttyAMA0',
- extra_qemu_params=['-machine', 'virt', '-cpu', 'cortex-a57'])
+ extra_qemu_params=['-machine', 'virt', '-cpu', 'max,pauth-impdef=on'])
---
base-commit: 06c2afb862f9da8dc5efa4b6076a0e48c3fbaaa5
change-id: 20230702-kunit-arm64-cpu-max-7e3aa5f02fb2
Best regards,
--
Mark Brown <broonie(a)kernel.org>
=== Context ===
In the context of a middlebox, fragmented packets are tricky to handle.
The full 5-tuple of a packet is often only available in the first
fragment which makes enforcing consistent policy difficult. There are
really only two stateless options, neither of which are very nice:
1. Enforce policy on first fragment and accept all subsequent fragments.
This works but may let in certain attacks or allow data exfiltration.
2. Enforce policy on first fragment and drop all subsequent fragments.
This does not really work b/c some protocols may rely on
fragmentation. For example, DNS may rely on oversized UDP packets for
large responses.
So stateful tracking is the only sane option. RFC 8900 [0] calls this
out as well in section 6.3:
Middleboxes [...] should process IP fragments in a manner that is
consistent with [RFC0791] and [RFC8200]. In many cases, middleboxes
must maintain state in order to achieve this goal.
=== BPF related bits ===
Policy has traditionally been enforced from XDP/TC hooks. Both hooks
run before kernel reassembly facilities. However, with the new
BPF_PROG_TYPE_NETFILTER, we can rather easily hook into existing
netfilter reassembly infra.
The basic idea is we bump a refcnt on the netfilter defrag module and
then run the bpf prog after the defrag module runs. This allows bpf
progs to transparently see full, reassembled packets. The nice thing
about this is that progs don't have to carry around logic to detect
fragments.
=== Changelog ===
Changes from v3:
* Correctly initialize `addrlen` stack var for recvmsg()
Changes from v2:
* module_put() if ->enable() fails
* Fix CI build errors
Changes from v1:
* Drop bpf_program__attach_netfilter() patches
* static -> static const where appropriate
* Fix callback assignment order during registration
* Only request_module() if callbacks are missing
* Fix retval when modprobe fails in userspace
* Fix v6 defrag module name (nf_defrag_ipv6_hooks -> nf_defrag_ipv6)
* Simplify priority checking code
* Add warning if module doesn't assign callbacks in the future
* Take refcnt on module while defrag link is active
[0]: https://datatracker.ietf.org/doc/html/rfc8900
Daniel Xu (6):
netfilter: defrag: Add glue hooks for enabling/disabling defrag
netfilter: bpf: Support BPF_F_NETFILTER_IP_DEFRAG in netfilter link
netfilter: bpf: Prevent defrag module unload while link active
bpf: selftests: Support not connecting client socket
bpf: selftests: Support custom type and proto for client sockets
bpf: selftests: Add defrag selftests
include/linux/netfilter.h | 15 +
include/uapi/linux/bpf.h | 5 +
net/ipv4/netfilter/nf_defrag_ipv4.c | 17 +-
net/ipv6/netfilter/nf_defrag_ipv6_hooks.c | 11 +
net/netfilter/core.c | 6 +
net/netfilter/nf_bpf_link.c | 150 +++++++++-
tools/include/uapi/linux/bpf.h | 5 +
tools/testing/selftests/bpf/Makefile | 4 +-
.../selftests/bpf/generate_udp_fragments.py | 90 ++++++
.../selftests/bpf/ip_check_defrag_frags.h | 57 ++++
tools/testing/selftests/bpf/network_helpers.c | 26 +-
tools/testing/selftests/bpf/network_helpers.h | 3 +
.../bpf/prog_tests/ip_check_defrag.c | 283 ++++++++++++++++++
.../selftests/bpf/progs/ip_check_defrag.c | 104 +++++++
14 files changed, 754 insertions(+), 22 deletions(-)
create mode 100755 tools/testing/selftests/bpf/generate_udp_fragments.py
create mode 100644 tools/testing/selftests/bpf/ip_check_defrag_frags.h
create mode 100644 tools/testing/selftests/bpf/prog_tests/ip_check_defrag.c
create mode 100644 tools/testing/selftests/bpf/progs/ip_check_defrag.c
--
2.41.0
The #endif is the wrong side of a } causing a build failure when
__NR_userfaultfd is not defined. Fix this by moving the #end to
enclose the }
Fixes: 9eac40fc0cc7 ("selftests/mm: mkdirty: test behavior of (pte|pmd)_mkdirty on VMAs without write permissions")
Signed-off-by: Colin Ian King <colin.i.king(a)gmail.com>
---
tools/testing/selftests/mm/mkdirty.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/mm/mkdirty.c b/tools/testing/selftests/mm/mkdirty.c
index 6d71d972997b..301abb99e027 100644
--- a/tools/testing/selftests/mm/mkdirty.c
+++ b/tools/testing/selftests/mm/mkdirty.c
@@ -321,8 +321,8 @@ static void test_uffdio_copy(void)
munmap:
munmap(dst, pagesize);
free(src);
-#endif /* __NR_userfaultfd */
}
+#endif /* __NR_userfaultfd */
int main(void)
{
--
2.39.2
Awk is already called for /sys/block/zram#/mm_stat parsing, so use it
to also perform the floating point capacity vs consumption ratio
calculations. The test output is unchanged.
This allows bc to be dropped as a dependency for the zram selftests.
The documented free dependency can also be removed following
d18da7ec37195 ("selftests/zram01.sh: Fix compression ratio calculation")
Signed-off-by: David Disseldorp <ddiss(a)suse.de>
---
tools/testing/selftests/zram/README | 2 --
tools/testing/selftests/zram/zram01.sh | 18 ++++++++----------
2 files changed, 8 insertions(+), 12 deletions(-)
v2: drop unused dependencies from selftests/zram/README
diff --git a/tools/testing/selftests/zram/README b/tools/testing/selftests/zram/README
index 110b34834a6fa..510ca5a1087f5 100644
--- a/tools/testing/selftests/zram/README
+++ b/tools/testing/selftests/zram/README
@@ -27,9 +27,7 @@ zram01.sh: creates general purpose ram disks with ext4 filesystems
zram02.sh: creates block device for swap
Commands required for testing:
- - bc
- dd
- - free
- awk
- mkswap
- swapon
diff --git a/tools/testing/selftests/zram/zram01.sh b/tools/testing/selftests/zram/zram01.sh
index 8f4affe34f3e4..df1b1d4158989 100755
--- a/tools/testing/selftests/zram/zram01.sh
+++ b/tools/testing/selftests/zram/zram01.sh
@@ -33,7 +33,7 @@ zram_algs="lzo"
zram_fill_fs()
{
- for i in $(seq $dev_start $dev_end); do
+ for ((i = $dev_start; i <= $dev_end && !ERR_CODE; i++)); do
echo "fill zram$i..."
local b=0
while [ true ]; do
@@ -44,15 +44,13 @@ zram_fill_fs()
done
echo "zram$i can be filled with '$b' KB"
- local mem_used_total=`awk '{print $3}' "/sys/block/zram$i/mm_stat"`
- local v=$((100 * 1024 * $b / $mem_used_total))
- if [ "$v" -lt 100 ]; then
- echo "FAIL compression ratio: 0.$v:1"
- ERR_CODE=-1
- return
- fi
-
- echo "zram compression ratio: $(echo "scale=2; $v / 100 " | bc):1: OK"
+ awk -v b="$b" '{ v = (100 * 1024 * b / $3) } END {
+ if (v < 100) {
+ printf "FAIL compression ratio: 0.%u:1\n", v
+ exit 1
+ }
+ printf "zram compression ratio: %.2f:1: OK\n", v / 100
+ }' "/sys/block/zram$i/mm_stat" || ERR_CODE=-1
done
}
--
2.35.3
Dzień dobry,
zapoznałem się z Państwa ofertą i z przyjemnością przyznaję, że przyciąga uwagę i zachęca do dalszych rozmów.
Pomyślałem, że może mógłbym mieć swój wkład w Państwa rozwój i pomóc dotrzeć z tą ofertą do większego grona odbiorców. Pozycjonuję strony www, dzięki czemu generują świetny ruch w sieci.
Możemy porozmawiać w najbliższym czasie?
Pozdrawiam
Adam Charachuta
*Changes in v24*:
- Rebase on top of next-20230710
- Place WP markers in case of hole as well
*Changes in v23*:
- Set vec_buf_index in loop only when vec_buf_index is set
- Return -EFAULT instead of -EINVAL if vec is NULL
- Correctly return the walk ending address to the page granularity
*Changes in v22*:
- Interface change:
- Replace [start start + len) with [start, end)
- Return the ending address of the address walk in start
*Changes in v21*:
- Abort walk instead of returning error if WP is to be performed on
partial hugetlb
*Changes in v20*
- Correct PAGE_IS_FILE and add PAGE_IS_PFNZERO
*Changes in v19*
- Minor changes and interface updates
*Changes in v18*
- Rebase on top of next-20230613
- Minor updates
*Changes in v17*
- Rebase on top of next-20230606
- Minor improvements in PAGEMAP_SCAN IOCTL patch
*Changes in v16*
- Fix a corner case
- Add exclusive PM_SCAN_OP_WP back
*Changes in v15*
- Build fix (Add missed build fix in RESEND)
*Changes in v14*
- Fix build error caused by #ifdef added at last minute in some configs
*Changes in v13*
- Rebase on top of next-20230414
- Give-up on using uffd_wp_range() and write new helpers, flush tlb only
once
*Changes in v12*
- Update and other memory types to UFFD_FEATURE_WP_ASYNC
- Rebaase on top of next-20230406
- Review updates
*Changes in v11*
- Rebase on top of next-20230307
- Base patches on UFFD_FEATURE_WP_UNPOPULATED
- Do a lot of cosmetic changes and review updates
- Remove ENGAGE_WP + !GET operation as it can be performed with
UFFDIO_WRITEPROTECT
*Changes in v10*
- Add specific condition to return error if hugetlb is used with wp
async
- Move changes in tools/include/uapi/linux/fs.h to separate patch
- Add documentation
*Changes in v9:*
- Correct fault resolution for userfaultfd wp async
- Fix build warnings and errors which were happening on some configs
- Simplify pagemap ioctl's code
*Changes in v8:*
- Update uffd async wp implementation
- Improve PAGEMAP_IOCTL implementation
*Changes in v7:*
- Add uffd wp async
- Update the IOCTL to use uffd under the hood instead of soft-dirty
flags
*Motivation*
The real motivation for adding PAGEMAP_SCAN IOCTL is to emulate Windows
GetWriteWatch() syscall [1]. The GetWriteWatch{} retrieves the addresses of
the pages that are written to in a region of virtual memory.
This syscall is used in Windows applications and games etc. This syscall is
being emulated in pretty slow manner in userspace. Our purpose is to
enhance the kernel such that we translate it efficiently in a better way.
Currently some out of tree hack patches are being used to efficiently
emulate it in some kernels. We intend to replace those with these patches.
So the whole gaming on Linux can effectively get benefit from this. It
means there would be tons of users of this code.
CRIU use case [2] was mentioned by Andrei and Danylo:
> Use cases for migrating sparse VMAs are binaries sanitized with ASAN,
> MSAN or TSAN [3]. All of these sanitizers produce sparse mappings of
> shadow memory [4]. Being able to migrate such binaries allows to highly
> reduce the amount of work needed to identify and fix post-migration
> crashes, which happen constantly.
Andrei's defines the following uses of this code:
* it is more granular and allows us to track changed pages more
effectively. The current interface can clear dirty bits for the entire
process only. In addition, reading info about pages is a separate
operation. It means we must freeze the process to read information
about all its pages, reset dirty bits, only then we can start dumping
pages. The information about pages becomes more and more outdated,
while we are processing pages. The new interface solves both these
downsides. First, it allows us to read pte bits and clear the
soft-dirty bit atomically. It means that CRIU will not need to freeze
processes to pre-dump their memory. Second, it clears soft-dirty bits
for a specified region of memory. It means CRIU will have actual info
about pages to the moment of dumping them.
* The new interface has to be much faster because basic page filtering
is happening in the kernel. With the old interface, we have to read
pagemap for each page.
*Implementation Evolution (Short Summary)*
From the definition of GetWriteWatch(), we feel like kernel's soft-dirty
feature can be used under the hood with some additions like:
* reset soft-dirty flag for only a specific region of memory instead of
clearing the flag for the entire process
* get and clear soft-dirty flag for a specific region atomically
So we decided to use ioctl on pagemap file to read or/and reset soft-dirty
flag. But using soft-dirty flag, sometimes we get extra pages which weren't
even written. They had become soft-dirty because of VMA merging and
VM_SOFTDIRTY flag. This breaks the definition of GetWriteWatch(). We were
able to by-pass this short coming by ignoring VM_SOFTDIRTY until David
reported that mprotect etc messes up the soft-dirty flag while ignoring
VM_SOFTDIRTY [5]. This wasn't happening until [6] got introduced. We
discussed if we can revert these patches. But we could not reach to any
conclusion. So at this point, I made couple of tries to solve this whole
VM_SOFTDIRTY issue by correcting the soft-dirty implementation:
* [7] Correct the bug fixed wrongly back in 2014. It had potential to cause
regression. We left it behind.
* [8] Keep a list of soft-dirty part of a VMA across splits and merges. I
got the reply don't increase the size of the VMA by 8 bytes.
At this point, we left soft-dirty considering it is too much delicate and
userfaultfd [9] seemed like the only way forward. From there onward, we
have been basing soft-dirty emulation on userfaultfd wp feature where
kernel resolves the faults itself when WP_ASYNC feature is used. It was
straight forward to add WP_ASYNC feature in userfautlfd. Now we get only
those pages dirty or written-to which are really written in reality. (PS
There is another WP_UNPOPULATED userfautfd feature is required which is
needed to avoid pre-faulting memory before write-protecting [9].)
All the different masks were added on the request of CRIU devs to create
interface more generic and better.
[1] https://learn.microsoft.com/en-us/windows/win32/api/memoryapi/nf-memoryapi-…
[2] https://lore.kernel.org/all/20221014134802.1361436-1-mdanylo@google.com
[3] https://github.com/google/sanitizers
[4] https://github.com/google/sanitizers/wiki/AddressSanitizerAlgorithm#64-bit
[5] https://lore.kernel.org/all/bfcae708-db21-04b4-0bbe-712badd03071@redhat.com
[6] https://lore.kernel.org/all/20220725142048.30450-1-peterx@redhat.com/
[7] https://lore.kernel.org/all/20221122115007.2787017-1-usama.anjum@collabora.…
[8] https://lore.kernel.org/all/20221220162606.1595355-1-usama.anjum@collabora.…
[9] https://lore.kernel.org/all/20230306213925.617814-1-peterx@redhat.com
[10] https://lore.kernel.org/all/20230125144529.1630917-1-mdanylo@google.com
* Original Cover letter from v8*
Hello,
Note:
Soft-dirty pages and pages which have been written-to are synonyms. As
kernel already has soft-dirty feature inside which we have given up to
use, we are using written-to terminology while using UFFD async WP under
the hood.
This IOCTL, PAGEMAP_SCAN on pagemap file can be used to get and/or clear
the info about page table entries. The following operations are
supported in this ioctl:
- Get the information if the pages have been written-to (PAGE_IS_WRITTEN),
file mapped (PAGE_IS_FILE), present (PAGE_IS_PRESENT) or swapped
(PAGE_IS_SWAPPED).
- Write-protect the pages (PAGEMAP_WP_ENGAGE) to start finding which
pages have been written-to.
- Find pages which have been written-to and write protect the pages
(atomic PAGE_IS_WRITTEN + PAGEMAP_WP_ENGAGE)
It is possible to find and clear soft-dirty pages entirely in userspace.
But it isn't efficient:
- The mprotect and SIGSEGV handler for bookkeeping
- The userfaultfd wp (synchronous) with the handler for bookkeeping
Some benchmarks can be seen here[1]. This series adds features that weren't
present earlier:
- There is no atomic get soft-dirty/Written-to status and clear present in
the kernel.
- The pages which have been written-to can not be found in accurate way.
(Kernel's soft-dirty PTE bit + sof_dirty VMA bit shows more soft-dirty
pages than there actually are.)
Historically, soft-dirty PTE bit tracking has been used in the CRIU
project. The procfs interface is enough for finding the soft-dirty bit
status and clearing the soft-dirty bit of all the pages of a process.
We have the use case where we need to track the soft-dirty PTE bit for
only specific pages on-demand. We need this tracking and clear mechanism
of a region of memory while the process is running to emulate the
getWriteWatch() syscall of Windows.
*(Moved to using UFFD instead of soft-dirtyi feature to find pages which
have been written-to from v7 patch series)*:
Stop using the soft-dirty flags for finding which pages have been
written to. It is too delicate and wrong as it shows more soft-dirty
pages than the actual soft-dirty pages. There is no interest in
correcting it [2][3] as this is how the feature was written years ago.
It shouldn't be updated to changed behaviour. Peter Xu has suggested
using the async version of the UFFD WP [4] as it is based inherently
on the PTEs.
So in this patch series, I've added a new mode to the UFFD which is
asynchronous version of the write protect. When this variant of the
UFFD WP is used, the page faults are resolved automatically by the
kernel. The pages which have been written-to can be found by reading
pagemap file (!PM_UFFD_WP). This feature can be used successfully to
find which pages have been written to from the time the pages were
write protected. This works just like the soft-dirty flag without
showing any extra pages which aren't soft-dirty in reality.
The information related to pages if the page is file mapped, present and
swapped is required for the CRIU project [5][6]. The addition of the
required mask, any mask, excluded mask and return masks are also required
for the CRIU project [5].
The IOCTL returns the addresses of the pages which match the specific
masks. The page addresses are returned in struct page_region in a compact
form. The max_pages is needed to support a use case where user only wants
to get a specific number of pages. So there is no need to find all the
pages of interest in the range when max_pages is specified. The IOCTL
returns when the maximum number of the pages are found. The max_pages is
optional. If max_pages is specified, it must be equal or greater than the
vec_size. This restriction is needed to handle worse case when one
page_region only contains info of one page and it cannot be compacted.
This is needed to emulate the Windows getWriteWatch() syscall.
The patch series include the detailed selftest which can be used as an
example for the uffd async wp test and PAGEMAP_IOCTL. It shows the
interface usages as well.
[1] https://lore.kernel.org/lkml/54d4c322-cd6e-eefd-b161-2af2b56aae24@collabora…
[2] https://lore.kernel.org/all/20221220162606.1595355-1-usama.anjum@collabora.…
[3] https://lore.kernel.org/all/20221122115007.2787017-1-usama.anjum@collabora.…
[4] https://lore.kernel.org/all/Y6Hc2d+7eTKs7AiH@x1n
[5] https://lore.kernel.org/all/YyiDg79flhWoMDZB@gmail.com/
[6] https://lore.kernel.org/all/20221014134802.1361436-1-mdanylo@google.com/
Regards,
Muhammad Usama Anjum
Muhammad Usama Anjum (4):
fs/proc/task_mmu: Implement IOCTL to get and optionally clear info
about PTEs
tools headers UAPI: Update linux/fs.h with the kernel sources
mm/pagemap: add documentation of PAGEMAP_SCAN IOCTL
selftests: mm: add pagemap ioctl tests
Peter Xu (1):
userfaultfd: UFFD_FEATURE_WP_ASYNC
Documentation/admin-guide/mm/pagemap.rst | 58 +
Documentation/admin-guide/mm/userfaultfd.rst | 35 +
fs/proc/task_mmu.c | 583 +++++++
fs/userfaultfd.c | 26 +-
include/linux/hugetlb.h | 1 +
include/linux/userfaultfd_k.h | 21 +-
include/uapi/linux/fs.h | 55 +
include/uapi/linux/userfaultfd.h | 9 +-
mm/hugetlb.c | 34 +-
mm/memory.c | 27 +-
tools/include/uapi/linux/fs.h | 55 +
tools/testing/selftests/mm/.gitignore | 2 +
tools/testing/selftests/mm/Makefile | 3 +-
tools/testing/selftests/mm/config | 1 +
tools/testing/selftests/mm/pagemap_ioctl.c | 1464 ++++++++++++++++++
tools/testing/selftests/mm/run_vmtests.sh | 4 +
16 files changed, 2354 insertions(+), 24 deletions(-)
create mode 100644 tools/testing/selftests/mm/pagemap_ioctl.c
mode change 100644 => 100755 tools/testing/selftests/mm/run_vmtests.sh
--
2.39.2
A few cleanups to the existing test logic.
Signed-off-by: Thomas Weißschuh <linux(a)weissschuh.net>
---
Thomas Weißschuh (4):
selftests/nolibc: make evaluation of test conditions
selftests/nolibc: simplify status printing
selftests/nolibc: simplify status argument
selftests/nolibc: avoid gaps in test numbers
tools/testing/selftests/nolibc/nolibc-test.c | 201 +++++++++++----------------
1 file changed, 85 insertions(+), 116 deletions(-)
---
base-commit: 078cda365b3f47f61047a08230925a1478e9a1c8
change-id: 20230711-nolibc-sizeof-long-gaps-0f28cba7ee4d
Best regards,
--
Thomas Weißschuh <linux(a)weissschuh.net>
We want to replace iptables TPROXY with a BPF program at TC ingress.
To make this work in all cases we need to assign a SO_REUSEPORT socket
to an skb, which is currently prohibited. This series adds support for
such sockets to bpf_sk_assing.
I did some refactoring to cut down on the amount of duplicate code. The
key to this is to use INDIRECT_CALL in the reuseport helpers. To show
that this approach is not just beneficial to TC sk_assign I removed
duplicate code for bpf_sk_lookup as well.
Joint work with Daniel Borkmann.
Signed-off-by: Lorenz Bauer <lmb(a)isovalent.com>
---
Changes in v5:
- Drop reuse_sk == sk check in inet[6]_steal_stock (Kuniyuki)
- Link to v4: https://lore.kernel.org/r/20230613-so-reuseport-v4-0-4ece76708bba@isovalent…
Changes in v4:
- WARN_ON_ONCE if reuseport socket is refcounted (Kuniyuki)
- Use inet[6]_ehashfn_t to shorten function declarations (Kuniyuki)
- Shuffle documentation patch around (Kuniyuki)
- Update commit message to explain why IPv6 needs EXPORT_SYMBOL
- Link to v3: https://lore.kernel.org/r/20230613-so-reuseport-v3-0-907b4cbb7b99@isovalent…
Changes in v3:
- Fix warning re udp_ehashfn and udp6_ehashfn (Simon)
- Return higher scoring connected UDP reuseport sockets (Kuniyuki)
- Fix ipv6 module builds
- Link to v2: https://lore.kernel.org/r/20230613-so-reuseport-v2-0-b7c69a342613@isovalent…
Changes in v2:
- Correct commit abbrev length (Kuniyuki)
- Reduce duplication (Kuniyuki)
- Add checks on sk_state (Martin)
- Split exporting inet[6]_lookup_reuseport into separate patch (Eric)
---
Daniel Borkmann (1):
selftests/bpf: Test that SO_REUSEPORT can be used with sk_assign helper
Lorenz Bauer (6):
udp: re-score reuseport groups when connected sockets are present
net: export inet_lookup_reuseport and inet6_lookup_reuseport
net: remove duplicate reuseport_lookup functions
net: document inet[6]_lookup_reuseport sk_state requirements
net: remove duplicate sk_lookup helpers
bpf, net: Support SO_REUSEPORT sockets with bpf_sk_assign
include/net/inet6_hashtables.h | 81 ++++++++-
include/net/inet_hashtables.h | 74 +++++++-
include/net/sock.h | 7 +-
include/uapi/linux/bpf.h | 3 -
net/core/filter.c | 2 -
net/ipv4/inet_hashtables.c | 68 ++++---
net/ipv4/udp.c | 88 ++++-----
net/ipv6/inet6_hashtables.c | 71 +++++---
net/ipv6/udp.c | 98 ++++------
tools/include/uapi/linux/bpf.h | 3 -
tools/testing/selftests/bpf/network_helpers.c | 3 +
.../selftests/bpf/prog_tests/assign_reuse.c | 197 +++++++++++++++++++++
.../selftests/bpf/progs/test_assign_reuse.c | 142 +++++++++++++++
13 files changed, 658 insertions(+), 179 deletions(-)
---
base-commit: c20f9cef725bc6b19efe372696e8000fb5af0d46
change-id: 20230613-so-reuseport-e92c526173ee
Best regards,
--
Lorenz Bauer <lmb(a)isovalent.com>
The build failure reported in [1] occurred because commit 9fc96c7c19df
("selftests: error out if kernel header files are not yet built") added
a new "kernel_header_files" dependency to "all", and that triggered
another, pre-existing problem. Specifically, the arm64 selftests
override the emit_tests target, and that override improperly declares
itself to depend upon the "all" target.
This is a problem because the "emit_tests" target in lib.mk was not
intended to be overridden. emit_tests is a very simple, sequential build
target that was originally invoked from the "install" target, which in
turn, depends upon "all".
That approach worked for years. But with 9fc96c7c19df in place,
emit_tests failed, because it does not set up all of the elaborate
things that "install" does. And that caused the new
"kernel_header_files" target (which depends upon $(KBUILD_OUTPUT) being
correct) to fail.
Some detail: The "all" target is .PHONY. Therefore, each target that
depends on "all" will cause it to be invoked again, and because
dependencies are managed quite loosely in the selftests Makefiles, many
things will run, even "all" is invoked several times in immediate
succession. So this is not a "real" failure, as far as build steps go:
everything gets built, but "all" reports a problem when invoked a second
time from a bad environment.
To fix this, simply remove the unnecessary "all" dependency from the
overridden emit_tests target. The dependency is still effectively
honored, because again, invocation is via "install", which also depends
upon "all".
An alternative approach would be to harden the emit_tests target so that
it can depend upon "all", but that's a lot more complicated and hard to
get right, and doesn't seem worth it, especially given that emit_tests
should probably not be overridden at all.
[1] https://lore.kernel.org/20230710-kselftest-fix-arm64-v1-1-48e872844f25@kern…
Fixes: 9fc96c7c19df ("selftests: error out if kernel header files are not yet built")
Reported-by: Mark Brown <broonie(a)kernel.org>
Signed-off-by: John Hubbard <jhubbard(a)nvidia.com>
---
tools/testing/selftests/arm64/Makefile | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/arm64/Makefile b/tools/testing/selftests/arm64/Makefile
index 9460cbe81bcc..ace8b67fb22d 100644
--- a/tools/testing/selftests/arm64/Makefile
+++ b/tools/testing/selftests/arm64/Makefile
@@ -42,7 +42,7 @@ run_tests: all
done
# Avoid any output on non arm64 on emit_tests
-emit_tests: all
+emit_tests:
@for DIR in $(ARM64_SUBTARGETS); do \
BUILD_TARGET=$(OUTPUT)/$$DIR; \
make OUTPUT=$$BUILD_TARGET -C $$DIR $@; \
base-commit: d5fe758c21f4770763ae4c05580be239be18947d
--
2.41.0
v4:
- [v3] https://lore.kernel.org/lkml/20230627005529.1564984-1-longman@redhat.com/
- Fix compilation problem reported by kernel test robot.
v3:
- [v2] https://lore.kernel.org/lkml/20230531163405.2200292-1-longman@redhat.com/
- Change the new control file from root-only "cpuset.cpus.reserve" to
non-root "cpuset.cpus.exclusive" which lists the set of exclusive
CPUs distributed down the hierarchy.
- Add a patch to restrict boot-time isolated CPUs to isolated
partitions only.
- Update the test_cpuset_prs.sh test script and documentation
accordingly.
This patch series introduces a new cpuset control file
"cpuset.cpus.exclusive" which must be a subset of "cpuset.cpus"
and the parent's "cpuset.cpus.exclusive". This control file lists
the exclusive CPUs to be distributed down the hierarchy. Any one
of the exclusive CPUs can only be distributed to at most one child
cpuset. Unlike "cpuset.cpus", invalid input to "cpuset.cpus.exclusive"
will be rejected with an error. This new control file has no effect on
the behavior of the cpuset until it turns into a partition root. At that
point, its effective CPUs will be set to its exclusive CPUs unless some
of them are offline.
This patch series also introduces a new category of cpuset partition
called remote partitions. The existing partition category where the
partition roots have to be clustered around the root cgroup in a
hierarchical way is now referred to as local partitions.
A remote partition can be formed far from the root cgroup
with no partition root parent. While local partitions can be
created without touching "cpuset.cpus.exclusive" as it can be set
automatically if a cpuset becomes a local partition root. Properly set
"cpuset.cpus.exclusive" values down the hierarchy are required to create
a remote partition.
Both scheduling and isolated partitions can be formed in a remote
partition. A local partition can be created under a remote partition.
A remote partition, however, cannot be formed under a local partition
for now.
Modern container orchestration tools like Kubernetes use the cgroup
hierarchy to manage different containers. And it is relying on other
middleware like systemd to help managing it. If a container needs to
use isolated CPUs, it is hard to get those with the local partitions
as it will require the administrative parent cgroup to be a partition
root too which tool like systemd may not be ready to manage.
With this patch series, we allow the creation of remote partition
far from the root. The container management tool can manage the
"cpuset.cpus.exclusive" file without impacting the other cpuset
files that are managed by other middlewares. Of course, invalid
"cpuset.cpus.exclusive" values will be rejected and changes to
"cpuset.cpus" can affect the value of "cpuset.cpus.exclusive" due to
the requirement that it has to be a subset of the former control file.
Waiman Long (9):
cgroup/cpuset: Inherit parent's load balance state in v2
cgroup/cpuset: Extract out CS_CPU_EXCLUSIVE & CS_SCHED_LOAD_BALANCE
handling
cgroup/cpuset: Improve temporary cpumasks handling
cgroup/cpuset: Allow suppression of sched domain rebuild in
update_cpumasks_hier()
cgroup/cpuset: Add cpuset.cpus.exclusive for v2
cgroup/cpuset: Introduce remote partition
cgroup/cpuset: Check partition conflict with housekeeping setup
cgroup/cpuset: Documentation update for partition
cgroup/cpuset: Extend test_cpuset_prs.sh to test remote partition
Documentation/admin-guide/cgroup-v2.rst | 100 +-
kernel/cgroup/cpuset.c | 1347 ++++++++++++-----
.../selftests/cgroup/test_cpuset_prs.sh | 398 +++--
3 files changed, 1291 insertions(+), 554 deletions(-)
--
2.31.1
We want to replace iptables TPROXY with a BPF program at TC ingress.
To make this work in all cases we need to assign a SO_REUSEPORT socket
to an skb, which is currently prohibited. This series adds support for
such sockets to bpf_sk_assing.
I did some refactoring to cut down on the amount of duplicate code. The
key to this is to use INDIRECT_CALL in the reuseport helpers. To show
that this approach is not just beneficial to TC sk_assign I removed
duplicate code for bpf_sk_lookup as well.
Joint work with Daniel Borkmann.
Signed-off-by: Lorenz Bauer <lmb(a)isovalent.com>
---
Changes in v4:
- WARN_ON_ONCE if reuseport socket is refcounted (Kuniyuki)
- Use inet[6]_ehashfn_t to shorten function declarations (Kuniyuki)
- Shuffle documentation patch around (Kuniyuki)
- Update commit message to explain why IPv6 needs EXPORT_SYMBOL
- Link to v3: https://lore.kernel.org/r/20230613-so-reuseport-v3-0-907b4cbb7b99@isovalent…
Changes in v3:
- Fix warning re udp_ehashfn and udp6_ehashfn (Simon)
- Return higher scoring connected UDP reuseport sockets (Kuniyuki)
- Fix ipv6 module builds
- Link to v2: https://lore.kernel.org/r/20230613-so-reuseport-v2-0-b7c69a342613@isovalent…
Changes in v2:
- Correct commit abbrev length (Kuniyuki)
- Reduce duplication (Kuniyuki)
- Add checks on sk_state (Martin)
- Split exporting inet[6]_lookup_reuseport into separate patch (Eric)
---
Daniel Borkmann (1):
selftests/bpf: Test that SO_REUSEPORT can be used with sk_assign helper
Lorenz Bauer (6):
udp: re-score reuseport groups when connected sockets are present
net: export inet_lookup_reuseport and inet6_lookup_reuseport
net: remove duplicate reuseport_lookup functions
net: document inet[6]_lookup_reuseport sk_state requirements
net: remove duplicate sk_lookup helpers
bpf, net: Support SO_REUSEPORT sockets with bpf_sk_assign
include/net/inet6_hashtables.h | 81 ++++++++-
include/net/inet_hashtables.h | 74 +++++++-
include/net/sock.h | 7 +-
include/uapi/linux/bpf.h | 3 -
net/core/filter.c | 2 -
net/ipv4/inet_hashtables.c | 67 ++++---
net/ipv4/udp.c | 88 ++++-----
net/ipv6/inet6_hashtables.c | 70 +++++---
net/ipv6/udp.c | 98 ++++------
tools/include/uapi/linux/bpf.h | 3 -
tools/testing/selftests/bpf/network_helpers.c | 3 +
.../selftests/bpf/prog_tests/assign_reuse.c | 197 +++++++++++++++++++++
.../selftests/bpf/progs/test_assign_reuse.c | 142 +++++++++++++++
13 files changed, 656 insertions(+), 179 deletions(-)
---
base-commit: 970308a7b544fa1c7ee98a2721faba3765be8dd8
change-id: 20230613-so-reuseport-e92c526173ee
Best regards,
--
Lorenz Bauer <lmb(a)isovalent.com>
=== Context ===
In the context of a middlebox, fragmented packets are tricky to handle.
The full 5-tuple of a packet is often only available in the first
fragment which makes enforcing consistent policy difficult. There are
really only two stateless options, neither of which are very nice:
1. Enforce policy on first fragment and accept all subsequent fragments.
This works but may let in certain attacks or allow data exfiltration.
2. Enforce policy on first fragment and drop all subsequent fragments.
This does not really work b/c some protocols may rely on
fragmentation. For example, DNS may rely on oversized UDP packets for
large responses.
So stateful tracking is the only sane option. RFC 8900 [0] calls this
out as well in section 6.3:
Middleboxes [...] should process IP fragments in a manner that is
consistent with [RFC0791] and [RFC8200]. In many cases, middleboxes
must maintain state in order to achieve this goal.
=== BPF related bits ===
Policy has traditionally been enforced from XDP/TC hooks. Both hooks
run before kernel reassembly facilities. However, with the new
BPF_PROG_TYPE_NETFILTER, we can rather easily hook into existing
netfilter reassembly infra.
The basic idea is we bump a refcnt on the netfilter defrag module and
then run the bpf prog after the defrag module runs. This allows bpf
progs to transparently see full, reassembled packets. The nice thing
about this is that progs don't have to carry around logic to detect
fragments.
=== Changelog ===
Changes from v2:
* module_put() if ->enable() fails
* Fix CI build errors
Changes from v1:
* Drop bpf_program__attach_netfilter() patches
* static -> static const where appropriate
* Fix callback assignment order during registration
* Only request_module() if callbacks are missing
* Fix retval when modprobe fails in userspace
* Fix v6 defrag module name (nf_defrag_ipv6_hooks -> nf_defrag_ipv6)
* Simplify priority checking code
* Add warning if module doesn't assign callbacks in the future
* Take refcnt on module while defrag link is active
[0]: https://datatracker.ietf.org/doc/html/rfc8900
Daniel Xu (6):
netfilter: defrag: Add glue hooks for enabling/disabling defrag
netfilter: bpf: Support BPF_F_NETFILTER_IP_DEFRAG in netfilter link
netfilter: bpf: Prevent defrag module unload while link active
bpf: selftests: Support not connecting client socket
bpf: selftests: Support custom type and proto for client sockets
bpf: selftests: Add defrag selftests
include/linux/netfilter.h | 15 +
include/uapi/linux/bpf.h | 5 +
net/ipv4/netfilter/nf_defrag_ipv4.c | 17 +-
net/ipv6/netfilter/nf_defrag_ipv6_hooks.c | 11 +
net/netfilter/core.c | 6 +
net/netfilter/nf_bpf_link.c | 150 +++++++++-
tools/include/uapi/linux/bpf.h | 5 +
tools/testing/selftests/bpf/Makefile | 4 +-
.../selftests/bpf/generate_udp_fragments.py | 90 ++++++
.../selftests/bpf/ip_check_defrag_frags.h | 57 ++++
tools/testing/selftests/bpf/network_helpers.c | 26 +-
tools/testing/selftests/bpf/network_helpers.h | 3 +
.../bpf/prog_tests/ip_check_defrag.c | 282 ++++++++++++++++++
.../selftests/bpf/progs/ip_check_defrag.c | 104 +++++++
14 files changed, 753 insertions(+), 22 deletions(-)
create mode 100755 tools/testing/selftests/bpf/generate_udp_fragments.py
create mode 100644 tools/testing/selftests/bpf/ip_check_defrag_frags.h
create mode 100644 tools/testing/selftests/bpf/prog_tests/ip_check_defrag.c
create mode 100644 tools/testing/selftests/bpf/progs/ip_check_defrag.c
--
2.41.0
On Mon, 10 Jul 2023 15:07:30 -0400
Steven Rostedt <rostedt(a)goodmis.org> wrote:
> On Mon, 10 Jul 2023 15:06:06 -0400
> Steven Rostedt <rostedt(a)goodmis.org> wrote:
>
> > > Something was broken in your mail (I guess cc list) and couldn’t reach to lkml or
> > > ignored by lkml. I just wanted to track the auto test results from linux-kselftest.
> >
> > Yeah, claws-mail has an issue with some emails with quotes in it (sometimes
> > drops the second quote). Sad part is, it happens after I hit send, and it
> > is not part of the email. I'll send this reply now, but I bet it's going to happen again.
> >
> > Let's see :-/ I checked the To and Cc's and they all have the proper
> > quotes. Let's see what ends up in my "Sent" folder.
>
> This time it worked!
>
But this reply did not :-p
It was fine before I sent, but the email in my Sent folder shows:
Cc: "mhiramat(a)kernel.org" <mhiramat(a)kernel.org>, "shuah(a)kernel.org" <shuah(a)kernel.org>, "linux-kernel(a)vger.kernel.org" <linux-kernel(a)vger.kernel.org>, "linux-trace-kernel(a)vger.kernel.org\" <linux-trace-kernel(a)vger.kernel.org>, "linux-kselftest(a)vger.kernel.org" <linux-kselftest(a)vger.kernel.org>, Ching-lin Yu <chinglinyu(a)google.com>, Nadav Amit <namit(a)vmware.com>, "srivatsa(a)csail.mit.edu" <srivatsa(a)csail.mit.edu>, Alexey Makhalov <amakhalov(a)vmware.com>, Vasavi Sirnapalli <vsirnapalli(a)vmware.com>, Tapas Kundu <tkundu(a)vmware.com>, "er.ajay.kaher(a)gmail.com" <er.ajay.kaher(a)gmail.com>
Claw's injected a backslash into: "linux-trace-kernel(a)vger.kernel.org\" <linux-trace-kernel(a)vger.kernel.org>
I have my own build of claws-mail, let me update it and perhaps this will
go away.
-- Steve
This is the basic functionality for iommufd to support
iommufd_device_replace() and IOMMU_HWPT_ALLOC for physical devices.
iommufd_device_replace() allows changing the HWPT associated with the
device to a new IOAS or HWPT. Replace does this in way that failure leaves
things unchanged, and utilizes the iommu iommu_group_replace_domain() API
to allow the iommu driver to perform an optional non-disruptive change.
IOMMU_HWPT_ALLOC allows HWPTs to be explicitly allocated by the user and
used by attach or replace. At this point it isn't very useful since the
HWPT is the same as the automatically managed HWPT from the IOAS. However
a following series will allow userspace to customize the created HWPT.
The implementation is complicated because we have to introduce some
per-iommu_group memory in iommufd and redo how we think about multi-device
groups to be more explicit. This solves all the locking problems in the
prior attempts.
This series is infrastructure work for the following series which:
- Add replace for attach
- Expose replace through VFIO APIs
- Implement driver parameters for HWPT creation (nesting)
Once review of this is complete I will keep it on a side branch and
accumulate the following series when they are ready so we can have a
stable base and make more incremental progress. When we have all the parts
together to get a full implementation it can go to Linus.
This is on github: https://github.com/jgunthorpe/linux/commits/iommufd_hwpt
v7:
- Rebase to v6.4-rc2, update to new signature of iommufd_get_ioas()
v6: https://lore.kernel.org/r/0-v6-fdb604df649a+369-iommufd_alloc_jgg@nvidia.com
- Go back to the v4 locking arragnment with now both the attach/detach
igroup->locks inside the functions, Kevin says he needs this for a
followup series. This still fixes the syzkaller bug
- Fix two more error unwind locking bugs where
iommufd_object_abort_and_destroy(hwpt) would deadlock or be mislocked.
Make sure fail_nth will catch these mistakes
- Add a patch allowing objects to have different abort than destroy
function, it allows hwpt abort to require the caller to continue
to hold the lock and enforces this with lockdep.
v5: https://lore.kernel.org/r/0-v5-6716da355392+c5-iommufd_alloc_jgg@nvidia.com
- Go back to the v3 version of the code, keep the comment changes from
v4. Syzkaller says the group lock change in v4 didn't work.
- Adjust the fail_nth test to cover the path syzkaller found. We need to
have an ioas with a mapped page installed to inject a failure during
domain attachment.
v4: https://lore.kernel.org/r/0-v4-9cd79ad52ee8+13f5-iommufd_alloc_jgg@nvidia.c…
- Refine comments and commit messages
- Move the group lock into iommufd_hw_pagetable_attach()
- Fix error unwind in iommufd_device_do_replace()
v3: https://lore.kernel.org/r/0-v3-61d41fd9e13e+1f5-iommufd_alloc_jgg@nvidia.com
- Refine comments and commit messages
- Adjust the flow in iommufd_device_auto_get_domain() so pt_id is only
set on success
- Reject replace on non-attached devices
- Add missing __reserved check for IOMMU_HWPT_ALLOC
v2: https://lore.kernel.org/r/0-v2-51b9896e7862+8a8c-iommufd_alloc_jgg@nvidia.c…
- Use WARN_ON for the igroup->group test and move that logic to a
function iommufd_group_try_get()
- Change igroup->devices to igroup->device list
Replace will need to iterate over all attached idevs
- Rename to iommufd_group_setup_msi()
- New patch to export iommu_get_resv_regions()
- New patch to use per-device reserved regions instead of per-group
regions
- Split out the reorganizing of iommufd_device_change_pt() from the
replace patch
- Replace uses the per-dev reserved regions
- Use stdev_id in a few more places in the selftest
- Fix error handling in IOMMU_HWPT_ALLOC
- Clarify comments
- Rebase on v6.3-rc1
v1: https://lore.kernel.org/all/0-v1-7612f88c19f5+2f21-iommufd_alloc_jgg@nvidia…
Jason Gunthorpe (17):
iommufd: Move isolated msi enforcement to iommufd_device_bind()
iommufd: Add iommufd_group
iommufd: Replace the hwpt->devices list with iommufd_group
iommu: Export iommu_get_resv_regions()
iommufd: Keep track of each device's reserved regions instead of
groups
iommufd: Use the iommufd_group to avoid duplicate MSI setup
iommufd: Make sw_msi_start a group global
iommufd: Move putting a hwpt to a helper function
iommufd: Add enforced_cache_coherency to iommufd_hw_pagetable_alloc()
iommufd: Allow a hwpt to be aborted after allocation
iommufd: Fix locking around hwpt allocation
iommufd: Reorganize iommufd_device_attach into
iommufd_device_change_pt
iommufd: Add iommufd_device_replace()
iommufd: Make destroy_rwsem use a lock class per object type
iommufd: Add IOMMU_HWPT_ALLOC
iommufd/selftest: Return the real idev id from selftest mock_domain
iommufd/selftest: Add a selftest for IOMMU_HWPT_ALLOC
Nicolin Chen (2):
iommu: Introduce a new iommu_group_replace_domain() API
iommufd/selftest: Test iommufd_device_replace()
drivers/iommu/iommu-priv.h | 10 +
drivers/iommu/iommu.c | 41 +-
drivers/iommu/iommufd/device.c | 553 +++++++++++++-----
drivers/iommu/iommufd/hw_pagetable.c | 112 +++-
drivers/iommu/iommufd/io_pagetable.c | 32 +-
drivers/iommu/iommufd/iommufd_private.h | 52 +-
drivers/iommu/iommufd/iommufd_test.h | 6 +
drivers/iommu/iommufd/main.c | 24 +-
drivers/iommu/iommufd/selftest.c | 40 ++
include/linux/iommufd.h | 1 +
include/uapi/linux/iommufd.h | 26 +
tools/testing/selftests/iommu/iommufd.c | 67 ++-
.../selftests/iommu/iommufd_fail_nth.c | 67 ++-
tools/testing/selftests/iommu/iommufd_utils.h | 63 +-
14 files changed, 868 insertions(+), 226 deletions(-)
create mode 100644 drivers/iommu/iommu-priv.h
base-commit: f1fcbaa18b28dec10281551dfe6ed3a3ed80e3d6
--
2.40.1
Hi Liam,
On Thu, May 18, 2023 at 9:37 PM Liam R. Howlett <Liam.Howlett(a)oracle.com> wrote:
> Now that the functions have changed the limits, update the testing of
> the maple tree to test these new settings.
>
> Signed-off-by: Liam R. Howlett <Liam.Howlett(a)oracle.com>
Thanks for your patch, which is now commit eb2e817f38cafbf7
("maple_tree: update testing code for mas_{next,prev,walk}") in
> --- a/lib/test_maple_tree.c
> +++ b/lib/test_maple_tree.c
> @@ -2011,7 +2011,7 @@ static noinline void __init next_prev_test(struct maple_tree *mt)
>
> val = mas_next(&mas, ULONG_MAX);
> MT_BUG_ON(mt, val != NULL);
> - MT_BUG_ON(mt, mas.index != ULONG_MAX);
> + MT_BUG_ON(mt, mas.index != 0x7d6);
On m68k (ARAnyM):
TEST STARTING
BUG at next_prev_test:2014 (1)
Pass: 3749128 Run:3749129
And after that it seems to hang[*].
After adding a debug print (thus shifting all line numbers by +1):
next_prev_test:mas.index = 0x138e
BUG at next_prev_test:2015 (1)
0x138e = 5006, while the expected value is 0x7d6 = 2006.
I guess converting this test to the KUnit framework would make it a
bit easier to investigate failures...
[*] Left the debug one running, and I got a few more:
BUG at check_empty_area_window:2656 (1)
Pass: 3754275 Run:3754277
BUG at check_empty_area_window:2657 (1)
Pass: 3754275 Run:3754278
BUG at check_empty_area_window:2658 (1)
Pass: 3754275 Run:3754279
BUG at check_empty_area_window:2662 (1)
Pass: 3754275 Run:3754280
BUG at check_empty_area_window:2663 (1)
Pass: 3754275 Run:3754281
maple_tree: 3804518 of 3804524 tests passed
So the full test took more than 20 minutes...
> MT_BUG_ON(mt, mas.last != ULONG_MAX);
>
> val = mas_prev(&mas, 0);
Gr{oetje,eeting}s,
Geert
--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert(a)linux-m68k.org
In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds
Hi, Willy
This v4 mainly uses the argv0 suggested by you, at the same time, a new
run-libc-test target is added for glibc and musl, and the RB_ flags are
added for nolibc to allow compile nolibc-test.c without <linux/reboot.h>
for glibc, musl and nolibc (mainly for musl-gcc, without -I
/path/to/sysroot).
This patchset is based on the 20230705-nolibc-series2 branch of nolibc
repo [2], it must be applied after our v6 __sysret series [3] (argv0
exported there) and Thomas' chmod_net removal patchset [4] (the new
chmod_argv0 is added at the same line of chmod_net, will conflict).
This patchset assumes the chmod_net removal patchset will be applied at
first, if not, the chmod_argv0 added alphabetically will not be applied.
Since our new chmod_argv0 is exactly added to replace chmod_net, so,
Willy, is it ok for you to at least apply the chmod_net removal patch
[5] before this patchset?
selftests/nolibc: drop test chmod_net
This patchset is tested together with the v6 __sysret series [3]:
arch/board | result
------------|------------
arm/vexpress-a9 | 142 test(s) passed, 1 skipped, 0 failed.
arm/virt | 142 test(s) passed, 1 skipped, 0 failed.
aarch64/virt | 142 test(s) passed, 1 skipped, 0 failed.
ppc/g3beige | not supported
ppc/ppce500 | not supported
i386/pc | 142 test(s) passed, 1 skipped, 0 failed.
x86_64/pc | 142 test(s) passed, 1 skipped, 0 failed.
mipsel/malta | 142 test(s) passed, 1 skipped, 0 failed.
loongarch64/virt | 142 test(s) passed, 1 skipped, 0 failed.
riscv64/virt | 142 test(s) passed, 1 skipped, 0 failed.
riscv32/virt | 0 test(s) passed, 0 skipped, 0 failed.
s390x/s390-ccw-virtio | 142 test(s) passed, 1 skipped, 0 failed.
If use tinyconfig + basic console options (means disable all of the
other options, include procfs, shmem, tmpfs, net and memfd_create, to
save test time, only randomly choose 4 archs):
...
LOG: testing report for loongarch64/virt:
15 chmod_self [SKIPPED]
16 chown_self [SKIPPED]
40 link_cross [SKIPPED]
0 -fstackprotector not supported [SKIPPED]
139 test(s) passed, 4 skipped, 0 failed.
See all results in /labs/linux-lab/logging/nolibc/loongarch64-virt-nolibc-test.log
LOG: testing summary:
arch/board | result
------------|------------
arm/vexpress-a9 | 139 test(s) passed, 4 skipped, 0 failed.
x86_64/pc | 139 test(s) passed, 4 skipped, 0 failed.
mipsel/malta | 139 test(s) passed, 4 skipped, 0 failed.
loongarch64/virt | 139 test(s) passed, 4 skipped, 0 failed.
Changes from v3 --> v4:
* selftests/nolibc: stat_fault: silence NULL argument warning with glibc
selftests/nolibc: gettid: restore for glibc and musl
selftests/nolibc: add _LARGEFILE64_SOURCE for musl
selftests/nolibc: fix up int_fast16/32_t test cases for musl
selftests/nolibc: fix up kernel parameters support
selftests/nolibc: link_cross: use /proc/self/cmdline
tools/nolibc: add rmdir() support
selftests/nolibc: add a new rmdir() test case
selftests/nolibc: fix up failures when CONFIG_PROC_FS=n
selftests/nolibc: prepare /tmp for tmpfs or ramfs
selftests/nolibc: vfprintf: remove MEMFD_CREATE dependency
No change.
* selftests/nolibc: add run-libc-test target
New run and report for glibc or musl. for musl, we can simply issue:
$ make run-libc-test CC=/path/to/musl-install/bin/musl-gcc
* tools/nolibc: types.h: add RB_ flags for reboot()
selftests/nolibc: prefer <sys/reboot.h> to <linux/reboot.h>
Required by musl to compile nolibc-test.c without -I/path/to/sysroot
* selftests/nolibc: chdir_root: restore current path after test
restore current path to prevent breakage of using relative path
* selftests/nolibc: stat_timestamps: remove procfs dependency
selftests/nolibc: chroot_exe: remove procfs dependency
selftests/nolibc: add chmod_argv0 test
use argv0 instead of '/init' as before.
Best regards,
Zhangjin
---
[1]: https://lore.kernel.org/lkml/cover.1688134399.git.falcon@tinylab.org/
[2]: https://git.kernel.org/pub/scm/linux/kernel/git/wtarreau/nolibc.git
[3]: https://lore.kernel.org/lkml/cover.1688739492.git.falcon@tinylab.org/
[4]: https://lore.kernel.org/lkml/20230624-proc-net-setattr-v1-0-73176812adee@we…
[5]: https://lore.kernel.org/lkml/20230624-proc-net-setattr-v1-1-73176812adee@we…
Zhangjin Wu (18):
selftests/nolibc: add run-libc-test target
selftests/nolibc: stat_fault: silence NULL argument warning with glibc
selftests/nolibc: gettid: restore for glibc and musl
selftests/nolibc: add _LARGEFILE64_SOURCE for musl
selftests/nolibc: fix up int_fast16/32_t test cases for musl
tools/nolibc: types.h: add RB_ flags for reboot()
selftests/nolibc: prefer <sys/reboot.h> to <linux/reboot.h>
selftests/nolibc: fix up kernel parameters support
selftests/nolibc: link_cross: use /proc/self/cmdline
tools/nolibc: add rmdir() support
selftests/nolibc: add a new rmdir() test case
selftests/nolibc: fix up failures when CONFIG_PROC_FS=n
selftests/nolibc: prepare /tmp for tmpfs or ramfs
selftests/nolibc: vfprintf: remove MEMFD_CREATE dependency
selftests/nolibc: chdir_root: restore current path after test
selftests/nolibc: stat_timestamps: remove procfs dependency
selftests/nolibc: chroot_exe: remove procfs dependency
selftests/nolibc: add chmod_argv0 test
tools/include/nolibc/sys.h | 23 ++++-
tools/include/nolibc/types.h | 12 ++-
tools/testing/selftests/nolibc/Makefile | 4 +
tools/testing/selftests/nolibc/nolibc-test.c | 88 +++++++++++++++-----
4 files changed, 104 insertions(+), 23 deletions(-)
--
2.25.1
According to commit 01d6c48a828b ("Documentation: kselftest:
"make headers" is a prerequisite"), running the kselftests requires
to run "make headers" first.
Do that in "vmtest.sh" as well to fix the HID CI.
Signed-off-by: Benjamin Tissoires <bentiss(a)kernel.org>
---
Looks like the new master branch (v6.5-rc1) broke my CI.
And given that `make headers` is now a requisite to run the kselftests,
also include that command in vmtests.sh.
Broken CI job: https://gitlab.freedesktop.org/bentiss/hid/-/jobs/44704436
Fixed CI job: https://gitlab.freedesktop.org/bentiss/hid/-/jobs/45151040
---
tools/testing/selftests/hid/vmtest.sh | 1 +
1 file changed, 1 insertion(+)
diff --git a/tools/testing/selftests/hid/vmtest.sh b/tools/testing/selftests/hid/vmtest.sh
index 681b906b4853..4da48bf6b328 100755
--- a/tools/testing/selftests/hid/vmtest.sh
+++ b/tools/testing/selftests/hid/vmtest.sh
@@ -79,6 +79,7 @@ recompile_kernel()
cd "${kernel_checkout}"
${make_command} olddefconfig
+ ${make_command} headers
${make_command}
}
---
base-commit: 0e382fa72bbf0610be40af9af9b03b0cd149df82
change-id: 20230709-fix-selftests-c8b0bdff1d20
Best regards,
--
Benjamin Tissoires <bentiss(a)kernel.org>
Hi, Willy
As you suggested, the 'status: [success|warning|failure]' info is added
to the summary line, with additional newlines around this line to
extrude the status info. at the same time, the total tests is printed,
the passed, skipped and failed values are aligned with '%03d'.
This patchset is based on 20230705-nolibc-series2 of nolibc repo[1].
The test result looks like:
...
138 test(s): 135 passed, 002 skipped, 001 failed => status: failure
See all results in /labs/linux-lab/src/linux-stable/tools/testing/selftests/nolibc/run.out
Or:
...
137 test(s): 134 passed, 003 skipped, 000 failed => status: warning
See all results in /labs/linux-lab/src/linux-stable/tools/testing/selftests/nolibc/run.out
Best regards,
Zhangjin
---
[1]: https://git.kernel.org/pub/scm/linux/kernel/git/wtarreau/nolibc.git
Zhangjin Wu (5):
selftests/nolibc: report: print a summarized test status
selftests/nolibc: report: print total tests
selftests/nolibc: report: align passed, skipped and failed
selftests/nolibc: report: extrude the test status line
selftests/nolibc: report: add newline before test failures
tools/testing/selftests/nolibc/Makefile | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)
--
2.25.1
On Mon, 10 Jul 2023 02:17:01 +0000
Nadav Amit <namit(a)vmware.com> wrote:
> > On Jul 9, 2023, at 6:54 PM, Steven Rostedt <rostedt(a)goodmis.org> wrote:
> >
> > + union {
> > + struct rcu_head rcu;
> > + struct llist_node llist; /* For freeing after RCU */
> > + };
>
> The memory savings from using a union might not be worth the potential impact
> of type confusion and bugs.
It's also documentation. The two are related, as one is the hand off to
the other. It's not a random union, and I'd like to leave it that way.
-- Steve
Since commit 53fcfafa8c5c ("tools/nolibc/unistd: add syscall()") nolibc
has support for syscall(2).
Use it to get rid of some ifdef-ery.
Signed-off-by: Thomas Weißschuh <linux(a)weissschuh.net>
---
tools/testing/selftests/nolibc/nolibc-test.c | 6 +-----
1 file changed, 1 insertion(+), 5 deletions(-)
diff --git a/tools/testing/selftests/nolibc/nolibc-test.c b/tools/testing/selftests/nolibc/nolibc-test.c
index 486334981e60..c02d89953679 100644
--- a/tools/testing/selftests/nolibc/nolibc-test.c
+++ b/tools/testing/selftests/nolibc/nolibc-test.c
@@ -1051,11 +1051,7 @@ int main(int argc, char **argv, char **envp)
* exit with status code 2N+1 when N is written to 0x501. We
* hard-code the syscall here as it's arch-dependent.
*/
-#if defined(_NOLIBC_SYS_H)
- else if (my_syscall3(__NR_ioperm, 0x501, 1, 1) == 0)
-#else
- else if (ioperm(0x501, 1, 1) == 0)
-#endif
+ else if (syscall(__NR_ioperm, 0x501, 1, 1) == 0)
__asm__ volatile ("outb %%al, %%dx" :: "d"(0x501), "a"(0));
/* if it does nothing, fall back to the regular panic */
#endif
---
base-commit: a901a3568fd26ca9c4a82d8bc5ed5b3ed844d451
change-id: 20230703-nolibc-ioperm-88d87ae6d5e9
Best regards,
--
Thomas Weißschuh <linux(a)weissschuh.net>
Make sv48 the default address space for mmap as some applications
currently depend on this assumption. Also enable users to select
desired address space using a non-zero hint address to mmap. Previous
kernel changes caused Java and other applications to be broken on sv57
which this patch fixes.
Documentation is also added to the RISC-V virtual memory section to explain
these changes.
-Charlie
---
v4:
- Split testcases/document patch into test cases, in-code documentation, and
formal documentation patches
- Modified the mmap_base macro to be more legible and better represent memory
layout
- Fixed documentation to better reflect the implmentation
- Renamed DEFAULT_VA_BITS to MMAP_VA_BITS
- Added additional test case for rlimit changes
---
Charlie Jenkins (4):
RISC-V: mm: Restrict address space for sv39,sv48,sv57
RISC-V: mm: Add tests for RISC-V mm
RISC-V: mm: Update pgtable comment documentation
RISC-V: mm: Document mmap changes
Documentation/riscv/vm-layout.rst | 22 +++
arch/riscv/include/asm/elf.h | 2 +-
arch/riscv/include/asm/pgtable.h | 21 ++-
arch/riscv/include/asm/processor.h | 43 +++++-
tools/testing/selftests/riscv/Makefile | 2 +-
tools/testing/selftests/riscv/mm/.gitignore | 1 +
tools/testing/selftests/riscv/mm/Makefile | 21 +++
.../selftests/riscv/mm/testcases/mmap.c | 133 ++++++++++++++++++
8 files changed, 232 insertions(+), 13 deletions(-)
create mode 100644 tools/testing/selftests/riscv/mm/.gitignore
create mode 100644 tools/testing/selftests/riscv/mm/Makefile
create mode 100644 tools/testing/selftests/riscv/mm/testcases/mmap.c
--
2.41.0
This series adds a new userfaultfd feature, UFFDIO_POISON. See commit 4
for a detailed description of the feature.
The series is based on Linus master (partial 6.5 merge window), and
structured like this:
- Patches 1-3 are preparation / refactoring
- Patches 4-6 implement and advertise the new feature
- Patches 7-8 implement a unit test for the new feature
Changelog:
v2 -> v3:
- Rebase onto current Linus master.
- Don't overwrite existing PTE markers for non-hugetlb UFFDIO_POISON.
Before, non-hugetlb would override them, but hugetlb would not. I don't
think there's a use case where we *want* to override a UFFD_WP marker
for example, so take the more conservative behavior for all kinds of
memory.
- [Peter] Drop hugetlb mfill atomic refactoring, since it isn't needed
for this series (we don't touch that code directly anyway).
- [Peter] Switch to re-using PTE_MARKER_SWAPIN_ERROR instead of defining
new PTE_MARKER_UFFD_POISON.
- [Peter] Extract start / len range overflow check into existing
validate_range helper; this fixes the style issue of unnecessary braces
in the UFFDIO_POISON implementation, because this code is just deleted.
- [Peter] Extract file size check out into a new helper.
- [Peter] Defer actually "enabling" the new feature until the last commit
in the series; combine this with adding the documentation. As a
consequence, move the selftest commits after this one.
- [Randy] Fix typo in documentation.
v1 -> v2:
- [Peter] Return VM_FAULT_HWPOISON not VM_FAULT_SIGBUS, to yield the
correct behavior for KVM (guest MCE).
- [Peter] Rename UFFDIO_SIGBUS to UFFDIO_POISON.
- [Peter] Implement hugetlbfs support for UFFDIO_POISON.
Axel Rasmussen (8):
mm: make PTE_MARKER_SWAPIN_ERROR more general
mm: userfaultfd: check for start + len overflow in validate_range
mm: userfaultfd: extract file size check out into a helper
mm: userfaultfd: add new UFFDIO_POISON ioctl
mm: userfaultfd: support UFFDIO_POISON for hugetlbfs
mm: userfaultfd: document and enable new UFFDIO_POISON feature
selftests/mm: refactor uffd_poll_thread to allow custom fault handlers
selftests/mm: add uffd unit test for UFFDIO_POISON
Documentation/admin-guide/mm/userfaultfd.rst | 15 +++
fs/userfaultfd.c | 73 ++++++++++--
include/linux/mm_inline.h | 19 +++
include/linux/swapops.h | 10 +-
include/linux/userfaultfd_k.h | 4 +
include/uapi/linux/userfaultfd.h | 25 +++-
mm/hugetlb.c | 51 ++++++--
mm/madvise.c | 2 +-
mm/memory.c | 15 ++-
mm/mprotect.c | 4 +-
mm/shmem.c | 4 +-
mm/swapfile.c | 2 +-
mm/userfaultfd.c | 83 ++++++++++---
tools/testing/selftests/mm/uffd-common.c | 5 +-
tools/testing/selftests/mm/uffd-common.h | 3 +
tools/testing/selftests/mm/uffd-stress.c | 12 +-
tools/testing/selftests/mm/uffd-unit-tests.c | 117 +++++++++++++++++++
17 files changed, 377 insertions(+), 67 deletions(-)
--
2.41.0.255.g8b1d071c50-goog