While it probably doesn't make a huge difference given the current KUnit
coverage we will get the best coverage of arm64 architecture features if
we specify -cpu=max rather than picking a specific CPU, this will include
all architecture features that qemu supports including many which have not
yet made it into physical implementations.
Due to performance issues emulating the architected pointer authentication
algorithm it is recommended to use the implementation defined algorithm
that qemu has instead, this should make no meaningful difference to the
coverage and will run the tests faster.
Signed-off-by: Mark Brown <broonie(a)kernel.org>
---
tools/testing/kunit/qemu_configs/arm64.py | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/kunit/qemu_configs/arm64.py b/tools/testing/kunit/qemu_configs/arm64.py
index 67d04064f785..d3ff27024755 100644
--- a/tools/testing/kunit/qemu_configs/arm64.py
+++ b/tools/testing/kunit/qemu_configs/arm64.py
@@ -9,4 +9,4 @@ CONFIG_SERIAL_AMBA_PL011_CONSOLE=y''',
qemu_arch='aarch64',
kernel_path='arch/arm64/boot/Image.gz',
kernel_command_line='console=ttyAMA0',
- extra_qemu_params=['-machine', 'virt', '-cpu', 'cortex-a57'])
+ extra_qemu_params=['-machine', 'virt', '-cpu', 'max,pauth-impdef=on'])
---
base-commit: 06c2afb862f9da8dc5efa4b6076a0e48c3fbaaa5
change-id: 20230702-kunit-arm64-cpu-max-7e3aa5f02fb2
Best regards,
--
Mark Brown <broonie(a)kernel.org>
=== Context ===
In the context of a middlebox, fragmented packets are tricky to handle.
The full 5-tuple of a packet is often only available in the first
fragment which makes enforcing consistent policy difficult. There are
really only two stateless options, neither of which are very nice:
1. Enforce policy on first fragment and accept all subsequent fragments.
This works but may let in certain attacks or allow data exfiltration.
2. Enforce policy on first fragment and drop all subsequent fragments.
This does not really work b/c some protocols may rely on
fragmentation. For example, DNS may rely on oversized UDP packets for
large responses.
So stateful tracking is the only sane option. RFC 8900 [0] calls this
out as well in section 6.3:
Middleboxes [...] should process IP fragments in a manner that is
consistent with [RFC0791] and [RFC8200]. In many cases, middleboxes
must maintain state in order to achieve this goal.
=== BPF related bits ===
Policy has traditionally been enforced from XDP/TC hooks. Both hooks
run before kernel reassembly facilities. However, with the new
BPF_PROG_TYPE_NETFILTER, we can rather easily hook into existing
netfilter reassembly infra.
The basic idea is we bump a refcnt on the netfilter defrag module and
then run the bpf prog after the defrag module runs. This allows bpf
progs to transparently see full, reassembled packets. The nice thing
about this is that progs don't have to carry around logic to detect
fragments.
=== Changelog ===
Changes from v3:
* Correctly initialize `addrlen` stack var for recvmsg()
Changes from v2:
* module_put() if ->enable() fails
* Fix CI build errors
Changes from v1:
* Drop bpf_program__attach_netfilter() patches
* static -> static const where appropriate
* Fix callback assignment order during registration
* Only request_module() if callbacks are missing
* Fix retval when modprobe fails in userspace
* Fix v6 defrag module name (nf_defrag_ipv6_hooks -> nf_defrag_ipv6)
* Simplify priority checking code
* Add warning if module doesn't assign callbacks in the future
* Take refcnt on module while defrag link is active
[0]: https://datatracker.ietf.org/doc/html/rfc8900
Daniel Xu (6):
netfilter: defrag: Add glue hooks for enabling/disabling defrag
netfilter: bpf: Support BPF_F_NETFILTER_IP_DEFRAG in netfilter link
netfilter: bpf: Prevent defrag module unload while link active
bpf: selftests: Support not connecting client socket
bpf: selftests: Support custom type and proto for client sockets
bpf: selftests: Add defrag selftests
include/linux/netfilter.h | 15 +
include/uapi/linux/bpf.h | 5 +
net/ipv4/netfilter/nf_defrag_ipv4.c | 17 +-
net/ipv6/netfilter/nf_defrag_ipv6_hooks.c | 11 +
net/netfilter/core.c | 6 +
net/netfilter/nf_bpf_link.c | 150 +++++++++-
tools/include/uapi/linux/bpf.h | 5 +
tools/testing/selftests/bpf/Makefile | 4 +-
.../selftests/bpf/generate_udp_fragments.py | 90 ++++++
.../selftests/bpf/ip_check_defrag_frags.h | 57 ++++
tools/testing/selftests/bpf/network_helpers.c | 26 +-
tools/testing/selftests/bpf/network_helpers.h | 3 +
.../bpf/prog_tests/ip_check_defrag.c | 283 ++++++++++++++++++
.../selftests/bpf/progs/ip_check_defrag.c | 104 +++++++
14 files changed, 754 insertions(+), 22 deletions(-)
create mode 100755 tools/testing/selftests/bpf/generate_udp_fragments.py
create mode 100644 tools/testing/selftests/bpf/ip_check_defrag_frags.h
create mode 100644 tools/testing/selftests/bpf/prog_tests/ip_check_defrag.c
create mode 100644 tools/testing/selftests/bpf/progs/ip_check_defrag.c
--
2.41.0
The #endif is the wrong side of a } causing a build failure when
__NR_userfaultfd is not defined. Fix this by moving the #end to
enclose the }
Fixes: 9eac40fc0cc7 ("selftests/mm: mkdirty: test behavior of (pte|pmd)_mkdirty on VMAs without write permissions")
Signed-off-by: Colin Ian King <colin.i.king(a)gmail.com>
---
tools/testing/selftests/mm/mkdirty.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/mm/mkdirty.c b/tools/testing/selftests/mm/mkdirty.c
index 6d71d972997b..301abb99e027 100644
--- a/tools/testing/selftests/mm/mkdirty.c
+++ b/tools/testing/selftests/mm/mkdirty.c
@@ -321,8 +321,8 @@ static void test_uffdio_copy(void)
munmap:
munmap(dst, pagesize);
free(src);
-#endif /* __NR_userfaultfd */
}
+#endif /* __NR_userfaultfd */
int main(void)
{
--
2.39.2
Awk is already called for /sys/block/zram#/mm_stat parsing, so use it
to also perform the floating point capacity vs consumption ratio
calculations. The test output is unchanged.
This allows bc to be dropped as a dependency for the zram selftests.
The documented free dependency can also be removed following
d18da7ec37195 ("selftests/zram01.sh: Fix compression ratio calculation")
Signed-off-by: David Disseldorp <ddiss(a)suse.de>
---
tools/testing/selftests/zram/README | 2 --
tools/testing/selftests/zram/zram01.sh | 18 ++++++++----------
2 files changed, 8 insertions(+), 12 deletions(-)
v2: drop unused dependencies from selftests/zram/README
diff --git a/tools/testing/selftests/zram/README b/tools/testing/selftests/zram/README
index 110b34834a6fa..510ca5a1087f5 100644
--- a/tools/testing/selftests/zram/README
+++ b/tools/testing/selftests/zram/README
@@ -27,9 +27,7 @@ zram01.sh: creates general purpose ram disks with ext4 filesystems
zram02.sh: creates block device for swap
Commands required for testing:
- - bc
- dd
- - free
- awk
- mkswap
- swapon
diff --git a/tools/testing/selftests/zram/zram01.sh b/tools/testing/selftests/zram/zram01.sh
index 8f4affe34f3e4..df1b1d4158989 100755
--- a/tools/testing/selftests/zram/zram01.sh
+++ b/tools/testing/selftests/zram/zram01.sh
@@ -33,7 +33,7 @@ zram_algs="lzo"
zram_fill_fs()
{
- for i in $(seq $dev_start $dev_end); do
+ for ((i = $dev_start; i <= $dev_end && !ERR_CODE; i++)); do
echo "fill zram$i..."
local b=0
while [ true ]; do
@@ -44,15 +44,13 @@ zram_fill_fs()
done
echo "zram$i can be filled with '$b' KB"
- local mem_used_total=`awk '{print $3}' "/sys/block/zram$i/mm_stat"`
- local v=$((100 * 1024 * $b / $mem_used_total))
- if [ "$v" -lt 100 ]; then
- echo "FAIL compression ratio: 0.$v:1"
- ERR_CODE=-1
- return
- fi
-
- echo "zram compression ratio: $(echo "scale=2; $v / 100 " | bc):1: OK"
+ awk -v b="$b" '{ v = (100 * 1024 * b / $3) } END {
+ if (v < 100) {
+ printf "FAIL compression ratio: 0.%u:1\n", v
+ exit 1
+ }
+ printf "zram compression ratio: %.2f:1: OK\n", v / 100
+ }' "/sys/block/zram$i/mm_stat" || ERR_CODE=-1
done
}
--
2.35.3
Dzień dobry,
zapoznałem się z Państwa ofertą i z przyjemnością przyznaję, że przyciąga uwagę i zachęca do dalszych rozmów.
Pomyślałem, że może mógłbym mieć swój wkład w Państwa rozwój i pomóc dotrzeć z tą ofertą do większego grona odbiorców. Pozycjonuję strony www, dzięki czemu generują świetny ruch w sieci.
Możemy porozmawiać w najbliższym czasie?
Pozdrawiam
Adam Charachuta
*Changes in v24*:
- Rebase on top of next-20230710
- Place WP markers in case of hole as well
*Changes in v23*:
- Set vec_buf_index in loop only when vec_buf_index is set
- Return -EFAULT instead of -EINVAL if vec is NULL
- Correctly return the walk ending address to the page granularity
*Changes in v22*:
- Interface change:
- Replace [start start + len) with [start, end)
- Return the ending address of the address walk in start
*Changes in v21*:
- Abort walk instead of returning error if WP is to be performed on
partial hugetlb
*Changes in v20*
- Correct PAGE_IS_FILE and add PAGE_IS_PFNZERO
*Changes in v19*
- Minor changes and interface updates
*Changes in v18*
- Rebase on top of next-20230613
- Minor updates
*Changes in v17*
- Rebase on top of next-20230606
- Minor improvements in PAGEMAP_SCAN IOCTL patch
*Changes in v16*
- Fix a corner case
- Add exclusive PM_SCAN_OP_WP back
*Changes in v15*
- Build fix (Add missed build fix in RESEND)
*Changes in v14*
- Fix build error caused by #ifdef added at last minute in some configs
*Changes in v13*
- Rebase on top of next-20230414
- Give-up on using uffd_wp_range() and write new helpers, flush tlb only
once
*Changes in v12*
- Update and other memory types to UFFD_FEATURE_WP_ASYNC
- Rebaase on top of next-20230406
- Review updates
*Changes in v11*
- Rebase on top of next-20230307
- Base patches on UFFD_FEATURE_WP_UNPOPULATED
- Do a lot of cosmetic changes and review updates
- Remove ENGAGE_WP + !GET operation as it can be performed with
UFFDIO_WRITEPROTECT
*Changes in v10*
- Add specific condition to return error if hugetlb is used with wp
async
- Move changes in tools/include/uapi/linux/fs.h to separate patch
- Add documentation
*Changes in v9:*
- Correct fault resolution for userfaultfd wp async
- Fix build warnings and errors which were happening on some configs
- Simplify pagemap ioctl's code
*Changes in v8:*
- Update uffd async wp implementation
- Improve PAGEMAP_IOCTL implementation
*Changes in v7:*
- Add uffd wp async
- Update the IOCTL to use uffd under the hood instead of soft-dirty
flags
*Motivation*
The real motivation for adding PAGEMAP_SCAN IOCTL is to emulate Windows
GetWriteWatch() syscall [1]. The GetWriteWatch{} retrieves the addresses of
the pages that are written to in a region of virtual memory.
This syscall is used in Windows applications and games etc. This syscall is
being emulated in pretty slow manner in userspace. Our purpose is to
enhance the kernel such that we translate it efficiently in a better way.
Currently some out of tree hack patches are being used to efficiently
emulate it in some kernels. We intend to replace those with these patches.
So the whole gaming on Linux can effectively get benefit from this. It
means there would be tons of users of this code.
CRIU use case [2] was mentioned by Andrei and Danylo:
> Use cases for migrating sparse VMAs are binaries sanitized with ASAN,
> MSAN or TSAN [3]. All of these sanitizers produce sparse mappings of
> shadow memory [4]. Being able to migrate such binaries allows to highly
> reduce the amount of work needed to identify and fix post-migration
> crashes, which happen constantly.
Andrei's defines the following uses of this code:
* it is more granular and allows us to track changed pages more
effectively. The current interface can clear dirty bits for the entire
process only. In addition, reading info about pages is a separate
operation. It means we must freeze the process to read information
about all its pages, reset dirty bits, only then we can start dumping
pages. The information about pages becomes more and more outdated,
while we are processing pages. The new interface solves both these
downsides. First, it allows us to read pte bits and clear the
soft-dirty bit atomically. It means that CRIU will not need to freeze
processes to pre-dump their memory. Second, it clears soft-dirty bits
for a specified region of memory. It means CRIU will have actual info
about pages to the moment of dumping them.
* The new interface has to be much faster because basic page filtering
is happening in the kernel. With the old interface, we have to read
pagemap for each page.
*Implementation Evolution (Short Summary)*
From the definition of GetWriteWatch(), we feel like kernel's soft-dirty
feature can be used under the hood with some additions like:
* reset soft-dirty flag for only a specific region of memory instead of
clearing the flag for the entire process
* get and clear soft-dirty flag for a specific region atomically
So we decided to use ioctl on pagemap file to read or/and reset soft-dirty
flag. But using soft-dirty flag, sometimes we get extra pages which weren't
even written. They had become soft-dirty because of VMA merging and
VM_SOFTDIRTY flag. This breaks the definition of GetWriteWatch(). We were
able to by-pass this short coming by ignoring VM_SOFTDIRTY until David
reported that mprotect etc messes up the soft-dirty flag while ignoring
VM_SOFTDIRTY [5]. This wasn't happening until [6] got introduced. We
discussed if we can revert these patches. But we could not reach to any
conclusion. So at this point, I made couple of tries to solve this whole
VM_SOFTDIRTY issue by correcting the soft-dirty implementation:
* [7] Correct the bug fixed wrongly back in 2014. It had potential to cause
regression. We left it behind.
* [8] Keep a list of soft-dirty part of a VMA across splits and merges. I
got the reply don't increase the size of the VMA by 8 bytes.
At this point, we left soft-dirty considering it is too much delicate and
userfaultfd [9] seemed like the only way forward. From there onward, we
have been basing soft-dirty emulation on userfaultfd wp feature where
kernel resolves the faults itself when WP_ASYNC feature is used. It was
straight forward to add WP_ASYNC feature in userfautlfd. Now we get only
those pages dirty or written-to which are really written in reality. (PS
There is another WP_UNPOPULATED userfautfd feature is required which is
needed to avoid pre-faulting memory before write-protecting [9].)
All the different masks were added on the request of CRIU devs to create
interface more generic and better.
[1] https://learn.microsoft.com/en-us/windows/win32/api/memoryapi/nf-memoryapi-…
[2] https://lore.kernel.org/all/20221014134802.1361436-1-mdanylo@google.com
[3] https://github.com/google/sanitizers
[4] https://github.com/google/sanitizers/wiki/AddressSanitizerAlgorithm#64-bit
[5] https://lore.kernel.org/all/bfcae708-db21-04b4-0bbe-712badd03071@redhat.com
[6] https://lore.kernel.org/all/20220725142048.30450-1-peterx@redhat.com/
[7] https://lore.kernel.org/all/20221122115007.2787017-1-usama.anjum@collabora.…
[8] https://lore.kernel.org/all/20221220162606.1595355-1-usama.anjum@collabora.…
[9] https://lore.kernel.org/all/20230306213925.617814-1-peterx@redhat.com
[10] https://lore.kernel.org/all/20230125144529.1630917-1-mdanylo@google.com
* Original Cover letter from v8*
Hello,
Note:
Soft-dirty pages and pages which have been written-to are synonyms. As
kernel already has soft-dirty feature inside which we have given up to
use, we are using written-to terminology while using UFFD async WP under
the hood.
This IOCTL, PAGEMAP_SCAN on pagemap file can be used to get and/or clear
the info about page table entries. The following operations are
supported in this ioctl:
- Get the information if the pages have been written-to (PAGE_IS_WRITTEN),
file mapped (PAGE_IS_FILE), present (PAGE_IS_PRESENT) or swapped
(PAGE_IS_SWAPPED).
- Write-protect the pages (PAGEMAP_WP_ENGAGE) to start finding which
pages have been written-to.
- Find pages which have been written-to and write protect the pages
(atomic PAGE_IS_WRITTEN + PAGEMAP_WP_ENGAGE)
It is possible to find and clear soft-dirty pages entirely in userspace.
But it isn't efficient:
- The mprotect and SIGSEGV handler for bookkeeping
- The userfaultfd wp (synchronous) with the handler for bookkeeping
Some benchmarks can be seen here[1]. This series adds features that weren't
present earlier:
- There is no atomic get soft-dirty/Written-to status and clear present in
the kernel.
- The pages which have been written-to can not be found in accurate way.
(Kernel's soft-dirty PTE bit + sof_dirty VMA bit shows more soft-dirty
pages than there actually are.)
Historically, soft-dirty PTE bit tracking has been used in the CRIU
project. The procfs interface is enough for finding the soft-dirty bit
status and clearing the soft-dirty bit of all the pages of a process.
We have the use case where we need to track the soft-dirty PTE bit for
only specific pages on-demand. We need this tracking and clear mechanism
of a region of memory while the process is running to emulate the
getWriteWatch() syscall of Windows.
*(Moved to using UFFD instead of soft-dirtyi feature to find pages which
have been written-to from v7 patch series)*:
Stop using the soft-dirty flags for finding which pages have been
written to. It is too delicate and wrong as it shows more soft-dirty
pages than the actual soft-dirty pages. There is no interest in
correcting it [2][3] as this is how the feature was written years ago.
It shouldn't be updated to changed behaviour. Peter Xu has suggested
using the async version of the UFFD WP [4] as it is based inherently
on the PTEs.
So in this patch series, I've added a new mode to the UFFD which is
asynchronous version of the write protect. When this variant of the
UFFD WP is used, the page faults are resolved automatically by the
kernel. The pages which have been written-to can be found by reading
pagemap file (!PM_UFFD_WP). This feature can be used successfully to
find which pages have been written to from the time the pages were
write protected. This works just like the soft-dirty flag without
showing any extra pages which aren't soft-dirty in reality.
The information related to pages if the page is file mapped, present and
swapped is required for the CRIU project [5][6]. The addition of the
required mask, any mask, excluded mask and return masks are also required
for the CRIU project [5].
The IOCTL returns the addresses of the pages which match the specific
masks. The page addresses are returned in struct page_region in a compact
form. The max_pages is needed to support a use case where user only wants
to get a specific number of pages. So there is no need to find all the
pages of interest in the range when max_pages is specified. The IOCTL
returns when the maximum number of the pages are found. The max_pages is
optional. If max_pages is specified, it must be equal or greater than the
vec_size. This restriction is needed to handle worse case when one
page_region only contains info of one page and it cannot be compacted.
This is needed to emulate the Windows getWriteWatch() syscall.
The patch series include the detailed selftest which can be used as an
example for the uffd async wp test and PAGEMAP_IOCTL. It shows the
interface usages as well.
[1] https://lore.kernel.org/lkml/54d4c322-cd6e-eefd-b161-2af2b56aae24@collabora…
[2] https://lore.kernel.org/all/20221220162606.1595355-1-usama.anjum@collabora.…
[3] https://lore.kernel.org/all/20221122115007.2787017-1-usama.anjum@collabora.…
[4] https://lore.kernel.org/all/Y6Hc2d+7eTKs7AiH@x1n
[5] https://lore.kernel.org/all/YyiDg79flhWoMDZB@gmail.com/
[6] https://lore.kernel.org/all/20221014134802.1361436-1-mdanylo@google.com/
Regards,
Muhammad Usama Anjum
Muhammad Usama Anjum (4):
fs/proc/task_mmu: Implement IOCTL to get and optionally clear info
about PTEs
tools headers UAPI: Update linux/fs.h with the kernel sources
mm/pagemap: add documentation of PAGEMAP_SCAN IOCTL
selftests: mm: add pagemap ioctl tests
Peter Xu (1):
userfaultfd: UFFD_FEATURE_WP_ASYNC
Documentation/admin-guide/mm/pagemap.rst | 58 +
Documentation/admin-guide/mm/userfaultfd.rst | 35 +
fs/proc/task_mmu.c | 583 +++++++
fs/userfaultfd.c | 26 +-
include/linux/hugetlb.h | 1 +
include/linux/userfaultfd_k.h | 21 +-
include/uapi/linux/fs.h | 55 +
include/uapi/linux/userfaultfd.h | 9 +-
mm/hugetlb.c | 34 +-
mm/memory.c | 27 +-
tools/include/uapi/linux/fs.h | 55 +
tools/testing/selftests/mm/.gitignore | 2 +
tools/testing/selftests/mm/Makefile | 3 +-
tools/testing/selftests/mm/config | 1 +
tools/testing/selftests/mm/pagemap_ioctl.c | 1464 ++++++++++++++++++
tools/testing/selftests/mm/run_vmtests.sh | 4 +
16 files changed, 2354 insertions(+), 24 deletions(-)
create mode 100644 tools/testing/selftests/mm/pagemap_ioctl.c
mode change 100644 => 100755 tools/testing/selftests/mm/run_vmtests.sh
--
2.39.2
A few cleanups to the existing test logic.
Signed-off-by: Thomas Weißschuh <linux(a)weissschuh.net>
---
Thomas Weißschuh (4):
selftests/nolibc: make evaluation of test conditions
selftests/nolibc: simplify status printing
selftests/nolibc: simplify status argument
selftests/nolibc: avoid gaps in test numbers
tools/testing/selftests/nolibc/nolibc-test.c | 201 +++++++++++----------------
1 file changed, 85 insertions(+), 116 deletions(-)
---
base-commit: 078cda365b3f47f61047a08230925a1478e9a1c8
change-id: 20230711-nolibc-sizeof-long-gaps-0f28cba7ee4d
Best regards,
--
Thomas Weißschuh <linux(a)weissschuh.net>
We want to replace iptables TPROXY with a BPF program at TC ingress.
To make this work in all cases we need to assign a SO_REUSEPORT socket
to an skb, which is currently prohibited. This series adds support for
such sockets to bpf_sk_assing.
I did some refactoring to cut down on the amount of duplicate code. The
key to this is to use INDIRECT_CALL in the reuseport helpers. To show
that this approach is not just beneficial to TC sk_assign I removed
duplicate code for bpf_sk_lookup as well.
Joint work with Daniel Borkmann.
Signed-off-by: Lorenz Bauer <lmb(a)isovalent.com>
---
Changes in v5:
- Drop reuse_sk == sk check in inet[6]_steal_stock (Kuniyuki)
- Link to v4: https://lore.kernel.org/r/20230613-so-reuseport-v4-0-4ece76708bba@isovalent…
Changes in v4:
- WARN_ON_ONCE if reuseport socket is refcounted (Kuniyuki)
- Use inet[6]_ehashfn_t to shorten function declarations (Kuniyuki)
- Shuffle documentation patch around (Kuniyuki)
- Update commit message to explain why IPv6 needs EXPORT_SYMBOL
- Link to v3: https://lore.kernel.org/r/20230613-so-reuseport-v3-0-907b4cbb7b99@isovalent…
Changes in v3:
- Fix warning re udp_ehashfn and udp6_ehashfn (Simon)
- Return higher scoring connected UDP reuseport sockets (Kuniyuki)
- Fix ipv6 module builds
- Link to v2: https://lore.kernel.org/r/20230613-so-reuseport-v2-0-b7c69a342613@isovalent…
Changes in v2:
- Correct commit abbrev length (Kuniyuki)
- Reduce duplication (Kuniyuki)
- Add checks on sk_state (Martin)
- Split exporting inet[6]_lookup_reuseport into separate patch (Eric)
---
Daniel Borkmann (1):
selftests/bpf: Test that SO_REUSEPORT can be used with sk_assign helper
Lorenz Bauer (6):
udp: re-score reuseport groups when connected sockets are present
net: export inet_lookup_reuseport and inet6_lookup_reuseport
net: remove duplicate reuseport_lookup functions
net: document inet[6]_lookup_reuseport sk_state requirements
net: remove duplicate sk_lookup helpers
bpf, net: Support SO_REUSEPORT sockets with bpf_sk_assign
include/net/inet6_hashtables.h | 81 ++++++++-
include/net/inet_hashtables.h | 74 +++++++-
include/net/sock.h | 7 +-
include/uapi/linux/bpf.h | 3 -
net/core/filter.c | 2 -
net/ipv4/inet_hashtables.c | 68 ++++---
net/ipv4/udp.c | 88 ++++-----
net/ipv6/inet6_hashtables.c | 71 +++++---
net/ipv6/udp.c | 98 ++++------
tools/include/uapi/linux/bpf.h | 3 -
tools/testing/selftests/bpf/network_helpers.c | 3 +
.../selftests/bpf/prog_tests/assign_reuse.c | 197 +++++++++++++++++++++
.../selftests/bpf/progs/test_assign_reuse.c | 142 +++++++++++++++
13 files changed, 658 insertions(+), 179 deletions(-)
---
base-commit: c20f9cef725bc6b19efe372696e8000fb5af0d46
change-id: 20230613-so-reuseport-e92c526173ee
Best regards,
--
Lorenz Bauer <lmb(a)isovalent.com>
The build failure reported in [1] occurred because commit 9fc96c7c19df
("selftests: error out if kernel header files are not yet built") added
a new "kernel_header_files" dependency to "all", and that triggered
another, pre-existing problem. Specifically, the arm64 selftests
override the emit_tests target, and that override improperly declares
itself to depend upon the "all" target.
This is a problem because the "emit_tests" target in lib.mk was not
intended to be overridden. emit_tests is a very simple, sequential build
target that was originally invoked from the "install" target, which in
turn, depends upon "all".
That approach worked for years. But with 9fc96c7c19df in place,
emit_tests failed, because it does not set up all of the elaborate
things that "install" does. And that caused the new
"kernel_header_files" target (which depends upon $(KBUILD_OUTPUT) being
correct) to fail.
Some detail: The "all" target is .PHONY. Therefore, each target that
depends on "all" will cause it to be invoked again, and because
dependencies are managed quite loosely in the selftests Makefiles, many
things will run, even "all" is invoked several times in immediate
succession. So this is not a "real" failure, as far as build steps go:
everything gets built, but "all" reports a problem when invoked a second
time from a bad environment.
To fix this, simply remove the unnecessary "all" dependency from the
overridden emit_tests target. The dependency is still effectively
honored, because again, invocation is via "install", which also depends
upon "all".
An alternative approach would be to harden the emit_tests target so that
it can depend upon "all", but that's a lot more complicated and hard to
get right, and doesn't seem worth it, especially given that emit_tests
should probably not be overridden at all.
[1] https://lore.kernel.org/20230710-kselftest-fix-arm64-v1-1-48e872844f25@kern…
Fixes: 9fc96c7c19df ("selftests: error out if kernel header files are not yet built")
Reported-by: Mark Brown <broonie(a)kernel.org>
Signed-off-by: John Hubbard <jhubbard(a)nvidia.com>
---
tools/testing/selftests/arm64/Makefile | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/arm64/Makefile b/tools/testing/selftests/arm64/Makefile
index 9460cbe81bcc..ace8b67fb22d 100644
--- a/tools/testing/selftests/arm64/Makefile
+++ b/tools/testing/selftests/arm64/Makefile
@@ -42,7 +42,7 @@ run_tests: all
done
# Avoid any output on non arm64 on emit_tests
-emit_tests: all
+emit_tests:
@for DIR in $(ARM64_SUBTARGETS); do \
BUILD_TARGET=$(OUTPUT)/$$DIR; \
make OUTPUT=$$BUILD_TARGET -C $$DIR $@; \
base-commit: d5fe758c21f4770763ae4c05580be239be18947d
--
2.41.0
v4:
- [v3] https://lore.kernel.org/lkml/20230627005529.1564984-1-longman@redhat.com/
- Fix compilation problem reported by kernel test robot.
v3:
- [v2] https://lore.kernel.org/lkml/20230531163405.2200292-1-longman@redhat.com/
- Change the new control file from root-only "cpuset.cpus.reserve" to
non-root "cpuset.cpus.exclusive" which lists the set of exclusive
CPUs distributed down the hierarchy.
- Add a patch to restrict boot-time isolated CPUs to isolated
partitions only.
- Update the test_cpuset_prs.sh test script and documentation
accordingly.
This patch series introduces a new cpuset control file
"cpuset.cpus.exclusive" which must be a subset of "cpuset.cpus"
and the parent's "cpuset.cpus.exclusive". This control file lists
the exclusive CPUs to be distributed down the hierarchy. Any one
of the exclusive CPUs can only be distributed to at most one child
cpuset. Unlike "cpuset.cpus", invalid input to "cpuset.cpus.exclusive"
will be rejected with an error. This new control file has no effect on
the behavior of the cpuset until it turns into a partition root. At that
point, its effective CPUs will be set to its exclusive CPUs unless some
of them are offline.
This patch series also introduces a new category of cpuset partition
called remote partitions. The existing partition category where the
partition roots have to be clustered around the root cgroup in a
hierarchical way is now referred to as local partitions.
A remote partition can be formed far from the root cgroup
with no partition root parent. While local partitions can be
created without touching "cpuset.cpus.exclusive" as it can be set
automatically if a cpuset becomes a local partition root. Properly set
"cpuset.cpus.exclusive" values down the hierarchy are required to create
a remote partition.
Both scheduling and isolated partitions can be formed in a remote
partition. A local partition can be created under a remote partition.
A remote partition, however, cannot be formed under a local partition
for now.
Modern container orchestration tools like Kubernetes use the cgroup
hierarchy to manage different containers. And it is relying on other
middleware like systemd to help managing it. If a container needs to
use isolated CPUs, it is hard to get those with the local partitions
as it will require the administrative parent cgroup to be a partition
root too which tool like systemd may not be ready to manage.
With this patch series, we allow the creation of remote partition
far from the root. The container management tool can manage the
"cpuset.cpus.exclusive" file without impacting the other cpuset
files that are managed by other middlewares. Of course, invalid
"cpuset.cpus.exclusive" values will be rejected and changes to
"cpuset.cpus" can affect the value of "cpuset.cpus.exclusive" due to
the requirement that it has to be a subset of the former control file.
Waiman Long (9):
cgroup/cpuset: Inherit parent's load balance state in v2
cgroup/cpuset: Extract out CS_CPU_EXCLUSIVE & CS_SCHED_LOAD_BALANCE
handling
cgroup/cpuset: Improve temporary cpumasks handling
cgroup/cpuset: Allow suppression of sched domain rebuild in
update_cpumasks_hier()
cgroup/cpuset: Add cpuset.cpus.exclusive for v2
cgroup/cpuset: Introduce remote partition
cgroup/cpuset: Check partition conflict with housekeeping setup
cgroup/cpuset: Documentation update for partition
cgroup/cpuset: Extend test_cpuset_prs.sh to test remote partition
Documentation/admin-guide/cgroup-v2.rst | 100 +-
kernel/cgroup/cpuset.c | 1347 ++++++++++++-----
.../selftests/cgroup/test_cpuset_prs.sh | 398 +++--
3 files changed, 1291 insertions(+), 554 deletions(-)
--
2.31.1
We want to replace iptables TPROXY with a BPF program at TC ingress.
To make this work in all cases we need to assign a SO_REUSEPORT socket
to an skb, which is currently prohibited. This series adds support for
such sockets to bpf_sk_assing.
I did some refactoring to cut down on the amount of duplicate code. The
key to this is to use INDIRECT_CALL in the reuseport helpers. To show
that this approach is not just beneficial to TC sk_assign I removed
duplicate code for bpf_sk_lookup as well.
Joint work with Daniel Borkmann.
Signed-off-by: Lorenz Bauer <lmb(a)isovalent.com>
---
Changes in v4:
- WARN_ON_ONCE if reuseport socket is refcounted (Kuniyuki)
- Use inet[6]_ehashfn_t to shorten function declarations (Kuniyuki)
- Shuffle documentation patch around (Kuniyuki)
- Update commit message to explain why IPv6 needs EXPORT_SYMBOL
- Link to v3: https://lore.kernel.org/r/20230613-so-reuseport-v3-0-907b4cbb7b99@isovalent…
Changes in v3:
- Fix warning re udp_ehashfn and udp6_ehashfn (Simon)
- Return higher scoring connected UDP reuseport sockets (Kuniyuki)
- Fix ipv6 module builds
- Link to v2: https://lore.kernel.org/r/20230613-so-reuseport-v2-0-b7c69a342613@isovalent…
Changes in v2:
- Correct commit abbrev length (Kuniyuki)
- Reduce duplication (Kuniyuki)
- Add checks on sk_state (Martin)
- Split exporting inet[6]_lookup_reuseport into separate patch (Eric)
---
Daniel Borkmann (1):
selftests/bpf: Test that SO_REUSEPORT can be used with sk_assign helper
Lorenz Bauer (6):
udp: re-score reuseport groups when connected sockets are present
net: export inet_lookup_reuseport and inet6_lookup_reuseport
net: remove duplicate reuseport_lookup functions
net: document inet[6]_lookup_reuseport sk_state requirements
net: remove duplicate sk_lookup helpers
bpf, net: Support SO_REUSEPORT sockets with bpf_sk_assign
include/net/inet6_hashtables.h | 81 ++++++++-
include/net/inet_hashtables.h | 74 +++++++-
include/net/sock.h | 7 +-
include/uapi/linux/bpf.h | 3 -
net/core/filter.c | 2 -
net/ipv4/inet_hashtables.c | 67 ++++---
net/ipv4/udp.c | 88 ++++-----
net/ipv6/inet6_hashtables.c | 70 +++++---
net/ipv6/udp.c | 98 ++++------
tools/include/uapi/linux/bpf.h | 3 -
tools/testing/selftests/bpf/network_helpers.c | 3 +
.../selftests/bpf/prog_tests/assign_reuse.c | 197 +++++++++++++++++++++
.../selftests/bpf/progs/test_assign_reuse.c | 142 +++++++++++++++
13 files changed, 656 insertions(+), 179 deletions(-)
---
base-commit: 970308a7b544fa1c7ee98a2721faba3765be8dd8
change-id: 20230613-so-reuseport-e92c526173ee
Best regards,
--
Lorenz Bauer <lmb(a)isovalent.com>
=== Context ===
In the context of a middlebox, fragmented packets are tricky to handle.
The full 5-tuple of a packet is often only available in the first
fragment which makes enforcing consistent policy difficult. There are
really only two stateless options, neither of which are very nice:
1. Enforce policy on first fragment and accept all subsequent fragments.
This works but may let in certain attacks or allow data exfiltration.
2. Enforce policy on first fragment and drop all subsequent fragments.
This does not really work b/c some protocols may rely on
fragmentation. For example, DNS may rely on oversized UDP packets for
large responses.
So stateful tracking is the only sane option. RFC 8900 [0] calls this
out as well in section 6.3:
Middleboxes [...] should process IP fragments in a manner that is
consistent with [RFC0791] and [RFC8200]. In many cases, middleboxes
must maintain state in order to achieve this goal.
=== BPF related bits ===
Policy has traditionally been enforced from XDP/TC hooks. Both hooks
run before kernel reassembly facilities. However, with the new
BPF_PROG_TYPE_NETFILTER, we can rather easily hook into existing
netfilter reassembly infra.
The basic idea is we bump a refcnt on the netfilter defrag module and
then run the bpf prog after the defrag module runs. This allows bpf
progs to transparently see full, reassembled packets. The nice thing
about this is that progs don't have to carry around logic to detect
fragments.
=== Changelog ===
Changes from v2:
* module_put() if ->enable() fails
* Fix CI build errors
Changes from v1:
* Drop bpf_program__attach_netfilter() patches
* static -> static const where appropriate
* Fix callback assignment order during registration
* Only request_module() if callbacks are missing
* Fix retval when modprobe fails in userspace
* Fix v6 defrag module name (nf_defrag_ipv6_hooks -> nf_defrag_ipv6)
* Simplify priority checking code
* Add warning if module doesn't assign callbacks in the future
* Take refcnt on module while defrag link is active
[0]: https://datatracker.ietf.org/doc/html/rfc8900
Daniel Xu (6):
netfilter: defrag: Add glue hooks for enabling/disabling defrag
netfilter: bpf: Support BPF_F_NETFILTER_IP_DEFRAG in netfilter link
netfilter: bpf: Prevent defrag module unload while link active
bpf: selftests: Support not connecting client socket
bpf: selftests: Support custom type and proto for client sockets
bpf: selftests: Add defrag selftests
include/linux/netfilter.h | 15 +
include/uapi/linux/bpf.h | 5 +
net/ipv4/netfilter/nf_defrag_ipv4.c | 17 +-
net/ipv6/netfilter/nf_defrag_ipv6_hooks.c | 11 +
net/netfilter/core.c | 6 +
net/netfilter/nf_bpf_link.c | 150 +++++++++-
tools/include/uapi/linux/bpf.h | 5 +
tools/testing/selftests/bpf/Makefile | 4 +-
.../selftests/bpf/generate_udp_fragments.py | 90 ++++++
.../selftests/bpf/ip_check_defrag_frags.h | 57 ++++
tools/testing/selftests/bpf/network_helpers.c | 26 +-
tools/testing/selftests/bpf/network_helpers.h | 3 +
.../bpf/prog_tests/ip_check_defrag.c | 282 ++++++++++++++++++
.../selftests/bpf/progs/ip_check_defrag.c | 104 +++++++
14 files changed, 753 insertions(+), 22 deletions(-)
create mode 100755 tools/testing/selftests/bpf/generate_udp_fragments.py
create mode 100644 tools/testing/selftests/bpf/ip_check_defrag_frags.h
create mode 100644 tools/testing/selftests/bpf/prog_tests/ip_check_defrag.c
create mode 100644 tools/testing/selftests/bpf/progs/ip_check_defrag.c
--
2.41.0
On Mon, 10 Jul 2023 15:07:30 -0400
Steven Rostedt <rostedt(a)goodmis.org> wrote:
> On Mon, 10 Jul 2023 15:06:06 -0400
> Steven Rostedt <rostedt(a)goodmis.org> wrote:
>
> > > Something was broken in your mail (I guess cc list) and couldn’t reach to lkml or
> > > ignored by lkml. I just wanted to track the auto test results from linux-kselftest.
> >
> > Yeah, claws-mail has an issue with some emails with quotes in it (sometimes
> > drops the second quote). Sad part is, it happens after I hit send, and it
> > is not part of the email. I'll send this reply now, but I bet it's going to happen again.
> >
> > Let's see :-/ I checked the To and Cc's and they all have the proper
> > quotes. Let's see what ends up in my "Sent" folder.
>
> This time it worked!
>
But this reply did not :-p
It was fine before I sent, but the email in my Sent folder shows:
Cc: "mhiramat(a)kernel.org" <mhiramat(a)kernel.org>, "shuah(a)kernel.org" <shuah(a)kernel.org>, "linux-kernel(a)vger.kernel.org" <linux-kernel(a)vger.kernel.org>, "linux-trace-kernel(a)vger.kernel.org\" <linux-trace-kernel(a)vger.kernel.org>, "linux-kselftest(a)vger.kernel.org" <linux-kselftest(a)vger.kernel.org>, Ching-lin Yu <chinglinyu(a)google.com>, Nadav Amit <namit(a)vmware.com>, "srivatsa(a)csail.mit.edu" <srivatsa(a)csail.mit.edu>, Alexey Makhalov <amakhalov(a)vmware.com>, Vasavi Sirnapalli <vsirnapalli(a)vmware.com>, Tapas Kundu <tkundu(a)vmware.com>, "er.ajay.kaher(a)gmail.com" <er.ajay.kaher(a)gmail.com>
Claw's injected a backslash into: "linux-trace-kernel(a)vger.kernel.org\" <linux-trace-kernel(a)vger.kernel.org>
I have my own build of claws-mail, let me update it and perhaps this will
go away.
-- Steve
This is the basic functionality for iommufd to support
iommufd_device_replace() and IOMMU_HWPT_ALLOC for physical devices.
iommufd_device_replace() allows changing the HWPT associated with the
device to a new IOAS or HWPT. Replace does this in way that failure leaves
things unchanged, and utilizes the iommu iommu_group_replace_domain() API
to allow the iommu driver to perform an optional non-disruptive change.
IOMMU_HWPT_ALLOC allows HWPTs to be explicitly allocated by the user and
used by attach or replace. At this point it isn't very useful since the
HWPT is the same as the automatically managed HWPT from the IOAS. However
a following series will allow userspace to customize the created HWPT.
The implementation is complicated because we have to introduce some
per-iommu_group memory in iommufd and redo how we think about multi-device
groups to be more explicit. This solves all the locking problems in the
prior attempts.
This series is infrastructure work for the following series which:
- Add replace for attach
- Expose replace through VFIO APIs
- Implement driver parameters for HWPT creation (nesting)
Once review of this is complete I will keep it on a side branch and
accumulate the following series when they are ready so we can have a
stable base and make more incremental progress. When we have all the parts
together to get a full implementation it can go to Linus.
This is on github: https://github.com/jgunthorpe/linux/commits/iommufd_hwpt
v7:
- Rebase to v6.4-rc2, update to new signature of iommufd_get_ioas()
v6: https://lore.kernel.org/r/0-v6-fdb604df649a+369-iommufd_alloc_jgg@nvidia.com
- Go back to the v4 locking arragnment with now both the attach/detach
igroup->locks inside the functions, Kevin says he needs this for a
followup series. This still fixes the syzkaller bug
- Fix two more error unwind locking bugs where
iommufd_object_abort_and_destroy(hwpt) would deadlock or be mislocked.
Make sure fail_nth will catch these mistakes
- Add a patch allowing objects to have different abort than destroy
function, it allows hwpt abort to require the caller to continue
to hold the lock and enforces this with lockdep.
v5: https://lore.kernel.org/r/0-v5-6716da355392+c5-iommufd_alloc_jgg@nvidia.com
- Go back to the v3 version of the code, keep the comment changes from
v4. Syzkaller says the group lock change in v4 didn't work.
- Adjust the fail_nth test to cover the path syzkaller found. We need to
have an ioas with a mapped page installed to inject a failure during
domain attachment.
v4: https://lore.kernel.org/r/0-v4-9cd79ad52ee8+13f5-iommufd_alloc_jgg@nvidia.c…
- Refine comments and commit messages
- Move the group lock into iommufd_hw_pagetable_attach()
- Fix error unwind in iommufd_device_do_replace()
v3: https://lore.kernel.org/r/0-v3-61d41fd9e13e+1f5-iommufd_alloc_jgg@nvidia.com
- Refine comments and commit messages
- Adjust the flow in iommufd_device_auto_get_domain() so pt_id is only
set on success
- Reject replace on non-attached devices
- Add missing __reserved check for IOMMU_HWPT_ALLOC
v2: https://lore.kernel.org/r/0-v2-51b9896e7862+8a8c-iommufd_alloc_jgg@nvidia.c…
- Use WARN_ON for the igroup->group test and move that logic to a
function iommufd_group_try_get()
- Change igroup->devices to igroup->device list
Replace will need to iterate over all attached idevs
- Rename to iommufd_group_setup_msi()
- New patch to export iommu_get_resv_regions()
- New patch to use per-device reserved regions instead of per-group
regions
- Split out the reorganizing of iommufd_device_change_pt() from the
replace patch
- Replace uses the per-dev reserved regions
- Use stdev_id in a few more places in the selftest
- Fix error handling in IOMMU_HWPT_ALLOC
- Clarify comments
- Rebase on v6.3-rc1
v1: https://lore.kernel.org/all/0-v1-7612f88c19f5+2f21-iommufd_alloc_jgg@nvidia…
Jason Gunthorpe (17):
iommufd: Move isolated msi enforcement to iommufd_device_bind()
iommufd: Add iommufd_group
iommufd: Replace the hwpt->devices list with iommufd_group
iommu: Export iommu_get_resv_regions()
iommufd: Keep track of each device's reserved regions instead of
groups
iommufd: Use the iommufd_group to avoid duplicate MSI setup
iommufd: Make sw_msi_start a group global
iommufd: Move putting a hwpt to a helper function
iommufd: Add enforced_cache_coherency to iommufd_hw_pagetable_alloc()
iommufd: Allow a hwpt to be aborted after allocation
iommufd: Fix locking around hwpt allocation
iommufd: Reorganize iommufd_device_attach into
iommufd_device_change_pt
iommufd: Add iommufd_device_replace()
iommufd: Make destroy_rwsem use a lock class per object type
iommufd: Add IOMMU_HWPT_ALLOC
iommufd/selftest: Return the real idev id from selftest mock_domain
iommufd/selftest: Add a selftest for IOMMU_HWPT_ALLOC
Nicolin Chen (2):
iommu: Introduce a new iommu_group_replace_domain() API
iommufd/selftest: Test iommufd_device_replace()
drivers/iommu/iommu-priv.h | 10 +
drivers/iommu/iommu.c | 41 +-
drivers/iommu/iommufd/device.c | 553 +++++++++++++-----
drivers/iommu/iommufd/hw_pagetable.c | 112 +++-
drivers/iommu/iommufd/io_pagetable.c | 32 +-
drivers/iommu/iommufd/iommufd_private.h | 52 +-
drivers/iommu/iommufd/iommufd_test.h | 6 +
drivers/iommu/iommufd/main.c | 24 +-
drivers/iommu/iommufd/selftest.c | 40 ++
include/linux/iommufd.h | 1 +
include/uapi/linux/iommufd.h | 26 +
tools/testing/selftests/iommu/iommufd.c | 67 ++-
.../selftests/iommu/iommufd_fail_nth.c | 67 ++-
tools/testing/selftests/iommu/iommufd_utils.h | 63 +-
14 files changed, 868 insertions(+), 226 deletions(-)
create mode 100644 drivers/iommu/iommu-priv.h
base-commit: f1fcbaa18b28dec10281551dfe6ed3a3ed80e3d6
--
2.40.1
Hi Liam,
On Thu, May 18, 2023 at 9:37 PM Liam R. Howlett <Liam.Howlett(a)oracle.com> wrote:
> Now that the functions have changed the limits, update the testing of
> the maple tree to test these new settings.
>
> Signed-off-by: Liam R. Howlett <Liam.Howlett(a)oracle.com>
Thanks for your patch, which is now commit eb2e817f38cafbf7
("maple_tree: update testing code for mas_{next,prev,walk}") in
> --- a/lib/test_maple_tree.c
> +++ b/lib/test_maple_tree.c
> @@ -2011,7 +2011,7 @@ static noinline void __init next_prev_test(struct maple_tree *mt)
>
> val = mas_next(&mas, ULONG_MAX);
> MT_BUG_ON(mt, val != NULL);
> - MT_BUG_ON(mt, mas.index != ULONG_MAX);
> + MT_BUG_ON(mt, mas.index != 0x7d6);
On m68k (ARAnyM):
TEST STARTING
BUG at next_prev_test:2014 (1)
Pass: 3749128 Run:3749129
And after that it seems to hang[*].
After adding a debug print (thus shifting all line numbers by +1):
next_prev_test:mas.index = 0x138e
BUG at next_prev_test:2015 (1)
0x138e = 5006, while the expected value is 0x7d6 = 2006.
I guess converting this test to the KUnit framework would make it a
bit easier to investigate failures...
[*] Left the debug one running, and I got a few more:
BUG at check_empty_area_window:2656 (1)
Pass: 3754275 Run:3754277
BUG at check_empty_area_window:2657 (1)
Pass: 3754275 Run:3754278
BUG at check_empty_area_window:2658 (1)
Pass: 3754275 Run:3754279
BUG at check_empty_area_window:2662 (1)
Pass: 3754275 Run:3754280
BUG at check_empty_area_window:2663 (1)
Pass: 3754275 Run:3754281
maple_tree: 3804518 of 3804524 tests passed
So the full test took more than 20 minutes...
> MT_BUG_ON(mt, mas.last != ULONG_MAX);
>
> val = mas_prev(&mas, 0);
Gr{oetje,eeting}s,
Geert
--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert(a)linux-m68k.org
In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds
Hi, Willy
This v4 mainly uses the argv0 suggested by you, at the same time, a new
run-libc-test target is added for glibc and musl, and the RB_ flags are
added for nolibc to allow compile nolibc-test.c without <linux/reboot.h>
for glibc, musl and nolibc (mainly for musl-gcc, without -I
/path/to/sysroot).
This patchset is based on the 20230705-nolibc-series2 branch of nolibc
repo [2], it must be applied after our v6 __sysret series [3] (argv0
exported there) and Thomas' chmod_net removal patchset [4] (the new
chmod_argv0 is added at the same line of chmod_net, will conflict).
This patchset assumes the chmod_net removal patchset will be applied at
first, if not, the chmod_argv0 added alphabetically will not be applied.
Since our new chmod_argv0 is exactly added to replace chmod_net, so,
Willy, is it ok for you to at least apply the chmod_net removal patch
[5] before this patchset?
selftests/nolibc: drop test chmod_net
This patchset is tested together with the v6 __sysret series [3]:
arch/board | result
------------|------------
arm/vexpress-a9 | 142 test(s) passed, 1 skipped, 0 failed.
arm/virt | 142 test(s) passed, 1 skipped, 0 failed.
aarch64/virt | 142 test(s) passed, 1 skipped, 0 failed.
ppc/g3beige | not supported
ppc/ppce500 | not supported
i386/pc | 142 test(s) passed, 1 skipped, 0 failed.
x86_64/pc | 142 test(s) passed, 1 skipped, 0 failed.
mipsel/malta | 142 test(s) passed, 1 skipped, 0 failed.
loongarch64/virt | 142 test(s) passed, 1 skipped, 0 failed.
riscv64/virt | 142 test(s) passed, 1 skipped, 0 failed.
riscv32/virt | 0 test(s) passed, 0 skipped, 0 failed.
s390x/s390-ccw-virtio | 142 test(s) passed, 1 skipped, 0 failed.
If use tinyconfig + basic console options (means disable all of the
other options, include procfs, shmem, tmpfs, net and memfd_create, to
save test time, only randomly choose 4 archs):
...
LOG: testing report for loongarch64/virt:
15 chmod_self [SKIPPED]
16 chown_self [SKIPPED]
40 link_cross [SKIPPED]
0 -fstackprotector not supported [SKIPPED]
139 test(s) passed, 4 skipped, 0 failed.
See all results in /labs/linux-lab/logging/nolibc/loongarch64-virt-nolibc-test.log
LOG: testing summary:
arch/board | result
------------|------------
arm/vexpress-a9 | 139 test(s) passed, 4 skipped, 0 failed.
x86_64/pc | 139 test(s) passed, 4 skipped, 0 failed.
mipsel/malta | 139 test(s) passed, 4 skipped, 0 failed.
loongarch64/virt | 139 test(s) passed, 4 skipped, 0 failed.
Changes from v3 --> v4:
* selftests/nolibc: stat_fault: silence NULL argument warning with glibc
selftests/nolibc: gettid: restore for glibc and musl
selftests/nolibc: add _LARGEFILE64_SOURCE for musl
selftests/nolibc: fix up int_fast16/32_t test cases for musl
selftests/nolibc: fix up kernel parameters support
selftests/nolibc: link_cross: use /proc/self/cmdline
tools/nolibc: add rmdir() support
selftests/nolibc: add a new rmdir() test case
selftests/nolibc: fix up failures when CONFIG_PROC_FS=n
selftests/nolibc: prepare /tmp for tmpfs or ramfs
selftests/nolibc: vfprintf: remove MEMFD_CREATE dependency
No change.
* selftests/nolibc: add run-libc-test target
New run and report for glibc or musl. for musl, we can simply issue:
$ make run-libc-test CC=/path/to/musl-install/bin/musl-gcc
* tools/nolibc: types.h: add RB_ flags for reboot()
selftests/nolibc: prefer <sys/reboot.h> to <linux/reboot.h>
Required by musl to compile nolibc-test.c without -I/path/to/sysroot
* selftests/nolibc: chdir_root: restore current path after test
restore current path to prevent breakage of using relative path
* selftests/nolibc: stat_timestamps: remove procfs dependency
selftests/nolibc: chroot_exe: remove procfs dependency
selftests/nolibc: add chmod_argv0 test
use argv0 instead of '/init' as before.
Best regards,
Zhangjin
---
[1]: https://lore.kernel.org/lkml/cover.1688134399.git.falcon@tinylab.org/
[2]: https://git.kernel.org/pub/scm/linux/kernel/git/wtarreau/nolibc.git
[3]: https://lore.kernel.org/lkml/cover.1688739492.git.falcon@tinylab.org/
[4]: https://lore.kernel.org/lkml/20230624-proc-net-setattr-v1-0-73176812adee@we…
[5]: https://lore.kernel.org/lkml/20230624-proc-net-setattr-v1-1-73176812adee@we…
Zhangjin Wu (18):
selftests/nolibc: add run-libc-test target
selftests/nolibc: stat_fault: silence NULL argument warning with glibc
selftests/nolibc: gettid: restore for glibc and musl
selftests/nolibc: add _LARGEFILE64_SOURCE for musl
selftests/nolibc: fix up int_fast16/32_t test cases for musl
tools/nolibc: types.h: add RB_ flags for reboot()
selftests/nolibc: prefer <sys/reboot.h> to <linux/reboot.h>
selftests/nolibc: fix up kernel parameters support
selftests/nolibc: link_cross: use /proc/self/cmdline
tools/nolibc: add rmdir() support
selftests/nolibc: add a new rmdir() test case
selftests/nolibc: fix up failures when CONFIG_PROC_FS=n
selftests/nolibc: prepare /tmp for tmpfs or ramfs
selftests/nolibc: vfprintf: remove MEMFD_CREATE dependency
selftests/nolibc: chdir_root: restore current path after test
selftests/nolibc: stat_timestamps: remove procfs dependency
selftests/nolibc: chroot_exe: remove procfs dependency
selftests/nolibc: add chmod_argv0 test
tools/include/nolibc/sys.h | 23 ++++-
tools/include/nolibc/types.h | 12 ++-
tools/testing/selftests/nolibc/Makefile | 4 +
tools/testing/selftests/nolibc/nolibc-test.c | 88 +++++++++++++++-----
4 files changed, 104 insertions(+), 23 deletions(-)
--
2.25.1
According to commit 01d6c48a828b ("Documentation: kselftest:
"make headers" is a prerequisite"), running the kselftests requires
to run "make headers" first.
Do that in "vmtest.sh" as well to fix the HID CI.
Signed-off-by: Benjamin Tissoires <bentiss(a)kernel.org>
---
Looks like the new master branch (v6.5-rc1) broke my CI.
And given that `make headers` is now a requisite to run the kselftests,
also include that command in vmtests.sh.
Broken CI job: https://gitlab.freedesktop.org/bentiss/hid/-/jobs/44704436
Fixed CI job: https://gitlab.freedesktop.org/bentiss/hid/-/jobs/45151040
---
tools/testing/selftests/hid/vmtest.sh | 1 +
1 file changed, 1 insertion(+)
diff --git a/tools/testing/selftests/hid/vmtest.sh b/tools/testing/selftests/hid/vmtest.sh
index 681b906b4853..4da48bf6b328 100755
--- a/tools/testing/selftests/hid/vmtest.sh
+++ b/tools/testing/selftests/hid/vmtest.sh
@@ -79,6 +79,7 @@ recompile_kernel()
cd "${kernel_checkout}"
${make_command} olddefconfig
+ ${make_command} headers
${make_command}
}
---
base-commit: 0e382fa72bbf0610be40af9af9b03b0cd149df82
change-id: 20230709-fix-selftests-c8b0bdff1d20
Best regards,
--
Benjamin Tissoires <bentiss(a)kernel.org>
Hi, Willy
As you suggested, the 'status: [success|warning|failure]' info is added
to the summary line, with additional newlines around this line to
extrude the status info. at the same time, the total tests is printed,
the passed, skipped and failed values are aligned with '%03d'.
This patchset is based on 20230705-nolibc-series2 of nolibc repo[1].
The test result looks like:
...
138 test(s): 135 passed, 002 skipped, 001 failed => status: failure
See all results in /labs/linux-lab/src/linux-stable/tools/testing/selftests/nolibc/run.out
Or:
...
137 test(s): 134 passed, 003 skipped, 000 failed => status: warning
See all results in /labs/linux-lab/src/linux-stable/tools/testing/selftests/nolibc/run.out
Best regards,
Zhangjin
---
[1]: https://git.kernel.org/pub/scm/linux/kernel/git/wtarreau/nolibc.git
Zhangjin Wu (5):
selftests/nolibc: report: print a summarized test status
selftests/nolibc: report: print total tests
selftests/nolibc: report: align passed, skipped and failed
selftests/nolibc: report: extrude the test status line
selftests/nolibc: report: add newline before test failures
tools/testing/selftests/nolibc/Makefile | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)
--
2.25.1
On Mon, 10 Jul 2023 02:17:01 +0000
Nadav Amit <namit(a)vmware.com> wrote:
> > On Jul 9, 2023, at 6:54 PM, Steven Rostedt <rostedt(a)goodmis.org> wrote:
> >
> > + union {
> > + struct rcu_head rcu;
> > + struct llist_node llist; /* For freeing after RCU */
> > + };
>
> The memory savings from using a union might not be worth the potential impact
> of type confusion and bugs.
It's also documentation. The two are related, as one is the hand off to
the other. It's not a random union, and I'd like to leave it that way.
-- Steve
Since commit 53fcfafa8c5c ("tools/nolibc/unistd: add syscall()") nolibc
has support for syscall(2).
Use it to get rid of some ifdef-ery.
Signed-off-by: Thomas Weißschuh <linux(a)weissschuh.net>
---
tools/testing/selftests/nolibc/nolibc-test.c | 6 +-----
1 file changed, 1 insertion(+), 5 deletions(-)
diff --git a/tools/testing/selftests/nolibc/nolibc-test.c b/tools/testing/selftests/nolibc/nolibc-test.c
index 486334981e60..c02d89953679 100644
--- a/tools/testing/selftests/nolibc/nolibc-test.c
+++ b/tools/testing/selftests/nolibc/nolibc-test.c
@@ -1051,11 +1051,7 @@ int main(int argc, char **argv, char **envp)
* exit with status code 2N+1 when N is written to 0x501. We
* hard-code the syscall here as it's arch-dependent.
*/
-#if defined(_NOLIBC_SYS_H)
- else if (my_syscall3(__NR_ioperm, 0x501, 1, 1) == 0)
-#else
- else if (ioperm(0x501, 1, 1) == 0)
-#endif
+ else if (syscall(__NR_ioperm, 0x501, 1, 1) == 0)
__asm__ volatile ("outb %%al, %%dx" :: "d"(0x501), "a"(0));
/* if it does nothing, fall back to the regular panic */
#endif
---
base-commit: a901a3568fd26ca9c4a82d8bc5ed5b3ed844d451
change-id: 20230703-nolibc-ioperm-88d87ae6d5e9
Best regards,
--
Thomas Weißschuh <linux(a)weissschuh.net>
Make sv48 the default address space for mmap as some applications
currently depend on this assumption. Also enable users to select
desired address space using a non-zero hint address to mmap. Previous
kernel changes caused Java and other applications to be broken on sv57
which this patch fixes.
Documentation is also added to the RISC-V virtual memory section to explain
these changes.
-Charlie
---
v4:
- Split testcases/document patch into test cases, in-code documentation, and
formal documentation patches
- Modified the mmap_base macro to be more legible and better represent memory
layout
- Fixed documentation to better reflect the implmentation
- Renamed DEFAULT_VA_BITS to MMAP_VA_BITS
- Added additional test case for rlimit changes
---
Charlie Jenkins (4):
RISC-V: mm: Restrict address space for sv39,sv48,sv57
RISC-V: mm: Add tests for RISC-V mm
RISC-V: mm: Update pgtable comment documentation
RISC-V: mm: Document mmap changes
Documentation/riscv/vm-layout.rst | 22 +++
arch/riscv/include/asm/elf.h | 2 +-
arch/riscv/include/asm/pgtable.h | 21 ++-
arch/riscv/include/asm/processor.h | 43 +++++-
tools/testing/selftests/riscv/Makefile | 2 +-
tools/testing/selftests/riscv/mm/.gitignore | 1 +
tools/testing/selftests/riscv/mm/Makefile | 21 +++
.../selftests/riscv/mm/testcases/mmap.c | 133 ++++++++++++++++++
8 files changed, 232 insertions(+), 13 deletions(-)
create mode 100644 tools/testing/selftests/riscv/mm/.gitignore
create mode 100644 tools/testing/selftests/riscv/mm/Makefile
create mode 100644 tools/testing/selftests/riscv/mm/testcases/mmap.c
--
2.41.0
This series adds a new userfaultfd feature, UFFDIO_POISON. See commit 4
for a detailed description of the feature.
The series is based on Linus master (partial 6.5 merge window), and
structured like this:
- Patches 1-3 are preparation / refactoring
- Patches 4-6 implement and advertise the new feature
- Patches 7-8 implement a unit test for the new feature
Changelog:
v2 -> v3:
- Rebase onto current Linus master.
- Don't overwrite existing PTE markers for non-hugetlb UFFDIO_POISON.
Before, non-hugetlb would override them, but hugetlb would not. I don't
think there's a use case where we *want* to override a UFFD_WP marker
for example, so take the more conservative behavior for all kinds of
memory.
- [Peter] Drop hugetlb mfill atomic refactoring, since it isn't needed
for this series (we don't touch that code directly anyway).
- [Peter] Switch to re-using PTE_MARKER_SWAPIN_ERROR instead of defining
new PTE_MARKER_UFFD_POISON.
- [Peter] Extract start / len range overflow check into existing
validate_range helper; this fixes the style issue of unnecessary braces
in the UFFDIO_POISON implementation, because this code is just deleted.
- [Peter] Extract file size check out into a new helper.
- [Peter] Defer actually "enabling" the new feature until the last commit
in the series; combine this with adding the documentation. As a
consequence, move the selftest commits after this one.
- [Randy] Fix typo in documentation.
v1 -> v2:
- [Peter] Return VM_FAULT_HWPOISON not VM_FAULT_SIGBUS, to yield the
correct behavior for KVM (guest MCE).
- [Peter] Rename UFFDIO_SIGBUS to UFFDIO_POISON.
- [Peter] Implement hugetlbfs support for UFFDIO_POISON.
Axel Rasmussen (8):
mm: make PTE_MARKER_SWAPIN_ERROR more general
mm: userfaultfd: check for start + len overflow in validate_range
mm: userfaultfd: extract file size check out into a helper
mm: userfaultfd: add new UFFDIO_POISON ioctl
mm: userfaultfd: support UFFDIO_POISON for hugetlbfs
mm: userfaultfd: document and enable new UFFDIO_POISON feature
selftests/mm: refactor uffd_poll_thread to allow custom fault handlers
selftests/mm: add uffd unit test for UFFDIO_POISON
Documentation/admin-guide/mm/userfaultfd.rst | 15 +++
fs/userfaultfd.c | 73 ++++++++++--
include/linux/mm_inline.h | 19 +++
include/linux/swapops.h | 10 +-
include/linux/userfaultfd_k.h | 4 +
include/uapi/linux/userfaultfd.h | 25 +++-
mm/hugetlb.c | 51 ++++++--
mm/madvise.c | 2 +-
mm/memory.c | 15 ++-
mm/mprotect.c | 4 +-
mm/shmem.c | 4 +-
mm/swapfile.c | 2 +-
mm/userfaultfd.c | 83 ++++++++++---
tools/testing/selftests/mm/uffd-common.c | 5 +-
tools/testing/selftests/mm/uffd-common.h | 3 +
tools/testing/selftests/mm/uffd-stress.c | 12 +-
tools/testing/selftests/mm/uffd-unit-tests.c | 117 +++++++++++++++++++
17 files changed, 377 insertions(+), 67 deletions(-)
--
2.41.0.255.g8b1d071c50-goog
When wrapping code, use ';' better than using ',' which is more
in line with the coding habits of most engineers.
Signed-off-by: Lu Hongfei <luhongfei(a)vivo.com>
---
Compared to the previous version, the modifications made are:
1. Modified the subject to make it clearer and more accurate
2. Newly optimized typo in tcp_hdr_options.c
tools/testing/selftests/bpf/benchs/bench_ringbufs.c | 2 +-
tools/testing/selftests/bpf/prog_tests/tcp_hdr_options.c | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/bpf/benchs/bench_ringbufs.c b/tools/testing/selftests/bpf/benchs/bench_ringbufs.c
index 3ca14ad36607..e1ee979e6acc 100644
--- a/tools/testing/selftests/bpf/benchs/bench_ringbufs.c
+++ b/tools/testing/selftests/bpf/benchs/bench_ringbufs.c
@@ -399,7 +399,7 @@ static void perfbuf_libbpf_setup(void)
ctx->skel = perfbuf_setup_skeleton();
memset(&attr, 0, sizeof(attr));
- attr.config = PERF_COUNT_SW_BPF_OUTPUT,
+ attr.config = PERF_COUNT_SW_BPF_OUTPUT;
attr.type = PERF_TYPE_SOFTWARE;
attr.sample_type = PERF_SAMPLE_RAW;
/* notify only every Nth sample */
diff --git a/tools/testing/selftests/bpf/prog_tests/tcp_hdr_options.c b/tools/testing/selftests/bpf/prog_tests/tcp_hdr_options.c
index 13bcaeb028b8..56685fc03c7e 100644
--- a/tools/testing/selftests/bpf/prog_tests/tcp_hdr_options.c
+++ b/tools/testing/selftests/bpf/prog_tests/tcp_hdr_options.c
@@ -347,7 +347,7 @@ static void syncookie_estab(void)
exp_active_estab_in.max_delack_ms = 22;
exp_passive_hdr_stg.syncookie = true;
- exp_active_hdr_stg.resend_syn = true,
+ exp_active_hdr_stg.resend_syn = true;
prepare_out();
--
2.39.0
When wrapping code, use ';' better than using ',' which is more
in line with the coding habits of most engineers.
Signed-off-by: Lu Hongfei <luhongfei(a)vivo.com>
---
tools/testing/selftests/bpf/benchs/bench_ringbufs.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/bpf/benchs/bench_ringbufs.c b/tools/testing/selftests/bpf/benchs/bench_ringbufs.c
index 3ca14ad36607..e1ee979e6acc 100644
--- a/tools/testing/selftests/bpf/benchs/bench_ringbufs.c
+++ b/tools/testing/selftests/bpf/benchs/bench_ringbufs.c
@@ -399,7 +399,7 @@ static void perfbuf_libbpf_setup(void)
ctx->skel = perfbuf_setup_skeleton();
memset(&attr, 0, sizeof(attr));
- attr.config = PERF_COUNT_SW_BPF_OUTPUT,
+ attr.config = PERF_COUNT_SW_BPF_OUTPUT;
attr.type = PERF_TYPE_SOFTWARE;
attr.sample_type = PERF_SAMPLE_RAW;
/* notify only every Nth sample */
--
2.39.0
From: Roberto Sassu <roberto.sassu(a)huawei.com>
Define a new TLV-based format for keys and signatures, aiming to store and
use in the kernel the crypto material from other unsupported formats
(e.g. PGP).
TLV fields have been defined to fill the corresponding kernel structures
public_key, public_key_signature and key_preparsed_payload.
Keys:
struct public_key { struct key_preparsed_payload {
KEY_PUB --> void *key;
u32 keylen; --> prep->payload.data[asym_crypto]
KEY_ALGO --> const char *pkey_algo;
KEY_KID0
KEY_KID1 --> prep->payload.data[asym_key_ids]
KEY_KID2
KEY_DESC --> prep->description
Signatures:
struct public_key_signature {
SIG_S --> u8 *s;
u32 s_size;
SIG_KEY_ALGO --> const char *pkey_algo;
SIG_HASH_ALGO --> const char *hash_algo;
u32 digest_size;
SIG_ENC --> const char *encoding;
SIG_KID0
SIG_KID1 --> struct asymmetric_key_id *auth_ids[3];
SIG_KID2
For keys, since the format conversion has to be done in user space, user
space is assumed to be trusted, in this proposal. Without this assumption,
a malicious conversion tool could make a user load to the kernel a
different key than the one expected.
That should not be a particular problem for keys that are embedded in the
kernel image and loaded at boot, since the conversion happens in a trusted
environment such as the building infrastructure of the Linux distribution
vendor.
In the other cases, such as enrolling a key through the Machine Owner Key
(MOK) mechanism, the user is responsible to ensure that the crypto material
carried in the original format remains the same after the conversion.
For signatures, assuming the strength of the crypto algorithms, altering
the crypto material is simply a Denial-of-Service (DoS), as data can be
validated only with the right signature.
This patch set also offers the following contributions:
- An API similar to the PKCS#7 one, to verify the authenticity of system
data through user asymmetric keys and signatures
- A mechanism to store a keyring blob in the kernel image and to extract
and load the keys at system boot
- eBPF binding, so that data authenticity verification with user asymmetric
keys and signatures can be carried out also with eBPF programs
- A new command for gnupg (in user space), to convert keys and signatures
from PGP to the new kernel format
The primary use case for this patch set is to verify the authenticity of
RPM package headers with the PGP keys of the Linux distribution. Once their
authenticity is verified, file digests can be extracted from those RPM
headers and used as reference values for IMA Appraisal.
Compared to the previous patch set, the main difference is not relying on
User Mode Drivers (UMDs) for the conversion from the original format to the
kernel format, due to the concern that full isolation of the UMD process
cannot be achieved against a fully privileged system user (root).
The discussion is still ongoing here:
https://lore.kernel.org/linux-integrity/eb31920bd00e2c921b0aa6ebed8745cb013…
This however does not prevent the goal mentioned above of verifying the
authenticity of RPM headers to be achieved. The fact that Linux
distribution vendors do the conversion in their infrastructure is a good
enough guarantee.
A very quick way to test the patch set is to execute:
# gpg --conv-kernel /etc/pki/rpm-gpg/RPM-GPG-KEY-fedora-rawhide-primary | keyctl padd asymmetric "" @u
# keyctl show @u
Keyring
762357580 --alswrv 0 65534 keyring: _uid.0
567216072 --als--v 0 0 \_ asymmetric: PGP: 18b8e74c
Patches 1-2 preliminarly export some definitions to user space so that
conversion tools can specify the right public key algorithms and signature
encodings (digest algorithms are already exported).
Patches 3-5 introduce the user asymmetric keys and signatures.
Patches 6 introduces a system API for verifying the authenticity of system
data through user asymmetric keys and signatures.
Patch 7-8 introduce a mechanism to store a keyring blob with user
asymmetric keys in the kernel image, and load them at system boot.
Patches 9-10 introduce the eBPF binding and corresponding test (which can
be enabled only after the gnupg patches are upstreamed).
Patches 1-2 [GNUPG] introduce the new gpg command --conv-kernel to convert
PGP keys and signatures to the new kernel format.
Changelog
v1:
- Remove useless check in validate_key() (suggested by Yonghong)
- Don't rely on User Mode Drivers for the conversion from the original
format to the kernel format
- Use the more extensible TLV format, instead of a fixed structure
Roberto Sassu (10):
crypto: Export public key algorithm information
crypto: Export signature encoding information
KEYS: asymmetric: Introduce a parser for user asymmetric keys and sigs
KEYS: asymmetric: Introduce the user asymmetric key parser
KEYS: asymmetric: Introduce the user asymmetric key signature parser
verification: Add verify_uasym_signature() and
verify_uasym_sig_message()
KEYS: asymmetric: Preload user asymmetric keys from a keyring blob
KEYS: Introduce load_uasym_keyring()
bpf: Introduce bpf_verify_uasym_signature() kfunc
selftests/bpf: Prepare a test for user asymmetric key signatures
MAINTAINERS | 1 +
certs/Kconfig | 11 +
certs/Makefile | 7 +
certs/system_certificates.S | 18 +
certs/system_keyring.c | 166 +++++-
crypto/Kconfig | 6 +
crypto/Makefile | 2 +
crypto/asymmetric_keys/Kconfig | 14 +
crypto/asymmetric_keys/Makefile | 10 +
crypto/asymmetric_keys/asymmetric_type.c | 3 +-
crypto/asymmetric_keys/uasym_key_parser.c | 229 ++++++++
crypto/asymmetric_keys/uasym_key_preload.c | 99 ++++
crypto/asymmetric_keys/uasym_parser.c | 201 +++++++
crypto/asymmetric_keys/uasym_parser.h | 43 ++
crypto/asymmetric_keys/uasym_sig_parser.c | 491 ++++++++++++++++++
crypto/pub_key_info.c | 20 +
crypto/sig_enc_info.c | 16 +
include/crypto/pub_key_info.h | 15 +
include/crypto/sig_enc_info.h | 15 +
include/crypto/uasym_keys_sigs.h | 82 +++
include/keys/asymmetric-type.h | 1 +
include/linux/verification.h | 50 ++
include/uapi/linux/pub_key_info.h | 22 +
include/uapi/linux/sig_enc_info.h | 18 +
include/uapi/linux/uasym_parser.h | 107 ++++
kernel/trace/bpf_trace.c | 68 ++-
...y_pkcs7_sig.c => verify_pkcs7_uasym_sig.c} | 159 +++++-
...s7_sig.c => test_verify_pkcs7_uasym_sig.c} | 18 +-
.../testing/selftests/bpf/verify_sig_setup.sh | 82 ++-
29 files changed, 1924 insertions(+), 50 deletions(-)
create mode 100644 crypto/asymmetric_keys/uasym_key_parser.c
create mode 100644 crypto/asymmetric_keys/uasym_key_preload.c
create mode 100644 crypto/asymmetric_keys/uasym_parser.c
create mode 100644 crypto/asymmetric_keys/uasym_parser.h
create mode 100644 crypto/asymmetric_keys/uasym_sig_parser.c
create mode 100644 crypto/pub_key_info.c
create mode 100644 crypto/sig_enc_info.c
create mode 100644 include/crypto/pub_key_info.h
create mode 100644 include/crypto/sig_enc_info.h
create mode 100644 include/crypto/uasym_keys_sigs.h
create mode 100644 include/uapi/linux/pub_key_info.h
create mode 100644 include/uapi/linux/sig_enc_info.h
create mode 100644 include/uapi/linux/uasym_parser.h
rename tools/testing/selftests/bpf/prog_tests/{verify_pkcs7_sig.c => verify_pkcs7_uasym_sig.c} (69%)
rename tools/testing/selftests/bpf/progs/{test_verify_pkcs7_sig.c => test_verify_pkcs7_uasym_sig.c} (82%)
--
2.34.1
Make sv48 the default address space for mmap as some applications
currently depend on this assumption. Also enable users to select
desired address space using a non-zero hint address to mmap. Previous
kernel changes caused Java and other applications to be broken on sv57
which this patch fixes.
Documentation is also added to the RISC-V virtual memory section to explain
these changes.
Charlie Jenkins (2):
RISC-V: mm: Restrict address space for sv39,sv48,sv57
RISC-V: mm: Update documentation and include test
Documentation/riscv/vm-layout.rst | 22 +++++++++
arch/riscv/include/asm/elf.h | 2 +-
arch/riscv/include/asm/pgtable.h | 21 ++++++--
arch/riscv/include/asm/processor.h | 34 ++++++++++---
tools/testing/selftests/riscv/Makefile | 2 +-
tools/testing/selftests/riscv/mm/.gitignore | 1 +
tools/testing/selftests/riscv/mm/Makefile | 21 ++++++++
.../selftests/riscv/mm/testcases/mmap.c | 49 +++++++++++++++++++
8 files changed, 139 insertions(+), 13 deletions(-)
create mode 100644 tools/testing/selftests/riscv/mm/.gitignore
create mode 100644 tools/testing/selftests/riscv/mm/Makefile
create mode 100644 tools/testing/selftests/riscv/mm/testcases/mmap.c
--
2.41.0
=== Context ===
In the context of a middlebox, fragmented packets are tricky to handle.
The full 5-tuple of a packet is often only available in the first
fragment which makes enforcing consistent policy difficult. There are
really only two stateless options, neither of which are very nice:
1. Enforce policy on first fragment and accept all subsequent fragments.
This works but may let in certain attacks or allow data exfiltration.
2. Enforce policy on first fragment and drop all subsequent fragments.
This does not really work b/c some protocols may rely on
fragmentation. For example, DNS may rely on oversized UDP packets for
large responses.
So stateful tracking is the only sane option. RFC 8900 [0] calls this
out as well in section 6.3:
Middleboxes [...] should process IP fragments in a manner that is
consistent with [RFC0791] and [RFC8200]. In many cases, middleboxes
must maintain state in order to achieve this goal.
=== BPF related bits ===
Policy has traditionally been enforced from XDP/TC hooks. Both hooks
run before kernel reassembly facilities. However, with the new
BPF_PROG_TYPE_NETFILTER, we can rather easily hook into existing
netfilter reassembly infra.
The basic idea is we bump a refcnt on the netfilter defrag module and
then run the bpf prog after the defrag module runs. This allows bpf
progs to transparently see full, reassembled packets. The nice thing
about this is that progs don't have to carry around logic to detect
fragments.
=== Changelog ===
Changes from v1:
* Drop bpf_program__attach_netfilter() patches
* static -> static const where appropriate
* Fix callback assignment order during registration
* Only request_module() if callbacks are missing
* Fix retval when modprobe fails in userspace
* Fix v6 defrag module name (nf_defrag_ipv6_hooks -> nf_defrag_ipv6)
* Simplify priority checking code
* Add warning if module doesn't assign callbacks in the future
* Take refcnt on module while defrag link is active
[0]: https://datatracker.ietf.org/doc/html/rfc8900
Daniel Xu (6):
netfilter: defrag: Add glue hooks for enabling/disabling defrag
netfilter: bpf: Support BPF_F_NETFILTER_IP_DEFRAG in netfilter link
netfilter: bpf: Prevent defrag module unload while link active
bpf: selftests: Support not connecting client socket
bpf: selftests: Support custom type and proto for client sockets
bpf: selftests: Add defrag selftests
include/linux/netfilter.h | 15 +
include/uapi/linux/bpf.h | 5 +
net/ipv4/netfilter/nf_defrag_ipv4.c | 17 +-
net/ipv6/netfilter/nf_defrag_ipv6_hooks.c | 11 +
net/netfilter/core.c | 6 +
net/netfilter/nf_bpf_link.c | 149 ++++++++-
tools/include/uapi/linux/bpf.h | 5 +
tools/testing/selftests/bpf/Makefile | 4 +-
.../selftests/bpf/generate_udp_fragments.py | 90 ++++++
.../selftests/bpf/ip_check_defrag_frags.h | 57 ++++
tools/testing/selftests/bpf/network_helpers.c | 26 +-
tools/testing/selftests/bpf/network_helpers.h | 3 +
.../bpf/prog_tests/ip_check_defrag.c | 282 ++++++++++++++++++
.../selftests/bpf/progs/ip_check_defrag.c | 104 +++++++
14 files changed, 752 insertions(+), 22 deletions(-)
create mode 100755 tools/testing/selftests/bpf/generate_udp_fragments.py
create mode 100644 tools/testing/selftests/bpf/ip_check_defrag_frags.h
create mode 100644 tools/testing/selftests/bpf/prog_tests/ip_check_defrag.c
create mode 100644 tools/testing/selftests/bpf/progs/ip_check_defrag.c
--
2.41.0
From: Björn Töpel <bjorn(a)rivosinc.com>
BPF tests that load /proc/kallsyms, e.g. bpf_cookie, will perform a
buffer overrun if the number of syms on the system is larger than
MAX_SYMS.
Bump the MAX_SYMS to 400000, and add a runtime check that bails out if
the maximum is reached.
Signed-off-by: Björn Töpel <bjorn(a)rivosinc.com>
---
tools/testing/selftests/bpf/trace_helpers.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/bpf/trace_helpers.c b/tools/testing/selftests/bpf/trace_helpers.c
index 9b070cdf44ac..f83d9f65c65b 100644
--- a/tools/testing/selftests/bpf/trace_helpers.c
+++ b/tools/testing/selftests/bpf/trace_helpers.c
@@ -18,7 +18,7 @@
#define TRACEFS_PIPE "/sys/kernel/tracing/trace_pipe"
#define DEBUGFS_PIPE "/sys/kernel/debug/tracing/trace_pipe"
-#define MAX_SYMS 300000
+#define MAX_SYMS 400000
static struct ksym syms[MAX_SYMS];
static int sym_cnt;
@@ -46,6 +46,9 @@ int load_kallsyms_refresh(void)
break;
if (!addr)
continue;
+ if (i >= MAX_SYMS)
+ return -EFBIG;
+
syms[i].addr = (long) addr;
syms[i].name = strdup(func);
i++;
base-commit: fd283ab196a867f8f65f36913e0fadd031fcb823
--
2.39.2
*Changes in v23*:
- Set vec_buf_index in loop only when vec_buf_index is set
- Return -EFAULT instead of -EINVAL if vec is NULL
- Correctly return the walk ending address to the page granularity
*Changes in v22*:
- Interface change:
- Replace [start start + len) with [start, end)
- Return the ending address of the address walk in start
*Changes in v21*:
- Abort walk instead of returning error if WP is to be performed on
partial hugetlb
*Changes in v20*
- Correct PAGE_IS_FILE and add PAGE_IS_PFNZERO
*Changes in v19*
- Minor changes and interface updates
*Changes in v18*
- Rebase on top of next-20230613
- Minor updates
*Changes in v17*
- Rebase on top of next-20230606
- Minor improvements in PAGEMAP_SCAN IOCTL patch
*Changes in v16*
- Fix a corner case
- Add exclusive PM_SCAN_OP_WP back
*Changes in v15*
- Build fix (Add missed build fix in RESEND)
*Changes in v14*
- Fix build error caused by #ifdef added at last minute in some configs
*Changes in v13*
- Rebase on top of next-20230414
- Give-up on using uffd_wp_range() and write new helpers, flush tlb only
once
*Changes in v12*
- Update and other memory types to UFFD_FEATURE_WP_ASYNC
- Rebaase on top of next-20230406
- Review updates
*Changes in v11*
- Rebase on top of next-20230307
- Base patches on UFFD_FEATURE_WP_UNPOPULATED
- Do a lot of cosmetic changes and review updates
- Remove ENGAGE_WP + !GET operation as it can be performed with
UFFDIO_WRITEPROTECT
*Changes in v10*
- Add specific condition to return error if hugetlb is used with wp
async
- Move changes in tools/include/uapi/linux/fs.h to separate patch
- Add documentation
*Changes in v9:*
- Correct fault resolution for userfaultfd wp async
- Fix build warnings and errors which were happening on some configs
- Simplify pagemap ioctl's code
*Changes in v8:*
- Update uffd async wp implementation
- Improve PAGEMAP_IOCTL implementation
*Changes in v7:*
- Add uffd wp async
- Update the IOCTL to use uffd under the hood instead of soft-dirty
flags
*Motivation*
The real motivation for adding PAGEMAP_SCAN IOCTL is to emulate Windows
GetWriteWatch() syscall [1]. The GetWriteWatch{} retrieves the addresses of
the pages that are written to in a region of virtual memory.
This syscall is used in Windows applications and games etc. This syscall is
being emulated in pretty slow manner in userspace. Our purpose is to
enhance the kernel such that we translate it efficiently in a better way.
Currently some out of tree hack patches are being used to efficiently
emulate it in some kernels. We intend to replace those with these patches.
So the whole gaming on Linux can effectively get benefit from this. It
means there would be tons of users of this code.
CRIU use case [2] was mentioned by Andrei and Danylo:
> Use cases for migrating sparse VMAs are binaries sanitized with ASAN,
> MSAN or TSAN [3]. All of these sanitizers produce sparse mappings of
> shadow memory [4]. Being able to migrate such binaries allows to highly
> reduce the amount of work needed to identify and fix post-migration
> crashes, which happen constantly.
Andrei's defines the following uses of this code:
* it is more granular and allows us to track changed pages more
effectively. The current interface can clear dirty bits for the entire
process only. In addition, reading info about pages is a separate
operation. It means we must freeze the process to read information
about all its pages, reset dirty bits, only then we can start dumping
pages. The information about pages becomes more and more outdated,
while we are processing pages. The new interface solves both these
downsides. First, it allows us to read pte bits and clear the
soft-dirty bit atomically. It means that CRIU will not need to freeze
processes to pre-dump their memory. Second, it clears soft-dirty bits
for a specified region of memory. It means CRIU will have actual info
about pages to the moment of dumping them.
* The new interface has to be much faster because basic page filtering
is happening in the kernel. With the old interface, we have to read
pagemap for each page.
*Implementation Evolution (Short Summary)*
From the definition of GetWriteWatch(), we feel like kernel's soft-dirty
feature can be used under the hood with some additions like:
* reset soft-dirty flag for only a specific region of memory instead of
clearing the flag for the entire process
* get and clear soft-dirty flag for a specific region atomically
So we decided to use ioctl on pagemap file to read or/and reset soft-dirty
flag. But using soft-dirty flag, sometimes we get extra pages which weren't
even written. They had become soft-dirty because of VMA merging and
VM_SOFTDIRTY flag. This breaks the definition of GetWriteWatch(). We were
able to by-pass this short coming by ignoring VM_SOFTDIRTY until David
reported that mprotect etc messes up the soft-dirty flag while ignoring
VM_SOFTDIRTY [5]. This wasn't happening until [6] got introduced. We
discussed if we can revert these patches. But we could not reach to any
conclusion. So at this point, I made couple of tries to solve this whole
VM_SOFTDIRTY issue by correcting the soft-dirty implementation:
* [7] Correct the bug fixed wrongly back in 2014. It had potential to cause
regression. We left it behind.
* [8] Keep a list of soft-dirty part of a VMA across splits and merges. I
got the reply don't increase the size of the VMA by 8 bytes.
At this point, we left soft-dirty considering it is too much delicate and
userfaultfd [9] seemed like the only way forward. From there onward, we
have been basing soft-dirty emulation on userfaultfd wp feature where
kernel resolves the faults itself when WP_ASYNC feature is used. It was
straight forward to add WP_ASYNC feature in userfautlfd. Now we get only
those pages dirty or written-to which are really written in reality. (PS
There is another WP_UNPOPULATED userfautfd feature is required which is
needed to avoid pre-faulting memory before write-protecting [9].)
All the different masks were added on the request of CRIU devs to create
interface more generic and better.
[1] https://learn.microsoft.com/en-us/windows/win32/api/memoryapi/nf-memoryapi-…
[2] https://lore.kernel.org/all/20221014134802.1361436-1-mdanylo@google.com
[3] https://github.com/google/sanitizers
[4] https://github.com/google/sanitizers/wiki/AddressSanitizerAlgorithm#64-bit
[5] https://lore.kernel.org/all/bfcae708-db21-04b4-0bbe-712badd03071@redhat.com
[6] https://lore.kernel.org/all/20220725142048.30450-1-peterx@redhat.com/
[7] https://lore.kernel.org/all/20221122115007.2787017-1-usama.anjum@collabora.…
[8] https://lore.kernel.org/all/20221220162606.1595355-1-usama.anjum@collabora.…
[9] https://lore.kernel.org/all/20230306213925.617814-1-peterx@redhat.com
[10] https://lore.kernel.org/all/20230125144529.1630917-1-mdanylo@google.com
* Original Cover letter from v8*
Hello,
Note:
Soft-dirty pages and pages which have been written-to are synonyms. As
kernel already has soft-dirty feature inside which we have given up to
use, we are using written-to terminology while using UFFD async WP under
the hood.
This IOCTL, PAGEMAP_SCAN on pagemap file can be used to get and/or clear
the info about page table entries. The following operations are
supported in this ioctl:
- Get the information if the pages have been written-to (PAGE_IS_WRITTEN),
file mapped (PAGE_IS_FILE), present (PAGE_IS_PRESENT) or swapped
(PAGE_IS_SWAPPED).
- Write-protect the pages (PAGEMAP_WP_ENGAGE) to start finding which
pages have been written-to.
- Find pages which have been written-to and write protect the pages
(atomic PAGE_IS_WRITTEN + PAGEMAP_WP_ENGAGE)
It is possible to find and clear soft-dirty pages entirely in userspace.
But it isn't efficient:
- The mprotect and SIGSEGV handler for bookkeeping
- The userfaultfd wp (synchronous) with the handler for bookkeeping
Some benchmarks can be seen here[1]. This series adds features that weren't
present earlier:
- There is no atomic get soft-dirty/Written-to status and clear present in
the kernel.
- The pages which have been written-to can not be found in accurate way.
(Kernel's soft-dirty PTE bit + sof_dirty VMA bit shows more soft-dirty
pages than there actually are.)
Historically, soft-dirty PTE bit tracking has been used in the CRIU
project. The procfs interface is enough for finding the soft-dirty bit
status and clearing the soft-dirty bit of all the pages of a process.
We have the use case where we need to track the soft-dirty PTE bit for
only specific pages on-demand. We need this tracking and clear mechanism
of a region of memory while the process is running to emulate the
getWriteWatch() syscall of Windows.
*(Moved to using UFFD instead of soft-dirtyi feature to find pages which
have been written-to from v7 patch series)*:
Stop using the soft-dirty flags for finding which pages have been
written to. It is too delicate and wrong as it shows more soft-dirty
pages than the actual soft-dirty pages. There is no interest in
correcting it [2][3] as this is how the feature was written years ago.
It shouldn't be updated to changed behaviour. Peter Xu has suggested
using the async version of the UFFD WP [4] as it is based inherently
on the PTEs.
So in this patch series, I've added a new mode to the UFFD which is
asynchronous version of the write protect. When this variant of the
UFFD WP is used, the page faults are resolved automatically by the
kernel. The pages which have been written-to can be found by reading
pagemap file (!PM_UFFD_WP). This feature can be used successfully to
find which pages have been written to from the time the pages were
write protected. This works just like the soft-dirty flag without
showing any extra pages which aren't soft-dirty in reality.
The information related to pages if the page is file mapped, present and
swapped is required for the CRIU project [5][6]. The addition of the
required mask, any mask, excluded mask and return masks are also required
for the CRIU project [5].
The IOCTL returns the addresses of the pages which match the specific
masks. The page addresses are returned in struct page_region in a compact
form. The max_pages is needed to support a use case where user only wants
to get a specific number of pages. So there is no need to find all the
pages of interest in the range when max_pages is specified. The IOCTL
returns when the maximum number of the pages are found. The max_pages is
optional. If max_pages is specified, it must be equal or greater than the
vec_size. This restriction is needed to handle worse case when one
page_region only contains info of one page and it cannot be compacted.
This is needed to emulate the Windows getWriteWatch() syscall.
The patch series include the detailed selftest which can be used as an
example for the uffd async wp test and PAGEMAP_IOCTL. It shows the
interface usages as well.
[1] https://lore.kernel.org/lkml/54d4c322-cd6e-eefd-b161-2af2b56aae24@collabora…
[2] https://lore.kernel.org/all/20221220162606.1595355-1-usama.anjum@collabora.…
[3] https://lore.kernel.org/all/20221122115007.2787017-1-usama.anjum@collabora.…
[4] https://lore.kernel.org/all/Y6Hc2d+7eTKs7AiH@x1n
[5] https://lore.kernel.org/all/YyiDg79flhWoMDZB@gmail.com/
[6] https://lore.kernel.org/all/20221014134802.1361436-1-mdanylo@google.com/
Regards,
Muhammad Usama Anjum
Muhammad Usama Anjum (4):
fs/proc/task_mmu: Implement IOCTL to get and optionally clear info
about PTEs
tools headers UAPI: Update linux/fs.h with the kernel sources
mm/pagemap: add documentation of PAGEMAP_SCAN IOCTL
selftests: mm: add pagemap ioctl tests
Peter Xu (1):
userfaultfd: UFFD_FEATURE_WP_ASYNC
Documentation/admin-guide/mm/pagemap.rst | 58 +
Documentation/admin-guide/mm/userfaultfd.rst | 35 +
fs/proc/task_mmu.c | 577 +++++++
fs/userfaultfd.c | 26 +-
include/linux/hugetlb.h | 1 +
include/linux/userfaultfd_k.h | 21 +-
include/uapi/linux/fs.h | 55 +
include/uapi/linux/userfaultfd.h | 9 +-
mm/hugetlb.c | 34 +-
mm/memory.c | 27 +-
tools/include/uapi/linux/fs.h | 55 +
tools/testing/selftests/mm/.gitignore | 2 +
tools/testing/selftests/mm/Makefile | 3 +-
tools/testing/selftests/mm/config | 1 +
tools/testing/selftests/mm/pagemap_ioctl.c | 1464 ++++++++++++++++++
tools/testing/selftests/mm/run_vmtests.sh | 4 +
16 files changed, 2348 insertions(+), 24 deletions(-)
create mode 100644 tools/testing/selftests/mm/pagemap_ioctl.c
mode change 100644 => 100755 tools/testing/selftests/mm/run_vmtests.sh
--
2.39.2
Changes in v22:
- Interface change:
- Replace [start start + len) with [start, end)
- Return the ending address of the address walk in start
Changes in v21:
- Abort walk instead of returning error if WP is to be performed on
partial hugetlb
*Changes in v20*
- Correct PAGE_IS_FILE and add PAGE_IS_PFNZERO
*Changes in v19*
- Minor changes and interface updates
*Changes in v18*
- Rebase on top of next-20230613
- Minor updates
*Changes in v17*
- Rebase on top of next-20230606
- Minor improvements in PAGEMAP_SCAN IOCTL patch
*Changes in v16*
- Fix a corner case
- Add exclusive PM_SCAN_OP_WP back
*Changes in v15*
- Build fix (Add missed build fix in RESEND)
*Changes in v14*
- Fix build error caused by #ifdef added at last minute in some configs
*Changes in v13*
- Rebase on top of next-20230414
- Give-up on using uffd_wp_range() and write new helpers, flush tlb only
once
*Changes in v12*
- Update and other memory types to UFFD_FEATURE_WP_ASYNC
- Rebaase on top of next-20230406
- Review updates
*Changes in v11*
- Rebase on top of next-20230307
- Base patches on UFFD_FEATURE_WP_UNPOPULATED
- Do a lot of cosmetic changes and review updates
- Remove ENGAGE_WP + !GET operation as it can be performed with
UFFDIO_WRITEPROTECT
*Changes in v10*
- Add specific condition to return error if hugetlb is used with wp
async
- Move changes in tools/include/uapi/linux/fs.h to separate patch
- Add documentation
*Changes in v9:*
- Correct fault resolution for userfaultfd wp async
- Fix build warnings and errors which were happening on some configs
- Simplify pagemap ioctl's code
*Changes in v8:*
- Update uffd async wp implementation
- Improve PAGEMAP_IOCTL implementation
*Changes in v7:*
- Add uffd wp async
- Update the IOCTL to use uffd under the hood instead of soft-dirty
flags
*Motivation*
The real motivation for adding PAGEMAP_SCAN IOCTL is to emulate Windows
GetWriteWatch() syscall [1]. The GetWriteWatch{} retrieves the addresses of
the pages that are written to in a region of virtual memory.
This syscall is used in Windows applications and games etc. This syscall is
being emulated in pretty slow manner in userspace. Our purpose is to
enhance the kernel such that we translate it efficiently in a better way.
Currently some out of tree hack patches are being used to efficiently
emulate it in some kernels. We intend to replace those with these patches.
So the whole gaming on Linux can effectively get benefit from this. It
means there would be tons of users of this code.
CRIU use case [2] was mentioned by Andrei and Danylo:
> Use cases for migrating sparse VMAs are binaries sanitized with ASAN,
> MSAN or TSAN [3]. All of these sanitizers produce sparse mappings of
> shadow memory [4]. Being able to migrate such binaries allows to highly
> reduce the amount of work needed to identify and fix post-migration
> crashes, which happen constantly.
Andrei's defines the following uses of this code:
* it is more granular and allows us to track changed pages more
effectively. The current interface can clear dirty bits for the entire
process only. In addition, reading info about pages is a separate
operation. It means we must freeze the process to read information
about all its pages, reset dirty bits, only then we can start dumping
pages. The information about pages becomes more and more outdated,
while we are processing pages. The new interface solves both these
downsides. First, it allows us to read pte bits and clear the
soft-dirty bit atomically. It means that CRIU will not need to freeze
processes to pre-dump their memory. Second, it clears soft-dirty bits
for a specified region of memory. It means CRIU will have actual info
about pages to the moment of dumping them.
* The new interface has to be much faster because basic page filtering
is happening in the kernel. With the old interface, we have to read
pagemap for each page.
*Implementation Evolution (Short Summary)*
From the definition of GetWriteWatch(), we feel like kernel's soft-dirty
feature can be used under the hood with some additions like:
* reset soft-dirty flag for only a specific region of memory instead of
clearing the flag for the entire process
* get and clear soft-dirty flag for a specific region atomically
So we decided to use ioctl on pagemap file to read or/and reset soft-dirty
flag. But using soft-dirty flag, sometimes we get extra pages which weren't
even written. They had become soft-dirty because of VMA merging and
VM_SOFTDIRTY flag. This breaks the definition of GetWriteWatch(). We were
able to by-pass this short coming by ignoring VM_SOFTDIRTY until David
reported that mprotect etc messes up the soft-dirty flag while ignoring
VM_SOFTDIRTY [5]. This wasn't happening until [6] got introduced. We
discussed if we can revert these patches. But we could not reach to any
conclusion. So at this point, I made couple of tries to solve this whole
VM_SOFTDIRTY issue by correcting the soft-dirty implementation:
* [7] Correct the bug fixed wrongly back in 2014. It had potential to cause
regression. We left it behind.
* [8] Keep a list of soft-dirty part of a VMA across splits and merges. I
got the reply don't increase the size of the VMA by 8 bytes.
At this point, we left soft-dirty considering it is too much delicate and
userfaultfd [9] seemed like the only way forward. From there onward, we
have been basing soft-dirty emulation on userfaultfd wp feature where
kernel resolves the faults itself when WP_ASYNC feature is used. It was
straight forward to add WP_ASYNC feature in userfautlfd. Now we get only
those pages dirty or written-to which are really written in reality. (PS
There is another WP_UNPOPULATED userfautfd feature is required which is
needed to avoid pre-faulting memory before write-protecting [9].)
All the different masks were added on the request of CRIU devs to create
interface more generic and better.
[1] https://learn.microsoft.com/en-us/windows/win32/api/memoryapi/nf-memoryapi-…
[2] https://lore.kernel.org/all/20221014134802.1361436-1-mdanylo@google.com
[3] https://github.com/google/sanitizers
[4] https://github.com/google/sanitizers/wiki/AddressSanitizerAlgorithm#64-bit
[5] https://lore.kernel.org/all/bfcae708-db21-04b4-0bbe-712badd03071@redhat.com
[6] https://lore.kernel.org/all/20220725142048.30450-1-peterx@redhat.com/
[7] https://lore.kernel.org/all/20221122115007.2787017-1-usama.anjum@collabora.…
[8] https://lore.kernel.org/all/20221220162606.1595355-1-usama.anjum@collabora.…
[9] https://lore.kernel.org/all/20230306213925.617814-1-peterx@redhat.com
[10] https://lore.kernel.org/all/20230125144529.1630917-1-mdanylo@google.com
* Original Cover letter from v8*
Hello,
Note:
Soft-dirty pages and pages which have been written-to are synonyms. As
kernel already has soft-dirty feature inside which we have given up to
use, we are using written-to terminology while using UFFD async WP under
the hood.
This IOCTL, PAGEMAP_SCAN on pagemap file can be used to get and/or clear
the info about page table entries. The following operations are
supported in this ioctl:
- Get the information if the pages have been written-to (PAGE_IS_WRITTEN),
file mapped (PAGE_IS_FILE), present (PAGE_IS_PRESENT) or swapped
(PAGE_IS_SWAPPED).
- Write-protect the pages (PAGEMAP_WP_ENGAGE) to start finding which
pages have been written-to.
- Find pages which have been written-to and write protect the pages
(atomic PAGE_IS_WRITTEN + PAGEMAP_WP_ENGAGE)
It is possible to find and clear soft-dirty pages entirely in userspace.
But it isn't efficient:
- The mprotect and SIGSEGV handler for bookkeeping
- The userfaultfd wp (synchronous) with the handler for bookkeeping
Some benchmarks can be seen here[1]. This series adds features that weren't
present earlier:
- There is no atomic get soft-dirty/Written-to status and clear present in
the kernel.
- The pages which have been written-to can not be found in accurate way.
(Kernel's soft-dirty PTE bit + sof_dirty VMA bit shows more soft-dirty
pages than there actually are.)
Historically, soft-dirty PTE bit tracking has been used in the CRIU
project. The procfs interface is enough for finding the soft-dirty bit
status and clearing the soft-dirty bit of all the pages of a process.
We have the use case where we need to track the soft-dirty PTE bit for
only specific pages on-demand. We need this tracking and clear mechanism
of a region of memory while the process is running to emulate the
getWriteWatch() syscall of Windows.
*(Moved to using UFFD instead of soft-dirtyi feature to find pages which
have been written-to from v7 patch series)*:
Stop using the soft-dirty flags for finding which pages have been
written to. It is too delicate and wrong as it shows more soft-dirty
pages than the actual soft-dirty pages. There is no interest in
correcting it [2][3] as this is how the feature was written years ago.
It shouldn't be updated to changed behaviour. Peter Xu has suggested
using the async version of the UFFD WP [4] as it is based inherently
on the PTEs.
So in this patch series, I've added a new mode to the UFFD which is
asynchronous version of the write protect. When this variant of the
UFFD WP is used, the page faults are resolved automatically by the
kernel. The pages which have been written-to can be found by reading
pagemap file (!PM_UFFD_WP). This feature can be used successfully to
find which pages have been written to from the time the pages were
write protected. This works just like the soft-dirty flag without
showing any extra pages which aren't soft-dirty in reality.
The information related to pages if the page is file mapped, present and
swapped is required for the CRIU project [5][6]. The addition of the
required mask, any mask, excluded mask and return masks are also required
for the CRIU project [5].
The IOCTL returns the addresses of the pages which match the specific
masks. The page addresses are returned in struct page_region in a compact
form. The max_pages is needed to support a use case where user only wants
to get a specific number of pages. So there is no need to find all the
pages of interest in the range when max_pages is specified. The IOCTL
returns when the maximum number of the pages are found. The max_pages is
optional. If max_pages is specified, it must be equal or greater than the
vec_size. This restriction is needed to handle worse case when one
page_region only contains info of one page and it cannot be compacted.
This is needed to emulate the Windows getWriteWatch() syscall.
The patch series include the detailed selftest which can be used as an
example for the uffd async wp test and PAGEMAP_IOCTL. It shows the
interface usages as well.
[1] https://lore.kernel.org/lkml/54d4c322-cd6e-eefd-b161-2af2b56aae24@collabora…
[2] https://lore.kernel.org/all/20221220162606.1595355-1-usama.anjum@collabora.…
[3] https://lore.kernel.org/all/20221122115007.2787017-1-usama.anjum@collabora.…
[4] https://lore.kernel.org/all/Y6Hc2d+7eTKs7AiH@x1n
[5] https://lore.kernel.org/all/YyiDg79flhWoMDZB@gmail.com/
[6] https://lore.kernel.org/all/20221014134802.1361436-1-mdanylo@google.com/
Regards,
Muhammad Usama Anjum
Muhammad Usama Anjum (4):
fs/proc/task_mmu: Implement IOCTL to get and optionally clear info
about PTEs
tools headers UAPI: Update linux/fs.h with the kernel sources
mm/pagemap: add documentation of PAGEMAP_SCAN IOCTL
selftests: mm: add pagemap ioctl tests
Peter Xu (1):
userfaultfd: UFFD_FEATURE_WP_ASYNC
Documentation/admin-guide/mm/pagemap.rst | 58 +
Documentation/admin-guide/mm/userfaultfd.rst | 35 +
fs/proc/task_mmu.c | 565 +++++++
fs/userfaultfd.c | 26 +-
include/linux/hugetlb.h | 1 +
include/linux/userfaultfd_k.h | 21 +-
include/uapi/linux/fs.h | 55 +
include/uapi/linux/userfaultfd.h | 9 +-
mm/hugetlb.c | 34 +-
mm/memory.c | 27 +-
tools/include/uapi/linux/fs.h | 55 +
tools/testing/selftests/mm/.gitignore | 2 +
tools/testing/selftests/mm/Makefile | 3 +-
tools/testing/selftests/mm/config | 1 +
tools/testing/selftests/mm/pagemap_ioctl.c | 1464 ++++++++++++++++++
tools/testing/selftests/mm/run_vmtests.sh | 4 +
16 files changed, 2336 insertions(+), 24 deletions(-)
create mode 100644 tools/testing/selftests/mm/pagemap_ioctl.c
mode change 100644 => 100755 tools/testing/selftests/mm/run_vmtests.sh
--
2.39.2
The basic idea here is to "simulate" memory poisoning for VMs. A VM
running on some host might encounter a memory error, after which some
page(s) are poisoned (i.e., future accesses SIGBUS). They expect that
once poisoned, pages can never become "un-poisoned". So, when we live
migrate the VM, we need to preserve the poisoned status of these pages.
When live migrating, we try to get the guest running on its new host as
quickly as possible. So, we start it running before all memory has been
copied, and before we're certain which pages should be poisoned or not.
So the basic way to use this new feature is:
- On the new host, the guest's memory is registered with userfaultfd, in
either MISSING or MINOR mode (doesn't really matter for this purpose).
- On any first access, we get a userfaultfd event. At this point we can
communicate with the old host to find out if the page was poisoned.
- If so, we can respond with a UFFDIO_POISON - this places a swap marker
so any future accesses will SIGBUS. Because the pte is now "present",
future accesses won't generate more userfaultfd events, they'll just
SIGBUS directly.
UFFDIO_POISON does not handle unmapping previously-present PTEs. This
isn't needed, because during live migration we want to intercept
all accesses with userfaultfd (not just writes, so WP mode isn't useful
for this). So whether minor or missing mode is being used (or both), the
PTE won't be present in any case, so handling that case isn't needed.
Why return VM_FAULT_HWPOISON instead of VM_FAULT_SIGBUS when one of
these markers is encountered? For "normal" userspace programs there
isn't a big difference, both yield a SIGBUS. The difference for KVM is
key though: VM_FAULT_HWPOISON will result in an MCE being injected into
the guest (which is the behavior we want). With VM_FAULT_SIGBUS, the
hypervisor would need to catch the SIGBUS and deal with the MCE
injection itself.
Signed-off-by: Axel Rasmussen <axelrasmussen(a)google.com>
---
fs/userfaultfd.c | 63 ++++++++++++++++++++++++++++++++
include/linux/swapops.h | 3 +-
include/linux/userfaultfd_k.h | 4 ++
include/uapi/linux/userfaultfd.h | 25 +++++++++++--
mm/memory.c | 4 ++
mm/userfaultfd.c | 62 ++++++++++++++++++++++++++++++-
6 files changed, 156 insertions(+), 5 deletions(-)
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 7cecd49e078b..c26a883399c9 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -1965,6 +1965,66 @@ static int userfaultfd_continue(struct userfaultfd_ctx *ctx, unsigned long arg)
return ret;
}
+static inline int userfaultfd_poison(struct userfaultfd_ctx *ctx, unsigned long arg)
+{
+ __s64 ret;
+ struct uffdio_poison uffdio_poison;
+ struct uffdio_poison __user *user_uffdio_poison;
+ struct userfaultfd_wake_range range;
+
+ user_uffdio_poison = (struct uffdio_poison __user *)arg;
+
+ ret = -EAGAIN;
+ if (atomic_read(&ctx->mmap_changing))
+ goto out;
+
+ ret = -EFAULT;
+ if (copy_from_user(&uffdio_poison, user_uffdio_poison,
+ /* don't copy the output fields */
+ sizeof(uffdio_poison) - (sizeof(__s64))))
+ goto out;
+
+ ret = validate_range(ctx->mm, uffdio_poison.range.start,
+ uffdio_poison.range.len);
+ if (ret)
+ goto out;
+
+ ret = -EINVAL;
+ /* double check for wraparound just in case. */
+ if (uffdio_poison.range.start + uffdio_poison.range.len <=
+ uffdio_poison.range.start) {
+ goto out;
+ }
+ if (uffdio_poison.mode & ~UFFDIO_POISON_MODE_DONTWAKE)
+ goto out;
+
+ if (mmget_not_zero(ctx->mm)) {
+ ret = mfill_atomic_poison(ctx->mm, uffdio_poison.range.start,
+ uffdio_poison.range.len,
+ &ctx->mmap_changing, 0);
+ mmput(ctx->mm);
+ } else {
+ return -ESRCH;
+ }
+
+ if (unlikely(put_user(ret, &user_uffdio_poison->updated)))
+ return -EFAULT;
+ if (ret < 0)
+ goto out;
+
+ /* len == 0 would wake all */
+ BUG_ON(!ret);
+ range.len = ret;
+ if (!(uffdio_poison.mode & UFFDIO_POISON_MODE_DONTWAKE)) {
+ range.start = uffdio_poison.range.start;
+ wake_userfault(ctx, &range);
+ }
+ ret = range.len == uffdio_poison.range.len ? 0 : -EAGAIN;
+
+out:
+ return ret;
+}
+
static inline unsigned int uffd_ctx_features(__u64 user_features)
{
/*
@@ -2066,6 +2126,9 @@ static long userfaultfd_ioctl(struct file *file, unsigned cmd,
case UFFDIO_CONTINUE:
ret = userfaultfd_continue(ctx, arg);
break;
+ case UFFDIO_POISON:
+ ret = userfaultfd_poison(ctx, arg);
+ break;
}
return ret;
}
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index 4c932cb45e0b..8259fee32421 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -394,7 +394,8 @@ typedef unsigned long pte_marker;
#define PTE_MARKER_UFFD_WP BIT(0)
#define PTE_MARKER_SWAPIN_ERROR BIT(1)
-#define PTE_MARKER_MASK (BIT(2) - 1)
+#define PTE_MARKER_UFFD_POISON BIT(2)
+#define PTE_MARKER_MASK (BIT(3) - 1)
static inline swp_entry_t make_pte_marker_entry(pte_marker marker)
{
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index ac7b0c96d351..ac8c6854097c 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -46,6 +46,7 @@ enum mfill_atomic_mode {
MFILL_ATOMIC_COPY,
MFILL_ATOMIC_ZEROPAGE,
MFILL_ATOMIC_CONTINUE,
+ MFILL_ATOMIC_POISON,
NR_MFILL_ATOMIC_MODES,
};
@@ -83,6 +84,9 @@ extern ssize_t mfill_atomic_zeropage(struct mm_struct *dst_mm,
extern ssize_t mfill_atomic_continue(struct mm_struct *dst_mm, unsigned long dst_start,
unsigned long len, atomic_t *mmap_changing,
uffd_flags_t flags);
+extern ssize_t mfill_atomic_poison(struct mm_struct *dst_mm, unsigned long start,
+ unsigned long len, atomic_t *mmap_changing,
+ uffd_flags_t flags);
extern int mwriteprotect_range(struct mm_struct *dst_mm,
unsigned long start, unsigned long len,
bool enable_wp, atomic_t *mmap_changing);
diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
index 66dd4cd277bd..62151706c5a3 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -39,7 +39,8 @@
UFFD_FEATURE_MINOR_SHMEM | \
UFFD_FEATURE_EXACT_ADDRESS | \
UFFD_FEATURE_WP_HUGETLBFS_SHMEM | \
- UFFD_FEATURE_WP_UNPOPULATED)
+ UFFD_FEATURE_WP_UNPOPULATED | \
+ UFFD_FEATURE_POISON)
#define UFFD_API_IOCTLS \
((__u64)1 << _UFFDIO_REGISTER | \
(__u64)1 << _UFFDIO_UNREGISTER | \
@@ -49,12 +50,14 @@
(__u64)1 << _UFFDIO_COPY | \
(__u64)1 << _UFFDIO_ZEROPAGE | \
(__u64)1 << _UFFDIO_WRITEPROTECT | \
- (__u64)1 << _UFFDIO_CONTINUE)
+ (__u64)1 << _UFFDIO_CONTINUE | \
+ (__u64)1 << _UFFDIO_POISON)
#define UFFD_API_RANGE_IOCTLS_BASIC \
((__u64)1 << _UFFDIO_WAKE | \
(__u64)1 << _UFFDIO_COPY | \
+ (__u64)1 << _UFFDIO_WRITEPROTECT | \
(__u64)1 << _UFFDIO_CONTINUE | \
- (__u64)1 << _UFFDIO_WRITEPROTECT)
+ (__u64)1 << _UFFDIO_POISON)
/*
* Valid ioctl command number range with this API is from 0x00 to
@@ -71,6 +74,7 @@
#define _UFFDIO_ZEROPAGE (0x04)
#define _UFFDIO_WRITEPROTECT (0x06)
#define _UFFDIO_CONTINUE (0x07)
+#define _UFFDIO_POISON (0x08)
#define _UFFDIO_API (0x3F)
/* userfaultfd ioctl ids */
@@ -91,6 +95,8 @@
struct uffdio_writeprotect)
#define UFFDIO_CONTINUE _IOWR(UFFDIO, _UFFDIO_CONTINUE, \
struct uffdio_continue)
+#define UFFDIO_POISON _IOWR(UFFDIO, _UFFDIO_POISON, \
+ struct uffdio_poison)
/* read() structure */
struct uffd_msg {
@@ -225,6 +231,7 @@ struct uffdio_api {
#define UFFD_FEATURE_EXACT_ADDRESS (1<<11)
#define UFFD_FEATURE_WP_HUGETLBFS_SHMEM (1<<12)
#define UFFD_FEATURE_WP_UNPOPULATED (1<<13)
+#define UFFD_FEATURE_POISON (1<<14)
__u64 features;
__u64 ioctls;
@@ -321,6 +328,18 @@ struct uffdio_continue {
__s64 mapped;
};
+struct uffdio_poison {
+ struct uffdio_range range;
+#define UFFDIO_POISON_MODE_DONTWAKE ((__u64)1<<0)
+ __u64 mode;
+
+ /*
+ * Fields below here are written by the ioctl and must be at the end:
+ * the copy_from_user will not read past here.
+ */
+ __s64 updated;
+};
+
/*
* Flags for the userfaultfd(2) system call itself.
*/
diff --git a/mm/memory.c b/mm/memory.c
index d8a9a770b1f1..7fbda39e060d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3692,6 +3692,10 @@ static vm_fault_t handle_pte_marker(struct vm_fault *vmf)
if (WARN_ON_ONCE(!marker))
return VM_FAULT_SIGBUS;
+ /* Poison emulation explicitly requested for this PTE. */
+ if (marker & PTE_MARKER_UFFD_POISON)
+ return VM_FAULT_HWPOISON;
+
/* Higher priority than uffd-wp when data corrupted */
if (marker & PTE_MARKER_SWAPIN_ERROR)
return VM_FAULT_SIGBUS;
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index a2bf37ee276d..87b62ca1e09e 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -286,6 +286,51 @@ static int mfill_atomic_pte_continue(pmd_t *dst_pmd,
goto out;
}
+/* Handles UFFDIO_POISON for all non-hugetlb VMAs. */
+static int mfill_atomic_pte_poison(pmd_t *dst_pmd,
+ struct vm_area_struct *dst_vma,
+ unsigned long dst_addr,
+ uffd_flags_t flags)
+{
+ int ret;
+ struct mm_struct *dst_mm = dst_vma->vm_mm;
+ pte_t _dst_pte, *dst_pte;
+ spinlock_t *ptl;
+
+ _dst_pte = make_pte_marker(PTE_MARKER_UFFD_POISON);
+ dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl);
+
+ if (vma_is_shmem(dst_vma)) {
+ struct inode *inode;
+ pgoff_t offset, max_off;
+
+ /* serialize against truncate with the page table lock */
+ inode = dst_vma->vm_file->f_inode;
+ offset = linear_page_index(dst_vma, dst_addr);
+ max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
+ ret = -EFAULT;
+ if (unlikely(offset >= max_off))
+ goto out_unlock;
+ }
+
+ ret = -EEXIST;
+ /*
+ * For now, we don't handle unmapping pages, so only support filling in
+ * none PTEs, or replacing PTE markers.
+ */
+ if (!pte_none_mostly(*dst_pte))
+ goto out_unlock;
+
+ set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte);
+
+ /* No need to invalidate - it was non-present before */
+ update_mmu_cache(dst_vma, dst_addr, dst_pte);
+ ret = 0;
+out_unlock:
+ pte_unmap_unlock(dst_pte, ptl);
+ return ret;
+}
+
static pmd_t *mm_alloc_pmd(struct mm_struct *mm, unsigned long address)
{
pgd_t *pgd;
@@ -336,8 +381,12 @@ static __always_inline ssize_t mfill_atomic_hugetlb(
* supported by hugetlb. A PMD_SIZE huge pages may exist as used
* by THP. Since we can not reliably insert a zero page, this
* feature is not supported.
+ *
+ * PTE marker handling for hugetlb is a bit special, so for now
+ * UFFDIO_POISON is not supported.
*/
- if (uffd_flags_mode_is(flags, MFILL_ATOMIC_ZEROPAGE)) {
+ if (uffd_flags_mode_is(flags, MFILL_ATOMIC_ZEROPAGE) ||
+ uffd_flags_mode_is(flags, MFILL_ATOMIC_POISON)) {
mmap_read_unlock(dst_mm);
return -EINVAL;
}
@@ -481,6 +530,9 @@ static __always_inline ssize_t mfill_atomic_pte(pmd_t *dst_pmd,
if (uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE)) {
return mfill_atomic_pte_continue(dst_pmd, dst_vma,
dst_addr, flags);
+ } else if (uffd_flags_mode_is(flags, MFILL_ATOMIC_POISON)) {
+ return mfill_atomic_pte_poison(dst_pmd, dst_vma,
+ dst_addr, flags);
}
/*
@@ -702,6 +754,14 @@ ssize_t mfill_atomic_continue(struct mm_struct *dst_mm, unsigned long start,
uffd_flags_set_mode(flags, MFILL_ATOMIC_CONTINUE));
}
+ssize_t mfill_atomic_poison(struct mm_struct *dst_mm, unsigned long start,
+ unsigned long len, atomic_t *mmap_changing,
+ uffd_flags_t flags)
+{
+ return mfill_atomic(dst_mm, start, 0, len, mmap_changing,
+ uffd_flags_set_mode(flags, MFILL_ATOMIC_POISON));
+}
+
long uffd_wp_range(struct vm_area_struct *dst_vma,
unsigned long start, unsigned long len, bool enable_wp)
{
--
2.41.0.255.g8b1d071c50-goog
From: Björn Töpel <bjorn(a)rivosinc.com>
This series has two minor fixes, found when cross-compiling for the
RISC-V architecture.
Some RISC-V systems do not define HAVE_EFFICIENT_UNALIGNED_ACCESS,
which made some of tests bail out. Fix the failing tests by adding
F_NEEDS_EFFICIENT_UNALIGNED_ACCESS.
...and some RISC-V systems *do* define
HAVE_EFFICIENT_UNALIGNED_ACCESS. In this case the autoconf.h was not
correctly picked up by the build system.
Cheers,
Björn
Björn Töpel (2):
selftests/bpf: Add F_NEEDS_EFFICIENT_UNALIGNED_ACCESS to some tests
selftests/bpf: Honor $(O) when figuring out paths
tools/testing/selftests/bpf/Makefile | 4 ++++
tools/testing/selftests/bpf/verifier/atomic_cmpxchg.c | 1 +
tools/testing/selftests/bpf/verifier/ctx_skb.c | 2 ++
tools/testing/selftests/bpf/verifier/jmp32.c | 8 ++++++++
tools/testing/selftests/bpf/verifier/map_kptr.c | 2 ++
tools/testing/selftests/bpf/verifier/precise.c | 2 +-
6 files changed, 18 insertions(+), 1 deletion(-)
base-commit: a94098d490e17d652770f2309fcb9b46bc4cf864
--
2.39.2
In use_missing_map function, value is
initialized twice.There is no
connection between the two assignment.
This patch could fix this bug.
Signed-off-by: Wang Ming <machel(a)vivo.com>
---
tools/testing/selftests/bpf/progs/test_log_fixup.c | 8 ++------
1 file changed, 2 insertions(+), 6 deletions(-)
diff --git a/tools/testing/selftests/bpf/progs/test_log_fixup.c b/tools/testing/selftests/bpf/progs/test_log_fixup.c
index 1bd48feaaa42..1c49b2f9be6c 100644
--- a/tools/testing/selftests/bpf/progs/test_log_fixup.c
+++ b/tools/testing/selftests/bpf/progs/test_log_fixup.c
@@ -52,13 +52,9 @@ struct {
SEC("?raw_tp/sys_enter")
int use_missing_map(const void *ctx)
{
- int zero = 0, *value;
+ int zero = 0;
- value = bpf_map_lookup_elem(&existing_map, &zero);
-
- value = bpf_map_lookup_elem(&missing_map, &zero);
-
- return value != NULL;
+ return bpf_map_lookup_elem(&missing_map, &zero) != NULL;
}
extern int bpf_nonexistent_kfunc(void) __ksym __weak;
--
2.25.1
From: Björn Töpel <bjorn(a)rivosinc.com>
Timeouts in kselftest are done using the "timeout" command with the
"--foreground" option. Without the "foreground" option, it is not
possible for a user to cancel the runner using SIGINT, because the
signal is not propagated to timeout which is running in a different
process group. The "forground" options places the timeout in the same
process group as its parent, but only sends the SIGTERM (on timeout)
signal to the forked process. Unfortunately, this does not play nice
with all kselftests, e.g. "net:fcnal-test.sh", where the child
processes will linger because timeout does not send SIGTERM to the
group.
Some users have noted these hangs [1].
Fix this by nesting the timeout with an additional timeout without the
foreground option.
Link: https://lore.kernel.org/all/7650b2eb-0aee-a2b0-2e64-c9bc63210f67@alu.unizg.… # [1]
Fixes: 651e0d881461 ("kselftest/runner: allow to properly deliver signals to tests")
Signed-off-by: Björn Töpel <bjorn(a)rivosinc.com>
---
tools/testing/selftests/kselftest/runner.sh | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/kselftest/runner.sh b/tools/testing/selftests/kselftest/runner.sh
index 1c952d1401d4..70e0a465e30d 100644
--- a/tools/testing/selftests/kselftest/runner.sh
+++ b/tools/testing/selftests/kselftest/runner.sh
@@ -36,7 +36,8 @@ tap_timeout()
{
# Make sure tests will time out if utility is available.
if [ -x /usr/bin/timeout ] ; then
- /usr/bin/timeout --foreground "$kselftest_timeout" $1
+ /usr/bin/timeout --foreground "$kselftest_timeout" \
+ /usr/bin/timeout "$kselftest_timeout" $1
else
$1
fi
base-commit: d528014517f2b0531862c02865b9d4c908019dc4
--
2.39.2
Here is a first batch of fixes for v6.5 and older.
The fixes are not linked to each others.
Patch 1 ensures subflows are unhashed before cleaning the backlog to
avoid races. This fixes another recent fix from v6.4.
Patch 2 does not rely on implicit state check in mptcp_listen() to avoid
races when receiving an MP_FASTCLOSE. A regression from v5.17.
The rest fixes issues in the selftests.
Patch 3 makes sure errors when setting up the environment are no longer
ignored. For v5.17+.
Patch 4 uses 'iptables-legacy' if available to be able to run on older
kernels. A fix for v5.13 and newer.
Patch 5 catches errors when issues are detected with packet marks. Also
for v5.13+.
Patch 6 uses the correct variable instead of an undefined one. Even if
there was no visible impact, it can help to find regressions later. An
issue visible in v5.19+.
Patch 7 makes sure errors with some sub-tests are reported to have the
selftest marked as failed as expected. Also for v5.19+.
Patch 8 adds a kernel config that is required to execute MPTCP
selftests. It is valid for v5.9+.
Patch 9 fixes issues when validating the userspace path-manager with
32-bit arch, an issue affecting v5.19+.
Signed-off-by: Matthieu Baerts <matthieu.baerts(a)tessares.net>
---
Matthieu Baerts (7):
selftests: mptcp: connect: fail if nft supposed to work
selftests: mptcp: sockopt: use 'iptables-legacy' if available
selftests: mptcp: sockopt: return error if wrong mark
selftests: mptcp: userspace_pm: use correct server port
selftests: mptcp: userspace_pm: report errors with 'remove' tests
selftests: mptcp: depend on SYN_COOKIES
selftests: mptcp: pm_nl_ctl: fix 32-bit support
Paolo Abeni (2):
mptcp: ensure subflow is unhashed before cleaning the backlog
mptcp: do not rely on implicit state check in mptcp_listen()
net/mptcp/protocol.c | 7 +++++-
tools/testing/selftests/net/mptcp/config | 1 +
tools/testing/selftests/net/mptcp/mptcp_connect.sh | 3 +++
tools/testing/selftests/net/mptcp/mptcp_sockopt.sh | 29 ++++++++++++----------
tools/testing/selftests/net/mptcp/pm_nl_ctl.c | 10 ++++----
tools/testing/selftests/net/mptcp/userspace_pm.sh | 4 ++-
6 files changed, 34 insertions(+), 20 deletions(-)
---
base-commit: 14bb236b29922c4f57d8c05bfdbcb82677f917c9
change-id: 20230704-upstream-net-20230704-misc-fixes-6-5-rc1-c52608649559
Best regards,
--
Matthieu Baerts <matthieu.baerts(a)tessares.net>
From: Jeff Xu <jeffxu(a)google.com>
When sysctl vm.memfd_noexec is 2 (MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED),
memfd_create(.., MFD_EXEC) should fail.
This complies with how MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED is
defined - "memfd_create() without MFD_NOEXEC_SEAL will be rejected"
Thanks to Dominique Martinet <asmadeus(a)codewreck.org> who reported the bug.
see [1] for context.
[1] https://lore.kernel.org/linux-mm/CABi2SkXUX_QqTQ10Yx9bBUGpN1wByOi_=gZU6WEy5…
History:
V2: fix build error when CONFIG_SYSCTL is not defined.
V1: initial version
https://lore.kernel.org/linux-mm/20230630031721.623955-3-jeffxu@google.com/…
Jeff Xu (2):
mm/memfd: sysctl: fix MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED
selftests/memfd: sysctl: fix MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED
mm/memfd.c | 57 +++++++++++++---------
tools/testing/selftests/memfd/memfd_test.c | 5 ++
2 files changed, 38 insertions(+), 24 deletions(-)
--
2.41.0.255.g8b1d071c50-goog
Hello.
I am Frank Jody Dawson, I have investors and they are seeking to invest in any lucrative venture worldwide, like aviation, real estate, agriculture, industrial, medical equipment and renewable energy. My investors are mainly from the Arabian countries who are widely in real estate and oil and gas, but now they want to expand their businesses across the globe in any lucrative business.
Your profile caught my attention so I decided to message and see if we
can work together?
Thank you,
Frank.
BPF applications, e.g., a TCP congestion control, might benefit from
precise packet timestamps. These timestamps are already available in
__sk_buff and bpf_sock_ops, but could not be requested: A BPF program
was not allowed to set SO_TIMESTAMPING* on a socket. This change enables
BPF programs to actively request the generation of timestamps from a
stream socket.
To reuse the setget_sockopt BPF prog test for
bpf_{get,set}sockopt(SO_TIMESTAMPING_NEW), also implement the missing
getsockopt(SO_TIMESTAMPING_NEW) in the network stack.
I reckon the way I added getsockopt(SO_TIMESTAMPING_NEW) causes an API
change: For existing users that set SO_TIMESTAMPING_NEW but queried
SO_TIMESTAMPING_OLD afterwards, it would now look as if no timestamping
flags have been set. Is this an acceptable change? If not, I’m happy to
change getsockopt() to only be strict about the newly-implemented
getsockopt(SO_TIMESTAMPING_NEW), or not distinguish between
SO_TIMESTAMPING_NEW and SO_TIMESTAMPING_OLD at all.
Jörn-Thorben Hinz (2):
net: Implement missing getsockopt(SO_TIMESTAMPING_NEW)
bpf: Allow setting SO_TIMESTAMPING* with bpf_setsockopt()
include/uapi/linux/bpf.h | 3 ++-
net/core/filter.c | 2 ++
net/core/sock.c | 9 +++++++--
tools/include/uapi/linux/bpf.h | 3 ++-
tools/testing/selftests/bpf/progs/bpf_tracing_net.h | 2 ++
tools/testing/selftests/bpf/progs/setget_sockopt.c | 4 ++++
6 files changed, 19 insertions(+), 4 deletions(-)
--
2.39.2
Hi Jon, Shuah & others,
I'd like to discuss with you with regards to test documentation.
I had some preliminary discussions with people interested on improving
tests during EOSS last week in Prague, as we're working to improve media
test coverage as well. During such discussions, I talked with developers
from several companies that have been collaboration and/or using Kernel
CI. I also talked with Nikolai from Red Hat, who gave a presentation about
Kernel CI, which points that one of the areas to be improved there is
documentation.
So, it seems it is worth having some discussions about how to improve
Kernel test documentation.
While kernel_doc does a pretty decent job documenting functions and data
structures, for tests, the most important things to be documented are:
a. what the tests do;
b. what functionalities they are testing.
This is a lot more important than documenting functions - and the used
data structures on tests are typically the ones that are part of the
driver's kAPI or uAPI, so they should be documented somewhere else.
Usually, (b) is not so simple, as, at least for complex hardware,
the tested features are grouped on an hierarchical way, like:
1. hardware
1.1 DMA engine
1.2 output ports
...
2. firmware
2.1 firmware load
2.2 firmware DMA actions
...
3. kernel features
3.1 memory allocation
3.2 mmap
3.3 bind/unbind
...
CI engines running the test sets usually want to produce a report that will
be providing pass rates for the tested features and functionalites that
are available at the driver's and their respective hardware and firmware.
I've doing some work at the tool we use to test DRM code [1] in order to
have a decent documentation of the tests we have hosted there, focusing
mostly on tests for i915 and Xe Intel drivers, also covering documentation
for DRM core tests - while providing support for other vendors to also
improve their test documentation for IGT - IGT GPU tools and tests.
The documentation tool I developed is generic enough to be used for other
test sets and I believe it could be useful as well to document Kselftest
and KUnit.
The core of the tool (at test_list.py) is a Python class, with some callers
(igt_doc.py, xls_to_doc.py, doc_to_xls.py), being extensible enough to
also have other callers to integrate with external tools. We are
developing internally one to integrate with our internal Grafana reports
to report the pass rate per documented feature, in an hierarchical way.
Something similar to:
1. hardware pass rate: 98% (98 tests passed of 100)
1.1 DMA engine pass rate: 80% (8 tests passed of 10)
1.2 output ports pass rate: 100% (10 tests passed of 10)
...
It is based on the concept that test documentation should be placed as
close as possible to the actual code implementing the test sets. It was
also be developed in a way that the documentation grouping is flexible.
The code was written from the scratch in Python and was implemented
inside a class that can also be re-used to do do other nice things,
like importing/exporting test documentation to spreadsheets and
integration with other tools (like Grafana).
The actual documentation tags look like this:
/**
* TEST: Check if new IGT test documentation logic functionality is working
* Category: Software build block
* Sub-category: documentation
* Functionality: test documentation
* Issue: none
* Description: Complete description of this test
*
* SUBTEST: foo
* Description: do foo things
* description continuing on another line
*
* SUBTEST: bar
* Description: do bar things
* description continuing on another line
* Functionality: bar test doc
*/
And it has support for wildcards.
There, "TEST" is associated to the contents of the file, while "SUBTEST"
refers to each specific subtest inside it. The valid fields are imported
from JSON config files, and can be placed into an hierarchical way, in
order to produce an hierarchical documentation. Fields defined at the
"TEST" level are imported on "SUBTEST", but can be overriden.
The JSON config file looks like this:
https://gitlab.freedesktop.org/drm/igt-gpu-tools/-/blob/158feaa20fa2b9424ee…
The output is in ReST, which can be generated in hierarchical or per-file
way. The hierarchical output looks like this:
$ ./scripts/igt_doc.py --config tests/xe/xe_test_config.json --file fubar_tests.c
===============================
Implemented Tests for Xe Driver
===============================
Category: Software build block
==============================
Sub-category: documentation
---------------------------
Functionality: bar test doc
^^^^^^^^^^^^^^^^^^^^^^^^^^^
``igt@fubar_tests@bar``
:Description: do bar things description continuing on another line
:Issue: none
Functionality: test documentation
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
``igt@fubar_tests@foo``
:Description: do foo things description continuing on another line
:Issue: none
(if --file is not used, it will use all C files specified at the
configuration)
The tool already skips tags like the ones used by kernel-doc[1], so one
could have both function documentation and per-test documentation on
the same file, if needed.
While such tool was conceived to be part of IGT, it doesn't have anything
specific for it [2], and I do believe it would be a great contribution to
the Kernel to have such tool upstreamed, and integrated as a Sphinx
extension.
If we decide to go ahead adding it, I can work on a patchset to apply
it to the Kernel, modifying the scripts to better fit at the Kernel
needs and start with some documentation examples for i915,
DRM core and upcoming Xe KUnit tests.
Comments?
Regards,
Mauro
[1] It should be trivial to patch kernel-doc for it to skip TEST and
SUBTEST tags if we decide to integrate it to the kernel.
[2] except that tests there are named after IGT, as
<igt <test>@<subtest>@<dynamic_subtest>, but a change to a
Kernel-specific namespace would be trivial
Hi Noah,
On Thu, May 25, 2023 at 8:04 PM tip-bot2 for Noah Goldstein
<tip-bot2(a)linutronix.de> wrote:
> The following commit has been merged into the x86/misc branch of tip:
>
> Commit-ID: 688eb8191b475db5acfd48634600b04fd3dda9ad
> Gitweb: https://git.kernel.org/tip/688eb8191b475db5acfd48634600b04fd3dda9ad
> Author: Noah Goldstein <goldstein.w.n(a)gmail.com>
> AuthorDate: Wed, 10 May 2023 20:10:02 -05:00
> Committer: Dave Hansen <dave.hansen(a)linux.intel.com>
> CommitterDate: Thu, 25 May 2023 10:55:18 -07:00
>
> x86/csum: Improve performance of `csum_partial`
>
> 1) Add special case for len == 40 as that is the hottest value. The
> nets a ~8-9% latency improvement and a ~30% throughput improvement
> in the len == 40 case.
>
> 2) Use multiple accumulators in the 64-byte loop. This dramatically
> improves ILP and results in up to a 40% latency/throughput
> improvement (better for more iterations).
>
> Results from benchmarking on Icelake. Times measured with rdtsc()
> len lat_new lat_old r tput_new tput_old r
> 8 3.58 3.47 1.032 3.58 3.51 1.021
> 16 4.14 4.02 1.028 3.96 3.78 1.046
> 24 4.99 5.03 0.992 4.23 4.03 1.050
> 32 5.09 5.08 1.001 4.68 4.47 1.048
> 40 5.57 6.08 0.916 3.05 4.43 0.690
> 48 6.65 6.63 1.003 4.97 4.69 1.059
> 56 7.74 7.72 1.003 5.22 4.95 1.055
> 64 6.65 7.22 0.921 6.38 6.42 0.994
> 96 9.43 9.96 0.946 7.46 7.54 0.990
> 128 9.39 12.15 0.773 8.90 8.79 1.012
> 200 12.65 18.08 0.699 11.63 11.60 1.002
> 272 15.82 23.37 0.677 14.43 14.35 1.005
> 440 24.12 36.43 0.662 21.57 22.69 0.951
> 952 46.20 74.01 0.624 42.98 53.12 0.809
> 1024 47.12 78.24 0.602 46.36 58.83 0.788
> 1552 72.01 117.30 0.614 71.92 96.78 0.743
> 2048 93.07 153.25 0.607 93.28 137.20 0.680
> 2600 114.73 194.30 0.590 114.28 179.32 0.637
> 3608 156.34 268.41 0.582 154.97 254.02 0.610
> 4096 175.01 304.03 0.576 175.89 292.08 0.602
>
> There is no such thing as a free lunch, however, and the special case
> for len == 40 does add overhead to the len != 40 cases. This seems to
> amount to be ~5% throughput and slightly less in terms of latency.
>
> Testing:
> Part of this change is a new kunit test. The tests check all
> alignment X length pairs in [0, 64) X [0, 512).
> There are three cases.
> 1) Precomputed random inputs/seed. The expected results where
> generated use the generic implementation (which is assumed to be
> non-buggy).
> 2) An input of all 1s. The goal of this test is to catch any case
> a carry is missing.
> 3) An input that never carries. The goal of this test si to catch
> any case of incorrectly carrying.
>
> More exhaustive tests that test all alignment X length pairs in
> [0, 8192) X [0, 8192] on random data are also available here:
> https://github.com/goldsteinn/csum-reproduction
>
> The reposity also has the code for reproducing the above benchmark
> numbers.
>
> Signed-off-by: Noah Goldstein <goldstein.w.n(a)gmail.com>
> Signed-off-by: Dave Hansen <dave.hansen(a)linux.intel.com>
Thanks for your patch, which is now commit 688eb8191b475db5 ("x86/csum:
Improve performance of `csum_partial`") in linus/master stable/master
> Link: https://lore.kernel.org/all/20230511011002.935690-1-goldstein.w.n%40gmail.c…
This does not seem to be a message sent to a public mailing list
archived at lore (yet).
On m68k (ARAnyM):
KTAP version 1
# Subtest: checksum
1..3
# test_csum_fixed_random_inputs: ASSERTION FAILED at
lib/checksum_kunit.c:243
Expected result == expec, but
result == 54991 (0xd6cf)
expec == 33316 (0x8224)
not ok 1 test_csum_fixed_random_inputs
# test_csum_all_carry_inputs: ASSERTION FAILED at lib/checksum_kunit.c:267
Expected result == expec, but
result == 255 (0xff)
expec == 65280 (0xff00)
Endianness issue in the test?
not ok 2 test_csum_all_carry_inputs
# test_csum_no_carry_inputs: ASSERTION FAILED at lib/checksum_kunit.c:306
Expected result == expec, but
result == 64515 (0xfc03)
expec == 0 (0x0)
not ok 3 test_csum_no_carry_inputs
# checksum: pass:0 fail:3 skip:0 total:3
# Totals: pass:0 fail:3 skip:0 total:3
not ok 1 checksum
Gr{oetje,eeting}s,
Geert
--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert(a)linux-m68k.org
In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds
KVM_GET_REG_LIST will dump all register IDs that are available to
KVM_GET/SET_ONE_REG and It's very useful to identify some platform
regression issue during VM migration.
Patch 1-7 re-structured the get-reg-list test in aarch64 to make some
of the code as common test framework that can be shared by riscv.
Patch 8 move reject_set check logic to a function so as to check for
different errno for different registers.
Patch 9 move finalize_vcpu back to run_test so that riscv can implement
its specific operation.
Patch 10 change to do the get/set operation only on present-blessed list.
Patch 11 add the skip_set facilities so that riscv can skip set operation
on some registers.
Patch 12 enabled the KVM_GET_REG_LIST API in riscv.
patch 13 added the corresponding kselftest for checking possible
register regressions.
The get-reg-list kvm selftest was ported from aarch64 and tested with
Linux v6.4 on a Qemu riscv64 virt machine.
---
Changed since v4:
* Rebase to v6.4
* Address Andrew's suggestions and comments:
Added skip_set concept
Updated errno check logic
Modified finalize_vcpu as weak function
Andrew Jones (7):
KVM: arm64: selftests: Replace str_with_index with strdup_printf
KVM: arm64: selftests: Drop SVE cap check in print_reg
KVM: arm64: selftests: Remove print_reg's dependency on vcpu_config
KVM: arm64: selftests: Rename vcpu_config and add to kvm_util.h
KVM: arm64: selftests: Delete core_reg_fixup
KVM: arm64: selftests: Split get-reg-list test code
KVM: arm64: selftests: Finish generalizing get-reg-list
Haibo Xu (6):
KVM: arm64: selftests: Move reject_set check logic to a function
KVM: arm64: selftests: Move finalize_vcpu back to run_test
KVM: selftests: Only do get/set tests on present blessed list
KVM: selftests: Add skip_set facility to get_reg_list test
KVM: riscv: Add KVM_GET_REG_LIST API support
KVM: riscv: selftests: Add get-reg-list test
Documentation/virt/kvm/api.rst | 2 +-
arch/riscv/kvm/vcpu.c | 375 +++++++++
tools/testing/selftests/kvm/Makefile | 11 +-
.../selftests/kvm/aarch64/get-reg-list.c | 544 ++----------
tools/testing/selftests/kvm/get-reg-list.c | 395 +++++++++
.../selftests/kvm/include/kvm_util_base.h | 21 +
.../selftests/kvm/include/riscv/processor.h | 3 +
.../testing/selftests/kvm/include/test_util.h | 2 +
tools/testing/selftests/kvm/lib/test_util.c | 15 +
.../selftests/kvm/riscv/get-reg-list.c | 780 ++++++++++++++++++
10 files changed, 1655 insertions(+), 493 deletions(-)
create mode 100644 tools/testing/selftests/kvm/get-reg-list.c
create mode 100644 tools/testing/selftests/kvm/riscv/get-reg-list.c
--
2.34.1
Writing `subprocess.Popen[str]` requires python 3.9+.
kunit.py has an assertion that the python version is 3.7+, so we should
try to stay backwards compatible.
This conflicts a bit with commit 1da2e6220e11 ("kunit: tool: fix
pre-existing `mypy --strict` errors and update run_checks.py"), since
mypy complains like so
> kunit_kernel.py:95: error: Missing type parameters for generic type "Popen" [type-arg]
Note: `mypy --strict --python-version 3.7` does not work.
We could annotate each file with comments like
`# mypy: disable-error-code="type-arg"
but then we might still get nudged to break back-compat in other files.
This patch adds a `mypy.ini` file since it seems like the only way to
disable specific error codes for all our files.
Note: run_checks.py doesn't need to specify `--config_file mypy.ini`,
but I think being explicit is better, particularly since most kernel
devs won't be familiar with how mypy works.
Fixes: 695e26030858 ("kunit: tool: add subscripts for type annotations where appropriate")
Reported-by: SeongJae Park <sj(a)kernel.org>
Link: https://lore.kernel.org/linux-kselftest/20230501171520.138753-1-sj@kernel.o…
Signed-off-by: Daniel Latypov <dlatypov(a)google.com>
---
tools/testing/kunit/kunit_kernel.py | 6 +++---
tools/testing/kunit/mypy.ini | 6 ++++++
tools/testing/kunit/run_checks.py | 2 +-
3 files changed, 10 insertions(+), 4 deletions(-)
create mode 100644 tools/testing/kunit/mypy.ini
diff --git a/tools/testing/kunit/kunit_kernel.py b/tools/testing/kunit/kunit_kernel.py
index f01f94106129..7f648802caf6 100644
--- a/tools/testing/kunit/kunit_kernel.py
+++ b/tools/testing/kunit/kunit_kernel.py
@@ -92,7 +92,7 @@ class LinuxSourceTreeOperations:
if stderr: # likely only due to build warnings
print(stderr.decode())
- def start(self, params: List[str], build_dir: str) -> subprocess.Popen[str]:
+ def start(self, params: List[str], build_dir: str) -> subprocess.Popen:
raise RuntimeError('not implemented!')
@@ -113,7 +113,7 @@ class LinuxSourceTreeOperationsQemu(LinuxSourceTreeOperations):
kconfig.merge_in_entries(base_kunitconfig)
return kconfig
- def start(self, params: List[str], build_dir: str) -> subprocess.Popen[str]:
+ def start(self, params: List[str], build_dir: str) -> subprocess.Popen:
kernel_path = os.path.join(build_dir, self._kernel_path)
qemu_command = ['qemu-system-' + self._qemu_arch,
'-nodefaults',
@@ -142,7 +142,7 @@ class LinuxSourceTreeOperationsUml(LinuxSourceTreeOperations):
kconfig.merge_in_entries(base_kunitconfig)
return kconfig
- def start(self, params: List[str], build_dir: str) -> subprocess.Popen[str]:
+ def start(self, params: List[str], build_dir: str) -> subprocess.Popen:
"""Runs the Linux UML binary. Must be named 'linux'."""
linux_bin = os.path.join(build_dir, 'linux')
params.extend(['mem=1G', 'console=tty', 'kunit_shutdown=halt'])
diff --git a/tools/testing/kunit/mypy.ini b/tools/testing/kunit/mypy.ini
new file mode 100644
index 000000000000..ddd288309efa
--- /dev/null
+++ b/tools/testing/kunit/mypy.ini
@@ -0,0 +1,6 @@
+[mypy]
+strict = True
+
+# E.g. we can't write subprocess.Popen[str] until Python 3.9+.
+# But kunit.py tries to support Python 3.7+, so let's disable it.
+disable_error_code = type-arg
diff --git a/tools/testing/kunit/run_checks.py b/tools/testing/kunit/run_checks.py
index 8208c3b3135e..c6d494ea3373 100755
--- a/tools/testing/kunit/run_checks.py
+++ b/tools/testing/kunit/run_checks.py
@@ -23,7 +23,7 @@ commands: Dict[str, Sequence[str]] = {
'kunit_tool_test.py': ['./kunit_tool_test.py'],
'kunit smoke test': ['./kunit.py', 'run', '--kunitconfig=lib/kunit', '--build_dir=kunit_run_checks'],
'pytype': ['/bin/sh', '-c', 'pytype *.py'],
- 'mypy': ['mypy', '--strict', '--exclude', '_test.py$', '--exclude', 'qemu_configs/', '.'],
+ 'mypy': ['mypy', '--config-file', 'mypy.ini', '--exclude', '_test.py$', '--exclude', 'qemu_configs/', '.'],
}
# The user might not have mypy or pytype installed, skip them if so.
base-commit: a42077b787680cbc365a96446b30f32399fa3f6f
--
2.40.1.495.gc816e09b53d-goog
Events Tracing infrastructure contains lot of files, directories
(internally in terms of inodes, dentries). And ends up by consuming
memory in MBs. We can have multiple events of Events Tracing, which
further requires more memory.
Instead of creating inodes/dentries, eventfs could keep meta-data and
skip the creation of inodes/dentries. As and when require, eventfs will
create the inodes/dentries only for required files/directories.
Also eventfs would delete the inodes/dentries once no more requires
but preserve the meta data.
Tracing events took ~9MB, with this approach it took ~4.5MB
for ~10K files/dir.
v2:
Patch 01: new patch:'Require all trace events to have a TRACE_SYSTEM'
Patch 02: moved from v1 1/9
Patch 03: moved from v1 2/9
As suggested by Zheng Yejian, introduced eventfs_prepare_ef()
helper function to add files or directories to eventfs
fix WARNING reported by kernel test robot in v1 8/9
Patch 04: moved from v1 3/9
used eventfs_prepare_ef() to add files
fix WARNING reported by kernel test robot in v1 8/9
Patch 05: moved from v1 4/9
fix compiling warning reported by kernel test robot in v1 4/9
Patch 06: moved from v1 5/9
Patch 07: moved from v1 6/9
Patch 08: moved from v1 7/9
Patch 09: moved from v1 8/9
rebased because of v3 01/10
Patch 10: moved from v1 9/9
v1:
Patch 1: add header file
Patch 2: resolved kernel test robot issues
protecting eventfs lists using nested eventfs_rwsem
Patch 3: protecting eventfs lists using nested eventfs_rwsem
Patch 4: improve events cleanup code to fix crashes
Patch 5: resolved kernel test robot issues
removed d_instantiate_anon() calls
Patch 6: resolved kernel test robot issues
fix kprobe test in eventfs_root_lookup()
protecting eventfs lists using nested eventfs_rwsem
Patch 7: remove header file
Patch 8: pass eventfs_rwsem as argument to eventfs functions
called eventfs_remove_events_dir() instead of tracefs_remove()
from event_trace_del_tracer()
Patch 9: new patch to fix kprobe test case
fs/tracefs/Makefile | 1 +
fs/tracefs/event_inode.c | 757 ++++++++++++++++++
fs/tracefs/inode.c | 124 ++-
fs/tracefs/internal.h | 25 +
include/linux/trace_events.h | 1 +
include/linux/tracefs.h | 49 ++
kernel/trace/trace.h | 3 +-
kernel/trace/trace_events.c | 78 +-
.../ftrace/test.d/kprobe/kprobe_args_char.tc | 4 +-
.../test.d/kprobe/kprobe_args_string.tc | 4 +-
10 files changed, 994 insertions(+), 52 deletions(-)
create mode 100644 fs/tracefs/event_inode.c
create mode 100644 fs/tracefs/internal.h
--
2.40.0
Hi, Willy
Here is the v2 of our old patchset about test report [1].
The trailing '\r' fixup has been merged, so, here only resend the left
parts with an additional patch to restore the failed tests print.
This patchset is rebased on the dev.2023.06.14a branch of linux-rcu [2].
Tests have passed for 'x86 run':
138 test(s) passed, 0 skipped, 0 failed.
See all results in /labs/linux-lab/src/linux-stable/tools/testing/selftests/nolibc/run.out
Also did 'run-user' for x86, mips and arm64.
Changes from v1 -> v2:
1. selftests/nolibc: add a standalone test report macro
As Willy pointed out, the old method with additional test-report
target not work in 'make -j'.
A new macro is added to share the same report logic among the
run-user, run and rerun targets, the path to test log file is
2. selftests/nolibc: always print the path to test log file
Always print the path to test log file, but move it to a new line to
avoid annoying people when the test pass without any failures.
3. selftests/nolibc: restore the failed tests print
Restore printing of the failed tests to avoid manually opening
the test log file when there are really failues.
Best regards,
Zhangjin
---
[1]: https://lore.kernel.org/lkml/cover.1685936428.git.falcon@tinylab.org/
[2]: https://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git/
Zhangjin Wu (3):
selftests/nolibc: add a standalone test report macro
selftests/nolibc: always print the path to test log file
selftests/nolibc: restore the failed tests print
tools/testing/selftests/nolibc/Makefile | 19 +++++++------------
1 file changed, 7 insertions(+), 12 deletions(-)
--
2.25.1
Hi, Willy
This is the revision of the v4 part2 of support for rv32 [1], this
further split the generic KARCH code out of the old rv32 compile patch
and also add kernel specific KARCH and nolibc specific NARCH for
tools/include/nolibc/Makefile too.
This is rebased on the dev.2023.06.14a branch of linux-rcu repo [2] with
basic run-user and run tests.
Changes from v4 -> v5:
* selftests/nolibc: allow customize kernel specific ARCH variable
The KARCH customize support part splitted out of the old rv32 compile
patch and removed the one passed to tools/include/nolibc/Makefile.
* tools/nolibc: add kernel and nolibc specific ARCH variables
Pass original ARCH to tools/include/nolibc/Makefile, add KARCH and
NARCH for kernel and nolibc respectively.
* selftests/nolibc: riscv: customize makefile for rv32
Now, it is rv32 specific, no generic code.
Best regards,
Zhangjin
---
[1]: https://lore.kernel.org/linux-riscv/cover.1686128703.git.falcon@tinylab.org/
[2]: https://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git/
Zhangjin Wu (5):
tools/nolibc: fix up #error compile failures with -ENOSYS
tools/nolibc: fix up undeclared syscall macros with #ifdef and -ENOSYS
selftests/nolibc: allow customize kernel specific ARCH variable
tools/nolibc: add kernel and nolibc specific ARCH variables
selftests/nolibc: riscv: customize makefile for rv32
tools/include/nolibc/Makefile | 18 +++++++++---
tools/include/nolibc/sys.h | 38 ++++++++++++++++---------
tools/testing/selftests/nolibc/Makefile | 18 ++++++++++--
3 files changed, 55 insertions(+), 19 deletions(-)
--
2.25.1
Hi,
This patchset further improves porting of nolibc to new architectures,
it is based on our previous v5 sysret helper series [1].
It mainly shrinks the assembly _start by moving most of its operations
to a C version of _start_c() function. and also, it removes the old
sys_stat() support by using the sys_statx() instead and therefore,
removes all of the arch specific sys_stat_struct.
Tested 'run' on all of the supported architectures:
arch/board | result
------------|------------
arm/vexpress-a9 | 141 test(s) passed, 1 skipped, 0 failed. See all results in /labs/linux-lab/logging/nolibc/arm-vexpress-a9-nolibc-test.log
arm/virt | 141 test(s) passed, 1 skipped, 0 failed. See all results in /labs/linux-lab/logging/nolibc/arm-virt-nolibc-test.log
aarch64/virt | 141 test(s) passed, 1 skipped, 0 failed. See all results in /labs/linux-lab/logging/nolibc/aarch64-virt-nolibc-test.log
ppc/g3beige | not supported
ppc/ppce500 | not supported
i386/pc | 141 test(s) passed, 1 skipped, 0 failed. See all results in /labs/linux-lab/logging/nolibc/i386-pc-nolibc-test.log
x86_64/pc | 141 test(s) passed, 1 skipped, 0 failed. See all results in /labs/linux-lab/logging/nolibc/x86_64-pc-nolibc-test.log
mipsel/malta | 141 test(s) passed, 1 skipped, 0 failed. See all results in /labs/linux-lab/logging/nolibc/mipsel-malta-nolibc-test.log
loongarch64/virt | 141 test(s) passed, 1 skipped, 0 failed. See all results in /labs/linux-lab/logging/nolibc/loongarch64-virt-nolibc-test.log
riscv64/virt | 141 test(s) passed, 1 skipped, 0 failed. See all results in /labs/linux-lab/logging/nolibc/riscv64-virt-nolibc-test.log
riscv32/virt | 119 test(s) passed, 1 skipped, 22 failed. See all results in /labs/linux-lab/logging/nolibc/riscv32-virt-nolibc-test.log
s390x/s390-ccw-virtio | 141 test(s) passed, 1 skipped, 0 failed. See all results in /labs/linux-lab/logging/nolibc/s390x-s390-ccw-virtio-nolibc-test.log
Notes:
- ppc support are ready locally, will be sent out later.
- full riscv32/virt support are ready locally, will be sent out later.
Changes:
* tools/nolibc: remove old arch specific stat support
Just like the __NR_statx we used in nolibc-test.c, Let's only
reserve sys_statx() and use it to implement the stat() function.
Remove the old sys_stat() and sys_stat_struct completely.
* tools/nolibc: add new crt.h with _start_c
A new C version of _start_c() is added to only require a 'sp' argument
and find the others (argc, argv, envp/environ, auxv) for us in C.
* tools/nolibc: include crt.h before arch.h
Include crt.h before arch.h to let _start() be able to call the new
added _start_c() in arch-<ARCH>.h.
* tools/nolibc: arm: shrink _start with _start_c
tools/nolibc: aarch64: shrink _start with _start_c
tools/nolibc: i386: shrink _start with _start_c
tools/nolibc: x86_64: shrink _start with _start_c
tools/nolibc: mips: shrink _start with _start_c
tools/nolibc: loongarch: shrink _start with _start_c
tools/nolibc: riscv: shrink _start with _start_c
tools/nolibc: s390: shrink _start with _start_c
Move most of the operations from the assembly _start() to the C
_start_c(), only require to do minimal operations in assembly _start
now.
With this patchset, porting nolibc to a new architecture become easier,
the powerpc porting will be added later.
Best regards,
Zhangjin
---
[1]: https://lore.kernel.org/lkml/cover.1687957589.git.falcon@tinylab.org/
Zhangjin Wu (11):
tools/nolibc: remove old arch specific stat support
tools/nolibc: add new crt.h with _start_c
tools/nolibc: include crt.h before arch.h
tools/nolibc: arm: shrink _start with _start_c
tools/nolibc: aarch64: shrink _start with _start_c
tools/nolibc: i386: shrink _start with _start_c
tools/nolibc: x86_64: shrink _start with _start_c
tools/nolibc: mips: shrink _start with _start_c
tools/nolibc: loongarch: shrink _start with _start_c
tools/nolibc: riscv: shrink _start with _start_c
tools/nolibc: s390: shrink _start with _start_c
tools/include/nolibc/Makefile | 1 +
tools/include/nolibc/arch-aarch64.h | 53 ++----------------
tools/include/nolibc/arch-arm.h | 79 ++-------------------------
tools/include/nolibc/arch-i386.h | 58 +++-----------------
tools/include/nolibc/arch-loongarch.h | 42 ++------------
tools/include/nolibc/arch-mips.h | 73 +++----------------------
tools/include/nolibc/arch-riscv.h | 65 ++--------------------
tools/include/nolibc/arch-s390.h | 60 ++------------------
tools/include/nolibc/arch-x86_64.h | 54 ++----------------
tools/include/nolibc/crt.h | 57 +++++++++++++++++++
tools/include/nolibc/nolibc.h | 1 +
tools/include/nolibc/signal.h | 1 +
tools/include/nolibc/stdio.h | 1 +
tools/include/nolibc/stdlib.h | 1 +
tools/include/nolibc/sys.h | 64 ++++------------------
tools/include/nolibc/time.h | 1 +
tools/include/nolibc/types.h | 4 +-
tools/include/nolibc/unistd.h | 1 +
18 files changed, 122 insertions(+), 494 deletions(-)
create mode 100644 tools/include/nolibc/crt.h
--
2.25.1
The kernel cmdline option panic_on_warn expects an integer, it is not a
plain option as documented. A number of uses in the tree figured this
already, and use panic_on_warn=1 for their purpose.
Adjust a comment which otherwise may mislead people in the future.
Fixes: 9e3961a097 ("kernel: add panic_on_warn")
Signed-off-by: Olaf Hering <olaf(a)aepfle.de>
---
Documentation/admin-guide/kernel-parameters.txt | 2 +-
tools/testing/selftests/rcutorture/bin/kvm.sh | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 9e5bab29685f..15196f84df49 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -4049,7 +4049,7 @@
extra details on the taint flags that users can pick
to compose the bitmask to assign to panic_on_taint.
- panic_on_warn panic() instead of WARN(). Useful to cause kdump
+ panic_on_warn=1 panic() instead of WARN(). Useful to cause kdump
on a WARN().
parkbd.port= [HW] Parallel port number the keyboard adapter is
diff --git a/tools/testing/selftests/rcutorture/bin/kvm.sh b/tools/testing/selftests/rcutorture/bin/kvm.sh
index 62f3b0f56e4d..d3cdc2d33d4b 100755
--- a/tools/testing/selftests/rcutorture/bin/kvm.sh
+++ b/tools/testing/selftests/rcutorture/bin/kvm.sh
@@ -655,4 +655,4 @@ fi
# Control buffer size: --bootargs trace_buf_size=3k
# Get trace-buffer dumps on all oopses: --bootargs ftrace_dump_on_oops
# Ditto, but dump only the oopsing CPU: --bootargs ftrace_dump_on_oops=orig_cpu
-# Heavy-handed way to also dump on warnings: --bootargs panic_on_warn
+# Heavy-handed way to also dump on warnings: --bootargs panic_on_warn=1
Hi, Thomas, David, Willy
Thanks very much for your kindly review.
This is the revision of v3 "tools/nolibc: add a new syscall helper" [1],
this mainly applies the suggestion from David in this reply [2] and
rebased everything on the dev.2023.06.14a branch of linux-rcu [3].
The old __sysret() doesn't support the syscalls with pointer return
value, this revision now supports such syscalls. The left mmap() syscall
is converted to use this new __sysret() with additional test cases.
Changes from v3 -> v4:
* tools/nolibc: sys.h: add a syscall return helper
tools/nolibc: unistd.h: apply __sysret() helper
tools/nolibc: sys.h: apply __sysret() helper
The original v3 series, no code change, except the Reviewed-by lines
from Thomas.
* tools/nolibc: unistd.h: reorder the syscall macros
reorder the syscall macros in using order and align most of them.
* tools/nolibc: add missing my_syscall6() for mips
required by mmap() syscall, this is the last missing my_syscall6().
* tools/nolibc: __sysret: support syscalls who return a pointer
Apply suggestion from David.
Let __sysret() also supports syscalls with pointer return value, so, the
return value is converted to unsigned long and the comparing of < 0 is
converted to the comparing of [(unsigned long)-MAX_ERRNO, (unsigned long)-1].
This also allows return a huge value (not pointer) with highest bit as 1.
It is able to merge this one to the first one if necessary.
* tools/nolibc: clean up mmap() support
Apply new __sysret(), clean up #ifdef and some macros.
* selftests/nolibc: add EXPECT_PTREQ, EXPECT_PTRNE and EXPECT_PTRER
selftests/nolibc: add sbrk_0 to test current brk getting
selftests/nolibc: add mmap and munmap test cases
Add some mmap & munmap test cases and the corresponding helpers, to
verify one of the new helpers, a sbrk_0 test case is also added.
Best regards,
Zhangjin
---
[1]: https://lore.kernel.org/linux-riscv/87e7a391-b97b-4001-b12a-76d20790563e@t-…
[2]: https://lore.kernel.org/linux-riscv/94dd5170929f454fbc0a10a2eb3b108d@AcuMS.…
[3]: https://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git/
Zhangjin Wu (10):
tools/nolibc: sys.h: add a syscall return helper
tools/nolibc: unistd.h: apply __sysret() helper
tools/nolibc: sys.h: apply __sysret() helper
tools/nolibc: unistd.h: reorder the syscall macros
tools/nolibc: add missing my_syscall6() for mips
tools/nolibc: __sysret: support syscalls who return a pointer
tools/nolibc: clean up mmap() support
selftests/nolibc: add EXPECT_PTREQ, EXPECT_PTRNE and EXPECT_PTRER
selftests/nolibc: add sbrk_0 to test current brk getting
selftests/nolibc: add mmap and munmap test cases
tools/include/nolibc/arch-mips.h | 26 ++
tools/include/nolibc/nolibc.h | 9 +-
tools/include/nolibc/sys.h | 391 +++----------------
tools/include/nolibc/types.h | 11 +
tools/include/nolibc/unistd.h | 13 +-
tools/testing/selftests/nolibc/nolibc-test.c | 90 +++++
6 files changed, 191 insertions(+), 349 deletions(-)
--
2.25.1
When running Kselftests with the current selftests/net/config
the following problem can be seen with the net:xfrm_policy.sh
selftest:
# selftests: net: xfrm_policy.sh
[ 41.076721] IPv6: ADDRCONF(NETDEV_CHANGE): veth0: link becomes ready
[ 41.094787] IPv6: ADDRCONF(NETDEV_CHANGE): veth0: link becomes ready
[ 41.107635] IPv6: ADDRCONF(NETDEV_CHANGE): veth0: link becomes ready
# modprobe: FATAL: Module ip_tables not found in directory /lib/modules/6.1.36
# iptables v1.8.7 (legacy): can't initialize iptables table `filter': Table does not exist (do you need to insmod?)
# Perhaps iptables or your kernel needs to be upgraded.
# modprobe: FATAL: Module ip_tables not found in directory /lib/modules/6.1.36
# iptables v1.8.7 (legacy): can't initialize iptables table `filter': Table does not exist (do you need to insmod?)
# Perhaps iptables or your kernel needs to be upgraded.
# SKIP: Could not insert iptables rule
ok 1 selftests: net: xfrm_policy.sh # SKIP
This is because IPsec "policy" match support is not available
to the kernel.
This patch adds CONFIG_NETFILTER_XT_MATCH_POLICY as a module
to the selftests/net/config file, so that `make
kselftest-merge` can take this into consideration.
Signed-off-by: Daniel Díaz <daniel.diaz(a)linaro.org>
---
tools/testing/selftests/net/config | 1 +
1 file changed, 1 insertion(+)
diff --git a/tools/testing/selftests/net/config b/tools/testing/selftests/net/config
index d1d421ec10a3..cd3cc52c59b4 100644
--- a/tools/testing/selftests/net/config
+++ b/tools/testing/selftests/net/config
@@ -50,3 +50,4 @@ CONFIG_CRYPTO_SM4_GENERIC=y
CONFIG_AMT=m
CONFIG_VXLAN=m
CONFIG_IP_SCTP=m
+CONFIG_NETFILTER_XT_MATCH_POLICY=m
--
2.34.1
From: Björn Töpel <bjorn(a)rivosinc.com>
When you're cross-building kselftest, in this case RISC-V:
| make ARCH=riscv CROSS_COMPILE=riscv64-linux-gnu- O=/tmp/kselftest \
| HOSTCC=gcc FORMAT= SKIP_TARGETS="arm64 ia64 powerpc sparc64 x86 \
| sgx" -C tools/testing/selftests gen_tar
the components (paths) that fail to build are skipped. In this case,
openat2 failed due to missing library support, and proc due to an
x86-64 only test.
This tiny series addresses the problems above.
Björn
Björn Töpel (2):
selftests/openat2: Run-time check for -fsanitize=undefined
selftests/proc: Do not build x86-64 tests on non-x86-64 builds
tools/testing/selftests/openat2/Makefile | 9 ++++++++-
tools/testing/selftests/proc/Makefile | 4 ++++
2 files changed, 12 insertions(+), 1 deletion(-)
base-commit: 3a8a670eeeaa40d87bd38a587438952741980c18
--
2.39.2
Hi, all
Thanks very much for your review suggestions of the v1 series [1], we
just sent out the generic part1 [2], and here is the part2 of the whole
v2 revision.
Changes from v1 -> v2:
* Don't emulate the return values in the new syscalls path, fix up or
support the new syscalls in the side of the related test cases (1-3)
selftests/nolibc: remove gettimeofday_bad1/2 completely
selftests/nolibc: support two errnos with EXPECT_SYSER2()
selftests/nolibc: waitpid_min: add waitid syscall support
(Review suggestions from Willy and Thomas)
* Fix up new failure of the state_timestamps test case (4, new)
tools/nolibc: add missing nanoseconds support for __NR_statx
(Fixes for the commit a89c937d781a ("tools/nolibc: support nanoseconds in stat()")
* Add new waitstatus macros as a standalone patch for the waitid support (5)
tools/nolibc: add more wait status related types
(Split and Cleanup for the waitid syscall based sys_wait4)
* Pure 64bit lseek and time64 select/poll/gettimeofday support (6-11)
tools/nolibc: add pure 64bit off_t, time_t and blkcnt_t
tools/nolibc: sys_lseek: add pure 64bit lseek
tools/nolibc: add pure 64bit time structs
tools/nolibc: sys_select: add pure 64bit select
tools/nolibc: sys_poll: add pure 64bit poll
tools/nolibc: sys_gettimeofday: add pure 64bit gettimeofday
(Review suggestions from Arnd, Thomas and Willy, time32 variants have
been removed completely and some fixups)
* waitid syscall support cleanup (12)
tools/nolibc: sys_wait4: add waitid syscall support
(Sync with the waitstatus macros update and Removal of emulated code)
* rv32 nolibc-test support, commit message update (13)
selftests/nolibc: riscv: customize makefile for rv32
(Review suggestions from Thomas, explain more about the change logic in commit message)
Best regards,
Zhangjin
---
[1]: https://lore.kernel.org/linux-riscv/20230529113143.GB2762@1wt.eu/T/#t
[2]: https://lore.kernel.org/linux-riscv/cover.1685362482.git.falcon@tinylab.org/
Zhangjin Wu (13):
selftests/nolibc: remove gettimeofday_bad1/2 completely
selftests/nolibc: support two errnos with EXPECT_SYSER2()
selftests/nolibc: waitpid_min: add waitid syscall support
tools/nolibc: add missing nanoseconds support for __NR_statx
tools/nolibc: add more wait status related types
tools/nolibc: add pure 64bit off_t, time_t and blkcnt_t
tools/nolibc: sys_lseek: add pure 64bit lseek
tools/nolibc: add pure 64bit time structs
tools/nolibc: sys_select: add pure 64bit select
tools/nolibc: sys_poll: add pure 64bit poll
tools/nolibc: sys_gettimeofday: add pure 64bit gettimeofday
tools/nolibc: sys_wait4: add waitid syscall support
selftests/nolibc: riscv: customize makefile for rv32
tools/include/nolibc/arch-aarch64.h | 3 -
tools/include/nolibc/arch-loongarch.h | 3 -
tools/include/nolibc/arch-riscv.h | 3 -
tools/include/nolibc/std.h | 28 ++--
tools/include/nolibc/sys.h | 134 +++++++++++++++----
tools/include/nolibc/types.h | 58 +++++++-
tools/testing/selftests/nolibc/Makefile | 11 +-
tools/testing/selftests/nolibc/nolibc-test.c | 20 +--
8 files changed, 202 insertions(+), 58 deletions(-)
--
2.25.1
This extension allows to use F_UNLCK on query, which currently returns
EINVAL. Instead it can be used to query the locks on a particular fd -
something that is not currently possible. The basic idea is that on
F_OFD_GETLK, F_UNLCK would "conflict" with (or query) any types of the
lock on the same fd, and ignore any locks on other fds.
Use-cases:
1. CRIU-alike scenario when you want to read the locking info from an
fd for the later reconstruction. This can now be done by setting
l_start and l_len to 0 to cover entire file range, and do F_OFD_GETLK.
In the loop you need to advance l_start past the returned lock ranges,
to eventually collect all locked ranges.
2. Implementing the lock checking/enforcing policy.
Say you want to implement an "auditor" module in your program,
that checks that the I/O is done only after the proper locking is
applied on a file region. In this case you need to know if the
particular region is locked on that fd, and if so - with what type
of the lock. If you would do that currently (without this extension)
then you can only check for the write locks, and for that you need to
probe the lock on your fd and then open the same file via another fd and
probe there. That way you can identify the write lock on a particular
fd, but such trick is non-atomic and complex. As for finding out the
read lock on a particular fd - impossible.
This extension allows to do such queries without any extra efforts.
3. Implementing the mandatory locking policy.
Suppose you want to make a policy where the write lock inhibits any
unlocked readers and writers. Currently you need to check if the
write lock is present on some other fd, and if it is not there - allow
the I/O operation. But because the write lock can appear at any moment,
you need to do that under some global lock, which can be released only
when the I/O operation is finished.
With the proposed extension you can instead just check the write lock
on your own fd first, and if it is there - allow the I/O operation on
that fd without using any global lock. Only if there is no write lock
on this fd, then you need to take global lock and check for a write
lock on other fds.
The second patch adds a test-case for OFD locks.
It tests both the generic things and the proposed extension.
The third patch is a proposed man page update for fcntl(2)
(not for the linux source tree)
Changes in v3:
- Move selftest to selftests/filelock
Changes in v2:
- Dropped the l_pid extension patch and updated test-case accordingly.
Stas Sergeev (2):
fs/locks: F_UNLCK extension for F_OFD_GETLK
selftests: add OFD lock tests
fs/locks.c | 23 +++-
tools/testing/selftests/filelock/Makefile | 5 +
tools/testing/selftests/filelock/ofdlocks.c | 132 ++++++++++++++++++++
3 files changed, 157 insertions(+), 3 deletions(-)
create mode 100644 tools/testing/selftests/filelock/Makefile
create mode 100644 tools/testing/selftests/filelock/ofdlocks.c
CC: Jeff Layton <jlayton(a)kernel.org>
CC: Chuck Lever <chuck.lever(a)oracle.com>
CC: Alexander Viro <viro(a)zeniv.linux.org.uk>
CC: Christian Brauner <brauner(a)kernel.org>
CC: linux-fsdevel(a)vger.kernel.org
CC: linux-kernel(a)vger.kernel.org
CC: Shuah Khan <shuah(a)kernel.org>
CC: linux-kselftest(a)vger.kernel.org
CC: linux-api(a)vger.kernel.org
--
2.39.2
Willy, Thomas
This is v3 to allow run with minimal kernel config, see v2 [1].
Applied further suggestions from Thomas, It is based on our previous v5
sysret helper series [2] and Thomas' chmod_net removal patchset [3].
Now, a test report on arm/vexpress-a9 without procfs, shmem, tmpfs, net
and memfd_create looks like:
LOG: testing report for arm/vexpress-a9:
14 chmod_self [SKIPPED]
16 chown_self [SKIPPED]
40 link_cross [SKIPPED]
0 -fstackprotector not supported [SKIPPED]
139 test(s) passed, 4 skipped, 0 failed.
See all results in /labs/linux-lab/logging/nolibc/arm-vexpress-a9-nolibc-test.log
LOG: testing summary:
arch/board | result
------------|------------
arm/vexpress-a9 | 139 test(s) passed, 4 skipped, 0 failed. See all results in /labs/linux-lab/logging/nolibc/arm-vexpress-a9-nolibc-test.log
Changes from v2 --> v3:
* Added Reviewed-by from Thomas for the whole series, Many Thanks
* selftests/nolibc: stat_fault: silence NULL argument warning with glibc
selftests/nolibc: gettid: restore for glibc and musl
selftests/nolibc: add _LARGEFILE64_SOURCE for musl
selftests/nolibc: fix up int_fast16/32_t test cases for musl
selftests/nolibc: fix up kernel parameters support
selftests/nolibc: stat_timestamps: remove procfs dependency
selftests/nolibc: link_cross: use /proc/self/cmdline
tools/nolibc: add rmdir() support
selftests/nolibc: add a new rmdir() test case
selftests/nolibc: fix up failures when CONFIG_PROC_FS=n
selftests/nolibc: vfprintf: remove MEMFD_CREATE dependency
No code changes except some commit message cleanups.
* selftests/nolibc: prepare /tmp for tmpfs or ramfs
As suggested by Thomas, simply calling mkdir() and mount() to
prepare /tmp can save a stat() call.
* selftests/nolibc: chroot_exe: remove procfs dependency
As suggested by Thomas, remove the 'weird' get_tmpfile() and use
the '/init' for !procfs as we did for stat_timestamps.
For the worst-case scene, when '/init' is not there, add ENOENT to
the error check list.
Now, it is a oneline code change.
* selftests/nolibc: add chmod_tmpdir test
Without get_tmpfile(), let's direct mkdir() a temp directory for
chmod_tmpdir test, it function as a substitute for the removed
chmod_net.
Now, it is a oneline code change.
Best regards,
Zhangjin
---
[1]: https://lore.kernel.org/lkml/cover.1688078604.git.falcon@tinylab.org/
Zhangjin Wu (14):
selftests/nolibc: stat_fault: silence NULL argument warning with glibc
selftests/nolibc: gettid: restore for glibc and musl
selftests/nolibc: add _LARGEFILE64_SOURCE for musl
selftests/nolibc: fix up int_fast16/32_t test cases for musl
selftests/nolibc: fix up kernel parameters support
selftests/nolibc: stat_timestamps: remove procfs dependency
selftests/nolibc: chroot_exe: remove procfs dependency
selftests/nolibc: link_cross: use /proc/self/cmdline
tools/nolibc: add rmdir() support
selftests/nolibc: add a new rmdir() test case
selftests/nolibc: fix up failures when CONFIG_PROC_FS=n
selftests/nolibc: prepare /tmp for tmpfs or ramfs
selftests/nolibc: add chmod_tmpdir test
selftests/nolibc: vfprintf: remove MEMFD_CREATE dependency
tools/include/nolibc/sys.h | 22 ++++++
tools/testing/selftests/nolibc/nolibc-test.c | 83 +++++++++++++++-----
2 files changed, 87 insertions(+), 18 deletions(-)
--
2.25.1
This is the initial KUnit integration for running Rust documentation
tests within the kernel.
Thank you to the KUnit team for all the input and feedback on this
over the months, as well as the Intel LKP 0-Day team!
This may be merged through either the KUnit or the Rust trees. If
the KUnit team wants to merge it, then that would be great.
Please see the message in the main commit for the details.
Miguel Ojeda (6):
rust: init: make doctests compilable/testable
rust: str: make doctests compilable/testable
rust: sync: make doctests compilable/testable
rust: types: make doctests compilable/testable
rust: support running Rust documentation tests as KUnit ones
MAINTAINERS: add Rust KUnit files to the KUnit entry
MAINTAINERS | 2 +
lib/Kconfig.debug | 13 +++
rust/.gitignore | 2 +
rust/Makefile | 29 ++++++
rust/bindings/bindings_helper.h | 1 +
rust/helpers.c | 7 ++
rust/kernel/init.rs | 25 +++--
rust/kernel/kunit.rs | 156 ++++++++++++++++++++++++++++
rust/kernel/lib.rs | 2 +
rust/kernel/str.rs | 4 +-
rust/kernel/sync/arc.rs | 9 +-
rust/kernel/sync/lock/mutex.rs | 1 +
rust/kernel/sync/lock/spinlock.rs | 1 +
rust/kernel/types.rs | 6 +-
scripts/.gitignore | 2 +
scripts/Makefile | 4 +
scripts/rustdoc_test_builder.rs | 73 ++++++++++++++
scripts/rustdoc_test_gen.rs | 162 ++++++++++++++++++++++++++++++
18 files changed, 484 insertions(+), 15 deletions(-)
create mode 100644 rust/kernel/kunit.rs
create mode 100644 scripts/rustdoc_test_builder.rs
create mode 100644 scripts/rustdoc_test_gen.rs
base-commit: d2e3115d717197cb2bc020dd1f06b06538474ac3
--
2.41.0
TCP SYN/ACK packets of connections from processes/sockets outside a
cgroup on the same host are not received by the cgroup's installed
cgroup_skb filters.
There were two BPF cgroup_skb programs attached to a cgroup named
"my_cgroup".
SEC("cgroup_skb/ingress")
int ingress(struct __sk_buff *skb)
{
/* .... process skb ... */
return 1;
}
SEC("cgroup_skb/egress")
int egress(struct __sk_buff *skb)
{
/* .... process skb ... */
return 1;
}
We discovered that when running the command "nc -6 -l 8000" in
"my_group" and connecting to it from outside of "my_cgroup" with the
command "nc -6 localhost 8000", the egress filter did not detect the
SYN/ACK packet. However, we did observe the SYN/ACK packet at the
ingress when connecting from a socket in "my_cgroup" to a socket
outside of it.
We came across BPF_CGROUP_RUN_PROG_INET_EGRESS(). This macro is
responsible for calling BPF programs that are attached to the egress
hook of a cgroup and it skips programs if the sending socket is not the
owner of the skb. Specifically, in our situation, the SYN/ACK
skb is owned by a struct request_sock instance, but the sending
socket is the listener socket we use to receive incoming
connections. The request_sock is created to manage an incoming
connection.
It has been determined that checking the owner of a skb against
the sending socket is not required. Removing this check will allow the
filters to receive SYN/ACK packets.
To ensure that cgroup_skb filters can receive all signaling packets,
including SYN, SYN/ACK, ACK, FIN, and FIN/ACK. A new self-test has
been added as well.
Changes from v3:
- Check SKB ownership against full socket instead of just remove the
check.
- Address the issue raised by Yonghong.
- Put more details down in the commit message.
Changes from v2:
- Remove redundant blank lines.
Changes from v1:
- Check the number of observed packets instead of just sleeping.
- Use ASSERT_XXX() instead of CHECK()/
[v1] https://lore.kernel.org/all/20230612191641.441774-1-kuifeng@meta.com/
[v2] https://lore.kernel.org/all/20230617052756.640916-2-kuifeng@meta.com/
[v3] https://lore.kernel.org/all/20230620171409.166001-1-kuifeng@meta.com/
Kui-Feng Lee (2):
net: bpf: Check SKB ownership against full socket.
selftests/bpf: Verify that the cgroup_skb filters receive expected
packets.
include/linux/bpf-cgroup.h | 4 +-
tools/testing/selftests/bpf/cgroup_helpers.c | 12 +
tools/testing/selftests/bpf/cgroup_helpers.h | 1 +
tools/testing/selftests/bpf/cgroup_tcp_skb.h | 35 ++
.../selftests/bpf/prog_tests/cgroup_tcp_skb.c | 402 ++++++++++++++++++
.../selftests/bpf/progs/cgroup_tcp_skb.c | 382 +++++++++++++++++
6 files changed, 834 insertions(+), 2 deletions(-)
create mode 100644 tools/testing/selftests/bpf/cgroup_tcp_skb.h
create mode 100644 tools/testing/selftests/bpf/prog_tests/cgroup_tcp_skb.c
create mode 100644 tools/testing/selftests/bpf/progs/cgroup_tcp_skb.c
--
2.34.1
Willy, Thomas
This is v2 to allow run with minimal kernel config, see v1 [1].
It mainly applied the suggestions from Thomas. It is based on our
previous v5 sysret helper series [2] and Thomas' chmod_net removal
patchset [3].
Now, a test report on arm/vexpress-a9 without procfs, shmem, tmpfs, net
and memfd_create looks like:
LOG: testing report for arm/vexpress-a9:
14 chmod_net [SKIPPED]
15 chmod_self [SKIPPED]
17 chown_self [SKIPPED]
41 link_cross [SKIPPED]
0 -fstackprotector not supported [SKIPPED]
139 test(s) passed, 5 skipped, 0 failed.
See all results in /labs/linux-lab/logging/nolibc/arm-vexpress-a9-nolibc-test.log
LOG: testing summary:
arch/board | result
------------|------------
arm/vexpress-a9 | 139 test(s) passed, 5 skipped, 0 failed. See all results in /labs/linux-lab/logging/nolibc/arm-vexpress-a9-nolibc-test.log
Changes from v1 --> v2:
* selftests/nolibc: stat_fault: silence NULL argument warning with glibc
selftests/nolibc: gettid: restore for glibc and musl
selftests/nolibc: add _LARGEFILE64_SOURCE for musl
selftests/nolibc: add a new rmdir() test case
selftests/nolibc: fix up failures when CONFIG_PROC_FS=n
The same as v1, only a few of commit message changes.
* selftests/nolibc: fix up int_fast16/32_t test cases for musl
Applied the method suggested by Thomas, two new macros are added to
get SINT_MAX_OF_TYPE(type) and SINT_MIN_OF_TYPE(type).
* selftests/nolibc: fix up kernel parameters support
After discuss with Thomas and with more tests, both of argv[1] and
NOLIBC_TEST environment variable should be verified to support
such kernel parameters:
NOLIBC_TEST=syscall
noapic NOLIBC_TEST=syscall
noapic
* selftests/nolibc: stat_timestamps: remove procfs dependency
Add '/init' and '/' for !procfs, don't skip it.
* selftests/nolibc: link_cross: use /proc/self/cmdline
Use /proc/self/cmdline instead of /proc/self/net, the ramfs based
/tmp/file doesn't work as expected (not really crossdev).
* tools/nolibc: add rmdir() support
Now, rebased on __sysret() from sysret helper patchset [2].
* selftests/nolibc: prepare /tmp for tmpfs or ramfs
Removed the hugetlbfs prepare part, not really required.
Don't remove /tmp and reserve it to use ramfs as tmpfs.
* selftests/nolibc: add common get_tmpfile()
selftests/nolibc: rename chroot_exe to chroot_tmpfile
Some cleanups.
* selftests/nolibc: add chmod_tmpfile test
To avoid conflict with Thomas' chmod_net removal patch [3], a new
chmod_tmpfile is added (in v1, there is a rename patch from
chmod_net to chmod_good)
Still to avoid conflict, these two are removed in this series:
- selftests/nolibc: rename proc variable to has_proc
- selftests/nolibc: rename euid0 variable to is_root
* selftests/nolibc: vfprintf: remove MEMFD_CREATE dependency
Many checks are removed, only reserve the direct tmpfs access
version.
Best regards,
Zhangjin
---
[1]: https://lore.kernel.org/lkml/cover.1687344643.git.falcon@tinylab.org/
[2]: https://lore.kernel.org/lkml/cover.1687976753.git.falcon@tinylab.org/
[3]: https://lore.kernel.org/lkml/20230624-proc-net-setattr-v1-0-73176812adee@we…
Zhangjin Wu (15):
selftests/nolibc: stat_fault: silence NULL argument warning with glibc
selftests/nolibc: gettid: restore for glibc and musl
selftests/nolibc: add _LARGEFILE64_SOURCE for musl
selftests/nolibc: fix up int_fast16/32_t test cases for musl
selftests/nolibc: fix up kernel parameters support
selftests/nolibc: stat_timestamps: remove procfs dependency
selftests/nolibc: link_cross: use /proc/self/cmdline
tools/nolibc: add rmdir() support
selftests/nolibc: add a new rmdir() test case
selftests/nolibc: fix up failures when CONFIG_PROC_FS=n
selftests/nolibc: prepare /tmp for tmpfs or ramfs
selftests/nolibc: add common get_tmpfile()
selftests/nolibc: rename chroot_exe to chroot_tmpfile
selftests/nolibc: add chmod_tmpfile test
selftests/nolibc: vfprintf: remove MEMFD_CREATE dependency
tools/include/nolibc/sys.h | 22 ++++
tools/testing/selftests/nolibc/nolibc-test.c | 102 +++++++++++++++----
2 files changed, 106 insertions(+), 18 deletions(-)
--
2.25.1
Hi,
This patch series introduces two tests to further enhance and
verify the functionality of the KVM subsystem. These tests focus
on MSR_IA32_DS_AREA and MSR_IA32_PERF_CAPABILITIES.
The first patch adds tests to verify the correct behavior when
trying to set MSR_IA32_DS_AREA with a non-classical address. It
checks that KVM is correctly faulting these non-classical addresses,
ensuring the accuracy and stability of the KVM subsystem.
The second patch includes a comprehensive PEBS test that checks all
possible combinations of PEBS-related bits in MSR_IA32_PERF_CAPABILITIES.
This helps to ensure the accuracy of PEBS functionality.
Feedback and suggestions are welcomed and appreciated.
Sincerely,
Jinrong Liang
Jinrong Liang (2):
KVM: selftests: Test consistency of setting MSR_IA32_DS_AREA
KVM: selftests: Add PEBS test for MSR_IA32_PERF_CAPABILITIES
.../selftests/kvm/x86_64/vmx_pmu_caps_test.c | 171 ++++++++++++++++++
1 file changed, 171 insertions(+)
base-commit: 31b4fc3bc64aadd660c5bfa5178c86a7ba61e0f7
--
2.31.1
From: Jeff Xu <jeffxu(a)google.com>
When sysctl vm.memfd_noexec is 2 (MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED),
memfd_create(.., MFD_EXEC) should fail.
This complies with how MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED is
defined - "memfd_create() without MFD_NOEXEC_SEAL will be rejected"
Thanks to Dominique Martinet <asmadeus(a)codewreck.org> who reported the bug.
see [1] for context.
[1] https://lore.kernel.org/linux-mm/CABi2SkXUX_QqTQ10Yx9bBUGpN1wByOi_=gZU6WEy5…
Jeff Xu (2):
mm/memfd: sysctl: fix MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED
selftests/memfd: sysctl: fix MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED
mm/memfd.c | 48 +++++++++++-----------
tools/testing/selftests/memfd/memfd_test.c | 5 +++
2 files changed, 30 insertions(+), 23 deletions(-)
--
2.41.0.255.g8b1d071c50-goog
From: Jeff Xu <jeffxu(a)google.com>
Add documentation for sysctl vm.memfd_noexec
Link:https://lore.kernel.org/linux-mm/CABi2SkXUX_QqTQ10Yx9bBUGpN1wByOi_=gZU…
Reported-by: Dominique Martinet <asmadeus(a)codewreck.org>
Signed-off-by: Jeff Xu <jeffxu(a)google.com>
---
Documentation/admin-guide/sysctl/vm.rst | 30 +++++++++++++++++++++++++
1 file changed, 30 insertions(+)
diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst
index 45ba1f4dc004..621588041a9e 100644
--- a/Documentation/admin-guide/sysctl/vm.rst
+++ b/Documentation/admin-guide/sysctl/vm.rst
@@ -424,6 +424,36 @@ e.g., up to one or two maps per allocation.
The default value is 65530.
+memfd_noexec:
+=============
+This pid namespaced sysctl controls memfd_create().
+
+The new MFD_NOEXEC_SEAL and MFD_EXEC flags of memfd_create() allows
+application to set executable bit at creation time.
+
+When MFD_NOEXEC_SEAL is set, memfd is created without executable bit
+(mode:0666), and sealed with F_SEAL_EXEC, so it can't be chmod to
+be executable (mode: 0777) after creation.
+
+when MFD_EXEC flag is set, memfd is created with executable bit
+(mode:0777), this is the same as the old behavior of memfd_create.
+
+The new pid namespaced sysctl vm.memfd_noexec has 3 values:
+0: memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL acts like
+ MFD_EXEC was set.
+1: memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL acts like
+ MFD_NOEXEC_SEAL was set.
+2: memfd_create() without MFD_NOEXEC_SEAL will be rejected.
+
+The default value is 0.
+
+Once set, it can't be downgraded at runtime, i.e. 2=>1, 1=>0
+are denied.
+
+This is pid namespaced sysctl, child processes inherit the parent
+process's memfd_noexec at the time of fork. Changes to the parent
+process after fork are not automatically propagated to the child
+process.
memory_failure_early_kill:
==========================
--
2.41.0.255.g8b1d071c50-goog
Hi,
This patch series aims to improve the PMU event filter settings with a cleaner
and more organized structure and adds several test cases related to PMU event
filters.
The first patch of this series introduces a custom "__kvm_pmu_event_filter"
structure that simplifies the event filter setup and improves overall code
readability and maintainability.
The second patch adds test cases to check that unsupported input values in the
PMU event filters are rejected, covering unsupported "action" values,
unsupported "flags" values, and unsupported "nevents" values, as well as the
setting of non-existent fixed counters in the fixed bitmap.
The third patch includes tests for the PMU event filter's behavior when applied
to fixed performance counters, ensuring the correct operation in cases where no
fixed counters exist (e.g., Intel guest PMU version=1 or AMD guest).
Finally, the fourth patch adds a test to verify that setting both generic and
fixed performance event filters does not impact the consistency of the fixed
performance filter behavior.
These changes help to ensure that KVM's PMU event filter functions as expected
in all supported use cases. These patches have been tested and verified to
function properly.
Any feedback or suggestions are greatly appreciated.
Please note that following patches should be applied before this patch series:
https://lore.kernel.org/kvm/20230530134248.23998-2-cloudliang@tencent.comhttps://lore.kernel.org/kvm/20230530134248.23998-3-cloudliang@tencent.com
This will ensure that macro definitions such as X86_INTEL_MAX_FIXED_CTR_NUM,
INTEL_PMC_IDX_FIXED, etc. can be used.
Sincerely,
Jinrong Liang
Changes log:
v3:
- Rebased to 31b4fc3bc64a(tag: kvm-x86-next-2023.06.02).
- Dropped the patch "KVM: selftests: Replace int with uint32_t for nevents". (Sean)
- Dropped the patch "KVM: selftests: Test pmu event filter with incompatible
kvm_pmu_event_filter". (Sean)
- Introduce __kvm_pmu_event_filter to replace the original method of creating
PMU event filters. (Sean)
- Use the macro definition of kvm_cpu_property to find the number of supported
fixed counters instead of calculating it via the vcpu's cpuid. (Sean)
- Remove the wrappers that are single line passthroughs. (Sean)
- Optimize function names and variable names. (Sean)
- Optimize comments to make them more rigorous. (Sean)
v2:
- Wrap the code from the documentation in a block of code. (Bagas Sanjaya)
v1:
https://lore.kernel.org/kvm/20230414110056.19665-1-cloudliang@tencent.com
Jinrong Liang (4):
KVM: selftests: Introduce __kvm_pmu_event_filter to improved event
filter settings
KVM: selftests: Test unavailable event filters are rejected
KVM: selftests: Check if event filter meets expectations on fixed
counters
KVM: selftests: Test gp event filters don't affect fixed event filters
.../kvm/x86_64/pmu_event_filter_test.c | 341 +++++++++++++-----
1 file changed, 246 insertions(+), 95 deletions(-)
base-commit: 31b4fc3bc64aadd660c5bfa5178c86a7ba61e0f7
prerequisite-patch-id: 909d42f185f596d6e5c5b48b33231c89fa5236e4
prerequisite-patch-id: ba0dd0f97d8db0fb6cdf2c7f1e3a60c206fc9784
--
2.31.1
Hi, Willy
This patchset mainly allows speed up the nolibc test with a minimal
kernel config.
As the nolibc supported architectures become more and more, the 'run'
test with DEFCONFIG may cost several hours, which is not friendly to
develop testing and even for release testing, so, smaller kernel configs
may be required, and firstly, we should let nolibc-test work with less
kernel config options, this patchset aims to this goal.
This patchset mainly remove the dependency from procfs, tmpfs, net and
memfd_create, many failures have been fixed up.
When CONFIG_TMPFS and CONFIG_SHMEM are disabled, kernel will provide a
ramfs based tmpfs (mm/shmem.c), it will be used as a choice to fix up
some failures and also allow skip less tests.
Besides, it also adds musl support, improves glibc support and fixes up
a kernel cmdline passing use case.
This is based on the dev.2023.06.14a branch of linux-rcu [1], all of the
supported architectures are tested (with local minimal configs, [5]
pasted the one for i386) without failures:
arch/board | result
------------|------------
arm/vexpress-a9 | 138 test(s) passed, 1 skipped, 0 failed. See all results in /labs/linux-lab/logging/nolibc/arm-vexpress-a9-nolibc-test.log
aarch64/virt | 138 test(s) passed, 1 skipped, 0 failed. See all results in /labs/linux-lab/logging/nolibc/aarch64-virt-nolibc-test.log
ppc/g3beige | not supported
i386/pc | 136 test(s) passed, 3 skipped, 0 failed. See all results in /labs/linux-lab/logging/nolibc/i386-pc-nolibc-test.log
x86_64/pc | 138 test(s) passed, 1 skipped, 0 failed. See all results in /labs/linux-lab/logging/nolibc/x86_64-pc-nolibc-test.log
mipsel/malta | 138 test(s) passed, 1 skipped, 0 failed. See all results in /labs/linux-lab/logging/nolibc/mipsel-malta-nolibc-test.log
loongarch64/virt | 138 test(s) passed, 1 skipped, 0 failed. See all results in /labs/linux-lab/logging/nolibc/loongarch64-virt-nolibc-test.log
riscv64/virt | 136 test(s) passed, 3 skipped, 0 failed. See all results in /labs/linux-lab/logging/nolibc/riscv64-virt-nolibc-test.log
riscv32/virt | no test log found
s390x/s390-ccw-virtio | 138 test(s) passed, 1 skipped, 0 failed. See all results in /labs/linux-lab/logging/nolibc/s390x-s390-ccw-virtio-nolibc-test.log
Notes:
* The skipped ones are -fstackprotector, chmod_self and chown_self
The -fstackprotector skip is due to gcc version.
chmod_self and chmod_self skips are due to procfs not enabled
* ppc/g3beige support is added locally, but not added in this patchset
will send ppc support as a new patchset, it depends on v2 test
report patchset [3] and the v5 rv32 support, require changes on
Makefile
* riscv32/virt support is still in review, see v5 rv32 support [4]
This patchset doesn't depends on any of my other nolibc patch series,
but the new rmdir() routine added in this patchset may be requird to
apply the __sysret() from our v4 syscall helper series [2] after that
series being merged, currently, we use the old method to let it compile
without any dependency.
Here explains all of the patches:
* selftests/nolibc: stat_fault: silence NULL argument warning with glibc
selftests/nolibc: gettid: restore for glibc and musl
selftests/nolibc: add _LARGEFILE64_SOURCE for musl
The above 3 patches adds musl compile support and improve glibc support.
It is able to build and run nolibc-test with musl libc now, but there
are some failures/skips due to the musl its own issues/requirements:
$ sudo ./nolibc-test | grep -E 'FAIL|SKIP'
8 sbrk = 1 ENOMEM [FAIL]
9 brk = -1 ENOMEM [FAIL]
46 limit_int_fast16_min = -2147483648 [FAIL]
47 limit_int_fast16_max = 2147483647 [FAIL]
49 limit_int_fast32_min = -2147483648 [FAIL]
50 limit_int_fast32_max = 2147483647 [FAIL]
0 -fstackprotector not supported [SKIPPED]
musl disabled sbrk and brk for some conflicts with its malloc and the
fast version of int types are defined in 32bit, which differs from nolibc
and glibc. musl reserved the sbrk(0) to allow get current brk, we
added a test for this in the v4 __sysret() helper series [2].
* selftests/nolibc: fix up kernel parameters support
kernel cmdline allows pass two types of parameters, one is without
'=', another is with '=', the first one is passed as init arguments,
the sencond one is passed as init environment variables.
Our nolibc-test prefer arguments to environment variables, this not
work when users add such parameters in the kernel cmdline:
noapic NOLIBC_TEST=syscall
So, this patch will verify the setting from arguments at first, if it
is no valid, will try the environment variables instead.
* selftests/nolibc: stat_timestamps: remove procfs dependency
Use '/' instead of /proc/self, or we can add a 'has_proc' condition
for this test case, but it is not that necessary to skip the whole
stat_timestamps tests for such a subtest binding to /proc/self.
Welcome suggestion from Thomas.
* tools/nolibc: add rmdir() support
selftests/nolibc: add a new rmdir() test case
rmdir() routine and test case are added for the coming requirement.
Note, if the __sysret() patchset [2] is applied before us, this patch
should be rebased on it and apply the __sysret() helper.
* selftests/nolibc: fix up failures when there is no procfs
call rmdir() to remove /proc completely to rework the checking of
/proc, before, the existing of /proc not means the procfs is really
mounted.
* selftests/nolibc: rename proc variable to has_proc
selftests/nolibc: rename euid0 variable to is_root
align with the has_gettid, has_xxx variables.
* selftests/nolibc: prepare tmpfs and hugetlbfs
selftests/nolibc: rename chmod_net to chmod_good
selftests/nolibc: link_cross: support tmpfs
selftests/nolibc: rename chroot_exe to chroot_file
use file from /tmp instead of file from /proc when there is no procfs
this avoid skipping the chmod_net, link_cross, chroot_exe tests
* selftests/nolibc: vfprintf: silence memfd_create() warning
selftests/nolibc: vfprintf: skip if neither tmpfs nor hugetlbfs
selftests/nolibc: vfprintf: support tmpfs and hugetlbfs
memfd_create from kernel >= v6.2 forcely warn on missing
MFD_NOEXEC_SEAL flag, the first one silence it with such flag, for
older kernels, use 0 flag as before.
since memfd_create() depends on TMPFS or HUGETLBFS, the second one
skip the whole vfprintf instead of simply fail if memfd_create() not
work.
the 3rd one futher try the ramfs based tmpfs even when memfd_create()
not work.
At last, let's simply discuss about the configs, I have prepared minimal
configs for all of the nolibc supported architectures but not sure where
should we put them, what about tools/testing/selftests/nolibc/configs ?
Thanks!
Best regards,
Zhangjin
---
[1]: https://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git/
[2]: https://lore.kernel.org/linux-riscv/cover.1687187451.git.falcon@tinylab.org/
[3]: https://lore.kernel.org/lkml/cover.1687156559.git.falcon@tinylab.org/
[4]: https://lore.kernel.org/linux-riscv/cover.1687176996.git.falcon@tinylab.org/
[5]: https://pastebin.com/5jq0Vxbz
Zhangjin Wu (17):
selftests/nolibc: stat_fault: silence NULL argument warning with glibc
selftests/nolibc: gettid: restore for glibc and musl
selftests/nolibc: add _LARGEFILE64_SOURCE for musl
selftests/nolibc: fix up kernel parameters support
selftests/nolibc: stat_timestamps: remove procfs dependency
tools/nolibc: add rmdir() support
selftests/nolibc: add a new rmdir() test case
selftests/nolibc: fix up failures when there is no procfs
selftests/nolibc: rename proc variable to has_proc
selftests/nolibc: rename euid0 variable to is_root
selftests/nolibc: prepare tmpfs and hugetlbfs
selftests/nolibc: rename chmod_net to chmod_good
selftests/nolibc: link_cross: support tmpfs
selftests/nolibc: rename chroot_exe to chroot_file
selftests/nolibc: vfprintf: silence memfd_create() warning
selftests/nolibc: vfprintf: skip if neither tmpfs nor hugetlbfs
selftests/nolibc: vfprintf: support tmpfs and hugetlbfs
tools/include/nolibc/sys.h | 28 ++++
tools/testing/selftests/nolibc/nolibc-test.c | 132 +++++++++++++++----
2 files changed, 138 insertions(+), 22 deletions(-)
--
2.25.1
From: Jeff Xu <jeffxu(a)google.com>
Since Linux introduced the memfd feature, memfd have always had their
execute bit set, and the memfd_create() syscall doesn't allow setting
it differently.
However, in a secure by default system, such as ChromeOS, (where all
executables should come from the rootfs, which is protected by Verified
boot), this executable nature of memfd opens a door for NoExec bypass
and enables “confused deputy attack”. E.g, in VRP bug [1]: cros_vm
process created a memfd to share the content with an external process,
however the memfd is overwritten and used for executing arbitrary code
and root escalation. [2] lists more VRP in this kind.
On the other hand, executable memfd has its legit use, runc uses memfd’s
seal and executable feature to copy the contents of the binary then
execute them, for such system, we need a solution to differentiate runc's
use of executable memfds and an attacker's [3].
To address those above, this set of patches add following:
1> Let memfd_create() set X bit at creation time.
2> Let memfd to be sealed for modifying X bit.
3> A new pid namespace sysctl: vm.memfd_noexec to control the behavior of
X bit.For example, if a container has vm.memfd_noexec=2, then
memfd_create() without MFD_NOEXEC_SEAL will be rejected.
4> A new security hook in memfd_create(). This make it possible to a new
LSM, which rejects or allows executable memfd based on its security policy.
Change history:
v8:
- Update ref bug in cover letter.
- Add Reviewed-by field.
- Remove security hook (security_memfd_create) patch, which will have
its own patch set in future.
v7:
- patch 2/6: remove #ifdef and MAX_PATH (memfd_test.c).
- patch 3/6: check capability (CAP_SYS_ADMIN) from userns instead of
global ns (pid_sysctl.h). Add a tab (pid_namespace.h).
- patch 5/6: remove #ifdef (memfd_test.c)
- patch 6/6: remove unneeded security_move_mount(security.c).
v6:https://lore.kernel.org/lkml/20221206150233.1963717-1-jeffxu@google.com/
- Address comment and move "#ifdef CONFIG_" from .c file to pid_sysctl.h
v5:https://lore.kernel.org/lkml/20221206152358.1966099-1-jeffxu@google.com/
- Pass vm.memfd_noexec from current ns to child ns.
- Fix build issue detected by kernel test robot.
- Add missing security.c
v3:https://lore.kernel.org/lkml/20221202013404.163143-1-jeffxu@google.com/
- Address API design comments in v2.
- Let memfd_create() to set X bit at creation time.
- A new pid namespace sysctl: vm.memfd_noexec to control behavior of X bit.
- A new security hook in memfd_create().
v2:https://lore.kernel.org/lkml/20220805222126.142525-1-jeffxu@google.com/
- address comments in V1.
- add sysctl (vm.mfd_noexec) to set the default file permissions of
memfd_create to be non-executable.
v1:https://lwn.net/Articles/890096/
[1] https://crbug.com/1305267
[2] https://bugs.chromium.org/p/chromium/issues/list?q=type%3Dbug-security%20me…
[3] https://lwn.net/Articles/781013/
Daniel Verkamp (2):
mm/memfd: add F_SEAL_EXEC
selftests/memfd: add tests for F_SEAL_EXEC
Jeff Xu (3):
mm/memfd: add MFD_NOEXEC_SEAL and MFD_EXEC
mm/memfd: Add write seals when apply SEAL_EXEC to executable memfd
selftests/memfd: add tests for MFD_NOEXEC_SEAL MFD_EXEC
include/linux/pid_namespace.h | 19 ++
include/uapi/linux/fcntl.h | 1 +
include/uapi/linux/memfd.h | 4 +
kernel/pid_namespace.c | 5 +
kernel/pid_sysctl.h | 59 ++++
mm/memfd.c | 56 +++-
mm/shmem.c | 6 +
tools/testing/selftests/memfd/fuse_test.c | 1 +
tools/testing/selftests/memfd/memfd_test.c | 341 ++++++++++++++++++++-
9 files changed, 489 insertions(+), 3 deletions(-)
create mode 100644 kernel/pid_sysctl.h
base-commit: eb7081409f94a9a8608593d0fb63a1aa3d6f95d8
--
2.39.0.rc1.256.g54fd8350bd-goog
From: sunliming <sunliming(a)kylinos.cn>
[ Upstream commit ba470eebc2f6c2f704872955a715b9555328e7d0 ]
User processes register name_args for events. If the same name but different
args event are registered. The trace outputs of second event are printed
as the first event. This is incorrect.
Return EADDRINUSE back to the user process if the same name but different args
event has being registered.
Link: https://lore.kernel.org/linux-trace-kernel/20230529032100.286534-1-sunlimin…
Signed-off-by: sunliming <sunliming(a)kylinos.cn>
Reviewed-by: Masami Hiramatsu (Google) <mhiramat(a)kernel.org>
Acked-by: Beau Belgrave <beaub(a)linux.microsoft.com>
Signed-off-by: Steven Rostedt (Google) <rostedt(a)goodmis.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
kernel/trace/trace_events_user.c | 36 +++++++++++++++----
.../selftests/user_events/ftrace_test.c | 6 ++++
2 files changed, 36 insertions(+), 6 deletions(-)
diff --git a/kernel/trace/trace_events_user.c b/kernel/trace/trace_events_user.c
index 625cab4b9d945..774d146c2c2ca 100644
--- a/kernel/trace/trace_events_user.c
+++ b/kernel/trace/trace_events_user.c
@@ -1274,6 +1274,8 @@ static int user_event_parse(struct user_event_group *group, char *name,
int index;
u32 key;
struct user_event *user;
+ int argc = 0;
+ char **argv;
/* Prevent dyn_event from racing */
mutex_lock(&event_mutex);
@@ -1281,13 +1283,35 @@ static int user_event_parse(struct user_event_group *group, char *name,
mutex_unlock(&event_mutex);
if (user) {
- *newuser = user;
- /*
- * Name is allocated by caller, free it since it already exists.
- * Caller only worries about failure cases for freeing.
- */
- kfree(name);
+ if (args) {
+ argv = argv_split(GFP_KERNEL, args, &argc);
+ if (!argv) {
+ ret = -ENOMEM;
+ goto error;
+ }
+
+ ret = user_fields_match(user, argc, (const char **)argv);
+ argv_free(argv);
+
+ } else
+ ret = list_empty(&user->fields);
+
+ if (ret) {
+ *newuser = user;
+ /*
+ * Name is allocated by caller, free it since it already exists.
+ * Caller only worries about failure cases for freeing.
+ */
+ kfree(name);
+ } else {
+ ret = -EADDRINUSE;
+ goto error;
+ }
+
return 0;
+error:
+ refcount_dec(&user->refcnt);
+ return ret;
}
index = find_first_zero_bit(group->page_bitmap, MAX_EVENTS);
diff --git a/tools/testing/selftests/user_events/ftrace_test.c b/tools/testing/selftests/user_events/ftrace_test.c
index 1bc26e6476fc3..df0e776c2cc1b 100644
--- a/tools/testing/selftests/user_events/ftrace_test.c
+++ b/tools/testing/selftests/user_events/ftrace_test.c
@@ -209,6 +209,12 @@ TEST_F(user, register_events) {
ASSERT_EQ(0, reg.write_index);
ASSERT_NE(0, reg.status_bit);
+ /* Multiple registers to same name but different args should fail */
+ reg.enable_bit = 29;
+ reg.name_args = (__u64)"__test_event u32 field1;";
+ ASSERT_EQ(-1, ioctl(self->data_fd, DIAG_IOCSREG, ®));
+ ASSERT_EQ(EADDRINUSE, errno);
+
/* Ensure disabled */
self->enable_fd = open(enable_file, O_RDWR);
ASSERT_NE(-1, self->enable_fd);
--
2.39.2
From: sunliming <sunliming(a)kylinos.cn>
[ Upstream commit ba470eebc2f6c2f704872955a715b9555328e7d0 ]
User processes register name_args for events. If the same name but different
args event are registered. The trace outputs of second event are printed
as the first event. This is incorrect.
Return EADDRINUSE back to the user process if the same name but different args
event has being registered.
Link: https://lore.kernel.org/linux-trace-kernel/20230529032100.286534-1-sunlimin…
Signed-off-by: sunliming <sunliming(a)kylinos.cn>
Reviewed-by: Masami Hiramatsu (Google) <mhiramat(a)kernel.org>
Acked-by: Beau Belgrave <beaub(a)linux.microsoft.com>
Signed-off-by: Steven Rostedt (Google) <rostedt(a)goodmis.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
kernel/trace/trace_events_user.c | 36 +++++++++++++++----
.../selftests/user_events/ftrace_test.c | 6 ++++
2 files changed, 36 insertions(+), 6 deletions(-)
diff --git a/kernel/trace/trace_events_user.c b/kernel/trace/trace_events_user.c
index 625cab4b9d945..774d146c2c2ca 100644
--- a/kernel/trace/trace_events_user.c
+++ b/kernel/trace/trace_events_user.c
@@ -1274,6 +1274,8 @@ static int user_event_parse(struct user_event_group *group, char *name,
int index;
u32 key;
struct user_event *user;
+ int argc = 0;
+ char **argv;
/* Prevent dyn_event from racing */
mutex_lock(&event_mutex);
@@ -1281,13 +1283,35 @@ static int user_event_parse(struct user_event_group *group, char *name,
mutex_unlock(&event_mutex);
if (user) {
- *newuser = user;
- /*
- * Name is allocated by caller, free it since it already exists.
- * Caller only worries about failure cases for freeing.
- */
- kfree(name);
+ if (args) {
+ argv = argv_split(GFP_KERNEL, args, &argc);
+ if (!argv) {
+ ret = -ENOMEM;
+ goto error;
+ }
+
+ ret = user_fields_match(user, argc, (const char **)argv);
+ argv_free(argv);
+
+ } else
+ ret = list_empty(&user->fields);
+
+ if (ret) {
+ *newuser = user;
+ /*
+ * Name is allocated by caller, free it since it already exists.
+ * Caller only worries about failure cases for freeing.
+ */
+ kfree(name);
+ } else {
+ ret = -EADDRINUSE;
+ goto error;
+ }
+
return 0;
+error:
+ refcount_dec(&user->refcnt);
+ return ret;
}
index = find_first_zero_bit(group->page_bitmap, MAX_EVENTS);
diff --git a/tools/testing/selftests/user_events/ftrace_test.c b/tools/testing/selftests/user_events/ftrace_test.c
index 1bc26e6476fc3..df0e776c2cc1b 100644
--- a/tools/testing/selftests/user_events/ftrace_test.c
+++ b/tools/testing/selftests/user_events/ftrace_test.c
@@ -209,6 +209,12 @@ TEST_F(user, register_events) {
ASSERT_EQ(0, reg.write_index);
ASSERT_NE(0, reg.status_bit);
+ /* Multiple registers to same name but different args should fail */
+ reg.enable_bit = 29;
+ reg.name_args = (__u64)"__test_event u32 field1;";
+ ASSERT_EQ(-1, ioctl(self->data_fd, DIAG_IOCSREG, ®));
+ ASSERT_EQ(EADDRINUSE, errno);
+
/* Ensure disabled */
self->enable_fd = open(enable_file, O_RDWR);
ASSERT_NE(-1, self->enable_fd);
--
2.39.2
=== Context ===
In the context of a middlebox, fragmented packets are tricky to handle.
The full 5-tuple of a packet is often only available in the first
fragment which makes enforcing consistent policy difficult. There are
really only two stateless options, neither of which are very nice:
1. Enforce policy on first fragment and accept all subsequent fragments.
This works but may let in certain attacks or allow data exfiltration.
2. Enforce policy on first fragment and drop all subsequent fragments.
This does not really work b/c some protocols may rely on
fragmentation. For example, DNS may rely on oversized UDP packets for
large responses.
So stateful tracking is the only sane option. RFC 8900 [0] calls this
out as well in section 6.3:
Middleboxes [...] should process IP fragments in a manner that is
consistent with [RFC0791] and [RFC8200]. In many cases, middleboxes
must maintain state in order to achieve this goal.
=== BPF related bits ===
Policy has traditionally been enforced from XDP/TC hooks. Both hooks
run before kernel reassembly facilities. However, with the new
BPF_PROG_TYPE_NETFILTER, we can rather easily hook into existing
netfilter reassembly infra.
The basic idea is we bump a refcnt on the netfilter defrag module and
then run the bpf prog after the defrag module runs. This allows bpf
progs to transparently see full, reassembled packets. The nice thing
about this is that progs don't have to carry around logic to detect
fragments.
=== Patchset details ===
There was an earlier attempt at providing defrag via kfuncs [1]. The
feedback was that we could end up doing too much stuff in prog execution
context (like sending ICMP error replies). However, I think there are
still some outstanding discussion w.r.t. performance when it comes to
netfilter vs the previous approach. I'll schedule some time during
office hours for this.
Patches 1 & 2 are stolenfrom Florian. Hopefully he doesn't mind. There
were some outstanding comments on the v2 [2] but it doesn't look like a
v3 was ever submitted. I've addressed the comments and put them in this
patchset cuz I needed them.
Finally, the new selftest seems to be a little flaky. I'm not quite
sure why the server will fail to `recvfrom()` occassionaly. I'm fairly
sure it's a timing related issue with creating veths. I'll keep
debugging but I didn't want that to hold up discussion on this patchset.
[0]: https://datatracker.ietf.org/doc/html/rfc8900
[1]: https://lore.kernel.org/bpf/cover.1677526810.git.dxu@dxuuu.xyz/
[2]: https://lore.kernel.org/bpf/20230525110100.8212-1-fw@strlen.de/
Daniel Xu (7):
tools: libbpf: add netfilter link attach helper
selftests/bpf: Add bpf_program__attach_netfilter helper test
netfilter: defrag: Add glue hooks for enabling/disabling defrag
netfilter: bpf: Support BPF_F_NETFILTER_IP_DEFRAG in netfilter link
bpf: selftests: Support not connecting client socket
bpf: selftests: Support custom type and proto for client sockets
bpf: selftests: Add defrag selftests
include/linux/netfilter.h | 12 +
include/uapi/linux/bpf.h | 5 +
net/ipv4/netfilter/nf_defrag_ipv4.c | 8 +
net/ipv6/netfilter/nf_defrag_ipv6_hooks.c | 10 +
net/netfilter/core.c | 6 +
net/netfilter/nf_bpf_link.c | 108 ++++++-
tools/include/uapi/linux/bpf.h | 5 +
tools/lib/bpf/bpf.c | 8 +
tools/lib/bpf/bpf.h | 6 +
tools/lib/bpf/libbpf.c | 47 +++
tools/lib/bpf/libbpf.h | 15 +
tools/lib/bpf/libbpf.map | 1 +
tools/testing/selftests/bpf/Makefile | 4 +-
.../selftests/bpf/generate_udp_fragments.py | 90 ++++++
.../selftests/bpf/ip_check_defrag_frags.h | 57 ++++
tools/testing/selftests/bpf/network_helpers.c | 26 +-
tools/testing/selftests/bpf/network_helpers.h | 3 +
.../bpf/prog_tests/ip_check_defrag.c | 282 ++++++++++++++++++
.../bpf/prog_tests/netfilter_basic.c | 78 +++++
.../selftests/bpf/progs/ip_check_defrag.c | 104 +++++++
.../bpf/progs/test_netfilter_link_attach.c | 14 +
21 files changed, 868 insertions(+), 21 deletions(-)
create mode 100755 tools/testing/selftests/bpf/generate_udp_fragments.py
create mode 100644 tools/testing/selftests/bpf/ip_check_defrag_frags.h
create mode 100644 tools/testing/selftests/bpf/prog_tests/ip_check_defrag.c
create mode 100644 tools/testing/selftests/bpf/prog_tests/netfilter_basic.c
create mode 100644 tools/testing/selftests/bpf/progs/ip_check_defrag.c
create mode 100644 tools/testing/selftests/bpf/progs/test_netfilter_link_attach.c
--
2.40.1
Dzień dobry,
zapoznałem się z Państwa ofertą i z przyjemnością przyznaję, że przyciąga uwagę i zachęca do dalszych rozmów.
Pomyślałem, że może mógłbym mieć swój wkład w Państwa rozwój i pomóc dotrzeć z tą ofertą do większego grona odbiorców. Pozycjonuję strony www, dzięki czemu generują świetny ruch w sieci.
Możemy porozmawiać w najbliższym czasie?
Pozdrawiam
Adam Charachuta
Nested translation is a hardware feature that is supported by many modern
IOMMU hardwares. It has two stages (stage-1, stage-2) address translation
to get access to the physical address. stage-1 translation table is owned
by userspace (e.g. by a guest OS), while stage-2 is owned by kernel. Changes
to stage-1 translation table should be followed by an IOTLB invalidation.
Take Intel VT-d as an example, the stage-1 translation table is I/O page
table. As the below diagram shows, guest I/O page table pointer in GPA
(guest physical address) is passed to host and be used to perform the stage-1
address translation. Along with it, modifications to present mappings in the
guest I/O page table should be followed with an IOTLB invalidation.
.-------------. .---------------------------.
| vIOMMU | | Guest I/O page table |
| | '---------------------------'
.----------------/
| PASID Entry |--- PASID cache flush --+
'-------------' |
| | V
| | I/O page table pointer in GPA
'-------------'
Guest
------| Shadow |--------------------------|--------
v v v
Host
.-------------. .------------------------.
| pIOMMU | | FS for GIOVA->GPA |
| | '------------------------'
.----------------/ |
| PASID Entry | V (Nested xlate)
'----------------\.----------------------------------.
| | | SS for GPA->HPA, unmanaged domain|
| | '----------------------------------'
'-------------'
Where:
- FS = First stage page tables
- SS = Second stage page tables
<Intel VT-d Nested translation>
In IOMMUFD, all the translation tables are tracked by hw_pagetable (hwpt)
and each has an iommu_domain allocated from iommu driver. So in this series
hw_pagetable and iommu_domain means the same thing if no special note.
IOMMUFD has already supported allocating hw_pagetable that is linked with
an IOAS. However, nesting requires IOMMUFD to allow allocating hw_pagetable
with driver specific parameters and interface to sync stage-1 IOTLB as user
owns the stage-1 translation table.
This series is based on the iommu hw info reporting series [1]. It first
introduces new iommu op for allocating domains with user data and the op
for syncing stage-1 IOTLB, and then extend the IOMMUFD internal infrastructure
to accept user_data and parent hwpt, then relay the data to iommu core to
allocate iommu_domain. After it, extend the ioctl IOMMU_HWPT_ALLOC to accept
user data and stage-2 hwpt ID to allocate hwpt. Along with it, ioctl
IOMMU_HWPT_INVALIDATE is added to invalidate stage-1 IOTLB. This is needed
for user-managed hwpts. Selftest is added as well to cover the new ioctls.
Complete code can be found in [2], QEMU could can be found in [3].
At last, this is a team work together with Nicolin Chen, Lu Baolu. Thanks
them for the help. ^_^. Look forward to your feedbacks.
base-commit: cf905391237ded2331388e75adb5afbabeddc852
[1] https://lore.kernel.org/linux-iommu/20230511143024.19542-1-yi.l.liu@intel.c…
[2] https://github.com/yiliu1765/iommufd/tree/iommufd_nesting
[3] https://github.com/yiliu1765/qemu/tree/wip/iommufd_rfcv4.mig.reset.v4_var3%…
Change log:
v2:
- Add union iommu_domain_user_data to include all user data structures to avoid
passing void * in kernel APIs.
- Add iommu op to return user data length for user domain allocation
- Rename struct iommu_hwpt_alloc::data_type to be hwpt_type
- Store the invalidation data length in iommu_domain_ops::cache_invalidate_user_data_len
- Convert cache_invalidate_user op to be int instead of void
- Remove @data_type in struct iommu_hwpt_invalidate
- Remove out_hwpt_type_bitmap in struct iommu_hw_info hence drop patch 08 of v1
v1: https://lore.kernel.org/linux-iommu/20230309080910.607396-1-yi.l.liu@intel.…
Thanks,
Yi Liu
Lu Baolu (2):
iommu: Add new iommu op to create domains owned by userspace
iommu: Add nested domain support
Nicolin Chen (5):
iommufd/hw_pagetable: Do not populate user-managed hw_pagetables
iommufd/selftest: Add domain_alloc_user() support in iommu mock
iommufd/selftest: Add coverage for IOMMU_HWPT_ALLOC with user data
iommufd/selftest: Add IOMMU_TEST_OP_MD_CHECK_IOTLB test op
iommufd/selftest: Add coverage for IOMMU_HWPT_INVALIDATE ioctl
Yi Liu (4):
iommufd/hw_pagetable: Use domain_alloc_user op for domain allocation
iommufd: Pass parent hwpt and user_data to
iommufd_hw_pagetable_alloc()
iommufd: IOMMU_HWPT_ALLOC allocation with user data
iommufd: Add IOMMU_HWPT_INVALIDATE
drivers/iommu/iommufd/device.c | 2 +-
drivers/iommu/iommufd/hw_pagetable.c | 191 +++++++++++++++++-
drivers/iommu/iommufd/iommufd_private.h | 16 +-
drivers/iommu/iommufd/iommufd_test.h | 30 +++
drivers/iommu/iommufd/main.c | 5 +-
drivers/iommu/iommufd/selftest.c | 119 ++++++++++-
include/linux/iommu.h | 36 ++++
include/uapi/linux/iommufd.h | 58 +++++-
tools/testing/selftests/iommu/iommufd.c | 126 +++++++++++-
tools/testing/selftests/iommu/iommufd_utils.h | 70 +++++++
10 files changed, 629 insertions(+), 24 deletions(-)
--
2.34.1
Make sv39 the default address space for mmap as some applications
currently depend on this assumption. The RISC-V specification enforces
that bits outside of the virtual address range are not used, so
restricting the size of the default address space as such should be
temporary. A hint address passed to mmap will cause the largest address
space that fits entirely into the hint to be used. If the hint is less
than or equal to 1<<38, a 39-bit address will be used. After an address
space is completely full, the next smallest address space will be used.
Documentation is also added to the RISC-V virtual memory section to explain
these changes.
Charlie Jenkins (2):
RISC-V: mm: Restrict address space for sv39,sv48,sv57
RISC-V: mm: Update documentation and include test
Documentation/riscv/vm-layout.rst | 20 ++++++++
arch/riscv/include/asm/elf.h | 2 +-
arch/riscv/include/asm/pgtable.h | 21 ++++++--
arch/riscv/include/asm/processor.h | 41 +++++++++++++---
tools/testing/selftests/riscv/Makefile | 2 +-
tools/testing/selftests/riscv/mm/Makefile | 22 +++++++++
.../selftests/riscv/mm/testcases/mmap.c | 49 +++++++++++++++++++
7 files changed, 144 insertions(+), 13 deletions(-)
create mode 100644 tools/testing/selftests/riscv/mm/Makefile
create mode 100644 tools/testing/selftests/riscv/mm/testcases/mmap.c
base-commit: eef509789cecdce895020682192d32e8bac790e8
--
2.34.1
Hi folks,
This series implements the functionality of delivering IO page faults to
user space through the IOMMUFD framework. The use case is nested
translation, where modern IOMMU hardware supports two-stage translation
tables. The second-stage translation table is managed by the host VMM
while the first-stage translation table is owned by the user space.
Hence, any IO page fault that occurs on the first-stage page table
should be delivered to the user space and handled there. The user space
should respond the page fault handling result to the device top-down
through the IOMMUFD response uAPI.
User space indicates its capablity of handling IO page faults by setting
a user HWPT allocation flag IOMMU_HWPT_ALLOC_FLAGS_IOPF_CAPABLE. IOMMUFD
will then setup its infrastructure for page fault delivery. Together
with the iopf-capable flag, user space should also provide an eventfd
where it will listen on any down-top page fault messages.
On a successful return of the allocation of iopf-capable HWPT, a fault
fd will be returned. User space can open and read fault messages from it
once the eventfd is signaled.
Besides the overall design, I'd like to hear comments about below
designs:
- The IOMMUFD fault message format. It is very similar to that in
uapi/linux/iommu which has been discussed before and partially used by
the IOMMU SVA implementation. I'd like to get more comments on the
format when it comes to IOMMUFD.
- The timeout value for the pending page fault messages. Ideally we
should determine the timeout value from the device configuration, but
I failed to find any statement in the PCI specification (version 6.x).
A default 100 milliseconds is selected in the implementation, but it
leave the room for grow the code for per-device setting.
This series is only for review comment purpose. I used IOMMUFD selftest
to verify the hwpt allocation, attach/detach and replace. But I didn't
get a chance to run it with real hardware yet. I will do more test in
the subsequent versions when I am confident that I am heading on the
right way.
This series is based on the latest implementation of the nested
translation under discussion. The whole series and related patches are
available on gitbub:
https://github.com/LuBaolu/intel-iommu/commits/iommufd-io-pgfault-delivery-…
Best regards,
baolu
Lu Baolu (17):
iommu: Move iommu fault data to linux/iommu.h
iommu: Support asynchronous I/O page fault response
iommu: Add helper to set iopf handler for domain
iommu: Pass device parameter to iopf handler
iommu: Split IO page fault handling from SVA
iommu: Add iommu page fault cookie helpers
iommufd: Add iommu page fault data
iommufd: IO page fault delivery initialization and release
iommufd: Add iommufd hwpt iopf handler
iommufd: Add IOMMU_HWPT_ALLOC_FLAGS_USER_PASID_TABLE for hwpt_alloc
iommufd: Deliver fault messages to user space
iommufd: Add io page fault response support
iommufd: Add a timer for each iommufd fault data
iommufd: Drain all pending faults when destroying hwpt
iommufd: Allow new hwpt_alloc flags
iommufd/selftest: Add IOPF feature for mock devices
iommufd/selftest: Cover iopf-capable nested hwpt
include/linux/iommu.h | 175 +++++++++-
drivers/iommu/{iommu-sva.h => io-pgfault.h} | 25 +-
drivers/iommu/iommu-priv.h | 3 +
drivers/iommu/iommufd/iommufd_private.h | 32 ++
include/uapi/linux/iommu.h | 161 ---------
include/uapi/linux/iommufd.h | 73 +++-
tools/testing/selftests/iommu/iommufd_utils.h | 20 +-
.../iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c | 2 +-
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 2 +-
drivers/iommu/intel/iommu.c | 2 +-
drivers/iommu/intel/svm.c | 2 +-
drivers/iommu/io-pgfault.c | 7 +-
drivers/iommu/iommu-sva.c | 4 +-
drivers/iommu/iommu.c | 50 ++-
drivers/iommu/iommufd/device.c | 64 +++-
drivers/iommu/iommufd/hw_pagetable.c | 318 +++++++++++++++++-
drivers/iommu/iommufd/main.c | 3 +
drivers/iommu/iommufd/selftest.c | 71 ++++
tools/testing/selftests/iommu/iommufd.c | 17 +-
MAINTAINERS | 1 -
drivers/iommu/Kconfig | 4 +
drivers/iommu/Makefile | 3 +-
drivers/iommu/intel/Kconfig | 1 +
23 files changed, 837 insertions(+), 203 deletions(-)
rename drivers/iommu/{iommu-sva.h => io-pgfault.h} (71%)
delete mode 100644 include/uapi/linux/iommu.h
--
2.34.1
When we collect a signal context with one of the SME modes enabled we will
have enabled that mode behind the compiler and libc's back so they may
issue some instructions not valid in streaming mode, causing spurious
failures.
For the code prior to issuing the BRK to trigger signal handling we need to
stay in streaming mode if we were already there since that's a part of the
signal context the caller is trying to collect. Unfortunately this code
includes a memset() which is likely to be heavily optimised and is likely
to use FP instructions incompatible with streaming mode. We can avoid this
happening by open coding the memset(), inserting a volatile assembly
statement to avoid the compiler recognising what's being done and doing
something in optimisation. This code is not performance critical so the
inefficiency should not be an issue.
After collecting the context we can simply exit streaming mode, avoiding
these issues. Use a full SMSTOP for safety to prevent any issues appearing
with ZA.
Reported-by: Will Deacon <will(a)kernel.org>
Signed-off-by: Mark Brown <broonie(a)kernel.org>
---
.../selftests/arm64/signal/test_signals_utils.h | 28 +++++++++++++++++++++-
1 file changed, 27 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/arm64/signal/test_signals_utils.h b/tools/testing/selftests/arm64/signal/test_signals_utils.h
index 222093f51b67..db28409fd44b 100644
--- a/tools/testing/selftests/arm64/signal/test_signals_utils.h
+++ b/tools/testing/selftests/arm64/signal/test_signals_utils.h
@@ -60,13 +60,28 @@ static __always_inline bool get_current_context(struct tdescr *td,
size_t dest_sz)
{
static volatile bool seen_already;
+ int i;
+ char *uc = (char *)dest_uc;
assert(td && dest_uc);
/* it's a genuine invocation..reinit */
seen_already = 0;
td->live_uc_valid = 0;
td->live_sz = dest_sz;
- memset(dest_uc, 0x00, td->live_sz);
+
+ /*
+ * This is a memset() but we don't want the compiler to
+ * optimise it into either instructions or a library call
+ * which might be incompatible with streaming mode.
+ */
+ for (i = 0; i < td->live_sz; i++) {
+ asm volatile("nop"
+ : "+m" (*dest_uc)
+ :
+ : "memory");
+ uc[i] = 0;
+ }
+
td->live_uc = dest_uc;
/*
* Grab ucontext_t triggering a SIGTRAP.
@@ -103,6 +118,17 @@ static __always_inline bool get_current_context(struct tdescr *td,
:
: "memory");
+ /*
+ * If we were grabbing a streaming mode context then we may
+ * have entered streaming mode behind the system's back and
+ * libc or compiler generated code might decide to do
+ * something invalid in streaming mode, or potentially even
+ * the state of ZA. Issue a SMSTOP to exit both now we have
+ * grabbed the state.
+ */
+ if (td->feats_supported & FEAT_SME)
+ asm volatile("msr S0_3_C4_C6_3, xzr");
+
/*
* If we get here with seen_already==1 it implies the td->live_uc
* context has been used to get back here....this probably means
---
base-commit: 6995e2de6891c724bfeb2db33d7b87775f913ad1
change-id: 20230628-arm64-signal-memcpy-fix-7de3b3c8fa10
Best regards,
--
Mark Brown <broonie(a)kernel.org>
Hi Mark,
While debugging the SME issue reported in CI, I noticed that the
streaming SVE tests are failing on the fastmodel because of an
unexpected SIGILL. For example:
will:arm64/signal$ ./ssve_za_regs
# Streaming SVE registers :: Check that we get the right Streaming SVE registers reported
Registered handlers for all signals.
Detected MINSTKSIGSZ:4720
Required Features: [ SME ] supported
Incompatible Features: [] absent
Testcase initialized.
Testing VL 64
-- RX UNEXPECTED SIGNAL: 4
==>> completed. FAIL(0)
The signal is injected because we get an SME trap due to an fpsimd, sve
or sve2 instruction being used in streaming mode (ESR is 0x76000001).
I did a bit of digging and it looks like this is my libc using a vector
DUP instruction in memset:
#0 __memset_generic () at ../sysdeps/aarch64/memset.S:37
#1 0x0000aaaaaaaa1170 in get_current_context (dest_sz=131072,
dest_uc=0xaaaaaeab6ba0 <context>, td=0xaaaaaaab50f0 <tde>)
at ./test_signals_utils.h:69
#2 do_one_sme_vl (si=<optimized out>, uc=<optimized out>, vl=64,
td=0xaaaaaaab50f0 <tde>) at testcases/ssve_za_regs.c:90
#3 sme_regs (td=0xaaaaaaab50f0 <tde>, si=<optimized out>, uc=<optimized out>)
at testcases/ssve_za_regs.c:145
#4 0x0000aaaaaaaa0ed0 in main (argc=<optimized out>, argv=<optimized out>)
at test_signals.c:21
Dump of assembler code for function __memset_generic:
=> 0x0000fffff7edfb00 <+0>: dup v0.16b, w1
The easy option would be to require FA64 for these tests, but I guess it
would be better to exit streaming mode.
Please can you have a look?
Thanks,
Will
Awk is already called for /sys/block/zram#/mm_stat parsing, so use it
to also perform the floating point capacity vs consumption ratio
calculations. The test output is unchanged.
This allows bc to be dropped as a dependency for the zram selftests.
Signed-off-by: David Disseldorp <ddiss(a)suse.de>
---
tools/testing/selftests/zram/zram01.sh | 18 ++++++++----------
1 file changed, 8 insertions(+), 10 deletions(-)
diff --git a/tools/testing/selftests/zram/zram01.sh b/tools/testing/selftests/zram/zram01.sh
index 8f4affe34f3e4..df1b1d4158989 100755
--- a/tools/testing/selftests/zram/zram01.sh
+++ b/tools/testing/selftests/zram/zram01.sh
@@ -33,7 +33,7 @@ zram_algs="lzo"
zram_fill_fs()
{
- for i in $(seq $dev_start $dev_end); do
+ for ((i = $dev_start; i <= $dev_end && !ERR_CODE; i++)); do
echo "fill zram$i..."
local b=0
while [ true ]; do
@@ -44,15 +44,13 @@ zram_fill_fs()
done
echo "zram$i can be filled with '$b' KB"
- local mem_used_total=`awk '{print $3}' "/sys/block/zram$i/mm_stat"`
- local v=$((100 * 1024 * $b / $mem_used_total))
- if [ "$v" -lt 100 ]; then
- echo "FAIL compression ratio: 0.$v:1"
- ERR_CODE=-1
- return
- fi
-
- echo "zram compression ratio: $(echo "scale=2; $v / 100 " | bc):1: OK"
+ awk -v b="$b" '{ v = (100 * 1024 * b / $3) } END {
+ if (v < 100) {
+ printf "FAIL compression ratio: 0.%u:1\n", v
+ exit 1
+ }
+ printf "zram compression ratio: %.2f:1: OK\n", v / 100
+ }' "/sys/block/zram$i/mm_stat" || ERR_CODE=-1
done
}
--
2.35.3
KVM_GET_REG_LIST will dump all register IDs that are available to
KVM_GET/SET_ONE_REG and It's very useful to identify some platform
regression issue during VM migration.
Patch 1-7 re-structured the get-reg-list test in aarch64 to make some
of the code as common test framework that can be shared by riscv.
Patch 8 move reject_set check logic to a function so as to check for
different errno for different registers.
Patch 9 change to do the get/set operation only on present-blessed list.
Patch 10 enabled the KVM_GET_REG_LIST API in riscv.
patch 11-12 added the corresponding kselftest for checking possible
register regressions.
The get-reg-list kvm selftest was ported from aarch64 and tested with
Linux 6.4-rc6 on a Qemu riscv64 virt machine.
---
Changed since v3:
* Rebase to Linux 6.4-rc6
* Address Andrew's suggestions and comments:
* Move reject_set check logic to a function
* Only do get/set tests on present blessed list
* Only enable ISA extension for the specified config
* For disable-not-allowed registers, move them to the filter-reg-list
Andrew Jones (7):
KVM: arm64: selftests: Replace str_with_index with strdup_printf
KVM: arm64: selftests: Drop SVE cap check in print_reg
KVM: arm64: selftests: Remove print_reg's dependency on vcpu_config
KVM: arm64: selftests: Rename vcpu_config and add to kvm_util.h
KVM: arm64: selftests: Delete core_reg_fixup
KVM: arm64: selftests: Split get-reg-list test code
KVM: arm64: selftests: Finish generalizing get-reg-list
Haibo Xu (5):
KVM: arm64: selftests: Move reject_set check logic to a function
KVM: selftests: Only do get/set tests on present blessed list
KVM: riscv: Add KVM_GET_REG_LIST API support
KVM: riscv: selftests: Add finalize_vcpu check in run_test
KVM: riscv: selftests: Add get-reg-list test
Documentation/virt/kvm/api.rst | 2 +-
arch/riscv/kvm/vcpu.c | 375 +++++++++
tools/testing/selftests/kvm/Makefile | 11 +-
.../selftests/kvm/aarch64/get-reg-list.c | 538 ++-----------
tools/testing/selftests/kvm/get-reg-list.c | 439 ++++++++++
.../selftests/kvm/include/kvm_util_base.h | 16 +
.../selftests/kvm/include/riscv/processor.h | 3 +
.../testing/selftests/kvm/include/test_util.h | 2 +
tools/testing/selftests/kvm/lib/test_util.c | 15 +
.../selftests/kvm/riscv/get-reg-list.c | 752 ++++++++++++++++++
10 files changed, 1658 insertions(+), 495 deletions(-)
create mode 100644 tools/testing/selftests/kvm/get-reg-list.c
create mode 100644 tools/testing/selftests/kvm/riscv/get-reg-list.c
--
2.34.1
This patch introduces two tests for the EVIOCSABS ioctl. The first one
checks that the ioctl fails when the EV_ABS bit was not set, and the
second one just checks that the normal workflow for this ioctl
succeeds.
Signed-off-by: Dana Elfassy <dangel101(a)gmail.com>
---
This patch depends on '[v3] selftests/input: Introduce basic tests for evdev ioctls' [1] sent to the ML.
[1] https://patchwork.kernel.org/project/linux-input/patch/20230607153214.15933…
tools/testing/selftests/input/evioc-test.c | 23 ++++++++++++++++++++++
1 file changed, 23 insertions(+)
diff --git a/tools/testing/selftests/input/evioc-test.c b/tools/testing/selftests/input/evioc-test.c
index 4c0c8ebed378..7afd537f0b24 100644
--- a/tools/testing/selftests/input/evioc-test.c
+++ b/tools/testing/selftests/input/evioc-test.c
@@ -279,4 +279,27 @@ TEST(eviocgrep_get_repeat_settings)
selftest_uinput_destroy(uidev);
}
+TEST(eviocsabs_set_abs_value_limits)
+{
+ struct selftest_uinput *uidev;
+ struct input_absinfo absinfo;
+ int rc;
+
+ // fail test on dev->absinfo
+ rc = selftest_uinput_create_device(&uidev), -1;
+ ASSERT_EQ(0, rc);
+ ASSERT_NE(NULL, uidev);
+ rc = ioctl(uidev->evdev_fd, EVIOCSABS(0), &absinfo);
+ ASSERT_EQ(-1, rc);
+ selftest_uinput_destroy(uidev);
+
+ // ioctl normal flow
+ rc = selftest_uinput_create_device(&uidev, EV_ABS, -1);
+ ASSERT_EQ(0, rc);
+ ASSERT_NE(NULL, uidev);
+ rc = ioctl(uidev->evdev_fd, EVIOCSABS(0), &absinfo);
+ ASSERT_EQ(0, rc);
+ selftest_uinput_destroy(uidev);
+}
+
TEST_HARNESS_MAIN
--
2.41.0
Changes in v21:
- Abort walk instead of returning error if WP is to be performed on
partial hugetlb
*Changes in v20*
- Correct PAGE_IS_FILE and add PAGE_IS_PFNZERO
*Changes in v19*
- Minor changes and interface updates
*Changes in v18*
- Rebase on top of next-20230613
- Minor updates
*Changes in v17*
- Rebase on top of next-20230606
- Minor improvements in PAGEMAP_SCAN IOCTL patch
*Changes in v16*
- Fix a corner case
- Add exclusive PM_SCAN_OP_WP back
*Changes in v15*
- Build fix (Add missed build fix in RESEND)
*Changes in v14*
- Fix build error caused by #ifdef added at last minute in some configs
*Changes in v13*
- Rebase on top of next-20230414
- Give-up on using uffd_wp_range() and write new helpers, flush tlb only
once
*Changes in v12*
- Update and other memory types to UFFD_FEATURE_WP_ASYNC
- Rebaase on top of next-20230406
- Review updates
*Changes in v11*
- Rebase on top of next-20230307
- Base patches on UFFD_FEATURE_WP_UNPOPULATED
- Do a lot of cosmetic changes and review updates
- Remove ENGAGE_WP + !GET operation as it can be performed with
UFFDIO_WRITEPROTECT
*Changes in v10*
- Add specific condition to return error if hugetlb is used with wp
async
- Move changes in tools/include/uapi/linux/fs.h to separate patch
- Add documentation
*Changes in v9:*
- Correct fault resolution for userfaultfd wp async
- Fix build warnings and errors which were happening on some configs
- Simplify pagemap ioctl's code
*Changes in v8:*
- Update uffd async wp implementation
- Improve PAGEMAP_IOCTL implementation
*Changes in v7:*
- Add uffd wp async
- Update the IOCTL to use uffd under the hood instead of soft-dirty
flags
*Motivation*
The real motivation for adding PAGEMAP_SCAN IOCTL is to emulate Windows
GetWriteWatch() syscall [1]. The GetWriteWatch{} retrieves the addresses of
the pages that are written to in a region of virtual memory.
This syscall is used in Windows applications and games etc. This syscall is
being emulated in pretty slow manner in userspace. Our purpose is to
enhance the kernel such that we translate it efficiently in a better way.
Currently some out of tree hack patches are being used to efficiently
emulate it in some kernels. We intend to replace those with these patches.
So the whole gaming on Linux can effectively get benefit from this. It
means there would be tons of users of this code.
CRIU use case [2] was mentioned by Andrei and Danylo:
> Use cases for migrating sparse VMAs are binaries sanitized with ASAN,
> MSAN or TSAN [3]. All of these sanitizers produce sparse mappings of
> shadow memory [4]. Being able to migrate such binaries allows to highly
> reduce the amount of work needed to identify and fix post-migration
> crashes, which happen constantly.
Andrei's defines the following uses of this code:
* it is more granular and allows us to track changed pages more
effectively. The current interface can clear dirty bits for the entire
process only. In addition, reading info about pages is a separate
operation. It means we must freeze the process to read information
about all its pages, reset dirty bits, only then we can start dumping
pages. The information about pages becomes more and more outdated,
while we are processing pages. The new interface solves both these
downsides. First, it allows us to read pte bits and clear the
soft-dirty bit atomically. It means that CRIU will not need to freeze
processes to pre-dump their memory. Second, it clears soft-dirty bits
for a specified region of memory. It means CRIU will have actual info
about pages to the moment of dumping them.
* The new interface has to be much faster because basic page filtering
is happening in the kernel. With the old interface, we have to read
pagemap for each page.
*Implementation Evolution (Short Summary)*
From the definition of GetWriteWatch(), we feel like kernel's soft-dirty
feature can be used under the hood with some additions like:
* reset soft-dirty flag for only a specific region of memory instead of
clearing the flag for the entire process
* get and clear soft-dirty flag for a specific region atomically
So we decided to use ioctl on pagemap file to read or/and reset soft-dirty
flag. But using soft-dirty flag, sometimes we get extra pages which weren't
even written. They had become soft-dirty because of VMA merging and
VM_SOFTDIRTY flag. This breaks the definition of GetWriteWatch(). We were
able to by-pass this short coming by ignoring VM_SOFTDIRTY until David
reported that mprotect etc messes up the soft-dirty flag while ignoring
VM_SOFTDIRTY [5]. This wasn't happening until [6] got introduced. We
discussed if we can revert these patches. But we could not reach to any
conclusion. So at this point, I made couple of tries to solve this whole
VM_SOFTDIRTY issue by correcting the soft-dirty implementation:
* [7] Correct the bug fixed wrongly back in 2014. It had potential to cause
regression. We left it behind.
* [8] Keep a list of soft-dirty part of a VMA across splits and merges. I
got the reply don't increase the size of the VMA by 8 bytes.
At this point, we left soft-dirty considering it is too much delicate and
userfaultfd [9] seemed like the only way forward. From there onward, we
have been basing soft-dirty emulation on userfaultfd wp feature where
kernel resolves the faults itself when WP_ASYNC feature is used. It was
straight forward to add WP_ASYNC feature in userfautlfd. Now we get only
those pages dirty or written-to which are really written in reality. (PS
There is another WP_UNPOPULATED userfautfd feature is required which is
needed to avoid pre-faulting memory before write-protecting [9].)
All the different masks were added on the request of CRIU devs to create
interface more generic and better.
[1] https://learn.microsoft.com/en-us/windows/win32/api/memoryapi/nf-memoryapi-…
[2] https://lore.kernel.org/all/20221014134802.1361436-1-mdanylo@google.com
[3] https://github.com/google/sanitizers
[4] https://github.com/google/sanitizers/wiki/AddressSanitizerAlgorithm#64-bit
[5] https://lore.kernel.org/all/bfcae708-db21-04b4-0bbe-712badd03071@redhat.com
[6] https://lore.kernel.org/all/20220725142048.30450-1-peterx@redhat.com/
[7] https://lore.kernel.org/all/20221122115007.2787017-1-usama.anjum@collabora.…
[8] https://lore.kernel.org/all/20221220162606.1595355-1-usama.anjum@collabora.…
[9] https://lore.kernel.org/all/20230306213925.617814-1-peterx@redhat.com
[10] https://lore.kernel.org/all/20230125144529.1630917-1-mdanylo@google.com
* Original Cover letter from v8*
Hello,
Note:
Soft-dirty pages and pages which have been written-to are synonyms. As
kernel already has soft-dirty feature inside which we have given up to
use, we are using written-to terminology while using UFFD async WP under
the hood.
This IOCTL, PAGEMAP_SCAN on pagemap file can be used to get and/or clear
the info about page table entries. The following operations are
supported in this ioctl:
- Get the information if the pages have been written-to (PAGE_IS_WRITTEN),
file mapped (PAGE_IS_FILE), present (PAGE_IS_PRESENT) or swapped
(PAGE_IS_SWAPPED).
- Write-protect the pages (PAGEMAP_WP_ENGAGE) to start finding which
pages have been written-to.
- Find pages which have been written-to and write protect the pages
(atomic PAGE_IS_WRITTEN + PAGEMAP_WP_ENGAGE)
It is possible to find and clear soft-dirty pages entirely in userspace.
But it isn't efficient:
- The mprotect and SIGSEGV handler for bookkeeping
- The userfaultfd wp (synchronous) with the handler for bookkeeping
Some benchmarks can be seen here[1]. This series adds features that weren't
present earlier:
- There is no atomic get soft-dirty/Written-to status and clear present in
the kernel.
- The pages which have been written-to can not be found in accurate way.
(Kernel's soft-dirty PTE bit + sof_dirty VMA bit shows more soft-dirty
pages than there actually are.)
Historically, soft-dirty PTE bit tracking has been used in the CRIU
project. The procfs interface is enough for finding the soft-dirty bit
status and clearing the soft-dirty bit of all the pages of a process.
We have the use case where we need to track the soft-dirty PTE bit for
only specific pages on-demand. We need this tracking and clear mechanism
of a region of memory while the process is running to emulate the
getWriteWatch() syscall of Windows.
*(Moved to using UFFD instead of soft-dirtyi feature to find pages which
have been written-to from v7 patch series)*:
Stop using the soft-dirty flags for finding which pages have been
written to. It is too delicate and wrong as it shows more soft-dirty
pages than the actual soft-dirty pages. There is no interest in
correcting it [2][3] as this is how the feature was written years ago.
It shouldn't be updated to changed behaviour. Peter Xu has suggested
using the async version of the UFFD WP [4] as it is based inherently
on the PTEs.
So in this patch series, I've added a new mode to the UFFD which is
asynchronous version of the write protect. When this variant of the
UFFD WP is used, the page faults are resolved automatically by the
kernel. The pages which have been written-to can be found by reading
pagemap file (!PM_UFFD_WP). This feature can be used successfully to
find which pages have been written to from the time the pages were
write protected. This works just like the soft-dirty flag without
showing any extra pages which aren't soft-dirty in reality.
The information related to pages if the page is file mapped, present and
swapped is required for the CRIU project [5][6]. The addition of the
required mask, any mask, excluded mask and return masks are also required
for the CRIU project [5].
The IOCTL returns the addresses of the pages which match the specific
masks. The page addresses are returned in struct page_region in a compact
form. The max_pages is needed to support a use case where user only wants
to get a specific number of pages. So there is no need to find all the
pages of interest in the range when max_pages is specified. The IOCTL
returns when the maximum number of the pages are found. The max_pages is
optional. If max_pages is specified, it must be equal or greater than the
vec_size. This restriction is needed to handle worse case when one
page_region only contains info of one page and it cannot be compacted.
This is needed to emulate the Windows getWriteWatch() syscall.
The patch series include the detailed selftest which can be used as an
example for the uffd async wp test and PAGEMAP_IOCTL. It shows the
interface usages as well.
[1] https://lore.kernel.org/lkml/54d4c322-cd6e-eefd-b161-2af2b56aae24@collabora…
[2] https://lore.kernel.org/all/20221220162606.1595355-1-usama.anjum@collabora.…
[3] https://lore.kernel.org/all/20221122115007.2787017-1-usama.anjum@collabora.…
[4] https://lore.kernel.org/all/Y6Hc2d+7eTKs7AiH@x1n
[5] https://lore.kernel.org/all/YyiDg79flhWoMDZB@gmail.com/
[6] https://lore.kernel.org/all/20221014134802.1361436-1-mdanylo@google.com/
Regards,
Muhammad Usama Anjum
Muhammad Usama Anjum (4):
fs/proc/task_mmu: Implement IOCTL to get and optionally clear info
about PTEs
tools headers UAPI: Update linux/fs.h with the kernel sources
mm/pagemap: add documentation of PAGEMAP_SCAN IOCTL
selftests: mm: add pagemap ioctl tests
Peter Xu (1):
userfaultfd: UFFD_FEATURE_WP_ASYNC
Documentation/admin-guide/mm/pagemap.rst | 58 +
Documentation/admin-guide/mm/userfaultfd.rst | 35 +
fs/proc/task_mmu.c | 560 +++++++
fs/userfaultfd.c | 26 +-
include/linux/hugetlb.h | 1 +
include/linux/userfaultfd_k.h | 21 +-
include/uapi/linux/fs.h | 54 +
include/uapi/linux/userfaultfd.h | 9 +-
mm/hugetlb.c | 34 +-
mm/memory.c | 27 +-
tools/include/uapi/linux/fs.h | 54 +
tools/testing/selftests/mm/.gitignore | 2 +
tools/testing/selftests/mm/Makefile | 3 +-
tools/testing/selftests/mm/config | 1 +
tools/testing/selftests/mm/pagemap_ioctl.c | 1464 ++++++++++++++++++
tools/testing/selftests/mm/run_vmtests.sh | 4 +
16 files changed, 2329 insertions(+), 24 deletions(-)
create mode 100644 tools/testing/selftests/mm/pagemap_ioctl.c
mode change 100644 => 100755 tools/testing/selftests/mm/run_vmtests.sh
--
2.39.2
Hi Linus,
Please pull the following Kselftest update for Linux 6.5-rc1.
This kselftest update for Linux 6.5-rc1 consists of:
- change to allow runners to override the timeout
This change is made to avoid future increases of long
timeouts
- several other spelling and cleanups
- a new subtest to video_device_test
- enhancements to test coverage in clone3 test
- other fixes to ftrace and cpufreq tests
diff is attached.
thanks,
-- Shuah
----------------------------------------------------------------
The following changes since commit 858fd168a95c5b9669aac8db6c14a9aeab446375:
Linux 6.4-rc6 (2023-06-11 14:35:30 -0700)
are available in the Git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest tags/linux-kselftest-next-6.5-rc1
for you to fetch changes up to 8cd0d8633e2de4e6dd9ddae7980432e726220fdb:
selftests/ftace: Fix KTAP output ordering (2023-06-12 16:40:22 -0600)
----------------------------------------------------------------
linux-kselftest-next-6.5-rc1
This kselftest update for Linux 6.5-rc1 consists of:
- change to allow runners to override the timeout
This change is made to avoid future increases of long
timeouts
- several other spelling and cleanups
- a new subtest to video_device_test
- enhancements to test coverage in clone3 test
- other fixes to ftrace and cpufreq tests
----------------------------------------------------------------
Akanksha J N (1):
selftests/ftrace: Add new test case which checks for optimized probes
Colin Ian King (2):
selftests: prctl: Fix spelling mistake "anonynous" -> "anonymous"
kselftest: vDSO: Fix accumulation of uninitialized ret when CLOCK_REALTIME is undefined
Ivan Orlov (1):
selftests: media_tests: Add new subtest to video_device_test
Luis Chamberlain (1):
selftests: allow runners to override the timeout
Mark Brown (2):
selftests/cpufreq: Don't enable generic lock debugging options
selftests/ftace: Fix KTAP output ordering
Rishabh Bhatnagar (1):
kselftests: Sort the collections list to avoid duplicate tests
Tobias Klauser (1):
selftests/clone3: test clone3 with exit signal in flags
Ziqi Zhao (1):
selftest: pidfd: Omit long and repeating outputs
Documentation/dev-tools/kselftest.rst | 22 ++++
tools/testing/selftests/clone3/clone3.c | 5 +-
tools/testing/selftests/cpufreq/config | 8 --
tools/testing/selftests/ftrace/ftracetest | 2 +-
.../ftrace/test.d/kprobe/kprobe_opt_types.tc | 34 +++++++
tools/testing/selftests/kselftest/runner.sh | 11 +-
.../selftests/media_tests/video_device_test.c | 111 +++++++++++++++------
tools/testing/selftests/pidfd/pidfd.h | 1 -
tools/testing/selftests/pidfd/pidfd_fdinfo_test.c | 1 +
tools/testing/selftests/pidfd/pidfd_test.c | 3 +-
.../selftests/prctl/set-anon-vma-name-test.c | 2 +-
tools/testing/selftests/run_kselftest.sh | 7 +-
.../selftests/vDSO/vdso_test_clock_getres.c | 4 +-
13 files changed, 166 insertions(+), 45 deletions(-)
create mode 100644 tools/testing/selftests/ftrace/test.d/kprobe/kprobe_opt_types.tc
----------------------------------------------------------------
Make sv39 the default address space for mmap as some applications
currently depend on this assumption. The RISC-V specification enforces
that bits outside of the virtual address range are not used, so
restricting the size of the default address space as such should be
temporary. A hint address passed to mmap will cause the largest address
space that fits entirely into the hint to be used. If the hint is less
than or equal to 1<<38, a 39-bit address will be used. After an address
space is completely full, the next smallest address space will be used.
Documentation is also added to the RISC-V virtual memory section to explain
these changes.
Charlie Jenkins (2):
RISC-V: mm: Restrict address space for sv39,sv48,sv57
RISC-V: mm: Update documentation and include test
Documentation/riscv/vm-layout.rst | 20 ++++++++
arch/riscv/include/asm/elf.h | 2 +-
arch/riscv/include/asm/pgtable.h | 21 ++++++--
arch/riscv/include/asm/processor.h | 41 +++++++++++++---
tools/testing/selftests/riscv/Makefile | 2 +-
tools/testing/selftests/riscv/mm/Makefile | 22 +++++++++
.../selftests/riscv/mm/testcases/mmap.c | 49 +++++++++++++++++++
7 files changed, 144 insertions(+), 13 deletions(-)
create mode 100644 tools/testing/selftests/riscv/mm/Makefile
create mode 100644 tools/testing/selftests/riscv/mm/testcases/mmap.c
base-commit: eef509789cecdce895020682192d32e8bac790e8
--
2.34.1
Hello!
Here is v4 of the mremap start address optimization / fix for exec warning. It
took me a while to write a test that catches the issue me/Linus discussed in
the last version. And I verified kernel crashes without the check. See below.
The main changes in this series is:
Care to be taken to move purely within a VMA, in other words this check
in call_align_down():
if (vma->vm_start != addr_masked)
return false;
As an example of why this is needed:
Consider the following range which is 2MB aligned and is
a part of a larger 10MB range which is not shown. Each
character is 256KB below making the source and destination
2MB each. The lower case letters are moved (s to d) and the
upper case letters are not moved.
|DDDDddddSSSSssss|
If we align down 'ssss' to start from the 'SSSS', we will end up destroying
SSSS. The above if statement prevents that and I verified it.
I also added a test for this in the last patch.
History of patches
==================
v3->v4:
1. Make sure to check address to align is beginning of VMA
2. Add test to check this (test fails with a kernel crash if we don't do this).
v2->v3:
1. Masked address was stored in int, fixed it to unsigned long to avoid truncation.
2. We now handle moves happening purely within a VMA, a new test is added to handle this.
3. More code comments.
v1->v2:
1. Trigger the optimization for mremaps smaller than a PMD. I tested by tracing
that it works correctly.
2. Fix issue with bogus return value found by Linus if we broke out of the
above loop for the first PMD itself.
v1: Initial RFC.
Description of patches
======================
These patches optimizes the start addresses in move_page_tables() and tests the
changes. It addresses a warning [1] that occurs due to a downward, overlapping
move on a mutually-aligned offset within a PMD during exec. By initiating the
copy process at the PMD level when such alignment is present, we can prevent
this warning and speed up the copying process at the same time. Linus Torvalds
suggested this idea.
Please check the individual patches for more details.
thanks,
- Joel
[1] https://lore.kernel.org/all/ZB2GTBD%2FLWTrkOiO@dhcp22.suse.cz/
Joel Fernandes (Google) (7):
mm/mremap: Optimize the start addresses in move_page_tables()
mm/mremap: Allow moves within the same VMA for stack
selftests: mm: Fix failure case when new remap region was not found
selftests: mm: Add a test for mutually aligned moves > PMD size
selftests: mm: Add a test for remapping to area immediately after
existing mapping
selftests: mm: Add a test for remapping within a range
selftests: mm: Add a test for moving from an offset from start of
mapping
fs/exec.c | 2 +-
include/linux/mm.h | 2 +-
mm/mremap.c | 63 ++++-
tools/testing/selftests/mm/mremap_test.c | 301 +++++++++++++++++++----
4 files changed, 319 insertions(+), 49 deletions(-)
--
2.41.0.rc2.161.g9c6817b8e7-goog
Hi Linus,
Please pull the following KUnit next update for Linux 6.5-rc1.
This KUnit update for Linux 6.5-rc1 consists of:
- kunit_add_action() API to defer a call until test exit.
- Update document to add kunit_add_action() usage notes.
- Changes to always run cleanup from a test kthread.
- Documentation updates to clarify cleanup usage
- assertions should not be used in cleanup
- Documentation update to clearly indicate that exit
functions should run even if init fails
- Several fixes and enhancements to existing tests.
diff is attached.
thanks,
-- Shuah
----------------------------------------------------------------
The following changes since commit ac9a78681b921877518763ba0e89202254349d1b:
Linux 6.4-rc1 (2023-05-07 13:34:35 -0700)
are available in the Git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest tags/linux-kselftest-kunit-6.5-rc1
for you to fetch changes up to 2e66833579ed759d7b7da1a8f07eb727ec6e80db:
MAINTAINERS: Add source tree entry for kunit (2023-06-15 09:16:01 -0600)
----------------------------------------------------------------
linux-kselftest-kunit-6.5-rc1
This KUnit update for Linux 6.5-rc1 consists of:
- kunit_add_action() API to defer a call until test exit.
- Update document to add kunit_add_action() usage notes.
- Changes to always run cleanup from a test kthread.
- Documentation updates to clarify cleanup usage
- assertions should not be used in cleanup
- Documentation update to clearly indicate that exit
functions should run even if init fails
- Several fixes and enhancements to existing tests.
----------------------------------------------------------------
Daniel Latypov (1):
kunit: tool: undo type subscripts for subprocess.Popen
David Gow (11):
kunit: Always run cleanup from a test kthread
Documentation: kunit: Note that assertions should not be used in cleanup
Documentation: kunit: Warn that exit functions run even if init fails
kunit: example: Provide example exit functions
kunit: Add kunit_add_action() to defer a call until test exit
kunit: executor_test: Use kunit_add_action()
kunit: kmalloc_array: Use kunit_add_action()
Documentation: kunit: Add usage notes for kunit_add_action()
kunit: Fix obsolete name in documentation headers (func->action)
kunit: Move kunit_abort() call out of kunit_do_failed_assertion()
Documentation: kunit: Rename references to kunit_abort()
Geert Uytterhoeven (1):
Documentation: kunit: Modular tests should not depend on KUNIT=y
Michal Wajdeczko (3):
kunit/test: Add example test showing parameterized testing
kunit: Fix reporting of the skipped parameterized tests
kunit: Update kunit_print_ok_not_ok function
SeongJae Park (1):
MAINTAINERS: Add source tree entry for kunit
Takashi Sakamoto (1):
Documentation: Kunit: add MODULE_LICENSE to sample code
Documentation/dev-tools/kunit/architecture.rst | 4 +-
Documentation/dev-tools/kunit/start.rst | 7 +-
Documentation/dev-tools/kunit/usage.rst | 69 ++++++++++-
MAINTAINERS | 2 +
include/kunit/resource.h | 92 +++++++++++++++
include/kunit/test.h | 34 ++++--
lib/kunit/executor_test.c | 11 +-
lib/kunit/kunit-example-test.c | 56 +++++++++
lib/kunit/kunit-test.c | 88 +++++++++++++-
lib/kunit/resource.c | 99 ++++++++++++++++
lib/kunit/test.c | 157 ++++++++++++++-----------
tools/testing/kunit/kunit_kernel.py | 6 +-
tools/testing/kunit/mypy.ini | 6 +
tools/testing/kunit/run_checks.py | 2 +-
14 files changed, 538 insertions(+), 95 deletions(-)
create mode 100644 tools/testing/kunit/mypy.ini
----------------------------------------------------------------
Hi Shuah,
This series contains updates to the rseq selftests.
* A typo in the Makefile prevents the basic_percpu_ops_mm_cid_test to use
the mm_cid field.
* Fix load-acquire/store-release macros which were buggy on arm64.
(this depends on commit "Implement rseq_unqual_scalar_typeof").
* The change "Use rseq_unqual_scalar_typeof in macros" is not a fix
per se, but improves the assembler generated.
Can you pick these in the selftests tree please ?
Thanks,
Mathieu
Mathieu Desnoyers (4):
selftests/rseq: Fix CID_ID typo in Makefile
selftests/rseq: Implement rseq_unqual_scalar_typeof
selftests/rseq: Fix arm64 buggy load-acquire/store-release macros
selftests/rseq: Use rseq_unqual_scalar_typeof in macros
tools/testing/selftests/rseq/Makefile | 2 +-
tools/testing/selftests/rseq/compiler.h | 26 ++++++++++
tools/testing/selftests/rseq/rseq-arm.h | 4 +-
tools/testing/selftests/rseq/rseq-arm64.h | 58 ++++++++++++-----------
tools/testing/selftests/rseq/rseq-mips.h | 4 +-
tools/testing/selftests/rseq/rseq-ppc.h | 4 +-
tools/testing/selftests/rseq/rseq-riscv.h | 6 +--
tools/testing/selftests/rseq/rseq-s390.h | 4 +-
tools/testing/selftests/rseq/rseq-x86.h | 4 +-
9 files changed, 70 insertions(+), 42 deletions(-)
--
2.25.1
We want to replace iptables TPROXY with a BPF program at TC ingress.
To make this work in all cases we need to assign a SO_REUSEPORT socket
to an skb, which is currently prohibited. This series adds support for
such sockets to bpf_sk_assing.
I did some refactoring to cut down on the amount of duplicate code. The
key to this is to use INDIRECT_CALL in the reuseport helpers. To show
that this approach is not just beneficial to TC sk_assign I removed
duplicate code for bpf_sk_lookup as well.
Changes from v1:
- Correct commit abbrev length (Kuniyuki)
- Reduce duplication (Kuniyuki)
- Add checks on sk_state (Martin)
- Split exporting inet[6]_lookup_reuseport into separate patch (Eric)
Joint work with Daniel Borkmann.
Signed-off-by: Lorenz Bauer <lmb(a)isovalent.com>
---
Changes in v3:
- Fix warning re udp_ehashfn and udp6_ehashfn (Simon)
- Return higher scoring connected UDP reuseport sockets (Kuniyuki)
- Fix ipv6 module builds
- Link to v2: https://lore.kernel.org/r/20230613-so-reuseport-v2-0-b7c69a342613@isovalent…
---
Daniel Borkmann (1):
selftests/bpf: Test that SO_REUSEPORT can be used with sk_assign helper
Lorenz Bauer (6):
udp: re-score reuseport groups when connected sockets are present
net: export inet_lookup_reuseport and inet6_lookup_reuseport
net: document inet[6]_lookup_reuseport sk_state requirements
net: remove duplicate reuseport_lookup functions
net: remove duplicate sk_lookup helpers
bpf, net: Support SO_REUSEPORT sockets with bpf_sk_assign
include/net/inet6_hashtables.h | 84 ++++++++-
include/net/inet_hashtables.h | 77 +++++++-
include/net/sock.h | 7 +-
include/net/udp.h | 8 +
include/uapi/linux/bpf.h | 3 -
net/core/filter.c | 2 -
net/ipv4/inet_hashtables.c | 70 +++++---
net/ipv4/udp.c | 88 ++++-----
net/ipv6/inet6_hashtables.c | 73 +++++---
net/ipv6/udp.c | 98 ++++------
tools/include/uapi/linux/bpf.h | 3 -
tools/testing/selftests/bpf/network_helpers.c | 3 +
.../selftests/bpf/prog_tests/assign_reuse.c | 197 +++++++++++++++++++++
.../selftests/bpf/progs/test_assign_reuse.c | 142 +++++++++++++++
14 files changed, 676 insertions(+), 179 deletions(-)
---
base-commit: 970308a7b544fa1c7ee98a2721faba3765be8dd8
change-id: 20230613-so-reuseport-e92c526173ee
Best regards,
--
Lorenz Bauer <lmb(a)isovalent.com>
v3:
- [v2] https://lore.kernel.org/lkml/20230531163405.2200292-1-longman@redhat.com/
- Change the new control file from root-only "cpuset.cpus.reserve" to
non-root "cpuset.cpus.exclusive" which lists the set of exclusive
CPUs distributed down the hierarchy.
- Add a patch to restrict boot-time isolated CPUs to isolated
partitions only.
- Update the test_cpuset_prs.sh test script and documentation
accordingly.
v2:
- [v1] https://lore.kernel.org/lkml/20230412153758.3088111-1-longman@redhat.com/
- Dropped the special "isolcpus" partition in v1
- Add the root only "cpuset.cpus.reserve" control file for reserving
CPUs used for remote isolated partitions.
- Update the test_cpuset_prs.sh test script and documentation
accordingly.
This patch series introduces a new cpuset control file
"cpuset.cpus.exclusive" which must be a subset of "cpuset.cpus"
and the parent's "cpuset.cpus.exclusive". This control file lists
the exclusive CPUs to be distributed down the hierarchy. Any one
of the exclusive CPUs can only be distributed to at most one child
cpuset. Unlike "cpuset.cpus", invalid input to "cpuset.cpus.exclusive"
will be rejected with an error. This new control file has no effect on
the behavior of the cpuset until it turns into a partition root. At that
point, its effective CPUs will be set to its exclusive CPUs unless some
of them are offline.
This patch series also introduces a new category of cpuset partition
called remote partitions. The existing partition category where the
partition roots have to be clustered around the root cgroup in a
hierarchical way is now referred to as local partitions.
A remote partition can be formed far from the root cgroup
with no partition root parent. While local partitions can be
created without touching "cpuset.cpus.exclusive" as it can be set
automatically if a cpuset becomes a local partition root. Properly set
"cpuset.cpus.exclusive" values down the hierarchy are required to create
a remote partition.
Both scheduling and isolated partitions can be formed in a remote
partition. A local partition can be created under a remote partition.
A remote partition, however, cannot be formed under a local partition
for now.
Modern container orchestration tools like Kubernetes use the cgroup
hierarchy to manage different containers. And it is relying on other
middleware like systemd to help managing it. If a container needs to
use isolated CPUs, it is hard to get those with the local partitions
as it will require the administrative parent cgroup to be a partition
root too which tool like systemd may not be ready to manage.
With this patch series, we allow the creation of remote partition
far from the root. The container management tool can manage the
"cpuset.cpus.exclusive" file without impacting the other cpuset
files that are managed by other middlewares. Of course, invalid
"cpuset.cpus.exclusive" values will be rejected and changes to
"cpuset.cpus" can affect the value of "cpuset.cpus.exclusive" due to
the requirement that it has to be a subset of the former control file.
Waiman Long (9):
cgroup/cpuset: Inherit parent's load balance state in v2
cgroup/cpuset: Extract out CS_CPU_EXCLUSIVE & CS_SCHED_LOAD_BALANCE
handling
cgroup/cpuset: Improve temporary cpumasks handling
cgroup/cpuset: Allow suppression of sched domain rebuild in
update_cpumasks_hier()
cgroup/cpuset: Add cpuset.cpus.exclusive for v2
cgroup/cpuset: Introduce remote partition
cgroup/cpuset: Check partition conflict with housekeeping setup
cgroup/cpuset: Documentation update for partition
cgroup/cpuset: Extend test_cpuset_prs.sh to test remote partition
Documentation/admin-guide/cgroup-v2.rst | 100 +-
kernel/cgroup/cpuset.c | 1352 ++++++++++++-----
.../selftests/cgroup/test_cpuset_prs.sh | 398 +++--
3 files changed, 1297 insertions(+), 553 deletions(-)
--
2.31.1
Now the writing operation return the count of writes regardless of whether
events are enabled or disabled. Fix this by just return -EBADF when events
are disabled.
v3 -> v4:
- Change the return value from zero to -EBADF
v2 -> v3:
- Change the return value from -ENOENT to zero
v1 -> v2:
- Change the return value from -EFAULT to -ENOENT
sunliming (3):
tracing/user_events: Fix incorrect return value for writing operation
when events are disabled
selftests/user_events: Enable the event before write_fault test in
ftrace self-test
selftests/user_events: Add test cases when event is disabled
kernel/trace/trace_events_user.c | 3 ++-
tools/testing/selftests/user_events/ftrace_test.c | 8 ++++++++
2 files changed, 10 insertions(+), 1 deletion(-)
--
2.25.1
On systems where netdevsim is built-in or loaded before the test
starts, kci_test_ipsec_offload doesn't remove the netdevsim device it
created during the test.
Fixes: e05b2d141fef ("netdevsim: move netdev creation/destruction to dev probe")
Signed-off-by: Sabrina Dubroca <sd(a)queasysnail.net>
---
tools/testing/selftests/net/rtnetlink.sh | 1 +
1 file changed, 1 insertion(+)
diff --git a/tools/testing/selftests/net/rtnetlink.sh b/tools/testing/selftests/net/rtnetlink.sh
index 383ac6fc037d..ba286d680fd9 100755
--- a/tools/testing/selftests/net/rtnetlink.sh
+++ b/tools/testing/selftests/net/rtnetlink.sh
@@ -860,6 +860,7 @@ EOF
fi
# clean up any leftovers
+ echo 0 > /sys/bus/netdevsim/del_device
$probed && rmmod netdevsim
if [ $ret -ne 0 ]; then
--
2.40.1
*Changes in v20*
- Correct PAGE_IS_FILE and add PAGE_IS_PFNZERO
*Changes in v19*
- Minor changes and interface updates
*Changes in v18*
- Rebase on top of next-20230613
- Minor updates
*Changes in v17*
- Rebase on top of next-20230606
- Minor improvements in PAGEMAP_SCAN IOCTL patch
*Changes in v16*
- Fix a corner case
- Add exclusive PM_SCAN_OP_WP back
*Changes in v15*
- Build fix (Add missed build fix in RESEND)
*Changes in v14*
- Fix build error caused by #ifdef added at last minute in some configs
*Changes in v13*
- Rebase on top of next-20230414
- Give-up on using uffd_wp_range() and write new helpers, flush tlb only
once
*Changes in v12*
- Update and other memory types to UFFD_FEATURE_WP_ASYNC
- Rebaase on top of next-20230406
- Review updates
*Changes in v11*
- Rebase on top of next-20230307
- Base patches on UFFD_FEATURE_WP_UNPOPULATED
- Do a lot of cosmetic changes and review updates
- Remove ENGAGE_WP + !GET operation as it can be performed with
UFFDIO_WRITEPROTECT
*Changes in v10*
- Add specific condition to return error if hugetlb is used with wp
async
- Move changes in tools/include/uapi/linux/fs.h to separate patch
- Add documentation
*Changes in v9:*
- Correct fault resolution for userfaultfd wp async
- Fix build warnings and errors which were happening on some configs
- Simplify pagemap ioctl's code
*Changes in v8:*
- Update uffd async wp implementation
- Improve PAGEMAP_IOCTL implementation
*Changes in v7:*
- Add uffd wp async
- Update the IOCTL to use uffd under the hood instead of soft-dirty
flags
*Motivation*
The real motivation for adding PAGEMAP_SCAN IOCTL is to emulate Windows
GetWriteWatch() syscall [1]. The GetWriteWatch{} retrieves the addresses of
the pages that are written to in a region of virtual memory.
This syscall is used in Windows applications and games etc. This syscall is
being emulated in pretty slow manner in userspace. Our purpose is to
enhance the kernel such that we translate it efficiently in a better way.
Currently some out of tree hack patches are being used to efficiently
emulate it in some kernels. We intend to replace those with these patches.
So the whole gaming on Linux can effectively get benefit from this. It
means there would be tons of users of this code.
CRIU use case [2] was mentioned by Andrei and Danylo:
> Use cases for migrating sparse VMAs are binaries sanitized with ASAN,
> MSAN or TSAN [3]. All of these sanitizers produce sparse mappings of
> shadow memory [4]. Being able to migrate such binaries allows to highly
> reduce the amount of work needed to identify and fix post-migration
> crashes, which happen constantly.
Andrei's defines the following uses of this code:
* it is more granular and allows us to track changed pages more
effectively. The current interface can clear dirty bits for the entire
process only. In addition, reading info about pages is a separate
operation. It means we must freeze the process to read information
about all its pages, reset dirty bits, only then we can start dumping
pages. The information about pages becomes more and more outdated,
while we are processing pages. The new interface solves both these
downsides. First, it allows us to read pte bits and clear the
soft-dirty bit atomically. It means that CRIU will not need to freeze
processes to pre-dump their memory. Second, it clears soft-dirty bits
for a specified region of memory. It means CRIU will have actual info
about pages to the moment of dumping them.
* The new interface has to be much faster because basic page filtering
is happening in the kernel. With the old interface, we have to read
pagemap for each page.
*Implementation Evolution (Short Summary)*
From the definition of GetWriteWatch(), we feel like kernel's soft-dirty
feature can be used under the hood with some additions like:
* reset soft-dirty flag for only a specific region of memory instead of
clearing the flag for the entire process
* get and clear soft-dirty flag for a specific region atomically
So we decided to use ioctl on pagemap file to read or/and reset soft-dirty
flag. But using soft-dirty flag, sometimes we get extra pages which weren't
even written. They had become soft-dirty because of VMA merging and
VM_SOFTDIRTY flag. This breaks the definition of GetWriteWatch(). We were
able to by-pass this short coming by ignoring VM_SOFTDIRTY until David
reported that mprotect etc messes up the soft-dirty flag while ignoring
VM_SOFTDIRTY [5]. This wasn't happening until [6] got introduced. We
discussed if we can revert these patches. But we could not reach to any
conclusion. So at this point, I made couple of tries to solve this whole
VM_SOFTDIRTY issue by correcting the soft-dirty implementation:
* [7] Correct the bug fixed wrongly back in 2014. It had potential to cause
regression. We left it behind.
* [8] Keep a list of soft-dirty part of a VMA across splits and merges. I
got the reply don't increase the size of the VMA by 8 bytes.
At this point, we left soft-dirty considering it is too much delicate and
userfaultfd [9] seemed like the only way forward. From there onward, we
have been basing soft-dirty emulation on userfaultfd wp feature where
kernel resolves the faults itself when WP_ASYNC feature is used. It was
straight forward to add WP_ASYNC feature in userfautlfd. Now we get only
those pages dirty or written-to which are really written in reality. (PS
There is another WP_UNPOPULATED userfautfd feature is required which is
needed to avoid pre-faulting memory before write-protecting [9].)
All the different masks were added on the request of CRIU devs to create
interface more generic and better.
[1] https://learn.microsoft.com/en-us/windows/win32/api/memoryapi/nf-memoryapi-…
[2] https://lore.kernel.org/all/20221014134802.1361436-1-mdanylo@google.com
[3] https://github.com/google/sanitizers
[4] https://github.com/google/sanitizers/wiki/AddressSanitizerAlgorithm#64-bit
[5] https://lore.kernel.org/all/bfcae708-db21-04b4-0bbe-712badd03071@redhat.com
[6] https://lore.kernel.org/all/20220725142048.30450-1-peterx@redhat.com/
[7] https://lore.kernel.org/all/20221122115007.2787017-1-usama.anjum@collabora.…
[8] https://lore.kernel.org/all/20221220162606.1595355-1-usama.anjum@collabora.…
[9] https://lore.kernel.org/all/20230306213925.617814-1-peterx@redhat.com
[10] https://lore.kernel.org/all/20230125144529.1630917-1-mdanylo@google.com
* Original Cover letter from v8*
Hello,
Note:
Soft-dirty pages and pages which have been written-to are synonyms. As
kernel already has soft-dirty feature inside which we have given up to
use, we are using written-to terminology while using UFFD async WP under
the hood.
This IOCTL, PAGEMAP_SCAN on pagemap file can be used to get and/or clear
the info about page table entries. The following operations are
supported in this ioctl:
- Get the information if the pages have been written-to (PAGE_IS_WRITTEN),
file mapped (PAGE_IS_FILE), present (PAGE_IS_PRESENT) or swapped
(PAGE_IS_SWAPPED).
- Write-protect the pages (PAGEMAP_WP_ENGAGE) to start finding which
pages have been written-to.
- Find pages which have been written-to and write protect the pages
(atomic PAGE_IS_WRITTEN + PAGEMAP_WP_ENGAGE)
It is possible to find and clear soft-dirty pages entirely in userspace.
But it isn't efficient:
- The mprotect and SIGSEGV handler for bookkeeping
- The userfaultfd wp (synchronous) with the handler for bookkeeping
Some benchmarks can be seen here[1]. This series adds features that weren't
present earlier:
- There is no atomic get soft-dirty/Written-to status and clear present in
the kernel.
- The pages which have been written-to can not be found in accurate way.
(Kernel's soft-dirty PTE bit + sof_dirty VMA bit shows more soft-dirty
pages than there actually are.)
Historically, soft-dirty PTE bit tracking has been used in the CRIU
project. The procfs interface is enough for finding the soft-dirty bit
status and clearing the soft-dirty bit of all the pages of a process.
We have the use case where we need to track the soft-dirty PTE bit for
only specific pages on-demand. We need this tracking and clear mechanism
of a region of memory while the process is running to emulate the
getWriteWatch() syscall of Windows.
*(Moved to using UFFD instead of soft-dirtyi feature to find pages which
have been written-to from v7 patch series)*:
Stop using the soft-dirty flags for finding which pages have been
written to. It is too delicate and wrong as it shows more soft-dirty
pages than the actual soft-dirty pages. There is no interest in
correcting it [2][3] as this is how the feature was written years ago.
It shouldn't be updated to changed behaviour. Peter Xu has suggested
using the async version of the UFFD WP [4] as it is based inherently
on the PTEs.
So in this patch series, I've added a new mode to the UFFD which is
asynchronous version of the write protect. When this variant of the
UFFD WP is used, the page faults are resolved automatically by the
kernel. The pages which have been written-to can be found by reading
pagemap file (!PM_UFFD_WP). This feature can be used successfully to
find which pages have been written to from the time the pages were
write protected. This works just like the soft-dirty flag without
showing any extra pages which aren't soft-dirty in reality.
The information related to pages if the page is file mapped, present and
swapped is required for the CRIU project [5][6]. The addition of the
required mask, any mask, excluded mask and return masks are also required
for the CRIU project [5].
The IOCTL returns the addresses of the pages which match the specific
masks. The page addresses are returned in struct page_region in a compact
form. The max_pages is needed to support a use case where user only wants
to get a specific number of pages. So there is no need to find all the
pages of interest in the range when max_pages is specified. The IOCTL
returns when the maximum number of the pages are found. The max_pages is
optional. If max_pages is specified, it must be equal or greater than the
vec_size. This restriction is needed to handle worse case when one
page_region only contains info of one page and it cannot be compacted.
This is needed to emulate the Windows getWriteWatch() syscall.
The patch series include the detailed selftest which can be used as an
example for the uffd async wp test and PAGEMAP_IOCTL. It shows the
interface usages as well.
[1] https://lore.kernel.org/lkml/54d4c322-cd6e-eefd-b161-2af2b56aae24@collabora…
[2] https://lore.kernel.org/all/20221220162606.1595355-1-usama.anjum@collabora.…
[3] https://lore.kernel.org/all/20221122115007.2787017-1-usama.anjum@collabora.…
[4] https://lore.kernel.org/all/Y6Hc2d+7eTKs7AiH@x1n
[5] https://lore.kernel.org/all/YyiDg79flhWoMDZB@gmail.com/
[6] https://lore.kernel.org/all/20221014134802.1361436-1-mdanylo@google.com/
Regards,
Muhammad Usama Anjum
Muhammad Usama Anjum (4):
fs/proc/task_mmu: Implement IOCTL to get and optionally clear info
about PTEs
tools headers UAPI: Update linux/fs.h with the kernel sources
mm/pagemap: add documentation of PAGEMAP_SCAN IOCTL
selftests: mm: add pagemap ioctl tests
Peter Xu (1):
userfaultfd: UFFD_FEATURE_WP_ASYNC
Documentation/admin-guide/mm/pagemap.rst | 58 +
Documentation/admin-guide/mm/userfaultfd.rst | 35 +
fs/proc/task_mmu.c | 560 +++++++
fs/userfaultfd.c | 26 +-
include/linux/hugetlb.h | 1 +
include/linux/userfaultfd_k.h | 21 +-
include/uapi/linux/fs.h | 54 +
include/uapi/linux/userfaultfd.h | 9 +-
mm/hugetlb.c | 34 +-
mm/memory.c | 27 +-
tools/include/uapi/linux/fs.h | 54 +
tools/testing/selftests/mm/.gitignore | 2 +
tools/testing/selftests/mm/Makefile | 3 +-
tools/testing/selftests/mm/config | 1 +
tools/testing/selftests/mm/pagemap_ioctl.c | 1464 ++++++++++++++++++
tools/testing/selftests/mm/run_vmtests.sh | 4 +
16 files changed, 2329 insertions(+), 24 deletions(-)
create mode 100644 tools/testing/selftests/mm/pagemap_ioctl.c
mode change 100644 => 100755 tools/testing/selftests/mm/run_vmtests.sh
--
2.39.2
Erdem Aktas wrote:
> On Mon, Jun 12, 2023 at 12:03 PM Dan Williams <dan.j.williams(a)intel.com>
> wrote:
>
> > [ add David, Brijesh, and Atish]
> >
> > Kuppuswamy Sathyanarayanan wrote:
> > > In TDX guest, the second stage of the attestation process is Quote
> > > generation. This process is required to convert the locally generated
> > > TDREPORT into a remotely verifiable Quote. It involves sending the
> > > TDREPORT data to a Quoting Enclave (QE) which will verify the
> > > integrity of the TDREPORT and sign it with an attestation key.
> > >
> > > Intel's TDX attestation driver exposes TDX_CMD_GET_QUOTE IOCTL to
> > > allow the user agent to get the TD Quote.
> > >
> > > Add a kernel selftest module to verify the Quote generation feature.
> > >
> > > TD Quote generation involves following steps:
> > >
> > > * Get the TDREPORT data using TDX_CMD_GET_REPORT IOCTL.
> > > * Embed the TDREPORT data in quote buffer and request for quote
> > > generation via TDX_CMD_GET_QUOTE IOCTL request.
> > > * Upon completion of the GetQuote request, check for non zero value
> > > in the status field of Quote header to make sure the generated
> > > quote is valid.
> >
> > What this cover letter does not say is that this is adding another
> > instance of the similar pattern as SNP_GET_REPORT.
> >
> > Linux is best served when multiple vendors trying to do similar
> > operations are brought together behind a common ABI. We see this in the
> > history of wrangling SCSI vendors behind common interfaces.
>
> Compared to the number of SCSI vendors, I think the number of CPU vendors
> for confidential computing seems manageable to me. Is this really a good
> comparison?
Fair enough, and prompted by this I talk a bit more about the
motiviations and benefits of a Keys abstraction for attestation here:
https://lore.kernel.org/all/64961c3baf8ce_142af829436@dwillia2-xfh.jf.intel…
> > Now multiple
> > confidential computing vendors trying to develop similar flows with
> > differentiated formats where that differentiation need not leak over the
> > ABI boundary.
> >
>
> <Just my personal opinion below>
> I agree with this statement in the high level but it is also somehow
> surprising for me after all the discussion happened around this topic.
> Honestly, I feel like there are multiple versions of "Intel" working in
> different directions.
This proposal was sent while firmly wearing my Linux community hat. I
agree, the timing here is unfortunate.
> If we want multiple vendors trying to do the similar things behind a common
> ABI, it should start with the spec. Since this comment is coming from
> Intel, I wonder if there is any plan to combine the GHCB and GHCI
> interfaces under common ABI in the future or why it did not even happen in
> the first place.
Per above comment about firmly wearing my Linux hat I am coming at this
purely from the perspective of what do we do now as a community that
continues to see these implementations proliferate and grow more
features. Common specs are great, but I agree with you, it is too late
for that, but I hope that as Linux asserts "this is what it should look
like" it starts to influence future IP innovation, and attestation
service providers, to acommodate the kernel's ABI momentum.
> What I see is that Intel has GETQUOTE TDVMCALL interface in its spec and
> again Intel does not really want to provide support for it in linux. It
> feels really frustrating.
I am aware of how frustrating late feedback can be. I am also encouraged
by some of the conversations and investigations that have already
happened around how Keys fits what these attestation solutions need.
> > My observation of SNP_GET_REPORT and TDX_CMD_GET_REPORT is that they are
> > both passing blobs across the user/kernel and platform/kernel boundary
> > for the purposes of unlocking other resources. To me that is a flow that
> > the Keys subsystem has infrastructure to handle. It has the concept of
> > upcalls and asynchronous population of blobs by handles and mechanisms
> > to protect and cache those communications. Linux / the Keys subsystem
> > could benefit from the enhancements it would need to cover these 2
> > cases. Specifically, the benefit that when ARM and RISC-V arrive with
> > similar communications with platform TSMs (Trusted Security Module) they
> > can build upon the same infrastructure.
> >
> > David, am I reaching with that association? My strawman mapping of
> > TDX_CMD_GET_QUOTE to request_key() is something like:
> >
> > request_key(coco_quote, "description", "<uuencoded tdreport>")
> >
> > Where this is a common key_type for all vendors, but the description and
> > arguments have room for vendor differentiation when doing the upcall to
> > the platform TSM, but userspace never needs to contend with the
> > different vendor formats, that is all handled internally to the kernel.
> >
> > I think the problem definition here is not accurate. With AMD SNP, guests
> need to do a hypercall to KVM and KVM needs to issue
> a SNP_GUEST_REQUEST(MSG_REPORT_REQ) to the SP firmware. In TDX, guests
> need to do a TDCALL to TDXMODULE to get the TDREPORT and then it needs to
> get that report delivered to the host userspace to get the TDQUOTE
> generated by the SGX quoting enclave. Also TDQUOTE is designed to work
> async while the SNP_GUEST_REQUESTS are blocking vmcalls.
>
> Those are completely different flows. Are you suggesting that intel should
> also come down to a single call to get the TDQUOTE like AMD SNP?
The Keys subsystem supports async instantiation of key material with
usermode upcalls if necessary. So I do not see a problem supporting
these flows behind a common key type.
> The TDCALL interface asking for the TDREPORT is already there. AMD does not
> need to ask the report and the quote separately.
>
> Here, the problem was that Intel (upstream) did not want to implement
> hypercall for TDQUOTE which would be handled by the user space VMM. The
> alternative implementation (using vsock) does not work for many use cases
> including ours. I do not see how your suggestion addresses the problem that
> this patch was trying to solve.
Perhaps the strawman mockup makes it more clear:
https://lore.kernel.org/all/64961c3baf8ce_142af829436@dwillia2-xfh.jf.intel…
> So while I like the suggested direction, I am not sure how much it is
> possible to come up with a common ABI even with just only for 2 vendors
> (AMD and Intel) without doing spec changes which is a multi year effort
> imho.
I agree, hardware spec changes are out of scope for this effort, but
Keys might require some additional flows to be built up in the kernel
that could be previously handled in userspace. I.e. the "bottom half"
that I reference in the mockup.
This is something we went through with using "encrypted-keys" for
nvdimm. Instead of an ioctl to inject a secret key over the user kernel
boundary a key server need to store a serialized version of the
encrypted key blob and pass that into the kernel.
The restoring of TPIDR2 signal context has been broken since it was
merged, fix this and add a test case covering it. This is a result of
TPIDR2 context management following a different flow to any of the other
state that we provide and the fact that we don't expose TPIDR (which
follows the same pattern) to signals.
Signed-off-by: Mark Brown <broonie(a)kernel.org>
---
Changes in v2:
- Added a feature check for SME to the new test.
- Link to v1: https://lore.kernel.org/r/20230621-arm64-fix-tpidr2-signal-restore-v1-0-b6d…
---
Mark Brown (2):
arm64/signal: Restore TPIDR2 register rather than memory state
kselftest/arm64: Add a test case for TPIDR2 restore
arch/arm64/kernel/signal.c | 2 +-
tools/testing/selftests/arm64/signal/.gitignore | 2 +-
.../arm64/signal/testcases/tpidr2_restore.c | 86 ++++++++++++++++++++++
3 files changed, 88 insertions(+), 2 deletions(-)
---
base-commit: 858fd168a95c5b9669aac8db6c14a9aeab446375
change-id: 20230621-arm64-fix-tpidr2-signal-restore-713d93798f99
Best regards,
--
Mark Brown <broonie(a)kernel.org>
TCP SYN/ACK packets of connections from processes/sockets outside a
cgroup on the same host are not received by the cgroup's installed
cgroup_skb filters.
There were two BPF cgroup_skb programs attached to a cgroup named
"my_cgroup".
SEC("cgroup_skb/ingress")
int ingress(struct __sk_buff *skb)
{
/* .... process skb ... */
return 1;
}
SEC("cgroup_skb/egress")
int egress(struct __sk_buff *skb)
{
/* .... process skb ... */
return 1;
}
We discovered that when running the command "nc -6 -l 8000" in
"my_group" and connecting to it from outside of "my_cgroup" with the
command "nc -6 localhost 8000", the egress filter did not detect the
SYN/ACK packet. However, we did observe the SYN/ACK packet at the
ingress when connecting from a socket in "my_cgroup" to a socket
outside of it.
We came across BPF_CGROUP_RUN_PROG_INET_EGRESS(). This macro is
responsible for calling BPF programs that are attached to the egress
hook of a cgroup and it skips programs if the sending socket is not the
owner of the skb. Specifically, in our situation, the SYN/ACK
skb is owned by a struct request_sock instance, but the sending
socket is the listener socket we use to receive incoming
connections. The request_sock is created to manage an incoming
connection.
It has been determined that checking the owner of a skb against
the sending socket is not required. Removing this check will allow the
filters to receive SYN/ACK packets.
To ensure that cgroup_skb filters can receive all signaling packets,
including SYN, SYN/ACK, ACK, FIN, and FIN/ACK. A new self-test has
been added as well.
Changes from v2:
- Remove redundant blank lines.
Changes from v1:
- Check the number of observed packets instead of just sleeping.
- Use ASSERT_XXX() instead of CHECK()/
[v1] https://lore.kernel.org/all/20230612191641.441774-1-kuifeng@meta.com/
[v2] https://lore.kernel.org/all/20230617052756.640916-2-kuifeng@meta.com/
Kui-Feng Lee (2):
net: bpf: Always call BPF cgroup filters for egress.
selftests/bpf: Verify that the cgroup_skb filters receive expected
packets.
include/linux/bpf-cgroup.h | 2 +-
tools/testing/selftests/bpf/cgroup_helpers.c | 12 +
tools/testing/selftests/bpf/cgroup_helpers.h | 1 +
tools/testing/selftests/bpf/cgroup_tcp_skb.h | 35 ++
.../selftests/bpf/prog_tests/cgroup_tcp_skb.c | 399 ++++++++++++++++++
.../selftests/bpf/progs/cgroup_tcp_skb.c | 382 +++++++++++++++++
6 files changed, 830 insertions(+), 1 deletion(-)
create mode 100644 tools/testing/selftests/bpf/cgroup_tcp_skb.h
create mode 100644 tools/testing/selftests/bpf/prog_tests/cgroup_tcp_skb.c
create mode 100644 tools/testing/selftests/bpf/progs/cgroup_tcp_skb.c
--
2.34.1
Patch 1-3/9 track and expose some aggregated data counters at the MPTCP
level: the number of retransmissions and the bytes that have been
transferred. The first patch prepares the work by moving where snd_una
is updated for fallback sockets while the last patch adds some tests to
cover the new code.
Patch 4-6/9 introduce a new getsockopt for SOL_MPTCP: MPTCP_FULL_INFO.
This new socket option allows to combine info from MPTCP_INFO,
MPTCP_TCPINFO and MPTCP_SUBFLOW_ADDRS socket options into one. It can be
needed to have all info in one because the path-manager can close and
re-create subflows between getsockopt() and fooling the accounting. The
first patch introduces a unique subflow ID to easily detect when
subflows are being re-created with the same 5-tuple while the last patch
adds some tests to cover the new code.
Please note that patch 5/9 ("mptcp: introduce MPTCP_FULL_INFO getsockopt")
can reveal a bug that were there for a bit of time, see [1]. A fix has
recently been fixed to netdev for the -net tree: "mptcp: ensure listener
is unhashed before updating the sk status", see [2]. There is no
conflicts between the two patches but it might be better to apply this
series after the one for -net and after having merged "net" into
"net-next".
Patch 7/9 is similar to commit 47867f0a7e83 ("selftests: mptcp: join:
skip check if MIB counter not supported") recently applied in the -net
tree but here it adapts the new code that is only in net-next (and it
fixes a merge conflict resolution which didn't have any impact).
Patch 8 and 9/9 are two simple refactoring. One to consolidate the
transition to TCP_CLOSE in mptcp_do_fastclose() and avoid duplicated
code. The other one reduces the scope of an argument passed to
mptcp_pm_alloc_anno_list() function.
Link: https://github.com/multipath-tcp/mptcp_net-next/issues/407 [1]
Link: https://lore.kernel.org/netdev/20230620-upstream-net-20230620-misc-fixes-fo… [2]
Signed-off-by: Matthieu Baerts <matthieu.baerts(a)tessares.net>
---
Geliang Tang (1):
mptcp: pass addr to mptcp_pm_alloc_anno_list
Matthieu Baerts (1):
selftests: mptcp: join: skip check if MIB counter not supported (part 2)
Paolo Abeni (7):
mptcp: move snd_una update earlier for fallback socket
mptcp: track some aggregate data counters
selftests: mptcp: explicitly tests aggregate counters
mptcp: add subflow unique id
mptcp: introduce MPTCP_FULL_INFO getsockopt
selftests: mptcp: add MPTCP_FULL_INFO testcase
mptcp: consolidate transition to TCP_CLOSE in mptcp_do_fastclose()
include/uapi/linux/mptcp.h | 29 +++++
net/mptcp/options.c | 14 +-
net/mptcp/pm_netlink.c | 8 +-
net/mptcp/pm_userspace.c | 2 +-
net/mptcp/protocol.c | 31 +++--
net/mptcp/protocol.h | 11 +-
net/mptcp/sockopt.c | 152 +++++++++++++++++++++-
net/mptcp/subflow.c | 2 +
tools/testing/selftests/net/mptcp/mptcp_join.sh | 33 ++---
tools/testing/selftests/net/mptcp/mptcp_sockopt.c | 120 ++++++++++++++++-
10 files changed, 356 insertions(+), 46 deletions(-)
---
base-commit: 712557f210723101717570844c95ac0913af74d7
change-id: 20230620-upstream-net-next-20230620-mptcp-expose-more-info-and-misc-6b4a3a415ec5
Best regards,
--
Matthieu Baerts <matthieu.baerts(a)tessares.net>
*Changes in v19*
- Minor changes and interface updates
*Changes in v18*
- Rebase on top of next-20230613
- Minor updates
*Changes in v17*
- Rebase on top of next-20230606
- Minor improvements in PAGEMAP_SCAN IOCTL patch
*Changes in v16*
- Fix a corner case
- Add exclusive PM_SCAN_OP_WP back
*Changes in v15*
- Build fix (Add missed build fix in RESEND)
*Changes in v14*
- Fix build error caused by #ifdef added at last minute in some configs
*Changes in v13*
- Rebase on top of next-20230414
- Give-up on using uffd_wp_range() and write new helpers, flush tlb only
once
*Changes in v12*
- Update and other memory types to UFFD_FEATURE_WP_ASYNC
- Rebaase on top of next-20230406
- Review updates
*Changes in v11*
- Rebase on top of next-20230307
- Base patches on UFFD_FEATURE_WP_UNPOPULATED
- Do a lot of cosmetic changes and review updates
- Remove ENGAGE_WP + !GET operation as it can be performed with
UFFDIO_WRITEPROTECT
*Changes in v10*
- Add specific condition to return error if hugetlb is used with wp
async
- Move changes in tools/include/uapi/linux/fs.h to separate patch
- Add documentation
*Changes in v9:*
- Correct fault resolution for userfaultfd wp async
- Fix build warnings and errors which were happening on some configs
- Simplify pagemap ioctl's code
*Changes in v8:*
- Update uffd async wp implementation
- Improve PAGEMAP_IOCTL implementation
*Changes in v7:*
- Add uffd wp async
- Update the IOCTL to use uffd under the hood instead of soft-dirty
flags
*Motivation*
The real motivation for adding PAGEMAP_SCAN IOCTL is to emulate Windows
GetWriteWatch() syscall [1]. The GetWriteWatch{} retrieves the addresses of
the pages that are written to in a region of virtual memory.
This syscall is used in Windows applications and games etc. This syscall is
being emulated in pretty slow manner in userspace. Our purpose is to
enhance the kernel such that we translate it efficiently in a better way.
Currently some out of tree hack patches are being used to efficiently
emulate it in some kernels. We intend to replace those with these patches.
So the whole gaming on Linux can effectively get benefit from this. It
means there would be tons of users of this code.
CRIU use case [2] was mentioned by Andrei and Danylo:
> Use cases for migrating sparse VMAs are binaries sanitized with ASAN,
> MSAN or TSAN [3]. All of these sanitizers produce sparse mappings of
> shadow memory [4]. Being able to migrate such binaries allows to highly
> reduce the amount of work needed to identify and fix post-migration
> crashes, which happen constantly.
Andrei's defines the following uses of this code:
* it is more granular and allows us to track changed pages more
effectively. The current interface can clear dirty bits for the entire
process only. In addition, reading info about pages is a separate
operation. It means we must freeze the process to read information
about all its pages, reset dirty bits, only then we can start dumping
pages. The information about pages becomes more and more outdated,
while we are processing pages. The new interface solves both these
downsides. First, it allows us to read pte bits and clear the
soft-dirty bit atomically. It means that CRIU will not need to freeze
processes to pre-dump their memory. Second, it clears soft-dirty bits
for a specified region of memory. It means CRIU will have actual info
about pages to the moment of dumping them.
* The new interface has to be much faster because basic page filtering
is happening in the kernel. With the old interface, we have to read
pagemap for each page.
*Implementation Evolution (Short Summary)*
From the definition of GetWriteWatch(), we feel like kernel's soft-dirty
feature can be used under the hood with some additions like:
* reset soft-dirty flag for only a specific region of memory instead of
clearing the flag for the entire process
* get and clear soft-dirty flag for a specific region atomically
So we decided to use ioctl on pagemap file to read or/and reset soft-dirty
flag. But using soft-dirty flag, sometimes we get extra pages which weren't
even written. They had become soft-dirty because of VMA merging and
VM_SOFTDIRTY flag. This breaks the definition of GetWriteWatch(). We were
able to by-pass this short coming by ignoring VM_SOFTDIRTY until David
reported that mprotect etc messes up the soft-dirty flag while ignoring
VM_SOFTDIRTY [5]. This wasn't happening until [6] got introduced. We
discussed if we can revert these patches. But we could not reach to any
conclusion. So at this point, I made couple of tries to solve this whole
VM_SOFTDIRTY issue by correcting the soft-dirty implementation:
* [7] Correct the bug fixed wrongly back in 2014. It had potential to cause
regression. We left it behind.
* [8] Keep a list of soft-dirty part of a VMA across splits and merges. I
got the reply don't increase the size of the VMA by 8 bytes.
At this point, we left soft-dirty considering it is too much delicate and
userfaultfd [9] seemed like the only way forward. From there onward, we
have been basing soft-dirty emulation on userfaultfd wp feature where
kernel resolves the faults itself when WP_ASYNC feature is used. It was
straight forward to add WP_ASYNC feature in userfautlfd. Now we get only
those pages dirty or written-to which are really written in reality. (PS
There is another WP_UNPOPULATED userfautfd feature is required which is
needed to avoid pre-faulting memory before write-protecting [9].)
All the different masks were added on the request of CRIU devs to create
interface more generic and better.
[1] https://learn.microsoft.com/en-us/windows/win32/api/memoryapi/nf-memoryapi-…
[2] https://lore.kernel.org/all/20221014134802.1361436-1-mdanylo@google.com
[3] https://github.com/google/sanitizers
[4] https://github.com/google/sanitizers/wiki/AddressSanitizerAlgorithm#64-bit
[5] https://lore.kernel.org/all/bfcae708-db21-04b4-0bbe-712badd03071@redhat.com
[6] https://lore.kernel.org/all/20220725142048.30450-1-peterx@redhat.com/
[7] https://lore.kernel.org/all/20221122115007.2787017-1-usama.anjum@collabora.…
[8] https://lore.kernel.org/all/20221220162606.1595355-1-usama.anjum@collabora.…
[9] https://lore.kernel.org/all/20230306213925.617814-1-peterx@redhat.com
[10] https://lore.kernel.org/all/20230125144529.1630917-1-mdanylo@google.com
* Original Cover letter from v8*
Hello,
Note:
Soft-dirty pages and pages which have been written-to are synonyms. As
kernel already has soft-dirty feature inside which we have given up to
use, we are using written-to terminology while using UFFD async WP under
the hood.
This IOCTL, PAGEMAP_SCAN on pagemap file can be used to get and/or clear
the info about page table entries. The following operations are
supported in this ioctl:
- Get the information if the pages have been written-to (PAGE_IS_WRITTEN),
file mapped (PAGE_IS_FILE), present (PAGE_IS_PRESENT) or swapped
(PAGE_IS_SWAPPED).
- Write-protect the pages (PAGEMAP_WP_ENGAGE) to start finding which
pages have been written-to.
- Find pages which have been written-to and write protect the pages
(atomic PAGE_IS_WRITTEN + PAGEMAP_WP_ENGAGE)
It is possible to find and clear soft-dirty pages entirely in userspace.
But it isn't efficient:
- The mprotect and SIGSEGV handler for bookkeeping
- The userfaultfd wp (synchronous) with the handler for bookkeeping
Some benchmarks can be seen here[1]. This series adds features that weren't
present earlier:
- There is no atomic get soft-dirty/Written-to status and clear present in
the kernel.
- The pages which have been written-to can not be found in accurate way.
(Kernel's soft-dirty PTE bit + sof_dirty VMA bit shows more soft-dirty
pages than there actually are.)
Historically, soft-dirty PTE bit tracking has been used in the CRIU
project. The procfs interface is enough for finding the soft-dirty bit
status and clearing the soft-dirty bit of all the pages of a process.
We have the use case where we need to track the soft-dirty PTE bit for
only specific pages on-demand. We need this tracking and clear mechanism
of a region of memory while the process is running to emulate the
getWriteWatch() syscall of Windows.
*(Moved to using UFFD instead of soft-dirtyi feature to find pages which
have been written-to from v7 patch series)*:
Stop using the soft-dirty flags for finding which pages have been
written to. It is too delicate and wrong as it shows more soft-dirty
pages than the actual soft-dirty pages. There is no interest in
correcting it [2][3] as this is how the feature was written years ago.
It shouldn't be updated to changed behaviour. Peter Xu has suggested
using the async version of the UFFD WP [4] as it is based inherently
on the PTEs.
So in this patch series, I've added a new mode to the UFFD which is
asynchronous version of the write protect. When this variant of the
UFFD WP is used, the page faults are resolved automatically by the
kernel. The pages which have been written-to can be found by reading
pagemap file (!PM_UFFD_WP). This feature can be used successfully to
find which pages have been written to from the time the pages were
write protected. This works just like the soft-dirty flag without
showing any extra pages which aren't soft-dirty in reality.
The information related to pages if the page is file mapped, present and
swapped is required for the CRIU project [5][6]. The addition of the
required mask, any mask, excluded mask and return masks are also required
for the CRIU project [5].
The IOCTL returns the addresses of the pages which match the specific
masks. The page addresses are returned in struct page_region in a compact
form. The max_pages is needed to support a use case where user only wants
to get a specific number of pages. So there is no need to find all the
pages of interest in the range when max_pages is specified. The IOCTL
returns when the maximum number of the pages are found. The max_pages is
optional. If max_pages is specified, it must be equal or greater than the
vec_size. This restriction is needed to handle worse case when one
page_region only contains info of one page and it cannot be compacted.
This is needed to emulate the Windows getWriteWatch() syscall.
The patch series include the detailed selftest which can be used as an
example for the uffd async wp test and PAGEMAP_IOCTL. It shows the
interface usages as well.
[1] https://lore.kernel.org/lkml/54d4c322-cd6e-eefd-b161-2af2b56aae24@collabora…
[2] https://lore.kernel.org/all/20221220162606.1595355-1-usama.anjum@collabora.…
[3] https://lore.kernel.org/all/20221122115007.2787017-1-usama.anjum@collabora.…
[4] https://lore.kernel.org/all/Y6Hc2d+7eTKs7AiH@x1n
[5] https://lore.kernel.org/all/YyiDg79flhWoMDZB@gmail.com/
[6] https://lore.kernel.org/all/20221014134802.1361436-1-mdanylo@google.com/
Regards,
Muhammad Usama Anjum
Muhammad Usama Anjum (4):
fs/proc/task_mmu: Implement IOCTL to get and optionally clear info
about PTEs
tools headers UAPI: Update linux/fs.h with the kernel sources
mm/pagemap: add documentation of PAGEMAP_SCAN IOCTL
selftests: mm: add pagemap ioctl tests
Peter Xu (1):
userfaultfd: UFFD_FEATURE_WP_ASYNC
Documentation/admin-guide/mm/pagemap.rst | 58 +
Documentation/admin-guide/mm/userfaultfd.rst | 35 +
fs/proc/task_mmu.c | 526 +++++++
fs/userfaultfd.c | 26 +-
include/linux/hugetlb.h | 1 +
include/linux/userfaultfd_k.h | 21 +-
include/uapi/linux/fs.h | 53 +
include/uapi/linux/userfaultfd.h | 9 +-
mm/hugetlb.c | 34 +-
mm/memory.c | 27 +-
tools/include/uapi/linux/fs.h | 53 +
tools/testing/selftests/mm/.gitignore | 2 +
tools/testing/selftests/mm/Makefile | 3 +-
tools/testing/selftests/mm/config | 1 +
tools/testing/selftests/mm/pagemap_ioctl.c | 1458 ++++++++++++++++++
tools/testing/selftests/mm/run_vmtests.sh | 4 +
16 files changed, 2287 insertions(+), 24 deletions(-)
create mode 100644 tools/testing/selftests/mm/pagemap_ioctl.c
mode change 100644 => 100755 tools/testing/selftests/mm/run_vmtests.sh
--
2.39.2
This patch introduces a specific test case for the EVIOCGLED ioctl.
The test covers the case where len > maxlen in the
EVIOCGLED(sizeof(all_leds)), all_leds) ioctl.
Signed-off-by: Dana Elfassy <dangel101(a)gmail.com>
---
Changes in v2:
- Changed variable leds from an array to an int
This patch depends on '[v3] selftests/input: Introduce basic tests for evdev ioctls' [1] sent to the ML.
[1] https://patchwork.kernel.org/project/linux-input/patch/20230607153214.15933…
tools/testing/selftests/input/evioc-test.c | 17 +++++++++++++++++
1 file changed, 17 insertions(+)
diff --git a/tools/testing/selftests/input/evioc-test.c b/tools/testing/selftests/input/evioc-test.c
index ad7b93fe39cf..378db2b4dd56 100644
--- a/tools/testing/selftests/input/evioc-test.c
+++ b/tools/testing/selftests/input/evioc-test.c
@@ -234,4 +234,21 @@ TEST(eviocsrep_set_repeat_settings)
selftest_uinput_destroy(uidev);
}
+TEST(eviocgled_get_all_leds)
+{
+ struct selftest_uinput *uidev;
+ int leds = 0;
+ int rc;
+
+ rc = selftest_uinput_create_device(&uidev, -1);
+ ASSERT_EQ(0, rc);
+ ASSERT_NE(NULL, uidev);
+
+ /* ioctl to set the maxlen = 0 */
+ rc = ioctl(uidev->evdev_fd, EVIOCGLED(0), leds);
+ ASSERT_EQ(0, rc);
+
+ selftest_uinput_destroy(uidev);
+}
+
TEST_HARNESS_MAIN
--
2.41.0
This patch introduces a specific test case for the EVIOCGKEY ioctl.
The test covers the case where len > maxlen in the
EVIOCGKEY(sizeof(keystate)), keystate) ioctl.
Signed-off-by: Dana Elfassy <dangel101(a)gmail.com>
---
Changes in v3:
- Edited commit's subject and description
- Renamed variable rep_values to keystate
- Added argument to selftest_uinput_create_device()
- Removed memset
Changes in v2:
- Added following note about the patch's dependency
This patch depends on '[v3] selftests/input: Introduce basic tests for evdev ioctls' [1] sent to the ML.
[1] https://patchwork.kernel.org/project/linux-input/patch/20230607153214.15933…
tools/testing/selftests/input/evioc-test.c | 17 +++++++++++++++++
1 file changed, 17 insertions(+)
diff --git a/tools/testing/selftests/input/evioc-test.c b/tools/testing/selftests/input/evioc-test.c
index ad7b93fe39cf..e0f69459f504 100644
--- a/tools/testing/selftests/input/evioc-test.c
+++ b/tools/testing/selftests/input/evioc-test.c
@@ -234,4 +234,21 @@ TEST(eviocsrep_set_repeat_settings)
selftest_uinput_destroy(uidev);
}
+TEST(eviocgkey_get_global_key_state)
+{
+ struct selftest_uinput *uidev;
+ int keystate = 0;
+ int rc;
+
+ rc = selftest_uinput_create_device(&uidev, -1);
+ ASSERT_EQ(0, rc);
+ ASSERT_NE(NULL, uidev);
+
+ /* ioctl to create the scenario where len > maxlen in bits_to_user() */
+ rc = ioctl(uidev->evdev_fd, EVIOCGKEY(0), keystate);
+ ASSERT_EQ(0, rc);
+
+ selftest_uinput_destroy(uidev);
+}
+
TEST_HARNESS_MAIN
--
2.41.0
This patch introduces a specific test case for the EVIOCGLED ioctl.
The test covers the case where len > maxlen in the
EVIOCGLED(sizeof(all_leds)), all_leds) ioctl.
Signed-off-by: Dana Elfassy <dangel101(a)gmail.com>
---
This patch depends on '[v3] selftests/input: Introduce basic tests for evdev ioctls' [1] sent to the ML.
[1] https://patchwork.kernel.org/project/linux-input/patch/20230607153214.15933…
tools/testing/selftests/input/evioc-test.c | 17 +++++++++++++++++
1 file changed, 17 insertions(+)
diff --git a/tools/testing/selftests/input/evioc-test.c b/tools/testing/selftests/input/evioc-test.c
index ad7b93fe39cf..2bf1b32ae01a 100644
--- a/tools/testing/selftests/input/evioc-test.c
+++ b/tools/testing/selftests/input/evioc-test.c
@@ -234,4 +234,21 @@ TEST(eviocsrep_set_repeat_settings)
selftest_uinput_destroy(uidev);
}
+TEST(eviocgled_get_all_leds)
+{
+ struct selftest_uinput *uidev;
+ int leds[2];
+ int rc;
+
+ rc = selftest_uinput_create_device(&uidev, -1);
+ ASSERT_EQ(0, rc);
+ ASSERT_NE(NULL, uidev);
+
+ /* ioctl to set the maxlen = 0 */
+ rc = ioctl(uidev->evdev_fd, EVIOCGLED(0), leds);
+ ASSERT_EQ(0, rc);
+
+ selftest_uinput_destroy(uidev);
+}
+
TEST_HARNESS_MAIN
--
2.41.0
The restoring of TPIDR2 signal context has been broken since it was
merged, fix this and add a test case covering it. This is a result of
TPIDR2 context management following a different flow to any of the other
state that we provide and the fact that we don't expose TPIDR (which
follows the same pattern) to signals.
Signed-off-by: Mark Brown <broonie(a)kernel.org>
---
Mark Brown (2):
arm64/signal: Restore TPIDR2 register rather than memory state
kselftest/arm64: Add a test case for TPIDR2 restore
arch/arm64/kernel/signal.c | 2 +-
tools/testing/selftests/arm64/signal/.gitignore | 2 +-
.../arm64/signal/testcases/tpidr2_restore.c | 85 ++++++++++++++++++++++
3 files changed, 87 insertions(+), 2 deletions(-)
---
base-commit: 858fd168a95c5b9669aac8db6c14a9aeab446375
change-id: 20230621-arm64-fix-tpidr2-signal-restore-713d93798f99
Best regards,
--
Mark Brown <broonie(a)kernel.org>
In order to cover this case, setting 'maxlen = 0', with the following
explanation:
EVIOCGKEY is executed from evdev_do_ioctl(), which is called from
evdev_ioctl_handler().
evdev_ioctl_handler() is called from 2 functions, where by code coverage,
only the first one is in use.
‘compat’ is given the value ‘0’ [1].
Thus, the condition [2] is always false.
This means ‘len’ always equals a positive number [3]
‘maxlen’ in evdev_handle_get_val [4] is defined locally in
evdev_do_ioctl() [5], and is sent in the variable 'size' [6]
[1] https://elixir.bootlin.com/linux/v6.2/source/drivers/input/evdev.c#L1281
[2] https://elixir.bootlin.com/linux/v6.2/source/drivers/input/evdev.c#L705
[3] https://elixir.bootlin.com/linux/v6.2/source/drivers/input/evdev.c#L707
[4] https://elixir.bootlin.com/linux/v6.2/source/drivers/input/evdev.c#L886
[5] https://elixir.bootlin.com/linux/v6.2/source/drivers/input/evdev.c#L1155
[6] https://elixir.bootlin.com/linux/v6.2/source/drivers/input/evdev.c#L1141
Signed-off-by: Dana Elfassy <dangel101(a)gmail.com>
---
Changes in v2:
- Added following note about the patch's dependency
This patch depends on '[v3] selftests/input: Introduce basic tests for evdev ioctls' [1] sent to the ML.
[1] https://patchwork.kernel.org/project/linux-input/patch/20230607153214.15933…
tools/testing/selftests/input/evioc-test.c | 19 +++++++++++++++++++
1 file changed, 19 insertions(+)
diff --git a/tools/testing/selftests/input/evioc-test.c b/tools/testing/selftests/input/evioc-test.c
index ad7b93fe39cf..b94de2ee5596 100644
--- a/tools/testing/selftests/input/evioc-test.c
+++ b/tools/testing/selftests/input/evioc-test.c
@@ -234,4 +234,23 @@ TEST(eviocsrep_set_repeat_settings)
selftest_uinput_destroy(uidev);
}
+TEST(eviocgkey_get_global_key_state)
+{
+ struct selftest_uinput *uidev;
+ int rep_values[2];
+ int rc;
+
+ memset(rep_values, 0, sizeof(rep_values));
+
+ rc = selftest_uinput_create_device(&uidev);
+ ASSERT_EQ(0, rc);
+ ASSERT_NE(NULL, uidev);
+
+ /* ioctl to create the scenario where len > maxlen in bits_to_user() */
+ rc = ioctl(uidev->evdev_fd, EVIOCGKEY(0), rep_values);
+ ASSERT_EQ(0, rc);
+
+ selftest_uinput_destroy(uidev);
+}
+
TEST_HARNESS_MAIN
--
2.41.0
From: Danielle Ratson <danieller(a)nvidia.com>
When mirroring to a gretap in hardware the device expects to be
programmed with the egress port and all the encapsulating headers. This
requires the driver to resolve the path the packet will take in the
software data path and program the device accordingly.
If the path cannot be resolved (in this case because of an unresolved
neighbor), then mirror installation fails until the path is resolved.
This results in a race that causes the test to sometimes fail.
Fix this by setting the neighbor's state to permanent in a couple of
tests, so that it is always valid.
Fixes: 35c31d5c323f ("selftests: forwarding: Test mirror-to-gretap w/ UL 802.1d")
Fixes: 239e754af854 ("selftests: forwarding: Test mirror-to-gretap w/ UL 802.1q")
Signed-off-by: Danielle Ratson <danieller(a)nvidia.com>
Reviewed-by: Petr Machata <petrm(a)nvidia.com>
Signed-off-by: Petr Machata <petrm(a)nvidia.com>
---
.../testing/selftests/net/forwarding/mirror_gre_bridge_1d.sh | 4 ++++
.../testing/selftests/net/forwarding/mirror_gre_bridge_1q.sh | 4 ++++
2 files changed, 8 insertions(+)
diff --git a/tools/testing/selftests/net/forwarding/mirror_gre_bridge_1d.sh b/tools/testing/selftests/net/forwarding/mirror_gre_bridge_1d.sh
index c5095da7f6bf..aec752a22e9e 100755
--- a/tools/testing/selftests/net/forwarding/mirror_gre_bridge_1d.sh
+++ b/tools/testing/selftests/net/forwarding/mirror_gre_bridge_1d.sh
@@ -93,12 +93,16 @@ cleanup()
test_gretap()
{
+ ip neigh replace 192.0.2.130 lladdr $(mac_get $h3) \
+ nud permanent dev br2
full_test_span_gre_dir gt4 ingress 8 0 "mirror to gretap"
full_test_span_gre_dir gt4 egress 0 8 "mirror to gretap"
}
test_ip6gretap()
{
+ ip neigh replace 2001:db8:2::2 lladdr $(mac_get $h3) \
+ nud permanent dev br2
full_test_span_gre_dir gt6 ingress 8 0 "mirror to ip6gretap"
full_test_span_gre_dir gt6 egress 0 8 "mirror to ip6gretap"
}
diff --git a/tools/testing/selftests/net/forwarding/mirror_gre_bridge_1q.sh b/tools/testing/selftests/net/forwarding/mirror_gre_bridge_1q.sh
index 9ff22f28032d..0cf4c47a46f9 100755
--- a/tools/testing/selftests/net/forwarding/mirror_gre_bridge_1q.sh
+++ b/tools/testing/selftests/net/forwarding/mirror_gre_bridge_1q.sh
@@ -90,12 +90,16 @@ cleanup()
test_gretap()
{
+ ip neigh replace 192.0.2.130 lladdr $(mac_get $h3) \
+ nud permanent dev br1
full_test_span_gre_dir gt4 ingress 8 0 "mirror to gretap"
full_test_span_gre_dir gt4 egress 0 8 "mirror to gretap"
}
test_ip6gretap()
{
+ ip neigh replace 2001:db8:2::2 lladdr $(mac_get $h3) \
+ nud permanent dev br1
full_test_span_gre_dir gt6 ingress 8 0 "mirror to ip6gretap"
full_test_span_gre_dir gt6 egress 0 8 "mirror to ip6gretap"
}
--
2.40.1
When calling socket lookup from L2 (tc, xdp), VRF boundaries aren't
respected. This patchset fixes this by regarding the incoming device's
VRF attachment when performing the socket lookups from tc/xdp.
The first two patches are coding changes which factor out the tc helper's
logic which was shared with cg/sk_skb (which operate correctly).
This refactoring is needed in order to avoid affecting the cgroup/sk_skb
flows as there does not seem to be a strict criteria for discerning which
flow the helper is called from based on the net device or packet
information.
The third patch contains the actual bugfix.
The fourth patch adds bpf tests for these lookup functions.
---
v6: - Remove redundant IS_ENABLED as suggested by Daniel Borkmann
- Declare net_device variable and use it as suggested by Daniel Borkmann
v5: Use reverse xmas tree indentation
v4: - Move dev_sdif() to include/linux/netdevice.h as suggested by Stanislav Fomichev
- Remove SYS and SYS_NOFAIL duplicate definitions
v3: - Rename bpf_l2_sdif() to dev_sdif() as suggested by Stanislav Fomichev
- Added xdp tests as suggested by Daniel Borkmann
- Use start_server() to avoid duplicate code as suggested by Stanislav Fomichev
v2: Fixed uninitialized var in test patch (4).
Gilad Sever (4):
bpf: factor out socket lookup functions for the TC hookpoint.
bpf: Call __bpf_sk_lookup()/__bpf_skc_lookup() directly via TC
hookpoint
bpf: fix bpf socket lookup from tc/xdp to respect socket VRF bindings
selftests/bpf: Add vrf_socket_lookup tests
include/linux/netdevice.h | 9 +
net/core/filter.c | 141 ++++++--
.../bpf/prog_tests/vrf_socket_lookup.c | 312 ++++++++++++++++++
.../selftests/bpf/progs/vrf_socket_lookup.c | 88 +++++
4 files changed, 526 insertions(+), 24 deletions(-)
create mode 100644 tools/testing/selftests/bpf/prog_tests/vrf_socket_lookup.c
create mode 100644 tools/testing/selftests/bpf/progs/vrf_socket_lookup.c
--
2.34.1
The mlxsw driver currently makes the assumption that the user applies
configuration in a bottom-up manner. Thus netdevices need to be added to
the bridge before IP addresses are configured on that bridge or SVI added
on top of it. Enslaving a netdevice to another netdevice that already has
uppers is in fact forbidden by mlxsw for this reason. Despite this safety,
it is rather easy to get into situations where the offloaded configuration
is just plain wrong.
Over the course of the following several patchsets, mlxsw code is going to
be adjusted to diminish the space of wrongly offloaded configurations.
Ideally the offload state will reflect the actual state, regardless of the
sequence of operation used to construct that state.
Several selftests build configurations that will not be offloadable in the
future on some systems. The reason is that what will get offloaded is the
actual configuration, not the configuration steps.
For example, when a port is added to a bridge that has an IP address, that
bridge will get a RIF, which it would not have with the current code. But
on Nvidia Spectrum-1 machines, MAC addresses of all RIFs need to have the
same prefix, which the bridge will violate. The RIF thus couldn't be
created, and the enslavement is therefore canceled, because it would lead
to an unoffloadable configuration. This breaks some selftests.
In this patchset, adjust selftests to avoid the configurations that mlxsw
would be incapable of offloading, while maintaining relevance with regards
to the feature that is being tested. There are generally two cases of
fixes:
- Disabling IPv6 autogen on bridges that do not participate in routing,
either because of the abovementioned requirement to keep the same MAC
prefix on all in-HW router interfaces, or, on 802.1ad bridges, because
in-HW router interfaces are not supported at all.
- Setting the bridge MAC address to what it will become after the first
member port is attached, so that the in-HW router interface is created
with a supported MAC address.
The patchset is then split thus:
- Patches #1-#7 adjust generic selftests
- Patches #8-#16 adjust mlxsw-specific selftests
Petr Machata (16):
selftests: forwarding: q_in_vni: Disable IPv6 autogen on bridges
selftests: forwarding: dual_vxlan_bridge: Disable IPv6 autogen on
bridges
selftests: forwarding: skbedit_priority: Disable IPv6 autogen on a
bridge
selftests: forwarding: pedit_dsfield: Disable IPv6 autogen on a bridge
selftests: forwarding: mirror_gre_*: Disable IPv6 autogen on bridges
selftests: forwarding: mirror_gre_*: Use port MAC for bridge address
selftests: forwarding: router_bridge: Use port MAC for bridge address
selftests: mlxsw: q_in_q_veto: Disable IPv6 autogen on bridges
selftests: mlxsw: extack: Disable IPv6 autogen on bridges
selftests: mlxsw: mirror_gre_scale: Disable IPv6 autogen on a bridge
selftests: mlxsw: qos_dscp_bridge: Disable IPv6 autogen on a bridge
selftests: mlxsw: qos_ets_strict: Disable IPv6 autogen on bridges
selftests: mlxsw: qos_mc_aware: Disable IPv6 autogen on bridges
selftests: mlxsw: spectrum: q_in_vni_veto: Disable IPv6 autogen on a
bridge
selftests: mlxsw: vxlan: Disable IPv6 autogen on bridges
selftests: mlxsw: one_armed_router: Use port MAC for bridge address
.../selftests/drivers/net/mlxsw/extack.sh | 24 ++++++++---
.../drivers/net/mlxsw/mirror_gre_scale.sh | 1 +
.../drivers/net/mlxsw/one_armed_router.sh | 3 +-
.../drivers/net/mlxsw/q_in_q_veto.sh | 8 ++++
.../drivers/net/mlxsw/qos_dscp_bridge.sh | 1 +
.../drivers/net/mlxsw/qos_ets_strict.sh | 8 +++-
.../drivers/net/mlxsw/qos_mc_aware.sh | 2 +
.../net/mlxsw/spectrum/q_in_vni_veto.sh | 1 +
.../selftests/drivers/net/mlxsw/vxlan.sh | 41 ++++++++++++++-----
.../net/forwarding/dual_vxlan_bridge.sh | 1 +
.../net/forwarding/mirror_gre_bound.sh | 1 +
.../net/forwarding/mirror_gre_bridge_1d.sh | 3 +-
.../forwarding/mirror_gre_bridge_1d_vlan.sh | 3 +-
.../forwarding/mirror_gre_bridge_1q_lag.sh | 3 +-
.../net/forwarding/mirror_topo_lib.sh | 1 +
.../selftests/net/forwarding/pedit_dsfield.sh | 4 +-
.../selftests/net/forwarding/q_in_vni.sh | 1 +
.../selftests/net/forwarding/router_bridge.sh | 3 +-
.../net/forwarding/skbedit_priority.sh | 4 +-
19 files changed, 88 insertions(+), 25 deletions(-)
--
2.40.1
If we get an unexpected signal during a signal test log a bit more data to
aid diagnostics.
Signed-off-by: Mark Brown <broonie(a)kernel.org>
---
tools/testing/selftests/arm64/signal/test_signals_utils.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/arm64/signal/test_signals_utils.c b/tools/testing/selftests/arm64/signal/test_signals_utils.c
index 40be8443949d..0dc948db3a4a 100644
--- a/tools/testing/selftests/arm64/signal/test_signals_utils.c
+++ b/tools/testing/selftests/arm64/signal/test_signals_utils.c
@@ -249,7 +249,8 @@ static void default_handler(int signum, siginfo_t *si, void *uc)
fprintf(stderr, "-- Timeout !\n");
} else {
fprintf(stderr,
- "-- RX UNEXPECTED SIGNAL: %d\n", signum);
+ "-- RX UNEXPECTED SIGNAL: %d code %d address %p\n",
+ signum, si->si_code, si->si_addr);
}
default_result(current, 1);
}
---
base-commit: 44c026a73be8038f03dbdeef028b642880cf1511
change-id: 20230620-arm64-selftest-log-wrong-signal-cd8c34ae5e4f
Best regards,
--
Mark Brown <broonie(a)kernel.org>
This series adds 2 zswap related selftests that verify known and fixed
issues. A new dedicated test program (test_zswap) is proposed since
the test cases are specific to zswap and hosts specific helpers.
The first patch adds the (empty) test program, while the other 2 add an
actual test function each.
Domenico Cerasuolo (3):
selftests: cgroup: add test_zswap program
selftests: cgroup: add test_zswap with no kmem bypass test
selftests: cgroup: add zswap-memcg unwanted writeback test
tools/testing/selftests/cgroup/.gitignore | 1 +
tools/testing/selftests/cgroup/Makefile | 2 +
tools/testing/selftests/cgroup/test_zswap.c | 286 ++++++++++++++++++++
3 files changed, 289 insertions(+)
create mode 100644 tools/testing/selftests/cgroup/test_zswap.c
--
2.34.1
We want to replace iptables TPROXY with a BPF program at TC ingress.
To make this work in all cases we need to assign a SO_REUSEPORT socket
to an skb, which is currently prohibited. This series adds support for
such sockets to bpf_sk_assing. See patch 5 for details.
I did some refactoring to cut down on the amount of duplicate code. The
key to this is to use INDIRECT_CALL in the reuseport helpers. To show
that this approach is not just beneficial to TC sk_assign I removed
duplicate code for bpf_sk_lookup as well.
Changes from v1:
- Correct commit abbrev length (Kuniyuki)
- Reduce duplication (Kuniyuki)
- Add checks on sk_state (Martin)
- Split exporting inet[6]_lookup_reuseport into separate patch (Eric)
Joint work with Daniel Borkmann.
Signed-off-by: Lorenz Bauer <lmb(a)isovalent.com>
---
Daniel Borkmann (1):
selftests/bpf: Test that SO_REUSEPORT can be used with sk_assign helper
Lorenz Bauer (5):
net: export inet_lookup_reuseport and inet6_lookup_reuseport
net: document inet[6]_lookup_reuseport sk_state requirements
net: remove duplicate reuseport_lookup functions
net: remove duplicate sk_lookup helpers
bpf, net: Support SO_REUSEPORT sockets with bpf_sk_assign
include/net/inet6_hashtables.h | 84 ++++++++-
include/net/inet_hashtables.h | 77 +++++++-
include/net/sock.h | 7 +-
include/uapi/linux/bpf.h | 3 -
net/core/filter.c | 2 -
net/ipv4/inet_hashtables.c | 69 +++++---
net/ipv4/udp.c | 73 +++-----
net/ipv6/inet6_hashtables.c | 71 +++++---
net/ipv6/udp.c | 85 +++------
tools/include/uapi/linux/bpf.h | 3 -
tools/testing/selftests/bpf/network_helpers.c | 3 +
.../selftests/bpf/prog_tests/assign_reuse.c | 197 +++++++++++++++++++++
.../selftests/bpf/progs/test_assign_reuse.c | 142 +++++++++++++++
13 files changed, 637 insertions(+), 179 deletions(-)
---
base-commit: 25085b4e9251c77758964a8e8651338972353642
change-id: 20230613-so-reuseport-e92c526173ee
Best regards,
--
Lorenz Bauer <lmb(a)isovalent.com>
*Changes in v18*
- Rebase on top of next-20230613
- Minor updates
*Changes in v17*
- Rebase on top of next-20230606
- Minor improvements in PAGEMAP_SCAN IOCTL patch
*Changes in v16*
- Fix a corner case
- Add exclusive PM_SCAN_OP_WP back
*Changes in v15*
- Build fix (Add missed build fix in RESEND)
*Changes in v14*
- Fix build error caused by #ifdef added at last minute in some configs
*Changes in v13*
- Rebase on top of next-20230414
- Give-up on using uffd_wp_range() and write new helpers, flush tlb only
once
*Changes in v12*
- Update and other memory types to UFFD_FEATURE_WP_ASYNC
- Rebaase on top of next-20230406
- Review updates
*Changes in v11*
- Rebase on top of next-20230307
- Base patches on UFFD_FEATURE_WP_UNPOPULATED
- Do a lot of cosmetic changes and review updates
- Remove ENGAGE_WP + !GET operation as it can be performed with
UFFDIO_WRITEPROTECT
*Changes in v10*
- Add specific condition to return error if hugetlb is used with wp
async
- Move changes in tools/include/uapi/linux/fs.h to separate patch
- Add documentation
*Changes in v9:*
- Correct fault resolution for userfaultfd wp async
- Fix build warnings and errors which were happening on some configs
- Simplify pagemap ioctl's code
*Changes in v8:*
- Update uffd async wp implementation
- Improve PAGEMAP_IOCTL implementation
*Changes in v7:*
- Add uffd wp async
- Update the IOCTL to use uffd under the hood instead of soft-dirty
flags
*Motivation*
The real motivation for adding PAGEMAP_SCAN IOCTL is to emulate Windows
GetWriteWatch() syscall [1]. The GetWriteWatch{} retrieves the addresses of
the pages that are written to in a region of virtual memory.
This syscall is used in Windows applications and games etc. This syscall is
being emulated in pretty slow manner in userspace. Our purpose is to
enhance the kernel such that we translate it efficiently in a better way.
Currently some out of tree hack patches are being used to efficiently
emulate it in some kernels. We intend to replace those with these patches.
So the whole gaming on Linux can effectively get benefit from this. It
means there would be tons of users of this code.
CRIU use case [2] was mentioned by Andrei and Danylo:
> Use cases for migrating sparse VMAs are binaries sanitized with ASAN,
> MSAN or TSAN [3]. All of these sanitizers produce sparse mappings of
> shadow memory [4]. Being able to migrate such binaries allows to highly
> reduce the amount of work needed to identify and fix post-migration
> crashes, which happen constantly.
Andrei's defines the following uses of this code:
* it is more granular and allows us to track changed pages more
effectively. The current interface can clear dirty bits for the entire
process only. In addition, reading info about pages is a separate
operation. It means we must freeze the process to read information
about all its pages, reset dirty bits, only then we can start dumping
pages. The information about pages becomes more and more outdated,
while we are processing pages. The new interface solves both these
downsides. First, it allows us to read pte bits and clear the
soft-dirty bit atomically. It means that CRIU will not need to freeze
processes to pre-dump their memory. Second, it clears soft-dirty bits
for a specified region of memory. It means CRIU will have actual info
about pages to the moment of dumping them.
* The new interface has to be much faster because basic page filtering
is happening in the kernel. With the old interface, we have to read
pagemap for each page.
*Implementation Evolution (Short Summary)*
From the definition of GetWriteWatch(), we feel like kernel's soft-dirty
feature can be used under the hood with some additions like:
* reset soft-dirty flag for only a specific region of memory instead of
clearing the flag for the entire process
* get and clear soft-dirty flag for a specific region atomically
So we decided to use ioctl on pagemap file to read or/and reset soft-dirty
flag. But using soft-dirty flag, sometimes we get extra pages which weren't
even written. They had become soft-dirty because of VMA merging and
VM_SOFTDIRTY flag. This breaks the definition of GetWriteWatch(). We were
able to by-pass this short coming by ignoring VM_SOFTDIRTY until David
reported that mprotect etc messes up the soft-dirty flag while ignoring
VM_SOFTDIRTY [5]. This wasn't happening until [6] got introduced. We
discussed if we can revert these patches. But we could not reach to any
conclusion. So at this point, I made couple of tries to solve this whole
VM_SOFTDIRTY issue by correcting the soft-dirty implementation:
* [7] Correct the bug fixed wrongly back in 2014. It had potential to cause
regression. We left it behind.
* [8] Keep a list of soft-dirty part of a VMA across splits and merges. I
got the reply don't increase the size of the VMA by 8 bytes.
At this point, we left soft-dirty considering it is too much delicate and
userfaultfd [9] seemed like the only way forward. From there onward, we
have been basing soft-dirty emulation on userfaultfd wp feature where
kernel resolves the faults itself when WP_ASYNC feature is used. It was
straight forward to add WP_ASYNC feature in userfautlfd. Now we get only
those pages dirty or written-to which are really written in reality. (PS
There is another WP_UNPOPULATED userfautfd feature is required which is
needed to avoid pre-faulting memory before write-protecting [9].)
All the different masks were added on the request of CRIU devs to create
interface more generic and better.
[1] https://learn.microsoft.com/en-us/windows/win32/api/memoryapi/nf-memoryapi-…
[2] https://lore.kernel.org/all/20221014134802.1361436-1-mdanylo@google.com
[3] https://github.com/google/sanitizers
[4] https://github.com/google/sanitizers/wiki/AddressSanitizerAlgorithm#64-bit
[5] https://lore.kernel.org/all/bfcae708-db21-04b4-0bbe-712badd03071@redhat.com
[6] https://lore.kernel.org/all/20220725142048.30450-1-peterx@redhat.com/
[7] https://lore.kernel.org/all/20221122115007.2787017-1-usama.anjum@collabora.…
[8] https://lore.kernel.org/all/20221220162606.1595355-1-usama.anjum@collabora.…
[9] https://lore.kernel.org/all/20230306213925.617814-1-peterx@redhat.com
[10] https://lore.kernel.org/all/20230125144529.1630917-1-mdanylo@google.com
* Original Cover letter from v8*
Hello,
Note:
Soft-dirty pages and pages which have been written-to are synonyms. As
kernel already has soft-dirty feature inside which we have given up to
use, we are using written-to terminology while using UFFD async WP under
the hood.
This IOCTL, PAGEMAP_SCAN on pagemap file can be used to get and/or clear
the info about page table entries. The following operations are
supported in this ioctl:
- Get the information if the pages have been written-to (PAGE_IS_WRITTEN),
file mapped (PAGE_IS_FILE), present (PAGE_IS_PRESENT) or swapped
(PAGE_IS_SWAPPED).
- Write-protect the pages (PAGEMAP_WP_ENGAGE) to start finding which
pages have been written-to.
- Find pages which have been written-to and write protect the pages
(atomic PAGE_IS_WRITTEN + PAGEMAP_WP_ENGAGE)
It is possible to find and clear soft-dirty pages entirely in userspace.
But it isn't efficient:
- The mprotect and SIGSEGV handler for bookkeeping
- The userfaultfd wp (synchronous) with the handler for bookkeeping
Some benchmarks can be seen here[1]. This series adds features that weren't
present earlier:
- There is no atomic get soft-dirty/Written-to status and clear present in
the kernel.
- The pages which have been written-to can not be found in accurate way.
(Kernel's soft-dirty PTE bit + sof_dirty VMA bit shows more soft-dirty
pages than there actually are.)
Historically, soft-dirty PTE bit tracking has been used in the CRIU
project. The procfs interface is enough for finding the soft-dirty bit
status and clearing the soft-dirty bit of all the pages of a process.
We have the use case where we need to track the soft-dirty PTE bit for
only specific pages on-demand. We need this tracking and clear mechanism
of a region of memory while the process is running to emulate the
getWriteWatch() syscall of Windows.
*(Moved to using UFFD instead of soft-dirtyi feature to find pages which
have been written-to from v7 patch series)*:
Stop using the soft-dirty flags for finding which pages have been
written to. It is too delicate and wrong as it shows more soft-dirty
pages than the actual soft-dirty pages. There is no interest in
correcting it [2][3] as this is how the feature was written years ago.
It shouldn't be updated to changed behaviour. Peter Xu has suggested
using the async version of the UFFD WP [4] as it is based inherently
on the PTEs.
So in this patch series, I've added a new mode to the UFFD which is
asynchronous version of the write protect. When this variant of the
UFFD WP is used, the page faults are resolved automatically by the
kernel. The pages which have been written-to can be found by reading
pagemap file (!PM_UFFD_WP). This feature can be used successfully to
find which pages have been written to from the time the pages were
write protected. This works just like the soft-dirty flag without
showing any extra pages which aren't soft-dirty in reality.
The information related to pages if the page is file mapped, present and
swapped is required for the CRIU project [5][6]. The addition of the
required mask, any mask, excluded mask and return masks are also required
for the CRIU project [5].
The IOCTL returns the addresses of the pages which match the specific
masks. The page addresses are returned in struct page_region in a compact
form. The max_pages is needed to support a use case where user only wants
to get a specific number of pages. So there is no need to find all the
pages of interest in the range when max_pages is specified. The IOCTL
returns when the maximum number of the pages are found. The max_pages is
optional. If max_pages is specified, it must be equal or greater than the
vec_size. This restriction is needed to handle worse case when one
page_region only contains info of one page and it cannot be compacted.
This is needed to emulate the Windows getWriteWatch() syscall.
The patch series include the detailed selftest which can be used as an
example for the uffd async wp test and PAGEMAP_IOCTL. It shows the
interface usages as well.
[1] https://lore.kernel.org/lkml/54d4c322-cd6e-eefd-b161-2af2b56aae24@collabora…
[2] https://lore.kernel.org/all/20221220162606.1595355-1-usama.anjum@collabora.…
[3] https://lore.kernel.org/all/20221122115007.2787017-1-usama.anjum@collabora.…
[4] https://lore.kernel.org/all/Y6Hc2d+7eTKs7AiH@x1n
[5] https://lore.kernel.org/all/YyiDg79flhWoMDZB@gmail.com/
[6] https://lore.kernel.org/all/20221014134802.1361436-1-mdanylo@google.com/
Regards,
Muhammad Usama Anjum
Muhammad Usama Anjum (4):
fs/proc/task_mmu: Implement IOCTL to get and optionally clear info
about PTEs
tools headers UAPI: Update linux/fs.h with the kernel sources
mm/pagemap: add documentation of PAGEMAP_SCAN IOCTL
selftests: mm: add pagemap ioctl tests
Peter Xu (1):
userfaultfd: UFFD_FEATURE_WP_ASYNC
Documentation/admin-guide/mm/pagemap.rst | 58 +
Documentation/admin-guide/mm/userfaultfd.rst | 35 +
fs/proc/task_mmu.c | 513 ++++++
fs/userfaultfd.c | 26 +-
include/linux/hugetlb.h | 1 +
include/linux/userfaultfd_k.h | 21 +-
include/uapi/linux/fs.h | 53 +
include/uapi/linux/userfaultfd.h | 9 +-
mm/hugetlb.c | 34 +-
mm/memory.c | 27 +-
tools/include/uapi/linux/fs.h | 53 +
tools/testing/selftests/mm/.gitignore | 2 +
tools/testing/selftests/mm/Makefile | 3 +-
tools/testing/selftests/mm/config | 1 +
tools/testing/selftests/mm/pagemap_ioctl.c | 1459 ++++++++++++++++++
tools/testing/selftests/mm/run_vmtests.sh | 4 +
16 files changed, 2275 insertions(+), 24 deletions(-)
create mode 100644 tools/testing/selftests/mm/pagemap_ioctl.c
mode change 100644 => 100755 tools/testing/selftests/mm/run_vmtests.sh
--
2.39.2
This patchset is based on the next branch of shuah/linux-kselftest.git
Tiezhu Yang (2):
selftests/vDSO: Add support for LoongArch
selftests/vDSO: Get version and name for all archs
tools/testing/selftests/vDSO/vdso_config.h | 6 ++++-
tools/testing/selftests/vDSO/vdso_test_getcpu.c | 16 +++++--------
.../selftests/vDSO/vdso_test_gettimeofday.c | 26 ++++++----------------
3 files changed, 18 insertions(+), 30 deletions(-)
--
2.1.0
When execute the following command to test clone3 on LoongArch:
# cd tools/testing/selftests/clone3 && make && ./clone3
we can see the following error info:
# [5719] Trying clone3() with flags 0x80 (size 0)
# Invalid argument - Failed to create new process
# [5719] clone3() with flags says: -22 expected 0
not ok 18 [5719] Result (-22) is different than expected (0)
This is because if CONFIG_TIME_NS is not set, but the flag
CLONE_NEWTIME (0x80) is used to clone a time namespace, it
will return -EINVAL in copy_time_ns().
If kernel does not support CONFIG_TIME_NS, /proc/self/ns/time
will be not exist, and then we should skip clone3() test with
CLONE_NEWTIME.
With this patch under !CONFIG_TIME_NS:
# cd tools/testing/selftests/clone3 && make && ./clone3
...
# Time namespaces are not supported
ok 18 # SKIP Skipping clone3() with CLONE_NEWTIME
# Totals: pass:17 fail:0 xfail:0 xpass:0 skip:1 error:0
Fixes: 515bddf0ec41 ("selftests/clone3: test clone3 with CLONE_NEWTIME")
Suggested-by: Thomas Gleixner <tglx(a)linutronix.de>
Signed-off-by: Tiezhu Yang <yangtiezhu(a)loongson.cn>
---
v5:
-- Rebase on the next branch of shuah/linux-kselftest.git
to avoid potential merge conflicts due to changes in the link:
https://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest.git/c…
-- Update the commit message and send it as a single patch
Here is the v4 patch:
https://lore.kernel.org/loongarch/1685968410-5412-2-git-send-email-yangtiez…
tools/testing/selftests/clone3/clone3.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/clone3/clone3.c b/tools/testing/selftests/clone3/clone3.c
index e60cf4d..1c61e3c 100644
--- a/tools/testing/selftests/clone3/clone3.c
+++ b/tools/testing/selftests/clone3/clone3.c
@@ -196,7 +196,12 @@ int main(int argc, char *argv[])
CLONE3_ARGS_NO_TEST);
/* Do a clone3() in a new time namespace */
- test_clone3(CLONE_NEWTIME, 0, 0, CLONE3_ARGS_NO_TEST);
+ if (access("/proc/self/ns/time", F_OK) == 0) {
+ test_clone3(CLONE_NEWTIME, 0, 0, CLONE3_ARGS_NO_TEST);
+ } else {
+ ksft_print_msg("Time namespaces are not supported\n");
+ ksft_test_result_skip("Skipping clone3() with CLONE_NEWTIME\n");
+ }
/* Do a clone3() with exit signal (SIGCHLD) in flags */
test_clone3(SIGCHLD, 0, -EINVAL, CLONE3_ARGS_NO_TEST);
--
2.1.0
Hello,
This patchset builds upon a soon-to-be-published WIP patchset that Sean
published at https://github.com/sean-jc/linux/tree/x86/kvm_gmem_solo, mentioned
at [1].
The tree can be found at:
https://github.com/googleprodkernel/linux-cc/tree/gmem-hugetlb-rfc-v1
In this patchset, hugetlb support for KVM's guest_mem (aka gmem) is introduced,
allowing VM private memory (for confidential computing) to be backed by hugetlb
pages.
guest_mem provides userspace with a handle, with which userspace can allocate
and deallocate memory for confidential VMs without mapping the memory into
userspace.
Why use hugetlb instead of introducing a new allocator, like gmem does for 4K
and transparent hugepages?
+ hugetlb provides the following useful functionality, which would otherwise
have to be reimplemented:
+ Allocation of hugetlb pages at boot time, including
+ Parsing of kernel boot parameters to configure hugetlb
+ Tracking of usage in hstate
+ gmem will share the same system-wide pool of hugetlb pages, so users
don't have to have separate pools for hugetlb and gmem
+ Page accounting with subpools
+ hugetlb pages are tracked in subpools, which gmem uses to reserve
pages from the global hstate
+ Memory charging
+ hugetlb provides code that charges memory to cgroups
+ Reporting: hugetlb usage and availability are available at /proc/meminfo,
etc
The first 11 patches in this patchset is a series of refactoring to decouple
hugetlb and hugetlbfs.
The central thread binding the refactoring is that some functions (like
inode_resv_map(), inode_subpool(), inode_hstate(), etc) rely on a hugetlbfs
concept, that the resv_map, subpool, hstate, are in a specific field in a
hugetlb inode.
Refactoring to parametrize functions by hstate, subpool, resv_map will allow
hugetlb to be used by gmem and in other places where these data structures
aren't necessarily stored in the same positions in the inode.
The refactoring proposed here is just the minimum required to get a
proof-of-concept working with gmem. I would like to get opinions on this
approach before doing further refactoring. (See TODOs)
TODOs:
+ hugetlb/hugetlbfs refactoring
+ remove_inode_hugepages() no longer needs to be exposed, it is hugetlbfs
specific and used only in inode.c
+ remove_mapping_hugepages(), remove_inode_single_folio(),
hugetlb_unreserve_pages() shouldn't need to take inode as a parameter
+ Updating inode->i_blocks can be refactored to a separate function and
called from hugetlbfs and gmem
+ alloc_hugetlb_folio_from_subpool() shouldn't need to be parametrized by
vma
+ hugetlb_reserve_pages() should be refactored to be symmetric with
hugetlb_unreserve_pages()
+ It should be parametrized by resv_map
+ alloc_hugetlb_folio_from_subpool() could perhaps use
hugetlb_reserve_pages()?
+ gmem
+ Figure out if resv_map should be used by gmem at all
+ Probably needs more refactoring to decouple resv_map from hugetlb
functions
Questions for the community:
1. In this patchset, every gmem file backed with hugetlb is given a new
subpool. Is that desirable?
+ In hugetlbfs, a subpool always belongs to a mount, and hugetlbfs has one
mount per hugetlb size (2M, 1G, etc)
+ memfd_create(MFD_HUGETLB) effectively returns a full hugetlbfs file, so it
(rightfully) uses the hugetlbfs kernel mounts and their subpools
+ I gave each file a subpool mostly to speed up implementation and still be
able to reserve hugetlb pages from the global hstate based on the gmem
file size.
+ gmem, unlike hugetlbfs, isn't meant to be a full filesystem, so
+ Should there be multiple mounts, one for each hugetlb size?
+ Will the mounts be initialized on boot or on first gmem file creation?
+ Or is one subpool per gmem file fine?
2. Should resv_map be used for gmem at all, since gmem doesn't allow userspace
reservations?
[1] https://lore.kernel.org/lkml/ZEM5Zq8oo+xnApW9@google.com/
---
Ackerley Tng (19):
mm: hugetlb: Expose get_hstate_idx()
mm: hugetlb: Move and expose hugetlbfs_zero_partial_page
mm: hugetlb: Expose remove_inode_hugepages
mm: hugetlb: Decouple hstate, subpool from inode
mm: hugetlb: Allow alloc_hugetlb_folio() to be parametrized by subpool
and hstate
mm: hugetlb: Provide hugetlb_filemap_add_folio()
mm: hugetlb: Refactor vma_*_reservation functions
mm: hugetlb: Refactor restore_reserve_on_error
mm: hugetlb: Use restore_reserve_on_error directly in filesystems
mm: hugetlb: Parametrize alloc_hugetlb_folio_from_subpool() by
resv_map
mm: hugetlb: Parametrize hugetlb functions by resv_map
mm: truncate: Expose preparation steps for truncate_inode_pages_final
KVM: guest_mem: Refactor kvm_gmem fd creation to be in layers
KVM: guest_mem: Refactor cleanup to separate inode and file cleanup
KVM: guest_mem: hugetlb: initialization and cleanup
KVM: guest_mem: hugetlb: allocate and truncate from hugetlb
KVM: selftests: Add basic selftests for hugetlbfs-backed guest_mem
KVM: selftests: Support various types of backing sources for private
memory
KVM: selftests: Update test for various private memory backing source
types
fs/hugetlbfs/inode.c | 102 ++--
include/linux/hugetlb.h | 86 ++-
include/linux/mm.h | 1 +
include/uapi/linux/kvm.h | 25 +
mm/hugetlb.c | 324 +++++++-----
mm/truncate.c | 24 +-
.../testing/selftests/kvm/guest_memfd_test.c | 33 +-
.../testing/selftests/kvm/include/test_util.h | 14 +
tools/testing/selftests/kvm/lib/test_util.c | 74 +++
.../kvm/x86_64/private_mem_conversions_test.c | 38 +-
virt/kvm/guest_mem.c | 488 ++++++++++++++----
11 files changed, 882 insertions(+), 327 deletions(-)
--
2.41.0.rc0.172.g3f132b7071-goog
KVM_GET_REG_LIST will dump all register IDs that are available to
KVM_GET/SET_ONE_REG and It's very useful to identify some platform
regression issue during VM migration.
Patch 1-7 re-structured the get-reg-list test in aarch64 to make some
of the code as common test framework that can be shared by riscv.
Patch 8 enabled the KVM_GET_REG_LIST API in riscv and patch 9-10 added
the corresponding kselftest for checking possible register regressions.
The get-reg-list kvm selftest was ported from aarch64 and tested with
Linux 6.4-rc5 on a Qemu riscv64 virt machine.
---
Changed since v2:
* Rebase to Linux 6.4-rc5
* Filter out ZICBO* config and ISA_EXT registers report if the
extensions were not supported in host
* Enable AIA CSR test
* Move vCPU extension check_supported() to finalize_vcpu() per
Andrew's suggestion
* Switch to use KVM_REG_SIZE_ULONG for most registers' definition
---
Changed since v1:
* rebase to Andrew's changes
* fix coding style
Andrew Jones (7):
KVM: arm64: selftests: Replace str_with_index with strdup_printf
KVM: arm64: selftests: Drop SVE cap check in print_reg
KVM: arm64: selftests: Remove print_reg's dependency on vcpu_config
KVM: arm64: selftests: Rename vcpu_config and add to kvm_util.h
KVM: arm64: selftests: Delete core_reg_fixup
KVM: arm64: selftests: Split get-reg-list test code
KVM: arm64: selftests: Finish generalizing get-reg-list
Haibo Xu (3):
KVM: riscv: Add KVM_GET_REG_LIST API support
KVM: riscv: selftests: Skip some registers set operation
KVM: riscv: selftests: Add get-reg-list test
Documentation/virt/kvm/api.rst | 2 +-
arch/riscv/kvm/vcpu.c | 378 +++++++++++
tools/testing/selftests/kvm/Makefile | 11 +-
.../selftests/kvm/aarch64/get-reg-list.c | 540 ++--------------
tools/testing/selftests/kvm/get-reg-list.c | 421 ++++++++++++
.../selftests/kvm/include/kvm_util_base.h | 16 +
.../selftests/kvm/include/riscv/processor.h | 3 +
.../testing/selftests/kvm/include/test_util.h | 2 +
tools/testing/selftests/kvm/lib/test_util.c | 15 +
.../selftests/kvm/riscv/get-reg-list.c | 611 ++++++++++++++++++
10 files changed, 1499 insertions(+), 500 deletions(-)
create mode 100644 tools/testing/selftests/kvm/get-reg-list.c
create mode 100644 tools/testing/selftests/kvm/riscv/get-reg-list.c
--
2.34.1
When calling socket lookup from L2 (tc, xdp), VRF boundaries aren't
respected. This patchset fixes this by regarding the incoming device's
VRF attachment when performing the socket lookups from tc/xdp.
The first two patches are coding changes which factor out the tc helper's
logic which was shared with cg/sk_skb (which operate correctly).
This refactoring is needed in order to avoid affecting the cgroup/sk_skb
flows as there does not seem to be a strict criteria for discerning which
flow the helper is called from based on the net device or packet
information.
The third patch contains the actual bugfix.
The fourth patch adds bpf tests for these lookup functions.
---
v5: Use reverse xmas tree indentation
v4: - Move dev_sdif() to include/linux/netdevice.h as suggested by Stanislav Fomichev
- Remove SYS and SYS_NOFAIL duplicate definitions
v3: - Rename bpf_l2_sdif() to dev_sdif() as suggested by Stanislav Fomichev
- Added xdp tests as suggested by Daniel Borkmann
- Use start_server() to avoid duplicate code as suggested by Stanislav Fomichev
v2: Fixed uninitialized var in test patch (4).
Gilad Sever (4):
bpf: factor out socket lookup functions for the TC hookpoint.
bpf: Call __bpf_sk_lookup()/__bpf_skc_lookup() directly via TC
hookpoint
bpf: fix bpf socket lookup from tc/xdp to respect socket VRF bindings
selftests/bpf: Add vrf_socket_lookup tests
include/linux/netdevice.h | 9 +
net/core/filter.c | 123 +++++--
.../bpf/prog_tests/vrf_socket_lookup.c | 312 ++++++++++++++++++
.../selftests/bpf/progs/vrf_socket_lookup.c | 88 +++++
4 files changed, 511 insertions(+), 21 deletions(-)
create mode 100644 tools/testing/selftests/bpf/prog_tests/vrf_socket_lookup.c
create mode 100644 tools/testing/selftests/bpf/progs/vrf_socket_lookup.c
--
2.34.1
PTP_SYS_OFFSET_EXTENDED was added in November 2018 in
361800876f80 (" ptp: add PTP_SYS_OFFSET_EXTENDED ioctl")
and PTP_SYS_OFFSET_PRECISE was added in February 2016 in
719f1aa4a671 ("ptp: Add PTP_SYS_OFFSET_PRECISE for driver crosstimestamping")
The PTP selftest code is lacking support for these two IOCTLS.
This short series of patches adds support for them.
Alex Maftei (2):
selftests/ptp: Add -x option for testing PTP_SYS_OFFSET_EXTENDED
selftests/ptp: Add -X option for testing PTP_SYS_OFFSET_PRECISE
tools/testing/selftests/ptp/testptp.c | 71 ++++++++++++++++++++++++++-
1 file changed, 69 insertions(+), 2 deletions(-)
--
2.28.0
Now the writing operation return the count of writes whether events are
enabled or disabled. Fix this by just return -ENOENT when events are disabled.
v1 -> v2:
- Change the returh vale from -EFAULT to -ENOENT
sunliming (3):
tracing/user_events: Fix incorrect return value for writing operation
when events are disabled
selftests/user_events: Enable the event before write_fault test in
ftrace self-test
selftests/user_events: Add test cases when event is disabled
kernel/trace/trace_events_user.c | 3 ++-
tools/testing/selftests/user_events/ftrace_test.c | 8 ++++++++
2 files changed, 10 insertions(+), 1 deletion(-)
--
2.25.1
This patch-set implements 2 small extensions to the current F_OFD_GETLK,
allowing it to gather more information than it currently returns.
First extension allows to use F_UNLCK on query, which currently returns
EINVAL. Instead it can be used to query the locks on a particular fd -
something that is not currently possible. The basic idea is that on
F_OFD_GETLK, F_UNLCK would "conflict" with (or query) any types of the
lock on the same fd, and ignore any locks on other fds.
Use-cases:
1. CRIU-alike scenario when you want to read the locking info from an
fd for the later reconstruction. This can now be done by setting
l_start and l_len to 0 to cover entire file range, and do F_OFD_GETLK.
In the loop you need to advance l_start past the returned lock ranges,
to eventually collect all locked ranges.
2. Implementing the lock checking/enforcing policy.
Say you want to implement an "auditor" module in your program,
that checks that the I/O is done only after the proper locking is
applied on a file region. In this case you need to know if the
particular region is locked on that fd, and if so - with what type
of the lock. If you would do that currently (without this extension)
then you can only check for the write locks, and for that you need to
probe the lock on your fd and then open the same file via nother fd and
probe there. That way you can identify the write lock on a particular
fd, but such trick is non-atomic and complex. As for finding out the
read lock on a particular fd - impossible.
This extension allows to do such queries without any extra efforts.
3. Implementing the mandatory locking policy.
Suppose you want to make a policy where the write lock inhibits any
unlocked readers and writers. Currently you need to check if the
write lock is present on some other fd, and if it is not there - allow
the I/O operation. But because the write lock can appear at any moment,
you need to do that under some global lock, which can be released only
when the I/O operation is finished.
With the proposed extension you can instead just check the write lock
on your own fd first, and if it is there - allow the I/O operation on
that fd without using any global lock. Only if there is no write lock
on this fd, then you need to take global lock and check for a write
lock on other fds.
The second patch implements another extension.
Currently F_OFD_GETLK returns -1 in the l_pid member.
This patch removes the code that writes -1 there, so that the proper
pid is returned. I am not sure why it was decided to deliberately hide
the owner's pid. It may be needed in case you want to send some
message to the offending locker, like eg SIGKILL.
The third patch adds a test-case for OFD locks.
It tests both the generic things and the proposed extensions.
Stas Sergeev (3):
fs/locks: F_UNLCK extension for F_OFD_GETLK
fd/locks: allow get the lock owner by F_OFD_GETLK
selftests: add OFD lock tests
fs/locks.c | 25 +++-
tools/testing/selftests/locking/Makefile | 2 +
tools/testing/selftests/locking/ofdlocks.c | 135 +++++++++++++++++++++
3 files changed, 157 insertions(+), 5 deletions(-)
create mode 100644 tools/testing/selftests/locking/ofdlocks.c
CC: Jeff Layton <jlayton(a)kernel.org>
CC: Chuck Lever <chuck.lever(a)oracle.com>
CC: Alexander Viro <viro(a)zeniv.linux.org.uk>
CC: Christian Brauner <brauner(a)kernel.org>
CC: linux-fsdevel(a)vger.kernel.org
CC: linux-kernel(a)vger.kernel.org
CC: Shuah Khan <shuah(a)kernel.org>
CC: linux-kselftest(a)vger.kernel.org
--
2.39.2
This is to add Intel VT-d nested translation based on IOMMUFD nesting
infrastructure. As the iommufd nesting infrastructure series[1], iommu
core supports new ops to report iommu hardware information, allocate
domains with user data and sync stage-1 IOTLB. The data required in
the three paths are vendor-specific, so
1) IOMMU_HW_INFO_TYPE_INTEL_VTD and struct iommu_device_info_vtd are
defined to report iommu hardware information for Intel VT-d .
2) IOMMU_HWPT_DATA_VTD_S1 is defined for the Intel VT-d stage-1 page
table, it will be used in the stage-1 domain allocation and IOTLB
syncing path. struct iommu_hwpt_intel_vtd is defined to pass user_data
for the Intel VT-d stage-1 domain allocation.
struct iommu_hwpt_invalidate_intel_vtd is defined to pass the data for
the Intel VT-d stage-1 IOTLB invalidation.
With above IOMMUFD extensions, the intel iommu driver implements the three
paths to support nested translation.
The first Intel platform supporting nested translation is Sapphire
Rapids which, unfortunately, has a hardware errata [2] requiring special
treatment. This errata happens when a stage-1 page table page (either
level) is located in a stage-2 read-only region. In that case the IOMMU
hardware may ignore the stage-2 RO permission and still set the A/D bit
in stage-1 page table entries during page table walking.
A flag IOMMU_HW_INFO_VTD_ERRATA_772415_SPR17 is introduced to report
this errata to userspace. With that restriction the user should either
disable nested translation to favor RO stage-2 mappings or ensure no
RO stage-2 mapping to enable nested translation.
Intel-iommu driver is armed with necessary checks to prevent such mix
in patch10 of this series.
Qemu currently does add RO mappings though. The vfio agent in Qemu
simply maps all valid regions in the GPA address space which certainly
includes RO regions e.g. vbios.
In reality we don't know a usage relying on DMA reads from the BIOS
region. Hence finding a way to allow user opt-out RO mappings in
Qemu might be an acceptable tradeoff. But how to achieve it cleanly
needs more discussion in Qemu community. For now we just hacked Qemu
to test.
Complete code can be found in [3], QEMU could can be found in [4].
base-commit: ce9b593b1f74ccd090edc5d2ad397da84baa9946
[1] https://lore.kernel.org/linux-iommu/20230511143844.22693-1-yi.l.liu@intel.c…
[2] https://www.intel.com/content/www/us/en/content-details/772415/content-deta…
[3] https://github.com/yiliu1765/iommufd/tree/iommufd_nesting
[4] https://github.com/yiliu1765/qemu/tree/wip/iommufd_rfcv4.mig.reset.v4_var3%…
Change log:
v3:
- Further split the patches into an order of adding helpers for nested
domain, iotlb flush, nested domain attachment and nested domain allocation
callback, then report the hw_info to userspace.
- Add batch support in cache invalidation from userspace
- Disallow nested translation usage if RO mappings exists in stage-2 domain
due to errata on readonly mappings on Sapphire Rapids platform.
v2: https://lore.kernel.org/linux-iommu/20230309082207.612346-1-yi.l.liu@intel.…
- The iommufd infrastructure is split to be separate series.
v1: https://lore.kernel.org/linux-iommu/20230209043153.14964-1-yi.l.liu@intel.c…
Regards,
Yi Liu
Lu Baolu (5):
iommu/vt-d: Extend dmar_domain to support nested domain
iommu/vt-d: Add helper for nested domain allocation
iommu/vt-d: Add helper to setup pasid nested translation
iommu/vt-d: Add nested domain allocation
iommu/vt-d: Disallow nesting on domains with read-only mappings
Yi Liu (5):
iommufd: Add data structure for Intel VT-d stage-1 domain allocation
iommu/vt-d: Make domain attach helpers to be extern
iommu/vt-d: Set the nested domain to a device
iommu/vt-d: Add iotlb flush for nested domain
iommu/vt-d: Implement hw_info for iommu capability query
drivers/iommu/intel/Makefile | 2 +-
drivers/iommu/intel/iommu.c | 78 ++++++++++++---
drivers/iommu/intel/iommu.h | 55 +++++++++--
drivers/iommu/intel/nested.c | 181 +++++++++++++++++++++++++++++++++++
drivers/iommu/intel/pasid.c | 151 +++++++++++++++++++++++++++++
drivers/iommu/intel/pasid.h | 2 +
drivers/iommu/iommufd/main.c | 6 ++
include/linux/iommu.h | 1 +
include/uapi/linux/iommufd.h | 149 ++++++++++++++++++++++++++++
9 files changed, 603 insertions(+), 22 deletions(-)
create mode 100644 drivers/iommu/intel/nested.c
--
2.34.1
In order to cover this case, setting 'maxlen = 0', with the following
explanation:
EVIOCGKEY is executed from evdev_do_ioctl(), which is called from
evdev_ioctl_handler().
evdev_ioctl_handler() is called from 2 functions, where by code coverage,
only the first one is in use.
‘compat’ is given the value ‘0’ [1].
Thus, the condition [2] is always false.
This means ‘len’ always equals a positive number [3]
‘maxlen’ in evdev_handle_get_val [4] is defined locally in
evdev_do_ioctl() [5], and is sent in the variable 'size' [6]
[1] https://elixir.bootlin.com/linux/v6.2/source/drivers/input/evdev.c#L1281
[2] https://elixir.bootlin.com/linux/v6.2/source/drivers/input/evdev.c#L705
[3] https://elixir.bootlin.com/linux/v6.2/source/drivers/input/evdev.c#L707
[4] https://elixir.bootlin.com/linux/v6.2/source/drivers/input/evdev.c#L886
[5] https://elixir.bootlin.com/linux/v6.2/source/drivers/input/evdev.c#L1155
[6] https://elixir.bootlin.com/linux/v6.2/source/drivers/input/evdev.c#L1141
Signed-off-by: Dana Elfassy <dangel101(a)gmail.com>
---
tools/testing/selftests/input/evioc-test.c | 19 +++++++++++++++++++
1 file changed, 19 insertions(+)
diff --git a/tools/testing/selftests/input/evioc-test.c b/tools/testing/selftests/input/evioc-test.c
index ad7b93fe39cf..b94de2ee5596 100644
--- a/tools/testing/selftests/input/evioc-test.c
+++ b/tools/testing/selftests/input/evioc-test.c
@@ -234,4 +234,23 @@ TEST(eviocsrep_set_repeat_settings)
selftest_uinput_destroy(uidev);
}
+TEST(eviocgkey_get_global_key_state)
+{
+ struct selftest_uinput *uidev;
+ int rep_values[2];
+ int rc;
+
+ memset(rep_values, 0, sizeof(rep_values));
+
+ rc = selftest_uinput_create_device(&uidev);
+ ASSERT_EQ(0, rc);
+ ASSERT_NE(NULL, uidev);
+
+ /* ioctl to create the scenario where len > maxlen in bits_to_user() */
+ rc = ioctl(uidev->evdev_fd, EVIOCGKEY(0), rep_values);
+ ASSERT_EQ(0, rc);
+
+ selftest_uinput_destroy(uidev);
+}
+
TEST_HARNESS_MAIN
--
2.41.0
From: Mirsad Goran Todorovac <mirsad.todorovac(a)alu.unizg.hr>
[ Upstream commit 4acfe3dfde685a5a9eaec5555351918e2d7266a1 ]
Dan Carpenter spotted a race condition in a couple of situations like
these in the test_firmware driver:
static int test_dev_config_update_u8(const char *buf, size_t size, u8 *cfg)
{
u8 val;
int ret;
ret = kstrtou8(buf, 10, &val);
if (ret)
return ret;
mutex_lock(&test_fw_mutex);
*(u8 *)cfg = val;
mutex_unlock(&test_fw_mutex);
/* Always return full write size even if we didn't consume all */
return size;
}
static ssize_t config_num_requests_store(struct device *dev,
struct device_attribute *attr,
const char *buf, size_t count)
{
int rc;
mutex_lock(&test_fw_mutex);
if (test_fw_config->reqs) {
pr_err("Must call release_all_firmware prior to changing config\n");
rc = -EINVAL;
mutex_unlock(&test_fw_mutex);
goto out;
}
mutex_unlock(&test_fw_mutex);
rc = test_dev_config_update_u8(buf, count,
&test_fw_config->num_requests);
out:
return rc;
}
static ssize_t config_read_fw_idx_store(struct device *dev,
struct device_attribute *attr,
const char *buf, size_t count)
{
return test_dev_config_update_u8(buf, count,
&test_fw_config->read_fw_idx);
}
The function test_dev_config_update_u8() is called from both the locked
and the unlocked context, function config_num_requests_store() and
config_read_fw_idx_store() which can both be called asynchronously as
they are driver's methods, while test_dev_config_update_u8() and siblings
change their argument pointed to by u8 *cfg or similar pointer.
To avoid deadlock on test_fw_mutex, the lock is dropped before calling
test_dev_config_update_u8() and re-acquired within test_dev_config_update_u8()
itself, but alas this creates a race condition.
Having two locks wouldn't assure a race-proof mutual exclusion.
This situation is best avoided by the introduction of a new, unlocked
function __test_dev_config_update_u8() which can be called from the locked
context and reducing test_dev_config_update_u8() to:
static int test_dev_config_update_u8(const char *buf, size_t size, u8 *cfg)
{
int ret;
mutex_lock(&test_fw_mutex);
ret = __test_dev_config_update_u8(buf, size, cfg);
mutex_unlock(&test_fw_mutex);
return ret;
}
doing the locking and calling the unlocked primitive, which enables both
locked and unlocked versions without duplication of code.
The similar approach was applied to all functions called from the locked
and the unlocked context, which safely mitigates both deadlocks and race
conditions in the driver.
__test_dev_config_update_bool(), __test_dev_config_update_u8() and
__test_dev_config_update_size_t() unlocked versions of the functions
were introduced to be called from the locked contexts as a workaround
without releasing the main driver's lock and thereof causing a race
condition.
The test_dev_config_update_bool(), test_dev_config_update_u8() and
test_dev_config_update_size_t() locked versions of the functions
are being called from driver methods without the unnecessary multiplying
of the locking and unlocking code for each method, and complicating
the code with saving of the return value across lock.
Fixes: 7feebfa487b92 ("test_firmware: add support for request_firmware_into_buf")
Cc: Luis Chamberlain <mcgrof(a)kernel.org>
Cc: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Cc: Russ Weight <russell.h.weight(a)intel.com>
Cc: Takashi Iwai <tiwai(a)suse.de>
Cc: Tianfei Zhang <tianfei.zhang(a)intel.com>
Cc: Shuah Khan <shuah(a)kernel.org>
Cc: Colin Ian King <colin.i.king(a)gmail.com>
Cc: Randy Dunlap <rdunlap(a)infradead.org>
Cc: linux-kselftest(a)vger.kernel.org
Cc: stable(a)vger.kernel.org # v5.4
Suggested-by: Dan Carpenter <error27(a)gmail.com>
Signed-off-by: Mirsad Goran Todorovac <mirsad.todorovac(a)alu.unizg.hr>
Link: https://lore.kernel.org/r/20230509084746.48259-1-mirsad.todorovac@alu.unizg…
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
lib/test_firmware.c | 52 ++++++++++++++++++++++++++++++---------------
1 file changed, 35 insertions(+), 17 deletions(-)
diff --git a/lib/test_firmware.c b/lib/test_firmware.c
index b99cf0a50a698..4884057eb53f0 100644
--- a/lib/test_firmware.c
+++ b/lib/test_firmware.c
@@ -321,16 +321,26 @@ static ssize_t config_test_show_str(char *dst,
return len;
}
-static int test_dev_config_update_bool(const char *buf, size_t size,
+static inline int __test_dev_config_update_bool(const char *buf, size_t size,
bool *cfg)
{
int ret;
- mutex_lock(&test_fw_mutex);
if (kstrtobool(buf, cfg) < 0)
ret = -EINVAL;
else
ret = size;
+
+ return ret;
+}
+
+static int test_dev_config_update_bool(const char *buf, size_t size,
+ bool *cfg)
+{
+ int ret;
+
+ mutex_lock(&test_fw_mutex);
+ ret = __test_dev_config_update_bool(buf, size, cfg);
mutex_unlock(&test_fw_mutex);
return ret;
@@ -341,7 +351,8 @@ static ssize_t test_dev_config_show_bool(char *buf, bool val)
return snprintf(buf, PAGE_SIZE, "%d\n", val);
}
-static int test_dev_config_update_size_t(const char *buf,
+static int __test_dev_config_update_size_t(
+ const char *buf,
size_t size,
size_t *cfg)
{
@@ -352,9 +363,7 @@ static int test_dev_config_update_size_t(const char *buf,
if (ret)
return ret;
- mutex_lock(&test_fw_mutex);
*(size_t *)cfg = new;
- mutex_unlock(&test_fw_mutex);
/* Always return full write size even if we didn't consume all */
return size;
@@ -370,7 +379,7 @@ static ssize_t test_dev_config_show_int(char *buf, int val)
return snprintf(buf, PAGE_SIZE, "%d\n", val);
}
-static int test_dev_config_update_u8(const char *buf, size_t size, u8 *cfg)
+static int __test_dev_config_update_u8(const char *buf, size_t size, u8 *cfg)
{
u8 val;
int ret;
@@ -379,14 +388,23 @@ static int test_dev_config_update_u8(const char *buf, size_t size, u8 *cfg)
if (ret)
return ret;
- mutex_lock(&test_fw_mutex);
*(u8 *)cfg = val;
- mutex_unlock(&test_fw_mutex);
/* Always return full write size even if we didn't consume all */
return size;
}
+static int test_dev_config_update_u8(const char *buf, size_t size, u8 *cfg)
+{
+ int ret;
+
+ mutex_lock(&test_fw_mutex);
+ ret = __test_dev_config_update_u8(buf, size, cfg);
+ mutex_unlock(&test_fw_mutex);
+
+ return ret;
+}
+
static ssize_t test_dev_config_show_u8(char *buf, u8 val)
{
return snprintf(buf, PAGE_SIZE, "%u\n", val);
@@ -413,10 +431,10 @@ static ssize_t config_num_requests_store(struct device *dev,
mutex_unlock(&test_fw_mutex);
goto out;
}
- mutex_unlock(&test_fw_mutex);
- rc = test_dev_config_update_u8(buf, count,
- &test_fw_config->num_requests);
+ rc = __test_dev_config_update_u8(buf, count,
+ &test_fw_config->num_requests);
+ mutex_unlock(&test_fw_mutex);
out:
return rc;
@@ -460,10 +478,10 @@ static ssize_t config_buf_size_store(struct device *dev,
mutex_unlock(&test_fw_mutex);
goto out;
}
- mutex_unlock(&test_fw_mutex);
- rc = test_dev_config_update_size_t(buf, count,
- &test_fw_config->buf_size);
+ rc = __test_dev_config_update_size_t(buf, count,
+ &test_fw_config->buf_size);
+ mutex_unlock(&test_fw_mutex);
out:
return rc;
@@ -490,10 +508,10 @@ static ssize_t config_file_offset_store(struct device *dev,
mutex_unlock(&test_fw_mutex);
goto out;
}
- mutex_unlock(&test_fw_mutex);
- rc = test_dev_config_update_size_t(buf, count,
- &test_fw_config->file_offset);
+ rc = __test_dev_config_update_size_t(buf, count,
+ &test_fw_config->file_offset);
+ mutex_unlock(&test_fw_mutex);
out:
return rc;
--
2.39.2
From: Mirsad Goran Todorovac <mirsad.todorovac(a)alu.unizg.hr>
[ Upstream commit 4acfe3dfde685a5a9eaec5555351918e2d7266a1 ]
Dan Carpenter spotted a race condition in a couple of situations like
these in the test_firmware driver:
static int test_dev_config_update_u8(const char *buf, size_t size, u8 *cfg)
{
u8 val;
int ret;
ret = kstrtou8(buf, 10, &val);
if (ret)
return ret;
mutex_lock(&test_fw_mutex);
*(u8 *)cfg = val;
mutex_unlock(&test_fw_mutex);
/* Always return full write size even if we didn't consume all */
return size;
}
static ssize_t config_num_requests_store(struct device *dev,
struct device_attribute *attr,
const char *buf, size_t count)
{
int rc;
mutex_lock(&test_fw_mutex);
if (test_fw_config->reqs) {
pr_err("Must call release_all_firmware prior to changing config\n");
rc = -EINVAL;
mutex_unlock(&test_fw_mutex);
goto out;
}
mutex_unlock(&test_fw_mutex);
rc = test_dev_config_update_u8(buf, count,
&test_fw_config->num_requests);
out:
return rc;
}
static ssize_t config_read_fw_idx_store(struct device *dev,
struct device_attribute *attr,
const char *buf, size_t count)
{
return test_dev_config_update_u8(buf, count,
&test_fw_config->read_fw_idx);
}
The function test_dev_config_update_u8() is called from both the locked
and the unlocked context, function config_num_requests_store() and
config_read_fw_idx_store() which can both be called asynchronously as
they are driver's methods, while test_dev_config_update_u8() and siblings
change their argument pointed to by u8 *cfg or similar pointer.
To avoid deadlock on test_fw_mutex, the lock is dropped before calling
test_dev_config_update_u8() and re-acquired within test_dev_config_update_u8()
itself, but alas this creates a race condition.
Having two locks wouldn't assure a race-proof mutual exclusion.
This situation is best avoided by the introduction of a new, unlocked
function __test_dev_config_update_u8() which can be called from the locked
context and reducing test_dev_config_update_u8() to:
static int test_dev_config_update_u8(const char *buf, size_t size, u8 *cfg)
{
int ret;
mutex_lock(&test_fw_mutex);
ret = __test_dev_config_update_u8(buf, size, cfg);
mutex_unlock(&test_fw_mutex);
return ret;
}
doing the locking and calling the unlocked primitive, which enables both
locked and unlocked versions without duplication of code.
The similar approach was applied to all functions called from the locked
and the unlocked context, which safely mitigates both deadlocks and race
conditions in the driver.
__test_dev_config_update_bool(), __test_dev_config_update_u8() and
__test_dev_config_update_size_t() unlocked versions of the functions
were introduced to be called from the locked contexts as a workaround
without releasing the main driver's lock and thereof causing a race
condition.
The test_dev_config_update_bool(), test_dev_config_update_u8() and
test_dev_config_update_size_t() locked versions of the functions
are being called from driver methods without the unnecessary multiplying
of the locking and unlocking code for each method, and complicating
the code with saving of the return value across lock.
Fixes: 7feebfa487b92 ("test_firmware: add support for request_firmware_into_buf")
Cc: Luis Chamberlain <mcgrof(a)kernel.org>
Cc: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Cc: Russ Weight <russell.h.weight(a)intel.com>
Cc: Takashi Iwai <tiwai(a)suse.de>
Cc: Tianfei Zhang <tianfei.zhang(a)intel.com>
Cc: Shuah Khan <shuah(a)kernel.org>
Cc: Colin Ian King <colin.i.king(a)gmail.com>
Cc: Randy Dunlap <rdunlap(a)infradead.org>
Cc: linux-kselftest(a)vger.kernel.org
Cc: stable(a)vger.kernel.org # v5.4
Suggested-by: Dan Carpenter <error27(a)gmail.com>
Signed-off-by: Mirsad Goran Todorovac <mirsad.todorovac(a)alu.unizg.hr>
Link: https://lore.kernel.org/r/20230509084746.48259-1-mirsad.todorovac@alu.unizg…
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
lib/test_firmware.c | 52 ++++++++++++++++++++++++++++++---------------
1 file changed, 35 insertions(+), 17 deletions(-)
diff --git a/lib/test_firmware.c b/lib/test_firmware.c
index 0b4e3de3f1748..4ad01dbe7e729 100644
--- a/lib/test_firmware.c
+++ b/lib/test_firmware.c
@@ -321,16 +321,26 @@ static ssize_t config_test_show_str(char *dst,
return len;
}
-static int test_dev_config_update_bool(const char *buf, size_t size,
+static inline int __test_dev_config_update_bool(const char *buf, size_t size,
bool *cfg)
{
int ret;
- mutex_lock(&test_fw_mutex);
if (kstrtobool(buf, cfg) < 0)
ret = -EINVAL;
else
ret = size;
+
+ return ret;
+}
+
+static int test_dev_config_update_bool(const char *buf, size_t size,
+ bool *cfg)
+{
+ int ret;
+
+ mutex_lock(&test_fw_mutex);
+ ret = __test_dev_config_update_bool(buf, size, cfg);
mutex_unlock(&test_fw_mutex);
return ret;
@@ -341,7 +351,8 @@ static ssize_t test_dev_config_show_bool(char *buf, bool val)
return snprintf(buf, PAGE_SIZE, "%d\n", val);
}
-static int test_dev_config_update_size_t(const char *buf,
+static int __test_dev_config_update_size_t(
+ const char *buf,
size_t size,
size_t *cfg)
{
@@ -352,9 +363,7 @@ static int test_dev_config_update_size_t(const char *buf,
if (ret)
return ret;
- mutex_lock(&test_fw_mutex);
*(size_t *)cfg = new;
- mutex_unlock(&test_fw_mutex);
/* Always return full write size even if we didn't consume all */
return size;
@@ -370,7 +379,7 @@ static ssize_t test_dev_config_show_int(char *buf, int val)
return snprintf(buf, PAGE_SIZE, "%d\n", val);
}
-static int test_dev_config_update_u8(const char *buf, size_t size, u8 *cfg)
+static int __test_dev_config_update_u8(const char *buf, size_t size, u8 *cfg)
{
u8 val;
int ret;
@@ -379,14 +388,23 @@ static int test_dev_config_update_u8(const char *buf, size_t size, u8 *cfg)
if (ret)
return ret;
- mutex_lock(&test_fw_mutex);
*(u8 *)cfg = val;
- mutex_unlock(&test_fw_mutex);
/* Always return full write size even if we didn't consume all */
return size;
}
+static int test_dev_config_update_u8(const char *buf, size_t size, u8 *cfg)
+{
+ int ret;
+
+ mutex_lock(&test_fw_mutex);
+ ret = __test_dev_config_update_u8(buf, size, cfg);
+ mutex_unlock(&test_fw_mutex);
+
+ return ret;
+}
+
static ssize_t test_dev_config_show_u8(char *buf, u8 val)
{
return snprintf(buf, PAGE_SIZE, "%u\n", val);
@@ -413,10 +431,10 @@ static ssize_t config_num_requests_store(struct device *dev,
mutex_unlock(&test_fw_mutex);
goto out;
}
- mutex_unlock(&test_fw_mutex);
- rc = test_dev_config_update_u8(buf, count,
- &test_fw_config->num_requests);
+ rc = __test_dev_config_update_u8(buf, count,
+ &test_fw_config->num_requests);
+ mutex_unlock(&test_fw_mutex);
out:
return rc;
@@ -460,10 +478,10 @@ static ssize_t config_buf_size_store(struct device *dev,
mutex_unlock(&test_fw_mutex);
goto out;
}
- mutex_unlock(&test_fw_mutex);
- rc = test_dev_config_update_size_t(buf, count,
- &test_fw_config->buf_size);
+ rc = __test_dev_config_update_size_t(buf, count,
+ &test_fw_config->buf_size);
+ mutex_unlock(&test_fw_mutex);
out:
return rc;
@@ -490,10 +508,10 @@ static ssize_t config_file_offset_store(struct device *dev,
mutex_unlock(&test_fw_mutex);
goto out;
}
- mutex_unlock(&test_fw_mutex);
- rc = test_dev_config_update_size_t(buf, count,
- &test_fw_config->file_offset);
+ rc = __test_dev_config_update_size_t(buf, count,
+ &test_fw_config->file_offset);
+ mutex_unlock(&test_fw_mutex);
out:
return rc;
--
2.39.2
From: Mirsad Goran Todorovac <mirsad.todorovac(a)alu.unizg.hr>
[ Upstream commit 4acfe3dfde685a5a9eaec5555351918e2d7266a1 ]
Dan Carpenter spotted a race condition in a couple of situations like
these in the test_firmware driver:
static int test_dev_config_update_u8(const char *buf, size_t size, u8 *cfg)
{
u8 val;
int ret;
ret = kstrtou8(buf, 10, &val);
if (ret)
return ret;
mutex_lock(&test_fw_mutex);
*(u8 *)cfg = val;
mutex_unlock(&test_fw_mutex);
/* Always return full write size even if we didn't consume all */
return size;
}
static ssize_t config_num_requests_store(struct device *dev,
struct device_attribute *attr,
const char *buf, size_t count)
{
int rc;
mutex_lock(&test_fw_mutex);
if (test_fw_config->reqs) {
pr_err("Must call release_all_firmware prior to changing config\n");
rc = -EINVAL;
mutex_unlock(&test_fw_mutex);
goto out;
}
mutex_unlock(&test_fw_mutex);
rc = test_dev_config_update_u8(buf, count,
&test_fw_config->num_requests);
out:
return rc;
}
static ssize_t config_read_fw_idx_store(struct device *dev,
struct device_attribute *attr,
const char *buf, size_t count)
{
return test_dev_config_update_u8(buf, count,
&test_fw_config->read_fw_idx);
}
The function test_dev_config_update_u8() is called from both the locked
and the unlocked context, function config_num_requests_store() and
config_read_fw_idx_store() which can both be called asynchronously as
they are driver's methods, while test_dev_config_update_u8() and siblings
change their argument pointed to by u8 *cfg or similar pointer.
To avoid deadlock on test_fw_mutex, the lock is dropped before calling
test_dev_config_update_u8() and re-acquired within test_dev_config_update_u8()
itself, but alas this creates a race condition.
Having two locks wouldn't assure a race-proof mutual exclusion.
This situation is best avoided by the introduction of a new, unlocked
function __test_dev_config_update_u8() which can be called from the locked
context and reducing test_dev_config_update_u8() to:
static int test_dev_config_update_u8(const char *buf, size_t size, u8 *cfg)
{
int ret;
mutex_lock(&test_fw_mutex);
ret = __test_dev_config_update_u8(buf, size, cfg);
mutex_unlock(&test_fw_mutex);
return ret;
}
doing the locking and calling the unlocked primitive, which enables both
locked and unlocked versions without duplication of code.
The similar approach was applied to all functions called from the locked
and the unlocked context, which safely mitigates both deadlocks and race
conditions in the driver.
__test_dev_config_update_bool(), __test_dev_config_update_u8() and
__test_dev_config_update_size_t() unlocked versions of the functions
were introduced to be called from the locked contexts as a workaround
without releasing the main driver's lock and thereof causing a race
condition.
The test_dev_config_update_bool(), test_dev_config_update_u8() and
test_dev_config_update_size_t() locked versions of the functions
are being called from driver methods without the unnecessary multiplying
of the locking and unlocking code for each method, and complicating
the code with saving of the return value across lock.
Fixes: 7feebfa487b92 ("test_firmware: add support for request_firmware_into_buf")
Cc: Luis Chamberlain <mcgrof(a)kernel.org>
Cc: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Cc: Russ Weight <russell.h.weight(a)intel.com>
Cc: Takashi Iwai <tiwai(a)suse.de>
Cc: Tianfei Zhang <tianfei.zhang(a)intel.com>
Cc: Shuah Khan <shuah(a)kernel.org>
Cc: Colin Ian King <colin.i.king(a)gmail.com>
Cc: Randy Dunlap <rdunlap(a)infradead.org>
Cc: linux-kselftest(a)vger.kernel.org
Cc: stable(a)vger.kernel.org # v5.4
Suggested-by: Dan Carpenter <error27(a)gmail.com>
Signed-off-by: Mirsad Goran Todorovac <mirsad.todorovac(a)alu.unizg.hr>
Link: https://lore.kernel.org/r/20230509084746.48259-1-mirsad.todorovac@alu.unizg…
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
lib/test_firmware.c | 52 ++++++++++++++++++++++++++++++---------------
1 file changed, 35 insertions(+), 17 deletions(-)
diff --git a/lib/test_firmware.c b/lib/test_firmware.c
index 6ef3e6926da8a..13d3fa6aa972c 100644
--- a/lib/test_firmware.c
+++ b/lib/test_firmware.c
@@ -360,16 +360,26 @@ static ssize_t config_test_show_str(char *dst,
return len;
}
-static int test_dev_config_update_bool(const char *buf, size_t size,
+static inline int __test_dev_config_update_bool(const char *buf, size_t size,
bool *cfg)
{
int ret;
- mutex_lock(&test_fw_mutex);
if (kstrtobool(buf, cfg) < 0)
ret = -EINVAL;
else
ret = size;
+
+ return ret;
+}
+
+static int test_dev_config_update_bool(const char *buf, size_t size,
+ bool *cfg)
+{
+ int ret;
+
+ mutex_lock(&test_fw_mutex);
+ ret = __test_dev_config_update_bool(buf, size, cfg);
mutex_unlock(&test_fw_mutex);
return ret;
@@ -380,7 +390,8 @@ static ssize_t test_dev_config_show_bool(char *buf, bool val)
return snprintf(buf, PAGE_SIZE, "%d\n", val);
}
-static int test_dev_config_update_size_t(const char *buf,
+static int __test_dev_config_update_size_t(
+ const char *buf,
size_t size,
size_t *cfg)
{
@@ -391,9 +402,7 @@ static int test_dev_config_update_size_t(const char *buf,
if (ret)
return ret;
- mutex_lock(&test_fw_mutex);
*(size_t *)cfg = new;
- mutex_unlock(&test_fw_mutex);
/* Always return full write size even if we didn't consume all */
return size;
@@ -409,7 +418,7 @@ static ssize_t test_dev_config_show_int(char *buf, int val)
return snprintf(buf, PAGE_SIZE, "%d\n", val);
}
-static int test_dev_config_update_u8(const char *buf, size_t size, u8 *cfg)
+static int __test_dev_config_update_u8(const char *buf, size_t size, u8 *cfg)
{
u8 val;
int ret;
@@ -418,14 +427,23 @@ static int test_dev_config_update_u8(const char *buf, size_t size, u8 *cfg)
if (ret)
return ret;
- mutex_lock(&test_fw_mutex);
*(u8 *)cfg = val;
- mutex_unlock(&test_fw_mutex);
/* Always return full write size even if we didn't consume all */
return size;
}
+static int test_dev_config_update_u8(const char *buf, size_t size, u8 *cfg)
+{
+ int ret;
+
+ mutex_lock(&test_fw_mutex);
+ ret = __test_dev_config_update_u8(buf, size, cfg);
+ mutex_unlock(&test_fw_mutex);
+
+ return ret;
+}
+
static ssize_t test_dev_config_show_u8(char *buf, u8 val)
{
return snprintf(buf, PAGE_SIZE, "%u\n", val);
@@ -478,10 +496,10 @@ static ssize_t config_num_requests_store(struct device *dev,
mutex_unlock(&test_fw_mutex);
goto out;
}
- mutex_unlock(&test_fw_mutex);
- rc = test_dev_config_update_u8(buf, count,
- &test_fw_config->num_requests);
+ rc = __test_dev_config_update_u8(buf, count,
+ &test_fw_config->num_requests);
+ mutex_unlock(&test_fw_mutex);
out:
return rc;
@@ -525,10 +543,10 @@ static ssize_t config_buf_size_store(struct device *dev,
mutex_unlock(&test_fw_mutex);
goto out;
}
- mutex_unlock(&test_fw_mutex);
- rc = test_dev_config_update_size_t(buf, count,
- &test_fw_config->buf_size);
+ rc = __test_dev_config_update_size_t(buf, count,
+ &test_fw_config->buf_size);
+ mutex_unlock(&test_fw_mutex);
out:
return rc;
@@ -555,10 +573,10 @@ static ssize_t config_file_offset_store(struct device *dev,
mutex_unlock(&test_fw_mutex);
goto out;
}
- mutex_unlock(&test_fw_mutex);
- rc = test_dev_config_update_size_t(buf, count,
- &test_fw_config->file_offset);
+ rc = __test_dev_config_update_size_t(buf, count,
+ &test_fw_config->file_offset);
+ mutex_unlock(&test_fw_mutex);
out:
return rc;
--
2.39.2
This is part of the effort to remove the empty element of the ctl_table
structures (used to calculate size) and replace it with an ARRAY_SIZE call. By
replacing the child element in struct ctl_table with a flags element we make
sure that there are no forward recursions on child nodes and therefore set
ourselves up for just using an ARRAY_SIZE. We also added some self tests to
make sure that we do not break anything.
Patchset is separated in 4: parport fixes, selftests fixes, selftests additions and
replacement of child element. Tested everything with sysctl self tests and everything
seems "ok".
1. parport fixes: This is related to my previous series and it plugs a sysct
table leak in the parport driver. @mcgrof: I'm just leaving this here so we
don't have to retest the parport stuff
2. Selftests fixes: Remove the prefixed zeros when passing a awk field to the
awk print command because it was causing $0009 to be interpreted as $0.
Replaced continue with return in sysctl.sh(test_case) so the test actually
gets skipped. The skip decision is now in sysctl.sh(skip_test).
3. Selftest additions: New test to confirm that unregister actually removes
targets. New test to confirm that permanently empty targets are indeed
created and that no other targets can be created "on top".
4. Replaced the child pointer in struct ctl_table with an enum which is used to
differentiate between permanently empty targets and non-empty ones.
V2: Replaced the u8 flag with an enumeration.
Comments/feedback greatly appreciated
Best
Joel
Joel Granados (8):
parport: plug a sysctl register leak
test_sysctl: Fix test metadata getters
test_sysctl: Group node sysctl test under one func
test_sysctl: Add an unregister sysctl test
test_sysctl: Add an option to prevent test skip
test_sysclt: Test for registering a mount point
sysctl: Remove debugging dump_stack
sysctl: replace child with an enumeration
drivers/parport/procfs.c | 23 ++---
fs/proc/proc_sysctl.c | 82 ++++------------
include/linux/sysctl.h | 14 ++-
lib/test_sysctl.c | 91 ++++++++++++++++--
tools/testing/selftests/sysctl/sysctl.sh | 115 +++++++++++++++++------
5 files changed, 214 insertions(+), 111 deletions(-)
--
2.30.2
Events Tracing infrastructure contains lot of files, directories
(internally in terms of inodes, dentries). And ends up by consuming
memory in MBs. We can have multiple events of Events Tracing, which
further requires more memory.
Instead of creating inodes/dentries, eventfs could keep meta-data and
skip the creation of inodes/dentries. As and when require, eventfs will
create the inodes/dentries only for required files/directories.
Also eventfs would delete the inodes/dentries once no more requires
but preserve the meta data.
Tracing events took ~9MB, with this approach it took ~4.5MB
for ~10K files/dir.
Diff from v1:
Patch 1: add header file
Patch 2: resolved kernel test robot issues
protecting eventfs lists using nested eventfs_rwsem
Patch 3: protecting eventfs lists using nested eventfs_rwsem
Patch 4: improve events cleanup code to fix crashes
Patch 5: resolved kernel test robot issues
removed d_instantiate_anon() calls
Patch 6: resolved kernel test robot issues
fix kprobe test in eventfs_root_lookup()
protecting eventfs lists using nested eventfs_rwsem
Patch 7: remove header file
Patch 8: pass eventfs_rwsem as argument to eventfs functions
called eventfs_remove_events_dir() instead of tracefs_remove()
from event_trace_del_tracer()
Patch 9: new patch to fix kprobe test case
fs/tracefs/Makefile | 1 +
fs/tracefs/event_inode.c | 761 ++++++++++++++++++
fs/tracefs/inode.c | 124 ++-
fs/tracefs/internal.h | 25 +
include/linux/trace_events.h | 1 +
include/linux/tracefs.h | 49 ++
kernel/trace/trace.h | 3 +-
kernel/trace/trace_events.c | 66 +-
.../ftrace/test.d/kprobe/kprobe_args_char.tc | 4 +-
.../test.d/kprobe/kprobe_args_string.tc | 4 +-
10 files changed, 992 insertions(+), 46 deletions(-)
create mode 100644 fs/tracefs/event_inode.c
create mode 100644 fs/tracefs/internal.h
--
2.39.0
Some test cases from net/tls, net/fcnal-test and net/vrf-xfrm-tests
that rely on cryptographic functions to work and use non-compliant FIPS
algorithms fail in FIPS mode.
In order to allow these tests to pass in a wider set of kernels,
- for net/tls, skip the test variants that use the ChaCha20-Poly1305
and SM4 algorithms, when FIPS mode is enabled;
- for net/fcnal-test, skip the MD5 tests, when FIPS mode is enabled;
- for net/vrf-xfrm-tests, replace the algorithms that are not
FIPS-compliant with compliant ones.
Changes in v4:
- Remove extra newline.
- Add R-b tag.
Changes in v3:
- Add new commit to allow skipping test directly from test setup.
- No need to initialize static variable to zero.
- Skip tests during test setup only.
- Use the constructor attribute to set fips_enabled before entering
main().
Changes in v2:
- Add R-b tags.
- Put fips_non_compliant into the variants.
- Turn fips_enabled into a static global variable.
- Read /proc/sys/crypto/fips_enabled only once at main().
v1: https://lore.kernel.org/netdev/20230607174302.19542-1-magali.lemes@canonica…
v2: https://lore.kernel.org/netdev/20230609164324.497813-1-magali.lemes@canonic…
v3: https://lore.kernel.org/netdev/20230612125107.73795-1-magali.lemes@canonica…
Magali Lemes (4):
selftests/harness: allow tests to be skipped during setup
selftests: net: tls: check if FIPS mode is enabled
selftests: net: vrf-xfrm-tests: change authentication and encryption
algos
selftests: net: fcnal-test: check if FIPS mode is enabled
tools/testing/selftests/kselftest_harness.h | 6 ++--
tools/testing/selftests/net/fcnal-test.sh | 27 +++++++++++-----
tools/testing/selftests/net/tls.c | 24 +++++++++++++-
tools/testing/selftests/net/vrf-xfrm-tests.sh | 32 +++++++++----------
4 files changed, 61 insertions(+), 28 deletions(-)
--
2.34.1
From: Mirsad Todorovac <mirsad.todorovac(a)alu.unizg.hr>
According to Mirsad the gpio-sim.sh test appears to FAIL in a wrong way
due to missing initialisation of shell variables:
4.2. Bias settings work correctly
cat: /sys/devices/platform/gpio-sim.0/gpiochip18/sim_gpio0/value: No such file or directory
./gpio-sim.sh: line 393: test: =: unary operator expected
bias setting does not work
GPIO gpio-sim test FAIL
After this change the test passed:
4.2. Bias settings work correctly
GPIO gpio-sim test PASS
His testing environment is AlmaLinux 8.7 on Lenovo desktop box with
the latest Linux kernel based on v6.2:
Linux 6.2.0-mglru-kmlk-andy-09238-gd2980d8d8265 x86_64
Suggested-by: Mirsad Todorovac <mirsad.todorovac(a)alu.unizg.hr>
Signed-off-by: Andy Shevchenko <andriy.shevchenko(a)linux.intel.com>
---
tools/testing/selftests/gpio/gpio-sim.sh | 3 +++
1 file changed, 3 insertions(+)
diff --git a/tools/testing/selftests/gpio/gpio-sim.sh b/tools/testing/selftests/gpio/gpio-sim.sh
index 9f539d454ee4..fa2ce2b9dd5f 100755
--- a/tools/testing/selftests/gpio/gpio-sim.sh
+++ b/tools/testing/selftests/gpio/gpio-sim.sh
@@ -389,6 +389,9 @@ create_chip chip
create_bank chip bank
set_num_lines chip bank 8
enable_chip chip
+DEVNAME=`configfs_dev_name chip`
+CHIPNAME=`configfs_chip_name chip bank`
+SYSFS_PATH="/sys/devices/platform/$DEVNAME/$CHIPNAME/sim_gpio0/value"
$BASE_DIR/gpio-mockup-cdev -b pull-up /dev/`configfs_chip_name chip bank` 0
test `cat $SYSFS_PATH` = "1" || fail "bias setting does not work"
remove_chip chip
--
2.40.0.1.gaa8946217a0b
The default timeout for kselftests is 45 seconds, but pcm-test can take
longer than that to run depending on the number of PCMs present on a
device.
As a data point, running pcm-test on mt8192-asurada-spherion takes about
1m15s.
Set the timeout to 10 minutes, which should give enough slack to run the
test even on devices with many PCMs.
Signed-off-by: Nícolas F. R. A. Prado <nfraprado(a)collabora.com>
---
tools/testing/selftests/alsa/settings | 1 +
1 file changed, 1 insertion(+)
create mode 100644 tools/testing/selftests/alsa/settings
diff --git a/tools/testing/selftests/alsa/settings b/tools/testing/selftests/alsa/settings
new file mode 100644
index 000000000000..a62d2fa1275c
--- /dev/null
+++ b/tools/testing/selftests/alsa/settings
@@ -0,0 +1 @@
+timeout=600
--
2.39.0
Here is a series with some fixes and cleanups to resctrl selftests and
rewrite of CAT test into something that really tests CAT working or not
condition.
v2:
- Rebased on top of next to solve the conflicts
- Added 2 patches related to resctrl FS mount/umount (fix + cleanup)
- Consistently use "alloc" in cache_alloc_size()
- CAT test error handling tweaked
- Remove a spurious newline change from the CAT patch
- Small improvements to changelogs
Ilpo Järvinen (24):
selftests/resctrl: Add resctrl.h into build deps
selftests/resctrl: Check also too low values for CBM bits
selftests/resctrl: Move resctrl FS mount/umount to higher level
selftests/resctrl: Remove mum_resctrlfs
selftests/resctrl: Make span unsigned long everywhere
selftests/resctrl: Express span in bytes
selftests/resctrl: Remove duplicated preparation for span arg
selftests/resctrl: Don't use variable argument list for ->setup()
selftests/resctrl: Remove "malloc_and_init_memory" param from
run_fill_buf()
selftests/resctrl: Split run_fill_buf() to alloc, work, and dealloc
helpers
selftests/resctrl: Remove start_buf local variable from buffer alloc
func
selftests/resctrl: Don't pass test name to fill_buf
selftests/resctrl: Add flush_buffer() to fill_buf
selftests/resctrl: Remove test type checks from cat_val()
selftests/resctrl: Refactor get_cbm_mask()
selftests/resctrl: Create cache_alloc_size() helper
selftests/resctrl: Replace count_bits with count_consecutive_bits()
selftests/resctrl: Exclude shareable bits from schemata in CAT test
selftests/resctrl: Pass the real number of tests to show_cache_info()
selftests/resctrl: Move CAT/CMT test global vars to func they are used
selftests/resctrl: Read in less obvious order to defeat prefetch
optimizations
selftests/resctrl: Split measure_cache_vals() function
selftests/resctrl: Split show_cache_info() to test specific and
generic parts
selftests/resctrl: Rewrite Cache Allocation Technology (CAT) test
tools/testing/selftests/resctrl/Makefile | 2 +-
tools/testing/selftests/resctrl/cache.c | 154 ++++++------
tools/testing/selftests/resctrl/cat_test.c | 235 ++++++++----------
tools/testing/selftests/resctrl/cmt_test.c | 65 +++--
tools/testing/selftests/resctrl/fill_buf.c | 105 ++++----
tools/testing/selftests/resctrl/mba_test.c | 9 +-
tools/testing/selftests/resctrl/mbm_test.c | 17 +-
tools/testing/selftests/resctrl/resctrl.h | 32 +--
.../testing/selftests/resctrl/resctrl_tests.c | 82 ++++--
tools/testing/selftests/resctrl/resctrl_val.c | 9 +-
tools/testing/selftests/resctrl/resctrlfs.c | 187 ++++++++++----
11 files changed, 499 insertions(+), 398 deletions(-)
--
2.30.2
Fix the following coccicheck warning:
tools/testing/selftests/nolibc/nolibc-test.c:646:5-8: Unneeded variable:
"ret". Return "0"
Signed-off-by: Yonggang Wu <wuyonggang001(a)208suo.com>
---
tools/testing/selftests/nolibc/nolibc-test.c | 15 ++++++---------
1 file changed, 6 insertions(+), 9 deletions(-)
diff --git a/tools/testing/selftests/nolibc/nolibc-test.c
b/tools/testing/selftests/nolibc/nolibc-test.c
index 486334981e60..2b723354e085 100644
--- a/tools/testing/selftests/nolibc/nolibc-test.c
+++ b/tools/testing/selftests/nolibc/nolibc-test.c
@@ -546,7 +546,6 @@ int run_syscall(int min, int max)
int proc;
int test;
int tmp;
- int ret = 0;
void *p1, *p2;
/* <proc> indicates whether or not /proc is mounted */
@@ -632,18 +631,17 @@ int run_syscall(int min, int max)
CASE_TEST(syscall_noargs); EXPECT_SYSEQ(1,
syscall(__NR_getpid), getpid()); break;
CASE_TEST(syscall_args); EXPECT_SYSER(1,
syscall(__NR_statx, 0, NULL, 0, 0, NULL), -1, EFAULT); break;
case __LINE__:
- return ret; /* must be last */
+ return 0; /* must be last */
/* note: do not set any defaults so as to permit holes above */
}
}
- return ret;
+ return 0;
}
int run_stdlib(int min, int max)
{
int test;
int tmp;
- int ret = 0;
void *p1, *p2;
for (test = min; test >= 0 && test <= max; test++) {
@@ -726,11 +724,11 @@ int run_stdlib(int min, int max)
# warning "__SIZEOF_LONG__ is undefined"
#endif /* __SIZEOF_LONG__ */
case __LINE__:
- return ret; /* must be last */
+ return 0; /* must be last */
/* note: do not set any defaults so as to permit holes above */
}
}
- return ret;
+ return 0;
}
#define EXPECT_VFPRINTF(c, expected, fmt, ...) \
@@ -790,7 +788,6 @@ static int run_vfprintf(int min, int max)
{
int test;
int tmp;
- int ret = 0;
void *p1, *p2;
for (test = min; test >= 0 && test <= max; test++) {
@@ -810,11 +807,11 @@ static int run_vfprintf(int min, int max)
CASE_TEST(hex); EXPECT_VFPRINTF(1, "f", "%x", 0xf);
break;
CASE_TEST(pointer); EXPECT_VFPRINTF(3, "0x1", "%p", (void
*) 0x1); break;
case __LINE__:
- return ret; /* must be last */
+ return 0; /* must be last */
/* note: do not set any defaults so as to permit holes above */
}
}
- return ret;
+ return 0;
}
static int smash_stack(void)
Currently the MM selftests attempt to work out the target architecture by
using CROSS_COMPILE or otherwise querying the host machine, storing the
target architecture in a variable called MACHINE rather than the usual ARCH
though as far as I can tell (including for x86_64) the value is the same as
we would use for architecture.
When cross compiling with LLVM we don't need a CROSS_COMPILE as LLVM can
support many target architectures in a single build so this logic does not
work, CROSS_COMPILE is not set and we end up selecting tests for the host
rather than target architecture. Fix this by using the more standard ARCH
to describe the architecture, taking it from the environment if specified.
Signed-off-by: Mark Brown <broonie(a)kernel.org>
---
tools/testing/selftests/mm/Makefile | 13 ++++++++-----
1 file changed, 8 insertions(+), 5 deletions(-)
diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile
index 23af4633f0f4..4f0c50c33ba7 100644
--- a/tools/testing/selftests/mm/Makefile
+++ b/tools/testing/selftests/mm/Makefile
@@ -5,12 +5,15 @@ LOCAL_HDRS += $(selfdir)/mm/local_config.h $(top_srcdir)/mm/gup_test.h
include local_config.mk
+ifeq ($(ARCH),)
+
ifeq ($(CROSS_COMPILE),)
uname_M := $(shell uname -m 2>/dev/null || echo not)
else
uname_M := $(shell echo $(CROSS_COMPILE) | grep -o '^[a-z0-9]\+')
endif
-MACHINE ?= $(shell echo $(uname_M) | sed -e 's/aarch64.*/arm64/' -e 's/ppc64.*/ppc64/')
+ARCH ?= $(shell echo $(uname_M) | sed -e 's/aarch64.*/arm64/' -e 's/ppc64.*/ppc64/')
+endif
# Without this, failed build products remain, with up-to-date timestamps,
# thus tricking Make (and you!) into believing that All Is Well, in subsequent
@@ -65,7 +68,7 @@ TEST_GEN_PROGS += ksm_tests
TEST_GEN_PROGS += ksm_functional_tests
TEST_GEN_PROGS += mdwe_test
-ifeq ($(MACHINE),x86_64)
+ifeq ($(ARCH),x86_64)
CAN_BUILD_I386 := $(shell ./../x86/check_cc.sh "$(CC)" ../x86/trivial_32bit_program.c -m32)
CAN_BUILD_X86_64 := $(shell ./../x86/check_cc.sh "$(CC)" ../x86/trivial_64bit_program.c)
CAN_BUILD_WITH_NOPIE := $(shell ./../x86/check_cc.sh "$(CC)" ../x86/trivial_program.c -no-pie)
@@ -87,13 +90,13 @@ TEST_GEN_PROGS += $(BINARIES_64)
endif
else
-ifneq (,$(findstring $(MACHINE),ppc64))
+ifneq (,$(findstring $(ARCH),ppc64))
TEST_GEN_PROGS += protection_keys
endif
endif
-ifneq (,$(filter $(MACHINE),arm64 ia64 mips64 parisc64 ppc64 riscv64 s390x sparc64 x86_64))
+ifneq (,$(filter $(ARCH),arm64 ia64 mips64 parisc64 ppc64 riscv64 s390x sparc64 x86_64))
TEST_GEN_PROGS += va_high_addr_switch
TEST_GEN_PROGS += virtual_address_range
TEST_GEN_PROGS += write_to_hugetlbfs
@@ -112,7 +115,7 @@ $(TEST_GEN_PROGS): vm_util.c
$(OUTPUT)/uffd-stress: uffd-common.c
$(OUTPUT)/uffd-unit-tests: uffd-common.c
-ifeq ($(MACHINE),x86_64)
+ifeq ($(ARCH),x86_64)
BINARIES_32 := $(patsubst %,$(OUTPUT)/%,$(BINARIES_32))
BINARIES_64 := $(patsubst %,$(OUTPUT)/%,$(BINARIES_64))
---
base-commit: 858fd168a95c5b9669aac8db6c14a9aeab446375
change-id: 20230614-kselftest-mm-llvm-a25a7daffa6f
Best regards,
--
Mark Brown <broonie(a)kernel.org>
Hi,
The very recent 6.4-rc3 kernel build with AlmaLinux 8.7 on LENOVO 10TX000VCR
desktop box fails one test:
[root@host net]# ./fcnal-test.sh
[...]
TEST: ping out, vrf device+address bind - ns-B loopback IPv6 [ OK ]
TEST: ping out, vrf device+address bind - ns-B IPv6 LLA [FAIL]
TEST: ping in - ns-A IPv6 [ OK ]
[...]
Tests passed: 887
Tests failed: 1
[root@host net]#
Please find the config, + dmesg and lshw output here:
https://domac.alu.unizg.hr/~mtodorov/linux/selftests/net-fcnal-test/config-…https://domac.alu.unizg.hr/~mtodorov/linux/selftests/net-fcnal-test/dmesg.l…https://domac.alu.unizg.hr/~mtodorov/linux/selftests/net-fcnal-test/lshw.txt
I believe that I have all required configs merged for the selftest/net tests.
Maybe we have a regression?
My knowledge of fcnal-test.sh isn't sufficient to build a smaller reproducer.
Guillaume said in January he could help with the net/fcnal-test.sh, but I was doing
the other things in the meantime. Tempus fugit :-/
Best regards,
Mirsad
--
Mirsad Goran Todorovac
Sistem inženjer
Grafički fakultet | Akademija likovnih umjetnosti
Sveučilište u Zagrebu
System engineer
Faculty of Graphic Arts | Academy of Fine Arts
University of Zagreb, Republic of Croatia
"What’s this thing suddenly coming towards me very fast? Very very fast.
... I wonder if it will be friends with me?"
Hi,
Static analysis with cppcheck has found an issue in the following commit:
commit 047e6575aec71d75b765c22111820c4776cd1c43
Author: Aneesh Kumar K.V <aneesh.kumar(a)linux.ibm.com>
Date: Tue Sep 24 09:22:53 2019 +0530
powerpc/mm: Fixup tlbie vs mtpidr/mtlpidr ordering issue on POWER9
The issue in tools/testing/selftests/powerpc/mm/tlbie_test.c in
end_verification_log() is as follows:
static inline void end_verification_log(unsigned int tid, unsigned
nr_anamolies)
{
FILE *f = fp[tid];
char logfile[30];
char path[LOGDIR_NAME_SIZE + 30];
char separator[] = "/";
fclose(f);
if (nr_anamolies == 0) {
remove(path);
return;
}
.... etc
in the case where nr_anamolies is zero the remove(path) call is using an
uninitialized path, this potentially could contain uninitialized garbage
on the stack (and if one is unlucky enough it may be a valid filename
that one does not want to be removed).
Not sure what the original intention was, but this code looks incorrect
to me.
Colin
Dzień dobry,
zapoznałem się z Państwa ofertą i z przyjemnością przyznaję, że przyciąga uwagę i zachęca do dalszych rozmów.
Pomyślałem, że może mógłbym mieć swój wkład w Państwa rozwój i pomóc dotrzeć z tą ofertą do większego grona odbiorców. Pozycjonuję strony www, dzięki czemu generują świetny ruch w sieci.
Możemy porozmawiać w najbliższym czasie?
Pozdrawiam
Adam Charachuta
Since commit ("selftests: error out if kernel header files are not yet
built") got merged, the kselftest build correctly because the
KBUILD_OUTPUT isn't set when building out-of-tree and specifying 'O='
This is the error message that pops up.
make --silent --keep-going --jobs=32 O=/home/anders/.cache/tuxmake/builds/1482/build INSTALL_PATH=/home/anders/.cache/tuxmake/builds/1482/build/kselftest_install ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu- V=1 CROSS_COMPILE_COMPAT=arm-linux-gnueabihf- kselftest-install
make[3]: Entering directory '/home/anders/src/kernel/next/tools/testing/selftests/alsa'
-e [1;31merror[0m: missing kernel header files.
Please run this and try again:
cd /home/anders/src/kernel/next/tools/testing/selftests/../../..
make headers
make[3]: Leaving directory '/home/anders/src/kernel/next/tools/testing/selftests/alsa'
make[3]: *** [../lib.mk:77: kernel_header_files] Error 1
Fixing the issue by assigning KBUILD_OUTPUT the same way how its done in
kselftest's Makefile. By adding 'KBUILD_OUTPUT := $(O)' 'if $(origin O)'
is set to 'command line'. This will set the the BUILD dir to
KBUILD_OUTPUT/kselftest when doing out-of-tree builds which makes them
in its own separete output directory.
Signed-off-by: Anders Roxell <anders.roxell(a)linaro.org>
---
tools/testing/selftests/lib.mk | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/tools/testing/selftests/lib.mk b/tools/testing/selftests/lib.mk
index b8ea03b9a015..d17854285f2b 100644
--- a/tools/testing/selftests/lib.mk
+++ b/tools/testing/selftests/lib.mk
@@ -44,6 +44,10 @@ endif
selfdir = $(realpath $(dir $(filter %/lib.mk,$(MAKEFILE_LIST))))
top_srcdir = $(selfdir)/../../..
+ifeq ("$(origin O)", "command line")
+ KBUILD_OUTPUT := $(O)
+endif
+
ifneq ($(KBUILD_OUTPUT),)
# Make's built-in functions such as $(abspath ...), $(realpath ...) cannot
# expand a shell special character '~'. We use a somewhat tedious way here.
--
2.39.2
tls:no_pad exits the test when tls is not available. It should skip the
test like all others do
Signed-off-by: Kuba Pawlak <kuba.pawlak(a)canonical.com>
---
tools/testing/selftests/net/tls.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/net/tls.c b/tools/testing/selftests/net/tls.c
index e699548d4247dd57555a72ec1627566962128f73..ea3ec8463df993d80f0b70c4632b2a1e3c57b424 100644
--- a/tools/testing/selftests/net/tls.c
+++ b/tools/testing/selftests/net/tls.c
@@ -1727,7 +1727,7 @@ TEST(no_pad) {
ulp_sock_pair(_metadata, &fd, &cfd, ¬ls);
if (notls)
- exit(KSFT_SKIP);
+ SKIP(return, "no TLS support");
ret = setsockopt(fd, SOL_TLS, TLS_TX, &tls12, sizeof(tls12));
EXPECT_EQ(ret, 0);
--
2.37.2
Hi,
Enclosed are a pair of patches for an oops that can occur if an exception is
generated while a bpf subprogram is running. One of the bpf_prog_aux entries
for the subprograms are missing an extable. This can lead to an exception that
would otherwise be handled turning into a NULL pointer bug.
These changes were tested via the verifier and progs selftests and no
regressions were observed.
Changes from v4:
- Ensure that num_exentries is copied to prog->aux from func[0] (Feedback from
Ilya Leoshkevich)
Changes from v3:
- Selftest style fixups (Feedback from Yonghong Song)
- Selftest needs to assert that test bpf program executed (Feedback from
Yonghong Song)
- Selftest should combine open and load using open_and_load (Feedback from
Yonghong Song)
Changes from v2:
- Insert only the main program's kallsyms (Feedback from Yonghong Song and
Alexei Starovoitov)
- Selftest should use ASSERT instead of CHECK (Feedback from Yonghong Song)
- Selftest needs some cleanup (Feedback from Yonghong Song)
- Switch patch order (Feedback from Alexei Starovoitov)
Changes from v1:
- Add a selftest (Feedback From Alexei Starovoitov)
- Move to a 1-line verifier change instead of searching multiple extables
Krister Johansen (2):
bpf: ensure main program has an extable
selftests/bpf: add a test for subprogram extables
kernel/bpf/verifier.c | 7 ++-
.../bpf/prog_tests/subprogs_extable.c | 29 +++++++++++
.../bpf/progs/test_subprogs_extable.c | 51 +++++++++++++++++++
3 files changed, 85 insertions(+), 2 deletions(-)
create mode 100644 tools/testing/selftests/bpf/prog_tests/subprogs_extable.c
create mode 100644 tools/testing/selftests/bpf/progs/test_subprogs_extable.c
--
2.25.1
Hello everyone,
This is an RFC patch series to propose the addition of a test attributes
framework to KUnit.
There has been interest in filtering out "slow" KUnit tests. Most notably,
a new config, CONFIG_MEMCPY_SLOW_KUNIT_TEST, has been added to exclude
particularly slow memcpy tests
(https://lore.kernel.org/all/20230118200653.give.574-kees@kernel.org/).
This proposed attributes framework would be used to save and access test
associated data, including whether a test is slow. These attributes would
be reportable (via KTAP and command line output) and some will be
filterable.
This framework is designed to allow for the addition of other attributes in
the future. These attributes could include whether the test is flaky,
associated test files, etc.
Note that this could intersect with the discussions on how to format
test-associated data in KTAP v2 that I am also involved in
(https://lore.kernel.org/all/20230420205734.1288498-1-rmoar@google.com/).
If the overall idea seems good, I'll make sure to add tests/documentation,
and more patches marking existing tests as slow to the patch series.
Thanks!
Rae
Rae Moar (6):
kunit: Add test attributes API structure
kunit: Add speed attribute
kunit: Add ability to filter attributes
kunit: tool: Add command line interface to filter and report
attributes
kunit: memcpy: Mark tests as slow using test attributes
kunit: time: Mark test as slow using test attributes
include/kunit/attributes.h | 41 ++++
include/kunit/test.h | 62 ++++++
kernel/time/time_test.c | 2 +-
lib/kunit/Makefile | 3 +-
lib/kunit/attributes.c | 280 +++++++++++++++++++++++++
lib/kunit/executor.c | 89 ++++++--
lib/kunit/executor_test.c | 8 +-
lib/kunit/kunit-example-test.c | 9 +
lib/kunit/test.c | 17 +-
lib/memcpy_kunit.c | 8 +-
tools/testing/kunit/kunit.py | 34 ++-
tools/testing/kunit/kunit_kernel.py | 6 +-
tools/testing/kunit/kunit_tool_test.py | 41 ++--
13 files changed, 536 insertions(+), 64 deletions(-)
create mode 100644 include/kunit/attributes.h
create mode 100644 lib/kunit/attributes.c
base-commit: fefdb43943c1a0d87e1b43ae4d03e5f9a1d058f4
--
2.41.0.162.gfafddb0af9-goog
On 6/13/23 1:50 AM, baomingtong001(a)208suo.com wrote:
> Fix the following coccicheck warning:
>
> tools/testing/selftests/bpf/progs/tailcall_bpf2bpf6.c:28:14-17: Unneeded
> variable: "ret".
>
> Return "1".
>
> Signed-off-by: Mingtong Bao <baomingtong001(a)208suo.com>
> ---
> tools/testing/selftests/bpf/progs/tailcall_bpf2bpf6.c | 3 +--
> 1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/tools/testing/selftests/bpf/progs/tailcall_bpf2bpf6.c
> b/tools/testing/selftests/bpf/progs/tailcall_bpf2bpf6.c
> index 4a9f63bea66c..7f0146682577 100644
> --- a/tools/testing/selftests/bpf/progs/tailcall_bpf2bpf6.c
> +++ b/tools/testing/selftests/bpf/progs/tailcall_bpf2bpf6.c
> @@ -25,10 +25,9 @@ static __noinline
> int subprog_tail(struct __sk_buff *skb)
> {
> /* Don't propagate the constant to the caller */
> - volatile int ret = 1;
>
> bpf_tail_call_static(skb, &jmp_table, 0);
> - return ret;
> + return 1;
Please pay attention to the comment:
/* Don't propagate the constant to the caller */
which clearly says 'constant' is not preferred.
The patch introduced this change is:
5e0b0a4c52d30 selftests/bpf: Test tail call counting with bpf2bpf
and data on stack
The test intentionally want to:
'Specifically when the size
of data allocated on BPF stack is not a multiple on 8.'
Note that with volatile and without volatile, the generated
code will be different and it will result in different
verification path.
cc Jakub for further clarification.
> }
>
> SEC("tc")
On Thu, Jun 08, 2023 at 07:52:54PM +0200, Michal Sekletar wrote:
> On Thu, Jun 8, 2023 at 1:51 PM Greg KH <gregkh(a)linuxfoundation.org> wrote:
>
> > So how are you protecting this from being an information leak like we
> > have had in the past where you could monitor how many characters were
> > being sent to the tty through a proc file? Seems like now you can just
> > monitor any tty node in the system and get the same information, while
> > today you can only do it for the tty devices you have permissions for,
> > right?
>
> Hi Greg,
>
> I am not protecting against it in any way, but proposed changes are only
> about timestamp updates which still happen in at least 8 seconds intervals
> so exact timing of read/writes to tty can't be inferred. Frankly, I may
> have misunderstood something. It would be great if you could mention a bit
> more details about CVE you had in mind.
Ah, I missed that this is in 8 second increments, nevermind then!
thanks,
greg k-h
*Changes in v12*
- Update and other memory types to UFFD_FEATURE_WP_ASYNC
- Rebaase on top of next-20230406
- Review updates
*Changes in v11*
- Rebase on top of next-20230307
- Base patches on UFFD_FEATURE_WP_UNPOPULATED
- Do a lot of cosmetic changes and review updates
- Remove ENGAGE_WP + !GET operation as it can be performed with
UFFDIO_WRITEPROTECT
*Changes in v10*
- Add specific condition to return error if hugetlb is used with wp
async
- Move changes in tools/include/uapi/linux/fs.h to separate patch
- Add documentation
*Changes in v9:*
- Correct fault resolution for userfaultfd wp async
- Fix build warnings and errors which were happening on some configs
- Simplify pagemap ioctl's code
*Changes in v8:*
- Update uffd async wp implementation
- Improve PAGEMAP_IOCTL implementation
*Changes in v7:*
- Add uffd wp async
- Update the IOCTL to use uffd under the hood instead of soft-dirty
flags
*Motivation*
The real motivation for adding PAGEMAP_SCAN IOCTL is to emulate Windows
GetWriteWatch() syscall [1]. The GetWriteWatch{} retrieves the addresses of
the pages that are written to in a region of virtual memory.
This syscall is used in Windows applications and games etc. This syscall is
being emulated in pretty slow manner in userspace. Our purpose is to
enhance the kernel such that we translate it efficiently in a better way.
Currently some out of tree hack patches are being used to efficiently
emulate it in some kernels. We intend to replace those with these patches.
So the whole gaming on Linux can effectively get benefit from this. It
means there would be tons of users of this code.
CRIU use case [2] was mentioned by Andrei and Danylo:
> Use cases for migrating sparse VMAs are binaries sanitized with ASAN,
> MSAN or TSAN [3]. All of these sanitizers produce sparse mappings of
> shadow memory [4]. Being able to migrate such binaries allows to highly
> reduce the amount of work needed to identify and fix post-migration
> crashes, which happen constantly.
Andrei's defines the following uses of this code:
* it is more granular and allows us to track changed pages more
effectively. The current interface can clear dirty bits for the entire
process only. In addition, reading info about pages is a separate
operation. It means we must freeze the process to read information
about all its pages, reset dirty bits, only then we can start dumping
pages. The information about pages becomes more and more outdated,
while we are processing pages. The new interface solves both these
downsides. First, it allows us to read pte bits and clear the
soft-dirty bit atomically. It means that CRIU will not need to freeze
processes to pre-dump their memory. Second, it clears soft-dirty bits
for a specified region of memory. It means CRIU will have actual info
about pages to the moment of dumping them.
* The new interface has to be much faster because basic page filtering
is happening in the kernel. With the old interface, we have to read
pagemap for each page.
*Implementation Evolution (Short Summary)*
From the definition of GetWriteWatch(), we feel like kernel's soft-dirty
feature can be used under the hood with some additions like:
* reset soft-dirty flag for only a specific region of memory instead of
clearing the flag for the entire process
* get and clear soft-dirty flag for a specific region atomically
So we decided to use ioctl on pagemap file to read or/and reset soft-dirty
flag. But using soft-dirty flag, sometimes we get extra pages which weren't
even written. They had become soft-dirty because of VMA merging and
VM_SOFTDIRTY flag. This breaks the definition of GetWriteWatch(). We were
able to by-pass this short coming by ignoring VM_SOFTDIRTY until David
reported that mprotect etc messes up the soft-dirty flag while ignoring
VM_SOFTDIRTY [5]. This wasn't happening until [6] got introduced. We
discussed if we can revert these patches. But we could not reach to any
conclusion. So at this point, I made couple of tries to solve this whole
VM_SOFTDIRTY issue by correcting the soft-dirty implementation:
* [7] Correct the bug fixed wrongly back in 2014. It had potential to cause
regression. We left it behind.
* [8] Keep a list of soft-dirty part of a VMA across splits and merges. I
got the reply don't increase the size of the VMA by 8 bytes.
At this point, we left soft-dirty considering it is too much delicate and
userfaultfd [9] seemed like the only way forward. From there onward, we
have been basing soft-dirty emulation on userfaultfd wp feature where
kernel resolves the faults itself when WP_ASYNC feature is used. It was
straight forward to add WP_ASYNC feature in userfautlfd. Now we get only
those pages dirty or written-to which are really written in reality. (PS
There is another WP_UNPOPULATED userfautfd feature is required which is
needed to avoid pre-faulting memory before write-protecting [9].)
All the different masks were added on the request of CRIU devs to create
interface more generic and better.
[1] https://learn.microsoft.com/en-us/windows/win32/api/memoryapi/nf-memoryapi-…
[2] https://lore.kernel.org/all/20221014134802.1361436-1-mdanylo@google.com
[3] https://github.com/google/sanitizers
[4] https://github.com/google/sanitizers/wiki/AddressSanitizerAlgorithm#64-bit
[5] https://lore.kernel.org/all/bfcae708-db21-04b4-0bbe-712badd03071@redhat.com
[6] https://lore.kernel.org/all/20220725142048.30450-1-peterx@redhat.com/
[7] https://lore.kernel.org/all/20221122115007.2787017-1-usama.anjum@collabora.…
[8] https://lore.kernel.org/all/20221220162606.1595355-1-usama.anjum@collabora.…
[9] https://lore.kernel.org/all/20230306213925.617814-1-peterx@redhat.com
[10] https://lore.kernel.org/all/20230125144529.1630917-1-mdanylo@google.com
* Original Cover letter from v8*
Hello,
Note:
Soft-dirty pages and pages which have been written-to are synonyms. As
kernel already has soft-dirty feature inside which we have given up to
use, we are using written-to terminology while using UFFD async WP under
the hood.
This IOCTL, PAGEMAP_SCAN on pagemap file can be used to get and/or clear
the info about page table entries. The following operations are
supported in this ioctl:
- Get the information if the pages have been written-to (PAGE_IS_WRITTEN),
file mapped (PAGE_IS_FILE), present (PAGE_IS_PRESENT) or swapped
(PAGE_IS_SWAPPED).
- Write-protect the pages (PAGEMAP_WP_ENGAGE) to start finding which
pages have been written-to.
- Find pages which have been written-to and write protect the pages
(atomic PAGE_IS_WRITTEN + PAGEMAP_WP_ENGAGE)
It is possible to find and clear soft-dirty pages entirely in userspace.
But it isn't efficient:
- The mprotect and SIGSEGV handler for bookkeeping
- The userfaultfd wp (synchronous) with the handler for bookkeeping
Some benchmarks can be seen here[1]. This series adds features that weren't
present earlier:
- There is no atomic get soft-dirty/Written-to status and clear present in
the kernel.
- The pages which have been written-to can not be found in accurate way.
(Kernel's soft-dirty PTE bit + sof_dirty VMA bit shows more soft-dirty
pages than there actually are.)
Historically, soft-dirty PTE bit tracking has been used in the CRIU
project. The procfs interface is enough for finding the soft-dirty bit
status and clearing the soft-dirty bit of all the pages of a process.
We have the use case where we need to track the soft-dirty PTE bit for
only specific pages on-demand. We need this tracking and clear mechanism
of a region of memory while the process is running to emulate the
getWriteWatch() syscall of Windows.
*(Moved to using UFFD instead of soft-dirtyi feature to find pages which
have been written-to from v7 patch series)*:
Stop using the soft-dirty flags for finding which pages have been
written to. It is too delicate and wrong as it shows more soft-dirty
pages than the actual soft-dirty pages. There is no interest in
correcting it [2][3] as this is how the feature was written years ago.
It shouldn't be updated to changed behaviour. Peter Xu has suggested
using the async version of the UFFD WP [4] as it is based inherently
on the PTEs.
So in this patch series, I've added a new mode to the UFFD which is
asynchronous version of the write protect. When this variant of the
UFFD WP is used, the page faults are resolved automatically by the
kernel. The pages which have been written-to can be found by reading
pagemap file (!PM_UFFD_WP). This feature can be used successfully to
find which pages have been written to from the time the pages were
write protected. This works just like the soft-dirty flag without
showing any extra pages which aren't soft-dirty in reality.
The information related to pages if the page is file mapped, present and
swapped is required for the CRIU project [5][6]. The addition of the
required mask, any mask, excluded mask and return masks are also required
for the CRIU project [5].
The IOCTL returns the addresses of the pages which match the specific
masks. The page addresses are returned in struct page_region in a compact
form. The max_pages is needed to support a use case where user only wants
to get a specific number of pages. So there is no need to find all the
pages of interest in the range when max_pages is specified. The IOCTL
returns when the maximum number of the pages are found. The max_pages is
optional. If max_pages is specified, it must be equal or greater than the
vec_size. This restriction is needed to handle worse case when one
page_region only contains info of one page and it cannot be compacted.
This is needed to emulate the Windows getWriteWatch() syscall.
The patch series include the detailed selftest which can be used as an
example for the uffd async wp test and PAGEMAP_IOCTL. It shows the
interface usages as well.
[1] https://lore.kernel.org/lkml/54d4c322-cd6e-eefd-b161-2af2b56aae24@collabora…
[2] https://lore.kernel.org/all/20221220162606.1595355-1-usama.anjum@collabora.…
[3] https://lore.kernel.org/all/20221122115007.2787017-1-usama.anjum@collabora.…
[4] https://lore.kernel.org/all/Y6Hc2d+7eTKs7AiH@x1n
[5] https://lore.kernel.org/all/YyiDg79flhWoMDZB@gmail.com/
[6] https://lore.kernel.org/all/20221014134802.1361436-1-mdanylo@google.com/
Regards,
Muhammad Usama Anjum
Muhammad Usama Anjum (4):
fs/proc/task_mmu: Implement IOCTL to get and optionally clear info
about PTEs
tools headers UAPI: Update linux/fs.h with the kernel sources
mm/pagemap: add documentation of PAGEMAP_SCAN IOCTL
selftests: mm: add pagemap ioctl tests
Peter Xu (1):
userfaultfd: UFFD_FEATURE_WP_ASYNC
Documentation/admin-guide/mm/pagemap.rst | 56 +
Documentation/admin-guide/mm/userfaultfd.rst | 35 +
fs/proc/task_mmu.c | 426 ++++++
fs/userfaultfd.c | 26 +-
include/linux/userfaultfd_k.h | 29 +-
include/uapi/linux/fs.h | 53 +
include/uapi/linux/userfaultfd.h | 9 +-
mm/hugetlb.c | 32 +-
mm/memory.c | 27 +-
tools/include/uapi/linux/fs.h | 53 +
tools/testing/selftests/mm/.gitignore | 1 +
tools/testing/selftests/mm/Makefile | 4 +-
tools/testing/selftests/mm/config | 1 +
tools/testing/selftests/mm/pagemap_ioctl.c | 1301 ++++++++++++++++++
tools/testing/selftests/mm/run_vmtests.sh | 4 +
15 files changed, 2034 insertions(+), 23 deletions(-)
create mode 100644 tools/testing/selftests/mm/pagemap_ioctl.c
mode change 100644 => 100755 tools/testing/selftests/mm/run_vmtests.sh
--
2.39.2
Now the writing operation return the count of writes whether events are
enabled or disabled. Fix this by just return -EFAULT when events are disabled.
sunliming (3):
tracing/user_events: Fix incorrect return value for writing operation
when events are disabled
selftests/user_events: Enable the event before write_fault test in
ftrace self-test
selftests/user_events: Add test cases when event is disabled
kernel/trace/trace_events_user.c | 3 ++-
tools/testing/selftests/user_events/ftrace_test.c | 7 +++++++
2 files changed, 9 insertions(+), 1 deletion(-)
--
2.25.1
Some test cases from net/tls, net/fcnal-test and net/vrf-xfrm-tests
that rely on cryptographic functions to work and use non-compliant FIPS
algorithms fail in FIPS mode.
In order to allow these tests to pass in a wider set of kernels,
- for net/tls, skip the test variants that use the ChaCha20-Poly1305
and SM4 algorithms, when FIPS mode is enabled;
- for net/fcnal-test, skip the MD5 tests, when FIPS mode is enabled;
- for net/vrf-xfrm-tests, replace the algorithms that are not
FIPS-compliant with compliant ones.
Changes in v3:
- Add new commit to allow skipping test directly from test setup.
- No need to initialize static variable to zero.
- Skip tests during test setup only.
- Use the constructor attribute to set fips_enabled before entering
main().
Changes in v2:
- Add R-b tags.
- Put fips_non_compliant into the variants.
- Turn fips_enabled into a static global variable.
- Read /proc/sys/crypto/fips_enabled only once at main().
v1: https://lore.kernel.org/netdev/20230607174302.19542-1-magali.lemes@canonica…
v2: https://lore.kernel.org/netdev/20230609164324.497813-1-magali.lemes@canonic…
Magali Lemes (4):
selftests/harness: allow tests to be skipped during setup
selftests: net: tls: check if FIPS mode is enabled
selftests: net: vrf-xfrm-tests: change authentication and encryption
algos
selftests: net: fcnal-test: check if FIPS mode is enabled
tools/testing/selftests/kselftest_harness.h | 6 ++--
tools/testing/selftests/net/fcnal-test.sh | 27 +++++++++++-----
tools/testing/selftests/net/tls.c | 25 ++++++++++++++-
tools/testing/selftests/net/vrf-xfrm-tests.sh | 32 +++++++++----------
4 files changed, 62 insertions(+), 28 deletions(-)
--
2.34.1
Patches for kunit are managed in linux-kselftest tree before merged into
the mainline, but the MAINTAINERS section for kunit doesn't have the
entry for the tree. Add it.
Signed-off-by: SeongJae Park <sj(a)kernel.org>
---
MAINTAINERS | 1 +
1 file changed, 1 insertion(+)
diff --git a/MAINTAINERS b/MAINTAINERS
index ce5f343c1443..8a217438956b 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -11327,6 +11327,7 @@ L: linux-kselftest(a)vger.kernel.org
L: kunit-dev(a)googlegroups.com
S: Maintained
W: https://google.github.io/kunit-docs/third_party/kernel/docs/
+T: git git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest.git
F: Documentation/dev-tools/kunit/
F: include/kunit/
F: lib/kunit/
--
2.25.1
After a few years of increasing test coverage in the MPTCP selftests, we
realised [1] the last version of the selftests is supposed to run on old
kernels without issues.
Supporting older versions is not that easy for this MPTCP case: these
selftests are often validating the internals by checking packets that
are exchanged, when some MIB counters are incremented after some
actions, how connections are getting opened and closed in some cases,
etc. In other words, it is not limited to the socket interface between
the userspace and the kernelspace.
In addition to that, the current MPTCP selftests run a lot of different
sub-tests but the TAP13 protocol used in the selftests don't support
sub-tests: one failure in sub-tests implies that the whole selftest is
seen as failed at the end because sub-tests are not tracked. It is then
important to skip sub-tests not supported by old kernels.
To minimise the modifications and reduce the complexity to support old
versions, the idea is to look at external signs and skip the whole
selftest or just some sub-tests before starting them. This cannot be
applied in all cases.
Similar to the second part, this third one focuses on marking different
sub-tests as skipped if some MPTCP features are not supported. This
time, only in "mptcp_join.sh" selftest, the remaining one, is modified.
Several techniques are used here to achieve this task:
- Before starting some tests:
- Check if a file (sysctl knob) is present: that's what patch 12/17 is
doing for the userspace PM feature.
- Check if a required kernel symbol is present in /proc/kallsyms:
patches 9, 10, 14 and 15/17 are using this technique.
- Check if it is possible to setup a particular network environment
requiring Netfilter or TC: if the preparation step fail, the linked
sub-test is marked as skipped. Patch 5/17 is doing that.
- Check if a MIB counter is available: patches 7 and 13/17 do that.
- Check if the kernel version is newer than a specific one: patch 1/17
adds some helpers in mptcp_lib.sh to ease its use. That's not ideal
and it is only used as last resort but as mentioned above, it is
important to skip tests if they are not supported not to have the
whole selftest always being marked as failed on old kernels. Patches
11 and 17/17 are checking the kernel version. An alternative would
be to ignore the results for some sub-tests but that's not ideal
too. Note that SELFTESTS_MPTCP_LIB_NO_KVERSION_CHECK env var can be
set to 1 not to skip these tests if the running kernel doesn't have
a supported version.
- After having launched the tests:
- Adapt the expectations depending on the presence of a kernel symbol
(patch 6/17) or a kernel version (patch 8/17).
- Check is a MIB counter is available and skip the verification if
not. Patch 4/17 is using this technique.
Before skipping tests, SELFTESTS_MPTCP_LIB_EXPECT_ALL_FEATURES env var
value is checked: if it is set to 1, the test is marked as "failed"
instead of "skipped". MPTCP public CI expects to have all features
supported and it sets this env var to 1 to catch regressions in these
new checks.
Patch 2/17 uses 'iptables-legacy' if available because it might be
needed when using an older kernel not supporting iptables-nft.
Patch 3/17 adds some helpers used in the other patches mentioned to
easily mark sub-tests as skipped.
Patch 16/17 uniforms MPTCP Join "listener" tests: it was imported code
from userspace_pm.sh but without using the "code style" and ways of
using tools and printing messages from MPTCP Join selftest.
Link: https://lore.kernel.org/stable/CA+G9fYtDGpgT4dckXD-y-N92nqUxuvue_7AtDdBcHrb… [1]
Link: https://github.com/multipath-tcp/mptcp_net-next/issues/368
Signed-off-by: Matthieu Baerts <matthieu.baerts(a)tessares.net>
---
Note that it is supposed to be the last series on this subject for -net.
Also, this will conflict with commit 0639fa230a21 ("selftests: mptcp:
add explicit check for new mibs") that is currently in net-next but not
in -net. Here is the resolution. It is a bit long but you will see, it
is simple: take the version from -net with get_counter() and for the
last one, move the new call to chk_rm_tx_nr() inside the 'if' statement:
------------------- 8< -------------------
diff --cc tools/testing/selftests/net/mptcp/mptcp_join.sh
index 0ae8cafde439,85474e029784..bd47cdc2bd15
--- a/tools/testing/selftests/net/mptcp/mptcp_join.sh
+++ b/tools/testing/selftests/net/mptcp/mptcp_join.sh
@@@ -1360,27 -1265,23 +1355,25 @@@ chk_fclose_nr(
fi
printf "%-${nr_blank}s %s" " " "ctx"
- count=$(ip netns exec $ns_tx nstat -as | grep MPTcpExtMPFastcloseTx | awk '{print $2}')
- [ -z "$count" ] && count=0
- [ "$count" != "$fclose_tx" ] && extra_msg="$extra_msg,tx=$count"
- if [ "$count" != "$fclose_tx" ]; then
+ count=$(get_counter ${ns_tx} "MPTcpExtMPFastcloseTx")
+ if [ -z "$count" ]; then
+ echo -n "[skip]"
+ elif [ "$count" != "$fclose_tx" ]; then
+ extra_msg="$extra_msg,tx=$count"
echo "[fail] got $count MP_FASTCLOSE[s] TX expected $fclose_tx"
fail_test
- dump_stats=1
else
echo -n "[ ok ]"
fi
echo -n " - fclzrx"
- count=$(ip netns exec $ns_rx nstat -as | grep MPTcpExtMPFastcloseRx | awk '{print $2}')
- [ -z "$count" ] && count=0
- [ "$count" != "$fclose_rx" ] && extra_msg="$extra_msg,rx=$count"
- if [ "$count" != "$fclose_rx" ]; then
+ count=$(get_counter ${ns_rx} "MPTcpExtMPFastcloseRx")
+ if [ -z "$count" ]; then
+ echo -n "[skip]"
+ elif [ "$count" != "$fclose_rx" ]; then
+ extra_msg="$extra_msg,rx=$count"
echo "[fail] got $count MP_FASTCLOSE[s] RX expected $fclose_rx"
fail_test
- dump_stats=1
else
echo -n "[ ok ]"
fi
@@@ -1408,25 -1306,21 +1398,23 @@@ chk_rst_nr(
fi
printf "%-${nr_blank}s %s" " " "rtx"
- count=$(ip netns exec $ns_tx nstat -as | grep MPTcpExtMPRstTx | awk '{print $2}')
- [ -z "$count" ] && count=0
- if [ $count -lt $rst_tx ]; then
+ count=$(get_counter ${ns_tx} "MPTcpExtMPRstTx")
+ if [ -z "$count" ]; then
+ echo -n "[skip]"
+ elif [ $count -lt $rst_tx ]; then
echo "[fail] got $count MP_RST[s] TX expected $rst_tx"
fail_test
- dump_stats=1
else
echo -n "[ ok ]"
fi
echo -n " - rstrx "
- count=$(ip netns exec $ns_rx nstat -as | grep MPTcpExtMPRstRx | awk '{print $2}')
- [ -z "$count" ] && count=0
- if [ "$count" -lt "$rst_rx" ]; then
+ count=$(get_counter ${ns_rx} "MPTcpExtMPRstRx")
+ if [ -z "$count" ]; then
+ echo -n "[skip]"
+ elif [ "$count" -lt "$rst_rx" ]; then
echo "[fail] got $count MP_RST[s] RX expected $rst_rx"
fail_test
- dump_stats=1
else
echo -n "[ ok ]"
fi
@@@ -1441,28 -1333,23 +1427,25 @@@ chk_infi_nr(
local infi_tx=$1
local infi_rx=$2
local count
- local dump_stats
printf "%-${nr_blank}s %s" " " "itx"
- count=$(ip netns exec $ns2 nstat -as | grep InfiniteMapTx | awk '{print $2}')
- [ -z "$count" ] && count=0
- if [ "$count" != "$infi_tx" ]; then
+ count=$(get_counter ${ns2} "MPTcpExtInfiniteMapTx")
+ if [ -z "$count" ]; then
+ echo -n "[skip]"
+ elif [ "$count" != "$infi_tx" ]; then
echo "[fail] got $count infinite map[s] TX expected $infi_tx"
fail_test
- dump_stats=1
else
echo -n "[ ok ]"
fi
echo -n " - infirx"
- count=$(ip netns exec $ns1 nstat -as | grep InfiniteMapRx | awk '{print $2}')
- [ -z "$count" ] && count=0
- if [ "$count" != "$infi_rx" ]; then
+ count=$(get_counter ${ns1} "MPTcpExtInfiniteMapRx")
+ if [ -z "$count" ]; then
+ echo "[skip]"
+ elif [ "$count" != "$infi_rx" ]; then
echo "[fail] got $count infinite map[s] RX expected $infi_rx"
fail_test
- dump_stats=1
else
echo "[ ok ]"
fi
@@@ -1491,13 -1375,11 +1471,12 @@@ chk_join_nr(
fi
printf "%03u %-36s %s" "${TEST_COUNT}" "${title}" "syn"
- count=$(ip netns exec $ns1 nstat -as | grep MPTcpExtMPJoinSynRx | awk '{print $2}')
- [ -z "$count" ] && count=0
- if [ "$count" != "$syn_nr" ]; then
+ count=$(get_counter ${ns1} "MPTcpExtMPJoinSynRx")
+ if [ -z "$count" ]; then
+ echo -n "[skip]"
+ elif [ "$count" != "$syn_nr" ]; then
echo "[fail] got $count JOIN[s] syn expected $syn_nr"
fail_test
- dump_stats=1
else
echo -n "[ ok ]"
fi
@@@ -1523,13 -1403,11 +1501,12 @@@
fi
echo -n " - ack"
- count=$(ip netns exec $ns1 nstat -as | grep MPTcpExtMPJoinAckRx | awk '{print $2}')
- [ -z "$count" ] && count=0
- if [ "$count" != "$ack_nr" ]; then
+ count=$(get_counter ${ns1} "MPTcpExtMPJoinAckRx")
+ if [ -z "$count" ]; then
+ echo "[skip]"
+ elif [ "$count" != "$ack_nr" ]; then
echo "[fail] got $count JOIN[s] ack expected $ack_nr"
fail_test
- dump_stats=1
else
echo "[ ok ]"
fi
@@@ -1599,40 -1475,35 +1574,37 @@@ chk_add_nr(
timeout=$(ip netns exec $ns1 sysctl -n net.mptcp.add_addr_timeout)
printf "%-${nr_blank}s %s" " " "add"
- count=$(ip netns exec $ns2 nstat -as MPTcpExtAddAddr | grep MPTcpExtAddAddr | awk '{print $2}')
- [ -z "$count" ] && count=0
-
+ count=$(get_counter ${ns2} "MPTcpExtAddAddr")
+ if [ -z "$count" ]; then
+ echo -n "[skip]"
# if the test configured a short timeout tolerate greater then expected
# add addrs options, due to retransmissions
- if [ "$count" != "$add_nr" ] && { [ "$timeout" -gt 1 ] || [ "$count" -lt "$add_nr" ]; }; then
+ elif [ "$count" != "$add_nr" ] && { [ "$timeout" -gt 1 ] || [ "$count" -lt "$add_nr" ]; }; then
echo "[fail] got $count ADD_ADDR[s] expected $add_nr"
fail_test
- dump_stats=1
else
echo -n "[ ok ]"
fi
echo -n " - echo "
- count=$(ip netns exec $ns1 nstat -as MPTcpExtEchoAdd | grep MPTcpExtEchoAdd | awk '{print $2}')
- [ -z "$count" ] && count=0
- if [ "$count" != "$echo_nr" ]; then
+ count=$(get_counter ${ns1} "MPTcpExtEchoAdd")
+ if [ -z "$count" ]; then
+ echo -n "[skip]"
+ elif [ "$count" != "$echo_nr" ]; then
echo "[fail] got $count ADD_ADDR echo[s] expected $echo_nr"
fail_test
- dump_stats=1
else
echo -n "[ ok ]"
fi
if [ $port_nr -gt 0 ]; then
echo -n " - pt "
- count=$(ip netns exec $ns2 nstat -as | grep MPTcpExtPortAdd | awk '{print $2}')
- [ -z "$count" ] && count=0
- if [ "$count" != "$port_nr" ]; then
+ count=$(get_counter ${ns2} "MPTcpExtPortAdd")
+ if [ -z "$count" ]; then
+ echo "[skip]"
+ elif [ "$count" != "$port_nr" ]; then
echo "[fail] got $count ADD_ADDR[s] with a port-number expected $port_nr"
fail_test
- dump_stats=1
else
echo "[ ok ]"
fi
@@@ -1737,13 -1633,11 +1734,12 @@@ chk_rm_nr(
fi
printf "%-${nr_blank}s %s" " " "rm "
- count=$(ip netns exec $addr_ns nstat -as MPTcpExtRmAddr | grep MPTcpExtRmAddr | awk '{print $2}')
- [ -z "$count" ] && count=0
- if [ "$count" != "$rm_addr_nr" ]; then
+ count=$(get_counter ${addr_ns} "MPTcpExtRmAddr")
+ if [ -z "$count" ]; then
+ echo -n "[skip]"
+ elif [ "$count" != "$rm_addr_nr" ]; then
echo "[fail] got $count RM_ADDR[s] expected $rm_addr_nr"
fail_test
- dump_stats=1
else
echo -n "[ ok ]"
fi
@@@ -1767,12 -1661,12 +1763,10 @@@
else
echo "[fail] got $count RM_SUBFLOW[s] expected in range [$rm_subflow_nr:$((rm_subflow_nr*2))]"
fail_test
- dump_stats=1
fi
- return
- fi
- if [ "$count" != "$rm_subflow_nr" ]; then
+ elif [ "$count" != "$rm_subflow_nr" ]; then
echo "[fail] got $count RM_SUBFLOW[s] expected $rm_subflow_nr"
fail_test
- dump_stats=1
else
echo -n "[ ok ]"
fi
@@@ -1787,28 -1696,23 +1796,25 @@@ chk_prio_nr(
local mp_prio_nr_tx=$1
local mp_prio_nr_rx=$2
local count
- local dump_stats
printf "%-${nr_blank}s %s" " " "ptx"
- count=$(ip netns exec $ns1 nstat -as | grep MPTcpExtMPPrioTx | awk '{print $2}')
- [ -z "$count" ] && count=0
- if [ "$count" != "$mp_prio_nr_tx" ]; then
+ count=$(get_counter ${ns1} "MPTcpExtMPPrioTx")
+ if [ -z "$count" ]; then
+ echo -n "[skip]"
+ elif [ "$count" != "$mp_prio_nr_tx" ]; then
echo "[fail] got $count MP_PRIO[s] TX expected $mp_prio_nr_tx"
fail_test
- dump_stats=1
else
echo -n "[ ok ]"
fi
echo -n " - prx "
- count=$(ip netns exec $ns1 nstat -as | grep MPTcpExtMPPrioRx | awk '{print $2}')
- [ -z "$count" ] && count=0
- if [ "$count" != "$mp_prio_nr_rx" ]; then
+ count=$(get_counter ${ns1} "MPTcpExtMPPrioRx")
+ if [ -z "$count" ]; then
+ echo "[skip]"
+ elif [ "$count" != "$mp_prio_nr_rx" ]; then
echo "[fail] got $count MP_PRIO[s] RX expected $mp_prio_nr_rx"
fail_test
- dump_stats=1
else
echo "[ ok ]"
fi
@@@ -2394,12 -2290,8 +2399,13 @@@ remove_tests(
pm_nl_add_endpoint $ns2 10.0.4.2 flags subflow
run_tests $ns1 $ns2 10.0.1.1 0 -8 -8 slow
chk_join_nr 3 3 3
- chk_rm_tx_nr 0
- chk_rm_nr 0 3 simult
+
+ if mptcp_lib_kversion_ge 5.18; then
++ chk_rm_tx_nr 0
+ chk_rm_nr 0 3 simult
+ else
+ chk_rm_nr 3 3
+ fi
fi
# addresses flush
------------------- 8< -------------------
The resolved conflicts are also visible there:
https://github.com/multipath-tcp/mptcp_net-next/blob/t/DO-NOT-MERGE-git-mar…
---
Matthieu Baerts (17):
selftests: mptcp: lib: skip if not below kernel version
selftests: mptcp: join: use 'iptables-legacy' if available
selftests: mptcp: join: helpers to skip tests
selftests: mptcp: join: skip check if MIB counter not supported
selftests: mptcp: join: skip test if iptables/tc cmds fail
selftests: mptcp: join: support local endpoint being tracked or not
selftests: mptcp: join: skip Fastclose tests if not supported
selftests: mptcp: join: support RM_ADDR for used endpoints or not
selftests: mptcp: join: skip implicit tests if not supported
selftests: mptcp: join: skip backup if set flag on ID not supported
selftests: mptcp: join: skip fullmesh flag tests if not supported
selftests: mptcp: join: skip userspace PM tests if not supported
selftests: mptcp: join: skip fail tests if not supported
selftests: mptcp: join: skip MPC backups tests if not supported
selftests: mptcp: join: skip PM listener tests if not supported
selftests: mptcp: join: uniform listener tests
selftests: mptcp: join: skip mixed tests if not supported
tools/testing/selftests/net/mptcp/mptcp_join.sh | 513 +++++++++++++++---------
tools/testing/selftests/net/mptcp/mptcp_lib.sh | 26 ++
2 files changed, 354 insertions(+), 185 deletions(-)
---
base-commit: 1b8975f30abffc4f74f1ba049f9042e7d8f646cc
change-id: 20230609-upstream-net-20230610-mptcp-selftests-support-old-kernels-part-3-37aa5185e955
Best regards,
--
Matthieu Baerts <matthieu.baerts(a)tessares.net>
The KTAP parser I used to test the KTAP output for ftracetest was overly
robust and did not notice that the test number and pass/fail result were
reversed. Fix this.
Fixes: dbcf76390eb9 ("elftests/ftrace: Improve integration with kselftest runner")
Signed-off-by: Mark Brown <broonie(a)kernel.org>
---
tools/testing/selftests/ftrace/ftracetest | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/ftrace/ftracetest b/tools/testing/selftests/ftrace/ftracetest
index 2506621e75df..cb5f18c06593 100755
--- a/tools/testing/selftests/ftrace/ftracetest
+++ b/tools/testing/selftests/ftrace/ftracetest
@@ -301,7 +301,7 @@ ktaptest() { # result comment
comment="# $comment"
fi
- echo $CASENO $result $INSTANCE$CASENAME $comment
+ echo $result $CASENO $INSTANCE$CASENAME $comment
}
eval_result() { # sigval
---
base-commit: dbcf76390eb9a65d5d0c37b0cd57335218564e37
change-id: 20230609-ftrace-ktap-order-d5b64a74dc79
Best regards,
--
Mark Brown <broonie(a)kernel.org>
Building and running the subsuite 'damon' of kselftest, shows the
following issues:
selftests: damon: debugfs_attrs.sh
/sys/kernel/debug/damon not found
By creating a config file enabling DAMON fragments in the
selftests/damon/ directory the tests pass.
Fixes: b348eb7abd09 ("mm/damon: add user space selftests")
Reported-by: Naresh Kamboju <naresh.kamboju(a)linaro.org>
Signed-off-by: Anders Roxell <anders.roxell(a)linaro.org>
---
tools/testing/selftests/damon/config | 7 +++++++
1 file changed, 7 insertions(+)
create mode 100644 tools/testing/selftests/damon/config
diff --git a/tools/testing/selftests/damon/config b/tools/testing/selftests/damon/config
new file mode 100644
index 000000000000..0daf38974eb0
--- /dev/null
+++ b/tools/testing/selftests/damon/config
@@ -0,0 +1,7 @@
+CONFIG_DAMON=y
+CONFIG_DAMON_SYSFS=y
+CONFIG_DAMON_DBGFS=y
+CONFIG_DAMON_PADDR=y
+CONFIG_DAMON_VADDR=y
+CONFIG_DAMON_RECLAIM=y
+CONFIG_DAMON_LRU_SORT=y
--
2.39.2
Hi,
Commit cb2c7d1a1776 ("landlock: Support filesystem access-control")
introduced a new ARCH_EPHEMERAL_INODES configuration, only enabled for
User-Mode Linux. The reason was that UML's hostfs managed inodes in an
ephemeral way: from the kernel point of view, the same inode struct
could be created several times while being used by user space because
the kernel didn't hold references to inodes. Because Landlock (and
probably other subsystems) ties properties (i.e. access rights) to inode
objects, it wasn't possible to create rules that match inodes and then
allow specific accesses.
This patch series fixes the way UML manages inodes according to the
underlying filesystem. They are now properly handles as for other
filesystems, which enables to support Landlock (and probably other
features).
Changes since v1:
https://lore.kernel.org/r/20230309165455.175131-1-mic@digikod.net
- Remove Cc stable@ (suggested by Richard).
- Add Acked-by: Richard Weinberger to the first patch.
- Split the test patch into two patches: one for the common
pseudo-filesystems, and another patch dedicated to hostfs.
- Remove CONFIG_SECURITY_PATH because it is useless for merge_config.sh
- Move CONFIG_HOSTFS to a new config.um file.
- Fix commit message spelling and test warnings.
- Improve prepare_layout_opt() with remove_path() call to avoid
cascading errors when some tested filesystems are not supported.
- Remove cgroup-v1 tests because this filesystem cannot really be
mounted several times.
- Add test coverage with and without kernel debug code, according to
GCC 12 and GCC 13.
Regards,
Mickaël Salaün (6):
hostfs: Fix ephemeral inodes
selftests/landlock: Don't create useless file layouts
selftests/landlock: Add supports_filesystem() helper
selftests/landlock: Make mounts configurable
selftests/landlock: Add tests for pseudo filesystems
selftests/landlock: Add hostfs tests
arch/Kconfig | 7 -
arch/um/Kconfig | 1 -
fs/hostfs/hostfs.h | 1 +
fs/hostfs/hostfs_kern.c | 213 ++++++------
fs/hostfs/hostfs_user.c | 1 +
security/landlock/Kconfig | 2 +-
tools/testing/selftests/landlock/config | 9 +-
tools/testing/selftests/landlock/config.um | 1 +
tools/testing/selftests/landlock/fs_test.c | 387 +++++++++++++++++++--
9 files changed, 478 insertions(+), 144 deletions(-)
create mode 100644 tools/testing/selftests/landlock/config.um
base-commit: 858fd168a95c5b9669aac8db6c14a9aeab446375
--
2.41.0
Hi,
Commit cb2c7d1a1776 ("landlock: Support filesystem access-control")
introduced a new ARCH_EPHEMERAL_INODES configuration, only enabled for
User-Mode Linux. The reason was that UML's hostfs managed inodes in an
ephemeral way: from the kernel point of view, the same inode struct
could be created several times while being used by user space because
the kernel didn't hold references to inodes. Because Landlock (and
probably other subsystems) ties properties (i.e. access rights) to inode
objects, it wasn't possible to create rules that match inodes and then
allow specific accesses.
This patch series fixes the way UML manages inodes according to the
underlying filesystem. They are now properly handles as for other
filesystems, which enables to support Landlock (and probably other
features).
Backporting these patches requires some selftest harness patches
backports too.
Regards,
Mickaël Salaün (5):
hostfs: Fix ephemeral inodes
selftests/landlock: Don't create useless file layouts
selftests/landlock: Add supports_filesystem() helper
selftests/landlock: Make mounts configurable
selftests/landlock: Add tests for pseudo filesystems
arch/Kconfig | 7 -
arch/um/Kconfig | 1 -
fs/hostfs/hostfs.h | 1 +
fs/hostfs/hostfs_kern.c | 213 ++++++------
fs/hostfs/hostfs_user.c | 1 +
security/landlock/Kconfig | 2 +-
tools/testing/selftests/landlock/config | 8 +-
tools/testing/selftests/landlock/fs_test.c | 381 +++++++++++++++++++--
8 files changed, 472 insertions(+), 142 deletions(-)
base-commit: fe15c26ee26efa11741a7b632e9f23b01aca4cc6
--
2.39.2
It is wrong to include unprocessed user header files directly. They are
processed to "<source_tree>/usr/include" by running "make headers" and
they are included in selftests by kselftest makefiles automatically with
help of KHDR_INCLUDES variable. These headers should always bulilt
first before building kselftests.
Fixes: 07115fcc15b4 ("selftests/mm: add new selftests for KSM")
Signed-off-by: Muhammad Usama Anjum <usama.anjum(a)collabora.com>
---
tools/testing/selftests/mm/Makefile | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile
index 95acb099315e..e6cd60ca9e48 100644
--- a/tools/testing/selftests/mm/Makefile
+++ b/tools/testing/selftests/mm/Makefile
@@ -29,7 +29,7 @@ MACHINE ?= $(shell echo $(uname_M) | sed -e 's/aarch64.*/arm64/' -e 's/ppc64.*/p
# LDLIBS.
MAKEFLAGS += --no-builtin-rules
-CFLAGS = -Wall -I $(top_srcdir) -I $(top_srcdir)/tools/include/uapi $(EXTRA_CFLAGS) $(KHDR_INCLUDES)
+CFLAGS = -Wall -I $(top_srcdir) $(EXTRA_CFLAGS) $(KHDR_INCLUDES)
LDLIBS = -lrt -lpthread
TEST_GEN_PROGS = cow
--
2.39.2