- Linux-kselftest-mirror - lists.linaro.org

[PATCH v5 0/7] use per-vma locks for /proc/pid/maps reads and PROCMAP_QUERY

by Suren Baghdasaryan

Reading /proc/pid/maps requires read-locking mmap_lock which prevents any other task from concurrently modifying the address space. This guarantees coherent reporting of virtual address ranges, however it can block important updates from happening. Oftentimes /proc/pid/maps readers are low priority monitoring tasks and them blocking high priority tasks results in priority inversion. Locking the entire address space is required to present fully coherent picture of the address space, however even current implementation does not strictly guarantee that by outputting vmas in page-size chunks and dropping mmap_lock in between each chunk. Address space modifications are possible while mmap_lock is dropped and userspace reading the content is expected to deal with possible concurrent address space modifications. Considering these relaxed rules, holding mmap_lock is not strictly needed as long as we can guarantee that a concurrently modified vma is reported either in its original form or after it was modified. This patchset switches from holding mmap_lock while reading /proc/pid/maps to taking per-vma locks as we walk the vma tree. This reduces the contention with tasks modifying the address space because they would have to contend for the same vma as opposed to the entire address space. Same is done for PROCMAP_QUERY ioctl which locks only the vma that fell into the requested range instead of the entire address space. Previous version of this patchset [1] tried to perform /proc/pid/maps reading under RCU, however its implementation is quite complex and the results are worse than the new version because it still relied on mmap_lock speculation which retries if any part of the address space gets modified. New implementaion is both simpler and results in less contention. Note that similar approach would not work for /proc/pid/smaps reading as it also walks the page table and that's not RCU-safe. Paul McKenney's designed a test [2] to measure mmap/munmap latencies while concurrently reading /proc/pid/maps. The test has a pair of processes scanning /proc/PID/maps, and another process unmapping and remapping 4K pages from a 128MB range of anonymous memory. At the end of each 10 second run, the latency of each mmap() or munmap() operation is measured, and for each run the maximum and mean latency is printed. The map/unmap process is started first, its PID is passed to the scanners, and then the map/unmap process waits until both scanners are running before starting its timed test. The scanners keep scanning until the specified /proc/PID/maps file disappears. This test registered close to 10x improvement in update latencies: Before the change: ./run-proc-vs-map.sh --nsamples 100 --rawdata -- --busyduration 2 0.011 0.008 0.455 0.011 0.008 0.472 0.011 0.008 0.535 0.011 0.009 0.545 ... 0.011 0.014 2.875 0.011 0.014 2.913 0.011 0.014 3.007 0.011 0.015 3.018 After the change: ./run-proc-vs-map.sh --nsamples 100 --rawdata -- --busyduration 2 0.006 0.005 0.036 0.006 0.005 0.039 0.006 0.005 0.039 0.006 0.005 0.039 ... 0.006 0.006 0.403 0.006 0.006 0.474 0.006 0.006 0.479 0.006 0.006 0.498 The patchset also adds a number of tests to check for /proc/pid/maps data coherency. They are designed to detect any unexpected data tearing while performing some common address space modifications (vma split, resize and remap). Even before these changes, reading /proc/pid/maps might have inconsistent data because the file is read page-by-page with mmap_lock being dropped between the pages. An example of user-visible inconsistency can be that the same vma is printed twice: once before it was modified and then after the modifications. For example if vma was extended, it might be found and reported twice. What is not expected is to see a gap where there should have been a vma both before and after modification. This patchset increases the chances of such tearing, therefore it's even more important now to test for unexpected inconsistencies. In [3] Lorenzo identified the following possible vma merging/splitting scenarios: Merges with changes to existing vmas: 1 Merge both - mapping a vma over another one and between two vmas which can be merged after this replacement; 2. Merge left full - mapping a vma at the end of an existing one and completely over its right neighbor; 3. Merge left partial - mapping a vma at the end of an existing one and partially over its right neighbor; 4. Merge right full - mapping a vma before the start of an existing one and completely over its left neighbor; 5. Merge right partial - mapping a vma before the start of an existing one and partially over its left neighbor; Merges without changes to existing vmas: 6. Merge both - mapping a vma into a gap between two vmas which can be merged after the insertion; 7. Merge left - mapping a vma at the end of an existing one; 8. Merge right - mapping a vma before the start end of an existing one; Splits 9. Split with new vma at the lower address; 10. Split with new vma at the higher address; If such merges or splits happen concurrently with the /proc/maps reading we might report a vma twice, once before the modification and once after it is modified: Case 1 might report overwritten and previous vma along with the final merged vma; Case 2 might report previous and the final merged vma; Case 3 might cause us to retry once we detect the temporary gap caused by shrinking of the right neighbor; Case 4 might report overritten and the final merged vma; Case 5 might cause us to retry once we detect the temporary gap caused by shrinking of the left neighbor; Case 6 might report previous vma and the gap along with the final marged vma; Case 7 might report previous and the final merged vma; Case 8 might report the original gap and the final merged vma covering the gap; Case 9 might cause us to retry once we detect the temporary gap caused by shrinking of the original vma at the vma start; Case 10 might cause us to retry once we detect the temporary gap caused by shrinking of the original vma at the vma end; In all these cases the retry mechanism prevents us from reporting possible temporary gaps. Changes from v4 [4]: - refactored trylock_vma() and other locking parts into mmap_lock.c, per Lorenzo - renamed {lock|unlock}_content() into {lock|unlock}_vma_range(), per Lorenzo - added clarifying comments for sentinels, per Lorenzo - introduced is_sentinel_pos() helper function - fixed position reset logic when last_addr is a sentinel, per Lorenzo - added Acked-by to the last patch, per Andrii Nakryiko [1] https://lore.kernel.org/all/20250418174959.1431962-1-surenb@google.com/ [2] https://github.com/paulmckrcu/proc-mmap_sem-test [3] https://lore.kernel.org/all/e1863f40-39ab-4e5b-984a-c48765ffde1c@lucifer.lo… [4] https://lore.kernel.org/all/20250604231151.799834-1-surenb@google.com/ Suren Baghdasaryan (7): selftests/proc: add /proc/pid/maps tearing from vma split test selftests/proc: extend /proc/pid/maps tearing test to include vma resizing selftests/proc: extend /proc/pid/maps tearing test to include vma remapping selftests/proc: test PROCMAP_QUERY ioctl while vma is concurrently modified selftests/proc: add verbose more for tests to facilitate debugging mm/maps: read proc/pid/maps under per-vma lock mm/maps: execute PROCMAP_QUERY ioctl under per-vma locks fs/proc/internal.h | 5 + fs/proc/task_mmu.c | 179 ++++- include/linux/mmap_lock.h | 11 + mm/mmap_lock.c | 88 +++ tools/testing/selftests/proc/proc-pid-vm.c | 793 ++++++++++++++++++++- 5 files changed, 1053 insertions(+), 23 deletions(-) base-commit: 0b2a863368fb0cf674b40925c55dc8898c5a33af -- 2.50.0.714.g196bf9f422-goog

2 months, 2 weeks

2
9
0 0

Re: [PATCH 1/6] mm/selftests: Fix virtual_address_range test issues.

by Donet Tom

eOn Tue, Jun 24, 2025 at 11:45:09AM +0530, Dev Jain wrote: > > On 23/06/25 11:02 pm, Donet Tom wrote: > > On Mon, Jun 23, 2025 at 10:23:02AM +0530, Dev Jain wrote: > > > On 21/06/25 11:25 pm, Donet Tom wrote: > > > > On Fri, Jun 20, 2025 at 08:15:25PM +0530, Dev Jain wrote: > > > > > On 19/06/25 1:53 pm, Donet Tom wrote: > > > > > > On Wed, Jun 18, 2025 at 08:13:54PM +0530, Dev Jain wrote: > > > > > > > On 18/06/25 8:05 pm, Lorenzo Stoakes wrote: > > > > > > > > On Wed, Jun 18, 2025 at 07:47:18PM +0530, Dev Jain wrote: > > > > > > > > > On 18/06/25 7:37 pm, Lorenzo Stoakes wrote: > > > > > > > > > > On Wed, Jun 18, 2025 at 07:28:16PM +0530, Dev Jain wrote: > > > > > > > > > > > On 18/06/25 5:27 pm, Lorenzo Stoakes wrote: > > > > > > > > > > > > On Wed, Jun 18, 2025 at 05:15:50PM +0530, Dev Jain wrote: > > > > > > > > > > > > Are you accounting for sys.max_map_count? If not, then you'll be hitting that > > > > > > > > > > > > first. > > > > > > > > > > > run_vmtests.sh will run the test in overcommit mode so that won't be an issue. > > > > > > > > > > Umm, what? You mean overcommit all mode, and that has no bearing on the max > > > > > > > > > > mapping count check. > > > > > > > > > > > > > > > > > > > > In do_mmap(): > > > > > > > > > > > > > > > > > > > > /* Too many mappings? */ > > > > > > > > > > if (mm->map_count > sysctl_max_map_count) > > > > > > > > > > return -ENOMEM; > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > As well as numerous other checks in mm/vma.c. > > > > > > > > > Ah sorry, didn't look at the code properly just assumed that overcommit_always meant overriding > > > > > > > > > this. > > > > > > > > No problem! It's hard to be aware of everything in mm :) > > > > > > > > > > > > > > > > > > I'm not sure why an overcommit toggle is even necessary when you could use > > > > > > > > > > MAP_NORESERVE or simply map PROT_NONE to avoid the OVERCOMMIT_GUESS limits? > > > > > > > > > > > > > > > > > > > > I'm pretty confused as to what this test is really achieving honestly. This > > > > > > > > > > isn't a useful way of asserting mmap() behaviour as far as I can tell. > > > > > > > > > Well, seems like a useful way to me at least : ) Not sure if you are in the mood > > > > > > > > > to discuss that but if you'd like me to explain from start to end what the test > > > > > > > > > is doing, I can do that : ) > > > > > > > > > > > > > > > > > I just don't have time right now, I guess I'll have to come back to it > > > > > > > > later... it's not the end of the world for it to be iffy in my view as long as > > > > > > > > it passes, but it might just not be of great value. > > > > > > > > > > > > > > > > Philosophically I'd rather we didn't assert internal implementation details like > > > > > > > > where we place mappings in userland memory. At no point do we promise to not > > > > > > > > leave larger gaps if we feel like it :) > > > > > > > You have a fair point. Anyhow a debate for another day. > > > > > > > > > > > > > > > I'm guessing, reading more, the _real_ test here is some mathematical assertion > > > > > > > > about layout from HIGH_ADDR_SHIFT -> end of address space when using hints. > > > > > > > > > > > > > > > > But again I'm not sure that achieves much and again also is asserting internal > > > > > > > > implementation details. > > > > > > > > > > > > > > > > Correct behaviour of this kind of thing probably better belongs to tests in the > > > > > > > > userland VMA testing I'd say. > > > > > > > > > > > > > > > > Sorry I don't mean to do down work you've done before, just giving an honest > > > > > > > > technical appraisal! > > > > > > > Nah, it will be rather hilarious to see it all go down the drain xD > > > > > > > > > > > > > > > Anyway don't let this block work to fix the test if it's failing. We can revisit > > > > > > > > this later. > > > > > > > Sure. @Aboorva and Donet, I still believe that the correct approach is to elide > > > > > > > the gap check at the crossing boundary. What do you think? > > > > > > > > > > > > > One problem I am seeing with this approach is that, since the hint address > > > > > > is generated randomly, the VMAs are also being created at randomly based on > > > > > > the hint address.So, for the VMAs created at high addresses, we cannot guarantee > > > > > > that the gaps between them will be aligned to MAP_CHUNK_SIZE. > > > > > > > > > > > > High address VMAs > > > > > > ----------------- > > > > > > 1000000000000-1000040000000 r--p 00000000 00:00 0 > > > > > > 2000000000000-2000040000000 r--p 00000000 00:00 0 > > > > > > 4000000000000-4000040000000 r--p 00000000 00:00 0 > > > > > > 8000000000000-8000040000000 r--p 00000000 00:00 0 > > > > > > e80009d260000-fffff9d260000 r--p 00000000 00:00 0 > > > > > > > > > > > > I have a different approach to solve this issue. > > > > > It is really weird that such a large amount of VA space > > > > > is left between the two VMAs yet mmap is failing. > > > > > > > > > > > > > > > > > > > > Can you please do the following: > > > > > set /proc/sys/vm/max_map_count to the highest value possible. > > > > > If running without run_vmtests.sh, set /proc/sys/vm/overcommit_memory to 1. > > > > > In validate_complete_va_space: > > > > > > > > > > if (start_addr >= HIGH_ADDR_MARK && found == false) { > > > > > found = true; > > > > > continue; > > > > > } > > > > Thanks Dev for the suggestion. I set max_map_count and set overcommit > > > > memory to 1, added this code change as well, and then tried. Still, the > > > > test is failing > > > > > > > > > where found is initialized to false. This will skip the check > > > > > for the boundary. > > > > > > > > > > After this can you tell whether the test is still failing. > > > > > > > > > > Also can you give me the complete output of proc/pid/maps > > > > > after putting a sleep at the end of the test. > > > > > > > > > on powerpc support DEFAULT_MAP_WINDOW is 128TB and with > > > > total address space size is 4PB With hint it can map upto > > > > 4PB. Since the hint addres is random in this test random hing VMAs > > > > are getting created. IIUC this is expected only. > > > > > > > > > > > > 10000000-10010000 r-xp 00000000 fd:05 134226638 /home/donet/linux/tools/testing/selftests/mm/virtual_address_range > > > > 10010000-10020000 r--p 00000000 fd:05 134226638 /home/donet/linux/tools/testing/selftests/mm/virtual_address_range > > > > 10020000-10030000 rw-p 00010000 fd:05 134226638 /home/donet/linux/tools/testing/selftests/mm/virtual_address_range > > > > 30000000-10030000000 r--p 00000000 00:00 0 [anon:virtual_address_range] > > > > 10030770000-100307a0000 rw-p 00000000 00:00 0 [heap] > > > > 1004f000000-7fff8f000000 r--p 00000000 00:00 0 [anon:virtual_address_range] > > > > 7fff8faf0000-7fff8fe00000 rw-p 00000000 00:00 0 > > > > 7fff8fe00000-7fff90030000 r-xp 00000000 fd:00 792355 /usr/lib64/libc.so.6 > > > > 7fff90030000-7fff90040000 r--p 00230000 fd:00 792355 /usr/lib64/libc.so.6 > > > > 7fff90040000-7fff90050000 rw-p 00240000 fd:00 792355 /usr/lib64/libc.so.6 > > > > 7fff90050000-7fff90130000 r-xp 00000000 fd:00 792358 /usr/lib64/libm.so.6 > > > > 7fff90130000-7fff90140000 r--p 000d0000 fd:00 792358 /usr/lib64/libm.so.6 > > > > 7fff90140000-7fff90150000 rw-p 000e0000 fd:00 792358 /usr/lib64/libm.so.6 > > > > 7fff90160000-7fff901a0000 r--p 00000000 00:00 0 [vvar] > > > > 7fff901a0000-7fff901b0000 r-xp 00000000 00:00 0 [vdso] > > > > 7fff901b0000-7fff90200000 r-xp 00000000 fd:00 792351 /usr/lib64/ld64.so.2 > > > > 7fff90200000-7fff90210000 r--p 00040000 fd:00 792351 /usr/lib64/ld64.so.2 > > > > 7fff90210000-7fff90220000 rw-p 00050000 fd:00 792351 /usr/lib64/ld64.so.2 > > > > 7fffc9770000-7fffc9880000 rw-p 00000000 00:00 0 [stack] > > > > 1000000000000-1000040000000 r--p 00000000 00:00 0 [anon:virtual_address_range] > > > > 2000000000000-2000040000000 r--p 00000000 00:00 0 [anon:virtual_address_range] > > > > 4000000000000-4000040000000 r--p 00000000 00:00 0 [anon:virtual_address_range] > > > > 8000000000000-8000040000000 r--p 00000000 00:00 0 [anon:virtual_address_range] > > > > eb95410220000-fffff90220000 r--p 00000000 00:00 0 [anon:virtual_address_range] > > > > > > > > > > > > > > > > > > > > If I give the hint address serially from 128TB then the address > > > > space is contigous and gap is also MAP_SIZE, the test is passing. > > > > > > > > 10000000-10010000 r-xp 00000000 fd:05 134226638 /home/donet/linux/tools/testing/selftests/mm/virtual_address_range > > > > 10010000-10020000 r--p 00000000 fd:05 134226638 /home/donet/linux/tools/testing/selftests/mm/virtual_address_range > > > > 10020000-10030000 rw-p 00010000 fd:05 134226638 /home/donet/linux/tools/testing/selftests/mm/virtual_address_range > > > > 33000000-10033000000 r--p 00000000 00:00 0 [anon:virtual_address_range] > > > > 10033380000-100333b0000 rw-p 00000000 00:00 0 [heap] > > > > 1006f0f0000-10071000000 rw-p 00000000 00:00 0 > > > > 10071000000-7fffb1000000 r--p 00000000 00:00 0 [anon:virtual_address_range] > > > > 7fffb15d0000-7fffb1800000 r-xp 00000000 fd:00 792355 /usr/lib64/libc.so.6 > > > > 7fffb1800000-7fffb1810000 r--p 00230000 fd:00 792355 /usr/lib64/libc.so.6 > > > > 7fffb1810000-7fffb1820000 rw-p 00240000 fd:00 792355 /usr/lib64/libc.so.6 > > > > 7fffb1820000-7fffb1900000 r-xp 00000000 fd:00 792358 /usr/lib64/libm.so.6 > > > > 7fffb1900000-7fffb1910000 r--p 000d0000 fd:00 792358 /usr/lib64/libm.so.6 > > > > 7fffb1910000-7fffb1920000 rw-p 000e0000 fd:00 792358 /usr/lib64/libm.so.6 > > > > 7fffb1930000-7fffb1970000 r--p 00000000 00:00 0 [vvar] > > > > 7fffb1970000-7fffb1980000 r-xp 00000000 00:00 0 [vdso] > > > > 7fffb1980000-7fffb19d0000 r-xp 00000000 fd:00 792351 /usr/lib64/ld64.so.2 > > > > 7fffb19d0000-7fffb19e0000 r--p 00040000 fd:00 792351 /usr/lib64/ld64.so.2 > > > > 7fffb19e0000-7fffb19f0000 rw-p 00050000 fd:00 792351 /usr/lib64/ld64.so.2 > > > > 7fffc5470000-7fffc5580000 rw-p 00000000 00:00 0 [stack] > > > > 800000000000-2aab000000000 r--p 00000000 00:00 0 [anon:virtual_address_range] > > > > > > > > > > > Thank you for this output. I can't wrap my head around why this behaviour changes > > > when you generate the hint sequentially. The mmap() syscall is supposed to do the > > > following (irrespective of high VA space or not) - if the allocation at the hint > > Yes, it is working as expected. On PowerPC, the DEFAULT_MAP_WINDOW is > > 128TB, and the system can map up to 4PB. > > > > In the test, the first mmap call maps memory up to 128TB without any > > hint, so the VMAs are created below the 128TB boundary. > > > > In the second mmap call, we provide a hint starting from 256TB, and > > the hint address is generated randomly above 256TB. The mappings are > > correctly created at these hint addresses. Since the hint addresses > > are random, the resulting VMAs are also created at random locations. > > > > So, what I tried is: mapping from 0 to 128TB without any hint, and > > then for the second mmap, instead of starting the hint from 256TB, I > > started from 128TB. Instead of using random hint addresses, I used > > sequential hint addresses from 128TB up to 512TB. With this change, > > the VMAs are created in order, and the test passes. > > > > 800000000000-2aab000000000 r--p 00000000 00:00 0 128TB to 512TB VMA > > > > I think we will see same behaviour on x86 with X86_FEATURE_LA57. > > > > I will send the updated patch in V2. > > Since you say it fails on both radix and hash, it means that the generic > code path is failing. I see that on my system, when I run the test with > LPA2 config, write() fails with errno set to -ENOMEM. Can you apply > the following diff and check whether the test fails still. Doing this > fixed it for arm64. > > diff --git a/tools/testing/selftests/mm/virtual_address_range.c b/tools/testing/selftests/mm/virtual_address_range.c > > index b380e102b22f..3032902d01f2 100644 > > --- a/tools/testing/selftests/mm/virtual_address_range.c > > +++ b/tools/testing/selftests/mm/virtual_address_range.c > > @@ -173,10 +173,6 @@ static int validate_complete_va_space(void) > > */ > > hop = 0; > > while (start_addr + hop < end_addr) { > > - if (write(fd, (void *)(start_addr + hop), 1) != 1) > > - return 1; > > - lseek(fd, 0, SEEK_SET); > > - > > if (is_marked_vma(vma_name)) > > munmap((char *)(start_addr + hop), MAP_CHUNK_SIZE); > Even with this change, the test is still failing. In this case, we are allocating physical memory and writing into it, but our issue seems to be with the gap between VMAs, so I believe this might not be directly related. I will send the next revision where the test passes and no issues are observed Just curious — with LPA2, is the second mmap() call successful? And are the VMAs being created at the hint address as expected? > > > > > addr succeeds, then all is well, otherwise, do a top-down search for a large > > > enough gap. I am not aware of the nuances in powerpc but I really am suspecting > > > a bug in powerpc mmap code. Can you try to do some tracing - which function > > > eventually fails to find the empty gap? > > > > > > Through my limited code tracing - we should end up in slice_find_area_topdown, > > > then we ask the generic code to find the gap using vm_unmapped_area. So I > > > suspect something is happening between this, probably slice_scan_available(). > > > > > > > > > From 0 to 128TB, we map memory directly without using any hint. For the range above > > > > > > 256TB up to 512TB, we perform the mapping using hint addresses. In the current test, > > > > > > we use random hint addresses, but I have modified it to generate hint addresses linearly > > > > > > starting from 128TB. > > > > > > > > > > > > With this change: > > > > > > > > > > > > The 0–128TB range is mapped without hints and verified accordingly. > > > > > > > > > > > > The 128TB–512TB range is mapped using linear hint addresses and then verified. > > > > > > > > > > > > Below are the VMAs obtained with this approach: > > > > > > > > > > > > 10000000-10010000 r-xp 00000000 fd:05 135019531 > > > > > > 10010000-10020000 r--p 00000000 fd:05 135019531 > > > > > > 10020000-10030000 rw-p 00010000 fd:05 135019531 > > > > > > 20000000-10020000000 r--p 00000000 00:00 0 > > > > > > 10020800000-10020830000 rw-p 00000000 00:00 0 > > > > > > 1004bcf0000-1004c000000 rw-p 00000000 00:00 0 > > > > > > 1004c000000-7fff8c000000 r--p 00000000 00:00 0 > > > > > > 7fff8c130000-7fff8c360000 r-xp 00000000 fd:00 792355 > > > > > > 7fff8c360000-7fff8c370000 r--p 00230000 fd:00 792355 > > > > > > 7fff8c370000-7fff8c380000 rw-p 00240000 fd:00 792355 > > > > > > 7fff8c380000-7fff8c460000 r-xp 00000000 fd:00 792358 > > > > > > 7fff8c460000-7fff8c470000 r--p 000d0000 fd:00 792358 > > > > > > 7fff8c470000-7fff8c480000 rw-p 000e0000 fd:00 792358 > > > > > > 7fff8c490000-7fff8c4d0000 r--p 00000000 00:00 0 > > > > > > 7fff8c4d0000-7fff8c4e0000 r-xp 00000000 00:00 0 > > > > > > 7fff8c4e0000-7fff8c530000 r-xp 00000000 fd:00 792351 > > > > > > 7fff8c530000-7fff8c540000 r--p 00040000 fd:00 792351 > > > > > > 7fff8c540000-7fff8c550000 rw-p 00050000 fd:00 792351 > > > > > > 7fff8d000000-7fffcd000000 r--p 00000000 00:00 0 > > > > > > 7fffe9c80000-7fffe9d90000 rw-p 00000000 00:00 0 > > > > > > 800000000000-2000000000000 r--p 00000000 00:00 0 -> High Address (128TB to 512TB) > > > > > > > > > > > > diff --git a/tools/testing/selftests/mm/virtual_address_range.c b/tools/testing/selftests/mm/virtual_address_range.c > > > > > > index 4c4c35eac15e..0be008cba4b0 100644 > > > > > > --- a/tools/testing/selftests/mm/virtual_address_range.c > > > > > > +++ b/tools/testing/selftests/mm/virtual_address_range.c > > > > > > @@ -56,21 +56,21 @@ > > > > > > #ifdef __aarch64__ > > > > > > #define HIGH_ADDR_MARK ADDR_MARK_256TB > > > > > > -#define HIGH_ADDR_SHIFT 49 > > > > > > +#define HIGH_ADDR_SHIFT 48 > > > > > > #define NR_CHUNKS_LOW NR_CHUNKS_256TB > > > > > > #define NR_CHUNKS_HIGH NR_CHUNKS_3840TB > > > > > > #else > > > > > > #define HIGH_ADDR_MARK ADDR_MARK_128TB > > > > > > -#define HIGH_ADDR_SHIFT 48 > > > > > > +#define HIGH_ADDR_SHIFT 47 > > > > > > #define NR_CHUNKS_LOW NR_CHUNKS_128TB > > > > > > #define NR_CHUNKS_HIGH NR_CHUNKS_384TB > > > > > > #endif > > > > > > -static char *hint_addr(void) > > > > > > +static char *hint_addr(int hint) > > > > > > { > > > > > > - int bits = HIGH_ADDR_SHIFT + rand() % (63 - HIGH_ADDR_SHIFT); > > > > > > + unsigned long addr = ((1UL << HIGH_ADDR_SHIFT) + (hint * MAP_CHUNK_SIZE)); > > > > > > - return (char *) (1UL << bits); > > > > > > + return (char *) (addr); > > > > > > } > > > > > > static void validate_addr(char *ptr, int high_addr) > > > > > > @@ -217,7 +217,7 @@ int main(int argc, char *argv[]) > > > > > > } > > > > > > for (i = 0; i < NR_CHUNKS_HIGH; i++) { > > > > > > - hint = hint_addr(); > > > > > > + hint = hint_addr(i); > > > > > > hptr[i] = mmap(hint, MAP_CHUNK_SIZE, PROT_READ, > > > > > > MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); > > > > > > > > > > > > > > > > > > > > > > > > Can we fix it this way?

2 months, 2 weeks

2
1
0 0

[PATCH net 03/10] netlink: specs: ethtool: replace underscores with dashes in names

by Jakub Kicinski

We're trying to add a strict regexp for the name format in the spec. Underscores will not be allowed, dashes should be used instead. This makes no difference to C (codegen replaces special chars in names) but gives more uniform naming in Python. Fixes: 13e59344fb9d ("net: ethtool: add support for symmetric-xor RSS hash") Fixes: 46fb3ba95b93 ("ethtool: Add an interface for flashing transceiver modules' firmware") Signed-off-by: Jakub Kicinski <kuba(a)kernel.org> --- CC: andrew(a)lunn.ch CC: donald.hunter(a)gmail.com CC: shuah(a)kernel.org CC: kory.maincent(a)bootlin.com CC: sdf(a)fomichev.me CC: gal(a)nvidia.com CC: noren(a)nvidia.com CC: ahmed.zaki(a)intel.com CC: wojciech.drewek(a)intel.com CC: petrm(a)nvidia.com CC: danieller(a)nvidia.com CC: linux-kselftest(a)vger.kernel.org --- Documentation/netlink/specs/ethtool.yaml | 6 +++--- tools/testing/selftests/drivers/net/hw/rss_input_xfrm.py | 2 +- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/Documentation/netlink/specs/ethtool.yaml b/Documentation/netlink/specs/ethtool.yaml index 72a076b0e1b5..348c6ad548f5 100644 --- a/Documentation/netlink/specs/ethtool.yaml +++ b/Documentation/netlink/specs/ethtool.yaml @@ -48,7 +48,7 @@ c-version-name: ethtool-genl-version name: started doc: The firmware flashing process has started. - - name: in_progress + name: in-progress doc: The firmware flashing process is in progress. - name: completed @@ -1422,7 +1422,7 @@ c-version-name: ethtool-genl-version name: hkey type: binary - - name: input_xfrm + name: input-xfrm type: u32 - name: start-context @@ -2238,7 +2238,7 @@ c-version-name: ethtool-genl-version - hfunc - indir - hkey - - input_xfrm + - input-xfrm dump: request: attributes: diff --git a/tools/testing/selftests/drivers/net/hw/rss_input_xfrm.py b/tools/testing/selftests/drivers/net/hw/rss_input_xfrm.py index f439c434ba36..648ff50bc1c3 100755 --- a/tools/testing/selftests/drivers/net/hw/rss_input_xfrm.py +++ b/tools/testing/selftests/drivers/net/hw/rss_input_xfrm.py @@ -38,7 +38,7 @@ from lib.py import rand_port raise KsftSkipEx("socket.SO_INCOMING_CPU was added in Python 3.11") input_xfrm = cfg.ethnl.rss_get( - {'header': {'dev-name': cfg.ifname}}).get('input_xfrm') + {'header': {'dev-name': cfg.ifname}}).get('input-xfrm') # Check for symmetric xor/or-xor if not input_xfrm or (input_xfrm != 1 and input_xfrm != 2): -- 2.49.0

2 months, 2 weeks

3
3
0 0

[PATCH v2 0/3] tools/nolibc: add support for SuperH

by Thomas Weißschuh

Add support for SuperH/"sh" to nolibc. Only sh4 is tested for now. This is only tested on QEMU so far. Additional testing would be very welcome. Test instructions: $ cd tools/testings/selftests/nolibc/ $ make -f Makefile.nolibc ARCH=sh CROSS_COMPILE=sh4-linux- nolibc-test $ file nolibc-test nolibc-test: ELF 32-bit LSB executable, Renesas SH, version 1 (SYSV), statically linked, not stripped $ ./nolibc-test Running test 'startup' 0 argc = 1 [OK] ... Total number of errors: 0 Exiting with status 0 Signed-off-by: Thomas Weißschuh <linux(a)weissschuh.net> --- Changes in v2: - Rebase onto latest nolibc-next - Pick up Ack from Willy - Provide some test instructions - Link to v1: https://lore.kernel.org/r/20250609-nolibc-sh-v1-0-9dcdb1b66bb5@weissschuh.n… --- Thomas Weißschuh (3): selftests/nolibc: fix EXTRACONFIG variables ordering selftests/nolibc: use file driver for QEMU serial tools/nolibc: add support for SuperH tools/include/nolibc/arch-sh.h | 162 +++++++++++++++++++++++++ tools/include/nolibc/arch.h | 2 + tools/testing/selftests/nolibc/Makefile.nolibc | 15 ++- tools/testing/selftests/nolibc/run-tests.sh | 3 +- 4 files changed, 177 insertions(+), 5 deletions(-) --- base-commit: eb135311083100b6590a7545618cd9760d896a86 change-id: 20250528-nolibc-sh-8b4e3bb8efcb Best regards, -- Thomas Weißschuh <linux(a)weissschuh.net>

2 months, 2 weeks

4
7
0 0

[PATCH v3 2/2] selftests/bpf: Add testcases for BPF_ADD and BPF_SUB

by Harishankar Vishwanathan

The previous commit improves the precision in scalar(32)_min_max_add, and scalar(32)_min_max_sub. The improvement in precision occurs in cases when all outcomes overflow or underflow, respectively. This commit adds selftests that exercise those cases. This commit also adds selftests for cases where the output register state bounds for u(32)_min/u(32)_max are conservatively set to unbounded (when there is partial overflow or underflow). Signed-off-by: Harishankar Vishwanathan <harishankar.vishwanathan(a)gmail.com> Co-developed-by: Matan Shachnai <m.shachnai(a)rutgers.edu> Signed-off-by: Matan Shachnai <m.shachnai(a)rutgers.edu> Suggested-by: Eduard Zingerman <eddyz87(a)gmail.com> --- .../selftests/bpf/progs/verifier_bounds.c | 161 ++++++++++++++++++ 1 file changed, 161 insertions(+) diff --git a/tools/testing/selftests/bpf/progs/verifier_bounds.c b/tools/testing/selftests/bpf/progs/verifier_bounds.c index 30e16153fdf1..31986f6c609e 100644 --- a/tools/testing/selftests/bpf/progs/verifier_bounds.c +++ b/tools/testing/selftests/bpf/progs/verifier_bounds.c @@ -1371,4 +1371,165 @@ __naked void mult_sign_ovf(void) __imm(bpf_skb_store_bytes) : __clobber_all); } + +SEC("socket") +__description("64-bit addition, all outcomes overflow") +__success __log_level(2) +__msg("5: (0f) r3 += r3 {{.*}} R3_w=scalar(umin=0x4000000000000000,umax=0xfffffffffffffffe)") +__retval(0) +__naked void add64_full_overflow(void) +{ + asm volatile ( + "r4 = 0;" + "r4 = -r4;" + "r3 = 0xa000000000000000 ll;" + "r3 |= r4;" + "r3 += r3;" + "r0 = 0;" + "exit" + : + : + : __clobber_all); +} + +SEC("socket") +__description("64-bit addition, partial overflow, result in unbounded reg") +__success __log_level(2) +__msg("4: (0f) r3 += r3 {{.*}} R3_w=scalar()") +__retval(0) +__naked void add64_partial_overflow(void) +{ + asm volatile ( + "r4 = 0;" + "r4 = -r4;" + "r3 = 2;" + "r3 |= r4;" + "r3 += r3;" + "r0 = 0;" + "exit" + : + : + : __clobber_all); +} + +SEC("socket") +__description("32-bit addition overflow, all outcomes overflow") +__success __log_level(2) +__msg("4: (0c) w3 += w3 {{.*}} R3_w=scalar(smin=umin=umin32=0x40000000,smax=umax=umax32=0xfffffffe,var_off=(0x0; 0xffffffff))") +__retval(0) +__naked void add32_full_overflow(void) +{ + asm volatile ( + "w4 = 0;" + "w4 = -w4;" + "w3 = 0xa0000000;" + "w3 |= w4;" + "w3 += w3;" + "r0 = 0;" + "exit" + : + : + : __clobber_all); +} + +SEC("socket") +__description("32-bit addition, partial overflow, result in unbounded u32 bounds") +__success __log_level(2) +__msg("4: (0c) w3 += w3 {{.*}} R3_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff))") +__retval(0) +__naked void add32_partial_overflow(void) +{ + asm volatile ( + "w4 = 0;" + "w4 = -w4;" + "w3 = 2;" + "w3 |= w4;" + "w3 += w3;" + "r0 = 0;" + "exit" + : + : + : __clobber_all); +} + +SEC("socket") +__description("64-bit subtraction, all outcomes underflow") +__success __log_level(2) +__msg("6: (1f) r3 -= r1 {{.*}} R3_w=scalar(umin=1,umax=0x8000000000000000)") +__retval(0) +__naked void sub64_full_overflow(void) +{ + asm volatile ( + "r1 = 0;" + "r1 = -r1;" + "r2 = 0x8000000000000000 ll;" + "r1 |= r2;" + "r3 = 0;" + "r3 -= r1;" + "r0 = 0;" + "exit" + : + : + : __clobber_all); +} + +SEC("socket") +__description("64-bit subtration, partial overflow, result in unbounded reg") +__success __log_level(2) +__msg("3: (1f) r3 -= r2 {{.*}} R3_w=scalar()") +__retval(0) +__naked void sub64_partial_overflow(void) +{ + asm volatile ( + "r3 = 0;" + "r3 = -r3;" + "r2 = 1;" + "r3 -= r2;" + "r0 = 0;" + "exit" + : + : + : __clobber_all); +} + +SEC("socket") +__description("32-bit subtraction overflow, all outcomes underflow") +__success __log_level(2) +__msg("5: (1c) w3 -= w1 {{.*}} R3_w=scalar(smin=umin=umin32=1,smax=umax=umax32=0x80000000,var_off=(0x0; 0xffffffff))") +__retval(0) +__naked void sub32_full_overflow(void) +{ + asm volatile ( + "w1 = 0;" + "w1 = -w1;" + "w2 = 0x80000000;" + "w1 |= w2;" + "w3 = 0;" + "w3 -= w1;" + "r0 = 0;" + "exit" + : + : + : __clobber_all); +} + +SEC("socket") +__description("32-bit subtration, partial overflow, result in unbounded u32 bounds") +__success __log_level(2) +__msg("3: (1c) w3 -= w2 {{.*}} R3_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff))") +__retval(0) +__naked void sub32_partial_overflow(void) +{ + asm volatile ( + "w3 = 0;" + "w3 = -w3;" + "w2 = 1;" + "w3 -= w2;" + "r0 = 0;" + "exit" + : + : + : __clobber_all); +} + char _license[] SEC("license") = "GPL"; -- 2.45.2

2 months, 2 weeks

3
2
0 0

[PATCH v2 00/23] ARM64 PMU Partitioning

by Colton Lewis

This series creates a new PMU scheme on ARM, a partitioned PMU that allows reserving a subset of counters for more direct guest access, significantly reducing overhead. More details, including performance benchmarks, can be read in the v1 cover letter linked below. v2: * Rebased on top of kvm/queue to pick up Sean's patch [1] that reorganizes some of the same headers and would otherwise conflict. * Changed the semantics of the command line parameters and the ioctl. It was pointed out in the comments last time that it doesn't work to repartition at runtime because the perf subsystem assumes the number of counters it gets will not change after the PMU is probed. Now the PMUv3 command line parameters are the sole thing that divides up guest and host counters and the ioctl just toggles a flag for whether a vcpu should use the partitioned PMU. I've also moved from one to two parameters: partition_pmu=[y/n] and reserved_guest_counters=[0-N]. This makes it possible to unambiguously express configurations like a partitioned PMU with 0 general purpose counters exposed to the guest (which still exposes the cycle counter. * Moved the partitioning code into the PMUv3 driver itself so KVM code isn't modifying fields that are otherwise internal to the driver. * Define PMI{CNTR,FILTR} as undef_access since KVM isn't ready to support that counter. It is, however, still handled in the partitioning because the driver recognizes it. * Take out the dependency on FEAT_FGT since it is not widely available on hardware yet. Instead, define a fast path in switch.h for handling accesses to the registers that would otherwise be untrapped. * During MDCR_EL2 setup for guests, ensure the computed HPMN value is always below the number of guest counters allocated by the driver at boot and always below the number of counters on the current CPU. This accounts for the possibiliy of heterogeneous hardware where I guest might be able to use the partitioned PMU on one CPU but not another. * The KVM PMU event filter API says that counters must not count while the event is filtered. To ensure this, enforce the filter on every vcpu_load into the guest. * Settable PMCR_EL0.N with a partitioned PMU now works and the vcpu_counter_access selftest changes reflect that. v1: https://lore.kernel.org/kvm/20250602192702.2125115-1-coltonlewis@google.com/ Colton Lewis (22): arm64: cpufeature: Add cpucap for HPMN0 arm64: Generate sign macro for sysreg Enums arm64: cpufeature: Add cpucap for PMICNTR arm64: Define PMI{CNTR,FILTR}_EL0 as undef_access KVM: arm64: Reorganize PMU functions perf: arm_pmuv3: Introduce method to partition the PMU perf: arm_pmuv3: Generalize counter bitmasks perf: arm_pmuv3: Keep out of guest counter partition KVM: arm64: Correct kvm_arm_pmu_get_max_counters() KVM: arm64: Set up FGT for Partitioned PMU KVM: arm64: Writethrough trapped PMEVTYPER register KVM: arm64: Use physical PMSELR for PMXEVTYPER if partitioned KVM: arm64: Writethrough trapped PMOVS register KVM: arm64: Write fast path PMU register handlers KVM: arm64: Setup MDCR_EL2 to handle a partitioned PMU KVM: arm64: Account for partitioning in PMCR_EL0 access KVM: arm64: Context swap Partitioned PMU guest registers KVM: arm64: Enforce PMU event filter at vcpu_load() perf: arm_pmuv3: Handle IRQs for Partitioned PMU guest counters KVM: arm64: Inject recorded guest interrupts KVM: arm64: Add ioctl to partition the PMU when supported KVM: arm64: selftests: Add test case for partitioned PMU Marc Zyngier (1): KVM: arm64: Cleanup PMU includes Documentation/virt/kvm/api.rst | 21 + arch/arm/include/asm/arm_pmuv3.h | 34 + arch/arm64/include/asm/arm_pmuv3.h | 61 +- arch/arm64/include/asm/kvm_host.h | 20 +- arch/arm64/include/asm/kvm_pmu.h | 61 ++ arch/arm64/kernel/cpufeature.c | 15 + arch/arm64/kvm/Makefile | 2 +- arch/arm64/kvm/arm.c | 22 + arch/arm64/kvm/debug.c | 24 +- arch/arm64/kvm/hyp/include/hyp/switch.h | 233 ++++++ arch/arm64/kvm/pmu-emul.c | 676 +---------------- arch/arm64/kvm/pmu-part.c | 359 +++++++++ arch/arm64/kvm/pmu.c | 687 ++++++++++++++++++ arch/arm64/kvm/sys_regs.c | 66 +- arch/arm64/tools/cpucaps | 2 + arch/arm64/tools/gen-sysreg.awk | 1 + arch/arm64/tools/sysreg | 6 +- drivers/perf/arm_pmuv3.c | 150 +++- include/linux/perf/arm_pmu.h | 15 +- include/linux/perf/arm_pmuv3.h | 14 +- include/uapi/linux/kvm.h | 4 + tools/include/uapi/linux/kvm.h | 2 + .../selftests/kvm/arm64/vpmu_counter_access.c | 63 +- virt/kvm/kvm_main.c | 1 + 24 files changed, 1791 insertions(+), 748 deletions(-) create mode 100644 arch/arm64/kvm/pmu-part.c base-commit: 79150772457f4d45e38b842d786240c36bb1f97f -- 2.50.0.714.g196bf9f422-goog

2 months, 2 weeks

3
39
0 0

[PATCH] selftests: kvm: Fix spelling of 'occurrences' in sparsebit.c comments

by Rahul Kumar

Corrected two instances of the misspelled word 'occurences' to 'occurrences' in comments explaining node invariants in sparsebit.c. These comments describe core behavior of the data structure and should be clear. Signed-off-by: Rahul Kumar <rk0006818(a)gmail.com> --- tools/testing/selftests/kvm/lib/sparsebit.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/tools/testing/selftests/kvm/lib/sparsebit.c b/tools/testing/selftests/kvm/lib/sparsebit.c index cfed9d26cc71..a99188f87a38 100644 --- a/tools/testing/selftests/kvm/lib/sparsebit.c +++ b/tools/testing/selftests/kvm/lib/sparsebit.c @@ -116,7 +116,7 @@ * * + A node with all mask bits set only occurs when the last bit * described by the previous node is not equal to this nodes - * starting index - 1. All such occurences of this condition are + * starting index - 1. All such occurrences of this condition are * avoided by moving the setting of the nodes mask bits into * the previous nodes num_after setting. * @@ -592,7 +592,7 @@ static struct node *node_split(struct sparsebit *s, sparsebit_idx_t idx) * * + A node with all mask bits set only occurs when the last bit * described by the previous node is not equal to this nodes - * starting index - 1. All such occurences of this condition are + * starting index - 1. All such occurrences of this condition are * avoided by moving the setting of the nodes mask bits into * the previous nodes num_after setting. */ -- 2.43.0 This patch fixes two misspellings of the word 'occurrences' in comments within sparsebit.c used by the KVM selftests. Fixing the spelling improves readability and clarity of the documented behavior. Only comment text has been changed — there are no modifications to the functional logic of the tests. I would appreciate your review and any feedback you may have. Thank you for your time and support. Best regards, Rahul Kumar

2 months, 2 weeks

2
1
0 0

[PATCH v3 00/13] KVM: Make irqfd registration globally unique

by Sean Christopherson

Non-KVM folks, I am hoping to route this through the KVM tree (6.17 or later), as the non-KVM changes should be glorified nops. Please holler if you object to that idea. Hyper-V folks in particular, let me know if you want a stable topic branch/tag, e.g. on the off chance you want to make similar changes to the Hyper-V code, and I'll make sure that happens. As for what this series actually does... Rework KVM's irqfd registration to require that an eventfd is bound to at most one irqfd throughout the entire system. KVM currently disallows binding an eventfd to multiple irqfds for a single VM, but doesn't reject attempts to bind an eventfd to multiple VMs. This is obviously an ABI change, but I'm fairly confident that it won't break userspace, because binding an eventfd to multiple irqfds hasn't truly worked since commit e8dbf19508a1 ("kvm/eventfd: Use priority waitqueue to catch events before userspace"). A somewhat undocumented, and perhaps even unintentional, side effect of suppressing eventfd notifications for userspace is that the priority+exclusive behavior also suppresses eventfd notifications for any subsequent waiters, even if they are priority waiters. I.e. only the first VM with an irqfd+eventfd binding will get notifications. And for IRQ bypass, a.k.a. device posted interrupts, globally unique bindings are a hard requirement (at least on x86; I assume other archs are the same). KVM and the IRQ bypass manager kinda sorta handle this, but in the absolute worst way possible (IMO). Instead of surfacing an error to userspace, KVM silently ignores IRQ bypass registration errors. The motivation for this series is to harden against userspace goofs. AFAIK, we (Google) have never actually had a bug where userspace tries to assign an eventfd to multiple VMs, but the possibility has come up in more than one bug investigation (our intra-host, a.k.a. copyless, migration scheme transfers eventfds from the old to the new VM when updating the host VMM). v3: - Retain WQ_FLAG_EXCLUSIVE in mshv_eventfd.c, which snuck in between v1 and v2. [Peter] - Use EXPORT_SYMBOL_GPL. [Peter] - Move WQ_FLAG_EXCLUSIVE out of add_wait_queue_priority() in a prep patch so that the affected subsystems are more explicitly documented (and then immediately drop the flag from drivers/xen/privcmd.c, which amusingly hides that file from the diff stats). v2: - https://lore.kernel.org/all/20250519185514.2678456-1-seanjc@google.com - Use guard(spinlock_irqsave). [Prateek] v1: https://lore.kernel.org/all/20250401204425.904001-1-seanjc@google.com Sean Christopherson (13): KVM: Use a local struct to do the initial vfs_poll() on an irqfd KVM: Acquire SCRU lock outside of irqfds.lock during assignment KVM: Initialize irqfd waitqueue callback when adding to the queue KVM: Add irqfd to KVM's list via the vfs_poll() callback KVM: Add irqfd to eventfd's waitqueue while holding irqfds.lock sched/wait: Drop WQ_FLAG_EXCLUSIVE from add_wait_queue_priority() xen: privcmd: Don't mark eventfd waiter as EXCLUSIVE sched/wait: Add a waitqueue helper for fully exclusive priority waiters KVM: Disallow binding multiple irqfds to an eventfd with a priority waiter KVM: Drop sanity check that per-VM list of irqfds is unique KVM: selftests: Assert that eventfd() succeeds in Xen shinfo test KVM: selftests: Add utilities to create eventfds and do KVM_IRQFD KVM: selftests: Add a KVM_IRQFD test to verify uniqueness requirements drivers/hv/mshv_eventfd.c | 8 ++ include/linux/kvm_irqfd.h | 1 - include/linux/wait.h | 2 + kernel/sched/wait.c | 22 ++- tools/testing/selftests/kvm/Makefile.kvm | 1 + tools/testing/selftests/kvm/arm64/vgic_irq.c | 12 +- .../testing/selftests/kvm/include/kvm_util.h | 40 ++++++ tools/testing/selftests/kvm/irqfd_test.c | 130 ++++++++++++++++++ .../selftests/kvm/x86/xen_shinfo_test.c | 21 +-- virt/kvm/eventfd.c | 130 +++++++++++++----- 10 files changed, 302 insertions(+), 65 deletions(-) create mode 100644 tools/testing/selftests/kvm/irqfd_test.c base-commit: 45eb29140e68ffe8e93a5471006858a018480a45 -- 2.49.0.1151.ga128411c76-goog

2 months, 2 weeks

4
20
0 0

[PATCH net-next] selftests: net: add netpoll basic functionality test

by Breno Leitao

Add a basic selftest for the netpoll polling mechanism, specifically targeting the netpoll poll() side. The test creates a scenario where network transmission is running at maximum speed, and netpoll needs to poll the NIC. This is achieved by: 1. Configuring a single RX/TX queue to create contention 2. Generating background traffic to saturate the interface 3. Sending netconsole messages to trigger netpoll polling 4. Using dynamic netconsole targets via configfs 5. Delete and create new netconsole targets after 5 iterations The test validates a critical netpoll code path by monitoring traffic flow and ensuring netpoll_poll_dev() is called when the normal TX path is blocked. Perf probing confirms this test successfully triggers netpoll_poll_dev() in typical test runs. This addresses a gap in netpoll test coverage for a path that is tricky for the network stack. Signed-off-by: Breno Leitao <leitao(a)debian.org> --- Changes since RFC: - Toggle the netconsole interfaces up and down after 5 iterations. - Moved the traffic check under DEBUG (Willem de Bruijn). - Bumped the iterations to 20 given it runs faster now. - Link to the RFC: https://lore.kernel.org/r/20250612-netpoll_test-v1-1-4774fd95933f@debian.org --- tools/testing/selftests/drivers/net/Makefile | 1 + .../testing/selftests/drivers/net/netpoll_basic.py | 231 +++++++++++++++++++++ 2 files changed, 232 insertions(+) diff --git a/tools/testing/selftests/drivers/net/Makefile b/tools/testing/selftests/drivers/net/Makefile index bd309b2d39095..9bd84d6b542e5 100644 --- a/tools/testing/selftests/drivers/net/Makefile +++ b/tools/testing/selftests/drivers/net/Makefile @@ -16,6 +16,7 @@ TEST_PROGS := \ netcons_fragmented_msg.sh \ netcons_overflow.sh \ netcons_sysdata.sh \ + netpoll_basic.py \ ping.py \ queues.py \ stats.py \ diff --git a/tools/testing/selftests/drivers/net/netpoll_basic.py b/tools/testing/selftests/drivers/net/netpoll_basic.py new file mode 100755 index 0000000000000..2a81926169262 --- /dev/null +++ b/tools/testing/selftests/drivers/net/netpoll_basic.py @@ -0,0 +1,231 @@ +#!/usr/bin/env python3 +# SPDX-License-Identifier: GPL-2.0 + +# This test aims to evaluate the netpoll polling mechanism (as in +# netpoll_poll_dev()). It presents a complex scenario where the network +# attempts to send a packet but fails, prompting it to poll the NIC from within +# the netpoll TX side. +# +# This has been a crucial path in netpoll that was previously untested. Jakub +# suggested using a single RX/TX queue, pushing traffic to the NIC, and then +# sending netpoll messages (via netconsole) to trigger the poll. `perf` probing +# of netpoll_poll_dev() showed that this test indeed triggers +# netpoll_poll_dev() once or twice in 10 iterations. + +# Author: Breno Leitao <leitao(a)debian.org> + +import errno +import os +import random +import string +import time + +from lib.py import ( + ethtool, + GenerateTraffic, + ksft_exit, + ksft_pr, + ksft_run, + KsftFailEx, + KsftSkipEx, + NetdevFamily, + NetDrvEpEnv, +) + +NETCONSOLE_CONFIGFS_PATH = "/sys/kernel/config/netconsole" +REMOTE_PORT = 6666 +LOCAL_PORT = 1514 +# Number of netcons messages to send. I usually see netpoll_poll_dev() +# being called at least once in 10 iterations. Having 20 to have some buffers +ITERATIONS = 20 +DEBUG = False + + +def generate_random_netcons_name() -> str: + """Generate a random target name starting with 'netcons'""" + random_suffix = "".join(random.choices(string.ascii_lowercase + string.digits, k=8)) + return f"netcons_{random_suffix}" + + +def get_stats(cfg: NetDrvEpEnv, netdevnl: NetdevFamily) -> dict[str, int]: + """Get the statistics for the interface""" + return netdevnl.qstats_get({"ifindex": cfg.ifindex}, dump=True)[0] + + +def set_single_rx_tx_queue(interface_name: str) -> None: + """Set the number of RX and TX queues to 1 using ethtool""" + try: + # This don't need to be reverted, since interfaces will be deleted after test + ethtool(f"-G {interface_name} rx 1 tx 1") + except Exception as e: + raise KsftSkipEx( + f"Failed to configure RX/TX queues: {e}. Ethtool not available?" + ) + + +def create_netconsole_target( + config_data: dict[str, str], + target_name: str, +) -> None: + """Create a netconsole dynamic target against the interfaces""" + ksft_pr(f"Using netconsole name: {target_name}") + try: + os.makedirs(f"{NETCONSOLE_CONFIGFS_PATH}/{target_name}", exist_ok=True) + ksft_pr(f"Created target directory: {NETCONSOLE_CONFIGFS_PATH}/{target_name}") + except OSError as e: + if e.errno != errno.EEXIST: + raise KsftFailEx(f"Failed to create netconsole target directory: {e}") + + try: + for key, value in config_data.items(): + if DEBUG: + ksft_pr(f"Setting {key} to {value}") + with open( + f"{NETCONSOLE_CONFIGFS_PATH}/{target_name}/{key}", + "w", + encoding="utf-8", + ) as f: + # Always convert to string to write to file + f.write(str(value)) + f.close() + + if DEBUG: + # Read all configuration values for debugging + for debug_key in config_data.keys(): + with open( + f"{NETCONSOLE_CONFIGFS_PATH}/{target_name}/{debug_key}", + "r", + encoding="utf-8", + ) as f: + content = f.read() + ksft_pr( + f"{NETCONSOLE_CONFIGFS_PATH}/{target_name}/{debug_key} {content}" + ) + + except Exception as e: + raise KsftFailEx(f"Failed to configure netconsole target: {e}") + + +def set_netconsole(cfg: NetDrvEpEnv, interface_name: str, target_name: str) -> None: + """Configure netconsole on the interface with the given target name""" + config_data = { + "extended": "1", + "dev_name": interface_name, + "local_port": LOCAL_PORT, + "remote_port": REMOTE_PORT, + "local_ip": cfg.addr_v["4"] if cfg.addr_ipver == "4" else cfg.addr_v["6"], + "remote_ip": ( + cfg.remote_addr_v["4"] if cfg.addr_ipver == "4" else cfg.remote_addr_v["6"] + ), + "remote_mac": "00:00:00:00:00:00", # Not important for this test + "enabled": "1", + } + + create_netconsole_target(config_data, target_name) + ksft_pr(f"Created netconsole target: {target_name} on interface {interface_name}") + + +def delete_netconsole_target(name: str) -> None: + """Delete a netconsole dynamic target""" + target_path = f"{NETCONSOLE_CONFIGFS_PATH}/{name}" + try: + if os.path.exists(target_path): + os.rmdir(target_path) + except OSError as e: + raise KsftFailEx(f"Failed to delete netconsole target: {e}") + + +def check_traffic_flowing(cfg: NetDrvEpEnv, netdevnl: NetdevFamily) -> int: + """Check if traffic is flowing on the interface""" + stat1 = get_stats(cfg, netdevnl) + time.sleep(1) + stat2 = get_stats(cfg, netdevnl) + pkts_per_sec = stat2["rx-packets"] - stat1["rx-packets"] + # Just make sure this will not fail even in slow/debug kernels + if pkts_per_sec < 10: + raise KsftFailEx(f"Traffic seems low: {pkts_per_sec}") + if DEBUG: + ksft_pr(f"Traffic per second {pkts_per_sec}") + + return pkts_per_sec + + +def do_netpoll_flush( + cfg: NetDrvEpEnv, netdevnl: NetdevFamily, ifname: str, target_name: str +) -> None: + """Print messages to the console, trying to trigger a netpoll poll""" + + set_netconsole(cfg, ifname, target_name) + for i in range(int(ITERATIONS)): + msg = f"netcons test #{i}." + + if DEBUG: + pkts_per_s = check_traffic_flowing(cfg, netdevnl) + msg += f" ({pkts_per_s} packets/s)" + + with open("/dev/kmsg", "w", encoding="utf-8") as kmsg: + kmsg.write(msg) + + if not i % 5: + # Every 5 iterations, toggle netconsole + delete_netconsole_target(target_name) + set_netconsole(cfg, ifname, target_name) + + +def test_netpoll(cfg: NetDrvEpEnv, netdevnl: NetdevFamily) -> None: + """ + Test netpoll by sending traffic to the interface and then sending + netconsole messages to trigger a poll + """ + + target_name = generate_random_netcons_name() + ifname = cfg.dev["ifname"] + traffic = None + + try: + set_single_rx_tx_queue(ifname) + traffic = GenerateTraffic(cfg) + check_traffic_flowing(cfg, netdevnl) + do_netpoll_flush(cfg, netdevnl, ifname, target_name) + finally: + if traffic: + traffic.stop() + delete_netconsole_target(target_name) + + +def check_dependencies() -> None: + """Check if the dependencies are met""" + if not os.path.exists(NETCONSOLE_CONFIGFS_PATH): + raise KsftSkipEx( + f"Directory {NETCONSOLE_CONFIGFS_PATH} does not exist. CONFIG_NETCONSOLE_DYNAMIC might not be set." + ) + + +def load_netconsole_module() -> None: + """Try to load the netconsole module""" + try: + os.system("modprobe netconsole") + except Exception: + # It is fine if we fail to load the module, it will fail later + # at check_dependencies() + pass + + +def main() -> None: + """Main function to run the test""" + load_netconsole_module() + check_dependencies() + netdevnl = NetdevFamily() + with NetDrvEpEnv(__file__, nsim_test=True) as cfg: + ksft_run( + [test_netpoll], + args=( + cfg, + netdevnl, + ), + ) + ksft_exit() + + +if __name__ == "__main__": + main() --- base-commit: 4f4040ea5d3e4bebebbef9379f88085c8b99221c change-id: 20250612-netpoll_test-a1324d2057c8 Best regards, -- Breno Leitao <leitao(a)debian.org>

2 months, 2 weeks

2
3
0 0

[PATCH] mm/selftests: improve UFFD-WP feature detection in KSM test

by Li Wang

The current implementation of test_unmerge_uffd_wp() explicitly sets `uffdio_api.features = UFFD_FEATURE_PAGEFAULT_FLAG_WP` before calling UFFDIO_API. This can cause the ioctl() call to fail with EINVAL on kernels that do not support UFFD-WP, leading the test to fail unnecessarily: # ------------------------------ # running ./ksm_functional_tests # ------------------------------ # TAP version 13 # 1..9 # # [RUN] test_unmerge # ok 1 Pages were unmerged # # [RUN] test_unmerge_zero_pages # ok 2 KSM zero pages were unmerged # # [RUN] test_unmerge_discarded # ok 3 Pages were unmerged # # [RUN] test_unmerge_uffd_wp # not ok 4 UFFDIO_API failed <----- # # [RUN] test_prot_none # ok 5 Pages were unmerged # # [RUN] test_prctl # ok 6 Setting/clearing PR_SET_MEMORY_MERGE works # # [RUN] test_prctl_fork # # No pages got merged # # [RUN] test_prctl_fork_exec # ok 7 PR_SET_MEMORY_MERGE value is inherited # # [RUN] test_prctl_unmerge # ok 8 Pages were unmerged # Bail out! 1 out of 8 tests failed # # Planned tests != run tests (9 != 8) # # Totals: pass:7 fail:1 xfail:0 xpass:0 skip:0 error:0 # [FAIL] This patch improves compatibility and error handling by: 1. Changes the feature check to first query supported features (features=0) rather than specifically requesting WP support. 2. Gracefully skipping the test if: - UFFDIO_API fails with EINVAL (feature not supported), or - UFFD_FEATURE_PAGEFAULT_FLAG_WP is not advertised by the kernel. 3. Providing better diagnostics by distinguishing expected failures (e.g., EINVAL) from unexpected ones and reporting them using strerror(). The updated logic makes the test more robust across different kernel versions and configurations, while preserving existing behavior on systems that do support UFFD-WP. Signed-off-by: Li Wang <liwang(a)redhat.com> Cc: Aruna Ramakrishna <aruna.ramakrishna(a)oracle.com> Cc: Bagas Sanjaya <bagasdotme(a)gmail.com> Cc: Catalin Marinas <catalin.marinas(a)arm.com> Cc: Dave Hansen <dave.hansen(a)linux.intel.com> Cc: David Hildenbrand <david(a)redhat.com> Cc: Joey Gouly <joey.gouly(a)arm.com> Cc: Johannes Weiner <hannes(a)cmpxchg.org> Cc: Keith Lucas <keith.lucas(a)oracle.com> Cc: Ryan Roberts <ryan.roberts(a)arm.com> Cc: Shuah Khan <shuah(a)kernel.org> --- tools/testing/selftests/mm/ksm_functional_tests.c | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/tools/testing/selftests/mm/ksm_functional_tests.c b/tools/testing/selftests/mm/ksm_functional_tests.c index b61803e36d1c..f3db257dc555 100644 --- a/tools/testing/selftests/mm/ksm_functional_tests.c +++ b/tools/testing/selftests/mm/ksm_functional_tests.c @@ -393,9 +393,13 @@ static void test_unmerge_uffd_wp(void) /* See if UFFD-WP is around. */ uffdio_api.api = UFFD_API; - uffdio_api.features = UFFD_FEATURE_PAGEFAULT_FLAG_WP; + uffdio_api.features = 0; if (ioctl(uffd, UFFDIO_API, &uffdio_api) < 0) { - ksft_test_result_fail("UFFDIO_API failed\n"); + if (errno == EINVAL) + ksft_test_result_skip("UFFDIO_API not supported (EINVAL)\n"); + else + ksft_test_result_fail("UFFDIO_API failed: %s\n", strerror(errno)); + goto close_uffd; } if (!(uffdio_api.features & UFFD_FEATURE_PAGEFAULT_FLAG_WP)) { -- 2.49.0

2 months, 2 weeks

4
10
0 0

2025

2024

2023

2022

2021

2020

2019

2018

2017

Linux-kselftest-mirror