eOn Tue, Jun 24, 2025 at 11:45:09AM +0530, Dev Jain wrote:
>
> On 23/06/25 11:02 pm, Donet Tom wrote:
> > On Mon, Jun 23, 2025 at 10:23:02AM +0530, Dev Jain wrote:
> > > On 21/06/25 11:25 pm, Donet Tom wrote:
> > > > On Fri, Jun 20, 2025 at 08:15:25PM +0530, Dev Jain wrote:
> > > > > On 19/06/25 1:53 pm, Donet Tom wrote:
> > > > > > On Wed, Jun 18, 2025 at 08:13:54PM +0530, Dev Jain wrote:
> > > > > > > On 18/06/25 8:05 pm, Lorenzo Stoakes wrote:
> > > > > > > > On Wed, Jun 18, 2025 at 07:47:18PM +0530, Dev Jain wrote:
> > > > > > > > > On 18/06/25 7:37 pm, Lorenzo Stoakes wrote:
> > > > > > > > > > On Wed, Jun 18, 2025 at 07:28:16PM +0530, Dev Jain wrote:
> > > > > > > > > > > On 18/06/25 5:27 pm, Lorenzo Stoakes wrote:
> > > > > > > > > > > > On Wed, Jun 18, 2025 at 05:15:50PM +0530, Dev Jain wrote:
> > > > > > > > > > > > Are you accounting for sys.max_map_count? If not, then you'll be hitting that
> > > > > > > > > > > > first.
> > > > > > > > > > > run_vmtests.sh will run the test in overcommit mode so that won't be an issue.
> > > > > > > > > > Umm, what? You mean overcommit all mode, and that has no bearing on the max
> > > > > > > > > > mapping count check.
> > > > > > > > > >
> > > > > > > > > > In do_mmap():
> > > > > > > > > >
> > > > > > > > > > /* Too many mappings? */
> > > > > > > > > > if (mm->map_count > sysctl_max_map_count)
> > > > > > > > > > return -ENOMEM;
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > As well as numerous other checks in mm/vma.c.
> > > > > > > > > Ah sorry, didn't look at the code properly just assumed that overcommit_always meant overriding
> > > > > > > > > this.
> > > > > > > > No problem! It's hard to be aware of everything in mm :)
> > > > > > > >
> > > > > > > > > > I'm not sure why an overcommit toggle is even necessary when you could use
> > > > > > > > > > MAP_NORESERVE or simply map PROT_NONE to avoid the OVERCOMMIT_GUESS limits?
> > > > > > > > > >
> > > > > > > > > > I'm pretty confused as to what this test is really achieving honestly. This
> > > > > > > > > > isn't a useful way of asserting mmap() behaviour as far as I can tell.
> > > > > > > > > Well, seems like a useful way to me at least : ) Not sure if you are in the mood
> > > > > > > > > to discuss that but if you'd like me to explain from start to end what the test
> > > > > > > > > is doing, I can do that : )
> > > > > > > > >
> > > > > > > > I just don't have time right now, I guess I'll have to come back to it
> > > > > > > > later... it's not the end of the world for it to be iffy in my view as long as
> > > > > > > > it passes, but it might just not be of great value.
> > > > > > > >
> > > > > > > > Philosophically I'd rather we didn't assert internal implementation details like
> > > > > > > > where we place mappings in userland memory. At no point do we promise to not
> > > > > > > > leave larger gaps if we feel like it :)
> > > > > > > You have a fair point. Anyhow a debate for another day.
> > > > > > >
> > > > > > > > I'm guessing, reading more, the _real_ test here is some mathematical assertion
> > > > > > > > about layout from HIGH_ADDR_SHIFT -> end of address space when using hints.
> > > > > > > >
> > > > > > > > But again I'm not sure that achieves much and again also is asserting internal
> > > > > > > > implementation details.
> > > > > > > >
> > > > > > > > Correct behaviour of this kind of thing probably better belongs to tests in the
> > > > > > > > userland VMA testing I'd say.
> > > > > > > >
> > > > > > > > Sorry I don't mean to do down work you've done before, just giving an honest
> > > > > > > > technical appraisal!
> > > > > > > Nah, it will be rather hilarious to see it all go down the drain xD
> > > > > > >
> > > > > > > > Anyway don't let this block work to fix the test if it's failing. We can revisit
> > > > > > > > this later.
> > > > > > > Sure. @Aboorva and Donet, I still believe that the correct approach is to elide
> > > > > > > the gap check at the crossing boundary. What do you think?
> > > > > > >
> > > > > > One problem I am seeing with this approach is that, since the hint address
> > > > > > is generated randomly, the VMAs are also being created at randomly based on
> > > > > > the hint address.So, for the VMAs created at high addresses, we cannot guarantee
> > > > > > that the gaps between them will be aligned to MAP_CHUNK_SIZE.
> > > > > >
> > > > > > High address VMAs
> > > > > > -----------------
> > > > > > 1000000000000-1000040000000 r--p 00000000 00:00 0
> > > > > > 2000000000000-2000040000000 r--p 00000000 00:00 0
> > > > > > 4000000000000-4000040000000 r--p 00000000 00:00 0
> > > > > > 8000000000000-8000040000000 r--p 00000000 00:00 0
> > > > > > e80009d260000-fffff9d260000 r--p 00000000 00:00 0
> > > > > >
> > > > > > I have a different approach to solve this issue.
> > > > > It is really weird that such a large amount of VA space
> > > > > is left between the two VMAs yet mmap is failing.
> > > > >
> > > > >
> > > > >
> > > > > Can you please do the following:
> > > > > set /proc/sys/vm/max_map_count to the highest value possible.
> > > > > If running without run_vmtests.sh, set /proc/sys/vm/overcommit_memory to 1.
> > > > > In validate_complete_va_space:
> > > > >
> > > > > if (start_addr >= HIGH_ADDR_MARK && found == false) {
> > > > > found = true;
> > > > > continue;
> > > > > }
> > > > Thanks Dev for the suggestion. I set max_map_count and set overcommit
> > > > memory to 1, added this code change as well, and then tried. Still, the
> > > > test is failing
> > > >
> > > > > where found is initialized to false. This will skip the check
> > > > > for the boundary.
> > > > >
> > > > > After this can you tell whether the test is still failing.
> > > > >
> > > > > Also can you give me the complete output of proc/pid/maps
> > > > > after putting a sleep at the end of the test.
> > > > >
> > > > on powerpc support DEFAULT_MAP_WINDOW is 128TB and with
> > > > total address space size is 4PB With hint it can map upto
> > > > 4PB. Since the hint addres is random in this test random hing VMAs
> > > > are getting created. IIUC this is expected only.
> > > >
> > > >
> > > > 10000000-10010000 r-xp 00000000 fd:05 134226638 /home/donet/linux/tools/testing/selftests/mm/virtual_address_range
> > > > 10010000-10020000 r--p 00000000 fd:05 134226638 /home/donet/linux/tools/testing/selftests/mm/virtual_address_range
> > > > 10020000-10030000 rw-p 00010000 fd:05 134226638 /home/donet/linux/tools/testing/selftests/mm/virtual_address_range
> > > > 30000000-10030000000 r--p 00000000 00:00 0 [anon:virtual_address_range]
> > > > 10030770000-100307a0000 rw-p 00000000 00:00 0 [heap]
> > > > 1004f000000-7fff8f000000 r--p 00000000 00:00 0 [anon:virtual_address_range]
> > > > 7fff8faf0000-7fff8fe00000 rw-p 00000000 00:00 0
> > > > 7fff8fe00000-7fff90030000 r-xp 00000000 fd:00 792355 /usr/lib64/libc.so.6
> > > > 7fff90030000-7fff90040000 r--p 00230000 fd:00 792355 /usr/lib64/libc.so.6
> > > > 7fff90040000-7fff90050000 rw-p 00240000 fd:00 792355 /usr/lib64/libc.so.6
> > > > 7fff90050000-7fff90130000 r-xp 00000000 fd:00 792358 /usr/lib64/libm.so.6
> > > > 7fff90130000-7fff90140000 r--p 000d0000 fd:00 792358 /usr/lib64/libm.so.6
> > > > 7fff90140000-7fff90150000 rw-p 000e0000 fd:00 792358 /usr/lib64/libm.so.6
> > > > 7fff90160000-7fff901a0000 r--p 00000000 00:00 0 [vvar]
> > > > 7fff901a0000-7fff901b0000 r-xp 00000000 00:00 0 [vdso]
> > > > 7fff901b0000-7fff90200000 r-xp 00000000 fd:00 792351 /usr/lib64/ld64.so.2
> > > > 7fff90200000-7fff90210000 r--p 00040000 fd:00 792351 /usr/lib64/ld64.so.2
> > > > 7fff90210000-7fff90220000 rw-p 00050000 fd:00 792351 /usr/lib64/ld64.so.2
> > > > 7fffc9770000-7fffc9880000 rw-p 00000000 00:00 0 [stack]
> > > > 1000000000000-1000040000000 r--p 00000000 00:00 0 [anon:virtual_address_range]
> > > > 2000000000000-2000040000000 r--p 00000000 00:00 0 [anon:virtual_address_range]
> > > > 4000000000000-4000040000000 r--p 00000000 00:00 0 [anon:virtual_address_range]
> > > > 8000000000000-8000040000000 r--p 00000000 00:00 0 [anon:virtual_address_range]
> > > > eb95410220000-fffff90220000 r--p 00000000 00:00 0 [anon:virtual_address_range]
> > > >
> > > >
> > > >
> > > >
> > > > If I give the hint address serially from 128TB then the address
> > > > space is contigous and gap is also MAP_SIZE, the test is passing.
> > > >
> > > > 10000000-10010000 r-xp 00000000 fd:05 134226638 /home/donet/linux/tools/testing/selftests/mm/virtual_address_range
> > > > 10010000-10020000 r--p 00000000 fd:05 134226638 /home/donet/linux/tools/testing/selftests/mm/virtual_address_range
> > > > 10020000-10030000 rw-p 00010000 fd:05 134226638 /home/donet/linux/tools/testing/selftests/mm/virtual_address_range
> > > > 33000000-10033000000 r--p 00000000 00:00 0 [anon:virtual_address_range]
> > > > 10033380000-100333b0000 rw-p 00000000 00:00 0 [heap]
> > > > 1006f0f0000-10071000000 rw-p 00000000 00:00 0
> > > > 10071000000-7fffb1000000 r--p 00000000 00:00 0 [anon:virtual_address_range]
> > > > 7fffb15d0000-7fffb1800000 r-xp 00000000 fd:00 792355 /usr/lib64/libc.so.6
> > > > 7fffb1800000-7fffb1810000 r--p 00230000 fd:00 792355 /usr/lib64/libc.so.6
> > > > 7fffb1810000-7fffb1820000 rw-p 00240000 fd:00 792355 /usr/lib64/libc.so.6
> > > > 7fffb1820000-7fffb1900000 r-xp 00000000 fd:00 792358 /usr/lib64/libm.so.6
> > > > 7fffb1900000-7fffb1910000 r--p 000d0000 fd:00 792358 /usr/lib64/libm.so.6
> > > > 7fffb1910000-7fffb1920000 rw-p 000e0000 fd:00 792358 /usr/lib64/libm.so.6
> > > > 7fffb1930000-7fffb1970000 r--p 00000000 00:00 0 [vvar]
> > > > 7fffb1970000-7fffb1980000 r-xp 00000000 00:00 0 [vdso]
> > > > 7fffb1980000-7fffb19d0000 r-xp 00000000 fd:00 792351 /usr/lib64/ld64.so.2
> > > > 7fffb19d0000-7fffb19e0000 r--p 00040000 fd:00 792351 /usr/lib64/ld64.so.2
> > > > 7fffb19e0000-7fffb19f0000 rw-p 00050000 fd:00 792351 /usr/lib64/ld64.so.2
> > > > 7fffc5470000-7fffc5580000 rw-p 00000000 00:00 0 [stack]
> > > > 800000000000-2aab000000000 r--p 00000000 00:00 0 [anon:virtual_address_range]
> > > >
> > > >
> > > Thank you for this output. I can't wrap my head around why this behaviour changes
> > > when you generate the hint sequentially. The mmap() syscall is supposed to do the
> > > following (irrespective of high VA space or not) - if the allocation at the hint
> > Yes, it is working as expected. On PowerPC, the DEFAULT_MAP_WINDOW is
> > 128TB, and the system can map up to 4PB.
> >
> > In the test, the first mmap call maps memory up to 128TB without any
> > hint, so the VMAs are created below the 128TB boundary.
> >
> > In the second mmap call, we provide a hint starting from 256TB, and
> > the hint address is generated randomly above 256TB. The mappings are
> > correctly created at these hint addresses. Since the hint addresses
> > are random, the resulting VMAs are also created at random locations.
> >
> > So, what I tried is: mapping from 0 to 128TB without any hint, and
> > then for the second mmap, instead of starting the hint from 256TB, I
> > started from 128TB. Instead of using random hint addresses, I used
> > sequential hint addresses from 128TB up to 512TB. With this change,
> > the VMAs are created in order, and the test passes.
> >
> > 800000000000-2aab000000000 r--p 00000000 00:00 0 128TB to 512TB VMA
> >
> > I think we will see same behaviour on x86 with X86_FEATURE_LA57.
> >
> > I will send the updated patch in V2.
>
> Since you say it fails on both radix and hash, it means that the generic
> code path is failing. I see that on my system, when I run the test with
> LPA2 config, write() fails with errno set to -ENOMEM. Can you apply
> the following diff and check whether the test fails still. Doing this
> fixed it for arm64.
>
> diff --git a/tools/testing/selftests/mm/virtual_address_range.c b/tools/testing/selftests/mm/virtual_address_range.c
>
> index b380e102b22f..3032902d01f2 100644
>
> --- a/tools/testing/selftests/mm/virtual_address_range.c
>
> +++ b/tools/testing/selftests/mm/virtual_address_range.c
>
> @@ -173,10 +173,6 @@ static int validate_complete_va_space(void)
>
> */
>
> hop = 0;
>
> while (start_addr + hop < end_addr) {
>
> - if (write(fd, (void *)(start_addr + hop), 1) != 1)
>
> - return 1;
>
> - lseek(fd, 0, SEEK_SET);
>
> -
>
> if (is_marked_vma(vma_name))
>
> munmap((char *)(start_addr + hop), MAP_CHUNK_SIZE);
>
Even with this change, the test is still failing. In this case,
we are allocating physical memory and writing into it, but our
issue seems to be with the gap between VMAs, so I believe this
might not be directly related.
I will send the next revision where the test passes and no
issues are observed
Just curious — with LPA2, is the second mmap() call successful?
And are the VMAs being created at the hint address as expected?
> >
> > > addr succeeds, then all is well, otherwise, do a top-down search for a large
> > > enough gap. I am not aware of the nuances in powerpc but I really am suspecting
> > > a bug in powerpc mmap code. Can you try to do some tracing - which function
> > > eventually fails to find the empty gap?
> > >
> > > Through my limited code tracing - we should end up in slice_find_area_topdown,
> > > then we ask the generic code to find the gap using vm_unmapped_area. So I
> > > suspect something is happening between this, probably slice_scan_available().
> > >
> > > > > > From 0 to 128TB, we map memory directly without using any hint. For the range above
> > > > > > 256TB up to 512TB, we perform the mapping using hint addresses. In the current test,
> > > > > > we use random hint addresses, but I have modified it to generate hint addresses linearly
> > > > > > starting from 128TB.
> > > > > >
> > > > > > With this change:
> > > > > >
> > > > > > The 0–128TB range is mapped without hints and verified accordingly.
> > > > > >
> > > > > > The 128TB–512TB range is mapped using linear hint addresses and then verified.
> > > > > >
> > > > > > Below are the VMAs obtained with this approach:
> > > > > >
> > > > > > 10000000-10010000 r-xp 00000000 fd:05 135019531
> > > > > > 10010000-10020000 r--p 00000000 fd:05 135019531
> > > > > > 10020000-10030000 rw-p 00010000 fd:05 135019531
> > > > > > 20000000-10020000000 r--p 00000000 00:00 0
> > > > > > 10020800000-10020830000 rw-p 00000000 00:00 0
> > > > > > 1004bcf0000-1004c000000 rw-p 00000000 00:00 0
> > > > > > 1004c000000-7fff8c000000 r--p 00000000 00:00 0
> > > > > > 7fff8c130000-7fff8c360000 r-xp 00000000 fd:00 792355
> > > > > > 7fff8c360000-7fff8c370000 r--p 00230000 fd:00 792355
> > > > > > 7fff8c370000-7fff8c380000 rw-p 00240000 fd:00 792355
> > > > > > 7fff8c380000-7fff8c460000 r-xp 00000000 fd:00 792358
> > > > > > 7fff8c460000-7fff8c470000 r--p 000d0000 fd:00 792358
> > > > > > 7fff8c470000-7fff8c480000 rw-p 000e0000 fd:00 792358
> > > > > > 7fff8c490000-7fff8c4d0000 r--p 00000000 00:00 0
> > > > > > 7fff8c4d0000-7fff8c4e0000 r-xp 00000000 00:00 0
> > > > > > 7fff8c4e0000-7fff8c530000 r-xp 00000000 fd:00 792351
> > > > > > 7fff8c530000-7fff8c540000 r--p 00040000 fd:00 792351
> > > > > > 7fff8c540000-7fff8c550000 rw-p 00050000 fd:00 792351
> > > > > > 7fff8d000000-7fffcd000000 r--p 00000000 00:00 0
> > > > > > 7fffe9c80000-7fffe9d90000 rw-p 00000000 00:00 0
> > > > > > 800000000000-2000000000000 r--p 00000000 00:00 0 -> High Address (128TB to 512TB)
> > > > > >
> > > > > > diff --git a/tools/testing/selftests/mm/virtual_address_range.c b/tools/testing/selftests/mm/virtual_address_range.c
> > > > > > index 4c4c35eac15e..0be008cba4b0 100644
> > > > > > --- a/tools/testing/selftests/mm/virtual_address_range.c
> > > > > > +++ b/tools/testing/selftests/mm/virtual_address_range.c
> > > > > > @@ -56,21 +56,21 @@
> > > > > > #ifdef __aarch64__
> > > > > > #define HIGH_ADDR_MARK ADDR_MARK_256TB
> > > > > > -#define HIGH_ADDR_SHIFT 49
> > > > > > +#define HIGH_ADDR_SHIFT 48
> > > > > > #define NR_CHUNKS_LOW NR_CHUNKS_256TB
> > > > > > #define NR_CHUNKS_HIGH NR_CHUNKS_3840TB
> > > > > > #else
> > > > > > #define HIGH_ADDR_MARK ADDR_MARK_128TB
> > > > > > -#define HIGH_ADDR_SHIFT 48
> > > > > > +#define HIGH_ADDR_SHIFT 47
> > > > > > #define NR_CHUNKS_LOW NR_CHUNKS_128TB
> > > > > > #define NR_CHUNKS_HIGH NR_CHUNKS_384TB
> > > > > > #endif
> > > > > > -static char *hint_addr(void)
> > > > > > +static char *hint_addr(int hint)
> > > > > > {
> > > > > > - int bits = HIGH_ADDR_SHIFT + rand() % (63 - HIGH_ADDR_SHIFT);
> > > > > > + unsigned long addr = ((1UL << HIGH_ADDR_SHIFT) + (hint * MAP_CHUNK_SIZE));
> > > > > > - return (char *) (1UL << bits);
> > > > > > + return (char *) (addr);
> > > > > > }
> > > > > > static void validate_addr(char *ptr, int high_addr)
> > > > > > @@ -217,7 +217,7 @@ int main(int argc, char *argv[])
> > > > > > }
> > > > > > for (i = 0; i < NR_CHUNKS_HIGH; i++) {
> > > > > > - hint = hint_addr();
> > > > > > + hint = hint_addr(i);
> > > > > > hptr[i] = mmap(hint, MAP_CHUNK_SIZE, PROT_READ,
> > > > > > MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> > > > > >
> > > > > >
> > > > > >
> > > > > > Can we fix it this way?
Add support for SuperH/"sh" to nolibc.
Only sh4 is tested for now.
This is only tested on QEMU so far.
Additional testing would be very welcome.
Test instructions:
$ cd tools/testings/selftests/nolibc/
$ make -f Makefile.nolibc ARCH=sh CROSS_COMPILE=sh4-linux- nolibc-test
$ file nolibc-test
nolibc-test: ELF 32-bit LSB executable, Renesas SH, version 1 (SYSV), statically linked, not stripped
$ ./nolibc-test
Running test 'startup'
0 argc = 1 [OK]
...
Total number of errors: 0
Exiting with status 0
Signed-off-by: Thomas Weißschuh <linux(a)weissschuh.net>
---
Changes in v2:
- Rebase onto latest nolibc-next
- Pick up Ack from Willy
- Provide some test instructions
- Link to v1: https://lore.kernel.org/r/20250609-nolibc-sh-v1-0-9dcdb1b66bb5@weissschuh.n…
---
Thomas Weißschuh (3):
selftests/nolibc: fix EXTRACONFIG variables ordering
selftests/nolibc: use file driver for QEMU serial
tools/nolibc: add support for SuperH
tools/include/nolibc/arch-sh.h | 162 +++++++++++++++++++++++++
tools/include/nolibc/arch.h | 2 +
tools/testing/selftests/nolibc/Makefile.nolibc | 15 ++-
tools/testing/selftests/nolibc/run-tests.sh | 3 +-
4 files changed, 177 insertions(+), 5 deletions(-)
---
base-commit: eb135311083100b6590a7545618cd9760d896a86
change-id: 20250528-nolibc-sh-8b4e3bb8efcb
Best regards,
--
Thomas Weißschuh <linux(a)weissschuh.net>
This series creates a new PMU scheme on ARM, a partitioned PMU that
allows reserving a subset of counters for more direct guest access,
significantly reducing overhead. More details, including performance
benchmarks, can be read in the v1 cover letter linked below.
v2:
* Rebased on top of kvm/queue to pick up Sean's patch [1] that
reorganizes some of the same headers and would otherwise conflict.
* Changed the semantics of the command line parameters and the
ioctl. It was pointed out in the comments last time that it doesn't
work to repartition at runtime because the perf subsystem assumes
the number of counters it gets will not change after the PMU is
probed. Now the PMUv3 command line parameters are the sole thing
that divides up guest and host counters and the ioctl just toggles a
flag for whether a vcpu should use the partitioned PMU. I've also
moved from one to two parameters: partition_pmu=[y/n] and
reserved_guest_counters=[0-N]. This makes it possible to
unambiguously express configurations like a partitioned PMU with 0
general purpose counters exposed to the guest (which still exposes
the cycle counter.
* Moved the partitioning code into the PMUv3 driver itself so KVM code
isn't modifying fields that are otherwise internal to the driver.
* Define PMI{CNTR,FILTR} as undef_access since KVM isn't ready to
support that counter. It is, however, still handled in the
partitioning because the driver recognizes it.
* Take out the dependency on FEAT_FGT since it is not widely available
on hardware yet. Instead, define a fast path in switch.h for
handling accesses to the registers that would otherwise be
untrapped.
* During MDCR_EL2 setup for guests, ensure the computed HPMN value is
always below the number of guest counters allocated by the driver at
boot and always below the number of counters on the current
CPU. This accounts for the possibiliy of heterogeneous hardware
where I guest might be able to use the partitioned PMU on one CPU
but not another.
* The KVM PMU event filter API says that counters must not count while
the event is filtered. To ensure this, enforce the filter on every
vcpu_load into the guest.
* Settable PMCR_EL0.N with a partitioned PMU now works and the
vcpu_counter_access selftest changes reflect that.
v1:
https://lore.kernel.org/kvm/20250602192702.2125115-1-coltonlewis@google.com/
Colton Lewis (22):
arm64: cpufeature: Add cpucap for HPMN0
arm64: Generate sign macro for sysreg Enums
arm64: cpufeature: Add cpucap for PMICNTR
arm64: Define PMI{CNTR,FILTR}_EL0 as undef_access
KVM: arm64: Reorganize PMU functions
perf: arm_pmuv3: Introduce method to partition the PMU
perf: arm_pmuv3: Generalize counter bitmasks
perf: arm_pmuv3: Keep out of guest counter partition
KVM: arm64: Correct kvm_arm_pmu_get_max_counters()
KVM: arm64: Set up FGT for Partitioned PMU
KVM: arm64: Writethrough trapped PMEVTYPER register
KVM: arm64: Use physical PMSELR for PMXEVTYPER if partitioned
KVM: arm64: Writethrough trapped PMOVS register
KVM: arm64: Write fast path PMU register handlers
KVM: arm64: Setup MDCR_EL2 to handle a partitioned PMU
KVM: arm64: Account for partitioning in PMCR_EL0 access
KVM: arm64: Context swap Partitioned PMU guest registers
KVM: arm64: Enforce PMU event filter at vcpu_load()
perf: arm_pmuv3: Handle IRQs for Partitioned PMU guest counters
KVM: arm64: Inject recorded guest interrupts
KVM: arm64: Add ioctl to partition the PMU when supported
KVM: arm64: selftests: Add test case for partitioned PMU
Marc Zyngier (1):
KVM: arm64: Cleanup PMU includes
Documentation/virt/kvm/api.rst | 21 +
arch/arm/include/asm/arm_pmuv3.h | 34 +
arch/arm64/include/asm/arm_pmuv3.h | 61 +-
arch/arm64/include/asm/kvm_host.h | 20 +-
arch/arm64/include/asm/kvm_pmu.h | 61 ++
arch/arm64/kernel/cpufeature.c | 15 +
arch/arm64/kvm/Makefile | 2 +-
arch/arm64/kvm/arm.c | 22 +
arch/arm64/kvm/debug.c | 24 +-
arch/arm64/kvm/hyp/include/hyp/switch.h | 233 ++++++
arch/arm64/kvm/pmu-emul.c | 676 +----------------
arch/arm64/kvm/pmu-part.c | 359 +++++++++
arch/arm64/kvm/pmu.c | 687 ++++++++++++++++++
arch/arm64/kvm/sys_regs.c | 66 +-
arch/arm64/tools/cpucaps | 2 +
arch/arm64/tools/gen-sysreg.awk | 1 +
arch/arm64/tools/sysreg | 6 +-
drivers/perf/arm_pmuv3.c | 150 +++-
include/linux/perf/arm_pmu.h | 15 +-
include/linux/perf/arm_pmuv3.h | 14 +-
include/uapi/linux/kvm.h | 4 +
tools/include/uapi/linux/kvm.h | 2 +
.../selftests/kvm/arm64/vpmu_counter_access.c | 63 +-
virt/kvm/kvm_main.c | 1 +
24 files changed, 1791 insertions(+), 748 deletions(-)
create mode 100644 arch/arm64/kvm/pmu-part.c
base-commit: 79150772457f4d45e38b842d786240c36bb1f97f
--
2.50.0.714.g196bf9f422-goog
Corrected two instances of the misspelled word 'occurences' to
'occurrences' in comments explaining node invariants in sparsebit.c.
These comments describe core behavior of the data structure and
should be clear.
Signed-off-by: Rahul Kumar <rk0006818(a)gmail.com>
---
tools/testing/selftests/kvm/lib/sparsebit.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/kvm/lib/sparsebit.c b/tools/testing/selftests/kvm/lib/sparsebit.c
index cfed9d26cc71..a99188f87a38 100644
--- a/tools/testing/selftests/kvm/lib/sparsebit.c
+++ b/tools/testing/selftests/kvm/lib/sparsebit.c
@@ -116,7 +116,7 @@
*
* + A node with all mask bits set only occurs when the last bit
* described by the previous node is not equal to this nodes
- * starting index - 1. All such occurences of this condition are
+ * starting index - 1. All such occurrences of this condition are
* avoided by moving the setting of the nodes mask bits into
* the previous nodes num_after setting.
*
@@ -592,7 +592,7 @@ static struct node *node_split(struct sparsebit *s, sparsebit_idx_t idx)
*
* + A node with all mask bits set only occurs when the last bit
* described by the previous node is not equal to this nodes
- * starting index - 1. All such occurences of this condition are
+ * starting index - 1. All such occurrences of this condition are
* avoided by moving the setting of the nodes mask bits into
* the previous nodes num_after setting.
*/
--
2.43.0
This patch fixes two misspellings of the word 'occurrences' in comments within sparsebit.c used by the KVM selftests.
Fixing the spelling improves readability and clarity of the documented behavior.
Only comment text has been changed — there are no modifications to the functional logic of the tests.
I would appreciate your review and any feedback you may have.
Thank you for your time and support.
Best regards,
Rahul Kumar
Non-KVM folks,
I am hoping to route this through the KVM tree (6.17 or later), as the non-KVM
changes should be glorified nops. Please holler if you object to that idea.
Hyper-V folks in particular, let me know if you want a stable topic branch/tag,
e.g. on the off chance you want to make similar changes to the Hyper-V code,
and I'll make sure that happens.
As for what this series actually does...
Rework KVM's irqfd registration to require that an eventfd is bound to at
most one irqfd throughout the entire system. KVM currently disallows
binding an eventfd to multiple irqfds for a single VM, but doesn't reject
attempts to bind an eventfd to multiple VMs.
This is obviously an ABI change, but I'm fairly confident that it won't
break userspace, because binding an eventfd to multiple irqfds hasn't
truly worked since commit e8dbf19508a1 ("kvm/eventfd: Use priority waitqueue
to catch events before userspace"). A somewhat undocumented, and perhaps
even unintentional, side effect of suppressing eventfd notifications for
userspace is that the priority+exclusive behavior also suppresses eventfd
notifications for any subsequent waiters, even if they are priority waiters.
I.e. only the first VM with an irqfd+eventfd binding will get notifications.
And for IRQ bypass, a.k.a. device posted interrupts, globally unique
bindings are a hard requirement (at least on x86; I assume other archs are
the same). KVM and the IRQ bypass manager kinda sorta handle this, but in
the absolute worst way possible (IMO). Instead of surfacing an error to
userspace, KVM silently ignores IRQ bypass registration errors.
The motivation for this series is to harden against userspace goofs. AFAIK,
we (Google) have never actually had a bug where userspace tries to assign
an eventfd to multiple VMs, but the possibility has come up in more than one
bug investigation (our intra-host, a.k.a. copyless, migration scheme
transfers eventfds from the old to the new VM when updating the host VMM).
v3:
- Retain WQ_FLAG_EXCLUSIVE in mshv_eventfd.c, which snuck in between v1
and v2. [Peter]
- Use EXPORT_SYMBOL_GPL. [Peter]
- Move WQ_FLAG_EXCLUSIVE out of add_wait_queue_priority() in a prep patch
so that the affected subsystems are more explicitly documented (and then
immediately drop the flag from drivers/xen/privcmd.c, which amusingly
hides that file from the diff stats).
v2:
- https://lore.kernel.org/all/20250519185514.2678456-1-seanjc@google.com
- Use guard(spinlock_irqsave). [Prateek]
v1: https://lore.kernel.org/all/20250401204425.904001-1-seanjc@google.com
Sean Christopherson (13):
KVM: Use a local struct to do the initial vfs_poll() on an irqfd
KVM: Acquire SCRU lock outside of irqfds.lock during assignment
KVM: Initialize irqfd waitqueue callback when adding to the queue
KVM: Add irqfd to KVM's list via the vfs_poll() callback
KVM: Add irqfd to eventfd's waitqueue while holding irqfds.lock
sched/wait: Drop WQ_FLAG_EXCLUSIVE from add_wait_queue_priority()
xen: privcmd: Don't mark eventfd waiter as EXCLUSIVE
sched/wait: Add a waitqueue helper for fully exclusive priority
waiters
KVM: Disallow binding multiple irqfds to an eventfd with a priority
waiter
KVM: Drop sanity check that per-VM list of irqfds is unique
KVM: selftests: Assert that eventfd() succeeds in Xen shinfo test
KVM: selftests: Add utilities to create eventfds and do KVM_IRQFD
KVM: selftests: Add a KVM_IRQFD test to verify uniqueness requirements
drivers/hv/mshv_eventfd.c | 8 ++
include/linux/kvm_irqfd.h | 1 -
include/linux/wait.h | 2 +
kernel/sched/wait.c | 22 ++-
tools/testing/selftests/kvm/Makefile.kvm | 1 +
tools/testing/selftests/kvm/arm64/vgic_irq.c | 12 +-
.../testing/selftests/kvm/include/kvm_util.h | 40 ++++++
tools/testing/selftests/kvm/irqfd_test.c | 130 ++++++++++++++++++
.../selftests/kvm/x86/xen_shinfo_test.c | 21 +--
virt/kvm/eventfd.c | 130 +++++++++++++-----
10 files changed, 302 insertions(+), 65 deletions(-)
create mode 100644 tools/testing/selftests/kvm/irqfd_test.c
base-commit: 45eb29140e68ffe8e93a5471006858a018480a45
--
2.49.0.1151.ga128411c76-goog
Add a basic selftest for the netpoll polling mechanism, specifically
targeting the netpoll poll() side.
The test creates a scenario where network transmission is running at
maximum speed, and netpoll needs to poll the NIC. This is achieved by:
1. Configuring a single RX/TX queue to create contention
2. Generating background traffic to saturate the interface
3. Sending netconsole messages to trigger netpoll polling
4. Using dynamic netconsole targets via configfs
5. Delete and create new netconsole targets after 5 iterations
The test validates a critical netpoll code path by monitoring traffic
flow and ensuring netpoll_poll_dev() is called when the normal TX path
is blocked. Perf probing confirms this test successfully triggers
netpoll_poll_dev() in typical test runs.
This addresses a gap in netpoll test coverage for a path that is
tricky for the network stack.
Signed-off-by: Breno Leitao <leitao(a)debian.org>
---
Changes since RFC:
- Toggle the netconsole interfaces up and down after 5 iterations.
- Moved the traffic check under DEBUG (Willem de Bruijn).
- Bumped the iterations to 20 given it runs faster now.
- Link to the RFC: https://lore.kernel.org/r/20250612-netpoll_test-v1-1-4774fd95933f@debian.org
---
tools/testing/selftests/drivers/net/Makefile | 1 +
.../testing/selftests/drivers/net/netpoll_basic.py | 231 +++++++++++++++++++++
2 files changed, 232 insertions(+)
diff --git a/tools/testing/selftests/drivers/net/Makefile b/tools/testing/selftests/drivers/net/Makefile
index bd309b2d39095..9bd84d6b542e5 100644
--- a/tools/testing/selftests/drivers/net/Makefile
+++ b/tools/testing/selftests/drivers/net/Makefile
@@ -16,6 +16,7 @@ TEST_PROGS := \
netcons_fragmented_msg.sh \
netcons_overflow.sh \
netcons_sysdata.sh \
+ netpoll_basic.py \
ping.py \
queues.py \
stats.py \
diff --git a/tools/testing/selftests/drivers/net/netpoll_basic.py b/tools/testing/selftests/drivers/net/netpoll_basic.py
new file mode 100755
index 0000000000000..2a81926169262
--- /dev/null
+++ b/tools/testing/selftests/drivers/net/netpoll_basic.py
@@ -0,0 +1,231 @@
+#!/usr/bin/env python3
+# SPDX-License-Identifier: GPL-2.0
+
+# This test aims to evaluate the netpoll polling mechanism (as in
+# netpoll_poll_dev()). It presents a complex scenario where the network
+# attempts to send a packet but fails, prompting it to poll the NIC from within
+# the netpoll TX side.
+#
+# This has been a crucial path in netpoll that was previously untested. Jakub
+# suggested using a single RX/TX queue, pushing traffic to the NIC, and then
+# sending netpoll messages (via netconsole) to trigger the poll. `perf` probing
+# of netpoll_poll_dev() showed that this test indeed triggers
+# netpoll_poll_dev() once or twice in 10 iterations.
+
+# Author: Breno Leitao <leitao(a)debian.org>
+
+import errno
+import os
+import random
+import string
+import time
+
+from lib.py import (
+ ethtool,
+ GenerateTraffic,
+ ksft_exit,
+ ksft_pr,
+ ksft_run,
+ KsftFailEx,
+ KsftSkipEx,
+ NetdevFamily,
+ NetDrvEpEnv,
+)
+
+NETCONSOLE_CONFIGFS_PATH = "/sys/kernel/config/netconsole"
+REMOTE_PORT = 6666
+LOCAL_PORT = 1514
+# Number of netcons messages to send. I usually see netpoll_poll_dev()
+# being called at least once in 10 iterations. Having 20 to have some buffers
+ITERATIONS = 20
+DEBUG = False
+
+
+def generate_random_netcons_name() -> str:
+ """Generate a random target name starting with 'netcons'"""
+ random_suffix = "".join(random.choices(string.ascii_lowercase + string.digits, k=8))
+ return f"netcons_{random_suffix}"
+
+
+def get_stats(cfg: NetDrvEpEnv, netdevnl: NetdevFamily) -> dict[str, int]:
+ """Get the statistics for the interface"""
+ return netdevnl.qstats_get({"ifindex": cfg.ifindex}, dump=True)[0]
+
+
+def set_single_rx_tx_queue(interface_name: str) -> None:
+ """Set the number of RX and TX queues to 1 using ethtool"""
+ try:
+ # This don't need to be reverted, since interfaces will be deleted after test
+ ethtool(f"-G {interface_name} rx 1 tx 1")
+ except Exception as e:
+ raise KsftSkipEx(
+ f"Failed to configure RX/TX queues: {e}. Ethtool not available?"
+ )
+
+
+def create_netconsole_target(
+ config_data: dict[str, str],
+ target_name: str,
+) -> None:
+ """Create a netconsole dynamic target against the interfaces"""
+ ksft_pr(f"Using netconsole name: {target_name}")
+ try:
+ os.makedirs(f"{NETCONSOLE_CONFIGFS_PATH}/{target_name}", exist_ok=True)
+ ksft_pr(f"Created target directory: {NETCONSOLE_CONFIGFS_PATH}/{target_name}")
+ except OSError as e:
+ if e.errno != errno.EEXIST:
+ raise KsftFailEx(f"Failed to create netconsole target directory: {e}")
+
+ try:
+ for key, value in config_data.items():
+ if DEBUG:
+ ksft_pr(f"Setting {key} to {value}")
+ with open(
+ f"{NETCONSOLE_CONFIGFS_PATH}/{target_name}/{key}",
+ "w",
+ encoding="utf-8",
+ ) as f:
+ # Always convert to string to write to file
+ f.write(str(value))
+ f.close()
+
+ if DEBUG:
+ # Read all configuration values for debugging
+ for debug_key in config_data.keys():
+ with open(
+ f"{NETCONSOLE_CONFIGFS_PATH}/{target_name}/{debug_key}",
+ "r",
+ encoding="utf-8",
+ ) as f:
+ content = f.read()
+ ksft_pr(
+ f"{NETCONSOLE_CONFIGFS_PATH}/{target_name}/{debug_key} {content}"
+ )
+
+ except Exception as e:
+ raise KsftFailEx(f"Failed to configure netconsole target: {e}")
+
+
+def set_netconsole(cfg: NetDrvEpEnv, interface_name: str, target_name: str) -> None:
+ """Configure netconsole on the interface with the given target name"""
+ config_data = {
+ "extended": "1",
+ "dev_name": interface_name,
+ "local_port": LOCAL_PORT,
+ "remote_port": REMOTE_PORT,
+ "local_ip": cfg.addr_v["4"] if cfg.addr_ipver == "4" else cfg.addr_v["6"],
+ "remote_ip": (
+ cfg.remote_addr_v["4"] if cfg.addr_ipver == "4" else cfg.remote_addr_v["6"]
+ ),
+ "remote_mac": "00:00:00:00:00:00", # Not important for this test
+ "enabled": "1",
+ }
+
+ create_netconsole_target(config_data, target_name)
+ ksft_pr(f"Created netconsole target: {target_name} on interface {interface_name}")
+
+
+def delete_netconsole_target(name: str) -> None:
+ """Delete a netconsole dynamic target"""
+ target_path = f"{NETCONSOLE_CONFIGFS_PATH}/{name}"
+ try:
+ if os.path.exists(target_path):
+ os.rmdir(target_path)
+ except OSError as e:
+ raise KsftFailEx(f"Failed to delete netconsole target: {e}")
+
+
+def check_traffic_flowing(cfg: NetDrvEpEnv, netdevnl: NetdevFamily) -> int:
+ """Check if traffic is flowing on the interface"""
+ stat1 = get_stats(cfg, netdevnl)
+ time.sleep(1)
+ stat2 = get_stats(cfg, netdevnl)
+ pkts_per_sec = stat2["rx-packets"] - stat1["rx-packets"]
+ # Just make sure this will not fail even in slow/debug kernels
+ if pkts_per_sec < 10:
+ raise KsftFailEx(f"Traffic seems low: {pkts_per_sec}")
+ if DEBUG:
+ ksft_pr(f"Traffic per second {pkts_per_sec}")
+
+ return pkts_per_sec
+
+
+def do_netpoll_flush(
+ cfg: NetDrvEpEnv, netdevnl: NetdevFamily, ifname: str, target_name: str
+) -> None:
+ """Print messages to the console, trying to trigger a netpoll poll"""
+
+ set_netconsole(cfg, ifname, target_name)
+ for i in range(int(ITERATIONS)):
+ msg = f"netcons test #{i}."
+
+ if DEBUG:
+ pkts_per_s = check_traffic_flowing(cfg, netdevnl)
+ msg += f" ({pkts_per_s} packets/s)"
+
+ with open("/dev/kmsg", "w", encoding="utf-8") as kmsg:
+ kmsg.write(msg)
+
+ if not i % 5:
+ # Every 5 iterations, toggle netconsole
+ delete_netconsole_target(target_name)
+ set_netconsole(cfg, ifname, target_name)
+
+
+def test_netpoll(cfg: NetDrvEpEnv, netdevnl: NetdevFamily) -> None:
+ """
+ Test netpoll by sending traffic to the interface and then sending
+ netconsole messages to trigger a poll
+ """
+
+ target_name = generate_random_netcons_name()
+ ifname = cfg.dev["ifname"]
+ traffic = None
+
+ try:
+ set_single_rx_tx_queue(ifname)
+ traffic = GenerateTraffic(cfg)
+ check_traffic_flowing(cfg, netdevnl)
+ do_netpoll_flush(cfg, netdevnl, ifname, target_name)
+ finally:
+ if traffic:
+ traffic.stop()
+ delete_netconsole_target(target_name)
+
+
+def check_dependencies() -> None:
+ """Check if the dependencies are met"""
+ if not os.path.exists(NETCONSOLE_CONFIGFS_PATH):
+ raise KsftSkipEx(
+ f"Directory {NETCONSOLE_CONFIGFS_PATH} does not exist. CONFIG_NETCONSOLE_DYNAMIC might not be set."
+ )
+
+
+def load_netconsole_module() -> None:
+ """Try to load the netconsole module"""
+ try:
+ os.system("modprobe netconsole")
+ except Exception:
+ # It is fine if we fail to load the module, it will fail later
+ # at check_dependencies()
+ pass
+
+
+def main() -> None:
+ """Main function to run the test"""
+ load_netconsole_module()
+ check_dependencies()
+ netdevnl = NetdevFamily()
+ with NetDrvEpEnv(__file__, nsim_test=True) as cfg:
+ ksft_run(
+ [test_netpoll],
+ args=(
+ cfg,
+ netdevnl,
+ ),
+ )
+ ksft_exit()
+
+
+if __name__ == "__main__":
+ main()
---
base-commit: 4f4040ea5d3e4bebebbef9379f88085c8b99221c
change-id: 20250612-netpoll_test-a1324d2057c8
Best regards,
--
Breno Leitao <leitao(a)debian.org>
The current implementation of test_unmerge_uffd_wp() explicitly sets
`uffdio_api.features = UFFD_FEATURE_PAGEFAULT_FLAG_WP` before calling
UFFDIO_API. This can cause the ioctl() call to fail with EINVAL on kernels
that do not support UFFD-WP, leading the test to fail unnecessarily:
# ------------------------------
# running ./ksm_functional_tests
# ------------------------------
# TAP version 13
# 1..9
# # [RUN] test_unmerge
# ok 1 Pages were unmerged
# # [RUN] test_unmerge_zero_pages
# ok 2 KSM zero pages were unmerged
# # [RUN] test_unmerge_discarded
# ok 3 Pages were unmerged
# # [RUN] test_unmerge_uffd_wp
# not ok 4 UFFDIO_API failed <-----
# # [RUN] test_prot_none
# ok 5 Pages were unmerged
# # [RUN] test_prctl
# ok 6 Setting/clearing PR_SET_MEMORY_MERGE works
# # [RUN] test_prctl_fork
# # No pages got merged
# # [RUN] test_prctl_fork_exec
# ok 7 PR_SET_MEMORY_MERGE value is inherited
# # [RUN] test_prctl_unmerge
# ok 8 Pages were unmerged
# Bail out! 1 out of 8 tests failed
# # Planned tests != run tests (9 != 8)
# # Totals: pass:7 fail:1 xfail:0 xpass:0 skip:0 error:0
# [FAIL]
This patch improves compatibility and error handling by:
1. Changes the feature check to first query supported features (features=0)
rather than specifically requesting WP support.
2. Gracefully skipping the test if:
- UFFDIO_API fails with EINVAL (feature not supported), or
- UFFD_FEATURE_PAGEFAULT_FLAG_WP is not advertised by the kernel.
3. Providing better diagnostics by distinguishing expected failures (e.g.,
EINVAL) from unexpected ones and reporting them using strerror().
The updated logic makes the test more robust across different kernel versions
and configurations, while preserving existing behavior on systems that do
support UFFD-WP.
Signed-off-by: Li Wang <liwang(a)redhat.com>
Cc: Aruna Ramakrishna <aruna.ramakrishna(a)oracle.com>
Cc: Bagas Sanjaya <bagasdotme(a)gmail.com>
Cc: Catalin Marinas <catalin.marinas(a)arm.com>
Cc: Dave Hansen <dave.hansen(a)linux.intel.com>
Cc: David Hildenbrand <david(a)redhat.com>
Cc: Joey Gouly <joey.gouly(a)arm.com>
Cc: Johannes Weiner <hannes(a)cmpxchg.org>
Cc: Keith Lucas <keith.lucas(a)oracle.com>
Cc: Ryan Roberts <ryan.roberts(a)arm.com>
Cc: Shuah Khan <shuah(a)kernel.org>
---
tools/testing/selftests/mm/ksm_functional_tests.c | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/mm/ksm_functional_tests.c b/tools/testing/selftests/mm/ksm_functional_tests.c
index b61803e36d1c..f3db257dc555 100644
--- a/tools/testing/selftests/mm/ksm_functional_tests.c
+++ b/tools/testing/selftests/mm/ksm_functional_tests.c
@@ -393,9 +393,13 @@ static void test_unmerge_uffd_wp(void)
/* See if UFFD-WP is around. */
uffdio_api.api = UFFD_API;
- uffdio_api.features = UFFD_FEATURE_PAGEFAULT_FLAG_WP;
+ uffdio_api.features = 0;
if (ioctl(uffd, UFFDIO_API, &uffdio_api) < 0) {
- ksft_test_result_fail("UFFDIO_API failed\n");
+ if (errno == EINVAL)
+ ksft_test_result_skip("UFFDIO_API not supported (EINVAL)\n");
+ else
+ ksft_test_result_fail("UFFDIO_API failed: %s\n", strerror(errno));
+
goto close_uffd;
}
if (!(uffdio_api.features & UFFD_FEATURE_PAGEFAULT_FLAG_WP)) {
--
2.49.0