The upcoming new Idle HLT Intercept feature allows for the HLT
instruction execution by a vCPU to be intercepted by the hypervisor
only if there are no pending V_INTR and V_NMI events for the vCPU.
When the vCPU is expected to service the pending V_INTR and V_NMI
events, the Idle HLT intercept won’t trigger. The feature allows the
hypervisor to determine if the vCPU is actually idle and reduces
wasteful VMEXITs.
The Idle HLT intercept feature is used for enlightened guests who wish
to securely handle the events. When an enlightened guest does a HLT
while an interrupt is pending, hypervisor will not have a way to
figure out whether the guest needs to be re-entered or not. The Idle
HLT intercept feature allows the HLT execution only if there are no
pending V_INTR and V_NMI events.
Presence of the Idle HLT Intercept feature is indicated via CPUID
function Fn8000_000A_EDX[30].
Document for the Idle HLT intercept feature is available at [1].
This series is based on kvm-x86/next (eb723766b103) + [2].
Testing Done:
- Tested the functionality for the Idle HLT intercept feature
using selftest ipi_hlt_test.
- Tested on normal, SEV, SEV-ES, SEV-SNP guest for the Idle HLT intercept
functionality.
- Tested the Idle HLT intercept functionality on nested guest.
v5 -> v6
- Incorporated Neeraj's review comments on selftest.
v4 -> v5
- Incorporated Sean's review comments on nested Idle HLT intercept support.
- Make svm_idle_hlt_test independent of the Idle HLT to run on all hardware.
v3 -> v4
- Drop the patches to add vcpu_get_stat() into a new series [2].
- Added nested Idle HLT intercept support.
v2 -> v3
- Incorporated Andrew's suggestion to structure vcpu_stat_types in
a way that each architecture can share the generic types and also
provide its own.
v1 -> v2
- Did changes in svm_idle_hlt_test based on the review comments from Sean.
- Added an enum based approach to get binary stats in vcpu_get_stat() which
doesn't use string to get stat data based on the comments from Sean.
- Added safe_halt() and cli() helpers based on the comments from Sean.
[1]: AMD64 Architecture Programmer's Manual Pub. 24593, April 2024,
Vol 2, 15.9 Instruction Intercepts (Table 15-7: IDLE_HLT).
https://bugzilla.kernel.org/attachment.cgi?id=306251
[2]: https://lore.kernel.org/kvm/ee027335-f1b9-4637-bc79-27a610c1ab08@amd.com/T/…
---
V5: https://lore.kernel.org/kvm/20250103081828.7060-1-manali.shukla@amd.com/
V4: https://lore.kernel.org/kvm/20241022054810.23369-1-manali.shukla@amd.com/
V3: https://lore.kernel.org/kvm/20240528041926.3989-4-manali.shukla@amd.com/T/
V2: https://lore.kernel.org/kvm/20240501145433.4070-1-manali.shukla@amd.com/
V1: https://lore.kernel.org/kvm/20240307054623.13632-1-manali.shukla@amd.com/
Manali Shukla (3):
x86/cpufeatures: Add CPUID feature bit for Idle HLT intercept
KVM: SVM: Add Idle HLT intercept support
KVM: selftests: Add self IPI HLT test
arch/x86/include/asm/cpufeatures.h | 1 +
arch/x86/include/asm/svm.h | 1 +
arch/x86/include/uapi/asm/svm.h | 2 +
arch/x86/kvm/svm/svm.c | 13 ++-
tools/testing/selftests/kvm/Makefile.kvm | 1 +
.../selftests/kvm/include/x86/processor.h | 1 +
tools/testing/selftests/kvm/ipi_hlt_test.c | 81 +++++++++++++++++++
7 files changed, 97 insertions(+), 3 deletions(-)
create mode 100644 tools/testing/selftests/kvm/ipi_hlt_test.c
base-commit: eb723766b1030a23c38adf2348b7c3d1409d11f0
prerequisite-patch-id: cb345fc0d814a351df2b5788b76eee0eef9de549
prerequisite-patch-id: 71806f400cffe09f47d6231cb072cbdbd540de1b
prerequisite-patch-id: 9ea0412aab7ecd8555fcee3e9609dbfe8456d47b
prerequisite-patch-id: 3504df50cdd33958456f2e56139d76867273525c
prerequisite-patch-id: 674e56729a56cc487cb85be1a64ef561eb7bac8a
prerequisite-patch-id: 48e87354f9d6e6bd121ca32ab73cd0d7f1dce74f
prerequisite-patch-id: b32c21df6522a7396baa41d62bcad9479041d97a
prerequisite-patch-id: 0ff4b504e982db7c1dfa8ec6ac485c92a89f4af8
prerequisite-patch-id: 509018dc2fc1657debc641544e86f5a92d04bc1a
--
2.34.1
I never had much luck running mm selftests so I spent a couple of hours
digging into why.
Looks like most of the reason is missing SKIP checks, so this series is
just adding a bunch of those that I found. I did not do anything like
all of them, just the ones I spotted in gup_longterm, gup_test, mmap,
userfaultfd and memfd_secret.
It's a bit unfortunate to have to skip those tests when ftruncate()
fails, but I don't have time to dig deep enough into it to actually make
them pass - I observed these issues on both 9p and virtiofs. Probably
it requires digging into the filesystem implementation
(An alternative might just be to mount a tmpfs in the test script).
I am also seeing some failures to allocate hugetlb pages in
uffd-mp-mremap that I have not had time to fully understand, you can see
those here:
https://gist.github.com/bjackman/af74c3a6e60975e6ff0d760cba1e05d2#file-user…
Signed-off-by: Brendan Jackman <jackmanb(a)google.com>
---
Changes in v2 (Thanks to Dev for the reviews):
- Improve and cleanup some error messages
- Add some extra SKIPs
- Fix misnaming of nr_cpus variable in uffd tests
- Link to v1: https://lore.kernel.org/r/20250220-mm-selftests-v1-0-9bbf57d64463@google.com
---
Brendan Jackman (9):
selftests/mm: Report errno when things fail in gup_longterm
selftests/mm: Fix assumption that sudo is present
selftests/mm: Skip uffd-stress if userfaultfd not available
selftests/mm: Skip uffd-wp-mremap if userfaultfd not available
selftests/mm/uffd: Rename nr_cpus -> nr_threads
selftests/mm: Print some details when uffd-stress gets bad params
selftests/mm: Don't fail uffd-stress if too many CPUs
selftests/mm: Skip map_populate on weird filesystems
selftests/mm: Skip gup_longerm tests on weird filesystems
tools/testing/selftests/mm/gup_longterm.c | 45 ++++++++++++++++++----------
tools/testing/selftests/mm/map_populate.c | 7 +++++
tools/testing/selftests/mm/run_vmtests.sh | 22 +++++++++++---
tools/testing/selftests/mm/uffd-common.c | 8 ++---
tools/testing/selftests/mm/uffd-common.h | 2 +-
tools/testing/selftests/mm/uffd-stress.c | 42 ++++++++++++++++----------
tools/testing/selftests/mm/uffd-unit-tests.c | 2 +-
tools/testing/selftests/mm/uffd-wp-mremap.c | 5 +++-
8 files changed, 90 insertions(+), 43 deletions(-)
---
base-commit: a3daad8215143340c0870c5489e599fd059037e9
change-id: 20250220-mm-selftests-2d7d0542face
Best regards,
--
Brendan Jackman <jackmanb(a)google.com>
This series adds KVM selftests for Secure AVIC.
The Secure AVIC KVM support patch series is at:
https://lore.kernel.org/kvm/20250228085115.105648-1-Neeraj.Upadhyay@amd.com…
Git tree is available at:
https://github.com/AMDESE/linux-kvm/tree/savic-host-latest
This series depends on SNP Smoke tests patch series by Pratik:
https://lore.kernel.org/lkml/20250123220100.339867-1-prsampat@amd.com/
- Patch 1-6 are taken from Peter Gonda's patch series for GHCB support
for SEV-ES guests. GHCB support for SNP guests is added to these
patches.
https://lore.kernel.org/lkml/Ziln_Spd6KtgVqkr@google.com/T/#m6c0fc7e2b2e35f…
Patches 7-8 are fixes on top of Peter's series.
- Patch 9 fixes IDT vector for #VC exception (29) which has a valid
error code associated with the exception.
- Patch 10 adds #VC exception handling for rdmsr/wrmsr accesses of
SEV-ES guests.
- Patch 11 skips vm_is_gpa_protected() check for APIC MMIO base address
in __virt_pg_map() for VMs with protected memory. This is required
for xapic tests enablement for SEV VMs.
- Patch 12 and 13 are PoC patches to support MMIO #VC handling for SEV-ES
guests. They add x86 instruction decoding support.
- Patch 14 adds #VC handling for MMIO accesses by SEV-ES guests.
- Patch 15 adds movabs instruction decoding for cases where compiler
generates movabs for MMIO reads/writes.
- Patch 16 adds SEV guests testing support in xapic_state_test.
- Patch 17 adds x2apic mode support in xapic_ipi_test.
- Patch 18 adds SEV VMs support in xapic_ipi_test.
- Patch 19 adds a library for Secure AVIC backing page initialization
and enabling Secure AVIC for a SNP guest.
- Patch 20 adds support for SVM_EXIT_AVIC_UNACCELERATED_ACCESS #VC
exception handling for APIC msr reads/writes by Secure AVIC enabled
VM.
- Patch 21 adds support for SVM_EXIT_AVIC_INCOMPLETE_IPI #VC error
code handling for Secure AVIC enabled VM.
- Patch 22 adds args param to kvm_arch_vm_post_create() to pass
vmsa features to KVM_SEV_INIT2 ioctl for SEV VMs.
- Patch 23 adds an api for passing guest APIC page GPA to Hypervisor.
- Patch 24 adds Secure AVIC VM support to xapic_ipi_test test.
- Patch 25 adds a test for verifying APIC regs MMIO/msr accesses
for a Secure AVIC VM before it enables x2apic mode, in x2apic mode
and after enabling Secure AVIC in the Secure AVIC control msr.
- Patch 26 adds a msr access test to verify accelerated/unaccelerated
msr acceses for Secure AVIC enabled VM.
- Patch 27 tests idle hlt for Secure AVIC enabled VM.
- Patch 28 adds IOAPIC tests for Secure AVIC enabled VM.
- Patch 29 adds cross-vCPU IPI testing with various destination
shorthands for Secure AVIC enabled VM.
- Patch 30 adds Hypervisor NMI injection and cross-vCPU ICR based NMI
for Secure AVIC enabled VM.
- Patch 31 adds MSI injection test for Secure AVIC enabled VM.
Neeraj Upadhyay (25):
KVM: selftests: Fix ghcb_entry returned in ghcb_alloc()
KVM: selftests: Make GHCB entry page size aligned
KVM: selftests: Add support for #VC in x86 exception handlers
KVM: selftests: Add MSR VC handling support for SEV-ES VMs
KVM: selftests: Skip vm_is_gpa_protected() call for APIC MMIO base
KVM: selftests: Add instruction decoding support
KVM: selftests: Add instruction decoding support
KVM: selftests: Add MMIO VC exception handling for SEV-ES guests
KVM: selftests: Add instruction decoding for movabs instructions
KVM: selftests: Add SEV guests support in xapic_state_test
KVM: selftests: Add x2apic mode testing in xapic_ipi_test
KVM: selftests: Add SEV VM support in xapic_ipi_test
KVM: selftests: Add Secure AVIC lib
KVM: selftests: Add unaccelerated APIC msrs #VC handling
KVM: selftests: Add IPI handling support for Secure AVIC
KVM: selftests: Add args param to kvm_arch_vm_post_create()
KVM: selftests: Add SAVIC GPA notification GHCB call
KVM: selftests: Add Secure AVIC mode to xapic_ipi_test
KVM: selftests: Add Secure AVIC APIC regs test
KVM: selftests: Add test to verify APIC MSR accesses for SAVIC guest
KVM: selftests: Extend savic test with idle halt testing
KVM: selftests: Add IOAPIC tests for Secure AVIC
KVM: selftests: Add cross-vCPU IPI testing for SAVIC guests
KVM: selftests: Add NMI test for SAVIC guests
KVM: selftests: Add MSI injection test for SAVIC
Peter Gonda (6):
Add GHCB with setters and getters
Add arch specific additional guest pages
Add vm_vaddr_alloc_pages_shared()
Add GHCB allocations and helpers
Add is_sev_enabled() helpers
Add ability for SEV-ES guests to use ucalls via GHCB
tools/arch/x86/include/asm/msr-index.h | 4 +-
tools/testing/selftests/kvm/.gitignore | 3 +-
tools/testing/selftests/kvm/Makefile.kvm | 16 +-
.../testing/selftests/kvm/include/kvm_util.h | 14 +-
.../testing/selftests/kvm/include/x86/apic.h | 57 +
.../selftests/kvm/include/x86/ex_regs.h | 21 +
.../selftests/kvm/include/x86/insn-eval.h | 48 +
.../selftests/kvm/include/x86/processor.h | 18 +-
.../testing/selftests/kvm/include/x86/savic.h | 25 +
tools/testing/selftests/kvm/include/x86/sev.h | 15 +
tools/testing/selftests/kvm/include/x86/svm.h | 109 ++
tools/testing/selftests/kvm/lib/kvm_util.c | 109 +-
.../testing/selftests/kvm/lib/x86/handlers.S | 4 +-
.../testing/selftests/kvm/lib/x86/insn-eval.c | 1726 +++++++++++++++++
.../testing/selftests/kvm/lib/x86/processor.c | 24 +-
tools/testing/selftests/kvm/lib/x86/savic.c | 490 +++++
tools/testing/selftests/kvm/lib/x86/sev.c | 598 +++++-
tools/testing/selftests/kvm/lib/x86/ucall.c | 18 +
tools/testing/selftests/kvm/s390/cmma_test.c | 2 +-
tools/testing/selftests/kvm/x86/savic_test.c | 1549 +++++++++++++++
.../selftests/kvm/x86/sev_smoke_test.c | 40 +-
.../selftests/kvm/x86/xapic_ipi_test.c | 183 +-
.../selftests/kvm/x86/xapic_state_test.c | 117 +-
23 files changed, 5084 insertions(+), 106 deletions(-)
create mode 100644 tools/testing/selftests/kvm/include/x86/ex_regs.h
create mode 100644 tools/testing/selftests/kvm/include/x86/insn-eval.h
create mode 100644 tools/testing/selftests/kvm/include/x86/savic.h
create mode 100644 tools/testing/selftests/kvm/lib/x86/insn-eval.c
create mode 100644 tools/testing/selftests/kvm/lib/x86/savic.c
create mode 100644 tools/testing/selftests/kvm/x86/savic_test.c
base-commit: f7bafceba76e9ab475b413578c1757ee18c3e44b
--
2.34.1
Hi all,
This patch series continues the work to migrate the *.sh tests into
prog_tests framework.
The test_tunnel.sh script has already been partly migrated to
test_progs in prog_tests/test_tunnel.c so I add my work to it.
PATCH 1 & 2 create some helpers to avoid code duplication and ease the
migration in the following patches.
PATCH 3 to 9 migrate the tests of gre, ip6gre, erspan, ip6erspan,
geneve, ip6geneve and ip6tnl tunnels.
PATCH 10 removes test_tunnel.sh
Signed-off-by: Bastien Curutchet (eBPF Foundation) <bastien.curutchet(a)bootlin.com>
---
Bastien Curutchet (eBPF Foundation) (10):
selftests/bpf: test_tunnel: Add generic_attach* helpers
selftests/bpf: test_tunnel: Add ping helpers
selftests/bpf: test_tunnel: Move gre tunnel test to test_progs
selftests/bpf: test_tunnel: Move ip6gre tunnel test to test_progs
selftests/bpf: test_tunnel: Move erspan tunnel tests to test_progs
selftests/bpf: test_tunnel: Move ip6erspan tunnel test to test_progs
selftests/bpf: test_tunnel: Move geneve tunnel test to test_progs
selftests/bpf: test_tunnel: Move ip6geneve tunnel test to test_progs
selftests/bpf: test_tunnel: Move ip6tnl tunnel tests to test_progs
selftests/bpf: test_tunnel: Remove test_tunnel.sh
tools/testing/selftests/bpf/Makefile | 1 -
.../testing/selftests/bpf/prog_tests/test_tunnel.c | 627 +++++++++++++++++---
tools/testing/selftests/bpf/test_tunnel.sh | 645 ---------------------
3 files changed, 532 insertions(+), 741 deletions(-)
---
base-commit: 16566afa71143757b49fc4b2a331639f487d105a
change-id: 20250131-tunnels-59b641ea3f10
Best regards,
--
Bastien Curutchet (eBPF Foundation) <bastien.curutchet(a)bootlin.com>
1. Issue
Syzkaller reported this issue [1].
2. Reproduce
We can reproduce this issue by using the test_sockmap_with_close_on_write()
test I provided in selftest, also you need to apply the following patch to
ensure 100% reproducibility (sleep after checking sock):
'''
static void sk_psock_verdict_data_ready(struct sock *sk)
{
.......
if (unlikely(!sock))
return;
+ if (!strcmp("test_progs", current->comm)) {
+ printk("sleep 2s to wait socket freed\n");
+ mdelay(2000);
+ printk("sleep end\n");
+ }
ops = READ_ONCE(sock->ops);
if (!ops || !ops->read_skb)
return;
}
'''
Then running './test_progs -v sockmap_basic', and if the kernel has KASAN
enabled [2], you will see the following warning:
'''
BUG: KASAN: slab-use-after-free in sk_psock_verdict_data_ready+0x29b/0x2d0
Read of size 8 at addr ffff88813a777020 by task test_progs/47055
Tainted: [O]=OOT_MODULE
Call Trace:
<TASK>
dump_stack_lvl+0x53/0x70
print_address_description.constprop.0+0x30/0x420
? sk_psock_verdict_data_ready+0x29b/0x2d0
print_report+0xb7/0x270
? sk_psock_verdict_data_ready+0x29b/0x2d0
? kasan_addr_to_slab+0xd/0xa0
? sk_psock_verdict_data_ready+0x29b/0x2d0
kasan_report+0xca/0x100
? sk_psock_verdict_data_ready+0x29b/0x2d0
sk_psock_verdict_data_ready+0x29b/0x2d0
unix_stream_sendmsg+0x4a6/0xa40
? __pfx_unix_stream_sendmsg+0x10/0x10
? fdget+0x2c1/0x3a0
__sys_sendto+0x39c/0x410
'''
3. Reason
'''
CPU0 CPU1
unix_stream_sendmsg(sk):
other = unix_peer(sk)
other->sk_data_ready(other):
socket *sock = sk->sk_socket
if (unlikely(!sock))
return;
close(other):
...
other->close()
free(socket)
READ_ONCE(sock->ops)
^
use 'sock' after free
'''
For TCP, UDP, or other protocols, we have already performed
rcu_read_lock() when the network stack receives packets in ip_input.c:
'''
ip_local_deliver_finish():
rcu_read_lock()
ip_protocol_deliver_rcu()
xxx_rcv
rcu_read_unlock()
'''
However, for Unix sockets, sk_data_ready is called directly from the
process context without rcu_read_lock() protection.
4. Solution
Based on the fact that the 'struct socket' is released using call_rcu(),
We add rcu_read_{un}lock() at the entrance and exit of our sk_data_ready.
It will not increase performance overhead, at least for TCP and UDP, they
are already in a relatively large critical section.
Of course, we can also add a custom callback for Unix sockets and call
rcu_read_lock() before calling _verdict_data_ready like this:
'''
if (sk_is_unix(sk))
sk->sk_data_ready = sk_psock_verdict_data_ready_rcu;
else
sk->sk_data_ready = sk_psock_verdict_data_ready;
sk_psock_verdict_data_ready_rcu():
rcu_read_lock()
sk_psock_verdict_data_ready()
rcu_read_unlock()
'''
However, this will cause too many branches, and it's not suitable to
distinguish network protocols in skmsg.c.
[1] https://syzkaller.appspot.com/bug?extid=dd90a702f518e0eac072
[2] https://syzkaller.appspot.com/text?tag=KernelConfig&x=1362a5aee630ff34
---
v1 -> v2:
1. Add Fixes tag.
2. Extend selftest of edge case for TCP/UDP sockets.
3. Add Reviewed-by and Acked-by tag.
https://lore.kernel.org/bpf/20250226132242.52663-1-jiayuan.chen@linux.dev/T…
Jiayuan Chen (3):
bpf, sockmap: avoid using sk_socket after free
selftests/bpf: Add socketpair to create_pair to support unix socket
selftests/bpf: Add edge case tests for sockmap
net/core/skmsg.c | 18 ++++--
.../selftests/bpf/prog_tests/socket_helpers.h | 13 +++-
.../selftests/bpf/prog_tests/sockmap_basic.c | 59 +++++++++++++++++++
3 files changed, 84 insertions(+), 6 deletions(-)
--
2.47.1
1. Issue
Syzkaller reported this issue [1].
2. Reproduce
We can reproduce this issue by using the test_sockmap_with_close_on_write()
test I provided in selftest, also you need to apply the following patch to
ensure 100% reproducibility (sleep after checking sock):
'''
static void sk_psock_verdict_data_ready(struct sock *sk)
{
.......
if (unlikely(!sock))
return;
+ if (!strcmp("test_progs", current->comm)) {
+ printk("sleep 2s to wait socket freed\n");
+ mdelay(2000);
+ printk("sleep end\n");
+ }
ops = READ_ONCE(sock->ops);
if (!ops || !ops->read_skb)
return;
}
'''
Then running './test_progs -v sockmap_basic', and if the kernel has KASAN
enabled [2], you will see the following warning:
'''
BUG: KASAN: slab-use-after-free in sk_psock_verdict_data_ready+0x29b/0x2d0
Read of size 8 at addr ffff88813a777020 by task test_progs/47055
Tainted: [O]=OOT_MODULE
Call Trace:
<TASK>
dump_stack_lvl+0x53/0x70
print_address_description.constprop.0+0x30/0x420
? sk_psock_verdict_data_ready+0x29b/0x2d0
print_report+0xb7/0x270
? sk_psock_verdict_data_ready+0x29b/0x2d0
? kasan_addr_to_slab+0xd/0xa0
? sk_psock_verdict_data_ready+0x29b/0x2d0
kasan_report+0xca/0x100
? sk_psock_verdict_data_ready+0x29b/0x2d0
sk_psock_verdict_data_ready+0x29b/0x2d0
unix_stream_sendmsg+0x4a6/0xa40
? __pfx_unix_stream_sendmsg+0x10/0x10
? fdget+0x2c1/0x3a0
__sys_sendto+0x39c/0x410
'''
3. Reason
'''
CPU0 CPU1
unix_stream_sendmsg(sk):
other = unix_peer(sk)
other->sk_data_ready(other):
socket *sock = sk->sk_socket
if (unlikely(!sock))
return;
close(other):
...
other->close()
free(socket)
READ_ONCE(sock->ops)
^
use 'sock' after free
'''
For TCP, UDP, or other protocols, we have already performed
rcu_read_lock() when the network stack receives packets in ip_input.c:
'''
ip_local_deliver_finish():
rcu_read_lock()
ip_protocol_deliver_rcu()
xxx_rcv
rcu_read_unlock()
'''
However, for Unix sockets, sk_data_ready is called directly from the
process context without rcu_read_lock() protection.
4. Solution
Based on the fact that the 'struct socket' is released using call_rcu(),
We add rcu_read_{un}lock() at the entrance and exit of our sk_data_ready.
It will not increase performance overhead, at least for TCP and UDP, they
are already in a relatively large critical section.
Of course, we can also add a custom callback for Unix sockets and call
rcu_read_lock() before calling _verdict_data_ready like this:
'''
if (sk_is_unix(sk))
sk->sk_data_ready = sk_psock_verdict_data_ready_rcu;
else
sk->sk_data_ready = sk_psock_verdict_data_ready;
sk_psock_verdict_data_ready_rcu():
rcu_read_lock()
sk_psock_verdict_data_ready()
rcu_read_unlock()
'''
However, this will cause too many branches, and it's not suitable to
distinguish network protocols in skmsg.c.
[1] https://syzkaller.appspot.com/bug?extid=dd90a702f518e0eac072
[2] https://syzkaller.appspot.com/text?tag=KernelConfig&x=1362a5aee630ff34
Jiayuan Chen (3):
bpf, sockmap: avoid using sk_socket after free
selftests/bpf: Add socketpair to create_pair to support unix socket
selftests/bpf: Add edge case tests for sockmap
net/core/skmsg.c | 18 ++++--
.../selftests/bpf/prog_tests/socket_helpers.h | 13 ++++-
.../selftests/bpf/prog_tests/sockmap_basic.c | 57 +++++++++++++++++++
3 files changed, 82 insertions(+), 6 deletions(-)
--
2.47.1
The GRO selftests can flake and have some confusing behavior. These
changes make the output and return value of GRO behave as expected, then
deflake the tests.
v2:
- Split into multiple commits.
- Reduced napi_defer_hard_irqs to 1.
- Reduced gro_flush_timeout to 100us.
- Fixed comment that wasn't updated.
v1: https://lore.kernel.org/netdev/20250218164555.1955400-1-krakauer@google.com/
Kevin Krakauer (3):
selftests/net: have `gro.sh -t` return a correct exit code
selftests/net: only print passing message in GRO tests when tests pass
selftests/net: deflake GRO tests
tools/testing/selftests/net/gro.c | 8 +++++---
tools/testing/selftests/net/gro.sh | 7 ++++---
tools/testing/selftests/net/setup_veth.sh | 3 ++-
3 files changed, 11 insertions(+), 7 deletions(-)
--
2.48.1.658.g4767266eb4-goog
There is a spelling mistake in a ksft_test_result_skip message. Fix it.
Signed-off-by: Colin Ian King <colin.i.king(a)gmail.com>
---
tools/testing/selftests/kvm/s390/cpumodel_subfuncs_test.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/kvm/s390/cpumodel_subfuncs_test.c b/tools/testing/selftests/kvm/s390/cpumodel_subfuncs_test.c
index 27255880dabd..aded795d42be 100644
--- a/tools/testing/selftests/kvm/s390/cpumodel_subfuncs_test.c
+++ b/tools/testing/selftests/kvm/s390/cpumodel_subfuncs_test.c
@@ -291,7 +291,7 @@ int main(int argc, char *argv[])
ksft_test_result_pass("%s\n", testlist[idx].subfunc_name);
free(array);
} else {
- ksft_test_result_skip("%s feature is not avaialable\n",
+ ksft_test_result_skip("%s feature is not available\n",
testlist[idx].subfunc_name);
}
}
--
2.47.2
[ Background ]
On ARM GIC systems and others, the target address of the MSI is translated
by the IOMMU. For GIC, the MSI address page is called "ITS" page. When the
IOMMU is disabled, the MSI address is programmed to the physical location
of the GIC ITS page (e.g. 0x20200000). When the IOMMU is enabled, the ITS
page is behind the IOMMU, so the MSI address is programmed to an allocated
IO virtual address (a.k.a IOVA), e.g. 0xFFFF0000, which must be mapped to
the physical ITS page: IOVA (0xFFFF0000) ===> PA (0x20200000).
When a 2-stage translation is enabled, IOVA will be still used to program
the MSI address, though the mappings will be in two stages:
IOVA (0xFFFF0000) ===> IPA (e.g. 0x80900000) ===> PA (0x20200000)
(IPA stands for Intermediate Physical Address).
If the device that generates MSI is attached to an IOMMU_DOMAIN_DMA, the
IOVA is dynamically allocated from the top of the IOVA space. If attached
to an IOMMU_DOMAIN_UNMANAGED (e.g. a VFIO passthrough device), the IOVA is
fixed to an MSI window reported by the IOMMU driver via IOMMU_RESV_SW_MSI,
which is hardwired to MSI_IOVA_BASE (IOVA==0x8000000) for ARM IOMMUs.
So far, this IOMMU_RESV_SW_MSI works well as kernel is entirely in charge
of the IOMMU translation (1-stage translation), since the IOVA for the ITS
page is fixed and known by kernel. However, with virtual machine enabling
a nested IOMMU translation (2-stage), a guest kernel directly controls the
stage-1 translation with an IOMMU_DOMAIN_DMA, mapping a vITS page (at an
IPA 0x80900000) onto its own IOVA space (e.g. 0xEEEE0000). Then, the host
kernel can't know that guest-level IOVA to program the MSI address.
There have been two approaches to solve this problem:
1. Create an identity mapping in the stage-1. VMM could insert a few RMRs
(Reserved Memory Regions) in guest's IORT. Then the guest kernel would
fetch these RMR entries from the IORT and create an IOMMU_RESV_DIRECT
region per iommu group for a direct mapping. Eventually, the mappings
would look like: IOVA (0x8000000) === IPA (0x8000000) ===> 0x20200000
This requires an IOMMUFD ioctl for kernel and VMM to agree on the IPA.
2. Forward the guest-level MSI IOVA captured by VMM to the host-level GIC
driver, to program the correct MSI IOVA. Forward the VMM-defined vITS
page location (IPA) to the kernel for the stage-2 mapping. Eventually:
IOVA (0xFFFF0000) ===> IPA (0x80900000) ===> PA (0x20200000)
This requires a VFIO ioctl (for IOVA) and an IOMMUFD ioctl (for IPA).
Worth mentioning that when Eric Auger was working on the same topic with
the VFIO iommu uAPI, he had a solution for approach (2) first, and then
switched to approach (1), suggested by Jean-Philippe for the reduction of
complexity.
Approach (1) basically feels like the existing VFIO passthrough that has
a 1-stage mapping for the unmanaged domain, yet only by shifting the MSI
mapping from stage 1 (no-viommu case) to stage 2 (has-viommu case). So,
it could reuse the existing IOMMU_RESV_SW_MSI piece, by sharing the same
idea of "VMM leaving everything to the kernel".
Approach (2) is an ideal solution, yet it requires additional effort for
kernel to be aware of the stage-1 gIOVAs and the stage-2 IPAs for vITS
page(s), which demands VMM to closely cooperate.
* It also brings some complicated use cases to the table where the host
or/and guest system(s) has/have multiple ITS pages.
[ Execution ]
Though these two approaches feel very different on the surface, they can
share some underlying common infrastructure. Currently, only one pair of
sw_msi functions (prepare/compose) are provided by dma-iommu for irqchip
drivers to directly use. There could be different versions of functions
from different domain owners: for existing VFIO passthrough cases and in-
kernel DMA domain cases, reuse the existing dma-iommu's version of sw_msi
functions; for nested translation use cases, there can be another version
of sw_msi functions to handle mapping and msi_msg(s) differently.
As a part-1 series, this refactors the core infrastructure:
- Get rid of the duplication in the "compose" function
- Introduce a function pointer for the previously "prepare" function
- Allow different domain owners to set their own "sw_msi" implementations
- Implement an iommufd_sw_msi function to additionally support non-nested
use cases and also prepare for a nested translation use case using the
approach (1)
[ Future Plan ]
Part-2 will add support of approach (1), i.e. RMR solution:
- Add a pair of IOMMUFD options for a SW_MSI window for kernel and VMM to
agree on (for approach 1)
Part-3 and beyond will continue the effort of supporting approach (2) i.e.
a complete vITS-to-pITS mapping:
- Map the phsical ITS page (potentially via IOMMUFD_CMD_IOAS_MAP_MSI)
- Convey the IOVAs per-irq (potentially via VFIO_IRQ_SET_ACTION_PREPARE)
---
This is a joint effort that includes Jason's rework in irq/iommu/iommufd
base level and my additional patches on top of that for new uAPIs.
This series is on github:
https://github.com/nicolinc/iommufd/commits/iommufd_msi_p1-v2
For testing with nested SMMU (approach 1):
https://github.com/nicolinc/iommufd/commits/wip/iommufd_msi_p2-v2
Pairing QEMU branch for testing (approach 1):
https://github.com/nicolinc/qemu/commits/wip/for_iommufd_msi_p2-v2-rmr
Changelog
v2
* Split the iommufd ioctl for approach (1) out of this part-1
* Rebase on Jason's for-next tree (6.14-rc2) for two iommufd patches
* Update commit logs in two irqchip patches to make narrative clearer
* Keep iommu_dma_compose_msi_msg() in PATCH-1 as a small cleaner step
* Improve with some coding style changes: kdoc and 100-char wrapping
v1
https://lore.kernel.org/kvm/cover.1739005085.git.nicolinc@nvidia.com/
* Rebase on v6.14-rc1 and iommufd_attach_handle-v1 series
https://lore.kernel.org/all/cover.1738645017.git.nicolinc@nvidia.com/
* Correct typos
* Replace set_bit with __set_bit
* Use a common helper to get iommufd_handle
* Add kdoc for iommu_msi_iova/iommu_msi_page_shift
* Rename msi_msg_set_msi_addr() to msi_msg_set_addr()
* Update selftest for a better coverage for the new options
* Change IOMMU_OPTION_SW_MSI_START/SIZE to be per-idev and properly
check against device's reserved region list
RFCv2
https://lore.kernel.org/kvm/cover.1736550979.git.nicolinc@nvidia.com/
* Rebase on v6.13-rc6
* Drop all the irq/pci patches and rework the compose function instead
* Add a new sw_msi op to iommu_domain for a per type implementation and
let iommufd core has its own implementation to support both approaches
* Add RMR-solution (approach 1) support since it is straightforward and
have been used in some out-of-tree projects widely
RFCv1
https://lore.kernel.org/kvm/cover.1731130093.git.nicolinc@nvidia.com/
Thanks!
Nicolin
Jason Gunthorpe (5):
genirq/msi: Store the IOMMU IOVA directly in msi_desc instead of
iommu_cookie
genirq/msi: Refactor iommu_dma_compose_msi_msg()
iommu: Make iommu_dma_prepare_msi() into a generic operation
irqchip: Have CONFIG_IRQ_MSI_IOMMU be selected by irqchips that need
it
iommufd: Implement sw_msi support natively
Nicolin Chen (2):
iommu: Turn fault_data to iommufd private pointer
iommu: Turn iova_cookie to dma-iommu private pointer
drivers/iommu/Kconfig | 1 -
drivers/irqchip/Kconfig | 4 +
kernel/irq/Kconfig | 1 +
drivers/iommu/iommufd/iommufd_private.h | 23 +++-
include/linux/iommu.h | 58 +++++----
include/linux/msi.h | 55 +++++---
drivers/iommu/dma-iommu.c | 63 +++-------
drivers/iommu/iommu.c | 29 +++++
drivers/iommu/iommufd/device.c | 160 ++++++++++++++++++++----
drivers/iommu/iommufd/fault.c | 2 +-
drivers/iommu/iommufd/hw_pagetable.c | 5 +-
drivers/iommu/iommufd/main.c | 9 ++
drivers/irqchip/irq-gic-v2m.c | 5 +-
drivers/irqchip/irq-gic-v3-its.c | 13 +-
drivers/irqchip/irq-gic-v3-mbi.c | 12 +-
drivers/irqchip/irq-ls-scfg-msi.c | 5 +-
16 files changed, 309 insertions(+), 136 deletions(-)
base-commit: dc10ba25d43f433ad5d9e8e6be4f4d2bb3cd9ddb
prerequisite-patch-id: 0000000000000000000000000000000000000000
--
2.43.0