The count_stcx_fail test runs for close to or just over 2 minutes, which
means it sometimes times out.
That's overkill for a test that just demonstrates some PMU counters
are working. Drop the 64 billion instruction case, to lower the runtime
to ~30s.
Signed-off-by: Michael Ellerman <mpe(a)ellerman.id.au>
---
tools/testing/selftests/powerpc/pmu/count_stcx_fail.c | 3 ---
1 file changed, 3 deletions(-)
diff --git a/tools/testing/selftests/powerpc/pmu/count_stcx_fail.c b/tools/testing/selftests/powerpc/pmu/count_stcx_fail.c
index 2070a1e2b3a5..d8dd9a9c6c1b 100644
--- a/tools/testing/selftests/powerpc/pmu/count_stcx_fail.c
+++ b/tools/testing/selftests/powerpc/pmu/count_stcx_fail.c
@@ -144,9 +144,6 @@ static int test_body(void)
/* Run for 16Bi instructions */
FAIL_IF(do_count_loop(events, 16000000000, overhead, true));
- /* Run for 64Bi instructions */
- FAIL_IF(do_count_loop(events, 64000000000, overhead, true));
-
event_close(&events[0]);
event_close(&events[1]);
--
2.47.0
Hello,
I am a private investment consultant representing the interest of a multinational conglomerate that wishes to place funds into a trust management portfolio.
Please indicate your interest for additional information.
Regards,
Van Collin.
This patch series includes some netns-related improvements and fixes for
RTNL and ip_tunnel, to make link creation more intuitive:
- Creating link in another net namespace doesn't conflict with link names
in current one.
- Refector rtnetlink link creation. Create link in target namespace
directly. Pass both source and link netns to drivers via newlink()
callback.
So that
# ip link add netns ns1 link-netns ns2 tun0 type gre ...
will create tun0 in ns1, rather than create it in ns2 and move to ns1.
And don't conflict with another interface named "tun0" in current netns.
Patch 1 from Donald is included just as a dependency.
---
v3:
- Drop "netns_atomic" flag and module parameter. Add netns parameter to
newlink() instead, and convert drivers accordingly.
- Move python NetNSEnter helper to net selftest lib.
v2:
link: https://lore.kernel.org/all/20241107133004.7469-1-shaw.leon@gmail.com/
- Check NLM_F_EXCL to ensure only link creation is affected.
- Add self tests for link name/ifindex conflict and notifications
in different netns.
- Changes in dummy driver and ynl in order to add the test case.
v1:
link: https://lore.kernel.org/all/20241023023146.372653-1-shaw.leon@gmail.com/
Donald Hunter (1):
Revert "tools/net/ynl: improve async notification handling"
Xiao Liang (5):
net: ip_tunnel: Build flow in underlay net namespace
rtnetlink: Lookup device in target netns when creating link
rtnetlink: Decouple net namespaces in rtnl_newlink_create()
selftests: net: Add python context manager for netns entering
selftests: net: Add two test cases for link netns
drivers/infiniband/ulp/ipoib/ipoib_netlink.c | 6 ++-
drivers/net/amt.c | 6 +--
drivers/net/bareudp.c | 4 +-
drivers/net/bonding/bond_netlink.c | 3 +-
drivers/net/can/dev/netlink.c | 2 +-
drivers/net/can/vxcan.c | 4 +-
.../ethernet/qualcomm/rmnet/rmnet_config.c | 5 +-
drivers/net/geneve.c | 4 +-
drivers/net/gtp.c | 4 +-
drivers/net/ipvlan/ipvlan.h | 2 +-
drivers/net/ipvlan/ipvlan_main.c | 5 +-
drivers/net/ipvlan/ipvtap.c | 4 +-
drivers/net/macsec.c | 5 +-
drivers/net/macvlan.c | 5 +-
drivers/net/macvtap.c | 5 +-
drivers/net/netkit.c | 4 +-
drivers/net/pfcp.c | 4 +-
drivers/net/ppp/ppp_generic.c | 4 +-
drivers/net/team/team_core.c | 2 +-
drivers/net/veth.c | 4 +-
drivers/net/vrf.c | 2 +-
drivers/net/vxlan/vxlan_core.c | 4 +-
drivers/net/wireguard/device.c | 4 +-
drivers/net/wireless/virtual/virt_wifi.c | 5 +-
drivers/net/wwan/wwan_core.c | 6 ++-
include/net/ip_tunnels.h | 5 +-
include/net/rtnetlink.h | 22 ++++++++-
net/8021q/vlan_netlink.c | 5 +-
net/batman-adv/soft-interface.c | 5 +-
net/bridge/br_netlink.c | 2 +-
net/caif/chnl_net.c | 2 +-
net/core/rtnetlink.c | 25 ++++++----
net/hsr/hsr_netlink.c | 8 +--
net/ieee802154/6lowpan/core.c | 5 +-
net/ipv4/ip_gre.c | 13 +++--
net/ipv4/ip_tunnel.c | 16 +++---
net/ipv4/ip_vti.c | 5 +-
net/ipv4/ipip.c | 5 +-
net/ipv6/ip6_gre.c | 17 ++++---
net/ipv6/ip6_tunnel.c | 11 ++---
net/ipv6/ip6_vti.c | 11 ++---
net/ipv6/sit.c | 11 ++---
net/xfrm/xfrm_interface_core.c | 13 +++--
tools/net/ynl/cli.py | 10 ++--
tools/net/ynl/lib/ynl.py | 49 ++++++++-----------
tools/testing/selftests/net/Makefile | 1 +
.../testing/selftests/net/lib/py/__init__.py | 2 +-
tools/testing/selftests/net/lib/py/netns.py | 18 +++++++
tools/testing/selftests/net/netns-name.sh | 10 ++++
tools/testing/selftests/net/netns_atomic.py | 38 ++++++++++++++
50 files changed, 255 insertions(+), 157 deletions(-)
create mode 100755 tools/testing/selftests/net/netns_atomic.py
--
2.47.0
The alloc_string_stream() function only returns ERR_PTR(-ENOMEM) on
failure and never returns NULL. Therefore, switching the error check in
the caller from IS_ERR_OR_NULL to IS_ERR improves clarity, indicating
that this function will return an error pointer (not NULL) when an
error occurs. This change avoids any ambiguity regarding the function's
return behavior.
Link: https://lore.kernel.org/lkml/Zy9deU5VK3YR+r9N@visitorckw-System-Product-Name
Signed-off-by: Kuan-Wei Chiu <visitorckw(a)gmail.com>
---
lib/kunit/debugfs.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/lib/kunit/debugfs.c b/lib/kunit/debugfs.c
index d548750a325a..6273fa9652df 100644
--- a/lib/kunit/debugfs.c
+++ b/lib/kunit/debugfs.c
@@ -181,7 +181,7 @@ void kunit_debugfs_create_suite(struct kunit_suite *suite)
* successfully.
*/
stream = alloc_string_stream(GFP_KERNEL);
- if (IS_ERR_OR_NULL(stream))
+ if (IS_ERR(stream))
return;
string_stream_set_append_newlines(stream, true);
@@ -189,7 +189,7 @@ void kunit_debugfs_create_suite(struct kunit_suite *suite)
kunit_suite_for_each_test_case(suite, test_case) {
stream = alloc_string_stream(GFP_KERNEL);
- if (IS_ERR_OR_NULL(stream))
+ if (IS_ERR(stream))
goto err;
string_stream_set_append_newlines(stream, true);
--
2.34.1
Update Brendan's email address for the KUnit entry.
Signed-off-by: Brendan Higgins <brendanhiggins(a)google.com>
---
I am leaving Google and am going through and cleaning up my @google.com
address in the relevant places. This patch updates my email address for
the KUnit entry. I am removing myself from the ASPEED I2C entry in a
separate patch.
Do note that Friday, November 15 2024 is my last day at Google after
which I will lose access to this email account so any future updates or
comments after Friday will come from my @linux.dev account.
---
MAINTAINERS | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/MAINTAINERS b/MAINTAINERS
index b878ddc99f94e..d077a3deeb386 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -12405,7 +12405,7 @@ F: fs/smb/common/
F: fs/smb/server/
KERNEL UNIT TESTING FRAMEWORK (KUnit)
-M: Brendan Higgins <brendanhiggins(a)google.com>
+M: Brendan Higgins <brendan.higgins(a)linux.dev>
M: David Gow <davidgow(a)google.com>
R: Rae Moar <rmoar(a)google.com>
L: linux-kselftest(a)vger.kernel.org
base-commit: cfaaa7d010d1fc58f9717fcc8591201e741d2d49
--
2.47.0.338.g60cca15819-goog
Unmapping virtual machine guest memory from the host kernel's direct map
is a successful mitigation against Spectre-style transient execution
issues: If the kernel page tables do not contain entries pointing to
guest memory, then any attempted speculative read through the direct map
will necessarily be blocked by the MMU before any observable
microarchitectural side-effects happen. This means that Spectre-gadgets
and similar cannot be used to target virtual machine memory. Roughly 60%
of speculative execution issues fall into this category [1, Table 1].
This patch series extends guest_memfd with the ability to remove its
memory from the host kernel's direct map, to be able to attain the above
protection for KVM guests running inside guest_memfd.
=== Changes to v2 ===
- Handle direct map removal for physically contiguous pages in arch code
(Mike R.)
- Track the direct map state in guest_memfd itself instead of at the
folio level, to prepare for huge pages support (Sean C.)
- Allow configuring direct map state of not-yet faulted in memory
(Vishal A.)
- Pay attention to alignment in ftrace structs (Steven R.)
Most significantly, I've reduced the patch series to focus only on
direct map removal for guest_memfd for now, leaving the whole "how to do
non-CoCo VMs in guest_memfd" for later. If this separation is
acceptable, then I think I can drop the RFC tag in the next revision
(I've mainly kept it here because I'm not entirely sure what to do with
patches 3 and 4).
=== Implementation ===
This patch series introduces a new flag to the KVM_CREATE_GUEST_MEMFD
that causes guest_memfd to remove its pages from the host kernel's
direct map immediately after population/preparation. It also adds
infrastructure for tracking the direct map state of all gmem folios
inside the guest_memfd inode. Storing this information in the inode has
the advantage that the code is ready for future hugepages extensions,
where only removing/reinserting direct map entries for sub-ranges of a
huge folio is a valid usecase, and it allows pre-configuring the direct
map state of not-yet faulted in parts of memory (for example, when the
VMM is receiving a RX virtio buffer from the guest).
=== Summary ===
Patch 1 (from Mike Rapoport) adds arch APIs for manipulating the direct
map for ranges of physically contiguous pages, which are used by
guest_memfd in follow up patches. Patch 2 adds the
KVM_GMEM_NO_DIRECT_MAP flag and the logic for configuring direct map
state of freshly prepared folios. Patches 3 and 4 mainly serve an
illustrative purpose, to show how the framework from patch 2 can be
extended with routines for runtime direct map manipulation. Patches 5
and 6 deal with documentation and self-tests respectively.
[1]: https://download.vusec.net/papers/quarantine_raid23.pdf
[RFC v1]: https://lore.kernel.org/kvm/20240709132041.3625501-1-roypat@amazon.co.uk/
[RFC v2]: https://lore.kernel.org/kvm/20240910163038.1298452-1-roypat@amazon.co.uk/
Mike Rapoport (Microsoft) (1):
arch: introduce set_direct_map_valid_noflush()
Patrick Roy (5):
kvm: gmem: add flag to remove memory from kernel direct map
kvm: gmem: implement direct map manipulation routines
kvm: gmem: add trace point for direct map state changes
kvm: document KVM_GMEM_NO_DIRECT_MAP flag
kvm: selftests: run gmem tests with KVM_GMEM_NO_DIRECT_MAP set
Documentation/virt/kvm/api.rst | 14 +
arch/arm64/include/asm/set_memory.h | 1 +
arch/arm64/mm/pageattr.c | 10 +
arch/loongarch/include/asm/set_memory.h | 1 +
arch/loongarch/mm/pageattr.c | 21 ++
arch/riscv/include/asm/set_memory.h | 1 +
arch/riscv/mm/pageattr.c | 15 +
arch/s390/include/asm/set_memory.h | 1 +
arch/s390/mm/pageattr.c | 11 +
arch/x86/include/asm/set_memory.h | 1 +
arch/x86/mm/pat/set_memory.c | 8 +
include/linux/set_memory.h | 6 +
include/trace/events/kvm.h | 22 ++
include/uapi/linux/kvm.h | 2 +
.../testing/selftests/kvm/guest_memfd_test.c | 2 +-
.../kvm/x86_64/private_mem_conversions_test.c | 7 +-
virt/kvm/guest_memfd.c | 280 +++++++++++++++++-
17 files changed, 384 insertions(+), 19 deletions(-)
base-commit: 5cb1659f412041e4780f2e8ee49b2e03728a2ba6
--
2.47.0
Hello LKFT maintainers, CI operators,
First, I would like to say thank you to the people behind the LKFT
project for validating stable kernels (and more), and including some
Network selftests in their tests suites.
A lot of improvements around the networking kselftests have been done
this year. At the last Netconf [1], we discussed how these tests were
validated on stable kernels from CIs like the LKFT one, and we have some
suggestions to improve the situation.
KSelftests from the same version
--------------------------------
According to the doc [2], kselftests should support all previous kernel
versions. The LKFT CI is then using the kselftests from the last stable
release to validate all stable versions. Even if there are good reasons
to do that, we would like to ask for an opt-out for this policy for the
networking tests: this is hard to maintain with the increased
complexity, hard to validate on all stable kernels before applying
patches, and hard to put in place in some situations. As a result, many
tests are failing on older kernels, and it looks like it is a lot of
work to support older kernels, and to maintain this.
Many networking tests are validating the internal behaviour that is not
exposed to the userspace. A typical example: some tests look at the raw
packets being exchanged during a test, and this behaviour can change
without modifying how the userspace is interacting with the kernel. The
kernel could expose capabilities, but that's not something that seems
natural to put in place for internal behaviours that are not exposed to
end users. Maybe workarounds could be used, e.g. looking at kernel
symbols, etc. Nut that doesn't always work, increase the complexity, and
often "false positive" issue will be noticed only after a patch hits
stable, and will cause a bunch of tests to be ignored.
Regarding fixes, ideally they will come with a new or modified test that
can also be backported. So the coverage can continue to grow in stable
versions too.
Do you think that from the kernel v6.12 (or before?), the LKFT CI could
run the networking kselftests from the version that is being validated,
and not from a newer one? So validating the selftests from v6.12.1 on a
v6.12.1, and not the ones from a future v6.16.y on a v6.12.42.
Skipped tests
-------------
It looks like many tests are skipped:
- Some have been in a skip file [3] for a while: maybe they can be removed?
- Some are skipped because of missing tools: maybe they can be added?
e.g. iputils, tshark, ipv6toolkit, etc.
- Some tests are in 'net', but in subdirectories, and hence not tested,
e.g. forwarding, packetdrill, netfilter, tcp_ao. Could they be tested too?
How can we change this to increase the code coverage using existing tests?
KVM
---
It looks like different VMs are being used to execute the different
tests. Do these VMs benefit from any accelerations like KVM? If not,
some tests might fail because the environment is too slow.
The KSFT_MACHINE_SLOW=yes env var can be set to increase some
tolerances, timeout or to skip some parts, but that might not be enough
for some tests.
Notifications
-------------
In case of new regressions, who is being notified? Are the people from
the MAINTAINERS file, and linked to the corresponding selftests being
notified or do they need to do the monitoring on their side?
Looking forward to improving the networking selftests results when
validating stable kernels!
[1] https://netdev.bots.linux.dev/netconf/2024/
[2] https://docs.kernel.org/dev-tools/kselftest.html
[3]
https://github.com/Linaro/test-definitions/blob/master/automated/linux/ksel…
Cheers,
Matt
--
Sponsored by the NGI0 Core fund.
Basics and overview
===================
Software with larger attack surfaces (e.g. network facing apps like databases,
browsers or apps relying on browser runtimes) suffer from memory corruption
issues which can be utilized by attackers to bend control flow of the program
to eventually gain control (by making their payload executable). Attackers are
able to perform such attacks by leveraging call-sites which rely on indirect
calls or return sites which rely on obtaining return address from stack memory.
To mitigate such attacks, risc-v extension zicfilp enforces that all indirect
calls must land on a landing pad instruction `lpad` else cpu will raise software
check exception (a new cpu exception cause code on riscv).
Similarly for return flow, risc-v extension zicfiss extends architecture with
- `sspush` instruction to push return address on a shadow stack
- `sspopchk` instruction to pop return address from shadow stack
and compare with input operand (i.e. return address on stack)
- `sspopchk` to raise software check exception if comparision above
was a mismatch
- Protection mechanism using which shadow stack is not writeable via
regular store instructions
More information an details can be found at extensions github repo [1].
Equivalent to landing pad (zicfilp) on x86 is `ENDBRANCH` instruction in Intel
CET [3] and branch target identification (BTI) [4] on arm.
Similarly x86's Intel CET has shadow stack [5] and arm64 has guarded control
stack (GCS) [6] which are very similar to risc-v's zicfiss shadow stack.
x86 already supports shadow stack for user mode and arm64 support for GCS in
usermode [7] is in -next.
Kernel awareness for user control flow integrity
================================================
This series picks up Samuel Holland's envcfg changes [2] as well. So if those are
being applied independently, they should be removed from this series.
Enabling:
In order to maintain compatibility and not break anything in user mode, kernel
doesn't enable control flow integrity cpu extensions on binary by default.
Instead exposes a prctl interface to enable, disable and lock the shadow stack
or landing pad feature for a task. This allows userspace (loader) to enumerate
if all objects in its address space are compiled with shadow stack and landing
pad support and accordingly enable the feature. Additionally if a subsequent
`dlopen` happens on a library, user mode can take a decision again to disable
the feature (if incoming library is not compiled with support) OR terminate the
task (if user mode policy is strict to have all objects in address space to be
compiled with control flow integirty cpu feature). prctl to enable shadow stack
results in allocating shadow stack from virtual memory and activating for user
address space. x86 and arm64 are also following same direction due to similar
reason(s).
clone/fork:
On clone and fork, cfi state for task is inherited by child. Shadow stack is
part of virtual memory and is a writeable memory from kernel perspective
(writeable via a restricted set of instructions aka shadow stack instructions)
Thus kernel changes ensure that this memory is converted into read-only when
fork/clone happens and COWed when fault is taken due to sspush, sspopchk or
ssamoswap. In case `CLONE_VM` is specified and shadow stack is to be enabled,
kernel will automatically allocate a shadow stack for that clone call.
map_shadow_stack:
x86 introduced `map_shadow_stack` system call to allow user space to explicitly
map shadow stack memory in its address space. It is useful to allocate shadow
for different contexts managed by a single thread (green threads or contexts)
risc-v implements this system call as well.
signal management:
If shadow stack is enabled for a task, kernel performs an asynchronous control
flow diversion to deliver the signal and eventually expects userspace to issue
sigreturn so that original execution can be resumed. Even though resume context
is prepared by kernel, it is in user space memory and is subject to memory
corruption and corruption bugs can be utilized by attacker in this race window
to perform arbitrary sigreturn and eventually bypass cfi mechanism.
Another issue is how to ensure that cfi related state on sigcontext area is not
trampled by legacy apps or apps compiled with old kernel headers.
In order to mitigate control-flow hijacting, kernel prepares a token and place
it on shadow stack before signal delivery and places address of token in
sigcontext structure. During sigreturn, kernel obtains address of token from
sigcontext struture, reads token from shadow stack and validates it and only
then allow sigreturn to succeed. Compatiblity issue is solved by adopting
dynamic sigcontext management introduced for vector extension. This series
re-factor the code little bit to allow future sigcontext management easy (as
proposed by Andy Chiu from SiFive)
config and compilation:
Introduce a new risc-v config option `CONFIG_RISCV_USER_CFI`. Selecting this
config option picks the kernel support for user control flow integrity. This
optin is presented only if toolchain has shadow stack and landing pad support.
And is on purpose guarded by toolchain support. Reason being that eventually
vDSO also needs to be compiled in with shadow stack and landing pad support.
vDSO compile patches are not included as of now because landing pad labeling
scheme is yet to settle for usermode runtime.
To get more information on kernel interactions with respect to
zicfilp and zicfiss, patch series adds documentation for
`zicfilp` and `zicfiss` in following:
Documentation/arch/riscv/zicfiss.rst
Documentation/arch/riscv/zicfilp.rst
How to test this series
=======================
Toolchain
---------
$ git clone git@github.com:sifive/riscv-gnu-toolchain.git -b cfi-dev
$ riscv-gnu-toolchain/configure --prefix=<path-to-where-to-build> --with-arch=rv64gc_zicfilp_zicfiss --enable-linux --disable-gdb --with-extra-multilib-test="rv64gc_zicfilp_zicfiss-lp64d:-static"
$ make -j$(nproc)
Qemu
----
$ git clone git@github.com:deepak0414/qemu.git -b zicfilp_zicfiss_ratified_master_july11
$ cd qemu
$ mkdir build
$ cd build
$ ../configure --target-list=riscv64-softmmu
$ make -j$(nproc)
Opensbi
-------
$ git clone git@github.com:deepak0414/opensbi.git -b v6_cfi_spec_split_opensbi
$ make CROSS_COMPILE=<your riscv toolchain> -j$(nproc) PLATFORM=generic
Linux
-----
Running defconfig is fine. CFI is enabled by default if the toolchain
supports it.
$ make ARCH=riscv CROSS_COMPILE=<path-to-cfi-riscv-gnu-toolchain>/build/bin/riscv64-unknown-linux-gnu- -j$(nproc) defconfig
$ make ARCH=riscv CROSS_COMPILE=<path-to-cfi-riscv-gnu-toolchain>/build/bin/riscv64-unknown-linux-gnu- -j$(nproc)
Branch where user cfi enabling patches are maintained
https://github.com/deepak0414/linux-riscv-cfi/tree/vdso_user_cfi_v6.12-rc1
In case you're building your own rootfs using toolchain, please make sure you
pick following patch to ensure that vDSO compiled with lpad and shadow stack.
"arch/riscv: compile vdso with landing pad"
Running
-------
Modify your qemu command to have:
-bios <path-to-cfi-opensbi>/build/platform/generic/firmware/fw_dynamic.bin
-cpu rv64,zicfilp=true,zicfiss=true,zimop=true,zcmop=true
vDSO related Opens (in the flux)
=================================
I am listing these opens for laying out plan and what to expect in future
patch sets. And of course for the sake of discussion.
Shadow stack and landing pad enabling in vDSO
----------------------------------------------
vDSO must have shadow stack and landing pad support compiled in for task
to have shadow stack and landing pad support. This patch series doesn't
enable that (yet). Enabling shadow stack support in vDSO should be
straight forward (intend to do that in next versions of patch set). Enabling
landing pad support in vDSO requires some collaboration with toolchain folks
to follow a single label scheme for all object binaries. This is necessary to
ensure that all indirect call-sites are setting correct label and target landing
pads are decorated with same label scheme.
How many vDSOs
---------------
Shadow stack instructions are carved out of zimop (may be operations) and if CPU
doesn't implement zimop, they're illegal instructions. Kernel could be running on
a CPU which may or may not implement zimop. And thus kernel will have to carry 2
different vDSOs and expose the appropriate one depending on whether CPU implements
zimop or not.
References
==========
[1] - https://github.com/riscv/riscv-cfi
[2] - https://lore.kernel.org/all/20240814081126.956287-1-samuel.holland@sifive.c…
[3] - https://lwn.net/Articles/889475/
[4] - https://developer.arm.com/documentation/109576/0100/Branch-Target-Identific…
[5] - https://www.intel.com/content/dam/develop/external/us/en/documents/catc17-i…
[6] - https://lwn.net/Articles/940403/
[7] - https://lore.kernel.org/all/20241001-arm64-gcs-v13-0-222b78d87eee@kernel.or…
---
changelog
---------
v8:
- rebased on palmer/for-next
- dropped samuel holland's `envcfg` context switch patches.
they are in parlmer/for-next
v7:
- Removed "riscv/Kconfig: enable HAVE_EXIT_THREAD for riscv"
Instead using `deactivate_mm` flow to clean up.
see here for more context
https://lore.kernel.org/all/20230908203655.543765-1-rick.p.edgecombe@intel.…
- Changed the header include in `kselftest`. Hopefully this fixes compile
issue faced by Zong Li at SiFive.
- Cleaned up an orphaned change to `mm/mmap.c` in below patch
"riscv/mm : ensure PROT_WRITE leads to VM_READ | VM_WRITE"
- Lock interfaces for shadow stack and indirect branch tracking expect arg == 0
Any future evolution of this interface should accordingly define how arg should
be setup.
- `mm/map.c` has an instance of using `VM_SHADOW_STACK`. Fixed it to use helper
`is_shadow_stack_vma`.
- Link to v6: https://lore.kernel.org/r/20241008-v5_user_cfi_series-v6-0-60d9fe073f37@riv…
v6:
- Picked up Samuel Holland's changes as is with `envcfg` placed in
`thread` instead of `thread_info`
- fixed unaligned newline escapes in kselftest
- cleaned up messages in kselftest and included test output in commit message
- fixed a bug in clone path reported by Zong Li
- fixed a build issue if CONFIG_RISCV_ISA_V is not selected
(this was introduced due to re-factoring signal context
management code)
v5:
- rebased on v6.12-rc1
- Fixed schema related issues in device tree file
- Fixed some of the documentation related issues in zicfilp/ss.rst
(style issues and added index)
- added `SHADOW_STACK_SET_MARKER` so that implementation can define base
of shadow stack.
- Fixed warnings on definitions added in usercfi.h when
CONFIG_RISCV_USER_CFI is not selected.
- Adopted context header based signal handling as proposed by Andy Chiu
- Added support for enabling kernel mode access to shadow stack using
FWFT
(https://github.com/riscv-non-isa/riscv-sbi-doc/blob/master/src/ext-firmware…)
- Link to v5: https://lore.kernel.org/r/20241001-v5_user_cfi_series-v1-0-3ba65b6e550f@riv…
(Note: I had an issue in my workflow due to which version number wasn't
picked up correctly while sending out patches)
v4:
- rebased on 6.11-rc6
- envcfg: Converged with Samuel Holland's patches for envcfg management on per-
thread basis.
- vma_is_shadow_stack is renamed to is_vma_shadow_stack
- picked up Mark Brown's `ARCH_HAS_USER_SHADOW_STACK` patch
- signal context: using extended context management to maintain compatibility.
- fixed `-Wmissing-prototypes` compiler warnings for prctl functions
- Documentation fixes and amending typos.
- Link to v4: https://lore.kernel.org/all/20240912231650.3740732-1-debug@rivosinc.com/
v3:
- envcfg
logic to pick up base envcfg had a bug where `ENVCFG_CBZE` could have been
picked on per task basis, even though CPU didn't implement it. Fixed in
this series.
- dt-bindings
As suggested, split into separate commit. fixed the messaging that spec is
in public review
- arch_is_shadow_stack change
arch_is_shadow_stack changed to vma_is_shadow_stack
- hwprobe
zicfiss / zicfilp if present will get enumerated in hwprobe
- selftests
As suggested, added object and binary filenames to .gitignore
Selftest binary anyways need to be compiled with cfi enabled compiler which
will make sure that landing pad and shadow stack are enabled. Thus removed
separate enable/disable tests. Cleaned up tests a bit.
- Link to v3: https://lore.kernel.org/lkml/20240403234054.2020347-1-debug@rivosinc.com/
v2:
- Using config `CONFIG_RISCV_USER_CFI`, kernel support for riscv control flow
integrity for user mode programs can be compiled in the kernel.
- Enabling of control flow integrity for user programs is left to user runtime
- This patch series introduces arch agnostic `prctls` to enable shadow stack
and indirect branch tracking. And implements them on riscv.
---
Changes in v8:
- EDITME: describe what is new in this series revision.
- EDITME: use bulletpoints and terse descriptions.
- Link to v7: https://lore.kernel.org/r/20241029-v5_user_cfi_series-v7-0-2727ce9936cb@riv…
---
Andy Chiu (1):
riscv: signal: abstract header saving for setup_sigcontext
Clément Léger (1):
riscv: Add Firmware Feature SBI extensions definitions
Deepak Gupta (25):
mm: helper `is_shadow_stack_vma` to check shadow stack vma
dt-bindings: riscv: zicfilp and zicfiss in dt-bindings (extensions.yaml)
riscv: zicfiss / zicfilp enumeration
riscv: zicfiss / zicfilp extension csr and bit definitions
riscv: usercfi state for task and save/restore of CSR_SSP on trap entry/exit
riscv/mm : ensure PROT_WRITE leads to VM_READ | VM_WRITE
riscv mm: manufacture shadow stack pte
riscv mmu: teach pte_mkwrite to manufacture shadow stack PTEs
riscv mmu: write protect and shadow stack
riscv/mm: Implement map_shadow_stack() syscall
riscv/shstk: If needed allocate a new shadow stack on clone
prctl: arch-agnostic prctl for indirect branch tracking
riscv: Implements arch agnostic shadow stack prctls
riscv: Implements arch agnostic indirect branch tracking prctls
riscv/traps: Introduce software check exception
riscv/signal: save and restore of shadow stack for signal
riscv/kernel: update __show_regs to print shadow stack register
riscv/ptrace: riscv cfi status and state via ptrace and in core files
riscv/hwprobe: zicfilp / zicfiss enumeration in hwprobe
riscv: enable kernel access to shadow stack memory via FWFT sbi call
riscv: kernel command line option to opt out of user cfi
riscv: create a config for shadow stack and landing pad instr support
riscv: Documentation for landing pad / indirect branch tracking
riscv: Documentation for shadow stack on riscv
kselftest/riscv: kselftest for user mode cfi
Mark Brown (2):
mm: Introduce ARCH_HAS_USER_SHADOW_STACK
prctl: arch-agnostic prctl for shadow stack
Documentation/arch/riscv/index.rst | 2 +
Documentation/arch/riscv/zicfilp.rst | 115 +++++
Documentation/arch/riscv/zicfiss.rst | 176 +++++++
.../devicetree/bindings/riscv/extensions.yaml | 14 +
arch/riscv/Kconfig | 20 +
arch/riscv/include/asm/asm-prototypes.h | 1 +
arch/riscv/include/asm/cpufeature.h | 13 +
arch/riscv/include/asm/csr.h | 16 +
arch/riscv/include/asm/entry-common.h | 2 +
arch/riscv/include/asm/hwcap.h | 2 +
arch/riscv/include/asm/mman.h | 24 +
arch/riscv/include/asm/mmu_context.h | 7 +
arch/riscv/include/asm/pgtable.h | 30 +-
arch/riscv/include/asm/processor.h | 2 +
arch/riscv/include/asm/sbi.h | 27 ++
arch/riscv/include/asm/thread_info.h | 3 +
arch/riscv/include/asm/usercfi.h | 89 ++++
arch/riscv/include/asm/vector.h | 3 +
arch/riscv/include/uapi/asm/hwprobe.h | 2 +
arch/riscv/include/uapi/asm/ptrace.h | 22 +
arch/riscv/include/uapi/asm/sigcontext.h | 1 +
arch/riscv/kernel/Makefile | 2 +
arch/riscv/kernel/asm-offsets.c | 8 +
arch/riscv/kernel/cpufeature.c | 2 +
arch/riscv/kernel/entry.S | 31 +-
arch/riscv/kernel/head.S | 12 +
arch/riscv/kernel/process.c | 26 +-
arch/riscv/kernel/ptrace.c | 83 ++++
arch/riscv/kernel/signal.c | 140 +++++-
arch/riscv/kernel/sys_hwprobe.c | 2 +
arch/riscv/kernel/sys_riscv.c | 10 +
arch/riscv/kernel/traps.c | 42 ++
arch/riscv/kernel/usercfi.c | 526 +++++++++++++++++++++
arch/riscv/mm/init.c | 2 +-
arch/riscv/mm/pgtable.c | 17 +
arch/x86/Kconfig | 1 +
fs/proc/task_mmu.c | 2 +-
include/linux/cpu.h | 4 +
include/linux/mm.h | 5 +-
include/uapi/asm-generic/mman.h | 4 +
include/uapi/linux/elf.h | 1 +
include/uapi/linux/prctl.h | 48 ++
kernel/sys.c | 60 +++
mm/Kconfig | 6 +
mm/gup.c | 2 +-
mm/mmap.c | 2 +-
mm/vma.h | 10 +-
tools/testing/selftests/riscv/Makefile | 2 +-
tools/testing/selftests/riscv/cfi/.gitignore | 3 +
tools/testing/selftests/riscv/cfi/Makefile | 10 +
tools/testing/selftests/riscv/cfi/cfi_rv_test.h | 84 ++++
tools/testing/selftests/riscv/cfi/riscv_cfi_test.c | 78 +++
tools/testing/selftests/riscv/cfi/shadowstack.c | 373 +++++++++++++++
tools/testing/selftests/riscv/cfi/shadowstack.h | 37 ++
54 files changed, 2171 insertions(+), 35 deletions(-)
---
base-commit: 64f7b77f0bd9271861ed9e410e9856b6b0b21c48
change-id: 20240930-v5_user_cfi_series-3dc332f8f5b2
--
- debug
For logging to be useful, something has to set RET and retmsg by calling
ret_set_ksft_status(). There is a suite of functions to that end in
forwarding/lib: check_err, check_fail et.al. Move them to net/lib.sh so
that every net test can use them.
Existing lib.sh users might be using these same names for their functions.
However lib.sh is always sourced near the top of the file (checked), and
whatever new definitions will simply override the ones provided by lib.sh.
Signed-off-by: Petr Machata <petrm(a)nvidia.com>
Reviewed-by: Amit Cohen <amcohen(a)nvidia.com>
Acked-by: Shuah Khan <skhan(a)linuxfoundation.org>
---
Notes:
CC: Shuah Khan <shuah(a)kernel.org>
CC: Benjamin Poirier <bpoirier(a)nvidia.com>
CC: Hangbin Liu <liuhangbin(a)gmail.com>
CC: linux-kselftest(a)vger.kernel.org
CC: Jiri Pirko <jiri(a)resnulli.us>
tools/testing/selftests/net/forwarding/lib.sh | 73 -------------------
tools/testing/selftests/net/lib.sh | 73 +++++++++++++++++++
2 files changed, 73 insertions(+), 73 deletions(-)
diff --git a/tools/testing/selftests/net/forwarding/lib.sh b/tools/testing/selftests/net/forwarding/lib.sh
index d28dbf27c1f0..8625e3c99f55 100644
--- a/tools/testing/selftests/net/forwarding/lib.sh
+++ b/tools/testing/selftests/net/forwarding/lib.sh
@@ -445,79 +445,6 @@ done
##############################################################################
# Helpers
-# Whether FAILs should be interpreted as XFAILs. Internal.
-FAIL_TO_XFAIL=
-
-check_err()
-{
- local err=$1
- local msg=$2
-
- if ((err)); then
- if [[ $FAIL_TO_XFAIL = yes ]]; then
- ret_set_ksft_status $ksft_xfail "$msg"
- else
- ret_set_ksft_status $ksft_fail "$msg"
- fi
- fi
-}
-
-check_fail()
-{
- local err=$1
- local msg=$2
-
- check_err $((!err)) "$msg"
-}
-
-check_err_fail()
-{
- local should_fail=$1; shift
- local err=$1; shift
- local what=$1; shift
-
- if ((should_fail)); then
- check_fail $err "$what succeeded, but should have failed"
- else
- check_err $err "$what failed"
- fi
-}
-
-xfail()
-{
- FAIL_TO_XFAIL=yes "$@"
-}
-
-xfail_on_slow()
-{
- if [[ $KSFT_MACHINE_SLOW = yes ]]; then
- FAIL_TO_XFAIL=yes "$@"
- else
- "$@"
- fi
-}
-
-omit_on_slow()
-{
- if [[ $KSFT_MACHINE_SLOW != yes ]]; then
- "$@"
- fi
-}
-
-xfail_on_veth()
-{
- local dev=$1; shift
- local kind
-
- kind=$(ip -j -d link show dev $dev |
- jq -r '.[].linkinfo.info_kind')
- if [[ $kind = veth ]]; then
- FAIL_TO_XFAIL=yes "$@"
- else
- "$@"
- fi
-}
-
not()
{
"$@"
diff --git a/tools/testing/selftests/net/lib.sh b/tools/testing/selftests/net/lib.sh
index 4f52b8e48a3a..6bcf5d13879d 100644
--- a/tools/testing/selftests/net/lib.sh
+++ b/tools/testing/selftests/net/lib.sh
@@ -361,3 +361,76 @@ tests_run()
$current_test
done
}
+
+# Whether FAILs should be interpreted as XFAILs. Internal.
+FAIL_TO_XFAIL=
+
+check_err()
+{
+ local err=$1
+ local msg=$2
+
+ if ((err)); then
+ if [[ $FAIL_TO_XFAIL = yes ]]; then
+ ret_set_ksft_status $ksft_xfail "$msg"
+ else
+ ret_set_ksft_status $ksft_fail "$msg"
+ fi
+ fi
+}
+
+check_fail()
+{
+ local err=$1
+ local msg=$2
+
+ check_err $((!err)) "$msg"
+}
+
+check_err_fail()
+{
+ local should_fail=$1; shift
+ local err=$1; shift
+ local what=$1; shift
+
+ if ((should_fail)); then
+ check_fail $err "$what succeeded, but should have failed"
+ else
+ check_err $err "$what failed"
+ fi
+}
+
+xfail()
+{
+ FAIL_TO_XFAIL=yes "$@"
+}
+
+xfail_on_slow()
+{
+ if [[ $KSFT_MACHINE_SLOW = yes ]]; then
+ FAIL_TO_XFAIL=yes "$@"
+ else
+ "$@"
+ fi
+}
+
+omit_on_slow()
+{
+ if [[ $KSFT_MACHINE_SLOW != yes ]]; then
+ "$@"
+ fi
+}
+
+xfail_on_veth()
+{
+ local dev=$1; shift
+ local kind
+
+ kind=$(ip -j -d link show dev $dev |
+ jq -r '.[].linkinfo.info_kind')
+ if [[ $kind = veth ]]; then
+ FAIL_TO_XFAIL=yes "$@"
+ else
+ "$@"
+ fi
+}
--
2.45.0
It would be good to use the same mechanism for scheduling and dispatching
general net tests as the many forwarding tests already use. To that end,
move the logging helpers to net/lib.sh so that every net test can use them.
Existing lib.sh users might be using the name themselves. However lib.sh is
always sourced near the top of the file (checked), and whatever new
definition will simply override the one provided by lib.sh.
Signed-off-by: Petr Machata <petrm(a)nvidia.com>
Reviewed-by: Amit Cohen <amcohen(a)nvidia.com>
Acked-by: Shuah Khan <skhan(a)linuxfoundation.org>
---
Notes:
CC: Shuah Khan <shuah(a)kernel.org>
CC: Benjamin Poirier <bpoirier(a)nvidia.com>
CC: Hangbin Liu <liuhangbin(a)gmail.com>
CC: linux-kselftest(a)vger.kernel.org
CC: Jiri Pirko <jiri(a)resnulli.us>
tools/testing/selftests/net/forwarding/lib.sh | 10 ----------
tools/testing/selftests/net/lib.sh | 10 ++++++++++
2 files changed, 10 insertions(+), 10 deletions(-)
diff --git a/tools/testing/selftests/net/forwarding/lib.sh b/tools/testing/selftests/net/forwarding/lib.sh
index 41dd14c42c48..d28dbf27c1f0 100644
--- a/tools/testing/selftests/net/forwarding/lib.sh
+++ b/tools/testing/selftests/net/forwarding/lib.sh
@@ -1285,16 +1285,6 @@ matchall_sink_create()
action drop
}
-tests_run()
-{
- local current_test
-
- for current_test in ${TESTS:-$ALL_TESTS}; do
- in_defer_scope \
- $current_test
- done
-}
-
cleanup()
{
pre_cleanup
diff --git a/tools/testing/selftests/net/lib.sh b/tools/testing/selftests/net/lib.sh
index 691318b1ec55..4f52b8e48a3a 100644
--- a/tools/testing/selftests/net/lib.sh
+++ b/tools/testing/selftests/net/lib.sh
@@ -351,3 +351,13 @@ log_info()
echo "INFO: $msg"
}
+
+tests_run()
+{
+ local current_test
+
+ for current_test in ${TESTS:-$ALL_TESTS}; do
+ in_defer_scope \
+ $current_test
+ done
+}
--
2.45.0
Many net selftests invent their own logging helpers. These really should be
in a library sourced by these tests. Currently forwarding/lib.sh has a
suite of perfectly fine logging helpers, but sourcing a forwarding/ library
from a higher-level directory smells of layering violation. In this patch,
move the logging helpers to net/lib.sh so that every net test can use them.
Together with the logging helpers, it's also necessary to move
pause_on_fail(), and EXIT_STATUS and RET.
Existing lib.sh users might be using these same names for their functions
or variables. However lib.sh is always sourced near the top of the
file (checked), and whatever new definitions will simply override the ones
provided by lib.sh.
Signed-off-by: Petr Machata <petrm(a)nvidia.com>
Reviewed-by: Amit Cohen <amcohen(a)nvidia.com>
Acked-by: Shuah Khan <skhan(a)linuxfoundation.org>
---
Notes:
CC: Shuah Khan <shuah(a)kernel.org>
CC: Benjamin Poirier <bpoirier(a)nvidia.com>
CC: Hangbin Liu <liuhangbin(a)gmail.com>
CC: linux-kselftest(a)vger.kernel.org
CC: Jiri Pirko <jiri(a)resnulli.us>
tools/testing/selftests/net/forwarding/lib.sh | 113 -----------------
tools/testing/selftests/net/lib.sh | 115 ++++++++++++++++++
2 files changed, 115 insertions(+), 113 deletions(-)
diff --git a/tools/testing/selftests/net/forwarding/lib.sh b/tools/testing/selftests/net/forwarding/lib.sh
index 89c25f72b10c..41dd14c42c48 100644
--- a/tools/testing/selftests/net/forwarding/lib.sh
+++ b/tools/testing/selftests/net/forwarding/lib.sh
@@ -48,7 +48,6 @@ declare -A NETIFS=(
: "${WAIT_TIME:=5}"
# Whether to pause on, respectively, after a failure and before cleanup.
-: "${PAUSE_ON_FAIL:=no}"
: "${PAUSE_ON_CLEANUP:=no}"
# Whether to create virtual interfaces, and what netdevice type they should be.
@@ -446,22 +445,6 @@ done
##############################################################################
# Helpers
-# Exit status to return at the end. Set in case one of the tests fails.
-EXIT_STATUS=0
-# Per-test return value. Clear at the beginning of each test.
-RET=0
-
-ret_set_ksft_status()
-{
- local ksft_status=$1; shift
- local msg=$1; shift
-
- RET=$(ksft_status_merge $RET $ksft_status)
- if (( $? )); then
- retmsg=$msg
- fi
-}
-
# Whether FAILs should be interpreted as XFAILs. Internal.
FAIL_TO_XFAIL=
@@ -535,102 +518,6 @@ xfail_on_veth()
fi
}
-log_test_result()
-{
- local test_name=$1; shift
- local opt_str=$1; shift
- local result=$1; shift
- local retmsg=$1; shift
-
- printf "TEST: %-60s [%s]\n" "$test_name $opt_str" "$result"
- if [[ $retmsg ]]; then
- printf "\t%s\n" "$retmsg"
- fi
-}
-
-pause_on_fail()
-{
- if [[ $PAUSE_ON_FAIL == yes ]]; then
- echo "Hit enter to continue, 'q' to quit"
- read a
- [[ $a == q ]] && exit 1
- fi
-}
-
-handle_test_result_pass()
-{
- local test_name=$1; shift
- local opt_str=$1; shift
-
- log_test_result "$test_name" "$opt_str" " OK "
-}
-
-handle_test_result_fail()
-{
- local test_name=$1; shift
- local opt_str=$1; shift
-
- log_test_result "$test_name" "$opt_str" FAIL "$retmsg"
- pause_on_fail
-}
-
-handle_test_result_xfail()
-{
- local test_name=$1; shift
- local opt_str=$1; shift
-
- log_test_result "$test_name" "$opt_str" XFAIL "$retmsg"
- pause_on_fail
-}
-
-handle_test_result_skip()
-{
- local test_name=$1; shift
- local opt_str=$1; shift
-
- log_test_result "$test_name" "$opt_str" SKIP "$retmsg"
-}
-
-log_test()
-{
- local test_name=$1
- local opt_str=$2
-
- if [[ $# -eq 2 ]]; then
- opt_str="($opt_str)"
- fi
-
- if ((RET == ksft_pass)); then
- handle_test_result_pass "$test_name" "$opt_str"
- elif ((RET == ksft_xfail)); then
- handle_test_result_xfail "$test_name" "$opt_str"
- elif ((RET == ksft_skip)); then
- handle_test_result_skip "$test_name" "$opt_str"
- else
- handle_test_result_fail "$test_name" "$opt_str"
- fi
-
- EXIT_STATUS=$(ksft_exit_status_merge $EXIT_STATUS $RET)
- return $RET
-}
-
-log_test_skip()
-{
- RET=$ksft_skip retmsg= log_test "$@"
-}
-
-log_test_xfail()
-{
- RET=$ksft_xfail retmsg= log_test "$@"
-}
-
-log_info()
-{
- local msg=$1
-
- echo "INFO: $msg"
-}
-
not()
{
"$@"
diff --git a/tools/testing/selftests/net/lib.sh b/tools/testing/selftests/net/lib.sh
index c8991cc6bf28..691318b1ec55 100644
--- a/tools/testing/selftests/net/lib.sh
+++ b/tools/testing/selftests/net/lib.sh
@@ -9,6 +9,9 @@ source "$net_dir/lib/sh/defer.sh"
: "${WAIT_TIMEOUT:=20}"
+# Whether to pause on after a failure.
+: "${PAUSE_ON_FAIL:=no}"
+
BUSYWAIT_TIMEOUT=$((WAIT_TIMEOUT * 1000)) # ms
# Kselftest framework constants.
@@ -20,6 +23,11 @@ ksft_skip=4
# namespace list created by setup_ns
NS_LIST=()
+# Exit status to return at the end. Set in case one of the tests fails.
+EXIT_STATUS=0
+# Per-test return value. Clear at the beginning of each test.
+RET=0
+
##############################################################################
# Helpers
@@ -236,3 +244,110 @@ tc_rule_handle_stats_get()
| jq ".[] | select(.options.handle == $handle) | \
.options.actions[0].stats$selector"
}
+
+ret_set_ksft_status()
+{
+ local ksft_status=$1; shift
+ local msg=$1; shift
+
+ RET=$(ksft_status_merge $RET $ksft_status)
+ if (( $? )); then
+ retmsg=$msg
+ fi
+}
+
+log_test_result()
+{
+ local test_name=$1; shift
+ local opt_str=$1; shift
+ local result=$1; shift
+ local retmsg=$1; shift
+
+ printf "TEST: %-60s [%s]\n" "$test_name $opt_str" "$result"
+ if [[ $retmsg ]]; then
+ printf "\t%s\n" "$retmsg"
+ fi
+}
+
+pause_on_fail()
+{
+ if [[ $PAUSE_ON_FAIL == yes ]]; then
+ echo "Hit enter to continue, 'q' to quit"
+ read a
+ [[ $a == q ]] && exit 1
+ fi
+}
+
+handle_test_result_pass()
+{
+ local test_name=$1; shift
+ local opt_str=$1; shift
+
+ log_test_result "$test_name" "$opt_str" " OK "
+}
+
+handle_test_result_fail()
+{
+ local test_name=$1; shift
+ local opt_str=$1; shift
+
+ log_test_result "$test_name" "$opt_str" FAIL "$retmsg"
+ pause_on_fail
+}
+
+handle_test_result_xfail()
+{
+ local test_name=$1; shift
+ local opt_str=$1; shift
+
+ log_test_result "$test_name" "$opt_str" XFAIL "$retmsg"
+ pause_on_fail
+}
+
+handle_test_result_skip()
+{
+ local test_name=$1; shift
+ local opt_str=$1; shift
+
+ log_test_result "$test_name" "$opt_str" SKIP "$retmsg"
+}
+
+log_test()
+{
+ local test_name=$1
+ local opt_str=$2
+
+ if [[ $# -eq 2 ]]; then
+ opt_str="($opt_str)"
+ fi
+
+ if ((RET == ksft_pass)); then
+ handle_test_result_pass "$test_name" "$opt_str"
+ elif ((RET == ksft_xfail)); then
+ handle_test_result_xfail "$test_name" "$opt_str"
+ elif ((RET == ksft_skip)); then
+ handle_test_result_skip "$test_name" "$opt_str"
+ else
+ handle_test_result_fail "$test_name" "$opt_str"
+ fi
+
+ EXIT_STATUS=$(ksft_exit_status_merge $EXIT_STATUS $RET)
+ return $RET
+}
+
+log_test_skip()
+{
+ RET=$ksft_skip retmsg= log_test "$@"
+}
+
+log_test_xfail()
+{
+ RET=$ksft_xfail retmsg= log_test "$@"
+}
+
+log_info()
+{
+ local msg=$1
+
+ echo "INFO: $msg"
+}
--
2.45.0
Hello,
this new series aims to migrate test_flow_dissector.sh into test_progs.
There are 2 "main" parts in test_flow_dissector.sh:
- a set of tests checking flow_dissector programs attachment to either
root namespace or non-root namespace
- dissection test
The first set is integrated in flow_dissector.c, which already contains
some existing tests for flow_dissector programs. This series uses the
opportunity to update a bit this file (use new assert, re-split tests,
etc)
The second part is migrated into a new file under test_progs,
flow_dissector_classification.c. It uses the same eBPF programs as
flow_dissector.c, but the difference is rather about how those program
are executed:
- flow_dissector.c manually runs programs with BPF_PROG_RUN
- flow_dissector_classification.c sends real packets to be dissected, and
so it also executes kernel code related to eBPF flow dissector (eg:
__skb_flow_bpf_to_target)
---
Alexis Lothoré (eBPF Foundation) (10):
selftests/bpf: add a macro to compare raw memory
selftests/bpf: use ASSERT_MEMEQ to compare bpf flow keys
selftests/bpf: replace CHECK calls with ASSERT macros in flow_dissector test
selftests/bpf: re-split main function into dedicated tests
selftests/bpf: expose all subtests from flow_dissector
selftests/bpf: add gre packets testing to flow_dissector
selftests/bpf: migrate flow_dissector namespace exclusivity test
selftests/bpf: Enable generic tc actions in selftests config
selftests/bpf: migrate bpf flow dissectors tests to test_progs
selftests/bpf: remove test_flow_dissector.sh
tools/testing/selftests/bpf/.gitignore | 1 -
tools/testing/selftests/bpf/Makefile | 3 +-
tools/testing/selftests/bpf/config | 1 +
.../selftests/bpf/prog_tests/flow_dissector.c | 307 ++++++--
.../bpf/prog_tests/flow_dissector_classification.c | 851 +++++++++++++++++++++
tools/testing/selftests/bpf/test_flow_dissector.c | 780 -------------------
tools/testing/selftests/bpf/test_flow_dissector.sh | 178 -----
tools/testing/selftests/bpf/test_progs.h | 25 +
8 files changed, 1107 insertions(+), 1039 deletions(-)
---
base-commit: 16e1d1c377aa4c56223701a31f3bfa88505e7e4f
change-id: 20241019-flow_dissector-3eb0c07fc163
Best regards,
--
Alexis Lothoré, Bootlin
Embedded Linux and Kernel engineering
https://bootlin.com
On Wed, Nov 13, 2024, Paolo Bonzini wrote:
> Il mar 12 nov 2024, 21:44 Doug Covelli <doug.covelli(a)broadcom.com> ha
> scritto:
>
> > > Split irqchip should be the best tradeoff. Without it, moves from cr8
> > > stay in the kernel, but moves to cr8 always go to userspace with a
> > > KVM_EXIT_SET_TPR exit. You also won't be able to use Intel
> > > flexpriority (in-processor accelerated TPR) because KVM does not know
> > > which bits are set in IRR. So it will be *really* every move to cr8
> > > that goes to userspace.
> >
> > Sorry to hijack this thread but is there a technical reason not to allow
> > CR8 based accesses to the TPR (not MMIO accesses) when the in-kernel local
> > APIC is not in use?
>
> No worries, you're not hijacking :) The only reason is that it would be
> more code for a seldom used feature and anyway with worse performance. (To
> be clear, CR8 based accesses are allowed, but stores cause an exit in order
> to check the new TPR against IRR. That's because KVM's API does not have an
> equivalent of the TPR threshold as you point out below).
>
> > Also I could not find these documented anywhere but with MSFT's APIC our
> > monitor relies on extensions for trapping certain events such as INIT/SIPI
> > plus LINT0 and SVR writes:
> >
> > UINT64 X64ApicInitSipiExitTrap : 1; //
> > WHvRunVpExitReasonX64ApicInitSipiTrap
> > UINT64 X64ApicWriteLint0ExitTrap : 1; //
> > WHvRunVpExitReasonX64ApicWriteTrap
> > UINT64 X64ApicWriteLint1ExitTrap : 1; //
> > WHvRunVpExitReasonX64ApicWriteTrap
> > UINT64 X64ApicWriteSvrExitTrap : 1; //
> > WHvRunVpExitReasonX64ApicWriteTrap
> >
>
> There's no need for this in KVM's in-kernel APIC model. INIT and SIPI are
> handled in the hypervisor and you can get the current state of APs via
> KVM_GET_MPSTATE. LINT0 and LINT1 are injected with KVM_INTERRUPT and
> KVM_NMI respectively, and they obey IF/PPR and NMI blocking respectively,
> plus the interrupt shadow; so there's no need for userspace to know when
> LINT0/LINT1 themselves change. The spurious interrupt vector register is
> also handled completely in kernel.
>
> > I did not see any similar functionality for KVM. Does anything like that
> > exist? In any case we would be happy to add support for handling CR8
> > accesses w/o exiting w/o the in-kernel APIC along with some sort of a way
> > to configure the TPR threshold if folks are not opposed to that.
>
> As far I know everybody who's using KVM (whether proprietary or open
> source) has had no need for that, so I don't think it's a good idea to make
> the API more complex.
+1
> Performance of Windows guests is going to be bad anyway with userspace APIC.
Heh, on modern hardware, performance of any guest is going to suck with a userspace
APIC, compared to what is possible with an in-kernel APIC.
More importantly, I really, really don't want to encourage non-trivial usage of
a local APIC in userspace. KVM's support for a userspace local APIC is very
poorly tested these days. I have zero desire to spend any amount of time reviewing
and fixing issues that are unique to emulating the local APIC in userspace. And
long term, I would love to force an in-kernel local APIC, though I don't know if
that's entirely feasible.
This patchset fixes a bug where bnxt driver was failing to report that
an ntuple rule is redirecting to an RSS context. First commit is the
fix, then second commit extends selftests to detect if other/new drivers
are compliant with ntuple/rss_ctx API.
Daniel Xu (2):
bnxt_en: ethtool: Supply ntuple rss context action
selftests: drv-net: rss_ctx: Add test for ntuple rule
drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c | 8 ++++++--
tools/testing/selftests/drivers/net/hw/rss_ctx.py | 13 ++++++++++++-
2 files changed, 18 insertions(+), 3 deletions(-)
--
2.46.0
To be able to switch VMware products running on Linux to KVM some minor
changes are required to let KVM run/resume unmodified VMware guests.
First allow enabling of the VMware backdoor via an api. Currently the
setting of the VMware backdoor is limited to kernel boot parameters,
which forces all VM's running on a host to either run with or without
the VMware backdoor. Add a simple cap to allow enabling of the VMware
backdoor on a per VM basis. The default for that setting remains the
kvm.enable_vmware_backdoor boot parameter (which is false by default)
and can be changed on a per-vm basis via the KVM_CAP_X86_VMWARE_BACKDOOR
cap.
Second add a cap to forward hypercalls to userspace. I know that in
general that's frowned upon but VMwre guests send quite a few hypercalls
from userspace and it would be both impractical and largelly impossible
to handle all in the kernel. The change is trivial and I'd be maintaining
this code so I hope it's not a big deal.
The third commit just adds a self-test for the "forward VMware hypercalls
to userspace" functionality.
Cc: Doug Covelli <doug.covelli(a)broadcom.com>
Cc: Paolo Bonzini <pbonzini(a)redhat.com>
Cc: Jonathan Corbet <corbet(a)lwn.net>
Cc: Sean Christopherson <seanjc(a)google.com>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Cc: Ingo Molnar <mingo(a)redhat.com>
Cc: Borislav Petkov <bp(a)alien8.de>
Cc: Dave Hansen <dave.hansen(a)linux.intel.com>
Cc: x86(a)kernel.org
Cc: "H. Peter Anvin" <hpa(a)zytor.com>
Cc: Shuah Khan <shuah(a)kernel.org>
Cc: Namhyung Kim <namhyung(a)kernel.org>
Cc: Arnaldo Carvalho de Melo <acme(a)redhat.com>
Cc: Isaku Yamahata <isaku.yamahata(a)intel.com>
Cc: Joel Stanley <joel(a)jms.id.au>
Cc: Zack Rusin <zack.rusin(a)broadcom.com>
Cc: kvm(a)vger.kernel.org
Cc: linux-doc(a)vger.kernel.org
Cc: linux-kernel(a)vger.kernel.org
Cc: linux-kselftest(a)vger.kernel.org
Zack Rusin (3):
KVM: x86: Allow enabling of the vmware backdoor via a cap
KVM: x86: Add support for VMware guest specific hypercalls
KVM: selftests: x86: Add a test for KVM_CAP_X86_VMWARE_HYPERCALL
Documentation/virt/kvm/api.rst | 56 ++++++++-
arch/x86/include/asm/kvm_host.h | 2 +
arch/x86/kvm/emulate.c | 5 +-
arch/x86/kvm/svm/svm.c | 6 +-
arch/x86/kvm/vmx/vmx.c | 4 +-
arch/x86/kvm/x86.c | 47 ++++++++
arch/x86/kvm/x86.h | 7 +-
include/uapi/linux/kvm.h | 2 +
tools/include/uapi/linux/kvm.h | 2 +
tools/testing/selftests/kvm/Makefile | 1 +
.../kvm/x86_64/vmware_hypercall_test.c | 108 ++++++++++++++++++
11 files changed, 227 insertions(+), 13 deletions(-)
create mode 100644 tools/testing/selftests/kvm/x86_64/vmware_hypercall_test.c
--
2.43.0
This patch series represents the final release of KTAP version 2.
There have been having open discussions on version 2 for just over 2
years. This patch series marks the end of KTAP version 2 development
and beginning of the KTAP version 3 development.
The largest component of KTAP version 2 release is the addition of test
metadata to the specification. KTAP metadata could include any test
information that is pertinent for user interaction before or after the
running of the test. For example, the test file path or the test speed.
Example of KTAP Metadata:
KTAP version 2
#:ktap_test: main
#:ktap_arch: uml
1..1
KTAP version 2
#:ktap_test: suite_1
#:ktap_subsystem: example
#:ktap_test_file: lib/test.c
1..2
ok 1 test_1
#:ktap_test: test_2
#:ktap_speed: very_slow
# test_2 has begun
#:custom_is_flaky: true
ok 2 test_2
# suite_1 has passed
ok 1 suite_1
The release also includes some formatting fixes and changes to update
the specification to version 2.
Frank Rowand (2):
ktap_v2: change version to 2-rc in KTAP specification
ktap_v2: change "version 1" to "version 2" in examples
Rae Moar (3):
ktap_v2: add test metadata
ktap_v2: formatting fixes to ktap spec
ktap_v2: change version to 2 in KTAP specification
Documentation/dev-tools/ktap.rst | 276 +++++++++++++++++++++++++++++--
1 file changed, 260 insertions(+), 16 deletions(-)
base-commit: 2d5404caa8c7bb5c4e0435f94b28834ae5456623
--
2.47.0.277.g8800431eea-goog
Since commit 49f59573e9e0 ("selftests/mm: Enable pkey_sighandler_tests
on arm64"), pkey_sighandler_tests.c (which includes pkey-arm64.h via
pkey-helpers.h) ends up compiled for arm64. Since it doesn't use
aarch64_write_signal_pkey(), the compiler warns:
In file included from pkey-helpers.h:106,
from pkey_sighandler_tests.c:31:
pkey-arm64.h:130:13: warning: ‘aarch64_write_signal_pkey’ defined but not used [-Wunused-function]
130 | static void aarch64_write_signal_pkey(ucontext_t *uctxt, u64 pkey)
| ^~~~~~~~~~~~~~~~~~~~~~~~~
Make the aarch64_write_signal_pkey() a 'static inline void' function to
avoid the compiler warning.
Fixes: f5b5ea51f78f ("selftests: mm: make protection_keys test work on arm64")
Cc: Shuah Khan <skhan(a)linuxfoundation.org>
Cc: Joey Gouly <joey.gouly(a)arm.com>
Cc: Kevin Brodsky <kevin.brodsky(a)arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas(a)arm.com>
---
I'll add this on top of the arm64 for-next/pkey-signal branch together with
Kevin's other patches.
tools/testing/selftests/mm/pkey-arm64.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/mm/pkey-arm64.h b/tools/testing/selftests/mm/pkey-arm64.h
index d57fbeace38f..d9d2100eafc0 100644
--- a/tools/testing/selftests/mm/pkey-arm64.h
+++ b/tools/testing/selftests/mm/pkey-arm64.h
@@ -127,7 +127,7 @@ static inline u64 get_pkey_bits(u64 reg, int pkey)
return 0;
}
-static void aarch64_write_signal_pkey(ucontext_t *uctxt, u64 pkey)
+static inline void aarch64_write_signal_pkey(ucontext_t *uctxt, u64 pkey)
{
struct _aarch64_ctx *ctx = GET_UC_RESV_HEAD(uctxt);
struct poe_context *poe_ctx =
It looks like people started ignoring the compiler warnings (or even
errors) when building the arm64-specific kselftests. The first three
patches are printf() arguments adjustment. The last one adds
".arch_extension sme", otherwise they fail to build (with my toolchain
versions at least).
Note for future kselftest contributors - try to keep them warning-free.
Catalin Marinas (4):
kselftest/arm64: Fix printf() compiler warnings in the arm64 fp tests
kselftest/arm64: Fix printf() warning in the arm64 MTE prctl() test
kselftest/arm64: Fix printf() compiler warnings in the arm64
syscall-abi.c tests
kselftest/arm64: Fix compilation of SME instructions in the arm64 fp
tests
tools/testing/selftests/arm64/abi/syscall-abi.c | 8 ++++----
tools/testing/selftests/arm64/fp/sve-ptrace.c | 16 +++++++++-------
tools/testing/selftests/arm64/fp/za-ptrace.c | 8 +++++---
tools/testing/selftests/arm64/fp/za-test.S | 1 +
tools/testing/selftests/arm64/fp/zt-ptrace.c | 8 +++++---
tools/testing/selftests/arm64/fp/zt-test.S | 1 +
tools/testing/selftests/arm64/mte/check_prctl.c | 2 +-
7 files changed, 26 insertions(+), 18 deletions(-)
Currently we test signal delivery to the programs run by fp-stress but
our signal handlers simply count the number of signals seen and don't do
anything with the floating point state. The original fpsimd-test and
sve-test programs had signal handlers called irritators which modify the
live register state, verifying that we restore the signal context on
return, but a combination of misleading comments and code resulted in
them never being used and the equivalent handlers in the other tests
being stubbed or omitted.
Clarify the code, implement effective irritator handlers for the test
programs that can have them and then switch the signals generated by the
fp-stress program over to use the irritators, ensuring that we validate
that we restore the saved signal context properly.
Signed-off-by: Mark Brown <broonie(a)kernel.org>
---
Changes in v2:
- Change instruction and commit log for corruption of predicat
registers.
- Link to v1: https://lore.kernel.org/r/20241023-arm64-fp-stress-irritator-v1-0-a51af298d…
---
Mark Brown (6):
kselftest/arm64: Correct misleading comments on fp-stress irritators
kselftest/arm64: Remove unused ADRs from irritator handlers
kselftest/arm64: Corrupt P0 in the irritator when testing SSVE
kselftest/arm64: Implement irritators for ZA and ZT
kselftest/arm64: Provide a SIGUSR1 handler in the kernel mode FP stress test
kselftest/arm64: Test signal handler state modification in fp-stress
tools/testing/selftests/arm64/fp/fp-stress.c | 2 +-
tools/testing/selftests/arm64/fp/fpsimd-test.S | 4 +---
tools/testing/selftests/arm64/fp/kernel-test.c | 4 ++++
tools/testing/selftests/arm64/fp/sve-test.S | 8 +++-----
tools/testing/selftests/arm64/fp/za-test.S | 13 ++++---------
tools/testing/selftests/arm64/fp/zt-test.S | 13 ++++---------
6 files changed, 17 insertions(+), 27 deletions(-)
---
base-commit: 8e929cb546ee42c9a61d24fae60605e9e3192354
change-id: 20241023-arm64-fp-stress-irritator-5415fe92adef
Best regards,
--
Mark Brown <broonie(a)kernel.org>
This series contains a bit of a grab bag of improvements to the floating
point tests, mainly fp-ptrace. Globally over all the tests we start
using defines from the generated sysregs (following the example of the
KVM selftests) for SVCR, stop being quite so wasteful with registers
when calling into the assembler code then expand the coverage of both ZA
writes and FPMR (which was not there since fp-ptrace and the 2023 dpISA
extensions were on the list at the same time).
Signed-off-by: Mark Brown <broonie(a)kernel.org>
---
Changes in v2:
- Drop attempt to reference sysreg-defs.h due to build issues, just
import the defines directly into fp-ptrace. We can revisit later.
- Link to v1: https://lore.kernel.org/r/20241107-arm64-fp-ptrace-fpmr-v1-0-3e5e0b6e3be9@k…
---
Mark Brown (3):
kselftets/arm64: Use flag bits for features in fp-ptrace assembler code
kselftest/arm64: Expand the set of ZA writes fp-ptrace does
kselftest/arm64: Add FPMR coverage to fp-ptrace
tools/testing/selftests/arm64/fp/fp-ptrace-asm.S | 41 ++++--
tools/testing/selftests/arm64/fp/fp-ptrace.c | 161 +++++++++++++++++++++--
tools/testing/selftests/arm64/fp/fp-ptrace.h | 12 ++
tools/testing/selftests/arm64/fp/sme-inst.h | 2 +
4 files changed, 188 insertions(+), 28 deletions(-)
---
base-commit: 8e929cb546ee42c9a61d24fae60605e9e3192354
change-id: 20241105-arm64-fp-ptrace-fpmr-ff061facd3da
Best regards,
--
Mark Brown <broonie(a)kernel.org>
While some assemblers (including the LLVM assembler I mostly use) will
happily accept SMSTART as an instruction by default others, specifically
gas, require that any architecture extensions be explicitly enabled.
The assembler SME test programs use manually encoded helpers for the new
instructions but no SMSTART helper is defined, only SM and ZA specific
variants. Unfortunately the irritators that were just added use plain
SMSTART so on stricter assemblers these fail to build:
za-test.S:160: Error: selected processor does not support `smstart'
Switch to using SMSTART ZA via the manually encoded smstart_za macro we
already have defined.
Fixes: d65f27d240bb ("kselftest/arm64: Implement irritators for ZA and ZT")
Signed-off-by: Mark Brown <broonie(a)kernel.org>
---
tools/testing/selftests/arm64/fp/za-test.S | 2 +-
tools/testing/selftests/arm64/fp/zt-test.S | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/arm64/fp/za-test.S b/tools/testing/selftests/arm64/fp/za-test.S
index 95fdc1c1f228221bc812087a528e4b7c99767bba..9c33e13e9dc4a6f084649fe7d0fb838d9171e3aa 100644
--- a/tools/testing/selftests/arm64/fp/za-test.S
+++ b/tools/testing/selftests/arm64/fp/za-test.S
@@ -157,7 +157,7 @@ function irritator_handler
// This will reset ZA to all bits 0
smstop
- smstart
+ smstart_za
ret
endfunction
diff --git a/tools/testing/selftests/arm64/fp/zt-test.S b/tools/testing/selftests/arm64/fp/zt-test.S
index a90712802801efb97dc6bf8027fb9ceac8f0a895..38080f3c328042af6b3e2d7c3300162ea6efa4ea 100644
--- a/tools/testing/selftests/arm64/fp/zt-test.S
+++ b/tools/testing/selftests/arm64/fp/zt-test.S
@@ -126,7 +126,7 @@ function irritator_handler
// This will reset ZT to all bits 0
smstop
- smstart
+ smstart_za
ret
endfunction
---
base-commit: 95ad089d464da2a4cd4511fb077f25994104c8f1
change-id: 20241108-arm64-selftest-asm-error-d78570e50b3b
Best regards,
--
Mark Brown <broonie(a)kernel.org>
Currently we don't build the PAC selftests when building with LLVM=1 since
we attempt to test for PAC support in the toolchain before we've set up the
build system to point at LLVM in lib.mk, which has to be one of the last
things in the Makefile.
Since all versions of LLVM supported for use with the kernel have PAC
support we can just sidestep the issue by just assuming PAC is there when
doing a LLVM=1 build.
Signed-off-by: Mark Brown <broonie(a)kernel.org>
---
tools/testing/selftests/arm64/pauth/Makefile | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/tools/testing/selftests/arm64/pauth/Makefile b/tools/testing/selftests/arm64/pauth/Makefile
index 72e290b0b10c1ea5bf1b84232f70844a601b8129..b5a1c80e0ead6932d2441a192b9758a049e3b3f8 100644
--- a/tools/testing/selftests/arm64/pauth/Makefile
+++ b/tools/testing/selftests/arm64/pauth/Makefile
@@ -7,8 +7,14 @@ CC := $(CROSS_COMPILE)gcc
endif
CFLAGS += -mbranch-protection=pac-ret
+
+# All supported LLVMs have PAC, test for GCC
+ifeq ($(LLVM),1)
+pauth_cc_support := 1
+else
# check if the compiler supports ARMv8.3 and branch protection with PAuth
pauth_cc_support := $(shell if ($(CC) $(CFLAGS) -march=armv8.3-a -E -x c /dev/null -o /dev/null 2>&1) then echo "1"; fi)
+endif
ifeq ($(pauth_cc_support),1)
TEST_GEN_PROGS := pac
---
base-commit: 8e929cb546ee42c9a61d24fae60605e9e3192354
change-id: 20241111-arm64-selftest-pac-clang-cb80de079169
Best regards,
--
Mark Brown <broonie(a)kernel.org>
The PAC tests currently generate a very small number of false failures
since the limited size of PAC keys, especially with fewer bits allocated
for PAC due to larger VA, means collisions in generated PACs are
possible even if the PAC keys are different. The test tries to work around
this by testing repeatedly but doesn't iterate often enough to be
reliable.
We also fix a leak of file descriptors in the exec test which doesn't
matter now but can become an issue when the number of iterations is
increased, we can bump into limits on allocated file descriptors.
Signed-off-by: Mark Brown <broonie(a)kernel.org>
---
Mark Brown (2):
kselftest/arm64: Don't leak pipe fds in pac.exec_sign_all()
kselftest/arm64: Try harder to generate different keys during PAC tests
tools/testing/selftests/arm64/pauth/pac.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
---
base-commit: 8e929cb546ee42c9a61d24fae60605e9e3192354
change-id: 20241111-arm64-pac-test-collisions-5613f5dfe479
Best regards,
--
Mark Brown <broonie(a)kernel.org>
This series contains a bit of a grab bag of improvements to the floating
point tests, mainly fp-ptrace. Globally over all the tests we start
using defines from the generated sysregs (following the example of the
KVM selftests) for SVCR, stop being quite so wasteful with registers
when calling into the assembler code then expand the coverage of both ZA
writes and FPMR (which was not there since fp-ptrace and the 2023 dpISA
extensions were on the list at the same time).
Signed-off-by: Mark Brown <broonie(a)kernel.org>
---
Mark Brown (4):
kselftets/arm64: Use flag bits for features in fp-ptrace assembler code
kselftest/arm64: Use a define for SVCR
kselftest/arm64: Expand the set of ZA writes fp-ptrace does
kselftest/arm64: Add FPMR coverage to fp-ptrace
tools/testing/selftests/arm64/fp/Makefile | 20 ++-
tools/testing/selftests/arm64/fp/fp-ptrace-asm.S | 52 +++++---
tools/testing/selftests/arm64/fp/fp-ptrace.c | 155 +++++++++++++++++++++--
tools/testing/selftests/arm64/fp/fp-ptrace.h | 16 ++-
tools/testing/selftests/arm64/fp/sve-test.S | 5 +-
tools/testing/selftests/arm64/fp/za-fork-asm.S | 3 +-
tools/testing/selftests/arm64/fp/za-test.S | 6 +-
tools/testing/selftests/arm64/fp/zt-test.S | 5 +-
8 files changed, 213 insertions(+), 49 deletions(-)
---
base-commit: 8e929cb546ee42c9a61d24fae60605e9e3192354
change-id: 20241105-arm64-fp-ptrace-fpmr-ff061facd3da
Best regards,
--
Mark Brown <broonie(a)kernel.org>
Kuan-Wei Chiu pointed out that the kernel doc for kunit_zalloc_skb()
needs to include the @gfp information. Add it.
Reported-by: Kuan-Wei Chiu <visitorckw(a)gmail.com>
Closes: https://lore.kernel.org/all/Zy+VIXDPuU613fFd@visitorckw-System-Product-Name/
Signed-off-by: Dan Carpenter <dan.carpenter(a)linaro.org>
---
include/kunit/skbuff.h | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/include/kunit/skbuff.h b/include/kunit/skbuff.h
index 345e1e8f0312..07784694357c 100644
--- a/include/kunit/skbuff.h
+++ b/include/kunit/skbuff.h
@@ -20,8 +20,9 @@ static void kunit_action_kfree_skb(void *p)
* kunit_zalloc_skb() - Allocate and initialize a resource managed skb.
* @test: The test case to which the skb belongs
* @len: size to allocate
+ * @gfp: allocation flags
*
- * Allocate a new struct sk_buff with GFP_KERNEL, zero fill the give length
+ * Allocate a new struct sk_buff with gfp flags, zero fill the given length
* and add it as a resource to the kunit test for automatic cleanup.
*
* Returns: newly allocated SKB, or %NULL on error
--
2.45.2
This patch series includes some netns-related improvements and fixes for
RTNL and ip_tunnel, to make link creation more intuitive:
- Creating link in another net namespace doesn't conflict with link names
in current one.
- Add a flag in rtnl_ops, to avoid netns change when link-netns is present
if possible.
- When creating ip tunnel (e.g. GRE) in another netns, use current as
link-netns if not specified explicitly.
So that
# modprobe ip_gre netns_atomic=1
# ip link add netns ns1 link-netns ns2 tun0 type gre ...
will create tun0 in ns1, rather than create it in ns2 and move to ns1.
And don't conflict with another interface named "tun0" in current netns.
---
v2:
- Check NLM_F_EXCL to ensure only link creation is affected.
- Add self tests for link name/ifindex conflict and notifications
in different netns.
- Changes in dummy driver and ynl in order to add the test case.
v1:
link: https://lore.kernel.org/all/20241023023146.372653-1-shaw.leon@gmail.com/
Xiao Liang (8):
rtnetlink: Lookup device in target netns when creating link
rtnetlink: Add netns_atomic flag in rtnl_link_ops
net: ip_tunnel: Build flow in underlay net namespace
net: ip_tunnel: Add source netns support for newlink
net: ip_gre: Add netns_atomic module parameter
net: dummy: Set netns_atomic in rtnl ops for testing
tools/net/ynl: Add retry limit for async notification
selftests: net: Add two test cases for link netns
drivers/net/dummy.c | 1 +
include/net/ip_tunnels.h | 3 ++
include/net/rtnetlink.h | 3 ++
net/core/rtnetlink.c | 17 +++++--
net/ipv4/ip_gre.c | 15 +++++-
net/ipv4/ip_tunnel.c | 27 +++++++----
tools/net/ynl/lib/ynl.py | 7 ++-
tools/testing/selftests/net/Makefile | 1 +
tools/testing/selftests/net/netns-name.sh | 10 ++++
tools/testing/selftests/net/netns_atomic.py | 54 +++++++++++++++++++++
10 files changed, 121 insertions(+), 17 deletions(-)
create mode 100755 tools/testing/selftests/net/netns_atomic.py
--
2.47.0
Check number of paths by fib_info_num_path(),
and update_or_create_fnhe() for every path.
Problem is that pmtu is cached only for the oif
that has received icmp message "need to frag",
other oifs will still try to use "default" iface mtu.
An example topology showing the problem:
| host1
+---------+
| dummy0 | 10.179.20.18/32 mtu9000
+---------+
+-----------+----------------+
+---------+ +---------+
| ens17f0 | 10.179.2.141/31 | ens17f1 | 10.179.2.13/31
+---------+ +---------+
| (all here have mtu 9000) |
+------+ +------+
| ro1 | 10.179.2.140/31 | ro2 | 10.179.2.12/31
+------+ +------+
| |
---------+------------+-------------------+------
|
+-----+
| ro3 | 10.10.10.10 mtu1500
+-----+
|
========================================
some networks
========================================
|
+-----+
| eth0| 10.10.30.30 mtu9000
+-----+
| host2
host1 have enabled multipath and
sysctl net.ipv4.fib_multipath_hash_policy = 1:
default proto static src 10.179.20.18
nexthop via 10.179.2.12 dev ens17f1 weight 1
nexthop via 10.179.2.140 dev ens17f0 weight 1
When host1 tries to do pmtud from 10.179.20.18/32 to host2,
host1 receives at ens17f1 iface an icmp packet from ro3 that ro3 mtu=1500.
And host1 caches it in nexthop exceptions cache.
Problem is that it is cached only for the iface that has received icmp,
and there is no way that ro3 will send icmp msg to host1 via another path.
Host1 now have this routes to host2:
ip r g 10.10.30.30 sport 30000 dport 443
10.10.30.30 via 10.179.2.12 dev ens17f1 src 10.179.20.18 uid 0
cache expires 521sec mtu 1500
ip r g 10.10.30.30 sport 30033 dport 443
10.10.30.30 via 10.179.2.140 dev ens17f0 src 10.179.20.18 uid 0
cache
So when host1 tries again to reach host2 with mtu>1500,
if packet flow is lucky enough to be hashed with oif=ens17f1 its ok,
if oif=ens17f0 it blackholes and still gets icmp msgs from ro3 to ens17f1,
until lucky day when ro3 will send it through another flow to ens17f0.
Signed-off-by: Vladimir Vdovin <deliran(a)verdict.gg>
---
V10:
selfttests in pmtu.sh:
- remove unused fail var
- Flip error messages
V9:
selftests in pmtu.sh:
- remove useless return
- fix mtu var override
V8:
selftests in pmtu.sh:
- Change var names from "dummy" to "host"
- Fix errors caused by incorrect iface arguments pass
- Add src addr to setup_multipath_new
- Change multipath* func order
- Change route_get_dst_exception() && route_get_dst_pmtu_from_exception()
and arguments pass where they are used
as Ido suggested in https://lore.kernel.org/all/ZykH_fdcMBdFgXix@shredder/
V7:
selftest in pmtu.sh:
- add setup_multipath() with old and new nh tests
- add global "dummy_v4" addr variables
- add documentation
- remove dummy netdev usage in mp nh test
- remove useless sysctl opts in mp nh test
V6:
- make commit message cleaner
V5:
- make self test cleaner
V4:
- fix selftest, do route lookup before checking cached exceptions
V3:
- add selftest
- fix compile error
V2:
- fix fib_info_num_path parameter pass
---
net/ipv4/route.c | 13 ++++
tools/testing/selftests/net/pmtu.sh | 112 +++++++++++++++++++++++-----
2 files changed, 108 insertions(+), 17 deletions(-)
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 723ac9181558..652f603d29fe 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1027,6 +1027,19 @@ static void __ip_rt_update_pmtu(struct rtable *rt, struct flowi4 *fl4, u32 mtu)
struct fib_nh_common *nhc;
fib_select_path(net, &res, fl4, NULL);
+#ifdef CONFIG_IP_ROUTE_MULTIPATH
+ if (fib_info_num_path(res.fi) > 1) {
+ int nhsel;
+
+ for (nhsel = 0; nhsel < fib_info_num_path(res.fi); nhsel++) {
+ nhc = fib_info_nhc(res.fi, nhsel);
+ update_or_create_fnhe(nhc, fl4->daddr, 0, mtu, lock,
+ jiffies + net->ipv4.ip_rt_mtu_expires);
+ }
+ rcu_read_unlock();
+ return;
+ }
+#endif /* CONFIG_IP_ROUTE_MULTIPATH */
nhc = FIB_RES_NHC(res);
update_or_create_fnhe(nhc, fl4->daddr, 0, mtu, lock,
jiffies + net->ipv4.ip_rt_mtu_expires);
diff --git a/tools/testing/selftests/net/pmtu.sh b/tools/testing/selftests/net/pmtu.sh
index 569bce8b6383..7ca2fbd971ae 100755
--- a/tools/testing/selftests/net/pmtu.sh
+++ b/tools/testing/selftests/net/pmtu.sh
@@ -197,6 +197,12 @@
#
# - pmtu_ipv6_route_change
# Same as above but with IPv6
+#
+# - pmtu_ipv4_mp_exceptions
+# Use the same topology as in pmtu_ipv4, but add routeable addresses
+# on host A and B on lo reachable via both routers. Host A and B
+# addresses have multipath routes to each other, b_r1 mtu = 1500.
+# Check that PMTU exceptions are created for both paths.
source lib.sh
source net_helper.sh
@@ -266,7 +272,8 @@ tests="
list_flush_ipv4_exception ipv4: list and flush cached exceptions 1
list_flush_ipv6_exception ipv6: list and flush cached exceptions 1
pmtu_ipv4_route_change ipv4: PMTU exception w/route replace 1
- pmtu_ipv6_route_change ipv6: PMTU exception w/route replace 1"
+ pmtu_ipv6_route_change ipv6: PMTU exception w/route replace 1
+ pmtu_ipv4_mp_exceptions ipv4: PMTU multipath nh exceptions 1"
# Addressing and routing for tests with routers: four network segments, with
# index SEGMENT between 1 and 4, a common prefix (PREFIX4 or PREFIX6) and an
@@ -343,6 +350,9 @@ tunnel6_a_addr="fd00:2::a"
tunnel6_b_addr="fd00:2::b"
tunnel6_mask="64"
+host4_a_addr="192.168.99.99"
+host4_b_addr="192.168.88.88"
+
dummy6_0_prefix="fc00:1000::"
dummy6_1_prefix="fc00:1001::"
dummy6_mask="64"
@@ -984,6 +994,52 @@ setup_ovs_bridge() {
run_cmd ip route add ${prefix6}:${b_r1}::1 via ${prefix6}:${a_r1}::2
}
+setup_multipath_new() {
+ # Set up host A with multipath routes to host B host4_b_addr
+ run_cmd ${ns_a} ip addr add ${host4_a_addr} dev lo
+ run_cmd ${ns_a} ip nexthop add id 401 via ${prefix4}.${a_r1}.2 dev veth_A-R1
+ run_cmd ${ns_a} ip nexthop add id 402 via ${prefix4}.${a_r2}.2 dev veth_A-R2
+ run_cmd ${ns_a} ip nexthop add id 403 group 401/402
+ run_cmd ${ns_a} ip route add ${host4_b_addr} src ${host4_a_addr} nhid 403
+
+ # Set up host B with multipath routes to host A host4_a_addr
+ run_cmd ${ns_b} ip addr add ${host4_b_addr} dev lo
+ run_cmd ${ns_b} ip nexthop add id 401 via ${prefix4}.${b_r1}.2 dev veth_B-R1
+ run_cmd ${ns_b} ip nexthop add id 402 via ${prefix4}.${b_r2}.2 dev veth_B-R2
+ run_cmd ${ns_b} ip nexthop add id 403 group 401/402
+ run_cmd ${ns_b} ip route add ${host4_a_addr} src ${host4_b_addr} nhid 403
+}
+
+setup_multipath_old() {
+ # Set up host A with multipath routes to host B host4_b_addr
+ run_cmd ${ns_a} ip addr add ${host4_a_addr} dev lo
+ run_cmd ${ns_a} ip route add ${host4_b_addr} \
+ src ${host4_a_addr} \
+ nexthop via ${prefix4}.${a_r1}.2 weight 1 \
+ nexthop via ${prefix4}.${a_r2}.2 weight 1
+
+ # Set up host B with multipath routes to host A host4_a_addr
+ run_cmd ${ns_b} ip addr add ${host4_b_addr} dev lo
+ run_cmd ${ns_b} ip route add ${host4_a_addr} \
+ src ${host4_b_addr} \
+ nexthop via ${prefix4}.${b_r1}.2 weight 1 \
+ nexthop via ${prefix4}.${b_r2}.2 weight 1
+}
+
+setup_multipath() {
+ if [ "$USE_NH" = "yes" ]; then
+ setup_multipath_new
+ else
+ setup_multipath_old
+ fi
+
+ # Set up routers with routes to dummies
+ run_cmd ${ns_r1} ip route add ${host4_a_addr} via ${prefix4}.${a_r1}.1
+ run_cmd ${ns_r2} ip route add ${host4_a_addr} via ${prefix4}.${a_r2}.1
+ run_cmd ${ns_r1} ip route add ${host4_b_addr} via ${prefix4}.${b_r1}.1
+ run_cmd ${ns_r2} ip route add ${host4_b_addr} via ${prefix4}.${b_r2}.1
+}
+
setup() {
[ "$(id -u)" -ne 0 ] && echo " need to run as root" && return $ksft_skip
@@ -1076,23 +1132,15 @@ link_get_mtu() {
}
route_get_dst_exception() {
- ns_cmd="${1}"
- dst="${2}"
- dsfield="${3}"
+ ns_cmd="${1}"; shift
- if [ -z "${dsfield}" ]; then
- dsfield=0
- fi
-
- ${ns_cmd} ip route get "${dst}" dsfield "${dsfield}"
+ ${ns_cmd} ip route get "$@"
}
route_get_dst_pmtu_from_exception() {
- ns_cmd="${1}"
- dst="${2}"
- dsfield="${3}"
+ ns_cmd="${1}"; shift
- mtu_parse "$(route_get_dst_exception "${ns_cmd}" "${dst}" "${dsfield}")"
+ mtu_parse "$(route_get_dst_exception "${ns_cmd}" "$@")"
}
check_pmtu_value() {
@@ -1235,10 +1283,10 @@ test_pmtu_ipv4_dscp_icmp_exception() {
run_cmd "${ns_a}" ping -q -M want -Q "${dsfield}" -c 1 -w 1 -s "${len}" "${dst2}"
# Check that exceptions have been created with the correct PMTU
- pmtu_1="$(route_get_dst_pmtu_from_exception "${ns_a}" "${dst1}" "${policy_mark}")"
+ pmtu_1="$(route_get_dst_pmtu_from_exception "${ns_a}" "${dst1}" dsfield "${policy_mark}")"
check_pmtu_value "1400" "${pmtu_1}" "exceeding MTU" || return 1
- pmtu_2="$(route_get_dst_pmtu_from_exception "${ns_a}" "${dst2}" "${policy_mark}")"
+ pmtu_2="$(route_get_dst_pmtu_from_exception "${ns_a}" "${dst2}" dsfield "${policy_mark}")"
check_pmtu_value "1500" "${pmtu_2}" "exceeding MTU" || return 1
}
@@ -1285,9 +1333,9 @@ test_pmtu_ipv4_dscp_udp_exception() {
UDP:"${dst2}":50000,tos="${dsfield}"
# Check that exceptions have been created with the correct PMTU
- pmtu_1="$(route_get_dst_pmtu_from_exception "${ns_a}" "${dst1}" "${policy_mark}")"
+ pmtu_1="$(route_get_dst_pmtu_from_exception "${ns_a}" "${dst1}" dsfield "${policy_mark}")"
check_pmtu_value "1400" "${pmtu_1}" "exceeding MTU" || return 1
- pmtu_2="$(route_get_dst_pmtu_from_exception "${ns_a}" "${dst2}" "${policy_mark}")"
+ pmtu_2="$(route_get_dst_pmtu_from_exception "${ns_a}" "${dst2}" dsfield "${policy_mark}")"
check_pmtu_value "1500" "${pmtu_2}" "exceeding MTU" || return 1
}
@@ -2329,6 +2377,36 @@ test_pmtu_ipv6_route_change() {
test_pmtu_ipvX_route_change 6
}
+test_pmtu_ipv4_mp_exceptions() {
+ setup namespaces routing multipath || return $ksft_skip
+
+ trace "${ns_a}" veth_A-R1 "${ns_r1}" veth_R1-A \
+ "${ns_r1}" veth_R1-B "${ns_b}" veth_B-R1 \
+ "${ns_a}" veth_A-R2 "${ns_r2}" veth_R2-A \
+ "${ns_r2}" veth_R2-B "${ns_b}" veth_B-R2
+
+ # Set up initial MTU values
+ mtu "${ns_a}" veth_A-R1 2000
+ mtu "${ns_r1}" veth_R1-A 2000
+ mtu "${ns_r1}" veth_R1-B 1500
+ mtu "${ns_b}" veth_B-R1 1500
+
+ mtu "${ns_a}" veth_A-R2 2000
+ mtu "${ns_r2}" veth_R2-A 2000
+ mtu "${ns_r2}" veth_R2-B 1500
+ mtu "${ns_b}" veth_B-R2 1500
+
+ # Ping and expect two nexthop exceptions for two routes
+ run_cmd ${ns_a} ping -q -M want -i 0.1 -c 1 -s 1800 "${host4_b_addr}"
+
+ # Check that exceptions have been created with the correct PMTU
+ pmtu_a_R1="$(route_get_dst_pmtu_from_exception "${ns_a}" "${host4_b_addr}" oif veth_A-R1)"
+ pmtu_a_R2="$(route_get_dst_pmtu_from_exception "${ns_a}" "${host4_b_addr}" oif veth_A-R2)"
+
+ check_pmtu_value "1500" "${pmtu_a_R1}" "exceeding MTU (veth_A-R1)" || return 1
+ check_pmtu_value "1500" "${pmtu_a_R2}" "exceeding MTU (veth_A-R2)" || return 1
+}
+
usage() {
echo
echo "$0 [OPTIONS] [TEST]..."
base-commit: 66600fac7a984dea4ae095411f644770b2561ede
--
2.43.5
The netconsole selftest relies on the availability of the netdevsim module.
To ensure the test can run correctly, we need to check if the netdevsim
module is either loaded or built-in before proceeding.
Update the netconsole selftest to check for the existence of
the /sys/bus/netdevsim/new_device file before running the test. If the
file is not found, the test is skipped with an explanation that the
CONFIG_NETDEVSIM kernel config option may not be enabled.
Signed-off-by: Breno Leitao <leitao(a)debian.org>
---
tools/testing/selftests/drivers/net/netcons_basic.sh | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/drivers/net/netcons_basic.sh b/tools/testing/selftests/drivers/net/netcons_basic.sh
index 182eb1a97e59f3b4c9eea0e5b9e64a7fff656e2b..b175f4d966e5056ddb62e335f212c03e55f50fb0 100755
--- a/tools/testing/selftests/drivers/net/netcons_basic.sh
+++ b/tools/testing/selftests/drivers/net/netcons_basic.sh
@@ -39,6 +39,7 @@ NAMESPACE=""
# IDs for netdevsim
NSIM_DEV_1_ID=$((256 + RANDOM % 256))
NSIM_DEV_2_ID=$((512 + RANDOM % 256))
+NSIM_DEV_SYS_NEW="/sys/bus/netdevsim/new_device"
# Used to create and delete namespaces
source "${SCRIPTDIR}"/../../net/lib.sh
@@ -46,7 +47,6 @@ source "${SCRIPTDIR}"/../../net/net_helper.sh
# Create netdevsim interfaces
create_ifaces() {
- local NSIM_DEV_SYS_NEW=/sys/bus/netdevsim/new_device
echo "$NSIM_DEV_2_ID" > "$NSIM_DEV_SYS_NEW"
echo "$NSIM_DEV_1_ID" > "$NSIM_DEV_SYS_NEW"
@@ -212,6 +212,11 @@ function check_for_dependencies() {
exit "${ksft_skip}"
fi
+ if [ ! -f "${NSIM_DEV_SYS_NEW}" ]; then
+ echo "SKIP: file ${NSIM_DEV_SYS_NEW} does not exist. Check if CONFIG_NETDEVSIM is enabled" >&2
+ exit "${ksft_skip}"
+ fi
+
if [ ! -d "${NETCONS_CONFIGFS}" ]; then
echo "SKIP: directory ${NETCONS_CONFIGFS} does not exist. Check if NETCONSOLE_DYNAMIC is enabled" >&2
exit "${ksft_skip}"
---
base-commit: 4861333b42178fa3d8fd1bb4e2cfb2fedc968dba
change-id: 20241108-netcon_selftest_deps-83f26456f095
Best regards,
--
Breno Leitao <leitao(a)debian.org>
Greetings:
Welcome to v9, see changelog below.
This revision addresses feedback Willem gave on the selftests. No
functional or code changes to the implementation were made and
performance tests were not re-run.
This series introduces a new mechanism, IRQ suspension, which allows
network applications using epoll to mask IRQs during periods of high
traffic while also reducing tail latency (compared to existing
mechanisms, see below) during periods of low traffic. In doing so, this
balances CPU consumption with network processing efficiency.
Martin Karsten (CC'd) and I have been collaborating on this series for
several months and have appreciated the feedback from the community on
our RFC [1]. We've updated the cover letter and kernel documentation in
an attempt to more clearly explain how this mechanism works, how
applications can use it, and how it compares to existing mechanisms in
the kernel.
I briefly mentioned this idea at netdev conf 2024 (for those who were
there) and Martin described this idea in an earlier paper presented at
Sigmetrics 2024 [2].
~ The short explanation (TL;DR)
We propose adding a new napi config parameter: irq_suspend_timeout to
help balance CPU usage and network processing efficiency when using IRQ
deferral and napi busy poll.
If this parameter is set to a non-zero value *and* a user application
has enabled preferred busy poll on a busy poll context (via the
EPIOCSPARAMS ioctl introduced in commit 18e2bf0edf4d ("eventpoll: Add
epoll ioctl for epoll_params")), then application calls to epoll_wait
for that context will cause device IRQs and softirq processing to be
suspended as long as epoll_wait successfully retrieves data from the
NAPI. Each time data is retrieved, the irq_suspend_timeout is deferred.
If/when network traffic subsides and epoll_wait returns no data, IRQ
suspension is immediately reverted back to the existing
napi_defer_hard_irqs and gro_flush_timeout mechanism which was
introduced in commit 6f8b12d661d0 ("net: napi: add hard irqs deferral
feature")).
The irq_suspend_timeout serves as a safety mechanism. If userland takes
a long time processing data, irq_suspend_timeout will fire and restart
normal NAPI processing.
For a more in depth explanation, please continue reading.
~ Comparison with existing mechanisms
Interrupt mitigation can be accomplished in napi software, by setting
napi_defer_hard_irqs and gro_flush_timeout, or via interrupt coalescing
in the NIC. This can be quite efficient, but in both cases, a fixed
timeout (or packet count) needs to be configured. However, a fixed
timeout cannot effectively support both low- and high-load situations:
At low load, an application typically processes a few requests and then
waits to receive more input data. In this scenario, a large timeout will
cause unnecessary latency.
At high load, an application typically processes many requests before
being ready to receive more input data. In this case, a small timeout
will likely fire prematurely and trigger irq/softirq processing, which
interferes with the application's execution. This causes overhead, most
likely due to cache contention.
While NICs attempt to provide adaptive interrupt coalescing schemes,
these cannot properly take into account application-level processing.
An alternative packet delivery mechanism is busy-polling, which results
in perfect alignment of application processing and network polling. It
delivers optimal performance (throughput and latency), but results in
100% cpu utilization and is thus inefficient for below-capacity
workloads.
We propose to add a new packet delivery mode that properly alternates
between busy polling and interrupt-based delivery depending on busy and
idle periods of the application. During a busy period, the system
operates in busy-polling mode, which avoids interference. During an idle
period, the system falls back to interrupt deferral, but with a small
timeout to avoid excessive latencies. This delivery mode can also be
viewed as an extension of basic interrupt deferral, but alternating
between a small and a very large timeout.
This delivery mode is efficient, because it avoids softirq execution
interfering with application processing during busy periods. It can be
used with blocking epoll_wait to conserve cpu cycles during idle
periods. The effect of alternating between busy and idle periods is that
performance (throughput and latency) is very close to full busy polling,
while cpu utilization is lower and very close to interrupt mitigation.
~ Usage details
IRQ suspension is introduced via a per-NAPI configuration parameter that
controls the maximum time that IRQs can be suspended.
Here's how it is intended to work:
- The user application (or system administrator) uses the netdev-genl
netlink interface to set the pre-existing napi_defer_hard_irqs and
gro_flush_timeout NAPI config parameters to enable IRQ deferral.
- The user application (or system administrator) sets the proposed
irq_suspend_timeout parameter via the netdev-genl netlink interface
to a larger value than gro_flush_timeout to enable IRQ suspension.
- The user application issues the existing epoll ioctl to set the
prefer_busy_poll flag on the epoll context.
- The user application then calls epoll_wait to busy poll for network
events, as it normally would.
- If epoll_wait returns events to userland, IRQs are suspended for the
duration of irq_suspend_timeout.
- If epoll_wait finds no events and the thread is about to go to
sleep, IRQ handling using napi_defer_hard_irqs and gro_flush_timeout
is resumed.
As long as epoll_wait is retrieving events, IRQs (and softirq
processing) for the NAPI being polled remain disabled. When network
traffic reduces, eventually a busy poll loop in the kernel will retrieve
no data. When this occurs, regular IRQ deferral using gro_flush_timeout
for the polled NAPI is re-enabled.
Unless IRQ suspension is continued by subsequent calls to epoll_wait, it
automatically times out after the irq_suspend_timeout timer expires.
Regular deferral is also immediately re-enabled when the epoll context
is destroyed.
~ Usage scenario
The target scenario for IRQ suspension as packet delivery mode is a
system that runs a dominant application with substantial network I/O.
The target application can be configured to receive input data up to a
certain batch size (via epoll_wait maxevents parameter) and this batch
size determines the worst-case latency that application requests might
experience. Because packet delivery is suspended during the target
application's processing, the batch size also determines the worst-case
latency of concurrent applications using the same RX queue(s).
gro_flush_timeout should be set as small as possible, but large enough to
make sure that a single request is likely not being interfered with.
irq_suspend_timeout is largely a safety mechanism against misbehaving
applications. It should be set large enough to cover the processing of an
entire application batch, i.e., the factor between gro_flush_timeout and
irq_suspend_timeout should roughly correspond to the maximum batch size
that the target application would process in one go.
~ Important call out in the implementation
- Enabling per epoll-context preferred busy poll will now effectively
lead to a nonblocking iteration through napi_busy_loop, even when
busy_poll_usecs is 0. See patch 4.
~ Benchmark configs & descriptions
The changes were benchmarked with memcached [3] using the benchmarking
tool mutilate [4].
To facilitate benchmarking, a small patch [5] was applied to memcached
1.6.29 to allow setting per-epoll context preferred busy poll and other
settings via environment variables. Another small patch [6] was applied
to libevent to enable full busy-polling.
Multiple scenarios were benchmarked as described below and the scripts
used for producing these results can be found on github [7] (note: all
scenarios use NAPI-based traffic splitting via SO_INCOMING_ID by passing
-N to memcached):
- base:
- no other options enabled
- deferX:
- set defer_hard_irqs to 100
- set gro_flush_timeout to X,000
- napibusy:
- set defer_hard_irqs to 100
- set gro_flush_timeout to 200,000
- enable busy poll via the existing ioctl (busy_poll_usecs = 64,
busy_poll_budget = 64, prefer_busy_poll = true)
- fullbusy:
- set defer_hard_irqs to 100
- set gro_flush_timeout to 5,000,000
- enable busy poll via the existing ioctl (busy_poll_usecs = 1000,
busy_poll_budget = 64, prefer_busy_poll = true)
- change memcached's nonblocking epoll_wait invocation (via
libevent) to using a 1 ms timeout
- suspend0:
- set defer_hard_irqs to 0
- set gro_flush_timeout to 0
- set irq_suspend_timeout to 20,000,000
- enable busy poll via the existing ioctl (busy_poll_usecs = 0,
busy_poll_budget = 64, prefer_busy_poll = true)
- suspendX:
- set defer_hard_irqs to 100
- set gro_flush_timeout to X,000
- set irq_suspend_timeout to 20,000,000
- enable busy poll via the existing ioctl (busy_poll_usecs = 0,
busy_poll_budget = 64, prefer_busy_poll = true)
~ Benchmark results
Tested on:
Single socket AMD EPYC 7662 64-Core Processor
Hyperthreading disabled
4 NUMA Zones (NPS=4)
16 CPUs per NUMA zone (64 cores total)
2 x Dual port 100gbps Mellanox Technologies ConnectX-5 Ex EN NIC
The test machine is configured such that a single interface has 8 RX
queues. The queues' IRQs and memcached are pinned to CPUs that are
NUMA-local to the interface which is under test. The NIC's interrupt
coalescing configuration is left at boot-time defaults.
Results:
Results are shown below. The mechanism added by this series is
represented by the 'suspend' cases. Data presented shows a summary over
nearly 10 runs of each test case [8] using the scripts on github [7].
For latency, the median is shown. For throughput and CPU utilization,
the average is shown.
The results also include cycles-per-query (cpq) and
instruction-per-query (ipq) metrics, following the methodology proposed
in [2], to augment the CPU utilization numbers, which could be skewed
due to frequency scaling. We find that this does not appear to be the
case as CPU utilization and low-level metrics show similar trends.
These results were captured using the scripts on github [7] to
illustrate how this approach compares with other pre-existing
mechanisms. This data is not to be interpreted as scientific data
captured in a fully isolated lab setting, but instead as best effort,
illustrative information comparing and contrasting tradeoffs.
The absolute QPS results shift between submissions, but the
relative differences are equivalent. As patches are rebased,
several factors likely influence overall performance.
Compare:
- Throughput (MAX) and latencies of base vs suspend.
- CPU usage of napibusy and fullbusy during lower load (200K, 400K for
example) vs suspend.
- Latency of the defer variants vs suspend as timeout and load
increases.
- suspend0, which sets defer_hard_irqs and gro_flush_timeout to 0, has
nearly the same performance as the base case (this is FAQ item #1).
The overall takeaway is that the suspend variants provide a superior
combination of high throughput, low latency, and low cpu utilization
compared to all other variants. Each of the suspend variants works very
well, but some fine-tuning between latency and cpu utilization is still
possible by tuning the small timeout (gro_flush_timeout).
Note: we've reorganized the results to make comparison among testcases
with the same load easier.
testcase load qps avglat 95%lat 99%lat cpu cpq ipq
base 200K 199946 112 239 416 26 12973 11343
defer10 200K 199971 54 124 142 29 19412 17460
defer20 200K 199986 60 130 153 26 15644 14095
defer50 200K 200025 79 144 182 23 12122 11632
defer200 200K 199999 164 254 309 19 8923 9635
fullbusy 200K 199998 46 118 133 100 43658 23133
napibusy 200K 199983 100 237 277 56 24840 24716
suspend0 200K 200020 105 249 432 30 14264 11796
suspend10 200K 199950 53 123 141 32 19518 16903
suspend20 200K 200037 58 126 151 30 16426 14736
suspend50 200K 199961 73 136 177 26 13310 12633
suspend200 200K 199998 149 251 306 21 9566 10203
testcase load qps avglat 95%lat 99%lat cpu cpq ipq
base 400K 400014 139 269 707 41 9476 9343
defer10 400K 400016 59 133 166 53 13991 12989
defer20 400K 399952 67 140 172 47 12063 11644
defer50 400K 400007 87 162 198 39 9384 9880
defer200 400K 399979 181 274 330 31 7089 8430
fullbusy 400K 399987 50 123 156 100 21827 16037
napibusy 400K 400014 76 222 272 83 18185 16529
suspend0 400K 400015 127 350 776 47 10699 9603
suspend10 400K 400023 57 129 164 54 13758 13178
suspend20 400K 400043 62 135 169 49 12071 11826
suspend50 400K 400071 76 149 186 42 10011 10301
suspend200 400K 399961 154 269 327 34 7827 8774
testcase load qps avglat 95%lat 99%lat cpu cpq ipq
base 600K 599951 149 266 574 61 9265 8876
defer10 600K 600006 71 147 203 76 11866 10936
defer20 600K 600123 76 152 203 66 10430 10342
defer50 600K 600162 95 172 217 54 8526 9142
defer200 600K 599942 200 301 357 46 6977 8212
fullbusy 600K 599990 55 127 177 100 14551 13983
napibusy 600K 600035 63 160 250 96 13937 14140
suspend0 600K 599903 127 320 732 68 10166 8963
suspend10 600K 599908 63 137 192 69 10902 11100
suspend20 600K 599961 66 141 194 65 9976 10370
suspend50 600K 599973 80 159 204 57 8678 9381
suspend200 600K 600010 157 277 346 48 7133 8381
testcase load qps avglat 95%lat 99%lat cpu cpq ipq
base 800K 800039 181 300 536 87 9585 8304
defer10 800K 800038 181 530 939 96 10564 8970
defer20 800K 800029 112 225 329 90 10056 8935
defer50 800K 799999 120 208 296 82 9234 8562
defer200 800K 800066 227 338 401 63 7117 8129
fullbusy 800K 800040 61 134 190 100 10913 12608
napibusy 800K 799944 64 141 214 99 10828 12588
suspend0 800K 799911 126 248 509 85 9346 8498
suspend10 800K 800006 69 143 200 83 9410 9845
suspend20 800K 800120 74 150 207 78 8786 9454
suspend50 800K 799989 87 168 224 71 7946 8833
suspend200 800K 799987 160 292 357 62 6923 8229
testcase load qps avglat 95%lat 99%lat cpu cpq ipq
base 1000K 906879 4079 5751 6216 98 9496 7904
defer10 1000K 860849 3643 6274 6730 99 10040 8676
defer20 1000K 896063 3298 5840 6349 98 9620 8237
defer50 1000K 919782 2962 5513 5807 97 9284 7951
defer200 1000K 970941 3059 5348 5984 95 8593 7959
fullbusy 1000K 999950 70 150 207 100 8732 10777
napibusy 1000K 999996 78 154 223 100 8722 10656
suspend0 1000K 949706 2666 5770 6660 99 9071 8046
suspend10 1000K 1000024 80 160 220 92 8137 9035
suspend20 1000K 1000059 83 165 226 89 7850 8804
suspend50 1000K 999955 95 180 240 84 7411 8459
suspend200 1000K 999914 163 299 366 77 6833 8078
testcase load qps avglat 95%lat 99%lat cpu cpq ipq
base MAX 1037654 4184 5453 5810 100 8411 7938
defer10 MAX 905607 4840 6151 6380 100 9639 8431
defer20 MAX 986463 4455 5594 5796 100 8848 8110
defer50 MAX 1077030 4000 5073 5299 100 8104 7920
defer200 MAX 1040728 4152 5385 5765 100 8379 7849
fullbusy MAX 1247536 3518 3935 3984 100 6998 7930
napibusy MAX 1136310 3799 7756 9964 100 7670 7877
suspend0 MAX 1057509 4132 5724 6185 100 8253 7918
suspend10 MAX 1215147 3580 3957 4041 100 7185 7944
suspend20 MAX 1216469 3576 3953 3988 100 7175 7950
suspend50 MAX 1215871 3577 3961 4075 100 7181 7949
suspend200 MAX 1216882 3556 3951 3988 100 7175 7955
~ FAQ
- Why is a new parameter needed? Does irq_suspend_timeout override
gro_flush_timeout?
Using the suspend mechanism causes the system to alternate between
polling mode and irq-driven packet delivery. During busy periods,
irq_suspend_timeout overrides gro_flush_timeout and keeps the system
busy polling, but when epoll finds no events, the setting of
gro_flush_timeout and napi_defer_hard_irqs determine the next step.
There are essentially three possible loops for network processing and
packet delivery:
1) hardirq -> softirq -> napi poll; basic interrupt delivery
2) timer -> softirq -> napi poll; deferred irq processing
3) epoll -> busy-poll -> napi poll; busy looping
Loop 2 can take control from Loop 1, if gro_flush_timeout and
napi_defer_hard_irqs are set.
If gro_flush_timeout and napi_defer_hard_irqs are set, Loops 2 and
3 "wrestle" with each other for control. During busy periods,
irq_suspend_timeout is used as timer in Loop 2, which essentially
tilts this in favour of Loop 3.
If gro_flush_timeout and napi_defer_hard_irqs are not set, Loop 3
cannot take control from Loop 1.
Therefore, setting gro_flush_timeout and napi_defer_hard_irqs is the
recommended usage, because otherwise setting irq_suspend_timeout
might not have any discernible effect.
This is shown in the results above: compare suspend0 with the base
case. Note that the lack of napi_defer_hard_irqs and
gro_flush_timeout produce similar results for both, which encourages
the use of napi_defer_hard_irqs and gro_flush_timeout in addition to
irq_suspend_timeout.
- Can the new timeout value be threaded through the new epoll ioctl ?
It is possible, but presents challenges for userspace. User
applications must ensure that the file descriptors added to epoll
contexts have the same NAPI ID to support busy polling.
An epoll context is not permanently tied to any particular NAPI ID.
So, a user application could decide to clear the file descriptors
from the context and add a new set of file descriptors with a
different NAPI ID to the context. Busy polling would work as
expected, but the meaning of the suspend timeout becomes ambiguous
because IRQs are not inherently associated with epoll contexts, but
rather with the NAPI. The user program would need to reissue the
ioctl to set the irq_suspend_timeout, but the napi_defer_hard_irqs
and gro_flush_timeout settings would come from the NAPI's
napi_config (which are set either by sysfs or by netlink). Such an
interface seems awkard to use from a user perspective.
Further, IRQs are related to NAPIs, which is why they are stored in
the napi_config space. Putting the irq_suspend_timeout in
the epoll context while other IRQ deferral mechanisms remain in the
NAPI's napi_config space seems like an odd design choice.
We've opted to keep all of the IRQ deferral parameters together and
place the irq_suspend_timeout in napi_config. This has nice benefits
for userspace: if a user app were to remove all file descriptors
from an epoll context and add new file descriptors with a new NAPI ID,
the correct suspend timeout for that NAPI ID would be used automatically
without the user application needing to do anything (like re-issuing an
ioctl, for example). All IRQ deferral related parameters are in one
place and can all be set the same way: with netlink.
- Can irq suspend be built by combining NIC coalescing and
gro_flush_timeout ?
No. The problem is that the long timeout must engage if and only if
prefer-busy is active.
When using NIC coalescing for the short timeout (without
napi_defer_hard_irqs/gro_flush_timeout), an interrupt after an idle
period will trigger softirq, which will run napi polling. At this
point, prefer-busy is not active, so NIC interrupts would be
re-enabled. Then it is not possible for the longer timeout to
interject to switch control back to polling. In other words, only by
using the software timer for the short timeout, it is possible to
extend the timeout without having to reprogram the NIC timer or
reach down directly and disable interrupts.
Using gro_flush_timeout for the long timeout also has problems, for
the same underlying reason. In the current napi implementation,
gro_flush_timeout is not tied to prefer-busy. We'd either have to
change that and in the process modify the existing deferral
mechanism, or introduce a state variable to determine whether
gro_flush_timeout is used as long timeout for irq suspend or whether
it is used for its default purpose. In an earlier version, we did
try something similar to the latter and made it work, but it ends up
being a lot more convoluted than our current proposal.
- Isn't it already possible to combine busy looping with irq deferral?
Yes, in fact enabling irq deferral via napi_defer_hard_irqs and
gro_flush_timeout is a precondition for prefer_busy_poll to have an
effect. If the application also uses a tight busy loop with
essentially nonblocking epoll_wait (accomplished with a very short
timeout parameter), this is the fullbusy case shown in the results.
An application using blocking epoll_wait is shown as the napibusy
case in the results. It's a hybrid approach that provides limited
latency benefits compared to the base case and plain irq deferral,
but not as good as fullbusy or suspend.
~ Special thanks
Several people were involved in earlier stages of the development of this
mechanism whom we'd like to thank:
- Peter Cai (CC'd), for the initial kernel patch and his contributions
to the paper.
- Mohammadamin Shafie (CC'd), for testing various versions of the kernel
patch and providing helpful feedback.
Thanks,
Martin and Joe
[1]: https://lore.kernel.org/netdev/20240812125717.413108-1-jdamato@fastly.com/
[2]: https://doi.org/10.1145/3626780
[3]: https://github.com/memcached/memcached/blob/master/doc/napi_ids.txt
[4]: https://github.com/leverich/mutilate
[5]: https://raw.githubusercontent.com/martinkarsten/irqsuspend/main/patches/mem…
[6]: https://raw.githubusercontent.com/martinkarsten/irqsuspend/main/patches/lib…
[7]: https://github.com/martinkarsten/irqsuspend
[8]: https://github.com/martinkarsten/irqsuspend/tree/main/results
v9:
- Addresses Willem's feedback on the selftests in patch 5 by fixing
the SPDX-License-Identifier, moving constants into variables in the
test script, reducing code duplication, shortening long lines, and
renaming variables to be more reader friendly. In the C test file,
added a comment explaining the if def blob and changed a few types
for strtoul.
v8: https://lore.kernel.org/netdev/20241108045337.292905-1-jdamato@fastly.com/
- Update patch 2 to drop the exports, as requested by Jakub.
v7: https://lore.kernel.org/netdev/20241108023912.98416-1-jdamato@fastly.com/
- Jakub noted that patch 2 adds unnecessary complexity by checking the
suspend timeout in the NAPI loop. This makes the code more
complicated and difficult to reason about. He's right; we've dropped
patch 2 which simplifies this series.
- Updated the cover letter with a full re-run of all test cases.
- Updated FAQ #2.
v6: https://lore.kernel.org/netdev/20241104215542.215919-1-jdamato@fastly.com/
- Updated the cover letter with a full re-run of all test cases,
including a new case suspend0, as requested by Sridhar previously.
- Updated the kernel documentation in patch 7 as suggested by Bagas
Sanjaya, which improved the htmldoc output.
v5: https://lore.kernel.org/netdev/20241103052421.518856-1-jdamato@fastly.com/
- Adjusted patch 5 to only suspend IRQs when ep_send_events returns a
positive return value. This issue was pointed out by Hillf Danton.
- Updated the commit message of patch 6 which still mentioned netcat,
despite the code being updated in v4 to replace it with socat and fixed
misspelling of netdevsim.
- Fixed a minor typo in patch 7 and removed an unnecessary paragraph.
- Added Sridhar Samudrala's Reviewed-by to patch 1-5 and 7.
v4: https://lore.kernel.org/netdev/20241102005214.32443-1-jdamato@fastly.com/
- Added a new FAQ item to cover letter.
- Updated patch 6 to use socat instead of nc in busy_poll_test.sh and
updated busy_poller.c to use netlink directly to configure napi
params.
- Updated the kernel documentation in patch 7 to include more details.
- Dropped Stanislav's Acked-by and Bagas' Reviewed-by from patch 7
since the documentation was updated.
v3: https://lore.kernel.org/netdev/20241101004846.32532-1-jdamato@fastly.com/
- Added Stanislav Fomichev's Acked-by to every patch except the newly
added selftest.
- Added Bagas Sanjaya's Reviewed-by to the documentation patch.
- Fixed the commit message of patch 2 to remove a reference to the now
non-existent sysfs setting.
- Added a self test which tests both "regular" busy poll and busy poll
with suspend enabled. This was added as patch 6 as requested by
Paolo. netdevsim was chosen instead of veth due to netdevsim's
pre-existing support for netdev-genl. See the commit message of
patch 6 for more details.
v2: https://lore.kernel.org/bpf/20241021015311.95468-1-jdamato@fastly.com/
- Cover letter updated, including a re-run of test data.
- Patch 1 rewritten to use netdev-genl instead of sysfs.
- Patch 3 updated with a comment added to napi_resume_irqs.
- Patch 4 rebased to apply now that commit b9ca079dd6b0 ("eventpoll:
Annotate data-race of busy_poll_usecs") has been picked up from VFS.
- Patch 6 updated the kernel documentation.
rfc -> v1:
- Cover letter updated to include more details.
- Patch 1 updated to remove the documentation added. This was moved to
patch 6 with the rest of the docs (see below).
- Patch 5 updated to fix an error uncovered by the kernel build robot.
See patch 5's changelog for more details.
- Patch 6 added which updates kernel documentation.
Joe Damato (2):
selftests: net: Add busy_poll_test
docs: networking: Describe irq suspension
Martin Karsten (4):
net: Add napi_struct parameter irq_suspend_timeout
net: Add control functions for irq suspension
eventpoll: Trigger napi_busy_loop, if prefer_busy_poll is set
eventpoll: Control irq suspension for prefer_busy_poll
Documentation/netlink/specs/netdev.yaml | 7 +
Documentation/networking/napi.rst | 170 ++++++++-
fs/eventpoll.c | 36 +-
include/linux/netdevice.h | 2 +
include/net/busy_poll.h | 3 +
include/uapi/linux/netdev.h | 1 +
net/core/dev.c | 39 ++
net/core/dev.h | 25 ++
net/core/netdev-genl-gen.c | 5 +-
net/core/netdev-genl.c | 12 +
tools/include/uapi/linux/netdev.h | 1 +
tools/testing/selftests/net/.gitignore | 1 +
tools/testing/selftests/net/Makefile | 3 +-
tools/testing/selftests/net/busy_poll_test.sh | 165 +++++++++
tools/testing/selftests/net/busy_poller.c | 346 ++++++++++++++++++
15 files changed, 809 insertions(+), 7 deletions(-)
create mode 100755 tools/testing/selftests/net/busy_poll_test.sh
create mode 100644 tools/testing/selftests/net/busy_poller.c
base-commit: dc7c381bb8649e3701ed64f6c3e55316675904d7
--
2.25.1
The goal of the series is to simplify and make it possible to use
ncdevmem in an automated way from the ksft python wrapper.
ncdevmem is slowly mutated into a state where it uses stdout
to print the payload and the python wrapper is added to
make sure the arrived payload matches the expected one.
v8:
- move error() calls into enable_reuseaddr() (Joe)
- bail out when number of queues is 1 (Joe)
- use socat instead of nc (Joe)
- fix warning about string truncation buf[256]->buf[40] (Jakub)
v7:
- fix validation (Mina)
- add support for working with non ::ffff-prefixed addresses (Mina)
v6:
- fix compilation issue in 'Unify error handling' patch (Jakub)
v5:
- properly handle errors from inet_pton() and socket() (Paolo)
- remove unneeded import from python selftest (Paolo)
v4:
- keep usage example with validation (Mina)
- fix compilation issue in one patch (s/start_queues/start_queue/)
v3:
- keep and refine the comment about ncdevmem invocation (Mina)
- add the comment about not enforcing exit status for ntuple reset (Mina)
- make configure_headersplit more robust (Mina)
- use num_queues/2 in selftest and let the users override it (Mina)
- remove memory_provider.memcpy_to_device (Mina)
- keep ksft as is (don't use -v validate flags): we are gonna
need a --debug-disable flag to make it less chatty; otherwise
it times out when sending too much data; so leaving it as
a separate follow up
v2:
- don't remove validation (Mina)
- keep 5-tuple flow steering but use it only when -c is provided (Mina)
- remove separate flag for probing (Mina)
- move ncdevmem under drivers/net/hw, not drivers/net (Jakub)
Cc: Mina Almasry <almasrymina(a)google.com>
Stanislav Fomichev (12):
selftests: ncdevmem: Redirect all non-payload output to stderr
selftests: ncdevmem: Separate out dmabuf provider
selftests: ncdevmem: Unify error handling
selftests: ncdevmem: Make client_ip optional
selftests: ncdevmem: Remove default arguments
selftests: ncdevmem: Switch to AF_INET6
selftests: ncdevmem: Properly reset flow steering
selftests: ncdevmem: Use YNL to enable TCP header split
selftests: ncdevmem: Remove hard-coded queue numbers
selftests: ncdevmem: Run selftest when none of the -s or -c has been
provided
selftests: ncdevmem: Move ncdevmem under drivers/net/hw
selftests: ncdevmem: Add automated test
.../selftests/drivers/net/hw/.gitignore | 1 +
.../testing/selftests/drivers/net/hw/Makefile | 9 +
.../selftests/drivers/net/hw/devmem.py | 45 +
.../selftests/drivers/net/hw/ncdevmem.c | 789 ++++++++++++++++++
tools/testing/selftests/net/.gitignore | 1 -
tools/testing/selftests/net/Makefile | 8 -
tools/testing/selftests/net/ncdevmem.c | 570 -------------
7 files changed, 844 insertions(+), 579 deletions(-)
create mode 100644 tools/testing/selftests/drivers/net/hw/.gitignore
create mode 100755 tools/testing/selftests/drivers/net/hw/devmem.py
create mode 100644 tools/testing/selftests/drivers/net/hw/ncdevmem.c
delete mode 100644 tools/testing/selftests/net/ncdevmem.c
--
2.47.0
This series adds VLAN support to HSR framework.
This series also adds VLAN support to HSR mode of ICSSG Ethernet driver.
Changes from v2 to v3:
*) Modified hsr_ndo_vlan_rx_add_vid() to handle arbitrary HSR_PT_SLAVE_A,
HSR_PT_SLAVE_B order and skip INTERLINK port in patch 2/4 as suggested by
Paolo Abeni <pabeni(a)redhat.com>
*) Removed handling of HSR_PT_MASTER in hsr_ndo_vlan_rx_kill_vid() as MASTER
and INTERLINK port will be ignored anyway in the default switch case as
suggested by Paolo Abeni <pabeni(a)redhat.com>
*) Modified the selftest in patch 4/4 to use vlan by default. The test will
check the exposed feature `vlan-challenged` and if vlan is not supported, skip
the vlan test as suggested by Paolo Abeni <pabeni(a)redhat.com>. Test logs can be
found at [1]
Changes from v1 to v2:
*) Added patch 4/4 to add test script related to VLAN in HSR as asked by
Lukasz Majewski <lukma(a)denx.de>
[1] https://gist.githubusercontent.com/danish-ti/d309f92c640134ccc4f2c0c442de5b…
v1 https://lore.kernel.org/all/20241004074715.791191-1-danishanwar@ti.com/
v2 https://lore.kernel.org/all/20241024103056.3201071-1-danishanwar@ti.com/
MD Danish Anwar (1):
selftests: hsr: Add test for VLAN
Murali Karicheri (1):
net: hsr: Add VLAN CTAG filter support
Ravi Gunasekaran (1):
net: ti: icssg-prueth: Add VLAN support for HSR mode
WingMan Kwok (1):
net: hsr: Add VLAN support
drivers/net/ethernet/ti/icssg/icssg_prueth.c | 45 ++++++++-
net/hsr/hsr_device.c | 85 +++++++++++++++--
net/hsr/hsr_forward.c | 19 +++-
tools/testing/selftests/net/hsr/config | 1 +
tools/testing/selftests/net/hsr/hsr_ping.sh | 98 ++++++++++++++++++++
5 files changed, 236 insertions(+), 12 deletions(-)
base-commit: a84e8c05f58305dfa808bc5465c5175c29d7c9b6
--
2.34.1
Soft lockups have been observed on a cluster of Linux-based edge routers
located in a highly dynamic environment. Using the `bird` service, these
routers continuously update BGP-advertised routes due to frequently
changing nexthop destinations, while also managing significant IPv6
traffic. The lockups occur during the traversal of the multipath
circular linked-list in the `fib6_select_path` function, particularly
while iterating through the siblings in the list. The issue typically
arises when the nodes of the linked list are unexpectedly deleted
concurrently on a different core—indicated by their 'next' and
'previous' elements pointing back to the node itself and their reference
count dropping to zero. This results in an infinite loop, leading to a
soft lockup that triggers a system panic via the watchdog timer.
Apply RCU primitives in the problematic code sections to resolve the
issue. Where necessary, update the references to fib6_siblings to
annotate or use the RCU APIs.
Include a test script that reproduces the issue. The script
periodically updates the routing table while generating a heavy load
of outgoing IPv6 traffic through multiple iperf3 clients. It
consistently induces infinite soft lockups within a couple of minutes.
Kernel log:
0 [ffffbd13003e8d30] machine_kexec at ffffffff8ceaf3eb
1 [ffffbd13003e8d90] __crash_kexec at ffffffff8d0120e3
2 [ffffbd13003e8e58] panic at ffffffff8cef65d4
3 [ffffbd13003e8ed8] watchdog_timer_fn at ffffffff8d05cb03
4 [ffffbd13003e8f08] __hrtimer_run_queues at ffffffff8cfec62f
5 [ffffbd13003e8f70] hrtimer_interrupt at ffffffff8cfed756
6 [ffffbd13003e8fd0] __sysvec_apic_timer_interrupt at ffffffff8cea01af
7 [ffffbd13003e8ff0] sysvec_apic_timer_interrupt at ffffffff8df1b83d
-- <IRQ stack> --
8 [ffffbd13003d3708] asm_sysvec_apic_timer_interrupt at ffffffff8e000ecb
[exception RIP: fib6_select_path+299]
RIP: ffffffff8ddafe7b RSP: ffffbd13003d37b8 RFLAGS: 00000287
RAX: ffff975850b43600 RBX: ffff975850b40200 RCX: 0000000000000000
RDX: 000000003fffffff RSI: 0000000051d383e4 RDI: ffff975850b43618
RBP: ffffbd13003d3800 R8: 0000000000000000 R9: ffff975850b40200
R10: 0000000000000000 R11: 0000000000000000 R12: ffffbd13003d3830
R13: ffff975850b436a8 R14: ffff975850b43600 R15: 0000000000000007
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
9 [ffffbd13003d3808] ip6_pol_route at ffffffff8ddb030c
10 [ffffbd13003d3888] ip6_pol_route_input at ffffffff8ddb068c
11 [ffffbd13003d3898] fib6_rule_lookup at ffffffff8ddf02b5
12 [ffffbd13003d3928] ip6_route_input at ffffffff8ddb0f47
13 [ffffbd13003d3a18] ip6_rcv_finish_core.constprop.0 at ffffffff8dd950d0
14 [ffffbd13003d3a30] ip6_list_rcv_finish.constprop.0 at ffffffff8dd96274
15 [ffffbd13003d3a98] ip6_sublist_rcv at ffffffff8dd96474
16 [ffffbd13003d3af8] ipv6_list_rcv at ffffffff8dd96615
17 [ffffbd13003d3b60] __netif_receive_skb_list_core at ffffffff8dc16fec
18 [ffffbd13003d3be0] netif_receive_skb_list_internal at ffffffff8dc176b3
19 [ffffbd13003d3c50] napi_gro_receive at ffffffff8dc565b9
20 [ffffbd13003d3c80] ice_receive_skb at ffffffffc087e4f5 [ice]
21 [ffffbd13003d3c90] ice_clean_rx_irq at ffffffffc0881b80 [ice]
22 [ffffbd13003d3d20] ice_napi_poll at ffffffffc088232f [ice]
23 [ffffbd13003d3d80] __napi_poll at ffffffff8dc18000
24 [ffffbd13003d3db8] net_rx_action at ffffffff8dc18581
25 [ffffbd13003d3e40] __do_softirq at ffffffff8df352e9
26 [ffffbd13003d3eb0] run_ksoftirqd at ffffffff8ceffe47
27 [ffffbd13003d3ec0] smpboot_thread_fn at ffffffff8cf36a30
28 [ffffbd13003d3ee8] kthread at ffffffff8cf2b39f
29 [ffffbd13003d3f28] ret_from_fork at ffffffff8ce5fa64
30 [ffffbd13003d3f50] ret_from_fork_asm at ffffffff8ce03cbb
Fixes: 66f5d6ce53e6 ("ipv6: replace rwlock with rcu and spinlock in fib6_table")
Reported-by: Adrian Oliver <kernel(a)aoliver.ca>
Signed-off-by: Omid Ehtemam-Haghighi <omid.ehtemamhaghighi(a)menlosecurity.com>
Cc: David S. Miller <davem(a)davemloft.net>
Cc: David Ahern <dsahern(a)gmail.com>
Cc: Eric Dumazet <edumazet(a)google.com>
Cc: Jakub Kicinski <kuba(a)kernel.org>
Cc: Paolo Abeni <pabeni(a)redhat.com>
Cc: Shuah Khan <shuah(a)kernel.org>
Cc: Ido Schimmel <idosch(a)idosch.org>
Cc: Kuniyuki Iwashima <kuniyu(a)amazon.com>
Cc: Simon Horman <horms(a)kernel.org>
Cc: Omid Ehtemam-Haghighi <oeh.kernel(a)gmail.com>
Cc: netdev(a)vger.kernel.org
Cc: linux-kselftest(a)vger.kernel.org
Cc: linux-kernel(a)vger.kernel.org
---
v6 -> v7:
* Rebased on top of 'net-next'
v5 -> v6:
* Adjust the comment line lengths in the test script to a maximum of
80 characters
* Change memory allocation in inet6_rt_notify from gfp_any() to GFP_ATOMIC for
atomic allocation in non-blocking contexts, as suggested by Ido Schimmel
* NOTE: I have executed the test script on both bare-metal servers and
virtualized environments such as QEMU and vng. In the case of bare-metal, it
consistently triggers a soft lockup in under a minute on unpatched kernels.
For the virtualized environments, an unpatched kernel compiled with the
Ubuntu 24.04 configuration also triggers a soft lockup, though it takes
longer; however, it did not trigger a soft lockup on kernels compiled with
configurations provided in:
https://github.com/linux-netdev/nipa/wiki/How-to-run-netdev-selftests-CI-st…
leading to potential false negatives in the test results.
I am curious if this test can be executed on a bare-metal machine within a
CI system, if such a setup exists, rather than in a virtualized environment.
If that’s not possible, how can I apply a different kernel configuration,
such as the one used in Ubuntu 24.04, for this test? Please advise.
v4 -> v5:
* Addressed review comments from Paolo Abeni.
* Added additional clarifying comments in the test script.
* Minor cleanup performed in the test script.
v3 -> v4:
* Added RCU primitives to rt6_fill_node(). I found that this function is typically
called either with a table lock held or within rcu_read_lock/rcu_read_unlock
pairs, except in the following call chain, where the protection is unclear:
rt_fill_node()
fib6_info_hw_flags_set()
mlxsw_sp_fib6_offload_failed_flag_set()
mlxsw_sp_router_fib6_event_work()
The last function is initialized as a work item in mlxsw_sp_router_fib_event()
and scheduled for deferred execution. I am unsure if the execution context of
this work item is protected by any table lock or rcu_read_lock/rcu_read_unlock
pair, so I have added the protection. Please let me know if this is redundant.
* Other review comments addressed
v2 -> v3:
* Removed redundant rcu_read_lock()/rcu_read_unlock() pairs
* Revised the test script based on Ido Schimmel's feedback
* Updated the test script to ensure compatibility with the latest iperf3 version
* Fixed new warnings generated with 'C=2' in the previous version
* Other review comments addressed
v1 -> v2:
* list_del_rcu() is applied exclusively to legacy multipath code
* All occurrences of fib6_siblings have been modified to utilize RCU
APIs for annotation and usage.
* Additionally, a test script for reproducing the reported
issue is included
---
net/ipv6/ip6_fib.c | 8 +-
net/ipv6/route.c | 45 ++-
tools/testing/selftests/net/Makefile | 1 +
.../net/ipv6_route_update_soft_lockup.sh | 262 ++++++++++++++++++
4 files changed, 297 insertions(+), 19 deletions(-)
create mode 100755 tools/testing/selftests/net/ipv6_route_update_soft_lockup.sh
diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index 6383263bfd04..c134ba202c4c 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -1183,8 +1183,8 @@ static int fib6_add_rt2node(struct fib6_node *fn, struct fib6_info *rt,
while (sibling) {
if (sibling->fib6_metric == rt->fib6_metric &&
rt6_qualify_for_ecmp(sibling)) {
- list_add_tail(&rt->fib6_siblings,
- &sibling->fib6_siblings);
+ list_add_tail_rcu(&rt->fib6_siblings,
+ &sibling->fib6_siblings);
break;
}
sibling = rcu_dereference_protected(sibling->fib6_next,
@@ -1245,7 +1245,7 @@ static int fib6_add_rt2node(struct fib6_node *fn, struct fib6_info *rt,
fib6_siblings)
sibling->fib6_nsiblings--;
rt->fib6_nsiblings = 0;
- list_del_init(&rt->fib6_siblings);
+ list_del_rcu(&rt->fib6_siblings);
rt6_multipath_rebalance(next_sibling);
return err;
}
@@ -1963,7 +1963,7 @@ static void fib6_del_route(struct fib6_table *table, struct fib6_node *fn,
&rt->fib6_siblings, fib6_siblings)
sibling->fib6_nsiblings--;
rt->fib6_nsiblings = 0;
- list_del_init(&rt->fib6_siblings);
+ list_del_rcu(&rt->fib6_siblings);
rt6_multipath_rebalance(next_sibling);
}
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index d7ce5cf2017a..e23e653de158 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -413,8 +413,8 @@ void fib6_select_path(const struct net *net, struct fib6_result *res,
struct flowi6 *fl6, int oif, bool have_oif_match,
const struct sk_buff *skb, int strict)
{
- struct fib6_info *sibling, *next_sibling;
struct fib6_info *match = res->f6i;
+ struct fib6_info *sibling;
if (!match->nh && (!match->fib6_nsiblings || have_oif_match))
goto out;
@@ -440,8 +440,8 @@ void fib6_select_path(const struct net *net, struct fib6_result *res,
if (fl6->mp_hash <= atomic_read(&match->fib6_nh->fib_nh_upper_bound))
goto out;
- list_for_each_entry_safe(sibling, next_sibling, &match->fib6_siblings,
- fib6_siblings) {
+ list_for_each_entry_rcu(sibling, &match->fib6_siblings,
+ fib6_siblings) {
const struct fib6_nh *nh = sibling->fib6_nh;
int nh_upper_bound;
@@ -5195,14 +5195,18 @@ static void ip6_route_mpath_notify(struct fib6_info *rt,
* nexthop. Since sibling routes are always added at the end of
* the list, find the first sibling of the last route appended
*/
+ rcu_read_lock();
+
if ((nlflags & NLM_F_APPEND) && rt_last && rt_last->fib6_nsiblings) {
- rt = list_first_entry(&rt_last->fib6_siblings,
- struct fib6_info,
- fib6_siblings);
+ rt = list_first_or_null_rcu(&rt_last->fib6_siblings,
+ struct fib6_info,
+ fib6_siblings);
}
if (rt)
inet6_rt_notify(RTM_NEWROUTE, rt, info, nlflags);
+
+ rcu_read_unlock();
}
static bool ip6_route_mpath_should_notify(const struct fib6_info *rt)
@@ -5547,17 +5551,21 @@ static size_t rt6_nlmsg_size(struct fib6_info *f6i)
nexthop_for_each_fib6_nh(f6i->nh, rt6_nh_nlmsg_size,
&nexthop_len);
} else {
- struct fib6_info *sibling, *next_sibling;
struct fib6_nh *nh = f6i->fib6_nh;
+ struct fib6_info *sibling;
nexthop_len = 0;
if (f6i->fib6_nsiblings) {
rt6_nh_nlmsg_size(nh, &nexthop_len);
- list_for_each_entry_safe(sibling, next_sibling,
- &f6i->fib6_siblings, fib6_siblings) {
+ rcu_read_lock();
+
+ list_for_each_entry_rcu(sibling, &f6i->fib6_siblings,
+ fib6_siblings) {
rt6_nh_nlmsg_size(sibling->fib6_nh, &nexthop_len);
}
+
+ rcu_read_unlock();
}
nexthop_len += lwtunnel_get_encap_size(nh->fib_nh_lws);
}
@@ -5721,7 +5729,7 @@ static int rt6_fill_node(struct net *net, struct sk_buff *skb,
lwtunnel_fill_encap(skb, dst->lwtstate, RTA_ENCAP, RTA_ENCAP_TYPE) < 0)
goto nla_put_failure;
} else if (rt->fib6_nsiblings) {
- struct fib6_info *sibling, *next_sibling;
+ struct fib6_info *sibling;
struct nlattr *mp;
mp = nla_nest_start_noflag(skb, RTA_MULTIPATH);
@@ -5733,14 +5741,21 @@ static int rt6_fill_node(struct net *net, struct sk_buff *skb,
0) < 0)
goto nla_put_failure;
- list_for_each_entry_safe(sibling, next_sibling,
- &rt->fib6_siblings, fib6_siblings) {
+ rcu_read_lock();
+
+ list_for_each_entry_rcu(sibling, &rt->fib6_siblings,
+ fib6_siblings) {
if (fib_add_nexthop(skb, &sibling->fib6_nh->nh_common,
sibling->fib6_nh->fib_nh_weight,
- AF_INET6, 0) < 0)
+ AF_INET6, 0) < 0) {
+ rcu_read_unlock();
+
goto nla_put_failure;
+ }
}
+ rcu_read_unlock();
+
nla_nest_end(skb, mp);
} else if (rt->nh) {
if (nla_put_u32(skb, RTA_NH_ID, rt->nh->id))
@@ -6177,7 +6192,7 @@ void inet6_rt_notify(int event, struct fib6_info *rt, struct nl_info *info,
err = -ENOBUFS;
seq = info->nlh ? info->nlh->nlmsg_seq : 0;
- skb = nlmsg_new(rt6_nlmsg_size(rt), gfp_any());
+ skb = nlmsg_new(rt6_nlmsg_size(rt), GFP_ATOMIC);
if (!skb)
goto errout;
@@ -6190,7 +6205,7 @@ void inet6_rt_notify(int event, struct fib6_info *rt, struct nl_info *info,
goto errout;
}
rtnl_notify(skb, net, info->portid, RTNLGRP_IPV6_ROUTE,
- info->nlh, gfp_any());
+ info->nlh, GFP_ATOMIC);
return;
errout:
rtnl_set_sk_err(net, RTNLGRP_IPV6_ROUTE, err);
diff --git a/tools/testing/selftests/net/Makefile b/tools/testing/selftests/net/Makefile
index 26a4883a65c9..8c4db5199a42 100644
--- a/tools/testing/selftests/net/Makefile
+++ b/tools/testing/selftests/net/Makefile
@@ -96,6 +96,7 @@ TEST_PROGS += fdb_flush.sh
TEST_PROGS += fq_band_pktlimit.sh
TEST_PROGS += vlan_hw_filter.sh
TEST_PROGS += bpf_offload.py
+TEST_PROGS += ipv6_route_update_soft_lockup.sh
# YNL files, must be before "include ..lib.mk"
YNL_GEN_FILES := ncdevmem
diff --git a/tools/testing/selftests/net/ipv6_route_update_soft_lockup.sh b/tools/testing/selftests/net/ipv6_route_update_soft_lockup.sh
new file mode 100755
index 000000000000..a6b2b1f9c641
--- /dev/null
+++ b/tools/testing/selftests/net/ipv6_route_update_soft_lockup.sh
@@ -0,0 +1,262 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# Testing for potential kernel soft lockup during IPv6 routing table
+# refresh under heavy outgoing IPv6 traffic. If a kernel soft lockup
+# occurs, a kernel panic will be triggered to prevent associated issues.
+#
+#
+# Test Environment Layout
+#
+# ┌----------------┐ ┌----------------┐
+# | SOURCE_NS | | SINK_NS |
+# | NAMESPACE | | NAMESPACE |
+# |(iperf3 clients)| |(iperf3 servers)|
+# | | | |
+# | | | |
+# | ┌-----------| nexthops |---------┐ |
+# | |veth_source|<--------------------------------------->|veth_sink|<┐ |
+# | └-----------|2001:0DB8:1::0:1/96 2001:0DB8:1::1:1/96 |---------┘ | |
+# | | ^ 2001:0DB8:1::1:2/96 | | |
+# | | . . | fwd | |
+# | ┌---------┐ | . . | | |
+# | | IPv6 | | . . | V |
+# | | routing | | . 2001:0DB8:1::1:80/96| ┌-----┐ |
+# | | table | | . | | lo | |
+# | | nexthop | | . └--------┴-----┴-┘
+# | | update | | ............................> 2001:0DB8:2::1:1/128
+# | └-------- ┘ |
+# └----------------┘
+#
+# The test script sets up two network namespaces, source_ns and sink_ns,
+# connected via a veth link. Within source_ns, it continuously updates the
+# IPv6 routing table by flushing and inserting IPV6_NEXTHOP_ADDR_COUNT nexthop
+# IPs destined for SINK_LOOPBACK_IP_ADDR in sink_ns. This refresh occurs at a
+# rate of 1/ROUTING_TABLE_REFRESH_PERIOD per second for TEST_DURATION seconds.
+#
+# Simultaneously, multiple iperf3 clients within source_ns generate heavy
+# outgoing IPv6 traffic. Each client is assigned a unique port number starting
+# at 5000 and incrementing sequentially. Each client targets a unique iperf3
+# server running in sink_ns, connected to the SINK_LOOPBACK_IFACE interface
+# using the same port number.
+#
+# The number of iperf3 servers and clients is set to half of the total
+# available cores on each machine.
+#
+# NOTE: We have tested this script on machines with various CPU specifications,
+# ranging from lower to higher performance as listed below. The test script
+# effectively triggered a kernel soft lockup on machines running an unpatched
+# kernel in under a minute:
+#
+# - 1x Intel Xeon E-2278G 8-Core Processor @ 3.40GHz
+# - 1x Intel Xeon E-2378G Processor 8-Core @ 2.80GHz
+# - 1x AMD EPYC 7401P 24-Core Processor @ 2.00GHz
+# - 1x AMD EPYC 7402P 24-Core Processor @ 2.80GHz
+# - 2x Intel Xeon Gold 5120 14-Core Processor @ 2.20GHz
+# - 1x Ampere Altra Q80-30 80-Core Processor @ 3.00GHz
+# - 2x Intel Xeon Gold 5120 14-Core Processor @ 2.20GHz
+# - 2x Intel Xeon Silver 4214 24-Core Processor @ 2.20GHz
+# - 1x AMD EPYC 7502P 32-Core @ 2.50GHz
+# - 1x Intel Xeon Gold 6314U 32-Core Processor @ 2.30GHz
+# - 2x Intel Xeon Gold 6338 32-Core Processor @ 2.00GHz
+#
+# On less performant machines, you may need to increase the TEST_DURATION
+# parameter to enhance the likelihood of encountering a race condition leading
+# to a kernel soft lockup and avoid a false negative result.
+#
+# NOTE: The test may not produce the expected result in virtualized
+# environments (e.g., qemu) due to differences in timing and CPU handling,
+# which can affect the conditions needed to trigger a soft lockup.
+
+source lib.sh
+source net_helper.sh
+
+TEST_DURATION=300
+ROUTING_TABLE_REFRESH_PERIOD=0.01
+
+IPERF3_BITRATE="300m"
+
+
+IPV6_NEXTHOP_ADDR_COUNT="128"
+IPV6_NEXTHOP_ADDR_MASK="96"
+IPV6_NEXTHOP_PREFIX="2001:0DB8:1"
+
+
+SOURCE_TEST_IFACE="veth_source"
+SOURCE_TEST_IP_ADDR="2001:0DB8:1::0:1/96"
+
+SINK_TEST_IFACE="veth_sink"
+# ${SINK_TEST_IFACE} is populated with the following range of IPv6 addresses:
+# 2001:0DB8:1::1:1 to 2001:0DB8:1::1:${IPV6_NEXTHOP_ADDR_COUNT}
+SINK_LOOPBACK_IFACE="lo"
+SINK_LOOPBACK_IP_MASK="128"
+SINK_LOOPBACK_IP_ADDR="2001:0DB8:2::1:1"
+
+nexthop_ip_list=""
+termination_signal=""
+kernel_softlokup_panic_prev_val=""
+
+terminate_ns_processes_by_pattern() {
+ local ns=$1
+ local pattern=$2
+
+ for pid in $(ip netns pids ${ns}); do
+ [ -e /proc/$pid/cmdline ] && grep -qe "${pattern}" /proc/$pid/cmdline && kill -9 $pid
+ done
+}
+
+cleanup() {
+ echo "info: cleaning up namespaces and terminating all processes within them..."
+
+
+ # Terminate iperf3 instances running in the source_ns. To avoid race
+ # conditions, first iterate over the PIDs and terminate those
+ # associated with the bash shells running the
+ # `while true; do iperf3 -c ...; done` loops. In a second iteration,
+ # terminate the individual `iperf3 -c ...` instances.
+ terminate_ns_processes_by_pattern ${source_ns} while
+ terminate_ns_processes_by_pattern ${source_ns} iperf3
+
+ # Repeat the same process for sink_ns
+ terminate_ns_processes_by_pattern ${sink_ns} while
+ terminate_ns_processes_by_pattern ${sink_ns} iperf3
+
+ # Check if any iperf3 instances are still running. This could happen
+ # if a core has entered an infinite loop and the timeout for detecting
+ # the soft lockup has not expired, but either the test interval has
+ # already elapsed or the test was terminated manually (e.g., with ^C)
+ for pid in $(ip netns pids ${source_ns}); do
+ if [ -e /proc/$pid/cmdline ] && grep -qe 'iperf3' /proc/$pid/cmdline; then
+ echo "FAIL: unable to terminate some iperf3 instances. Soft lockup is underway. A kernel panic is on the way!"
+ exit ${ksft_fail}
+ fi
+ done
+
+ if [ "$termination_signal" == "SIGINT" ]; then
+ echo "SKIP: Termination due to ^C (SIGINT)"
+ elif [ "$termination_signal" == "SIGALRM" ]; then
+ echo "PASS: No kernel soft lockup occurred during this ${TEST_DURATION} second test"
+ fi
+
+ cleanup_ns ${source_ns} ${sink_ns}
+
+ sysctl -qw kernel.softlockup_panic=${kernel_softlokup_panic_prev_val}
+}
+
+setup_prepare() {
+ setup_ns source_ns sink_ns
+
+ ip -n ${source_ns} link add name ${SOURCE_TEST_IFACE} type veth peer name ${SINK_TEST_IFACE} netns ${sink_ns}
+
+ # Setting up the Source namespace
+ ip -n ${source_ns} addr add ${SOURCE_TEST_IP_ADDR} dev ${SOURCE_TEST_IFACE}
+ ip -n ${source_ns} link set dev ${SOURCE_TEST_IFACE} qlen 10000
+ ip -n ${source_ns} link set dev ${SOURCE_TEST_IFACE} up
+ ip netns exec ${source_ns} sysctl -qw net.ipv6.fib_multipath_hash_policy=1
+
+ # Setting up the Sink namespace
+ ip -n ${sink_ns} addr add ${SINK_LOOPBACK_IP_ADDR}/${SINK_LOOPBACK_IP_MASK} dev ${SINK_LOOPBACK_IFACE}
+ ip -n ${sink_ns} link set dev ${SINK_LOOPBACK_IFACE} up
+ ip netns exec ${sink_ns} sysctl -qw net.ipv6.conf.${SINK_LOOPBACK_IFACE}.forwarding=1
+
+ ip -n ${sink_ns} link set ${SINK_TEST_IFACE} up
+ ip netns exec ${sink_ns} sysctl -qw net.ipv6.conf.${SINK_TEST_IFACE}.forwarding=1
+
+
+ # Populate nexthop IPv6 addresses on the test interface in the sink_ns
+ echo "info: populating ${IPV6_NEXTHOP_ADDR_COUNT} IPv6 addresses on the ${SINK_TEST_IFACE} interface ..."
+ for IP in $(seq 1 ${IPV6_NEXTHOP_ADDR_COUNT}); do
+ ip -n ${sink_ns} addr add ${IPV6_NEXTHOP_PREFIX}::$(printf "1:%x" "${IP}")/${IPV6_NEXTHOP_ADDR_MASK} dev ${SINK_TEST_IFACE};
+ done
+
+ # Preparing list of nexthops
+ for IP in $(seq 1 ${IPV6_NEXTHOP_ADDR_COUNT}); do
+ nexthop_ip_list=$nexthop_ip_list" nexthop via ${IPV6_NEXTHOP_PREFIX}::$(printf "1:%x" $IP) dev ${SOURCE_TEST_IFACE} weight 1"
+ done
+}
+
+
+test_soft_lockup_during_routing_table_refresh() {
+ # Start num_of_iperf_servers iperf3 servers in the sink_ns namespace,
+ # each listening on ports starting at 5001 and incrementing
+ # sequentially. Since iperf3 instances may terminate unexpectedly, a
+ # while loop is used to automatically restart them in such cases.
+ echo "info: starting ${num_of_iperf_servers} iperf3 servers in the sink_ns namespace ..."
+ for i in $(seq 1 ${num_of_iperf_servers}); do
+ cmd="iperf3 --bind ${SINK_LOOPBACK_IP_ADDR} -s -p $(printf '5%03d' ${i}) --rcv-timeout 200 &>/dev/null"
+ ip netns exec ${sink_ns} bash -c "while true; do ${cmd}; done &" &>/dev/null
+ done
+
+ # Wait for the iperf3 servers to be ready
+ for i in $(seq ${num_of_iperf_servers}); do
+ port=$(printf '5%03d' ${i});
+ wait_local_port_listen ${sink_ns} ${port} tcp
+ done
+
+ # Continuously refresh the routing table in the background within
+ # the source_ns namespace
+ ip netns exec ${source_ns} bash -c "
+ while \$(ip netns list | grep -q ${source_ns}); do
+ ip -6 route add ${SINK_LOOPBACK_IP_ADDR}/${SINK_LOOPBACK_IP_MASK} ${nexthop_ip_list};
+ sleep ${ROUTING_TABLE_REFRESH_PERIOD};
+ ip -6 route delete ${SINK_LOOPBACK_IP_ADDR}/${SINK_LOOPBACK_IP_MASK};
+ done &"
+
+ # Start num_of_iperf_servers iperf3 clients in the source_ns namespace,
+ # each sending TCP traffic on sequential ports starting at 5001.
+ # Since iperf3 instances may terminate unexpectedly (e.g., if the route
+ # to the server is deleted in the background during a route refresh), a
+ # while loop is used to automatically restart them in such cases.
+ echo "info: starting ${num_of_iperf_servers} iperf3 clients in the source_ns namespace ..."
+ for i in $(seq 1 ${num_of_iperf_servers}); do
+ cmd="iperf3 -c ${SINK_LOOPBACK_IP_ADDR} -p $(printf '5%03d' ${i}) --length 64 --bitrate ${IPERF3_BITRATE} -t 0 --connect-timeout 150 &>/dev/null"
+ ip netns exec ${source_ns} bash -c "while true; do ${cmd}; done &" &>/dev/null
+ done
+
+ echo "info: IPv6 routing table is being updated at the rate of $(echo "1/${ROUTING_TABLE_REFRESH_PERIOD}" | bc)/s for ${TEST_DURATION} seconds ..."
+ echo "info: A kernel soft lockup, if detected, results in a kernel panic!"
+
+ wait
+}
+
+# Make sure 'iperf3' is installed, skip the test otherwise
+if [ ! -x "$(command -v "iperf3")" ]; then
+ echo "SKIP: 'iperf3' is not installed. Skipping the test."
+ exit ${ksft_skip}
+fi
+
+# Determine the number of cores on the machine
+num_of_iperf_servers=$(( $(nproc)/2 ))
+
+# Check if we are running on a multi-core machine, skip the test otherwise
+if [ "${num_of_iperf_servers}" -eq 0 ]; then
+ echo "SKIP: This test is not valid on a single core machine!"
+ exit ${ksft_skip}
+fi
+
+# Since the kernel soft lockup we're testing causes at least one core to enter
+# an infinite loop, destabilizing the host and likely affecting subsequent
+# tests, we trigger a kernel panic instead of reporting a failure and
+# continuing
+kernel_softlokup_panic_prev_val=$(sysctl -n kernel.softlockup_panic)
+sysctl -qw kernel.softlockup_panic=1
+
+handle_sigint() {
+ termination_signal="SIGINT"
+ cleanup
+ exit ${ksft_skip}
+}
+
+handle_sigalrm() {
+ termination_signal="SIGALRM"
+ cleanup
+ exit ${ksft_pass}
+}
+
+trap handle_sigint SIGINT
+trap handle_sigalrm SIGALRM
+
+(sleep ${TEST_DURATION} && kill -s SIGALRM $$)&
+
+setup_prepare
+test_soft_lockup_during_routing_table_refresh
--
2.43.0
When macsec is offloaded to a NIC, we can take advantage of some of
its features, mainly TSO and checksumming. This increases performance
significantly. Some features cannot be inherited, because they require
additional ops that aren't provided by the macsec netdevice.
We also need to inherit TSO limits from the lower device, like
VLAN/macvlan devices do.
This series also moves the existing macsec offload selftest to the
netdevsim selftests before adding tests for the new features. To allow
this new selftest to work, netdevsim's hw_features are expanded.
Sabrina Dubroca (8):
netdevsim: add more hw_features
selftests: netdevsim: add a test checking ethtool features
macsec: add some of the lower device's features when offloading
macsec: clean up local variables in macsec_notify
macsec: inherit lower device's TSO limits when offloading
selftests: move macsec offload tests from net/rtnetlink to
drivers/net/netdvesim
selftests: netdevsim: add test toggling macsec offload
selftests: netdevsim: add ethtool features to macsec offload tests
drivers/net/macsec.c | 64 +++++++---
drivers/net/netdevsim/netdev.c | 6 +-
.../selftests/drivers/net/netdevsim/Makefile | 2 +
.../selftests/drivers/net/netdevsim/config | 1 +
.../drivers/net/netdevsim/ethtool-features.sh | 31 +++++
.../drivers/net/netdevsim/macsec-offload.sh | 117 ++++++++++++++++++
tools/testing/selftests/net/rtnetlink.sh | 68 ----------
7 files changed, 200 insertions(+), 89 deletions(-)
create mode 100644 tools/testing/selftests/drivers/net/netdevsim/ethtool-features.sh
create mode 100755 tools/testing/selftests/drivers/net/netdevsim/macsec-offload.sh
--
2.47.0
For logging to be useful, something has to set RET and retmsg by calling
ret_set_ksft_status(). There is a suite of functions to that end in
forwarding/lib: check_err, check_fail et.al. Move them to net/lib.sh so
that every net test can use them.
Existing lib.sh users might be using these same names for their functions.
However lib.sh is always sourced near the top of the file (checked), and
whatever new definitions will simply override the ones provided by lib.sh.
Signed-off-by: Petr Machata <petrm(a)nvidia.com>
Reviewed-by: Amit Cohen <amcohen(a)nvidia.com>
Acked-by: Shuah Khan <skhan(a)linuxfoundation.org>
---
Notes:
CC: Shuah Khan <shuah(a)kernel.org>
CC: Benjamin Poirier <bpoirier(a)nvidia.com>
CC: Hangbin Liu <liuhangbin(a)gmail.com>
CC: linux-kselftest(a)vger.kernel.org
CC: Jiri Pirko <jiri(a)resnulli.us>
tools/testing/selftests/net/forwarding/lib.sh | 73 -------------------
tools/testing/selftests/net/lib.sh | 73 +++++++++++++++++++
2 files changed, 73 insertions(+), 73 deletions(-)
diff --git a/tools/testing/selftests/net/forwarding/lib.sh b/tools/testing/selftests/net/forwarding/lib.sh
index d28dbf27c1f0..8625e3c99f55 100644
--- a/tools/testing/selftests/net/forwarding/lib.sh
+++ b/tools/testing/selftests/net/forwarding/lib.sh
@@ -445,79 +445,6 @@ done
##############################################################################
# Helpers
-# Whether FAILs should be interpreted as XFAILs. Internal.
-FAIL_TO_XFAIL=
-
-check_err()
-{
- local err=$1
- local msg=$2
-
- if ((err)); then
- if [[ $FAIL_TO_XFAIL = yes ]]; then
- ret_set_ksft_status $ksft_xfail "$msg"
- else
- ret_set_ksft_status $ksft_fail "$msg"
- fi
- fi
-}
-
-check_fail()
-{
- local err=$1
- local msg=$2
-
- check_err $((!err)) "$msg"
-}
-
-check_err_fail()
-{
- local should_fail=$1; shift
- local err=$1; shift
- local what=$1; shift
-
- if ((should_fail)); then
- check_fail $err "$what succeeded, but should have failed"
- else
- check_err $err "$what failed"
- fi
-}
-
-xfail()
-{
- FAIL_TO_XFAIL=yes "$@"
-}
-
-xfail_on_slow()
-{
- if [[ $KSFT_MACHINE_SLOW = yes ]]; then
- FAIL_TO_XFAIL=yes "$@"
- else
- "$@"
- fi
-}
-
-omit_on_slow()
-{
- if [[ $KSFT_MACHINE_SLOW != yes ]]; then
- "$@"
- fi
-}
-
-xfail_on_veth()
-{
- local dev=$1; shift
- local kind
-
- kind=$(ip -j -d link show dev $dev |
- jq -r '.[].linkinfo.info_kind')
- if [[ $kind = veth ]]; then
- FAIL_TO_XFAIL=yes "$@"
- else
- "$@"
- fi
-}
-
not()
{
"$@"
diff --git a/tools/testing/selftests/net/lib.sh b/tools/testing/selftests/net/lib.sh
index 4f52b8e48a3a..6bcf5d13879d 100644
--- a/tools/testing/selftests/net/lib.sh
+++ b/tools/testing/selftests/net/lib.sh
@@ -361,3 +361,76 @@ tests_run()
$current_test
done
}
+
+# Whether FAILs should be interpreted as XFAILs. Internal.
+FAIL_TO_XFAIL=
+
+check_err()
+{
+ local err=$1
+ local msg=$2
+
+ if ((err)); then
+ if [[ $FAIL_TO_XFAIL = yes ]]; then
+ ret_set_ksft_status $ksft_xfail "$msg"
+ else
+ ret_set_ksft_status $ksft_fail "$msg"
+ fi
+ fi
+}
+
+check_fail()
+{
+ local err=$1
+ local msg=$2
+
+ check_err $((!err)) "$msg"
+}
+
+check_err_fail()
+{
+ local should_fail=$1; shift
+ local err=$1; shift
+ local what=$1; shift
+
+ if ((should_fail)); then
+ check_fail $err "$what succeeded, but should have failed"
+ else
+ check_err $err "$what failed"
+ fi
+}
+
+xfail()
+{
+ FAIL_TO_XFAIL=yes "$@"
+}
+
+xfail_on_slow()
+{
+ if [[ $KSFT_MACHINE_SLOW = yes ]]; then
+ FAIL_TO_XFAIL=yes "$@"
+ else
+ "$@"
+ fi
+}
+
+omit_on_slow()
+{
+ if [[ $KSFT_MACHINE_SLOW != yes ]]; then
+ "$@"
+ fi
+}
+
+xfail_on_veth()
+{
+ local dev=$1; shift
+ local kind
+
+ kind=$(ip -j -d link show dev $dev |
+ jq -r '.[].linkinfo.info_kind')
+ if [[ $kind = veth ]]; then
+ FAIL_TO_XFAIL=yes "$@"
+ else
+ "$@"
+ fi
+}
--
2.45.0
It would be good to use the same mechanism for scheduling and dispatching
general net tests as the many forwarding tests already use. To that end,
move the logging helpers to net/lib.sh so that every net test can use them.
Existing lib.sh users might be using the name themselves. However lib.sh is
always sourced near the top of the file (checked), and whatever new
definition will simply override the one provided by lib.sh.
Signed-off-by: Petr Machata <petrm(a)nvidia.com>
Reviewed-by: Amit Cohen <amcohen(a)nvidia.com>
Acked-by: Shuah Khan <skhan(a)linuxfoundation.org>
---
Notes:
CC: Shuah Khan <shuah(a)kernel.org>
CC: Benjamin Poirier <bpoirier(a)nvidia.com>
CC: Hangbin Liu <liuhangbin(a)gmail.com>
CC: linux-kselftest(a)vger.kernel.org
CC: Jiri Pirko <jiri(a)resnulli.us>
tools/testing/selftests/net/forwarding/lib.sh | 10 ----------
tools/testing/selftests/net/lib.sh | 10 ++++++++++
2 files changed, 10 insertions(+), 10 deletions(-)
diff --git a/tools/testing/selftests/net/forwarding/lib.sh b/tools/testing/selftests/net/forwarding/lib.sh
index 41dd14c42c48..d28dbf27c1f0 100644
--- a/tools/testing/selftests/net/forwarding/lib.sh
+++ b/tools/testing/selftests/net/forwarding/lib.sh
@@ -1285,16 +1285,6 @@ matchall_sink_create()
action drop
}
-tests_run()
-{
- local current_test
-
- for current_test in ${TESTS:-$ALL_TESTS}; do
- in_defer_scope \
- $current_test
- done
-}
-
cleanup()
{
pre_cleanup
diff --git a/tools/testing/selftests/net/lib.sh b/tools/testing/selftests/net/lib.sh
index 691318b1ec55..4f52b8e48a3a 100644
--- a/tools/testing/selftests/net/lib.sh
+++ b/tools/testing/selftests/net/lib.sh
@@ -351,3 +351,13 @@ log_info()
echo "INFO: $msg"
}
+
+tests_run()
+{
+ local current_test
+
+ for current_test in ${TESTS:-$ALL_TESTS}; do
+ in_defer_scope \
+ $current_test
+ done
+}
--
2.45.0
Many net selftests invent their own logging helpers. These really should be
in a library sourced by these tests. Currently forwarding/lib.sh has a
suite of perfectly fine logging helpers, but sourcing a forwarding/ library
from a higher-level directory smells of layering violation. In this patch,
move the logging helpers to net/lib.sh so that every net test can use them.
Together with the logging helpers, it's also necessary to move
pause_on_fail(), and EXIT_STATUS and RET.
Existing lib.sh users might be using these same names for their functions
or variables. However lib.sh is always sourced near the top of the
file (checked), and whatever new definitions will simply override the ones
provided by lib.sh.
Signed-off-by: Petr Machata <petrm(a)nvidia.com>
Reviewed-by: Amit Cohen <amcohen(a)nvidia.com>
Acked-by: Shuah Khan <skhan(a)linuxfoundation.org>
---
Notes:
CC: Shuah Khan <shuah(a)kernel.org>
CC: Benjamin Poirier <bpoirier(a)nvidia.com>
CC: Hangbin Liu <liuhangbin(a)gmail.com>
CC: linux-kselftest(a)vger.kernel.org
CC: Jiri Pirko <jiri(a)resnulli.us>
tools/testing/selftests/net/forwarding/lib.sh | 113 -----------------
tools/testing/selftests/net/lib.sh | 115 ++++++++++++++++++
2 files changed, 115 insertions(+), 113 deletions(-)
diff --git a/tools/testing/selftests/net/forwarding/lib.sh b/tools/testing/selftests/net/forwarding/lib.sh
index 89c25f72b10c..41dd14c42c48 100644
--- a/tools/testing/selftests/net/forwarding/lib.sh
+++ b/tools/testing/selftests/net/forwarding/lib.sh
@@ -48,7 +48,6 @@ declare -A NETIFS=(
: "${WAIT_TIME:=5}"
# Whether to pause on, respectively, after a failure and before cleanup.
-: "${PAUSE_ON_FAIL:=no}"
: "${PAUSE_ON_CLEANUP:=no}"
# Whether to create virtual interfaces, and what netdevice type they should be.
@@ -446,22 +445,6 @@ done
##############################################################################
# Helpers
-# Exit status to return at the end. Set in case one of the tests fails.
-EXIT_STATUS=0
-# Per-test return value. Clear at the beginning of each test.
-RET=0
-
-ret_set_ksft_status()
-{
- local ksft_status=$1; shift
- local msg=$1; shift
-
- RET=$(ksft_status_merge $RET $ksft_status)
- if (( $? )); then
- retmsg=$msg
- fi
-}
-
# Whether FAILs should be interpreted as XFAILs. Internal.
FAIL_TO_XFAIL=
@@ -535,102 +518,6 @@ xfail_on_veth()
fi
}
-log_test_result()
-{
- local test_name=$1; shift
- local opt_str=$1; shift
- local result=$1; shift
- local retmsg=$1; shift
-
- printf "TEST: %-60s [%s]\n" "$test_name $opt_str" "$result"
- if [[ $retmsg ]]; then
- printf "\t%s\n" "$retmsg"
- fi
-}
-
-pause_on_fail()
-{
- if [[ $PAUSE_ON_FAIL == yes ]]; then
- echo "Hit enter to continue, 'q' to quit"
- read a
- [[ $a == q ]] && exit 1
- fi
-}
-
-handle_test_result_pass()
-{
- local test_name=$1; shift
- local opt_str=$1; shift
-
- log_test_result "$test_name" "$opt_str" " OK "
-}
-
-handle_test_result_fail()
-{
- local test_name=$1; shift
- local opt_str=$1; shift
-
- log_test_result "$test_name" "$opt_str" FAIL "$retmsg"
- pause_on_fail
-}
-
-handle_test_result_xfail()
-{
- local test_name=$1; shift
- local opt_str=$1; shift
-
- log_test_result "$test_name" "$opt_str" XFAIL "$retmsg"
- pause_on_fail
-}
-
-handle_test_result_skip()
-{
- local test_name=$1; shift
- local opt_str=$1; shift
-
- log_test_result "$test_name" "$opt_str" SKIP "$retmsg"
-}
-
-log_test()
-{
- local test_name=$1
- local opt_str=$2
-
- if [[ $# -eq 2 ]]; then
- opt_str="($opt_str)"
- fi
-
- if ((RET == ksft_pass)); then
- handle_test_result_pass "$test_name" "$opt_str"
- elif ((RET == ksft_xfail)); then
- handle_test_result_xfail "$test_name" "$opt_str"
- elif ((RET == ksft_skip)); then
- handle_test_result_skip "$test_name" "$opt_str"
- else
- handle_test_result_fail "$test_name" "$opt_str"
- fi
-
- EXIT_STATUS=$(ksft_exit_status_merge $EXIT_STATUS $RET)
- return $RET
-}
-
-log_test_skip()
-{
- RET=$ksft_skip retmsg= log_test "$@"
-}
-
-log_test_xfail()
-{
- RET=$ksft_xfail retmsg= log_test "$@"
-}
-
-log_info()
-{
- local msg=$1
-
- echo "INFO: $msg"
-}
-
not()
{
"$@"
diff --git a/tools/testing/selftests/net/lib.sh b/tools/testing/selftests/net/lib.sh
index c8991cc6bf28..691318b1ec55 100644
--- a/tools/testing/selftests/net/lib.sh
+++ b/tools/testing/selftests/net/lib.sh
@@ -9,6 +9,9 @@ source "$net_dir/lib/sh/defer.sh"
: "${WAIT_TIMEOUT:=20}"
+# Whether to pause on after a failure.
+: "${PAUSE_ON_FAIL:=no}"
+
BUSYWAIT_TIMEOUT=$((WAIT_TIMEOUT * 1000)) # ms
# Kselftest framework constants.
@@ -20,6 +23,11 @@ ksft_skip=4
# namespace list created by setup_ns
NS_LIST=()
+# Exit status to return at the end. Set in case one of the tests fails.
+EXIT_STATUS=0
+# Per-test return value. Clear at the beginning of each test.
+RET=0
+
##############################################################################
# Helpers
@@ -236,3 +244,110 @@ tc_rule_handle_stats_get()
| jq ".[] | select(.options.handle == $handle) | \
.options.actions[0].stats$selector"
}
+
+ret_set_ksft_status()
+{
+ local ksft_status=$1; shift
+ local msg=$1; shift
+
+ RET=$(ksft_status_merge $RET $ksft_status)
+ if (( $? )); then
+ retmsg=$msg
+ fi
+}
+
+log_test_result()
+{
+ local test_name=$1; shift
+ local opt_str=$1; shift
+ local result=$1; shift
+ local retmsg=$1; shift
+
+ printf "TEST: %-60s [%s]\n" "$test_name $opt_str" "$result"
+ if [[ $retmsg ]]; then
+ printf "\t%s\n" "$retmsg"
+ fi
+}
+
+pause_on_fail()
+{
+ if [[ $PAUSE_ON_FAIL == yes ]]; then
+ echo "Hit enter to continue, 'q' to quit"
+ read a
+ [[ $a == q ]] && exit 1
+ fi
+}
+
+handle_test_result_pass()
+{
+ local test_name=$1; shift
+ local opt_str=$1; shift
+
+ log_test_result "$test_name" "$opt_str" " OK "
+}
+
+handle_test_result_fail()
+{
+ local test_name=$1; shift
+ local opt_str=$1; shift
+
+ log_test_result "$test_name" "$opt_str" FAIL "$retmsg"
+ pause_on_fail
+}
+
+handle_test_result_xfail()
+{
+ local test_name=$1; shift
+ local opt_str=$1; shift
+
+ log_test_result "$test_name" "$opt_str" XFAIL "$retmsg"
+ pause_on_fail
+}
+
+handle_test_result_skip()
+{
+ local test_name=$1; shift
+ local opt_str=$1; shift
+
+ log_test_result "$test_name" "$opt_str" SKIP "$retmsg"
+}
+
+log_test()
+{
+ local test_name=$1
+ local opt_str=$2
+
+ if [[ $# -eq 2 ]]; then
+ opt_str="($opt_str)"
+ fi
+
+ if ((RET == ksft_pass)); then
+ handle_test_result_pass "$test_name" "$opt_str"
+ elif ((RET == ksft_xfail)); then
+ handle_test_result_xfail "$test_name" "$opt_str"
+ elif ((RET == ksft_skip)); then
+ handle_test_result_skip "$test_name" "$opt_str"
+ else
+ handle_test_result_fail "$test_name" "$opt_str"
+ fi
+
+ EXIT_STATUS=$(ksft_exit_status_merge $EXIT_STATUS $RET)
+ return $RET
+}
+
+log_test_skip()
+{
+ RET=$ksft_skip retmsg= log_test "$@"
+}
+
+log_test_xfail()
+{
+ RET=$ksft_xfail retmsg= log_test "$@"
+}
+
+log_info()
+{
+ local msg=$1
+
+ echo "INFO: $msg"
+}
--
2.45.0
This patch series adds a some not yet picked selftests to the kvm s390x
selftest suite.
The additional test cases are covering:
* Assert KVM_EXIT_S390_UCONTROL exit on not mapped memory access
* Assert functionality of storage keys in ucontrol VM
* Assert that memory region operations are rejected for ucontrol VMs
Running the test cases requires sys_admin capabilities to start the
ucontrol VM.
This can be achieved by running as root or with a command like:
sudo setpriv --reuid nobody --inh-caps -all,+sys_admin \
--ambient-caps -all,+sys_admin --bounding-set -all,+sys_admin \
./ucontrol_test
---
The patches in this series have been part of the previous patch series.
The test cases added here do depend on the fixture added in the earlier
patches.
From v5 PATCH 7-9 the segment and page table generation has been removed
and DAT
has been disabled. Since DAT is not necessary to validate the KVM code.
https://lore.kernel.org/kvm/20240807154512.316936-1-schlameuss@linux.ibm.co…
v7:
- skip uc_skey test when execution is in vsie
v6:
- add instruction intercept handling for skey specific instructions
(iske, sske, rrbe) in addition to kss intercept to work properly on
all machines
- reorder local variables
- fixup some method comments
- add a patch correcting the IP.b value length a debug message
v5:
- rebased to current upstream master
- corrected assertion on 0x00 to 0
- reworded fixup commit so that it can be merged on top of current
upstream
v4:
- fix whitespaces in pointer function arguments (thanks Claudio)
- fix whitespaces in comments (thanks Janosch)
v3:
- fix skey assertion (thanks Claudio)
- introduce a wrapper around UCAS map and unmap ioctls to improve
readability (Claudio)
- add an displacement to accessed memory to assert translation
intercepts actually point to segments to the uc_map_unmap test
- add an misaligned failing mapping try to the uc_map_unmap test
v2:
- Reenable KSS intercept and handle it within skey test.
- Modify the checked register between storing (sske) and reading (iske)
it within the test program to make sure the.
- Add an additional state assertion in the end of uc_skey
- Fix some typos and white spaces.
v1:
- Remove segment and page table generation and disable DAT. This is not
necessary to validate the KVM code.
Christoph Schlameuss (5):
selftests: kvm: s390: Add uc_map_unmap VM test case
selftests: kvm: s390: Add uc_skey VM test case
selftests: kvm: s390: Verify reject memory region operations for
ucontrol VMs
selftests: kvm: s390: Fix whitespace confusion in ucontrol test
selftests: kvm: s390: correct IP.b length in uc_handle_sieic debug
output
.../selftests/kvm/include/s390x/processor.h | 6 +
.../selftests/kvm/s390x/ucontrol_test.c | 321 +++++++++++++++++-
2 files changed, 319 insertions(+), 8 deletions(-)
--
2.47.0
The test should be skipped if initial conditions aren't fulfilled in
the start instead of failing and outputting non-compliant TAP logs. This
kind of failure pollutes the results. The initial conditions are:
- The test should only execute if /tmp file can be allocated.
- The test should only execute if huge pages are free.
Signed-off-by: Muhammad Usama Anjum <usama.anjum(a)collabora.com>
---
Before:
TAP version 13
1..4
Bail out! Error opening file
: Read-only file system (30)
# Planned tests != run tests (4 != 0)
# Totals: pass:0 fail:0 xfail:0 xpass:0 skip:0 error:0
After:
TAP version 13
1..0 # SKIP Unable to allocate file: Read-only file system
---
tools/testing/selftests/mm/hugetlb_dio.c | 19 ++++++++++++-------
1 file changed, 12 insertions(+), 7 deletions(-)
diff --git a/tools/testing/selftests/mm/hugetlb_dio.c b/tools/testing/selftests/mm/hugetlb_dio.c
index f9ac20c657ec6..60001c142ce99 100644
--- a/tools/testing/selftests/mm/hugetlb_dio.c
+++ b/tools/testing/selftests/mm/hugetlb_dio.c
@@ -44,13 +44,6 @@ void run_dio_using_hugetlb(unsigned int start_off, unsigned int end_off)
if (fd < 0)
ksft_exit_fail_perror("Error opening file\n");
- /* Get the free huge pages before allocation */
- free_hpage_b = get_free_hugepages();
- if (free_hpage_b == 0) {
- close(fd);
- ksft_exit_skip("No free hugepage, exiting!\n");
- }
-
/* Allocate a hugetlb page */
orig_buffer = mmap(NULL, h_pagesize, mmap_prot, mmap_flags, -1, 0);
if (orig_buffer == MAP_FAILED) {
@@ -94,8 +87,20 @@ void run_dio_using_hugetlb(unsigned int start_off, unsigned int end_off)
int main(void)
{
size_t pagesize = 0;
+ int fd;
ksft_print_header();
+
+ /* Open the file to DIO */
+ fd = open("/tmp", O_TMPFILE | O_RDWR | O_DIRECT, 0664);
+ if (fd < 0)
+ ksft_exit_skip("Unable to allocate file: %s\n", strerror(errno));
+ close(fd);
+
+ /* Check if huge pages are free */
+ if (!get_free_hugepages())
+ ksft_exit_skip("No free hugepage, exiting\n");
+
ksft_set_plan(4);
/* Get base page size */
--
2.39.5
1. fix recursive lock when ebpf prog return SK_PASS.
2. add selftest to reproduce recursive lock.
Note that if just the selftest merged without first
patch, the test case will definitely fail, because the
issue of deadlock is inevitable.
---
v2->v3: fix line length reported by patchwork.
(max_line_length is set to 80 in patchwork but default is 100 in kernel tree)
v1->v2: 1.inspired by martin.lau to add selftest to reproduce the issue.
2. follow the community rules for patch.
v1: https://lore.kernel.org/bpf/55fc6114-7e64-4b65-86d2-92cfd1e9e92f@linux.dev/…
---
Jiayuan Chen (2):
bpf: fix recursive lock when verdict program return SK_PASS
selftests/bpf: Add some tests with sockmap SK_PASS
net/core/skmsg.c | 4 +-
.../selftests/bpf/prog_tests/sockmap_basic.c | 54 +++++++++++++++++++
.../bpf/progs/test_sockmap_pass_prog.c | 2 +-
3 files changed, 57 insertions(+), 3 deletions(-)
--
2.43.5
This series was originally written by José Expósito, and has been
modified and updated by Matt Gilbride and myself. The original version
can be found here:
https://github.com/Rust-for-Linux/linux/pull/950
Add support for writing KUnit tests in Rust. While Rust doctests are
already converted to KUnit tests and run, they're really better suited
for examples, rather than as first-class unit tests.
This series implements a series of direct Rust bindings for KUnit tests,
as well as a new macro which allows KUnit tests to be written using a
close variant of normal Rust unit test syntax. The only change required
is replacing '#[cfg(test)]' with '#[kunit_tests(kunit_test_suite_name)]'
An example test would look like:
#[kunit_tests(rust_kernel_hid_driver)]
mod tests {
use super::*;
use crate::{c_str, driver, hid, prelude::*};
use core::ptr;
struct SimpleTestDriver;
impl Driver for SimpleTestDriver {
type Data = ();
}
#[test]
fn rust_test_hid_driver_adapter() {
let mut hid = bindings::hid_driver::default();
let name = c_str!("SimpleTestDriver");
static MODULE: ThisModule = unsafe { ThisModule::from_ptr(ptr::null_mut()) };
let res = unsafe {
<hid::Adapter<SimpleTestDriver> as driver::DriverOps>::register(&mut hid, name, &MODULE)
};
assert_eq!(res, Err(ENODEV)); // The mock returns -19
}
}
Please give this a go, and make sure I haven't broken it! There's almost
certainly a lot of improvements which can be made -- and there's a fair
case to be made for replacing some of this with generated C code which
can use the C macros -- but this is hopefully an adequate implementation
for now, and the interface can (with luck) remain the same even if the
implementation changes.
A few small notable missing features:
- Attributes (like the speed of a test) are hardcoded to the default
value.
- Similarly, the module name attribute is hardcoded to NULL. In C, we
use the KBUILD_MODNAME macro, but I couldn't find a way to use this
from Rust which wasn't more ugly than just disabling it.
- Assertions are not automatically rewritten to use KUnit assertions.
---
Changes since v3:
https://lore.kernel.org/linux-kselftest/20241030045719.3085147-2-davidgow@g…
- The kunit_unsafe_test_suite!() macro now panic!s if the suite name is
too long, triggering a compile error. (Thanks, Alice!)
- The #[kunit_tests()] macro now preserves span information, so
errors can be better reported. (Thanks, Boqun!)
- The example tests have been updated to no longer use assert_eq!() with
a constant bool argument (which triggered a clippy warning now we
have the span info).
Changes since v2:
https://lore.kernel.org/linux-kselftest/20241029092422.2884505-1-davidgow@g…
- Include missing rust/macros/kunit.rs file from v2. (Thanks Boqun!)
- The kunit_unsafe_test_suite!() macro will truncate the name of the
suite if it is too long. (Thanks Alice!)
- The proc macro now emits an error if the suite name is too long.
- We no longer needlessly use UnsafeCell<> in
kunit_unsafe_test_suite!(). (Thanks Alice!)
Changes since v1:
https://lore.kernel.org/lkml/20230720-rustbind-v1-0-c80db349e3b5@google.com…
- Rebase on top of the latest rust-next (commit 718c4069896c)
- Make kunit_case a const fn, rather than a macro (Thanks Boqun)
- As a result, the null terminator is now created with
kernel::kunit::kunit_case_null()
- Use the C kunit_get_current_test() function to implement
in_kunit_test(), rather than re-implementing it (less efficiently)
ourselves.
Changes since the GitHub PR:
- Rebased on top of kselftest/kunit
- Add const_mut_refs feature
This may conflict with https://lore.kernel.org/lkml/20230503090708.2524310-6-nmi@metaspace.dk/
- Add rust/macros/kunit.rs to the KUnit MAINTAINERS entry
---
José Expósito (3):
rust: kunit: add KUnit case and suite macros
rust: macros: add macro to easily run KUnit tests
rust: kunit: allow to know if we are in a test
MAINTAINERS | 1 +
rust/kernel/kunit.rs | 194 +++++++++++++++++++++++++++++++++++++++++++
rust/kernel/lib.rs | 1 +
rust/macros/kunit.rs | 168 +++++++++++++++++++++++++++++++++++++
rust/macros/lib.rs | 29 +++++++
5 files changed, 393 insertions(+)
create mode 100644 rust/macros/kunit.rs
--
2.47.0.199.ga7371fff76-goog
Greetings:
Welcome to v8, see changelog below.
This revision is a quick follow-up to v7 and only drops the unneeded
exports in patch 2. No other changes.
This series introduces a new mechanism, IRQ suspension, which allows
network applications using epoll to mask IRQs during periods of high
traffic while also reducing tail latency (compared to existing
mechanisms, see below) during periods of low traffic. In doing so, this
balances CPU consumption with network processing efficiency.
Martin Karsten (CC'd) and I have been collaborating on this series for
several months and have appreciated the feedback from the community on
our RFC [1]. We've updated the cover letter and kernel documentation in
an attempt to more clearly explain how this mechanism works, how
applications can use it, and how it compares to existing mechanisms in
the kernel.
I briefly mentioned this idea at netdev conf 2024 (for those who were
there) and Martin described this idea in an earlier paper presented at
Sigmetrics 2024 [2].
~ The short explanation (TL;DR)
We propose adding a new napi config parameter: irq_suspend_timeout to
help balance CPU usage and network processing efficiency when using IRQ
deferral and napi busy poll.
If this parameter is set to a non-zero value *and* a user application
has enabled preferred busy poll on a busy poll context (via the
EPIOCSPARAMS ioctl introduced in commit 18e2bf0edf4d ("eventpoll: Add
epoll ioctl for epoll_params")), then application calls to epoll_wait
for that context will cause device IRQs and softirq processing to be
suspended as long as epoll_wait successfully retrieves data from the
NAPI. Each time data is retrieved, the irq_suspend_timeout is deferred.
If/when network traffic subsides and epoll_wait returns no data, IRQ
suspension is immediately reverted back to the existing
napi_defer_hard_irqs and gro_flush_timeout mechanism which was
introduced in commit 6f8b12d661d0 ("net: napi: add hard irqs deferral
feature")).
The irq_suspend_timeout serves as a safety mechanism. If userland takes
a long time processing data, irq_suspend_timeout will fire and restart
normal NAPI processing.
For a more in depth explanation, please continue reading.
~ Comparison with existing mechanisms
Interrupt mitigation can be accomplished in napi software, by setting
napi_defer_hard_irqs and gro_flush_timeout, or via interrupt coalescing
in the NIC. This can be quite efficient, but in both cases, a fixed
timeout (or packet count) needs to be configured. However, a fixed
timeout cannot effectively support both low- and high-load situations:
At low load, an application typically processes a few requests and then
waits to receive more input data. In this scenario, a large timeout will
cause unnecessary latency.
At high load, an application typically processes many requests before
being ready to receive more input data. In this case, a small timeout
will likely fire prematurely and trigger irq/softirq processing, which
interferes with the application's execution. This causes overhead, most
likely due to cache contention.
While NICs attempt to provide adaptive interrupt coalescing schemes,
these cannot properly take into account application-level processing.
An alternative packet delivery mechanism is busy-polling, which results
in perfect alignment of application processing and network polling. It
delivers optimal performance (throughput and latency), but results in
100% cpu utilization and is thus inefficient for below-capacity
workloads.
We propose to add a new packet delivery mode that properly alternates
between busy polling and interrupt-based delivery depending on busy and
idle periods of the application. During a busy period, the system
operates in busy-polling mode, which avoids interference. During an idle
period, the system falls back to interrupt deferral, but with a small
timeout to avoid excessive latencies. This delivery mode can also be
viewed as an extension of basic interrupt deferral, but alternating
between a small and a very large timeout.
This delivery mode is efficient, because it avoids softirq execution
interfering with application processing during busy periods. It can be
used with blocking epoll_wait to conserve cpu cycles during idle
periods. The effect of alternating between busy and idle periods is that
performance (throughput and latency) is very close to full busy polling,
while cpu utilization is lower and very close to interrupt mitigation.
~ Usage details
IRQ suspension is introduced via a per-NAPI configuration parameter that
controls the maximum time that IRQs can be suspended.
Here's how it is intended to work:
- The user application (or system administrator) uses the netdev-genl
netlink interface to set the pre-existing napi_defer_hard_irqs and
gro_flush_timeout NAPI config parameters to enable IRQ deferral.
- The user application (or system administrator) sets the proposed
irq_suspend_timeout parameter via the netdev-genl netlink interface
to a larger value than gro_flush_timeout to enable IRQ suspension.
- The user application issues the existing epoll ioctl to set the
prefer_busy_poll flag on the epoll context.
- The user application then calls epoll_wait to busy poll for network
events, as it normally would.
- If epoll_wait returns events to userland, IRQs are suspended for the
duration of irq_suspend_timeout.
- If epoll_wait finds no events and the thread is about to go to
sleep, IRQ handling using napi_defer_hard_irqs and gro_flush_timeout
is resumed.
As long as epoll_wait is retrieving events, IRQs (and softirq
processing) for the NAPI being polled remain disabled. When network
traffic reduces, eventually a busy poll loop in the kernel will retrieve
no data. When this occurs, regular IRQ deferral using gro_flush_timeout
for the polled NAPI is re-enabled.
Unless IRQ suspension is continued by subsequent calls to epoll_wait, it
automatically times out after the irq_suspend_timeout timer expires.
Regular deferral is also immediately re-enabled when the epoll context
is destroyed.
~ Usage scenario
The target scenario for IRQ suspension as packet delivery mode is a
system that runs a dominant application with substantial network I/O.
The target application can be configured to receive input data up to a
certain batch size (via epoll_wait maxevents parameter) and this batch
size determines the worst-case latency that application requests might
experience. Because packet delivery is suspended during the target
application's processing, the batch size also determines the worst-case
latency of concurrent applications using the same RX queue(s).
gro_flush_timeout should be set as small as possible, but large enough to
make sure that a single request is likely not being interfered with.
irq_suspend_timeout is largely a safety mechanism against misbehaving
applications. It should be set large enough to cover the processing of an
entire application batch, i.e., the factor between gro_flush_timeout and
irq_suspend_timeout should roughly correspond to the maximum batch size
that the target application would process in one go.
~ Important call out in the implementation
- Enabling per epoll-context preferred busy poll will now effectively
lead to a nonblocking iteration through napi_busy_loop, even when
busy_poll_usecs is 0. See patch 4.
~ Benchmark configs & descriptions
The changes were benchmarked with memcached [3] using the benchmarking
tool mutilate [4].
To facilitate benchmarking, a small patch [5] was applied to memcached
1.6.29 to allow setting per-epoll context preferred busy poll and other
settings via environment variables. Another small patch [6] was applied
to libevent to enable full busy-polling.
Multiple scenarios were benchmarked as described below and the scripts
used for producing these results can be found on github [7] (note: all
scenarios use NAPI-based traffic splitting via SO_INCOMING_ID by passing
-N to memcached):
- base:
- no other options enabled
- deferX:
- set defer_hard_irqs to 100
- set gro_flush_timeout to X,000
- napibusy:
- set defer_hard_irqs to 100
- set gro_flush_timeout to 200,000
- enable busy poll via the existing ioctl (busy_poll_usecs = 64,
busy_poll_budget = 64, prefer_busy_poll = true)
- fullbusy:
- set defer_hard_irqs to 100
- set gro_flush_timeout to 5,000,000
- enable busy poll via the existing ioctl (busy_poll_usecs = 1000,
busy_poll_budget = 64, prefer_busy_poll = true)
- change memcached's nonblocking epoll_wait invocation (via
libevent) to using a 1 ms timeout
- suspend0:
- set defer_hard_irqs to 0
- set gro_flush_timeout to 0
- set irq_suspend_timeout to 20,000,000
- enable busy poll via the existing ioctl (busy_poll_usecs = 0,
busy_poll_budget = 64, prefer_busy_poll = true)
- suspendX:
- set defer_hard_irqs to 100
- set gro_flush_timeout to X,000
- set irq_suspend_timeout to 20,000,000
- enable busy poll via the existing ioctl (busy_poll_usecs = 0,
busy_poll_budget = 64, prefer_busy_poll = true)
~ Benchmark results
Tested on:
Single socket AMD EPYC 7662 64-Core Processor
Hyperthreading disabled
4 NUMA Zones (NPS=4)
16 CPUs per NUMA zone (64 cores total)
2 x Dual port 100gbps Mellanox Technologies ConnectX-5 Ex EN NIC
The test machine is configured such that a single interface has 8 RX
queues. The queues' IRQs and memcached are pinned to CPUs that are
NUMA-local to the interface which is under test. The NIC's interrupt
coalescing configuration is left at boot-time defaults.
Results:
Results are shown below. The mechanism added by this series is
represented by the 'suspend' cases. Data presented shows a summary over
nearly 10 runs of each test case [8] using the scripts on github [7].
For latency, the median is shown. For throughput and CPU utilization,
the average is shown.
The results also include cycles-per-query (cpq) and
instruction-per-query (ipq) metrics, following the methodology proposed
in [2], to augment the CPU utilization numbers, which could be skewed
due to frequency scaling. We find that this does not appear to be the
case as CPU utilization and low-level metrics show similar trends.
These results were captured using the scripts on github [7] to
illustrate how this approach compares with other pre-existing
mechanisms. This data is not to be interpreted as scientific data
captured in a fully isolated lab setting, but instead as best effort,
illustrative information comparing and contrasting tradeoffs.
The absolute QPS results shift between submissions, but the
relative differences are equivalent. As patches are rebased,
several factors likely influence overall performance.
Compare:
- Throughput (MAX) and latencies of base vs suspend.
- CPU usage of napibusy and fullbusy during lower load (200K, 400K for
example) vs suspend.
- Latency of the defer variants vs suspend as timeout and load
increases.
- suspend0, which sets defer_hard_irqs and gro_flush_timeout to 0, has
nearly the same performance as the base case (this is FAQ item #1).
The overall takeaway is that the suspend variants provide a superior
combination of high throughput, low latency, and low cpu utilization
compared to all other variants. Each of the suspend variants works very
well, but some fine-tuning between latency and cpu utilization is still
possible by tuning the small timeout (gro_flush_timeout).
Note: we've reorganized the results to make comparison among testcases
with the same load easier.
testcase load qps avglat 95%lat 99%lat cpu cpq ipq
base 200K 199946 112 239 416 26 12973 11343
defer10 200K 199971 54 124 142 29 19412 17460
defer20 200K 199986 60 130 153 26 15644 14095
defer50 200K 200025 79 144 182 23 12122 11632
defer200 200K 199999 164 254 309 19 8923 9635
fullbusy 200K 199998 46 118 133 100 43658 23133
napibusy 200K 199983 100 237 277 56 24840 24716
suspend0 200K 200020 105 249 432 30 14264 11796
suspend10 200K 199950 53 123 141 32 19518 16903
suspend20 200K 200037 58 126 151 30 16426 14736
suspend50 200K 199961 73 136 177 26 13310 12633
suspend200 200K 199998 149 251 306 21 9566 10203
testcase load qps avglat 95%lat 99%lat cpu cpq ipq
base 400K 400014 139 269 707 41 9476 9343
defer10 400K 400016 59 133 166 53 13991 12989
defer20 400K 399952 67 140 172 47 12063 11644
defer50 400K 400007 87 162 198 39 9384 9880
defer200 400K 399979 181 274 330 31 7089 8430
fullbusy 400K 399987 50 123 156 100 21827 16037
napibusy 400K 400014 76 222 272 83 18185 16529
suspend0 400K 400015 127 350 776 47 10699 9603
suspend10 400K 400023 57 129 164 54 13758 13178
suspend20 400K 400043 62 135 169 49 12071 11826
suspend50 400K 400071 76 149 186 42 10011 10301
suspend200 400K 399961 154 269 327 34 7827 8774
testcase load qps avglat 95%lat 99%lat cpu cpq ipq
base 600K 599951 149 266 574 61 9265 8876
defer10 600K 600006 71 147 203 76 11866 10936
defer20 600K 600123 76 152 203 66 10430 10342
defer50 600K 600162 95 172 217 54 8526 9142
defer200 600K 599942 200 301 357 46 6977 8212
fullbusy 600K 599990 55 127 177 100 14551 13983
napibusy 600K 600035 63 160 250 96 13937 14140
suspend0 600K 599903 127 320 732 68 10166 8963
suspend10 600K 599908 63 137 192 69 10902 11100
suspend20 600K 599961 66 141 194 65 9976 10370
suspend50 600K 599973 80 159 204 57 8678 9381
suspend200 600K 600010 157 277 346 48 7133 8381
testcase load qps avglat 95%lat 99%lat cpu cpq ipq
base 800K 800039 181 300 536 87 9585 8304
defer10 800K 800038 181 530 939 96 10564 8970
defer20 800K 800029 112 225 329 90 10056 8935
defer50 800K 799999 120 208 296 82 9234 8562
defer200 800K 800066 227 338 401 63 7117 8129
fullbusy 800K 800040 61 134 190 100 10913 12608
napibusy 800K 799944 64 141 214 99 10828 12588
suspend0 800K 799911 126 248 509 85 9346 8498
suspend10 800K 800006 69 143 200 83 9410 9845
suspend20 800K 800120 74 150 207 78 8786 9454
suspend50 800K 799989 87 168 224 71 7946 8833
suspend200 800K 799987 160 292 357 62 6923 8229
testcase load qps avglat 95%lat 99%lat cpu cpq ipq
base 1000K 906879 4079 5751 6216 98 9496 7904
defer10 1000K 860849 3643 6274 6730 99 10040 8676
defer20 1000K 896063 3298 5840 6349 98 9620 8237
defer50 1000K 919782 2962 5513 5807 97 9284 7951
defer200 1000K 970941 3059 5348 5984 95 8593 7959
fullbusy 1000K 999950 70 150 207 100 8732 10777
napibusy 1000K 999996 78 154 223 100 8722 10656
suspend0 1000K 949706 2666 5770 6660 99 9071 8046
suspend10 1000K 1000024 80 160 220 92 8137 9035
suspend20 1000K 1000059 83 165 226 89 7850 8804
suspend50 1000K 999955 95 180 240 84 7411 8459
suspend200 1000K 999914 163 299 366 77 6833 8078
testcase load qps avglat 95%lat 99%lat cpu cpq ipq
base MAX 1037654 4184 5453 5810 100 8411 7938
defer10 MAX 905607 4840 6151 6380 100 9639 8431
defer20 MAX 986463 4455 5594 5796 100 8848 8110
defer50 MAX 1077030 4000 5073 5299 100 8104 7920
defer200 MAX 1040728 4152 5385 5765 100 8379 7849
fullbusy MAX 1247536 3518 3935 3984 100 6998 7930
napibusy MAX 1136310 3799 7756 9964 100 7670 7877
suspend0 MAX 1057509 4132 5724 6185 100 8253 7918
suspend10 MAX 1215147 3580 3957 4041 100 7185 7944
suspend20 MAX 1216469 3576 3953 3988 100 7175 7950
suspend50 MAX 1215871 3577 3961 4075 100 7181 7949
suspend200 MAX 1216882 3556 3951 3988 100 7175 7955
~ FAQ
- Why is a new parameter needed? Does irq_suspend_timeout override
gro_flush_timeout?
Using the suspend mechanism causes the system to alternate between
polling mode and irq-driven packet delivery. During busy periods,
irq_suspend_timeout overrides gro_flush_timeout and keeps the system
busy polling, but when epoll finds no events, the setting of
gro_flush_timeout and napi_defer_hard_irqs determine the next step.
There are essentially three possible loops for network processing and
packet delivery:
1) hardirq -> softirq -> napi poll; basic interrupt delivery
2) timer -> softirq -> napi poll; deferred irq processing
3) epoll -> busy-poll -> napi poll; busy looping
Loop 2 can take control from Loop 1, if gro_flush_timeout and
napi_defer_hard_irqs are set.
If gro_flush_timeout and napi_defer_hard_irqs are set, Loops 2 and
3 "wrestle" with each other for control. During busy periods,
irq_suspend_timeout is used as timer in Loop 2, which essentially
tilts this in favour of Loop 3.
If gro_flush_timeout and napi_defer_hard_irqs are not set, Loop 3
cannot take control from Loop 1.
Therefore, setting gro_flush_timeout and napi_defer_hard_irqs is the
recommended usage, because otherwise setting irq_suspend_timeout
might not have any discernible effect.
This is shown in the results above: compare suspend0 with the base
case. Note that the lack of napi_defer_hard_irqs and
gro_flush_timeout produce similar results for both, which encourages
the use of napi_defer_hard_irqs and gro_flush_timeout in addition to
irq_suspend_timeout.
- Can the new timeout value be threaded through the new epoll ioctl ?
It is possible, but presents challenges for userspace. User
applications must ensure that the file descriptors added to epoll
contexts have the same NAPI ID to support busy polling.
An epoll context is not permanently tied to any particular NAPI ID.
So, a user application could decide to clear the file descriptors
from the context and add a new set of file descriptors with a
different NAPI ID to the context. Busy polling would work as
expected, but the meaning of the suspend timeout becomes ambiguous
because IRQs are not inherently associated with epoll contexts, but
rather with the NAPI. The user program would need to reissue the
ioctl to set the irq_suspend_timeout, but the napi_defer_hard_irqs
and gro_flush_timeout settings would come from the NAPI's
napi_config (which are set either by sysfs or by netlink). Such an
interface seems awkard to use from a user perspective.
Further, IRQs are related to NAPIs, which is why they are stored in
the napi_config space. Putting the irq_suspend_timeout in
the epoll context while other IRQ deferral mechanisms remain in the
NAPI's napi_config space seems like an odd design choice.
We've opted to keep all of the IRQ deferral parameters together and
place the irq_suspend_timeout in napi_config. This has nice benefits
for userspace: if a user app were to remove all file descriptors
from an epoll context and add new file descriptors with a new NAPI ID,
the correct suspend timeout for that NAPI ID would be used automatically
without the user application needing to do anything (like re-issuing an
ioctl, for example). All IRQ deferral related parameters are in one
place and can all be set the same way: with netlink.
- Can irq suspend be built by combining NIC coalescing and
gro_flush_timeout ?
No. The problem is that the long timeout must engage if and only if
prefer-busy is active.
When using NIC coalescing for the short timeout (without
napi_defer_hard_irqs/gro_flush_timeout), an interrupt after an idle
period will trigger softirq, which will run napi polling. At this
point, prefer-busy is not active, so NIC interrupts would be
re-enabled. Then it is not possible for the longer timeout to
interject to switch control back to polling. In other words, only by
using the software timer for the short timeout, it is possible to
extend the timeout without having to reprogram the NIC timer or
reach down directly and disable interrupts.
Using gro_flush_timeout for the long timeout also has problems, for
the same underlying reason. In the current napi implementation,
gro_flush_timeout is not tied to prefer-busy. We'd either have to
change that and in the process modify the existing deferral
mechanism, or introduce a state variable to determine whether
gro_flush_timeout is used as long timeout for irq suspend or whether
it is used for its default purpose. In an earlier version, we did
try something similar to the latter and made it work, but it ends up
being a lot more convoluted than our current proposal.
- Isn't it already possible to combine busy looping with irq deferral?
Yes, in fact enabling irq deferral via napi_defer_hard_irqs and
gro_flush_timeout is a precondition for prefer_busy_poll to have an
effect. If the application also uses a tight busy loop with
essentially nonblocking epoll_wait (accomplished with a very short
timeout parameter), this is the fullbusy case shown in the results.
An application using blocking epoll_wait is shown as the napibusy
case in the results. It's a hybrid approach that provides limited
latency benefits compared to the base case and plain irq deferral,
but not as good as fullbusy or suspend.
~ Special thanks
Several people were involved in earlier stages of the development of this
mechanism whom we'd like to thank:
- Peter Cai (CC'd), for the initial kernel patch and his contributions
to the paper.
- Mohammadamin Shafie (CC'd), for testing various versions of the kernel
patch and providing helpful feedback.
Thanks,
Martin and Joe
[1]: https://lore.kernel.org/netdev/20240812125717.413108-1-jdamato@fastly.com/
[2]: https://doi.org/10.1145/3626780
[3]: https://github.com/memcached/memcached/blob/master/doc/napi_ids.txt
[4]: https://github.com/leverich/mutilate
[5]: https://raw.githubusercontent.com/martinkarsten/irqsuspend/main/patches/mem…
[6]: https://raw.githubusercontent.com/martinkarsten/irqsuspend/main/patches/lib…
[7]: https://github.com/martinkarsten/irqsuspend
[8]: https://github.com/martinkarsten/irqsuspend/tree/main/results
v8:
- Update patch 2 to drop the exports, as requested by Jakub.
v7: https://lore.kernel.org/netdev/20241108023912.98416-1-jdamato@fastly.com/
- Jakub noted that patch 2 adds unnecessary complexity by checking the
suspend timeout in the NAPI loop. This makes the code more
complicated and difficult to reason about. He's right; we've dropped
patch 2 which simplifies this series.
- Updated the cover letter with a full re-run of all test cases.
- Updated FAQ #2.
v6: https://lore.kernel.org/netdev/20241104215542.215919-1-jdamato@fastly.com/
- Updated the cover letter with a full re-run of all test cases,
including a new case suspend0, as requested by Sridhar previously.
- Updated the kernel documentation in patch 7 as suggested by Bagas
Sanjaya, which improved the htmldoc output.
v5: https://lore.kernel.org/netdev/20241103052421.518856-1-jdamato@fastly.com/
- Adjusted patch 5 to only suspend IRQs when ep_send_events returns a
positive return value. This issue was pointed out by Hillf Danton.
- Updated the commit message of patch 6 which still mentioned netcat,
despite the code being updated in v4 to replace it with socat and fixed
misspelling of netdevsim.
- Fixed a minor typo in patch 7 and removed an unnecessary paragraph.
- Added Sridhar Samudrala's Reviewed-by to patch 1-5 and 7.
v4: https://lore.kernel.org/netdev/20241102005214.32443-1-jdamato@fastly.com/
- Added a new FAQ item to cover letter.
- Updated patch 6 to use socat instead of nc in busy_poll_test.sh and
updated busy_poller.c to use netlink directly to configure napi
params.
- Updated the kernel documentation in patch 7 to include more details.
- Dropped Stanislav's Acked-by and Bagas' Reviewed-by from patch 7
since the documentation was updated.
v3: https://lore.kernel.org/netdev/20241101004846.32532-1-jdamato@fastly.com/
- Added Stanislav Fomichev's Acked-by to every patch except the newly
added selftest.
- Added Bagas Sanjaya's Reviewed-by to the documentation patch.
- Fixed the commit message of patch 2 to remove a reference to the now
non-existent sysfs setting.
- Added a self test which tests both "regular" busy poll and busy poll
with suspend enabled. This was added as patch 6 as requested by
Paolo. netdevsim was chosen instead of veth due to netdevsim's
pre-existing support for netdev-genl. See the commit message of
patch 6 for more details.
v2: https://lore.kernel.org/bpf/20241021015311.95468-1-jdamato@fastly.com/
- Cover letter updated, including a re-run of test data.
- Patch 1 rewritten to use netdev-genl instead of sysfs.
- Patch 3 updated with a comment added to napi_resume_irqs.
- Patch 4 rebased to apply now that commit b9ca079dd6b0 ("eventpoll:
Annotate data-race of busy_poll_usecs") has been picked up from VFS.
- Patch 6 updated the kernel documentation.
rfc -> v1:
- Cover letter updated to include more details.
- Patch 1 updated to remove the documentation added. This was moved to
patch 6 with the rest of the docs (see below).
- Patch 5 updated to fix an error uncovered by the kernel build robot.
See patch 5's changelog for more details.
- Patch 6 added which updates kernel documentation.
Joe Damato (2):
selftests: net: Add busy_poll_test
docs: networking: Describe irq suspension
Martin Karsten (4):
net: Add napi_struct parameter irq_suspend_timeout
net: Add control functions for irq suspension
eventpoll: Trigger napi_busy_loop, if prefer_busy_poll is set
eventpoll: Control irq suspension for prefer_busy_poll
Documentation/netlink/specs/netdev.yaml | 7 +
Documentation/networking/napi.rst | 170 ++++++++-
fs/eventpoll.c | 36 +-
include/linux/netdevice.h | 2 +
include/net/busy_poll.h | 3 +
include/uapi/linux/netdev.h | 1 +
net/core/dev.c | 39 +++
net/core/dev.h | 25 ++
net/core/netdev-genl-gen.c | 5 +-
net/core/netdev-genl.c | 12 +
tools/include/uapi/linux/netdev.h | 1 +
tools/testing/selftests/net/.gitignore | 1 +
tools/testing/selftests/net/Makefile | 3 +-
tools/testing/selftests/net/busy_poll_test.sh | 164 +++++++++
tools/testing/selftests/net/busy_poller.c | 328 ++++++++++++++++++
15 files changed, 790 insertions(+), 7 deletions(-)
create mode 100755 tools/testing/selftests/net/busy_poll_test.sh
create mode 100644 tools/testing/selftests/net/busy_poller.c
base-commit: dc7c381bb8649e3701ed64f6c3e55316675904d7
--
2.25.1
Commit 6e182dc9f268 ("selftests/mm: Use generic pkey register
manipulation") makes use of PKEY_UNRESTRICTED in
pkey_sighandler_tests. The macro has been proposed for addition to
uapi headers [1], but the patch hasn't landed yet.
Define PKEY_UNRESTRICTED in pkey-helpers.h for the time being to fix
the build.
[1] https://lore.kernel.org/all/20241028090715.509527-2-yury.khrustalev@arm.com/
Fixes: 6e182dc9f268 ("selftests/mm: Use generic pkey register
manipulation")
Reported-by: Aishwarya TCV <aishwarya.tcv(a)arm.com>
Signed-off-by: Kevin Brodsky <kevin.brodsky(a)arm.com>
---
Based on arm64 for-next/pkey-signal (49f59573e9e0).
---
tools/testing/selftests/mm/pkey-helpers.h | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/tools/testing/selftests/mm/pkey-helpers.h b/tools/testing/selftests/mm/pkey-helpers.h
index 9ab6a3ee153b..319f5b6b7132 100644
--- a/tools/testing/selftests/mm/pkey-helpers.h
+++ b/tools/testing/selftests/mm/pkey-helpers.h
@@ -112,6 +112,10 @@ void record_pkey_malloc(void *ptr, long size, int prot);
#define PKEY_MASK (PKEY_DISABLE_ACCESS | PKEY_DISABLE_WRITE)
#endif
+#ifndef PKEY_UNRESTRICTED
+#define PKEY_UNRESTRICTED 0x0
+#endif
+
#ifndef set_pkey_bits
static inline u64 set_pkey_bits(u64 reg, int pkey, u64 flags)
{
--
2.43.0
Hello colleagues,
Following up on Tim Bird's presentation "Adding benchmarks results
support to KTAP/kselftest", I would like to share some thoughts on
kernel benchmarking and kernel performance evaluation. Tim suggested
sharing these comments with the wider kselftest community for
discussion.
The topic of performance evaluation is obviously extremely complex, so
I’ve organised my comments into several paragraphs, each of which
focuses on a specific aspect. This should make it easier to follow and
understand the key points, such as metrics, reference values, results
data lake, interpretation of contradictory results, system profiles,
analysis and methodology.
# Metrics
A few remarks on benchmark metrics which were called “values” in the
original presentation:
- Metrics must be accompanied by standardised units. This
standardisation ensures consistency across different tests and
environments, simplifying accurate comparisons and analysis.
- Each metric should be clearly labelled with its nature or kind
(throughput, speed, latency, etc). This classification is essential
for proper interpretation of the results and prevents
misunderstandings that could lead to incorrect conclusions.
- Presentation contains "May also include allowable variance", but
variance must be included into the analysis as we deal with
statistical calculations and multiple randomised values.
- I would like to note that other statistical parameters are also
worth including into comparison, like confidence levels, sample size
and so on.
# Reference Values
The concept of "reference values" introduced in the slides could be
significantly enhanced by implementing a collaborative, transparent
system for data collection and validation. This system could operate
as follows:
- Data Collection: Any user could submit benchmark results to a
centralised and public repository. This would allow for a diverse
range of hardware configurations and use cases to be represented.
- Vendor Validation: Hardware vendors would have the opportunity to
review submitted results pertaining to their products. They could then
mark certain results as "Vendor Approved," indicating that the results
align with their own testing and expectations.
- Community Review: The broader community of users and experts could
also review and vote on submitted results. Results that receive
substantial positive feedback could be marked as "Community Approved,"
providing an additional layer of validation.
- Automated Validation: Reference values must be checked, validated
and supported by multiple sources. This can be done only in an
automatic way as those processes are time consuming and require
extreme attention to details.
- Transparency: All submitted results would need to be accompanied by
detailed information about the testing environment, hardware
specifications, and methodology used. This would ensure
reproducibility and allow others to understand the context of each
result.
- Trust Building: The combination of vendor and community approval
would help establish trust in the reference values. It would mitigate
concerns about marketing bias and provide a more reliable basis for
performance comparisons.
- Accessibility: The system would be publicly accessible, allowing
anyone to reference and utilise this data in their own testing and
analysis.
Implementation of such a system would require careful consideration of
governance and funding. A community-driven, non-profit organisation
sponsored by multiple stakeholders could be an appropriate model. This
structure would help maintain neutrality and avoid potential conflicts
of interest.
While the specifics of building and managing such a system would need
further exploration, this approach could significantly improve the
reliability and usefulness of reference values in benchmark testing.
It would foster a more collaborative and transparent environment for
performance evaluation in the Linux ecosystem as well as attract
interested vendors to submit and review results.
I’m not very informed about the current state of the community in this
field, but I’m sure you know better how exactly this can be done.
# Results Data Lake
Along with reference values it’s important to collect results on a
regular basis as the kernel evolves so results must follow this
evolution as well. To do this cloud-based data lake is needed (a
self-hosted system will be too expensive from my point of view).
This data lake should be able to collect and process incoming data as
well as to serve reference values for users. Data processing flow
should be quite standard: Collection -> Parsing + Enhancement ->
Storage -> Analysis -> Serving.
Tim proposed to use file names for reference files, I would like to
note that such approach could fail pretty fast if system will collect
more and more data and there will rise a need to have more granular
and detailed features to identify reference results and this can lead
to very long filenames, which will be hard to use. I propose to use
UUID4-based identification, which provides very low chances for
collision. Those IDs will be keys in the database with all information
required for clear identification of relevant results and
corresponding details. Moreover this approach can be easily extended
on the database side if more data is needed.
Yes, UUID4 is not human-readable, but do we need such an option if we
have tools, which can provide a better interface?
For example, this could be something like:
---
request: results-cli search -b "Test Suite D" -v "v1.2.3" -o "Ubuntu
22.04" -t "baseline" -m "response_time>100"
response:
[
{
"id": "550e8400-e29b-41d4-a716-446655440005",
"benchmark": "Test Suite A",
"version": "v1.2.3",
"target_os": "Ubuntu 22.04",
"metrics": {
"cpu_usage": 70.5,
"memory_usage": 2048,
"response_time": 120
},
"tags": ["baseline", "v1.0"],
"created_at": "2024-10-25T10:00:00Z"
},
...
]
---
or
request: results-cli search "<Domain-Specific-Language-Query>"
response: [ {}, {}, {}...]
---
or
request: results-cli get 550e8400-e29b-41d4-a716-446655440005
response:
{
"id": "550e8400-e29b-41d4-a716-446655440005",
"benchmark": "Test Suite A",
"version": "v1.2.3",
"target_os": "Ubuntu 22.04",
"metrics": {
"cpu_usage": 70.5,
"memory_usage": 2048,
"response_time": 120
},
"tags": ["baseline", "v1.0"],
"created_at": "2024-10-25T10:00:00Z"
}
---
or
request: curl -X POST http://api.example.com/references/search \ -d '{
"query": "benchmark = \"Test Suite A\" AND (version >= \"v1.2\" OR tag
IN [\"baseline\", \"regression\"]) AND cpu_usage > 60" }'
...
---
Another point of use DB-based approach is the following: in case when
a user works with particular hardware and/or would like to use a
reference he/she does not need a full database with all collected
reference values, but only a small slice of it. This slice can be
downloaded from public repo or accessed via API.
# Large results dataset
If we collect a large benchmarks dataset in one place accompanied with
detailed information about target systems from which this dataset was
collected, then it will allow us to calculate precise baselines across
different compositions of parameters, making performance deviations
easier to detect. Long-term trend analysis can identify small changes
and correlate them with updates, revealing performance drift.
Another use of such a database - predictive modelling, which can
provide forecasts of expected results and setting dynamic performance
thresholds, enabling early issue detection. Anomaly detection becomes
more effective with context, distinguishing unusual deviations from
normal behaviour.
# Interpretation of contradictory results
It’s not clear how to deal with contradictory results to make a
decision on regression presence. For example, we have a set of 10
tests, which test more or less the same, for example disk performance.
It’s unclear what to do when one subset of tests show degradation and
another subset shows neutral status or improvements. Is there a
regression?
I suppose that the availability of historical data can help to deal
with such situations as historical data can show behaviour of
particular tests and allow to assign weights in decision-making
algorithms, but it’s just my guess.
# System Profiles
Tim's idea to reduce results to - “pass / fail” and my experience with
various people trying to interpret benchmarking results led me to
think of “profiles” - a set of parameters and metrics collected from a
reference system while execution of a particular configuration of a
particular benchmark.
Profiles can be used for A/B comparison with pass/fail outcomes or
match/not match, and this approach does not hide/miss the details and
allows to capture multiple characteristics of the experiment, like
presence of outliers/errors or skewed distribution form. Interested
persons (like kernel developers or performance engineers, for example)
can dig deeper to find a reason for such mismatch and those who are
interested just in high-level results - pass/fail should be enough.
Here is how I imaging a structure of a profile:
---
profile_a:
system packages:
- pkg_1
- pkg_2
# Additional packages...
settings:
- cmdline
# Additional settings...
indicators:
cpu: null
ram: null
loadavg: null
# Additional indicators...
benchmark:
settings:
param_1: null
param_2: null
param_x: null
metrics:
metric_1: null
metric_2: null
metric_x: null
---
- System Packages, System Settings: Usually we do not pay much
attention to this, but I think it’s worth highlighting that base OS is
an important factor, as there are distribution-specific modifications
present in the filesystem. Most commonly developers and researchers
use Ubuntu (as the most popular distro) or Debian (as a cleaner and
lightweight version of Ubuntu), but distributions apply their own
patches to the kernel and system libraries, which may impact
performance. Another kind of base OS - cloud OS images which can be
modified by cloud providers to add internal packages & services which
could potentially affect performance as well. While comparing we must
take into account this aspect to compare apples-to-apples.
- System Indicators: These are periodic statistics like CPU
utilisation, RAM consumption, and other params collected before
benchmarking, while benchmarking and after benchmarking.
- Benchmark Settings: Benchmarking systems have multiple parameters,
so it’s important to capture them and use them in analysis.
- Benchmark Metrics: That’s obviously - benchmark results. It’s not a
rare case when a benchmark test provides more than a single number.
# Analysis
Proposed rules-based analysis will work only for highly determined
environments and systems, where rules can describe all the aspects.
Rule-based systems are easier to understand and implement than other
types, but for a small set of rules. However, we deal with the live
system and it constantly evolves, so rules will deprecate extremely
fast. It's the same story as with rule-based recommended systems in
early years of machine learning.
If you want to follow a rules-based approach, it's probably worth
taking a look at https://www.clipsrules.net as this will allow to
decouple results from analysis and avoid reinventing the analysis
engine.
Declaration of those rules will be error-prone due to the nature of
their origin - they must be declared and maintained by humans. IMHO a
human-less approach and use modern ML methods instead would be more
beneficial in the long run.
# Methodology
Used methodology - another aspect which is not directly related to
Tim's slides, but it's an important topic for results processing and
interpretation, probably an idea of automated results interpretation
can force use of one or another methodology.
# Next steps
I would be glad to participate in further discussions and share
experience to improve kernel performance testing automation, analysis
and interpretation of results. If there is interest, I'm open to
collaborating on implementing some of these ideas.
--
Best regards,
Konstantin Belov
Greetings:
Welcome to v7, see changelog below.
This revision drops patch 2 from the previous version, which was
unnecessary as Jakub pointed out. The tests were fully re-run after
dropping the patch to confirm the performance numbers hold; they do -
see below. FAQ #2 has been updated to be more verbose, as well.
This series introduces a new mechanism, IRQ suspension, which allows
network applications using epoll to mask IRQs during periods of high
traffic while also reducing tail latency (compared to existing
mechanisms, see below) during periods of low traffic. In doing so, this
balances CPU consumption with network processing efficiency.
Martin Karsten (CC'd) and I have been collaborating on this series for
several months and have appreciated the feedback from the community on
our RFC [1]. We've updated the cover letter and kernel documentation in
an attempt to more clearly explain how this mechanism works, how
applications can use it, and how it compares to existing mechanisms in
the kernel.
I briefly mentioned this idea at netdev conf 2024 (for those who were
there) and Martin described this idea in an earlier paper presented at
Sigmetrics 2024 [2].
~ The short explanation (TL;DR)
We propose adding a new napi config parameter: irq_suspend_timeout to
help balance CPU usage and network processing efficiency when using IRQ
deferral and napi busy poll.
If this parameter is set to a non-zero value *and* a user application
has enabled preferred busy poll on a busy poll context (via the
EPIOCSPARAMS ioctl introduced in commit 18e2bf0edf4d ("eventpoll: Add
epoll ioctl for epoll_params")), then application calls to epoll_wait
for that context will cause device IRQs and softirq processing to be
suspended as long as epoll_wait successfully retrieves data from the
NAPI. Each time data is retrieved, the irq_suspend_timeout is deferred.
If/when network traffic subsides and epoll_wait returns no data, IRQ
suspension is immediately reverted back to the existing
napi_defer_hard_irqs and gro_flush_timeout mechanism which was
introduced in commit 6f8b12d661d0 ("net: napi: add hard irqs deferral
feature")).
The irq_suspend_timeout serves as a safety mechanism. If userland takes
a long time processing data, irq_suspend_timeout will fire and restart
normal NAPI processing.
For a more in depth explanation, please continue reading.
~ Comparison with existing mechanisms
Interrupt mitigation can be accomplished in napi software, by setting
napi_defer_hard_irqs and gro_flush_timeout, or via interrupt coalescing
in the NIC. This can be quite efficient, but in both cases, a fixed
timeout (or packet count) needs to be configured. However, a fixed
timeout cannot effectively support both low- and high-load situations:
At low load, an application typically processes a few requests and then
waits to receive more input data. In this scenario, a large timeout will
cause unnecessary latency.
At high load, an application typically processes many requests before
being ready to receive more input data. In this case, a small timeout
will likely fire prematurely and trigger irq/softirq processing, which
interferes with the application's execution. This causes overhead, most
likely due to cache contention.
While NICs attempt to provide adaptive interrupt coalescing schemes,
these cannot properly take into account application-level processing.
An alternative packet delivery mechanism is busy-polling, which results
in perfect alignment of application processing and network polling. It
delivers optimal performance (throughput and latency), but results in
100% cpu utilization and is thus inefficient for below-capacity
workloads.
We propose to add a new packet delivery mode that properly alternates
between busy polling and interrupt-based delivery depending on busy and
idle periods of the application. During a busy period, the system
operates in busy-polling mode, which avoids interference. During an idle
period, the system falls back to interrupt deferral, but with a small
timeout to avoid excessive latencies. This delivery mode can also be
viewed as an extension of basic interrupt deferral, but alternating
between a small and a very large timeout.
This delivery mode is efficient, because it avoids softirq execution
interfering with application processing during busy periods. It can be
used with blocking epoll_wait to conserve cpu cycles during idle
periods. The effect of alternating between busy and idle periods is that
performance (throughput and latency) is very close to full busy polling,
while cpu utilization is lower and very close to interrupt mitigation.
~ Usage details
IRQ suspension is introduced via a per-NAPI configuration parameter that
controls the maximum time that IRQs can be suspended.
Here's how it is intended to work:
- The user application (or system administrator) uses the netdev-genl
netlink interface to set the pre-existing napi_defer_hard_irqs and
gro_flush_timeout NAPI config parameters to enable IRQ deferral.
- The user application (or system administrator) sets the proposed
irq_suspend_timeout parameter via the netdev-genl netlink interface
to a larger value than gro_flush_timeout to enable IRQ suspension.
- The user application issues the existing epoll ioctl to set the
prefer_busy_poll flag on the epoll context.
- The user application then calls epoll_wait to busy poll for network
events, as it normally would.
- If epoll_wait returns events to userland, IRQs are suspended for the
duration of irq_suspend_timeout.
- If epoll_wait finds no events and the thread is about to go to
sleep, IRQ handling using napi_defer_hard_irqs and gro_flush_timeout
is resumed.
As long as epoll_wait is retrieving events, IRQs (and softirq
processing) for the NAPI being polled remain disabled. When network
traffic reduces, eventually a busy poll loop in the kernel will retrieve
no data. When this occurs, regular IRQ deferral using gro_flush_timeout
for the polled NAPI is re-enabled.
Unless IRQ suspension is continued by subsequent calls to epoll_wait, it
automatically times out after the irq_suspend_timeout timer expires.
Regular deferral is also immediately re-enabled when the epoll context
is destroyed.
~ Usage scenario
The target scenario for IRQ suspension as packet delivery mode is a
system that runs a dominant application with substantial network I/O.
The target application can be configured to receive input data up to a
certain batch size (via epoll_wait maxevents parameter) and this batch
size determines the worst-case latency that application requests might
experience. Because packet delivery is suspended during the target
application's processing, the batch size also determines the worst-case
latency of concurrent applications using the same RX queue(s).
gro_flush_timeout should be set as small as possible, but large enough to
make sure that a single request is likely not being interfered with.
irq_suspend_timeout is largely a safety mechanism against misbehaving
applications. It should be set large enough to cover the processing of an
entire application batch, i.e., the factor between gro_flush_timeout and
irq_suspend_timeout should roughly correspond to the maximum batch size
that the target application would process in one go.
~ Important call out in the implementation
- Enabling per epoll-context preferred busy poll will now effectively
lead to a nonblocking iteration through napi_busy_loop, even when
busy_poll_usecs is 0. See patch 4.
~ Benchmark configs & descriptions
The changes were benchmarked with memcached [3] using the benchmarking
tool mutilate [4].
To facilitate benchmarking, a small patch [5] was applied to memcached
1.6.29 to allow setting per-epoll context preferred busy poll and other
settings via environment variables. Another small patch [6] was applied
to libevent to enable full busy-polling.
Multiple scenarios were benchmarked as described below and the scripts
used for producing these results can be found on github [7] (note: all
scenarios use NAPI-based traffic splitting via SO_INCOMING_ID by passing
-N to memcached):
- base:
- no other options enabled
- deferX:
- set defer_hard_irqs to 100
- set gro_flush_timeout to X,000
- napibusy:
- set defer_hard_irqs to 100
- set gro_flush_timeout to 200,000
- enable busy poll via the existing ioctl (busy_poll_usecs = 64,
busy_poll_budget = 64, prefer_busy_poll = true)
- fullbusy:
- set defer_hard_irqs to 100
- set gro_flush_timeout to 5,000,000
- enable busy poll via the existing ioctl (busy_poll_usecs = 1000,
busy_poll_budget = 64, prefer_busy_poll = true)
- change memcached's nonblocking epoll_wait invocation (via
libevent) to using a 1 ms timeout
- suspend0:
- set defer_hard_irqs to 0
- set gro_flush_timeout to 0
- set irq_suspend_timeout to 20,000,000
- enable busy poll via the existing ioctl (busy_poll_usecs = 0,
busy_poll_budget = 64, prefer_busy_poll = true)
- suspendX:
- set defer_hard_irqs to 100
- set gro_flush_timeout to X,000
- set irq_suspend_timeout to 20,000,000
- enable busy poll via the existing ioctl (busy_poll_usecs = 0,
busy_poll_budget = 64, prefer_busy_poll = true)
~ Benchmark results
Tested on:
Single socket AMD EPYC 7662 64-Core Processor
Hyperthreading disabled
4 NUMA Zones (NPS=4)
16 CPUs per NUMA zone (64 cores total)
2 x Dual port 100gbps Mellanox Technologies ConnectX-5 Ex EN NIC
The test machine is configured such that a single interface has 8 RX
queues. The queues' IRQs and memcached are pinned to CPUs that are
NUMA-local to the interface which is under test. The NIC's interrupt
coalescing configuration is left at boot-time defaults.
Results:
Results are shown below. The mechanism added by this series is
represented by the 'suspend' cases. Data presented shows a summary over
nearly 10 runs of each test case [8] using the scripts on github [7].
For latency, the median is shown. For throughput and CPU utilization,
the average is shown.
The results also include cycles-per-query (cpq) and
instruction-per-query (ipq) metrics, following the methodology proposed
in [2], to augment the CPU utilization numbers, which could be skewed
due to frequency scaling. We find that this does not appear to be the
case as CPU utilization and low-level metrics show similar trends.
These results were captured using the scripts on github [7] to
illustrate how this approach compares with other pre-existing
mechanisms. This data is not to be interpreted as scientific data
captured in a fully isolated lab setting, but instead as best effort,
illustrative information comparing and contrasting tradeoffs.
The absolute QPS results shift between submissions, but the
relative differences are equivalent. As patches are rebased,
several factors likely influence overall performance.
Compare:
- Throughput (MAX) and latencies of base vs suspend.
- CPU usage of napibusy and fullbusy during lower load (200K, 400K for
example) vs suspend.
- Latency of the defer variants vs suspend as timeout and load
increases.
- suspend0, which sets defer_hard_irqs and gro_flush_timeout to 0, has
nearly the same performance as the base case (this is FAQ item #1).
The overall takeaway is that the suspend variants provide a superior
combination of high throughput, low latency, and low cpu utilization
compared to all other variants. Each of the suspend variants works very
well, but some fine-tuning between latency and cpu utilization is still
possible by tuning the small timeout (gro_flush_timeout).
Note: we've reorganized the results to make comparison among testcases
with the same load easier.
testcase load qps avglat 95%lat 99%lat cpu cpq ipq
base 200K 199946 112 239 416 26 12973 11343
defer10 200K 199971 54 124 142 29 19412 17460
defer20 200K 199986 60 130 153 26 15644 14095
defer50 200K 200025 79 144 182 23 12122 11632
defer200 200K 199999 164 254 309 19 8923 9635
fullbusy 200K 199998 46 118 133 100 43658 23133
napibusy 200K 199983 100 237 277 56 24840 24716
suspend0 200K 200020 105 249 432 30 14264 11796
suspend10 200K 199950 53 123 141 32 19518 16903
suspend20 200K 200037 58 126 151 30 16426 14736
suspend50 200K 199961 73 136 177 26 13310 12633
suspend200 200K 199998 149 251 306 21 9566 10203
testcase load qps avglat 95%lat 99%lat cpu cpq ipq
base 400K 400014 139 269 707 41 9476 9343
defer10 400K 400016 59 133 166 53 13991 12989
defer20 400K 399952 67 140 172 47 12063 11644
defer50 400K 400007 87 162 198 39 9384 9880
defer200 400K 399979 181 274 330 31 7089 8430
fullbusy 400K 399987 50 123 156 100 21827 16037
napibusy 400K 400014 76 222 272 83 18185 16529
suspend0 400K 400015 127 350 776 47 10699 9603
suspend10 400K 400023 57 129 164 54 13758 13178
suspend20 400K 400043 62 135 169 49 12071 11826
suspend50 400K 400071 76 149 186 42 10011 10301
suspend200 400K 399961 154 269 327 34 7827 8774
testcase load qps avglat 95%lat 99%lat cpu cpq ipq
base 600K 599951 149 266 574 61 9265 8876
defer10 600K 600006 71 147 203 76 11866 10936
defer20 600K 600123 76 152 203 66 10430 10342
defer50 600K 600162 95 172 217 54 8526 9142
defer200 600K 599942 200 301 357 46 6977 8212
fullbusy 600K 599990 55 127 177 100 14551 13983
napibusy 600K 600035 63 160 250 96 13937 14140
suspend0 600K 599903 127 320 732 68 10166 8963
suspend10 600K 599908 63 137 192 69 10902 11100
suspend20 600K 599961 66 141 194 65 9976 10370
suspend50 600K 599973 80 159 204 57 8678 9381
suspend200 600K 600010 157 277 346 48 7133 8381
testcase load qps avglat 95%lat 99%lat cpu cpq ipq
base 800K 800039 181 300 536 87 9585 8304
defer10 800K 800038 181 530 939 96 10564 8970
defer20 800K 800029 112 225 329 90 10056 8935
defer50 800K 799999 120 208 296 82 9234 8562
defer200 800K 800066 227 338 401 63 7117 8129
fullbusy 800K 800040 61 134 190 100 10913 12608
napibusy 800K 799944 64 141 214 99 10828 12588
suspend0 800K 799911 126 248 509 85 9346 8498
suspend10 800K 800006 69 143 200 83 9410 9845
suspend20 800K 800120 74 150 207 78 8786 9454
suspend50 800K 799989 87 168 224 71 7946 8833
suspend200 800K 799987 160 292 357 62 6923 8229
testcase load qps avglat 95%lat 99%lat cpu cpq ipq
base 1000K 906879 4079 5751 6216 98 9496 7904
defer10 1000K 860849 3643 6274 6730 99 10040 8676
defer20 1000K 896063 3298 5840 6349 98 9620 8237
defer50 1000K 919782 2962 5513 5807 97 9284 7951
defer200 1000K 970941 3059 5348 5984 95 8593 7959
fullbusy 1000K 999950 70 150 207 100 8732 10777
napibusy 1000K 999996 78 154 223 100 8722 10656
suspend0 1000K 949706 2666 5770 6660 99 9071 8046
suspend10 1000K 1000024 80 160 220 92 8137 9035
suspend20 1000K 1000059 83 165 226 89 7850 8804
suspend50 1000K 999955 95 180 240 84 7411 8459
suspend200 1000K 999914 163 299 366 77 6833 8078
testcase load qps avglat 95%lat 99%lat cpu cpq ipq
base MAX 1037654 4184 5453 5810 100 8411 7938
defer10 MAX 905607 4840 6151 6380 100 9639 8431
defer20 MAX 986463 4455 5594 5796 100 8848 8110
defer50 MAX 1077030 4000 5073 5299 100 8104 7920
defer200 MAX 1040728 4152 5385 5765 100 8379 7849
fullbusy MAX 1247536 3518 3935 3984 100 6998 7930
napibusy MAX 1136310 3799 7756 9964 100 7670 7877
suspend0 MAX 1057509 4132 5724 6185 100 8253 7918
suspend10 MAX 1215147 3580 3957 4041 100 7185 7944
suspend20 MAX 1216469 3576 3953 3988 100 7175 7950
suspend50 MAX 1215871 3577 3961 4075 100 7181 7949
suspend200 MAX 1216882 3556 3951 3988 100 7175 7955
~ FAQ
- Why is a new parameter needed? Does irq_suspend_timeout override
gro_flush_timeout?
Using the suspend mechanism causes the system to alternate between
polling mode and irq-driven packet delivery. During busy periods,
irq_suspend_timeout overrides gro_flush_timeout and keeps the system
busy polling, but when epoll finds no events, the setting of
gro_flush_timeout and napi_defer_hard_irqs determine the next step.
There are essentially three possible loops for network processing and
packet delivery:
1) hardirq -> softirq -> napi poll; basic interrupt delivery
2) timer -> softirq -> napi poll; deferred irq processing
3) epoll -> busy-poll -> napi poll; busy looping
Loop 2 can take control from Loop 1, if gro_flush_timeout and
napi_defer_hard_irqs are set.
If gro_flush_timeout and napi_defer_hard_irqs are set, Loops 2 and
3 "wrestle" with each other for control. During busy periods,
irq_suspend_timeout is used as timer in Loop 2, which essentially
tilts this in favour of Loop 3.
If gro_flush_timeout and napi_defer_hard_irqs are not set, Loop 3
cannot take control from Loop 1.
Therefore, setting gro_flush_timeout and napi_defer_hard_irqs is the
recommended usage, because otherwise setting irq_suspend_timeout
might not have any discernible effect.
This is shown in the results above: compare suspend0 with the base
case. Note that the lack of napi_defer_hard_irqs and
gro_flush_timeout produce similar results for both, which encourages
the use of napi_defer_hard_irqs and gro_flush_timeout in addition to
irq_suspend_timeout.
- Can the new timeout value be threaded through the new epoll ioctl ?
It is possible, but presents challenges for userspace. User
applications must ensure that the file descriptors added to epoll
contexts have the same NAPI ID to support busy polling.
An epoll context is not permanently tied to any particular NAPI ID.
So, a user application could decide to clear the file descriptors
from the context and add a new set of file descriptors with a
different NAPI ID to the context. Busy polling would work as
expected, but the meaning of the suspend timeout becomes ambiguous
because IRQs are not inherently associated with epoll contexts, but
rather with the NAPI. The user program would need to reissue the
ioctl to set the irq_suspend_timeout, but the napi_defer_hard_irqs
and gro_flush_timeout settings would come from the NAPI's
napi_config (which are set either by sysfs or by netlink). Such an
interface seems awkard to use from a user perspective.
Further, IRQs are related to NAPIs, which is why they are stored in
the napi_config space. Putting the irq_suspend_timeout in
the epoll context while other IRQ deferral mechanisms remain in the
NAPI's napi_config space seems like an odd design choice.
We've opted to keep all of the IRQ deferral parameters together and
place the irq_suspend_timeout in napi_config. This has nice benefits
for userspace: if a user app were to remove all file descriptors
from an epoll context and add new file descriptors with a new NAPI ID,
the correct suspend timeout for that NAPI ID would be used automatically
without the user application needing to do anything (like re-issuing an
ioctl, for example). All IRQ deferral related parameters are in one
place and can all be set the same way: with netlink.
- Can irq suspend be built by combining NIC coalescing and
gro_flush_timeout ?
No. The problem is that the long timeout must engage if and only if
prefer-busy is active.
When using NIC coalescing for the short timeout (without
napi_defer_hard_irqs/gro_flush_timeout), an interrupt after an idle
period will trigger softirq, which will run napi polling. At this
point, prefer-busy is not active, so NIC interrupts would be
re-enabled. Then it is not possible for the longer timeout to
interject to switch control back to polling. In other words, only by
using the software timer for the short timeout, it is possible to
extend the timeout without having to reprogram the NIC timer or
reach down directly and disable interrupts.
Using gro_flush_timeout for the long timeout also has problems, for
the same underlying reason. In the current napi implementation,
gro_flush_timeout is not tied to prefer-busy. We'd either have to
change that and in the process modify the existing deferral
mechanism, or introduce a state variable to determine whether
gro_flush_timeout is used as long timeout for irq suspend or whether
it is used for its default purpose. In an earlier version, we did
try something similar to the latter and made it work, but it ends up
being a lot more convoluted than our current proposal.
- Isn't it already possible to combine busy looping with irq deferral?
Yes, in fact enabling irq deferral via napi_defer_hard_irqs and
gro_flush_timeout is a precondition for prefer_busy_poll to have an
effect. If the application also uses a tight busy loop with
essentially nonblocking epoll_wait (accomplished with a very short
timeout parameter), this is the fullbusy case shown in the results.
An application using blocking epoll_wait is shown as the napibusy
case in the results. It's a hybrid approach that provides limited
latency benefits compared to the base case and plain irq deferral,
but not as good as fullbusy or suspend.
~ Special thanks
Several people were involved in earlier stages of the development of this
mechanism whom we'd like to thank:
- Peter Cai (CC'd), for the initial kernel patch and his contributions
to the paper.
- Mohammadamin Shafie (CC'd), for testing various versions of the kernel
patch and providing helpful feedback.
Thanks,
Martin and Joe
[1]: https://lore.kernel.org/netdev/20240812125717.413108-1-jdamato@fastly.com/
[2]: https://doi.org/10.1145/3626780
[3]: https://github.com/memcached/memcached/blob/master/doc/napi_ids.txt
[4]: https://github.com/leverich/mutilate
[5]: https://raw.githubusercontent.com/martinkarsten/irqsuspend/main/patches/mem…
[6]: https://raw.githubusercontent.com/martinkarsten/irqsuspend/main/patches/lib…
[7]: https://github.com/martinkarsten/irqsuspend
[8]: https://github.com/martinkarsten/irqsuspend/tree/main/results
v7:
- Jakub noted that patch 2 adds unnecessary complexity by checking the
suspend timeout in the NAPI loop. This makes the code more
complicated and difficult to reason about. He's right; we've dropped
patch 2 which simplifies this series.
- Updated the cover letter with a full re-run of all test cases.
- Updated FAQ #2.
v6: https://lore.kernel.org/netdev/20241104215542.215919-1-jdamato@fastly.com/
- Updated the cover letter with a full re-run of all test cases,
including a new case suspend0, as requested by Sridhar previously.
- Updated the kernel documentation in patch 7 as suggested by Bagas
Sanjaya, which improved the htmldoc output.
v5: https://lore.kernel.org/netdev/20241103052421.518856-1-jdamato@fastly.com/
- Adjusted patch 5 to only suspend IRQs when ep_send_events returns a
positive return value. This issue was pointed out by Hillf Danton.
- Updated the commit message of patch 6 which still mentioned netcat,
despite the code being updated in v4 to replace it with socat and fixed
misspelling of netdevsim.
- Fixed a minor typo in patch 7 and removed an unnecessary paragraph.
- Added Sridhar Samudrala's Reviewed-by to patch 1-5 and 7.
v4: https://lore.kernel.org/netdev/20241102005214.32443-1-jdamato@fastly.com/
- Added a new FAQ item to cover letter.
- Updated patch 6 to use socat instead of nc in busy_poll_test.sh and
updated busy_poller.c to use netlink directly to configure napi
params.
- Updated the kernel documentation in patch 7 to include more details.
- Dropped Stanislav's Acked-by and Bagas' Reviewed-by from patch 7
since the documentation was updated.
v3: https://lore.kernel.org/netdev/20241101004846.32532-1-jdamato@fastly.com/
- Added Stanislav Fomichev's Acked-by to every patch except the newly
added selftest.
- Added Bagas Sanjaya's Reviewed-by to the documentation patch.
- Fixed the commit message of patch 2 to remove a reference to the now
non-existent sysfs setting.
- Added a self test which tests both "regular" busy poll and busy poll
with suspend enabled. This was added as patch 6 as requested by
Paolo. netdevsim was chosen instead of veth due to netdevsim's
pre-existing support for netdev-genl. See the commit message of
patch 6 for more details.
v2: https://lore.kernel.org/bpf/20241021015311.95468-1-jdamato@fastly.com/
- Cover letter updated, including a re-run of test data.
- Patch 1 rewritten to use netdev-genl instead of sysfs.
- Patch 3 updated with a comment added to napi_resume_irqs.
- Patch 4 rebased to apply now that commit b9ca079dd6b0 ("eventpoll:
Annotate data-race of busy_poll_usecs") has been picked up from VFS.
- Patch 6 updated the kernel documentation.
rfc -> v1:
- Cover letter updated to include more details.
- Patch 1 updated to remove the documentation added. This was moved to
patch 6 with the rest of the docs (see below).
- Patch 5 updated to fix an error uncovered by the kernel build robot.
See patch 5's changelog for more details.
- Patch 6 added which updates kernel documentation.
Joe Damato (2):
selftests: net: Add busy_poll_test
docs: networking: Describe irq suspension
Martin Karsten (4):
net: Add napi_struct parameter irq_suspend_timeout
net: Add control functions for irq suspension
eventpoll: Trigger napi_busy_loop, if prefer_busy_poll is set
eventpoll: Control irq suspension for prefer_busy_poll
Documentation/netlink/specs/netdev.yaml | 7 +
Documentation/networking/napi.rst | 170 ++++++++-
fs/eventpoll.c | 36 +-
include/linux/netdevice.h | 2 +
include/net/busy_poll.h | 3 +
include/uapi/linux/netdev.h | 1 +
net/core/dev.c | 41 +++
net/core/dev.h | 25 ++
net/core/netdev-genl-gen.c | 5 +-
net/core/netdev-genl.c | 12 +
tools/include/uapi/linux/netdev.h | 1 +
tools/testing/selftests/net/.gitignore | 1 +
tools/testing/selftests/net/Makefile | 3 +-
tools/testing/selftests/net/busy_poll_test.sh | 164 +++++++++
tools/testing/selftests/net/busy_poller.c | 328 ++++++++++++++++++
15 files changed, 792 insertions(+), 7 deletions(-)
create mode 100755 tools/testing/selftests/net/busy_poll_test.sh
create mode 100644 tools/testing/selftests/net/busy_poller.c
base-commit: dc7c381bb8649e3701ed64f6c3e55316675904d7
--
2.25.1
A few unrelated devmem TCP fixes bundled in a series for some
convenience (if that's ok).
Patch 1-2: fix naming and provide page_pool_alloc_netmem for fragged
netmem.
Patch 3-4: fix issues with dma-buf dma addresses being potentially
passed to dma_sync_for_* helpers.
Patch 5-6: fix syzbot SO_DEVMEM_DONTNEED issue and add test for this
case.
Mina Almasry (6):
net: page_pool: rename page_pool_alloc_netmem to *_netmems
net: page_pool: create page_pool_alloc_netmem
page_pool: disable sync for cpu for dmabuf memory provider
netmem: add netmem_prefetch
net: fix SO_DEVMEM_DONTNEED looping too long
ncdevmem: add test for too many token_count
Samiullah Khawaja (1):
page_pool: Set `dma_sync` to false for devmem memory provider
include/net/netmem.h | 7 ++++
include/net/page_pool/helpers.h | 50 ++++++++++++++++++--------
include/net/page_pool/types.h | 2 +-
net/core/devmem.c | 9 +++--
net/core/page_pool.c | 11 +++---
net/core/sock.c | 46 ++++++++++++++----------
tools/testing/selftests/net/ncdevmem.c | 11 ++++++
7 files changed, 93 insertions(+), 43 deletions(-)
--
2.47.0.163.g1226f6d8fa-goog
Following the previous vIOMMU series, this adds another vDEVICE structure,
representing the association from an iommufd_device to an iommufd_viommu.
This gives the whole architecture a new "v" layer:
_______________________________________________________________________
| iommufd (with vIOMMU/vDEVICE) |
| _____________ _____________ |
| | | | | |
| |----------------| vIOMMU |<---| vDEVICE |<------| |
| | | | |_____________| | |
| | ______ | | _____________ ___|____ |
| | | | | | | | | | |
| | | IOAS |<---|(HWPT_PAGING)|<---| HWPT_NESTED |<--| DEVICE | |
| | |______| |_____________| |_____________| |________| |
|______|________|______________|__________________|_______________|_____|
| | | | |
______v_____ | ______v_____ ______v_____ ___v__
| struct | | PFN | (paging) | | (nested) | |struct|
|iommu_device| |------>|iommu_domain|<----|iommu_domain|<----|device|
|____________| storage|____________| |____________| |______|
This vDEVICE object is used to collect and store all vIOMMU-related device
information/attributes in a VM. As an initial series for vDEVICE, add only
the virt_id to the vDEVICE, which is a vIOMMU specific device ID in a VM:
e.g. vSID of ARM SMMUv3, vDeviceID of AMD IOMMU, and vRID of Intel VT-d to
a Context Table. This virt_id helps IOMMU drivers to link the vID to a pID
of the device against the physical IOMMU instance. This is essential for a
vIOMMU-based invalidation, where the request contains a device's vID for a
device cache flush, e.g. ATC invalidation.
Therefore, with this vDEVICE object, support a vIOMMU-based invalidation,
by reusing IOMMUFD_CMD_HWPT_INVALIDATE for a vIOMMU object to flush cache
with a given driver data.
As for the implementation of the series, add driver support in ARM SMMUv3
for a real world use case.
This series is on Github:
https://github.com/nicolinc/iommufd/commits/iommufd_viommu_p2-v6
(QEMU branch for testing will be provided in Jason's nesting series)
Changelog
v6
* Fixed kdoc in the uAPI header
* Fixed indentations in iommufd.rst
* Replaced vdev->idev with vdev->dev
* Added "Reviewed-by" from Kevin and Jason
* Updated kdoc of struct iommu_vdevice_alloc
* Fixed lockdep function call in iommufd_viommu_find_dev
* Added missing iommu_dev validation between viommu and idev
* Skipped SMMUv3 driver changes (to post in a separate series)
* Replaced !cache_invalidate_user in WARN_ON of the allocation path
with cache_invalidate_user validation in iommufd_hwpt_invalidate
v5
https://lore.kernel.org/all/cover.1729897278.git.nicolinc@nvidia.com/
* Dropped driver-allocated vDEVICE support
* Changed vdev_to_dev helper to iommufd_viommu_find_dev
v4
https://lore.kernel.org/all/cover.1729555967.git.nicolinc@nvidia.com/
* Added missing brackets in switch-case
* Fixed the unreleased idev refcount issue
* Reworked the iommufd_vdevice_alloc allocator
* Dropped support for IOMMU_VIOMMU_TYPE_DEFAULT
* Added missing TEST_LENGTH and fail_nth coverages
* Added a verification to the driver-allocated vDEVICE object
* Added an iommufd_vdevice_abort for a missing mutex protection
* Added a u64 structure arm_vsmmu_invalidation_cmd for user command
conversion
v3
https://lore.kernel.org/all/cover.1728491532.git.nicolinc@nvidia.com/
* Added Jason's Reviewed-by
* Split this invalidation part out of the part-1 series
* Repurposed VDEV_ID ioctl to a wider vDEVICE structure and ioctl
* Reduced viommu_api functions by allowing drivers to access viommu
and vdevice structure directly
* Dropped vdevs_rwsem by using xa_lock instead
* Dropped arm_smmu_cache_invalidate_user
v2
https://lore.kernel.org/all/cover.1724776335.git.nicolinc@nvidia.com/
* Limited vdev_id to one per idev
* Added a rw_sem to protect the vdev_id list
* Reworked driver-level APIs with proper lockings
* Added a new viommu_api file for IOMMUFD_DRIVER config
* Dropped useless iommu_dev point from the viommu structure
* Added missing index numnbers to new types in the uAPI header
* Dropped IOMMU_VIOMMU_INVALIDATE uAPI; Instead, reuse the HWPT one
* Reworked mock_viommu_cache_invalidate() using the new iommu helper
* Reordered details of set/unset_vdev_id handlers for proper lockings
v1
https://lore.kernel.org/all/cover.1723061377.git.nicolinc@nvidia.com/
Thanks!
Nicolin
Jason Gunthorpe (1):
iommu: Add iommu_copy_struct_from_full_user_array helper
Nicolin Chen (9):
iommufd/viommu: Add IOMMUFD_OBJ_VDEVICE and IOMMU_VDEVICE_ALLOC ioctl
iommufd/selftest: Add IOMMU_VDEVICE_ALLOC test coverage
iommu/viommu: Add cache_invalidate to iommufd_viommu_ops
iommufd: Allow hwpt_id to carry viommu_id for IOMMU_HWPT_INVALIDATE
iommufd/viommu: Add iommufd_viommu_find_dev helper
iommufd/selftest: Add mock_viommu_cache_invalidate
iommufd/selftest: Add IOMMU_TEST_OP_DEV_CHECK_CACHE test command
iommufd/selftest: Add vIOMMU coverage for IOMMU_HWPT_INVALIDATE ioctl
Documentation: userspace-api: iommufd: Update vDEVICE
drivers/iommu/iommufd/iommufd_private.h | 18 ++
drivers/iommu/iommufd/iommufd_test.h | 30 +++
include/linux/iommu.h | 48 ++++-
include/linux/iommufd.h | 22 ++
include/uapi/linux/iommufd.h | 31 ++-
tools/testing/selftests/iommu/iommufd_utils.h | 83 +++++++
drivers/iommu/iommufd/driver.c | 13 ++
drivers/iommu/iommufd/hw_pagetable.c | 40 +++-
drivers/iommu/iommufd/main.c | 6 +
drivers/iommu/iommufd/selftest.c | 98 ++++++++-
drivers/iommu/iommufd/viommu.c | 76 +++++++
tools/testing/selftests/iommu/iommufd.c | 204 +++++++++++++++++-
.../selftests/iommu/iommufd_fail_nth.c | 4 +
Documentation/userspace-api/iommufd.rst | 41 +++-
14 files changed, 688 insertions(+), 26 deletions(-)
--
2.43.0
This series introduces a new vIOMMU infrastructure and related ioctls.
IOMMUFD has been using the HWPT infrastructure for all cases, including a
nested IO page table support. Yet, there're limitations for an HWPT-based
structure to support some advanced HW-accelerated features, such as CMDQV
on NVIDIA Grace, and HW-accelerated vIOMMU on AMD. Even for a multi-IOMMU
environment, it is not straightforward for nested HWPTs to share the same
parent HWPT (stage-2 IO pagetable), with the HWPT infrastructure alone: a
parent HWPT typically hold one stage-2 IO pagetable and tag it with only
one ID in the cache entries. When sharing one large stage-2 IO pagetable
across physical IOMMU instances, that one ID may not always be available
across all the IOMMU instances. In other word, it's ideal for SW to have
a different container for the stage-2 IO pagetable so it can hold another
ID that's available. And this container will be able to hold some advanced
feature too.
For this "different container", add vIOMMU, an additional layer to hold
extra virtualization information:
_______________________________________________________________________
| iommufd (with vIOMMU) |
| _____________ |
| | | |
| |----------------| vIOMMU | |
| | ______ | | _____________ ________ |
| | | | | | | | | | |
| | | IOAS |<---|(HWPT_PAGING)|<---| HWPT_NESTED |<--| DEVICE | |
| | |______| |_____________| |_____________| |________| |
| | | | | | |
|______|________|______________|__________________|_______________|_____|
| | | | |
______v_____ | ______v_____ ______v_____ ___v__
| struct | | PFN | (paging) | | (nested) | |struct|
|iommu_device| |------>|iommu_domain|<----|iommu_domain|<----|device|
|____________| storage|____________| |____________| |______|
The vIOMMU object should be seen as a slice of a physical IOMMU instance
that is passed to or shared with a VM. That can be some HW/SW resources:
- Security namespace for guest owned ID, e.g. guest-controlled cache tags
- Non-device-affiliated event reporting, e.g. invalidation queue errors
- Access to a sharable nesting parent pagetable across physical IOMMUs
- Virtualization of various platforms IDs, e.g. RIDs and others
- Delivery of paravirtualized invalidation
- Direct assigned invalidation queues
- Direct assigned interrupts
On a multi-IOMMU system, the vIOMMU object must be instanced to the number
of the physical IOMMUs that have a slice passed to (via device) a guest VM,
while being able to hold the shareable parent HWPT. Each vIOMMU then just
needs to allocate its own individual ID to tag its own cache:
----------------------------
---------------- | | paging_hwpt0 |
| hwpt_nested0 |--->| viommu0 ------------------
---------------- | | IDx |
----------------------------
----------------------------
---------------- | | paging_hwpt0 |
| hwpt_nested1 |--->| viommu1 ------------------
---------------- | | IDy |
----------------------------
As an initial part-1, add IOMMUFD_CMD_VIOMMU_ALLOC ioctl for an allocation
only.
More vIOMMU-based structs and ioctls will be introduced in the follow-up
series to support vDEVICE, vIRQ (vEVENT) and vQUEUE objects. Although we
repurposed the vIOMMU object from an earlier RFC, just for a referece:
https://lore.kernel.org/all/cover.1712978212.git.nicolinc@nvidia.com/
This series is on Github:
https://github.com/nicolinc/iommufd/commits/iommufd_viommu_p1-v6
(QEMU branch for testing will be provided in Jason's nesting series)
Changelog
v6
* Improved comment lines
* Added a TEST_F for IO page fault
* Fixed indentations in iommufd.rst
* Revised kdoc of the viommu_alloc op
* Added "Reviewed-by" from Kevin and Jason
* Passed in "flags" to ->alloc_domain_nested
* Renamed "free" op to "destroy" in viommu_ops
* Skipped SMMUv3 driver changes (to post in a separate series)
* Fixed "flags" validation in iommufd_viommu_alloc_hwpt_nested
* Added CONFIG_IOMMUFD_DRIVER_CORE for sharing between iommufd
core and IOMMU dirvers
* Replaced iommufd_verify_unfinalized_object with xa_cmpxchg in
iommufd_object_finalize/abort functions
v5
https://lore.kernel.org/all/cover.1729897352.git.nicolinc@nvidia.com/
* Added "Reviewed-by" from Kevin
* Reworked iommufd_viommu_alloc helper
* Revised the uAPI kdoc for vIOMMU object
* Revised comments for pluggable iommu_dev
* Added a couple of cleanup patches for selftest
* Renamed domain_alloc_nested op to alloc_domain_nested
* Updated a few commit messages to reflect the latest series
* Renamed iommufd_hwpt_nested_alloc_for_viommu to
iommufd_viommu_alloc_hwpt_nested, and added flag validation
v4
https://lore.kernel.org/all/cover.1729553811.git.nicolinc@nvidia.com/
* Added "Reviewed-by" from Jason
* Dropped IOMMU_VIOMMU_TYPE_DEFAULT support
* Dropped iommufd_object_alloc_elm renamings
* Renamed iommufd's viommu_api.c to driver.c
* Reworked iommufd_viommu_alloc helper
* Added a separate iommufd_hwpt_nested_alloc_for_viommu function for
hwpt_nested allocations on a vIOMMU, and added comparison between
viommu->iommu_dev->ops and dev_iommu_ops(idev->dev)
* Replaced s2_parent with vsmmu in arm_smmu_nested_domain
* Replaced domain_alloc_user in iommu_ops with domain_alloc_nested in
viommu_ops
* Replaced wait_queue_head_t with a completion, to delay the unplug of
mock_iommu_dev
* Corrected documentation graph that was missing struct iommu_device
* Added an iommufd_verify_unfinalized_object helper to verify driver-
allocated vIOMMU/vDEVICE objects
* Added missing test cases for TEST_LENGTH and fail_nth
v3
https://lore.kernel.org/all/cover.1728491453.git.nicolinc@nvidia.com/
* Rebased on top of Jason's nesting v3 series
https://lore.kernel.org/all/0-v3-e2e16cd7467f+2a6a1-smmuv3_nesting_jgg@nvid…
* Split the series into smaller parts
* Added Jason's Reviewed-by
* Added back viommu->iommu_dev
* Added support for driver-allocated vIOMMU v.s. core-allocated
* Dropped arm_smmu_cache_invalidate_user
* Added an iommufd_test_wait_for_users() in selftest
* Reworked test code to make viommu an individual FIXTURE
* Added missing TEST_LENGTH case for the new ioctl command
v2
https://lore.kernel.org/all/cover.1724776335.git.nicolinc@nvidia.com/
* Limited vdev_id to one per idev
* Added a rw_sem to protect the vdev_id list
* Reworked driver-level APIs with proper lockings
* Added a new viommu_api file for IOMMUFD_DRIVER config
* Dropped useless iommu_dev point from the viommu structure
* Added missing index numnbers to new types in the uAPI header
* Dropped IOMMU_VIOMMU_INVALIDATE uAPI; Instead, reuse the HWPT one
* Reworked mock_viommu_cache_invalidate() using the new iommu helper
* Reordered details of set/unset_vdev_id handlers for proper lockings
v1
https://lore.kernel.org/all/cover.1723061377.git.nicolinc@nvidia.com/
Thanks!
Nicolin
Nicolin Chen (13):
iommufd: Move struct iommufd_object to public iommufd header
iommufd: Move _iommufd_object_alloc helper to a sharable file
iommufd: Introduce IOMMUFD_OBJ_VIOMMU and its related struct
iommufd: Verify object in iommufd_object_finalize/abort()
iommufd/viommu: Add IOMMU_VIOMMU_ALLOC ioctl
iommufd: Add alloc_domain_nested op to iommufd_viommu_ops
iommufd: Allow pt_id to carry viommu_id for IOMMU_HWPT_ALLOC
iommufd/selftest: Add container_of helpers
iommufd/selftest: Prepare for mock_viommu_alloc_domain_nested()
iommufd/selftest: Add refcount to mock_iommu_device
iommufd/selftest: Add IOMMU_VIOMMU_TYPE_SELFTEST
iommufd/selftest: Add IOMMU_VIOMMU_ALLOC test coverage
Documentation: userspace-api: iommufd: Update vIOMMU
drivers/iommu/iommufd/Kconfig | 5 +
drivers/iommu/iommufd/Makefile | 8 +-
drivers/iommu/iommufd/iommufd_private.h | 33 +--
drivers/iommu/iommufd/iommufd_test.h | 2 +
include/linux/iommu.h | 14 +
include/linux/iommufd.h | 83 ++++++
include/uapi/linux/iommufd.h | 54 +++-
tools/testing/selftests/iommu/iommufd_utils.h | 28 ++
drivers/iommu/iommufd/driver.c | 40 +++
drivers/iommu/iommufd/hw_pagetable.c | 72 ++++-
drivers/iommu/iommufd/main.c | 54 ++--
drivers/iommu/iommufd/selftest.c | 266 +++++++++++++-----
drivers/iommu/iommufd/viommu.c | 81 ++++++
tools/testing/selftests/iommu/iommufd.c | 128 +++++++++
.../selftests/iommu/iommufd_fail_nth.c | 11 +
Documentation/userspace-api/iommufd.rst | 69 ++++-
16 files changed, 795 insertions(+), 153 deletions(-)
create mode 100644 drivers/iommu/iommufd/driver.c
create mode 100644 drivers/iommu/iommufd/viommu.c
--
2.43.0
One of the things that fp-stress does to stress the floating point
context switching is send signals to the test threads it spawns.
Currently we do this once per second but as suggested by Mark Rutland if
we increase this we can improve the chances of triggering any issues
with context switching the signal handling code. Do a quick change to
increase the rate of signal delivery, trying to avoid excessive impact
on emulated platforms, and a further change to mitigate the impact that
this creates during startup.
Signed-off-by: Mark Brown <broonie(a)kernel.org>
---
Changes in v2:
- Minor clarifications in commit message and log output.
- Link to v1: https://lore.kernel.org/r/20241029-arm64-fp-stress-interval-v1-0-db540abf6d…
---
Mark Brown (2):
kselftest/arm64: Increase frequency of signal delivery in fp-stress
kselftest/arm64: Poll less often while waiting for fp-stress children
tools/testing/selftests/arm64/fp/fp-stress.c | 28 +++++++++++++++++-----------
1 file changed, 17 insertions(+), 11 deletions(-)
---
base-commit: 8e929cb546ee42c9a61d24fae60605e9e3192354
change-id: 20241028-arm64-fp-stress-interval-8f5e62c06e12
Best regards,
--
Mark Brown <broonie(a)kernel.org>
Check number of paths by fib_info_num_path(),
and update_or_create_fnhe() for every path.
Problem is that pmtu is cached only for the oif
that has received icmp message "need to frag",
other oifs will still try to use "default" iface mtu.
An example topology showing the problem:
| host1
+---------+
| dummy0 | 10.179.20.18/32 mtu9000
+---------+
+-----------+----------------+
+---------+ +---------+
| ens17f0 | 10.179.2.141/31 | ens17f1 | 10.179.2.13/31
+---------+ +---------+
| (all here have mtu 9000) |
+------+ +------+
| ro1 | 10.179.2.140/31 | ro2 | 10.179.2.12/31
+------+ +------+
| |
---------+------------+-------------------+------
|
+-----+
| ro3 | 10.10.10.10 mtu1500
+-----+
|
========================================
some networks
========================================
|
+-----+
| eth0| 10.10.30.30 mtu9000
+-----+
| host2
host1 have enabled multipath and
sysctl net.ipv4.fib_multipath_hash_policy = 1:
default proto static src 10.179.20.18
nexthop via 10.179.2.12 dev ens17f1 weight 1
nexthop via 10.179.2.140 dev ens17f0 weight 1
When host1 tries to do pmtud from 10.179.20.18/32 to host2,
host1 receives at ens17f1 iface an icmp packet from ro3 that ro3 mtu=1500.
And host1 caches it in nexthop exceptions cache.
Problem is that it is cached only for the iface that has received icmp,
and there is no way that ro3 will send icmp msg to host1 via another path.
Host1 now have this routes to host2:
ip r g 10.10.30.30 sport 30000 dport 443
10.10.30.30 via 10.179.2.12 dev ens17f1 src 10.179.20.18 uid 0
cache expires 521sec mtu 1500
ip r g 10.10.30.30 sport 30033 dport 443
10.10.30.30 via 10.179.2.140 dev ens17f0 src 10.179.20.18 uid 0
cache
So when host1 tries again to reach host2 with mtu>1500,
if packet flow is lucky enough to be hashed with oif=ens17f1 its ok,
if oif=ens17f0 it blackholes and still gets icmp msgs from ro3 to ens17f1,
until lucky day when ro3 will send it through another flow to ens17f0.
Signed-off-by: Vladimir Vdovin <deliran(a)verdict.gg>
---
V9:
selftests in pmtu.sh:
- remove useless return
- fix mtu var override
V8:
selftests in pmtu.sh:
- Change var names from "dummy" to "host"
- Fix errors caused by incorrect iface arguments pass
- Add src addr to setup_multipath_new
- Change multipath* func order
- Change route_get_dst_exception() && route_get_dst_pmtu_from_exception()
and arguments pass where they are used
as Ido suggested in https://lore.kernel.org/all/ZykH_fdcMBdFgXix@shredder/
V7:
selftest in pmtu.sh:
- add setup_multipath() with old and new nh tests
- add global "dummy_v4" addr variables
- add documentation
- remove dummy netdev usage in mp nh test
- remove useless sysctl opts in mp nh test
V6:
- make commit message cleaner
V5:
- make self test cleaner
V4:
- fix selftest, do route lookup before checking cached exceptions
V3:
- add selftest
- fix compile error
V2:
- fix fib_info_num_path parameter pass
---
net/ipv4/route.c | 13 ++++
tools/testing/selftests/net/pmtu.sh | 117 ++++++++++++++++++++++++----
2 files changed, 113 insertions(+), 17 deletions(-)
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 723ac9181558..652f603d29fe 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1027,6 +1027,19 @@ static void __ip_rt_update_pmtu(struct rtable *rt, struct flowi4 *fl4, u32 mtu)
struct fib_nh_common *nhc;
fib_select_path(net, &res, fl4, NULL);
+#ifdef CONFIG_IP_ROUTE_MULTIPATH
+ if (fib_info_num_path(res.fi) > 1) {
+ int nhsel;
+
+ for (nhsel = 0; nhsel < fib_info_num_path(res.fi); nhsel++) {
+ nhc = fib_info_nhc(res.fi, nhsel);
+ update_or_create_fnhe(nhc, fl4->daddr, 0, mtu, lock,
+ jiffies + net->ipv4.ip_rt_mtu_expires);
+ }
+ rcu_read_unlock();
+ return;
+ }
+#endif /* CONFIG_IP_ROUTE_MULTIPATH */
nhc = FIB_RES_NHC(res);
update_or_create_fnhe(nhc, fl4->daddr, 0, mtu, lock,
jiffies + net->ipv4.ip_rt_mtu_expires);
diff --git a/tools/testing/selftests/net/pmtu.sh b/tools/testing/selftests/net/pmtu.sh
index 569bce8b6383..611ae7862f46 100755
--- a/tools/testing/selftests/net/pmtu.sh
+++ b/tools/testing/selftests/net/pmtu.sh
@@ -197,6 +197,12 @@
#
# - pmtu_ipv6_route_change
# Same as above but with IPv6
+#
+# - pmtu_ipv4_mp_exceptions
+# Use the same topology as in pmtu_ipv4, but add routeable addresses
+# on host A and B on lo reachable via both routers. Host A and B
+# addresses have multipath routes to each other, b_r1 mtu = 1500.
+# Check that PMTU exceptions are created for both paths.
source lib.sh
source net_helper.sh
@@ -266,7 +272,8 @@ tests="
list_flush_ipv4_exception ipv4: list and flush cached exceptions 1
list_flush_ipv6_exception ipv6: list and flush cached exceptions 1
pmtu_ipv4_route_change ipv4: PMTU exception w/route replace 1
- pmtu_ipv6_route_change ipv6: PMTU exception w/route replace 1"
+ pmtu_ipv6_route_change ipv6: PMTU exception w/route replace 1
+ pmtu_ipv4_mp_exceptions ipv4: PMTU multipath nh exceptions 1"
# Addressing and routing for tests with routers: four network segments, with
# index SEGMENT between 1 and 4, a common prefix (PREFIX4 or PREFIX6) and an
@@ -343,6 +350,9 @@ tunnel6_a_addr="fd00:2::a"
tunnel6_b_addr="fd00:2::b"
tunnel6_mask="64"
+host4_a_addr="192.168.99.99"
+host4_b_addr="192.168.88.88"
+
dummy6_0_prefix="fc00:1000::"
dummy6_1_prefix="fc00:1001::"
dummy6_mask="64"
@@ -984,6 +994,52 @@ setup_ovs_bridge() {
run_cmd ip route add ${prefix6}:${b_r1}::1 via ${prefix6}:${a_r1}::2
}
+setup_multipath_new() {
+ # Set up host A with multipath routes to host B host4_b_addr
+ run_cmd ${ns_a} ip addr add ${host4_a_addr} dev lo
+ run_cmd ${ns_a} ip nexthop add id 401 via ${prefix4}.${a_r1}.2 dev veth_A-R1
+ run_cmd ${ns_a} ip nexthop add id 402 via ${prefix4}.${a_r2}.2 dev veth_A-R2
+ run_cmd ${ns_a} ip nexthop add id 403 group 401/402
+ run_cmd ${ns_a} ip route add ${host4_b_addr} src ${host4_a_addr} nhid 403
+
+ # Set up host B with multipath routes to host A host4_a_addr
+ run_cmd ${ns_b} ip addr add ${host4_b_addr} dev lo
+ run_cmd ${ns_b} ip nexthop add id 401 via ${prefix4}.${b_r1}.2 dev veth_B-R1
+ run_cmd ${ns_b} ip nexthop add id 402 via ${prefix4}.${b_r2}.2 dev veth_B-R2
+ run_cmd ${ns_b} ip nexthop add id 403 group 401/402
+ run_cmd ${ns_b} ip route add ${host4_a_addr} src ${host4_b_addr} nhid 403
+}
+
+setup_multipath_old() {
+ # Set up host A with multipath routes to host B host4_b_addr
+ run_cmd ${ns_a} ip addr add ${host4_a_addr} dev lo
+ run_cmd ${ns_a} ip route add ${host4_b_addr} \
+ src ${host4_a_addr} \
+ nexthop via ${prefix4}.${a_r1}.2 weight 1 \
+ nexthop via ${prefix4}.${a_r2}.2 weight 1
+
+ # Set up host B with multipath routes to host A host4_a_addr
+ run_cmd ${ns_b} ip addr add ${host4_b_addr} dev lo
+ run_cmd ${ns_b} ip route add ${host4_a_addr} \
+ src ${host4_b_addr} \
+ nexthop via ${prefix4}.${b_r1}.2 weight 1 \
+ nexthop via ${prefix4}.${b_r2}.2 weight 1
+}
+
+setup_multipath() {
+ if [ "$USE_NH" = "yes" ]; then
+ setup_multipath_new
+ else
+ setup_multipath_old
+ fi
+
+ # Set up routers with routes to dummies
+ run_cmd ${ns_r1} ip route add ${host4_a_addr} via ${prefix4}.${a_r1}.1
+ run_cmd ${ns_r2} ip route add ${host4_a_addr} via ${prefix4}.${a_r2}.1
+ run_cmd ${ns_r1} ip route add ${host4_b_addr} via ${prefix4}.${b_r1}.1
+ run_cmd ${ns_r2} ip route add ${host4_b_addr} via ${prefix4}.${b_r2}.1
+}
+
setup() {
[ "$(id -u)" -ne 0 ] && echo " need to run as root" && return $ksft_skip
@@ -1076,23 +1132,15 @@ link_get_mtu() {
}
route_get_dst_exception() {
- ns_cmd="${1}"
- dst="${2}"
- dsfield="${3}"
-
- if [ -z "${dsfield}" ]; then
- dsfield=0
- fi
+ ns_cmd="${1}"; shift
- ${ns_cmd} ip route get "${dst}" dsfield "${dsfield}"
+ ${ns_cmd} ip route get "$@"
}
route_get_dst_pmtu_from_exception() {
- ns_cmd="${1}"
- dst="${2}"
- dsfield="${3}"
+ ns_cmd="${1}"; shift
- mtu_parse "$(route_get_dst_exception "${ns_cmd}" "${dst}" "${dsfield}")"
+ mtu_parse "$(route_get_dst_exception "${ns_cmd}" "$@")"
}
check_pmtu_value() {
@@ -1235,10 +1283,10 @@ test_pmtu_ipv4_dscp_icmp_exception() {
run_cmd "${ns_a}" ping -q -M want -Q "${dsfield}" -c 1 -w 1 -s "${len}" "${dst2}"
# Check that exceptions have been created with the correct PMTU
- pmtu_1="$(route_get_dst_pmtu_from_exception "${ns_a}" "${dst1}" "${policy_mark}")"
+ pmtu_1="$(route_get_dst_pmtu_from_exception "${ns_a}" "${dst1}" dsfield "${policy_mark}")"
check_pmtu_value "1400" "${pmtu_1}" "exceeding MTU" || return 1
- pmtu_2="$(route_get_dst_pmtu_from_exception "${ns_a}" "${dst2}" "${policy_mark}")"
+ pmtu_2="$(route_get_dst_pmtu_from_exception "${ns_a}" "${dst2}" dsfield "${policy_mark}")"
check_pmtu_value "1500" "${pmtu_2}" "exceeding MTU" || return 1
}
@@ -1285,9 +1333,9 @@ test_pmtu_ipv4_dscp_udp_exception() {
UDP:"${dst2}":50000,tos="${dsfield}"
# Check that exceptions have been created with the correct PMTU
- pmtu_1="$(route_get_dst_pmtu_from_exception "${ns_a}" "${dst1}" "${policy_mark}")"
+ pmtu_1="$(route_get_dst_pmtu_from_exception "${ns_a}" "${dst1}" dsfield "${policy_mark}")"
check_pmtu_value "1400" "${pmtu_1}" "exceeding MTU" || return 1
- pmtu_2="$(route_get_dst_pmtu_from_exception "${ns_a}" "${dst2}" "${policy_mark}")"
+ pmtu_2="$(route_get_dst_pmtu_from_exception "${ns_a}" "${dst2}" dsfield "${policy_mark}")"
check_pmtu_value "1500" "${pmtu_2}" "exceeding MTU" || return 1
}
@@ -2329,6 +2377,41 @@ test_pmtu_ipv6_route_change() {
test_pmtu_ipvX_route_change 6
}
+test_pmtu_ipv4_mp_exceptions() {
+ setup namespaces routing multipath || return $ksft_skip
+
+ trace "${ns_a}" veth_A-R1 "${ns_r1}" veth_R1-A \
+ "${ns_r1}" veth_R1-B "${ns_b}" veth_B-R1 \
+ "${ns_a}" veth_A-R2 "${ns_r2}" veth_R2-A \
+ "${ns_r2}" veth_R2-B "${ns_b}" veth_B-R2
+
+ # Set up initial MTU values
+ mtu "${ns_a}" veth_A-R1 2000
+ mtu "${ns_r1}" veth_R1-A 2000
+ mtu "${ns_r1}" veth_R1-B 1500
+ mtu "${ns_b}" veth_B-R1 1500
+
+ mtu "${ns_a}" veth_A-R2 2000
+ mtu "${ns_r2}" veth_R2-A 2000
+ mtu "${ns_r2}" veth_R2-B 1500
+ mtu "${ns_b}" veth_B-R2 1500
+
+ fail=0
+
+ # Ping and expect two nexthop exceptions for two routes in nh group
+ run_cmd ${ns_a} ping -q -M want -i 0.1 -c 1 -s 1800 "${host4_b_addr}"
+
+ # Do route lookup before checking cached exceptions.
+ # The following commands are needed for dst entries to be cached
+ # in both paths exceptions and therefore dumped to user space
+ # Check that exceptions have been created with the correct PMTU
+ pmtu_a_R1="$(route_get_dst_pmtu_from_exception "${ns_a}" "${host4_b_addr}" oif veth_A-R1)"
+ pmtu_a_R2="$(route_get_dst_pmtu_from_exception "${ns_a}" "${host4_b_addr}" oif veth_A-R2)"
+
+ check_pmtu_value "1500" "${pmtu_a_R1}" "exceeding MTU (veth_A-R2)" || return 1
+ check_pmtu_value "1500" "${pmtu_a_R2}" "exceeding MTU (veth_A-R1)" || return 1
+}
+
usage() {
echo
echo "$0 [OPTIONS] [TEST]..."
base-commit: 66600fac7a984dea4ae095411f644770b2561ede
--
2.43.5
Check number of paths by fib_info_num_path(),
and update_or_create_fnhe() for every path.
Problem is that pmtu is cached only for the oif
that has received icmp message "need to frag",
other oifs will still try to use "default" iface mtu.
An example topology showing the problem:
| host1
+---------+
| dummy0 | 10.179.20.18/32 mtu9000
+---------+
+-----------+----------------+
+---------+ +---------+
| ens17f0 | 10.179.2.141/31 | ens17f1 | 10.179.2.13/31
+---------+ +---------+
| (all here have mtu 9000) |
+------+ +------+
| ro1 | 10.179.2.140/31 | ro2 | 10.179.2.12/31
+------+ +------+
| |
---------+------------+-------------------+------
|
+-----+
| ro3 | 10.10.10.10 mtu1500
+-----+
|
========================================
some networks
========================================
|
+-----+
| eth0| 10.10.30.30 mtu9000
+-----+
| host2
host1 have enabled multipath and
sysctl net.ipv4.fib_multipath_hash_policy = 1:
default proto static src 10.179.20.18
nexthop via 10.179.2.12 dev ens17f1 weight 1
nexthop via 10.179.2.140 dev ens17f0 weight 1
When host1 tries to do pmtud from 10.179.20.18/32 to host2,
host1 receives at ens17f1 iface an icmp packet from ro3 that ro3 mtu=1500.
And host1 caches it in nexthop exceptions cache.
Problem is that it is cached only for the iface that has received icmp,
and there is no way that ro3 will send icmp msg to host1 via another path.
Host1 now have this routes to host2:
ip r g 10.10.30.30 sport 30000 dport 443
10.10.30.30 via 10.179.2.12 dev ens17f1 src 10.179.20.18 uid 0
cache expires 521sec mtu 1500
ip r g 10.10.30.30 sport 30033 dport 443
10.10.30.30 via 10.179.2.140 dev ens17f0 src 10.179.20.18 uid 0
cache
So when host1 tries again to reach host2 with mtu>1500,
if packet flow is lucky enough to be hashed with oif=ens17f1 its ok,
if oif=ens17f0 it blackholes and still gets icmp msgs from ro3 to ens17f1,
until lucky day when ro3 will send it through another flow to ens17f0.
Signed-off-by: Vladimir Vdovin <deliran(a)verdict.gg>
---
V8:
selftests in pmtu.sh:
- Change var names from "dummy" to "host"
- Fix errors caused by incorrect iface arguments pass
- Add src addr to setup_multipath_new
- Change multipath* func order
- Change route_get_dst_exception() && route_get_dst_pmtu_from_exception()
and arguments pass where they are used
as Ido suggested in https://lore.kernel.org/all/ZykH_fdcMBdFgXix@shredder/
V7:
selftest in pmtu.sh:
- add setup_multipath() with old and new nh tests
- add global "dummy_v4" addr variables
- add documentation
- remove dummy netdev usage in mp nh test
- remove useless sysctl opts in mp nh test
V6:
- make commit message cleaner
V5:
- make self test cleaner
V4:
- fix selftest, do route lookup before checking cached exceptions
V3:
- add selftest
- fix compile error
V2:
- fix fib_info_num_path parameter pass
---
net/ipv4/route.c | 13 +++
tools/testing/selftests/net/pmtu.sh | 119 ++++++++++++++++++++++++----
2 files changed, 115 insertions(+), 17 deletions(-)
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 723ac9181558..652f603d29fe 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1027,6 +1027,19 @@ static void __ip_rt_update_pmtu(struct rtable *rt, struct flowi4 *fl4, u32 mtu)
struct fib_nh_common *nhc;
fib_select_path(net, &res, fl4, NULL);
+#ifdef CONFIG_IP_ROUTE_MULTIPATH
+ if (fib_info_num_path(res.fi) > 1) {
+ int nhsel;
+
+ for (nhsel = 0; nhsel < fib_info_num_path(res.fi); nhsel++) {
+ nhc = fib_info_nhc(res.fi, nhsel);
+ update_or_create_fnhe(nhc, fl4->daddr, 0, mtu, lock,
+ jiffies + net->ipv4.ip_rt_mtu_expires);
+ }
+ rcu_read_unlock();
+ return;
+ }
+#endif /* CONFIG_IP_ROUTE_MULTIPATH */
nhc = FIB_RES_NHC(res);
update_or_create_fnhe(nhc, fl4->daddr, 0, mtu, lock,
jiffies + net->ipv4.ip_rt_mtu_expires);
diff --git a/tools/testing/selftests/net/pmtu.sh b/tools/testing/selftests/net/pmtu.sh
index 569bce8b6383..2b79a16d3369 100755
--- a/tools/testing/selftests/net/pmtu.sh
+++ b/tools/testing/selftests/net/pmtu.sh
@@ -197,6 +197,12 @@
#
# - pmtu_ipv6_route_change
# Same as above but with IPv6
+#
+# - pmtu_ipv4_mp_exceptions
+# Use the same topology as in pmtu_ipv4, but add routeable addresses
+# on host A and B on lo reachable via both routers. Host A and B
+# addresses have multipath routes to each other, b_r1 mtu = 1500.
+# Check that PMTU exceptions are created for both paths.
source lib.sh
source net_helper.sh
@@ -266,7 +272,8 @@ tests="
list_flush_ipv4_exception ipv4: list and flush cached exceptions 1
list_flush_ipv6_exception ipv6: list and flush cached exceptions 1
pmtu_ipv4_route_change ipv4: PMTU exception w/route replace 1
- pmtu_ipv6_route_change ipv6: PMTU exception w/route replace 1"
+ pmtu_ipv6_route_change ipv6: PMTU exception w/route replace 1
+ pmtu_ipv4_mp_exceptions ipv4: PMTU multipath nh exceptions 1"
# Addressing and routing for tests with routers: four network segments, with
# index SEGMENT between 1 and 4, a common prefix (PREFIX4 or PREFIX6) and an
@@ -343,6 +350,9 @@ tunnel6_a_addr="fd00:2::a"
tunnel6_b_addr="fd00:2::b"
tunnel6_mask="64"
+host4_a_addr="192.168.99.99"
+host4_b_addr="192.168.88.88"
+
dummy6_0_prefix="fc00:1000::"
dummy6_1_prefix="fc00:1001::"
dummy6_mask="64"
@@ -984,6 +994,52 @@ setup_ovs_bridge() {
run_cmd ip route add ${prefix6}:${b_r1}::1 via ${prefix6}:${a_r1}::2
}
+setup_multipath_new() {
+ # Set up host A with multipath routes to host B host4_b_addr
+ run_cmd ${ns_a} ip addr add ${host4_a_addr} dev lo
+ run_cmd ${ns_a} ip nexthop add id 401 via ${prefix4}.${a_r1}.2 dev veth_A-R1
+ run_cmd ${ns_a} ip nexthop add id 402 via ${prefix4}.${a_r2}.2 dev veth_A-R2
+ run_cmd ${ns_a} ip nexthop add id 403 group 401/402
+ run_cmd ${ns_a} ip route add ${host4_b_addr} src ${host4_a_addr} nhid 403
+
+ # Set up host B with multipath routes to host A host4_a_addr
+ run_cmd ${ns_b} ip addr add ${host4_b_addr} dev lo
+ run_cmd ${ns_b} ip nexthop add id 401 via ${prefix4}.${b_r1}.2 dev veth_B-R1
+ run_cmd ${ns_b} ip nexthop add id 402 via ${prefix4}.${b_r2}.2 dev veth_B-R2
+ run_cmd ${ns_b} ip nexthop add id 403 group 401/402
+ run_cmd ${ns_b} ip route add ${host4_a_addr} src ${host4_b_addr} nhid 403
+}
+
+setup_multipath_old() {
+ # Set up host A with multipath routes to host B host4_b_addr
+ run_cmd ${ns_a} ip addr add ${host4_a_addr} dev lo
+ run_cmd ${ns_a} ip route add ${host4_b_addr} \
+ src ${host4_a_addr} \
+ nexthop via ${prefix4}.${a_r1}.2 weight 1 \
+ nexthop via ${prefix4}.${a_r2}.2 weight 1
+
+ # Set up host B with multipath routes to host A host4_a_addr
+ run_cmd ${ns_b} ip addr add ${host4_b_addr} dev lo
+ run_cmd ${ns_b} ip route add ${host4_a_addr} \
+ src ${host4_b_addr} \
+ nexthop via ${prefix4}.${b_r1}.2 weight 1 \
+ nexthop via ${prefix4}.${b_r2}.2 weight 1
+}
+
+setup_multipath() {
+ if [ "$USE_NH" = "yes" ]; then
+ setup_multipath_new
+ else
+ setup_multipath_old
+ fi
+
+ # Set up routers with routes to dummies
+ run_cmd ${ns_r1} ip route add ${host4_a_addr} via ${prefix4}.${a_r1}.1
+ run_cmd ${ns_r2} ip route add ${host4_a_addr} via ${prefix4}.${a_r2}.1
+ run_cmd ${ns_r1} ip route add ${host4_b_addr} via ${prefix4}.${b_r1}.1
+ run_cmd ${ns_r2} ip route add ${host4_b_addr} via ${prefix4}.${b_r2}.1
+}
+
setup() {
[ "$(id -u)" -ne 0 ] && echo " need to run as root" && return $ksft_skip
@@ -1076,23 +1132,15 @@ link_get_mtu() {
}
route_get_dst_exception() {
- ns_cmd="${1}"
- dst="${2}"
- dsfield="${3}"
-
- if [ -z "${dsfield}" ]; then
- dsfield=0
- fi
+ ns_cmd="${1}"; shift
- ${ns_cmd} ip route get "${dst}" dsfield "${dsfield}"
+ ${ns_cmd} ip route get "$@"
}
route_get_dst_pmtu_from_exception() {
- ns_cmd="${1}"
- dst="${2}"
- dsfield="${3}"
+ ns_cmd="${1}"; shift
- mtu_parse "$(route_get_dst_exception "${ns_cmd}" "${dst}" "${dsfield}")"
+ mtu_parse "$(route_get_dst_exception "${ns_cmd}" "$@")"
}
check_pmtu_value() {
@@ -1235,10 +1283,10 @@ test_pmtu_ipv4_dscp_icmp_exception() {
run_cmd "${ns_a}" ping -q -M want -Q "${dsfield}" -c 1 -w 1 -s "${len}" "${dst2}"
# Check that exceptions have been created with the correct PMTU
- pmtu_1="$(route_get_dst_pmtu_from_exception "${ns_a}" "${dst1}" "${policy_mark}")"
+ pmtu_1="$(route_get_dst_pmtu_from_exception "${ns_a}" "${dst1}" dsfield "${policy_mark}")"
check_pmtu_value "1400" "${pmtu_1}" "exceeding MTU" || return 1
- pmtu_2="$(route_get_dst_pmtu_from_exception "${ns_a}" "${dst2}" "${policy_mark}")"
+ pmtu_2="$(route_get_dst_pmtu_from_exception "${ns_a}" "${dst2}" dsfield "${policy_mark}")"
check_pmtu_value "1500" "${pmtu_2}" "exceeding MTU" || return 1
}
@@ -1285,9 +1333,9 @@ test_pmtu_ipv4_dscp_udp_exception() {
UDP:"${dst2}":50000,tos="${dsfield}"
# Check that exceptions have been created with the correct PMTU
- pmtu_1="$(route_get_dst_pmtu_from_exception "${ns_a}" "${dst1}" "${policy_mark}")"
+ pmtu_1="$(route_get_dst_pmtu_from_exception "${ns_a}" "${dst1}" dsfield "${policy_mark}")"
check_pmtu_value "1400" "${pmtu_1}" "exceeding MTU" || return 1
- pmtu_2="$(route_get_dst_pmtu_from_exception "${ns_a}" "${dst2}" "${policy_mark}")"
+ pmtu_2="$(route_get_dst_pmtu_from_exception "${ns_a}" "${dst2}" dsfield "${policy_mark}")"
check_pmtu_value "1500" "${pmtu_2}" "exceeding MTU" || return 1
}
@@ -2329,6 +2377,43 @@ test_pmtu_ipv6_route_change() {
test_pmtu_ipvX_route_change 6
}
+test_pmtu_ipv4_mp_exceptions() {
+ setup namespaces routing multipath || return $ksft_skip
+
+ trace "${ns_a}" veth_A-R1 "${ns_r1}" veth_R1-A \
+ "${ns_r1}" veth_R1-B "${ns_b}" veth_B-R1 \
+ "${ns_a}" veth_A-R2 "${ns_r2}" veth_R2-A \
+ "${ns_r2}" veth_R2-B "${ns_b}" veth_B-R2
+
+ # Set up initial MTU values
+ mtu "${ns_a}" veth_A-R1 2000
+ mtu "${ns_r1}" veth_R1-A 2000
+ mtu "${ns_r1}" veth_R1-B 1500
+ mtu "${ns_b}" veth_B-R1 1500
+
+ mtu "${ns_a}" veth_A-R2 2000
+ mtu "${ns_r2}" veth_R2-A 2000
+ mtu "${ns_r2}" veth_R2-B 1500
+ mtu "${ns_b}" veth_B-R2 1500
+
+ fail=0
+
+ # Ping and expect two nexthop exceptions for two routes in nh group
+ run_cmd ${ns_a} ping -q -M want -i 0.1 -c 1 -s 1800 "${host4_b_addr}"
+
+ # Do route lookup before checking cached exceptions.
+ # The following commands are needed for dst entries to be cached
+ # in both paths exceptions and therefore dumped to user space
+ # Check that exceptions have been created with the correct PMTU
+ pmtu="$(route_get_dst_pmtu_from_exception "${ns_a}" "${host4_b_addr}" oif veth_A-R1)"
+ pmtu="$(route_get_dst_pmtu_from_exception "${ns_a}" "${host4_b_addr}" oif veth_A-R2)"
+
+ check_pmtu_value "1500" "${pmtu}" "exceeding MTU (veth_A-R2)" || return 1
+ check_pmtu_value "1500" "${pmtu}" "exceeding MTU (veth_A-R1)" || return 1
+
+ return ${fail}
+}
+
usage() {
echo
echo "$0 [OPTIONS] [TEST]..."
base-commit: 66600fac7a984dea4ae095411f644770b2561ede
--
2.43.5
$ARCH is not always enough to know whether getrandom vDSO is supported
or not. For instance on x86 we want it for x86_64 but not i386.
On the other hand, we already have detailed architecture selection in
vdso_config.h, the only difference is that it cannot be used for
Makefile. But most selftests are built regardless of whether a
functionality is supported or not. The return value KSFT_SKIP is there
for than: it tells the test is skipped because it is not supported.
Make the implementation more flexible by setting a VDSO_GETRANDOM
macro in vdso_config.h. That macro contains the path to the file that
defines __arch_chacha20_blocks_nostack(). It avoids the symbolic
link to vdso directory and will allow architectures to have several
implementations of __arch_chacha20_blocks_nostack() if needed.
Then restore the original behaviour which was dedicated to
vdso_standalone_test_x86 and build getrandom and chacha tests all
the time just like other vDSO selftests and return SKIP when the
functionality to be tested is not implemented.
This has the advantage of doing architecture specific selection at
only one place.
Also change vdso_test_getrandom to return SKIP instead of FAIL when
vDSO function is not found, just like vdso_test_getcpu or
vdso_test_gettimeofday.
Signed-off-by: Christophe Leroy <christophe.leroy(a)csgroup.eu>
---
Based on latest random tree (0dfed8092247)
tools/arch/x86/vdso | 1 -
tools/testing/selftests/vDSO/Makefile | 10 ++++------
tools/testing/selftests/vDSO/vdso_config.h | 3 +++
tools/testing/selftests/vDSO/vdso_test_chacha-asm.S | 7 +++++++
tools/testing/selftests/vDSO/vdso_test_chacha.c | 11 +++++++++++
tools/testing/selftests/vDSO/vdso_test_getrandom.c | 2 +-
6 files changed, 26 insertions(+), 8 deletions(-)
delete mode 120000 tools/arch/x86/vdso
create mode 100644 tools/testing/selftests/vDSO/vdso_test_chacha-asm.S
diff --git a/tools/arch/x86/vdso b/tools/arch/x86/vdso
deleted file mode 120000
index 7eb962fd3454..000000000000
--- a/tools/arch/x86/vdso
+++ /dev/null
@@ -1 +0,0 @@
-../../../arch/x86/entry/vdso/
\ No newline at end of file
diff --git a/tools/testing/selftests/vDSO/Makefile b/tools/testing/selftests/vDSO/Makefile
index 5ead6b1f0478..cfb7c281b22c 100644
--- a/tools/testing/selftests/vDSO/Makefile
+++ b/tools/testing/selftests/vDSO/Makefile
@@ -1,6 +1,6 @@
# SPDX-License-Identifier: GPL-2.0
-ARCH ?= $(shell uname -m | sed -e s/i.86/x86/)
-SRCARCH := $(subst x86_64,x86,$(ARCH))
+uname_M := $(shell uname -m 2>/dev/null || echo not)
+ARCH ?= $(shell echo $(uname_M) | sed -e s/i.86/x86/ -e s/x86_64/x86/)
TEST_GEN_PROGS := vdso_test_gettimeofday
TEST_GEN_PROGS += vdso_test_getcpu
@@ -10,10 +10,8 @@ ifeq ($(ARCH),$(filter $(ARCH),x86 x86_64))
TEST_GEN_PROGS += vdso_standalone_test_x86
endif
TEST_GEN_PROGS += vdso_test_correctness
-ifeq ($(ARCH),$(filter $(ARCH),x86_64))
TEST_GEN_PROGS += vdso_test_getrandom
TEST_GEN_PROGS += vdso_test_chacha
-endif
CFLAGS := -std=gnu99
@@ -38,8 +36,8 @@ $(OUTPUT)/vdso_test_getrandom: CFLAGS += -isystem $(top_srcdir)/tools/include \
$(KHDR_INCLUDES) \
-isystem $(top_srcdir)/include/uapi
-$(OUTPUT)/vdso_test_chacha: $(top_srcdir)/tools/arch/$(SRCARCH)/vdso/vgetrandom-chacha.S
+$(OUTPUT)/vdso_test_chacha: vdso_test_chacha-asm.S
$(OUTPUT)/vdso_test_chacha: CFLAGS += -idirafter $(top_srcdir)/tools/include \
- -idirafter $(top_srcdir)/arch/$(SRCARCH)/include \
+ -idirafter $(top_srcdir)/arch/$(ARCH)/include \
-idirafter $(top_srcdir)/include \
-D__ASSEMBLY__ -Wa,--noexecstack
diff --git a/tools/testing/selftests/vDSO/vdso_config.h b/tools/testing/selftests/vDSO/vdso_config.h
index 740ce8c98d2e..693920471160 100644
--- a/tools/testing/selftests/vDSO/vdso_config.h
+++ b/tools/testing/selftests/vDSO/vdso_config.h
@@ -47,6 +47,7 @@
#elif defined(__x86_64__)
#define VDSO_VERSION 0
#define VDSO_NAMES 1
+#define VDSO_GETRANDOM "../../../../arch/x86/entry/vdso/vgetrandom-chacha.S"
#elif defined(__riscv__) || defined(__riscv)
#define VDSO_VERSION 5
#define VDSO_NAMES 1
@@ -58,6 +59,7 @@
#define VDSO_NAMES 1
#endif
+#ifndef __ASSEMBLY__
static const char *versions[7] = {
"LINUX_2.6",
"LINUX_2.6.15",
@@ -88,5 +90,6 @@ static const char *names[2][7] = {
"__vdso_getrandom",
},
};
+#endif
#endif /* __VDSO_CONFIG_H__ */
diff --git a/tools/testing/selftests/vDSO/vdso_test_chacha-asm.S b/tools/testing/selftests/vDSO/vdso_test_chacha-asm.S
new file mode 100644
index 000000000000..8e704165f6f2
--- /dev/null
+++ b/tools/testing/selftests/vDSO/vdso_test_chacha-asm.S
@@ -0,0 +1,7 @@
+#include "vdso_config.h"
+
+#ifdef VDSO_GETRANDOM
+
+#include VDSO_GETRANDOM
+
+#endif
diff --git a/tools/testing/selftests/vDSO/vdso_test_chacha.c b/tools/testing/selftests/vDSO/vdso_test_chacha.c
index 3a5a08d857cf..9d18d49a82f8 100644
--- a/tools/testing/selftests/vDSO/vdso_test_chacha.c
+++ b/tools/testing/selftests/vDSO/vdso_test_chacha.c
@@ -8,6 +8,8 @@
#include <string.h>
#include <stdint.h>
#include <stdbool.h>
+#include <linux/kconfig.h>
+#include "vdso_config.h"
#include "../kselftest.h"
static uint32_t rol32(uint32_t word, unsigned int shift)
@@ -57,6 +59,10 @@ typedef uint32_t u32;
typedef uint64_t u64;
#include <vdso/getrandom.h>
+#ifdef VDSO_GETRANDOM
+#define HAVE_VDSO_GETRANDOM 1
+#endif
+
int main(int argc, char *argv[])
{
enum { TRIALS = 1000, BLOCKS = 128, BLOCK_SIZE = 64 };
@@ -68,6 +74,11 @@ int main(int argc, char *argv[])
ksft_print_header();
ksft_set_plan(1);
+ if (!__is_defined(HAVE_VDSO_GETRANDOM)) {
+ printf("__arch_chacha20_blocks_nostack() not implemented\n");
+ return KSFT_SKIP;
+ }
+
for (unsigned int trial = 0; trial < TRIALS; ++trial) {
if (getrandom(key, sizeof(key), 0) != sizeof(key)) {
printf("getrandom() failed!\n");
diff --git a/tools/testing/selftests/vDSO/vdso_test_getrandom.c b/tools/testing/selftests/vDSO/vdso_test_getrandom.c
index 8866b65a4605..47ee94b32617 100644
--- a/tools/testing/selftests/vDSO/vdso_test_getrandom.c
+++ b/tools/testing/selftests/vDSO/vdso_test_getrandom.c
@@ -115,7 +115,7 @@ static void vgetrandom_init(void)
vgrnd.fn = (__typeof__(vgrnd.fn))vdso_sym(version, name);
if (!vgrnd.fn) {
printf("%s is missing!\n", name);
- exit(KSFT_FAIL);
+ exit(KSFT_SKIP);
}
ret = VDSO_CALL(vgrnd.fn, 5, NULL, 0, 0, &vgrnd.params, ~0UL);
if (ret == -ENOSYS) {
--
2.44.0
This addresses the infamous unregister_netdevice splat in net selftests;
the actual fix is carried by the first patch, while the 2nd one
addresses a related problem in the relevant test that was patially
hiding the problem.
Targeting net-next as the issue is quite old and I feel a little lost
in the fib info/nh jungle.
---
v1 -> v2:
- drop unintended whitespace change in patch 1/2
Paolo Abeni (2):
ipv6: release nexthop on device removal
selftests: net: really check for bg process completion
net/ipv6/route.c | 6 +++---
tools/testing/selftests/net/pmtu.sh | 2 +-
2 files changed, 4 insertions(+), 4 deletions(-)
--
2.45.2
This version 4 patch series replace direct error handling methods with ksft
macros, which provide better reporting.Currently, when the tmpfs test runs,
it does not display any output if it passes,and if it fails
(particularly when not run as root),it simply exits without any warning or message.
This series of patch adds:
1. Add 'ksft_print_header()' and 'ksft_set_plan()'
to structure test outputs more effectively.
2. Skip if not run as root.
3. Replace direct error handling with 'ksft_test_result_*',
'ksft_print_msg' and KSFT_SKIP macros for better reporting.
v3->v4:
- Start a patchset
- Split patch into smaller pathes to make it easy to review.
Patch1 Replace 'ksft_test_result_skip' with 'KSFT_SKIP' during root run check.
Patch2 Replace 'ksft_test_result_fail' with 'KSFT_SKIP' where fail does not make sense,
or failure could be due to not unsupported APIs with appropriate warnings.
v3: https://lore.kernel.org/all/20241028185756.111832-1-cvam0000@gmail.com/
v2->v3:
- Remove extra ksft_set_plan()
- Remove function for unshare()
- Fix the comment style
v2: https://lore.kernel.org/all/20241026191621.2860376-1-cvam0000@gmail.com/
v1->v2:
- Make the commit message more clear.
v1: https://lore.kernel.org/all/20241024200228.1075840-1-cvam0000@gmail.com/T/#u
thanks
Shivam
Shivam Chaudhary (2):
selftests:tmpfs: Add Skip test if not run as root
selftests:tmpfs: Add kselftest support to tmpfs
.../selftests/tmpfs/bug-link-o-tmpfile.c | 79 +++++++++++++++----
1 file changed, 62 insertions(+), 17 deletions(-)
--
2.45.2
Check number of paths by fib_info_num_path(),
and update_or_create_fnhe() for every path.
Problem is that pmtu is cached only for the oif
that has received icmp message "need to frag",
other oifs will still try to use "default" iface mtu.
An example topology showing the problem:
| host1
+---------+
| dummy0 | 10.179.20.18/32 mtu9000
+---------+
+-----------+----------------+
+---------+ +---------+
| ens17f0 | 10.179.2.141/31 | ens17f1 | 10.179.2.13/31
+---------+ +---------+
| (all here have mtu 9000) |
+------+ +------+
| ro1 | 10.179.2.140/31 | ro2 | 10.179.2.12/31
+------+ +------+
| |
---------+------------+-------------------+------
|
+-----+
| ro3 | 10.10.10.10 mtu1500
+-----+
|
========================================
some networks
========================================
|
+-----+
| eth0| 10.10.30.30 mtu9000
+-----+
| host2
host1 have enabled multipath and
sysctl net.ipv4.fib_multipath_hash_policy = 1:
default proto static src 10.179.20.18
nexthop via 10.179.2.12 dev ens17f1 weight 1
nexthop via 10.179.2.140 dev ens17f0 weight 1
When host1 tries to do pmtud from 10.179.20.18/32 to host2,
host1 receives at ens17f1 iface an icmp packet from ro3 that ro3 mtu=1500.
And host1 caches it in nexthop exceptions cache.
Problem is that it is cached only for the iface that has received icmp,
and there is no way that ro3 will send icmp msg to host1 via another path.
Host1 now have this routes to host2:
ip r g 10.10.30.30 sport 30000 dport 443
10.10.30.30 via 10.179.2.12 dev ens17f1 src 10.179.20.18 uid 0
cache expires 521sec mtu 1500
ip r g 10.10.30.30 sport 30033 dport 443
10.10.30.30 via 10.179.2.140 dev ens17f0 src 10.179.20.18 uid 0
cache
So when host1 tries again to reach host2 with mtu>1500,
if packet flow is lucky enough to be hashed with oif=ens17f1 its ok,
if oif=ens17f0 it blackholes and still gets icmp msgs from ro3 to ens17f1,
until lucky day when ro3 will send it through another flow to ens17f0.
Signed-off-by: Vladimir Vdovin <deliran(a)verdict.gg>
---
V7:
selftest in pmtu.sh:
- add setup_multipath() with old and new nh tests
- add global "dummy_v4" addr variables
- add documentation
- remove dummy netdev usage in mp nh test
- remove useless sysctl opts in mp nh test
V6:
- make commit message cleaner
V5:
- make self test cleaner
V4:
- fix selftest, do route lookup before checking cached exceptions
V3:
- add selftest
- fix compile error
V2:
- fix fib_info_num_path parameter pass
---
net/ipv4/route.c | 13 ++++
tools/testing/selftests/net/pmtu.sh | 95 ++++++++++++++++++++++++++++-
2 files changed, 107 insertions(+), 1 deletion(-)
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 723ac9181558..652f603d29fe 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1027,6 +1027,19 @@ static void __ip_rt_update_pmtu(struct rtable *rt, struct flowi4 *fl4, u32 mtu)
struct fib_nh_common *nhc;
fib_select_path(net, &res, fl4, NULL);
+#ifdef CONFIG_IP_ROUTE_MULTIPATH
+ if (fib_info_num_path(res.fi) > 1) {
+ int nhsel;
+
+ for (nhsel = 0; nhsel < fib_info_num_path(res.fi); nhsel++) {
+ nhc = fib_info_nhc(res.fi, nhsel);
+ update_or_create_fnhe(nhc, fl4->daddr, 0, mtu, lock,
+ jiffies + net->ipv4.ip_rt_mtu_expires);
+ }
+ rcu_read_unlock();
+ return;
+ }
+#endif /* CONFIG_IP_ROUTE_MULTIPATH */
nhc = FIB_RES_NHC(res);
update_or_create_fnhe(nhc, fl4->daddr, 0, mtu, lock,
jiffies + net->ipv4.ip_rt_mtu_expires);
diff --git a/tools/testing/selftests/net/pmtu.sh b/tools/testing/selftests/net/pmtu.sh
index 569bce8b6383..f24c84184c61 100755
--- a/tools/testing/selftests/net/pmtu.sh
+++ b/tools/testing/selftests/net/pmtu.sh
@@ -197,6 +197,12 @@
#
# - pmtu_ipv6_route_change
# Same as above but with IPv6
+#
+# - pmtu_ipv4_mp_exceptions
+# Use the same topology as in pmtu_ipv4, but add routeable "dummy"
+# addresses on host A and B on lo0 reachable via both routers.
+# Host A and B "dummy" addresses have multipath routes to each other.
+# Check that PMTU exceptions are created for both paths.
source lib.sh
source net_helper.sh
@@ -266,7 +272,8 @@ tests="
list_flush_ipv4_exception ipv4: list and flush cached exceptions 1
list_flush_ipv6_exception ipv6: list and flush cached exceptions 1
pmtu_ipv4_route_change ipv4: PMTU exception w/route replace 1
- pmtu_ipv6_route_change ipv6: PMTU exception w/route replace 1"
+ pmtu_ipv6_route_change ipv6: PMTU exception w/route replace 1
+ pmtu_ipv4_mp_exceptions ipv4: PMTU multipath nh exceptions 1"
# Addressing and routing for tests with routers: four network segments, with
# index SEGMENT between 1 and 4, a common prefix (PREFIX4 or PREFIX6) and an
@@ -343,6 +350,9 @@ tunnel6_a_addr="fd00:2::a"
tunnel6_b_addr="fd00:2::b"
tunnel6_mask="64"
+dummy4_a_addr="192.168.99.99"
+dummy4_b_addr="192.168.88.88"
+
dummy6_0_prefix="fc00:1000::"
dummy6_1_prefix="fc00:1001::"
dummy6_mask="64"
@@ -984,6 +994,50 @@ setup_ovs_bridge() {
run_cmd ip route add ${prefix6}:${b_r1}::1 via ${prefix6}:${a_r1}::2
}
+setup_multipath() {
+ if [ "$USE_NH" = "yes" ]; then
+ setup_multipath_new
+ else
+ setup_multipath_old
+ fi
+
+ # Set up routers with routes to dummies
+ run_cmd ${ns_r1} ip route add ${dummy4_a_addr} via ${prefix4}.${a_r1}.1
+ run_cmd ${ns_r2} ip route add ${dummy4_a_addr} via ${prefix4}.${a_r2}.1
+ run_cmd ${ns_r1} ip route add ${dummy4_b_addr} via ${prefix4}.${b_r1}.1
+ run_cmd ${ns_r2} ip route add ${dummy4_b_addr} via ${prefix4}.${b_r2}.1
+}
+
+setup_multipath_new() {
+ # Set up host A with multipath routes to host B dummy4_b_addr
+ run_cmd ${ns_a} ip addr add ${dummy4_a_addr} dev lo0
+ run_cmd ${ns_a} ip nexthop add id 201 via ${prefix4}.${a_r1}.2 dev veth_A-R1
+ run_cmd ${ns_a} ip nexthop add id 202 via ${prefix4}.${a_r2}.2 dev veth_A-R2
+ run_cmd ${ns_a} ip nexthop add id 203 group 201/202
+ run_cmd ${ns_a} ip route add ${dummy4_b_addr} nhid 203
+
+ # Set up host B with multipath routes to host A dummy4_a_addr
+ run_cmd ${ns_b} ip addr add ${dummy4_b_addr} dev lo0
+ run_cmd ${ns_b} ip nexthop add id 201 via ${prefix4}.${b_r1}.2 dev veth_A-R1
+ run_cmd ${ns_b} ip nexthop add id 202 via ${prefix4}.${b_r2}.2 dev veth_A-R2
+ run_cmd ${ns_b} ip nexthop add id 203 group 201/202
+ run_cmd ${ns_b} ip route add ${dummy4_a_addr} nhid 203
+}
+
+setup_multipath_old() {
+ # Set up host A with multipath routes to host B dummy4_b_addr
+ run_cmd ${ns_a} ip addr add ${dummy4_a_addr} dev lo0
+ run_cmd ${ns_a} ip route add ${dummy4_b_addr} \
+ nexthop via ${prefix4}.${a_r1}.2 weight 1 \
+ nexthop via ${prefix4}.${a_r2}.2 weight 1
+
+ # Set up host B with multipath routes to host A dummy4_a_addr
+ run_cmd ${ns_b} ip addr add ${dummy4_b_addr} dev lo0
+ run_cmd ${ns_a} ip route add ${dummy4_a_addr} \
+ nexthop via ${prefix4}.${a_b1}.2 weight 1 \
+ nexthop via ${prefix4}.${a_b2}.2 weight 1
+}
+
setup() {
[ "$(id -u)" -ne 0 ] && echo " need to run as root" && return $ksft_skip
@@ -2329,6 +2383,45 @@ test_pmtu_ipv6_route_change() {
test_pmtu_ipvX_route_change 6
}
+test_pmtu_ipv4_mp_exceptions() {
+ setup namespaces routing multipath || return $ksft_skip
+
+ trace "${ns_a}" veth_A-R1 "${ns_r1}" veth_R1-A \
+ "${ns_r1}" veth_R1-B "${ns_b}" veth_B-R1 \
+ "${ns_a}" veth_A-R2 "${ns_r2}" veth_R2-A \
+ "${ns_r2}" veth_R2-B "${ns_b}" veth_B-R2
+
+ # Set up initial MTU values
+ mtu "${ns_a}" veth_A-R1 2000
+ mtu "${ns_r1}" veth_R1-A 2000
+ mtu "${ns_r1}" veth_R1-B 1400
+ mtu "${ns_b}" veth_B-R1 1400
+
+ mtu "${ns_a}" veth_A-R2 2000
+ mtu "${ns_r2}" veth_R2-A 2000
+ mtu "${ns_r2}" veth_R2-B 1500
+ mtu "${ns_b}" veth_B-R2 1500
+
+ fail=0
+
+ # Ping and expect two nexthop exceptions for two routes in nh group
+ run_cmd ${ns_a} ping -q -M want -i 0.1 -c 1 -s 1800 "${dummy4_b_addr}"
+
+ # Do route lookup before checking cached exceptions.
+ # The following commands are needed for dst entries to be cached
+ # in both paths exceptions and therefore dumped to user space
+ run_cmd ${ns_a} ip route get ${dummy4_b_addr} oif veth_A-R1
+ run_cmd ${ns_a} ip route get ${dummy4_b_addr} oif veth_A-R2
+
+ # Check cached exceptions
+ if [ "$(${ns_a} ip -oneline route list cache | grep mtu | wc -l)" -ne 2 ]; then
+ err " there are not enough cached exceptions"
+ fail=1
+ fi
+
+ return ${fail}
+}
+
usage() {
echo
echo "$0 [OPTIONS] [TEST]..."
base-commit: 66600fac7a984dea4ae095411f644770b2561ede
--
2.43.5
Currently we test signal delivery to the programs run by fp-stress but
our signal handlers simply count the number of signals seen and don't do
anything with the floating point state. The original fpsimd-test and
sve-test programs had signal handlers called irritators which modify the
live register state, verifying that we restore the signal context on
return, but a combination of misleading comments and code resulted in
them never being used and the equivalent handlers in the other tests
being stubbed or omitted.
Clarify the code, implement effective irritator handlers for the test
programs that can have them and then switch the signals generated by the
fp-stress program over to use the irritators, ensuring that we validate
that we restore the saved signal context properly.
Signed-off-by: Mark Brown <broonie(a)kernel.org>
---
Mark Brown (6):
kselftest/arm64: Correct misleading comments on fp-stress irritators
kselftest/arm64: Remove unused ADRs from irritator handlers
kselftest/arm64: Corrupt P15 in the irritator when testing SSVE
kselftest/arm64: Implement irritators for ZA and ZT
kselftest/arm64: Provide a SIGUSR1 handler in the kernel mode FP stress test
kselftest/arm64: Test signal handler state modification in fp-stress
tools/testing/selftests/arm64/fp/fp-stress.c | 2 +-
tools/testing/selftests/arm64/fp/fpsimd-test.S | 4 +---
tools/testing/selftests/arm64/fp/kernel-test.c | 4 ++++
tools/testing/selftests/arm64/fp/sve-test.S | 6 ++----
tools/testing/selftests/arm64/fp/za-test.S | 13 ++++---------
tools/testing/selftests/arm64/fp/zt-test.S | 13 ++++---------
6 files changed, 16 insertions(+), 26 deletions(-)
---
base-commit: 8e929cb546ee42c9a61d24fae60605e9e3192354
change-id: 20241023-arm64-fp-stress-irritator-5415fe92adef
Best regards,
--
Mark Brown <broonie(a)kernel.org>
[adding kunit lists to cc, see lore link below for context]
https://lore.kernel.org/linux-fsdevel/20241104141750.16119-3-ddiss@suse.de/
On Wed, 6 Nov 2024 07:04:21 +0800, kernel test robot wrote:
...
> All warnings (new ones prefixed by >>, old ones prefixed by <<):
>
> >> WARNING: modpost: vmlinux: section mismatch in reference: initramfs_test_cases+0x0 (section: .data) -> initramfs_test_extract (section: .init.text)
> >> WARNING: modpost: vmlinux: section mismatch in reference: initramfs_test_cases+0x30 (section: .data) -> initramfs_test_fname_overrun (section: .init.text)
> >> WARNING: modpost: vmlinux: section mismatch in reference: initramfs_test_cases+0x60 (section: .data) -> initramfs_test_data (section: .init.text)
> >> WARNING: modpost: vmlinux: section mismatch in reference: initramfs_test_cases+0x90 (section: .data) -> initramfs_test_csum (section: .init.text)
> >> WARNING: modpost: vmlinux: section mismatch in reference: initramfs_test_cases+0xc0 (section: .data) -> initramfs_test_hardlink (section: .init.text)
> >> WARNING: modpost: vmlinux: section mismatch in reference: initramfs_test_cases+0xf0 (section: .data) -> initramfs_test_many (section: .init.text)
If I understand correctly, the kunit_case initramfs_test_cases[]
members can't be flagged __initdata, as they need to be present
post-init for results queries via debugfs.
The remaining -Wimplicit-function-declaration reports are all valid.
I'll fix them in v3.
Thanks, David
Recently, the net/lib.sh file has been modified to include defer.sh from
net/lib/sh/ directory. The Makefile from net/lib has been modified
accordingly, but not the ones from the sub-targets using net/lib.sh.
Because of that, the new file is not installed as expected when
installing the Forwarding, MPTCP, and Netfilter targets, e.g.
# make -C tools/testing/selftests TARGETS=net/mptcp install \
INSTALL_PATH=/tmp/kself
# cd /tmp/kself/
# ./run_kselftest.sh -c net/mptcp
TAP version 13
1..7
# timeout set to 1800
# selftests: net/mptcp: mptcp_connect.sh
# ./../lib.sh: line 5: /tmp/kself/net/lib/sh/defer.sh: No such file
or directory
# (...)
This can be fixed simply by adding all the .sh files from net/lib/sh
directory to the TEST_INCLUDES variable in the different Makefile's.
Fixes: a6e263f125cd ("selftests: net: lib: Introduce deferred commands")
Signed-off-by: Matthieu Baerts (NGI0) <matttbe(a)kernel.org>
---
tools/testing/selftests/net/forwarding/Makefile | 3 ++-
tools/testing/selftests/net/mptcp/Makefile | 2 +-
tools/testing/selftests/net/netfilter/Makefile | 3 ++-
3 files changed, 5 insertions(+), 3 deletions(-)
diff --git a/tools/testing/selftests/net/forwarding/Makefile b/tools/testing/selftests/net/forwarding/Makefile
index 224346426ef2220470bf2dbef66198eae9331f56..7d885cff8d79bc6882d2341d0a1a59891fc1cb2e 100644
--- a/tools/testing/selftests/net/forwarding/Makefile
+++ b/tools/testing/selftests/net/forwarding/Makefile
@@ -126,6 +126,7 @@ TEST_FILES := devlink_lib.sh \
tc_common.sh
TEST_INCLUDES := \
- ../lib.sh
+ ../lib.sh \
+ $(wildcard ../lib/sh/*.sh)
include ../../lib.mk
diff --git a/tools/testing/selftests/net/mptcp/Makefile b/tools/testing/selftests/net/mptcp/Makefile
index 5d796622e73099d0a0331c1bc8a41f09e1d36b4d..8e3fc05a539797c5e0feab72be69db623ef3fa96 100644
--- a/tools/testing/selftests/net/mptcp/Makefile
+++ b/tools/testing/selftests/net/mptcp/Makefile
@@ -11,7 +11,7 @@ TEST_GEN_FILES = mptcp_connect pm_nl_ctl mptcp_sockopt mptcp_inq
TEST_FILES := mptcp_lib.sh settings
-TEST_INCLUDES := ../lib.sh ../net_helper.sh
+TEST_INCLUDES := ../lib.sh $(wildcard ../lib/sh/*.sh) ../net_helper.sh
EXTRA_CLEAN := *.pcap
diff --git a/tools/testing/selftests/net/netfilter/Makefile b/tools/testing/selftests/net/netfilter/Makefile
index 542f7886a0bc2ac016f41d8b70357a8e0c1d271b..9d009f74cfc2da66428b1e23d5884e2c3bc4a85d 100644
--- a/tools/testing/selftests/net/netfilter/Makefile
+++ b/tools/testing/selftests/net/netfilter/Makefile
@@ -55,4 +55,5 @@ TEST_FILES := lib.sh
TEST_FILES += packetdrill
TEST_INCLUDES := \
- ../lib.sh
+ ../lib.sh \
+ $(wildcard ../lib/sh/*.sh)
---
base-commit: ecf99864ea6b1843773589a935bb026951bf12dd
change-id: 20241104-net-next-selftests-lib-sh-deps-cc359ca5602f
Best regards,
--
Matthieu Baerts (NGI0) <matttbe(a)kernel.org>
A few unrelated devmem TCP fixes bundled in a series for some
convenience (if that's ok).
Patch 1-2: fix naming and provide page_pool_alloc_netmem for fragged
netmem.
Patch 3-4: fix issues with dma-buf dma addresses being potentially
passed to dma_sync_for_* helpers.
Patch 5-6: fix syzbot SO_DEVMEM_DONTNEED issue and add test for this
case.
Mina Almasry (6):
net: page_pool: rename page_pool_alloc_netmem to *_netmems
net: page_pool: create page_pool_alloc_netmem
page_pool: disable sync for cpu for dmabuf memory provider
netmem: add netmem_prefetch
net: fix SO_DEVMEM_DONTNEED looping too long
ncdevmem: add test for too many token_count
Samiullah Khawaja (1):
page_pool: Set `dma_sync` to false for devmem memory provider
include/net/netmem.h | 7 ++++
include/net/page_pool/helpers.h | 50 ++++++++++++++++++--------
include/net/page_pool/types.h | 2 +-
net/core/devmem.c | 9 +++--
net/core/page_pool.c | 11 +++---
net/core/sock.c | 46 ++++++++++++++----------
tools/testing/selftests/net/ncdevmem.c | 11 ++++++
7 files changed, 93 insertions(+), 43 deletions(-)
--
2.47.0.163.g1226f6d8fa-goog
Changes v5:
- Tests are skipped if snc_unreliable was set.
- Moved resctrlfs.c changes from patch 2/2 to 1/2.
- Removed CAT changes since it's not impacted by SNC in the selftest.
- Updated various comments.
- Fixed a bunch of minor issues pointed out in the review.
Changes v4:
- Printing SNC warnings at the start of every test.
- Printing SNC warnings at the end of every relevant test.
- Remove global snc_mode variable, consolidate snc detection functions
into one.
- Correct minor mistakes.
Changes v3:
- Reworked patch 2.
- Changed minor things in patch 1 like function name and made
corrections to the patch message.
Changes v2:
- Removed patches 2 and 3 since now this part will be supported by the
kernel.
Sub-Numa Clustering (SNC) allows splitting CPU cores, caches and memory
into multiple NUMA nodes. When enabled, NUMA-aware applications can
achieve better performance on bigger server platforms.
SNC support in the kernel was merged into x86/cache [1]. With SNC enabled
and kernel support in place all the tests will function normally (aside
from effective cache size). There might be a problem when SNC is enabled
but the system is still using an older kernel version without SNC
support. Currently the only message displayed in that situation is a
guess that SNC might be enabled and is causing issues. That message also
is displayed whenever the test fails on an Intel platform.
Add a mechanism to discover kernel support for SNC which will add more
meaning and certainty to the error message.
Add runtime SNC mode detection and verify how reliable that information
is.
Series was tested on Ice Lake server platforms with SNC disabled, SNC-2
and SNC-4. The tests were also ran with and without kernel support for
SNC.
Series applies cleanly on kselftest/next.
[1] https://lore.kernel.org/all/20240628215619.76401-1-tony.luck@intel.com/
Previous versions of this series:
[v1] https://lore.kernel.org/all/cover.1709721159.git.maciej.wieczor-retman@inte…
[v2] https://lore.kernel.org/all/cover.1715769576.git.maciej.wieczor-retman@inte…
[v3] https://lore.kernel.org/all/cover.1719842207.git.maciej.wieczor-retman@inte…
[v4] https://lore.kernel.org/all/cover.1720774981.git.maciej.wieczor-retman@inte…
Maciej Wieczor-Retman (2):
selftests/resctrl: Adjust effective L3 cache size with SNC enabled
selftests/resctrl: Adjust SNC support messages
tools/testing/selftests/resctrl/cmt_test.c | 8 +-
tools/testing/selftests/resctrl/mba_test.c | 8 +-
tools/testing/selftests/resctrl/mbm_test.c | 10 +-
tools/testing/selftests/resctrl/resctrl.h | 7 +
.../testing/selftests/resctrl/resctrl_tests.c | 8 +-
tools/testing/selftests/resctrl/resctrlfs.c | 132 ++++++++++++++++++
6 files changed, 166 insertions(+), 7 deletions(-)
--
2.46.2
Greetings:
Welcome to v6, see changelog below. This revision includes only
documentation changes: patch 7 has been updated with Bagas' suggestions
and the htmldoc looks better as a result. In addition, this cover letter
has been updated with a full re-run of the test data. We've included a
new test case highlighting a case Sridhar asked about in our v3. See
below.
This series introduces a new mechanism, IRQ suspension, which allows
network applications using epoll to mask IRQs during periods of high
traffic while also reducing tail latency (compared to existing
mechanisms, see below) during periods of low traffic. In doing so, this
balances CPU consumption with network processing efficiency.
Martin Karsten (CC'd) and I have been collaborating on this series for
several months and have appreciated the feedback from the community on
our RFC [1]. We've updated the cover letter and kernel documentation in
an attempt to more clearly explain how this mechanism works, how
applications can use it, and how it compares to existing mechanisms in
the kernel. We've added an additional test case, 'fullbusy', achieved by
modifying libevent for comparison. See below for a detailed description,
link to the patch, and test results.
I briefly mentioned this idea at netdev conf 2024 (for those who were
there) and Martin described this idea in an earlier paper presented at
Sigmetrics 2024 [2].
~ The short explanation (TL;DR)
We propose adding a new napi config parameter: irq_suspend_timeout to
help balance CPU usage and network processing efficiency when using IRQ
deferral and napi busy poll.
If this parameter is set to a non-zero value *and* a user application
has enabled preferred busy poll on a busy poll context (via the
EPIOCSPARAMS ioctl introduced in commit 18e2bf0edf4d ("eventpoll: Add
epoll ioctl for epoll_params")), then application calls to epoll_wait
for that context will cause device IRQs and softirq processing to be
suspended as long as epoll_wait successfully retrieves data from the
NAPI. Each time data is retrieved, the irq_suspend_timeout is deferred.
If/when network traffic subsides and epoll_wait returns no data, IRQ
suspension is immediately reverted back to the existing
napi_defer_hard_irqs and gro_flush_timeout mechanism which was
introduced in commit 6f8b12d661d0 ("net: napi: add hard irqs deferral
feature")).
The irq_suspend_timeout serves as a safety mechanism. If userland takes
a long time processing data, irq_suspend_timeout will fire and restart
normal NAPI processing.
For a more in depth explanation, please continue reading.
~ Comparison with existing mechanisms
Interrupt mitigation can be accomplished in napi software, by setting
napi_defer_hard_irqs and gro_flush_timeout, or via interrupt coalescing
in the NIC. This can be quite efficient, but in both cases, a fixed
timeout (or packet count) needs to be configured. However, a fixed
timeout cannot effectively support both low- and high-load situations:
At low load, an application typically processes a few requests and then
waits to receive more input data. In this scenario, a large timeout will
cause unnecessary latency.
At high load, an application typically processes many requests before
being ready to receive more input data. In this case, a small timeout
will likely fire prematurely and trigger irq/softirq processing, which
interferes with the application's execution. This causes overhead, most
likely due to cache contention.
While NICs attempt to provide adaptive interrupt coalescing schemes,
these cannot properly take into account application-level processing.
An alternative packet delivery mechanism is busy-polling, which results
in perfect alignment of application processing and network polling. It
delivers optimal performance (throughput and latency), but results in
100% cpu utilization and is thus inefficient for below-capacity
workloads.
We propose to add a new packet delivery mode that properly alternates
between busy polling and interrupt-based delivery depending on busy and
idle periods of the application. During a busy period, the system
operates in busy-polling mode, which avoids interference. During an idle
period, the system falls back to interrupt deferral, but with a small
timeout to avoid excessive latencies. This delivery mode can also be
viewed as an extension of basic interrupt deferral, but alternating
between a small and a very large timeout.
This delivery mode is efficient, because it avoids softirq execution
interfering with application processing during busy periods. It can be
used with blocking epoll_wait to conserve cpu cycles during idle
periods. The effect of alternating between busy and idle periods is that
performance (throughput and latency) is very close to full busy polling,
while cpu utilization is lower and very close to interrupt mitigation.
~ Usage details
IRQ suspension is introduced via a per-NAPI configuration parameter that
controls the maximum time that IRQs can be suspended.
Here's how it is intended to work:
- The user application (or system administrator) uses the netdev-genl
netlink interface to set the pre-existing napi_defer_hard_irqs and
gro_flush_timeout NAPI config parameters to enable IRQ deferral.
- The user application (or system administrator) sets the proposed
irq_suspend_timeout parameter via the netdev-genl netlink interface
to a larger value than gro_flush_timeout to enable IRQ suspension.
- The user application issues the existing epoll ioctl to set the
prefer_busy_poll flag on the epoll context.
- The user application then calls epoll_wait to busy poll for network
events, as it normally would.
- If epoll_wait returns events to userland, IRQs are suspended for the
duration of irq_suspend_timeout.
- If epoll_wait finds no events and the thread is about to go to
sleep, IRQ handling using napi_defer_hard_irqs and gro_flush_timeout
is resumed.
As long as epoll_wait is retrieving events, IRQs (and softirq
processing) for the NAPI being polled remain disabled. When network
traffic reduces, eventually a busy poll loop in the kernel will retrieve
no data. When this occurs, regular IRQ deferral using gro_flush_timeout
for the polled NAPI is re-enabled.
Unless IRQ suspension is continued by subsequent calls to epoll_wait, it
automatically times out after the irq_suspend_timeout timer expires.
Regular deferral is also immediately re-enabled when the epoll context
is destroyed.
~ Usage scenario
The target scenario for IRQ suspension as packet delivery mode is a
system that runs a dominant application with substantial network I/O.
The target application can be configured to receive input data up to a
certain batch size (via epoll_wait maxevents parameter) and this batch
size determines the worst-case latency that application requests might
experience. Because packet delivery is suspended during the target
application's processing, the batch size also determines the worst-case
latency of concurrent applications using the same RX queue(s).
gro_flush_timeout should be set as small as possible, but large enough to
make sure that a single request is likely not being interfered with.
irq_suspend_timeout is largely a safety mechanism against misbehaving
applications. It should be set large enough to cover the processing of an
entire application batch, i.e., the factor between gro_flush_timeout and
irq_suspend_timeout should roughly correspond to the maximum batch size
that the target application would process in one go.
~ Design rationale
The implementation of the IRQ suspension mechanism very nicely dovetails
with the existing mechanism for IRQ deferral when preferred busy poll is
enabled (introduced in commit 7fd3253a7de6 ("net: Introduce preferred
busy-polling"), see that commit message for more details).
While it would be possible to inject the suspend timeout via
the existing epoll ioctl, it is more natural to avoid this path for one
main reason:
An epoll context is linked to NAPI IDs as file descriptors are added;
this means any epoll context might suddenly be associated with a
different net_device if the application were to replace all existing
fds with fds from a different device. In this case, the scope of the
suspend timeout becomes unclear and many edge cases for both the user
application and the kernel are introduced
Only a single iteration through napi busy polling is needed for this
mechanism to work effectively. Since an important objective for this
mechanism is preserving cpu cycles, exactly one iteration of the napi
busy loop is invoked when busy_poll_usecs is set to 0.
~ Important call out in the implementation
- Enabling per epoll-context preferred busy poll will now effectively
lead to a nonblocking iteration through napi_busy_loop, even when
busy_poll_usecs is 0. See patch 4.
~ Benchmark configs & descriptions
The changes were benchmarked with memcached [3] using the benchmarking
tool mutilate [4].
To facilitate benchmarking, a small patch [5] was applied to memcached
1.6.29 to allow setting per-epoll context preferred busy poll and other
settings via environment variables. Another small patch [6] was applied
to libevent to enable full busy-polling.
Multiple scenarios were benchmarked as described below and the scripts
used for producing these results can be found on github [7] (note: all
scenarios use NAPI-based traffic splitting via SO_INCOMING_ID by passing
-N to memcached):
- base:
- no other options enabled
- deferX:
- set defer_hard_irqs to 100
- set gro_flush_timeout to X,000
- napibusy:
- set defer_hard_irqs to 100
- set gro_flush_timeout to 200,000
- enable busy poll via the existing ioctl (busy_poll_usecs = 64,
busy_poll_budget = 64, prefer_busy_poll = true)
- fullbusy:
- set defer_hard_irqs to 100
- set gro_flush_timeout to 5,000,000
- enable busy poll via the existing ioctl (busy_poll_usecs = 1000,
busy_poll_budget = 64, prefer_busy_poll = true)
- change memcached's nonblocking epoll_wait invocation (via
libevent) to using a 1 ms timeout
- suspend0:
- set defer_hard_irqs to 0
- set gro_flush_timeout to 0
- set irq_suspend_timeout to 20,000,000
- enable busy poll via the existing ioctl (busy_poll_usecs = 0,
busy_poll_budget = 64, prefer_busy_poll = true)
- suspendX:
- set defer_hard_irqs to 100
- set gro_flush_timeout to X,000
- set irq_suspend_timeout to 20,000,000
- enable busy poll via the existing ioctl (busy_poll_usecs = 0,
busy_poll_budget = 64, prefer_busy_poll = true)
~ Benchmark results
Tested on:
Single socket AMD EPYC 7662 64-Core Processor
Hyperthreading disabled
4 NUMA Zones (NPS=4)
16 CPUs per NUMA zone (64 cores total)
2 x Dual port 100gbps Mellanox Technologies ConnectX-5 Ex EN NIC
The test machine is configured such that a single interface has 8 RX
queues. The queues' IRQs and memcached are pinned to CPUs that are
NUMA-local to the interface which is under test. The NIC's interrupt
coalescing configuration is left at boot-time defaults.
Results:
Results are shown below. The mechanism added by this series is
represented by the 'suspend' cases. Data presented shows a summary over
at least 15 runs of each test case [8] using the scripts on github [7].
For latency, the median is shown. For throughput and CPU utilization,
the average is shown.
The results also include cycles-per-query (cpq) and
instruction-per-query (ipq) metrics, following the methodology proposed
in [2], to augment the CPU utilization numbers, which could be skewed
due to frequency scaling. We find that this does not appear to be the
case as CPU utilization and low-level metrics show similar trends.
These results were captured using the scripts on github [7] to
illustrate how this approach compares with other pre-existing
mechanisms. This data is not to be interpreted as scientific data
captured in a fully isolated lab setting, but instead as best effort,
illustrative information comparing and contrasting tradeoffs.
The absolute QPS results are higher than our previous submission, but
the relative differences between variants are equivalent. Because the
patches have been rebased on 6.12, several factors have likely
influenced the overall performance. Most importantly, we had to switch
to a new set of basic kernel options, which has likely altered the
baseline performance. Because the overall comparison of variants still
holds, we have not attempted to recreate the exact set of kernel options
from the previous submission.
Compare:
- Throughput (MAX) and latencies of base vs suspend.
- CPU usage of napibusy and fullbusy during lower load (200K, 400K for
example) vs suspend.
- Latency of the defer variants vs suspend as timeout and load
increases.
- suspend0, which sets defer_hard_irqs and gro_flush_timeout to 0, has
nearly the same performance as the base case (this is FAQ item #1).
The overall takeaway is that the suspend variants provide a superior
combination of high throughput, low latency, and low cpu utilization
compared to all other variants. Each of the suspend variants works very
well, but some fine-tuning between latency and cpu utilization is still
possible by tuning the small timeout (gro_flush_timeout).
Note: we've reorganized the results to make comparison among testcases
with the same load easier.
testcase load qps avglat 95%lat 99%lat cpu cpq ipq
base 200K 199954 112 237 415 26 13040 11336
defer10 200K 200002 54 123 142 28 19033 16508
defer20 200K 199985 60 130 153 26 15737 14247
defer50 200K 199968 78 142 181 23 12113 11609
defer200 200K 199997 163 252 304 18 8449 9155
fullbusy 200K 199993 46 117 132 100 43959 23320
napibusy 200K 200006 100 237 275 56 25016 24866
suspend0 200K 200012 105 249 432 29 14369 11844
suspend10 200K 200004 53 123 141 32 19432 16752
suspend20 200K 200014 58 126 151 30 16356 14670
suspend50 200K 200018 73 134 176 26 13245 12416
suspend200 200K 200027 149 250 302 20 9508 9781
testcase load qps avglat 95%lat 99%lat cpu cpq ipq
base 400K 399984 139 268 715 40 9437 9299
defer10 400K 400002 59 133 165 53 14089 12908
defer20 400K 400016 66 140 171 47 12085 11682
defer50 400K 400037 87 161 198 39 9528 9879
defer200 400K 399954 181 273 329 32 7326 8438
fullbusy 400K 399951 50 123 155 100 21990 16097
napibusy 400K 399997 76 221 271 83 18260 16511
suspend0 400K 399991 125 337 768 48 11051 9629
suspend10 400K 399990 57 129 161 54 13629 12841
suspend20 400K 399922 61 135 167 49 12055 11715
suspend50 400K 400024 75 148 186 42 10049 10243
suspend200 400K 399936 154 267 325 34 7770 8677
testcase load qps avglat 95%lat 99%lat cpu cpq ipq
base 600K 600064 148 265 576 61 9276 8757
defer10 600K 599985 71 147 204 76 12048 10863
defer20 600K 600024 75 151 199 66 10572 10328
defer50 600K 600054 94 172 217 55 8584 9144
defer200 600K 600030 200 299 355 45 6874 8189
fullbusy 600K 599956 55 127 176 100 14650 13968
napibusy 600K 599956 64 163 252 96 14022 14153
suspend0 600K 600029 126 306 724 70 10393 8977
suspend10 600K 599997 63 137 194 70 10991 11005
suspend20 600K 600012 67 141 194 65 10108 10359
suspend50 600K 600045 80 157 203 57 8747 9320
suspend200 600K 599940 158 277 344 48 7221 8354
testcase load qps avglat 95%lat 99%lat cpu cpq ipq
base 800K 800025 179 298 555 86 9572 8297
defer10 800K 799275 224 633 1271 96 10679 8904
defer20 800K 800041 114 226 328 90 10122 8917
defer50 800K 799936 118 207 288 77 8820 8607
defer200 800K 799994 228 341 403 65 7424 8130
fullbusy 800K 799964 62 136 192 100 10992 12518
napibusy 800K 799971 65 142 216 99 10911 12529
suspend0 800K 799965 126 250 533 86 9489 8496
suspend10 800K 799995 69 145 201 83 9475 9764
suspend20 800K 799931 74 151 209 79 8976 9336
suspend50 800K 799946 87 168 224 71 7993 8794
suspend200 800K 799993 160 292 357 62 6967 8184
testcase load qps avglat 95%lat 99%lat cpu cpq ipq
base 1000K 915792 3498 5740 6239 97 9388 7930
defer10 1000K 876285 3896 6095 6418 99 9960 8542
defer20 1000K 914909 3107 5771 6283 97 9407 8284
defer50 1000K 928426 2977 5591 5931 97 9214 8012
defer200 1000K 959989 3097 5306 5929 96 8816 7908
fullbusy 1000K 1000102 74 155 213 100 8796 10559
napibusy 1000K 1000006 74 154 216 100 8787 10654
suspend0 1000K 960757 2223 5715 7029 98 8964 7993
suspend10 1000K 999926 80 162 222 92 8246 8922
suspend20 1000K 1000095 85 166 226 89 7966 8719
suspend50 1000K 1000067 96 180 238 84 7476 8419
suspend200 1000K 999968 163 298 363 76 6798 8061
testcase load qps avglat 95%lat 99%lat cpu cpq ipq
base MAX 1054805 4152 5298 5743 100 8332 7890
defer10 MAX 937098 4598 6010 6347 100 9378 8407
defer20 MAX 988905 4389 5637 5990 100 8886 8106
defer50 MAX 1067194 3960 5216 5544 100 8235 7911
defer200 MAX 1054967 4084 5496 5821 100 8323 7871
fullbusy MAX 1248006 3472 3918 3979 100 7050 7919
napibusy MAX 1128384 3742 7958 10753 100 7776 7872
suspend0 MAX 1034456 4242 5668 6042 100 8497 7912
suspend10 MAX 1229229 3513 3926 3986 100 7156 7937
suspend20 MAX 1226845 3514 3939 3985 100 7171 7937
suspend50 MAX 1230757 3513 3935 3983 100 7140 7935
suspend200 MAX 1230424 3503 3934 3984 100 7142 7927
~ FAQ
- Why is a new parameter needed? Does irq_suspend_timeout override
gro_flush_timeout?
Using the suspend mechanism causes the system to alternate between
polling mode and irq-driven packet delivery. During busy periods,
irq_suspend_timeout overrides gro_flush_timeout and keeps the system
busy polling, but when epoll finds no events, the setting of
gro_flush_timeout and napi_defer_hard_irqs determine the next step.
There are essentially three possible loops for network processing and
packet delivery:
1) hardirq -> softirq -> napi poll; basic interrupt delivery
2) timer -> softirq -> napi poll; deferred irq processing
3) epoll -> busy-poll -> napi poll; busy looping
Loop 2 can take control from Loop 1, if gro_flush_timeout and
napi_defer_hard_irqs are set.
If gro_flush_timeout and napi_defer_hard_irqs are set, Loops 2 and
3 "wrestle" with each other for control. During busy periods,
irq_suspend_timeout is used as timer in Loop 2, which essentially
tilts this in favour of Loop 3.
If gro_flush_timeout and napi_defer_hard_irqs are not set, Loop 3
cannot take control from Loop 1.
Therefore, setting gro_flush_timeout and napi_defer_hard_irqs is the
recommended usage, because otherwise setting irq_suspend_timeout
might not have any discernible effect.
This is shown in the results above: compare suspend0 with the base
case. Note that the lack of napi_defer_hard_irqs and
gro_flush_timeout produce similar results for both, which encourages
the use of napi_defer_hard_irqs and gro_flush_timeout in addition to
irq_suspend_timeout.
- Can the new timeout value be threaded through the new epoll ioctl ?
Only with difficulty. The epoll ioctl sets options on an epoll
context and the NAPI ID associated with an epoll context can change
based on what file descriptors a user app adds to the epoll context.
This would introduce complexity in the API from the user perspective
and also complexity in the kernel.
- Can irq suspend be built by combining NIC coalescing and
gro_flush_timeout ?
No. The problem is that the long timeout must engage if and only if
prefer-busy is active.
When using NIC coalescing for the short timeout (without
napi_defer_hard_irqs/gro_flush_timeout), an interrupt after an idle
period will trigger softirq, which will run napi polling. At this
point, prefer-busy is not active, so NIC interrupts would be
re-enabled. Then it is not possible for the longer timeout to
interject to switch control back to polling. In other words, only by
using the software timer for the short timeout, it is possible to
extend the timeout without having to reprogram the NIC timer or
reach down directly and disable interrupts.
Using gro_flush_timeout for the long timeout also has problems, for
the same underlying reason. In the current napi implementation,
gro_flush_timeout is not tied to prefer-busy. We'd either have to
change that and in the process modify the existing deferral
mechanism, or introduce a state variable to determine whether
gro_flush_timeout is used as long timeout for irq suspend or whether
it is used for its default purpose. In an earlier version, we did
try something similar to the latter and made it work, but it ends up
being a lot more convoluted than our current proposal.
- Isn't it already possible to combine busy looping with irq deferral?
Yes, in fact enabling irq deferral via napi_defer_hard_irqs and
gro_flush_timeout is a precondition for prefer_busy_poll to have an
effect. If the application also uses a tight busy loop with
essentially nonblocking epoll_wait (accomplished with a very short
timeout parameter), this is the fullbusy case shown in the results.
An application using blocking epoll_wait is shown as the napibusy
case in the results. It's a hybrid approach that provides limited
latency benefits compared to the base case and plain irq deferral,
but not as good as fullbusy or suspend.
~ Special thanks
Several people were involved in earlier stages of the development of this
mechanism whom we'd like to thank:
- Peter Cai (CC'd), for the initial kernel patch and his contributions
to the paper.
- Mohammadamin Shafie (CC'd), for testing various versions of the kernel
patch and providing helpful feedback.
Thanks,
Martin and Joe
[1]: https://lore.kernel.org/netdev/20240812125717.413108-1-jdamato@fastly.com/
[2]: https://doi.org/10.1145/3626780
[3]: https://github.com/memcached/memcached/blob/master/doc/napi_ids.txt
[4]: https://github.com/leverich/mutilate
[5]: https://raw.githubusercontent.com/martinkarsten/irqsuspend/main/patches/mem…
[6]: https://raw.githubusercontent.com/martinkarsten/irqsuspend/main/patches/lib…
[7]: https://github.com/martinkarsten/irqsuspend
[8]: https://github.com/martinkarsten/irqsuspend/tree/main/results
v6:
- Updated the cover letter with a full re-run of all test cases,
including a new case suspend0, as requested by Sridhar previously.
- Updated the kernel documentation in patch 7 as suggested by Bagas
Sanjaya, which improved the htmldoc output.
v5: https://lore.kernel.org/netdev/20241103052421.518856-1-jdamato@fastly.com/
- Adjusted patch 5 to only suspend IRQs when ep_send_events returns a
positive return value. This issue was pointed out by Hillf Danton.
- Updated the commit message of patch 6 which still mentioned netcat,
despite the code being updated in v4 to replace it with socat and fixed
misspelling of netdevsim.
- Fixed a minor typo in patch 7 and removed an unnecessary paragraph.
- Added Sridhar Samudrala's Reviewed-by to patch 1-5 and 7.
v4: https://lore.kernel.org/netdev/20241102005214.32443-1-jdamato@fastly.com/
- Added a new FAQ item to cover letter.
- Updated patch 6 to use socat instead of nc in busy_poll_test.sh and
updated busy_poller.c to use netlink directly to configure napi
params.
- Updated the kernel documentation in patch 7 to include more details.
- Dropped Stanislav's Acked-by and Bagas' Reviewed-by from patch 7
since the documentation was updated.
v3: https://lore.kernel.org/netdev/20241101004846.32532-1-jdamato@fastly.com/
- Added Stanislav Fomichev's Acked-by to every patch except the newly
added selftest.
- Added Bagas Sanjaya's Reviewed-by to the documentation patch.
- Fixed the commit message of patch 2 to remove a reference to the now
non-existent sysfs setting.
- Added a self test which tests both "regular" busy poll and busy poll
with suspend enabled. This was added as patch 6 as requested by
Paolo. netdevsim was chosen instead of veth due to netdevsim's
pre-existing support for netdev-genl. See the commit message of
patch 6 for more details.
v2: https://lore.kernel.org/bpf/20241021015311.95468-1-jdamato@fastly.com/
- Cover letter updated, including a re-run of test data.
- Patch 1 rewritten to use netdev-genl instead of sysfs.
- Patch 3 updated with a comment added to napi_resume_irqs.
- Patch 4 rebased to apply now that commit b9ca079dd6b0 ("eventpoll:
Annotate data-race of busy_poll_usecs") has been picked up from VFS.
- Patch 6 updated the kernel documentation.
rfc -> v1:
- Cover letter updated to include more details.
- Patch 1 updated to remove the documentation added. This was moved to
patch 6 with the rest of the docs (see below).
- Patch 5 updated to fix an error uncovered by the kernel build robot.
See patch 5's changelog for more details.
- Patch 6 added which updates kernel documentation.
Joe Damato (2):
selftests: net: Add busy_poll_test
docs: networking: Describe irq suspension
Martin Karsten (5):
net: Add napi_struct parameter irq_suspend_timeout
net: Suspend softirq when prefer_busy_poll is set
net: Add control functions for irq suspension
eventpoll: Trigger napi_busy_loop, if prefer_busy_poll is set
eventpoll: Control irq suspension for prefer_busy_poll
Documentation/netlink/specs/netdev.yaml | 7 +
Documentation/networking/napi.rst | 170 ++++++++-
fs/eventpoll.c | 36 +-
include/linux/netdevice.h | 2 +
include/net/busy_poll.h | 3 +
include/uapi/linux/netdev.h | 1 +
net/core/dev.c | 58 +++-
net/core/dev.h | 25 ++
net/core/netdev-genl-gen.c | 5 +-
net/core/netdev-genl.c | 12 +
tools/include/uapi/linux/netdev.h | 1 +
tools/testing/selftests/net/.gitignore | 1 +
tools/testing/selftests/net/Makefile | 3 +-
tools/testing/selftests/net/busy_poll_test.sh | 164 +++++++++
tools/testing/selftests/net/busy_poller.c | 328 ++++++++++++++++++
15 files changed, 805 insertions(+), 11 deletions(-)
create mode 100755 tools/testing/selftests/net/busy_poll_test.sh
create mode 100644 tools/testing/selftests/net/busy_poller.c
base-commit: dbb9a7ef347828870df3e5e6ddf19469a3277fc9
--
2.25.1
The goal of the series is to simplify and make it possible to use
ncdevmem in an automated way from the ksft python wrapper.
ncdevmem is slowly mutated into a state where it uses stdout
to print the payload and the python wrapper is added to
make sure the arrived payload matches the expected one.
v7:
- fix validation (Mina)
- add support for working with non ::ffff-prefixed addresses (Mina)
v6:
- fix compilation issue in 'Unify error handling' patch (Jakub)
v5:
- properly handle errors from inet_pton() and socket() (Paolo)
- remove unneeded import from python selftest (Paolo)
v4:
- keep usage example with validation (Mina)
- fix compilation issue in one patch (s/start_queues/start_queue/)
v3:
- keep and refine the comment about ncdevmem invocation (Mina)
- add the comment about not enforcing exit status for ntuple reset (Mina)
- make configure_headersplit more robust (Mina)
- use num_queues/2 in selftest and let the users override it (Mina)
- remove memory_provider.memcpy_to_device (Mina)
- keep ksft as is (don't use -v validate flags): we are gonna
need a --debug-disable flag to make it less chatty; otherwise
it times out when sending too much data; so leaving it as
a separate follow up
v2:
- don't remove validation (Mina)
- keep 5-tuple flow steering but use it only when -c is provided (Mina)
- remove separate flag for probing (Mina)
- move ncdevmem under drivers/net/hw, not drivers/net (Jakub)
Cc: Mina Almasry <almasrymina(a)google.com>
Stanislav Fomichev (12):
selftests: ncdevmem: Redirect all non-payload output to stderr
selftests: ncdevmem: Separate out dmabuf provider
selftests: ncdevmem: Unify error handling
selftests: ncdevmem: Make client_ip optional
selftests: ncdevmem: Remove default arguments
selftests: ncdevmem: Switch to AF_INET6
selftests: ncdevmem: Properly reset flow steering
selftests: ncdevmem: Use YNL to enable TCP header split
selftests: ncdevmem: Remove hard-coded queue numbers
selftests: ncdevmem: Run selftest when none of the -s or -c has been
provided
selftests: ncdevmem: Move ncdevmem under drivers/net/hw
selftests: ncdevmem: Add automated test
.../selftests/drivers/net/hw/.gitignore | 1 +
.../testing/selftests/drivers/net/hw/Makefile | 9 +
.../selftests/drivers/net/hw/devmem.py | 45 +
.../selftests/drivers/net/hw/ncdevmem.c | 786 ++++++++++++++++++
tools/testing/selftests/net/.gitignore | 1 -
tools/testing/selftests/net/Makefile | 8 -
tools/testing/selftests/net/ncdevmem.c | 570 -------------
7 files changed, 841 insertions(+), 579 deletions(-)
create mode 100644 tools/testing/selftests/drivers/net/hw/.gitignore
create mode 100755 tools/testing/selftests/drivers/net/hw/devmem.py
create mode 100644 tools/testing/selftests/drivers/net/hw/ncdevmem.c
delete mode 100644 tools/testing/selftests/net/ncdevmem.c
--
2.47.0
Misbehaving guests can cause bus locks to degrade the performance of
a system. Non-WB (write-back) and misaligned locked RMW
(read-modify-write) instructions are referred to as "bus locks" and
require system wide synchronization among all processors to guarantee
the atomicity. The bus locks can impose notable performance penalties
for all processors within the system.
Support for the Bus Lock Threshold is indicated by CPUID
Fn8000_000A_EDX[29] BusLockThreshold=1, the VMCB provides a Bus Lock
Threshold enable bit and an unsigned 16-bit Bus Lock Threshold count.
VMCB intercept bit
VMCB Offset Bits Function
14h 5 Intercept bus lock operations
Bus lock threshold count
VMCB Offset Bits Function
120h 15:0 Bus lock counter
During VMRUN, the bus lock threshold count is fetched and stored in an
internal count register. Prior to executing a bus lock within the
guest, the processor verifies the count in the bus lock register. If
the count is greater than zero, the processor executes the bus lock,
reducing the count. However, if the count is zero, the bus lock
operation is not performed, and instead, a Bus Lock Threshold #VMEXIT
is triggered to transfer control to the Virtual Machine Monitor (VMM).
A Bus Lock Threshold #VMEXIT is reported to the VMM with VMEXIT code
0xA5h, VMEXIT_BUSLOCK. EXITINFO1 and EXITINFO2 are set to 0 on
a VMEXIT_BUSLOCK. On a #VMEXIT, the processor writes the current
value of the Bus Lock Threshold Counter to the VMCB.
More details about the Bus Lock Threshold feature can be found in AMD
APM [1].
v2 -> v3
- Drop parch to add virt tag in /proc/cpuinfo.
- Incorporated Tom's review comments.
v1 -> v2
- Incorporated misc review comments from Sean.
- Removed bus_lock_counter module parameter.
- Set the value of bus_lock_counter to zero by default and reload the value by 1
in bus lock exit handler.
- Add documentation for the behavioral difference for KVM_EXIT_BUS_LOCK.
- Improved selftest for buslock to work on SVM and VMX.
- Rewrite the commit messages.
Patches are prepared on kvm-next/next (efbc6bd090f4).
Testing done:
- Added a selftest for the Bus Lock Threshold functionality.
- The bus lock threshold selftest has been tested on both Intel and AMD platforms.
- Tested the Bus Lock Threshold functionality on SEV, SEV-ES, SEV-SNP guests.
- Tested the Bus Lock Threshold functionality on nested guests.
v1: https://lore.kernel.org/kvm/20240709175145.9986-4-manali.shukla@amd.com/T/
v2: https://lore.kernel.org/kvm/20241001063413.687787-4-manali.shukla@amd.com/T/
[1]: AMD64 Architecture Programmer's Manual Pub. 24593, April 2024,
Vol 2, 15.14.5 Bus Lock Threshold.
https://bugzilla.kernel.org/attachment.cgi?id=306250
Manali Shukla (2):
x86/cpufeatures: Add CPUID feature bit for the Bus Lock Threshold
KVM: X86: Add documentation about behavioral difference for
KVM_EXIT_BUS_LOCK
Nikunj A Dadhania (2):
KVM: SVM: Enable Bus lock threshold exit
KVM: selftests: Add bus lock exit test
Documentation/virt/kvm/api.rst | 4 +
arch/x86/include/asm/cpufeatures.h | 1 +
arch/x86/include/asm/svm.h | 5 +-
arch/x86/include/uapi/asm/svm.h | 2 +
arch/x86/kvm/svm/nested.c | 10 ++
arch/x86/kvm/svm/svm.c | 27 ++++
tools/testing/selftests/kvm/Makefile | 1 +
.../selftests/kvm/x86_64/kvm_buslock_test.c | 130 ++++++++++++++++++
8 files changed, 179 insertions(+), 1 deletion(-)
create mode 100644 tools/testing/selftests/kvm/x86_64/kvm_buslock_test.c
base-commit: efbc6bd090f48ccf64f7a8dd5daea775821d57ec
--
2.34.1
Hello,
KernelCI is hosting a bi-weekly call on Thursday to discuss improvements
to existing upstream tests, the development of new tests to increase
kernel testing coverage, and the enablement of these tests in KernelCI.
In recent months, we at Collabora have focused on various kernel areas,
assessing the tests already available upstream and contributing patches
to make them easily runnable in CIs.
Below is a list of the tests we've been working on and their latest
status updates, as discussed in the last meeting held on 2024-10-31:
*Boot time test*
- Sent v2:
https://lore.kernel.org/all/20241018101439.20849-1-laura.nao@collabora.com/
- Changes in v2 driven by feedback received in the LPC session
*ACPI probe kselftest*
- RFCv2:
https://lore.kernel.org/all/20240308144933.337107-1-laura.nao@collabora.com/
- Related upstream discussion on probe status tracking through printks:
https://lore.kernel.org/lkml/dc7af563-530d-4e1f-bcbb-90bcfc2fe11a@stanley.m…
Please reply to this thread if you'd like to join the call or discuss
any of the topics further. We look forward to collaborating with the
community to improve upstream tests and expand coverage to more areas of
interest within the kernel.
Best regards,
Laura Nao
This series adds VLAN support to HSR framework.
This series also adds VLAN support to HSR mode of ICSSG Ethernet driver.
Changes from v1 to v2:
*) Added patch 4/4 to add test script related to VLAN in HSR as asked by
Lukasz Majewski <lukma(a)denx.de>
v1 https://lore.kernel.org/all/20241004074715.791191-1-danishanwar@ti.com/
MD Danish Anwar (1):
selftests: hsr: Add test for VLAN
Murali Karicheri (1):
net: hsr: Add VLAN CTAG filter support
Ravi Gunasekaran (1):
net: ti: icssg-prueth: Add VLAN support for HSR mode
WingMan Kwok (1):
net: hsr: Add VLAN support
drivers/net/ethernet/ti/icssg/icssg_prueth.c | 45 +++++++++++-
net/hsr/hsr_device.c | 76 ++++++++++++++++++--
net/hsr/hsr_forward.c | 19 +++--
tools/testing/selftests/net/hsr/config | 1 +
tools/testing/selftests/net/hsr/hsr_ping.sh | 63 +++++++++++++++-
5 files changed, 191 insertions(+), 13 deletions(-)
base-commit: 1bf70e6c3a5346966c25e0a1ff492945b25d3f80
--
2.34.1
Soft lockups have been observed on a cluster of Linux-based edge routers
located in a highly dynamic environment. Using the `bird` service, these
routers continuously update BGP-advertised routes due to frequently
changing nexthop destinations, while also managing significant IPv6
traffic. The lockups occur during the traversal of the multipath
circular linked-list in the `fib6_select_path` function, particularly
while iterating through the siblings in the list. The issue typically
arises when the nodes of the linked list are unexpectedly deleted
concurrently on a different core—indicated by their 'next' and
'previous' elements pointing back to the node itself and their reference
count dropping to zero. This results in an infinite loop, leading to a
soft lockup that triggers a system panic via the watchdog timer.
Apply RCU primitives in the problematic code sections to resolve the
issue. Where necessary, update the references to fib6_siblings to
annotate or use the RCU APIs.
Include a test script that reproduces the issue. The script
periodically updates the routing table while generating a heavy load
of outgoing IPv6 traffic through multiple iperf3 clients. It
consistently induces infinite soft lockups within a couple of minutes.
Kernel log:
0 [ffffbd13003e8d30] machine_kexec at ffffffff8ceaf3eb
1 [ffffbd13003e8d90] __crash_kexec at ffffffff8d0120e3
2 [ffffbd13003e8e58] panic at ffffffff8cef65d4
3 [ffffbd13003e8ed8] watchdog_timer_fn at ffffffff8d05cb03
4 [ffffbd13003e8f08] __hrtimer_run_queues at ffffffff8cfec62f
5 [ffffbd13003e8f70] hrtimer_interrupt at ffffffff8cfed756
6 [ffffbd13003e8fd0] __sysvec_apic_timer_interrupt at ffffffff8cea01af
7 [ffffbd13003e8ff0] sysvec_apic_timer_interrupt at ffffffff8df1b83d
-- <IRQ stack> --
8 [ffffbd13003d3708] asm_sysvec_apic_timer_interrupt at ffffffff8e000ecb
[exception RIP: fib6_select_path+299]
RIP: ffffffff8ddafe7b RSP: ffffbd13003d37b8 RFLAGS: 00000287
RAX: ffff975850b43600 RBX: ffff975850b40200 RCX: 0000000000000000
RDX: 000000003fffffff RSI: 0000000051d383e4 RDI: ffff975850b43618
RBP: ffffbd13003d3800 R8: 0000000000000000 R9: ffff975850b40200
R10: 0000000000000000 R11: 0000000000000000 R12: ffffbd13003d3830
R13: ffff975850b436a8 R14: ffff975850b43600 R15: 0000000000000007
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
9 [ffffbd13003d3808] ip6_pol_route at ffffffff8ddb030c
10 [ffffbd13003d3888] ip6_pol_route_input at ffffffff8ddb068c
11 [ffffbd13003d3898] fib6_rule_lookup at ffffffff8ddf02b5
12 [ffffbd13003d3928] ip6_route_input at ffffffff8ddb0f47
13 [ffffbd13003d3a18] ip6_rcv_finish_core.constprop.0 at ffffffff8dd950d0
14 [ffffbd13003d3a30] ip6_list_rcv_finish.constprop.0 at ffffffff8dd96274
15 [ffffbd13003d3a98] ip6_sublist_rcv at ffffffff8dd96474
16 [ffffbd13003d3af8] ipv6_list_rcv at ffffffff8dd96615
17 [ffffbd13003d3b60] __netif_receive_skb_list_core at ffffffff8dc16fec
18 [ffffbd13003d3be0] netif_receive_skb_list_internal at ffffffff8dc176b3
19 [ffffbd13003d3c50] napi_gro_receive at ffffffff8dc565b9
20 [ffffbd13003d3c80] ice_receive_skb at ffffffffc087e4f5 [ice]
21 [ffffbd13003d3c90] ice_clean_rx_irq at ffffffffc0881b80 [ice]
22 [ffffbd13003d3d20] ice_napi_poll at ffffffffc088232f [ice]
23 [ffffbd13003d3d80] __napi_poll at ffffffff8dc18000
24 [ffffbd13003d3db8] net_rx_action at ffffffff8dc18581
25 [ffffbd13003d3e40] __do_softirq at ffffffff8df352e9
26 [ffffbd13003d3eb0] run_ksoftirqd at ffffffff8ceffe47
27 [ffffbd13003d3ec0] smpboot_thread_fn at ffffffff8cf36a30
28 [ffffbd13003d3ee8] kthread at ffffffff8cf2b39f
29 [ffffbd13003d3f28] ret_from_fork at ffffffff8ce5fa64
30 [ffffbd13003d3f50] ret_from_fork_asm at ffffffff8ce03cbb
Fixes: 66f5d6ce53e6 ("ipv6: replace rwlock with rcu and spinlock in fib6_table")
Reported-by: Adrian Oliver <kernel(a)aoliver.ca>
Signed-off-by: Omid Ehtemam-Haghighi <omid.ehtemamhaghighi(a)menlosecurity.com>
Cc: David S. Miller <davem(a)davemloft.net>
Cc: David Ahern <dsahern(a)gmail.com>
Cc: Eric Dumazet <edumazet(a)google.com>
Cc: Jakub Kicinski <kuba(a)kernel.org>
Cc: Paolo Abeni <pabeni(a)redhat.com>
Cc: Shuah Khan <shuah(a)kernel.org>
Cc: Ido Schimmel <idosch(a)idosch.org>
Cc: Kuniyuki Iwashima <kuniyu(a)amazon.com>
Cc: Simon Horman <horms(a)kernel.org>
Cc: netdev(a)vger.kernel.org
Cc: linux-kselftest(a)vger.kernel.org
Cc: linux-kernel(a)vger.kernel.org
---
v5 -> v6:
* Adjust the comment line lengths in the test script to a maximum of
80 characters
* Change memory allocation in inet6_rt_notify from gfp_any() to GFP_ATOMIC for
atomic allocation in non-blocking contexts, as suggested by Ido Schimmel
* NOTE: I have executed the test script on both bare-metal servers and
virtualized environments such as QEMU and vng. In the case of bare-metal, it
consistently triggers a soft lockup in under a minute on unpatched kernels.
For the virtualized environments, an unpatched kernel compiled with the
Ubuntu 24.04 configuration also triggers a soft lockup, though it takes
longer; however, it did not trigger a soft lockup on kernels compiled with
configurations provided in:
https://github.com/linux-netdev/nipa/wiki/How-to-run-netdev-selftests-CI-st…
leading to potential false negatives in the test results.
I am curious if this test can be executed on a bare-metal machine within a
CI system, if such a setup exists, rather than in a virtualized environment.
If that’s not possible, how can I apply a different kernel configuration,
such as the one used in Ubuntu 24.04, for this test? Please advise.
v4 -> v5:
* Addressed review comments from Paolo Abeni.
* Added additional clarifying comments in the test script.
* Minor cleanup performed in the test script.
v3 -> v4:
* Added RCU primitives to rt6_fill_node(). I found that this function is typically
called either with a table lock held or within rcu_read_lock/rcu_read_unlock
pairs, except in the following call chain, where the protection is unclear:
rt_fill_node()
fib6_info_hw_flags_set()
mlxsw_sp_fib6_offload_failed_flag_set()
mlxsw_sp_router_fib6_event_work()
The last function is initialized as a work item in mlxsw_sp_router_fib_event()
and scheduled for deferred execution. I am unsure if the execution context of
this work item is protected by any table lock or rcu_read_lock/rcu_read_unlock
pair, so I have added the protection. Please let me know if this is redundant.
* Other review comments addressed
v2 -> v3:
* Removed redundant rcu_read_lock()/rcu_read_unlock() pairs
* Revised the test script based on Ido Schimmel's feedback
* Updated the test script to ensure compatibility with the latest iperf3 version
* Fixed new warnings generated with 'C=2' in the previous version
* Other review comments addressed
v1 -> v2:
* list_del_rcu() is applied exclusively to legacy multipath code
* All occurrences of fib6_siblings have been modified to utilize RCU
APIs for annotation and usage.
* Additionally, a test script for reproducing the reported
issue is included
---
net/ipv6/ip6_fib.c | 8 +-
net/ipv6/route.c | 45 ++-
tools/testing/selftests/net/Makefile | 1 +
.../net/ipv6_route_update_soft_lockup.sh | 262 ++++++++++++++++++
4 files changed, 297 insertions(+), 19 deletions(-)
create mode 100755 tools/testing/selftests/net/ipv6_route_update_soft_lockup.sh
diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index eb111d20615c..9a1c59275a10 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -1190,8 +1190,8 @@ static int fib6_add_rt2node(struct fib6_node *fn, struct fib6_info *rt,
while (sibling) {
if (sibling->fib6_metric == rt->fib6_metric &&
rt6_qualify_for_ecmp(sibling)) {
- list_add_tail(&rt->fib6_siblings,
- &sibling->fib6_siblings);
+ list_add_tail_rcu(&rt->fib6_siblings,
+ &sibling->fib6_siblings);
break;
}
sibling = rcu_dereference_protected(sibling->fib6_next,
@@ -1252,7 +1252,7 @@ static int fib6_add_rt2node(struct fib6_node *fn, struct fib6_info *rt,
fib6_siblings)
sibling->fib6_nsiblings--;
rt->fib6_nsiblings = 0;
- list_del_init(&rt->fib6_siblings);
+ list_del_rcu(&rt->fib6_siblings);
rt6_multipath_rebalance(next_sibling);
return err;
}
@@ -1970,7 +1970,7 @@ static void fib6_del_route(struct fib6_table *table, struct fib6_node *fn,
&rt->fib6_siblings, fib6_siblings)
sibling->fib6_nsiblings--;
rt->fib6_nsiblings = 0;
- list_del_init(&rt->fib6_siblings);
+ list_del_rcu(&rt->fib6_siblings);
rt6_multipath_rebalance(next_sibling);
}
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index b4251915585f..b9b986bda943 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -413,8 +413,8 @@ void fib6_select_path(const struct net *net, struct fib6_result *res,
struct flowi6 *fl6, int oif, bool have_oif_match,
const struct sk_buff *skb, int strict)
{
- struct fib6_info *sibling, *next_sibling;
struct fib6_info *match = res->f6i;
+ struct fib6_info *sibling;
if (!match->nh && (!match->fib6_nsiblings || have_oif_match))
goto out;
@@ -440,8 +440,8 @@ void fib6_select_path(const struct net *net, struct fib6_result *res,
if (fl6->mp_hash <= atomic_read(&match->fib6_nh->fib_nh_upper_bound))
goto out;
- list_for_each_entry_safe(sibling, next_sibling, &match->fib6_siblings,
- fib6_siblings) {
+ list_for_each_entry_rcu(sibling, &match->fib6_siblings,
+ fib6_siblings) {
const struct fib6_nh *nh = sibling->fib6_nh;
int nh_upper_bound;
@@ -5195,14 +5195,18 @@ static void ip6_route_mpath_notify(struct fib6_info *rt,
* nexthop. Since sibling routes are always added at the end of
* the list, find the first sibling of the last route appended
*/
+ rcu_read_lock();
+
if ((nlflags & NLM_F_APPEND) && rt_last && rt_last->fib6_nsiblings) {
- rt = list_first_entry(&rt_last->fib6_siblings,
- struct fib6_info,
- fib6_siblings);
+ rt = list_first_or_null_rcu(&rt_last->fib6_siblings,
+ struct fib6_info,
+ fib6_siblings);
}
if (rt)
inet6_rt_notify(RTM_NEWROUTE, rt, info, nlflags);
+
+ rcu_read_unlock();
}
static bool ip6_route_mpath_should_notify(const struct fib6_info *rt)
@@ -5547,17 +5551,21 @@ static size_t rt6_nlmsg_size(struct fib6_info *f6i)
nexthop_for_each_fib6_nh(f6i->nh, rt6_nh_nlmsg_size,
&nexthop_len);
} else {
- struct fib6_info *sibling, *next_sibling;
struct fib6_nh *nh = f6i->fib6_nh;
+ struct fib6_info *sibling;
nexthop_len = 0;
if (f6i->fib6_nsiblings) {
rt6_nh_nlmsg_size(nh, &nexthop_len);
- list_for_each_entry_safe(sibling, next_sibling,
- &f6i->fib6_siblings, fib6_siblings) {
+ rcu_read_lock();
+
+ list_for_each_entry_rcu(sibling, &f6i->fib6_siblings,
+ fib6_siblings) {
rt6_nh_nlmsg_size(sibling->fib6_nh, &nexthop_len);
}
+
+ rcu_read_unlock();
}
nexthop_len += lwtunnel_get_encap_size(nh->fib_nh_lws);
}
@@ -5721,7 +5729,7 @@ static int rt6_fill_node(struct net *net, struct sk_buff *skb,
lwtunnel_fill_encap(skb, dst->lwtstate, RTA_ENCAP, RTA_ENCAP_TYPE) < 0)
goto nla_put_failure;
} else if (rt->fib6_nsiblings) {
- struct fib6_info *sibling, *next_sibling;
+ struct fib6_info *sibling;
struct nlattr *mp;
mp = nla_nest_start_noflag(skb, RTA_MULTIPATH);
@@ -5733,14 +5741,21 @@ static int rt6_fill_node(struct net *net, struct sk_buff *skb,
0) < 0)
goto nla_put_failure;
- list_for_each_entry_safe(sibling, next_sibling,
- &rt->fib6_siblings, fib6_siblings) {
+ rcu_read_lock();
+
+ list_for_each_entry_rcu(sibling, &rt->fib6_siblings,
+ fib6_siblings) {
if (fib_add_nexthop(skb, &sibling->fib6_nh->nh_common,
sibling->fib6_nh->fib_nh_weight,
- AF_INET6, 0) < 0)
+ AF_INET6, 0) < 0) {
+ rcu_read_unlock();
+
goto nla_put_failure;
+ }
}
+ rcu_read_unlock();
+
nla_nest_end(skb, mp);
} else if (rt->nh) {
if (nla_put_u32(skb, RTA_NH_ID, rt->nh->id))
@@ -6177,7 +6192,7 @@ void inet6_rt_notify(int event, struct fib6_info *rt, struct nl_info *info,
err = -ENOBUFS;
seq = info->nlh ? info->nlh->nlmsg_seq : 0;
- skb = nlmsg_new(rt6_nlmsg_size(rt), gfp_any());
+ skb = nlmsg_new(rt6_nlmsg_size(rt), GFP_ATOMIC);
if (!skb)
goto errout;
@@ -6190,7 +6205,7 @@ void inet6_rt_notify(int event, struct fib6_info *rt, struct nl_info *info,
goto errout;
}
rtnl_notify(skb, net, info->portid, RTNLGRP_IPV6_ROUTE,
- info->nlh, gfp_any());
+ info->nlh, GFP_ATOMIC);
return;
errout:
rtnl_set_sk_err(net, RTNLGRP_IPV6_ROUTE, err);
diff --git a/tools/testing/selftests/net/Makefile b/tools/testing/selftests/net/Makefile
index 649f1fe0dc46..88cbac27aa10 100644
--- a/tools/testing/selftests/net/Makefile
+++ b/tools/testing/selftests/net/Makefile
@@ -96,6 +96,7 @@ TEST_PROGS += fdb_flush.sh
TEST_PROGS += fq_band_pktlimit.sh
TEST_PROGS += vlan_hw_filter.sh
TEST_PROGS += bpf_offload.py
+TEST_PROGS += ipv6_route_update_soft_lockup.sh
# YNL files, must be before "include ..lib.mk"
EXTRA_CLEAN += $(OUTPUT)/libynl.a
diff --git a/tools/testing/selftests/net/ipv6_route_update_soft_lockup.sh b/tools/testing/selftests/net/ipv6_route_update_soft_lockup.sh
new file mode 100755
index 000000000000..c61a7a8352cf
--- /dev/null
+++ b/tools/testing/selftests/net/ipv6_route_update_soft_lockup.sh
@@ -0,0 +1,262 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# Testing for potential kernel soft lockup during IPv6 routing table
+# refresh under heavy outgoing IPv6 traffic. If a kernel soft lockup
+# occurs, a kernel panic will be triggered to prevent associated issues.
+#
+#
+# Test Environment Layout
+#
+# ┌----------------┐ ┌----------------┐
+# | SOURCE_NS | | SINK_NS |
+# | NAMESPACE | | NAMESPACE |
+# |(iperf3 clients)| |(iperf3 servers)|
+# | | | |
+# | | | |
+# | ┌-----------| nexthops |---------┐ |
+# | |veth_source|<--------------------------------------->|veth_sink|<┐ |
+# | └-----------|2001:0DB8:1::0:1/96 2001:0DB8:1::1:1/96 |---------┘ | |
+# | | ^ 2001:0DB8:1::1:2/96 | | |
+# | | . . | fwd | |
+# | ┌---------┐ | . . | | |
+# | | IPv6 | | . . | V |
+# | | routing | | . 2001:0DB8:1::1:80/96| ┌-----┐ |
+# | | table | | . | | lo | |
+# | | nexthop | | . └--------┴-----┴-┘
+# | | update | | ............................> 2001:0DB8:2::1:1/128
+# | └-------- ┘ |
+# └----------------┘
+#
+# The test script sets up two network namespaces, source_ns and sink_ns,
+# connected via a veth link. Within source_ns, it continuously updates the
+# IPv6 routing table by flushing and inserting IPV6_NEXTHOP_ADDR_COUNT nexthop
+# IPs destined for SINK_LOOPBACK_IP_ADDR in sink_ns. This refresh occurs at a
+# rate of 1/ROUTING_TABLE_REFRESH_PERIOD per second for TEST_DURATION seconds.
+#
+# Simultaneously, multiple iperf3 clients within source_ns generate heavy
+# outgoing IPv6 traffic. Each client is assigned a unique port number starting
+# at 5000 and incrementing sequentially. Each client targets a unique iperf3
+# server running in sink_ns, connected to the SINK_LOOPBACK_IFACE interface
+# using the same port number.
+#
+# The number of iperf3 servers and clients is set to half of the total
+# available cores on each machine.
+#
+# NOTE: We have tested this script on machines with various CPU specifications,
+# ranging from lower to higher performance as listed below. The test script
+# effectively triggered a kernel soft lockup on machines running an unpatched
+# kernel in under a minute:
+#
+# - 1x Intel Xeon E-2278G 8-Core Processor @ 3.40GHz
+# - 1x Intel Xeon E-2378G Processor 8-Core @ 2.80GHz
+# - 1x AMD EPYC 7401P 24-Core Processor @ 2.00GHz
+# - 1x AMD EPYC 7402P 24-Core Processor @ 2.80GHz
+# - 2x Intel Xeon Gold 5120 14-Core Processor @ 2.20GHz
+# - 1x Ampere Altra Q80-30 80-Core Processor @ 3.00GHz
+# - 2x Intel Xeon Gold 5120 14-Core Processor @ 2.20GHz
+# - 2x Intel Xeon Silver 4214 24-Core Processor @ 2.20GHz
+# - 1x AMD EPYC 7502P 32-Core @ 2.50GHz
+# - 1x Intel Xeon Gold 6314U 32-Core Processor @ 2.30GHz
+# - 2x Intel Xeon Gold 6338 32-Core Processor @ 2.00GHz
+#
+# On less performant machines, you may need to increase the TEST_DURATION
+# parameter to enhance the likelihood of encountering a race condition leading
+# to a kernel soft lockup and avoid a false negative result.
+#
+# NOTE: The test may not produce the expected result in virtualized
+# environments (e.g., qemu) due to differences in timing and CPU handling,
+# which can affect the conditions needed to trigger a soft lockup.
+
+source lib.sh
+source net_helper.sh
+
+TEST_DURATION=300
+ROUTING_TABLE_REFRESH_PERIOD=0.01
+
+IPERF3_BITRATE="300m"
+
+
+IPV6_NEXTHOP_ADDR_COUNT="128"
+IPV6_NEXTHOP_ADDR_MASK="96"
+IPV6_NEXTHOP_PREFIX="2001:0DB8:1"
+
+
+SOURCE_TEST_IFACE="veth_source"
+SOURCE_TEST_IP_ADDR="2001:0DB8:1::0:1/96"
+
+SINK_TEST_IFACE="veth_sink"
+# ${SINK_TEST_IFACE} is populated with the following range of IPv6 addresses:
+# 2001:0DB8:1::1:1 to 2001:0DB8:1::1:${IPV6_NEXTHOP_ADDR_COUNT}
+SINK_LOOPBACK_IFACE="lo"
+SINK_LOOPBACK_IP_MASK="128"
+SINK_LOOPBACK_IP_ADDR="2001:0DB8:2::1:1"
+
+nexthop_ip_list=""
+termination_signal=""
+kernel_softlokup_panic_prev_val=""
+
+terminate_ns_processes_by_pattern() {
+ local ns=$1
+ local pattern=$2
+
+ for pid in $(ip netns pids ${ns}); do
+ [ -e /proc/$pid/cmdline ] && grep -qe "${pattern}" /proc/$pid/cmdline && kill -9 $pid
+ done
+}
+
+cleanup() {
+ echo "info: cleaning up namespaces and terminating all processes within them..."
+
+
+ # Terminate iperf3 instances running in the source_ns. To avoid race
+ # conditions, first iterate over the PIDs and terminate those
+ # associated with the bash shells running the
+ # `while true; do iperf3 -c ...; done` loops. In a second iteration,
+ # terminate the individual `iperf3 -c ...` instances.
+ terminate_ns_processes_by_pattern ${source_ns} while
+ terminate_ns_processes_by_pattern ${source_ns} iperf3
+
+ # Repeat the same process for sink_ns
+ terminate_ns_processes_by_pattern ${sink_ns} while
+ terminate_ns_processes_by_pattern ${sink_ns} iperf3
+
+ # Check if any iperf3 instances are still running. This could happen
+ # if a core has entered an infinite loop and the timeout for detecting
+ # the soft lockup has not expired, but either the test interval has
+ # already elapsed or the test was terminated manually (e.g., with ^C)
+ for pid in $(ip netns pids ${source_ns}); do
+ if [ -e /proc/$pid/cmdline ] && grep -qe 'iperf3' /proc/$pid/cmdline; then
+ echo "FAIL: unable to terminate some iperf3 instances. Soft lockup is underway. A kernel panic is on the way!"
+ exit ${ksft_fail}
+ fi
+ done
+
+ if [ "$termination_signal" == "SIGINT" ]; then
+ echo "SKIP: Termination due to ^C (SIGINT)"
+ elif [ "$termination_signal" == "SIGALRM" ]; then
+ echo "PASS: No kernel soft lockup occurred during this ${TEST_DURATION} second test"
+ fi
+
+ cleanup_ns ${source_ns} ${sink_ns}
+
+ sysctl -qw kernel.softlockup_panic=${kernel_softlokup_panic_prev_val}
+}
+
+setup_prepare() {
+ setup_ns source_ns sink_ns
+
+ ip -n ${source_ns} link add name ${SOURCE_TEST_IFACE} type veth peer name ${SINK_TEST_IFACE} netns ${sink_ns}
+
+ # Setting up the Source namespace
+ ip -n ${source_ns} addr add ${SOURCE_TEST_IP_ADDR} dev ${SOURCE_TEST_IFACE}
+ ip -n ${source_ns} link set dev ${SOURCE_TEST_IFACE} qlen 10000
+ ip -n ${source_ns} link set dev ${SOURCE_TEST_IFACE} up
+ ip netns exec ${source_ns} sysctl -qw net.ipv6.fib_multipath_hash_policy=1
+
+ # Setting up the Sink namespace
+ ip -n ${sink_ns} addr add ${SINK_LOOPBACK_IP_ADDR}/${SINK_LOOPBACK_IP_MASK} dev ${SINK_LOOPBACK_IFACE}
+ ip -n ${sink_ns} link set dev ${SINK_LOOPBACK_IFACE} up
+ ip netns exec ${sink_ns} sysctl -qw net.ipv6.conf.${SINK_LOOPBACK_IFACE}.forwarding=1
+
+ ip -n ${sink_ns} link set ${SINK_TEST_IFACE} up
+ ip netns exec ${sink_ns} sysctl -qw net.ipv6.conf.${SINK_TEST_IFACE}.forwarding=1
+
+
+ # Populate nexthop IPv6 addresses on the test interface in the sink_ns
+ echo "info: populating ${IPV6_NEXTHOP_ADDR_COUNT} IPv6 addresses on the ${SINK_TEST_IFACE} interface ..."
+ for IP in $(seq 1 ${IPV6_NEXTHOP_ADDR_COUNT}); do
+ ip -n ${sink_ns} addr add ${IPV6_NEXTHOP_PREFIX}::$(printf "1:%x" "${IP}")/${IPV6_NEXTHOP_ADDR_MASK} dev ${SINK_TEST_IFACE};
+ done
+
+ # Preparing list of nexthops
+ for IP in $(seq 1 ${IPV6_NEXTHOP_ADDR_COUNT}); do
+ nexthop_ip_list=$nexthop_ip_list" nexthop via ${IPV6_NEXTHOP_PREFIX}::$(printf "1:%x" $IP) dev ${SOURCE_TEST_IFACE} weight 1"
+ done
+}
+
+
+test_soft_lockup_during_routing_table_refresh() {
+ # Start num_of_iperf_servers iperf3 servers in the sink_ns namespace,
+ # each listening on ports starting at 5001 and incrementing
+ # sequentially. Since iperf3 instances may terminate unexpectedly, a
+ # while loop is used to automatically restart them in such cases.
+ echo "info: starting ${num_of_iperf_servers} iperf3 servers in the sink_ns namespace ..."
+ for i in $(seq 1 ${num_of_iperf_servers}); do
+ cmd="iperf3 --bind ${SINK_LOOPBACK_IP_ADDR} -s -p $(printf '5%03d' ${i}) --rcv-timeout 200 &>/dev/null"
+ ip netns exec ${sink_ns} bash -c "while true; do ${cmd}; done &" &>/dev/null
+ done
+
+ # Wait for the iperf3 servers to be ready
+ for i in $(seq ${num_of_iperf_servers}); do
+ port=$(printf '5%03d' ${i});
+ wait_local_port_listen ${sink_ns} ${port} tcp
+ done
+
+ # Continuously refresh the routing table in the background within
+ # the source_ns namespace
+ ip netns exec ${source_ns} bash -c "
+ while \$(ip netns list | grep -q ${source_ns}); do
+ ip -6 route add ${SINK_LOOPBACK_IP_ADDR}/${SINK_LOOPBACK_IP_MASK} ${nexthop_ip_list};
+ sleep ${ROUTING_TABLE_REFRESH_PERIOD};
+ ip -6 route delete ${SINK_LOOPBACK_IP_ADDR}/${SINK_LOOPBACK_IP_MASK};
+ done &"
+
+ # Start num_of_iperf_servers iperf3 clients in the source_ns namespace,
+ # each sending TCP traffic on sequential ports starting at 5001.
+ # Since iperf3 instances may terminate unexpectedly (e.g., if the route
+ # to the server is deleted in the background during a route refresh), a
+ # while loop is used to automatically restart them in such cases.
+ echo "info: starting ${num_of_iperf_servers} iperf3 clients in the source_ns namespace ..."
+ for i in $(seq 1 ${num_of_iperf_servers}); do
+ cmd="iperf3 -c ${SINK_LOOPBACK_IP_ADDR} -p $(printf '5%03d' ${i}) --length 64 --bitrate ${IPERF3_BITRATE} -t 0 --connect-timeout 150 &>/dev/null"
+ ip netns exec ${source_ns} bash -c "while true; do ${cmd}; done &" &>/dev/null
+ done
+
+ echo "info: IPv6 routing table is being updated at the rate of $(echo "1/${ROUTING_TABLE_REFRESH_PERIOD}" | bc)/s for ${TEST_DURATION} seconds ..."
+ echo "info: A kernel soft lockup, if detected, results in a kernel panic!"
+
+ wait
+}
+
+# Make sure 'iperf3' is installed, skip the test otherwise
+if [ ! -x "$(command -v "iperf3")" ]; then
+ echo "SKIP: 'iperf3' is not installed. Skipping the test."
+ exit ${ksft_skip}
+fi
+
+# Determine the number of cores on the machine
+num_of_iperf_servers=$(( $(nproc)/2 ))
+
+# Check if we are running on a multi-core machine, skip the test otherwise
+if [ "${num_of_iperf_servers}" -eq 0 ]; then
+ echo "SKIP: This test is not valid on a single core machine!"
+ exit ${ksft_skip}
+fi
+
+# Since the kernel soft lockup we're testing causes at least one core to enter
+# an infinite loop, destabilizing the host and likely affecting subsequent
+# tests, we trigger a kernel panic instead of reporting a failure and
+# continuing
+kernel_softlokup_panic_prev_val=$(sysctl -n kernel.softlockup_panic)
+sysctl -qw kernel.softlockup_panic=1
+
+handle_sigint() {
+ termination_signal="SIGINT"
+ cleanup
+ exit ${ksft_skip}
+}
+
+handle_sigalrm() {
+ termination_signal="SIGALRM"
+ cleanup
+ exit ${ksft_pass}
+}
+
+trap handle_sigint SIGINT
+trap handle_sigalrm SIGALRM
+
+(sleep ${TEST_DURATION} && kill -s SIGALRM $$)&
+
+setup_prepare
+test_soft_lockup_during_routing_table_refresh
--
2.43.0
From: zhouyuhang <zhouyuhang(a)kylinos.cn>
The libcap commit aca076443591 ("Make cap_t operations thread safe.")
added a __u8 mutex at the beginning of the struct _cap_struct, it changes
the offset of the members in the structure that breaks the assumption
made in the "struct libcap" definition in clone3_cap_checkpoint_restore.c.
This causes the call to cap_set_proc here to fail with error code EPERM,
and the output is as follows:
# RUN global.clone3_cap_checkpoint_restore ...
# clone3() syscall supported
# clone3_cap_checkpoint_restore.c:151:clone3_cap_checkpoint_restore:Child has PID 130508
cap_set_proc: Operation not permitted
# clone3_cap_checkpoint_restore.c:160:clone3_cap_checkpoint_restore:Expected set_capability() (-1) == 0 (0)
# clone3_cap_checkpoint_restore.c:161:clone3_cap_checkpoint_restore:Could not set CAP_CHECKPOINT_RESTORE
# clone3_cap_checkpoint_restore: Test terminated by assertion
# FAIL global.clone3_cap_checkpoint_restore
Changing to using capget and capset syscall directly here can fix this error,
just like what the commit 663af70aabb7 ("bpf: selftests: Add helpers to directly
use the capget and capset syscall") does.
The output is as follows:
# RUN global.clone3_cap_checkpoint_restore ...
# clone3() syscall supported
# clone3_cap_checkpoint_restore.c:160:clone3_cap_checkpoint_restore:Child has PID 23708
# clone3_cap_checkpoint_restore.c:91:clone3_cap_checkpoint_restore:[23707] Trying clone3() with CLONE_SET_TID to 23708
# clone3_cap_checkpoint_restore.c:58:clone3_cap_checkpoint_restore:Operation not permitted - Failed to create new process
# clone3_cap_checkpoint_restore.c:93:clone3_cap_checkpoint_restore:[23707] clone3() with CLONE_SET_TID 23708 says:-1
# clone3_cap_checkpoint_restore.c:91:clone3_cap_checkpoint_restore:[23707] Trying clone3() with CLONE_SET_TID to 23708
# clone3_cap_checkpoint_restore.c:73:clone3_cap_checkpoint_restore:I am the parent (23707). My child's pid is 23708
# clone3_cap_checkpoint_restore.c:66:clone3_cap_checkpoint_restore:I am the child, my PID is 23708 (expected 23708)
# clone3_cap_checkpoint_restore.c:93:clone3_cap_checkpoint_restore:[23707] clone3() with CLONE_SET_TID 23708 says:0
# OK global.clone3_cap_checkpoint_restore
ok 1 global.clone3_cap_checkpoint_restore
# PASSED: 1 / 1 tests passed.
Signed-off-by: zhouyuhang <zhouyuhang(a)kylinos.cn>
---
v4:
* Add some comments and modify the output and return value when set_capability fails
v3:
* Remove locally declared system calls and retained the - lcap in the Makefile.
v2:
* Move locally declared system calls to header file.
v1:
* Directly using capget and capset and declare them locally.
---
.../clone3/clone3_cap_checkpoint_restore.c | 77 +++++++++++--------
1 file changed, 43 insertions(+), 34 deletions(-)
diff --git a/tools/testing/selftests/clone3/clone3_cap_checkpoint_restore.c b/tools/testing/selftests/clone3/clone3_cap_checkpoint_restore.c
index 3c196fa86c99..076f9d4cce60 100644
--- a/tools/testing/selftests/clone3/clone3_cap_checkpoint_restore.c
+++ b/tools/testing/selftests/clone3/clone3_cap_checkpoint_restore.c
@@ -27,6 +27,13 @@
#include "../kselftest_harness.h"
#include "clone3_selftests.h"
+/*
+ * Prevent not being defined in the header file
+ */
+#ifndef CAP_CHECKPOINT_RESTORE
+#define CAP_CHECKPOINT_RESTORE 40
+#endif
+
static void child_exit(int ret)
{
fflush(stdout);
@@ -87,47 +94,49 @@ static int test_clone3_set_tid(struct __test_metadata *_metadata,
return ret;
}
-struct libcap {
- struct __user_cap_header_struct hdr;
+static int set_capability(struct __test_metadata *_metadata)
+{
struct __user_cap_data_struct data[2];
-};
-static int set_capability(void)
-{
- cap_value_t cap_values[] = { CAP_SETUID, CAP_SETGID };
- struct libcap *cap;
- int ret = -1;
- cap_t caps;
-
- caps = cap_get_proc();
- if (!caps) {
- perror("cap_get_proc");
- return -1;
- }
+ /*
+ * Only _LINUX_CAPABILITY_VERSION_3 can be used here.
+ * _LINUX_CAPABILITY_VERSION_1 represents use 32-bit capabilities,
+ * using it will cause CAP_CHECKPOINT_RESTORE to not be set.
+ * _LINUX_CAPABILITY_VERSION_2 has already been deprecated.
+ */
+ struct __user_cap_header_struct hdr = {
+ .version = _LINUX_CAPABILITY_VERSION_3,
+ };
- /* Drop all capabilities */
- if (cap_clear(caps)) {
- perror("cap_clear");
- goto out;
+ /*
+ * CAP_CHECKPOINT_RESTORE is greater than 31, so we need two u32.
+ * cap0 is the lower 32bit and cap1 is the higher 32bit, they will
+ * be combined into a u64 in mk_kernel_cap.
+ */
+ __u32 cap0 = 1 << CAP_SETUID | 1 << CAP_SETGID;
+ __u32 cap1 = 1 << (CAP_CHECKPOINT_RESTORE - 32);
+ int ret;
+
+ ret = capget(&hdr, data);
+ if (ret) {
+ TH_LOG("%s - Failed to get capability", strerror(errno));
+ return -errno;
}
- cap_set_flag(caps, CAP_EFFECTIVE, 2, cap_values, CAP_SET);
- cap_set_flag(caps, CAP_PERMITTED, 2, cap_values, CAP_SET);
+ /* Drop all capabilities */
+ memset(&data, 0, sizeof(data));
- cap = (struct libcap *) caps;
+ data[0].effective |= cap0;
+ data[0].permitted |= cap0;
- /* 40 -> CAP_CHECKPOINT_RESTORE */
- cap->data[1].effective |= 1 << (40 - 32);
- cap->data[1].permitted |= 1 << (40 - 32);
+ data[1].effective |= cap1;
+ data[1].permitted |= cap1;
- if (cap_set_proc(caps)) {
- perror("cap_set_proc");
- goto out;
+ ret = capset(&hdr, data);
+ if (ret) {
+ TH_LOG("%s - Failed to set capability", strerror(errno));
+ return -errno;
}
- ret = 0;
-out:
- if (cap_free(caps))
- perror("cap_free");
return ret;
}
@@ -157,7 +166,7 @@ TEST(clone3_cap_checkpoint_restore)
/* After the child has finished, its PID should be free. */
set_tid[0] = pid;
- ASSERT_EQ(set_capability(), 0)
+ ASSERT_EQ(set_capability(_metadata), 0)
TH_LOG("Could not set CAP_CHECKPOINT_RESTORE");
ASSERT_EQ(prctl(PR_SET_KEEPCAPS, 1, 0, 0, 0), 0);
@@ -169,7 +178,7 @@ TEST(clone3_cap_checkpoint_restore)
set_tid[0] = pid;
/* This would fail without CAP_CHECKPOINT_RESTORE */
ASSERT_EQ(test_clone3_set_tid(_metadata, set_tid, 1), -EPERM);
- ASSERT_EQ(set_capability(), 0)
+ ASSERT_EQ(set_capability(_metadata), 0)
TH_LOG("Could not set CAP_CHECKPOINT_RESTORE");
/* This should work as we have CAP_CHECKPOINT_RESTORE as non-root */
ASSERT_EQ(test_clone3_set_tid(_metadata, set_tid, 1), 0);
--
2.27.0
When memslot_perf_test is run nested, first iteration of test_memslot_rw_loop
testcase, sometimes takes more than 2 seconds due to build of shadow page tables.
Following iterations are fast.
To be on the safe side, bump the timeout to 10 seconds.
Signed-off-by: Maxim Levitsky <mlevitsk(a)redhat.com>
---
tools/testing/selftests/kvm/memslot_perf_test.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/kvm/memslot_perf_test.c b/tools/testing/selftests/kvm/memslot_perf_test.c
index 893366982f77b..e9189cbed4bb6 100644
--- a/tools/testing/selftests/kvm/memslot_perf_test.c
+++ b/tools/testing/selftests/kvm/memslot_perf_test.c
@@ -415,7 +415,7 @@ static bool _guest_should_exit(void)
*/
static noinline void host_perform_sync(struct sync_area *sync)
{
- alarm(2);
+ alarm(10);
atomic_store_explicit(&sync->sync_flag, true, memory_order_release);
while (atomic_load_explicit(&sync->sync_flag, memory_order_acquire))
--
2.26.3
The loop in test_create_guest_memfd_invalid that is supposed to test
that nothing is accepted as a valid flag to KVM_CREATE_GUEST_MEMFD was
initializing `flag` as 0 instead of BIT(0). This caused the loop to
immediately exit instead of iterating over BIT(0), BIT(1), ... .
Fixes: 8a89efd43423 ("KVM: selftests: Add basic selftest for guest_memfd()")
Signed-off-by: Patrick Roy <roypat(a)amazon.co.uk>
---
tools/testing/selftests/kvm/guest_memfd_test.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/kvm/guest_memfd_test.c b/tools/testing/selftests/kvm/guest_memfd_test.c
index ba0c8e9960358..ce687f8d248fc 100644
--- a/tools/testing/selftests/kvm/guest_memfd_test.c
+++ b/tools/testing/selftests/kvm/guest_memfd_test.c
@@ -134,7 +134,7 @@ static void test_create_guest_memfd_invalid(struct kvm_vm *vm)
size);
}
- for (flag = 0; flag; flag <<= 1) {
+ for (flag = BIT(0); flag; flag <<= 1) {
fd = __vm_create_guest_memfd(vm, page_size, flag);
TEST_ASSERT(fd == -1 && errno == EINVAL,
"guest_memfd() with flag '0x%lx' should fail with EINVAL",
--
2.47.0
This series primarily introduces SEV-SNP test for the kernel selftest
framework. It tests boot, ioctl, pre fault, and fallocate in various
combinations to exercise both positive and negative launch flow paths.
Patch 1 - Adds a wrapper for the ioctl calls that decouple ioctl and
asserts, which enables the use of negative test cases. No functional
change intended.
Patch 2 - Extend the sev smoke tests to use the SNP specific ioctl
calls and sets up memory to boot a SNP guest VM
Patch 3 - Adds SNP to shutdown testing
Patch 4, 5 - Tests the ioctl path for SEV, SEV-ES and SNP
Patch 6 - Adds support for SNP in KVM_SEV_INIT2 tests
Patch 7,8,9 - Enable Prefault tests for SEV, SEV-ES and SNP
The patchset is rebased on top of kvm-x86/next branch.
v3:
1. Remove the assignments for the prefault and fallocate test type
enums.
2. Fix error message for sev launch measure and finish.
3. Collect tested-by tags [Peter, Srikanth]
v2:
https://lore.kernel.org/kvm/20240816192310.117456-1-pratikrajesh.sampat@amd…
1. Add SMT parsing check to populate SNP policy flags
2. Extend Peter Gonda's shutdown test to include SNP
3. Introduce new tests for prefault which include exercising prefault,
fallocate, hole-punch in various combinations.
4. Decouple ioctl patch reworked to introduce private variants of the
the functions that call into the ioctl. Also reordered the patch for
it to arrive first so that new APIs are not written right after
their introduction.
5. General cleanups - adding comments, avoiding local booleans, better
error message. Suggestions incorporated from Peter, Tom, and Sean.
RFC:
https://lore.kernel.org/kvm/20240710220540.188239-1-pratikrajesh.sampat@amd…
Any feedback/review is highly appreciated!
Michael Roth (2):
KVM: selftests: Add interface to manually flag protected/encrypted
ranges
KVM: selftests: Add a CoCo-specific test for KVM_PRE_FAULT_MEMORY
Pratik R. Sampat (7):
KVM: selftests: Decouple SEV ioctls from asserts
KVM: selftests: Add a basic SNP smoke test
KVM: selftests: Add SNP to shutdown testing
KVM: selftests: SEV IOCTL test
KVM: selftests: SNP IOCTL test
KVM: selftests: SEV-SNP test for KVM_SEV_INIT2
KVM: selftests: Interleave fallocate for KVM_PRE_FAULT_MEMORY
tools/testing/selftests/kvm/Makefile | 1 +
.../testing/selftests/kvm/include/kvm_util.h | 13 +
.../selftests/kvm/include/x86_64/processor.h | 1 +
.../selftests/kvm/include/x86_64/sev.h | 76 +++-
tools/testing/selftests/kvm/lib/kvm_util.c | 53 ++-
.../selftests/kvm/lib/x86_64/processor.c | 6 +-
tools/testing/selftests/kvm/lib/x86_64/sev.c | 190 +++++++-
.../kvm/x86_64/coco_pre_fault_memory_test.c | 421 ++++++++++++++++++
.../selftests/kvm/x86_64/sev_init2_tests.c | 13 +
.../selftests/kvm/x86_64/sev_smoke_test.c | 297 +++++++++++-
10 files changed, 1023 insertions(+), 48 deletions(-)
create mode 100644 tools/testing/selftests/kvm/x86_64/coco_pre_fault_memory_test.c
--
2.34.1
Changes since V3:
- V3: https://lore.kernel.org/all/cover.1729218182.git.reinette.chatre@intel.com/
- Rebased on HEAD 2a027d6bb660 of kselftest/next.
- Fix empty string parsing issues pointed out by Ilpo.
- Add Reviewed-by tags.
- Please see individual patches for detailed changes.
Changes since V2:
- V2: https://lore.kernel.org/all/cover.1726164080.git.reinette.chatre@intel.com/
- Add fix to protect against buffer overflow when parsing text from sysfs files.
- Add cleanup patch to address use of magic constants as pointed out by
Ilpo.
- Add Reviewed-by tags where received, except for "selftests/resctrl: Use cache
size to determine "fill_buf" buffer size" that changed too much since
receiving the Reviewed-by tag.
- Please see individual patches for detailed changes.
Changes since V1:
- V1: https://lore.kernel.org/cover.1724970211.git.reinette.chatre@intel.com/
- V2 contains the same general solutions to stated problem as V1 but these
are now preceded by more fixes (patches 1 to 5) and improved robustness
(patches 6 to 9) to existing tests before the series gets back
to solving the original problem with more confidence in patches 10 to 13.
- The posibility of making "memflush = false" for CMT test was discussed
during V1. Modifying this setting does not have a significant impact on the
observed results that are already well within acceptable range and this
version thus keeps original default. If performance was a goal it may
be possible to do further experimentation where "memflush = false" could
eliminate the need for the sleep(1) within the test wrapper, but
improving the performance is not a goal of this work.
- (New) Support what seems to be unintended ability for user space to provide
parameters to "fill_buf" by making the parsing robust and only support
changing parameters that are supported to be changed. Drop support for
"write" operation since it has never been measured.
- (New) Improve wraparound handling. (Ilpo)
- (New) A couple of new fixes addressing issues discovered during development.
- (Change from V1) To support fill_buf parameters provided by user space as
well as test specific fill_buf parameters struct fill_buf_param is no longer
just a member of struct resctrl_val_param, instead there could be at most
two instances of struct fill_buf_param, the immutable parameters provided
by user space and the parameters used by individual tests. (Ilpo)
- Please see individual patches for detailed changes.
V1 cover:
The resctrl selftests for Memory Bandwidth Allocation (MBA) and Memory
Bandwidth Monitoring (MBM) are failing on some (for example [1]) Emerald
Rapids systems. The test failures result from the following two
properties of these systems:
1) Emerald Rapids systems can have up to 320MB L3 cache. The resctrl
MBA and MBM selftests measure memory traffic for which a hardcoded
250MB buffer has been sufficient so far. On platforms with L3 cache
larger than the buffer, the buffer fits in the L3 cache and thus
no/very little memory traffic is generated during the "memory
bandwidth" tests.
2) Some platform features, for example RAS features or memory
performance features that generate memory traffic may drive accesses
that are counted differently by performance counters and MBM
respectively, for instance generating "overhead" traffic which is not
counted against any specific RMID. Until now these counting
differences have always been "in the noise". On Emerald Rapids
systems the maximum MBA throttling (10% memory bandwidth)
throttles memory bandwidth to where memory accesses by these other
platform features push the memory bandwidth difference between
memory controller performance counters and resctrl (MBM) beyond the
tests' hardcoded tolerance.
Make the tests more robust against platform variations:
1) Let the buffer used by memory bandwidth tests be guided by the size
of the L3 cache.
2) Larger buffers require longer initialization time before the buffer can
be used to measurement. Rework the tests to ensure that buffer
initialization is complete before measurements start.
3) Do not compare performance counters and MBM measurements at low
bandwidth. The value of "low" is hardcoded to 750MiB based on
measurements on Emerald Rapids, Sapphire Rapids, and Ice Lake
systems. This limit is not applicable to AMD systems since it
only applies to the MBA and MBM tests that are isolated to Intel.
[1]
https://ark.intel.com/content/www/us/en/ark/products/237261/intel-xeon-plat…
Reinette Chatre (15):
selftests/resctrl: Make functions only used in same file static
selftests/resctrl: Print accurate buffer size as part of MBM results
selftests/resctrl: Fix memory overflow due to unhandled wraparound
selftests/resctrl: Protect against array overrun during iMC config
parsing
selftests/resctrl: Protect against array overflow when reading strings
selftests/resctrl: Make wraparound handling obvious
selftests/resctrl: Remove "once" parameter required to be false
selftests/resctrl: Only support measured read operation
selftests/resctrl: Remove unused measurement code
selftests/resctrl: Make benchmark parameter passing robust
selftests/resctrl: Ensure measurements skip initialization of default
benchmark
selftests/resctrl: Use cache size to determine "fill_buf" buffer size
selftests/resctrl: Do not compare performance counters and resctrl at
low bandwidth
selftests/resctrl: Keep results from first test run
selftests/resctrl: Replace magic constants used as array size
tools/testing/selftests/resctrl/cmt_test.c | 37 +-
tools/testing/selftests/resctrl/fill_buf.c | 45 +-
tools/testing/selftests/resctrl/mba_test.c | 54 ++-
tools/testing/selftests/resctrl/mbm_test.c | 37 +-
tools/testing/selftests/resctrl/resctrl.h | 79 +++-
.../testing/selftests/resctrl/resctrl_tests.c | 95 +++-
tools/testing/selftests/resctrl/resctrl_val.c | 447 +++++-------------
tools/testing/selftests/resctrl/resctrlfs.c | 19 +-
8 files changed, 354 insertions(+), 459 deletions(-)
base-commit: 2a027d6bb66002c8e50e974676f932b33c5fce10
--
2.47.0
This series is a follow-up to Joey's Permission Overlay Extension (POE)
series [1] that recently landed on mainline. The goal is to improve the
way we handle the register that governs which pkeys/POIndex are
accessible (POR_EL0) during signal delivery. As things stand, we may
unexpectedly fail to write the signal frame on the stack because POR_EL0
is not reset before the uaccess operations. See patch 1 for more details
and the main changes this series brings.
A similar series landed recently for x86/MPK [2]; the present series
aims at aligning arm64 with x86. Worth noting: once the signal frame is
written, POR_EL0 is still set to POR_EL0_INIT, granting access to pkey 0
only. This means that a program that sets up an alternate signal stack
with a non-zero pkey will need some assembly trampoline to set POR_EL0
before invoking the real signal handler, as discussed here [3]. This is
not ideal, but it makes experimentation with pkeys in signal handlers
possible while waiting for a potential interface to control the pkey
state when delivering a signal. See Pierre's reply [4] for more
information about use-cases and a potential interface.
The x86 series also added kselftests to ensure that no spurious SIGSEGV
occurs during signal delivery regardless of which pkey is accessible at
the point where the signal is delivered. This series adapts those
kselftests to allow running them on arm64 (patch 4-5). There is a
dependency on Yury's PKEY_UNRESTRICTED patch [7] for patch 4
specifically.
Finally patch 2 is a clean-up following feedback on Joey's series [5].
I have tested this series on arm64 and x86_64 (booting and running the
protection_keys and pkey_sighandler_tests mm kselftests).
- Kevin
---
v2..v3:
* Reordered patches (patch 1 is now the main patch).
* Patch 1: compute por_enable_all with an explicit loop, based on
arch_max_pkey() (suggestion from Dave M).
* Patch 4: improved naming, replaced global pkey reg value with inline
helper, made use of Yury's PKEY_UNRESTRICTED macro [7] (suggestions
from Dave H).
v2: https://lore.kernel.org/linux-arm-kernel/20241023150511.3923558-1-kevin.bro…
v1..v2:
* In setup_rt_frame(), ensured that POR_EL0 is reset to its original
value if we fail to deliver the signal (addresses Catalin's concern [6]).
* Renamed *unpriv_access* to *user_access* in patch 3 (suggestion from
Dave).
* Made what patch 1-2 do explicit in the commit message body (suggestion
from Dave).
v1: https://lore.kernel.org/linux-arm-kernel/20241017133909.3837547-1-kevin.bro…
[1] https://lore.kernel.org/linux-arm-kernel/20240822151113.1479789-1-joey.goul…
[2] https://lore.kernel.org/lkml/20240802061318.2140081-1-aruna.ramakrishna@ora…
[3] https://lore.kernel.org/lkml/CABi2SkWxNkP2O7ipkP67WKz0-LV33e5brReevTTtba6oK…
[4] https://lore.kernel.org/linux-arm-kernel/87plns8owh.fsf@arm.com/
[5] https://lore.kernel.org/linux-arm-kernel/20241015114116.GA19334@willie-the-…
[6] https://lore.kernel.org/linux-arm-kernel/Zw6D2waVyIwYE7wd@arm.com/
[7] https://lore.kernel.org/all/20241028090715.509527-2-yury.khrustalev@arm.com/
Cc: akpm(a)linux-foundation.org
Cc: anshuman.khandual(a)arm.com
Cc: aruna.ramakrishna(a)oracle.com
Cc: broonie(a)kernel.org
Cc: catalin.marinas(a)arm.com
Cc: dave.hansen(a)linux.intel.com
Cc: Dave.Martin(a)arm.com
Cc: jeffxu(a)chromium.org
Cc: joey.gouly(a)arm.com
Cc: keith.lucas(a)oracle.com
Cc: pierre.langlois(a)arm.com
Cc: shuah(a)kernel.org
Cc: sroettger(a)google.com
Cc: tglx(a)linutronix.de
Cc: will(a)kernel.org
Cc: yury.khrustalev(a)arm.com
Cc: linux-kselftest(a)vger.kernel.org
Cc: x86(a)kernel.org
Kevin Brodsky (5):
arm64: signal: Improve POR_EL0 handling to avoid uaccess failures
arm64: signal: Remove unnecessary check when saving POE state
arm64: signal: Remove unused macro
selftests/mm: Use generic pkey register manipulation
selftests/mm: Enable pkey_sighandler_tests on arm64
arch/arm64/kernel/signal.c | 95 ++++++++++++---
tools/testing/selftests/mm/Makefile | 8 +-
tools/testing/selftests/mm/pkey-arm64.h | 1 +
tools/testing/selftests/mm/pkey-x86.h | 2 +
.../selftests/mm/pkey_sighandler_tests.c | 115 ++++++++++++++----
5 files changed, 176 insertions(+), 45 deletions(-)
--
2.43.0
The PSCI v1.3 spec (https://developer.arm.com/documentation/den0022)
adds support for a SYSTEM_OFF2 function enabling a HIBERNATE_OFF state
which is analogous to ACPI S4. This will allow hosting environments to
determine that a guest is hibernated rather than just powered off, and
ensure that they preserve the virtual environment appropriately to
allow the guest to resume safely (or bump the hardware_signature in the
FACS to trigger a clean reboot instead).
This updates KVM to support advertising PSCI v1.3, and unconditionally
enables the SYSTEM_OFF2 support when PSCI v1.3 is enabled.
For the guest side, add a new SYS_OFF_MODE_POWER_OFF handler with higher
priority than the EFI one, but which *only* triggers when there's a
hibernation in progress. There are other ways to do this (see the commit
message for more details) but this seemed like the simplest.
Version 2 of the patch series splits out the psci.h definitions into a
separate commit (a dependency for both the guest and KVM side), and adds
definitions for the other new functions added in v1.3. It also moves the
pKVM psci-relay support to a separate commit; although in arch/arm64/kvm
that's actually about the *guest* side of SYSTEM_OFF2 (i.e. using it
from the host kernel, relayed through nVHE).
Version 3 dropped the KVM_CAP which allowed userspace to explicitly opt
in to the new feature like with SYSTEM_SUSPEND, and makes it depend only
on PSCI v1.3 being exposed to the guest.
Version 4 is no longer RFC, as the PSCI v1.3 spec is finally published.
Minor fixes from the last round of review, and an added KVM self test.
Version 5 drops some of the changes which didn't make it to the final
v1.3 spec, and cleans up a couple of places which still referred to it
as 'alpha' or 'beta'. It also temporarily drops the guest-side patch to
invoke SYSTEM_OFF2 for hibernation, pending confirmation that the final
PSCI v1.3 spec just has a typo where it changed to saying that 0x1
should be passed to mean HIBERNATE_OFF, even though it's advertised as
bit 0. That can be sent under separate cover, and perhaps should have
been anyway. The change in question doesn't matter for any of the KVM
patches, because we just treat SYSTEM_OFF2 like the existing
SYSTEM_RESET2, setting a flag to indicate that it was a SYSTEM_OFF2
call, but not actually caring about the argument; that's for userspace
to worry about.
Version 6 is updated to revision F.b of the specification, allowing both
0x0 and 0x1 to indicate HIBERNATE_OFF in the SYSTEM_OFF2 call, as well
as disallowing anything but zero in the second argument. The test is
expanded to cover those variants, and to require PSCI v1.3 instead of
skipping on older kernels. The final guest-side patch in the series is
reinstated, using zero as the argument for compatibility with existing
hypervisors in the field as permitted by revision F.b. allows for zero
as well as 0x1 for HIBERNATE_OFF to accommodate the change in the final
version of the spec, and expands the test to cover both forms as well as
checking that a non-zero second argument is correctly rejected.
David Woodhouse (6):
firmware/psci: Add definitions for PSCI v1.3 specification
KVM: arm64: Add PSCI v1.3 SYSTEM_OFF2 function for hibernation
KVM: arm64: Add support for PSCI v1.2 and v1.3
KVM: selftests: Add test for PSCI SYSTEM_OFF2
KVM: arm64: nvhe: Pass through PSCI v1.3 SYSTEM_OFF2 call
arm64: Use SYSTEM_OFF2 PSCI call to power off for hibernate
Documentation/virt/kvm/api.rst | 11 +++
arch/arm64/include/uapi/asm/kvm.h | 6 ++
arch/arm64/kvm/hyp/nvhe/psci-relay.c | 2 +
arch/arm64/kvm/hypercalls.c | 2 +
arch/arm64/kvm/psci.c | 50 ++++++++++++-
drivers/firmware/psci/psci.c | 42 +++++++++++
include/kvm/arm_psci.h | 4 +-
include/uapi/linux/psci.h | 5 ++
kernel/power/hibernate.c | 5 +-
tools/testing/selftests/kvm/aarch64/psci_test.c | 93 +++++++++++++++++++++++++
10 files changed, 217 insertions(+), 3 deletions(-)
From: Li Zhijian <lizhijian(a)fujitsu.com>
[ Upstream commit dc1308bee1ed03b4d698d77c8bd670d399dcd04d ]
When running watchdog-test with 'make run_tests', the watchdog-test will
be terminated by a timeout signal(SIGTERM) due to the test timemout.
And then, a system reboot would happen due to watchdog not stop. see
the dmesg as below:
```
[ 1367.185172] watchdog: watchdog0: watchdog did not stop!
```
Fix it by registering more signals(including SIGTERM) in watchdog-test,
where its signal handler will stop the watchdog.
After that
# timeout 1 ./watchdog-test
Watchdog Ticking Away!
.
Stopping watchdog ticks...
Link: https://lore.kernel.org/all/20241029031324.482800-1-lizhijian@fujitsu.com/
Signed-off-by: Li Zhijian <lizhijian(a)fujitsu.com>
Reviewed-by: Shuah Khan <skhan(a)linuxfoundation.org>
Signed-off-by: Shuah Khan <skhan(a)linuxfoundation.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/watchdog/watchdog-test.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/tools/testing/selftests/watchdog/watchdog-test.c b/tools/testing/selftests/watchdog/watchdog-test.c
index f1c6e025cbe54..561bcc253fb3e 100644
--- a/tools/testing/selftests/watchdog/watchdog-test.c
+++ b/tools/testing/selftests/watchdog/watchdog-test.c
@@ -152,7 +152,13 @@ int main(int argc, char *argv[])
printf("Watchdog Ticking Away!\n");
+ /*
+ * Register the signals
+ */
signal(SIGINT, term);
+ signal(SIGTERM, term);
+ signal(SIGKILL, term);
+ signal(SIGQUIT, term);
while (1) {
keep_alive();
--
2.43.0
From: Li Zhijian <lizhijian(a)fujitsu.com>
[ Upstream commit dc1308bee1ed03b4d698d77c8bd670d399dcd04d ]
When running watchdog-test with 'make run_tests', the watchdog-test will
be terminated by a timeout signal(SIGTERM) due to the test timemout.
And then, a system reboot would happen due to watchdog not stop. see
the dmesg as below:
```
[ 1367.185172] watchdog: watchdog0: watchdog did not stop!
```
Fix it by registering more signals(including SIGTERM) in watchdog-test,
where its signal handler will stop the watchdog.
After that
# timeout 1 ./watchdog-test
Watchdog Ticking Away!
.
Stopping watchdog ticks...
Link: https://lore.kernel.org/all/20241029031324.482800-1-lizhijian@fujitsu.com/
Signed-off-by: Li Zhijian <lizhijian(a)fujitsu.com>
Reviewed-by: Shuah Khan <skhan(a)linuxfoundation.org>
Signed-off-by: Shuah Khan <skhan(a)linuxfoundation.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/watchdog/watchdog-test.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/tools/testing/selftests/watchdog/watchdog-test.c b/tools/testing/selftests/watchdog/watchdog-test.c
index f45e510500c0d..09773695d219f 100644
--- a/tools/testing/selftests/watchdog/watchdog-test.c
+++ b/tools/testing/selftests/watchdog/watchdog-test.c
@@ -242,7 +242,13 @@ int main(int argc, char *argv[])
printf("Watchdog Ticking Away!\n");
+ /*
+ * Register the signals
+ */
signal(SIGINT, term);
+ signal(SIGTERM, term);
+ signal(SIGKILL, term);
+ signal(SIGQUIT, term);
while (1) {
keep_alive();
--
2.43.0
From: Li Zhijian <lizhijian(a)fujitsu.com>
[ Upstream commit dc1308bee1ed03b4d698d77c8bd670d399dcd04d ]
When running watchdog-test with 'make run_tests', the watchdog-test will
be terminated by a timeout signal(SIGTERM) due to the test timemout.
And then, a system reboot would happen due to watchdog not stop. see
the dmesg as below:
```
[ 1367.185172] watchdog: watchdog0: watchdog did not stop!
```
Fix it by registering more signals(including SIGTERM) in watchdog-test,
where its signal handler will stop the watchdog.
After that
# timeout 1 ./watchdog-test
Watchdog Ticking Away!
.
Stopping watchdog ticks...
Link: https://lore.kernel.org/all/20241029031324.482800-1-lizhijian@fujitsu.com/
Signed-off-by: Li Zhijian <lizhijian(a)fujitsu.com>
Reviewed-by: Shuah Khan <skhan(a)linuxfoundation.org>
Signed-off-by: Shuah Khan <skhan(a)linuxfoundation.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/watchdog/watchdog-test.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/tools/testing/selftests/watchdog/watchdog-test.c b/tools/testing/selftests/watchdog/watchdog-test.c
index f45e510500c0d..09773695d219f 100644
--- a/tools/testing/selftests/watchdog/watchdog-test.c
+++ b/tools/testing/selftests/watchdog/watchdog-test.c
@@ -242,7 +242,13 @@ int main(int argc, char *argv[])
printf("Watchdog Ticking Away!\n");
+ /*
+ * Register the signals
+ */
signal(SIGINT, term);
+ signal(SIGTERM, term);
+ signal(SIGKILL, term);
+ signal(SIGQUIT, term);
while (1) {
keep_alive();
--
2.43.0
From: Li Zhijian <lizhijian(a)fujitsu.com>
[ Upstream commit dc1308bee1ed03b4d698d77c8bd670d399dcd04d ]
When running watchdog-test with 'make run_tests', the watchdog-test will
be terminated by a timeout signal(SIGTERM) due to the test timemout.
And then, a system reboot would happen due to watchdog not stop. see
the dmesg as below:
```
[ 1367.185172] watchdog: watchdog0: watchdog did not stop!
```
Fix it by registering more signals(including SIGTERM) in watchdog-test,
where its signal handler will stop the watchdog.
After that
# timeout 1 ./watchdog-test
Watchdog Ticking Away!
.
Stopping watchdog ticks...
Link: https://lore.kernel.org/all/20241029031324.482800-1-lizhijian@fujitsu.com/
Signed-off-by: Li Zhijian <lizhijian(a)fujitsu.com>
Reviewed-by: Shuah Khan <skhan(a)linuxfoundation.org>
Signed-off-by: Shuah Khan <skhan(a)linuxfoundation.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/watchdog/watchdog-test.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/tools/testing/selftests/watchdog/watchdog-test.c b/tools/testing/selftests/watchdog/watchdog-test.c
index f45e510500c0d..09773695d219f 100644
--- a/tools/testing/selftests/watchdog/watchdog-test.c
+++ b/tools/testing/selftests/watchdog/watchdog-test.c
@@ -242,7 +242,13 @@ int main(int argc, char *argv[])
printf("Watchdog Ticking Away!\n");
+ /*
+ * Register the signals
+ */
signal(SIGINT, term);
+ signal(SIGTERM, term);
+ signal(SIGKILL, term);
+ signal(SIGQUIT, term);
while (1) {
keep_alive();
--
2.43.0
From: Li Zhijian <lizhijian(a)fujitsu.com>
[ Upstream commit dc1308bee1ed03b4d698d77c8bd670d399dcd04d ]
When running watchdog-test with 'make run_tests', the watchdog-test will
be terminated by a timeout signal(SIGTERM) due to the test timemout.
And then, a system reboot would happen due to watchdog not stop. see
the dmesg as below:
```
[ 1367.185172] watchdog: watchdog0: watchdog did not stop!
```
Fix it by registering more signals(including SIGTERM) in watchdog-test,
where its signal handler will stop the watchdog.
After that
# timeout 1 ./watchdog-test
Watchdog Ticking Away!
.
Stopping watchdog ticks...
Link: https://lore.kernel.org/all/20241029031324.482800-1-lizhijian@fujitsu.com/
Signed-off-by: Li Zhijian <lizhijian(a)fujitsu.com>
Reviewed-by: Shuah Khan <skhan(a)linuxfoundation.org>
Signed-off-by: Shuah Khan <skhan(a)linuxfoundation.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/watchdog/watchdog-test.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/tools/testing/selftests/watchdog/watchdog-test.c b/tools/testing/selftests/watchdog/watchdog-test.c
index f45e510500c0d..09773695d219f 100644
--- a/tools/testing/selftests/watchdog/watchdog-test.c
+++ b/tools/testing/selftests/watchdog/watchdog-test.c
@@ -242,7 +242,13 @@ int main(int argc, char *argv[])
printf("Watchdog Ticking Away!\n");
+ /*
+ * Register the signals
+ */
signal(SIGINT, term);
+ signal(SIGTERM, term);
+ signal(SIGKILL, term);
+ signal(SIGQUIT, term);
while (1) {
keep_alive();
--
2.43.0
From: Li Zhijian <lizhijian(a)fujitsu.com>
[ Upstream commit dc1308bee1ed03b4d698d77c8bd670d399dcd04d ]
When running watchdog-test with 'make run_tests', the watchdog-test will
be terminated by a timeout signal(SIGTERM) due to the test timemout.
And then, a system reboot would happen due to watchdog not stop. see
the dmesg as below:
```
[ 1367.185172] watchdog: watchdog0: watchdog did not stop!
```
Fix it by registering more signals(including SIGTERM) in watchdog-test,
where its signal handler will stop the watchdog.
After that
# timeout 1 ./watchdog-test
Watchdog Ticking Away!
.
Stopping watchdog ticks...
Link: https://lore.kernel.org/all/20241029031324.482800-1-lizhijian@fujitsu.com/
Signed-off-by: Li Zhijian <lizhijian(a)fujitsu.com>
Reviewed-by: Shuah Khan <skhan(a)linuxfoundation.org>
Signed-off-by: Shuah Khan <skhan(a)linuxfoundation.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/watchdog/watchdog-test.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/tools/testing/selftests/watchdog/watchdog-test.c b/tools/testing/selftests/watchdog/watchdog-test.c
index bc71cbca0dde7..a1f506ba55786 100644
--- a/tools/testing/selftests/watchdog/watchdog-test.c
+++ b/tools/testing/selftests/watchdog/watchdog-test.c
@@ -334,7 +334,13 @@ int main(int argc, char *argv[])
printf("Watchdog Ticking Away!\n");
+ /*
+ * Register the signals
+ */
signal(SIGINT, term);
+ signal(SIGTERM, term);
+ signal(SIGKILL, term);
+ signal(SIGQUIT, term);
while (1) {
keep_alive();
--
2.43.0
From: Li Zhijian <lizhijian(a)fujitsu.com>
[ Upstream commit dc1308bee1ed03b4d698d77c8bd670d399dcd04d ]
When running watchdog-test with 'make run_tests', the watchdog-test will
be terminated by a timeout signal(SIGTERM) due to the test timemout.
And then, a system reboot would happen due to watchdog not stop. see
the dmesg as below:
```
[ 1367.185172] watchdog: watchdog0: watchdog did not stop!
```
Fix it by registering more signals(including SIGTERM) in watchdog-test,
where its signal handler will stop the watchdog.
After that
# timeout 1 ./watchdog-test
Watchdog Ticking Away!
.
Stopping watchdog ticks...
Link: https://lore.kernel.org/all/20241029031324.482800-1-lizhijian@fujitsu.com/
Signed-off-by: Li Zhijian <lizhijian(a)fujitsu.com>
Reviewed-by: Shuah Khan <skhan(a)linuxfoundation.org>
Signed-off-by: Shuah Khan <skhan(a)linuxfoundation.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/watchdog/watchdog-test.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/tools/testing/selftests/watchdog/watchdog-test.c b/tools/testing/selftests/watchdog/watchdog-test.c
index bc71cbca0dde7..a1f506ba55786 100644
--- a/tools/testing/selftests/watchdog/watchdog-test.c
+++ b/tools/testing/selftests/watchdog/watchdog-test.c
@@ -334,7 +334,13 @@ int main(int argc, char *argv[])
printf("Watchdog Ticking Away!\n");
+ /*
+ * Register the signals
+ */
signal(SIGINT, term);
+ signal(SIGTERM, term);
+ signal(SIGKILL, term);
+ signal(SIGQUIT, term);
while (1) {
keep_alive();
--
2.43.0
Check number of paths by fib_info_num_path(),
and update_or_create_fnhe() for every path.
Problem is that pmtu is cached only for the oif
that has received icmp message "need to frag",
other oifs will still try to use "default" iface mtu.
An example topology showing the problem:
| host1
+---------+
| dummy0 | 10.179.20.18/32 mtu9000
+---------+
+-----------+----------------+
+---------+ +---------+
| ens17f0 | 10.179.2.141/31 | ens17f1 | 10.179.2.13/31
+---------+ +---------+
| (all here have mtu 9000) |
+------+ +------+
| ro1 | 10.179.2.140/31 | ro2 | 10.179.2.12/31
+------+ +------+
| |
---------+------------+-------------------+------
|
+-----+
| ro3 | 10.10.10.10 mtu1500
+-----+
|
========================================
some networks
========================================
|
+-----+
| eth0| 10.10.30.30 mtu9000
+-----+
| host2
host1 have enabled multipath and
sysctl net.ipv4.fib_multipath_hash_policy = 1:
default proto static src 10.179.20.18
nexthop via 10.179.2.12 dev ens17f1 weight 1
nexthop via 10.179.2.140 dev ens17f0 weight 1
When host1 tries to do pmtud from 10.179.20.18/32 to host2,
host1 receives at ens17f1 iface an icmp packet from ro3 that ro3 mtu=1500.
And host1 caches it in nexthop exceptions cache.
Problem is that it is cached only for the iface that has received icmp,
and there is no way that ro3 will send icmp msg to host1 via another path.
Host1 now have this routes to host2:
ip r g 10.10.30.30 sport 30000 dport 443
10.10.30.30 via 10.179.2.12 dev ens17f1 src 10.179.20.18 uid 0
cache expires 521sec mtu 1500
ip r g 10.10.30.30 sport 30033 dport 443
10.10.30.30 via 10.179.2.140 dev ens17f0 src 10.179.20.18 uid 0
cache
So when host1 tries again to reach host2 with mtu>1500,
if packet flow is lucky enough to be hashed with oif=ens17f1 its ok,
if oif=ens17f0 it blackholes and still gets icmp msgs from ro3 to ens17f1,
until lucky day when ro3 will send it through another flow to ens17f0.
Signed-off-by: Vladimir Vdovin <deliran(a)verdict.gg>
---
V6:
- make commit message cleaner
V5:
- make self test cleaner
V4:
- fix selftest, do route lookup before checking cached exceptions
V3:
- added selftest
- fixed compile error
V2:
- fix fib_info_num_path parameter pass
---
net/ipv4/route.c | 13 +++++
tools/testing/selftests/net/pmtu.sh | 78 ++++++++++++++++++++++++++++-
2 files changed, 90 insertions(+), 1 deletion(-)
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 723ac9181558..41162b5cc4cb 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1027,6 +1027,19 @@ static void __ip_rt_update_pmtu(struct rtable *rt, struct flowi4 *fl4, u32 mtu)
struct fib_nh_common *nhc;
fib_select_path(net, &res, fl4, NULL);
+#ifdef CONFIG_IP_ROUTE_MULTIPATH
+ if (fib_info_num_path(res.fi) > 1) {
+ int nhsel;
+
+ for (nhsel = 0; nhsel < fib_info_num_path(res.fi); nhsel++) {
+ nhc = fib_info_nhc(res.fi, nhsel);
+ update_or_create_fnhe(nhc, fl4->daddr, 0, mtu, lock,
+ jiffies + net->ipv4.ip_rt_mtu_expires);
+ }
+ rcu_read_unlock();
+ return;
+ }
+#endif /* CONFIG_IP_ROUTE_MULTIPATH */
nhc = FIB_RES_NHC(res);
update_or_create_fnhe(nhc, fl4->daddr, 0, mtu, lock,
jiffies + net->ipv4.ip_rt_mtu_expires);
diff --git a/tools/testing/selftests/net/pmtu.sh b/tools/testing/selftests/net/pmtu.sh
index 569bce8b6383..a0159340fe84 100755
--- a/tools/testing/selftests/net/pmtu.sh
+++ b/tools/testing/selftests/net/pmtu.sh
@@ -266,7 +266,8 @@ tests="
list_flush_ipv4_exception ipv4: list and flush cached exceptions 1
list_flush_ipv6_exception ipv6: list and flush cached exceptions 1
pmtu_ipv4_route_change ipv4: PMTU exception w/route replace 1
- pmtu_ipv6_route_change ipv6: PMTU exception w/route replace 1"
+ pmtu_ipv6_route_change ipv6: PMTU exception w/route replace 1
+ pmtu_ipv4_mp_exceptions ipv4: PMTU multipath nh exceptions 0"
# Addressing and routing for tests with routers: four network segments, with
# index SEGMENT between 1 and 4, a common prefix (PREFIX4 or PREFIX6) and an
@@ -2329,6 +2330,81 @@ test_pmtu_ipv6_route_change() {
test_pmtu_ipvX_route_change 6
}
+test_pmtu_ipv4_mp_exceptions() {
+ setup namespaces routing || return $ksft_skip
+
+ ip nexthop ls >/dev/null 2>&1
+ if [ $? -ne 0 ]; then
+ echo "Nexthop objects not supported; skipping tests"
+ exit $ksft_skip
+ fi
+
+ trace "${ns_a}" veth_A-R1 "${ns_r1}" veth_R1-A \
+ "${ns_r1}" veth_R1-B "${ns_b}" veth_B-R1 \
+ "${ns_a}" veth_A-R2 "${ns_r2}" veth_R2-A \
+ "${ns_r2}" veth_R2-B "${ns_b}" veth_B-R2
+
+ dummy0_a="192.168.99.99"
+ dummy0_b="192.168.88.88"
+
+ # Set up initial MTU values
+ mtu "${ns_a}" veth_A-R1 2000
+ mtu "${ns_r1}" veth_R1-A 2000
+ mtu "${ns_r1}" veth_R1-B 1500
+ mtu "${ns_b}" veth_B-R1 1500
+
+ mtu "${ns_a}" veth_A-R2 2000
+ mtu "${ns_r2}" veth_R2-A 2000
+ mtu "${ns_r2}" veth_R2-B 1500
+ mtu "${ns_b}" veth_B-R2 1500
+
+ fail=0
+
+ #Set up host A with multipath routes to host B dummy0_b
+ run_cmd ${ns_a} sysctl -q net.ipv4.fib_multipath_hash_policy=1
+ run_cmd ${ns_a} sysctl -q net.ipv4.ip_forward=1
+ run_cmd ${ns_a} ip link add dummy0 mtu 2000 type dummy
+ run_cmd ${ns_a} ip link set dummy0 up
+ run_cmd ${ns_a} ip addr add ${dummy0_a} dev dummy0
+ run_cmd ${ns_a} ip nexthop add id 201 via ${prefix4}.${a_r1}.2 dev veth_A-R1
+ run_cmd ${ns_a} ip nexthop add id 202 via ${prefix4}.${a_r2}.2 dev veth_A-R2
+ run_cmd ${ns_a} ip nexthop add id 203 group 201/202
+ run_cmd ${ns_a} ip route add ${dummy0_b} nhid 203
+
+ #Set up host B with multipath routes to host A dummy0_a
+ run_cmd ${ns_b} sysctl -q net.ipv4.fib_multipath_hash_policy=1
+ run_cmd ${ns_b} sysctl -q net.ipv4.ip_forward=1
+ run_cmd ${ns_b} ip link add dummy0 mtu 2000 type dummy
+ run_cmd ${ns_b} ip link set dummy0 up
+ run_cmd ${ns_b} ip addr add ${dummy0_b} dev dummy0
+ run_cmd ${ns_b} ip nexthop add id 201 via ${prefix4}.${b_r1}.2 dev veth_A-R1
+ run_cmd ${ns_b} ip nexthop add id 202 via ${prefix4}.${b_r2}.2 dev veth_A-R2
+ run_cmd ${ns_b} ip nexthop add id 203 group 201/202
+ run_cmd ${ns_b} ip route add ${dummy0_a} nhid 203
+
+ #Set up routers with routes to dummies
+ run_cmd ${ns_r1} ip route add ${dummy0_a} via ${prefix4}.${a_r1}.1
+ run_cmd ${ns_r2} ip route add ${dummy0_a} via ${prefix4}.${a_r2}.1
+ run_cmd ${ns_r1} ip route add ${dummy0_b} via ${prefix4}.${b_r1}.1
+ run_cmd ${ns_r2} ip route add ${dummy0_b} via ${prefix4}.${b_r2}.1
+
+
+ #Ping and expect two nexthop exceptions for two routes in nh group
+ run_cmd ${ns_a} ping -q -M want -i 0.1 -c 2 -s 1800 "${dummy0_b}"
+
+ #Do route lookup before checking cached exceptions
+ run_cmd ${ns_a} ip route get ${dummy0_b} oif veth_A-R1
+ run_cmd ${ns_a} ip route get ${dummy0_b} oif veth_A-R2
+
+ #Check cached exceptions
+ if [ "$(${ns_a} ip -oneline route list cache| grep mtu | wc -l)" -ne 2 ]; then
+ err " there are not enough cached exceptions"
+ fail=1
+ fi
+
+ return ${fail}
+}
+
usage() {
echo
echo "$0 [OPTIONS] [TEST]..."
base-commit: 66600fac7a984dea4ae095411f644770b2561ede
--
2.43.5
- Add skip test if not run as root, with an
appropriate Warning.
- Add 'ksft_print_header()' and 'ksft_set_plan()'
to structure test outputs more effectively.
Test logs :
Before change:
- Without root
error: unshare, errno 1
- With root
No, output
After change:
- Without root
TAP version 13
1..1
- With root
TAP version 13
1..1
Signed-off-by: Shivam Chaudhary <cvam0000(a)gmail.com>
---
Notes:
Changes in v4 1/2:
- Start a patchset
- Split patch into smaller pathes to make it easy to review.
link to v3: https://lore.kernel.org/all/20241028185756.111832-1-cvam0000@gmail.com/
Changes in v3:
- Remove extra ksft_set_plan()
- Remove function for unshare()
- Fix the comment style
link to v2: https://lore.kernel.org/all/20241026191621.2860376-1-cvam0000@gmail.com/
Changes in v2:
- Make the commit message more clear.
link to v1: https://lore.kernel.org/all/20241024200228.1075840-1-cvam0000@gmail.com/T/#u
tools/testing/selftests/tmpfs/bug-link-o-tmpfile.c | 13 +++++++++++++
1 file changed, 13 insertions(+)
diff --git a/tools/testing/selftests/tmpfs/bug-link-o-tmpfile.c b/tools/testing/selftests/tmpfs/bug-link-o-tmpfile.c
index b5c3ddb90942..cdab1e8c0392 100644
--- a/tools/testing/selftests/tmpfs/bug-link-o-tmpfile.c
+++ b/tools/testing/selftests/tmpfs/bug-link-o-tmpfile.c
@@ -23,10 +23,23 @@
#include <sys/mount.h>
#include <unistd.h>
+#include "../kselftest.h"
+
int main(void)
{
int fd;
+ /* Setting up kselftest framework */
+ ksft_print_header();
+ ksft_set_plan(1);
+
+ /* Check if test is run as root */
+ if (geteuid()) {
+ ksft_print_msg("Skip : Need to run as root");
+ exit(KSFT_SKIP);
+
+ }
+
if (unshare(CLONE_NEWNS) == -1) {
if (errno == ENOSYS || errno == EPERM) {
fprintf(stderr, "error: unshare, errno %d\n", errno);
--
2.45.2
Greetings:
Welcome to v5, see changelog below. Note that our performance tests were
not re-run for this revision as we updated a commit message, fixed
typos, removed a short unnecessary paragraph from the documentation and
added a very minor functional change to return without suspended IRQs in
certain error cases.
This series introduces a new mechanism, IRQ suspension, which allows
network applications using epoll to mask IRQs during periods of high
traffic while also reducing tail latency (compared to existing
mechanisms, see below) during periods of low traffic. In doing so, this
balances CPU consumption with network processing efficiency.
Martin Karsten (CC'd) and I have been collaborating on this series for
several months and have appreciated the feedback from the community on
our RFC [1]. We've updated the cover letter and kernel documentation in
an attempt to more clearly explain how this mechanism works, how
applications can use it, and how it compares to existing mechanisms in
the kernel. We've added an additional test case, 'fullbusy', achieved by
modifying libevent for comparison. See below for a detailed description,
link to the patch, and test results.
I briefly mentioned this idea at netdev conf 2024 (for those who were
there) and Martin described this idea in an earlier paper presented at
Sigmetrics 2024 [2].
~ The short explanation (TL;DR)
We propose adding a new napi config parameter: irq_suspend_timeout to
help balance CPU usage and network processing efficiency when using IRQ
deferral and napi busy poll.
If this parameter is set to a non-zero value *and* a user application
has enabled preferred busy poll on a busy poll context (via the
EPIOCSPARAMS ioctl introduced in commit 18e2bf0edf4d ("eventpoll: Add
epoll ioctl for epoll_params")), then application calls to epoll_wait
for that context will cause device IRQs and softirq processing to be
suspended as long as epoll_wait successfully retrieves data from the
NAPI. Each time data is retrieved, the irq_suspend_timeout is deferred.
If/when network traffic subsides and epoll_wait returns no data, IRQ
suspension is immediately reverted back to the existing
napi_defer_hard_irqs and gro_flush_timeout mechanism which was
introduced in commit 6f8b12d661d0 ("net: napi: add hard irqs deferral
feature")).
The irq_suspend_timeout serves as a safety mechanism. If userland takes
a long time processing data, irq_suspend_timeout will fire and restart
normal NAPI processing.
For a more in depth explanation, please continue reading.
~ Comparison with existing mechanisms
Interrupt mitigation can be accomplished in napi software, by setting
napi_defer_hard_irqs and gro_flush_timeout, or via interrupt coalescing
in the NIC. This can be quite efficient, but in both cases, a fixed
timeout (or packet count) needs to be configured. However, a fixed
timeout cannot effectively support both low- and high-load situations:
At low load, an application typically processes a few requests and then
waits to receive more input data. In this scenario, a large timeout will
cause unnecessary latency.
At high load, an application typically processes many requests before
being ready to receive more input data. In this case, a small timeout
will likely fire prematurely and trigger irq/softirq processing, which
interferes with the application's execution. This causes overhead, most
likely due to cache contention.
While NICs attempt to provide adaptive interrupt coalescing schemes,
these cannot properly take into account application-level processing.
An alternative packet delivery mechanism is busy-polling, which results
in perfect alignment of application processing and network polling. It
delivers optimal performance (throughput and latency), but results in
100% cpu utilization and is thus inefficient for below-capacity
workloads.
We propose to add a new packet delivery mode that properly alternates
between busy polling and interrupt-based delivery depending on busy and
idle periods of the application. During a busy period, the system
operates in busy-polling mode, which avoids interference. During an idle
period, the system falls back to interrupt deferral, but with a small
timeout to avoid excessive latencies. This delivery mode can also be
viewed as an extension of basic interrupt deferral, but alternating
between a small and a very large timeout.
This delivery mode is efficient, because it avoids softirq execution
interfering with application processing during busy periods. It can be
used with blocking epoll_wait to conserve cpu cycles during idle
periods. The effect of alternating between busy and idle periods is that
performance (throughput and latency) is very close to full busy polling,
while cpu utilization is lower and very close to interrupt mitigation.
~ Usage details
IRQ suspension is introduced via a per-NAPI configuration parameter that
controls the maximum time that IRQs can be suspended.
Here's how it is intended to work:
- The user application (or system administrator) uses the netdev-genl
netlink interface to set the pre-existing napi_defer_hard_irqs and
gro_flush_timeout NAPI config parameters to enable IRQ deferral.
- The user application (or system administrator) sets the proposed
irq_suspend_timeout parameter via the netdev-genl netlink interface
to a larger value than gro_flush_timeout to enable IRQ suspension.
- The user application issues the existing epoll ioctl to set the
prefer_busy_poll flag on the epoll context.
- The user application then calls epoll_wait to busy poll for network
events, as it normally would.
- If epoll_wait returns events to userland, IRQs are suspended for the
duration of irq_suspend_timeout.
- If epoll_wait finds no events and the thread is about to go to
sleep, IRQ handling using napi_defer_hard_irqs and gro_flush_timeout
is resumed.
As long as epoll_wait is retrieving events, IRQs (and softirq
processing) for the NAPI being polled remain disabled. When network
traffic reduces, eventually a busy poll loop in the kernel will retrieve
no data. When this occurs, regular IRQ deferral using gro_flush_timeout
for the polled NAPI is re-enabled.
Unless IRQ suspension is continued by subsequent calls to epoll_wait, it
automatically times out after the irq_suspend_timeout timer expires.
Regular deferral is also immediately re-enabled when the epoll context
is destroyed.
~ Usage scenario
The target scenario for IRQ suspension as packet delivery mode is a
system that runs a dominant application with substantial network I/O.
The target application can be configured to receive input data up to a
certain batch size (via epoll_wait maxevents parameter) and this batch
size determines the worst-case latency that application requests might
experience. Because packet delivery is suspended during the target
application's processing, the batch size also determines the worst-case
latency of concurrent applications using the same RX queue(s).
gro_flush_timeout should be set as small as possible, but large enough to
make sure that a single request is likely not being interfered with.
irq_suspend_timeout is largely a safety mechanism against misbehaving
applications. It should be set large enough to cover the processing of an
entire application batch, i.e., the factor between gro_flush_timeout and
irq_suspend_timeout should roughly correspond to the maximum batch size
that the target application would process in one go.
~ Design rationale
The implementation of the IRQ suspension mechanism very nicely dovetails
with the existing mechanism for IRQ deferral when preferred busy poll is
enabled (introduced in commit 7fd3253a7de6 ("net: Introduce preferred
busy-polling"), see that commit message for more details).
While it would be possible to inject the suspend timeout via
the existing epoll ioctl, it is more natural to avoid this path for one
main reason:
An epoll context is linked to NAPI IDs as file descriptors are added;
this means any epoll context might suddenly be associated with a
different net_device if the application were to replace all existing
fds with fds from a different device. In this case, the scope of the
suspend timeout becomes unclear and many edge cases for both the user
application and the kernel are introduced
Only a single iteration through napi busy polling is needed for this
mechanism to work effectively. Since an important objective for this
mechanism is preserving cpu cycles, exactly one iteration of the napi
busy loop is invoked when busy_poll_usecs is set to 0.
~ Important call out in the implementation
- Enabling per epoll-context preferred busy poll will now effectively
lead to a nonblocking iteration through napi_busy_loop, even when
busy_poll_usecs is 0. See patch 4.
~ Benchmark configs & descriptions
The changes were benchmarked with memcached [3] using the benchmarking
tool mutilate [4].
To facilitate benchmarking, a small patch [5] was applied to memcached
1.6.29 to allow setting per-epoll context preferred busy poll and other
settings via environment variables. Another small patch [6] was applied
to libevent to enable full busy-polling.
Multiple scenarios were benchmarked as described below and the scripts
used for producing these results can be found on github [7] (note: all
scenarios use NAPI-based traffic splitting via SO_INCOMING_ID by passing
-N to memcached):
- base:
- no other options enabled
- deferX:
- set defer_hard_irqs to 100
- set gro_flush_timeout to X,000
- napibusy:
- set defer_hard_irqs to 100
- set gro_flush_timeout to 200,000
- enable busy poll via the existing ioctl (busy_poll_usecs = 64,
busy_poll_budget = 64, prefer_busy_poll = true)
- fullbusy:
- set defer_hard_irqs to 100
- set gro_flush_timeout to 5,000,000
- enable busy poll via the existing ioctl (busy_poll_usecs = 1000,
busy_poll_budget = 64, prefer_busy_poll = true)
- change memcached's nonblocking epoll_wait invocation (via
libevent) to using a 1 ms timeout
- suspendX:
- set defer_hard_irqs to 100
- set gro_flush_timeout to X,000
- set irq_suspend_timeout to 20,000,000
- enable busy poll via the existing ioctl (busy_poll_usecs = 0,
busy_poll_budget = 64, prefer_busy_poll = true)
~ Benchmark results
Tested on:
Single socket AMD EPYC 7662 64-Core Processor
Hyperthreading disabled
4 NUMA Zones (NPS=4)
16 CPUs per NUMA zone (64 cores total)
2 x Dual port 100gbps Mellanox Technologies ConnectX-5 Ex EN NIC
The test machine is configured such that a single interface has 8 RX
queues. The queues' IRQs and memcached are pinned to CPUs that are
NUMA-local to the interface which is under test. The NIC's interrupt
coalescing configuration is left at boot-time defaults.
Results:
Results are shown below. The mechanism added by this series is
represented by the 'suspend' cases. Data presented shows a summary over
at least 10 runs of each test case [8] using the scripts on github [7].
For latency, the median is shown. For throughput and CPU utilization,
the average is shown.
The results also include cycles-per-query (cpq) and
instruction-per-query (ipq) metrics, following the methodology proposed
in [2], to augment the CPU utilization numbers, which could be skewed
due to frequency scaling. We find that this does not appear to be the
case as CPU utilization and low-level metrics show similar trends.
These results were captured using the scripts on github [7] to
illustrate how this approach compares with other pre-existing
mechanisms. This data is not to be interpreted as scientific data
captured in a fully isolated lab setting, but instead as best effort,
illustrative information comparing and contrasting tradeoffs.
The absolute QPS results are higher than our previous submission, but
the relative differences between variants are equivalent. Because the
patches have been rebased on 6.12, several factors have likely
influenced the overall performance. Most importantly, we had to switch
to a new set of basic kernel options, which has likely altered the
baseline performance. Because the overall comparison of variants still
holds, we have not attempted to recreate the exact set of kernel options
from the previous submission.
Compare:
- Throughput (MAX) and latencies of base vs suspend.
- CPU usage of napibusy and fullbusy during lower load (200K, 400K for
example) vs suspend.
- Latency of the defer variants vs suspend as timeout and load
increases.
The overall takeaway is that the suspend variants provide a superior
combination of high throughput, low latency, and low cpu utilization
compared to all other variants. Each of the suspend variants works very
well, but some fine-tuning between latency and cpu utilization is still
possible by tuning the small timeout (gro_flush_timeout).
Note: we've reorganized the results to make comparison among testcases
with the same load easier.
testcase load qps avglat 95%lat 99%lat cpu cpq ipq
base 200K 200024 127 254 458 25 12748 11289
defer10 200K 199991 64 128 166 27 18763 16574
defer20 200K 199986 72 135 178 25 15405 14173
defer50 200K 200025 91 149 198 23 12275 12203
defer200 200K 199996 182 266 326 18 8595 9183
fullbusy 200K 200040 58 123 167 100 43641 23145
napibusy 200K 200009 115 244 299 56 24797 24693
suspend10 200K 200005 63 128 167 32 19559 17240
suspend20 200K 199952 69 132 170 29 16324 14838
suspend50 200K 200019 84 144 189 26 13106 12516
suspend200 200K 199978 168 264 326 20 9331 9643
testcase load qps avglat 95%lat 99%lat cpu cpq ipq
base 400K 400017 157 292 762 39 9287 9325
defer10 400K 400033 71 141 204 53 13950 12943
defer20 400K 399935 79 150 212 47 12027 11673
defer50 400K 399888 101 171 231 39 9556 9921
defer200 400K 399993 200 287 358 32 7428 8576
fullbusy 400K 400018 63 132 203 100 21827 16062
napibusy 400K 399970 89 230 292 83 18156 16508
suspend10 400K 400061 69 139 202 54 13576 13057
suspend20 400K 399988 73 144 206 49 11930 11773
suspend50 400K 399975 88 161 218 42 9996 10270
suspend200 400K 399954 172 276 353 34 7847 8713
testcase load qps avglat 95%lat 99%lat cpu cpq ipq
base 600K 600031 166 289 631 61 9188 8787
defer10 600K 599967 85 167 262 75 11833 10947
defer20 600K 599888 89 165 243 66 10513 10362
defer50 600K 600072 109 185 253 55 8664 9190
defer200 600K 599951 222 315 393 45 6892 8213
fullbusy 600K 600041 69 145 227 100 14549 13936
napibusy 600K 599980 79 188 280 96 13927 14155
suspend10 600K 600028 78 159 267 69 10877 11032
suspend20 600K 600026 81 159 254 64 9922 10320
suspend50 600K 600007 96 178 258 57 8681 9331
suspend200 600K 599964 177 295 369 47 7115 8366
testcase load qps avglat 95%lat 99%lat cpu cpq ipq
base 800K 800034 198 329 698 84 9366 8338
defer10 800K 799718 243 642 1457 95 10532 9007
defer20 800K 800009 132 245 399 89 9956 8979
defer50 800K 800024 136 228 378 80 9002 8598
defer200 800K 799965 255 362 473 66 7481 8147
fullbusy 800K 799927 78 157 253 100 10915 12533
napibusy 800K 799870 81 173 273 99 10826 12532
suspend10 800K 799991 84 167 269 83 9380 9802
suspend20 800K 799979 90 172 290 78 8765 9404
suspend50 800K 800031 106 191 307 71 7945 8805
suspend200 800K 799905 182 307 411 62 6985 8242
testcase load qps avglat 95%lat 99%lat cpu cpq ipq
base 1000K 919543 3805 6390 14229 98 9324 7978
defer10 1000K 850751 4574 7382 15370 99 10218 8470
defer20 1000K 890296 4736 6862 14858 99 9708 8277
defer50 1000K 932694 3463 6180 13251 97 9148 8053
defer200 1000K 951311 3524 6052 13599 96 8875 7845
fullbusy 1000K 1000011 90 181 278 100 8731 10686
napibusy 1000K 1000050 93 184 280 100 8721 10547
suspend10 1000K 999962 101 193 306 92 8138 8980
suspend20 1000K 1000030 103 191 324 88 7844 8763
suspend50 1000K 1000001 114 202 320 83 7396 8431
suspend200 1000K 999965 185 314 428 76 6733 8072
testcase load qps avglat 95%lat 99%lat cpu cpq ipq
base MAX 1005592 4651 6594 14979 100 8679 7918
defer10 MAX 928204 5106 7286 15199 100 9398 8380
defer20 MAX 984663 4774 6518 14920 100 8861 8063
defer50 MAX 1044099 4431 6368 14652 100 8350 7948
defer200 MAX 1040451 4423 6610 14674 100 8380 7931
fullbusy MAX 1236608 3715 3987 12805 100 7051 7936
napibusy MAX 1077516 4345 10155 15957 100 8080 7842
suspend10 MAX 1218344 3760 3990 12585 100 7150 7935
suspend20 MAX 1220056 3752 4053 12602 100 7150 7961
suspend50 MAX 1213666 3791 4103 12919 100 7183 7959
suspend200 MAX 1217411 3768 3988 12863 100 7161 7954
~ FAQ
- Why is a new parameter needed? Does irq_suspend_timeout override
gro_flush_timeout?
Using the suspend mechanism causes the system to alternate between
polling mode and irq-driven packet delivery. During busy periods,
irq_suspend_timeout overrides gro_flush_timeout and keeps the system
busy polling, but when epoll finds no events, the setting of
gro_flush_timeout and napi_defer_hard_irqs determine the next step.
There are essentially three possible loops for network processing and
packet delivery:
1) hardirq -> softirq -> napi poll; basic interrupt delivery
2) timer -> softirq -> napi poll; deferred irq processing
3) epoll -> busy-poll -> napi poll; busy looping
Loop 2) can take control from Loop 1), if gro_flush_timeout and
napi_defer_hard_irqs are set.
If gro_flush_timeout and napi_defer_hard_irqs are set, Loops 2) and
3) "wrestle" with each other for control. During busy periods,
irq_suspend_timeout is used as timer in Loop 2), which essentially
tilts this in favour of Loop 3).
If gro_flush_timeout and napi_defer_hard_irqs are not set, Loop 3)
cannot take control from Loop 1).
Therefore, setting gro_flush_timeout and napi_defer_hard_irqs is the
recommended usage, because otherwise setting irq_suspend_timeout
might not have any discernible effect.
We ran experiments with these parameters set to zero and the results
are as expected and essentially the same as the base case.
- Can the new timeout value be threaded through the new epoll ioctl ?
Only with difficulty. The epoll ioctl sets options on an epoll
context and the NAPI ID associated with an epoll context can change
based on what file descriptors a user app adds to the epoll context.
This would introduce complexity in the API from the user perspective
and also complexity in the kernel.
- Can irq suspend be built by combining NIC coalescing and
gro_flush_timeout ?
No. The problem is that the long timeout must engage if and only if
prefer-busy is active.
When using NIC coalescing for the short timeout (without
napi_defer_hard_irqs/gro_flush_timeout), an interrupt after an idle
period will trigger softirq, which will run napi polling. At this
point, prefer-busy is not active, so NIC interrupts would be
re-enabled. Then it is not possible for the longer timeout to
interject to switch control back to polling. In other words, only by
using the software timer for the short timeout, it is possible to
extend the timeout without having to reprogram the NIC timer or
reach down directly and disable interrupts.
Using gro_flush_timeout for the long timeout also has problems, for
the same underlying reason. In the current napi implementation,
gro_flush_timeout is not tied to prefer-busy. We'd either have to
change that and in the process modify the existing deferral
mechanism, or introduce a state variable to determine whether
gro_flush_timeout is used as long timeout for irq suspend or whether
it is used for its default purpose. In an earlier version, we did
try something similar to the latter and made it work, but it ends up
being a lot more convoluted than our current proposal.
- Isn't it already possible to combine busy looping with irq deferral?
Yes, in fact enabling irq deferral via napi_defer_hard_irqs and
gro_flush_timeout is a precondition for prefer_busy_poll to have an
effect. If the application also uses a tight busy loop with
essentially nonblocking epoll_wait (accomplished with a very short
timeout parameter), this is the fullbusy case shown in the results.
An application using blocking epoll_wait is shown as the napibusy
case in the results. It's a hybrid approach that provides limited
latency benefits compared to the base case and plain irq deferral,
but not as good as fullbusy or suspend.
~ Special thanks
Several people were involved in earlier stages of the development of this
mechanism whom we'd like to thank:
- Peter Cai (CC'd), for the initial kernel patch and his contributions
to the paper.
- Mohammadamin Shafie (CC'd), for testing various versions of the kernel
patch and providing helpful feedback.
Thanks,
Martin and Joe
[1]: https://lore.kernel.org/netdev/20240812125717.413108-1-jdamato@fastly.com/
[2]: https://doi.org/10.1145/3626780
[3]: https://github.com/memcached/memcached/blob/master/doc/napi_ids.txt
[4]: https://github.com/leverich/mutilate
[5]: https://raw.githubusercontent.com/martinkarsten/irqsuspend/main/patches/mem…
[6]: https://raw.githubusercontent.com/martinkarsten/irqsuspend/main/patches/lib…
[7]: https://github.com/martinkarsten/irqsuspend
[8]: https://github.com/martinkarsten/irqsuspend/tree/main/results
v5:
- Adjusted patch 5 to only suspend IRQs when ep_send_events returns a
positive return value. This issue was pointed out by Hillf Danton.
- Updated the commit message of patch 6 which still mentioned netcat,
despite the code being updated in v4 to replace it with socat and fixed
misspelling of netdevsim.
- Fixed a minor typo in patch 7 and removed an unnecessary paragraph.
- Added Sridhar Samudrala's Reviewed-by to patch 1-5 and 7.
v4: https://lore.kernel.org/netdev/20241102005214.32443-1-jdamato@fastly.com/
- Added a new FAQ item to cover letter.
- Updated patch 6 to use socat instead of nc in busy_poll_test.sh and
updated busy_poller.c to use netlink directly to configure napi
params.
- Updated the kernel documentation in patch 7 to include more details.
- Dropped Stanislav's Acked-by and Bagas' Reviewed-by from patch 7
since the documentation was updated.
v3: https://lore.kernel.org/netdev/20241101004846.32532-1-jdamato@fastly.com/
- Added Stanislav Fomichev's Acked-by to every patch except the newly
added selftest.
- Added Bagas Sanjaya's Reviewed-by to the documentation patch.
- Fixed the commit message of patch 2 to remove a reference to the now
non-existent sysfs setting.
- Added a self test which tests both "regular" busy poll and busy poll
with suspend enabled. This was added as patch 6 as requested by
Paolo. netdevsim was chosen instead of veth due to netdevsim's
pre-existing support for netdev-genl. See the commit message of
patch 6 for more details.
v2: https://lore.kernel.org/bpf/20241021015311.95468-1-jdamato@fastly.com/
- Cover letter updated, including a re-run of test data.
- Patch 1 rewritten to use netdev-genl instead of sysfs.
- Patch 3 updated with a comment added to napi_resume_irqs.
- Patch 4 rebased to apply now that commit b9ca079dd6b0 ("eventpoll:
Annotate data-race of busy_poll_usecs") has been picked up from VFS.
- Patch 6 updated the kernel documentation.
rfc -> v1:
- Cover letter updated to include more details.
- Patch 1 updated to remove the documentation added. This was moved to
patch 6 with the rest of the docs (see below).
- Patch 5 updated to fix an error uncovered by the kernel build robot.
See patch 5's changelog for more details.
- Patch 6 added which updates kernel documentation.
Joe Damato (2):
selftests: net: Add busy_poll_test
docs: networking: Describe irq suspension
Martin Karsten (5):
net: Add napi_struct parameter irq_suspend_timeout
net: Suspend softirq when prefer_busy_poll is set
net: Add control functions for irq suspension
eventpoll: Trigger napi_busy_loop, if prefer_busy_poll is set
eventpoll: Control irq suspension for prefer_busy_poll
Documentation/netlink/specs/netdev.yaml | 7 +
Documentation/networking/napi.rst | 172 ++++++++-
fs/eventpoll.c | 36 +-
include/linux/netdevice.h | 2 +
include/net/busy_poll.h | 3 +
include/uapi/linux/netdev.h | 1 +
net/core/dev.c | 58 +++-
net/core/dev.h | 25 ++
net/core/netdev-genl-gen.c | 5 +-
net/core/netdev-genl.c | 12 +
tools/include/uapi/linux/netdev.h | 1 +
tools/testing/selftests/net/.gitignore | 1 +
tools/testing/selftests/net/Makefile | 3 +-
tools/testing/selftests/net/busy_poll_test.sh | 164 +++++++++
tools/testing/selftests/net/busy_poller.c | 328 ++++++++++++++++++
15 files changed, 807 insertions(+), 11 deletions(-)
create mode 100755 tools/testing/selftests/net/busy_poll_test.sh
create mode 100644 tools/testing/selftests/net/busy_poller.c
base-commit: dbb9a7ef347828870df3e5e6ddf19469a3277fc9
--
2.25.1
Check number of paths by fib_info_num_path(),
and update_or_create_fnhe() for every path.
Problem is that pmtu is cached only for the oif
that has received icmp message "need to frag",
other oifs will still try to use "default" iface mtu.
V5:
- make self test cleaner
V4:
- fix selftest, do route lookup before checking cached exceptions
V3:
- added selftest
- fixed compile error
V2:
- fix fib_info_num_path parameter pass
An example topology showing the problem:
| host1
+---------+
| dummy0 | 10.179.20.18/32 mtu9000
+---------+
+-----------+----------------+
+---------+ +---------+
| ens17f0 | 10.179.2.141/31 | ens17f1 | 10.179.2.13/31
+---------+ +---------+
| (all here have mtu 9000) |
+------+ +------+
| ro1 | 10.179.2.140/31 | ro2 | 10.179.2.12/31
+------+ +------+
| |
---------+------------+-------------------+------
|
+-----+
| ro3 | 10.10.10.10 mtu1500
+-----+
|
========================================
some networks
========================================
|
+-----+
| eth0| 10.10.30.30 mtu9000
+-----+
| host2
host1 have enabled multipath and
sysctl net.ipv4.fib_multipath_hash_policy = 1:
default proto static src 10.179.20.18
nexthop via 10.179.2.12 dev ens17f1 weight 1
nexthop via 10.179.2.140 dev ens17f0 weight 1
When host1 tries to do pmtud from 10.179.20.18/32 to host2,
host1 receives at ens17f1 iface an icmp packet from ro3 that ro3 mtu=1500.
And host1 caches it in nexthop exceptions cache.
Problem is that it is cached only for the iface that has received icmp,
and there is no way that ro3 will send icmp msg to host1 via another path.
Host1 now have this routes to host2:
ip r g 10.10.30.30 sport 30000 dport 443
10.10.30.30 via 10.179.2.12 dev ens17f1 src 10.179.20.18 uid 0
cache expires 521sec mtu 1500
ip r g 10.10.30.30 sport 30033 dport 443
10.10.30.30 via 10.179.2.140 dev ens17f0 src 10.179.20.18 uid 0
cache
So when host1 tries again to reach host2 with mtu>1500,
if packet flow is lucky enough to be hashed with oif=ens17f1 its ok,
if oif=ens17f0 it blackholes and still gets icmp msgs from ro3 to ens17f1,
until lucky day when ro3 will send it through another flow to ens17f0.
Signed-off-by: Vladimir Vdovin <deliran(a)verdict.gg>
---
net/ipv4/route.c | 13 +++++
tools/testing/selftests/net/pmtu.sh | 78 ++++++++++++++++++++++++++++-
2 files changed, 90 insertions(+), 1 deletion(-)
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 723ac9181558..41162b5cc4cb 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1027,6 +1027,19 @@ static void __ip_rt_update_pmtu(struct rtable *rt, struct flowi4 *fl4, u32 mtu)
struct fib_nh_common *nhc;
fib_select_path(net, &res, fl4, NULL);
+#ifdef CONFIG_IP_ROUTE_MULTIPATH
+ if (fib_info_num_path(res.fi) > 1) {
+ int nhsel;
+
+ for (nhsel = 0; nhsel < fib_info_num_path(res.fi); nhsel++) {
+ nhc = fib_info_nhc(res.fi, nhsel);
+ update_or_create_fnhe(nhc, fl4->daddr, 0, mtu, lock,
+ jiffies + net->ipv4.ip_rt_mtu_expires);
+ }
+ rcu_read_unlock();
+ return;
+ }
+#endif /* CONFIG_IP_ROUTE_MULTIPATH */
nhc = FIB_RES_NHC(res);
update_or_create_fnhe(nhc, fl4->daddr, 0, mtu, lock,
jiffies + net->ipv4.ip_rt_mtu_expires);
diff --git a/tools/testing/selftests/net/pmtu.sh b/tools/testing/selftests/net/pmtu.sh
index 569bce8b6383..a0159340fe84 100755
--- a/tools/testing/selftests/net/pmtu.sh
+++ b/tools/testing/selftests/net/pmtu.sh
@@ -266,7 +266,8 @@ tests="
list_flush_ipv4_exception ipv4: list and flush cached exceptions 1
list_flush_ipv6_exception ipv6: list and flush cached exceptions 1
pmtu_ipv4_route_change ipv4: PMTU exception w/route replace 1
- pmtu_ipv6_route_change ipv6: PMTU exception w/route replace 1"
+ pmtu_ipv6_route_change ipv6: PMTU exception w/route replace 1
+ pmtu_ipv4_mp_exceptions ipv4: PMTU multipath nh exceptions 0"
# Addressing and routing for tests with routers: four network segments, with
# index SEGMENT between 1 and 4, a common prefix (PREFIX4 or PREFIX6) and an
@@ -2329,6 +2330,81 @@ test_pmtu_ipv6_route_change() {
test_pmtu_ipvX_route_change 6
}
+test_pmtu_ipv4_mp_exceptions() {
+ setup namespaces routing || return $ksft_skip
+
+ ip nexthop ls >/dev/null 2>&1
+ if [ $? -ne 0 ]; then
+ echo "Nexthop objects not supported; skipping tests"
+ exit $ksft_skip
+ fi
+
+ trace "${ns_a}" veth_A-R1 "${ns_r1}" veth_R1-A \
+ "${ns_r1}" veth_R1-B "${ns_b}" veth_B-R1 \
+ "${ns_a}" veth_A-R2 "${ns_r2}" veth_R2-A \
+ "${ns_r2}" veth_R2-B "${ns_b}" veth_B-R2
+
+ dummy0_a="192.168.99.99"
+ dummy0_b="192.168.88.88"
+
+ # Set up initial MTU values
+ mtu "${ns_a}" veth_A-R1 2000
+ mtu "${ns_r1}" veth_R1-A 2000
+ mtu "${ns_r1}" veth_R1-B 1500
+ mtu "${ns_b}" veth_B-R1 1500
+
+ mtu "${ns_a}" veth_A-R2 2000
+ mtu "${ns_r2}" veth_R2-A 2000
+ mtu "${ns_r2}" veth_R2-B 1500
+ mtu "${ns_b}" veth_B-R2 1500
+
+ fail=0
+
+ #Set up host A with multipath routes to host B dummy0_b
+ run_cmd ${ns_a} sysctl -q net.ipv4.fib_multipath_hash_policy=1
+ run_cmd ${ns_a} sysctl -q net.ipv4.ip_forward=1
+ run_cmd ${ns_a} ip link add dummy0 mtu 2000 type dummy
+ run_cmd ${ns_a} ip link set dummy0 up
+ run_cmd ${ns_a} ip addr add ${dummy0_a} dev dummy0
+ run_cmd ${ns_a} ip nexthop add id 201 via ${prefix4}.${a_r1}.2 dev veth_A-R1
+ run_cmd ${ns_a} ip nexthop add id 202 via ${prefix4}.${a_r2}.2 dev veth_A-R2
+ run_cmd ${ns_a} ip nexthop add id 203 group 201/202
+ run_cmd ${ns_a} ip route add ${dummy0_b} nhid 203
+
+ #Set up host B with multipath routes to host A dummy0_a
+ run_cmd ${ns_b} sysctl -q net.ipv4.fib_multipath_hash_policy=1
+ run_cmd ${ns_b} sysctl -q net.ipv4.ip_forward=1
+ run_cmd ${ns_b} ip link add dummy0 mtu 2000 type dummy
+ run_cmd ${ns_b} ip link set dummy0 up
+ run_cmd ${ns_b} ip addr add ${dummy0_b} dev dummy0
+ run_cmd ${ns_b} ip nexthop add id 201 via ${prefix4}.${b_r1}.2 dev veth_A-R1
+ run_cmd ${ns_b} ip nexthop add id 202 via ${prefix4}.${b_r2}.2 dev veth_A-R2
+ run_cmd ${ns_b} ip nexthop add id 203 group 201/202
+ run_cmd ${ns_b} ip route add ${dummy0_a} nhid 203
+
+ #Set up routers with routes to dummies
+ run_cmd ${ns_r1} ip route add ${dummy0_a} via ${prefix4}.${a_r1}.1
+ run_cmd ${ns_r2} ip route add ${dummy0_a} via ${prefix4}.${a_r2}.1
+ run_cmd ${ns_r1} ip route add ${dummy0_b} via ${prefix4}.${b_r1}.1
+ run_cmd ${ns_r2} ip route add ${dummy0_b} via ${prefix4}.${b_r2}.1
+
+
+ #Ping and expect two nexthop exceptions for two routes in nh group
+ run_cmd ${ns_a} ping -q -M want -i 0.1 -c 2 -s 1800 "${dummy0_b}"
+
+ #Do route lookup before checking cached exceptions
+ run_cmd ${ns_a} ip route get ${dummy0_b} oif veth_A-R1
+ run_cmd ${ns_a} ip route get ${dummy0_b} oif veth_A-R2
+
+ #Check cached exceptions
+ if [ "$(${ns_a} ip -oneline route list cache| grep mtu | wc -l)" -ne 2 ]; then
+ err " there are not enough cached exceptions"
+ fail=1
+ fi
+
+ return ${fail}
+}
+
usage() {
echo
echo "$0 [OPTIONS] [TEST]..."
base-commit: 66600fac7a984dea4ae095411f644770b2561ede
--
2.43.5
This series implements feature detection of hardware virtualization on
Linux and macOS; the latter being my primary use case.
This yields approximately a 6x improvement using HVF on M3 Pro.
Signed-off-by: Tamir Duberstein <tamird(a)gmail.com>
---
Tamir Duberstein (2):
kunit: add fallback for os.sched_getaffinity
kunit: enable hardware acceleration when available
tools/testing/kunit/kunit.py | 11 ++++++++++-
tools/testing/kunit/kunit_kernel.py | 26 +++++++++++++++++++++++++-
tools/testing/kunit/qemu_configs/arm64.py | 2 +-
3 files changed, 36 insertions(+), 3 deletions(-)
---
base-commit: ae90f6a6170d7a7a1aa4fddf664fbd093e3023bc
change-id: 20241025-kunit-qemu-accel-macos-2840e4c2def5
Best regards,
--
Tamir Duberstein <tamird(a)gmail.com>
Hi Linus,
Please pull the following kselftest fixes update for Linux 6.12-rc6.
Kselftest fixes for Linux 6.12-rc6
- fix syntax error in frequency calculation arithmetic expression in
intel_pstate run.sh
- add missing cpupower dependency check intel_pstate run.sh
- fix idmap_mount_tree_invalid test failure due to incorrect argument
- fix watchdog-test run leaving the watchdog timer enabled causing
system reboot. With this fix, the test disables the watchdog timer
when it gets terminated with SIGTERM, SIGKILL, and SIGQUIT in addition
to SIGINT
diff is attached.
thanks,
-- Shuah
----------------------------------------------------------------
The following changes since commit fe05c40ca9c18cfdb003f639a30fc78a7ab49519:
selftest: hid: add the missing tests directory (2024-10-16 15:55:14 -0600)
are available in the Git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest tags/linux_kselftest-fixes-6.12-rc6
for you to fetch changes up to dc1308bee1ed03b4d698d77c8bd670d399dcd04d:
selftests/watchdog-test: Fix system accidentally reset after watchdog-test (2024-10-28 21:34:43 -0600)
----------------------------------------------------------------
linux_kselftest-fixes-6.12-rc6
Kselftest fixes for Linux 6.12-rc6
- fix syntax error in frequency calculation arithmetic expression in
intel_pstate run.sh
- add missing cpupower dependency check intel_pstate run.sh
- fix idmap_mount_tree_invalid test failure due to incorrect argument
- fix watchdog-test run leaving the watchdog timer enabled causing
system reboot. With this fix, the test disables the watchdog timer
when it gets terminated with SIGTERM, SIGKILL, and SIGQUIT in addition
to SIGINT
----------------------------------------------------------------
Alessandro Zanni (2):
selftests/intel_pstate: fix operand expected error
selftests/intel_pstate: check if cpupower is installed
Li Zhijian (1):
selftests/watchdog-test: Fix system accidentally reset after watchdog-test
zhouyuhang (1):
selftests/mount_setattr: fix idmap_mount_tree_invalid failed to run
tools/testing/selftests/intel_pstate/run.sh | 9 +++++++--
tools/testing/selftests/mount_setattr/mount_setattr_test.c | 9 +++++++++
tools/testing/selftests/watchdog/watchdog-test.c | 6 ++++++
3 files changed, 22 insertions(+), 2 deletions(-)
---------------------------------------------------------------
Greetings:
Welcome to v4, see changelog below. Note that our performance tests were
not re-run for this revision as we updated the selftest, FAQ in the
cover letter, and kernel documentation. No functional/code changes.
This series introduces a new mechanism, IRQ suspension, which allows
network applications using epoll to mask IRQs during periods of high
traffic while also reducing tail latency (compared to existing
mechanisms, see below) during periods of low traffic. In doing so, this
balances CPU consumption with network processing efficiency.
Martin Karsten (CC'd) and I have been collaborating on this series for
several months and have appreciated the feedback from the community on
our RFC [1]. We've updated the cover letter and kernel documentation in
an attempt to more clearly explain how this mechanism works, how
applications can use it, and how it compares to existing mechanisms in
the kernel. We've added an additional test case, 'fullbusy', achieved by
modifying libevent for comparison. See below for a detailed description,
link to the patch, and test results.
I briefly mentioned this idea at netdev conf 2024 (for those who were
there) and Martin described this idea in an earlier paper presented at
Sigmetrics 2024 [2].
~ The short explanation (TL;DR)
We propose adding a new napi config parameter: irq_suspend_timeout to
help balance CPU usage and network processing efficiency when using IRQ
deferral and napi busy poll.
If this parameter is set to a non-zero value *and* a user application
has enabled preferred busy poll on a busy poll context (via the
EPIOCSPARAMS ioctl introduced in commit 18e2bf0edf4d ("eventpoll: Add
epoll ioctl for epoll_params")), then application calls to epoll_wait
for that context will cause device IRQs and softirq processing to be
suspended as long as epoll_wait successfully retrieves data from the
NAPI. Each time data is retrieved, the irq_suspend_timeout is deferred.
If/when network traffic subsides and epoll_wait returns no data, IRQ
suspension is immediately reverted back to the existing
napi_defer_hard_irqs and gro_flush_timeout mechanism which was
introduced in commit 6f8b12d661d0 ("net: napi: add hard irqs deferral
feature")).
The irq_suspend_timeout serves as a safety mechanism. If userland takes
a long time processing data, irq_suspend_timeout will fire and restart
normal NAPI processing.
For a more in depth explanation, please continue reading.
~ Comparison with existing mechanisms
Interrupt mitigation can be accomplished in napi software, by setting
napi_defer_hard_irqs and gro_flush_timeout, or via interrupt coalescing
in the NIC. This can be quite efficient, but in both cases, a fixed
timeout (or packet count) needs to be configured. However, a fixed
timeout cannot effectively support both low- and high-load situations:
At low load, an application typically processes a few requests and then
waits to receive more input data. In this scenario, a large timeout will
cause unnecessary latency.
At high load, an application typically processes many requests before
being ready to receive more input data. In this case, a small timeout
will likely fire prematurely and trigger irq/softirq processing, which
interferes with the application's execution. This causes overhead, most
likely due to cache contention.
While NICs attempt to provide adaptive interrupt coalescing schemes,
these cannot properly take into account application-level processing.
An alternative packet delivery mechanism is busy-polling, which results
in perfect alignment of application processing and network polling. It
delivers optimal performance (throughput and latency), but results in
100% cpu utilization and is thus inefficient for below-capacity
workloads.
We propose to add a new packet delivery mode that properly alternates
between busy polling and interrupt-based delivery depending on busy and
idle periods of the application. During a busy period, the system
operates in busy-polling mode, which avoids interference. During an idle
period, the system falls back to interrupt deferral, but with a small
timeout to avoid excessive latencies. This delivery mode can also be
viewed as an extension of basic interrupt deferral, but alternating
between a small and a very large timeout.
This delivery mode is efficient, because it avoids softirq execution
interfering with application processing during busy periods. It can be
used with blocking epoll_wait to conserve cpu cycles during idle
periods. The effect of alternating between busy and idle periods is that
performance (throughput and latency) is very close to full busy polling,
while cpu utilization is lower and very close to interrupt mitigation.
~ Usage details
IRQ suspension is introduced via a per-NAPI configuration parameter that
controls the maximum time that IRQs can be suspended.
Here's how it is intended to work:
- The user application (or system administrator) uses the netdev-genl
netlink interface to set the pre-existing napi_defer_hard_irqs and
gro_flush_timeout NAPI config parameters to enable IRQ deferral.
- The user application (or system administrator) sets the proposed
irq_suspend_timeout parameter via the netdev-genl netlink interface
to a larger value than gro_flush_timeout to enable IRQ suspension.
- The user application issues the existing epoll ioctl to set the
prefer_busy_poll flag on the epoll context.
- The user application then calls epoll_wait to busy poll for network
events, as it normally would.
- If epoll_wait returns events to userland, IRQs are suspended for the
duration of irq_suspend_timeout.
- If epoll_wait finds no events and the thread is about to go to
sleep, IRQ handling using napi_defer_hard_irqs and gro_flush_timeout
is resumed.
As long as epoll_wait is retrieving events, IRQs (and softirq
processing) for the NAPI being polled remain disabled. When network
traffic reduces, eventually a busy poll loop in the kernel will retrieve
no data. When this occurs, regular IRQ deferral using gro_flush_timeout
for the polled NAPI is re-enabled.
Unless IRQ suspension is continued by subsequent calls to epoll_wait, it
automatically times out after the irq_suspend_timeout timer expires.
Regular deferral is also immediately re-enabled when the epoll context
is destroyed.
~ Usage scenario
The target scenario for IRQ suspension as packet delivery mode is a
system that runs a dominant application with substantial network I/O.
The target application can be configured to receive input data up to a
certain batch size (via epoll_wait maxevents parameter) and this batch
size determines the worst-case latency that application requests might
experience. Because packet delivery is suspended during the target
application's processing, the batch size also determines the worst-case
latency of concurrent applications using the same RX queue(s).
gro_flush_timeout should be set as small as possible, but large enough to
make sure that a single request is likely not being interfered with.
irq_suspend_timeout is largely a safety mechanism against misbehaving
applications. It should be set large enough to cover the processing of an
entire application batch, i.e., the factor between gro_flush_timeout and
irq_suspend_timeout should roughly correspond to the maximum batch size
that the target application would process in one go.
~ Design rationale
The implementation of the IRQ suspension mechanism very nicely dovetails
with the existing mechanism for IRQ deferral when preferred busy poll is
enabled (introduced in commit 7fd3253a7de6 ("net: Introduce preferred
busy-polling"), see that commit message for more details).
While it would be possible to inject the suspend timeout via
the existing epoll ioctl, it is more natural to avoid this path for one
main reason:
An epoll context is linked to NAPI IDs as file descriptors are added;
this means any epoll context might suddenly be associated with a
different net_device if the application were to replace all existing
fds with fds from a different device. In this case, the scope of the
suspend timeout becomes unclear and many edge cases for both the user
application and the kernel are introduced
Only a single iteration through napi busy polling is needed for this
mechanism to work effectively. Since an important objective for this
mechanism is preserving cpu cycles, exactly one iteration of the napi
busy loop is invoked when busy_poll_usecs is set to 0.
~ Important call outs in the implementation
- Enabling per epoll-context preferred busy poll will now effectively
lead to a nonblocking iteration through napi_busy_loop, even when
busy_poll_usecs is 0. See patch 4.
- Patches apply cleanly on commit 160a810b2a85 ("net: vxlan: update
the document for vxlan_snoop()").
~ Benchmark configs & descriptions
The changes were benchmarked with memcached [3] using the benchmarking
tool mutilate [4].
To facilitate benchmarking, a small patch [5] was applied to memcached
1.6.29 to allow setting per-epoll context preferred busy poll and other
settings via environment variables. Another small patch [6] was applied
to libevent to enable full busy-polling.
Multiple scenarios were benchmarked as described below and the scripts
used for producing these results can be found on github [7] (note: all
scenarios use NAPI-based traffic splitting via SO_INCOMING_ID by passing
-N to memcached):
- base:
- no other options enabled
- deferX:
- set defer_hard_irqs to 100
- set gro_flush_timeout to X,000
- napibusy:
- set defer_hard_irqs to 100
- set gro_flush_timeout to 200,000
- enable busy poll via the existing ioctl (busy_poll_usecs = 64,
busy_poll_budget = 64, prefer_busy_poll = true)
- fullbusy:
- set defer_hard_irqs to 100
- set gro_flush_timeout to 5,000,000
- enable busy poll via the existing ioctl (busy_poll_usecs = 1000,
busy_poll_budget = 64, prefer_busy_poll = true)
- change memcached's nonblocking epoll_wait invocation (via
libevent) to using a 1 ms timeout
- suspendX:
- set defer_hard_irqs to 100
- set gro_flush_timeout to X,000
- set irq_suspend_timeout to 20,000,000
- enable busy poll via the existing ioctl (busy_poll_usecs = 0,
busy_poll_budget = 64, prefer_busy_poll = true)
~ Benchmark results
Tested on:
Single socket AMD EPYC 7662 64-Core Processor
Hyperthreading disabled
4 NUMA Zones (NPS=4)
16 CPUs per NUMA zone (64 cores total)
2 x Dual port 100gbps Mellanox Technologies ConnectX-5 Ex EN NIC
The test machine is configured such that a single interface has 8 RX
queues. The queues' IRQs and memcached are pinned to CPUs that are
NUMA-local to the interface which is under test. The NIC's interrupt
coalescing configuration is left at boot-time defaults.
Results:
Results are shown below. The mechanism added by this series is
represented by the 'suspend' cases. Data presented shows a summary over
at least 10 runs of each test case [8] using the scripts on github [7].
For latency, the median is shown. For throughput and CPU utilization,
the average is shown.
The results also include cycles-per-query (cpq) and
instruction-per-query (ipq) metrics, following the methodology proposed
in [2], to augment the CPU utilization numbers, which could be skewed
due to frequency scaling. We find that this does not appear to be the
case as CPU utilization and low-level metrics show similar trends.
These results were captured using the scripts on github [7] to
illustrate how this approach compares with other pre-existing
mechanisms. This data is not to be interpreted as scientific data
captured in a fully isolated lab setting, but instead as best effort,
illustrative information comparing and contrasting tradeoffs.
The absolute QPS results are higher than our previous submission, but
the relative differences between variants are equivalent. Because the
patches have been rebased on 6.12, several factors have likely
influenced the overall performance. Most importantly, we had to switch
to a new set of basic kernel options, which has likely altered the
baseline performance. Because the overall comparison of variants still
holds, we have not attempted to recreate the exact set of kernel options
from the previous submission.
Compare:
- Throughput (MAX) and latencies of base vs suspend.
- CPU usage of napibusy and fullbusy during lower load (200K, 400K for
example) vs suspend.
- Latency of the defer variants vs suspend as timeout and load
increases.
The overall takeaway is that the suspend variants provide a superior
combination of high throughput, low latency, and low cpu utilization
compared to all other variants. Each of the suspend variants works very
well, but some fine-tuning between latency and cpu utilization is still
possible by tuning the small timeout (gro_flush_timeout).
Note: we've reorganized the results to make comparison among testcases
with the same load easier.
testcase load qps avglat 95%lat 99%lat cpu cpq ipq
base 200K 200024 127 254 458 25 12748 11289
defer10 200K 199991 64 128 166 27 18763 16574
defer20 200K 199986 72 135 178 25 15405 14173
defer50 200K 200025 91 149 198 23 12275 12203
defer200 200K 199996 182 266 326 18 8595 9183
fullbusy 200K 200040 58 123 167 100 43641 23145
napibusy 200K 200009 115 244 299 56 24797 24693
suspend10 200K 200005 63 128 167 32 19559 17240
suspend20 200K 199952 69 132 170 29 16324 14838
suspend50 200K 200019 84 144 189 26 13106 12516
suspend200 200K 199978 168 264 326 20 9331 9643
testcase load qps avglat 95%lat 99%lat cpu cpq ipq
base 400K 400017 157 292 762 39 9287 9325
defer10 400K 400033 71 141 204 53 13950 12943
defer20 400K 399935 79 150 212 47 12027 11673
defer50 400K 399888 101 171 231 39 9556 9921
defer200 400K 399993 200 287 358 32 7428 8576
fullbusy 400K 400018 63 132 203 100 21827 16062
napibusy 400K 399970 89 230 292 83 18156 16508
suspend10 400K 400061 69 139 202 54 13576 13057
suspend20 400K 399988 73 144 206 49 11930 11773
suspend50 400K 399975 88 161 218 42 9996 10270
suspend200 400K 399954 172 276 353 34 7847 8713
testcase load qps avglat 95%lat 99%lat cpu cpq ipq
base 600K 600031 166 289 631 61 9188 8787
defer10 600K 599967 85 167 262 75 11833 10947
defer20 600K 599888 89 165 243 66 10513 10362
defer50 600K 600072 109 185 253 55 8664 9190
defer200 600K 599951 222 315 393 45 6892 8213
fullbusy 600K 600041 69 145 227 100 14549 13936
napibusy 600K 599980 79 188 280 96 13927 14155
suspend10 600K 600028 78 159 267 69 10877 11032
suspend20 600K 600026 81 159 254 64 9922 10320
suspend50 600K 600007 96 178 258 57 8681 9331
suspend200 600K 599964 177 295 369 47 7115 8366
testcase load qps avglat 95%lat 99%lat cpu cpq ipq
base 800K 800034 198 329 698 84 9366 8338
defer10 800K 799718 243 642 1457 95 10532 9007
defer20 800K 800009 132 245 399 89 9956 8979
defer50 800K 800024 136 228 378 80 9002 8598
defer200 800K 799965 255 362 473 66 7481 8147
fullbusy 800K 799927 78 157 253 100 10915 12533
napibusy 800K 799870 81 173 273 99 10826 12532
suspend10 800K 799991 84 167 269 83 9380 9802
suspend20 800K 799979 90 172 290 78 8765 9404
suspend50 800K 800031 106 191 307 71 7945 8805
suspend200 800K 799905 182 307 411 62 6985 8242
testcase load qps avglat 95%lat 99%lat cpu cpq ipq
base 1000K 919543 3805 6390 14229 98 9324 7978
defer10 1000K 850751 4574 7382 15370 99 10218 8470
defer20 1000K 890296 4736 6862 14858 99 9708 8277
defer50 1000K 932694 3463 6180 13251 97 9148 8053
defer200 1000K 951311 3524 6052 13599 96 8875 7845
fullbusy 1000K 1000011 90 181 278 100 8731 10686
napibusy 1000K 1000050 93 184 280 100 8721 10547
suspend10 1000K 999962 101 193 306 92 8138 8980
suspend20 1000K 1000030 103 191 324 88 7844 8763
suspend50 1000K 1000001 114 202 320 83 7396 8431
suspend200 1000K 999965 185 314 428 76 6733 8072
testcase load qps avglat 95%lat 99%lat cpu cpq ipq
base MAX 1005592 4651 6594 14979 100 8679 7918
defer10 MAX 928204 5106 7286 15199 100 9398 8380
defer20 MAX 984663 4774 6518 14920 100 8861 8063
defer50 MAX 1044099 4431 6368 14652 100 8350 7948
defer200 MAX 1040451 4423 6610 14674 100 8380 7931
fullbusy MAX 1236608 3715 3987 12805 100 7051 7936
napibusy MAX 1077516 4345 10155 15957 100 8080 7842
suspend10 MAX 1218344 3760 3990 12585 100 7150 7935
suspend20 MAX 1220056 3752 4053 12602 100 7150 7961
suspend50 MAX 1213666 3791 4103 12919 100 7183 7959
suspend200 MAX 1217411 3768 3988 12863 100 7161 7954
~ FAQ
- Why is a new parameter needed? Does irq_suspend_timeout override
gro_flush_timeout?
Using the suspend mechanism causes the system to alternate between
polling mode and irq-driven packet delivery. During busy periods,
irq_suspend_timeout overrides gro_flush_timeout and keeps the system
busy polling, but when epoll finds no events, the setting of
gro_flush_timeout and napi_defer_hard_irqs determine the next step.
There are essentially three possible loops for network processing and
packet delivery:
1) hardirq -> softirq -> napi poll; basic interrupt delivery
2) timer -> softirq -> napi poll; deferred irq processing
3) epoll -> busy-poll -> napi poll; busy looping
Loop 2) can take control from Loop 1), if gro_flush_timeout and
napi_defer_hard_irqs are set.
If gro_flush_timeout and napi_defer_hard_irqs are set, Loops 2) and
3) "wrestle" with each other for control. During busy periods,
irq_suspend_timeout is used as timer in Loop 2), which essentially
tilts this in favour of Loop 3).
If gro_flush_timeout and napi_defer_hard_irqs are not set, Loop 3)
cannot take control from Loop 1).
Therefore, setting gro_flush_timeout and napi_defer_hard_irqs is the
recommended usage, because otherwise setting irq_suspend_timeout
might not have any discernible effect.
We ran experiments with these parameters set to zero and the results
are as expected and essentially the same as the base case.
- Can the new timeout value be threaded through the new epoll ioctl ?
Only with difficulty. The epoll ioctl sets options on an epoll
context and the NAPI ID associated with an epoll context can change
based on what file descriptors a user app adds to the epoll context.
This would introduce complexity in the API from the user perspective
and also complexity in the kernel.
- Can irq suspend be built by combining NIC coalescing and
gro_flush_timeout ?
No. The problem is that the long timeout must engage if and only if
prefer-busy is active.
When using NIC coalescing for the short timeout (without
napi_defer_hard_irqs/gro_flush_timeout), an interrupt after an idle
period will trigger softirq, which will run napi polling. At this
point, prefer-busy is not active, so NIC interrupts would be
re-enabled. Then it is not possible for the longer timeout to
interject to switch control back to polling. In other words, only by
using the software timer for the short timeout, it is possible to
extend the timeout without having to reprogram the NIC timer or
reach down directly and disable interrupts.
Using gro_flush_timeout for the long timeout also has problems, for
the same underlying reason. In the current napi implementation,
gro_flush_timeout is not tied to prefer-busy. We'd either have to
change that and in the process modify the existing deferral
mechanism, or introduce a state variable to determine whether
gro_flush_timeout is used as long timeout for irq suspend or whether
it is used for its default purpose. In an earlier version, we did
try something similar to the latter and made it work, but it ends up
being a lot more convoluted than our current proposal.
- Isn't it already possible to combine busy looping with irq deferral?
Yes, in fact enabling irq deferral via napi_defer_hard_irqs and
gro_flush_timeout is a precondition for prefer_busy_poll to have an
effect. If the application also uses a tight busy loop with
essentially nonblocking epoll_wait (accomplished with a very short
timeout parameter), this is the fullbusy case shown in the results.
An application using blocking epoll_wait is shown as the napibusy
case in the result. It's a hybrid approach that provides limited
latency benefits compared to the base case and plain irq deferral,
but not as good as fullbusy or suspend.
~ Special thanks
Several people were involved in earlier stages of the development of this
mechanism whom we'd like to thank:
- Peter Cai (CC'd), for the initial kernel patch and his contributions
to the paper.
- Mohammadamin Shafie (CC'd), for testing various versions of the kernel
patch and providing helpful feedback.
Thanks,
Martin and Joe
[1]: https://lore.kernel.org/netdev/20240812125717.413108-1-jdamato@fastly.com/
[2]: https://doi.org/10.1145/3626780
[3]: https://github.com/memcached/memcached/blob/master/doc/napi_ids.txt
[4]: https://github.com/leverich/mutilate
[5]: https://raw.githubusercontent.com/martinkarsten/irqsuspend/main/patches/mem…
[6]: https://raw.githubusercontent.com/martinkarsten/irqsuspend/main/patches/lib…
[7]: https://github.com/martinkarsten/irqsuspend
[8]: https://github.com/martinkarsten/irqsuspend/tree/main/results
v4:
- Added a new FAQ item to cover letter.
- Updated patch 6 to use socat instead of nc in busy_poll_test.sh and
updated busy_poller.c to use netlink directly to configure napi
params.
- Updated the kernel documentation in patch 7 to include more details.
- Dropped Stanislav's Acked-by and Bagas' Reviewed-by from patch 7
since the documentation was updated.
v3:
- Added Stanislav Fomichev's Acked-by to every patch except the newly
added selftest.
- Added Bagas Sanjaya's Reviewed-by to the documentation patch.
- Fixed the commit message of patch 2 to remove a reference to the now
non-existent sysfs setting.
- Added a self test which tests both "regular" busy poll and busy poll
with suspend enabled. This was added as patch 6 as requested by
Paolo. netdevsim was chosen instead of veth due to netdevsim's
pre-existing support for netdev-genl. See the commit message of
patch 6 for more details.
v2: https://lore.kernel.org/bpf/20241021015311.95468-1-jdamato@fastly.com/
- Cover letter updated, including a re-run of test data.
- Patch 1 rewritten to use netdev-genl instead of sysfs.
- Patch 3 updated with a comment added to napi_resume_irqs.
- Patch 4 rebased to apply now that commit b9ca079dd6b0 ("eventpoll:
Annotate data-race of busy_poll_usecs") has been picked up from VFS.
- Patch 6 updated the kernel documentation.
rfc -> v1:
- Cover letter updated to include more details.
- Patch 1 updated to remove the documentation added. This was moved to
patch 6 with the rest of the docs (see below).
- Patch 5 updated to fix an error uncovered by the kernel build robot.
See patch 5's changelog for more details.
- Patch 6 added which updates kernel documentation.
Joe Damato (2):
selftests: net: Add busy_poll_test
docs: networking: Describe irq suspension
Martin Karsten (5):
net: Add napi_struct parameter irq_suspend_timeout
net: Suspend softirq when prefer_busy_poll is set
net: Add control functions for irq suspension
eventpoll: Trigger napi_busy_loop, if prefer_busy_poll is set
eventpoll: Control irq suspension for prefer_busy_poll
Documentation/netlink/specs/netdev.yaml | 7 +
Documentation/networking/napi.rst | 176 +++++++++-
fs/eventpoll.c | 35 +-
include/linux/netdevice.h | 2 +
include/net/busy_poll.h | 3 +
include/uapi/linux/netdev.h | 1 +
net/core/dev.c | 58 +++-
net/core/dev.h | 25 ++
net/core/netdev-genl-gen.c | 5 +-
net/core/netdev-genl.c | 12 +
tools/include/uapi/linux/netdev.h | 1 +
tools/testing/selftests/net/.gitignore | 1 +
tools/testing/selftests/net/Makefile | 3 +-
tools/testing/selftests/net/busy_poll_test.sh | 164 +++++++++
tools/testing/selftests/net/busy_poller.c | 328 ++++++++++++++++++
15 files changed, 810 insertions(+), 11 deletions(-)
create mode 100755 tools/testing/selftests/net/busy_poll_test.sh
create mode 100644 tools/testing/selftests/net/busy_poller.c
base-commit: dbb9a7ef347828870df3e5e6ddf19469a3277fc9
--
2.25.1
Greetings:
Welcome to v3, see changelog below. Note that our performance tests were
not re-run for this revision as we only updated a commit message and
added a selftest.
This series introduces a new mechanism, IRQ suspension, which allows
network applications using epoll to mask IRQs during periods of high
traffic while also reducing tail latency (compared to existing
mechanisms, see below) during periods of low traffic. In doing so, this
balances CPU consumption with network processing efficiency.
Martin Karsten (CC'd) and I have been collaborating on this series for
several months and have appreciated the feedback from the community on
our RFC [1]. We've updated the cover letter and kernel documentation in
an attempt to more clearly explain how this mechanism works, how
applications can use it, and how it compares to existing mechanisms in
the kernel. We've added an additional test case, 'fullbusy', achieved by
modifying libevent for comparison. See below for a detailed description,
link to the patch, and test results.
I briefly mentioned this idea at netdev conf 2024 (for those who were
there) and Martin described this idea in an earlier paper presented at
Sigmetrics 2024 [2].
~ The short explanation (TL;DR)
We propose adding a new napi config parameter: irq_suspend_timeout to
help balance CPU usage and network processing efficiency when using IRQ
deferral and napi busy poll.
If this parameter is set to a non-zero value *and* a user application
has enabled preferred busy poll on a busy poll context (via the
EPIOCSPARAMS ioctl introduced in commit 18e2bf0edf4d ("eventpoll: Add
epoll ioctl for epoll_params")), then application calls to epoll_wait
for that context will cause device IRQs and softirq processing to be
suspended as long as epoll_wait successfully retrieves data from the
NAPI. Each time data is retrieved, the irq_suspend_timeout is deferred.
If/when network traffic subsides and epoll_wait returns no data, IRQ
suspension is immediately reverted back to the existing
napi_defer_hard_irqs and gro_flush_timeout mechanism which was
introduced in commit 6f8b12d661d0 ("net: napi: add hard irqs deferral
feature")).
The irq_suspend_timeout serves as a safety mechanism. If userland takes
a long time processing data, irq_suspend_timeout will fire and restart
normal NAPI processing.
For a more in depth explanation, please continue reading.
~ Comparison with existing mechanisms
Interrupt mitigation can be accomplished in napi software, by setting
napi_defer_hard_irqs and gro_flush_timeout, or via interrupt coalescing
in the NIC. This can be quite efficient, but in both cases, a fixed
timeout (or packet count) needs to be configured. However, a fixed
timeout cannot effectively support both low- and high-load situations:
At low load, an application typically processes a few requests and then
waits to receive more input data. In this scenario, a large timeout will
cause unnecessary latency.
At high load, an application typically processes many requests before
being ready to receive more input data. In this case, a small timeout
will likely fire prematurely and trigger irq/softirq processing, which
interferes with the application's execution. This causes overhead, most
likely due to cache contention.
While NICs attempt to provide adaptive interrupt coalescing schemes,
these cannot properly take into account application-level processing.
An alternative packet delivery mechanism is busy-polling, which results
in perfect alignment of application processing and network polling. It
delivers optimal performance (throughput and latency), but results in
100% cpu utilization and is thus inefficient for below-capacity
workloads.
We propose to add a new packet delivery mode that properly alternates
between busy polling and interrupt-based delivery depending on busy and
idle periods of the application. During a busy period, the system
operates in busy-polling mode, which avoids interference. During an idle
period, the system falls back to interrupt deferral, but with a small
timeout to avoid excessive latencies. This delivery mode can also be
viewed as an extension of basic interrupt deferral, but alternating
between a small and a very large timeout.
This delivery mode is efficient, because it avoids softirq execution
interfering with application processing during busy periods. It can be
used with blocking epoll_wait to conserve cpu cycles during idle
periods. The effect of alternating between busy and idle periods is that
performance (throughput and latency) is very close to full busy polling,
while cpu utilization is lower and very close to interrupt mitigation.
~ Usage details
IRQ suspension is introduced via a per-NAPI configuration parameter that
controls the maximum time that IRQs can be suspended.
Here's how it is intended to work:
- The user application (or system administrator) uses the netdev-genl
netlink interface to set the pre-existing napi_defer_hard_irqs and
gro_flush_timeout NAPI config parameters to enable IRQ deferral.
- The user application (or system administrator) sets the proposed
irq_suspend_timeout parameter via the netdev-genl netlink interface
to a larger value than gro_flush_timeout to enable IRQ suspension.
- The user application issues the existing epoll ioctl to set the
prefer_busy_poll flag on the epoll context.
- The user application then calls epoll_wait to busy poll for network
events, as it normally would.
- If epoll_wait returns events to userland, IRQs are suspended for the
duration of irq_suspend_timeout.
- If epoll_wait finds no events and the thread is about to go to
sleep, IRQ handling using napi_defer_hard_irqs and gro_flush_timeout
is resumed.
As long as epoll_wait is retrieving events, IRQs (and softirq
processing) for the NAPI being polled remain disabled. When network
traffic reduces, eventually a busy poll loop in the kernel will retrieve
no data. When this occurs, regular IRQ deferral using gro_flush_timeout
for the polled NAPI is re-enabled.
Unless IRQ suspension is continued by subsequent calls to epoll_wait, it
automatically times out after the irq_suspend_timeout timer expires.
Regular deferral is also immediately re-enabled when the epoll context
is destroyed.
~ Usage scenario
The target scenario for IRQ suspension as packet delivery mode is a
system that runs a dominant application with substantial network I/O.
The target application can be configured to receive input data up to a
certain batch size (via epoll_wait maxevents parameter) and this batch
size determines the worst-case latency that application requests might
experience. Because packet delivery is suspended during the target
application's processing, the batch size also determines the worst-case
latency of concurrent applications using the same RX queue(s).
gro_flush_timeout should be set as small as possible, but large enough to
make sure that a single request is likely not being interfered with.
irq_suspend_timeout is largely a safety mechanism against misbehaving
applications. It should be set large enough to cover the processing of an
entire application batch, i.e., the factor between gro_flush_timeout and
irq_suspend_timeout should roughly correspond to the maximum batch size
that the target application would process in one go.
~ Design rationale
The implementation of the IRQ suspension mechanism very nicely dovetails
with the existing mechanism for IRQ deferral when preferred busy poll is
enabled (introduced in commit 7fd3253a7de6 ("net: Introduce preferred
busy-polling"), see that commit message for more details).
While it would be possible to inject the suspend timeout via
the existing epoll ioctl, it is more natural to avoid this path for one
main reason:
An epoll context is linked to NAPI IDs as file descriptors are added;
this means any epoll context might suddenly be associated with a
different net_device if the application were to replace all existing
fds with fds from a different device. In this case, the scope of the
suspend timeout becomes unclear and many edge cases for both the user
application and the kernel are introduced
Only a single iteration through napi busy polling is needed for this
mechanism to work effectively. Since an important objective for this
mechanism is preserving cpu cycles, exactly one iteration of the napi
busy loop is invoked when busy_poll_usecs is set to 0.
~ Important call outs in the implementation
- Enabling per epoll-context preferred busy poll will now effectively
lead to a nonblocking iteration through napi_busy_loop, even when
busy_poll_usecs is 0. See patch 4.
- Patches apply cleanly on commit 160a810b2a85 ("net: vxlan: update
the document for vxlan_snoop()").
~ Benchmark configs & descriptions
The changes were benchmarked with memcached [3] using the benchmarking
tool mutilate [4].
To facilitate benchmarking, a small patch [5] was applied to memcached
1.6.29 to allow setting per-epoll context preferred busy poll and other
settings via environment variables. Another small patch [6] was applied
to libevent to enable full busy-polling.
Multiple scenarios were benchmarked as described below and the scripts
used for producing these results can be found on github [7] (note: all
scenarios use NAPI-based traffic splitting via SO_INCOMING_ID by passing
-N to memcached):
- base:
- no other options enabled
- deferX:
- set defer_hard_irqs to 100
- set gro_flush_timeout to X,000
- napibusy:
- set defer_hard_irqs to 100
- set gro_flush_timeout to 200,000
- enable busy poll via the existing ioctl (busy_poll_usecs = 64,
busy_poll_budget = 64, prefer_busy_poll = true)
- fullbusy:
- set defer_hard_irqs to 100
- set gro_flush_timeout to 5,000,000
- enable busy poll via the existing ioctl (busy_poll_usecs = 1000,
busy_poll_budget = 64, prefer_busy_poll = true)
- change memcached's nonblocking epoll_wait invocation (via
libevent) to using a 1 ms timeout
- suspendX:
- set defer_hard_irqs to 100
- set gro_flush_timeout to X,000
- set irq_suspend_timeout to 20,000,000
- enable busy poll via the existing ioctl (busy_poll_usecs = 0,
busy_poll_budget = 64, prefer_busy_poll = true)
~ Benchmark results
Tested on:
Single socket AMD EPYC 7662 64-Core Processor
Hyperthreading disabled
4 NUMA Zones (NPS=4)
16 CPUs per NUMA zone (64 cores total)
2 x Dual port 100gbps Mellanox Technologies ConnectX-5 Ex EN NIC
The test machine is configured such that a single interface has 8 RX
queues. The queues' IRQs and memcached are pinned to CPUs that are
NUMA-local to the interface which is under test. The NIC's interrupt
coalescing configuration is left at boot-time defaults.
Results:
Results are shown below. The mechanism added by this series is
represented by the 'suspend' cases. Data presented shows a summary over
at least 10 runs of each test case [8] using the scripts on github [7].
For latency, the median is shown. For throughput and CPU utilization,
the average is shown.
The results also include cycles-per-query (cpq) and
instruction-per-query (ipq) metrics, following the methodology proposed
in [2], to augment the CPU utilization numbers, which could be skewed
due to frequency scaling. We find that this does not appear to be the
case as CPU utilization and low-level metrics show similar trends.
These results were captured using the scripts on github [7] to
illustrate how this approach compares with other pre-existing
mechanisms. This data is not to be interpreted as scientific data
captured in a fully isolated lab setting, but instead as best effort,
illustrative information comparing and contrasting tradeoffs.
The absolute QPS results are higher than our previous submission, but
the relative differences between variants are equivalent. Because the
patches have been rebased on 6.12, several factors have likely
influenced the overall performance. Most importantly, we had to switch
to a new set of basic kernel options, which has likely altered the
baseline performance. Because the overall comparison of variants still
holds, we have not attempted to recreate the exact set of kernel options
from the previous submission.
Compare:
- Throughput (MAX) and latencies of base vs suspend.
- CPU usage of napibusy and fullbusy during lower load (200K, 400K for
example) vs suspend.
- Latency of the defer variants vs suspend as timeout and load
increases.
The overall takeaway is that the suspend variants provide a superior
combination of high throughput, low latency, and low cpu utilization
compared to all other variants. Each of the suspend variants works very
well, but some fine-tuning between latency and cpu utilization is still
possible by tuning the small timeout (gro_flush_timeout).
Note: we've reorganized the results to make comparison among testcases
with the same load easier.
testcase load qps avglat 95%lat 99%lat cpu cpq ipq
base 200K 200024 127 254 458 25 12748 11289
defer10 200K 199991 64 128 166 27 18763 16574
defer20 200K 199986 72 135 178 25 15405 14173
defer50 200K 200025 91 149 198 23 12275 12203
defer200 200K 199996 182 266 326 18 8595 9183
fullbusy 200K 200040 58 123 167 100 43641 23145
napibusy 200K 200009 115 244 299 56 24797 24693
suspend10 200K 200005 63 128 167 32 19559 17240
suspend20 200K 199952 69 132 170 29 16324 14838
suspend50 200K 200019 84 144 189 26 13106 12516
suspend200 200K 199978 168 264 326 20 9331 9643
testcase load qps avglat 95%lat 99%lat cpu cpq ipq
base 400K 400017 157 292 762 39 9287 9325
defer10 400K 400033 71 141 204 53 13950 12943
defer20 400K 399935 79 150 212 47 12027 11673
defer50 400K 399888 101 171 231 39 9556 9921
defer200 400K 399993 200 287 358 32 7428 8576
fullbusy 400K 400018 63 132 203 100 21827 16062
napibusy 400K 399970 89 230 292 83 18156 16508
suspend10 400K 400061 69 139 202 54 13576 13057
suspend20 400K 399988 73 144 206 49 11930 11773
suspend50 400K 399975 88 161 218 42 9996 10270
suspend200 400K 399954 172 276 353 34 7847 8713
testcase load qps avglat 95%lat 99%lat cpu cpq ipq
base 600K 600031 166 289 631 61 9188 8787
defer10 600K 599967 85 167 262 75 11833 10947
defer20 600K 599888 89 165 243 66 10513 10362
defer50 600K 600072 109 185 253 55 8664 9190
defer200 600K 599951 222 315 393 45 6892 8213
fullbusy 600K 600041 69 145 227 100 14549 13936
napibusy 600K 599980 79 188 280 96 13927 14155
suspend10 600K 600028 78 159 267 69 10877 11032
suspend20 600K 600026 81 159 254 64 9922 10320
suspend50 600K 600007 96 178 258 57 8681 9331
suspend200 600K 599964 177 295 369 47 7115 8366
testcase load qps avglat 95%lat 99%lat cpu cpq ipq
base 800K 800034 198 329 698 84 9366 8338
defer10 800K 799718 243 642 1457 95 10532 9007
defer20 800K 800009 132 245 399 89 9956 8979
defer50 800K 800024 136 228 378 80 9002 8598
defer200 800K 799965 255 362 473 66 7481 8147
fullbusy 800K 799927 78 157 253 100 10915 12533
napibusy 800K 799870 81 173 273 99 10826 12532
suspend10 800K 799991 84 167 269 83 9380 9802
suspend20 800K 799979 90 172 290 78 8765 9404
suspend50 800K 800031 106 191 307 71 7945 8805
suspend200 800K 799905 182 307 411 62 6985 8242
testcase load qps avglat 95%lat 99%lat cpu cpq ipq
base 1000K 919543 3805 6390 14229 98 9324 7978
defer10 1000K 850751 4574 7382 15370 99 10218 8470
defer20 1000K 890296 4736 6862 14858 99 9708 8277
defer50 1000K 932694 3463 6180 13251 97 9148 8053
defer200 1000K 951311 3524 6052 13599 96 8875 7845
fullbusy 1000K 1000011 90 181 278 100 8731 10686
napibusy 1000K 1000050 93 184 280 100 8721 10547
suspend10 1000K 999962 101 193 306 92 8138 8980
suspend20 1000K 1000030 103 191 324 88 7844 8763
suspend50 1000K 1000001 114 202 320 83 7396 8431
suspend200 1000K 999965 185 314 428 76 6733 8072
testcase load qps avglat 95%lat 99%lat cpu cpq ipq
base MAX 1005592 4651 6594 14979 100 8679 7918
defer10 MAX 928204 5106 7286 15199 100 9398 8380
defer20 MAX 984663 4774 6518 14920 100 8861 8063
defer50 MAX 1044099 4431 6368 14652 100 8350 7948
defer200 MAX 1040451 4423 6610 14674 100 8380 7931
fullbusy MAX 1236608 3715 3987 12805 100 7051 7936
napibusy MAX 1077516 4345 10155 15957 100 8080 7842
suspend10 MAX 1218344 3760 3990 12585 100 7150 7935
suspend20 MAX 1220056 3752 4053 12602 100 7150 7961
suspend50 MAX 1213666 3791 4103 12919 100 7183 7959
suspend200 MAX 1217411 3768 3988 12863 100 7161 7954
~ FAQ
- Can the new timeout value be threaded through the new epoll ioctl ?
Only with difficulty. The epoll ioctl sets options on an epoll
context and the NAPI ID associated with an epoll context can change
based on what file descriptors a user app adds to the epoll context.
This would introduce complexity in the API from the user perspective
and also complexity in the kernel.
- Can irq suspend be built by combining NIC coalescing and
gro_flush_timeout ?
No. The problem is that the long timeout must engage if and only if
prefer-busy is active.
When using NIC coalescing for the short timeout (without
napi_defer_hard_irqs/gro_flush_timeout), an interrupt after an idle
period will trigger softirq, which will run napi polling. At this
point, prefer-busy is not active, so NIC interrupts would be
re-enabled. Then it is not possible for the longer timeout to
interject to switch control back to polling. In other words, only by
using the software timer for the short timeout, it is possible to
extend the timeout without having to reprogram the NIC timer or
reach down directly and disable interrupts.
Using gro_flush_timeout for the long timeout also has problems, for
the same underlying reason. In the current napi implementation,
gro_flush_timeout is not tied to prefer-busy. We'd either have to
change that and in the process modify the existing deferral
mechanism, or introduce a state variable to determine whether
gro_flush_timeout is used as long timeout for irq suspend or whether
it is used for its default purpose. In an earlier version, we did
try something similar to the latter and made it work, but it ends up
being a lot more convoluted than our current proposal.
- Isn't it already possible to combine busy looping with irq deferral?
Yes, in fact enabling irq deferral via napi_defer_hard_irqs and
gro_flush_timeout is a precondition for prefer_busy_poll to have an
effect. If the application also uses a tight busy loop with
essentially nonblocking epoll_wait (accomplished with a very short
timeout parameter), this is the fullbusy case shown in the results.
An application using blocking epoll_wait is shown as the napibusy
case in the result. It's a hybrid approach that provides limited
latency benefits compared to the base case and plain irq deferral,
but not as good as fullbusy or suspend.
~ Special thanks
Several people were involved in earlier stages of the development of this
mechanism whom we'd like to thank:
- Peter Cai (CC'd), for the initial kernel patch and his contributions
to the paper.
- Mohammadamin Shafie (CC'd), for testing various versions of the kernel
patch and providing helpful feedback.
Thanks,
Martin and Joe
[1]: https://lore.kernel.org/netdev/20240812125717.413108-1-jdamato@fastly.com/
[2]: https://doi.org/10.1145/3626780
[3]: https://github.com/memcached/memcached/blob/master/doc/napi_ids.txt
[4]: https://github.com/leverich/mutilate
[5]: https://raw.githubusercontent.com/martinkarsten/irqsuspend/main/patches/mem…
[6]: https://raw.githubusercontent.com/martinkarsten/irqsuspend/main/patches/lib…
[7]: https://github.com/martinkarsten/irqsuspend
[8]: https://github.com/martinkarsten/irqsuspend/tree/main/results
v3:
- Added Stanislav Fomichev's Acked-by to every patch except the newly
added selftest.
- Added Bagas Sanjaya's Reviewed-by to the documentation patch.
- Fixed the commit message of patch 2 to remove a reference to the now
non-existent sysfs setting.
- Added a self test which tests both "regular" busy poll and busy poll
with suspend enabled. This was added as patch 6 as requested by
Paolo. netdevsim was chosen instead of veth due to netdevsim's
pre-existing support for netdev-genl. See the commit message of
patch 6 for more details.
v2: https://lore.kernel.org/bpf/20241021015311.95468-1-jdamato@fastly.com/
- Cover letter updated, including a re-run of test data.
- Patch 1 rewritten to use netdev-genl instead of sysfs.
- Patch 3 updated with a comment added to napi_resume_irqs.
- Patch 4 rebased to apply now that commit b9ca079dd6b0 ("eventpoll:
Annotate data-race of busy_poll_usecs") has been picked up from VFS.
- Patch 6 updated the kernel documentation.
rfc -> v1:
- Cover letter updated to include more details.
- Patch 1 updated to remove the documentation added. This was moved to
patch 6 with the rest of the docs (see below).
- Patch 5 updated to fix an error uncovered by the kernel build robot.
See patch 5's changelog for more details.
- Patch 6 added which updates kernel documentation.
Joe Damato (2):
selftests: net: Add busy_poll_test
docs: networking: Describe irq suspension
Martin Karsten (5):
net: Add napi_struct parameter irq_suspend_timeout
net: Suspend softirq when prefer_busy_poll is set
net: Add control functions for irq suspension
eventpoll: Trigger napi_busy_loop, if prefer_busy_poll is set
eventpoll: Control irq suspension for prefer_busy_poll
Documentation/netlink/specs/netdev.yaml | 7 +
Documentation/networking/napi.rst | 132 +++++++++-
fs/eventpoll.c | 35 ++-
include/linux/netdevice.h | 2 +
include/net/busy_poll.h | 3 +
include/uapi/linux/netdev.h | 1 +
net/core/dev.c | 58 ++++-
net/core/dev.h | 25 ++
net/core/netdev-genl-gen.c | 5 +-
net/core/netdev-genl.c | 12 +
tools/include/uapi/linux/netdev.h | 1 +
tools/testing/selftests/net/.gitignore | 1 +
tools/testing/selftests/net/Makefile | 2 +
tools/testing/selftests/net/busy_poll_test.sh | 164 ++++++++++++
tools/testing/selftests/net/busy_poller.c | 245 ++++++++++++++++++
15 files changed, 683 insertions(+), 10 deletions(-)
create mode 100755 tools/testing/selftests/net/busy_poll_test.sh
create mode 100644 tools/testing/selftests/net/busy_poller.c
base-commit: 9e114ec8084020e10e1cb7b43dbbf6e69940866b
--
2.25.1
This series was motivated by the regression fixed by 166bf8af9122
("pinctrl: mediatek: common-v2: Fix broken bias-disable for
PULL_PU_PD_RSEL_TYPE"). A bug was introduced in the pinctrl_paris driver
which prevented certain pins from having their bias configured.
Running this test on the mt8195-tomato platform with the test plan
included below[1] shows the test passing with the fix applied, but failing
without the fix:
With fix:
$ ./gpio-setget-config.py
TAP version 13
# Using test plan file: ./google,tomato.yaml
1..3
ok 1 pinctrl_paris.34.pull-up
ok 2 pinctrl_paris.34.pull-down
ok 3 pinctrl_paris.34.disabled
# Totals: pass:3 fail:0 xfail:0 xpass:0 skip:0 error:0
Without fix:
$ ./gpio-setget-config.py
TAP version 13
# Using test plan file: ./google,tomato.yaml
1..3
# Bias doesn't match: Expected pull-up, read pull-down.
not ok 1 pinctrl_paris.34.pull-up
ok 2 pinctrl_paris.34.pull-down
# Bias doesn't match: Expected disabled, read pull-down.
not ok 3 pinctrl_paris.34.disabled
# Totals: pass:1 fail:2 xfail:0 xpass:0 skip:0 error:0
In order to achieve this, the first three patches expose bias
configuration through the GPIO API in the MediaTek pinctrl drivers,
notably, pinctrl_paris, patch 4 extends the gpio-mockup-cdev utility for
use by patch 5, and patch 5 introduces a new GPIO kselftest that takes a
test plan in YAML, which can be tailored per-platform to specify the
configurations to test, and sets and gets back each pin configuration to
verify that they match and thus that the driver is behaving as expected.
Since the GPIO uAPI only allows setting the pin configuration, getting
it back is done through pinconf-pins in the pinctrl debugfs folder.
The test currently only verifies bias but it would be easy to extend to
verify other pin configurations.
The test plan YAML file can be customized for each use-case and is
platform-dependant. For that reason, only an example is included in
patch 3 and the user is supposed to provide their test plan. That said,
the aim is to collect test plans for ease of use at [2].
[1] This is the test plan used for mt8195-tomato:
- label: "pinctrl_paris"
tests:
# Pin 34 has type MTK_PULL_PU_PD_RSEL_TYPE and is unused.
# Setting bias to MTK_PULL_PU_PD_RSEL_TYPE pins was fixed by
# 166bf8af9122 ("pinctrl: mediatek: common-v2: Fix broken bias-disable for PULL_PU_PD_RSEL_TYPE")
- pin: 34
bias: "pull-up"
- pin: 34
bias: "pull-down"
- pin: 34
bias: "disabled"
[2] https://github.com/kernelci/platform-test-parameters
Signed-off-by: Nícolas F. R. A. Prado <nfraprado(a)collabora.com>
---
Changes in v2:
- Added patches 2 and 3 enabling the extra GPIO pin configurations on
the other mediatek drivers: pinctrl-moore and pinctrl-mtk-common
- Tweaked function name in patch 1:
mtk_pinconf_set -> mtk_paris_pin_config_set,
to make it clear it is not a pinconf_ops
- Adjusted commit message to make it clear the current support is
limited to pins supported by the EINT controller
- Link to v1: https://lore.kernel.org/r/20240909-kselftest-gpio-set-get-config-v1-0-16a06…
---
Nícolas F. R. A. Prado (5):
pinctrl: mediatek: paris: Expose more configurations to GPIO set_config
pinctrl: mediatek: moore: Expose more configurations to GPIO set_config
pinctrl: mediatek: common: Expose more configurations to GPIO set_config
selftest: gpio: Add wait flag to gpio-mockup-cdev
selftest: gpio: Add a new set-get config test
drivers/pinctrl/mediatek/pinctrl-moore.c | 283 +++++++++++----------
drivers/pinctrl/mediatek/pinctrl-mtk-common.c | 48 ++--
drivers/pinctrl/mediatek/pinctrl-paris.c | 26 +-
tools/testing/selftests/gpio/Makefile | 2 +-
tools/testing/selftests/gpio/gpio-mockup-cdev.c | 14 +-
.../gpio-set-get-config-example-test-plan.yaml | 15 ++
.../testing/selftests/gpio/gpio-set-get-config.py | 183 +++++++++++++
7 files changed, 395 insertions(+), 176 deletions(-)
---
base-commit: a39230ecf6b3057f5897bc4744a790070cfbe7a8
change-id: 20240906-kselftest-gpio-set-get-config-6e5bb670c1a5
Best regards,
--
Nícolas F. R. A. Prado <nfraprado(a)collabora.com>
Check number of paths by fib_info_num_path(),
and update_or_create_fnhe() for every path.
Problem is that pmtu is cached only for the oif
that has received icmp message "need to frag",
other oifs will still try to use "default" iface mtu.
V4:
- fix selftest, do route lookup before checking cached exceptions
V3:
- added selftest
- fixed compile error
V2:
- fix fib_info_num_path parameter pass
An example topology showing the problem:
| host1
+---------+
| dummy0 | 10.179.20.18/32 mtu9000
+---------+
+-----------+----------------+
+---------+ +---------+
| ens17f0 | 10.179.2.141/31 | ens17f1 | 10.179.2.13/31
+---------+ +---------+
| (all here have mtu 9000) |
+------+ +------+
| ro1 | 10.179.2.140/31 | ro2 | 10.179.2.12/31
+------+ +------+
| |
---------+------------+-------------------+------
|
+-----+
| ro3 | 10.10.10.10 mtu1500
+-----+
|
========================================
some networks
========================================
|
+-----+
| eth0| 10.10.30.30 mtu9000
+-----+
| host2
host1 have enabled multipath and
sysctl net.ipv4.fib_multipath_hash_policy = 1:
default proto static src 10.179.20.18
nexthop via 10.179.2.12 dev ens17f1 weight 1
nexthop via 10.179.2.140 dev ens17f0 weight 1
When host1 tries to do pmtud from 10.179.20.18/32 to host2,
host1 receives at ens17f1 iface an icmp packet from ro3 that ro3 mtu=1500.
And host1 caches it in nexthop exceptions cache.
Problem is that it is cached only for the iface that has received icmp,
and there is no way that ro3 will send icmp msg to host1 via another path.
Host1 now have this routes to host2:
ip r g 10.10.30.30 sport 30000 dport 443
10.10.30.30 via 10.179.2.12 dev ens17f1 src 10.179.20.18 uid 0
cache expires 521sec mtu 1500
ip r g 10.10.30.30 sport 30033 dport 443
10.10.30.30 via 10.179.2.140 dev ens17f0 src 10.179.20.18 uid 0
cache
So when host1 tries again to reach host2 with mtu>1500,
if packet flow is lucky enough to be hashed with oif=ens17f1 its ok,
if oif=ens17f0 it blackholes and still gets icmp msgs from ro3 to ens17f1,
until lucky day when ro3 will send it through another flow to ens17f0.
Signed-off-by: Vladimir Vdovin <deliran(a)verdict.gg>
---
net/ipv4/route.c | 13 +++++
tools/testing/selftests/net/pmtu.sh | 79 ++++++++++++++++++++++++++++-
2 files changed, 91 insertions(+), 1 deletion(-)
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 723ac9181558..41162b5cc4cb 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1027,6 +1027,19 @@ static void __ip_rt_update_pmtu(struct rtable *rt, struct flowi4 *fl4, u32 mtu)
struct fib_nh_common *nhc;
fib_select_path(net, &res, fl4, NULL);
+#ifdef CONFIG_IP_ROUTE_MULTIPATH
+ if (fib_info_num_path(res.fi) > 1) {
+ int nhsel;
+
+ for (nhsel = 0; nhsel < fib_info_num_path(res.fi); nhsel++) {
+ nhc = fib_info_nhc(res.fi, nhsel);
+ update_or_create_fnhe(nhc, fl4->daddr, 0, mtu, lock,
+ jiffies + net->ipv4.ip_rt_mtu_expires);
+ }
+ rcu_read_unlock();
+ return;
+ }
+#endif /* CONFIG_IP_ROUTE_MULTIPATH */
nhc = FIB_RES_NHC(res);
update_or_create_fnhe(nhc, fl4->daddr, 0, mtu, lock,
jiffies + net->ipv4.ip_rt_mtu_expires);
diff --git a/tools/testing/selftests/net/pmtu.sh b/tools/testing/selftests/net/pmtu.sh
index 569bce8b6383..f7ced4c436fb 100755
--- a/tools/testing/selftests/net/pmtu.sh
+++ b/tools/testing/selftests/net/pmtu.sh
@@ -266,7 +266,8 @@ tests="
list_flush_ipv4_exception ipv4: list and flush cached exceptions 1
list_flush_ipv6_exception ipv6: list and flush cached exceptions 1
pmtu_ipv4_route_change ipv4: PMTU exception w/route replace 1
- pmtu_ipv6_route_change ipv6: PMTU exception w/route replace 1"
+ pmtu_ipv6_route_change ipv6: PMTU exception w/route replace 1
+ pmtu_ipv4_mp_exceptions ipv4: PMTU multipath nh exceptions 0"
# Addressing and routing for tests with routers: four network segments, with
# index SEGMENT between 1 and 4, a common prefix (PREFIX4 or PREFIX6) and an
@@ -2329,6 +2330,82 @@ test_pmtu_ipv6_route_change() {
test_pmtu_ipvX_route_change 6
}
+test_pmtu_ipv4_mp_exceptions() {
+ setup namespaces routing || return $ksft_skip
+
+ ip nexthop ls >/dev/null 2>&1
+ if [ $? -ne 0 ]; then
+ echo "Nexthop objects not supported; skipping tests"
+ exit $ksft_skip
+ fi
+
+ trace "${ns_a}" veth_A-R1 "${ns_r1}" veth_R1-A \
+ "${ns_r1}" veth_R1-B "${ns_b}" veth_B-R1 \
+ "${ns_a}" veth_A-R2 "${ns_r2}" veth_R2-A \
+ "${ns_r2}" veth_R2-B "${ns_b}" veth_B-R2
+
+ dummy0_a="192.168.99.99"
+ dummy0_b="192.168.88.88"
+
+ # Set up initial MTU values
+ mtu "${ns_a}" veth_A-R1 2000
+ mtu "${ns_r1}" veth_R1-A 2000
+ mtu "${ns_r1}" veth_R1-B 1500
+ mtu "${ns_b}" veth_B-R1 1500
+
+ mtu "${ns_a}" veth_A-R2 2000
+ mtu "${ns_r2}" veth_R2-A 2000
+ mtu "${ns_r2}" veth_R2-B 1500
+ mtu "${ns_b}" veth_B-R2 1500
+
+ fail=0
+
+ #Set up host A with multipath routes to host B dummy0_b
+ run_cmd ${ns_a} sysctl -q net.ipv4.fib_multipath_hash_policy=1
+ run_cmd ${ns_a} sysctl -q net.ipv4.ip_forward=1
+ run_cmd ${ns_a} ip link add dummy0 mtu 2000 type dummy
+ run_cmd ${ns_a} ip link set dummy0 up
+ run_cmd ${ns_a} ip addr add ${dummy0_a} dev dummy0
+ run_cmd ${ns_a} ip nexthop add id 201 via ${prefix4}.${a_r1}.2 dev veth_A-R1
+ run_cmd ${ns_a} ip nexthop add id 202 via ${prefix4}.${a_r2}.2 dev veth_A-R2
+ run_cmd ${ns_a} ip nexthop add id 203 group 201/202
+ run_cmd ${ns_a} ip route add ${dummy0_b} nhid 203
+
+ #Set up host B with multipath routes to host A dummy0_a
+ run_cmd ${ns_b} sysctl -q net.ipv4.fib_multipath_hash_policy=1
+ run_cmd ${ns_b} sysctl -q net.ipv4.ip_forward=1
+ run_cmd ${ns_b} ip link add dummy0 mtu 2000 type dummy
+ run_cmd ${ns_b} ip link set dummy0 up
+ run_cmd ${ns_b} ip addr add ${dummy0_b} dev dummy0
+ run_cmd ${ns_b} ip nexthop add id 201 via ${prefix4}.${b_r1}.2 dev veth_A-R1
+ run_cmd ${ns_b} ip nexthop add id 202 via ${prefix4}.${b_r2}.2 dev veth_A-R2
+ run_cmd ${ns_b} ip nexthop add id 203 group 201/202
+ run_cmd ${ns_b} ip route add ${dummy0_a} nhid 203
+
+ #Set up routers with routes to dummies
+ run_cmd ${ns_r1} ip route add ${dummy0_a} via ${prefix4}.${a_r1}.1
+ run_cmd ${ns_r2} ip route add ${dummy0_a} via ${prefix4}.${a_r2}.1
+ run_cmd ${ns_r1} ip route add ${dummy0_b} via ${prefix4}.${b_r1}.1
+ run_cmd ${ns_r2} ip route add ${dummy0_b} via ${prefix4}.${b_r2}.1
+
+
+ #Ping and expect two nexthop exceptions for two routes in nh group
+ run_cmd ${ns_a} ping -q -M want -i 0.1 -c 2 -s 1800 "${dummy0_b}"
+
+ #Do route lookup before checking cached exceptions
+ run_cmd ${ns_a} ip route get ${dummy0_b} oif veth_A-R1
+ run_cmd ${ns_a} ip route get ${dummy0_b} oif veth_A-R2
+
+ #Check cached exceptions
+ echo "$(ip -oneline route list cache)"
+ if [ "$(${ns_a} ip -oneline route list cache| grep mtu | wc -l)" -ne 2 ]; then
+ err " there are not enough cached exceptions"
+ fail=1
+ fi
+
+ return ${fail}
+}
+
usage() {
echo
echo "$0 [OPTIONS] [TEST]..."
base-commit: 66600fac7a984dea4ae095411f644770b2561ede
--
2.43.5
This series was originally written by José Expósito, and has been
modified and updated by Matt Gilbride and myself. The original version
can be found here:
https://github.com/Rust-for-Linux/linux/pull/950
Add support for writing KUnit tests in Rust. While Rust doctests are
already converted to KUnit tests and run, they're really better suited
for examples, rather than as first-class unit tests.
This series implements a series of direct Rust bindings for KUnit tests,
as well as a new macro which allows KUnit tests to be written using a
close variant of normal Rust unit test syntax. The only change required
is replacing '#[cfg(test)]' with '#[kunit_tests(kunit_test_suite_name)]'
An example test would look like:
#[kunit_tests(rust_kernel_hid_driver)]
mod tests {
use super::*;
use crate::{c_str, driver, hid, prelude::*};
use core::ptr;
struct SimpleTestDriver;
impl Driver for SimpleTestDriver {
type Data = ();
}
#[test]
fn rust_test_hid_driver_adapter() {
let mut hid = bindings::hid_driver::default();
let name = c_str!("SimpleTestDriver");
static MODULE: ThisModule = unsafe { ThisModule::from_ptr(ptr::null_mut()) };
let res = unsafe {
<hid::Adapter<SimpleTestDriver> as driver::DriverOps>::register(&mut hid, name, &MODULE)
};
assert_eq!(res, Err(ENODEV)); // The mock returns -19
}
}
Please give this a go, and make sure I haven't broken it! There's almost
certainly a lot of improvements which can be made -- and there's a fair
case to be made for replacing some of this with generated C code which
can use the C macros -- but this is hopefully an adequate implementation
for now, and the interface can (with luck) remain the same even if the
implementation changes.
A few small notable missing features:
- Attributes (like the speed of a test) are hardcoded to the default
value.
- Similarly, the module name attribute is hardcoded to NULL. In C, we
use the KBUILD_MODNAME macro, but I couldn't find a way to use this
from Rust which wasn't more ugly than just disabling it.
- Assertions are not automatically rewritten to use KUnit assertions.
---
Changes since v2:
https://lore.kernel.org/linux-kselftest/20241029092422.2884505-1-davidgow@g…
- Include missing rust/macros/kunit.rs file from v2. (Thanks Boqun!)
- The kunit_unsafe_test_suite!() macro will truncate the name of the
suite if it is too long. (Thanks Alice!)
- The proc macro now emits an error if the suite name is too long.
- We no longer needlessly use UnsafeCell<> in
kunit_unsafe_test_suite!(). (Thanks Alice!)
Changes since v1:
https://lore.kernel.org/lkml/20230720-rustbind-v1-0-c80db349e3b5@google.com…
- Rebase on top of the latest rust-next (commit 718c4069896c)
- Make kunit_case a const fn, rather than a macro (Thanks Boqun)
- As a result, the null terminator is now created with
kernel::kunit::kunit_case_null()
- Use the C kunit_get_current_test() function to implement
in_kunit_test(), rather than re-implementing it (less efficiently)
ourselves.
Changes since the GitHub PR:
- Rebased on top of kselftest/kunit
- Add const_mut_refs feature
This may conflict with https://lore.kernel.org/lkml/20230503090708.2524310-6-nmi@metaspace.dk/
- Add rust/macros/kunit.rs to the KUnit MAINTAINERS entry
---
José Expósito (3):
rust: kunit: add KUnit case and suite macros
rust: macros: add macro to easily run KUnit tests
rust: kunit: allow to know if we are in a test
MAINTAINERS | 1 +
rust/kernel/kunit.rs | 191 +++++++++++++++++++++++++++++++++++++++++++
rust/kernel/lib.rs | 1 +
rust/macros/kunit.rs | 153 ++++++++++++++++++++++++++++++++++
rust/macros/lib.rs | 29 +++++++
5 files changed, 375 insertions(+)
create mode 100644 rust/macros/kunit.rs
--
2.47.0.163.g1226f6d8fa-goog
```
readonly STATS="$(mktemp -p /tmp ns-XXXXXX)"
readonly BASE=`basename $STATS`
```
It could be a mistake to write to $BASE rather than $STATS, where $STATS
is used to save the NSTAT_HISTORY and it will be cleaned up before exit.
Although since we've been creating the wrong file this whole time and
everything worked, it's fine to remove these 2 lines completely
Cc: "David S. Miller" <davem(a)davemloft.net>
Cc: Eric Dumazet <edumazet(a)google.com>
Cc: Jakub Kicinski <kuba(a)kernel.org>
Cc: Paolo Abeni <pabeni(a)redhat.com>
Signed-off-by: Li Zhijian <lizhijian(a)fujitsu.com>
---
Cc: netdev(a)vger.kernel.org
---
V3:
Remove these 2 lines rather than fixing the filename
---
Hello,
Cover letter is here.
This patch set aims to make 'git status' clear after 'make' and 'make
run_tests' for kselftests.
---
V2: nothing change
Signed-off-by: Li Zhijian <lizhijian(a)fujitsu.com>
---
tools/testing/selftests/net/veth.sh | 2 --
1 file changed, 2 deletions(-)
diff --git a/tools/testing/selftests/net/veth.sh b/tools/testing/selftests/net/veth.sh
index 4f1edbafb946..6bb7dfaa30b6 100755
--- a/tools/testing/selftests/net/veth.sh
+++ b/tools/testing/selftests/net/veth.sh
@@ -46,8 +46,6 @@ create_ns() {
ip -n $BASE$ns addr add dev veth$ns $BM_NET_V4$ns/24
ip -n $BASE$ns addr add dev veth$ns $BM_NET_V6$ns/64 nodad
done
- echo "#kernel" > $BASE
- chmod go-rw $BASE
}
__chk_flag() {
--
2.44.0
This is the 8th version of the ovpn patchset.
Thanks Sergey for arguing regarding splitting PEER_SET into SET and NEW.
I decided to follow this suggestion as it makes the API and its return
value easier to work with.
Thanks Donald for the suggestions regarding the NL API - they have all
been implemented (unless I forgot some, but hopefully I did not).
Notable changes from v7:
* Netlink API adjustments:
** renamed NL API from OP_OBJ to OBJ_OP (i.e. from SET_PEER to PEER_SET)
** split PEER_SET from PEER_NEW for better clarity in case of error
** renamed NL API from NEW/DEL_IFACE to DEV_NEW/DEL
** converted all underscores to dashes in YML NL spec
** split sockaddr_remote attr into ipv4/6, port and v6_scope_id attrs
** split local_ip attr into local_ipv4 and local_ipv6 attrs
** turned keyconf into a root attribute (it was nested in peer before)
** made key_swap use a keyconf object rather than a peer for consistency
with key mgmt API
** created specific op for peer_del notification (peer_del_ntf)
** created specific op for key_swap notification (key_swap_ntf)
** allow user to update VPN IPv4/6 (peer is now rehashable)
** converted port attrs from u32 to u16 for better consistency with
userspace code
* added rtnl_ops .dellink implementation
* removed patch 2 as it's not needed anymore thanks to the point
above
* moved rtnl_ops .kind initialization to first patch
* updated MAINTAINERS file with Github tree and selftest folder
* wrapped long lines in selftest scripts
BONUS: used b4 for the first time to prepare the patchset and send it
Please note that patches previously reviewed by Andrew Lunn have
retained the Reviewed-by tag as they have been simply rebased without
any modification.
The latest code can also be found at:
https://github.com/OpenVPN/linux-kernel-ovpn
Thanks a lot!
Best Regards,
Antonio Quartulli
OpenVPN Inc.
---
Antonio Quartulli (24):
netlink: add NLA_POLICY_MAX_LEN macro
net: introduce OpenVPN Data Channel Offload (ovpn)
ovpn: add basic netlink support
ovpn: add basic interface creation/destruction/management routines
ovpn: implement interface creation/destruction via netlink
ovpn: keep carrier always on
ovpn: introduce the ovpn_peer object
ovpn: introduce the ovpn_socket object
ovpn: implement basic TX path (UDP)
ovpn: implement basic RX path (UDP)
ovpn: implement packet processing
ovpn: store tunnel and transport statistics
ovpn: implement TCP transport
ovpn: implement multi-peer support
ovpn: implement peer lookup logic
ovpn: implement keepalive mechanism
ovpn: add support for updating local UDP endpoint
ovpn: add support for peer floating
ovpn: implement peer add/dump/delete via netlink
ovpn: implement key add/del/swap via netlink
ovpn: kill key and notify userspace in case of IV exhaustion
ovpn: notify userspace when a peer is deleted
ovpn: add basic ethtool support
testing/selftest: add test tool and scripts for ovpn module
Documentation/netlink/specs/ovpn.yaml | 387 +++++
MAINTAINERS | 11 +
drivers/net/Kconfig | 15 +
drivers/net/Makefile | 1 +
drivers/net/ovpn/Makefile | 22 +
drivers/net/ovpn/bind.c | 54 +
drivers/net/ovpn/bind.h | 117 ++
drivers/net/ovpn/crypto.c | 172 ++
drivers/net/ovpn/crypto.h | 138 ++
drivers/net/ovpn/crypto_aead.c | 356 ++++
drivers/net/ovpn/crypto_aead.h | 31 +
drivers/net/ovpn/io.c | 459 ++++++
drivers/net/ovpn/io.h | 25 +
drivers/net/ovpn/main.c | 363 ++++
drivers/net/ovpn/main.h | 29 +
drivers/net/ovpn/netlink-gen.c | 224 +++
drivers/net/ovpn/netlink-gen.h | 42 +
drivers/net/ovpn/netlink.c | 1099 +++++++++++++
drivers/net/ovpn/netlink.h | 18 +
drivers/net/ovpn/ovpnstruct.h | 60 +
drivers/net/ovpn/packet.h | 40 +
drivers/net/ovpn/peer.c | 1207 ++++++++++++++
drivers/net/ovpn/peer.h | 172 ++
drivers/net/ovpn/pktid.c | 130 ++
drivers/net/ovpn/pktid.h | 87 +
drivers/net/ovpn/proto.h | 104 ++
drivers/net/ovpn/skb.h | 61 +
drivers/net/ovpn/socket.c | 165 ++
drivers/net/ovpn/socket.h | 53 +
drivers/net/ovpn/stats.c | 21 +
drivers/net/ovpn/stats.h | 47 +
drivers/net/ovpn/tcp.c | 506 ++++++
drivers/net/ovpn/tcp.h | 43 +
drivers/net/ovpn/udp.c | 406 +++++
drivers/net/ovpn/udp.h | 26 +
include/net/netlink.h | 1 +
include/uapi/linux/ovpn.h | 116 ++
include/uapi/linux/udp.h | 1 +
tools/net/ynl/ynl-gen-c.py | 2 +
tools/testing/selftests/Makefile | 1 +
tools/testing/selftests/net/ovpn/.gitignore | 2 +
tools/testing/selftests/net/ovpn/Makefile | 18 +
tools/testing/selftests/net/ovpn/config | 8 +
tools/testing/selftests/net/ovpn/data-test-tcp.sh | 9 +
tools/testing/selftests/net/ovpn/data-test.sh | 153 ++
tools/testing/selftests/net/ovpn/data64.key | 5 +
tools/testing/selftests/net/ovpn/float-test.sh | 118 ++
tools/testing/selftests/net/ovpn/ovpn-cli.c | 1822 +++++++++++++++++++++
tools/testing/selftests/net/ovpn/tcp_peers.txt | 5 +
tools/testing/selftests/net/ovpn/udp_peers.txt | 5 +
50 files changed, 8957 insertions(+)
---
base-commit: 44badc908f2c85711cb18e45e13119c10ad3a05f
change-id: 20241002-b4-ovpn-eeee35c694a2
Best regards,
--
Antonio Quartulli <antonio(a)openvpn.net>
The goal of the series is to simplify and make it possible to use
ncdevmem in an automated way from the ksft python wrapper.
ncdevmem is slowly mutated into a state where it uses stdout
to print the payload and the python wrapper is added to
make sure the arrived payload matches the expected one.
v6:
- fix compilation issue in 'Unify error handling' patch (Jakub)
v5:
- properly handle errors from inet_pton() and socket() (Paolo)
- remove unneeded import from python selftest (Paolo)
v4:
- keep usage example with validation (Mina)
- fix compilation issue in one patch (s/start_queues/start_queue/)
v3:
- keep and refine the comment about ncdevmem invocation (Mina)
- add the comment about not enforcing exit status for ntuple reset (Mina)
- make configure_headersplit more robust (Mina)
- use num_queues/2 in selftest and let the users override it (Mina)
- remove memory_provider.memcpy_to_device (Mina)
- keep ksft as is (don't use -v validate flags): we are gonna
need a --debug-disable flag to make it less chatty; otherwise
it times out when sending too much data; so leaving it as
a separate follow up
v2:
- don't remove validation (Mina)
- keep 5-tuple flow steering but use it only when -c is provided (Mina)
- remove separate flag for probing (Mina)
- move ncdevmem under drivers/net/hw, not drivers/net (Jakub)
Cc: Mina Almasry <almasrymina(a)google.com>
Stanislav Fomichev (12):
selftests: ncdevmem: Redirect all non-payload output to stderr
selftests: ncdevmem: Separate out dmabuf provider
selftests: ncdevmem: Unify error handling
selftests: ncdevmem: Make client_ip optional
selftests: ncdevmem: Remove default arguments
selftests: ncdevmem: Switch to AF_INET6
selftests: ncdevmem: Properly reset flow steering
selftests: ncdevmem: Use YNL to enable TCP header split
selftests: ncdevmem: Remove hard-coded queue numbers
selftests: ncdevmem: Run selftest when none of the -s or -c has been
provided
selftests: ncdevmem: Move ncdevmem under drivers/net/hw
selftests: ncdevmem: Add automated test
.../selftests/drivers/net/hw/.gitignore | 1 +
.../testing/selftests/drivers/net/hw/Makefile | 9 +
.../selftests/drivers/net/hw/devmem.py | 45 +
.../selftests/drivers/net/hw/ncdevmem.c | 773 ++++++++++++++++++
tools/testing/selftests/net/.gitignore | 1 -
tools/testing/selftests/net/Makefile | 8 -
tools/testing/selftests/net/ncdevmem.c | 570 -------------
7 files changed, 828 insertions(+), 579 deletions(-)
create mode 100644 tools/testing/selftests/drivers/net/hw/.gitignore
create mode 100755 tools/testing/selftests/drivers/net/hw/devmem.py
create mode 100644 tools/testing/selftests/drivers/net/hw/ncdevmem.c
delete mode 100644 tools/testing/selftests/net/ncdevmem.c
--
2.47.0
The 2024 architecture release includes a number of data processing
extensions, mostly SVE and SME additions with a few others. These are
all very straightforward extensions which add instructions but no
architectural state so only need hwcaps and exposing of the ID registers
to KVM guests and userspace.
Signed-off-by: Mark Brown <broonie(a)kernel.org>
---
Changes in v2:
- Filter KVM guest visible bitfields in ID_AA64ISAR3_EL1 to only those
we make writeable.
- Link to v1: https://lore.kernel.org/r/20241028-arm64-2024-dpisa-v1-0-a38d08b008a8@kerne…
---
Mark Brown (9):
arm64/sysreg: Update ID_AA64PFR2_EL1 to DDI0601 2024-09
arm64/sysreg: Update ID_AA64ISAR3_EL1 to DDI0601 2024-09
arm64/sysreg: Update ID_AA64FPFR0_EL1 to DDI0601 2024-09
arm64/sysreg: Update ID_AA64ZFR0_EL1 to DDI0601 2024-09
arm64/sysreg: Update ID_AA64SMFR0_EL1 to DDI0601 2024-09
arm64/sysreg: Update ID_AA64ISAR2_EL1 to DDI0601 2024-09
arm64/hwcap: Describe 2024 dpISA extensions to userspace
KVM: arm64: Allow control of dpISA extensions in ID_AA64ISAR3_EL1
kselftest/arm64: Add 2024 dpISA extensions to hwcap test
Documentation/arch/arm64/elf_hwcaps.rst | 51 ++++++
arch/arm64/include/asm/hwcap.h | 17 ++
arch/arm64/include/uapi/asm/hwcap.h | 17 ++
arch/arm64/kernel/cpufeature.c | 35 ++++
arch/arm64/kernel/cpuinfo.c | 17 ++
arch/arm64/kvm/sys_regs.c | 6 +-
arch/arm64/tools/sysreg | 87 +++++++++-
tools/testing/selftests/arm64/abi/hwcap.c | 273 +++++++++++++++++++++++++++++-
8 files changed, 493 insertions(+), 10 deletions(-)
---
base-commit: 8e929cb546ee42c9a61d24fae60605e9e3192354
change-id: 20241008-arm64-2024-dpisa-8091074a7f48
Best regards,
--
Mark Brown <broonie(a)kernel.org>
Currently, we are only using the linear search method to find the type
id by the name, which has a time complexity of O(n). This change involves
sorting the names of btf types in ascending order and using binary search,
which has a time complexity of O(log(n)). This idea was inspired by the
following patch:
60443c88f3a8 ("kallsyms: Improve the performance of kallsyms_lookup_name()").
At present, this improvement is only for searching in vmlinux's and module's BTFs.
Another change is the search direction, where we search the BTF first and
then its base, the type id of the first matched btf_type will be returned.
Here is a time-consuming result that finding 87590 type ids by their names in
vmlinux's BTF.
Before: 158426 ms
After: 114 ms
The average lookup performance has improved more than 1000x in the above scenario.
v4:
- Divide the patch into two parts: kernel and libbpf
- Use Eduard's code to sort btf_types in the btf__dedup function
- Correct some btf testcases due to modifications of the order of btf_types.
v3:
- Link: https://lore.kernel.org/all/20240608140835.965949-1-dolinux.peng@gmail.com/
- Sort btf_types during build process other than during boot, to reduce the
overhead of memory and boot time.
v2:
- Link: https://lore.kernel.org/all/20230909091646.420163-1-pengdonglin@sangfor.com…
Donglin Peng (3):
libbpf: Sort btf_types in ascending order by name
bpf: Using binary search to improve the performance of
btf_find_by_name_kind
libbpf: Using binary search to improve the performance of
btf__find_by_name_kind
include/linux/btf.h | 1 +
kernel/bpf/btf.c | 157 +++++++++-
tools/lib/bpf/btf.c | 274 +++++++++++++---
tools/testing/selftests/bpf/prog_tests/btf.c | 296 +++++++++---------
.../bpf/prog_tests/btf_dedup_split.c | 64 ++--
5 files changed, 555 insertions(+), 237 deletions(-)
--
2.34.1
This adds the pasid attach/detach uAPIs for userspace to attach/detach
a PASID of a device to/from a given ioas/hwpt. Only vfio-pci driver is
enabled in this series. After this series, PASID-capable devices bound
with vfio-pci can report PASID capability to userspace and VM to enable
PASID usages like Shared Virtual Addressing (SVA).
Based on the discussion about reporting the vPASID to VM [1], it's agreed
that we will let the userspace VMM to synthesize the vPASID capability.
The VMM needs to figure out a hole to put the vPASID cap. This includes
the hidden bits handling for some devices. While, it's up to the userspace,
it's not the focus of this series.
This series first adds the helpers for pasid attach in vfio core and then
adds the device cdev ioctls for pasid attach/detach. In the end of this
series, the IOMMU_GET_HW_INFO ioctl is extended to report the PCI PASID
capability to the userspace. Userspace should check this before using any
PASID related uAPIs provided by VFIO, which is the agreement in [2]. This
series depends on the iommufd pasid attach/detach series [3].
The completed code can be found at [4], tested with a hacky Qemu branch [5].
[1] https://lore.kernel.org/kvm/BN9PR11MB5276318969A212AD0649C7BE8CBE2@BN9PR11M…
[2] https://lore.kernel.org/kvm/4f2daf50-a5ad-4599-ab59-bcfc008688d8@intel.com/
[3] https://lore.kernel.org/linux-iommu/20240912131255.13305-1-yi.l.liu@intel.c…
[4] https://github.com/yiliu1765/iommufd/tree/iommufd_pasid
[5] https://github.com/yiliu1765/qemu/tree/wip/zhenzhong/iommufd_nesting_rfcv2-…
Change log:
v3:
- Misc enhancement on patch 01 of v2 (Alex, Jason)
- Add Jason's r-b to patch 03 of v2
- Drop the logic that report PASID via VFIO_DEVICE_FEATURE ioctl
- Extend IOMMU_GET_HW_INFO to report PASID support (Kevin, Jason, Alex)
v2: https://lore.kernel.org/kvm/20240412082121.33382-1-yi.l.liu@intel.com/
- Use IDA to track if PASID is attached or not in VFIO. (Jason)
- Fix the issue of calling pasid_at[de]tach_ioas callback unconditionally (Alex)
- Fix the wrong data copy in vfio_df_ioctl_pasid_detach_pt() (Zhenzhong)
- Minor tweaks in comments (Kevin)
v1: https://lore.kernel.org/kvm/20231127063909.129153-1-yi.l.liu@intel.com/
- Report PASID capability via VFIO_DEVICE_FEATURE (Alex)
rfc: https://lore.kernel.org/linux-iommu/20230926093121.18676-1-yi.l.liu@intel.c…
Regards,
Yi Liu
Yi Liu (4):
ida: Add ida_find_first_range()
vfio-iommufd: Support pasid [at|de]tach for physical VFIO devices
vfio: Add VFIO_DEVICE_PASID_[AT|DE]TACH_IOMMUFD_PT
iommufd: Extend IOMMU_GET_HW_INFO to report PASID capability
drivers/iommu/iommufd/device.c | 27 +++++++++++++-
drivers/pci/ats.c | 32 ++++++++++++++++
drivers/vfio/device_cdev.c | 51 ++++++++++++++++++++++++++
drivers/vfio/iommufd.c | 50 +++++++++++++++++++++++++
drivers/vfio/pci/vfio_pci.c | 2 +
drivers/vfio/vfio.h | 4 ++
drivers/vfio/vfio_main.c | 8 ++++
include/linux/idr.h | 11 ++++++
include/linux/pci-ats.h | 3 ++
include/linux/vfio.h | 11 ++++++
include/uapi/linux/iommufd.h | 14 ++++++-
include/uapi/linux/vfio.h | 55 ++++++++++++++++++++++++++++
lib/idr.c | 67 ++++++++++++++++++++++++++++++++++
13 files changed, 333 insertions(+), 2 deletions(-)
--
2.34.1
Recently we committed a fix to allow processes to receive notifications for
non-zero exits via the process connector module. Commit is a4c9a56e6a2c.
However, for threads, when it does a pthread_exit(&exit_status) call, the
kernel is not aware of the exit status with which pthread_exit is called.
It is sent by child thread to the parent process, if it is waiting in
pthread_join(). Hence, for a thread exiting abnormally, kernel cannot
send notifications to any listening processes.
The exception to this is if the thread is sent a signal which it has not
handled, and dies along with it's process as a result; for eg. SIGSEGV or
SIGKILL. In this case, kernel is aware of the non-zero exit and sends a
notification for it.
For our use case, we cannot have parent wait in pthread_join, one of the
main reasons for this being that we do not want to track normal
pthread_exit(), which could be a very large number. We only want to be
notified of any abnormal exits. Hence, threads are created with
pthread_attr_t set to PTHREAD_CREATE_DETACHED.
To fix this problem, we add a new type PROC_CN_MCAST_NOTIFY to proc connector
API, which allows a thread to send it's exit status to kernel either when
it needs to call pthread_exit() with non-zero value to indicate some
error or from signal handler before pthread_exit().
v5->v6 changes:
- As suggested by Simon Horman, fixed style issues (some old) by running
./scripts/checkpatch.pl --strict --max-line-length=80
- Removed inline functions as suggested by Simon Horman
- Added "depends on" in Kconfig.debug as suggested by Stanislav Fomichev
- Removed warning while compilation with kernel space headers as
suggested by Simon Horman
- Removed "comm" field, will send separate patch for this.
- Added kunit configs in tools/testing/kunit/configs/all_tests.config
with it's dependencies.
v4->v5 changes:
- Handled comment by Stanislav Fomichev to fix a print format error.
- Made thread.c completely automated by starting proc_filter program
from within threads.c.
- Changed name CONFIG_CN_HASH_KUNIT_TEST to CN_HASH_KUNIT_TEST in
Kconfig.debug and changed display text.
v3->v4 changes:
- Reduce size of exit.log by removing unnecessary text.
v2->v3 changes:
- Handled comment by Liam Howlett to set hdev to NULL and add comment on
it.
- Handled comment by Liam Howlett to combine functions for deleting+get
and deleting into one in cn_hash.c
- Handled comment by Liam Howlett to remove extern in the functions
defined in cn_hash_test.h
- Some nits by Liam Howlett fixed.
- Handled comment by Liam Howlett to make threads test automated.
proc_filter.c creates exit.log, which is read by thread.c and checks
the values reported.
- Added "comm" field to struct proc_event, to copy the task's name to
the packet to allow further filtering by packets.
v1->v2 changes:
- Handled comment by Peter Zijlstra to remove locking for PF_EXIT_NOTIFY
task->flags.
- Added error handling in thread.c
v->v1 changes:
- Handled comment by Simon Horman to remove unused err in cn_proc.c
- Handled comment by Simon Horman to make adata and key_display static
in cn_hash_test.c
Anjali Kulkarni (3):
connector/cn_proc: Add hash table for threads
connector/cn_proc: Kunit tests for threads hash table
connector/cn_proc: Selftest for threads
drivers/connector/Makefile | 2 +-
drivers/connector/cn_hash.c | 216 ++++++++++++++++
drivers/connector/cn_proc.c | 81 ++++--
drivers/connector/connector.c | 88 ++++++-
include/linux/connector.h | 62 ++++-
include/linux/sched.h | 2 +-
include/uapi/linux/cn_proc.h | 4 +-
lib/Kconfig.debug | 19 ++
lib/Makefile | 1 +
lib/cn_hash_test.c | 169 +++++++++++++
lib/cn_hash_test.h | 10 +
tools/testing/kunit/configs/all_tests.config | 5 +
tools/testing/selftests/connector/Makefile | 25 ++
.../testing/selftests/connector/proc_filter.c | 63 +++--
tools/testing/selftests/connector/thread.c | 238 ++++++++++++++++++
.../selftests/connector/thread_filter.c | 96 +++++++
16 files changed, 1027 insertions(+), 54 deletions(-)
create mode 100644 drivers/connector/cn_hash.c
create mode 100644 lib/cn_hash_test.c
create mode 100644 lib/cn_hash_test.h
create mode 100644 tools/testing/selftests/connector/thread.c
create mode 100644 tools/testing/selftests/connector/thread_filter.c
--
2.46.0
The kernel has recently added support for shadow stacks, currently
x86 only using their CET feature but both arm64 and RISC-V have
equivalent features (GCS and Zicfiss respectively), I am actively
working on GCS[1]. With shadow stacks the hardware maintains an
additional stack containing only the return addresses for branch
instructions which is not generally writeable by userspace and ensures
that any returns are to the recorded addresses. This provides some
protection against ROP attacks and making it easier to collect call
stacks. These shadow stacks are allocated in the address space of the
userspace process.
Our API for shadow stacks does not currently offer userspace any
flexiblity for managing the allocation of shadow stacks for newly
created threads, instead the kernel allocates a new shadow stack with
the same size as the normal stack whenever a thread is created with the
feature enabled. The stacks allocated in this way are freed by the
kernel when the thread exits or shadow stacks are disabled for the
thread. This lack of flexibility and control isn't ideal, in the vast
majority of cases the shadow stack will be over allocated and the
implicit allocation and deallocation is not consistent with other
interfaces. As far as I can tell the interface is done in this manner
mainly because the shadow stack patches were in development since before
clone3() was implemented.
Since clone3() is readily extensible let's add support for specifying a
shadow stack when creating a new thread or process, keeping the current
implicit allocation behaviour if one is not specified either with
clone3() or through the use of clone(). The user must provide a shadow
stack pointer, this must point to memory mapped for use as a shadow
stackby map_shadow_stack() with an architecture specified shadow stack
token at the top of the stack.
Please note that the x86 portions of this code are build tested only, I
don't appear to have a system that can run CET available to me.
[1] https://lore.kernel.org/linux-arm-kernel/20241001-arm64-gcs-v13-0-222b78d87…
Signed-off-by: Mark Brown <broonie(a)kernel.org>
---
Changes in v11:
- Rebase onto arm64 for-next/gcs, which is based on v6.12-rc1, and
integrate arm64 support.
- Rework the interface to specify a shadow stack pointer rather than a
base and size like we do for the regular stack.
- Link to v10: https://lore.kernel.org/r/20240821-clone3-shadow-stack-v10-0-06e8797b9445@k…
Changes in v10:
- Integrate fixes & improvements for the x86 implementation from Rick
Edgecombe.
- Require that the shadow stack be VM_WRITE.
- Require that the shadow stack base and size be sizeof(void *) aligned.
- Clean up trailing newline.
- Link to v9: https://lore.kernel.org/r/20240819-clone3-shadow-stack-v9-0-962d74f99464@ke…
Changes in v9:
- Pull token validation earlier and report problems with an error return
to parent rather than signal delivery to the child.
- Verify that the top of the supplied shadow stack is VM_SHADOW_STACK.
- Rework token validation to only do the page mapping once.
- Drop no longer needed support for testing for signals in selftest.
- Fix typo in comments.
- Link to v8: https://lore.kernel.org/r/20240808-clone3-shadow-stack-v8-0-0acf37caf14c@ke…
Changes in v8:
- Fix token verification with user specified shadow stack.
- Don't track user managed shadow stacks for child processes.
- Link to v7: https://lore.kernel.org/r/20240731-clone3-shadow-stack-v7-0-a9532eebfb1d@ke…
Changes in v7:
- Rebase onto v6.11-rc1.
- Typo fixes.
- Link to v6: https://lore.kernel.org/r/20240623-clone3-shadow-stack-v6-0-9ee7783b1fb9@ke…
Changes in v6:
- Rebase onto v6.10-rc3.
- Ensure we don't try to free the parent shadow stack in error paths of
x86 arch code.
- Spelling fixes in userspace API document.
- Additional cleanups and improvements to the clone3() tests to support
the shadow stack tests.
- Link to v5: https://lore.kernel.org/r/20240203-clone3-shadow-stack-v5-0-322c69598e4b@ke…
Changes in v5:
- Rebase onto v6.8-rc2.
- Rework ABI to have the user allocate the shadow stack memory with
map_shadow_stack() and a token.
- Force inlining of the x86 shadow stack enablement.
- Move shadow stack enablement out into a shared header for reuse by
other tests.
- Link to v4: https://lore.kernel.org/r/20231128-clone3-shadow-stack-v4-0-8b28ffe4f676@ke…
Changes in v4:
- Formatting changes.
- Use a define for minimum shadow stack size and move some basic
validation to fork.c.
- Link to v3: https://lore.kernel.org/r/20231120-clone3-shadow-stack-v3-0-a7b8ed3e2acc@ke…
Changes in v3:
- Rebase onto v6.7-rc2.
- Remove stale shadow_stack in internal kargs.
- If a shadow stack is specified unconditionally use it regardless of
CLONE_ parameters.
- Force enable shadow stacks in the selftest.
- Update changelogs for RISC-V feature rename.
- Link to v2: https://lore.kernel.org/r/20231114-clone3-shadow-stack-v2-0-b613f8681155@ke…
Changes in v2:
- Rebase onto v6.7-rc1.
- Remove ability to provide preallocated shadow stack, just specify the
desired size.
- Link to v1: https://lore.kernel.org/r/20231023-clone3-shadow-stack-v1-0-d867d0b5d4d0@ke…
---
Mark Brown (8):
arm64/gcs: Return a success value from gcs_alloc_thread_stack()
Documentation: userspace-api: Add shadow stack API documentation
selftests: Provide helper header for shadow stack testing
fork: Add shadow stack support to clone3()
selftests/clone3: Remove redundant flushes of output streams
selftests/clone3: Factor more of main loop into test_clone3()
selftests/clone3: Allow tests to flag if -E2BIG is a valid error code
selftests/clone3: Test shadow stack support
Documentation/userspace-api/index.rst | 1 +
Documentation/userspace-api/shadow_stack.rst | 41 ++++
arch/arm64/include/asm/gcs.h | 8 +-
arch/arm64/kernel/process.c | 8 +-
arch/arm64/mm/gcs.c | 62 +++++-
arch/x86/include/asm/shstk.h | 11 +-
arch/x86/kernel/process.c | 2 +-
arch/x86/kernel/shstk.c | 57 +++++-
include/asm-generic/cacheflush.h | 11 ++
include/linux/sched/task.h | 17 ++
include/uapi/linux/sched.h | 10 +-
kernel/fork.c | 96 +++++++--
tools/testing/selftests/clone3/clone3.c | 226 ++++++++++++++++++----
tools/testing/selftests/clone3/clone3_selftests.h | 65 ++++++-
tools/testing/selftests/ksft_shstk.h | 98 ++++++++++
15 files changed, 632 insertions(+), 81 deletions(-)
---
base-commit: d17cd7b7cc92d37ee8b2df8f975fc859a261f4dc
change-id: 20231019-clone3-shadow-stack-15d40d2bf536
Best regards,
--
Mark Brown <broonie(a)kernel.org>
Hi all,
This series is based on v6.12-rc4. It fixes an issue with the tracefs
gid mount option. Adds test case to prevent future breakages and updates
the tracefs readme to document the expected behavior of this option.
Thanks,
Kalesh
Kalesh Singh (3):
tracing: Document tracefs gid mount option
tracing/selftests: Add tracefs mount options test
tracing: Fix tracefs gid mount option
fs/tracefs/inode.c | 12 ++-
kernel/trace/trace.c | 4 +
.../ftrace/test.d/00basic/mount_options.tc | 101 ++++++++++++++++++
.../ftrace/test.d/00basic/test_ownership.tc | 16 +--
.../testing/selftests/ftrace/test.d/functions | 25 +++++
5 files changed, 142 insertions(+), 16 deletions(-)
create mode 100644 tools/testing/selftests/ftrace/test.d/00basic/mount_options.tc
base-commit: 42f7652d3eb527d03665b09edac47f85fb600924
--
2.47.0.163.g1226f6d8fa-goog
Hi all,
This is v2 of the series fixing the tracefs mount options.
Changes in v2:
- Update the commit descriptions, per Eric and Steve
- Fix ordering of the patches, per Steve
v1 can be found at:
- https://lore.kernel.org/r/20241028214550.2099923-1-kaleshsingh@google.com/
Thanks,
Kalesh
Kalesh Singh (3):
tracing: Fix tracefs mount options
tracing: Document tracefs gid mount option
tracing/selftests: Add tracefs mount options test
fs/tracefs/inode.c | 12 ++-
kernel/trace/trace.c | 4 +
.../ftrace/test.d/00basic/mount_options.tc | 101 ++++++++++++++++++
.../ftrace/test.d/00basic/test_ownership.tc | 16 +--
.../testing/selftests/ftrace/test.d/functions | 25 +++++
5 files changed, 142 insertions(+), 16 deletions(-)
create mode 100644 tools/testing/selftests/ftrace/test.d/00basic/mount_options.tc
base-commit: 42f7652d3eb527d03665b09edac47f85fb600924
--
2.47.0.163.g1226f6d8fa-goog
This series introduces a new vIOMMU infrastructure and related ioctls.
IOMMUFD has been using the HWPT infrastructure for all cases, including a
nested IO page table support. Yet, there're limitations for an HWPT-based
structure to support some advanced HW-accelerated features, such as CMDQV
on NVIDIA Grace, and HW-accelerated vIOMMU on AMD. Even for a multi-IOMMU
environment, it is not straightforward for nested HWPTs to share the same
parent HWPT (stage-2 IO pagetable), with the HWPT infrastructure alone: a
parent HWPT typically hold one stage-2 IO pagetable and tag it with only
one ID in the cache entries. When sharing one large stage-2 IO pagetable
across physical IOMMU instances, that one ID may not always be available
across all the IOMMU instances. In other word, it's ideal for SW to have
a different container for the stage-2 IO pagetable so it can hold another
ID that's available. And this container will be able to hold some advanced
feature too.
For this "different container", add vIOMMU, an additional layer to hold
extra virtualization information:
_______________________________________________________________________
| iommufd (with vIOMMU) |
| _____________ |
| | | |
| |----------------| vIOMMU | |
| | ______ | | _____________ ________ |
| | | | | | | | | | |
| | | IOAS |<---|(HWPT_PAGING)|<---| HWPT_NESTED |<--| DEVICE | |
| | |______| |_____________| |_____________| |________| |
| | | | | | |
|______|________|______________|__________________|_______________|_____|
| | | | |
______v_____ | ______v_____ ______v_____ ___v__
| struct | | PFN | (paging) | | (nested) | |struct|
|iommu_device| |------>|iommu_domain|<----|iommu_domain|<----|device|
|____________| storage|____________| |____________| |______|
The vIOMMU object should be seen as a slice of a physical IOMMU instance
that is passed to or shared with a VM. That can be some HW/SW resources:
- Security namespace for guest owned ID, e.g. guest-controlled cache tags
- Access to a sharable nesting parent pagetable across physical IOMMUs
- Non-affiliated event reporting (e.g. an invalidation queue error)
- Virtualization of various platforms IDs, e.g. RIDs and others
- Delivery of paravirtualized invalidation
- Direct assigned invalidation queues
- Direct assigned interrupts
On a multi-IOMMU system, the vIOMMU object must be instanced to the number
of the physical IOMMUs that have a slice passed to (via device) a guest VM,
while being able to hold the shareable parent HWPT. Each vIOMMU then just
needs to allocate its own individual ID to tag its own cache:
----------------------------
---------------- | | paging_hwpt0 |
| hwpt_nested0 |--->| viommu0 ------------------
---------------- | | IDx |
----------------------------
----------------------------
---------------- | | paging_hwpt0 |
| hwpt_nested1 |--->| viommu1 ------------------
---------------- | | IDy |
----------------------------
As an initial part-1, add IOMMUFD_CMD_VIOMMU_ALLOC ioctl for an allocation
only. And implement it in arm-smmu-v3 driver as a real world use case.
More vIOMMU-based structs and ioctls will be introduced in the follow-up
series to support vDEVICE, vIRQ (vEVENT) and vQUEUE objects. Although we
repurposed the vIOMMU object from an earlier RFC, just for a referece:
https://lore.kernel.org/all/cover.1712978212.git.nicolinc@nvidia.com/
This series is on Github:
https://github.com/nicolinc/iommufd/commits/iommufd_viommu_p1-v5
(paring QEMU branch for testing will be provided with the part2 series)
Changelog
v5
* Added "Reviewed-by" from Kevin
* Reworked iommufd_viommu_alloc helper
* Revised the uAPI kdoc for vIOMMU object
* Revised comments for pluggable iommu_dev
* Added a couple of cleanup patches for selftest
* Renamed domain_alloc_nested op to alloc_domain_nested
* Updated a few commit messages to reflect the latest series
* Renamed iommufd_hwpt_nested_alloc_for_viommu to
iommufd_viommu_alloc_hwpt_nested, and added flag validation
v4
https://lore.kernel.org/all/cover.1729553811.git.nicolinc@nvidia.com/
* Added "Reviewed-by" from Jason
* Dropped IOMMU_VIOMMU_TYPE_DEFAULT support
* Dropped iommufd_object_alloc_elm renamings
* Renamed iommufd's viommu_api.c to driver.c
* Reworked iommufd_viommu_alloc helper
* Added a separate iommufd_hwpt_nested_alloc_for_viommu function for
hwpt_nested allocations on a vIOMMU, and added comparison between
viommu->iommu_dev->ops and dev_iommu_ops(idev->dev)
* Replaced s2_parent with vsmmu in arm_smmu_nested_domain
* Replaced domain_alloc_user in iommu_ops with domain_alloc_nested in
viommu_ops
* Replaced wait_queue_head_t with a completion, to delay the unplug of
mock_iommu_dev
* Corrected documentation graph that was missing struct iommu_device
* Added an iommufd_verify_unfinalized_object helper to verify driver-
allocated vIOMMU/vDEVICE objects
* Added missing test cases for TEST_LENGTH and fail_nth
v3
https://lore.kernel.org/all/cover.1728491453.git.nicolinc@nvidia.com/
* Rebased on top of Jason's nesting v3 series
https://lore.kernel.org/all/0-v3-e2e16cd7467f+2a6a1-smmuv3_nesting_jgg@nvid…
* Split the series into smaller parts
* Added Jason's Reviewed-by
* Added back viommu->iommu_dev
* Added support for driver-allocated vIOMMU v.s. core-allocated
* Dropped arm_smmu_cache_invalidate_user
* Added an iommufd_test_wait_for_users() in selftest
* Reworked test code to make viommu an individual FIXTURE
* Added missing TEST_LENGTH case for the new ioctl command
v2
https://lore.kernel.org/all/cover.1724776335.git.nicolinc@nvidia.com/
* Limited vdev_id to one per idev
* Added a rw_sem to protect the vdev_id list
* Reworked driver-level APIs with proper lockings
* Added a new viommu_api file for IOMMUFD_DRIVER config
* Dropped useless iommu_dev point from the viommu structure
* Added missing index numnbers to new types in the uAPI header
* Dropped IOMMU_VIOMMU_INVALIDATE uAPI; Instead, reuse the HWPT one
* Reworked mock_viommu_cache_invalidate() using the new iommu helper
* Reordered details of set/unset_vdev_id handlers for proper lockings
v1
https://lore.kernel.org/all/cover.1723061377.git.nicolinc@nvidia.com/
Thanks!
Nicolin
Nicolin Chen (13):
iommufd: Move struct iommufd_object to public iommufd header
iommufd: Introduce IOMMUFD_OBJ_VIOMMU and its related struct
iommufd: Add iommufd_verify_unfinalized_object
iommufd/viommu: Add IOMMU_VIOMMU_ALLOC ioctl
iommufd: Add alloc_domain_nested op to iommufd_viommu_ops
iommufd: Allow pt_id to carry viommu_id for IOMMU_HWPT_ALLOC
iommufd/selftest: Add container_of helpers
iommufd/selftest: Prepare for mock_viommu_alloc_domain_nested()
iommufd/selftest: Add refcount to mock_iommu_device
iommufd/selftest: Add IOMMU_VIOMMU_TYPE_SELFTEST
iommufd/selftest: Add IOMMU_VIOMMU_ALLOC test coverage
Documentation: userspace-api: iommufd: Update vIOMMU
iommu/arm-smmu-v3: Add IOMMU_VIOMMU_TYPE_ARM_SMMUV3 support
drivers/iommu/iommufd/Makefile | 5 +-
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 26 +-
drivers/iommu/iommufd/iommufd_private.h | 36 +--
drivers/iommu/iommufd/iommufd_test.h | 2 +
include/linux/iommu.h | 14 +
include/linux/iommufd.h | 86 ++++++
include/uapi/linux/iommufd.h | 56 +++-
tools/testing/selftests/iommu/iommufd_utils.h | 28 ++
.../arm/arm-smmu-v3/arm-smmu-v3-iommufd.c | 80 ++++--
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 9 +-
drivers/iommu/iommufd/driver.c | 38 +++
drivers/iommu/iommufd/hw_pagetable.c | 71 ++++-
drivers/iommu/iommufd/main.c | 58 ++--
drivers/iommu/iommufd/selftest.c | 256 ++++++++++++------
drivers/iommu/iommufd/viommu.c | 89 ++++++
tools/testing/selftests/iommu/iommufd.c | 87 ++++++
.../selftests/iommu/iommufd_fail_nth.c | 11 +
Documentation/userspace-api/iommufd.rst | 69 ++++-
18 files changed, 827 insertions(+), 194 deletions(-)
create mode 100644 drivers/iommu/iommufd/driver.c
create mode 100644 drivers/iommu/iommufd/viommu.c
--
2.43.0
There is a spelling mistake in a ksft_test_result_skip message. Fix it.
Signed-off-by: Colin Ian King <colin.i.king(a)gmail.com>
---
tools/testing/selftests/kvm/s390x/cpumodel_subfuncs_test.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/kvm/s390x/cpumodel_subfuncs_test.c b/tools/testing/selftests/kvm/s390x/cpumodel_subfuncs_test.c
index 222ba1cc3cac..34e0ceceaa74 100644
--- a/tools/testing/selftests/kvm/s390x/cpumodel_subfuncs_test.c
+++ b/tools/testing/selftests/kvm/s390x/cpumodel_subfuncs_test.c
@@ -276,7 +276,7 @@ int main(int argc, char *argv[])
ksft_test_result_pass("%s\n", testlist[idx].subfunc_name);
free(array);
} else {
- ksft_test_result_skip("%s feature is not avaialable\n",
+ ksft_test_result_skip("%s feature is not available\n",
testlist[idx].subfunc_name);
}
}
--
2.39.5
This series was originally written by José Expósito, and has been
modified and updated by Matt Gilbride and myself. The original version
can be found here:
https://github.com/Rust-for-Linux/linux/pull/950
Add support for writing KUnit tests in Rust. While Rust doctests are
already converted to KUnit tests and run, they're really better suited
for examples, rather than as first-class unit tests.
This series implements a series of direct Rust bindings for KUnit tests,
as well as a new macro which allows KUnit tests to be written using a
close variant of normal Rust unit test syntax. The only change required
is replacing '#[cfg(test)]' with '#[kunit_tests(kunit_test_suite_name)]'
An example test would look like:
#[kunit_tests(rust_kernel_hid_driver)]
mod tests {
use super::*;
use crate::{c_str, driver, hid, prelude::*};
use core::ptr;
struct SimpleTestDriver;
impl Driver for SimpleTestDriver {
type Data = ();
}
#[test]
fn rust_test_hid_driver_adapter() {
let mut hid = bindings::hid_driver::default();
let name = c_str!("SimpleTestDriver");
static MODULE: ThisModule = unsafe { ThisModule::from_ptr(ptr::null_mut()) };
let res = unsafe {
<hid::Adapter<SimpleTestDriver> as driver::DriverOps>::register(&mut hid, name, &MODULE)
};
assert_eq!(res, Err(ENODEV)); // The mock returns -19
}
}
Please give this a go, and make sure I haven't broken it! There's almost
certainly a lot of improvements which can be made -- and there's a fair
case to be made for replacing some of this with generated C code which
can use the C macros -- but this is hopefully an adequate implementation
for now, and the interface can (with luck) remain the same even if the
implementation changes.
A few small notable missing features:
- Attributes (like the speed of a test) are hardcoded to the default
value.
- Similarly, the module name attribute is hardcoded to NULL. In C, we
use the KBUILD_MODNAME macro, but I couldn't find a way to use this
from Rust which wasn't more ugly than just disabling it.
- Assertions are not automatically rewritten to use KUnit assertions.
---
Changes since v1:
https://lore.kernel.org/lkml/20230720-rustbind-v1-0-c80db349e3b5@google.com…
- Rebase on top of the latest rust-next (commit 718c4069896c)
- Make kunit_case a const fn, rather than a macro (Thanks Boqun)
- As a result, the null terminator is now created with
kernel::kunit::kunit_case_null()
- Use the C kunit_get_current_test() function to implement
in_kunit_test(), rather than re-implementing it (less efficiently)
ourselves.
Changes since the GitHub PR:
- Rebased on top of kselftest/kunit
- Add const_mut_refs feature
This may conflict with https://lore.kernel.org/lkml/20230503090708.2524310-6-nmi@metaspace.dk/
- Add rust/macros/kunit.rs to the KUnit MAINTAINERS entry
---
José Expósito (3):
rust: kunit: add KUnit case and suite macros
rust: macros: add macro to easily run KUnit tests
rust: kunit: allow to know if we are in a test
MAINTAINERS | 1 +
rust/kernel/kunit.rs | 191 +++++++++++++++++++++++++++++++++++++++++++
rust/kernel/lib.rs | 1 +
rust/macros/lib.rs | 29 +++++++
4 files changed, 222 insertions(+)
--
2.47.0.163.g1226f6d8fa-goog
If running hid testcases with command "./run_kselftest.sh -c hid",
the following tests will take longer than the kselftest framework
timeout (now 200 seconds) to run and thus got terminated with TIMEOUT
error:
hid-multitouch.sh - took about 6min41s
hid-tablet.sh - took about 6min30s
Increase the timeout setting to 10 minutes to allow them have a chance
to finish.
Cc: stable(a)vger.kernel.org
Signed-off-by: Yun Lu <luyun(a)kylinos.cn>
---
tools/testing/selftests/hid/settings | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/hid/settings b/tools/testing/selftests/hid/settings
index b3cbfc521b10..dff0d947f9c2 100644
--- a/tools/testing/selftests/hid/settings
+++ b/tools/testing/selftests/hid/settings
@@ -1,3 +1,3 @@
# HID tests can be long, so give a little bit more time
# to them
-timeout=200
+timeout=600
--
2.27.0
Following the previous vIOMMU series, this adds another vDEVICE structure,
representing the association from an iommufd_device to an iommufd_viommu.
This gives the whole architecture a new "v" layer:
_______________________________________________________________________
| iommufd (with vIOMMU/vDEVICE) |
| _____________ _____________ |
| | | | | |
| |----------------| vIOMMU |<---| vDEVICE |<------| |
| | | | |_____________| | |
| | ______ | | _____________ ___|____ |
| | | | | | | | | | |
| | | IOAS |<---|(HWPT_PAGING)|<---| HWPT_NESTED |<--| DEVICE | |
| | |______| |_____________| |_____________| |________| |
|______|________|______________|__________________|_______________|_____|
| | | | |
______v_____ | ______v_____ ______v_____ ___v__
| struct | | PFN | (paging) | | (nested) | |struct|
|iommu_device| |------>|iommu_domain|<----|iommu_domain|<----|device|
|____________| storage|____________| |____________| |______|
This vDEVICE object is used to collect and store all vIOMMU-related device
information/attributes in a VM. As an initial series for vDEVICE, add only
the virt_id to the vDEVICE, which is a vIOMMU specific device ID in a VM:
e.g. vSID of ARM SMMUv3, vDeviceID of AMD IOMMU, and vID of Intel VT-d to
a Context Table. This virt_id helps IOMMU drivers to link the vID to a pID
of the device against the physical IOMMU instance. This is essential for a
vIOMMU-based invalidation, where the request contains a device's vID for a
device cache flush, e.g. ATC invalidation.
Therefore, with this vDEVICE object, support a vIOMMU-based invalidation,
by reusing IOMMUFD_CMD_HWPT_INVALIDATE for a vIOMMU object to flush cache
with a given driver data.
As for the implementation of the series, add driver support in ARM SMMUv3
for a real world use case.
This series is on Github:
https://github.com/nicolinc/iommufd/commits/iommufd_viommu_p2-v5
For testing, try this "with-rmr" branch:
https://github.com/nicolinc/iommufd/commits/iommufd_viommu_p2-v5-with-rmr
Paring QEMU branch for testing:
https://github.com/nicolinc/qemu/commits/wip/for_iommufd_viommu_p2-v5
Changelog
v5
* Dropped driver-allocated vDEVICE support
* Changed vdev_to_dev helper to iommufd_viommu_find_dev
v4
https://lore.kernel.org/all/cover.1729555967.git.nicolinc@nvidia.com/
* Added missing brackets in switch-case
* Fixed the unreleased idev refcount issue
* Reworked the iommufd_vdevice_alloc allocator
* Dropped support for IOMMU_VIOMMU_TYPE_DEFAULT
* Added missing TEST_LENGTH and fail_nth coverages
* Added a verification to the driver-allocated vDEVICE object
* Added an iommufd_vdevice_abort for a missing mutex protection
* Added a u64 structure arm_vsmmu_invalidation_cmd for user command
conversion
v3
https://lore.kernel.org/all/cover.1728491532.git.nicolinc@nvidia.com/
* Added Jason's Reviewed-by
* Split this invalidation part out of the part-1 series
* Repurposed VDEV_ID ioctl to a wider vDEVICE structure and ioctl
* Reduced viommu_api functions by allowing drivers to access viommu
and vdevice structure directly
* Dropped vdevs_rwsem by using xa_lock instead
* Dropped arm_smmu_cache_invalidate_user
v2
https://lore.kernel.org/all/cover.1724776335.git.nicolinc@nvidia.com/
* Limited vdev_id to one per idev
* Added a rw_sem to protect the vdev_id list
* Reworked driver-level APIs with proper lockings
* Added a new viommu_api file for IOMMUFD_DRIVER config
* Dropped useless iommu_dev point from the viommu structure
* Added missing index numnbers to new types in the uAPI header
* Dropped IOMMU_VIOMMU_INVALIDATE uAPI; Instead, reuse the HWPT one
* Reworked mock_viommu_cache_invalidate() using the new iommu helper
* Reordered details of set/unset_vdev_id handlers for proper lockings
v1
https://lore.kernel.org/all/cover.1723061377.git.nicolinc@nvidia.com/
Thanks!
Nicolin
Jason Gunthorpe (2):
iommu: Add iommu_copy_struct_from_full_user_array helper
iommu/arm-smmu-v3: Allow ATS for IOMMU_DOMAIN_NESTED
Nicolin Chen (11):
iommufd/viommu: Add IOMMUFD_OBJ_VDEVICE and IOMMU_VDEVICE_ALLOC ioctl
iommufd/selftest: Add IOMMU_VDEVICE_ALLOC test coverage
iommu/viommu: Add cache_invalidate to iommufd_viommu_ops
iommufd/hw_pagetable: Enforce invalidation op on vIOMMU-based
hwpt_nested
iommufd: Allow hwpt_id to carry viommu_id for IOMMU_HWPT_INVALIDATE
iommufd/viommu: Add iommufd_viommu_find_dev helper
iommufd/selftest: Add mock_viommu_cache_invalidate
iommufd/selftest: Add IOMMU_TEST_OP_DEV_CHECK_CACHE test command
iommufd/selftest: Add vIOMMU coverage for IOMMU_HWPT_INVALIDATE ioctl
Documentation: userspace-api: iommufd: Update vDEVICE
iommu/arm-smmu-v3: Add arm_vsmmu_cache_invalidate
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 9 +-
drivers/iommu/iommufd/iommufd_private.h | 20 ++
drivers/iommu/iommufd/iommufd_test.h | 30 +++
include/linux/iommu.h | 48 ++++-
include/linux/iommufd.h | 21 ++
include/uapi/linux/iommufd.h | 61 +++++-
tools/testing/selftests/iommu/iommufd_utils.h | 83 +++++++
.../arm/arm-smmu-v3/arm-smmu-v3-iommufd.c | 161 +++++++++++++-
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 32 ++-
drivers/iommu/iommufd/device.c | 11 +
drivers/iommu/iommufd/driver.c | 13 ++
drivers/iommu/iommufd/hw_pagetable.c | 36 +++-
drivers/iommu/iommufd/main.c | 7 +
drivers/iommu/iommufd/selftest.c | 98 ++++++++-
drivers/iommu/iommufd/viommu.c | 101 +++++++++
tools/testing/selftests/iommu/iommufd.c | 204 +++++++++++++++++-
.../selftests/iommu/iommufd_fail_nth.c | 4 +
Documentation/userspace-api/iommufd.rst | 41 +++-
18 files changed, 942 insertions(+), 38 deletions(-)
--
2.43.0
Compiled binary files should be added to .gitignore
'git status' complains:
Untracked files:
(use "git add <file>..." to include in what will be committed)
net/netfilter/conntrack_reverse_clash
Cc: Pablo Neira Ayuso <pablo(a)netfilter.org>
Cc: Jozsef Kadlecsik <kadlec(a)netfilter.org>
Cc: "David S. Miller" <davem(a)davemloft.net>
Cc: Eric Dumazet <edumazet(a)google.com>
Cc: Jakub Kicinski <kuba(a)kernel.org>
Cc: Paolo Abeni <pabeni(a)redhat.com>
Cc: Shuah Khan <shuah(a)kernel.org>
Signed-off-by: Li Zhijian <lizhijian(a)fujitsu.com>
---
Cc: netfilter-devel(a)vger.kernel.org
Cc: coreteam(a)netfilter.org
Cc: netdev(a)vger.kernel.org
---
Hello,
Cover letter is here.
This patch set aims to make 'git status' clear after 'make' and 'make
run_tests' for kselftests.
---
V3:
sort the files
V2:
split as a separate patch from a small one [0]
[0] https://lore.kernel.org/linux-kselftest/20241015010817.453539-1-lizhijian@f…
Signed-off-by: Li Zhijian <lizhijian(a)fujitsu.com>
---
tools/testing/selftests/net/netfilter/.gitignore | 1 +
1 file changed, 1 insertion(+)
diff --git a/tools/testing/selftests/net/netfilter/.gitignore b/tools/testing/selftests/net/netfilter/.gitignore
index 0a64d6d0e29a..64c4f8d9aa6c 100644
--- a/tools/testing/selftests/net/netfilter/.gitignore
+++ b/tools/testing/selftests/net/netfilter/.gitignore
@@ -2,5 +2,6 @@
audit_logread
connect_close
conntrack_dump_flush
+conntrack_reverse_clash
sctp_collision
nf_queue
--
2.44.0
Compiled binary files should be added to .gitignore
'git status' complains:
Untracked files:
(use "git add <file>..." to include in what will be committed)
alsa/global-timer
alsa/utimer-test
Cc: Mark Brown <broonie(a)kernel.org>
Cc: Jaroslav Kysela <perex(a)perex.cz>
Cc: Takashi Iwai <tiwai(a)suse.com>
Cc: Shuah Khan <shuah(a)kernel.org>
Signed-off-by: Li Zhijian <lizhijian(a)fujitsu.com>
---
Cc: linux-sound(a)vger.kernel.org
---
Hello,
Cover letter is here.
This patch set aims to make 'git status' clear after 'make' and 'make
run_tests' for kselftests.
---
V2:
split as a separate patch from a small one [0]
[0] https://lore.kernel.org/linux-kselftest/20241015010817.453539-1-lizhijian@f…
---
tools/testing/selftests/alsa/.gitignore | 2 ++
1 file changed, 2 insertions(+)
diff --git a/tools/testing/selftests/alsa/.gitignore b/tools/testing/selftests/alsa/.gitignore
index 12dc3fcd3456..1407fd24a97b 100644
--- a/tools/testing/selftests/alsa/.gitignore
+++ b/tools/testing/selftests/alsa/.gitignore
@@ -1,3 +1,5 @@
mixer-test
pcm-test
test-pcmtest-driver
+global-timer
+utimer-test
--
2.44.0
The goal of the series is to simplify and make it possible to use
ncdevmem in an automated way from the ksft python wrapper.
ncdevmem is slowly mutated into a state where it uses stdout
to print the payload and the python wrapper is added to
make sure the arrived payload matches the expected one.
v5:
- properly handle errors from inet_pton() and socket() (Paolo)
- remove unneeded import from python selftest (Paolo)
v4:
- keep usage example with validation (Mina)
- fix compilation issue in one patch (s/start_queues/start_queue/)
v3:
- keep and refine the comment about ncdevmem invocation (Mina)
- add the comment about not enforcing exit status for ntuple reset (Mina)
- make configure_headersplit more robust (Mina)
- use num_queues/2 in selftest and let the users override it (Mina)
- remove memory_provider.memcpy_to_device (Mina)
- keep ksft as is (don't use -v validate flags): we are gonna
need a --debug-disable flag to make it less chatty; otherwise
it times out when sending too much data; so leaving it as
a separate follow up
v2:
- don't remove validation (Mina)
- keep 5-tuple flow steering but use it only when -c is provided (Mina)
- remove separate flag for probing (Mina)
- move ncdevmem under drivers/net/hw, not drivers/net (Jakub)
Cc: Mina Almasry <almasrymina(a)google.com>
Stanislav Fomichev (12):
selftests: ncdevmem: Redirect all non-payload output to stderr
selftests: ncdevmem: Separate out dmabuf provider
selftests: ncdevmem: Unify error handling
selftests: ncdevmem: Make client_ip optional
selftests: ncdevmem: Remove default arguments
selftests: ncdevmem: Switch to AF_INET6
selftests: ncdevmem: Properly reset flow steering
selftests: ncdevmem: Use YNL to enable TCP header split
selftests: ncdevmem: Remove hard-coded queue numbers
selftests: ncdevmem: Run selftest when none of the -s or -c has been
provided
selftests: ncdevmem: Move ncdevmem under drivers/net/hw
selftests: ncdevmem: Add automated test
.../selftests/drivers/net/hw/.gitignore | 1 +
.../testing/selftests/drivers/net/hw/Makefile | 9 +
.../selftests/drivers/net/hw/devmem.py | 45 +
.../selftests/drivers/net/hw/ncdevmem.c | 773 ++++++++++++++++++
tools/testing/selftests/net/.gitignore | 1 -
tools/testing/selftests/net/Makefile | 8 -
tools/testing/selftests/net/ncdevmem.c | 570 -------------
7 files changed, 828 insertions(+), 579 deletions(-)
create mode 100644 tools/testing/selftests/drivers/net/hw/.gitignore
create mode 100755 tools/testing/selftests/drivers/net/hw/devmem.py
create mode 100644 tools/testing/selftests/drivers/net/hw/ncdevmem.c
delete mode 100644 tools/testing/selftests/net/ncdevmem.c
--
2.47.0
Basics and overview
===================
Software with larger attack surfaces (e.g. network facing apps like databases,
browsers or apps relying on browser runtimes) suffer from memory corruption
issues which can be utilized by attackers to bend control flow of the program
to eventually gain control (by making their payload executable). Attackers are
able to perform such attacks by leveraging call-sites which rely on indirect
calls or return sites which rely on obtaining return address from stack memory.
To mitigate such attacks, risc-v extension zicfilp enforces that all indirect
calls must land on a landing pad instruction `lpad` else cpu will raise software
check exception (a new cpu exception cause code on riscv).
Similarly for return flow, risc-v extension zicfiss extends architecture with
- `sspush` instruction to push return address on a shadow stack
- `sspopchk` instruction to pop return address from shadow stack
and compare with input operand (i.e. return address on stack)
- `sspopchk` to raise software check exception if comparision above
was a mismatch
- Protection mechanism using which shadow stack is not writeable via
regular store instructions
More information an details can be found at extensions github repo [1].
Equivalent to landing pad (zicfilp) on x86 is `ENDBRANCH` instruction in Intel
CET [3] and branch target identification (BTI) [4] on arm.
Similarly x86's Intel CET has shadow stack [5] and arm64 has guarded control
stack (GCS) [6] which are very similar to risc-v's zicfiss shadow stack.
x86 already supports shadow stack for user mode and arm64 support for GCS in
usermode [7] is ongoing.
Kernel awareness for user control flow integrity
================================================
This series picks up Samuel Holland's envcfg changes [2] as well. So if those are
being applied independently, they should be removed from this series.
Enabling:
In order to maintain compatibility and not break anything in user mode, kernel
doesn't enable control flow integrity cpu extensions on binary by default.
Instead exposes a prctl interface to enable, disable and lock the shadow stack
or landing pad feature for a task. This allows userspace (loader) to enumerate
if all objects in its address space are compiled with shadow stack and landing
pad support and accordingly enable the feature. Additionally if a subsequent
`dlopen` happens on a library, user mode can take a decision again to disable
the feature (if incoming library is not compiled with support) OR terminate the
task (if user mode policy is strict to have all objects in address space to be
compiled with control flow integirty cpu feature). prctl to enable shadow stack
results in allocating shadow stack from virtual memory and activating for user
address space. x86 and arm64 are also following same direction due to similar
reason(s).
clone/fork:
On clone and fork, cfi state for task is inherited by child. Shadow stack is
part of virtual memory and is a writeable memory from kernel perspective
(writeable via a restricted set of instructions aka shadow stack instructions)
Thus kernel changes ensure that this memory is converted into read-only when
fork/clone happens and COWed when fault is taken due to sspush, sspopchk or
ssamoswap. In case `CLONE_VM` is specified and shadow stack is to be enabled,
kernel will automatically allocate a shadow stack for that clone call.
map_shadow_stack:
x86 introduced `map_shadow_stack` system call to allow user space to explicitly
map shadow stack memory in its address space. It is useful to allocate shadow
for different contexts managed by a single thread (green threads or contexts)
risc-v implements this system call as well.
signal management:
If shadow stack is enabled for a task, kernel performs an asynchronous control
flow diversion to deliver the signal and eventually expects userspace to issue
sigreturn so that original execution can be resumed. Even though resume context
is prepared by kernel, it is in user space memory and is subject to memory
corruption and corruption bugs can be utilized by attacker in this race window
to perform arbitrary sigreturn and eventually bypass cfi mechanism.
Another issue is how to ensure that cfi related state on sigcontext area is not
trampled by legacy apps or apps compiled with old kernel headers.
In order to mitigate control-flow hijacting, kernel prepares a token and place
it on shadow stack before signal delivery and places address of token in
sigcontext structure. During sigreturn, kernel obtains address of token from
sigcontext struture, reads token from shadow stack and validates it and only
then allow sigreturn to succeed. Compatiblity issue is solved by adopting
dynamic sigcontext management introduced for vector extension. This series
re-factor the code little bit to allow future sigcontext management easy (as
proposed by Andy Chiu from SiFive)
config and compilation:
Introduce a new risc-v config option `CONFIG_RISCV_USER_CFI`. Selecting this
config option picks the kernel support for user control flow integrity. This
optin is presented only if toolchain has shadow stack and landing pad support.
And is on purpose guarded by toolchain support. Reason being that eventually
vDSO also needs to be compiled in with shadow stack and landing pad support.
vDSO compile patches are not included as of now because landing pad labeling
scheme is yet to settle for usermode runtime.
To get more information on kernel interactions with respect to
zicfilp and zicfiss, patch series adds documentation for
`zicfilp` and `zicfiss` in following:
Documentation/arch/riscv/zicfiss.rst
Documentation/arch/riscv/zicfilp.rst
How to test this series
=======================
Toolchain
---------
$ git clone git@github.com:sifive/riscv-gnu-toolchain.git -b cfi-dev
$ riscv-gnu-toolchain/configure --prefix=<path-to-where-to-build> --with-arch=rv64gc_zicfilp_zicfiss --enable-linux --disable-gdb --with-extra-multilib-test="rv64gc_zicfilp_zicfiss-lp64d:-static"
$ make -j$(nproc)
Qemu
----
$ git clone git@github.com:deepak0414/qemu.git -b zicfilp_zicfiss_ratified_master_july11
$ cd qemu
$ mkdir build
$ cd build
$ ../configure --target-list=riscv64-softmmu
$ make -j$(nproc)
Opensbi
-------
$ git clone git@github.com:deepak0414/opensbi.git -b v6_cfi_spec_split_opensbi
$ make CROSS_COMPILE=<your riscv toolchain> -j$(nproc) PLATFORM=generic
Linux
-----
Running defconfig is fine. CFI is enabled by default if the toolchain
supports it.
$ make ARCH=riscv CROSS_COMPILE=<path-to-cfi-riscv-gnu-toolchain>/build/bin/riscv64-unknown-linux-gnu- -j$(nproc) defconfig
$ make ARCH=riscv CROSS_COMPILE=<path-to-cfi-riscv-gnu-toolchain>/build/bin/riscv64-unknown-linux-gnu- -j$(nproc)
Branch where user cfi enabling patches are maintained
https://github.com/deepak0414/linux-riscv-cfi/tree/vdso_user_cfi_v6.12-rc1
In case you're building your own rootfs using toolchain, please make sure you
pick following patch to ensure that vDSO compiled with lpad and shadow stack.
"arch/riscv: compile vdso with landing pad"
Running
-------
Modify your qemu command to have:
-bios <path-to-cfi-opensbi>/build/platform/generic/firmware/fw_dynamic.bin
-cpu rv64,zicfilp=true,zicfiss=true,zimop=true,zcmop=true
vDSO related Opens (in the flux)
=================================
I am listing these opens for laying out plan and what to expect in future
patch sets. And of course for the sake of discussion.
Shadow stack and landing pad enabling in vDSO
----------------------------------------------
vDSO must have shadow stack and landing pad support compiled in for task
to have shadow stack and landing pad support. This patch series doesn't
enable that (yet). Enabling shadow stack support in vDSO should be
straight forward (intend to do that in next versions of patch set). Enabling
landing pad support in vDSO requires some collaboration with toolchain folks
to follow a single label scheme for all object binaries. This is necessary to
ensure that all indirect call-sites are setting correct label and target landing
pads are decorated with same label scheme.
How many vDSOs
---------------
Shadow stack instructions are carved out of zimop (may be operations) and if CPU
doesn't implement zimop, they're illegal instructions. Kernel could be running on
a CPU which may or may not implement zimop. And thus kernel will have to carry 2
different vDSOs and expose the appropriate one depending on whether CPU implements
zimop or not.
References
==========
[1] - https://github.com/riscv/riscv-cfi
[2] - https://lore.kernel.org/all/20240814081126.956287-1-samuel.holland@sifive.c…
[3] - https://lwn.net/Articles/889475/
[4] - https://developer.arm.com/documentation/109576/0100/Branch-Target-Identific…
[5] - https://www.intel.com/content/dam/develop/external/us/en/documents/catc17-i…
[6] - https://lwn.net/Articles/940403/
[7] - https://lore.kernel.org/all/20241001-arm64-gcs-v13-0-222b78d87eee@kernel.or…
To: Thomas Gleixner <tglx(a)linutronix.de>
To: Ingo Molnar <mingo(a)redhat.com>
To: Borislav Petkov <bp(a)alien8.de>
To: Dave Hansen <dave.hansen(a)linux.intel.com>
To: x86(a)kernel.org
To: H. Peter Anvin <hpa(a)zytor.com>
To: Andrew Morton <akpm(a)linux-foundation.org>
To: Liam R. Howlett <Liam.Howlett(a)oracle.com>
To: Vlastimil Babka <vbabka(a)suse.cz>
To: Lorenzo Stoakes <lorenzo.stoakes(a)oracle.com>
To: Paul Walmsley <paul.walmsley(a)sifive.com>
To: Palmer Dabbelt <palmer(a)dabbelt.com>
To: Albert Ou <aou(a)eecs.berkeley.edu>
To: Conor Dooley <conor(a)kernel.org>
To: Rob Herring <robh(a)kernel.org>
To: Krzysztof Kozlowski <krzk+dt(a)kernel.org>
To: Arnd Bergmann <arnd(a)arndb.de>
To: Christian Brauner <brauner(a)kernel.org>
To: Peter Zijlstra <peterz(a)infradead.org>
To: Oleg Nesterov <oleg(a)redhat.com>
To: Eric Biederman <ebiederm(a)xmission.com>
To: Kees Cook <kees(a)kernel.org>
To: Jonathan Corbet <corbet(a)lwn.net>
To: Shuah Khan <shuah(a)kernel.org>
Cc: linux-kernel(a)vger.kernel.org
Cc: linux-fsdevel(a)vger.kernel.org
Cc: linux-mm(a)kvack.org
Cc: linux-riscv(a)lists.infradead.org
Cc: devicetree(a)vger.kernel.org
Cc: linux-arch(a)vger.kernel.org
Cc: linux-doc(a)vger.kernel.org
Cc: linux-kselftest(a)vger.kernel.org
Cc: alistair.francis(a)wdc.com
Cc: richard.henderson(a)linaro.org
Cc: jim.shu(a)sifive.com
Cc: andybnac(a)gmail.com
Cc: kito.cheng(a)sifive.com
Cc: charlie(a)rivosinc.com
Cc: atishp(a)rivosinc.com
Cc: evan(a)rivosinc.com
Cc: cleger(a)rivosinc.com
Cc: alexghiti(a)rivosinc.com
Cc: samitolvanen(a)google.com
Cc: broonie(a)kernel.org
Cc: rick.p.edgecombe(a)intel.com
Signed-off-by: Deepak Gupta <debug(a)rivosinc.com>
---
changelog
---------
v6:
- Picked up Samuel Holland's changes as is with `envcfg` placed in
`thread` instead of `thread_info`
- fixed unaligned newline escapes in kselftest
- cleaned up messages in kselftest and included test output in commit message
- fixed a bug in clone path reported by Zong Li
- fixed a build issue if CONFIG_RISCV_ISA_V is not selected
(this was introduced due to re-factoring signal context
management code)
v5:
- rebased on v6.12-rc1
- Fixed schema related issues in device tree file
- Fixed some of the documentation related issues in zicfilp/ss.rst
(style issues and added index)
- added `SHADOW_STACK_SET_MARKER` so that implementation can define base
of shadow stack.
- Fixed warnings on definitions added in usercfi.h when
CONFIG_RISCV_USER_CFI is not selected.
- Adopted context header based signal handling as proposed by Andy Chiu
- Added support for enabling kernel mode access to shadow stack using
FWFT
(https://github.com/riscv-non-isa/riscv-sbi-doc/blob/master/src/ext-firmware…)
- Link to v5: https://lore.kernel.org/r/20241001-v5_user_cfi_series-v1-0-3ba65b6e550f@riv…
(Note: I had an issue in my workflow due to which version number wasn't
picked up correctly while sending out patches)
v4:
- rebased on 6.11-rc6
- envcfg: Converged with Samuel Holland's patches for envcfg management on per-
thread basis.
- vma_is_shadow_stack is renamed to is_vma_shadow_stack
- picked up Mark Brown's `ARCH_HAS_USER_SHADOW_STACK` patch
- signal context: using extended context management to maintain compatibility.
- fixed `-Wmissing-prototypes` compiler warnings for prctl functions
- Documentation fixes and amending typos.
- Link to v4: https://lore.kernel.org/all/20240912231650.3740732-1-debug@rivosinc.com/
v3:
- envcfg
logic to pick up base envcfg had a bug where `ENVCFG_CBZE` could have been
picked on per task basis, even though CPU didn't implement it. Fixed in
this series.
- dt-bindings
As suggested, split into separate commit. fixed the messaging that spec is
in public review
- arch_is_shadow_stack change
arch_is_shadow_stack changed to vma_is_shadow_stack
- hwprobe
zicfiss / zicfilp if present will get enumerated in hwprobe
- selftests
As suggested, added object and binary filenames to .gitignore
Selftest binary anyways need to be compiled with cfi enabled compiler which
will make sure that landing pad and shadow stack are enabled. Thus removed
separate enable/disable tests. Cleaned up tests a bit.
- Link to v3: https://lore.kernel.org/lkml/20240403234054.2020347-1-debug@rivosinc.com/
v2:
- Using config `CONFIG_RISCV_USER_CFI`, kernel support for riscv control flow
integrity for user mode programs can be compiled in the kernel.
- Enabling of control flow integrity for user programs is left to user runtime
- This patch series introduces arch agnostic `prctls` to enable shadow stack
and indirect branch tracking. And implements them on riscv.
---
Andy Chiu (1):
riscv: signal: abstract header saving for setup_sigcontext
Clément Léger (1):
riscv: Add Firmware Feature SBI extensions definitions
Deepak Gupta (26):
mm: helper `is_shadow_stack_vma` to check shadow stack vma
riscv/Kconfig: enable HAVE_EXIT_THREAD for riscv
dt-bindings: riscv: zicfilp and zicfiss in dt-bindings (extensions.yaml)
riscv: zicfiss / zicfilp enumeration
riscv: zicfiss / zicfilp extension csr and bit definitions
riscv: usercfi state for task and save/restore of CSR_SSP on trap entry/exit
riscv/mm : ensure PROT_WRITE leads to VM_READ | VM_WRITE
riscv mm: manufacture shadow stack pte
riscv mmu: teach pte_mkwrite to manufacture shadow stack PTEs
riscv mmu: write protect and shadow stack
riscv/mm: Implement map_shadow_stack() syscall
riscv/shstk: If needed allocate a new shadow stack on clone
prctl: arch-agnostic prctl for indirect branch tracking
riscv: Implements arch agnostic shadow stack prctls
riscv: Implements arch agnostic indirect branch tracking prctls
riscv/traps: Introduce software check exception
riscv/signal: save and restore of shadow stack for signal
riscv/kernel: update __show_regs to print shadow stack register
riscv/ptrace: riscv cfi status and state via ptrace and in core files
riscv/hwprobe: zicfilp / zicfiss enumeration in hwprobe
riscv: enable kernel access to shadow stack memory via FWFT sbi call
riscv: kernel command line option to opt out of user cfi
riscv: create a config for shadow stack and landing pad instr support
riscv: Documentation for landing pad / indirect branch tracking
riscv: Documentation for shadow stack on riscv
kselftest/riscv: kselftest for user mode cfi
Mark Brown (2):
mm: Introduce ARCH_HAS_USER_SHADOW_STACK
prctl: arch-agnostic prctl for shadow stack
Samuel Holland (3):
riscv: Enable cbo.zero only when all harts support Zicboz
riscv: Add support for per-thread envcfg CSR values
riscv: Call riscv_user_isa_enable() only on the boot hart
Documentation/arch/riscv/index.rst | 2 +
Documentation/arch/riscv/zicfilp.rst | 115 +++++
Documentation/arch/riscv/zicfiss.rst | 176 +++++++
.../devicetree/bindings/riscv/extensions.yaml | 14 +
arch/riscv/Kconfig | 21 +
arch/riscv/include/asm/asm-prototypes.h | 1 +
arch/riscv/include/asm/cpufeature.h | 15 +-
arch/riscv/include/asm/csr.h | 16 +
arch/riscv/include/asm/entry-common.h | 2 +
arch/riscv/include/asm/hwcap.h | 2 +
arch/riscv/include/asm/mman.h | 24 +
arch/riscv/include/asm/pgtable.h | 30 +-
arch/riscv/include/asm/processor.h | 3 +
arch/riscv/include/asm/sbi.h | 27 ++
arch/riscv/include/asm/switch_to.h | 8 +
arch/riscv/include/asm/thread_info.h | 3 +
arch/riscv/include/asm/usercfi.h | 89 ++++
arch/riscv/include/asm/vector.h | 3 +
arch/riscv/include/uapi/asm/hwprobe.h | 2 +
arch/riscv/include/uapi/asm/ptrace.h | 22 +
arch/riscv/include/uapi/asm/sigcontext.h | 1 +
arch/riscv/kernel/Makefile | 2 +
arch/riscv/kernel/asm-offsets.c | 8 +
arch/riscv/kernel/cpufeature.c | 13 +-
arch/riscv/kernel/entry.S | 31 +-
arch/riscv/kernel/head.S | 12 +
arch/riscv/kernel/process.c | 31 +-
arch/riscv/kernel/ptrace.c | 83 ++++
arch/riscv/kernel/signal.c | 140 +++++-
arch/riscv/kernel/smpboot.c | 2 -
arch/riscv/kernel/suspend.c | 4 +-
arch/riscv/kernel/sys_hwprobe.c | 2 +
arch/riscv/kernel/sys_riscv.c | 10 +
arch/riscv/kernel/traps.c | 42 ++
arch/riscv/kernel/usercfi.c | 526 +++++++++++++++++++++
arch/riscv/mm/init.c | 2 +-
arch/riscv/mm/pgtable.c | 17 +
arch/x86/Kconfig | 1 +
fs/proc/task_mmu.c | 2 +-
include/linux/cpu.h | 4 +
include/linux/mm.h | 5 +-
include/uapi/asm-generic/mman.h | 4 +
include/uapi/linux/elf.h | 1 +
include/uapi/linux/prctl.h | 48 ++
kernel/sys.c | 60 +++
mm/Kconfig | 6 +
mm/gup.c | 2 +-
mm/mmap.c | 1 +
mm/vma.h | 10 +-
tools/testing/selftests/riscv/Makefile | 2 +-
tools/testing/selftests/riscv/cfi/.gitignore | 3 +
tools/testing/selftests/riscv/cfi/Makefile | 10 +
tools/testing/selftests/riscv/cfi/cfi_rv_test.h | 84 ++++
tools/testing/selftests/riscv/cfi/riscv_cfi_test.c | 78 +++
tools/testing/selftests/riscv/cfi/shadowstack.c | 373 +++++++++++++++
tools/testing/selftests/riscv/cfi/shadowstack.h | 37 ++
56 files changed, 2190 insertions(+), 42 deletions(-)
---
base-commit: 7d9923ee3960bdbfaa7f3a4e0ac2364e770c46ff
change-id: 20240930-v5_user_cfi_series-3dc332f8f5b2
--
- debug
The 2024 architecture release includes a number of data processing
extensions, mostly SVE and SME additions with a few others. These are
all very straightforward extensions which add instructions but no
architectural state so only need hwcaps and exposing of the ID registers
to KVM guests and userspace.
Signed-off-by: Mark Brown <broonie(a)kernel.org>
---
Mark Brown (9):
arm64/sysreg: Update ID_AA64PFR2_EL1 to DDI0601 2024-09
arm64/sysreg: Update ID_AA64ISAR3_EL1 to DDI0601 2024-09
arm64/sysreg: Update ID_AA64FPFR0_EL1 to DDI0601 2024-09
arm64/sysreg: Update ID_AA64ZFR0_EL1 to DDI0601 2024-09
arm64/sysreg: Update ID_AA64SMFR0_EL1 to DDI0601 2024-09
arm64/sysreg: Update ID_AA64ISAR2_EL1 to DDI0601 2024-09
arm64/hwcap: Describe 2024 dpISA extensions to userspace
KVM: arm64: Allow control of dpISA extensions in ID_AA64ISAR3_EL1
kselftest/arm64: Add 2024 dpISA extensions to hwcap test
Documentation/arch/arm64/elf_hwcaps.rst | 51 ++++++
arch/arm64/include/asm/hwcap.h | 17 ++
arch/arm64/include/uapi/asm/hwcap.h | 17 ++
arch/arm64/kernel/cpufeature.c | 35 ++++
arch/arm64/kernel/cpuinfo.c | 17 ++
arch/arm64/kvm/sys_regs.c | 3 +-
arch/arm64/tools/sysreg | 87 +++++++++-
tools/testing/selftests/arm64/abi/hwcap.c | 273 +++++++++++++++++++++++++++++-
8 files changed, 490 insertions(+), 10 deletions(-)
---
base-commit: 8e929cb546ee42c9a61d24fae60605e9e3192354
change-id: 20241008-arm64-2024-dpisa-8091074a7f48
Best regards,
--
Mark Brown <broonie(a)kernel.org>
One of the things that fp-stress does to stress the floating point
context switching is send signals to the test threads it spawns.
Currently we do this once per second but as suggested by Mark Rutland if
we increase this we can improve the chances of triggering any issues
with context switching the signal handling code. Do a quick change to
increase the rate of signal delivery, trying to avoid excessive impact
on emulated platforms, and a further change to mitigate the impact that
this creates during startup.
Signed-off-by: Mark Brown <broonie(a)kernel.org>
---
Mark Brown (2):
kselftest/arm64: Increase frequency of signal delivery in fp-stress
kselftest/arm64: Lower poll interval while waiting for fp-stress children
tools/testing/selftests/arm64/fp/fp-stress.c | 26 ++++++++++++++++----------
1 file changed, 16 insertions(+), 10 deletions(-)
---
base-commit: 8e929cb546ee42c9a61d24fae60605e9e3192354
change-id: 20241028-arm64-fp-stress-interval-8f5e62c06e12
Best regards,
--
Mark Brown <broonie(a)kernel.org>
Userland library functions such as allocators and threading implementations
often require regions of memory to act as 'guard pages' - mappings which,
when accessed, result in a fatal signal being sent to the accessing
process.
The current means by which these are implemented is via a PROT_NONE mmap()
mapping, which provides the required semantics however incur an overhead of
a VMA for each such region.
With a great many processes and threads, this can rapidly add up and incur
a significant memory penalty. It also has the added problem of preventing
merges that might otherwise be permitted.
This series takes a different approach - an idea suggested by Vlasimil
Babka (and before him David Hildenbrand and Jann Horn - perhaps more - the
provenance becomes a little tricky to ascertain after this - please forgive
any omissions!) - rather than locating the guard pages at the VMA layer,
instead placing them in page tables mapping the required ranges.
Early testing of the prototype version of this code suggests a 5 times
speed up in memory mapping invocations (in conjunction with use of
process_madvise()) and a 13% reduction in VMAs on an entirely idle android
system and unoptimised code.
We expect with optimisation and a loaded system with a larger number of
guard pages this could significantly increase, but in any case these
numbers are encouraging.
This way, rather than having separate VMAs specifying which parts of a
range are guard pages, instead we have a VMA spanning the entire range of
memory a user is permitted to access and including ranges which are to be
'guarded'.
After mapping this, a user can specify which parts of the range should
result in a fatal signal when accessed.
By restricting the ability to specify guard pages to memory mapped by
existing VMAs, we can rely on the mappings being torn down when the
mappings are ultimately unmapped and everything works simply as if the
memory were not faulted in, from the point of view of the containing VMAs.
This mechanism in effect poisons memory ranges similar to hardware memory
poisoning, only it is an entirely software-controlled form of poisoning.
The mechanism is implemented via madvise() behaviour - MADV_GUARD_INSTALL
which installs page table-level guard page markers - and
MADV_GUARD_REMOVE - which clears them.
Guard markers can be installed across multiple VMAs and any existing
mappings will be cleared, that is zapped, before installing the guard page
markers in the page tables.
There is no concept of 'nested' guard markers, multiple attempts to install
guard markers in a range will, after the first attempt, have no effect.
Importantly, removing guard markers over a range that contains both guard
markers and ordinary backed memory has no effect on anything but the guard
markers (including leaving huge pages un-split), so a user can safely
remove guard markers over a range of memory leaving the rest intact.
The actual mechanism by which the page table entries are specified makes
use of existing logic - PTE markers, which are used for the userfaultfd
UFFDIO_POISON mechanism.
Unfortunately PTE_MARKER_POISONED is not suited for the guard page
mechanism as it results in VM_FAULT_HWPOISON semantics in the fault
handler, so we add our own specific PTE_MARKER_GUARD and adapt existing
logic to handle it.
We also extend the generic page walk mechanism to allow for installation of
PTEs (carefully restricted to memory management logic only to prevent
unwanted abuse).
We ensure that zapping performed by MADV_DONTNEED and MADV_FREE do not
remove guard markers, nor does forking (except when VM_WIPEONFORK is
specified for a VMA which implies a total removal of memory
characteristics).
It's important to note that the guard page implementation is emphatically
NOT a security feature, so a user can remove the markers if they wish. We
simply implement it in such a way as to provide the least surprising
behaviour.
An extensive set of self-tests are provided which ensure behaviour is as
expected and additionally self-documents expected behaviour of guard
ranges.
Suggested-by: Vlastimil Babka <vbabka(a)suse.cz>
Suggested-by: Jann Horn <jannh(a)google.com>
Suggested-by: David Hildenbrand <david(a)redhat.com>
v4
* Use restart_syscall() to implement -ERESTARTNOINTR to ensure correctly
handled by kernel - tested this code path and confirmed it works
correctly. Thanks to Vlastimil for pointing this issue out!
* Updated the vector_madvise() handler to not unnecessarily invoke
cond_resched() as suggested by Vlastimil.
* Updated guard page tests to add a test for a vector operation which
overwrites existing mappings. Tested this against the -ERESTARTNOINTR
case and confirmed working.
* Improved page walk logic further, refactoring handling logic as suggested
by Vlastimil.
* Moved MAX_MADVISE_GUARD_RETRIES to mm/madvise.c as suggested by Vlastimil.
v3
* Cleaned up mm/pagewalk.c logic a bit to make things clearer, as suggested
by Vlastiml.
* Explicitly avoid splitting THP on PTE installation, as suggested by
Vlastimil. Note this has no impact on the guard pages logic, which has
page table entry handlers at PUD, PMD and PTE level.
* Added WARN_ON_ONCE() to mm/hugetlb.c path where we don't expect a guard
marker, as suggested by Vlastimil.
* Reverted change to is_poisoned_swp_entry() to exclude guard pages which
has the effect of MADV_FREE _not_ clearing guard pages. After discussion
with Vlastimil, it became apparent that the ability to 'cancel' the
freeing operation by writing to the mapping after having issued an
MADV_FREE would mean that we would risk unexpected behaviour should the
guard pages be removed, so we now do not remove markers here at all.
* Added comment to PTE_MARKER_GUARD to highlight that memory tagged with
the marker behaves as if it were a region mapped PROT_NONE, as
highlighted by David.
* Rename poison -> install, unpoison -> remove (i.e. MADV_GUARD_INSTALL /
MADV_GUARD_REMOVE over MADV_GUARD_POISON / MADV_GUARD_REMOVE) at the
request of David and John who both find the poison analogy
confusing/overloaded.
* After a lot of discussion, replace the looping behaviour should page
faults race with guard page installation with a modest reattempt followed
by returning -ERESTARTNOINTR to have the operation abort and re-enter,
relieving lock contention and avoiding the possibility of allowing a
malicious sandboxed process to impact the mmap lock or stall the overall
process more than necessary, as suggested by Jann and Vlastimil having
raised the issue.
* Adjusted the page table walker so a populated huge PUD or PMD is
correctly treated as being populated, necessitating a zap. In v2 we
incorrectly skipped over these, which would cause the logic to wrongly
proceed as if nothing were populated and the install succeeded.
Instead, explicitly check to see if a huge page - if so, do not split but
rather abort the operation and let zap take care of things.
* Updated the guard remove logic to not unnecessarily split huge pages
either.
* Added a debug check to assert that the number of installed PTEs matches
expectation, accounting for any existing guard pages.
* Adapted vector_madvise() used by the process_madvise() system call to
handle -ERESTARTNOINTR correctly.
https://lore.kernel.org/all/cover.1729699916.git.lorenzo.stoakes@oracle.com/
v2
* The macros in kselftest_harness.h seem to be broken - __EXPECT() is
terminated by '} while (0); OPTIONAL_HANDLER(_assert)' meaning it is not
safe in single line if / else or for /which blocks, however working
around this results in checkpatch producing invalid warnings, as reported
by Shuah.
* Fixing these macros is out of scope for this series, so compromise and
instead rewrite test blocks so as to use multiple lines by separating out
a decl in most cases. This has the side effect of, for the most part,
making things more readable.
* Heavily document the use of the volatile keyword - we can't avoid
checkpatch complaining about this, so we explain it, as reported by
Shuah.
* Updated commit message to highlight that we skip tests we lack
permissions for, as reported by Shuah.
* Replaced a perror() with ksft_exit_fail_perror(), as reported by Shuah.
* Added user friendly messages to cases where tests are skipped due to lack
of permissions, as reported by Shuah.
* Update the tool header to include the new MADV_GUARD_POISON/UNPOISON
defines and directly include asm-generic/mman.h to get the
platform-neutral versions to ensure we import them.
* Finally fixed Vlastimil's email address in Suggested-by tags from suze to
suse, as reported by Vlastimil.
* Added linux-api to cc list, as reported by Vlastimil.
https://lore.kernel.org/all/cover.1729440856.git.lorenzo.stoakes@oracle.com/
v1
* Un-RFC'd as appears no major objections to approach but rather debate on
implementation.
* Fixed issue with arches which need mmu_context.h and
tlbfush.h. header imports in pagewalker logic to be able to use
update_mmu_cache() as reported by the kernel test bot.
* Added comments in page walker logic to clarify who can use
ops->install_pte and why as well as adding a check_ops_valid() helper
function, as suggested by Christoph.
* Pass false in full parameter in pte_clear_not_present_full() as suggested
by Jann.
* Stopped erroneously requiring a write lock for the poison operation as
suggested by Jann and Suren.
* Moved anon_vma_prepare() to the start of madvise_guard_poison() to be
consistent with how this is used elsewhere in the kernel as suggested by
Jann.
* Avoid returning -EAGAIN if we are raced on page faults, just keep looping
and duck out if a fatal signal is pending or a conditional reschedule is
needed, as suggested by Jann.
* Avoid needlessly splitting huge PUDs and PMDs by specifying
ACTION_CONTINUE, as suggested by Jann.
https://lore.kernel.org/all/cover.1729196871.git.lorenzo.stoakes@oracle.com/
RFC
https://lore.kernel.org/all/cover.1727440966.git.lorenzo.stoakes@oracle.com/
Lorenzo Stoakes (5):
mm: pagewalk: add the ability to install PTEs
mm: add PTE_MARKER_GUARD PTE marker
mm: madvise: implement lightweight guard page mechanism
tools: testing: update tools UAPI header for mman-common.h
selftests/mm: add self tests for guard page feature
arch/alpha/include/uapi/asm/mman.h | 3 +
arch/mips/include/uapi/asm/mman.h | 3 +
arch/parisc/include/uapi/asm/mman.h | 3 +
arch/xtensa/include/uapi/asm/mman.h | 3 +
include/linux/mm_inline.h | 2 +-
include/linux/pagewalk.h | 18 +-
include/linux/swapops.h | 24 +-
include/uapi/asm-generic/mman-common.h | 3 +
mm/hugetlb.c | 4 +
mm/internal.h | 6 +
mm/madvise.c | 239 ++++
mm/memory.c | 18 +-
mm/mprotect.c | 6 +-
mm/mseal.c | 1 +
mm/pagewalk.c | 246 +++-
tools/include/uapi/asm-generic/mman-common.h | 3 +
tools/testing/selftests/mm/.gitignore | 1 +
tools/testing/selftests/mm/Makefile | 1 +
tools/testing/selftests/mm/guard-pages.c | 1243 ++++++++++++++++++
19 files changed, 1751 insertions(+), 76 deletions(-)
create mode 100644 tools/testing/selftests/mm/guard-pages.c
--
2.47.0
Use a less populated IP range to run the tests, as suggested by Petr in
Link: https://lore.kernel.org/netdev/87ikvukv3s.fsf@nvidia.com/.
Suggested-by: Petr Machata <petrm(a)nvidia.com>
Signed-off-by: Breno Leitao <leitao(a)debian.org>
---
tools/testing/selftests/drivers/net/netcons_basic.sh | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/drivers/net/netcons_basic.sh b/tools/testing/selftests/drivers/net/netcons_basic.sh
index 06021b2059b7..4ad1e216c6b0 100755
--- a/tools/testing/selftests/drivers/net/netcons_basic.sh
+++ b/tools/testing/selftests/drivers/net/netcons_basic.sh
@@ -20,9 +20,9 @@ SCRIPTDIR=$(dirname "$(readlink -e "${BASH_SOURCE[0]}")")
# Simple script to test dynamic targets in netconsole
SRCIF="" # to be populated later
-SRCIP=192.168.1.1
+SRCIP=192.168.2.1
DSTIF="" # to be populated later
-DSTIP=192.168.1.2
+DSTIP=192.168.2.2
PORT="6666"
MSG="netconsole selftest"
--
2.43.5
Extend netcons_basic selftest to verify the userdata functionality by:
1. Creating a test key in the userdata configfs directory
2. Writing a known value to the key
3. Validating the key-value pair appears in the captured network output
This ensures the userdata feature is properly tested during selftests.
Signed-off-by: Breno Leitao <leitao(a)debian.org>
---
.../selftests/drivers/net/netcons_basic.sh | 29 +++++++++++++++++++
1 file changed, 29 insertions(+)
diff --git a/tools/testing/selftests/drivers/net/netcons_basic.sh b/tools/testing/selftests/drivers/net/netcons_basic.sh
index 8d28e5189e91..182eb1a97e59 100755
--- a/tools/testing/selftests/drivers/net/netcons_basic.sh
+++ b/tools/testing/selftests/drivers/net/netcons_basic.sh
@@ -26,10 +26,13 @@ DSTIP=192.0.2.2
PORT="6666"
MSG="netconsole selftest"
+USERDATA_KEY="key"
+USERDATA_VALUE="value"
TARGET=$(mktemp -u netcons_XXXXX)
DEFAULT_PRINTK_VALUES=$(cat /proc/sys/kernel/printk)
NETCONS_CONFIGFS="/sys/kernel/config/netconsole"
NETCONS_PATH="${NETCONS_CONFIGFS}"/"${TARGET}"
+KEY_PATH="${NETCONS_PATH}/userdata/${USERDATA_KEY}"
# NAMESPACE will be populated by setup_ns with a random value
NAMESPACE=""
@@ -122,6 +125,8 @@ function cleanup() {
# delete netconsole dynamic reconfiguration
echo 0 > "${NETCONS_PATH}"/enabled
+ # Remove key
+ rmdir "${KEY_PATH}"
# Remove the configfs entry
rmdir "${NETCONS_PATH}"
@@ -136,6 +141,18 @@ function cleanup() {
echo "${DEFAULT_PRINTK_VALUES}" > /proc/sys/kernel/printk
}
+function set_user_data() {
+ if [[ ! -d "${NETCONS_PATH}""/userdata" ]]
+ then
+ echo "Userdata path not available in ${NETCONS_PATH}/userdata"
+ exit "${ksft_skip}"
+ fi
+
+ mkdir -p "${KEY_PATH}"
+ VALUE_PATH="${KEY_PATH}""/value"
+ echo "${USERDATA_VALUE}" > "${VALUE_PATH}"
+}
+
function listen_port_and_save_to() {
local OUTPUT=${1}
# Just wait for 2 seconds
@@ -146,6 +163,10 @@ function listen_port_and_save_to() {
function validate_result() {
local TMPFILENAME="$1"
+ # TMPFILENAME will contain something like:
+ # 6.11.1-0_fbk0_rc13_509_g30d75cea12f7,13,1822,115075213798,-;netconsole selftest: netcons_gtJHM
+ # key=value
+
# Check if the file exists
if [ ! -f "$TMPFILENAME" ]; then
echo "FAIL: File was not generated." >&2
@@ -158,6 +179,12 @@ function validate_result() {
exit "${ksft_fail}"
fi
+ if ! grep -q "${USERDATA_KEY}=${USERDATA_VALUE}" "${TMPFILENAME}"; then
+ echo "FAIL: ${USERDATA_KEY}=${USERDATA_VALUE} not found in ${TMPFILENAME}" >&2
+ cat "${TMPFILENAME}" >&2
+ exit "${ksft_fail}"
+ fi
+
# Delete the file once it is validated, otherwise keep it
# for debugging purposes
rm "${TMPFILENAME}"
@@ -220,6 +247,8 @@ trap cleanup EXIT
set_network
# Create a dynamic target for netconsole
create_dynamic_target
+# Set userdata "key" with the "value" value
+set_user_data
# Listed for netconsole port inside the namespace and destination interface
listen_port_and_save_to "${OUTPUT_FILE}" &
# Wait for socat to start and listen to the port.
--
2.43.5
Use a less populated IP range to run the tests, as suggested by Petr in
Link: https://lore.kernel.org/netdev/87ikvukv3s.fsf@nvidia.com/.
Suggested-by: Petr Machata <petrm(a)nvidia.com>
Signed-off-by: Breno Leitao <leitao(a)debian.org>
---
tools/testing/selftests/drivers/net/netcons_basic.sh | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/drivers/net/netcons_basic.sh b/tools/testing/selftests/drivers/net/netcons_basic.sh
index 06021b2059b7..8d28e5189e91 100755
--- a/tools/testing/selftests/drivers/net/netcons_basic.sh
+++ b/tools/testing/selftests/drivers/net/netcons_basic.sh
@@ -20,9 +20,9 @@ SCRIPTDIR=$(dirname "$(readlink -e "${BASH_SOURCE[0]}")")
# Simple script to test dynamic targets in netconsole
SRCIF="" # to be populated later
-SRCIP=192.168.1.1
+SRCIP=192.0.2.1
DSTIF="" # to be populated later
-DSTIP=192.168.1.2
+DSTIP=192.0.2.2
PORT="6666"
MSG="netconsole selftest"
--
2.43.5
This patch series is motivated by the following observation:
Raise a signal, jump to signal handler. The ucontext_t structure dumped
by kernel to userspace has a uc_sigmask field having the mask of blocked
signals. If you run a fresh minimalistic program doing this, this field
is empty, even if you block some signals while registering the handler
with sigaction().
Here is what the man-pages have to say:
sigaction(2): "sa_mask specifies a mask of signals which should be blocked
(i.e., added to the signal mask of the thread in which the signal handler
is invoked) during execution of the signal handler. In addition, the
signal which triggered the handler will be blocked, unless the SA_NODEFER
flag is used."
signal(7): Under "Execution of signal handlers", (1.3) implies:
"The thread's current signal mask is accessible via the ucontext_t
object that is pointed to by the third argument of the signal handler."
But, (1.4) states:
"Any signals specified in act->sa_mask when registering the handler with
sigprocmask(2) are added to the thread's signal mask. The signal being
delivered is also added to the signal mask, unless SA_NODEFER was
specified when registering the handler. These signals are thus blocked
while the handler executes."
There clearly is no distinction being made in the man pages between
"Thread's signal mask" and ucontext_t; this logically should imply
that a signal blocked by populating struct sigaction should be visible
in ucontext_t.
Here is what the kernel code does (for Aarch64):
do_signal() -> handle_signal() -> sigmask_to_save(), which returns
¤t->blocked, is passed to setup_rt_frame() -> setup_sigframe() ->
__copy_to_user(). Hence, ¤t->blocked is copied to ucontext_t
exposed to userspace. Returning back to handle_signal(),
signal_setup_done() -> signal_delivered() -> sigorsets() and
set_current_blocked() are responsible for using information from
struct ksignal ksig, which was populated through the sigaction()
system call in kernel/signal.c:
copy_from_user(&new_sa.sa, act, sizeof(new_sa.sa)),
to update ¤t->blocked; hence, the set of blocked signals for the
current thread is updated AFTER the kernel dumps ucontext_t to
userspace.
Assuming that the above is indeed the intended behaviour, because it
semantically makes sense, since the signals blocked using sigaction()
remain blocked only till the execution of the handler, and not in the
context present before jumping to the handler (but nothing can be
confirmed from the man-pages), the series introduces a test for
mangling with uc_sigmask. I will send a separate series to fix the
man-pages.
The proposed selftest has been tested out on Aarch32, Aarch64 and x86_64.
v5->v6:
- Drop renaming of sas.c
- Include the explanation from the cover letter in the changelog
for the second patch
v4->v5:
- Remove a redundant print statement
v3->v4:
- Allocate sigsets as automatic variables to avoid malloc()
v2->v3:
- ucontext describes current state -> ucontext describes interrupted context
- Add a comment for blockage of USR2 even after return from handler
- Describe blockage of signals in a better way
v1->v2:
- Replace all occurrences of SIGPIPE with SIGSEGV
- Fixed a mismatch between code comment and ksft log
- Add a testcase: Raise the same signal again; it must not be queued
- Remove unneeded <assert.h>, <unistd.h>
- Give a detailed test description in the comments; also describe the
exact meaning of delivered and blocked
- Handle errors for all libc functions/syscalls
- Mention tests in Makefile and .gitignore in alphabetical order
v1:
- https://lore.kernel.org/all/20240607122319.768640-1-dev.jain@arm.com/
Dev Jain (2):
selftests: Rename sigaltstack to generic signal
selftests: Add a test mangling with uc_sigmask
tools/testing/selftests/Makefile | 2 +-
.../{sigaltstack => signal}/.gitignore | 1 +
.../{sigaltstack => signal}/Makefile | 3 +-
.../current_stack_pointer.h | 0
.../selftests/signal/mangle_uc_sigmask.c | 184 ++++++++++++++++++
.../selftests/{sigaltstack => signal}/sas.c | 0
6 files changed, 188 insertions(+), 2 deletions(-)
rename tools/testing/selftests/{sigaltstack => signal}/.gitignore (70%)
rename tools/testing/selftests/{sigaltstack => signal}/Makefile (56%)
rename tools/testing/selftests/{sigaltstack => signal}/current_stack_pointer.h (100%)
create mode 100644 tools/testing/selftests/signal/mangle_uc_sigmask.c
rename tools/testing/selftests/{sigaltstack => signal}/sas.c (100%)
--
2.30.2
When running watchdog-test with 'make run_tests', the watchdog-test will
be terminated by a timeout signal(SIGTERM) due to the test timemout.
And then, a system reboot would happen due to watchdog not stop. see
the dmesg as below:
```
[ 1367.185172] watchdog: watchdog0: watchdog did not stop!
```
Fix it by registering more signals(including SIGTERM) in watchdog-test,
where its signal handler will stop the watchdog.
After that
# timeout 1 ./watchdog-test
Watchdog Ticking Away!
.
Stopping watchdog ticks...
Signed-off-by: Li Zhijian <lizhijian(a)fujitsu.com>
---
tools/testing/selftests/watchdog/watchdog-test.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/tools/testing/selftests/watchdog/watchdog-test.c b/tools/testing/selftests/watchdog/watchdog-test.c
index bc71cbca0dde..a1f506ba5578 100644
--- a/tools/testing/selftests/watchdog/watchdog-test.c
+++ b/tools/testing/selftests/watchdog/watchdog-test.c
@@ -334,7 +334,13 @@ int main(int argc, char *argv[])
printf("Watchdog Ticking Away!\n");
+ /*
+ * Register the signals
+ */
signal(SIGINT, term);
+ signal(SIGTERM, term);
+ signal(SIGKILL, term);
+ signal(SIGQUIT, term);
while (1) {
keep_alive();
--
2.44.0
Currently, watchdog-test keep running until it gets a SIGINT. However,
when watchdog-test is executed from the kselftests framework, where it
launches test via timeout which will send SIGTERM in time up. This could
lead to
1. watchdog haven't stop, a watchdog reset is triggered to reboot the OS
in silent.
2. kselftests gets an timeout exit code, and judge watchdog-test as
'not ok'
This patch is prepare to fix above 2 issues
Signed-off-by: Li Zhijian <lizhijian(a)fujitsu.com>
---
Hey,
Cover letter is here.
It's notice that a OS reboot was triggerred after ran the watchdog-test
in kselftests framwork 'make run_tests', that's because watchdog-test
didn't stop feeding the watchdog after enable it.
In addition, current watchdog-test didn't adapt to the kselftests
framework which launchs the test with /usr/bin/timeout and no timeout
is expected.
---
tools/testing/selftests/watchdog/watchdog-test.c | 13 +++++++++----
1 file changed, 9 insertions(+), 4 deletions(-)
diff --git a/tools/testing/selftests/watchdog/watchdog-test.c b/tools/testing/selftests/watchdog/watchdog-test.c
index bc71cbca0dde..2f8fd2670897 100644
--- a/tools/testing/selftests/watchdog/watchdog-test.c
+++ b/tools/testing/selftests/watchdog/watchdog-test.c
@@ -27,7 +27,7 @@
int fd;
const char v = 'V';
-static const char sopts[] = "bdehp:st:Tn:NLf:i";
+static const char sopts[] = "bdehp:st:Tn:NLf:c:i";
static const struct option lopts[] = {
{"bootstatus", no_argument, NULL, 'b'},
{"disable", no_argument, NULL, 'd'},
@@ -42,6 +42,7 @@ static const struct option lopts[] = {
{"gettimeleft", no_argument, NULL, 'L'},
{"file", required_argument, NULL, 'f'},
{"info", no_argument, NULL, 'i'},
+ {"count", required_argument, NULL, 'c'},
{NULL, no_argument, NULL, 0x0}
};
@@ -95,6 +96,7 @@ static void usage(char *progname)
printf(" -n, --pretimeout=T\tSet the pretimeout to T seconds\n");
printf(" -N, --getpretimeout\tGet the pretimeout\n");
printf(" -L, --gettimeleft\tGet the time left until timer expires\n");
+ printf(" -c, --count\tStop after feeding the watchdog count times\n");
printf("\n");
printf("Parameters are parsed left-to-right in real-time.\n");
printf("Example: %s -d -t 10 -p 5 -e\n", progname);
@@ -174,7 +176,7 @@ int main(int argc, char *argv[])
unsigned int ping_rate = DEFAULT_PING_RATE;
int ret;
int c;
- int oneshot = 0;
+ int oneshot = 0, stop = 1, count = 0;
char *file = "/dev/watchdog";
struct watchdog_info info;
int temperature;
@@ -307,6 +309,9 @@ int main(int argc, char *argv[])
else
printf("WDIOC_GETTIMELEFT error '%s'\n", strerror(errno));
break;
+ case 'c':
+ stop = 0;
+ count = strtoul(optarg, NULL, 0);
case 'f':
/* Handled above */
break;
@@ -336,8 +341,8 @@ int main(int argc, char *argv[])
signal(SIGINT, term);
- while (1) {
- keep_alive();
+ while (stop || count--) {
+ exit_code = keep_alive();
sleep(ping_rate);
}
end:
--
2.44.0
From: zhouyuhang <zhouyuhang(a)kylinos.cn>
Test case idmap_mount_tree_invalid failed to run on the newer kernel
with the following output:
# RUN mount_setattr_idmapped.idmap_mount_tree_invalid ...
# mount_setattr_test.c:1428:idmap_mount_tree_invalid:Expected sys_mount_setattr(open_tree_fd, "", AT_EMPTY_PATH, &attr, sizeof(attr)) (0) ! = 0 (0)
# idmap_mount_tree_invalid: Test terminated by assertion
This is because tmpfs is mounted at "/mnt/A", and tmpfs already
contains the flag FS_ALLOW_IDMAP after the commit 7a80e5b8c6fa ("shmem:
support idmapped mounts for tmpfs"). So calling sys_mount_setattr here
returns 0 instead of -EINVAL as expected.
Ramfs does not support idmap mounts, so we can use it here to test invalid mounts,
which allows the test case to pass with the following output:
# Starting 1 tests from 1 test cases.
# RUN mount_setattr_idmapped.idmap_mount_tree_invalid ...
# OK mount_setattr_idmapped.idmap_mount_tree_invalid
ok 1 mount_setattr_idmapped.idmap_mount_tree_invalid
# PASSED: 1 / 1 tests passed.
Signed-off-by: zhouyuhang <zhouyuhang(a)kylinos.cn>
---
.../testing/selftests/mount_setattr/mount_setattr_test.c | 9 +++++++++
1 file changed, 9 insertions(+)
diff --git a/tools/testing/selftests/mount_setattr/mount_setattr_test.c b/tools/testing/selftests/mount_setattr/mount_setattr_test.c
index c6a8c732b802..68801e1a9ec2 100644
--- a/tools/testing/selftests/mount_setattr/mount_setattr_test.c
+++ b/tools/testing/selftests/mount_setattr/mount_setattr_test.c
@@ -1414,6 +1414,13 @@ TEST_F(mount_setattr_idmapped, idmap_mount_tree_invalid)
ASSERT_EQ(expected_uid_gid(-EBADF, "/tmp/B/b", 0, 0, 0), 0);
ASSERT_EQ(expected_uid_gid(-EBADF, "/tmp/B/BB/b", 0, 0, 0), 0);
+ ASSERT_EQ(mount("testing", "/mnt/A", "ramfs", MS_NOATIME | MS_NODEV,
+ "size=100000,mode=700"), 0);
+
+ ASSERT_EQ(mkdir("/mnt/A/AA", 0777), 0);
+
+ ASSERT_EQ(mount("/tmp", "/mnt/A/AA", NULL, MS_BIND | MS_REC, NULL), 0);
+
open_tree_fd = sys_open_tree(-EBADF, "/mnt/A",
AT_RECURSIVE |
AT_EMPTY_PATH |
@@ -1433,6 +1440,8 @@ TEST_F(mount_setattr_idmapped, idmap_mount_tree_invalid)
ASSERT_EQ(expected_uid_gid(-EBADF, "/tmp/B/BB/b", 0, 0, 0), 0);
ASSERT_EQ(expected_uid_gid(open_tree_fd, "B/b", 0, 0, 0), 0);
ASSERT_EQ(expected_uid_gid(open_tree_fd, "B/BB/b", 0, 0, 0), 0);
+
+ (void)umount2("/mnt/A", MNT_DETACH);
}
TEST_F(mount_setattr, mount_attr_nosymfollow)
--
2.27.0
Good day,
My name is Evan from Ukraine.I have tried to reach you but find it difficult due to internet scarcity here in this town.
I am contacting you because I want to come over to your country,I have some huge funds I plan to invest in your country because I want to relocate my Late Father business due to ongoing war in my country Ukraine and visit your country to set up new investment business you may advice profitable in your area.
I also have some millions of dollars belonging to my late father that i want to invest and 995kg of 24+carat 99.9% purity gold I want to ship to your country and establish gold jewelry manufacturing company.
I will offer you good monetary rewards for your help,due to how russian drone bombards this town frequently,internet network is disrupted because of attacks on network installations,please for fast discussion, you can add me on whatsapp and let's talk faster as i want to leave this war zone as soon as possible,this is my whatsapp number below.
+380-672575220
Email:evanbohdan@gmail.com
I wait for your earliest response.
Regards,
Evan
Whatsapp/Tel:+380-672575220
Email:evanbohdan@gmail.com
From: Jason Xing <kernelxing(a)tencent.com>
As we can see from the title, when I compiled the selftests/bpf, I
saw the error:
implicit declaration of function ‘gettid’ ; did you mean ‘getgid’? [-Werror=implicit-function-declaration]
skel->bss->tid = gettid();
^~~~~~
getgid
Adding a define to fix it (referring to
tools/perf/tests/shell/coresight/thread_loop/thread_loop.c file.
Signed-off-by: Jason Xing <kernelxing(a)tencent.com>
---
tools/testing/selftests/bpf/prog_tests/bpf_iter.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/tools/testing/selftests/bpf/prog_tests/bpf_iter.c b/tools/testing/selftests/bpf/prog_tests/bpf_iter.c
index f0a3a9c18e9e..a105759f3dcf 100644
--- a/tools/testing/selftests/bpf/prog_tests/bpf_iter.c
+++ b/tools/testing/selftests/bpf/prog_tests/bpf_iter.c
@@ -34,6 +34,8 @@
#include "bpf_iter_ksym.skel.h"
#include "bpf_iter_sockmap.skel.h"
+#define gettid() syscall(SYS_gettid)
+
static void test_btf_id_or_null(void)
{
struct bpf_iter_test_kern3 *skel;
--
2.37.3
Fixup small broken window panes in DAMON selftests and kunit tests.
First four patches clean up DAMON debugfs interface selftests output, by
fixing segmentation fault of a test program (patch 1), removing
unnecessary debugging messages (patch 2), and hiding error messages from
expected failures (patches 3 and 4).
Following two patches fix copy-paste mistakes in DAMON Kconfig help
message that copied from debugfs kunit test (patch 5) and a comment on
the debugfs kunit test code (patch 6).
Signed-off-by: SeongJae Park <sj(a)kernel.org>
Andrew Paniakin (1):
selftests/damon/huge_count_read_write: provide sufficiently large
buffer for DEPRECATED file read
SeongJae Park (5):
selftests/damon/huge_count_read_write: remove unnecessary debugging
message
selftests/damon/_debugfs_common: hide expected error message from
test_write_result()
selftests/damon/debugfs_duplicate_context_creation: hide errors from
expected file write failures
mm/damon/Kconfig: update DBGFS_KUNIT prompt copy for SYSFS_KUNIT
mm/damon/tests/dbgfs-kunit: fix the header double inclusion guarding
ifdef comment
mm/damon/Kconfig | 2 +-
mm/damon/tests/dbgfs-kunit.h | 2 +-
tools/testing/selftests/damon/_debugfs_common.sh | 7 ++++++-
.../selftests/damon/debugfs_duplicate_context_creation.sh | 2 +-
tools/testing/selftests/damon/huge_count_read_write.c | 4 +---
5 files changed, 10 insertions(+), 7 deletions(-)
base-commit: 13583c750117b4e10cdaf5578dcc7723b305ce4e
--
2.39.5
Two small fixes related to the MPTCP packets scheduler:
- Patch 1: add missing rcu_read_(un)lock(). A fix for >= 6.6.
- Patch 2: remove unneeded lock when listing packets schedulers. A fix
for >= 6.10.
And some modifications in the MPTCP selftests:
- Patch 3: a small addition to the MPTCP selftests to cover more code.
Signed-off-by: Matthieu Baerts (NGI0) <matttbe(a)kernel.org>
---
Matthieu Baerts (NGI0) (3):
mptcp: init: protect sched with rcu_read_lock
mptcp: remove unneeded lock when listing scheds
selftests: mptcp: list sysctl data
net/mptcp/protocol.c | 2 ++
net/mptcp/sched.c | 2 --
tools/testing/selftests/net/mptcp/mptcp_connect.sh | 9 +++++++++
3 files changed, 11 insertions(+), 2 deletions(-)
---
base-commit: 3b05b9c36ddd01338e1352588f2ec1ea23f97d43
change-id: 20241021-net-mptcp-sched-lock-10dfc75d1e00
Best regards,
--
Matthieu Baerts (NGI0) <matttbe(a)kernel.org>
Userland library functions such as allocators and threading implementations
often require regions of memory to act as 'guard pages' - mappings which,
when accessed, result in a fatal signal being sent to the accessing
process.
The current means by which these are implemented is via a PROT_NONE mmap()
mapping, which provides the required semantics however incur an overhead of
a VMA for each such region.
With a great many processes and threads, this can rapidly add up and incur
a significant memory penalty. It also has the added problem of preventing
merges that might otherwise be permitted.
This series takes a different approach - an idea suggested by Vlasimil
Babka (and before him David Hildenbrand and Jann Horn - perhaps more - the
provenance becomes a little tricky to ascertain after this - please forgive
any omissions!) - rather than locating the guard pages at the VMA layer,
instead placing them in page tables mapping the required ranges.
Early testing of the prototype version of this code suggests a 5 times
speed up in memory mapping invocations (in conjunction with use of
process_madvise()) and a 13% reduction in VMAs on an entirely idle android
system and unoptimised code.
We expect with optimisation and a loaded system with a larger number of
guard pages this could significantly increase, but in any case these
numbers are encouraging.
This way, rather than having separate VMAs specifying which parts of a
range are guard pages, instead we have a VMA spanning the entire range of
memory a user is permitted to access and including ranges which are to be
'guarded'.
After mapping this, a user can specify which parts of the range should
result in a fatal signal when accessed.
By restricting the ability to specify guard pages to memory mapped by
existing VMAs, we can rely on the mappings being torn down when the
mappings are ultimately unmapped and everything works simply as if the
memory were not faulted in, from the point of view of the containing VMAs.
This mechanism in effect poisons memory ranges similar to hardware memory
poisoning, only it is an entirely software-controlled form of poisoning.
The mechanism is implemented via madvise() behaviour - MADV_GUARD_INSTALL
which installs page table-level guard page markers - and
MADV_GUARD_REMOVE - which clears them.
Guard markers can be installed across multiple VMAs and any existing
mappings will be cleared, that is zapped, before installing the guard page
markers in the page tables.
There is no concept of 'nested' guard markers, multiple attempts to install
guard markers in a range will, after the first attempt, have no effect.
Importantly, removing guard markers over a range that contains both guard
markers and ordinary backed memory has no effect on anything but the guard
markers (including leaving huge pages un-split), so a user can safely
remove guard markers over a range of memory leaving the rest intact.
The actual mechanism by which the page table entries are specified makes
use of existing logic - PTE markers, which are used for the userfaultfd
UFFDIO_POISON mechanism.
Unfortunately PTE_MARKER_POISONED is not suited for the guard page
mechanism as it results in VM_FAULT_HWPOISON semantics in the fault
handler, so we add our own specific PTE_MARKER_GUARD and adapt existing
logic to handle it.
We also extend the generic page walk mechanism to allow for installation of
PTEs (carefully restricted to memory management logic only to prevent
unwanted abuse).
We ensure that zapping performed by MADV_DONTNEED and MADV_FREE do not
remove guard markers, nor does forking (except when VM_WIPEONFORK is
specified for a VMA which implies a total removal of memory
characteristics).
It's important to note that the guard page implementation is emphatically
NOT a security feature, so a user can remove the markers if they wish. We
simply implement it in such a way as to provide the least surprising
behaviour.
An extensive set of self-tests are provided which ensure behaviour is as
expected and additionally self-documents expected behaviour of guard
ranges.
Suggested-by: Vlastimil Babka <vbabka(a)suse.cz>
Suggested-by: Jann Horn <jannh(a)google.com>
Suggested-by: David Hildenbrand <david(a)redhat.com>
v3
* Cleaned up mm/pagewalk.c logic a bit to make things clearer, as suggested
by Vlastiml.
* Explicitly avoid splitting THP on PTE installation, as suggested by
Vlastimil. Note this has no impact on the guard pages logic, which has
page table entry handlers at PUD, PMD and PTE level.
* Added WARN_ON_ONCE() to mm/hugetlb.c path where we don't expect a guard
marker, as suggested by Vlastimil.
* Reverted change to is_poisoned_swp_entry() to exclude guard pages which
has the effect of MADV_FREE _not_ clearing guard pages. After discussion
with Vlastimil, it became apparent that the ability to 'cancel' the
freeing operation by writing to the mapping after having issued an
MADV_FREE would mean that we would risk unexpected behaviour should the
guard pages be removed, so we now do not remove markers here at all.
* Added comment to PTE_MARKER_GUARD to highlight that memory tagged with
the marker behaves as if it were a region mapped PROT_NONE, as
highlighted by David.
* Rename poison -> install, unpoison -> remove (i.e. MADV_GUARD_INSTALL /
MADV_GUARD_REMOVE over MADV_GUARD_POISON / MADV_GUARD_REMOVE) at the
request of David and John who both find the poison analogy
confusing/overloaded.
* After a lot of discussion, replace the looping behaviour should page
faults race with guard page installation with a modest reattempt followed
by returning -ERESTARTNOINTR to have the operation abort and re-enter,
relieving lock contention and avoiding the possibility of allowing a
malicious sandboxed process to impact the mmap lock or stall the overall
process more than necessary, as suggested by Jann and Vlastimil having
raised the issue.
* Adjusted the page table walker so a populated huge PUD or PMD is
correctly treated as being populated, necessitating a zap. In v2 we
incorrectly skipped over these, which would cause the logic to wrongly
proceed as if nothing were populated and the install succeeded.
Instead, explicitly check to see if a huge page - if so, do not split but
rather abort the operation and let zap take care of things.
* Updated the guard remove logic to not unnecessarily split huge pages
either.
* Added a debug check to assert that the number of installed PTEs matches
expectation, accounting for any existing guard pages.
* Adapted vector_madvise() used by the process_madvise() system call to
handle -ERESTARTNOINTR correctly.
v2
* The macros in kselftest_harness.h seem to be broken - __EXPECT() is
terminated by '} while (0); OPTIONAL_HANDLER(_assert)' meaning it is not
safe in single line if / else or for /which blocks, however working
around this results in checkpatch producing invalid warnings, as reported
by Shuah.
* Fixing these macros is out of scope for this series, so compromise and
instead rewrite test blocks so as to use multiple lines by separating out
a decl in most cases. This has the side effect of, for the most part,
making things more readable.
* Heavily document the use of the volatile keyword - we can't avoid
checkpatch complaining about this, so we explain it, as reported by
Shuah.
* Updated commit message to highlight that we skip tests we lack
permissions for, as reported by Shuah.
* Replaced a perror() with ksft_exit_fail_perror(), as reported by Shuah.
* Added user friendly messages to cases where tests are skipped due to lack
of permissions, as reported by Shuah.
* Update the tool header to include the new MADV_GUARD_POISON/UNPOISON
defines and directly include asm-generic/mman.h to get the
platform-neutral versions to ensure we import them.
* Finally fixed Vlastimil's email address in Suggested-by tags from suze to
suse, as reported by Vlastimil.
* Added linux-api to cc list, as reported by Vlastimil.
https://lore.kernel.org/all/cover.1729440856.git.lorenzo.stoakes@oracle.com/
v1
* Un-RFC'd as appears no major objections to approach but rather debate on
implementation.
* Fixed issue with arches which need mmu_context.h and
tlbfush.h. header imports in pagewalker logic to be able to use
update_mmu_cache() as reported by the kernel test bot.
* Added comments in page walker logic to clarify who can use
ops->install_pte and why as well as adding a check_ops_valid() helper
function, as suggested by Christoph.
* Pass false in full parameter in pte_clear_not_present_full() as suggested
by Jann.
* Stopped erroneously requiring a write lock for the poison operation as
suggested by Jann and Suren.
* Moved anon_vma_prepare() to the start of madvise_guard_poison() to be
consistent with how this is used elsewhere in the kernel as suggested by
Jann.
* Avoid returning -EAGAIN if we are raced on page faults, just keep looping
and duck out if a fatal signal is pending or a conditional reschedule is
needed, as suggested by Jann.
* Avoid needlessly splitting huge PUDs and PMDs by specifying
ACTION_CONTINUE, as suggested by Jann.
https://lore.kernel.org/all/cover.1729196871.git.lorenzo.stoakes@oracle.com/
RFC
https://lore.kernel.org/all/cover.1727440966.git.lorenzo.stoakes@oracle.com/
Lorenzo Stoakes (5):
mm: pagewalk: add the ability to install PTEs
mm: add PTE_MARKER_GUARD PTE marker
mm: madvise: implement lightweight guard page mechanism
tools: testing: update tools UAPI header for mman-common.h
selftests/mm: add self tests for guard page feature
arch/alpha/include/uapi/asm/mman.h | 3 +
arch/mips/include/uapi/asm/mman.h | 3 +
arch/parisc/include/uapi/asm/mman.h | 3 +
arch/xtensa/include/uapi/asm/mman.h | 3 +
include/linux/mm_inline.h | 2 +-
include/linux/pagewalk.h | 18 +-
include/linux/swapops.h | 24 +-
include/uapi/asm-generic/mman-common.h | 3 +
mm/hugetlb.c | 4 +
mm/internal.h | 12 +
mm/madvise.c | 225 ++++
mm/memory.c | 18 +-
mm/mprotect.c | 6 +-
mm/mseal.c | 1 +
mm/pagewalk.c | 227 +++-
tools/include/uapi/asm-generic/mman-common.h | 3 +
tools/testing/selftests/mm/.gitignore | 1 +
tools/testing/selftests/mm/Makefile | 1 +
tools/testing/selftests/mm/guard-pages.c | 1239 ++++++++++++++++++
19 files changed, 1720 insertions(+), 76 deletions(-)
create mode 100644 tools/testing/selftests/mm/guard-pages.c
--
2.47.0
The following kselftest arm64 and FVP failed with Linux next-20241025 on
- Qemu-arm64
- FVP
running Linux next-20241025 kernel.
First seen on next-20241025
Good: next-20241024
BAD: next-20241025
kselftest-arm64, FVP
* arm64_check_buffer_fill
* arm64_check_mmap_options
* arm64_check_child_memory
Anyone have seen these failures ?
Reported-by: Linux Kernel Functional Testing <lkft(a)linaro.org>
Test log:
----------
# selftests: arm64: check_buffer_fill
# 1..20
# ok 1 Check buffer correctness by byte with sync err mode and mmap memory
# ok 2 Check buffer correctness by byte with async err mode and mmap memory
# ok 3 Check buffer correctness by byte with sync err mode and
mmap/mprotect memory
# ok 4 Check buffer correctness by byte with async err mode and
mmap/mprotect memory
# ok 5 Check buffer write underflow by byte with sync mode and mmap memory
# ok 6 Check buffer write underflow by byte with async mode and mmap memory
# ok 7 Check buffer write underflow by byte with tag check fault
ignore and mmap memory
# ok 8 Check buffer write underflow by byte with sync mode and mmap memory
# ok 9 Check buffer write underflow by byte with async mode and mmap memory
# ok 10 Check buffer write underflow by byte with tag check fault
ignore and mmap memory
# ok 11 Check buffer write overflow by byte with sync mode and mmap memory
# ok 12 Check buffer write overflow by byte with async mode and mmap memory
# ok 13 Check buffer write overflow by byte with tag fault ignore mode
and mmap memory
# ok 14 Check buffer write correctness by block with sync mode and mmap memory
# ok 15 Check buffer write correctness by block with async mode and mmap memory
# ok 16 Check buffer write correctness by block with tag fault ignore
and mmap memory
# # FAIL: mmap allocation
# # FAIL: memory allocation
# not ok 17 Check initial tags with private mapping, sync error mode
and mmap memory
# ok 18 Check initial tags with private mapping, sync error mode and
mmap/mprotect memory
# # FAIL: mmap allocation
# # FAIL: memory allocation
# not ok 19 Check initial tags with shared mapping, sync error mode
and mmap memory
# ok 20 Check initial tags with shared mapping, sync error mode and
mmap/mprotect memory
# # Totals: pass:18 fail:2 xfail:0 xpass:0 skip:0 error:0
not ok 21 selftests: arm64: check_buffer_fill # exit=1
# timeout set to 45
# selftests: arm64: check_child_memory
# 1..12
# ok 1 Check child anonymous memory with private mapping, precise mode
and mmap memory
# ok 2 Check child anonymous memory with shared mapping, precise mode
and mmap memory
# ok 3 Check child anonymous memory with private mapping, imprecise
mode and mmap memory
# ok 4 Check child anonymous memory with shared mapping, imprecise
mode and mmap memory
# ok 5 Check child anonymous memory with private mapping, precise mode
and mmap/mprotect memory
# ok 6 Check child anonymous memory with shared mapping, precise mode
and mmap/mprotect memory
# # FAIL: mmap allocation
# # FAIL: memory allocation
# not ok 7 Check child file memory with private mapping, precise mode
and mmap memory
# # FAIL: mmap allocation
# # FAIL: memory allocation
# not ok 8 Check child file memory with shared mapping, precise mode
and mmap memory
# ok 9 Check child file memory with private mapping, imprecise mode
and mmap memory
# ok 10 Check child file memory with shared mapping, imprecise mode
and mmap memory
# ok 11 Check child file memory with private mapping, precise mode and
mmap/mprotect memory
# ok 12 Check child file memory with shared mapping, precise mode and
mmap/mprotect memory
# # Totals: pass:10 fail:2 xfail:0 xpass:0 skip:0 error:0
not ok 22 selftests: arm64: check_child_memory # exit=1
boot Log links,
--------
- https://qa-reports.linaro.org/lkft/linux-next-master/build/next-20241028/te…
- https://qa-reports.linaro.org/lkft/linux-next-master/build/next-20241028/te…
Test results history:
----------
- https://qa-reports.linaro.org/lkft/linux-next-master/build/next-20241028/te…
- https://qa-reports.linaro.org/lkft/linux-next-master/build/next-20241028/te…
metadata:
----
git describe: next-20241028
git repo: https://gitlab.com/Linaro/lkft/mirrors/next/linux-next
git sha: dec9255a128e19c5fcc3bdb18175d78094cc624d
kernel config:
https://storage.tuxsuite.com/public/linaro/lkft/builds/2o3tMqzOtHXYQjlvfR5t…
build url: https://storage.tuxsuite.com/public/linaro/lkft/builds/2o3tMqzOtHXYQjlvfR5t…
toolchain: gcc-13
Steps to reproduce:
---------
- https://tuxapi.tuxsuite.com/v1/groups/linaro/projects/lkft/tests/2o3tON5kNi…
- https://tuxapi.tuxsuite.com/v1/groups/linaro/projects/lkft/tests/2o3tON5kNi…
--
Linaro LKFT
https://lkft.linaro.org
As the part-2 of the VIOMMU infrastructure, this series introduces a VIRQ
object after repurposing the existing FAULT object, which provides a nice
notification pathway to the user space already. So, the first thing to do
is reworking the FAULT object.
Mimicing the HWPT structures, add a common EVENT structure to support its
derivatives: EVENT_IOPF (the prior FAULT object) and EVENT_VIRQ (new one).
IOMMUFD_CMD_VIRQ_ALLOC is introduced to allocate EVENT_VIRQ for a VIOMMU.
One VIOMMU can have multiple VIRQs in different types but can not support
multiple VIRQs with the same types.
Drivers might need the VIOMMU's vdev_id list or the exact vdev_id link of
the passthrough device's to forward IRQs/events via the VIOMMU framework.
Thus, extend the set/unset_vdev_id ioctls down to the driver using VIOMMU
ops. This allows drivers to take the control of a vdev_id's lifecycle.
The forwarding part is fairly simple but might need to replace a physical
device ID with a virtual device ID. So, there comes with some helpers for
drivers to use.
As usual, this series comes with the selftest coverage for this new VIRQ,
and with a real world use case in the ARM SMMUv3 driver.
This must be based on the VIOMMU Part-1 series. It's on Github:
https://github.com/nicolinc/iommufd/commits/iommufd_virq-v1
Paring QEMU branch for testing:
https://github.com/nicolinc/qemu/commits/wip/for_iommufd_virq-v1
Thanks!
Nicolin
Nicolin Chen (10):
iommufd: Rename IOMMUFD_OBJ_FAULT to IOMMUFD_OBJ_EVENT_IOPF
iommufd: Rename fault.c to event.c
iommufd: Add IOMMUFD_OBJ_EVENT_VIRQ and IOMMUFD_CMD_VIRQ_ALLOC
iommufd/viommu: Allow drivers to control vdev_id lifecycle
iommufd/viommu: Add iommufd_vdev_id_to_dev helper
iommufd/viommu: Add iommufd_viommu_report_irq helper
iommufd/selftest: Implement mock_viommu_set/unset_vdev_id
iommufd/selftest: Add IOMMU_TEST_OP_TRIGGER_VIRQ for VIRQ coverage
iommufd/selftest: Add EVENT_VIRQ test coverage
iommu/arm-smmu-v3: Report virtual IRQ for device in user space
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 109 +++-
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 2 +
drivers/iommu/iommufd/Makefile | 2 +-
drivers/iommu/iommufd/device.c | 2 +
drivers/iommu/iommufd/event.c | 613 ++++++++++++++++++
drivers/iommu/iommufd/fault.c | 443 -------------
drivers/iommu/iommufd/hw_pagetable.c | 12 +-
drivers/iommu/iommufd/iommufd_private.h | 147 ++++-
drivers/iommu/iommufd/iommufd_test.h | 10 +
drivers/iommu/iommufd/main.c | 13 +-
drivers/iommu/iommufd/selftest.c | 66 ++
drivers/iommu/iommufd/viommu.c | 25 +-
drivers/iommu/iommufd/viommu_api.c | 54 ++
include/linux/iommufd.h | 28 +
include/uapi/linux/iommufd.h | 46 ++
tools/testing/selftests/iommu/iommufd.c | 11 +
tools/testing/selftests/iommu/iommufd_utils.h | 64 ++
17 files changed, 1130 insertions(+), 517 deletions(-)
create mode 100644 drivers/iommu/iommufd/event.c
delete mode 100644 drivers/iommu/iommufd/fault.c
--
2.43.0
This series introduces a new vIOMMU infrastructure and related ioctls.
IOMMUFD has been using the HWPT infrastructure for all cases, including a
nested IO page table support. Yet, there're limitations for an HWPT-based
structure to support some advanced HW-accelerated features, such as CMDQV
on NVIDIA Grace, and HW-accelerated vIOMMU on AMD. Even for a multi-IOMMU
environment, it is not straightforward for nested HWPTs to share the same
parent HWPT (stage-2 IO pagetable), with the HWPT infrastructure alone: a
parent HWPT typically hold one stage-2 IO pagetable and tag it with only
one ID in the cache entries. When sharing one large stage-2 IO pagetable
across physical IOMMU instances, that one ID may not always be available
across all the IOMMU instances. In other word, it's ideal for SW to have
a different container for the stage-2 IO pagetable so it can hold another
ID that's available.
For this "different container", add vIOMMU, an additional layer to hold
extra virtualization information:
_______________________________________________________________________
| iommufd (with vIOMMU) |
| |
| [5] |
| _____________ |
| | | |
| |----------------| vIOMMU | |
| | | | |
| | | | |
| | [1] | | [4] [2] |
| | ______ | | _____________ ________ |
| | | | | [3] | | | | | |
| | | IOAS |<---|(HWPT_PAGING)|<---| HWPT_NESTED |<--| DEVICE | |
| | |______| |_____________| |_____________| |________| |
| | | | | | |
|______|________|______________|__________________|_______________|_____|
| | | | |
______v_____ | ______v_____ ______v_____ ___v__
| struct | | PFN | (paging) | | (nested) | |struct|
|iommu_device| |------>|iommu_domain|<----|iommu_domain|<----|device|
|____________| storage|____________| |____________| |______|
The vIOMMU object should be seen as a slice of a physical IOMMU instance
that is passed to or shared with a VM. That can be some HW/SW resources:
- Security namespace for guest owned ID, e.g. guest-controlled cache tags
- Access to a sharable nesting parent pagetable across physical IOMMUs
- Virtualization of various platforms IDs, e.g. RIDs and others
- Delivery of paravirtualized invalidation
- Direct assigned invalidation queues
- Direct assigned interrupts
- Non-affiliated event reporting
On a multi-IOMMU system, the vIOMMU object must be instanced to the number
of the physical IOMMUs that are passed to (via devices) a guest VM, while
being able to hold the shareable parent HWPT. Each vIOMMU then just needs
to allocate its own individual ID to tag its own cache:
----------------------------
---------------- | | paging_hwpt0 |
| hwpt_nested0 |--->| viommu0 ------------------
---------------- | | IDx |
----------------------------
----------------------------
---------------- | | paging_hwpt0 |
| hwpt_nested1 |--->| viommu1 ------------------
---------------- | | IDy |
----------------------------
As an initial part-1, add IOMMUFD_CMD_VIOMMU_ALLOC ioctl for an allocation
only. And implement it in arm-smmu-v3 driver as a real world use case.
More vIOMMU-based structs and ioctls will be introduced in the follow-up
series to support vDEVICE, vIRQ (vEVENT) and vQUEUE objects. Although we
repurposed the vIOMMU object from an earlier RFC, just for a referece:
https://lore.kernel.org/all/cover.1712978212.git.nicolinc@nvidia.com/
This series is on Github:
https://github.com/nicolinc/iommufd/commits/iommufd_viommu_p1-v4
(paring QEMU branch for testing will be provided with the part2 series)
Changelog
v4
* Added "Reviewed-by" from Jason
* Dropped IOMMU_VIOMMU_TYPE_DEFAULT support
* Dropped iommufd_object_alloc_elm renamings
* Renamed iommufd's viommu_api.c to driver.c
* Reworked iommufd_viommu_alloc helper
* Added a separate iommufd_hwpt_nested_alloc_for_viommu function for
hwpt_nested allocations on a vIOMMU, and added comparison between
viommu->iommu_dev->ops and dev_iommu_ops(idev->dev)
* Replaced s2_parent with vsmmu in arm_smmu_nested_domain
* Replaced domain_alloc_user in iommu_ops with domain_alloc_nested in
viommu_ops
* Replaced wait_queue_head_t with a completion, to delay the unplug of
mock_iommu_dev
* Corrected documentation graph that was missing struct iommu_device
* Added an iommufd_verify_unfinalized_object helper to verify driver-
allocated vIOMMU/vDEVICE objects
* Added missing test cases for TEST_LENGTH and fail_nth
v3
https://lore.kernel.org/all/cover.1728491453.git.nicolinc@nvidia.com/
* Rebased on top of Jason's nesting v3 series
https://lore.kernel.org/all/0-v3-e2e16cd7467f+2a6a1-smmuv3_nesting_jgg@nvid…
* Split the series into smaller parts
* Added Jason's Reviewed-by
* Added back viommu->iommu_dev
* Added support for driver-allocated vIOMMU v.s. core-allocated
* Dropped arm_smmu_cache_invalidate_user
* Added an iommufd_test_wait_for_users() in selftest
* Reworked test code to make viommu an individual FIXTURE
* Added missing TEST_LENGTH case for the new ioctl command
v2
https://lore.kernel.org/all/cover.1724776335.git.nicolinc@nvidia.com/
* Limited vdev_id to one per idev
* Added a rw_sem to protect the vdev_id list
* Reworked driver-level APIs with proper lockings
* Added a new viommu_api file for IOMMUFD_DRIVER config
* Dropped useless iommu_dev point from the viommu structure
* Added missing index numnbers to new types in the uAPI header
* Dropped IOMMU_VIOMMU_INVALIDATE uAPI; Instead, reuse the HWPT one
* Reworked mock_viommu_cache_invalidate() using the new iommu helper
* Reordered details of set/unset_vdev_id handlers for proper lockings
v1
https://lore.kernel.org/all/cover.1723061377.git.nicolinc@nvidia.com/
Thanks!
Nicolin
Nicolin Chen (11):
iommufd: Move struct iommufd_object to public iommufd header
iommufd: Introduce IOMMUFD_OBJ_VIOMMU and its related struct
iommufd: Add iommufd_verify_unfinalized_object
iommufd/viommu: Add IOMMU_VIOMMU_ALLOC ioctl
iommufd: Add domain_alloc_nested op to iommufd_viommu_ops
iommufd: Allow pt_id to carry viommu_id for IOMMU_HWPT_ALLOC
iommufd/selftest: Add refcount to mock_iommu_device
iommufd/selftest: Add IOMMU_VIOMMU_TYPE_SELFTEST
iommufd/selftest: Add IOMMU_VIOMMU_ALLOC test coverage
Documentation: userspace-api: iommufd: Update vIOMMU
iommu/arm-smmu-v3: Add IOMMU_VIOMMU_TYPE_ARM_SMMUV3 support
drivers/iommu/iommufd/Makefile | 5 +-
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 26 +++---
drivers/iommu/iommufd/iommufd_private.h | 36 ++------
drivers/iommu/iommufd/iommufd_test.h | 2 +
include/linux/iommu.h | 14 +++
include/linux/iommufd.h | 89 +++++++++++++++++++
include/uapi/linux/iommufd.h | 56 ++++++++++--
tools/testing/selftests/iommu/iommufd_utils.h | 28 ++++++
.../arm/arm-smmu-v3/arm-smmu-v3-iommufd.c | 79 ++++++++++------
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 9 +-
drivers/iommu/iommufd/driver.c | 38 ++++++++
drivers/iommu/iommufd/hw_pagetable.c | 69 +++++++++++++-
drivers/iommu/iommufd/main.c | 58 ++++++------
drivers/iommu/iommufd/selftest.c | 73 +++++++++++++--
drivers/iommu/iommufd/viommu.c | 85 ++++++++++++++++++
tools/testing/selftests/iommu/iommufd.c | 78 ++++++++++++++++
.../selftests/iommu/iommufd_fail_nth.c | 11 +++
Documentation/userspace-api/iommufd.rst | 69 +++++++++++++-
18 files changed, 701 insertions(+), 124 deletions(-)
create mode 100644 drivers/iommu/iommufd/driver.c
create mode 100644 drivers/iommu/iommufd/viommu.c
--
2.43.0
From: zhouyuhang <zhouyuhang(a)kylinos.cn>
Test case idmap_mount_tree_invalid failed to run on the newer kernel
with the following output:
# RUN mount_setattr_idmapped.idmap_mount_tree_invalid ...
# mount_setattr_test.c:1428:idmap_mount_tree_invalid:Expected sys_mount_setattr(open_tree_fd, "", AT_EMPTY_PATH, &attr, sizeof(attr)) (0) ! = 0 (0)
# idmap_mount_tree_invalid: Test terminated by assertion
This is because tmpfs is mounted at "/mnt/A", and tmpfs already
contains the flag FS_ALLOW_IDMAP after the commit 7a80e5b8c6fa ("shmem:
support idmapped mounts for tmpfs"). So calling sys_mount_setattr here
returns 0 instead of -EINVAL as expected.
Ramfs is mounted at "/mnt/B" and does not support idmap mounts.
So we can use "/mnt/B" instead of "/mnt/A" to make the test run
successfully with the following output:
# Starting 1 tests from 1 test cases.
# RUN mount_setattr_idmapped.idmap_mount_tree_invalid ...
# OK mount_setattr_idmapped.idmap_mount_tree_invalid
ok 1 mount_setattr_idmapped.idmap_mount_tree_invalid
# PASSED: 1 / 1 tests passed.
Signed-off-by: zhouyuhang <zhouyuhang(a)kylinos.cn>
---
tools/testing/selftests/mount_setattr/mount_setattr_test.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/mount_setattr/mount_setattr_test.c b/tools/testing/selftests/mount_setattr/mount_setattr_test.c
index c6a8c732b802..54552c19bc24 100644
--- a/tools/testing/selftests/mount_setattr/mount_setattr_test.c
+++ b/tools/testing/selftests/mount_setattr/mount_setattr_test.c
@@ -1414,7 +1414,7 @@ TEST_F(mount_setattr_idmapped, idmap_mount_tree_invalid)
ASSERT_EQ(expected_uid_gid(-EBADF, "/tmp/B/b", 0, 0, 0), 0);
ASSERT_EQ(expected_uid_gid(-EBADF, "/tmp/B/BB/b", 0, 0, 0), 0);
- open_tree_fd = sys_open_tree(-EBADF, "/mnt/A",
+ open_tree_fd = sys_open_tree(-EBADF, "/mnt/B",
AT_RECURSIVE |
AT_EMPTY_PATH |
AT_NO_AUTOMOUNT |
--
2.27.0
If you wish to utilise a pidfd interface to refer to the current process or
thread it is rather cumbersome, requiring something like:
int pidfd = pidfd_open(getpid(), 0 or PIDFD_THREAD);
...
close(pidfd);
Or the equivalent call opening /proc/self. It is more convenient to use a
sentinel value to indicate to an interface that accepts a pidfd that we
simply wish to refer to the current process thread.
This series introduces sentinels for this purposes which can be passed as
the pidfd in this instance rather than having to establish a dummy fd for
this purpose.
It is useful to refer to both the current thread from the userland's
perspective for which we use PIDFD_SELF, and the current process from the
userland's perspective, for which we use PIDFD_SELF_PROCESS.
There is unfortunately some confusion between the kernel and userland as to
what constitutes a process - a thread from the userland perspective is a
process in userland, and a userland process is a thread group (more
specifically the thread group leader from the kernel perspective). We
therefore alias things thusly:
* PIDFD_SELF_THREAD aliased by PIDFD_SELF - use PIDTYPE_PID.
* PIDFD_SELF_THREAD_GROUP alised by PIDFD_SELF_PROCESS - use PIDTYPE_TGID.
In all of the kernel code we refer to PIDFD_SELF_THREAD and
PIDFD_SELF_THREAD_GROUP. However we expect users to use PIDFD_SELF and
PIDFD_SELF_PROCESS.
This matters for cases where, for instance, a user unshare()'s FDs or does
thread-specific signal handling and where the user would be hugely confused
if the FDs referenced or signal processed referred to the thread group
leader rather than the individual thread.
We ensure that pidfd_send_signal() and pidfd_getfd() work correctly, and
assert as much in selftests. All other interfaces except setns() will work
implicitly with this new interface, however it doesn't make sense to test
waitid(P_PIDFD, ...) as waiting on ourselves is a blocking operation.
In the case of setns() we explicitly disallow use of PIDFD_SELF* as it
doesn't make sense to obtain the namespaces of our own process, and it
would require work to implement this functionality there that would be of
no use.
We also do not provide the ability to utilise PIDFD_SELF* in ordinary fd
operations such as open() or poll(), as this would require extensive work
and be of no real use.
v5:
* Fixup self test dependencies on pidfd/pidfd.h.
v4:
* Avoid returning an fd in the __pidfd_get_pid() function as pointed out by
Christian, instead simply always pin the pid and maintain fd scope in the
helper alone.
* Add wrapper header file in tools/include/linux to allow for import of
UAPI pidfd.h header without encountering the collision between system
fcntl.h and linux/fcntl.h as discussed with Shuah and John.
* Fixup tests to import the UAPI pidfd.h header working around conflicts
between system fcntl.h and linux/fcntl.h which the UAPI pidfd.h imports,
as reported by Shuah.
* Use an int for pidfd_is_self_sentinel() to avoid any dependency on
stdbool.h in userland.
https://lore.kernel.org/linux-mm/cover.1729198898.git.lorenzo.stoakes@oracl…
v3:
* Do not fput() an invalid fd as reported by kernel test bot.
* Fix unintended churn from moving variable declaration.
https://lore.kernel.org/linux-mm/cover.1729073310.git.lorenzo.stoakes@oracl…
v2:
* Fix tests as reported by Shuah.
* Correct RFC version lore link.
https://lore.kernel.org/linux-mm/cover.1728643714.git.lorenzo.stoakes@oracl…
Non-RFC v1:
* Removed RFC tag - there seems to be general consensus that this change is
a good idea, but perhaps some debate to be had on implementation. It
seems sensible then to move forward with the RFC flag removed.
* Introduced PIDFD_SELF_THREAD, PIDFD_SELF_THREAD_GROUP and their aliases
PIDFD_SELF and PIDFD_SELF_PROCESS respectively.
* Updated testing accordingly.
https://lore.kernel.org/linux-mm/cover.1728578231.git.lorenzo.stoakes@oracl…
RFC version:
https://lore.kernel.org/linux-mm/cover.1727644404.git.lorenzo.stoakes@oracl…
Lorenzo Stoakes (5):
pidfd: extend pidfd_get_pid() and de-duplicate pid lookup
pidfd: add PIDFD_SELF_* sentinels to refer to own thread/process
tools: testing: separate out wait_for_pid() into helper header
selftests: pidfd: add pidfd.h UAPI wrapper
selftests: pidfd: add tests for PIDFD_SELF_*
include/linux/pid.h | 34 ++++-
include/uapi/linux/pidfd.h | 15 ++
kernel/exit.c | 3 +-
kernel/nsproxy.c | 1 +
kernel/pid.c | 65 +++++---
kernel/signal.c | 29 +---
tools/include/linux/pidfd.h | 14 ++
tools/testing/selftests/cgroup/test_kill.c | 2 +-
.../pid_namespace/regression_enomem.c | 2 +-
tools/testing/selftests/pidfd/Makefile | 3 +-
tools/testing/selftests/pidfd/pidfd.h | 28 +---
.../selftests/pidfd/pidfd_getfd_test.c | 141 ++++++++++++++++++
tools/testing/selftests/pidfd/pidfd_helpers.h | 39 +++++
.../selftests/pidfd/pidfd_setns_test.c | 11 ++
tools/testing/selftests/pidfd/pidfd_test.c | 76 ++++++++--
15 files changed, 375 insertions(+), 88 deletions(-)
create mode 100644 tools/include/linux/pidfd.h
create mode 100644 tools/testing/selftests/pidfd/pidfd_helpers.h
--
2.47.0
Use a less populated IP range to run the tests, as suggested by Petr in
Link: https://lore.kernel.org/netdev/87ikvukv3s.fsf@nvidia.com/.
Suggested-by: Petr Machata <petrm(a)nvidia.com>
Signed-off-by: Breno Leitao <leitao(a)debian.org>
---
tools/testing/selftests/drivers/net/netcons_basic.sh | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/drivers/net/netcons_basic.sh b/tools/testing/selftests/drivers/net/netcons_basic.sh
index 06021b2059b7..4ad1e216c6b0 100755
--- a/tools/testing/selftests/drivers/net/netcons_basic.sh
+++ b/tools/testing/selftests/drivers/net/netcons_basic.sh
@@ -20,9 +20,9 @@ SCRIPTDIR=$(dirname "$(readlink -e "${BASH_SOURCE[0]}")")
# Simple script to test dynamic targets in netconsole
SRCIF="" # to be populated later
-SRCIP=192.168.1.1
+SRCIP=192.168.2.1
DSTIF="" # to be populated later
-DSTIP=192.168.1.2
+DSTIP=192.168.2.2
PORT="6666"
MSG="netconsole selftest"
--
2.43.5
Following the previous vIOMMU series, this adds another vDEVICE structure,
representing the association from an iommufd_device to an iommufd_viommu.
This gives the whole architecture a new "v" layer:
_______________________________________________________________________
| iommufd (with vIOMMU/vDEVICE) |
| _____________ _____________ |
| | | | | |
| |----------------| vIOMMU |<---| vDEVICE |<------| |
| | | | |_____________| | |
| | ______ | | _____________ ___|____ |
| | | | | | | | | | |
| | | IOAS |<---|(HWPT_PAGING)|<---| HWPT_NESTED |<--| DEVICE | |
| | |______| |_____________| |_____________| |________| |
|______|________|______________|__________________|_______________|_____|
| | | | |
______v_____ | ______v_____ ______v_____ ___v__
| struct | | PFN | (paging) | | (nested) | |struct|
|iommu_device| |------>|iommu_domain|<----|iommu_domain|<----|device|
|____________| storage|____________| |____________| |______|
This vDEVICE object is used to collect and store all vIOMMU-related device
information/attributes in a VM. As an initial series for vDEVICE, add only
the virt_id to the vDEVICE, which is a vIOMMU specific device ID in a VM:
e.g. vSID of ARM SMMUv3, vDeviceID of AMD IOMMU, and vID of Intel VT-d to
a Context Table. This virt_id helps IOMMU drivers to link the vID to a pID
of the device against the physical IOMMU instance. This is essential for a
vIOMMU-based invalidation, where the request contains a device's vID for a
device cache flush, e.g. ATC invalidation.
Therefore, with this vDEVICE object, support a vIOMMU-based invalidation,
by reusing IOMMUFD_CMD_HWPT_INVALIDATE for a vIOMMU object to flush cache
with a given driver data.
As for the implementation of the series, add driver support in ARM SMMUv3
for a real world use case.
This series is on Github:
https://github.com/nicolinc/iommufd/commits/iommufd_viommu_p2-v4
For testing, try this "with-rmr" branch:
https://github.com/nicolinc/iommufd/commits/iommufd_viommu_p2-v4-with-rmr
Paring QEMU branch for testing:
https://github.com/nicolinc/qemu/commits/wip/for_iommufd_viommu_p2-v4
Changelog
v4
* Added missing brackets in switch-case
* Fixed the unreleased idev refcount issue
* Reworked the iommufd_vdevice_alloc allocator
* Dropped support for IOMMU_VIOMMU_TYPE_DEFAULT
* Added missing TEST_LENGTH and fail_nth coverages
* Added a verification to the driver-allocated vDEVICE object
* Added an iommufd_vdevice_abort for a missing mutex protection
* Added a u64 structure arm_vsmmu_invalidation_cmd for user command
conversion
v3
https://lore.kernel.org/all/cover.1728491532.git.nicolinc@nvidia.com/
* Added Jason's Reviewed-by
* Split this invalidation part out of the part-1 series
* Repurposed VDEV_ID ioctl to a wider vDEVICE structure and ioctl
* Reduced viommu_api functions by allowing drivers to access viommu
and vdevice structure directly
* Dropped vdevs_rwsem by using xa_lock instead
* Dropped arm_smmu_cache_invalidate_user
v2
https://lore.kernel.org/all/cover.1724776335.git.nicolinc@nvidia.com/
* Limited vdev_id to one per idev
* Added a rw_sem to protect the vdev_id list
* Reworked driver-level APIs with proper lockings
* Added a new viommu_api file for IOMMUFD_DRIVER config
* Dropped useless iommu_dev point from the viommu structure
* Added missing index numnbers to new types in the uAPI header
* Dropped IOMMU_VIOMMU_INVALIDATE uAPI; Instead, reuse the HWPT one
* Reworked mock_viommu_cache_invalidate() using the new iommu helper
* Reordered details of set/unset_vdev_id handlers for proper lockings
v1
https://lore.kernel.org/all/cover.1723061377.git.nicolinc@nvidia.com/
Thanks!
Nicolin
Jason Gunthorpe (2):
iommu: Add iommu_copy_struct_from_full_user_array helper
iommu/arm-smmu-v3: Allow ATS for IOMMU_DOMAIN_NESTED
Nicolin Chen (12):
iommufd/viommu: Introduce IOMMUFD_OBJ_VDEVICE and its related struct
iommufd/viommu: Add IOMMU_VDEVICE_ALLOC ioctl
iommufd/selftest: Add IOMMU_VDEVICE_ALLOC test coverage
iommu/viommu: Add cache_invalidate to iommufd_viommu_ops
iommufd/hw_pagetable: Enforce cache invalidation op on vIOMMU-based
hwpt_nested
iommufd: Allow hwpt_id to carry viommu_id for IOMMU_HWPT_INVALIDATE
iommufd/viommu: Add vdev_to_dev helper
iommufd/selftest: Add mock_viommu_cache_invalidate
iommufd/selftest: Add IOMMU_TEST_OP_DEV_CHECK_CACHE test command
iommufd/selftest: Add vIOMMU coverage for IOMMU_HWPT_INVALIDATE ioctl
Documentation: userspace-api: iommufd: Update vDEVICE
iommu/arm-smmu-v3: Add arm_vsmmu_cache_invalidate
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 9 +-
drivers/iommu/iommufd/iommufd_private.h | 12 ++
drivers/iommu/iommufd/iommufd_test.h | 30 +++
include/linux/iommu.h | 49 ++++-
include/linux/iommufd.h | 50 +++++
include/uapi/linux/iommufd.h | 61 +++++-
tools/testing/selftests/iommu/iommufd_utils.h | 83 +++++++
.../arm/arm-smmu-v3/arm-smmu-v3-iommufd.c | 162 +++++++++++++-
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 32 ++-
drivers/iommu/iommufd/device.c | 11 +
drivers/iommu/iommufd/driver.c | 7 +
drivers/iommu/iommufd/hw_pagetable.c | 36 +++-
drivers/iommu/iommufd/main.c | 7 +
drivers/iommu/iommufd/selftest.c | 115 +++++++++-
drivers/iommu/iommufd/viommu.c | 108 ++++++++++
tools/testing/selftests/iommu/iommufd.c | 204 +++++++++++++++++-
.../selftests/iommu/iommufd_fail_nth.c | 4 +
Documentation/userspace-api/iommufd.rst | 41 +++-
18 files changed, 983 insertions(+), 38 deletions(-)
--
2.43.0
If you wish to utilise a pidfd interface to refer to the current process or
thread it is rather cumbersome, requiring something like:
int pidfd = pidfd_open(getpid(), 0 or PIDFD_THREAD);
...
close(pidfd);
Or the equivalent call opening /proc/self. It is more convenient to use a
sentinel value to indicate to an interface that accepts a pidfd that we
simply wish to refer to the current process thread.
This series introduces sentinels for this purposes which can be passed as
the pidfd in this instance rather than having to establish a dummy fd for
this purpose.
It is useful to refer to both the current thread from the userland's
perspective for which we use PIDFD_SELF, and the current process from the
userland's perspective, for which we use PIDFD_SELF_PROCESS.
There is unfortunately some confusion between the kernel and userland as to
what constitutes a process - a thread from the userland perspective is a
process in userland, and a userland process is a thread group (more
specifically the thread group leader from the kernel perspective). We
therefore alias things thusly:
* PIDFD_SELF_THREAD aliased by PIDFD_SELF - use PIDTYPE_PID.
* PIDFD_SELF_THREAD_GROUP alised by PIDFD_SELF_PROCESS - use PIDTYPE_TGID.
In all of the kernel code we refer to PIDFD_SELF_THREAD and
PIDFD_SELF_THREAD_GROUP. However we expect users to use PIDFD_SELF and
PIDFD_SELF_PROCESS.
This matters for cases where, for instance, a user unshare()'s FDs or does
thread-specific signal handling and where the user would be hugely confused
if the FDs referenced or signal processed referred to the thread group
leader rather than the individual thread.
We ensure that pidfd_send_signal() and pidfd_getfd() work correctly, and
assert as much in selftests. All other interfaces except setns() will work
implicitly with this new interface, however it doesn't make sense to test
waitid(P_PIDFD, ...) as waiting on ourselves is a blocking operation.
In the case of setns() we explicitly disallow use of PIDFD_SELF* as it
doesn't make sense to obtain the namespaces of our own process, and it
would require work to implement this functionality there that would be of
no use.
We also do not provide the ability to utilise PIDFD_SELF* in ordinary fd
operations such as open() or poll(), as this would require extensive work
and be of no real use.
v4:
* Avoid returning an fd in the __pidfd_get_pid() function as pointed out by
Christian, instead simply always pin the pid and maintain fd scope in the
helper alone.
* Add wrapper header file in tools/include/linux to allow for import of
UAPI pidfd.h header without encountering the collision between system
fcntl.h and linux/fcntl.h as discussed with Shuah and John.
* Fixup tests to import the UAPI pidfd.h header working around conflicts
between system fcntl.h and linux/fcntl.h which the UAPI pidfd.h imports,
as reported by Shuah.
* Use an int for pidfd_is_self_sentinel() to avoid any dependency on
stdbool.h in userland.
v3:
* Do not fput() an invalid fd as reported by kernel test bot.
* Fix unintended churn from moving variable declaration.
https://lore.kernel.org/linux-mm/cover.1729073310.git.lorenzo.stoakes@oracl…
v2:
* Fix tests as reported by Shuah.
* Correct RFC version lore link.
https://lore.kernel.org/linux-mm/cover.1728643714.git.lorenzo.stoakes@oracl…
Non-RFC v1:
* Removed RFC tag - there seems to be general consensus that this change is
a good idea, but perhaps some debate to be had on implementation. It
seems sensible then to move forward with the RFC flag removed.
* Introduced PIDFD_SELF_THREAD, PIDFD_SELF_THREAD_GROUP and their aliases
PIDFD_SELF and PIDFD_SELF_PROCESS respectively.
* Updated testing accordingly.
https://lore.kernel.org/linux-mm/cover.1728578231.git.lorenzo.stoakes@oracl…
RFC version:
https://lore.kernel.org/linux-mm/cover.1727644404.git.lorenzo.stoakes@oracl…
Lorenzo Stoakes (4):
pidfd: extend pidfd_get_pid() and de-duplicate pid lookup
pidfd: add PIDFD_SELF_* sentinels to refer to own thread/process
selftests: pidfd: add pidfd.h UAPI wrapper
selftests: pidfd: add tests for PIDFD_SELF_*
include/linux/pid.h | 34 ++++-
include/uapi/linux/pidfd.h | 15 ++
kernel/exit.c | 3 +-
kernel/nsproxy.c | 1 +
kernel/pid.c | 65 +++++---
kernel/signal.c | 29 +---
tools/include/linux/pidfd.h | 14 ++
tools/testing/selftests/pidfd/Makefile | 3 +-
tools/testing/selftests/pidfd/pidfd.h | 2 +
.../selftests/pidfd/pidfd_getfd_test.c | 141 ++++++++++++++++++
.../selftests/pidfd/pidfd_setns_test.c | 11 ++
tools/testing/selftests/pidfd/pidfd_test.c | 76 ++++++++--
12 files changed, 333 insertions(+), 61 deletions(-)
create mode 100644 tools/include/linux/pidfd.h
--
2.46.2
RISC-V defines three extensions for pointer masking[1]:
- Smmpm: configured in M-mode, affects M-mode
- Smnpm: configured in M-mode, affects the next lower mode (S or U-mode)
- Ssnpm: configured in S-mode, affects the next lower mode (VS, VU, or U-mode)
This series adds support for configuring Smnpm or Ssnpm (depending on
which privilege mode the kernel is running in) to allow pointer masking
in userspace (VU or U-mode), extending the PR_SET_TAGGED_ADDR_CTRL API
from arm64. Unlike arm64 TBI, userspace pointer masking is not enabled
by default on RISC-V. Additionally, the tag width (referred to as PMLEN)
is variable, so userspace needs to ask the kernel for a specific tag
width, which is interpreted as a lower bound on the number of tag bits.
This series also adds support for a tagged address ABI similar to arm64
and x86. Since accesses from the kernel to user memory use the kernel's
pointer masking configuration, not the user's, the kernel must untag
user pointers in software before dereferencing them. And since the tag
width is variable, as with LAM on x86, it must be kept the same across
all threads in a process so untagged_addr_remote() can work.
[1]: https://github.com/riscv/riscv-j-extension/raw/d70011dde6c2/zjpm-spec.pdf
---
This series depends on the per-thread envcfg series in riscv/for-next.
This series can be tested in QEMU by applying a patch set[2].
KASAN_SW_TAGS using pointer masking is an independent patch series[3].
[2]: https://lore.kernel.org/qemu-devel/20240511101053.1875596-1-me@deliversmonk…
[3]: https://lore.kernel.org/linux-riscv/20240814085618.968833-1-samuel.holland@…
Changes in v5:
- Update pointer masking spec version to 1.0 and state to ratified
- Document how PR_[SG]ET_TAGGED_ADDR_CTRL are used on RISC-V
- Document that the RISC-V tagged address ABI is the same as AArch64
- Rename "pm" selftests directory to "abi" to be more generic
- Fix -Wparentheses warnings
- Fix order of operations when writing via the tagged pointer
- Update pointer masking spec version to 1.0 in hwprobe documentation
Changes in v4:
- Switch IS_ENABLED back to #ifdef to fix riscv32 build
- Combine __untagged_addr() and __untagged_addr_remote()
Changes in v3:
- Note in the commit message that the ISA extension spec is frozen
- Rebase on riscv/for-next (ISA extension list conflicts)
- Remove RISCV_ISA_EXT_SxPM, which was not used anywhere
- Use shifts instead of large numbers in ENVCFG_PMM* macro definitions
- Rename CONFIG_RISCV_ISA_POINTER_MASKING to CONFIG_RISCV_ISA_SUPM,
since it only controls the userspace part of pointer masking
- Use IS_ENABLED instead of #ifdef when possible
- Use an enum for the supported PMLEN values
- Simplify the logic in set_tagged_addr_ctrl()
- Use IS_ENABLED instead of #ifdef when possible
- Implement mm_untag_mask()
- Remove pmlen from struct thread_info (now only in mm_context_t)
Changes in v2:
- Drop patch 4 ("riscv: Define is_compat_thread()"), as an equivalent
patch was already applied
- Move patch 5 ("riscv: Split per-CPU and per-thread envcfg bits") to a
different series[3]
- Update pointer masking specification version reference
- Provide macros for the extension affecting the kernel and userspace
- Use the correct name for the hstatus.HUPMM field
- Rebase on riscv/linux.git for-next
- Add and use the envcfg_update_bits() helper function
- Inline flush_tagged_addr_state()
- Implement untagged_addr_remote()
- Restrict PMLEN changes once a process is multithreaded
- Rename "tags" directory to "pm" to avoid .gitignore rules
- Add .gitignore file to ignore the compiled selftest binary
- Write to a pipe to force dereferencing the user pointer
- Handle SIGSEGV in the child process to reduce dmesg noise
- Export Supm via hwprobe
- Export Smnpm and Ssnpm to KVM guests
Samuel Holland (10):
dt-bindings: riscv: Add pointer masking ISA extensions
riscv: Add ISA extension parsing for pointer masking
riscv: Add CSR definitions for pointer masking
riscv: Add support for userspace pointer masking
riscv: Add support for the tagged address ABI
riscv: Allow ptrace control of the tagged address ABI
riscv: selftests: Add a pointer masking test
riscv: hwprobe: Export the Supm ISA extension
RISC-V: KVM: Allow Smnpm and Ssnpm extensions for guests
KVM: riscv: selftests: Add Smnpm and Ssnpm to get-reg-list test
Documentation/arch/riscv/hwprobe.rst | 3 +
Documentation/arch/riscv/uabi.rst | 16 +
.../devicetree/bindings/riscv/extensions.yaml | 18 +
arch/riscv/Kconfig | 11 +
arch/riscv/include/asm/csr.h | 16 +
arch/riscv/include/asm/hwcap.h | 5 +
arch/riscv/include/asm/mmu.h | 7 +
arch/riscv/include/asm/mmu_context.h | 13 +
arch/riscv/include/asm/processor.h | 8 +
arch/riscv/include/asm/switch_to.h | 11 +
arch/riscv/include/asm/uaccess.h | 43 ++-
arch/riscv/include/uapi/asm/hwprobe.h | 1 +
arch/riscv/include/uapi/asm/kvm.h | 2 +
arch/riscv/kernel/cpufeature.c | 3 +
arch/riscv/kernel/process.c | 154 ++++++++
arch/riscv/kernel/ptrace.c | 42 +++
arch/riscv/kernel/sys_hwprobe.c | 3 +
arch/riscv/kvm/vcpu_onereg.c | 4 +
include/uapi/linux/elf.h | 1 +
include/uapi/linux/prctl.h | 5 +-
.../selftests/kvm/riscv/get-reg-list.c | 8 +
tools/testing/selftests/riscv/Makefile | 2 +-
tools/testing/selftests/riscv/abi/.gitignore | 1 +
tools/testing/selftests/riscv/abi/Makefile | 10 +
.../selftests/riscv/abi/pointer_masking.c | 332 ++++++++++++++++++
25 files changed, 712 insertions(+), 7 deletions(-)
create mode 100644 tools/testing/selftests/riscv/abi/.gitignore
create mode 100644 tools/testing/selftests/riscv/abi/Makefile
create mode 100644 tools/testing/selftests/riscv/abi/pointer_masking.c
--
2.45.1
This series was motivated by the regression fixed by 166bf8af9122
("pinctrl: mediatek: common-v2: Fix broken bias-disable for
PULL_PU_PD_RSEL_TYPE"). A bug was introduced in the pinctrl_paris driver
which prevented certain pins from having their bias configured.
Running this test on the mt8195-tomato platform with the test plan
included below[1] shows the test passing with the fix applied, but failing
without the fix:
With fix:
$ ./gpio-setget-config.py
TAP version 13
# Using test plan file: ./google,tomato.yaml
1..3
ok 1 pinctrl_paris.34.pull-up
ok 2 pinctrl_paris.34.pull-down
ok 3 pinctrl_paris.34.disabled
# Totals: pass:3 fail:0 xfail:0 xpass:0 skip:0 error:0
Without fix:
$ ./gpio-setget-config.py
TAP version 13
# Using test plan file: ./google,tomato.yaml
1..3
# Bias doesn't match: Expected pull-up, read pull-down.
not ok 1 pinctrl_paris.34.pull-up
ok 2 pinctrl_paris.34.pull-down
# Bias doesn't match: Expected disabled, read pull-down.
not ok 3 pinctrl_paris.34.disabled
# Totals: pass:1 fail:2 xfail:0 xpass:0 skip:0 error:0
In order to achieve this, the first patch exposes bias configuration
through the GPIO API in the pinctrl_paris driver, patch 2 extends the
gpio-mockup-cdev utility for use by patch 3, and patch 3 introduces a
new GPIO kselftest that takes a test plan in YAML, which can be tailored
per-platform to specify the configurations to test, and sets and gets
back each pin configuration to verify that they match and thus that the
driver is behaving as expected.
Since the GPIO uAPI only allows setting the pin configuration, getting
it back is done through pinconf-pins in the pinctrl debugfs folder.
The test currently only verifies bias but it would be easy to extend to
verify other pin configurations.
The test plan YAML file can be customized for each use-case and is
platform-dependant. For that reason, only an example is included in
patch 3 and the user is supposed to provide their test plan. That said,
the aim is to collect test plans for ease of use at [2].
[1] This is the test plan used for mt8195-tomato:
- label: "pinctrl_paris"
tests:
# Pin 34 has type MTK_PULL_PU_PD_RSEL_TYPE and is unused.
# Setting bias to MTK_PULL_PU_PD_RSEL_TYPE pins was fixed by
# 166bf8af9122 ("pinctrl: mediatek: common-v2: Fix broken bias-disable for PULL_PU_PD_RSEL_TYPE")
- pin: 34
bias: "pull-up"
- pin: 34
bias: "pull-down"
- pin: 34
bias: "disabled"
[2] https://github.com/kernelci/platform-test-parameters
Signed-off-by: Nícolas F. R. A. Prado <nfraprado(a)collabora.com>
---
Nícolas F. R. A. Prado (3):
pinctrl: mediatek: paris: Expose more configurations to GPIO set_config
selftest: gpio: Add wait flag to gpio-mockup-cdev
selftest: gpio: Add a new set-get config test
drivers/pinctrl/mediatek/pinctrl-paris.c | 20 +--
tools/testing/selftests/gpio/Makefile | 2 +-
tools/testing/selftests/gpio/gpio-mockup-cdev.c | 14 +-
.../gpio-set-get-config-example-test-plan.yaml | 15 ++
.../testing/selftests/gpio/gpio-set-get-config.py | 183 +++++++++++++++++++++
5 files changed, 220 insertions(+), 14 deletions(-)
---
base-commit: 6a7917c89f219f09b1d88d09f376000914a52763
change-id: 20240906-kselftest-gpio-set-get-config-6e5bb670c1a5
Best regards,
--
Nícolas F. R. A. Prado <nfraprado(a)collabora.com>
For logging to be useful, something has to set RET and retmsg by calling
ret_set_ksft_status(). There is a suite of functions to that end in
forwarding/lib: check_err, check_fail et.al. Move them to net/lib.sh so
that every net test can use them.
Existing lib.sh users might be using these same names for their functions.
However lib.sh is always sourced near the top of the file (checked), and
whatever new definitions will simply override the ones provided by lib.sh.
Signed-off-by: Petr Machata <petrm(a)nvidia.com>
Reviewed-by: Amit Cohen <amcohen(a)nvidia.com>
Acked-by: Shuah Khan <skhan(a)linuxfoundation.org>
---
Notes:
CC: Shuah Khan <shuah(a)kernel.org>
CC: Benjamin Poirier <bpoirier(a)nvidia.com>
CC: Hangbin Liu <liuhangbin(a)gmail.com>
CC: linux-kselftest(a)vger.kernel.org
CC: Jiri Pirko <jiri(a)resnulli.us>
---
tools/testing/selftests/net/forwarding/lib.sh | 73 -------------------
tools/testing/selftests/net/lib.sh | 73 +++++++++++++++++++
2 files changed, 73 insertions(+), 73 deletions(-)
diff --git a/tools/testing/selftests/net/forwarding/lib.sh b/tools/testing/selftests/net/forwarding/lib.sh
index d28dbf27c1f0..8625e3c99f55 100644
--- a/tools/testing/selftests/net/forwarding/lib.sh
+++ b/tools/testing/selftests/net/forwarding/lib.sh
@@ -445,79 +445,6 @@ done
##############################################################################
# Helpers
-# Whether FAILs should be interpreted as XFAILs. Internal.
-FAIL_TO_XFAIL=
-
-check_err()
-{
- local err=$1
- local msg=$2
-
- if ((err)); then
- if [[ $FAIL_TO_XFAIL = yes ]]; then
- ret_set_ksft_status $ksft_xfail "$msg"
- else
- ret_set_ksft_status $ksft_fail "$msg"
- fi
- fi
-}
-
-check_fail()
-{
- local err=$1
- local msg=$2
-
- check_err $((!err)) "$msg"
-}
-
-check_err_fail()
-{
- local should_fail=$1; shift
- local err=$1; shift
- local what=$1; shift
-
- if ((should_fail)); then
- check_fail $err "$what succeeded, but should have failed"
- else
- check_err $err "$what failed"
- fi
-}
-
-xfail()
-{
- FAIL_TO_XFAIL=yes "$@"
-}
-
-xfail_on_slow()
-{
- if [[ $KSFT_MACHINE_SLOW = yes ]]; then
- FAIL_TO_XFAIL=yes "$@"
- else
- "$@"
- fi
-}
-
-omit_on_slow()
-{
- if [[ $KSFT_MACHINE_SLOW != yes ]]; then
- "$@"
- fi
-}
-
-xfail_on_veth()
-{
- local dev=$1; shift
- local kind
-
- kind=$(ip -j -d link show dev $dev |
- jq -r '.[].linkinfo.info_kind')
- if [[ $kind = veth ]]; then
- FAIL_TO_XFAIL=yes "$@"
- else
- "$@"
- fi
-}
-
not()
{
"$@"
diff --git a/tools/testing/selftests/net/lib.sh b/tools/testing/selftests/net/lib.sh
index 4f52b8e48a3a..6bcf5d13879d 100644
--- a/tools/testing/selftests/net/lib.sh
+++ b/tools/testing/selftests/net/lib.sh
@@ -361,3 +361,76 @@ tests_run()
$current_test
done
}
+
+# Whether FAILs should be interpreted as XFAILs. Internal.
+FAIL_TO_XFAIL=
+
+check_err()
+{
+ local err=$1
+ local msg=$2
+
+ if ((err)); then
+ if [[ $FAIL_TO_XFAIL = yes ]]; then
+ ret_set_ksft_status $ksft_xfail "$msg"
+ else
+ ret_set_ksft_status $ksft_fail "$msg"
+ fi
+ fi
+}
+
+check_fail()
+{
+ local err=$1
+ local msg=$2
+
+ check_err $((!err)) "$msg"
+}
+
+check_err_fail()
+{
+ local should_fail=$1; shift
+ local err=$1; shift
+ local what=$1; shift
+
+ if ((should_fail)); then
+ check_fail $err "$what succeeded, but should have failed"
+ else
+ check_err $err "$what failed"
+ fi
+}
+
+xfail()
+{
+ FAIL_TO_XFAIL=yes "$@"
+}
+
+xfail_on_slow()
+{
+ if [[ $KSFT_MACHINE_SLOW = yes ]]; then
+ FAIL_TO_XFAIL=yes "$@"
+ else
+ "$@"
+ fi
+}
+
+omit_on_slow()
+{
+ if [[ $KSFT_MACHINE_SLOW != yes ]]; then
+ "$@"
+ fi
+}
+
+xfail_on_veth()
+{
+ local dev=$1; shift
+ local kind
+
+ kind=$(ip -j -d link show dev $dev |
+ jq -r '.[].linkinfo.info_kind')
+ if [[ $kind = veth ]]; then
+ FAIL_TO_XFAIL=yes "$@"
+ else
+ "$@"
+ fi
+}
--
2.45.0
It would be good to use the same mechanism for scheduling and dispatching
general net tests as the many forwarding tests already use. To that end,
move the logging helpers to net/lib.sh so that every net test can use them.
Existing lib.sh users might be using the name themselves. However lib.sh is
always sourced near the top of the file (checked), and whatever new
definition will simply override the one provided by lib.sh.
Signed-off-by: Petr Machata <petrm(a)nvidia.com>
Reviewed-by: Amit Cohen <amcohen(a)nvidia.com>
Acked-by: Shuah Khan <skhan(a)linuxfoundation.org>
---
Notes:
CC: Shuah Khan <shuah(a)kernel.org>
CC: Benjamin Poirier <bpoirier(a)nvidia.com>
CC: Hangbin Liu <liuhangbin(a)gmail.com>
CC: linux-kselftest(a)vger.kernel.org
CC: Jiri Pirko <jiri(a)resnulli.us>
---
tools/testing/selftests/net/forwarding/lib.sh | 10 ----------
tools/testing/selftests/net/lib.sh | 10 ++++++++++
2 files changed, 10 insertions(+), 10 deletions(-)
diff --git a/tools/testing/selftests/net/forwarding/lib.sh b/tools/testing/selftests/net/forwarding/lib.sh
index 41dd14c42c48..d28dbf27c1f0 100644
--- a/tools/testing/selftests/net/forwarding/lib.sh
+++ b/tools/testing/selftests/net/forwarding/lib.sh
@@ -1285,16 +1285,6 @@ matchall_sink_create()
action drop
}
-tests_run()
-{
- local current_test
-
- for current_test in ${TESTS:-$ALL_TESTS}; do
- in_defer_scope \
- $current_test
- done
-}
-
cleanup()
{
pre_cleanup
diff --git a/tools/testing/selftests/net/lib.sh b/tools/testing/selftests/net/lib.sh
index 691318b1ec55..4f52b8e48a3a 100644
--- a/tools/testing/selftests/net/lib.sh
+++ b/tools/testing/selftests/net/lib.sh
@@ -351,3 +351,13 @@ log_info()
echo "INFO: $msg"
}
+
+tests_run()
+{
+ local current_test
+
+ for current_test in ${TESTS:-$ALL_TESTS}; do
+ in_defer_scope \
+ $current_test
+ done
+}
--
2.45.0
Many net selftests invent their own logging helpers. These really should be
in a library sourced by these tests. Currently forwarding/lib.sh has a
suite of perfectly fine logging helpers, but sourcing a forwarding/ library
from a higher-level directory smells of layering violation. In this patch,
move the logging helpers to net/lib.sh so that every net test can use them.
Together with the logging helpers, it's also necessary to move
pause_on_fail(), and EXIT_STATUS and RET.
Existing lib.sh users might be using these same names for their functions
or variables. However lib.sh is always sourced near the top of the
file (checked), and whatever new definitions will simply override the ones
provided by lib.sh.
Signed-off-by: Petr Machata <petrm(a)nvidia.com>
Reviewed-by: Amit Cohen <amcohen(a)nvidia.com>
Acked-by: Shuah Khan <skhan(a)linuxfoundation.org>
---
Notes:
CC: Shuah Khan <shuah(a)kernel.org>
CC: Benjamin Poirier <bpoirier(a)nvidia.com>
CC: Hangbin Liu <liuhangbin(a)gmail.com>
CC: linux-kselftest(a)vger.kernel.org
CC: Jiri Pirko <jiri(a)resnulli.us>
---
tools/testing/selftests/net/forwarding/lib.sh | 113 -----------------
tools/testing/selftests/net/lib.sh | 115 ++++++++++++++++++
2 files changed, 115 insertions(+), 113 deletions(-)
diff --git a/tools/testing/selftests/net/forwarding/lib.sh b/tools/testing/selftests/net/forwarding/lib.sh
index 89c25f72b10c..41dd14c42c48 100644
--- a/tools/testing/selftests/net/forwarding/lib.sh
+++ b/tools/testing/selftests/net/forwarding/lib.sh
@@ -48,7 +48,6 @@ declare -A NETIFS=(
: "${WAIT_TIME:=5}"
# Whether to pause on, respectively, after a failure and before cleanup.
-: "${PAUSE_ON_FAIL:=no}"
: "${PAUSE_ON_CLEANUP:=no}"
# Whether to create virtual interfaces, and what netdevice type they should be.
@@ -446,22 +445,6 @@ done
##############################################################################
# Helpers
-# Exit status to return at the end. Set in case one of the tests fails.
-EXIT_STATUS=0
-# Per-test return value. Clear at the beginning of each test.
-RET=0
-
-ret_set_ksft_status()
-{
- local ksft_status=$1; shift
- local msg=$1; shift
-
- RET=$(ksft_status_merge $RET $ksft_status)
- if (( $? )); then
- retmsg=$msg
- fi
-}
-
# Whether FAILs should be interpreted as XFAILs. Internal.
FAIL_TO_XFAIL=
@@ -535,102 +518,6 @@ xfail_on_veth()
fi
}
-log_test_result()
-{
- local test_name=$1; shift
- local opt_str=$1; shift
- local result=$1; shift
- local retmsg=$1; shift
-
- printf "TEST: %-60s [%s]\n" "$test_name $opt_str" "$result"
- if [[ $retmsg ]]; then
- printf "\t%s\n" "$retmsg"
- fi
-}
-
-pause_on_fail()
-{
- if [[ $PAUSE_ON_FAIL == yes ]]; then
- echo "Hit enter to continue, 'q' to quit"
- read a
- [[ $a == q ]] && exit 1
- fi
-}
-
-handle_test_result_pass()
-{
- local test_name=$1; shift
- local opt_str=$1; shift
-
- log_test_result "$test_name" "$opt_str" " OK "
-}
-
-handle_test_result_fail()
-{
- local test_name=$1; shift
- local opt_str=$1; shift
-
- log_test_result "$test_name" "$opt_str" FAIL "$retmsg"
- pause_on_fail
-}
-
-handle_test_result_xfail()
-{
- local test_name=$1; shift
- local opt_str=$1; shift
-
- log_test_result "$test_name" "$opt_str" XFAIL "$retmsg"
- pause_on_fail
-}
-
-handle_test_result_skip()
-{
- local test_name=$1; shift
- local opt_str=$1; shift
-
- log_test_result "$test_name" "$opt_str" SKIP "$retmsg"
-}
-
-log_test()
-{
- local test_name=$1
- local opt_str=$2
-
- if [[ $# -eq 2 ]]; then
- opt_str="($opt_str)"
- fi
-
- if ((RET == ksft_pass)); then
- handle_test_result_pass "$test_name" "$opt_str"
- elif ((RET == ksft_xfail)); then
- handle_test_result_xfail "$test_name" "$opt_str"
- elif ((RET == ksft_skip)); then
- handle_test_result_skip "$test_name" "$opt_str"
- else
- handle_test_result_fail "$test_name" "$opt_str"
- fi
-
- EXIT_STATUS=$(ksft_exit_status_merge $EXIT_STATUS $RET)
- return $RET
-}
-
-log_test_skip()
-{
- RET=$ksft_skip retmsg= log_test "$@"
-}
-
-log_test_xfail()
-{
- RET=$ksft_xfail retmsg= log_test "$@"
-}
-
-log_info()
-{
- local msg=$1
-
- echo "INFO: $msg"
-}
-
not()
{
"$@"
diff --git a/tools/testing/selftests/net/lib.sh b/tools/testing/selftests/net/lib.sh
index c8991cc6bf28..691318b1ec55 100644
--- a/tools/testing/selftests/net/lib.sh
+++ b/tools/testing/selftests/net/lib.sh
@@ -9,6 +9,9 @@ source "$net_dir/lib/sh/defer.sh"
: "${WAIT_TIMEOUT:=20}"
+# Whether to pause on after a failure.
+: "${PAUSE_ON_FAIL:=no}"
+
BUSYWAIT_TIMEOUT=$((WAIT_TIMEOUT * 1000)) # ms
# Kselftest framework constants.
@@ -20,6 +23,11 @@ ksft_skip=4
# namespace list created by setup_ns
NS_LIST=()
+# Exit status to return at the end. Set in case one of the tests fails.
+EXIT_STATUS=0
+# Per-test return value. Clear at the beginning of each test.
+RET=0
+
##############################################################################
# Helpers
@@ -236,3 +244,110 @@ tc_rule_handle_stats_get()
| jq ".[] | select(.options.handle == $handle) | \
.options.actions[0].stats$selector"
}
+
+ret_set_ksft_status()
+{
+ local ksft_status=$1; shift
+ local msg=$1; shift
+
+ RET=$(ksft_status_merge $RET $ksft_status)
+ if (( $? )); then
+ retmsg=$msg
+ fi
+}
+
+log_test_result()
+{
+ local test_name=$1; shift
+ local opt_str=$1; shift
+ local result=$1; shift
+ local retmsg=$1; shift
+
+ printf "TEST: %-60s [%s]\n" "$test_name $opt_str" "$result"
+ if [[ $retmsg ]]; then
+ printf "\t%s\n" "$retmsg"
+ fi
+}
+
+pause_on_fail()
+{
+ if [[ $PAUSE_ON_FAIL == yes ]]; then
+ echo "Hit enter to continue, 'q' to quit"
+ read a
+ [[ $a == q ]] && exit 1
+ fi
+}
+
+handle_test_result_pass()
+{
+ local test_name=$1; shift
+ local opt_str=$1; shift
+
+ log_test_result "$test_name" "$opt_str" " OK "
+}
+
+handle_test_result_fail()
+{
+ local test_name=$1; shift
+ local opt_str=$1; shift
+
+ log_test_result "$test_name" "$opt_str" FAIL "$retmsg"
+ pause_on_fail
+}
+
+handle_test_result_xfail()
+{
+ local test_name=$1; shift
+ local opt_str=$1; shift
+
+ log_test_result "$test_name" "$opt_str" XFAIL "$retmsg"
+ pause_on_fail
+}
+
+handle_test_result_skip()
+{
+ local test_name=$1; shift
+ local opt_str=$1; shift
+
+ log_test_result "$test_name" "$opt_str" SKIP "$retmsg"
+}
+
+log_test()
+{
+ local test_name=$1
+ local opt_str=$2
+
+ if [[ $# -eq 2 ]]; then
+ opt_str="($opt_str)"
+ fi
+
+ if ((RET == ksft_pass)); then
+ handle_test_result_pass "$test_name" "$opt_str"
+ elif ((RET == ksft_xfail)); then
+ handle_test_result_xfail "$test_name" "$opt_str"
+ elif ((RET == ksft_skip)); then
+ handle_test_result_skip "$test_name" "$opt_str"
+ else
+ handle_test_result_fail "$test_name" "$opt_str"
+ fi
+
+ EXIT_STATUS=$(ksft_exit_status_merge $EXIT_STATUS $RET)
+ return $RET
+}
+
+log_test_skip()
+{
+ RET=$ksft_skip retmsg= log_test "$@"
+}
+
+log_test_xfail()
+{
+ RET=$ksft_xfail retmsg= log_test "$@"
+}
+
+log_info()
+{
+ local msg=$1
+
+ echo "INFO: $msg"
+}
--
2.45.0