A few unrelated devmem TCP fixes bundled in a series for some
convenience (if that's ok).
Patch 1-2: fix naming and provide page_pool_alloc_netmem for fragged
netmem.
Patch 3-4: fix issues with dma-buf dma addresses being potentially
passed to dma_sync_for_* helpers.
Patch 5-6: fix syzbot SO_DEVMEM_DONTNEED issue and add test for this
case.
Mina Almasry (6):
net: page_pool: rename page_pool_alloc_netmem to *_netmems
net: page_pool: create page_pool_alloc_netmem
page_pool: disable sync for cpu for dmabuf memory provider
netmem: add netmem_prefetch
net: fix SO_DEVMEM_DONTNEED looping too long
ncdevmem: add test for too many token_count
Samiullah Khawaja (1):
page_pool: Set `dma_sync` to false for devmem memory provider
include/net/netmem.h | 7 ++++
include/net/page_pool/helpers.h | 50 ++++++++++++++++++--------
include/net/page_pool/types.h | 2 +-
net/core/devmem.c | 9 +++--
net/core/page_pool.c | 11 +++---
net/core/sock.c | 46 ++++++++++++++----------
tools/testing/selftests/net/ncdevmem.c | 11 ++++++
7 files changed, 93 insertions(+), 43 deletions(-)
--
2.47.0.163.g1226f6d8fa-goog
Changes v5:
- Tests are skipped if snc_unreliable was set.
- Moved resctrlfs.c changes from patch 2/2 to 1/2.
- Removed CAT changes since it's not impacted by SNC in the selftest.
- Updated various comments.
- Fixed a bunch of minor issues pointed out in the review.
Changes v4:
- Printing SNC warnings at the start of every test.
- Printing SNC warnings at the end of every relevant test.
- Remove global snc_mode variable, consolidate snc detection functions
into one.
- Correct minor mistakes.
Changes v3:
- Reworked patch 2.
- Changed minor things in patch 1 like function name and made
corrections to the patch message.
Changes v2:
- Removed patches 2 and 3 since now this part will be supported by the
kernel.
Sub-Numa Clustering (SNC) allows splitting CPU cores, caches and memory
into multiple NUMA nodes. When enabled, NUMA-aware applications can
achieve better performance on bigger server platforms.
SNC support in the kernel was merged into x86/cache [1]. With SNC enabled
and kernel support in place all the tests will function normally (aside
from effective cache size). There might be a problem when SNC is enabled
but the system is still using an older kernel version without SNC
support. Currently the only message displayed in that situation is a
guess that SNC might be enabled and is causing issues. That message also
is displayed whenever the test fails on an Intel platform.
Add a mechanism to discover kernel support for SNC which will add more
meaning and certainty to the error message.
Add runtime SNC mode detection and verify how reliable that information
is.
Series was tested on Ice Lake server platforms with SNC disabled, SNC-2
and SNC-4. The tests were also ran with and without kernel support for
SNC.
Series applies cleanly on kselftest/next.
[1] https://lore.kernel.org/all/20240628215619.76401-1-tony.luck@intel.com/
Previous versions of this series:
[v1] https://lore.kernel.org/all/cover.1709721159.git.maciej.wieczor-retman@inte…
[v2] https://lore.kernel.org/all/cover.1715769576.git.maciej.wieczor-retman@inte…
[v3] https://lore.kernel.org/all/cover.1719842207.git.maciej.wieczor-retman@inte…
[v4] https://lore.kernel.org/all/cover.1720774981.git.maciej.wieczor-retman@inte…
Maciej Wieczor-Retman (2):
selftests/resctrl: Adjust effective L3 cache size with SNC enabled
selftests/resctrl: Adjust SNC support messages
tools/testing/selftests/resctrl/cmt_test.c | 8 +-
tools/testing/selftests/resctrl/mba_test.c | 8 +-
tools/testing/selftests/resctrl/mbm_test.c | 10 +-
tools/testing/selftests/resctrl/resctrl.h | 7 +
.../testing/selftests/resctrl/resctrl_tests.c | 8 +-
tools/testing/selftests/resctrl/resctrlfs.c | 132 ++++++++++++++++++
6 files changed, 166 insertions(+), 7 deletions(-)
--
2.46.2
Greetings:
Welcome to v6, see changelog below. This revision includes only
documentation changes: patch 7 has been updated with Bagas' suggestions
and the htmldoc looks better as a result. In addition, this cover letter
has been updated with a full re-run of the test data. We've included a
new test case highlighting a case Sridhar asked about in our v3. See
below.
This series introduces a new mechanism, IRQ suspension, which allows
network applications using epoll to mask IRQs during periods of high
traffic while also reducing tail latency (compared to existing
mechanisms, see below) during periods of low traffic. In doing so, this
balances CPU consumption with network processing efficiency.
Martin Karsten (CC'd) and I have been collaborating on this series for
several months and have appreciated the feedback from the community on
our RFC [1]. We've updated the cover letter and kernel documentation in
an attempt to more clearly explain how this mechanism works, how
applications can use it, and how it compares to existing mechanisms in
the kernel. We've added an additional test case, 'fullbusy', achieved by
modifying libevent for comparison. See below for a detailed description,
link to the patch, and test results.
I briefly mentioned this idea at netdev conf 2024 (for those who were
there) and Martin described this idea in an earlier paper presented at
Sigmetrics 2024 [2].
~ The short explanation (TL;DR)
We propose adding a new napi config parameter: irq_suspend_timeout to
help balance CPU usage and network processing efficiency when using IRQ
deferral and napi busy poll.
If this parameter is set to a non-zero value *and* a user application
has enabled preferred busy poll on a busy poll context (via the
EPIOCSPARAMS ioctl introduced in commit 18e2bf0edf4d ("eventpoll: Add
epoll ioctl for epoll_params")), then application calls to epoll_wait
for that context will cause device IRQs and softirq processing to be
suspended as long as epoll_wait successfully retrieves data from the
NAPI. Each time data is retrieved, the irq_suspend_timeout is deferred.
If/when network traffic subsides and epoll_wait returns no data, IRQ
suspension is immediately reverted back to the existing
napi_defer_hard_irqs and gro_flush_timeout mechanism which was
introduced in commit 6f8b12d661d0 ("net: napi: add hard irqs deferral
feature")).
The irq_suspend_timeout serves as a safety mechanism. If userland takes
a long time processing data, irq_suspend_timeout will fire and restart
normal NAPI processing.
For a more in depth explanation, please continue reading.
~ Comparison with existing mechanisms
Interrupt mitigation can be accomplished in napi software, by setting
napi_defer_hard_irqs and gro_flush_timeout, or via interrupt coalescing
in the NIC. This can be quite efficient, but in both cases, a fixed
timeout (or packet count) needs to be configured. However, a fixed
timeout cannot effectively support both low- and high-load situations:
At low load, an application typically processes a few requests and then
waits to receive more input data. In this scenario, a large timeout will
cause unnecessary latency.
At high load, an application typically processes many requests before
being ready to receive more input data. In this case, a small timeout
will likely fire prematurely and trigger irq/softirq processing, which
interferes with the application's execution. This causes overhead, most
likely due to cache contention.
While NICs attempt to provide adaptive interrupt coalescing schemes,
these cannot properly take into account application-level processing.
An alternative packet delivery mechanism is busy-polling, which results
in perfect alignment of application processing and network polling. It
delivers optimal performance (throughput and latency), but results in
100% cpu utilization and is thus inefficient for below-capacity
workloads.
We propose to add a new packet delivery mode that properly alternates
between busy polling and interrupt-based delivery depending on busy and
idle periods of the application. During a busy period, the system
operates in busy-polling mode, which avoids interference. During an idle
period, the system falls back to interrupt deferral, but with a small
timeout to avoid excessive latencies. This delivery mode can also be
viewed as an extension of basic interrupt deferral, but alternating
between a small and a very large timeout.
This delivery mode is efficient, because it avoids softirq execution
interfering with application processing during busy periods. It can be
used with blocking epoll_wait to conserve cpu cycles during idle
periods. The effect of alternating between busy and idle periods is that
performance (throughput and latency) is very close to full busy polling,
while cpu utilization is lower and very close to interrupt mitigation.
~ Usage details
IRQ suspension is introduced via a per-NAPI configuration parameter that
controls the maximum time that IRQs can be suspended.
Here's how it is intended to work:
- The user application (or system administrator) uses the netdev-genl
netlink interface to set the pre-existing napi_defer_hard_irqs and
gro_flush_timeout NAPI config parameters to enable IRQ deferral.
- The user application (or system administrator) sets the proposed
irq_suspend_timeout parameter via the netdev-genl netlink interface
to a larger value than gro_flush_timeout to enable IRQ suspension.
- The user application issues the existing epoll ioctl to set the
prefer_busy_poll flag on the epoll context.
- The user application then calls epoll_wait to busy poll for network
events, as it normally would.
- If epoll_wait returns events to userland, IRQs are suspended for the
duration of irq_suspend_timeout.
- If epoll_wait finds no events and the thread is about to go to
sleep, IRQ handling using napi_defer_hard_irqs and gro_flush_timeout
is resumed.
As long as epoll_wait is retrieving events, IRQs (and softirq
processing) for the NAPI being polled remain disabled. When network
traffic reduces, eventually a busy poll loop in the kernel will retrieve
no data. When this occurs, regular IRQ deferral using gro_flush_timeout
for the polled NAPI is re-enabled.
Unless IRQ suspension is continued by subsequent calls to epoll_wait, it
automatically times out after the irq_suspend_timeout timer expires.
Regular deferral is also immediately re-enabled when the epoll context
is destroyed.
~ Usage scenario
The target scenario for IRQ suspension as packet delivery mode is a
system that runs a dominant application with substantial network I/O.
The target application can be configured to receive input data up to a
certain batch size (via epoll_wait maxevents parameter) and this batch
size determines the worst-case latency that application requests might
experience. Because packet delivery is suspended during the target
application's processing, the batch size also determines the worst-case
latency of concurrent applications using the same RX queue(s).
gro_flush_timeout should be set as small as possible, but large enough to
make sure that a single request is likely not being interfered with.
irq_suspend_timeout is largely a safety mechanism against misbehaving
applications. It should be set large enough to cover the processing of an
entire application batch, i.e., the factor between gro_flush_timeout and
irq_suspend_timeout should roughly correspond to the maximum batch size
that the target application would process in one go.
~ Design rationale
The implementation of the IRQ suspension mechanism very nicely dovetails
with the existing mechanism for IRQ deferral when preferred busy poll is
enabled (introduced in commit 7fd3253a7de6 ("net: Introduce preferred
busy-polling"), see that commit message for more details).
While it would be possible to inject the suspend timeout via
the existing epoll ioctl, it is more natural to avoid this path for one
main reason:
An epoll context is linked to NAPI IDs as file descriptors are added;
this means any epoll context might suddenly be associated with a
different net_device if the application were to replace all existing
fds with fds from a different device. In this case, the scope of the
suspend timeout becomes unclear and many edge cases for both the user
application and the kernel are introduced
Only a single iteration through napi busy polling is needed for this
mechanism to work effectively. Since an important objective for this
mechanism is preserving cpu cycles, exactly one iteration of the napi
busy loop is invoked when busy_poll_usecs is set to 0.
~ Important call out in the implementation
- Enabling per epoll-context preferred busy poll will now effectively
lead to a nonblocking iteration through napi_busy_loop, even when
busy_poll_usecs is 0. See patch 4.
~ Benchmark configs & descriptions
The changes were benchmarked with memcached [3] using the benchmarking
tool mutilate [4].
To facilitate benchmarking, a small patch [5] was applied to memcached
1.6.29 to allow setting per-epoll context preferred busy poll and other
settings via environment variables. Another small patch [6] was applied
to libevent to enable full busy-polling.
Multiple scenarios were benchmarked as described below and the scripts
used for producing these results can be found on github [7] (note: all
scenarios use NAPI-based traffic splitting via SO_INCOMING_ID by passing
-N to memcached):
- base:
- no other options enabled
- deferX:
- set defer_hard_irqs to 100
- set gro_flush_timeout to X,000
- napibusy:
- set defer_hard_irqs to 100
- set gro_flush_timeout to 200,000
- enable busy poll via the existing ioctl (busy_poll_usecs = 64,
busy_poll_budget = 64, prefer_busy_poll = true)
- fullbusy:
- set defer_hard_irqs to 100
- set gro_flush_timeout to 5,000,000
- enable busy poll via the existing ioctl (busy_poll_usecs = 1000,
busy_poll_budget = 64, prefer_busy_poll = true)
- change memcached's nonblocking epoll_wait invocation (via
libevent) to using a 1 ms timeout
- suspend0:
- set defer_hard_irqs to 0
- set gro_flush_timeout to 0
- set irq_suspend_timeout to 20,000,000
- enable busy poll via the existing ioctl (busy_poll_usecs = 0,
busy_poll_budget = 64, prefer_busy_poll = true)
- suspendX:
- set defer_hard_irqs to 100
- set gro_flush_timeout to X,000
- set irq_suspend_timeout to 20,000,000
- enable busy poll via the existing ioctl (busy_poll_usecs = 0,
busy_poll_budget = 64, prefer_busy_poll = true)
~ Benchmark results
Tested on:
Single socket AMD EPYC 7662 64-Core Processor
Hyperthreading disabled
4 NUMA Zones (NPS=4)
16 CPUs per NUMA zone (64 cores total)
2 x Dual port 100gbps Mellanox Technologies ConnectX-5 Ex EN NIC
The test machine is configured such that a single interface has 8 RX
queues. The queues' IRQs and memcached are pinned to CPUs that are
NUMA-local to the interface which is under test. The NIC's interrupt
coalescing configuration is left at boot-time defaults.
Results:
Results are shown below. The mechanism added by this series is
represented by the 'suspend' cases. Data presented shows a summary over
at least 15 runs of each test case [8] using the scripts on github [7].
For latency, the median is shown. For throughput and CPU utilization,
the average is shown.
The results also include cycles-per-query (cpq) and
instruction-per-query (ipq) metrics, following the methodology proposed
in [2], to augment the CPU utilization numbers, which could be skewed
due to frequency scaling. We find that this does not appear to be the
case as CPU utilization and low-level metrics show similar trends.
These results were captured using the scripts on github [7] to
illustrate how this approach compares with other pre-existing
mechanisms. This data is not to be interpreted as scientific data
captured in a fully isolated lab setting, but instead as best effort,
illustrative information comparing and contrasting tradeoffs.
The absolute QPS results are higher than our previous submission, but
the relative differences between variants are equivalent. Because the
patches have been rebased on 6.12, several factors have likely
influenced the overall performance. Most importantly, we had to switch
to a new set of basic kernel options, which has likely altered the
baseline performance. Because the overall comparison of variants still
holds, we have not attempted to recreate the exact set of kernel options
from the previous submission.
Compare:
- Throughput (MAX) and latencies of base vs suspend.
- CPU usage of napibusy and fullbusy during lower load (200K, 400K for
example) vs suspend.
- Latency of the defer variants vs suspend as timeout and load
increases.
- suspend0, which sets defer_hard_irqs and gro_flush_timeout to 0, has
nearly the same performance as the base case (this is FAQ item #1).
The overall takeaway is that the suspend variants provide a superior
combination of high throughput, low latency, and low cpu utilization
compared to all other variants. Each of the suspend variants works very
well, but some fine-tuning between latency and cpu utilization is still
possible by tuning the small timeout (gro_flush_timeout).
Note: we've reorganized the results to make comparison among testcases
with the same load easier.
testcase load qps avglat 95%lat 99%lat cpu cpq ipq
base 200K 199954 112 237 415 26 13040 11336
defer10 200K 200002 54 123 142 28 19033 16508
defer20 200K 199985 60 130 153 26 15737 14247
defer50 200K 199968 78 142 181 23 12113 11609
defer200 200K 199997 163 252 304 18 8449 9155
fullbusy 200K 199993 46 117 132 100 43959 23320
napibusy 200K 200006 100 237 275 56 25016 24866
suspend0 200K 200012 105 249 432 29 14369 11844
suspend10 200K 200004 53 123 141 32 19432 16752
suspend20 200K 200014 58 126 151 30 16356 14670
suspend50 200K 200018 73 134 176 26 13245 12416
suspend200 200K 200027 149 250 302 20 9508 9781
testcase load qps avglat 95%lat 99%lat cpu cpq ipq
base 400K 399984 139 268 715 40 9437 9299
defer10 400K 400002 59 133 165 53 14089 12908
defer20 400K 400016 66 140 171 47 12085 11682
defer50 400K 400037 87 161 198 39 9528 9879
defer200 400K 399954 181 273 329 32 7326 8438
fullbusy 400K 399951 50 123 155 100 21990 16097
napibusy 400K 399997 76 221 271 83 18260 16511
suspend0 400K 399991 125 337 768 48 11051 9629
suspend10 400K 399990 57 129 161 54 13629 12841
suspend20 400K 399922 61 135 167 49 12055 11715
suspend50 400K 400024 75 148 186 42 10049 10243
suspend200 400K 399936 154 267 325 34 7770 8677
testcase load qps avglat 95%lat 99%lat cpu cpq ipq
base 600K 600064 148 265 576 61 9276 8757
defer10 600K 599985 71 147 204 76 12048 10863
defer20 600K 600024 75 151 199 66 10572 10328
defer50 600K 600054 94 172 217 55 8584 9144
defer200 600K 600030 200 299 355 45 6874 8189
fullbusy 600K 599956 55 127 176 100 14650 13968
napibusy 600K 599956 64 163 252 96 14022 14153
suspend0 600K 600029 126 306 724 70 10393 8977
suspend10 600K 599997 63 137 194 70 10991 11005
suspend20 600K 600012 67 141 194 65 10108 10359
suspend50 600K 600045 80 157 203 57 8747 9320
suspend200 600K 599940 158 277 344 48 7221 8354
testcase load qps avglat 95%lat 99%lat cpu cpq ipq
base 800K 800025 179 298 555 86 9572 8297
defer10 800K 799275 224 633 1271 96 10679 8904
defer20 800K 800041 114 226 328 90 10122 8917
defer50 800K 799936 118 207 288 77 8820 8607
defer200 800K 799994 228 341 403 65 7424 8130
fullbusy 800K 799964 62 136 192 100 10992 12518
napibusy 800K 799971 65 142 216 99 10911 12529
suspend0 800K 799965 126 250 533 86 9489 8496
suspend10 800K 799995 69 145 201 83 9475 9764
suspend20 800K 799931 74 151 209 79 8976 9336
suspend50 800K 799946 87 168 224 71 7993 8794
suspend200 800K 799993 160 292 357 62 6967 8184
testcase load qps avglat 95%lat 99%lat cpu cpq ipq
base 1000K 915792 3498 5740 6239 97 9388 7930
defer10 1000K 876285 3896 6095 6418 99 9960 8542
defer20 1000K 914909 3107 5771 6283 97 9407 8284
defer50 1000K 928426 2977 5591 5931 97 9214 8012
defer200 1000K 959989 3097 5306 5929 96 8816 7908
fullbusy 1000K 1000102 74 155 213 100 8796 10559
napibusy 1000K 1000006 74 154 216 100 8787 10654
suspend0 1000K 960757 2223 5715 7029 98 8964 7993
suspend10 1000K 999926 80 162 222 92 8246 8922
suspend20 1000K 1000095 85 166 226 89 7966 8719
suspend50 1000K 1000067 96 180 238 84 7476 8419
suspend200 1000K 999968 163 298 363 76 6798 8061
testcase load qps avglat 95%lat 99%lat cpu cpq ipq
base MAX 1054805 4152 5298 5743 100 8332 7890
defer10 MAX 937098 4598 6010 6347 100 9378 8407
defer20 MAX 988905 4389 5637 5990 100 8886 8106
defer50 MAX 1067194 3960 5216 5544 100 8235 7911
defer200 MAX 1054967 4084 5496 5821 100 8323 7871
fullbusy MAX 1248006 3472 3918 3979 100 7050 7919
napibusy MAX 1128384 3742 7958 10753 100 7776 7872
suspend0 MAX 1034456 4242 5668 6042 100 8497 7912
suspend10 MAX 1229229 3513 3926 3986 100 7156 7937
suspend20 MAX 1226845 3514 3939 3985 100 7171 7937
suspend50 MAX 1230757 3513 3935 3983 100 7140 7935
suspend200 MAX 1230424 3503 3934 3984 100 7142 7927
~ FAQ
- Why is a new parameter needed? Does irq_suspend_timeout override
gro_flush_timeout?
Using the suspend mechanism causes the system to alternate between
polling mode and irq-driven packet delivery. During busy periods,
irq_suspend_timeout overrides gro_flush_timeout and keeps the system
busy polling, but when epoll finds no events, the setting of
gro_flush_timeout and napi_defer_hard_irqs determine the next step.
There are essentially three possible loops for network processing and
packet delivery:
1) hardirq -> softirq -> napi poll; basic interrupt delivery
2) timer -> softirq -> napi poll; deferred irq processing
3) epoll -> busy-poll -> napi poll; busy looping
Loop 2 can take control from Loop 1, if gro_flush_timeout and
napi_defer_hard_irqs are set.
If gro_flush_timeout and napi_defer_hard_irqs are set, Loops 2 and
3 "wrestle" with each other for control. During busy periods,
irq_suspend_timeout is used as timer in Loop 2, which essentially
tilts this in favour of Loop 3.
If gro_flush_timeout and napi_defer_hard_irqs are not set, Loop 3
cannot take control from Loop 1.
Therefore, setting gro_flush_timeout and napi_defer_hard_irqs is the
recommended usage, because otherwise setting irq_suspend_timeout
might not have any discernible effect.
This is shown in the results above: compare suspend0 with the base
case. Note that the lack of napi_defer_hard_irqs and
gro_flush_timeout produce similar results for both, which encourages
the use of napi_defer_hard_irqs and gro_flush_timeout in addition to
irq_suspend_timeout.
- Can the new timeout value be threaded through the new epoll ioctl ?
Only with difficulty. The epoll ioctl sets options on an epoll
context and the NAPI ID associated with an epoll context can change
based on what file descriptors a user app adds to the epoll context.
This would introduce complexity in the API from the user perspective
and also complexity in the kernel.
- Can irq suspend be built by combining NIC coalescing and
gro_flush_timeout ?
No. The problem is that the long timeout must engage if and only if
prefer-busy is active.
When using NIC coalescing for the short timeout (without
napi_defer_hard_irqs/gro_flush_timeout), an interrupt after an idle
period will trigger softirq, which will run napi polling. At this
point, prefer-busy is not active, so NIC interrupts would be
re-enabled. Then it is not possible for the longer timeout to
interject to switch control back to polling. In other words, only by
using the software timer for the short timeout, it is possible to
extend the timeout without having to reprogram the NIC timer or
reach down directly and disable interrupts.
Using gro_flush_timeout for the long timeout also has problems, for
the same underlying reason. In the current napi implementation,
gro_flush_timeout is not tied to prefer-busy. We'd either have to
change that and in the process modify the existing deferral
mechanism, or introduce a state variable to determine whether
gro_flush_timeout is used as long timeout for irq suspend or whether
it is used for its default purpose. In an earlier version, we did
try something similar to the latter and made it work, but it ends up
being a lot more convoluted than our current proposal.
- Isn't it already possible to combine busy looping with irq deferral?
Yes, in fact enabling irq deferral via napi_defer_hard_irqs and
gro_flush_timeout is a precondition for prefer_busy_poll to have an
effect. If the application also uses a tight busy loop with
essentially nonblocking epoll_wait (accomplished with a very short
timeout parameter), this is the fullbusy case shown in the results.
An application using blocking epoll_wait is shown as the napibusy
case in the results. It's a hybrid approach that provides limited
latency benefits compared to the base case and plain irq deferral,
but not as good as fullbusy or suspend.
~ Special thanks
Several people were involved in earlier stages of the development of this
mechanism whom we'd like to thank:
- Peter Cai (CC'd), for the initial kernel patch and his contributions
to the paper.
- Mohammadamin Shafie (CC'd), for testing various versions of the kernel
patch and providing helpful feedback.
Thanks,
Martin and Joe
[1]: https://lore.kernel.org/netdev/20240812125717.413108-1-jdamato@fastly.com/
[2]: https://doi.org/10.1145/3626780
[3]: https://github.com/memcached/memcached/blob/master/doc/napi_ids.txt
[4]: https://github.com/leverich/mutilate
[5]: https://raw.githubusercontent.com/martinkarsten/irqsuspend/main/patches/mem…
[6]: https://raw.githubusercontent.com/martinkarsten/irqsuspend/main/patches/lib…
[7]: https://github.com/martinkarsten/irqsuspend
[8]: https://github.com/martinkarsten/irqsuspend/tree/main/results
v6:
- Updated the cover letter with a full re-run of all test cases,
including a new case suspend0, as requested by Sridhar previously.
- Updated the kernel documentation in patch 7 as suggested by Bagas
Sanjaya, which improved the htmldoc output.
v5: https://lore.kernel.org/netdev/20241103052421.518856-1-jdamato@fastly.com/
- Adjusted patch 5 to only suspend IRQs when ep_send_events returns a
positive return value. This issue was pointed out by Hillf Danton.
- Updated the commit message of patch 6 which still mentioned netcat,
despite the code being updated in v4 to replace it with socat and fixed
misspelling of netdevsim.
- Fixed a minor typo in patch 7 and removed an unnecessary paragraph.
- Added Sridhar Samudrala's Reviewed-by to patch 1-5 and 7.
v4: https://lore.kernel.org/netdev/20241102005214.32443-1-jdamato@fastly.com/
- Added a new FAQ item to cover letter.
- Updated patch 6 to use socat instead of nc in busy_poll_test.sh and
updated busy_poller.c to use netlink directly to configure napi
params.
- Updated the kernel documentation in patch 7 to include more details.
- Dropped Stanislav's Acked-by and Bagas' Reviewed-by from patch 7
since the documentation was updated.
v3: https://lore.kernel.org/netdev/20241101004846.32532-1-jdamato@fastly.com/
- Added Stanislav Fomichev's Acked-by to every patch except the newly
added selftest.
- Added Bagas Sanjaya's Reviewed-by to the documentation patch.
- Fixed the commit message of patch 2 to remove a reference to the now
non-existent sysfs setting.
- Added a self test which tests both "regular" busy poll and busy poll
with suspend enabled. This was added as patch 6 as requested by
Paolo. netdevsim was chosen instead of veth due to netdevsim's
pre-existing support for netdev-genl. See the commit message of
patch 6 for more details.
v2: https://lore.kernel.org/bpf/20241021015311.95468-1-jdamato@fastly.com/
- Cover letter updated, including a re-run of test data.
- Patch 1 rewritten to use netdev-genl instead of sysfs.
- Patch 3 updated with a comment added to napi_resume_irqs.
- Patch 4 rebased to apply now that commit b9ca079dd6b0 ("eventpoll:
Annotate data-race of busy_poll_usecs") has been picked up from VFS.
- Patch 6 updated the kernel documentation.
rfc -> v1:
- Cover letter updated to include more details.
- Patch 1 updated to remove the documentation added. This was moved to
patch 6 with the rest of the docs (see below).
- Patch 5 updated to fix an error uncovered by the kernel build robot.
See patch 5's changelog for more details.
- Patch 6 added which updates kernel documentation.
Joe Damato (2):
selftests: net: Add busy_poll_test
docs: networking: Describe irq suspension
Martin Karsten (5):
net: Add napi_struct parameter irq_suspend_timeout
net: Suspend softirq when prefer_busy_poll is set
net: Add control functions for irq suspension
eventpoll: Trigger napi_busy_loop, if prefer_busy_poll is set
eventpoll: Control irq suspension for prefer_busy_poll
Documentation/netlink/specs/netdev.yaml | 7 +
Documentation/networking/napi.rst | 170 ++++++++-
fs/eventpoll.c | 36 +-
include/linux/netdevice.h | 2 +
include/net/busy_poll.h | 3 +
include/uapi/linux/netdev.h | 1 +
net/core/dev.c | 58 +++-
net/core/dev.h | 25 ++
net/core/netdev-genl-gen.c | 5 +-
net/core/netdev-genl.c | 12 +
tools/include/uapi/linux/netdev.h | 1 +
tools/testing/selftests/net/.gitignore | 1 +
tools/testing/selftests/net/Makefile | 3 +-
tools/testing/selftests/net/busy_poll_test.sh | 164 +++++++++
tools/testing/selftests/net/busy_poller.c | 328 ++++++++++++++++++
15 files changed, 805 insertions(+), 11 deletions(-)
create mode 100755 tools/testing/selftests/net/busy_poll_test.sh
create mode 100644 tools/testing/selftests/net/busy_poller.c
base-commit: dbb9a7ef347828870df3e5e6ddf19469a3277fc9
--
2.25.1
The goal of the series is to simplify and make it possible to use
ncdevmem in an automated way from the ksft python wrapper.
ncdevmem is slowly mutated into a state where it uses stdout
to print the payload and the python wrapper is added to
make sure the arrived payload matches the expected one.
v7:
- fix validation (Mina)
- add support for working with non ::ffff-prefixed addresses (Mina)
v6:
- fix compilation issue in 'Unify error handling' patch (Jakub)
v5:
- properly handle errors from inet_pton() and socket() (Paolo)
- remove unneeded import from python selftest (Paolo)
v4:
- keep usage example with validation (Mina)
- fix compilation issue in one patch (s/start_queues/start_queue/)
v3:
- keep and refine the comment about ncdevmem invocation (Mina)
- add the comment about not enforcing exit status for ntuple reset (Mina)
- make configure_headersplit more robust (Mina)
- use num_queues/2 in selftest and let the users override it (Mina)
- remove memory_provider.memcpy_to_device (Mina)
- keep ksft as is (don't use -v validate flags): we are gonna
need a --debug-disable flag to make it less chatty; otherwise
it times out when sending too much data; so leaving it as
a separate follow up
v2:
- don't remove validation (Mina)
- keep 5-tuple flow steering but use it only when -c is provided (Mina)
- remove separate flag for probing (Mina)
- move ncdevmem under drivers/net/hw, not drivers/net (Jakub)
Cc: Mina Almasry <almasrymina(a)google.com>
Stanislav Fomichev (12):
selftests: ncdevmem: Redirect all non-payload output to stderr
selftests: ncdevmem: Separate out dmabuf provider
selftests: ncdevmem: Unify error handling
selftests: ncdevmem: Make client_ip optional
selftests: ncdevmem: Remove default arguments
selftests: ncdevmem: Switch to AF_INET6
selftests: ncdevmem: Properly reset flow steering
selftests: ncdevmem: Use YNL to enable TCP header split
selftests: ncdevmem: Remove hard-coded queue numbers
selftests: ncdevmem: Run selftest when none of the -s or -c has been
provided
selftests: ncdevmem: Move ncdevmem under drivers/net/hw
selftests: ncdevmem: Add automated test
.../selftests/drivers/net/hw/.gitignore | 1 +
.../testing/selftests/drivers/net/hw/Makefile | 9 +
.../selftests/drivers/net/hw/devmem.py | 45 +
.../selftests/drivers/net/hw/ncdevmem.c | 786 ++++++++++++++++++
tools/testing/selftests/net/.gitignore | 1 -
tools/testing/selftests/net/Makefile | 8 -
tools/testing/selftests/net/ncdevmem.c | 570 -------------
7 files changed, 841 insertions(+), 579 deletions(-)
create mode 100644 tools/testing/selftests/drivers/net/hw/.gitignore
create mode 100755 tools/testing/selftests/drivers/net/hw/devmem.py
create mode 100644 tools/testing/selftests/drivers/net/hw/ncdevmem.c
delete mode 100644 tools/testing/selftests/net/ncdevmem.c
--
2.47.0
Misbehaving guests can cause bus locks to degrade the performance of
a system. Non-WB (write-back) and misaligned locked RMW
(read-modify-write) instructions are referred to as "bus locks" and
require system wide synchronization among all processors to guarantee
the atomicity. The bus locks can impose notable performance penalties
for all processors within the system.
Support for the Bus Lock Threshold is indicated by CPUID
Fn8000_000A_EDX[29] BusLockThreshold=1, the VMCB provides a Bus Lock
Threshold enable bit and an unsigned 16-bit Bus Lock Threshold count.
VMCB intercept bit
VMCB Offset Bits Function
14h 5 Intercept bus lock operations
Bus lock threshold count
VMCB Offset Bits Function
120h 15:0 Bus lock counter
During VMRUN, the bus lock threshold count is fetched and stored in an
internal count register. Prior to executing a bus lock within the
guest, the processor verifies the count in the bus lock register. If
the count is greater than zero, the processor executes the bus lock,
reducing the count. However, if the count is zero, the bus lock
operation is not performed, and instead, a Bus Lock Threshold #VMEXIT
is triggered to transfer control to the Virtual Machine Monitor (VMM).
A Bus Lock Threshold #VMEXIT is reported to the VMM with VMEXIT code
0xA5h, VMEXIT_BUSLOCK. EXITINFO1 and EXITINFO2 are set to 0 on
a VMEXIT_BUSLOCK. On a #VMEXIT, the processor writes the current
value of the Bus Lock Threshold Counter to the VMCB.
More details about the Bus Lock Threshold feature can be found in AMD
APM [1].
v2 -> v3
- Drop parch to add virt tag in /proc/cpuinfo.
- Incorporated Tom's review comments.
v1 -> v2
- Incorporated misc review comments from Sean.
- Removed bus_lock_counter module parameter.
- Set the value of bus_lock_counter to zero by default and reload the value by 1
in bus lock exit handler.
- Add documentation for the behavioral difference for KVM_EXIT_BUS_LOCK.
- Improved selftest for buslock to work on SVM and VMX.
- Rewrite the commit messages.
Patches are prepared on kvm-next/next (efbc6bd090f4).
Testing done:
- Added a selftest for the Bus Lock Threshold functionality.
- The bus lock threshold selftest has been tested on both Intel and AMD platforms.
- Tested the Bus Lock Threshold functionality on SEV, SEV-ES, SEV-SNP guests.
- Tested the Bus Lock Threshold functionality on nested guests.
v1: https://lore.kernel.org/kvm/20240709175145.9986-4-manali.shukla@amd.com/T/
v2: https://lore.kernel.org/kvm/20241001063413.687787-4-manali.shukla@amd.com/T/
[1]: AMD64 Architecture Programmer's Manual Pub. 24593, April 2024,
Vol 2, 15.14.5 Bus Lock Threshold.
https://bugzilla.kernel.org/attachment.cgi?id=306250
Manali Shukla (2):
x86/cpufeatures: Add CPUID feature bit for the Bus Lock Threshold
KVM: X86: Add documentation about behavioral difference for
KVM_EXIT_BUS_LOCK
Nikunj A Dadhania (2):
KVM: SVM: Enable Bus lock threshold exit
KVM: selftests: Add bus lock exit test
Documentation/virt/kvm/api.rst | 4 +
arch/x86/include/asm/cpufeatures.h | 1 +
arch/x86/include/asm/svm.h | 5 +-
arch/x86/include/uapi/asm/svm.h | 2 +
arch/x86/kvm/svm/nested.c | 10 ++
arch/x86/kvm/svm/svm.c | 27 ++++
tools/testing/selftests/kvm/Makefile | 1 +
.../selftests/kvm/x86_64/kvm_buslock_test.c | 130 ++++++++++++++++++
8 files changed, 179 insertions(+), 1 deletion(-)
create mode 100644 tools/testing/selftests/kvm/x86_64/kvm_buslock_test.c
base-commit: efbc6bd090f48ccf64f7a8dd5daea775821d57ec
--
2.34.1
Hello,
KernelCI is hosting a bi-weekly call on Thursday to discuss improvements
to existing upstream tests, the development of new tests to increase
kernel testing coverage, and the enablement of these tests in KernelCI.
In recent months, we at Collabora have focused on various kernel areas,
assessing the tests already available upstream and contributing patches
to make them easily runnable in CIs.
Below is a list of the tests we've been working on and their latest
status updates, as discussed in the last meeting held on 2024-10-31:
*Boot time test*
- Sent v2:
https://lore.kernel.org/all/20241018101439.20849-1-laura.nao@collabora.com/
- Changes in v2 driven by feedback received in the LPC session
*ACPI probe kselftest*
- RFCv2:
https://lore.kernel.org/all/20240308144933.337107-1-laura.nao@collabora.com/
- Related upstream discussion on probe status tracking through printks:
https://lore.kernel.org/lkml/dc7af563-530d-4e1f-bcbb-90bcfc2fe11a@stanley.m…
Please reply to this thread if you'd like to join the call or discuss
any of the topics further. We look forward to collaborating with the
community to improve upstream tests and expand coverage to more areas of
interest within the kernel.
Best regards,
Laura Nao
This series adds VLAN support to HSR framework.
This series also adds VLAN support to HSR mode of ICSSG Ethernet driver.
Changes from v1 to v2:
*) Added patch 4/4 to add test script related to VLAN in HSR as asked by
Lukasz Majewski <lukma(a)denx.de>
v1 https://lore.kernel.org/all/20241004074715.791191-1-danishanwar@ti.com/
MD Danish Anwar (1):
selftests: hsr: Add test for VLAN
Murali Karicheri (1):
net: hsr: Add VLAN CTAG filter support
Ravi Gunasekaran (1):
net: ti: icssg-prueth: Add VLAN support for HSR mode
WingMan Kwok (1):
net: hsr: Add VLAN support
drivers/net/ethernet/ti/icssg/icssg_prueth.c | 45 +++++++++++-
net/hsr/hsr_device.c | 76 ++++++++++++++++++--
net/hsr/hsr_forward.c | 19 +++--
tools/testing/selftests/net/hsr/config | 1 +
tools/testing/selftests/net/hsr/hsr_ping.sh | 63 +++++++++++++++-
5 files changed, 191 insertions(+), 13 deletions(-)
base-commit: 1bf70e6c3a5346966c25e0a1ff492945b25d3f80
--
2.34.1
Soft lockups have been observed on a cluster of Linux-based edge routers
located in a highly dynamic environment. Using the `bird` service, these
routers continuously update BGP-advertised routes due to frequently
changing nexthop destinations, while also managing significant IPv6
traffic. The lockups occur during the traversal of the multipath
circular linked-list in the `fib6_select_path` function, particularly
while iterating through the siblings in the list. The issue typically
arises when the nodes of the linked list are unexpectedly deleted
concurrently on a different core—indicated by their 'next' and
'previous' elements pointing back to the node itself and their reference
count dropping to zero. This results in an infinite loop, leading to a
soft lockup that triggers a system panic via the watchdog timer.
Apply RCU primitives in the problematic code sections to resolve the
issue. Where necessary, update the references to fib6_siblings to
annotate or use the RCU APIs.
Include a test script that reproduces the issue. The script
periodically updates the routing table while generating a heavy load
of outgoing IPv6 traffic through multiple iperf3 clients. It
consistently induces infinite soft lockups within a couple of minutes.
Kernel log:
0 [ffffbd13003e8d30] machine_kexec at ffffffff8ceaf3eb
1 [ffffbd13003e8d90] __crash_kexec at ffffffff8d0120e3
2 [ffffbd13003e8e58] panic at ffffffff8cef65d4
3 [ffffbd13003e8ed8] watchdog_timer_fn at ffffffff8d05cb03
4 [ffffbd13003e8f08] __hrtimer_run_queues at ffffffff8cfec62f
5 [ffffbd13003e8f70] hrtimer_interrupt at ffffffff8cfed756
6 [ffffbd13003e8fd0] __sysvec_apic_timer_interrupt at ffffffff8cea01af
7 [ffffbd13003e8ff0] sysvec_apic_timer_interrupt at ffffffff8df1b83d
-- <IRQ stack> --
8 [ffffbd13003d3708] asm_sysvec_apic_timer_interrupt at ffffffff8e000ecb
[exception RIP: fib6_select_path+299]
RIP: ffffffff8ddafe7b RSP: ffffbd13003d37b8 RFLAGS: 00000287
RAX: ffff975850b43600 RBX: ffff975850b40200 RCX: 0000000000000000
RDX: 000000003fffffff RSI: 0000000051d383e4 RDI: ffff975850b43618
RBP: ffffbd13003d3800 R8: 0000000000000000 R9: ffff975850b40200
R10: 0000000000000000 R11: 0000000000000000 R12: ffffbd13003d3830
R13: ffff975850b436a8 R14: ffff975850b43600 R15: 0000000000000007
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
9 [ffffbd13003d3808] ip6_pol_route at ffffffff8ddb030c
10 [ffffbd13003d3888] ip6_pol_route_input at ffffffff8ddb068c
11 [ffffbd13003d3898] fib6_rule_lookup at ffffffff8ddf02b5
12 [ffffbd13003d3928] ip6_route_input at ffffffff8ddb0f47
13 [ffffbd13003d3a18] ip6_rcv_finish_core.constprop.0 at ffffffff8dd950d0
14 [ffffbd13003d3a30] ip6_list_rcv_finish.constprop.0 at ffffffff8dd96274
15 [ffffbd13003d3a98] ip6_sublist_rcv at ffffffff8dd96474
16 [ffffbd13003d3af8] ipv6_list_rcv at ffffffff8dd96615
17 [ffffbd13003d3b60] __netif_receive_skb_list_core at ffffffff8dc16fec
18 [ffffbd13003d3be0] netif_receive_skb_list_internal at ffffffff8dc176b3
19 [ffffbd13003d3c50] napi_gro_receive at ffffffff8dc565b9
20 [ffffbd13003d3c80] ice_receive_skb at ffffffffc087e4f5 [ice]
21 [ffffbd13003d3c90] ice_clean_rx_irq at ffffffffc0881b80 [ice]
22 [ffffbd13003d3d20] ice_napi_poll at ffffffffc088232f [ice]
23 [ffffbd13003d3d80] __napi_poll at ffffffff8dc18000
24 [ffffbd13003d3db8] net_rx_action at ffffffff8dc18581
25 [ffffbd13003d3e40] __do_softirq at ffffffff8df352e9
26 [ffffbd13003d3eb0] run_ksoftirqd at ffffffff8ceffe47
27 [ffffbd13003d3ec0] smpboot_thread_fn at ffffffff8cf36a30
28 [ffffbd13003d3ee8] kthread at ffffffff8cf2b39f
29 [ffffbd13003d3f28] ret_from_fork at ffffffff8ce5fa64
30 [ffffbd13003d3f50] ret_from_fork_asm at ffffffff8ce03cbb
Fixes: 66f5d6ce53e6 ("ipv6: replace rwlock with rcu and spinlock in fib6_table")
Reported-by: Adrian Oliver <kernel(a)aoliver.ca>
Signed-off-by: Omid Ehtemam-Haghighi <omid.ehtemamhaghighi(a)menlosecurity.com>
Cc: David S. Miller <davem(a)davemloft.net>
Cc: David Ahern <dsahern(a)gmail.com>
Cc: Eric Dumazet <edumazet(a)google.com>
Cc: Jakub Kicinski <kuba(a)kernel.org>
Cc: Paolo Abeni <pabeni(a)redhat.com>
Cc: Shuah Khan <shuah(a)kernel.org>
Cc: Ido Schimmel <idosch(a)idosch.org>
Cc: Kuniyuki Iwashima <kuniyu(a)amazon.com>
Cc: Simon Horman <horms(a)kernel.org>
Cc: netdev(a)vger.kernel.org
Cc: linux-kselftest(a)vger.kernel.org
Cc: linux-kernel(a)vger.kernel.org
---
v5 -> v6:
* Adjust the comment line lengths in the test script to a maximum of
80 characters
* Change memory allocation in inet6_rt_notify from gfp_any() to GFP_ATOMIC for
atomic allocation in non-blocking contexts, as suggested by Ido Schimmel
* NOTE: I have executed the test script on both bare-metal servers and
virtualized environments such as QEMU and vng. In the case of bare-metal, it
consistently triggers a soft lockup in under a minute on unpatched kernels.
For the virtualized environments, an unpatched kernel compiled with the
Ubuntu 24.04 configuration also triggers a soft lockup, though it takes
longer; however, it did not trigger a soft lockup on kernels compiled with
configurations provided in:
https://github.com/linux-netdev/nipa/wiki/How-to-run-netdev-selftests-CI-st…
leading to potential false negatives in the test results.
I am curious if this test can be executed on a bare-metal machine within a
CI system, if such a setup exists, rather than in a virtualized environment.
If that’s not possible, how can I apply a different kernel configuration,
such as the one used in Ubuntu 24.04, for this test? Please advise.
v4 -> v5:
* Addressed review comments from Paolo Abeni.
* Added additional clarifying comments in the test script.
* Minor cleanup performed in the test script.
v3 -> v4:
* Added RCU primitives to rt6_fill_node(). I found that this function is typically
called either with a table lock held or within rcu_read_lock/rcu_read_unlock
pairs, except in the following call chain, where the protection is unclear:
rt_fill_node()
fib6_info_hw_flags_set()
mlxsw_sp_fib6_offload_failed_flag_set()
mlxsw_sp_router_fib6_event_work()
The last function is initialized as a work item in mlxsw_sp_router_fib_event()
and scheduled for deferred execution. I am unsure if the execution context of
this work item is protected by any table lock or rcu_read_lock/rcu_read_unlock
pair, so I have added the protection. Please let me know if this is redundant.
* Other review comments addressed
v2 -> v3:
* Removed redundant rcu_read_lock()/rcu_read_unlock() pairs
* Revised the test script based on Ido Schimmel's feedback
* Updated the test script to ensure compatibility with the latest iperf3 version
* Fixed new warnings generated with 'C=2' in the previous version
* Other review comments addressed
v1 -> v2:
* list_del_rcu() is applied exclusively to legacy multipath code
* All occurrences of fib6_siblings have been modified to utilize RCU
APIs for annotation and usage.
* Additionally, a test script for reproducing the reported
issue is included
---
net/ipv6/ip6_fib.c | 8 +-
net/ipv6/route.c | 45 ++-
tools/testing/selftests/net/Makefile | 1 +
.../net/ipv6_route_update_soft_lockup.sh | 262 ++++++++++++++++++
4 files changed, 297 insertions(+), 19 deletions(-)
create mode 100755 tools/testing/selftests/net/ipv6_route_update_soft_lockup.sh
diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index eb111d20615c..9a1c59275a10 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -1190,8 +1190,8 @@ static int fib6_add_rt2node(struct fib6_node *fn, struct fib6_info *rt,
while (sibling) {
if (sibling->fib6_metric == rt->fib6_metric &&
rt6_qualify_for_ecmp(sibling)) {
- list_add_tail(&rt->fib6_siblings,
- &sibling->fib6_siblings);
+ list_add_tail_rcu(&rt->fib6_siblings,
+ &sibling->fib6_siblings);
break;
}
sibling = rcu_dereference_protected(sibling->fib6_next,
@@ -1252,7 +1252,7 @@ static int fib6_add_rt2node(struct fib6_node *fn, struct fib6_info *rt,
fib6_siblings)
sibling->fib6_nsiblings--;
rt->fib6_nsiblings = 0;
- list_del_init(&rt->fib6_siblings);
+ list_del_rcu(&rt->fib6_siblings);
rt6_multipath_rebalance(next_sibling);
return err;
}
@@ -1970,7 +1970,7 @@ static void fib6_del_route(struct fib6_table *table, struct fib6_node *fn,
&rt->fib6_siblings, fib6_siblings)
sibling->fib6_nsiblings--;
rt->fib6_nsiblings = 0;
- list_del_init(&rt->fib6_siblings);
+ list_del_rcu(&rt->fib6_siblings);
rt6_multipath_rebalance(next_sibling);
}
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index b4251915585f..b9b986bda943 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -413,8 +413,8 @@ void fib6_select_path(const struct net *net, struct fib6_result *res,
struct flowi6 *fl6, int oif, bool have_oif_match,
const struct sk_buff *skb, int strict)
{
- struct fib6_info *sibling, *next_sibling;
struct fib6_info *match = res->f6i;
+ struct fib6_info *sibling;
if (!match->nh && (!match->fib6_nsiblings || have_oif_match))
goto out;
@@ -440,8 +440,8 @@ void fib6_select_path(const struct net *net, struct fib6_result *res,
if (fl6->mp_hash <= atomic_read(&match->fib6_nh->fib_nh_upper_bound))
goto out;
- list_for_each_entry_safe(sibling, next_sibling, &match->fib6_siblings,
- fib6_siblings) {
+ list_for_each_entry_rcu(sibling, &match->fib6_siblings,
+ fib6_siblings) {
const struct fib6_nh *nh = sibling->fib6_nh;
int nh_upper_bound;
@@ -5195,14 +5195,18 @@ static void ip6_route_mpath_notify(struct fib6_info *rt,
* nexthop. Since sibling routes are always added at the end of
* the list, find the first sibling of the last route appended
*/
+ rcu_read_lock();
+
if ((nlflags & NLM_F_APPEND) && rt_last && rt_last->fib6_nsiblings) {
- rt = list_first_entry(&rt_last->fib6_siblings,
- struct fib6_info,
- fib6_siblings);
+ rt = list_first_or_null_rcu(&rt_last->fib6_siblings,
+ struct fib6_info,
+ fib6_siblings);
}
if (rt)
inet6_rt_notify(RTM_NEWROUTE, rt, info, nlflags);
+
+ rcu_read_unlock();
}
static bool ip6_route_mpath_should_notify(const struct fib6_info *rt)
@@ -5547,17 +5551,21 @@ static size_t rt6_nlmsg_size(struct fib6_info *f6i)
nexthop_for_each_fib6_nh(f6i->nh, rt6_nh_nlmsg_size,
&nexthop_len);
} else {
- struct fib6_info *sibling, *next_sibling;
struct fib6_nh *nh = f6i->fib6_nh;
+ struct fib6_info *sibling;
nexthop_len = 0;
if (f6i->fib6_nsiblings) {
rt6_nh_nlmsg_size(nh, &nexthop_len);
- list_for_each_entry_safe(sibling, next_sibling,
- &f6i->fib6_siblings, fib6_siblings) {
+ rcu_read_lock();
+
+ list_for_each_entry_rcu(sibling, &f6i->fib6_siblings,
+ fib6_siblings) {
rt6_nh_nlmsg_size(sibling->fib6_nh, &nexthop_len);
}
+
+ rcu_read_unlock();
}
nexthop_len += lwtunnel_get_encap_size(nh->fib_nh_lws);
}
@@ -5721,7 +5729,7 @@ static int rt6_fill_node(struct net *net, struct sk_buff *skb,
lwtunnel_fill_encap(skb, dst->lwtstate, RTA_ENCAP, RTA_ENCAP_TYPE) < 0)
goto nla_put_failure;
} else if (rt->fib6_nsiblings) {
- struct fib6_info *sibling, *next_sibling;
+ struct fib6_info *sibling;
struct nlattr *mp;
mp = nla_nest_start_noflag(skb, RTA_MULTIPATH);
@@ -5733,14 +5741,21 @@ static int rt6_fill_node(struct net *net, struct sk_buff *skb,
0) < 0)
goto nla_put_failure;
- list_for_each_entry_safe(sibling, next_sibling,
- &rt->fib6_siblings, fib6_siblings) {
+ rcu_read_lock();
+
+ list_for_each_entry_rcu(sibling, &rt->fib6_siblings,
+ fib6_siblings) {
if (fib_add_nexthop(skb, &sibling->fib6_nh->nh_common,
sibling->fib6_nh->fib_nh_weight,
- AF_INET6, 0) < 0)
+ AF_INET6, 0) < 0) {
+ rcu_read_unlock();
+
goto nla_put_failure;
+ }
}
+ rcu_read_unlock();
+
nla_nest_end(skb, mp);
} else if (rt->nh) {
if (nla_put_u32(skb, RTA_NH_ID, rt->nh->id))
@@ -6177,7 +6192,7 @@ void inet6_rt_notify(int event, struct fib6_info *rt, struct nl_info *info,
err = -ENOBUFS;
seq = info->nlh ? info->nlh->nlmsg_seq : 0;
- skb = nlmsg_new(rt6_nlmsg_size(rt), gfp_any());
+ skb = nlmsg_new(rt6_nlmsg_size(rt), GFP_ATOMIC);
if (!skb)
goto errout;
@@ -6190,7 +6205,7 @@ void inet6_rt_notify(int event, struct fib6_info *rt, struct nl_info *info,
goto errout;
}
rtnl_notify(skb, net, info->portid, RTNLGRP_IPV6_ROUTE,
- info->nlh, gfp_any());
+ info->nlh, GFP_ATOMIC);
return;
errout:
rtnl_set_sk_err(net, RTNLGRP_IPV6_ROUTE, err);
diff --git a/tools/testing/selftests/net/Makefile b/tools/testing/selftests/net/Makefile
index 649f1fe0dc46..88cbac27aa10 100644
--- a/tools/testing/selftests/net/Makefile
+++ b/tools/testing/selftests/net/Makefile
@@ -96,6 +96,7 @@ TEST_PROGS += fdb_flush.sh
TEST_PROGS += fq_band_pktlimit.sh
TEST_PROGS += vlan_hw_filter.sh
TEST_PROGS += bpf_offload.py
+TEST_PROGS += ipv6_route_update_soft_lockup.sh
# YNL files, must be before "include ..lib.mk"
EXTRA_CLEAN += $(OUTPUT)/libynl.a
diff --git a/tools/testing/selftests/net/ipv6_route_update_soft_lockup.sh b/tools/testing/selftests/net/ipv6_route_update_soft_lockup.sh
new file mode 100755
index 000000000000..c61a7a8352cf
--- /dev/null
+++ b/tools/testing/selftests/net/ipv6_route_update_soft_lockup.sh
@@ -0,0 +1,262 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# Testing for potential kernel soft lockup during IPv6 routing table
+# refresh under heavy outgoing IPv6 traffic. If a kernel soft lockup
+# occurs, a kernel panic will be triggered to prevent associated issues.
+#
+#
+# Test Environment Layout
+#
+# ┌----------------┐ ┌----------------┐
+# | SOURCE_NS | | SINK_NS |
+# | NAMESPACE | | NAMESPACE |
+# |(iperf3 clients)| |(iperf3 servers)|
+# | | | |
+# | | | |
+# | ┌-----------| nexthops |---------┐ |
+# | |veth_source|<--------------------------------------->|veth_sink|<┐ |
+# | └-----------|2001:0DB8:1::0:1/96 2001:0DB8:1::1:1/96 |---------┘ | |
+# | | ^ 2001:0DB8:1::1:2/96 | | |
+# | | . . | fwd | |
+# | ┌---------┐ | . . | | |
+# | | IPv6 | | . . | V |
+# | | routing | | . 2001:0DB8:1::1:80/96| ┌-----┐ |
+# | | table | | . | | lo | |
+# | | nexthop | | . └--------┴-----┴-┘
+# | | update | | ............................> 2001:0DB8:2::1:1/128
+# | └-------- ┘ |
+# └----------------┘
+#
+# The test script sets up two network namespaces, source_ns and sink_ns,
+# connected via a veth link. Within source_ns, it continuously updates the
+# IPv6 routing table by flushing and inserting IPV6_NEXTHOP_ADDR_COUNT nexthop
+# IPs destined for SINK_LOOPBACK_IP_ADDR in sink_ns. This refresh occurs at a
+# rate of 1/ROUTING_TABLE_REFRESH_PERIOD per second for TEST_DURATION seconds.
+#
+# Simultaneously, multiple iperf3 clients within source_ns generate heavy
+# outgoing IPv6 traffic. Each client is assigned a unique port number starting
+# at 5000 and incrementing sequentially. Each client targets a unique iperf3
+# server running in sink_ns, connected to the SINK_LOOPBACK_IFACE interface
+# using the same port number.
+#
+# The number of iperf3 servers and clients is set to half of the total
+# available cores on each machine.
+#
+# NOTE: We have tested this script on machines with various CPU specifications,
+# ranging from lower to higher performance as listed below. The test script
+# effectively triggered a kernel soft lockup on machines running an unpatched
+# kernel in under a minute:
+#
+# - 1x Intel Xeon E-2278G 8-Core Processor @ 3.40GHz
+# - 1x Intel Xeon E-2378G Processor 8-Core @ 2.80GHz
+# - 1x AMD EPYC 7401P 24-Core Processor @ 2.00GHz
+# - 1x AMD EPYC 7402P 24-Core Processor @ 2.80GHz
+# - 2x Intel Xeon Gold 5120 14-Core Processor @ 2.20GHz
+# - 1x Ampere Altra Q80-30 80-Core Processor @ 3.00GHz
+# - 2x Intel Xeon Gold 5120 14-Core Processor @ 2.20GHz
+# - 2x Intel Xeon Silver 4214 24-Core Processor @ 2.20GHz
+# - 1x AMD EPYC 7502P 32-Core @ 2.50GHz
+# - 1x Intel Xeon Gold 6314U 32-Core Processor @ 2.30GHz
+# - 2x Intel Xeon Gold 6338 32-Core Processor @ 2.00GHz
+#
+# On less performant machines, you may need to increase the TEST_DURATION
+# parameter to enhance the likelihood of encountering a race condition leading
+# to a kernel soft lockup and avoid a false negative result.
+#
+# NOTE: The test may not produce the expected result in virtualized
+# environments (e.g., qemu) due to differences in timing and CPU handling,
+# which can affect the conditions needed to trigger a soft lockup.
+
+source lib.sh
+source net_helper.sh
+
+TEST_DURATION=300
+ROUTING_TABLE_REFRESH_PERIOD=0.01
+
+IPERF3_BITRATE="300m"
+
+
+IPV6_NEXTHOP_ADDR_COUNT="128"
+IPV6_NEXTHOP_ADDR_MASK="96"
+IPV6_NEXTHOP_PREFIX="2001:0DB8:1"
+
+
+SOURCE_TEST_IFACE="veth_source"
+SOURCE_TEST_IP_ADDR="2001:0DB8:1::0:1/96"
+
+SINK_TEST_IFACE="veth_sink"
+# ${SINK_TEST_IFACE} is populated with the following range of IPv6 addresses:
+# 2001:0DB8:1::1:1 to 2001:0DB8:1::1:${IPV6_NEXTHOP_ADDR_COUNT}
+SINK_LOOPBACK_IFACE="lo"
+SINK_LOOPBACK_IP_MASK="128"
+SINK_LOOPBACK_IP_ADDR="2001:0DB8:2::1:1"
+
+nexthop_ip_list=""
+termination_signal=""
+kernel_softlokup_panic_prev_val=""
+
+terminate_ns_processes_by_pattern() {
+ local ns=$1
+ local pattern=$2
+
+ for pid in $(ip netns pids ${ns}); do
+ [ -e /proc/$pid/cmdline ] && grep -qe "${pattern}" /proc/$pid/cmdline && kill -9 $pid
+ done
+}
+
+cleanup() {
+ echo "info: cleaning up namespaces and terminating all processes within them..."
+
+
+ # Terminate iperf3 instances running in the source_ns. To avoid race
+ # conditions, first iterate over the PIDs and terminate those
+ # associated with the bash shells running the
+ # `while true; do iperf3 -c ...; done` loops. In a second iteration,
+ # terminate the individual `iperf3 -c ...` instances.
+ terminate_ns_processes_by_pattern ${source_ns} while
+ terminate_ns_processes_by_pattern ${source_ns} iperf3
+
+ # Repeat the same process for sink_ns
+ terminate_ns_processes_by_pattern ${sink_ns} while
+ terminate_ns_processes_by_pattern ${sink_ns} iperf3
+
+ # Check if any iperf3 instances are still running. This could happen
+ # if a core has entered an infinite loop and the timeout for detecting
+ # the soft lockup has not expired, but either the test interval has
+ # already elapsed or the test was terminated manually (e.g., with ^C)
+ for pid in $(ip netns pids ${source_ns}); do
+ if [ -e /proc/$pid/cmdline ] && grep -qe 'iperf3' /proc/$pid/cmdline; then
+ echo "FAIL: unable to terminate some iperf3 instances. Soft lockup is underway. A kernel panic is on the way!"
+ exit ${ksft_fail}
+ fi
+ done
+
+ if [ "$termination_signal" == "SIGINT" ]; then
+ echo "SKIP: Termination due to ^C (SIGINT)"
+ elif [ "$termination_signal" == "SIGALRM" ]; then
+ echo "PASS: No kernel soft lockup occurred during this ${TEST_DURATION} second test"
+ fi
+
+ cleanup_ns ${source_ns} ${sink_ns}
+
+ sysctl -qw kernel.softlockup_panic=${kernel_softlokup_panic_prev_val}
+}
+
+setup_prepare() {
+ setup_ns source_ns sink_ns
+
+ ip -n ${source_ns} link add name ${SOURCE_TEST_IFACE} type veth peer name ${SINK_TEST_IFACE} netns ${sink_ns}
+
+ # Setting up the Source namespace
+ ip -n ${source_ns} addr add ${SOURCE_TEST_IP_ADDR} dev ${SOURCE_TEST_IFACE}
+ ip -n ${source_ns} link set dev ${SOURCE_TEST_IFACE} qlen 10000
+ ip -n ${source_ns} link set dev ${SOURCE_TEST_IFACE} up
+ ip netns exec ${source_ns} sysctl -qw net.ipv6.fib_multipath_hash_policy=1
+
+ # Setting up the Sink namespace
+ ip -n ${sink_ns} addr add ${SINK_LOOPBACK_IP_ADDR}/${SINK_LOOPBACK_IP_MASK} dev ${SINK_LOOPBACK_IFACE}
+ ip -n ${sink_ns} link set dev ${SINK_LOOPBACK_IFACE} up
+ ip netns exec ${sink_ns} sysctl -qw net.ipv6.conf.${SINK_LOOPBACK_IFACE}.forwarding=1
+
+ ip -n ${sink_ns} link set ${SINK_TEST_IFACE} up
+ ip netns exec ${sink_ns} sysctl -qw net.ipv6.conf.${SINK_TEST_IFACE}.forwarding=1
+
+
+ # Populate nexthop IPv6 addresses on the test interface in the sink_ns
+ echo "info: populating ${IPV6_NEXTHOP_ADDR_COUNT} IPv6 addresses on the ${SINK_TEST_IFACE} interface ..."
+ for IP in $(seq 1 ${IPV6_NEXTHOP_ADDR_COUNT}); do
+ ip -n ${sink_ns} addr add ${IPV6_NEXTHOP_PREFIX}::$(printf "1:%x" "${IP}")/${IPV6_NEXTHOP_ADDR_MASK} dev ${SINK_TEST_IFACE};
+ done
+
+ # Preparing list of nexthops
+ for IP in $(seq 1 ${IPV6_NEXTHOP_ADDR_COUNT}); do
+ nexthop_ip_list=$nexthop_ip_list" nexthop via ${IPV6_NEXTHOP_PREFIX}::$(printf "1:%x" $IP) dev ${SOURCE_TEST_IFACE} weight 1"
+ done
+}
+
+
+test_soft_lockup_during_routing_table_refresh() {
+ # Start num_of_iperf_servers iperf3 servers in the sink_ns namespace,
+ # each listening on ports starting at 5001 and incrementing
+ # sequentially. Since iperf3 instances may terminate unexpectedly, a
+ # while loop is used to automatically restart them in such cases.
+ echo "info: starting ${num_of_iperf_servers} iperf3 servers in the sink_ns namespace ..."
+ for i in $(seq 1 ${num_of_iperf_servers}); do
+ cmd="iperf3 --bind ${SINK_LOOPBACK_IP_ADDR} -s -p $(printf '5%03d' ${i}) --rcv-timeout 200 &>/dev/null"
+ ip netns exec ${sink_ns} bash -c "while true; do ${cmd}; done &" &>/dev/null
+ done
+
+ # Wait for the iperf3 servers to be ready
+ for i in $(seq ${num_of_iperf_servers}); do
+ port=$(printf '5%03d' ${i});
+ wait_local_port_listen ${sink_ns} ${port} tcp
+ done
+
+ # Continuously refresh the routing table in the background within
+ # the source_ns namespace
+ ip netns exec ${source_ns} bash -c "
+ while \$(ip netns list | grep -q ${source_ns}); do
+ ip -6 route add ${SINK_LOOPBACK_IP_ADDR}/${SINK_LOOPBACK_IP_MASK} ${nexthop_ip_list};
+ sleep ${ROUTING_TABLE_REFRESH_PERIOD};
+ ip -6 route delete ${SINK_LOOPBACK_IP_ADDR}/${SINK_LOOPBACK_IP_MASK};
+ done &"
+
+ # Start num_of_iperf_servers iperf3 clients in the source_ns namespace,
+ # each sending TCP traffic on sequential ports starting at 5001.
+ # Since iperf3 instances may terminate unexpectedly (e.g., if the route
+ # to the server is deleted in the background during a route refresh), a
+ # while loop is used to automatically restart them in such cases.
+ echo "info: starting ${num_of_iperf_servers} iperf3 clients in the source_ns namespace ..."
+ for i in $(seq 1 ${num_of_iperf_servers}); do
+ cmd="iperf3 -c ${SINK_LOOPBACK_IP_ADDR} -p $(printf '5%03d' ${i}) --length 64 --bitrate ${IPERF3_BITRATE} -t 0 --connect-timeout 150 &>/dev/null"
+ ip netns exec ${source_ns} bash -c "while true; do ${cmd}; done &" &>/dev/null
+ done
+
+ echo "info: IPv6 routing table is being updated at the rate of $(echo "1/${ROUTING_TABLE_REFRESH_PERIOD}" | bc)/s for ${TEST_DURATION} seconds ..."
+ echo "info: A kernel soft lockup, if detected, results in a kernel panic!"
+
+ wait
+}
+
+# Make sure 'iperf3' is installed, skip the test otherwise
+if [ ! -x "$(command -v "iperf3")" ]; then
+ echo "SKIP: 'iperf3' is not installed. Skipping the test."
+ exit ${ksft_skip}
+fi
+
+# Determine the number of cores on the machine
+num_of_iperf_servers=$(( $(nproc)/2 ))
+
+# Check if we are running on a multi-core machine, skip the test otherwise
+if [ "${num_of_iperf_servers}" -eq 0 ]; then
+ echo "SKIP: This test is not valid on a single core machine!"
+ exit ${ksft_skip}
+fi
+
+# Since the kernel soft lockup we're testing causes at least one core to enter
+# an infinite loop, destabilizing the host and likely affecting subsequent
+# tests, we trigger a kernel panic instead of reporting a failure and
+# continuing
+kernel_softlokup_panic_prev_val=$(sysctl -n kernel.softlockup_panic)
+sysctl -qw kernel.softlockup_panic=1
+
+handle_sigint() {
+ termination_signal="SIGINT"
+ cleanup
+ exit ${ksft_skip}
+}
+
+handle_sigalrm() {
+ termination_signal="SIGALRM"
+ cleanup
+ exit ${ksft_pass}
+}
+
+trap handle_sigint SIGINT
+trap handle_sigalrm SIGALRM
+
+(sleep ${TEST_DURATION} && kill -s SIGALRM $$)&
+
+setup_prepare
+test_soft_lockup_during_routing_table_refresh
--
2.43.0
From: zhouyuhang <zhouyuhang(a)kylinos.cn>
The libcap commit aca076443591 ("Make cap_t operations thread safe.")
added a __u8 mutex at the beginning of the struct _cap_struct, it changes
the offset of the members in the structure that breaks the assumption
made in the "struct libcap" definition in clone3_cap_checkpoint_restore.c.
This causes the call to cap_set_proc here to fail with error code EPERM,
and the output is as follows:
# RUN global.clone3_cap_checkpoint_restore ...
# clone3() syscall supported
# clone3_cap_checkpoint_restore.c:151:clone3_cap_checkpoint_restore:Child has PID 130508
cap_set_proc: Operation not permitted
# clone3_cap_checkpoint_restore.c:160:clone3_cap_checkpoint_restore:Expected set_capability() (-1) == 0 (0)
# clone3_cap_checkpoint_restore.c:161:clone3_cap_checkpoint_restore:Could not set CAP_CHECKPOINT_RESTORE
# clone3_cap_checkpoint_restore: Test terminated by assertion
# FAIL global.clone3_cap_checkpoint_restore
Changing to using capget and capset syscall directly here can fix this error,
just like what the commit 663af70aabb7 ("bpf: selftests: Add helpers to directly
use the capget and capset syscall") does.
The output is as follows:
# RUN global.clone3_cap_checkpoint_restore ...
# clone3() syscall supported
# clone3_cap_checkpoint_restore.c:160:clone3_cap_checkpoint_restore:Child has PID 23708
# clone3_cap_checkpoint_restore.c:91:clone3_cap_checkpoint_restore:[23707] Trying clone3() with CLONE_SET_TID to 23708
# clone3_cap_checkpoint_restore.c:58:clone3_cap_checkpoint_restore:Operation not permitted - Failed to create new process
# clone3_cap_checkpoint_restore.c:93:clone3_cap_checkpoint_restore:[23707] clone3() with CLONE_SET_TID 23708 says:-1
# clone3_cap_checkpoint_restore.c:91:clone3_cap_checkpoint_restore:[23707] Trying clone3() with CLONE_SET_TID to 23708
# clone3_cap_checkpoint_restore.c:73:clone3_cap_checkpoint_restore:I am the parent (23707). My child's pid is 23708
# clone3_cap_checkpoint_restore.c:66:clone3_cap_checkpoint_restore:I am the child, my PID is 23708 (expected 23708)
# clone3_cap_checkpoint_restore.c:93:clone3_cap_checkpoint_restore:[23707] clone3() with CLONE_SET_TID 23708 says:0
# OK global.clone3_cap_checkpoint_restore
ok 1 global.clone3_cap_checkpoint_restore
# PASSED: 1 / 1 tests passed.
Signed-off-by: zhouyuhang <zhouyuhang(a)kylinos.cn>
---
v4:
* Add some comments and modify the output and return value when set_capability fails
v3:
* Remove locally declared system calls and retained the - lcap in the Makefile.
v2:
* Move locally declared system calls to header file.
v1:
* Directly using capget and capset and declare them locally.
---
.../clone3/clone3_cap_checkpoint_restore.c | 77 +++++++++++--------
1 file changed, 43 insertions(+), 34 deletions(-)
diff --git a/tools/testing/selftests/clone3/clone3_cap_checkpoint_restore.c b/tools/testing/selftests/clone3/clone3_cap_checkpoint_restore.c
index 3c196fa86c99..076f9d4cce60 100644
--- a/tools/testing/selftests/clone3/clone3_cap_checkpoint_restore.c
+++ b/tools/testing/selftests/clone3/clone3_cap_checkpoint_restore.c
@@ -27,6 +27,13 @@
#include "../kselftest_harness.h"
#include "clone3_selftests.h"
+/*
+ * Prevent not being defined in the header file
+ */
+#ifndef CAP_CHECKPOINT_RESTORE
+#define CAP_CHECKPOINT_RESTORE 40
+#endif
+
static void child_exit(int ret)
{
fflush(stdout);
@@ -87,47 +94,49 @@ static int test_clone3_set_tid(struct __test_metadata *_metadata,
return ret;
}
-struct libcap {
- struct __user_cap_header_struct hdr;
+static int set_capability(struct __test_metadata *_metadata)
+{
struct __user_cap_data_struct data[2];
-};
-static int set_capability(void)
-{
- cap_value_t cap_values[] = { CAP_SETUID, CAP_SETGID };
- struct libcap *cap;
- int ret = -1;
- cap_t caps;
-
- caps = cap_get_proc();
- if (!caps) {
- perror("cap_get_proc");
- return -1;
- }
+ /*
+ * Only _LINUX_CAPABILITY_VERSION_3 can be used here.
+ * _LINUX_CAPABILITY_VERSION_1 represents use 32-bit capabilities,
+ * using it will cause CAP_CHECKPOINT_RESTORE to not be set.
+ * _LINUX_CAPABILITY_VERSION_2 has already been deprecated.
+ */
+ struct __user_cap_header_struct hdr = {
+ .version = _LINUX_CAPABILITY_VERSION_3,
+ };
- /* Drop all capabilities */
- if (cap_clear(caps)) {
- perror("cap_clear");
- goto out;
+ /*
+ * CAP_CHECKPOINT_RESTORE is greater than 31, so we need two u32.
+ * cap0 is the lower 32bit and cap1 is the higher 32bit, they will
+ * be combined into a u64 in mk_kernel_cap.
+ */
+ __u32 cap0 = 1 << CAP_SETUID | 1 << CAP_SETGID;
+ __u32 cap1 = 1 << (CAP_CHECKPOINT_RESTORE - 32);
+ int ret;
+
+ ret = capget(&hdr, data);
+ if (ret) {
+ TH_LOG("%s - Failed to get capability", strerror(errno));
+ return -errno;
}
- cap_set_flag(caps, CAP_EFFECTIVE, 2, cap_values, CAP_SET);
- cap_set_flag(caps, CAP_PERMITTED, 2, cap_values, CAP_SET);
+ /* Drop all capabilities */
+ memset(&data, 0, sizeof(data));
- cap = (struct libcap *) caps;
+ data[0].effective |= cap0;
+ data[0].permitted |= cap0;
- /* 40 -> CAP_CHECKPOINT_RESTORE */
- cap->data[1].effective |= 1 << (40 - 32);
- cap->data[1].permitted |= 1 << (40 - 32);
+ data[1].effective |= cap1;
+ data[1].permitted |= cap1;
- if (cap_set_proc(caps)) {
- perror("cap_set_proc");
- goto out;
+ ret = capset(&hdr, data);
+ if (ret) {
+ TH_LOG("%s - Failed to set capability", strerror(errno));
+ return -errno;
}
- ret = 0;
-out:
- if (cap_free(caps))
- perror("cap_free");
return ret;
}
@@ -157,7 +166,7 @@ TEST(clone3_cap_checkpoint_restore)
/* After the child has finished, its PID should be free. */
set_tid[0] = pid;
- ASSERT_EQ(set_capability(), 0)
+ ASSERT_EQ(set_capability(_metadata), 0)
TH_LOG("Could not set CAP_CHECKPOINT_RESTORE");
ASSERT_EQ(prctl(PR_SET_KEEPCAPS, 1, 0, 0, 0), 0);
@@ -169,7 +178,7 @@ TEST(clone3_cap_checkpoint_restore)
set_tid[0] = pid;
/* This would fail without CAP_CHECKPOINT_RESTORE */
ASSERT_EQ(test_clone3_set_tid(_metadata, set_tid, 1), -EPERM);
- ASSERT_EQ(set_capability(), 0)
+ ASSERT_EQ(set_capability(_metadata), 0)
TH_LOG("Could not set CAP_CHECKPOINT_RESTORE");
/* This should work as we have CAP_CHECKPOINT_RESTORE as non-root */
ASSERT_EQ(test_clone3_set_tid(_metadata, set_tid, 1), 0);
--
2.27.0
When memslot_perf_test is run nested, first iteration of test_memslot_rw_loop
testcase, sometimes takes more than 2 seconds due to build of shadow page tables.
Following iterations are fast.
To be on the safe side, bump the timeout to 10 seconds.
Signed-off-by: Maxim Levitsky <mlevitsk(a)redhat.com>
---
tools/testing/selftests/kvm/memslot_perf_test.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/kvm/memslot_perf_test.c b/tools/testing/selftests/kvm/memslot_perf_test.c
index 893366982f77b..e9189cbed4bb6 100644
--- a/tools/testing/selftests/kvm/memslot_perf_test.c
+++ b/tools/testing/selftests/kvm/memslot_perf_test.c
@@ -415,7 +415,7 @@ static bool _guest_should_exit(void)
*/
static noinline void host_perform_sync(struct sync_area *sync)
{
- alarm(2);
+ alarm(10);
atomic_store_explicit(&sync->sync_flag, true, memory_order_release);
while (atomic_load_explicit(&sync->sync_flag, memory_order_acquire))
--
2.26.3
The loop in test_create_guest_memfd_invalid that is supposed to test
that nothing is accepted as a valid flag to KVM_CREATE_GUEST_MEMFD was
initializing `flag` as 0 instead of BIT(0). This caused the loop to
immediately exit instead of iterating over BIT(0), BIT(1), ... .
Fixes: 8a89efd43423 ("KVM: selftests: Add basic selftest for guest_memfd()")
Signed-off-by: Patrick Roy <roypat(a)amazon.co.uk>
---
tools/testing/selftests/kvm/guest_memfd_test.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/kvm/guest_memfd_test.c b/tools/testing/selftests/kvm/guest_memfd_test.c
index ba0c8e9960358..ce687f8d248fc 100644
--- a/tools/testing/selftests/kvm/guest_memfd_test.c
+++ b/tools/testing/selftests/kvm/guest_memfd_test.c
@@ -134,7 +134,7 @@ static void test_create_guest_memfd_invalid(struct kvm_vm *vm)
size);
}
- for (flag = 0; flag; flag <<= 1) {
+ for (flag = BIT(0); flag; flag <<= 1) {
fd = __vm_create_guest_memfd(vm, page_size, flag);
TEST_ASSERT(fd == -1 && errno == EINVAL,
"guest_memfd() with flag '0x%lx' should fail with EINVAL",
--
2.47.0
This series primarily introduces SEV-SNP test for the kernel selftest
framework. It tests boot, ioctl, pre fault, and fallocate in various
combinations to exercise both positive and negative launch flow paths.
Patch 1 - Adds a wrapper for the ioctl calls that decouple ioctl and
asserts, which enables the use of negative test cases. No functional
change intended.
Patch 2 - Extend the sev smoke tests to use the SNP specific ioctl
calls and sets up memory to boot a SNP guest VM
Patch 3 - Adds SNP to shutdown testing
Patch 4, 5 - Tests the ioctl path for SEV, SEV-ES and SNP
Patch 6 - Adds support for SNP in KVM_SEV_INIT2 tests
Patch 7,8,9 - Enable Prefault tests for SEV, SEV-ES and SNP
The patchset is rebased on top of kvm-x86/next branch.
v3:
1. Remove the assignments for the prefault and fallocate test type
enums.
2. Fix error message for sev launch measure and finish.
3. Collect tested-by tags [Peter, Srikanth]
v2:
https://lore.kernel.org/kvm/20240816192310.117456-1-pratikrajesh.sampat@amd…
1. Add SMT parsing check to populate SNP policy flags
2. Extend Peter Gonda's shutdown test to include SNP
3. Introduce new tests for prefault which include exercising prefault,
fallocate, hole-punch in various combinations.
4. Decouple ioctl patch reworked to introduce private variants of the
the functions that call into the ioctl. Also reordered the patch for
it to arrive first so that new APIs are not written right after
their introduction.
5. General cleanups - adding comments, avoiding local booleans, better
error message. Suggestions incorporated from Peter, Tom, and Sean.
RFC:
https://lore.kernel.org/kvm/20240710220540.188239-1-pratikrajesh.sampat@amd…
Any feedback/review is highly appreciated!
Michael Roth (2):
KVM: selftests: Add interface to manually flag protected/encrypted
ranges
KVM: selftests: Add a CoCo-specific test for KVM_PRE_FAULT_MEMORY
Pratik R. Sampat (7):
KVM: selftests: Decouple SEV ioctls from asserts
KVM: selftests: Add a basic SNP smoke test
KVM: selftests: Add SNP to shutdown testing
KVM: selftests: SEV IOCTL test
KVM: selftests: SNP IOCTL test
KVM: selftests: SEV-SNP test for KVM_SEV_INIT2
KVM: selftests: Interleave fallocate for KVM_PRE_FAULT_MEMORY
tools/testing/selftests/kvm/Makefile | 1 +
.../testing/selftests/kvm/include/kvm_util.h | 13 +
.../selftests/kvm/include/x86_64/processor.h | 1 +
.../selftests/kvm/include/x86_64/sev.h | 76 +++-
tools/testing/selftests/kvm/lib/kvm_util.c | 53 ++-
.../selftests/kvm/lib/x86_64/processor.c | 6 +-
tools/testing/selftests/kvm/lib/x86_64/sev.c | 190 +++++++-
.../kvm/x86_64/coco_pre_fault_memory_test.c | 421 ++++++++++++++++++
.../selftests/kvm/x86_64/sev_init2_tests.c | 13 +
.../selftests/kvm/x86_64/sev_smoke_test.c | 297 +++++++++++-
10 files changed, 1023 insertions(+), 48 deletions(-)
create mode 100644 tools/testing/selftests/kvm/x86_64/coco_pre_fault_memory_test.c
--
2.34.1
Changes since V3:
- V3: https://lore.kernel.org/all/cover.1729218182.git.reinette.chatre@intel.com/
- Rebased on HEAD 2a027d6bb660 of kselftest/next.
- Fix empty string parsing issues pointed out by Ilpo.
- Add Reviewed-by tags.
- Please see individual patches for detailed changes.
Changes since V2:
- V2: https://lore.kernel.org/all/cover.1726164080.git.reinette.chatre@intel.com/
- Add fix to protect against buffer overflow when parsing text from sysfs files.
- Add cleanup patch to address use of magic constants as pointed out by
Ilpo.
- Add Reviewed-by tags where received, except for "selftests/resctrl: Use cache
size to determine "fill_buf" buffer size" that changed too much since
receiving the Reviewed-by tag.
- Please see individual patches for detailed changes.
Changes since V1:
- V1: https://lore.kernel.org/cover.1724970211.git.reinette.chatre@intel.com/
- V2 contains the same general solutions to stated problem as V1 but these
are now preceded by more fixes (patches 1 to 5) and improved robustness
(patches 6 to 9) to existing tests before the series gets back
to solving the original problem with more confidence in patches 10 to 13.
- The posibility of making "memflush = false" for CMT test was discussed
during V1. Modifying this setting does not have a significant impact on the
observed results that are already well within acceptable range and this
version thus keeps original default. If performance was a goal it may
be possible to do further experimentation where "memflush = false" could
eliminate the need for the sleep(1) within the test wrapper, but
improving the performance is not a goal of this work.
- (New) Support what seems to be unintended ability for user space to provide
parameters to "fill_buf" by making the parsing robust and only support
changing parameters that are supported to be changed. Drop support for
"write" operation since it has never been measured.
- (New) Improve wraparound handling. (Ilpo)
- (New) A couple of new fixes addressing issues discovered during development.
- (Change from V1) To support fill_buf parameters provided by user space as
well as test specific fill_buf parameters struct fill_buf_param is no longer
just a member of struct resctrl_val_param, instead there could be at most
two instances of struct fill_buf_param, the immutable parameters provided
by user space and the parameters used by individual tests. (Ilpo)
- Please see individual patches for detailed changes.
V1 cover:
The resctrl selftests for Memory Bandwidth Allocation (MBA) and Memory
Bandwidth Monitoring (MBM) are failing on some (for example [1]) Emerald
Rapids systems. The test failures result from the following two
properties of these systems:
1) Emerald Rapids systems can have up to 320MB L3 cache. The resctrl
MBA and MBM selftests measure memory traffic for which a hardcoded
250MB buffer has been sufficient so far. On platforms with L3 cache
larger than the buffer, the buffer fits in the L3 cache and thus
no/very little memory traffic is generated during the "memory
bandwidth" tests.
2) Some platform features, for example RAS features or memory
performance features that generate memory traffic may drive accesses
that are counted differently by performance counters and MBM
respectively, for instance generating "overhead" traffic which is not
counted against any specific RMID. Until now these counting
differences have always been "in the noise". On Emerald Rapids
systems the maximum MBA throttling (10% memory bandwidth)
throttles memory bandwidth to where memory accesses by these other
platform features push the memory bandwidth difference between
memory controller performance counters and resctrl (MBM) beyond the
tests' hardcoded tolerance.
Make the tests more robust against platform variations:
1) Let the buffer used by memory bandwidth tests be guided by the size
of the L3 cache.
2) Larger buffers require longer initialization time before the buffer can
be used to measurement. Rework the tests to ensure that buffer
initialization is complete before measurements start.
3) Do not compare performance counters and MBM measurements at low
bandwidth. The value of "low" is hardcoded to 750MiB based on
measurements on Emerald Rapids, Sapphire Rapids, and Ice Lake
systems. This limit is not applicable to AMD systems since it
only applies to the MBA and MBM tests that are isolated to Intel.
[1]
https://ark.intel.com/content/www/us/en/ark/products/237261/intel-xeon-plat…
Reinette Chatre (15):
selftests/resctrl: Make functions only used in same file static
selftests/resctrl: Print accurate buffer size as part of MBM results
selftests/resctrl: Fix memory overflow due to unhandled wraparound
selftests/resctrl: Protect against array overrun during iMC config
parsing
selftests/resctrl: Protect against array overflow when reading strings
selftests/resctrl: Make wraparound handling obvious
selftests/resctrl: Remove "once" parameter required to be false
selftests/resctrl: Only support measured read operation
selftests/resctrl: Remove unused measurement code
selftests/resctrl: Make benchmark parameter passing robust
selftests/resctrl: Ensure measurements skip initialization of default
benchmark
selftests/resctrl: Use cache size to determine "fill_buf" buffer size
selftests/resctrl: Do not compare performance counters and resctrl at
low bandwidth
selftests/resctrl: Keep results from first test run
selftests/resctrl: Replace magic constants used as array size
tools/testing/selftests/resctrl/cmt_test.c | 37 +-
tools/testing/selftests/resctrl/fill_buf.c | 45 +-
tools/testing/selftests/resctrl/mba_test.c | 54 ++-
tools/testing/selftests/resctrl/mbm_test.c | 37 +-
tools/testing/selftests/resctrl/resctrl.h | 79 +++-
.../testing/selftests/resctrl/resctrl_tests.c | 95 +++-
tools/testing/selftests/resctrl/resctrl_val.c | 447 +++++-------------
tools/testing/selftests/resctrl/resctrlfs.c | 19 +-
8 files changed, 354 insertions(+), 459 deletions(-)
base-commit: 2a027d6bb66002c8e50e974676f932b33c5fce10
--
2.47.0
This series is a follow-up to Joey's Permission Overlay Extension (POE)
series [1] that recently landed on mainline. The goal is to improve the
way we handle the register that governs which pkeys/POIndex are
accessible (POR_EL0) during signal delivery. As things stand, we may
unexpectedly fail to write the signal frame on the stack because POR_EL0
is not reset before the uaccess operations. See patch 1 for more details
and the main changes this series brings.
A similar series landed recently for x86/MPK [2]; the present series
aims at aligning arm64 with x86. Worth noting: once the signal frame is
written, POR_EL0 is still set to POR_EL0_INIT, granting access to pkey 0
only. This means that a program that sets up an alternate signal stack
with a non-zero pkey will need some assembly trampoline to set POR_EL0
before invoking the real signal handler, as discussed here [3]. This is
not ideal, but it makes experimentation with pkeys in signal handlers
possible while waiting for a potential interface to control the pkey
state when delivering a signal. See Pierre's reply [4] for more
information about use-cases and a potential interface.
The x86 series also added kselftests to ensure that no spurious SIGSEGV
occurs during signal delivery regardless of which pkey is accessible at
the point where the signal is delivered. This series adapts those
kselftests to allow running them on arm64 (patch 4-5). There is a
dependency on Yury's PKEY_UNRESTRICTED patch [7] for patch 4
specifically.
Finally patch 2 is a clean-up following feedback on Joey's series [5].
I have tested this series on arm64 and x86_64 (booting and running the
protection_keys and pkey_sighandler_tests mm kselftests).
- Kevin
---
v2..v3:
* Reordered patches (patch 1 is now the main patch).
* Patch 1: compute por_enable_all with an explicit loop, based on
arch_max_pkey() (suggestion from Dave M).
* Patch 4: improved naming, replaced global pkey reg value with inline
helper, made use of Yury's PKEY_UNRESTRICTED macro [7] (suggestions
from Dave H).
v2: https://lore.kernel.org/linux-arm-kernel/20241023150511.3923558-1-kevin.bro…
v1..v2:
* In setup_rt_frame(), ensured that POR_EL0 is reset to its original
value if we fail to deliver the signal (addresses Catalin's concern [6]).
* Renamed *unpriv_access* to *user_access* in patch 3 (suggestion from
Dave).
* Made what patch 1-2 do explicit in the commit message body (suggestion
from Dave).
v1: https://lore.kernel.org/linux-arm-kernel/20241017133909.3837547-1-kevin.bro…
[1] https://lore.kernel.org/linux-arm-kernel/20240822151113.1479789-1-joey.goul…
[2] https://lore.kernel.org/lkml/20240802061318.2140081-1-aruna.ramakrishna@ora…
[3] https://lore.kernel.org/lkml/CABi2SkWxNkP2O7ipkP67WKz0-LV33e5brReevTTtba6oK…
[4] https://lore.kernel.org/linux-arm-kernel/87plns8owh.fsf@arm.com/
[5] https://lore.kernel.org/linux-arm-kernel/20241015114116.GA19334@willie-the-…
[6] https://lore.kernel.org/linux-arm-kernel/Zw6D2waVyIwYE7wd@arm.com/
[7] https://lore.kernel.org/all/20241028090715.509527-2-yury.khrustalev@arm.com/
Cc: akpm(a)linux-foundation.org
Cc: anshuman.khandual(a)arm.com
Cc: aruna.ramakrishna(a)oracle.com
Cc: broonie(a)kernel.org
Cc: catalin.marinas(a)arm.com
Cc: dave.hansen(a)linux.intel.com
Cc: Dave.Martin(a)arm.com
Cc: jeffxu(a)chromium.org
Cc: joey.gouly(a)arm.com
Cc: keith.lucas(a)oracle.com
Cc: pierre.langlois(a)arm.com
Cc: shuah(a)kernel.org
Cc: sroettger(a)google.com
Cc: tglx(a)linutronix.de
Cc: will(a)kernel.org
Cc: yury.khrustalev(a)arm.com
Cc: linux-kselftest(a)vger.kernel.org
Cc: x86(a)kernel.org
Kevin Brodsky (5):
arm64: signal: Improve POR_EL0 handling to avoid uaccess failures
arm64: signal: Remove unnecessary check when saving POE state
arm64: signal: Remove unused macro
selftests/mm: Use generic pkey register manipulation
selftests/mm: Enable pkey_sighandler_tests on arm64
arch/arm64/kernel/signal.c | 95 ++++++++++++---
tools/testing/selftests/mm/Makefile | 8 +-
tools/testing/selftests/mm/pkey-arm64.h | 1 +
tools/testing/selftests/mm/pkey-x86.h | 2 +
.../selftests/mm/pkey_sighandler_tests.c | 115 ++++++++++++++----
5 files changed, 176 insertions(+), 45 deletions(-)
--
2.43.0
The PSCI v1.3 spec (https://developer.arm.com/documentation/den0022)
adds support for a SYSTEM_OFF2 function enabling a HIBERNATE_OFF state
which is analogous to ACPI S4. This will allow hosting environments to
determine that a guest is hibernated rather than just powered off, and
ensure that they preserve the virtual environment appropriately to
allow the guest to resume safely (or bump the hardware_signature in the
FACS to trigger a clean reboot instead).
This updates KVM to support advertising PSCI v1.3, and unconditionally
enables the SYSTEM_OFF2 support when PSCI v1.3 is enabled.
For the guest side, add a new SYS_OFF_MODE_POWER_OFF handler with higher
priority than the EFI one, but which *only* triggers when there's a
hibernation in progress. There are other ways to do this (see the commit
message for more details) but this seemed like the simplest.
Version 2 of the patch series splits out the psci.h definitions into a
separate commit (a dependency for both the guest and KVM side), and adds
definitions for the other new functions added in v1.3. It also moves the
pKVM psci-relay support to a separate commit; although in arch/arm64/kvm
that's actually about the *guest* side of SYSTEM_OFF2 (i.e. using it
from the host kernel, relayed through nVHE).
Version 3 dropped the KVM_CAP which allowed userspace to explicitly opt
in to the new feature like with SYSTEM_SUSPEND, and makes it depend only
on PSCI v1.3 being exposed to the guest.
Version 4 is no longer RFC, as the PSCI v1.3 spec is finally published.
Minor fixes from the last round of review, and an added KVM self test.
Version 5 drops some of the changes which didn't make it to the final
v1.3 spec, and cleans up a couple of places which still referred to it
as 'alpha' or 'beta'. It also temporarily drops the guest-side patch to
invoke SYSTEM_OFF2 for hibernation, pending confirmation that the final
PSCI v1.3 spec just has a typo where it changed to saying that 0x1
should be passed to mean HIBERNATE_OFF, even though it's advertised as
bit 0. That can be sent under separate cover, and perhaps should have
been anyway. The change in question doesn't matter for any of the KVM
patches, because we just treat SYSTEM_OFF2 like the existing
SYSTEM_RESET2, setting a flag to indicate that it was a SYSTEM_OFF2
call, but not actually caring about the argument; that's for userspace
to worry about.
Version 6 is updated to revision F.b of the specification, allowing both
0x0 and 0x1 to indicate HIBERNATE_OFF in the SYSTEM_OFF2 call, as well
as disallowing anything but zero in the second argument. The test is
expanded to cover those variants, and to require PSCI v1.3 instead of
skipping on older kernels. The final guest-side patch in the series is
reinstated, using zero as the argument for compatibility with existing
hypervisors in the field as permitted by revision F.b. allows for zero
as well as 0x1 for HIBERNATE_OFF to accommodate the change in the final
version of the spec, and expands the test to cover both forms as well as
checking that a non-zero second argument is correctly rejected.
David Woodhouse (6):
firmware/psci: Add definitions for PSCI v1.3 specification
KVM: arm64: Add PSCI v1.3 SYSTEM_OFF2 function for hibernation
KVM: arm64: Add support for PSCI v1.2 and v1.3
KVM: selftests: Add test for PSCI SYSTEM_OFF2
KVM: arm64: nvhe: Pass through PSCI v1.3 SYSTEM_OFF2 call
arm64: Use SYSTEM_OFF2 PSCI call to power off for hibernate
Documentation/virt/kvm/api.rst | 11 +++
arch/arm64/include/uapi/asm/kvm.h | 6 ++
arch/arm64/kvm/hyp/nvhe/psci-relay.c | 2 +
arch/arm64/kvm/hypercalls.c | 2 +
arch/arm64/kvm/psci.c | 50 ++++++++++++-
drivers/firmware/psci/psci.c | 42 +++++++++++
include/kvm/arm_psci.h | 4 +-
include/uapi/linux/psci.h | 5 ++
kernel/power/hibernate.c | 5 +-
tools/testing/selftests/kvm/aarch64/psci_test.c | 93 +++++++++++++++++++++++++
10 files changed, 217 insertions(+), 3 deletions(-)
From: Li Zhijian <lizhijian(a)fujitsu.com>
[ Upstream commit dc1308bee1ed03b4d698d77c8bd670d399dcd04d ]
When running watchdog-test with 'make run_tests', the watchdog-test will
be terminated by a timeout signal(SIGTERM) due to the test timemout.
And then, a system reboot would happen due to watchdog not stop. see
the dmesg as below:
```
[ 1367.185172] watchdog: watchdog0: watchdog did not stop!
```
Fix it by registering more signals(including SIGTERM) in watchdog-test,
where its signal handler will stop the watchdog.
After that
# timeout 1 ./watchdog-test
Watchdog Ticking Away!
.
Stopping watchdog ticks...
Link: https://lore.kernel.org/all/20241029031324.482800-1-lizhijian@fujitsu.com/
Signed-off-by: Li Zhijian <lizhijian(a)fujitsu.com>
Reviewed-by: Shuah Khan <skhan(a)linuxfoundation.org>
Signed-off-by: Shuah Khan <skhan(a)linuxfoundation.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/watchdog/watchdog-test.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/tools/testing/selftests/watchdog/watchdog-test.c b/tools/testing/selftests/watchdog/watchdog-test.c
index f1c6e025cbe54..561bcc253fb3e 100644
--- a/tools/testing/selftests/watchdog/watchdog-test.c
+++ b/tools/testing/selftests/watchdog/watchdog-test.c
@@ -152,7 +152,13 @@ int main(int argc, char *argv[])
printf("Watchdog Ticking Away!\n");
+ /*
+ * Register the signals
+ */
signal(SIGINT, term);
+ signal(SIGTERM, term);
+ signal(SIGKILL, term);
+ signal(SIGQUIT, term);
while (1) {
keep_alive();
--
2.43.0
From: Li Zhijian <lizhijian(a)fujitsu.com>
[ Upstream commit dc1308bee1ed03b4d698d77c8bd670d399dcd04d ]
When running watchdog-test with 'make run_tests', the watchdog-test will
be terminated by a timeout signal(SIGTERM) due to the test timemout.
And then, a system reboot would happen due to watchdog not stop. see
the dmesg as below:
```
[ 1367.185172] watchdog: watchdog0: watchdog did not stop!
```
Fix it by registering more signals(including SIGTERM) in watchdog-test,
where its signal handler will stop the watchdog.
After that
# timeout 1 ./watchdog-test
Watchdog Ticking Away!
.
Stopping watchdog ticks...
Link: https://lore.kernel.org/all/20241029031324.482800-1-lizhijian@fujitsu.com/
Signed-off-by: Li Zhijian <lizhijian(a)fujitsu.com>
Reviewed-by: Shuah Khan <skhan(a)linuxfoundation.org>
Signed-off-by: Shuah Khan <skhan(a)linuxfoundation.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/watchdog/watchdog-test.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/tools/testing/selftests/watchdog/watchdog-test.c b/tools/testing/selftests/watchdog/watchdog-test.c
index f45e510500c0d..09773695d219f 100644
--- a/tools/testing/selftests/watchdog/watchdog-test.c
+++ b/tools/testing/selftests/watchdog/watchdog-test.c
@@ -242,7 +242,13 @@ int main(int argc, char *argv[])
printf("Watchdog Ticking Away!\n");
+ /*
+ * Register the signals
+ */
signal(SIGINT, term);
+ signal(SIGTERM, term);
+ signal(SIGKILL, term);
+ signal(SIGQUIT, term);
while (1) {
keep_alive();
--
2.43.0
From: Li Zhijian <lizhijian(a)fujitsu.com>
[ Upstream commit dc1308bee1ed03b4d698d77c8bd670d399dcd04d ]
When running watchdog-test with 'make run_tests', the watchdog-test will
be terminated by a timeout signal(SIGTERM) due to the test timemout.
And then, a system reboot would happen due to watchdog not stop. see
the dmesg as below:
```
[ 1367.185172] watchdog: watchdog0: watchdog did not stop!
```
Fix it by registering more signals(including SIGTERM) in watchdog-test,
where its signal handler will stop the watchdog.
After that
# timeout 1 ./watchdog-test
Watchdog Ticking Away!
.
Stopping watchdog ticks...
Link: https://lore.kernel.org/all/20241029031324.482800-1-lizhijian@fujitsu.com/
Signed-off-by: Li Zhijian <lizhijian(a)fujitsu.com>
Reviewed-by: Shuah Khan <skhan(a)linuxfoundation.org>
Signed-off-by: Shuah Khan <skhan(a)linuxfoundation.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/watchdog/watchdog-test.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/tools/testing/selftests/watchdog/watchdog-test.c b/tools/testing/selftests/watchdog/watchdog-test.c
index f45e510500c0d..09773695d219f 100644
--- a/tools/testing/selftests/watchdog/watchdog-test.c
+++ b/tools/testing/selftests/watchdog/watchdog-test.c
@@ -242,7 +242,13 @@ int main(int argc, char *argv[])
printf("Watchdog Ticking Away!\n");
+ /*
+ * Register the signals
+ */
signal(SIGINT, term);
+ signal(SIGTERM, term);
+ signal(SIGKILL, term);
+ signal(SIGQUIT, term);
while (1) {
keep_alive();
--
2.43.0
From: Li Zhijian <lizhijian(a)fujitsu.com>
[ Upstream commit dc1308bee1ed03b4d698d77c8bd670d399dcd04d ]
When running watchdog-test with 'make run_tests', the watchdog-test will
be terminated by a timeout signal(SIGTERM) due to the test timemout.
And then, a system reboot would happen due to watchdog not stop. see
the dmesg as below:
```
[ 1367.185172] watchdog: watchdog0: watchdog did not stop!
```
Fix it by registering more signals(including SIGTERM) in watchdog-test,
where its signal handler will stop the watchdog.
After that
# timeout 1 ./watchdog-test
Watchdog Ticking Away!
.
Stopping watchdog ticks...
Link: https://lore.kernel.org/all/20241029031324.482800-1-lizhijian@fujitsu.com/
Signed-off-by: Li Zhijian <lizhijian(a)fujitsu.com>
Reviewed-by: Shuah Khan <skhan(a)linuxfoundation.org>
Signed-off-by: Shuah Khan <skhan(a)linuxfoundation.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/watchdog/watchdog-test.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/tools/testing/selftests/watchdog/watchdog-test.c b/tools/testing/selftests/watchdog/watchdog-test.c
index f45e510500c0d..09773695d219f 100644
--- a/tools/testing/selftests/watchdog/watchdog-test.c
+++ b/tools/testing/selftests/watchdog/watchdog-test.c
@@ -242,7 +242,13 @@ int main(int argc, char *argv[])
printf("Watchdog Ticking Away!\n");
+ /*
+ * Register the signals
+ */
signal(SIGINT, term);
+ signal(SIGTERM, term);
+ signal(SIGKILL, term);
+ signal(SIGQUIT, term);
while (1) {
keep_alive();
--
2.43.0
From: Li Zhijian <lizhijian(a)fujitsu.com>
[ Upstream commit dc1308bee1ed03b4d698d77c8bd670d399dcd04d ]
When running watchdog-test with 'make run_tests', the watchdog-test will
be terminated by a timeout signal(SIGTERM) due to the test timemout.
And then, a system reboot would happen due to watchdog not stop. see
the dmesg as below:
```
[ 1367.185172] watchdog: watchdog0: watchdog did not stop!
```
Fix it by registering more signals(including SIGTERM) in watchdog-test,
where its signal handler will stop the watchdog.
After that
# timeout 1 ./watchdog-test
Watchdog Ticking Away!
.
Stopping watchdog ticks...
Link: https://lore.kernel.org/all/20241029031324.482800-1-lizhijian@fujitsu.com/
Signed-off-by: Li Zhijian <lizhijian(a)fujitsu.com>
Reviewed-by: Shuah Khan <skhan(a)linuxfoundation.org>
Signed-off-by: Shuah Khan <skhan(a)linuxfoundation.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/watchdog/watchdog-test.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/tools/testing/selftests/watchdog/watchdog-test.c b/tools/testing/selftests/watchdog/watchdog-test.c
index f45e510500c0d..09773695d219f 100644
--- a/tools/testing/selftests/watchdog/watchdog-test.c
+++ b/tools/testing/selftests/watchdog/watchdog-test.c
@@ -242,7 +242,13 @@ int main(int argc, char *argv[])
printf("Watchdog Ticking Away!\n");
+ /*
+ * Register the signals
+ */
signal(SIGINT, term);
+ signal(SIGTERM, term);
+ signal(SIGKILL, term);
+ signal(SIGQUIT, term);
while (1) {
keep_alive();
--
2.43.0
From: Li Zhijian <lizhijian(a)fujitsu.com>
[ Upstream commit dc1308bee1ed03b4d698d77c8bd670d399dcd04d ]
When running watchdog-test with 'make run_tests', the watchdog-test will
be terminated by a timeout signal(SIGTERM) due to the test timemout.
And then, a system reboot would happen due to watchdog not stop. see
the dmesg as below:
```
[ 1367.185172] watchdog: watchdog0: watchdog did not stop!
```
Fix it by registering more signals(including SIGTERM) in watchdog-test,
where its signal handler will stop the watchdog.
After that
# timeout 1 ./watchdog-test
Watchdog Ticking Away!
.
Stopping watchdog ticks...
Link: https://lore.kernel.org/all/20241029031324.482800-1-lizhijian@fujitsu.com/
Signed-off-by: Li Zhijian <lizhijian(a)fujitsu.com>
Reviewed-by: Shuah Khan <skhan(a)linuxfoundation.org>
Signed-off-by: Shuah Khan <skhan(a)linuxfoundation.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/watchdog/watchdog-test.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/tools/testing/selftests/watchdog/watchdog-test.c b/tools/testing/selftests/watchdog/watchdog-test.c
index bc71cbca0dde7..a1f506ba55786 100644
--- a/tools/testing/selftests/watchdog/watchdog-test.c
+++ b/tools/testing/selftests/watchdog/watchdog-test.c
@@ -334,7 +334,13 @@ int main(int argc, char *argv[])
printf("Watchdog Ticking Away!\n");
+ /*
+ * Register the signals
+ */
signal(SIGINT, term);
+ signal(SIGTERM, term);
+ signal(SIGKILL, term);
+ signal(SIGQUIT, term);
while (1) {
keep_alive();
--
2.43.0
From: Li Zhijian <lizhijian(a)fujitsu.com>
[ Upstream commit dc1308bee1ed03b4d698d77c8bd670d399dcd04d ]
When running watchdog-test with 'make run_tests', the watchdog-test will
be terminated by a timeout signal(SIGTERM) due to the test timemout.
And then, a system reboot would happen due to watchdog not stop. see
the dmesg as below:
```
[ 1367.185172] watchdog: watchdog0: watchdog did not stop!
```
Fix it by registering more signals(including SIGTERM) in watchdog-test,
where its signal handler will stop the watchdog.
After that
# timeout 1 ./watchdog-test
Watchdog Ticking Away!
.
Stopping watchdog ticks...
Link: https://lore.kernel.org/all/20241029031324.482800-1-lizhijian@fujitsu.com/
Signed-off-by: Li Zhijian <lizhijian(a)fujitsu.com>
Reviewed-by: Shuah Khan <skhan(a)linuxfoundation.org>
Signed-off-by: Shuah Khan <skhan(a)linuxfoundation.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/watchdog/watchdog-test.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/tools/testing/selftests/watchdog/watchdog-test.c b/tools/testing/selftests/watchdog/watchdog-test.c
index bc71cbca0dde7..a1f506ba55786 100644
--- a/tools/testing/selftests/watchdog/watchdog-test.c
+++ b/tools/testing/selftests/watchdog/watchdog-test.c
@@ -334,7 +334,13 @@ int main(int argc, char *argv[])
printf("Watchdog Ticking Away!\n");
+ /*
+ * Register the signals
+ */
signal(SIGINT, term);
+ signal(SIGTERM, term);
+ signal(SIGKILL, term);
+ signal(SIGQUIT, term);
while (1) {
keep_alive();
--
2.43.0
Check number of paths by fib_info_num_path(),
and update_or_create_fnhe() for every path.
Problem is that pmtu is cached only for the oif
that has received icmp message "need to frag",
other oifs will still try to use "default" iface mtu.
An example topology showing the problem:
| host1
+---------+
| dummy0 | 10.179.20.18/32 mtu9000
+---------+
+-----------+----------------+
+---------+ +---------+
| ens17f0 | 10.179.2.141/31 | ens17f1 | 10.179.2.13/31
+---------+ +---------+
| (all here have mtu 9000) |
+------+ +------+
| ro1 | 10.179.2.140/31 | ro2 | 10.179.2.12/31
+------+ +------+
| |
---------+------------+-------------------+------
|
+-----+
| ro3 | 10.10.10.10 mtu1500
+-----+
|
========================================
some networks
========================================
|
+-----+
| eth0| 10.10.30.30 mtu9000
+-----+
| host2
host1 have enabled multipath and
sysctl net.ipv4.fib_multipath_hash_policy = 1:
default proto static src 10.179.20.18
nexthop via 10.179.2.12 dev ens17f1 weight 1
nexthop via 10.179.2.140 dev ens17f0 weight 1
When host1 tries to do pmtud from 10.179.20.18/32 to host2,
host1 receives at ens17f1 iface an icmp packet from ro3 that ro3 mtu=1500.
And host1 caches it in nexthop exceptions cache.
Problem is that it is cached only for the iface that has received icmp,
and there is no way that ro3 will send icmp msg to host1 via another path.
Host1 now have this routes to host2:
ip r g 10.10.30.30 sport 30000 dport 443
10.10.30.30 via 10.179.2.12 dev ens17f1 src 10.179.20.18 uid 0
cache expires 521sec mtu 1500
ip r g 10.10.30.30 sport 30033 dport 443
10.10.30.30 via 10.179.2.140 dev ens17f0 src 10.179.20.18 uid 0
cache
So when host1 tries again to reach host2 with mtu>1500,
if packet flow is lucky enough to be hashed with oif=ens17f1 its ok,
if oif=ens17f0 it blackholes and still gets icmp msgs from ro3 to ens17f1,
until lucky day when ro3 will send it through another flow to ens17f0.
Signed-off-by: Vladimir Vdovin <deliran(a)verdict.gg>
---
V6:
- make commit message cleaner
V5:
- make self test cleaner
V4:
- fix selftest, do route lookup before checking cached exceptions
V3:
- added selftest
- fixed compile error
V2:
- fix fib_info_num_path parameter pass
---
net/ipv4/route.c | 13 +++++
tools/testing/selftests/net/pmtu.sh | 78 ++++++++++++++++++++++++++++-
2 files changed, 90 insertions(+), 1 deletion(-)
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 723ac9181558..41162b5cc4cb 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1027,6 +1027,19 @@ static void __ip_rt_update_pmtu(struct rtable *rt, struct flowi4 *fl4, u32 mtu)
struct fib_nh_common *nhc;
fib_select_path(net, &res, fl4, NULL);
+#ifdef CONFIG_IP_ROUTE_MULTIPATH
+ if (fib_info_num_path(res.fi) > 1) {
+ int nhsel;
+
+ for (nhsel = 0; nhsel < fib_info_num_path(res.fi); nhsel++) {
+ nhc = fib_info_nhc(res.fi, nhsel);
+ update_or_create_fnhe(nhc, fl4->daddr, 0, mtu, lock,
+ jiffies + net->ipv4.ip_rt_mtu_expires);
+ }
+ rcu_read_unlock();
+ return;
+ }
+#endif /* CONFIG_IP_ROUTE_MULTIPATH */
nhc = FIB_RES_NHC(res);
update_or_create_fnhe(nhc, fl4->daddr, 0, mtu, lock,
jiffies + net->ipv4.ip_rt_mtu_expires);
diff --git a/tools/testing/selftests/net/pmtu.sh b/tools/testing/selftests/net/pmtu.sh
index 569bce8b6383..a0159340fe84 100755
--- a/tools/testing/selftests/net/pmtu.sh
+++ b/tools/testing/selftests/net/pmtu.sh
@@ -266,7 +266,8 @@ tests="
list_flush_ipv4_exception ipv4: list and flush cached exceptions 1
list_flush_ipv6_exception ipv6: list and flush cached exceptions 1
pmtu_ipv4_route_change ipv4: PMTU exception w/route replace 1
- pmtu_ipv6_route_change ipv6: PMTU exception w/route replace 1"
+ pmtu_ipv6_route_change ipv6: PMTU exception w/route replace 1
+ pmtu_ipv4_mp_exceptions ipv4: PMTU multipath nh exceptions 0"
# Addressing and routing for tests with routers: four network segments, with
# index SEGMENT between 1 and 4, a common prefix (PREFIX4 or PREFIX6) and an
@@ -2329,6 +2330,81 @@ test_pmtu_ipv6_route_change() {
test_pmtu_ipvX_route_change 6
}
+test_pmtu_ipv4_mp_exceptions() {
+ setup namespaces routing || return $ksft_skip
+
+ ip nexthop ls >/dev/null 2>&1
+ if [ $? -ne 0 ]; then
+ echo "Nexthop objects not supported; skipping tests"
+ exit $ksft_skip
+ fi
+
+ trace "${ns_a}" veth_A-R1 "${ns_r1}" veth_R1-A \
+ "${ns_r1}" veth_R1-B "${ns_b}" veth_B-R1 \
+ "${ns_a}" veth_A-R2 "${ns_r2}" veth_R2-A \
+ "${ns_r2}" veth_R2-B "${ns_b}" veth_B-R2
+
+ dummy0_a="192.168.99.99"
+ dummy0_b="192.168.88.88"
+
+ # Set up initial MTU values
+ mtu "${ns_a}" veth_A-R1 2000
+ mtu "${ns_r1}" veth_R1-A 2000
+ mtu "${ns_r1}" veth_R1-B 1500
+ mtu "${ns_b}" veth_B-R1 1500
+
+ mtu "${ns_a}" veth_A-R2 2000
+ mtu "${ns_r2}" veth_R2-A 2000
+ mtu "${ns_r2}" veth_R2-B 1500
+ mtu "${ns_b}" veth_B-R2 1500
+
+ fail=0
+
+ #Set up host A with multipath routes to host B dummy0_b
+ run_cmd ${ns_a} sysctl -q net.ipv4.fib_multipath_hash_policy=1
+ run_cmd ${ns_a} sysctl -q net.ipv4.ip_forward=1
+ run_cmd ${ns_a} ip link add dummy0 mtu 2000 type dummy
+ run_cmd ${ns_a} ip link set dummy0 up
+ run_cmd ${ns_a} ip addr add ${dummy0_a} dev dummy0
+ run_cmd ${ns_a} ip nexthop add id 201 via ${prefix4}.${a_r1}.2 dev veth_A-R1
+ run_cmd ${ns_a} ip nexthop add id 202 via ${prefix4}.${a_r2}.2 dev veth_A-R2
+ run_cmd ${ns_a} ip nexthop add id 203 group 201/202
+ run_cmd ${ns_a} ip route add ${dummy0_b} nhid 203
+
+ #Set up host B with multipath routes to host A dummy0_a
+ run_cmd ${ns_b} sysctl -q net.ipv4.fib_multipath_hash_policy=1
+ run_cmd ${ns_b} sysctl -q net.ipv4.ip_forward=1
+ run_cmd ${ns_b} ip link add dummy0 mtu 2000 type dummy
+ run_cmd ${ns_b} ip link set dummy0 up
+ run_cmd ${ns_b} ip addr add ${dummy0_b} dev dummy0
+ run_cmd ${ns_b} ip nexthop add id 201 via ${prefix4}.${b_r1}.2 dev veth_A-R1
+ run_cmd ${ns_b} ip nexthop add id 202 via ${prefix4}.${b_r2}.2 dev veth_A-R2
+ run_cmd ${ns_b} ip nexthop add id 203 group 201/202
+ run_cmd ${ns_b} ip route add ${dummy0_a} nhid 203
+
+ #Set up routers with routes to dummies
+ run_cmd ${ns_r1} ip route add ${dummy0_a} via ${prefix4}.${a_r1}.1
+ run_cmd ${ns_r2} ip route add ${dummy0_a} via ${prefix4}.${a_r2}.1
+ run_cmd ${ns_r1} ip route add ${dummy0_b} via ${prefix4}.${b_r1}.1
+ run_cmd ${ns_r2} ip route add ${dummy0_b} via ${prefix4}.${b_r2}.1
+
+
+ #Ping and expect two nexthop exceptions for two routes in nh group
+ run_cmd ${ns_a} ping -q -M want -i 0.1 -c 2 -s 1800 "${dummy0_b}"
+
+ #Do route lookup before checking cached exceptions
+ run_cmd ${ns_a} ip route get ${dummy0_b} oif veth_A-R1
+ run_cmd ${ns_a} ip route get ${dummy0_b} oif veth_A-R2
+
+ #Check cached exceptions
+ if [ "$(${ns_a} ip -oneline route list cache| grep mtu | wc -l)" -ne 2 ]; then
+ err " there are not enough cached exceptions"
+ fail=1
+ fi
+
+ return ${fail}
+}
+
usage() {
echo
echo "$0 [OPTIONS] [TEST]..."
base-commit: 66600fac7a984dea4ae095411f644770b2561ede
--
2.43.5
- Add skip test if not run as root, with an
appropriate Warning.
- Add 'ksft_print_header()' and 'ksft_set_plan()'
to structure test outputs more effectively.
Test logs :
Before change:
- Without root
error: unshare, errno 1
- With root
No, output
After change:
- Without root
TAP version 13
1..1
- With root
TAP version 13
1..1
Signed-off-by: Shivam Chaudhary <cvam0000(a)gmail.com>
---
Notes:
Changes in v4 1/2:
- Start a patchset
- Split patch into smaller pathes to make it easy to review.
link to v3: https://lore.kernel.org/all/20241028185756.111832-1-cvam0000@gmail.com/
Changes in v3:
- Remove extra ksft_set_plan()
- Remove function for unshare()
- Fix the comment style
link to v2: https://lore.kernel.org/all/20241026191621.2860376-1-cvam0000@gmail.com/
Changes in v2:
- Make the commit message more clear.
link to v1: https://lore.kernel.org/all/20241024200228.1075840-1-cvam0000@gmail.com/T/#u
tools/testing/selftests/tmpfs/bug-link-o-tmpfile.c | 13 +++++++++++++
1 file changed, 13 insertions(+)
diff --git a/tools/testing/selftests/tmpfs/bug-link-o-tmpfile.c b/tools/testing/selftests/tmpfs/bug-link-o-tmpfile.c
index b5c3ddb90942..cdab1e8c0392 100644
--- a/tools/testing/selftests/tmpfs/bug-link-o-tmpfile.c
+++ b/tools/testing/selftests/tmpfs/bug-link-o-tmpfile.c
@@ -23,10 +23,23 @@
#include <sys/mount.h>
#include <unistd.h>
+#include "../kselftest.h"
+
int main(void)
{
int fd;
+ /* Setting up kselftest framework */
+ ksft_print_header();
+ ksft_set_plan(1);
+
+ /* Check if test is run as root */
+ if (geteuid()) {
+ ksft_print_msg("Skip : Need to run as root");
+ exit(KSFT_SKIP);
+
+ }
+
if (unshare(CLONE_NEWNS) == -1) {
if (errno == ENOSYS || errno == EPERM) {
fprintf(stderr, "error: unshare, errno %d\n", errno);
--
2.45.2
Greetings:
Welcome to v5, see changelog below. Note that our performance tests were
not re-run for this revision as we updated a commit message, fixed
typos, removed a short unnecessary paragraph from the documentation and
added a very minor functional change to return without suspended IRQs in
certain error cases.
This series introduces a new mechanism, IRQ suspension, which allows
network applications using epoll to mask IRQs during periods of high
traffic while also reducing tail latency (compared to existing
mechanisms, see below) during periods of low traffic. In doing so, this
balances CPU consumption with network processing efficiency.
Martin Karsten (CC'd) and I have been collaborating on this series for
several months and have appreciated the feedback from the community on
our RFC [1]. We've updated the cover letter and kernel documentation in
an attempt to more clearly explain how this mechanism works, how
applications can use it, and how it compares to existing mechanisms in
the kernel. We've added an additional test case, 'fullbusy', achieved by
modifying libevent for comparison. See below for a detailed description,
link to the patch, and test results.
I briefly mentioned this idea at netdev conf 2024 (for those who were
there) and Martin described this idea in an earlier paper presented at
Sigmetrics 2024 [2].
~ The short explanation (TL;DR)
We propose adding a new napi config parameter: irq_suspend_timeout to
help balance CPU usage and network processing efficiency when using IRQ
deferral and napi busy poll.
If this parameter is set to a non-zero value *and* a user application
has enabled preferred busy poll on a busy poll context (via the
EPIOCSPARAMS ioctl introduced in commit 18e2bf0edf4d ("eventpoll: Add
epoll ioctl for epoll_params")), then application calls to epoll_wait
for that context will cause device IRQs and softirq processing to be
suspended as long as epoll_wait successfully retrieves data from the
NAPI. Each time data is retrieved, the irq_suspend_timeout is deferred.
If/when network traffic subsides and epoll_wait returns no data, IRQ
suspension is immediately reverted back to the existing
napi_defer_hard_irqs and gro_flush_timeout mechanism which was
introduced in commit 6f8b12d661d0 ("net: napi: add hard irqs deferral
feature")).
The irq_suspend_timeout serves as a safety mechanism. If userland takes
a long time processing data, irq_suspend_timeout will fire and restart
normal NAPI processing.
For a more in depth explanation, please continue reading.
~ Comparison with existing mechanisms
Interrupt mitigation can be accomplished in napi software, by setting
napi_defer_hard_irqs and gro_flush_timeout, or via interrupt coalescing
in the NIC. This can be quite efficient, but in both cases, a fixed
timeout (or packet count) needs to be configured. However, a fixed
timeout cannot effectively support both low- and high-load situations:
At low load, an application typically processes a few requests and then
waits to receive more input data. In this scenario, a large timeout will
cause unnecessary latency.
At high load, an application typically processes many requests before
being ready to receive more input data. In this case, a small timeout
will likely fire prematurely and trigger irq/softirq processing, which
interferes with the application's execution. This causes overhead, most
likely due to cache contention.
While NICs attempt to provide adaptive interrupt coalescing schemes,
these cannot properly take into account application-level processing.
An alternative packet delivery mechanism is busy-polling, which results
in perfect alignment of application processing and network polling. It
delivers optimal performance (throughput and latency), but results in
100% cpu utilization and is thus inefficient for below-capacity
workloads.
We propose to add a new packet delivery mode that properly alternates
between busy polling and interrupt-based delivery depending on busy and
idle periods of the application. During a busy period, the system
operates in busy-polling mode, which avoids interference. During an idle
period, the system falls back to interrupt deferral, but with a small
timeout to avoid excessive latencies. This delivery mode can also be
viewed as an extension of basic interrupt deferral, but alternating
between a small and a very large timeout.
This delivery mode is efficient, because it avoids softirq execution
interfering with application processing during busy periods. It can be
used with blocking epoll_wait to conserve cpu cycles during idle
periods. The effect of alternating between busy and idle periods is that
performance (throughput and latency) is very close to full busy polling,
while cpu utilization is lower and very close to interrupt mitigation.
~ Usage details
IRQ suspension is introduced via a per-NAPI configuration parameter that
controls the maximum time that IRQs can be suspended.
Here's how it is intended to work:
- The user application (or system administrator) uses the netdev-genl
netlink interface to set the pre-existing napi_defer_hard_irqs and
gro_flush_timeout NAPI config parameters to enable IRQ deferral.
- The user application (or system administrator) sets the proposed
irq_suspend_timeout parameter via the netdev-genl netlink interface
to a larger value than gro_flush_timeout to enable IRQ suspension.
- The user application issues the existing epoll ioctl to set the
prefer_busy_poll flag on the epoll context.
- The user application then calls epoll_wait to busy poll for network
events, as it normally would.
- If epoll_wait returns events to userland, IRQs are suspended for the
duration of irq_suspend_timeout.
- If epoll_wait finds no events and the thread is about to go to
sleep, IRQ handling using napi_defer_hard_irqs and gro_flush_timeout
is resumed.
As long as epoll_wait is retrieving events, IRQs (and softirq
processing) for the NAPI being polled remain disabled. When network
traffic reduces, eventually a busy poll loop in the kernel will retrieve
no data. When this occurs, regular IRQ deferral using gro_flush_timeout
for the polled NAPI is re-enabled.
Unless IRQ suspension is continued by subsequent calls to epoll_wait, it
automatically times out after the irq_suspend_timeout timer expires.
Regular deferral is also immediately re-enabled when the epoll context
is destroyed.
~ Usage scenario
The target scenario for IRQ suspension as packet delivery mode is a
system that runs a dominant application with substantial network I/O.
The target application can be configured to receive input data up to a
certain batch size (via epoll_wait maxevents parameter) and this batch
size determines the worst-case latency that application requests might
experience. Because packet delivery is suspended during the target
application's processing, the batch size also determines the worst-case
latency of concurrent applications using the same RX queue(s).
gro_flush_timeout should be set as small as possible, but large enough to
make sure that a single request is likely not being interfered with.
irq_suspend_timeout is largely a safety mechanism against misbehaving
applications. It should be set large enough to cover the processing of an
entire application batch, i.e., the factor between gro_flush_timeout and
irq_suspend_timeout should roughly correspond to the maximum batch size
that the target application would process in one go.
~ Design rationale
The implementation of the IRQ suspension mechanism very nicely dovetails
with the existing mechanism for IRQ deferral when preferred busy poll is
enabled (introduced in commit 7fd3253a7de6 ("net: Introduce preferred
busy-polling"), see that commit message for more details).
While it would be possible to inject the suspend timeout via
the existing epoll ioctl, it is more natural to avoid this path for one
main reason:
An epoll context is linked to NAPI IDs as file descriptors are added;
this means any epoll context might suddenly be associated with a
different net_device if the application were to replace all existing
fds with fds from a different device. In this case, the scope of the
suspend timeout becomes unclear and many edge cases for both the user
application and the kernel are introduced
Only a single iteration through napi busy polling is needed for this
mechanism to work effectively. Since an important objective for this
mechanism is preserving cpu cycles, exactly one iteration of the napi
busy loop is invoked when busy_poll_usecs is set to 0.
~ Important call out in the implementation
- Enabling per epoll-context preferred busy poll will now effectively
lead to a nonblocking iteration through napi_busy_loop, even when
busy_poll_usecs is 0. See patch 4.
~ Benchmark configs & descriptions
The changes were benchmarked with memcached [3] using the benchmarking
tool mutilate [4].
To facilitate benchmarking, a small patch [5] was applied to memcached
1.6.29 to allow setting per-epoll context preferred busy poll and other
settings via environment variables. Another small patch [6] was applied
to libevent to enable full busy-polling.
Multiple scenarios were benchmarked as described below and the scripts
used for producing these results can be found on github [7] (note: all
scenarios use NAPI-based traffic splitting via SO_INCOMING_ID by passing
-N to memcached):
- base:
- no other options enabled
- deferX:
- set defer_hard_irqs to 100
- set gro_flush_timeout to X,000
- napibusy:
- set defer_hard_irqs to 100
- set gro_flush_timeout to 200,000
- enable busy poll via the existing ioctl (busy_poll_usecs = 64,
busy_poll_budget = 64, prefer_busy_poll = true)
- fullbusy:
- set defer_hard_irqs to 100
- set gro_flush_timeout to 5,000,000
- enable busy poll via the existing ioctl (busy_poll_usecs = 1000,
busy_poll_budget = 64, prefer_busy_poll = true)
- change memcached's nonblocking epoll_wait invocation (via
libevent) to using a 1 ms timeout
- suspendX:
- set defer_hard_irqs to 100
- set gro_flush_timeout to X,000
- set irq_suspend_timeout to 20,000,000
- enable busy poll via the existing ioctl (busy_poll_usecs = 0,
busy_poll_budget = 64, prefer_busy_poll = true)
~ Benchmark results
Tested on:
Single socket AMD EPYC 7662 64-Core Processor
Hyperthreading disabled
4 NUMA Zones (NPS=4)
16 CPUs per NUMA zone (64 cores total)
2 x Dual port 100gbps Mellanox Technologies ConnectX-5 Ex EN NIC
The test machine is configured such that a single interface has 8 RX
queues. The queues' IRQs and memcached are pinned to CPUs that are
NUMA-local to the interface which is under test. The NIC's interrupt
coalescing configuration is left at boot-time defaults.
Results:
Results are shown below. The mechanism added by this series is
represented by the 'suspend' cases. Data presented shows a summary over
at least 10 runs of each test case [8] using the scripts on github [7].
For latency, the median is shown. For throughput and CPU utilization,
the average is shown.
The results also include cycles-per-query (cpq) and
instruction-per-query (ipq) metrics, following the methodology proposed
in [2], to augment the CPU utilization numbers, which could be skewed
due to frequency scaling. We find that this does not appear to be the
case as CPU utilization and low-level metrics show similar trends.
These results were captured using the scripts on github [7] to
illustrate how this approach compares with other pre-existing
mechanisms. This data is not to be interpreted as scientific data
captured in a fully isolated lab setting, but instead as best effort,
illustrative information comparing and contrasting tradeoffs.
The absolute QPS results are higher than our previous submission, but
the relative differences between variants are equivalent. Because the
patches have been rebased on 6.12, several factors have likely
influenced the overall performance. Most importantly, we had to switch
to a new set of basic kernel options, which has likely altered the
baseline performance. Because the overall comparison of variants still
holds, we have not attempted to recreate the exact set of kernel options
from the previous submission.
Compare:
- Throughput (MAX) and latencies of base vs suspend.
- CPU usage of napibusy and fullbusy during lower load (200K, 400K for
example) vs suspend.
- Latency of the defer variants vs suspend as timeout and load
increases.
The overall takeaway is that the suspend variants provide a superior
combination of high throughput, low latency, and low cpu utilization
compared to all other variants. Each of the suspend variants works very
well, but some fine-tuning between latency and cpu utilization is still
possible by tuning the small timeout (gro_flush_timeout).
Note: we've reorganized the results to make comparison among testcases
with the same load easier.
testcase load qps avglat 95%lat 99%lat cpu cpq ipq
base 200K 200024 127 254 458 25 12748 11289
defer10 200K 199991 64 128 166 27 18763 16574
defer20 200K 199986 72 135 178 25 15405 14173
defer50 200K 200025 91 149 198 23 12275 12203
defer200 200K 199996 182 266 326 18 8595 9183
fullbusy 200K 200040 58 123 167 100 43641 23145
napibusy 200K 200009 115 244 299 56 24797 24693
suspend10 200K 200005 63 128 167 32 19559 17240
suspend20 200K 199952 69 132 170 29 16324 14838
suspend50 200K 200019 84 144 189 26 13106 12516
suspend200 200K 199978 168 264 326 20 9331 9643
testcase load qps avglat 95%lat 99%lat cpu cpq ipq
base 400K 400017 157 292 762 39 9287 9325
defer10 400K 400033 71 141 204 53 13950 12943
defer20 400K 399935 79 150 212 47 12027 11673
defer50 400K 399888 101 171 231 39 9556 9921
defer200 400K 399993 200 287 358 32 7428 8576
fullbusy 400K 400018 63 132 203 100 21827 16062
napibusy 400K 399970 89 230 292 83 18156 16508
suspend10 400K 400061 69 139 202 54 13576 13057
suspend20 400K 399988 73 144 206 49 11930 11773
suspend50 400K 399975 88 161 218 42 9996 10270
suspend200 400K 399954 172 276 353 34 7847 8713
testcase load qps avglat 95%lat 99%lat cpu cpq ipq
base 600K 600031 166 289 631 61 9188 8787
defer10 600K 599967 85 167 262 75 11833 10947
defer20 600K 599888 89 165 243 66 10513 10362
defer50 600K 600072 109 185 253 55 8664 9190
defer200 600K 599951 222 315 393 45 6892 8213
fullbusy 600K 600041 69 145 227 100 14549 13936
napibusy 600K 599980 79 188 280 96 13927 14155
suspend10 600K 600028 78 159 267 69 10877 11032
suspend20 600K 600026 81 159 254 64 9922 10320
suspend50 600K 600007 96 178 258 57 8681 9331
suspend200 600K 599964 177 295 369 47 7115 8366
testcase load qps avglat 95%lat 99%lat cpu cpq ipq
base 800K 800034 198 329 698 84 9366 8338
defer10 800K 799718 243 642 1457 95 10532 9007
defer20 800K 800009 132 245 399 89 9956 8979
defer50 800K 800024 136 228 378 80 9002 8598
defer200 800K 799965 255 362 473 66 7481 8147
fullbusy 800K 799927 78 157 253 100 10915 12533
napibusy 800K 799870 81 173 273 99 10826 12532
suspend10 800K 799991 84 167 269 83 9380 9802
suspend20 800K 799979 90 172 290 78 8765 9404
suspend50 800K 800031 106 191 307 71 7945 8805
suspend200 800K 799905 182 307 411 62 6985 8242
testcase load qps avglat 95%lat 99%lat cpu cpq ipq
base 1000K 919543 3805 6390 14229 98 9324 7978
defer10 1000K 850751 4574 7382 15370 99 10218 8470
defer20 1000K 890296 4736 6862 14858 99 9708 8277
defer50 1000K 932694 3463 6180 13251 97 9148 8053
defer200 1000K 951311 3524 6052 13599 96 8875 7845
fullbusy 1000K 1000011 90 181 278 100 8731 10686
napibusy 1000K 1000050 93 184 280 100 8721 10547
suspend10 1000K 999962 101 193 306 92 8138 8980
suspend20 1000K 1000030 103 191 324 88 7844 8763
suspend50 1000K 1000001 114 202 320 83 7396 8431
suspend200 1000K 999965 185 314 428 76 6733 8072
testcase load qps avglat 95%lat 99%lat cpu cpq ipq
base MAX 1005592 4651 6594 14979 100 8679 7918
defer10 MAX 928204 5106 7286 15199 100 9398 8380
defer20 MAX 984663 4774 6518 14920 100 8861 8063
defer50 MAX 1044099 4431 6368 14652 100 8350 7948
defer200 MAX 1040451 4423 6610 14674 100 8380 7931
fullbusy MAX 1236608 3715 3987 12805 100 7051 7936
napibusy MAX 1077516 4345 10155 15957 100 8080 7842
suspend10 MAX 1218344 3760 3990 12585 100 7150 7935
suspend20 MAX 1220056 3752 4053 12602 100 7150 7961
suspend50 MAX 1213666 3791 4103 12919 100 7183 7959
suspend200 MAX 1217411 3768 3988 12863 100 7161 7954
~ FAQ
- Why is a new parameter needed? Does irq_suspend_timeout override
gro_flush_timeout?
Using the suspend mechanism causes the system to alternate between
polling mode and irq-driven packet delivery. During busy periods,
irq_suspend_timeout overrides gro_flush_timeout and keeps the system
busy polling, but when epoll finds no events, the setting of
gro_flush_timeout and napi_defer_hard_irqs determine the next step.
There are essentially three possible loops for network processing and
packet delivery:
1) hardirq -> softirq -> napi poll; basic interrupt delivery
2) timer -> softirq -> napi poll; deferred irq processing
3) epoll -> busy-poll -> napi poll; busy looping
Loop 2) can take control from Loop 1), if gro_flush_timeout and
napi_defer_hard_irqs are set.
If gro_flush_timeout and napi_defer_hard_irqs are set, Loops 2) and
3) "wrestle" with each other for control. During busy periods,
irq_suspend_timeout is used as timer in Loop 2), which essentially
tilts this in favour of Loop 3).
If gro_flush_timeout and napi_defer_hard_irqs are not set, Loop 3)
cannot take control from Loop 1).
Therefore, setting gro_flush_timeout and napi_defer_hard_irqs is the
recommended usage, because otherwise setting irq_suspend_timeout
might not have any discernible effect.
We ran experiments with these parameters set to zero and the results
are as expected and essentially the same as the base case.
- Can the new timeout value be threaded through the new epoll ioctl ?
Only with difficulty. The epoll ioctl sets options on an epoll
context and the NAPI ID associated with an epoll context can change
based on what file descriptors a user app adds to the epoll context.
This would introduce complexity in the API from the user perspective
and also complexity in the kernel.
- Can irq suspend be built by combining NIC coalescing and
gro_flush_timeout ?
No. The problem is that the long timeout must engage if and only if
prefer-busy is active.
When using NIC coalescing for the short timeout (without
napi_defer_hard_irqs/gro_flush_timeout), an interrupt after an idle
period will trigger softirq, which will run napi polling. At this
point, prefer-busy is not active, so NIC interrupts would be
re-enabled. Then it is not possible for the longer timeout to
interject to switch control back to polling. In other words, only by
using the software timer for the short timeout, it is possible to
extend the timeout without having to reprogram the NIC timer or
reach down directly and disable interrupts.
Using gro_flush_timeout for the long timeout also has problems, for
the same underlying reason. In the current napi implementation,
gro_flush_timeout is not tied to prefer-busy. We'd either have to
change that and in the process modify the existing deferral
mechanism, or introduce a state variable to determine whether
gro_flush_timeout is used as long timeout for irq suspend or whether
it is used for its default purpose. In an earlier version, we did
try something similar to the latter and made it work, but it ends up
being a lot more convoluted than our current proposal.
- Isn't it already possible to combine busy looping with irq deferral?
Yes, in fact enabling irq deferral via napi_defer_hard_irqs and
gro_flush_timeout is a precondition for prefer_busy_poll to have an
effect. If the application also uses a tight busy loop with
essentially nonblocking epoll_wait (accomplished with a very short
timeout parameter), this is the fullbusy case shown in the results.
An application using blocking epoll_wait is shown as the napibusy
case in the results. It's a hybrid approach that provides limited
latency benefits compared to the base case and plain irq deferral,
but not as good as fullbusy or suspend.
~ Special thanks
Several people were involved in earlier stages of the development of this
mechanism whom we'd like to thank:
- Peter Cai (CC'd), for the initial kernel patch and his contributions
to the paper.
- Mohammadamin Shafie (CC'd), for testing various versions of the kernel
patch and providing helpful feedback.
Thanks,
Martin and Joe
[1]: https://lore.kernel.org/netdev/20240812125717.413108-1-jdamato@fastly.com/
[2]: https://doi.org/10.1145/3626780
[3]: https://github.com/memcached/memcached/blob/master/doc/napi_ids.txt
[4]: https://github.com/leverich/mutilate
[5]: https://raw.githubusercontent.com/martinkarsten/irqsuspend/main/patches/mem…
[6]: https://raw.githubusercontent.com/martinkarsten/irqsuspend/main/patches/lib…
[7]: https://github.com/martinkarsten/irqsuspend
[8]: https://github.com/martinkarsten/irqsuspend/tree/main/results
v5:
- Adjusted patch 5 to only suspend IRQs when ep_send_events returns a
positive return value. This issue was pointed out by Hillf Danton.
- Updated the commit message of patch 6 which still mentioned netcat,
despite the code being updated in v4 to replace it with socat and fixed
misspelling of netdevsim.
- Fixed a minor typo in patch 7 and removed an unnecessary paragraph.
- Added Sridhar Samudrala's Reviewed-by to patch 1-5 and 7.
v4: https://lore.kernel.org/netdev/20241102005214.32443-1-jdamato@fastly.com/
- Added a new FAQ item to cover letter.
- Updated patch 6 to use socat instead of nc in busy_poll_test.sh and
updated busy_poller.c to use netlink directly to configure napi
params.
- Updated the kernel documentation in patch 7 to include more details.
- Dropped Stanislav's Acked-by and Bagas' Reviewed-by from patch 7
since the documentation was updated.
v3: https://lore.kernel.org/netdev/20241101004846.32532-1-jdamato@fastly.com/
- Added Stanislav Fomichev's Acked-by to every patch except the newly
added selftest.
- Added Bagas Sanjaya's Reviewed-by to the documentation patch.
- Fixed the commit message of patch 2 to remove a reference to the now
non-existent sysfs setting.
- Added a self test which tests both "regular" busy poll and busy poll
with suspend enabled. This was added as patch 6 as requested by
Paolo. netdevsim was chosen instead of veth due to netdevsim's
pre-existing support for netdev-genl. See the commit message of
patch 6 for more details.
v2: https://lore.kernel.org/bpf/20241021015311.95468-1-jdamato@fastly.com/
- Cover letter updated, including a re-run of test data.
- Patch 1 rewritten to use netdev-genl instead of sysfs.
- Patch 3 updated with a comment added to napi_resume_irqs.
- Patch 4 rebased to apply now that commit b9ca079dd6b0 ("eventpoll:
Annotate data-race of busy_poll_usecs") has been picked up from VFS.
- Patch 6 updated the kernel documentation.
rfc -> v1:
- Cover letter updated to include more details.
- Patch 1 updated to remove the documentation added. This was moved to
patch 6 with the rest of the docs (see below).
- Patch 5 updated to fix an error uncovered by the kernel build robot.
See patch 5's changelog for more details.
- Patch 6 added which updates kernel documentation.
Joe Damato (2):
selftests: net: Add busy_poll_test
docs: networking: Describe irq suspension
Martin Karsten (5):
net: Add napi_struct parameter irq_suspend_timeout
net: Suspend softirq when prefer_busy_poll is set
net: Add control functions for irq suspension
eventpoll: Trigger napi_busy_loop, if prefer_busy_poll is set
eventpoll: Control irq suspension for prefer_busy_poll
Documentation/netlink/specs/netdev.yaml | 7 +
Documentation/networking/napi.rst | 172 ++++++++-
fs/eventpoll.c | 36 +-
include/linux/netdevice.h | 2 +
include/net/busy_poll.h | 3 +
include/uapi/linux/netdev.h | 1 +
net/core/dev.c | 58 +++-
net/core/dev.h | 25 ++
net/core/netdev-genl-gen.c | 5 +-
net/core/netdev-genl.c | 12 +
tools/include/uapi/linux/netdev.h | 1 +
tools/testing/selftests/net/.gitignore | 1 +
tools/testing/selftests/net/Makefile | 3 +-
tools/testing/selftests/net/busy_poll_test.sh | 164 +++++++++
tools/testing/selftests/net/busy_poller.c | 328 ++++++++++++++++++
15 files changed, 807 insertions(+), 11 deletions(-)
create mode 100755 tools/testing/selftests/net/busy_poll_test.sh
create mode 100644 tools/testing/selftests/net/busy_poller.c
base-commit: dbb9a7ef347828870df3e5e6ddf19469a3277fc9
--
2.25.1
Check number of paths by fib_info_num_path(),
and update_or_create_fnhe() for every path.
Problem is that pmtu is cached only for the oif
that has received icmp message "need to frag",
other oifs will still try to use "default" iface mtu.
V5:
- make self test cleaner
V4:
- fix selftest, do route lookup before checking cached exceptions
V3:
- added selftest
- fixed compile error
V2:
- fix fib_info_num_path parameter pass
An example topology showing the problem:
| host1
+---------+
| dummy0 | 10.179.20.18/32 mtu9000
+---------+
+-----------+----------------+
+---------+ +---------+
| ens17f0 | 10.179.2.141/31 | ens17f1 | 10.179.2.13/31
+---------+ +---------+
| (all here have mtu 9000) |
+------+ +------+
| ro1 | 10.179.2.140/31 | ro2 | 10.179.2.12/31
+------+ +------+
| |
---------+------------+-------------------+------
|
+-----+
| ro3 | 10.10.10.10 mtu1500
+-----+
|
========================================
some networks
========================================
|
+-----+
| eth0| 10.10.30.30 mtu9000
+-----+
| host2
host1 have enabled multipath and
sysctl net.ipv4.fib_multipath_hash_policy = 1:
default proto static src 10.179.20.18
nexthop via 10.179.2.12 dev ens17f1 weight 1
nexthop via 10.179.2.140 dev ens17f0 weight 1
When host1 tries to do pmtud from 10.179.20.18/32 to host2,
host1 receives at ens17f1 iface an icmp packet from ro3 that ro3 mtu=1500.
And host1 caches it in nexthop exceptions cache.
Problem is that it is cached only for the iface that has received icmp,
and there is no way that ro3 will send icmp msg to host1 via another path.
Host1 now have this routes to host2:
ip r g 10.10.30.30 sport 30000 dport 443
10.10.30.30 via 10.179.2.12 dev ens17f1 src 10.179.20.18 uid 0
cache expires 521sec mtu 1500
ip r g 10.10.30.30 sport 30033 dport 443
10.10.30.30 via 10.179.2.140 dev ens17f0 src 10.179.20.18 uid 0
cache
So when host1 tries again to reach host2 with mtu>1500,
if packet flow is lucky enough to be hashed with oif=ens17f1 its ok,
if oif=ens17f0 it blackholes and still gets icmp msgs from ro3 to ens17f1,
until lucky day when ro3 will send it through another flow to ens17f0.
Signed-off-by: Vladimir Vdovin <deliran(a)verdict.gg>
---
net/ipv4/route.c | 13 +++++
tools/testing/selftests/net/pmtu.sh | 78 ++++++++++++++++++++++++++++-
2 files changed, 90 insertions(+), 1 deletion(-)
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 723ac9181558..41162b5cc4cb 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1027,6 +1027,19 @@ static void __ip_rt_update_pmtu(struct rtable *rt, struct flowi4 *fl4, u32 mtu)
struct fib_nh_common *nhc;
fib_select_path(net, &res, fl4, NULL);
+#ifdef CONFIG_IP_ROUTE_MULTIPATH
+ if (fib_info_num_path(res.fi) > 1) {
+ int nhsel;
+
+ for (nhsel = 0; nhsel < fib_info_num_path(res.fi); nhsel++) {
+ nhc = fib_info_nhc(res.fi, nhsel);
+ update_or_create_fnhe(nhc, fl4->daddr, 0, mtu, lock,
+ jiffies + net->ipv4.ip_rt_mtu_expires);
+ }
+ rcu_read_unlock();
+ return;
+ }
+#endif /* CONFIG_IP_ROUTE_MULTIPATH */
nhc = FIB_RES_NHC(res);
update_or_create_fnhe(nhc, fl4->daddr, 0, mtu, lock,
jiffies + net->ipv4.ip_rt_mtu_expires);
diff --git a/tools/testing/selftests/net/pmtu.sh b/tools/testing/selftests/net/pmtu.sh
index 569bce8b6383..a0159340fe84 100755
--- a/tools/testing/selftests/net/pmtu.sh
+++ b/tools/testing/selftests/net/pmtu.sh
@@ -266,7 +266,8 @@ tests="
list_flush_ipv4_exception ipv4: list and flush cached exceptions 1
list_flush_ipv6_exception ipv6: list and flush cached exceptions 1
pmtu_ipv4_route_change ipv4: PMTU exception w/route replace 1
- pmtu_ipv6_route_change ipv6: PMTU exception w/route replace 1"
+ pmtu_ipv6_route_change ipv6: PMTU exception w/route replace 1
+ pmtu_ipv4_mp_exceptions ipv4: PMTU multipath nh exceptions 0"
# Addressing and routing for tests with routers: four network segments, with
# index SEGMENT between 1 and 4, a common prefix (PREFIX4 or PREFIX6) and an
@@ -2329,6 +2330,81 @@ test_pmtu_ipv6_route_change() {
test_pmtu_ipvX_route_change 6
}
+test_pmtu_ipv4_mp_exceptions() {
+ setup namespaces routing || return $ksft_skip
+
+ ip nexthop ls >/dev/null 2>&1
+ if [ $? -ne 0 ]; then
+ echo "Nexthop objects not supported; skipping tests"
+ exit $ksft_skip
+ fi
+
+ trace "${ns_a}" veth_A-R1 "${ns_r1}" veth_R1-A \
+ "${ns_r1}" veth_R1-B "${ns_b}" veth_B-R1 \
+ "${ns_a}" veth_A-R2 "${ns_r2}" veth_R2-A \
+ "${ns_r2}" veth_R2-B "${ns_b}" veth_B-R2
+
+ dummy0_a="192.168.99.99"
+ dummy0_b="192.168.88.88"
+
+ # Set up initial MTU values
+ mtu "${ns_a}" veth_A-R1 2000
+ mtu "${ns_r1}" veth_R1-A 2000
+ mtu "${ns_r1}" veth_R1-B 1500
+ mtu "${ns_b}" veth_B-R1 1500
+
+ mtu "${ns_a}" veth_A-R2 2000
+ mtu "${ns_r2}" veth_R2-A 2000
+ mtu "${ns_r2}" veth_R2-B 1500
+ mtu "${ns_b}" veth_B-R2 1500
+
+ fail=0
+
+ #Set up host A with multipath routes to host B dummy0_b
+ run_cmd ${ns_a} sysctl -q net.ipv4.fib_multipath_hash_policy=1
+ run_cmd ${ns_a} sysctl -q net.ipv4.ip_forward=1
+ run_cmd ${ns_a} ip link add dummy0 mtu 2000 type dummy
+ run_cmd ${ns_a} ip link set dummy0 up
+ run_cmd ${ns_a} ip addr add ${dummy0_a} dev dummy0
+ run_cmd ${ns_a} ip nexthop add id 201 via ${prefix4}.${a_r1}.2 dev veth_A-R1
+ run_cmd ${ns_a} ip nexthop add id 202 via ${prefix4}.${a_r2}.2 dev veth_A-R2
+ run_cmd ${ns_a} ip nexthop add id 203 group 201/202
+ run_cmd ${ns_a} ip route add ${dummy0_b} nhid 203
+
+ #Set up host B with multipath routes to host A dummy0_a
+ run_cmd ${ns_b} sysctl -q net.ipv4.fib_multipath_hash_policy=1
+ run_cmd ${ns_b} sysctl -q net.ipv4.ip_forward=1
+ run_cmd ${ns_b} ip link add dummy0 mtu 2000 type dummy
+ run_cmd ${ns_b} ip link set dummy0 up
+ run_cmd ${ns_b} ip addr add ${dummy0_b} dev dummy0
+ run_cmd ${ns_b} ip nexthop add id 201 via ${prefix4}.${b_r1}.2 dev veth_A-R1
+ run_cmd ${ns_b} ip nexthop add id 202 via ${prefix4}.${b_r2}.2 dev veth_A-R2
+ run_cmd ${ns_b} ip nexthop add id 203 group 201/202
+ run_cmd ${ns_b} ip route add ${dummy0_a} nhid 203
+
+ #Set up routers with routes to dummies
+ run_cmd ${ns_r1} ip route add ${dummy0_a} via ${prefix4}.${a_r1}.1
+ run_cmd ${ns_r2} ip route add ${dummy0_a} via ${prefix4}.${a_r2}.1
+ run_cmd ${ns_r1} ip route add ${dummy0_b} via ${prefix4}.${b_r1}.1
+ run_cmd ${ns_r2} ip route add ${dummy0_b} via ${prefix4}.${b_r2}.1
+
+
+ #Ping and expect two nexthop exceptions for two routes in nh group
+ run_cmd ${ns_a} ping -q -M want -i 0.1 -c 2 -s 1800 "${dummy0_b}"
+
+ #Do route lookup before checking cached exceptions
+ run_cmd ${ns_a} ip route get ${dummy0_b} oif veth_A-R1
+ run_cmd ${ns_a} ip route get ${dummy0_b} oif veth_A-R2
+
+ #Check cached exceptions
+ if [ "$(${ns_a} ip -oneline route list cache| grep mtu | wc -l)" -ne 2 ]; then
+ err " there are not enough cached exceptions"
+ fail=1
+ fi
+
+ return ${fail}
+}
+
usage() {
echo
echo "$0 [OPTIONS] [TEST]..."
base-commit: 66600fac7a984dea4ae095411f644770b2561ede
--
2.43.5
This series implements feature detection of hardware virtualization on
Linux and macOS; the latter being my primary use case.
This yields approximately a 6x improvement using HVF on M3 Pro.
Signed-off-by: Tamir Duberstein <tamird(a)gmail.com>
---
Tamir Duberstein (2):
kunit: add fallback for os.sched_getaffinity
kunit: enable hardware acceleration when available
tools/testing/kunit/kunit.py | 11 ++++++++++-
tools/testing/kunit/kunit_kernel.py | 26 +++++++++++++++++++++++++-
tools/testing/kunit/qemu_configs/arm64.py | 2 +-
3 files changed, 36 insertions(+), 3 deletions(-)
---
base-commit: ae90f6a6170d7a7a1aa4fddf664fbd093e3023bc
change-id: 20241025-kunit-qemu-accel-macos-2840e4c2def5
Best regards,
--
Tamir Duberstein <tamird(a)gmail.com>
Hi Linus,
Please pull the following kselftest fixes update for Linux 6.12-rc6.
Kselftest fixes for Linux 6.12-rc6
- fix syntax error in frequency calculation arithmetic expression in
intel_pstate run.sh
- add missing cpupower dependency check intel_pstate run.sh
- fix idmap_mount_tree_invalid test failure due to incorrect argument
- fix watchdog-test run leaving the watchdog timer enabled causing
system reboot. With this fix, the test disables the watchdog timer
when it gets terminated with SIGTERM, SIGKILL, and SIGQUIT in addition
to SIGINT
diff is attached.
thanks,
-- Shuah
----------------------------------------------------------------
The following changes since commit fe05c40ca9c18cfdb003f639a30fc78a7ab49519:
selftest: hid: add the missing tests directory (2024-10-16 15:55:14 -0600)
are available in the Git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest tags/linux_kselftest-fixes-6.12-rc6
for you to fetch changes up to dc1308bee1ed03b4d698d77c8bd670d399dcd04d:
selftests/watchdog-test: Fix system accidentally reset after watchdog-test (2024-10-28 21:34:43 -0600)
----------------------------------------------------------------
linux_kselftest-fixes-6.12-rc6
Kselftest fixes for Linux 6.12-rc6
- fix syntax error in frequency calculation arithmetic expression in
intel_pstate run.sh
- add missing cpupower dependency check intel_pstate run.sh
- fix idmap_mount_tree_invalid test failure due to incorrect argument
- fix watchdog-test run leaving the watchdog timer enabled causing
system reboot. With this fix, the test disables the watchdog timer
when it gets terminated with SIGTERM, SIGKILL, and SIGQUIT in addition
to SIGINT
----------------------------------------------------------------
Alessandro Zanni (2):
selftests/intel_pstate: fix operand expected error
selftests/intel_pstate: check if cpupower is installed
Li Zhijian (1):
selftests/watchdog-test: Fix system accidentally reset after watchdog-test
zhouyuhang (1):
selftests/mount_setattr: fix idmap_mount_tree_invalid failed to run
tools/testing/selftests/intel_pstate/run.sh | 9 +++++++--
tools/testing/selftests/mount_setattr/mount_setattr_test.c | 9 +++++++++
tools/testing/selftests/watchdog/watchdog-test.c | 6 ++++++
3 files changed, 22 insertions(+), 2 deletions(-)
---------------------------------------------------------------
Greetings:
Welcome to v4, see changelog below. Note that our performance tests were
not re-run for this revision as we updated the selftest, FAQ in the
cover letter, and kernel documentation. No functional/code changes.
This series introduces a new mechanism, IRQ suspension, which allows
network applications using epoll to mask IRQs during periods of high
traffic while also reducing tail latency (compared to existing
mechanisms, see below) during periods of low traffic. In doing so, this
balances CPU consumption with network processing efficiency.
Martin Karsten (CC'd) and I have been collaborating on this series for
several months and have appreciated the feedback from the community on
our RFC [1]. We've updated the cover letter and kernel documentation in
an attempt to more clearly explain how this mechanism works, how
applications can use it, and how it compares to existing mechanisms in
the kernel. We've added an additional test case, 'fullbusy', achieved by
modifying libevent for comparison. See below for a detailed description,
link to the patch, and test results.
I briefly mentioned this idea at netdev conf 2024 (for those who were
there) and Martin described this idea in an earlier paper presented at
Sigmetrics 2024 [2].
~ The short explanation (TL;DR)
We propose adding a new napi config parameter: irq_suspend_timeout to
help balance CPU usage and network processing efficiency when using IRQ
deferral and napi busy poll.
If this parameter is set to a non-zero value *and* a user application
has enabled preferred busy poll on a busy poll context (via the
EPIOCSPARAMS ioctl introduced in commit 18e2bf0edf4d ("eventpoll: Add
epoll ioctl for epoll_params")), then application calls to epoll_wait
for that context will cause device IRQs and softirq processing to be
suspended as long as epoll_wait successfully retrieves data from the
NAPI. Each time data is retrieved, the irq_suspend_timeout is deferred.
If/when network traffic subsides and epoll_wait returns no data, IRQ
suspension is immediately reverted back to the existing
napi_defer_hard_irqs and gro_flush_timeout mechanism which was
introduced in commit 6f8b12d661d0 ("net: napi: add hard irqs deferral
feature")).
The irq_suspend_timeout serves as a safety mechanism. If userland takes
a long time processing data, irq_suspend_timeout will fire and restart
normal NAPI processing.
For a more in depth explanation, please continue reading.
~ Comparison with existing mechanisms
Interrupt mitigation can be accomplished in napi software, by setting
napi_defer_hard_irqs and gro_flush_timeout, or via interrupt coalescing
in the NIC. This can be quite efficient, but in both cases, a fixed
timeout (or packet count) needs to be configured. However, a fixed
timeout cannot effectively support both low- and high-load situations:
At low load, an application typically processes a few requests and then
waits to receive more input data. In this scenario, a large timeout will
cause unnecessary latency.
At high load, an application typically processes many requests before
being ready to receive more input data. In this case, a small timeout
will likely fire prematurely and trigger irq/softirq processing, which
interferes with the application's execution. This causes overhead, most
likely due to cache contention.
While NICs attempt to provide adaptive interrupt coalescing schemes,
these cannot properly take into account application-level processing.
An alternative packet delivery mechanism is busy-polling, which results
in perfect alignment of application processing and network polling. It
delivers optimal performance (throughput and latency), but results in
100% cpu utilization and is thus inefficient for below-capacity
workloads.
We propose to add a new packet delivery mode that properly alternates
between busy polling and interrupt-based delivery depending on busy and
idle periods of the application. During a busy period, the system
operates in busy-polling mode, which avoids interference. During an idle
period, the system falls back to interrupt deferral, but with a small
timeout to avoid excessive latencies. This delivery mode can also be
viewed as an extension of basic interrupt deferral, but alternating
between a small and a very large timeout.
This delivery mode is efficient, because it avoids softirq execution
interfering with application processing during busy periods. It can be
used with blocking epoll_wait to conserve cpu cycles during idle
periods. The effect of alternating between busy and idle periods is that
performance (throughput and latency) is very close to full busy polling,
while cpu utilization is lower and very close to interrupt mitigation.
~ Usage details
IRQ suspension is introduced via a per-NAPI configuration parameter that
controls the maximum time that IRQs can be suspended.
Here's how it is intended to work:
- The user application (or system administrator) uses the netdev-genl
netlink interface to set the pre-existing napi_defer_hard_irqs and
gro_flush_timeout NAPI config parameters to enable IRQ deferral.
- The user application (or system administrator) sets the proposed
irq_suspend_timeout parameter via the netdev-genl netlink interface
to a larger value than gro_flush_timeout to enable IRQ suspension.
- The user application issues the existing epoll ioctl to set the
prefer_busy_poll flag on the epoll context.
- The user application then calls epoll_wait to busy poll for network
events, as it normally would.
- If epoll_wait returns events to userland, IRQs are suspended for the
duration of irq_suspend_timeout.
- If epoll_wait finds no events and the thread is about to go to
sleep, IRQ handling using napi_defer_hard_irqs and gro_flush_timeout
is resumed.
As long as epoll_wait is retrieving events, IRQs (and softirq
processing) for the NAPI being polled remain disabled. When network
traffic reduces, eventually a busy poll loop in the kernel will retrieve
no data. When this occurs, regular IRQ deferral using gro_flush_timeout
for the polled NAPI is re-enabled.
Unless IRQ suspension is continued by subsequent calls to epoll_wait, it
automatically times out after the irq_suspend_timeout timer expires.
Regular deferral is also immediately re-enabled when the epoll context
is destroyed.
~ Usage scenario
The target scenario for IRQ suspension as packet delivery mode is a
system that runs a dominant application with substantial network I/O.
The target application can be configured to receive input data up to a
certain batch size (via epoll_wait maxevents parameter) and this batch
size determines the worst-case latency that application requests might
experience. Because packet delivery is suspended during the target
application's processing, the batch size also determines the worst-case
latency of concurrent applications using the same RX queue(s).
gro_flush_timeout should be set as small as possible, but large enough to
make sure that a single request is likely not being interfered with.
irq_suspend_timeout is largely a safety mechanism against misbehaving
applications. It should be set large enough to cover the processing of an
entire application batch, i.e., the factor between gro_flush_timeout and
irq_suspend_timeout should roughly correspond to the maximum batch size
that the target application would process in one go.
~ Design rationale
The implementation of the IRQ suspension mechanism very nicely dovetails
with the existing mechanism for IRQ deferral when preferred busy poll is
enabled (introduced in commit 7fd3253a7de6 ("net: Introduce preferred
busy-polling"), see that commit message for more details).
While it would be possible to inject the suspend timeout via
the existing epoll ioctl, it is more natural to avoid this path for one
main reason:
An epoll context is linked to NAPI IDs as file descriptors are added;
this means any epoll context might suddenly be associated with a
different net_device if the application were to replace all existing
fds with fds from a different device. In this case, the scope of the
suspend timeout becomes unclear and many edge cases for both the user
application and the kernel are introduced
Only a single iteration through napi busy polling is needed for this
mechanism to work effectively. Since an important objective for this
mechanism is preserving cpu cycles, exactly one iteration of the napi
busy loop is invoked when busy_poll_usecs is set to 0.
~ Important call outs in the implementation
- Enabling per epoll-context preferred busy poll will now effectively
lead to a nonblocking iteration through napi_busy_loop, even when
busy_poll_usecs is 0. See patch 4.
- Patches apply cleanly on commit 160a810b2a85 ("net: vxlan: update
the document for vxlan_snoop()").
~ Benchmark configs & descriptions
The changes were benchmarked with memcached [3] using the benchmarking
tool mutilate [4].
To facilitate benchmarking, a small patch [5] was applied to memcached
1.6.29 to allow setting per-epoll context preferred busy poll and other
settings via environment variables. Another small patch [6] was applied
to libevent to enable full busy-polling.
Multiple scenarios were benchmarked as described below and the scripts
used for producing these results can be found on github [7] (note: all
scenarios use NAPI-based traffic splitting via SO_INCOMING_ID by passing
-N to memcached):
- base:
- no other options enabled
- deferX:
- set defer_hard_irqs to 100
- set gro_flush_timeout to X,000
- napibusy:
- set defer_hard_irqs to 100
- set gro_flush_timeout to 200,000
- enable busy poll via the existing ioctl (busy_poll_usecs = 64,
busy_poll_budget = 64, prefer_busy_poll = true)
- fullbusy:
- set defer_hard_irqs to 100
- set gro_flush_timeout to 5,000,000
- enable busy poll via the existing ioctl (busy_poll_usecs = 1000,
busy_poll_budget = 64, prefer_busy_poll = true)
- change memcached's nonblocking epoll_wait invocation (via
libevent) to using a 1 ms timeout
- suspendX:
- set defer_hard_irqs to 100
- set gro_flush_timeout to X,000
- set irq_suspend_timeout to 20,000,000
- enable busy poll via the existing ioctl (busy_poll_usecs = 0,
busy_poll_budget = 64, prefer_busy_poll = true)
~ Benchmark results
Tested on:
Single socket AMD EPYC 7662 64-Core Processor
Hyperthreading disabled
4 NUMA Zones (NPS=4)
16 CPUs per NUMA zone (64 cores total)
2 x Dual port 100gbps Mellanox Technologies ConnectX-5 Ex EN NIC
The test machine is configured such that a single interface has 8 RX
queues. The queues' IRQs and memcached are pinned to CPUs that are
NUMA-local to the interface which is under test. The NIC's interrupt
coalescing configuration is left at boot-time defaults.
Results:
Results are shown below. The mechanism added by this series is
represented by the 'suspend' cases. Data presented shows a summary over
at least 10 runs of each test case [8] using the scripts on github [7].
For latency, the median is shown. For throughput and CPU utilization,
the average is shown.
The results also include cycles-per-query (cpq) and
instruction-per-query (ipq) metrics, following the methodology proposed
in [2], to augment the CPU utilization numbers, which could be skewed
due to frequency scaling. We find that this does not appear to be the
case as CPU utilization and low-level metrics show similar trends.
These results were captured using the scripts on github [7] to
illustrate how this approach compares with other pre-existing
mechanisms. This data is not to be interpreted as scientific data
captured in a fully isolated lab setting, but instead as best effort,
illustrative information comparing and contrasting tradeoffs.
The absolute QPS results are higher than our previous submission, but
the relative differences between variants are equivalent. Because the
patches have been rebased on 6.12, several factors have likely
influenced the overall performance. Most importantly, we had to switch
to a new set of basic kernel options, which has likely altered the
baseline performance. Because the overall comparison of variants still
holds, we have not attempted to recreate the exact set of kernel options
from the previous submission.
Compare:
- Throughput (MAX) and latencies of base vs suspend.
- CPU usage of napibusy and fullbusy during lower load (200K, 400K for
example) vs suspend.
- Latency of the defer variants vs suspend as timeout and load
increases.
The overall takeaway is that the suspend variants provide a superior
combination of high throughput, low latency, and low cpu utilization
compared to all other variants. Each of the suspend variants works very
well, but some fine-tuning between latency and cpu utilization is still
possible by tuning the small timeout (gro_flush_timeout).
Note: we've reorganized the results to make comparison among testcases
with the same load easier.
testcase load qps avglat 95%lat 99%lat cpu cpq ipq
base 200K 200024 127 254 458 25 12748 11289
defer10 200K 199991 64 128 166 27 18763 16574
defer20 200K 199986 72 135 178 25 15405 14173
defer50 200K 200025 91 149 198 23 12275 12203
defer200 200K 199996 182 266 326 18 8595 9183
fullbusy 200K 200040 58 123 167 100 43641 23145
napibusy 200K 200009 115 244 299 56 24797 24693
suspend10 200K 200005 63 128 167 32 19559 17240
suspend20 200K 199952 69 132 170 29 16324 14838
suspend50 200K 200019 84 144 189 26 13106 12516
suspend200 200K 199978 168 264 326 20 9331 9643
testcase load qps avglat 95%lat 99%lat cpu cpq ipq
base 400K 400017 157 292 762 39 9287 9325
defer10 400K 400033 71 141 204 53 13950 12943
defer20 400K 399935 79 150 212 47 12027 11673
defer50 400K 399888 101 171 231 39 9556 9921
defer200 400K 399993 200 287 358 32 7428 8576
fullbusy 400K 400018 63 132 203 100 21827 16062
napibusy 400K 399970 89 230 292 83 18156 16508
suspend10 400K 400061 69 139 202 54 13576 13057
suspend20 400K 399988 73 144 206 49 11930 11773
suspend50 400K 399975 88 161 218 42 9996 10270
suspend200 400K 399954 172 276 353 34 7847 8713
testcase load qps avglat 95%lat 99%lat cpu cpq ipq
base 600K 600031 166 289 631 61 9188 8787
defer10 600K 599967 85 167 262 75 11833 10947
defer20 600K 599888 89 165 243 66 10513 10362
defer50 600K 600072 109 185 253 55 8664 9190
defer200 600K 599951 222 315 393 45 6892 8213
fullbusy 600K 600041 69 145 227 100 14549 13936
napibusy 600K 599980 79 188 280 96 13927 14155
suspend10 600K 600028 78 159 267 69 10877 11032
suspend20 600K 600026 81 159 254 64 9922 10320
suspend50 600K 600007 96 178 258 57 8681 9331
suspend200 600K 599964 177 295 369 47 7115 8366
testcase load qps avglat 95%lat 99%lat cpu cpq ipq
base 800K 800034 198 329 698 84 9366 8338
defer10 800K 799718 243 642 1457 95 10532 9007
defer20 800K 800009 132 245 399 89 9956 8979
defer50 800K 800024 136 228 378 80 9002 8598
defer200 800K 799965 255 362 473 66 7481 8147
fullbusy 800K 799927 78 157 253 100 10915 12533
napibusy 800K 799870 81 173 273 99 10826 12532
suspend10 800K 799991 84 167 269 83 9380 9802
suspend20 800K 799979 90 172 290 78 8765 9404
suspend50 800K 800031 106 191 307 71 7945 8805
suspend200 800K 799905 182 307 411 62 6985 8242
testcase load qps avglat 95%lat 99%lat cpu cpq ipq
base 1000K 919543 3805 6390 14229 98 9324 7978
defer10 1000K 850751 4574 7382 15370 99 10218 8470
defer20 1000K 890296 4736 6862 14858 99 9708 8277
defer50 1000K 932694 3463 6180 13251 97 9148 8053
defer200 1000K 951311 3524 6052 13599 96 8875 7845
fullbusy 1000K 1000011 90 181 278 100 8731 10686
napibusy 1000K 1000050 93 184 280 100 8721 10547
suspend10 1000K 999962 101 193 306 92 8138 8980
suspend20 1000K 1000030 103 191 324 88 7844 8763
suspend50 1000K 1000001 114 202 320 83 7396 8431
suspend200 1000K 999965 185 314 428 76 6733 8072
testcase load qps avglat 95%lat 99%lat cpu cpq ipq
base MAX 1005592 4651 6594 14979 100 8679 7918
defer10 MAX 928204 5106 7286 15199 100 9398 8380
defer20 MAX 984663 4774 6518 14920 100 8861 8063
defer50 MAX 1044099 4431 6368 14652 100 8350 7948
defer200 MAX 1040451 4423 6610 14674 100 8380 7931
fullbusy MAX 1236608 3715 3987 12805 100 7051 7936
napibusy MAX 1077516 4345 10155 15957 100 8080 7842
suspend10 MAX 1218344 3760 3990 12585 100 7150 7935
suspend20 MAX 1220056 3752 4053 12602 100 7150 7961
suspend50 MAX 1213666 3791 4103 12919 100 7183 7959
suspend200 MAX 1217411 3768 3988 12863 100 7161 7954
~ FAQ
- Why is a new parameter needed? Does irq_suspend_timeout override
gro_flush_timeout?
Using the suspend mechanism causes the system to alternate between
polling mode and irq-driven packet delivery. During busy periods,
irq_suspend_timeout overrides gro_flush_timeout and keeps the system
busy polling, but when epoll finds no events, the setting of
gro_flush_timeout and napi_defer_hard_irqs determine the next step.
There are essentially three possible loops for network processing and
packet delivery:
1) hardirq -> softirq -> napi poll; basic interrupt delivery
2) timer -> softirq -> napi poll; deferred irq processing
3) epoll -> busy-poll -> napi poll; busy looping
Loop 2) can take control from Loop 1), if gro_flush_timeout and
napi_defer_hard_irqs are set.
If gro_flush_timeout and napi_defer_hard_irqs are set, Loops 2) and
3) "wrestle" with each other for control. During busy periods,
irq_suspend_timeout is used as timer in Loop 2), which essentially
tilts this in favour of Loop 3).
If gro_flush_timeout and napi_defer_hard_irqs are not set, Loop 3)
cannot take control from Loop 1).
Therefore, setting gro_flush_timeout and napi_defer_hard_irqs is the
recommended usage, because otherwise setting irq_suspend_timeout
might not have any discernible effect.
We ran experiments with these parameters set to zero and the results
are as expected and essentially the same as the base case.
- Can the new timeout value be threaded through the new epoll ioctl ?
Only with difficulty. The epoll ioctl sets options on an epoll
context and the NAPI ID associated with an epoll context can change
based on what file descriptors a user app adds to the epoll context.
This would introduce complexity in the API from the user perspective
and also complexity in the kernel.
- Can irq suspend be built by combining NIC coalescing and
gro_flush_timeout ?
No. The problem is that the long timeout must engage if and only if
prefer-busy is active.
When using NIC coalescing for the short timeout (without
napi_defer_hard_irqs/gro_flush_timeout), an interrupt after an idle
period will trigger softirq, which will run napi polling. At this
point, prefer-busy is not active, so NIC interrupts would be
re-enabled. Then it is not possible for the longer timeout to
interject to switch control back to polling. In other words, only by
using the software timer for the short timeout, it is possible to
extend the timeout without having to reprogram the NIC timer or
reach down directly and disable interrupts.
Using gro_flush_timeout for the long timeout also has problems, for
the same underlying reason. In the current napi implementation,
gro_flush_timeout is not tied to prefer-busy. We'd either have to
change that and in the process modify the existing deferral
mechanism, or introduce a state variable to determine whether
gro_flush_timeout is used as long timeout for irq suspend or whether
it is used for its default purpose. In an earlier version, we did
try something similar to the latter and made it work, but it ends up
being a lot more convoluted than our current proposal.
- Isn't it already possible to combine busy looping with irq deferral?
Yes, in fact enabling irq deferral via napi_defer_hard_irqs and
gro_flush_timeout is a precondition for prefer_busy_poll to have an
effect. If the application also uses a tight busy loop with
essentially nonblocking epoll_wait (accomplished with a very short
timeout parameter), this is the fullbusy case shown in the results.
An application using blocking epoll_wait is shown as the napibusy
case in the result. It's a hybrid approach that provides limited
latency benefits compared to the base case and plain irq deferral,
but not as good as fullbusy or suspend.
~ Special thanks
Several people were involved in earlier stages of the development of this
mechanism whom we'd like to thank:
- Peter Cai (CC'd), for the initial kernel patch and his contributions
to the paper.
- Mohammadamin Shafie (CC'd), for testing various versions of the kernel
patch and providing helpful feedback.
Thanks,
Martin and Joe
[1]: https://lore.kernel.org/netdev/20240812125717.413108-1-jdamato@fastly.com/
[2]: https://doi.org/10.1145/3626780
[3]: https://github.com/memcached/memcached/blob/master/doc/napi_ids.txt
[4]: https://github.com/leverich/mutilate
[5]: https://raw.githubusercontent.com/martinkarsten/irqsuspend/main/patches/mem…
[6]: https://raw.githubusercontent.com/martinkarsten/irqsuspend/main/patches/lib…
[7]: https://github.com/martinkarsten/irqsuspend
[8]: https://github.com/martinkarsten/irqsuspend/tree/main/results
v4:
- Added a new FAQ item to cover letter.
- Updated patch 6 to use socat instead of nc in busy_poll_test.sh and
updated busy_poller.c to use netlink directly to configure napi
params.
- Updated the kernel documentation in patch 7 to include more details.
- Dropped Stanislav's Acked-by and Bagas' Reviewed-by from patch 7
since the documentation was updated.
v3:
- Added Stanislav Fomichev's Acked-by to every patch except the newly
added selftest.
- Added Bagas Sanjaya's Reviewed-by to the documentation patch.
- Fixed the commit message of patch 2 to remove a reference to the now
non-existent sysfs setting.
- Added a self test which tests both "regular" busy poll and busy poll
with suspend enabled. This was added as patch 6 as requested by
Paolo. netdevsim was chosen instead of veth due to netdevsim's
pre-existing support for netdev-genl. See the commit message of
patch 6 for more details.
v2: https://lore.kernel.org/bpf/20241021015311.95468-1-jdamato@fastly.com/
- Cover letter updated, including a re-run of test data.
- Patch 1 rewritten to use netdev-genl instead of sysfs.
- Patch 3 updated with a comment added to napi_resume_irqs.
- Patch 4 rebased to apply now that commit b9ca079dd6b0 ("eventpoll:
Annotate data-race of busy_poll_usecs") has been picked up from VFS.
- Patch 6 updated the kernel documentation.
rfc -> v1:
- Cover letter updated to include more details.
- Patch 1 updated to remove the documentation added. This was moved to
patch 6 with the rest of the docs (see below).
- Patch 5 updated to fix an error uncovered by the kernel build robot.
See patch 5's changelog for more details.
- Patch 6 added which updates kernel documentation.
Joe Damato (2):
selftests: net: Add busy_poll_test
docs: networking: Describe irq suspension
Martin Karsten (5):
net: Add napi_struct parameter irq_suspend_timeout
net: Suspend softirq when prefer_busy_poll is set
net: Add control functions for irq suspension
eventpoll: Trigger napi_busy_loop, if prefer_busy_poll is set
eventpoll: Control irq suspension for prefer_busy_poll
Documentation/netlink/specs/netdev.yaml | 7 +
Documentation/networking/napi.rst | 176 +++++++++-
fs/eventpoll.c | 35 +-
include/linux/netdevice.h | 2 +
include/net/busy_poll.h | 3 +
include/uapi/linux/netdev.h | 1 +
net/core/dev.c | 58 +++-
net/core/dev.h | 25 ++
net/core/netdev-genl-gen.c | 5 +-
net/core/netdev-genl.c | 12 +
tools/include/uapi/linux/netdev.h | 1 +
tools/testing/selftests/net/.gitignore | 1 +
tools/testing/selftests/net/Makefile | 3 +-
tools/testing/selftests/net/busy_poll_test.sh | 164 +++++++++
tools/testing/selftests/net/busy_poller.c | 328 ++++++++++++++++++
15 files changed, 810 insertions(+), 11 deletions(-)
create mode 100755 tools/testing/selftests/net/busy_poll_test.sh
create mode 100644 tools/testing/selftests/net/busy_poller.c
base-commit: dbb9a7ef347828870df3e5e6ddf19469a3277fc9
--
2.25.1
Greetings:
Welcome to v3, see changelog below. Note that our performance tests were
not re-run for this revision as we only updated a commit message and
added a selftest.
This series introduces a new mechanism, IRQ suspension, which allows
network applications using epoll to mask IRQs during periods of high
traffic while also reducing tail latency (compared to existing
mechanisms, see below) during periods of low traffic. In doing so, this
balances CPU consumption with network processing efficiency.
Martin Karsten (CC'd) and I have been collaborating on this series for
several months and have appreciated the feedback from the community on
our RFC [1]. We've updated the cover letter and kernel documentation in
an attempt to more clearly explain how this mechanism works, how
applications can use it, and how it compares to existing mechanisms in
the kernel. We've added an additional test case, 'fullbusy', achieved by
modifying libevent for comparison. See below for a detailed description,
link to the patch, and test results.
I briefly mentioned this idea at netdev conf 2024 (for those who were
there) and Martin described this idea in an earlier paper presented at
Sigmetrics 2024 [2].
~ The short explanation (TL;DR)
We propose adding a new napi config parameter: irq_suspend_timeout to
help balance CPU usage and network processing efficiency when using IRQ
deferral and napi busy poll.
If this parameter is set to a non-zero value *and* a user application
has enabled preferred busy poll on a busy poll context (via the
EPIOCSPARAMS ioctl introduced in commit 18e2bf0edf4d ("eventpoll: Add
epoll ioctl for epoll_params")), then application calls to epoll_wait
for that context will cause device IRQs and softirq processing to be
suspended as long as epoll_wait successfully retrieves data from the
NAPI. Each time data is retrieved, the irq_suspend_timeout is deferred.
If/when network traffic subsides and epoll_wait returns no data, IRQ
suspension is immediately reverted back to the existing
napi_defer_hard_irqs and gro_flush_timeout mechanism which was
introduced in commit 6f8b12d661d0 ("net: napi: add hard irqs deferral
feature")).
The irq_suspend_timeout serves as a safety mechanism. If userland takes
a long time processing data, irq_suspend_timeout will fire and restart
normal NAPI processing.
For a more in depth explanation, please continue reading.
~ Comparison with existing mechanisms
Interrupt mitigation can be accomplished in napi software, by setting
napi_defer_hard_irqs and gro_flush_timeout, or via interrupt coalescing
in the NIC. This can be quite efficient, but in both cases, a fixed
timeout (or packet count) needs to be configured. However, a fixed
timeout cannot effectively support both low- and high-load situations:
At low load, an application typically processes a few requests and then
waits to receive more input data. In this scenario, a large timeout will
cause unnecessary latency.
At high load, an application typically processes many requests before
being ready to receive more input data. In this case, a small timeout
will likely fire prematurely and trigger irq/softirq processing, which
interferes with the application's execution. This causes overhead, most
likely due to cache contention.
While NICs attempt to provide adaptive interrupt coalescing schemes,
these cannot properly take into account application-level processing.
An alternative packet delivery mechanism is busy-polling, which results
in perfect alignment of application processing and network polling. It
delivers optimal performance (throughput and latency), but results in
100% cpu utilization and is thus inefficient for below-capacity
workloads.
We propose to add a new packet delivery mode that properly alternates
between busy polling and interrupt-based delivery depending on busy and
idle periods of the application. During a busy period, the system
operates in busy-polling mode, which avoids interference. During an idle
period, the system falls back to interrupt deferral, but with a small
timeout to avoid excessive latencies. This delivery mode can also be
viewed as an extension of basic interrupt deferral, but alternating
between a small and a very large timeout.
This delivery mode is efficient, because it avoids softirq execution
interfering with application processing during busy periods. It can be
used with blocking epoll_wait to conserve cpu cycles during idle
periods. The effect of alternating between busy and idle periods is that
performance (throughput and latency) is very close to full busy polling,
while cpu utilization is lower and very close to interrupt mitigation.
~ Usage details
IRQ suspension is introduced via a per-NAPI configuration parameter that
controls the maximum time that IRQs can be suspended.
Here's how it is intended to work:
- The user application (or system administrator) uses the netdev-genl
netlink interface to set the pre-existing napi_defer_hard_irqs and
gro_flush_timeout NAPI config parameters to enable IRQ deferral.
- The user application (or system administrator) sets the proposed
irq_suspend_timeout parameter via the netdev-genl netlink interface
to a larger value than gro_flush_timeout to enable IRQ suspension.
- The user application issues the existing epoll ioctl to set the
prefer_busy_poll flag on the epoll context.
- The user application then calls epoll_wait to busy poll for network
events, as it normally would.
- If epoll_wait returns events to userland, IRQs are suspended for the
duration of irq_suspend_timeout.
- If epoll_wait finds no events and the thread is about to go to
sleep, IRQ handling using napi_defer_hard_irqs and gro_flush_timeout
is resumed.
As long as epoll_wait is retrieving events, IRQs (and softirq
processing) for the NAPI being polled remain disabled. When network
traffic reduces, eventually a busy poll loop in the kernel will retrieve
no data. When this occurs, regular IRQ deferral using gro_flush_timeout
for the polled NAPI is re-enabled.
Unless IRQ suspension is continued by subsequent calls to epoll_wait, it
automatically times out after the irq_suspend_timeout timer expires.
Regular deferral is also immediately re-enabled when the epoll context
is destroyed.
~ Usage scenario
The target scenario for IRQ suspension as packet delivery mode is a
system that runs a dominant application with substantial network I/O.
The target application can be configured to receive input data up to a
certain batch size (via epoll_wait maxevents parameter) and this batch
size determines the worst-case latency that application requests might
experience. Because packet delivery is suspended during the target
application's processing, the batch size also determines the worst-case
latency of concurrent applications using the same RX queue(s).
gro_flush_timeout should be set as small as possible, but large enough to
make sure that a single request is likely not being interfered with.
irq_suspend_timeout is largely a safety mechanism against misbehaving
applications. It should be set large enough to cover the processing of an
entire application batch, i.e., the factor between gro_flush_timeout and
irq_suspend_timeout should roughly correspond to the maximum batch size
that the target application would process in one go.
~ Design rationale
The implementation of the IRQ suspension mechanism very nicely dovetails
with the existing mechanism for IRQ deferral when preferred busy poll is
enabled (introduced in commit 7fd3253a7de6 ("net: Introduce preferred
busy-polling"), see that commit message for more details).
While it would be possible to inject the suspend timeout via
the existing epoll ioctl, it is more natural to avoid this path for one
main reason:
An epoll context is linked to NAPI IDs as file descriptors are added;
this means any epoll context might suddenly be associated with a
different net_device if the application were to replace all existing
fds with fds from a different device. In this case, the scope of the
suspend timeout becomes unclear and many edge cases for both the user
application and the kernel are introduced
Only a single iteration through napi busy polling is needed for this
mechanism to work effectively. Since an important objective for this
mechanism is preserving cpu cycles, exactly one iteration of the napi
busy loop is invoked when busy_poll_usecs is set to 0.
~ Important call outs in the implementation
- Enabling per epoll-context preferred busy poll will now effectively
lead to a nonblocking iteration through napi_busy_loop, even when
busy_poll_usecs is 0. See patch 4.
- Patches apply cleanly on commit 160a810b2a85 ("net: vxlan: update
the document for vxlan_snoop()").
~ Benchmark configs & descriptions
The changes were benchmarked with memcached [3] using the benchmarking
tool mutilate [4].
To facilitate benchmarking, a small patch [5] was applied to memcached
1.6.29 to allow setting per-epoll context preferred busy poll and other
settings via environment variables. Another small patch [6] was applied
to libevent to enable full busy-polling.
Multiple scenarios were benchmarked as described below and the scripts
used for producing these results can be found on github [7] (note: all
scenarios use NAPI-based traffic splitting via SO_INCOMING_ID by passing
-N to memcached):
- base:
- no other options enabled
- deferX:
- set defer_hard_irqs to 100
- set gro_flush_timeout to X,000
- napibusy:
- set defer_hard_irqs to 100
- set gro_flush_timeout to 200,000
- enable busy poll via the existing ioctl (busy_poll_usecs = 64,
busy_poll_budget = 64, prefer_busy_poll = true)
- fullbusy:
- set defer_hard_irqs to 100
- set gro_flush_timeout to 5,000,000
- enable busy poll via the existing ioctl (busy_poll_usecs = 1000,
busy_poll_budget = 64, prefer_busy_poll = true)
- change memcached's nonblocking epoll_wait invocation (via
libevent) to using a 1 ms timeout
- suspendX:
- set defer_hard_irqs to 100
- set gro_flush_timeout to X,000
- set irq_suspend_timeout to 20,000,000
- enable busy poll via the existing ioctl (busy_poll_usecs = 0,
busy_poll_budget = 64, prefer_busy_poll = true)
~ Benchmark results
Tested on:
Single socket AMD EPYC 7662 64-Core Processor
Hyperthreading disabled
4 NUMA Zones (NPS=4)
16 CPUs per NUMA zone (64 cores total)
2 x Dual port 100gbps Mellanox Technologies ConnectX-5 Ex EN NIC
The test machine is configured such that a single interface has 8 RX
queues. The queues' IRQs and memcached are pinned to CPUs that are
NUMA-local to the interface which is under test. The NIC's interrupt
coalescing configuration is left at boot-time defaults.
Results:
Results are shown below. The mechanism added by this series is
represented by the 'suspend' cases. Data presented shows a summary over
at least 10 runs of each test case [8] using the scripts on github [7].
For latency, the median is shown. For throughput and CPU utilization,
the average is shown.
The results also include cycles-per-query (cpq) and
instruction-per-query (ipq) metrics, following the methodology proposed
in [2], to augment the CPU utilization numbers, which could be skewed
due to frequency scaling. We find that this does not appear to be the
case as CPU utilization and low-level metrics show similar trends.
These results were captured using the scripts on github [7] to
illustrate how this approach compares with other pre-existing
mechanisms. This data is not to be interpreted as scientific data
captured in a fully isolated lab setting, but instead as best effort,
illustrative information comparing and contrasting tradeoffs.
The absolute QPS results are higher than our previous submission, but
the relative differences between variants are equivalent. Because the
patches have been rebased on 6.12, several factors have likely
influenced the overall performance. Most importantly, we had to switch
to a new set of basic kernel options, which has likely altered the
baseline performance. Because the overall comparison of variants still
holds, we have not attempted to recreate the exact set of kernel options
from the previous submission.
Compare:
- Throughput (MAX) and latencies of base vs suspend.
- CPU usage of napibusy and fullbusy during lower load (200K, 400K for
example) vs suspend.
- Latency of the defer variants vs suspend as timeout and load
increases.
The overall takeaway is that the suspend variants provide a superior
combination of high throughput, low latency, and low cpu utilization
compared to all other variants. Each of the suspend variants works very
well, but some fine-tuning between latency and cpu utilization is still
possible by tuning the small timeout (gro_flush_timeout).
Note: we've reorganized the results to make comparison among testcases
with the same load easier.
testcase load qps avglat 95%lat 99%lat cpu cpq ipq
base 200K 200024 127 254 458 25 12748 11289
defer10 200K 199991 64 128 166 27 18763 16574
defer20 200K 199986 72 135 178 25 15405 14173
defer50 200K 200025 91 149 198 23 12275 12203
defer200 200K 199996 182 266 326 18 8595 9183
fullbusy 200K 200040 58 123 167 100 43641 23145
napibusy 200K 200009 115 244 299 56 24797 24693
suspend10 200K 200005 63 128 167 32 19559 17240
suspend20 200K 199952 69 132 170 29 16324 14838
suspend50 200K 200019 84 144 189 26 13106 12516
suspend200 200K 199978 168 264 326 20 9331 9643
testcase load qps avglat 95%lat 99%lat cpu cpq ipq
base 400K 400017 157 292 762 39 9287 9325
defer10 400K 400033 71 141 204 53 13950 12943
defer20 400K 399935 79 150 212 47 12027 11673
defer50 400K 399888 101 171 231 39 9556 9921
defer200 400K 399993 200 287 358 32 7428 8576
fullbusy 400K 400018 63 132 203 100 21827 16062
napibusy 400K 399970 89 230 292 83 18156 16508
suspend10 400K 400061 69 139 202 54 13576 13057
suspend20 400K 399988 73 144 206 49 11930 11773
suspend50 400K 399975 88 161 218 42 9996 10270
suspend200 400K 399954 172 276 353 34 7847 8713
testcase load qps avglat 95%lat 99%lat cpu cpq ipq
base 600K 600031 166 289 631 61 9188 8787
defer10 600K 599967 85 167 262 75 11833 10947
defer20 600K 599888 89 165 243 66 10513 10362
defer50 600K 600072 109 185 253 55 8664 9190
defer200 600K 599951 222 315 393 45 6892 8213
fullbusy 600K 600041 69 145 227 100 14549 13936
napibusy 600K 599980 79 188 280 96 13927 14155
suspend10 600K 600028 78 159 267 69 10877 11032
suspend20 600K 600026 81 159 254 64 9922 10320
suspend50 600K 600007 96 178 258 57 8681 9331
suspend200 600K 599964 177 295 369 47 7115 8366
testcase load qps avglat 95%lat 99%lat cpu cpq ipq
base 800K 800034 198 329 698 84 9366 8338
defer10 800K 799718 243 642 1457 95 10532 9007
defer20 800K 800009 132 245 399 89 9956 8979
defer50 800K 800024 136 228 378 80 9002 8598
defer200 800K 799965 255 362 473 66 7481 8147
fullbusy 800K 799927 78 157 253 100 10915 12533
napibusy 800K 799870 81 173 273 99 10826 12532
suspend10 800K 799991 84 167 269 83 9380 9802
suspend20 800K 799979 90 172 290 78 8765 9404
suspend50 800K 800031 106 191 307 71 7945 8805
suspend200 800K 799905 182 307 411 62 6985 8242
testcase load qps avglat 95%lat 99%lat cpu cpq ipq
base 1000K 919543 3805 6390 14229 98 9324 7978
defer10 1000K 850751 4574 7382 15370 99 10218 8470
defer20 1000K 890296 4736 6862 14858 99 9708 8277
defer50 1000K 932694 3463 6180 13251 97 9148 8053
defer200 1000K 951311 3524 6052 13599 96 8875 7845
fullbusy 1000K 1000011 90 181 278 100 8731 10686
napibusy 1000K 1000050 93 184 280 100 8721 10547
suspend10 1000K 999962 101 193 306 92 8138 8980
suspend20 1000K 1000030 103 191 324 88 7844 8763
suspend50 1000K 1000001 114 202 320 83 7396 8431
suspend200 1000K 999965 185 314 428 76 6733 8072
testcase load qps avglat 95%lat 99%lat cpu cpq ipq
base MAX 1005592 4651 6594 14979 100 8679 7918
defer10 MAX 928204 5106 7286 15199 100 9398 8380
defer20 MAX 984663 4774 6518 14920 100 8861 8063
defer50 MAX 1044099 4431 6368 14652 100 8350 7948
defer200 MAX 1040451 4423 6610 14674 100 8380 7931
fullbusy MAX 1236608 3715 3987 12805 100 7051 7936
napibusy MAX 1077516 4345 10155 15957 100 8080 7842
suspend10 MAX 1218344 3760 3990 12585 100 7150 7935
suspend20 MAX 1220056 3752 4053 12602 100 7150 7961
suspend50 MAX 1213666 3791 4103 12919 100 7183 7959
suspend200 MAX 1217411 3768 3988 12863 100 7161 7954
~ FAQ
- Can the new timeout value be threaded through the new epoll ioctl ?
Only with difficulty. The epoll ioctl sets options on an epoll
context and the NAPI ID associated with an epoll context can change
based on what file descriptors a user app adds to the epoll context.
This would introduce complexity in the API from the user perspective
and also complexity in the kernel.
- Can irq suspend be built by combining NIC coalescing and
gro_flush_timeout ?
No. The problem is that the long timeout must engage if and only if
prefer-busy is active.
When using NIC coalescing for the short timeout (without
napi_defer_hard_irqs/gro_flush_timeout), an interrupt after an idle
period will trigger softirq, which will run napi polling. At this
point, prefer-busy is not active, so NIC interrupts would be
re-enabled. Then it is not possible for the longer timeout to
interject to switch control back to polling. In other words, only by
using the software timer for the short timeout, it is possible to
extend the timeout without having to reprogram the NIC timer or
reach down directly and disable interrupts.
Using gro_flush_timeout for the long timeout also has problems, for
the same underlying reason. In the current napi implementation,
gro_flush_timeout is not tied to prefer-busy. We'd either have to
change that and in the process modify the existing deferral
mechanism, or introduce a state variable to determine whether
gro_flush_timeout is used as long timeout for irq suspend or whether
it is used for its default purpose. In an earlier version, we did
try something similar to the latter and made it work, but it ends up
being a lot more convoluted than our current proposal.
- Isn't it already possible to combine busy looping with irq deferral?
Yes, in fact enabling irq deferral via napi_defer_hard_irqs and
gro_flush_timeout is a precondition for prefer_busy_poll to have an
effect. If the application also uses a tight busy loop with
essentially nonblocking epoll_wait (accomplished with a very short
timeout parameter), this is the fullbusy case shown in the results.
An application using blocking epoll_wait is shown as the napibusy
case in the result. It's a hybrid approach that provides limited
latency benefits compared to the base case and plain irq deferral,
but not as good as fullbusy or suspend.
~ Special thanks
Several people were involved in earlier stages of the development of this
mechanism whom we'd like to thank:
- Peter Cai (CC'd), for the initial kernel patch and his contributions
to the paper.
- Mohammadamin Shafie (CC'd), for testing various versions of the kernel
patch and providing helpful feedback.
Thanks,
Martin and Joe
[1]: https://lore.kernel.org/netdev/20240812125717.413108-1-jdamato@fastly.com/
[2]: https://doi.org/10.1145/3626780
[3]: https://github.com/memcached/memcached/blob/master/doc/napi_ids.txt
[4]: https://github.com/leverich/mutilate
[5]: https://raw.githubusercontent.com/martinkarsten/irqsuspend/main/patches/mem…
[6]: https://raw.githubusercontent.com/martinkarsten/irqsuspend/main/patches/lib…
[7]: https://github.com/martinkarsten/irqsuspend
[8]: https://github.com/martinkarsten/irqsuspend/tree/main/results
v3:
- Added Stanislav Fomichev's Acked-by to every patch except the newly
added selftest.
- Added Bagas Sanjaya's Reviewed-by to the documentation patch.
- Fixed the commit message of patch 2 to remove a reference to the now
non-existent sysfs setting.
- Added a self test which tests both "regular" busy poll and busy poll
with suspend enabled. This was added as patch 6 as requested by
Paolo. netdevsim was chosen instead of veth due to netdevsim's
pre-existing support for netdev-genl. See the commit message of
patch 6 for more details.
v2: https://lore.kernel.org/bpf/20241021015311.95468-1-jdamato@fastly.com/
- Cover letter updated, including a re-run of test data.
- Patch 1 rewritten to use netdev-genl instead of sysfs.
- Patch 3 updated with a comment added to napi_resume_irqs.
- Patch 4 rebased to apply now that commit b9ca079dd6b0 ("eventpoll:
Annotate data-race of busy_poll_usecs") has been picked up from VFS.
- Patch 6 updated the kernel documentation.
rfc -> v1:
- Cover letter updated to include more details.
- Patch 1 updated to remove the documentation added. This was moved to
patch 6 with the rest of the docs (see below).
- Patch 5 updated to fix an error uncovered by the kernel build robot.
See patch 5's changelog for more details.
- Patch 6 added which updates kernel documentation.
Joe Damato (2):
selftests: net: Add busy_poll_test
docs: networking: Describe irq suspension
Martin Karsten (5):
net: Add napi_struct parameter irq_suspend_timeout
net: Suspend softirq when prefer_busy_poll is set
net: Add control functions for irq suspension
eventpoll: Trigger napi_busy_loop, if prefer_busy_poll is set
eventpoll: Control irq suspension for prefer_busy_poll
Documentation/netlink/specs/netdev.yaml | 7 +
Documentation/networking/napi.rst | 132 +++++++++-
fs/eventpoll.c | 35 ++-
include/linux/netdevice.h | 2 +
include/net/busy_poll.h | 3 +
include/uapi/linux/netdev.h | 1 +
net/core/dev.c | 58 ++++-
net/core/dev.h | 25 ++
net/core/netdev-genl-gen.c | 5 +-
net/core/netdev-genl.c | 12 +
tools/include/uapi/linux/netdev.h | 1 +
tools/testing/selftests/net/.gitignore | 1 +
tools/testing/selftests/net/Makefile | 2 +
tools/testing/selftests/net/busy_poll_test.sh | 164 ++++++++++++
tools/testing/selftests/net/busy_poller.c | 245 ++++++++++++++++++
15 files changed, 683 insertions(+), 10 deletions(-)
create mode 100755 tools/testing/selftests/net/busy_poll_test.sh
create mode 100644 tools/testing/selftests/net/busy_poller.c
base-commit: 9e114ec8084020e10e1cb7b43dbbf6e69940866b
--
2.25.1
This series was motivated by the regression fixed by 166bf8af9122
("pinctrl: mediatek: common-v2: Fix broken bias-disable for
PULL_PU_PD_RSEL_TYPE"). A bug was introduced in the pinctrl_paris driver
which prevented certain pins from having their bias configured.
Running this test on the mt8195-tomato platform with the test plan
included below[1] shows the test passing with the fix applied, but failing
without the fix:
With fix:
$ ./gpio-setget-config.py
TAP version 13
# Using test plan file: ./google,tomato.yaml
1..3
ok 1 pinctrl_paris.34.pull-up
ok 2 pinctrl_paris.34.pull-down
ok 3 pinctrl_paris.34.disabled
# Totals: pass:3 fail:0 xfail:0 xpass:0 skip:0 error:0
Without fix:
$ ./gpio-setget-config.py
TAP version 13
# Using test plan file: ./google,tomato.yaml
1..3
# Bias doesn't match: Expected pull-up, read pull-down.
not ok 1 pinctrl_paris.34.pull-up
ok 2 pinctrl_paris.34.pull-down
# Bias doesn't match: Expected disabled, read pull-down.
not ok 3 pinctrl_paris.34.disabled
# Totals: pass:1 fail:2 xfail:0 xpass:0 skip:0 error:0
In order to achieve this, the first three patches expose bias
configuration through the GPIO API in the MediaTek pinctrl drivers,
notably, pinctrl_paris, patch 4 extends the gpio-mockup-cdev utility for
use by patch 5, and patch 5 introduces a new GPIO kselftest that takes a
test plan in YAML, which can be tailored per-platform to specify the
configurations to test, and sets and gets back each pin configuration to
verify that they match and thus that the driver is behaving as expected.
Since the GPIO uAPI only allows setting the pin configuration, getting
it back is done through pinconf-pins in the pinctrl debugfs folder.
The test currently only verifies bias but it would be easy to extend to
verify other pin configurations.
The test plan YAML file can be customized for each use-case and is
platform-dependant. For that reason, only an example is included in
patch 3 and the user is supposed to provide their test plan. That said,
the aim is to collect test plans for ease of use at [2].
[1] This is the test plan used for mt8195-tomato:
- label: "pinctrl_paris"
tests:
# Pin 34 has type MTK_PULL_PU_PD_RSEL_TYPE and is unused.
# Setting bias to MTK_PULL_PU_PD_RSEL_TYPE pins was fixed by
# 166bf8af9122 ("pinctrl: mediatek: common-v2: Fix broken bias-disable for PULL_PU_PD_RSEL_TYPE")
- pin: 34
bias: "pull-up"
- pin: 34
bias: "pull-down"
- pin: 34
bias: "disabled"
[2] https://github.com/kernelci/platform-test-parameters
Signed-off-by: Nícolas F. R. A. Prado <nfraprado(a)collabora.com>
---
Changes in v2:
- Added patches 2 and 3 enabling the extra GPIO pin configurations on
the other mediatek drivers: pinctrl-moore and pinctrl-mtk-common
- Tweaked function name in patch 1:
mtk_pinconf_set -> mtk_paris_pin_config_set,
to make it clear it is not a pinconf_ops
- Adjusted commit message to make it clear the current support is
limited to pins supported by the EINT controller
- Link to v1: https://lore.kernel.org/r/20240909-kselftest-gpio-set-get-config-v1-0-16a06…
---
Nícolas F. R. A. Prado (5):
pinctrl: mediatek: paris: Expose more configurations to GPIO set_config
pinctrl: mediatek: moore: Expose more configurations to GPIO set_config
pinctrl: mediatek: common: Expose more configurations to GPIO set_config
selftest: gpio: Add wait flag to gpio-mockup-cdev
selftest: gpio: Add a new set-get config test
drivers/pinctrl/mediatek/pinctrl-moore.c | 283 +++++++++++----------
drivers/pinctrl/mediatek/pinctrl-mtk-common.c | 48 ++--
drivers/pinctrl/mediatek/pinctrl-paris.c | 26 +-
tools/testing/selftests/gpio/Makefile | 2 +-
tools/testing/selftests/gpio/gpio-mockup-cdev.c | 14 +-
.../gpio-set-get-config-example-test-plan.yaml | 15 ++
.../testing/selftests/gpio/gpio-set-get-config.py | 183 +++++++++++++
7 files changed, 395 insertions(+), 176 deletions(-)
---
base-commit: a39230ecf6b3057f5897bc4744a790070cfbe7a8
change-id: 20240906-kselftest-gpio-set-get-config-6e5bb670c1a5
Best regards,
--
Nícolas F. R. A. Prado <nfraprado(a)collabora.com>
Check number of paths by fib_info_num_path(),
and update_or_create_fnhe() for every path.
Problem is that pmtu is cached only for the oif
that has received icmp message "need to frag",
other oifs will still try to use "default" iface mtu.
V4:
- fix selftest, do route lookup before checking cached exceptions
V3:
- added selftest
- fixed compile error
V2:
- fix fib_info_num_path parameter pass
An example topology showing the problem:
| host1
+---------+
| dummy0 | 10.179.20.18/32 mtu9000
+---------+
+-----------+----------------+
+---------+ +---------+
| ens17f0 | 10.179.2.141/31 | ens17f1 | 10.179.2.13/31
+---------+ +---------+
| (all here have mtu 9000) |
+------+ +------+
| ro1 | 10.179.2.140/31 | ro2 | 10.179.2.12/31
+------+ +------+
| |
---------+------------+-------------------+------
|
+-----+
| ro3 | 10.10.10.10 mtu1500
+-----+
|
========================================
some networks
========================================
|
+-----+
| eth0| 10.10.30.30 mtu9000
+-----+
| host2
host1 have enabled multipath and
sysctl net.ipv4.fib_multipath_hash_policy = 1:
default proto static src 10.179.20.18
nexthop via 10.179.2.12 dev ens17f1 weight 1
nexthop via 10.179.2.140 dev ens17f0 weight 1
When host1 tries to do pmtud from 10.179.20.18/32 to host2,
host1 receives at ens17f1 iface an icmp packet from ro3 that ro3 mtu=1500.
And host1 caches it in nexthop exceptions cache.
Problem is that it is cached only for the iface that has received icmp,
and there is no way that ro3 will send icmp msg to host1 via another path.
Host1 now have this routes to host2:
ip r g 10.10.30.30 sport 30000 dport 443
10.10.30.30 via 10.179.2.12 dev ens17f1 src 10.179.20.18 uid 0
cache expires 521sec mtu 1500
ip r g 10.10.30.30 sport 30033 dport 443
10.10.30.30 via 10.179.2.140 dev ens17f0 src 10.179.20.18 uid 0
cache
So when host1 tries again to reach host2 with mtu>1500,
if packet flow is lucky enough to be hashed with oif=ens17f1 its ok,
if oif=ens17f0 it blackholes and still gets icmp msgs from ro3 to ens17f1,
until lucky day when ro3 will send it through another flow to ens17f0.
Signed-off-by: Vladimir Vdovin <deliran(a)verdict.gg>
---
net/ipv4/route.c | 13 +++++
tools/testing/selftests/net/pmtu.sh | 79 ++++++++++++++++++++++++++++-
2 files changed, 91 insertions(+), 1 deletion(-)
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 723ac9181558..41162b5cc4cb 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1027,6 +1027,19 @@ static void __ip_rt_update_pmtu(struct rtable *rt, struct flowi4 *fl4, u32 mtu)
struct fib_nh_common *nhc;
fib_select_path(net, &res, fl4, NULL);
+#ifdef CONFIG_IP_ROUTE_MULTIPATH
+ if (fib_info_num_path(res.fi) > 1) {
+ int nhsel;
+
+ for (nhsel = 0; nhsel < fib_info_num_path(res.fi); nhsel++) {
+ nhc = fib_info_nhc(res.fi, nhsel);
+ update_or_create_fnhe(nhc, fl4->daddr, 0, mtu, lock,
+ jiffies + net->ipv4.ip_rt_mtu_expires);
+ }
+ rcu_read_unlock();
+ return;
+ }
+#endif /* CONFIG_IP_ROUTE_MULTIPATH */
nhc = FIB_RES_NHC(res);
update_or_create_fnhe(nhc, fl4->daddr, 0, mtu, lock,
jiffies + net->ipv4.ip_rt_mtu_expires);
diff --git a/tools/testing/selftests/net/pmtu.sh b/tools/testing/selftests/net/pmtu.sh
index 569bce8b6383..f7ced4c436fb 100755
--- a/tools/testing/selftests/net/pmtu.sh
+++ b/tools/testing/selftests/net/pmtu.sh
@@ -266,7 +266,8 @@ tests="
list_flush_ipv4_exception ipv4: list and flush cached exceptions 1
list_flush_ipv6_exception ipv6: list and flush cached exceptions 1
pmtu_ipv4_route_change ipv4: PMTU exception w/route replace 1
- pmtu_ipv6_route_change ipv6: PMTU exception w/route replace 1"
+ pmtu_ipv6_route_change ipv6: PMTU exception w/route replace 1
+ pmtu_ipv4_mp_exceptions ipv4: PMTU multipath nh exceptions 0"
# Addressing and routing for tests with routers: four network segments, with
# index SEGMENT between 1 and 4, a common prefix (PREFIX4 or PREFIX6) and an
@@ -2329,6 +2330,82 @@ test_pmtu_ipv6_route_change() {
test_pmtu_ipvX_route_change 6
}
+test_pmtu_ipv4_mp_exceptions() {
+ setup namespaces routing || return $ksft_skip
+
+ ip nexthop ls >/dev/null 2>&1
+ if [ $? -ne 0 ]; then
+ echo "Nexthop objects not supported; skipping tests"
+ exit $ksft_skip
+ fi
+
+ trace "${ns_a}" veth_A-R1 "${ns_r1}" veth_R1-A \
+ "${ns_r1}" veth_R1-B "${ns_b}" veth_B-R1 \
+ "${ns_a}" veth_A-R2 "${ns_r2}" veth_R2-A \
+ "${ns_r2}" veth_R2-B "${ns_b}" veth_B-R2
+
+ dummy0_a="192.168.99.99"
+ dummy0_b="192.168.88.88"
+
+ # Set up initial MTU values
+ mtu "${ns_a}" veth_A-R1 2000
+ mtu "${ns_r1}" veth_R1-A 2000
+ mtu "${ns_r1}" veth_R1-B 1500
+ mtu "${ns_b}" veth_B-R1 1500
+
+ mtu "${ns_a}" veth_A-R2 2000
+ mtu "${ns_r2}" veth_R2-A 2000
+ mtu "${ns_r2}" veth_R2-B 1500
+ mtu "${ns_b}" veth_B-R2 1500
+
+ fail=0
+
+ #Set up host A with multipath routes to host B dummy0_b
+ run_cmd ${ns_a} sysctl -q net.ipv4.fib_multipath_hash_policy=1
+ run_cmd ${ns_a} sysctl -q net.ipv4.ip_forward=1
+ run_cmd ${ns_a} ip link add dummy0 mtu 2000 type dummy
+ run_cmd ${ns_a} ip link set dummy0 up
+ run_cmd ${ns_a} ip addr add ${dummy0_a} dev dummy0
+ run_cmd ${ns_a} ip nexthop add id 201 via ${prefix4}.${a_r1}.2 dev veth_A-R1
+ run_cmd ${ns_a} ip nexthop add id 202 via ${prefix4}.${a_r2}.2 dev veth_A-R2
+ run_cmd ${ns_a} ip nexthop add id 203 group 201/202
+ run_cmd ${ns_a} ip route add ${dummy0_b} nhid 203
+
+ #Set up host B with multipath routes to host A dummy0_a
+ run_cmd ${ns_b} sysctl -q net.ipv4.fib_multipath_hash_policy=1
+ run_cmd ${ns_b} sysctl -q net.ipv4.ip_forward=1
+ run_cmd ${ns_b} ip link add dummy0 mtu 2000 type dummy
+ run_cmd ${ns_b} ip link set dummy0 up
+ run_cmd ${ns_b} ip addr add ${dummy0_b} dev dummy0
+ run_cmd ${ns_b} ip nexthop add id 201 via ${prefix4}.${b_r1}.2 dev veth_A-R1
+ run_cmd ${ns_b} ip nexthop add id 202 via ${prefix4}.${b_r2}.2 dev veth_A-R2
+ run_cmd ${ns_b} ip nexthop add id 203 group 201/202
+ run_cmd ${ns_b} ip route add ${dummy0_a} nhid 203
+
+ #Set up routers with routes to dummies
+ run_cmd ${ns_r1} ip route add ${dummy0_a} via ${prefix4}.${a_r1}.1
+ run_cmd ${ns_r2} ip route add ${dummy0_a} via ${prefix4}.${a_r2}.1
+ run_cmd ${ns_r1} ip route add ${dummy0_b} via ${prefix4}.${b_r1}.1
+ run_cmd ${ns_r2} ip route add ${dummy0_b} via ${prefix4}.${b_r2}.1
+
+
+ #Ping and expect two nexthop exceptions for two routes in nh group
+ run_cmd ${ns_a} ping -q -M want -i 0.1 -c 2 -s 1800 "${dummy0_b}"
+
+ #Do route lookup before checking cached exceptions
+ run_cmd ${ns_a} ip route get ${dummy0_b} oif veth_A-R1
+ run_cmd ${ns_a} ip route get ${dummy0_b} oif veth_A-R2
+
+ #Check cached exceptions
+ echo "$(ip -oneline route list cache)"
+ if [ "$(${ns_a} ip -oneline route list cache| grep mtu | wc -l)" -ne 2 ]; then
+ err " there are not enough cached exceptions"
+ fail=1
+ fi
+
+ return ${fail}
+}
+
usage() {
echo
echo "$0 [OPTIONS] [TEST]..."
base-commit: 66600fac7a984dea4ae095411f644770b2561ede
--
2.43.5
This series was originally written by José Expósito, and has been
modified and updated by Matt Gilbride and myself. The original version
can be found here:
https://github.com/Rust-for-Linux/linux/pull/950
Add support for writing KUnit tests in Rust. While Rust doctests are
already converted to KUnit tests and run, they're really better suited
for examples, rather than as first-class unit tests.
This series implements a series of direct Rust bindings for KUnit tests,
as well as a new macro which allows KUnit tests to be written using a
close variant of normal Rust unit test syntax. The only change required
is replacing '#[cfg(test)]' with '#[kunit_tests(kunit_test_suite_name)]'
An example test would look like:
#[kunit_tests(rust_kernel_hid_driver)]
mod tests {
use super::*;
use crate::{c_str, driver, hid, prelude::*};
use core::ptr;
struct SimpleTestDriver;
impl Driver for SimpleTestDriver {
type Data = ();
}
#[test]
fn rust_test_hid_driver_adapter() {
let mut hid = bindings::hid_driver::default();
let name = c_str!("SimpleTestDriver");
static MODULE: ThisModule = unsafe { ThisModule::from_ptr(ptr::null_mut()) };
let res = unsafe {
<hid::Adapter<SimpleTestDriver> as driver::DriverOps>::register(&mut hid, name, &MODULE)
};
assert_eq!(res, Err(ENODEV)); // The mock returns -19
}
}
Please give this a go, and make sure I haven't broken it! There's almost
certainly a lot of improvements which can be made -- and there's a fair
case to be made for replacing some of this with generated C code which
can use the C macros -- but this is hopefully an adequate implementation
for now, and the interface can (with luck) remain the same even if the
implementation changes.
A few small notable missing features:
- Attributes (like the speed of a test) are hardcoded to the default
value.
- Similarly, the module name attribute is hardcoded to NULL. In C, we
use the KBUILD_MODNAME macro, but I couldn't find a way to use this
from Rust which wasn't more ugly than just disabling it.
- Assertions are not automatically rewritten to use KUnit assertions.
---
Changes since v2:
https://lore.kernel.org/linux-kselftest/20241029092422.2884505-1-davidgow@g…
- Include missing rust/macros/kunit.rs file from v2. (Thanks Boqun!)
- The kunit_unsafe_test_suite!() macro will truncate the name of the
suite if it is too long. (Thanks Alice!)
- The proc macro now emits an error if the suite name is too long.
- We no longer needlessly use UnsafeCell<> in
kunit_unsafe_test_suite!(). (Thanks Alice!)
Changes since v1:
https://lore.kernel.org/lkml/20230720-rustbind-v1-0-c80db349e3b5@google.com…
- Rebase on top of the latest rust-next (commit 718c4069896c)
- Make kunit_case a const fn, rather than a macro (Thanks Boqun)
- As a result, the null terminator is now created with
kernel::kunit::kunit_case_null()
- Use the C kunit_get_current_test() function to implement
in_kunit_test(), rather than re-implementing it (less efficiently)
ourselves.
Changes since the GitHub PR:
- Rebased on top of kselftest/kunit
- Add const_mut_refs feature
This may conflict with https://lore.kernel.org/lkml/20230503090708.2524310-6-nmi@metaspace.dk/
- Add rust/macros/kunit.rs to the KUnit MAINTAINERS entry
---
José Expósito (3):
rust: kunit: add KUnit case and suite macros
rust: macros: add macro to easily run KUnit tests
rust: kunit: allow to know if we are in a test
MAINTAINERS | 1 +
rust/kernel/kunit.rs | 191 +++++++++++++++++++++++++++++++++++++++++++
rust/kernel/lib.rs | 1 +
rust/macros/kunit.rs | 153 ++++++++++++++++++++++++++++++++++
rust/macros/lib.rs | 29 +++++++
5 files changed, 375 insertions(+)
create mode 100644 rust/macros/kunit.rs
--
2.47.0.163.g1226f6d8fa-goog
```
readonly STATS="$(mktemp -p /tmp ns-XXXXXX)"
readonly BASE=`basename $STATS`
```
It could be a mistake to write to $BASE rather than $STATS, where $STATS
is used to save the NSTAT_HISTORY and it will be cleaned up before exit.
Although since we've been creating the wrong file this whole time and
everything worked, it's fine to remove these 2 lines completely
Cc: "David S. Miller" <davem(a)davemloft.net>
Cc: Eric Dumazet <edumazet(a)google.com>
Cc: Jakub Kicinski <kuba(a)kernel.org>
Cc: Paolo Abeni <pabeni(a)redhat.com>
Signed-off-by: Li Zhijian <lizhijian(a)fujitsu.com>
---
Cc: netdev(a)vger.kernel.org
---
V3:
Remove these 2 lines rather than fixing the filename
---
Hello,
Cover letter is here.
This patch set aims to make 'git status' clear after 'make' and 'make
run_tests' for kselftests.
---
V2: nothing change
Signed-off-by: Li Zhijian <lizhijian(a)fujitsu.com>
---
tools/testing/selftests/net/veth.sh | 2 --
1 file changed, 2 deletions(-)
diff --git a/tools/testing/selftests/net/veth.sh b/tools/testing/selftests/net/veth.sh
index 4f1edbafb946..6bb7dfaa30b6 100755
--- a/tools/testing/selftests/net/veth.sh
+++ b/tools/testing/selftests/net/veth.sh
@@ -46,8 +46,6 @@ create_ns() {
ip -n $BASE$ns addr add dev veth$ns $BM_NET_V4$ns/24
ip -n $BASE$ns addr add dev veth$ns $BM_NET_V6$ns/64 nodad
done
- echo "#kernel" > $BASE
- chmod go-rw $BASE
}
__chk_flag() {
--
2.44.0
This is the 8th version of the ovpn patchset.
Thanks Sergey for arguing regarding splitting PEER_SET into SET and NEW.
I decided to follow this suggestion as it makes the API and its return
value easier to work with.
Thanks Donald for the suggestions regarding the NL API - they have all
been implemented (unless I forgot some, but hopefully I did not).
Notable changes from v7:
* Netlink API adjustments:
** renamed NL API from OP_OBJ to OBJ_OP (i.e. from SET_PEER to PEER_SET)
** split PEER_SET from PEER_NEW for better clarity in case of error
** renamed NL API from NEW/DEL_IFACE to DEV_NEW/DEL
** converted all underscores to dashes in YML NL spec
** split sockaddr_remote attr into ipv4/6, port and v6_scope_id attrs
** split local_ip attr into local_ipv4 and local_ipv6 attrs
** turned keyconf into a root attribute (it was nested in peer before)
** made key_swap use a keyconf object rather than a peer for consistency
with key mgmt API
** created specific op for peer_del notification (peer_del_ntf)
** created specific op for key_swap notification (key_swap_ntf)
** allow user to update VPN IPv4/6 (peer is now rehashable)
** converted port attrs from u32 to u16 for better consistency with
userspace code
* added rtnl_ops .dellink implementation
* removed patch 2 as it's not needed anymore thanks to the point
above
* moved rtnl_ops .kind initialization to first patch
* updated MAINTAINERS file with Github tree and selftest folder
* wrapped long lines in selftest scripts
BONUS: used b4 for the first time to prepare the patchset and send it
Please note that patches previously reviewed by Andrew Lunn have
retained the Reviewed-by tag as they have been simply rebased without
any modification.
The latest code can also be found at:
https://github.com/OpenVPN/linux-kernel-ovpn
Thanks a lot!
Best Regards,
Antonio Quartulli
OpenVPN Inc.
---
Antonio Quartulli (24):
netlink: add NLA_POLICY_MAX_LEN macro
net: introduce OpenVPN Data Channel Offload (ovpn)
ovpn: add basic netlink support
ovpn: add basic interface creation/destruction/management routines
ovpn: implement interface creation/destruction via netlink
ovpn: keep carrier always on
ovpn: introduce the ovpn_peer object
ovpn: introduce the ovpn_socket object
ovpn: implement basic TX path (UDP)
ovpn: implement basic RX path (UDP)
ovpn: implement packet processing
ovpn: store tunnel and transport statistics
ovpn: implement TCP transport
ovpn: implement multi-peer support
ovpn: implement peer lookup logic
ovpn: implement keepalive mechanism
ovpn: add support for updating local UDP endpoint
ovpn: add support for peer floating
ovpn: implement peer add/dump/delete via netlink
ovpn: implement key add/del/swap via netlink
ovpn: kill key and notify userspace in case of IV exhaustion
ovpn: notify userspace when a peer is deleted
ovpn: add basic ethtool support
testing/selftest: add test tool and scripts for ovpn module
Documentation/netlink/specs/ovpn.yaml | 387 +++++
MAINTAINERS | 11 +
drivers/net/Kconfig | 15 +
drivers/net/Makefile | 1 +
drivers/net/ovpn/Makefile | 22 +
drivers/net/ovpn/bind.c | 54 +
drivers/net/ovpn/bind.h | 117 ++
drivers/net/ovpn/crypto.c | 172 ++
drivers/net/ovpn/crypto.h | 138 ++
drivers/net/ovpn/crypto_aead.c | 356 ++++
drivers/net/ovpn/crypto_aead.h | 31 +
drivers/net/ovpn/io.c | 459 ++++++
drivers/net/ovpn/io.h | 25 +
drivers/net/ovpn/main.c | 363 ++++
drivers/net/ovpn/main.h | 29 +
drivers/net/ovpn/netlink-gen.c | 224 +++
drivers/net/ovpn/netlink-gen.h | 42 +
drivers/net/ovpn/netlink.c | 1099 +++++++++++++
drivers/net/ovpn/netlink.h | 18 +
drivers/net/ovpn/ovpnstruct.h | 60 +
drivers/net/ovpn/packet.h | 40 +
drivers/net/ovpn/peer.c | 1207 ++++++++++++++
drivers/net/ovpn/peer.h | 172 ++
drivers/net/ovpn/pktid.c | 130 ++
drivers/net/ovpn/pktid.h | 87 +
drivers/net/ovpn/proto.h | 104 ++
drivers/net/ovpn/skb.h | 61 +
drivers/net/ovpn/socket.c | 165 ++
drivers/net/ovpn/socket.h | 53 +
drivers/net/ovpn/stats.c | 21 +
drivers/net/ovpn/stats.h | 47 +
drivers/net/ovpn/tcp.c | 506 ++++++
drivers/net/ovpn/tcp.h | 43 +
drivers/net/ovpn/udp.c | 406 +++++
drivers/net/ovpn/udp.h | 26 +
include/net/netlink.h | 1 +
include/uapi/linux/ovpn.h | 116 ++
include/uapi/linux/udp.h | 1 +
tools/net/ynl/ynl-gen-c.py | 2 +
tools/testing/selftests/Makefile | 1 +
tools/testing/selftests/net/ovpn/.gitignore | 2 +
tools/testing/selftests/net/ovpn/Makefile | 18 +
tools/testing/selftests/net/ovpn/config | 8 +
tools/testing/selftests/net/ovpn/data-test-tcp.sh | 9 +
tools/testing/selftests/net/ovpn/data-test.sh | 153 ++
tools/testing/selftests/net/ovpn/data64.key | 5 +
tools/testing/selftests/net/ovpn/float-test.sh | 118 ++
tools/testing/selftests/net/ovpn/ovpn-cli.c | 1822 +++++++++++++++++++++
tools/testing/selftests/net/ovpn/tcp_peers.txt | 5 +
tools/testing/selftests/net/ovpn/udp_peers.txt | 5 +
50 files changed, 8957 insertions(+)
---
base-commit: 44badc908f2c85711cb18e45e13119c10ad3a05f
change-id: 20241002-b4-ovpn-eeee35c694a2
Best regards,
--
Antonio Quartulli <antonio(a)openvpn.net>
The goal of the series is to simplify and make it possible to use
ncdevmem in an automated way from the ksft python wrapper.
ncdevmem is slowly mutated into a state where it uses stdout
to print the payload and the python wrapper is added to
make sure the arrived payload matches the expected one.
v6:
- fix compilation issue in 'Unify error handling' patch (Jakub)
v5:
- properly handle errors from inet_pton() and socket() (Paolo)
- remove unneeded import from python selftest (Paolo)
v4:
- keep usage example with validation (Mina)
- fix compilation issue in one patch (s/start_queues/start_queue/)
v3:
- keep and refine the comment about ncdevmem invocation (Mina)
- add the comment about not enforcing exit status for ntuple reset (Mina)
- make configure_headersplit more robust (Mina)
- use num_queues/2 in selftest and let the users override it (Mina)
- remove memory_provider.memcpy_to_device (Mina)
- keep ksft as is (don't use -v validate flags): we are gonna
need a --debug-disable flag to make it less chatty; otherwise
it times out when sending too much data; so leaving it as
a separate follow up
v2:
- don't remove validation (Mina)
- keep 5-tuple flow steering but use it only when -c is provided (Mina)
- remove separate flag for probing (Mina)
- move ncdevmem under drivers/net/hw, not drivers/net (Jakub)
Cc: Mina Almasry <almasrymina(a)google.com>
Stanislav Fomichev (12):
selftests: ncdevmem: Redirect all non-payload output to stderr
selftests: ncdevmem: Separate out dmabuf provider
selftests: ncdevmem: Unify error handling
selftests: ncdevmem: Make client_ip optional
selftests: ncdevmem: Remove default arguments
selftests: ncdevmem: Switch to AF_INET6
selftests: ncdevmem: Properly reset flow steering
selftests: ncdevmem: Use YNL to enable TCP header split
selftests: ncdevmem: Remove hard-coded queue numbers
selftests: ncdevmem: Run selftest when none of the -s or -c has been
provided
selftests: ncdevmem: Move ncdevmem under drivers/net/hw
selftests: ncdevmem: Add automated test
.../selftests/drivers/net/hw/.gitignore | 1 +
.../testing/selftests/drivers/net/hw/Makefile | 9 +
.../selftests/drivers/net/hw/devmem.py | 45 +
.../selftests/drivers/net/hw/ncdevmem.c | 773 ++++++++++++++++++
tools/testing/selftests/net/.gitignore | 1 -
tools/testing/selftests/net/Makefile | 8 -
tools/testing/selftests/net/ncdevmem.c | 570 -------------
7 files changed, 828 insertions(+), 579 deletions(-)
create mode 100644 tools/testing/selftests/drivers/net/hw/.gitignore
create mode 100755 tools/testing/selftests/drivers/net/hw/devmem.py
create mode 100644 tools/testing/selftests/drivers/net/hw/ncdevmem.c
delete mode 100644 tools/testing/selftests/net/ncdevmem.c
--
2.47.0
The 2024 architecture release includes a number of data processing
extensions, mostly SVE and SME additions with a few others. These are
all very straightforward extensions which add instructions but no
architectural state so only need hwcaps and exposing of the ID registers
to KVM guests and userspace.
Signed-off-by: Mark Brown <broonie(a)kernel.org>
---
Changes in v2:
- Filter KVM guest visible bitfields in ID_AA64ISAR3_EL1 to only those
we make writeable.
- Link to v1: https://lore.kernel.org/r/20241028-arm64-2024-dpisa-v1-0-a38d08b008a8@kerne…
---
Mark Brown (9):
arm64/sysreg: Update ID_AA64PFR2_EL1 to DDI0601 2024-09
arm64/sysreg: Update ID_AA64ISAR3_EL1 to DDI0601 2024-09
arm64/sysreg: Update ID_AA64FPFR0_EL1 to DDI0601 2024-09
arm64/sysreg: Update ID_AA64ZFR0_EL1 to DDI0601 2024-09
arm64/sysreg: Update ID_AA64SMFR0_EL1 to DDI0601 2024-09
arm64/sysreg: Update ID_AA64ISAR2_EL1 to DDI0601 2024-09
arm64/hwcap: Describe 2024 dpISA extensions to userspace
KVM: arm64: Allow control of dpISA extensions in ID_AA64ISAR3_EL1
kselftest/arm64: Add 2024 dpISA extensions to hwcap test
Documentation/arch/arm64/elf_hwcaps.rst | 51 ++++++
arch/arm64/include/asm/hwcap.h | 17 ++
arch/arm64/include/uapi/asm/hwcap.h | 17 ++
arch/arm64/kernel/cpufeature.c | 35 ++++
arch/arm64/kernel/cpuinfo.c | 17 ++
arch/arm64/kvm/sys_regs.c | 6 +-
arch/arm64/tools/sysreg | 87 +++++++++-
tools/testing/selftests/arm64/abi/hwcap.c | 273 +++++++++++++++++++++++++++++-
8 files changed, 493 insertions(+), 10 deletions(-)
---
base-commit: 8e929cb546ee42c9a61d24fae60605e9e3192354
change-id: 20241008-arm64-2024-dpisa-8091074a7f48
Best regards,
--
Mark Brown <broonie(a)kernel.org>
Currently, we are only using the linear search method to find the type
id by the name, which has a time complexity of O(n). This change involves
sorting the names of btf types in ascending order and using binary search,
which has a time complexity of O(log(n)). This idea was inspired by the
following patch:
60443c88f3a8 ("kallsyms: Improve the performance of kallsyms_lookup_name()").
At present, this improvement is only for searching in vmlinux's and module's BTFs.
Another change is the search direction, where we search the BTF first and
then its base, the type id of the first matched btf_type will be returned.
Here is a time-consuming result that finding 87590 type ids by their names in
vmlinux's BTF.
Before: 158426 ms
After: 114 ms
The average lookup performance has improved more than 1000x in the above scenario.
v4:
- Divide the patch into two parts: kernel and libbpf
- Use Eduard's code to sort btf_types in the btf__dedup function
- Correct some btf testcases due to modifications of the order of btf_types.
v3:
- Link: https://lore.kernel.org/all/20240608140835.965949-1-dolinux.peng@gmail.com/
- Sort btf_types during build process other than during boot, to reduce the
overhead of memory and boot time.
v2:
- Link: https://lore.kernel.org/all/20230909091646.420163-1-pengdonglin@sangfor.com…
Donglin Peng (3):
libbpf: Sort btf_types in ascending order by name
bpf: Using binary search to improve the performance of
btf_find_by_name_kind
libbpf: Using binary search to improve the performance of
btf__find_by_name_kind
include/linux/btf.h | 1 +
kernel/bpf/btf.c | 157 +++++++++-
tools/lib/bpf/btf.c | 274 +++++++++++++---
tools/testing/selftests/bpf/prog_tests/btf.c | 296 +++++++++---------
.../bpf/prog_tests/btf_dedup_split.c | 64 ++--
5 files changed, 555 insertions(+), 237 deletions(-)
--
2.34.1