From: Chia-Yu Chang <chia-yu.chang(a)nokia-bell-labs.com>
Hello,
Please find the DualPI2 patch v20.
This patch serise adds DualPI Improved with a Square (DualPI2) with following features:
* Supports congestion controls that comply with the Prague requirements in RFC9331 (e.g. TCP-Prague)
* Coupled dual-queue that separates the L4S traffic in a low latency queue (L-queue), without harming remaining traffic that is scheduled in classic queue (C-queue) due to congestion-coupling using PI2 as defined in RFC9332
* Configurable overload strategies
* Use of sojourn time to reliably estimate queue delay
* Supports ECN L4S-identifier (IP.ECN==0b*1) to classify traffic into respective queues
For more details of DualPI2, please refer IETF RFC9332 (https://datatracker.ietf.org/doc/html/rfc9332).
Best regards,
Chia-Yu
---
v20 (21-Jun-2025)
- Add one more commit to fix warning and style check on tdc.sh reported by shellcheck
- Remove double-prefixed of "tc_tc_dualpi2_attrs" in tc-user.h (Donald Hunter <donald.hunter(a)gmail.com>)
v19 (14-Jun-2025)
- Fix one typo in the comment of #1 (ALOK TIWARI <alok.a.tiwari(a)oracle.com>)
- Update commit message of #4 (ALOK TIWARI <alok.a.tiwari(a)oracle.com>)
- Wrap long lines of Documentation/netlink/specs/tc.yaml to within 80 characters (Jakub Kicinski <kuba(a)kernel.org>)
v18 (13-Jun-2025)
- Add the num of enum used by DualPI2 and fix name and name-prefix of DualPI2 enum and attribute
- Replace from_timer() with timer_container_of() (Pedro Tammela <pctammela(a)mojatatu.com>)
v17 (25-May-2025, Resent at 11-Jun-2025)
- Replace 0xffffffff with U32_MAX (Paolo Abeni <pabeni(a)redhat.com>)
- Use helper function qdisc_dequeue_internal() and add new helper function skb_apply_step() (Paolo Abeni <pabeni(a)redhat.com>)
- Add s64 casting when calculating the delta of the PI controller (Paolo Abeni <pabeni(a)redhat.com>)
- Change the drop reason into SKB_DROP_REASON_QDISC_CONGESTED for drop_early (Paolo Abeni <pabeni(a)redhat.com>)
- Modify the condition to remove the original skb when enqueuing multiple GSO segments (Paolo Abeni <pabeni(a)redhat.com>)
- Add READ_ONCE() in dualpi2_dump_stat() (Paolo Abeni <pabeni(a)redhat.com>)
- Add comments, brackets, and brackets for readability (Paolo Abeni <pabeni(a)redhat.com>)
v16 (16-MAy-2025)
- Add qdisc_lock() to dualpi2_timer() in dualpi2_timer (Paolo Abeni <pabeni(a)redhat.com>)
- Introduce convert_ns_to_usec() to convert usec to nsec without overflow in #1 (Paolo Abeni <pabeni(a)redhat.com>)
- Update convert_us_tonsec() to convert nsec to usec without overflow in #2 (Paolo Abeni <pabeni(a)redhat.com>)
- Add more descriptions with respect to DualPI2 in the cover ltter and add changelog in each patch (Paolo Abeni <pabeni(a)redhat.com>)
v15 (09-May-2025)
- Add enum of TCA_DUALPI2_ECN_MASK_CLA_ECT to remove potential leakeage in #1 (Simon Horman <horms(a)kernel.org>)
- Fix one typo in comment of #2
- Update tc.yaml in #5 to aligh with the updated enum of pkt_sched.h
v14 (05-May-2025)
- Modify tc.yaml: (1) Replace flags with enum and remove enum-as-flags, (2) Remove credit-queue in xstats, and (3) Change attribute types (Donald Hunter <donald.hun
- Add enum and fix the ordering of variables in pkt_sched.h to align with the modified tc.yaml (Donald Hunter <donald.hunter(a)gmail.com>)
- Add validators for DROP_OVERLOAD, DROP_EARLY, ECN_MASK, and SPLIT_GSO in sch_dualpi2.c (Donald Hunter <donald.hunter(a)gmail.com>)
- Update dualpi2.json to align with the updated variable order in pkt_sched.h
- Reorder patches (Donald Hunter <donald.hunter(a)gmail.com>)
v13 (26-Apr-2025)
- Use dashes in member names to follow YNL conventions in tc.yaml (Donald Hunter <donald.hunter(a)gmail.com>)
- Define enumerations separately for flags of drop-early, drop-overload, ecn-mask, credit-queue in tc.yaml (Donald Hunter <donald.hunter(a)gmail.com>)
- Change the types of split-gso and step-packets into flag in tc.yaml (Donald Hunter <donald.hunter(a)gmail.com>)
- Revert to u32/u8 types for tc-dualpi2-xstats members in tc.yaml (Donald Hunter <donald.hunter(a)gmail.com>)
- Add new test cases in tc-tests/qdiscs/dualpi2.json to cover all dualpi2 parameters (Donald Hunter <donald.hunter(a)gmail.com>)
- Change the type of TCA_DUALPI2_STEP_PACKETS into NLA_FLAG (Donald Hunter <donald.hunter(a)gmail.com>)
v12 (22-Apr-2025)
- Remove anonymous struct in sch_dualpi2.c (Paolo Abeni <pabeni(a)redhat.com>)
- Replace u32/u8 with uint and s32 with int in tc spec document (Paolo Abeni <pabeni(a)redhat.com>)
- Introduce get_memory_limit function to handle potential overflow when multipling limit with MTU (Paolo Abeni <pabeni(a)redhat.com>)
- Double the packet length to further include packet overhead in memory_limit (Paolo Abeni <pabeni(a)redhat.com>)
- Remove the check of qdisc_qlen(sch) when calling qdisc_tree_reduce_backlog (Paolo Abeni <pabeni(a)redhat.com>)
v11 (15-Apr-2025)
- Replace hstimer_init with hstimer_setup in sch_dualpi2.c
v10 (25-Mar-2025)
- Remove leftover include in include/linux/netdevice.h and anonymous struct in sch_dualpi2.c (Paolo Abeni <pabeni(a)redhat.com>)
- Use kfree_skb_reason() and add SKB_DROP_REASON_DUALPI2_STEP_DROP drop reason (Paolo Abeni <pabeni(a)redhat.com>)
- Split sch_dualpi2.c into 3 patches (and overall 5 patches): Struct definition & parsing, Dump stats & configuration, Enqueue/Dequeue (Paolo Abeni <pabeni(a)redhat.com>)
v9 (16-Mar-2025)
- Fix mem_usage error in previous version
- Add min_qlen_step to the dualpi2 attribute as the minimum queue length in number of packets in the L-queue to start step threshold marking.
In previous versions, this value was fixed to 2, so the step threshold was applied to mark packets in the L queue only when the queue length of the L queue was greater than or equal to 2 packets.
This will cause larger queuing delays for L4S traffic at low rates (<20Mbps). So we parameterize it and change the default value to 0.
Comparison of tcp_1down run 'HTB 20Mbit + DUALPI2 + 10ms base delay'
Old versions:
avg median # data pts
Ping (ms) ICMP : 11.55 11.70 ms 350
TCP upload avg : 18.96 N/A Mbits/s 350
TCP upload sum : 18.96 N/A Mbits/s 350
New version (v9):
avg median # data pts
Ping (ms) ICMP : 10.81 10.70 ms 350
TCP upload avg : 18.91 N/A Mbits/s 350
TCP upload sum : 18.91 N/A Mbits/s 350
Comparison of tcp_1down run 'HTB 10Mbit + DUALPI2 + 10ms base delay'
Old versions:
avg median # data pts
Ping (ms) ICMP : 12.61 12.80 ms 350
TCP upload avg : 9.48 N/A Mbits/s 350
TCP upload sum : 9.48 N/A Mbits/s 350
New version (v9):
avg median # data pts
Ping (ms) ICMP : 11.06 10.80 ms 350
TCP upload avg : 9.43 N/A Mbits/s 350
TCP upload sum : 9.43 N/A Mbits/s 350
Comparison of tcp_1down run 'HTB 10Mbit + DUALPI2 + 10ms base delay'
Old versions:
avg median # data pts
Ping (ms) ICMP : 40.86 37.45 ms 350
TCP upload avg : 0.88 N/A Mbits/s 350
TCP upload sum : 0.88 N/A Mbits/s 350
TCP upload::1 : 0.88 0.97 Mbits/s 350
New version (v9):
avg median # data pts
Ping (ms) ICMP : 11.07 10.40 ms 350
TCP upload avg : 0.55 N/A Mbits/s 350
TCP upload sum : 0.55 N/A Mbits/s 350
TCP upload::1 : 0.55 0.59 Mbits/s 350
v8 (11-Mar-2025)
- Fix warning messages in v7
v7 (07-Mar-2025)
- Separate into 3 patches to avoid mixing changes of documentation, selftest, and code. (Cong Wang <xiyou.wangcong(a)gmail.com>)
v6 (04-Mar-2025)
- Add modprobe for dulapi2 in tc-testing script tc-testing/tdc.sh (Jakub Kicinski <kuba(a)kernel.org>)
- Update test cases in dualpi2.json
- Update commit message
v5 (22-Feb-2025)
- A comparison was done between MQ + DUALPI2, MQ + FQ_PIE, MQ + FQ_CODEL:
Unshaped 1gigE with 4 download streams test:
- Summary of tcp_4down run 'MQ + FQ_CODEL':
avg median # data pts
Ping (ms) ICMP : 1.19 1.34 ms 349
TCP download avg : 235.42 N/A Mbits/s 349
TCP download sum : 941.68 N/A Mbits/s 349
TCP download::1 : 235.19 235.39 Mbits/s 349
TCP download::2 : 235.03 235.35 Mbits/s 349
TCP download::3 : 236.89 235.44 Mbits/s 349
TCP download::4 : 234.57 235.19 Mbits/s 349
- Summary of tcp_4down run 'MQ + FQ_PIE'
avg median # data pts
Ping (ms) ICMP : 1.21 1.37 ms 350
TCP download avg : 235.42 N/A Mbits/s 350
TCP download sum : 941.61 N/A Mbits/s 350
TCP download::1 : 232.54 233.13 Mbits/s 350
TCP download::2 : 232.52 232.80 Mbits/s 350
TCP download::3 : 233.14 233.78 Mbits/s 350
TCP download::4 : 243.41 241.48 Mbits/s 350
- Summary of tcp_4down run 'MQ + DUALPI2'
avg median # data pts
Ping (ms) ICMP : 1.19 1.34 ms 349
TCP download avg : 235.42 N/A Mbits/s 349
TCP download sum : 941.68 N/A Mbits/s 349
TCP download::1 : 235.19 235.39 Mbits/s 349
TCP download::2 : 235.03 235.35 Mbits/s 349
TCP download::3 : 236.89 235.44 Mbits/s 349
TCP download::4 : 234.57 235.19 Mbits/s 349
Unshaped 1gigE with 128 download streams test:
- Summary of tcp_128down run 'MQ + FQ_CODEL':
avg median # data pts
Ping (ms) ICMP : 1.88 1.86 ms 350
TCP download avg : 7.39 N/A Mbits/s 350
TCP download sum : 946.47 N/A Mbits/s 350
- Summary of tcp_128down run 'MQ + FQ_PIE':
avg median # data pts
Ping (ms) ICMP : 1.88 1.86 ms 350
TCP download avg : 7.39 N/A Mbits/s 350
TCP download sum : 946.47 N/A Mbits/s 350
- Summary of tcp_128down run 'MQ + DUALPI2':
avg median # data pts
Ping (ms) ICMP : 1.88 1.86 ms 350
TCP download avg : 7.39 N/A Mbits/s 350
TCP download sum : 946.47 N/A Mbits/s 350
Unshaped 10gigE with 4 download streams test:
- Summary of tcp_4down run 'MQ + FQ_CODEL':
avg median # data pts
Ping (ms) ICMP : 0.22 0.23 ms 350
TCP download avg : 2354.08 N/A Mbits/s 350
TCP download sum : 9416.31 N/A Mbits/s 350
TCP download::1 : 2353.65 2352.81 Mbits/s 350
TCP download::2 : 2354.54 2354.21 Mbits/s 350
TCP download::3 : 2353.56 2353.78 Mbits/s 350
TCP download::4 : 2354.56 2354.45 Mbits/s 350
- Summary of tcp_4down run 'MQ + FQ_PIE':
avg median # data pts
Ping (ms) ICMP : 0.20 0.19 ms 350
TCP download avg : 2354.76 N/A Mbits/s 350
TCP download sum : 9419.04 N/A Mbits/s 350
TCP download::1 : 2354.77 2353.89 Mbits/s 350
TCP download::2 : 2353.41 2354.29 Mbits/s 350
TCP download::3 : 2356.18 2354.19 Mbits/s 350
TCP download::4 : 2354.68 2353.15 Mbits/s 350
- Summary of tcp_4down run 'MQ + DUALPI2':
avg median # data pts
Ping (ms) ICMP : 0.24 0.24 ms 350
TCP download avg : 2354.11 N/A Mbits/s 350
TCP download sum : 9416.43 N/A Mbits/s 350
TCP download::1 : 2354.75 2353.93 Mbits/s 350
TCP download::2 : 2353.15 2353.75 Mbits/s 350
TCP download::3 : 2353.49 2353.72 Mbits/s 350
TCP download::4 : 2355.04 2353.73 Mbits/s 350
Unshaped 10gigE with 128 download streams test:
- Summary of tcp_128down run 'MQ + FQ_CODEL':
avg median # data pts
Ping (ms) ICMP : 7.57 8.69 ms 350
TCP download avg : 73.97 N/A Mbits/s 350
TCP download sum : 9467.82 N/A Mbits/s 350
- Summary of tcp_128down run 'MQ + FQ_PIE':
avg median # data pts
Ping (ms) ICMP : 7.82 8.91 ms 350
TCP download avg : 73.97 N/A Mbits/s 350
TCP download sum : 9468.42 N/A Mbits/s 350
- Summary of tcp_128down run 'MQ + DUALPI2':
avg median # data pts
Ping (ms) ICMP : 6.87 7.93 ms 350
TCP download avg : 73.95 N/A Mbits/s 350
TCP download sum : 9465.87 N/A Mbits/s 350
From the results shown above, we see small differences between combinations.
- Update commit message to include results of no_split_gso and split_gso (Dave Taht <dave.taht(a)gmail.com> and Paolo Abeni <pabeni(a)redhat.com>)
- Add memlimit in the dualpi2 attribute, and add memory_used, max_memory_used, memory_limit in dualpi2 stats (Dave Taht <dave.taht(a)gmail.com>)
- Update note in sch_dualpi2.c related to BBRv3 status (Dave Taht <dave.taht(a)gmail.com>)
- Update license identifier (Dave Taht <dave.taht(a)gmail.com>)
- Add selftest in tools/testing/selftests/tc-testing (Cong Wang <xiyou.wangcong(a)gmail.com>)
- Use netlink policies for parameter checks (Jamal Hadi Salim <jhs(a)mojatatu.com>)
- Modify texts & fix typos in Documentation/netlink/specs/tc.yaml (Dave Taht <dave.taht(a)gmail.com>)
- Add descriptions of packet counter statistics and the reset function of sch_dualpi2.c
- Fix step_thresh in packets
- Update code comments in sch_dualpi2.c
v4 (22-Oct-2024)
- Update statement in Kconfig for DualPI2 (Stephen Hemminger <stephen(a)networkplumber.org>)
- Put a blank line after #define in sch_dualpi2.c (Stephen Hemminger <stephen(a)networkplumber.org>)
- Fix line length warning.
v3 (19-Oct-2024)
- Fix compilaiton error
- Update Documentation/netlink/specs/tc.yaml (Jakub Kicinski <kuba(a)kernel.org>)
v2 (18-Oct-2024)
- Add Documentation/netlink/specs/tc.yaml (Jakub Kicinski <kuba(a)kernel.org>)
- Use dualpi2 instead of skb prefix (Jamal Hadi Salim <jhs(a)mojatatu.com>)
- Replace nla_parse_nested_deprecated with nla_parse_nested (Jamal Hadi Salim <jhs(a)mojatatu.com>)
- Fix line length warning
---
Chia-Yu Chang (5):
sched: Struct definition and parsing of dualpi2 qdisc
sched: Dump configuration and statistics of dualpi2 qdisc
selftests/tc-testing: Fix warning and style check on tdc.sh
selftests/tc-testing: Add selftests for qdisc DualPI2
Documentation: netlink: specs: tc: Add DualPI2 specification
Koen De Schepper (1):
sched: Add enqueue/dequeue of dualpi2 qdisc
Documentation/netlink/specs/tc.yaml | 166 +++
include/net/dropreason-core.h | 6 +
include/uapi/linux/pkt_sched.h | 70 +-
net/sched/Kconfig | 12 +
net/sched/Makefile | 1 +
net/sched/sch_dualpi2.c | 1146 +++++++++++++++++
tools/testing/selftests/tc-testing/config | 1 +
.../tc-testing/tc-tests/qdiscs/dualpi2.json | 254 ++++
tools/testing/selftests/tc-testing/tdc.sh | 6 +-
9 files changed, 1658 insertions(+), 4 deletions(-)
create mode 100644 net/sched/sch_dualpi2.c
create mode 100644 tools/testing/selftests/tc-testing/tc-tests/qdiscs/dualpi2.json
--
2.34.1
Changes in v4:
- The first patch from previous versions is now
broken down into multiple smaller patches.
- `#![expect(dead_code)]` is reverted from
`rust/kernel/opp.rs` as it failed on the
kernel test bot (see [1] for more).
Link: https://lore.kernel.org/all/202506291507.M9eg5kic-lkp@intel.com [1]
Onur Özkan (6):
rust: switch to `#[expect(...)]` in core modules
rust: switch to `#[expect(...)]` in init and kunit
drivers: gpu: switch to `#[expect(...)]` in nova-core/regs.rs
rust: switch to `#[expect(...)]` in devres, driver and ioctl
rust: remove `#[allow(clippy::unnecessary_cast)]`
rust: remove `#[allow(clippy::non_send_fields_in_send_ty)]`
drivers/gpu/nova-core/regs.rs | 2 +-
rust/kernel/alloc/allocator_test.rs | 2 +-
rust/kernel/cpufreq.rs | 1 -
rust/kernel/devres.rs | 2 +-
rust/kernel/driver.rs | 2 +-
rust/kernel/drm/ioctl.rs | 8 ++++----
rust/kernel/error.rs | 3 +--
rust/kernel/init.rs | 6 +++---
rust/kernel/kunit.rs | 2 +-
rust/kernel/types.rs | 2 +-
rust/macros/helpers.rs | 2 +-
11 files changed, 15 insertions(+), 17 deletions(-)
--
2.50.0
The protocol device drivers under drivers/platform/chrome/ are responsible
to communicate to the ChromeOS EC (Embedded Controller). They need to pack
the data in a pre-defined format and check if the EC responds accordingly.
The series adds some fundamental unit tests for the protocol. It calls the
.cmd_xfer() and .pkt_xfer() callbacks (which are the most crucial parts for
the protocol), mocks the rest of the system, and checks if the interactions
are all correct.
The series isn't ready for landing. It's more like a PoC for the
binary-level function redirection and its use cases.
The 1st patch adds ftrace stub which is originally from [1][2]. There is no
follow-up discussion about the ftrace stub. As a result, the patch is still
on the mailing list.
The 2nd patch adds Kunit tests for cros_ec_i2c. It relies on the ftrace stub
for redirecting cros_ec_{un,}register().
The 3rd patch uses static stub instead (if ftrace stub isn't really an option).
However, I'm not a big fan to change the production code (i.e. adding the
prologue in cros_ec_{un,}register()) for testing.
The 4th patch adds Kunit tests for cros_ec_spi. It relies on the ftrace stub
for redirecting cros_ec_{un,}register() again.
The 5th patch calls .probe() directly instead of forcing the driver probe
needs to be synchronous. In comparison with the 4th patch, I don't think
this is simpler. I'd prefer to the way in the 4th patch.
After talked to Masami about the work, he suggested to use Kprobes for
function redirection. The 6th patch adds kprobes stub.
The 7th patch uses kprobes stub instead for cros_ec_spi.
Questions:
- Are we going to support ftrace stub so that tests can use it?
- If ftrace stub isn't on the plate (e.g. due to too many dependencies), how
about the kprobes stub? Is it something we could pursue?
- (minor) I'm unsure if people would prefer 'kprobes stub' vs. 'kprobe stub'.
[1]: https://kunit.dev/mocking.html#binary-level-ftrace-et-al
[2]: https://lore.kernel.org/linux-kselftest/20220318021314.3225240-3-davidgow@g…
Daniel Latypov (1):
kunit: expose ftrace-based API for stubbing out functions during tests
Tzung-Bi Shih (6):
platform/chrome: kunit: cros_ec_i2c: Add tests with ftrace stub
platform/chrome: kunit: cros_ec_i2c: Use static stub instead
platform/chrome: kunit: cros_ec_spi: Add tests with ftrace stub
platform/chrome: kunit: cros_ec_spi: Call .probe() directly
kunit: Expose 'kprobes stub' API to redirect functions
platform/chrome: kunit: cros_ec_spi: Use kprobes stub instead
drivers/platform/chrome/Kconfig | 17 +
drivers/platform/chrome/Makefile | 2 +
drivers/platform/chrome/cros_ec.c | 6 +
drivers/platform/chrome/cros_ec_i2c_test.c | 479 +++++++++++++++++++++
drivers/platform/chrome/cros_ec_spi_test.c | 355 +++++++++++++++
include/kunit/ftrace_stub.h | 84 ++++
include/kunit/kprobes_stub.h | 19 +
kernel/trace/ftrace.c | 3 +
lib/kunit/Kconfig | 18 +
lib/kunit/Makefile | 8 +
lib/kunit/ftrace_stub.c | 139 ++++++
lib/kunit/kprobes_stub.c | 113 +++++
lib/kunit/kunit-example-test.c | 27 +-
lib/kunit/stubs_example.kunitconfig | 11 +
14 files changed, 1280 insertions(+), 1 deletion(-)
create mode 100644 drivers/platform/chrome/cros_ec_i2c_test.c
create mode 100644 drivers/platform/chrome/cros_ec_spi_test.c
create mode 100644 include/kunit/ftrace_stub.h
create mode 100644 include/kunit/kprobes_stub.h
create mode 100644 lib/kunit/ftrace_stub.c
create mode 100644 lib/kunit/kprobes_stub.c
create mode 100644 lib/kunit/stubs_example.kunitconfig
--
2.49.0.1101.gccaa498523-goog
This patch set improves the documentation and selftests for XDP Rx metadata
handling. The first patch clarifies the documentation around XDP metadata
layout and the use of bpf_xdp_adjust_meta. The second patch enhances the
BPF selftests to make XDP metadata handling more robust and portable across
different NICs.
Prior to this patch set, the user application retrieved the xdp_meta by
calculating backward from the data pointer, while the XDP program fill in
the xdp_meta by calculating backward from data_meta. This approach will
cause mismatch if there is device-reserved metadata.
|<---sizeof(xdp_meta)--|
| |
struct xdp_meta rx_desc->address
^ ^
| |
+----------+----------------------+------------+------+
| headroom | custom metadata | reserved | data |
+----------+----------------------+------------+------+
^ ^ ^
| | |
struct xdp_meta xdp_buff->data_meta xdp_buff->data
| |
|<---sizeof(xdp_meta)--|
Song Yoong Siang (2):
doc: clarify XDP Rx metadata layout and bpf_xdp_adjust_meta usage
selftests/bpf: Enhance XDP Rx Metadata Handling
Documentation/networking/xdp-rx-metadata.rst | 38 +++++++++++++++++++
.../selftests/bpf/prog_tests/xdp_metadata.c | 2 +-
.../selftests/bpf/progs/xdp_hw_metadata.c | 10 ++++-
.../selftests/bpf/progs/xdp_metadata.c | 8 +++-
tools/testing/selftests/bpf/xdp_hw_metadata.c | 2 +-
tools/testing/selftests/bpf/xdp_metadata.h | 7 ++++
6 files changed, 63 insertions(+), 4 deletions(-)
--
2.34.1
The config snippet specifies CONFIG_NET_EMATCH_IPSET. This option
depends on CONFIG_IP_SET.
Set CONFIG_IP_SET to be enabled at part for tc-testing.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy(a)linutronix.de>
---
tools/testing/selftests/tc-testing/config | 1 +
1 file changed, 1 insertion(+)
diff --git a/tools/testing/selftests/tc-testing/config b/tools/testing/selftests/tc-testing/config
index db176fe7d0c3f..8e902f7f1a181 100644
--- a/tools/testing/selftests/tc-testing/config
+++ b/tools/testing/selftests/tc-testing/config
@@ -21,6 +21,7 @@ CONFIG_NF_NAT=m
CONFIG_NETFILTER_XT_TARGET_LOG=m
CONFIG_NET_SCHED=y
+CONFIG_IP_SET=m
#
# Queueing/Scheduling
--
2.50.0
From: Chia-Yu Chang <chia-yu.chang(a)nokia-bell-labs.com>
Hello,
Please find the v9 AccECN protocol patch series, which covers the core
functionality of Accurate ECN, AccECN negotiation, AccECN TCP options,
and AccECN failure handling. The Accurate ECN draft can be found in
https://datatracker.ietf.org/doc/html/draft-ietf-tcpm-accurate-ecn-28
This patch series is part of the full AccECN patch series, which is available at
https://github.com/L4STeam/linux-net-next/commits/upstream_l4steam/
Best Regards,
Chia-Yu
---
v9 (21-Jun-2025)
- Use tcp_data_ecn_check() to set TCP_ECN_SEE flag only for RFC3168 ECN (Paolo Abeni <pabeni(a)redhat.com>)
- Add comments about setting TCP_ECN_SEEN flag for RFC3168 and Accruate ECN (Paolo Abeni <pabeni(a)redhat.com>)
- Restruct the code in the for loop of tcp_accecn_process_option() (Paolo Abeni <pabeni(a)redhat.com>)
- Remove ecn_bytes and add use_synack_ecn_bytes flag to identify whether syn_ack_bytes or received_ecn_bytes is used (Paolo Abeni <pabeni(a)redhat.com>)
- Replace leftover_bytes and leftover_size with leftover_highbyte and leftover_lowbyte and add comments in tcp_options_write() (Paolo Abeni <pabeni(a)redhat.com>)
- Add comments and commit message about the 1st retx SYN still attempt AccECN negotiation (Paolo Abeni <pabeni(a)redhat.com>)
v8 (10-Jun-2025)
- Add new helper function tcp_ecn_received_counters_payload() in #6 (Paolo Abeni <pabeni(a)redhat.com>)
- Set opts->num_sack_blocks=0 to avoid potential undefined value in #8 (Paolo Abeni <pabeni(a)redhat.com>)
- Reset leftover_size to 2 once leftover_bytes is used in #9 (Paolo Abeni <pabeni(a)redhat.com>)
- Add new helper function tcp_accecn_opt_demand_min() in #10 (Paolo Abeni <pabeni(a)redhat.com>)
- Add new helper function tcp_accecn_saw_opt_fail_recv() in #11 (Paolo Abeni <pabeni(a)redhat.com>)
- Update tcp_options_fit_accecn() to avoid using recursion in #14 (Paolo Abeni <pabeni(a)redhat.com>)
v7 (14-May-2025)
- Modify group sizes of tcp_sock_write_txrx and tcp_sock_write_rx in #3 based on pahole results (Paolo Abeni <pabeni(a)redhat.com>)
- Fix the issue in #4 and #5 where the RFC3168 ECN behavior in tcp_ecn_send() is changed (Paolo Abeni <pabeni(a)redhat.com>)
- Modify group size of tcp_sock_write_txrx in #4 and #6 based on pahole results (Paolo Abeni <pabeni(a)redhat.com>)
- Update commit message for #9 to explain the increase in tcp_sock_write_rx group size
- Modify group size of tcp_sock_write_tx in #10 based on pahole results
v6 (09-May-2025)
- Add #3 to utilize exisintg holes of tcp_sock_write_txrx group for later patches (#4, #9, #10) with new u8 members (Paolo Abeni <pabeni(a)redhat.com>)
- Add pahole outcomes before and after commit in #4, #5, #6, #9, #10, #15 (Paolo Abeni <pabeni(a)redhat.com>)
- Define new helper function tcp_send_ack_reflect_ect() for sending ACK with reflected ECT in #5 (Paolo Abeni <pabeni(a)redhat.com>)
- Add comments for function tcp_ecn_rcv_synack() in #5 (Paolo Abeni <pabeni(a)redhat.com>)
- Add enum/define to be used by sysctl_tcp_ecn in #5, sysctl_tcp_ecn_option in #9, and sysctl_tcp_ecn_option_beacon in #10 (Paolo Abeni <pabeni(a)redhat.com>)
- Move accecn_fail_mode and saw_accecn_opt in #5 and #11 to use exisintg holes of tcp_sock (Paolo Abeni <pabeni(a)redhat.com>)
- Change data type of new members of tcp_request_sock and move them to the end of struct in #5 and #11 (Paolo Abeni <pabeni(a)redhat.com>)
- Move new members of tcp_info to the end of struct in #6 (Paolo Abeni <pabeni(a)redhat.com>)
- Merge previous #7 into #9 (Paolo Abeni <pabeni(a)redhat.com>)
- Mask ecnfield with INET_ECN_MASK to remove WARN_ONCE in #9 (Paolo Abeni <pabeni(a)redhat.com>)
- Reduce the indentation levels for reabability in #9 and #10 (Paolo Abeni <pabeni(a)redhat.com>)
- Move delivered_ecn_bytes to the RX group in #9, accecn_opt_tstamp to the TX group in #10, pkts_acked_ewma to the RX group in #15 (Paolo Abeni <pabeni(a)redhat.com>)
- Add changes in Documentation/networking/net_cachelines/tcp_sock.rst for new tcp_sock members in #3, #5, #6, #9, #10, #15
v5 (22-Apr-2025)
- Further fix for 32-bit ARM alignment in tcp.c (Simon Horman <horms(a)kernel.org>)
v4 (18-Apr-2025)
- Fix 32-bit ARM assertion for alignment requirement (Simon Horman <horms(a)kernel.org>)
v3 (14-Apr-2025)
- Fix patch apply issue in v2 (Jakub Kicinski <kuba(a)kernel.org>)
v2 (18-Mar-2025)
- Add one missing patch from the previous AccECN protocol preparation patch series to this patch series.
---
Chia-Yu Chang (3):
tcp: reorganize tcp_sock_write_txrx group for variables later
tcp: accecn: AccECN option failure handling
tcp: accecn: try to fit AccECN option with SACK
Ilpo Järvinen (12):
tcp: reorganize SYN ECN code
tcp: fast path functions later
tcp: AccECN core
tcp: accecn: AccECN negotiation
tcp: accecn: add AccECN rx byte counters
tcp: accecn: AccECN needs to know delivered bytes
tcp: sack option handling improvements
tcp: accecn: AccECN option
tcp: accecn: AccECN option send control
tcp: accecn: AccECN option ceb/cep heuristic
tcp: accecn: AccECN ACE field multi-wrap heuristic
tcp: try to avoid safer when ACKs are thinned
.../networking/net_cachelines/tcp_sock.rst | 14 +
include/linux/tcp.h | 34 +-
include/net/netns/ipv4.h | 2 +
include/net/tcp.h | 225 ++++++-
include/uapi/linux/tcp.h | 7 +
net/ipv4/syncookies.c | 3 +
net/ipv4/sysctl_net_ipv4.c | 19 +
net/ipv4/tcp.c | 30 +-
net/ipv4/tcp_input.c | 622 +++++++++++++++++-
net/ipv4/tcp_ipv4.c | 7 +-
net/ipv4/tcp_minisocks.c | 91 ++-
net/ipv4/tcp_output.c | 325 ++++++++-
net/ipv6/syncookies.c | 1 +
net/ipv6/tcp_ipv6.c | 1 +
14 files changed, 1285 insertions(+), 96 deletions(-)
--
2.34.1
I am submitting a new selftest for the netpoll subsystem specifically
targeting the case where the RX is polling in the TX path, which is
a case that we don't have any test in the tree today. This is done when
netpoll_poll_dev() called, and this test creates a scenario when that is
probably.
The test does the following:
1) Configuring a single RX/TX queue to increase contention on the
interface.
2) Generating background traffic to saturate the network, mimicking
real-world congestion.
3) Sending netconsole messages to trigger netpoll polling and monitor
its behavior.
4) Using dynamic netconsole targets via configfs, with the ability to
delete and recreate targets during the test.
5) Running bpftrace in parallel to verify that netpoll_poll_dev() is
called when expected. If it is called, then the test passes,
otherwise the test is marked as skipped.
In order to achieve it, I stole Jakub's bpftrace helper from [1], and
did some small changes that I found useful to use the helper.
So, this patchset basically contains:
1) The code stolen from Jakub
2) Improvements on bpftrace() helper
3) The selftest itself
Link: https://lore.kernel.org/all/20250421222827.283737-22-kuba@kernel.org/ [1]
---
Changes in v3:
- Make pylint happy (Simon)
- Remove the unnecessary patch in bpftrace to raise an exception when it
fails. (Jakub)
- Improved the bpftrace code (Willem)
- Stop sending messages if bpftrace is not alive anymore.
- Link to v2: https://lore.kernel.org/r/20250625-netpoll_test-v2-0-47d27775222c@debian.org
Changes in v2:
- Stole Jakub's helper to run bpftrace
- Removed the DEBUG option and moved logs to logging
- Change the code to have a higher chance of calling netpoll_poll_dev().
In my current configuration, it is hitting multiple times during the
test.
- Save and restore TX/RX queue size (Jakub)
- Link to v1: https://lore.kernel.org/r/20250620-netpoll_test-v1-1-5068832f72fc@debian.org
---
Breno Leitao (2):
selftests: drv-net: Strip '@' prefix from bpftrace map keys
selftests: net: add netpoll basic functionality test
Jakub Kicinski (1):
selftests: drv-net: add helper/wrapper for bpftrace
tools/testing/selftests/drivers/net/Makefile | 1 +
.../selftests/drivers/net/lib/py/__init__.py | 3 +-
.../testing/selftests/drivers/net/netpoll_basic.py | 345 +++++++++++++++++++++
tools/testing/selftests/net/lib/py/utils.py | 35 +++
4 files changed, 383 insertions(+), 1 deletion(-)
---
base-commit: 8efa26fcbf8a7f783fd1ce7dd2a409e9b7758df0
change-id: 20250612-netpoll_test-a1324d2057c8
Best regards,
--
Breno Leitao <leitao(a)debian.org>
This series creates a new PMU scheme on ARM, a partitioned PMU that
allows reserving a subset of counters for more direct guest access,
significantly reducing overhead. More details, including performance
benchmarks, can be read in the v1 cover letter linked below.
v3:
* Return to using one kernel command line parameter for reserved host
counters, but set the default value meaning no partitioning to -1 so
0 isn't ambiguous. Adjust checks for partitioning accordingly.
* Move the function kvm_pmu_partition_supported() back out of the
driver to avoid subbing has_vhe in 32-bit ARM code.
* Fix the kernel test robot build failures, one with KVM and no PMU,
and one with PMU and no KVM.
* Scan entire series to remove vestigial changes and update comments.
v2:
https://lore.kernel.org/kvm/20250620221326.1261128-1-coltonlewis@google.com/
v1:
https://lore.kernel.org/kvm/20250602192702.2125115-1-coltonlewis@google.com/
Colton Lewis (21):
arm64: cpufeature: Add cpucap for HPMN0
arm64: Generate sign macro for sysreg Enums
KVM: arm64: Define PMI{CNTR,FILTR}_EL0 as undef_access
KVM: arm64: Reorganize PMU functions
perf: arm_pmuv3: Introduce method to partition the PMU
perf: arm_pmuv3: Generalize counter bitmasks
perf: arm_pmuv3: Keep out of guest counter partition
KVM: arm64: Correct kvm_arm_pmu_get_max_counters()
KVM: arm64: Set up FGT for Partitioned PMU
KVM: arm64: Writethrough trapped PMEVTYPER register
KVM: arm64: Use physical PMSELR for PMXEVTYPER if partitioned
KVM: arm64: Writethrough trapped PMOVS register
KVM: arm64: Write fast path PMU register handlers
KVM: arm64: Setup MDCR_EL2 to handle a partitioned PMU
KVM: arm64: Account for partitioning in PMCR_EL0 access
KVM: arm64: Context swap Partitioned PMU guest registers
KVM: arm64: Enforce PMU event filter at vcpu_load()
perf: arm_pmuv3: Handle IRQs for Partitioned PMU guest counters
KVM: arm64: Inject recorded guest interrupts
KVM: arm64: Add ioctl to partition the PMU when supported
KVM: arm64: selftests: Add test case for partitioned PMU
Marc Zyngier (1):
KVM: arm64: Cleanup PMU includes
Documentation/virt/kvm/api.rst | 21 +
arch/arm/include/asm/arm_pmuv3.h | 38 +
arch/arm64/include/asm/arm_pmuv3.h | 61 +-
arch/arm64/include/asm/kvm_host.h | 18 +-
arch/arm64/include/asm/kvm_pmu.h | 100 +++
arch/arm64/kernel/cpufeature.c | 8 +
arch/arm64/kvm/Makefile | 2 +-
arch/arm64/kvm/arm.c | 22 +
arch/arm64/kvm/debug.c | 24 +-
arch/arm64/kvm/hyp/include/hyp/switch.h | 233 ++++++
arch/arm64/kvm/pmu-emul.c | 674 +----------------
arch/arm64/kvm/pmu-part.c | 378 ++++++++++
arch/arm64/kvm/pmu.c | 702 ++++++++++++++++++
arch/arm64/kvm/sys_regs.c | 79 +-
arch/arm64/tools/cpucaps | 1 +
arch/arm64/tools/gen-sysreg.awk | 1 +
arch/arm64/tools/sysreg | 6 +-
drivers/perf/arm_pmuv3.c | 123 ++-
include/linux/perf/arm_pmu.h | 6 +
include/linux/perf/arm_pmuv3.h | 14 +-
include/uapi/linux/kvm.h | 4 +
tools/include/uapi/linux/kvm.h | 2 +
.../selftests/kvm/arm64/vpmu_counter_access.c | 62 +-
virt/kvm/kvm_main.c | 1 +
24 files changed, 1828 insertions(+), 752 deletions(-)
create mode 100644 arch/arm64/kvm/pmu-part.c
base-commit: 79150772457f4d45e38b842d786240c36bb1f97f
--
2.50.0.727.gbf7dc18ff4-goog
Fix a typo:
instaces -> instances
The typo has been identified using codespell, and the tool does not
report any additional issues in the selftests considered.
Signed-off-by: Andrea Mayer <andrea.mayer(a)uniroma2.it>
---
tools/testing/selftests/net/srv6_end_next_csid_l3vpn_test.sh | 2 +-
tools/testing/selftests/net/srv6_end_x_next_csid_l3vpn_test.sh | 2 +-
tools/testing/selftests/net/srv6_hencap_red_l3vpn_test.sh | 2 +-
tools/testing/selftests/net/srv6_hl2encap_red_l2vpn_test.sh | 2 +-
4 files changed, 4 insertions(+), 4 deletions(-)
diff --git a/tools/testing/selftests/net/srv6_end_next_csid_l3vpn_test.sh b/tools/testing/selftests/net/srv6_end_next_csid_l3vpn_test.sh
index ba730655a7bf..4bc135e5c22c 100755
--- a/tools/testing/selftests/net/srv6_end_next_csid_l3vpn_test.sh
+++ b/tools/testing/selftests/net/srv6_end_next_csid_l3vpn_test.sh
@@ -594,7 +594,7 @@ setup_rt_local_sids()
dev "${DUMMY_DEVNAME}"
# all SIDs for VPNs start with a common locator. Routes and SRv6
- # Endpoint behavior instaces are grouped together in the 'localsid'
+ # Endpoint behavior instances are grouped together in the 'localsid'
# table.
ip -netns "${nsname}" -6 rule \
add to "${VPN_LOCATOR_SERVICE}::/16" \
diff --git a/tools/testing/selftests/net/srv6_end_x_next_csid_l3vpn_test.sh b/tools/testing/selftests/net/srv6_end_x_next_csid_l3vpn_test.sh
index bedf0ce885c2..34b781a2ae74 100755
--- a/tools/testing/selftests/net/srv6_end_x_next_csid_l3vpn_test.sh
+++ b/tools/testing/selftests/net/srv6_end_x_next_csid_l3vpn_test.sh
@@ -681,7 +681,7 @@ setup_rt_local_sids()
set_underlay_sids_reachability "${rt}" "${rt_neighs}"
# all SIDs for VPNs start with a common locator. Routes and SRv6
- # Endpoint behavior instaces are grouped together in the 'localsid'
+ # Endpoint behavior instances are grouped together in the 'localsid'
# table.
ip -netns "${nsname}" -6 rule \
add to "${VPN_LOCATOR_SERVICE}::/16" \
diff --git a/tools/testing/selftests/net/srv6_hencap_red_l3vpn_test.sh b/tools/testing/selftests/net/srv6_hencap_red_l3vpn_test.sh
index 3efce1718c5f..6a68c7eff1dc 100755
--- a/tools/testing/selftests/net/srv6_hencap_red_l3vpn_test.sh
+++ b/tools/testing/selftests/net/srv6_hencap_red_l3vpn_test.sh
@@ -395,7 +395,7 @@ setup_rt_local_sids()
dev "${VRF_DEVNAME}"
# all SIDs for VPNs start with a common locator. Routes and SRv6
- # Endpoint behavior instaces are grouped together in the 'localsid'
+ # Endpoint behavior instances are grouped together in the 'localsid'
# table.
ip -netns "${nsname}" -6 rule \
add to "${VPN_LOCATOR_SERVICE}::/16" \
diff --git a/tools/testing/selftests/net/srv6_hl2encap_red_l2vpn_test.sh b/tools/testing/selftests/net/srv6_hl2encap_red_l2vpn_test.sh
index cabc70538ffe..0979b5316fdf 100755
--- a/tools/testing/selftests/net/srv6_hl2encap_red_l2vpn_test.sh
+++ b/tools/testing/selftests/net/srv6_hl2encap_red_l2vpn_test.sh
@@ -343,7 +343,7 @@ setup_rt_local_sids()
encap seg6local action End dev "${DUMMY_DEVNAME}"
# all SIDs for VPNs start with a common locator. Routes and SRv6
- # Endpoint behaviors instaces are grouped together in the 'localsid'
+ # Endpoint behaviors instances are grouped together in the 'localsid'
# table.
ip -netns "${nsname}" -6 rule add \
to "${VPN_LOCATOR_SERVICE}::/16" \
--
2.20.1
When compiling the RTC library functions test as a module, the module
has the non-descriptive name "lib_test.ko". Fix this by adding the
subsystem's name as a prefix.
Signed-off-by: Geert Uytterhoeven <geert(a)linux-m68k.org>
---
drivers/rtc/Makefile | 2 +-
drivers/rtc/{lib_test.c => rtc_lib_test.c} | 0
2 files changed, 1 insertion(+), 1 deletion(-)
rename drivers/rtc/{lib_test.c => rtc_lib_test.c} (100%)
diff --git a/drivers/rtc/Makefile b/drivers/rtc/Makefile
index 4619aa2ac4697591..aa08d1c8a32ec23d 100644
--- a/drivers/rtc/Makefile
+++ b/drivers/rtc/Makefile
@@ -15,7 +15,7 @@ rtc-core-$(CONFIG_RTC_INTF_DEV) += dev.o
rtc-core-$(CONFIG_RTC_INTF_PROC) += proc.o
rtc-core-$(CONFIG_RTC_INTF_SYSFS) += sysfs.o
-obj-$(CONFIG_RTC_LIB_KUNIT_TEST) += lib_test.o
+obj-$(CONFIG_RTC_LIB_KUNIT_TEST) += rtc_lib_test.o
# Keep the list ordered.
diff --git a/drivers/rtc/lib_test.c b/drivers/rtc/rtc_lib_test.c
similarity index 100%
rename from drivers/rtc/lib_test.c
rename to drivers/rtc/rtc_lib_test.c
--
2.43.0
This series introduces VFIO selftests, located in
tools/testing/selftests/vfio/.
VFIO selftests aim to enable kernel developers to write and run tests
that take the form of userspace programs that interact with VFIO and
IOMMUFD uAPIs. VFIO selftests can be used to write functional tests for
new features, regression tests for bugs, and performance tests for
optimizations.
These tests are designed to interact with real PCI devices, i.e. they do
not rely on mocking out or faking any behavior in the kernel. This
allows the tests to exercise not only VFIO but also IOMMUFD, the IOMMU
driver, interrupt remapping, IRQ handling, etc.
We chose selftests to host these tests primarily to enable integration
with the existing KVM selftests. As explained in the next section,
enabling KVM developers to test the interaction between VFIO and KVM is
one of the motivators of this series.
Motivation
-----------------------------------------------------------------------
The main motivation for this series is upcoming development in the
kernel to support Hypervisor Live Updates [1][2]. Live Update is a
specialized reboot process where selected devices are kept operational
and their kernel state is preserved and recreated across a kexec. For
devices, DMA and interrupts may continue during the reboot. VFIO-bound
devices are the main target, since the first usecase of Live Updates is
to enable host kernel upgrades in a Cloud Computing environment without
disrupting running customer VMs.
To prepare for upcoming support for Live Updates in VFIO, IOMMUFD, IOMMU
drivers, the PCI layer, etc., we'd like to first lay the ground work for
exercising and testing VFIO from kernel selftests. This way when we
eventually upstream support for Live Updates, we can also upstream tests
for those changes, rather than purely relying on Live Update integration
tests which would be hard to share and reproduce upstream.
But even without Live Updates, VFIO and IOMMUFD are becoming an
increasingly critical component of running KVM-based VMs in cloud
environments. Virtualized networking and storage are increasingly being
offloaded to smart NICs/cards, and demand for high performance
networking, storage, and AI are also leading to NICs, SSDs, and GPUs
being directly attached to VMs via VFIO.
VFIO selftests increases our ability to test in several ways.
- It enables developers sending VFIO, IOMMUFD, etc. commits upstream to
test their changes against all existing VFIO selftests, reducing the
probability of regressions.
- It enables developers sending VFIO, IOMMUFD, etc. commits upstream to
include tests alongside their changes, increasing the quality of the
code that is merged.
- It enables testing the interaction between VFIO and KVM. There are
some paths in KVM that are only exercised through VFIO, such as IRQ
bypass. VFIO selftests provides a helper library to enable KVM
developers to write KVM selftests to test those interactions [3].
Design
-----------------------------------------------------------------------
VFIO selftests are designed around interacting with with VFIO-managed PCI
devices. As such, the core data struture is struct vfio_pci_device, which
represents a single PCI device.
struct vfio_pci_device *device;
device = vfio_pci_device_init("0000:6a:01.0", iommu_mode);
...
vfio_pci_device_cleanup(device);
vfio_pci_device_init() sets up a container or iommufd, depending on the
iommu_mode argument, to manage DMA mappings, fetches information about
the device and what interrupts it supports from VFIO and caches it, and
mmap()s all mappable BARs for the test to use.
There are helper methods that operate on struct vfio_pci_device to do
things like read and write to PCI config space, enable/disable IRQs, and
map memory for DMA,
struct vfio_pci_device and its methods do not care about what device
they are actually interacting with. It can be a GPU, a NIC, an SSD, etc.
To keep things simple initially, VFIO selftests only support a single
device per group and per container/iommufd. But it should be possible to
relax those restrictions in the future, e.g. to enable testing with
multiple devices in the same container/iommufd.
Driver Framework
-----------------------------------------------------------------------
In order to support VFIO selftests where a device is generating DMA and
interrupts on command, the VFIO selftests supports a driver framework.
This framework abstracts away device-specific details allowing VFIO
selftests to be written in a generic way, and then run against different
devices depending on what hardware developers have access to.
The framework also aims to support carrying drivers out-of-tree, e.g.
so that companies can run VFIO selftests with custom/test hardware.
Drivers must implement the following methods:
- probe(): Check if the driver supports a given device.
- init(): Initialize the driver.
- remove(): Deinitialize the driver and reset the device.
- memcpy_start(): Kick off a series of repeated memcpys (DMA reads and
DMA writes).
- memcpy_wait(): Wait for a memcpy operation to complete.
- send_msi(): Make the device send an MSI interrupt.
memcpy_start/wait() are for generating DMA. We separate the operation
into 2 steps so that tests can trigger a long-running DMA operation. We
expect to use this to stress test Live Updates by kicking off a
long-running mempcy operation and then performing a Live Update. These
methods are required to not generate any interrupts.
send_msi() is used for testing MSI and MSI-x interrupts. The driver
tells the test which MSI it will be using via device->driver.msi.
It's the responsibility of the test to set up a region of memory
and map it into the device for use by the driver, e.g. for in-memory
descriptors, before calling init().
A demo of the driver framework can be found in
tools/testing/selftests/vfio/vfio_pci_driver_test.c.
In addition, this series introduces a new KVM selftest to demonstrate
delivering a device MSI directly into a guest, which can be found in
tools/testing/selftests/kvm/vfio_pci_device_irq_test.c.
Tests
-----------------------------------------------------------------------
There are 5 tests in this series, mostly to demonstrate as a
proof-of-concept:
- tools/testing/selftests/vfio/vfio_pci_device_test.c
- tools/testing/selftests/vfio/vfio_pci_driver_test.c
- tools/testing/selftests/vfio/vfio_iommufd_setup_test.c
- tools/testing/selftests/vfio/vfio_dma_mapping_test.c
- tools/testing/selftests/kvm/vfio_pci_device_irq_test.c
Integrating with KVM selftests
-----------------------------------------------------------------------
To support testing the interactions between VFIO and KVM, the VFIO
selftests support sharing its library with the KVM selftest. The patches
at the end of this series demonstrate how that works.
Essentially, we allow the KVM selftests to build their own copy of
tools/testing/selftests/vfio/lib/ and link it into KVM selftests
binaries. This requires minimal changes to the KVM selftests Makefile.
Future Areas of Development
-----------------------------------------------------------------------
Library:
- Driver support for devices that can be used on AMD, ARM, and other
platforms.
- Driver support for a device available in QEMU VMs.
- Support for tests that use multiple devices.
- Support for IOMMU groups with multiple devices.
- Support for multiple devices sharing the same container/iommufd.
- Sharing TEST_ASSERT() macros and other common code between KVM
and VFIO selftests.
Tests:
- DMA mapping performance tests for BARs/HugeTLB/etc.
- Live Update selftests.
- Porting Sean's KVM selftest for posted interrupts to use the VFIO
selftests library [3]
This series can also be found on GitHub:
https://github.com/dmatlack/linux/tree/vfio/selftests/rfc
Cc: Alex Williamson <alex.williamson(a)redhat.com>
Cc: Jason Gunthorpe <jgg(a)nvidia.com>
Cc: Kevin Tian <kevin.tian(a)intel.com>
Cc: Paolo Bonzini <pbonzini(a)redhat.com>
Cc: Sean Christopherson <seanjc(a)google.com>
Cc: Vipin Sharma <vipinsh(a)google.com>
Cc: Josh Hilke <jrhilke(a)google.com>
Cc: Pasha Tatashin <pasha.tatashin(a)soleen.com>
Cc: Saeed Mahameed <saeedm(a)nvidia.com>
Cc: Saeed Mahameed <saeedm(a)nvidia.com>
Cc: Adithya Jayachandran <ajayachandra(a)nvidia.com>
Cc: Parav Pandit <parav(a)nvidia.com>
Cc: Leon Romanovsky <leonro(a)nvidia.com>
Cc: Vinicius Costa Gomes <vinicius.gomes(a)intel.com>
Cc: Dave Jiang <dave.jiang(a)intel.com>
Cc: Dan Williams <dan.j.williams(a)intel.com>
[1] https://lore.kernel.org/all/f35359d5-63e1-8390-619f-67961443bfe1@google.com/
[2] https://lore.kernel.org/all/20250515182322.117840-1-pasha.tatashin@soleen.c…
[3] https://lore.kernel.org/kvm/20250404193923.1413163-68-seanjc@google.com/
David Matlack (28):
selftests: Create tools/testing/selftests/vfio
vfio: selftests: Add a helper library for VFIO selftests
vfio: selftests: Introduce vfio_pci_device_test
tools headers: Add stub definition for __iomem
tools headers: Import asm-generic MMIO helpers
tools headers: Import x86 MMIO helper overrides
tools headers: Import iosubmit_cmds512()
tools headers: Import drivers/dma/ioat/{hw.h,registers.h}
tools headers: Import drivers/dma/idxd/registers.h
tools headers: Import linux/pci_ids.h
vfio: selftests: Keep track of DMA regions mapped into the device
vfio: selftests: Enable asserting MSI eventfds not firing
vfio: selftests: Add a helper for matching vendor+device IDs
vfio: selftests: Add driver framework
vfio: sefltests: Add vfio_pci_driver_test
vfio: selftests: Add driver for Intel CBDMA
vfio: selftests: Add driver for Intel DSA
vfio: selftests: Move helper to get cdev path to libvfio
vfio: selftests: Encapsulate IOMMU mode
vfio: selftests: Add [-i iommu_mode] option to all tests
vfio: selftests: Add vfio_type1v2_mode
vfio: selftests: Add iommufd_compat_type1{,v2} modes
vfio: selftests: Add iommufd mode
vfio: selftests: Make iommufd the default iommu_mode
vfio: selftests: Add a script to help with running VFIO selftests
KVM: selftests: Build and link sefltests/vfio/lib into KVM selftests
KVM: selftests: Test sending a vfio-pci device IRQ to a VM
KVM: selftests: Use real device MSIs in vfio_pci_device_irq_test
Josh Hilke (5):
vfio: selftests: Test basic VFIO and IOMMUFD integration
vfio: selftests: Move vfio dma mapping test to their own file
vfio: selftests: Add test to reset vfio device.
vfio: selftests: Use command line to set hugepage size for DMA mapping
test
vfio: selftests: Validate 2M/1G HugeTLB are mapped as 2M/1G in IOMMU
MAINTAINERS | 7 +
tools/arch/x86/include/asm/io.h | 101 +
tools/arch/x86/include/asm/special_insns.h | 27 +
tools/include/asm-generic/io.h | 482 +++
tools/include/asm/io.h | 11 +
tools/include/drivers/dma/idxd/registers.h | 601 +++
tools/include/drivers/dma/ioat/hw.h | 270 ++
tools/include/drivers/dma/ioat/registers.h | 251 ++
tools/include/linux/compiler.h | 4 +
tools/include/linux/io.h | 4 +-
tools/include/linux/pci_ids.h | 3212 +++++++++++++++++
tools/testing/selftests/Makefile | 1 +
tools/testing/selftests/kvm/Makefile.kvm | 6 +-
.../testing/selftests/kvm/include/kvm_util.h | 4 +
tools/testing/selftests/kvm/lib/kvm_util.c | 21 +
.../selftests/kvm/vfio_pci_device_irq_test.c | 173 +
tools/testing/selftests/vfio/.gitignore | 7 +
tools/testing/selftests/vfio/Makefile | 20 +
.../testing/selftests/vfio/lib/drivers/dsa.c | 416 +++
.../testing/selftests/vfio/lib/drivers/ioat.c | 235 ++
.../selftests/vfio/lib/include/vfio_util.h | 271 ++
tools/testing/selftests/vfio/lib/libvfio.mk | 26 +
.../selftests/vfio/lib/vfio_pci_device.c | 573 +++
.../selftests/vfio/lib/vfio_pci_driver.c | 126 +
tools/testing/selftests/vfio/run.sh | 110 +
.../selftests/vfio/vfio_dma_mapping_test.c | 239 ++
.../selftests/vfio/vfio_iommufd_setup_test.c | 133 +
.../selftests/vfio/vfio_pci_device_test.c | 195 +
.../selftests/vfio/vfio_pci_driver_test.c | 256 ++
29 files changed, 7780 insertions(+), 2 deletions(-)
create mode 100644 tools/arch/x86/include/asm/io.h
create mode 100644 tools/arch/x86/include/asm/special_insns.h
create mode 100644 tools/include/asm-generic/io.h
create mode 100644 tools/include/asm/io.h
create mode 100644 tools/include/drivers/dma/idxd/registers.h
create mode 100644 tools/include/drivers/dma/ioat/hw.h
create mode 100644 tools/include/drivers/dma/ioat/registers.h
create mode 100644 tools/include/linux/pci_ids.h
create mode 100644 tools/testing/selftests/kvm/vfio_pci_device_irq_test.c
create mode 100644 tools/testing/selftests/vfio/.gitignore
create mode 100644 tools/testing/selftests/vfio/Makefile
create mode 100644 tools/testing/selftests/vfio/lib/drivers/dsa.c
create mode 100644 tools/testing/selftests/vfio/lib/drivers/ioat.c
create mode 100644 tools/testing/selftests/vfio/lib/include/vfio_util.h
create mode 100644 tools/testing/selftests/vfio/lib/libvfio.mk
create mode 100644 tools/testing/selftests/vfio/lib/vfio_pci_device.c
create mode 100644 tools/testing/selftests/vfio/lib/vfio_pci_driver.c
create mode 100755 tools/testing/selftests/vfio/run.sh
create mode 100644 tools/testing/selftests/vfio/vfio_dma_mapping_test.c
create mode 100644 tools/testing/selftests/vfio/vfio_iommufd_setup_test.c
create mode 100644 tools/testing/selftests/vfio/vfio_pci_device_test.c
create mode 100644 tools/testing/selftests/vfio/vfio_pci_driver_test.c
base-commit: a11a72229881d8ac1d52ea727101bc9c744189c1
prerequisite-patch-id: 3bae97c9e1093148763235f47a84fa040b512d04
--
2.49.0.1151.ga128411c76-goog
This patchset refactors non-composite global variables into a common
struct that can be initialized and passed around per-test instead of
relying on the presence of global variables.
This allows:
- Better encapsulation
- Debugging becomes easier -- local variable state can be viewed per
stack frame, and we can more easily reason about the variable
mutations
Patch 1 needs to be applied first and can be followed by any of the
other patches.
I've ensured that the tests are passing locally (or atleast have the
same output as the code on master).
Ujwal Kundur (4):
selftests/mm/uffd: Refactor non-composite global vars into struct
selftests/mm/uffd: Swap global vars with global test options
selftests/mm/uffd: Swap global variables with global test opts
selftests/mm/uffd: Swap global variables with global test opts
tools/testing/selftests/mm/uffd-common.c | 269 +++++-----
tools/testing/selftests/mm/uffd-common.h | 78 +--
tools/testing/selftests/mm/uffd-stress.c | 226 ++++----
tools/testing/selftests/mm/uffd-unit-tests.c | 523 ++++++++++---------
tools/testing/selftests/mm/uffd-wp-mremap.c | 23 +-
5 files changed, 591 insertions(+), 528 deletions(-)
--
2.20.1
FEAT_LSFE is optional from v9.5, it adds new instructions for atomic
memory operations with floating point values. We have no immediate use
for it in kernel, provide a hwcap so userspace can discover it and allow
the ID register field to be exposed to KVM guests.
Signed-off-by: Mark Brown <broonie(a)kernel.org>
---
Mark Brown (3):
arm64/hwcap: Add hwcap for FEAT_LSFE
KVM: arm64: Expose FEAT_LSFE to guests
kselftest/arm64: Add lsfe to the hwcaps test
Documentation/arch/arm64/elf_hwcaps.rst | 4 ++++
arch/arm64/include/asm/hwcap.h | 1 +
arch/arm64/include/uapi/asm/hwcap.h | 1 +
arch/arm64/kernel/cpufeature.c | 2 ++
arch/arm64/kernel/cpuinfo.c | 1 +
arch/arm64/kvm/sys_regs.c | 4 +++-
tools/testing/selftests/arm64/abi/hwcap.c | 21 +++++++++++++++++++++
7 files changed, 33 insertions(+), 1 deletion(-)
---
base-commit: 86731a2a651e58953fc949573895f2fa6d456841
change-id: 20250625-arm64-lsfe-0810cf98adc2
Best regards,
--
Mark Brown <broonie(a)kernel.org>
This series introduces VFIO selftests, located in
tools/testing/selftests/vfio/.
VFIO selftests aim to enable kernel developers to write and run tests
that take the form of userspace programs that interact with VFIO and
IOMMUFD uAPIs. VFIO selftests can be used to write functional tests for
new features, regression tests for bugs, and performance tests for
optimizations.
These tests are designed to interact with real PCI devices, i.e. they do
not rely on mocking out or faking any behavior in the kernel. This
allows the tests to exercise not only VFIO but also IOMMUFD, the IOMMU
driver, interrupt remapping, IRQ handling, etc.
For more background on the motivation and design of this series, please
see the RFC:
https://lore.kernel.org/kvm/20250523233018.1702151-1-dmatlack@google.com/
This series can also be found on GitHub:
https://github.com/dmatlack/linux/tree/vfio/selftests/v1
Changelog
-----------------------------------------------------------------------
RFC: https://lore.kernel.org/kvm/20250523233018.1702151-1-dmatlack@google.com/
- Add symlink to linux/pci_ids.h instead of copying (Jason)
- Add symlinks to drivers/dma/*/*.h instead of copying (Jason)
- Automatically replicate vfio_dma_mapping_test across backing
sources using fixture variants (Jason)
- Automatically replicate vfio_dma_mapping_test and
vfio_pci_driver_test across all iommu_modes using fixture
variants (Jason)
- Invert access() check in vfio_dma_mapping_test (me)
- Use driver_override instead of add/remove_id (Alex)
- Allow tests to get BDF from env var (Alex)
- Use KSFT_FAIL instead of 1 to exit with failure (Alex)
- Unconditionally create $(LIBVFIO_O_DIRS) to avoid target
conflict with ../cgroup/lib/libcgroup.mk when building
KVM selftests (me)
- Allow VFIO selftests to run automatically by switching from
TEST_GEN_PROGS_EXTENDED to TEST_GEN_PROGS. Automatically run
selftests will use $VFIO_SELFTESTS_BDF environment variable
to know which device to use (Alex)
- Replace hardcoded SZ_4K with getpagesize() in vfio_dma_mapping_test
to support platforms with other page sizes (me)
- Make all global variables static where possible (me)
- Pass argc and argv to test_harness_main() so that users can
pass flags to the kselftest harness (me)
Instructions
-----------------------------------------------------------------------
Running VFIO selftests requires at a PCI device bound to vfio-pci for
the tests to use. The address of this device is passed to the test as
a segment:bus:device.function string, which must match the path to
the device in /sys/bus/pci/devices/ (e.g. 0000:00:04.0).
Once you have chosen a device, there is a helper script provided to
unbind the device from its current driver, bind it to vfio-pci, export
the environment variable $VFIO_SELFTESTS_BDF, and launch a shell:
$ tools/testing/selftests/vfio/run.sh -d 0000:00:04.0 -s
The -d option tells the script which device to use and the -s option
tells the script to launch a shell.
Additionally, the VFIO selftest vfio_dma_mapping_test has test cases
that rely on HugeTLB pages being available, otherwise they are skipped.
To enable those tests make sure at least 1 2MB and 1 1GB HugeTLB pages
are available.
$ echo 1 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
$ echo 1 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
To run all VFIO selftests using make:
$ make -C tools/testing/selftests/vfio run_tests
To run individual tests:
$ tools/testing/selftests/vfio/vfio_dma_mapping_test
$ tools/testing/selftests/vfio/vfio_dma_mapping_test -v iommufd_anonymous_hugetlb_2mb
$ tools/testing/selftests/vfio/vfio_dma_mapping_test -r vfio_dma_mapping_test.iommufd_anonymous_hugetlb_2mb.dma_map_unmap
The environment variable $VFIO_SELFTESTS_BDF can be overridden for a
specific test by passing in the BDF on the command line as the last
positional argument.
$ tools/testing/selftests/vfio/vfio_dma_mapping_test 0000:00:04.0
$ tools/testing/selftests/vfio/vfio_dma_mapping_test -v iommufd_anonymous_hugetlb_2mb 0000:00:04.0
$ tools/testing/selftests/vfio/vfio_dma_mapping_test -r vfio_dma_mapping_test.iommufd_anonymous_hugetlb_2mb.dma_map_unmap 0000:00:04.0
When you are done, free the HugeTLB pages and exit the shell started by
run.sh. Exiting the shell will cause the device to be unbound from
vfio-pci and bound back to its original driver.
$ echo 0 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
$ echo 0 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
$ exit
It's also possible to use run.sh to run just a single test hermetically,
rather than dropping into a shell:
$ tools/testing/selftests/vfio/run.sh -d 0000:00:04.0 -- tools/testing/selftests/vfio/vfio_dma_mapping_test -v iommufd_anonymous
Tests
-----------------------------------------------------------------------
There are 5 tests in this series, mostly to demonstrate as a
proof-of-concept:
- tools/testing/selftests/vfio/vfio_pci_device_test.c
- tools/testing/selftests/vfio/vfio_pci_driver_test.c
- tools/testing/selftests/vfio/vfio_iommufd_setup_test.c
- tools/testing/selftests/vfio/vfio_dma_mapping_test.c
- tools/testing/selftests/kvm/vfio_pci_device_irq_test.c
Future Areas of Development
-----------------------------------------------------------------------
Library:
- Driver support for devices that can be used on AMD, ARM, and other
platforms (e.g. mlx5).
- Driver support for a device available in QEMU VMs (e.g.
pcie-ats-testdev [1])
- Support for tests that use multiple devices.
- Support for IOMMU groups with multiple devices.
- Support for multiple devices sharing the same container/iommufd.
- Sharing TEST_ASSERT() macros and other common code between KVM
and VFIO selftests.
Tests:
- DMA mapping performance tests for BARs/HugeTLB/etc.
- Porting tests from
https://github.com/awilliam/tests/commits/for-clg/ to selftests.
- Live Update selftests.
- Porting Sean's KVM selftest for posted interrupts to use the VFIO
selftests library [2]
Cc: Alex Williamson <alex.williamson(a)redhat.com>
Cc: Jason Gunthorpe <jgg(a)nvidia.com>
Cc: Kevin Tian <kevin.tian(a)intel.com>
Cc: Paolo Bonzini <pbonzini(a)redhat.com>
Cc: Sean Christopherson <seanjc(a)google.com>
Cc: Vipin Sharma <vipinsh(a)google.com>
Cc: Josh Hilke <jrhilke(a)google.com>
Cc: Aaron Lewis <aaronlewis(a)google.com>
Cc: Pasha Tatashin <pasha.tatashin(a)soleen.com>
Cc: Saeed Mahameed <saeedm(a)nvidia.com>
Cc: Adithya Jayachandran <ajayachandra(a)nvidia.com>
Cc: Joel Granados <joel.granados(a)kernel.org>
[1] https://github.com/Joelgranados/qemu/blob/pcie-testdev/hw/misc/pcie-ats-tes…
[2] https://lore.kernel.org/kvm/20250404193923.1413163-68-seanjc@google.com/
David Matlack (28):
selftests: Create tools/testing/selftests/vfio
vfio: selftests: Add a helper library for VFIO selftests
vfio: selftests: Introduce vfio_pci_device_test
tools headers: Add stub definition for __iomem
tools headers: Import asm-generic MMIO helpers
tools headers: Import x86 MMIO helper overrides
tools headers: Import iosubmit_cmds512()
tools headers: Add symlink to linux/pci_ids.h
vfio: selftests: Keep track of DMA regions mapped into the device
vfio: selftests: Enable asserting MSI eventfds not firing
vfio: selftests: Add a helper for matching vendor+device IDs
vfio: selftests: Add driver framework
vfio: sefltests: Add vfio_pci_driver_test
dmaengine: ioat: Move system_has_dca_enabled() to dma.h
vfio: selftests: Add driver for Intel CBDMA
dmaengine: idxd: Allow registers.h to be included from tools/
vfio: selftests: Add driver for Intel DSA
vfio: selftests: Move helper to get cdev path to libvfio
vfio: selftests: Encapsulate IOMMU mode
vfio: selftests: Replicate tests across all iommu_modes
vfio: selftests: Add vfio_type1v2_mode
vfio: selftests: Add iommufd_compat_type1{,v2} modes
vfio: selftests: Add iommufd mode
vfio: selftests: Make iommufd the default iommu_mode
vfio: selftests: Add a script to help with running VFIO selftests
KVM: selftests: Build and link sefltests/vfio/lib into KVM selftests
KVM: selftests: Test sending a vfio-pci device IRQ to a VM
KVM: selftests: Add -d option to vfio_pci_device_irq_test for
device-sent MSIs
Josh Hilke (5):
vfio: selftests: Test basic VFIO and IOMMUFD integration
vfio: selftests: Move vfio dma mapping test to their own file
vfio: selftests: Add test to reset vfio device.
vfio: selftests: Add DMA mapping tests for 2M and 1G HugeTLB
vfio: selftests: Validate 2M/1G HugeTLB are mapped as 2M/1G in IOMMU
MAINTAINERS | 7 +
drivers/dma/idxd/registers.h | 4 +
drivers/dma/ioat/dma.h | 2 +
drivers/dma/ioat/hw.h | 3 -
tools/arch/x86/include/asm/io.h | 101 +++
tools/arch/x86/include/asm/special_insns.h | 27 +
tools/include/asm-generic/io.h | 482 ++++++++++++++
tools/include/asm/io.h | 11 +
tools/include/linux/compiler.h | 4 +
tools/include/linux/io.h | 4 +-
tools/include/linux/pci_ids.h | 1 +
tools/testing/selftests/Makefile | 1 +
tools/testing/selftests/kvm/Makefile.kvm | 4 +
.../testing/selftests/kvm/include/kvm_util.h | 4 +
tools/testing/selftests/kvm/lib/kvm_util.c | 21 +
.../selftests/kvm/vfio_pci_device_irq_test.c | 172 +++++
tools/testing/selftests/vfio/.gitignore | 7 +
tools/testing/selftests/vfio/Makefile | 21 +
.../selftests/vfio/lib/drivers/dsa/dsa.c | 416 ++++++++++++
.../vfio/lib/drivers/dsa/registers.h | 1 +
.../selftests/vfio/lib/drivers/ioat/hw.h | 1 +
.../selftests/vfio/lib/drivers/ioat/ioat.c | 235 +++++++
.../vfio/lib/drivers/ioat/registers.h | 1 +
.../selftests/vfio/lib/include/vfio_util.h | 295 +++++++++
tools/testing/selftests/vfio/lib/libvfio.mk | 24 +
.../selftests/vfio/lib/vfio_pci_device.c | 594 ++++++++++++++++++
.../selftests/vfio/lib/vfio_pci_driver.c | 126 ++++
tools/testing/selftests/vfio/run.sh | 109 ++++
.../selftests/vfio/vfio_dma_mapping_test.c | 199 ++++++
.../selftests/vfio/vfio_iommufd_setup_test.c | 127 ++++
.../selftests/vfio/vfio_pci_device_test.c | 176 ++++++
.../selftests/vfio/vfio_pci_driver_test.c | 247 ++++++++
32 files changed, 3423 insertions(+), 4 deletions(-)
create mode 100644 tools/arch/x86/include/asm/io.h
create mode 100644 tools/arch/x86/include/asm/special_insns.h
create mode 100644 tools/include/asm-generic/io.h
create mode 100644 tools/include/asm/io.h
create mode 120000 tools/include/linux/pci_ids.h
create mode 100644 tools/testing/selftests/kvm/vfio_pci_device_irq_test.c
create mode 100644 tools/testing/selftests/vfio/.gitignore
create mode 100644 tools/testing/selftests/vfio/Makefile
create mode 100644 tools/testing/selftests/vfio/lib/drivers/dsa/dsa.c
create mode 120000 tools/testing/selftests/vfio/lib/drivers/dsa/registers.h
create mode 120000 tools/testing/selftests/vfio/lib/drivers/ioat/hw.h
create mode 100644 tools/testing/selftests/vfio/lib/drivers/ioat/ioat.c
create mode 120000 tools/testing/selftests/vfio/lib/drivers/ioat/registers.h
create mode 100644 tools/testing/selftests/vfio/lib/include/vfio_util.h
create mode 100644 tools/testing/selftests/vfio/lib/libvfio.mk
create mode 100644 tools/testing/selftests/vfio/lib/vfio_pci_device.c
create mode 100644 tools/testing/selftests/vfio/lib/vfio_pci_driver.c
create mode 100755 tools/testing/selftests/vfio/run.sh
create mode 100644 tools/testing/selftests/vfio/vfio_dma_mapping_test.c
create mode 100644 tools/testing/selftests/vfio/vfio_iommufd_setup_test.c
create mode 100644 tools/testing/selftests/vfio/vfio_pci_device_test.c
create mode 100644 tools/testing/selftests/vfio/vfio_pci_driver_test.c
base-commit: e271ed52b344ac02d4581286961d0c40acc54c03
prerequisite-patch-id: c1decca4653262d3d2451e6fd4422ebff9c0b589
--
2.50.0.rc2.701.gf1e915cc24-goog
Fix cur_aux()->nospec_result test after do_check_insn() referring to the
to-be-analyzed (potentially unsafe) instruction, not the
already-analyzed (safe) instruction. This might allow a unsafe insn to
slip through on a speculative path. Create some tests from the
reproducer [1].
Commit d6f1c85f2253 ("bpf: Fall back to nospec for Spectre v1") should
not be in any stable kernel yet, therefore bpf-next should suffice.
[1] https://lore.kernel.org/bpf/685b3c1b.050a0220.2303ee.0010.GAE@google.com/
Changes since v1:
- Fix compiler error due to missed rename of prev_insn_idx in first
patch
- v1: https://lore.kernel.org/bpf/20250628125927.763088-1-luis.gerhorst@fau.de/
Changes since RFC:
- Introduce prev_aux() as suggested by Alexei. For this, we must move
the env->prev_insn_idx assignment to happen directly after
do_check_insn(), for which I have created a separate commit. This
patch could be simplified by using a local prev_aux variable as
sugested by Eduard, but I figured one might find the new
assignment-strategy easier to understand (before, prev_insn_idx and
env->prev_insn_idx were out-of-sync for the latter part of the loop).
Also, like this we do not have an additional prev_* variable that must
be kept in-sync and the local variable's usage (old prev_insn_idx, new
tmp) is much more local. If you think it would be better to not take
the risk and keep the fix simple by just introducing the prev_aux
variable, let me know.
- Change WARN_ON_ONCE() to verifier_bug_if() as suggested by Alexei
- Change assertion to check instruction is BPF_JMP[32] as suggested by
Eduard
- RFC: https://lore.kernel.org/bpf/8734bmoemx.fsf@fau.de/
Luis Gerhorst (3):
bpf: Update env->prev_insn_idx after do_check_insn()
bpf: Fix aux usage after do_check_insn()
selftests/bpf: Add Spectre v4 tests
kernel/bpf/verifier.c | 30 ++--
tools/testing/selftests/bpf/progs/bpf_misc.h | 4 +
.../selftests/bpf/progs/verifier_unpriv.c | 149 ++++++++++++++++++
3 files changed, 174 insertions(+), 9 deletions(-)
base-commit: d69bafe6ee2b5eff6099fa26626ecc2963f0f363
--
2.49.0
This series introduces NUMA-aware memory placement support for KVM guests
with guest_memfd memory backends. It builds upon Fuad Tabba's work that
enabled host-mapping for guest_memfd memory [1] and can be applied directly
on KVM tree (branch:queue, base commit:7915077245) [2].
== Background ==
KVM's guest-memfd memory backend currently lacks support for NUMA policy
enforcement, causing guest memory allocations to be distributed across host
nodes according to kernel's default behavior, irrespective of any policy
specified by the VMM. This limitation arises because conventional userspace
NUMA control mechanisms like mbind(2) don't work since the memory isn't
directly mapped to userspace when allocations occur.
Fuad's work [1] provides the necessary mmap capability, and this series
leverages it to enable mbind(2).
== Implementation ==
This series implements proper NUMA policy support for guest-memfd by:
1. Adding mempolicy-aware allocation APIs to the filemap layer.
2. Introducing custom inodes (via a dedicated slab-allocated inode cache,
kvm_gmem_inode_info) to store NUMA policy and metadata for guest memory.
3. Implementing get/set_policy vm_ops in guest_memfd to support NUMA
policy.
With these changes, VMMs can now control guest memory placement by mapping
guest_memfd file descriptor and using mbind(2) to specify:
- Policy modes: default, bind, interleave, or preferred
- Host NUMA nodes: List of target nodes for memory allocation
These Policies affect only future allocations and do not migrate existing
memory. This matches mbind(2)'s default behavior which affects only new
allocations unless overridden with MPOL_MF_MOVE/MPOL_MF_MOVE_ALL flags (Not
supported for guest_memfd as it is unmovable by design).
== Upstream Plan ==
Phased approach as per David's guest_memfd extension overview [3] and
community calls [4]:
Phase 1 (this series):
1. Focuses on shared guest_memfd support (non-CoCo VMs).
2. Builds on Fuad's host-mapping work.
Phase2 (future work):
1. NUMA support for private guest_memfd (CoCo VMs).
2. Depends on SNP in-place conversion support [5].
This series provides a clean integration path for NUMA-aware memory
management for guest_memfd and lays the groundwork for future confidential
computing NUMA capabilities.
Please review and provide feedback!
Thanks,
Shivank
== Changelog ==
- v1,v2: Extended the KVM_CREATE_GUEST_MEMFD IOCTL to pass mempolicy.
- v3: Introduced fbind() syscall for VMM memory-placement configuration.
- v4-v6: Current approach using shared_policy support and vm_ops (based on
suggestions from David [6] and guest_memfd bi-weekly upstream
call discussion [7]).
- v7: Use inodes to store NUMA policy instead of file [8].
- v8: Rebase on top of Fuad's V12: Host mmaping for guest_memfd memory.
[1] https://lore.kernel.org/all/20250611133330.1514028-1-tabba@google.com
[2] https://git.kernel.org/pub/scm/virt/kvm/kvm.git/log/?h=queue
[3] https://lore.kernel.org/all/c1c9591d-218a-495c-957b-ba356c8f8e09@redhat.com
[4] https://docs.google.com/document/d/1M6766BzdY1Lhk7LiR5IqVR8B8mG3cr-cxTxOrAo…
[5] https://lore.kernel.org/all/20250613005400.3694904-1-michael.roth@amd.com
[6] https://lore.kernel.org/all/6fbef654-36e2-4be5-906e-2a648a845278@redhat.com
[7] https://lore.kernel.org/all/2b77e055-98ac-43a1-a7ad-9f9065d7f38f@amd.com
[8] https://lore.kernel.org/all/diqzbjumm167.fsf@ackerleytng-ctop.c.googlers.com
Ackerley Tng (1):
KVM: guest_memfd: Use guest mem inodes instead of anonymous inodes
Shivank Garg (5):
security: Export anon_inode_make_secure_inode for KVM guest_memfd
mm/mempolicy: Export memory policy symbols
KVM: guest_memfd: Add slab-allocated inode cache
KVM: guest_memfd: Enforce NUMA mempolicy using shared policy
KVM: guest_memfd: selftests: Add tests for mmap and NUMA policy
support
Shivansh Dhiman (1):
mm/filemap: Add mempolicy support to the filemap layer
fs/anon_inodes.c | 20 +-
include/linux/fs.h | 2 +
include/linux/pagemap.h | 41 +++
include/uapi/linux/magic.h | 1 +
mm/filemap.c | 27 +-
mm/mempolicy.c | 6 +
tools/testing/selftests/kvm/Makefile.kvm | 1 +
.../testing/selftests/kvm/guest_memfd_test.c | 123 ++++++++-
virt/kvm/guest_memfd.c | 254 ++++++++++++++++--
virt/kvm/kvm_main.c | 7 +-
virt/kvm/kvm_mm.h | 10 +-
11 files changed, 456 insertions(+), 36 deletions(-)
--
2.43.0
---
== Earlier Postings ==
v7: https://lore.kernel.org/all/20250408112402.181574-1-shivankg@amd.com
v6: https://lore.kernel.org/all/20250226082549.6034-1-shivankg@amd.com
v5: https://lore.kernel.org/all/20250219101559.414878-1-shivankg@amd.com
v4: https://lore.kernel.org/all/20250210063227.41125-1-shivankg@amd.com
v3: https://lore.kernel.org/all/20241105164549.154700-1-shivankg@amd.com
v2: https://lore.kernel.org/all/20240919094438.10987-1-shivankg@amd.com
v1: https://lore.kernel.org/all/20240916165743.201087-1-shivankg@amd.com
V12:
- Fixed YAML indentation in devlink.yaml.
- Removed unused total variable from devlink_nl_rate_tc_bw_set().
- Quoted shell variables in devlink.sh and split declarations to fix
shellcheck warnings.
- Added missing DevlinkFamily imports in selftests to fix pylint
warnings.
- Pulled changes from net-next to enable these adjustments:
Inclusion of DevlinkFamily in YNL test libs.
Introduction of nlmsg_for_each_attr_type() macro and its use in nfsd.
V11:
- Refactored the devlink code to accept relative TC bandwidth share
values instead of percentages.
- Updated documentation to clarify that values are interpreted as
relative shares.
- Refactored the logic in mlx5 to support proportional scaling for
tc-bw values.
- Switched to `nlmsg_for_each_attr_type()` for cleaner attribute
parsing.
- Added a hardware selftest to validate TC bandwidth behavior.
- Refactored esw_qos_is_node_empty for readability.
V10:
- Added netdevsim selftest for tc-bw ops.
- Dropped header: field as it’s unnecessary for local constants in
devlink.yaml.
V9:
- Defined DEVLINK_RATE_TCS_MAX as 8 in uapi/linux/devlink.h.
- Replaced IEEE_8021QAZ_MAX_TCS with DEVLINK_RATE_TCS_MAX throughout
the code.
- Updated devlink-rate-tc-index-max spec to reference the correct UAPI
header.
V8:
- Limit line width to 80 characters in mlx5 changes instead of 100.
- Increase the scheduling node levels to support TC arbitration.
- Ensure parent nodes are set correctly in all code paths that extend
the hierarchy depth for TC arbitration.
- Extended the cover letter with the ongoing discussion on devlink-rate
and net-shapers.
- Extended the cover letter with the Netdev talk link on this series.
V7:
- Fixed disabling tc-bw on leaf nodes that did not have tc-bw
configured.
- Fixed an issue where tc-bw was disabled on a node with assigned
vports, ensuring that vport->qos.sched_node->parent is correctly
updated with the cloned node.
- Declared a constant for the maximum allowed Traffic Class index in
devlink rate.
- Added a range check to validate rate-tc-index.
- Added documentation for the tc-bw argument.
- Add a validation check to ensure that the total bandwidth assigned to
all traffic classes sums to 100.
V6:
- Addressed comments on devlink patch #3.
- Removed first 4 IFC patches, to be pulled from mlx5-next.
V5:
- Fix warning in devlink_nl_rate_tc_bw_set().
- Fix target branch of patch #4.
V4:
- Renamed the nested attribute for traffic class bandwidth to
DEVLINK_ATTR_RATE_TC_BWS.
- Changed the order of the attributes in `devlink.h`.
- Refactored the initialization tc-bw array in
devlink_nl_rate_tc_bw_set().
- Added extack messages to provide clear feedback on issues with tc-bw
arguments.
- Updated `rate-tc-bws` to support a multi-attr set, where each
attribute includes an index and the corresponding bandwidth for that
traffic class.
- Handled the issue where the user could provide
DEVLINK_ATTR_RATE_TC_BWS with duplicate indices.
- Provided ynl exmaples in patch [1/5] commit message.
- Take IFC patches to beginning of the series, targeted for mlx5-next.
V3:
- Dropped rate-tc-index, using tc-bw array index instead.
- Renamed rate-bw to rate-tc-bw.
- Documneted what the rate-tc-bw represents and added a range check for
validation.
- Intorduced devlink_nl_rate_tc_bw_set() to parse and set the TC
bandwidth values.
- Updated the user API in the commit message of patch 1/6 to ensure
bandwidths sum equals 100.
- Fixed missing filling of rate-parent in devlink_nl_rate_fill().
V2:
- Included <linux/dcbnl.h> in devlink.h to resolve missing
IEEE_8021QAZ_MAX_TCS definition.
- Refactored the rate-tc-bw attribute structure to use a separate
rate-tc-index.
- Updated patch 2/6 title.
This patch series extends the devlink-rate API to support traffic class
(TC) bandwidth management, enabling more granular control over traffic
shaping and rate limiting across multiple TCs. The API now allows users
to specify bandwidth proportions for different traffic classes in a
single command. This is particularly useful for managing Enhanced
Transmission Selection (ETS) for groups of Virtual Functions (VFs),
allowing precise bandwidth allocation across traffic classes.
Additionally the series refines the QoS handling in net/mlx5 to support
TC arbitration and bandwidth management on vports and rate nodes.
Discussions on traffic class shaping in net-shapers began in V5 [1],
where we discussed with maintainers whether net-shapers should support
traffic classes and how this could be implemented.
Later, after further conversations with Paolo Abeni and Simon Horman,
Cosmin provided an update [2], confirming that net-shapers' tree-based
hierarchy aligns well with traffic classes when treated as distinct
subsets of netdev queues. Since mlx5 enforces a 1:1 mapping between TX
queues and traffic classes, this approach seems feasible, though some
open questions remain regarding queue reconfiguration and certain mlx5
scheduling behaviors.
Building on that discussion, Cosmin has now shared a concrete
implementation plan on the netdev mailing list [3]. The plan, developed
in collaboration with Paolo and Simon, outlines how net-shapers can be
extended to support the same use cases currently covered by
devlink-rate, with the eventual goal of aligning both and simplifying
the shaping infrastructure in the kernel.
This work was presented at Netdev 0x19 in Zagreb [4].
There we presented how TC scheduling is enforced in mlx5 hardware,
which led to discussions on the mailing list.
A summary of how things work:
Classification means labeling a packet with a traffic class based on
the packet's DSCP or VLAN PCP field, then treating packets with
different traffic classes differently during transmit processing.
In a virtualized setup, VFs are untrusted and do not control
classification or shaping. Classification is done by the hardware using
a prio-to-TC mapping set by the hypervisor. VFs only select which send
queue to use and are expected to respect the classification logic by
sending each traffic class on its dedicated queue. As stated in the
net-shapers plan [3], each transmit queue should carry only a single
traffic class. Mixing classes in a single queue can lead to HOL
blocking.
In the mlx5 implementation, if the queue used does not match the
classified traffic class, the hardware moves the queue to the correct
TC scheduler. This movement is not a reclassification; it’s a necessary
enforcement step to ensure traffic class isolation is maintained.
Extend devlink-rate API to support rate management on TCs:
- devlink: Extend the devlink rate API to support traffic class
bandwidth management
Introduce a no-op implementation:
- net/mlx5: Add no-op implementation for setting tc-bw on rate objects
Add support for enabling and disabling TC QoS on vports and nodes:
- net/mlx5: Add support for setting tc-bw on nodes
- net/mlx5: Add traffic class scheduling support for vport QoS
Support for setting tc-bw on rate objects:
- net/mlx5: Manage TC arbiter nodes and implement full support for
tc-bw
[1]
https://lore.kernel.org/netdev/20241204220931.254964-1-tariqt@nvidia.com/
[2]
https://lore.kernel.org/netdev/67df1a562614b553dcab043f347a0d7c5393ff83.cam…
[3]
https://lore.kernel.org/netdev/d9831d0c940a7b77419abe7c7330e822bbfd1cfb.cam…
[4]
https://netdevconf.info/0x19/sessions/talk/optimizing-bandwidth-allocation-…
Carolina Jubran (8):
netlink: introduce type-checking attribute iteration for nlmsg
devlink: Extend devlink rate API with traffic classes bandwidth management
selftest: netdevsim: Add devlink rate tc-bw test
net/mlx5: Add no-op implementation for setting tc-bw on rate objects
net/mlx5: Add support for setting tc-bw on nodes
net/mlx5: Add traffic class scheduling support for vport QoS
net/mlx5: Manage TC arbiter nodes and implement full support for tc-bw
selftests: drv-net: Add test for devlink-rate traffic class bandwidth distribution
Documentation/netlink/specs/devlink.yaml | 32 +-
.../networking/devlink/devlink-port.rst | 8 +
.../net/ethernet/mellanox/mlx5/core/devlink.c | 2 +
.../net/ethernet/mellanox/mlx5/core/esw/qos.c | 1037 ++++++++++++++++-
.../net/ethernet/mellanox/mlx5/core/esw/qos.h | 8 +
.../net/ethernet/mellanox/mlx5/core/eswitch.h | 14 +-
drivers/net/netdevsim/dev.c | 43 +
drivers/net/netdevsim/netdevsim.h | 1 +
drivers/net/vxlan/vxlan_vnifilter.c | 13 +-
fs/nfsd/nfsctl.c | 36 +-
include/net/devlink.h | 8 +
include/net/netlink.h | 14 +
include/uapi/linux/devlink.h | 9 +
net/devlink/netlink_gen.c | 15 +-
net/devlink/netlink_gen.h | 1 +
net/devlink/rate.c | 127 ++
.../drivers/net/hw/devlink_rate_tc_bw.py | 466 ++++++++
.../drivers/net/hw/lib/py/__init__.py | 2 +-
.../selftests/drivers/net/lib/py/__init__.py | 2 +-
.../drivers/net/netdevsim/devlink.sh | 53 +
.../testing/selftests/net/lib/py/__init__.py | 2 +-
tools/testing/selftests/net/lib/py/ynl.py | 5 +
22 files changed, 1825 insertions(+), 73 deletions(-)
create mode 100755 tools/testing/selftests/drivers/net/hw/devlink_rate_tc_bw.py
base-commit: 20a0c20f82acf46d5731a11743e7c7ac4de25db8
--
2.34.1
A few packets may still be sent and received during the termination of
the iperf processes. These late packets cause failures when they arrive
on queues expected to be empty.
Add a one second delay between repeated _send_traffic_check() calls in
rss_ctx tests to ensure such packets are processed before the next
traffic checks are performed.
Example failure observed:
Check failed 2 != 0 traffic on inactive queues (context 1):
[0, 0, 1, 1, 386385, 397196, 0, 0, 0, 0, ...]
Check failed 4 != 0 traffic on inactive queues (context 2):
[0, 0, 0, 0, 2, 2, 247152, 253013, 0, 0, ...]
Check failed 2 != 0 traffic on inactive queues (context 3):
[0, 0, 0, 0, 0, 0, 1, 1, 282434, 283070, ...]
Note: While the `noise` parameter could be used to tolerate these late
packets, it would be inappropriate here. `noise` tolerates far more
traffic than acceptable in this case, risking false positives.
Inactive queues are supposed to see zero traffic.
Fixes: 847aa551fa78 ("selftests: drv-net: rss_ctx: factor out send traffic and check")
Reviewed-by: Gal Pressman <gal(a)nvidia.com>
Reviewed-by: Carolina Jubran <cjubran(a)nvidia.com>
Signed-off-by: Nimrod Oren <noren(a)nvidia.com>
---
tools/testing/selftests/drivers/net/hw/rss_ctx.py | 3 +++
1 file changed, 3 insertions(+)
diff --git a/tools/testing/selftests/drivers/net/hw/rss_ctx.py b/tools/testing/selftests/drivers/net/hw/rss_ctx.py
index 7bb552f8b182..19be69227693 100755
--- a/tools/testing/selftests/drivers/net/hw/rss_ctx.py
+++ b/tools/testing/selftests/drivers/net/hw/rss_ctx.py
@@ -4,6 +4,7 @@
import datetime
import random
import re
+import time
from lib.py import ksft_run, ksft_pr, ksft_exit
from lib.py import ksft_eq, ksft_ne, ksft_ge, ksft_in, ksft_lt, ksft_true, ksft_raises
from lib.py import NetDrvEpEnv
@@ -492,6 +493,7 @@ def test_rss_context(cfg, ctx_cnt=1, create_with_cfg=None):
{ 'target': (2+i*2, 3+i*2),
'noise': (0, 1),
'empty': list(range(2, 2+i*2)) + list(range(4+i*2, 2+2*ctx_cnt)) })
+ time.sleep(1)
if requested_ctx_cnt != ctx_cnt:
raise KsftSkipEx(f"Tested only {ctx_cnt} contexts, wanted {requested_ctx_cnt}")
@@ -559,6 +561,7 @@ def test_rss_context_out_of_order(cfg, ctx_cnt=4):
}
_send_traffic_check(cfg, ports[i], f"context {i}", expected)
+ time.sleep(1)
# Use queues 0 and 1 for normal traffic
ethtool(f"-X {cfg.ifname} equal 2")
--
2.37.1
I've removed the RFC tag from this version of the series, but the items
that I'm looking for feedback on remains the same:
- The userspace ABI, in particular:
- The vector length used for the SVE registers, access to the SVE
registers and access to ZA and (if available) ZT0 depending on
the current state of PSTATE.{SM,ZA}.
- The use of a single finalisation for both SVE and SME.
- The addition of control for enabling fine grained traps in a similar
manner to FGU but without the UNDEF, I'm not clear if this is desired
at all and at present this requires symmetric read and write traps like
FGU. That seemed like it might be desired from an implementation
point of view but we already have one case where we enable an
asymmetric trap (for ARM64_WORKAROUND_AMPERE_AC03_CPU_38) and it
seems generally useful to enable asymmetrically.
This series implements support for SME use in non-protected KVM guests.
Much of this is very similar to SVE, the main additional challenge that
SME presents is that it introduces a new vector length similar to the
SVE vector length and two new controls which change the registers seen
by guests:
- PSTATE.ZA enables the ZA matrix register and, if SME2 is supported,
the ZT0 LUT register.
- PSTATE.SM enables streaming mode, a new floating point mode which
uses the SVE register set with the separately configured SME vector
length. In streaming mode implementation of the FFR register is
optional.
It is also permitted to build systems which support SME without SVE, in
this case when not in streaming mode no SVE registers or instructions
are available. Further, there is no requirement that there be any
overlap in the set of vector lengths supported by SVE and SME in a
system, this is expected to be a common situation in practical systems.
Since there is a new vector length to configure we introduce a new
feature parallel to the existing SVE one with a new pseudo register for
the streaming mode vector length. Due to the overlap with SVE caused by
streaming mode rather than finalising SME as a separate feature we use
the existing SVE finalisation to also finalise SME, a new define
KVM_ARM_VCPU_VEC is provided to help make user code clearer. Finalising
SVE and SME separately would introduce complication with register access
since finalising SVE makes the SVE registers writeable by userspace and
doing multiple finalisations results in an error being reported.
Dealing with a state where the SVE registers are writeable due to one of
SVE or SME being finalised but may have their VL changed by the other
being finalised seems like needless complexity with minimal practical
utility, it seems clearer to just express directly that only one
finalisation can be done in the ABI.
Access to the floating point registers follows the architecture:
- When both SVE and SME are present:
- If PSTATE.SM == 0 the vector length used for the Z and P registers
is the SVE vector length.
- If PSTATE.SM == 1 the vector length used for the Z and P registers
is the SME vector length.
- If only SME is present:
- If PSTATE.SM == 0 the Z and P registers are inaccessible and the
floating point state accessed via the encodings for the V registers.
- If PSTATE.SM == 1 the vector length used for the Z and P registers
- The SME specific ZA and ZT0 registers are only accessible if SVCR.ZA is 1.
The VMM must understand this, in particular when loading state SVCR
should be configured before other state. It should be noted that while
the architecture refers to PSTATE.SM and PSTATE.ZA these PSTATE bits are
not preserved in SPSR_ELx, they are only accessible via SVCR.
There are a large number of subfeatures for SME, most of which only
offer additional instructions but some of which (SME2 and FA64) add
architectural state. These are configured via the ID registers as per
usual.
Protected KVM supported, with the implementation maintaining the
existing restriction that the hypervisor will refuse to run if streaming
mode or ZA is enabled. This both simplfies the code and avoids the need
to allocate storage for host ZA and ZT0 state, there seems to be little
practical use case for supporting this and the memory usage would be
non-trivial.
The new KVM_ARM_VCPU_VEC feature and ZA and ZT0 registers have not been
added to the get-reg-list selftest, the idea of supporting additional
features there without restructuring the program to generate all
possible feature combinations has been rejected. I will post a separate
series which does that restructuring.
Signed-off-by: Mark Brown <broonie(a)kernel.org>
---
Changes in v6:
- Rebase onto v6.16-rc3.
- Link to v5: https://lore.kernel.org/r/20250417-kvm-arm64-sme-v5-0-f469a2d5f574@kernel.o…
Changes in v5:
- Rebase onto v6.15-rc2.
- Add pKVM guest support.
- Always restore SVCR.
- Link to v4: https://lore.kernel.org/r/20250214-kvm-arm64-sme-v4-0-d64a681adcc2@kernel.o…
Changes in v4:
- Rebase onto v6.14-rc2 and Mark Rutland's fixes.
- Expose SME to nested guests.
- Additional cleanups and test fixes following on from the rebase.
- Flush register state on VMM PSTATE.{SM,ZA}.
- Link to v3: https://lore.kernel.org/r/20241220-kvm-arm64-sme-v3-0-05b018c1ffeb@kernel.o…
Changes in v3:
- Rebase onto v6.12-rc2.
- Link to v2: https://lore.kernel.org/r/20231222-kvm-arm64-sme-v2-0-da226cb180bb@kernel.o…
Changes in v2:
- Rebase onto v6.7-rc3.
- Configure subfeatures based on host system only.
- Complete nVHE support.
- There was some snafu with sending v1 out, it didn't make it to the
lists but in case it hit people's inboxes I'm sending as v2.
---
Mark Brown (28):
arm64/fpsimd: Update FA64 and ZT0 enables when loading SME state
arm64/fpsimd: Decide to save ZT0 and streaming mode FFR at bind time
arm64/fpsimd: Check enable bit for FA64 when saving EFI state
arm64/fpsimd: Determine maximum virtualisable SME vector length
KVM: arm64: Introduce non-UNDEF FGT control
KVM: arm64: Pay attention to FFR parameter in SVE save and load
KVM: arm64: Pull ctxt_has_ helpers to start of sysreg-sr.h
KVM: arm64: Move SVE state access macros after feature test macros
KVM: arm64: Rename SVE finalization constants to be more general
KVM: arm64: Document the KVM ABI for SME
KVM: arm64: Define internal features for SME
KVM: arm64: Rename sve_state_reg_region
KVM: arm64: Store vector lengths in an array
KVM: arm64: Implement SME vector length configuration
KVM: arm64: Support SME control registers
KVM: arm64: Support TPIDR2_EL0
KVM: arm64: Support SME identification registers for guests
KVM: arm64: Support SME priority registers
KVM: arm64: Provide assembly for SME register access
KVM: arm64: Support userspace access to streaming mode Z and P registers
KVM: arm64: Flush register state on writes to SVCR.SM and SVCR.ZA
KVM: arm64: Expose SME specific state to userspace
KVM: arm64: Context switch SME state for guests
KVM: arm64: Handle SME exceptions
KVM: arm64: Expose SME to nested guests
KVM: arm64: Provide interface for configuring and enabling SME for guests
KVM: arm64: selftests: Add SME system registers to get-reg-list
KVM: arm64: selftests: Add SME to set_id_regs test
Documentation/virt/kvm/api.rst | 117 +++++++----
arch/arm64/include/asm/fpsimd.h | 26 +++
arch/arm64/include/asm/kvm_emulate.h | 6 +
arch/arm64/include/asm/kvm_host.h | 168 ++++++++++++---
arch/arm64/include/asm/kvm_hyp.h | 5 +-
arch/arm64/include/asm/kvm_pkvm.h | 2 +-
arch/arm64/include/asm/vncr_mapping.h | 2 +
arch/arm64/include/uapi/asm/kvm.h | 33 +++
arch/arm64/kernel/cpufeature.c | 2 -
arch/arm64/kernel/fpsimd.c | 89 ++++----
arch/arm64/kvm/arm.c | 10 +
arch/arm64/kvm/fpsimd.c | 28 ++-
arch/arm64/kvm/guest.c | 252 ++++++++++++++++++++---
arch/arm64/kvm/handle_exit.c | 14 ++
arch/arm64/kvm/hyp/fpsimd.S | 28 ++-
arch/arm64/kvm/hyp/include/hyp/switch.h | 175 ++++++++++++++--
arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h | 97 +++++----
arch/arm64/kvm/hyp/nvhe/hyp-main.c | 86 ++++++--
arch/arm64/kvm/hyp/nvhe/pkvm.c | 81 ++++++--
arch/arm64/kvm/hyp/nvhe/switch.c | 4 +-
arch/arm64/kvm/hyp/nvhe/sys_regs.c | 6 +
arch/arm64/kvm/hyp/vhe/switch.c | 17 +-
arch/arm64/kvm/nested.c | 3 +-
arch/arm64/kvm/reset.c | 156 ++++++++++----
arch/arm64/kvm/sys_regs.c | 140 ++++++++++++-
include/uapi/linux/kvm.h | 1 +
tools/testing/selftests/kvm/arm64/get-reg-list.c | 32 ++-
tools/testing/selftests/kvm/arm64/set_id_regs.c | 29 ++-
28 files changed, 1315 insertions(+), 294 deletions(-)
---
base-commit: 7204503c922cfdb4fcfce4a4ab61f4558a01a73b
change-id: 20230301-kvm-arm64-sme-06a1246d3636
Best regards,
--
Mark Brown <broonie(a)kernel.org>
DAMON sysfs interface is the bridge between the user space and the
kernel space for DAMON parameters. There is no good and simple test to
see if the parameters are set as expected. Existing DAMON selftests
therefore test end-to-end features. For example, damos_quota_goal.py
runs a DAMOS scheme with quota goal set against a test program running
an artificial access pattern, and see if the result is as expected.
Such tests cover only a few part of DAMON. Adding more tests is also
complicated. Finally, the reliability of the test itself on different
systems is bad.
'drgn' is a tool that can extract kernel internal data structures like
DAMON parameters. Add a test that passes specific DAMON parameters via
DAMON sysfs reusing _damon_sysfs.py, extract resulting DAMON parameters
via 'drgn', and compare those. Note that this test is not adding
exhaustive tests of all DAMON parameters and input combinations but very
basic things. Advancing the test infrastructure and adding more tests
are future works.
Changes from RFC
(https://lore.kernel.org/20250622210330.40490-1-sj@kernel.org)
- Rebase on latest mm-new
SeongJae Park (6):
selftests/damon: add drgn script for extracting damon status
selftests/damon/_damon_sysfs: set Kdamond.pid in start()
selftests/damon: add python and drgn-based DAMON sysfs test
selftests/damon/sysfs.py: test monitoring attribute parameters
selftests/damon/sysfs.py: test adaptive targets parameter
selftests/damon/sysfs.py: test DAMOS schemes parameters setup
tools/testing/selftests/damon/Makefile | 1 +
tools/testing/selftests/damon/_damon_sysfs.py | 3 +
.../selftests/damon/drgn_dump_damon_status.py | 161 ++++++++++++++++++
tools/testing/selftests/damon/sysfs.py | 115 +++++++++++++
4 files changed, 280 insertions(+)
create mode 100755 tools/testing/selftests/damon/drgn_dump_damon_status.py
create mode 100755 tools/testing/selftests/damon/sysfs.py
base-commit: 5ab6feac2d83ebbf0d0d2eedf0505878ba677dcb
--
2.39.5
This patch adds a new robust_list() syscall. The current syscall
can't be expanded to cover the following use case, so a new one is
needed. This new syscall allows users to set multiple robust lists per
process and to have either 32bit or 64bit pointers in the list.
* Use case
FEX-Emu[1] is an application that runs x86 and x86-64 binaries on an
AArch64 Linux host. One of the tasks of FEX-Emu is to translate syscalls
from one platform to another. Existing set_robust_list() can't be easily
translated because of two limitations:
1) x86 apps can have 32bit pointers robust lists. For a x86-64 kernel
this is not a problem, because of the compat entry point. But there's
no such compat entry point for AArch64, so the kernel would do the
pointer arithmetic wrongly. Is also unviable to userspace to keep
track every addition/removal to the robust list and keep a 64bit
version of it somewhere else to feed the kernel. Thus, the new
interface has an option of telling the kernel if the list is filled
with 32bit or 64bit pointers.
2) Apps can set just one robust list (in theory, x86-64 can set two if
they also use the compat entry point). That means that when a x86 app
asks FEX-Emu to call set_robust_list(), FEX have two options: to
overwrite their own robust list pointer and make the app robust, or
to ignore the app robust list and keep the emulator robust. The new
interface allows for multiple robust lists per application, solving
this.
* Interface
This is the proposed interface:
long set_robust_list2(void *head, int index, unsigned int flags)
`head` is the head of the userspace struct robust_list_head, just as old
set_robust_list(). It needs to be a void pointer since it can point to a normal
robust_list_head or a compat_robust_list_head.
`flags` can be used for defining the list type:
enum robust_list_type {
ROBUST_LIST_32BIT,
ROBUST_LIST_64BIT,
};
`index` is the index in the internal robust_list's linked list (the naming
starts to get confusing, I reckon). If `index == -1`, that means that user wants
to set a new robust_list, and the kernel will append it in the end of the list,
assign a new index and return this index to the user. If `index >= 0`, that
means that user wants to re-set `*head` of an already existing list (similarly
to what happens when you call set_robust_list() twice with different `*head`).
If `index` is out of range, or it points to a non-existing robust_list, or if
the internal list is full, an error is returned.
* Implementation
The implementation re-uses most of the existing robust list interface as
possible. The new task_struct member `struct list_head robust_list2` is just a
linked list where new lists are appended as the user requests more lists, and by
futex_cleanup(), the kernel walks through the internal list feeding
exit_robust_list() with the robust_list's.
This implementation supports up to 10 lists (defined at ROBUST_LISTS_PER_TASK),
but it was an arbitrary number for this RFC. For the described use case above, 4
should be enough, I'm not sure which should be the limit.
It doesn't support list removal (should it support?). It doesn't have a proper
get_robust_list2() yet as well, but I can add it in a next revision. We could
also have a generic robust_list() syscall that can be used to set/get and be
controlled by flags.
The new interface has a `unsigned int flags` argument, making it
extensible for future use cases as well.
It refuses unaligned `head` addresses. It doesn't have a limit for elements in a
single list (like ROBUST_LIST_LIMIT), it destroys the list as it is parsed to be
safe against circular lists.
* Testing
This patcheset has a selftest patch that expands this one:
https://lore.kernel.org/lkml/20250212131123.37431-1-andrealmeid@igalia.com/
Also, FEX-Emu added support for this interface to validate it:
https://github.com/FEX-Emu/FEX/pull/3966
Feedback is very welcomed!
Thanks,
André
[1] https://github.com/FEX-Emu/FEX
Changelog:
- Fixed compilation issues when CONFIG_COMPAT or CONFIG_FUTEX are not
set
- Rebased on top of new futex work (private hash)
v4: https://lore.kernel.org/lkml/20250225183531.682556-1-andrealmeid@igalia.com/
- Refuse unaligned head pointers
- Ignore ROBUST_LIST_LIMIT for lists created with this interface and make it
robust against circular lists
- Fix a get_robust_list() syscall bug for getting the list from another thread
- Adapt selftest to use the new interface
v3: https://lore.kernel.org/lkml/20241217174958.477692-1-andrealmeid@igalia.com/
- Old syscall set_robust_list() adds new head to the internal linked list of
robust lists pointers, instead of having a field just for them. Remove
tsk->robust_list and use only tsk->robust_list2
v2: https://lore.kernel.org/lkml/20241101162147.284993-1-andrealmeid@igalia.com/
- Added a patch to properly deal with exit_robust_list() in 64bit vs 32bit
- Wired-up syscall for all archs
- Added more of the cover letter to the commit message
v1: https://lore.kernel.org/lkml/20241024145735.162090-1-andrealmeid@igalia.com/
---
André Almeida (7):
selftests/futex: Add ASSERT_ macros
selftests/futex: Create test for robust list
futex: Use explicit sizes for compat_exit_robust_list
futex: Create set_robust_list2
futex: Remove the limit of elements for sys_set_robust_list2 lists
futex: Wire up set_robust_list2 syscall
selftests: futex: Expand robust list test for the new interface
arch/alpha/kernel/syscalls/syscall.tbl | 1 +
arch/arm/tools/syscall.tbl | 1 +
arch/m68k/kernel/syscalls/syscall.tbl | 1 +
arch/microblaze/kernel/syscalls/syscall.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n32.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n64.tbl | 1 +
arch/mips/kernel/syscalls/syscall_o32.tbl | 1 +
arch/parisc/kernel/syscalls/syscall.tbl | 1 +
arch/powerpc/kernel/syscalls/syscall.tbl | 1 +
arch/s390/kernel/syscalls/syscall.tbl | 1 +
arch/sh/kernel/syscalls/syscall.tbl | 1 +
arch/sparc/kernel/syscalls/syscall.tbl | 1 +
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/xtensa/kernel/syscalls/syscall.tbl | 1 +
include/linux/compat.h | 12 +-
include/linux/futex.h | 30 +-
include/linux/sched.h | 5 +-
include/uapi/asm-generic/unistd.h | 2 +
include/uapi/linux/futex.h | 10 +
kernel/futex/core.c | 156 ++++-
kernel/futex/futex.h | 5 +
kernel/futex/syscalls.c | 85 ++-
kernel/sys_ni.c | 1 +
scripts/syscall.tbl | 1 +
.../testing/selftests/futex/functional/.gitignore | 1 +
tools/testing/selftests/futex/functional/Makefile | 3 +-
.../selftests/futex/functional/robust_list.c | 706 +++++++++++++++++++++
tools/testing/selftests/futex/include/logging.h | 38 ++
29 files changed, 1021 insertions(+), 49 deletions(-)
---
base-commit: a24cc6ce1933eade12aa2b9859de0fcd2dac2c06
change-id: 20250225-tonyk-robust_futex-60adeedac695
Best regards,
--
André Almeida <andrealmeid(a)igalia.com>
Fix cur_aux()->nospec_result test after do_check_insn() referring to the
to-be-analyzed (potentially unsafe) instruction, not the
already-analyzed (safe) instruction. This might allow a unsafe insn to
slip through on a speculative path. Create some tests from the
reproducer [1].
Commit d6f1c85f2253 ("bpf: Fall back to nospec for Spectre v1") should
not be in any stable kernel yet, therefore bpf-next should suffice.
[1] https://lore.kernel.org/bpf/685b3c1b.050a0220.2303ee.0010.GAE@google.com/
Changes since RFC:
- Introduce prev_aux() as suggested by Alexei. For this, we must move
the env->prev_insn_idx assignment to happen directly after
do_check_insn(), for which I have created a separate commit. This
patch could be simplified by using a local prev_aux variable as
sugested by Eduard, but I figured one might find the new
assignment-strategy easier to understand (before, prev_insn_idx and
env->prev_insn_idx were out-of-sync for the latter part of the loop).
Also, like this we do not have an additional prev_* variable that must
be kept in-sync and the local variable's usage (old prev_insn_idx, new
tmp) is much more local. If you think it would be better to not take
the risk and keep the fix simple by just introducing the prev_aux
variable, let me know.
- Change WARN_ON_ONCE() to verifier_bug_if() as suggested by Alexei
- Change assertion to check instruction is BPF_JMP[32] as suggested by
Eduard
- RFC: https://lore.kernel.org/bpf/8734bmoemx.fsf@fau.de/
Luis Gerhorst (3):
bpf: Update env->prev_insn_idx after do_check_insn()
bpf: Fix aux usage after do_check_insn()
selftests/bpf: Add Spectre v4 tests
kernel/bpf/verifier.c | 30 ++--
tools/testing/selftests/bpf/progs/bpf_misc.h | 4 +
.../selftests/bpf/progs/verifier_unpriv.c | 149 ++++++++++++++++++
3 files changed, 174 insertions(+), 9 deletions(-)
base-commit: d69bafe6ee2b5eff6099fa26626ecc2963f0f363
--
2.49.0
Hello,
binder_alloc_selftest provides a robust set of checks for the binder allocator,
but it rarely runs because it must hook into a running binder process and block
all other binder threads until it completes. The test itself is a good candidate
for conversion to KUnit, and it can be further isolated from user processes by
using a test-specific lru freelist instead of the global one. This series
converts the selftest to KUnit to make it less burdensome to run and to set up a
foundation for testing future binder_alloc changes.
Thanks,
Tiffany
Tiffany Yang (5):
binder: Fix selftest page indexing
binder: Store lru freelist in binder_alloc
binder: Scaffolding for binder_alloc KUnit tests
binder: Convert binder_alloc selftests to KUnit
binder: encapsulate individual alloc test cases
drivers/android/Kconfig | 15 +-
drivers/android/Makefile | 2 +-
drivers/android/binder.c | 10 +-
drivers/android/binder_alloc.c | 39 +-
drivers/android/binder_alloc.h | 14 +-
drivers/android/binder_alloc_selftest.c | 306 -----------
drivers/android/binder_internal.h | 4 +
drivers/android/tests/.kunitconfig | 3 +
drivers/android/tests/Makefile | 3 +
drivers/android/tests/binder_alloc_kunit.c | 572 +++++++++++++++++++++
include/kunit/test.h | 12 +
lib/kunit/user_alloc.c | 4 +-
12 files changed, 644 insertions(+), 340 deletions(-)
delete mode 100644 drivers/android/binder_alloc_selftest.c
create mode 100644 drivers/android/tests/.kunitconfig
create mode 100644 drivers/android/tests/Makefile
create mode 100644 drivers/android/tests/binder_alloc_kunit.c
--
2.50.0.727.gbf7dc18ff4-goog
glibc does not define SYS_futex for 32-bit architectures using 64-bit
time_t e.g. riscv32, therefore this test fails to compile since it does not
find SYS_futex in C library headers. Define SYS_futex as SYS_futex_time64
in this situation to ensure successful compilation and compatibility.
Signed-off-by: Cynthia Huang <cynthia(a)andestech.com>
Signed-off-by: Ben Zong-You Xie <ben717(a)andestech.com>
---
Changes since v1:
- Fix the SOB chain
v1 : https://lore.kernel.org/all/20250527093536.3646143-1-ben717@andestech.com/
---
tools/testing/selftests/futex/include/futextest.h | 11 +++++++++++
1 file changed, 11 insertions(+)
diff --git a/tools/testing/selftests/futex/include/futextest.h b/tools/testing/selftests/futex/include/futextest.h
index ddbcfc9b7bac..7a5fd1d5355e 100644
--- a/tools/testing/selftests/futex/include/futextest.h
+++ b/tools/testing/selftests/futex/include/futextest.h
@@ -47,6 +47,17 @@ typedef volatile u_int32_t futex_t;
FUTEX_PRIVATE_FLAG)
#endif
+/*
+ * SYS_futex is expected from system C library, in glibc some 32-bit
+ * architectures (e.g. RV32) are using 64-bit time_t, therefore it doesn't have
+ * SYS_futex defined but just SYS_futex_time64. Define SYS_futex as
+ * SYS_futex_time64 in this situation to ensure the compilation and the
+ * compatibility.
+ */
+#if !defined(SYS_futex) && defined(SYS_futex_time64)
+#define SYS_futex SYS_futex_time64
+#endif
+
/**
* futex() - SYS_futex syscall wrapper
* @uaddr: address of first futex
--
2.34.1
Changes in v2:
- Removed lints are not replaced with `expect` in the first diff.
- Removals are done in separate diffs for each.
The `#[allow(clippy::non_send_fields_in_send_ty)]` removal was tested
on 1.81 and clippy was still happy with it. I couldn't test it on 1.78
because when I go below 1.81 `menuconfig` no longer shows the Rust option.
And any manual changes I make to `.config` are immediately reverted on
`make` invocations.
Onur Özkan (3):
replace `#[allow(...)]` with `#[expect(...)]`
rust: remove `#[allow(clippy::unnecessary_cast)]`
rust: remove `#[allow(clippy::non_send_fields_in_send_ty)]`
drivers/gpu/nova-core/regs.rs | 2 +-
rust/compiler_builtins.rs | 2 +-
rust/kernel/alloc/allocator_test.rs | 2 +-
rust/kernel/cpufreq.rs | 1 -
rust/kernel/devres.rs | 2 +-
rust/kernel/driver.rs | 2 +-
rust/kernel/drm/ioctl.rs | 8 ++++----
rust/kernel/error.rs | 3 +--
rust/kernel/init.rs | 6 +++---
rust/kernel/kunit.rs | 2 +-
rust/kernel/opp.rs | 4 ++--
rust/kernel/types.rs | 2 +-
rust/macros/helpers.rs | 2 +-
13 files changed, 18 insertions(+), 20 deletions(-)
--
2.50.0
This patch series was initially sent to security(a)k.o; resending it in
public. I might follow-up with a tests series which addresses similar
issues with TIOCLINUX.
===============
The TIOCSTI ioctl uses capable(CAP_SYS_ADMIN) for access control, which
checks the current process's credentials. However, it doesn't validate
against the file opener's credentials stored in file->f_cred.
This creates a potential security issue where an unprivileged process
can open a TTY fd and pass it to a privileged process via SCM_RIGHTS.
The privileged process may then inadvertently grant access based on its
elevated privileges rather than the original opener's credentials.
Background
==========
As noted in previous discussion, while CONFIG_LEGACY_TIOCSTI can restrict
TIOCSTI usage, it is enabled by default in most distributions. Even when
CONFIG_LEGACY_TIOCSTI=n, processes with CAP_SYS_ADMIN can still use TIOCSTI
according to the Kconfig documentation.
Additionally, CONFIG_LEGACY_TIOCSTI controls the default value for the
dev.tty.legacy_tiocsti sysctl, which remains runtime-configurable. This
means the described attack vector could work on systems even with
CONFIG_LEGACY_TIOCSTI=n, particularly on Ubuntu 24.04 where it's "restricted"
but still functional.
Solution Approach
=================
This series addresses the issue through SELinux LSM integration rather
than modifying core TTY credential checking to avoid potential compatibility
issues with existing userspace.
The enhancement adds proper current task and file credential capability
validation in SELinux's selinux_file_ioctl() hook specifically for
TIOCSTI operations.
Testing
=======
All patches have been validated using:
- scripts/checkpatch.pl --strict (0 errors, 0 warnings)
- Functional testing on kernel v6.16-rc2
- File descriptor passing security test scenarios
- SELinux policy enforcement testing
The fd_passing_security test demonstrates the security concern.
To verify, disable legacy TIOCSTI and run the test:
$ echo "0" | sudo tee /proc/sys/dev/tty/legacy_tiocsti
$ sudo ./tools/testing/selftests/tty/tty_tiocsti_test -t fd_passing_security
Patch Overview
==============
PATCH 1/2: selftests/tty: add TIOCSTI test suite
Comprehensive test suite demonstrating the issue and fix validation
PATCH 2/2: selinux: add capability checks for TIOCSTI ioctl
Core security enhancement via SELinux LSM hook
References
==========
- tty_ioctl(4) - documents TIOCSTI ioctl and capability requirements
- commit 83efeeeb3d04 ("tty: Allow TIOCSTI to be disabled")
- Documentation/security/credentials.rst
- https://github.com/KSPP/linux/issues/156
- https://lore.kernel.org/linux-hardening/Y0m9l52AKmw6Yxi1@hostpad/
- drivers/tty/Kconfig
Configuration References:
[1] - https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/dri…
[2] - https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/dri…
[3] - https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/dri…
To: Shuah Khan <shuah(a)kernel.org>
To: Nathan Chancellor <nathan(a)kernel.org>
To: Nick Desaulniers <nick.desaulniers+lkml(a)gmail.com>
To: Bill Wendling <morbo(a)google.com>
To: Justin Stitt <justinstitt(a)google.com>
To: Paul Moore <paul(a)paul-moore.com>
To: Stephen Smalley <stephen.smalley.work(a)gmail.com>
To: Ondrej Mosnacek <omosnace(a)redhat.com>
Cc: linux-kernel(a)vger.kernel.org
Cc: linux-kselftest(a)vger.kernel.org
Cc: llvm(a)lists.linux.dev
Cc: selinux(a)vger.kernel.org
Signed-off-by: Abhinav Saxena <xandfury(a)gmail.com>
---
Abhinav Saxena (2):
selftests/tty: add TIOCSTI test suite
selinux: add capability checks for TIOCSTI ioctl
security/selinux/hooks.c | 6 +
tools/testing/selftests/tty/Makefile | 6 +-
tools/testing/selftests/tty/config | 1 +
tools/testing/selftests/tty/tty_tiocsti_test.c | 421 +++++++++++++++++++++++++
4 files changed, 433 insertions(+), 1 deletion(-)
---
base-commit: 5adb635077d1b4bd65b183022775a59a378a9c00
change-id: 20250618-toicsti-bug-7822b8e94a32
Best regards,
--
Abhinav Saxena <xandfury(a)gmail.com>
The kernel has recently added support for shadow stacks, currently
x86 only using their CET feature but both arm64 and RISC-V have
equivalent features (GCS and Zicfiss respectively), I am actively
working on GCS[1]. With shadow stacks the hardware maintains an
additional stack containing only the return addresses for branch
instructions which is not generally writeable by userspace and ensures
that any returns are to the recorded addresses. This provides some
protection against ROP attacks and making it easier to collect call
stacks. These shadow stacks are allocated in the address space of the
userspace process.
Our API for shadow stacks does not currently offer userspace any
flexiblity for managing the allocation of shadow stacks for newly
created threads, instead the kernel allocates a new shadow stack with
the same size as the normal stack whenever a thread is created with the
feature enabled. The stacks allocated in this way are freed by the
kernel when the thread exits or shadow stacks are disabled for the
thread. This lack of flexibility and control isn't ideal, in the vast
majority of cases the shadow stack will be over allocated and the
implicit allocation and deallocation is not consistent with other
interfaces. As far as I can tell the interface is done in this manner
mainly because the shadow stack patches were in development since before
clone3() was implemented.
Since clone3() is readily extensible let's add support for specifying a
shadow stack when creating a new thread or process, keeping the current
implicit allocation behaviour if one is not specified either with
clone3() or through the use of clone(). The user must provide a shadow
stack pointer, this must point to memory mapped for use as a shadow
stackby map_shadow_stack() with an architecture specified shadow stack
token at the top of the stack.
Yuri Khrustalev has raised questions from the libc side regarding
discoverability of extended clone3() structure sizes[2], this seems like
a general issue with clone3(). There was a suggestion to add a hwcap on
arm64 which isn't ideal but is doable there, though architecture
specific mechanisms would also be needed for x86 (and RISC-V if it's
support gets merged before this does).
Please note that the x86 portions of this code are build tested only, I
don't appear to have a system that can run CET available to me.
[1] https://lore.kernel.org/linux-arm-kernel/20241001-arm64-gcs-v13-0-222b78d87…
[2] https://lore.kernel.org/r/aCs65ccRQtJBnZ_5@arm.com
Signed-off-by: Mark Brown <broonie(a)kernel.org>
---
Changes in v17:
- Rebase onto v6.16-rc1.
- Link to v16: https://lore.kernel.org/r/20250416-clone3-shadow-stack-v16-0-2ffc9ca3917b@k…
Changes in v16:
- Rebase onto v6.15-rc2.
- Roll in fixes from x86 testing from Rick Edgecombe.
- Rework so that the argument is shadow_stack_token.
- Link to v15: https://lore.kernel.org/r/20250408-clone3-shadow-stack-v15-0-3fa245c6e3be@k…
Changes in v15:
- Rebase onto v6.15-rc1.
- Link to v14: https://lore.kernel.org/r/20250206-clone3-shadow-stack-v14-0-805b53af73b9@k…
Changes in v14:
- Rebase onto v6.14-rc1.
- Link to v13: https://lore.kernel.org/r/20241203-clone3-shadow-stack-v13-0-93b89a81a5ed@k…
Changes in v13:
- Rebase onto v6.13-rc1.
- Link to v12: https://lore.kernel.org/r/20241031-clone3-shadow-stack-v12-0-7183eb8bee17@k…
Changes in v12:
- Add the regular prctl() to the userspace API document since arm64
support is queued in -next.
- Link to v11: https://lore.kernel.org/r/20241005-clone3-shadow-stack-v11-0-2a6a2bd6d651@k…
Changes in v11:
- Rebase onto arm64 for-next/gcs, which is based on v6.12-rc1, and
integrate arm64 support.
- Rework the interface to specify a shadow stack pointer rather than a
base and size like we do for the regular stack.
- Link to v10: https://lore.kernel.org/r/20240821-clone3-shadow-stack-v10-0-06e8797b9445@k…
Changes in v10:
- Integrate fixes & improvements for the x86 implementation from Rick
Edgecombe.
- Require that the shadow stack be VM_WRITE.
- Require that the shadow stack base and size be sizeof(void *) aligned.
- Clean up trailing newline.
- Link to v9: https://lore.kernel.org/r/20240819-clone3-shadow-stack-v9-0-962d74f99464@ke…
Changes in v9:
- Pull token validation earlier and report problems with an error return
to parent rather than signal delivery to the child.
- Verify that the top of the supplied shadow stack is VM_SHADOW_STACK.
- Rework token validation to only do the page mapping once.
- Drop no longer needed support for testing for signals in selftest.
- Fix typo in comments.
- Link to v8: https://lore.kernel.org/r/20240808-clone3-shadow-stack-v8-0-0acf37caf14c@ke…
Changes in v8:
- Fix token verification with user specified shadow stack.
- Don't track user managed shadow stacks for child processes.
- Link to v7: https://lore.kernel.org/r/20240731-clone3-shadow-stack-v7-0-a9532eebfb1d@ke…
Changes in v7:
- Rebase onto v6.11-rc1.
- Typo fixes.
- Link to v6: https://lore.kernel.org/r/20240623-clone3-shadow-stack-v6-0-9ee7783b1fb9@ke…
Changes in v6:
- Rebase onto v6.10-rc3.
- Ensure we don't try to free the parent shadow stack in error paths of
x86 arch code.
- Spelling fixes in userspace API document.
- Additional cleanups and improvements to the clone3() tests to support
the shadow stack tests.
- Link to v5: https://lore.kernel.org/r/20240203-clone3-shadow-stack-v5-0-322c69598e4b@ke…
Changes in v5:
- Rebase onto v6.8-rc2.
- Rework ABI to have the user allocate the shadow stack memory with
map_shadow_stack() and a token.
- Force inlining of the x86 shadow stack enablement.
- Move shadow stack enablement out into a shared header for reuse by
other tests.
- Link to v4: https://lore.kernel.org/r/20231128-clone3-shadow-stack-v4-0-8b28ffe4f676@ke…
Changes in v4:
- Formatting changes.
- Use a define for minimum shadow stack size and move some basic
validation to fork.c.
- Link to v3: https://lore.kernel.org/r/20231120-clone3-shadow-stack-v3-0-a7b8ed3e2acc@ke…
Changes in v3:
- Rebase onto v6.7-rc2.
- Remove stale shadow_stack in internal kargs.
- If a shadow stack is specified unconditionally use it regardless of
CLONE_ parameters.
- Force enable shadow stacks in the selftest.
- Update changelogs for RISC-V feature rename.
- Link to v2: https://lore.kernel.org/r/20231114-clone3-shadow-stack-v2-0-b613f8681155@ke…
Changes in v2:
- Rebase onto v6.7-rc1.
- Remove ability to provide preallocated shadow stack, just specify the
desired size.
- Link to v1: https://lore.kernel.org/r/20231023-clone3-shadow-stack-v1-0-d867d0b5d4d0@ke…
---
Mark Brown (8):
arm64/gcs: Return a success value from gcs_alloc_thread_stack()
Documentation: userspace-api: Add shadow stack API documentation
selftests: Provide helper header for shadow stack testing
fork: Add shadow stack support to clone3()
selftests/clone3: Remove redundant flushes of output streams
selftests/clone3: Factor more of main loop into test_clone3()
selftests/clone3: Allow tests to flag if -E2BIG is a valid error code
selftests/clone3: Test shadow stack support
Documentation/userspace-api/index.rst | 1 +
Documentation/userspace-api/shadow_stack.rst | 44 +++++
arch/arm64/include/asm/gcs.h | 8 +-
arch/arm64/kernel/process.c | 8 +-
arch/arm64/mm/gcs.c | 61 +++++-
arch/x86/include/asm/shstk.h | 11 +-
arch/x86/kernel/process.c | 2 +-
arch/x86/kernel/shstk.c | 57 +++++-
include/asm-generic/cacheflush.h | 11 ++
include/linux/sched/task.h | 17 ++
include/uapi/linux/sched.h | 9 +-
kernel/fork.c | 96 +++++++--
tools/testing/selftests/clone3/clone3.c | 226 ++++++++++++++++++----
tools/testing/selftests/clone3/clone3_selftests.h | 65 ++++++-
tools/testing/selftests/ksft_shstk.h | 98 ++++++++++
15 files changed, 633 insertions(+), 81 deletions(-)
---
base-commit: 19272b37aa4f83ca52bdf9c16d5d81bdd1354494
change-id: 20231019-clone3-shadow-stack-15d40d2bf536
Best regards,
--
Mark Brown <broonie(a)kernel.org>
Rename is_signed_type() to is_signed_var() to avoid colliding with a macro
of the same name defined by linux/overflow.h. Note, overflow.h's version
takes a type as the input, whereas the harness's version takes a variable!
This fixes warnings (and presumably potential test failures) in tests
that utilize the selftests harness and happen to (indirectly) include
overflow.h.
In file included from tools/include/linux/bits.h:34,
from tools/include/linux/bitops.h:14,
from tools/include/linux/hashtable.h:13,
from include/kvm_util.h:11,
from x86/userspace_msr_exit_test.c:11:
tools/include/linux/overflow.h:31:9: error: "is_signed_type" redefined [-Werror]
31 | #define is_signed_type(type) (((type)(-1)) < (type)1)
| ^~~~~~~~~~~~~~
In file included from include/kvm_test_harness.h:11,
from x86/userspace_msr_exit_test.c:9:
../kselftest_harness.h:754:9: note: this is the location of the previous definition
754 | #define is_signed_type(var) (!!(((__typeof__(var))(-1)) < (__typeof__(var))1))
| ^~~~~~~~~~~~~~
Opportunistically use is_signed_type() to implement is_signed_var() so
that the relationship and differences are obvious.
Fixes: fc92099902fb ("tools headers: Synchronize linux/bits.h with the kernel sources")
Cc: Vincent Mailhol <mailhol.vincent(a)wanadoo.fr>
Cc: Arnaldo Carvalho de Melo <acme(a)redhat.com>
Signed-off-by: Sean Christopherson <seanjc(a)google.com>
---
This is probably compile-tested only, I don't think any of the KVM selftests
utilize the harness's EXPECT macros.
tools/testing/selftests/kselftest_harness.h | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/kselftest_harness.h b/tools/testing/selftests/kselftest_harness.h
index 2925e47db995..f3e7a46345db 100644
--- a/tools/testing/selftests/kselftest_harness.h
+++ b/tools/testing/selftests/kselftest_harness.h
@@ -56,6 +56,7 @@
#include <asm/types.h>
#include <ctype.h>
#include <errno.h>
+#include <linux/overflow.h>
#include <linux/unistd.h>
#include <poll.h>
#include <stdbool.h>
@@ -751,7 +752,7 @@
for (; _metadata->trigger; _metadata->trigger = \
__bail(_assert, _metadata))
-#define is_signed_type(var) (!!(((__typeof__(var))(-1)) < (__typeof__(var))1))
+#define is_signed_var(var) is_signed_type(__typeof__(var))
#define __EXPECT(_expected, _expected_str, _seen, _seen_str, _t, _assert) do { \
/* Avoid multiple evaluation of the cases */ \
@@ -759,7 +760,7 @@
__typeof__(_seen) __seen = (_seen); \
if (!(__exp _t __seen)) { \
/* Report with actual signedness to avoid weird output. */ \
- switch (is_signed_type(__exp) * 2 + is_signed_type(__seen)) { \
+ switch (is_signed_var(__exp) * 2 + is_signed_var(__seen)) { \
case 0: { \
uintmax_t __exp_print = (uintmax_t)__exp; \
uintmax_t __seen_print = (uintmax_t)__seen; \
base-commit: 78f4e737a53e1163ded2687a922fce138aee73f5
--
2.50.0.714.g196bf9f422-goog
Hi all,
In the last Fall Reinette mentioned functional tests of resctrl would
be preferred over selftests that are based on performance measurement.
This series tries to address that shortcoming by adding some functional
tests for resctrl FS interface and another that checks MSRs match to
what is written through resctrl FS. The MSR test is only available for
Intel CPUs at the moment.
Why RFC?
The new functional selftest itself works, AFAIK. However, calling
ksft_test_result_skip() in cat.c if MSR reading is found to be
unavailable is problematic because of how kselftest harness is
architected. The kselftest.h header itself defines some variables, so
including it into different .c files results in duplicating the test
framework related variables (duplication of ksft_count matters in this
case).
The duplication problem could be worked around by creating a resctrl
selftest specific wrapper for ksft_test_result_skip() into
resctrl_tests.c so the accounting would occur in the "correct" .c file,
but perhaps that is considered hacky and the selftest framework/build
systems should be reworked to avoid duplicating variables?
Ilpo Järvinen (2):
kselftest/resctrl: CAT L3 resctrl FS function tests
kselftest/resctrl: Add CAT L3 CBM functional test for Intel RDT
tools/testing/selftests/resctrl/cat_test.c | 210 ++++++++++++++++++
tools/testing/selftests/resctrl/msr.c | 55 +++++
tools/testing/selftests/resctrl/resctrl.h | 6 +
.../testing/selftests/resctrl/resctrl_tests.c | 2 +
tools/testing/selftests/resctrl/resctrlfs.c | 48 ++++
5 files changed, 321 insertions(+)
create mode 100644 tools/testing/selftests/resctrl/msr.c
base-commit: c1d7e19c70cbb8a19f63c190cf53e71b5f970514
--
2.39.5
Refactors the netpoll UDP transmit path to improve code clarity,
maintainability, and protocol-layer encapsulation.
Function netpoll_send_udp() has more than 100 LoC, which is hard to
understand and review. After this patchset, it has only 32 LoC, which is
more manageable.
The series systematically moves the construction of protocol headers
(UDP, IPv4, IPv6, Ethernet) out of the core `netpoll_send_udp()`
function into dedicated static helpers:
- `push_udp()` for UDP header setup
- `push_ipv4()` and `push_ipv6()` for IP header setup
- `push_eth()` for Ethernet header setup
This results in a clean, layered abstraction that mirrors the protocol
stack, reduces code duplication, and improves readability.
Also, to make sure this is not breaking anything, add IPv6 selftest to
netconsole tests, which will exercise this code. This test would also pick
problems similiar to the one fixed by f599020702698 ("net: netpoll:
Initialize UDP checksum field before checksumming"), which was
embarrassin we didn't have a selftest catch it.
Anyway, there are **no functional changes** intended in this patchset.
Signed-off-by: Breno Leitao <leitao(a)debian.org>
---
Breno Leitao (7):
netpoll: Improve code clarity with explicit struct size calculations
netpoll: factor out UDP checksum calculation into helper
netpoll: factor out IPv6 header setup into push_ipv6() helper
netpoll: factor out IPv4 header setup into push_ipv4() helper
netpoll: factor out UDP header setup into push_udp() helper
netpoll: move Ethernet setup to push_eth() helper
selftests: net: Add IPv6 support to netconsole basic tests
net/core/netpoll.c | 196 +++++++++++++--------
.../selftests/drivers/net/lib/sh/lib_netcons.sh | 74 +++++++-
.../testing/selftests/drivers/net/netcons_basic.sh | 52 +++---
3 files changed, 216 insertions(+), 106 deletions(-)
---
base-commit: 8efa26fcbf8a7f783fd1ce7dd2a409e9b7758df0
change-id: 20250620-netpoll_untagle_ip-e37c799a6925
Best regards,
--
Breno Leitao <leitao(a)debian.org>
Various KUnit tests require PCI infrastructure to work.
All normal platforms enable PCI by default, but UML does not.
Enabling PCI from .kunitconfig files is problematic as it would not be
portable. So in commit 6fc3a8636a7b ("kunit: tool: Enable virtio/PCI by default on UML")
PCI was enabled by way of CONFIG_UML_PCI_OVER_VIRTIO=y.
However CONFIG_UML_PCI_OVER_VIRTIO requires additional configuration of
CONFIG_UML_PCI_OVER_VIRTIO_DEVICE_ID or will otherwise trigger a WARN() in
virtio_pcidev_init(). However there is no one correct value for
UML_PCI_OVER_VIRTIO_DEVICE_ID which could be used by default.
This warning is confusing when debugging test failures.
On the other hand, the functionality of CONFIG_UML_PCI_OVER_VIRTIO is not
used at all, given that it is completely non-functional as indicated by
the WARN() in question. Instead it is only used as a way to enable
CONFIG_UML_PCI which itself is not directly configurable.
Instead of going through CONFIG_UML_PCI_OVER_VIRTIO, introduce a custom
configuration option which enables CONFIG_UML_PCI without triggering
warnings or building dead code.
Signed-off-by: Thomas Weißschuh <thomas.weissschuh(a)linutronix.de>
---
lib/kunit/Kconfig | 7 +++++++
tools/testing/kunit/configs/arch_uml.config | 5 ++---
2 files changed, 9 insertions(+), 3 deletions(-)
diff --git a/lib/kunit/Kconfig b/lib/kunit/Kconfig
index a97897edd9642f3e5df7fdd9dee26ee5cf00d6a4..c8ca155521b2455a221ddbec3f6fc55662c83475 100644
--- a/lib/kunit/Kconfig
+++ b/lib/kunit/Kconfig
@@ -93,4 +93,11 @@ config KUNIT_AUTORUN_ENABLED
In most cases this should be left as Y. Only if additional opt-in
behavior is needed should this be set to N.
+config KUNIT_UML_PCI
+ bool "KUnit UML PCI Support"
+ depends on UML
+ select UML_PCI
+ help
+ Enables the PCI subsystem on UML for use by KUnit tests.
+
endif # KUNIT
diff --git a/tools/testing/kunit/configs/arch_uml.config b/tools/testing/kunit/configs/arch_uml.config
index 54ad8972681a2cc724e6122b19407188910b9025..28edf816aa70e6f408d9486efff8898df79ee090 100644
--- a/tools/testing/kunit/configs/arch_uml.config
+++ b/tools/testing/kunit/configs/arch_uml.config
@@ -1,8 +1,7 @@
# Config options which are added to UML builds by default
-# Enable virtio/pci, as a lot of tests require it.
-CONFIG_VIRTIO_UML=y
-CONFIG_UML_PCI_OVER_VIRTIO=y
+# Enable pci, as a lot of tests require it.
+CONFIG_KUNIT_UML_PCI=y
# Enable FORTIFY_SOURCE for wider checking.
CONFIG_FORTIFY_SOURCE=y
---
base-commit: 19272b37aa4f83ca52bdf9c16d5d81bdd1354494
change-id: 20250626-kunit-uml-pci-a2b687553746
Best regards,
--
Thomas Weißschuh <thomas.weissschuh(a)linutronix.de>
I am submitting a new selftest for the netpoll subsystem specifically
targeting the case where the RX is polling in the TX path, which is
a case that we don't have any test in the tree today. This is done when
netpoll_poll_dev() called, and this test creates a scenario when that is
probably.
The test does the following:
1) Configuring a single RX/TX queue to increase contention on the
interface.
2) Generating background traffic to saturate the network, mimicking
real-world congestion.
3) Sending netconsole messages to trigger netpoll polling and monitor
its behavior.
4) Using dynamic netconsole targets via configfs, with the ability to
delete and recreate targets during the test.
5) Running bpftrace in parallel to verify that netpoll_poll_dev() is
called when expected. If it is called, then the test passes,
otherwise the test is marked as skipped.
In order to achieve it, I stole Jakub's bpftrace helper from [1], and
did some small changes that I found useful to use the helper.
So, this patchset basically contains:
1) The code stolen from Jakub
2) Improvements on bpftrace() helper
3) The selftest itself
Link: https://lore.kernel.org/all/20250421222827.283737-22-kuba@kernel.org/ [1]
---
Changes in v1 (from RFC):
- Toggle the netconsole interfaces up and down after 5 iterations.
- Moved the traffic check under DEBUG (Willem de Bruijn).
- Bumped the iterations to 20 given it runs faster now.
- Link to the RFC: https://lore.kernel.org/r/20250612-netpoll_test-v1-1-4774fd95933f@debian.org
---
Changes in v2:
- Stole Jakub's helper to run bpftrace
- Removed the DEBUG option and moved logs to logging
- Change the code to have a higher chance of calling netpoll_poll_dev().
In my current configuration, it is hitting multiple times during the
test.
- Save and restore TX/RX queue size (Jakub)
- Link to v1: https://lore.kernel.org/r/20250620-netpoll_test-v1-1-5068832f72fc@debian.org
---
Breno Leitao (3):
selftests: drv-net: Improve bpftrace utility error handling
selftests: drv-net: Strip '@' prefix from bpftrace map keys
selftests: net: add netpoll basic functionality test
Jakub Kicinski (1):
selftests: drv-net: add helper/wrapper for bpftrace
tools/testing/selftests/drivers/net/Makefile | 1 +
.../testing/selftests/drivers/net/netpoll_basic.py | 344 +++++++++++++++++++++
tools/testing/selftests/net/lib/py/utils.py | 38 +++
3 files changed, 383 insertions(+)
---
base-commit: eb4c27edb4d8dbfbdcc7bc03e0394a0fab8af7d5
change-id: 20250612-netpoll_test-a1324d2057c8
Best regards,
--
Breno Leitao <leitao(a)debian.org>
From: Yicong Yang <yangyicong(a)hisilicon.com>
Armv8.7 introduces single-copy atomic 64-byte loads and stores
instructions and its variants named under FEAT_{LS64, LS64_V}.
Add support for Armv8.7 FEAT_{LS64, LS64_V}:
- Add identifying and enabling in the cpufeature list
- Expose the support of these features to userspace through HWCAP3
and cpuinfo
- Add related hwcap test
- Handle the trap of unsupported memory (normal/uncacheable) access in a VM
A real scenario for this feature is that the userspace driver can make use of
this to implement direct WQE (workqueue entry) - a mechanism to fill WQE
directly into the hardware.
Picked Marc's 2 patches form [1] for handling the LS64 trap in a VM on emulated
MMIO and the introduce of KVM_EXIT_ARM_LDST64B.
[1] https://lore.kernel.org/linux-arm-kernel/20240815125959.2097734-1-maz@kerne…
Tested with hwcap test [*]:
[*] https://lore.kernel.org/linux-arm-kernel/20250331094320.35226-5-yangyicong@…
On host:
root@localhost:/tmp# dmesg | grep "All CPU(s) started"
[ 0.504846] CPU: All CPU(s) started at EL2
root@localhost:/tmp# ./hwcap
[...]
# LS64 present
ok 217 cpuinfo_match_LS64
ok 218 sigill_LS64
ok 219 # SKIP sigbus_LS64
# LS64_V present
ok 220 cpuinfo_match_LS64_V
ok 221 sigill_LS64_V
ok 222 # SKIP sigbus_LS64_V
# 115 skipped test(s) detected. Consider enabling relevant config options to improve coverage.
# Totals: pass:107 fail:0 xfail:0 xpass:0 skip:115 error:0
On guest:
root@localhost:/# dmesg | grep "All CPU(s) started"
[ 0.451482] CPU: All CPU(s) started at EL1
root@localhost:/mnt# ./hwcap
[...]
# LS64 present
ok 217 cpuinfo_match_LS64
ok 218 sigill_LS64
ok 219 # SKIP sigbus_LS64
# LS64_V present
ok 220 cpuinfo_match_LS64_V
ok 221 sigill_LS64_V
ok 222 # SKIP sigbus_LS64_V
# 115 skipped test(s) detected. Consider enabling relevant config options to improve coverage.
# Totals: pass:107 fail:0 xfail:0 xpass:0 skip:115 error:0
Change since v2:
- Handle the LS64 fault to userspace and allow userspace to inject LS64 fault
- Reorder the patches to make KVM handling prior to feature support
Link: https://lore.kernel.org/linux-arm-kernel/20250331094320.35226-1-yangyicong@…
Change since v1:
- Drop the support for LS64_ACCDATA
- handle the DABT of unsupported memory type after checking the memory attributes
Link: https://lore.kernel.org/linux-arm-kernel/20241202135504.14252-1-yangyicong@…
Marc Zyngier (2):
KVM: arm64: Add exit to userspace on {LD,ST}64B* outside of memslots
KVM: arm64: Add documentation for KVM_EXIT_ARM_LDST64B
Yicong Yang (5):
KVM: arm64: Handle DABT caused by LS64* instructions on unsupported
memory
KVM: arm/arm64: Allow user injection of unsupported exclusive/atomic
DABT
arm64: Provide basic EL2 setup for FEAT_{LS64, LS64_V} usage at EL0/1
arm64: Add support for FEAT_{LS64, LS64_V}
KVM: arm64: Enable FEAT_{LS64, LS64_V} in the supported guest
Documentation/arch/arm64/booting.rst | 12 ++++++
Documentation/arch/arm64/elf_hwcaps.rst | 6 +++
Documentation/virt/kvm/api.rst | 43 +++++++++++++++++----
arch/arm64/include/asm/el2_setup.h | 12 +++++-
arch/arm64/include/asm/esr.h | 8 ++++
arch/arm64/include/asm/hwcap.h | 2 +
arch/arm64/include/asm/kvm_emulate.h | 7 ++++
arch/arm64/include/uapi/asm/hwcap.h | 2 +
arch/arm64/include/uapi/asm/kvm.h | 3 +-
arch/arm64/kernel/cpufeature.c | 51 +++++++++++++++++++++++++
arch/arm64/kernel/cpuinfo.c | 2 +
arch/arm64/kvm/guest.c | 4 ++
arch/arm64/kvm/inject_fault.c | 29 ++++++++++++++
arch/arm64/kvm/mmio.c | 27 ++++++++++++-
arch/arm64/kvm/mmu.c | 21 +++++++++-
arch/arm64/tools/cpucaps | 2 +
include/uapi/linux/kvm.h | 3 +-
17 files changed, 222 insertions(+), 12 deletions(-)
--
2.24.0
Arguments passed into WEXITSTATUS() should have been initialized earlier.
Otherwise following warning show up while building platform selftests on
arm64. Hence just zero out all the relevant local variables to avoid the
build warning.
Warning: ‘status’ may be used uninitialized in this function [-Wmaybe-uninitialized]
Cc: Catalin Marinas <catalin.marinas(a)arm.com>
Cc: Will Deacon <will(a)kernel.org>
Cc: Shuah Khan <shuah(a)kernel.org>
Cc: Mark Brown <broonie(a)kernel.org>
Cc: linux-arm-kernel(a)lists.infradead.org
Cc: linux-kselftest(a)vger.kernel.org
Cc: linux-kernel(a)vger.kernel.org
Signed-off-by: Anshuman Khandual <anshuman.khandual(a)arm.com>
---
This applies on v6.16-rc3
tools/testing/selftests/arm64/abi/tpidr2.c | 2 +-
tools/testing/selftests/arm64/fp/za-fork.c | 2 +-
tools/testing/selftests/arm64/gcs/basic-gcs.c | 2 +-
3 files changed, 3 insertions(+), 3 deletions(-)
diff --git a/tools/testing/selftests/arm64/abi/tpidr2.c b/tools/testing/selftests/arm64/abi/tpidr2.c
index eb19dcc37a755..389a60e5feabf 100644
--- a/tools/testing/selftests/arm64/abi/tpidr2.c
+++ b/tools/testing/selftests/arm64/abi/tpidr2.c
@@ -96,7 +96,7 @@ static int write_sleep_read(void)
static int write_fork_read(void)
{
pid_t newpid, waiting, oldpid;
- int status;
+ int status = 0;
set_tpidr2(getpid());
diff --git a/tools/testing/selftests/arm64/fp/za-fork.c b/tools/testing/selftests/arm64/fp/za-fork.c
index 587b946482226..6098beb3515a0 100644
--- a/tools/testing/selftests/arm64/fp/za-fork.c
+++ b/tools/testing/selftests/arm64/fp/za-fork.c
@@ -24,7 +24,7 @@ int verify_fork(void);
int fork_test_c(void)
{
pid_t newpid, waiting;
- int child_status, parent_result;
+ int child_status = 0, parent_result;
newpid = fork();
if (newpid == 0) {
diff --git a/tools/testing/selftests/arm64/gcs/basic-gcs.c b/tools/testing/selftests/arm64/gcs/basic-gcs.c
index 3fb9742342a34..2b350b6b7e12c 100644
--- a/tools/testing/selftests/arm64/gcs/basic-gcs.c
+++ b/tools/testing/selftests/arm64/gcs/basic-gcs.c
@@ -240,7 +240,7 @@ static bool map_guarded_stack(void)
static bool test_fork(void)
{
unsigned long child_mode;
- int ret, status;
+ int ret, status = 0;
pid_t pid;
bool pass = true;
--
2.30.2
This is the last series to relocate sysctl tables from kernel/sysctl.c
into their respective subsystems. After the move of two ctl_tables
(uevent_helper & overflow{uid,gid}), five remain. They either handle
variables defined within sysctl.c or serve as a common place for
variables that are defined in different architectures. These five will
not be moved. Note that this series includes two auxiliary changes:
Removal of an unused variable and Nix-based rework of sysctl.sh test
script
By decentralizing sysctl registrations, subsystem maintainers regain
control over their sysctl interfaces, improving maintainability and
reducing the likelihood of merge conflicts. All this is made possible by
the work done to reduce the ctl_table memory footprint in commit
d7a76ec87195 ("sysctl: Remove check for sentinel element in ctl_table
arrays").
A few comments on the process:
1. If you prefer to merge this through a non-sysctl tree, please let me
know so I can avoid conflicts in linux-next.
2. Apologies if you were copied by mistake—let me know if you'd like to
be removed.
3. This series builds on [1], so please rebase accordingly for clean
application.
4. Testing done by running sysctl selftests on x86_64 and 0-day.
Comments/Suggestions greatly appreciated
[1] https://lore.kernel.org/20250509-jag-mv_ctltables_iter2-v1-0-d0ad83f5f4c3@k…
Signed-off-by: Joel Granados <joel.granados(a)kernel.org>
---
Joel Granados (5):
sysctl: Nixify sysctl.sh
sysctl: Removed unused variable
uevent: mv uevent_helper into kobject_uevent.c
kernel/sys.c: Move overflow{uid,gid} sysctl into kernel/sys.c
sysctl: rename kern_table -> sysctl_subsys_table
include/linux/sysctl.h | 1 -
kernel/sys.c | 29 +++++++++++++++++++
kernel/sysctl.c | 49 +++++++-------------------------
lib/kobject_uevent.c | 20 +++++++++++++
tools/testing/selftests/sysctl/sysctl.sh | 2 +-
5 files changed, 61 insertions(+), 40 deletions(-)
---
base-commit: 501dd0fbc76bcae57902ea000d9c6ccd9d5f226e
change-id: 20250627-jag-sysctl-823adf5732be
Best regards,
--
Joel Granados <joel.granados(a)kernel.org>
Currently testing of userspace and in-kernel API use two different
frameworks. kselftests for the userspace ones and Kunit for the
in-kernel ones. Besides their different scopes, both have different
strengths and limitations:
Kunit:
* Tests are normal kernel code.
* They use the regular kernel toolchain.
* They can be packaged and distributed as modules conveniently.
Kselftests:
* Tests are normal userspace code
* They need a userspace toolchain.
A kernel cross toolchain is likely not enough.
* A fair amout of userland is required to run the tests,
which means a full distro or handcrafted rootfs.
* There is no way to conveniently package and run kselftests with a
given kernel image.
* The kselftests makefiles are not as powerful as regular kbuild.
For example they are missing proper header dependency tracking or more
complex compiler option modifications.
Therefore kunit is much easier to run against different kernel
configurations and architectures.
This series aims to combine kselftests and kunit, avoiding both their
limitations. It works by compiling the userspace kselftests as part of
the regular kernel build, embedding them into the kunit kernel or module
and executing them from there. If the kernel toolchain is not fit to
produce userspace because of a missing libc, the kernel's own nolibc can
be used instead.
The structured TAP output from the kselftest is integrated into the
kunit KTAP output transparently, the kunit parser can parse the combined
logs together.
Further room for improvements:
* Call each test in its completely dedicated namespace
* Handle additional test files besides the test executable through
archives. CPIO, cramfs, etc.
* Compatibility with kselftest_harness.h (in progress)
* Expose the blobs in debugfs
* Provide some convience wrappers around compat userprogs
* Figure out a migration path/coexistence solution for
kunit UAPI and tools/testing/selftests/
Output from the kunit example testcase, note the output of
"example_uapi_tests".
$ ./tools/testing/kunit/kunit.py run --kunitconfig lib/kunit example
...
Running tests with:
$ .kunit/linux kunit.filter_glob=example kunit.enable=1 mem=1G console=tty kunit_shutdown=halt
[11:53:53] ================== example (10 subtests) ===================
[11:53:53] [PASSED] example_simple_test
[11:53:53] [SKIPPED] example_skip_test
[11:53:53] [SKIPPED] example_mark_skipped_test
[11:53:53] [PASSED] example_all_expect_macros_test
[11:53:53] [PASSED] example_static_stub_test
[11:53:53] [PASSED] example_static_stub_using_fn_ptr_test
[11:53:53] [PASSED] example_priv_test
[11:53:53] =================== example_params_test ===================
[11:53:53] [SKIPPED] example value 3
[11:53:53] [PASSED] example value 2
[11:53:53] [PASSED] example value 1
[11:53:53] [SKIPPED] example value 0
[11:53:53] =============== [PASSED] example_params_test ===============
[11:53:53] [PASSED] example_slow_test
[11:53:53] ======================= (4 subtests) =======================
[11:53:53] [PASSED] procfs
[11:53:53] [PASSED] userspace test 2
[11:53:53] [SKIPPED] userspace test 3: some reason
[11:53:53] [PASSED] userspace test 4
[11:53:53] ================ [PASSED] example_uapi_test ================
[11:53:53] ===================== [PASSED] example =====================
[11:53:53] ============================================================
[11:53:53] Testing complete. Ran 16 tests: passed: 11, skipped: 5
[11:53:53] Elapsed time: 67.543s total, 1.823s configuring, 65.655s building, 0.058s running
Based on v6.16-rc1.
Signed-off-by: Thomas Weißschuh <thomas.weissschuh(a)linutronix.de>
---
Changes in v4:
- Move Kconfig.nolibc from tools/ to init/
- Drop generic userprogs nolibc integration
- Drop generic blob framework
- Pick up review tags from David
- Extend new kunit TAP parser tests
- Add MAINTAINERS entry
- Allow CONFIG_KUNIT_UAPI=m
- Split /proc validation into dedicated UAPI test
- Trim recipient list a bit
- Use KUNIT_FAIL_AND_ABORT() over KUNIT_FAIL()
- Link to v3: https://lore.kernel.org/r/20250611-kunit-kselftests-v3-0-55e3d148cbc6@linut…
Changes in v3:
- Reintroduce CONFIG_CC_CAN_LINK_STATIC
- Enable CONFIG_ARCH_HAS_NOLIBC for m68k and SPARC
- Properly handle 'clean' target for userprogs
- Use ramfs over tmpfs to reduce dependencies
- Inherit userprogs byte order and ABI from kernel
- Drop now unnecessary "#ifndef NOLIBC"
- Pick up review tags
- Drop usage of __private in blob.h,
sparse complains and it is not really necessary
- Fix execution on loongarch when using clang
- Drop userprogs libgcc handling, it was ugly and is not yet necessary
- Link to v2: https://lore.kernel.org/r/20250407-kunit-kselftests-v2-0-454114e287fd@linut…
Changes in v2:
- Rebase onto v6.15-rc1
- Add documentation and kernel docs
- Resolve invalid kconfig breakages
- Drop already applied patch "kbuild: implement CONFIG_HEADERS_INSTALL for Usermode Linux"
- Drop userprogs CONFIG_WERROR integration, it doesn't need to be part of this series
- Replace patch prefix "kconfig" with "kbuild"
- Rename kunit_uapi_run_executable() to kunit_uapi_run_kselftest()
- Generate private, conflict-free symbols in the blob framework
- Handle kselftest exit codes
- Handle SIGABRT
- Forward output also to kunit debugfs log
- Install a fd=0 stdin filedescriptor
- Link to v1: https://lore.kernel.org/r/20250217-kunit-kselftests-v1-0-42b4524c3b0a@linut…
---
Thomas Weißschuh (15):
kbuild: userprogs: avoid duplication of flags inherited from kernel
kbuild: userprogs: also inherit byte order and ABI from kernel
kbuild: doc: add label for userprogs section
init: re-add CONFIG_CC_CAN_LINK_STATIC
init: add nolibc build support
fs,fork,exit: export symbols necessary for KUnit UAPI support
kunit: tool: Add test for nested test result reporting
kunit: tool: Don't overwrite test status based on subtest counts
kunit: tool: Parse skipped tests from kselftest.h
kunit: Always descend into kunit directory during build
kunit: qemu_configs: loongarch: Enable LSX/LSAX
kunit: Introduce UAPI testing framework
kunit: uapi: Add example for UAPI tests
kunit: uapi: Introduce preinit executable
kunit: uapi: Validate usability of /proc
Documentation/dev-tools/kunit/api/index.rst | 5 +
Documentation/dev-tools/kunit/api/uapi.rst | 14 +
Documentation/kbuild/makefiles.rst | 2 +
MAINTAINERS | 11 +
Makefile | 7 +-
fs/exec.c | 2 +
fs/file.c | 1 +
fs/filesystems.c | 2 +
fs/fs_struct.c | 1 +
fs/pipe.c | 2 +
include/kunit/uapi.h | 77 ++++++
init/Kconfig | 7 +
init/Kconfig.nolibc | 15 +
init/Makefile.nolibc | 13 +
kernel/exit.c | 3 +
kernel/fork.c | 2 +
lib/Makefile | 4 -
lib/kunit/Kconfig | 14 +
lib/kunit/Makefile | 27 +-
lib/kunit/kunit-example-test.c | 15 +
lib/kunit/kunit-example-uapi.c | 22 ++
lib/kunit/kunit-test-uapi.c | 51 ++++
lib/kunit/kunit-test.c | 23 +-
lib/kunit/kunit-uapi.c | 305 +++++++++++++++++++++
lib/kunit/uapi-preinit.c | 63 +++++
tools/testing/kunit/kunit_parser.py | 13 +-
tools/testing/kunit/kunit_tool_test.py | 11 +
tools/testing/kunit/qemu_configs/loongarch.py | 2 +
.../test_is_test_passed-failure-nested.log | 10 +
.../test_data/test_is_test_passed-kselftest.log | 3 +-
30 files changed, 714 insertions(+), 13 deletions(-)
---
base-commit: 9d5898b413d17510b2a41664a42390a2c79f8bf4
change-id: 20241015-kunit-kselftests-56273bc40442
Best regards,
--
Thomas Weißschuh <thomas.weissschuh(a)linutronix.de>
V11:
- Refactored the devlink code to accept relative TC bandwidth share
values instead of percentages.
- Updated documentation to clarify that values are interpreted as
relative shares.
- Refactored the logic in mlx5 to support proportional scaling for
tc-bw values.
- Switched to `nlmsg_for_each_attr_type()` for cleaner attribute
parsing.
- Added a hardware selftest to validate TC bandwidth behavior.
- Refactored esw_qos_is_node_empty for readability.
V10:
- Added netdevsim selftest for tc-bw ops.
- Dropped header: field as it’s unnecessary for local constants in
devlink.yaml.
V9:
- Defined DEVLINK_RATE_TCS_MAX as 8 in uapi/linux/devlink.h.
- Replaced IEEE_8021QAZ_MAX_TCS with DEVLINK_RATE_TCS_MAX throughout
the code.
- Updated devlink-rate-tc-index-max spec to reference the correct UAPI
header.
V8:
- Limit line width to 80 characters in mlx5 changes instead of 100.
- Increase the scheduling node levels to support TC arbitration.
- Ensure parent nodes are set correctly in all code paths that extend
the hierarchy depth for TC arbitration.
- Extended the cover letter with the ongoing discussion on devlink-rate
and net-shapers.
- Extended the cover letter with the Netdev talk link on this series.
V7:
- Fixed disabling tc-bw on leaf nodes that did not have tc-bw
configured.
- Fixed an issue where tc-bw was disabled on a node with assigned
vports, ensuring that vport->qos.sched_node->parent is correctly
updated with the cloned node.
- Declared a constant for the maximum allowed Traffic Class index in
devlink rate.
- Added a range check to validate rate-tc-index.
- Added documentation for the tc-bw argument.
- Add a validation check to ensure that the total bandwidth assigned to
all traffic classes sums to 100.
V6:
- Addressed comments on devlink patch #3.
- Removed first 4 IFC patches, to be pulled from mlx5-next.
V5:
- Fix warning in devlink_nl_rate_tc_bw_set().
- Fix target branch of patch #4.
V4:
- Renamed the nested attribute for traffic class bandwidth to
DEVLINK_ATTR_RATE_TC_BWS.
- Changed the order of the attributes in `devlink.h`.
- Refactored the initialization tc-bw array in
devlink_nl_rate_tc_bw_set().
- Added extack messages to provide clear feedback on issues with tc-bw
arguments.
- Updated `rate-tc-bws` to support a multi-attr set, where each
attribute includes an index and the corresponding bandwidth for that
traffic class.
- Handled the issue where the user could provide
DEVLINK_ATTR_RATE_TC_BWS with duplicate indices.
- Provided ynl exmaples in patch [1/5] commit message.
- Take IFC patches to beginning of the series, targeted for mlx5-next.
V3:
- Dropped rate-tc-index, using tc-bw array index instead.
- Renamed rate-bw to rate-tc-bw.
- Documneted what the rate-tc-bw represents and added a range check for
validation.
- Intorduced devlink_nl_rate_tc_bw_set() to parse and set the TC
bandwidth values.
- Updated the user API in the commit message of patch 1/6 to ensure
bandwidths sum equals 100.
- Fixed missing filling of rate-parent in devlink_nl_rate_fill().
V2:
- Included <linux/dcbnl.h> in devlink.h to resolve missing
IEEE_8021QAZ_MAX_TCS definition.
- Refactored the rate-tc-bw attribute structure to use a separate
rate-tc-index.
- Updated patch 2/6 title.
This patch series extends the devlink-rate API to support traffic class
(TC) bandwidth management, enabling more granular control over traffic
shaping and rate limiting across multiple TCs. The API now allows users
to specify bandwidth proportions for different traffic classes in a
single command. This is particularly useful for managing Enhanced
Transmission Selection (ETS) for groups of Virtual Functions (VFs),
allowing precise bandwidth allocation across traffic classes.
Additionally the series refines the QoS handling in net/mlx5 to support
TC arbitration and bandwidth management on vports and rate nodes.
Discussions on traffic class shaping in net-shapers began in V5 [1],
where we discussed with maintainers whether net-shapers should support
traffic classes and how this could be implemented.
Later, after further conversations with Paolo Abeni and Simon Horman,
Cosmin provided an update [2], confirming that net-shapers' tree-based
hierarchy aligns well with traffic classes when treated as distinct
subsets of netdev queues. Since mlx5 enforces a 1:1 mapping between TX
queues and traffic classes, this approach seems feasible, though some
open questions remain regarding queue reconfiguration and certain mlx5
scheduling behaviors.
Building on that discussion, Cosmin has now shared a concrete
implementation plan on the netdev mailing list [3]. The plan, developed
in collaboration with Paolo and Simon, outlines how net-shapers can be
extended to support the same use cases currently covered by
devlink-rate, with the eventual goal of aligning both and simplifying
the shaping infrastructure in the kernel.
This work was presented at Netdev 0x19 in Zagreb [4].
There we presented how TC scheduling is enforced in mlx5 hardware,
which led to discussions on the mailing list.
A summary of how things work:
Classification means labeling a packet with a traffic class based on
the packet's DSCP or VLAN PCP field, then treating packets with
different traffic classes differently during transmit processing.
In a virtualized setup, VFs are untrusted and do not control
classification or shaping. Classification is done by the hardware using
a prio-to-TC mapping set by the hypervisor. VFs only select which send
queue to use and are expected to respect the classification logic by
sending each traffic class on its dedicated queue. As stated in the
net-shapers plan [3], each transmit queue should carry only a single
traffic class. Mixing classes in a single queue can lead to HOL
blocking.
In the mlx5 implementation, if the queue used does not match the
classified traffic class, the hardware moves the queue to the correct
TC scheduler. This movement is not a reclassification; it’s a necessary
enforcement step to ensure traffic class isolation is maintained.
Extend devlink-rate API to support rate management on TCs:
- devlink: Extend the devlink rate API to support traffic class
bandwidth management
Introduce a no-op implementation:
- net/mlx5: Add no-op implementation for setting tc-bw on rate objects
Add support for enabling and disabling TC QoS on vports and nodes:
- net/mlx5: Add support for setting tc-bw on nodes
- net/mlx5: Add traffic class scheduling support for vport QoS
Support for setting tc-bw on rate objects:
- net/mlx5: Manage TC arbiter nodes and implement full support for
tc-bw
[1]
https://lore.kernel.org/netdev/20241204220931.254964-1-tariqt@nvidia.com/
[2]
https://lore.kernel.org/netdev/67df1a562614b553dcab043f347a0d7c5393ff83.cam…
[3]
https://lore.kernel.org/netdev/d9831d0c940a7b77419abe7c7330e822bbfd1cfb.cam…
[4]
https://netdevconf.info/0x19/sessions/talk/optimizing-bandwidth-allocation-…
Carolina Jubran (8):
netlink: introduce type-checking attribute iteration for nlmsg
devlink: Extend devlink rate API with traffic classes bandwidth management
selftest: netdevsim: Add devlink rate tc-bw test
net/mlx5: Add no-op implementation for setting tc-bw on rate objects
net/mlx5: Add support for setting tc-bw on nodes
net/mlx5: Add traffic class scheduling support for vport QoS
net/mlx5: Manage TC arbiter nodes and implement full support for tc-bw
selftests: drv-net: Add test for devlink-rate traffic class bandwidth distribution
Documentation/netlink/specs/devlink.yaml | 32 +-
.../networking/devlink/devlink-port.rst | 8 +
.../net/ethernet/mellanox/mlx5/core/devlink.c | 2 +
.../net/ethernet/mellanox/mlx5/core/esw/qos.c | 1037 ++++++++++++++++-
.../net/ethernet/mellanox/mlx5/core/esw/qos.h | 8 +
.../net/ethernet/mellanox/mlx5/core/eswitch.h | 14 +-
drivers/net/netdevsim/dev.c | 43 +
drivers/net/netdevsim/netdevsim.h | 1 +
drivers/net/vxlan/vxlan_vnifilter.c | 13 +-
fs/nfsd/nfsctl.c | 36 +-
include/net/devlink.h | 8 +
include/net/netlink.h | 14 +
include/uapi/linux/devlink.h | 9 +
net/devlink/netlink_gen.c | 15 +-
net/devlink/netlink_gen.h | 1 +
net/devlink/rate.c | 129 ++
.../drivers/net/hw/devlink_rate_tc_bw.py | 466 ++++++++
.../drivers/net/netdevsim/devlink.sh | 51 +
.../testing/selftests/net/lib/py/__init__.py | 2 +-
tools/testing/selftests/net/lib/py/ynl.py | 5 +
20 files changed, 1823 insertions(+), 71 deletions(-)
create mode 100755 tools/testing/selftests/drivers/net/hw/devlink_rate_tc_bw.py
base-commit: 8dacfd92dbefee829ca555a860e86108fdd1d55b
--
2.34.1
setup_wait() takes an optional argument and then is called from the top
level of the test script. That confuses shellcheck, which thinks that maybe
the intention is to pass $1 of the script to the function, which is never
the case. To avoid having to annotate every single new test with a SC
disable, split the function in two: one that takes a mandatory argument,
and one that takes no argument at all.
Convert the two existing users of that optional argument, both in Spectrum
resource selftest, to use the new form. Clean up vxlan_bridge_1q_mc_ul.sh
to not pass a now-unused argument.
Signed-off-by: Petr Machata <petrm(a)nvidia.com>
---
Notes:
CC: Shuah Khan <shuah(a)kernel.org>
CC: Matthieu Baerts <matttbe(a)kernel.org>
CC: linux-kselftest(a)vger.kernel.org
.../drivers/net/mlxsw/spectrum-2/resource_scale.sh | 2 +-
.../drivers/net/mlxsw/spectrum/resource_scale.sh | 2 +-
tools/testing/selftests/net/forwarding/lib.sh | 9 +++++++--
.../selftests/net/forwarding/vxlan_bridge_1q_mc_ul.sh | 2 +-
4 files changed, 10 insertions(+), 5 deletions(-)
diff --git a/tools/testing/selftests/drivers/net/mlxsw/spectrum-2/resource_scale.sh b/tools/testing/selftests/drivers/net/mlxsw/spectrum-2/resource_scale.sh
index 899b6892603f..d7505b933aef 100755
--- a/tools/testing/selftests/drivers/net/mlxsw/spectrum-2/resource_scale.sh
+++ b/tools/testing/selftests/drivers/net/mlxsw/spectrum-2/resource_scale.sh
@@ -51,7 +51,7 @@ for current_test in ${TESTS:-$ALL_TESTS}; do
fi
${current_test}_setup_prepare
- setup_wait $num_netifs
+ setup_wait_n $num_netifs
# Update target in case occupancy of a certain resource changed
# following the test setup.
target=$(${current_test}_get_target "$should_fail")
diff --git a/tools/testing/selftests/drivers/net/mlxsw/spectrum/resource_scale.sh b/tools/testing/selftests/drivers/net/mlxsw/spectrum/resource_scale.sh
index 482ebb744eba..7b98cdd0580d 100755
--- a/tools/testing/selftests/drivers/net/mlxsw/spectrum/resource_scale.sh
+++ b/tools/testing/selftests/drivers/net/mlxsw/spectrum/resource_scale.sh
@@ -55,7 +55,7 @@ for current_test in ${TESTS:-$ALL_TESTS}; do
continue
fi
${current_test}_setup_prepare
- setup_wait $num_netifs
+ setup_wait_n $num_netifs
# Update target in case occupancy of a certain resource
# changed following the test setup.
target=$(${current_test}_get_target "$should_fail")
diff --git a/tools/testing/selftests/net/forwarding/lib.sh b/tools/testing/selftests/net/forwarding/lib.sh
index 83ee6a07e072..9308b2f77fed 100644
--- a/tools/testing/selftests/net/forwarding/lib.sh
+++ b/tools/testing/selftests/net/forwarding/lib.sh
@@ -526,9 +526,9 @@ setup_wait_dev_with_timeout()
return 1
}
-setup_wait()
+setup_wait_n()
{
- local num_netifs=${1:-$NUM_NETIFS}
+ local num_netifs=$1; shift
local i
for ((i = 1; i <= num_netifs; ++i)); do
@@ -539,6 +539,11 @@ setup_wait()
sleep $WAIT_TIME
}
+setup_wait()
+{
+ setup_wait_n "$NUM_NETIFS"
+}
+
wait_for_dev()
{
local dev=$1; shift
diff --git a/tools/testing/selftests/net/forwarding/vxlan_bridge_1q_mc_ul.sh b/tools/testing/selftests/net/forwarding/vxlan_bridge_1q_mc_ul.sh
index 7ec58b6b1128..462db0b603e7 100755
--- a/tools/testing/selftests/net/forwarding/vxlan_bridge_1q_mc_ul.sh
+++ b/tools/testing/selftests/net/forwarding/vxlan_bridge_1q_mc_ul.sh
@@ -765,7 +765,7 @@ ipv6_mcroute_fdb_sep_rx()
trap cleanup EXIT
setup_prepare
-setup_wait "$NUM_NETIFS"
+setup_wait
tests_run
exit "$EXIT_STATUS"
--
2.49.0
glibc does not define SYS_futex for 32-bit architectures using 64-bit
time_t e.g. riscv32, therefore this test fails to compile since it does not
find SYS_futex in C library headers. Define SYS_futex as SYS_futex_time64
in this situation to ensure successful compilation and compatibility.
Signed-off-by: Ben Zong-You Xie <ben717(a)andestech.com>
Signed-off-by: Cynthia Huang <cynthia(a)andestech.com>
---
tools/testing/selftests/futex/include/futextest.h | 11 +++++++++++
1 file changed, 11 insertions(+)
diff --git a/tools/testing/selftests/futex/include/futextest.h b/tools/testing/selftests/futex/include/futextest.h
index ddbcfc9b7bac..7a5fd1d5355e 100644
--- a/tools/testing/selftests/futex/include/futextest.h
+++ b/tools/testing/selftests/futex/include/futextest.h
@@ -47,6 +47,17 @@ typedef volatile u_int32_t futex_t;
FUTEX_PRIVATE_FLAG)
#endif
+/*
+ * SYS_futex is expected from system C library, in glibc some 32-bit
+ * architectures (e.g. RV32) are using 64-bit time_t, therefore it doesn't have
+ * SYS_futex defined but just SYS_futex_time64. Define SYS_futex as
+ * SYS_futex_time64 in this situation to ensure the compilation and the
+ * compatibility.
+ */
+#if !defined(SYS_futex) && defined(SYS_futex_time64)
+#define SYS_futex SYS_futex_time64
+#endif
+
/**
* futex() - SYS_futex syscall wrapper
* @uaddr: address of first futex
--
2.34.1
Futex_waitv can not accept old_timespec32 struct, so userspace should
convert it from 32bit to 64bit before syscall in 32bit compatible mode.
This fix is based off [1]
Link: https://lore.kernel.org/all/20231203235117.29677-1-wegao@suse.com/ [1]
Signed-off-by: Terry Tritton <terry.tritton(a)linaro.org>
Signed-off-by: Wei Gao <wegao(a)suse.com>
---
The original patch is for an identically named file and function in ltp
and we need the same fix in kselftest. The patch is near identical with
only a slight change to `syscall` instead of `tst_syscall`.
Is the way I have tagged this appropriate?
.../testing/selftests/futex/include/futex2test.h | 15 +++++++++++++++
1 file changed, 15 insertions(+)
diff --git a/tools/testing/selftests/futex/include/futex2test.h b/tools/testing/selftests/futex/include/futex2test.h
index ea79662405bc..6780e51eb2d6 100644
--- a/tools/testing/selftests/futex/include/futex2test.h
+++ b/tools/testing/selftests/futex/include/futex2test.h
@@ -55,6 +55,13 @@ struct futex32_numa {
futex_t numa;
};
+#if !defined(__LP64__)
+struct timespec64 {
+ int64_t tv_sec;
+ int64_t tv_nsec;
+};
+#endif
+
/**
* futex_waitv - Wait at multiple futexes, wake on any
* @waiters: Array of waiters
@@ -65,7 +72,15 @@ struct futex32_numa {
static inline int futex_waitv(volatile struct futex_waitv *waiters, unsigned long nr_waiters,
unsigned long flags, struct timespec *timo, clockid_t clockid)
{
+#if !defined(__LP64__)
+ struct timespec64 timo64 = {0};
+
+ timo64.tv_sec = timo->tv_sec;
+ timo64.tv_nsec = timo->tv_nsec;
+ return syscall(__NR_futex_waitv, waiters, nr_waiters, flags, &timo64, clockid);
+#else
return syscall(__NR_futex_waitv, waiters, nr_waiters, flags, timo, clockid);
+#endif
}
/*
--
2.39.5
The vIOMMU object is designed to represent a slice of an IOMMU HW for its
virtualization features shared with or passed to user space (a VM mostly)
in a way of HW acceleration. This extended the HWPT-based design for more
advanced virtualization feature.
HW QUEUE introduced by this series as a part of the vIOMMU infrastructure
represents a HW accelerated queue/buffer for VM to use exclusively, e.g.
- NVIDIA's Virtual Command Queue
- AMD vIOMMU's Command Buffer, Event Log Buffer, and PPR Log Buffer
each of which allows its IOMMU HW to directly access a queue memory owned
by a guest VM and allows a guest OS to control the HW queue direclty, to
avoid VM Exit overheads to improve the performance.
Introduce IOMMUFD_OBJ_HW_QUEUE and its pairing IOMMUFD_CMD_HW_QUEUE_ALLOC
allowing VMM to forward the IOMMU-specific queue info, such as queue base
address, size, and etc.
Meanwhile, a guest-owned queue needs the guest kernel to control the queue
by reading/writing its consumer and producer indexes, via MMIO acceses to
the hardware MMIO registers. Introduce an mmap infrastructure for iommufd
to support passing through a piece of MMIO region from the host physical
address space to the guest physical address space. The mmap info (offset/
length) used by an mmap syscall must be pre-allocated and returned to the
user space via an output driver-data during an IOMMUFD_CMD_HW_QUEUE_ALLOC
call. Thus, it requires a driver-specific user data support in the vIOMMU
allocation flow.
As a real-world use case, this series implements a HW QUEUE support in the
tegra241-cmdqv driver for VCMDQs on NVIDIA Grace CPU. In another word, it
is also the Tegra CMDQV series Part-2 (user-space support), reworked from
Previous RFCv1:
https://lore.kernel.org/all/cover.1712978212.git.nicolinc@nvidia.com/
This enables the HW accelerated feature for NVIDIA Grace CPU. Compared to
the standard SMMUv3 operating in the nested translation mode trapping CMDQ
for TLBI and ATC_INV commands, this gives a huge performance improvement:
70% to 90% reductions of invalidation time were measured by various DMA
unmap tests running in a guest OS.
// Unmap latencies from "dma_map_benchmark -g @granule -t @threads",
// by toggling "/sys/kernel/debug/iommu/tegra241_cmdqv/bypass_vcmdq"
@granule | @threads | bypass_vcmdq=1 | bypass_vcmdq=0
4KB 1 35.7 us 5.3 us
16KB 1 41.8 us 6.8 us
64KB 1 68.9 us 9.9 us
128KB 1 109.0 us 12.6 us
256KB 1 187.1 us 18.0 us
4KB 2 96.9 us 6.8 us
16KB 2 97.8 us 7.5 us
64KB 2 151.5 us 10.7 us
128KB 2 257.8 us 12.7 us
256KB 2 443.0 us 17.9 us
This is on Github:
https://github.com/nicolinc/iommufd/commits/iommufd_hw_queue-v7
Paring QEMU branch for testing:
https://github.com/nicolinc/qemu/commits/wip/for_iommufd_hw_queue-v7
Changelog
v7
* Rebased on Jason's for-next tree (iommufd_hw_queue-prep series)
* Add Reviewed-by from Baolu, Jason, Pranjal
* Update kdocs and notes
* [iommu] Replace "u32" with "enum iommu_hw_info_type"
* [iommufd] Rename vdev->id to vdev->virt_id
* [iommufd] Replace macros with inline helpers
* [iommufd] Report unmapped_bytes in error path
* [iommufd] Add iommufd_access_is_internal helper
* [iommufd] Do not drop ops->unmap check for mdevs
* [iommufd] Store physical addresses in immap structure
* [iommufd] Reorder access and hw_queue object allocations
* [iommufd] Scan for an internal access before any unmap call
* [iommufd] Drop unused ictx pointer in struct iommufd_hw_queue
* [iommufd] Use kcalloc to avoid failure due to memory fragmentation
* [tegra] Use "else"
* [tegra] Lock destroy() using lvcmdq_mutex
v6
https://lore.kernel.org/all/cover.1749884998.git.nicolinc@nvidia.com/
* Rebase on iommufd_hw_queue-prep-v2
* Add Reviewed-by from Kevin and Jason
* [iommufd] Update kdocs and notes
* [iommufd] Drop redundant pages[i] check
* [iommufd] Allow nesting_parent_iova to be 0
* [iommufd] Add iommufd_hw_queue_alloc_phys()
* [iommufd] Revise iommufd_viommu_alloc/destroy_mmap APIs
* [iommufd] Move destroy ops to vdevice/hw_queue structures
* [iommufd] Add union in hw_info struct to share out_data_type field
* [iommufd] Replace iopt_pin/unpin_pages() with internal access APIs
* [iommufd] Replace vdevice_alloc with vdevice_size and vdevice_init
* [iommufd] Replace hw_queue_alloc with get_hw_queue_size/hw_queue_init
* [iommufd] Replace IOMMUFD_VIOMMU_FLAG_HW_QUEUE_READS_PA with init_phys
* [smmu] Drop arm_smmu_domain_ipa_to_pa
* [smmu] Update arm_smmu_impl_ops changes for vsmmu_init
* [tegra] Add a vdev_to_vsid macro
* [tegra] Add lvcmdq_mutex to protect multi queues
* [tegra] Drop duplicated kcalloc for vintf->lvcmdqs (memory leak)
v5
https://lore.kernel.org/all/cover.1747537752.git.nicolinc@nvidia.com/
* Rebase on v6.15-rc6
* Add Reviewed-by from Jason and Kevin
* Correct typos in kdoc and update commit logs
* [iommufd] Add a cosmetic fix
* [iommufd] Drop unused num_pfns
* [iommufd] Drop unnecessary check
* [iommufd] Reorder patch sequence
* [iommufd] Use io_remap_pfn_range()
* [iommufd] Use success oriented flow
* [iommufd] Fix max_npages calculation
* [iommufd] Add more selftest coverage
* [iommufd] Drop redundant static_assert
* [iommufd] Fix mmap pfn range validation
* [iommufd] Reject unmap on pinned iovas
* [iommufd] Drop redundant vm_flags_set()
* [iommufd] Drop iommufd_struct_destroy()
* [iommufd] Drop redundant queue iova test
* [iommufd] Use "mmio_addr" and "mmio_pfn"
* [iommufd] Rename to "nesting_parent_iova"
* [iommufd] Make iopt_pin_pages call option
* [iommufd] Add ictx comparison in depend()
* [iommufd] Add iommufd_object_alloc_ucmd()
* [iommufd] Move kcalloc() after validations
* [iommufd] Replace ictx setting with WARN_ON
* [iommufd] Make hw_info's type bidirectional
* [smmu] Add supported_vsmmu_type in impl_ops
* [smmu] Drop impl report in smmu vendor struct
* [tegra] Add IOMMU_HW_INFO_TYPE_TEGRA241_CMDQV
* [tegra] Replace "number of VINTFs" with a note
* [tegra] Drop the redundant lvcmdq pointer setting
* [tegra] Flag IOMMUFD_VIOMMU_FLAG_HW_QUEUE_READS_PA
* [tegra] Use "vintf_alloc_vsid" for vdevice_alloc op
v4
https://lore.kernel.org/all/cover.1746757630.git.nicolinc@nvidia.com/
* Rebase on v6.15-rc5
* Add Reviewed-by from Vasant
* Rename "vQUEUE" to "HW QUEUE"
* Use "offset" and "length" for all mmap-related variables
* [iommufd] Use u64 for guest PA
* [iommufd] Fix typo in uAPI doc
* [iommufd] Rename immap_id to offset
* [iommufd] Drop the partial-size mmap support
* [iommufd] Do not replace WARN_ON with WARN_ON_ONCE
* [iommufd] Use "u64 base_addr" for queue base address
* [iommufd] Use u64 base_pfn/num_pfns for immap structure
* [iommufd] Correct the size passed in to mtree_alloc_range()
* [iommufd] Add IOMMUFD_VIOMMU_FLAG_HW_QUEUE_READS_PA to viommu_ops
v3
https://lore.kernel.org/all/cover.1746139811.git.nicolinc@nvidia.com/
* Add Reviewed-by from Baolu, Pranjal, and Alok
* Revise kdocs, uAPI docs, and commit logs
* Rename "vCMDQ" back to "vQUEUE" for AMD cases
* [tegra] Add tegra241_vcmdq_hw_flush_timeout()
* [tegra] Rename vsmmu_alloc to alloc_vintf_user
* [tegra] Use writel for SID replacement registers
* [tegra] Move mmap removal call to vsmmu_destroy op
* [tegra] Fix revert in tegra241_vintf_alloc_lvcmdq_user()
* [iommufd] Replace "& ~PAGE_MASK" with PAGE_ALIGNED()
* [iommufd] Add an object-type "owner" to immap structure
* [iommufd] Drop the ictx input in the new for-driver APIs
* [iommufd] Add iommufd_vma_ops to keep track of mmap lifecycle
* [iommufd] Add viommu-based iommufd_viommu_alloc/destroy_mmap helpers
* [iommufd] Rename iommufd_ctx_alloc/free_mmap to
_iommufd_alloc/destroy_mmap
v2
https://lore.kernel.org/all/cover.1745646960.git.nicolinc@nvidia.com/
* Add Reviewed-by from Jason
* [smmu] Fix vsmmu initial value
* [smmu] Support impl for hw_info
* [tegra] Rename "slot" to "vsid"
* [tegra] Update kdocs and commit logs
* [tegra] Map/unmap LVCMDQ dynamically
* [tegra] Refcount the previous LVCMDQ
* [tegra] Return -EEXIST if LVCMDQ exists
* [tegra] Simplify VINTF cleanup routine
* [tegra] Use vmid and s2_domain in vsmmu
* [tegra] Rename "mmap_pgoff" to "immap_id"
* [tegra] Add more addr and length validation
* [iommufd] Add more narrative to mmap's kdoc
* [iommufd] Add iommufd_struct_depend/undepend()
* [iommufd] Rename vcmdq_free op to vcmdq_destroy
* [iommufd] Fix bug in iommu_copy_struct_to_user()
* [iommufd] Drop is_io from iommufd_ctx_alloc_mmap()
* [iommufd] Test the queue memory for its contiguity
* [iommufd] Return -ENXIO if address or length fails
* [iommufd] Do not change @min_last in mock_viommu_alloc()
* [iommufd] Generalize TEGRA241_VCMDQ data in core structure
* [iommufd] Add selftest coverage for IOMMUFD_CMD_VCMDQ_ALLOC
* [iommufd] Add iopt_pin_pages() to prevent queue memory from unmapping
v1
https://lore.kernel.org/all/cover.1744353300.git.nicolinc@nvidia.com/
Thanks
Nicolin
Nicolin Chen (28):
iommufd: Report unmapped bytes in the error path of
iopt_unmap_iova_range
iommufd/viommu: Explicitly define vdev->virt_id
iommu: Use enum iommu_hw_info_type for type in hw_info op
iommu: Add iommu_copy_struct_to_user helper
iommu: Pass in a driver-level user data structure to viommu_init op
iommufd/viommu: Allow driver-specific user data for a vIOMMU object
iommufd/selftest: Support user_data in mock_viommu_alloc
iommufd/selftest: Add coverage for viommu data
iommufd/access: Add internal APIs for HW queue to use
iommufd/access: Bypass access->ops->unmap for internal use
iommufd/viommu: Add driver-defined vDEVICE support
iommufd/viommu: Introduce IOMMUFD_OBJ_HW_QUEUE and its related struct
iommufd/viommu: Add IOMMUFD_CMD_HW_QUEUE_ALLOC ioctl
iommufd/driver: Add iommufd_hw_queue_depend/undepend() helpers
iommufd/selftest: Add coverage for IOMMUFD_CMD_HW_QUEUE_ALLOC
iommufd: Add mmap interface
iommufd/selftest: Add coverage for the new mmap interface
Documentation: userspace-api: iommufd: Update HW QUEUE
iommu: Allow an input type in hw_info op
iommufd: Allow an input data_type via iommu_hw_info
iommufd/selftest: Update hw_info coverage for an input data_type
iommu/arm-smmu-v3-iommufd: Add vsmmu_size/type and vsmmu_init impl ops
iommu/arm-smmu-v3-iommufd: Add hw_info to impl_ops
iommu/tegra241-cmdqv: Use request_threaded_irq
iommu/tegra241-cmdqv: Simplify deinit flow in
tegra241_cmdqv_remove_vintf()
iommu/tegra241-cmdqv: Do not statically map LVCMDQs
iommu/tegra241-cmdqv: Add user-space use support
iommu/tegra241-cmdqv: Add IOMMU_VEVENTQ_TYPE_TEGRA241_CMDQV support
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 22 +-
drivers/iommu/iommufd/iommufd_private.h | 50 +-
drivers/iommu/iommufd/iommufd_test.h | 20 +
include/linux/iommu.h | 50 +-
include/linux/iommufd.h | 160 ++++++
include/uapi/linux/iommufd.h | 145 +++++-
tools/testing/selftests/iommu/iommufd_utils.h | 89 +++-
.../arm/arm-smmu-v3/arm-smmu-v3-iommufd.c | 28 +-
.../iommu/arm/arm-smmu-v3/tegra241-cmdqv.c | 484 +++++++++++++++++-
drivers/iommu/intel/iommu.c | 7 +-
drivers/iommu/iommufd/device.c | 90 +++-
drivers/iommu/iommufd/driver.c | 81 ++-
drivers/iommu/iommufd/io_pagetable.c | 17 +-
drivers/iommu/iommufd/main.c | 69 +++
drivers/iommu/iommufd/selftest.c | 153 +++++-
drivers/iommu/iommufd/viommu.c | 208 +++++++-
tools/testing/selftests/iommu/iommufd.c | 143 +++++-
.../selftests/iommu/iommufd_fail_nth.c | 15 +-
Documentation/userspace-api/iommufd.rst | 12 +
19 files changed, 1736 insertions(+), 107 deletions(-)
--
2.43.0
The step_after_suspend_test verifies that the system successfully
suspended and resumed by setting a timerfd and checking whether the
timer fully expired. However, this method is unreliable due to timing
races.
In practice, the system may take time to enter suspend, during which the
timer may expire just before or during the transition. As a result,
the remaining time after resume may show non-zero nanoseconds, even if
suspend/resume completed successfully. This leads to false test failures.
Replace the timer-based check with a read from
/sys/power/suspend_stats/success. This counter is incremented only
after a full suspend/resume cycle, providing a reliable and race-free
indicator.
Also remove the unused file descriptor for /sys/power/state, which
remained after switching to a system() call to trigger suspend [1].
[1] https://lore.kernel.org/all/20240930224025.2858767-1-yifei.l.liu@oracle.com/
Fixes: c66be905cda2 ("selftests: breakpoints: use remaining time to check if suspend succeed")
Signed-off-by: Moon Hee Lee <moonhee.lee.ca(a)gmail.com>
---
.../breakpoints/step_after_suspend_test.c | 41 ++++++++++++++-----
1 file changed, 31 insertions(+), 10 deletions(-)
diff --git a/tools/testing/selftests/breakpoints/step_after_suspend_test.c b/tools/testing/selftests/breakpoints/step_after_suspend_test.c
index 8d275f03e977..8d233ac95696 100644
--- a/tools/testing/selftests/breakpoints/step_after_suspend_test.c
+++ b/tools/testing/selftests/breakpoints/step_after_suspend_test.c
@@ -127,22 +127,42 @@ int run_test(int cpu)
return KSFT_PASS;
}
+/*
+ * Reads the suspend success count from sysfs.
+ * Returns the count on success or exits on failure.
+ */
+static int get_suspend_success_count_or_fail(void)
+{
+ FILE *fp;
+ int val;
+
+ fp = fopen("/sys/power/suspend_stats/success", "r");
+ if (!fp)
+ ksft_exit_fail_msg(
+ "Failed to open suspend_stats/success: %s\n",
+ strerror(errno));
+
+ if (fscanf(fp, "%d", &val) != 1) {
+ fclose(fp);
+ ksft_exit_fail_msg(
+ "Failed to read suspend success count\n");
+ }
+
+ fclose(fp);
+ return val;
+}
+
void suspend(void)
{
- int power_state_fd;
int timerfd;
int err;
+ int count_before;
+ int count_after;
struct itimerspec spec = {};
if (getuid() != 0)
ksft_exit_skip("Please run the test as root - Exiting.\n");
- power_state_fd = open("/sys/power/state", O_RDWR);
- if (power_state_fd < 0)
- ksft_exit_fail_msg(
- "open(\"/sys/power/state\") failed %s)\n",
- strerror(errno));
-
timerfd = timerfd_create(CLOCK_BOOTTIME_ALARM, 0);
if (timerfd < 0)
ksft_exit_fail_msg("timerfd_create() failed\n");
@@ -152,14 +172,15 @@ void suspend(void)
if (err < 0)
ksft_exit_fail_msg("timerfd_settime() failed\n");
+ count_before = get_suspend_success_count_or_fail();
+
system("(echo mem > /sys/power/state) 2> /dev/null");
- timerfd_gettime(timerfd, &spec);
- if (spec.it_value.tv_sec != 0 || spec.it_value.tv_nsec != 0)
+ count_after = get_suspend_success_count_or_fail();
+ if (count_after <= count_before)
ksft_exit_fail_msg("Failed to enter Suspend state\n");
close(timerfd);
- close(power_state_fd);
}
int main(int argc, char **argv)
--
2.43.0
The vIOMMU object is designed to represent a slice of an IOMMU HW for its
virtualization features shared with or passed to user space (a VM mostly)
in a way of HW acceleration. This extended the HWPT-based design for more
advanced virtualization feature.
HW QUEUE introduced by this series as a part of the vIOMMU infrastructure
represents a HW accelerated queue/buffer for VM to use exclusively, e.g.
- NVIDIA's Virtual Command Queue
- AMD vIOMMU's Command Buffer, Event Log Buffer, and PPR Log Buffer
each of which allows its IOMMU HW to directly access a queue memory owned
by a guest VM and allows a guest OS to control the HW queue direclty, to
avoid VM Exit overheads to improve the performance.
Introduce IOMMUFD_OBJ_HW_QUEUE and its pairing IOMMUFD_CMD_HW_QUEUE_ALLOC
allowing VMM to forward the IOMMU-specific queue info, such as queue base
address, size, and etc.
Meanwhile, a guest-owned queue needs the guest kernel to control the queue
by reading/writing its consumer and producer indexes, via MMIO acceses to
the hardware MMIO registers. Introduce an mmap infrastructure for iommufd
to support passing through a piece of MMIO region from the host physical
address space to the guest physical address space. The mmap info (offset/
length) used by an mmap syscall must be pre-allocated and returned to the
user space via an output driver-data during an IOMMUFD_CMD_HW_QUEUE_ALLOC
call. Thus, it requires a driver-specific user data support in the vIOMMU
allocation flow.
As a real-world use case, this series implements a HW QUEUE support in the
tegra241-cmdqv driver for VCMDQs on NVIDIA Grace CPU. In another word, it
is also the Tegra CMDQV series Part-2 (user-space support), reworked from
Previous RFCv1:
https://lore.kernel.org/all/cover.1712978212.git.nicolinc@nvidia.com/
This enables the HW accelerated feature for NVIDIA Grace CPU. Compared to
the standard SMMUv3 operating in the nested translation mode trapping CMDQ
for TLBI and ATC_INV commands, this gives a huge performance improvement:
70% to 90% reductions of invalidation time were measured by various DMA
unmap tests running in a guest OS.
// Unmap latencies from "dma_map_benchmark -g @granule -t @threads",
// by toggling "/sys/kernel/debug/iommu/tegra241_cmdqv/bypass_vcmdq"
@granule | @threads | bypass_vcmdq=1 | bypass_vcmdq=0
4KB 1 35.7 us 5.3 us
16KB 1 41.8 us 6.8 us
64KB 1 68.9 us 9.9 us
128KB 1 109.0 us 12.6 us
256KB 1 187.1 us 18.0 us
4KB 2 96.9 us 6.8 us
16KB 2 97.8 us 7.5 us
64KB 2 151.5 us 10.7 us
128KB 2 257.8 us 12.7 us
256KB 2 443.0 us 17.9 us
This is on Github:
https://github.com/nicolinc/iommufd/commits/iommufd_hw_queue-v6
Paring QEMU branch for testing:
https://github.com/nicolinc/qemu/commits/wip/for_iommufd_hw_queue-v6
Changelog
v6
* Rebase on iommufd_hw_queue-prep-v2
* Add Reviewed-by from Kevin and Jason
* [iommufd] Update kdocs and notes
* [iommufd] Drop redundant pages[i] check
* [iommufd] Allow nesting_parent_iova to be 0
* [iommufd] Add iommufd_hw_queue_alloc_phys()
* [iommufd] Revise iommufd_viommu_alloc/destroy_mmap APIs
* [iommufd] Move destroy ops to vdevice/hw_queue structures
* [iommufd] Add union in hw_info struct to share out_data_type field
* [iommufd] Replace iopt_pin/unpin_pages() with internal access APIs
* [iommufd] Replace vdevice_alloc with vdevice_size and vdevice_init
* [iommufd] Replace hw_queue_alloc with get_hw_queue_size/hw_queue_init
* [iommufd] Replace IOMMUFD_VIOMMU_FLAG_HW_QUEUE_READS_PA with init_phys
* [smmu] Drop arm_smmu_domain_ipa_to_pa
* [smmu] Update arm_smmu_impl_ops changes for vsmmu_init
* [tegra] Add a vdev_to_vsid macro
* [tegra] Add lvcmdq_mutex to protect multi queues
* [tegra] Drop duplicated kcalloc for vintf->lvcmdqs (memory leak)
v5
https://lore.kernel.org/all/cover.1747537752.git.nicolinc@nvidia.com/
* Rebase on v6.15-rc6
* Add Reviewed-by from Jason and Kevin
* Correct typos in kdoc and update commit logs
* [iommufd] Add a cosmetic fix
* [iommufd] Drop unused num_pfns
* [iommufd] Drop unnecessary check
* [iommufd] Reorder patch sequence
* [iommufd] Use io_remap_pfn_range()
* [iommufd] Use success oriented flow
* [iommufd] Fix max_npages calculation
* [iommufd] Add more selftest coverage
* [iommufd] Drop redundant static_assert
* [iommufd] Fix mmap pfn range validation
* [iommufd] Reject unmap on pinned iovas
* [iommufd] Drop redundant vm_flags_set()
* [iommufd] Drop iommufd_struct_destroy()
* [iommufd] Drop redundant queue iova test
* [iommufd] Use "mmio_addr" and "mmio_pfn"
* [iommufd] Rename to "nesting_parent_iova"
* [iommufd] Make iopt_pin_pages call option
* [iommufd] Add ictx comparison in depend()
* [iommufd] Add iommufd_object_alloc_ucmd()
* [iommufd] Move kcalloc() after validations
* [iommufd] Replace ictx setting with WARN_ON
* [iommufd] Make hw_info's type bidirectional
* [smmu] Add supported_vsmmu_type in impl_ops
* [smmu] Drop impl report in smmu vendor struct
* [tegra] Add IOMMU_HW_INFO_TYPE_TEGRA241_CMDQV
* [tegra] Replace "number of VINTFs" with a note
* [tegra] Drop the redundant lvcmdq pointer setting
* [tegra] Flag IOMMUFD_VIOMMU_FLAG_HW_QUEUE_READS_PA
* [tegra] Use "vintf_alloc_vsid" for vdevice_alloc op
v4
https://lore.kernel.org/all/cover.1746757630.git.nicolinc@nvidia.com/
* Rebase on v6.15-rc5
* Add Reviewed-by from Vasant
* Rename "vQUEUE" to "HW QUEUE"
* Use "offset" and "length" for all mmap-related variables
* [iommufd] Use u64 for guest PA
* [iommufd] Fix typo in uAPI doc
* [iommufd] Rename immap_id to offset
* [iommufd] Drop the partial-size mmap support
* [iommufd] Do not replace WARN_ON with WARN_ON_ONCE
* [iommufd] Use "u64 base_addr" for queue base address
* [iommufd] Use u64 base_pfn/num_pfns for immap structure
* [iommufd] Correct the size passed in to mtree_alloc_range()
* [iommufd] Add IOMMUFD_VIOMMU_FLAG_HW_QUEUE_READS_PA to viommu_ops
v3
https://lore.kernel.org/all/cover.1746139811.git.nicolinc@nvidia.com/
* Add Reviewed-by from Baolu, Pranjal, and Alok
* Revise kdocs, uAPI docs, and commit logs
* Rename "vCMDQ" back to "vQUEUE" for AMD cases
* [tegra] Add tegra241_vcmdq_hw_flush_timeout()
* [tegra] Rename vsmmu_alloc to alloc_vintf_user
* [tegra] Use writel for SID replacement registers
* [tegra] Move mmap removal call to vsmmu_destroy op
* [tegra] Fix revert in tegra241_vintf_alloc_lvcmdq_user()
* [iommufd] Replace "& ~PAGE_MASK" with PAGE_ALIGNED()
* [iommufd] Add an object-type "owner" to immap structure
* [iommufd] Drop the ictx input in the new for-driver APIs
* [iommufd] Add iommufd_vma_ops to keep track of mmap lifecycle
* [iommufd] Add viommu-based iommufd_viommu_alloc/destroy_mmap helpers
* [iommufd] Rename iommufd_ctx_alloc/free_mmap to
_iommufd_alloc/destroy_mmap
v2
https://lore.kernel.org/all/cover.1745646960.git.nicolinc@nvidia.com/
* Add Reviewed-by from Jason
* [smmu] Fix vsmmu initial value
* [smmu] Support impl for hw_info
* [tegra] Rename "slot" to "vsid"
* [tegra] Update kdocs and commit logs
* [tegra] Map/unmap LVCMDQ dynamically
* [tegra] Refcount the previous LVCMDQ
* [tegra] Return -EEXIST if LVCMDQ exists
* [tegra] Simplify VINTF cleanup routine
* [tegra] Use vmid and s2_domain in vsmmu
* [tegra] Rename "mmap_pgoff" to "immap_id"
* [tegra] Add more addr and length validation
* [iommufd] Add more narrative to mmap's kdoc
* [iommufd] Add iommufd_struct_depend/undepend()
* [iommufd] Rename vcmdq_free op to vcmdq_destroy
* [iommufd] Fix bug in iommu_copy_struct_to_user()
* [iommufd] Drop is_io from iommufd_ctx_alloc_mmap()
* [iommufd] Test the queue memory for its contiguity
* [iommufd] Return -ENXIO if address or length fails
* [iommufd] Do not change @min_last in mock_viommu_alloc()
* [iommufd] Generalize TEGRA241_VCMDQ data in core structure
* [iommufd] Add selftest coverage for IOMMUFD_CMD_VCMDQ_ALLOC
* [iommufd] Add iopt_pin_pages() to prevent queue memory from unmapping
v1
https://lore.kernel.org/all/cover.1744353300.git.nicolinc@nvidia.com/
Thanks
Nicolin
Nicolin Chen (25):
iommu: Add iommu_copy_struct_to_user helper
iommu: Pass in a driver-level user data structure to viommu_init op
iommufd/viommu: Allow driver-specific user data for a vIOMMU object
iommufd/selftest: Support user_data in mock_viommu_alloc
iommufd/selftest: Add coverage for viommu data
iommufd/access: Allow access->ops to be NULL for internal use
iommufd/access: Add internal APIs for HW queue to use
iommufd/viommu: Add driver-defined vDEVICE support
iommufd/viommu: Introduce IOMMUFD_OBJ_HW_QUEUE and its related struct
iommufd/viommu: Add IOMMUFD_CMD_HW_QUEUE_ALLOC ioctl
iommufd/driver: Add iommufd_hw_queue_depend/undepend() helpers
iommufd/selftest: Add coverage for IOMMUFD_CMD_HW_QUEUE_ALLOC
iommufd: Add mmap interface
iommufd/selftest: Add coverage for the new mmap interface
Documentation: userspace-api: iommufd: Update HW QUEUE
iommu: Allow an input type in hw_info op
iommufd: Allow an input data_type via iommu_hw_info
iommufd/selftest: Update hw_info coverage for an input data_type
iommu/arm-smmu-v3-iommufd: Add vsmmu_size/type and vsmmu_init impl ops
iommu/arm-smmu-v3-iommufd: Add hw_info to impl_ops
iommu/tegra241-cmdqv: Use request_threaded_irq
iommu/tegra241-cmdqv: Simplify deinit flow in
tegra241_cmdqv_remove_vintf()
iommu/tegra241-cmdqv: Do not statically map LVCMDQs
iommu/tegra241-cmdqv: Add user-space use support
iommu/tegra241-cmdqv: Add IOMMU_VEVENTQ_TYPE_TEGRA241_CMDQV support
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 16 +-
drivers/iommu/iommufd/iommufd_private.h | 35 +-
drivers/iommu/iommufd/iommufd_test.h | 20 +
include/linux/iommu.h | 49 +-
include/linux/iommufd.h | 156 ++++++
include/uapi/linux/iommufd.h | 145 +++++-
tools/testing/selftests/iommu/iommufd_utils.h | 89 +++-
.../arm/arm-smmu-v3/arm-smmu-v3-iommufd.c | 25 +-
.../iommu/arm/arm-smmu-v3/tegra241-cmdqv.c | 481 +++++++++++++++++-
drivers/iommu/intel/iommu.c | 4 +
drivers/iommu/iommufd/device.c | 84 ++-
drivers/iommu/iommufd/driver.c | 79 +++
drivers/iommu/iommufd/io_pagetable.c | 17 +-
drivers/iommu/iommufd/main.c | 70 +++
drivers/iommu/iommufd/selftest.c | 150 +++++-
drivers/iommu/iommufd/viommu.c | 218 +++++++-
tools/testing/selftests/iommu/iommufd.c | 143 +++++-
.../selftests/iommu/iommufd_fail_nth.c | 15 +-
Documentation/userspace-api/iommufd.rst | 12 +
19 files changed, 1708 insertions(+), 100 deletions(-)
--
2.43.0
Hi,
I noticed that only the RSEQ membarrier command allows specifying a
specific cpu. I have a (extremely toy) lib I was playing around with [1] and
noticed this and that being able to specify the cpu would be useful to
me.
I'm by no means an expert in this code though - and so could be
totally missing something.
Additionally this seems really difficult to actually test. I added a
self test that just proces "some" interrupts are being sent, but
nothing more than that. I don't know what else can be done there.
Patch 1 is the main change
Patch 2 is the self-test. Which maybe you don't want at all.
[1]: https://github.com/DylanZA/rseqlock/commit/be7bc7214fd5aacec47e26126118f8bb…
Thanks,
Dylan
Dylan Yudaken (2):
membarrier: allow cpu_id to be set on more commands
membarrier: self test for cpu specific calls
kernel/sched/membarrier.c | 6 +-
tools/testing/selftests/membarrier/.gitignore | 1 +
tools/testing/selftests/membarrier/Makefile | 3 +-
.../membarrier/membarrier_test_expedited.c | 135 ++++++++++++++++++
.../membarrier/membarrier_test_impl.h | 5 +
5 files changed, 148 insertions(+), 2 deletions(-)
create mode 100644 tools/testing/selftests/membarrier/membarrier_test_expedited.c
base-commit: ee88bddf7f2f5d1f1da87dd7bedc734048b70e88
--
2.49.0
To accommodate varying hardware performance and use cases,
the default kunit test case timeout (currently 300 seconds)
is now configurable. Users can adjust the timeout by
either setting the 'timeout' module parameter or the
KUNIT_DEFAULT_TIMEOUT Kconfig option to their desired
timeout in seconds.
Signed-off-by: Marie Zhussupova <marievic(a)google.com>
---
lib/kunit/Kconfig | 13 +++++++++++++
lib/kunit/test.c | 15 ++++++++-------
2 files changed, 21 insertions(+), 7 deletions(-)
diff --git a/lib/kunit/Kconfig b/lib/kunit/Kconfig
index a97897edd964..c10ede4b1d22 100644
--- a/lib/kunit/Kconfig
+++ b/lib/kunit/Kconfig
@@ -93,4 +93,17 @@ config KUNIT_AUTORUN_ENABLED
In most cases this should be left as Y. Only if additional opt-in
behavior is needed should this be set to N.
+config KUNIT_DEFAULT_TIMEOUT
+ int "Default value of the timeout module parameter"
+ default 300
+ help
+ Sets the default timeout, in seconds, for Kunit test cases. This value
+ is further multiplied by a factor determined by the assigned speed
+ setting: 1x for `DEFAULT`, 3x for `KUNIT_SPEED_SLOW`, and 12x for
+ `KUNIT_SPEED_VERY_SLOW`. This allows slower tests on slower machines
+ sufficient time to complete.
+
+ If unsure, the default timeout of 300 seconds is suitable for most
+ cases.
+
endif # KUNIT
diff --git a/lib/kunit/test.c b/lib/kunit/test.c
index 002121675605..f3c6b11f12b8 100644
--- a/lib/kunit/test.c
+++ b/lib/kunit/test.c
@@ -69,6 +69,13 @@ static bool enable_param;
module_param_named(enable, enable_param, bool, 0);
MODULE_PARM_DESC(enable, "Enable KUnit tests");
+/*
+ * Configure the base timeout.
+ */
+static unsigned long kunit_base_timeout = CONFIG_KUNIT_DEFAULT_TIMEOUT;
+module_param_named(timeout, kunit_base_timeout, ulong, 0644);
+MODULE_PARM_DESC(timeout, "Set the base timeout for Kunit test cases");
+
/*
* KUnit statistic mode:
* 0 - disabled
@@ -393,12 +400,6 @@ static int kunit_timeout_mult(enum kunit_speed speed)
static unsigned long kunit_test_timeout(struct kunit_suite *suite, struct kunit_case *test_case)
{
int mult = 1;
- /*
- * TODO: Make the default (base) timeout configurable, so that users with
- * particularly slow or fast machines can successfully run tests, while
- * still taking advantage of the relative speed.
- */
- unsigned long default_timeout = 300;
/*
* The default test timeout is 300 seconds and will be adjusted by mult
@@ -409,7 +410,7 @@ static unsigned long kunit_test_timeout(struct kunit_suite *suite, struct kunit_
mult = kunit_timeout_mult(suite->attr.speed);
if (test_case->attr.speed != KUNIT_SPEED_UNSET)
mult = kunit_timeout_mult(test_case->attr.speed);
- return mult * default_timeout * msecs_to_jiffies(MSEC_PER_SEC);
+ return mult * kunit_base_timeout * msecs_to_jiffies(MSEC_PER_SEC);
}
--
2.50.0.rc2.761.g2dc52ea45b-goog
A few selftest harness changes being merged to v6.16, which exposed some
bugs and vulnerabilities in the iommufd selftest code. Fix them properly.
Note that the patch fixing the build warnings at mfd is not ideal, as it
has possibly hit some corner case in the gcc:
https://lore.kernel.org/all/aEi8DV+ReF3v3Rlf@nvidia.com/
This is on github:
https://github.com/nicolinc/iommufd/commits/iommufd_selftest_fixes-v6.16
Changelog:
v2
* Add "Reviewed-by" from Jason
* Only use kfree() in the teardown()
* Add an mmap_buffer_size for readability
v1
https://lore.kernel.org/all/cover.1750049883.git.nicolinc@nvidia.com/
Thanks
Nicolin
Nicolin Chen (4):
iommufd/selftest: Fix iommufd_dirty_tracking with large hugepage sizes
iommufd/selftest: Add missing close(mfd) in memfd_mmap()
iommufd/selftest: Add asserts testing global mfd
iommufd/selftest: Fix build warnings due to uninitialized mfd
tools/testing/selftests/iommu/iommufd_utils.h | 9 ++++-
tools/testing/selftests/iommu/iommufd.c | 40 ++++++++++++++-----
2 files changed, 36 insertions(+), 13 deletions(-)
--
2.43.0
This patch series fixes some of the false positives in generic
mm selftests and skips tests that cannot run correctly due to
missing features or system limitations.
Please let us know if you have any feedback.
Thanks,
Aboorva
Aboorva Devarajan (2):
selftests/mm: Fix child process exit codes in KSM tests
selftests/mm: Mark thuge-gen as skipped if shmmax is too small or no
1G pages
Donet Tom (4):
mm/selftests: Fix virtual_address_range test issues.
selftest/mm: Fix ksm_funtional_test failures
selftests/mm : fix test_prctl_fork_exec failure
mm/selftests: Fix split_huge_page_test failure on systems with 64KB
page size
.../selftests/mm/ksm_functional_tests.c | 24 +++++++++++++------
.../selftests/mm/split_huge_page_test.c | 23 ++++++++++++++----
tools/testing/selftests/mm/thuge-gen.c | 11 +++++----
.../selftests/mm/virtual_address_range.c | 14 +++--------
4 files changed, 45 insertions(+), 27 deletions(-)
--
2.43.5
Add support for SuperH/"sh" to nolibc.
Only sh4 is tested for now.
This is only tested on QEMU so far.
Additional testing would be very welcome.
Signed-off-by: Thomas Weißschuh <linux(a)weissschuh.net>
---
Thomas Weißschuh (3):
selftests/nolibc: fix EXTRACONFIG variables ordering
selftests/nolibc: use file driver for QEMU serial
tools/nolibc: add support for SuperH
tools/include/nolibc/arch-sh.h | 162 ++++++++++++++++++++++++++++
tools/include/nolibc/arch.h | 2 +
tools/testing/selftests/nolibc/Makefile | 15 ++-
tools/testing/selftests/nolibc/run-tests.sh | 3 +-
4 files changed, 177 insertions(+), 5 deletions(-)
---
base-commit: 6275a61db2f0586b8a5d651dfc7b4aacf9d0b2d6
change-id: 20250528-nolibc-sh-8b4e3bb8efcb
Best regards,
--
Thomas Weißschuh <linux(a)weissschuh.net>
Reading /proc/pid/maps requires read-locking mmap_lock which prevents any
other task from concurrently modifying the address space. This guarantees
coherent reporting of virtual address ranges, however it can block
important updates from happening. Oftentimes /proc/pid/maps readers are
low priority monitoring tasks and them blocking high priority tasks
results in priority inversion.
Locking the entire address space is required to present fully coherent
picture of the address space, however even current implementation does not
strictly guarantee that by outputting vmas in page-size chunks and
dropping mmap_lock in between each chunk. Address space modifications are
possible while mmap_lock is dropped and userspace reading the content is
expected to deal with possible concurrent address space modifications.
Considering these relaxed rules, holding mmap_lock is not strictly needed
as long as we can guarantee that a concurrently modified vma is reported
either in its original form or after it was modified.
This patchset switches from holding mmap_lock while reading /proc/pid/maps
to taking per-vma locks as we walk the vma tree. This reduces the
contention with tasks modifying the address space because they would have
to contend for the same vma as opposed to the entire address space. Same
is done for PROCMAP_QUERY ioctl which locks only the vma that fell into
the requested range instead of the entire address space. Previous version
of this patchset [1] tried to perform /proc/pid/maps reading under RCU,
however its implementation is quite complex and the results are worse than
the new version because it still relied on mmap_lock speculation which
retries if any part of the address space gets modified. New implementaion
is both simpler and results in less contention. Note that similar approach
would not work for /proc/pid/smaps reading as it also walks the page table
and that's not RCU-safe.
Paul McKenney's designed a test [2] to measure mmap/munmap latencies while
concurrently reading /proc/pid/maps. The test has a pair of processes
scanning /proc/PID/maps, and another process unmapping and remapping 4K
pages from a 128MB range of anonymous memory. At the end of each 10
second run, the latency of each mmap() or munmap() operation is measured,
and for each run the maximum and mean latency is printed. The map/unmap
process is started first, its PID is passed to the scanners, and then the
map/unmap process waits until both scanners are running before starting
its timed test. The scanners keep scanning until the specified
/proc/PID/maps file disappears. This test registered close to 10x
improvement in update latencies:
Before the change:
./run-proc-vs-map.sh --nsamples 100 --rawdata -- --busyduration 2
0.011 0.008 0.455
0.011 0.008 0.472
0.011 0.008 0.535
0.011 0.009 0.545
...
0.011 0.014 2.875
0.011 0.014 2.913
0.011 0.014 3.007
0.011 0.015 3.018
After the change:
./run-proc-vs-map.sh --nsamples 100 --rawdata -- --busyduration 2
0.006 0.005 0.036
0.006 0.005 0.039
0.006 0.005 0.039
0.006 0.005 0.039
...
0.006 0.006 0.403
0.006 0.006 0.474
0.006 0.006 0.479
0.006 0.006 0.498
The patchset also adds a number of tests to check for /proc/pid/maps data
coherency. They are designed to detect any unexpected data tearing while
performing some common address space modifications (vma split, resize and
remap). Even before these changes, reading /proc/pid/maps might have
inconsistent data because the file is read page-by-page with mmap_lock
being dropped between the pages. An example of user-visible inconsistency
can be that the same vma is printed twice: once before it was modified and
then after the modifications. For example if vma was extended, it might be
found and reported twice. What is not expected is to see a gap where there
should have been a vma both before and after modification. This patchset
increases the chances of such tearing, therefore it's even more important
now to test for unexpected inconsistencies.
In [3] Lorenzo identified the following possible vma merging/splitting
scenarios:
Merges with changes to existing vmas:
1 Merge both - mapping a vma over another one and between two vmas which
can be merged after this replacement;
2. Merge left full - mapping a vma at the end of an existing one and
completely over its right neighbor;
3. Merge left partial - mapping a vma at the end of an existing one and
partially over its right neighbor;
4. Merge right full - mapping a vma before the start of an existing one
and completely over its left neighbor;
5. Merge right partial - mapping a vma before the start of an existing one
and partially over its left neighbor;
Merges without changes to existing vmas:
6. Merge both - mapping a vma into a gap between two vmas which can be
merged after the insertion;
7. Merge left - mapping a vma at the end of an existing one;
8. Merge right - mapping a vma before the start end of an existing one;
Splits
9. Split with new vma at the lower address;
10. Split with new vma at the higher address;
If such merges or splits happen concurrently with the /proc/maps reading
we might report a vma twice, once before the modification and once after
it is modified:
Case 1 might report overwritten and previous vma along with the final
merged vma;
Case 2 might report previous and the final merged vma;
Case 3 might cause us to retry once we detect the temporary gap caused by
shrinking of the right neighbor;
Case 4 might report overritten and the final merged vma;
Case 5 might cause us to retry once we detect the temporary gap caused by
shrinking of the left neighbor;
Case 6 might report previous vma and the gap along with the final marged
vma;
Case 7 might report previous and the final merged vma;
Case 8 might report the original gap and the final merged vma covering the
gap;
Case 9 might cause us to retry once we detect the temporary gap caused by
shrinking of the original vma at the vma start;
Case 10 might cause us to retry once we detect the temporary gap caused by
shrinking of the original vma at the vma end;
In all these cases the retry mechanism prevents us from reporting possible
temporary gaps.
Changes from v4 [4]:
- refactored trylock_vma() and other locking parts into mmap_lock.c, per
Lorenzo
- renamed {lock|unlock}_content() into {lock|unlock}_vma_range(), per
Lorenzo
- added clarifying comments for sentinels, per Lorenzo
- introduced is_sentinel_pos() helper function
- fixed position reset logic when last_addr is a sentinel, per Lorenzo
- added Acked-by to the last patch, per Andrii Nakryiko
[1] https://lore.kernel.org/all/20250418174959.1431962-1-surenb@google.com/
[2] https://github.com/paulmckrcu/proc-mmap_sem-test
[3] https://lore.kernel.org/all/e1863f40-39ab-4e5b-984a-c48765ffde1c@lucifer.lo…
[4] https://lore.kernel.org/all/20250604231151.799834-1-surenb@google.com/
Suren Baghdasaryan (7):
selftests/proc: add /proc/pid/maps tearing from vma split test
selftests/proc: extend /proc/pid/maps tearing test to include vma
resizing
selftests/proc: extend /proc/pid/maps tearing test to include vma
remapping
selftests/proc: test PROCMAP_QUERY ioctl while vma is concurrently
modified
selftests/proc: add verbose more for tests to facilitate debugging
mm/maps: read proc/pid/maps under per-vma lock
mm/maps: execute PROCMAP_QUERY ioctl under per-vma locks
fs/proc/internal.h | 5 +
fs/proc/task_mmu.c | 179 ++++-
include/linux/mmap_lock.h | 11 +
mm/mmap_lock.c | 88 +++
tools/testing/selftests/proc/proc-pid-vm.c | 793 ++++++++++++++++++++-
5 files changed, 1053 insertions(+), 23 deletions(-)
base-commit: 0b2a863368fb0cf674b40925c55dc8898c5a33af
--
2.50.0.714.g196bf9f422-goog
eOn Tue, Jun 24, 2025 at 11:45:09AM +0530, Dev Jain wrote:
>
> On 23/06/25 11:02 pm, Donet Tom wrote:
> > On Mon, Jun 23, 2025 at 10:23:02AM +0530, Dev Jain wrote:
> > > On 21/06/25 11:25 pm, Donet Tom wrote:
> > > > On Fri, Jun 20, 2025 at 08:15:25PM +0530, Dev Jain wrote:
> > > > > On 19/06/25 1:53 pm, Donet Tom wrote:
> > > > > > On Wed, Jun 18, 2025 at 08:13:54PM +0530, Dev Jain wrote:
> > > > > > > On 18/06/25 8:05 pm, Lorenzo Stoakes wrote:
> > > > > > > > On Wed, Jun 18, 2025 at 07:47:18PM +0530, Dev Jain wrote:
> > > > > > > > > On 18/06/25 7:37 pm, Lorenzo Stoakes wrote:
> > > > > > > > > > On Wed, Jun 18, 2025 at 07:28:16PM +0530, Dev Jain wrote:
> > > > > > > > > > > On 18/06/25 5:27 pm, Lorenzo Stoakes wrote:
> > > > > > > > > > > > On Wed, Jun 18, 2025 at 05:15:50PM +0530, Dev Jain wrote:
> > > > > > > > > > > > Are you accounting for sys.max_map_count? If not, then you'll be hitting that
> > > > > > > > > > > > first.
> > > > > > > > > > > run_vmtests.sh will run the test in overcommit mode so that won't be an issue.
> > > > > > > > > > Umm, what? You mean overcommit all mode, and that has no bearing on the max
> > > > > > > > > > mapping count check.
> > > > > > > > > >
> > > > > > > > > > In do_mmap():
> > > > > > > > > >
> > > > > > > > > > /* Too many mappings? */
> > > > > > > > > > if (mm->map_count > sysctl_max_map_count)
> > > > > > > > > > return -ENOMEM;
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > As well as numerous other checks in mm/vma.c.
> > > > > > > > > Ah sorry, didn't look at the code properly just assumed that overcommit_always meant overriding
> > > > > > > > > this.
> > > > > > > > No problem! It's hard to be aware of everything in mm :)
> > > > > > > >
> > > > > > > > > > I'm not sure why an overcommit toggle is even necessary when you could use
> > > > > > > > > > MAP_NORESERVE or simply map PROT_NONE to avoid the OVERCOMMIT_GUESS limits?
> > > > > > > > > >
> > > > > > > > > > I'm pretty confused as to what this test is really achieving honestly. This
> > > > > > > > > > isn't a useful way of asserting mmap() behaviour as far as I can tell.
> > > > > > > > > Well, seems like a useful way to me at least : ) Not sure if you are in the mood
> > > > > > > > > to discuss that but if you'd like me to explain from start to end what the test
> > > > > > > > > is doing, I can do that : )
> > > > > > > > >
> > > > > > > > I just don't have time right now, I guess I'll have to come back to it
> > > > > > > > later... it's not the end of the world for it to be iffy in my view as long as
> > > > > > > > it passes, but it might just not be of great value.
> > > > > > > >
> > > > > > > > Philosophically I'd rather we didn't assert internal implementation details like
> > > > > > > > where we place mappings in userland memory. At no point do we promise to not
> > > > > > > > leave larger gaps if we feel like it :)
> > > > > > > You have a fair point. Anyhow a debate for another day.
> > > > > > >
> > > > > > > > I'm guessing, reading more, the _real_ test here is some mathematical assertion
> > > > > > > > about layout from HIGH_ADDR_SHIFT -> end of address space when using hints.
> > > > > > > >
> > > > > > > > But again I'm not sure that achieves much and again also is asserting internal
> > > > > > > > implementation details.
> > > > > > > >
> > > > > > > > Correct behaviour of this kind of thing probably better belongs to tests in the
> > > > > > > > userland VMA testing I'd say.
> > > > > > > >
> > > > > > > > Sorry I don't mean to do down work you've done before, just giving an honest
> > > > > > > > technical appraisal!
> > > > > > > Nah, it will be rather hilarious to see it all go down the drain xD
> > > > > > >
> > > > > > > > Anyway don't let this block work to fix the test if it's failing. We can revisit
> > > > > > > > this later.
> > > > > > > Sure. @Aboorva and Donet, I still believe that the correct approach is to elide
> > > > > > > the gap check at the crossing boundary. What do you think?
> > > > > > >
> > > > > > One problem I am seeing with this approach is that, since the hint address
> > > > > > is generated randomly, the VMAs are also being created at randomly based on
> > > > > > the hint address.So, for the VMAs created at high addresses, we cannot guarantee
> > > > > > that the gaps between them will be aligned to MAP_CHUNK_SIZE.
> > > > > >
> > > > > > High address VMAs
> > > > > > -----------------
> > > > > > 1000000000000-1000040000000 r--p 00000000 00:00 0
> > > > > > 2000000000000-2000040000000 r--p 00000000 00:00 0
> > > > > > 4000000000000-4000040000000 r--p 00000000 00:00 0
> > > > > > 8000000000000-8000040000000 r--p 00000000 00:00 0
> > > > > > e80009d260000-fffff9d260000 r--p 00000000 00:00 0
> > > > > >
> > > > > > I have a different approach to solve this issue.
> > > > > It is really weird that such a large amount of VA space
> > > > > is left between the two VMAs yet mmap is failing.
> > > > >
> > > > >
> > > > >
> > > > > Can you please do the following:
> > > > > set /proc/sys/vm/max_map_count to the highest value possible.
> > > > > If running without run_vmtests.sh, set /proc/sys/vm/overcommit_memory to 1.
> > > > > In validate_complete_va_space:
> > > > >
> > > > > if (start_addr >= HIGH_ADDR_MARK && found == false) {
> > > > > found = true;
> > > > > continue;
> > > > > }
> > > > Thanks Dev for the suggestion. I set max_map_count and set overcommit
> > > > memory to 1, added this code change as well, and then tried. Still, the
> > > > test is failing
> > > >
> > > > > where found is initialized to false. This will skip the check
> > > > > for the boundary.
> > > > >
> > > > > After this can you tell whether the test is still failing.
> > > > >
> > > > > Also can you give me the complete output of proc/pid/maps
> > > > > after putting a sleep at the end of the test.
> > > > >
> > > > on powerpc support DEFAULT_MAP_WINDOW is 128TB and with
> > > > total address space size is 4PB With hint it can map upto
> > > > 4PB. Since the hint addres is random in this test random hing VMAs
> > > > are getting created. IIUC this is expected only.
> > > >
> > > >
> > > > 10000000-10010000 r-xp 00000000 fd:05 134226638 /home/donet/linux/tools/testing/selftests/mm/virtual_address_range
> > > > 10010000-10020000 r--p 00000000 fd:05 134226638 /home/donet/linux/tools/testing/selftests/mm/virtual_address_range
> > > > 10020000-10030000 rw-p 00010000 fd:05 134226638 /home/donet/linux/tools/testing/selftests/mm/virtual_address_range
> > > > 30000000-10030000000 r--p 00000000 00:00 0 [anon:virtual_address_range]
> > > > 10030770000-100307a0000 rw-p 00000000 00:00 0 [heap]
> > > > 1004f000000-7fff8f000000 r--p 00000000 00:00 0 [anon:virtual_address_range]
> > > > 7fff8faf0000-7fff8fe00000 rw-p 00000000 00:00 0
> > > > 7fff8fe00000-7fff90030000 r-xp 00000000 fd:00 792355 /usr/lib64/libc.so.6
> > > > 7fff90030000-7fff90040000 r--p 00230000 fd:00 792355 /usr/lib64/libc.so.6
> > > > 7fff90040000-7fff90050000 rw-p 00240000 fd:00 792355 /usr/lib64/libc.so.6
> > > > 7fff90050000-7fff90130000 r-xp 00000000 fd:00 792358 /usr/lib64/libm.so.6
> > > > 7fff90130000-7fff90140000 r--p 000d0000 fd:00 792358 /usr/lib64/libm.so.6
> > > > 7fff90140000-7fff90150000 rw-p 000e0000 fd:00 792358 /usr/lib64/libm.so.6
> > > > 7fff90160000-7fff901a0000 r--p 00000000 00:00 0 [vvar]
> > > > 7fff901a0000-7fff901b0000 r-xp 00000000 00:00 0 [vdso]
> > > > 7fff901b0000-7fff90200000 r-xp 00000000 fd:00 792351 /usr/lib64/ld64.so.2
> > > > 7fff90200000-7fff90210000 r--p 00040000 fd:00 792351 /usr/lib64/ld64.so.2
> > > > 7fff90210000-7fff90220000 rw-p 00050000 fd:00 792351 /usr/lib64/ld64.so.2
> > > > 7fffc9770000-7fffc9880000 rw-p 00000000 00:00 0 [stack]
> > > > 1000000000000-1000040000000 r--p 00000000 00:00 0 [anon:virtual_address_range]
> > > > 2000000000000-2000040000000 r--p 00000000 00:00 0 [anon:virtual_address_range]
> > > > 4000000000000-4000040000000 r--p 00000000 00:00 0 [anon:virtual_address_range]
> > > > 8000000000000-8000040000000 r--p 00000000 00:00 0 [anon:virtual_address_range]
> > > > eb95410220000-fffff90220000 r--p 00000000 00:00 0 [anon:virtual_address_range]
> > > >
> > > >
> > > >
> > > >
> > > > If I give the hint address serially from 128TB then the address
> > > > space is contigous and gap is also MAP_SIZE, the test is passing.
> > > >
> > > > 10000000-10010000 r-xp 00000000 fd:05 134226638 /home/donet/linux/tools/testing/selftests/mm/virtual_address_range
> > > > 10010000-10020000 r--p 00000000 fd:05 134226638 /home/donet/linux/tools/testing/selftests/mm/virtual_address_range
> > > > 10020000-10030000 rw-p 00010000 fd:05 134226638 /home/donet/linux/tools/testing/selftests/mm/virtual_address_range
> > > > 33000000-10033000000 r--p 00000000 00:00 0 [anon:virtual_address_range]
> > > > 10033380000-100333b0000 rw-p 00000000 00:00 0 [heap]
> > > > 1006f0f0000-10071000000 rw-p 00000000 00:00 0
> > > > 10071000000-7fffb1000000 r--p 00000000 00:00 0 [anon:virtual_address_range]
> > > > 7fffb15d0000-7fffb1800000 r-xp 00000000 fd:00 792355 /usr/lib64/libc.so.6
> > > > 7fffb1800000-7fffb1810000 r--p 00230000 fd:00 792355 /usr/lib64/libc.so.6
> > > > 7fffb1810000-7fffb1820000 rw-p 00240000 fd:00 792355 /usr/lib64/libc.so.6
> > > > 7fffb1820000-7fffb1900000 r-xp 00000000 fd:00 792358 /usr/lib64/libm.so.6
> > > > 7fffb1900000-7fffb1910000 r--p 000d0000 fd:00 792358 /usr/lib64/libm.so.6
> > > > 7fffb1910000-7fffb1920000 rw-p 000e0000 fd:00 792358 /usr/lib64/libm.so.6
> > > > 7fffb1930000-7fffb1970000 r--p 00000000 00:00 0 [vvar]
> > > > 7fffb1970000-7fffb1980000 r-xp 00000000 00:00 0 [vdso]
> > > > 7fffb1980000-7fffb19d0000 r-xp 00000000 fd:00 792351 /usr/lib64/ld64.so.2
> > > > 7fffb19d0000-7fffb19e0000 r--p 00040000 fd:00 792351 /usr/lib64/ld64.so.2
> > > > 7fffb19e0000-7fffb19f0000 rw-p 00050000 fd:00 792351 /usr/lib64/ld64.so.2
> > > > 7fffc5470000-7fffc5580000 rw-p 00000000 00:00 0 [stack]
> > > > 800000000000-2aab000000000 r--p 00000000 00:00 0 [anon:virtual_address_range]
> > > >
> > > >
> > > Thank you for this output. I can't wrap my head around why this behaviour changes
> > > when you generate the hint sequentially. The mmap() syscall is supposed to do the
> > > following (irrespective of high VA space or not) - if the allocation at the hint
> > Yes, it is working as expected. On PowerPC, the DEFAULT_MAP_WINDOW is
> > 128TB, and the system can map up to 4PB.
> >
> > In the test, the first mmap call maps memory up to 128TB without any
> > hint, so the VMAs are created below the 128TB boundary.
> >
> > In the second mmap call, we provide a hint starting from 256TB, and
> > the hint address is generated randomly above 256TB. The mappings are
> > correctly created at these hint addresses. Since the hint addresses
> > are random, the resulting VMAs are also created at random locations.
> >
> > So, what I tried is: mapping from 0 to 128TB without any hint, and
> > then for the second mmap, instead of starting the hint from 256TB, I
> > started from 128TB. Instead of using random hint addresses, I used
> > sequential hint addresses from 128TB up to 512TB. With this change,
> > the VMAs are created in order, and the test passes.
> >
> > 800000000000-2aab000000000 r--p 00000000 00:00 0 128TB to 512TB VMA
> >
> > I think we will see same behaviour on x86 with X86_FEATURE_LA57.
> >
> > I will send the updated patch in V2.
>
> Since you say it fails on both radix and hash, it means that the generic
> code path is failing. I see that on my system, when I run the test with
> LPA2 config, write() fails with errno set to -ENOMEM. Can you apply
> the following diff and check whether the test fails still. Doing this
> fixed it for arm64.
>
> diff --git a/tools/testing/selftests/mm/virtual_address_range.c b/tools/testing/selftests/mm/virtual_address_range.c
>
> index b380e102b22f..3032902d01f2 100644
>
> --- a/tools/testing/selftests/mm/virtual_address_range.c
>
> +++ b/tools/testing/selftests/mm/virtual_address_range.c
>
> @@ -173,10 +173,6 @@ static int validate_complete_va_space(void)
>
> */
>
> hop = 0;
>
> while (start_addr + hop < end_addr) {
>
> - if (write(fd, (void *)(start_addr + hop), 1) != 1)
>
> - return 1;
>
> - lseek(fd, 0, SEEK_SET);
>
> -
>
> if (is_marked_vma(vma_name))
>
> munmap((char *)(start_addr + hop), MAP_CHUNK_SIZE);
>
Even with this change, the test is still failing. In this case,
we are allocating physical memory and writing into it, but our
issue seems to be with the gap between VMAs, so I believe this
might not be directly related.
I will send the next revision where the test passes and no
issues are observed
Just curious — with LPA2, is the second mmap() call successful?
And are the VMAs being created at the hint address as expected?
> >
> > > addr succeeds, then all is well, otherwise, do a top-down search for a large
> > > enough gap. I am not aware of the nuances in powerpc but I really am suspecting
> > > a bug in powerpc mmap code. Can you try to do some tracing - which function
> > > eventually fails to find the empty gap?
> > >
> > > Through my limited code tracing - we should end up in slice_find_area_topdown,
> > > then we ask the generic code to find the gap using vm_unmapped_area. So I
> > > suspect something is happening between this, probably slice_scan_available().
> > >
> > > > > > From 0 to 128TB, we map memory directly without using any hint. For the range above
> > > > > > 256TB up to 512TB, we perform the mapping using hint addresses. In the current test,
> > > > > > we use random hint addresses, but I have modified it to generate hint addresses linearly
> > > > > > starting from 128TB.
> > > > > >
> > > > > > With this change:
> > > > > >
> > > > > > The 0–128TB range is mapped without hints and verified accordingly.
> > > > > >
> > > > > > The 128TB–512TB range is mapped using linear hint addresses and then verified.
> > > > > >
> > > > > > Below are the VMAs obtained with this approach:
> > > > > >
> > > > > > 10000000-10010000 r-xp 00000000 fd:05 135019531
> > > > > > 10010000-10020000 r--p 00000000 fd:05 135019531
> > > > > > 10020000-10030000 rw-p 00010000 fd:05 135019531
> > > > > > 20000000-10020000000 r--p 00000000 00:00 0
> > > > > > 10020800000-10020830000 rw-p 00000000 00:00 0
> > > > > > 1004bcf0000-1004c000000 rw-p 00000000 00:00 0
> > > > > > 1004c000000-7fff8c000000 r--p 00000000 00:00 0
> > > > > > 7fff8c130000-7fff8c360000 r-xp 00000000 fd:00 792355
> > > > > > 7fff8c360000-7fff8c370000 r--p 00230000 fd:00 792355
> > > > > > 7fff8c370000-7fff8c380000 rw-p 00240000 fd:00 792355
> > > > > > 7fff8c380000-7fff8c460000 r-xp 00000000 fd:00 792358
> > > > > > 7fff8c460000-7fff8c470000 r--p 000d0000 fd:00 792358
> > > > > > 7fff8c470000-7fff8c480000 rw-p 000e0000 fd:00 792358
> > > > > > 7fff8c490000-7fff8c4d0000 r--p 00000000 00:00 0
> > > > > > 7fff8c4d0000-7fff8c4e0000 r-xp 00000000 00:00 0
> > > > > > 7fff8c4e0000-7fff8c530000 r-xp 00000000 fd:00 792351
> > > > > > 7fff8c530000-7fff8c540000 r--p 00040000 fd:00 792351
> > > > > > 7fff8c540000-7fff8c550000 rw-p 00050000 fd:00 792351
> > > > > > 7fff8d000000-7fffcd000000 r--p 00000000 00:00 0
> > > > > > 7fffe9c80000-7fffe9d90000 rw-p 00000000 00:00 0
> > > > > > 800000000000-2000000000000 r--p 00000000 00:00 0 -> High Address (128TB to 512TB)
> > > > > >
> > > > > > diff --git a/tools/testing/selftests/mm/virtual_address_range.c b/tools/testing/selftests/mm/virtual_address_range.c
> > > > > > index 4c4c35eac15e..0be008cba4b0 100644
> > > > > > --- a/tools/testing/selftests/mm/virtual_address_range.c
> > > > > > +++ b/tools/testing/selftests/mm/virtual_address_range.c
> > > > > > @@ -56,21 +56,21 @@
> > > > > > #ifdef __aarch64__
> > > > > > #define HIGH_ADDR_MARK ADDR_MARK_256TB
> > > > > > -#define HIGH_ADDR_SHIFT 49
> > > > > > +#define HIGH_ADDR_SHIFT 48
> > > > > > #define NR_CHUNKS_LOW NR_CHUNKS_256TB
> > > > > > #define NR_CHUNKS_HIGH NR_CHUNKS_3840TB
> > > > > > #else
> > > > > > #define HIGH_ADDR_MARK ADDR_MARK_128TB
> > > > > > -#define HIGH_ADDR_SHIFT 48
> > > > > > +#define HIGH_ADDR_SHIFT 47
> > > > > > #define NR_CHUNKS_LOW NR_CHUNKS_128TB
> > > > > > #define NR_CHUNKS_HIGH NR_CHUNKS_384TB
> > > > > > #endif
> > > > > > -static char *hint_addr(void)
> > > > > > +static char *hint_addr(int hint)
> > > > > > {
> > > > > > - int bits = HIGH_ADDR_SHIFT + rand() % (63 - HIGH_ADDR_SHIFT);
> > > > > > + unsigned long addr = ((1UL << HIGH_ADDR_SHIFT) + (hint * MAP_CHUNK_SIZE));
> > > > > > - return (char *) (1UL << bits);
> > > > > > + return (char *) (addr);
> > > > > > }
> > > > > > static void validate_addr(char *ptr, int high_addr)
> > > > > > @@ -217,7 +217,7 @@ int main(int argc, char *argv[])
> > > > > > }
> > > > > > for (i = 0; i < NR_CHUNKS_HIGH; i++) {
> > > > > > - hint = hint_addr();
> > > > > > + hint = hint_addr(i);
> > > > > > hptr[i] = mmap(hint, MAP_CHUNK_SIZE, PROT_READ,
> > > > > > MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> > > > > >
> > > > > >
> > > > > >
> > > > > > Can we fix it this way?
Add support for SuperH/"sh" to nolibc.
Only sh4 is tested for now.
This is only tested on QEMU so far.
Additional testing would be very welcome.
Test instructions:
$ cd tools/testings/selftests/nolibc/
$ make -f Makefile.nolibc ARCH=sh CROSS_COMPILE=sh4-linux- nolibc-test
$ file nolibc-test
nolibc-test: ELF 32-bit LSB executable, Renesas SH, version 1 (SYSV), statically linked, not stripped
$ ./nolibc-test
Running test 'startup'
0 argc = 1 [OK]
...
Total number of errors: 0
Exiting with status 0
Signed-off-by: Thomas Weißschuh <linux(a)weissschuh.net>
---
Changes in v2:
- Rebase onto latest nolibc-next
- Pick up Ack from Willy
- Provide some test instructions
- Link to v1: https://lore.kernel.org/r/20250609-nolibc-sh-v1-0-9dcdb1b66bb5@weissschuh.n…
---
Thomas Weißschuh (3):
selftests/nolibc: fix EXTRACONFIG variables ordering
selftests/nolibc: use file driver for QEMU serial
tools/nolibc: add support for SuperH
tools/include/nolibc/arch-sh.h | 162 +++++++++++++++++++++++++
tools/include/nolibc/arch.h | 2 +
tools/testing/selftests/nolibc/Makefile.nolibc | 15 ++-
tools/testing/selftests/nolibc/run-tests.sh | 3 +-
4 files changed, 177 insertions(+), 5 deletions(-)
---
base-commit: eb135311083100b6590a7545618cd9760d896a86
change-id: 20250528-nolibc-sh-8b4e3bb8efcb
Best regards,
--
Thomas Weißschuh <linux(a)weissschuh.net>
The current implementation of test_unmerge_uffd_wp() explicitly sets
`uffdio_api.features = UFFD_FEATURE_PAGEFAULT_FLAG_WP` before calling
UFFDIO_API. This can cause the ioctl() call to fail with EINVAL on kernels
that do not support UFFD-WP, leading the test to fail unnecessarily:
# ------------------------------
# running ./ksm_functional_tests
# ------------------------------
# TAP version 13
# 1..9
# # [RUN] test_unmerge
# ok 1 Pages were unmerged
# # [RUN] test_unmerge_zero_pages
# ok 2 KSM zero pages were unmerged
# # [RUN] test_unmerge_discarded
# ok 3 Pages were unmerged
# # [RUN] test_unmerge_uffd_wp
# not ok 4 UFFDIO_API failed <-----
# # [RUN] test_prot_none
# ok 5 Pages were unmerged
# # [RUN] test_prctl
# ok 6 Setting/clearing PR_SET_MEMORY_MERGE works
# # [RUN] test_prctl_fork
# # No pages got merged
# # [RUN] test_prctl_fork_exec
# ok 7 PR_SET_MEMORY_MERGE value is inherited
# # [RUN] test_prctl_unmerge
# ok 8 Pages were unmerged
# Bail out! 1 out of 8 tests failed
# # Planned tests != run tests (9 != 8)
# # Totals: pass:7 fail:1 xfail:0 xpass:0 skip:0 error:0
# [FAIL]
This patch improves compatibility and robustness of the UFFD-WP test
(test_unmerge_uffd_wp) by correctly implementing the UFFDIO_API
two-step handshake as recommended by the userfaultfd(2) man page.
Key changes:
1. Use features=0 in the initial UFFDIO_API call to query supported
feature bits, rather than immediately requesting WP support.
2. Skip the test gracefully if:
- UFFDIO_API fails with EINVAL (e.g. unsupported API version), or
- UFFD_FEATURE_PAGEFAULT_FLAG_WP is not advertised by the kernel.
3. Close the initial userfaultfd and create a new one before enabling
the required feature, since UFFDIO_API can only be called once per fd.
4. Improve diagnostics by distinguishing between expected and unexpected
failures, using strerror() to report errors.
This ensures the test behaves correctly across a wider range of kernel
versions and configurations, while preserving the intended behavior on
kernels that support UFFD-WP.
Suggestted-by: David Hildenbrand <david(a)redhat.com>
Signed-off-by: Li Wang <liwang(a)redhat.com>
Cc: Peter Xu <peterx(a)redhat.com>
Cc: Nadav Amit <nadav.amit(a)gmail.com>
Cc: Aruna Ramakrishna <aruna.ramakrishna(a)oracle.com>
Cc: Bagas Sanjaya <bagasdotme(a)gmail.com>
Cc: Catalin Marinas <catalin.marinas(a)arm.com>
Cc: Dave Hansen <dave.hansen(a)linux.intel.com>
Cc: Joey Gouly <joey.gouly(a)arm.com>
Cc: Johannes Weiner <hannes(a)cmpxchg.org>
Cc: Keith Lucas <keith.lucas(a)oracle.com>
Cc: Ryan Roberts <ryan.roberts(a)arm.com>
Cc: Shuah Khan <shuah(a)kernel.org>
Acked-by: David Hildenbrand <david(a)redhat.com>
---
.../selftests/mm/ksm_functional_tests.c | 28 +++++++++++++++++--
1 file changed, 26 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/mm/ksm_functional_tests.c b/tools/testing/selftests/mm/ksm_functional_tests.c
index b61803e36d1c..d8bd1911dfc0 100644
--- a/tools/testing/selftests/mm/ksm_functional_tests.c
+++ b/tools/testing/selftests/mm/ksm_functional_tests.c
@@ -393,9 +393,13 @@ static void test_unmerge_uffd_wp(void)
/* See if UFFD-WP is around. */
uffdio_api.api = UFFD_API;
- uffdio_api.features = UFFD_FEATURE_PAGEFAULT_FLAG_WP;
+ uffdio_api.features = 0;
if (ioctl(uffd, UFFDIO_API, &uffdio_api) < 0) {
- ksft_test_result_fail("UFFDIO_API failed\n");
+ if (errno == EINVAL)
+ ksft_test_result_skip("The API version requested is not supported\n");
+ else
+ ksft_test_result_fail("UFFDIO_API failed: %s\n", strerror(errno));
+
goto close_uffd;
}
if (!(uffdio_api.features & UFFD_FEATURE_PAGEFAULT_FLAG_WP)) {
@@ -403,6 +407,26 @@ static void test_unmerge_uffd_wp(void)
goto close_uffd;
}
+ /*
+ * UFFDIO_API must only be called once to enable features.
+ * So we close the old userfaultfd and create a new one to
+ * actually enable UFFD_FEATURE_PAGEFAULT_FLAG_WP.
+ */
+ close(uffd);
+ uffd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK);
+ if (uffd < 0) {
+ ksft_test_result_fail("__NR_userfaultfd failed\n");
+ goto unmap;
+ }
+
+ /* Now, enable it ("two-step handshake") */
+ uffdio_api.api = UFFD_API;
+ uffdio_api.features = UFFD_FEATURE_PAGEFAULT_FLAG_WP;
+ if (ioctl(uffd, UFFDIO_API, &uffdio_api) < 0) {
+ ksft_test_result_fail("UFFDIO_API failed: %s\n", strerror(errno));
+ goto close_uffd;
+ }
+
/* Register UFFD-WP, no need for an actual handler. */
if (uffd_register(uffd, map, size, false, true, false)) {
ksft_test_result_fail("UFFDIO_REGISTER_MODE_WP failed\n");
--
2.49.0
The test_kexec_jump program builds correctly when invoked from the top-level
selftests/Makefile, which explicitly sets the OUTPUT variable. However,
building directly in tools/testing/selftests/kexec fails with:
make: *** No rule to make target '/test_kexec_jump', needed by 'test_kexec_jump.sh'. Stop.
This failure occurs because the Makefile rule relies on $(OUTPUT), which is
undefined in direct builds.
Fix this by listing test_kexec_jump in TEST_GEN_PROGS, the standard way to
declare generated test binaries in the kselftest framework. This ensures the
binary is built regardless of invocation context and properly removed by
make clean.
Also add the binary to .gitignore to avoid tracking it in version control.
Signed-off-by: Moon Hee Lee <moonhee.lee.ca(a)gmail.com>
---
tools/testing/selftests/kexec/.gitignore | 2 ++
tools/testing/selftests/kexec/Makefile | 2 +-
2 files changed, 3 insertions(+), 1 deletion(-)
create mode 100644 tools/testing/selftests/kexec/.gitignore
diff --git a/tools/testing/selftests/kexec/.gitignore b/tools/testing/selftests/kexec/.gitignore
new file mode 100644
index 000000000000..5f3d9e089ae8
--- /dev/null
+++ b/tools/testing/selftests/kexec/.gitignore
@@ -0,0 +1,2 @@
+# SPDX-License-Identifier: GPL-2.0-only
+test_kexec_jump
diff --git a/tools/testing/selftests/kexec/Makefile b/tools/testing/selftests/kexec/Makefile
index e3000ccb9a5d..874cfdd3b75b 100644
--- a/tools/testing/selftests/kexec/Makefile
+++ b/tools/testing/selftests/kexec/Makefile
@@ -12,7 +12,7 @@ include ../../../scripts/Makefile.arch
ifeq ($(IS_64_BIT)$(ARCH_PROCESSED),1x86)
TEST_PROGS += test_kexec_jump.sh
-test_kexec_jump.sh: $(OUTPUT)/test_kexec_jump
+TEST_GEN_PROGS := test_kexec_jump
endif
include ../lib.mk
--
2.43.0
This series creates a new PMU scheme on ARM, a partitioned PMU that
allows reserving a subset of counters for more direct guest access,
significantly reducing overhead. More details, including performance
benchmarks, can be read in the v1 cover letter linked below.
v2:
* Rebased on top of kvm/queue to pick up Sean's patch [1] that
reorganizes some of the same headers and would otherwise conflict.
* Changed the semantics of the command line parameters and the
ioctl. It was pointed out in the comments last time that it doesn't
work to repartition at runtime because the perf subsystem assumes
the number of counters it gets will not change after the PMU is
probed. Now the PMUv3 command line parameters are the sole thing
that divides up guest and host counters and the ioctl just toggles a
flag for whether a vcpu should use the partitioned PMU. I've also
moved from one to two parameters: partition_pmu=[y/n] and
reserved_guest_counters=[0-N]. This makes it possible to
unambiguously express configurations like a partitioned PMU with 0
general purpose counters exposed to the guest (which still exposes
the cycle counter.
* Moved the partitioning code into the PMUv3 driver itself so KVM code
isn't modifying fields that are otherwise internal to the driver.
* Define PMI{CNTR,FILTR} as undef_access since KVM isn't ready to
support that counter. It is, however, still handled in the
partitioning because the driver recognizes it.
* Take out the dependency on FEAT_FGT since it is not widely available
on hardware yet. Instead, define a fast path in switch.h for
handling accesses to the registers that would otherwise be
untrapped.
* During MDCR_EL2 setup for guests, ensure the computed HPMN value is
always below the number of guest counters allocated by the driver at
boot and always below the number of counters on the current
CPU. This accounts for the possibiliy of heterogeneous hardware
where I guest might be able to use the partitioned PMU on one CPU
but not another.
* The KVM PMU event filter API says that counters must not count while
the event is filtered. To ensure this, enforce the filter on every
vcpu_load into the guest.
* Settable PMCR_EL0.N with a partitioned PMU now works and the
vcpu_counter_access selftest changes reflect that.
v1:
https://lore.kernel.org/kvm/20250602192702.2125115-1-coltonlewis@google.com/
Colton Lewis (22):
arm64: cpufeature: Add cpucap for HPMN0
arm64: Generate sign macro for sysreg Enums
arm64: cpufeature: Add cpucap for PMICNTR
arm64: Define PMI{CNTR,FILTR}_EL0 as undef_access
KVM: arm64: Reorganize PMU functions
perf: arm_pmuv3: Introduce method to partition the PMU
perf: arm_pmuv3: Generalize counter bitmasks
perf: arm_pmuv3: Keep out of guest counter partition
KVM: arm64: Correct kvm_arm_pmu_get_max_counters()
KVM: arm64: Set up FGT for Partitioned PMU
KVM: arm64: Writethrough trapped PMEVTYPER register
KVM: arm64: Use physical PMSELR for PMXEVTYPER if partitioned
KVM: arm64: Writethrough trapped PMOVS register
KVM: arm64: Write fast path PMU register handlers
KVM: arm64: Setup MDCR_EL2 to handle a partitioned PMU
KVM: arm64: Account for partitioning in PMCR_EL0 access
KVM: arm64: Context swap Partitioned PMU guest registers
KVM: arm64: Enforce PMU event filter at vcpu_load()
perf: arm_pmuv3: Handle IRQs for Partitioned PMU guest counters
KVM: arm64: Inject recorded guest interrupts
KVM: arm64: Add ioctl to partition the PMU when supported
KVM: arm64: selftests: Add test case for partitioned PMU
Marc Zyngier (1):
KVM: arm64: Cleanup PMU includes
Documentation/virt/kvm/api.rst | 21 +
arch/arm/include/asm/arm_pmuv3.h | 34 +
arch/arm64/include/asm/arm_pmuv3.h | 61 +-
arch/arm64/include/asm/kvm_host.h | 20 +-
arch/arm64/include/asm/kvm_pmu.h | 61 ++
arch/arm64/kernel/cpufeature.c | 15 +
arch/arm64/kvm/Makefile | 2 +-
arch/arm64/kvm/arm.c | 22 +
arch/arm64/kvm/debug.c | 24 +-
arch/arm64/kvm/hyp/include/hyp/switch.h | 233 ++++++
arch/arm64/kvm/pmu-emul.c | 676 +----------------
arch/arm64/kvm/pmu-part.c | 359 +++++++++
arch/arm64/kvm/pmu.c | 687 ++++++++++++++++++
arch/arm64/kvm/sys_regs.c | 66 +-
arch/arm64/tools/cpucaps | 2 +
arch/arm64/tools/gen-sysreg.awk | 1 +
arch/arm64/tools/sysreg | 6 +-
drivers/perf/arm_pmuv3.c | 150 +++-
include/linux/perf/arm_pmu.h | 15 +-
include/linux/perf/arm_pmuv3.h | 14 +-
include/uapi/linux/kvm.h | 4 +
tools/include/uapi/linux/kvm.h | 2 +
.../selftests/kvm/arm64/vpmu_counter_access.c | 63 +-
virt/kvm/kvm_main.c | 1 +
24 files changed, 1791 insertions(+), 748 deletions(-)
create mode 100644 arch/arm64/kvm/pmu-part.c
base-commit: 79150772457f4d45e38b842d786240c36bb1f97f
--
2.50.0.714.g196bf9f422-goog
Corrected two instances of the misspelled word 'occurences' to
'occurrences' in comments explaining node invariants in sparsebit.c.
These comments describe core behavior of the data structure and
should be clear.
Signed-off-by: Rahul Kumar <rk0006818(a)gmail.com>
---
tools/testing/selftests/kvm/lib/sparsebit.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/kvm/lib/sparsebit.c b/tools/testing/selftests/kvm/lib/sparsebit.c
index cfed9d26cc71..a99188f87a38 100644
--- a/tools/testing/selftests/kvm/lib/sparsebit.c
+++ b/tools/testing/selftests/kvm/lib/sparsebit.c
@@ -116,7 +116,7 @@
*
* + A node with all mask bits set only occurs when the last bit
* described by the previous node is not equal to this nodes
- * starting index - 1. All such occurences of this condition are
+ * starting index - 1. All such occurrences of this condition are
* avoided by moving the setting of the nodes mask bits into
* the previous nodes num_after setting.
*
@@ -592,7 +592,7 @@ static struct node *node_split(struct sparsebit *s, sparsebit_idx_t idx)
*
* + A node with all mask bits set only occurs when the last bit
* described by the previous node is not equal to this nodes
- * starting index - 1. All such occurences of this condition are
+ * starting index - 1. All such occurrences of this condition are
* avoided by moving the setting of the nodes mask bits into
* the previous nodes num_after setting.
*/
--
2.43.0
This patch fixes two misspellings of the word 'occurrences' in comments within sparsebit.c used by the KVM selftests.
Fixing the spelling improves readability and clarity of the documented behavior.
Only comment text has been changed — there are no modifications to the functional logic of the tests.
I would appreciate your review and any feedback you may have.
Thank you for your time and support.
Best regards,
Rahul Kumar
Non-KVM folks,
I am hoping to route this through the KVM tree (6.17 or later), as the non-KVM
changes should be glorified nops. Please holler if you object to that idea.
Hyper-V folks in particular, let me know if you want a stable topic branch/tag,
e.g. on the off chance you want to make similar changes to the Hyper-V code,
and I'll make sure that happens.
As for what this series actually does...
Rework KVM's irqfd registration to require that an eventfd is bound to at
most one irqfd throughout the entire system. KVM currently disallows
binding an eventfd to multiple irqfds for a single VM, but doesn't reject
attempts to bind an eventfd to multiple VMs.
This is obviously an ABI change, but I'm fairly confident that it won't
break userspace, because binding an eventfd to multiple irqfds hasn't
truly worked since commit e8dbf19508a1 ("kvm/eventfd: Use priority waitqueue
to catch events before userspace"). A somewhat undocumented, and perhaps
even unintentional, side effect of suppressing eventfd notifications for
userspace is that the priority+exclusive behavior also suppresses eventfd
notifications for any subsequent waiters, even if they are priority waiters.
I.e. only the first VM with an irqfd+eventfd binding will get notifications.
And for IRQ bypass, a.k.a. device posted interrupts, globally unique
bindings are a hard requirement (at least on x86; I assume other archs are
the same). KVM and the IRQ bypass manager kinda sorta handle this, but in
the absolute worst way possible (IMO). Instead of surfacing an error to
userspace, KVM silently ignores IRQ bypass registration errors.
The motivation for this series is to harden against userspace goofs. AFAIK,
we (Google) have never actually had a bug where userspace tries to assign
an eventfd to multiple VMs, but the possibility has come up in more than one
bug investigation (our intra-host, a.k.a. copyless, migration scheme
transfers eventfds from the old to the new VM when updating the host VMM).
v3:
- Retain WQ_FLAG_EXCLUSIVE in mshv_eventfd.c, which snuck in between v1
and v2. [Peter]
- Use EXPORT_SYMBOL_GPL. [Peter]
- Move WQ_FLAG_EXCLUSIVE out of add_wait_queue_priority() in a prep patch
so that the affected subsystems are more explicitly documented (and then
immediately drop the flag from drivers/xen/privcmd.c, which amusingly
hides that file from the diff stats).
v2:
- https://lore.kernel.org/all/20250519185514.2678456-1-seanjc@google.com
- Use guard(spinlock_irqsave). [Prateek]
v1: https://lore.kernel.org/all/20250401204425.904001-1-seanjc@google.com
Sean Christopherson (13):
KVM: Use a local struct to do the initial vfs_poll() on an irqfd
KVM: Acquire SCRU lock outside of irqfds.lock during assignment
KVM: Initialize irqfd waitqueue callback when adding to the queue
KVM: Add irqfd to KVM's list via the vfs_poll() callback
KVM: Add irqfd to eventfd's waitqueue while holding irqfds.lock
sched/wait: Drop WQ_FLAG_EXCLUSIVE from add_wait_queue_priority()
xen: privcmd: Don't mark eventfd waiter as EXCLUSIVE
sched/wait: Add a waitqueue helper for fully exclusive priority
waiters
KVM: Disallow binding multiple irqfds to an eventfd with a priority
waiter
KVM: Drop sanity check that per-VM list of irqfds is unique
KVM: selftests: Assert that eventfd() succeeds in Xen shinfo test
KVM: selftests: Add utilities to create eventfds and do KVM_IRQFD
KVM: selftests: Add a KVM_IRQFD test to verify uniqueness requirements
drivers/hv/mshv_eventfd.c | 8 ++
include/linux/kvm_irqfd.h | 1 -
include/linux/wait.h | 2 +
kernel/sched/wait.c | 22 ++-
tools/testing/selftests/kvm/Makefile.kvm | 1 +
tools/testing/selftests/kvm/arm64/vgic_irq.c | 12 +-
.../testing/selftests/kvm/include/kvm_util.h | 40 ++++++
tools/testing/selftests/kvm/irqfd_test.c | 130 ++++++++++++++++++
.../selftests/kvm/x86/xen_shinfo_test.c | 21 +--
virt/kvm/eventfd.c | 130 +++++++++++++-----
10 files changed, 302 insertions(+), 65 deletions(-)
create mode 100644 tools/testing/selftests/kvm/irqfd_test.c
base-commit: 45eb29140e68ffe8e93a5471006858a018480a45
--
2.49.0.1151.ga128411c76-goog
Add a basic selftest for the netpoll polling mechanism, specifically
targeting the netpoll poll() side.
The test creates a scenario where network transmission is running at
maximum speed, and netpoll needs to poll the NIC. This is achieved by:
1. Configuring a single RX/TX queue to create contention
2. Generating background traffic to saturate the interface
3. Sending netconsole messages to trigger netpoll polling
4. Using dynamic netconsole targets via configfs
5. Delete and create new netconsole targets after 5 iterations
The test validates a critical netpoll code path by monitoring traffic
flow and ensuring netpoll_poll_dev() is called when the normal TX path
is blocked. Perf probing confirms this test successfully triggers
netpoll_poll_dev() in typical test runs.
This addresses a gap in netpoll test coverage for a path that is
tricky for the network stack.
Signed-off-by: Breno Leitao <leitao(a)debian.org>
---
Changes since RFC:
- Toggle the netconsole interfaces up and down after 5 iterations.
- Moved the traffic check under DEBUG (Willem de Bruijn).
- Bumped the iterations to 20 given it runs faster now.
- Link to the RFC: https://lore.kernel.org/r/20250612-netpoll_test-v1-1-4774fd95933f@debian.org
---
tools/testing/selftests/drivers/net/Makefile | 1 +
.../testing/selftests/drivers/net/netpoll_basic.py | 231 +++++++++++++++++++++
2 files changed, 232 insertions(+)
diff --git a/tools/testing/selftests/drivers/net/Makefile b/tools/testing/selftests/drivers/net/Makefile
index bd309b2d39095..9bd84d6b542e5 100644
--- a/tools/testing/selftests/drivers/net/Makefile
+++ b/tools/testing/selftests/drivers/net/Makefile
@@ -16,6 +16,7 @@ TEST_PROGS := \
netcons_fragmented_msg.sh \
netcons_overflow.sh \
netcons_sysdata.sh \
+ netpoll_basic.py \
ping.py \
queues.py \
stats.py \
diff --git a/tools/testing/selftests/drivers/net/netpoll_basic.py b/tools/testing/selftests/drivers/net/netpoll_basic.py
new file mode 100755
index 0000000000000..2a81926169262
--- /dev/null
+++ b/tools/testing/selftests/drivers/net/netpoll_basic.py
@@ -0,0 +1,231 @@
+#!/usr/bin/env python3
+# SPDX-License-Identifier: GPL-2.0
+
+# This test aims to evaluate the netpoll polling mechanism (as in
+# netpoll_poll_dev()). It presents a complex scenario where the network
+# attempts to send a packet but fails, prompting it to poll the NIC from within
+# the netpoll TX side.
+#
+# This has been a crucial path in netpoll that was previously untested. Jakub
+# suggested using a single RX/TX queue, pushing traffic to the NIC, and then
+# sending netpoll messages (via netconsole) to trigger the poll. `perf` probing
+# of netpoll_poll_dev() showed that this test indeed triggers
+# netpoll_poll_dev() once or twice in 10 iterations.
+
+# Author: Breno Leitao <leitao(a)debian.org>
+
+import errno
+import os
+import random
+import string
+import time
+
+from lib.py import (
+ ethtool,
+ GenerateTraffic,
+ ksft_exit,
+ ksft_pr,
+ ksft_run,
+ KsftFailEx,
+ KsftSkipEx,
+ NetdevFamily,
+ NetDrvEpEnv,
+)
+
+NETCONSOLE_CONFIGFS_PATH = "/sys/kernel/config/netconsole"
+REMOTE_PORT = 6666
+LOCAL_PORT = 1514
+# Number of netcons messages to send. I usually see netpoll_poll_dev()
+# being called at least once in 10 iterations. Having 20 to have some buffers
+ITERATIONS = 20
+DEBUG = False
+
+
+def generate_random_netcons_name() -> str:
+ """Generate a random target name starting with 'netcons'"""
+ random_suffix = "".join(random.choices(string.ascii_lowercase + string.digits, k=8))
+ return f"netcons_{random_suffix}"
+
+
+def get_stats(cfg: NetDrvEpEnv, netdevnl: NetdevFamily) -> dict[str, int]:
+ """Get the statistics for the interface"""
+ return netdevnl.qstats_get({"ifindex": cfg.ifindex}, dump=True)[0]
+
+
+def set_single_rx_tx_queue(interface_name: str) -> None:
+ """Set the number of RX and TX queues to 1 using ethtool"""
+ try:
+ # This don't need to be reverted, since interfaces will be deleted after test
+ ethtool(f"-G {interface_name} rx 1 tx 1")
+ except Exception as e:
+ raise KsftSkipEx(
+ f"Failed to configure RX/TX queues: {e}. Ethtool not available?"
+ )
+
+
+def create_netconsole_target(
+ config_data: dict[str, str],
+ target_name: str,
+) -> None:
+ """Create a netconsole dynamic target against the interfaces"""
+ ksft_pr(f"Using netconsole name: {target_name}")
+ try:
+ os.makedirs(f"{NETCONSOLE_CONFIGFS_PATH}/{target_name}", exist_ok=True)
+ ksft_pr(f"Created target directory: {NETCONSOLE_CONFIGFS_PATH}/{target_name}")
+ except OSError as e:
+ if e.errno != errno.EEXIST:
+ raise KsftFailEx(f"Failed to create netconsole target directory: {e}")
+
+ try:
+ for key, value in config_data.items():
+ if DEBUG:
+ ksft_pr(f"Setting {key} to {value}")
+ with open(
+ f"{NETCONSOLE_CONFIGFS_PATH}/{target_name}/{key}",
+ "w",
+ encoding="utf-8",
+ ) as f:
+ # Always convert to string to write to file
+ f.write(str(value))
+ f.close()
+
+ if DEBUG:
+ # Read all configuration values for debugging
+ for debug_key in config_data.keys():
+ with open(
+ f"{NETCONSOLE_CONFIGFS_PATH}/{target_name}/{debug_key}",
+ "r",
+ encoding="utf-8",
+ ) as f:
+ content = f.read()
+ ksft_pr(
+ f"{NETCONSOLE_CONFIGFS_PATH}/{target_name}/{debug_key} {content}"
+ )
+
+ except Exception as e:
+ raise KsftFailEx(f"Failed to configure netconsole target: {e}")
+
+
+def set_netconsole(cfg: NetDrvEpEnv, interface_name: str, target_name: str) -> None:
+ """Configure netconsole on the interface with the given target name"""
+ config_data = {
+ "extended": "1",
+ "dev_name": interface_name,
+ "local_port": LOCAL_PORT,
+ "remote_port": REMOTE_PORT,
+ "local_ip": cfg.addr_v["4"] if cfg.addr_ipver == "4" else cfg.addr_v["6"],
+ "remote_ip": (
+ cfg.remote_addr_v["4"] if cfg.addr_ipver == "4" else cfg.remote_addr_v["6"]
+ ),
+ "remote_mac": "00:00:00:00:00:00", # Not important for this test
+ "enabled": "1",
+ }
+
+ create_netconsole_target(config_data, target_name)
+ ksft_pr(f"Created netconsole target: {target_name} on interface {interface_name}")
+
+
+def delete_netconsole_target(name: str) -> None:
+ """Delete a netconsole dynamic target"""
+ target_path = f"{NETCONSOLE_CONFIGFS_PATH}/{name}"
+ try:
+ if os.path.exists(target_path):
+ os.rmdir(target_path)
+ except OSError as e:
+ raise KsftFailEx(f"Failed to delete netconsole target: {e}")
+
+
+def check_traffic_flowing(cfg: NetDrvEpEnv, netdevnl: NetdevFamily) -> int:
+ """Check if traffic is flowing on the interface"""
+ stat1 = get_stats(cfg, netdevnl)
+ time.sleep(1)
+ stat2 = get_stats(cfg, netdevnl)
+ pkts_per_sec = stat2["rx-packets"] - stat1["rx-packets"]
+ # Just make sure this will not fail even in slow/debug kernels
+ if pkts_per_sec < 10:
+ raise KsftFailEx(f"Traffic seems low: {pkts_per_sec}")
+ if DEBUG:
+ ksft_pr(f"Traffic per second {pkts_per_sec}")
+
+ return pkts_per_sec
+
+
+def do_netpoll_flush(
+ cfg: NetDrvEpEnv, netdevnl: NetdevFamily, ifname: str, target_name: str
+) -> None:
+ """Print messages to the console, trying to trigger a netpoll poll"""
+
+ set_netconsole(cfg, ifname, target_name)
+ for i in range(int(ITERATIONS)):
+ msg = f"netcons test #{i}."
+
+ if DEBUG:
+ pkts_per_s = check_traffic_flowing(cfg, netdevnl)
+ msg += f" ({pkts_per_s} packets/s)"
+
+ with open("/dev/kmsg", "w", encoding="utf-8") as kmsg:
+ kmsg.write(msg)
+
+ if not i % 5:
+ # Every 5 iterations, toggle netconsole
+ delete_netconsole_target(target_name)
+ set_netconsole(cfg, ifname, target_name)
+
+
+def test_netpoll(cfg: NetDrvEpEnv, netdevnl: NetdevFamily) -> None:
+ """
+ Test netpoll by sending traffic to the interface and then sending
+ netconsole messages to trigger a poll
+ """
+
+ target_name = generate_random_netcons_name()
+ ifname = cfg.dev["ifname"]
+ traffic = None
+
+ try:
+ set_single_rx_tx_queue(ifname)
+ traffic = GenerateTraffic(cfg)
+ check_traffic_flowing(cfg, netdevnl)
+ do_netpoll_flush(cfg, netdevnl, ifname, target_name)
+ finally:
+ if traffic:
+ traffic.stop()
+ delete_netconsole_target(target_name)
+
+
+def check_dependencies() -> None:
+ """Check if the dependencies are met"""
+ if not os.path.exists(NETCONSOLE_CONFIGFS_PATH):
+ raise KsftSkipEx(
+ f"Directory {NETCONSOLE_CONFIGFS_PATH} does not exist. CONFIG_NETCONSOLE_DYNAMIC might not be set."
+ )
+
+
+def load_netconsole_module() -> None:
+ """Try to load the netconsole module"""
+ try:
+ os.system("modprobe netconsole")
+ except Exception:
+ # It is fine if we fail to load the module, it will fail later
+ # at check_dependencies()
+ pass
+
+
+def main() -> None:
+ """Main function to run the test"""
+ load_netconsole_module()
+ check_dependencies()
+ netdevnl = NetdevFamily()
+ with NetDrvEpEnv(__file__, nsim_test=True) as cfg:
+ ksft_run(
+ [test_netpoll],
+ args=(
+ cfg,
+ netdevnl,
+ ),
+ )
+ ksft_exit()
+
+
+if __name__ == "__main__":
+ main()
---
base-commit: 4f4040ea5d3e4bebebbef9379f88085c8b99221c
change-id: 20250612-netpoll_test-a1324d2057c8
Best regards,
--
Breno Leitao <leitao(a)debian.org>
The current implementation of test_unmerge_uffd_wp() explicitly sets
`uffdio_api.features = UFFD_FEATURE_PAGEFAULT_FLAG_WP` before calling
UFFDIO_API. This can cause the ioctl() call to fail with EINVAL on kernels
that do not support UFFD-WP, leading the test to fail unnecessarily:
# ------------------------------
# running ./ksm_functional_tests
# ------------------------------
# TAP version 13
# 1..9
# # [RUN] test_unmerge
# ok 1 Pages were unmerged
# # [RUN] test_unmerge_zero_pages
# ok 2 KSM zero pages were unmerged
# # [RUN] test_unmerge_discarded
# ok 3 Pages were unmerged
# # [RUN] test_unmerge_uffd_wp
# not ok 4 UFFDIO_API failed <-----
# # [RUN] test_prot_none
# ok 5 Pages were unmerged
# # [RUN] test_prctl
# ok 6 Setting/clearing PR_SET_MEMORY_MERGE works
# # [RUN] test_prctl_fork
# # No pages got merged
# # [RUN] test_prctl_fork_exec
# ok 7 PR_SET_MEMORY_MERGE value is inherited
# # [RUN] test_prctl_unmerge
# ok 8 Pages were unmerged
# Bail out! 1 out of 8 tests failed
# # Planned tests != run tests (9 != 8)
# # Totals: pass:7 fail:1 xfail:0 xpass:0 skip:0 error:0
# [FAIL]
This patch improves compatibility and error handling by:
1. Changes the feature check to first query supported features (features=0)
rather than specifically requesting WP support.
2. Gracefully skipping the test if:
- UFFDIO_API fails with EINVAL (feature not supported), or
- UFFD_FEATURE_PAGEFAULT_FLAG_WP is not advertised by the kernel.
3. Providing better diagnostics by distinguishing expected failures (e.g.,
EINVAL) from unexpected ones and reporting them using strerror().
The updated logic makes the test more robust across different kernel versions
and configurations, while preserving existing behavior on systems that do
support UFFD-WP.
Signed-off-by: Li Wang <liwang(a)redhat.com>
Cc: Aruna Ramakrishna <aruna.ramakrishna(a)oracle.com>
Cc: Bagas Sanjaya <bagasdotme(a)gmail.com>
Cc: Catalin Marinas <catalin.marinas(a)arm.com>
Cc: Dave Hansen <dave.hansen(a)linux.intel.com>
Cc: David Hildenbrand <david(a)redhat.com>
Cc: Joey Gouly <joey.gouly(a)arm.com>
Cc: Johannes Weiner <hannes(a)cmpxchg.org>
Cc: Keith Lucas <keith.lucas(a)oracle.com>
Cc: Ryan Roberts <ryan.roberts(a)arm.com>
Cc: Shuah Khan <shuah(a)kernel.org>
---
tools/testing/selftests/mm/ksm_functional_tests.c | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/mm/ksm_functional_tests.c b/tools/testing/selftests/mm/ksm_functional_tests.c
index b61803e36d1c..f3db257dc555 100644
--- a/tools/testing/selftests/mm/ksm_functional_tests.c
+++ b/tools/testing/selftests/mm/ksm_functional_tests.c
@@ -393,9 +393,13 @@ static void test_unmerge_uffd_wp(void)
/* See if UFFD-WP is around. */
uffdio_api.api = UFFD_API;
- uffdio_api.features = UFFD_FEATURE_PAGEFAULT_FLAG_WP;
+ uffdio_api.features = 0;
if (ioctl(uffd, UFFDIO_API, &uffdio_api) < 0) {
- ksft_test_result_fail("UFFDIO_API failed\n");
+ if (errno == EINVAL)
+ ksft_test_result_skip("UFFDIO_API not supported (EINVAL)\n");
+ else
+ ksft_test_result_fail("UFFDIO_API failed: %s\n", strerror(errno));
+
goto close_uffd;
}
if (!(uffdio_api.features & UFFD_FEATURE_PAGEFAULT_FLAG_WP)) {
--
2.49.0
On GCC 15 the following warnings is emitted:
nolibc-test.c: In function ‘run_stdlib’:
nolibc-test.c:1416:32: warning: initializer-string for array of ‘char’ truncates NUL terminator but destination lacks ‘nonstring’ attribute (11 chars into 10 available) [-Wunterminated-string-initialization]
1416 | char buf[10] = "test123456";
| ^~~~~~~~~~~~
Increase the size of buf to avoid the warning.
It would also be possible to use __attribute__((nonstring)) but that
would require some ifdeffery to work with older compilers.
Fixes: 1063649cf531 ("selftests/nolibc: Add tests for strlcat() and strlcpy()")
Signed-off-by: Thomas Weißschuh <linux(a)weissschuh.net>
---
tools/testing/selftests/nolibc/nolibc-test.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/nolibc/nolibc-test.c b/tools/testing/selftests/nolibc/nolibc-test.c
index dbe13000fb1ac153e9a89f627492daeb584a05d4..52640d8ae402b9e34174ae798e74882ca750ec2b 100644
--- a/tools/testing/selftests/nolibc/nolibc-test.c
+++ b/tools/testing/selftests/nolibc/nolibc-test.c
@@ -1413,7 +1413,7 @@ int run_stdlib(int min, int max)
* Add some more chars after the \0, to test functions that overwrite the buffer set
* the \0 at the exact right position.
*/
- char buf[10] = "test123456";
+ char buf[11] = "test123456";
buf[4] = '\0';
---
base-commit: eb135311083100b6590a7545618cd9760d896a86
change-id: 20250623-nolibc-nonstring-7fe6974552b5
Best regards,
--
Thomas Weißschuh <linux(a)weissschuh.net>
Add a .gitignore for the test case build object.
Signed-off-by: Dylan Yudaken <dyudaken(a)gmail.com>
---
Hi,
I noticed this was causing some noise in my git checkout, but perhaps I was
doing something odd that it has not been noticed before?
Regards,
Dylan
tools/testing/selftests/kexec/.gitignore | 2 ++
1 file changed, 2 insertions(+)
create mode 100644 tools/testing/selftests/kexec/.gitignore
diff --git a/tools/testing/selftests/kexec/.gitignore b/tools/testing/selftests/kexec/.gitignore
new file mode 100644
index 000000000000..5f3d9e089ae8
--- /dev/null
+++ b/tools/testing/selftests/kexec/.gitignore
@@ -0,0 +1,2 @@
+# SPDX-License-Identifier: GPL-2.0-only
+test_kexec_jump
base-commit: 86731a2a651e58953fc949573895f2fa6d456841
--
2.49.0
Basics and overview
===================
Software with larger attack surfaces (e.g. network facing apps like databases,
browsers or apps relying on browser runtimes) suffer from memory corruption
issues which can be utilized by attackers to bend control flow of the program
to eventually gain control (by making their payload executable). Attackers are
able to perform such attacks by leveraging call-sites which rely on indirect
calls or return sites which rely on obtaining return address from stack memory.
To mitigate such attacks, risc-v extension zicfilp enforces that all indirect
calls must land on a landing pad instruction `lpad` else cpu will raise software
check exception (a new cpu exception cause code on riscv).
Similarly for return flow, risc-v extension zicfiss extends architecture with
- `sspush` instruction to push return address on a shadow stack
- `sspopchk` instruction to pop return address from shadow stack
and compare with input operand (i.e. return address on stack)
- `sspopchk` to raise software check exception if comparision above
was a mismatch
- Protection mechanism using which shadow stack is not writeable via
regular store instructions
More information an details can be found at extensions github repo [1].
Equivalent to landing pad (zicfilp) on x86 is `ENDBRANCH` instruction in Intel
CET [3] and branch target identification (BTI) [4] on arm.
Similarly x86's Intel CET has shadow stack [5] and arm64 has guarded control
stack (GCS) [6] which are very similar to risc-v's zicfiss shadow stack.
x86 and arm64 support for user mode shadow stack is already in mainline.
Kernel awareness for user control flow integrity
================================================
This series picks up Samuel Holland's envcfg changes [2] as well. So if those are
being applied independently, they should be removed from this series.
Enabling:
In order to maintain compatibility and not break anything in user mode, kernel
doesn't enable control flow integrity cpu extensions on binary by default.
Instead exposes a prctl interface to enable, disable and lock the shadow stack
or landing pad feature for a task. This allows userspace (loader) to enumerate
if all objects in its address space are compiled with shadow stack and landing
pad support and accordingly enable the feature. Additionally if a subsequent
`dlopen` happens on a library, user mode can take a decision again to disable
the feature (if incoming library is not compiled with support) OR terminate the
task (if user mode policy is strict to have all objects in address space to be
compiled with control flow integirty cpu feature). prctl to enable shadow stack
results in allocating shadow stack from virtual memory and activating for user
address space. x86 and arm64 are also following same direction due to similar
reason(s).
clone/fork:
On clone and fork, cfi state for task is inherited by child. Shadow stack is
part of virtual memory and is a writeable memory from kernel perspective
(writeable via a restricted set of instructions aka shadow stack instructions)
Thus kernel changes ensure that this memory is converted into read-only when
fork/clone happens and COWed when fault is taken due to sspush, sspopchk or
ssamoswap. In case `CLONE_VM` is specified and shadow stack is to be enabled,
kernel will automatically allocate a shadow stack for that clone call.
map_shadow_stack:
x86 introduced `map_shadow_stack` system call to allow user space to explicitly
map shadow stack memory in its address space. It is useful to allocate shadow
for different contexts managed by a single thread (green threads or contexts)
risc-v implements this system call as well.
signal management:
If shadow stack is enabled for a task, kernel performs an asynchronous control
flow diversion to deliver the signal and eventually expects userspace to issue
sigreturn so that original execution can be resumed. Even though resume context
is prepared by kernel, it is in user space memory and is subject to memory
corruption and corruption bugs can be utilized by attacker in this race window
to perform arbitrary sigreturn and eventually bypass cfi mechanism.
Another issue is how to ensure that cfi related state on sigcontext area is not
trampled by legacy apps or apps compiled with old kernel headers.
In order to mitigate control-flow hijacting, kernel prepares a token and place
it on shadow stack before signal delivery and places address of token in
sigcontext structure. During sigreturn, kernel obtains address of token from
sigcontext struture, reads token from shadow stack and validates it and only
then allow sigreturn to succeed. Compatiblity issue is solved by adopting
dynamic sigcontext management introduced for vector extension. This series
re-factor the code little bit to allow future sigcontext management easy (as
proposed by Andy Chiu from SiFive)
config and compilation:
Introduce a new risc-v config option `CONFIG_RISCV_USER_CFI`. Selecting this
config option picks the kernel support for user control flow integrity. This
optin is presented only if toolchain has shadow stack and landing pad support.
And is on purpose guarded by toolchain support. Reason being that eventually
vDSO also needs to be compiled in with shadow stack and landing pad support.
vDSO compile patches are not included as of now because landing pad labeling
scheme is yet to settle for usermode runtime.
To get more information on kernel interactions with respect to
zicfilp and zicfiss, patch series adds documentation for
`zicfilp` and `zicfiss` in following:
Documentation/arch/riscv/zicfiss.rst
Documentation/arch/riscv/zicfilp.rst
How to test this series
=======================
Toolchain
---------
$ git clone git@github.com:sifive/riscv-gnu-toolchain.git -b cfi-dev
$ riscv-gnu-toolchain/configure --prefix=<path-to-where-to-build> --with-arch=rv64gc_zicfilp_zicfiss --enable-linux --disable-gdb --with-extra-multilib-test="rv64gc_zicfilp_zicfiss-lp64d:-static"
$ make -j$(nproc)
Qemu
----
Get the lastest qemu
$ cd qemu
$ mkdir build
$ cd build
$ ../configure --target-list=riscv64-softmmu
$ make -j$(nproc)
Opensbi
-------
$ git clone git@github.com:deepak0414/opensbi.git -b v6_cfi_spec_split_opensbi
$ make CROSS_COMPILE=<your riscv toolchain> -j$(nproc) PLATFORM=generic
Linux
-----
Running defconfig is fine. CFI is enabled by default if the toolchain
supports it.
$ make ARCH=riscv CROSS_COMPILE=<path-to-cfi-riscv-gnu-toolchain>/build/bin/riscv64-unknown-linux-gnu- -j$(nproc) defconfig
$ make ARCH=riscv CROSS_COMPILE=<path-to-cfi-riscv-gnu-toolchain>/build/bin/riscv64-unknown-linux-gnu- -j$(nproc)
In case you're building your own rootfs using toolchain, please make sure you
pick following patch to ensure that vDSO compiled with lpad and shadow stack.
"arch/riscv: compile vdso with landing pad"
Branch where above patch can be picked
https://github.com/deepak0414/linux-riscv-cfi/tree/vdso_user_cfi_v6.12-rc1
Running
-------
Modify your qemu command to have:
-bios <path-to-cfi-opensbi>/build/platform/generic/firmware/fw_dynamic.bin
-cpu rv64,zicfilp=true,zicfiss=true,zimop=true,zcmop=true
vDSO related Opens (in the flux)
=================================
I am listing these opens for laying out plan and what to expect in future
patch sets. And of course for the sake of discussion.
Shadow stack and landing pad enabling in vDSO
----------------------------------------------
vDSO must have shadow stack and landing pad support compiled in for task
to have shadow stack and landing pad support. This patch series doesn't
enable that (yet). Enabling shadow stack support in vDSO should be
straight forward (intend to do that in next versions of patch set). Enabling
landing pad support in vDSO requires some collaboration with toolchain folks
to follow a single label scheme for all object binaries. This is necessary to
ensure that all indirect call-sites are setting correct label and target landing
pads are decorated with same label scheme.
How many vDSOs
---------------
Shadow stack instructions are carved out of zimop (may be operations) and if CPU
doesn't implement zimop, they're illegal instructions. Kernel could be running on
a CPU which may or may not implement zimop. And thus kernel will have to carry 2
different vDSOs and expose the appropriate one depending on whether CPU implements
zimop or not.
References
==========
[1] - https://github.com/riscv/riscv-cfi
[2] - https://lore.kernel.org/all/20240814081126.956287-1-samuel.holland@sifive.c…
[3] - https://lwn.net/Articles/889475/
[4] - https://developer.arm.com/documentation/109576/0100/Branch-Target-Identific…
[5] - https://www.intel.com/content/dam/develop/external/us/en/documents/catc17-i…
[6] - https://lwn.net/Articles/940403/
To: Thomas Gleixner <tglx(a)linutronix.de>
To: Ingo Molnar <mingo(a)redhat.com>
To: Borislav Petkov <bp(a)alien8.de>
To: Dave Hansen <dave.hansen(a)linux.intel.com>
To: x86(a)kernel.org
To: H. Peter Anvin <hpa(a)zytor.com>
To: Andrew Morton <akpm(a)linux-foundation.org>
To: Liam R. Howlett <Liam.Howlett(a)oracle.com>
To: Vlastimil Babka <vbabka(a)suse.cz>
To: Lorenzo Stoakes <lorenzo.stoakes(a)oracle.com>
To: Paul Walmsley <paul.walmsley(a)sifive.com>
To: Palmer Dabbelt <palmer(a)dabbelt.com>
To: Albert Ou <aou(a)eecs.berkeley.edu>
To: Conor Dooley <conor(a)kernel.org>
To: Rob Herring <robh(a)kernel.org>
To: Krzysztof Kozlowski <krzk+dt(a)kernel.org>
To: Arnd Bergmann <arnd(a)arndb.de>
To: Christian Brauner <brauner(a)kernel.org>
To: Peter Zijlstra <peterz(a)infradead.org>
To: Oleg Nesterov <oleg(a)redhat.com>
To: Eric Biederman <ebiederm(a)xmission.com>
To: Kees Cook <kees(a)kernel.org>
To: Jonathan Corbet <corbet(a)lwn.net>
To: Shuah Khan <shuah(a)kernel.org>
To: Jann Horn <jannh(a)google.com>
To: Conor Dooley <conor+dt(a)kernel.org>
To: Miguel Ojeda <ojeda(a)kernel.org>
To: Alex Gaynor <alex.gaynor(a)gmail.com>
To: Boqun Feng <boqun.feng(a)gmail.com>
To: Gary Guo <gary(a)garyguo.net>
To: Björn Roy Baron <bjorn3_gh(a)protonmail.com>
To: Benno Lossin <benno.lossin(a)proton.me>
To: Andreas Hindborg <a.hindborg(a)kernel.org>
To: Alice Ryhl <aliceryhl(a)google.com>
To: Trevor Gross <tmgross(a)umich.edu>
Cc: linux-kernel(a)vger.kernel.org
Cc: linux-fsdevel(a)vger.kernel.org
Cc: linux-mm(a)kvack.org
Cc: linux-riscv(a)lists.infradead.org
Cc: devicetree(a)vger.kernel.org
Cc: linux-arch(a)vger.kernel.org
Cc: linux-doc(a)vger.kernel.org
Cc: linux-kselftest(a)vger.kernel.org
Cc: alistair.francis(a)wdc.com
Cc: richard.henderson(a)linaro.org
Cc: jim.shu(a)sifive.com
Cc: andybnac(a)gmail.com
Cc: kito.cheng(a)sifive.com
Cc: charlie(a)rivosinc.com
Cc: atishp(a)rivosinc.com
Cc: evan(a)rivosinc.com
Cc: cleger(a)rivosinc.com
Cc: alexghiti(a)rivosinc.com
Cc: samitolvanen(a)google.com
Cc: broonie(a)kernel.org
Cc: rick.p.edgecombe(a)intel.com
Cc: rust-for-linux(a)vger.kernel.org
changelog
---------
v17:
- fixed warnings due to empty macros in usercfi.h (reported by alexg)
- fixed prefixes in commit titles reported by alexg
- took below uprobe with fcfi v2 patch from Zong Li and squashed it with
"riscv/traps: Introduce software check exception and uprobe handling"
https://lore.kernel.org/all/20250604093403.10916-1-zong.li@sifive.com/
v16:
- If FWFT is not implemented or returns error for shadow stack activation, then
no_usercfi is set to disable shadow stack. Although this should be picked up
by extension validation and activation. Fixed this bug for zicfilp and zicfiss
both. Thanks to Charlie Jenkins for reporting this.
- If toolchain doesn't support cfi, cfi kselftest shouldn't build. Suggested by
Charlie Jenkins.
- Default for CONFIG_RISCV_USER_CFI is set to no. Charlie/Atish suggested to
keep it off till we have more hardware availibility with RVA23 profile and
zimop/zcmop implemented. Else this will start breaking people's workflow
- Includes the fix if "!RV64 and !SBI" then definitions for FWFT in
asm-offsets.c error.
v15:
- Toolchain has been updated to include `-fcf-protection` flag. This
exists for x86 as well. Updated kernel patches to compile vDSO and
selftest to compile with `fcf-protection=full` flag.
- selecting CONFIG_RISCV_USERCFI selects CONFIG_RISCV_SBI.
- Patch to enable shadow stack for kernel wasn't hidden behind
CONFIG_RISCV_USERCFI and CONFIG_RISCV_SBI. fixed that.
v14:
- rebased on top of palmer/sbi-v3. Thus dropped clement's FWFT patches
Updated RISCV_ISA_EXT_XXXX in hwcap and hwprobe constants.
- Took Radim's suggestions on bitfields.
- Placed cfi_state at the end of thread_info block so that current situation
is not disturbed with respect to member fields of thread_info in single
cacheline.
v13:
- cpu_supports_shadow_stack/cpu_supports_indirect_br_lp_instr uses
riscv_has_extension_unlikely()
- uses nops(count) to create nop slide
- RISCV_ACQUIRE_BARRIER is not needed in `amo_user_shstk`. Removed it
- changed ternaries to simply use implicit casting to convert to bool.
- kernel command line allows to disable zicfilp and zicfiss independently.
updated kernel-parameters.txt.
- ptrace user abi for cfi uses bitmasks instead of bitfields. Added ptrace
kselftest.
- cosmetic and grammatical changes to documentation.
v12:
- It seems like I had accidently squashed arch agnostic indirect branch
tracking prctl and riscv implementation of those prctls. Split them again.
- set_shstk_status/set_indir_lp_status perform CSR writes only when CPU
support is available. As suggested by Zong Li.
- Some minor clean up in kselftests as suggested by Zong Li.
v11:
- patch "arch/riscv: compile vdso with landing pad" was unconditionally
selecting `_zicfilp` for vDSO compile. fixed that. Changed `lpad 1` to
to `lpad 0`.
v10:
- dropped "mm: helper `is_shadow_stack_vma` to check shadow stack vma". This patch
is not that interesting to this patch series for risc-v. There are instances in
arch directories where VM_SHADOW_STACK flag is anyways used. Dropping this patch
to expedite merging in riscv tree.
- Took suggestions from `Clement` on "riscv: zicfiss / zicfilp enumeration" to
validate presence of cfi based on config.
- Added a patch for vDSO to have `lpad 0`. I had omitted this earlier to make sure
we add single vdso object with cfi enabled. But a vdso object with scheme of
zero labeled landing pad is least common denominator and should work with all
objects of zero labeled as well as function-signature labeled objects.
v9:
- rebased on master (39a803b754d5 fix braino in "9p: fix ->rename_sem exclusion")
- dropped "mm: Introduce ARCH_HAS_USER_SHADOW_STACK" (master has it from arm64/gcs)
- dropped "prctl: arch-agnostic prctl for shadow stack" (master has it from arm64/gcs)
v8:
- rebased on palmer/for-next
- dropped samuel holland's `envcfg` context switch patches.
they are in parlmer/for-next
v7:
- Removed "riscv/Kconfig: enable HAVE_EXIT_THREAD for riscv"
Instead using `deactivate_mm` flow to clean up.
see here for more context
https://lore.kernel.org/all/20230908203655.543765-1-rick.p.edgecombe@intel.…
- Changed the header include in `kselftest`. Hopefully this fixes compile
issue faced by Zong Li at SiFive.
- Cleaned up an orphaned change to `mm/mmap.c` in below patch
"riscv/mm : ensure PROT_WRITE leads to VM_READ | VM_WRITE"
- Lock interfaces for shadow stack and indirect branch tracking expect arg == 0
Any future evolution of this interface should accordingly define how arg should
be setup.
- `mm/map.c` has an instance of using `VM_SHADOW_STACK`. Fixed it to use helper
`is_shadow_stack_vma`.
- Link to v6: https://lore.kernel.org/r/20241008-v5_user_cfi_series-v6-0-60d9fe073f37@riv…
v6:
- Picked up Samuel Holland's changes as is with `envcfg` placed in
`thread` instead of `thread_info`
- fixed unaligned newline escapes in kselftest
- cleaned up messages in kselftest and included test output in commit message
- fixed a bug in clone path reported by Zong Li
- fixed a build issue if CONFIG_RISCV_ISA_V is not selected
(this was introduced due to re-factoring signal context
management code)
v5:
- rebased on v6.12-rc1
- Fixed schema related issues in device tree file
- Fixed some of the documentation related issues in zicfilp/ss.rst
(style issues and added index)
- added `SHADOW_STACK_SET_MARKER` so that implementation can define base
of shadow stack.
- Fixed warnings on definitions added in usercfi.h when
CONFIG_RISCV_USER_CFI is not selected.
- Adopted context header based signal handling as proposed by Andy Chiu
- Added support for enabling kernel mode access to shadow stack using
FWFT
(https://github.com/riscv-non-isa/riscv-sbi-doc/blob/master/src/ext-firmware…)
- Link to v5: https://lore.kernel.org/r/20241001-v5_user_cfi_series-v1-0-3ba65b6e550f@riv…
(Note: I had an issue in my workflow due to which version number wasn't
picked up correctly while sending out patches)
v4:
- rebased on 6.11-rc6
- envcfg: Converged with Samuel Holland's patches for envcfg management on per-
thread basis.
- vma_is_shadow_stack is renamed to is_vma_shadow_stack
- picked up Mark Brown's `ARCH_HAS_USER_SHADOW_STACK` patch
- signal context: using extended context management to maintain compatibility.
- fixed `-Wmissing-prototypes` compiler warnings for prctl functions
- Documentation fixes and amending typos.
- Link to v4: https://lore.kernel.org/all/20240912231650.3740732-1-debug@rivosinc.com/
v3:
- envcfg
logic to pick up base envcfg had a bug where `ENVCFG_CBZE` could have been
picked on per task basis, even though CPU didn't implement it. Fixed in
this series.
- dt-bindings
As suggested, split into separate commit. fixed the messaging that spec is
in public review
- arch_is_shadow_stack change
arch_is_shadow_stack changed to vma_is_shadow_stack
- hwprobe
zicfiss / zicfilp if present will get enumerated in hwprobe
- selftests
As suggested, added object and binary filenames to .gitignore
Selftest binary anyways need to be compiled with cfi enabled compiler which
will make sure that landing pad and shadow stack are enabled. Thus removed
separate enable/disable tests. Cleaned up tests a bit.
- Link to v3: https://lore.kernel.org/lkml/20240403234054.2020347-1-debug@rivosinc.com/
v2:
- Using config `CONFIG_RISCV_USER_CFI`, kernel support for riscv control flow
integrity for user mode programs can be compiled in the kernel.
- Enabling of control flow integrity for user programs is left to user runtime
- This patch series introduces arch agnostic `prctls` to enable shadow stack
and indirect branch tracking. And implements them on riscv.
---
Changes in v17:
- Link to v16: https://lore.kernel.org/r/20250522-v5_user_cfi_series-v16-0-64f61a35eee7@ri…
Changes in v16:
- Link to v15: https://lore.kernel.org/r/20250502-v5_user_cfi_series-v15-0-914966471885@ri…
Changes in v15:
- changelog posted just below cover letter
- Link to v14: https://lore.kernel.org/r/20250429-v5_user_cfi_series-v14-0-5239410d012a@ri…
Changes in v14:
- changelog posted just below cover letter
- Link to v13: https://lore.kernel.org/r/20250424-v5_user_cfi_series-v13-0-971437de586a@ri…
Changes in v13:
- changelog posted just below cover letter
- Link to v12: https://lore.kernel.org/r/20250314-v5_user_cfi_series-v12-0-e51202b53138@ri…
Changes in v12:
- changelog posted just below cover letter
- Link to v11: https://lore.kernel.org/r/20250310-v5_user_cfi_series-v11-0-86b36cbfb910@ri…
Changes in v11:
- changelog posted just below cover letter
- Link to v10: https://lore.kernel.org/r/20250210-v5_user_cfi_series-v10-0-163dcfa31c60@ri…
---
Andy Chiu (1):
riscv: signal: abstract header saving for setup_sigcontext
Deepak Gupta (25):
mm: VM_SHADOW_STACK definition for riscv
dt-bindings: riscv: zicfilp and zicfiss in dt-bindings (extensions.yaml)
riscv: zicfiss / zicfilp enumeration
riscv: zicfiss / zicfilp extension csr and bit definitions
riscv: usercfi state for task and save/restore of CSR_SSP on trap entry/exit
riscv/mm : ensure PROT_WRITE leads to VM_READ | VM_WRITE
riscv/mm: manufacture shadow stack pte
riscv/mm: teach pte_mkwrite to manufacture shadow stack PTEs
riscv/mm: write protect and shadow stack
riscv/mm: Implement map_shadow_stack() syscall
riscv/shstk: If needed allocate a new shadow stack on clone
riscv: Implements arch agnostic shadow stack prctls
prctl: arch-agnostic prctl for indirect branch tracking
riscv: Implements arch agnostic indirect branch tracking prctls
riscv/traps: Introduce software check exception and uprobe handling
riscv/signal: save and restore of shadow stack for signal
riscv/kernel: update __show_regs to print shadow stack register
riscv/ptrace: riscv cfi status and state via ptrace and in core files
riscv/hwprobe: zicfilp / zicfiss enumeration in hwprobe
riscv: kernel command line option to opt out of user cfi
riscv: enable kernel access to shadow stack memory via FWFT sbi call
riscv: create a config for shadow stack and landing pad instr support
riscv: Documentation for landing pad / indirect branch tracking
riscv: Documentation for shadow stack on riscv
kselftest/riscv: kselftest for user mode cfi
Jim Shu (1):
arch/riscv: compile vdso with landing pad
Documentation/admin-guide/kernel-parameters.txt | 8 +
Documentation/arch/riscv/index.rst | 2 +
Documentation/arch/riscv/zicfilp.rst | 115 +++++
Documentation/arch/riscv/zicfiss.rst | 179 +++++++
.../devicetree/bindings/riscv/extensions.yaml | 14 +
arch/riscv/Kconfig | 21 +
arch/riscv/Makefile | 5 +-
arch/riscv/include/asm/asm-prototypes.h | 1 +
arch/riscv/include/asm/assembler.h | 44 ++
arch/riscv/include/asm/cpufeature.h | 12 +
arch/riscv/include/asm/csr.h | 16 +
arch/riscv/include/asm/entry-common.h | 2 +
arch/riscv/include/asm/hwcap.h | 2 +
arch/riscv/include/asm/mman.h | 26 +
arch/riscv/include/asm/mmu_context.h | 7 +
arch/riscv/include/asm/pgtable.h | 30 +-
arch/riscv/include/asm/processor.h | 2 +
arch/riscv/include/asm/thread_info.h | 3 +
arch/riscv/include/asm/usercfi.h | 95 ++++
arch/riscv/include/asm/vector.h | 3 +
arch/riscv/include/uapi/asm/hwprobe.h | 2 +
arch/riscv/include/uapi/asm/ptrace.h | 34 ++
arch/riscv/include/uapi/asm/sigcontext.h | 1 +
arch/riscv/kernel/Makefile | 1 +
arch/riscv/kernel/asm-offsets.c | 10 +
arch/riscv/kernel/cpufeature.c | 27 +
arch/riscv/kernel/entry.S | 33 +-
arch/riscv/kernel/head.S | 27 +
arch/riscv/kernel/process.c | 27 +-
arch/riscv/kernel/ptrace.c | 95 ++++
arch/riscv/kernel/signal.c | 148 +++++-
arch/riscv/kernel/sys_hwprobe.c | 2 +
arch/riscv/kernel/sys_riscv.c | 10 +
arch/riscv/kernel/traps.c | 51 ++
arch/riscv/kernel/usercfi.c | 545 +++++++++++++++++++++
arch/riscv/kernel/vdso/Makefile | 6 +
arch/riscv/kernel/vdso/flush_icache.S | 4 +
arch/riscv/kernel/vdso/getcpu.S | 4 +
arch/riscv/kernel/vdso/rt_sigreturn.S | 4 +
arch/riscv/kernel/vdso/sys_hwprobe.S | 4 +
arch/riscv/mm/init.c | 2 +-
arch/riscv/mm/pgtable.c | 16 +
include/linux/cpu.h | 4 +
include/linux/mm.h | 7 +
include/uapi/linux/elf.h | 2 +
include/uapi/linux/prctl.h | 27 +
kernel/sys.c | 30 ++
tools/testing/selftests/riscv/Makefile | 2 +-
tools/testing/selftests/riscv/cfi/.gitignore | 3 +
tools/testing/selftests/riscv/cfi/Makefile | 16 +
tools/testing/selftests/riscv/cfi/cfi_rv_test.h | 82 ++++
tools/testing/selftests/riscv/cfi/riscv_cfi_test.c | 173 +++++++
tools/testing/selftests/riscv/cfi/shadowstack.c | 385 +++++++++++++++
tools/testing/selftests/riscv/cfi/shadowstack.h | 27 +
54 files changed, 2369 insertions(+), 29 deletions(-)
---
base-commit: 4181f8ad7a1061efed0219951d608d4988302af7
change-id: 20240930-v5_user_cfi_series-3dc332f8f5b2
--
- debug
From: Chia-Yu Chang <chia-yu.chang(a)nokia-bell-labs.com>
Hello,
Please find the v8 AccECN protocol patch series, which covers the core
functionality of Accurate ECN, AccECN negotiation, AccECN TCP options,
and AccECN failure handling. The Accurate ECN draft can be found in
https://datatracker.ietf.org/doc/html/draft-ietf-tcpm-accurate-ecn-28
This patch series is part of the full AccECN patch series, which is available at
https://github.com/L4STeam/linux-net-next/commits/upstream_l4steam/
v8 (10-Jun-2025)
- Add new helper function tcp_ecn_received_counters_payload() in #6 (Paolo Abeni <pabeni(a)redhat.com>)
- Set opts->num_sack_blocks=0 to avoid potential undefined value in #8 (Paolo Abeni <pabeni(a)redhat.com>)
- Reset leftover_size to 2 once leftover_bytes is used in #9 (Paolo Abeni <pabeni(a)redhat.com>)
- Add new helper function tcp_accecn_opt_demand_min() in #10 (Paolo Abeni <pabeni(a)redhat.com>)
- Add new helper function tcp_accecn_saw_opt_fail_recv() in #11 (Paolo Abeni <pabeni(a)redhat.com>)
- Update tcp_options_fit_accecn() to avoid using recursion in #14 (Paolo Abeni <pabeni(a)redhat.com>)
v7 (14-May-2025)
- Modify group sizes of tcp_sock_write_txrx and tcp_sock_write_rx in #3 based on pahole results (Paolo Abeni <pabeni(a)redhat.com>)
- Fix the issue in #4 and #5 where the RFC3168 ECN behavior in tcp_ecn_send() is changed (Paolo Abeni <pabeni(a)redhat.com>)
- Modify group size of tcp_sock_write_txrx in #4 and #6 based on pahole results (Paolo Abeni <pabeni(a)redhat.com>)
- Update commit message for #9 to explain the increase in tcp_sock_write_rx group size
- Modify group size of tcp_sock_write_tx in #10 based on pahole results
v6 (09-May-2025)
- Add #3 to utilize exisintg holes of tcp_sock_write_txrx group for later patches (#4, #9, #10) with new u8 members (Paolo Abeni <pabeni(a)redhat.com>)
- Add pahole outcomes before and after commit in #4, #5, #6, #9, #10, #15 (Paolo Abeni <pabeni(a)redhat.com>)
- Define new helper function tcp_send_ack_reflect_ect() for sending ACK with reflected ECT in #5 (Paolo Abeni <pabeni(a)redhat.com>)
- Add comments for function tcp_ecn_rcv_synack() in #5 (Paolo Abeni <pabeni(a)redhat.com>)
- Add enum/define to be used by sysctl_tcp_ecn in #5, sysctl_tcp_ecn_option in #9, and sysctl_tcp_ecn_option_beacon in #10 (Paolo Abeni <pabeni(a)redhat.com>)
- Move accecn_fail_mode and saw_accecn_opt in #5 and #11 to use exisintg holes of tcp_sock (Paolo Abeni <pabeni(a)redhat.com>)
- Change data type of new members of tcp_request_sock and move them to the end of struct in #5 and #11 (Paolo Abeni <pabeni(a)redhat.com>)
- Move new members of tcp_info to the end of struct in #6 (Paolo Abeni <pabeni(a)redhat.com>)
- Merge previous #7 into #9 (Paolo Abeni <pabeni(a)redhat.com>)
- Mask ecnfield with INET_ECN_MASK to remove WARN_ONCE in #9 (Paolo Abeni <pabeni(a)redhat.com>)
- Reduce the indentation levels for reabability in #9 and #10 (Paolo Abeni <pabeni(a)redhat.com>)
- Move delivered_ecn_bytes to the RX group in #9, accecn_opt_tstamp to the TX group in #10, pkts_acked_ewma to the RX group in #15 (Paolo Abeni <pabeni(a)redhat.com>)
- Add changes in Documentation/networking/net_cachelines/tcp_sock.rst for new tcp_sock members in #3, #5, #6, #9, #10, #15
v5 (22-Apr-2025)
- Further fix for 32-bit ARM alignment in tcp.c (Simon Horman <horms(a)kernel.org>)
v4 (18-Apr-2025)
- Fix 32-bit ARM assertion for alignment requirement (Simon Horman <horms(a)kernel.org>)
v3 (14-Apr-2025)
- Fix patch apply issue in v2 (Jakub Kicinski <kuba(a)kernel.org>)
v2 (18-Mar-2025)
- Add one missing patch from the previous AccECN protocol preparation patch series to this patch series.
Best regards,
Chia-Yu
Chia-Yu Chang (3):
tcp: reorganize tcp_sock_write_txrx group for variables later
tcp: accecn: AccECN option failure handling
tcp: accecn: try to fit AccECN option with SACK
Ilpo Järvinen (12):
tcp: reorganize SYN ECN code
tcp: fast path functions later
tcp: AccECN core
tcp: accecn: AccECN negotiation
tcp: accecn: add AccECN rx byte counters
tcp: accecn: AccECN needs to know delivered bytes
tcp: sack option handling improvements
tcp: accecn: AccECN option
tcp: accecn: AccECN option send control
tcp: accecn: AccECN option ceb/cep heuristic
tcp: accecn: AccECN ACE field multi-wrap heuristic
tcp: try to avoid safer when ACKs are thinned
.../networking/net_cachelines/tcp_sock.rst | 14 +
include/linux/tcp.h | 34 +-
include/net/netns/ipv4.h | 2 +
include/net/tcp.h | 225 ++++++-
include/uapi/linux/tcp.h | 7 +
net/ipv4/syncookies.c | 3 +
net/ipv4/sysctl_net_ipv4.c | 19 +
net/ipv4/tcp.c | 30 +-
net/ipv4/tcp_input.c | 611 +++++++++++++++++-
net/ipv4/tcp_ipv4.c | 7 +-
net/ipv4/tcp_minisocks.c | 91 ++-
net/ipv4/tcp_output.c | 303 ++++++++-
net/ipv6/syncookies.c | 1 +
net/ipv6/tcp_ipv6.c | 1 +
14 files changed, 1250 insertions(+), 98 deletions(-)
--
2.34.1
NOTE: I'm leaving Daynix Computing Ltd., for which I worked on this
patch series, by the end of this month.
While net-next is closed, this is the last chance for me to send another
version so let me send the local changes now.
Please contact Yuri Benditovich, who is CCed on this email, for anything
about this series.
virtio-net have two usage of hashes: one is RSS and another is hash
reporting. Conventionally the hash calculation was done by the VMM.
However, computing the hash after the queue was chosen defeats the
purpose of RSS.
Another approach is to use eBPF steering program. This approach has
another downside: it cannot report the calculated hash due to the
restrictive nature of eBPF.
Introduce the code to compute hashes to the kernel in order to overcome
thse challenges.
An alternative solution is to extend the eBPF steering program so that it
will be able to report to the userspace, but it is based on context
rewrites, which is in feature freeze. We can adopt kfuncs, but they will
not be UAPIs. We opt to ioctl to align with other relevant UAPIs (KVM
and vhost_net).
The patches for QEMU to use this new feature was submitted as RFC and
is available at:
https://patchew.org/QEMU/20250530-hash-v5-0-343d7d7a8200@daynix.com/
This work was presented at LPC 2024:
https://lpc.events/event/18/contributions/1963/
V1 -> V2:
Changed to introduce a new BPF program type.
Signed-off-by: Akihiko Odaki <akihiko.odaki(a)daynix.com>
---
Changes in v12:
- Updated tools/testing/selftests/net/config.
- Split TUNSETVNETHASH.
- Link to v11: https://lore.kernel.org/r/20250317-rss-v11-0-4cacca92f31f@daynix.com
Changes in v11:
- Added the missing code to free vnet_hash in patch
"tap: Introduce virtio-net hash feature".
- Link to v10: https://lore.kernel.org/r/20250313-rss-v10-0-3185d73a9af0@daynix.com
Changes in v10:
- Split common code and TUN/TAP-specific code into separate patches.
- Reverted a spurious style change in patch "tun: Introduce virtio-net
hash feature".
- Added a comment explaining disable_ipv6 in tests.
- Used AF_PACKET for patch "selftest: tun: Add tests for
virtio-net hashing". I also added the usage of FIXTURE_VARIANT() as
the testing function now needs access to more variant-specific
variables.
- Corrected the message of patch "selftest: tun: Add tests for
virtio-net hashing"; it mentioned validation of configuration but
it is not scope of this patch.
- Expanded the description of patch "selftest: tun: Add tests for
virtio-net hashing".
- Added patch "tun: Allow steering eBPF program to fall back".
- Changed to handle TUNGETVNETHASHCAP before taking the rtnl lock.
- Removed redundant tests for tun_vnet_ioctl().
- Added patch "selftest: tap: Add tests for virtio-net ioctls".
- Added a design explanation of ioctls for extensibility and migration.
- Removed a few branches in patch
"vhost/net: Support VIRTIO_NET_F_HASH_REPORT".
- Link to v9: https://lore.kernel.org/r/20250307-rss-v9-0-df76624025eb@daynix.com
Changes in v9:
- Added a missing return statement in patch
"tun: Introduce virtio-net hash feature".
- Link to v8: https://lore.kernel.org/r/20250306-rss-v8-0-7ab4f56ff423@daynix.com
Changes in v8:
- Disabled IPv6 to eliminate noises in tests.
- Added a branch in tap to avoid unnecessary dissection when hash
reporting is disabled.
- Removed unnecessary rtnl_lock().
- Extracted code to handle new ioctls into separate functions to avoid
adding extra NULL checks to the code handling other ioctls.
- Introduced variable named "fd" to __tun_chr_ioctl().
- s/-/=/g in a patch message to avoid confusing Git.
- Link to v7: https://lore.kernel.org/r/20250228-rss-v7-0-844205cbbdd6@daynix.com
Changes in v7:
- Ensured to set hash_report to VIRTIO_NET_HASH_REPORT_NONE for
VHOST_NET_F_VIRTIO_NET_HDR.
- s/4/sizeof(u32)/ in patch "virtio_net: Add functions for hashing".
- Added tap_skb_cb type.
- Rebased.
- Link to v6: https://lore.kernel.org/r/20250109-rss-v6-0-b1c90ad708f6@daynix.com
Changes in v6:
- Extracted changes to fill vnet header holes into another series.
- Squashed patches "skbuff: Introduce SKB_EXT_TUN_VNET_HASH", "tun:
Introduce virtio-net hash reporting feature", and "tun: Introduce
virtio-net RSS" into patch "tun: Introduce virtio-net hash feature".
- Dropped the RFC tag.
- Link to v5: https://lore.kernel.org/r/20241008-rss-v5-0-f3cf68df005d@daynix.com
Changes in v5:
- Fixed a compilation error with CONFIG_TUN_VNET_CROSS_LE.
- Optimized the calculation of the hash value according to:
https://git.dpdk.org/dpdk/commit/?id=3fb1ea032bd6ff8317af5dac9af901f1f324ca…
- Added patch "tun: Unify vnet implementation".
- Dropped patch "tap: Pad virtio header with zero".
- Added patch "selftest: tun: Test vnet ioctls without device".
- Reworked selftests to skip for older kernels.
- Documented the case when the underlying device is deleted and packets
have queue_mapping set by TC.
- Reordered test harness arguments.
- Added code to handle fragmented packets.
- Link to v4: https://lore.kernel.org/r/20240924-rss-v4-0-84e932ec0e6c@daynix.com
Changes in v4:
- Moved tun_vnet_hash_ext to if_tun.h.
- Renamed virtio_net_toeplitz() to virtio_net_toeplitz_calc().
- Replaced htons() with cpu_to_be16().
- Changed virtio_net_hash_rss() to return void.
- Reordered variable declarations in virtio_net_hash_rss().
- Removed virtio_net_hdr_v1_hash_from_skb().
- Updated messages of "tap: Pad virtio header with zero" and
"tun: Pad virtio header with zero".
- Fixed vnet_hash allocation size.
- Ensured to free vnet_hash when destructing tun_struct.
- Link to v3: https://lore.kernel.org/r/20240915-rss-v3-0-c630015db082@daynix.com
Changes in v3:
- Reverted back to add ioctl.
- Split patch "tun: Introduce virtio-net hashing feature" into
"tun: Introduce virtio-net hash reporting feature" and
"tun: Introduce virtio-net RSS".
- Changed to reuse hash values computed for automq instead of performing
RSS hashing when hash reporting is requested but RSS is not.
- Extracted relevant data from struct tun_struct to keep it minimal.
- Added kernel-doc.
- Changed to allow calling TUNGETVNETHASHCAP before TUNSETIFF.
- Initialized num_buffers with 1.
- Added a test case for unclassified packets.
- Fixed error handling in tests.
- Changed tests to verify that the queue index will not overflow.
- Rebased.
- Link to v2: https://lore.kernel.org/r/20231015141644.260646-1-akihiko.odaki@daynix.com
---
Akihiko Odaki (10):
virtio_net: Add functions for hashing
net: flow_dissector: Export flow_keys_dissector_symmetric
tun: Allow steering eBPF program to fall back
tun: Add common virtio-net hash feature code
tun: Introduce virtio-net hash feature
tap: Introduce virtio-net hash feature
selftest: tun: Test vnet ioctls without device
selftest: tun: Add tests for virtio-net hashing
selftest: tap: Add tests for virtio-net ioctls
vhost/net: Support VIRTIO_NET_F_HASH_REPORT
Documentation/networking/tuntap.rst | 7 +
drivers/net/Kconfig | 1 +
drivers/net/ipvlan/ipvtap.c | 2 +-
drivers/net/macvtap.c | 2 +-
drivers/net/tap.c | 80 +++++-
drivers/net/tun.c | 92 +++++--
drivers/net/tun_vnet.h | 165 +++++++++++-
drivers/vhost/net.c | 68 ++---
include/linux/if_tap.h | 4 +-
include/linux/skbuff.h | 3 +
include/linux/virtio_net.h | 188 ++++++++++++++
include/net/flow_dissector.h | 1 +
include/uapi/linux/if_tun.h | 80 ++++++
net/core/flow_dissector.c | 3 +-
net/core/skbuff.c | 4 +
tools/testing/selftests/net/Makefile | 2 +-
tools/testing/selftests/net/config | 1 +
tools/testing/selftests/net/tap.c | 131 +++++++++-
tools/testing/selftests/net/tun.c | 485 ++++++++++++++++++++++++++++++++++-
19 files changed, 1234 insertions(+), 85 deletions(-)
---
base-commit: 5cb8274d66c611b7889565c418a8158517810f9b
change-id: 20240403-rss-e737d89efa77
Best regards,
--
Akihiko Odaki <akihiko.odaki(a)daynix.com>
Alex posted support for configuring pause frames in fbnic. This flipped
the pause stats test from xfail to fail. Because CI considered xfail as
pass it now flags the test as failing. This shouldn't happen. Also we
currently report pause and FEC tests as passing on virtio which doesn't
make sense.
Jakub Kicinski (2):
selftests: drv-net: stats: fix pylint issues
selftests: drv-net: stats: use skip instead of xfail for unsupported
features
tools/testing/selftests/drivers/net/stats.py | 45 +++++++++++++-------
1 file changed, 30 insertions(+), 15 deletions(-)
--
2.49.0
Remove `use core::ffi::c_void`, which shadows `kernel::ffi::c_void`
brought in via `use crate::prelude::*`, to maintain consistency and
centralize the abstraction.
Since `kernel::ffi::c_void` is a straightforward re-export of
`core::ffi::c_void`, both are functionally equivalent. However, using
`kernel::ffi::c_void` improves consistency across the kernel's Rust code
and provides a unified reference point in case the definition ever needs
to change, even if such a change is unlikely.
Reviewed-by: Benno Lossin <lossin(a)kernel.org>
Signed-off-by: Jesung Yang <y.j3ms.n(a)gmail.com>
Link: https://rust-for-linux.zulipchat.com/#narrow/channel/288089/topic/x/near/52…
---
Changes in v3:
- Rebase on a3b2347343e0
- Remove the explicit import of `kernel::ffi::c_void`
- Reword the commit message accordingly
- Link to v2: https://lore.kernel.org/rust-for-linux/20250528155147.2793921-1-y.j3ms.n@gm…
Changes in v2:
- Add "Link" tag to the related discussion on Zulip
- Reword the commit message to clarify `kernel::ffi::c_void` is a re-export
- Link to v1: https://lore.kernel.org/rust-for-linux/20250526162429.1114862-1-y.j3ms.n@gm…
---
rust/kernel/kunit.rs | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/rust/kernel/kunit.rs b/rust/kernel/kunit.rs
index 4b8cdcb21e77..603330f247c7 100644
--- a/rust/kernel/kunit.rs
+++ b/rust/kernel/kunit.rs
@@ -7,7 +7,7 @@
//! Reference: <https://docs.kernel.org/dev-tools/kunit/index.html>
use crate::prelude::*;
-use core::{ffi::c_void, fmt};
+use core::fmt;
/// Prints a KUnit error-level message.
///
base-commit: a3b2347343e077e81d3c169f32c9b2cb1364f4cc
--
2.39.5
Introduce support for the N32 and N64 ABIs. As preparation, the
entrypoint is first simplified significantly. Thanks to Maciej for all
the valuable information.
Signed-off-by: Thomas Weißschuh <linux(a)weissschuh.net>
---
Changes in v3:
- Rebase onto latest nolibc-next
- Link to v2: https://lore.kernel.org/r/20250225-nolibc-mips-n32-v2-0-664b47d87fa0@weisss…
Changes in v2:
- Clean up entrypoint first
- Annotate #endifs
- Link to v1: https://lore.kernel.org/r/20250212-nolibc-mips-n32-v1-1-6892e58d1321@weisss…
---
Thomas Weißschuh (4):
tools/nolibc: MIPS: drop $gp setup
tools/nolibc: MIPS: drop manual stack pointer alignment
tools/nolibc: MIPS: drop noreorder option
tools/nolibc: MIPS: add support for N64 and N32 ABIs
tools/include/nolibc/arch-mips.h | 117 +++++++++++++++++++------
tools/testing/selftests/nolibc/Makefile.nolibc | 26 ++++++
tools/testing/selftests/nolibc/run-tests.sh | 2 +-
3 files changed, 117 insertions(+), 28 deletions(-)
---
base-commit: eb135311083100b6590a7545618cd9760d896a86
change-id: 20231105-nolibc-mips-n32-234901bd910d
Best regards,
--
Thomas Weißschuh <linux(a)weissschuh.net>
Add a basic selftest for the netpoll polling mechanism, specifically
targeting the netpoll poll() side.
The test creates a scenario where network transmission is running at
maximum sppend, and netpoll needs to poll the NIC. This is achieved by:
1. Configuring a single RX/TX queue to create contention
2. Generating background traffic to saturate the interface
3. Sending netconsole messages to trigger netpoll polling
4. Using dynamic netconsole targets via configfs
The test validates a critical netpoll code path by monitoring traffic
flow and ensuring netpoll_poll_dev() is called when the normal TX path
is blocked. Perf probing confirms this test successfully triggers
netpoll_poll_dev() in typical test runs.
This addresses a gap in netpoll test coverage for a path that is
tricky for the network stack.
Signed-off-by: Breno Leitao <leitao(a)debian.org>
---
Sending as an RFC for your appreciation, but it dpends on [1] which is
stil under review. Once [1] lands, I will send this officially.
Link: https://lore.kernel.org/all/20250611-netdevsim_stat-v1-0-c11b657d96bf@debia… [1]
---
tools/testing/selftests/drivers/net/Makefile | 1 +
.../testing/selftests/drivers/net/netpoll_basic.py | 201 +++++++++++++++++++++
2 files changed, 202 insertions(+)
diff --git a/tools/testing/selftests/drivers/net/Makefile b/tools/testing/selftests/drivers/net/Makefile
index be780bcb73a3b..70d6e3a920b7f 100644
--- a/tools/testing/selftests/drivers/net/Makefile
+++ b/tools/testing/selftests/drivers/net/Makefile
@@ -15,6 +15,7 @@ TEST_PROGS := \
netcons_fragmented_msg.sh \
netcons_overflow.sh \
netcons_sysdata.sh \
+ netpoll_basic.py \
ping.py \
queues.py \
stats.py \
diff --git a/tools/testing/selftests/drivers/net/netpoll_basic.py b/tools/testing/selftests/drivers/net/netpoll_basic.py
new file mode 100755
index 0000000000000..8abdfb2b1eb6e
--- /dev/null
+++ b/tools/testing/selftests/drivers/net/netpoll_basic.py
@@ -0,0 +1,201 @@
+#!/usr/bin/env python3
+# SPDX-License-Identifier: GPL-2.0
+
+# This test aims to evaluate the netpoll polling mechanism (as in netpoll_poll_dev()).
+# It presents a complex scenario where the network attempts to send a packet but fails,
+# prompting it to poll the NIC from within the netpoll TX side.
+#
+# This has been a crucial path in netpoll that was previously untested. Jakub
+# suggested using a single RX/TX queue, pushing traffic to the NIC, and then sending
+# netpoll messages (via netconsole) to trigger the poll. `perf` probing of netpoll_poll_dev()
+# showed that this test indeed triggers netpoll_poll_dev() once or twice in 10 iterations.
+
+# Author: Breno Leitao <leitao(a)debian.org>
+
+import errno
+import os
+import random
+import string
+import time
+
+from lib.py import (
+ ethtool,
+ GenerateTraffic,
+ ksft_exit,
+ ksft_pr,
+ ksft_run,
+ KsftFailEx,
+ KsftSkipEx,
+ NetdevFamily,
+ NetDrvEpEnv,
+)
+
+NETCONSOLE_CONFIGFS_PATH = "/sys/kernel/config/netconsole"
+REMOTE_PORT = 6666
+LOCAL_PORT = 1514
+# Number of netcons messages to send. I usually see netpoll_poll_dev()
+# being called at least once in 10 iterations.
+ITERATIONS = 10
+DEBUG = False
+
+
+def generate_random_netcons_name() -> str:
+ """Generate a random name starting with 'netcons'"""
+ random_suffix = "".join(random.choices(string.ascii_lowercase + string.digits, k=8))
+ return f"netcons_{random_suffix}"
+
+
+def get_stats(cfg: NetDrvEpEnv, netdevnl: NetdevFamily) -> dict[str, int]:
+ """Get the statistics for the interface"""
+ return netdevnl.qstats_get({"ifindex": cfg.ifindex}, dump=True)[0]
+
+
+def set_single_rx_tx_queue(interface_name: str) -> None:
+ """Set the number of RX and TX queues to 1 using ethtool"""
+ try:
+ # This don't need to be reverted, since interfaces will be deleted after test
+ ethtool(f"-G {interface_name} rx 1 tx 1")
+ except Exception as e:
+ raise KsftSkipEx(
+ f"Failed to configure RX/TX queues: {e}. Ethtool not available?"
+ )
+
+
+def create_netconsole_target(
+ config_data: dict[str, str],
+ target_name: str,
+) -> None:
+ """Create a netconsole dynamic target against the interfaces"""
+ ksft_pr(f"Using netconsole name: {target_name}")
+ try:
+ ksft_pr(f"Created target directory: {NETCONSOLE_CONFIGFS_PATH}/{target_name}")
+ os.makedirs(f"{NETCONSOLE_CONFIGFS_PATH}/{target_name}", exist_ok=True)
+ except OSError as e:
+ if e.errno != errno.EEXIST:
+ raise KsftFailEx(f"Failed to create netconsole target directory: {e}")
+
+ try:
+ for key, value in config_data.items():
+ if DEBUG:
+ ksft_pr(f"Setting {key} to {value}")
+ with open(
+ f"{NETCONSOLE_CONFIGFS_PATH}/{target_name}/{key}",
+ "w",
+ encoding="utf-8",
+ ) as f:
+ # Always convert to string to write to file
+ f.write(str(value))
+ f.close()
+
+ if DEBUG:
+ # Read all configuration values for debugging
+ for debug_key in config_data.keys():
+ with open(
+ f"{NETCONSOLE_CONFIGFS_PATH}/{target_name}/{debug_key}",
+ "r",
+ encoding="utf-8",
+ ) as f:
+ content = f.read()
+ ksft_pr(
+ f"{NETCONSOLE_CONFIGFS_PATH}/{target_name}/{debug_key} {content}"
+ )
+
+ except Exception as e:
+ raise KsftFailEx(f"Failed to configure netconsole target: {e}")
+
+
+def set_netconsole(cfg: NetDrvEpEnv, interface_name: str, target_name: str) -> None:
+ """Configure netconsole on the interface with the given target name"""
+ config_data = {
+ "extended": "1",
+ "dev_name": interface_name,
+ "local_port": LOCAL_PORT,
+ "remote_port": REMOTE_PORT,
+ "local_ip": cfg.addr_v["4"] if cfg.addr_ipver == "4" else cfg.addr_v["6"],
+ "remote_ip": (
+ cfg.remote_addr_v["4"] if cfg.addr_ipver == "4" else cfg.remote_addr_v["6"]
+ ),
+ "remote_mac": "00:00:00:00:00:00", # Not important for this test
+ "enabled": "1",
+ }
+
+ create_netconsole_target(config_data, target_name)
+ ksft_pr(f"Created netconsole target: {target_name} on interface {interface_name}")
+
+
+def delete_netconsole_target(name: str) -> None:
+ """Delete a netconsole dynamic target"""
+ target_path = f"{NETCONSOLE_CONFIGFS_PATH}/{name}"
+ try:
+ if os.path.exists(target_path):
+ os.rmdir(target_path)
+ except OSError as e:
+ raise KsftFailEx(f"Failed to delete netconsole target: {e}")
+
+
+def check_traffic_flowing(cfg: NetDrvEpEnv, netdevnl: NetdevFamily) -> int:
+ """Check if traffic is flowing on the interface"""
+ stat1 = get_stats(cfg, netdevnl)
+ time.sleep(1)
+ stat2 = get_stats(cfg, netdevnl)
+ pkts_per_sec = stat2["rx-packets"] - stat1["rx-packets"]
+ # Just make sure this will not fail even in slow/debug kernels
+ if pkts_per_sec < 10:
+ raise KsftFailEx(f"Traffic seems low: {pkts_per_sec}")
+ if DEBUG:
+ ksft_pr(f"Traffic per second {pkts_per_sec} ", pkts_per_sec)
+
+ return pkts_per_sec
+
+
+def do_netpoll_flush(cfg: NetDrvEpEnv, netdevnl: NetdevFamily) -> None:
+ """Print messages to the console, trying to trigger a netpoll poll"""
+ for i in range(int(ITERATIONS)):
+ pkts_per_s = check_traffic_flowing(cfg, netdevnl)
+ with open("/dev/kmsg", "w", encoding="utf-8") as kmsg:
+ kmsg.write(f"netcons test #{i}: ({pkts_per_s} packets/s)\n")
+
+
+def test_netpoll(cfg: NetDrvEpEnv, netdevnl: NetdevFamily) -> None:
+ """Test netpoll by sending traffic to the interface and then sending netconsole messages to trigger a poll"""
+ target_name = generate_random_netcons_name()
+ ifname = cfg.dev["ifname"]
+ traffic = None
+
+ try:
+ set_single_rx_tx_queue(ifname)
+ traffic = GenerateTraffic(cfg)
+ check_traffic_flowing(cfg, netdevnl)
+ set_netconsole(cfg, ifname, target_name)
+ do_netpoll_flush(cfg, netdevnl)
+ finally:
+ if traffic:
+ traffic.stop()
+ delete_netconsole_target(target_name)
+
+
+def check_dependencies() -> None:
+ """Check if the dependencies are met"""
+ if not os.path.exists(NETCONSOLE_CONFIGFS_PATH):
+ raise KsftSkipEx(
+ f"Directory {NETCONSOLE_CONFIGFS_PATH} does not exist. CONFIG_NETCONSOLE_DYNAMIC might not be set."
+ )
+
+
+def main() -> None:
+ """Main function to run the test"""
+ check_dependencies()
+ netdevnl = NetdevFamily()
+ with NetDrvEpEnv(__file__, nsim_test=True) as cfg:
+ ksft_run(
+ [test_netpoll],
+ args=(
+ cfg,
+ netdevnl,
+ ),
+ )
+ ksft_exit()
+
+
+if __name__ == "__main__":
+ main()
---
base-commit: 5d6d67c4cb10a4b4d3ae35758d5eeed6239afdc8
change-id: 20250612-netpoll_test-a1324d2057c8
Best regards,
--
Breno Leitao <leitao(a)debian.org>
This patch series makes the selftest work with NV enabled. The guest code
is run in vEL2 instead of EL1. We add a command line option to enable
testing of NV. The NV tests are disabled by default.
Modified around 12 selftests in this series.
Changes since v1:
- Updated NV helper functions as per comments [1].
- Modified existing testscases to run guest code in vEL2.
[1] https://lkml.iu.edu/hypermail/linux/kernel/2502.0/07001.html
Ganapatrao Kulkarni (9):
KVM: arm64: nv: selftests: Add support to run guest code in vEL2.
KVM: arm64: nv: selftests: Add simple test to run guest code in vEL2
KVM: arm64: nv: selftests: Enable hypervisor timer tests to run in
vEL2
KVM: arm64: nv: selftests: enable aarch32_id_regs test to run in vEL2
KVM: arm64: nv: selftests: Enable vgic tests to run in vEL2
KVM: arm64: nv: selftests: Enable set_id_regs test to run in vEL2
KVM: arm64: nv: selftests: Enable test to run in vEL2
KVM: selftests: arm64: Extend kvm_page_table_test to run guest code in
vEL2
KVM: arm64: nv: selftests: Enable page_fault_test test to run in vEL2
tools/testing/selftests/kvm/Makefile.kvm | 2 +
tools/testing/selftests/kvm/arch_timer.c | 8 +-
.../selftests/kvm/arm64/aarch32_id_regs.c | 34 ++++-
.../testing/selftests/kvm/arm64/arch_timer.c | 118 +++++++++++++++---
.../selftests/kvm/arm64/nv_guest_hypervisor.c | 68 ++++++++++
.../selftests/kvm/arm64/page_fault_test.c | 35 +++++-
.../testing/selftests/kvm/arm64/set_id_regs.c | 57 ++++++++-
tools/testing/selftests/kvm/arm64/vgic_init.c | 54 +++++++-
tools/testing/selftests/kvm/arm64/vgic_irq.c | 27 ++--
.../selftests/kvm/arm64/vgic_lpi_stress.c | 19 ++-
.../testing/selftests/kvm/guest_print_test.c | 32 +++++
.../selftests/kvm/include/arm64/arch_timer.h | 16 +++
.../kvm/include/arm64/kvm_util_arch.h | 3 +
.../selftests/kvm/include/arm64/nv_util.h | 45 +++++++
.../selftests/kvm/include/arm64/vgic.h | 1 +
.../testing/selftests/kvm/include/kvm_util.h | 3 +
.../selftests/kvm/include/timer_test.h | 1 +
.../selftests/kvm/kvm_page_table_test.c | 30 ++++-
tools/testing/selftests/kvm/lib/arm64/nv.c | 46 +++++++
.../selftests/kvm/lib/arm64/processor.c | 61 ++++++---
tools/testing/selftests/kvm/lib/arm64/vgic.c | 8 ++
21 files changed, 604 insertions(+), 64 deletions(-)
create mode 100644 tools/testing/selftests/kvm/arm64/nv_guest_hypervisor.c
create mode 100644 tools/testing/selftests/kvm/include/arm64/nv_util.h
create mode 100644 tools/testing/selftests/kvm/lib/arm64/nv.c
--
2.48.1
Add a script to test various scenarios where a bridge is involved
in the fastpath. It runs tests in the forward path, and also in
a bridged path.
The setup is similar to a basic home router with multiple lan ports.
It uses 3 pairs of veth-devices. Each or all pairs can be
replaced by a pair of real interfaces, interconnected by wire.
This is necessary to test the behavior when dealing with
dsa ports, foreign (dsa) ports and switchdev userports that support
SWITCHDEV_OBJ_ID_PORT_VLAN.
See the head of the script for a detailed description.
Run without arguments to perform all tests on veth-devices.
Signed-off-by: Eric Woudstra <ericwouds(a)gmail.com>
---
This test script is written first for the proposed bridge-fastpath
patch-sets, but it's use is more general and can easily be expanded.
Changes in v2:
- Moved test-series to functions
- Moved code to set_pair_link() up/down
- Added conntrack zone to bridged traffic
- Test bridge chain prerouting in test without fastpath
and bridge chain forward in tests with fastpath
Some example outputs of this last version of patches from different
hardware, without and with patches:
ALL VETH:
=========
./bridge_fastpath.sh -t
Setup:
CLIENT 0
veth0cl
|
veth0rt
WAN
ROUTER
LAN1 LAN2
veth1rt veth2rt
| |
veth1cl veth2cl
CLIENT 1 CLIENT 2
Without patches:
PASS: unaware bridge, without encaps, without fastpath
PASS: unaware bridge, with single vlan encap, without fastpath
ERROR: unaware bridge, with double q vlan encaps, without fastpath: ipv4/6: established bytes 0 < 4194304
ERROR: unaware bridge, with 802.1ad vlan encaps, without fastpath: ipv4/6: established bytes 0 < 4194304
PASS: aware bridge, without/without vlan encap, without fastpath
PASS: aware bridge, with/without vlan encap, without fastpath
PASS: aware bridge, with/with vlan encap, without fastpath
PASS: aware bridge, without/with vlan encap, without fastpath
PASS: forward, without vlan-device, without vlan encap, client1, without fastpath
PASS: forward, without vlan-device, without vlan encap, client1, with fastpath
PASS: forward, without vlan-device, with vlan encap, client1, without fastpath
ERROR: forward, without vlan-device, with vlan encap, client1, with fastpath: ipv4/6: tcp broken
PASS: forward, with vlan-device, with vlan encap, client1, without fastpath
PASS: forward, with vlan-device, with vlan encap, client1, with fastpath
PASS: forward, with vlan-device, without vlan encap, client1, without fastpath
PASS: forward, with vlan-device, without vlan encap, client1, with fastpath
ERROR: bridge fastpath test has failed
With patches:
PASS: unaware bridge, without encaps, without fastpath
PASS: unaware bridge, without encaps, with fastpath
PASS: unaware bridge, with single vlan encap, without fastpath
PASS: unaware bridge, with single vlan encap, with fastpath
PASS: unaware bridge, with double q vlan encaps, without fastpath
PASS: unaware bridge, with double q vlan encaps, with fastpath
PASS: unaware bridge, with 802.1ad vlan encaps, without fastpath
PASS: unaware bridge, with 802.1ad vlan encaps, with fastpath
PASS: aware bridge, without/without vlan encap, without fastpath
PASS: aware bridge, without/without vlan encap, with fastpath
PASS: aware bridge, with/without vlan encap, without fastpath
PASS: aware bridge, with/without vlan encap, with fastpath
PASS: aware bridge, with/with vlan encap, without fastpath
PASS: aware bridge, with/with vlan encap, with fastpath
PASS: aware bridge, without/with vlan encap, without fastpath
PASS: aware bridge, without/with vlan encap, with fastpath
PASS: forward, without vlan-device, without vlan encap, client1, without fastpath
PASS: forward, without vlan-device, without vlan encap, client1, with fastpath
PASS: forward, without vlan-device, with vlan encap, client1, without fastpath
PASS: forward, without vlan-device, with vlan encap, client1, with fastpath
PASS: forward, with vlan-device, with vlan encap, client1, without fastpath
PASS: forward, with vlan-device, with vlan encap, client1, with fastpath
PASS: forward, with vlan-device, without vlan encap, client1, without fastpath
PASS: forward, with vlan-device, without vlan encap, client1, with fastpath
PASS: all tests passed
BANANAPI-R3 (lan1 & lan2 are dsa):
============
Without patches:
./bridge_fastpath.sh -t -0 enu1u2,lan2 -1 enu1u1,lan1 -2 lan4,eth1
Setup:
CLIENT 0
enu1u2
|
lan2
WAN
ROUTER
LAN1 LAN2
lan1 eth1
| |
enu1u1 lan4
CLIENT 1 CLIENT 2
PASS: unaware bridge, without encaps, without fastpath
PASS: unaware bridge, with single vlan encap, without fastpath
PASS: aware bridge, without/without vlan encap, without fastpath
PASS: aware bridge, with/without vlan encap, without fastpath
PASS: aware bridge, with/with vlan encap, without fastpath
PASS: aware bridge, without/with vlan encap, without fastpath
PASS: forward, without vlan-device, without vlan encap, client1, without fastpath
ERROR: forward, without vlan-device, without vlan encap, client1, with fastpath: ipv4: counted bytes 2118540 > 2097152
ERROR: forward, without vlan-device, without vlan encap, client1, with fastpath: ipv6: counted bytes 2117904 > 2097152
PASS: forward, without vlan-device, without vlan encap, client2, without fastpath
PASS: forward, without vlan-device, without vlan encap, client2, with fastpath
PASS: forward, without vlan-device, without vlan encap, client2, with hw_fastpath
PASS: forward, without vlan-device, with vlan encap, client1, without fastpath
ERROR: forward, without vlan-device, with vlan encap, client1, with fastpath: ipv4/6: tcp broken
PASS: forward, without vlan-device, with vlan encap, client2, without fastpath
ERROR: forward, without vlan-device, with vlan encap, client2, with fastpath: ipv4/6: tcp broken
PASS: forward, with vlan-device, with vlan encap, client1, without fastpath
PASS: forward, with vlan-device, with vlan encap, client1, with fastpath
PASS: forward, with vlan-device, with vlan encap, client1, with hw_fastpath
PASS: forward, with vlan-device, with vlan encap, client2, without fastpath
PASS: forward, with vlan-device, with vlan encap, client2, with fastpath
PASS: forward, with vlan-device, with vlan encap, client2, with hw_fastpath
PASS: forward, with vlan-device, without vlan encap, client1, without fastpath
PASS: forward, with vlan-device, without vlan encap, client1, with fastpath
PASS: forward, with vlan-device, without vlan encap, client1, with hw_fastpath
PASS: forward, with vlan-device, without vlan encap, client2, without fastpath
ERROR: forward, with vlan-device, without vlan encap, client2, with fastpath: ipv4: counted bytes 2109596 > 2097152
ERROR: forward, with vlan-device, without vlan encap, client2, with fastpath: ipv6: counted bytes 2121432 > 2097152
ERROR: bridge fastpath test has failed
With patches:
PASS: unaware bridge, without encaps, without fastpath
PASS: unaware bridge, without encaps, with fastpath
PASS: unaware bridge, without encaps, with hw_fastpath
PASS: unaware bridge, with single vlan encap, without fastpath
PASS: unaware bridge, with single vlan encap, with fastpath
PASS: unaware bridge, with single vlan encap, with hw_fastpath
PASS: aware bridge, without/without vlan encap, without fastpath
PASS: aware bridge, without/without vlan encap, with fastpath
PASS: aware bridge, without/without vlan encap, with hw_fastpath
PASS: aware bridge, with/without vlan encap, without fastpath
PASS: aware bridge, with/without vlan encap, with fastpath
PASS: aware bridge, with/without vlan encap, with hw_fastpath
PASS: aware bridge, with/with vlan encap, without fastpath
PASS: aware bridge, with/with vlan encap, with fastpath
PASS: aware bridge, with/with vlan encap, with hw_fastpath
PASS: aware bridge, without/with vlan encap, without fastpath
PASS: aware bridge, without/with vlan encap, with fastpath
PASS: aware bridge, without/with vlan encap, with hw_fastpath
PASS: forward, without vlan-device, without vlan encap, client1, without fastpath
PASS: forward, without vlan-device, without vlan encap, client1, with fastpath
PASS: forward, without vlan-device, without vlan encap, client1, with hw_fastpath
PASS: forward, without vlan-device, without vlan encap, client2, without fastpath
PASS: forward, without vlan-device, without vlan encap, client2, with fastpath
PASS: forward, without vlan-device, without vlan encap, client2, with hw_fastpath
PASS: forward, without vlan-device, with vlan encap, client1, without fastpath
PASS: forward, without vlan-device, with vlan encap, client1, with fastpath
PASS: forward, without vlan-device, with vlan encap, client1, with hw_fastpath
PASS: forward, without vlan-device, with vlan encap, client2, without fastpath
PASS: forward, without vlan-device, with vlan encap, client2, with fastpath
PASS: forward, without vlan-device, with vlan encap, client2, with hw_fastpath
PASS: forward, with vlan-device, with vlan encap, client1, without fastpath
PASS: forward, with vlan-device, with vlan encap, client1, with fastpath
PASS: forward, with vlan-device, with vlan encap, client1, with hw_fastpath
PASS: forward, with vlan-device, with vlan encap, client2, without fastpath
PASS: forward, with vlan-device, with vlan encap, client2, with fastpath
PASS: forward, with vlan-device, with vlan encap, client2, with hw_fastpath
PASS: forward, with vlan-device, without vlan encap, client1, without fastpath
PASS: forward, with vlan-device, without vlan encap, client1, with fastpath
PASS: forward, with vlan-device, without vlan encap, client1, with hw_fastpath
PASS: forward, with vlan-device, without vlan encap, client2, without fastpath
PASS: forward, with vlan-device, without vlan encap, client2, with fastpath
PASS: forward, with vlan-device, without vlan encap, client2, with hw_fastpath
PASS: all tests passed
AM3359 (end1 supports SWITCHDEV_OBJ_ID_PORT_VLAN, ipv4 only for now):
=======
./bridge_fastpath.sh -t -a -4 -d -1 enu1u4c2,end1
Without patches:
Setup:
CLIENT 0
veth0cl
|
veth0rt
WAN
ROUTER
LAN1 LAN2
end1 veth2rt
| |
enu1u4c2 veth2cl
CLIENT 1 CLIENT 2
INFO: Skipping unaware bridge
PASS: aware bridge, without/without vlan encap, without fastpath
PASS: aware bridge, with/without vlan encap, without fastpath
PASS: aware bridge, with/with vlan encap, without fastpath
PASS: aware bridge, without/with vlan encap, without fastpath
PASS: forward, without vlan-device, without vlan encap, client1, without fastpath
ERROR: forward, without vlan-device, without vlan encap, client1, with fastpath: ipv4: counted bytes 2190092 > 2097152
PASS: forward, without vlan-device, without vlan encap, client2, without fastpath
PASS: forward, without vlan-device, without vlan encap, client2, with fastpath
PASS: forward, without vlan-device, with vlan encap, client1, without fastpath
ERROR: forward, without vlan-device, with vlan encap, client1, with fastpath: ipv4: tcp broken
PASS: forward, without vlan-device, with vlan encap, client2, without fastpath
ERROR: forward, without vlan-device, with vlan encap, client2, with fastpath: ipv4: tcp broken
PASS: forward, with vlan-device, with vlan encap, client1, without fastpath
PASS: forward, with vlan-device, with vlan encap, client1, with fastpath
PASS: forward, with vlan-device, with vlan encap, client2, without fastpath
PASS: forward, with vlan-device, with vlan encap, client2, with fastpath
PASS: forward, with vlan-device, without vlan encap, client1, without fastpath
PASS: forward, with vlan-device, without vlan encap, client1, with fastpath
PASS: forward, with vlan-device, without vlan encap, client2, without fastpath
PASS: forward, with vlan-device, without vlan encap, client2, with fastpath
ERROR: bridge fastpath test has failed
With patches:
INFO: Skipping unaware bridge
PASS: aware bridge, without/without vlan encap, without fastpath
PASS: aware bridge, without/without vlan encap, with fastpath
PASS: aware bridge, with/without vlan encap, without fastpath
PASS: aware bridge, with/without vlan encap, with fastpath
PASS: aware bridge, with/with vlan encap, without fastpath
PASS: aware bridge, with/with vlan encap, with fastpath
PASS: aware bridge, without/with vlan encap, without fastpath
PASS: aware bridge, without/with vlan encap, with fastpath
PASS: forward, without vlan-device, without vlan encap, client1, without fastpath
PASS: forward, without vlan-device, without vlan encap, client1, with fastpath
PASS: forward, without vlan-device, without vlan encap, client2, without fastpath
PASS: forward, without vlan-device, without vlan encap, client2, with fastpath
PASS: forward, without vlan-device, with vlan encap, client1, without fastpath
PASS: forward, without vlan-device, with vlan encap, client1, with fastpath
PASS: forward, without vlan-device, with vlan encap, client2, without fastpath
PASS: forward, without vlan-device, with vlan encap, client2, with fastpath
PASS: forward, with vlan-device, with vlan encap, client1, without fastpath
PASS: forward, with vlan-device, with vlan encap, client1, with fastpath
PASS: forward, with vlan-device, with vlan encap, client2, without fastpath
PASS: forward, with vlan-device, with vlan encap, client2, with fastpath
PASS: forward, with vlan-device, without vlan encap, client1, without fastpath
PASS: forward, with vlan-device, without vlan encap, client1, with fastpath
PASS: forward, with vlan-device, without vlan encap, client2, without fastpath
PASS: forward, with vlan-device, without vlan encap, client2, with fastpath
PASS: all tests passed
(Some problem still to figure out for my AM3359 hardware: On the second run
of the command the tcp traffic is ok on all tests ipv4. On the first run
the hardware is not setup correctly, some tests report broken tcp even
without fastpath. Also ipv6 tcp broken even on second run even without
fastpath. This may be a problem with my hardware or the test-script,
but anyway it shows the fastpath is functional)
.../testing/selftests/net/netfilter/Makefile | 1 +
.../net/netfilter/bridge_fastpath.sh | 1008 +++++++++++++++++
2 files changed, 1009 insertions(+)
create mode 100755 tools/testing/selftests/net/netfilter/bridge_fastpath.sh
diff --git a/tools/testing/selftests/net/netfilter/Makefile b/tools/testing/selftests/net/netfilter/Makefile
index 3bdcbbdba925..50afe91bc3e2 100644
--- a/tools/testing/selftests/net/netfilter/Makefile
+++ b/tools/testing/selftests/net/netfilter/Makefile
@@ -8,6 +8,7 @@ MNL_LDLIBS := $(shell $(HOSTPKG_CONFIG) --libs libmnl 2>/dev/null || echo -lmnl)
TEST_PROGS := br_netfilter.sh bridge_brouter.sh
TEST_PROGS += br_netfilter_queue.sh
+TEST_PROGS += bridge_fastpath.sh
TEST_PROGS += conntrack_dump_flush.sh
TEST_PROGS += conntrack_icmp_related.sh
TEST_PROGS += conntrack_ipip_mtu.sh
diff --git a/tools/testing/selftests/net/netfilter/bridge_fastpath.sh b/tools/testing/selftests/net/netfilter/bridge_fastpath.sh
new file mode 100755
index 000000000000..82f2ddc946b8
--- /dev/null
+++ b/tools/testing/selftests/net/netfilter/bridge_fastpath.sh
@@ -0,0 +1,1008 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# Check if conntrack, nft chain and fastpath is functional in setups
+# where a bridge is in the fastpath.
+#
+# Commandline options make it possible to use real ethernet pairs
+# instead of veth-device pairs. Any, or all, pairs can be tested using
+# real hardware pairs. This is can be useful to test dsa-ports,
+# switchdev (dsa) foreign ports and switchdev ports supporting
+# SWITCHDEV_OBJ_ID_PORT_VLAN.
+#
+# First tcp is tested. Conntrack and nft chain are tested using a counter.
+# When there is a fastpath possible between the interfaces then the
+# fastpath is also tested.
+# When there is a hardware offloaded fastpath possible between the
+# interfaces then the hardware offloaded path is also tested.
+#
+# Setup is as a typical router:
+#
+# nsclientwan
+# |
+# nsrt
+# | |
+# nsclient1 nsclient2
+#
+# Masquerading for ipv4 only.
+#
+# First check if a bridge table forward chain can be setup, skip
+# these tests if this is not possible.
+# Then check if a inet table forward chain can be setup, skip
+# these tests if this is not possible.
+#
+# Different setups of paths are tested that involve a bridge in the
+# fastpath. This can be in the forward-fastpath or in the bridge-fastpath.
+#
+# The first series, in the bridge-fastpath, using a vlan-unaware bridge.
+# Traffic with the following vlan-tags is checked:
+# a. without vlan
+# b. single vlan
+# c. double q vlan (only on veth-devices)
+# d. 802.1ad vlan (only on veth-devices)
+# e. pppoe (when available)
+# f. pppoe-in-q (when available)
+#
+# (for items c to f fastpath can only work when a conntrack zone is set)
+# (double tag testing results in broken tcp traffic on most hardware,
+# in this test setup, use '-a' argument to test it anyway)
+# (pppoe testing takes place if pppd and pppoe-server are installed)
+#
+# The second series, in the bridge-fastpath, using a vlan-aware bridge.
+# Here we test all combinations of ingress/egress with or without single
+# vlan encaps.
+#
+# The third series, in the forward-fastpath, using a vlan-aware bridge,
+# without a vlan-device linked to the master port. We test the same combinations
+# of ingress/egress with or without single vlan encaps.
+#
+# The fourth series, in the forward-fastpath, using a vlan-aware bridge,
+# with a vlan-device linked to the master port. We test the same combinations
+# of ingress/egress with or without single vlan encaps.
+#
+# Note 1: Using dsa userports on both sides of eth-pairs client1 or client2
+# gives erratic and unpredictable results. Use, for example, an usb-eth device
+# on the client side to test a dsa-userport.
+#
+# Note 2: Testing the hardware offloaded fastpath, it is not checked if the
+# packets do not follow the software fastpath instead. A universal way to
+# check this should be added at some point.
+#
+# Note 3: Some interfaces to test on the router side, are netns immutable.
+# Use the -d or --defaultnsrouter option so that the interfaces of the router
+# do not have to change netns. The router is build up in the default netns.
+#
+
+source lib.sh
+
+checktool "nft --version" "run test without nft"
+checktool "socat -h" "run test without socat"
+checktool "bridge -V" "run test without bridge"
+
+NR_OF_TESTS=4
+VID1=100
+VID2=101
+BRWAN=brwan
+BRLAN=brlan
+BRCL=brcl
+LINKUP_TIMEOUT=10
+PING_TIMEOUT=10
+SOCAT_TIMEOUT=10
+filesize=2 # MiB
+
+filein=$(mktemp)
+file1out=$(mktemp)
+file2out=$(mktemp)
+pppoeserveroptions=$(mktemp)
+pppoeserverpid=$(mktemp)
+
+setup_ns nsclientwan nsclientlan1 nsclientlan2
+
+ WAN=0 ; LAN1=1 ; LAN2=2 ; ADWAN=3 ; ADLAN=4
+nsa=( $nsclientwan $nsclientlan1 $nsclientlan2 ) # $nsrt $nsrt
+AD4=( '192.168.1.1' '192.168.2.101' '192.168.2.102' '192.168.1.2' '192.168.2.1' )
+AD6=( 'dead:1::1' 'dead:2::101' 'dead:2::102' 'dead:1::2' 'dead:2::1' )
+
+tests_string=$(seq 1 $NR_OF_TESTS)
+
+while [ "${1:-}" != '' ]; do
+ case "$1" in
+ '-0' | '--pairwan')
+ shift
+ vethcl[$WAN]="${1%,*}"
+ vethrt[$WAN]="${1#*,}"
+ ;;
+ '-1' | '--pairlan1')
+ shift
+ vethcl[$LAN1]="${1%,*}"
+ vethrt[$LAN1]="${1#*,}"
+ ;;
+ '-2' | '--pairlan2')
+ shift
+ vethcl[$LAN2]="${1%,*}"
+ vethrt[$LAN2]="${1#*,}"
+ ;;
+ '-s' | '--filesize')
+ shift
+ filesize=$1
+ ;;
+ '-p' | '--parts')
+ shift
+ tests_string=$1
+ ;;
+ '-4' | '--ipv4')
+ do_ipv4=1
+ ;;
+ '-6' | '--ipv6')
+ do_ipv6=1
+ ;;
+ '-n' | '--noskip')
+ noskip=1
+ ;;
+ '-d' | '--defaultnsrouter')
+ defaultnsrouter=1
+ ;;
+ '-f' | '--fixmac')
+ fixmac=1
+ ;;
+ '-t' | '--showtree')
+ showtree=1
+ ;;
+ *)
+ cat <<-EOF
+ Usage: $(basename $0) [OPTION]...
+ -0 --pairwan eth0cl,eth0rt pair of real interfaces to use on wan side
+ -1 --pairlan1 eth1cl,eth1rt pair of real interfaces to use on lan1 side
+ -2 --pairlan2 eth2cl,eth2rt pair of real interfaces to use on lan2 side
+ -s --filesize filesize to use for testing
+ -p --parts partnumbers of tests to run, comma separated
+ -4|-6 --ipv4|--ipv6 test ipv4/6 only
+ -d --defaultnsrouter router in default network namespace, caution!
+ -f --fixmac change mac address when conflict found
+ -n --noskip also perform the normally skipped tests
+ -t --showtree show the tree of used interfaces
+ EOF
+ exit $ksft_skip
+ ;;
+ esac
+ shift
+done
+
+for i in ${tests_string//','/' '}; do
+ tests[$i]="yes"
+done
+
+if [ -n "$defaultnsrouter" ]; then
+ nsrt="nsrt-$(mktemp -u XXXXXX)"
+ touch /var/run/netns/$nsrt
+ mount --bind /proc/1/ns/net /var/run/netns/$nsrt
+else
+ setup_ns nsrt
+fi
+nsa+=($nsrt $nsrt)
+
+cleanup() {
+ if [ -n "$defaultnsrouter" ]; then
+ umount /var/run/netns/$nsrt
+ rm -f /var/run/netns/$nsrt
+ fi
+ cleanup_all_ns
+ rm -f "$filein" "$file1out" "$file2out" "$pppoeserveroptions" "$pppoeserverpid"
+}
+
+trap cleanup EXIT
+
+head -c $(($filesize * 1024 * 1024)) < /dev/urandom > "$filein"
+
+check_mac()
+{
+ local ns=$1
+ local dev=$2
+ local othermacs=$3
+ local mac
+
+ mac=$(ip -net "$ns" -br link show dev "$dev" | \
+ grep -o -E '([[:xdigit:]]{1,2}:){5}[[:xdigit:]]{1,2}')
+
+ if [[ ! "$othermacs" =~ "$mac" ]]; then
+ echo $mac
+ return 0
+ fi
+ echo "WARN: Conflicting mac address $dev $mac" 1>&2
+
+ [ -z "$fixmac" ] && return 1
+
+ for (( j = 0 ; j < 10 ; j++ )); do
+ mac="${mac::6}$(printf %02x:%02x:%02x:%02x $(($RANDOM%256)) \
+ $(($RANDOM%256)) $(($RANDOM%256)) $(($RANDOM%256)))"
+ [[ "$othermacs" =~ "$mac" ]] && continue
+ echo $mac
+ ip -net "$ns" link set dev "$dev" address "$mac" 1>&2
+ return $?
+ done
+ return 1
+}
+
+is_linkup()
+{
+ local ns=$1
+ local dev=$2
+
+ if [ -n "$(ip -net "$ns" link show dev "$dev" up 2>/dev/null | \
+ grep 'state UP')" ]; then
+ return 0
+ fi
+ return 1
+}
+
+set_pair_link()
+{
+ local arg=$1
+ local all="${@:2}"
+ local lret=0
+ local i j
+
+ for i in $WAN $LAN1 $LAN2; do
+ ns="${nsa[$i]}"
+ ip -net "$ns" link set "${vethcl[$i]}" $arg
+ lret=$(($lret | $?))
+ ip -net "$nsrt" link set "${vethrt[$i]}" $arg
+ lret=$(($lret | $?))
+ done
+ [ $lret -ne 0 ] && return 1
+
+ [[ "$arg" != "up" ]] && return 0
+
+ for j in $(seq 1 $(($LINKUP_TIMEOUT * 5 ))); do
+ lret=0
+ for i in $all; do
+ ns="${nsa[$i]}"
+ is_linkup $ns "${vethcl[$i]}"
+ lret=$(($lret | $?))
+ is_linkup $nsrt "${vethrt[$i]}"
+ lret=$(($lret | $?))
+ done
+ [ $lret -eq 0 ] && break
+ sleep 0.2
+ done
+ return $lret
+}
+
+wait_ping()
+{
+ local i1=$1
+ local i2=$2
+ local ns1=${nsa[$i1]}
+ local j
+
+ for j in $(seq 1 $(($PING_TIMEOUT * 5 ))); do
+ ip netns exec "$ns1" ping -c 1 -w $PING_TIMEOUT -i 0.2 \
+ -q "${AD4[$i2]}" >/dev/null 2>&1
+ [ $? -le 1 ] && return $?
+ sleep 0.2
+ done
+ return 1
+}
+
+add_addr()
+{
+ local i=$1
+ local dev=$2
+ local ns=${nsa[$i]}
+ local ad4=${AD4[$i]}
+ local ad6=${AD6[$i]}
+
+ ip -net "$ns" addr add "${ad4}/24" dev "$dev"
+ ip -net "$ns" addr add "${ad6}/64" dev "$dev" nodad
+ if [[ "$ns" == "nsclientlan"* ]]; then
+ ip -net "$ns" route add default via "${AD4[$ADLAN]}"
+ ip -net "$ns" route add default via "${AD6[$ADLAN]}"
+ elif [[ "$ns" == "nsclientwan"* ]]; then
+ ip -net "$ns" route add default via "${AD6[$ADWAN]}"
+ fi
+
+}
+
+del_addr()
+{
+ local i=$1
+ local dev=$2
+ local ns=${nsa[$i]}
+ local ad4=${AD4[$i]}
+ local ad6=${AD6[$i]}
+
+ if [[ "$ns" == "nsclientlan"* ]]; then
+ ip -net "$ns" route del default via "${AD6[$ADLAN]}"
+ ip -net "$ns" route del default via "${AD4[$ADLAN]}"
+ elif [[ "$ns" == "nsclientwan"* ]]; then
+ ip -net "$ns" route del default via "${AD6[$ADWAN]}"
+ fi
+ ip -net "$ns" addr del "${ad6}/64" dev "$dev" nodad
+ ip -net "$ns" addr del "${ad4}/24" dev "$dev"
+}
+
+set_client()
+{
+ local i=$1
+ local vlan=$2
+ local arg=$3
+ local ns=${nsa[$i]}
+ local vdev="${vethcl[$i]}"
+ local brdev="$BRCL"
+ local proto=""
+ local pvidslave=""
+
+ unset_client $i
+
+ if [[ "$vlan" == "qq" ]]; then
+ ip -net "$ns" link add link "$vdev" name "$vdev.$VID1" type vlan id $VID1
+ ip -net "$ns" link add link "$vdev.$VID1" name "$vdev.$VID1.$VID2" \
+ type vlan id $VID2
+ ip -net "$ns" link set "$vdev.$VID1" up
+ ip -net "$ns" link set "$vdev.$VID1.$VID2" up
+ add_addr $i "$vdev.$VID1.$VID2"
+ return
+ fi
+
+ [[ "$vlan" == "none" ]] && pvidslave="pvid untagged"
+ [[ "$vlan" == "ad" ]] && proto="vlan_protocol 802.1ad"
+
+ ip -net "$ns" link add "$brdev" type bridge vlan_filtering 1 vlan_default_pvid 0 $proto
+ ip -net "$ns" link set "$vdev" master "$brdev"
+ ip -net "$ns" link set "$brdev" up
+
+ bridge -net "$ns" vlan add dev "$brdev" vid $VID1 pvid untagged self
+ bridge -net "$ns" vlan add dev "$vdev" vid $VID1 $pvidslave
+
+ if [[ "$vlan" == "ad" ]]; then
+ ip -net "$ns" link add link "$brdev" name "$brdev.$VID2" type vlan id $VID2
+ brdev="$brdev.$VID2"
+ ip -net "$ns" link set "$brdev" up
+ fi
+
+ if [[ "$arg" != "noaddress" ]]; then
+ add_addr $i "$brdev"
+ fi
+}
+
+unset_client()
+{
+ local i=$1
+ local ns=${nsa[$i]}
+ local vdev="${vethcl[$i]}"
+ local brdev="$BRCL"
+
+ ip -net "$ns" link del "$brdev" type bridge 2>/dev/null
+ ip -net "$ns" link del "$vdev.$VID1" 2>/dev/null
+}
+
+add_pppoe()
+{
+ local i1=$1
+ local i2=$2
+ local dev1=$3
+ local dev2=$4
+ local desc=$5
+ local ns1=${nsa[$i1]}
+ local ns2=${nsa[$i2]}
+
+ ppp1=0
+ while [ -n "$(ip -net "$ns1" link show ppp$ppp1 2>/dev/null)" ]
+ do ((ppp1++)); done
+ echo "noauth defaultroute noipdefault unit $ppp1" >"$pppoeserveroptions"
+ ppp1="ppp$ppp1"
+
+ if ! ip netns exec "$ns1" pppoe-server -k -L "${AD4[$i1]}" -R "${AD4[$i2]}" \
+ -I $dev1 -X "$pppoeserverpid" -O "$pppoeserveroptions" >/dev/null; then
+ echo "ERROR: $desc: failed to setup pppoe server" 1>&2
+ return 1
+ fi
+
+ if ! ip netns exec "$ns2" pppd plugin pppoe.so nic-$dev2 persist holdoff 0 noauth \
+ defaultroute noipdefault noaccomp nodeflate noproxyarp nopcomp \
+ novj novjccomp linkname "selftest-$$" >/dev/null; then
+ echo "ERROR: $desc: failed to setup pppoe client" 1>&2
+ return 1
+ fi
+
+ if ! wait_ping $i1 $i2; then
+ echo "ERROR: $desc: failed to setup functional pppoe connection" 1>&2
+ return 1
+ fi
+
+ ppp2=$(cat "/run/pppd/ppp-selftest-$$.pid" | tail -n 1)
+
+ ip -net "$ns1" addr add "${AD6[$i1]}/64" dev "$ppp1" nodad
+ ip -net "$ns2" addr add "${AD6[$i2]}/64" dev "$ppp2" nodad
+
+ return 0
+}
+
+del_pppoe()
+{
+ local i1=$1
+ local i2=$2
+ local dev1=$3
+ local dev2=$4
+ local ns1=${nsa[$i1]}
+ local ns2=${nsa[$i2]}
+
+ [[ -n "$ppp1" ]] && ip -net "$ns1" addr del "${AD6[$i1]}/64" dev "$ppp1"
+ [[ -n "$ppp2" ]] && ip -net "$ns2" addr del "${AD6[$i2]}/64" dev "$ppp2"
+
+ kill -9 $(cat "/run/pppd/ppp-selftest-$$.pid" | head -n 1) \
+ $(cat "$pppoeserverpid" | head -n 1)
+}
+
+listener_ready()
+{
+ local ns=$1
+ local ipv=$2
+
+ ss -N "$ns" --ipv$ipv -lnt -o "sport = :8080" | grep -q 8080
+}
+
+test_tcp() {
+ local i1=$1
+ local i2=$2
+ local dofast=$3
+ local desc=$4
+ local ns1=${nsa[$i1]}
+ local ns2=${nsa[$i2]}
+ local i=-1
+ local lret=0
+ local ads=""
+ local ipv ad a lpid bytes limit error
+
+ if [ -n "$do_ipv4" ]; then ads="${AD4[$i2]}"
+ elif [ -n "$do_ipv6" ]; then ads="${AD6[$i2]}"
+ else ads="${AD4[$i2]} ${AD6[$i2]}"
+ fi
+ for ad in $ads; do
+ ((i++))
+ if [[ "$ad" =~ ":" ]]
+ then ipv="6"; a="[${ad}]"
+ else ipv="4"; a="${ad}"
+ fi
+
+ rm -f "$file1out" "$file2out"
+
+ # ip netns exec "$nsrt" nft reset counters >/dev/null
+ # But on some systems this results in 4GB values in packet and byte count, so:
+ (echo "flush ruleset"; ip netns exec "$nsrt" nft --stateless list ruleset) | \
+ ip netns exec "$nsrt" nft -f -
+
+ timeout "$SOCAT_TIMEOUT" ip netns exec "$ns2" socat TCP$ipv-LISTEN:8080,reuseaddr \
+ STDIO <"$filein" >"$file2out" 2>/dev/null &
+ lpid=$!
+ busywait 1000 listener_ready "$ns2" "$ipv"
+
+ timeout "$SOCAT_TIMEOUT" ip netns exec "$ns1" socat TCP$ipv:$a:8080 \
+ STDIO <"$filein" >"$file1out" 2>/dev/null
+ wait $lpid
+
+ if [ $? -ne 0 ]; then
+ error[$i]="ipv$ipv: tcp broken"
+ continue
+ fi
+ if ! cmp "$filein" "$file1out" >/dev/null 2>&1; then
+ error[$i]="ipv$ipv: file mismatch to ${ad}"
+ continue
+ fi
+ if ! cmp "$filein" "$file2out" >/dev/null 2>&1; then
+ error[$i]="ipv$ipv: file mismatch from ${ad}"
+ continue
+ fi
+
+ limit=$((2 * $filesize * 1024 * 1024))
+ bytes=$(ip netns exec "$nsrt" nft list counter $family filter "check" | \
+ grep "packets" | cut -d' ' -f4)
+ if [ -z "$dofast" ] && [ "$bytes" -lt "$limit" ]; then
+
+ error[$i]="ipv$ipv: established bytes $bytes < $limit"
+ continue
+ fi
+ if [ -n "$dofast" ] && [ "$bytes" -gt "$((limit/2))" ]; then
+ # Significant reduction of bytes expected
+ error[$i]="ipv$ipv: counted bytes $bytes > $((limit/2))"
+ continue
+ fi
+ done
+
+ if [ -n "${error[0]}" ]; then
+ if [[ "${error[0]#*:}" == "${error[1]#*:}" ]]; then
+ echo "ERROR: $desc: ipv4/6:${error[0]#*:}" 1>&2
+ return 1
+ fi
+ echo "ERROR: $desc: ${error[0]}" 1>&2
+ lret=1
+ fi
+ if [ -n "${error[1]}" ]; then
+ echo "ERROR: $desc: ${error[1]}" 1>&2
+ lret=1
+ fi
+ if [ $lret -eq 0 ]; then
+ echo "PASS: $desc"
+ fi
+ return $lret
+}
+
+test_paths() {
+ local i1=$1
+ local i2=$2
+ local desc=$3
+ local ns1=${nsa[$i1]}
+ local ns2=${nsa[$i2]}
+
+
+ if ! setup_nftables $i1 $i2; then
+ echo "ERROR: $desc: cannot setup nftables" 1>&2
+ return 1
+ fi
+ if ! test_tcp $i1 $i2 "" "$desc without fastpath"; then
+ return 1
+ fi
+
+ if ! setup_fastpath $i1 $i2 "" 2>/dev/null; then
+ return 0
+ fi
+ if ! test_tcp $i1 $i2 "fast" "$desc with fastpath"; then
+ return 1
+ fi
+
+ if ! setup_fastpath $i1 $i2 "hw" 2>/dev/null; then
+ return 0
+ fi
+ if ! test_tcp $i1 $i2 "fast" "$desc with hw_fastpath"; then
+ return 1
+ fi
+
+ return 0
+
+}
+
+add_masq()
+{
+ if [[ $family != "bridge" ]]; then
+ ip netns exec "$nsrt" nft -f - <<-EOF
+ table ip nat {
+ chain postrouting {
+ type nat hook postrouting priority 0;
+ oifname ${BRWAN} masquerade
+ }
+ }
+ EOF
+ else
+ return 0
+ fi
+}
+
+add_zone()
+{
+ local devs=$1
+
+ if [[ $family == "bridge" ]]; then
+ ip netns exec "$nsrt" nft -f - <<-EOF
+ table ${family} filter {
+ chain preroutingzones {
+ type filter hook prerouting priority -300;
+ iif ${devs} ct zone set 23
+ }
+ }
+ EOF
+ fi
+}
+
+setup_nftables()
+{
+ local devs="{ ${vethrt[$1]} , ${vethrt[$2]} }"
+ local i1=$1
+ local i2=$2
+
+ ip netns exec "$nsrt" nft flush ruleset
+
+ if ! add_masq; then
+ return 1
+ fi
+
+ add_zone "${devs}" 2>/dev/null
+
+ ip netns exec "$nsrt" nft -f - <<-EOF
+ table ${family} filter {
+ counter check { }
+ chain prerouting {
+ type filter hook prerouting priority 0; policy accept;
+ ct state established ip saddr ${AD4[$i1]} tcp dport 8080 counter name "check"
+ ct state established ip saddr ${AD4[$i2]} tcp sport 8080 counter name "check"
+ ct state established ip6 saddr ${AD6[$i1]} tcp dport 8080 counter name "check"
+ ct state established ip6 saddr ${AD6[$i2]} tcp sport 8080 counter name "check"
+ }
+ }
+ EOF
+}
+
+setup_fastpath()
+{
+ local devs="{ ${vethrt[$1]} , ${vethrt[$2]} }"
+ local arg=$3
+ local flags=""
+
+ [[ "$arg" == "hw" ]] && flags="flags offload"
+
+ ip netns exec "$nsrt" nft flush ruleset
+
+ if ! add_masq; then
+ return 1
+ fi
+
+ add_zone "${devs}" 2>/dev/null
+
+ ip netns exec "$nsrt" nft -f - <<-EOF
+ table ${family} filter {
+ counter check { }
+ flowtable f {
+ hook ingress priority filter
+ devices = ${devs}
+ ${flags}
+ }
+ chain forward {
+ type filter hook forward priority 0; policy accept;
+ counter name "check"
+ ct state established flow add @f
+ }
+ }
+ EOF
+}
+
+test_1_unaware_bridge()
+{
+ local lret=0
+ local i
+
+ for i in $LAN1 $LAN2; do
+ set_client $i none
+ done
+
+ test_paths $LAN1 $LAN2 "unaware bridge, without encaps, "
+ lret=$(($lret | $?))
+
+ for i in $LAN1 $LAN2; do
+ set_client $i q
+ done
+
+ test_paths $LAN1 $LAN2 "unaware bridge, with single vlan encap, "
+ lret=$(($lret | $?))
+
+ for i in $LAN1 $LAN2; do
+ set_client $i qq
+ done
+
+ # Skip testing double tagged packets on real hardware
+ if [ -n "$lan_all_veth" ] || [ -n "$noskip" ]; then
+
+ test_paths $LAN1 $LAN2 "unaware bridge, with double q vlan encaps,"
+ lret=$(($lret | $?))
+
+ for i in $LAN1 $LAN2; do
+ set_client $i ad
+ done
+
+ test_paths $LAN1 $LAN2 "unaware bridge, with 802.1ad vlan encaps, "
+ lret=$(($lret | $?))
+
+ fi
+ # End Skip testing double tagged packets
+
+ if [ -n "$(command -v pppd 2>/dev/null)" ] &&
+ [ -n "$(command -v pppoe-server 2>/dev/null)" ]; then
+ # Start pppoe
+
+ for i in $LAN1 $LAN2; do
+ set_client $i none noaddress
+ done
+
+ if add_pppoe $LAN1 $LAN2 "$BRCL" "$BRCL" "unaware bridge, with pppoe encap"; then
+ test_paths $LAN1 $LAN2 "unaware bridge, with pppoe encap, "
+ lret=$(($lret | $?))
+ fi
+
+ del_pppoe $LAN1 $LAN2 "$BRCL" "$BRCL"
+
+ for i in $LAN1 $LAN2; do
+ set_client $i q noaddress
+ done
+
+ if add_pppoe $LAN1 $LAN2 "$BRCL" "$BRCL" "unaware bridge, with pppoe-in-q encaps"; then
+ test_paths $LAN1 $LAN2 "unaware bridge, with pppoe-in-q encaps, "
+ lret=$(($lret | $?))
+ fi
+
+ del_pppoe $LAN1 $LAN2 "$BRCL" "$BRCL"
+
+ # End pppoe
+ fi
+
+ for i in $LAN1 $LAN2; do
+ unset_client $i
+ done
+ return $lret
+}
+
+test_2_aware_bridge()
+{
+ local lret=0
+ local i
+
+ for i in $LAN1 $LAN2; do
+ bridge -net "$nsrt" vlan add dev "${vethrt[$i]}" vid $VID1 pvid untagged
+ set_client $i none
+ done
+ test_paths $LAN1 $LAN2 "aware bridge, without/without vlan encap,"
+ lret=$(($lret | $?))
+
+ i=$LAN1
+ bridge -net "$nsrt" vlan del dev "${vethrt[$i]}" vid $VID1 pvid untagged
+ bridge -net "$nsrt" vlan add dev "${vethrt[$i]}" vid $VID1
+ set_client $i q
+
+ test_paths $LAN1 $LAN2 "aware bridge, with/without vlan encap, "
+ lret=$(($lret | $?))
+
+ i=$LAN2
+ bridge -net "$nsrt" vlan del dev "${vethrt[$i]}" vid $VID1 pvid untagged
+ bridge -net "$nsrt" vlan add dev "${vethrt[$i]}" vid $VID1
+ set_client $i q
+
+ test_paths $LAN1 $LAN2 "aware bridge, with/with vlan encap, "
+ lret=$(($lret | $?))
+
+ i=$LAN1
+ bridge -net "$nsrt" vlan del dev "${vethrt[$i]}" vid $VID1
+ bridge -net "$nsrt" vlan add dev "${vethrt[$i]}" vid $VID1 pvid untagged
+ set_client $i none
+
+ test_paths $LAN1 $LAN2 "aware bridge, without/with vlan encap, "
+ lret=$(($lret | $?))
+
+ i=$LAN1
+ bridge -net "$nsrt" vlan del dev "${vethrt[$i]}" vid $VID1 pvid untagged
+ unset_client $i
+ i=$LAN2
+ bridge -net "$nsrt" vlan del dev "${vethrt[$i]}" vid $VID1
+ unset_client $i
+
+ return $lret
+}
+
+test_3_forward_without_vlandev()
+{
+ local wo=$1
+ local lret=0
+ local i
+
+ [[ "$wo" == "" ]] && wo="without"
+
+ for i in $LAN1 $LAN2; do
+ bridge -net "$nsrt" vlan add dev "${vethrt[$i]}" vid $VID1 pvid untagged
+ set_client $i none
+ done
+
+ test_paths $LAN1 $WAN "forward, $wo vlan-device, without vlan encap, client1,"
+ lret=$(($lret | $?))
+ if [ -z "$lan_all_veth" ] || [ -n "$noskip" ]; then
+ test_paths $LAN2 $WAN "forward, $wo vlan-device, without vlan encap, client2,"
+ lret=$(($lret | $?))
+ fi
+
+ for i in $LAN1 $LAN2; do
+ bridge -net "$nsrt" vlan del dev "${vethrt[$i]}" vid $VID1 pvid untagged
+ bridge -net "$nsrt" vlan add dev "${vethrt[$i]}" vid $VID1
+ set_client $i q
+ done
+
+ test_paths $LAN1 $WAN "forward, $wo vlan-device, with vlan encap, client1,"
+ lret=$(($lret | $?))
+ if [ -z "$lan_all_veth" ] || [ -n "$noskip" ]; then
+ test_paths $LAN2 $WAN "forward, $wo vlan-device, with vlan encap, client2,"
+ lret=$(($lret | $?))
+ fi
+
+ for i in $LAN1 $LAN2; do
+ bridge -net "$nsrt" vlan del dev "${vethrt[$i]}" vid $VID1
+ unset_client $i
+ done
+ return $lret
+}
+
+test_4_forward_with_vlandev()
+{
+ test_3_forward_without_vlandev "with"
+ return $?
+}
+
+test_5_bond()
+{
+ local lret=0
+ local i
+
+ for i in $LAN1; do
+ unset_client $i
+ done
+ return $lret
+}
+
+ret=0
+### Start Initial Setup ###
+
+for i in 4 6; do
+ ip netns exec "$nsrt" sysctl -q net.ipv$i.conf.all.forwarding=1
+done
+
+### Use brwan to make sure software fastpath is ###
+### direct xmit in other direction also ###
+
+ip -net "$nsrt" link add $BRWAN type bridge
+ret=$(($ret | $?))
+ip -net "$nsrt" link set $BRWAN up
+ret=$(($ret | $?))
+if [ $ret -ne 0 ]; then
+ echo "SKIP: Can't create bridge"
+ exit $ksft_skip
+fi
+
+# If both lan clients are veth-devices, only test 1 in the forward path
+if [ -z "${vethcl[$LAN1]}" ] && [ -z "${vethcl[$LAN2]}" ]; then
+ lan_all_veth=1
+fi
+
+for i in $WAN $LAN1 $LAN2; do
+ ns="${nsa[$i]}"
+ if [ -z "${vethcl[$i]}" ]; then
+ vethcl[$i]="veth${i}cl"
+ vethrt[$i]="veth${i}rt"
+ ip link add "${vethcl[$i]}" netns "$ns" type veth \
+ peer name "${vethrt[$i]}" netns "$nsrt"
+ ret=$(($ret | $?))
+ else # Use pair of interconnected hardware interfaces
+ ip link set "${vethrt[$i]}" netns "$nsrt"
+ ret=$(($ret | $?))
+ ip link set "${vethcl[$i]}" netns "$ns"
+ ret=$(($ret | $?))
+ fi
+done
+if [ $ret -ne 0 ]; then
+ echo "SKIP: (v)eth pairs cannot be used"
+ exit $ksft_skip
+fi
+
+if [ -n "$showtree" ]; then
+ cat <<-EOF
+ Setup:
+ CLIENT 0
+ ${vethcl[$WAN]}
+ |
+ ${vethrt[$WAN]}
+ WAN
+ ROUTER
+ LAN1 LAN2
+ $(printf "%14.14s" ${vethrt[$LAN1]}) ${vethrt[$LAN2]}
+ | |
+ $(printf "%14.14s" ${vethcl[$LAN1]}) ${vethcl[$LAN2]}
+ CLIENT 1 CLIENT 2
+
+ EOF
+fi
+
+for n in nsclientwan nsclientlan; do
+ routerside=""; clientside=""
+ for i in $WAN $LAN1 $LAN2; do
+ ns="${nsa[$i]}"
+ [[ "$ns" != "$n"* ]] && continue
+ mac=$(check_mac $ns ${vethcl[$i]} "$routerside $clientside")
+ ret=$(($ret | $?))
+ clientside+=" $mac"
+ mac=$(check_mac $nsrt ${vethrt[$i]} "$clientside")
+ ret=$(($ret | $?))
+ routerside+=" $mac"
+ done
+done
+if [ $ret -ne 0 ]; then
+ echo "SKIP: conflicting mac address"
+ exit $ksft_skip
+fi
+
+set_pair_link up $WAN $LAN1 $LAN2
+ret=$(($ret | $?))
+if [ $ret -ne 0 ]; then
+ echo "SKIP: setting (v)eth pairs link up failed"
+ exit $ksft_skip
+fi
+
+i=$WAN
+ip -net "$nsrt" link set "${vethrt[$i]}" master $BRWAN
+set_client $i none
+add_addr $ADWAN "$BRWAN"
+
+family="bridge"
+setup_nftables $LAN1 $LAN2 2>/dev/null
+if [ $? -ne 0 ]; then
+ echo "INFO: Cannot add nftables table $family"
+ tests[1]=""; test[2]=""
+fi
+family="inet"
+if ! setup_nftables $WAN $LAN1 2>/dev/null; then
+ echo "INFO: Cannot add nftables table $family"
+ tests[3]=""; test[4]=""; tests[5]=""
+fi
+
+### End Initial Setup ###
+
+if [ -n "${tests[1]}" ]; then
+ # Setup brlan as vlan unaware bridge
+ family="bridge"
+ ip -net "$nsrt" link add $BRLAN type bridge
+ ip -net "$nsrt" link set $BRLAN up
+ for i in $LAN1 $LAN2; do
+ ip -net "$nsrt" link set "${vethrt[$i]}" master $BRLAN
+ done
+ test_1_unaware_bridge
+ ret=$(($ret | $?))
+ ip -net "$nsrt" link del $BRLAN type bridge
+fi
+
+if [ -n "${tests[2]}" ] || [ -n "${tests[3]}" ] || [ -n "${tests[4]}" ]; then
+ # Setup brlan as vlan aware bridge
+ family="bridge"
+
+ ip -net "$nsrt" link add $BRLAN type bridge vlan_filtering 1 vlan_default_pvid 0
+ ip -net "$nsrt" link set $BRLAN up
+ bridge -net "$nsrt" vlan add dev $BRLAN vid $VID1 pvid untagged self
+ add_addr $ADLAN "$BRLAN"
+ for i in $LAN1 $LAN2; do
+ ip -net "$nsrt" link set "${vethrt[$i]}" master $BRLAN
+ done
+
+ if [ -n "${tests[2]}" ]; then
+ test_2_aware_bridge
+ ret=$(($ret | $?))
+ fi
+
+ family="inet"
+
+ if [ -n "${tests[3]}" ]; then
+ test_3_forward_without_vlandev
+ ret=$(($ret | $?))
+ fi
+
+ if [ -n "${tests[4]}" ]; then
+ # Setup vlan-device linked to brlan master port
+ del_addr $ADLAN "$BRLAN"
+ ip -net "$nsrt" link set $BRLAN down
+ bridge -net "$nsrt" vlan del dev $BRLAN vid $VID1 pvid untagged self
+ bridge -net "$nsrt" vlan add dev $BRLAN vid $VID1 self
+ ip -net "$nsrt" link add link $BRLAN name $BRLAN.$VID1 type vlan id $VID1
+ ip -net "$nsrt" link set $BRLAN up
+ ip -net "$nsrt" link set "$BRLAN.$VID1" up
+ add_addr $ADLAN "$BRLAN.$VID1"
+ test_4_forward_with_vlandev
+ ret=$(($ret | $?))
+ fi
+
+ ip -net "$nsrt" link del $BRLAN type bridge
+fi
+
+### Finish tests ###
+
+ip -net "$nsrt" link del $BRWAN type bridge
+
+for i in $WAN $LAN1 $LAN2; do
+ unset_client $i
+done
+
+if [ $ret -eq 0 ]; then
+ echo "PASS: all tests passed"
+else
+ echo "ERROR: bridge fastpath test has failed"
+fi
+
+exit $ret
--
2.47.1
On Sat, Jun 21, 2025 at 05:08:18PM +0530, Dev Jain wrote:
>
> On 21/06/25 4:40 pm, wang lian wrote:
> > From cb505647eb5f418d1ff5e807361f4c3a337c251f Mon Sep 17 00:00:00 2001
> > From: Lian Wang <lianux.mm(a)gmail.com>
> > Date: Sat, 21 Jun 2025 18:51:49 +0800
> > Subject: [PATCH] selftests/mm: add test for (BATCH_PROCESS)MADV_DONTNEED
> >
> > Let's add a simple test for MADV_DONTNEED and PROCESS_MADV_DONTNEED,
> > and inspired by SeongJae Park's test at GitHub[1] add batch test
> > for PROCESS_MADV_DONTNEED,but for now it influence by workload and
> > need add some race conditions test.We can add it later.
> >
> > Signed-off-by: Lian Wang <lianux.mm(a)gmail.com>
> > References
> > ==========
> >
> > [1] https://github.com/sjp38/eval_proc_madvise
> >
> >
>
> Hello Lian,
>
>
> Thank you for your patch. Please configure your editor to take a TAB as 8
> spaces. And,
>
> your email client to sending plain text messages instead of HTML. Please
> resend the
>
> patch after making these changes.
Thanks for resending this, but please do put '[RESEND PATCH]' rather than
'[PATCH]' so we can differentiate!
Have reviewed the resent version.
Cheers, Lorenzo
Some libc's like musl libc don't provide execinfo.h since it's not part
of POSIX. In order to fix compilation on musl, only include execinfo.h
if available (HAVE_BACKTRACE_SUPPORT)
This was discovered with c104c16073b7 ("Kunit to check the longest symbol length")
which starts to include linux/kallsyms.h with Alpine Linux' configs.
Signed-off-by: Achill Gilgenast <fossdd(a)pwned.life>
Cc: stable(a)vger.kernel.org
---
tools/include/linux/kallsyms.h | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/tools/include/linux/kallsyms.h b/tools/include/linux/kallsyms.h
index 5a37ccbec54f..f61a01dd7eb7 100644
--- a/tools/include/linux/kallsyms.h
+++ b/tools/include/linux/kallsyms.h
@@ -18,6 +18,7 @@ static inline const char *kallsyms_lookup(unsigned long addr,
return NULL;
}
+#ifdef HAVE_BACKTRACE_SUPPORT
#include <execinfo.h>
#include <stdlib.h>
static inline void print_ip_sym(const char *loglvl, unsigned long ip)
@@ -30,5 +31,8 @@ static inline void print_ip_sym(const char *loglvl, unsigned long ip)
free(name);
}
+#else
+static inline void print_ip_sym(const char *loglvl, unsigned long ip) {}
+#endif
#endif
--
2.50.0
validate_addr() function checks whether the address returned by mmap()
lies in the low or high VA space, according to whether a high addr hint
was passed or not. The fix commit mentioned below changed the code in
such a way that this function will always return failure when passed
high_addr == 1; addr will be >= HIGH_ADDR_MARK always, we will fall
down to "if (addr < HIGH_ADDR_MARK)" and return failure. Fix this.
Fixes: d1d86ce28d0f ("selftests/mm: virtual_address_range: conform to TAP format output")
Signed-off-by: Dev Jain <dev.jain(a)arm.com>
---
tools/testing/selftests/mm/virtual_address_range.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/mm/virtual_address_range.c b/tools/testing/selftests/mm/virtual_address_range.c
index b380e102b22f..169dbd692bf5 100644
--- a/tools/testing/selftests/mm/virtual_address_range.c
+++ b/tools/testing/selftests/mm/virtual_address_range.c
@@ -77,8 +77,11 @@ static void validate_addr(char *ptr, int high_addr)
{
unsigned long addr = (unsigned long) ptr;
- if (high_addr && addr < HIGH_ADDR_MARK)
- ksft_exit_fail_msg("Bad address %lx\n", addr);
+ if (high_addr) {
+ if (addr < HIGH_ADDR_MARK)
+ ksft_exit_fail_msg("Bad address %lx\n", addr);
+ return;
+ }
if (addr > HIGH_ADDR_MARK)
ksft_exit_fail_msg("Bad address %lx\n", addr);
--
2.30.2
The coredump.socket_detect_userspace_client test occasionally fails:
# RUN coredump.socket_detect_userspace_client ...
# stackdump_test.c:500:socket_detect_userspace_client:Expected 0 (0) != WIFEXITED(status) (0)
# socket_detect_userspace_client: Test terminated by assertion
# FAIL coredump.socket_detect_userspace_client
not ok 3 coredump.socket_detect_userspace_client
because there is no guarantee that client's write() happens before server's
close(). The client gets terminated SIGPIPE, and thus the test fails.
Add a read() to server to make sure server's close() doesn't happen before
client's write().
Fixes: 7b6724fe9a6b ("selftests/coredump: add tests for AF_UNIX coredumps")
Signed-off-by: Nam Cao <namcao(a)linutronix.de>
---
tools/testing/selftests/coredump/stackdump_test.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/tools/testing/selftests/coredump/stackdump_test.c b/tools/testing/selftests/coredump/stackdump_test.c
index 9984413be9f06..68f8e479ac368 100644
--- a/tools/testing/selftests/coredump/stackdump_test.c
+++ b/tools/testing/selftests/coredump/stackdump_test.c
@@ -461,10 +461,15 @@ TEST_F(coredump, socket_detect_userspace_client)
_exit(EXIT_FAILURE);
}
+ ret = read(fd_coredump, &c, 1);
+
close(fd_coredump);
close(fd_server);
close(fd_peer_pidfd);
close(fd_core_file);
+
+ if (ret < 1)
+ _exit(EXIT_FAILURE);
_exit(EXIT_SUCCESS);
}
self->pid_coredump_server = pid_coredump_server;
--
2.39.5
This picks up from Michal Rostecki's work[0]. Per Michal's guidance I
have omitted Co-authored tags, as the end result is quite different.
Link: https://lore.kernel.org/rust-for-linux/20240819153656.28807-2-vadorovsky@pr… [0]
Closes: https://github.com/Rust-for-Linux/linux/issues/1075
Signed-off-by: Tamir Duberstein <tamird(a)gmail.com>
---
Changes in v12:
- Introduce `kernel::fmt::Display` to allow implementations on foreign
types.
- Tidy up doc comment on `str_to_cstr`. (Alice Ryhl).
- Link to v11: https://lore.kernel.org/r/20250530-cstr-core-v11-0-cd9c0cbcb902@gmail.com
Changes in v11:
- Use `quote_spanned!` to avoid `use<'a, T>` and generally reduce manual
token construction.
- Add a commit to simplify `quote_spanned!`.
- Drop first commit in favor of
https://lore.kernel.org/rust-for-linux/20240906164448.2268368-1-paddymills@….
(Miguel Ojeda)
- Correctly handle expressions such as `pr_info!("{a}", a = a = a)`.
(Benno Lossin)
- Avoid dealing with `}}` escapes, which is not needed. (Benno Lossin)
- Revert some unnecessary changes. (Benno Lossin)
- Rename `c_str_avoid_literals!` to `str_to_cstr!`. (Benno Lossin &
Alice Ryhl).
- Link to v10: https://lore.kernel.org/r/20250524-cstr-core-v10-0-6412a94d9d75@gmail.com
Changes in v10:
- Rebase on cbeaa41dfe26b72639141e87183cb23e00d4b0dd.
- Implement Alice's suggestion to use a proc macro to work around orphan
rules otherwise preventing `core::ffi::CStr` to be directly printed
with `{}`.
- Link to v9: https://lore.kernel.org/r/20250317-cstr-core-v9-0-51d6cc522f62@gmail.com
Changes in v9:
- Rebase on rust-next.
- Restore `impl Display for BStr` which exists upstream[1].
- Link: https://doc.rust-lang.org/nightly/std/bstr/struct.ByteStr.html#impl-Display… [1]
- Link to v8: https://lore.kernel.org/r/20250203-cstr-core-v8-0-cb3f26e78686@gmail.com
Changes in v8:
- Move `{from,as}_char_ptr` back to `CStrExt`. This reduces the diff
some.
- Restore `from_bytes_with_nul_unchecked_mut`, `to_cstring`.
- Link to v7: https://lore.kernel.org/r/20250202-cstr-core-v7-0-da1802520438@gmail.com
Changes in v7:
- Rebased on mainline.
- Restore functionality added in commit a321f3ad0a5d ("rust: str: add
{make,to}_{upper,lower}case() to CString").
- Used `diff.algorithm patience` to improve diff readability.
- Link to v6: https://lore.kernel.org/r/20250202-cstr-core-v6-0-8469cd6d29fd@gmail.com
Changes in v6:
- Split the work into several commits for ease of review.
- Restore `{from,as}_char_ptr` to allow building on ARM (see commit
message).
- Add `CStrExt` to `kernel::prelude`. (Alice Ryhl)
- Remove `CStrExt::from_bytes_with_nul_unchecked_mut` and restore
`DerefMut for CString`. (Alice Ryhl)
- Rename and hide `kernel::c_str!` to encourage use of C-String
literals.
- Drop implementation and invocation changes in kunit.rs. (Trevor Gross)
- Drop docs on `Display` impl. (Trevor Gross)
- Rewrite docs in the style of the standard library.
- Restore the `test_cstr_debug` unit tests to demonstrate that the
implementation has changed.
Changes in v5:
- Keep the `test_cstr_display*` unit tests.
Changes in v4:
- Provide the `CStrExt` trait with `display()` method, which returns a
`CStrDisplay` wrapper with `Display` implementation. This addresses
the lack of `Display` implementation for `core::ffi::CStr`.
- Provide `from_bytes_with_nul_unchecked_mut()` method in `CStrExt`,
which might be useful and is going to prevent manual, unsafe casts.
- Fix a typo (s/preffered/prefered/).
Changes in v3:
- Fix the commit message.
- Remove redundant braces in `use`, when only one item is imported.
Changes in v2:
- Do not remove `c_str` macro. While it's preferred to use C-string
literals, there are two cases where `c_str` is helpful:
- When working with macros, which already return a Rust string literal
(e.g. `stringify!`).
- When building macros, where we want to take a Rust string literal as an
argument (for caller's convenience), but still use it as a C-string
internally.
- Use Rust literals as arguments in macros (`new_mutex`, `new_condvar`,
`new_mutex`). Use the `c_str` macro to convert these literals to C-string
literals.
- Use `c_str` in kunit.rs for converting the output of `stringify!` to a
`CStr`.
- Remove `DerefMut` implementation for `CString`.
---
Tamir Duberstein (5):
rust: macros: reduce collections in `quote!` macro
rust: support formatting of foreign types
rust: replace `CStr` with `core::ffi::CStr`
rust: replace `kernel::c_str!` with C-Strings
rust: remove core::ffi::CStr reexport
drivers/block/rnull.rs | 4 +-
drivers/cpufreq/rcpufreq_dt.rs | 5 +-
drivers/gpu/drm/drm_panic_qr.rs | 5 +-
drivers/gpu/drm/nova/driver.rs | 10 +-
drivers/gpu/nova-core/driver.rs | 6 +-
drivers/gpu/nova-core/firmware.rs | 2 +-
drivers/gpu/nova-core/gpu.rs | 4 +-
drivers/gpu/nova-core/nova_core.rs | 2 +-
drivers/net/phy/ax88796b_rust.rs | 8 +-
drivers/net/phy/qt2025.rs | 6 +-
rust/kernel/auxiliary.rs | 6 +-
rust/kernel/block/mq.rs | 2 +-
rust/kernel/clk.rs | 9 +-
rust/kernel/configfs.rs | 14 +-
rust/kernel/cpufreq.rs | 6 +-
rust/kernel/device.rs | 9 +-
rust/kernel/devres.rs | 2 +-
rust/kernel/driver.rs | 4 +-
rust/kernel/drm/device.rs | 4 +-
rust/kernel/drm/driver.rs | 3 +-
rust/kernel/drm/ioctl.rs | 2 +-
rust/kernel/error.rs | 10 +-
rust/kernel/faux.rs | 5 +-
rust/kernel/firmware.rs | 16 +-
rust/kernel/fmt.rs | 89 +++++++
rust/kernel/kunit.rs | 21 +-
rust/kernel/lib.rs | 3 +-
rust/kernel/miscdevice.rs | 5 +-
rust/kernel/net/phy.rs | 12 +-
rust/kernel/of.rs | 5 +-
rust/kernel/pci.rs | 2 +-
rust/kernel/platform.rs | 6 +-
rust/kernel/prelude.rs | 5 +-
rust/kernel/print.rs | 4 +-
rust/kernel/seq_file.rs | 6 +-
rust/kernel/str.rs | 444 ++++++++++------------------------
rust/kernel/sync.rs | 7 +-
rust/kernel/sync/condvar.rs | 4 +-
rust/kernel/sync/lock.rs | 4 +-
rust/kernel/sync/lock/global.rs | 6 +-
rust/kernel/sync/poll.rs | 1 +
rust/kernel/workqueue.rs | 9 +-
rust/macros/fmt.rs | 99 ++++++++
rust/macros/kunit.rs | 10 +-
rust/macros/lib.rs | 19 ++
rust/macros/module.rs | 2 +-
rust/macros/quote.rs | 111 ++++-----
samples/rust/rust_configfs.rs | 9 +-
samples/rust/rust_driver_auxiliary.rs | 7 +-
samples/rust/rust_driver_faux.rs | 4 +-
samples/rust/rust_driver_pci.rs | 4 +-
samples/rust/rust_driver_platform.rs | 4 +-
samples/rust/rust_misc_device.rs | 3 +-
scripts/rustdoc_test_gen.rs | 6 +-
54 files changed, 540 insertions(+), 515 deletions(-)
---
base-commit: e04c78d86a9699d136910cfc0bdcf01087e3267e
change-id: 20250201-cstr-core-d4b9b69120cf
Best regards,
--
Tamir Duberstein <tamird(a)gmail.com>
This started with a patch that enabled `clippy::ptr_as_ptr`. Benno
Lossin suggested I also look into `clippy::ptr_cast_constness` and I
discovered `clippy::as_ptr_cast_mut`. This series now enables all 3
lints. It also enables `clippy::as_underscore` which ensures other
pointer casts weren't missed.
As a later addition, `clippy::cast_lossless` and `clippy::ref_as_ptr`
are also enabled.
Signed-off-by: Tamir Duberstein <tamird(a)gmail.com>
---
Changes in v12:
- Remove stale mention of a dependency. (Miguel Ojeda)
- Apply to config, cpufreq, and nova. (Miguel Ojeda)
- Link to v11: https://lore.kernel.org/r/20250611-ptr-as-ptr-v11-0-ce5b41c6e9c6@gmail.com
Changes in v11:
- Rebase on v6.16-rc1.
- Replace some `as <integer>` with `as bindings::T` and others with `as
ffi::T`. (Miguel Ojeda)
- Revert explicit `ffi::c_void` import which is in the prelude. (Miguel Ojeda)
- Link to v10: https://lore.kernel.org/r/20250418-ptr-as-ptr-v10-0-3d63d27907aa@gmail.com
Changes in v10:
- Move fragment from "rust: enable `clippy::ptr_cast_constness` lint" to
"rust: enable `clippy::ptr_as_ptr` lint". (Boqun Feng)
- Replace `(...).into()` with `T::from(...)` where the destination type
isn't obvious in "rust: enable `clippy::cast_lossless` lint". (Boqun
Feng)
- Link to v9: https://lore.kernel.org/r/20250416-ptr-as-ptr-v9-0-18ec29b1b1f3@gmail.com
Changes in v9:
- Replace ref-to-ptr coercion using `let` bindings with
`core::ptr::from_{ref,mut}`. (Boqun Feng).
- Link to v8: https://lore.kernel.org/r/20250409-ptr-as-ptr-v8-0-3738061534ef@gmail.com
Changes in v8:
- Use coercion to go ref -> ptr.
- rustfmt.
- Rebase on v6.15-rc1.
- Extract first commit to its own series as it is shared with other
series.
- Link to v7: https://lore.kernel.org/r/20250325-ptr-as-ptr-v7-0-87ab452147b9@gmail.com
Changes in v7:
- Add patch to enable `clippy::ref_as_ptr`.
- Link to v6: https://lore.kernel.org/r/20250324-ptr-as-ptr-v6-0-49d1b7fd4290@gmail.com
Changes in v6:
- Drop strict provenance patch.
- Fix URLs in doc comments.
- Add patch to enable `clippy::cast_lossless`.
- Rebase on rust-next.
- Link to v5: https://lore.kernel.org/r/20250317-ptr-as-ptr-v5-0-5b5f21fa230a@gmail.com
Changes in v5:
- Use `pointer::addr` in OF. (Boqun Feng)
- Add documentation on stubs. (Benno Lossin)
- Mark stubs `#[inline]`.
- Pick up Alice's RB on a shared commit from
https://lore.kernel.org/all/Z9f-3Aj3_FWBZRrm@google.com/.
- Link to v4: https://lore.kernel.org/r/20250315-ptr-as-ptr-v4-0-b2d72c14dc26@gmail.com
Changes in v4:
- Add missing SoB. (Benno Lossin)
- Use `without_provenance_mut` in alloc. (Boqun Feng)
- Limit strict provenance lints to the `kernel` crate to avoid complex
logic in the build system. This can be revisited on MSRV >= 1.84.0.
- Rebase on rust-next.
- Link to v3: https://lore.kernel.org/r/20250314-ptr-as-ptr-v3-0-e7ba61048f4a@gmail.com
Changes in v3:
- Fixed clippy warning in rust/kernel/firmware.rs. (kernel test robot)
Link: https://lore.kernel.org/all/202503120332.YTCpFEvv-lkp@intel.com/
- s/as u64/as bindings::phys_addr_t/g. (Benno Lossin)
- Use strict provenance APIs and enable lints. (Benno Lossin)
- Link to v2: https://lore.kernel.org/r/20250309-ptr-as-ptr-v2-0-25d60ad922b7@gmail.com
Changes in v2:
- Fixed typo in first commit message.
- Added additional patches, converted to series.
- Link to v1: https://lore.kernel.org/r/20250307-ptr-as-ptr-v1-1-582d06514c98@gmail.com
---
Tamir Duberstein (6):
rust: enable `clippy::ptr_as_ptr` lint
rust: enable `clippy::ptr_cast_constness` lint
rust: enable `clippy::as_ptr_cast_mut` lint
rust: enable `clippy::as_underscore` lint
rust: enable `clippy::cast_lossless` lint
rust: enable `clippy::ref_as_ptr` lint
Makefile | 6 ++++
drivers/gpu/drm/drm_panic_qr.rs | 4 +--
drivers/gpu/nova-core/driver.rs | 2 +-
drivers/gpu/nova-core/regs.rs | 2 +-
drivers/gpu/nova-core/regs/macros.rs | 2 +-
rust/bindings/lib.rs | 3 ++
rust/kernel/alloc/allocator_test.rs | 2 +-
rust/kernel/alloc/kvec.rs | 4 +--
rust/kernel/block/mq/operations.rs | 2 +-
rust/kernel/block/mq/request.rs | 11 +++++--
rust/kernel/configfs.rs | 22 +++++---------
rust/kernel/cpufreq.rs | 2 +-
rust/kernel/device.rs | 4 +--
rust/kernel/device_id.rs | 4 +--
rust/kernel/devres.rs | 17 +++++------
rust/kernel/dma.rs | 6 ++--
rust/kernel/drm/device.rs | 6 ++--
rust/kernel/error.rs | 2 +-
rust/kernel/firmware.rs | 3 +-
rust/kernel/fs/file.rs | 2 +-
rust/kernel/io.rs | 18 ++++++------
rust/kernel/kunit.rs | 11 ++++---
rust/kernel/list/impl_list_item_mod.rs | 2 +-
rust/kernel/miscdevice.rs | 2 +-
rust/kernel/mm/virt.rs | 52 +++++++++++++++++-----------------
rust/kernel/net/phy.rs | 4 +--
rust/kernel/of.rs | 6 ++--
rust/kernel/pci.rs | 11 ++++---
rust/kernel/platform.rs | 4 ++-
rust/kernel/print.rs | 6 ++--
rust/kernel/seq_file.rs | 2 +-
rust/kernel/str.rs | 14 ++++-----
rust/kernel/sync/poll.rs | 2 +-
rust/kernel/time/hrtimer/pin.rs | 2 +-
rust/kernel/time/hrtimer/pin_mut.rs | 2 +-
rust/kernel/uaccess.rs | 4 +--
rust/kernel/workqueue.rs | 8 +++---
rust/uapi/lib.rs | 3 ++
38 files changed, 139 insertions(+), 120 deletions(-)
---
base-commit: 19272b37aa4f83ca52bdf9c16d5d81bdd1354494
change-id: 20250307-ptr-as-ptr-21b1867fc4d4
Best regards,
--
Tamir Duberstein <tamird(a)gmail.com>
DAMON sysfs interface is the bridge between the user space and the
kernel space for DAMON parameters. There is no good and simple test to
see if the parameters are set as expected. Existing DAMON selftests
therefore test end-to-end features. For example, damos_quota_goal.py
runs a DAMOS scheme with quota goal set against a test program running
an artificial access pattern, and see if the result is as expected.
Such tests cover only a few part of DAMON. Adding more tests is also
complicated. Finally, the reliability of the test itself on different
systems is bad.
'drgn' is a tool that can extract kernel internal data structures like
DAMON parameters. Add a test that passes specific DAMON parameters via
DAMON sysfs reusing _damon_sysfs.py, extract resulting DAMON parameters
via 'drgn', and compare those. Note that this test is not adding
exhaustive tests of all DAMON parameters and input combinations but very
basic things. Advancing the test infrastructure and adding more tests
are future works.
SeongJae Park (6):
selftests/damon: add drgn script for extracting damon status
selftests/damon/_damon_sysfs: set Kdamond.pid in start()
selftests/damon: add python and drgn-based DAMON sysfs test
selftests/damon/sysfs.py: test monitoring attribute parameters
selftests/damon/sysfs.py: test adaptive targets parameter
selftests/damon/sysfs.py: test DAMOS schemes parameters setup
tools/testing/selftests/damon/Makefile | 1 +
tools/testing/selftests/damon/_damon_sysfs.py | 3 +
.../selftests/damon/drgn_dump_damon_status.py | 161 ++++++++++++++++++
tools/testing/selftests/damon/sysfs.py | 115 +++++++++++++
4 files changed, 280 insertions(+)
create mode 100755 tools/testing/selftests/damon/drgn_dump_damon_status.py
create mode 100755 tools/testing/selftests/damon/sysfs.py
base-commit: 59f618c718d036132b59bcf997943d4f5520149f
--
2.39.5
This commit adds a new kernel selftest to verify RTNLGRP_IPV6_ACADDR
notifications. The test works by adding/removing a dummy interface,
enabling packet forwarding, and then confirming that user space can
correctly receive anycast notifications.
The test relies on the iproute2 version to be 6.13+.
Tested by the following command:
$ vng -v --user root --cpus 16 -- \
make -C tools/testing/selftests TARGETS=net
TEST_PROGS=rtnetlink_notification.sh \
TEST_GEN_PROGS="" run_tests
Cc: Maciej Żenczykowski <maze(a)google.com>
Cc: Lorenzo Colitti <lorenzo(a)google.com>
Signed-off-by: Yuyang Huang <yuyanghuang(a)google.com>
---
Changelog since v1:
- Remote unrelated clean up code.
.../selftests/net/rtnetlink_notification.sh | 44 ++++++++++++++++++-
1 file changed, 43 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/net/rtnetlink_notification.sh b/tools/testing/selftests/net/rtnetlink_notification.sh
index 39c1b815bbe4..3f9780232bd6 100755
--- a/tools/testing/selftests/net/rtnetlink_notification.sh
+++ b/tools/testing/selftests/net/rtnetlink_notification.sh
@@ -8,9 +8,11 @@
ALL_TESTS="
kci_test_mcast_addr_notification
+ kci_test_anycast_addr_notification
"
source lib.sh
+test_dev="test-dummy1"
kci_test_mcast_addr_notification()
{
@@ -18,7 +20,6 @@ kci_test_mcast_addr_notification()
local tmpfile
local monitor_pid
local match_result
- local test_dev="test-dummy1"
tmpfile=$(mktemp)
defer rm "$tmpfile"
@@ -56,6 +57,47 @@ kci_test_mcast_addr_notification()
return $RET
}
+kci_test_anycast_addr_notification()
+{
+ RET=0
+ local tmpfile
+ local monitor_pid
+ local match_result
+
+ tmpfile=$(mktemp)
+ defer rm "$tmpfile"
+
+ ip monitor acaddress > "$tmpfile" &
+ monitor_pid=$!
+ defer kill_process "$monitor_pid"
+ sleep 1
+
+ if [ ! -e "/proc/$monitor_pid" ]; then
+ RET=$ksft_skip
+ log_test "anycast addr notification: iproute2 too old"
+ return "$RET"
+ fi
+
+ ip link add name "$test_dev" type dummy
+ check_err $? "failed to add dummy interface"
+ ip link set "$test_dev" up
+ check_err $? "failed to set dummy interface up"
+ sysctl -qw net.ipv6.conf."$test_dev".forwarding=1
+ ip link del dev "$test_dev"
+ check_err $? "Failed to delete dummy interface"
+ sleep 1
+
+ # There should be 2 line matches as follows.
+ # 9: dummy2 inet6 any fe80:: scope global
+ # Deleted 9: dummy2 inet6 any fe80:: scope global
+ match_result=$(grep -cE "$test_dev.*(fe80::)" "$tmpfile")
+ if [ "$match_result" -ne 2 ]; then
+ RET=$ksft_fail
+ fi
+ log_test "anycast addr notification: Expected 2 matches, got $match_result"
+ return "$RET"
+}
+
#check for needed privileges
if [ "$(id -u)" -ne 0 ];then
RET=$ksft_skip
--
2.50.0.rc2.701.gf1e915cc24-goog
A task in the kernel (task_mm_cid_work) runs somewhat periodically to
compact the mm_cid for each process. Add a test to validate that it runs
correctly and timely.
The test spawns 1 thread pinned to each CPU, then each thread, including
the main one, runs in short bursts for some time. During this period, the
mm_cids should be spanning all numbers between 0 and nproc.
At the end of this phase, a thread with high enough mm_cid (>= nproc/2)
is selected to be the new leader, all other threads terminate.
After some time, the only running thread should see 0 as mm_cid, if that
doesn't happen, the compaction mechanism didn't work and the test fails.
The test never fails if only 1 core is available, in which case, we
cannot test anything as the only available mm_cid is 0.
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers(a)efficios.com>
Signed-off-by: Gabriele Monaco <gmonaco(a)redhat.com>
---
tools/testing/selftests/rseq/.gitignore | 1 +
tools/testing/selftests/rseq/Makefile | 2 +-
.../selftests/rseq/mm_cid_compaction_test.c | 200 ++++++++++++++++++
3 files changed, 202 insertions(+), 1 deletion(-)
create mode 100644 tools/testing/selftests/rseq/mm_cid_compaction_test.c
diff --git a/tools/testing/selftests/rseq/.gitignore b/tools/testing/selftests/rseq/.gitignore
index 0fda241fa62b0..b3920c59bf401 100644
--- a/tools/testing/selftests/rseq/.gitignore
+++ b/tools/testing/selftests/rseq/.gitignore
@@ -3,6 +3,7 @@ basic_percpu_ops_test
basic_percpu_ops_mm_cid_test
basic_test
basic_rseq_op_test
+mm_cid_compaction_test
param_test
param_test_benchmark
param_test_compare_twice
diff --git a/tools/testing/selftests/rseq/Makefile b/tools/testing/selftests/rseq/Makefile
index 0d0a5fae59547..bc4d940f66d40 100644
--- a/tools/testing/selftests/rseq/Makefile
+++ b/tools/testing/selftests/rseq/Makefile
@@ -17,7 +17,7 @@ OVERRIDE_TARGETS = 1
TEST_GEN_PROGS = basic_test basic_percpu_ops_test basic_percpu_ops_mm_cid_test param_test \
param_test_benchmark param_test_compare_twice param_test_mm_cid \
param_test_mm_cid_benchmark param_test_mm_cid_compare_twice \
- syscall_errors_test
+ syscall_errors_test mm_cid_compaction_test
TEST_GEN_PROGS_EXTENDED = librseq.so
diff --git a/tools/testing/selftests/rseq/mm_cid_compaction_test.c b/tools/testing/selftests/rseq/mm_cid_compaction_test.c
new file mode 100644
index 0000000000000..7ddde3b657dd6
--- /dev/null
+++ b/tools/testing/selftests/rseq/mm_cid_compaction_test.c
@@ -0,0 +1,200 @@
+// SPDX-License-Identifier: LGPL-2.1
+#define _GNU_SOURCE
+#include <assert.h>
+#include <pthread.h>
+#include <sched.h>
+#include <stdint.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <stddef.h>
+
+#include "../kselftest.h"
+#include "rseq.h"
+
+#define VERBOSE 0
+#define printf_verbose(fmt, ...) \
+ do { \
+ if (VERBOSE) \
+ printf(fmt, ##__VA_ARGS__); \
+ } while (0)
+
+/* 0.5 s */
+#define RUNNER_PERIOD 500000
+/* Number of runs before we terminate or get the token */
+#define THREAD_RUNS 5
+
+/*
+ * Number of times we check that the mm_cid were compacted.
+ * Checks are repeated every RUNNER_PERIOD.
+ */
+#define MM_CID_COMPACT_TIMEOUT 10
+
+struct thread_args {
+ int cpu;
+ int num_cpus;
+ pthread_mutex_t *token;
+ pthread_barrier_t *barrier;
+ pthread_t *tinfo;
+ struct thread_args *args_head;
+};
+
+static void __noreturn *thread_runner(void *arg)
+{
+ struct thread_args *args = arg;
+ int i, ret, curr_mm_cid;
+ cpu_set_t cpumask;
+
+ CPU_ZERO(&cpumask);
+ CPU_SET(args->cpu, &cpumask);
+ ret = pthread_setaffinity_np(pthread_self(), sizeof(cpumask), &cpumask);
+ if (ret) {
+ errno = ret;
+ perror("Error: failed to set affinity");
+ abort();
+ }
+ pthread_barrier_wait(args->barrier);
+
+ for (i = 0; i < THREAD_RUNS; i++)
+ usleep(RUNNER_PERIOD);
+ curr_mm_cid = rseq_current_mm_cid();
+ /*
+ * We select one thread with high enough mm_cid to be the new leader.
+ * All other threads (including the main thread) will terminate.
+ * After some time, the mm_cid of the only remaining thread should
+ * converge to 0, if not, the test fails.
+ */
+ if (curr_mm_cid >= args->num_cpus / 2 &&
+ !pthread_mutex_trylock(args->token)) {
+ printf_verbose(
+ "cpu%d has mm_cid=%d and will be the new leader.\n",
+ sched_getcpu(), curr_mm_cid);
+ for (i = 0; i < args->num_cpus; i++) {
+ if (args->tinfo[i] == pthread_self())
+ continue;
+ ret = pthread_join(args->tinfo[i], NULL);
+ if (ret) {
+ errno = ret;
+ perror("Error: failed to join thread");
+ abort();
+ }
+ }
+ pthread_barrier_destroy(args->barrier);
+ free(args->tinfo);
+ free(args->token);
+ free(args->barrier);
+ free(args->args_head);
+
+ for (i = 0; i < MM_CID_COMPACT_TIMEOUT; i++) {
+ curr_mm_cid = rseq_current_mm_cid();
+ printf_verbose("run %d: mm_cid=%d on cpu%d.\n", i,
+ curr_mm_cid, sched_getcpu());
+ if (curr_mm_cid == 0)
+ exit(EXIT_SUCCESS);
+ usleep(RUNNER_PERIOD);
+ }
+ exit(EXIT_FAILURE);
+ }
+ printf_verbose("cpu%d has mm_cid=%d and is going to terminate.\n",
+ sched_getcpu(), curr_mm_cid);
+ pthread_exit(NULL);
+}
+
+int test_mm_cid_compaction(void)
+{
+ cpu_set_t affinity;
+ int i, j, ret = 0, num_threads;
+ pthread_t *tinfo;
+ pthread_mutex_t *token;
+ pthread_barrier_t *barrier;
+ struct thread_args *args;
+
+ sched_getaffinity(0, sizeof(affinity), &affinity);
+ num_threads = CPU_COUNT(&affinity);
+ tinfo = calloc(num_threads, sizeof(*tinfo));
+ if (!tinfo) {
+ perror("Error: failed to allocate tinfo");
+ return -1;
+ }
+ args = calloc(num_threads, sizeof(*args));
+ if (!args) {
+ perror("Error: failed to allocate args");
+ ret = -1;
+ goto out_free_tinfo;
+ }
+ token = malloc(sizeof(*token));
+ if (!token) {
+ perror("Error: failed to allocate token");
+ ret = -1;
+ goto out_free_args;
+ }
+ barrier = malloc(sizeof(*barrier));
+ if (!barrier) {
+ perror("Error: failed to allocate barrier");
+ ret = -1;
+ goto out_free_token;
+ }
+ if (num_threads == 1) {
+ fprintf(stderr, "Cannot test on a single cpu. "
+ "Skipping mm_cid_compaction test.\n");
+ /* only skipping the test, this is not a failure */
+ goto out_free_barrier;
+ }
+ pthread_mutex_init(token, NULL);
+ ret = pthread_barrier_init(barrier, NULL, num_threads);
+ if (ret) {
+ errno = ret;
+ perror("Error: failed to initialise barrier");
+ goto out_free_barrier;
+ }
+ for (i = 0, j = 0; i < CPU_SETSIZE && j < num_threads; i++) {
+ if (!CPU_ISSET(i, &affinity))
+ continue;
+ args[j].num_cpus = num_threads;
+ args[j].tinfo = tinfo;
+ args[j].token = token;
+ args[j].barrier = barrier;
+ args[j].cpu = i;
+ args[j].args_head = args;
+ if (!j) {
+ /* The first thread is the main one */
+ tinfo[0] = pthread_self();
+ ++j;
+ continue;
+ }
+ ret = pthread_create(&tinfo[j], NULL, thread_runner, &args[j]);
+ if (ret) {
+ errno = ret;
+ perror("Error: failed to create thread");
+ abort();
+ }
+ ++j;
+ }
+ printf_verbose("Started %d threads.\n", num_threads);
+
+ /* Also main thread will terminate if it is not selected as leader */
+ thread_runner(&args[0]);
+
+ /* only reached in case of errors */
+out_free_barrier:
+ free(barrier);
+out_free_token:
+ free(token);
+out_free_args:
+ free(args);
+out_free_tinfo:
+ free(tinfo);
+
+ return ret;
+}
+
+int main(int argc, char **argv)
+{
+ if (!rseq_mm_cid_available()) {
+ fprintf(stderr, "Error: rseq_mm_cid unavailable\n");
+ return -1;
+ }
+ if (test_mm_cid_compaction())
+ return -1;
+ return 0;
+}
--
2.49.0
This series is built on top of the Fuad's v7 "mapping guest_memfd backed
memory at the host" [1].
With James's KVM userfault [2], it is possible to handle stage-2 faults
in guest_memfd in userspace. However, KVM itself also triggers faults
in guest_memfd in some cases, for example: PV interfaces like kvmclock,
PV EOI and page table walking code when fetching the MMIO instruction on
x86. It was agreed in the guest_memfd upstream call on 23 Jan 2025 [3]
that KVM would be accessing those pages via userspace page tables. In
order for such faults to be handled in userspace, guest_memfd needs to
support userfaultfd.
Changes since v2 [4]:
- James: Fix sgp type when calling shmem_get_folio_gfp
- James: Improved vm_ops->fault() error handling
- James: Add and make use of the can_userfault() VMA operation
- James: Add UFFD_FEATURE_MINOR_GUEST_MEMFD feature flag
- James: Fix typos and add more checks in the test
Nikita
[1] https://lore.kernel.org/kvm/20250318161823.4005529-1-tabba@google.com/T/
[2] https://lore.kernel.org/kvm/20250109204929.1106563-1-jthoughton@google.com/…
[3] https://docs.google.com/document/d/1M6766BzdY1Lhk7LiR5IqVR8B8mG3cr-cxTxOrAo…
[4] https://lore.kernel.org/kvm/20250402160721.97596-1-kalyazin@amazon.com/T/
Nikita Kalyazin (6):
mm: userfaultfd: generic continue for non hugetlbfs
mm: provide can_userfault vma operation
mm: userfaultfd: use can_userfault vma operation
KVM: guest_memfd: add support for userfaultfd minor
mm: userfaultfd: add UFFD_FEATURE_MINOR_GUEST_MEMFD
KVM: selftests: test userfaultfd minor for guest_memfd
fs/userfaultfd.c | 3 +-
include/linux/mm.h | 5 +
include/linux/mm_types.h | 4 +
include/linux/userfaultfd_k.h | 10 +-
include/uapi/linux/userfaultfd.h | 8 +-
mm/hugetlb.c | 9 +-
mm/shmem.c | 17 +++-
mm/userfaultfd.c | 47 ++++++---
.../testing/selftests/kvm/guest_memfd_test.c | 99 +++++++++++++++++++
virt/kvm/guest_memfd.c | 10 ++
10 files changed, 188 insertions(+), 24 deletions(-)
base-commit: 3cc51efc17a2c41a480eed36b31c1773936717e0
--
2.47.1
Currently testing of userspace and in-kernel API use two different
frameworks. kselftests for the userspace ones and Kunit for the
in-kernel ones. Besides their different scopes, both have different
strengths and limitations:
Kunit:
* Tests are normal kernel code.
* They use the regular kernel toolchain.
* They can be packaged and distributed as modules conveniently.
Kselftests:
* Tests are normal userspace code
* They need a userspace toolchain.
A kernel cross toolchain is likely not enough.
* A fair amout of userland is required to run the tests,
which means a full distro or handcrafted rootfs.
* There is no way to conveniently package and run kselftests with a
given kernel image.
* The kselftests makefiles are not as powerful as regular kbuild.
For example they are missing proper header dependency tracking or more
complex compiler option modifications.
Therefore kunit is much easier to run against different kernel
configurations and architectures.
This series aims to combine kselftests and kunit, avoiding both their
limitations. It works by compiling the userspace kselftests as part of
the regular kernel build, embedding them into the kunit kernel or module
and executing them from there. If the kernel toolchain is not fit to
produce userspace because of a missing libc, the kernel's own nolibc can
be used instead.
The structured TAP output from the kselftest is integrated into the
kunit KTAP output transparently, the kunit parser can parse the combined
logs together.
Further room for improvements:
* Call each test in its completely dedicated namespace
* Handle additional test files besides the test executable through
archives. CPIO, cramfs, etc.
* Compatibility with kselftest_harness.h (in progress)
* Expose the blobs in debugfs
* Provide some convience wrappers around compat userprogs
* Figure out a migration path/coexistence solution for
kunit UAPI and tools/testing/selftests/
Output from the kunit example testcase, note the output of
"example_uapi_tests".
$ ./tools/testing/kunit/kunit.py run --kunitconfig lib/kunit example
...
Running tests with:
$ .kunit/linux kunit.filter_glob=example kunit.enable=1 mem=1G console=tty kunit_shutdown=halt
[11:53:53] ================== example (10 subtests) ===================
[11:53:53] [PASSED] example_simple_test
[11:53:53] [SKIPPED] example_skip_test
[11:53:53] [SKIPPED] example_mark_skipped_test
[11:53:53] [PASSED] example_all_expect_macros_test
[11:53:53] [PASSED] example_static_stub_test
[11:53:53] [PASSED] example_static_stub_using_fn_ptr_test
[11:53:53] [PASSED] example_priv_test
[11:53:53] =================== example_params_test ===================
[11:53:53] [SKIPPED] example value 3
[11:53:53] [PASSED] example value 2
[11:53:53] [PASSED] example value 1
[11:53:53] [SKIPPED] example value 0
[11:53:53] =============== [PASSED] example_params_test ===============
[11:53:53] [PASSED] example_slow_test
[11:53:53] ======================= (4 subtests) =======================
[11:53:53] [PASSED] procfs
[11:53:53] [PASSED] userspace test 2
[11:53:53] [SKIPPED] userspace test 3: some reason
[11:53:53] [PASSED] userspace test 4
[11:53:53] ================ [PASSED] example_uapi_test ================
[11:53:53] ===================== [PASSED] example =====================
[11:53:53] ============================================================
[11:53:53] Testing complete. Ran 16 tests: passed: 11, skipped: 5
[11:53:53] Elapsed time: 67.543s total, 1.823s configuring, 65.655s building, 0.058s running
Based on v6.15-rc1.
Signed-off-by: Thomas Weißschuh <thomas.weissschuh(a)linutronix.de>
---
Changes in v3:
- Reintroduce CONFIG_CC_CAN_LINK_STATIC
- Enable CONFIG_ARCH_HAS_NOLIBC for m68k and SPARC
- Properly handle 'clean' target for userprogs
- Use ramfs over tmpfs to reduce dependencies
- Inherit userprogs byte order and ABI from kernel
- Drop now unnecessary "#ifndef NOLIBC"
- Pick up review tags
- Drop usage of __private in blob.h,
sparse complains and it is not really necessary
- Fix execution on loongarch when using clang
- Drop userprogs libgcc handling, it was ugly and is not yet necessary
- Link to v2: https://lore.kernel.org/r/20250407-kunit-kselftests-v2-0-454114e287fd@linut…
Changes in v2:
- Rebase onto v6.15-rc1
- Add documentation and kernel docs
- Resolve invalid kconfig breakages
- Drop already applied patch "kbuild: implement CONFIG_HEADERS_INSTALL for Usermode Linux"
- Drop userprogs CONFIG_WERROR integration, it doesn't need to be part of this series
- Replace patch prefix "kconfig" with "kbuild"
- Rename kunit_uapi_run_executable() to kunit_uapi_run_kselftest()
- Generate private, conflict-free symbols in the blob framework
- Handle kselftest exit codes
- Handle SIGABRT
- Forward output also to kunit debugfs log
- Install a fd=0 stdin filedescriptor
- Link to v1: https://lore.kernel.org/r/20250217-kunit-kselftests-v1-0-42b4524c3b0a@linut…
---
Thomas Weißschuh (16):
kbuild: userprogs: avoid duplicating of flags inherited from kernel
kbuild: userprogs: also inherit byte order and ABI from kernel
init: re-add CONFIG_CC_CAN_LINK_STATIC
kbuild: userprogs: add nolibc support
kbuild: introduce CONFIG_ARCH_HAS_NOLIBC
kbuild: doc: add label for userprogs section
kbuild: introduce blob framework
kunit: tool: Add test for nested test result reporting
kunit: tool: Don't overwrite test status based on subtest counts
kunit: tool: Parse skipped tests from kselftest.h
kunit: Always descend into kunit directory during build
kunit: qemu_configs: loongarch: Enable LSX/LSAX
kunit: Introduce UAPI testing framework
kunit: uapi: Add example for UAPI tests
kunit: uapi: Introduce preinit executable
kunit: uapi: Validate usability of /proc
Documentation/dev-tools/kunit/api/index.rst | 5 +
Documentation/dev-tools/kunit/api/uapi.rst | 12 +
Documentation/kbuild/makefiles.rst | 38 ++-
MAINTAINERS | 2 +
Makefile | 7 +-
include/kunit/uapi.h | 24 ++
include/linux/blob.h | 31 +++
init/Kconfig | 7 +
lib/Makefile | 4 -
lib/kunit/Kconfig | 10 +
lib/kunit/Makefile | 20 +-
lib/kunit/kunit-example-test.c | 15 ++
lib/kunit/kunit-example-uapi.c | 54 ++++
lib/kunit/uapi-preinit.c | 63 +++++
lib/kunit/uapi.c | 294 +++++++++++++++++++++
scripts/Makefile.blobs | 19 ++
scripts/Makefile.build | 6 +
scripts/Makefile.clean | 2 +-
scripts/Makefile.userprogs | 13 +-
scripts/blob-wrap.c | 27 ++
tools/include/nolibc/Kconfig.nolibc | 15 ++
tools/testing/kunit/kunit_parser.py | 13 +-
tools/testing/kunit/kunit_tool_test.py | 9 +
tools/testing/kunit/qemu_configs/loongarch.py | 2 +
.../test_is_test_passed-failure-nested.log | 10 +
.../test_data/test_is_test_passed-kselftest.log | 3 +-
26 files changed, 686 insertions(+), 19 deletions(-)
---
base-commit: f07a3558c4a5d76f3fea004075e5151c4516d055
change-id: 20241015-kunit-kselftests-56273bc40442
Best regards,
--
Thomas Weißschuh <thomas.weissschuh(a)linutronix.de>
Code using IS_ENABLED(CONFIG_MODULES) as a C expression may need access
to the module structure definitions to compile.
Make sure these structure definitions are always visible.
Signed-off-by: Thomas Weißschuh <thomas.weissschuh(a)linutronix.de>
---
Thomas Weißschuh (3):
module: move 'struct module_use' to internal.h
module: make structure definitions always visible
kunit: test: Drop CONFIG_MODULE ifdeffery
include/linux/module.h | 30 ++++++++++++------------------
kernel/module/internal.h | 7 +++++++
lib/kunit/test.c | 8 --------
3 files changed, 19 insertions(+), 26 deletions(-)
---
base-commit: 19272b37aa4f83ca52bdf9c16d5d81bdd1354494
change-id: 20250611-kunit-ifdef-modules-0fefd13ae153
Best regards,
--
Thomas Weißschuh <thomas.weissschuh(a)linutronix.de>
We already have a selftest for harness, while there is not usage
of FIXTURE_VARIANT.
Patch 2 add FIXTURE_VARIANT usage in the selftest.
Patch 1 is a typo fix.
v2:
* drop patch 2 in v1
* adjust patch 2 based on Thomas comment
Wei Yang (2):
selftests: harness: correct typo of __constructor_order_forward in
comment
selftests: harness: Add kselftest harness selftest with variant
tools/testing/selftests/kselftest_harness.h | 2 +-
.../kselftest_harness/harness-selftest.c | 30 +++++++++++++++++++
.../harness-selftest.expected | 20 ++++++++++---
3 files changed, 47 insertions(+), 5 deletions(-)
--
2.34.1
nolibc only supports symbol-based stackprotectors, based on the global
variable __stack_chk_guard. Support for this differs between
architectures and toolchains. Some use the symbol mode by default, some
require a flag to enable it and some don't support it at all.
Before the nolibc test Makefile required the availability of
"-mstack-protector-guard=global" to enable stackprotectors.
While this flag makes sure that the correct mode is available it doesn't
work where the correct mode is the only supported one and therefore the
flag is not implemented.
Switch to a more dynamic probing mechanism.
This correctly enables stack protectors for mips, loongarch and m68k.
Signed-off-by: Thomas Weißschuh <linux(a)weissschuh.net>
---
tools/testing/selftests/nolibc/Makefile | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/nolibc/Makefile b/tools/testing/selftests/nolibc/Makefile
index 94176ffe46463548cc9bc787933b6cefa83d6502..853f3a846d4c0fb187922d3063ec3d1a9a30ae46 100644
--- a/tools/testing/selftests/nolibc/Makefile
+++ b/tools/testing/selftests/nolibc/Makefile
@@ -195,7 +195,10 @@ CFLAGS_sparc32 = $(call cc-option,-m32)
ifeq ($(origin XARCH),command line)
CFLAGS_XARCH = $(CFLAGS_$(XARCH))
endif
-CFLAGS_STACKPROTECTOR ?= $(call cc-option,-mstack-protector-guard=global $(call cc-option,-fstack-protector-all))
+_CFLAGS_STACKPROTECTOR = $(call cc-option,-fstack-protector-all) $(call cc-option,-mstack-protector-guard=global)
+CFLAGS_STACKPROTECTOR ?= $(call try-run, \
+ echo 'void foo(void) {}' | $(CC) -x c - -o - -S $(_CFLAGS_STACKPROTECTOR) | grep -q __stack_chk_guard, \
+ $(_CFLAGS_STACKPROTECTOR))
CFLAGS_SANITIZER ?= $(call cc-option,-fsanitize=undefined -fsanitize-trap=all)
CFLAGS ?= -Os -fno-ident -fno-asynchronous-unwind-tables -std=c89 -W -Wall -Wextra \
$(call cc-option,-fno-stack-protector) $(call cc-option,-Wmissing-prototypes) \
---
base-commit: 19272b37aa4f83ca52bdf9c16d5d81bdd1354494
change-id: 20250530-nolibc-stackprotector-robust-77c9f55a3921
Best regards,
--
Thomas Weißschuh <linux(a)weissschuh.net>
When running the khugepaged selftest for shmem (./khugepaged all:shmem),
I encountered the following test failures:
"
Run test: collapse_full (khugepaged:shmem)
Collapse multiple fully populated PTE table.... Fail
...
Run test: collapse_single_pte_entry (khugepaged:shmem)
Collapse PTE table with single PTE entry present.... Fail
...
Run test: collapse_full_of_compound (khugepaged:shmem)
Allocate huge page... OK
Split huge page leaving single PTE page table full of compound pages... OK
Collapse PTE table full of compound pages.... Fail
"
The reason for the failure is that, it will set MADV_NOHUGEPAGE to prevent
khugepaged from continuing to scan shmem VMA after khugepaged finishes
scanning in the wait_for_scan() function. Moreover, shmem requires a refault
to establish PMD mappings.
However, after commit 2b0f922323cc, PMD mappings are prevented if the VMA is
set with MADV_NOHUGEPAGE flag, so shmem cannot establish PMD mappings during
refault.
To fix this issue, we can set the MADV_NOHUGEPAGE flag after the shmem refault.
With this fix, the shmem test case passes.
Fixes: 2b0f922323cc ("mm: don't install PMD mappings when THPs are disabled by the hw/process/vma")
Signed-off-by: Baolin Wang <baolin.wang(a)linux.alibaba.com>
---
tools/testing/selftests/mm/khugepaged.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/tools/testing/selftests/mm/khugepaged.c b/tools/testing/selftests/mm/khugepaged.c
index 8a4d34cce36b..d462f62d8116 100644
--- a/tools/testing/selftests/mm/khugepaged.c
+++ b/tools/testing/selftests/mm/khugepaged.c
@@ -561,8 +561,6 @@ static bool wait_for_scan(const char *msg, char *p, int nr_hpages,
usleep(TICK);
}
- madvise(p, nr_hpages * hpage_pmd_size, MADV_NOHUGEPAGE);
-
return timeout == -1;
}
@@ -585,6 +583,7 @@ static void khugepaged_collapse(const char *msg, char *p, int nr_hpages,
if (ops != &__anon_ops)
ops->fault(p, 0, nr_hpages * hpage_pmd_size);
+ madvise(p, nr_hpages * hpage_pmd_size, MADV_NOHUGEPAGE);
if (ops->check_huge(p, expect ? nr_hpages : 0))
success("OK");
else
--
2.43.5
When writing a test for fusectl, I referred to this Makefile as a
reference for creating a FUSE daemon in the selftests.
While doing so, I noticed that there is a minor issue in the Makefile.
The fuse_mnt.c file is not actually compiled into fuse_mnt.o,
and the code setting CFLAGS for it never takes effect.
The reason fuse_mnt compiles successfully is because CFLAGS is set
at the very beginning of the file.
Signed-off-by: Chen Linxuan <chenlinxuan(a)uniontech.com>
---
tools/testing/selftests/memfd/Makefile | 4 +---
1 file changed, 1 insertion(+), 3 deletions(-)
diff --git a/tools/testing/selftests/memfd/Makefile b/tools/testing/selftests/memfd/Makefile
index 163b6f68631c4..e9b886c65153d 100644
--- a/tools/testing/selftests/memfd/Makefile
+++ b/tools/testing/selftests/memfd/Makefile
@@ -1,5 +1,4 @@
# SPDX-License-Identifier: GPL-2.0
-CFLAGS += -D_FILE_OFFSET_BITS=64
CFLAGS += $(KHDR_INCLUDES)
TEST_GEN_PROGS := memfd_test
@@ -16,10 +15,9 @@ ifeq ($(VAR_LDLIBS),)
VAR_LDLIBS := -lfuse -pthread
endif
-fuse_mnt.o: CFLAGS += $(VAR_CFLAGS)
-
include ../lib.mk
+$(OUTPUT)/fuse_mnt: CFLAGS += $(VAR_CFLAGS)
$(OUTPUT)/fuse_mnt: LDLIBS += $(VAR_LDLIBS)
$(OUTPUT)/memfd_test: memfd_test.c common.c
--
2.43.0
The netdevsim driver previously lacked RX statistics support, which
prevented its use with the GenerateTraffic() test framework, as this
framework verifies traffic flow by checking RX byte counts.
This patch migrates netdevsim from its custom statistics collection to
the NETDEV_PCPU_STAT_DSTATS framework, as suggested by Jakub. This
change not only standardizes the statistics handling but also adds the
necessary RX statistics support required by the test framework.
Signed-off-by: Breno Leitao <leitao(a)debian.org>
---
Changes in v4:
- Protect dev_dstats_rx_dropped_add() by disabling BH (Jakub)
- Link to v3: https://lore.kernel.org/r/20250617-netdevsim_stat-v3-0-afe4bdcbf237@debian.…
Changes in v3:
- Rely on netdev from caller instead of napi->dev in nsim_queue_free().
- Link to v2: https://lore.kernel.org/r/20250613-netdevsim_stat-v2-0-98fa38836c48@debian.…
Changes in v2:
- Changed the RX collection place from nsim_napi_rx() to nsim_rcv (Joe Damato)
- Collect RX dropped packets statistic in nsim_queue_free() (Jakub)
- Added a helper in dstat to add values to RX dropped packets
- Link to v1: https://lore.kernel.org/r/20250611-netdevsim_stat-v1-0-c11b657d96bf@debian.…
---
Breno Leitao (4):
netdevsim: migrate to dstats stats collection
netdevsim: collect statistics at RX side
net: add dev_dstats_rx_dropped_add() helper
netdevsim: account dropped packet length in stats on queue free
drivers/net/netdevsim/netdev.c | 56 ++++++++++++++++-----------------------
drivers/net/netdevsim/netdevsim.h | 5 ----
include/linux/netdevice.h | 10 +++++++
3 files changed, 33 insertions(+), 38 deletions(-)
---
base-commit: 3b5b1c428260152e47c9584bc176f358b87ca82d
change-id: 20250610-netdevsim_stat-95995921e03e
Best regards,
--
Breno Leitao <leitao(a)debian.org>
Users can leak memory by repeatedly writing a string to DAMOS sysfs
memcg_path file. Fix it (patch 1) and add a selftest (patch 2) to avoid
reoccurrance of the bug.
SeongJae Park (2):
mm/damon/sysfs-schemes: free old damon_sysfs_scheme_filter->memcg_path
on write
selftets/damon: add a test for memcg_path leak
mm/damon/sysfs-schemes.c | 1 +
tools/testing/selftests/damon/Makefile | 1 +
.../selftests/damon/sysfs_memcg_path_leak.sh | 43 +++++++++++++++++++
3 files changed, 45 insertions(+)
create mode 100755 tools/testing/selftests/damon/sysfs_memcg_path_leak.sh
base-commit: 05b89e828eb4f791f721cbdc65f36e1a8287a9d3
--
2.39.5
The two updated tests sometimes failed because the network setup hadn't
completed. Used slowwait to ensure the setup finished and the tests
always passed. I ran both tests 50 times, and all of them passed.
Hangbin Liu (2):
selftests: net: use slowwait to stabilize vrf_route_leaking test
selftests: net: use slowwait to make sure IPv6 setup finished
tools/testing/selftests/net/test_vxlan_vnifiltering.sh | 9 ++++-----
tools/testing/selftests/net/vrf_route_leaking.sh | 4 ++--
2 files changed, 6 insertions(+), 7 deletions(-)
--
2.46.0
From: Ujwal Jain <ujwaljain(a)google.com>
Currently, the in-kernel kunit test case timeout is 300 seconds. (There
is a separate timeout mechanism for the whole test execution in
kunit.py, but that's unrelated.) However, tests marked 'slow' or 'very
slow' may timeout, particularly on slower machines.
Implement a multiplier to the test-case timeout, so that slower tests
have longer to complete:
- DEFAULT -> 1x default timeout
- KUNIT_SPEED_SLOW -> 3x default timeout
- KUNIT_SPEED_VERY_SLOW -> 12x default timeout
A further change is planned to allow user configuration of the
default/base timeout to allow people with faster or slower machines to
adjust these to their use-cases.
Signed-off-by: Ujwal Jain <ujwaljain(a)google.com>
Co-developed-by: David Gow <davidgow(a)google.com>
Signed-off-by: David Gow <davidgow(a)google.com>
---
include/kunit/try-catch.h | 1 +
lib/kunit/kunit-test.c | 9 +++++---
lib/kunit/test.c | 46 ++++++++++++++++++++++++++++++++++++--
lib/kunit/try-catch-impl.h | 4 +++-
lib/kunit/try-catch.c | 29 ++----------------------
5 files changed, 56 insertions(+), 33 deletions(-)
diff --git a/include/kunit/try-catch.h b/include/kunit/try-catch.h
index 7c966a1adbd3..d4e1a5b98ed6 100644
--- a/include/kunit/try-catch.h
+++ b/include/kunit/try-catch.h
@@ -47,6 +47,7 @@ struct kunit_try_catch {
int try_result;
kunit_try_catch_func_t try;
kunit_try_catch_func_t catch;
+ unsigned long timeout;
void *context;
};
diff --git a/lib/kunit/kunit-test.c b/lib/kunit/kunit-test.c
index d9c781c859fd..387cdf7782f6 100644
--- a/lib/kunit/kunit-test.c
+++ b/lib/kunit/kunit-test.c
@@ -43,7 +43,8 @@ static void kunit_test_try_catch_successful_try_no_catch(struct kunit *test)
kunit_try_catch_init(try_catch,
test,
kunit_test_successful_try,
- kunit_test_no_catch);
+ kunit_test_no_catch,
+ 300 * msecs_to_jiffies(MSEC_PER_SEC));
kunit_try_catch_run(try_catch, test);
KUNIT_EXPECT_TRUE(test, ctx->function_called);
@@ -75,7 +76,8 @@ static void kunit_test_try_catch_unsuccessful_try_does_catch(struct kunit *test)
kunit_try_catch_init(try_catch,
test,
kunit_test_unsuccessful_try,
- kunit_test_catch);
+ kunit_test_catch,
+ 300 * msecs_to_jiffies(MSEC_PER_SEC));
kunit_try_catch_run(try_catch, test);
KUNIT_EXPECT_TRUE(test, ctx->function_called);
@@ -129,7 +131,8 @@ static void kunit_test_fault_null_dereference(struct kunit *test)
kunit_try_catch_init(try_catch,
test,
kunit_test_null_dereference,
- kunit_test_catch);
+ kunit_test_catch,
+ 300 * msecs_to_jiffies(MSEC_PER_SEC));
kunit_try_catch_run(try_catch, test);
KUNIT_EXPECT_EQ(test, try_catch->try_result, -EINTR);
diff --git a/lib/kunit/test.c b/lib/kunit/test.c
index 146d1b48a096..002121675605 100644
--- a/lib/kunit/test.c
+++ b/lib/kunit/test.c
@@ -373,6 +373,46 @@ static void kunit_run_case_check_speed(struct kunit *test,
duration.tv_sec, duration.tv_nsec);
}
+/* Returns timeout multiplier based on speed.
+ * DEFAULT: 1
+ * KUNIT_SPEED_SLOW: 3
+ * KUNIT_SPEED_VERY_SLOW: 12
+ */
+static int kunit_timeout_mult(enum kunit_speed speed)
+{
+ switch (speed) {
+ case KUNIT_SPEED_SLOW:
+ return 3;
+ case KUNIT_SPEED_VERY_SLOW:
+ return 12;
+ default:
+ return 1;
+ }
+}
+
+static unsigned long kunit_test_timeout(struct kunit_suite *suite, struct kunit_case *test_case)
+{
+ int mult = 1;
+ /*
+ * TODO: Make the default (base) timeout configurable, so that users with
+ * particularly slow or fast machines can successfully run tests, while
+ * still taking advantage of the relative speed.
+ */
+ unsigned long default_timeout = 300;
+
+ /*
+ * The default test timeout is 300 seconds and will be adjusted by mult
+ * based on the test speed. The test speed will be overridden by the
+ * innermost test component.
+ */
+ if (suite->attr.speed != KUNIT_SPEED_UNSET)
+ mult = kunit_timeout_mult(suite->attr.speed);
+ if (test_case->attr.speed != KUNIT_SPEED_UNSET)
+ mult = kunit_timeout_mult(test_case->attr.speed);
+ return mult * default_timeout * msecs_to_jiffies(MSEC_PER_SEC);
+}
+
+
/*
* Initializes and runs test case. Does not clean up or do post validations.
*/
@@ -527,7 +567,8 @@ static void kunit_run_case_catch_errors(struct kunit_suite *suite,
kunit_try_catch_init(try_catch,
test,
kunit_try_run_case,
- kunit_catch_run_case);
+ kunit_catch_run_case,
+ kunit_test_timeout(suite, test_case));
context.test = test;
context.suite = suite;
context.test_case = test_case;
@@ -537,7 +578,8 @@ static void kunit_run_case_catch_errors(struct kunit_suite *suite,
kunit_try_catch_init(try_catch,
test,
kunit_try_run_case_cleanup,
- kunit_catch_run_case_cleanup);
+ kunit_catch_run_case_cleanup,
+ kunit_test_timeout(suite, test_case));
kunit_try_catch_run(try_catch, &context);
/* Propagate the parameter result to the test case. */
diff --git a/lib/kunit/try-catch-impl.h b/lib/kunit/try-catch-impl.h
index 203ba6a5e740..6f401b97cd0b 100644
--- a/lib/kunit/try-catch-impl.h
+++ b/lib/kunit/try-catch-impl.h
@@ -17,11 +17,13 @@ struct kunit;
static inline void kunit_try_catch_init(struct kunit_try_catch *try_catch,
struct kunit *test,
kunit_try_catch_func_t try,
- kunit_try_catch_func_t catch)
+ kunit_try_catch_func_t catch,
+ unsigned long timeout)
{
try_catch->test = test;
try_catch->try = try;
try_catch->catch = catch;
+ try_catch->timeout = timeout;
}
#endif /* _KUNIT_TRY_CATCH_IMPL_H */
diff --git a/lib/kunit/try-catch.c b/lib/kunit/try-catch.c
index 6bbe0025b079..d84a879f0a78 100644
--- a/lib/kunit/try-catch.c
+++ b/lib/kunit/try-catch.c
@@ -34,31 +34,6 @@ static int kunit_generic_run_threadfn_adapter(void *data)
return 0;
}
-static unsigned long kunit_test_timeout(void)
-{
- /*
- * TODO(brendanhiggins(a)google.com): We should probably have some type of
- * variable timeout here. The only question is what that timeout value
- * should be.
- *
- * The intention has always been, at some point, to be able to label
- * tests with some type of size bucket (unit/small, integration/medium,
- * large/system/end-to-end, etc), where each size bucket would get a
- * default timeout value kind of like what Bazel does:
- * https://docs.bazel.build/versions/master/be/common-definitions.html#test.si…
- * There is still some debate to be had on exactly how we do this. (For
- * one, we probably want to have some sort of test runner level
- * timeout.)
- *
- * For more background on this topic, see:
- * https://mike-bland.com/2011/11/01/small-medium-large.html
- *
- * If tests timeout due to exceeding sysctl_hung_task_timeout_secs,
- * the task will be killed and an oops generated.
- */
- return 300 * msecs_to_jiffies(MSEC_PER_SEC); /* 5 min */
-}
-
void kunit_try_catch_run(struct kunit_try_catch *try_catch, void *context)
{
struct kunit *test = try_catch->test;
@@ -85,8 +60,8 @@ void kunit_try_catch_run(struct kunit_try_catch *try_catch, void *context)
task_done = task_struct->vfork_done;
wake_up_process(task_struct);
- time_remaining = wait_for_completion_timeout(task_done,
- kunit_test_timeout());
+ time_remaining = wait_for_completion_timeout(
+ task_done, try_catch->timeout);
if (time_remaining == 0) {
try_catch->try_result = -ETIMEDOUT;
kthread_stop(task_struct);
--
2.50.0.rc1.591.g9c95f17f64-goog
This commit adds a new kernel selftest to verify RTNLGRP_IPV6_ACADDR
notifications. The test works by adding/removing a dummy interface,
enabling packet forwarding, and then confirming that user space can
correctly receive anycast notifications.
The test relies on the iproute2 version to be 6.13+.
Tested by the following command:
$ vng -v --user root --cpus 16 -- \
make -C tools/testing/selftests TARGETS=net
TEST_PROGS=rtnetlink_notification.sh \
TEST_GEN_PROGS="" run_tests
Signed-off-by: Yuyang Huang <yuyanghuang(a)google.com>
---
.../selftests/net/rtnetlink_notification.sh | 52 +++++++++++++++++--
1 file changed, 47 insertions(+), 5 deletions(-)
diff --git a/tools/testing/selftests/net/rtnetlink_notification.sh b/tools/testing/selftests/net/rtnetlink_notification.sh
index 39c1b815bbe4..2d938861197c 100755
--- a/tools/testing/selftests/net/rtnetlink_notification.sh
+++ b/tools/testing/selftests/net/rtnetlink_notification.sh
@@ -8,9 +8,11 @@
ALL_TESTS="
kci_test_mcast_addr_notification
+ kci_test_anycast_addr_notification
"
source lib.sh
+test_dev="test-dummy1"
kci_test_mcast_addr_notification()
{
@@ -18,12 +20,11 @@ kci_test_mcast_addr_notification()
local tmpfile
local monitor_pid
local match_result
- local test_dev="test-dummy1"
tmpfile=$(mktemp)
defer rm "$tmpfile"
- ip monitor maddr > $tmpfile &
+ ip monitor maddr > "$tmpfile" &
monitor_pid=$!
defer kill_process "$monitor_pid"
@@ -32,7 +33,7 @@ kci_test_mcast_addr_notification()
if [ ! -e "/proc/$monitor_pid" ]; then
RET=$ksft_skip
log_test "mcast addr notification: iproute2 too old"
- return $RET
+ return "$RET"
fi
ip link add name "$test_dev" type dummy
@@ -53,7 +54,48 @@ kci_test_mcast_addr_notification()
RET=$ksft_fail
fi
log_test "mcast addr notification: Expected 4 matches, got $match_result"
- return $RET
+ return "$RET"
+}
+
+kci_test_anycast_addr_notification()
+{
+ RET=0
+ local tmpfile
+ local monitor_pid
+ local match_result
+
+ tmpfile=$(mktemp)
+ defer rm "$tmpfile"
+
+ ip monitor acaddress > "$tmpfile" &
+ monitor_pid=$!
+ defer kill_process "$monitor_pid"
+ sleep 1
+
+ if [ ! -e "/proc/$monitor_pid" ]; then
+ RET=$ksft_skip
+ log_test "anycast addr notification: iproute2 too old"
+ return "$RET"
+ fi
+
+ ip link add name "$test_dev" type dummy
+ check_err $? "failed to add dummy interface"
+ ip link set "$test_dev" up
+ check_err $? "failed to set dummy interface up"
+ sysctl -qw net.ipv6.conf."$test_dev".forwarding=1
+ ip link del dev "$test_dev"
+ check_err $? "Failed to delete dummy interface"
+ sleep 1
+
+ # There should be 2 line matches as follows.
+ # 9: dummy2 inet6 any fe80:: scope global
+ # Deleted 9: dummy2 inet6 any fe80:: scope global
+ match_result=$(grep -cE "$test_dev.*(fe80::)" "$tmpfile")
+ if [ "$match_result" -ne 2 ]; then
+ RET=$ksft_fail
+ fi
+ log_test "anycast addr notification: Expected 2 matches, got $match_result"
+ return "$RET"
}
#check for needed privileges
@@ -67,4 +109,4 @@ require_command ip
tests_run
-exit $EXIT_STATUS
+exit "$EXIT_STATUS"
--
2.50.0.rc2.761.g2dc52ea45b-goog
Fix the spelling error from "multible" to "multiple".
Signed-off-by: Ankit Chauhan <ankitchauhan2065(a)gmail.com>
---
tools/testing/selftests/ptrace/peeksiginfo.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/ptrace/peeksiginfo.c b/tools/testing/selftests/ptrace/peeksiginfo.c
index a6884f66dc01..2f345d11e4b8 100644
--- a/tools/testing/selftests/ptrace/peeksiginfo.c
+++ b/tools/testing/selftests/ptrace/peeksiginfo.c
@@ -199,7 +199,7 @@ int main(int argc, char *argv[])
/*
* Dump signal from the process-wide queue.
- * The number of signals is not multible to the buffer size
+ * The number of signals is not multiple to the buffer size
*/
if (check_direct_path(child, 1, 3))
goto out;
--
2.34.1
This patch adds a new robust_list() syscall. The current syscall
can't be expanded to cover the following use case, so a new one is
needed. This new syscall allows users to set multiple robust lists per
process and to have either 32bit or 64bit pointers in the list.
* Use case
FEX-Emu[1] is an application that runs x86 and x86-64 binaries on an
AArch64 Linux host. One of the tasks of FEX-Emu is to translate syscalls
from one platform to another. Existing set_robust_list() can't be easily
translated because of two limitations:
1) x86 apps can have 32bit pointers robust lists. For a x86-64 kernel
this is not a problem, because of the compat entry point. But there's
no such compat entry point for AArch64, so the kernel would do the
pointer arithmetic wrongly. Is also unviable to userspace to keep
track every addition/removal to the robust list and keep a 64bit
version of it somewhere else to feed the kernel. Thus, the new
interface has an option of telling the kernel if the list is filled
with 32bit or 64bit pointers.
2) Apps can set just one robust list (in theory, x86-64 can set two if
they also use the compat entry point). That means that when a x86 app
asks FEX-Emu to call set_robust_list(), FEX have two options: to
overwrite their own robust list pointer and make the app robust, or
to ignore the app robust list and keep the emulator robust. The new
interface allows for multiple robust lists per application, solving
this.
* Interface
This is the proposed interface:
long set_robust_list2(void *head, int index, unsigned int flags)
`head` is the head of the userspace struct robust_list_head, just as old
set_robust_list(). It needs to be a void pointer since it can point to a normal
robust_list_head or a compat_robust_list_head.
`flags` can be used for defining the list type:
enum robust_list_type {
ROBUST_LIST_32BIT,
ROBUST_LIST_64BIT,
};
`index` is the index in the internal robust_list's linked list (the naming
starts to get confusing, I reckon). If `index == -1`, that means that user wants
to set a new robust_list, and the kernel will append it in the end of the list,
assign a new index and return this index to the user. If `index >= 0`, that
means that user wants to re-set `*head` of an already existing list (similarly
to what happens when you call set_robust_list() twice with different `*head`).
If `index` is out of range, or it points to a non-existing robust_list, or if
the internal list is full, an error is returned.
* Implementation
The implementation re-uses most of the existing robust list interface as
possible. The new task_struct member `struct list_head robust_list2` is just a
linked list where new lists are appended as the user requests more lists, and by
futex_cleanup(), the kernel walks through the internal list feeding
exit_robust_list() with the robust_list's.
This implementation supports up to 10 lists (defined at ROBUST_LISTS_PER_TASK),
but it was an arbitrary number for this RFC. For the described use case above, 4
should be enough, I'm not sure which should be the limit.
It doesn't support list removal (should it support?). It doesn't have a proper
get_robust_list2() yet as well, but I can add it in a next revision. We could
also have a generic robust_list() syscall that can be used to set/get and be
controlled by flags.
The new interface has a `unsigned int flags` argument, making it
extensible for future use cases as well.
It refuses unaligned `head` addresses. It doesn't have a limit for elements in a
single list (like ROBUST_LIST_LIMIT), it destroys the list as it is parsed to be
safe against circular lists.
* Testing
This patcheset has a selftest patch that expands this one:
https://lore.kernel.org/lkml/20250212131123.37431-1-andrealmeid@igalia.com/
Also, FEX-Emu added support for this interface to validate it:
https://github.com/FEX-Emu/FEX/pull/3966
Feedback is very welcomed!
Thanks,
André
[1] https://github.com/FEX-Emu/FEX
Changelog:
- Rebased on top of new futex work (private hash)
v4: https://lore.kernel.org/lkml/20250225183531.682556-1-andrealmeid@igalia.com/
- Refuse unaligned head pointers
- Ignore ROBUST_LIST_LIMIT for lists created with this interface and make it
robust against circular lists
- Fix a get_robust_list() syscall bug for getting the list from another thread
- Adapt selftest to use the new interface
v3: https://lore.kernel.org/lkml/20241217174958.477692-1-andrealmeid@igalia.com/
- Old syscall set_robust_list() adds new head to the internal linked list of
robust lists pointers, instead of having a field just for them. Remove
tsk->robust_list and use only tsk->robust_list2
v2: https://lore.kernel.org/lkml/20241101162147.284993-1-andrealmeid@igalia.com/
- Added a patch to properly deal with exit_robust_list() in 64bit vs 32bit
- Wired-up syscall for all archs
- Added more of the cover letter to the commit message
v1: https://lore.kernel.org/lkml/20241024145735.162090-1-andrealmeid@igalia.com/
---
André Almeida (7):
selftests/futex: Add ASSERT_ macros
selftests/futex: Create test for robust list
futex: Use explicit sizes for compat_exit_robust_list
futex: Create set_robust_list2
futex: Wire up set_robust_list2 syscall
futex: Remove the limit of elements for sys_set_robust_list2 lists
selftests: futex: Expand robust list test for the new interface
arch/alpha/kernel/syscalls/syscall.tbl | 1 +
arch/arm/tools/syscall.tbl | 1 +
arch/m68k/kernel/syscalls/syscall.tbl | 1 +
arch/microblaze/kernel/syscalls/syscall.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n32.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n64.tbl | 1 +
arch/mips/kernel/syscalls/syscall_o32.tbl | 1 +
arch/parisc/kernel/syscalls/syscall.tbl | 1 +
arch/powerpc/kernel/syscalls/syscall.tbl | 1 +
arch/s390/kernel/syscalls/syscall.tbl | 1 +
arch/sh/kernel/syscalls/syscall.tbl | 1 +
arch/sparc/kernel/syscalls/syscall.tbl | 1 +
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/xtensa/kernel/syscalls/syscall.tbl | 1 +
include/linux/compat.h | 12 +-
include/linux/futex.h | 16 +-
include/linux/sched.h | 5 +-
include/uapi/asm-generic/unistd.h | 2 +
include/uapi/linux/futex.h | 24 +
kernel/futex/core.c | 165 ++++-
kernel/futex/futex.h | 5 +
kernel/futex/syscalls.c | 85 ++-
kernel/sys_ni.c | 1 +
scripts/syscall.tbl | 1 +
.../testing/selftests/futex/functional/.gitignore | 1 +
tools/testing/selftests/futex/functional/Makefile | 3 +-
.../selftests/futex/functional/robust_list.c | 706 +++++++++++++++++++++
tools/testing/selftests/futex/include/logging.h | 38 ++
29 files changed, 1026 insertions(+), 53 deletions(-)
---
base-commit: 3ee84e3dd88e39b55b534e17a7b9a181f1d46809
change-id: 20250225-tonyk-robust_futex-60adeedac695
Best regards,
--
André Almeida <andrealmeid(a)igalia.com>
The _rval register variable is meant to be an output operand of the asm
statement but is instead used as input operand.
clang 20.1 notices this and triggers -Wuninitialized warnings:
tools/testing/selftests/timers/auxclock.c:154:10: error: variable '_rval' is uninitialized when used here [-Werror,-Wuninitialized]
154 | return VDSO_CALL(self->vdso_clock_gettime64, 2, clockid, ts);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
tools/testing/selftests/timers/../vDSO/vdso_call.h:59:10: note: expanded from macro 'VDSO_CALL'
59 | : "r" (_rval) \
| ^~~~~
tools/testing/selftests/timers/auxclock.c:154:10: note: variable '_rval' is declared here
tools/testing/selftests/timers/../vDSO/vdso_call.h:47:2: note: expanded from macro 'VDSO_CALL'
47 | register long _rval asm ("r3"); \
| ^
It seems the list of input and output operands have been switched around.
However as the argument registers are not always initialized they can not
be marked as pure inputs as that would trigger -Wuninitialized warnings.
Adding _rval as another input and output operand does also not work as it
would collide with the existing _r3 variable.
Instead reuse _r3 for both the argument and the return value.
Reported-by: kernel test robot <lkp(a)intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202506180223.BOOk5jDK-lkp@intel.com/
Fixes: 6eda706a535c ("selftests: vDSO: fix the way vDSO functions are called for powerpc")
Signed-off-by: Thomas Weißschuh <thomas.weissschuh(a)linutronix.de>
---
tools/testing/selftests/vDSO/vdso_call.h | 7 +++----
1 file changed, 3 insertions(+), 4 deletions(-)
diff --git a/tools/testing/selftests/vDSO/vdso_call.h b/tools/testing/selftests/vDSO/vdso_call.h
index bb237d771051bd4103367fc60b54b505b7586965..e7205584cbdca5e10c13c1e9425d2023b02cda7f 100644
--- a/tools/testing/selftests/vDSO/vdso_call.h
+++ b/tools/testing/selftests/vDSO/vdso_call.h
@@ -44,7 +44,6 @@
register long _r6 asm ("r6"); \
register long _r7 asm ("r7"); \
register long _r8 asm ("r8"); \
- register long _rval asm ("r3"); \
\
LOADARGS_##nr(fn, args); \
\
@@ -54,13 +53,13 @@
" bns+ 1f\n" \
" neg 3, 3\n" \
"1:" \
- : "+r" (_r0), "=r" (_r3), "+r" (_r4), "+r" (_r5), \
+ : "+r" (_r0), "+r" (_r3), "+r" (_r4), "+r" (_r5), \
"+r" (_r6), "+r" (_r7), "+r" (_r8) \
- : "r" (_rval) \
+ : \
: "r9", "r10", "r11", "r12", "cr0", "cr1", "cr5", \
"cr6", "cr7", "xer", "lr", "ctr", "memory" \
); \
- _rval; \
+ _r3; \
})
#else
---
base-commit: 52da431bf03b5506203bca27fe14a97895c80faf
change-id: 20250618-vdso-vdso_call-uninit-ccc33be00568
Best regards,
--
Thomas Weißschuh <thomas.weissschuh(a)linutronix.de>
A few selftest harness changes being merged to v6.16, which exposed some
bugs and vulnerabilities in the iommufd selftest code. Fix them properly.
Note that the patch fixing the build warnings at mfd is not ideal, as it
has possibly hit some corner case in the gcc:
https://lore.kernel.org/all/aEi8DV+ReF3v3Rlf@nvidia.com/
This is on github:
https://github.com/nicolinc/iommufd/commits/iommufd_selftest_fixes-v6.16
Thanks
Nicolin
Nicolin Chen (4):
iommufd/selftest: Fix iommufd_dirty_tracking with large hugepage sizes
iommufd/selftest: Add missing close(mfd) in memfd_mmap()
iommufd/selftest: Add asserts testing global mfd
iommufd/selftest: Fix build warnings due to uninitialized mfd
tools/testing/selftests/iommu/iommufd_utils.h | 9 ++++-
tools/testing/selftests/iommu/iommufd.c | 38 +++++++++++++++----
2 files changed, 38 insertions(+), 9 deletions(-)
--
2.43.0
This commit adds a new kernel selftest to verify RTNLGRP_IPV4_MCADDR
and RTNLGRP_IPV6_MCADDR notifications. The test works by adding and
removing a dummy interface and then confirming that the system
correctly receives join and removal notifications for the 224.0.0.1
and ff02::1 multicast addresses.
The test relies on the iproute2 version to be 6.13+.
Tested by the following command:
$ vng -v --user root --cpus 16 -- \
make -C tools/testing/selftests TARGETS=net
TEST_PROGS=rtnetlink_notification.sh \
TEST_GEN_PROGS="" run_tests
Cc: Maciej Żenczykowski <maze(a)google.com>
Cc: Lorenzo Colitti <lorenzo(a)google.com>
Signed-off-by: Yuyang Huang <yuyanghuang(a)google.com>
---
Changelog since v2:
- Move the test cases to a separate file.
Changelog since v1:
- Skip the test if the iproute2 is too old.
tools/testing/selftests/net/Makefile | 1 +
.../selftests/net/rtnetlink_notification.sh | 159 ++++++++++++++++++
2 files changed, 160 insertions(+)
create mode 100755 tools/testing/selftests/net/rtnetlink_notification.sh
diff --git a/tools/testing/selftests/net/Makefile b/tools/testing/selftests/net/Makefile
index 70a38f485d4d..ad258b25bc9d 100644
--- a/tools/testing/selftests/net/Makefile
+++ b/tools/testing/selftests/net/Makefile
@@ -40,6 +40,7 @@ TEST_PROGS += netns-name.sh
TEST_PROGS += link_netns.py
TEST_PROGS += nl_netdev.py
TEST_PROGS += rtnetlink.py
+TEST_PROGS += rtnetlink_notification.sh
TEST_PROGS += srv6_end_dt46_l3vpn_test.sh
TEST_PROGS += srv6_end_dt4_l3vpn_test.sh
TEST_PROGS += srv6_end_dt6_l3vpn_test.sh
diff --git a/tools/testing/selftests/net/rtnetlink_notification.sh b/tools/testing/selftests/net/rtnetlink_notification.sh
new file mode 100755
index 000000000000..a2c1afed5023
--- /dev/null
+++ b/tools/testing/selftests/net/rtnetlink_notification.sh
@@ -0,0 +1,159 @@
+#!/bin/bash
+#
+# This test is for checking rtnetlink notification callpaths, and get as much
+# coverage as possible.
+#
+# set -e
+
+ALL_TESTS="
+ kci_test_mcast_addr_notification
+"
+
+VERBOSE=0
+PAUSE=no
+PAUSE_ON_FAIL=no
+
+source lib.sh
+
+# set global exit status, but never reset nonzero one.
+check_err()
+{
+ if [ $ret -eq 0 ]; then
+ ret=$1
+ fi
+ [ -n "$2" ] && echo "$2"
+}
+
+run_cmd_common()
+{
+ local cmd="$*"
+ local out
+ if [ "$VERBOSE" = "1" ]; then
+ echo "COMMAND: ${cmd}"
+ fi
+ out=$($cmd 2>&1)
+ rc=$?
+ if [ "$VERBOSE" = "1" -a -n "$out" ]; then
+ echo " $out"
+ fi
+ return $rc
+}
+
+run_cmd() {
+ run_cmd_common "$@"
+ rc=$?
+ check_err $rc
+ return $rc
+}
+
+end_test()
+{
+ echo "$*"
+ [ "${VERBOSE}" = "1" ] && echo
+
+ if [[ $ret -ne 0 ]] && [[ "${PAUSE_ON_FAIL}" = "yes" ]]; then
+ echo "Hit enter to continue"
+ read a
+ fi;
+
+ if [ "${PAUSE}" = "yes" ]; then
+ echo "Hit enter to continue"
+ read a
+ fi
+
+}
+
+kci_test_mcast_addr_notification()
+{
+ local tmpfile
+ local monitor_pid
+ local match_result
+
+ tmpfile=$(mktemp)
+
+ ip monitor maddr > $tmpfile &
+ monitor_pid=$!
+ sleep 1
+ if [ ! -e "/proc/$monitor_pid" ]; then
+ end_test "SKIP: mcast addr notification: iproute2 too old"
+ rm $tmpfile
+ return $ksft_skip
+ fi
+
+ run_cmd ip link add name test-dummy1 type dummy
+ run_cmd ip link set test-dummy1 up
+ run_cmd ip link del dev test-dummy1
+ sleep 1
+
+ match_result=$(grep -cE "test-dummy1.*(224.0.0.1|ff02::1)" $tmpfile)
+
+ kill $monitor_pid
+ rm $tmpfile
+ # There should be 4 line matches as follows.
+ # 13: test-dummy1 inet6 mcast ff02::1 scope global
+ # 13: test-dummy1 inet mcast 224.0.0.1 scope global
+ # Deleted 13: test-dummy1 inet mcast 224.0.0.1 scope global
+ # Deleted 13: test-dummy1 inet6 mcast ff02::1 scope global
+ if [ $match_result -ne 4 ];then
+ end_test "FAIL: mcast addr notification"
+ return 1
+ fi
+ end_test "PASS: mcast addr notification"
+}
+
+kci_test_rtnl()
+{
+ local current_test
+ local ret=0
+
+ for current_test in ${TESTS:-$ALL_TESTS}; do
+ $current_test
+ check_err $?
+ done
+
+ return $ret
+}
+
+usage()
+{
+ cat <<EOF
+usage: ${0##*/} OPTS
+
+ -t <test> Test(s) to run (default: all)
+ (options: $(echo $ALL_TESTS))
+ -v Verbose mode (show commands and output)
+ -P Pause after every test
+ -p Pause after every failing test before cleanup (for debugging)
+EOF
+}
+
+#check for needed privileges
+if [ "$(id -u)" -ne 0 ];then
+ end_test "SKIP: Need root privileges"
+ exit $ksft_skip
+fi
+
+for x in ip;do
+ $x -Version 2>/dev/null >/dev/null
+ if [ $? -ne 0 ];then
+ end_test "SKIP: Could not run test without the $x tool"
+ exit $ksft_skip
+ fi
+done
+
+while getopts t:hvpP o; do
+ case $o in
+ t) TESTS=$OPTARG;;
+ v) VERBOSE=1;;
+ p) PAUSE_ON_FAIL=yes;;
+ P) PAUSE=yes;;
+ h) usage; exit 0;;
+ *) usage; exit 1;;
+ esac
+done
+
+[ $PAUSE = "yes" ] && PAUSE_ON_FAIL="no"
+
+kci_test_rtnl
+
+exit $?
--
2.50.0.rc1.591.g9c95f17f64-goog
This patch series introduces a new feature to netconsole which allows
appending a message ID to the userdata dictionary.
If the msgid feature is enabled, the message ID is built from a per-target 32
bit counter that is incremented and appended to every message sent to the target.
Example::
echo 1 > "/sys/kernel/config/netconsole/cmdline0/userdata/msgid_enabled"
echo "This is message #1" > /dev/kmsg
echo "This is message #2" > /dev/kmsg
13,434,54928466,-;This is message #1
msgid=1
13,435,54934019,-;This is message #2
msgid=2
This feature can be used by the target to detect if messages were dropped or
reordered before reaching the target. This allows system administrators to
assess the reliability of their netconsole pipeline and detect loss of messages
due to network contention or temporary unavailability.
Suggested-by: Breno Leitao <leitao(a)debian.org>
Signed-off-by: Gustavo Luiz Duarte <gustavold(a)gmail.com>
---
Changes in v3:
- Add kdoc documentation for msgcounter.
- Link to v2: https://lore.kernel.org/r/20250612-netconsole-msgid-v2-0-d4c1abc84bac@gmail…
Changes in v2:
- Use wrapping_assign_add() to avoid warnings in UBSAN and friends.
- Improve documentation to clarify wrapping and distinguish msgid from sequnum.
- Rebase and fix conflict in prepare_extradata().
- Link to v1: https://lore.kernel.org/r/20250611-netconsole-msgid-v1-0-1784a51feb1e@gmail…
---
Gustavo Luiz Duarte (5):
netconsole: introduce 'msgid' as a new sysdata field
netconsole: implement configfs for msgid_enabled
netconsole: append msgid to sysdata
selftests: netconsole: Add tests for 'msgid' feature in sysdata
docs: netconsole: document msgid feature
Documentation/networking/netconsole.rst | 32 +++++++++++
drivers/net/netconsole.c | 66 ++++++++++++++++++++++
.../selftests/drivers/net/netcons_sysdata.sh | 30 ++++++++++
3 files changed, 128 insertions(+)
---
base-commit: 08207f42d3ffee43c97f16baf03d7426a3c353ca
change-id: 20250609-netconsole-msgid-b93c6f8e9c60
Best regards,
--
Gustavo Luiz Duarte <gustavold(a)gmail.com>
We already have a selftest for harness, while there is not usage of
FIXTURE_VARIANT.
Patch 3 add FIXTURE_VARIANT usage in the selftest.
Patch 1/2 are trivial fix.
Wei Yang (3):
selftests: correct one typo in comment
selftests: print 0 if no test is chosen
selftests: harness: Add kselftest harness selftest with variant
tools/testing/selftests/kselftest.h | 2 +-
tools/testing/selftests/kselftest_harness.h | 2 +-
.../kselftest_harness/harness-selftest.c | 34 +++++++++++++++++++
.../harness-selftest.expected | 22 +++++++++---
4 files changed, 54 insertions(+), 6 deletions(-)
--
2.34.1
Trivial fix to a couple of outdated netmem comments. No code changes,
just more accurately describing current code.
Signed-off-by: Mina Almasry <almasrymina(a)google.com>
---
v2: https://lore.kernel.org/netdev/20250613042804.3259045-2-almasrymina@google.…
- Adjust comment for clearing lsb as (Jakub)
---
include/net/netmem.h | 21 ++++++++++++++++-----
1 file changed, 16 insertions(+), 5 deletions(-)
diff --git a/include/net/netmem.h b/include/net/netmem.h
index 386164fb9c18..850869b45b45 100644
--- a/include/net/netmem.h
+++ b/include/net/netmem.h
@@ -89,8 +89,7 @@ static inline unsigned int net_iov_idx(const struct net_iov *niov)
* typedef netmem_ref - a nonexistent type marking a reference to generic
* network memory.
*
- * A netmem_ref currently is always a reference to a struct page. This
- * abstraction is introduced so support for new memory types can be added.
+ * A netmem_ref can be a struct page* or a struct net_iov* underneath.
*
* Use the supplied helpers to obtain the underlying memory pointer and fields.
*/
@@ -117,9 +116,6 @@ static inline struct page *__netmem_to_page(netmem_ref netmem)
return (__force struct page *)netmem;
}
-/* This conversion fails (returns NULL) if the netmem_ref is not struct page
- * backed.
- */
static inline struct page *netmem_to_page(netmem_ref netmem)
{
if (WARN_ON_ONCE(netmem_is_net_iov(netmem)))
@@ -178,6 +174,21 @@ static inline unsigned long netmem_pfn_trace(netmem_ref netmem)
return page_to_pfn(netmem_to_page(netmem));
}
+/* __netmem_clear_lsb - convert netmem_ref to struct net_iov * for access to
+ * common fields.
+ * @netmem: netmem reference to extract as net_iov.
+ *
+ * All the sub types of netmem_ref (page, net_iov) have the same pp, pp_magic,
+ * dma_addr, and pp_ref_count fields at the same offsets. Thus, we can access
+ * these fields without a type check to make sure that the underlying mem is
+ * net_iov or page.
+ *
+ * The resulting value of this function can only be used to access the fields
+ * that are NET_IOV_ASSERT_OFFSET'd. Accessing any other fields will result in
+ * undefined behavior.
+ *
+ * Return: the netmem_ref cast to net_iov* regardless of its underlying type.
+ */
static inline struct net_iov *__netmem_clear_lsb(netmem_ref netmem)
{
return (struct net_iov *)((__force unsigned long)netmem & ~NET_IOV);
base-commit: 8909f5f4ecd551c2299b28e05254b77424c8c7dc
--
2.50.0.rc1.591.g9c95f17f64-goog
This commit adds a new kernel selftest to verify RTNLGRP_IPV4_MCADDR
and RTNLGRP_IPV6_MCADDR notifications. The test works by adding and
removing a dummy interface and then confirming that the system
correctly receives join and removal notifications for the 224.0.0.1
and ff02::1 multicast addresses.
The test relies on the iproute2 version to be 6.13+.
Tested by the following command:
$ vng -v --user root --cpus 16 -- \
make -C tools/testing/selftests TARGETS=net
TEST_PROGS=rtnetlink_notification.sh \
TEST_GEN_PROGS="" run_tests
Cc: Maciej Żenczykowski <maze(a)google.com>
Cc: Lorenzo Colitti <lorenzo(a)google.com>
Signed-off-by: Yuyang Huang <yuyanghuang(a)google.com>
---
Changelog since v3:
- Refactor the test to use utilities provided by lib.h.
- Fix shellcheck warnings.
Changelog since v2:
- Move the test case to a separate file.
Changelog since v1:
- Skip the test if the iproute2 is too old.
tools/testing/selftests/net/Makefile | 1 +
.../selftests/net/rtnetlink_notification.sh | 70 +++++++++++++++++++
2 files changed, 71 insertions(+)
create mode 100755 tools/testing/selftests/net/rtnetlink_notification.sh
diff --git a/tools/testing/selftests/net/Makefile b/tools/testing/selftests/net/Makefile
index ab996bd22a5f..3abb74d563a7 100644
--- a/tools/testing/selftests/net/Makefile
+++ b/tools/testing/selftests/net/Makefile
@@ -41,6 +41,7 @@ TEST_PROGS += netns-name.sh
TEST_PROGS += link_netns.py
TEST_PROGS += nl_netdev.py
TEST_PROGS += rtnetlink.py
+TEST_PROGS += rtnetlink_notification.sh
TEST_PROGS += srv6_end_dt46_l3vpn_test.sh
TEST_PROGS += srv6_end_dt4_l3vpn_test.sh
TEST_PROGS += srv6_end_dt6_l3vpn_test.sh
diff --git a/tools/testing/selftests/net/rtnetlink_notification.sh b/tools/testing/selftests/net/rtnetlink_notification.sh
new file mode 100755
index 000000000000..39c1b815bbe4
--- /dev/null
+++ b/tools/testing/selftests/net/rtnetlink_notification.sh
@@ -0,0 +1,70 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# This test is for checking rtnetlink notification callpaths, and get as much
+# coverage as possible.
+#
+# set -e
+
+ALL_TESTS="
+ kci_test_mcast_addr_notification
+"
+
+source lib.sh
+
+kci_test_mcast_addr_notification()
+{
+ RET=0
+ local tmpfile
+ local monitor_pid
+ local match_result
+ local test_dev="test-dummy1"
+
+ tmpfile=$(mktemp)
+ defer rm "$tmpfile"
+
+ ip monitor maddr > $tmpfile &
+ monitor_pid=$!
+ defer kill_process "$monitor_pid"
+
+ sleep 1
+
+ if [ ! -e "/proc/$monitor_pid" ]; then
+ RET=$ksft_skip
+ log_test "mcast addr notification: iproute2 too old"
+ return $RET
+ fi
+
+ ip link add name "$test_dev" type dummy
+ check_err $? "failed to add dummy interface"
+ ip link set "$test_dev" up
+ check_err $? "failed to set dummy interface up"
+ ip link del dev "$test_dev"
+ check_err $? "Failed to delete dummy interface"
+ sleep 1
+
+ # There should be 4 line matches as follows.
+ # 13: test-dummy1 inet6 mcast ff02::1 scope global
+ # 13: test-dummy1 inet mcast 224.0.0.1 scope global
+ # Deleted 13: test-dummy1 inet mcast 224.0.0.1 scope global
+ # Deleted 13: test-dummy1 inet6 mcast ff02::1 scope global
+ match_result=$(grep -cE "$test_dev.*(224.0.0.1|ff02::1)" "$tmpfile")
+ if [ "$match_result" -ne 4 ]; then
+ RET=$ksft_fail
+ fi
+ log_test "mcast addr notification: Expected 4 matches, got $match_result"
+ return $RET
+}
+
+#check for needed privileges
+if [ "$(id -u)" -ne 0 ];then
+ RET=$ksft_skip
+ log_test "need root privileges"
+ exit $RET
+fi
+
+require_command ip
+
+tests_run
+
+exit $EXIT_STATUS
--
2.50.0.rc1.591.g9c95f17f64-goog
The netdevsim driver previously lacked RX statistics support, which
prevented its use with the GenerateTraffic() test framework, as this
framework verifies traffic flow by checking RX byte counts.
This patch migrates netdevsim from its custom statistics collection to
the NETDEV_PCPU_STAT_DSTATS framework, as suggested by Jakub. This
change not only standardizes the statistics handling but also adds the
necessary RX statistics support required by the test framework.
Signed-off-by: Breno Leitao <leitao(a)debian.org>
---
Changes in v3:
- Rely on netdev from caller instead of napi->dev in nsim_queue_free().
- Link to v2: https://lore.kernel.org/r/20250613-netdevsim_stat-v2-0-98fa38836c48@debian.…
Changes in v2:
- Changed the RX collection place from nsim_napi_rx() to nsim_rcv (Joe
Damato)
- Collect RX dropped packets statistic in nsim_queue_free() (Jakub)
- Added a helper in dstat to add values to RX dropped packets
- Link to v1: https://lore.kernel.org/r/20250611-netdevsim_stat-v1-0-c11b657d96bf@debian.…
---
Breno Leitao (4):
netdevsim: migrate to dstats stats collection
netdevsim: collect statistics at RX side
net: add dev_dstats_rx_dropped_add() helper
netdevsim: account dropped packet length in stats on queue free
drivers/net/netdevsim/netdev.c | 54 +++++++++++++++------------------------
drivers/net/netdevsim/netdevsim.h | 5 ----
include/linux/netdevice.h | 10 ++++++++
3 files changed, 31 insertions(+), 38 deletions(-)
---
base-commit: 3b5b1c428260152e47c9584bc176f358b87ca82d
change-id: 20250610-netdevsim_stat-95995921e03e
Best regards,
--
Breno Leitao <leitao(a)debian.org>
This comes useful when using selftests from mainline on older
kernels/setups that still rely on cgroup v1 while maintains single
variant of selftests.
The core tests that rely on v2 specific features are skipped, the
remaining ones are adjusted to work with a v1 hierarchy.
The expected output on v1 system:
ok 1 # SKIP test_cgcore_internal_process_constraint
ok 2 # SKIP test_cgcore_top_down_constraint_enable
ok 3 # SKIP test_cgcore_top_down_constraint_disable
ok 4 # SKIP test_cgcore_no_internal_process_constraint_on_threads
ok 5 # SKIP test_cgcore_parent_becomes_threaded
ok 6 # SKIP test_cgcore_invalid_domain
ok 7 # SKIP test_cgcore_populated
ok 8 test_cgcore_proc_migration
ok 9 test_cgcore_thread_migration
ok 10 test_cgcore_destroy
ok 11 test_cgcore_lesser_euid_open
ok 12 # SKIP test_cgcore_lesser_ns_open
Michal Koutný (4):
selftests: cgroup_util: Add helpers for testing named v1 hierarchies
selftests: cgroup: Add support for named v1 hierarchies in test_core
selftests: cgroup: Optionally set up v1 environment
selftests: cgroup: Fix compilation on pre-cgroupns kernels
.../selftests/cgroup/lib/cgroup_util.c | 4 +-
.../cgroup/lib/include/cgroup_util.h | 5 ++
tools/testing/selftests/cgroup/test_core.c | 84 ++++++++++++++++---
3 files changed, 82 insertions(+), 11 deletions(-)
base-commit: 9afe652958c3ee88f24df1e4a97f298afce89407
--
2.49.0
This series is a follow-up to [1], which adds mTHP support to khugepaged.
mTHP khugepaged support is a "loose" dependency for the sysfs/sysctl
configs to make sense. Without it global="defer" and mTHP="inherit" case
is "undefined" behavior.
We've seen cases were customers switching from RHEL7 to RHEL8 see a
significant increase in the memory footprint for the same workloads.
Through our investigations we found that a large contributing factor to
the increase in RSS was an increase in THP usage.
For workloads like MySQL, or when using allocators like jemalloc, it is
often recommended to set /transparent_hugepages/enabled=never. This is
in part due to performance degradations and increased memory waste.
This series introduces enabled=defer, this setting acts as a middle
ground between always and madvise. If the mapping is MADV_HUGEPAGE, the
page fault handler will act normally, making a hugepage if possible. If
the allocation is not MADV_HUGEPAGE, then the page fault handler will
default to the base size allocation. The caveat is that khugepaged can
still operate on pages that are not MADV_HUGEPAGE.
This allows for three things... one, applications specifically designed to
use hugepages will get them, and two, applications that don't use
hugepages can still benefit from them without aggressively inserting
THPs at every possible chance. This curbs the memory waste, and defers
the use of hugepages to khugepaged. Khugepaged can then scan the memory
for eligible collapsing. Lastly there is the added benefit for those who
want THPs but experience higher latency PFs. Now you can get base page
performance at the PF handler and Hugepage performance for those mappings
after they collapse.
Admins may want to lower max_ptes_none, if not, khugepaged may
aggressively collapse single allocations into hugepages.
TESTING:
- Built for x86_64, aarch64, ppc64le, and s390x
- selftests mm
- In [1] I provided a script [2] that has multiple access patterns
- lots of general use.
- redis testing. This test was my original case for the defer mode. What I
was able to prove was that THP=always leads to increased max_latency
cases; hence why it is recommended to disable THPs for redis servers.
However with 'defer' we dont have the max_latency spikes and can still
get the system to utilize THPs. I further tested this with the mTHP
defer setting and found that redis (and probably other jmalloc users)
can utilize THPs via defer (+mTHP defer) without a large latency
penalty and some potential gains. I uploaded some mmtest results
here[3] which compares:
stock+thp=never
stock+(m)thp=always
khugepaged-mthp + defer (max_ptes_none=64)
The results show that (m)THPs can cause some throughput regression in
some cases, but also has gains in other cases. The mTHP+defer results
have more gains and less losses over the (m)THP=always case.
V6 Changes:
- nits
- rebased dependent series and added review tags
V5 Changes:
- rebased dependent series
- added reviewed-by tag on 2/4
V4 Changes:
- Minor Documentation fixes
- rebased the dependent series [1] onto mm-unstable
commit 0e68b850b1d3 ("vmalloc: use atomic_long_add_return_relaxed()")
V3 Changes:
- Combined the documentation commits into one, and moved a section to the
khugepaged mthp patchset
V2 Changes:
- base changes on mTHP khugepaged support
- Fix selftests parsing issue
- add mTHP defer option
- add mTHP defer Documentation
[1] - https://lore.kernel.org/all/20250515032226.128900-1-npache@redhat.com/
[2] - https://gitlab.com/npache/khugepaged_mthp_test
[3] - https://people.redhat.com/npache/mthp_khugepaged_defer/testoutput2/output.h…
Nico Pache (4):
mm: defer THP insertion to khugepaged
mm: document (m)THP defer usage
khugepaged: add defer option to mTHP options
selftests: mm: add defer to thp setting parser
Documentation/admin-guide/mm/transhuge.rst | 31 +++++++---
include/linux/huge_mm.h | 18 +++++-
mm/huge_memory.c | 69 +++++++++++++++++++---
mm/khugepaged.c | 8 +--
tools/testing/selftests/mm/thp_settings.c | 1 +
tools/testing/selftests/mm/thp_settings.h | 1 +
6 files changed, 106 insertions(+), 22 deletions(-)
--
2.49.0
[All the precursor patches are merged now and AMD/RISCV/VTD conversions
are written]
Currently each of the iommu page table formats duplicates all of the logic
to maintain the page table and perform map/unmap/etc operations. There are
several different versions of the algorithms between all the different
formats. The io-pgtable system provides an interface to help isolate the
page table code from the iommu driver, but doesn't provide tools to
implement the common algorithms.
This makes it very hard to improve the state of the pagetable code under
the iommu domains as any proposed improvement needs to alter a large
number of different driver code paths. Combined with a lack of software
based testing this makes improvement in this area very hard.
iommufd wants several new page table operations:
- More efficient map/unmap operations, using iommufd's batching logic
- unmap that returns the physical addresses into a batch as it progresses
- cut that allows splitting areas so large pages can have holes
poked in them dynamically (ie guestmemfd hitless shared/private
transitions)
- More agressive freeing of table memory to avoid waste
- Fragmenting large pages so that dirty tracking can be more granular
- Reassembling large pages so that VMs can run at full IO performance
in migration/dirty tracking error flows
- KHO integration for kernel live upgrade
Together these are algorithmically complex enough to be a very significant
task to go and implement in all the page table formats we support. Just
the "server" focused drivers use almost all the formats (ARMv8 S1&S2 / x86
PAE / AMDv1 / VT-D SS / RISCV)
Instead of doing the duplicated work, this series takes the first step to
consolidate the algorithms into one places. In spirit it is similar to the
work Christoph did a few years back to pull the redundant get_user_pages()
implementations out of the arch code into core MM. This unlocked a great
deal of improvement in that space in the following years. I would like to
see the same benefit in iommu as well.
My first RFC showed a bigger picture with all most all formats and more
algorithms. This series reorganizes that to be narrowly focused on just
enough to convert the AMD driver to use the new mechanism.
kunit tests are provided that allow good testing of the algorithms and all
formats on x86, nothing is arch specific.
AMD is one of the simpler options as the HW is quite uniform with few
different options/bugs while still requiring the complicated contiguous
pages support. The HW also has a very simple range based invalidation
approach that is easy to implement.
The AMD v1 and AMD v2 page table formats are implemented bit for bit
identical to the current code, tested using a compare kunit test that
checks against the io-pgtable version (on github, see below).
Updating the AMD driver to replace the io-pgtable layer with the new stuff
is fairly straightforward now. The layering is fixed up in the new version
so that all the invalidation goes through function pointers.
Several small fixing patches have come out of this as I've been fixing the
problems that the test suite uncovers in the current code, and
implementing the fixed version in iommupt.
On performance, there is a quite wide variety of implementation designs
across all the drivers. Looking at some key performance across
the main formats:
iommu_map():
pgsz ,avg new,old ns, min new,old ns , min % (+ve is better)
2^12, 53,66 , 51,63 , 19.19 (AMDV1)
256*2^12, 386,1909 , 367,1795 , 79.79
256*2^21, 362,1633 , 355,1556 , 77.77
2^12, 56,62 , 52,59 , 11.11 (AMDv2)
256*2^12, 405,1355 , 357,1292 , 72.72
256*2^21, 393,1160 , 358,1114 , 67.67
2^12, 55,65 , 53,62 , 14.14 (VTD second stage)
256*2^12, 391,518 , 332,512 , 35.35
256*2^21, 383,635 , 336,624 , 46.46
2^12, 57,65 , 55,63 , 12.12 (ARM 64 bit)
256*2^12, 380,389 , 361,369 , 2.02
256*2^21, 358,419 , 345,400 , 13.13
iommu_unmap():
pgsz ,avg new,old ns, min new,old ns , min % (+ve is better)
2^12, 69,88 , 65,85 , 23.23 (AMDv1)
256*2^12, 353,6498 , 331,6029 , 94.94
256*2^21, 373,6014 , 360,5706 , 93.93
2^12, 71,72 , 66,69 , 4.04 (AMDv2)
256*2^12, 228,891 , 206,871 , 76.76
256*2^21, 254,721 , 245,711 , 65.65
2^12, 69,87 , 65,82 , 20.20 (VTD second stage)
256*2^12, 210,321 , 200,315 , 36.36
256*2^21, 255,349 , 238,342 , 30.30
2^12, 72,77 , 68,74 , 8.08 (ARM 64 bit)
256*2^12, 521,357 , 447,346 , -29.29
256*2^21, 489,358 , 433,345 , -25.25
* Above numbers include additional patches to remove the iommu_pgsize()
overheads. gcc 13.3.0, i7-12700
This version provides fairly consistent performance across formats. ARM
unmap performance is quite different because this version supports
contiguous pages and uses a very different algorithm for unmapping. Though
why it is so worse compared to AMDv1 I haven't figured out yet.
The per-format commits include a more detailed chart.
There is a second branch:
https://github.com/jgunthorpe/linux/commits/iommu_pt_all
Containing supporting work and future steps:
- ARM short descriptor (32 bit), ARM long descriptor (64 bit) formats
- RISCV format and RISCV conversion
https://github.com/jgunthorpe/linux/commits/iommu_pt_riscv
- Support for a DMA incoherent HW page table walker
- VT-D second stage format and VT-D conversion
https://github.com/jgunthorpe/linux/commits/iommu_pt_vtd
- DART v1 & v2 format
- Draft of a iommufd 'cut' operation to break down huge pages
- A compare test that checks the iommupt formats against the iopgtable
interface, including updating AMD to have a working iopgtable and patches
to make VT-D have an iopgtable for testing.
- A performance test to micro-benchmark map and unmap against iogptable
My strategy is to go one by one for the drivers:
- AMD driver conversion
- RISCV page table and driver
- Intel VT-D driver and VTDSS page table
- Flushing improvements for RISCV
- ARM SMMUv3
And concurrently work on the algorithm side:
- debugfs content dump, like VT-D has
- Cut support
- Increase/Decrease page size support
- map/unmap batching
- KHO
As we make more algorithm improvements the value to convert the drivers
increases.
This is on github: https://github.com/jgunthorpe/linux/commits/iommu_pt
v2:
- Rebase on v6.16-rc2
- s/PT_ENTRY_WORD_SIZE/PT_ITEM_WORD_SIZE/s to follow the language better
- Comment and documentation updates
- Add PT_TOP_PHYS_MASK to help manage alignment restrictions on the top
pointer
- Add missed force_aperture = true
- Make pt_iommu_deinit() take care of the not-yet-inited error case
internally as AMD/RISCV/VTD all shared this logic
- Change gather_range() into gather_range_pages() so it also deals with
the page list. This makes the following cache flushing series simpler
- Fix missed update of unmap->unmapped in some error cases
- Change clear_contig() to order the gather more logically
- Remove goto from the error handling in __map_range_leaf()
- s/log2_/oalog2_/ in places where the argument is an oaddr_t
- Pass the pts to pt_table_install64/32()
- Do not use SIGN_EXTEND for the AMDv2 page table because of Vasant's
information on how PASID 0 works.
v1: https://patch.msgid.link/r/0-v2-5c26bde5c22d+58b-iommu_pt_jgg@nvidia.com
- AMD driver only, many code changes
RFC: https://lore.kernel.org/all/0-v1-01fa10580981+1d-iommu_pt_jgg@nvidia.com/
Alejandro Jimenez (1):
iommu/amd: Use the generic iommu page table
Jason Gunthorpe (14):
genpt: Generic Page Table base API
genpt: Add Documentation/ files
iommupt: Add the basic structure of the iommu implementation
iommupt: Add the AMD IOMMU v1 page table format
iommupt: Add iova_to_phys op
iommupt: Add unmap_pages op
iommupt: Add map_pages op
iommupt: Add read_and_clear_dirty op
iommupt: Add a kunit test for Generic Page Table
iommupt: Add a mock pagetable format for iommufd selftest to use
iommufd: Change the selftest to use iommupt instead of xarray
iommupt: Add the x86 64 bit page table format
iommu/amd: Remove AMD io_pgtable support
iommupt: Add a kunit test for the IOMMU implementation
.clang-format | 1 +
Documentation/driver-api/generic_pt.rst | 140 ++
Documentation/driver-api/index.rst | 1 +
drivers/iommu/Kconfig | 2 +
drivers/iommu/Makefile | 1 +
drivers/iommu/amd/Kconfig | 5 +-
drivers/iommu/amd/Makefile | 2 +-
drivers/iommu/amd/amd_iommu.h | 1 -
drivers/iommu/amd/amd_iommu_types.h | 109 +-
drivers/iommu/amd/io_pgtable.c | 560 --------
drivers/iommu/amd/io_pgtable_v2.c | 370 ------
drivers/iommu/amd/iommu.c | 516 ++++----
drivers/iommu/generic_pt/.kunitconfig | 13 +
drivers/iommu/generic_pt/Kconfig | 72 ++
drivers/iommu/generic_pt/fmt/Makefile | 26 +
drivers/iommu/generic_pt/fmt/amdv1.h | 409 ++++++
drivers/iommu/generic_pt/fmt/defs_amdv1.h | 21 +
drivers/iommu/generic_pt/fmt/defs_x86_64.h | 21 +
drivers/iommu/generic_pt/fmt/iommu_amdv1.c | 15 +
drivers/iommu/generic_pt/fmt/iommu_mock.c | 10 +
drivers/iommu/generic_pt/fmt/iommu_template.h | 48 +
drivers/iommu/generic_pt/fmt/iommu_x86_64.c | 11 +
drivers/iommu/generic_pt/fmt/x86_64.h | 248 ++++
drivers/iommu/generic_pt/iommu_pt.h | 1150 +++++++++++++++++
drivers/iommu/generic_pt/kunit_generic_pt.h | 717 ++++++++++
drivers/iommu/generic_pt/kunit_iommu.h | 183 +++
drivers/iommu/generic_pt/kunit_iommu_pt.h | 451 +++++++
drivers/iommu/generic_pt/pt_common.h | 354 +++++
drivers/iommu/generic_pt/pt_defs.h | 323 +++++
drivers/iommu/generic_pt/pt_fmt_defaults.h | 193 +++
drivers/iommu/generic_pt/pt_iter.h | 640 +++++++++
drivers/iommu/generic_pt/pt_log2.h | 130 ++
drivers/iommu/io-pgtable.c | 4 -
drivers/iommu/iommufd/Kconfig | 1 +
drivers/iommu/iommufd/iommufd_test.h | 11 +-
drivers/iommu/iommufd/selftest.c | 439 +++----
include/linux/generic_pt/common.h | 166 +++
include/linux/generic_pt/iommu.h | 270 ++++
include/linux/io-pgtable.h | 2 -
tools/testing/selftests/iommu/iommufd.c | 60 +-
tools/testing/selftests/iommu/iommufd_utils.h | 12 +
41 files changed, 6119 insertions(+), 1589 deletions(-)
create mode 100644 Documentation/driver-api/generic_pt.rst
delete mode 100644 drivers/iommu/amd/io_pgtable.c
delete mode 100644 drivers/iommu/amd/io_pgtable_v2.c
create mode 100644 drivers/iommu/generic_pt/.kunitconfig
create mode 100644 drivers/iommu/generic_pt/Kconfig
create mode 100644 drivers/iommu/generic_pt/fmt/Makefile
create mode 100644 drivers/iommu/generic_pt/fmt/amdv1.h
create mode 100644 drivers/iommu/generic_pt/fmt/defs_amdv1.h
create mode 100644 drivers/iommu/generic_pt/fmt/defs_x86_64.h
create mode 100644 drivers/iommu/generic_pt/fmt/iommu_amdv1.c
create mode 100644 drivers/iommu/generic_pt/fmt/iommu_mock.c
create mode 100644 drivers/iommu/generic_pt/fmt/iommu_template.h
create mode 100644 drivers/iommu/generic_pt/fmt/iommu_x86_64.c
create mode 100644 drivers/iommu/generic_pt/fmt/x86_64.h
create mode 100644 drivers/iommu/generic_pt/iommu_pt.h
create mode 100644 drivers/iommu/generic_pt/kunit_generic_pt.h
create mode 100644 drivers/iommu/generic_pt/kunit_iommu.h
create mode 100644 drivers/iommu/generic_pt/kunit_iommu_pt.h
create mode 100644 drivers/iommu/generic_pt/pt_common.h
create mode 100644 drivers/iommu/generic_pt/pt_defs.h
create mode 100644 drivers/iommu/generic_pt/pt_fmt_defaults.h
create mode 100644 drivers/iommu/generic_pt/pt_iter.h
create mode 100644 drivers/iommu/generic_pt/pt_log2.h
create mode 100644 include/linux/generic_pt/common.h
create mode 100644 include/linux/generic_pt/iommu.h
base-commit: cd76b0248a38645a3e3f8ca4a48bffc591e9da19
--
2.43.0
This series adds namespace support to vhost-vsock. It does not add
namespaces to any of the guest transports (virtio-vsock, hyperv, or
vmci).
The current revision only supports two modes: local or global. Local
mode is complete isolation of namespaces, while global mode is complete
sharing between namespaces of CIDs (the original behavior).
If it is deemed necessary to add mixed mode up front, it is doable but
at the cost of more complexity than local and global modes. Mixed will
require adding the notion of allocation to the socket lookup functions
(like vhost_vsock_get()) and also more logic will be necessary for
controlling or using lookups differently based on mixed-to-global or
global-to-mixed scenarios.
The current implementation takes into consideration the future need for mixed
mode and makes sure it is possible by making vsock_ns_mode per-namespace, as for
mixed mode we need at least one "global" namespace and one "mixed"
namespace for it to work. Is it feasible to support local and global
modes only initially?
I've demoted this series to RFC, as I haven't been able to re-run the
tests after rebasing onto the upstreamed vmtest.sh, some of the code is
still pretty messy, there are still some TODOs, stale comments, and
other work to do. I thought reviewers might want to see the current
state even though unfinished, since I'll be OoO until the second week of
July and that just feels like a long time of silence given we've already
all done work on this together.
Thanks again for everyone's help and reviews!
Signed-off-by: Bobby Eshleman <bobbyeshleman(a)gmail.com>
---
Changes in v3:
- add notion of "modes"
- add procfs /proc/net/vsock_ns_mode
- local and global modes only
- no /dev/vhost-vsock-netns
- vmtest.sh already merged, so new patch just adds new tests for NS
- Link to v2:
https://lore.kernel.org/kvm/20250312-vsock-netns-v2-0-84bffa1aa97a@gmail.com
Changes in v2:
- only support vhost-vsock namespaces
- all g2h namespaces retain old behavior, only common API changes
impacted by vhost-vsock changes
- add /dev/vhost-vsock-netns for "opt-in"
- leave /dev/vhost-vsock to old behavior
- removed netns module param
- Link to v1:
https://lore.kernel.org/r/20200116172428.311437-1-sgarzare@redhat.com
Changes in v1:
- added 'netns' module param to vsock.ko to enable the
network namespace support (disabled by default)
- added 'vsock_net_eq()' to check the "net" assigned to a socket
only when 'netns' support is enabled
- Link to RFC: https://patchwork.ozlabs.org/cover/1202235/
---
Bobby Eshleman (11):
selftests/vsock: add NS tests to vmtest.sh
vsock: a per-net vsock NS mode state
vsock: add vsock net ns helpers
vsock: add net to vsock skb cb
vsock: add common code for vsock NS support
virtio-vsock: add netns to common code
vhost/vsock: add netns support
vsock/virtio: add netns hooks
hv_sock: add netns hooks
vsock/vmci: add netns hooks
vsock/loopback: add netns support
MAINTAINERS | 1 +
drivers/vhost/vsock.c | 48 ++-
include/linux/virtio_vsock.h | 12 +
include/net/af_vsock.h | 53 ++-
include/net/net_namespace.h | 4 +
include/net/netns/vsock.h | 19 ++
net/vmw_vsock/af_vsock.c | 203 +++++++++++-
net/vmw_vsock/hyperv_transport.c | 2 +-
net/vmw_vsock/virtio_transport.c | 5 +-
net/vmw_vsock/virtio_transport_common.c | 14 +-
net/vmw_vsock/vmci_transport.c | 4 +-
net/vmw_vsock/vsock_loopback.c | 4 +-
tools/testing/selftests/vsock/vmtest.sh | 555 +++++++++++++++++++++++++++++---
13 files changed, 843 insertions(+), 81 deletions(-)
---
base-commit: 8909f5f4ecd551c2299b28e05254b77424c8c7dc
change-id: 20250325-vsock-vmtest-b3a21d2102c2
Best regards,
--
Bobby Eshleman <bobbyeshleman(a)meta.com>
The following build warnings / errors noticed while building the selftest/ublk
with gcc-13 and clang-nightly toolchains on Linux next tree.
Please suggest if I am missing something in my build setup.
Regressions found on arm arm64 x86_64
- selftests ublk
Regression Analysis:
- New regression? Yes
- Reproducibility? Yes
Build regression: selftests ublk UBLK_IO_F_NEED_REG_BUF undeclared
Reported-by: Linux Kernel Functional Testing <lkft(a)linaro.org>
## Build log
CC kublk
In file included from kublk.c:6:
kublk.h: In function 'ublk_io_auto_zc_fallback':
kublk.h:240:35: error: 'UBLK_IO_F_NEED_REG_BUF' undeclared (first use
in this function); did you mean 'UBLKSRV_NEED_REG_BUF'?
240 | return !!(iod->op_flags & UBLK_IO_F_NEED_REG_BUF);
| ^~~~~~~~~~~~~~~~~~~~~~
| UBLKSRV_NEED_REG_BUF
kublk.h:240:35: note: each undeclared identifier is reported only once
for each function it appears in
kublk.c: In function 'ublk_ctrl_update_size':
kublk.c:223:27: error: 'UBLK_U_CMD_UPDATE_SIZE' undeclared (first use
in this function)
223 | .cmd_op = UBLK_U_CMD_UPDATE_SIZE,
| ^~~~~~~~~~~~~~~~~~~~~~
kublk.c: In function 'ublk_ctrl_quiesce_dev':
kublk.c:235:27: error: 'UBLK_U_CMD_QUIESCE_DEV' undeclared (first use
in this function); did you mean 'UBLK_U_CMD_DEL_DEV'?
235 | .cmd_op = UBLK_U_CMD_QUIESCE_DEV,
| ^~~~~~~~~~~~~~~~~~~~~~
| UBLK_U_CMD_DEL_DEV
kublk.c: In function 'ublk_queue_init':
kublk.c:447:63: error: 'UBLK_F_AUTO_BUF_REG' undeclared (first use in
this function); did you mean 'UBLKSRV_AUTO_BUF_REG'?
447 | if (dev->dev_info.flags & (UBLK_F_SUPPORT_ZERO_COPY |
UBLK_F_AUTO_BUF_REG)) {
|
^~~~~~~~~~~~~~~~~~~
|
UBLKSRV_AUTO_BUF_REG
kublk.c: In function 'ublk_thread_init':
kublk.c:507:63: error: 'UBLK_F_AUTO_BUF_REG' undeclared (first use in
this function); did you mean 'UBLKSRV_AUTO_BUF_REG'?
507 | if (dev->dev_info.flags & (UBLK_F_SUPPORT_ZERO_COPY |
UBLK_F_AUTO_BUF_REG)) {
|
^~~~~~~~~~~~~~~~~~~
|
UBLKSRV_AUTO_BUF_REG
kublk.c: In function 'ublk_set_auto_buf_reg':
kublk.c:579:16: error: variable 'buf' has initializer but incomplete type
579 | struct ublk_auto_buf_reg buf = {};
| ^~~~~~~~~~~~~~~~~
kublk.c:579:34: error: storage size of 'buf' isn't known
579 | struct ublk_auto_buf_reg buf = {};
| ^~~
kublk.c:587:29: error: 'UBLK_AUTO_BUF_REG_FALLBACK' undeclared (first
use in this function); did you mean 'UBLKSRV_AUTO_BUF_REG_FALLBACK'?
587 | buf.flags = UBLK_AUTO_BUF_REG_FALLBACK;
| ^~~~~~~~~~~~~~~~~~~~~~~~~~
| UBLKSRV_AUTO_BUF_REG_FALLBACK
kublk.c:589:21: error: implicit declaration of function
'ublk_auto_buf_reg_to_sqe_addr'
[-Werror=implicit-function-declaration]
589 | sqe->addr = ublk_auto_buf_reg_to_sqe_addr(&buf);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
kublk.c:579:34: error: unused variable 'buf' [-Werror=unused-variable]
579 | struct ublk_auto_buf_reg buf = {};
| ^~~
kublk.c: In function '__cmd_dev_add':
kublk.c:1178:25: error: 'UBLK_F_QUIESCE' undeclared (first use in this function)
1178 | if ((features & UBLK_F_QUIESCE) &&
| ^~~~~~~~~~~~~~
kublk.c: In function 'cmd_dev_get_features':
kublk.c:1381:30: error: 'UBLK_F_UPDATE_SIZE' undeclared (first use in
this function)
1381 | [const_ilog2(UBLK_F_UPDATE_SIZE)] = "UPDATE_SIZE",
| ^~~~~~~~~~~~~~~~~~
kublk.c:1369:46: note: in definition of macro 'const_ilog2'
1369 | #define const_ilog2(x) (63 - __builtin_clzll(x))
| ^
kublk.c:1369:24: error: array index in initializer not of integer type
1369 | #define const_ilog2(x) (63 - __builtin_clzll(x))
| ^
kublk.c:1381:18: note: in expansion of macro 'const_ilog2'
1381 | [const_ilog2(UBLK_F_UPDATE_SIZE)] = "UPDATE_SIZE",
| ^~~~~~~~~~~
kublk.c:1369:24: note: (near initialization for 'feat_map')
1369 | #define const_ilog2(x) (63 - __builtin_clzll(x))
| ^
kublk.c:1381:18: note: in expansion of macro 'const_ilog2'
1381 | [const_ilog2(UBLK_F_UPDATE_SIZE)] = "UPDATE_SIZE",
| ^~~~~~~~~~~
kublk.c:1382:30: error: 'UBLK_F_AUTO_BUF_REG' undeclared (first use in
this function); did you mean 'UBLKSRV_AUTO_BUF_REG'?
1382 | [const_ilog2(UBLK_F_AUTO_BUF_REG)] = "AUTO_BUF_REG",
| ^~~~~~~~~~~~~~~~~~~
kublk.c:1369:46: note: in definition of macro 'const_ilog2'
1369 | #define const_ilog2(x) (63 - __builtin_clzll(x))
| ^
kublk.c:1369:24: error: array index in initializer not of integer type
1369 | #define const_ilog2(x) (63 - __builtin_clzll(x))
| ^
kublk.c:1382:18: note: in expansion of macro 'const_ilog2'
1382 | [const_ilog2(UBLK_F_AUTO_BUF_REG)] = "AUTO_BUF_REG",
| ^~~~~~~~~~~
kublk.c:1369:24: note: (near initialization for 'feat_map')
1369 | #define const_ilog2(x) (63 - __builtin_clzll(x))
| ^
kublk.c:1382:18: note: in expansion of macro 'const_ilog2'
1382 | [const_ilog2(UBLK_F_AUTO_BUF_REG)] = "AUTO_BUF_REG",
| ^~~~~~~~~~~
kublk.c:1383:30: error: 'UBLK_F_QUIESCE' undeclared (first use in this function)
1383 | [const_ilog2(UBLK_F_QUIESCE)] = "QUIESCE",
| ^~~~~~~~~~~~~~
kublk.c:1369:46: note: in definition of macro 'const_ilog2'
1369 | #define const_ilog2(x) (63 - __builtin_clzll(x))
| ^
kublk.c:1369:24: error: array index in initializer not of integer type
1369 | #define const_ilog2(x) (63 - __builtin_clzll(x))
| ^
kublk.c:1383:18: note: in expansion of macro 'const_ilog2'
1383 | [const_ilog2(UBLK_F_QUIESCE)] = "QUIESCE",
| ^~~~~~~~~~~
kublk.c:1369:24: note: (near initialization for 'feat_map')
1369 | #define const_ilog2(x) (63 - __builtin_clzll(x))
| ^
kublk.c:1383:18: note: in expansion of macro 'const_ilog2'
1383 | [const_ilog2(UBLK_F_QUIESCE)] = "QUIESCE",
| ^~~~~~~~~~~
kublk.c:1384:30: error: 'UBLK_F_PER_IO_DAEMON' undeclared (first use
in this function)
1384 | [const_ilog2(UBLK_F_PER_IO_DAEMON)] = "PER_IO_DAEMON",
| ^~~~~~~~~~~~~~~~~~~~
kublk.c:1369:46: note: in definition of macro 'const_ilog2'
1369 | #define const_ilog2(x) (63 - __builtin_clzll(x))
| ^
kublk.c:1369:24: error: array index in initializer not of integer type
1369 | #define const_ilog2(x) (63 - __builtin_clzll(x))
| ^
kublk.c:1384:18: note: in expansion of macro 'const_ilog2'
1384 | [const_ilog2(UBLK_F_PER_IO_DAEMON)] = "PER_IO_DAEMON",
| ^~~~~~~~~~~
kublk.c:1369:24: note: (near initialization for 'feat_map')
1369 | #define const_ilog2(x) (63 - __builtin_clzll(x))
| ^
kublk.c:1384:18: note: in expansion of macro 'const_ilog2'
1384 | [const_ilog2(UBLK_F_PER_IO_DAEMON)] = "PER_IO_DAEMON",
| ^~~~~~~~~~~
kublk.c: In function 'main':
kublk.c:1613:46: error: 'UBLK_F_AUTO_BUF_REG' undeclared (first use in
this function); did you mean 'UBLKSRV_AUTO_BUF_REG'?
1613 | ctx.flags |= UBLK_F_AUTO_BUF_REG;
| ^~~~~~~~~~~~~~~~~~~
| UBLKSRV_AUTO_BUF_REG
cc1: all warnings being treated as errors
## Source
* Kernel version: 6.16.0-rc2
* Git tree: https://kernel.googlesource.com/pub/scm/linux/kernel/git/next/linux-next.git
* Git sha: 050f8ad7b58d9079455af171ac279c4b9b828c11
* Git describe: next-20250616
* Project details:
https://qa-reports.linaro.org/lkft/linux-next-master/build/next-20250616/
* Architectures: arm, arm64, x86_64
* Toolchains: gcc-13, clang nightly
* Kconfigs: selftest/*/config+defconfig+
## Build arm64
* Build log clang:
https://regressions.linaro.org/lkft/linux-next-master/next-20250617/log-par…
* Build log gcc:
https://regressions.linaro.org/lkft/linux-next-master/next-20250617/log-par…
* Build details:
https://regressions.linaro.org/lkft/linux-next-master/next-20250617/log-par…
* Build link: https://storage.tuxsuite.com/public/linaro/lkft/builds/2ycmBy2r4aPm6emlo7FX…
* Kernel config:
https://storage.tuxsuite.com/public/linaro/lkft/builds/2ycmBy2r4aPm6emlo7FX…
## Steps to reproduce
- tuxmake \
--runtime podman \
--target-arch arm64 \
--toolchain gcc-13 \
--kconfig defconfig \
--kconfig-add
https://gitlab.com/Linaro/lkft/kernel-fragments/-/raw/main/netdev.config
\
--kconfig-add
https://gitlab.com/Linaro/lkft/kernel-fragments/-/raw/main/systemd.config
\
--kconfig-add CONFIG_SYN_COOKIES=y \
--kconfig-add CONFIG_SCHEDSTATS=y debugkernel dtbs dtbs-legacy
headers kernel kselftest modules
--
Linaro LKFT
https://lkft.linaro.org
The following test regressions noticed while running selftests/mm gup_longterm
test cases on Dragonboard-845c, Dragonboard-410c, rock-pi-4, qemu-arm64 and
qemu-x86_64 this build have required selftest/mm/configs included and toolchain
is clang nightly.
Regressions found on Dragonboard-845c, Dragonboard-410c, rock-pi-4,
qemu-arm64 and qemu-x86_64
- selftests mm gup_longterm fails
Regression Analysis:
- New regression? Yes
- Reproducibility? Yes
Test regression: selftests mm gup_longterm error while loading shared
libraries liburing.so.2 cannot open shared object file No such file or
directory
Test regression: selftests mm cow error while loading shared libraries
liburing.so.2 cannot open shared object file No such file or directory
Test regression: selftests mm mlock-random-test exit=139
Test regression: selftests mm pagemap_ioctl exit=1
Test regression: selftests mm guard_regions file hole_punch
Reported-by: Linux Kernel Functional Testing <lkft(a)linaro.org>
## Test log
Linux version 6.15.0-next-20250606 (tuxmake@tuxmake) (Debian clang
version 21.0.0 (++20250602112323+c5a56f74fef7-1~exp1~20250602112342.1487),
Debian LLD 21.0.0) #1 SMP PREEMPT @1749190532
running ./gup_longterm
----------------------
./gup_longterm: error while loading shared libraries: liburing.so.2:
cannot open shared object file: No such file or directory
[FAIL]
not ok 14 gup_longterm # exit=127
./cow: error while loading shared libraries: liburing.so.2: cannot
open shared object file: No such file or directory
[FAIL]
not ok 50 cow # exit=127
running ./mlock-random-test
---------------------------
TAP version 13
1..2
[ 311.408456] traps: mlock-random-te[21661] general protection fault
ip:7f63210dbf0f sp:7ffdff6fca28 error:0 in
libc.so.6[adf0f,7f6321056000+165000]
[FAIL]
not ok 23 mlock-random-test # exit=139
running ./pagemap_ioctl
...
ok 53 Huge page testing: only two middle pages dirty
ok 54 # SKIP Hugetlb shmem testing: all new pages must not be written (dirty)
ok 55 # SKIP Hugetlb shmem testing: all pages must be written (dirty)
ok 56 # SKIP Hugetlb shmem testing: all pages dirty other than first
and the last one
ok 57 # SKIP Hugetlb shmem testing: PM_SCAN_WP_MATCHING | PM_SCAN_CHECK_WPASYNC
ok 58 # SKIP Hugetlb shmem testing: only middle page dirty
ok 59 # SKIP Hugetlb shmem testing: only two middle pages dirty
ok 60 # SKIP Hugetlb mem testing: all new pages must not be written (dirty)
ok 61 # SKIP Hugetlb mem testing: all pages must be written (dirty)
ok 62 # SKIP Hugetlb mem testing: all pages dirty other than first and
the last one
ok 63 # SKIP Hugetlb mem testing: PM_SCAN_WP_MATCHING |
PM_SCAN_CHECK_WPASYNC[ 241.731600] run_vmtests.sh (456): drop_caches:
3
ok 64 # SKIP Hugetlb mem testing: only middle page dirty
ok 65 # SKIP Hugetlb mem testing: only two middle pages dirty
Bail out! uffd-test creation failed 12 Cannot allocate memory
12 skipped test(s) detected. Consider enabling relevant config options
to improve coverage.
Planned tests != run tests (115 != 65)
Totals: pass:53 fail:0 xfail:0 xpass:0 skip:12 error:0
[FAIL]
# not ok 48 pagemap_ioctl # exit=1
running ./guard-regions
...
RUN guard_regions.file.hole_punch ...
guard-regions.c:1905:hole_punch:Expected madvise(&ptr[3 * page_size],
4 * page_size, MADV_REMOVE) (-1) == 0 (0)
hole_punch: Test terminated by assertion
FAIL guard_regions.file.hole_punch
not ok 80 guard_regions.file.hole_punch
## Source
* Kernel version: 6.16.0-rc2
* Git tree: https://kernel.googlesource.com/pub/scm/linux/kernel/git/next/linux-next.git
* Git sha: 050f8ad7b58d9079455af171ac279c4b9b828c11
* Git describe: next-20250616
* Project details:
https://qa-reports.linaro.org/lkft/linux-next-master/build/next-20250616/
* Architectures: arm64, x86_64
* Test environments: Dragonboard-845c, Dragonboard-410c, rock-pi-4,
qemu-arm64, qemu-x86_64 and x86
* Toolchains: clang nightly
* Kconfigs: selftest/mm/config+defconfig+
## Test
* Test log: https://qa-reports.linaro.org/api/testruns/28766026/log_file/
* Test log 2: https://qa-reports.linaro.org/api/testruns/28743077/log_file/
* Build details:
https://regressions.linaro.org/lkft/linux-next-master/next-20250616/kselfte…
* Build link: https://storage.tuxsuite.com/public/linaro/lkft/builds/2ya0viPHafKAe0u89drI…
* Kernel config:
https://storage.tuxsuite.com/public/linaro/lkft/builds/2ya0viPHafKAe0u89drI…
## Steps to reproduce
- tuxrun \
--runtime podman \
--device qemu-x86_64 \
--boot-args rw \
--kernel https://storage.tuxsuite.com/public/linaro/lkft/builds/2ya0wmVl0eHb9koWyQYC…
\
--rootfs https://storage.tuxboot.com/debian/20250605/trixie/amd64/rootfs.ext4.xz
\
--modules https://storage.tuxsuite.com/public/linaro/lkft/builds/2ya0wmVl0eHb9koWyQYC…
/usr/ \
--parameters MODULES_PATH=/usr/ \
--parameters
SQUAD_URL=https://qa-reports.linaro.org//api/submit/lkft/linux-next-master/…
\
--parameters SKIPFILE=skipfile-lkft.yaml \
--parameters
KSELFTEST=https://storage.tuxsuite.com/public/linaro/lkft/builds/2ya0wmVl0e…
\
--image docker.io/linaro/tuxrun-dispatcher:v1.2.2 \
--tests kselftest-mm \
--timeouts boot=15
--
Linaro LKFT
https://lkft.linaro.org
Add basic support to run various MIPS variants via kunit_tool using the
virtualized malta platform.
Signed-off-by: Thomas Weißschuh <thomas.weissschuh(a)linutronix.de>
---
Changes in v4:
- Rebase on v6.16-rc1
- Pick up reviews from David
- Clarify that GIC page is linked to vDSO
- Link to v3: https://lore.kernel.org/r/20250415-kunit-mips-v3-0-4ec2461b5a7e@linutronix.…
Changes in v3:
- Also skip VDSO_RANDOMIZE_SIZE adjustment for kthreads
- Link to v2: https://lore.kernel.org/r/20250414-kunit-mips-v2-0-4cf01e1a29e6@linutronix.…
Changes in v2:
- Fix usercopy kunit test by handling ABI-less tasks in stack_top()
- Drop change to mm initialization.
The broken test is not built by default anymore.
- Link to v1: https://lore.kernel.org/r/20250212-kunit-mips-v1-0-eb49c9d76615@linutronix.…
---
Thomas Weißschuh (2):
MIPS: Don't crash in stack_top() for tasks without ABI or vDSO
kunit: qemu_configs: Add MIPS configurations
arch/mips/kernel/process.c | 16 +++++++++-------
tools/testing/kunit/qemu_configs/mips.py | 18 ++++++++++++++++++
tools/testing/kunit/qemu_configs/mips64.py | 19 +++++++++++++++++++
tools/testing/kunit/qemu_configs/mips64el.py | 19 +++++++++++++++++++
tools/testing/kunit/qemu_configs/mipsel.py | 18 ++++++++++++++++++
5 files changed, 83 insertions(+), 7 deletions(-)
---
base-commit: 19272b37aa4f83ca52bdf9c16d5d81bdd1354494
change-id: 20241014-kunit-mips-e4fe1c265ed7
Best regards,
--
Thomas Weißschuh <thomas.weissschuh(a)linutronix.de>
Tests may wish to add other interfaces to listen on. Notably locally
generated traffic uses dummy interfaces. The multicast daemon needs to know
about these so that it allows forming rules that involve these interfaces,
and so that net.ipv4.conf.X.mc_forwarding is set for the interfaces.
To that end, allow passing in a list of interfaces to configure in addition
to all the physical ones.
Signed-off-by: Petr Machata <petrm(a)nvidia.com>
Reviewed-by: Nikolay Aleksandrov <razor(a)blackwall.org>
---
Notes:
v2:
- Adjust as per shellcheck citations
- Retain Nik's R-b, the changes were very minor.
---
CC: Shuah Khan <shuah(a)kernel.org>
CC: linux-kselftest(a)vger.kernel.org
tools/testing/selftests/net/forwarding/lib.sh | 13 +++++++++++++
1 file changed, 13 insertions(+)
diff --git a/tools/testing/selftests/net/forwarding/lib.sh b/tools/testing/selftests/net/forwarding/lib.sh
index 253847372062..83ee6a07e072 100644
--- a/tools/testing/selftests/net/forwarding/lib.sh
+++ b/tools/testing/selftests/net/forwarding/lib.sh
@@ -1760,9 +1760,12 @@ mc_send()
adf_mcd_start()
{
+ local ifs=("$@")
+
local table_name="$MCD_TABLE_NAME"
local smcroutedir
local pid
+ local if
local i
check_command "$MCD" || return 1
@@ -1776,6 +1779,16 @@ adf_mcd_start()
"$smcroutedir/$table_name.conf"
done
+ for if in "${ifs[@]}"; do
+ if ! ip_link_has_flag "$if" MULTICAST; then
+ ip link set dev "$if" multicast on
+ defer ip link set dev "$if" multicast off
+ fi
+
+ echo "phyint $if enable" >> \
+ "$smcroutedir/$table_name.conf"
+ done
+
"$MCD" -N -I "$table_name" -f "$smcroutedir/$table_name.conf" \
-P "$smcroutedir/$table_name.pid"
busywait "$BUSYWAIT_TIMEOUT" test -e "$smcroutedir/$table_name.pid"
--
2.49.0
Initially netpoll and netconsole were created together, and some
functions are in the wrong file. Seperate netconsole-only functions
in netconsole, avoiding exports.
1. Expose netpoll logging macros in the public header to enable consistent
log formatting across netpoll consumers.
2. Relocate netconsole-specific functions from netpoll to the netconsole
module where they are actually used, reducing unnecessary coupling.
3. Remove unnecessary function exports
4. Rename netpoll parsing functions in netconsole to better reflect their
specific usage.
5. Create a test to check that cmdline works fine. This was in my todo
list since [1], this was a good time to add it here to make sure this
patchset doesn't regress.
PS: The code was split in a way that it is easy to review. When copying
the functions from netpoll to netconsole, I do not change than other
than adding `static`. This will make checkpatch unhappy, but, further
patches will address the issues. It is done this way to make it easy for
reviewers.
Link: https://lore.kernel.org/netdev/Z36TlACdNMwFD7wv@dev-ushankar.dev.purestorag… [1]
Signed-off-by: Breno Leitao <leitao(a)debian.org>
---
Changes in v3:
- The cleanup on the netcons_cmdline.sh test was not cleaning the
netdevsim. Clean it at the end of the test. (Jakub)
- Link to v2: https://lore.kernel.org/r/20250611-rework-v2-0-ab1d92b458ca@debian.org
Changes in v2:
- No change in the code. Just rebased the patches onto netnext/main
- Link to v1: https://lore.kernel.org/r/20250610-rework-v1-0-7cfde283f246@debian.org
---
Breno Leitao (8):
netpoll: remove __netpoll_cleanup from exported API
netpoll: expose netpoll logging macros in public header
netpoll: relocate netconsole-specific functions to netconsole module
netpoll: move netpoll_print_options to netconsole
netconsole: rename functions to better reflect their purpose
netconsole: improve code style in parser function
selftests: net: Refactor cleanup logic in lib_netcons.sh
selftests: net: add netconsole test for cmdline configuration
drivers/net/netconsole.c | 137 ++++++++++++++++++++-
include/linux/netpoll.h | 10 +-
net/core/netpoll.c | 136 +-------------------
tools/testing/selftests/drivers/net/Makefile | 1 +
.../selftests/drivers/net/lib/sh/lib_netcons.sh | 59 ++++++---
.../selftests/drivers/net/netcons_cmdline.sh | 52 ++++++++
6 files changed, 240 insertions(+), 155 deletions(-)
---
base-commit: 6d4e01d29d87356924f1521ca6df7a364e948f13
change-id: 20250603-rework-c175cad8d22e
Best regards,
--
Breno Leitao <leitao(a)debian.org>
The following build warnings were noticed while building selftests/mm
with clang nightly toolchain for arm64 and x86_64 architectures.
Regressions found on arm64 and x86_64
- Build/clang-nightly-lkftconfig-kselftest
Regression Analysis:
- New regression? Yes
- Reproducibility? Yes
Build regression: selftests mm pkey_sighandler_tests.c warning
duplicate 'inline' declaration specifier [-Wduplicate-decl-specifier]
Build regression: selftests mm mremap_test.c warning pointer
comparison always evaluates to false [-Wtautological-compare]
Reported-by: Linux Kernel Functional Testing <lkft(a)linaro.org>
## Build log
make[4]: Entering directory '/builds/linux/tools/testing/selftests/mm'
/bin/sh ./check_config.sh clang --target=aarch64-linux-gnu
-fintegrated-as -Werror=unknown-warning-option
-Werror=ignored-optimization-argument -Werror=option-ignored
-Werror=unused-command-line-argument --target=aarch64-linux-gnu
-fintegrated-as
CC cow
CC compaction_test
CC gup_longterm
CC gup_test
CC hmm-tests
CC hugetlb-madvise
CC hugetlb-read-hwpoison
CC hugetlb-soft-offline
CC hugepage-mmap
CC hugepage-mremap
CC hugepage-shm
CC hugepage-vmemmap
CC khugepaged
CC madv_populate
CC map_fixed_noreplace
CC map_hugetlb
CC map_populate
CC memfd_secret
CC migration
CC mkdirty
CC mlock-random-test
CC mlock2-tests
CC mrelease_test
CC mremap_dontunmap
CC mremap_test
mremap_test.c:425:31: warning: pointer comparison always evaluates to
false [-Wtautological-compare]
425 | if (addr + c.dest_alignment < addr) {
| ^
1 warning generated.
CC mseal_test
CC on-fault-limit
CC pagemap_ioctl
CC pfnmap
CC thuge-gen
CC transhuge-stress
CC uffd-stress
CC uffd-unit-tests
CC uffd-wp-mremap
CC split_huge_page_test
CC ksm_tests
CC ksm_functional_tests
CC mdwe_test
CC hugetlb_fault_after_madv
CC hugetlb_madv_vs_map
CC hugetlb_dio
CC droppable
CC guard-regions
CC merge
CC protection_keys
CC pkey_sighandler_tests
pkey_sighandler_tests.c:44:15: warning: duplicate 'inline' declaration
specifier [-Wduplicate-decl-specifier]
44 | static inline __always_inline
| ^
/usr/lib/gcc-cross/aarch64-linux-gnu/12/../../../../aarch64-linux-gnu/include/sys/cdefs.h:424:26:
note: expanded from macro '__always_inline'
424 | # define __always_inline __inline __attribute__ ((__always_inline__))
| ^
1 warning generated.
CC va_high_addr_switch
CC virtual_address_range
CC write_to_hugetlbfs
Warning: missing Module.symvers, please have the kernel built first.
page_frag test will be skipped.
make[4]: Leaving directory '/builds/linux/tools/testing/selftests/mm'
## Source
* Kernel version: 6.16.0-rc2
* Git tree: https://kernel.googlesource.com/pub/scm/linux/kernel/git/next/linux-next.git
* Git sha: 050f8ad7b58d9079455af171ac279c4b9b828c11
* Git describe: next-20250616
* Project details:
https://qa-reports.linaro.org/lkft/linux-next-master/build/next-20250616/
* Architectures: arm64, x86_64
* Toolchains: clang nightly
* Kconfigs: selftest/mm/config+defconfig+
## Build arm64
* Build log: https://qa-reports.linaro.org/api/testruns/28765515/log_file/
* Build details:
https://regressions.linaro.org/lkft/linux-next-master/next-20250616/log-par…
* Build link: https://storage.tuxsuite.com/public/linaro/lkft/builds/2ya0viPHafKAe0u89drI…
* Kernel config:
https://storage.tuxsuite.com/public/linaro/lkft/builds/2ya0viPHafKAe0u89drI…
## Steps to reproduce on arm64
- tuxmake --runtime podman --target-arch arm64 --toolchain clang-20 \
--kconfig defconfig \
--kconfig-add
https://gitlab.com/Linaro/lkft/kernel-fragments/-/raw/main/netdev.config
\
--kconfig-add
https://gitlab.com/Linaro/lkft/kernel-fragments/-/raw/main/systemd.config
\
--kconfig-add CONFIG_SYN_COOKIES=y \
--kconfig-add CONFIG_SCHEDSTATS=y LLVM=1 LLVM_IAS=1 debugkernel
dtbs dtbs-legacy headers kernel kselftest modules
--
Linaro LKFT
https://lkft.linaro.org
This patch series introduces a new feature to netconsole which allows
appending a message ID to the userdata dictionary.
If the msgid feature is enabled, the message ID is built from a per-target 32
bit counter that is incremented and appended to every message sent to the target.
Example::
echo 1 > "/sys/kernel/config/netconsole/cmdline0/userdata/msgid_enabled"
echo "This is message #1" > /dev/kmsg
echo "This is message #2" > /dev/kmsg
13,434,54928466,-;This is message #1
msgid=1
13,435,54934019,-;This is message #2
msgid=2
This feature can be used by the target to detect if messages were dropped or
reordered before reaching the target. This allows system administrators to
assess the reliability of their netconsole pipeline and detect loss of messages
due to network contention or temporary unavailability.
Suggested-by: Breno Leitao <leitao(a)debian.org>
Signed-off-by: Gustavo Luiz Duarte <gustavold(a)gmail.com>
---
Changes in v2:
- Use wrapping_assign_add() to avoid warnings in UBSAN and friends.
- Improve documentation to clarify wrapping and distinguish msgid from sequnum.
- Rebase and fix conflict in prepare_extradata().
- Link to v1: https://lore.kernel.org/r/20250611-netconsole-msgid-v1-0-1784a51feb1e@gmail…
---
Gustavo Luiz Duarte (5):
netconsole: introduce 'msgid' as a new sysdata field
netconsole: implement configfs for msgid_enabled
netconsole: append msgid to sysdata
selftests: netconsole: Add tests for 'msgid' feature in sysdata
docs: netconsole: document msgid feature
Documentation/networking/netconsole.rst | 32 +++++++++++
drivers/net/netconsole.c | 65 ++++++++++++++++++++++
.../selftests/drivers/net/netcons_sysdata.sh | 30 ++++++++++
3 files changed, 127 insertions(+)
---
base-commit: 535de528015b56e34a40a8e1eb1629fadf809a84
change-id: 20250609-netconsole-msgid-b93c6f8e9c60
Best regards,
--
Gustavo Luiz Duarte <gustavold(a)gmail.com>
From: Chia-Yu Chang <chia-yu.chang(a)nokia-bell-labs.com>
Hello,
Please find the DualPI2 patch v18.
This patch serise adds DualPI Improved with a Square (DualPI2) with following features:
* Supports congestion controls that comply with the Prague requirements in RFC9331 (e.g. TCP-Prague)
* Coupled dual-queue that separates the L4S traffic in a low latency queue (L-queue), without harming remaining traffic that is scheduled in classic queue (C-queue) due to congestion-coupling using PI2 as defined in RFC9332
* Configurable overload strategies
* Use of sojourn time to reliably estimate queue delay
* Supports ECN L4S-identifier (IP.ECN==0b*1) to classify traffic into respective queues
For more details of DualPI2, please refer IETF RFC9332 (https://datatracker.ietf.org/doc/html/rfc9332).
Best regards,
Chia-Yu
---
v19 (14-Jun-2025)
- Fix one typo in the comment of #1 (ALOK TIWARI <alok.a.tiwari(a)oracle.com>)
- Update commit message of #4 (ALOK TIWARI <alok.a.tiwari(a)oracle.com>)
- Wrap long lines of Documentation/netlink/specs/tc.yaml to within 80 characters (Jakub Kicinski <kuba(a)kernel.org>)
v18 (13-Jun-2025)
- Add the num of enum used by DualPI2 and fix name and name-prefix of DualPI2 enum and attribute
- Replace from_timer() with timer_container_of() (Pedro Tammela <pctammela(a)mojatatu.com>)
v17 (25-May-2025, Resent at 11-Jun-2025)
- Replace 0xffffffff with U32_MAX (Paolo Abeni <pabeni(a)redhat.com>)
- Use helper function qdisc_dequeue_internal() and add new helper function skb_apply_step() (Paolo Abeni <pabeni(a)redhat.com>)
- Add s64 casting when calculating the delta of the PI controller (Paolo Abeni <pabeni(a)redhat.com>)
- Change the drop reason into SKB_DROP_REASON_QDISC_CONGESTED for drop_early (Paolo Abeni <pabeni(a)redhat.com>)
- Modify the condition to remove the original skb when enqueuing multiple GSO segments (Paolo Abeni <pabeni(a)redhat.com>)
- Add READ_ONCE() in dualpi2_dump_stat() (Paolo Abeni <pabeni(a)redhat.com>)
- Add comments, brackets, and brackets for readability (Paolo Abeni <pabeni(a)redhat.com>)
v16 (16-MAy-2025)
- Add qdisc_lock() to dualpi2_timer() in dualpi2_timer (Paolo Abeni <pabeni(a)redhat.com>)
- Introduce convert_ns_to_usec() to convert usec to nsec without overflow in #1 (Paolo Abeni <pabeni(a)redhat.com>)
- Update convert_us_tonsec() to convert nsec to usec without overflow in #2 (Paolo Abeni <pabeni(a)redhat.com>)
- Add more descriptions with respect to DualPI2 in the cover ltter and add changelog in each patch (Paolo Abeni <pabeni(a)redhat.com>)
v15 (09-May-2025)
- Add enum of TCA_DUALPI2_ECN_MASK_CLA_ECT to remove potential leakeage in #1 (Simon Horman <horms(a)kernel.org>)
- Fix one typo in comment of #2
- Update tc.yaml in #5 to aligh with the updated enum of pkt_sched.h
v14 (05-May-2025)
- Modify tc.yaml: (1) Replace flags with enum and remove enum-as-flags, (2) Remove credit-queue in xstats, and (3) Change attribute types (Donald Hunter <donald.hun
- Add enum and fix the ordering of variables in pkt_sched.h to align with the modified tc.yaml (Donald Hunter <donald.hunter(a)gmail.com>)
- Add validators for DROP_OVERLOAD, DROP_EARLY, ECN_MASK, and SPLIT_GSO in sch_dualpi2.c (Donald Hunter <donald.hunter(a)gmail.com>)
- Update dualpi2.json to align with the updated variable order in pkt_sched.h
- Reorder patches (Donald Hunter <donald.hunter(a)gmail.com>)
v13 (26-Apr-2025)
- Use dashes in member names to follow YNL conventions in tc.yaml (Donald Hunter <donald.hunter(a)gmail.com>)
- Define enumerations separately for flags of drop-early, drop-overload, ecn-mask, credit-queue in tc.yaml (Donald Hunter <donald.hunter(a)gmail.com>)
- Change the types of split-gso and step-packets into flag in tc.yaml (Donald Hunter <donald.hunter(a)gmail.com>)
- Revert to u32/u8 types for tc-dualpi2-xstats members in tc.yaml (Donald Hunter <donald.hunter(a)gmail.com>)
- Add new test cases in tc-tests/qdiscs/dualpi2.json to cover all dualpi2 parameters (Donald Hunter <donald.hunter(a)gmail.com>)
- Change the type of TCA_DUALPI2_STEP_PACKETS into NLA_FLAG (Donald Hunter <donald.hunter(a)gmail.com>)
v12 (22-Apr-2025)
- Remove anonymous struct in sch_dualpi2.c (Paolo Abeni <pabeni(a)redhat.com>)
- Replace u32/u8 with uint and s32 with int in tc spec document (Paolo Abeni <pabeni(a)redhat.com>)
- Introduce get_memory_limit function to handle potential overflow when multipling limit with MTU (Paolo Abeni <pabeni(a)redhat.com>)
- Double the packet length to further include packet overhead in memory_limit (Paolo Abeni <pabeni(a)redhat.com>)
- Remove the check of qdisc_qlen(sch) when calling qdisc_tree_reduce_backlog (Paolo Abeni <pabeni(a)redhat.com>)
v11 (15-Apr-2025)
- Replace hstimer_init with hstimer_setup in sch_dualpi2.c
v10 (25-Mar-2025)
- Remove leftover include in include/linux/netdevice.h and anonymous struct in sch_dualpi2.c (Paolo Abeni <pabeni(a)redhat.com>)
- Use kfree_skb_reason() and add SKB_DROP_REASON_DUALPI2_STEP_DROP drop reason (Paolo Abeni <pabeni(a)redhat.com>)
- Split sch_dualpi2.c into 3 patches (and overall 5 patches): Struct definition & parsing, Dump stats & configuration, Enqueue/Dequeue (Paolo Abeni <pabeni(a)redhat.com>)
v9 (16-Mar-2025)
- Fix mem_usage error in previous version
- Add min_qlen_step to the dualpi2 attribute as the minimum queue length in number of packets in the L-queue to start step threshold marking.
In previous versions, this value was fixed to 2, so the step threshold was applied to mark packets in the L queue only when the queue length of the L queue was greater than or equal to 2 packets.
This will cause larger queuing delays for L4S traffic at low rates (<20Mbps). So we parameterize it and change the default value to 0.
Comparison of tcp_1down run 'HTB 20Mbit + DUALPI2 + 10ms base delay'
Old versions:
avg median # data pts
Ping (ms) ICMP : 11.55 11.70 ms 350
TCP upload avg : 18.96 N/A Mbits/s 350
TCP upload sum : 18.96 N/A Mbits/s 350
New version (v9):
avg median # data pts
Ping (ms) ICMP : 10.81 10.70 ms 350
TCP upload avg : 18.91 N/A Mbits/s 350
TCP upload sum : 18.91 N/A Mbits/s 350
Comparison of tcp_1down run 'HTB 10Mbit + DUALPI2 + 10ms base delay'
Old versions:
avg median # data pts
Ping (ms) ICMP : 12.61 12.80 ms 350
TCP upload avg : 9.48 N/A Mbits/s 350
TCP upload sum : 9.48 N/A Mbits/s 350
New version (v9):
avg median # data pts
Ping (ms) ICMP : 11.06 10.80 ms 350
TCP upload avg : 9.43 N/A Mbits/s 350
TCP upload sum : 9.43 N/A Mbits/s 350
Comparison of tcp_1down run 'HTB 10Mbit + DUALPI2 + 10ms base delay'
Old versions:
avg median # data pts
Ping (ms) ICMP : 40.86 37.45 ms 350
TCP upload avg : 0.88 N/A Mbits/s 350
TCP upload sum : 0.88 N/A Mbits/s 350
TCP upload::1 : 0.88 0.97 Mbits/s 350
New version (v9):
avg median # data pts
Ping (ms) ICMP : 11.07 10.40 ms 350
TCP upload avg : 0.55 N/A Mbits/s 350
TCP upload sum : 0.55 N/A Mbits/s 350
TCP upload::1 : 0.55 0.59 Mbits/s 350
v8 (11-Mar-2025)
- Fix warning messages in v7
v7 (07-Mar-2025)
- Separate into 3 patches to avoid mixing changes of documentation, selftest, and code. (Cong Wang <xiyou.wangcong(a)gmail.com>)
v6 (04-Mar-2025)
- Add modprobe for dulapi2 in tc-testing script tc-testing/tdc.sh (Jakub Kicinski <kuba(a)kernel.org>)
- Update test cases in dualpi2.json
- Update commit message
v5 (22-Feb-2025)
- A comparison was done between MQ + DUALPI2, MQ + FQ_PIE, MQ + FQ_CODEL:
Unshaped 1gigE with 4 download streams test:
- Summary of tcp_4down run 'MQ + FQ_CODEL':
avg median # data pts
Ping (ms) ICMP : 1.19 1.34 ms 349
TCP download avg : 235.42 N/A Mbits/s 349
TCP download sum : 941.68 N/A Mbits/s 349
TCP download::1 : 235.19 235.39 Mbits/s 349
TCP download::2 : 235.03 235.35 Mbits/s 349
TCP download::3 : 236.89 235.44 Mbits/s 349
TCP download::4 : 234.57 235.19 Mbits/s 349
- Summary of tcp_4down run 'MQ + FQ_PIE'
avg median # data pts
Ping (ms) ICMP : 1.21 1.37 ms 350
TCP download avg : 235.42 N/A Mbits/s 350
TCP download sum : 941.61 N/A Mbits/s 350
TCP download::1 : 232.54 233.13 Mbits/s 350
TCP download::2 : 232.52 232.80 Mbits/s 350
TCP download::3 : 233.14 233.78 Mbits/s 350
TCP download::4 : 243.41 241.48 Mbits/s 350
- Summary of tcp_4down run 'MQ + DUALPI2'
avg median # data pts
Ping (ms) ICMP : 1.19 1.34 ms 349
TCP download avg : 235.42 N/A Mbits/s 349
TCP download sum : 941.68 N/A Mbits/s 349
TCP download::1 : 235.19 235.39 Mbits/s 349
TCP download::2 : 235.03 235.35 Mbits/s 349
TCP download::3 : 236.89 235.44 Mbits/s 349
TCP download::4 : 234.57 235.19 Mbits/s 349
Unshaped 1gigE with 128 download streams test:
- Summary of tcp_128down run 'MQ + FQ_CODEL':
avg median # data pts
Ping (ms) ICMP : 1.88 1.86 ms 350
TCP download avg : 7.39 N/A Mbits/s 350
TCP download sum : 946.47 N/A Mbits/s 350
- Summary of tcp_128down run 'MQ + FQ_PIE':
avg median # data pts
Ping (ms) ICMP : 1.88 1.86 ms 350
TCP download avg : 7.39 N/A Mbits/s 350
TCP download sum : 946.47 N/A Mbits/s 350
- Summary of tcp_128down run 'MQ + DUALPI2':
avg median # data pts
Ping (ms) ICMP : 1.88 1.86 ms 350
TCP download avg : 7.39 N/A Mbits/s 350
TCP download sum : 946.47 N/A Mbits/s 350
Unshaped 10gigE with 4 download streams test:
- Summary of tcp_4down run 'MQ + FQ_CODEL':
avg median # data pts
Ping (ms) ICMP : 0.22 0.23 ms 350
TCP download avg : 2354.08 N/A Mbits/s 350
TCP download sum : 9416.31 N/A Mbits/s 350
TCP download::1 : 2353.65 2352.81 Mbits/s 350
TCP download::2 : 2354.54 2354.21 Mbits/s 350
TCP download::3 : 2353.56 2353.78 Mbits/s 350
TCP download::4 : 2354.56 2354.45 Mbits/s 350
- Summary of tcp_4down run 'MQ + FQ_PIE':
avg median # data pts
Ping (ms) ICMP : 0.20 0.19 ms 350
TCP download avg : 2354.76 N/A Mbits/s 350
TCP download sum : 9419.04 N/A Mbits/s 350
TCP download::1 : 2354.77 2353.89 Mbits/s 350
TCP download::2 : 2353.41 2354.29 Mbits/s 350
TCP download::3 : 2356.18 2354.19 Mbits/s 350
TCP download::4 : 2354.68 2353.15 Mbits/s 350
- Summary of tcp_4down run 'MQ + DUALPI2':
avg median # data pts
Ping (ms) ICMP : 0.24 0.24 ms 350
TCP download avg : 2354.11 N/A Mbits/s 350
TCP download sum : 9416.43 N/A Mbits/s 350
TCP download::1 : 2354.75 2353.93 Mbits/s 350
TCP download::2 : 2353.15 2353.75 Mbits/s 350
TCP download::3 : 2353.49 2353.72 Mbits/s 350
TCP download::4 : 2355.04 2353.73 Mbits/s 350
Unshaped 10gigE with 128 download streams test:
- Summary of tcp_128down run 'MQ + FQ_CODEL':
avg median # data pts
Ping (ms) ICMP : 7.57 8.69 ms 350
TCP download avg : 73.97 N/A Mbits/s 350
TCP download sum : 9467.82 N/A Mbits/s 350
- Summary of tcp_128down run 'MQ + FQ_PIE':
avg median # data pts
Ping (ms) ICMP : 7.82 8.91 ms 350
TCP download avg : 73.97 N/A Mbits/s 350
TCP download sum : 9468.42 N/A Mbits/s 350
- Summary of tcp_128down run 'MQ + DUALPI2':
avg median # data pts
Ping (ms) ICMP : 6.87 7.93 ms 350
TCP download avg : 73.95 N/A Mbits/s 350
TCP download sum : 9465.87 N/A Mbits/s 350
From the results shown above, we see small differences between combinations.
- Update commit message to include results of no_split_gso and split_gso (Dave Taht <dave.taht(a)gmail.com> and Paolo Abeni <pabeni(a)redhat.com>)
- Add memlimit in the dualpi2 attribute, and add memory_used, max_memory_used, memory_limit in dualpi2 stats (Dave Taht <dave.taht(a)gmail.com>)
- Update note in sch_dualpi2.c related to BBRv3 status (Dave Taht <dave.taht(a)gmail.com>)
- Update license identifier (Dave Taht <dave.taht(a)gmail.com>)
- Add selftest in tools/testing/selftests/tc-testing (Cong Wang <xiyou.wangcong(a)gmail.com>)
- Use netlink policies for parameter checks (Jamal Hadi Salim <jhs(a)mojatatu.com>)
- Modify texts & fix typos in Documentation/netlink/specs/tc.yaml (Dave Taht <dave.taht(a)gmail.com>)
- Add descriptions of packet counter statistics and the reset function of sch_dualpi2.c
- Fix step_thresh in packets
- Update code comments in sch_dualpi2.c
v4 (22-Oct-2024)
- Update statement in Kconfig for DualPI2 (Stephen Hemminger <stephen(a)networkplumber.org>)
- Put a blank line after #define in sch_dualpi2.c (Stephen Hemminger <stephen(a)networkplumber.org>)
- Fix line length warning.
v3 (19-Oct-2024)
- Fix compilaiton error
- Update Documentation/netlink/specs/tc.yaml (Jakub Kicinski <kuba(a)kernel.org>)
v2 (18-Oct-2024)
- Add Documentation/netlink/specs/tc.yaml (Jakub Kicinski <kuba(a)kernel.org>)
- Use dualpi2 instead of skb prefix (Jamal Hadi Salim <jhs(a)mojatatu.com>)
- Replace nla_parse_nested_deprecated with nla_parse_nested (Jamal Hadi Salim <jhs(a)mojatatu.com>)
- Fix line length warning
---
Chia-Yu Chang (4):
sched: Struct definition and parsing of dualpi2 qdisc
sched: Dump configuration and statistics of dualpi2 qdisc
selftests/tc-testing: Add selftests for qdisc DualPI2
Documentation: netlink: specs: tc: Add DualPI2 specification
Koen De Schepper (1):
sched: Add enqueue/dequeue of dualpi2 qdisc
Documentation/netlink/specs/tc.yaml | 166 +++
include/net/dropreason-core.h | 6 +
include/uapi/linux/pkt_sched.h | 70 +-
net/sched/Kconfig | 12 +
net/sched/Makefile | 1 +
net/sched/sch_dualpi2.c | 1146 +++++++++++++++++
tools/testing/selftests/tc-testing/config | 1 +
.../tc-testing/tc-tests/qdiscs/dualpi2.json | 254 ++++
tools/testing/selftests/tc-testing/tdc.sh | 1 +
9 files changed, 1656 insertions(+), 1 deletion(-)
create mode 100644 net/sched/sch_dualpi2.c
create mode 100644 tools/testing/selftests/tc-testing/tc-tests/qdiscs/dualpi2.json
--
2.34.1
Check hugetlbfs support before starting tests in run_hugetlbfs_test.sh.
Otherwise on a system that does not support hugetlbfs the free huge
pages availability check will fail with:
./run_hugetlbfs_test.sh: line 47: [: -lt: unary operator expected
./run_hugetlbfs_test.sh: line 60: 12577 Aborted (core dumped) ./memfd_test hugetlbfs
Aborted (core dumped)
And it will left a fuse_mnt process behind, which may cause some
unexpected issues.
Patch tested with a kernel that does not have hugetlbfs support enabled
and the test was skipped as expected.
Po-Hsu Lin (1):
selftests/memfd: skip hugetlbfs test if not supported
tools/testing/selftests/memfd/run_hugetlbfs_test.sh | 5 +++++
1 file changed, 5 insertions(+)
--
2.34.1
This started with a patch that enabled `clippy::ptr_as_ptr`. Benno
Lossin suggested I also look into `clippy::ptr_cast_constness` and I
discovered `clippy::as_ptr_cast_mut`. This series now enables all 3
lints. It also enables `clippy::as_underscore` which ensures other
pointer casts weren't missed.
As a later addition, `clippy::cast_lossless` and `clippy::ref_as_ptr`
are also enabled.
This series depends on "rust: retain pointer mut-ness in
`container_of!`"[1].
Link: https://lore.kernel.org/all/20250409-container-of-mutness-v1-1-64f472b94534… [1]
Signed-off-by: Tamir Duberstein <tamird(a)gmail.com>
---
Changes in v11:
- Rebase on v6.16-rc1.
- Replace some `as <integer>` with `as bindings::T` and others with `as
ffi::T`. (Miguel Ojeda)
- Revert explicit `ffi::c_void` import which is in the prelude. (Miguel Ojeda)
- Link to v10: https://lore.kernel.org/r/20250418-ptr-as-ptr-v10-0-3d63d27907aa@gmail.com
Changes in v10:
- Move fragment from "rust: enable `clippy::ptr_cast_constness` lint" to
"rust: enable `clippy::ptr_as_ptr` lint". (Boqun Feng)
- Replace `(...).into()` with `T::from(...)` where the destination type
isn't obvious in "rust: enable `clippy::cast_lossless` lint". (Boqun
Feng)
- Link to v9: https://lore.kernel.org/r/20250416-ptr-as-ptr-v9-0-18ec29b1b1f3@gmail.com
Changes in v9:
- Replace ref-to-ptr coercion using `let` bindings with
`core::ptr::from_{ref,mut}`. (Boqun Feng).
- Link to v8: https://lore.kernel.org/r/20250409-ptr-as-ptr-v8-0-3738061534ef@gmail.com
Changes in v8:
- Use coercion to go ref -> ptr.
- rustfmt.
- Rebase on v6.15-rc1.
- Extract first commit to its own series as it is shared with other
series.
- Link to v7: https://lore.kernel.org/r/20250325-ptr-as-ptr-v7-0-87ab452147b9@gmail.com
Changes in v7:
- Add patch to enable `clippy::ref_as_ptr`.
- Link to v6: https://lore.kernel.org/r/20250324-ptr-as-ptr-v6-0-49d1b7fd4290@gmail.com
Changes in v6:
- Drop strict provenance patch.
- Fix URLs in doc comments.
- Add patch to enable `clippy::cast_lossless`.
- Rebase on rust-next.
- Link to v5: https://lore.kernel.org/r/20250317-ptr-as-ptr-v5-0-5b5f21fa230a@gmail.com
Changes in v5:
- Use `pointer::addr` in OF. (Boqun Feng)
- Add documentation on stubs. (Benno Lossin)
- Mark stubs `#[inline]`.
- Pick up Alice's RB on a shared commit from
https://lore.kernel.org/all/Z9f-3Aj3_FWBZRrm@google.com/.
- Link to v4: https://lore.kernel.org/r/20250315-ptr-as-ptr-v4-0-b2d72c14dc26@gmail.com
Changes in v4:
- Add missing SoB. (Benno Lossin)
- Use `without_provenance_mut` in alloc. (Boqun Feng)
- Limit strict provenance lints to the `kernel` crate to avoid complex
logic in the build system. This can be revisited on MSRV >= 1.84.0.
- Rebase on rust-next.
- Link to v3: https://lore.kernel.org/r/20250314-ptr-as-ptr-v3-0-e7ba61048f4a@gmail.com
Changes in v3:
- Fixed clippy warning in rust/kernel/firmware.rs. (kernel test robot)
Link: https://lore.kernel.org/all/202503120332.YTCpFEvv-lkp@intel.com/
- s/as u64/as bindings::phys_addr_t/g. (Benno Lossin)
- Use strict provenance APIs and enable lints. (Benno Lossin)
- Link to v2: https://lore.kernel.org/r/20250309-ptr-as-ptr-v2-0-25d60ad922b7@gmail.com
Changes in v2:
- Fixed typo in first commit message.
- Added additional patches, converted to series.
- Link to v1: https://lore.kernel.org/r/20250307-ptr-as-ptr-v1-1-582d06514c98@gmail.com
---
Tamir Duberstein (6):
rust: enable `clippy::ptr_as_ptr` lint
rust: enable `clippy::ptr_cast_constness` lint
rust: enable `clippy::as_ptr_cast_mut` lint
rust: enable `clippy::as_underscore` lint
rust: enable `clippy::cast_lossless` lint
rust: enable `clippy::ref_as_ptr` lint
Makefile | 6 ++++
drivers/gpu/drm/drm_panic_qr.rs | 4 +--
rust/bindings/lib.rs | 3 ++
rust/kernel/alloc/allocator_test.rs | 2 +-
rust/kernel/alloc/kvec.rs | 4 +--
rust/kernel/block/mq/operations.rs | 2 +-
rust/kernel/block/mq/request.rs | 11 +++++--
rust/kernel/device.rs | 4 +--
rust/kernel/device_id.rs | 4 +--
rust/kernel/devres.rs | 17 +++++------
rust/kernel/dma.rs | 6 ++--
rust/kernel/drm/device.rs | 6 ++--
rust/kernel/error.rs | 2 +-
rust/kernel/firmware.rs | 3 +-
rust/kernel/fs/file.rs | 2 +-
rust/kernel/io.rs | 18 ++++++------
rust/kernel/kunit.rs | 11 ++++---
rust/kernel/list/impl_list_item_mod.rs | 2 +-
rust/kernel/miscdevice.rs | 2 +-
rust/kernel/mm/virt.rs | 52 +++++++++++++++++-----------------
rust/kernel/net/phy.rs | 4 +--
rust/kernel/of.rs | 6 ++--
rust/kernel/pci.rs | 11 ++++---
rust/kernel/platform.rs | 4 ++-
rust/kernel/print.rs | 6 ++--
rust/kernel/seq_file.rs | 2 +-
rust/kernel/str.rs | 14 ++++-----
rust/kernel/sync/poll.rs | 2 +-
rust/kernel/time/hrtimer/pin.rs | 2 +-
rust/kernel/time/hrtimer/pin_mut.rs | 2 +-
rust/kernel/uaccess.rs | 4 +--
rust/kernel/workqueue.rs | 8 +++---
rust/uapi/lib.rs | 3 ++
33 files changed, 128 insertions(+), 101 deletions(-)
---
base-commit: 19272b37aa4f83ca52bdf9c16d5d81bdd1354494
change-id: 20250307-ptr-as-ptr-21b1867fc4d4
Best regards,
--
Tamir Duberstein <tamird(a)gmail.com>
Hello,
this series follows some discussions started in [1] around bpf
trampolines limitations on specific cases. When a trampoline is
generated for a target function involving many arguments, it has to
properly find and save the arguments that has been passed through stack.
While this is doable with basic types (eg: scalars), it brings more
uncertainty when dealing with specific types like structs (many ABIs
allow to pass structures by value if they fit in a register or a pair of
registers). The issue is that those structures layout and location on
the stack can be altered (ie with attributes, like packed or
aligned(x)), and this kind of alteration is not encoded in dwarf or BTF,
making the trampolines clueless about the needed adjustments. Rather
than trying to support this specific case, as agreed in [2], this series
aims to properly deny it.
It targets all the architectures currently implementing
arch_prepare_bpf_trampoline (except aarch64, since it has been handled
while adding the support for many args):
- x86
- s390
- riscv
- powerpc
A small validation function is added in the JIT compiler for each of
those architectures, ensuring that no argument passed on stack is a
struct. If so, the trampoline creation is cancelled. Any check on args
already implemented in a JIT comp has been moved in this new function.
On top of that, it updates the tracing_struct_many_args test, which
now merely checks that this case is indeed denied.
[1] https://lore.kernel.org/bpf/20250411-many_args_arm64-v1-0-0a32fe72339e@boot…
[2] https://lore.kernel.org/bpf/CAADnVQKr3ftNt1uQVrXBE0a2o37ZYRo2PHqCoHUnw6PE5T…
Signed-off-by: Alexis Lothoré (eBPF Foundation) <alexis.lothore(a)bootlin.com>
---
Alexis Lothoré (eBPF Foundation) (7):
bpf/x86: use define for max regs count used for arguments
bpf/x86: prevent trampoline attachment when args location on stack is uncertain
bpf/riscv: prevent trampoline attachment when args location on stack is uncertain
bpf/s390: prevent trampoline attachment when args location on stack is uncertain
bpf/powerpc64: use define for max regs count used for arguments
bpf/powerpc64: prevent trampoline attachment when args location on stack is uncertain
selftests/bpf: ensure that functions passing structs on stack can not be hooked
arch/powerpc/net/bpf_jit_comp.c | 38 ++++++++++--
arch/riscv/net/bpf_jit_comp64.c | 26 +++++++-
arch/s390/net/bpf_jit_comp.c | 33 ++++++++--
arch/x86/net/bpf_jit_comp.c | 50 ++++++++++++----
.../selftests/bpf/prog_tests/tracing_struct.c | 37 +-----------
.../selftests/bpf/progs/tracing_struct_many_args.c | 70 ----------------------
.../testing/selftests/bpf/test_kmods/bpf_testmod.c | 43 ++-----------
7 files changed, 129 insertions(+), 168 deletions(-)
---
base-commit: c4f4f8da70044d8b28fccf73016b4119f3e2fd50
change-id: 20250609-deny_trampoline_structs_on_stack-5bbc7bc20dd1
Best regards,
--
Alexis Lothoré, Bootlin
Embedded Linux and Kernel engineering
https://bootlin.com
skb_ensure_writable actually makes sure that the header of the skb is
writable, and doesn't touch the payload. It doesn't need an
skb_frags_readable check.
Removing this check restores DSCP functionality with unreadable skbs as
it's called from dscp_tg.
Fixes: 65249feb6b3d ("net: add support for skbs with unreadable frags")
Signed-off-by: Mina Almasry <almasrymina(a)google.com>
---
net/core/skbuff.c | 3 ---
1 file changed, 3 deletions(-)
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 85fc82f72d26..d6420b74ea9c 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -6261,9 +6261,6 @@ int skb_ensure_writable(struct sk_buff *skb, unsigned int write_len)
if (!pskb_may_pull(skb, write_len))
return -ENOMEM;
- if (!skb_frags_readable(skb))
- return -EFAULT;
-
if (!skb_cloned(skb) || skb_clone_writable(skb, write_len))
return 0;
base-commit: 6d4e01d29d87356924f1521ca6df7a364e948f13
--
2.50.0.rc1.591.g9c95f17f64-goog
From: Chia-Yu Chang <chia-yu.chang(a)nokia-bell-labs.com>
Hello,
Please find the DualPI2 patch v18.
This patch serise adds DualPI Improved with a Square (DualPI2) with following features:
* Supports congestion controls that comply with the Prague requirements in RFC9331 (e.g. TCP-Prague)
* Coupled dual-queue that separates the L4S traffic in a low latency queue (L-queue), without harming remaining traffic that is scheduled in classic queue (C-queue) due to congestion-coupling using PI2 as defined in RFC9332
* Configurable overload strategies
* Use of sojourn time to reliably estimate queue delay
* Supports ECN L4S-identifier (IP.ECN==0b*1) to classify traffic into respective queues
For more details of DualPI2, please refer IETF RFC9332 (https://datatracker.ietf.org/doc/html/rfc9332).
Best regards,
Chia-Yu
---
v18 (13-Jun-2025)
- Add the num of enum used by DualPI2 and fix name and name-prefix of DualPI2 enum and attribute
- Replace from_timer() with timer_container_of() (Pedro Tammela <pctammela(a)mojatatu.com>)
v17 (25-May-2025, Resent at 11-Jun-2025)
- Replace 0xffffffff with U32_MAX (Paolo Abeni <pabeni(a)redhat.com>)
- Use helper function qdisc_dequeue_internal() and add new helper function skb_apply_step() (Paolo Abeni <pabeni(a)redhat.com>)
- Add s64 casting when calculating the delta of the PI controller (Paolo Abeni <pabeni(a)redhat.com>)
- Change the drop reason into SKB_DROP_REASON_QDISC_CONGESTED for drop_early (Paolo Abeni <pabeni(a)redhat.com>)
- Modify the condition to remove the original skb when enqueuing multiple GSO segments (Paolo Abeni <pabeni(a)redhat.com>)
- Add READ_ONCE() in dualpi2_dump_stat() (Paolo Abeni <pabeni(a)redhat.com>)
- Add comments, brackets, and brackets for readability (Paolo Abeni <pabeni(a)redhat.com>)
v16 (16-MAy-2025)
- Add qdisc_lock() to dualpi2_timer() in dualpi2_timer (Paolo Abeni <pabeni(a)redhat.com>)
- Introduce convert_ns_to_usec() to convert usec to nsec without overflow in #1 (Paolo Abeni <pabeni(a)redhat.com>)
- Update convert_us_tonsec() to convert nsec to usec without overflow in #2 (Paolo Abeni <pabeni(a)redhat.com>)
- Add more descriptions with respect to DualPI2 in the cover ltter and add changelog in each patch (Paolo Abeni <pabeni(a)redhat.com>)
v15 (09-May-2025)
- Add enum of TCA_DUALPI2_ECN_MASK_CLA_ECT to remove potential leakeage in #1 (Simon Horman <horms(a)kernel.org>)
- Fix one typo in comment of #2
- Update tc.yaml in #5 to aligh with the updated enum of pkt_sched.h
v14 (05-May-2025)
- Modify tc.yaml: (1) Replace flags with enum and remove enum-as-flags, (2) Remove credit-queue in xstats, and (3) Change attribute types (Donald Hunter <donald.hun
- Add enum and fix the ordering of variables in pkt_sched.h to align with the modified tc.yaml (Donald Hunter <donald.hunter(a)gmail.com>)
- Add validators for DROP_OVERLOAD, DROP_EARLY, ECN_MASK, and SPLIT_GSO in sch_dualpi2.c (Donald Hunter <donald.hunter(a)gmail.com>)
- Update dualpi2.json to align with the updated variable order in pkt_sched.h
- Reorder patches (Donald Hunter <donald.hunter(a)gmail.com>)
v13 (26-Apr-2025)
- Use dashes in member names to follow YNL conventions in tc.yaml (Donald Hunter <donald.hunter(a)gmail.com>)
- Define enumerations separately for flags of drop-early, drop-overload, ecn-mask, credit-queue in tc.yaml (Donald Hunter <donald.hunter(a)gmail.com>)
- Change the types of split-gso and step-packets into flag in tc.yaml (Donald Hunter <donald.hunter(a)gmail.com>)
- Revert to u32/u8 types for tc-dualpi2-xstats members in tc.yaml (Donald Hunter <donald.hunter(a)gmail.com>)
- Add new test cases in tc-tests/qdiscs/dualpi2.json to cover all dualpi2 parameters (Donald Hunter <donald.hunter(a)gmail.com>)
- Change the type of TCA_DUALPI2_STEP_PACKETS into NLA_FLAG (Donald Hunter <donald.hunter(a)gmail.com>)
v12 (22-Apr-2025)
- Remove anonymous struct in sch_dualpi2.c (Paolo Abeni <pabeni(a)redhat.com>)
- Replace u32/u8 with uint and s32 with int in tc spec document (Paolo Abeni <pabeni(a)redhat.com>)
- Introduce get_memory_limit function to handle potential overflow when multipling limit with MTU (Paolo Abeni <pabeni(a)redhat.com>)
- Double the packet length to further include packet overhead in memory_limit (Paolo Abeni <pabeni(a)redhat.com>)
- Remove the check of qdisc_qlen(sch) when calling qdisc_tree_reduce_backlog (Paolo Abeni <pabeni(a)redhat.com>)
v11 (15-Apr-2025)
- Replace hstimer_init with hstimer_setup in sch_dualpi2.c
v10 (25-Mar-2025)
- Remove leftover include in include/linux/netdevice.h and anonymous struct in sch_dualpi2.c (Paolo Abeni <pabeni(a)redhat.com>)
- Use kfree_skb_reason() and add SKB_DROP_REASON_DUALPI2_STEP_DROP drop reason (Paolo Abeni <pabeni(a)redhat.com>)
- Split sch_dualpi2.c into 3 patches (and overall 5 patches): Struct definition & parsing, Dump stats & configuration, Enqueue/Dequeue (Paolo Abeni <pabeni(a)redhat.com>)
v9 (16-Mar-2025)
- Fix mem_usage error in previous version
- Add min_qlen_step to the dualpi2 attribute as the minimum queue length in number of packets in the L-queue to start step threshold marking.
In previous versions, this value was fixed to 2, so the step threshold was applied to mark packets in the L queue only when the queue length of the L queue was greater than or equal to 2 packets.
This will cause larger queuing delays for L4S traffic at low rates (<20Mbps). So we parameterize it and change the default value to 0.
Comparison of tcp_1down run 'HTB 20Mbit + DUALPI2 + 10ms base delay'
Old versions:
avg median # data pts
Ping (ms) ICMP : 11.55 11.70 ms 350
TCP upload avg : 18.96 N/A Mbits/s 350
TCP upload sum : 18.96 N/A Mbits/s 350
New version (v9):
avg median # data pts
Ping (ms) ICMP : 10.81 10.70 ms 350
TCP upload avg : 18.91 N/A Mbits/s 350
TCP upload sum : 18.91 N/A Mbits/s 350
Comparison of tcp_1down run 'HTB 10Mbit + DUALPI2 + 10ms base delay'
Old versions:
avg median # data pts
Ping (ms) ICMP : 12.61 12.80 ms 350
TCP upload avg : 9.48 N/A Mbits/s 350
TCP upload sum : 9.48 N/A Mbits/s 350
New version (v9):
avg median # data pts
Ping (ms) ICMP : 11.06 10.80 ms 350
TCP upload avg : 9.43 N/A Mbits/s 350
TCP upload sum : 9.43 N/A Mbits/s 350
Comparison of tcp_1down run 'HTB 10Mbit + DUALPI2 + 10ms base delay'
Old versions:
avg median # data pts
Ping (ms) ICMP : 40.86 37.45 ms 350
TCP upload avg : 0.88 N/A Mbits/s 350
TCP upload sum : 0.88 N/A Mbits/s 350
TCP upload::1 : 0.88 0.97 Mbits/s 350
New version (v9):
avg median # data pts
Ping (ms) ICMP : 11.07 10.40 ms 350
TCP upload avg : 0.55 N/A Mbits/s 350
TCP upload sum : 0.55 N/A Mbits/s 350
TCP upload::1 : 0.55 0.59 Mbits/s 350
v8 (11-Mar-2025)
- Fix warning messages in v7
v7 (07-Mar-2025)
- Separate into 3 patches to avoid mixing changes of documentation, selftest, and code. (Cong Wang <xiyou.wangcong(a)gmail.com>)
v6 (04-Mar-2025)
- Add modprobe for dulapi2 in tc-testing script tc-testing/tdc.sh (Jakub Kicinski <kuba(a)kernel.org>)
- Update test cases in dualpi2.json
- Update commit message
v5 (22-Feb-2025)
- A comparison was done between MQ + DUALPI2, MQ + FQ_PIE, MQ + FQ_CODEL:
Unshaped 1gigE with 4 download streams test:
- Summary of tcp_4down run 'MQ + FQ_CODEL':
avg median # data pts
Ping (ms) ICMP : 1.19 1.34 ms 349
TCP download avg : 235.42 N/A Mbits/s 349
TCP download sum : 941.68 N/A Mbits/s 349
TCP download::1 : 235.19 235.39 Mbits/s 349
TCP download::2 : 235.03 235.35 Mbits/s 349
TCP download::3 : 236.89 235.44 Mbits/s 349
TCP download::4 : 234.57 235.19 Mbits/s 349
- Summary of tcp_4down run 'MQ + FQ_PIE'
avg median # data pts
Ping (ms) ICMP : 1.21 1.37 ms 350
TCP download avg : 235.42 N/A Mbits/s 350
TCP download sum : 941.61 N/A Mbits/s 350
TCP download::1 : 232.54 233.13 Mbits/s 350
TCP download::2 : 232.52 232.80 Mbits/s 350
TCP download::3 : 233.14 233.78 Mbits/s 350
TCP download::4 : 243.41 241.48 Mbits/s 350
- Summary of tcp_4down run 'MQ + DUALPI2'
avg median # data pts
Ping (ms) ICMP : 1.19 1.34 ms 349
TCP download avg : 235.42 N/A Mbits/s 349
TCP download sum : 941.68 N/A Mbits/s 349
TCP download::1 : 235.19 235.39 Mbits/s 349
TCP download::2 : 235.03 235.35 Mbits/s 349
TCP download::3 : 236.89 235.44 Mbits/s 349
TCP download::4 : 234.57 235.19 Mbits/s 349
Unshaped 1gigE with 128 download streams test:
- Summary of tcp_128down run 'MQ + FQ_CODEL':
avg median # data pts
Ping (ms) ICMP : 1.88 1.86 ms 350
TCP download avg : 7.39 N/A Mbits/s 350
TCP download sum : 946.47 N/A Mbits/s 350
- Summary of tcp_128down run 'MQ + FQ_PIE':
avg median # data pts
Ping (ms) ICMP : 1.88 1.86 ms 350
TCP download avg : 7.39 N/A Mbits/s 350
TCP download sum : 946.47 N/A Mbits/s 350
- Summary of tcp_128down run 'MQ + DUALPI2':
avg median # data pts
Ping (ms) ICMP : 1.88 1.86 ms 350
TCP download avg : 7.39 N/A Mbits/s 350
TCP download sum : 946.47 N/A Mbits/s 350
Unshaped 10gigE with 4 download streams test:
- Summary of tcp_4down run 'MQ + FQ_CODEL':
avg median # data pts
Ping (ms) ICMP : 0.22 0.23 ms 350
TCP download avg : 2354.08 N/A Mbits/s 350
TCP download sum : 9416.31 N/A Mbits/s 350
TCP download::1 : 2353.65 2352.81 Mbits/s 350
TCP download::2 : 2354.54 2354.21 Mbits/s 350
TCP download::3 : 2353.56 2353.78 Mbits/s 350
TCP download::4 : 2354.56 2354.45 Mbits/s 350
- Summary of tcp_4down run 'MQ + FQ_PIE':
avg median # data pts
Ping (ms) ICMP : 0.20 0.19 ms 350
TCP download avg : 2354.76 N/A Mbits/s 350
TCP download sum : 9419.04 N/A Mbits/s 350
TCP download::1 : 2354.77 2353.89 Mbits/s 350
TCP download::2 : 2353.41 2354.29 Mbits/s 350
TCP download::3 : 2356.18 2354.19 Mbits/s 350
TCP download::4 : 2354.68 2353.15 Mbits/s 350
- Summary of tcp_4down run 'MQ + DUALPI2':
avg median # data pts
Ping (ms) ICMP : 0.24 0.24 ms 350
TCP download avg : 2354.11 N/A Mbits/s 350
TCP download sum : 9416.43 N/A Mbits/s 350
TCP download::1 : 2354.75 2353.93 Mbits/s 350
TCP download::2 : 2353.15 2353.75 Mbits/s 350
TCP download::3 : 2353.49 2353.72 Mbits/s 350
TCP download::4 : 2355.04 2353.73 Mbits/s 350
Unshaped 10gigE with 128 download streams test:
- Summary of tcp_128down run 'MQ + FQ_CODEL':
avg median # data pts
Ping (ms) ICMP : 7.57 8.69 ms 350
TCP download avg : 73.97 N/A Mbits/s 350
TCP download sum : 9467.82 N/A Mbits/s 350
- Summary of tcp_128down run 'MQ + FQ_PIE':
avg median # data pts
Ping (ms) ICMP : 7.82 8.91 ms 350
TCP download avg : 73.97 N/A Mbits/s 350
TCP download sum : 9468.42 N/A Mbits/s 350
- Summary of tcp_128down run 'MQ + DUALPI2':
avg median # data pts
Ping (ms) ICMP : 6.87 7.93 ms 350
TCP download avg : 73.95 N/A Mbits/s 350
TCP download sum : 9465.87 N/A Mbits/s 350
From the results shown above, we see small differences between combinations.
- Update commit message to include results of no_split_gso and split_gso (Dave Taht <dave.taht(a)gmail.com> and Paolo Abeni <pabeni(a)redhat.com>)
- Add memlimit in the dualpi2 attribute, and add memory_used, max_memory_used, memory_limit in dualpi2 stats (Dave Taht <dave.taht(a)gmail.com>)
- Update note in sch_dualpi2.c related to BBRv3 status (Dave Taht <dave.taht(a)gmail.com>)
- Update license identifier (Dave Taht <dave.taht(a)gmail.com>)
- Add selftest in tools/testing/selftests/tc-testing (Cong Wang <xiyou.wangcong(a)gmail.com>)
- Use netlink policies for parameter checks (Jamal Hadi Salim <jhs(a)mojatatu.com>)
- Modify texts & fix typos in Documentation/netlink/specs/tc.yaml (Dave Taht <dave.taht(a)gmail.com>)
- Add descriptions of packet counter statistics and the reset function of sch_dualpi2.c
- Fix step_thresh in packets
- Update code comments in sch_dualpi2.c
v4 (22-Oct-2024)
- Update statement in Kconfig for DualPI2 (Stephen Hemminger <stephen(a)networkplumber.org>)
- Put a blank line after #define in sch_dualpi2.c (Stephen Hemminger <stephen(a)networkplumber.org>)
- Fix line length warning.
v3 (19-Oct-2024)
- Fix compilaiton error
- Update Documentation/netlink/specs/tc.yaml (Jakub Kicinski <kuba(a)kernel.org>)
v2 (18-Oct-2024)
- Add Documentation/netlink/specs/tc.yaml (Jakub Kicinski <kuba(a)kernel.org>)
- Use dualpi2 instead of skb prefix (Jamal Hadi Salim <jhs(a)mojatatu.com>)
- Replace nla_parse_nested_deprecated with nla_parse_nested (Jamal Hadi Salim <jhs(a)mojatatu.com>)
- Fix line length warning
---
Chia-Yu Chang (4):
sched: Struct definition and parsing of dualpi2 qdisc
sched: Dump configuration and statistics of dualpi2 qdisc
selftests/tc-testing: Add selftests for qdisc DualPI2
Documentation: netlink: specs: tc: Add DualPI2 specification
Koen De Schepper (1):
sched: Add enqueue/dequeue of dualpi2 qdisc
Documentation/netlink/specs/tc.yaml | 161 +++
include/net/dropreason-core.h | 6 +
include/uapi/linux/pkt_sched.h | 70 +-
net/sched/Kconfig | 12 +
net/sched/Makefile | 1 +
net/sched/sch_dualpi2.c | 1146 +++++++++++++++++
tools/testing/selftests/tc-testing/config | 1 +
.../tc-testing/tc-tests/qdiscs/dualpi2.json | 254 ++++
tools/testing/selftests/tc-testing/tdc.sh | 1 +
9 files changed, 1651 insertions(+), 1 deletion(-)
create mode 100644 net/sched/sch_dualpi2.c
create mode 100644 tools/testing/selftests/tc-testing/tc-tests/qdiscs/dualpi2.json
--
2.34.1
From: Chia-Yu Chang <chia-yu.chang(a)nokia-bell-labs.com>
Hello,
Please find the DualPI2 patch v18.
This patch serise adds DualPI Improved with a Square (DualPI2) with following features:
* Supports congestion controls that comply with the Prague requirements in RFC9331 (e.g. TCP-Prague)
* Coupled dual-queue that separates the L4S traffic in a low latency queue (L-queue), without harming remaining traffic that is scheduled in classic queue (C-queue) due to congestion-coupling using PI2 as defined in RFC9332
* Configurable overload strategies
* Use of sojourn time to reliably estimate queue delay
* Supports ECN L4S-identifier (IP.ECN==0b*1) to classify traffic into respective queues
For more details of DualPI2, please refer IETF RFC9332 (https://datatracker.ietf.org/doc/html/rfc9332).
Best regards,
Chia-Yu
---
v18 (13-Jun-2025)
- Add the num of enum used by DualPI2 and fix name and name-prefix of DualPI2 enum and attribute
- Replace from_timer() with timer_container_of() (Pedro Tammela <pctammela(a)mojatatu.com>)
v17 (25-May-2025, Resent at 11-Jun-2025)
- Replace 0xffffffff with U32_MAX (Paolo Abeni <pabeni(a)redhat.com>)
- Use helper function qdisc_dequeue_internal() and add new helper function skb_apply_step() (Paolo Abeni <pabeni(a)redhat.com>)
- Add s64 casting when calculating the delta of the PI controller (Paolo Abeni <pabeni(a)redhat.com>)
- Change the drop reason into SKB_DROP_REASON_QDISC_CONGESTED for drop_early (Paolo Abeni <pabeni(a)redhat.com>)
- Modify the condition to remove the original skb when enqueuing multiple GSO segments (Paolo Abeni <pabeni(a)redhat.com>)
- Add READ_ONCE() in dualpi2_dump_stat() (Paolo Abeni <pabeni(a)redhat.com>)
- Add comments, brackets, and brackets for readability (Paolo Abeni <pabeni(a)redhat.com>)
v16 (16-MAy-2025)
- Add qdisc_lock() to dualpi2_timer() in dualpi2_timer (Paolo Abeni <pabeni(a)redhat.com>)
- Introduce convert_ns_to_usec() to convert usec to nsec without overflow in #1 (Paolo Abeni <pabeni(a)redhat.com>)
- Update convert_us_tonsec() to convert nsec to usec without overflow in #2 (Paolo Abeni <pabeni(a)redhat.com>)
- Add more descriptions with respect to DualPI2 in the cover ltter and add changelog in each patch (Paolo Abeni <pabeni(a)redhat.com>)
v15 (09-May-2025)
- Add enum of TCA_DUALPI2_ECN_MASK_CLA_ECT to remove potential leakeage in #1 (Simon Horman <horms(a)kernel.org>)
- Fix one typo in comment of #2
- Update tc.yaml in #5 to aligh with the updated enum of pkt_sched.h
v14 (05-May-2025)
- Modify tc.yaml: (1) Replace flags with enum and remove enum-as-flags, (2) Remove credit-queue in xstats, and (3) Change attribute types (Donald Hunter <donald.hun
- Add enum and fix the ordering of variables in pkt_sched.h to align with the modified tc.yaml (Donald Hunter <donald.hunter(a)gmail.com>)
- Add validators for DROP_OVERLOAD, DROP_EARLY, ECN_MASK, and SPLIT_GSO in sch_dualpi2.c (Donald Hunter <donald.hunter(a)gmail.com>)
- Update dualpi2.json to align with the updated variable order in pkt_sched.h
- Reorder patches (Donald Hunter <donald.hunter(a)gmail.com>)
v13 (26-Apr-2025)
- Use dashes in member names to follow YNL conventions in tc.yaml (Donald Hunter <donald.hunter(a)gmail.com>)
- Define enumerations separately for flags of drop-early, drop-overload, ecn-mask, credit-queue in tc.yaml (Donald Hunter <donald.hunter(a)gmail.com>)
- Change the types of split-gso and step-packets into flag in tc.yaml (Donald Hunter <donald.hunter(a)gmail.com>)
- Revert to u32/u8 types for tc-dualpi2-xstats members in tc.yaml (Donald Hunter <donald.hunter(a)gmail.com>)
- Add new test cases in tc-tests/qdiscs/dualpi2.json to cover all dualpi2 parameters (Donald Hunter <donald.hunter(a)gmail.com>)
- Change the type of TCA_DUALPI2_STEP_PACKETS into NLA_FLAG (Donald Hunter <donald.hunter(a)gmail.com>)
v12 (22-Apr-2025)
- Remove anonymous struct in sch_dualpi2.c (Paolo Abeni <pabeni(a)redhat.com>)
- Replace u32/u8 with uint and s32 with int in tc spec document (Paolo Abeni <pabeni(a)redhat.com>)
- Introduce get_memory_limit function to handle potential overflow when multipling limit with MTU (Paolo Abeni <pabeni(a)redhat.com>)
- Double the packet length to further include packet overhead in memory_limit (Paolo Abeni <pabeni(a)redhat.com>)
- Remove the check of qdisc_qlen(sch) when calling qdisc_tree_reduce_backlog (Paolo Abeni <pabeni(a)redhat.com>)
v11 (15-Apr-2025)
- Replace hstimer_init with hstimer_setup in sch_dualpi2.c
v10 (25-Mar-2025)
- Remove leftover include in include/linux/netdevice.h and anonymous struct in sch_dualpi2.c (Paolo Abeni <pabeni(a)redhat.com>)
- Use kfree_skb_reason() and add SKB_DROP_REASON_DUALPI2_STEP_DROP drop reason (Paolo Abeni <pabeni(a)redhat.com>)
- Split sch_dualpi2.c into 3 patches (and overall 5 patches): Struct definition & parsing, Dump stats & configuration, Enqueue/Dequeue (Paolo Abeni <pabeni(a)redhat.com>)
v9 (16-Mar-2025)
- Fix mem_usage error in previous version
- Add min_qlen_step to the dualpi2 attribute as the minimum queue length in number of packets in the L-queue to start step threshold marking.
In previous versions, this value was fixed to 2, so the step threshold was applied to mark packets in the L queue only when the queue length of the L queue was greater than or equal to 2 packets.
This will cause larger queuing delays for L4S traffic at low rates (<20Mbps). So we parameterize it and change the default value to 0.
Comparison of tcp_1down run 'HTB 20Mbit + DUALPI2 + 10ms base delay'
Old versions:
avg median # data pts
Ping (ms) ICMP : 11.55 11.70 ms 350
TCP upload avg : 18.96 N/A Mbits/s 350
TCP upload sum : 18.96 N/A Mbits/s 350
New version (v9):
avg median # data pts
Ping (ms) ICMP : 10.81 10.70 ms 350
TCP upload avg : 18.91 N/A Mbits/s 350
TCP upload sum : 18.91 N/A Mbits/s 350
Comparison of tcp_1down run 'HTB 10Mbit + DUALPI2 + 10ms base delay'
Old versions:
avg median # data pts
Ping (ms) ICMP : 12.61 12.80 ms 350
TCP upload avg : 9.48 N/A Mbits/s 350
TCP upload sum : 9.48 N/A Mbits/s 350
New version (v9):
avg median # data pts
Ping (ms) ICMP : 11.06 10.80 ms 350
TCP upload avg : 9.43 N/A Mbits/s 350
TCP upload sum : 9.43 N/A Mbits/s 350
Comparison of tcp_1down run 'HTB 10Mbit + DUALPI2 + 10ms base delay'
Old versions:
avg median # data pts
Ping (ms) ICMP : 40.86 37.45 ms 350
TCP upload avg : 0.88 N/A Mbits/s 350
TCP upload sum : 0.88 N/A Mbits/s 350
TCP upload::1 : 0.88 0.97 Mbits/s 350
New version (v9):
avg median # data pts
Ping (ms) ICMP : 11.07 10.40 ms 350
TCP upload avg : 0.55 N/A Mbits/s 350
TCP upload sum : 0.55 N/A Mbits/s 350
TCP upload::1 : 0.55 0.59 Mbits/s 350
v8 (11-Mar-2025)
- Fix warning messages in v7
v7 (07-Mar-2025)
- Separate into 3 patches to avoid mixing changes of documentation, selftest, and code. (Cong Wang <xiyou.wangcong(a)gmail.com>)
v6 (04-Mar-2025)
- Add modprobe for dulapi2 in tc-testing script tc-testing/tdc.sh (Jakub Kicinski <kuba(a)kernel.org>)
- Update test cases in dualpi2.json
- Update commit message
v5 (22-Feb-2025)
- A comparison was done between MQ + DUALPI2, MQ + FQ_PIE, MQ + FQ_CODEL:
Unshaped 1gigE with 4 download streams test:
- Summary of tcp_4down run 'MQ + FQ_CODEL':
avg median # data pts
Ping (ms) ICMP : 1.19 1.34 ms 349
TCP download avg : 235.42 N/A Mbits/s 349
TCP download sum : 941.68 N/A Mbits/s 349
TCP download::1 : 235.19 235.39 Mbits/s 349
TCP download::2 : 235.03 235.35 Mbits/s 349
TCP download::3 : 236.89 235.44 Mbits/s 349
TCP download::4 : 234.57 235.19 Mbits/s 349
- Summary of tcp_4down run 'MQ + FQ_PIE'
avg median # data pts
Ping (ms) ICMP : 1.21 1.37 ms 350
TCP download avg : 235.42 N/A Mbits/s 350
TCP download sum : 941.61 N/A Mbits/s 350
TCP download::1 : 232.54 233.13 Mbits/s 350
TCP download::2 : 232.52 232.80 Mbits/s 350
TCP download::3 : 233.14 233.78 Mbits/s 350
TCP download::4 : 243.41 241.48 Mbits/s 350
- Summary of tcp_4down run 'MQ + DUALPI2'
avg median # data pts
Ping (ms) ICMP : 1.19 1.34 ms 349
TCP download avg : 235.42 N/A Mbits/s 349
TCP download sum : 941.68 N/A Mbits/s 349
TCP download::1 : 235.19 235.39 Mbits/s 349
TCP download::2 : 235.03 235.35 Mbits/s 349
TCP download::3 : 236.89 235.44 Mbits/s 349
TCP download::4 : 234.57 235.19 Mbits/s 349
Unshaped 1gigE with 128 download streams test:
- Summary of tcp_128down run 'MQ + FQ_CODEL':
avg median # data pts
Ping (ms) ICMP : 1.88 1.86 ms 350
TCP download avg : 7.39 N/A Mbits/s 350
TCP download sum : 946.47 N/A Mbits/s 350
- Summary of tcp_128down run 'MQ + FQ_PIE':
avg median # data pts
Ping (ms) ICMP : 1.88 1.86 ms 350
TCP download avg : 7.39 N/A Mbits/s 350
TCP download sum : 946.47 N/A Mbits/s 350
- Summary of tcp_128down run 'MQ + DUALPI2':
avg median # data pts
Ping (ms) ICMP : 1.88 1.86 ms 350
TCP download avg : 7.39 N/A Mbits/s 350
TCP download sum : 946.47 N/A Mbits/s 350
Unshaped 10gigE with 4 download streams test:
- Summary of tcp_4down run 'MQ + FQ_CODEL':
avg median # data pts
Ping (ms) ICMP : 0.22 0.23 ms 350
TCP download avg : 2354.08 N/A Mbits/s 350
TCP download sum : 9416.31 N/A Mbits/s 350
TCP download::1 : 2353.65 2352.81 Mbits/s 350
TCP download::2 : 2354.54 2354.21 Mbits/s 350
TCP download::3 : 2353.56 2353.78 Mbits/s 350
TCP download::4 : 2354.56 2354.45 Mbits/s 350
- Summary of tcp_4down run 'MQ + FQ_PIE':
avg median # data pts
Ping (ms) ICMP : 0.20 0.19 ms 350
TCP download avg : 2354.76 N/A Mbits/s 350
TCP download sum : 9419.04 N/A Mbits/s 350
TCP download::1 : 2354.77 2353.89 Mbits/s 350
TCP download::2 : 2353.41 2354.29 Mbits/s 350
TCP download::3 : 2356.18 2354.19 Mbits/s 350
TCP download::4 : 2354.68 2353.15 Mbits/s 350
- Summary of tcp_4down run 'MQ + DUALPI2':
avg median # data pts
Ping (ms) ICMP : 0.24 0.24 ms 350
TCP download avg : 2354.11 N/A Mbits/s 350
TCP download sum : 9416.43 N/A Mbits/s 350
TCP download::1 : 2354.75 2353.93 Mbits/s 350
TCP download::2 : 2353.15 2353.75 Mbits/s 350
TCP download::3 : 2353.49 2353.72 Mbits/s 350
TCP download::4 : 2355.04 2353.73 Mbits/s 350
Unshaped 10gigE with 128 download streams test:
- Summary of tcp_128down run 'MQ + FQ_CODEL':
avg median # data pts
Ping (ms) ICMP : 7.57 8.69 ms 350
TCP download avg : 73.97 N/A Mbits/s 350
TCP download sum : 9467.82 N/A Mbits/s 350
- Summary of tcp_128down run 'MQ + FQ_PIE':
avg median # data pts
Ping (ms) ICMP : 7.82 8.91 ms 350
TCP download avg : 73.97 N/A Mbits/s 350
TCP download sum : 9468.42 N/A Mbits/s 350
- Summary of tcp_128down run 'MQ + DUALPI2':
avg median # data pts
Ping (ms) ICMP : 6.87 7.93 ms 350
TCP download avg : 73.95 N/A Mbits/s 350
TCP download sum : 9465.87 N/A Mbits/s 350
From the results shown above, we see small differences between combinations.
- Update commit message to include results of no_split_gso and split_gso (Dave Taht <dave.taht(a)gmail.com> and Paolo Abeni <pabeni(a)redhat.com>)
- Add memlimit in the dualpi2 attribute, and add memory_used, max_memory_used, memory_limit in dualpi2 stats (Dave Taht <dave.taht(a)gmail.com>)
- Update note in sch_dualpi2.c related to BBRv3 status (Dave Taht <dave.taht(a)gmail.com>)
- Update license identifier (Dave Taht <dave.taht(a)gmail.com>)
- Add selftest in tools/testing/selftests/tc-testing (Cong Wang <xiyou.wangcong(a)gmail.com>)
- Use netlink policies for parameter checks (Jamal Hadi Salim <jhs(a)mojatatu.com>)
- Modify texts & fix typos in Documentation/netlink/specs/tc.yaml (Dave Taht <dave.taht(a)gmail.com>)
- Add descriptions of packet counter statistics and the reset function of sch_dualpi2.c
- Fix step_thresh in packets
- Update code comments in sch_dualpi2.c
v4 (22-Oct-2024)
- Update statement in Kconfig for DualPI2 (Stephen Hemminger <stephen(a)networkplumber.org>)
- Put a blank line after #define in sch_dualpi2.c (Stephen Hemminger <stephen(a)networkplumber.org>)
- Fix line length warning.
v3 (19-Oct-2024)
- Fix compilaiton error
- Update Documentation/netlink/specs/tc.yaml (Jakub Kicinski <kuba(a)kernel.org>)
v2 (18-Oct-2024)
- Add Documentation/netlink/specs/tc.yaml (Jakub Kicinski <kuba(a)kernel.org>)
- Use dualpi2 instead of skb prefix (Jamal Hadi Salim <jhs(a)mojatatu.com>)
- Replace nla_parse_nested_deprecated with nla_parse_nested (Jamal Hadi Salim <jhs(a)mojatatu.com>)
- Fix line length warning
---
Chia-Yu Chang (4):
sched: Struct definition and parsing of dualpi2 qdisc
sched: Dump configuration and statistics of dualpi2 qdisc
selftests/tc-testing: Add selftests for qdisc DualPI2
Documentation: netlink: specs: tc: Add DualPI2 specification
Koen De Schepper (1):
sched: Add enqueue/dequeue of dualpi2 qdisc
Documentation/netlink/specs/tc.yaml | 161 +++
include/net/dropreason-core.h | 6 +
include/uapi/linux/pkt_sched.h | 70 +-
net/sched/Kconfig | 12 +
net/sched/Makefile | 1 +
net/sched/sch_dualpi2.c | 1146 +++++++++++++++++
tools/testing/selftests/tc-testing/config | 1 +
.../tc-testing/tc-tests/qdiscs/dualpi2.json | 254 ++++
tools/testing/selftests/tc-testing/tdc.sh | 1 +
9 files changed, 1651 insertions(+), 1 deletion(-)
create mode 100644 net/sched/sch_dualpi2.c
create mode 100644 tools/testing/selftests/tc-testing/tc-tests/qdiscs/dualpi2.json
--
2.34.1
This commit adds a new kernel selftest to verify RTNLGRP_IPV4_MCADDR
and RTNLGRP_IPV6_MCADDR notifications. The test works by adding and
removing a dummy interface and then confirming that the system
correctly receives join and removal notifications for the 224.0.0.1
and ff02::1 multicast addresses.
The test relies on the iproute2 version to be 6.13+.
Tested by the following command:
$ vng -v --user root --cpus 16 -- \
make -C tools/testing/selftests TARGETS=net
TEST_PROGS=rtnetlink_notification.sh \
TEST_GEN_PROGS="" run_tests
Cc: Maciej Żenczykowski <maze(a)google.com>
Cc: Lorenzo Colitti <lorenzo(a)google.com>
Signed-off-by: Yuyang Huang <yuyanghuang(a)google.com>
---
Changelog since v2:
- Move the test case to a separate file.
Changelog since v1:
- Skip the test if the iproute2 is too old.
tools/testing/selftests/net/Makefile | 1 +
.../selftests/net/rtnetlink_notification.sh | 159 ++++++++++++++++++
2 files changed, 160 insertions(+)
create mode 100755 tools/testing/selftests/net/rtnetlink_notification.sh
diff --git a/tools/testing/selftests/net/Makefile b/tools/testing/selftests/net/Makefile
index 70a38f485d4d..ad258b25bc9d 100644
--- a/tools/testing/selftests/net/Makefile
+++ b/tools/testing/selftests/net/Makefile
@@ -40,6 +40,7 @@ TEST_PROGS += netns-name.sh
TEST_PROGS += link_netns.py
TEST_PROGS += nl_netdev.py
TEST_PROGS += rtnetlink.py
+TEST_PROGS += rtnetlink_notification.sh
TEST_PROGS += srv6_end_dt46_l3vpn_test.sh
TEST_PROGS += srv6_end_dt4_l3vpn_test.sh
TEST_PROGS += srv6_end_dt6_l3vpn_test.sh
diff --git a/tools/testing/selftests/net/rtnetlink_notification.sh b/tools/testing/selftests/net/rtnetlink_notification.sh
new file mode 100755
index 000000000000..a2c1afed5023
--- /dev/null
+++ b/tools/testing/selftests/net/rtnetlink_notification.sh
@@ -0,0 +1,159 @@
+#!/bin/bash
+#
+# This test is for checking rtnetlink notification callpaths, and get as much
+# coverage as possible.
+#
+# set -e
+
+ALL_TESTS="
+ kci_test_mcast_addr_notification
+"
+
+VERBOSE=0
+PAUSE=no
+PAUSE_ON_FAIL=no
+
+source lib.sh
+
+# set global exit status, but never reset nonzero one.
+check_err()
+{
+ if [ $ret -eq 0 ]; then
+ ret=$1
+ fi
+ [ -n "$2" ] && echo "$2"
+}
+
+run_cmd_common()
+{
+ local cmd="$*"
+ local out
+ if [ "$VERBOSE" = "1" ]; then
+ echo "COMMAND: ${cmd}"
+ fi
+ out=$($cmd 2>&1)
+ rc=$?
+ if [ "$VERBOSE" = "1" -a -n "$out" ]; then
+ echo " $out"
+ fi
+ return $rc
+}
+
+run_cmd() {
+ run_cmd_common "$@"
+ rc=$?
+ check_err $rc
+ return $rc
+}
+
+end_test()
+{
+ echo "$*"
+ [ "${VERBOSE}" = "1" ] && echo
+
+ if [[ $ret -ne 0 ]] && [[ "${PAUSE_ON_FAIL}" = "yes" ]]; then
+ echo "Hit enter to continue"
+ read a
+ fi;
+
+ if [ "${PAUSE}" = "yes" ]; then
+ echo "Hit enter to continue"
+ read a
+ fi
+
+}
+
+kci_test_mcast_addr_notification()
+{
+ local tmpfile
+ local monitor_pid
+ local match_result
+
+ tmpfile=$(mktemp)
+
+ ip monitor maddr > $tmpfile &
+ monitor_pid=$!
+ sleep 1
+ if [ ! -e "/proc/$monitor_pid" ]; then
+ end_test "SKIP: mcast addr notification: iproute2 too old"
+ rm $tmpfile
+ return $ksft_skip
+ fi
+
+ run_cmd ip link add name test-dummy1 type dummy
+ run_cmd ip link set test-dummy1 up
+ run_cmd ip link del dev test-dummy1
+ sleep 1
+
+ match_result=$(grep -cE "test-dummy1.*(224.0.0.1|ff02::1)" $tmpfile)
+
+ kill $monitor_pid
+ rm $tmpfile
+ # There should be 4 line matches as follows.
+ # 13: test-dummy1 inet6 mcast ff02::1 scope global
+ # 13: test-dummy1 inet mcast 224.0.0.1 scope global
+ # Deleted 13: test-dummy1 inet mcast 224.0.0.1 scope global
+ # Deleted 13: test-dummy1 inet6 mcast ff02::1 scope global
+ if [ $match_result -ne 4 ];then
+ end_test "FAIL: mcast addr notification"
+ return 1
+ fi
+ end_test "PASS: mcast addr notification"
+}
+
+kci_test_rtnl()
+{
+ local current_test
+ local ret=0
+
+ for current_test in ${TESTS:-$ALL_TESTS}; do
+ $current_test
+ check_err $?
+ done
+
+ return $ret
+}
+
+usage()
+{
+ cat <<EOF
+usage: ${0##*/} OPTS
+
+ -t <test> Test(s) to run (default: all)
+ (options: $(echo $ALL_TESTS))
+ -v Verbose mode (show commands and output)
+ -P Pause after every test
+ -p Pause after every failing test before cleanup (for debugging)
+EOF
+}
+
+#check for needed privileges
+if [ "$(id -u)" -ne 0 ];then
+ end_test "SKIP: Need root privileges"
+ exit $ksft_skip
+fi
+
+for x in ip;do
+ $x -Version 2>/dev/null >/dev/null
+ if [ $? -ne 0 ];then
+ end_test "SKIP: Could not run test without the $x tool"
+ exit $ksft_skip
+ fi
+done
+
+while getopts t:hvpP o; do
+ case $o in
+ t) TESTS=$OPTARG;;
+ v) VERBOSE=1;;
+ p) PAUSE_ON_FAIL=yes;;
+ P) PAUSE=yes;;
+ h) usage; exit 0;;
+ *) usage; exit 1;;
+ esac
+done
+
+[ $PAUSE = "yes" ] && PAUSE_ON_FAIL="no"
+
+kci_test_rtnl
+
+exit $?
--
2.50.0.rc1.591.g9c95f17f64-goog
Reading /proc/pid/maps requires read-locking mmap_lock which prevents any
other task from concurrently modifying the address space. This guarantees
coherent reporting of virtual address ranges, however it can block
important updates from happening. Oftentimes /proc/pid/maps readers are
low priority monitoring tasks and them blocking high priority tasks
results in priority inversion.
Locking the entire address space is required to present fully coherent
picture of the address space, however even current implementation does not
strictly guarantee that by outputting vmas in page-size chunks and
dropping mmap_lock in between each chunk. Address space modifications are
possible while mmap_lock is dropped and userspace reading the content is
expected to deal with possible concurrent address space modifications.
Considering these relaxed rules, holding mmap_lock is not strictly needed
as long as we can guarantee that a concurrently modified vma is reported
either in its original form or after it was modified.
This patchset switches from holding mmap_lock while reading /proc/pid/maps
to taking per-vma locks as we walk the vma tree. This reduces the
contention with tasks modifying the address space because they would have
to contend for the same vma as opposed to the entire address space. Same
is done for PROCMAP_QUERY ioctl which locks only the vma that fell into
the requested range instead of the entire address space. Previous version
of this patchset [1] tried to perform /proc/pid/maps reading under RCU,
however its implementation is quite complex and the results are worse than
the new version because it still relied on mmap_lock speculation which
retries if any part of the address space gets modified. New implementaion
is both simpler and results in less contention. Note that similar approach
would not work for /proc/pid/smaps reading as it also walks the page table
and that's not RCU-safe.
Paul McKenney's designed a test [2] to measure mmap/munmap latencies while
concurrently reading /proc/pid/maps. The test has a pair of processes
scanning /proc/PID/maps, and another process unmapping and remapping 4K
pages from a 128MB range of anonymous memory. At the end of each 10
second run, the latency of each mmap() or munmap() operation is measured,
and for each run the maximum and mean latency is printed. The map/unmap
process is started first, its PID is passed to the scanners, and then the
map/unmap process waits until both scanners are running before starting
its timed test. The scanners keep scanning until the specified
/proc/PID/maps file disappears. This test registered close to 10x
improvement in update latencies:
Before the change:
./run-proc-vs-map.sh --nsamples 100 --rawdata -- --busyduration 2
0.011 0.008 0.455
0.011 0.008 0.472
0.011 0.008 0.535
0.011 0.009 0.545
...
0.011 0.014 2.875
0.011 0.014 2.913
0.011 0.014 3.007
0.011 0.015 3.018
After the change:
./run-proc-vs-map.sh --nsamples 100 --rawdata -- --busyduration 2
0.006 0.005 0.036
0.006 0.005 0.039
0.006 0.005 0.039
0.006 0.005 0.039
...
0.006 0.006 0.403
0.006 0.006 0.474
0.006 0.006 0.479
0.006 0.006 0.498
The patchset also adds a number of tests to check for /proc/pid/maps data
coherency. They are designed to detect any unexpected data tearing while
performing some common address space modifications (vma split, resize and
remap). Even before these changes, reading /proc/pid/maps might have
inconsistent data because the file is read page-by-page with mmap_lock
being dropped between the pages. An example of user-visible inconsistency
can be that the same vma is printed twice: once before it was modified and
then after the modifications. For example if vma was extended, it might be
found and reported twice. What is not expected is to see a gap where there
should have been a vma both before and after modification. This patchset
increases the chances of such tearing, therefore it's event more important
now to test for unexpected inconsistencies.
[1] https://lore.kernel.org/all/20250418174959.1431962-1-surenb@google.com/
[2] https://github.com/paulmckrcu/proc-mmap_sem-test
Suren Baghdasaryan (7):
selftests/proc: add /proc/pid/maps tearing from vma split test
selftests/proc: extend /proc/pid/maps tearing test to include vma
resizing
selftests/proc: extend /proc/pid/maps tearing test to include vma
remapping
selftests/proc: test PROCMAP_QUERY ioctl while vma is concurrently
modified
selftests/proc: add verbose more for tests to facilitate debugging
mm/maps: read proc/pid/maps under per-vma lock
mm/maps: execute PROCMAP_QUERY ioctl under per-vma locks
fs/proc/internal.h | 6 +
fs/proc/task_mmu.c | 233 +++++-
tools/testing/selftests/proc/proc-pid-vm.c | 793 ++++++++++++++++++++-
3 files changed, 1011 insertions(+), 21 deletions(-)
base-commit: 2d0c297637e7d59771c1533847c666cdddc19884
--
2.49.0.1266.g31b7d2e469-goog
The netdevsim driver previously lacked RX statistics support, which
prevented its use with the GenerateTraffic() test framework, as this
framework verifies traffic flow by checking RX byte counts.
This patch migrates netdevsim from its custom statistics collection to
the NETDEV_PCPU_STAT_DSTATS framework, as suggested by Jakub. This
change not only standardizes the statistics handling but also adds the
necessary RX statistics support required by the test framework.
Signed-off-by: Breno Leitao <leitao(a)debian.org>
---
Changes in v2:
- Changed the RX collection place from nsim_napi_rx() to nsim_rcv (Joe
Damato)
- Collect RX dropped packets statistic in nsim_queue_free() (Jakub)
- Added a helper in dstat to add values to RX dropped packets
- Link to v1: https://lore.kernel.org/r/20250611-netdevsim_stat-v1-0-c11b657d96bf@debian.…
---
Breno Leitao (4):
netdevsim: migrate to dstats stats collection
netdevsim: collect statistics at RX side
net: add dev_dstats_rx_dropped_add() helper
netdevsim: account dropped packet length in stats on queue free
drivers/net/netdevsim/netdev.c | 48 ++++++++++++++++-----------------------
drivers/net/netdevsim/netdevsim.h | 5 ----
include/linux/netdevice.h | 10 ++++++++
3 files changed, 29 insertions(+), 34 deletions(-)
---
base-commit: 6d4e01d29d87356924f1521ca6df7a364e948f13
change-id: 20250610-netdevsim_stat-95995921e03e
Best regards,
--
Breno Leitao <leitao(a)debian.org>
From: Chia-Yu Chang <chia-yu.chang(a)nokia-bell-labs.com>
Hello,
Please find DUALPI2 iproute2 patch v9.
For more details of DualPI2, please refer IETF RFC9332
(https://datatracker.ietf.org/doc/html/rfc9332).
Best Regards,
Chia-Yu
---
v9 (13-Jun-25)
- Fix space issue and typos (ALOK TIWARI <alok.a.tiwari(a)oracle.com>)
- Change 'rtt_typical' to 'typical_rtt' in tc/q_dualpi2.c (ALOK TIWARI <alok.a.tiwari(a)oracle.com>)
- Add the num of enum used by DualPI2 in pkt_sched.h
v8 (09-May-25)
- Update pkt_sched.h with the one in nex-next
- Correct a typo in the comment within pkt_sched.h (ALOK TIWARI <alok.a.tiwari(a)oracle.com>)
- Update manual content in man/man8/tc-dualpi2.8 (ALOK TIWARI <alok.a.tiwari(a)oracle.com>)
- Update tc/q_dualpi2.c to fix missing blank lines and add missing case (ALOK TIWARI <alok.a.tiwari(a)oracle.com>)
v7 (05-May-25)
- Align pkt_sched.h with the v14 version of net-next due to spec modification in tc.yaml
- Reorganize dualpi2_print_opt() to match the order in tc.yaml
- Remove credit-queue in PRINT_JSON
v6 (26-Apr-25)
- Update JSON file output due to spec modification in tc.yaml of net-next
v5 (25-Mar-25)
- Use matches() to replace current strcmp() (Stephen Hemminger <stephen(a)networkplumber.org>)
- Use general parse_percent() for handling scaled percentage values (Stephen Hemminger <stephen(a)networkplumber.org>)
- Add print function for JSON of dualpi2 stats (Stephen Hemminger <stephen(a)networkplumber.org>)
v4 (16-Mar-25)
- Add min_qlen_step to the dualpi2 attribute as the minimum queue length in number of packets in the L-queue to start step marking.
v3 (21-Feb-25)
- Add memlimit to the dualpi2 attribute, and add memory_used, max_memory_used, and memory_limit in dualpi2 stats (Dave Taht <dave.taht(a)gmail.com>)
- Update the manual to align with the latest implementation and clarify the queue naming and default unit
- Use common "get_scaled_alpha_beta" and clean print_opt for Dualpi2
v2 (23-Oct-24)
- Rename get_float in dualpi2 to get_float_min_max in utils.c
- Move get_float from iplink_can.c in utils.c (Stephen Hemminger <stephen(a)networkplumber.org>)
- Add print function for JSON of dualpi2 (Stephen Hemminger <stephen(a)networkplumber.org>)
---
Chia-Yu Chang (1):
tc: add dualpi2 scheduler module
bash-completion/tc | 11 +-
include/uapi/linux/pkt_sched.h | 70 ++++-
include/utils.h | 2 +
ip/iplink_can.c | 14 -
lib/utils.c | 30 ++
man/man8/tc-dualpi2.8 | 249 +++++++++++++++
tc/Makefile | 1 +
tc/q_dualpi2.c | 534 +++++++++++++++++++++++++++++++++
8 files changed, 895 insertions(+), 16 deletions(-)
create mode 100644 man/man8/tc-dualpi2.8
create mode 100644 tc/q_dualpi2.c
--
2.34.1
When CONFIG_SBI_CONSOLE is enabled and there is no uart defined in the
device tree kvm-unit-tests fails to start.
Only check if uart exists in device tree if SBI_CONSOLE is false.
Signed-off-by: Jesse Taube <jesse(a)rivosinc.com>
---
lib/riscv/io.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/lib/riscv/io.c b/lib/riscv/io.c
index fb40adb7..96a3c048 100644
--- a/lib/riscv/io.c
+++ b/lib/riscv/io.c
@@ -104,6 +104,7 @@ static void uart0_init_acpi(void)
void io_init(void)
{
+#ifndef CONFIG_SBI_CONSOLE
if (dt_available())
uart0_init_fdt();
else
@@ -114,6 +115,7 @@ void io_init(void)
"Found uart at %p, but early base is %p.\n",
uart0_base, UART_EARLY_BASE);
}
+#endif
}
#ifdef CONFIG_SBI_CONSOLE
--
2.43.0
If CONFIG_UPROBES is not set, a merge subtest fails:
Failure log:
7151 12:46:54.627936 # # # RUN merge.handle_uprobe_upon_merged_vma ...
7152 12:46:54.639014 # # f /sys/bus/event_source/devices/uprobe/type
7153 12:46:54.639306 # # fopen: No such file or directory
7154 12:46:54.650451 # # # merge.c:473:handle_uprobe_upon_merged_vma:Expected read_sysfs("/sys/bus/event_source/devices/uprobe/type", &type) (1) == 0 (0)
7155 12:46:54.650730 # # # handle_uprobe_upon_merged_vma: Test terminated by assertion
7156 12:46:54.661750 # # # FAIL merge.handle_uprobe_upon_merged_vma
7157 12:46:54.662030 # # not ok 8 merge.handle_uprobe_upon_merged_vma
CONFIG_UPROBES is enabled by CONFIG_UPROBE_EVENTS, which gets enabled by
CONFIG_FTRACE. Therefore add these configs to selftests/mm/config so that
CI systems can include this config in the kernel build. To be completely
safe, add CONFIG_PROFILING too, to enable the dependency chain
PROFILING -> PERF_EVENTS -> UPROBE_EVENTS -> UPROBES.
Fixes: efe99fabeb11b ("selftests/mm: add test about uprobe pte be orphan during vma merge")
Reported-by: Aishwarya <aishwarya.tcv(a)arm.com>
Closes: https://lore.kernel.org/all/20250610103729.72440-1-aishwarya.tcv@arm.com/
Tested-by: Aishwarya TCV <aishwarya.tcv(a)arm.com>
Tested-by : Donet Tom <donettom(a)linux.ibm.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes(a)oracle.com>
Signed-off-by: Dev Jain <dev.jain(a)arm.com>
---
v1->v2:
- Add CONFIG_UPROBES (Mark Brown)
- Add CONFIG_PROFILING (Lorenzo)
tools/testing/selftests/mm/config | 3 +++
1 file changed, 3 insertions(+)
diff --git a/tools/testing/selftests/mm/config b/tools/testing/selftests/mm/config
index a28baa536332..deba93379c80 100644
--- a/tools/testing/selftests/mm/config
+++ b/tools/testing/selftests/mm/config
@@ -8,3 +8,6 @@ CONFIG_GUP_TEST=y
CONFIG_TRANSPARENT_HUGEPAGE=y
CONFIG_MEM_SOFT_DIRTY=y
CONFIG_ANON_VMA_NAME=y
+CONFIG_FTRACE=y
+CONFIG_PROFILING=y
+CONFIG_UPROBES=y
--
2.30.2
Currently gup_longterm assumes that filesystems support fallocate() and uses
that to allocate space in files, however this is an optional feature and is
in particular not implemented by NFSv3 which is commonly used in CI systems
leading to spurious failures. Check for lack of support and report a skip
instead for that case.
Signed-off-by: Mark Brown <broonie(a)kernel.org>
---
tools/testing/selftests/mm/gup_longterm.c | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/mm/gup_longterm.c b/tools/testing/selftests/mm/gup_longterm.c
index 8a97ac5176a4..0e99494268ed 100644
--- a/tools/testing/selftests/mm/gup_longterm.c
+++ b/tools/testing/selftests/mm/gup_longterm.c
@@ -114,7 +114,15 @@ static void do_test(int fd, size_t size, enum test_type type, bool shared)
}
if (fallocate(fd, 0, 0, size)) {
- if (size == pagesize) {
+ /*
+ * Some filesystems (eg, NFSv3) don't support
+ * fallocate(), report this as a skip rather than a
+ * test failure.
+ */
+ if (errno == EOPNOTSUPP) {
+ ksft_print_msg("fallocate() not supported by filesystem\n");
+ result = KSFT_SKIP;
+ } else if (size == pagesize) {
ksft_print_msg("fallocate() failed (%s)\n", strerror(errno));
result = KSFT_FAIL;
} else {
---
base-commit: 19272b37aa4f83ca52bdf9c16d5d81bdd1354494
change-id: 20250610-selftest-mm-gup-longterm-fallocate-nfs-21ef54627ef2
Best regards,
--
Mark Brown <broonie(a)kernel.org>
Initially netpoll and netconsole were created together, and some
functions are in the wrong file. Seperate netconsole-only functions
in netconsole, avoiding exports.
1. Expose netpoll logging macros in the public header to enable consistent
log formatting across netpoll consumers.
2. Relocate netconsole-specific functions from netpoll to the netconsole
module where they are actually used, reducing unnecessary coupling.
3. Remove unnecessary function exports
4. Rename netpoll parsing functions in netconsole to better reflect their
specific usage.
5. Create a test to check that cmdline works fine. This was in my todo
list since [1], this was a good time to add it here to make sure this
patchset doesn't regress.
PS: The code was split in a way that it is easy to review. When copying
the functions from netpoll to netconsole, I do not change than other
than adding `static`. This will make checkpatch unhappy, but, further
patches will address the issues. It is done this way to make it easy for
reviewers.
Link: https://lore.kernel.org/netdev/Z36TlACdNMwFD7wv@dev-ushankar.dev.purestorag… [1]
Signed-off-by: Breno Leitao <leitao(a)debian.org>
---
Changes in v2:
- No change in the code. Just rebased the patches onto netnext/main
- Link to v1: https://lore.kernel.org/r/20250610-rework-v1-0-7cfde283f246@debian.org
---
Breno Leitao (7):
netpoll: remove __netpoll_cleanup from exported API
netpoll: expose netpoll logging macros in public header
netpoll: relocate netconsole-specific functions to netconsole module
netpoll: move netpoll_print_options to netconsole
netconsole: rename functions to better reflect their purpose
netconsole: improve code style in parser function
selftest: netconsole: add test for cmdline configuration
drivers/net/netconsole.c | 137 ++++++++++++++++++++-
include/linux/netpoll.h | 10 +-
net/core/netpoll.c | 136 +-------------------
tools/testing/selftests/drivers/net/Makefile | 1 +
.../selftests/drivers/net/lib/sh/lib_netcons.sh | 39 +++++-
.../selftests/drivers/net/netcons_cmdline.sh | 52 ++++++++
6 files changed, 228 insertions(+), 147 deletions(-)
---
base-commit: 0097c4195b1d0ca57d15979626c769c74747b5a0
change-id: 20250603-rework-c175cad8d22e
Best regards,
--
Breno Leitao <leitao(a)debian.org>
When running the khugepaged selftest for shmem (./khugepaged all:shmem),
I encountered the following test failures:
"
Run test: collapse_full (khugepaged:shmem)
Collapse multiple fully populated PTE table.... Fail
...
Run test: collapse_single_pte_entry (khugepaged:shmem)
Collapse PTE table with single PTE entry present.... Fail
...
Run test: collapse_full_of_compound (khugepaged:shmem)
Allocate huge page... OK
Split huge page leaving single PTE page table full of compound pages... OK
Collapse PTE table full of compound pages.... Fail
"
The reason for the failure is that, it will set MADV_NOHUGEPAGE to prevent
khugepaged from continuing to scan shmem VMA after khugepaged finishes
scanning in the wait_for_scan() function. Moreover, shmem requires a refault
to establish PMD mappings.
However, after commit 2b0f922323cc ("mm: don't install PMD mappings when
THPs are disabled by the hw/process/vma"), PMD mappings are prevented if the
VMA is set with MADV_NOHUGEPAGE flag, so shmem cannot establish PMD mappings
during refault.
One way to fix this issue is to move the MADV_NOHUGEPAGE setting after the
shmem refault. After shmem refault and check huge, the test case will unmap
the shmem immediately. So it seems unnecessary to set the MADV_NOHUGEPAGE.
Then we can simply drop the MADV_NOHUGEPAGE setting, and all khugepaged test
cases passed.
Fixes: 2b0f922323cc ("mm: don't install PMD mappings when THPs are disabled by the hw/process/vma")
Reviewed-by: Zi Yan <ziy(a)nvidia.com>
Signed-off-by: Baolin Wang <baolin.wang(a)linux.alibaba.com>
---
Changes from v1:
- Add reviewed tag from Zi. Thanks.
- Drop the MADV_NOHUGEPAGE setting, per David.
---
tools/testing/selftests/mm/khugepaged.c | 2 --
1 file changed, 2 deletions(-)
diff --git a/tools/testing/selftests/mm/khugepaged.c b/tools/testing/selftests/mm/khugepaged.c
index 8a4d34cce36b..4341ce6b3b38 100644
--- a/tools/testing/selftests/mm/khugepaged.c
+++ b/tools/testing/selftests/mm/khugepaged.c
@@ -561,8 +561,6 @@ static bool wait_for_scan(const char *msg, char *p, int nr_hpages,
usleep(TICK);
}
- madvise(p, nr_hpages * hpage_pmd_size, MADV_NOHUGEPAGE);
-
return timeout == -1;
}
--
2.43.5
Nolibc is useful for selftests as the test programs can be very small,
and compiled with just a kernel crosscompiler, without userspace support.
Currently nolibc is only usable with kselftest.h, not the more
convenient to use kselftest_harness.h
This series provides this compatibility by removing the usage of problematic
libc features from the harness.
Based on nolibc/for-next.
The series is meant to be merged through the nolibc tree.
Signed-off-by: Thomas Weißschuh <thomas.weissschuh(a)linutronix.de>
---
Changes in v4:
- Drop patches for nolibc which where already applied
- Preserve signatures of test functions for tests making assumptions about them
drop 'selftests: harness: Always provide "self" and "variant"'
add 'selftests: harness: Add "variant" and "self" to test metadata'
adapt 'selftests: harness: Stop using setjmp()/longjmp()'
- Validate test function signatures in harness selftest
- Link to v3: https://lore.kernel.org/r/20250411-nolibc-kselftest-harness-v3-0-4d9c029589…
Changes in v3:
- Send patches to correct kselftest harness maintainers
- Move harness selftest to dedicated directory
- Add harness selftest to MAINTAINERS
- Integrate harness selftest cleanup with the selftest framework
- Consistently use "kselftest harness" in commit messages
- Properly propagate kselftest harness failure
- Link to v2: https://lore.kernel.org/r/20250407-nolibc-kselftest-harness-v2-0-f8812f76e9…
Changes in v2:
- Rebase unto v6.15-rc1
- Rename internal nolibc symbols
- Handle edge case of waitpid(INT_MIN) == ESRCH
- Fix arm configurations for final testing patch
- Clean up global getopt.h variable declarations
- Add Acks from Willy
- Link to v1: https://lore.kernel.org/r/20250304-nolibc-kselftest-harness-v1-0-adca7cd231…
---
Thomas Weißschuh (14):
selftests: harness: Add kselftest harness selftest
selftests: harness: Use C89 comment style
selftests: harness: Ignore unused variant argument warning
selftests: harness: Mark functions without prototypes static
selftests: harness: Remove inline qualifier for wrappers
selftests: harness: Remove dependency on libatomic
selftests: harness: Implement test timeouts through pidfd
selftests: harness: Don't set setup_completed for fixtureless tests
selftests: harness: Move teardown conditional into test metadata
selftests: harness: Add teardown callback to test metadata
selftests: harness: Add "variant" and "self" to test metadata
selftests: harness: Stop using setjmp()/longjmp()
selftests: harness: Guard includes on nolibc
HACK: selftests/nolibc: demonstrate usage of the kselftest harness
MAINTAINERS | 1 +
tools/testing/selftests/Makefile | 1 +
tools/testing/selftests/kselftest_harness.h | 175 +-
.../testing/selftests/kselftest_harness/.gitignore | 2 +
tools/testing/selftests/kselftest_harness/Makefile | 7 +
.../selftests/kselftest_harness/harness-selftest.c | 138 ++
.../kselftest_harness/harness-selftest.expected | 64 +
.../kselftest_harness/harness-selftest.sh | 13 +
tools/testing/selftests/nolibc/Makefile | 15 +-
tools/testing/selftests/nolibc/harness-selftest.c | 1 +
tools/testing/selftests/nolibc/nolibc-test.c | 1715 +-------------------
tools/testing/selftests/nolibc/run-tests.sh | 2 +-
12 files changed, 313 insertions(+), 1821 deletions(-)
---
base-commit: 2051d3b830c0889ae55e37e9e8ff0d43a4acd482
change-id: 20250130-nolibc-kselftest-harness-8b2c8cac43bf
Best regards,
--
Thomas Weißschuh <thomas.weissschuh(a)linutronix.de>
Spelling fix:
conneciton --> connection
This is a non-functional change aimed at improving code clarity.
Signed-off-by: Ankit Chauhan <ankitchauhan2065(a)gmail.com>
---
tools/testing/selftests/net/tcp_ao/seq-ext.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/net/tcp_ao/seq-ext.c b/tools/testing/selftests/net/tcp_ao/seq-ext.c
index f00245263b20..6478da6a71c3 100644
--- a/tools/testing/selftests/net/tcp_ao/seq-ext.c
+++ b/tools/testing/selftests/net/tcp_ao/seq-ext.c
@@ -1,7 +1,7 @@
// SPDX-License-Identifier: GPL-2.0
/* Check that after SEQ number wrap-around:
* 1. SEQ-extension has upper bytes set
- * 2. TCP conneciton is alive and no TCPAOBad segments
+ * 2. TCP connection is alive and no TCPAOBad segments
* In order to test (2), the test doesn't just adjust seq number for a queue
* on a connected socket, but migrates it to another sk+port number, so
* that there won't be any delayed packets that will fail to verify
--
2.34.1
From: Chia-Yu Chang <chia-yu.chang(a)nokia-bell-labs.com>
Hello,
Please find the DualPI2 patch v17.
This patch serise adds DualPI Improved with a Square (DualPI2) with following features:
* Supports congestion controls that comply with the Prague requirements in RFC9331 (e.g. TCP-Prague)
* Coupled dual-queue that separates the L4S traffic in a low latency queue (L-queue), without harming remaining traffic that is scheduled in classic queue (C-queue) due to congestion-coupling using PI2 as defined in RFC9332
* Configurable overload strategies
* Use of sojourn time to reliably estimate queue delay
* Supports ECN L4S-identifier (IP.ECN==0b*1) to classify traffic into respective queues
For more details of DualPI2, please refer IETF RFC9332 (https://datatracker.ietf.org/doc/html/rfc9332).
Best regards,
Chia-Yu
---
v17 (25-May-2025, Resent at 11-Jun-2025)
- Replace 0xffffffff with U32_MAX (Paolo Abeni <pabeni(a)redhat.com>)
- Use helper function qdisc_dequeue_internal() and add new helper function skb_apply_step() (Paolo Abeni <pabeni(a)redhat.com>)
- Add s64 casting when calculating the delta of the PI controller (Paolo Abeni <pabeni(a)redhat.com>)
- Change the drop reason into SKB_DROP_REASON_QDISC_CONGESTED for drop_early (Paolo Abeni <pabeni(a)redhat.com>)
- Modify the condition to remove the original skb when enqueuing multiple GSO segments (Paolo Abeni <pabeni(a)redhat.com>)
- Add READ_ONCE() in dualpi2_dump_stat() (Paolo Abeni <pabeni(a)redhat.com>)
- Add comments, brackets, and brackets for readability (Paolo Abeni <pabeni(a)redhat.com>)
v16 (16-MAy-2025)
- Add qdisc_lock() to dualpi2_timer() in dualpi2_timer (Paolo Abeni <pabeni(a)redhat.com>)
- Introduce convert_ns_to_usec() to convert usec to nsec without overflow in #1 (Paolo Abeni <pabeni(a)redhat.com>)
- Update convert_us_tonsec() to convert nsec to usec without overflow in #2 (Paolo Abeni <pabeni(a)redhat.com>)
- Add more descriptions with respect to DualPI2 in the cover ltter and add changelog in each patch (Paolo Abeni <pabeni(a)redhat.com>)
v15 (09-May-2025)
- Add enum of TCA_DUALPI2_ECN_MASK_CLA_ECT to remove potential leakeage in #1 (Simon Horman <horms(a)kernel.org>)
- Fix one typo in comment of #2
- Update tc.yaml in #5 to aligh with the updated enum of pkt_sched.h
v14 (05-May-2025)
- Modify tc.yaml: (1) Replace flags with enum and remove enum-as-flags, (2) Remove credit-queue in xstats, and (3) Change attribute types (Donald Hunter <donald.hun
- Add enum and fix the ordering of variables in pkt_sched.h to align with the modified tc.yaml (Donald Hunter <donald.hunter(a)gmail.com>)
- Add validators for DROP_OVERLOAD, DROP_EARLY, ECN_MASK, and SPLIT_GSO in sch_dualpi2.c (Donald Hunter <donald.hunter(a)gmail.com>)
- Update dualpi2.json to align with the updated variable order in pkt_sched.h
- Reorder patches (Donald Hunter <donald.hunter(a)gmail.com>)
v13 (26-Apr-2025)
- Use dashes in member names to follow YNL conventions in tc.yaml (Donald Hunter <donald.hunter(a)gmail.com>)
- Define enumerations separately for flags of drop-early, drop-overload, ecn-mask, credit-queue in tc.yaml (Donald Hunter <donald.hunter(a)gmail.com>)
- Change the types of split-gso and step-packets into flag in tc.yaml (Donald Hunter <donald.hunter(a)gmail.com>)
- Revert to u32/u8 types for tc-dualpi2-xstats members in tc.yaml (Donald Hunter <donald.hunter(a)gmail.com>)
- Add new test cases in tc-tests/qdiscs/dualpi2.json to cover all dualpi2 parameters (Donald Hunter <donald.hunter(a)gmail.com>)
- Change the type of TCA_DUALPI2_STEP_PACKETS into NLA_FLAG (Donald Hunter <donald.hunter(a)gmail.com>)
v12 (22-Apr-2025)
- Remove anonymous struct in sch_dualpi2.c (Paolo Abeni <pabeni(a)redhat.com>)
- Replace u32/u8 with uint and s32 with int in tc spec document (Paolo Abeni <pabeni(a)redhat.com>)
- Introduce get_memory_limit function to handle potential overflow when multipling limit with MTU (Paolo Abeni <pabeni(a)redhat.com>)
- Double the packet length to further include packet overhead in memory_limit (Paolo Abeni <pabeni(a)redhat.com>)
- Remove the check of qdisc_qlen(sch) when calling qdisc_tree_reduce_backlog (Paolo Abeni <pabeni(a)redhat.com>)
v11 (15-Apr-2025)
- Replace hstimer_init with hstimer_setup in sch_dualpi2.c
v10 (25-Mar-2025)
- Remove leftover include in include/linux/netdevice.h and anonymous struct in sch_dualpi2.c (Paolo Abeni <pabeni(a)redhat.com>)
- Use kfree_skb_reason() and add SKB_DROP_REASON_DUALPI2_STEP_DROP drop reason (Paolo Abeni <pabeni(a)redhat.com>)
- Split sch_dualpi2.c into 3 patches (and overall 5 patches): Struct definition & parsing, Dump stats & configuration, Enqueue/Dequeue (Paolo Abeni <pabeni(a)redhat.com>)
v9 (16-Mar-2025)
- Fix mem_usage error in previous version
- Add min_qlen_step to the dualpi2 attribute as the minimum queue length in number of packets in the L-queue to start step threshold marking.
In previous versions, this value was fixed to 2, so the step threshold was applied to mark packets in the L queue only when the queue length of the L queue was greater than or equal to 2 packets.
This will cause larger queuing delays for L4S traffic at low rates (<20Mbps). So we parameterize it and change the default value to 0.
Comparison of tcp_1down run 'HTB 20Mbit + DUALPI2 + 10ms base delay'
Old versions:
avg median # data pts
Ping (ms) ICMP : 11.55 11.70 ms 350
TCP upload avg : 18.96 N/A Mbits/s 350
TCP upload sum : 18.96 N/A Mbits/s 350
New version (v9):
avg median # data pts
Ping (ms) ICMP : 10.81 10.70 ms 350
TCP upload avg : 18.91 N/A Mbits/s 350
TCP upload sum : 18.91 N/A Mbits/s 350
Comparison of tcp_1down run 'HTB 10Mbit + DUALPI2 + 10ms base delay'
Old versions:
avg median # data pts
Ping (ms) ICMP : 12.61 12.80 ms 350
TCP upload avg : 9.48 N/A Mbits/s 350
TCP upload sum : 9.48 N/A Mbits/s 350
New version (v9):
avg median # data pts
Ping (ms) ICMP : 11.06 10.80 ms 350
TCP upload avg : 9.43 N/A Mbits/s 350
TCP upload sum : 9.43 N/A Mbits/s 350
Comparison of tcp_1down run 'HTB 10Mbit + DUALPI2 + 10ms base delay'
Old versions:
avg median # data pts
Ping (ms) ICMP : 40.86 37.45 ms 350
TCP upload avg : 0.88 N/A Mbits/s 350
TCP upload sum : 0.88 N/A Mbits/s 350
TCP upload::1 : 0.88 0.97 Mbits/s 350
New version (v9):
avg median # data pts
Ping (ms) ICMP : 11.07 10.40 ms 350
TCP upload avg : 0.55 N/A Mbits/s 350
TCP upload sum : 0.55 N/A Mbits/s 350
TCP upload::1 : 0.55 0.59 Mbits/s 350
v8 (11-Mar-2025)
- Fix warning messages in v7
v7 (07-Mar-2025)
- Separate into 3 patches to avoid mixing changes of documentation, selftest, and code. (Cong Wang <xiyou.wangcong(a)gmail.com>)
v6 (04-Mar-2025)
- Add modprobe for dulapi2 in tc-testing script tc-testing/tdc.sh (Jakub Kicinski <kuba(a)kernel.org>)
- Update test cases in dualpi2.json
- Update commit message
v5 (22-Feb-2025)
- A comparison was done between MQ + DUALPI2, MQ + FQ_PIE, MQ + FQ_CODEL:
Unshaped 1gigE with 4 download streams test:
- Summary of tcp_4down run 'MQ + FQ_CODEL':
avg median # data pts
Ping (ms) ICMP : 1.19 1.34 ms 349
TCP download avg : 235.42 N/A Mbits/s 349
TCP download sum : 941.68 N/A Mbits/s 349
TCP download::1 : 235.19 235.39 Mbits/s 349
TCP download::2 : 235.03 235.35 Mbits/s 349
TCP download::3 : 236.89 235.44 Mbits/s 349
TCP download::4 : 234.57 235.19 Mbits/s 349
- Summary of tcp_4down run 'MQ + FQ_PIE'
avg median # data pts
Ping (ms) ICMP : 1.21 1.37 ms 350
TCP download avg : 235.42 N/A Mbits/s 350
TCP download sum : 941.61 N/A Mbits/s 350
TCP download::1 : 232.54 233.13 Mbits/s 350
TCP download::2 : 232.52 232.80 Mbits/s 350
TCP download::3 : 233.14 233.78 Mbits/s 350
TCP download::4 : 243.41 241.48 Mbits/s 350
- Summary of tcp_4down run 'MQ + DUALPI2'
avg median # data pts
Ping (ms) ICMP : 1.19 1.34 ms 349
TCP download avg : 235.42 N/A Mbits/s 349
TCP download sum : 941.68 N/A Mbits/s 349
TCP download::1 : 235.19 235.39 Mbits/s 349
TCP download::2 : 235.03 235.35 Mbits/s 349
TCP download::3 : 236.89 235.44 Mbits/s 349
TCP download::4 : 234.57 235.19 Mbits/s 349
Unshaped 1gigE with 128 download streams test:
- Summary of tcp_128down run 'MQ + FQ_CODEL':
avg median # data pts
Ping (ms) ICMP : 1.88 1.86 ms 350
TCP download avg : 7.39 N/A Mbits/s 350
TCP download sum : 946.47 N/A Mbits/s 350
- Summary of tcp_128down run 'MQ + FQ_PIE':
avg median # data pts
Ping (ms) ICMP : 1.88 1.86 ms 350
TCP download avg : 7.39 N/A Mbits/s 350
TCP download sum : 946.47 N/A Mbits/s 350
- Summary of tcp_128down run 'MQ + DUALPI2':
avg median # data pts
Ping (ms) ICMP : 1.88 1.86 ms 350
TCP download avg : 7.39 N/A Mbits/s 350
TCP download sum : 946.47 N/A Mbits/s 350
Unshaped 10gigE with 4 download streams test:
- Summary of tcp_4down run 'MQ + FQ_CODEL':
avg median # data pts
Ping (ms) ICMP : 0.22 0.23 ms 350
TCP download avg : 2354.08 N/A Mbits/s 350
TCP download sum : 9416.31 N/A Mbits/s 350
TCP download::1 : 2353.65 2352.81 Mbits/s 350
TCP download::2 : 2354.54 2354.21 Mbits/s 350
TCP download::3 : 2353.56 2353.78 Mbits/s 350
TCP download::4 : 2354.56 2354.45 Mbits/s 350
- Summary of tcp_4down run 'MQ + FQ_PIE':
avg median # data pts
Ping (ms) ICMP : 0.20 0.19 ms 350
TCP download avg : 2354.76 N/A Mbits/s 350
TCP download sum : 9419.04 N/A Mbits/s 350
TCP download::1 : 2354.77 2353.89 Mbits/s 350
TCP download::2 : 2353.41 2354.29 Mbits/s 350
TCP download::3 : 2356.18 2354.19 Mbits/s 350
TCP download::4 : 2354.68 2353.15 Mbits/s 350
- Summary of tcp_4down run 'MQ + DUALPI2':
avg median # data pts
Ping (ms) ICMP : 0.24 0.24 ms 350
TCP download avg : 2354.11 N/A Mbits/s 350
TCP download sum : 9416.43 N/A Mbits/s 350
TCP download::1 : 2354.75 2353.93 Mbits/s 350
TCP download::2 : 2353.15 2353.75 Mbits/s 350
TCP download::3 : 2353.49 2353.72 Mbits/s 350
TCP download::4 : 2355.04 2353.73 Mbits/s 350
Unshaped 10gigE with 128 download streams test:
- Summary of tcp_128down run 'MQ + FQ_CODEL':
avg median # data pts
Ping (ms) ICMP : 7.57 8.69 ms 350
TCP download avg : 73.97 N/A Mbits/s 350
TCP download sum : 9467.82 N/A Mbits/s 350
- Summary of tcp_128down run 'MQ + FQ_PIE':
avg median # data pts
Ping (ms) ICMP : 7.82 8.91 ms 350
TCP download avg : 73.97 N/A Mbits/s 350
TCP download sum : 9468.42 N/A Mbits/s 350
- Summary of tcp_128down run 'MQ + DUALPI2':
avg median # data pts
Ping (ms) ICMP : 6.87 7.93 ms 350
TCP download avg : 73.95 N/A Mbits/s 350
TCP download sum : 9465.87 N/A Mbits/s 350
From the results shown above, we see small differences between combinations.
- Update commit message to include results of no_split_gso and split_gso (Dave Taht <dave.taht(a)gmail.com> and Paolo Abeni <pabeni(a)redhat.com>)
- Add memlimit in the dualpi2 attribute, and add memory_used, max_memory_used, memory_limit in dualpi2 stats (Dave Taht <dave.taht(a)gmail.com>)
- Update note in sch_dualpi2.c related to BBRv3 status (Dave Taht <dave.taht(a)gmail.com>)
- Update license identifier (Dave Taht <dave.taht(a)gmail.com>)
- Add selftest in tools/testing/selftests/tc-testing (Cong Wang <xiyou.wangcong(a)gmail.com>)
- Use netlink policies for parameter checks (Jamal Hadi Salim <jhs(a)mojatatu.com>)
- Modify texts & fix typos in Documentation/netlink/specs/tc.yaml (Dave Taht <dave.taht(a)gmail.com>)
- Add descriptions of packet counter statistics and the reset function of sch_dualpi2.c
- Fix step_thresh in packets
- Update code comments in sch_dualpi2.c
v4 (22-Oct-2024)
- Update statement in Kconfig for DualPI2 (Stephen Hemminger <stephen(a)networkplumber.org>)
- Put a blank line after #define in sch_dualpi2.c (Stephen Hemminger <stephen(a)networkplumber.org>)
- Fix line length warning.
v3 (19-Oct-2024)
- Fix compilaiton error
- Update Documentation/netlink/specs/tc.yaml (Jakub Kicinski <kuba(a)kernel.org>)
v2 (18-Oct-2024)
- Add Documentation/netlink/specs/tc.yaml (Jakub Kicinski <kuba(a)kernel.org>)
- Use dualpi2 instead of skb prefix (Jamal Hadi Salim <jhs(a)mojatatu.com>)
- Replace nla_parse_nested_deprecated with nla_parse_nested (Jamal Hadi Salim <jhs(a)mojatatu.com>)
- Fix line length warning
---
Chia-Yu Chang (4):
sched: Struct definition and parsing of dualpi2 qdisc
sched: Dump configuration and statistics of dualpi2 qdisc
selftests/tc-testing: Add selftests for qdisc DualPI2
Documentation: netlink: specs: tc: Add DualPI2 specification
Koen De Schepper (1):
sched: Add enqueue/dequeue of dualpi2 qdisc
Documentation/netlink/specs/tc.yaml | 156 +++
include/net/dropreason-core.h | 6 +
include/uapi/linux/pkt_sched.h | 68 +
net/sched/Kconfig | 12 +
net/sched/Makefile | 1 +
net/sched/sch_dualpi2.c | 1146 +++++++++++++++++
tools/testing/selftests/tc-testing/config | 1 +
.../tc-testing/tc-tests/qdiscs/dualpi2.json | 254 ++++
tools/testing/selftests/tc-testing/tdc.sh | 1 +
9 files changed, 1645 insertions(+)
create mode 100644 net/sched/sch_dualpi2.c
create mode 100644 tools/testing/selftests/tc-testing/tc-tests/qdiscs/dualpi2.json
--
2.34.1
Tests may wish to add other interfaces to listen on. Notably locally
generated traffic uses dummy interfaces. The multicast daemon needs to know
about these so that it allows forming rules that involve these interfaces,
and so that net.ipv4.conf.X.mc_forwarding is set for the interfaces.
To that end, allow passing in a list of interfaces to configure in addition
to all the physical ones.
Signed-off-by: Petr Machata <petrm(a)nvidia.com>
Reviewed-by: Nikolay Aleksandrov <razor(a)blackwall.org>
---
Notes:
v2:
- Adjust as per shellcheck citations
- Retain Nik's R-b, the changes were very minor.
---
CC: Shuah Khan <shuah(a)kernel.org>
CC: linux-kselftest(a)vger.kernel.org
tools/testing/selftests/net/forwarding/lib.sh | 13 +++++++++++++
1 file changed, 13 insertions(+)
diff --git a/tools/testing/selftests/net/forwarding/lib.sh b/tools/testing/selftests/net/forwarding/lib.sh
index 253847372062..83ee6a07e072 100644
--- a/tools/testing/selftests/net/forwarding/lib.sh
+++ b/tools/testing/selftests/net/forwarding/lib.sh
@@ -1760,9 +1760,12 @@ mc_send()
adf_mcd_start()
{
+ local ifs=("$@")
+
local table_name="$MCD_TABLE_NAME"
local smcroutedir
local pid
+ local if
local i
check_command "$MCD" || return 1
@@ -1776,6 +1779,16 @@ adf_mcd_start()
"$smcroutedir/$table_name.conf"
done
+ for if in "${ifs[@]}"; do
+ if ! ip_link_has_flag "$if" MULTICAST; then
+ ip link set dev "$if" multicast on
+ defer ip link set dev "$if" multicast off
+ fi
+
+ echo "phyint $if enable" >> \
+ "$smcroutedir/$table_name.conf"
+ done
+
"$MCD" -N -I "$table_name" -f "$smcroutedir/$table_name.conf" \
-P "$smcroutedir/$table_name.pid"
busywait "$BUSYWAIT_TIMEOUT" test -e "$smcroutedir/$table_name.pid"
--
2.49.0
IDT event delivery has a debug hole in which it does not generate #DB
upon returning to userspace before the first userspace instruction is
executed if the Trap Flag (TF) is set.
FRED closes this hole by introducing a software event flag, i.e., bit
17 of the augmented SS: if the bit is set and ERETU would result in
RFLAGS.TF = 1, a single-step trap will be pending upon completion of
ERETU.
However I overlooked properly setting and clearing the bit in different
situations. Thus when FRED is enabled, if the Trap Flag (TF) is set
without an external debugger attached, it can lead to an infinite loop
in the SIGTRAP handler. To avoid this, the software event flag in the
augmented SS must be cleared, ensuring that no single-step trap remains
pending when ERETU completes.
This patch set combines the fix [1] and its corresponding selftest [2]
(requested by Dave Hansen) into one patch set.
[1] https://lore.kernel.org/lkml/20250523050153.3308237-1-xin@zytor.com/
[2] https://lore.kernel.org/lkml/20250530230707.2528916-1-xin@zytor.com/
This patch set is based on tip/x86/urgent branch as of today.
Xin Li (Intel) (2):
x86/fred/signal: Prevent single-step upon ERETU completion
selftests/x86: Add a test to detect infinite sigtrap handler loop
arch/x86/include/asm/sighandling.h | 22 +++++
arch/x86/kernel/signal_32.c | 4 +
arch/x86/kernel/signal_64.c | 4 +
tools/testing/selftests/x86/Makefile | 2 +-
tools/testing/selftests/x86/sigtrap_loop.c | 97 ++++++++++++++++++++++
5 files changed, 128 insertions(+), 1 deletion(-)
create mode 100644 tools/testing/selftests/x86/sigtrap_loop.c
base-commit: dd2922dcfaa3296846265e113309e5f7f138839f
--
2.49.0
This series addresses a regression in ethtool flow steering where rules
targeting the default RSS context (context 0) were incorrectly rejected.
The default RSS context always exists but is not stored in the rss_ctx
xarray like additional contexts. The current validation logic was
checking for the existence of context 0 in this array, causing valid
flow steering rules to be rejected.
This prevented configurations such as:
- High priority rules directing specific traffic to the default context
- Low priority catch-all rules directing remaining traffic to additional
contexts
Patch 1 fixes the validation logic to skip the existence check for
context 0.
Patch 2 adds a selftest that verifies this behavior.
Changelog -
v2->v3: https://lore.kernel.org/all/20250609120250.1630125-1-gal@nvidia.com/
* Reworded commit message.
* Fix pylint warning.
v1->v2: https://lore.kernel.org/all/20250225071348.509432-1-gal@nvidia.com/
* Reworded commit message.
* Added a selftest.
Gal Pressman (2):
net: ethtool: Don't check if RSS context exists in case of context 0
selftests: drv-net: rss_ctx: Add test for ntuple rules targeting
default RSS context
net/ethtool/ioctl.c | 3 +-
.../selftests/drivers/net/hw/rss_ctx.py | 59 ++++++++++++++++++-
2 files changed, 60 insertions(+), 2 deletions(-)
--
2.40.1
The SBI Firmware Feature extension allows the S-mode to request some
specific features (either hardware or software) to be enabled. This
series uses this extension to request misaligned access exception
delegation to S-mode in order to let the kernel handle it. It also adds
support for the KVM FWFT SBI extension based on the misaligned access
handling infrastructure.
FWFT SBI extension is part of the SBI V3.0 specifications [1]. It can be
tested using the qemu provided at [2] which contains the series from
[3]. Upstream kvm-unit-tests can be used inside kvm to tests the correct
delegation of misaligned exceptions. Upstream OpenSBI can be used.
Note: Since SBI V3.0 is not yet ratified, FWFT extension API is split
between interface only and implementation, allowing to pick only the
interface which do not have hard dependencies on SBI.
The tests can be run using the kselftest from series [4].
$ qemu-system-riscv64 \
-cpu rv64,trap-misaligned-access=true,v=true \
-M virt \
-m 1024M \
-bios fw_dynamic.bin \
-kernel Image
...
# ./misaligned
TAP version 13
1..23
# Starting 23 tests from 1 test cases.
# RUN global.gp_load_lh ...
# OK global.gp_load_lh
ok 1 global.gp_load_lh
# RUN global.gp_load_lhu ...
# OK global.gp_load_lhu
ok 2 global.gp_load_lhu
# RUN global.gp_load_lw ...
# OK global.gp_load_lw
ok 3 global.gp_load_lw
# RUN global.gp_load_lwu ...
# OK global.gp_load_lwu
ok 4 global.gp_load_lwu
# RUN global.gp_load_ld ...
# OK global.gp_load_ld
ok 5 global.gp_load_ld
# RUN global.gp_load_c_lw ...
# OK global.gp_load_c_lw
ok 6 global.gp_load_c_lw
# RUN global.gp_load_c_ld ...
# OK global.gp_load_c_ld
ok 7 global.gp_load_c_ld
# RUN global.gp_load_c_ldsp ...
# OK global.gp_load_c_ldsp
ok 8 global.gp_load_c_ldsp
# RUN global.gp_load_sh ...
# OK global.gp_load_sh
ok 9 global.gp_load_sh
# RUN global.gp_load_sw ...
# OK global.gp_load_sw
ok 10 global.gp_load_sw
# RUN global.gp_load_sd ...
# OK global.gp_load_sd
ok 11 global.gp_load_sd
# RUN global.gp_load_c_sw ...
# OK global.gp_load_c_sw
ok 12 global.gp_load_c_sw
# RUN global.gp_load_c_sd ...
# OK global.gp_load_c_sd
ok 13 global.gp_load_c_sd
# RUN global.gp_load_c_sdsp ...
# OK global.gp_load_c_sdsp
ok 14 global.gp_load_c_sdsp
# RUN global.fpu_load_flw ...
# OK global.fpu_load_flw
ok 15 global.fpu_load_flw
# RUN global.fpu_load_fld ...
# OK global.fpu_load_fld
ok 16 global.fpu_load_fld
# RUN global.fpu_load_c_fld ...
# OK global.fpu_load_c_fld
ok 17 global.fpu_load_c_fld
# RUN global.fpu_load_c_fldsp ...
# OK global.fpu_load_c_fldsp
ok 18 global.fpu_load_c_fldsp
# RUN global.fpu_store_fsw ...
# OK global.fpu_store_fsw
ok 19 global.fpu_store_fsw
# RUN global.fpu_store_fsd ...
# OK global.fpu_store_fsd
ok 20 global.fpu_store_fsd
# RUN global.fpu_store_c_fsd ...
# OK global.fpu_store_c_fsd
ok 21 global.fpu_store_c_fsd
# RUN global.fpu_store_c_fsdsp ...
# OK global.fpu_store_c_fsdsp
ok 22 global.fpu_store_c_fsdsp
# RUN global.gen_sigbus ...
[12797.988647] misaligned[618]: unhandled signal 7 code 0x1 at 0x0000000000014dc0 in misaligned[4dc0,10000+76000]
[12797.988990] CPU: 0 UID: 0 PID: 618 Comm: misaligned Not tainted 6.13.0-rc6-00008-g4ec4468967c9-dirty #51
[12797.989169] Hardware name: riscv-virtio,qemu (DT)
[12797.989264] epc : 0000000000014dc0 ra : 0000000000014d00 sp : 00007fffe165d100
[12797.989407] gp : 000000000008f6e8 tp : 0000000000095760 t0 : 0000000000000008
[12797.989544] t1 : 00000000000965d8 t2 : 000000000008e830 s0 : 00007fffe165d160
[12797.989692] s1 : 000000000000001a a0 : 0000000000000000 a1 : 0000000000000002
[12797.989831] a2 : 0000000000000000 a3 : 0000000000000000 a4 : ffffffffdeadbeef
[12797.989964] a5 : 000000000008ef61 a6 : 626769735f6e0000 a7 : fffffffffffff000
[12797.990094] s2 : 0000000000000001 s3 : 00007fffe165d838 s4 : 00007fffe165d848
[12797.990238] s5 : 000000000000001a s6 : 0000000000010442 s7 : 0000000000010200
[12797.990391] s8 : 000000000000003a s9 : 0000000000094508 s10: 0000000000000000
[12797.990526] s11: 0000555567460668 t3 : 00007fffe165d070 t4 : 00000000000965d0
[12797.990656] t5 : fefefefefefefeff t6 : 0000000000000073
[12797.990756] status: 0000000200004020 badaddr: 000000000008ef61 cause: 0000000000000006
[12797.990911] Code: 8793 8791 3423 fcf4 3783 fc84 c737 dead 0713 eef7 (c398) 0001
# OK global.gen_sigbus
ok 23 global.gen_sigbus
# PASSED: 23 / 23 tests passed.
# Totals: pass:23 fail:0 xfail:0 xpass:0 skip:0 error:0
With kvm-tools:
# lkvm run -k sbi.flat -m 128
Info: # lkvm run -k sbi.flat -m 128 -c 1 --name guest-97
Info: Removed ghost socket file "/root/.lkvm//guest-97.sock".
##########################################################################
# kvm-unit-tests
##########################################################################
... [test messages elided]
PASS: sbi: fwft: FWFT extension probing no error
PASS: sbi: fwft: get/set reserved feature 0x6 error == SBI_ERR_DENIED
PASS: sbi: fwft: get/set reserved feature 0x3fffffff error == SBI_ERR_DENIED
PASS: sbi: fwft: get/set reserved feature 0x80000000 error == SBI_ERR_DENIED
PASS: sbi: fwft: get/set reserved feature 0xbfffffff error == SBI_ERR_DENIED
PASS: sbi: fwft: misaligned_deleg: Get misaligned deleg feature no error
PASS: sbi: fwft: misaligned_deleg: Set misaligned deleg feature invalid value error
PASS: sbi: fwft: misaligned_deleg: Set misaligned deleg feature invalid value error
PASS: sbi: fwft: misaligned_deleg: Set misaligned deleg feature value no error
PASS: sbi: fwft: misaligned_deleg: Set misaligned deleg feature value 0
PASS: sbi: fwft: misaligned_deleg: Set misaligned deleg feature value no error
PASS: sbi: fwft: misaligned_deleg: Set misaligned deleg feature value 1
PASS: sbi: fwft: misaligned_deleg: Verify misaligned load exception trap in supervisor
SUMMARY: 50 tests, 2 unexpected failures, 12 skipped
This series is available at [5].
Link: https://github.com/riscv-non-isa/riscv-sbi-doc/releases/download/vv3.0-rc2/… [1]
Link: https://github.com/rivosinc/qemu/tree/dev/cleger/misaligned [2]
Link: https://lore.kernel.org/all/20241211211933.198792-3-fkonrad@amd.com/T/ [3]
Link: https://lore.kernel.org/linux-riscv/20250414123543.1615478-1-cleger@rivosin… [4]
Link: https://github.com/rivosinc/linux/tree/dev/cleger/fwft [5]
---
V8:
- Move misaligned_access_speed under CONFIG_RISCV_MISALIGNED and add a
separate commit for that.
V7:
- Fix ifdefery build problems
- Move sbi_fwft_is_supported with fwft_set_req struct
- Added Atish Reviewed-by
- Updated KVM vcpu cfg hedeleg value in set_delegation
- Changed SBI ETIME error mapping to ETIMEDOUT
- Fixed a few typo reported by Alok
V6:
- Rename FWFT interface to remove "_local"
- Fix test for MEDELEG values in KVM FWFT support
- Add __init for unaligned_access_init()
- Rebased on master
V5:
- Return ERANGE as mapping for SBI_ERR_BAD_RANGE
- Removed unused sbi_fwft_get()
- Fix kernel for sbi_fwft_local_set_cpumask()
- Fix indentation for sbi_fwft_local_set()
- Remove spurious space in kvm_sbi_fwft_ops.
- Rebased on origin/master
- Remove fixes commits and sent them as a separate series [4]
V4:
- Check SBI version 3.0 instead of 2.0 for FWFT presence
- Use long for kvm_sbi_fwft operation return value
- Init KVM sbi extension even if default_disabled
- Remove revert_on_fail parameter for sbi_fwft_feature_set().
- Fix comments for sbi_fwft_set/get()
- Only handle local features (there are no globals yet in the spec)
- Add new SBI errors to sbi_err_map_linux_errno()
V3:
- Added comment about kvm sbi fwft supported/set/get callback
requirements
- Move struct kvm_sbi_fwft_feature in kvm_sbi_fwft.c
- Add a FWFT interface
V2:
- Added Kselftest for misaligned testing
- Added get_user() usage instead of __get_user()
- Reenable interrupt when possible in misaligned access handling
- Document that riscv supports unaligned-traps
- Fix KVM extension state when an init function is present
- Rework SBI misaligned accesses trap delegation code
- Added support for CPU hotplugging
- Added KVM SBI reset callback
- Added reset for KVM SBI FWFT lock
- Return SBI_ERR_DENIED_LOCKED when LOCK flag is set
Clément Léger (14):
riscv: sbi: add Firmware Feature (FWFT) SBI extensions definitions
riscv: sbi: remove useless parenthesis
riscv: sbi: add new SBI error mappings
riscv: sbi: add FWFT extension interface
riscv: sbi: add SBI FWFT extension calls
riscv: misaligned: request misaligned exception from SBI
riscv: misaligned: use on_each_cpu() for scalar misaligned access
probing
riscv: misaligned: declare misaligned_access_speed under
CONFIG_RISCV_MISALIGNED
riscv: misaligned: move emulated access uniformity check in a function
riscv: misaligned: add a function to check misalign trap delegability
RISC-V: KVM: add SBI extension init()/deinit() functions
RISC-V: KVM: add SBI extension reset callback
RISC-V: KVM: add support for FWFT SBI extension
RISC-V: KVM: add support for SBI_FWFT_MISALIGNED_DELEG
arch/riscv/include/asm/cpufeature.h | 14 +-
arch/riscv/include/asm/kvm_host.h | 5 +-
arch/riscv/include/asm/kvm_vcpu_sbi.h | 12 +
arch/riscv/include/asm/kvm_vcpu_sbi_fwft.h | 29 +++
arch/riscv/include/asm/sbi.h | 60 +++++
arch/riscv/include/uapi/asm/kvm.h | 1 +
arch/riscv/kernel/sbi.c | 81 ++++++-
arch/riscv/kernel/traps_misaligned.c | 112 ++++++++-
arch/riscv/kernel/unaligned_access_speed.c | 8 +-
arch/riscv/kvm/Makefile | 1 +
arch/riscv/kvm/vcpu.c | 4 +-
arch/riscv/kvm/vcpu_sbi.c | 54 +++++
arch/riscv/kvm/vcpu_sbi_fwft.c | 257 +++++++++++++++++++++
arch/riscv/kvm/vcpu_sbi_sta.c | 3 +-
14 files changed, 620 insertions(+), 21 deletions(-)
create mode 100644 arch/riscv/include/asm/kvm_vcpu_sbi_fwft.h
create mode 100644 arch/riscv/kvm/vcpu_sbi_fwft.c
--
2.49.0
This patch series introduces a new feature to netconsole which allows
appending a message ID to the userdata dictionary.
If the msgid feature is enabled, the message ID is built from a per-target 32
bit counter that is incremented and appended to every message sent to the target.
Example::
echo 1 > "/sys/kernel/config/netconsole/cmdline0/userdata/msgid_enabled"
echo "This is message #1" > /dev/kmsg
echo "This is message #2" > /dev/kmsg
13,434,54928466,-;This is message #1
msgid=1
13,435,54934019,-;This is message #2
msgid=2
This feature can be used by the target to detect if messages were dropped or
reordered before reaching the target. This allows system administrators to
assess the reliability of their netconsole pipeline and detect loss of messages
due to network contention or temporary unavailability.
Suggested-by: Breno Leitao <leitao(a)debian.org>
Signed-off-by: Gustavo Luiz Duarte <gustavold(a)gmail.com>
Note to maintainer:
This will conflict with a fix I sent recently to net:
c85bf1975108 netconsole: fix appending sysdata when sysdata_fields ==
SYSDATA_RELEASE
Please let me know if I should rebase at some point and send a v2.
---
Gustavo Luiz Duarte (5):
netconsole: introduce 'msgid' as a new sysdata field
netconsole: implement configfs for msgid_enabled
netconsole: append msgid to sysdata
selftests: netconsole: Add tests for 'msgid' feature in sysdata
docs: netconsole: document msgid feature
Documentation/networking/netconsole.rst | 22 +++++++
drivers/net/netconsole.c | 67 +++++++++++++++++++++-
.../selftests/drivers/net/netcons_sysdata.sh | 30 ++++++++++
3 files changed, 118 insertions(+), 1 deletion(-)
---
base-commit: 0097c4195b1d0ca57d15979626c769c74747b5a0
change-id: 20250609-netconsole-msgid-b93c6f8e9c60
Best regards,
--
Gustavo Luiz Duarte <gustavold(a)gmail.com>