This series improves the CPU cost of RX token management by adding a
socket option that configures the socket to avoid the xarray allocator
and instead use an niov array and a uref field in niov.
Improvement is ~13% cpu util per RX user thread.
Using kperf, the following results were observed:
Before:
Average RX worker idle %: 13.13, flows 4, test runs 11
After:
Average RX worker idle %: 26.32, flows 4, test runs 11
Two other approaches were tested, but with no improvement. Namely, 1)
using a hashmap for tokens and 2) keeping an xarray of atomic counters
but using RCU so that the hotpath could be mostly lockless. Neither of
these approaches proved better than the simple array in terms of CPU.
The sockopt SO_DEVMEM_AUTORELEASE is added to toggle the optimization.
It defaults to 0 (i.e., optimization on).
Note that prior revs reported only a 5% gain. This lower gain was
measured with cpu frequency boosting (unknowingly) disabled. A
consistent ~13% is measured for both kperf and nccl workloads with cpu
frequency boosting on.
To: David S. Miller <davem(a)davemloft.net>
To: Eric Dumazet <edumazet(a)google.com>
To: Jakub Kicinski <kuba(a)kernel.org>
To: Paolo Abeni <pabeni(a)redhat.com>
To: Simon Horman <horms(a)kernel.org>
To: Kuniyuki Iwashima <kuniyu(a)google.com>
To: Willem de Bruijn <willemb(a)google.com>
To: Neal Cardwell <ncardwell(a)google.com>
To: David Ahern <dsahern(a)kernel.org>
To: Mina Almasry <almasrymina(a)google.com>
To: Arnd Bergmann <arnd(a)arndb.de>
To: Jonathan Corbet <corbet(a)lwn.net>
To: Andrew Lunn <andrew+netdev(a)lunn.ch>
To: Shuah Khan <shuah(a)kernel.org>
Cc: Stanislav Fomichev <sdf(a)fomichev.me>
Cc: netdev(a)vger.kernel.org
Cc: linux-kernel(a)vger.kernel.org
Cc: linux-arch(a)vger.kernel.org
Cc: linux-doc(a)vger.kernel.org
Cc: linux-kselftest(a)vger.kernel.org
Signed-off-by: Bobby Eshleman <bobbyeshleman(a)meta.com>
Changes in v6:
- renamed 'net: devmem: use niov array for token management' to refer to
optionality of new config
- added documentation and tests
- make autorelease flag per-socket sockopt instead of binding
field / sysctl
- many per-patch changes (see Changes sections per-patch)
- Link to v5: https://lore.kernel.org/r/20251023-scratch-bobbyeshleman-devmem-tcp-token-u…
Changes in v5:
- add sysctl to opt-out of performance benefit, back to old token release
- Link to v4: https://lore.kernel.org/all/20250926-scratch-bobbyeshleman-devmem-tcp-token…
Changes in v4:
- rebase to net-next
- Link to v3: https://lore.kernel.org/r/20250926-scratch-bobbyeshleman-devmem-tcp-token-u…
Changes in v3:
- make urefs per-binding instead of per-socket, reducing memory
footprint
- fallback to cleaning up references in dmabuf unbind if socket
leaked tokens
- drop ethtool patch
- Link to v2: https://lore.kernel.org/r/20250911-scratch-bobbyeshleman-devmem-tcp-token-u…
Changes in v2:
- net: ethtool: prevent user from breaking devmem single-binding rule
(Mina)
- pre-assign niovs in binding->vec for RX case (Mina)
- remove WARNs on invalid user input (Mina)
- remove extraneous binding ref get (Mina)
- remove WARN for changed binding (Mina)
- always use GFP_ZERO for binding->vec (Mina)
- fix length of alloc for urefs
- use atomic_set(, 0) to initialize sk_user_frags.urefs
- Link to v1: https://lore.kernel.org/r/20250902-scratch-bobbyeshleman-devmem-tcp-token-u…
---
Bobby Eshleman (6):
net: devmem: rename tx_vec to vec in dmabuf binding
net: devmem: refactor sock_devmem_dontneed for autorelease split
net: devmem: prepare for autorelease rx token management
net: devmem: add SO_DEVMEM_AUTORELEASE for autorelease control
net: devmem: document SO_DEVMEM_AUTORELEASE socket option
net: devmem: add tests for SO_DEVMEM_AUTORELEASE socket option
Documentation/networking/devmem.rst | 70 +++++++++-
include/net/netmem.h | 1 +
include/net/sock.h | 13 +-
include/uapi/asm-generic/socket.h | 2 +
net/core/devmem.c | 54 +++++---
net/core/devmem.h | 4 +-
net/core/sock.c | 152 ++++++++++++++++++----
net/ipv4/tcp.c | 69 ++++++++--
net/ipv4/tcp_ipv4.c | 11 +-
net/ipv4/tcp_minisocks.c | 5 +-
tools/include/uapi/asm-generic/socket.h | 2 +
tools/testing/selftests/drivers/net/hw/devmem.py | 115 +++++++++++++++-
tools/testing/selftests/drivers/net/hw/ncdevmem.c | 20 ++-
13 files changed, 453 insertions(+), 65 deletions(-)
---
base-commit: 255d75ef029f33f75fcf5015052b7302486f7ad2
change-id: 20250829-scratch-bobbyeshleman-devmem-tcp-token-upstream-292be174d503
Best regards,
--
Bobby Eshleman <bobbyeshleman(a)meta.com>
The printf statement attempts to print the DMA direction string using
the syntax 'dir[directions]', which is an invalid array access. The
variable 'dir' is an integer, and 'directions' is a char pointer array.
This incorrect syntax should be 'directions[dir]', using 'dir' as the
index into the 'directions' array. Fix this by correcting the array
access from 'dir[directions]' to 'directions[dir]'.
Signed-off-by: Zhang Chujun <zhangchujun(a)cmss.chinamobile.com>
diff --git a/tools/testing/selftests/dma/dma_map_benchmark.c b/tools/testing/selftests/dma/dma_map_benchmark.c
index b12f1f9babf8..b925756373ce 100644
--- a/tools/testing/selftests/dma/dma_map_benchmark.c
+++ b/tools/testing/selftests/dma/dma_map_benchmark.c
@@ -118,7 +118,7 @@ int main(int argc, char **argv)
}
printf("dma mapping benchmark: threads:%d seconds:%d node:%d dir:%s granule: %d\n",
- threads, seconds, node, dir[directions], granule);
+ threads, seconds, node, directions[dir], granule);
printf("average map latency(us):%.1f standard deviation:%.1f\n",
map.avg_map_100ns/10.0, map.map_stddev/10.0);
printf("average unmap latency(us):%.1f standard deviation:%.1f\n",
--
2.50.1.windows.1
Hello,
This small series is another follow-up to [1], in which I misunderstood
Martin's initial feedback (see [2]). I proposed to make tc-tunnel apply
SO_REUSEPORT once server is brought up. This series updates
start_server_addr to really apply Martin's proposal after his
clarification [3]
[1] https://lore.kernel.org/bpf/20251031-tc_tunnel_improv-v1-0-0ffe44d27eda@boo…
[2] https://lore.kernel.org/bpf/efa3540a-1f52-46ca-9f49-e631a5e3e48c@linux.dev/
[3] https://lore.kernel.org/bpf/4cbabdf1-af2c-490a-a41a-b40c1539c1cb@linux.dev/
Signed-off-by: Alexis Lothoré (eBPF Foundation) <alexis.lothore(a)bootlin.com>
---
Alexis Lothoré (eBPF Foundation) (2):
selftests/bpf: systematically add SO_REUSEADDR in start_server_addr
selftests/bpf: use start_server_str rather than start_reuseport_server in tc_tunnel
tools/testing/selftests/bpf/network_helpers.c | 9 +++++++-
.../selftests/bpf/prog_tests/test_tc_tunnel.c | 27 ++++++++++++----------
2 files changed, 23 insertions(+), 13 deletions(-)
---
base-commit: de0745f7cc98146c70a020bc3a1b73c7f3405282
change-id: 20251104-start-server-soreuseaddr-e442446e2d37
Best regards,
--
Alexis Lothoré, Bootlin
Embedded Linux and Kernel engineering
https://bootlin.com
From: Steven Rostedt <rostedt(a)goodmis.org>
Commit 64cf7d058a00 ("tracing: Have trace_marker use per-cpu data to read
user space") made an update that fixed both trace_marker and
trace_marker_raw. But the small difference made to trace_marker_raw had a
blatant bug in it that any basic testing would have uncovered.
Unfortunately, the self tests have tests for trace_marker but nothing for
trace_marker_raw which allowed the bug to get upstream.
Add basic selftests to test trace_marker_raw so that this doesn't happen
again.
Signed-off-by: Steven Rostedt (Google) <rostedt(a)goodmis.org>
---
.../ftrace/test.d/00basic/trace_marker_raw.tc | 107 ++++++++++++++++++
1 file changed, 107 insertions(+)
create mode 100644 tools/testing/selftests/ftrace/test.d/00basic/trace_marker_raw.tc
diff --git a/tools/testing/selftests/ftrace/test.d/00basic/trace_marker_raw.tc b/tools/testing/selftests/ftrace/test.d/00basic/trace_marker_raw.tc
new file mode 100644
index 000000000000..7daf7292209e
--- /dev/null
+++ b/tools/testing/selftests/ftrace/test.d/00basic/trace_marker_raw.tc
@@ -0,0 +1,107 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0
+# description: Basic tests on writing to trace_marker_raw
+# requires: trace_marker_raw
+# flags: instance
+
+is_little_endian() {
+ if lscpu | grep -q 'Little Endian'; then
+ echo 1;
+ else
+ echo 0;
+ fi
+}
+
+little=`is_little_endian`
+
+make_str() {
+ id=$1
+ cnt=$2
+
+ if [ $little -eq 1 ]; then
+ val=`printf "\\%03o\\%03o\\%03o\\%03o" \
+ $(($id & 0xff)) \
+ $((($id >> 8) & 0xff)) \
+ $((($id >> 16) & 0xff)) \
+ $((($id >> 24) & 0xff))`
+ else
+ val=`printf "\\%03o\\%03o\\%03o\\%03o" \
+ $((($id >> 24) & 0xff)) \
+ $((($id >> 16) & 0xff)) \
+ $((($id >> 8) & 0xff)) \
+ $(($id & 0xff))`
+ fi
+
+ data=`printf -- 'X%.0s' $(seq $cnt)`
+
+ printf "${val}${data}"
+}
+
+write_buffer() {
+ id=$1
+ size=$2
+
+ # write the string into the raw marker
+ make_str $id $size > trace_marker_raw
+}
+
+
+test_multiple_writes() {
+
+ # Write a bunch of data where the id is the count of
+ # data to write
+ for i in `seq 1 10` `seq 101 110` `seq 1001 1010`; do
+ write_buffer $i $i
+ done
+
+ # add a little buffer
+ echo stop > trace_marker
+
+ # Check to make sure the number of entries is the id (rounded up by 4)
+ awk '/.*: # [0-9a-f]* / {
+ print;
+ cnt = -1;
+ for (i = 0; i < NF; i++) {
+ # The counter is after the "#" marker
+ if ( $i == "#" ) {
+ i++;
+ cnt = strtonum("0x" $i);
+ num = NF - (i + 1);
+ # The number of items is always rounded up by 4
+ cnt2 = int((cnt + 3) / 4) * 4;
+ if (cnt2 != num) {
+ exit 1;
+ }
+ break;
+ }
+ }
+ }
+ // { if (NR > 30) { exit 0; } } ' trace_pipe;
+}
+
+
+get_buffer_data_size() {
+ sed -ne 's/^.*data.*size:\([0-9][0-9]*\).*/\1/p' events/header_page
+}
+
+test_buffer() {
+
+ # The id must be four bytes, test that 3 bytes fails a write
+ if echo -n abc > ./trace_marker_raw ; then
+ echo "Too small of write expected to fail but did not"
+ exit_fail
+ fi
+
+ size=`get_buffer_data_size`
+ echo size = $size
+
+ # Now add a little more than what it can handle
+
+ if write_buffer 0xdeadbeef $size ; then
+ echo "Too big of write expected to fail but did not"
+ exit_fail
+ fi
+}
+
+test_buffer
+test_multiple_writes
--
2.51.0
The printf statement attempts to print the DMA direction string using
the syntax 'dir[directions]', which is an invalid array access. The
variable 'dir' is an integer, and 'directions' is a char pointer array.
This incorrect syntax should be 'directions[dir]', using 'dir' as the
index into the 'directions' array. Fix this by correcting the array
access from 'dir[directions]' to 'directions[dir]'.
Signed-off-by: Zhang Chujun <zhangchujun(a)cmss.chinamobile.com>
diff --git a/tools/testing/selftests/dma/dma_map_benchmark.c b/tools/testing/selftests/dma/dma_map_benchmark.c
index b12f1f9babf8..b925756373ce 100644
--- a/tools/testing/selftests/dma/dma_map_benchmark.c
+++ b/tools/testing/selftests/dma/dma_map_benchmark.c
@@ -118,7 +118,7 @@ int main(int argc, char **argv)
}
printf("dma mapping benchmark: threads:%d seconds:%d node:%d dir:%s granule: %d\n",
- threads, seconds, node, dir[directions], granule);
+ threads, seconds, node, directions[dir], granule);
printf("average map latency(us):%.1f standard deviation:%.1f\n",
map.avg_map_100ns/10.0, map.map_stddev/10.0);
printf("average unmap latency(us):%.1f standard deviation:%.1f\n",
--
2.50.1.windows.1
Liang reported an issue where setting a slave’s actor_port_prio to
predefined values such as 0, 255, or 65535 would cause a system crash.
The problem occurs because in bond_opt_parse(), when the provided value
matches a predefined table entry, the function returns that table entry,
which does not contain slave information. Later, in
bond_option_actor_port_prio_set(), calling bond_slave_get_rtnl() leads
to a NULL pointer dereference.
Since actor_port_prio is defined as a u16 and initialized to the default
value of 255 in ad_initialize_port(), there is no need for the
bond_actor_port_prio_tbl. Using the BOND_OPTFLAG_RAWVAL flag is sufficient.
Fixes: 6b6dc81ee7e8 ("bonding: add support for per-port LACP actor priority")
Reported-by: Liang Li <liali(a)redhat.com>
Signed-off-by: Hangbin Liu <liuhangbin(a)gmail.com>
---
BTW, the logic in bond_opt_parse() may also need an update after we have
f2b3b28ce523 ("bonding: add slave_dev field for bond_opt_value"), as we
may need range checking on slave options in future. But this should
be another patch and not urgent as this one.
---
drivers/net/bonding/bond_options.c | 9 +--------
1 file changed, 1 insertion(+), 8 deletions(-)
diff --git a/drivers/net/bonding/bond_options.c b/drivers/net/bonding/bond_options.c
index 495a87f2ea7c..384499c869b8 100644
--- a/drivers/net/bonding/bond_options.c
+++ b/drivers/net/bonding/bond_options.c
@@ -225,13 +225,6 @@ static const struct bond_opt_value bond_ad_actor_sys_prio_tbl[] = {
{ NULL, -1, 0},
};
-static const struct bond_opt_value bond_actor_port_prio_tbl[] = {
- { "minval", 0, BOND_VALFLAG_MIN},
- { "maxval", 65535, BOND_VALFLAG_MAX},
- { "default", 255, BOND_VALFLAG_DEFAULT},
- { NULL, -1, 0},
-};
-
static const struct bond_opt_value bond_ad_user_port_key_tbl[] = {
{ "minval", 0, BOND_VALFLAG_MIN | BOND_VALFLAG_DEFAULT},
{ "maxval", 1023, BOND_VALFLAG_MAX},
@@ -497,7 +490,7 @@ static const struct bond_option bond_opts[BOND_OPT_LAST] = {
.id = BOND_OPT_ACTOR_PORT_PRIO,
.name = "actor_port_prio",
.unsuppmodes = BOND_MODE_ALL_EX(BIT(BOND_MODE_8023AD)),
- .values = bond_actor_port_prio_tbl,
+ .flags = BOND_OPTFLAG_RAWVAL,
.set = bond_option_actor_port_prio_set,
},
[BOND_OPT_AD_ACTOR_SYSTEM] = {
--
2.50.1