[PATCH net v2 0/2] netconsole: Fix userdata race condition

List overview All Threads
Download

newer

older

[PATCH bpf-next v1] selftests/bpf:...

[PATCH bpf-next 0/2] bpf: Fix...

Gustavo Luiz Duarte

22 Oct 2025 22 Oct '25

5:39 p.m.

This series fixes a race condition in netconsole's userdata handling where concurrent message transmission could read partially updated userdata fields, resulting in corrupted netconsole output.

The first patch adds a selftest that reproduces the race condition by continuously sending messages while rapidly changing userdata values, detecting any torn reads in the output.

The second patch fixes the issue by ensuring update_userdata() holds the target_list_lock while updating both extradata_complete and userdata_length, preventing readers from seeing inconsistent state.

This targets net tree as it fixes a bug introduced in commit df03f830d099 ("net: netconsole: cache userdata formatted string in netconsole_target").

Signed-off-by: Gustavo Luiz Duarte gustavold@gmail.com

Changes in v2: - Added testcase to Makefile. - Reordered fix and testcase to avoid failure in CI. - testcase: delay cleanup until child process are killed, plus shellcheck fixes. - Link to v1: https://lore.kernel.org/all/20251020-netconsole-fix-race-v1-0-b775be30ee8a@g...

--- Gustavo Luiz Duarte (2): netconsole: Fix race condition in between reader and writer of userdata selftests: netconsole: Add race condition test for userdata corruption

drivers/net/netconsole.c | 5 ++ tools/testing/selftests/drivers/net/Makefile | 1 + .../selftests/drivers/net/netcons_race_userdata.sh | 87 ++++++++++++++++++++++ 3 files changed, 93 insertions(+) --- base-commit: d63f0391d6c7b75e1a847e1a26349fa8cad0004d change-id: 20251020-netconsole-fix-race-f465f37b57ea

Best regards,

-- Gustavo Duarte gustavold@meta.com

Show replies by date

Gustavo Luiz Duarte

22 Oct 22 Oct

5:39 p.m.

New subject: [PATCH net v2 1/2] netconsole: Fix race condition in between reader and writer of userdata

The update_userdata() function constructs the complete userdata string in nt->extradata_complete and updates nt->userdata_length. This data is then read by write_msg() and write_ext_msg() when sending netconsole messages. However, update_userdata() was not holding target_list_lock during this process, allowing concurrent message transmission to read partially updated userdata.

This race condition could result in netconsole messages containing incomplete or inconsistent userdata - for example, reading the old userdata_length with new extradata_complete content, or vice versa, leading to truncated or corrupted output.

Fix this by acquiring target_list_lock with spin_lock_irqsave() before updating extradata_complete and userdata_length, and releasing it after both fields are fully updated. This ensures that readers see a consistent view of the userdata, preventing corruption during concurrent access.

The fix aligns with the existing locking pattern used throughout the netconsole code, where target_list_lock protects access to target fields including buf[] and msgcounter that are accessed during message transmission.

Fixes: df03f830d099 ("net: netconsole: cache userdata formatted string in netconsole_target") Signed-off-by: Gustavo Luiz Duarte gustavold@gmail.com --- drivers/net/netconsole.c | 5 +++++ 1 file changed, 5 insertions(+)

diff --git a/drivers/net/netconsole.c b/drivers/net/netconsole.c index 194570443493..1f9cf6b12dfc 100644 --- a/drivers/net/netconsole.c +++ b/drivers/net/netconsole.c @@ -888,6 +888,9 @@ static void update_userdata(struct netconsole_target *nt) { int complete_idx = 0, child_count = 0; struct list_head *entry; + unsigned long flags; + + spin_lock_irqsave(&target_list_lock, flags);

/* Clear the current string in case the last userdatum was deleted */ nt->userdata_length = 0; @@ -918,6 +921,8 @@ static void update_userdata(struct netconsole_target *nt) } nt->userdata_length = strnlen(nt->extradata_complete, sizeof(nt->extradata_complete)); + + spin_unlock_irqrestore(&target_list_lock, flags); }

static ssize_t userdatum_value_store(struct config_item *item, const char *buf,

-- 2.47.3

Gustavo Luiz Duarte

5:39 p.m.

New subject: [PATCH net v2 2/2] selftests: netconsole: Add race condition test for userdata corruption

Add a test to verify that netconsole userdata handling is properly synchronized under concurrent read/write operations. The test creates two competing loops: one continuously sending netconsole messages (which read userdata), and another rapidly alternating userdata values between two distinct 198-byte patterns filled with 'A' and 'B' characters.

Without proper synchronization, concurrent reads and writes could result in torn reads where a message contains mixed userdata (e.g., starting with 'A' but containing 'B', or vice versa). The test monitors 10,000 messages and fails if it detects any such corruption, ensuring that the netconsole implementation maintains data consistency through proper locking mechanisms.

This test validates the fix for potential race conditions in the netconsole userdata path and serves as a regression test to prevent similar issues in the future.

Signed-off-by: Gustavo Luiz Duarte gustavold@gmail.com --- tools/testing/selftests/drivers/net/Makefile | 1 + .../selftests/drivers/net/netcons_race_userdata.sh | 87 ++++++++++++++++++++++ 2 files changed, 88 insertions(+)

diff --git a/tools/testing/selftests/drivers/net/Makefile b/tools/testing/selftests/drivers/net/Makefile index 6e41635bd55a..ba7dedc54711 100644 --- a/tools/testing/selftests/drivers/net/Makefile +++ b/tools/testing/selftests/drivers/net/Makefile @@ -17,6 +17,7 @@ TEST_PROGS := \ netcons_cmdline.sh \ netcons_fragmented_msg.sh \ netcons_overflow.sh \ + netcons_race_userdata.sh \ netcons_sysdata.sh \ netpoll_basic.py \ ping.py \ diff --git a/tools/testing/selftests/drivers/net/netcons_race_userdata.sh b/tools/testing/selftests/drivers/net/netcons_race_userdata.sh new file mode 100755 index 000000000000..017ee042eb75 --- /dev/null +++ b/tools/testing/selftests/drivers/net/netcons_race_userdata.sh @@ -0,0 +1,87 @@ +#!/usr/bin/env bash +# SPDX-License-Identifier: GPL-2.0 + +# This test verifies that netconsole userdata remains consistent under concurrent +# read/write operations. It creates two loops: one continuously writing netconsole +# messages (which read userdata) and another rapidly alternating userdata values +# between two distinct patterns. The test checks that no message contains corrupted +# or mixed userdata, ensuring proper synchronization in the netconsole implementation. +# +# Author: Gustavo Luiz Duarte gustavold@gmail.com + +set -euo pipefail + +SCRIPTDIR=$(dirname "$(readlink -e "${BASH_SOURCE[0]}")") + +source "${SCRIPTDIR}"/lib/sh/lib_netcons.sh + +function loop_set_userdata() { + MSGA=$(printf 'A%.0s' {1..198}) + MSGB=$(printf 'B%.0s' {1..198}) + + while true; do + echo "$MSGA" > "${NETCONS_PATH}/userdata/${USERDATA_KEY}/value" + echo "$MSGB" > "${NETCONS_PATH}/userdata/${USERDATA_KEY}/value" + done +} + +function loop_print_msg() { + while true; do + echo "test msg" > /dev/kmsg + done +} + +cleanup_children() { + pkill_socat + kill "$child1" "$child2" 2> /dev/null || true + wait "$child1" "$child2" 2> /dev/null || true + # Remove the namespace, interfaces and netconsole target + cleanup +} + +modprobe netdevsim 2> /dev/null || true +modprobe netconsole 2> /dev/null || true + +OUTPUT_FILE="stdout" +# Check for basic system dependency and exit if not found +check_for_dependencies +# Set current loglevel to KERN_INFO(6), and default to KERN_NOTICE(5) +echo "6 5" > /proc/sys/kernel/printk +# kill child processes and remove interfaces on exit +trap cleanup_children EXIT + +# Create one namespace and two interfaces +set_network +# Create a dynamic target for netconsole +create_dynamic_target +# Set userdata "key" with the "value" value +set_user_data + +# Start userdata read loop (printk) +loop_print_msg & +child1=$! + +# Start userdata write loop +loop_set_userdata & +child2=$! + +# Start socat to listen for netconsole messages and check for corrupted userdata. +MAX_COUNT=10000 +i=0 +while read -r line; do + if [ $i -ge $MAX_COUNT ]; then + echo "Test passed." + exit "${ksft_pass}" + fi + + if [[ "$line" == "key=A"* && "$line" == *"B"* || + "$line" == "key=B"* && "$line" == *"A"* ]]; then + echo "Test failed. Found corrupted userdata: $line" + exit "${ksft_fail}" + fi + + i=$((i + 1)) +done < <(listen_port_and_save_to ${OUTPUT_FILE} 2> /dev/null) + +echo "socat died before we could check $MAX_COUNT messages. Skipping test." +exit "${ksft_skip}"

-- 2.47.3

Jakub Kicinski

23 Oct 23 Oct

1:01 a.m.

On Wed, 22 Oct 2025 10:39:56 -0700 Gustavo Luiz Duarte wrote:

...

This series fixes a race condition in netconsole's userdata handling where concurrent message transmission could read partially updated userdata fields, resulting in corrupted netconsole output.

The first patch adds a selftest that reproduces the race condition by continuously sending messages while rapidly changing userdata values, detecting any torn reads in the output.

The second patch fixes the issue by ensuring update_userdata() holds the target_list_lock while updating both extradata_complete and userdata_length, preventing readers from seeing inconsistent state.

This targets net tree as it fixes a bug introduced in commit df03f830d099 ("net: netconsole: cache userdata formatted string in netconsole_target").

This test is skipping on debug kernel builds in netdev CI.

TAP version 13 1..1 # overriding timeout to 360 # selftests: drivers/net: netcons_race_userdata.sh # socat died before we could check 10000 messages. Skipping test. ok 1 selftests: drivers/net: netcons_race_userdata.sh # SKIP

We can't have skips for SW tests.

I think Breno was fighting with a similar problem in the past. Not sure what he ended up doing. Maybe just leave it at the print? Don't actually mark the test as skipped?

Slightly more advanced option is to only do that if KSFT_MACHINE_SLOW per: https://github.com/linux-netdev/nipa/wiki/How-to-run-netdev-selftests-CI-sty...

-- pw-bot: cr

Gustavo Luiz Duarte

24 Oct 24 Oct

9:10 p.m.

On Wed, Oct 22, 2025 at 10:01 PM Jakub Kicinski kuba@kernel.org wrote:

...

On Wed, 22 Oct 2025 10:39:56 -0700 Gustavo Luiz Duarte wrote:

...
This series fixes a race condition in netconsole's userdata handling where concurrent message transmission could read partially updated userdata fields, resulting in corrupted netconsole output.

The first patch adds a selftest that reproduces the race condition by continuously sending messages while rapidly changing userdata values, detecting any torn reads in the output.

The second patch fixes the issue by ensuring update_userdata() holds the target_list_lock while updating both extradata_complete and userdata_length, preventing readers from seeing inconsistent state.

This targets net tree as it fixes a bug introduced in commit df03f830d099 ("net: netconsole: cache userdata formatted string in netconsole_target").

This test is skipping on debug kernel builds in netdev CI.

TAP version 13 1..1 # overriding timeout to 360 # selftests: drivers/net: netcons_race_userdata.sh # socat died before we could check 10000 messages. Skipping test. ok 1 selftests: drivers/net: netcons_race_userdata.sh # SKIP

We can't have skips for SW tests.

I think Breno was fighting with a similar problem in the past. Not sure what he ended up doing. Maybe just leave it at the print? Don't actually mark the test as skipped?

Slightly more advanced option is to only do that if KSFT_MACHINE_SLOW per: https://github.com/linux-netdev/nipa/wiki/How-to-run-netdev-selftests-CI-sty...

There are two reasons for hitting this skip. 1. The hardcoded 2s timeout in listen_port_and_save_to() expired 2. socat died or failed to start for mysterious reasons

#1 should probably be a success (we ran the test for this long and no corruption found), and for #2 we can try to return whatever exit code socat give us. Retrieving socat return code is a bit tricky because we are running it in a subshell, but we can save it in a temp file.

I can also send a follow up patch to use a longer timeout in listen_port_and_save_to() if KSFT_MACHINE_SLOW

...

-- pw-bot: cr

Jakub Kicinski

11:42 p.m.

On Fri, 24 Oct 2025 18:10:06 -0300 Gustavo Luiz Duarte wrote:

...

There are two reasons for hitting this skip.

The hardcoded 2s timeout in listen_port_and_save_to() expired

socat died or failed to start for mysterious reasons

#1 should probably be a success (we ran the test for this long and no corruption found), and for #2 we can try to return whatever exit code socat give us. Retrieving socat return code is a bit tricky because we are running it in a subshell, but we can save it in a temp file.

I can also send a follow up patch to use a longer timeout in listen_port_and_save_to() if KSFT_MACHINE_SLOW

Frankly I'm not sure this test is worth the compute cycles it will burn. It's a direct repro for a very specific problem. The changes it will occur again for the same field a pretty low. Maybe just repost patch 1?

Gustavo Luiz Duarte

27 Oct 27 Oct

3:56 p.m.

On Fri, Oct 24, 2025 at 8:42 PM Jakub Kicinski kuba@kernel.org wrote:

...

Frankly I'm not sure this test is worth the compute cycles it will burn. It's a direct repro for a very specific problem. The changes it will occur again for the same field a pretty low. Maybe just repost patch 1?

Fair enough. I will repost with only patch 1 then.

days inactive

days old

linux-kselftest-mirror@lists.linaro.org

6 comments

participants

tags (0)

participants (2)

Gustavo Luiz Duarte
Jakub Kicinski