From: Qiuxu Zhuo <qiuxu.zhuo(a)intel.com>
[ Upstream commit 197c1eaa7ba633a482ed7588eea6fd4aa57e08d4 ]
When running the mincore_selftest on a system with an XFS file system, it
failed the "check_file_mmap" test case due to the read-ahead pages reaching
the end of the file. The failure log is as below:
RUN global.check_file_mmap ...
mincore_selftest.c:264:check_file_mmap:Expected i (1024) < vec_size (1024)
mincore_selftest.c:265:check_file_mmap:Read-ahead pages reached the end of the file
check_file_mmap: Test failed
FAIL global.check_file_mmap
This is because the read-ahead window size of the XFS file system on this
machine is 4 MB, which is larger than the size from the #PF address to the
end of the file. As a result, all the pages for this file are populated.
blockdev --getra /dev/nvme0n1p5
8192
blockdev --getbsz /dev/nvme0n1p5
512
This issue can be fixed by extending the current FILE_SIZE 4MB to a larger
number, but it will still fail if the read-ahead window size of the file
system is larger enough. Additionally, in the real world, read-ahead pages
reaching the end of the file can happen and is an expected behavior.
Therefore, allowing read-ahead pages to reach the end of the file is a
better choice for the "check_file_mmap" test case.
Link: https://lore.kernel.org/r/20250311080940.21413-1-qiuxu.zhuo@intel.com
Reported-by: Yi Lai <yi1.lai(a)intel.com>
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo(a)intel.com>
Signed-off-by: Shuah Khan <skhan(a)linuxfoundation.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/mincore/mincore_selftest.c | 3 ---
1 file changed, 3 deletions(-)
diff --git a/tools/testing/selftests/mincore/mincore_selftest.c b/tools/testing/selftests/mincore/mincore_selftest.c
index 4c88238fc8f05..c0ae86c28d7f3 100644
--- a/tools/testing/selftests/mincore/mincore_selftest.c
+++ b/tools/testing/selftests/mincore/mincore_selftest.c
@@ -261,9 +261,6 @@ TEST(check_file_mmap)
TH_LOG("No read-ahead pages found in memory");
}
- EXPECT_LT(i, vec_size) {
- TH_LOG("Read-ahead pages reached the end of the file");
- }
/*
* End of the readahead window. The rest of the pages shouldn't
* be in memory.
--
2.39.5
From: Qiuxu Zhuo <qiuxu.zhuo(a)intel.com>
[ Upstream commit 197c1eaa7ba633a482ed7588eea6fd4aa57e08d4 ]
When running the mincore_selftest on a system with an XFS file system, it
failed the "check_file_mmap" test case due to the read-ahead pages reaching
the end of the file. The failure log is as below:
RUN global.check_file_mmap ...
mincore_selftest.c:264:check_file_mmap:Expected i (1024) < vec_size (1024)
mincore_selftest.c:265:check_file_mmap:Read-ahead pages reached the end of the file
check_file_mmap: Test failed
FAIL global.check_file_mmap
This is because the read-ahead window size of the XFS file system on this
machine is 4 MB, which is larger than the size from the #PF address to the
end of the file. As a result, all the pages for this file are populated.
blockdev --getra /dev/nvme0n1p5
8192
blockdev --getbsz /dev/nvme0n1p5
512
This issue can be fixed by extending the current FILE_SIZE 4MB to a larger
number, but it will still fail if the read-ahead window size of the file
system is larger enough. Additionally, in the real world, read-ahead pages
reaching the end of the file can happen and is an expected behavior.
Therefore, allowing read-ahead pages to reach the end of the file is a
better choice for the "check_file_mmap" test case.
Link: https://lore.kernel.org/r/20250311080940.21413-1-qiuxu.zhuo@intel.com
Reported-by: Yi Lai <yi1.lai(a)intel.com>
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo(a)intel.com>
Signed-off-by: Shuah Khan <skhan(a)linuxfoundation.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/mincore/mincore_selftest.c | 3 ---
1 file changed, 3 deletions(-)
diff --git a/tools/testing/selftests/mincore/mincore_selftest.c b/tools/testing/selftests/mincore/mincore_selftest.c
index e949a43a61450..efabfcbe0b498 100644
--- a/tools/testing/selftests/mincore/mincore_selftest.c
+++ b/tools/testing/selftests/mincore/mincore_selftest.c
@@ -261,9 +261,6 @@ TEST(check_file_mmap)
TH_LOG("No read-ahead pages found in memory");
}
- EXPECT_LT(i, vec_size) {
- TH_LOG("Read-ahead pages reached the end of the file");
- }
/*
* End of the readahead window. The rest of the pages shouldn't
* be in memory.
--
2.39.5
From: Qiuxu Zhuo <qiuxu.zhuo(a)intel.com>
[ Upstream commit 197c1eaa7ba633a482ed7588eea6fd4aa57e08d4 ]
When running the mincore_selftest on a system with an XFS file system, it
failed the "check_file_mmap" test case due to the read-ahead pages reaching
the end of the file. The failure log is as below:
RUN global.check_file_mmap ...
mincore_selftest.c:264:check_file_mmap:Expected i (1024) < vec_size (1024)
mincore_selftest.c:265:check_file_mmap:Read-ahead pages reached the end of the file
check_file_mmap: Test failed
FAIL global.check_file_mmap
This is because the read-ahead window size of the XFS file system on this
machine is 4 MB, which is larger than the size from the #PF address to the
end of the file. As a result, all the pages for this file are populated.
blockdev --getra /dev/nvme0n1p5
8192
blockdev --getbsz /dev/nvme0n1p5
512
This issue can be fixed by extending the current FILE_SIZE 4MB to a larger
number, but it will still fail if the read-ahead window size of the file
system is larger enough. Additionally, in the real world, read-ahead pages
reaching the end of the file can happen and is an expected behavior.
Therefore, allowing read-ahead pages to reach the end of the file is a
better choice for the "check_file_mmap" test case.
Link: https://lore.kernel.org/r/20250311080940.21413-1-qiuxu.zhuo@intel.com
Reported-by: Yi Lai <yi1.lai(a)intel.com>
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo(a)intel.com>
Signed-off-by: Shuah Khan <skhan(a)linuxfoundation.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/mincore/mincore_selftest.c | 3 ---
1 file changed, 3 deletions(-)
diff --git a/tools/testing/selftests/mincore/mincore_selftest.c b/tools/testing/selftests/mincore/mincore_selftest.c
index e949a43a61450..efabfcbe0b498 100644
--- a/tools/testing/selftests/mincore/mincore_selftest.c
+++ b/tools/testing/selftests/mincore/mincore_selftest.c
@@ -261,9 +261,6 @@ TEST(check_file_mmap)
TH_LOG("No read-ahead pages found in memory");
}
- EXPECT_LT(i, vec_size) {
- TH_LOG("Read-ahead pages reached the end of the file");
- }
/*
* End of the readahead window. The rest of the pages shouldn't
* be in memory.
--
2.39.5
From: Qiuxu Zhuo <qiuxu.zhuo(a)intel.com>
[ Upstream commit 197c1eaa7ba633a482ed7588eea6fd4aa57e08d4 ]
When running the mincore_selftest on a system with an XFS file system, it
failed the "check_file_mmap" test case due to the read-ahead pages reaching
the end of the file. The failure log is as below:
RUN global.check_file_mmap ...
mincore_selftest.c:264:check_file_mmap:Expected i (1024) < vec_size (1024)
mincore_selftest.c:265:check_file_mmap:Read-ahead pages reached the end of the file
check_file_mmap: Test failed
FAIL global.check_file_mmap
This is because the read-ahead window size of the XFS file system on this
machine is 4 MB, which is larger than the size from the #PF address to the
end of the file. As a result, all the pages for this file are populated.
blockdev --getra /dev/nvme0n1p5
8192
blockdev --getbsz /dev/nvme0n1p5
512
This issue can be fixed by extending the current FILE_SIZE 4MB to a larger
number, but it will still fail if the read-ahead window size of the file
system is larger enough. Additionally, in the real world, read-ahead pages
reaching the end of the file can happen and is an expected behavior.
Therefore, allowing read-ahead pages to reach the end of the file is a
better choice for the "check_file_mmap" test case.
Link: https://lore.kernel.org/r/20250311080940.21413-1-qiuxu.zhuo@intel.com
Reported-by: Yi Lai <yi1.lai(a)intel.com>
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo(a)intel.com>
Signed-off-by: Shuah Khan <skhan(a)linuxfoundation.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/mincore/mincore_selftest.c | 3 ---
1 file changed, 3 deletions(-)
diff --git a/tools/testing/selftests/mincore/mincore_selftest.c b/tools/testing/selftests/mincore/mincore_selftest.c
index e949a43a61450..efabfcbe0b498 100644
--- a/tools/testing/selftests/mincore/mincore_selftest.c
+++ b/tools/testing/selftests/mincore/mincore_selftest.c
@@ -261,9 +261,6 @@ TEST(check_file_mmap)
TH_LOG("No read-ahead pages found in memory");
}
- EXPECT_LT(i, vec_size) {
- TH_LOG("Read-ahead pages reached the end of the file");
- }
/*
* End of the readahead window. The rest of the pages shouldn't
* be in memory.
--
2.39.5
From: Qiuxu Zhuo <qiuxu.zhuo(a)intel.com>
[ Upstream commit 197c1eaa7ba633a482ed7588eea6fd4aa57e08d4 ]
When running the mincore_selftest on a system with an XFS file system, it
failed the "check_file_mmap" test case due to the read-ahead pages reaching
the end of the file. The failure log is as below:
RUN global.check_file_mmap ...
mincore_selftest.c:264:check_file_mmap:Expected i (1024) < vec_size (1024)
mincore_selftest.c:265:check_file_mmap:Read-ahead pages reached the end of the file
check_file_mmap: Test failed
FAIL global.check_file_mmap
This is because the read-ahead window size of the XFS file system on this
machine is 4 MB, which is larger than the size from the #PF address to the
end of the file. As a result, all the pages for this file are populated.
blockdev --getra /dev/nvme0n1p5
8192
blockdev --getbsz /dev/nvme0n1p5
512
This issue can be fixed by extending the current FILE_SIZE 4MB to a larger
number, but it will still fail if the read-ahead window size of the file
system is larger enough. Additionally, in the real world, read-ahead pages
reaching the end of the file can happen and is an expected behavior.
Therefore, allowing read-ahead pages to reach the end of the file is a
better choice for the "check_file_mmap" test case.
Link: https://lore.kernel.org/r/20250311080940.21413-1-qiuxu.zhuo@intel.com
Reported-by: Yi Lai <yi1.lai(a)intel.com>
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo(a)intel.com>
Signed-off-by: Shuah Khan <skhan(a)linuxfoundation.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/mincore/mincore_selftest.c | 3 ---
1 file changed, 3 deletions(-)
diff --git a/tools/testing/selftests/mincore/mincore_selftest.c b/tools/testing/selftests/mincore/mincore_selftest.c
index e949a43a61450..efabfcbe0b498 100644
--- a/tools/testing/selftests/mincore/mincore_selftest.c
+++ b/tools/testing/selftests/mincore/mincore_selftest.c
@@ -261,9 +261,6 @@ TEST(check_file_mmap)
TH_LOG("No read-ahead pages found in memory");
}
- EXPECT_LT(i, vec_size) {
- TH_LOG("Read-ahead pages reached the end of the file");
- }
/*
* End of the readahead window. The rest of the pages shouldn't
* be in memory.
--
2.39.5
Hi,
While trying the coredump test on qemu-system-riscv64, I observed test
failures for various reasons.
This series makes the test works on qemu-system-riscv64.
Best regards,
Nam
v1->v2 https://lore.kernel.org/lkml/cover.1743438749.git.namcao@linutronix.de/
- use getline() more precisely [John Ogness]
- be absolutely safe: waitpid() for the child process, and still wait 10s
for stack_values file to be created [John Ogness]
Nam Cao (3):
selftests: coredump: Properly initialize pointer
selftests: coredump: Fix test failure for slow machines
selftests: coredump: Raise timeout to 2 minutes
tools/testing/selftests/coredump/stackdump_test.c | 10 ++++++++--
1 file changed, 8 insertions(+), 2 deletions(-)
--
2.39.5
Add basic support to run various MIPS variants via kunit_tool using the
virtualized malta platform.
Some of the cs_dsp unittests are broken. They are being disabled by default in
the series "Fix up building KUnit tests for Cirrus Logic modules" [0].
[0] https://lore.kernel.org/lkml/20250411123608.1676462-1-rf@opensource.cirrus.…
Signed-off-by: Thomas Weißschuh <thomas.weissschuh(a)linutronix.de>
---
Changes in v2:
- Fix usercopy kunit test by handling ABI-less tasks in stack_top()
- Drop change to mm initialization.
The broken test is not built by default anymore.
- Link to v1: https://lore.kernel.org/r/20250212-kunit-mips-v1-0-eb49c9d76615@linutronix.…
---
Thomas Weißschuh (2):
MIPS: Don't crash in stack_top() for tasks without ABI or vDSO
kunit: qemu_configs: Add MIPS configurations
arch/mips/kernel/process.c | 8 +++++---
tools/testing/kunit/qemu_configs/mips.py | 18 ++++++++++++++++++
tools/testing/kunit/qemu_configs/mips64.py | 19 +++++++++++++++++++
tools/testing/kunit/qemu_configs/mips64el.py | 19 +++++++++++++++++++
tools/testing/kunit/qemu_configs/mipsel.py | 18 ++++++++++++++++++
5 files changed, 79 insertions(+), 3 deletions(-)
---
base-commit: 0466dc03fa779373afb807ce7496c404d98ace4b
change-id: 20241014-kunit-mips-e4fe1c265ed7
Best regards,
--
Thomas Weißschuh <thomas.weissschuh(a)linutronix.de>
When setting the lower-layer link up/down, the ipvlan/macvlan device
synchronizes its state via netif_stacked_transfer_operstate(), which
only checks the carrier state. However, setting the link down does not
necessarily change the carrier state for virtual interfaces like bonding.
This causes the ipvlan/macvlan state to become out of sync with the
lower-layer link state. Fix this by explicitly changing the IFF_UP flag,
similar to how VLAN handles it.
Before the patch:
# ./rtnetlink.sh -t "kci_test_vlan kci_test_ipvlan kci_test_macvlan"
PASS: vlan link state correct
FAIL: ipvlan link state incorrect
FAIL: macvlan link state incorrect
After the patch set:
# ./rtnetlink.sh -t "kci_test_vlan kci_test_ipvlan kci_test_macvlan"
PASS: vlan link state correct
PASS: ipvlan link state correct
PASS: macvlan link state correct
Hangbin Liu (3):
ipvlan: fix NETDEV_UP/NETDEV_DOWN event handling
macvlan: fix NETDEV_UP/NETDEV_DOWN event handling
selftests/rtnetlink.sh: add vlan/ipvlan/macvlan link state test
drivers/net/ipvlan/ipvlan_main.c | 20 +++++++-
drivers/net/macvlan.c | 20 ++++++++
tools/testing/selftests/net/rtnetlink.sh | 64 ++++++++++++++++++++++++
3 files changed, 103 insertions(+), 1 deletion(-)
--
2.46.0
v3:
- Take up Johannes' suggestion of just skip the !usage case and
fix test_memcontrol selftest to fix the rests of the min/low
failures.
The test_memcontrol selftest consistently fails its test_memcg_low
sub-test and sporadically fails its test_memcg_min sub-test. This
patchset fixes the test_memcg_min and test_memcg_low failures by
skipping the !usage case in shrink_node_memcgs() and adjust the
test_memcontrol selftest to fix other causes of the test failures.
Note that I decide not to use the suggested mem_cgroup_usage() call
as it is a real function call defined in mm/memcontrol.c which is not
available if CONFIG_MEMCG isn't defined.
Waiman Long (2):
mm/vmscan: Skip memcg with !usage in shrink_node_memcgs()
selftests: memcg: Increase error tolerance of child memory.current
check in test_memcg_protection()
mm/vmscan.c | 4 ++++
tools/testing/selftests/cgroup/test_memcontrol.c | 11 ++++++++---
2 files changed, 12 insertions(+), 3 deletions(-)
--
2.48.1
The default SH kunit configuration sets CONFIG_CMDLINE_OVERWRITE which
completely disregards the cmdline passed from the bootloader/QEMU in favor
of the builtin CONFIG_CMDLINE.
However the kunit tool needs to pass arguments to the in-kernel kunit core,
for filters and other runtime parameters.
Enable CONFIG_CMDLINE_EXTEND instead, so kunit arguments are respected.
Fixes: 8110a3cab05e ("kunit: tool: Add support for SH under QEMU")
Signed-off-by: Thomas Weißschuh <thomas.weissschuh(a)linutronix.de>
---
tools/testing/kunit/qemu_configs/sh.py | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/tools/testing/kunit/qemu_configs/sh.py b/tools/testing/kunit/qemu_configs/sh.py
index 78a474a5b95f3a7d6064a2d3b728810ced095606..f00cb89fdef6aa1c0abd83ca18e7004a4fdd96e1 100644
--- a/tools/testing/kunit/qemu_configs/sh.py
+++ b/tools/testing/kunit/qemu_configs/sh.py
@@ -7,7 +7,9 @@ CONFIG_CPU_SUBTYPE_SH7751R=y
CONFIG_MEMORY_START=0x0c000000
CONFIG_SH_RTS7751R2D=y
CONFIG_RTS7751R2D_PLUS=y
-CONFIG_SERIAL_SH_SCI=y''',
+CONFIG_SERIAL_SH_SCI=y
+CONFIG_CMDLINE_EXTEND=y
+''',
qemu_arch='sh4',
kernel_path='arch/sh/boot/zImage',
kernel_command_line='console=ttySC1',
---
base-commit: 2014c95afecee3e76ca4a56956a936e23283f05b
change-id: 20250220-kunit-sh-f42a3a8cce35
Best regards,
--
Thomas Weißschuh <thomas.weissschuh(a)linutronix.de>
v3:
- Take up Johannes' suggestion of just skip the !usage case and
fix test_memcontrol selftest to fix the rests of the min/low
failures.
v4:
- Add "#ifdef CONFIG_MEMCG" directives around shrink_node_memcgs() to
avoid compilation problem with !CONFIG_MEMCG configs.
The test_memcontrol selftest consistently fails its test_memcg_low
sub-test and sporadically fails its test_memcg_min sub-test. This
patchset fixes the test_memcg_min and test_memcg_low failures by
skipping the !usage case in shrink_node_memcgs() and adjust the
test_memcontrol selftest to fix other causes of the test failures.
Note that I decide not to use the suggested mem_cgroup_usage() call as
it is a real function call defined in mm/memcontrol.c to be used mainly
by cgroup v1 code.
Waiman Long (2):
mm/vmscan: Skip memcg with !usage in shrink_node_memcgs()
selftests: memcg: Increase error tolerance of child memory.current
check in test_memcg_protection()
mm/vmscan.c | 10 ++++++++++
tools/testing/selftests/cgroup/test_memcontrol.c | 11 ++++++++---
2 files changed, 18 insertions(+), 3 deletions(-)
--
2.48.1
v5:
- Use mem_cgroup_usage() as originally suggested by Johannes.
v4:
- Add "#ifdef CONFIG_MEMCG" directives around shrink_node_memcgs() to
avoid compilation problem with !CONFIG_MEMCG configs.
The test_memcontrol selftest consistently fails its test_memcg_low
sub-test and sporadically fails its test_memcg_min sub-test. This
patchset fixes the test_memcg_min and test_memcg_low failures by
skipping the !usage case in shrink_node_memcgs() and adjust the
test_memcontrol selftest to fix other causes of the test failures.
Waiman Long (2):
mm/vmscan: Skip memcg with !usage in shrink_node_memcgs()
selftests: memcg: Increase error tolerance of child memory.current
check in test_memcg_protection()
mm/internal.h | 9 +++++++++
mm/memcontrol-v1.h | 2 --
mm/vmscan.c | 4 ++++
tools/testing/selftests/cgroup/test_memcontrol.c | 11 ++++++++---
4 files changed, 21 insertions(+), 5 deletions(-)
--
2.48.1
FW_CS_DSP gets enabled if KUNIT is enabled. The test should rather
depend on if the feature is enabled. Fix this by moving FW_CS_DSP to the
depends on clause, and set CONFIG_FW_CS_DSP=y in the kunit tooling.
Fixes: dd0b6b1f29b9 ("firmware: cs_dsp: Add KUnit testing of bin file download")
Signed-off-by: Nico Pache <npache(a)redhat.com>
---
drivers/firmware/cirrus/Kconfig | 3 +--
tools/testing/kunit/configs/all_tests.config | 2 ++
2 files changed, 3 insertions(+), 2 deletions(-)
diff --git a/drivers/firmware/cirrus/Kconfig b/drivers/firmware/cirrus/Kconfig
index 0a883091259a..989568ab5712 100644
--- a/drivers/firmware/cirrus/Kconfig
+++ b/drivers/firmware/cirrus/Kconfig
@@ -11,9 +11,8 @@ config FW_CS_DSP_KUNIT_TEST_UTILS
config FW_CS_DSP_KUNIT_TEST
tristate "KUnit tests for Cirrus Logic cs_dsp" if !KUNIT_ALL_TESTS
- depends on KUNIT && REGMAP
+ depends on KUNIT && REGMAP && FW_CS_DSP
default KUNIT_ALL_TESTS
- select FW_CS_DSP
select FW_CS_DSP_KUNIT_TEST_UTILS
help
This builds KUnit tests for cs_dsp.
diff --git a/tools/testing/kunit/configs/all_tests.config b/tools/testing/kunit/configs/all_tests.config
index b0049be00c70..96c6b4aca87d 100644
--- a/tools/testing/kunit/configs/all_tests.config
+++ b/tools/testing/kunit/configs/all_tests.config
@@ -49,3 +49,5 @@ CONFIG_SOUND=y
CONFIG_SND=y
CONFIG_SND_SOC=y
CONFIG_SND_SOC_TOPOLOGY_BUILD=y
+
+CONFIG_FW_CS_DSP=y
\ No newline at end of file
--
2.48.1
┌────────────┐ ┌───────────────────────────────────┐ ┌────────────────┐
│ │ │ │ │ │
│ │ │ PCI Endpoint │ │ PCI Host │
│ │ │ │ │ │
│ │◄──┤ 1.platform_msi_domain_alloc_irqs()│ │ │
│ │ │ │ │ │
│ MSI ├──►│ 2.write_msi_msg() ├──►├─BAR<n> │
│ Controller │ │ update doorbell register address│ │ │
│ │ │ for BAR │ │ │
│ │ │ │ │ 3. Write BAR<n>│
│ │◄──┼───────────────────────────────────┼───┤ │
│ │ │ │ │ │
│ ├──►│ 4.Irq Handle │ │ │
│ │ │ │ │ │
│ │ │ │ │ │
└────────────┘ └───────────────────────────────────┘ └────────────────┘
This patches based on old https://lore.kernel.org/imx/20221124055036.1630573-1-Frank.Li@nxp.com/
Original patch only target to vntb driver. But actually it is common
method.
This patches add new API to pci-epf-core, so any EP driver can use it.
Previous v2 discussion here.
https://lore.kernel.org/imx/20230911220920.1817033-1-Frank.Li@nxp.com/
Changes in v17:
- move document part to pci-ep.yaml
- Link to v16: https://lore.kernel.org/r/20250404-ep-msi-v16-0-d4919d68c0d0@nxp.com
Changes in v16:
- remove arm64: dts: imx95-19x19-evk: Add PCIe1 endpoint function overlay file
because there are better patches, which under review.
- Add document for pcie-ep msi-map usage
- other change to see each patch's change log
About IMMUTABLE (No change for this part, tglx provide feedback)
> - This IMMUTABLE thing serves no purpose, because you don't randomly
> plug this end-point block on any MSI controller. They come as part
> of an SoC.
"Yes and no. The problem is that the EP implementation is meant to be a
generic library and while GIC-ITS guarantees immutability of the
address/data pair after setup, there are architectures (x86, loongson,
riscv) where the base MSI controller does not and immutability is only
achieved when interrupt remapping is enabled. The latter can be disabled
at boot-time and then the EP implementation becomes a lottery across
affinity changes.
That was my concern about this library implementation and that's why I
asked for a mechanism to ensure that the underlying irqdomain provides a
immutable address/data pair.
So it does not matter for GIC-ITS, but in the larger picture it matters.
Thanks,
tglx
"
So it does not matter for GIC-ITS, but in the larger picture it matters.
- Link to v15: https://lore.kernel.org/r/20250211-ep-msi-v15-0-bcacc1f2b1a9@nxp.com
Changes in v15:
- rebase to v6.14-rc1
- fix build issue find by kernel test robot
- Link to v14: https://lore.kernel.org/r/20250207-ep-msi-v14-0-9671b136f2b8@nxp.com
Changes in v14:
Marc Zyngier raised concerns about adding DOMAIN_BUS_DEVICE_PCI_EP_MSI. As
a result, the approach has been reverted to the v9 method. However, there
are several improvements:
MSI now supports msi-map in addition to msi-parent.
- The struct device: id is used as the endpoint function (EPF) device
identity to map to the stream ID (sideband information).
- The EPC device tree source (DTS) utilizes msi-map to provide such
information.
- The EPF device's of_node is set to the EPC controller’s node. This
approach is commonly used for multi-function device (MFD) platform child
devices, allowing them to inherit properties from the MFD device’s DTS,
such as reset-cells and gpio-cells. This method is well-suited for the
current case, as the EPF is inherently created/binded to the EPC and
should inherit the EPC’s DTS node properties.
Additionally:
Since the basic IMX95 LUT support has already been merged into the
mainline, a DTS and driver increment patch is added to complete the
solution. The patch is rebased onto the latest linux-next tree and
aligned with the new pcitest framework.
- Link to v13: https://lore.kernel.org/r/20241218-ep-msi-v13-0-646e2192dc24@nxp.com
Changes in v13:
- Change to use DOMAIN_BUS_PCI_DEVICE_EP_MSI
- Change request id as func | vfunc << 3
- Remove IRQ_DOMAIN_MSI_IMMUTABLE
Thomas Gleixner:
I hope capture all your points in review comments. If missed, let me know.
- Link to v12: https://lore.kernel.org/r/20241211-ep-msi-v12-0-33d4532fa520@nxp.com
Changes in v12:
- Change to use IRQ_DOMAIN_MSI_IMMUTABLE and add help function
irq_domain_msi_is_immuatble().
- split PCI: endpoint: pci-ep-msi: Add MSI address/data pair mutable check to 3 patches
- Link to v11: https://lore.kernel.org/r/20241209-ep-msi-v11-0-7434fa8397bd@nxp.com
Changes in v11:
- Change to use MSI_FLAG_MSG_IMMUTABLE
- Link to v10: https://lore.kernel.org/r/20241204-ep-msi-v10-0-87c378dbcd6d@nxp.com
Changes in v10:
Thomas Gleixner:
There are big change in pci-ep-msi.c. I am sure if go on the
corrent path. The key improvement is remove only 1 function devices's
limitation.
I use new patch for imutable check, which relative additional
feature compared to base enablement patch.
- Remove patch Add msi_remove_device_irq_domain() in platform_device_msi_free_irqs_all()
- Add new patch irqchip/gic-v3-its: Avoid overwriting msi_prepare callback if provided by msi_domain_info
- Remove only support 1 endpoint function limiation.
- Create one MSI domain for each endpoint function devices.
- Use "msi-map" in pci ep controler node, instead of of msi-parent. first
argument is
(func_no << 8 | vfunc_no)
- Link to v9: https://lore.kernel.org/r/20241203-ep-msi-v9-0-a60dbc3f15dd@nxp.com
Changes in v9
- Add patch platform-msi: Add msi_remove_device_irq_domain() in platform_device_msi_free_irqs_all()
- Remove patch PCI: endpoint: Add pci_epc_get_fn() API for customizable filtering
- Remove API pci_epf_align_inbound_addr_lo_hi
- Move doorbell_alloc in to doorbell_enable function.
- Link to v8: https://lore.kernel.org/r/20241116-ep-msi-v8-0-6f1f68ffd1bb@nxp.com
Changes in v8:
- update helper function name to pci_epf_align_inbound_addr()
- Link to v7: https://lore.kernel.org/r/20241114-ep-msi-v7-0-d4ac7aafbd2c@nxp.com
Changes in v7:
- Add helper function pci_epf_align_addr();
- Link to v6: https://lore.kernel.org/r/20241112-ep-msi-v6-0-45f9722e3c2a@nxp.com
Changes in v6:
- change doorbell_addr to doorbell_offset
- use round_down()
- add Niklas's test by tag
- rebase to pci/endpoint
- Link to v5: https://lore.kernel.org/r/20241108-ep-msi-v5-0-a14951c0d007@nxp.com
Changes in v5:
- Move request_irq to epf test function driver for more flexiable user case
- Add fixed size bar handler
- Some minor improvememtn to see each patches's changelog.
- Link to v4: https://lore.kernel.org/r/20241031-ep-msi-v4-0-717da2d99b28@nxp.com
Changes in v4:
- Remove patch genirq/msi: Add cleanup guard define for msi_lock_descs()/msi_unlock_descs()
- Use new method to avoid compatible problem.
Add new command DOORBELL_ENABLE and DOORBELL_DISABLE.
pcitest -B send DOORBELL_ENABLE first, EP test function driver try to
remap one of BAR_N (except test register bar) to ITS MSI MMIO space. Old
driver don't support new command, so failure return, not side effect.
After test, DOORBELL_DISABLE command send out to recover original map, so
pcitest bar test can pass as normal.
- Other detail change see each patches's change log
- Link to v3: https://lore.kernel.org/r/20241015-ep-msi-v3-0-cedc89a16c1a@nxp.com
Change from v2 to v3
- Fixed manivannan's comments
- Move common part to pci-ep-msi.c and pci-ep-msi.h
- rebase to 6.12-rc1
- use RevID to distingiush old version
mkdir /sys/kernel/config/pci_ep/functions/pci_epf_test/func1
echo 16 > /sys/kernel/config/pci_ep/functions/pci_epf_test/func1/msi_interrupts
echo 0x080c > /sys/kernel/config/pci_ep/functions/pci_epf_test/func1/deviceid
echo 0x1957 > /sys/kernel/config/pci_ep/functions/pci_epf_test/func1/vendorid
echo 1 > /sys/kernel/config/pci_ep/functions/pci_epf_test/func1/revid
^^^^^^ to enable platform msi support.
ln -s /sys/kernel/config/pci_ep/functions/pci_epf_test/func1 /sys/kernel/config/pci_ep/controllers/4c380000.pcie-ep
- use new device ID, which identify support doorbell to avoid broken
compatility.
Enable doorbell support only for PCI_DEVICE_ID_IMX8_DB, while other devices
keep the same behavior as before.
EP side RC with old driver RC with new driver
PCI_DEVICE_ID_IMX8_DB no probe doorbell enabled
Other device ID doorbell disabled* doorbell disabled*
* Behavior remains unchanged.
Change from v1 to v2
- Add missed patch for endpont/pci-epf-test.c
- Move alloc and free to epc driver from epf.
- Provide general help function for EPC driver to alloc platform msi irq.
- Fixed manivannan's comments.
Signed-off-by: Frank Li <Frank.Li(a)nxp.com>
---
Frank Li (15):
platform-msi: Add msi_remove_device_irq_domain() in platform_device_msi_free_irqs_all()
irqdomain: Add IRQ_DOMAIN_FLAG_MSI_IMMUTABLE and irq_domain_is_msi_immutable()
irqchip/gic-v3-its: Set IRQ_DOMAIN_FLAG_MSI_IMMUTABLE for ITS
dt-bindings: PCI: pci-ep: Add support for iommu-map and msi-map
irqchip/gic-v3-its: Add support for device tree msi-map and msi-mask
PCI: endpoint: Set ID and of_node for function driver
PCI: endpoint: Add RC-to-EP doorbell support using platform MSI controller
PCI: endpoint: pci-ep-msi: Add MSI address/data pair mutable check
PCI: endpoint: Add pci_epf_align_inbound_addr() helper for address alignment
PCI: endpoint: pci-epf-test: Add doorbell test support
misc: pci_endpoint_test: Add doorbell test case
selftests: pci_endpoint: Add doorbell test case
pci: imx6: Add helper function imx_pcie_add_lut_by_rid()
pci: imx6: Add LUT setting for MSI/IOMMU in Endpoint mode
arm64: dts: imx95: Add msi-map for pci-ep device
Documentation/devicetree/bindings/pci/pci-ep.yaml | 67 ++++++++++
arch/arm64/boot/dts/freescale/imx95.dtsi | 1 +
drivers/base/platform-msi.c | 1 +
drivers/irqchip/irq-gic-v3-its-msi-parent.c | 8 ++
drivers/irqchip/irq-gic-v3-its.c | 2 +-
drivers/misc/pci_endpoint_test.c | 82 ++++++++++++
drivers/pci/controller/dwc/pci-imx6.c | 25 ++--
drivers/pci/endpoint/Makefile | 1 +
drivers/pci/endpoint/functions/pci-epf-test.c | 142 +++++++++++++++++++++
drivers/pci/endpoint/pci-ep-msi.c | 90 +++++++++++++
drivers/pci/endpoint/pci-epf-core.c | 48 +++++++
include/linux/irqdomain.h | 7 +
include/linux/pci-ep-msi.h | 28 ++++
include/linux/pci-epf.h | 21 +++
include/uapi/linux/pcitest.h | 1 +
.../selftests/pci_endpoint/pci_endpoint_test.c | 28 ++++
16 files changed, 543 insertions(+), 9 deletions(-)
---
base-commit: a4949bd40778aa9beac77c89e4c6a1da52875c8b
change-id: 20241010-ep-msi-8b4cab33b1be
Best regards,
---
Frank Li <Frank.Li(a)nxp.com>
Originally introduced to sysctl-test.c by commit b5ffbd139688 ("sysctl:
move the extra1/2 boundary check of u8 to sysctl_check_table_array"), it
has been shown to lead to a panic under certain conditions related to a
dangling registration.
This series moves the u8 test to lib/test_sysctl.c where the
registration calls are kept and correctly removed on module exit. An
additional 0012 test is added to selftests/sysctl/sysctl.sh in order to
visualize the registration calls done in test_sysctl.c.
Very much related to adding tests to sysctl, the last two patches of
this series reduce the places that need to be changed when tests are
added by managing the initialization and closing of sysctl tables with a
for loop.
Comments are greatly appreciated
Signed-off-by: Joel Granados <joel.granados(a)kernel.org>
---
Joel Granados (4):
sysctl: move u8 register test to lib/test_sysctl.c
sysctl: Add 0012 to test the u8 range check
sysctl: call sysctl tests with a for loop
sysctl: Close test ctl_headers with a for loop
kernel/sysctl-test.c | 49 ------------
lib/test_sysctl.c | 133 +++++++++++++++++++++----------
tools/testing/selftests/sysctl/sysctl.sh | 30 +++++++
3 files changed, 122 insertions(+), 90 deletions(-)
---
base-commit: 7eb172143d5508b4da468ed59ee857c6e5e01da6
change-id: 20250321-jag-test_extra_val-40954050a1f6
Best regards,
--
Joel Granados <joel.granados(a)kernel.org>
Series takes care of few bugs and missing features with the aim to improve
the test coverage of sockmap/sockhash.
Last patch is a create_pair() rewrite making use of
__attribute__((cleanup)) to handle socket fd lifetime.
Signed-off-by: Michal Luczaj <mhal(a)rbox.co>
---
Changes in v2:
- Rebase on bpf-next (Jakub)
- Use cleanup helpers from kernel's cleanup.h (Jakub)
- Fix subject of patch 3, rephrase patch 4, use correct prefix
- Link to v1: https://lore.kernel.org/r/20240724-sockmap-selftest-fixes-v1-0-46165d224712…
Changes in v1:
- No declarations in function body (Jakub)
- Don't touch output arguments until function succeeds (Jakub)
- Link to v0: https://lore.kernel.org/netdev/027fdb41-ee11-4be0-a493-22f28a1abd7c@rbox.co/
---
Michal Luczaj (6):
selftests/bpf: Support more socket types in create_pair()
selftests/bpf: Socket pair creation, cleanups
selftests/bpf: Simplify inet_socketpair() and vsock_socketpair_connectible()
selftests/bpf: Honour the sotype of af_unix redir tests
selftests/bpf: Exercise SOCK_STREAM unix_inet_redir_to_connected()
selftests/bpf: Introduce __attribute__((cleanup)) in create_pair()
.../selftests/bpf/prog_tests/sockmap_basic.c | 28 ++--
.../selftests/bpf/prog_tests/sockmap_helpers.h | 149 ++++++++++++++-------
.../selftests/bpf/prog_tests/sockmap_listen.c | 117 ++--------------
3 files changed, 124 insertions(+), 170 deletions(-)
---
base-commit: 92cc2456e9775dc4333fb4aa430763ae4ac2f2d9
change-id: 20240729-selftest-sockmap-fixes-bcca996e143b
Best regards,
--
Michal Luczaj <mhal(a)rbox.co>
This patch series was motivated by fixing a few bugs in the bonding
driver related to xfrm state migration on device failover.
struct xfrm_dev_offload has two net_device pointers: dev and real_dev.
The first one is the device the xfrm_state is offloaded on and the
second one is used by the bonding driver to manage the underlying device
xfrm_states are actually offloaded on. When bonding isn't used, the two
pointers are the same.
This causes confusion in drivers: Which device pointer should they use?
If they want to support bonding, they need to only use real_dev and
never look at dev.
Furthermore, real_dev is used without proper locking from multiple code
paths and changing it is dangerous. See commit [1] for example.
This patch series clears things out by removing all uses of real_dev
from outside the bonding driver.
Then, the bonding driver is refactored to fix a couple of long standing
races and the original bug which motivated this patch series.
[1] commit f8cde9805981 ("bonding: fix xfrm real_dev null pointer
dereference")
v1 -> v2:
Added missing kdoc for various functions.
Made bond_ipsec_del_sa() use xso.real_dev instead of curr_active_slave.
Cosmin Ratiu (6):
Cleaning up unnecessary uses of xso.real_dev:
net/mlx5: Avoid using xso.real_dev unnecessarily
xfrm: Use xdo.dev instead of xdo.real_dev
xfrm: Remove unneeded device check from validate_xmit_xfrm
Refactoring device operations to get an explicit device pointer:
xfrm: Add explicit dev to .xdo_dev_state_{add,delete,free}
Fixing a bonding xfrm state migration bug:
bonding: Mark active offloaded xfrm_states
Fixing long standing races in bonding:
bonding: Fix multiple long standing offload races
Documentation/networking/xfrm_device.rst | 10 +-
drivers/net/bonding/bond_main.c | 113 +++++++++---------
.../net/ethernet/chelsio/cxgb4/cxgb4_main.c | 20 ++--
.../inline_crypto/ch_ipsec/chcr_ipsec.c | 18 ++-
.../net/ethernet/intel/ixgbe/ixgbe_ipsec.c | 41 ++++---
drivers/net/ethernet/intel/ixgbevf/ipsec.c | 21 ++--
.../marvell/octeontx2/nic/cn10k_ipsec.c | 18 +--
.../mellanox/mlx5/core/en_accel/ipsec.c | 28 ++---
.../mellanox/mlx5/core/en_accel/ipsec.h | 1 +
.../net/ethernet/netronome/nfp/crypto/ipsec.c | 11 +-
drivers/net/netdevsim/ipsec.c | 15 ++-
include/linux/netdevice.h | 10 +-
include/net/xfrm.h | 8 ++
net/xfrm/xfrm_device.c | 13 +-
net/xfrm/xfrm_state.c | 16 +--
15 files changed, 182 insertions(+), 161 deletions(-)
--
2.45.0
Nolibc is useful for selftests as the test programs can be very small,
and compiled with just a kernel crosscompiler, without userspace support.
Currently nolibc is only usable with kselftest.h, not the more
convenient to use kselftest_harness.h
This series provides this compatibility by adding new features to nolibc
and removing the usage of problematic features from the harness.
The first half of the series are changes to the harness, the second one
are for nolibc. Both parts are very independent and should go through
different trees.
The last patch is not meant to be applied and serves as test that
everything works together correctly.
Signed-off-by: Thomas Weißschuh <thomas.weissschuh(a)linutronix.de>
---
Changes in v2:
- Rebase unto v6.15-rc1
- Rename internal nolibc symbols
- Handle edge case of waitpid(INT_MIN) == ESRCH
- Fix arm configurations for final testing patch
- Clean up global getopt.h variable declarations
- Add Acks from Willy
- Link to v1: https://lore.kernel.org/r/20250304-nolibc-kselftest-harness-v1-0-adca7cd231…
---
Thomas Weißschuh (32):
selftests: harness: Add harness selftest
selftests: harness: Use C89 comment style
selftests: harness: Ignore unused variant argument warning
selftests: harness: Mark functions without prototypes static
selftests: harness: Remove inline qualifier for wrappers
selftests: harness: Remove dependency on libatomic
selftests: harness: Implement test timeouts through pidfd
selftests: harness: Don't set setup_completed for fixtureless tests
selftests: harness: Always provide "self" and "variant"
selftests: harness: Move teardown conditional into test metadata
selftests: harness: Add teardown callback to test metadata
selftests: harness: Stop using setjmp()/longjmp()
selftests: harness: Guard includes on nolibc
tools/nolibc: handle intmax_t/uintmax_t in printf
tools/nolibc: use intmax definitions from compiler
tools/nolibc: use pselect6_time64 if available
tools/nolibc: use ppoll_time64 if available
tools/nolibc: add tolower() and toupper()
tools/nolibc: add _exit()
tools/nolibc: add setpgrp()
tools/nolibc: implement waitpid() in terms of waitid()
Revert "selftests/nolibc: use waitid() over waitpid()"
tools/nolibc: add dprintf() and vdprintf()
tools/nolibc: add getopt()
tools/nolibc: allow different write callbacks in printf
tools/nolibc: allow limiting of printf destination size
tools/nolibc: add snprintf() and friends
selftests/nolibc: use snprintf() for printf tests
selftests/nolibc: rename vfprintf test suite
selftests/nolibc: add test for snprintf() truncation
tools/nolibc: implement width padding in printf()
HACK: selftests/nolibc: demonstrate usage of the kselftest harness
tools/include/nolibc/Makefile | 1 +
tools/include/nolibc/getopt.h | 101 ++
tools/include/nolibc/nolibc.h | 1 +
tools/include/nolibc/stdint.h | 4 +-
tools/include/nolibc/stdio.h | 127 +-
tools/include/nolibc/string.h | 17 +
tools/include/nolibc/sys.h | 105 +-
tools/testing/selftests/Makefile | 1 +
tools/testing/selftests/kselftest/.gitignore | 1 +
tools/testing/selftests/kselftest/Makefile | 6 +
.../testing/selftests/kselftest/harness-selftest.c | 129 ++
.../selftests/kselftest/harness-selftest.expected | 62 +
.../selftests/kselftest/harness-selftest.sh | 14 +
tools/testing/selftests/kselftest_harness.h | 181 +-
tools/testing/selftests/nolibc/Makefile | 13 +-
tools/testing/selftests/nolibc/harness-selftest.c | 1 +
tools/testing/selftests/nolibc/nolibc-test.c | 1729 +-------------------
tools/testing/selftests/nolibc/run-tests.sh | 2 +-
18 files changed, 635 insertions(+), 1860 deletions(-)
---
base-commit: 0af2f6be1b4281385b618cb86ad946eded089ac8
change-id: 20250130-nolibc-kselftest-harness-8b2c8cac43bf
Best regards,
--
Thomas Weißschuh <thomas.weissschuh(a)linutronix.de>
Call cs_dsp_mock_xm_header_get_fw_version() to get the firmware version
from the dummy XM header data in cs_dsp_bin_err_test_common_init().
Make the same change to cs_dsp_bin_test_common_init() and remove the
cs_dsp_mock_xm_header_get_fw_version_from_regmap() function.
The code in cs_dsp_test_bin.c was correctly calling
cs_dsp_mock_xm_header_get_fw_version_from_regmap() to fetch the fw version
from a dummy header it wrote to XM registers. However in
cs_dsp_test_bin_error.c the test doesn't stuff a dummy header into XM, it
populates it the normal way using a wmfw file. It should have called
cs_dsp_mock_xm_header_get_fw_version() to get the data from its blob
buffer, but was calling cs_dsp_mock_xm_header_get_fw_version_from_regmap().
As nothing had been written to the registers this returned the value of
uninitialized data.
The only other use of cs_dsp_mock_xm_header_get_fw_version_from_regmap()
was cs_dsp_test_bin.c, but it doesn't need to use it. It already has a
blob buffer containing the dummy XM header so it can use
cs_dsp_mock_xm_header_get_fw_version() to read from that.
Fixes: cd8c058499b6 ("firmware: cs_dsp: Add KUnit testing of bin error cases")
Signed-off-by: Richard Fitzgerald <rf(a)opensource.cirrus.com>
---
.../cirrus/test/cs_dsp_mock_mem_maps.c | 30 -------------------
.../firmware/cirrus/test/cs_dsp_test_bin.c | 2 +-
.../cirrus/test/cs_dsp_test_bin_error.c | 2 +-
.../linux/firmware/cirrus/cs_dsp_test_utils.h | 1 -
4 files changed, 2 insertions(+), 33 deletions(-)
diff --git a/drivers/firmware/cirrus/test/cs_dsp_mock_mem_maps.c b/drivers/firmware/cirrus/test/cs_dsp_mock_mem_maps.c
index 161272e47bda..73412bcef50c 100644
--- a/drivers/firmware/cirrus/test/cs_dsp_mock_mem_maps.c
+++ b/drivers/firmware/cirrus/test/cs_dsp_mock_mem_maps.c
@@ -461,36 +461,6 @@ unsigned int cs_dsp_mock_xm_header_get_alg_base_in_words(struct cs_dsp_test *pri
}
EXPORT_SYMBOL_NS_GPL(cs_dsp_mock_xm_header_get_alg_base_in_words, "FW_CS_DSP_KUNIT_TEST_UTILS");
-/**
- * cs_dsp_mock_xm_header_get_fw_version_from_regmap() - Firmware version.
- *
- * @priv: Pointer to struct cs_dsp_test.
- *
- * Return: Firmware version word value.
- */
-unsigned int cs_dsp_mock_xm_header_get_fw_version_from_regmap(struct cs_dsp_test *priv)
-{
- unsigned int xm = cs_dsp_mock_base_addr_for_mem(priv, WMFW_ADSP2_XM);
- union {
- struct wmfw_id_hdr adsp2;
- struct wmfw_v3_id_hdr halo;
- } hdr;
-
- switch (priv->dsp->type) {
- case WMFW_ADSP2:
- regmap_raw_read(priv->dsp->regmap, xm, &hdr.adsp2, sizeof(hdr.adsp2));
- return be32_to_cpu(hdr.adsp2.ver);
- case WMFW_HALO:
- regmap_raw_read(priv->dsp->regmap, xm, &hdr.halo, sizeof(hdr.halo));
- return be32_to_cpu(hdr.halo.ver);
- default:
- KUNIT_FAIL(priv->test, NULL);
- return 0;
- }
-}
-EXPORT_SYMBOL_NS_GPL(cs_dsp_mock_xm_header_get_fw_version_from_regmap,
- "FW_CS_DSP_KUNIT_TEST_UTILS");
-
/**
* cs_dsp_mock_xm_header_get_fw_version() - Firmware version.
*
diff --git a/drivers/firmware/cirrus/test/cs_dsp_test_bin.c b/drivers/firmware/cirrus/test/cs_dsp_test_bin.c
index 1e161bbc5b4a..163b7faecff4 100644
--- a/drivers/firmware/cirrus/test/cs_dsp_test_bin.c
+++ b/drivers/firmware/cirrus/test/cs_dsp_test_bin.c
@@ -2198,7 +2198,7 @@ static int cs_dsp_bin_test_common_init(struct kunit *test, struct cs_dsp *dsp)
priv->local->bin_builder =
cs_dsp_mock_bin_init(priv, 1,
- cs_dsp_mock_xm_header_get_fw_version_from_regmap(priv));
+ cs_dsp_mock_xm_header_get_fw_version(xm_hdr));
KUNIT_ASSERT_NOT_ERR_OR_NULL(test, priv->local->bin_builder);
/* We must provide a dummy wmfw to load */
diff --git a/drivers/firmware/cirrus/test/cs_dsp_test_bin_error.c b/drivers/firmware/cirrus/test/cs_dsp_test_bin_error.c
index 8748874f0552..a7ec956d2724 100644
--- a/drivers/firmware/cirrus/test/cs_dsp_test_bin_error.c
+++ b/drivers/firmware/cirrus/test/cs_dsp_test_bin_error.c
@@ -451,7 +451,7 @@ static int cs_dsp_bin_err_test_common_init(struct kunit *test, struct cs_dsp *ds
local->bin_builder =
cs_dsp_mock_bin_init(priv, 1,
- cs_dsp_mock_xm_header_get_fw_version_from_regmap(priv));
+ cs_dsp_mock_xm_header_get_fw_version(local->xm_header));
KUNIT_ASSERT_NOT_ERR_OR_NULL(test, local->bin_builder);
/* Init cs_dsp */
diff --git a/include/linux/firmware/cirrus/cs_dsp_test_utils.h b/include/linux/firmware/cirrus/cs_dsp_test_utils.h
index 4f87a908ab4f..ecd821ed8064 100644
--- a/include/linux/firmware/cirrus/cs_dsp_test_utils.h
+++ b/include/linux/firmware/cirrus/cs_dsp_test_utils.h
@@ -104,7 +104,6 @@ unsigned int cs_dsp_mock_num_dsp_words_to_num_packed_regs(unsigned int num_dsp_w
unsigned int cs_dsp_mock_xm_header_get_alg_base_in_words(struct cs_dsp_test *priv,
unsigned int alg_id,
int mem_type);
-unsigned int cs_dsp_mock_xm_header_get_fw_version_from_regmap(struct cs_dsp_test *priv);
unsigned int cs_dsp_mock_xm_header_get_fw_version(struct cs_dsp_mock_xm_header *header);
void cs_dsp_mock_xm_header_drop_from_regmap_cache(struct cs_dsp_test *priv);
int cs_dsp_mock_xm_header_write_to_regmap(struct cs_dsp_mock_xm_header *header);
--
2.39.5
On ARM64, when running with --configs '36*SRCU-P', I noticed that only 1 instance
instead of 36 for starting.
Fix it by checking for Image files, instead of bzImage which ARM does
not seem to have. With this I see all 36 instances running at the same
time in the batch.
Signed-off-by: Joel Fernandes <joelagnelf(a)nvidia.com>
---
tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh b/tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh
index ad79784e552d..957800c9ffba 100755
--- a/tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh
+++ b/tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh
@@ -73,7 +73,7 @@ config_override_param "$config_dir/CFcommon.$(uname -m)" KcList \
cp $T/KcList $resdir/ConfigFragment
base_resdir=`echo $resdir | sed -e 's/\.[0-9]\+$//'`
-if test "$base_resdir" != "$resdir" && test -f $base_resdir/bzImage && test -f $base_resdir/vmlinux
+if test "$base_resdir" != "$resdir" && (test -f $base_resdir/bzImage || test -f $base_resdir/Image) && test -f $base_resdir/vmlinux
then
# Rerunning previous test, so use that test's kernel.
QEMU="`identify_qemu $base_resdir/vmlinux`"
--
2.43.0
I was writing a benchmark based on sockmap + TCP and discovered several
issues:
1. When EAGAIN occurs, the direction of skb is incorrect, causing data
loss when retry.
2. When sending partial data, the offset is not recorded, leading to
duplicate data being sent when retry.
3. An unexpected BUG_ON() judgment in skb_linearize is triggered.
4. The memory of psock->ingress_skb is not limited by the socket buffer
and memcg.
Issues 1, 2, and 3 are described in each patch's commit message.
Regarding issue 4, this patchset does not cover it as it is difficult to
handle in practice, and I am still working on it.
Here is a brief description of the issue:
When using sockmap to skb/stream redirect, if the receiving end does not
perform read operations, all data will be buffered in ingress_skb.
For example:
'''
// set memory limit to 50G
cgcreate -g memory:myGroup
cgset -r memory.max="5000M" myGroup
// start benchmark and disable consumer from reading
cgexec -g "memory:myGroup" ./bench sockmap -c 2 -p 1 -a --rx-verdict-ingress --delay-consumer=-1 -d 100
Iter 0 ( 29.179us): Send Speed 2668.548 MB/s (20360.406 calls/s), ... Rcv Speed 0.000 MB/s ( 0.000 calls/s)
Iter 1 ( -7.237us): Send Speed 2694.467 MB/s (20557.149 calls/s), ... Rcv Speed 0.000 MB/s ( 0.000 calls/s)
Iter 2 ( -1.918us): Send Speed 2693.404 MB/s (20548.039 calls/s), ... Rcv Speed 0.000 MB/s ( 0.000 calls/s)
Iter 3 ( -0.684us): Send Speed 2693.138 MB/s (20548.014 calls/s), ... Rcv Speed 0.000 MB/s ( 0.000 calls/s)
Iter 4 ( 7.879us): Send Speed 2698.620 MB/s (20588.838 calls/s), ... Rcv Speed 0.000 MB/s ( 0.000 calls/s)
Iter 5 ( -3.224us): Send Speed 2696.553 MB/s (20573.066 calls/s), ... Rcv Speed 0.000 MB/s ( 0.000 calls/s)
Iter 6 ( -5.409us): Send Speed 2699.705 MB/s (20597.111 calls/s), ... Rcv Speed 0.000 MB/s ( 0.000 calls/s)
Iter 7 ( -0.439us): Send Speed 2699.691 MB/s (20597.009 calls/s), ... Rcv Speed 0.000 MB/s ( 0.000 calls/s)
...
// memory usage are not limited
cat /proc/slabinfo | grep skb
skbuff_small_head 11824024 11824024 704 46 8 : tunables 0 0 0 : slabdata 257044 257044 0
skbuff_fclone_cache 11822080 11822080 512 32 4 : tunables 0 0 0 : slabdata 369440 369440 0
'''
Thus, a simple socket in a large file upload/download model can eat the
entire OS memory.
We must charge the skb memory to psock->sk, and if we do not want losing
skb, we need to feedback the error info to read_sock/read_skb when the
enqueue operation of psock->ingress_skb fails.
---
My another patch related to stability also requires maintainers to spare
some time from their busy schedules for review.
https://lore.kernel.org/bpf/20250317092257.68760-1-jiayuan.chen@linux.dev/T…
Jiayuan Chen (4):
bpf, sockmap: Fix data lost during EAGAIN retries
bpf, sockmap: fix duplicated data transmission
bpf, sockmap: Fix panic when calling skb_linearize
selftest/bpf/benchs: Add benchmark for sockmap usage
net/core/skmsg.c | 48 +-
tools/testing/selftests/bpf/Makefile | 2 +
tools/testing/selftests/bpf/bench.c | 4 +
.../selftests/bpf/benchs/bench_sockmap.c | 599 ++++++++++++++++++
.../selftests/bpf/progs/bench_sockmap_prog.c | 65 ++
5 files changed, 697 insertions(+), 21 deletions(-)
create mode 100644 tools/testing/selftests/bpf/benchs/bench_sockmap.c
create mode 100644 tools/testing/selftests/bpf/progs/bench_sockmap_prog.c
--
2.47.1
Syzkaller reported this issue [1].
The current sockmap has a dependency on sk_socket in both read and write
stages, but there is a possibility that sk->sk_socket is released during
the process, leading to panic situations. For a detailed reproduction,
please refer to the description in the v2:
https://lore.kernel.org/bpf/20250228055106.58071-1-jiayuan.chen@linux.dev/
The corresponding fix approaches are described in the commit messages of
each patch.
By the way, the current sockmap lacks statistical information, especially
global statistics, such as the number of successful or failed rx and tx
operations. These statistics cannot be obtained from the socket interface
itself.
These data will be of great help in troubleshooting issues and observing
sockmap behavior.
If the maintainer/reviewer does not object, I think we can provide these
statistical information in the future, either through proc/trace/bpftool.
[1] https://syzkaller.appspot.com/bug?extid=dd90a702f518e0eac072
---
v3 -> v4:
1. Rebase on -rc.
2. Incorporated valuable feedback from the v3 thread into the commit
message, making it easier to review.
https://lore.kernel.org/all/20250317092257.68760-3-jiayuan.chen@linux.dev/
v2 -> v3:
1. Michal Luczaj reported similar race issue under sockmap sending path.
2. Rcu lock is conflict with mutex_lock in unix socket read implementation.
https://lore.kernel.org/bpf/20250228055106.58071-1-jiayuan.chen@linux.dev/
v1 -> v2:
1. Add Fixes tag.
2. Extend selftest of edge case for TCP/UDP sockets.
3. Add Reviewed-by and Acked-by tag.
https://lore.kernel.org/bpf/20250226132242.52663-1-jiayuan.chen@linux.dev/T…
Jiayuan Chen (3):
bpf, sockmap: avoid using sk_socket after free when sending
bpf, sockmap: avoid using sk_socket after free when reading
selftests/bpf: Add edge case tests for sockmap
net/core/skmsg.c | 22 ++++++-
.../selftests/bpf/prog_tests/socket_helpers.h | 13 +++-
.../selftests/bpf/prog_tests/sockmap_basic.c | 60 +++++++++++++++++++
3 files changed, 91 insertions(+), 4 deletions(-)
--
2.47.1
We can reproduce the issue using the existing test program:
'./test_sockmap --ktls'
Or use the selftest I provided, which will cause a panic:
------------[ cut here ]------------
kernel BUG at lib/iov_iter.c:629!
PKRU: 55555554
Call Trace:
<TASK>
? die+0x36/0x90
? do_trap+0xdd/0x100
? iov_iter_revert+0x178/0x180
? iov_iter_revert+0x178/0x180
? do_error_trap+0x7d/0x110
? iov_iter_revert+0x178/0x180
? exc_invalid_op+0x50/0x70
? iov_iter_revert+0x178/0x180
? asm_exc_invalid_op+0x1a/0x20
? iov_iter_revert+0x178/0x180
? iov_iter_revert+0x5c/0x180
tls_sw_sendmsg_locked.isra.0+0x794/0x840
tls_sw_sendmsg+0x52/0x80
? inet_sendmsg+0x1f/0x70
__sys_sendto+0x1cd/0x200
? find_held_lock+0x2b/0x80
? syscall_trace_enter+0x140/0x270
? __lock_release.isra.0+0x5e/0x170
? find_held_lock+0x2b/0x80
? syscall_trace_enter+0x140/0x270
? lockdep_hardirqs_on_prepare+0xda/0x190
? ktime_get_coarse_real_ts64+0xc2/0xd0
__x64_sys_sendto+0x24/0x30
do_syscall_64+0x90/0x170
1. It looks like the issue started occurring after bpf being introduced to
ktls and later the addition of assertions to iov_iter has caused a panic.
If my fix tag is incorrect, please assist me in correcting the fix tag.
2. I make minimal changes for now, it's enough to make ktls work
correctly.
---
v1->v2: Added more content to the commit message
https://lore.kernel.org/all/20250123171552.57345-1-mrpre@163.com/#r
---
Jiayuan Chen (2):
bpf: fix ktls panic with sockmap
selftests/bpf: add ktls selftest
net/tls/tls_sw.c | 8 +-
.../selftests/bpf/prog_tests/sockmap_ktls.c | 174 +++++++++++++++++-
.../selftests/bpf/progs/test_sockmap_ktls.c | 26 +++
3 files changed, 205 insertions(+), 3 deletions(-)
create mode 100644 tools/testing/selftests/bpf/progs/test_sockmap_ktls.c
--
2.47.1
Hi Linus,
Please pull the following kselftest fixes update for Linux 6.15-rc2
Fixes tpm2, futex, and mincore tests. Creates a dedicated .gitignore
for tpm2
Details:
selftests: tpm2: test_smoke: use POSIX-conformant expression operator
selftests/futex: futex_waitv wouldblock test should fail
selftests: tpm2: create a dedicated .gitignore
selftests/mincore: Allow read-ahead pages to reach the end of the file
diff is attached.
thanks,
-- Shuah
----------------------------------------------------------------
The following changes since commit 0af2f6be1b4281385b618cb86ad946eded089ac8:
Linux 6.15-rc1 (2025-04-06 13:11:33 -0700)
are available in the Git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest tags/linux_kselftest-fixes-6.15-rc2
for you to fetch changes up to 197c1eaa7ba633a482ed7588eea6fd4aa57e08d4:
selftests/mincore: Allow read-ahead pages to reach the end of the file (2025-04-08 17:08:50 -0600)
----------------------------------------------------------------
linux_kselftest-fixes-6.15-rc2
Fixes tpm2, futex, and mincore tests. Creates a dedicated .gitignore
for tpm2
Details:
selftests: tpm2: test_smoke: use POSIX-conformant expression operator
selftests/futex: futex_waitv wouldblock test should fail
selftests: tpm2: create a dedicated .gitignore
selftests/mincore: Allow read-ahead pages to reach the end of the file
----------------------------------------------------------------
Ahmed Salem (1):
selftests: tpm2: test_smoke: use POSIX-conformant expression operator
Edward Liaw (1):
selftests/futex: futex_waitv wouldblock test should fail
Khaled Elnaggar (1):
selftests: tpm2: create a dedicated .gitignore
Qiuxu Zhuo (1):
selftests/mincore: Allow read-ahead pages to reach the end of the file
tools/testing/selftests/.gitignore | 1 -
tools/testing/selftests/futex/functional/futex_wait_wouldblock.c | 2 +-
tools/testing/selftests/mincore/mincore_selftest.c | 3 ---
tools/testing/selftests/tpm2/.gitignore | 3 +++
tools/testing/selftests/tpm2/test_smoke.sh | 2 +-
5 files changed, 5 insertions(+), 6 deletions(-)
create mode 100644 tools/testing/selftests/tpm2/.gitignore
----------------------------------------------------------------
On Tue, 8 Apr 2025 20:18:26 +0200 Matthieu Baerts wrote:
> On 02/04/2025 19:23, Stanislav Fomichev wrote:
> > Recent change [0] resulted in a "BUG: using __this_cpu_read() in
> > preemptible" splat [1]. PREEMPT kernels have additional requirements
> > on what can and can not run with/without preemption enabled.
> > Expose those constrains in the debug kernels.
>
> Good idea to suggest this to find more bugs!
>
> I did some quick tests on my side with our CI, and the MPTCP selftests
> seem to take a bit more time, but without impacting the results.
> Hopefully, there will be no impact in slower/busy environments :)
What kind of slow down do you see? I think we get up to 50% more time
spent in the longer tests. Not sure how bad is too bad.. I'm leaning
towards applying this to net-next and we can see if people running
on linux-next complain?
Let me CC kselftests, patch in question:
https://lore.kernel.org/all/20250402172305.1775226-1-sdf@fomichev.me/
A fix [1] came in that fixed the notrace_filter side of the subops processing
of the function graph tracer. When I started testing that fix, I discovered
that the many more functions were being enabled than were being traced.
The function graph infrastructure uses ftrace to hook to functions. It has
a single ftrace_ops to manage all the users of function graph. Each
individual user (tracing, bpf, fprobes, etc) has its own ftrace_ops to
track the functions it will have its callback called from. These
ftrace_ops are "subops" to the main ftrace_ops of the function graph
infrastructure.
Each ftrace_ops has a filter_hash and a notrace_hash that is defined as:
Only trace functions that are in the filter_hash but not in the
notrace_hash.
If the filter_hash is empty, it means to trace all functions.
If the notrace_hash is empty, it means do not disable any function.
The function graph main ftrace_ops needs to be a superset containing all
the functions to be traced by all the subops it has. The algorithm to
perform this merge was incorrect. It was merging the filter_hashes
of all the subops and taking the intersect of all the notrace_hashes
of the subops. But by taking the intersect of all the notrace_hashes
it ignored how those notrace_hashes are dependent on the associated
filter_hashes of each individual subops.
Instead, modify the algorithm to be a bit simpler and correct.
First, when adding a new subops, do not add the notrace_hash if the
filter_hash is not empty. Instead, just add the functions that are in the
filter_hash of the subops but not in the notrace_hash of the subops into the
main ops filter_hash. There's no reason to add anything to the main ops
notrace_hash for this case.
The notrace_hash of the main ops should only be non empty iff all subops
filter_hashes are empty (meaning to trace all functions) and all subops
notrace_hashes have the same functions.
That is, the main ops notrace_hash is empty if any subops filter_hash is
non empty.
The main ops notrace_hash only has content in it if all subops
filter_hashes are empty, and the content are only functions that intersect
all the subops notrace_hashes. If any subops notrace_hash is empty, then
so is the main ops notrace_hash.
[1] https://lore.kernel.org/all/20250408160258.48563-1-andybnac@gmail.com/
Steven Rostedt (2):
ftrace: Fix accounting of subop hashes
tracing/selftest: Add test to better test subops filtering of function graph
----
kernel/trace/ftrace.c | 314 ++++++++++++---------
.../ftrace/test.d/ftrace/fgraph-multi-filter.tc | 177 ++++++++++++
2 files changed, 354 insertions(+), 137 deletions(-)
create mode 100644 tools/testing/selftests/ftrace/test.d/ftrace/fgraph-multi-filter.tc
On 09/04/2025 3:24 pm, Mark Brown wrote:
> On Wed, Apr 09, 2025 at 11:45:44AM +0100, Richard Fitzgerald wrote:
>> Depend on SND_SOC_CS_AMP_LIB instead of selecting it.
>>
>> KUNIT_ALL_TESTS should only build tests for components that are
>> already being built, it should not cause other stuff to be added
>> to the build.
>
>> config SND_SOC_CS_AMP_LIB_TEST
>> - tristate "KUnit test for Cirrus Logic cs-amp-lib"
>> - depends on KUNIT
>> + tristate "KUnit test for Cirrus Logic cs-amp-lib" if !KUNIT_ALL_TESTS
>> + depends on SND_SOC_CS_AMP_LIB && KUNIT
>> default KUNIT_ALL_TESTS
>> - select SND_SOC_CS_AMP_LIB
>> help
>> This builds KUnit tests for the Cirrus Logic common
>> amplifier library.
>
> This by itself results in the Cirrus tests being removed from a kunit
> --alltests run which is a regression in coverage. I'd expect to see
> some corresponding updates in the KUnit all_tests.config to keep them
> enabled.
That's the defined behaviour of KUNIT_ALL_TESTS. It shouldn't have been
running as part of an alltests if nothing had selected it. That seems to
make people angry. Probably the same people who would complain if there
was a bug in the code that they didn't want to test.
This patch series was motivated by fixing a few bugs in the bonding
driver related to xfrm state migration on device failover.
struct xfrm_dev_offload has two net_device pointers: dev and real_dev.
The first one is the device the xfrm_state is offloaded on and the
second one is used by the bonding driver to manage the underlying device
xfrm_states are actually offloaded on. When bonding isn't used, the two
pointers are the same.
This causes confusion in drivers: Which device pointer should they use?
If they want to support bonding, they need to only use real_dev and
never look at dev.
Furthermore, real_dev is used without proper locking from multiple code
paths and changing it is dangerous. See commit [1] for example.
This patch series clears things out by removing all uses of real_dev
from outside the bonding driver.
Then, the bonding driver is refactored to fix a couple of long standing
races and the original bug which motivated this patch series.
[1] commit f8cde9805981 ("bonding: fix xfrm real_dev null pointer
dereference")
Cosmin Ratiu (6):
Cleaning up unnecessary uses of xso.real_dev:
net/mlx5: Avoid using xso.real_dev unnecessarily
xfrm: Use xdo.dev instead of xdo.real_dev
xfrm: Remove unneeded device check from validate_xmit_xfrm
Refactoring device operations to get an explicit device pointer:
xfrm: Add explicit dev to .xdo_dev_state_{add,delete,free}
Fixing a bonding xfrm state migration bug:
bonding: Mark active offloaded xfrm_states
Fixing long standing races in bonding:
bonding: Fix multiple long standing offload races
Documentation/networking/xfrm_device.rst | 10 +-
drivers/net/bonding/bond_main.c | 93 +++++++++++--------
.../net/ethernet/chelsio/cxgb4/cxgb4_main.c | 20 ++--
.../inline_crypto/ch_ipsec/chcr_ipsec.c | 18 ++--
.../net/ethernet/intel/ixgbe/ixgbe_ipsec.c | 40 ++++----
drivers/net/ethernet/intel/ixgbevf/ipsec.c | 20 ++--
.../marvell/octeontx2/nic/cn10k_ipsec.c | 18 ++--
.../mellanox/mlx5/core/en_accel/ipsec.c | 28 +++---
.../mellanox/mlx5/core/en_accel/ipsec.h | 1 +
.../net/ethernet/netronome/nfp/crypto/ipsec.c | 11 +--
drivers/net/netdevsim/ipsec.c | 15 ++-
include/linux/netdevice.h | 10 +-
include/net/xfrm.h | 8 ++
net/xfrm/xfrm_device.c | 13 +--
net/xfrm/xfrm_state.c | 16 ++--
15 files changed, 175 insertions(+), 146 deletions(-)
--
2.45.0
This patchset adds the base infrastructure for modular BPF verifier.
The motivation remains unchanged from the LSFMMBPF25 proposal [0].
However, the design has diverged. Rather than immediately going for the
facade described in [0], we instead make a stop first at the continously
exported copies of the verifier in an out-of-tree repository, with a
separate copy for each kernel release. Each copy will receive as many
verifier backports as possible within the "boundary" of the modular
portions.
For example, a patch that changes the verifier at the same time as one
of the kernel symbols it depends on cannot be applied, as at runtime
only the verifier portion can be updated. However, a patch that only
changes verifier.c can be applied, as it's within the boundary. Rough
analysis of past data shows that most verifier changes fall within the
latter category. The jupyter notebook for this can be found here [1].
From here, we'll gradually enlarge the "boundary" to enable backports of
more and more patches, with the north star being the facade as described
in the proposal. Ideally, completion of the facade will render the
out-of-tree repository useless.
[0]: https://lore.kernel.org/bpf/nahst74z46ov7ii3vmriyhk25zo6tkf2f3hsulzjzselvob…
[1]: https://github.com/danobi/verifier-analysis/blob/master/analysis.ipynb
Daniel Xu (13):
bpf: Move bpf_prog_ctx_arg_info_init() body into header
bpf: Move BTF related globals out of verifier.c
bpf: Move percpu memory allocator definition into core
bpf: Move bpf_check_attach_target() to core
bpf: Remove map_set_for_each_callback_args callback for maps
bpf: Move kfunc definitions out of verifier.c
bpf: Make bpf_free_kfunc_btf_tab() static in core
selftests: bpf: Avoid attaching to bpf_check()
perf: Export perf_snapshot_branch_stack static key
bpf: verifier: Add indirection to kallsyms_lookup_name()
treewide: bpf: Export symbols used by verifier
bpf: verifier: Make verifier loadable
bpf: Supporting building verifier.ko out-of-tree
arch/x86/net/bpf_jit_comp.c | 2 +
drivers/media/rc/bpf-lirc.c | 1 +
fs/bpf_fs_kfuncs.c | 4 +
include/linux/bpf.h | 82 ++-
include/linux/bpf_verifier.h | 7 -
include/linux/btf.h | 4 +
kernel/bpf/Kbuild | 8 +
kernel/bpf/Kconfig | 12 +
kernel/bpf/Makefile | 3 +-
kernel/bpf/arraymap.c | 2 -
kernel/bpf/bpf_iter.c | 1 +
kernel/bpf/bpf_lsm.c | 5 +
kernel/bpf/bpf_struct_ops.c | 2 +
kernel/bpf/btf.c | 61 +-
kernel/bpf/cgroup.c | 4 +
kernel/bpf/core.c | 463 ++++++++++++++++
kernel/bpf/disasm.c | 4 +
kernel/bpf/hashtab.c | 4 -
kernel/bpf/helpers.c | 2 +
kernel/bpf/local_storage.c | 2 +
kernel/bpf/log.c | 12 +
kernel/bpf/map_iter.c | 1 +
kernel/bpf/memalloc.c | 3 +
kernel/bpf/offload.c | 10 +
kernel/bpf/syscall.c | 52 +-
kernel/bpf/tnum.c | 20 +
kernel/bpf/token.c | 1 +
kernel/bpf/trampoline.c | 5 +
kernel/bpf/verifier.c | 521 ++----------------
kernel/events/callchain.c | 3 +
kernel/events/core.c | 1 +
kernel/trace/bpf_trace.c | 9 +
lib/error-inject.c | 2 +
net/core/filter.c | 26 +
net/core/xdp.c | 2 +
net/netfilter/nf_bpf_link.c | 1 +
.../selftests/bpf/progs/exceptions_assert.c | 2 +-
.../selftests/bpf/progs/exceptions_fail.c | 4 +-
38 files changed, 834 insertions(+), 514 deletions(-)
create mode 100644 kernel/bpf/Kbuild
--
2.47.1
When running the mincore_selftest on a system with an XFS file system, it
failed the "check_file_mmap" test case due to the read-ahead pages reaching
the end of the file. The failure log is as below:
RUN global.check_file_mmap ...
mincore_selftest.c:264:check_file_mmap:Expected i (1024) < vec_size (1024)
mincore_selftest.c:265:check_file_mmap:Read-ahead pages reached the end of the file
check_file_mmap: Test failed
FAIL global.check_file_mmap
This is because the read-ahead window size of the XFS file system on this
machine is 4 MB, which is larger than the size from the #PF address to the
end of the file. As a result, all the pages for this file are populated.
blockdev --getra /dev/nvme0n1p5
8192
blockdev --getbsz /dev/nvme0n1p5
512
This issue can be fixed by extending the current FILE_SIZE 4MB to a larger
number, but it will still fail if the read-ahead window size of the file
system is larger enough. Additionally, in the real world, read-ahead pages
reaching the end of the file can happen and is an expected behavior.
Therefore, allowing read-ahead pages to reach the end of the file is a
better choice for the "check_file_mmap" test case.
Reported-by: Yi Lai <yi1.lai(a)intel.com>
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo(a)intel.com>
---
tools/testing/selftests/mincore/mincore_selftest.c | 3 ---
1 file changed, 3 deletions(-)
diff --git a/tools/testing/selftests/mincore/mincore_selftest.c b/tools/testing/selftests/mincore/mincore_selftest.c
index e949a43a6145..efabfcbe0b49 100644
--- a/tools/testing/selftests/mincore/mincore_selftest.c
+++ b/tools/testing/selftests/mincore/mincore_selftest.c
@@ -261,9 +261,6 @@ TEST(check_file_mmap)
TH_LOG("No read-ahead pages found in memory");
}
- EXPECT_LT(i, vec_size) {
- TH_LOG("Read-ahead pages reached the end of the file");
- }
/*
* End of the readahead window. The rest of the pages shouldn't
* be in memory.
--
2.17.1
Hi Linus,
Please pull the following kunit fixes update for Linux 6.15-rc2
Fixes tool to report test count in case of a late test plan when tests
are specified before the test plan. Fixes spelling error in the commit
that went into 6.15-rc1.
diff is attached.
thanks,
-- Shuah
----------------------------------------------------------------
The following changes since commit 0af2f6be1b4281385b618cb86ad946eded089ac8:
Linux 6.15-rc1 (2025-04-06 13:11:33 -0700)
are available in the Git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest tags/linux_kselftest-kunit-6.15-rc2
for you to fetch changes up to d1be0cf3b8aeae75bc8fff5b7a3e01ebfe276008:
kunit: Spelling s/slowm/slow/ (2025-04-08 14:57:24 -0600)
----------------------------------------------------------------
linux_kselftest-kunit-6.15-rc2
Fixes tool to report test count in case of a late test plan when tests
are specified before the test plan. Fixes spelling error in the commit
that went into 6.15-rc1.
----------------------------------------------------------------
Geert Uytterhoeven (1):
kunit: Spelling s/slowm/slow/
Rae Moar (1):
kunit: tool: fix count of tests if late test plan
include/kunit/test.h | 2 +-
tools/testing/kunit/kunit_parser.py | 4 ++++
tools/testing/kunit/kunit_tool_test.py | 4 ++--
3 files changed, 7 insertions(+), 3 deletions(-)
----------------------------------------------------------------
Recently we had some issues in parallel TDC where some of IFE tests are
failing due to some of IFE's submodules (like act_meta_skbtcindex and
act_meta_skbprio) taking too long to load [1]. To avoid that issue,
pre-load IFE and all its submodules before running any of the tests in
tdc.sh
[1] https://lore.kernel.org/netdev/e909b2a0-244e-4141-9fa9-1b7d96ab7d71@mojatat…
Signed-off-by: Victor Nogueira <victor(a)mojatatu.com>
---
tools/testing/selftests/tc-testing/tdc.sh | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/tools/testing/selftests/tc-testing/tdc.sh b/tools/testing/selftests/tc-testing/tdc.sh
index cddff1772e10..589b18ed758a 100755
--- a/tools/testing/selftests/tc-testing/tdc.sh
+++ b/tools/testing/selftests/tc-testing/tdc.sh
@@ -31,6 +31,10 @@ try_modprobe act_skbedit
try_modprobe act_skbmod
try_modprobe act_tunnel_key
try_modprobe act_vlan
+try_modprobe act_ife
+try_modprobe act_meta_mark
+try_modprobe act_meta_skbtcindex
+try_modprobe act_meta_skbprio
try_modprobe cls_basic
try_modprobe cls_bpf
try_modprobe cls_cgroup
--
2.49.0
Recently, during a debugging session using local MPTCP connections, I
noticed MPJoinAckHMacFailure was strangely not zero on the server side.
The first patch fixes this issue -- present since v5.9 -- and the second
one validates it in the selftests.
Signed-off-by: Matthieu Baerts (NGI0) <matttbe(a)kernel.org>
---
Matthieu Baerts (NGI0) (2):
mptcp: only inc MPJoinAckHMacFailure for HMAC failures
selftests: mptcp: validate MPJoin HMacFailure counters
net/mptcp/subflow.c | 8 ++++++--
tools/testing/selftests/net/mptcp/mptcp_join.sh | 18 ++++++++++++++++++
2 files changed, 24 insertions(+), 2 deletions(-)
---
base-commit: 61f96e684edd28ca40555ec49ea1555df31ba619
change-id: 20250407-net-mptcp-hmac-failure-mib-66f599305ff3
Best regards,
--
Matthieu Baerts (NGI0) <matttbe(a)kernel.org>
Testcase should fail if -EWOULDBLOCK is not returned when expected value
differs from actual value from the waiter.
Fixes: 9d57f7c79748920636f8293d2f01192d702fe390 ("selftests: futex: Test sys_futex_waitv() wouldblock")
Signed-off-by: Edward Liaw <edliaw(a)google.com>
---
.../testing/selftests/futex/functional/futex_wait_wouldblock.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/futex/functional/futex_wait_wouldblock.c b/tools/testing/selftests/futex/functional/futex_wait_wouldblock.c
index 7d7a6a06cdb7..2d8230da9064 100644
--- a/tools/testing/selftests/futex/functional/futex_wait_wouldblock.c
+++ b/tools/testing/selftests/futex/functional/futex_wait_wouldblock.c
@@ -98,7 +98,7 @@ int main(int argc, char *argv[])
info("Calling futex_waitv on f1: %u @ %p with val=%u\n", f1, &f1, f1+1);
res = futex_waitv(&waitv, 1, 0, &to, CLOCK_MONOTONIC);
if (!res || errno != EWOULDBLOCK) {
- ksft_test_result_pass("futex_waitv returned: %d %s\n",
+ ksft_test_result_fail("futex_waitv returned: %d %s\n",
res ? errno : res,
res ? strerror(errno) : "");
ret = RET_FAIL;
--
2.49.0.504.g3bcea36a83-goog
Add a script to test various scenarios where a bridge is involved
in the fastpath. It runs tests in the forward path, and also in
a bridged path.
The setup is similar to a basic home router with multiple lan ports.
It uses 3 pairs of veth-devices. Each or all pairs can be
replaced by a pair of real interfaces, interconnected by wire.
This is necessary to test the behavior when dealing with
dsa ports, foreign (dsa) ports and switchdev userports that support
SWITCHDEV_OBJ_ID_PORT_VLAN.
See the head of the script for a detailed description.
Run without arguments to perform all tests on veth-devices.
Signed-off-by: Eric Woudstra <ericwouds(a)gmail.com>
---
This test script is written first for the proposed bridge-fastpath
patch-sets, but it's use is more general and can easily be expanded.
Because the development of this script has helped me find and fix a
few issues in my last version of the patches needed for bridge-fastpath,
I am sending the whole set again (split up in smaller patch-sets),
including the latest fixes.
Some example outputs of this last version of patches from different
hardware, without and with patches:
ALL VETH:
=========
./bridge_fastpath.sh -t
Setup:
CLIENT 0
veth0cl
|
veth0rt
WAN
ROUTER
LAN1 LAN2
veth1rt veth2rt
| |
veth1cl veth2cl
CLIENT 1 CLIENT 2
Without patches:
PASS: unaware bridge, without encaps, without fastpath
PASS: unaware bridge, with single vlan encap, without fastpath
ERROR: unaware bridge, with double q vlan encaps, without fastpath: ipv4/6: established bytes 0 < 4194304
ERROR: unaware bridge, with 802.1ad vlan encaps, without fastpath: ipv4/6: established bytes 0 < 4194304
PASS: aware bridge, without/without vlan encap, without fastpath
PASS: aware bridge, with/without vlan encap, without fastpath
PASS: aware bridge, with/with vlan encap, without fastpath
PASS: aware bridge, without/with vlan encap, without fastpath
PASS: forward, without vlan-device, without vlan encap, client1, without fastpath
PASS: forward, without vlan-device, without vlan encap, client1, with fastpath
PASS: forward, without vlan-device, with vlan encap, client1, without fastpath
ERROR: forward, without vlan-device, with vlan encap, client1, with fastpath: ipv4/6: tcp broken
PASS: forward, with vlan-device, with vlan encap, client1, without fastpath
PASS: forward, with vlan-device, with vlan encap, client1, with fastpath
PASS: forward, with vlan-device, without vlan encap, client1, without fastpath
PASS: forward, with vlan-device, without vlan encap, client1, with fastpath
ERROR: bridge fastpath test has failed
With patches:
PASS: unaware bridge, without encaps, without fastpath
PASS: unaware bridge, without encaps, with fastpath
PASS: unaware bridge, with single vlan encap, without fastpath
PASS: unaware bridge, with single vlan encap, with fastpath
PASS: unaware bridge, with double q vlan encaps, without fastpath
PASS: unaware bridge, with double q vlan encaps, with fastpath
PASS: unaware bridge, with 802.1ad vlan encaps, without fastpath
PASS: unaware bridge, with 802.1ad vlan encaps, with fastpath
PASS: aware bridge, without/without vlan encap, without fastpath
PASS: aware bridge, without/without vlan encap, with fastpath
PASS: aware bridge, with/without vlan encap, without fastpath
PASS: aware bridge, with/without vlan encap, with fastpath
PASS: aware bridge, with/with vlan encap, without fastpath
PASS: aware bridge, with/with vlan encap, with fastpath
PASS: aware bridge, without/with vlan encap, without fastpath
PASS: aware bridge, without/with vlan encap, with fastpath
PASS: forward, without vlan-device, without vlan encap, client1, without fastpath
PASS: forward, without vlan-device, without vlan encap, client1, with fastpath
PASS: forward, without vlan-device, with vlan encap, client1, without fastpath
PASS: forward, without vlan-device, with vlan encap, client1, with fastpath
PASS: forward, with vlan-device, with vlan encap, client1, without fastpath
PASS: forward, with vlan-device, with vlan encap, client1, with fastpath
PASS: forward, with vlan-device, without vlan encap, client1, without fastpath
PASS: forward, with vlan-device, without vlan encap, client1, with fastpath
PASS: all tests passed
BANANAPI-R3 (lan1 & lan2 are dsa):
============
Without patches:
./bridge_fastpath.sh -t -0 enu1u2,lan2 -1 enu1u1,lan1 -2 lan4,eth1
Setup:
CLIENT 0
enu1u2
|
lan2
WAN
ROUTER
LAN1 LAN2
lan1 eth1
| |
enu1u1 lan4
CLIENT 1 CLIENT 2
PASS: unaware bridge, without encaps, without fastpath
PASS: unaware bridge, with single vlan encap, without fastpath
PASS: aware bridge, without/without vlan encap, without fastpath
PASS: aware bridge, with/without vlan encap, without fastpath
PASS: aware bridge, with/with vlan encap, without fastpath
PASS: aware bridge, without/with vlan encap, without fastpath
PASS: forward, without vlan-device, without vlan encap, client1, without fastpath
ERROR: forward, without vlan-device, without vlan encap, client1, with fastpath: ipv4: counted bytes 2118540 > 2097152
ERROR: forward, without vlan-device, without vlan encap, client1, with fastpath: ipv6: counted bytes 2117904 > 2097152
PASS: forward, without vlan-device, without vlan encap, client2, without fastpath
PASS: forward, without vlan-device, without vlan encap, client2, with fastpath
PASS: forward, without vlan-device, without vlan encap, client2, with hw_fastpath
PASS: forward, without vlan-device, with vlan encap, client1, without fastpath
ERROR: forward, without vlan-device, with vlan encap, client1, with fastpath: ipv4/6: tcp broken
PASS: forward, without vlan-device, with vlan encap, client2, without fastpath
ERROR: forward, without vlan-device, with vlan encap, client2, with fastpath: ipv4/6: tcp broken
PASS: forward, with vlan-device, with vlan encap, client1, without fastpath
PASS: forward, with vlan-device, with vlan encap, client1, with fastpath
PASS: forward, with vlan-device, with vlan encap, client1, with hw_fastpath
PASS: forward, with vlan-device, with vlan encap, client2, without fastpath
PASS: forward, with vlan-device, with vlan encap, client2, with fastpath
PASS: forward, with vlan-device, with vlan encap, client2, with hw_fastpath
PASS: forward, with vlan-device, without vlan encap, client1, without fastpath
PASS: forward, with vlan-device, without vlan encap, client1, with fastpath
PASS: forward, with vlan-device, without vlan encap, client1, with hw_fastpath
PASS: forward, with vlan-device, without vlan encap, client2, without fastpath
ERROR: forward, with vlan-device, without vlan encap, client2, with fastpath: ipv4: counted bytes 2109596 > 2097152
ERROR: forward, with vlan-device, without vlan encap, client2, with fastpath: ipv6: counted bytes 2121432 > 2097152
ERROR: bridge fastpath test has failed
With patches:
PASS: unaware bridge, without encaps, without fastpath
PASS: unaware bridge, without encaps, with fastpath
PASS: unaware bridge, without encaps, with hw_fastpath
PASS: unaware bridge, with single vlan encap, without fastpath
PASS: unaware bridge, with single vlan encap, with fastpath
PASS: unaware bridge, with single vlan encap, with hw_fastpath
PASS: aware bridge, without/without vlan encap, without fastpath
PASS: aware bridge, without/without vlan encap, with fastpath
PASS: aware bridge, without/without vlan encap, with hw_fastpath
PASS: aware bridge, with/without vlan encap, without fastpath
PASS: aware bridge, with/without vlan encap, with fastpath
PASS: aware bridge, with/without vlan encap, with hw_fastpath
PASS: aware bridge, with/with vlan encap, without fastpath
PASS: aware bridge, with/with vlan encap, with fastpath
PASS: aware bridge, with/with vlan encap, with hw_fastpath
PASS: aware bridge, without/with vlan encap, without fastpath
PASS: aware bridge, without/with vlan encap, with fastpath
PASS: aware bridge, without/with vlan encap, with hw_fastpath
PASS: forward, without vlan-device, without vlan encap, client1, without fastpath
PASS: forward, without vlan-device, without vlan encap, client1, with fastpath
PASS: forward, without vlan-device, without vlan encap, client1, with hw_fastpath
PASS: forward, without vlan-device, without vlan encap, client2, without fastpath
PASS: forward, without vlan-device, without vlan encap, client2, with fastpath
PASS: forward, without vlan-device, without vlan encap, client2, with hw_fastpath
PASS: forward, without vlan-device, with vlan encap, client1, without fastpath
PASS: forward, without vlan-device, with vlan encap, client1, with fastpath
PASS: forward, without vlan-device, with vlan encap, client1, with hw_fastpath
PASS: forward, without vlan-device, with vlan encap, client2, without fastpath
PASS: forward, without vlan-device, with vlan encap, client2, with fastpath
PASS: forward, without vlan-device, with vlan encap, client2, with hw_fastpath
PASS: forward, with vlan-device, with vlan encap, client1, without fastpath
PASS: forward, with vlan-device, with vlan encap, client1, with fastpath
PASS: forward, with vlan-device, with vlan encap, client1, with hw_fastpath
PASS: forward, with vlan-device, with vlan encap, client2, without fastpath
PASS: forward, with vlan-device, with vlan encap, client2, with fastpath
PASS: forward, with vlan-device, with vlan encap, client2, with hw_fastpath
PASS: forward, with vlan-device, without vlan encap, client1, without fastpath
PASS: forward, with vlan-device, without vlan encap, client1, with fastpath
PASS: forward, with vlan-device, without vlan encap, client1, with hw_fastpath
PASS: forward, with vlan-device, without vlan encap, client2, without fastpath
PASS: forward, with vlan-device, without vlan encap, client2, with fastpath
PASS: forward, with vlan-device, without vlan encap, client2, with hw_fastpath
PASS: all tests passed
AM3359 (end1 supports SWITCHDEV_OBJ_ID_PORT_VLAN, ipv4 only for now):
=======
./bridge_fastpath.sh -t -a -4 -d -1 enu1u4c2,end1
Without patches:
Setup:
CLIENT 0
veth0cl
|
veth0rt
WAN
ROUTER
LAN1 LAN2
end1 veth2rt
| |
enu1u4c2 veth2cl
CLIENT 1 CLIENT 2
INFO: Skipping unaware bridge
PASS: aware bridge, without/without vlan encap, without fastpath
PASS: aware bridge, with/without vlan encap, without fastpath
PASS: aware bridge, with/with vlan encap, without fastpath
PASS: aware bridge, without/with vlan encap, without fastpath
PASS: forward, without vlan-device, without vlan encap, client1, without fastpath
ERROR: forward, without vlan-device, without vlan encap, client1, with fastpath: ipv4: counted bytes 2190092 > 2097152
PASS: forward, without vlan-device, without vlan encap, client2, without fastpath
PASS: forward, without vlan-device, without vlan encap, client2, with fastpath
PASS: forward, without vlan-device, with vlan encap, client1, without fastpath
ERROR: forward, without vlan-device, with vlan encap, client1, with fastpath: ipv4: tcp broken
PASS: forward, without vlan-device, with vlan encap, client2, without fastpath
ERROR: forward, without vlan-device, with vlan encap, client2, with fastpath: ipv4: tcp broken
PASS: forward, with vlan-device, with vlan encap, client1, without fastpath
PASS: forward, with vlan-device, with vlan encap, client1, with fastpath
PASS: forward, with vlan-device, with vlan encap, client2, without fastpath
PASS: forward, with vlan-device, with vlan encap, client2, with fastpath
PASS: forward, with vlan-device, without vlan encap, client1, without fastpath
PASS: forward, with vlan-device, without vlan encap, client1, with fastpath
PASS: forward, with vlan-device, without vlan encap, client2, without fastpath
PASS: forward, with vlan-device, without vlan encap, client2, with fastpath
ERROR: bridge fastpath test has failed
With patches:
INFO: Skipping unaware bridge
PASS: aware bridge, without/without vlan encap, without fastpath
PASS: aware bridge, without/without vlan encap, with fastpath
PASS: aware bridge, with/without vlan encap, without fastpath
PASS: aware bridge, with/without vlan encap, with fastpath
PASS: aware bridge, with/with vlan encap, without fastpath
PASS: aware bridge, with/with vlan encap, with fastpath
PASS: aware bridge, without/with vlan encap, without fastpath
PASS: aware bridge, without/with vlan encap, with fastpath
PASS: forward, without vlan-device, without vlan encap, client1, without fastpath
PASS: forward, without vlan-device, without vlan encap, client1, with fastpath
PASS: forward, without vlan-device, without vlan encap, client2, without fastpath
PASS: forward, without vlan-device, without vlan encap, client2, with fastpath
PASS: forward, without vlan-device, with vlan encap, client1, without fastpath
PASS: forward, without vlan-device, with vlan encap, client1, with fastpath
PASS: forward, without vlan-device, with vlan encap, client2, without fastpath
PASS: forward, without vlan-device, with vlan encap, client2, with fastpath
PASS: forward, with vlan-device, with vlan encap, client1, without fastpath
PASS: forward, with vlan-device, with vlan encap, client1, with fastpath
PASS: forward, with vlan-device, with vlan encap, client2, without fastpath
PASS: forward, with vlan-device, with vlan encap, client2, with fastpath
PASS: forward, with vlan-device, without vlan encap, client1, without fastpath
PASS: forward, with vlan-device, without vlan encap, client1, with fastpath
PASS: forward, with vlan-device, without vlan encap, client2, without fastpath
PASS: forward, with vlan-device, without vlan encap, client2, with fastpath
PASS: all tests passed
(Some problem still to figure out for my AM3359 hardware: On the second run
of the command the tcp traffic is ok on all tests ipv4. On the first run
the hardware is not setup correctly, some tests report broken tcp even
without fastpath. Also ipv6 tcp broken even on second run even without
fastpath. This may be a problem with my hardware or the test-script,
but anyway it shows the fastpath is functional)
.../testing/selftests/net/netfilter/Makefile | 1 +
.../net/netfilter/bridge_fastpath.sh | 922 ++++++++++++++++++
2 files changed, 923 insertions(+)
create mode 100755 tools/testing/selftests/net/netfilter/bridge_fastpath.sh
diff --git a/tools/testing/selftests/net/netfilter/Makefile b/tools/testing/selftests/net/netfilter/Makefile
index ffe161fac8b5..104dd9e5e02a 100644
--- a/tools/testing/selftests/net/netfilter/Makefile
+++ b/tools/testing/selftests/net/netfilter/Makefile
@@ -8,6 +8,7 @@ MNL_LDLIBS := $(shell $(HOSTPKG_CONFIG) --libs libmnl 2>/dev/null || echo -lmnl)
TEST_PROGS := br_netfilter.sh bridge_brouter.sh
TEST_PROGS += br_netfilter_queue.sh
+TEST_PROGS += bridge_fastpath.sh
TEST_PROGS += conntrack_dump_flush.sh
TEST_PROGS += conntrack_icmp_related.sh
TEST_PROGS += conntrack_ipip_mtu.sh
diff --git a/tools/testing/selftests/net/netfilter/bridge_fastpath.sh b/tools/testing/selftests/net/netfilter/bridge_fastpath.sh
new file mode 100755
index 000000000000..68e2f9e70951
--- /dev/null
+++ b/tools/testing/selftests/net/netfilter/bridge_fastpath.sh
@@ -0,0 +1,922 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# Check if conntrack, nft chain and fastpath is functional in setups
+# where a bridge is in the fastpath.
+#
+# Commandline options make it possible to use real ethernet pairs
+# instead of veth-device pairs. Any, or all, pairs can be tested using
+# real hardware pairs. This is can be useful to test dsa-ports,
+# switchdev (dsa) foreign ports and switchdev ports supporting
+# SWITCHDEV_OBJ_ID_PORT_VLAN.
+#
+# First tcp is tested. Conntrack and nft chain are tested using a counter.
+# When there is a fastpath possible between the interfaces then the
+# fastpath is also tested.
+# When there is a hardware offloaded fastpath possible between the
+# interfaces then the hardware offloaded path is also tested.
+#
+# Setup is as a typical router:
+#
+# nsclientwan
+# |
+# nsrt
+# | |
+# nsclient1 nsclient2
+#
+# Masquerading for ipv4 only.
+#
+# First check if a bridge table forward chain can be setup, skip
+# these tests if this is not possible.
+# Then check if a inet table forward chain can be setup, skip
+# these tests if this is not possible.
+#
+# Different setups of paths are tested that involve a bridge in the
+# fastpath. This can be in the forward-fastpath or in the bridge-fastpath.
+#
+# The first series, in the bridge-fastpath, using a vlan-unaware bridge.
+# Traffic with the following vlan-tags is checked:
+# - without vlan
+# - single vlan
+# - double q vlan (only on veth-devices)
+# - 802.1ad vlan (only on veth-devices)
+# - pppoe (when available)
+# - pppoe-in-q (when available)
+#
+# (double tag testing results in broken tcp traffic on most hardware,
+# in this test setup, use '-a' argument to test it anyway)
+# (pppoe testing takes place if pppd and pppoe-server are installed)
+#
+# The second series, in the bridge-fastpath, using a vlan-aware bridge.
+# Here we test all combinations of ingress/egress with or without single
+# vlan encaps.
+#
+# The third series, in the forward-fastpath, using a vlan-aware bridge,
+# without a vlan-device linked to the master port. We test the same combinations
+# of ingress/egress with or without single vlan encaps.
+#
+# The fourth series, in the forward-fastpath, using a vlan-aware bridge,
+# with a vlan-device linked to the master port. We test the same combinations
+# of ingress/egress with or without single vlan encaps.
+#
+# Note 1: Using dsa userports on both sides of eth-pairs client1 or client2
+# gives erratic and unpredictable results. Use, for example, an usb-eth device
+# on the client side to test a dsa-userport.
+#
+# Note 2: Testing the hardware offloaded fastpath, it is not checked if the
+# packets do not follow the software fastpath instead. A universal way to
+# check this should be added at some point.
+#
+# Mote 3: Some interfaces to test on the router side, are netns immutable.
+# Use the -d or --defaultnsrouter option so that the interfaces of the router
+# do not have to change netns. The router is build up in the default netns.
+#
+
+source lib.sh
+
+checktool "nft --version" "run test without nft"
+checktool "socat -h" "run test without socat"
+checktool "bridge -V" "run test without bridge"
+
+VID1=100
+VID2=101
+BRWAN=brwan
+BRLAN=brlan
+BRCL=brcl
+LINKUP_TIMEOUT=10
+PING_TIMEOUT=10
+SOCAT_TIMEOUT=10
+filesize=2 # MiB
+
+filein=$(mktemp)
+file1out=$(mktemp)
+file2out=$(mktemp)
+pppoeserveroptions=$(mktemp)
+pppoeserverpid=$(mktemp)
+
+setup_ns nsclientwan nsclientlan1 nsclientlan2
+
+ WAN=0 ; LAN1=1 ; LAN2=2 ; ADWAN=3 ; ADLAN=4
+nsa=( $nsclientwan $nsclientlan1 $nsclientlan2 ) # $nsrt $nsrt
+AD4=( '192.168.1.1' '192.168.2.101' '192.168.2.102' '192.168.1.2' '192.168.2.1' )
+AD6=( 'dead:1::1' 'dead:2::101' 'dead:2::102' 'dead:1::2' 'dead:2::1' )
+
+while [ "${1:-}" != '' ]; do
+ case "$1" in
+ '-0' | '--pairwan')
+ shift
+ vethcl[$WAN]="${1%,*}"
+ vethrt[$WAN]="${1#*,}"
+ ;;
+ '-1' | '--pairlan1')
+ shift
+ vethcl[$LAN1]="${1%,*}"
+ vethrt[$LAN1]="${1#*,}"
+ ;;
+ '-2' | '--pairlan2')
+ shift
+ vethcl[$LAN2]="${1%,*}"
+ vethrt[$LAN2]="${1#*,}"
+ ;;
+ '-s' | '--filesize')
+ shift
+ filesize=$1
+ ;;
+ '-4' | '--ipv4')
+ do_ipv4=1
+ ;;
+ '-6' | '--ipv6')
+ do_ipv6=1
+ ;;
+ '-a' | '--aware')
+ skip_unaware=1
+ ;;
+ '-n' | '--noskip')
+ noskip=1
+ ;;
+ '-d' | '--defaultnsrouter')
+ defaultnsrouter=1
+ ;;
+ '-f' | '--fixmac')
+ fixmac=1
+ ;;
+ '-t' | '--showtree')
+ showtree=1
+ ;;
+ *)
+ cat <<-EOF
+ Usage: $(basename $0) [OPTION]...
+ -0 --pairwan eth0cl,eth0rt pair of real interfaces to use on wan side
+ -1 --pairlan1 eth1cl,eth1rt pair of real interfaces to use on lan1 side
+ -2 --pairlan2 eth2cl,eth2rt pair of real interfaces to use on lan2 side
+ -s --filesize filesize to use for testing
+ -4|-6 --ipv4|--ipv6 test ipv4/6 only
+ -a --aware only test vlan aware bridge
+ -d --defaultnsrouter router in default network namespace, caution!
+ -f --fixmac change mac address when conflict found
+ -n --noskip also perform the normally skipped tests
+ -t --showtree show the tree of used interfaces
+ EOF
+ ;;
+ esac
+ shift
+done
+
+if [ -n "$defaultnsrouter" ]; then
+ nsrt="nsrt-$(mktemp -u XXXXXX)"
+ touch /var/run/netns/$nsrt
+ mount --bind /proc/1/ns/net /var/run/netns/$nsrt
+else
+ setup_ns nsrt
+fi
+nsa+=($nsrt $nsrt)
+
+cleanup() {
+ if [ -n "$defaultnsrouter" ]; then
+ umount /var/run/netns/$nsrt
+ rm -f /var/run/netns/$nsrt
+ fi
+ cleanup_all_ns
+ rm -f "$filein" "$file1out" "$file2out" "$pppoeserveroptions" "$pppoeserverpid"
+}
+
+trap cleanup EXIT
+
+head -c $(($filesize * 1024 * 1024)) < /dev/urandom > "$filein"
+
+check_mac()
+{
+ local ns=$1
+ local dev=$2
+ local othermacs=$3
+ local mac
+
+ mac=$(ip -net "$ns" -br link show dev "$dev" | \
+ grep -o -E '([[:xdigit:]]{1,2}:){5}[[:xdigit:]]{1,2}')
+
+ if [[ ! "$othermacs" =~ "$mac" ]]; then
+ echo $mac
+ return 0
+ fi
+ echo "WARN: Conflicting mac address $dev $mac" 1>&2
+
+ [ -z "$fixmac" ] && return 1
+
+ for (( j = 0 ; j < 10 ; j++ )); do
+ mac="${mac::6}$(printf %02x:%02x:%02x:%02x $(($RANDOM%256)) \
+ $(($RANDOM%256)) $(($RANDOM%256)) $(($RANDOM%256)))"
+ [[ "$othermacs" =~ "$mac" ]] && continue
+ echo $mac
+ ip -net "$ns" link set dev "$dev" address "$mac" 1>&2
+ return $?
+ done
+ return 1
+}
+
+is_linkup()
+{
+ local ns=$1
+ local dev=$2
+
+ if [ -n "$(ip -net "$ns" link show dev "$dev" up 2>/dev/null | \
+ grep 'state UP')" ]; then
+ return 0
+ fi
+ return 1
+}
+
+wait_ping()
+{
+ local i1=$1
+ local i2=$2
+ local ns1=${nsa[$i1]}
+ local j
+
+ for j in $(seq 1 $(($PING_TIMEOUT * 5 ))); do
+ ip netns exec "$ns1" ping -c 1 -w $PING_TIMEOUT -i 0.2 \
+ -q "${AD4[$i2]}" >/dev/null 2>&1
+ [ $? -le 1 ] && return $?
+ sleep 0.2
+ done
+ return 1
+}
+
+add_addr()
+{
+ local i=$1
+ local dev=$2
+ local ns=${nsa[$i]}
+ local ad4=${AD4[$i]}
+ local ad6=${AD6[$i]}
+
+ ip -net "$ns" addr add "${ad4}/24" dev "$dev"
+ ip -net "$ns" addr add "${ad6}/64" dev "$dev" nodad
+ if [[ "$ns" == "nsclientlan"* ]]; then
+ ip -net "$ns" route add default via "${AD4[$ADLAN]}"
+ ip -net "$ns" route add default via "${AD6[$ADLAN]}"
+ elif [[ "$ns" == "nsclientwan"* ]]; then
+ ip -net "$ns" route add default via "${AD6[$ADWAN]}"
+ fi
+
+}
+
+del_addr()
+{
+ local i=$1
+ local dev=$2
+ local ns=${nsa[$i]}
+ local ad4=${AD4[$i]}
+ local ad6=${AD6[$i]}
+
+ if [[ "$ns" == "nsclientlan"* ]]; then
+ ip -net "$ns" route del default via "${AD6[$ADLAN]}"
+ ip -net "$ns" route del default via "${AD4[$ADLAN]}"
+ elif [[ "$ns" == "nsclientwan"* ]]; then
+ ip -net "$ns" route del default via "${AD6[$ADWAN]}"
+ fi
+ ip -net "$ns" addr del "${ad6}/64" dev "$dev" nodad
+ ip -net "$ns" addr del "${ad4}/24" dev "$dev"
+}
+
+set_client()
+{
+ local i=$1
+ local vlan=$2
+ local arg=$3
+ local ns=${nsa[$i]}
+ local vdev="${vethcl[$i]}"
+ local brdev="$BRCL"
+ local proto=""
+ local pvidslave=""
+
+ unset_client $i
+
+ if [[ "$vlan" == "qq" ]]; then
+ ip -net "$ns" link add link "$vdev" name "$vdev.$VID1" type vlan id $VID1
+ ip -net "$ns" link add link "$vdev.$VID1" name "$vdev.$VID1.$VID2" \
+ type vlan id $VID2
+ ip -net "$ns" link set "$vdev.$VID1" up
+ ip -net "$ns" link set "$vdev.$VID1.$VID2" up
+ add_addr $i "$vdev.$VID1.$VID2"
+ return
+ fi
+
+ [[ "$vlan" == "none" ]] && pvidslave="pvid untagged"
+ [[ "$vlan" == "ad" ]] && proto="vlan_protocol 802.1ad"
+
+ ip -net "$ns" link add "$brdev" type bridge vlan_filtering 1 vlan_default_pvid 0 $proto
+ ip -net "$ns" link set "$vdev" master "$brdev"
+ ip -net "$ns" link set "$brdev" up
+
+ bridge -net "$ns" vlan add dev "$brdev" vid $VID1 pvid untagged self
+ bridge -net "$ns" vlan add dev "$vdev" vid $VID1 $pvidslave
+
+ if [[ "$vlan" == "ad" ]]; then
+ ip -net "$ns" link add link "$brdev" name "$brdev.$VID2" type vlan id $VID2
+ brdev="$brdev.$VID2"
+ ip -net "$ns" link set "$brdev" up
+ fi
+
+ if [[ "$arg" != "noaddress" ]]; then
+ add_addr $i "$brdev"
+ fi
+}
+
+unset_client()
+{
+ local i=$1
+ local ns=${nsa[$i]}
+ local vdev="${vethcl[$i]}"
+ local brdev="$BRCL"
+
+ ip -net "$ns" link del "$brdev" type bridge 2>/dev/null
+ ip -net "$ns" link del "$vdev.$VID1" 2>/dev/null
+}
+
+add_pppoe()
+{
+ local i1=$1
+ local i2=$2
+ local dev1=$3
+ local dev2=$4
+ local desc=$5
+ local ns1=${nsa[$i1]}
+ local ns2=${nsa[$i2]}
+
+ ppp1=0
+ while [ -n "$(ip -net "$ns1" link show ppp$ppp$LAN1 $LAN2>/dev/null)" ]
+ do ((ppp1++)); done
+ echo "noauth defaultroute noipdefault unit $ppp1" >"$pppoeserveroptions"
+ ppp1="ppp$ppp1"
+
+ if ! ip netns exec "$ns1" pppoe-server -k -L "${AD4[$i1]}" -R "${AD4[$i2]}" \
+ -I $dev1 -X "$pppoeserverpid" -O "$pppoeserveroptions" >/dev/null; then
+ echo "ERROR: $desc: failed to setup pppoe server" 1>&2
+ return 1
+ fi
+
+ if ! ip netns exec "$ns2" pppd plugin pppoe.so nic-$dev2 persist holdoff 0 noauth \
+ defaultroute noipdefault noaccomp nodeflate noproxyarp nopcomp \
+ novj novjccomp linkname "selftest-$$" >/dev/null; then
+ echo "ERROR: $desc: failed to setup pppoe client" 1>&2
+ return 1
+ fi
+
+ if ! wait_ping $i1 $i2; then
+ echo "ERROR: $desc: failed to setup functional pppoe connection" 1>&2
+ return 1
+ fi
+
+ ppp2=$(cat "/run/pppd/ppp-selftest-$$.pid" | tail -n 1)
+
+ ip -net "$ns1" addr add "${AD6[$i1]}/64" dev "$ppp1" nodad
+ ip -net "$ns2" addr add "${AD6[$i2]}/64" dev "$ppp2" nodad
+
+ return 0
+}
+
+del_pppoe()
+{
+ local i1=$1
+ local i2=$2
+ local dev1=$3
+ local dev2=$4
+ local ns1=${nsa[$i1]}
+ local ns2=${nsa[$i2]}
+
+ [[ -n "$ppp1" ]] && ip -net "$ns1" addr del "${AD6[$i1]}/64" dev "$ppp1"
+ [[ -n "$ppp2" ]] && ip -net "$ns2" addr del "${AD6[$i2]}/64" dev "$ppp2"
+
+ kill -9 $(cat "/run/pppd/ppp-selftest-$$.pid" | head -n 1) \
+ $(cat "$pppoeserverpid" | head -n 1)
+}
+
+listener_ready()
+{
+ local ns=$1
+ local ipv=$2
+
+ ss -N "$ns" --ipv$ipv -lnt -o "sport = :8080" | grep -q 8080
+}
+
+test_tcp() {
+ local i1=$1
+ local i2=$2
+ local dofast=$3
+ local desc=$4
+ local ns1=${nsa[$i1]}
+ local ns2=${nsa[$i2]}
+ local i=-1
+ local lret=0
+ local ads=""
+ local ipv ad a lpid bytes limit error
+
+ if [ -n "$do_ipv4" ]; then ads="${AD4[$i2]}"
+ elif [ -n "$do_ipv6" ]; then ads="${AD6[$i2]}"
+ else ads="${AD4[$i2]} ${AD6[$i2]}"
+ fi
+ for ad in $ads; do
+ ((i++))
+ if [[ "$ad" =~ ":" ]]
+ then ipv="6"; a="[${ad}]"
+ else ipv="4"; a="${ad}"
+ fi
+
+ rm -f "$file1out" "$file2out"
+
+ # ip netns exec "$nsrt" nft reset counters >/dev/null
+ # But on some systems this results in 4GB values in packet and byte count, so:
+ (echo "flush ruleset"; ip netns exec "$nsrt" nft --stateless list ruleset) | \
+ ip netns exec "$nsrt" nft -f -
+
+ timeout "$SOCAT_TIMEOUT" ip netns exec "$ns2" socat TCP$ipv-LISTEN:8080,reuseaddr \
+ STDIO <"$filein" >"$file2out" 2>/dev/null &
+ lpid=$!
+ busywait 1000 listener_ready "$ns2" "$ipv"
+
+ timeout "$SOCAT_TIMEOUT" ip netns exec "$ns1" socat TCP$ipv:$a:8080 \
+ STDIO <"$filein" >"$file1out" 2>/dev/null
+ wait $lpid
+
+ if [ $? -ne 0 ]; then
+ error[$i]="ipv$ipv: tcp broken"
+ continue
+ fi
+ if ! cmp "$filein" "$file1out" >/dev/null 2>&1; then
+ error[$i]="ipv$ipv: file mismatch to ${ad}"
+ continue
+ fi
+ if ! cmp "$filein" "$file2out" >/dev/null 2>&1; then
+ error[$i]="ipv$ipv: file mismatch from ${ad}"
+ continue
+ fi
+
+ limit=$((2 * $filesize * 1024 * 1024))
+ bytes=$(ip netns exec "$nsrt" nft list counter $family filter "check" | \
+ grep "packets" | cut -d' ' -f4)
+ if [ -z "$dofast" ] && [ "$bytes" -lt "$limit" ]; then
+
+ error[$i]="ipv$ipv: established bytes $bytes < $limit"
+ continue
+ fi
+ if [ -n "$dofast" ] && [ "$bytes" -gt "$((limit/2))" ]; then
+ # Significant reduction of bytes expected
+ error[$i]="ipv$ipv: counted bytes $bytes > $((limit/2))"
+ continue
+ fi
+ done
+
+ if [ -n "${error[0]}" ]; then
+ if [[ "${error[0]#*:}" == "${error[1]#*:}" ]]; then
+ echo "ERROR: $desc: ipv4/6:${error[0]#*:}" 1>&2
+ return 1
+ fi
+ echo "ERROR: $desc: ${error[0]}" 1>&2
+ lret=1
+ fi
+ if [ -n "${error[1]}" ]; then
+ echo "ERROR: $desc: ${error[1]}" 1>&2
+ lret=1
+ fi
+ if [ $lret -eq 0 ]; then
+ echo "PASS: $desc"
+ fi
+ return $lret
+}
+
+test_paths() {
+ local i1=$1
+ local i2=$2
+ local desc=$3
+ local ns1=${nsa[$i1]}
+ local ns2=${nsa[$i2]}
+
+
+ if ! setup_nftables $i1 $i2; then
+ echo "ERROR: $desc: cannot setup nftables" 1>&2
+ return 1
+ fi
+ if ! test_tcp $i1 $i2 "" "$desc without fastpath"; then
+ return 1
+ fi
+
+ if ! setup_fastpath $i1 $i2 "" 2>/dev/null; then
+ return 0
+ fi
+ if ! test_tcp $i1 $i2 "fast" "$desc with fastpath"; then
+ return 1
+ fi
+
+ if ! setup_fastpath $i1 $i2 "hw" 2>/dev/null; then
+ return 0
+ fi
+ if ! test_tcp $i1 $i2 "fast" "$desc with hw_fastpath"; then
+ return 1
+ fi
+
+ return 0
+
+}
+
+add_masq()
+{
+ if [[ $family != "bridge" ]]; then
+ ip netns exec "$nsrt" nft -f - <<-EOF
+ table ip nat {
+ chain postrouting {
+ type nat hook postrouting priority 0;
+ oifname ${BRWAN} masquerade
+ }
+ }
+ EOF
+ else
+ return 0
+ fi
+}
+
+setup_nftables()
+{
+ local i1=$1
+ local i2=$2
+
+ ip netns exec "$nsrt" nft flush ruleset
+
+ if ! add_masq; then
+ return 1
+ fi
+
+ ip netns exec "$nsrt" nft -f - <<-EOF
+ table ${family} filter {
+ counter check { }
+ chain forward {
+ type filter hook forward priority 0; policy accept;
+ ct state established ip saddr ${AD4[$i1]} tcp dport 8080 counter name "check"
+ ct state established ip saddr ${AD4[$i2]} tcp sport 8080 counter name "check"
+ ct state established ip6 saddr ${AD6[$i1]} tcp dport 8080 counter name "check"
+ ct state established ip6 saddr ${AD6[$i2]} tcp sport 8080 counter name "check"
+ }
+ }
+ EOF
+}
+
+setup_fastpath()
+{
+ local devs="${vethrt[$1]} , ${vethrt[$2]}"
+ local arg=$3
+ local flags=""
+
+ [[ "$arg" == "hw" ]] && flags="flags offload"
+
+ ip netns exec "$nsrt" nft flush ruleset
+
+ if ! add_masq; then
+ return 1
+ fi
+
+ ip netns exec "$nsrt" nft -f - <<-EOF
+ table ${family} filter {
+ counter check { }
+ flowtable f {
+ hook ingress priority filter
+ devices = { ${devs} }
+ ${flags}
+ }
+ chain forward {
+ type filter hook forward priority 0; policy accept;
+ counter name "check"
+ ct state established flow add @f
+ }
+ }
+ EOF
+}
+
+ret=0
+### Start Initial Setup ###
+
+for i in 4 6; do
+ ip netns exec "$nsrt" sysctl -q net.ipv$i.conf.all.forwarding=1
+done
+
+### Setup brlan as vlan unaware bridge ###
+### Use brwan to make sure software fastpath is ###
+### direct xmit in other direction also ###
+
+ip -net "$nsrt" link add $BRWAN type bridge
+ret=$(($ret | $?))
+ip -net "$nsrt" link set $BRWAN up
+ret=$(($ret | $?))
+if [ $ret -ne 0 ]; then
+ echo "SKIP: Can't create bridge"
+ exit $ksft_skip
+fi
+
+# If both lan clients are veth-devices, only test 1 in the forward path
+if [ -z "${vethcl[$LAN1]}" ] && [ -z "${vethcl[$LAN2]}" ]; then
+ lan_all_veth=1
+fi
+
+for i in $WAN $LAN1 $LAN2; do
+ ns="${nsa[$i]}"
+ if [ -z "${vethcl[$i]}" ]; then
+ vethcl[$i]="veth${i}cl"
+ vethrt[$i]="veth${i}rt"
+ ip link add "${vethcl[$i]}" netns "$ns" type veth \
+ peer name "${vethrt[$i]}" netns "$nsrt"
+ ret=$(($ret | $?))
+ else # Use pair of interconnected hardware interfaces
+ ip link set "${vethrt[$i]}" netns "$nsrt"
+ ret=$(($ret | $?))
+ ip link set "${vethcl[$i]}" netns "$ns"
+ ret=$(($ret | $?))
+ fi
+done
+if [ $ret -ne 0 ]; then
+ echo "SKIP: (v)eth pairs cannot be used"
+ exit $ksft_skip
+fi
+
+if [ -n "$showtree" ]; then
+ cat <<-EOF
+ Setup:
+ CLIENT 0
+ ${vethcl[$WAN]}
+ |
+ ${vethrt[$WAN]}
+ WAN
+ ROUTER
+ LAN1 LAN2
+ $(printf "%14.14s" ${vethrt[$LAN1]}) ${vethrt[$LAN2]}
+ | |
+ $(printf "%14.14s" ${vethcl[$LAN1]}) ${vethcl[$LAN2]}
+ CLIENT 1 CLIENT 2
+
+ EOF
+fi
+
+for n in nsclientwan nsclientlan; do
+ routerside=""; clientside=""
+ for i in $WAN $LAN1 $LAN2; do
+ ns="${nsa[$i]}"
+ [[ "$ns" != "$n"* ]] && continue
+ mac=$(check_mac $ns ${vethcl[$i]} "$routerside $clientside")
+ ret=$(($ret | $?))
+ clientside+=" $mac"
+ mac=$(check_mac $nsrt ${vethrt[$i]} "$clientside")
+ ret=$(($ret | $?))
+ routerside+=" $mac"
+ done
+done
+if [ $ret -ne 0 ]; then
+ echo "SKIP: because of conflicting mac address"
+ exit $ksft_skip
+fi
+
+for i in $WAN $LAN1 $LAN2; do
+ ns="${nsa[$i]}"
+ ip -net "$ns" link set "${vethcl[$i]}" up
+ ret=$(($ret | $?))
+ ip -net "$nsrt" link set "${vethrt[$i]}" up
+ ret=$(($ret | $?))
+done
+if [ $ret -ne 0 ]; then
+ echo "SKIP: setting (v)eth pairs link up failed"
+ exit $ksft_skip
+fi
+
+for j in $(seq 1 $(($LINKUP_TIMEOUT * 5 ))); do
+ ret=0
+ for i in $WAN $LAN1 $LAN2; do
+ ns="${nsa[$i]}"
+ is_linkup $ns "${vethcl[$i]}"
+ ret=$(($ret | $?))
+ is_linkup $nsrt "${vethrt[$i]}"
+ ret=$(($ret | $?))
+ done
+ [ $ret -eq 0 ] && break
+ sleep 0.2
+done
+if [ $ret -ne 0 ]; then
+ echo "SKIP: waiting for (v)eth pairs link up failed"
+ exit $ksft_skip
+fi
+
+i=$WAN
+ip -net "$nsrt" link set "${vethrt[$i]}" master $BRWAN
+
+### End Initial Setup ###
+
+family="bridge"
+setup_nftables $LAN1 $LAN2 2>/dev/null
+if [ $? -ne 0 ]; then
+ echo "INFO: Cannot add nftables table $family"
+ skip_family_bridge_part2=1
+elif [ -n "$skip_unaware" ]; then
+ echo "INFO: Skipping unaware bridge"
+else
+
+### Start nft family bridge test part 1 ###
+
+ip -net "$nsrt" link add $BRLAN type bridge
+ip -net "$nsrt" link set $BRLAN up
+for i in $LAN1 $LAN2; do
+ ns="${nsa[$i]}"
+ ip -net "$nsrt" link set "${vethrt[$i]}" master $BRLAN
+done
+
+for i in $LAN1 $LAN2; do
+ set_client $i none
+done
+
+test_paths $LAN1 $LAN2 "unaware bridge, without encaps, "
+ret=$(($ret | $?))
+
+for i in $LAN1 $LAN2; do
+ set_client $i q
+done
+
+test_paths $LAN1 $LAN2 "unaware bridge, with single vlan encap, "
+ret=$(($ret | $?))
+
+for i in $LAN1 $LAN2; do
+ set_client $i qq
+done
+
+# Skip testing double tagged packets on real hardware
+if [ -n "$lan_all_veth" ] || [ -n "$noskip" ]; then
+
+test_paths $LAN1 $LAN2 "unaware bridge, with double q vlan encaps,"
+ret=$(($ret | $?))
+
+for i in $LAN1 $LAN2; do
+ set_client $i ad
+done
+
+test_paths $LAN1 $LAN2 "unaware bridge, with 802.1ad vlan encaps, "
+ret=$(($ret | $?))
+
+fi
+# End Skip testing double tagged packets
+
+if [ -n "$(command -v pppd 2>/dev/null)" ] &&
+ [ -n "$(command -v pppoe-server 2>/dev/null)" ]; then
+# Start pppoe
+
+for i in $LAN1 $LAN2; do
+ set_client $i none noaddress
+done
+
+if add_pppoe $LAN1 $LAN2 "$BRCL" "$BRCL" "unaware bridge, with pppoe encap"; then
+ test_paths $LAN1 $LAN2 "unaware bridge, with pppoe encap, "
+ ret=$(($ret | $?))
+fi
+
+del_pppoe $LAN1 $LAN2 "$BRCL" "$BRCL"
+
+for i in $LAN1 $LAN2; do
+ set_client $i q noaddress
+done
+
+if add_pppoe $LAN1 $LAN2 "$BRCL" "$BRCL" "unaware bridge, with pppoe-in-q encaps"; then
+ test_paths $LAN1 $LAN2 "unaware bridge, with pppoe-in-q encaps, "
+ ret=$(($ret | $?))
+fi
+
+del_pppoe $LAN1 $LAN2 "$BRCL" "$BRCL"
+
+# End pppoe
+fi
+
+ip -net "$nsrt" link del $BRLAN type bridge
+
+### End nft family bridge test part 1 ###
+fi
+
+### Setup brlan as vlan aware bridge ###
+
+ip -net "$nsrt" link add $BRLAN type bridge vlan_filtering 1 vlan_default_pvid 0
+ip -net "$nsrt" link set $BRLAN up
+bridge -net "$nsrt" vlan add dev $BRLAN vid $VID1 pvid untagged self
+for i in $LAN1 $LAN2; do
+ ip -net "$nsrt" link set "${vethrt[$i]}" master $BRLAN
+ bridge -net "$nsrt" vlan add dev "${vethrt[$i]}" vid $VID1 pvid untagged
+done
+
+for i in $LAN1 $LAN2; do
+ set_client $i none
+done
+
+if [ -z "$skip_family_bridge_part2" ]; then
+### Start nft family bridge test part 2 ###
+
+test_paths $LAN1 $LAN2 "aware bridge, without/without vlan encap,"
+ret=$(($ret | $?))
+
+i=$LAN1
+bridge -net "$nsrt" vlan del dev "${vethrt[$i]}" vid $VID1 pvid untagged
+bridge -net "$nsrt" vlan add dev "${vethrt[$i]}" vid $VID1
+set_client $i q
+
+test_paths $LAN1 $LAN2 "aware bridge, with/without vlan encap, "
+ret=$(($ret | $?))
+
+i=$LAN2
+bridge -net "$nsrt" vlan del dev "${vethrt[$i]}" vid $VID1 pvid untagged
+bridge -net "$nsrt" vlan add dev "${vethrt[$i]}" vid $VID1
+set_client $i q
+
+test_paths $LAN1 $LAN2 "aware bridge, with/with vlan encap, "
+ret=$(($ret | $?))
+
+i=$LAN1
+bridge -net "$nsrt" vlan del dev "${vethrt[$i]}" vid $VID1
+bridge -net "$nsrt" vlan add dev "${vethrt[$i]}" vid $VID1 pvid untagged
+set_client $i none
+
+test_paths $LAN1 $LAN2 "aware bridge, without/with vlan encap, "
+ret=$(($ret | $?))
+
+i=$LAN2
+bridge -net "$nsrt" vlan del dev "${vethrt[$i]}" vid $VID1
+bridge -net "$nsrt" vlan add dev "${vethrt[$i]}" vid $VID1 pvid untagged
+set_client $i none
+
+fi
+
+### End nft family bridge test part 2 ###
+
+### Start nft family inet test ###
+family="inet"
+if ! setup_nftables $WAN $LAN1 $LAN2>/dev/null; then
+ echo "INFO: Cannot add nftables table $family"
+ exit $ret
+fi
+
+set_client $WAN none
+add_addr $ADWAN "$BRWAN"
+add_addr $ADLAN "$BRLAN"
+
+test_paths $LAN1 $WAN "forward, without vlan-device, without vlan encap, client1,"
+ret=$(($ret | $?))
+if [ -z "$lan_all_veth" ] || [ -n "$noskip" ]; then
+test_paths $LAN2 $WAN "forward, without vlan-device, without vlan encap, client2,"
+ret=$(($ret | $?))
+fi
+
+for i in $LAN1 $LAN2; do
+bridge -net "$nsrt" vlan del dev "${vethrt[$i]}" vid $VID1 pvid untagged
+bridge -net "$nsrt" vlan add dev "${vethrt[$i]}" vid $VID1
+set_client $i q
+done
+
+test_paths $LAN1 $WAN "forward, without vlan-device, with vlan encap, client1,"
+ret=$(($ret | $?))
+if [ -z "$lan_all_veth" ] || [ -n "$noskip" ]; then
+test_paths $LAN2 $WAN "forward, without vlan-device, with vlan encap, client2,"
+ret=$(($ret | $?))
+fi
+
+# Setup vlan-device linked to brlan master port
+del_addr $ADLAN "$BRLAN"
+ip -net "$nsrt" link set $BRLAN down
+bridge -net "$nsrt" vlan del dev $BRLAN vid $VID1 pvid untagged self
+bridge -net "$nsrt" vlan add dev $BRLAN vid $VID1 self
+ip -net "$nsrt" link add link $BRLAN name $BRLAN.$VID1 type vlan id $VID1
+ip -net "$nsrt" link set $BRLAN up
+ip -net "$nsrt" link set "$BRLAN.$VID1" up
+add_addr $ADLAN "$BRLAN.$VID1"
+
+test_paths $LAN1 $WAN "forward, with vlan-device, with vlan encap, client1,"
+ret=$(($ret | $?))
+if [ -z "$lan_all_veth" ] || [ -n "$noskip" ]; then
+test_paths $LAN2 $WAN "forward, with vlan-device, with vlan encap, client2,"
+ret=$(($ret | $?))
+fi
+
+for i in $LAN1 $LAN2; do
+bridge -net "$nsrt" vlan del dev "${vethrt[$i]}" vid $VID1
+bridge -net "$nsrt" vlan add dev "${vethrt[$i]}" vid $VID1 pvid untagged
+set_client $i none
+done
+
+test_paths $LAN1 $WAN "forward, with vlan-device, without vlan encap, client1,"
+ret=$(($ret | $?))
+if [ -z "$lan_all_veth" ] || [ -n "$noskip" ]; then
+test_paths $LAN2 $WAN "forward, with vlan-device, without vlan encap, client2,"
+ret=$(($ret | $?))
+fi
+
+### End nft family inet test ###
+
+for i in $WAN $LAN1 $LAN2; do
+ unset_client $i
+done
+ip -net "$nsrt" link del $BRLAN type bridge
+ip -net "$nsrt" link del $BRWAN type bridge
+
+if [ $ret -eq 0 ]; then
+ echo "PASS: all tests passed"
+else
+ echo "ERROR: bridge fastpath test has failed"
+fi
+
+exit $ret
--
2.47.1
As the vIOMMU infrastructure series part-3, this introduces a new vEVENTQ
object. The existing FAULT object provides a nice notification pathway to
the user space with a queue already, so let vEVENTQ reuse that.
Mimicing the HWPT structure, add a common EVENTQ structure to support its
derivatives: IOMMUFD_OBJ_FAULT (existing) and IOMMUFD_OBJ_VEVENTQ (new).
An IOMMUFD_CMD_VEVENTQ_ALLOC is introduced to allocate vEVENTQ object for
vIOMMUs. One vIOMMU can have multiple vEVENTQs in different types but can
not support multiple vEVENTQs in the same type.
The forwarding part is fairly simple but might need to replace a physical
device ID with a virtual device ID in a driver-level event data structure.
So, this also adds some helpers for drivers to use.
As usual, this series comes with the selftest coverage for this new ioctl
and with a real world use case in the ARM SMMUv3 driver.
This is on Github:
https://github.com/nicolinc/iommufd/commits/iommufd_veventq-v8
Paring QEMU branch for testing:
https://github.com/nicolinc/qemu/commits/wip/for_iommufd_veventq-v8
Changelog
v8
* Add Reviewed-by from Jason and Pranjal
* Fix errno returned in arm_smmu_handle_event()
* Validate domain->type outside of arm_smmu_attach_prepare_vmaster()
* Drop unnecessary vmaster comparison in arm_smmu_attach_commit_vmaster()
v7
https://lore.kernel.org/all/cover.1740238876.git.nicolinc@nvidia.com/
* Rebase on Jason's for-next tree for latest fault.c
* Add Reviewed-by
* Update commit logs
* Add __reserved field sanity
* Skip kfree() on the static header
* Replace "bool on_list" with list_is_last()
* Use u32 for flags in iommufd_vevent_header
* Drop casting in iommufd_viommu_get_vdev_id()
* Update the bounding logic to veventq->sequence
* Add missing cpu_to_le64() around STRTAB_STE_1_MEV
* Reuse veventq->common.lock to fence sequence and num_events
* Rename overflow to lost_events and log it in upon kmalloc failure
* Correct the error handling part in iommufd_veventq_deliver_fetch()
* Add an arm_smmu_clear_vmaster() to simplify identity/blocked domain
attach ops
* Add additional four event records to forward to user space VM, and
update the uAPI doc
* Reuse the existing smmu->streams_mutex lock to fence master->vmaster
pointer, instead of adding a new rwsem
v6
https://lore.kernel.org/all/cover.1737754129.git.nicolinc@nvidia.com/
* Drop supports_veventq viommu op
* Split bug/cosmetics fixes out of the series
* Drop the blocking mutex around copy_to_user()
* Add veventq_depth in uAPI to limit vEVENTQ size
* Revise the documentation for a clear description
* Fix sparse warnings in arm_vmaster_report_event()
* Rework iommufd_viommu_get_vdev_id() to return -ENOENT v.s. 0
* Allow Abort/Bypass STEs to allocate vEVENTQ and set STE.MEV for DoS
mitigations
v5
https://lore.kernel.org/all/cover.1736237481.git.nicolinc@nvidia.com/
* Add Reviewed-by from Baolu
* Reorder the OBJ list as well
* Fix alphabetical order after renaming in v4
* Add supports_veventq viommu op for vEVENTQ type validation
v4
https://lore.kernel.org/all/cover.1735933254.git.nicolinc@nvidia.com/
* Rename "vIRQ" to "vEVENTQ"
* Use flexible array in struct iommufd_vevent
* Add the new ioctl command to union ucmd_buffer
* Fix the alphabetical order in union ucmd_buffer too
* Rename _TYPE_NONE to _TYPE_DEFAULT aligning with vIOMMU naming
v3
https://lore.kernel.org/all/cover.1734477608.git.nicolinc@nvidia.com/
* Rebase on Will's for-joerg/arm-smmu/updates for arm_smmu_event series
* Add "Reviewed-by" lines from Kevin
* Fix typos in comments, kdocs, and jump tags
* Add a patch to sort struct iommufd_ioctl_op
* Update iommufd's userpsace-api documentation
* Update uAPI kdoc to quote SMMUv3 offical spec
* Drop the unused workqueue in struct iommufd_virq
* Drop might_sleep() in iommufd_viommu_report_irq() helper
* Add missing "break" in iommufd_viommu_get_vdev_id() helper
* Shrink the scope of the vmaster's read lock in SMMUv3 driver
* Pass in two arguments to iommufd_eventq_virq_handler() helper
* Move "!ops || !ops->read" validation into iommufd_eventq_init()
* Move "fault->ictx = ictx" closer to iommufd_ctx_get(fault->ictx)
* Update commit message for arm_smmu_attach_prepare/commit_vmaster()
* Keep "iommufd_fault" as-is and rename "iommufd_eventq_virq" to just
"iommufd_virq"
v2
https://lore.kernel.org/all/cover.1733263737.git.nicolinc@nvidia.com/
* Rebase on v6.13-rc1
* Add IOPF and vIRQ in iommufd.rst (userspace-api)
* Add a proper locking in iommufd_event_virq_destroy
* Add iommufd_event_virq_abort with a lockdep_assert_held
* Rename "EVENT_*" to "EVENTQ_*" to describe the objects better
* Reorganize flows in iommufd_eventq_virq_alloc for abort() to work
* Adde struct arm_smmu_vmaster to store vSID upon attaching to a nested
domain, calling a newly added iommufd_viommu_get_vdev_id helper
* Adde an arm_vmaster_report_event helper in arm-smmu-v3-iommufd file
to simplify the routine in arm_smmu_handle_evt() of the main driver
v1
https://lore.kernel.org/all/cover.1724777091.git.nicolinc@nvidia.com/
Thanks!
Nicolin
Nicolin Chen (14):
iommufd/fault: Move two fault functions out of the header
iommufd/fault: Add an iommufd_fault_init() helper
iommufd: Abstract an iommufd_eventq from iommufd_fault
iommufd: Rename fault.c to eventq.c
iommufd: Add IOMMUFD_OBJ_VEVENTQ and IOMMUFD_CMD_VEVENTQ_ALLOC
iommufd/viommu: Add iommufd_viommu_get_vdev_id helper
iommufd/viommu: Add iommufd_viommu_report_event helper
iommufd/selftest: Require vdev_id when attaching to a nested domain
iommufd/selftest: Add IOMMU_TEST_OP_TRIGGER_VEVENT for vEVENTQ
coverage
iommufd/selftest: Add IOMMU_VEVENTQ_ALLOC test coverage
Documentation: userspace-api: iommufd: Update FAULT and VEVENTQ
iommu/arm-smmu-v3: Introduce struct arm_smmu_vmaster
iommu/arm-smmu-v3: Report events that belong to devices attached to
vIOMMU
iommu/arm-smmu-v3: Set MEV bit in nested STE for DoS mitigations
drivers/iommu/iommufd/Makefile | 2 +-
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 36 ++
drivers/iommu/iommufd/iommufd_private.h | 135 +++-
drivers/iommu/iommufd/iommufd_test.h | 10 +
include/linux/iommufd.h | 23 +
include/uapi/linux/iommufd.h | 105 +++
tools/testing/selftests/iommu/iommufd_utils.h | 115 ++++
.../arm/arm-smmu-v3/arm-smmu-v3-iommufd.c | 64 ++
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 82 ++-
drivers/iommu/iommufd/driver.c | 72 +++
drivers/iommu/iommufd/eventq.c | 597 ++++++++++++++++++
drivers/iommu/iommufd/fault.c | 342 ----------
drivers/iommu/iommufd/hw_pagetable.c | 6 +-
drivers/iommu/iommufd/main.c | 7 +
drivers/iommu/iommufd/selftest.c | 54 ++
drivers/iommu/iommufd/viommu.c | 2 +
tools/testing/selftests/iommu/iommufd.c | 36 ++
.../selftests/iommu/iommufd_fail_nth.c | 7 +
Documentation/userspace-api/iommufd.rst | 17 +
19 files changed, 1304 insertions(+), 408 deletions(-)
create mode 100644 drivers/iommu/iommufd/eventq.c
delete mode 100644 drivers/iommu/iommufd/fault.c
base-commit: 598749522d4254afb33b8a6c1bea614a95896868
--
2.43.0
There are currently two ways in which ublk server exit is detected by
ublk_drv:
1. uring_cmd cancellation. If there are any outstanding uring_cmds which
have not been completed to the ublk server when it exits, io_uring
calls the uring_cmd callback with a special cancellation flag as the
issuing task is exiting.
2. I/O timeout. This is needed in addition to the above to handle the
"saturated queue" case, when all I/Os for a given queue are in the
ublk server, and therefore there are no outstanding uring_cmds to
cancel when the ublk server exits.
There are a couple of issues with this approach:
- It is complex and inelegant to have two methods to detect the same
condition
- The second method detects ublk server exit only after a long delay
(~30s, the default timeout assigned by the block layer). This delays
the nosrv behavior from kicking in and potential subsequent recovery
of the device.
The second issue is brought to light with the new test_generic_04. It
fails before this fix:
selftests: ublk: test_generic_04.sh
dev id is 0
dd: error writing '/dev/ublkb0': Input/output error
1+0 records in
0+0 records out
0 bytes copied, 30.0611 s, 0.0 kB/s
DEAD
dd took 31 seconds to exit (>= 5s tolerance)!
generic_04 : [FAIL]
Fix this by instead detecting and handling ublk server exit in the
character file release callback. This has several advantages:
- This one place can handle both saturated and unsaturated queues. Thus,
it replaces both preexisting methods of detecting ublk server exit.
- It runs quickly on ublk server exit - there is no 30s delay.
- It starts the process of removing task references in ublk_drv. This is
needed if we want to relax restrictions in the driver like letting
only one thread serve each queue
There is also the disadvantage that the character file release callback
can also be triggered by intentional close of the file, which is a
significant behavior change. Preexisting ublk servers (libublksrv) are
dependent on the ability to open/close the file multiple times. To
address this, only transition to a nosrv state if the file is released
while the ublk device is live. This allows for programs to open/close
the file multiple times during setup. It is still a behavior change if a
ublk server decides to close/reopen the file while the device is LIVE
(i.e. while it is responsible for serving I/O), but that would be highly
unusual. This behavior is in line with what is done by FUSE, which is
very similar to ublk in that a userspace daemon is providing services
traditionally provided by the kernel.
With this change in, the new test (and all other selftests, and all
ublksrv tests) pass:
selftests: ublk: test_generic_04.sh
dev id is 0
dd: error writing '/dev/ublkb0': Input/output error
1+0 records in
0+0 records out
0 bytes copied, 0.0376731 s, 0.0 kB/s
DEAD
generic_04 : [PASS]
Signed-off-by: Uday Shankar <ushankar(a)purestorage.com>
---
Changes in v3:
- Quiesce queue earlier to avoid concurrent cancellation and "normal"
completion of io_uring cmds (Ming Lei)
- Fix del_gendisk hang, found by test_stress_02
- Remove unnecessary parameters in fault_inject target (Ming Lei)
- Fix delay implementation to have separate per-I/O delay instead of
blocking the whole thread (Ming Lei)
- Add delay_us to docs
- Link to v2: https://lore.kernel.org/r/20250402-ublk_timeout-v2-1-249bc5523000@purestora…
Changes in v2:
- Leave null ublk selftests target untouched, instead create new
fault_inject target for injecting per-I/O delay (Ming Lei)
- Allow multiple open/close of ublk character device with some
restrictions
- Drop patches which made it in separately at https://lore.kernel.org/r/20250401-ublk_selftests-v1-1-98129c9bc8bb@puresto…
- Consolidate more nosrv logic in ublk character device release, and
associated code cleanup
- Link to v1: https://lore.kernel.org/r/20250325-ublk_timeout-v1-0-262f0121a7bd@purestora…
---
drivers/block/ublk_drv.c | 228 +++++++++---------------
tools/testing/selftests/ublk/Makefile | 4 +-
tools/testing/selftests/ublk/fault_inject.c | 72 ++++++++
tools/testing/selftests/ublk/kublk.c | 6 +-
tools/testing/selftests/ublk/kublk.h | 4 +
tools/testing/selftests/ublk/test_generic_04.sh | 43 +++++
6 files changed, 215 insertions(+), 142 deletions(-)
diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index 2fd05c1bd30b03343cb6f357f8c08dd92ff47af9..73baa9d22ccafb00723defa755a0b3aab7238934 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -162,7 +162,6 @@ struct ublk_queue {
bool force_abort;
bool timeout;
- bool canceling;
bool fail_io; /* copy of dev->state == UBLK_S_DEV_FAIL_IO */
unsigned short nr_io_ready; /* how many ios setup */
spinlock_t cancel_lock;
@@ -199,8 +198,6 @@ struct ublk_device {
struct completion completion;
unsigned int nr_queues_ready;
unsigned int nr_privileged_daemon;
-
- struct work_struct nosrv_work;
};
/* header of ublk_params */
@@ -209,8 +206,9 @@ struct ublk_params_header {
__u32 types;
};
-static bool ublk_abort_requests(struct ublk_device *ub, struct ublk_queue *ubq);
-
+static void ublk_stop_dev_unlocked(struct ublk_device *ub);
+static void ublk_abort_queue(struct ublk_device *ub, struct ublk_queue *ubq);
+static void __ublk_quiesce_dev(struct ublk_device *ub);
static inline struct request *__ublk_check_and_get_req(struct ublk_device *ub,
struct ublk_queue *ubq, int tag, size_t offset);
static inline unsigned int ublk_req_build_flags(struct request *req);
@@ -1314,8 +1312,6 @@ static void ublk_queue_cmd_list(struct ublk_queue *ubq, struct rq_list *l)
static enum blk_eh_timer_return ublk_timeout(struct request *rq)
{
struct ublk_queue *ubq = rq->mq_hctx->driver_data;
- unsigned int nr_inflight = 0;
- int i;
if (ubq->flags & UBLK_F_UNPRIVILEGED_DEV) {
if (!ubq->timeout) {
@@ -1326,26 +1322,6 @@ static enum blk_eh_timer_return ublk_timeout(struct request *rq)
return BLK_EH_DONE;
}
- if (!ubq_daemon_is_dying(ubq))
- return BLK_EH_RESET_TIMER;
-
- for (i = 0; i < ubq->q_depth; i++) {
- struct ublk_io *io = &ubq->ios[i];
-
- if (!(io->flags & UBLK_IO_FLAG_ACTIVE))
- nr_inflight++;
- }
-
- /* cancelable uring_cmd can't help us if all commands are in-flight */
- if (nr_inflight == ubq->q_depth) {
- struct ublk_device *ub = ubq->dev;
-
- if (ublk_abort_requests(ub, ubq)) {
- schedule_work(&ub->nosrv_work);
- }
- return BLK_EH_DONE;
- }
-
return BLK_EH_RESET_TIMER;
}
@@ -1356,19 +1332,16 @@ static blk_status_t ublk_prep_req(struct ublk_queue *ubq, struct request *rq)
if (unlikely(ubq->fail_io))
return BLK_STS_TARGET;
- /* With recovery feature enabled, force_abort is set in
- * ublk_stop_dev() before calling del_gendisk(). We have to
- * abort all requeued and new rqs here to let del_gendisk()
- * move on. Besides, we cannot not call io_uring_cmd_complete_in_task()
- * to avoid UAF on io_uring ctx.
+ /*
+ * force_abort is set in ublk_stop_dev() before calling
+ * del_gendisk(). We have to abort all requeued and new rqs here
+ * to let del_gendisk() move on. Besides, we cannot not call
+ * io_uring_cmd_complete_in_task() to avoid UAF on io_uring ctx.
*
* Note: force_abort is guaranteed to be seen because it is set
* before request queue is unqiuesced.
*/
- if (ublk_nosrv_should_queue_io(ubq) && unlikely(ubq->force_abort))
- return BLK_STS_IOERR;
-
- if (unlikely(ubq->canceling))
+ if (unlikely(ubq->force_abort))
return BLK_STS_IOERR;
/* fill iod to slot in io cmd buffer */
@@ -1391,16 +1364,6 @@ static blk_status_t ublk_queue_rq(struct blk_mq_hw_ctx *hctx,
if (res != BLK_STS_OK)
return res;
- /*
- * ->canceling has to be handled after ->force_abort and ->fail_io
- * is dealt with, otherwise this request may not be failed in case
- * of recovery, and cause hang when deleting disk
- */
- if (unlikely(ubq->canceling)) {
- __ublk_abort_rq(ubq, rq);
- return BLK_STS_OK;
- }
-
ublk_queue_cmd(ubq, rq);
return BLK_STS_OK;
}
@@ -1461,8 +1424,71 @@ static int ublk_ch_open(struct inode *inode, struct file *filp)
static int ublk_ch_release(struct inode *inode, struct file *filp)
{
struct ublk_device *ub = filp->private_data;
+ int i;
+
+ mutex_lock(&ub->mutex);
+ /*
+ * If the device is not live, we will not transition to a nosrv
+ * state. This protects against:
+ * - accidental poking of the ublk character device
+ * - some ublk servers which may open/close the ublk character
+ * device during startup
+ */
+ if (ub->dev_info.state != UBLK_S_DEV_LIVE)
+ goto out;
+
+ /*
+ * Since we are releasing the ublk character file descriptor, we
+ * know that there cannot be any concurrent file-related
+ * activity (e.g. uring_cmds or reads/writes). However, I/O
+ * might still be getting dispatched. Quiesce that too so that
+ * we don't need to worry about anything concurrent.
+ *
+ * We may have already quiesced the queue if we canceled any
+ * uring_cmds, so only quiesce if necessary (quiesce is not
+ * idempotent, it has an internal counter which we need to
+ * manage carefully).
+ */
+ if (!blk_queue_quiesced(ub->ub_disk->queue))
+ blk_mq_quiesce_queue(ub->ub_disk->queue);
+
+ /*
+ * Handle any requests outstanding to the ublk server
+ */
+ for (i = 0; i < ub->dev_info.nr_hw_queues; i++)
+ ublk_abort_queue(ub, ublk_get_queue(ub, i));
+ /*
+ * Transition the device to the nosrv state. What exactly this
+ * means depends on the recovery flags
+ */
+ if (ublk_nosrv_should_stop_dev(ub)) {
+ /*
+ * Allow any pending/future I/O to pass through quickly
+ * with an error. This is needed because del_gendisk
+ * waits for all pending I/O to complete
+ */
+ for (i = 0; i < ub->dev_info.nr_hw_queues; i++)
+ ublk_get_queue(ub, i)->force_abort = true;
+ blk_mq_unquiesce_queue(ub->ub_disk->queue);
+
+ ublk_stop_dev_unlocked(ub);
+ } else {
+ if (ublk_nosrv_dev_should_queue_io(ub)) {
+ __ublk_quiesce_dev(ub);
+ } else {
+ ub->dev_info.state = UBLK_S_DEV_FAIL_IO;
+ for (i = 0; i < ub->dev_info.nr_hw_queues; i++)
+ ublk_get_queue(ub, i)->fail_io = true;
+ }
+
+ /* pair with earlier quiesce */
+ blk_mq_unquiesce_queue(ub->ub_disk->queue);
+ }
+
+out:
clear_bit(UB_STATE_OPEN, &ub->state);
+ mutex_unlock(&ub->mutex);
return 0;
}
@@ -1556,57 +1582,6 @@ static void ublk_abort_queue(struct ublk_device *ub, struct ublk_queue *ubq)
}
}
-/* Must be called when queue is frozen */
-static bool ublk_mark_queue_canceling(struct ublk_queue *ubq)
-{
- bool canceled;
-
- spin_lock(&ubq->cancel_lock);
- canceled = ubq->canceling;
- if (!canceled)
- ubq->canceling = true;
- spin_unlock(&ubq->cancel_lock);
-
- return canceled;
-}
-
-static bool ublk_abort_requests(struct ublk_device *ub, struct ublk_queue *ubq)
-{
- bool was_canceled = ubq->canceling;
- struct gendisk *disk;
-
- if (was_canceled)
- return false;
-
- spin_lock(&ub->lock);
- disk = ub->ub_disk;
- if (disk)
- get_device(disk_to_dev(disk));
- spin_unlock(&ub->lock);
-
- /* Our disk has been dead */
- if (!disk)
- return false;
-
- /*
- * Now we are serialized with ublk_queue_rq()
- *
- * Make sure that ubq->canceling is set when queue is frozen,
- * because ublk_queue_rq() has to rely on this flag for avoiding to
- * touch completed uring_cmd
- */
- blk_mq_quiesce_queue(disk->queue);
- was_canceled = ublk_mark_queue_canceling(ubq);
- if (!was_canceled) {
- /* abort queue is for making forward progress */
- ublk_abort_queue(ub, ubq);
- }
- blk_mq_unquiesce_queue(disk->queue);
- put_device(disk_to_dev(disk));
-
- return !was_canceled;
-}
-
static void ublk_cancel_cmd(struct ublk_queue *ubq, struct ublk_io *io,
unsigned int issue_flags)
{
@@ -1634,9 +1609,8 @@ static void ublk_uring_cmd_cancel_fn(struct io_uring_cmd *cmd,
{
struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_cmd_pdu(cmd);
struct ublk_queue *ubq = pdu->ubq;
+ struct ublk_device *ub = ubq->dev;
struct task_struct *task;
- struct ublk_device *ub;
- bool need_schedule;
struct ublk_io *io;
if (WARN_ON_ONCE(!ubq))
@@ -1649,16 +1623,20 @@ static void ublk_uring_cmd_cancel_fn(struct io_uring_cmd *cmd,
if (WARN_ON_ONCE(task && task != ubq->ubq_daemon))
return;
- ub = ubq->dev;
- need_schedule = ublk_abort_requests(ub, ubq);
+ /*
+ * We could be the first to notice that the ublk server is dying
+ * here. If we are, quiesce the queue to eliminate concurrent
+ * "normal" io_uring cmd completions in the I/O submission path.
+ */
+ mutex_lock(&ub->mutex);
+ if (ub->dev_info.state == UBLK_S_DEV_LIVE &&
+ !blk_queue_quiesced(ub->ub_disk->queue))
+ blk_mq_quiesce_queue(ub->ub_disk->queue);
+ mutex_unlock(&ub->mutex);
io = &ubq->ios[pdu->tag];
WARN_ON_ONCE(io->cmd != cmd);
ublk_cancel_cmd(ubq, io, issue_flags);
-
- if (need_schedule) {
- schedule_work(&ub->nosrv_work);
- }
}
static inline bool ublk_queue_ready(struct ublk_queue *ubq)
@@ -1756,13 +1734,13 @@ static struct gendisk *ublk_detach_disk(struct ublk_device *ub)
return disk;
}
-static void ublk_stop_dev(struct ublk_device *ub)
+static void ublk_stop_dev_unlocked(struct ublk_device *ub)
+ __must_hold(&ub->mutex)
{
struct gendisk *disk;
- mutex_lock(&ub->mutex);
if (ub->dev_info.state == UBLK_S_DEV_DEAD)
- goto unlock;
+ return;
if (ublk_nosrv_dev_should_queue_io(ub)) {
if (ub->dev_info.state == UBLK_S_DEV_LIVE)
__ublk_quiesce_dev(ub);
@@ -1771,38 +1749,12 @@ static void ublk_stop_dev(struct ublk_device *ub)
del_gendisk(ub->ub_disk);
disk = ublk_detach_disk(ub);
put_disk(disk);
- unlock:
- mutex_unlock(&ub->mutex);
- ublk_cancel_dev(ub);
}
-static void ublk_nosrv_work(struct work_struct *work)
+static void ublk_stop_dev(struct ublk_device *ub)
{
- struct ublk_device *ub =
- container_of(work, struct ublk_device, nosrv_work);
- int i;
-
- if (ublk_nosrv_should_stop_dev(ub)) {
- ublk_stop_dev(ub);
- return;
- }
-
mutex_lock(&ub->mutex);
- if (ub->dev_info.state != UBLK_S_DEV_LIVE)
- goto unlock;
-
- if (ublk_nosrv_dev_should_queue_io(ub)) {
- __ublk_quiesce_dev(ub);
- } else {
- blk_mq_quiesce_queue(ub->ub_disk->queue);
- ub->dev_info.state = UBLK_S_DEV_FAIL_IO;
- for (i = 0; i < ub->dev_info.nr_hw_queues; i++) {
- ublk_get_queue(ub, i)->fail_io = true;
- }
- blk_mq_unquiesce_queue(ub->ub_disk->queue);
- }
-
- unlock:
+ ublk_stop_dev_unlocked(ub);
mutex_unlock(&ub->mutex);
ublk_cancel_dev(ub);
}
@@ -2388,7 +2340,6 @@ static void ublk_remove(struct ublk_device *ub)
bool unprivileged;
ublk_stop_dev(ub);
- cancel_work_sync(&ub->nosrv_work);
cdev_device_del(&ub->cdev, &ub->cdev_dev);
unprivileged = ub->dev_info.flags & UBLK_F_UNPRIVILEGED_DEV;
ublk_put_device(ub);
@@ -2675,7 +2626,6 @@ static int ublk_ctrl_add_dev(struct io_uring_cmd *cmd)
goto out_unlock;
mutex_init(&ub->mutex);
spin_lock_init(&ub->lock);
- INIT_WORK(&ub->nosrv_work, ublk_nosrv_work);
ret = ublk_alloc_dev_number(ub, header->dev_id);
if (ret < 0)
@@ -2807,7 +2757,6 @@ static inline void ublk_ctrl_cmd_dump(struct io_uring_cmd *cmd)
static int ublk_ctrl_stop_dev(struct ublk_device *ub)
{
ublk_stop_dev(ub);
- cancel_work_sync(&ub->nosrv_work);
return 0;
}
@@ -2927,7 +2876,6 @@ static void ublk_queue_reinit(struct ublk_device *ub, struct ublk_queue *ubq)
/* We have to reset it to NULL, otherwise ub won't accept new FETCH_REQ */
ubq->ubq_daemon = NULL;
ubq->timeout = false;
- ubq->canceling = false;
for (i = 0; i < ubq->q_depth; i++) {
struct ublk_io *io = &ubq->ios[i];
diff --git a/tools/testing/selftests/ublk/Makefile b/tools/testing/selftests/ublk/Makefile
index c7781efea0f33c02f340f90f547d3a37c1d1b8a0..afee027cccdd1b8f13f1cb9a90a3348cd54b18bc 100644
--- a/tools/testing/selftests/ublk/Makefile
+++ b/tools/testing/selftests/ublk/Makefile
@@ -6,6 +6,7 @@ LDLIBS += -lpthread -lm -luring
TEST_PROGS := test_generic_01.sh
TEST_PROGS += test_generic_02.sh
TEST_PROGS += test_generic_03.sh
+TEST_PROGS += test_generic_04.sh
TEST_PROGS += test_null_01.sh
TEST_PROGS += test_null_02.sh
@@ -26,7 +27,8 @@ TEST_GEN_PROGS_EXTENDED = kublk
include ../lib.mk
-$(TEST_GEN_PROGS_EXTENDED): kublk.c null.c file_backed.c common.c stripe.c
+$(TEST_GEN_PROGS_EXTENDED): kublk.c null.c file_backed.c common.c stripe.c \
+ fault_inject.c
check:
shellcheck -x -f gcc *.sh
diff --git a/tools/testing/selftests/ublk/fault_inject.c b/tools/testing/selftests/ublk/fault_inject.c
new file mode 100644
index 0000000000000000000000000000000000000000..3a8574e6a73767b1f9d0d81c62c7dbf28d2445d0
--- /dev/null
+++ b/tools/testing/selftests/ublk/fault_inject.c
@@ -0,0 +1,72 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * Fault injection ublk target. Hack this up however you like for
+ * testing specific behaviors of ublk_drv. Currently is a null target
+ * with a configurable delay before completing each I/O. This delay can
+ * be used to test ublk_drv's handling of I/O outstanding to the ublk
+ * server when it dies.
+ */
+
+#include "kublk.h"
+
+static int ublk_fault_inject_tgt_init(const struct dev_ctx *ctx,
+ struct ublk_dev *dev)
+{
+ const struct ublksrv_ctrl_dev_info *info = &dev->dev_info;
+ unsigned long dev_size = 250UL << 30;
+
+ dev->tgt.dev_size = dev_size;
+ dev->tgt.params = (struct ublk_params) {
+ .types = UBLK_PARAM_TYPE_BASIC,
+ .basic = {
+ .logical_bs_shift = 9,
+ .physical_bs_shift = 12,
+ .io_opt_shift = 12,
+ .io_min_shift = 9,
+ .max_sectors = info->max_io_buf_bytes >> 9,
+ .dev_sectors = dev_size >> 9,
+ },
+ };
+
+ dev->private_data = (void *)(ctx->delay_us * 1000);
+ return 0;
+}
+
+static int ublk_fault_inject_queue_io(struct ublk_queue *q, int tag)
+{
+ const struct ublksrv_io_desc *iod = ublk_get_iod(q, tag);
+ struct io_uring_sqe *sqe;
+ struct __kernel_timespec ts = {
+ .tv_nsec = (long long)q->dev->private_data,
+ };
+
+ ublk_queue_alloc_sqes(q, &sqe, 1);
+ io_uring_prep_timeout(sqe, &ts, 1, 0);
+ sqe->user_data = build_user_data(tag, ublksrv_get_op(iod), 0, 1);
+
+ ublk_queued_tgt_io(q, tag, 1);
+
+ return 0;
+}
+
+static void ublk_fault_inject_tgt_io_done(struct ublk_queue *q, int tag,
+ const struct io_uring_cqe *cqe)
+{
+ const struct ublksrv_io_desc *iod = ublk_get_iod(q, tag);
+
+ if (cqe->res != -ETIME)
+ ublk_err("%s: unexpected cqe res %d\n", __func__, cqe->res);
+
+ if (ublk_completed_tgt_io(q, tag))
+ ublk_complete_io(q, tag, iod->nr_sectors << 9);
+ else
+ ublk_err("%s: io not complete after 1 cqe\n", __func__);
+}
+
+const struct ublk_tgt_ops fault_inject_tgt_ops = {
+ .name = "fault_inject",
+ .init_tgt = ublk_fault_inject_tgt_init,
+ .queue_io = ublk_fault_inject_queue_io,
+ .tgt_io_done = ublk_fault_inject_tgt_io_done,
+};
diff --git a/tools/testing/selftests/ublk/kublk.c b/tools/testing/selftests/ublk/kublk.c
index 91c282bc767449a418cce7fc816dc8e9fc732d6a..b741d91b2288b19d450ad22a045b014da18c3f8d 100644
--- a/tools/testing/selftests/ublk/kublk.c
+++ b/tools/testing/selftests/ublk/kublk.c
@@ -10,6 +10,7 @@ static const struct ublk_tgt_ops *tgt_ops_list[] = {
&null_tgt_ops,
&loop_tgt_ops,
&stripe_tgt_ops,
+ &fault_inject_tgt_ops,
};
static const struct ublk_tgt_ops *ublk_find_tgt(const char *name)
@@ -1041,7 +1042,7 @@ static int cmd_dev_get_features(void)
static int cmd_dev_help(char *exe)
{
- printf("%s add -t [null|loop] [-q nr_queues] [-d depth] [-n dev_id] [backfile1] [backfile2] ...\n", exe);
+ printf("%s add -t [null|loop|stripe|fault_inject] [-q nr_queues] [-d depth] [-n dev_id] [--delay_us delay] [backfile1] [backfile2] ...\n", exe);
printf("\t default: nr_queues=2(max 4), depth=128(max 128), dev_id=-1(auto allocation)\n");
printf("%s del [-n dev_id] -a \n", exe);
printf("\t -a delete all devices -n delete specified device\n");
@@ -1064,6 +1065,7 @@ int main(int argc, char *argv[])
{ "zero_copy", 0, NULL, 'z' },
{ "foreground", 0, NULL, 0 },
{ "chunk_size", 1, NULL, 0 },
+ { "delay_us", 1, NULL, 0 },
{ 0, 0, 0, 0 }
};
int option_idx, opt;
@@ -1112,6 +1114,8 @@ int main(int argc, char *argv[])
ctx.fg = 1;
if (!strcmp(longopts[option_idx].name, "chunk_size"))
ctx.chunk_size = strtol(optarg, NULL, 10);
+ if (!strcmp(longopts[option_idx].name, "delay_us"))
+ ctx.delay_us = strtoll(optarg, NULL, 10);
}
}
diff --git a/tools/testing/selftests/ublk/kublk.h b/tools/testing/selftests/ublk/kublk.h
index 760ff8ffb8107037a19a8fb7ab408818845e010d..a1a8a802fb43f0fe9272f33c8a3161e9316a5507 100644
--- a/tools/testing/selftests/ublk/kublk.h
+++ b/tools/testing/selftests/ublk/kublk.h
@@ -70,6 +70,9 @@ struct dev_ctx {
/* stripe */
unsigned int chunk_size;
+ /* fault_inject */
+ long long delay_us;
+
int _evtfd;
};
@@ -357,6 +360,7 @@ static inline int ublk_queue_use_zc(const struct ublk_queue *q)
extern const struct ublk_tgt_ops null_tgt_ops;
extern const struct ublk_tgt_ops loop_tgt_ops;
extern const struct ublk_tgt_ops stripe_tgt_ops;
+extern const struct ublk_tgt_ops fault_inject_tgt_ops;
void backing_file_tgt_deinit(struct ublk_dev *dev);
int backing_file_tgt_init(struct ublk_dev *dev);
diff --git a/tools/testing/selftests/ublk/test_generic_04.sh b/tools/testing/selftests/ublk/test_generic_04.sh
new file mode 100755
index 0000000000000000000000000000000000000000..48af48164aa444d8ac6a58fef1743d2a16a56a14
--- /dev/null
+++ b/tools/testing/selftests/ublk/test_generic_04.sh
@@ -0,0 +1,43 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+
+. "$(cd "$(dirname "$0")" && pwd)"/test_common.sh
+
+TID="generic_04"
+ERR_CODE=0
+
+_prep_test "fault_inject" "fast cleanup when all I/Os of one hctx are in server"
+
+# configure ublk server to sleep 2s before completing each I/O
+dev_id=$(_add_ublk_dev -t fault_inject -q 2 -d 1 --delay_us 2000000)
+_check_add_dev $TID $?
+
+echo "dev id is ${dev_id}"
+
+STARTTIME=${SECONDS}
+
+dd if=/dev/urandom of=/dev/ublkb${dev_id} oflag=direct bs=4k count=1 &
+dd_pid=$!
+
+__ublk_kill_daemon ${dev_id} "DEAD"
+
+wait $dd_pid
+dd_exitcode=$?
+
+ENDTIME=${SECONDS}
+ELAPSED=$(($ENDTIME - $STARTTIME))
+
+# assert that dd sees an error and exits quickly after ublk server is
+# killed. previously this relied on seeing an I/O timeout and so would
+# take ~30s
+if [ $dd_exitcode -eq 0 ]; then
+ echo "dd unexpectedly exited successfully!"
+ ERR_CODE=255
+fi
+if [ $ELAPSED -ge 5 ]; then
+ echo "dd took $ELAPSED seconds to exit (>= 5s tolerance)!"
+ ERR_CODE=255
+fi
+
+_cleanup_test "fault_inject"
+_show_result $TID $ERR_CODE
---
base-commit: 710e2c687a16b28a873a282517a85faf02a8b7cc
change-id: 20250325-ublk_timeout-b06b9b51c591
Best regards,
--
Uday Shankar <ushankar(a)purestorage.com>
A hds-thresh value is not set correctly if input value is 0.
The cause is that ethtool_ringparam_get_cfg(), which is a internal
function that returns ringparameters from both ->get_ringparam() and
dev->cfg can't return a correct hds-thresh value.
The first patch fixes ethtool_ringparam_get_cfg() to set hds-thresh
value correcltly.
The second patch adds random test for hds-thresh value.
So that we can test 0 value for a hds-thresh properly.
v2:
- Skips set_hds_thresh_random test when hds-thresh-max value is too
small. (2/2)
- Change random range from 1-MAX to 1-(MAX-1). (2/2)
Taehee Yoo (2):
net: ethtool: fix ethtool_ringparam_get_cfg() returns a hds_thresh
value always as 0.
selftests: drv-net: test random value for hds-thresh
net/ethtool/common.c | 1 +
tools/testing/selftests/drivers/net/hds.py | 33 +++++++++++++++++++++-
2 files changed, 33 insertions(+), 1 deletion(-)
--
2.34.1
Bingbu reported an issue in [1] that udmabuf vmap failed and in [2], we
discussed the scenario of folio vmap due to the misuse of vmap_pfn
in udmabuf.
We reached the conclusion that vmap_pfn prohibits the use of page-based
PFNs:
Christoph Hellwig : 'No, vmap_pfn is entirely for memory not backed by
pages or folios, i.e. PCIe BARs and similar memory. This must not be
mixed with proper folio backed memory.'
But udmabuf still need consider HVO based folio's vmap, and need fix
vmap issue. This RFC code want to show the two point that I mentioned
in [2], and more deep talk it:
Point1. simple copy vmap_pfn code, don't bother common vmap_pfn, use by
itself and remove pfn_valid check.
Point2. implement folio array based vmap(vmap_folios), which can given a
range of each folio(offset, nr_pages), so can suit HVO folio's vmap.
Patch 1-2 implement point1, and add a test simple set in udmabuf driver.
Patch 3-5 implement point2, also can test it.
Kasireddy also show that 'another option is to just limit udmabuf's vmap()
to only shmem folios'(This I guess folio_test_hugetlb_vmemmap_optimized
can help.)
But I prefer point2 to solution this issue, and IMO, folio based vmap still
need.
Compare to page based vmap(or pfn based), we need split each large folio
into single page struct, this need more large array struct and more longer
iter. If each tail page struct not exist(like HVO), can only use pfn vmap,
but there are no common api to do this.
In [2], we talked that udmabuf can use hugetlb as the memory
provider, and can give a range use. So if HVO used in hugetlb, each folio's
tail page may freed, so we can't use page based vmap, only can use pfn
based, which show in point1.
Further more, Folio based vmap only need record each folio(and offset,
nr_pages if range need). For 20MB vmap, page based need 5120 pages(40KB),
2MB folios only need 10 folio(80Byte).
Matthew show that Vishal also offered a folio based vmap - vmap_file[3].
This RFC patch want a range based folio, not only a full folio's map(like
file's folio), to resolve some problem like HVO's range folio vmap.
Please give me more suggestion.
Test case:
//enable/disable HVO
1. echo [1|0] > /proc/sys/vm/hugetlb_optimize_vmemmap
//prepare HUGETLB
2. echo 10 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
3. ./udmabuf_vmap
4. check output, and dmesg if any warn.
[1] https://lore.kernel.org/all/9172a601-c360-0d5b-ba1b-33deba430455@linux.inte…
[2] https://lore.kernel.org/lkml/20250312061513.1126496-1-link@vivo.com/
[3] https://lore.kernel.org/linux-mm/20250131001806.92349-1-vishal.moola@gmail.…
Huan Yang (6):
udmabuf: try fix udmabuf vmap
udmabuf: try udmabuf vmap test
mm/vmalloc: try add vmap folios range
udmabuf: use vmap_range_folios
udmabuf: vmap test suit for pages and pfns compare
udmabuf: remove no need code
drivers/dma-buf/udmabuf.c | 29 +++++++++-----------
include/linux/vmalloc.h | 57 +++++++++++++++++++++++++++++++++++++++
mm/vmalloc.c | 47 ++++++++++++++++++++++++++++++++
3 files changed, 117 insertions(+), 16 deletions(-)
--
2.48.1
Hi!
The Cirrus tests keep failing for me when run on x86
./tools/testing/kunit/kunit.py run --alltests --json --arch=x86_64
https://netdev-3.bots.linux.dev/kunit/results/60103/stdout
It seems like new cases continue to appear and we have to keep adding
them to the local ignored list. Is it possible to get these fixed or
can we exclude the cirrus tests from KUNIT_ALL_TESTS?
If no explicit XARCH is specified, use the toolchains default.
Signed-off-by: Thomas Weißschuh <linux(a)weissschuh.net>
---
Thomas Weißschuh (2):
selftests/nolibc: drop dependency from sysroot to defconfig
selftests/nolibc: only consider XARCH for CFLAGS when requested
tools/testing/selftests/nolibc/Makefile | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)
---
base-commit: 9a9b20007ab833c1aa3791efcfdf67e7e3ea8902
change-id: 20250330-nolibc-nolibc-test-native-6d4d84d764eb
Best regards,
--
Thomas Weißschuh <linux(a)weissschuh.net>
The test_memcontrol selftest consistently fails its test_memcg_low
sub-test due to the fact that two of its test child cgroups which
have a memmory.low of 0 or an effective memory.low of 0 still have low
events generated for them since mem_cgroup_below_low() use the ">="
operator when comparing to elow.
The two failed use cases are as follows:
1) memory.low is set to 0, but low events can still be triggered and
so the cgroup may have a non-zero low event count. I doubt users are
looking for that as they didn't set memory.low at all.
2) memory.low is set to a non-zero value but the cgroup has no task in
it so that it has an effective low value of 0. Again it may have a
non-zero low event count if memory reclaim happens. This is probably
not a result expected by the users and it is really doubtful that
users will check an empty cgroup with no task in it and expecting
some non-zero event counts.
The simple and naive fix of changing the operator to ">", however,
changes the memory reclaim behavior which can lead to other failures
as low events are needed to facilitate memory reclaim. So we can't do
that without some relatively riskier changes in memory reclaim.
Another simpler alternative is to avoid reporting below_low failure
if either memory.low or its effective equivalent is 0 which is done
by this patch specifically for the two failed use cases above.
With this patch applied, the test_memcg_low sub-test finishes
successfully without failure in most cases. Though both test_memcg_low
and test_memcg_min sub-tests may still fail occasionally if the
memory.current values fall outside of the expected ranges.
To be consistent, similar change is appled to mem_cgroup_below_min()
as to avoid the two failed use cases above with low replaced by min.
Signed-off-by: Waiman Long <longman(a)redhat.com>
---
include/linux/memcontrol.h | 18 ++++++++++++++----
1 file changed, 14 insertions(+), 4 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 53364526d877..4d4a1f159eaa 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -601,21 +601,31 @@ static inline bool mem_cgroup_unprotected(struct mem_cgroup *target,
static inline bool mem_cgroup_below_low(struct mem_cgroup *target,
struct mem_cgroup *memcg)
{
+ unsigned long elow;
+
if (mem_cgroup_unprotected(target, memcg))
return false;
- return READ_ONCE(memcg->memory.elow) >=
- page_counter_read(&memcg->memory);
+ elow = READ_ONCE(memcg->memory.elow);
+ if (!elow || !READ_ONCE(memcg->memory.low))
+ return false;
+
+ return page_counter_read(&memcg->memory) <= elow;
}
static inline bool mem_cgroup_below_min(struct mem_cgroup *target,
struct mem_cgroup *memcg)
{
+ unsigned long emin;
+
if (mem_cgroup_unprotected(target, memcg))
return false;
- return READ_ONCE(memcg->memory.emin) >=
- page_counter_read(&memcg->memory);
+ emin = READ_ONCE(memcg->memory.emin);
+ if (!emin || !READ_ONCE(memcg->memory.min))
+ return false;
+
+ return page_counter_read(&memcg->memory) <= emin;
}
int __mem_cgroup_charge(struct folio *folio, struct mm_struct *mm, gfp_t gfp);
--
2.48.1
┌────────────┐ ┌───────────────────────────────────┐ ┌────────────────┐
│ │ │ │ │ │
│ │ │ PCI Endpoint │ │ PCI Host │
│ │ │ │ │ │
│ │◄──┤ 1.platform_msi_domain_alloc_irqs()│ │ │
│ │ │ │ │ │
│ MSI ├──►│ 2.write_msi_msg() ├──►├─BAR<n> │
│ Controller │ │ update doorbell register address│ │ │
│ │ │ for BAR │ │ │
│ │ │ │ │ 3. Write BAR<n>│
│ │◄──┼───────────────────────────────────┼───┤ │
│ │ │ │ │ │
│ ├──►│ 4.Irq Handle │ │ │
│ │ │ │ │ │
│ │ │ │ │ │
└────────────┘ └───────────────────────────────────┘ └────────────────┘
This patches based on old https://lore.kernel.org/imx/20221124055036.1630573-1-Frank.Li@nxp.com/
Original patch only target to vntb driver. But actually it is common
method.
This patches add new API to pci-epf-core, so any EP driver can use it.
Previous v2 discussion here.
https://lore.kernel.org/imx/20230911220920.1817033-1-Frank.Li@nxp.com/
Changes in v16:
- remove arm64: dts: imx95-19x19-evk: Add PCIe1 endpoint function overlay file
because there are better patches, which under review.
- Add document for pcie-ep msi-map usage
- other change to see each patch's change log
About IMMUTABLE (No change for this part, tglx provide feedback)
> - This IMMUTABLE thing serves no purpose, because you don't randomly
> plug this end-point block on any MSI controller. They come as part
> of an SoC.
"Yes and no. The problem is that the EP implementation is meant to be a
generic library and while GIC-ITS guarantees immutability of the
address/data pair after setup, there are architectures (x86, loongson,
riscv) where the base MSI controller does not and immutability is only
achieved when interrupt remapping is enabled. The latter can be disabled
at boot-time and then the EP implementation becomes a lottery across
affinity changes.
That was my concern about this library implementation and that's why I
asked for a mechanism to ensure that the underlying irqdomain provides a
immutable address/data pair.
So it does not matter for GIC-ITS, but in the larger picture it matters.
Thanks,
tglx
"
So it does not matter for GIC-ITS, but in the larger picture it matters.
- Link to v15: https://lore.kernel.org/r/20250211-ep-msi-v15-0-bcacc1f2b1a9@nxp.com
Changes in v15:
- rebase to v6.14-rc1
- fix build issue find by kernel test robot
- Link to v14: https://lore.kernel.org/r/20250207-ep-msi-v14-0-9671b136f2b8@nxp.com
Changes in v14:
Marc Zyngier raised concerns about adding DOMAIN_BUS_DEVICE_PCI_EP_MSI. As
a result, the approach has been reverted to the v9 method. However, there
are several improvements:
MSI now supports msi-map in addition to msi-parent.
- The struct device: id is used as the endpoint function (EPF) device
identity to map to the stream ID (sideband information).
- The EPC device tree source (DTS) utilizes msi-map to provide such
information.
- The EPF device's of_node is set to the EPC controller’s node. This
approach is commonly used for multi-function device (MFD) platform child
devices, allowing them to inherit properties from the MFD device’s DTS,
such as reset-cells and gpio-cells. This method is well-suited for the
current case, as the EPF is inherently created/binded to the EPC and
should inherit the EPC’s DTS node properties.
Additionally:
Since the basic IMX95 LUT support has already been merged into the
mainline, a DTS and driver increment patch is added to complete the
solution. The patch is rebased onto the latest linux-next tree and
aligned with the new pcitest framework.
- Link to v13: https://lore.kernel.org/r/20241218-ep-msi-v13-0-646e2192dc24@nxp.com
Changes in v13:
- Change to use DOMAIN_BUS_PCI_DEVICE_EP_MSI
- Change request id as func | vfunc << 3
- Remove IRQ_DOMAIN_MSI_IMMUTABLE
Thomas Gleixner:
I hope capture all your points in review comments. If missed, let me know.
- Link to v12: https://lore.kernel.org/r/20241211-ep-msi-v12-0-33d4532fa520@nxp.com
Changes in v12:
- Change to use IRQ_DOMAIN_MSI_IMMUTABLE and add help function
irq_domain_msi_is_immuatble().
- split PCI: endpoint: pci-ep-msi: Add MSI address/data pair mutable check to 3 patches
- Link to v11: https://lore.kernel.org/r/20241209-ep-msi-v11-0-7434fa8397bd@nxp.com
Changes in v11:
- Change to use MSI_FLAG_MSG_IMMUTABLE
- Link to v10: https://lore.kernel.org/r/20241204-ep-msi-v10-0-87c378dbcd6d@nxp.com
Changes in v10:
Thomas Gleixner:
There are big change in pci-ep-msi.c. I am sure if go on the
corrent path. The key improvement is remove only 1 function devices's
limitation.
I use new patch for imutable check, which relative additional
feature compared to base enablement patch.
- Remove patch Add msi_remove_device_irq_domain() in platform_device_msi_free_irqs_all()
- Add new patch irqchip/gic-v3-its: Avoid overwriting msi_prepare callback if provided by msi_domain_info
- Remove only support 1 endpoint function limiation.
- Create one MSI domain for each endpoint function devices.
- Use "msi-map" in pci ep controler node, instead of of msi-parent. first
argument is
(func_no << 8 | vfunc_no)
- Link to v9: https://lore.kernel.org/r/20241203-ep-msi-v9-0-a60dbc3f15dd@nxp.com
Changes in v9
- Add patch platform-msi: Add msi_remove_device_irq_domain() in platform_device_msi_free_irqs_all()
- Remove patch PCI: endpoint: Add pci_epc_get_fn() API for customizable filtering
- Remove API pci_epf_align_inbound_addr_lo_hi
- Move doorbell_alloc in to doorbell_enable function.
- Link to v8: https://lore.kernel.org/r/20241116-ep-msi-v8-0-6f1f68ffd1bb@nxp.com
Changes in v8:
- update helper function name to pci_epf_align_inbound_addr()
- Link to v7: https://lore.kernel.org/r/20241114-ep-msi-v7-0-d4ac7aafbd2c@nxp.com
Changes in v7:
- Add helper function pci_epf_align_addr();
- Link to v6: https://lore.kernel.org/r/20241112-ep-msi-v6-0-45f9722e3c2a@nxp.com
Changes in v6:
- change doorbell_addr to doorbell_offset
- use round_down()
- add Niklas's test by tag
- rebase to pci/endpoint
- Link to v5: https://lore.kernel.org/r/20241108-ep-msi-v5-0-a14951c0d007@nxp.com
Changes in v5:
- Move request_irq to epf test function driver for more flexiable user case
- Add fixed size bar handler
- Some minor improvememtn to see each patches's changelog.
- Link to v4: https://lore.kernel.org/r/20241031-ep-msi-v4-0-717da2d99b28@nxp.com
Changes in v4:
- Remove patch genirq/msi: Add cleanup guard define for msi_lock_descs()/msi_unlock_descs()
- Use new method to avoid compatible problem.
Add new command DOORBELL_ENABLE and DOORBELL_DISABLE.
pcitest -B send DOORBELL_ENABLE first, EP test function driver try to
remap one of BAR_N (except test register bar) to ITS MSI MMIO space. Old
driver don't support new command, so failure return, not side effect.
After test, DOORBELL_DISABLE command send out to recover original map, so
pcitest bar test can pass as normal.
- Other detail change see each patches's change log
- Link to v3: https://lore.kernel.org/r/20241015-ep-msi-v3-0-cedc89a16c1a@nxp.com
Change from v2 to v3
- Fixed manivannan's comments
- Move common part to pci-ep-msi.c and pci-ep-msi.h
- rebase to 6.12-rc1
- use RevID to distingiush old version
mkdir /sys/kernel/config/pci_ep/functions/pci_epf_test/func1
echo 16 > /sys/kernel/config/pci_ep/functions/pci_epf_test/func1/msi_interrupts
echo 0x080c > /sys/kernel/config/pci_ep/functions/pci_epf_test/func1/deviceid
echo 0x1957 > /sys/kernel/config/pci_ep/functions/pci_epf_test/func1/vendorid
echo 1 > /sys/kernel/config/pci_ep/functions/pci_epf_test/func1/revid
^^^^^^ to enable platform msi support.
ln -s /sys/kernel/config/pci_ep/functions/pci_epf_test/func1 /sys/kernel/config/pci_ep/controllers/4c380000.pcie-ep
- use new device ID, which identify support doorbell to avoid broken
compatility.
Enable doorbell support only for PCI_DEVICE_ID_IMX8_DB, while other devices
keep the same behavior as before.
EP side RC with old driver RC with new driver
PCI_DEVICE_ID_IMX8_DB no probe doorbell enabled
Other device ID doorbell disabled* doorbell disabled*
* Behavior remains unchanged.
Change from v1 to v2
- Add missed patch for endpont/pci-epf-test.c
- Move alloc and free to epc driver from epf.
- Provide general help function for EPC driver to alloc platform msi irq.
- Fixed manivannan's comments.
Signed-off-by: Frank Li <Frank.Li(a)nxp.com>
---
Frank Li (15):
platform-msi: Add msi_remove_device_irq_domain() in platform_device_msi_free_irqs_all()
irqdomain: Add IRQ_DOMAIN_FLAG_MSI_IMMUTABLE and irq_domain_is_msi_immutable()
irqchip/gic-v3-its: Set IRQ_DOMAIN_FLAG_MSI_IMMUTABLE for ITS
dt-bindings: pci: pci-msi: Add support for PCI Endpoint msi-map
irqchip/gic-v3-its: Add support for device tree msi-map and msi-mask
PCI: endpoint: Set ID and of_node for function driver
PCI: endpoint: Add RC-to-EP doorbell support using platform MSI controller
PCI: endpoint: pci-ep-msi: Add MSI address/data pair mutable check
PCI: endpoint: Add pci_epf_align_inbound_addr() helper for address alignment
PCI: endpoint: pci-epf-test: Add doorbell test support
misc: pci_endpoint_test: Add doorbell test case
selftests: pci_endpoint: Add doorbell test case
pci: imx6: Add helper function imx_pcie_add_lut_by_rid()
pci: imx6: Add LUT setting for MSI/IOMMU in Endpoint mode
arm64: dts: imx95: Add msi-map for pci-ep device
Documentation/devicetree/bindings/pci/pci-msi.txt | 51 ++++++++
arch/arm64/boot/dts/freescale/imx95.dtsi | 1 +
drivers/base/platform-msi.c | 1 +
drivers/irqchip/irq-gic-v3-its-msi-parent.c | 8 ++
drivers/irqchip/irq-gic-v3-its.c | 2 +-
drivers/misc/pci_endpoint_test.c | 82 ++++++++++++
drivers/pci/controller/dwc/pci-imx6.c | 25 ++--
drivers/pci/endpoint/Makefile | 1 +
drivers/pci/endpoint/functions/pci-epf-test.c | 142 +++++++++++++++++++++
drivers/pci/endpoint/pci-ep-msi.c | 90 +++++++++++++
drivers/pci/endpoint/pci-epf-core.c | 48 +++++++
include/linux/irqdomain.h | 7 +
include/linux/pci-ep-msi.h | 28 ++++
include/linux/pci-epf.h | 21 +++
include/uapi/linux/pcitest.h | 1 +
.../selftests/pci_endpoint/pci_endpoint_test.c | 28 ++++
16 files changed, 527 insertions(+), 9 deletions(-)
---
base-commit: a4949bd40778aa9beac77c89e4c6a1da52875c8b
change-id: 20241010-ep-msi-8b4cab33b1be
Best regards,
---
Frank Li <Frank.Li(a)nxp.com>
Currently if the filesystem for the cgroups version it wants to use is
not mounted charge_reserved_hugetlb.sh and hugetlb_reparenting_test.sh
tests will attempt to mount it on the hard coded path
/dev/cgroup/memory, deleting that directory when the test finishes. This
will fail if there is not a preexisting directory at that path, and
since the directory is deleted subsequent runs of the test will fail.
Instead of relying on this hard coded directory name use mktemp to
generate a temporary directory to use as a mountpoint, fixing both the
assumption and the disruption caused by deleting a preexisting
directory.
This means that if the relevant cgroup filesystem is not already mounted
then we rely on having coreutils (which provides mktemp) installed. I
suspect that many current users are relying on having things automounted
by default, and given that the script relies on bash it's probably not
an unreasonable requirement.
Fixes: 209376ed2a84 ("selftests/vm: make charge_reserved_hugetlb.sh work with existing cgroup setting")
Signed-off-by: Mark Brown <broonie(a)kernel.org>
---
tools/testing/selftests/mm/charge_reserved_hugetlb.sh | 4 ++--
tools/testing/selftests/mm/hugetlb_reparenting_test.sh | 2 +-
2 files changed, 3 insertions(+), 3 deletions(-)
diff --git a/tools/testing/selftests/mm/charge_reserved_hugetlb.sh b/tools/testing/selftests/mm/charge_reserved_hugetlb.sh
index 67df7b47087f..e1fe16bcbbe8 100755
--- a/tools/testing/selftests/mm/charge_reserved_hugetlb.sh
+++ b/tools/testing/selftests/mm/charge_reserved_hugetlb.sh
@@ -29,7 +29,7 @@ fi
if [[ $cgroup2 ]]; then
cgroup_path=$(mount -t cgroup2 | head -1 | awk '{print $3}')
if [[ -z "$cgroup_path" ]]; then
- cgroup_path=/dev/cgroup/memory
+ cgroup_path=$(mktemp -d)
mount -t cgroup2 none $cgroup_path
do_umount=1
fi
@@ -37,7 +37,7 @@ if [[ $cgroup2 ]]; then
else
cgroup_path=$(mount -t cgroup | grep ",hugetlb" | awk '{print $3}')
if [[ -z "$cgroup_path" ]]; then
- cgroup_path=/dev/cgroup/memory
+ cgroup_path=$(mktemp -d)
mount -t cgroup memory,hugetlb $cgroup_path
do_umount=1
fi
diff --git a/tools/testing/selftests/mm/hugetlb_reparenting_test.sh b/tools/testing/selftests/mm/hugetlb_reparenting_test.sh
index 11f9bbe7dc22..0b0d4ba1af27 100755
--- a/tools/testing/selftests/mm/hugetlb_reparenting_test.sh
+++ b/tools/testing/selftests/mm/hugetlb_reparenting_test.sh
@@ -23,7 +23,7 @@ fi
if [[ $cgroup2 ]]; then
CGROUP_ROOT=$(mount -t cgroup2 | head -1 | awk '{print $3}')
if [[ -z "$CGROUP_ROOT" ]]; then
- CGROUP_ROOT=/dev/cgroup/memory
+ CGROUP_ROOT=$(mktemp -d)
mount -t cgroup2 none $CGROUP_ROOT
do_umount=1
fi
---
base-commit: a4cda136f021ad44b8b52286aafd613030a6db5f
change-id: 20250403-kselftest-mm-cgroup2-detection-b761fd232f9d
Best regards,
--
Mark Brown <broonie(a)kernel.org>
Our CI expects output from the test at least once every 10 minutes.
The AMT test when running on debug kernel is just on the edge
of that time for the stress test. Improve the output:
- print the name of the test first, before starting it,
- output a dot every 10% of the way.
Output after:
TEST: amt discovery [ OK ]
TEST: IPv4 amt multicast forwarding [ OK ]
TEST: IPv6 amt multicast forwarding [ OK ]
TEST: IPv4 amt traffic forwarding torture .......... [ OK ]
TEST: IPv6 amt traffic forwarding torture .......... [ OK ]
Signed-off-by: Jakub Kicinski <kuba(a)kernel.org>
---
Since net-next is closed I'm sending this for net.
We enabled DEBUG_PREEMPT in the debug flavor and the test now
times out most of the time.
CC: ap420073(a)gmail.com
CC: shuah(a)kernel.org
CC: linux-kselftest(a)vger.kernel.org
---
tools/testing/selftests/net/amt.sh | 20 ++++++++++++++------
1 file changed, 14 insertions(+), 6 deletions(-)
diff --git a/tools/testing/selftests/net/amt.sh b/tools/testing/selftests/net/amt.sh
index d458b45c775b..3ef209cacb8e 100755
--- a/tools/testing/selftests/net/amt.sh
+++ b/tools/testing/selftests/net/amt.sh
@@ -194,15 +194,21 @@ test_remote_ip()
send_mcast_torture4()
{
- ip netns exec "${SOURCE}" bash -c \
- 'cat /dev/urandom | head -c 1G | nc -w 1 -u 239.0.0.1 4001'
+ for i in `seq 10`; do
+ ip netns exec "${SOURCE}" bash -c \
+ 'cat /dev/urandom | head -c 100M | nc -w 1 -u 239.0.0.1 4001'
+ echo -n "."
+ done
}
send_mcast_torture6()
{
- ip netns exec "${SOURCE}" bash -c \
- 'cat /dev/urandom | head -c 1G | nc -w 1 -u ff0e::5:6 6001'
+ for i in `seq 10`; do
+ ip netns exec "${SOURCE}" bash -c \
+ 'cat /dev/urandom | head -c 100M | nc -w 1 -u ff0e::5:6 6001'
+ echo -n "."
+ done
}
check_features()
@@ -278,10 +284,12 @@ wait $pid || err=$?
if [ $err -eq 1 ]; then
ERR=1
fi
+printf "TEST: %-50s" "IPv4 amt traffic forwarding torture"
send_mcast_torture4
-printf "TEST: %-60s [ OK ]\n" "IPv4 amt traffic forwarding torture"
+printf " [ OK ]\n"
+printf "TEST: %-50s" "IPv6 amt traffic forwarding torture"
send_mcast_torture6
-printf "TEST: %-60s [ OK ]\n" "IPv6 amt traffic forwarding torture"
+printf " [ OK ]\n"
sleep 5
if [ "${ERR}" -eq 1 ]; then
echo "Some tests failed." >&2
--
2.49.0
For testing the functionality of the vDSO, it is necessary to build
userspace programs for multiple different architectures.
It is additional work to acquire matching userspace cross-compilers with
full C libraries and then building root images out of those.
The kernel tree already contains nolibc, a small, header-only C library.
By using it, it is possible to build userspace programs without any
additional dependencies.
For example the kernel.org crosstools or multi-target clang can be used
to build test programs for a multitude of architectures.
While nolibc is very limited, it is enough for many selftests.
With some minor adjustments it is possible to make parse_vdso.c
compatible with nolibc.
As an example, vdso_standalone_test_x86 is now built from the same C
code as the regular vdso_test_gettimeofday, while still being completely
standalone.
Also drop the dependency of parse_vdso.c on the elf.h header from libc and only
use the one from the kernel's UAPI.
While this series is useful on its own now, it will also integrate with the
kunit UAPI framework currently under development:
https://lore.kernel.org/lkml/20250217-kunit-kselftests-v1-0-42b4524c3b0a@li…
Signed-off-by: Thomas Weißschuh <thomas.weissschuh(a)linutronix.de>
---
Changes in v2:
- Provide a limits.h header in nolibc
- Pick up Reviewed-by tags from Kees
- Link to v1: https://lore.kernel.org/r/20250203-parse_vdso-nolibc-v1-0-9cb6268d77be@linu…
---
Thomas Weißschuh (16):
MAINTAINERS: Add vDSO selftests
elf, uapi: Add definition for STN_UNDEF
elf, uapi: Add definition for DT_GNU_HASH
elf, uapi: Add definitions for VER_FLG_BASE and VER_FLG_WEAK
elf, uapi: Add type ElfXX_Versym
elf, uapi: Add types ElfXX_Verdef and ElfXX_Veraux
tools/include: Add uapi/linux/elf.h
selftests: Add headers target
tools/nolibc: add limits.h shim header
selftests: vDSO: vdso_standalone_test_x86: Use vdso_init_form_sysinfo_ehdr
selftests: vDSO: parse_vdso: Drop vdso_init_from_auxv()
selftests: vDSO: parse_vdso: Use UAPI headers instead of libc headers
selftests: vDSO: parse_vdso: Test __SIZEOF_LONG__ instead of ULONG_MAX
selftests: vDSO: vdso_test_gettimeofday: Clean up includes
selftests: vDSO: vdso_test_gettimeofday: Make compatible with nolibc
selftests: vDSO: vdso_standalone_test_x86: Switch to nolibc
MAINTAINERS | 1 +
include/uapi/linux/elf.h | 38 ++
tools/include/nolibc/Makefile | 1 +
tools/include/nolibc/limits.h | 7 +
tools/include/uapi/linux/elf.h | 524 +++++++++++++++++++++
tools/testing/selftests/lib.mk | 5 +-
tools/testing/selftests/vDSO/Makefile | 11 +-
tools/testing/selftests/vDSO/parse_vdso.c | 19 +-
tools/testing/selftests/vDSO/parse_vdso.h | 1 -
.../selftests/vDSO/vdso_standalone_test_x86.c | 143 +-----
.../selftests/vDSO/vdso_test_gettimeofday.c | 4 +-
11 files changed, 590 insertions(+), 164 deletions(-)
---
base-commit: 2014c95afecee3e76ca4a56956a936e23283f05b
change-id: 20241017-parse_vdso-nolibc-e069baa7ff48
Best regards,
--
Thomas Weißschuh <thomas.weissschuh(a)linutronix.de>
Hi Thomas
Following posix_timers kself test failed in 6.12.y
-------------------------
not ok 7 check_sig_ign SIGEV_SIGNAL
not ok 9 check_rearm
not ok 10 check_delete
-------------------------
Reason of failure:
6.12.y does not support these KSELT tests,
Because the following commits were not backported in 6.12.y:
6017a158beb: "posix-timers: Embed sigqueue in struct k_itimer"
69f032c92cf: "signal: Provide ignored_posix_timers list"
is it feasible to backport in these commits or complete series of patch
in 6.12.y ( [patch V7 00/21] posix-timers: Cure the SIG_IGN mess)
6017a158beb: "posix-timers: Embed sigqueue in struct k_itimer"
69f032c92cf8: "signal: Provide ignored_posix_timers list"
if not, we shall revert following kself test from 6.12.y
45c4225c3dcc "selftests/timers/posix_timers: Add SIG_IGN test"
e65bb03e4427 "selftests/timers/posix_timers: Validate signal rules"
Thanks,
Alok
A hds-thresh value is not set correctly if input value is 0.
The cause is that ethtool_ringparam_get_cfg(), which is a internal
function that returns ringparameters from both ->get_ringparam() and
dev->cfg can't return a correct hds-thresh value.
The first patch fixes ethtool_ringparam_get_cfg() to set hds-thresh
value correcltly.
The second patch adds random test for hds-thresh value.
So that we can test 0 value for a hds-thresh properly.
Taehee Yoo (2):
net: ethtool: fix ethtool_ringparam_get_cfg() returns a hds_thresh
value always as 0.
selftests: drv-net: test random value for hds-thresh
net/ethtool/common.c | 1 +
tools/testing/selftests/drivers/net/hds.py | 28 +++++++++++++++++++++-
2 files changed, 28 insertions(+), 1 deletion(-)
--
2.34.1
This patchset originates from my attempt to resolve a KMSAN warning that
has existed for over 3 years:
https://syzkaller.appspot.com/bug?extid=0e6ddb1ef80986bdfe64
Previously, we had a brief discussion in this thread about whether we can
simply perform memset in adjust_{head,meta}:
https://lore.kernel.org/netdev/20250328043941.085de23b@kernel.org/T/#t
Unfortunately, I couldn't find a similar topic in the mail list, but I did
find a similar security-related commit:
commit 6dfb970d3dbd ("xdp: avoid leaking info stored in frame data on page reuse")
I just create a new topic here and make subject more clear, we can discuss
this here.
Meanwhile, I also discovered a related issue that led to a CVE,specifically
the Facebook Katran vulnerability (https://vuldb.com/?id.246309).
Currently, even with unprivileged functionality disabled, a user can load
a BPF program using CAP_BPF and CAP_NET_ADMIN, which I believe we should
avoid exposing kernel memory directly to users now.
Regarding performance considerations, I added corresponding results to the
selftest, testing common MAC headers and IP headers of various sizes.
Compared to not using memset, the execution time increased by 2ns, but I
think this is negligible considering the entire net stack.
Jiayuan Chen (2):
bpf, xdp: clean head/meta when expanding it
selftests/bpf: add perf test for adjust_{head,meta}
include/uapi/linux/bpf.h | 8 +--
net/core/filter.c | 5 +-
tools/include/uapi/linux/bpf.h | 6 ++-
.../selftests/bpf/prog_tests/xdp_perf.c | 52 ++++++++++++++++---
tools/testing/selftests/bpf/progs/xdp_dummy.c | 14 +++++
5 files changed, 72 insertions(+), 13 deletions(-)
--
2.47.1
From: Amery Hung <ameryhung(a)gmail.com>
[ Upstream commit b99f27e90268b1a814c13f8bd72ea1db448ea257 ]
Fix a race condition between the main test_progs thread and the traffic
monitoring thread. The traffic monitor thread tries to print a line
using multiple printf and use flockfile() to prevent the line from being
torn apart. Meanwhile, the main thread doing io redirection can reassign
or close stdout when going through tests. A deadlock as shown below can
happen.
main traffic_monitor_thread
==== ======================
show_transport()
-> flockfile(stdout)
stdio_hijack_init()
-> stdout = open_memstream(log_buf, log_cnt);
...
env.subtest_state->stdout_saved = stdout;
...
funlockfile(stdout)
stdio_restore_cleanup()
-> fclose(env.subtest_state->stdout_saved);
After the traffic monitor thread lock stdout, A new memstream can be
assigned to stdout by the main thread. Therefore, the traffic monitor
thread later will not be able to unlock the original stdout. As the
main thread tries to access the old stdout, it will hang indefinitely
as it is still locked by the traffic monitor thread.
The deadlock can be reproduced by running test_progs repeatedly with
traffic monitor enabled:
for ((i=1;i<=100;i++)); do
./test_progs -a flow_dissector_skb* -m '*'
done
Fix this by only calling printf once and remove flockfile()/funlockfile().
Signed-off-by: Amery Hung <ameryhung(a)gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau(a)kernel.org>
Link: https://patch.msgid.link/20250213233217.553258-1-ameryhung@gmail.com
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/bpf/network_helpers.c | 33 ++++++++-----------
1 file changed, 13 insertions(+), 20 deletions(-)
diff --git a/tools/testing/selftests/bpf/network_helpers.c b/tools/testing/selftests/bpf/network_helpers.c
index 27784946b01b8..af0ee70a53f9f 100644
--- a/tools/testing/selftests/bpf/network_helpers.c
+++ b/tools/testing/selftests/bpf/network_helpers.c
@@ -771,12 +771,13 @@ static const char *pkt_type_str(u16 pkt_type)
return "Unknown";
}
+#define MAX_FLAGS_STRLEN 21
/* Show the information of the transport layer in the packet */
static void show_transport(const u_char *packet, u16 len, u32 ifindex,
const char *src_addr, const char *dst_addr,
u16 proto, bool ipv6, u8 pkt_type)
{
- char *ifname, _ifname[IF_NAMESIZE];
+ char *ifname, _ifname[IF_NAMESIZE], flags[MAX_FLAGS_STRLEN] = "";
const char *transport_str;
u16 src_port, dst_port;
struct udphdr *udp;
@@ -817,29 +818,21 @@ static void show_transport(const u_char *packet, u16 len, u32 ifindex,
/* TCP or UDP*/
- flockfile(stdout);
+ if (proto == IPPROTO_TCP)
+ snprintf(flags, MAX_FLAGS_STRLEN, "%s%s%s%s",
+ tcp->fin ? ", FIN" : "",
+ tcp->syn ? ", SYN" : "",
+ tcp->rst ? ", RST" : "",
+ tcp->ack ? ", ACK" : "");
+
if (ipv6)
- printf("%-7s %-3s IPv6 %s.%d > %s.%d: %s, length %d",
+ printf("%-7s %-3s IPv6 %s.%d > %s.%d: %s, length %d%s\n",
ifname, pkt_type_str(pkt_type), src_addr, src_port,
- dst_addr, dst_port, transport_str, len);
+ dst_addr, dst_port, transport_str, len, flags);
else
- printf("%-7s %-3s IPv4 %s:%d > %s:%d: %s, length %d",
+ printf("%-7s %-3s IPv4 %s:%d > %s:%d: %s, length %d%s\n",
ifname, pkt_type_str(pkt_type), src_addr, src_port,
- dst_addr, dst_port, transport_str, len);
-
- if (proto == IPPROTO_TCP) {
- if (tcp->fin)
- printf(", FIN");
- if (tcp->syn)
- printf(", SYN");
- if (tcp->rst)
- printf(", RST");
- if (tcp->ack)
- printf(", ACK");
- }
-
- printf("\n");
- funlockfile(stdout);
+ dst_addr, dst_port, transport_str, len, flags);
}
static void show_ipv6_packet(const u_char *packet, u32 ifindex, u8 pkt_type)
--
2.39.5
From: Amery Hung <ameryhung(a)gmail.com>
[ Upstream commit b99f27e90268b1a814c13f8bd72ea1db448ea257 ]
Fix a race condition between the main test_progs thread and the traffic
monitoring thread. The traffic monitor thread tries to print a line
using multiple printf and use flockfile() to prevent the line from being
torn apart. Meanwhile, the main thread doing io redirection can reassign
or close stdout when going through tests. A deadlock as shown below can
happen.
main traffic_monitor_thread
==== ======================
show_transport()
-> flockfile(stdout)
stdio_hijack_init()
-> stdout = open_memstream(log_buf, log_cnt);
...
env.subtest_state->stdout_saved = stdout;
...
funlockfile(stdout)
stdio_restore_cleanup()
-> fclose(env.subtest_state->stdout_saved);
After the traffic monitor thread lock stdout, A new memstream can be
assigned to stdout by the main thread. Therefore, the traffic monitor
thread later will not be able to unlock the original stdout. As the
main thread tries to access the old stdout, it will hang indefinitely
as it is still locked by the traffic monitor thread.
The deadlock can be reproduced by running test_progs repeatedly with
traffic monitor enabled:
for ((i=1;i<=100;i++)); do
./test_progs -a flow_dissector_skb* -m '*'
done
Fix this by only calling printf once and remove flockfile()/funlockfile().
Signed-off-by: Amery Hung <ameryhung(a)gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau(a)kernel.org>
Link: https://patch.msgid.link/20250213233217.553258-1-ameryhung@gmail.com
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/bpf/network_helpers.c | 33 ++++++++-----------
1 file changed, 13 insertions(+), 20 deletions(-)
diff --git a/tools/testing/selftests/bpf/network_helpers.c b/tools/testing/selftests/bpf/network_helpers.c
index 27784946b01b8..af0ee70a53f9f 100644
--- a/tools/testing/selftests/bpf/network_helpers.c
+++ b/tools/testing/selftests/bpf/network_helpers.c
@@ -771,12 +771,13 @@ static const char *pkt_type_str(u16 pkt_type)
return "Unknown";
}
+#define MAX_FLAGS_STRLEN 21
/* Show the information of the transport layer in the packet */
static void show_transport(const u_char *packet, u16 len, u32 ifindex,
const char *src_addr, const char *dst_addr,
u16 proto, bool ipv6, u8 pkt_type)
{
- char *ifname, _ifname[IF_NAMESIZE];
+ char *ifname, _ifname[IF_NAMESIZE], flags[MAX_FLAGS_STRLEN] = "";
const char *transport_str;
u16 src_port, dst_port;
struct udphdr *udp;
@@ -817,29 +818,21 @@ static void show_transport(const u_char *packet, u16 len, u32 ifindex,
/* TCP or UDP*/
- flockfile(stdout);
+ if (proto == IPPROTO_TCP)
+ snprintf(flags, MAX_FLAGS_STRLEN, "%s%s%s%s",
+ tcp->fin ? ", FIN" : "",
+ tcp->syn ? ", SYN" : "",
+ tcp->rst ? ", RST" : "",
+ tcp->ack ? ", ACK" : "");
+
if (ipv6)
- printf("%-7s %-3s IPv6 %s.%d > %s.%d: %s, length %d",
+ printf("%-7s %-3s IPv6 %s.%d > %s.%d: %s, length %d%s\n",
ifname, pkt_type_str(pkt_type), src_addr, src_port,
- dst_addr, dst_port, transport_str, len);
+ dst_addr, dst_port, transport_str, len, flags);
else
- printf("%-7s %-3s IPv4 %s:%d > %s:%d: %s, length %d",
+ printf("%-7s %-3s IPv4 %s:%d > %s:%d: %s, length %d%s\n",
ifname, pkt_type_str(pkt_type), src_addr, src_port,
- dst_addr, dst_port, transport_str, len);
-
- if (proto == IPPROTO_TCP) {
- if (tcp->fin)
- printf(", FIN");
- if (tcp->syn)
- printf(", SYN");
- if (tcp->rst)
- printf(", RST");
- if (tcp->ack)
- printf(", ACK");
- }
-
- printf("\n");
- funlockfile(stdout);
+ dst_addr, dst_port, transport_str, len, flags);
}
static void show_ipv6_packet(const u_char *packet, u32 ifindex, u8 pkt_type)
--
2.39.5
From: Amery Hung <ameryhung(a)gmail.com>
[ Upstream commit b99f27e90268b1a814c13f8bd72ea1db448ea257 ]
Fix a race condition between the main test_progs thread and the traffic
monitoring thread. The traffic monitor thread tries to print a line
using multiple printf and use flockfile() to prevent the line from being
torn apart. Meanwhile, the main thread doing io redirection can reassign
or close stdout when going through tests. A deadlock as shown below can
happen.
main traffic_monitor_thread
==== ======================
show_transport()
-> flockfile(stdout)
stdio_hijack_init()
-> stdout = open_memstream(log_buf, log_cnt);
...
env.subtest_state->stdout_saved = stdout;
...
funlockfile(stdout)
stdio_restore_cleanup()
-> fclose(env.subtest_state->stdout_saved);
After the traffic monitor thread lock stdout, A new memstream can be
assigned to stdout by the main thread. Therefore, the traffic monitor
thread later will not be able to unlock the original stdout. As the
main thread tries to access the old stdout, it will hang indefinitely
as it is still locked by the traffic monitor thread.
The deadlock can be reproduced by running test_progs repeatedly with
traffic monitor enabled:
for ((i=1;i<=100;i++)); do
./test_progs -a flow_dissector_skb* -m '*'
done
Fix this by only calling printf once and remove flockfile()/funlockfile().
Signed-off-by: Amery Hung <ameryhung(a)gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau(a)kernel.org>
Link: https://patch.msgid.link/20250213233217.553258-1-ameryhung@gmail.com
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/bpf/network_helpers.c | 33 ++++++++-----------
1 file changed, 13 insertions(+), 20 deletions(-)
diff --git a/tools/testing/selftests/bpf/network_helpers.c b/tools/testing/selftests/bpf/network_helpers.c
index 80844a5fb1fee..95e943270f359 100644
--- a/tools/testing/selftests/bpf/network_helpers.c
+++ b/tools/testing/selftests/bpf/network_helpers.c
@@ -771,12 +771,13 @@ static const char *pkt_type_str(u16 pkt_type)
return "Unknown";
}
+#define MAX_FLAGS_STRLEN 21
/* Show the information of the transport layer in the packet */
static void show_transport(const u_char *packet, u16 len, u32 ifindex,
const char *src_addr, const char *dst_addr,
u16 proto, bool ipv6, u8 pkt_type)
{
- char *ifname, _ifname[IF_NAMESIZE];
+ char *ifname, _ifname[IF_NAMESIZE], flags[MAX_FLAGS_STRLEN] = "";
const char *transport_str;
u16 src_port, dst_port;
struct udphdr *udp;
@@ -817,29 +818,21 @@ static void show_transport(const u_char *packet, u16 len, u32 ifindex,
/* TCP or UDP*/
- flockfile(stdout);
+ if (proto == IPPROTO_TCP)
+ snprintf(flags, MAX_FLAGS_STRLEN, "%s%s%s%s",
+ tcp->fin ? ", FIN" : "",
+ tcp->syn ? ", SYN" : "",
+ tcp->rst ? ", RST" : "",
+ tcp->ack ? ", ACK" : "");
+
if (ipv6)
- printf("%-7s %-3s IPv6 %s.%d > %s.%d: %s, length %d",
+ printf("%-7s %-3s IPv6 %s.%d > %s.%d: %s, length %d%s\n",
ifname, pkt_type_str(pkt_type), src_addr, src_port,
- dst_addr, dst_port, transport_str, len);
+ dst_addr, dst_port, transport_str, len, flags);
else
- printf("%-7s %-3s IPv4 %s:%d > %s:%d: %s, length %d",
+ printf("%-7s %-3s IPv4 %s:%d > %s:%d: %s, length %d%s\n",
ifname, pkt_type_str(pkt_type), src_addr, src_port,
- dst_addr, dst_port, transport_str, len);
-
- if (proto == IPPROTO_TCP) {
- if (tcp->fin)
- printf(", FIN");
- if (tcp->syn)
- printf(", SYN");
- if (tcp->rst)
- printf(", RST");
- if (tcp->ack)
- printf(", ACK");
- }
-
- printf("\n");
- funlockfile(stdout);
+ dst_addr, dst_port, transport_str, len, flags);
}
static void show_ipv6_packet(const u_char *packet, u32 ifindex, u8 pkt_type)
--
2.39.5
From: Dmitry Safonov <0x7f454c46(a)gmail.com>
self-connect-ipv6 got slightly flaky on netdev:
> # timeout set to 120
> # selftests: net/tcp_ao: self-connect_ipv6
> # 1..5
> # # 708[lib/setup.c:250] rand seed 1742872572
> # TAP version 13
> # # 708[lib/proc.c:213] Snmp6 Ip6OutNoRoutes: 0 => 1
> # not ok 1 # error 708[self-connect.c:70] failed to connect()
> # ok 2 No unexpected trace events during the test run
> # # Planned tests != run tests (5 != 2)
> # # Totals: pass:1 fail:0 xfail:0 xpass:0 skip:0 error:1
> ok 1 selftests: net/tcp_ao: self-connect_ipv6
I can not reproduce it on my machines, but judging by "Ip6OutNoRoutes"
there is no route to the local_addr (::1).
Looking at the kernel code, I see that kernel does add link-local
address automatically in init_loopback(), but that is called from
ipv6 notifier block. So, in turn the userspace that brought up
the loopback interface may see rtnetlink ACK earlier than
addrconf_notify() does it's job (at least, on a slow VM such as netdev).
Probably, for ipv4 it's the same, judging by inetdev_event().
The fix is quite simple: set the link-local route straight after
bringing the loopback interface. That will make it synchronous.
Signed-off-by: Dmitry Safonov <0x7f454c46(a)gmail.com>
---
Sorry to send this during the merge window, it's a test stability fix.
It seems that netdev build bot has hit the issue a couple of times, but
seems not hitting it constantly at this moment:
https://netdev.bots.linux.dev/flakes.html?br-cnt=150&tn-needle=tcp-ao
I'm marking it net-next, so that build bot carries it until the merge
closes. If it's not fine, I can re-send it after the merge window.
---
tools/testing/selftests/net/tcp_ao/self-connect.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/tools/testing/selftests/net/tcp_ao/self-connect.c b/tools/testing/selftests/net/tcp_ao/self-connect.c
index 73b2f2276f3f5410aaa74bede7f366f81761bd6e..2c73bea698a677f9aedd7bec28f6e7fee7845d2e 100644
--- a/tools/testing/selftests/net/tcp_ao/self-connect.c
+++ b/tools/testing/selftests/net/tcp_ao/self-connect.c
@@ -16,6 +16,9 @@ static void __setup_lo_intf(const char *lo_intf,
if (link_set_up(lo_intf))
test_error("Failed to bring %s up", lo_intf);
+
+ if (ip_route_add(lo_intf, TEST_FAMILY, local_addr, local_addr))
+ test_error("Failed to add a local route %s", lo_intf);
}
static void setup_lo_intf(const char *lo_intf)
---
base-commit: 1a9239bb4253f9076b5b4b2a1a4e8d7defd77a95
change-id: 20250402-tcp-ao-selfconnect-flake-e0aabc03c076
Best regards,
--
Dmitry Safonov <0x7f454c46(a)gmail.com>
This improves the expressiveness of unprivileged BPF by inserting
speculation barriers instead of rejecting the programs.
The approach was previously presented at LPC'24 [1] and RAID'24 [2].
To mitigate the Spectre v1 (PHT) vulnerability, the kernel rejects
potentially-dangerous unprivileged BPF programs as of
commit 9183671af6db ("bpf: Fix leakage under speculation on mispredicted
branches"). In [2], we have analyzed 364 object files from open source
projects (Linux Samples and Selftests, BCC, Loxilb, Cilium, libbpf
Examples, Parca, and Prevail) and found that this affects 31% to 54% of
programs.
To resolve this in the majority of cases this patchset adds a fall-back
for mitigating Spectre v1 using speculation barriers. The kernel still
optimistically attempts to verify all speculative paths but uses
speculation barriers against v1 when unsafe behavior is detected. This
allows for more programs to be accepted without disabling the BPF
Spectre mitigations (e.g., by setting cpu_mitigations_off()).
In [1] we have measured the overhead of this approach relative to having
mitigations off and including the upstream Spectre v4 mitigations. For
event tracing and stack-sampling profilers, we found that mitigations
increase BPF program execution time by 0% to 62%. For the Loxilb network
load balancer, we have measured a 14% slowdown in SCTP performance but
no significant slowdown for TCP. This overhead only applies to programs
that were previously rejected.
I reran the expressiveness-evaluation with v6.14 and made sure the main
results still match those from [1] and [2] (which used v6.5).
Main design decisions are:
* Do not use separate bytecode insns for v1 and v4 barriers. This
simplifies the verifier significantly and has the only downside that
performance on PowerPC is not as high as it could be.
* Allow archs to still disable v1/v4 mitigations separately by setting
bpf_jit_bypass_spec_v1/v4(). This has the benefit that archs can
benefit from improved BPF expressiveness / performance if they are not
vulnerable (e.g., ARM64 for v4 in the kernel).
* Do not remove the empty BPF_NOSPEC implementation for backends for
which it is unknown whether they are vulnerable to Spectre v1.
[1] https://lpc.events/event/18/contributions/1954/ ("Mitigating
Spectre-PHT using Speculation Barriers in Linux eBPF")
[2] https://arxiv.org/pdf/2405.00078 ("VeriFence: Lightweight and
Precise Spectre Defenses for Untrusted Linux Kernel Extensions")
Changes:
* RFC -> v1:
- rebase to bpf-next-250313
- tests: mark expected successes/new errors
- add bpt_jit_bypass_spec_v1/v4() to avoid #ifdef in
bpf_bypass_spec_v1/v4()
- ensure that nospec with v1-support is implemented for archs for
which GCC supports speculation barriers, except for MIPS
- arm64: emit speculation barrier
- powerpc: change nospec to include v1 barrier
- discuss potential security (archs that do not impl. BPF nospec) and
performance (only PowerPC) regressions
RFC: https://lore.kernel.org/bpf/20250224203619.594724-1-luis.gerhorst@fau.de/
Luis Gerhorst (11):
bpf: Move insn if/else into do_check_insn()
bpf: Return -EFAULT on misconfigurations
bpf: Return -EFAULT on internal errors
bpf, arm64, powerpc: Add bpf_jit_bypass_spec_v1/v4()
bpf, arm64, powerpc: Change nospec to include v1 barrier
bpf: Rename sanitize_stack_spill to nospec_result
bpf: Fall back to nospec for Spectre v1
bpf: Allow nospec-protected var-offset stack access
bpf: Return PTR_ERR from push_stack()
bpf: Fall back to nospec for sanitization-failures
bpf: Fall back to nospec for spec path verification
arch/arm64/net/bpf_jit.h | 5 +
arch/arm64/net/bpf_jit_comp.c | 28 +-
arch/powerpc/net/bpf_jit_comp64.c | 79 +-
include/linux/bpf.h | 11 +-
include/linux/bpf_verifier.h | 3 +-
include/linux/filter.h | 2 +-
kernel/bpf/core.c | 32 +-
kernel/bpf/verifier.c | 723 ++++++++++--------
.../selftests/bpf/progs/verifier_and.c | 3 +-
.../selftests/bpf/progs/verifier_bounds.c | 35 +-
.../bpf/progs/verifier_bounds_deduction.c | 43 +-
.../selftests/bpf/progs/verifier_map_ptr.c | 12 +-
.../selftests/bpf/progs/verifier_movsx.c | 6 +-
.../selftests/bpf/progs/verifier_unpriv.c | 3 +-
.../bpf/progs/verifier_value_ptr_arith.c | 50 +-
.../selftests/bpf/verifier/dead_code.c | 3 +-
tools/testing/selftests/bpf/verifier/jmp32.c | 33 +-
tools/testing/selftests/bpf/verifier/jset.c | 10 +-
18 files changed, 630 insertions(+), 451 deletions(-)
base-commit: 46d38f489ef02175dcff1e03a849c226eb0729a6
--
2.48.1
This series is built on top of Fuad's v7 "mapping guest_memfd backed
memory at the host" [1].
With James's KVM userfault [2], it is possible to handle stage-2 faults
in guest_memfd in userspace. However, KVM itself also triggers faults
in guest_memfd in some cases, for example: PV interfaces like kvmclock,
PV EOI and page table walking code when fetching the MMIO instruction on
x86. It was agreed in the guest_memfd upstream call on 23 Jan 2025 [3]
that KVM would be accessing those pages via userspace page tables. In
order for such faults to be handled in userspace, guest_memfd needs to
support userfaultfd.
Changes since v1 [4]:
- James, Peter: implement a full minor trap instead of a hybrid
missing/minor trap
- James, Peter: to avoid shmem- and guest_memfd-specific code in the
UFFDIO_CONTINUE implementation make it generic by calling
vm_ops->fault()
While generalising UFFDIO_CONTINUE implementation helped avoid
guest_memfd-specific code in mm/userfaulfd, userfaultfd still needs
access to KVM code to be able to verify the VMA type when handling
UFFDIO_REGISTER_MODE_MINOR, so I used a similar approach to what Fuad
did for now [5].
In v1, Peter was mentioning a potential for eliminating taking a folio
lock [6]. I did not implement that, but according to my testing, the
performance of shmem minor fault handling stayed the same after the
migration to calling vm_ops->fault() (tested on an x86).
Before:
./demand_paging_test -u MINOR -s shmem
Random seed: 0x6b8b4567
Testing guest mode: PA-bits:ANY, VA-bits:48, 4K pages
guest physical test memory: [0x3fffbffff000, 0x3ffffffff000)
Finished creating vCPUs and starting uffd threads
Started all vCPUs
All vCPU threads joined
Total guest execution time: 10.979277020s
Per-vcpu demand paging rate: 23876.253375 pgs/sec/vcpu
Overall demand paging rate: 23876.253375 pgs/sec
After:
./demand_paging_test -u MINOR -s shmem
Random seed: 0x6b8b4567
Testing guest mode: PA-bits:ANY, VA-bits:48, 4K pages
guest physical test memory: [0x3fffbffff000, 0x3ffffffff000)
Finished creating vCPUs and starting uffd threads
Started all vCPUs
All vCPU threads joined
Total guest execution time: 10.978893504s
Per-vcpu demand paging rate: 23877.087423 pgs/sec/vcpu
Overall demand paging rate: 23877.087423 pgs/sec
Nikita
[1] https://lore.kernel.org/kvm/20250318161823.4005529-1-tabba@google.com/T/
[2] https://lore.kernel.org/kvm/20250109204929.1106563-1-jthoughton@google.com/…
[3] https://docs.google.com/document/d/1M6766BzdY1Lhk7LiR5IqVR8B8mG3cr-cxTxOrAo…
[4] https://lore.kernel.org/kvm/20250303133011.44095-1-kalyazin@amazon.com/T/
[5] https://lore.kernel.org/kvm/20250318161823.4005529-1-tabba@google.com/T/#Z2…
[6] https://lore.kernel.org/kvm/20250303133011.44095-1-kalyazin@amazon.com/T/#m…
Nikita Kalyazin (5):
mm: userfaultfd: generic continue for non hugetlbfs
KVM: guest_memfd: add kvm_gmem_vma_is_gmem
mm: userfaultfd: allow to register continue for guest_memfd
KVM: guest_memfd: add support for userfaultfd minor
KVM: selftests: test userfaultfd minor for guest_memfd
include/linux/mm_types.h | 3 +
include/linux/userfaultfd_k.h | 13 ++-
mm/hugetlb.c | 2 +-
mm/shmem.c | 3 +-
mm/userfaultfd.c | 25 +++--
.../testing/selftests/kvm/guest_memfd_test.c | 94 +++++++++++++++++++
virt/kvm/guest_memfd.c | 15 +++
virt/kvm/kvm_mm.h | 1 +
8 files changed, 146 insertions(+), 10 deletions(-)
base-commit: 3cc51efc17a2c41a480eed36b31c1773936717e0
--
2.47.1
Vector registers are zero initialized by the kernel. Stop accepting
"all ones" as a clean value.
Note that this was not working as expected given that
value == 0xff
can be assumed to be always false by the compiler as value's range is
[-128, 127]. Both GCC (-Wtype-limits) and clang
(-Wtautological-constant-out-of-range-compare) warn about this.
Reviewed-by: Charlie Jenkins <charlie(a)rivosinc.com>
Tested-by: Charlie Jenkins <charlie(a)rivosinc.com>
Signed-off-by: Ignacio Encinas <ignacio(a)iencinas.com>
---
Changes in v2:
Remove code that becomes useless now that the only "clean" value for
vector registers is 0.
- Link to v1: https://lore.kernel.org/r/20250305-fix-v_exec_initval_nolibc-v1-1-b87b60e43…
---
tools/testing/selftests/riscv/vector/v_exec_initval_nolibc.c | 10 +++-------
1 file changed, 3 insertions(+), 7 deletions(-)
diff --git a/tools/testing/selftests/riscv/vector/v_exec_initval_nolibc.c b/tools/testing/selftests/riscv/vector/v_exec_initval_nolibc.c
index 35c0812e32de0c82a54f84bd52c4272507121e35..4dde05e45a04122b566cedc36d20b072413b00e2 100644
--- a/tools/testing/selftests/riscv/vector/v_exec_initval_nolibc.c
+++ b/tools/testing/selftests/riscv/vector/v_exec_initval_nolibc.c
@@ -6,7 +6,7 @@
* the values. To further ensure consistency, this file is compiled without
* libc and without auto-vectorization.
*
- * To be "clean" all values must be either all ones or all zeroes.
+ * To be "clean" all values must be all zeroes.
*/
#define __stringify_1(x...) #x
@@ -14,9 +14,8 @@
int main(int argc, char **argv)
{
- char prev_value = 0, value;
+ char value = 0;
unsigned long vl;
- int first = 1;
if (argc > 2 && strcmp(argv[2], "x"))
asm volatile (
@@ -44,14 +43,11 @@ int main(int argc, char **argv)
"vsrl.vi " __stringify(register) ", " __stringify(register) ", 8\n\t" \
".option pop\n\t" \
: "=r" (value)); \
- if (first) { \
- first = 0; \
- } else if (value != prev_value || !(value == 0x00 || value == 0xff)) { \
+ if (value != 0x00) { \
printf("Register " __stringify(register) \
" values not clean! value: %u\n", value); \
exit(-1); \
} \
- prev_value = value; \
} \
})
---
base-commit: 03d38806a902b36bf364cae8de6f1183c0a35a67
change-id: 20250301-fix-v_exec_initval_nolibc-498d976c372d
Best regards,
--
Ignacio Encinas <ignacio(a)iencinas.com>
This patch series fixes a number of bugs in the cpuset partition code as
well as improvement in remote partition handling. The test_cpuset_prs.sh
is also enhanced to allow more vigorous remote partition testing.
Waiman Long (10):
cgroup/cpuset: Fix race between newly created partition and dying one
cgroup/cpuset: Fix incorrect isolated_cpus update in
update_parent_effective_cpumask()
cgroup/cpuset: Fix error handling in remote_partition_disable()
cgroup/cpuset: Remove remote_partition_check() & make
update_cpumasks_hier() handle remote partition
cgroup/cpuset: Don't allow creation of local partition over a remote
one
cgroup/cpuset: Code cleanup and comment update
cgroup/cpuset: Remove unneeded goto in sched_partition_write() and
rename it
selftest/cgroup: Update test_cpuset_prs.sh to use | as effective CPUs
and state separator
selftest/cgroup: Clean up and restructure test_cpuset_prs.sh
selftest/cgroup: Add a remote partition transition test to
test_cpuset_prs.sh
include/linux/cgroup-defs.h | 1 +
include/linux/cgroup.h | 2 +-
kernel/cgroup/cgroup.c | 6 +
kernel/cgroup/cpuset-internal.h | 1 +
kernel/cgroup/cpuset.c | 401 +++++++-----
.../selftests/cgroup/test_cpuset_prs.sh | 617 ++++++++++++------
6 files changed, 649 insertions(+), 379 deletions(-)
--
2.48.1
Cppcheck warning:
int result is assigned to long long variable. If the variable is long long
to avoid loss of information, then you have loss of information.
This patch changes the type of page_size from 'unsigned int' to
'unsigned long' instead of using ULL suffixes. Changing hpage_size to
'unsigned long' was considered, but since gethugepage() expects an int,
this change was avoided.
Reported-by: David Binderman <dcb314(a)hotmail.com>
Closes: https://lore.kernel.org/all/AS8PR02MB10217315060BBFDB21F19643E9CA62@AS8PR02…
Signed-off-by: Siddarth G <siddarthsgml(a)gmail.com>
---
Changes since v2:
- v2 had conflict with current mainline, so this is a fresh patch
tools/testing/selftests/mm/pagemap_ioctl.c | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)
diff --git a/tools/testing/selftests/mm/pagemap_ioctl.c b/tools/testing/selftests/mm/pagemap_ioctl.c
index 57b4bba2b45f..fe5ae8b25ff6 100644
--- a/tools/testing/selftests/mm/pagemap_ioctl.c
+++ b/tools/testing/selftests/mm/pagemap_ioctl.c
@@ -34,7 +34,7 @@
#define PAGEMAP "/proc/self/pagemap"
int pagemap_fd;
int uffd;
-unsigned int page_size;
+unsigned long page_size;
unsigned int hpage_size;
const char *progname;
@@ -184,7 +184,7 @@ void *gethugetlb_mem(int size, int *shmid)
int userfaultfd_tests(void)
{
- int mem_size, vec_size, written, num_pages = 16;
+ long mem_size, vec_size, written, num_pages = 16;
char *mem, *vec;
mem_size = num_pages * page_size;
@@ -213,7 +213,7 @@ int userfaultfd_tests(void)
written = pagemap_ioctl(mem, mem_size, vec, 1, PM_SCAN_WP_MATCHING | PM_SCAN_CHECK_WPASYNC,
vec_size - 2, PAGE_IS_WRITTEN, 0, 0, PAGE_IS_WRITTEN);
if (written < 0)
- ksft_exit_fail_msg("error %d %d %s\n", written, errno, strerror(errno));
+ ksft_exit_fail_msg("error %ld %d %s\n", written, errno, strerror(errno));
ksft_test_result(written == 0, "%s all new pages must not be written (dirty)\n", __func__);
@@ -995,7 +995,7 @@ int unmapped_region_tests(void)
{
void *start = (void *)0x10000000;
int written, len = 0x00040000;
- int vec_size = len / page_size;
+ long vec_size = len / page_size;
struct page_region *vec = malloc(sizeof(struct page_region) * vec_size);
/* 1. Get written pages */
@@ -1051,7 +1051,7 @@ static void test_simple(void)
int sanity_tests(void)
{
unsigned long long mem_size, vec_size;
- int ret, fd, i, buf_size;
+ long ret, fd, i, buf_size;
struct page_region *vec;
char *mem, *fmem;
struct stat sbuf;
@@ -1160,7 +1160,7 @@ int sanity_tests(void)
ret = stat(progname, &sbuf);
if (ret < 0)
- ksft_exit_fail_msg("error %d %d %s\n", ret, errno, strerror(errno));
+ ksft_exit_fail_msg("error %ld %d %s\n", ret, errno, strerror(errno));
fmem = mmap(NULL, sbuf.st_size, PROT_READ, MAP_PRIVATE, fd, 0);
if (fmem == MAP_FAILED)
--
2.43.0
There are currently two ways in which ublk server exit is detected by
ublk_drv:
1. uring_cmd cancellation. If there are any outstanding uring_cmds which
have not been completed to the ublk server when it exits, io_uring
calls the uring_cmd callback with a special cancellation flag as the
issuing task is exiting.
2. I/O timeout. This is needed in addition to the above to handle the
"saturated queue" case, when all I/Os for a given queue are in the
ublk server, and therefore there are no outstanding uring_cmds to
cancel when the ublk server exits.
There are a couple of issues with this approach:
- It is complex and inelegant to have two methods to detect the same
condition
- The second method detects ublk server exit only after a long delay
(~30s, the default timeout assigned by the block layer). This delays
the nosrv behavior from kicking in and potential subsequent recovery
of the device.
The second issue is brought to light with the new test_generic_04. It
fails before this fix:
selftests: ublk: test_generic_04.sh
dev id is 0
dd: error writing '/dev/ublkb0': Input/output error
1+0 records in
0+0 records out
0 bytes copied, 30.0611 s, 0.0 kB/s
DEAD
dd took 31 seconds to exit (>= 5s tolerance)!
generic_04 : [FAIL]
Fix this by instead detecting and handling ublk server exit in the
character file release callback. This has several advantages:
- This one place can handle both saturated and unsaturated queues. Thus,
it replaces both preexisting methods of detecting ublk server exit.
- It runs quickly on ublk server exit - there is no 30s delay.
- It starts the process of removing task references in ublk_drv. This is
needed if we want to relax restrictions in the driver like letting
only one thread serve each queue
There is also the disadvantage that the character file release callback
can also be triggered by intentional close of the file, which is a
significant behavior change. Preexisting ublk servers (libublksrv) are
dependent on the ability to open/close the file multiple times. To
address this, only transition to a nosrv state if the file is released
while the ublk device is live. This allows for programs to open/close
the file multiple times during setup. It is still a behavior change if a
ublk server decides to close/reopen the file while the device is LIVE
(i.e. while it is responsible for serving I/O), but that would be highly
unusual. This behavior is in line with what is done by FUSE, which is
very similar to ublk in that a userspace daemon is providing services
traditionally provided by the kernel.
With this change in, the new test (and all other selftests, and all
ublksrv tests) pass:
selftests: ublk: test_generic_04.sh
dev id is 0
dd: error writing '/dev/ublkb0': Input/output error
1+0 records in
0+0 records out
0 bytes copied, 0.0376731 s, 0.0 kB/s
DEAD
generic_04 : [PASS]
Signed-off-by: Uday Shankar <ushankar(a)purestorage.com>
---
Changes in v2:
- Leave null ublk selftests target untouched, instead create new
fault_inject target for injecting per-I/O delay (Ming Lei)
- Allow multiple open/close of ublk character device with some
restrictions
- Drop patches which made it in separately at https://lore.kernel.org/r/20250401-ublk_selftests-v1-1-98129c9bc8bb@puresto…
- Consolidate more nosrv logic in ublk character device release, and
associated code cleanup
- Link to v1: https://lore.kernel.org/r/20250325-ublk_timeout-v1-0-262f0121a7bd@purestora…
---
drivers/block/ublk_drv.c | 187 +++++++-----------------
tools/testing/selftests/ublk/Makefile | 4 +-
tools/testing/selftests/ublk/fault_inject.c | 58 ++++++++
tools/testing/selftests/ublk/kublk.c | 6 +-
tools/testing/selftests/ublk/kublk.h | 4 +
tools/testing/selftests/ublk/test_generic_04.sh | 43 ++++++
6 files changed, 165 insertions(+), 137 deletions(-)
diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index 2fd05c1bd30b03343cb6f357f8c08dd92ff47af9..d06f8a9aa23f8b846928247fc9e29002c10a49e3 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -162,7 +162,6 @@ struct ublk_queue {
bool force_abort;
bool timeout;
- bool canceling;
bool fail_io; /* copy of dev->state == UBLK_S_DEV_FAIL_IO */
unsigned short nr_io_ready; /* how many ios setup */
spinlock_t cancel_lock;
@@ -199,8 +198,6 @@ struct ublk_device {
struct completion completion;
unsigned int nr_queues_ready;
unsigned int nr_privileged_daemon;
-
- struct work_struct nosrv_work;
};
/* header of ublk_params */
@@ -209,8 +206,9 @@ struct ublk_params_header {
__u32 types;
};
-static bool ublk_abort_requests(struct ublk_device *ub, struct ublk_queue *ubq);
-
+static void ublk_stop_dev_unlocked(struct ublk_device *ub);
+static void ublk_abort_queue(struct ublk_device *ub, struct ublk_queue *ubq);
+static void __ublk_quiesce_dev(struct ublk_device *ub);
static inline struct request *__ublk_check_and_get_req(struct ublk_device *ub,
struct ublk_queue *ubq, int tag, size_t offset);
static inline unsigned int ublk_req_build_flags(struct request *req);
@@ -1314,8 +1312,6 @@ static void ublk_queue_cmd_list(struct ublk_queue *ubq, struct rq_list *l)
static enum blk_eh_timer_return ublk_timeout(struct request *rq)
{
struct ublk_queue *ubq = rq->mq_hctx->driver_data;
- unsigned int nr_inflight = 0;
- int i;
if (ubq->flags & UBLK_F_UNPRIVILEGED_DEV) {
if (!ubq->timeout) {
@@ -1326,26 +1322,6 @@ static enum blk_eh_timer_return ublk_timeout(struct request *rq)
return BLK_EH_DONE;
}
- if (!ubq_daemon_is_dying(ubq))
- return BLK_EH_RESET_TIMER;
-
- for (i = 0; i < ubq->q_depth; i++) {
- struct ublk_io *io = &ubq->ios[i];
-
- if (!(io->flags & UBLK_IO_FLAG_ACTIVE))
- nr_inflight++;
- }
-
- /* cancelable uring_cmd can't help us if all commands are in-flight */
- if (nr_inflight == ubq->q_depth) {
- struct ublk_device *ub = ubq->dev;
-
- if (ublk_abort_requests(ub, ubq)) {
- schedule_work(&ub->nosrv_work);
- }
- return BLK_EH_DONE;
- }
-
return BLK_EH_RESET_TIMER;
}
@@ -1368,9 +1344,6 @@ static blk_status_t ublk_prep_req(struct ublk_queue *ubq, struct request *rq)
if (ublk_nosrv_should_queue_io(ubq) && unlikely(ubq->force_abort))
return BLK_STS_IOERR;
- if (unlikely(ubq->canceling))
- return BLK_STS_IOERR;
-
/* fill iod to slot in io cmd buffer */
res = ublk_setup_iod(ubq, rq);
if (unlikely(res != BLK_STS_OK))
@@ -1391,16 +1364,6 @@ static blk_status_t ublk_queue_rq(struct blk_mq_hw_ctx *hctx,
if (res != BLK_STS_OK)
return res;
- /*
- * ->canceling has to be handled after ->force_abort and ->fail_io
- * is dealt with, otherwise this request may not be failed in case
- * of recovery, and cause hang when deleting disk
- */
- if (unlikely(ubq->canceling)) {
- __ublk_abort_rq(ubq, rq);
- return BLK_STS_OK;
- }
-
ublk_queue_cmd(ubq, rq);
return BLK_STS_OK;
}
@@ -1461,8 +1424,52 @@ static int ublk_ch_open(struct inode *inode, struct file *filp)
static int ublk_ch_release(struct inode *inode, struct file *filp)
{
struct ublk_device *ub = filp->private_data;
+ int i;
+ mutex_lock(&ub->mutex);
+ /*
+ * If the device is not live, we will not transition to a nosrv
+ * state. This protects against:
+ * - accidental poking of the ublk character device
+ * - some ublk servers which may open/close the ublk character
+ * device during startup
+ */
+ if (ub->dev_info.state != UBLK_S_DEV_LIVE)
+ goto out;
+
+ /*
+ * Since we are releasing the ublk character file descriptor, we
+ * know that there cannot be any concurrent file-related
+ * activity (e.g. uring_cmds or reads/writes). However, I/O
+ * might still be getting dispatched. Quiesce that too so that
+ * we don't need to worry about anything concurrent
+ */
+ blk_mq_quiesce_queue(ub->ub_disk->queue);
+
+ /*
+ * Handle any requests outstanding to the ublk server
+ */
+ for (i = 0; i < ub->dev_info.nr_hw_queues; i++)
+ ublk_abort_queue(ub, ublk_get_queue(ub, i));
+
+ /*
+ * Transition the device to the nosrv state. What exactly this
+ * means depends on the recovery flags
+ */
+ if (ublk_nosrv_should_stop_dev(ub)) {
+ ublk_stop_dev_unlocked(ub);
+ } else if (ublk_nosrv_dev_should_queue_io(ub)) {
+ __ublk_quiesce_dev(ub);
+ } else {
+ ub->dev_info.state = UBLK_S_DEV_FAIL_IO;
+ for (i = 0; i < ub->dev_info.nr_hw_queues; i++)
+ ublk_get_queue(ub, i)->fail_io = true;
+ }
+
+ blk_mq_unquiesce_queue(ub->ub_disk->queue);
+out:
clear_bit(UB_STATE_OPEN, &ub->state);
+ mutex_unlock(&ub->mutex);
return 0;
}
@@ -1556,57 +1563,6 @@ static void ublk_abort_queue(struct ublk_device *ub, struct ublk_queue *ubq)
}
}
-/* Must be called when queue is frozen */
-static bool ublk_mark_queue_canceling(struct ublk_queue *ubq)
-{
- bool canceled;
-
- spin_lock(&ubq->cancel_lock);
- canceled = ubq->canceling;
- if (!canceled)
- ubq->canceling = true;
- spin_unlock(&ubq->cancel_lock);
-
- return canceled;
-}
-
-static bool ublk_abort_requests(struct ublk_device *ub, struct ublk_queue *ubq)
-{
- bool was_canceled = ubq->canceling;
- struct gendisk *disk;
-
- if (was_canceled)
- return false;
-
- spin_lock(&ub->lock);
- disk = ub->ub_disk;
- if (disk)
- get_device(disk_to_dev(disk));
- spin_unlock(&ub->lock);
-
- /* Our disk has been dead */
- if (!disk)
- return false;
-
- /*
- * Now we are serialized with ublk_queue_rq()
- *
- * Make sure that ubq->canceling is set when queue is frozen,
- * because ublk_queue_rq() has to rely on this flag for avoiding to
- * touch completed uring_cmd
- */
- blk_mq_quiesce_queue(disk->queue);
- was_canceled = ublk_mark_queue_canceling(ubq);
- if (!was_canceled) {
- /* abort queue is for making forward progress */
- ublk_abort_queue(ub, ubq);
- }
- blk_mq_unquiesce_queue(disk->queue);
- put_device(disk_to_dev(disk));
-
- return !was_canceled;
-}
-
static void ublk_cancel_cmd(struct ublk_queue *ubq, struct ublk_io *io,
unsigned int issue_flags)
{
@@ -1635,8 +1591,6 @@ static void ublk_uring_cmd_cancel_fn(struct io_uring_cmd *cmd,
struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_cmd_pdu(cmd);
struct ublk_queue *ubq = pdu->ubq;
struct task_struct *task;
- struct ublk_device *ub;
- bool need_schedule;
struct ublk_io *io;
if (WARN_ON_ONCE(!ubq))
@@ -1649,16 +1603,9 @@ static void ublk_uring_cmd_cancel_fn(struct io_uring_cmd *cmd,
if (WARN_ON_ONCE(task && task != ubq->ubq_daemon))
return;
- ub = ubq->dev;
- need_schedule = ublk_abort_requests(ub, ubq);
-
io = &ubq->ios[pdu->tag];
WARN_ON_ONCE(io->cmd != cmd);
ublk_cancel_cmd(ubq, io, issue_flags);
-
- if (need_schedule) {
- schedule_work(&ub->nosrv_work);
- }
}
static inline bool ublk_queue_ready(struct ublk_queue *ubq)
@@ -1756,13 +1703,13 @@ static struct gendisk *ublk_detach_disk(struct ublk_device *ub)
return disk;
}
-static void ublk_stop_dev(struct ublk_device *ub)
+static void ublk_stop_dev_unlocked(struct ublk_device *ub)
+ __must_hold(&ub->mutex)
{
struct gendisk *disk;
- mutex_lock(&ub->mutex);
if (ub->dev_info.state == UBLK_S_DEV_DEAD)
- goto unlock;
+ return;
if (ublk_nosrv_dev_should_queue_io(ub)) {
if (ub->dev_info.state == UBLK_S_DEV_LIVE)
__ublk_quiesce_dev(ub);
@@ -1771,38 +1718,12 @@ static void ublk_stop_dev(struct ublk_device *ub)
del_gendisk(ub->ub_disk);
disk = ublk_detach_disk(ub);
put_disk(disk);
- unlock:
- mutex_unlock(&ub->mutex);
- ublk_cancel_dev(ub);
}
-static void ublk_nosrv_work(struct work_struct *work)
+static void ublk_stop_dev(struct ublk_device *ub)
{
- struct ublk_device *ub =
- container_of(work, struct ublk_device, nosrv_work);
- int i;
-
- if (ublk_nosrv_should_stop_dev(ub)) {
- ublk_stop_dev(ub);
- return;
- }
-
mutex_lock(&ub->mutex);
- if (ub->dev_info.state != UBLK_S_DEV_LIVE)
- goto unlock;
-
- if (ublk_nosrv_dev_should_queue_io(ub)) {
- __ublk_quiesce_dev(ub);
- } else {
- blk_mq_quiesce_queue(ub->ub_disk->queue);
- ub->dev_info.state = UBLK_S_DEV_FAIL_IO;
- for (i = 0; i < ub->dev_info.nr_hw_queues; i++) {
- ublk_get_queue(ub, i)->fail_io = true;
- }
- blk_mq_unquiesce_queue(ub->ub_disk->queue);
- }
-
- unlock:
+ ublk_stop_dev_unlocked(ub);
mutex_unlock(&ub->mutex);
ublk_cancel_dev(ub);
}
@@ -2388,7 +2309,6 @@ static void ublk_remove(struct ublk_device *ub)
bool unprivileged;
ublk_stop_dev(ub);
- cancel_work_sync(&ub->nosrv_work);
cdev_device_del(&ub->cdev, &ub->cdev_dev);
unprivileged = ub->dev_info.flags & UBLK_F_UNPRIVILEGED_DEV;
ublk_put_device(ub);
@@ -2675,7 +2595,6 @@ static int ublk_ctrl_add_dev(struct io_uring_cmd *cmd)
goto out_unlock;
mutex_init(&ub->mutex);
spin_lock_init(&ub->lock);
- INIT_WORK(&ub->nosrv_work, ublk_nosrv_work);
ret = ublk_alloc_dev_number(ub, header->dev_id);
if (ret < 0)
@@ -2807,7 +2726,6 @@ static inline void ublk_ctrl_cmd_dump(struct io_uring_cmd *cmd)
static int ublk_ctrl_stop_dev(struct ublk_device *ub)
{
ublk_stop_dev(ub);
- cancel_work_sync(&ub->nosrv_work);
return 0;
}
@@ -2927,7 +2845,6 @@ static void ublk_queue_reinit(struct ublk_device *ub, struct ublk_queue *ubq)
/* We have to reset it to NULL, otherwise ub won't accept new FETCH_REQ */
ubq->ubq_daemon = NULL;
ubq->timeout = false;
- ubq->canceling = false;
for (i = 0; i < ubq->q_depth; i++) {
struct ublk_io *io = &ubq->ios[i];
diff --git a/tools/testing/selftests/ublk/Makefile b/tools/testing/selftests/ublk/Makefile
index c7781efea0f33c02f340f90f547d3a37c1d1b8a0..afee027cccdd1b8f13f1cb9a90a3348cd54b18bc 100644
--- a/tools/testing/selftests/ublk/Makefile
+++ b/tools/testing/selftests/ublk/Makefile
@@ -6,6 +6,7 @@ LDLIBS += -lpthread -lm -luring
TEST_PROGS := test_generic_01.sh
TEST_PROGS += test_generic_02.sh
TEST_PROGS += test_generic_03.sh
+TEST_PROGS += test_generic_04.sh
TEST_PROGS += test_null_01.sh
TEST_PROGS += test_null_02.sh
@@ -26,7 +27,8 @@ TEST_GEN_PROGS_EXTENDED = kublk
include ../lib.mk
-$(TEST_GEN_PROGS_EXTENDED): kublk.c null.c file_backed.c common.c stripe.c
+$(TEST_GEN_PROGS_EXTENDED): kublk.c null.c file_backed.c common.c stripe.c \
+ fault_inject.c
check:
shellcheck -x -f gcc *.sh
diff --git a/tools/testing/selftests/ublk/fault_inject.c b/tools/testing/selftests/ublk/fault_inject.c
new file mode 100644
index 0000000000000000000000000000000000000000..e92d01e88e478a23df987ebff2a997212b831d31
--- /dev/null
+++ b/tools/testing/selftests/ublk/fault_inject.c
@@ -0,0 +1,58 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * Fault injection ublk target. Hack this up however you like for
+ * testing specific behaviors of ublk_drv. Currently is a null target
+ * with a configurable delay before completing each I/O. This delay can
+ * be used to test ublk_drv's handling of I/O outstanding to the ublk
+ * server when it dies.
+ */
+
+#include "kublk.h"
+
+static int ublk_fault_inject_tgt_init(const struct dev_ctx *ctx, struct ublk_dev *dev)
+{
+ const struct ublksrv_ctrl_dev_info *info = &dev->dev_info;
+ unsigned long dev_size = 250UL << 30;
+
+ dev->tgt.dev_size = dev_size;
+ dev->tgt.params = (struct ublk_params) {
+ .types = UBLK_PARAM_TYPE_BASIC | UBLK_PARAM_TYPE_DMA_ALIGN |
+ UBLK_PARAM_TYPE_SEGMENT,
+ .basic = {
+ .logical_bs_shift = 9,
+ .physical_bs_shift = 12,
+ .io_opt_shift = 12,
+ .io_min_shift = 9,
+ .max_sectors = info->max_io_buf_bytes >> 9,
+ .dev_sectors = dev_size >> 9,
+ },
+ .dma = {
+ .alignment = 4095,
+ },
+ .seg = {
+ .seg_boundary_mask = 4095,
+ .max_segment_size = 32 << 10,
+ .max_segments = 32,
+ },
+ };
+
+ dev->private_data = (void *)ctx->delay_us;
+ return 0;
+}
+
+static int ublk_fault_inject_queue_io(struct ublk_queue *q, int tag)
+{
+ const struct ublksrv_io_desc *iod = ublk_get_iod(q, tag);
+
+ usleep((unsigned long)q->dev->private_data);
+
+ ublk_complete_io(q, tag, iod->nr_sectors << 9);
+ return 0;
+}
+
+const struct ublk_tgt_ops fault_inject_tgt_ops = {
+ .name = "fault_inject",
+ .init_tgt = ublk_fault_inject_tgt_init,
+ .queue_io = ublk_fault_inject_queue_io,
+};
diff --git a/tools/testing/selftests/ublk/kublk.c b/tools/testing/selftests/ublk/kublk.c
index 91c282bc767449a418cce7fc816dc8e9fc732d6a..0fbfa43864453471219703451271540d5dfef593 100644
--- a/tools/testing/selftests/ublk/kublk.c
+++ b/tools/testing/selftests/ublk/kublk.c
@@ -10,6 +10,7 @@ static const struct ublk_tgt_ops *tgt_ops_list[] = {
&null_tgt_ops,
&loop_tgt_ops,
&stripe_tgt_ops,
+ &fault_inject_tgt_ops,
};
static const struct ublk_tgt_ops *ublk_find_tgt(const char *name)
@@ -1041,7 +1042,7 @@ static int cmd_dev_get_features(void)
static int cmd_dev_help(char *exe)
{
- printf("%s add -t [null|loop] [-q nr_queues] [-d depth] [-n dev_id] [backfile1] [backfile2] ...\n", exe);
+ printf("%s add -t [null|loop|stripe|fault_inject] [-q nr_queues] [-d depth] [-n dev_id] [backfile1] [backfile2] ...\n", exe);
printf("\t default: nr_queues=2(max 4), depth=128(max 128), dev_id=-1(auto allocation)\n");
printf("%s del [-n dev_id] -a \n", exe);
printf("\t -a delete all devices -n delete specified device\n");
@@ -1064,6 +1065,7 @@ int main(int argc, char *argv[])
{ "zero_copy", 0, NULL, 'z' },
{ "foreground", 0, NULL, 0 },
{ "chunk_size", 1, NULL, 0 },
+ { "delay_us", 1, NULL, 0 },
{ 0, 0, 0, 0 }
};
int option_idx, opt;
@@ -1112,6 +1114,8 @@ int main(int argc, char *argv[])
ctx.fg = 1;
if (!strcmp(longopts[option_idx].name, "chunk_size"))
ctx.chunk_size = strtol(optarg, NULL, 10);
+ if (!strcmp(longopts[option_idx].name, "delay_us"))
+ ctx.delay_us = strtoul(optarg, NULL, 10);
}
}
diff --git a/tools/testing/selftests/ublk/kublk.h b/tools/testing/selftests/ublk/kublk.h
index 760ff8ffb8107037a19a8fb7ab408818845e010d..3750e67727eed89991158add49d30615ea012dae 100644
--- a/tools/testing/selftests/ublk/kublk.h
+++ b/tools/testing/selftests/ublk/kublk.h
@@ -70,6 +70,9 @@ struct dev_ctx {
/* stripe */
unsigned int chunk_size;
+ /* fault_inject */
+ unsigned long delay_us;
+
int _evtfd;
};
@@ -357,6 +360,7 @@ static inline int ublk_queue_use_zc(const struct ublk_queue *q)
extern const struct ublk_tgt_ops null_tgt_ops;
extern const struct ublk_tgt_ops loop_tgt_ops;
extern const struct ublk_tgt_ops stripe_tgt_ops;
+extern const struct ublk_tgt_ops fault_inject_tgt_ops;
void backing_file_tgt_deinit(struct ublk_dev *dev);
int backing_file_tgt_init(struct ublk_dev *dev);
diff --git a/tools/testing/selftests/ublk/test_generic_04.sh b/tools/testing/selftests/ublk/test_generic_04.sh
new file mode 100755
index 0000000000000000000000000000000000000000..48af48164aa444d8ac6a58fef1743d2a16a56a14
--- /dev/null
+++ b/tools/testing/selftests/ublk/test_generic_04.sh
@@ -0,0 +1,43 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+
+. "$(cd "$(dirname "$0")" && pwd)"/test_common.sh
+
+TID="generic_04"
+ERR_CODE=0
+
+_prep_test "fault_inject" "fast cleanup when all I/Os of one hctx are in server"
+
+# configure ublk server to sleep 2s before completing each I/O
+dev_id=$(_add_ublk_dev -t fault_inject -q 2 -d 1 --delay_us 2000000)
+_check_add_dev $TID $?
+
+echo "dev id is ${dev_id}"
+
+STARTTIME=${SECONDS}
+
+dd if=/dev/urandom of=/dev/ublkb${dev_id} oflag=direct bs=4k count=1 &
+dd_pid=$!
+
+__ublk_kill_daemon ${dev_id} "DEAD"
+
+wait $dd_pid
+dd_exitcode=$?
+
+ENDTIME=${SECONDS}
+ELAPSED=$(($ENDTIME - $STARTTIME))
+
+# assert that dd sees an error and exits quickly after ublk server is
+# killed. previously this relied on seeing an I/O timeout and so would
+# take ~30s
+if [ $dd_exitcode -eq 0 ]; then
+ echo "dd unexpectedly exited successfully!"
+ ERR_CODE=255
+fi
+if [ $ELAPSED -ge 5 ]; then
+ echo "dd took $ELAPSED seconds to exit (>= 5s tolerance)!"
+ ERR_CODE=255
+fi
+
+_cleanup_test "fault_inject"
+_show_result $TID $ERR_CODE
---
base-commit: 710e2c687a16b28a873a282517a85faf02a8b7cc
change-id: 20250325-ublk_timeout-b06b9b51c591
Best regards,
--
Uday Shankar <ushankar(a)purestorage.com>
Hi ,
I wanted to confirm if you got my last email.
I can provide more information on the numbers and costs-just say the word!
Regards
Sara Wood
Demand Generation Manager
Leads Data Inc.,
Please reply with REMOVE if you don't wish to receive further emails
-----Original Message-----
From: Sara Wood
To:
Subject: ISC West 2025 Attendee Data to Drive Sales and Networking Efforts
Hi ,
Are you considering getting the ICS West 2025 attendees list?
Expo Name: International Security Conference & Exposition West 2025
Total Number of records: 23,000 records
List includes: Company Name, Contact Name, Job Title, Mailing Address, Phone, Emails, etc.
Do you want to acquire these leads? If so, I'm happy to send the pricing details.
Eager for your response
Regards
Sara Wood
Demand Generation Manager
Leads Data Inc.,
Please reply with REMOVE if you don't wish to receive further emails
Fix a couple of issues I saw when developing selftests for ublk. These
patches are split out from the following series:
https://lore.kernel.org/linux-block/20250325-ublk_timeout-v1-0-262f0121a7bd…
Signed-off-by: Uday Shankar <ushankar(a)purestorage.com>
---
Uday Shankar (2):
selftests: ublk: kublk: use ioctl-encoded opcodes
selftests: ublk: kublk: fix an error log line
tools/testing/selftests/ublk/kublk.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
---
base-commit: 4cfcc398357b0fb3d4c97d47d4a9e3c0653b7903
change-id: 20250325-ublk_selftests-6a055dfbc55b
Best regards,
--
Uday Shankar <ushankar(a)purestorage.com>
This set aims to reduce the long delay in applications reacting to ublk
server exit in the case of a "fully saturated" queue, i.e. one for which
all I/Os are outstanding to the ublk server. The first few patches fix
some minor issues in the ublk selftests, and the last patch contains the
main work and a test to validate it.
Signed-off-by: Uday Shankar <ushankar(a)purestorage.com>
---
Uday Shankar (4):
selftests: ublk: kublk: use ioctl-encoded opcodes
selftests: ublk: kublk: fix an error log line
selftests: ublk: kublk: ignore SIGCHLD
ublk: improve handling of saturated queues when ublk server exits
drivers/block/ublk_drv.c | 40 +++++++++++------------
tools/testing/selftests/ublk/Makefile | 1 +
tools/testing/selftests/ublk/kublk.c | 10 ++++--
tools/testing/selftests/ublk/kublk.h | 3 ++
tools/testing/selftests/ublk/null.c | 4 +++
tools/testing/selftests/ublk/test_generic_02.sh | 43 +++++++++++++++++++++++++
6 files changed, 76 insertions(+), 25 deletions(-)
---
base-commit: 648154b1c78c9e00b6934082cae48bb38714de20
change-id: 20250325-ublk_timeout-b06b9b51c591
Best regards,
--
Uday Shankar <ushankar(a)purestorage.com>
There are alarms which have only minute-granularity. The RTC core
already has a flag to describe them. Use this flag to skip tests which
require the alarm to support seconds.
Signed-off-by: Wolfram Sang <wsa+renesas(a)sang-engineering.com>
---
Tested with a Renesas RZ-N1D board. This RTC obviously has only minute
resolution for the alarms. Output now looks like this:
# RUN rtc.alarm_alm_set ...
# SKIP Skipping test since alarms has only minute granularity.
# OK rtc.alarm_alm_set
ok 5 rtc.alarm_alm_set # SKIP Skipping test since alarms has only minute granularity.
Before it was like this:
# RUN rtc.alarm_alm_set ...
# rtctest.c:255:alarm_alm_set:Alarm time now set to 09:40:00.
# rtctest.c:275:alarm_alm_set:data: 1a0
# rtctest.c:281:alarm_alm_set:Expected new (1489743644) == secs (1489743647)
tools/testing/selftests/rtc/rtctest.c | 19 ++++++++++++++-----
1 file changed, 14 insertions(+), 5 deletions(-)
diff --git a/tools/testing/selftests/rtc/rtctest.c b/tools/testing/selftests/rtc/rtctest.c
index 3e4f0d5c5329..e0a148261e6f 100644
--- a/tools/testing/selftests/rtc/rtctest.c
+++ b/tools/testing/selftests/rtc/rtctest.c
@@ -29,6 +29,7 @@ enum rtc_alarm_state {
RTC_ALARM_UNKNOWN,
RTC_ALARM_ENABLED,
RTC_ALARM_DISABLED,
+ RTC_ALARM_RES_MINUTE,
};
FIXTURE(rtc) {
@@ -88,7 +89,7 @@ static void nanosleep_with_retries(long ns)
}
}
-static enum rtc_alarm_state get_rtc_alarm_state(int fd)
+static enum rtc_alarm_state get_rtc_alarm_state(int fd, int need_seconds)
{
struct rtc_param param = { 0 };
int rc;
@@ -103,6 +104,10 @@ static enum rtc_alarm_state get_rtc_alarm_state(int fd)
if ((param.uvalue & _BITUL(RTC_FEATURE_ALARM)) == 0)
return RTC_ALARM_DISABLED;
+ /* Check if alarm has desired granularity */
+ if (need_seconds && (param.uvalue & _BITUL(RTC_FEATURE_ALARM_RES_MINUTE)))
+ return RTC_ALARM_RES_MINUTE;
+
return RTC_ALARM_ENABLED;
}
@@ -227,9 +232,11 @@ TEST_F(rtc, alarm_alm_set) {
SKIP(return, "Skipping test since %s does not exist", rtc_file);
ASSERT_NE(-1, self->fd);
- alarm_state = get_rtc_alarm_state(self->fd);
+ alarm_state = get_rtc_alarm_state(self->fd, 1);
if (alarm_state == RTC_ALARM_DISABLED)
SKIP(return, "Skipping test since alarms are not supported.");
+ if (alarm_state == RTC_ALARM_RES_MINUTE)
+ SKIP(return, "Skipping test since alarms has only minute granularity.");
rc = ioctl(self->fd, RTC_RD_TIME, &tm);
ASSERT_NE(-1, rc);
@@ -295,9 +302,11 @@ TEST_F(rtc, alarm_wkalm_set) {
SKIP(return, "Skipping test since %s does not exist", rtc_file);
ASSERT_NE(-1, self->fd);
- alarm_state = get_rtc_alarm_state(self->fd);
+ alarm_state = get_rtc_alarm_state(self->fd, 1);
if (alarm_state == RTC_ALARM_DISABLED)
SKIP(return, "Skipping test since alarms are not supported.");
+ if (alarm_state == RTC_ALARM_RES_MINUTE)
+ SKIP(return, "Skipping test since alarms has only minute granularity.");
rc = ioctl(self->fd, RTC_RD_TIME, &alarm.time);
ASSERT_NE(-1, rc);
@@ -357,7 +366,7 @@ TEST_F_TIMEOUT(rtc, alarm_alm_set_minute, 65) {
SKIP(return, "Skipping test since %s does not exist", rtc_file);
ASSERT_NE(-1, self->fd);
- alarm_state = get_rtc_alarm_state(self->fd);
+ alarm_state = get_rtc_alarm_state(self->fd, 0);
if (alarm_state == RTC_ALARM_DISABLED)
SKIP(return, "Skipping test since alarms are not supported.");
@@ -425,7 +434,7 @@ TEST_F_TIMEOUT(rtc, alarm_wkalm_set_minute, 65) {
SKIP(return, "Skipping test since %s does not exist", rtc_file);
ASSERT_NE(-1, self->fd);
- alarm_state = get_rtc_alarm_state(self->fd);
+ alarm_state = get_rtc_alarm_state(self->fd, 0);
if (alarm_state == RTC_ALARM_DISABLED)
SKIP(return, "Skipping test since alarms are not supported.");
--
2.39.2
Hi,
While trying the coredump test on qemu-system-riscv64, I observed test
failures for various reasons.
This series makes the test works on qemu-system-riscv64.
Best regards,
Nam
Nam Cao (3):
selftests: coredump: Properly initialize pointer
selftests: coredump: Use waitpid() instead of busy-wait
selftests: coredump: Raise timeout to 2 minutes
tools/testing/selftests/coredump/stackdump | 6 +-----
.../testing/selftests/coredump/stackdump_test.c | 17 +++++++++--------
2 files changed, 10 insertions(+), 13 deletions(-)
--
2.39.5
Here are 4 unrelated patches:
- Patch 1: fix a NULL pointer when two SYN-ACK for the same request are
handled in parallel. A fix for up to v5.9.
- Patch 2: selftests: fix check for the wrong FD. A fix for up to v5.17.
- Patch 3: selftests: close all FDs in case of error. A fix for up to
v5.17.
- Patch 4: selftests: ignore a new generated file. A fix for 6.15-rc0.
Signed-off-by: Matthieu Baerts (NGI0) <matttbe(a)kernel.org>
---
Cong Liu (1):
selftests: mptcp: fix incorrect fd checks in main_loop
Gang Yan (1):
mptcp: fix NULL pointer in can_accept_new_subflow
Geliang Tang (1):
selftests: mptcp: close fd_in before returning in main_loop
Matthieu Baerts (NGI0) (1):
selftests: mptcp: ignore mptcp_diag binary
net/mptcp/subflow.c | 15 ++++++++-------
tools/testing/selftests/net/mptcp/.gitignore | 1 +
tools/testing/selftests/net/mptcp/mptcp_connect.c | 11 +++++++----
3 files changed, 16 insertions(+), 11 deletions(-)
---
base-commit: 2ea396448f26d0d7d66224cb56500a6789c7ed07
change-id: 20250328-net-mptcp-misc-fixes-6-15-98bfbeaa15ac
Best regards,
--
Matthieu Baerts (NGI0) <matttbe(a)kernel.org>
The current implementation of suppressing warning backtraces uses __func__,
which is a compile-time constant only for non -fPIC compilation.
GCC's support for this situation in position-independent code varies across
versions and architectures.
On the s390 architecture, -fPIC is required for compilation, and support
for this scenario is available in GCC 11 and later.
Fixes: d8b14a2 ("bug/kunit: core support for suppressing warning backtraces")
Signed-off-by: Alessandro Carminati <acarmina(a)redhat.com>
---
lib/kunit/Kconfig | 1 +
1 file changed, 1 insertion(+)
diff --git a/lib/kunit/Kconfig b/lib/kunit/Kconfig
index 201402f0ab49..6c937144dcea 100644
--- a/lib/kunit/Kconfig
+++ b/lib/kunit/Kconfig
@@ -17,6 +17,7 @@ if KUNIT
config KUNIT_SUPPRESS_BACKTRACE
bool "KUnit - Enable backtrace suppression"
+ depends on (!S390 && CC_IS_GCC) || (CC_IS_GCC && GCC_VERSION >= 110000)
default y
help
Enable backtrace suppression for KUnit. If enabled, backtraces
--
2.34.1
Introduce support for the N32 and N64 ABIs. As preparation, the
entrypoint is first simplified significantly. Thanks to Maciej for all
the valuable information.
Signed-off-by: Thomas Weißschuh <linux(a)weissschuh.net>
---
Changes in v2:
- Clean up entrypoint first
- Annotate #endifs
- Link to v1: https://lore.kernel.org/r/20250212-nolibc-mips-n32-v1-1-6892e58d1321@weisss…
---
Thomas Weißschuh (4):
tools/nolibc: MIPS: drop $gp setup
tools/nolibc: MIPS: drop manual stack pointer alignment
tools/nolibc: MIPS: drop noreorder option
tools/nolibc: MIPS: add support for N64 and N32 ABIs
tools/include/nolibc/arch-mips.h | 117 +++++++++++++++++++++-------
tools/testing/selftests/nolibc/Makefile | 28 ++++++-
tools/testing/selftests/nolibc/run-tests.sh | 2 +-
3 files changed, 118 insertions(+), 29 deletions(-)
---
base-commit: 9c812b01f13d37410ea103e00bc47e5e0f6d2bad
change-id: 20231105-nolibc-mips-n32-234901bd910d
Best regards,
--
Thomas Weißschuh <linux(a)weissschuh.net>
Hi ,
Are you considering getting the ICS West 2025 attendees list?
Expo Name: International Security Conference & Exposition West 2025
Total Number of records: 23,000 records
List includes: Company Name, Contact Name, Job Title, Mailing Address, Phone, Emails, etc.
Do you want to acquire these leads? If so, I'm happy to send the pricing details.
Eager for your response
Regards
Sara Wood
Demand Generation Manager
Leads Data Inc.,
Please reply with REMOVE if you don't wish to receive further emails
The include of sys/io.h is not necessary anymore since
commit 67eb617a8e1e ("selftests/nolibc: simplify call to ioperm").
It's existence is also problematic as the header does not exist on all
architectures.
Reported-by: Sebastian Andrzej Siewior <sebastian(a)breakpoint.cc>
Signed-off-by: Thomas Weißschuh <linux(a)weissschuh.net>
---
tools/testing/selftests/nolibc/nolibc-test.c | 1 -
1 file changed, 1 deletion(-)
diff --git a/tools/testing/selftests/nolibc/nolibc-test.c b/tools/testing/selftests/nolibc/nolibc-test.c
index 5884a891c491544050fc35b07322c73a1a9dbaf3..7a60b6ac1457e8d862ab1a6a26c9e46abec92111 100644
--- a/tools/testing/selftests/nolibc/nolibc-test.c
+++ b/tools/testing/selftests/nolibc/nolibc-test.c
@@ -16,7 +16,6 @@
#ifndef _NOLIBC_STDIO_H
/* standard libcs need more includes */
#include <sys/auxv.h>
-#include <sys/io.h>
#include <sys/ioctl.h>
#include <sys/mman.h>
#include <sys/mount.h>
---
base-commit: bceb73904c855c78402dca94c82915f078f259dd
change-id: 20250324-nolibc-ioperm-155646560b95
Best regards,
--
Thomas Weißschuh <linux(a)weissschuh.net>
io_cmd_buf points to an array of ublksrv_io_desc structs but its type is
char *. Indexing the array requires an explicit multiplication and cast.
The compiler also can't check the pointer types.
Change io_cmd_buf's type to struct ublksrv_io_desc * so it can be
indexed directly and the compiler can type-check the code.
Make the same change to the ublk selftests.
Caleb Sander Mateos (2):
ublk: specify io_cmd_buf pointer type
selftests: ublk: specify io_cmd_buf pointer type
drivers/block/ublk_drv.c | 8 ++++----
tools/testing/selftests/ublk/kublk.c | 2 +-
tools/testing/selftests/ublk/kublk.h | 4 ++--
3 files changed, 7 insertions(+), 7 deletions(-)
--
2.45.2
virtio-net have two usage of hashes: one is RSS and another is hash
reporting. Conventionally the hash calculation was done by the VMM.
However, computing the hash after the queue was chosen defeats the
purpose of RSS.
Another approach is to use eBPF steering program. This approach has
another downside: it cannot report the calculated hash due to the
restrictive nature of eBPF.
Introduce the code to compute hashes to the kernel in order to overcome
thse challenges.
An alternative solution is to extend the eBPF steering program so that it
will be able to report to the userspace, but it is based on context
rewrites, which is in feature freeze. We can adopt kfuncs, but they will
not be UAPIs. We opt to ioctl to align with other relevant UAPIs (KVM
and vhost_net).
The patches for QEMU to use this new feature was submitted as RFC and
is available at:
https://patchew.org/QEMU/20240915-hash-v3-0-79cb08d28647@daynix.com/
This work was presented at LPC 2024:
https://lpc.events/event/18/contributions/1963/
V1 -> V2:
Changed to introduce a new BPF program type.
Signed-off-by: Akihiko Odaki <akihiko.odaki(a)daynix.com>
---
Changes in v9:
- Added a missing return statement in patch
"tun: Introduce virtio-net hash feature".
- Link to v8: https://lore.kernel.org/r/20250306-rss-v8-0-7ab4f56ff423@daynix.com
Changes in v8:
- Disabled IPv6 to eliminate noises in tests.
- Added a branch in tap to avoid unnecessary dissection when hash
reporting is disabled.
- Removed unnecessary rtnl_lock().
- Extracted code to handle new ioctls into separate functions to avoid
adding extra NULL checks to the code handling other ioctls.
- Introduced variable named "fd" to __tun_chr_ioctl().
- s/-/=/g in a patch message to avoid confusing Git.
- Link to v7: https://lore.kernel.org/r/20250228-rss-v7-0-844205cbbdd6@daynix.com
Changes in v7:
- Ensured to set hash_report to VIRTIO_NET_HASH_REPORT_NONE for
VHOST_NET_F_VIRTIO_NET_HDR.
- s/4/sizeof(u32)/ in patch "virtio_net: Add functions for hashing".
- Added tap_skb_cb type.
- Rebased.
- Link to v6: https://lore.kernel.org/r/20250109-rss-v6-0-b1c90ad708f6@daynix.com
Changes in v6:
- Extracted changes to fill vnet header holes into another series.
- Squashed patches "skbuff: Introduce SKB_EXT_TUN_VNET_HASH", "tun:
Introduce virtio-net hash reporting feature", and "tun: Introduce
virtio-net RSS" into patch "tun: Introduce virtio-net hash feature".
- Dropped the RFC tag.
- Link to v5: https://lore.kernel.org/r/20241008-rss-v5-0-f3cf68df005d@daynix.com
Changes in v5:
- Fixed a compilation error with CONFIG_TUN_VNET_CROSS_LE.
- Optimized the calculation of the hash value according to:
https://git.dpdk.org/dpdk/commit/?id=3fb1ea032bd6ff8317af5dac9af901f1f324ca…
- Added patch "tun: Unify vnet implementation".
- Dropped patch "tap: Pad virtio header with zero".
- Added patch "selftest: tun: Test vnet ioctls without device".
- Reworked selftests to skip for older kernels.
- Documented the case when the underlying device is deleted and packets
have queue_mapping set by TC.
- Reordered test harness arguments.
- Added code to handle fragmented packets.
- Link to v4: https://lore.kernel.org/r/20240924-rss-v4-0-84e932ec0e6c@daynix.com
Changes in v4:
- Moved tun_vnet_hash_ext to if_tun.h.
- Renamed virtio_net_toeplitz() to virtio_net_toeplitz_calc().
- Replaced htons() with cpu_to_be16().
- Changed virtio_net_hash_rss() to return void.
- Reordered variable declarations in virtio_net_hash_rss().
- Removed virtio_net_hdr_v1_hash_from_skb().
- Updated messages of "tap: Pad virtio header with zero" and
"tun: Pad virtio header with zero".
- Fixed vnet_hash allocation size.
- Ensured to free vnet_hash when destructing tun_struct.
- Link to v3: https://lore.kernel.org/r/20240915-rss-v3-0-c630015db082@daynix.com
Changes in v3:
- Reverted back to add ioctl.
- Split patch "tun: Introduce virtio-net hashing feature" into
"tun: Introduce virtio-net hash reporting feature" and
"tun: Introduce virtio-net RSS".
- Changed to reuse hash values computed for automq instead of performing
RSS hashing when hash reporting is requested but RSS is not.
- Extracted relevant data from struct tun_struct to keep it minimal.
- Added kernel-doc.
- Changed to allow calling TUNGETVNETHASHCAP before TUNSETIFF.
- Initialized num_buffers with 1.
- Added a test case for unclassified packets.
- Fixed error handling in tests.
- Changed tests to verify that the queue index will not overflow.
- Rebased.
- Link to v2: https://lore.kernel.org/r/20231015141644.260646-1-akihiko.odaki@daynix.com
---
Akihiko Odaki (6):
virtio_net: Add functions for hashing
net: flow_dissector: Export flow_keys_dissector_symmetric
tun: Introduce virtio-net hash feature
selftest: tun: Test vnet ioctls without device
selftest: tun: Add tests for virtio-net hashing
vhost/net: Support VIRTIO_NET_F_HASH_REPORT
Documentation/networking/tuntap.rst | 7 +
drivers/net/Kconfig | 1 +
drivers/net/tap.c | 68 +++-
drivers/net/tun.c | 98 +++++-
drivers/net/tun_vnet.h | 159 ++++++++-
drivers/vhost/net.c | 49 +--
include/linux/if_tap.h | 2 +
include/linux/skbuff.h | 3 +
include/linux/virtio_net.h | 188 ++++++++++
include/net/flow_dissector.h | 1 +
include/uapi/linux/if_tun.h | 75 ++++
net/core/flow_dissector.c | 3 +-
net/core/skbuff.c | 4 +
tools/testing/selftests/net/Makefile | 2 +-
tools/testing/selftests/net/tun.c | 656 ++++++++++++++++++++++++++++++++++-
15 files changed, 1255 insertions(+), 61 deletions(-)
---
base-commit: dd83757f6e686a2188997cb58b5975f744bb7786
change-id: 20240403-rss-e737d89efa77
prerequisite-change-id: 20241230-tun-66e10a49b0c7:v6
prerequisite-patch-id: 871dc5f146fb6b0e3ec8612971a8e8190472c0fb
prerequisite-patch-id: 2797ed249d32590321f088373d4055ff3f430a0e
prerequisite-patch-id: ea3370c72d4904e2f0536ec76ba5d26784c0cede
prerequisite-patch-id: 837e4cf5d6b451424f9b1639455e83a260c4440d
prerequisite-patch-id: ea701076f57819e844f5a35efe5cbc5712d3080d
prerequisite-patch-id: 701646fb43ad04cc64dd2bf13c150ccbe6f828ce
prerequisite-patch-id: 53176dae0c003f5b6c114d43f936cf7140d31bb5
prerequisite-change-id: 20250116-buffers-96e14bf023fc:v2
prerequisite-patch-id: 25fd4f99d4236a05a5ef16ab79f3e85ee57e21cc
Best regards,
--
Akihiko Odaki <akihiko.odaki(a)daynix.com>
Hello there,
Static analyser cppcheck says:
> linux-6.14/tools/testing/selftests/mm/pagemap_ioctl.c:1061:11: style: int result is assigned to long long variable. If the variable is long long to avoid loss of information, then you have loss of information. [truncLongCastAssignment]
> linux-6.14/tools/testing/selftests/mm/pagemap_ioctl.c:1510:11: style: int result is assigned to long long variable. If the variable is long long to avoid loss of information, then you have loss of information. [truncLongCastAssignment]
> linux-6.14/tools/testing/selftests/mm/pagemap_ioctl.c:1523:11: style: int result is assigned to long long variable. If the variable is long long to avoid loss of information, then you have loss of information. [truncLongCastAssignment]
> linux-6.14/tools/testing/selftests/mm/pagemap_ioctl.c:247:11: style: int result is assigned to long long variable. If the variable is long long to avoid loss of information, then you have loss of information. [truncLongCastAssignment]
> linux-6.14/tools/testing/selftests/mm/pagemap_ioctl.c:435:11: style: int result is assigned to long long variable. If the variable is long long to avoid loss of information, then you have loss of information. [truncLongCastAssignment]
> linux-6.14/tools/testing/selftests/mm/pagemap_ioctl.c:490:11: style: int result is assigned to long long variable. If the variable is long long to avoid loss of information, then you have loss of information. [truncLongCastAssignment]
The source code of the first one is
mem_size = 10 * page_size;
Maybe better code:
mem_size = 10ULL * page_size;
Regards
David Binderman
My randconfig builds sometimes (around one in every 700 configs) run
into this warning on x86:
Symbol __pfx_snnnng1h2i3j4k5l6m7ng1h2i3j4k5l6m7nng1h2i3j4k5l6m7ng1h2i3j4k5l6m7nnng1h2i3j4k5l6m7ng1h2i3j4k5l6m7nng1h2i3j4k5l6m7ng1h2i3j4k5l6m7nnnng1h2i3j4k5l6m7ng1h2i3j4k5l6m7nng1h2i3j4k5l6m7ng1h2i3j4k5l6m7nnng1h2i3j4k5l6m7ng1h2i3j4k5l6m7nng1h2i3j4k5l6m7ng1h2i3j4k5l6m7nnnnng1h2i3j4k5l6m7ng1h2i3j4k5l6m7nng1h2i3j4k5l6m7ng1h2i3j4k5l6m7nnng1h2i3j4k5l6m7ng1h2i3j4k5l6m7nng1h2i3j4k5l6m7ng1h2i3j4k5l6m7nnnng1h2i3j4k5l6m7ng1h2i3j4k5l6m7nng1h2i3j4k5l6m7ng1h2i3j4k5l6m7nnng1h2i3j4k5l6m7ng1h2i3j4k5l6m7nng1h2i3j4k5l6m7ng1h2i3j4k5l6m7n too long for kallsyms (517 >= 512).
Please increase KSYM_NAME_LEN both in kernel and kallsyms.c
The check that gets triggered was added in commit c104c16073b
("Kunit to check the longest symbol length"), see
https://lore.kernel.org/all/20241117195923.222145-1-sergio.collado@gmail.co…
and the overlong identifier seems to be the result of objtool adding
the six-byte "__pfx_" string to a symbol in elf_create_prefix_symbol()
when CONFIG_FUNCTION_PADDING_CFI is set.
I think the suggestion to "Please increase KSYM_NAME_LEN both in
kernel and kallsyms.c" is misleading here and should probably be
changed. I don't know if this something that objtool should work
around, or something that needs to be adapted in the test.
Arnd
Fix a misspelling of "slow".
Signed-off-by: Geert Uytterhoeven <geert(a)linux-m68k.org>
---
include/kunit/test.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/include/kunit/test.h b/include/kunit/test.h
index 58dbab60f8530588..9b773406e01f3c43 100644
--- a/include/kunit/test.h
+++ b/include/kunit/test.h
@@ -67,7 +67,7 @@ enum kunit_status {
/*
* Speed Attribute is stored as an enum and separated into categories of
- * speed: very_slowm, slow, and normal. These speeds are relative to
+ * speed: very_slow, slow, and normal. These speeds are relative to
* other KUnit tests.
*
* Note: unset speed attribute acts as default of KUNIT_SPEED_NORMAL.
--
2.43.0
Hi Linus,
Please pull the following kselftest next update for Linux 6.15-rc1.
Fixes bugs and cleans up code in tracing, ftrace, and user_events tests.
Adds missing executables to ftrace gitignore.
diff is attached.
thanks,
-- Shuah
----------------------------------------------------------------
The following changes since commit a64dcfb451e254085a7daee5fe51bf22959d52d3:
Linux 6.14-rc2 (2025-02-09 12:45:03 -0800)
are available in the Git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest tags/linux_kselftest-next-6.15-rc1
for you to fetch changes up to 82ef781f24ac26f4aa71f02d7624c439ab8389a7:
selftests/ftrace: add 'poll' binary to gitignore (2025-03-04 08:51:17 -0700)
----------------------------------------------------------------
linux_kselftest-next-6.15-rc1
Fixes bugs and cleans up code in tracing, ftrace, and user_events tests.
Adds missing executables to ftrace gitignore.
----------------------------------------------------------------
Bharadwaj Raju (1):
selftests/ftrace: add 'poll' binary to gitignore
Heiko Carstens (1):
selftests/ftrace: Use readelf to find entry point in uprobe test
Steven Rostedt (3):
selftests/tracing: Test only toplevel README file not the instances
selftests/ftrace: Clean up triggers after setting them
selftests/tracing: Allow some more tests to run in instances
Yiqian Xun (1):
selftests/user_events: Fix failures caused by test code
tools/testing/selftests/ftrace/.gitignore | 1 +
.../selftests/ftrace/test.d/dynevent/add_remove_uprobe.tc | 10 +++++++---
tools/testing/selftests/ftrace/test.d/functions | 8 +++++++-
.../test.d/trigger/inter-event/trigger-action-hist-xfail.tc | 1 +
.../test.d/trigger/inter-event/trigger-onchange-action-hist.tc | 3 +++
.../test.d/trigger/inter-event/trigger-snapshot-action-hist.tc | 3 +++
.../ftrace/test.d/trigger/trigger-hist-expressions.tc | 1 +
tools/testing/selftests/user_events/dyn_test.c | 2 ++
8 files changed, 25 insertions(+), 4 deletions(-)
----------------------------------------------------------------
Hi Linus,
Please pull the following kunit next update for Linux 6.15-rc1.
kunit tool:
- Changes to kunit tool to use qboot on QEMU x86_64, and build GDB scripts.
- Fixes kunit tool bug in parsing test plan.
- Adds test to kunit tool to check parsing late test plan.
kunit:
- Clarifies kunit_skip() argument name.
- Adds Kunit check for the longest symbol length.
- Changes qemu_configs for sparc to use Zilog console.
Conflicts in lib/Makefile
between commit:
b341f6fd45ab ("blackhole_dev: convert self-test to KUnit")
from the net-next tree and commit:
c104c16073b7 ("Kunit to check the longest symbol length")
The commit c104c16073b7 conflicts with the mainline now with
62f3802332ed ("vdso: add generic time data storage") from kspp
is now in the mainline.
Stephen has the fixes for these two conflicts in next.
(Thank you Stephen)
diff is attached.
thanks,
-- Shuah
----------------------------------------------------------------
The following changes since commit a64dcfb451e254085a7daee5fe51bf22959d52d3:
Linux 6.14-rc2 (2025-02-09 12:45:03 -0800)
are available in the Git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest tags/linux_kselftest-kunit-6.15-rc1
for you to fetch changes up to 2e0cf2b32f72b20b0db5cc665cd8465d0f257278:
kunit: tool: add test to check parsing late test plan (2025-03-15 18:13:43 -0600)
----------------------------------------------------------------
linux_kselftest-kunit-6.15-rc1
kunit tool:
- Changes to kunit tool to use qboot on QEMU x86_64, and build GDB scripts.
- Fixes kunit tool bug in parsing test plan.
- Adds test to kunit tool to check parsing late test plan.
kunit:
- Clarifies kunit_skip() argument name.
- Adds Kunit check for the longest symbol length.
- Changes qemu_configs for sparc to use Zilog console.
----------------------------------------------------------------
Brendan Jackman (2):
kunit: tool: Use qboot on QEMU x86_64
kunit: tool: Build GDB scripts
Kevin Brodsky (1):
kunit: Clarify kunit_skip() argument name
Rae Moar (2):
kunit: tool: Fix bug in parsing test plan
kunit: tool: add test to check parsing late test plan
Sergio González Collado (1):
Kunit to check the longest symbol length
Thomas Weißschuh (1):
kunit: qemu_configs: sparc: use Zilog console
arch/x86/tools/insn_decoder_test.c | 3 +-
include/kunit/test.h | 20 ++++----
lib/Kconfig.debug | 9 ++++
lib/Makefile | 2 +
lib/longest_symbol_kunit.c | 82 ++++++++++++++++++++++++++++++
tools/testing/kunit/kunit_kernel.py | 4 +-
tools/testing/kunit/kunit_parser.py | 9 ++--
tools/testing/kunit/kunit_tool_test.py | 11 ++++
tools/testing/kunit/qemu_configs/sparc.py | 5 +-
tools/testing/kunit/qemu_configs/x86_64.py | 4 +-
10 files changed, 128 insertions(+), 21 deletions(-)
create mode 100644 lib/longest_symbol_kunit.c
----------------------------------------------------------------
This started with a patch that enabled `clippy::ptr_as_ptr`. Benno
Lossin suggested I also look into `clippy::ptr_cast_constness` and I
discovered `clippy::as_ptr_cast_mut`. This series now enables all 3
lints. It also enables `clippy::as_underscore` which ensures other
pointer casts weren't missed. The first commit reduces the need for
pointer casts and is shared with another series[1].
As a later addition, `clippy::cast_lossless` and `clippy::ref_as_ptr`
are also enabled.
Link: https://lore.kernel.org/all/20250307-no-offset-v1-0-0c728f63b69c@gmail.com/ [1]
Signed-off-by: Tamir Duberstein <tamird(a)gmail.com>
---
Changes in v7:
- Add patch to enable `clippy::ref_as_ptr`.
- Link to v6: https://lore.kernel.org/r/20250324-ptr-as-ptr-v6-0-49d1b7fd4290@gmail.com
Changes in v6:
- Drop strict provenance patch.
- Fix URLs in doc comments.
- Add patch to enable `clippy::cast_lossless`.
- Rebase on rust-next.
- Link to v5: https://lore.kernel.org/r/20250317-ptr-as-ptr-v5-0-5b5f21fa230a@gmail.com
Changes in v5:
- Use `pointer::addr` in OF. (Boqun Feng)
- Add documentation on stubs. (Benno Lossin)
- Mark stubs `#[inline]`.
- Pick up Alice's RB on a shared commit from
https://lore.kernel.org/all/Z9f-3Aj3_FWBZRrm@google.com/.
- Link to v4: https://lore.kernel.org/r/20250315-ptr-as-ptr-v4-0-b2d72c14dc26@gmail.com
Changes in v4:
- Add missing SoB. (Benno Lossin)
- Use `without_provenance_mut` in alloc. (Boqun Feng)
- Limit strict provenance lints to the `kernel` crate to avoid complex
logic in the build system. This can be revisited on MSRV >= 1.84.0.
- Rebase on rust-next.
- Link to v3: https://lore.kernel.org/r/20250314-ptr-as-ptr-v3-0-e7ba61048f4a@gmail.com
Changes in v3:
- Fixed clippy warning in rust/kernel/firmware.rs. (kernel test robot)
Link: https://lore.kernel.org/all/202503120332.YTCpFEvv-lkp@intel.com/
- s/as u64/as bindings::phys_addr_t/g. (Benno Lossin)
- Use strict provenance APIs and enable lints. (Benno Lossin)
- Link to v2: https://lore.kernel.org/r/20250309-ptr-as-ptr-v2-0-25d60ad922b7@gmail.com
Changes in v2:
- Fixed typo in first commit message.
- Added additional patches, converted to series.
- Link to v1: https://lore.kernel.org/r/20250307-ptr-as-ptr-v1-1-582d06514c98@gmail.com
---
Tamir Duberstein (7):
rust: retain pointer mut-ness in `container_of!`
rust: enable `clippy::ptr_as_ptr` lint
rust: enable `clippy::ptr_cast_constness` lint
rust: enable `clippy::as_ptr_cast_mut` lint
rust: enable `clippy::as_underscore` lint
rust: enable `clippy::cast_lossless` lint
rust: enable `clippy::ref_as_ptr` lint
Makefile | 6 ++++++
drivers/gpu/drm/drm_panic_qr.rs | 10 +++++-----
rust/bindings/lib.rs | 3 +++
rust/kernel/alloc/allocator_test.rs | 2 +-
rust/kernel/alloc/kvec.rs | 4 ++--
rust/kernel/block/mq/operations.rs | 2 +-
rust/kernel/block/mq/request.rs | 7 ++++---
rust/kernel/device.rs | 5 +++--
rust/kernel/device_id.rs | 5 +++--
rust/kernel/devres.rs | 19 ++++++++++---------
rust/kernel/dma.rs | 6 +++---
rust/kernel/error.rs | 2 +-
rust/kernel/firmware.rs | 3 ++-
rust/kernel/fs/file.rs | 3 ++-
rust/kernel/io.rs | 18 +++++++++---------
rust/kernel/kunit.rs | 15 +++++++--------
rust/kernel/lib.rs | 5 ++---
rust/kernel/list/impl_list_item_mod.rs | 2 +-
rust/kernel/miscdevice.rs | 2 +-
rust/kernel/net/phy.rs | 4 ++--
rust/kernel/of.rs | 6 +++---
rust/kernel/pci.rs | 13 ++++++++-----
rust/kernel/platform.rs | 6 ++++--
rust/kernel/print.rs | 11 +++++------
rust/kernel/rbtree.rs | 23 ++++++++++-------------
rust/kernel/seq_file.rs | 3 ++-
rust/kernel/str.rs | 14 +++++++-------
rust/kernel/sync/poll.rs | 2 +-
rust/kernel/uaccess.rs | 5 +++--
rust/kernel/workqueue.rs | 12 ++++++------
rust/uapi/lib.rs | 3 +++
31 files changed, 120 insertions(+), 101 deletions(-)
---
base-commit: 28bb48c4cb34f65a9aa602142e76e1426da31293
change-id: 20250307-ptr-as-ptr-21b1867fc4d4
Best regards,
--
Tamir Duberstein <tamird(a)gmail.com>
When compiling the RTC library functions test as a module, the module
has the non-descriptive name "lib_test.ko". Fix this by adding the
subsystem's name as a prefix.
Signed-off-by: Geert Uytterhoeven <geert(a)linux-m68k.org>
---
drivers/rtc/Makefile | 2 +-
drivers/rtc/{lib_test.c => rtc_lib_test.c} | 0
2 files changed, 1 insertion(+), 1 deletion(-)
rename drivers/rtc/{lib_test.c => rtc_lib_test.c} (100%)
diff --git a/drivers/rtc/Makefile b/drivers/rtc/Makefile
index 489b4ab07068c758..c0ccbbfe2739c1aa 100644
--- a/drivers/rtc/Makefile
+++ b/drivers/rtc/Makefile
@@ -15,7 +15,7 @@ rtc-core-$(CONFIG_RTC_INTF_DEV) += dev.o
rtc-core-$(CONFIG_RTC_INTF_PROC) += proc.o
rtc-core-$(CONFIG_RTC_INTF_SYSFS) += sysfs.o
-obj-$(CONFIG_RTC_LIB_KUNIT_TEST) += lib_test.o
+obj-$(CONFIG_RTC_LIB_KUNIT_TEST) += rtc_lib_test.o
# Keep the list ordered.
diff --git a/drivers/rtc/lib_test.c b/drivers/rtc/rtc_lib_test.c
similarity index 100%
rename from drivers/rtc/lib_test.c
rename to drivers/rtc/rtc_lib_test.c
--
2.43.0
This series introduces support in the KVM and ARM PMUv3 driver for
partitioning PMU counters into two separate ranges by taking advantage
of the MDCR_EL2.HPMN register field.
The advantage of a partitioned PMU would be to allow KVM guests direct
access to a subset of PMU functionality, greatly reducing the overhead
of performance monitoring in guests.
While this feature could be accepted on its own merits, practically
there is a lot more to be done before it will be fully useful, so I'm
sending as an RFC for now.
v3:
* Include cpucap definition for FEAT_HPMN0 to allow for setting HPMN
to 0
* Include PMU header cleanup provided by Marc [1] with some minor
changes so compilation works
* Pull functions out of pmu-emul.c that aren't specific to the
emulated PMU. This and the previous item aren't strictly
needed but they provide a nicer starting point.
* As suggested by Oliver, start a file for partitioned PMU functions
and move the reserved_host_counters parameter and MDCR handling into
KVM so the driver does not have to know about it and we need fewer
hacks to keep the driver working on 32-bit ARM. This was not a
complete separation because the driver still needs to start and stop
the host counters all at once and needs to toggle MDCR_EL2.HPME to
do that. Introduce kvm_pmu_host_counters_{enable,disable}()
functions to handle this and define them as no ops on 32-bit ARM.
* As suggested by Oliver, don't limit PMCR.N on emulated PMU. This
value will be read correctly when the right traps are disabled to
use the partitioned PMU
v2:
https://lore.kernel.org/kvm/20250208020111.2068239-1-coltonlewis@google.com/
v1:
https://lore.kernel.org/kvm/20250127222031.3078945-1-coltonlewis@google.com/
[1] https://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms.git/log/?…
Colton Lewis (7):
arm64: cpufeature: Add cap for HPMN0
arm64: Generate sign macro for sysreg Enums
KVM: arm64: Reorganize PMU functions
KVM: arm64: Introduce module param to partition the PMU
perf: arm_pmuv3: Generalize counter bitmasks
perf: arm_pmuv3: Keep out of guest counter partition
KVM: arm64: selftests: Reword selftests error
Marc Zyngier (1):
KVM: arm64: Cleanup PMU includes
arch/arm/include/asm/arm_pmuv3.h | 2 +
arch/arm64/include/asm/arm_pmuv3.h | 2 +-
arch/arm64/include/asm/kvm_host.h | 199 +++++++-
arch/arm64/include/asm/kvm_pmu.h | 47 ++
arch/arm64/kernel/cpufeature.c | 8 +
arch/arm64/kvm/Makefile | 2 +-
arch/arm64/kvm/arm.c | 1 -
arch/arm64/kvm/debug.c | 10 +-
arch/arm64/kvm/hyp/include/hyp/switch.h | 1 +
arch/arm64/kvm/pmu-emul.c | 464 +-----------------
arch/arm64/kvm/pmu-part.c | 63 +++
arch/arm64/kvm/pmu.c | 454 +++++++++++++++++
arch/arm64/kvm/sys_regs.c | 2 +
arch/arm64/tools/cpucaps | 1 +
arch/arm64/tools/gen-sysreg.awk | 1 +
arch/arm64/tools/sysreg | 6 +-
drivers/perf/arm_pmuv3.c | 73 ++-
include/kvm/arm_pmu.h | 204 --------
include/linux/perf/arm_pmu.h | 16 +-
include/linux/perf/arm_pmuv3.h | 27 +-
.../selftests/kvm/arm64/vpmu_counter_access.c | 2 +-
virt/kvm/kvm_main.c | 1 +
22 files changed, 882 insertions(+), 704 deletions(-)
create mode 100644 arch/arm64/include/asm/kvm_pmu.h
create mode 100644 arch/arm64/kvm/pmu-part.c
delete mode 100644 include/kvm/arm_pmu.h
base-commit: 2014c95afecee3e76ca4a56956a936e23283f05b
--
2.48.1.601.g30ceb7b040-goog
v1/v2:
There is only the first patch: RISC-V: Enable cbo.clean/flush in usermode,
which mainly removes the enabling of cbo.inval in user mode.
v3:
Add the functionality of Expose Zicbom and selftests for Zicbom.
v4:
Modify the order of macros, The test_no_cbo_inval function is added
separately.
v5:
1. Modify the order of RISCV_HWPROBE_KEY_ZICBOM_BLOCK_SIZE in hwprobe.rst
2. "TEST_NO_ZICBOINVAL" -> "TEST_NO_CBO_INVAL"
v6:
Change hwprobe_ext0_has's second param to u64.
v7:
Rebase to the latest code of linux-next.
Yunhui Cui (3):
RISC-V: Enable cbo.clean/flush in usermode
RISC-V: hwprobe: Expose Zicbom extension and its block size
RISC-V: selftests: Add TEST_ZICBOM into CBO tests
Documentation/arch/riscv/hwprobe.rst | 6 ++
arch/riscv/include/asm/hwprobe.h | 2 +-
arch/riscv/include/uapi/asm/hwprobe.h | 2 +
arch/riscv/kernel/cpufeature.c | 8 +++
arch/riscv/kernel/sys_hwprobe.c | 8 ++-
tools/testing/selftests/riscv/hwprobe/cbo.c | 66 +++++++++++++++++----
6 files changed, 79 insertions(+), 13 deletions(-)
--
2.39.2
Should fix flaky tcp-ao/connect-deny-ipv6 test.
Begging pardon for the delay since the report and for sending it this
late in the release cycle.
To: David S. Miller <davem(a)davemloft.net>
To: Eric Dumazet <edumazet(a)google.com>
To: Jakub Kicinski <kuba(a)kernel.org>
To: Paolo Abeni <pabeni(a)redhat.com>
To: Simon Horman <horms(a)kernel.org>
To: Shuah Khan <shuah(a)kernel.org>
Cc: netdev(a)vger.kernel.org
Cc: linux-kselftest(a)vger.kernel.org
Cc: linux-kernel(a)vger.kernel.org
Signed-off-by: Dmitry Safonov <0x7f454c46(a)gmail.com>
Changes in v2:
- Base on net-next (Paolo)
- Add a missing Fixes tag (Paolo)
- Link to v1: https://lore.kernel.org/r/20250312-tcp-ao-selftests-polling-v1-0-72a642b855…
---
Dmitry Safonov (7):
selftests/net: Print TCP flags in more common format
selftests/net: Provide tcp-ao counters comparison helper
selftests/net: Fetch and check TCP-MD5 counters
selftests/net: Add mixed select()+polling mode to TCP-AO tests
selftests/net: Print the testing side in unsigned-md5
selftests/net: Delete timeout from test_connect_socket()
selftests/net: Drop timeout argument from test_client_verify()
tools/testing/selftests/net/tcp_ao/connect-deny.c | 58 ++--
tools/testing/selftests/net/tcp_ao/connect.c | 22 +-
tools/testing/selftests/net/tcp_ao/icmps-discard.c | 17 +-
.../testing/selftests/net/tcp_ao/key-management.c | 76 ++---
tools/testing/selftests/net/tcp_ao/lib/aolib.h | 114 ++++++--
.../testing/selftests/net/tcp_ao/lib/ftrace-tcp.c | 7 +-
tools/testing/selftests/net/tcp_ao/lib/sock.c | 315 +++++++++++++++------
tools/testing/selftests/net/tcp_ao/restore.c | 75 +++--
tools/testing/selftests/net/tcp_ao/rst.c | 47 ++-
tools/testing/selftests/net/tcp_ao/self-connect.c | 18 +-
tools/testing/selftests/net/tcp_ao/seq-ext.c | 30 +-
tools/testing/selftests/net/tcp_ao/unsigned-md5.c | 118 ++++----
12 files changed, 552 insertions(+), 345 deletions(-)
---
base-commit: 23c9ff659140f97d44bf6fb59f89526a168f2b86
change-id: 20250312-tcp-ao-selftests-polling-21b6bbdf77b6
Best regards,
--
Dmitry Safonov <0x7f454c46(a)gmail.com>
Due to several issues which are unlikely to be fixed in the near future,
the access_tracking_perf_test sanity check for how many pages are still clean
after an iteration is not reliable when NUMA balancing is active.
This patch series refactors this test to skip this check by default automatically.
V2: adopted Sean's suggestions.
Best regards,
Maxim Levitsky
Maxim Levitsky (1):
KVM: selftests: access_tracking_perf_test: add option to skip the
sanity check
Sean Christopherson (1):
KVM: selftests: Extract guts of THP accessor to standalone sysfs
helpers
.../selftests/kvm/access_tracking_perf_test.c | 33 +++++++++++++--
.../testing/selftests/kvm/include/test_util.h | 1 +
tools/testing/selftests/kvm/lib/test_util.c | 42 ++++++++++++++-----
3 files changed, 61 insertions(+), 15 deletions(-)
--
2.26.3
With the switch over to nolibc the source file vdso_standalone_test_x86.c
was intended to be replaced with a symlink to vdso_test_gettimeofday.c.
This was the patch that was submitted to LKML, but during application the
symlink was replaced by a textual copy of the linked-to file.
Having two copies introduces the possibility of divergence and increases
maintenance burden, switch back to a symlink.
Link: https://lore.kernel.org/lkml/20250226-parse_vdso-nolibc-v2-16-28e14e031ed8@…
Fixes: 8770a9183fe1 ("selftests: vDSO: vdso_standalone_test_x86: Switch to nolibc")
Signed-off-by: Thomas Weißschuh <thomas.weissschuh(a)linutronix.de>
---
If symlinks are problematic an #include shim would also work.
These are not handled really well by the kselftests build system though,
as #include dependencies are not tracked by it.
---
.../selftests/vDSO/vdso_standalone_test_x86.c | 59 +---------------------
1 file changed, 1 insertion(+), 58 deletions(-)
diff --git a/tools/testing/selftests/vDSO/vdso_standalone_test_x86.c b/tools/testing/selftests/vDSO/vdso_standalone_test_x86.c
deleted file mode 100644
index 9ce795b806f0992b83cef78c7e16fac0e54750da..0000000000000000000000000000000000000000
--- a/tools/testing/selftests/vDSO/vdso_standalone_test_x86.c
+++ /dev/null
@@ -1,58 +0,0 @@
-// SPDX-License-Identifier: GPL-2.0-only
-/*
- * vdso_test_gettimeofday.c: Sample code to test parse_vdso.c and
- * vDSO gettimeofday()
- * Copyright (c) 2014 Andy Lutomirski
- *
- * Compile with:
- * gcc -std=gnu99 vdso_test_gettimeofday.c parse_vdso_gettimeofday.c
- *
- * Tested on x86, 32-bit and 64-bit. It may work on other architectures, too.
- */
-
-#include <stdio.h>
-#ifndef NOLIBC
-#include <sys/auxv.h>
-#include <sys/time.h>
-#endif
-
-#include "../kselftest.h"
-#include "parse_vdso.h"
-#include "vdso_config.h"
-#include "vdso_call.h"
-
-int main(int argc, char **argv)
-{
- const char *version = versions[VDSO_VERSION];
- const char **name = (const char **)&names[VDSO_NAMES];
-
- unsigned long sysinfo_ehdr = getauxval(AT_SYSINFO_EHDR);
- if (!sysinfo_ehdr) {
- printf("AT_SYSINFO_EHDR is not present!\n");
- return KSFT_SKIP;
- }
-
- vdso_init_from_sysinfo_ehdr(getauxval(AT_SYSINFO_EHDR));
-
- /* Find gettimeofday. */
- typedef long (*gtod_t)(struct timeval *tv, struct timezone *tz);
- gtod_t gtod = (gtod_t)vdso_sym(version, name[0]);
-
- if (!gtod) {
- printf("Could not find %s\n", name[0]);
- return KSFT_SKIP;
- }
-
- struct timeval tv;
- long ret = VDSO_CALL(gtod, 2, &tv, 0);
-
- if (ret == 0) {
- printf("The time is %lld.%06lld\n",
- (long long)tv.tv_sec, (long long)tv.tv_usec);
- } else {
- printf("%s failed\n", name[0]);
- return KSFT_FAIL;
- }
-
- return 0;
-}
diff --git a/tools/testing/selftests/vDSO/vdso_standalone_test_x86.c b/tools/testing/selftests/vDSO/vdso_standalone_test_x86.c
new file mode 120000
index 0000000000000000000000000000000000000000..4d3d96f1e440c965474681a6f35375a60b3921be
--- /dev/null
+++ b/tools/testing/selftests/vDSO/vdso_standalone_test_x86.c
@@ -0,0 +1 @@
+vdso_test_gettimeofday.c
\ No newline at end of file
---
base-commit: 1e26c5e28ca5821a824e90dd359556f5e9e7b89f
change-id: 20250326-vdso-selftests-fix-vdso_standalone_test_x86-c3a77b57ccbd
Best regards,
--
Thomas Weißschuh <thomas.weissschuh(a)linutronix.de>
Context
=======
We've observed within Red Hat that isolated, NOHZ_FULL CPUs running a
pure-userspace application get regularly interrupted by IPIs sent from
housekeeping CPUs. Those IPIs are caused by activity on the housekeeping CPUs
leading to various on_each_cpu() calls, e.g.:
64359.052209596 NetworkManager 0 1405 smp_call_function_many_cond (cpu=0, func=do_kernel_range_flush)
smp_call_function_many_cond+0x1
smp_call_function+0x39
on_each_cpu+0x2a
flush_tlb_kernel_range+0x7b
__purge_vmap_area_lazy+0x70
_vm_unmap_aliases.part.42+0xdf
change_page_attr_set_clr+0x16a
set_memory_ro+0x26
bpf_int_jit_compile+0x2f9
bpf_prog_select_runtime+0xc6
bpf_prepare_filter+0x523
sk_attach_filter+0x13
sock_setsockopt+0x92c
__sys_setsockopt+0x16a
__x64_sys_setsockopt+0x20
do_syscall_64+0x87
entry_SYSCALL_64_after_hwframe+0x65
The heart of this series is the thought that while we cannot remove NOHZ_FULL
CPUs from the list of CPUs targeted by these IPIs, they may not have to execute
the callbacks immediately. Anything that only affects kernelspace can wait
until the next user->kernel transition, providing it can be executed "early
enough" in the entry code.
The original implementation is from Peter [1]. Nicolas then added kernel TLB
invalidation deferral to that [2], and I picked it up from there.
Deferral approach
=================
Storing each and every callback, like a secondary call_single_queue turned out
to be a no-go: the whole point of deferral is to keep NOHZ_FULL CPUs in
userspace for as long as possible - no signal of any form would be sent when
deferring an IPI. This means that any form of queuing for deferred callbacks
would end up as a convoluted memory leak.
Deferred IPIs must thus be coalesced, which this series achieves by assigning
IPIs a "type" and having a mapping of IPI type to callback, leveraged upon
kernel entry.
What about IPIs whose callback take a parameter, you may ask?
Peter suggested during OSPM23 [3] that since on_each_cpu() targets
housekeeping CPUs *and* isolated CPUs, isolated CPUs can access either global or
housekeeping-CPU-local state to "reconstruct" the data that would have been sent
via the IPI.
This series does not affect any IPI callback that requires an argument, but the
approach would remain the same (one coalescable callback executed on kernel
entry).
Kernel entry vs execution of the deferred operation
===================================================
This is what I've referred to as the "Danger Zone" during my LPC24 talk [4].
There is a non-zero length of code that is executed upon kernel entry before the
deferred operation can be itself executed (i.e. before we start getting into
context_tracking.c proper), i.e.:
idtentry_func_foo() <--- we're in the kernel
irqentry_enter()
enter_from_user_mode()
__ct_user_exit()
ct_kernel_enter_state()
ct_work_flush() <--- deferred operation is executed here
This means one must take extra care to what can happen in the early entry code,
and that <bad things> cannot happen. For instance, we really don't want to hit
instructions that have been modified by a remote text_poke() while we're on our
way to execute a deferred sync_core(). Patches doing the actual deferral have
more detail on this.
Patches
=======
o Patches 1-2 are standalone objtool cleanups.
o Patches 3-4 add an RCU testing feature.
o Patches 5-6 add infrastructure for annotating static keys and static calls
that may be used in noinstr code (courtesy of Josh).
o Patches 7-19 use said annotations on relevant keys / calls.
o Patch 20 enforces proper usage of said annotations (courtesy of Josh).
o Patches 21-23 fiddle with CT_STATE* within context tracking
o Patches 24-29 add the actual IPI deferral faff
o Patch 30 adds a freebie: deferring IPIs for NOHZ_IDLE. Not tested that much!
if you care about battery-powered devices and energy consumption, go give it
a try!
Patches are also available at:
https://gitlab.com/vschneid/linux.git -b redhat/isolirq/defer/v4
Stuff I'd like eyes and neurons on
==================================
Context-tracking vs idle. Patch 22 "improves" the situation by adding an
IDLE->KERNEL transition when getting an IRQ while idle, but it leaves the
following window:
~> IRQ
ct_nmi_enter()
state = state + CT_STATE_KERNEL - CT_STATE_IDLE
[...]
ct_nmi_exit()
state = state - CT_STATE_KERNEL + CT_STATE_IDLE
[...] /!\ CT_STATE_IDLE here while we're really in kernelspace! /!\
ct_cpuidle_exit()
state = state + CT_STATE_KERNEL - CT_STATE_IDLE
Said window is contained within cpu_idle_poll() and the cpuidle call within
cpuidle_enter_state(), both being noinstr (the former is __cpuidle which is
noinstr itself). Thus objtool will consider it as early entry and will warn
accordingly of any static key / call misuse, so the damage is somewhat
contained, but it's not ideal.
I tried fiddling with this but idle polling likes being annoying, as it is
shaped like so:
ct_cpuidle_enter();
raw_local_irq_enable();
while (!tif_need_resched() &&
(cpu_idle_force_poll || tick_check_broadcast_expired()))
cpu_relax();
raw_local_irq_disable();
ct_cpuidle_exit();
IOW, getting an IRQ that doesn't end up setting NEED_RESCHED while idle-polling
doesn't come near ct_cpuidle_exit(), which prevents me from having the outermost
ct_nmi_exit() leave the state as CT_STATE_KERNEL (rather than CT_STATE_IDLE).
Testing
=======
Xeon E5-2699 system with SMToff, NOHZ_FULL, isolated CPUs.
RHEL9 userspace.
Workload is using rteval (kernel compilation + hackbench) on housekeeping CPUs
and a dummy stay-in-userspace loop on the isolated CPUs. The main invocation is:
$ trace-cmd record -e "csd_queue_cpu" -f "cpu & CPUS{$ISOL_CPUS}" \
-e "ipi_send_cpumask" -f "cpumask & CPUS{$ISOL_CPUS}" \
-e "ipi_send_cpu" -f "cpu & CPUS{$ISOL_CPUS}" \
rteval --onlyload --loads-cpulist=$HK_CPUS \
--hackbench-runlowmem=True --duration=$DURATION
This only records IPIs sent to isolated CPUs, so any event there is interference
(with a bit of fuzz at the start/end of the workload when spawning the
processes). All tests were done with a duration of 6 hours.
v6.13-rc6
# This is the actual IPI count
$ trace-cmd report | grep callback | awk '{ print $(NF) }' | sort | uniq -c | sort -nr
531 callback=generic_smp_call_function_single_interrupt+0x0
# These are the different CSD's that caused IPIs
$ trace-cmd report | grep csd_queue | awk '{ print $(NF-1) }' | sort | uniq -c | sort -nr
12818 func=do_flush_tlb_all
910 func=do_kernel_range_flush
78 func=do_sync_core
v6.13-rc6 + patches:
# This is the actual IPI count
$ trace-cmd report | grep callback | awk '{ print $(NF) }' | sort | uniq -c | sort -nr
# Zilch!
# These are the different CSD's that caused IPIs
$ trace-cmd report | grep csd_queue | awk '{ print $(NF-1) }' | sort | uniq -c | sort -nr
# Nada!
Note that tlb_remove_table_smp_sync() showed up during testing of v3, and has
gone as mysteriously as it showed up. Yair had a series adressing this [5] which
would be worth revisiting.
Acknowledgements
================
Special thanks to:
o Clark Williams for listening to my ramblings about this and throwing ideas my way
o Josh Poimboeuf for all his help with everything objtool-related
o All of the folks who attended various (too many?) talks about this and
provided precious feedback.
Links
=====
[1]: https://lore.kernel.org/all/20210929151723.162004989@infradead.org/
[2]: https://github.com/vianpl/linux.git -b ct-work-defer-wip
[3]: https://youtu.be/0vjE6fjoVVE
[4]: https://lpc.events/event/18/contributions/1889/
[5]: https://lore.kernel.org/lkml/20230620144618.125703-1-ypodemsk@redhat.com/
Revisions
=========
RFCv3 -> v4
++++++++++++++
o Rebased onto v6.13-rc6
o New objtool patches from Josh
o More .noinstr static key/call patches
o Static calls now handled as well (again thanks to Josh)
o Fixed clearing the work bits on kernel exit
o Messed with IRQ hitting an idle CPU vs context tracking
o Various comment and naming cleanups
o Made RCU_DYNTICKS_TORTURE depend on !COMPILE_TEST (PeterZ)
o Fixed the CT_STATE_KERNEL check when setting a deferred work (Frederic)
o Cleaned up the __flush_tlb_all() mess thanks to PeterZ
RFCv2 -> RFCv3
++++++++++++++
o Rebased onto v6.12-rc6
o Added objtool documentation for the new warning (Josh)
o Added low-size RCU watching counter to TREE04 torture scenario (Paul)
o Added FORCEFUL jump label and static key types
o Added noinstr-compliant helpers for tlb flush deferral
RFCv1 -> RFCv2
++++++++++++++
o Rebased onto v6.5-rc1
o Updated the trace filter patches (Steven)
o Fixed __ro_after_init keys used in modules (Peter)
o Dropped the extra context_tracking atomic, squashed the new bits in the
existing .state field (Peter, Frederic)
o Added an RCU_EXPERT config for the RCU dynticks counter size, and added an
rcutorture case for a low-size counter (Paul)
o Fixed flush_tlb_kernel_range_deferrable() definition
Josh Poimboeuf (3):
jump_label: Add annotations for validating noinstr usage
static_call: Add read-only-after-init static calls
objtool: Add noinstr validation for static branches/calls
Peter Zijlstra (1):
x86,tlb: Make __flush_tlb_global() noinstr-compliant
Valentin Schneider (26):
objtool: Make validate_call() recognize indirect calls to pv_ops[]
objtool: Flesh out warning related to pv_ops[] calls
rcu: Add a small-width RCU watching counter debug option
rcutorture: Make TREE04 use CONFIG_RCU_DYNTICKS_TORTURE
x86/paravirt: Mark pv_sched_clock static call as __ro_after_init
x86/idle: Mark x86_idle static call as __ro_after_init
x86/paravirt: Mark pv_steal_clock static call as __ro_after_init
riscv/paravirt: Mark pv_steal_clock static call as __ro_after_init
loongarch/paravirt: Mark pv_steal_clock static call as __ro_after_init
arm64/paravirt: Mark pv_steal_clock static call as __ro_after_init
arm/paravirt: Mark pv_steal_clock static call as __ro_after_init
perf/x86/amd: Mark perf_lopwr_cb static call as __ro_after_init
sched/clock: Mark sched_clock_running key as __ro_after_init
x86/speculation/mds: Mark mds_idle_clear key as allowed in .noinstr
sched/clock, x86: Mark __sched_clock_stable key as allowed in .noinstr
x86/kvm/vmx: Mark vmx_l1d_should flush and vmx_l1d_flush_cond keys as
allowed in .noinstr
stackleack: Mark stack_erasing_bypass key as allowed in .noinstr
context_tracking: Explicitely use CT_STATE_KERNEL where it is missing
context_tracking: Exit CT_STATE_IDLE upon irq/nmi entry
context_tracking: Turn CT_STATE_* into bits
context-tracking: Introduce work deferral infrastructure
context_tracking,x86: Defer kernel text patching IPIs
x86/tlb: Make __flush_tlb_local() noinstr-compliant
x86/tlb: Make __flush_tlb_all() noinstr
x86/mm, mm/vmalloc: Defer flush_tlb_kernel_range() targeting NOHZ_FULL
CPUs
context-tracking: Add a Kconfig to enable IPI deferral for NO_HZ_IDLE
arch/Kconfig | 9 ++
arch/arm/kernel/paravirt.c | 2 +-
arch/arm64/kernel/paravirt.c | 2 +-
arch/loongarch/kernel/paravirt.c | 2 +-
arch/riscv/kernel/paravirt.c | 2 +-
arch/x86/Kconfig | 1 +
arch/x86/events/amd/brs.c | 2 +-
arch/x86/include/asm/context_tracking_work.h | 22 ++++
arch/x86/include/asm/invpcid.h | 13 +--
arch/x86/include/asm/paravirt.h | 4 +-
arch/x86/include/asm/text-patching.h | 1 +
arch/x86/include/asm/tlbflush.h | 3 +-
arch/x86/include/asm/xen/hypercall.h | 11 +-
arch/x86/kernel/alternative.c | 38 ++++++-
arch/x86/kernel/cpu/bugs.c | 9 +-
arch/x86/kernel/kprobes/core.c | 4 +-
arch/x86/kernel/kprobes/opt.c | 4 +-
arch/x86/kernel/module.c | 2 +-
arch/x86/kernel/paravirt.c | 4 +-
arch/x86/kernel/process.c | 2 +-
arch/x86/kvm/vmx/vmx.c | 11 +-
arch/x86/mm/tlb.c | 46 ++++++--
arch/x86/xen/mmu_pv.c | 10 +-
arch/x86/xen/xen-ops.h | 12 +-
include/asm-generic/sections.h | 15 +++
include/linux/context_tracking.h | 21 ++++
include/linux/context_tracking_state.h | 64 +++++++++--
include/linux/context_tracking_work.h | 28 +++++
include/linux/jump_label.h | 30 ++++-
include/linux/objtool.h | 7 ++
include/linux/static_call.h | 19 ++++
kernel/context_tracking.c | 98 ++++++++++++++--
kernel/rcu/Kconfig.debug | 15 +++
kernel/sched/clock.c | 7 +-
kernel/stackleak.c | 6 +-
kernel/time/Kconfig | 19 ++++
mm/vmalloc.c | 35 +++++-
tools/objtool/Documentation/objtool.txt | 34 ++++++
tools/objtool/check.c | 106 +++++++++++++++---
tools/objtool/include/objtool/check.h | 1 +
tools/objtool/include/objtool/elf.h | 1 +
tools/objtool/include/objtool/special.h | 1 +
tools/objtool/special.c | 18 ++-
.../selftests/rcutorture/configs/rcu/TREE04 | 1 +
44 files changed, 635 insertions(+), 107 deletions(-)
create mode 100644 arch/x86/include/asm/context_tracking_work.h
create mode 100644 include/linux/context_tracking_work.h
--
2.43.0
This patch set convert iptables to nftables for wireguard testing, as
iptables is deparated and nftables is the default framework of most releases.
v5: remove the counter in nft rules and link nft statically (Jason A. Donenfeld)
v4: no update, just re-send
v3: drop iptables directly (Jason A. Donenfeld)
Also convert to using nft for qemu testing (Jason A. Donenfeld)
v2: use one nft table for testing (Phil Sutter)
Hangbin Liu (2):
wireguard: selftests: convert iptables to nft
wireguard: selftests: update to using nft for qemu test
tools/testing/selftests/wireguard/netns.sh | 29 +++++++++------
.../testing/selftests/wireguard/qemu/Makefile | 36 ++++++++++++++-----
.../selftests/wireguard/qemu/kernel.config | 7 ++--
3 files changed, 49 insertions(+), 23 deletions(-)
--
2.46.0
Hello
I am keen to exchange thoughts on financial opportunities with you. If it
aligns with your preferences, I would appreciate your WhatsApp contact
details for a more fluid dialogue. Should that not suit, please let me know
a convenient time for us to correspond directly.
Thank you for your kind consideration. I eagerly anticipate your reply.
Warm regards,
General Johanne
This started with a patch that enabled `clippy::ptr_as_ptr`. Benno
Lossin suggested I also look into `clippy::ptr_cast_constness` and I
discovered `clippy::as_ptr_cast_mut`. This series now enables all 3
lints. It also enables `clippy::as_underscore` which ensures other
pointer casts weren't missed. The first commit reduces the need for
pointer casts and is shared with another series[1].
The final patch also enables pointer provenance lints and fixes
violations. See that commit message for details. The build system
portion of that commit is pretty messy but I couldn't find a better way
to convincingly ensure that these lints were applied globally.
Suggestions would be very welcome.
Link: https://lore.kernel.org/all/20250307-no-offset-v1-0-0c728f63b69c@gmail.com/ [1]
Signed-off-by: Tamir Duberstein <tamird(a)gmail.com>
---
Changes in v5:
- Use `pointer::addr` in OF. (Boqun Feng)
- Add documentation on stubs. (Benno Lossin)
- Mark stubs `#[inline]`.
- Pick up Alice's RB on a shared commit from
https://lore.kernel.org/all/Z9f-3Aj3_FWBZRrm@google.com/.
- Link to v4: https://lore.kernel.org/r/20250315-ptr-as-ptr-v4-0-b2d72c14dc26@gmail.com
Changes in v4:
- Add missing SoB. (Benno Lossin)
- Use `without_provenance_mut` in alloc. (Boqun Feng)
- Limit strict provenance lints to the `kernel` crate to avoid complex
logic in the build system. This can be revisited on MSRV >= 1.84.0.
- Rebase on rust-next.
- Link to v3: https://lore.kernel.org/r/20250314-ptr-as-ptr-v3-0-e7ba61048f4a@gmail.com
Changes in v3:
- Fixed clippy warning in rust/kernel/firmware.rs. (kernel test robot)
Link: https://lore.kernel.org/all/202503120332.YTCpFEvv-lkp@intel.com/
- s/as u64/as bindings::phys_addr_t/g. (Benno Lossin)
- Use strict provenance APIs and enable lints. (Benno Lossin)
- Link to v2: https://lore.kernel.org/r/20250309-ptr-as-ptr-v2-0-25d60ad922b7@gmail.com
Changes in v2:
- Fixed typo in first commit message.
- Added additional patches, converted to series.
- Link to v1: https://lore.kernel.org/r/20250307-ptr-as-ptr-v1-1-582d06514c98@gmail.com
---
Tamir Duberstein (6):
rust: retain pointer mut-ness in `container_of!`
rust: enable `clippy::ptr_as_ptr` lint
rust: enable `clippy::ptr_cast_constness` lint
rust: enable `clippy::as_ptr_cast_mut` lint
rust: enable `clippy::as_underscore` lint
rust: use strict provenance APIs
Makefile | 4 ++
init/Kconfig | 3 +
rust/bindings/lib.rs | 1 +
rust/kernel/alloc.rs | 2 +-
rust/kernel/alloc/allocator_test.rs | 2 +-
rust/kernel/alloc/kvec.rs | 4 +-
rust/kernel/block/mq/operations.rs | 2 +-
rust/kernel/block/mq/request.rs | 7 +-
rust/kernel/device.rs | 5 +-
rust/kernel/device_id.rs | 2 +-
rust/kernel/devres.rs | 19 +++---
rust/kernel/error.rs | 2 +-
rust/kernel/firmware.rs | 3 +-
rust/kernel/fs/file.rs | 2 +-
rust/kernel/io.rs | 16 ++---
rust/kernel/kunit.rs | 15 ++---
rust/kernel/lib.rs | 113 ++++++++++++++++++++++++++++++++-
rust/kernel/list/impl_list_item_mod.rs | 2 +-
rust/kernel/miscdevice.rs | 2 +-
rust/kernel/of.rs | 6 +-
rust/kernel/pci.rs | 15 +++--
rust/kernel/platform.rs | 6 +-
rust/kernel/print.rs | 11 ++--
rust/kernel/rbtree.rs | 23 +++----
rust/kernel/seq_file.rs | 3 +-
rust/kernel/str.rs | 18 ++----
rust/kernel/sync/poll.rs | 2 +-
rust/kernel/uaccess.rs | 12 ++--
rust/kernel/workqueue.rs | 12 ++--
rust/uapi/lib.rs | 1 +
30 files changed, 218 insertions(+), 97 deletions(-)
---
base-commit: 498f7ee4773f22924f00630136da8575f38954e8
change-id: 20250307-ptr-as-ptr-21b1867fc4d4
Best regards,
--
Tamir Duberstein <tamird(a)gmail.com>
Hi,
"kci_test_bridge_parent_id" test failed with error "as device can not be
enslaved while up".
Here is Error log.
-------------------
> ./rtnetlink.sh -t kci_test_bridge_parent_id -v
COMMAND: ip link add name test-dummy0 type dummy
COMMAND: ip link set test-dummy0 up
COMMAND: modprobe -q netdevsim
COMMAND: ip link add name test-bond0 type bond mode 802.3ad
COMMAND: ip link set dev eni10np1 master test-bond0
Error: Device can not be enslaved while up.
COMMAND: ip link set dev eni20np1 master test-bond0
Error: Device can not be enslaved while up.
COMMAND: ip link add name test-br0 type bridge
COMMAND: ip link set dev test-bond0 master test-br0
FAIL: bridge_parent_id
-------------------
upstream commit ec4ffd100ffb ("Revert "net: rtnetlink: Enslave device
before bringing it up""), suggest the following scenario!
$ ip link set dummy0 up
$ ip link set dummy0 master bond0 down
According to last commit, do we need to modify
"kci_test_bridge_parent_id" test set to down.
--- a/tools/testing/selftests/net/rtnetlink.sh
+++ b/tools/testing/selftests/net/rtnetlink.sh
@@ -1129,8 +1129,8 @@ kci_test_bridge_parent_id()
dev10=`ls ${sysfsnet}10/net/`
dev20=`ls ${sysfsnet}20/net/`
run_cmd ip link add name test-bond0 type bond mode 802.3ad
- run_cmd ip link set dev $dev10 master test-bond0
- run_cmd ip link set dev $dev20 master test-bond0
+ run_cmd ip link set dev $dev10 master test-bond0 down
+ run_cmd ip link set dev $dev20 master test-bond0 down
run_cmd ip link add name test-br0 type bridge
Success log with modified test case
> ./rtnetlink.sh -t kci_test_bridge_parent_id -v
COMMAND: ip link add name test-dummy0 type dummy
COMMAND: ip link set test-dummy0 up
COMMAND: modprobe -q netdevsim
COMMAND: ip link add name test-bond0 type bond mode 802.3ad
COMMAND: ip link set dev eni10np1 master test-bond0 down
COMMAND: ip link set dev eni20np1 master test-bond0 down
COMMAND: ip link add name test-br0 type bridge
COMMAND: ip link set dev test-bond0 master test-br0
PASS: bridge_parent_id
Thanks,
Alok
Set ret = 0 on successful completion of the processing loop in
cs_dsp_load() and cs_dsp_load_coeff() to ensure that the function
returns 0 on success.
All normal firmware files will have at least one data block, and
processing this block will set ret == 0, from the result of either
regmap_raw_write() or cs_dsp_parse_coeff().
The kunit tests create a dummy firmware file that contains only the
header, without any data blocks. This gives cs_dsp a file to "load"
that will not cause any side-effects. As there aren't any data blocks,
the processing loop will not set ret == 0.
Originally there was a line after the processing loop:
ret = regmap_async_complete(regmap);
which would set ret == 0 before the function returned.
Commit fe08b7d5085a ("firmware: cs_dsp: Remove async regmap writes")
changed the regmap write to a normal sync write, so the call to
regmap_async_complete() wasn't necessary and was removed. It was
overlooked that the ret here wasn't only to check the result of
regmap_async_complete(), it also set the final return value of the
function.
Fixes: fe08b7d5085a ("firmware: cs_dsp: Remove async regmap writes")
Signed-off-by: Richard Fitzgerald <rf(a)opensource.cirrus.com>
---
drivers/firmware/cirrus/cs_dsp.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/drivers/firmware/cirrus/cs_dsp.c b/drivers/firmware/cirrus/cs_dsp.c
index 42433c19eb30..560724ce21aa 100644
--- a/drivers/firmware/cirrus/cs_dsp.c
+++ b/drivers/firmware/cirrus/cs_dsp.c
@@ -1631,6 +1631,7 @@ static int cs_dsp_load(struct cs_dsp *dsp, const struct firmware *firmware,
cs_dsp_debugfs_save_wmfwname(dsp, file);
+ ret = 0;
out_fw:
cs_dsp_buf_free(&buf_list);
@@ -2338,6 +2339,7 @@ static int cs_dsp_load_coeff(struct cs_dsp *dsp, const struct firmware *firmware
cs_dsp_debugfs_save_binname(dsp, file);
+ ret = 0;
out_fw:
cs_dsp_buf_free(&buf_list);
--
2.43.0
This patchset add ftrace helpers functions and
add a new test makes sure that ftrace can trace
a function that was introduced by a livepatch.
Signed-off-by: Filipe Xavier <felipeaggger(a)gmail.com>
Suggested-by: Marcos Paulo de Souza <mpdesouza(a)suse.com>
Reviewed-by: Marcos Paulo de Souza <mpdesouza(a)suse.com>
Acked-by: Miroslav Benes <mbenes(a)suse.cz>
---
Changes in v3:
- functions.sh: fixed sed to remove warning from shellcheck and add grep -Fw params.
- test-ftrace.sh: change constant to use common SYSFS_KLP_DIR.
- Link to v2: https://lore.kernel.org/r/20250318-ftrace-sftest-livepatch-v2-0-60cb0aa95cc…
Changes in v2:
- functions.sh: change check traced function to accept a list of functions.
- Link to v1: https://lore.kernel.org/r/20250306-ftrace-sftest-livepatch-v1-0-a6f1dfc30e1…
---
Filipe Xavier (2):
selftests: livepatch: add new ftrace helpers functions
selftests: livepatch: test if ftrace can trace a livepatched function
tools/testing/selftests/livepatch/functions.sh | 49 ++++++++++++++++++++++++
tools/testing/selftests/livepatch/test-ftrace.sh | 34 ++++++++++++++++
2 files changed, 83 insertions(+)
---
base-commit: 848e076317446f9c663771ddec142d7c2eb4cb43
change-id: 20250306-ftrace-sftest-livepatch-60d9dc472235
Best regards,
--
Filipe Xavier <felipeaggger(a)gmail.com>
Bugfixes for a few issues I ran into.
Signed-off-by: Thomas Weißschuh <thomas.weissschuh(a)linutronix.de>
---
Thomas Weißschuh (3):
selftests: vDSO: chacha: Correctly skip test if necessary
selftests: vDSO: chacha: Include asm/hwcap.h for arm64
selftests: vDSO: chacha: Provide default definition of HWCAP_S390_VXRS
tools/testing/selftests/vDSO/vdso_test_chacha.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
---
base-commit: 586de92313fcab8ed84ac5f78f4d2aae2db92c59
change-id: 20250324-s390-vdso-hwcap-0914c0f7c7f1
Best regards,
--
Thomas Weißschuh <thomas.weissschuh(a)linutronix.de>
This started with a patch that enabled `clippy::ptr_as_ptr`. Benno
Lossin suggested I also look into `clippy::ptr_cast_constness` and I
discovered `clippy::as_ptr_cast_mut`. This series now enables all 3
lints. It also enables `clippy::as_underscore` which ensures other
pointer casts weren't missed. The first commit reduces the need for
pointer casts and is shared with another series[1].
As a late addition, `clippy::cast_lossless` is also enabled.
Link: https://lore.kernel.org/all/20250307-no-offset-v1-0-0c728f63b69c@gmail.com/ [1]
Signed-off-by: Tamir Duberstein <tamird(a)gmail.com>
---
Changes in v6:
- Drop strict provenance patch.
- Fix URLs in doc comments.
- Add patch to enable `clippy::cast_lossless`.
- Rebase on rust-next.
- Link to v5: https://lore.kernel.org/r/20250317-ptr-as-ptr-v5-0-5b5f21fa230a@gmail.com
Changes in v5:
- Use `pointer::addr` in OF. (Boqun Feng)
- Add documentation on stubs. (Benno Lossin)
- Mark stubs `#[inline]`.
- Pick up Alice's RB on a shared commit from
https://lore.kernel.org/all/Z9f-3Aj3_FWBZRrm@google.com/.
- Link to v4: https://lore.kernel.org/r/20250315-ptr-as-ptr-v4-0-b2d72c14dc26@gmail.com
Changes in v4:
- Add missing SoB. (Benno Lossin)
- Use `without_provenance_mut` in alloc. (Boqun Feng)
- Limit strict provenance lints to the `kernel` crate to avoid complex
logic in the build system. This can be revisited on MSRV >= 1.84.0.
- Rebase on rust-next.
- Link to v3: https://lore.kernel.org/r/20250314-ptr-as-ptr-v3-0-e7ba61048f4a@gmail.com
Changes in v3:
- Fixed clippy warning in rust/kernel/firmware.rs. (kernel test robot)
Link: https://lore.kernel.org/all/202503120332.YTCpFEvv-lkp@intel.com/
- s/as u64/as bindings::phys_addr_t/g. (Benno Lossin)
- Use strict provenance APIs and enable lints. (Benno Lossin)
- Link to v2: https://lore.kernel.org/r/20250309-ptr-as-ptr-v2-0-25d60ad922b7@gmail.com
Changes in v2:
- Fixed typo in first commit message.
- Added additional patches, converted to series.
- Link to v1: https://lore.kernel.org/r/20250307-ptr-as-ptr-v1-1-582d06514c98@gmail.com
---
Tamir Duberstein (6):
rust: retain pointer mut-ness in `container_of!`
rust: enable `clippy::ptr_as_ptr` lint
rust: enable `clippy::ptr_cast_constness` lint
rust: enable `clippy::as_ptr_cast_mut` lint
rust: enable `clippy::as_underscore` lint
rust: enable `clippy::cast_lossless` lint
Makefile | 5 +++++
drivers/gpu/drm/drm_panic_qr.rs | 10 +++++-----
rust/bindings/lib.rs | 1 +
rust/kernel/alloc/allocator_test.rs | 2 +-
rust/kernel/alloc/kvec.rs | 4 ++--
rust/kernel/block/mq/operations.rs | 2 +-
rust/kernel/block/mq/request.rs | 7 ++++---
rust/kernel/device.rs | 5 +++--
rust/kernel/device_id.rs | 2 +-
rust/kernel/devres.rs | 19 ++++++++++---------
rust/kernel/dma.rs | 6 +++---
rust/kernel/error.rs | 2 +-
rust/kernel/firmware.rs | 3 ++-
rust/kernel/fs/file.rs | 2 +-
rust/kernel/io.rs | 18 +++++++++---------
rust/kernel/kunit.rs | 15 +++++++--------
rust/kernel/lib.rs | 5 ++---
rust/kernel/list/impl_list_item_mod.rs | 2 +-
rust/kernel/miscdevice.rs | 2 +-
rust/kernel/net/phy.rs | 4 ++--
rust/kernel/of.rs | 6 +++---
rust/kernel/pci.rs | 13 ++++++++-----
rust/kernel/platform.rs | 6 ++++--
rust/kernel/print.rs | 11 +++++------
rust/kernel/rbtree.rs | 23 ++++++++++-------------
rust/kernel/seq_file.rs | 3 ++-
rust/kernel/str.rs | 10 +++++-----
rust/kernel/sync/poll.rs | 2 +-
rust/kernel/workqueue.rs | 12 ++++++------
rust/uapi/lib.rs | 1 +
30 files changed, 107 insertions(+), 96 deletions(-)
---
base-commit: 28bb48c4cb34f65a9aa602142e76e1426da31293
change-id: 20250307-ptr-as-ptr-21b1867fc4d4
Best regards,
--
Tamir Duberstein <tamird(a)gmail.com>
This patchset add ftrace helpers functions and
add a new test makes sure that ftrace can trace
a function that was introduced by a livepatch.
Signed-off-by: Filipe Xavier <felipeaggger(a)gmail.com>
Suggested-by: Marcos Paulo de Souza <mpdesouza(a)suse.com>
Reviewed-by: Marcos Paulo de Souza <mpdesouza(a)suse.com>
---
Changes in v2:
- functions.sh: change check traced function to accept a list of functions.
- Link to v1: https://lore.kernel.org/r/20250306-ftrace-sftest-livepatch-v1-0-a6f1dfc30e1…
---
Filipe Xavier (2):
selftests: livepatch: add new ftrace helpers functions
selftests: livepatch: test if ftrace can trace a livepatched function
tools/testing/selftests/livepatch/functions.sh | 49 ++++++++++++++++++++++++
tools/testing/selftests/livepatch/test-ftrace.sh | 34 ++++++++++++++++
2 files changed, 83 insertions(+)
---
base-commit: 848e076317446f9c663771ddec142d7c2eb4cb43
change-id: 20250306-ftrace-sftest-livepatch-60d9dc472235
Best regards,
--
Filipe Xavier <felipeaggger(a)gmail.com>
v7: https://lore.kernel.org/netdev/20250227041209.2031104-1-almasrymina@google.…
===
Changelog:
- Check the dmabuf net_iov binding belongs to the device the TX is going
out on. (Jakub)
- Provide detailed inspection of callsites of
__skb_frag_ref/skb_page_unref in patch 2's changelog (Jakub)
v6: https://lore.kernel.org/netdev/20250222191517.743530-1-almasrymina@google.c…
===
v6 has no major changes. Addressed a few issues from Paolo and David,
and collected Acks from Stan. Thank you everyone for the review!
Changes:
- retain behavior to process MSG_FASTOPEN even if the provided cmsg is
invalid (Paolo).
- Rework the freeing of tx_vec slightly (it now has its own err label).
(Paolo).
- Squash the commit that makes dmabuf unbinding scheduled work into the
same one which implements the TX path so we don't run into future
errors on bisecting (Paolo).
- Fix/add comments to explain how dmabuf binding refcounting works
(David).
v5: https://lore.kernel.org/netdev/20250220020914.895431-1-almasrymina@google.c…
===
v5 has no major changes; it clears up the relatively minor issues
pointed out to in v4, and rebases the series on top of net-next to
resolve the conflict with a patch that raced to the tree. It also
collects the review tags from v4.
Changes:
- Rebase to net-next
- Fix issues in selftest (Stan).
- Address comments in the devmem and netmem driver docs (Stan and Bagas)
- Fix zerocopy_fill_skb_from_devmem return error code (Stan).
v4: https://lore.kernel.org/netdev/20250203223916.1064540-1-almasrymina@google.…
===
v4 mainly addresses the critical driver support issue surfaced in v3 by
Paolo and Stan. Drivers aiming to support netmem_tx should make sure not
to pass the netmem dma-addrs to the dma-mapping APIs, as these dma-addrs
may come from dma-bufs.
Additionally other feedback from v3 is addressed.
Major changes:
- Add helpers to handle netmem dma-addrs. Add GVE support for
netmem_tx.
- Fix binding->tx_vec not being freed on error paths during the
tx binding.
- Add a minimal devmem_tx test to devmem.py.
- Clean up everything obsolete from the cover letter (Paolo).
v3: https://patchwork.kernel.org/project/netdevbpf/list/?series=929401&state=*
===
Address minor comments from RFCv2 and fix a few build warnings and
ynl-regen issues. No major changes.
RFC v2: https://patchwork.kernel.org/project/netdevbpf/list/?series=920056&state=*
=======
RFC v2 addresses much of the feedback from RFC v1. I plan on sending
something close to this as net-next reopens, sending it slightly early
to get feedback if any.
Major changes:
--------------
- much improved UAPI as suggested by Stan. We now interpret the iov_base
of the passed in iov from userspace as the offset into the dmabuf to
send from. This removes the need to set iov.iov_base = NULL which may
be confusing to users, and enables us to send multiple iovs in the
same sendmsg() call. ncdevmem and the docs show a sample use of that.
- Removed the duplicate dmabuf iov_iter in binding->iov_iter. I think
this is good improvment as it was confusing to keep track of
2 iterators for the same sendmsg, and mistracking both iterators
caused a couple of bugs reported in the last iteration that are now
resolved with this streamlining.
- Improved test coverage in ncdevmem. Now multiple sendmsg() are tested,
and sending multiple iovs in the same sendmsg() is tested.
- Fixed issue where dmabuf unmapping was happening in invalid context
(Stan).
====================================================================
The TX path had been dropped from the Device Memory TCP patch series
post RFCv1 [1], to make that series slightly easier to review. This
series rebases the implementation of the TX path on top of the
net_iov/netmem framework agreed upon and merged. The motivation for
the feature is thoroughly described in the docs & cover letter of the
original proposal, so I don't repeat the lengthy descriptions here, but
they are available in [1].
Full outline on usage of the TX path is detailed in the documentation
included with this series.
Test example is available via the kselftest included in the series as well.
The series is relatively small, as the TX path for this feature largely
piggybacks on the existing MSG_ZEROCOPY implementation.
Patch Overview:
---------------
1. Documentation & tests to give high level overview of the feature
being added.
1. Add netmem refcounting needed for the TX path.
2. Devmem TX netlink API.
3. Devmem TX net stack implementation.
4. Make dma-buf unbinding scheduled work to handle TX cases where it gets
freed from contexts where we can't sleep.
5. Add devmem TX documentation.
6. Add scaffolding enabling driver support for netmem_tx. Add helpers, driver
feature flag, and docs to enable drivers to declare netmem_tx support.
7. Guard netmem_tx against being enabled against drivers that don't
support it.
8. Add devmem_tx selftests. Add TX path to ncdevmem and add a test to
devmem.py.
Testing:
--------
Testing is very similar to devmem TCP RX path. The ncdevmem test used
for the RX path is now augemented with client functionality to test TX
path.
* Test Setup:
Kernel: net-next with this RFC and memory provider API cherry-picked
locally.
Hardware: Google Cloud A3 VMs.
NIC: GVE with header split & RSS & flow steering support.
Performance results are not included with this version, unfortunately.
I'm having issues running the dma-buf exporter driver against the
upstream kernel on my test setup. The issues are specific to that
dma-buf exporter and do not affect this patch series. I plan to follow
up this series with perf fixes if the tests point to issues once they're
up and running.
Special thanks to Stan who took a stab at rebasing the TX implementation
on top of the netmem/net_iov framework merged. Parts of his proposal [2]
that are reused as-is are forked off into their own patches to give full
credit.
[1] https://lore.kernel.org/netdev/20240909054318.1809580-1-almasrymina@google.…
[2] https://lore.kernel.org/netdev/20240913150913.1280238-2-sdf@fomichev.me/T/#…
Cc: sdf(a)fomichev.me
Cc: asml.silence(a)gmail.com
Cc: dw(a)davidwei.uk
Cc: Jamal Hadi Salim <jhs(a)mojatatu.com>
Cc: Victor Nogueira <victor(a)mojatatu.com>
Cc: Pedro Tammela <pctammela(a)mojatatu.com>
Cc: Samiullah Khawaja <skhawaja(a)google.com>
Mina Almasry (8):
netmem: add niov->type attribute to distinguish different net_iov
types
net: add get_netmem/put_netmem support
net: devmem: Implement TX path
net: add devmem TCP TX documentation
net: enable driver support for netmem TX
gve: add netmem TX support to GVE DQO-RDA mode
net: check for driver support in netmem TX
selftests: ncdevmem: Implement devmem TCP TX
Stanislav Fomichev (1):
net: devmem: TCP tx netlink api
Documentation/netlink/specs/netdev.yaml | 12 +
Documentation/networking/devmem.rst | 150 ++++++++-
.../networking/net_cachelines/net_device.rst | 1 +
Documentation/networking/netdev-features.rst | 5 +
Documentation/networking/netmem.rst | 23 +-
drivers/net/ethernet/google/gve/gve_main.c | 4 +
drivers/net/ethernet/google/gve/gve_tx_dqo.c | 8 +-
include/linux/netdevice.h | 2 +
include/linux/skbuff.h | 17 +-
include/linux/skbuff_ref.h | 4 +-
include/net/netmem.h | 38 ++-
include/net/sock.h | 1 +
include/uapi/linux/netdev.h | 1 +
net/core/datagram.c | 48 ++-
net/core/dev.c | 33 ++
net/core/devmem.c | 118 ++++++-
net/core/devmem.h | 83 ++++-
net/core/netdev-genl-gen.c | 13 +
net/core/netdev-genl-gen.h | 1 +
net/core/netdev-genl.c | 73 ++++-
net/core/skbuff.c | 48 ++-
net/core/sock.c | 6 +
net/ipv4/ip_output.c | 3 +-
net/ipv4/tcp.c | 50 ++-
net/ipv6/ip6_output.c | 3 +-
net/vmw_vsock/virtio_transport_common.c | 5 +-
tools/include/uapi/linux/netdev.h | 1 +
.../selftests/drivers/net/hw/devmem.py | 26 +-
.../selftests/drivers/net/hw/ncdevmem.c | 300 +++++++++++++++++-
29 files changed, 1002 insertions(+), 75 deletions(-)
base-commit: 8ef890df4031121a94407c84659125cbccd3fdbe
--
2.49.0.rc0.332.g42c0ae87b1-goog
The test_rss_context_dump() test assumes the indirection table is always
supported, which is not true for all drivers, e.g., virtio_net when
VIRTIO_NET_F_RSS is disabled.
Skip the check if 'indir' is not present.
Reviewed-by: Nimrod Oren <noren(a)nvidia.com>
Signed-off-by: Gal Pressman <gal(a)nvidia.com>
---
tools/testing/selftests/drivers/net/hw/rss_ctx.py | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/drivers/net/hw/rss_ctx.py b/tools/testing/selftests/drivers/net/hw/rss_ctx.py
index d6e69d7d5e43..ca60ae325c22 100755
--- a/tools/testing/selftests/drivers/net/hw/rss_ctx.py
+++ b/tools/testing/selftests/drivers/net/hw/rss_ctx.py
@@ -392,7 +392,7 @@ def test_rss_context_dump(cfg):
# Sanity-check the results
for data in ctxs:
- ksft_ne(set(data['indir']), {0}, "indir table is all zero")
+ ksft_ne(set(data.get('indir', [1])), {0}, "indir table is all zero")
ksft_ne(set(data.get('hkey', [1])), {0}, "key is all zero")
# More specific checks
--
2.40.1
virtio-net have two usage of hashes: one is RSS and another is hash
reporting. Conventionally the hash calculation was done by the VMM.
However, computing the hash after the queue was chosen defeats the
purpose of RSS.
Another approach is to use eBPF steering program. This approach has
another downside: it cannot report the calculated hash due to the
restrictive nature of eBPF.
Introduce the code to compute hashes to the kernel in order to overcome
thse challenges.
An alternative solution is to extend the eBPF steering program so that it
will be able to report to the userspace, but it is based on context
rewrites, which is in feature freeze. We can adopt kfuncs, but they will
not be UAPIs. We opt to ioctl to align with other relevant UAPIs (KVM
and vhost_net).
The patches for QEMU to use this new feature was submitted as RFC and
is available at:
https://patchew.org/QEMU/20250313-hash-v4-0-c75c494b495e@daynix.com/
This work was presented at LPC 2024:
https://lpc.events/event/18/contributions/1963/
V1 -> V2:
Changed to introduce a new BPF program type.
Signed-off-by: Akihiko Odaki <akihiko.odaki(a)daynix.com>
---
Changes in v11:
- Added the missing code to free vnet_hash in patch
"tap: Introduce virtio-net hash feature".
- Link to v10: https://lore.kernel.org/r/20250313-rss-v10-0-3185d73a9af0@daynix.com
Changes in v10:
- Split common code and TUN/TAP-specific code into separate patches.
- Reverted a spurious style change in patch "tun: Introduce virtio-net
hash feature".
- Added a comment explaining disable_ipv6 in tests.
- Used AF_PACKET for patch "selftest: tun: Add tests for
virtio-net hashing". I also added the usage of FIXTURE_VARIANT() as
the testing function now needs access to more variant-specific
variables.
- Corrected the message of patch "selftest: tun: Add tests for
virtio-net hashing"; it mentioned validation of configuration but
it is not scope of this patch.
- Expanded the description of patch "selftest: tun: Add tests for
virtio-net hashing".
- Added patch "tun: Allow steering eBPF program to fall back".
- Changed to handle TUNGETVNETHASHCAP before taking the rtnl lock.
- Removed redundant tests for tun_vnet_ioctl().
- Added patch "selftest: tap: Add tests for virtio-net ioctls".
- Added a design explanation of ioctls for extensibility and migration.
- Removed a few branches in patch
"vhost/net: Support VIRTIO_NET_F_HASH_REPORT".
- Link to v9: https://lore.kernel.org/r/20250307-rss-v9-0-df76624025eb@daynix.com
Changes in v9:
- Added a missing return statement in patch
"tun: Introduce virtio-net hash feature".
- Link to v8: https://lore.kernel.org/r/20250306-rss-v8-0-7ab4f56ff423@daynix.com
Changes in v8:
- Disabled IPv6 to eliminate noises in tests.
- Added a branch in tap to avoid unnecessary dissection when hash
reporting is disabled.
- Removed unnecessary rtnl_lock().
- Extracted code to handle new ioctls into separate functions to avoid
adding extra NULL checks to the code handling other ioctls.
- Introduced variable named "fd" to __tun_chr_ioctl().
- s/-/=/g in a patch message to avoid confusing Git.
- Link to v7: https://lore.kernel.org/r/20250228-rss-v7-0-844205cbbdd6@daynix.com
Changes in v7:
- Ensured to set hash_report to VIRTIO_NET_HASH_REPORT_NONE for
VHOST_NET_F_VIRTIO_NET_HDR.
- s/4/sizeof(u32)/ in patch "virtio_net: Add functions for hashing".
- Added tap_skb_cb type.
- Rebased.
- Link to v6: https://lore.kernel.org/r/20250109-rss-v6-0-b1c90ad708f6@daynix.com
Changes in v6:
- Extracted changes to fill vnet header holes into another series.
- Squashed patches "skbuff: Introduce SKB_EXT_TUN_VNET_HASH", "tun:
Introduce virtio-net hash reporting feature", and "tun: Introduce
virtio-net RSS" into patch "tun: Introduce virtio-net hash feature".
- Dropped the RFC tag.
- Link to v5: https://lore.kernel.org/r/20241008-rss-v5-0-f3cf68df005d@daynix.com
Changes in v5:
- Fixed a compilation error with CONFIG_TUN_VNET_CROSS_LE.
- Optimized the calculation of the hash value according to:
https://git.dpdk.org/dpdk/commit/?id=3fb1ea032bd6ff8317af5dac9af901f1f324ca…
- Added patch "tun: Unify vnet implementation".
- Dropped patch "tap: Pad virtio header with zero".
- Added patch "selftest: tun: Test vnet ioctls without device".
- Reworked selftests to skip for older kernels.
- Documented the case when the underlying device is deleted and packets
have queue_mapping set by TC.
- Reordered test harness arguments.
- Added code to handle fragmented packets.
- Link to v4: https://lore.kernel.org/r/20240924-rss-v4-0-84e932ec0e6c@daynix.com
Changes in v4:
- Moved tun_vnet_hash_ext to if_tun.h.
- Renamed virtio_net_toeplitz() to virtio_net_toeplitz_calc().
- Replaced htons() with cpu_to_be16().
- Changed virtio_net_hash_rss() to return void.
- Reordered variable declarations in virtio_net_hash_rss().
- Removed virtio_net_hdr_v1_hash_from_skb().
- Updated messages of "tap: Pad virtio header with zero" and
"tun: Pad virtio header with zero".
- Fixed vnet_hash allocation size.
- Ensured to free vnet_hash when destructing tun_struct.
- Link to v3: https://lore.kernel.org/r/20240915-rss-v3-0-c630015db082@daynix.com
Changes in v3:
- Reverted back to add ioctl.
- Split patch "tun: Introduce virtio-net hashing feature" into
"tun: Introduce virtio-net hash reporting feature" and
"tun: Introduce virtio-net RSS".
- Changed to reuse hash values computed for automq instead of performing
RSS hashing when hash reporting is requested but RSS is not.
- Extracted relevant data from struct tun_struct to keep it minimal.
- Added kernel-doc.
- Changed to allow calling TUNGETVNETHASHCAP before TUNSETIFF.
- Initialized num_buffers with 1.
- Added a test case for unclassified packets.
- Fixed error handling in tests.
- Changed tests to verify that the queue index will not overflow.
- Rebased.
- Link to v2: https://lore.kernel.org/r/20231015141644.260646-1-akihiko.odaki@daynix.com
---
Akihiko Odaki (10):
virtio_net: Add functions for hashing
net: flow_dissector: Export flow_keys_dissector_symmetric
tun: Allow steering eBPF program to fall back
tun: Add common virtio-net hash feature code
tun: Introduce virtio-net hash feature
tap: Introduce virtio-net hash feature
selftest: tun: Test vnet ioctls without device
selftest: tun: Add tests for virtio-net hashing
selftest: tap: Add tests for virtio-net ioctls
vhost/net: Support VIRTIO_NET_F_HASH_REPORT
Documentation/networking/tuntap.rst | 7 +
drivers/net/Kconfig | 1 +
drivers/net/ipvlan/ipvtap.c | 2 +-
drivers/net/macvtap.c | 2 +-
drivers/net/tap.c | 78 +++++-
drivers/net/tun.c | 90 +++++--
drivers/net/tun_vnet.h | 155 ++++++++++-
drivers/vhost/net.c | 68 ++---
include/linux/if_tap.h | 4 +-
include/linux/skbuff.h | 3 +
include/linux/virtio_net.h | 188 ++++++++++++++
include/net/flow_dissector.h | 1 +
include/uapi/linux/if_tun.h | 82 ++++++
net/core/flow_dissector.c | 3 +-
net/core/skbuff.c | 4 +
tools/testing/selftests/net/Makefile | 2 +-
tools/testing/selftests/net/tap.c | 97 ++++++-
tools/testing/selftests/net/tun.c | 491 ++++++++++++++++++++++++++++++++++-
18 files changed, 1194 insertions(+), 84 deletions(-)
---
base-commit: dd83757f6e686a2188997cb58b5975f744bb7786
change-id: 20240403-rss-e737d89efa77
prerequisite-change-id: 20241230-tun-66e10a49b0c7:v6
prerequisite-patch-id: 871dc5f146fb6b0e3ec8612971a8e8190472c0fb
prerequisite-patch-id: 2797ed249d32590321f088373d4055ff3f430a0e
prerequisite-patch-id: ea3370c72d4904e2f0536ec76ba5d26784c0cede
prerequisite-patch-id: 837e4cf5d6b451424f9b1639455e83a260c4440d
prerequisite-patch-id: ea701076f57819e844f5a35efe5cbc5712d3080d
prerequisite-patch-id: 701646fb43ad04cc64dd2bf13c150ccbe6f828ce
prerequisite-patch-id: 53176dae0c003f5b6c114d43f936cf7140d31bb5
prerequisite-change-id: 20250116-buffers-96e14bf023fc:v2
prerequisite-patch-id: 25fd4f99d4236a05a5ef16ab79f3e85ee57e21cc
Best regards,
--
Akihiko Odaki <akihiko.odaki(a)daynix.com>
The SBI Firmware Feature extension allows the S-mode to request some
specific features (either hardware or software) to be enabled. This
series uses this extension to request misaligned access exception
delegation to S-mode in order to let the kernel handle it. It also adds
support for the KVM FWFT SBI extension based on the misaligned access
handling infrastructure.
FWFT SBI extension is part of the SBI V3.0 specifications [1]. It can be
tested using the qemu provided at [2] which contains the series from
[3]. kvm-unit-tests [4] can be used inside kvm to tests the correct
delegation of misaligned exceptions. Upstream OpenSBI can be used.
Note: Since SBI V3.0 is not yet ratified, FWFT extension API is split
between interface only and implementation, allowing to pick only the
interface which do not have hard dependencies on SBI.
The tests can be run using the included kselftest:
$ qemu-system-riscv64 \
-cpu rv64,trap-misaligned-access=true,v=true \
-M virt \
-m 1024M \
-bios fw_dynamic.bin \
-kernel Image
...
# ./misaligned
TAP version 13
1..23
# Starting 23 tests from 1 test cases.
# RUN global.gp_load_lh ...
# OK global.gp_load_lh
ok 1 global.gp_load_lh
# RUN global.gp_load_lhu ...
# OK global.gp_load_lhu
ok 2 global.gp_load_lhu
# RUN global.gp_load_lw ...
# OK global.gp_load_lw
ok 3 global.gp_load_lw
# RUN global.gp_load_lwu ...
# OK global.gp_load_lwu
ok 4 global.gp_load_lwu
# RUN global.gp_load_ld ...
# OK global.gp_load_ld
ok 5 global.gp_load_ld
# RUN global.gp_load_c_lw ...
# OK global.gp_load_c_lw
ok 6 global.gp_load_c_lw
# RUN global.gp_load_c_ld ...
# OK global.gp_load_c_ld
ok 7 global.gp_load_c_ld
# RUN global.gp_load_c_ldsp ...
# OK global.gp_load_c_ldsp
ok 8 global.gp_load_c_ldsp
# RUN global.gp_load_sh ...
# OK global.gp_load_sh
ok 9 global.gp_load_sh
# RUN global.gp_load_sw ...
# OK global.gp_load_sw
ok 10 global.gp_load_sw
# RUN global.gp_load_sd ...
# OK global.gp_load_sd
ok 11 global.gp_load_sd
# RUN global.gp_load_c_sw ...
# OK global.gp_load_c_sw
ok 12 global.gp_load_c_sw
# RUN global.gp_load_c_sd ...
# OK global.gp_load_c_sd
ok 13 global.gp_load_c_sd
# RUN global.gp_load_c_sdsp ...
# OK global.gp_load_c_sdsp
ok 14 global.gp_load_c_sdsp
# RUN global.fpu_load_flw ...
# OK global.fpu_load_flw
ok 15 global.fpu_load_flw
# RUN global.fpu_load_fld ...
# OK global.fpu_load_fld
ok 16 global.fpu_load_fld
# RUN global.fpu_load_c_fld ...
# OK global.fpu_load_c_fld
ok 17 global.fpu_load_c_fld
# RUN global.fpu_load_c_fldsp ...
# OK global.fpu_load_c_fldsp
ok 18 global.fpu_load_c_fldsp
# RUN global.fpu_store_fsw ...
# OK global.fpu_store_fsw
ok 19 global.fpu_store_fsw
# RUN global.fpu_store_fsd ...
# OK global.fpu_store_fsd
ok 20 global.fpu_store_fsd
# RUN global.fpu_store_c_fsd ...
# OK global.fpu_store_c_fsd
ok 21 global.fpu_store_c_fsd
# RUN global.fpu_store_c_fsdsp ...
# OK global.fpu_store_c_fsdsp
ok 22 global.fpu_store_c_fsdsp
# RUN global.gen_sigbus ...
[12797.988647] misaligned[618]: unhandled signal 7 code 0x1 at 0x0000000000014dc0 in misaligned[4dc0,10000+76000]
[12797.988990] CPU: 0 UID: 0 PID: 618 Comm: misaligned Not tainted 6.13.0-rc6-00008-g4ec4468967c9-dirty #51
[12797.989169] Hardware name: riscv-virtio,qemu (DT)
[12797.989264] epc : 0000000000014dc0 ra : 0000000000014d00 sp : 00007fffe165d100
[12797.989407] gp : 000000000008f6e8 tp : 0000000000095760 t0 : 0000000000000008
[12797.989544] t1 : 00000000000965d8 t2 : 000000000008e830 s0 : 00007fffe165d160
[12797.989692] s1 : 000000000000001a a0 : 0000000000000000 a1 : 0000000000000002
[12797.989831] a2 : 0000000000000000 a3 : 0000000000000000 a4 : ffffffffdeadbeef
[12797.989964] a5 : 000000000008ef61 a6 : 626769735f6e0000 a7 : fffffffffffff000
[12797.990094] s2 : 0000000000000001 s3 : 00007fffe165d838 s4 : 00007fffe165d848
[12797.990238] s5 : 000000000000001a s6 : 0000000000010442 s7 : 0000000000010200
[12797.990391] s8 : 000000000000003a s9 : 0000000000094508 s10: 0000000000000000
[12797.990526] s11: 0000555567460668 t3 : 00007fffe165d070 t4 : 00000000000965d0
[12797.990656] t5 : fefefefefefefeff t6 : 0000000000000073
[12797.990756] status: 0000000200004020 badaddr: 000000000008ef61 cause: 0000000000000006
[12797.990911] Code: 8793 8791 3423 fcf4 3783 fc84 c737 dead 0713 eef7 (c398) 0001
# OK global.gen_sigbus
ok 23 global.gen_sigbus
# PASSED: 23 / 23 tests passed.
# Totals: pass:23 fail:0 xfail:0 xpass:0 skip:0 error:0
With kvm-tools:
# lkvm run -k sbi.flat -m 128
Info: # lkvm run -k sbi.flat -m 128 -c 1 --name guest-97
Info: Removed ghost socket file "/root/.lkvm//guest-97.sock".
##########################################################################
# kvm-unit-tests
##########################################################################
... [test messages elided]
PASS: sbi: fwft: FWFT extension probing no error
PASS: sbi: fwft: get/set reserved feature 0x6 error == SBI_ERR_DENIED
PASS: sbi: fwft: get/set reserved feature 0x3fffffff error == SBI_ERR_DENIED
PASS: sbi: fwft: get/set reserved feature 0x80000000 error == SBI_ERR_DENIED
PASS: sbi: fwft: get/set reserved feature 0xbfffffff error == SBI_ERR_DENIED
PASS: sbi: fwft: misaligned_deleg: Get misaligned deleg feature no error
PASS: sbi: fwft: misaligned_deleg: Set misaligned deleg feature invalid value error
PASS: sbi: fwft: misaligned_deleg: Set misaligned deleg feature invalid value error
PASS: sbi: fwft: misaligned_deleg: Set misaligned deleg feature value no error
PASS: sbi: fwft: misaligned_deleg: Set misaligned deleg feature value 0
PASS: sbi: fwft: misaligned_deleg: Set misaligned deleg feature value no error
PASS: sbi: fwft: misaligned_deleg: Set misaligned deleg feature value 1
PASS: sbi: fwft: misaligned_deleg: Verify misaligned load exception trap in supervisor
SUMMARY: 50 tests, 2 unexpected failures, 12 skipped
This series is available at [6].
Link: https://github.com/riscv-non-isa/riscv-sbi-doc/releases/download/vv3.0-rc2/… [1]
Link: https://github.com/rivosinc/qemu/tree/dev/cleger/misaligned [2]
Link: https://lore.kernel.org/all/20241211211933.198792-3-fkonrad@amd.com/T/ [3]
Link: https://github.com/clementleger/kvm-unit-tests/tree/dev/cleger/fwft_v1 [4]
Link: https://github.com/clementleger/unaligned_test [5]
Link: https://github.com/rivosinc/linux/tree/dev/cleger/fwft_v1 [6]
---
V4:
- Check SBI version 3.0 instead of 2.0 for FWFT presence
- Use long for kvm_sbi_fwft operation return value
- Init KVM sbi extension even if default_disabled
- Remove revert_on_fail parameter for sbi_fwft_feature_set().
- Fix comments for sbi_fwft_set/get()
- Only handle local features (there are no globals yet in the spec)
- Add new SBI errors to sbi_err_map_linux_errno()
V3:
- Added comment about kvm sbi fwft supported/set/get callback
requirements
- Move struct kvm_sbi_fwft_feature in kvm_sbi_fwft.c
- Add a FWFT interface
V2:
- Added Kselftest for misaligned testing
- Added get_user() usage instead of __get_user()
- Reenable interrupt when possible in misaligned access handling
- Document that riscv supports unaligned-traps
- Fix KVM extension state when an init function is present
- Rework SBI misaligned accesses trap delegation code
- Added support for CPU hotplugging
- Added KVM SBI reset callback
- Added reset for KVM SBI FWFT lock
- Return SBI_ERR_DENIED_LOCKED when LOCK flag is set
Clément Léger (18):
riscv: add Firmware Feature (FWFT) SBI extensions definitions
riscv: sbi: add new SBI error mappings
riscv: sbi: add FWFT extension interface
riscv: sbi: add SBI FWFT extension calls
riscv: misaligned: request misaligned exception from SBI
riscv: misaligned: use on_each_cpu() for scalar misaligned access
probing
riscv: misaligned: use correct CONFIG_ ifdef for
misaligned_access_speed
riscv: misaligned: move emulated access uniformity check in a function
riscv: misaligned: add a function to check misalign trap delegability
riscv: misaligned: factorize trap handling
riscv: misaligned: enable IRQs while handling misaligned accesses
riscv: misaligned: use get_user() instead of __get_user()
Documentation/sysctl: add riscv to unaligned-trap supported archs
selftests: riscv: add misaligned access testing
RISC-V: KVM: add SBI extension init()/deinit() functions
RISC-V: KVM: add SBI extension reset callback
RISC-V: KVM: add support for FWFT SBI extension
RISC-V: KVM: add support for SBI_FWFT_MISALIGNED_DELEG
Documentation/admin-guide/sysctl/kernel.rst | 4 +-
arch/riscv/include/asm/cpufeature.h | 8 +-
arch/riscv/include/asm/kvm_host.h | 5 +-
arch/riscv/include/asm/kvm_vcpu_sbi.h | 12 +
arch/riscv/include/asm/kvm_vcpu_sbi_fwft.h | 29 ++
arch/riscv/include/asm/sbi.h | 62 +++++
arch/riscv/include/uapi/asm/kvm.h | 1 +
arch/riscv/kernel/sbi.c | 95 +++++++
arch/riscv/kernel/traps.c | 57 ++--
arch/riscv/kernel/traps_misaligned.c | 118 +++++++-
arch/riscv/kernel/unaligned_access_speed.c | 11 +-
arch/riscv/kvm/Makefile | 1 +
arch/riscv/kvm/vcpu.c | 7 +-
arch/riscv/kvm/vcpu_sbi.c | 54 ++++
arch/riscv/kvm/vcpu_sbi_fwft.c | 252 +++++++++++++++++
arch/riscv/kvm/vcpu_sbi_sta.c | 3 +-
.../selftests/riscv/misaligned/.gitignore | 1 +
.../selftests/riscv/misaligned/Makefile | 12 +
.../selftests/riscv/misaligned/common.S | 33 +++
.../testing/selftests/riscv/misaligned/fpu.S | 180 +++++++++++++
tools/testing/selftests/riscv/misaligned/gp.S | 103 +++++++
.../selftests/riscv/misaligned/misaligned.c | 254 ++++++++++++++++++
22 files changed, 1255 insertions(+), 47 deletions(-)
create mode 100644 arch/riscv/include/asm/kvm_vcpu_sbi_fwft.h
create mode 100644 arch/riscv/kvm/vcpu_sbi_fwft.c
create mode 100644 tools/testing/selftests/riscv/misaligned/.gitignore
create mode 100644 tools/testing/selftests/riscv/misaligned/Makefile
create mode 100644 tools/testing/selftests/riscv/misaligned/common.S
create mode 100644 tools/testing/selftests/riscv/misaligned/fpu.S
create mode 100644 tools/testing/selftests/riscv/misaligned/gp.S
create mode 100644 tools/testing/selftests/riscv/misaligned/misaligned.c
--
2.47.2
Replacing all occurrences of `addr_of!(place)` with
`&raw const place`.
This will allow us to reduce macro complexity, and improve consistency
with existing reference syntax as `&raw const` is similar to `&` making
it fit more naturally with other existing code.
Suggested-by: Benno Lossin <benno.lossin(a)proton.me>
Link: https://github.com/Rust-for-Linux/linux/issues/1148
Signed-off-by: Antonio Hickey <contact(a)antoniohickey.com>
---
rust/kernel/kunit.rs | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/rust/kernel/kunit.rs b/rust/kernel/kunit.rs
index 824da0e9738a..a17ef3b2e860 100644
--- a/rust/kernel/kunit.rs
+++ b/rust/kernel/kunit.rs
@@ -128,9 +128,9 @@ unsafe impl Sync for UnaryAssert {}
unsafe {
$crate::bindings::__kunit_do_failed_assertion(
kunit_test,
- core::ptr::addr_of!(LOCATION.0),
+ &raw const LOCATION.0,
$crate::bindings::kunit_assert_type_KUNIT_ASSERTION,
- core::ptr::addr_of!(ASSERTION.0.assert),
+ &raw const ASSERTION.0.assert,
Some($crate::bindings::kunit_unary_assert_format),
core::ptr::null(),
);
This patch set convert iptables to nftables for wireguard testing, as
iptables is deparated and nftables is the default framework of most releases.
v3: drop iptables directly (Jason A. Donenfeld)
Also convert to using nft for qemu testing (Jason A. Donenfeld)
v2: use one nft table for testing (Phil Sutter)
Hangbin Liu (2):
selftests: wireguards: convert iptables to nft
selftests: wireguard: update to using nft for qemu test
tools/testing/selftests/wireguard/netns.sh | 29 +++++++++-----
.../testing/selftests/wireguard/qemu/Makefile | 40 ++++++++++++++-----
.../selftests/wireguard/qemu/kernel.config | 7 ++--
3 files changed, 53 insertions(+), 23 deletions(-)
--
2.46.0
Greetings:
Welcome to the RFC.
Currently, when a user app uses sendfile the user app has no way to know
if the bytes were transmit; sendfile simply returns, but it is possible
that a slow client on the other side may take time to receive and ACK
the bytes. In the meantime, the user app which called sendfile has no
way to know whether it can overwrite the data on disk that it just
sendfile'd.
One way to fix this is to add zerocopy notifications to sendfile similar
to how MSG_ZEROCOPY works with sendmsg. This is possible thanks to the
extensive work done by Pavel [1].
To support this, two important user ABI changes are proposed:
- A new splice flag, SPLICE_F_ZC, which allows users to signal that
splice should generate zerocopy notifications if possible.
- A new system call, sendfile2, which is similar to sendfile64 except
that it takes an additional argument, flags, which allows the user
to specify either a "regular" sendfile or a sendfile with zerocopy
notifications enabled.
In either case, user apps can read notifications from the error queue
(like they would with MSG_ZEROCOPY) to determine when their call to
sendfile has completed.
I tested this RFC using the selftest modified in the last patch and also
by using the selftest between two different physical hosts:
# server
./msg_zerocopy -4 -i eth0 -t 2 -v -r tcp
# client (does the sendfiling)
dd if=/dev/zero of=sendfile_data bs=1M count=8
./msg_zerocopy -4 -i eth0 -D $SERVER_IP -v -l 1 -t 2 -z -f sendfile_data tcp
I would love to get high level feedback from folks on a few things:
- Is this functionality, at a high level, something that would be
desirable / useful? I think so, but I'm of course I am biased ;)
- Is this approach generally headed in the right direction? Are the
proposed user ABI changes reasonable?
If the above two points are generally agreed upon then I'd welcome
feedback on the patches themselves :)
This is kind of a net thing, but also kind of a splice thing so hope I
am sending this to right places to get appropriate feedback. I based my
code on the vfs/for-next tree, but am happy to rebase on another tree if
desired. The cc-list got a little out of control, so I manually trimmed
it down quite a bit; sorry if I missed anyone I should have CC'd in the
process.
Thanks,
Joe
[1]: https://lore.kernel.org/netdev/cover.1657643355.git.asml.silence@gmail.com/
Joe Damato (10):
splice: Add ubuf_info to prepare for ZC
splice: Add helper that passes through splice_desc
splice: Factor splice_socket into a helper
splice: Add SPLICE_F_ZC and attach ubuf
fs: Add splice_write_sd to file operations
fs: Extend do_sendfile to take a flags argument
fs: Add sendfile2 which accepts a flags argument
fs: Add sendfile flags for sendfile2
fs: Add sendfile2 syscall
selftests: Add sendfile zerocopy notification test
arch/alpha/kernel/syscalls/syscall.tbl | 1 +
arch/arm/tools/syscall.tbl | 1 +
arch/arm64/tools/syscall_32.tbl | 1 +
arch/m68k/kernel/syscalls/syscall.tbl | 1 +
arch/microblaze/kernel/syscalls/syscall.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n32.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n64.tbl | 1 +
arch/mips/kernel/syscalls/syscall_o32.tbl | 1 +
arch/parisc/kernel/syscalls/syscall.tbl | 1 +
arch/powerpc/kernel/syscalls/syscall.tbl | 1 +
arch/s390/kernel/syscalls/syscall.tbl | 1 +
arch/sh/kernel/syscalls/syscall.tbl | 1 +
arch/sparc/kernel/syscalls/syscall.tbl | 1 +
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/xtensa/kernel/syscalls/syscall.tbl | 1 +
fs/read_write.c | 40 +++++++---
fs/splice.c | 87 +++++++++++++++++----
include/linux/fs.h | 2 +
include/linux/sendfile.h | 10 +++
include/linux/splice.h | 7 +-
include/linux/syscalls.h | 2 +
include/uapi/asm-generic/unistd.h | 4 +-
net/socket.c | 1 +
scripts/syscall.tbl | 1 +
tools/testing/selftests/net/msg_zerocopy.c | 54 ++++++++++++-
tools/testing/selftests/net/msg_zerocopy.sh | 5 ++
27 files changed, 200 insertions(+), 29 deletions(-)
create mode 100644 include/linux/sendfile.h
base-commit: 2e72b1e0aac24a12f3bf3eec620efaca7ab7d4de
--
2.43.0
This is one of just 3 remaining "Test Module" kselftests (the others
being printf and scanf), the rest having been converted to KUnit.
I tested this using:
$ tools/testing/kunit/kunit.py run --arch arm64 --make_options LLVM=1 bitmap.
I've already sent out a conversion series for each of printf[0] and scanf[1].
There was a previous attempt[2] to do this in July 2024. Please bear
with me as I try to understand and address the objections from that
time. I've spoken with Muhammad Usama Anjum, the author of that series,
and received their approval to "take over" this work. Here we go...
On 7/26/24 11:45 PM, John Hubbard wrote:
>
> This changes the situation from "works for Linus' tab completion
> case", to "causes a tab completion problem"! :)
>
> I think a tests/ subdir is how we eventually decided to do this [1],
> right?
>
> So:
>
> lib/tests/bitmap_kunit.c
>
> [1] https://lore.kernel.org/20240724201354.make.730-kees@kernel.org
This is true and unfortunate, but not trivial to fix because new
kallsyms tests were placed in lib/tests in commit 84b4a51fce4c
("selftests: add new kallsyms selftests") *after* the KUnit filename
best practices were adopted.
I propose that the KUnit maintainers blaze this trail using
`string_kunit.c` which currently still lives in lib/ despite the KUnit
docs giving it as an example at lib/tests/.
On 7/27/24 12:24 AM, Shuah Khan wrote:
>
> This change will take away the ability to run bitmap tests during
> boot on a non-kunit kernel.
>
> Nack on this change. I wan to see all tests that are being removed
> from lib because they have been converted - also it doesn't make
> sense to convert some tests like this one that add the ability test
> during boot.
This point was also discussed in another thread[3] in which:
On 7/27/24 12:35 AM, Shuah Khan wrote:
>
> Please make sure you aren't taking away the ability to run these tests during
> boot.
>
> It doesn't make sense to convert every single test especially when it
> is intended to be run during boot without dependencies - not as a kunit test
> but a regression test during boot.
>
> bitmap is one example - pay attention to the config help test - bitmap
> one clearly states it runs regression testing during boot. Any test that
> says that isn't a candidate for conversion.
>
> I am going to nack any such conversions.
The crux of the argument seems to be that the config help text is taken
to describe the author's intent with the fragment "at boot". I think
this may be a case of confirmation bias: I see at least the following
KUnit tests with "at boot" in their help text:
- CPUMASK_KUNIT_TEST
- BITFIELD_KUNIT
- CHECKSUM_KUNIT
- UTIL_MACROS_KUNIT
It seems to me that the inference being made is that any test that runs
"at boot" is intended to be run by both developers and users, but I find
no evidence that bitmap in particular would ever provide additional
value when run by users.
There's further discussion about KUnit not being "ideal for cases where
people would want to check a subsystem on a running kernel", but I find
no evidence that bitmap in particular is actually testing the running
kernel; it is a unit test of the bitmap functions, which is also stated
in the config help text.
David Gow made many of the same points in his final reply[4], which was
never replied to.
Link: https://lore.kernel.org/all/20250207-printf-kunit-convert-v2-0-057b23860823… [0]
Link: https://lore.kernel.org/all/20250207-scanf-kunit-convert-v4-0-a23e2afaede8@… [1]
Link: https://lore.kernel.org/all/20240726110658.2281070-1-usama.anjum@collabora.… [2]
Link: https://lore.kernel.org/all/327831fb-47ab-4555-8f0b-19a8dbcaacd7@collabora.… [3]
Link: https://lore.kernel.org/all/CABVgOSmMoPD3JfzVd4VTkzGL2fZCo8LfwzaVSzeFimPrhg… [4]
Thanks for your attention.
Signed-off-by: Tamir Duberstein <tamird(a)gmail.com>
---
Tamir Duberstein (3):
bitmap: remove _check_eq_u32_array
bitmap: convert self-test to KUnit
bitmap: break kunit into test cases
MAINTAINERS | 2 +-
arch/m68k/configs/amiga_defconfig | 1 -
arch/m68k/configs/apollo_defconfig | 1 -
arch/m68k/configs/atari_defconfig | 1 -
arch/m68k/configs/bvme6000_defconfig | 1 -
arch/m68k/configs/hp300_defconfig | 1 -
arch/m68k/configs/mac_defconfig | 1 -
arch/m68k/configs/multi_defconfig | 1 -
arch/m68k/configs/mvme147_defconfig | 1 -
arch/m68k/configs/mvme16x_defconfig | 1 -
arch/m68k/configs/q40_defconfig | 1 -
arch/m68k/configs/sun3_defconfig | 1 -
arch/m68k/configs/sun3x_defconfig | 1 -
arch/powerpc/configs/ppc64_defconfig | 1 -
lib/Kconfig.debug | 24 +-
lib/Makefile | 2 +-
lib/{test_bitmap.c => bitmap_kunit.c} | 454 +++++++++++++---------------------
tools/testing/selftests/lib/bitmap.sh | 3 -
tools/testing/selftests/lib/config | 1 -
19 files changed, 195 insertions(+), 304 deletions(-)
---
base-commit: 2014c95afecee3e76ca4a56956a936e23283f05b
change-id: 20250207-bitmap-kunit-convert-92d3147b2eee
Best regards,
--
Tamir Duberstein <tamird(a)gmail.com>
Replacing all occurrences of `addr_of!(place)` and `addr_of_mut!(place)`
with `&raw const place` and `&raw mut place` respectively.
This will allow us to reduce macro complexity, and improve consistency
with existing reference syntax as `&raw const`, `&raw mut` are similar
to `&`, `&mut` making it fit more naturally with other existing code.
Suggested-by: Benno Lossin <benno.lossin(a)proton.me>
Link: https://github.com/Rust-for-Linux/linux/issues/1148
Signed-off-by: Antonio Hickey <contact(a)antoniohickey.com>
---
rust/kernel/kunit.rs | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/rust/kernel/kunit.rs b/rust/kernel/kunit.rs
index 824da0e9738a..a17ef3b2e860 100644
--- a/rust/kernel/kunit.rs
+++ b/rust/kernel/kunit.rs
@@ -128,9 +128,9 @@ unsafe impl Sync for UnaryAssert {}
unsafe {
$crate::bindings::__kunit_do_failed_assertion(
kunit_test,
- core::ptr::addr_of!(LOCATION.0),
+ &raw const LOCATION.0,
$crate::bindings::kunit_assert_type_KUNIT_ASSERTION,
- core::ptr::addr_of!(ASSERTION.0.assert),
+ &raw const ASSERTION.0.assert,
Some($crate::bindings::kunit_unary_assert_format),
core::ptr::null(),
);
--
2.48.1
I am submitting a series of patches that introduce a new feature for the
netconsole subsystem, specifically the addition of the 'release' field
to the sysdata structure. This feature allows the kernel release/version
to be appended to the userdata dictionary in every message sent,
enhancing the information available for debugging and monitoring
purposes.
This complements the already supported release prepend feature, which
was added some time ago. The release prepend appends the release
information at the message header, which is not ideal for two reasons:
1) It is difficult to determine if a message includes this information,
making it hard and resource-intensive to parse.
2) When a message is fragmented, the release information is appended to
every message fragment, consuming valuable space in the packet.
The "release prepend" feature was created before the concept of userdata
and sysdata. Now that this format has proven successful, we are
implementing the release feature as part of this enhanced structure.
This patch series aims to improve the netconsole subsystem by providing
a more efficient and user-friendly way to include kernel release
information in messages. I believe these changes will significantly aid
in system analysis and troubleshooting.
Suggested-by: Manu Bretelle <chantr4(a)gmail.com>
Signed-off-by: Breno Leitao <leitao(a)debian.org>
---
Breno Leitao (6):
netconsole: introduce 'release' as a new sysdata field
netconsole: implement configfs for release_enabled
netconsole: add 'sysdata' suffix to related functions
netconsole: append release to sysdata
selftests: netconsole: Add tests for 'release' feature in sysdata
docs: netconsole: document release feature
Documentation/networking/netconsole.rst | 25 ++++++++
drivers/net/netconsole.c | 71 ++++++++++++++++++++--
.../selftests/drivers/net/netcons_sysdata.sh | 44 +++++++++++++-
3 files changed, 133 insertions(+), 7 deletions(-)
---
base-commit: 941defcea7e11ad7ff8f0d4856716dd637d757dd
change-id: 20250314-netcons_release-dc1f1f5ca0f7
Best regards,
--
Breno Leitao <leitao(a)debian.org>
Lei Chen raised an issue with CLOCK_MONOTONIC_COARSE seeing
time inconsistencies.
Lei tracked down that this was being caused by the adjustment
tk->tkr_mono.xtime_nsec -= offset;
which is made to compensate for the unaccumulated cycles in
offset when the mult value is adjusted forward, so that
the non-_COARSE clockids don't see inconsistencies.
However, the _COARSE clockids don't use the mult*offset value
in their calculations, so this subtraction can cause the
_COARSE clock ids to jump back a bit.
Now, by design, this negative adjustment should be fine, because
the logic run from timekeeping_adjust() is done after we
accumulate approx mult*interval_cycles into xtime_nsec.
The accumulated (mult*interval_cycles) will be larger then the
(mult_adj*offset) value subtracted from xtime_nsec, and both
operations are done together under the tk_core.lock, so the net
change to xtime_nsec should always be positive.
However, do_adjtimex() calls into timekeeping_advance() as well,
since we want to apply the ntp freq adjustment immediately.
In this case, we don't return early when the offset is smaller
then interval_cycles, so we don't end up accumulating any time
into xtime_nsec. But we do go on to call timekeeping_adjust(),
which modifies the mult value, and subtracts from xtime_nsec
to correct for the new mult value.
Here because we did not accumulate anything, we have a window
where the _COARSE clockids that don't utilize the mult*offset
value, can see an inconsistency.
So to fix this, rework the timekeeping_advance() logic a bit
so that when we are called from do_adjtimex() and the offset
is smaller then cycle_interval, that we call
timekeeping_forward(), to first accumulate the sub-interval
time into xtime_nsec. Then with no unaccumulated cycles in
offset, we can do the mult adjustment without worry of the
subtraction having an impact.
NOTE: This was implemented as a potential alternative to
Thomas' approach here:
https://lore.kernel.org/lkml/87cyej5rid.ffs@tglx/
And similarly, it needs some additional review and testing,
as it was developed while packing for conference travel.
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Cc: Stephen Boyd <sboyd(a)kernel.org>
Cc: Anna-Maria Behnsen <anna-maria(a)linutronix.de>
Cc: Frederic Weisbecker <frederic(a)kernel.org>
Cc: Shuah Khan <shuah(a)kernel.org>
Cc: Miroslav Lichvar <mlichvar(a)redhat.com>
Cc: linux-kselftest(a)vger.kernel.org
Cc: kernel-team(a)android.com
Cc: Lei Chen <lei.chen(a)smartx.com>
Fixes: da15cfdae033 ("time: Introduce CLOCK_REALTIME_COARSE")
Reported-by: Lei Chen <lei.chen(a)smartx.com>
Closes: https://lore.kernel.org/lkml/20250310030004.3705801-1-lei.chen@smartx.com/
Diagnosed-by: Thomas Gleixner <tglx(a)linutronix.de>
Signed-off-by: John Stultz <jstultz(a)google.com>
---
kernel/time/timekeeping.c | 87 ++++++++++++++++++++++++++++-----------
1 file changed, 62 insertions(+), 25 deletions(-)
diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index 1e67d076f1955..6f3a145e7b113 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -682,18 +682,18 @@ static void timekeeping_update_from_shadow(struct tk_data *tkd, unsigned int act
}
/**
- * timekeeping_forward_now - update clock to the current time
+ * timekeeping_forward - update clock to given cycle now value
* @tk: Pointer to the timekeeper to update
+ * @cycle_now: Current clocksource read value
*
* Forward the current clock to update its state since the last call to
* update_wall_time(). This is useful before significant clock changes,
* as it avoids having to deal with this time offset explicitly.
*/
-static void timekeeping_forward_now(struct timekeeper *tk)
+static void timekeeping_forward(struct timekeeper *tk, u64 cycle_now)
{
- u64 cycle_now, delta;
+ u64 delta;
- cycle_now = tk_clock_read(&tk->tkr_mono);
delta = clocksource_delta(cycle_now, tk->tkr_mono.cycle_last, tk->tkr_mono.mask,
tk->tkr_mono.clock->max_raw_delta);
tk->tkr_mono.cycle_last = cycle_now;
@@ -710,6 +710,21 @@ static void timekeeping_forward_now(struct timekeeper *tk)
}
}
+/**
+ * timekeeping_forward_now - update clock to the current time
+ * @tk: Pointer to the timekeeper to update
+ *
+ * Forward the current clock to update its state since the last call to
+ * update_wall_time(). This is useful before significant clock changes,
+ * as it avoids having to deal with this time offset explicitly.
+ */
+static void timekeeping_forward_now(struct timekeeper *tk)
+{
+ u64 cycle_now = tk_clock_read(&tk->tkr_mono);
+
+ timekeeping_forward(tk, cycle_now);
+}
+
/**
* ktime_get_real_ts64 - Returns the time of day in a timespec64.
* @ts: pointer to the timespec to be set
@@ -2151,6 +2166,45 @@ static u64 logarithmic_accumulation(struct timekeeper *tk, u64 offset,
return offset;
}
+static u64 timekeeping_accumulate(struct timekeeper *tk, u64 now, u64 offset,
+ unsigned int *clock_set)
+{
+ struct timekeeper *real_tk = &tk_core.timekeeper;
+ int shift = 0, maxshift;
+
+ /*
+ * If we have a sub-cycle_interval offset, we
+ * are likely doing a TK_FREQ_ADJ, so accumulate
+ * everything so we don't have a remainder offset
+ * when later adjusting the multiplier
+ */
+ if (offset < real_tk->cycle_interval) {
+ timekeeping_forward(tk, now);
+ *clock_set = 1;
+ return 0;
+ }
+
+ /*
+ * With NO_HZ we may have to accumulate many cycle_intervals
+ * (think "ticks") worth of time at once. To do this efficiently,
+ * we calculate the largest doubling multiple of cycle_intervals
+ * that is smaller than the offset. We then accumulate that
+ * chunk in one go, and then try to consume the next smaller
+ * doubled multiple.
+ */
+ shift = ilog2(offset) - ilog2(tk->cycle_interval);
+ shift = max(0, shift);
+ /* Bound shift to one less than what overflows tick_length */
+ maxshift = (64 - (ilog2(ntp_tick_length()) + 1)) - 1;
+ shift = min(shift, maxshift);
+ while (offset >= tk->cycle_interval) {
+ offset = logarithmic_accumulation(tk, offset, shift, clock_set);
+ if (offset < tk->cycle_interval << shift)
+ shift--;
+ }
+ return offset;
+}
+
/*
* timekeeping_advance - Updates the timekeeper to the current time and
* current NTP tick length
@@ -2160,8 +2214,7 @@ static bool timekeeping_advance(enum timekeeping_adv_mode mode)
struct timekeeper *tk = &tk_core.shadow_timekeeper;
struct timekeeper *real_tk = &tk_core.timekeeper;
unsigned int clock_set = 0;
- int shift = 0, maxshift;
- u64 offset;
+ u64 cycle_now, offset;
guard(raw_spinlock_irqsave)(&tk_core.lock);
@@ -2169,7 +2222,8 @@ static bool timekeeping_advance(enum timekeeping_adv_mode mode)
if (unlikely(timekeeping_suspended))
return false;
- offset = clocksource_delta(tk_clock_read(&tk->tkr_mono),
+ cycle_now = tk_clock_read(&tk->tkr_mono);
+ offset = clocksource_delta(cycle_now,
tk->tkr_mono.cycle_last, tk->tkr_mono.mask,
tk->tkr_mono.clock->max_raw_delta);
@@ -2177,24 +2231,7 @@ static bool timekeeping_advance(enum timekeeping_adv_mode mode)
if (offset < real_tk->cycle_interval && mode == TK_ADV_TICK)
return false;
- /*
- * With NO_HZ we may have to accumulate many cycle_intervals
- * (think "ticks") worth of time at once. To do this efficiently,
- * we calculate the largest doubling multiple of cycle_intervals
- * that is smaller than the offset. We then accumulate that
- * chunk in one go, and then try to consume the next smaller
- * doubled multiple.
- */
- shift = ilog2(offset) - ilog2(tk->cycle_interval);
- shift = max(0, shift);
- /* Bound shift to one less than what overflows tick_length */
- maxshift = (64 - (ilog2(ntp_tick_length())+1)) - 1;
- shift = min(shift, maxshift);
- while (offset >= tk->cycle_interval) {
- offset = logarithmic_accumulation(tk, offset, shift, &clock_set);
- if (offset < tk->cycle_interval<<shift)
- shift--;
- }
+ offset = timekeeping_accumulate(tk, cycle_now, offset, &clock_set);
/* Adjust the multiplier to correct NTP error */
timekeeping_adjust(tk, offset);
--
2.49.0.rc1.451.g8f38331e32-goog
Hi all,
This is v8 of the Rust/KUnit integration patch. I think all of the
suggestions have at least been responded to (even if there are a few I'm
leaving as either future projects or matters of taste). Hopefully this
is good-to-go for 6.15, so we can start using it concurrently with
making any additional improvements we may wish.
This series was originally written by José Expósito, and has been
modified and updated by Matt Gilbride, Miguel Ojeda, and myself. The
original version can be found here:
https://github.com/Rust-for-Linux/linux/pull/950
Add support for writing KUnit tests in Rust. While Rust doctests are
already converted to KUnit tests and run, they're really better suited
for examples, rather than as first-class unit tests.
This series implements a series of direct Rust bindings for KUnit tests,
as well as a new macro which allows KUnit tests to be written using a
close variant of normal Rust unit test syntax. The only change required
is replacing '#[cfg(test)]' with '#[kunit_tests(kunit_test_suite_name)]'
An example test would look like:
#[kunit_tests(rust_kernel_hid_driver)]
mod tests {
use super::*;
use crate::{c_str, driver, hid, prelude::*};
use core::ptr;
struct SimpleTestDriver;
impl Driver for SimpleTestDriver {
type Data = ();
}
#[test]
fn rust_test_hid_driver_adapter() {
let mut hid = bindings::hid_driver::default();
let name = c_str!("SimpleTestDriver");
static MODULE: ThisModule = unsafe { ThisModule::from_ptr(ptr::null_mut()) };
let res = unsafe {
<hid::Adapter<SimpleTestDriver> as driver::DriverOps>::register(&mut hid, name, &MODULE)
};
assert_eq!(res, Err(ENODEV)); // The mock returns -19
}
}
Please give this a go, and make sure I haven't broken it! There's almost
certainly a lot of improvements which can be made -- and there's a fair
case to be made for replacing some of this with generated C code which
can use the C macros -- but this is hopefully an adequate implementation
for now, and the interface can (with luck) remain the same even if the
implementation changes.
A few small notable missing features:
- Attributes (like the speed of a test) are hardcoded to the default
value.
- Similarly, the module name attribute is hardcoded to NULL. In C, we
use the KBUILD_MODNAME macro, but I couldn't find a way to use this
from Rust which wasn't more ugly than just disabling it.
- Assertions are not automatically rewritten to use KUnit assertions.
---
Changes since v7:
https://lore.kernel.org/rust-for-linux/20250214074051.1619256-1-davidgow@go…
- Reworked the SAFETY comment for addr_of_mut! use with statics in
kunit_unsafe_test_suite!() (again)
- Removed the second mocking example, which was causing confusion.
The first example of in_kunit_test() should be clear enough.
Changes since v6:
https://lore.kernel.org/rust-for-linux/20250214074051.1619256-1-davidgow@go…
- Fixed an [allow(unused_unsafe)] which ended up in patch 2 instead of
patch 1. (Thanks, Tamir!)
- Doc comments now have several useful links. (Thanks, Tamir!)
- Fix a potential compile error under macos. (Thanks, Tamir!)
- Several small tidy-ups to limit unsafe usage. (Thanks, Tamir!)
Changes since v5:
https://lore.kernel.org/all/20241213081035.2069066-1-davidgow@google.com/
- Rebased against 6.14-rc1
- Fixed a bunch of warnings / clippy lints introduced in Rust 1.83 and
1.84.
- No longer needs static_mut_refs / const_mut_refs, and is much cleaned
up as a result. (Thanks, Miguel)
- Major documentation and example fixes. (Thanks, Miguel)
Changes since v4:
https://lore.kernel.org/linux-kselftest/20241101064505.3820737-1-davidgow@g…
- Rebased against 6.13-rc1
- Allowed an unused_unsafe warning after the behaviour of addr_of_mut!()
changed in Rust 1.82. (Thanks Boqun, Miguel)
- "Expect" that the sample assert_eq!(1+1, 2) produces a clippy warning
due to a redundant assertion. (Thanks Boqun, Miguel)
- Fix some missing safety comments, and remove some unneeded 'unsafe'
blocks. (Thanks Boqun)
- Fix a couple of minor rustfmt issues which were triggering checkpatch
warnings.
Changes since v3:
https://lore.kernel.org/linux-kselftest/20241030045719.3085147-2-davidgow@g…
- The kunit_unsafe_test_suite!() macro now panic!s if the suite name is
too long, triggering a compile error. (Thanks, Alice!)
- The #[kunit_tests()] macro now preserves span information, so
errors can be better reported. (Thanks, Boqun!)
- The example tests have been updated to no longer use assert_eq!() with
a constant bool argument (which triggered a clippy warning now we
have the span info).
Changes since v2:
https://lore.kernel.org/linux-kselftest/20241029092422.2884505-1-davidgow@g…
- Include missing rust/macros/kunit.rs file from v2. (Thanks Boqun!)
- The kunit_unsafe_test_suite!() macro will truncate the name of the
suite if it is too long. (Thanks Alice!)
- The proc macro now emits an error if the suite name is too long.
- We no longer needlessly use UnsafeCell<> in
kunit_unsafe_test_suite!(). (Thanks Alice!)
Changes since v1:
https://lore.kernel.org/lkml/20230720-rustbind-v1-0-c80db349e3b5@google.com…
- Rebase on top of the latest rust-next (commit 718c4069896c)
- Make kunit_case a const fn, rather than a macro (Thanks Boqun)
- As a result, the null terminator is now created with
kernel::kunit::kunit_case_null()
- Use the C kunit_get_current_test() function to implement
in_kunit_test(), rather than re-implementing it (less efficiently)
ourselves.
Changes since the GitHub PR:
- Rebased on top of kselftest/kunit
- Add const_mut_refs feature
This may conflict with https://lore.kernel.org/lkml/20230503090708.2524310-6-nmi@metaspace.dk/
- Add rust/macros/kunit.rs to the KUnit MAINTAINERS entry
---
José Expósito (3):
rust: kunit: add KUnit case and suite macros
rust: macros: add macro to easily run KUnit tests
rust: kunit: allow to know if we are in a test
MAINTAINERS | 1 +
rust/kernel/kunit.rs | 171 +++++++++++++++++++++++++++++++++++++++++++
rust/macros/kunit.rs | 161 ++++++++++++++++++++++++++++++++++++++++
rust/macros/lib.rs | 29 ++++++++
4 files changed, 362 insertions(+)
create mode 100644 rust/macros/kunit.rs
--
2.49.0.rc0.332.g42c0ae87b1-goog
Signal delivery during connect() may disconnect an already established
socket. Problem is that such socket might have been placed in a sockmap
before the connection was closed.
PATCH 1 ensures this race won't lead to an unconnected vsock staying in the
sockmap. PATCH 2 selftests it.
PATCH 3 fixes a related race. Note that selftest in PATCH 2 does test this
code as well, but winning this race variant may take more than 2 seconds,
so I'm not advertising it.
Signed-off-by: Michal Luczaj <mhal(a)rbox.co>
---
Changes in v4:
- Selftest: send signal to only our own process
- Link to v3: https://lore.kernel.org/r/20250316-vsock-trans-signal-race-v3-0-17a6862277c…
Changes in v3:
- Selftest: drop unnecessary variable initialization and reorder the calls
- Link to v2: https://lore.kernel.org/r/20250314-vsock-trans-signal-race-v2-0-421a41f60f4…
Changes in v2:
- Handle one more path of tripping the warning
- Add a selftest
- Collect R-b [Stefano]
- Link to v1: https://lore.kernel.org/r/20250307-vsock-trans-signal-race-v1-1-3aca3f771fb…
---
Michal Luczaj (3):
vsock/bpf: Fix EINTR connect() racing sockmap update
selftest/bpf: Add test for AF_VSOCK connect() racing sockmap update
vsock/bpf: Fix bpf recvmsg() racing transport reassignment
net/vmw_vsock/af_vsock.c | 10 ++-
net/vmw_vsock/vsock_bpf.c | 24 ++++--
.../selftests/bpf/prog_tests/sockmap_basic.c | 99 ++++++++++++++++++++++
3 files changed, 124 insertions(+), 9 deletions(-)
---
base-commit: da9e8efe7ee10e8425dc356a9fc593502c8e3933
change-id: 20250305-vsock-trans-signal-race-d62f7718d099
Best regards,
--
Michal Luczaj <mhal(a)rbox.co>
From: Björn Töpel <bjorn(a)rivosinc.com>
There are scenarios where env.{sub,}test_state->stdout_saved, can be
NULL, e.g. sometimes when the watchdog timeout kicks in, or if the
open_memstream syscall is not available.
Avoid crashing test_progs by adding an explicit NULL check prior the
fclose() call.
Signed-off-by: Björn Töpel <bjorn(a)rivosinc.com>
---
tools/testing/selftests/bpf/test_progs.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/bpf/test_progs.c b/tools/testing/selftests/bpf/test_progs.c
index d4ec9586b98c..309d9d4a8ace 100644
--- a/tools/testing/selftests/bpf/test_progs.c
+++ b/tools/testing/selftests/bpf/test_progs.c
@@ -103,12 +103,14 @@ static void stdio_restore(void)
pthread_mutex_lock(&stdout_lock);
if (env.subtest_state) {
- fclose(env.subtest_state->stdout_saved);
+ if (env.subtest_state->stdout_saved)
+ fclose(env.subtest_state->stdout_saved);
env.subtest_state->stdout_saved = NULL;
stdout = env.test_state->stdout_saved;
stderr = env.test_state->stdout_saved;
} else {
- fclose(env.test_state->stdout_saved);
+ if (env.test_state->stdout_saved)
+ fclose(env.test_state->stdout_saved);
env.test_state->stdout_saved = NULL;
stdout = env.stdout_saved;
stderr = env.stderr_saved;
base-commit: f3f8649585a445414521a6d5b76f41b51205086d
--
2.45.2
Hi all, this is the v3 version.
===
Syzkaller reported this issue [1].
The current sockmap has a dependency on sk_socket in both read and write
stages, but there is a possibility that sk->sk_socket is released during
the process, leading to panic situations. For a detailed reproduction,
please refer to the description in the v2:
https://lore.kernel.org/bpf/20250228055106.58071-1-jiayuan.chen@linux.dev/
The corresponding fix approaches are described in the commit messages of
each patch.
By the way, the current sockmap lacks statistical information, especially
global statistics, such as the number of successful or failed rx and tx
operations. These statistics cannot be obtained from the socket interface
itself.
These data will be of great help in troubleshooting issues and observing
sockmap behavior.
If the maintainer/reviewer does not object, I think we can provide these
statistical information in the future, either through proc/trace/bpftool.
[1] https://syzkaller.appspot.com/bug?extid=dd90a702f518e0eac072
---
v2 -> v3:
1. Michal Luczaj reported similar race issue under sockmap sending path.
2. Rcu lock is conflict with mutex_lock in unix socket read implementation.
https://lore.kernel.org/bpf/20250228055106.58071-1-jiayuan.chen@linux.dev/
v1 -> v2:
1. Add Fixes tag.
2. Extend selftest of edge case for TCP/UDP sockets.
3. Add Reviewed-by and Acked-by tag.
https://lore.kernel.org/bpf/20250226132242.52663-1-jiayuan.chen@linux.dev/T…
Jiayuan Chen (3):
bpf, sockmap: avoid using sk_socket after free when sending
bpf, sockmap: avoid using sk_socket after free when reading
selftests/bpf: Add edge case tests for sockmap
net/core/skmsg.c | 22 ++++++-
.../selftests/bpf/prog_tests/socket_helpers.h | 13 +++-
.../selftests/bpf/prog_tests/sockmap_basic.c | 60 +++++++++++++++++++
3 files changed, 91 insertions(+), 4 deletions(-)
--
2.47.1
Fix test count with late test plan.
For example,
TAP version 13
ok 1 test1
1..4
Returns a count of 1 passed, 1 crashed (because it expects tests after
the test plan): returning the total count of 2 tests
Change this to be 1 passed, 1 error: total count of 1 test
Signed-off-by: Rae Moar <rmoar(a)google.com>
---
tools/testing/kunit/kunit_parser.py | 4 ++++
tools/testing/kunit/kunit_tool_test.py | 4 ++--
2 files changed, 6 insertions(+), 2 deletions(-)
diff --git a/tools/testing/kunit/kunit_parser.py b/tools/testing/kunit/kunit_parser.py
index da53a709773a..c176487356e6 100644
--- a/tools/testing/kunit/kunit_parser.py
+++ b/tools/testing/kunit/kunit_parser.py
@@ -809,6 +809,10 @@ def parse_test(lines: LineStream, expected_num: int, log: List[str], is_subtest:
test.log.extend(parse_diagnostic(lines))
if test.name != "" and not peek_test_name_match(lines, test):
test.add_error(printer, 'missing subtest result line!')
+ elif not lines:
+ print_log(test.log, printer)
+ test.status = TestStatus.NO_TESTS
+ test.add_error(printer, 'No more test results!')
else:
parse_test_result(lines, test, expected_num, printer)
diff --git a/tools/testing/kunit/kunit_tool_test.py b/tools/testing/kunit/kunit_tool_test.py
index 5ff4f6ffd873..bbba921e0eac 100755
--- a/tools/testing/kunit/kunit_tool_test.py
+++ b/tools/testing/kunit/kunit_tool_test.py
@@ -371,8 +371,8 @@ class KUnitParserTest(unittest.TestCase):
"""
result = kunit_parser.parse_run_tests(output.splitlines(), stdout)
# Missing test results after test plan should alert a suspected test crash.
- self.assertEqual(kunit_parser.TestStatus.TEST_CRASHED, result.status)
- self.assertEqual(result.counts, kunit_parser.TestCounts(passed=1, crashed=1, errors=1))
+ self.assertEqual(kunit_parser.TestStatus.SUCCESS, result.status)
+ self.assertEqual(result.counts, kunit_parser.TestCounts(passed=1, errors=2))
def line_stream_from_strs(strs: Iterable[str]) -> kunit_parser.LineStream:
return kunit_parser.LineStream(enumerate(strs, start=1))
base-commit: 2e0cf2b32f72b20b0db5cc665cd8465d0f257278
--
2.49.0.395.g12beb8f557-goog
A bug was identified where the KTAP below caused an infinite loop:
TAP version 13
ok 4 test_case
1..4
The infinite loop was caused by the parser not parsing a test plan
if following a test result line.
Fix this bug by parsing test plan line to avoid the infinite loop.
Signed-off-by: Rae Moar <rmoar(a)google.com>
---
Changes since v2:
- None, adds test in second patch
tools/testing/kunit/kunit_parser.py | 9 ++++-----
1 file changed, 4 insertions(+), 5 deletions(-)
diff --git a/tools/testing/kunit/kunit_parser.py b/tools/testing/kunit/kunit_parser.py
index 29fc27e8949b..da53a709773a 100644
--- a/tools/testing/kunit/kunit_parser.py
+++ b/tools/testing/kunit/kunit_parser.py
@@ -759,7 +759,7 @@ def parse_test(lines: LineStream, expected_num: int, log: List[str], is_subtest:
# If parsing the main/top-level test, parse KTAP version line and
# test plan
test.name = "main"
- ktap_line = parse_ktap_header(lines, test, printer)
+ parse_ktap_header(lines, test, printer)
test.log.extend(parse_diagnostic(lines))
parse_test_plan(lines, test)
parent_test = True
@@ -768,13 +768,12 @@ def parse_test(lines: LineStream, expected_num: int, log: List[str], is_subtest:
# the KTAP version line and/or subtest header line
ktap_line = parse_ktap_header(lines, test, printer)
subtest_line = parse_test_header(lines, test)
+ test.log.extend(parse_diagnostic(lines))
+ parse_test_plan(lines, test)
parent_test = (ktap_line or subtest_line)
if parent_test:
- # If KTAP version line and/or subtest header is found, attempt
- # to parse test plan and print test header
- test.log.extend(parse_diagnostic(lines))
- parse_test_plan(lines, test)
print_test_header(test, printer)
+
expected_count = test.expected_count
subtests = []
test_num = 1
base-commit: 0619a4868fc1b32b07fb9ed6c69adc5e5cf4e4b2
--
2.49.0.rc1.451.g8f38331e32-goog
Here are a few cleanups, preparation work for the new PM ops, and sysctl
knobs.
- Patch 1: reorg: move generic NL code used by all PMs to pm_netlink.c.
- Patch 2: use kmemdup() instead of kmalloc + copy.
- Patch 3: small cleanup to use pm var instead of msk->pm.
- Patch 4: reorg: id_avail_bitmap is only used by the in-kernel PM.
- Patch 5: use struct_group to easily reset a subset of PM data vars.
- Patch 6: introduce the minimal skeleton for the new PM ops.
- Patch 7: register in-kernel and userspace PM ops.
- Patch 8: new net.mptcp.path_manager sysctl knob, deprecating pm_type.
- Patch 9: map the new path_manager sysctl knob with pm_type.
- Patch 10: map the old pm_type sysctl knob with path_manager.
- Patch 11: new net.mptcp.available_path_managers sysctl knob.
- Patch 12: new test to validate path_manager and pm_type mapping.
Signed-off-by: Matthieu Baerts (NGI0) <matttbe(a)kernel.org>
---
Geliang Tang (11):
mptcp: pm: in-kernel: use kmemdup helper
mptcp: pm: use pm variable instead of msk->pm
mptcp: pm: only fill id_avail_bitmap for in-kernel pm
mptcp: pm: add struct_group in mptcp_pm_data
mptcp: pm: define struct mptcp_pm_ops
mptcp: pm: register in-kernel and userspace PM
mptcp: sysctl: set path manager by name
mptcp: sysctl: map path_manager to pm_type
mptcp: sysctl: map pm_type to path_manager
mptcp: sysctl: add available_path_managers
selftests: mptcp: add pm sysctl mapping tests
Matthieu Baerts (NGI0) (1):
mptcp: pm: split netlink and in-kernel init
Documentation/networking/mptcp-sysctl.rst | 23 +++++
include/net/mptcp.h | 14 +++
net/mptcp/ctrl.c | 113 +++++++++++++++++++++-
net/mptcp/pm.c | 97 ++++++++++++++++---
net/mptcp/pm_kernel.c | 16 +--
net/mptcp/pm_netlink.c | 6 ++
net/mptcp/pm_userspace.c | 10 ++
net/mptcp/protocol.h | 17 ++++
tools/testing/selftests/net/mptcp/userspace_pm.sh | 30 +++++-
9 files changed, 301 insertions(+), 25 deletions(-)
---
base-commit: e016cf5f39e9c53e274a7b7122a949d8839b8782
change-id: 20250312-net-next-mptcp-pm-ops-intro-01510135cd5e
Best regards,
--
Matthieu Baerts (NGI0) <matttbe(a)kernel.org>
For platforms on powerpc architecture with a default page size greater
than 4096, there was an inconsistency in fragment size calculation.
This caused the BPF selftest xdp_adjust_tail/xdp_adjust_frags_tail_grow
to fail on powerpc.
The issue occurred because the fragment buffer size in
bpf_prog_test_run_xdp() was set to 4096, while the actual data size in
the fragment within the shared skb was checked against PAGE_SIZE
(65536 on powerpc) in min_t, causing it to exceed 4096 and be set
accordingly. This discrepancy led to an overflow when
bpf_xdp_frags_increase_tail() checked for tailroom, as skb_frag_size(frag)
could be greater than rxq->frag_size (when PAGE_SIZE > 4096).
This change fixes:
1. test_run by getting the correct arch dependent PAGE_SIZE.
2. selftest by caculating tailroom and getting correct PAGE_SIZE.
Changes:
v1 -> v2:
* Address comments from Alexander
* Use dynamic page size, cacheline size and size of
struct skb_shared_info to calculate parameters.
* Fixed both test_run and selftest.
v1: https://lore.kernel.org/all/20250122183720.1411176-1-skb99@linux.ibm.com/
Saket Kumar Bhaskar (2):
bpf, test_run: Replace hardcoded page size with dynamic PAGE_SIZE in
test_run
selftests/bpf: Refactor xdp_adjust_tail selftest with dynamic sizing
.../bpf/prog_tests/xdp_adjust_tail.c | 160 +++++++++++++-----
.../bpf/progs/test_xdp_adjust_tail_grow.c | 41 +++--
2 files changed, 149 insertions(+), 52 deletions(-)
--
2.43.5
Currently there is no means of determining whether a give page in a mapping
range is designated a guard region (as installed via madvise() using the
MADV_GUARD_INSTALL flag).
This is generally not an issue, but in some instances users may wish to
determine whether this is the case.
This series adds this ability via /proc/$pid/pagemap, updates the
documentation and adds a self test to assert that this functions correctly.
Lorenzo Stoakes (2):
fs/proc/task_mmu: add guard region bit to pagemap
tools/selftests: add guard region test for /proc/$pid/pagemap
Documentation/admin-guide/mm/pagemap.rst | 3 +-
fs/proc/task_mmu.c | 6 ++-
tools/testing/selftests/mm/guard-regions.c | 47 ++++++++++++++++++++++
tools/testing/selftests/mm/vm_util.h | 1 +
4 files changed, 55 insertions(+), 2 deletions(-)
--
2.48.1
Hi all,
This patch series continues the work to migrate the script tests into
prog_tests.
test_xdp_vlan.sh tests the ability of an XDP program to modify the VLAN
ids on the fly. This isn't currently covered by an other test in the
test_progs framework so I add a new file prog_tests/xdp_vlan.c that does
the exact same tests (same network topology, same BPF programs) and
remove the script.
Signed-off-by: Bastien Curutchet (eBPF Foundation) <bastien.curutchet(a)bootlin.com>
---
Bastien Curutchet (eBPF Foundation) (2):
selftests/bpf: test_xdp_vlan: Rename BPF sections
selftests/bpf: Migrate test_xdp_vlan.sh into test_progs
tools/testing/selftests/bpf/Makefile | 4 +-
tools/testing/selftests/bpf/prog_tests/xdp_vlan.c | 175 ++++++++++++++++
tools/testing/selftests/bpf/progs/test_xdp_vlan.c | 20 +-
tools/testing/selftests/bpf/test_xdp_vlan.sh | 233 ---------------------
.../selftests/bpf/test_xdp_vlan_mode_generic.sh | 9 -
.../selftests/bpf/test_xdp_vlan_mode_native.sh | 9 -
6 files changed, 186 insertions(+), 264 deletions(-)
---
base-commit: a814b9be27fb3c3f49343aee4b015b76f5875558
change-id: 20250130-xdp_vlan-e825cc4df14a
Best regards,
--
Bastien Curutchet (eBPF Foundation) <bastien.curutchet(a)bootlin.com>
When the rseq UAPI header is included 'union rseq' clashes with 'struct
rseq', it's not the case in the rseq selftests but it does break the KVM
selftests that also include this file.
Rename 'union rseq' to 'union rseq_tls' to fix this.
Fixes: e6644c967d3c ("rseq/selftests: Ensure the rseq ABI TLS is actually 1024 bytes")
Reported-by: Mark Brown <broonie(a)kernel.org>
Signed-off-by: Michael Jeanson <mjeanson(a)efficios.com>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers(a)efficios.com>
---
tools/testing/selftests/rseq/rseq.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/rseq/rseq.c b/tools/testing/selftests/rseq/rseq.c
index 6d8997d815ef..663a9cef1952 100644
--- a/tools/testing/selftests/rseq/rseq.c
+++ b/tools/testing/selftests/rseq/rseq.c
@@ -75,13 +75,13 @@ static int rseq_ownership;
* Use a union to ensure we allocate a TLS area of 1024 bytes to accomodate an
* rseq registration that is larger than the current rseq ABI.
*/
-union rseq {
+union rseq_tls {
struct rseq_abi abi;
char dummy[RSEQ_THREAD_AREA_ALLOC_SIZE];
};
static
-__thread union rseq __rseq __attribute__((tls_model("initial-exec"))) = {
+__thread union rseq_tls __rseq __attribute__((tls_model("initial-exec"))) = {
.abi = {
.cpu_id = RSEQ_ABI_CPU_ID_UNINITIALIZED,
},
--
2.43.0
Conntrack bridge only tracks untagged and 802.1q.
To make the bridge-fastpath experience more similar to the
forward-fastpath experience, add double vlan, pppoe and pppoe-in-q
tagged packets to bridge conntrack and to bridge filter chain.
Split from patch-set: bridge-fastpath and related improvements v9
Eric Woudstra (3):
netfilter: bridge: Add conntrack double vlan and pppoe
netfilter: nft_chain_filter: Add bridge double vlan and pppoe
selftests: netfilter: Add conntrack_bridge.sh
net/bridge/netfilter/nf_conntrack_bridge.c | 83 +++++++--
net/netfilter/nft_chain_filter.c | 20 +-
.../testing/selftests/net/netfilter/Makefile | 1 +
.../net/netfilter/conntrack_bridge.sh | 176 ++++++++++++++++++
4 files changed, 267 insertions(+), 13 deletions(-)
create mode 100755 tools/testing/selftests/net/netfilter/conntrack_bridge.sh
--
2.47.1
Make sure the test cleans up after itself. The XDP off statements
at the end of the test may not be reached.
Fixes: 75cc19c8ff89 ("selftests: drv-net: add xdp cases for ping.py")
Signed-off-by: Jakub Kicinski <kuba(a)kernel.org>
---
CC: shuah(a)kernel.org
CC: ap420073(a)gmail.com
CC: linux-kselftest(a)vger.kernel.org
---
tools/testing/selftests/drivers/net/ping.py | 16 ++++++++--------
1 file changed, 8 insertions(+), 8 deletions(-)
diff --git a/tools/testing/selftests/drivers/net/ping.py b/tools/testing/selftests/drivers/net/ping.py
index 93f4b411b378..fc69bfcc37c4 100755
--- a/tools/testing/selftests/drivers/net/ping.py
+++ b/tools/testing/selftests/drivers/net/ping.py
@@ -7,7 +7,7 @@ from lib.py import ksft_run, ksft_exit
from lib.py import ksft_eq, KsftSkipEx, KsftFailEx
from lib.py import EthtoolFamily, NetDrvEpEnv
from lib.py import bkg, cmd, wait_port_listen, rand_port
-from lib.py import ethtool, ip
+from lib.py import defer, ethtool, ip
remote_ifname=""
no_sleep=False
@@ -60,6 +60,7 @@ no_sleep=False
prog = test_dir + "/../../net/lib/xdp_dummy.bpf.o"
cmd(f"ip link set dev {remote_ifname} mtu 1500", shell=True, host=cfg.remote)
cmd(f"ip link set dev {cfg.ifname} mtu 1500 xdpgeneric obj {prog} sec xdp", shell=True)
+ defer(cmd, f"ip link set dev {cfg.ifname} xdpgeneric off")
if no_sleep != True:
time.sleep(10)
@@ -68,7 +69,9 @@ no_sleep=False
test_dir = os.path.dirname(os.path.realpath(__file__))
prog = test_dir + "/../../net/lib/xdp_dummy.bpf.o"
cmd(f"ip link set dev {remote_ifname} mtu 9000", shell=True, host=cfg.remote)
+ defer(ip, f"link set dev {remote_ifname} mtu 1500", host=cfg.remote)
ip("link set dev %s mtu 9000 xdpgeneric obj %s sec xdp.frags" % (cfg.ifname, prog))
+ defer(ip, f"link set dev {cfg.ifname} mtu 1500 xdpgeneric off")
if no_sleep != True:
time.sleep(10)
@@ -78,6 +81,7 @@ no_sleep=False
prog = test_dir + "/../../net/lib/xdp_dummy.bpf.o"
cmd(f"ip link set dev {remote_ifname} mtu 1500", shell=True, host=cfg.remote)
cmd(f"ip -j link set dev {cfg.ifname} mtu 1500 xdp obj {prog} sec xdp", shell=True)
+ defer(ip, f"link set dev {cfg.ifname} mtu 1500 xdp off")
xdp_info = ip("-d link show %s" % (cfg.ifname), json=True)[0]
if xdp_info['xdp']['mode'] != 1:
"""
@@ -94,10 +98,11 @@ no_sleep=False
test_dir = os.path.dirname(os.path.realpath(__file__))
prog = test_dir + "/../../net/lib/xdp_dummy.bpf.o"
cmd(f"ip link set dev {remote_ifname} mtu 9000", shell=True, host=cfg.remote)
+ defer(ip, f"link set dev {remote_ifname} mtu 1500", host=cfg.remote)
try:
cmd(f"ip link set dev {cfg.ifname} mtu 9000 xdp obj {prog} sec xdp.frags", shell=True)
+ defer(ip, f"link set dev {cfg.ifname} mtu 1500 xdp off")
except Exception as e:
- cmd(f"ip link set dev {remote_ifname} mtu 1500", shell=True, host=cfg.remote)
raise KsftSkipEx('device does not support native-multi-buffer XDP')
if no_sleep != True:
@@ -111,6 +116,7 @@ no_sleep=False
cmd(f"ip link set dev {cfg.ifname} xdpoffload obj {prog} sec xdp", shell=True)
except Exception as e:
raise KsftSkipEx('device does not support offloaded XDP')
+ defer(ip, f"link set dev {cfg.ifname} xdpoffload off")
cmd(f"ip link set dev {remote_ifname} mtu 1500", shell=True, host=cfg.remote)
if no_sleep != True:
@@ -157,7 +163,6 @@ no_sleep=False
_test_v4(cfg)
_test_v6(cfg)
_test_tcp(cfg)
- ip("link set dev %s xdpgeneric off" % cfg.ifname)
def test_xdp_generic_mb(cfg, netnl) -> None:
_set_xdp_generic_mb_on(cfg)
@@ -169,7 +174,6 @@ no_sleep=False
_test_v4(cfg)
_test_v6(cfg)
_test_tcp(cfg)
- ip("link set dev %s xdpgeneric off" % cfg.ifname)
def test_xdp_native_sb(cfg, netnl) -> None:
_set_xdp_native_sb_on(cfg)
@@ -181,7 +185,6 @@ no_sleep=False
_test_v4(cfg)
_test_v6(cfg)
_test_tcp(cfg)
- ip("link set dev %s xdp off" % cfg.ifname)
def test_xdp_native_mb(cfg, netnl) -> None:
_set_xdp_native_mb_on(cfg)
@@ -193,14 +196,12 @@ no_sleep=False
_test_v4(cfg)
_test_v6(cfg)
_test_tcp(cfg)
- ip("link set dev %s xdp off" % cfg.ifname)
def test_xdp_offload(cfg, netnl) -> None:
_set_xdp_offload_on(cfg)
_test_v4(cfg)
_test_v6(cfg)
_test_tcp(cfg)
- ip("link set dev %s xdpoffload off" % cfg.ifname)
def main() -> None:
with NetDrvEpEnv(__file__) as cfg:
@@ -213,7 +214,6 @@ no_sleep=False
test_xdp_native_mb,
test_xdp_offload],
args=(cfg, EthtoolFamily()))
- set_interface_init(cfg)
ksft_exit()
--
2.48.1
Unmapping virtual machine guest memory from the host kernel's direct map
is a successful mitigation against Spectre-style transient execution
issues: If the kernel page tables do not contain entries pointing to
guest memory, then any attempted speculative read through the direct map
will necessarily be blocked by the MMU before any observable
microarchitectural side-effects happen. This means that Spectre-gadgets
and similar cannot be used to target virtual machine memory. Roughly 60%
of speculative execution issues fall into this category [1, Table 1].
This patch series extends guest_memfd with the ability to remove its
memory from the host kernel's direct map, to be able to attain the above
protection for KVM guests running inside guest_memfd.
=== Changes to RFC v3 ===
- Settle relationship between direct map removal and shared/private
memory in guest_memfd (David H.)
- Omit TLB flushes upon direct map removal again
- Settle uABI for how KVM accesses guest memory in non-CoCo guest_memfd
VMs (upstream guest_memfd calls)
- Add selftests that exercise the codepaths of non-CoCo guest_memfd VMs
Lastly, this series is rebased on top of Fuad's v4 for shared mapping of
guest_memfd [2]. The KVM parts should also apply on top of 0ad2507d5d93
("Linux 6.14-rc3"), but the selftest patches need Fuad's series as base.
=== Overview ===
guest_memfd should be usable for "non-CoCo" VMs - virtual machines where
host userspace is trusted (e.g. can have access to all of guest memory),
but which should still be hardened against speculative execution attacks
(Spectre, etc.) staged through potentially existing gadgets in the host
kernel.
To attain this hardening, unmap guest memory from the host kernels
address space (e.g. zap direct map entries), while allowing KVM to
continue accessing guest memory through userspace mappings. This works
because KVM already almost always uses userspace mappings whenever KVM
needs to access guest memory - the only parts that require direct map
entries (because they use GUP) are KVM's MMU, and kvm-clock on x86.
Building on top of guest_memfd sidesteps the MMU problem, as for
memslots with KVM_MEM_GUEST_MEMFD set, the MMU consumes fd + offset
directly without going through any VMAs. kvm-clock on the other hand is
not strictly needed (guests boot fine without it), so ignore it for
now.
=== Implementation ===
Make KVM_CREATE_GUEST_MEMFD accept a flag (KVM_GMEM_NO_DIRECT_MAP) that
instructs it to remove newly allocated folios from the host kernels
direct map immediately after preparation.
Nothing further is needed to make non-CoCo VMs work - particularly, KVM
does not need to be taught any special ways of accessing guest memory if
it is in guest_memfd. Userspace can simply mmap guest_memfd (via
KVM_GMEM_SHARED_MEM added in Fuad's series), and set the memslot's
userspace_addr to this userspace mapping of guest_memfd.
=== Open Questions ===
In this patch series, stale TLB entries do not get flushed after direct
map entries are marked as not present. This is fine from a functional
point of view (as the mapping is still valid, it's just temporarily not
supposed to be used), but pokes a theoretical hole into the speculation
protection: Something could try to keep alive stale TLB entries for
specific pages until the guest starts using them for sensitive
information, and then stage a Spectre attack on that memory, despite it
being unmapped. In practice, this would require knowing in advance, at
gmem fault-time, which pages will eventually contain information of
interest, and then preventing these specific TLB entries from getting
naturally evicted (where the number of pages that can be targeted like
this is limited by the size of the TLB). These seem to be fairly
difficult requisites to fulfill, but we were wondering what the
community thinks.
=== Summary ===
Patch 1 adds a struct address_space flag that indices that folios in a
mapping are direct map removed, and threads it through mm code to ensure
direct map removed folios don't end up in places where they can cause
mayhem (particularly, we reject them in get_user_pages). Since these
checks end up being duplicates of already existing checks for secretmem
folios, patch 2 unifies the two by using the new address_space flag for
secretmem mappings. Patches 3 through 5 are about support for direct map
removal in guest_memfd, while patches 6 through 12 are about testing the
non-CoCo setup in KVM selftests, with patches 6 through 9 being
preparatory, and patches 10 through 12 adding the actual test cases.
[1]: https://download.vusec.net/papers/quarantine_raid23.pdf
[2]: https://lore.kernel.org/kvm/20250218172500.807733-1-tabba@google.com/
[RFC v1]: https://lore.kernel.org/kvm/20240709132041.3625501-1-roypat@amazon.co.uk/
[RFC v2]: https://lore.kernel.org/kvm/20240910163038.1298452-1-roypat@amazon.co.uk/
[RFC v3]: https://lore.kernel.org/kvm/20241030134912.515725-1-roypat@amazon.co.uk/
Patrick Roy (12):
mm: introduce AS_NO_DIRECT_MAP
mm/secretmem: set AS_NO_DIRECT_MAP instead of special-casing
KVM: guest_memfd: Add flag to remove from direct map
KVM: Add capability to discover KVM_GMEM_NO_DIRECT_MAP support
KVM: Documentation: document KVM_GMEM_NO_DIRECT_MAP flag
KVM: selftests: load elf via bounce buffer
KVM: selftests: set KVM_MEM_GUEST_MEMFD in vm_mem_add() if guest_memfd
!= -1
KVM: selftests: Add guest_memfd based vm_mem_backing_src_types
KVM: selftests: stuff vm_mem_backing_src_type into vm_shape
KVM: selftests: adjust test_create_guest_memfd_invalid
KVM: selftests: set KVM_GMEM_NO_DIRECT_MAP in mem conversion tests
KVM: selftests: Test guest execution from direct map removed gmem
Documentation/virt/kvm/api.rst | 13 ++++
include/linux/pagemap.h | 16 +++++
include/linux/secretmem.h | 18 ------
include/uapi/linux/kvm.h | 3 +
lib/buildid.c | 4 +-
mm/gup.c | 14 +---
mm/mlock.c | 2 +-
mm/secretmem.c | 6 +-
.../testing/selftests/kvm/guest_memfd_test.c | 2 +-
.../testing/selftests/kvm/include/kvm_util.h | 29 ++++++---
.../testing/selftests/kvm/include/test_util.h | 8 +++
tools/testing/selftests/kvm/lib/elf.c | 8 +--
tools/testing/selftests/kvm/lib/io.c | 23 +++++++
tools/testing/selftests/kvm/lib/kvm_util.c | 64 +++++++++++--------
tools/testing/selftests/kvm/lib/test_util.c | 8 +++
tools/testing/selftests/kvm/lib/x86/sev.c | 1 +
.../selftests/kvm/pre_fault_memory_test.c | 1 +
.../selftests/kvm/set_memory_region_test.c | 40 ++++++++++++
.../kvm/x86/private_mem_conversions_test.c | 7 +-
virt/kvm/guest_memfd.c | 24 ++++++-
virt/kvm/kvm_main.c | 5 ++
21 files changed, 214 insertions(+), 82 deletions(-)
base-commit: da40655874b54a2b563f8ceb3ed839c6cd38e0b4
--
2.48.1
$half_ufd_size_MB is supposed to be half of the available hugetlb memory
expressed in MB. But previously it was calculated in pages since
$freepgs is the number of free pages.
When huge pages are 2M it doesn't make a whole lot of difference; the
number of pages that get used is just halved. But on arm64 with 16K or
64K base pages, the PMD size (and default hugetlb size) is 32M and 512M
respectively. So in this case we end up passing a number of MB that is
smaller than a single hugetlb page and the test raises an error.
Fixes: 2e47a445d7b3 ("selftests/mm: run_vmtests.sh: fix hugetlb mem size calculation")
Signed-off-by: Ryan Roberts <ryan.roberts(a)arm.com>
---
tools/testing/selftests/mm/run_vmtests.sh | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/mm/run_vmtests.sh b/tools/testing/selftests/mm/run_vmtests.sh
index da7e26668103..14fa9d40d574 100755
--- a/tools/testing/selftests/mm/run_vmtests.sh
+++ b/tools/testing/selftests/mm/run_vmtests.sh
@@ -304,7 +304,7 @@ uffd_stress_bin=./uffd-stress
CATEGORY="userfaultfd" run_test ${uffd_stress_bin} anon 20 16
# Hugetlb tests require source and destination huge pages. Pass in half
# the size of the free pages we have, which is used for *each*.
-half_ufd_size_MB=$((freepgs / 2))
+half_ufd_size_MB=$(((freepgs * hpgsize_KB / 2) / 1024))
CATEGORY="userfaultfd" run_test ${uffd_stress_bin} hugetlb "$half_ufd_size_MB" 32
CATEGORY="userfaultfd" run_test ${uffd_stress_bin} hugetlb-private "$half_ufd_size_MB" 32
CATEGORY="userfaultfd" run_test ${uffd_stress_bin} shmem 20 16
--
2.43.0
Adding the aligned(1024) attribute to the definition of __rseq_abi did
not increase its size to 1024, for this attribute to impact the size of
__rseq_abi it would need to be added to the declaration of 'struct
rseq_abi'. We only want to increase the size of the TLS allocation to
ensure registration will succeed with future extended ABI. Use a union
with a dummy member to ensure we allocate 1024 bytes.
Signed-off-by: Michael Jeanson <mjeanson(a)efficios.com>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers(a)efficios.com>
---
tools/testing/selftests/rseq/rseq.c | 21 ++++++++++++++++-----
1 file changed, 16 insertions(+), 5 deletions(-)
diff --git a/tools/testing/selftests/rseq/rseq.c b/tools/testing/selftests/rseq/rseq.c
index f6156790c3b4..aa9ae866bc1a 100644
--- a/tools/testing/selftests/rseq/rseq.c
+++ b/tools/testing/selftests/rseq/rseq.c
@@ -71,9 +71,20 @@ static int rseq_ownership;
/* Original struct rseq allocation size is 32 bytes. */
#define ORIG_RSEQ_ALLOC_SIZE 32
+/*
+ * Use a union to ensure we allocate a TLS area of 1024 bytes to accomodate an
+ * rseq registration that is larger than the current rseq ABI.
+ */
+union rseq {
+ struct rseq_abi abi;
+ char dummy[RSEQ_THREAD_AREA_ALLOC_SIZE];
+};
+
static
-__thread struct rseq_abi __rseq_abi __attribute__((tls_model("initial-exec"), aligned(RSEQ_THREAD_AREA_ALLOC_SIZE))) = {
- .cpu_id = RSEQ_ABI_CPU_ID_UNINITIALIZED,
+__thread union rseq __rseq __attribute__((tls_model("initial-exec"))) = {
+ .abi = {
+ .cpu_id = RSEQ_ABI_CPU_ID_UNINITIALIZED,
+ },
};
static int sys_rseq(struct rseq_abi *rseq_abi, uint32_t rseq_len,
@@ -149,7 +160,7 @@ int rseq_register_current_thread(void)
/* Treat libc's ownership as a successful registration. */
return 0;
}
- rc = sys_rseq(&__rseq_abi, get_rseq_min_alloc_size(), 0, RSEQ_SIG);
+ rc = sys_rseq(&__rseq.abi, get_rseq_min_alloc_size(), 0, RSEQ_SIG);
if (rc) {
/*
* After at least one thread has registered successfully
@@ -183,7 +194,7 @@ int rseq_unregister_current_thread(void)
/* Treat libc's ownership as a successful unregistration. */
return 0;
}
- rc = sys_rseq(&__rseq_abi, get_rseq_min_alloc_size(), RSEQ_ABI_FLAG_UNREGISTER, RSEQ_SIG);
+ rc = sys_rseq(&__rseq.abi, get_rseq_min_alloc_size(), RSEQ_ABI_FLAG_UNREGISTER, RSEQ_SIG);
if (rc)
return -1;
return 0;
@@ -249,7 +260,7 @@ void rseq_init(void)
rseq_ownership = 1;
/* Calculate the offset of the rseq area from the thread pointer. */
- rseq_offset = (void *)&__rseq_abi - rseq_thread_pointer();
+ rseq_offset = (void *)&__rseq.abi - rseq_thread_pointer();
/* rseq flags are deprecated, always set to 0. */
rseq_flags = 0;
--
2.43.0
Hi all,
This patch series continues the work to migrate the script tests into
prog_tests.
The test_xsk.sh script tests lots of AF_XDP use cases. The tests it uses
are defined in xksxceiver.c. As this script is used to test real
hardware, the goal here is to keep it as is and only integrate the
tests on veth peers into the test_progs framework.
Three tests are flaky on s390 so they won't be integrated to test_progs
yet (I'm currently trying to make them more robust).
PATCH 1 & 2 fix some small issues xskxceiver.c
PATCH 3 to 9 rework the xskxceiver to ease the integration in the
test_progs framework. Two main points are addressed in them :
- wrap kselftest calls behind macros to ease their replacement later
- handle all errors to release resources instead of calling exit() when
any error occurs.
PATCH 10 extracts test_xsk[.c/.h] from xskxceiver[.c/.h] to make the
tests available to test_progs
PATCH 11 enables kselftest de-activation
PATCH 12 isolates the flaky tests
PATCH 13 integrate the non-flaky tests to the test_progs framework
Signed-off-by: Bastien Curutchet (eBPF Foundation) <bastien.curutchet(a)bootlin.com>
---
Bastien Curutchet (eBPF Foundation) (13):
selftests/bpf: test_xsk: Initialize bitmap before use
selftests/bpf: test_xsk: Fix memory leaks
selftests/bpf: test_xsk: Wrap ksft_*() behind macros
selftests/bpf: test_xsk: Add return value to init_iface()
selftests/bpf: test_xsk: Don't exit immediately when xsk_attach fails
selftests/bpf: test_xsk: Don't exit immediately when gettimeofday fails
selftests/bpf: test_xsk: Don't exit immediately when workers fail
selftests/bpf: test_xsk: Don't exit immediately if validate_traffic fails
selftests/bpf: test_xsk: Don't exit immediately on allocation failures
selftests/bpf: test_xsk: Split xskxceiver
selftests/bpf: test_xsk: Make kselftest dependency optional
selftests/bpf: test_xsk: Isolate flaky tests
selftests/bpf: test_xsk: Integrate test_xsk.c to test_progs framework
tools/testing/selftests/bpf/Makefile | 13 +-
tools/testing/selftests/bpf/prog_tests/test_xsk.c | 2416 ++++++++++++++++++++
tools/testing/selftests/bpf/prog_tests/test_xsk.h | 299 +++
tools/testing/selftests/bpf/prog_tests/xsk.c | 178 ++
tools/testing/selftests/bpf/xskxceiver.c | 2543 +--------------------
tools/testing/selftests/bpf/xskxceiver.h | 153 --
6 files changed, 3021 insertions(+), 2581 deletions(-)
---
base-commit: 720c696b16a1b1680f64cac9b3bb9e312a23ac47
change-id: 20250218-xsk-0cf90e975d14
Best regards,
--
Bastien Curutchet (eBPF Foundation) <bastien.curutchet(a)bootlin.com>
When using 'make LLVM=1 W=1 -C tools/testing/selftests/pid_namespace'
with clang-20, I've noticed the following:
pid_max.c:42:8: error: call to undeclared function 'mount'; ISO
C99 and later do not support implicit function declarations
[-Wimplicit-function-declaration]
42 | ret = mount("", "/", NULL, MS_PRIVATE | MS_REC, 0);
| ^
pid_max.c:42:29: error: use of undeclared identifier 'MS_PRIVATE'
42 | ret = mount("", "/", NULL, MS_PRIVATE | MS_REC, 0);
| ^
...
So include '<sys/mount.h>' to add all of the required declarations.
Signed-off-by: Dmitry Antipov <dmantipov(a)yandex.ru>
---
tools/testing/selftests/pid_namespace/pid_max.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/tools/testing/selftests/pid_namespace/pid_max.c b/tools/testing/selftests/pid_namespace/pid_max.c
index 51c414faabb0..972bedc475f1 100644
--- a/tools/testing/selftests/pid_namespace/pid_max.c
+++ b/tools/testing/selftests/pid_namespace/pid_max.c
@@ -11,6 +11,7 @@
#include <string.h>
#include <syscall.h>
#include <sys/wait.h>
+#include <sys/mount.h>
#include "../kselftest_harness.h"
#include "../pidfd/pidfd.h"
--
2.48.1
This patchset add ftrace helpers functions and
add a new test makes sure that ftrace can trace
a function that was introduced by a livepatch.
Signed-off-by: Filipe Xavier <felipeaggger(a)gmail.com>
---
Filipe Xavier (2):
selftests: livepatch: add new ftrace helpers functions
selftests: livepatch: test if ftrace can trace a livepatched function
tools/testing/selftests/livepatch/functions.sh | 45 ++++++++++++++++++++++++
tools/testing/selftests/livepatch/test-ftrace.sh | 35 ++++++++++++++++++
2 files changed, 80 insertions(+)
---
base-commit: 848e076317446f9c663771ddec142d7c2eb4cb43
change-id: 20250306-ftrace-sftest-livepatch-60d9dc472235
Best regards,
--
Filipe Xavier <felipeaggger(a)gmail.com>
As the vIOMMU infrastructure series part-3, this introduces a new vEVENTQ
object. The existing FAULT object provides a nice notification pathway to
the user space with a queue already, so let vEVENTQ reuse that.
Mimicing the HWPT structure, add a common EVENTQ structure to support its
derivatives: IOMMUFD_OBJ_FAULT (existing) and IOMMUFD_OBJ_VEVENTQ (new).
An IOMMUFD_CMD_VEVENTQ_ALLOC is introduced to allocate vEVENTQ object for
vIOMMUs. One vIOMMU can have multiple vEVENTQs in different types but can
not support multiple vEVENTQs in the same type.
The forwarding part is fairly simple but might need to replace a physical
device ID with a virtual device ID in a driver-level event data structure.
So, this also adds some helpers for drivers to use.
As usual, this series comes with the selftest coverage for this new ioctl
and with a real world use case in the ARM SMMUv3 driver.
This is on Github:
https://github.com/nicolinc/iommufd/commits/iommufd_veventq-v9
Paring QEMU branch for testing:
https://github.com/nicolinc/qemu/commits/wip/for_iommufd_veventq-v9
Changelog
v9
* Add Acked-by from Will
* Fix typo in commit logs and reviewer name
* Allow invaid nested STE for C_BAD_STE report
* Drop extra indentation in arm_smmu_handle_event()
* Drop comments in arm_smmu_attach_prepare_vmaster()
v8
https://lore.kernel.org/all/cover.1740504232.git.nicolinc@nvidia.com/
* Add Reviewed-by from Jason and Pranjal
* Fix errno returned in arm_smmu_handle_event()
* Validate domain->type outside of arm_smmu_attach_prepare_vmaster()
* Drop unnecessary vmaster comparison in arm_smmu_attach_commit_vmaster()
v7
https://lore.kernel.org/all/cover.1740238876.git.nicolinc@nvidia.com/
* Rebase on Jason's for-next tree for latest fault.c
* Add Reviewed-by
* Update commit logs
* Add __reserved field sanity
* Skip kfree() on the static header
* Replace "bool on_list" with list_is_last()
* Use u32 for flags in iommufd_vevent_header
* Drop casting in iommufd_viommu_get_vdev_id()
* Update the bounding logic to veventq->sequence
* Add missing cpu_to_le64() around STRTAB_STE_1_MEV
* Reuse veventq->common.lock to fence sequence and num_events
* Rename overflow to lost_events and log it in upon kmalloc failure
* Correct the error handling part in iommufd_veventq_deliver_fetch()
* Add an arm_smmu_clear_vmaster() to simplify identity/blocked domain
attach ops
* Add additional four event records to forward to user space VM, and
update the uAPI doc
* Reuse the existing smmu->streams_mutex lock to fence master->vmaster
pointer, instead of adding a new rwsem
v6
https://lore.kernel.org/all/cover.1737754129.git.nicolinc@nvidia.com/
* Drop supports_veventq viommu op
* Split bug/cosmetics fixes out of the series
* Drop the blocking mutex around copy_to_user()
* Add veventq_depth in uAPI to limit vEVENTQ size
* Revise the documentation for a clear description
* Fix sparse warnings in arm_vmaster_report_event()
* Rework iommufd_viommu_get_vdev_id() to return -ENOENT v.s. 0
* Allow Abort/Bypass STEs to allocate vEVENTQ and set STE.MEV for DoS
mitigations
v5
https://lore.kernel.org/all/cover.1736237481.git.nicolinc@nvidia.com/
* Add Reviewed-by from Baolu
* Reorder the OBJ list as well
* Fix alphabetical order after renaming in v4
* Add supports_veventq viommu op for vEVENTQ type validation
v4
https://lore.kernel.org/all/cover.1735933254.git.nicolinc@nvidia.com/
* Rename "vIRQ" to "vEVENTQ"
* Use flexible array in struct iommufd_vevent
* Add the new ioctl command to union ucmd_buffer
* Fix the alphabetical order in union ucmd_buffer too
* Rename _TYPE_NONE to _TYPE_DEFAULT aligning with vIOMMU naming
v3
https://lore.kernel.org/all/cover.1734477608.git.nicolinc@nvidia.com/
* Rebase on Will's for-joerg/arm-smmu/updates for arm_smmu_event series
* Add "Reviewed-by" lines from Kevin
* Fix typos in comments, kdocs, and jump tags
* Add a patch to sort struct iommufd_ioctl_op
* Update iommufd's userpsace-api documentation
* Update uAPI kdoc to quote SMMUv3 offical spec
* Drop the unused workqueue in struct iommufd_virq
* Drop might_sleep() in iommufd_viommu_report_irq() helper
* Add missing "break" in iommufd_viommu_get_vdev_id() helper
* Shrink the scope of the vmaster's read lock in SMMUv3 driver
* Pass in two arguments to iommufd_eventq_virq_handler() helper
* Move "!ops || !ops->read" validation into iommufd_eventq_init()
* Move "fault->ictx = ictx" closer to iommufd_ctx_get(fault->ictx)
* Update commit message for arm_smmu_attach_prepare/commit_vmaster()
* Keep "iommufd_fault" as-is and rename "iommufd_eventq_virq" to just
"iommufd_virq"
v2
https://lore.kernel.org/all/cover.1733263737.git.nicolinc@nvidia.com/
* Rebase on v6.13-rc1
* Add IOPF and vIRQ in iommufd.rst (userspace-api)
* Add a proper locking in iommufd_event_virq_destroy
* Add iommufd_event_virq_abort with a lockdep_assert_held
* Rename "EVENT_*" to "EVENTQ_*" to describe the objects better
* Reorganize flows in iommufd_eventq_virq_alloc for abort() to work
* Adde struct arm_smmu_vmaster to store vSID upon attaching to a nested
domain, calling a newly added iommufd_viommu_get_vdev_id helper
* Adde an arm_vmaster_report_event helper in arm-smmu-v3-iommufd file
to simplify the routine in arm_smmu_handle_evt() of the main driver
v1
https://lore.kernel.org/all/cover.1724777091.git.nicolinc@nvidia.com/
Thanks!
Nicolin
Nicolin Chen (14):
iommufd/fault: Move two fault functions out of the header
iommufd/fault: Add an iommufd_fault_init() helper
iommufd: Abstract an iommufd_eventq from iommufd_fault
iommufd: Rename fault.c to eventq.c
iommufd: Add IOMMUFD_OBJ_VEVENTQ and IOMMUFD_CMD_VEVENTQ_ALLOC
iommufd/viommu: Add iommufd_viommu_get_vdev_id helper
iommufd/viommu: Add iommufd_viommu_report_event helper
iommufd/selftest: Require vdev_id when attaching to a nested domain
iommufd/selftest: Add IOMMU_TEST_OP_TRIGGER_VEVENT for vEVENTQ
coverage
iommufd/selftest: Add IOMMU_VEVENTQ_ALLOC test coverage
Documentation: userspace-api: iommufd: Update FAULT and VEVENTQ
iommu/arm-smmu-v3: Introduce struct arm_smmu_vmaster
iommu/arm-smmu-v3: Report events that belong to devices attached to
vIOMMU
iommu/arm-smmu-v3: Set MEV bit in nested STE for DoS mitigations
drivers/iommu/iommufd/Makefile | 2 +-
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 36 ++
drivers/iommu/iommufd/iommufd_private.h | 135 +++-
drivers/iommu/iommufd/iommufd_test.h | 10 +
include/linux/iommufd.h | 23 +
include/uapi/linux/iommufd.h | 105 +++
tools/testing/selftests/iommu/iommufd_utils.h | 115 ++++
.../arm/arm-smmu-v3/arm-smmu-v3-iommufd.c | 60 ++
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 80 ++-
drivers/iommu/iommufd/driver.c | 72 +++
drivers/iommu/iommufd/eventq.c | 597 ++++++++++++++++++
drivers/iommu/iommufd/fault.c | 342 ----------
drivers/iommu/iommufd/hw_pagetable.c | 6 +-
drivers/iommu/iommufd/main.c | 7 +
drivers/iommu/iommufd/selftest.c | 54 ++
drivers/iommu/iommufd/viommu.c | 2 +
tools/testing/selftests/iommu/iommufd.c | 36 ++
.../selftests/iommu/iommufd_fail_nth.c | 7 +
Documentation/userspace-api/iommufd.rst | 17 +
19 files changed, 1298 insertions(+), 408 deletions(-)
create mode 100644 drivers/iommu/iommufd/eventq.c
delete mode 100644 drivers/iommu/iommufd/fault.c
base-commit: a05df03a88bc1088be8e9d958f208d6484691e43
--
2.43.0
This picks up from Michal Rostecki's work[0]. Per Michal's guidance I
have omitted Co-authored tags, as the end result is quite different.
Link: https://lore.kernel.org/rust-for-linux/20240819153656.28807-2-vadorovsky@pr… [0]
Closes: https://github.com/Rust-for-Linux/linux/issues/1075
Signed-off-by: Tamir Duberstein <tamird(a)gmail.com>
---
Changes in v9:
- Rebase on rust-next.
- Restore `impl Display for BStr` which exists upstream[1].
- Link: https://doc.rust-lang.org/nightly/std/bstr/struct.ByteStr.html#impl-Display… [1]
- Link to v8: https://lore.kernel.org/r/20250203-cstr-core-v8-0-cb3f26e78686@gmail.com
Changes in v8:
- Move `{from,as}_char_ptr` back to `CStrExt`. This reduces the diff
some.
- Restore `from_bytes_with_nul_unchecked_mut`, `to_cstring`.
- Link to v7: https://lore.kernel.org/r/20250202-cstr-core-v7-0-da1802520438@gmail.com
Changes in v7:
- Rebased on mainline.
- Restore functionality added in commit a321f3ad0a5d ("rust: str: add
{make,to}_{upper,lower}case() to CString").
- Used `diff.algorithm patience` to improve diff readability.
- Link to v6: https://lore.kernel.org/r/20250202-cstr-core-v6-0-8469cd6d29fd@gmail.com
Changes in v6:
- Split the work into several commits for ease of review.
- Restore `{from,as}_char_ptr` to allow building on ARM (see commit
message).
- Add `CStrExt` to `kernel::prelude`. (Alice Ryhl)
- Remove `CStrExt::from_bytes_with_nul_unchecked_mut` and restore
`DerefMut for CString`. (Alice Ryhl)
- Rename and hide `kernel::c_str!` to encourage use of C-String
literals.
- Drop implementation and invocation changes in kunit.rs. (Trevor Gross)
- Drop docs on `Display` impl. (Trevor Gross)
- Rewrite docs in the style of the standard library.
- Restore the `test_cstr_debug` unit tests to demonstrate that the
implementation has changed.
Changes in v5:
- Keep the `test_cstr_display*` unit tests.
Changes in v4:
- Provide the `CStrExt` trait with `display()` method, which returns a
`CStrDisplay` wrapper with `Display` implementation. This addresses
the lack of `Display` implementation for `core::ffi::CStr`.
- Provide `from_bytes_with_nul_unchecked_mut()` method in `CStrExt`,
which might be useful and is going to prevent manual, unsafe casts.
- Fix a typo (s/preffered/prefered/).
Changes in v3:
- Fix the commit message.
- Remove redundant braces in `use`, when only one item is imported.
Changes in v2:
- Do not remove `c_str` macro. While it's preferred to use C-string
literals, there are two cases where `c_str` is helpful:
- When working with macros, which already return a Rust string literal
(e.g. `stringify!`).
- When building macros, where we want to take a Rust string literal as an
argument (for caller's convenience), but still use it as a C-string
internally.
- Use Rust literals as arguments in macros (`new_mutex`, `new_condvar`,
`new_mutex`). Use the `c_str` macro to convert these literals to C-string
literals.
- Use `c_str` in kunit.rs for converting the output of `stringify!` to a
`CStr`.
- Remove `DerefMut` implementation for `CString`.
---
Tamir Duberstein (4):
rust: move `CStr`'s `Display` to helper struct
rust: replace `CStr` with `core::ffi::CStr`
rust: replace `kernel::c_str!` with C-Strings
rust: remove core::ffi::CStr reexport
drivers/gpu/drm/drm_panic_qr.rs | 6 +-
drivers/net/phy/ax88796b_rust.rs | 8 +-
drivers/net/phy/qt2025.rs | 6 +-
rust/kernel/device.rs | 7 +-
rust/kernel/devres.rs | 2 +-
rust/kernel/driver.rs | 4 +-
rust/kernel/error.rs | 10 +-
rust/kernel/faux.rs | 5 +-
rust/kernel/firmware.rs | 8 +-
rust/kernel/kunit.rs | 18 +-
rust/kernel/lib.rs | 2 +-
rust/kernel/miscdevice.rs | 5 +-
rust/kernel/net/phy.rs | 12 +-
rust/kernel/of.rs | 5 +-
rust/kernel/pci.rs | 3 +-
rust/kernel/platform.rs | 7 +-
rust/kernel/prelude.rs | 2 +-
rust/kernel/seq_file.rs | 4 +-
rust/kernel/str.rs | 499 +++++++++++++----------------------
rust/kernel/sync.rs | 4 +-
rust/kernel/sync/condvar.rs | 3 +-
rust/kernel/sync/lock.rs | 4 +-
rust/kernel/sync/lock/global.rs | 6 +-
rust/kernel/sync/poll.rs | 1 +
rust/kernel/workqueue.rs | 1 +
rust/macros/module.rs | 2 +-
samples/rust/rust_driver_faux.rs | 4 +-
samples/rust/rust_driver_pci.rs | 4 +-
samples/rust/rust_driver_platform.rs | 4 +-
samples/rust/rust_misc_device.rs | 3 +-
30 files changed, 256 insertions(+), 393 deletions(-)
---
base-commit: 433b1bd6e0a98938105c43c0553f24e0747ef52c
change-id: 20250201-cstr-core-d4b9b69120cf
Best regards,
--
Tamir Duberstein <tamird(a)gmail.com>
This started with a patch that enabled `clippy::ptr_as_ptr`. Benno
Lossin suggested I also look into `clippy::ptr_cast_constness` and I
discovered `clippy::as_ptr_cast_mut`. This series now enables all 3
lints. It also enables `clippy::as_underscore` which ensures other
pointer casts weren't missed. The first commit reduces the need for
pointer casts and is shared with another series[1].
The final patch also enables pointer provenance lints and fixes
violations. See that commit message for details. The build system
portion of that commit is pretty messy but I couldn't find a better way
to convincingly ensure that these lints were applied globally.
Suggestions would be very welcome.
Link: https://lore.kernel.org/all/20250307-no-offset-v1-0-0c728f63b69c@gmail.com/ [1]
Signed-off-by: Tamir Duberstein <tamird(a)gmail.com>
---
Changes in v4:
- Add missing SoB. (Benno Lossin)
- Use `without_provenance_mut` in alloc. (Boqun Feng)
- Limit strict provenance lints to the `kernel` crate to avoid complex
logic in the build system. This can be revisited on MSRV >= 1.84.0.
- Rebase on rust-next.
- Link to v3: https://lore.kernel.org/r/20250314-ptr-as-ptr-v3-0-e7ba61048f4a@gmail.com
Changes in v3:
- Fixed clippy warning in rust/kernel/firmware.rs. (kernel test robot)
Link: https://lore.kernel.org/all/202503120332.YTCpFEvv-lkp@intel.com/
- s/as u64/as bindings::phys_addr_t/g. (Benno Lossin)
- Use strict provenance APIs and enable lints. (Benno Lossin)
- Link to v2: https://lore.kernel.org/r/20250309-ptr-as-ptr-v2-0-25d60ad922b7@gmail.com
Changes in v2:
- Fixed typo in first commit message.
- Added additional patches, converted to series.
- Link to v1: https://lore.kernel.org/r/20250307-ptr-as-ptr-v1-1-582d06514c98@gmail.com
---
Tamir Duberstein (6):
rust: retain pointer mut-ness in `container_of!`
rust: enable `clippy::ptr_as_ptr` lint
rust: enable `clippy::ptr_cast_constness` lint
rust: enable `clippy::as_ptr_cast_mut` lint
rust: enable `clippy::as_underscore` lint
rust: use strict provenance APIs
Makefile | 4 +++
init/Kconfig | 3 ++
rust/bindings/lib.rs | 1 +
rust/kernel/alloc.rs | 2 +-
rust/kernel/alloc/allocator_test.rs | 2 +-
rust/kernel/alloc/kvec.rs | 4 +--
rust/kernel/block/mq/operations.rs | 2 +-
rust/kernel/block/mq/request.rs | 7 +++--
rust/kernel/device.rs | 5 +--
rust/kernel/device_id.rs | 2 +-
rust/kernel/devres.rs | 19 ++++++------
rust/kernel/error.rs | 2 +-
rust/kernel/firmware.rs | 3 +-
rust/kernel/fs/file.rs | 2 +-
rust/kernel/io.rs | 16 +++++-----
rust/kernel/kunit.rs | 15 +++++----
rust/kernel/lib.rs | 57 ++++++++++++++++++++++++++++++++--
rust/kernel/list/impl_list_item_mod.rs | 2 +-
rust/kernel/miscdevice.rs | 2 +-
rust/kernel/of.rs | 6 ++--
rust/kernel/pci.rs | 15 +++++----
rust/kernel/platform.rs | 6 ++--
rust/kernel/print.rs | 11 +++----
rust/kernel/rbtree.rs | 23 ++++++--------
rust/kernel/seq_file.rs | 3 +-
rust/kernel/str.rs | 18 +++++------
rust/kernel/sync/poll.rs | 2 +-
rust/kernel/uaccess.rs | 12 ++++---
rust/kernel/workqueue.rs | 12 +++----
rust/uapi/lib.rs | 1 +
30 files changed, 162 insertions(+), 97 deletions(-)
---
base-commit: 2aadc0fc1f85d7a9ed2822ba7ee9f06775eb6d84
change-id: 20250307-ptr-as-ptr-21b1867fc4d4
Best regards,
--
Tamir Duberstein <tamird(a)gmail.com>
As discussed here:
https://lore.kernel.org/lkml/Z9RRkL1hom48z3Tt@google.com/
This code could benefit from some more commentary.
To avoid needing to comment the same thing in multiple places (I guess
more of these SKIPs will need to be added over time, for now I am only
like 20% of the way through Project Run run_vmtests.sh Successfully),
add a dummy "skip tests for this specific reason" function that
basically just serves as a hook to hang comments on.
Signed-off-by: Brendan Jackman <jackmanb(a)google.com>
---
To: David Hildenbrand <david(a)redhat.com>
---
tools/testing/selftests/mm/gup_longterm.c | 6 +-----
tools/testing/selftests/mm/map_populate.c | 8 +++-----
tools/testing/selftests/mm/vm_util.h | 18 ++++++++++++++++++
3 files changed, 22 insertions(+), 10 deletions(-)
diff --git a/tools/testing/selftests/mm/gup_longterm.c b/tools/testing/selftests/mm/gup_longterm.c
index 03271442aae5aed060fd44010df552a2eedcdafc..21595b20bbc391a0e5d0ab0563ac4ce5e1e0069f 100644
--- a/tools/testing/selftests/mm/gup_longterm.c
+++ b/tools/testing/selftests/mm/gup_longterm.c
@@ -97,11 +97,7 @@ static void do_test(int fd, size_t size, enum test_type type, bool shared)
if (ftruncate(fd, size)) {
if (errno == ENOENT) {
- /*
- * This can happen if the file has been unlinked and the
- * filesystem doesn't support truncating unlinked files.
- */
- ksft_test_result_skip("ftruncate() failed with ENOENT\n");
+ skip_test_dodgy_fs("ftruncate()");
} else {
ksft_test_result_fail("ftruncate() failed (%s)\n", strerror(errno));
}
diff --git a/tools/testing/selftests/mm/map_populate.c b/tools/testing/selftests/mm/map_populate.c
index 433e54fb634f793f2eb4c53ba6b791045c9f4986..9df2636c829bf34d6d0517e126b3deda1f3ba834 100644
--- a/tools/testing/selftests/mm/map_populate.c
+++ b/tools/testing/selftests/mm/map_populate.c
@@ -18,6 +18,8 @@
#include <unistd.h>
#include "../kselftest.h"
+#include "vm_util.h"
+
#define MMAP_SZ 4096
#define BUG_ON(condition, description) \
@@ -88,11 +90,7 @@ int main(int argc, char **argv)
ret = ftruncate(fileno(ftmp), MMAP_SZ);
if (ret < 0 && errno == ENOENT) {
- /*
- * This probably means tmpfile() made a file on a filesystem
- * that doesn't handle temporary files the way we want.
- */
- ksft_exit_skip("ftruncate(fileno(tmpfile())) gave ENOENT, weird filesystem?\n");
+ skip_test_dodgy_fs("ftruncate()");
}
BUG_ON(ret, "ftruncate()");
diff --git a/tools/testing/selftests/mm/vm_util.h b/tools/testing/selftests/mm/vm_util.h
index 0e629586556b5aae580d8e4ce7491bc93adcc4d6..6effafdc4d8a23f91f0adcb9e43d6196d651ba88 100644
--- a/tools/testing/selftests/mm/vm_util.h
+++ b/tools/testing/selftests/mm/vm_util.h
@@ -5,6 +5,7 @@
#include <err.h>
#include <strings.h> /* ffsl() */
#include <unistd.h> /* _SC_PAGESIZE */
+#include "../kselftest.h"
#define BIT_ULL(nr) (1ULL << (nr))
#define PM_SOFT_DIRTY BIT_ULL(55)
@@ -32,6 +33,23 @@ static inline unsigned int pshift(void)
return __page_shift;
}
+/*
+ * Plan 9 FS has bugs (at least on QEMU) where certain operations fail with
+ * ENOENT on unlinked files. See
+ * https://gitlab.com/qemu-project/qemu/-/issues/103 for some info about such
+ * bugs. There are rumours of NFS implementations with similar bugs.
+ *
+ * Ideally, tests should just detect filesystems known to have such issues and
+ * bail early. But 9pfs has the additional "feature" that it causes fstatfs to
+ * pass through the f_type field from the host filesystem. To avoid having to
+ * scrape /proc/mounts or some other hackery, tests can call this function when
+ * it seems such a bug might have been encountered.
+ */
+static inline void skip_test_dodgy_fs(const char *op_name)
+{
+ ksft_test_result_skip("%s failed with ENOENT. Filesystem might be buggy (9pfs?)\n", op_name);
+}
+
uint64_t pagemap_get_entry(int fd, char *start);
bool pagemap_is_softdirty(int fd, char *start);
bool pagemap_is_swapped(int fd, char *start);
---
base-commit: a91aaf8dd549dcee9caab227ecaa6cbc243bbc5a
change-id: 20250317-9pfs-comments-24b6fa5417cd
Best regards,
--
Brendan Jackman <jackmanb(a)google.com>
Signal delivery during connect() may disconnect an already established
socket. Problem is that such socket might have been placed in a sockmap
before the connection was closed.
PATCH 1 ensures this race won't lead to an unconnected vsock staying in the
sockmap. PATCH 2 selftests it.
PATCH 3 fixes a related race. Note that selftest in PATCH 2 does test this
code as well, but winning this race variant may take more than 2 seconds,
so I'm not advertising it.
Signed-off-by: Michal Luczaj <mhal(a)rbox.co>
---
Changes in v3:
- Selftest: drop unnecessary variable initialization and reorder the calls
- Link to v2: https://lore.kernel.org/r/20250314-vsock-trans-signal-race-v2-0-421a41f60f4…
Changes in v2:
- Handle one more path of tripping the warning
- Add a selftest
- Collect R-b [Stefano]
- Link to v1: https://lore.kernel.org/r/20250307-vsock-trans-signal-race-v1-1-3aca3f771fb…
---
Michal Luczaj (3):
vsock/bpf: Fix EINTR connect() racing sockmap update
selftest/bpf: Add test for AF_VSOCK connect() racing sockmap update
vsock/bpf: Fix bpf recvmsg() racing transport reassignment
net/vmw_vsock/af_vsock.c | 10 ++-
net/vmw_vsock/vsock_bpf.c | 24 ++++--
.../selftests/bpf/prog_tests/sockmap_basic.c | 97 ++++++++++++++++++++++
3 files changed, 122 insertions(+), 9 deletions(-)
---
base-commit: da9e8efe7ee10e8425dc356a9fc593502c8e3933
change-id: 20250305-vsock-trans-signal-race-d62f7718d099
Best regards,
--
Michal Luczaj <mhal(a)rbox.co>
virtio-net have two usage of hashes: one is RSS and another is hash
reporting. Conventionally the hash calculation was done by the VMM.
However, computing the hash after the queue was chosen defeats the
purpose of RSS.
Another approach is to use eBPF steering program. This approach has
another downside: it cannot report the calculated hash due to the
restrictive nature of eBPF.
Introduce the code to compute hashes to the kernel in order to overcome
thse challenges.
An alternative solution is to extend the eBPF steering program so that it
will be able to report to the userspace, but it is based on context
rewrites, which is in feature freeze. We can adopt kfuncs, but they will
not be UAPIs. We opt to ioctl to align with other relevant UAPIs (KVM
and vhost_net).
The patches for QEMU to use this new feature was submitted as RFC and
is available at:
https://patchew.org/QEMU/20250313-hash-v4-0-c75c494b495e@daynix.com/
This work was presented at LPC 2024:
https://lpc.events/event/18/contributions/1963/
V1 -> V2:
Changed to introduce a new BPF program type.
Signed-off-by: Akihiko Odaki <akihiko.odaki(a)daynix.com>
---
Changes in v10:
- Split common code and TUN/TAP-specific code into separate patches.
- Reverted a spurious style change in patch "tun: Introduce virtio-net
hash feature".
- Added a comment explaining disable_ipv6 in tests.
- Used AF_PACKET for patch "selftest: tun: Add tests for
virtio-net hashing". I also added the usage of FIXTURE_VARIANT() as
the testing function now needs access to more variant-specific
variables.
- Corrected the message of patch "selftest: tun: Add tests for
virtio-net hashing"; it mentioned validation of configuration but
it is not scope of this patch.
- Expanded the description of patch "selftest: tun: Add tests for
virtio-net hashing".
- Added patch "tun: Allow steering eBPF program to fall back".
- Changed to handle TUNGETVNETHASHCAP before taking the rtnl lock.
- Removed redundant tests for tun_vnet_ioctl().
- Added patch "selftest: tap: Add tests for virtio-net ioctls".
- Added a design explanation of ioctls for extensibility and migration.
- Removed a few branches in patch
"vhost/net: Support VIRTIO_NET_F_HASH_REPORT".
- Link to v9: https://lore.kernel.org/r/20250307-rss-v9-0-df76624025eb@daynix.com
Changes in v9:
- Added a missing return statement in patch
"tun: Introduce virtio-net hash feature".
- Link to v8: https://lore.kernel.org/r/20250306-rss-v8-0-7ab4f56ff423@daynix.com
Changes in v8:
- Disabled IPv6 to eliminate noises in tests.
- Added a branch in tap to avoid unnecessary dissection when hash
reporting is disabled.
- Removed unnecessary rtnl_lock().
- Extracted code to handle new ioctls into separate functions to avoid
adding extra NULL checks to the code handling other ioctls.
- Introduced variable named "fd" to __tun_chr_ioctl().
- s/-/=/g in a patch message to avoid confusing Git.
- Link to v7: https://lore.kernel.org/r/20250228-rss-v7-0-844205cbbdd6@daynix.com
Changes in v7:
- Ensured to set hash_report to VIRTIO_NET_HASH_REPORT_NONE for
VHOST_NET_F_VIRTIO_NET_HDR.
- s/4/sizeof(u32)/ in patch "virtio_net: Add functions for hashing".
- Added tap_skb_cb type.
- Rebased.
- Link to v6: https://lore.kernel.org/r/20250109-rss-v6-0-b1c90ad708f6@daynix.com
Changes in v6:
- Extracted changes to fill vnet header holes into another series.
- Squashed patches "skbuff: Introduce SKB_EXT_TUN_VNET_HASH", "tun:
Introduce virtio-net hash reporting feature", and "tun: Introduce
virtio-net RSS" into patch "tun: Introduce virtio-net hash feature".
- Dropped the RFC tag.
- Link to v5: https://lore.kernel.org/r/20241008-rss-v5-0-f3cf68df005d@daynix.com
Changes in v5:
- Fixed a compilation error with CONFIG_TUN_VNET_CROSS_LE.
- Optimized the calculation of the hash value according to:
https://git.dpdk.org/dpdk/commit/?id=3fb1ea032bd6ff8317af5dac9af901f1f324ca…
- Added patch "tun: Unify vnet implementation".
- Dropped patch "tap: Pad virtio header with zero".
- Added patch "selftest: tun: Test vnet ioctls without device".
- Reworked selftests to skip for older kernels.
- Documented the case when the underlying device is deleted and packets
have queue_mapping set by TC.
- Reordered test harness arguments.
- Added code to handle fragmented packets.
- Link to v4: https://lore.kernel.org/r/20240924-rss-v4-0-84e932ec0e6c@daynix.com
Changes in v4:
- Moved tun_vnet_hash_ext to if_tun.h.
- Renamed virtio_net_toeplitz() to virtio_net_toeplitz_calc().
- Replaced htons() with cpu_to_be16().
- Changed virtio_net_hash_rss() to return void.
- Reordered variable declarations in virtio_net_hash_rss().
- Removed virtio_net_hdr_v1_hash_from_skb().
- Updated messages of "tap: Pad virtio header with zero" and
"tun: Pad virtio header with zero".
- Fixed vnet_hash allocation size.
- Ensured to free vnet_hash when destructing tun_struct.
- Link to v3: https://lore.kernel.org/r/20240915-rss-v3-0-c630015db082@daynix.com
Changes in v3:
- Reverted back to add ioctl.
- Split patch "tun: Introduce virtio-net hashing feature" into
"tun: Introduce virtio-net hash reporting feature" and
"tun: Introduce virtio-net RSS".
- Changed to reuse hash values computed for automq instead of performing
RSS hashing when hash reporting is requested but RSS is not.
- Extracted relevant data from struct tun_struct to keep it minimal.
- Added kernel-doc.
- Changed to allow calling TUNGETVNETHASHCAP before TUNSETIFF.
- Initialized num_buffers with 1.
- Added a test case for unclassified packets.
- Fixed error handling in tests.
- Changed tests to verify that the queue index will not overflow.
- Rebased.
- Link to v2: https://lore.kernel.org/r/20231015141644.260646-1-akihiko.odaki@daynix.com
---
Akihiko Odaki (10):
virtio_net: Add functions for hashing
net: flow_dissector: Export flow_keys_dissector_symmetric
tun: Allow steering eBPF program to fall back
tun: Add common virtio-net hash feature code
tun: Introduce virtio-net hash feature
tap: Introduce virtio-net hash feature
selftest: tun: Test vnet ioctls without device
selftest: tun: Add tests for virtio-net hashing
selftest: tap: Add tests for virtio-net ioctls
vhost/net: Support VIRTIO_NET_F_HASH_REPORT
Documentation/networking/tuntap.rst | 7 +
drivers/net/Kconfig | 1 +
drivers/net/tap.c | 68 ++++-
drivers/net/tun.c | 90 +++++--
drivers/net/tun_vnet.h | 155 ++++++++++-
drivers/vhost/net.c | 68 ++---
include/linux/if_tap.h | 2 +
include/linux/skbuff.h | 3 +
include/linux/virtio_net.h | 188 ++++++++++++++
include/net/flow_dissector.h | 1 +
include/uapi/linux/if_tun.h | 82 ++++++
net/core/flow_dissector.c | 3 +-
net/core/skbuff.c | 4 +
tools/testing/selftests/net/Makefile | 2 +-
tools/testing/selftests/net/tap.c | 97 ++++++-
tools/testing/selftests/net/tun.c | 491 ++++++++++++++++++++++++++++++++++-
16 files changed, 1185 insertions(+), 77 deletions(-)
---
base-commit: dd83757f6e686a2188997cb58b5975f744bb7786
change-id: 20240403-rss-e737d89efa77
prerequisite-change-id: 20241230-tun-66e10a49b0c7:v6
prerequisite-patch-id: 871dc5f146fb6b0e3ec8612971a8e8190472c0fb
prerequisite-patch-id: 2797ed249d32590321f088373d4055ff3f430a0e
prerequisite-patch-id: ea3370c72d4904e2f0536ec76ba5d26784c0cede
prerequisite-patch-id: 837e4cf5d6b451424f9b1639455e83a260c4440d
prerequisite-patch-id: ea701076f57819e844f5a35efe5cbc5712d3080d
prerequisite-patch-id: 701646fb43ad04cc64dd2bf13c150ccbe6f828ce
prerequisite-patch-id: 53176dae0c003f5b6c114d43f936cf7140d31bb5
prerequisite-change-id: 20250116-buffers-96e14bf023fc:v2
prerequisite-patch-id: 25fd4f99d4236a05a5ef16ab79f3e85ee57e21cc
Best regards,
--
Akihiko Odaki <akihiko.odaki(a)daynix.com>
On Friday, 14 March 2025 05:14:30 CDT Su Hui wrote:
> On 2025/3/14 17:21, Dan Carpenter wrote:
> > On Fri, Mar 14, 2025 at 03:14:51PM +0800, Su Hui wrote:
> >> When 'manual=false' and 'signaled=true', then expected value when using
> >> NTSYNC_IOC_CREATE_EVENT should be greater than zero. Fix this typo error.
> >>
> >> Signed-off-by: Su Hui<suhui(a)nfschina.com>
> >> ---
> >> tools/testing/selftests/drivers/ntsync/ntsync.c | 2 +-
> >> 1 file changed, 1 insertion(+), 1 deletion(-)
> >>
> >> diff --git a/tools/testing/selftests/drivers/ntsync/ntsync.c b/tools/testing/selftests/drivers/ntsync/ntsync.c
> >> index 3aad311574c4..bfb6fad653d0 100644
> >> --- a/tools/testing/selftests/drivers/ntsync/ntsync.c
> >> +++ b/tools/testing/selftests/drivers/ntsync/ntsync.c
> >> @@ -968,7 +968,7 @@ TEST(wake_all)
> >> auto_event_args.manual = false;
> >> auto_event_args.signaled = true;
> >> objs[3] = ioctl(fd, NTSYNC_IOC_CREATE_EVENT, &auto_event_args);
> >> - EXPECT_EQ(0, objs[3]);
> >> + EXPECT_LE(0, objs[3]);
> > It's kind of weird how these macros put the constant on the left.
> > It returns an "fd" on success. So this look reasonable. It probably
> > won't return the zero fd so we could probably check EXPECT_LT()?
> Agreed, there are about 29 items that can be changed to EXPECT_LT().
> I can send a v2 patchset with this change if there is no more other
> suggestions.
I personally think it looks wrong to use EXPECT_LT(), but I'll certainly defer to a higher maintainer on this point.
Replacing all occurrences of `addr_of!(place)` with `&raw const place`, and
all occurrences of `addr_of_mut!(place)` with `&raw mut place`.
Utilizing the new feature will allow us to reduce macro complexity, and
improve consistency with existing reference syntax as `&raw const`, `&raw mut`
is very similar to `&`, `&mut` making it fit more naturally with other
existing code.
Suggested-by: Benno Lossin <benno.lossin(a)proton.me>
Link: https://github.com/Rust-for-Linux/linux/issues/1148
Signed-off-by: Antonio Hickey <contact(a)antoniohickey.com>
---
rust/kernel/block/mq/request.rs | 4 ++--
rust/kernel/faux.rs | 4 ++--
rust/kernel/fs/file.rs | 2 +-
rust/kernel/init.rs | 8 ++++----
rust/kernel/init/macros.rs | 28 +++++++++++++-------------
rust/kernel/jump_label.rs | 4 ++--
rust/kernel/kunit.rs | 4 ++--
rust/kernel/list.rs | 2 +-
rust/kernel/list/impl_list_item_mod.rs | 6 +++---
rust/kernel/net/phy.rs | 4 ++--
rust/kernel/pci.rs | 4 ++--
rust/kernel/platform.rs | 4 +---
rust/kernel/rbtree.rs | 22 ++++++++++----------
rust/kernel/sync/arc.rs | 2 +-
rust/kernel/task.rs | 4 ++--
rust/kernel/workqueue.rs | 8 ++++----
16 files changed, 54 insertions(+), 56 deletions(-)
diff --git a/rust/kernel/block/mq/request.rs b/rust/kernel/block/mq/request.rs
index 7943f43b9575..4a5b7ec914ef 100644
--- a/rust/kernel/block/mq/request.rs
+++ b/rust/kernel/block/mq/request.rs
@@ -12,7 +12,7 @@
};
use core::{
marker::PhantomData,
- ptr::{addr_of_mut, NonNull},
+ ptr::NonNull,
sync::atomic::{AtomicU64, Ordering},
};
@@ -187,7 +187,7 @@ pub(crate) fn refcount(&self) -> &AtomicU64 {
pub(crate) unsafe fn refcount_ptr(this: *mut Self) -> *mut AtomicU64 {
// SAFETY: Because of the safety requirements of this function, the
// field projection is safe.
- unsafe { addr_of_mut!((*this).refcount) }
+ unsafe { &raw mut (*this).refcount }
}
}
diff --git a/rust/kernel/faux.rs b/rust/kernel/faux.rs
index 5acc0c02d451..52ac554c1119 100644
--- a/rust/kernel/faux.rs
+++ b/rust/kernel/faux.rs
@@ -7,7 +7,7 @@
//! C header: [`include/linux/device/faux.h`]
use crate::{bindings, device, error::code::*, prelude::*};
-use core::ptr::{addr_of_mut, null, null_mut, NonNull};
+use core::ptr::{null, null_mut, NonNull};
/// The registration of a faux device.
///
@@ -45,7 +45,7 @@ impl AsRef<device::Device> for Registration {
fn as_ref(&self) -> &device::Device {
// SAFETY: The underlying `device` in `faux_device` is guaranteed by the C API to be
// a valid initialized `device`.
- unsafe { device::Device::as_ref(addr_of_mut!((*self.as_raw()).dev)) }
+ unsafe { device::Device::as_ref((&raw mut (*self.as_raw()).dev)) }
}
}
diff --git a/rust/kernel/fs/file.rs b/rust/kernel/fs/file.rs
index ed57e0137cdb..7ee4830b67f3 100644
--- a/rust/kernel/fs/file.rs
+++ b/rust/kernel/fs/file.rs
@@ -331,7 +331,7 @@ pub fn flags(&self) -> u32 {
// SAFETY: The file is valid because the shared reference guarantees a nonzero refcount.
//
// FIXME(read_once): Replace with `read_once` when available on the Rust side.
- unsafe { core::ptr::addr_of!((*self.as_ptr()).f_flags).read_volatile() }
+ unsafe { (&raw const (*self.as_ptr()).f_flags).read_volatile() }
}
}
diff --git a/rust/kernel/init.rs b/rust/kernel/init.rs
index 7fd1ea8265a5..a8fac6558671 100644
--- a/rust/kernel/init.rs
+++ b/rust/kernel/init.rs
@@ -122,7 +122,7 @@
//! ```rust
//! # #![expect(unreachable_pub, clippy::disallowed_names)]
//! use kernel::{init, types::Opaque};
-//! use core::{ptr::addr_of_mut, marker::PhantomPinned, pin::Pin};
+//! use core::{marker::PhantomPinned, pin::Pin};
//! # mod bindings {
//! # #![expect(non_camel_case_types)]
//! # #![expect(clippy::missing_safety_doc)]
@@ -159,7 +159,7 @@
//! unsafe {
//! init::pin_init_from_closure(move |slot: *mut Self| {
//! // `slot` contains uninit memory, avoid creating a reference.
-//! let foo = addr_of_mut!((*slot).foo);
+//! let foo = &raw mut (*slot).foo;
//!
//! // Initialize the `foo`
//! bindings::init_foo(Opaque::raw_get(foo));
@@ -541,7 +541,7 @@ macro_rules! stack_try_pin_init {
///
/// ```rust
/// # use kernel::{macros::{Zeroable, pin_data}, pin_init};
-/// # use core::{ptr::addr_of_mut, marker::PhantomPinned};
+/// # use core::marker::PhantomPinned;
/// #[pin_data]
/// #[derive(Zeroable)]
/// struct Buf {
@@ -554,7 +554,7 @@ macro_rules! stack_try_pin_init {
/// pin_init!(&this in Buf {
/// buf: [0; 64],
/// // SAFETY: TODO.
-/// ptr: unsafe { addr_of_mut!((*this.as_ptr()).buf).cast() },
+/// ptr: unsafe { &raw mut (*this.as_ptr()).buf.cast() },
/// pin: PhantomPinned,
/// });
/// pin_init!(Buf {
diff --git a/rust/kernel/init/macros.rs b/rust/kernel/init/macros.rs
index 1fd146a83241..af525fbb2f01 100644
--- a/rust/kernel/init/macros.rs
+++ b/rust/kernel/init/macros.rs
@@ -244,25 +244,25 @@
//! struct __InitOk;
//! // This is the expansion of `t,`, which is syntactic sugar for `t: t,`.
//! {
-//! unsafe { ::core::ptr::write(::core::addr_of_mut!((*slot).t), t) };
+//! unsafe { ::core::ptr::write(&raw mut (*slot).t, t) };
//! }
//! // Since initialization could fail later (not in this case, since the
//! // error type is `Infallible`) we will need to drop this field if there
//! // is an error later. This `DropGuard` will drop the field when it gets
//! // dropped and has not yet been forgotten.
//! let __t_guard = unsafe {
-//! ::pinned_init::__internal::DropGuard::new(::core::addr_of_mut!((*slot).t))
+//! ::pinned_init::__internal::DropGuard::new(&raw mut (*slot).t)
//! };
//! // Expansion of `x: 0,`:
//! // Since this can be an arbitrary expression we cannot place it inside
//! // of the `unsafe` block, so we bind it here.
//! {
//! let x = 0;
-//! unsafe { ::core::ptr::write(::core::addr_of_mut!((*slot).x), x) };
+//! unsafe { ::core::ptr::write(&raw mut (*slot).x, x) };
//! }
//! // We again create a `DropGuard`.
//! let __x_guard = unsafe {
-//! ::kernel::init::__internal::DropGuard::new(::core::addr_of_mut!((*slot).x))
+//! ::kernel::init::__internal::DropGuard::new(&raw mut (*slot).x)
//! };
//! // Since initialization has successfully completed, we can now forget
//! // the guards. This is not `mem::forget`, since we only have
@@ -459,15 +459,15 @@
//! {
//! struct __InitOk;
//! {
-//! unsafe { ::core::ptr::write(::core::addr_of_mut!((*slot).a), a) };
+//! unsafe { ::core::ptr::write(&raw mut (*slot).a, a) };
//! }
//! let __a_guard = unsafe {
-//! ::kernel::init::__internal::DropGuard::new(::core::addr_of_mut!((*slot).a))
+//! ::kernel::init::__internal::DropGuard::new(&raw mut (*slot).a)
//! };
//! let init = Bar::new(36);
-//! unsafe { data.b(::core::addr_of_mut!((*slot).b), b)? };
+//! unsafe { data.b(&raw mut (*slot).b, b)? };
//! let __b_guard = unsafe {
-//! ::kernel::init::__internal::DropGuard::new(::core::addr_of_mut!((*slot).b))
+//! ::kernel::init::__internal::DropGuard::new(&raw mut (*slot).b)
//! };
//! ::core::mem::forget(__b_guard);
//! ::core::mem::forget(__a_guard);
@@ -1210,7 +1210,7 @@ fn assert_zeroable<T: $crate::init::Zeroable>(_: *mut T) {}
// SAFETY: `slot` is valid, because we are inside of an initializer closure, we
// return when an error/panic occurs.
// We also use the `data` to require the correct trait (`Init` or `PinInit`) for `$field`.
- unsafe { $data.$field(::core::ptr::addr_of_mut!((*$slot).$field), init)? };
+ unsafe { $data.$field(&raw mut (*$slot).$field, init)? };
// Create the drop guard:
//
// We rely on macro hygiene to make it impossible for users to access this local variable.
@@ -1218,7 +1218,7 @@ fn assert_zeroable<T: $crate::init::Zeroable>(_: *mut T) {}
::kernel::macros::paste! {
// SAFETY: We forget the guard later when initialization has succeeded.
let [< __ $field _guard >] = unsafe {
- $crate::init::__internal::DropGuard::new(::core::ptr::addr_of_mut!((*$slot).$field))
+ $crate::init::__internal::DropGuard::new(&raw mut (*$slot).$field)
};
$crate::__init_internal!(init_slot($use_data):
@@ -1241,7 +1241,7 @@ fn assert_zeroable<T: $crate::init::Zeroable>(_: *mut T) {}
//
// SAFETY: `slot` is valid, because we are inside of an initializer closure, we
// return when an error/panic occurs.
- unsafe { $crate::init::Init::__init(init, ::core::ptr::addr_of_mut!((*$slot).$field))? };
+ unsafe { $crate::init::Init::__init(init, &raw mut (*$slot).$field)? };
// Create the drop guard:
//
// We rely on macro hygiene to make it impossible for users to access this local variable.
@@ -1249,7 +1249,7 @@ fn assert_zeroable<T: $crate::init::Zeroable>(_: *mut T) {}
::kernel::macros::paste! {
// SAFETY: We forget the guard later when initialization has succeeded.
let [< __ $field _guard >] = unsafe {
- $crate::init::__internal::DropGuard::new(::core::ptr::addr_of_mut!((*$slot).$field))
+ $crate::init::__internal::DropGuard::new(&raw mut (*$slot).$field)
};
$crate::__init_internal!(init_slot():
@@ -1272,7 +1272,7 @@ fn assert_zeroable<T: $crate::init::Zeroable>(_: *mut T) {}
// Initialize the field.
//
// SAFETY: The memory at `slot` is uninitialized.
- unsafe { ::core::ptr::write(::core::ptr::addr_of_mut!((*$slot).$field), $field) };
+ unsafe { ::core::ptr::write(&raw mut (*$slot).$field, $field) };
}
// Create the drop guard:
//
@@ -1281,7 +1281,7 @@ fn assert_zeroable<T: $crate::init::Zeroable>(_: *mut T) {}
::kernel::macros::paste! {
// SAFETY: We forget the guard later when initialization has succeeded.
let [< __ $field _guard >] = unsafe {
- $crate::init::__internal::DropGuard::new(::core::ptr::addr_of_mut!((*$slot).$field))
+ $crate::init::__internal::DropGuard::new(&raw mut (*$slot).$field)
};
$crate::__init_internal!(init_slot($($use_data)?):
diff --git a/rust/kernel/jump_label.rs b/rust/kernel/jump_label.rs
index 4e974c768dbd..ca10abae0eee 100644
--- a/rust/kernel/jump_label.rs
+++ b/rust/kernel/jump_label.rs
@@ -20,8 +20,8 @@
#[macro_export]
macro_rules! static_branch_unlikely {
($key:path, $keytyp:ty, $field:ident) => {{
- let _key: *const $keytyp = ::core::ptr::addr_of!($key);
- let _key: *const $crate::bindings::static_key_false = ::core::ptr::addr_of!((*_key).$field);
+ let _key: *const $keytyp = &raw const $key;
+ let _key: *const $crate::bindings::static_key_false = &raw const (*_key).$field;
let _key: *const $crate::bindings::static_key = _key.cast();
#[cfg(not(CONFIG_JUMP_LABEL))]
diff --git a/rust/kernel/kunit.rs b/rust/kernel/kunit.rs
index 824da0e9738a..a17ef3b2e860 100644
--- a/rust/kernel/kunit.rs
+++ b/rust/kernel/kunit.rs
@@ -128,9 +128,9 @@ unsafe impl Sync for UnaryAssert {}
unsafe {
$crate::bindings::__kunit_do_failed_assertion(
kunit_test,
- core::ptr::addr_of!(LOCATION.0),
+ &raw const LOCATION.0,
$crate::bindings::kunit_assert_type_KUNIT_ASSERTION,
- core::ptr::addr_of!(ASSERTION.0.assert),
+ &raw const ASSERTION.0.assert,
Some($crate::bindings::kunit_unary_assert_format),
core::ptr::null(),
);
diff --git a/rust/kernel/list.rs b/rust/kernel/list.rs
index c0ed227b8a4f..e98f0820f002 100644
--- a/rust/kernel/list.rs
+++ b/rust/kernel/list.rs
@@ -176,7 +176,7 @@ pub fn new() -> impl PinInit<Self> {
#[inline]
unsafe fn fields(me: *mut Self) -> *mut ListLinksFields {
// SAFETY: The caller promises that the pointer is valid.
- unsafe { Opaque::raw_get(ptr::addr_of!((*me).inner)) }
+ unsafe { Opaque::raw_get(&raw const (*me).inner) }
}
/// # Safety
diff --git a/rust/kernel/list/impl_list_item_mod.rs b/rust/kernel/list/impl_list_item_mod.rs
index a0438537cee1..014b6713d59d 100644
--- a/rust/kernel/list/impl_list_item_mod.rs
+++ b/rust/kernel/list/impl_list_item_mod.rs
@@ -49,7 +49,7 @@ macro_rules! impl_has_list_links {
// SAFETY: The implementation of `raw_get_list_links` only compiles if the field has the
// right type.
//
- // The behavior of `raw_get_list_links` is not changed since the `addr_of_mut!` macro is
+ // The behavior of `raw_get_list_links` is not changed since the `&raw mut` op is
// equivalent to the pointer offset operation in the trait definition.
unsafe impl$(<$($implarg),*>)? $crate::list::HasListLinks$(<$id>)? for
$self $(<$($selfarg),*>)?
@@ -61,7 +61,7 @@ unsafe fn raw_get_list_links(ptr: *mut Self) -> *mut $crate::list::ListLinks$(<$
// SAFETY: The caller promises that the pointer is not dangling. We know that this
// expression doesn't follow any pointers, as the `offset_of!` invocation above
// would otherwise not compile.
- unsafe { ::core::ptr::addr_of_mut!((*ptr)$(.$field)*) }
+ unsafe { &raw mut (*ptr)$(.$field)* }
}
}
)*};
@@ -103,7 +103,7 @@ macro_rules! impl_has_list_links_self_ptr {
unsafe fn raw_get_list_links(ptr: *mut Self) -> *mut $crate::list::ListLinks$(<$id>)? {
// SAFETY: The caller promises that the pointer is not dangling.
let ptr: *mut $crate::list::ListLinksSelfPtr<$item_type $(, $id)?> =
- unsafe { ::core::ptr::addr_of_mut!((*ptr).$field) };
+ unsafe { &raw mut (*ptr).$field };
ptr.cast()
}
}
diff --git a/rust/kernel/net/phy.rs b/rust/kernel/net/phy.rs
index a59469c785e3..757db052cc09 100644
--- a/rust/kernel/net/phy.rs
+++ b/rust/kernel/net/phy.rs
@@ -7,7 +7,7 @@
//! C headers: [`include/linux/phy.h`](srctree/include/linux/phy.h).
use crate::{error::*, prelude::*, types::Opaque};
-use core::{marker::PhantomData, ptr::addr_of_mut};
+use core::marker::PhantomData;
pub mod reg;
@@ -285,7 +285,7 @@ impl AsRef<kernel::device::Device> for Device {
fn as_ref(&self) -> &kernel::device::Device {
let phydev = self.0.get();
// SAFETY: The struct invariant ensures that `mdio.dev` is valid.
- unsafe { kernel::device::Device::as_ref(addr_of_mut!((*phydev).mdio.dev)) }
+ unsafe { kernel::device::Device::as_ref(&raw mut (*phydev).mdio.dev) }
}
}
diff --git a/rust/kernel/pci.rs b/rust/kernel/pci.rs
index f7b2743828ae..6cb9ed1e7cbf 100644
--- a/rust/kernel/pci.rs
+++ b/rust/kernel/pci.rs
@@ -17,7 +17,7 @@
types::{ARef, ForeignOwnable, Opaque},
ThisModule,
};
-use core::{ops::Deref, ptr::addr_of_mut};
+use core::ops::Deref;
use kernel::prelude::*;
/// An adapter for the registration of PCI drivers.
@@ -60,7 +60,7 @@ extern "C" fn probe_callback(
) -> kernel::ffi::c_int {
// SAFETY: The PCI bus only ever calls the probe callback with a valid pointer to a
// `struct pci_dev`.
- let dev = unsafe { device::Device::get_device(addr_of_mut!((*pdev).dev)) };
+ let dev = unsafe { device::Device::get_device(&raw mut (*pdev).dev) };
// SAFETY: `dev` is guaranteed to be embedded in a valid `struct pci_dev` by the call
// above.
let mut pdev = unsafe { Device::from_dev(dev) };
diff --git a/rust/kernel/platform.rs b/rust/kernel/platform.rs
index 1297f5292ba9..344875ad7b82 100644
--- a/rust/kernel/platform.rs
+++ b/rust/kernel/platform.rs
@@ -14,8 +14,6 @@
ThisModule,
};
-use core::ptr::addr_of_mut;
-
/// An adapter for the registration of platform drivers.
pub struct Adapter<T: Driver>(T);
@@ -55,7 +53,7 @@ unsafe fn unregister(pdrv: &Opaque<Self::RegType>) {
impl<T: Driver + 'static> Adapter<T> {
extern "C" fn probe_callback(pdev: *mut bindings::platform_device) -> kernel::ffi::c_int {
// SAFETY: The platform bus only ever calls the probe callback with a valid `pdev`.
- let dev = unsafe { device::Device::get_device(addr_of_mut!((*pdev).dev)) };
+ let dev = unsafe { device::Device::get_device(&raw mut (*pdev).dev) };
// SAFETY: `dev` is guaranteed to be embedded in a valid `struct platform_device` by the
// call above.
let mut pdev = unsafe { Device::from_dev(dev) };
diff --git a/rust/kernel/rbtree.rs b/rust/kernel/rbtree.rs
index 1ea25c7092fb..b0ad35663cb0 100644
--- a/rust/kernel/rbtree.rs
+++ b/rust/kernel/rbtree.rs
@@ -11,7 +11,7 @@
cmp::{Ord, Ordering},
marker::PhantomData,
mem::MaybeUninit,
- ptr::{addr_of_mut, from_mut, NonNull},
+ ptr::{from_mut, NonNull},
};
/// A red-black tree with owned nodes.
@@ -238,7 +238,7 @@ pub fn values_mut(&mut self) -> impl Iterator<Item = &'_ mut V> {
/// Returns a cursor over the tree nodes, starting with the smallest key.
pub fn cursor_front(&mut self) -> Option<Cursor<'_, K, V>> {
- let root = addr_of_mut!(self.root);
+ let root = &raw mut self.root;
// SAFETY: `self.root` is always a valid root node
let current = unsafe { bindings::rb_first(root) };
NonNull::new(current).map(|current| {
@@ -253,7 +253,7 @@ pub fn cursor_front(&mut self) -> Option<Cursor<'_, K, V>> {
/// Returns a cursor over the tree nodes, starting with the largest key.
pub fn cursor_back(&mut self) -> Option<Cursor<'_, K, V>> {
- let root = addr_of_mut!(self.root);
+ let root = &raw mut self.root;
// SAFETY: `self.root` is always a valid root node
let current = unsafe { bindings::rb_last(root) };
NonNull::new(current).map(|current| {
@@ -459,7 +459,7 @@ pub fn cursor_lower_bound(&mut self, key: &K) -> Option<Cursor<'_, K, V>>
let best = best_match?;
// SAFETY: `best` is a non-null node so it is valid by the type invariants.
- let links = unsafe { addr_of_mut!((*best.as_ptr()).links) };
+ let links = unsafe { &raw mut (*best.as_ptr()).links };
NonNull::new(links).map(|current| {
// INVARIANT:
@@ -767,7 +767,7 @@ pub fn remove_current(self) -> (Option<Self>, RBTreeNode<K, V>) {
let node = RBTreeNode { node };
// SAFETY: The reference to the tree used to create the cursor outlives the cursor, so
// the tree cannot change. By the tree invariant, all nodes are valid.
- unsafe { bindings::rb_erase(&mut (*this).links, addr_of_mut!(self.tree.root)) };
+ unsafe { bindings::rb_erase(&mut (*this).links, &raw mut self.tree.root) };
let current = match (prev, next) {
(_, Some(next)) => next,
@@ -803,7 +803,7 @@ fn remove_neighbor(&mut self, direction: Direction) -> Option<RBTreeNode<K, V>>
let neighbor = neighbor.as_ptr();
// SAFETY: The reference to the tree used to create the cursor outlives the cursor, so
// the tree cannot change. By the tree invariant, all nodes are valid.
- unsafe { bindings::rb_erase(neighbor, addr_of_mut!(self.tree.root)) };
+ unsafe { bindings::rb_erase(neighbor, &raw mut self.tree.root) };
// SAFETY: By the type invariant of `Self`, all non-null `rb_node` pointers stored in `self`
// point to the links field of `Node<K, V>` objects.
let this = unsafe { container_of!(neighbor, Node<K, V>, links) }.cast_mut();
@@ -918,7 +918,7 @@ unsafe fn to_key_value_raw<'b>(node: NonNull<bindings::rb_node>) -> (&'b K, *mut
let k = unsafe { &(*this).key };
// SAFETY: The passed `node` is the current node or a non-null neighbor,
// thus `this` is valid by the type invariants.
- let v = unsafe { addr_of_mut!((*this).value) };
+ let v = unsafe { &raw mut (*this).value };
(k, v)
}
}
@@ -1027,7 +1027,7 @@ fn next(&mut self) -> Option<Self::Item> {
self.next = unsafe { bindings::rb_next(self.next) };
// SAFETY: By the same reasoning above, it is safe to dereference the node.
- Some(unsafe { (addr_of_mut!((*cur).key), addr_of_mut!((*cur).value)) })
+ Some(unsafe { (&raw mut (*cur).key, &raw mut (*cur).value) })
}
}
@@ -1170,7 +1170,7 @@ fn insert(self, node: RBTreeNode<K, V>) -> &'a mut V {
// SAFETY: `node` is valid at least until we call `Box::from_raw`, which only happens when
// the node is removed or replaced.
- let node_links = unsafe { addr_of_mut!((*node).links) };
+ let node_links = unsafe { &raw mut (*node).links };
// INVARIANT: We are linking in a new node, which is valid. It remains valid because we
// "forgot" it with `Box::into_raw`.
@@ -1178,7 +1178,7 @@ fn insert(self, node: RBTreeNode<K, V>) -> &'a mut V {
unsafe { bindings::rb_link_node(node_links, self.parent, self.child_field_of_parent) };
// SAFETY: All pointers are valid. `node` has just been inserted into the tree.
- unsafe { bindings::rb_insert_color(node_links, addr_of_mut!((*self.rbtree).root)) };
+ unsafe { bindings::rb_insert_color(node_links, &raw mut (*self.rbtree).root) };
// SAFETY: The node is valid until we remove it from the tree.
unsafe { &mut (*node).value }
@@ -1261,7 +1261,7 @@ fn replace(self, node: RBTreeNode<K, V>) -> RBTreeNode<K, V> {
// SAFETY: `node` is valid at least until we call `Box::from_raw`, which only happens when
// the node is removed or replaced.
- let new_node_links = unsafe { addr_of_mut!((*node).links) };
+ let new_node_links = unsafe { &raw mut (*node).links };
// SAFETY: This updates the pointers so that `new_node_links` is in the tree where
// `self.node_links` used to be.
diff --git a/rust/kernel/sync/arc.rs b/rust/kernel/sync/arc.rs
index 3cefda7a4372..81d8b0f84957 100644
--- a/rust/kernel/sync/arc.rs
+++ b/rust/kernel/sync/arc.rs
@@ -243,7 +243,7 @@ pub fn into_raw(self) -> *const T {
let ptr = self.ptr.as_ptr();
core::mem::forget(self);
// SAFETY: The pointer is valid.
- unsafe { core::ptr::addr_of!((*ptr).data) }
+ unsafe { &raw const (*ptr).data }
}
/// Recreates an [`Arc`] instance previously deconstructed via [`Arc::into_raw`].
diff --git a/rust/kernel/task.rs b/rust/kernel/task.rs
index 49012e711942..b2ac768eed23 100644
--- a/rust/kernel/task.rs
+++ b/rust/kernel/task.rs
@@ -257,7 +257,7 @@ pub fn as_ptr(&self) -> *mut bindings::task_struct {
pub fn group_leader(&self) -> &Task {
// SAFETY: The group leader of a task never changes after initialization, so reading this
// field is not a data race.
- let ptr = unsafe { *ptr::addr_of!((*self.as_ptr()).group_leader) };
+ let ptr = unsafe { *(&raw const (*self.as_ptr()).group_leader) };
// SAFETY: The lifetime of the returned task reference is tied to the lifetime of `self`,
// and given that a task has a reference to its group leader, we know it must be valid for
@@ -269,7 +269,7 @@ pub fn group_leader(&self) -> &Task {
pub fn pid(&self) -> Pid {
// SAFETY: The pid of a task never changes after initialization, so reading this field is
// not a data race.
- unsafe { *ptr::addr_of!((*self.as_ptr()).pid) }
+ unsafe { *(&raw const (*self.as_ptr()).pid) }
}
/// Returns the UID of the given task.
diff --git a/rust/kernel/workqueue.rs b/rust/kernel/workqueue.rs
index 0cd100d2aefb..34e8abb38974 100644
--- a/rust/kernel/workqueue.rs
+++ b/rust/kernel/workqueue.rs
@@ -401,9 +401,9 @@ pub fn new(name: &'static CStr, key: &'static LockClassKey) -> impl PinInit<Self
pub unsafe fn raw_get(ptr: *const Self) -> *mut bindings::work_struct {
// SAFETY: The caller promises that the pointer is aligned and not dangling.
//
- // A pointer cast would also be ok due to `#[repr(transparent)]`. We use `addr_of!` so that
- // the compiler does not complain that the `work` field is unused.
- unsafe { Opaque::raw_get(core::ptr::addr_of!((*ptr).work)) }
+ // A pointer cast would also be ok due to `#[repr(transparent)]`. We use `&raw const (*ptr).work`
+ // so that the compiler does not complain that the `work` field is unused.
+ unsafe { Opaque::raw_get(&raw const (*ptr).work) }
}
}
@@ -510,7 +510,7 @@ macro_rules! impl_has_work {
unsafe fn raw_get_work(ptr: *mut Self) -> *mut $crate::workqueue::Work<$work_type $(, $id)?> {
// SAFETY: The caller promises that the pointer is not dangling.
unsafe {
- ::core::ptr::addr_of_mut!((*ptr).$field)
+ &raw mut (*ptr).$field
}
}
}
--
2.48.1
There are four small fixes for ntsync test and doc. I divided these into
four different patches due to different types of errors. If one patch is
better, I can do it too.
Su Hui (4):
selftests: ntsync: fix the wrong condition in wake_all
selftests: ntsync: avoid possible overflow in 32-bit machine
selftests: ntsync: update config
docs: ntsync: update NTSYNC_IOC_*
Documentation/userspace-api/ntsync.rst | 18 +++++++++---------
tools/testing/selftests/drivers/ntsync/config | 2 +-
.../testing/selftests/drivers/ntsync/ntsync.c | 6 +++---
3 files changed, 13 insertions(+), 13 deletions(-)
--
2.30.2
I never had much luck running mm selftests so I spent a few hours
digging into why.
Looks like most of the reason is missing SKIP checks, so this series is
just adding a bunch of those that I found. I did not do anything like
all of them, just the ones I spotted in gup_longterm, gup_test, mmap,
userfaultfd and memfd_secret.
It's a bit unfortunate to have to skip those tests when ftruncate()
fails, but I don't have time to dig deep enough into it to actually make
them pass. I have observed the issue on 9pfs and heard rumours that NFS
has a similar problem.
I'm now able to run these test groups successfully:
- mmap
- gup_test
- compaction
- migration
- page_frag
- userfaultfd
Signed-off-by: Brendan Jackman <jackmanb(a)google.com>
---
Changes in v3:
- Added fix for userfaultfd tests.
- Dropped attempts to use sudo.
- Fixed garbage printf in uffd-stress.
(Added EXTRA_CFLAGS=-Werror FORCE_TARGETS=1 to my scripts to prevent
such errors happening again).
- Fixed missing newlines in ksft_test_result_skip() calls.
- Link to v2: https://lore.kernel.org/r/20250221-mm-selftests-v2-0-28c4d66383c5@google.com
Changes in v2 (Thanks to Dev for the reviews):
- Improve and cleanup some error messages
- Add some extra SKIPs
- Fix misnaming of nr_cpus variable in uffd tests
- Link to v1: https://lore.kernel.org/r/20250220-mm-selftests-v1-0-9bbf57d64463@google.com
---
Brendan Jackman (10):
selftests/mm: Report errno when things fail in gup_longterm
selftests/mm: Skip uffd-stress if userfaultfd not available
selftests/mm: Skip uffd-wp-mremap if userfaultfd not available
selftests/mm/uffd: Rename nr_cpus -> nr_threads
selftests/mm: Print some details when uffd-stress gets bad params
selftests/mm: Don't fail uffd-stress if too many CPUs
selftests/mm: Skip map_populate on weird filesystems
selftests/mm: Skip gup_longerm tests on weird filesystems
selftests/mm: Drop unnecessary sudo usage
selftests/mm: Ensure uffd-wp-mremap gets pages of each size
tools/testing/selftests/mm/gup_longterm.c | 45 ++++++++++++++++++----------
tools/testing/selftests/mm/map_populate.c | 7 +++++
tools/testing/selftests/mm/run_vmtests.sh | 25 ++++++++++++++--
tools/testing/selftests/mm/uffd-common.c | 8 ++---
tools/testing/selftests/mm/uffd-common.h | 2 +-
tools/testing/selftests/mm/uffd-stress.c | 42 ++++++++++++++++----------
tools/testing/selftests/mm/uffd-unit-tests.c | 2 +-
tools/testing/selftests/mm/uffd-wp-mremap.c | 5 +++-
8 files changed, 95 insertions(+), 41 deletions(-)
---
base-commit: 76544811c850a1f4c055aa182b513b7a843868ea
change-id: 20250220-mm-selftests-2d7d0542face
Best regards,
--
Brendan Jackman <jackmanb(a)google.com>
This series is built on top of the v3 write syscall support [1].
With James's KVM userfault [2], it is possible to handle stage-2 faults
in guest_memfd in userspace. However, KVM itself also triggers faults
in guest_memfd in some cases, for example: PV interfaces like kvmclock,
PV EOI and page table walking code when fetching the MMIO instruction on
x86. It was agreed in the guest_memfd upstream call on 23 Jan 2025 [3]
that KVM would be accessing those pages via userspace page tables. In
order for such faults to be handled in userspace, guest_memfd needs to
support userfaultfd.
This series proposes a limited support for userfaultfd in guest_memfd:
- userfaultfd support is conditional to `CONFIG_KVM_GMEM_SHARED_MEM`
(as is fault support in general)
- Only `page missing` event is currently supported
- Userspace is supposed to respond to the event with the `write`
syscall followed by `UFFDIO_CONTINUE` ioctl to unblock the faulting
process. Note that we can't use `UFFDIO_COPY` here because
userfaulfd code does not know how to prepare guest_memfd pages, eg
remove them from direct map [4].
Not included in this series:
- Proper interface for userfaultfd to recognise guest_memfd mappings
- Proper handling of truncation cases after locking the page
Request for comments:
- Is it a sensible workflow for guest_memfd to resolve a userfault
`page missing` event with `write` syscall + `UFFDIO_CONTINUE`? One
of the alternatives is teaching `UFFDIO_COPY` how to deal with
guest_memfd pages.
- What is a way forward to make userfaultfd code aware of guest_memfd?
I saw that Patrick hit a somewhat similar problem in [5] when trying
to use direct map manipulation functions in KVM and was pointed by
David at Elliot's guestmem library [6] that might include a shim for that.
Would the library be the right place to expose required interfaces like
`vma_is_gmem`?
Nikita
[1] https://lore.kernel.org/kvm/20250303130838.28812-1-kalyazin@amazon.com/T/
[2] https://lore.kernel.org/kvm/20250109204929.1106563-1-jthoughton@google.com/…
[3] https://docs.google.com/document/d/1M6766BzdY1Lhk7LiR5IqVR8B8mG3cr-cxTxOrAo…
[4] https://lore.kernel.org/kvm/20250221160728.1584559-1-roypat@amazon.co.uk/T/
[4] https://lore.kernel.org/kvm/20250221160728.1584559-1-roypat@amazon.co.uk/T/…
[5] https://lore.kernel.org/kvm/20241122-guestmem-library-v5-2-450e92951a15@qui…
Nikita Kalyazin (5):
KVM: guest_memfd: add kvm_gmem_vma_is_gmem
KVM: guest_memfd: add support for uffd missing
mm: userfaultfd: allow to register userfaultfd for guest_memfd
mm: userfaultfd: support continue for guest_memfd
KVM: selftests: add uffd missing test for guest_memfd
include/linux/userfaultfd_k.h | 9 ++
mm/userfaultfd.c | 23 ++++-
.../testing/selftests/kvm/guest_memfd_test.c | 88 +++++++++++++++++++
virt/kvm/guest_memfd.c | 17 +++-
virt/kvm/kvm_mm.h | 1 +
5 files changed, 136 insertions(+), 2 deletions(-)
base-commit: 592e7531753dc4b711f96cd1daf808fd493d3223
--
2.47.1
Basics and overview
===================
Software with larger attack surfaces (e.g. network facing apps like databases,
browsers or apps relying on browser runtimes) suffer from memory corruption
issues which can be utilized by attackers to bend control flow of the program
to eventually gain control (by making their payload executable). Attackers are
able to perform such attacks by leveraging call-sites which rely on indirect
calls or return sites which rely on obtaining return address from stack memory.
To mitigate such attacks, risc-v extension zicfilp enforces that all indirect
calls must land on a landing pad instruction `lpad` else cpu will raise software
check exception (a new cpu exception cause code on riscv).
Similarly for return flow, risc-v extension zicfiss extends architecture with
- `sspush` instruction to push return address on a shadow stack
- `sspopchk` instruction to pop return address from shadow stack
and compare with input operand (i.e. return address on stack)
- `sspopchk` to raise software check exception if comparision above
was a mismatch
- Protection mechanism using which shadow stack is not writeable via
regular store instructions
More information an details can be found at extensions github repo [1].
Equivalent to landing pad (zicfilp) on x86 is `ENDBRANCH` instruction in Intel
CET [3] and branch target identification (BTI) [4] on arm.
Similarly x86's Intel CET has shadow stack [5] and arm64 has guarded control
stack (GCS) [6] which are very similar to risc-v's zicfiss shadow stack.
x86 and arm64 support for user mode shadow stack is already in mainline.
Kernel awareness for user control flow integrity
================================================
This series picks up Samuel Holland's envcfg changes [2] as well. So if those are
being applied independently, they should be removed from this series.
Enabling:
In order to maintain compatibility and not break anything in user mode, kernel
doesn't enable control flow integrity cpu extensions on binary by default.
Instead exposes a prctl interface to enable, disable and lock the shadow stack
or landing pad feature for a task. This allows userspace (loader) to enumerate
if all objects in its address space are compiled with shadow stack and landing
pad support and accordingly enable the feature. Additionally if a subsequent
`dlopen` happens on a library, user mode can take a decision again to disable
the feature (if incoming library is not compiled with support) OR terminate the
task (if user mode policy is strict to have all objects in address space to be
compiled with control flow integirty cpu feature). prctl to enable shadow stack
results in allocating shadow stack from virtual memory and activating for user
address space. x86 and arm64 are also following same direction due to similar
reason(s).
clone/fork:
On clone and fork, cfi state for task is inherited by child. Shadow stack is
part of virtual memory and is a writeable memory from kernel perspective
(writeable via a restricted set of instructions aka shadow stack instructions)
Thus kernel changes ensure that this memory is converted into read-only when
fork/clone happens and COWed when fault is taken due to sspush, sspopchk or
ssamoswap. In case `CLONE_VM` is specified and shadow stack is to be enabled,
kernel will automatically allocate a shadow stack for that clone call.
map_shadow_stack:
x86 introduced `map_shadow_stack` system call to allow user space to explicitly
map shadow stack memory in its address space. It is useful to allocate shadow
for different contexts managed by a single thread (green threads or contexts)
risc-v implements this system call as well.
signal management:
If shadow stack is enabled for a task, kernel performs an asynchronous control
flow diversion to deliver the signal and eventually expects userspace to issue
sigreturn so that original execution can be resumed. Even though resume context
is prepared by kernel, it is in user space memory and is subject to memory
corruption and corruption bugs can be utilized by attacker in this race window
to perform arbitrary sigreturn and eventually bypass cfi mechanism.
Another issue is how to ensure that cfi related state on sigcontext area is not
trampled by legacy apps or apps compiled with old kernel headers.
In order to mitigate control-flow hijacting, kernel prepares a token and place
it on shadow stack before signal delivery and places address of token in
sigcontext structure. During sigreturn, kernel obtains address of token from
sigcontext struture, reads token from shadow stack and validates it and only
then allow sigreturn to succeed. Compatiblity issue is solved by adopting
dynamic sigcontext management introduced for vector extension. This series
re-factor the code little bit to allow future sigcontext management easy (as
proposed by Andy Chiu from SiFive)
config and compilation:
Introduce a new risc-v config option `CONFIG_RISCV_USER_CFI`. Selecting this
config option picks the kernel support for user control flow integrity. This
optin is presented only if toolchain has shadow stack and landing pad support.
And is on purpose guarded by toolchain support. Reason being that eventually
vDSO also needs to be compiled in with shadow stack and landing pad support.
vDSO compile patches are not included as of now because landing pad labeling
scheme is yet to settle for usermode runtime.
To get more information on kernel interactions with respect to
zicfilp and zicfiss, patch series adds documentation for
`zicfilp` and `zicfiss` in following:
Documentation/arch/riscv/zicfiss.rst
Documentation/arch/riscv/zicfilp.rst
How to test this series
=======================
Toolchain
---------
$ git clone git@github.com:sifive/riscv-gnu-toolchain.git -b cfi-dev
$ riscv-gnu-toolchain/configure --prefix=<path-to-where-to-build> --with-arch=rv64gc_zicfilp_zicfiss --enable-linux --disable-gdb --with-extra-multilib-test="rv64gc_zicfilp_zicfiss-lp64d:-static"
$ make -j$(nproc)
Qemu
----
Get the lastest qemu
$ cd qemu
$ mkdir build
$ cd build
$ ../configure --target-list=riscv64-softmmu
$ make -j$(nproc)
Opensbi
-------
$ git clone git@github.com:deepak0414/opensbi.git -b v6_cfi_spec_split_opensbi
$ make CROSS_COMPILE=<your riscv toolchain> -j$(nproc) PLATFORM=generic
Linux
-----
Running defconfig is fine. CFI is enabled by default if the toolchain
supports it.
$ make ARCH=riscv CROSS_COMPILE=<path-to-cfi-riscv-gnu-toolchain>/build/bin/riscv64-unknown-linux-gnu- -j$(nproc) defconfig
$ make ARCH=riscv CROSS_COMPILE=<path-to-cfi-riscv-gnu-toolchain>/build/bin/riscv64-unknown-linux-gnu- -j$(nproc)
In case you're building your own rootfs using toolchain, please make sure you
pick following patch to ensure that vDSO compiled with lpad and shadow stack.
"arch/riscv: compile vdso with landing pad"
Branch where above patch can be picked
https://github.com/deepak0414/linux-riscv-cfi/tree/vdso_user_cfi_v6.12-rc1
Running
-------
Modify your qemu command to have:
-bios <path-to-cfi-opensbi>/build/platform/generic/firmware/fw_dynamic.bin
-cpu rv64,zicfilp=true,zicfiss=true,zimop=true,zcmop=true
vDSO related Opens (in the flux)
=================================
I am listing these opens for laying out plan and what to expect in future
patch sets. And of course for the sake of discussion.
Shadow stack and landing pad enabling in vDSO
----------------------------------------------
vDSO must have shadow stack and landing pad support compiled in for task
to have shadow stack and landing pad support. This patch series doesn't
enable that (yet). Enabling shadow stack support in vDSO should be
straight forward (intend to do that in next versions of patch set). Enabling
landing pad support in vDSO requires some collaboration with toolchain folks
to follow a single label scheme for all object binaries. This is necessary to
ensure that all indirect call-sites are setting correct label and target landing
pads are decorated with same label scheme.
How many vDSOs
---------------
Shadow stack instructions are carved out of zimop (may be operations) and if CPU
doesn't implement zimop, they're illegal instructions. Kernel could be running on
a CPU which may or may not implement zimop. And thus kernel will have to carry 2
different vDSOs and expose the appropriate one depending on whether CPU implements
zimop or not.
References
==========
[1] - https://github.com/riscv/riscv-cfi
[2] - https://lore.kernel.org/all/20240814081126.956287-1-samuel.holland@sifive.c…
[3] - https://lwn.net/Articles/889475/
[4] - https://developer.arm.com/documentation/109576/0100/Branch-Target-Identific…
[5] - https://www.intel.com/content/dam/develop/external/us/en/documents/catc17-i…
[6] - https://lwn.net/Articles/940403/
---
changelog
---------
v11:
- patch "arch/riscv: compile vdso with landing pad" was unconditionally
selecting `_zicfilp` for vDSO compile. fixed that. Changed `lpad 1` to
to `lpad 0`.
v10:
- dropped "mm: helper `is_shadow_stack_vma` to check shadow stack vma". This patch
is not that interesting to this patch series for risc-v. There are instances in
arch directories where VM_SHADOW_STACK flag is anyways used. Dropping this patch
to expedite merging in riscv tree.
- Took suggestions from `Clement` on "riscv: zicfiss / zicfilp enumeration" to
validate presence of cfi based on config.
- Added a patch for vDSO to have `lpad 0`. I had omitted this earlier to make sure
we add single vdso object with cfi enabled. But a vdso object with scheme of
zero labeled landing pad is least common denominator and should work with all
objects of zero labeled as well as function-signature labeled objects.
v9:
- rebased on master (39a803b754d5 fix braino in "9p: fix ->rename_sem exclusion")
- dropped "mm: Introduce ARCH_HAS_USER_SHADOW_STACK" (master has it from arm64/gcs)
- dropped "prctl: arch-agnostic prctl for shadow stack" (master has it from arm64/gcs)
v8:
- rebased on palmer/for-next
- dropped samuel holland's `envcfg` context switch patches.
they are in parlmer/for-next
v7:
- Removed "riscv/Kconfig: enable HAVE_EXIT_THREAD for riscv"
Instead using `deactivate_mm` flow to clean up.
see here for more context
https://lore.kernel.org/all/20230908203655.543765-1-rick.p.edgecombe@intel.…
- Changed the header include in `kselftest`. Hopefully this fixes compile
issue faced by Zong Li at SiFive.
- Cleaned up an orphaned change to `mm/mmap.c` in below patch
"riscv/mm : ensure PROT_WRITE leads to VM_READ | VM_WRITE"
- Lock interfaces for shadow stack and indirect branch tracking expect arg == 0
Any future evolution of this interface should accordingly define how arg should
be setup.
- `mm/map.c` has an instance of using `VM_SHADOW_STACK`. Fixed it to use helper
`is_shadow_stack_vma`.
- Link to v6: https://lore.kernel.org/r/20241008-v5_user_cfi_series-v6-0-60d9fe073f37@riv…
v6:
- Picked up Samuel Holland's changes as is with `envcfg` placed in
`thread` instead of `thread_info`
- fixed unaligned newline escapes in kselftest
- cleaned up messages in kselftest and included test output in commit message
- fixed a bug in clone path reported by Zong Li
- fixed a build issue if CONFIG_RISCV_ISA_V is not selected
(this was introduced due to re-factoring signal context
management code)
v5:
- rebased on v6.12-rc1
- Fixed schema related issues in device tree file
- Fixed some of the documentation related issues in zicfilp/ss.rst
(style issues and added index)
- added `SHADOW_STACK_SET_MARKER` so that implementation can define base
of shadow stack.
- Fixed warnings on definitions added in usercfi.h when
CONFIG_RISCV_USER_CFI is not selected.
- Adopted context header based signal handling as proposed by Andy Chiu
- Added support for enabling kernel mode access to shadow stack using
FWFT
(https://github.com/riscv-non-isa/riscv-sbi-doc/blob/master/src/ext-firmware…)
- Link to v5: https://lore.kernel.org/r/20241001-v5_user_cfi_series-v1-0-3ba65b6e550f@riv…
(Note: I had an issue in my workflow due to which version number wasn't
picked up correctly while sending out patches)
v4:
- rebased on 6.11-rc6
- envcfg: Converged with Samuel Holland's patches for envcfg management on per-
thread basis.
- vma_is_shadow_stack is renamed to is_vma_shadow_stack
- picked up Mark Brown's `ARCH_HAS_USER_SHADOW_STACK` patch
- signal context: using extended context management to maintain compatibility.
- fixed `-Wmissing-prototypes` compiler warnings for prctl functions
- Documentation fixes and amending typos.
- Link to v4: https://lore.kernel.org/all/20240912231650.3740732-1-debug@rivosinc.com/
v3:
- envcfg
logic to pick up base envcfg had a bug where `ENVCFG_CBZE` could have been
picked on per task basis, even though CPU didn't implement it. Fixed in
this series.
- dt-bindings
As suggested, split into separate commit. fixed the messaging that spec is
in public review
- arch_is_shadow_stack change
arch_is_shadow_stack changed to vma_is_shadow_stack
- hwprobe
zicfiss / zicfilp if present will get enumerated in hwprobe
- selftests
As suggested, added object and binary filenames to .gitignore
Selftest binary anyways need to be compiled with cfi enabled compiler which
will make sure that landing pad and shadow stack are enabled. Thus removed
separate enable/disable tests. Cleaned up tests a bit.
- Link to v3: https://lore.kernel.org/lkml/20240403234054.2020347-1-debug@rivosinc.com/
v2:
- Using config `CONFIG_RISCV_USER_CFI`, kernel support for riscv control flow
integrity for user mode programs can be compiled in the kernel.
- Enabling of control flow integrity for user programs is left to user runtime
- This patch series introduces arch agnostic `prctls` to enable shadow stack
and indirect branch tracking. And implements them on riscv.
---
---
Changes in v11:
- EDITME: describe what is new in this series revision.
- EDITME: use bulletpoints and terse descriptions.
- Link to v10: https://lore.kernel.org/r/20250210-v5_user_cfi_series-v10-0-163dcfa31c60@ri…
---
Andy Chiu (1):
riscv: signal: abstract header saving for setup_sigcontext
Clément Léger (1):
riscv: Add Firmware Feature SBI extensions definitions
Deepak Gupta (24):
mm: VM_SHADOW_STACK definition for riscv
dt-bindings: riscv: zicfilp and zicfiss in dt-bindings (extensions.yaml)
riscv: zicfiss / zicfilp enumeration
riscv: zicfiss / zicfilp extension csr and bit definitions
riscv: usercfi state for task and save/restore of CSR_SSP on trap entry/exit
riscv/mm : ensure PROT_WRITE leads to VM_READ | VM_WRITE
riscv mm: manufacture shadow stack pte
riscv mmu: teach pte_mkwrite to manufacture shadow stack PTEs
riscv mmu: write protect and shadow stack
riscv/mm: Implement map_shadow_stack() syscall
riscv/shstk: If needed allocate a new shadow stack on clone
riscv: Implements arch agnostic shadow stack prctls
prctl: arch-agnostic prctl for indirect branch tracking
riscv/traps: Introduce software check exception
riscv/signal: save and restore of shadow stack for signal
riscv/kernel: update __show_regs to print shadow stack register
riscv/ptrace: riscv cfi status and state via ptrace and in core files
riscv/hwprobe: zicfilp / zicfiss enumeration in hwprobe
riscv: enable kernel access to shadow stack memory via FWFT sbi call
riscv: kernel command line option to opt out of user cfi
riscv: create a config for shadow stack and landing pad instr support
riscv: Documentation for landing pad / indirect branch tracking
riscv: Documentation for shadow stack on riscv
kselftest/riscv: kselftest for user mode cfi
Jim Shu (1):
arch/riscv: compile vdso with landing pad
Documentation/arch/riscv/index.rst | 2 +
Documentation/arch/riscv/zicfilp.rst | 115 +++++
Documentation/arch/riscv/zicfiss.rst | 176 +++++++
.../devicetree/bindings/riscv/extensions.yaml | 14 +
arch/riscv/Kconfig | 20 +
arch/riscv/Makefile | 7 +-
arch/riscv/include/asm/asm-prototypes.h | 1 +
arch/riscv/include/asm/assembler.h | 44 ++
arch/riscv/include/asm/cpufeature.h | 13 +
arch/riscv/include/asm/csr.h | 16 +
arch/riscv/include/asm/entry-common.h | 2 +
arch/riscv/include/asm/hwcap.h | 2 +
arch/riscv/include/asm/mman.h | 25 +
arch/riscv/include/asm/mmu_context.h | 7 +
arch/riscv/include/asm/pgtable.h | 30 +-
arch/riscv/include/asm/processor.h | 2 +
arch/riscv/include/asm/sbi.h | 26 +
arch/riscv/include/asm/thread_info.h | 3 +
arch/riscv/include/asm/usercfi.h | 89 ++++
arch/riscv/include/asm/vector.h | 3 +
arch/riscv/include/uapi/asm/hwprobe.h | 2 +
arch/riscv/include/uapi/asm/ptrace.h | 22 +
arch/riscv/include/uapi/asm/sigcontext.h | 1 +
arch/riscv/kernel/Makefile | 1 +
arch/riscv/kernel/asm-offsets.c | 8 +
arch/riscv/kernel/cpufeature.c | 13 +
arch/riscv/kernel/entry.S | 31 +-
arch/riscv/kernel/head.S | 12 +
arch/riscv/kernel/process.c | 26 +-
arch/riscv/kernel/ptrace.c | 83 ++++
arch/riscv/kernel/signal.c | 142 +++++-
arch/riscv/kernel/sys_hwprobe.c | 2 +
arch/riscv/kernel/sys_riscv.c | 10 +
arch/riscv/kernel/traps.c | 43 ++
arch/riscv/kernel/usercfi.c | 524 +++++++++++++++++++++
arch/riscv/kernel/vdso/Makefile | 12 +
arch/riscv/kernel/vdso/flush_icache.S | 4 +
arch/riscv/kernel/vdso/getcpu.S | 4 +
arch/riscv/kernel/vdso/rt_sigreturn.S | 4 +
arch/riscv/kernel/vdso/sys_hwprobe.S | 4 +
arch/riscv/mm/init.c | 2 +-
arch/riscv/mm/pgtable.c | 17 +
include/linux/cpu.h | 4 +
include/linux/mm.h | 7 +
include/uapi/linux/elf.h | 1 +
include/uapi/linux/prctl.h | 27 ++
kernel/sys.c | 30 ++
tools/testing/selftests/riscv/Makefile | 2 +-
tools/testing/selftests/riscv/cfi/.gitignore | 3 +
tools/testing/selftests/riscv/cfi/Makefile | 10 +
tools/testing/selftests/riscv/cfi/cfi_rv_test.h | 84 ++++
tools/testing/selftests/riscv/cfi/riscv_cfi_test.c | 78 +++
tools/testing/selftests/riscv/cfi/shadowstack.c | 375 +++++++++++++++
tools/testing/selftests/riscv/cfi/shadowstack.h | 37 ++
54 files changed, 2193 insertions(+), 29 deletions(-)
---
base-commit: 39a803b754d5224a3522016b564113ee1e4091b2
change-id: 20240930-v5_user_cfi_series-3dc332f8f5b2
--
- debug
In Rust 1.51.0, Clippy introduced the `ignored_unit_patterns` lint [1]:
> Though `as` casts between raw pointers are not terrible,
> `pointer::cast` is safer because it cannot accidentally change the
> pointer's mutability, nor cast the pointer to other types like `usize`.
There are a few classes of changes required:
- Modules generated by bindgen are marked
`#[allow(clippy::ptr_as_ptr)]`.
- Inferred casts (` as _`) are replaced with `.cast()`.
- Ascribed casts (` as *... T`) are replaced with `.cast::<T>()`.
- Multistep casts from references (` as *const _ as *const T`) are
replaced with `let x: *const _ = &x;` and `.cast()` or `.cast::<T>()`
according to the previous rules. The intermediate `let` binding is
required because `(x as *const _).cast::<T>()` results in inference
failure.
- Native literal C strings are replaced with `c_str!().as_char_ptr()`.
Apply these changes and enable the lint -- no functional change
intended.
Link: https://rust-lang.github.io/rust-clippy/master/index.html#ptr_as_ptr [1]
Signed-off-by: Tamir Duberstein <tamird(a)gmail.com>
---
Makefile | 1 +
rust/bindings/lib.rs | 1 +
rust/kernel/alloc/allocator_test.rs | 2 +-
rust/kernel/alloc/kvec.rs | 4 ++--
rust/kernel/device.rs | 5 +++--
rust/kernel/devres.rs | 2 +-
rust/kernel/error.rs | 2 +-
rust/kernel/fs/file.rs | 2 +-
rust/kernel/kunit.rs | 15 +++++++--------
rust/kernel/lib.rs | 4 ++--
rust/kernel/list/impl_list_item_mod.rs | 2 +-
rust/kernel/pci.rs | 2 +-
rust/kernel/platform.rs | 4 +++-
rust/kernel/print.rs | 11 +++++------
rust/kernel/seq_file.rs | 3 ++-
rust/kernel/str.rs | 2 +-
rust/kernel/sync/poll.rs | 2 +-
rust/kernel/workqueue.rs | 10 +++++-----
rust/uapi/lib.rs | 1 +
19 files changed, 40 insertions(+), 35 deletions(-)
diff --git a/Makefile b/Makefile
index 70bdbf2218fc..ec8efc8e23ba 100644
--- a/Makefile
+++ b/Makefile
@@ -483,6 +483,7 @@ export rust_common_flags := --edition=2021 \
-Wclippy::needless_continue \
-Aclippy::needless_lifetimes \
-Wclippy::no_mangle_with_rust_abi \
+ -Wclippy::ptr_as_ptr \
-Wclippy::undocumented_unsafe_blocks \
-Wclippy::unnecessary_safety_comment \
-Wclippy::unnecessary_safety_doc \
diff --git a/rust/bindings/lib.rs b/rust/bindings/lib.rs
index 014af0d1fc70..0486a32ed314 100644
--- a/rust/bindings/lib.rs
+++ b/rust/bindings/lib.rs
@@ -25,6 +25,7 @@
)]
#[allow(dead_code)]
+#[allow(clippy::ptr_as_ptr)]
#[allow(clippy::undocumented_unsafe_blocks)]
mod bindings_raw {
// Manual definition for blocklisted types.
diff --git a/rust/kernel/alloc/allocator_test.rs b/rust/kernel/alloc/allocator_test.rs
index c37d4c0c64e9..8017aa9d5213 100644
--- a/rust/kernel/alloc/allocator_test.rs
+++ b/rust/kernel/alloc/allocator_test.rs
@@ -82,7 +82,7 @@ unsafe fn realloc(
// SAFETY: Returns either NULL or a pointer to a memory allocation that satisfies or
// exceeds the given size and alignment requirements.
- let dst = unsafe { libc_aligned_alloc(layout.align(), layout.size()) } as *mut u8;
+ let dst = unsafe { libc_aligned_alloc(layout.align(), layout.size()) }.cast::<u8>();
let dst = NonNull::new(dst).ok_or(AllocError)?;
if flags.contains(__GFP_ZERO) {
diff --git a/rust/kernel/alloc/kvec.rs b/rust/kernel/alloc/kvec.rs
index ae9d072741ce..c12844764671 100644
--- a/rust/kernel/alloc/kvec.rs
+++ b/rust/kernel/alloc/kvec.rs
@@ -262,7 +262,7 @@ pub fn spare_capacity_mut(&mut self) -> &mut [MaybeUninit<T>] {
// - `self.len` is smaller than `self.capacity` and hence, the resulting pointer is
// guaranteed to be part of the same allocated object.
// - `self.len` can not overflow `isize`.
- let ptr = unsafe { self.as_mut_ptr().add(self.len) } as *mut MaybeUninit<T>;
+ let ptr = unsafe { self.as_mut_ptr().add(self.len) }.cast::<MaybeUninit<T>>();
// SAFETY: The memory between `self.len` and `self.capacity` is guaranteed to be allocated
// and valid, but uninitialized.
@@ -554,7 +554,7 @@ fn drop(&mut self) {
// - `ptr` points to memory with at least a size of `size_of::<T>() * len`,
// - all elements within `b` are initialized values of `T`,
// - `len` does not exceed `isize::MAX`.
- unsafe { Vec::from_raw_parts(ptr as _, len, len) }
+ unsafe { Vec::from_raw_parts(ptr.cast(), len, len) }
}
}
diff --git a/rust/kernel/device.rs b/rust/kernel/device.rs
index db2d9658ba47..9e500498835d 100644
--- a/rust/kernel/device.rs
+++ b/rust/kernel/device.rs
@@ -168,16 +168,17 @@ pub fn pr_dbg(&self, args: fmt::Arguments<'_>) {
/// `KERN_*`constants, for example, `KERN_CRIT`, `KERN_ALERT`, etc.
#[cfg_attr(not(CONFIG_PRINTK), allow(unused_variables))]
unsafe fn printk(&self, klevel: &[u8], msg: fmt::Arguments<'_>) {
+ let msg: *const _ = &msg;
// SAFETY: `klevel` is null-terminated and one of the kernel constants. `self.as_raw`
// is valid because `self` is valid. The "%pA" format string expects a pointer to
// `fmt::Arguments`, which is what we're passing as the last argument.
#[cfg(CONFIG_PRINTK)]
unsafe {
bindings::_dev_printk(
- klevel as *const _ as *const crate::ffi::c_char,
+ klevel.as_ptr().cast::<crate::ffi::c_char>(),
self.as_raw(),
c_str!("%pA").as_char_ptr(),
- &msg as *const _ as *const crate::ffi::c_void,
+ msg.cast::<crate::ffi::c_void>(),
)
};
}
diff --git a/rust/kernel/devres.rs b/rust/kernel/devres.rs
index 942376f6f3af..3a9d998ec371 100644
--- a/rust/kernel/devres.rs
+++ b/rust/kernel/devres.rs
@@ -157,7 +157,7 @@ fn remove_action(this: &Arc<Self>) {
#[allow(clippy::missing_safety_doc)]
unsafe extern "C" fn devres_callback(ptr: *mut kernel::ffi::c_void) {
- let ptr = ptr as *mut DevresInner<T>;
+ let ptr = ptr.cast::<DevresInner<T>>();
// Devres owned this memory; now that we received the callback, drop the `Arc` and hence the
// reference.
// SAFETY: Safe, since we leaked an `Arc` reference to devm_add_action() in
diff --git a/rust/kernel/error.rs b/rust/kernel/error.rs
index f6ecf09cb65f..8654d52b0bb9 100644
--- a/rust/kernel/error.rs
+++ b/rust/kernel/error.rs
@@ -152,7 +152,7 @@ pub(crate) fn to_blk_status(self) -> bindings::blk_status_t {
/// Returns the error encoded as a pointer.
pub fn to_ptr<T>(self) -> *mut T {
// SAFETY: `self.0` is a valid error due to its invariant.
- unsafe { bindings::ERR_PTR(self.0.get() as _) as *mut _ }
+ unsafe { bindings::ERR_PTR(self.0.get() as _).cast() }
}
/// Returns a string representing the error, if one exists.
diff --git a/rust/kernel/fs/file.rs b/rust/kernel/fs/file.rs
index e03dbe14d62a..8936afc234a4 100644
--- a/rust/kernel/fs/file.rs
+++ b/rust/kernel/fs/file.rs
@@ -364,7 +364,7 @@ fn deref(&self) -> &LocalFile {
//
// By the type invariants, there are no `fdget_pos` calls that did not take the
// `f_pos_lock` mutex.
- unsafe { LocalFile::from_raw_file(self as *const File as *const bindings::file) }
+ unsafe { LocalFile::from_raw_file((self as *const Self).cast()) }
}
}
diff --git a/rust/kernel/kunit.rs b/rust/kernel/kunit.rs
index 824da0e9738a..7ed2063c1af0 100644
--- a/rust/kernel/kunit.rs
+++ b/rust/kernel/kunit.rs
@@ -8,19 +8,20 @@
use core::{ffi::c_void, fmt};
+#[cfg(CONFIG_PRINTK)]
+use crate::c_str;
+
/// Prints a KUnit error-level message.
///
/// Public but hidden since it should only be used from KUnit generated code.
#[doc(hidden)]
pub fn err(args: fmt::Arguments<'_>) {
+ let args: *const _ = &args;
// SAFETY: The format string is null-terminated and the `%pA` specifier matches the argument we
// are passing.
#[cfg(CONFIG_PRINTK)]
unsafe {
- bindings::_printk(
- c"\x013%pA".as_ptr() as _,
- &args as *const _ as *const c_void,
- );
+ bindings::_printk(c_str!("\x013%pA").as_char_ptr(), args.cast::<c_void>());
}
}
@@ -29,14 +30,12 @@ pub fn err(args: fmt::Arguments<'_>) {
/// Public but hidden since it should only be used from KUnit generated code.
#[doc(hidden)]
pub fn info(args: fmt::Arguments<'_>) {
+ let args: *const _ = &args;
// SAFETY: The format string is null-terminated and the `%pA` specifier matches the argument we
// are passing.
#[cfg(CONFIG_PRINTK)]
unsafe {
- bindings::_printk(
- c"\x016%pA".as_ptr() as _,
- &args as *const _ as *const c_void,
- );
+ bindings::_printk(c_str!("\x016%pA").as_char_ptr(), args.cast::<c_void>());
}
}
diff --git a/rust/kernel/lib.rs b/rust/kernel/lib.rs
index 7697c60b2d1a..01264e459c92 100644
--- a/rust/kernel/lib.rs
+++ b/rust/kernel/lib.rs
@@ -196,9 +196,9 @@ fn panic(info: &core::panic::PanicInfo<'_>) -> ! {
#[macro_export]
macro_rules! container_of {
($ptr:expr, $type:ty, $($f:tt)*) => {{
- let ptr = $ptr as *const _ as *const u8;
+ let ptr: *const _ = $ptr;
let offset: usize = ::core::mem::offset_of!($type, $($f)*);
- ptr.sub(offset) as *const $type
+ ptr.cast::<u8>().sub(offset).cast::<$type>()
}}
}
diff --git a/rust/kernel/list/impl_list_item_mod.rs b/rust/kernel/list/impl_list_item_mod.rs
index a0438537cee1..1f9498c1458f 100644
--- a/rust/kernel/list/impl_list_item_mod.rs
+++ b/rust/kernel/list/impl_list_item_mod.rs
@@ -34,7 +34,7 @@ pub unsafe trait HasListLinks<const ID: u64 = 0> {
unsafe fn raw_get_list_links(ptr: *mut Self) -> *mut ListLinks<ID> {
// SAFETY: The caller promises that the pointer is valid. The implementer promises that the
// `OFFSET` constant is correct.
- unsafe { (ptr as *mut u8).add(Self::OFFSET) as *mut ListLinks<ID> }
+ unsafe { ptr.cast::<u8>().add(Self::OFFSET).cast() }
}
}
diff --git a/rust/kernel/pci.rs b/rust/kernel/pci.rs
index 4c98b5b9aa1e..206f71d33ab2 100644
--- a/rust/kernel/pci.rs
+++ b/rust/kernel/pci.rs
@@ -75,7 +75,7 @@ extern "C" fn probe_callback(
// Let the `struct pci_dev` own a reference of the driver's private data.
// SAFETY: By the type invariant `pdev.as_raw` returns a valid pointer to a
// `struct pci_dev`.
- unsafe { bindings::pci_set_drvdata(pdev.as_raw(), data.into_foreign() as _) };
+ unsafe { bindings::pci_set_drvdata(pdev.as_raw(), data.into_foreign().cast()) };
}
Err(err) => return Error::to_errno(err),
}
diff --git a/rust/kernel/platform.rs b/rust/kernel/platform.rs
index 50e6b0421813..8f9e6b125faf 100644
--- a/rust/kernel/platform.rs
+++ b/rust/kernel/platform.rs
@@ -66,7 +66,9 @@ extern "C" fn probe_callback(pdev: *mut bindings::platform_device) -> kernel::ff
// Let the `struct platform_device` own a reference of the driver's private data.
// SAFETY: By the type invariant `pdev.as_raw` returns a valid pointer to a
// `struct platform_device`.
- unsafe { bindings::platform_set_drvdata(pdev.as_raw(), data.into_foreign() as _) };
+ unsafe {
+ bindings::platform_set_drvdata(pdev.as_raw(), data.into_foreign().cast())
+ };
}
Err(err) => return Error::to_errno(err),
}
diff --git a/rust/kernel/print.rs b/rust/kernel/print.rs
index b19ee490be58..0245c145ea32 100644
--- a/rust/kernel/print.rs
+++ b/rust/kernel/print.rs
@@ -25,7 +25,7 @@
// SAFETY: The C contract guarantees that `buf` is valid if it's less than `end`.
let mut w = unsafe { RawFormatter::from_ptrs(buf.cast(), end.cast()) };
// SAFETY: TODO.
- let _ = w.write_fmt(unsafe { *(ptr as *const fmt::Arguments<'_>) });
+ let _ = w.write_fmt(unsafe { *ptr.cast::<fmt::Arguments<'_>>() });
w.pos().cast()
}
@@ -102,6 +102,7 @@ pub unsafe fn call_printk(
module_name: &[u8],
args: fmt::Arguments<'_>,
) {
+ let args: *const _ = &args;
// `_printk` does not seem to fail in any path.
#[cfg(CONFIG_PRINTK)]
// SAFETY: TODO.
@@ -109,7 +110,7 @@ pub unsafe fn call_printk(
bindings::_printk(
format_string.as_ptr(),
module_name.as_ptr(),
- &args as *const _ as *const c_void,
+ args.cast::<c_void>(),
);
}
}
@@ -122,15 +123,13 @@ pub unsafe fn call_printk(
#[doc(hidden)]
#[cfg_attr(not(CONFIG_PRINTK), allow(unused_variables))]
pub fn call_printk_cont(args: fmt::Arguments<'_>) {
+ let args: *const _ = &args;
// `_printk` does not seem to fail in any path.
//
// SAFETY: The format string is fixed.
#[cfg(CONFIG_PRINTK)]
unsafe {
- bindings::_printk(
- format_strings::CONT.as_ptr(),
- &args as *const _ as *const c_void,
- );
+ bindings::_printk(format_strings::CONT.as_ptr(), args.cast::<c_void>());
}
}
diff --git a/rust/kernel/seq_file.rs b/rust/kernel/seq_file.rs
index 04947c672979..90545d28e6b7 100644
--- a/rust/kernel/seq_file.rs
+++ b/rust/kernel/seq_file.rs
@@ -31,12 +31,13 @@ pub unsafe fn from_raw<'a>(ptr: *mut bindings::seq_file) -> &'a SeqFile {
/// Used by the [`seq_print`] macro.
pub fn call_printf(&self, args: core::fmt::Arguments<'_>) {
+ let args: *const _ = &args;
// SAFETY: Passing a void pointer to `Arguments` is valid for `%pA`.
unsafe {
bindings::seq_printf(
self.inner.get(),
c_str!("%pA").as_char_ptr(),
- &args as *const _ as *const crate::ffi::c_void,
+ args.cast::<crate::ffi::c_void>(),
);
}
}
diff --git a/rust/kernel/str.rs b/rust/kernel/str.rs
index 28e2201604d6..6a1a982b946d 100644
--- a/rust/kernel/str.rs
+++ b/rust/kernel/str.rs
@@ -191,7 +191,7 @@ pub unsafe fn from_char_ptr<'a>(ptr: *const crate::ffi::c_char) -> &'a Self {
// to a `NUL`-terminated C string.
let len = unsafe { bindings::strlen(ptr) } + 1;
// SAFETY: Lifetime guaranteed by the safety precondition.
- let bytes = unsafe { core::slice::from_raw_parts(ptr as _, len) };
+ let bytes = unsafe { core::slice::from_raw_parts(ptr.cast(), len) };
// SAFETY: As `len` is returned by `strlen`, `bytes` does not contain interior `NUL`.
// As we have added 1 to `len`, the last byte is known to be `NUL`.
unsafe { Self::from_bytes_with_nul_unchecked(bytes) }
diff --git a/rust/kernel/sync/poll.rs b/rust/kernel/sync/poll.rs
index d5f17153b424..a151f54cde91 100644
--- a/rust/kernel/sync/poll.rs
+++ b/rust/kernel/sync/poll.rs
@@ -73,7 +73,7 @@ pub fn register_wait(&mut self, file: &File, cv: &PollCondVar) {
// be destroyed, the destructor must run. That destructor first removes all waiters,
// and then waits for an rcu grace period. Therefore, `cv.wait_queue_head` is valid for
// long enough.
- unsafe { qproc(file.as_ptr() as _, cv.wait_queue_head.get(), self.0.get()) };
+ unsafe { qproc(file.as_ptr().cast(), cv.wait_queue_head.get(), self.0.get()) };
}
}
}
diff --git a/rust/kernel/workqueue.rs b/rust/kernel/workqueue.rs
index 0cd100d2aefb..8ff54105be3f 100644
--- a/rust/kernel/workqueue.rs
+++ b/rust/kernel/workqueue.rs
@@ -170,7 +170,7 @@ impl Queue {
pub unsafe fn from_raw<'a>(ptr: *const bindings::workqueue_struct) -> &'a Queue {
// SAFETY: The `Queue` type is `#[repr(transparent)]`, so the pointer cast is valid. The
// caller promises that the pointer is not dangling.
- unsafe { &*(ptr as *const Queue) }
+ unsafe { &*ptr.cast::<Queue>() }
}
/// Enqueues a work item.
@@ -457,7 +457,7 @@ fn get_work_offset(&self) -> usize {
#[inline]
unsafe fn raw_get_work(ptr: *mut Self) -> *mut Work<T, ID> {
// SAFETY: The caller promises that the pointer is valid.
- unsafe { (ptr as *mut u8).add(Self::OFFSET) as *mut Work<T, ID> }
+ unsafe { ptr.cast::<u8>().add(Self::OFFSET).cast::<Work<T, ID>>() }
}
/// Returns a pointer to the struct containing the [`Work<T, ID>`] field.
@@ -472,7 +472,7 @@ unsafe fn work_container_of(ptr: *mut Work<T, ID>) -> *mut Self
{
// SAFETY: The caller promises that the pointer points at a field of the right type in the
// right kind of struct.
- unsafe { (ptr as *mut u8).sub(Self::OFFSET) as *mut Self }
+ unsafe { ptr.cast::<u8>().sub(Self::OFFSET).cast::<Self>() }
}
}
@@ -538,7 +538,7 @@ unsafe impl<T, const ID: u64> WorkItemPointer<ID> for Arc<T>
{
unsafe extern "C" fn run(ptr: *mut bindings::work_struct) {
// The `__enqueue` method always uses a `work_struct` stored in a `Work<T, ID>`.
- let ptr = ptr as *mut Work<T, ID>;
+ let ptr = ptr.cast::<Work<T, ID>>();
// SAFETY: This computes the pointer that `__enqueue` got from `Arc::into_raw`.
let ptr = unsafe { T::work_container_of(ptr) };
// SAFETY: This pointer comes from `Arc::into_raw` and we've been given back ownership.
@@ -591,7 +591,7 @@ unsafe impl<T, const ID: u64> WorkItemPointer<ID> for Pin<KBox<T>>
{
unsafe extern "C" fn run(ptr: *mut bindings::work_struct) {
// The `__enqueue` method always uses a `work_struct` stored in a `Work<T, ID>`.
- let ptr = ptr as *mut Work<T, ID>;
+ let ptr = ptr.cast::<Work<T, ID>>();
// SAFETY: This computes the pointer that `__enqueue` got from `Arc::into_raw`.
let ptr = unsafe { T::work_container_of(ptr) };
// SAFETY: This pointer comes from `Arc::into_raw` and we've been given back ownership.
diff --git a/rust/uapi/lib.rs b/rust/uapi/lib.rs
index 13495910271f..fe9bf7b5a306 100644
--- a/rust/uapi/lib.rs
+++ b/rust/uapi/lib.rs
@@ -15,6 +15,7 @@
#![allow(
clippy::all,
clippy::undocumented_unsafe_blocks,
+ clippy::ptr_as_ptr,
dead_code,
missing_docs,
non_camel_case_types,
---
base-commit: ff64846bee0e7e3e7bc9363ebad3bab42dd27e24
change-id: 20250307-ptr-as-ptr-21b1867fc4d4
Best regards,
--
Tamir Duberstein <tamird(a)gmail.com>
Signal delivery during connect() may disconnect an already established
socket. Problem is that such socket might have been placed in a sockmap
before the connection was closed.
PATCH 1 ensures this race won't lead to an unconnected vsock staying in the
sockmap. PATCH 2 selftests it.
PATCH 3 fixes a related race. Note that here the race window is rather
difficult to hit and I can't think of an easy way of testing it.
Signed-off-by: Michal Luczaj <mhal(a)rbox.co>
---
Changes in v2:
- Handle one more path of tripping the warning
- Add a selftest
- Collect R-b [Stefano]
- Link to v1: https://lore.kernel.org/r/20250307-vsock-trans-signal-race-v1-1-3aca3f771fb…
---
Michal Luczaj (3):
vsock/bpf: Fix EINTR connect() racing sockmap update
selftest/bpf: Add test for AF_VSOCK connect() racing sockmap update
vsock/bpf: Fix bpf recvmsg() racing transport reassignment
net/vmw_vsock/af_vsock.c | 10 +-
net/vmw_vsock/vsock_bpf.c | 24 +++--
.../selftests/bpf/prog_tests/sockmap_basic.c | 111 +++++++++++++++++++++
3 files changed, 136 insertions(+), 9 deletions(-)
---
base-commit: da9e8efe7ee10e8425dc356a9fc593502c8e3933
change-id: 20250305-vsock-trans-signal-race-d62f7718d099
Best regards,
--
Michal Luczaj <mhal(a)rbox.co>
On 2025/3/14 18:14, Su Hui wrote:
> On 2025/3/14 17:21, Dan Carpenter wrote:
>> On Fri, Mar 14, 2025 at 03:14:51PM +0800, Su Hui wrote:
>>> When 'manual=false' and 'signaled=true', then expected value when using
>>> NTSYNC_IOC_CREATE_EVENT should be greater than zero. Fix this typo error.
>>>
>>> Signed-off-by: Su Hui<suhui(a)nfschina.com>
>>> ---
>>> tools/testing/selftests/drivers/ntsync/ntsync.c | 2 +-
>>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/tools/testing/selftests/drivers/ntsync/ntsync.c b/tools/testing/selftests/drivers/ntsync/ntsync.c
>>> index 3aad311574c4..bfb6fad653d0 100644
>>> --- a/tools/testing/selftests/drivers/ntsync/ntsync.c
>>> +++ b/tools/testing/selftests/drivers/ntsync/ntsync.c
>>> @@ -968,7 +968,7 @@ TEST(wake_all)
>>> auto_event_args.manual = false;
>>> auto_event_args.signaled = true;
>>> objs[3] = ioctl(fd, NTSYNC_IOC_CREATE_EVENT, &auto_event_args);
>>> - EXPECT_EQ(0, objs[3]);
>>> + EXPECT_LE(0, objs[3]);
>> It's kind of weird how these macros put the constant on the left.
>> It returns an "fd" on success. So this look reasonable. It probably
>> won't return the zero fd so we could probably check EXPECT_LT()?
> Agreed, there are about 29 items that can be changed to EXPECT_LT().
> I can send a v2 patchset with this change if there is no more other
> suggestions.
Sorry for the wrong style of email:(.
Su Hui