virtio-net have two usage of hashes: one is RSS and another is hash
reporting. Conventionally the hash calculation was done by the VMM.
However, computing the hash after the queue was chosen defeats the
purpose of RSS.
Another approach is to use eBPF steering program. This approach has
another downside: it cannot report the calculated hash due to the
restrictive nature of eBPF.
Extend the steering program feature by introducing a dedicated program
type: BPF_PROG_TYPE_VNET_HASH. This program type is capable to report
the hash value and the queue to use at the same time.
This is a rewrite of a RFC patch series submitted by Yuri Benditovich that
incorporates feedbacks for the series and V1 of this series:
https://lore.kernel.org/lkml/20210112194143.1494-1-yuri.benditovich@daynix.…
QEMU patched to use this new feature is available at:
https://github.com/daynix/qemu/tree/akihikodaki/bpf
The QEMU patches will soon be submitted to the upstream as RFC too.
V1 -> V2:
Changed to introduce a new BPF program type.
Akihiko Odaki (7):
bpf: Introduce BPF_PROG_TYPE_VNET_HASH
bpf: Add vnet_hash members to __sk_buff
skbuff: Introduce SKB_EXT_TUN_VNET_HASH
virtio_net: Add virtio_net_hdr_v1_hash_from_skb()
tun: Support BPF_PROG_TYPE_VNET_HASH
selftests/bpf: Test BPF_PROG_TYPE_VNET_HASH
vhost_net: Support VIRTIO_NET_F_HASH_REPORT
Documentation/bpf/bpf_prog_run.rst | 1 +
Documentation/bpf/libbpf/program_types.rst | 2 +
drivers/net/tun.c | 158 +++++--
drivers/vhost/net.c | 16 +-
include/linux/bpf_types.h | 2 +
include/linux/filter.h | 7 +
include/linux/skbuff.h | 10 +
include/linux/virtio_net.h | 22 +
include/uapi/linux/bpf.h | 5 +
kernel/bpf/verifier.c | 6 +
net/core/filter.c | 86 +++-
net/core/skbuff.c | 3 +
tools/include/uapi/linux/bpf.h | 5 +
tools/lib/bpf/libbpf.c | 2 +
tools/testing/selftests/bpf/config | 1 +
tools/testing/selftests/bpf/config.aarch64 | 1 -
.../selftests/bpf/prog_tests/vnet_hash.c | 385 ++++++++++++++++++
tools/testing/selftests/bpf/progs/vnet_hash.c | 16 +
18 files changed, 681 insertions(+), 47 deletions(-)
create mode 100644 tools/testing/selftests/bpf/prog_tests/vnet_hash.c
create mode 100644 tools/testing/selftests/bpf/progs/vnet_hash.c
--
2.42.0
Hi,
Changes since v2 [1]:
* Added a new patch (sent separately earlier) at the end, to error out
if "make headers" has not yet been run.
* Reworked and simplified the uffd movement patch. Now it only moves
some uffd*() routines, not all, and doesn't have to touch the Makefile
at all. This lighter touch also allowed me to drop the "move psize(),
pshift() into vm_utils.c" entirely. I expect Peter Xu will be a little
happier with this new approach.
* Fixed the commit description for the MADV_COLLAPSE patch.
* Added more Reviewed-by tags from David Hildenbrand and Peter Xu.
[1] https://lore.kernel.org/all/20230603021558.95299-1-jhubbard@nvidia.com/
John Hubbard (11):
selftests/mm: fix uffd-stress unused function warning
selftests/mm: fix unused variable warnings in hugetlb-madvise.c,
migration.c
selftests/mm: fix "warning: expression which evaluates to zero..." in
mlock2-tests.c
selftests/mm: fix invocation of tests that are run via shell scripts
selftests/mm: .gitignore: add mkdirty, va_high_addr_switch
selftests/mm: fix two -Wformat-security warnings in uffd builds
selftests/mm: fix a "possibly uninitialized" warning in pkey-x86.h
selftests/mm: fix build failures due to missing MADV_COLLAPSE
selftests/mm: move certain uffd*() routines from vm_util.c to
uffd-common.c
Documentation: kselftest: "make headers" is a prerequisite
selftests: error out if kernel header files are not yet built
Documentation/dev-tools/kselftest.rst | 1 +
tools/testing/selftests/lib.mk | 36 +++++++++++-
tools/testing/selftests/mm/.gitignore | 2 +
tools/testing/selftests/mm/cow.c | 7 ---
tools/testing/selftests/mm/hugetlb-madvise.c | 8 ++-
tools/testing/selftests/mm/khugepaged.c | 10 ----
tools/testing/selftests/mm/migration.c | 5 +-
tools/testing/selftests/mm/mlock2-tests.c | 1 -
tools/testing/selftests/mm/pkey-x86.h | 2 +-
tools/testing/selftests/mm/run_vmtests.sh | 6 +-
tools/testing/selftests/mm/uffd-common.c | 59 ++++++++++++++++++++
tools/testing/selftests/mm/uffd-common.h | 5 ++
tools/testing/selftests/mm/uffd-stress.c | 10 ----
tools/testing/selftests/mm/uffd-unit-tests.c | 16 ++----
tools/testing/selftests/mm/vm_util.c | 59 --------------------
tools/testing/selftests/mm/vm_util.h | 14 +++--
16 files changed, 130 insertions(+), 111 deletions(-)
base-commit: f8dba31b0a826e691949cd4fdfa5c30defaac8c5
--
2.40.1
Regressions that cause a device to no longer be probed by a driver can
have a big impact on the platform's functionality, and despite being
relatively common there isn't currently any generic test to detect them.
As an example, bootrr [1] does test for device probe, but it requires
defining the expected probed devices for each platform.
Given that the Devicetree already provides a static description of
devices on the system, it is a good basis for building such a test on
top.
This series introduces a test to catch regressions that prevent devices
from probing.
Patches 1 and 2 extend the existing dt-extract-compatibles to be able to
output only the compatibles that can be expected to match a Devicetree
node to a driver. Patch 2 adds a kselftest that walks over the
Devicetree nodes on the current platform and compares the compatibles to
the ones on the list, and on an ignore list, to point out devices that
failed to be probed.
A compatible list is needed because not all compatibles that can show up
in a Devicetree node can be used to match to a driver, for example the
code for that compatible might use "OF_DECLARE" type macros and avoid
the driver framework, or the node might be controlled by a driver that
was bound to a different node.
An ignore list is needed for the few cases where it's common for a
driver to match a device but not probe, like for the "simple-mfd"
compatible, where the driver only probes if that compatible is the
node's first compatible.
The reason for parsing the kernel source instead of relying on
information exposed by the kernel at runtime (say, looking at modaliases
or introducing some other mechanism), is to be able to catch issues
where a config was renamed or a driver moved across configs, and the
.config used by the kernel not updated accordingly. We need to parse the
source to find all compatibles present in the kernel independent of the
current config being run.
[1] https://github.com/kernelci/bootrr
Changes in v3:
- Added DT selftest path to MAINTAINERS
- Enabled device probe test for nodes with 'status = "ok"'
- Added pass/fail/skip totals to end of test output
Changes in v2:
- Extended dt-extract-compatibles script to be able to extract driver
matching compatibles, instead of adding a new one in Coccinelle
- Made kselftest output in the KTAP format
Nícolas F. R. A. Prado (3):
dt: dt-extract-compatibles: Handle cfile arguments in generator
function
dt: dt-extract-compatibles: Add flag for driver matching compatibles
kselftest: Add new test for detecting unprobed Devicetree devices
MAINTAINERS | 1 +
scripts/dtc/dt-extract-compatibles | 74 +++++++++++++----
tools/testing/selftests/Makefile | 1 +
tools/testing/selftests/dt/.gitignore | 1 +
tools/testing/selftests/dt/Makefile | 21 +++++
.../selftests/dt/compatible_ignore_list | 1 +
tools/testing/selftests/dt/ktap_helpers.sh | 70 ++++++++++++++++
.../selftests/dt/test_unprobed_devices.sh | 83 +++++++++++++++++++
8 files changed, 236 insertions(+), 16 deletions(-)
create mode 100644 tools/testing/selftests/dt/.gitignore
create mode 100644 tools/testing/selftests/dt/Makefile
create mode 100644 tools/testing/selftests/dt/compatible_ignore_list
create mode 100644 tools/testing/selftests/dt/ktap_helpers.sh
create mode 100755 tools/testing/selftests/dt/test_unprobed_devices.sh
--
2.42.0
Hi Reinette, Fenghua,
This series introduces a new mount option enabling an alternate mode for
MBM to work around an issue on present AMD implementations and any other
resctrl implementation where there are more RMIDs (or equivalent) than
hardware counters.
The L3 External Bandwidth Monitoring feature of the AMD PQoS
extension[1] only guarantees that RMIDs currently assigned to a
processor will be tracked by hardware. The counters of any other RMIDs
which are no longer being tracked will be reset to zero. The MBM event
counters return "Unavailable" to indicate when this has happened.
An interval for effectively measuring memory bandwidth typically needs
to be multiple seconds long. In Google's workloads, it is not feasible
to bound the number of jobs with different RMIDs which will run in a
cache domain over any period of time. Consequently, on a
fully-committed system where all RMIDs are allocated, few groups'
counters return non-zero values.
To demonstrate the underlying issue, the first patch provides a test
case in tools/testing/selftests/resctrl/test_rmids.sh.
On an AMD EPYC 7B12 64-Core Processor with the default behavior:
# ./test_rmids.sh
Created 255 monitoring groups.
g1: mbm_total_bytes: Unavailable -> Unavailable (FAIL)
g2: mbm_total_bytes: Unavailable -> Unavailable (FAIL)
g3: mbm_total_bytes: Unavailable -> Unavailable (FAIL)
[..]
g238: mbm_total_bytes: Unavailable -> Unavailable (FAIL)
g239: mbm_total_bytes: Unavailable -> Unavailable (FAIL)
g240: mbm_total_bytes: Unavailable -> Unavailable (FAIL)
g241: mbm_total_bytes: Unavailable -> 660497472
g242: mbm_total_bytes: Unavailable -> 660793344
g243: mbm_total_bytes: Unavailable -> 660477312
g244: mbm_total_bytes: Unavailable -> 660495360
g245: mbm_total_bytes: Unavailable -> 660775360
g246: mbm_total_bytes: Unavailable -> 660645504
g247: mbm_total_bytes: Unavailable -> 660696128
g248: mbm_total_bytes: Unavailable -> 660605248
g249: mbm_total_bytes: Unavailable -> 660681280
g250: mbm_total_bytes: Unavailable -> 660834240
g251: mbm_total_bytes: Unavailable -> 660440064
g252: mbm_total_bytes: Unavailable -> 660501504
g253: mbm_total_bytes: Unavailable -> 660590720
g254: mbm_total_bytes: Unavailable -> 660548352
g255: mbm_total_bytes: Unavailable -> 660607296
255 groups, 0 returned counts in first pass, 15 in second
successfully measured bandwidth from 15/255 groups
To compare, here is the output from an Intel(R) Xeon(R) Platinum 8173M
CPU:
# ./test_rmids.sh
Created 223 monitoring groups.
g1: mbm_total_bytes: 0 -> 606126080
g2: mbm_total_bytes: 0 -> 613236736
g3: mbm_total_bytes: 0 -> 610254848
[..]
g221: mbm_total_bytes: 0 -> 584679424
g222: mbm_total_bytes: 0 -> 588808192
g223: mbm_total_bytes: 0 -> 587317248
223 groups, 223 returned counts in first pass, 223 in second
successfully measured bandwidth from 223/223 groups
To make better use of the hardware in such a use case, this patchset
introduces a "soft" RMID implementation, where each CPU is permanently
assigned a "hard" RMID. On context switches which change the current
soft RMID, the difference between each CPU's current event counts and
most recent counts is added to the totals for the current or outgoing
soft RMID.
This technique does not work for cache occupancy counters, so this patch
series disables cache occupancy events when soft RMIDs are enabled.
This series adds the "mbm_soft_rmid" mount option to allow users to
opt-in to the functionaltiy when they deem it helpful.
When the same system from the earlier AMD example enables the
mbm_soft_rmid mount option:
# ./test_rmids.sh
Created 255 monitoring groups.
g1: mbm_total_bytes: 0 -> 686560576
g2: mbm_total_bytes: 0 -> 668204416
[..]
g252: mbm_total_bytes: 0 -> 672651200
g253: mbm_total_bytes: 0 -> 666956800
g254: mbm_total_bytes: 0 -> 665917056
g255: mbm_total_bytes: 0 -> 671049600
255 groups, 255 returned counts in first pass, 255 in second
successfully measured bandwidth from 255/255 groups
(patches are based on tip/master)
[1] https://www.amd.com/system/files/TechDocs/56375_1.03_PUB.pdf
Peter Newman (8):
selftests/resctrl: Verify all RMIDs count together
x86/resctrl: Add resctrl_mbm_flush_cpu() to collect CPUs' MBM events
x86/resctrl: Flush MBM event counts on soft RMID change
x86/resctrl: Call mon_event_count() directly for soft RMIDs
x86/resctrl: Create soft RMID version of __mon_event_count()
x86/resctrl: Assign HW RMIDs to CPUs for soft RMID
x86/resctrl: Use mbm_update() to push soft RMID counts
x86/resctrl: Add mount option to enable soft RMID
Stephane Eranian (1):
x86/resctrl: Hold a spinlock in __rmid_read() on AMD
arch/x86/include/asm/resctrl.h | 29 +++-
arch/x86/kernel/cpu/resctrl/core.c | 80 ++++++++-
arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 9 +-
arch/x86/kernel/cpu/resctrl/internal.h | 19 ++-
arch/x86/kernel/cpu/resctrl/monitor.c | 158 +++++++++++++++++-
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 52 ++++++
tools/testing/selftests/resctrl/test_rmids.sh | 93 +++++++++++
7 files changed, 425 insertions(+), 15 deletions(-)
create mode 100755 tools/testing/selftests/resctrl/test_rmids.sh
base-commit: dd806e2f030e57dd5bac973372aa252b6c175b73
--
2.40.0.634.g4ca3ef3211-goog
In kunit_debugfs_create_suite() give up and skip creating the debugfs
file if any of the alloc_string_stream() calls return an error or NULL.
Only put a value in the log pointer of kunit_suite and kunit_test if it
is a valid pointer to a log.
This prevents the potential invalid dereference reported by smatch:
lib/kunit/debugfs.c:115 kunit_debugfs_create_suite() error: 'suite->log'
dereferencing possible ERR_PTR()
lib/kunit/debugfs.c:119 kunit_debugfs_create_suite() error: 'test_case->log'
dereferencing possible ERR_PTR()
Signed-off-by: Richard Fitzgerald <rf(a)opensource.cirrus.com>
Reported-by: Dan Carpenter <dan.carpenter(a)linaro.org>
Fixes: 05e2006ce493 ("kunit: Use string_stream for test log")
---
lib/kunit/debugfs.c | 30 +++++++++++++++++++++++++-----
1 file changed, 25 insertions(+), 5 deletions(-)
diff --git a/lib/kunit/debugfs.c b/lib/kunit/debugfs.c
index 270d185737e6..9d167adfa746 100644
--- a/lib/kunit/debugfs.c
+++ b/lib/kunit/debugfs.c
@@ -109,14 +109,28 @@ static const struct file_operations debugfs_results_fops = {
void kunit_debugfs_create_suite(struct kunit_suite *suite)
{
struct kunit_case *test_case;
+ struct string_stream *stream;
- /* Allocate logs before creating debugfs representation. */
- suite->log = alloc_string_stream(GFP_KERNEL);
- string_stream_set_append_newlines(suite->log, true);
+ /*
+ * Allocate logs before creating debugfs representation.
+ * The suite->log and test_case->log pointer are expected to be NULL
+ * if there isn't a log, so only set it if the log stream was created
+ * successfully.
+ */
+ stream = alloc_string_stream(GFP_KERNEL);
+ if (IS_ERR_OR_NULL(stream))
+ return;
+
+ string_stream_set_append_newlines(stream, true);
+ suite->log = stream;
kunit_suite_for_each_test_case(suite, test_case) {
- test_case->log = alloc_string_stream(GFP_KERNEL);
- string_stream_set_append_newlines(test_case->log, true);
+ stream = alloc_string_stream(GFP_KERNEL);
+ if (IS_ERR_OR_NULL(stream))
+ goto err;
+
+ string_stream_set_append_newlines(stream, true);
+ test_case->log = stream;
}
suite->debugfs = debugfs_create_dir(suite->name, debugfs_rootdir);
@@ -124,6 +138,12 @@ void kunit_debugfs_create_suite(struct kunit_suite *suite)
debugfs_create_file(KUNIT_DEBUGFS_RESULTS, S_IFREG | 0444,
suite->debugfs,
suite, &debugfs_results_fops);
+ return;
+
+err:
+ string_stream_destroy(suite->log);
+ kunit_suite_for_each_test_case(suite, test_case)
+ string_stream_destroy(test_case->log);
}
void kunit_debugfs_destroy_suite(struct kunit_suite *suite)
--
2.30.2
Check the stream pointer passed to string_stream_destroy() for
IS_ERR_OR_NULL() instead of only NULL.
Whatever alloc_string_stream() returns should be safe to pass
to string_stream_destroy(), and that will be an ERR_PTR.
It's obviously good practise and generally helpful to also check
for NULL pointers so that client cleanup code can call
string_stream_destroy() unconditionally - which could include
pointers that have never been set to anything and so are NULL.
Signed-off-by: Richard Fitzgerald <rf(a)opensource.cirrus.com>
---
lib/kunit/string-stream.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/lib/kunit/string-stream.c b/lib/kunit/string-stream.c
index a6f3616c2048..54f4fdcbfac8 100644
--- a/lib/kunit/string-stream.c
+++ b/lib/kunit/string-stream.c
@@ -173,7 +173,7 @@ void string_stream_destroy(struct string_stream *stream)
{
KUNIT_STATIC_STUB_REDIRECT(string_stream_destroy, stream);
- if (!stream)
+ if (IS_ERR_OR_NULL(stream))
return;
string_stream_clear(stream);
--
2.30.2
Change namespace creation for root and non-root
user differently in create_and_enter_ns() function
Test result with root user:
$sudo make TARGETS="capabilities" kselftest
...
TAP version 13
1..1
timeout set to 45
selftests: capabilities: test_execve
TAP version 13
1..12
[RUN] +++ Tests with uid == 0 +++
[NOTE] Using global UIDs for tests
[RUN] Root => ep
...
ok 12 Passed
Totals: pass:12 fail:0 xfail:0 xpass:0 skip:0 error:0
==================================================
TAP version 13
1..9
[RUN] +++ Tests with uid != 0 +++
[NOTE] Using global UIDs for tests
[RUN] Non-root => no caps
...
ok 9 Passed
Totals: pass:9 fail:0 xfail:0 xpass:0 skip:0 error:0
Test result without root or normal user:
$make TARGETS="capabilities" kselftest
...
timeout set to 45
selftests: capabilities: test_execve
TAP version 13
1..12
[RUN] +++ Tests with uid == 0 +++
[NOTE] Using a user namespace for tests
[RUN] Root => ep
validate_cap:: Capabilities after execve were correct
ok 1 Passed
Check cap_ambient manipulation rules
ok 2 PR_CAP_AMBIENT_RAISE failed on non-inheritable cap
ok 3 PR_CAP_AMBIENT_RAISE failed on non-permitted cap
ok 4 PR_CAP_AMBIENT_RAISE worked
ok 5 Basic manipulation appears to work
[RUN] Root +i => eip
validate_cap:: Capabilities after execve were correct
ok 6 Passed
[RUN] UID 0 +ia => eipa
validate_cap:: Capabilities after execve were correct
ok 7 Passed
ok 8 # SKIP SUID/SGID tests (needs privilege)
Planned tests != run tests (12 != 8)
Totals: pass:7 fail:0 xfail:0 xpass:0 skip:1 error:0
==================================================
TAP version 13
1..9
[RUN] +++ Tests with uid != 0 +++
[NOTE] Using a user namespace for tests
[RUN] Non-root => no caps
validate_cap:: Capabilities after execve were correct
ok 1 Passed
Check cap_ambient manipulation rules
ok 2 PR_CAP_AMBIENT_RAISE failed on non-inheritable cap
ok 3 PR_CAP_AMBIENT_RAISE failed on non-permitted cap
ok 4 PR_CAP_AMBIENT_RAISE worked
ok 5 Basic manipulation appears to work
[RUN] Non-root +i => i
validate_cap:: Capabilities after execve were correct
ok 6 Passed
[RUN] UID 1 +ia => eipa
validate_cap:: Capabilities after execve were correct
ok 7 Passed
ok 8 # SKIP SUID/SGID tests (needs privilege)
Planned tests != run tests (9 != 8)
Totals: pass:7 fail:0 xfail:0 xpass:0 skip:1 error:0
Signed-off-by: Swarup Laxman Kotiaklapudi <swarupkotikalapudi(a)gmail.com>
---
tools/testing/selftests/capabilities/test_execve.c | 6 +-----
1 file changed, 1 insertion(+), 5 deletions(-)
diff --git a/tools/testing/selftests/capabilities/test_execve.c b/tools/testing/selftests/capabilities/test_execve.c
index df0ef02b4036..8236150d377e 100644
--- a/tools/testing/selftests/capabilities/test_execve.c
+++ b/tools/testing/selftests/capabilities/test_execve.c
@@ -96,11 +96,7 @@ static bool create_and_enter_ns(uid_t inner_uid)
outer_uid = getuid();
outer_gid = getgid();
- /*
- * TODO: If we're already root, we could skip creating the userns.
- */
-
- if (unshare(CLONE_NEWNS) == 0) {
+ if (outer_uid == 0 && unshare(CLONE_NEWNS) == 0) {
ksft_print_msg("[NOTE]\tUsing global UIDs for tests\n");
if (prctl(PR_SET_KEEPCAPS, 1, 0, 0, 0) != 0)
ksft_exit_fail_msg("PR_SET_KEEPCAPS - %s\n",
--
2.34.1