Greetings:
This is an attempt to followup on something Jakub asked me about [1],
adding an xsk attribute to queues and more clearly documenting which
queues are linked to NAPIs...
After the RFC [2], Jakub suggested creating an empty nest for queues
which have a pool, so I've adjusted this version to work that way.
The nest can be extended in the future to express attributes about XSK
as needed. Queues which are not used for AF_XDP do not have the xsk
attribute present.
I've run the included test on:
- my mlx5 machine (via NETIF=)
- without setting NETIF
And the test seems to pass in both cases.
Thanks,
Joe
[1]: https://lore.kernel.org/netdev/20250113143109.60afa59a@kernel.org/
[2]: https://lore.kernel.org/netdev/20250129172431.65773-1-jdamato@fastly.com/
v2:
- Switched from RFC to actual submission now that net-next is open
- Adjusted patch 1 to include an empty nest as suggested by Jakub
- Adjusted patch 2 to update the test based on changes to patch 1, and
to incorporate some Python feedback from Jakub :)
rfc: https://lore.kernel.org/netdev/20250129172431.65773-1-jdamato@fastly.com/
Joe Damato (2):
netdev-genl: Add an XSK attribute to queues
selftests: drv-net: Test queue xsk attribute
Documentation/netlink/specs/netdev.yaml | 13 ++-
include/uapi/linux/netdev.h | 6 ++
net/core/netdev-genl.c | 11 +++
tools/include/uapi/linux/netdev.h | 6 ++
.../testing/selftests/drivers/net/.gitignore | 2 +
tools/testing/selftests/drivers/net/Makefile | 3 +
tools/testing/selftests/drivers/net/queues.py | 35 +++++++-
.../selftests/drivers/net/xdp_helper.c | 90 +++++++++++++++++++
8 files changed, 163 insertions(+), 3 deletions(-)
create mode 100644 tools/testing/selftests/drivers/net/.gitignore
create mode 100644 tools/testing/selftests/drivers/net/xdp_helper.c
base-commit: c2933b2befe25309f4c5cfbea0ca80909735fd76
--
2.25.1
This series introduces support in the ARM PMUv3 driver for
partitioning PMU counters into two separate ranges by taking advantage
of the MDCR_EL2.HPMN register field.
The advantage of a partitioned PMU would be to allow KVM guests direct
access to a subset of PMU functionality, greatly reducing the overhead
of performance monitoring in guests.
While this series could be accepted on its own merits, practically
there is a lot more to be done before it will be fully useful, so I'm
sending as an RFC for now.
This patch is based on v6.13-rc7. It needs a small additional change
after Oliver's Debug cleanups series going into 6.14, specifically
this patch [1], because it changes kvm_arm_setup_mdcr_el2() to
initialize HPMN from a cached value read early in the boot process
instead of reading from the register. The only sensible way I can see
to deal with this is returning to reading the register.
[1] https://lore.kernel.org/kvmarm/20241219224116.3941496-3-oliver.upton@linux.…
Colton Lewis (4):
perf: arm_pmuv3: Introduce module param to partition the PMU
KVM: arm64: Make guests see only counters they can access
perf: arm_pmuv3: Generalize counter bitmasks
perf: arm_pmuv3: Keep out of guest counter partition
arch/arm/include/asm/arm_pmuv3.h | 10 ++
arch/arm64/include/asm/arm_pmuv3.h | 10 ++
arch/arm64/kvm/pmu-emul.c | 8 +-
drivers/perf/arm_pmuv3.c | 113 ++++++++++++++++--
include/linux/perf/arm_pmu.h | 2 +
include/linux/perf/arm_pmuv3.h | 34 +++++-
.../kvm/aarch64/vpmu_counter_access.c | 2 +-
7 files changed, 160 insertions(+), 19 deletions(-)
base-commit: 5bc55a333a2f7316b58edc7573e8e893f7acb532
--
2.48.1.262.g85cc9f2d1e-goog
On Wed, Nov 13, 2024 at 2:31 AM Paolo Bonzini <pbonzini(a)redhat.com> wrote:
>
>
>
> Il mar 12 nov 2024, 21:44 Doug Covelli <doug.covelli(a)broadcom.com> ha scritto:
>>
>> > Split irqchip should be the best tradeoff. Without it, moves from cr8
>> > stay in the kernel, but moves to cr8 always go to userspace with a
>> > KVM_EXIT_SET_TPR exit. You also won't be able to use Intel
>> > flexpriority (in-processor accelerated TPR) because KVM does not know
>> > which bits are set in IRR. So it will be *really* every move to cr8
>> > that goes to userspace.
>>
>> Sorry to hijack this thread but is there a technical reason not to allow CR8
>> based accesses to the TPR (not MMIO accesses) when the in-kernel local APIC is
>> not in use?
>
>
> No worries, you're not hijacking :) The only reason is that it would be more code for a seldom used feature and anyway with worse performance. (To be clear, CR8 based accesses are allowed, but stores cause an exit in order to check the new TPR against IRR. That's because KVM's API does not have an equivalent of the TPR threshold as you point out below).
I have not really looked at the code but it seems like it could also
simplify things as CR8 would be handled more uniformly regardless of
who is virtualizing the local APIC.
>> Also I could not find these documented anywhere but with MSFT's APIC our monitor
>> relies on extensions for trapping certain events such as INIT/SIPI plus LINT0
>> and SVR writes:
>>
>> UINT64 X64ApicInitSipiExitTrap : 1; // WHvRunVpExitReasonX64ApicInitSipiTrap
>> UINT64 X64ApicWriteLint0ExitTrap : 1; // WHvRunVpExitReasonX64ApicWriteTrap
>> UINT64 X64ApicWriteLint1ExitTrap : 1; // WHvRunVpExitReasonX64ApicWriteTrap
>> UINT64 X64ApicWriteSvrExitTrap : 1; // WHvRunVpExitReasonX64ApicWriteTrap
>
>
> There's no need for this in KVM's in-kernel APIC model. INIT and SIPI are handled in the hypervisor and you can get the current state of APs via KVM_GET_MPSTATE. LINT0 and LINT1 are injected with KVM_INTERRUPT and KVM_NMI respectively, and they obey IF/PPR and NMI blocking respectively, plus the interrupt shadow; so there's no need for userspace to know when LINT0/LINT1 themselves change. The spurious interrupt vector register is also handled completely in kernel.
I realize that KVM can handle LINT0/SVR updates themselves but our
interrupt subsystem relies on knowing the current values of these
registers even when not virtualizing the local APIC. I suppose we
could use KVM_GET_LAPIC to sync things up on demand but that seems
like it might nor be great from a performance point of view.
>> I did not see any similar functionality for KVM. Does anything like that exist?
>> In any case we would be happy to add support for handling CR8 accesses w/o
>> exiting w/o the in-kernel APIC along with some sort of a way to configure the
>> TPR threshold if folks are not opposed to that.
>
>
> As far I know everybody who's using KVM (whether proprietary or open source) has had no need for that, so I don't think it's a good idea to make the API more complex. Performance of Windows guests is going to be bad anyway with userspace APIC.
From what I have seen the exit cost with KVM is significantly lower
than with WHP/Hyper-V. I don't think performance of Windows guests
with userspace APIC emulation would be bad if CR8 exits could be
avoided (Linux guests perf isn't bad from what I have observed and the
main difference is the astronomical number of CR8 exits). It seems
like it would be pretty decent although I agree if you want the
absolute best performance then you would want to use the in kernel
APIC to speed up handling of ICR/EOI writes but those are relatively
infrequent compared to CR8 accesses .
Anyway I just saw Sean's response while writing this and it seems he
is not in favor of avoiding CR8 exits w/o the in kernel APIC either so
I suppose we will have to look into making use of the in kernel APIC.
Doug
> Paolo
>
>> Doug
>>
>> > > For now I think it makes sense to handle BDOOR_CMD_GET_VCPU_INFO at userlevel
>> > > like we do on Windows and macOS.
>> > >
>> > > BDOOR_CMD_GETTIME/BDOOR_CMD_GETTIMEFULL are similar with the former being
>> > > deprecated in favor of the latter. Both do essentially the same thing which is
>> > > to return the host OS's time - on Linux this is obtained via gettimeofday. I
>> > > believe this is mainly used by tools to fix up the VM's time when resuming from
>> > > suspend. I think it is fine to continue handling these at userlevel.
>> >
>> > As long as the TSC is not involved it should be okay.
>> >
>> > Paolo
>> >
>> > > > >> Anyway, one question apart from this: is the API the same for the I/O
>> > > > >> port and hypercall backdoors?
>> > > > >
>> > > > > Yeah the calls and arguments are the same. The hypercall based
>> > > > > interface is an attempt to modernize the backdoor since as you pointed
>> > > > > out the I/O based interface is kind of hacky as it bypasses the normal
>> > > > > checks for an I/O port access at CPL3. It would be nice to get rid of
>> > > > > it but unfortunately I don't think that will happen in the foreseeable
>> > > > > future as there are a lot of existing VMs out there with older SW that
>> > > > > still uses this interface.
>> > > >
>> > > > Yeah, but I think it still justifies that the KVM_ENABLE_CAP API can
>> > > > enable the hypercall but not the I/O port.
>> > > >
>> > > > Paolo
>> >
>>
>> --
>> This electronic communication and the information and any files transmitted
>> with it, or attached to it, are confidential and are intended solely for
>> the use of the individual or entity to whom it is addressed and may contain
>> information that is confidential, legally privileged, protected by privacy
>> laws, or otherwise restricted from disclosure to anyone else. If you are
>> not the intended recipient or the person responsible for delivering the
>> e-mail to the intended recipient, you are hereby notified that any use,
>> copying, distributing, dissemination, forwarding, printing, or copying of
>> this e-mail is strictly prohibited. If you received this e-mail in error,
>> please return the e-mail to the sender, delete it from your computer, and
>> destroy any printed copy of it.
>>
--
This electronic communication and the information and any files transmitted
with it, or attached to it, are confidential and are intended solely for
the use of the individual or entity to whom it is addressed and may contain
information that is confidential, legally privileged, protected by privacy
laws, or otherwise restricted from disclosure to anyone else. If you are
not the intended recipient or the person responsible for delivering the
e-mail to the intended recipient, you are hereby notified that any use,
copying, distributing, dissemination, forwarding, printing, or copying of
this e-mail is strictly prohibited. If you received this e-mail in error,
please return the e-mail to the sender, delete it from your computer, and
destroy any printed copy of it.
Nolibc has support for riscv32. But the testsuite did not allow to test
it so far. Add a test configuration.
Signed-off-by: Thomas Weißschuh <linux(a)weissschuh.net>
---
Thomas Weißschuh (6):
tools/nolibc: add support for waitid()
selftests/nolibc: use waitid() over waitpid()
selftests/nolibc: use a pipe to in vfprintf tests
selftests/nolibc: skip tests for unimplemented syscalls
selftests/nolibc: rename riscv to riscv64
selftests/nolibc: add configurations for riscv32
tools/include/nolibc/sys.h | 18 ++++++++++++
tools/testing/selftests/nolibc/Makefile | 11 +++++++
tools/testing/selftests/nolibc/nolibc-test.c | 44 ++++++++++++++++------------
tools/testing/selftests/nolibc/run-tests.sh | 2 +-
4 files changed, 56 insertions(+), 19 deletions(-)
---
base-commit: 499551201b5f4fd3c0618a3e95e3d0d15ea18f31
change-id: 20241219-nolibc-rv32-cff8a3e22394
Best regards,
--
Thomas Weißschuh <linux(a)weissschuh.net>
Two fixes for nullness elision. See commits for more details.
Daniel Xu (3):
bpf: verifier: Do not extract constant map keys for irrelevant maps
bpf: selftests: Test constant key extraction on irrelevant maps
bpf: verifier: Disambiguate get_constant_map_key() errors
kernel/bpf/verifier.c | 29 ++++++++++++++-----
.../bpf/progs/verifier_array_access.c | 15 ++++++++++
2 files changed, 36 insertions(+), 8 deletions(-)
--
2.47.1
Add a new selftest to verify netconsole's handling of messages that
exceed the packet size limit and require fragmentation. The test sends
messages with varying sizes and userdata, validating that:
1. Large messages are correctly fragmented and reassembled
2. Userdata fields are properly preserved across fragments
3. Messages work correctly with and without kernel release version
appending
The test creates a networking environment using netdevsim, sends
messages through /dev/kmsg, and verifies the received fragments maintain
message integrity.
Signed-off-by: Breno Leitao <leitao(a)debian.org>
---
tools/testing/selftests/drivers/net/Makefile | 1 +
.../selftests/drivers/net/lib/sh/lib_netcons.sh | 7 ++
.../drivers/net/netcons_fragmented_msg.sh | 122 +++++++++++++++++++++
3 files changed, 130 insertions(+)
diff --git a/tools/testing/selftests/drivers/net/Makefile b/tools/testing/selftests/drivers/net/Makefile
index 137470bdee0c7fd2517bd1baafc12d575de4b4ac..c7f1c443f2af091aa13f67dd1df9ae05d7a43f40 100644
--- a/tools/testing/selftests/drivers/net/Makefile
+++ b/tools/testing/selftests/drivers/net/Makefile
@@ -7,6 +7,7 @@ TEST_INCLUDES := $(wildcard lib/py/*.py) \
TEST_PROGS := \
netcons_basic.sh \
+ netcons_fragmented_msg.sh \
netcons_overflow.sh \
ping.py \
queues.py \
diff --git a/tools/testing/selftests/drivers/net/lib/sh/lib_netcons.sh b/tools/testing/selftests/drivers/net/lib/sh/lib_netcons.sh
index 3acaba41ac7b21aa2fd8457ed640a5ac8a41bc12..0c262b123fdd3082c40b2bd899ec626d223226ed 100644
--- a/tools/testing/selftests/drivers/net/lib/sh/lib_netcons.sh
+++ b/tools/testing/selftests/drivers/net/lib/sh/lib_netcons.sh
@@ -110,6 +110,13 @@ function create_dynamic_target() {
echo 1 > "${NETCONS_PATH}"/enabled
}
+# Do not append the release to the header of the message
+function disable_release_append() {
+ echo 0 > "${NETCONS_PATH}"/enabled
+ echo 0 > "${NETCONS_PATH}"/release
+ echo 1 > "${NETCONS_PATH}"/enabled
+}
+
function cleanup() {
local NSIM_DEV_SYS_DEL="/sys/bus/netdevsim/del_device"
diff --git a/tools/testing/selftests/drivers/net/netcons_fragmented_msg.sh b/tools/testing/selftests/drivers/net/netcons_fragmented_msg.sh
new file mode 100755
index 0000000000000000000000000000000000000000..d175d5b9db662ab9a6ee203794569cc620801a4f
--- /dev/null
+++ b/tools/testing/selftests/drivers/net/netcons_fragmented_msg.sh
@@ -0,0 +1,122 @@
+#!/usr/bin/env bash
+# SPDX-License-Identifier: GPL-2.0
+
+# Test netconsole's message fragmentation functionality.
+#
+# When a message exceeds the maximum packet size, netconsole splits it into
+# multiple fragments for transmission. This test verifies:
+# - Correct fragmentation of large messages
+# - Proper reassembly of fragments at the receiver
+# - Preservation of userdata across fragments
+# - Behavior with and without kernel release version appending
+#
+# Author: Breno Leitao <leitao(a)debian.org>
+
+set -euo pipefail
+
+SCRIPTDIR=$(dirname "$(readlink -e "${BASH_SOURCE[0]}")")
+
+source "${SCRIPTDIR}"/lib/sh/lib_netcons.sh
+
+modprobe netdevsim 2> /dev/null || true
+modprobe netconsole 2> /dev/null || true
+
+# The content of kmsg will be save to the following file
+OUTPUT_FILE="/tmp/${TARGET}"
+
+# set userdata to a long value. In this case, it is "1-2-3-4...50-"
+USERDATA_VALUE=$(printf -- '%.2s-' {1..60})
+
+# Convert the header string in a regexp, so, we can remove
+# the second header as well.
+# A header looks like "13,468,514729715,-,ncfrag=0/1135;". If
+# release is appended, you might find something like:L
+# "6.13.0-04048-g4f561a87745a,13,468,514729715,-,ncfrag=0/1135;"
+function header_to_regex() {
+ # header is everything before ;
+ local HEADER="${1}"
+ REGEX=$(echo "${HEADER}" | cut -d'=' -f1)
+ echo "${REGEX}=[0-9]*\/[0-9]*;"
+}
+
+# We have two headers in the message. Remove both to get the full message,
+# and extract the full message.
+function extract_msg() {
+ local MSGFILE="${1}"
+ # Extract the header, which is the very first thing that arrives in the
+ # first list.
+ HEADER=$(sed -n '1p' "${MSGFILE}" | cut -d';' -f1)
+ HEADER_REGEX=$(header_to_regex "${HEADER}")
+
+ # Remove the two headers from the received message
+ # This will return the message without any header, similarly to what
+ # was sent.
+ sed "s/""${HEADER_REGEX}""//g" "${MSGFILE}"
+}
+
+# Validate the message, which has two messages glued together.
+# unwrap them to make sure all the characters were transmitted.
+# File will look like the following:
+# 13,468,514729715,-,ncfrag=0/1135;MSG1=MSG2=MSG3=MSG4=MSG5=MSG6=MSG7=MSG8=MSG9=MSG10=MSG11=MSG12=MSG13=MSG14=MSG15=MSG16=MSG17=MSG18=MSG19=MSG20=MSG21=MSG22=MSG23=MSG24=MSG25=MSG26=MSG27=MSG28=MSG29=MSG30=MSG31=MSG32=MSG33=MSG34=MSG35=MSG36=MSG37=MSG38=MSG39=MSG40=MSG41=MSG42=MSG43=MSG44=MSG45=MSG46=MSG47=MSG48=MSG49=MSG50=MSG51=MSG52=MSG53=MSG54=MSG55=MSG56=MSG57=MSG58=MSG59=MSG60=MSG61=MSG62=MSG63=MSG64=MSG65=MSG66=MSG67=MSG68=MSG69=MSG70=MSG71=MSG72=MSG73=MSG74=MSG75=MSG76=MSG77=MSG78=MSG79=MSG80=MSG81=MSG82=MSG83=MSG84=MSG85=MSG86=MSG87=MSG88=MSG89=MSG90=MSG91=MSG92=MSG93=MSG94=MSG95=MSG96=MSG97=MSG98=MSG99=MSG100=MSG101=MSG102=MSG103=MSG104=MSG105=MSG106=MSG107=MSG108=MSG109=MSG110=MSG111=MSG112=MSG113=MSG114=MSG115=MSG116=MSG117=MSG118=MSG119=MSG120=MSG121=MSG122=MSG123=MSG124=MSG125=MSG126=MSG127=MSG128=MSG129=MSG130=MSG131=MSG132=MSG133=MSG134=MSG135=MSG136=MSG137=MSG138=MSG139=MSG140=MSG141=MSG142=MSG143=MSG144=MSG145=MSG146=MSG147=MSG148=MSG149=MSG150=: netcons_nzmJQ
+# key=1-2-13,468,514729715,-,ncfrag=967/1135;3-4-5-6-7-8-9-10-11-12-13-14-15-16-17-18-19-20-21-22-23-24-25-26-27-28-29-30-31-32-33-34-35-36-37-38-39-40-41-42-43-44-45-46-47-48-49-50-51-52-53-54-55-56-57-58-59-60-
+function validate_fragmented_result() {
+ # Discard the netconsole headers, and assemble the full message
+ RCVMSG=$(extract_msg "${1}")
+
+ # check for the main message
+ if ! echo "${RCVMSG}" | grep -q "${MSG}"; then
+ echo "Message body doesn't match." >&2
+ echo "msg received=" "${RCVMSG}" >&2
+ exit "${ksft_fail}"
+ fi
+
+ # check userdata
+ if ! echo "${RCVMSG}" | grep -q "${USERDATA_VALUE}"; then
+ echo "message userdata doesn't match" >&2
+ echo "msg received=" "${RCVMSG}" >&2
+ exit "${ksft_fail}"
+ fi
+ # test passed. hooray
+}
+
+# Check for basic system dependency and exit if not found
+check_for_dependencies
+# Set current loglevel to KERN_INFO(6), and default to KERN_NOTICE(5)
+echo "6 5" > /proc/sys/kernel/printk
+# Remove the namespace, interfaces and netconsole target on exit
+trap cleanup EXIT
+# Create one namespace and two interfaces
+set_network
+# Create a dynamic target for netconsole
+create_dynamic_target
+# Set userdata "key" with the "value" value
+set_user_data
+
+
+# TEST 1: Send message and userdata. They will fragment
+# =======
+MSG=$(printf -- 'MSG%.3s=' {1..150})
+
+# Listen for netconsole port inside the namespace and destination interface
+listen_port_and_save_to "${OUTPUT_FILE}" &
+# Wait for socat to start and listen to the port.
+wait_local_port_listen "${NAMESPACE}" "${PORT}" udp
+# Send the message
+echo "${MSG}: ${TARGET}" > /dev/kmsg
+# Wait until socat saves the file to disk
+busywait "${BUSYWAIT_TIMEOUT}" test -s "${OUTPUT_FILE}"
+# Check if the message was not corrupted
+validate_fragmented_result "${OUTPUT_FILE}"
+
+# TEST 2: Test with smaller message, and without release appended
+# =======
+MSG=$(printf -- 'FOOBAR%.3s=' {1..100})
+# Let's disable release and test again.
+disable_release_append
+
+listen_port_and_save_to "${OUTPUT_FILE}" &
+wait_local_port_listen "${NAMESPACE}" "${PORT}" udp
+echo "${MSG}: ${TARGET}" > /dev/kmsg
+busywait "${BUSYWAIT_TIMEOUT}" test -s "${OUTPUT_FILE}"
+validate_fragmented_result "${OUTPUT_FILE}"
+exit "${ksft_pass}"
---
base-commit: 0ad9617c78acbc71373fb341a6f75d4012b01d69
change-id: 20250129-netcons_frag_msgs-91506d136f50
Best regards,
--
Breno Leitao <leitao(a)debian.org>