Note: it's net/ only bits and doesn't include changes, which shoulf be
merged separately and are posted separately. The full branch for
convenience is at [1], and the patch is here:
https://lore.kernel.org/io-uring/7486ab32e99be1f614b3ef8d0e9bc77015b173f7.1…
Many modern NICs support configurable receive buffer lengths, and zcrx and
memory providers can use buffers larger than 4K to improve performance. When
paired with hw-gro larger rx buffer sizes can drastically reduce the number
of buffers traversing the stack and save a lot of processing time. It also
allows to give to users larger contiguous chunks of data. The idea was first
floated around by Saeed during netdev conf 2024 and was asked about by a few
folks.
Single stream benchmarks showed up to ~30% CPU util improvement.
E.g. comparison for 4K vs 32K buffers using a 200Gbit NIC:
packets=23987040 (MB=2745098), rps=199559 (MB/s=22837)
CPU %usr %nice %sys %iowait %irq %soft %idle
0 1.53 0.00 27.78 2.72 1.31 66.45 0.22
packets=24078368 (MB=2755550), rps=200319 (MB/s=22924)
CPU %usr %nice %sys %iowait %irq %soft %idle
0 0.69 0.00 8.26 31.65 1.83 57.00 0.57
This series adds net infrastructure for memory providers configuring
the size and implements it for bnxt. It's an opt-in feature for drivers,
they should advertise support for the parameter in the qops and must check
if the hardware supports the given size. It's limited to memory providers
as it drastically simplifies implementation. It doesn't affect the fast
path zcrx uAPI, and the user exposed parameter is defined in zcrx terms,
which allows it to be flexible and adjusted in the future.
A liburing example can be found at [2]
full branch:
[1] https://github.com/isilence/linux.git zcrx/large-buffers-v8
Liburing example:
[2] https://github.com/isilence/liburing.git zcrx/rx-buf-len
---
The following changes since commit 9ace4753a5202b02191d54e9fdf7f9e3d02b85eb:
Linux 6.19-rc4 (2026-01-04 14:41:55 -0800)
are available in the Git repository at:
https://github.com/isilence/linux.git tags/net-queue-rx-buf-len-v8
for you to fetch changes up to 37f5abe6929963fc6086777056b59ecb034d0e19:
io_uring/zcrx: document area chunking parameter (2026-01-08 11:35:20 +0000)
v8: - Add stripped down qcfg
- Retain the page size across resets for bnxt
v7: - Add xa_destroy
- Rebase
v6: - Update docs and add a selftest
v5: https://lore.kernel.org/netdev/cover.1760440268.git.asml.silence@gmail.com/
- Remove all unnecessary bits like configuration via netlink, and
multi-stage queue configuration.
v4: https://lore.kernel.org/all/cover.1760364551.git.asml.silence@gmail.com/
- Update fbnic qops
- Propagate max buf len for hns3
- Use configured buf size in __bnxt_alloc_rx_netmem
- Minor stylistic changes
v3: https://lore.kernel.org/all/cover.1755499375.git.asml.silence@gmail.com/
- Rebased, excluded zcrx specific patches
- Set agg_size_fac to 1 on warning
v2: https://lore.kernel.org/all/cover.1754657711.git.asml.silence@gmail.com/
- Add MAX_PAGE_ORDER check on pp init
- Applied comments rewording
- Adjust pp.max_len based on order
- Patch up mlx5 queue callbacks after rebase
- Minor ->queue_mgmt_ops refactoring
- Rebased to account for both fill level and agg_size_fac
- Pass providers buf length in struct pp_memory_provider_params and
apply it in __netdev_queue_confi().
- Use ->supported_ring_params to validate drivers support of set
qcfg parameters.
Jakub Kicinski (2):
net: reduce indent of struct netdev_queue_mgmt_ops members
eth: bnxt: adjust the fill level of agg queues with larger buffers
Pavel Begunkov (7):
net: memzero mp params when closing a queue
net: add bare bone queue configs
net: pass queue rx page size from memory provider
eth: bnxt: store rx buffer size per queue
eth: bnxt: support qcfg provided rx page size
selftests: iou-zcrx: test large chunk sizes
io_uring/zcrx: document area chunking parameter
Documentation/networking/iou-zcrx.rst | 20 +++
drivers/net/ethernet/broadcom/bnxt/bnxt.c | 126 ++++++++++++++----
drivers/net/ethernet/broadcom/bnxt/bnxt.h | 2 +
drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c | 6 +-
drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.h | 2 +-
drivers/net/ethernet/google/gve/gve_main.c | 9 +-
.../net/ethernet/mellanox/mlx5/core/en_main.c | 10 +-
drivers/net/ethernet/meta/fbnic/fbnic_txrx.c | 8 +-
drivers/net/netdevsim/netdev.c | 7 +-
include/net/netdev_queues.h | 47 +++++--
include/net/netdev_rx_queue.h | 2 +
include/net/page_pool/types.h | 1 +
net/core/dev.c | 17 +++
net/core/netdev_rx_queue.c | 31 +++--
.../selftests/drivers/net/hw/iou-zcrx.c | 72 ++++++++--
.../selftests/drivers/net/hw/iou-zcrx.py | 37 +++++
16 files changed, 318 insertions(+), 79 deletions(-)
--
2.52.0
From: Chia-Yu Chang <chia-yu.chang(a)nokia-bell-labs.com>
Hello,
Plesae find the v7 AccECN case handling patch series, which covers
several excpetional case handling of Accurate ECN spec (RFC9768),
adds new identifiers to be used by CC modules, adds ecn_delta into
rate_sample, and keeps the ACE counter for computation, etc.
This patch series is part of the full AccECN patch series, which is available at
https://github.com/L4STeam/linux-net-next/commits/upstream_l4steam/
Best regards,
Chia-Yu
---
v7:
- Update comments in #3 (Paolo Abeni <pabeni(a)redhat.com>)
- Update comments and use synack_type TCP_SYNACK_RETRANS and num_timeout in #9. (Paolo Abeni <pabeni(a)redhat.com>)
v6:
- Update comment in #3 to highlight RX path is only used for virtio-net (Paolo Abeni <pabeni(a)redhat.com>)
- Rename TCP_CONG_WANTS_ECT_1 to TCP_CONG_ECT_1_NEGOTIATION to distiguish from TCP_CONG_ECT_1_ESTABLISH (Paolo Abeni <pabeni(a)redhat.com>)
- Move TCP_CONG_ECT_1_ESTABLISH in #6 to latter patch series (Paolo Abeni <pabeni(a)redhat.com>)
- Add new synack_type instead of moving the increment of num_retran in #9 (Paolo Abeni <pabeni(a)redhat.com>)
- Use new synack_type TCP_SYNACK_RETRANS and num_retrans for SYN/ACK retx fallbackk for AccECN in #10 (Paolo Abeni <pabeni(a)redhat.com>)
- Do not cast const struct into non-const in #11, and set AccECN fail mode after tcp_rtx_synack() (Paolo Abeni <pabeni(a)redhat.com>)
v5:
- Move previous #11 in v4 in latter patch after discussion with RFC author.
- Add #3 to update the comments for SKB_GSO_TCP_ECN and SKB_GSO_TCP_ACCECN. (Parav Pandit <parav(a)nvidia.com>)
- Add gro self-test for TCP CWR flag in #4. (Eric Dumazet <edumazet(a)google.com>)
- Add fixes: tag into #7 (Paolo Abeni <pabeni(a)redhat.com>)
- Update commit message of #8 and if condition check (Paolo Abeni <pabeni(a)redhat.com>)
- Add empty line between variable declarations and code in #13 (Paolo Abeni <pabeni(a)redhat.com>)
v4:
- Add previous #13 in v2 back after dicussion with the RFC author.
- Add TCP_ACCECN_OPTION_PERSIST to tcp_ecn_option sysctl to ignore AccECN fallback policy on sending AccECN option.
v3:
- Add additional min() check if pkts_acked_ewma is not initialized in #1. (Paolo Abeni <pabeni(a)redhat.com>)
- Change TCP_CONG_WANTS_ECT_1 into individual flag add helper function INET_ECN_xmit_wants_ect_1() in #3. (Paolo Abeni <pabeni(a)redhat.com>)
- Add empty line between variable declarations and code in #4. (Paolo Abeni <pabeni(a)redhat.com>)
- Update commit message to fix old AccECN commits in #5. (Paolo Abeni <pabeni(a)redhat.com>)
- Remove unnecessary brackets in #10. (Paolo Abeni <pabeni(a)redhat.com>)
- Move patch #3 in v2 to a later Prague patch serise and remove patch #13 in v2. (Paolo Abeni <pabeni(a)redhat.com>)
---
Chia-Yu Chang (11):
selftests/net: gro: add self-test for TCP CWR flag
tcp: ECT_1_NEGOTIATION and NEEDS_ACCECN identifiers
tcp: disable RFC3168 fallback identifier for CC modules
tcp: accecn: handle unexpected AccECN negotiation feedback
tcp: accecn: retransmit downgraded SYN in AccECN negotiation
tcp: add TCP_SYNACK_RETRANS synack_type
tcp: accecn: retransmit SYN/ACK without AccECN option or non-AccECN
SYN/ACK
tcp: accecn: unset ECT if receive or send ACE=0 in AccECN negotiaion
tcp: accecn: fallback outgoing half link to non-AccECN
tcp: accecn: detect loss ACK w/ AccECN option and add
TCP_ACCECN_OPTION_PERSIST
tcp: accecn: enable AccECN
Ilpo Järvinen (2):
tcp: try to avoid safer when ACKs are thinned
gro: flushing when CWR is set negatively affects AccECN
Documentation/networking/ip-sysctl.rst | 4 +-
.../networking/net_cachelines/tcp_sock.rst | 1 +
include/linux/tcp.h | 4 +-
include/net/inet_ecn.h | 20 +++-
include/net/tcp.h | 32 ++++++-
include/net/tcp_ecn.h | 92 ++++++++++++++-----
net/ipv4/inet_connection_sock.c | 4 +
net/ipv4/sysctl_net_ipv4.c | 4 +-
net/ipv4/tcp.c | 2 +
net/ipv4/tcp_cong.c | 5 +-
net/ipv4/tcp_input.c | 37 +++++++-
net/ipv4/tcp_minisocks.c | 46 +++++++---
net/ipv4/tcp_offload.c | 3 +-
net/ipv4/tcp_output.c | 32 ++++---
net/ipv4/tcp_timer.c | 3 +
tools/testing/selftests/drivers/net/gro.c | 81 +++++++++++-----
16 files changed, 284 insertions(+), 86 deletions(-)
--
2.34.1
This series improves the CPU cost of RX token management by adding an
attribute to NETDEV_CMD_BIND_RX that configures sockets using the
binding to avoid the xarray allocator and instead use a per-binding niov
array and a uref field in niov.
Improvement is ~13% cpu util per RX user thread.
Using kperf, the following results were observed:
Before:
Average RX worker idle %: 13.13, flows 4, test runs 11
After:
Average RX worker idle %: 26.32, flows 4, test runs 11
Two other approaches were tested, but with no improvement. Namely, 1)
using a hashmap for tokens and 2) keeping an xarray of atomic counters
but using RCU so that the hotpath could be mostly lockless. Neither of
these approaches proved better than the simple array in terms of CPU.
The attribute NETDEV_A_DMABUF_AUTORELEASE is added to toggle the
optimization. It is an optional attribute and defaults to 0 (i.e.,
optimization on).
Signed-off-by: Bobby Eshleman <bobbyeshleman(a)meta.com>
Changes in v9:
- fixed build with NET_DEVMEM=n
- fixed bug in rx bindings count logic
- Link to v8: https://lore.kernel.org/r/20260107-scratch-bobbyeshleman-devmem-tcp-token-u…
Changes in v8:
- change static branch logic (only set when enabled, otherwise just
always revert back to disabled)
- fix missing tests
- Link to v7: https://lore.kernel.org/r/20251119-scratch-bobbyeshleman-devmem-tcp-token-u…
Changes in v7:
- use netlink instead of sockopt (Stan)
- restrict system to only one mode, dmabuf bindings can not co-exist
with different modes (Stan)
- use static branching to enforce single system-wide mode (Stan)
- Link to v6: https://lore.kernel.org/r/20251104-scratch-bobbyeshleman-devmem-tcp-token-u…
Changes in v6:
- renamed 'net: devmem: use niov array for token management' to refer to
optionality of new config
- added documentation and tests
- make autorelease flag per-socket sockopt instead of binding
field / sysctl
- many per-patch changes (see Changes sections per-patch)
- Link to v5: https://lore.kernel.org/r/20251023-scratch-bobbyeshleman-devmem-tcp-token-u…
Changes in v5:
- add sysctl to opt-out of performance benefit, back to old token release
- Link to v4: https://lore.kernel.org/all/20250926-scratch-bobbyeshleman-devmem-tcp-token…
Changes in v4:
- rebase to net-next
- Link to v3: https://lore.kernel.org/r/20250926-scratch-bobbyeshleman-devmem-tcp-token-u…
Changes in v3:
- make urefs per-binding instead of per-socket, reducing memory
footprint
- fallback to cleaning up references in dmabuf unbind if socket
leaked tokens
- drop ethtool patch
- Link to v2: https://lore.kernel.org/r/20250911-scratch-bobbyeshleman-devmem-tcp-token-u…
Changes in v2:
- net: ethtool: prevent user from breaking devmem single-binding rule
(Mina)
- pre-assign niovs in binding->vec for RX case (Mina)
- remove WARNs on invalid user input (Mina)
- remove extraneous binding ref get (Mina)
- remove WARN for changed binding (Mina)
- always use GFP_ZERO for binding->vec (Mina)
- fix length of alloc for urefs
- use atomic_set(, 0) to initialize sk_user_frags.urefs
- Link to v1: https://lore.kernel.org/r/20250902-scratch-bobbyeshleman-devmem-tcp-token-u…
---
Bobby Eshleman (5):
net: devmem: rename tx_vec to vec in dmabuf binding
net: devmem: refactor sock_devmem_dontneed for autorelease split
net: devmem: implement autorelease token management
net: devmem: document NETDEV_A_DMABUF_AUTORELEASE netlink attribute
selftests: drv-net: devmem: add autorelease test
Documentation/netlink/specs/netdev.yaml | 12 +++
Documentation/networking/devmem.rst | 70 +++++++++++++
include/net/netmem.h | 1 +
include/net/sock.h | 7 +-
include/uapi/linux/netdev.h | 1 +
net/core/devmem.c | 116 ++++++++++++++++++----
net/core/devmem.h | 29 +++++-
net/core/netdev-genl-gen.c | 5 +-
net/core/netdev-genl.c | 10 +-
net/core/sock.c | 103 ++++++++++++++-----
net/ipv4/tcp.c | 76 +++++++++++---
net/ipv4/tcp_ipv4.c | 11 +-
net/ipv4/tcp_minisocks.c | 3 +-
tools/include/uapi/linux/netdev.h | 1 +
tools/testing/selftests/drivers/net/hw/devmem.py | 21 +++-
tools/testing/selftests/drivers/net/hw/ncdevmem.c | 19 ++--
16 files changed, 407 insertions(+), 78 deletions(-)
---
base-commit: 6ad078fa0ababa8de2a2b39f476d2abd179a3cf6
change-id: 20250829-scratch-bobbyeshleman-devmem-tcp-token-upstream-292be174d503
Best regards,
--
Bobby Eshleman <bobbyeshleman(a)meta.com>
The "struct alg" object contains a union of 3 xfrm structures:
union {
struct xfrm_algo;
struct xfrm_algo_aead;
struct xfrm_algo_auth;
}
All of them end with a flexible array member used to store key material,
but the flexible array appears at *different offsets* in each struct.
bcz of this, union itself is of variable-sized & Placing it above
char buf[...] triggers:
ipsec.c:835:5: warning: field 'u' with variable sized type 'union
(unnamed union at ipsec.c:831:3)' not at the end of a struct or class
is a GNU extension [-Wgnu-variable-sized-type-not-at-end]
835 | } u;
| ^
one fix is to use "TRAILING_OVERLAP()" which works with one flexible
array member only.
But In "struct alg" flexible array member exists in all union members,
but not at the same offset, so TRAILING_OVERLAP cannot be applied.
so the fix is to explicitly overlay the key buffer at the correct offset
for the largest union member (xfrm_algo_auth). This ensures that the
flexible-array region and the fixed buffer line up.
No functional change.
Reviewed-by: Simon Horman <horms(a)kernel.org>
Signed-off-by: Ankit Khushwaha <ankitkhushwaha.linux(a)gmail.com>
---
CCed Gustavo and linux-hardening as suggested by Simon.
Previous patch: https://lore.kernel.org/all/aSiXmp4mh7M3RaRv@horms.kernel.org/t/#u
---
tools/testing/selftests/net/ipsec.c | 11 +++++++++--
1 file changed, 9 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/net/ipsec.c b/tools/testing/selftests/net/ipsec.c
index 0ccf484b1d9d..f4afef51b930 100644
--- a/tools/testing/selftests/net/ipsec.c
+++ b/tools/testing/selftests/net/ipsec.c
@@ -43,6 +43,10 @@
#define BUILD_BUG_ON(condition) ((void)sizeof(char[1 - 2*!!(condition)]))
+#ifndef offsetof
+#define offsetof(TYPE, MEMBER) __builtin_offsetof(TYPE, MEMBER)
+#endif
+
#define IPV4_STR_SZ 16 /* xxx.xxx.xxx.xxx is longest + \0 */
#define MAX_PAYLOAD 2048
#define XFRM_ALGO_KEY_BUF_SIZE 512
@@ -827,13 +831,16 @@ static int xfrm_fill_key(char *name, char *buf,
static int xfrm_state_pack_algo(struct nlmsghdr *nh, size_t req_sz,
struct xfrm_desc *desc)
{
- struct {
+ union {
union {
struct xfrm_algo alg;
struct xfrm_algo_aead aead;
struct xfrm_algo_auth auth;
} u;
- char buf[XFRM_ALGO_KEY_BUF_SIZE];
+ struct {
+ unsigned char __offset_to_FAM[offsetof(struct xfrm_algo_auth, alg_key)];
+ char buf[XFRM_ALGO_KEY_BUF_SIZE];
+ };
} alg = {};
size_t alen, elen, clen, aelen;
unsigned short type;
--
2.52.0
syzkaller reported a bug [1] where a socket using sockmap, after being
unloaded, exposed incorrect copied_seq calculation. The selftest I
provided can be used to reproduce the issue reported by syzkaller.
TCP recvmsg seq # bug 2: copied E92C873, seq E68D125, rcvnxt E7CEB7C, fl 40
WARNING: CPU: 1 PID: 5997 at net/ipv4/tcp.c:2724 tcp_recvmsg_locked+0xb2f/0x2910 net/ipv4/tcp.c:2724
Call Trace:
<TASK>
receive_fallback_to_copy net/ipv4/tcp.c:1968 [inline]
tcp_zerocopy_receive+0x131a/0x2120 net/ipv4/tcp.c:2200
do_tcp_getsockopt+0xe28/0x26c0 net/ipv4/tcp.c:4713
tcp_getsockopt+0xdf/0x100 net/ipv4/tcp.c:4812
do_sock_getsockopt+0x34d/0x440 net/socket.c:2421
__sys_getsockopt+0x12f/0x260 net/socket.c:2450
__do_sys_getsockopt net/socket.c:2457 [inline]
__se_sys_getsockopt net/socket.c:2454 [inline]
__x64_sys_getsockopt+0xbd/0x160 net/socket.c:2454
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0xcd/0xfa0 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
A sockmap socket maintains its own receive queue (ingress_msg) which may
contain data from either its own protocol stack or forwarded from other
sockets.
FD1:read()
-- FD1->copied_seq++
| [read data]
|
[enqueue data] v
[sockmap] -> ingress to self -> ingress_msg queue
FD1 native stack ------> ^
-- FD1->rcv_nxt++ -> redirect to other | [enqueue data]
| |
| ingress to FD1
v ^
... | [sockmap]
FD2 native stack
The issue occurs when reading from ingress_msg: we update tp->copied_seq
by default, but if the data comes from other sockets (not the socket's
own protocol stack), tcp->rcv_nxt remains unchanged. Later, when
converting back to a native socket, reads may fail as copied_seq could
be significantly larger than rcv_nxt.
Additionally, FIONREAD calculation based on copied_seq and rcv_nxt is
insufficient for sockmap sockets, requiring separate field tracking.
[1] https://syzkaller.appspot.com/bug?extid=06dbd397158ec0ea4983
---
v5 -> v7: Some modifications suggested by Jakub Sitnicki, and added Reviewed-by tag.
https://lore.kernel.org/bpf/20260106051458.279151-1-jiayuan.chen@linux.dev/
v1 -> v5: Use skmsg.sk instead of extending BPF_F_XXX macro and fix CI
failure reported by CI
v1: https://lore.kernel.org/bpf/20251117110736.293040-1-jiayuan.chen@linux.dev/
Jiayuan Chen (3):
bpf, sockmap: Fix incorrect copied_seq calculation
bpf, sockmap: Fix FIONREAD for sockmap
bpf, selftest: Add tests for FIONREAD and copied_seq
include/linux/skmsg.h | 70 ++++-
net/core/skmsg.c | 31 +-
net/ipv4/tcp_bpf.c | 37 ++-
net/ipv4/udp_bpf.c | 23 +-
.../selftests/bpf/prog_tests/sockmap_basic.c | 277 +++++++++++++++++-
.../bpf/progs/test_sockmap_pass_prog.c | 14 +
6 files changed, 435 insertions(+), 17 deletions(-)
--
2.43.0
Various improvements/fixes for the mm kselftests:
- Patch 1-3 extend support for more build configurations: out-of-tree
$KDIR, cross-compilation, etc.
- Patch 4-6 fix issues related to faulting in pages, introducing a new
helper for that purpose.
- Patch 7 fixes the value returned by pagemap_ioctl (PASS was always
returned, which explains why the issue fixed in patch 6 went
unnoticed).
- Patch 8 improves the exit code of pfnmap.
Net results:
- 1 test no longer fails (patch 6)
- 3 tests are no longer skipped (patch 4)
- More accurate return values for whole suites (patch 7, 8)
- Extra tests are more likely to be built (patch 1-3)
---
v1..v2:
- New patches: 1, 4, 5, 8
v1: https://lore.kernel.org/all/20251216142633.2401447-1-kevin.brodsky@arm.com/
---
Cc: Andrew Morton <akpm(a)linux-foundation.org>
Cc: David Hildenbrand <david(a)kernel.org>
Cc: Lorenzo Stoakes <lorenzo.stoakes(a)oracle.com>
Cc: Mark Brown <broonie(a)kernel.org>
Cc: Ryan Roberts <ryan.roberts(a)arm.com>
Cc: Shuah Khan <shuah(a)kernel.org>
---
Kevin Brodsky (8):
selftests/mm: default KDIR to build directory
selftests/mm: remove flaky header check
selftests/mm: pass down full CC and CFLAGS to check_config.sh
selftests/mm: fix usage of FORCE_READ() in cow tests
selftests/mm: introduce helper to read every page in range
selftests/mm: fix faulting-in code in pagemap_ioctl test
selftests/mm: fix exit code in pagemap_ioctl
selftests/mm: report SKIP in pfnmap if a check fails
tools/testing/selftests/mm/Makefile | 8 +-
tools/testing/selftests/mm/check_config.sh | 3 +-
tools/testing/selftests/mm/cow.c | 16 ++--
tools/testing/selftests/mm/hugetlb-madvise.c | 9 +-
tools/testing/selftests/mm/page_frag/Makefile | 2 +-
tools/testing/selftests/mm/pagemap_ioctl.c | 10 +-
tools/testing/selftests/mm/pfnmap.c | 95 ++++++++++++-------
.../selftests/mm/split_huge_page_test.c | 6 +-
tools/testing/selftests/mm/vm_util.h | 6 ++
9 files changed, 84 insertions(+), 71 deletions(-)
base-commit: 9ace4753a5202b02191d54e9fdf7f9e3d02b85eb
--
2.51.2
KVM's implementation of nested SVM treats PAT the same way whether or
not nested NPT is enabled: L1 and L2 share a PAT.
This is correct when nested NPT is disabled, but incorrect when nested
NPT is enabled. When nested NPT is enabled, L1 and L2 have independent
PATs.
The architectural specification for this separation is unusual. There
is a "guest PAT register" that is accessed by references to the PAT
MSR in guest mode, but it is different from the (host) PAT MSR. Other
resources that have distinct host and guest values have a shared
storage location, and the values are swapped on VM-entry/VM-exit.
In
https://lore.kernel.org/kvm/20251107201151.3303170-1-jmattson@google.com/,
I proposed an implementation that adhered to the architectural
specification. It had a few warts. The worst was the necessity of
"fixing up" KVM_SET_MSRS when executing KVM_SET_NESTED_STATE if L2 was
active and nested NPT was enabled when a snapshot was taken. Aside
from Yosry's clarification, no one has responded. I will take silence
to imply rejection. That's okay; I wasn't fond of that implementation
myself.
The current series treats PAT just like any other resource with
distinct host and guest values. There is a single shared storage
location (vcpu->arch.pat), and the values are swapped on
VM-entry/VM-exit. Though this implementation doesn't precisely follow
the architectural specification, the guest visible behavior is the
same as architected.
The first three patches ensure that the vmcb01.g_pat value at VMRUN is
preserved through virtual SMM and serialization. When NPT is enabled,
this field holds the host (L1) hPAT value from emulated VMRUN to
emulated #VMEXIT.
The fourth patch restores (L1) hPAT value from vmcb01.g_pat at
emulated #VMEXIT. Note that this is not architected, but it is
required for this implementation, because hPAT and gPAT occupy the
same storage location.
The next three patches handle loading vmcb12.g_pat into the (L2) guest
PAT register at VMRUN. Most of this behavior is architected, but the
architectural specification states that the value is loaded into the
guest PAT register, leaving the hPAT register unchanged.
The eighth patch stores the (L2) guest PAT register into vmcb12_g_pat
on emulated #VMEXIT, as architected.
The ninth patch fixes the emulation of WRMSR(IA32_PAT) when nested NPT
is enabled.
The tenth patch introduces a new KVM selftest to validate virtualized
PAT behavior.
Jim Mattson (10):
KVM: x86: nSVM: Add g_pat to fields copied by svm_copy_vmrun_state()
KVM: x86: nSVM: Add VALID_GPAT flag to kvm_svm_nested_state_hdr
KVM: x86: nSVM: Handle legacy SVM nested state in SET_NESTED_STATE
KVM: x86: nSVM: Restore L1's PAT on emulated #VMEXIT from L2 to L1
KVM: x86: nSVM: Cache g_pat in vmcb_save_area_cached
KVM: x86: nSVM: Add validity check for VMCB12 g_pat
KVM: x86: nSVM: Set vmcb02.g_pat correctly for nested NPT
KVM: x86: nSVM: Save gPAT to vmcb12.g_pat on emulated #VMEXIT from L2
to L1
KVM: x86: nSVM: Fix assignment to IA32_PAT from L2
KVM: selftests: nSVM: Add svm_nested_pat test
arch/x86/include/uapi/asm/kvm.h | 3 +
arch/x86/kvm/svm/nested.c | 74 +++-
arch/x86/kvm/svm/svm.c | 14 +-
arch/x86/kvm/svm/svm.h | 2 +-
tools/testing/selftests/kvm/Makefile.kvm | 1 +
.../selftests/kvm/x86/svm_nested_pat_test.c | 357 ++++++++++++++++++
6 files changed, 432 insertions(+), 19 deletions(-)
create mode 100644 tools/testing/selftests/kvm/x86/svm_nested_pat_test.c
base-commit: f62b64b970570c92fe22503b0cdc65be7ce7fc7c
--
2.52.0.457.g6b5491de43-goog
Add support for running our existing GRO test against HW GRO
and LRO implementation. The first 3 patches are just ksft lib
nice-to-haves, and patch 4 cleans up the existing gro Python.
Patches 5 and 6 are of most practical interest. The support
reconfiguring the NIC to disable SW GRO and enable HW GRO and LRO.
Additionally last patch breaks up the existing GRO cases to
track HW compliance at finer granularity.
v2:
- fix restoring all features
- apply the generic XDP hack selectively (print a msg when it happens)
- a lot of small tweaks and 4 extra patches
v1: https://lore.kernel.org/20251128005242.2604732-1-kuba@kernel.org
Jakub Kicinski (6):
selftests: net: py: teach ksft_pr() multi-line safety
selftests: net: py: teach cmd() how to print itself
selftests: drv-net: gro: use cmd print
selftests: drv-net: gro: improve feature config
selftests: drv-net: gro: run the test against HW GRO and LRO
selftests: drv-net: gro: break out all individual test cases
tools/testing/selftests/drivers/net/gro.c | 399 +++++++++++---------
tools/testing/selftests/drivers/net/gro.py | 158 ++++++--
tools/testing/selftests/net/lib/py/ksft.py | 29 +-
tools/testing/selftests/net/lib/py/utils.py | 23 ++
4 files changed, 406 insertions(+), 203 deletions(-)
--
2.52.0
Add support for running our existing GRO test against HW GRO
and LRO implementation. The first 3 patches are just ksft lib
nice-to-haves, and patch 4 cleans up the existing gro Python.
Patches 5 and 6 are of most practical interest. The support
reconfiguring the NIC to disable SW GRO and enable HW GRO and LRO.
Additionally last patch breaks up the existing GRO cases to
track HW compliance at finer granularity.
v3:
- patch 4 - s/tso/tcp-segmentation-offload/ for ethtool feature names
- patch 5 - explicitly skip LRO on netdevsim, it lies about support
- patch 6 - add enum for the flush_id test configs
v2: https://lore.kernel.org/20260110005121.3561437-1-kuba@kernel.org
- fix restoring all features
- apply the generic XDP hack selectively (print a msg when it happens)
- a lot of small tweaks and 4 extra patches
v1: https://lore.kernel.org/20251128005242.2604732-1-kuba@kernel.org
Jakub Kicinski (6):
selftests: net: py: teach ksft_pr() multi-line safety
selftests: net: py: teach cmd() how to print itself
selftests: drv-net: gro: use cmd print
selftests: drv-net: gro: improve feature config
selftests: drv-net: gro: run the test against HW GRO and LRO
selftests: drv-net: gro: break out all individual test cases
tools/testing/selftests/drivers/net/gro.c | 433 ++++++++++--------
tools/testing/selftests/drivers/net/gro.py | 163 ++++++-
.../selftests/drivers/net/lib/py/env.py | 7 +-
tools/testing/selftests/net/lib/py/ksft.py | 29 +-
tools/testing/selftests/net/lib/py/utils.py | 23 +
5 files changed, 439 insertions(+), 216 deletions(-)
--
2.52.0