Hi all,
This patchset adds a new buddy allocator like (or non-uniform) large folio
split to reduce the total number of after-split folios, the amount of memory
needed for multi-index xarray split, and keep more large folios after a split.
It is on top of mm-everything-2025-02-01-05-58. It is ready to be merged.
Instead of duplicating existing split_huge_page*() code, __folio_split()
is introduced as the shared backend code for both
split_huge_page_to_list_to_order() and folio_split(). __folio_split()
can support both uniform split and buddy allocator like (or non-uniform) split.
All existing split_huge_page*() users can be gradually converted to use
folio_split() if possible. In this patchset, I converted
truncate_inode_partial_folio() to use folio_split().
xfstests quick group passed for both tmpfs and xfs.
Changelog
===
From V5[7]:
1. Split shmem to any lower order patches are in mm tree, so dropped
from this series.
2. Rename split_folio_at() to try_folio_split() to clarify that
non-uniform split will not be used if it is not supported.
From V4[6]:
1. Enabled shmem support in both uniform and buddy allocator like split
and added selftests for it.
2. Added functions to check if uniform split and buddy allocator like
split are supported for the given folio and order.
3. Made truncate fall back to uniform split if buddy allocator split is
not supported (CONFIG_READ_ONLY_THP_FOR_FS and FS without large folio).
4. Added the missing folio_clear_has_hwpoisoned() to
__split_unmapped_folio().
From V3[5]:
1. Used xas_split_alloc(GFP_NOWAIT) instead of xas_nomem(), since extra
operations inside xas_split_alloc() are needed for correctness.
2. Enabled folio_split() for shmem and no issue was found with xfstests
quick test group.
3. Split both ends of a truncate range in truncate_inode_partial_folio()
to avoid wasting memory in shmem truncate (per David Hildenbrand).
4. Removed page_in_folio_offset() since page_folio() does the same
thing.
5. Finished truncate related tests from xfstests quick test group on XFS and
tmpfs without issues.
6. Disabled buddy allocator like split on CONFIG_READ_ONLY_THP_FOR_FS
and FS without large folio. This check was missed in the prior
versions.
From V2[3]:
1. Incorporated all the feedback from Kirill[4].
2. Used GFP_NOWAIT for xas_nomem().
3. Tested the code path when xas_nomem() fails.
4. Added selftests for folio_split().
5. Fixed no THP config build error.
From V1[2]:
1. Split the original patch 1 into multiple ones for easy review (per
Kirill).
2. Added xas_destroy() to avoid memory leak.
3. Fixed nr_dropped not used error (per kernel test robot).
4. Added proper error handling when xas_nomem() fails to allocate memory
for xas_split() during buddy allocator like split.
From RFC[1]:
1. Merged backend code of split_huge_page_to_list_to_order() and
folio_split(). The same code is used for both uniform split and buddy
allocator like split.
2. Use xas_nomem() instead of xas_split_alloc() for folio_split().
3. folio_split() now leaves the first after-split folio unlocked,
instead of the one containing the given page, since
the caller of truncate_inode_partial_folio() locks and unlocks the
first folio.
4. Extended split_huge_page debugfs to use folio_split().
5. Added truncate_inode_partial_folio() as first user of folio_split().
Design
===
folio_split() splits a large folio in the same way as buddy allocator
splits a large free page for allocation. The purpose is to minimize the
number of folios after the split. For example, if user wants to free the
3rd subpage in a order-9 folio, folio_split() will split the order-9 folio
as:
O-0, O-0, O-0, O-0, O-2, O-3, O-4, O-5, O-6, O-7, O-8 if it is anon
O-1, O-0, O-0, O-2, O-3, O-4, O-5, O-6, O-7, O-9 if it is pagecache
Since anon folio does not support order-1 yet.
The split process is similar to existing approach:
1. Unmap all page mappings (split PMD mappings if exist);
2. Split meta data like memcg, page owner, page alloc tag;
3. Copy meta data in struct folio to sub pages, but instead of spliting
the whole folio into multiple smaller ones with the same order in a
shot, this approach splits the folio iteratively. Taking the example
above, this approach first splits the original order-9 into two order-8,
then splits left part of order-8 to two order-7 and so on;
4. Post-process split folios, like write mapping->i_pages for pagecache,
adjust folio refcounts, add split folios to corresponding list;
5. Remap split folios
6. Unlock split folios.
__split_unmapped_folio() and __split_folio_to_order() replace
__split_huge_page() and __split_huge_page_tail() respectively.
__split_unmapped_folio() uses different approaches to perform
uniform split and buddy allocator like split:
1. uniform split: one single call to __split_folio_to_order() is used to
uniformly split the given folio. All resulting folios are put back to
the list after split. The folio containing the given page is left to
caller to unlock and others are unlocked.
2. buddy allocator like (or non-uniform) split: (old_order - new_order) calls
to __split_folio_to_order() are used to split the given folio at order N to
order N-1. After each call, the target folio is changed to the one
containing the page, which is given as a folio_split() parameter.
After each call, folios not containing the page are put back to the list.
The folio containing the page is put back to the list when its order
is new_order. All folios are unlocked except the first folio, which
is left to caller to unlock.
Patch Overview
===
1. Patch 1 added __split_unmapped_folio() and __split_folio_to_order() to
prepare for moving to new backend split code.
2. Patch 2 moved common code in split_huge_page_to_list_to_order() to
__folio_split().
3. Patch 3 added new folio_split() and made
split_huge_page_to_list_to_order() share the new
__split_unmapped_folio() with folio_split().
4. Patch 4 removed no longer used __split_huge_page() and
__split_huge_page_tail().
5. Patch 5 added a new in_folio_offset to split_huge_page debugfs for
folio_split() test.
6. Patch 6 used try_folio_split() for truncate operation.
7. Patch 7 added folio_split() tests.
Any comments and/or suggestions are welcome. Thanks.
[1] https://lore.kernel.org/linux-mm/20241008223748.555845-1-ziy@nvidia.com/
[2] https://lore.kernel.org/linux-mm/20241028180932.1319265-1-ziy@nvidia.com/
[3] https://lore.kernel.org/linux-mm/20241101150357.1752726-1-ziy@nvidia.com/
[4] https://lore.kernel.org/linux-mm/e6ppwz5t4p4kvir6eqzoto4y5fmdjdxdyvxvtw43nc…
[5] https://lore.kernel.org/linux-mm/20241205001839.2582020-1-ziy@nvidia.com/
[6] https://lore.kernel.org/linux-mm/20250106165513.104899-1-ziy@nvidia.com/
[7] https://lore.kernel.org/linux-mm/20250116211042.741543-1-ziy@nvidia.com/
Zi Yan (7):
mm/huge_memory: add two new (not yet used) functions for folio_split()
mm/huge_memory: move folio split common code to __folio_split()
mm/huge_memory: add buddy allocator like folio_split()
mm/huge_memory: remove the old, unused __split_huge_page()
mm/huge_memory: add folio_split() to debugfs testing interface.
mm/truncate: use buddy allocator like folio split for truncate
operation.
selftests/mm: add tests for folio_split(), buddy allocator like split.
include/linux/huge_mm.h | 36 +
mm/huge_memory.c | 749 ++++++++++++------
mm/truncate.c | 31 +-
.../selftests/mm/split_huge_page_test.c | 34 +-
4 files changed, 582 insertions(+), 268 deletions(-)
--
2.47.2
Notable changes since v17:
* fixed netdevice_tracker pointer assignment in netlink post_doit
(triggered by kernel test robot on m86k)
* renamed nla_get_uint() to ovpn_nla_get_uint() in ovpn-cli.c to avoid
clashes with libnl-3.11.0
FTR, here are the notable changes since v16:
* fixed usage of netdev tracker by removing dev_tracker member from
ovpn_priv and adding it to ovpn_peer and ovpn_socket as those are the
objects really holding a ref to the netdev
* switched ovpn_get_dev_from_attrs() to GFP_ATOMIC to prevent sleep under
rcu_read_lock
* allocated netdevice_tracker in ovpn_nl_pre_doit() [stored in
user_ptr[1]] to keep track of the netdev reference held during netlink
handler calls
* moved whole socket detaching routine to worker. This way the code is
allowed to sleep and in turn it can be executed under lock_sock. This
lock allows us to happily coordinate concurrent attach/detach calls.
(note: lock is acquired everytime the refcnt for the socket is
decremented, because this guarantees us that setting the refcnt to 0
and detaching the socket will happen atomically)
* dropped kref_put_sock()/refcount handler as it's not required anymore,
thanks to the point above
* re-arranged ovpn_socket_new() in order to simplify error path by first
allocating the new ovpn_sock and then attaching
Please note that some patches were already reviewed/tested by a few
people. iThese patches have retained the tags as they have hardly been
touched.
The latest code can also be found at:
https://github.com/OpenVPN/linux-kernel-ovpn
Thanks a lot!
Best Regards,
Antonio Quartulli
OpenVPN Inc.
---
Antonio Quartulli (25):
net: introduce OpenVPN Data Channel Offload (ovpn)
ovpn: add basic netlink support
ovpn: add basic interface creation/destruction/management routines
ovpn: keep carrier always on for MP interfaces
ovpn: introduce the ovpn_peer object
ovpn: introduce the ovpn_socket object
ovpn: implement basic TX path (UDP)
ovpn: implement basic RX path (UDP)
ovpn: implement packet processing
ovpn: store tunnel and transport statistics
ipv6: export inet6_stream_ops via EXPORT_SYMBOL_GPL
ovpn: implement TCP transport
skb: implement skb_send_sock_locked_with_flags()
ovpn: add support for MSG_NOSIGNAL in tcp_sendmsg
ovpn: implement multi-peer support
ovpn: implement peer lookup logic
ovpn: implement keepalive mechanism
ovpn: add support for updating local UDP endpoint
ovpn: add support for peer floating
ovpn: implement peer add/get/dump/delete via netlink
ovpn: implement key add/get/del/swap via netlink
ovpn: kill key and notify userspace in case of IV exhaustion
ovpn: notify userspace when a peer is deleted
ovpn: add basic ethtool support
testing/selftests: add test tool and scripts for ovpn module
Documentation/netlink/specs/ovpn.yaml | 372 +++
Documentation/netlink/specs/rt_link.yaml | 16 +
MAINTAINERS | 11 +
drivers/net/Kconfig | 15 +
drivers/net/Makefile | 1 +
drivers/net/ovpn/Makefile | 22 +
drivers/net/ovpn/bind.c | 55 +
drivers/net/ovpn/bind.h | 101 +
drivers/net/ovpn/crypto.c | 211 ++
drivers/net/ovpn/crypto.h | 145 ++
drivers/net/ovpn/crypto_aead.c | 382 ++++
drivers/net/ovpn/crypto_aead.h | 33 +
drivers/net/ovpn/io.c | 446 ++++
drivers/net/ovpn/io.h | 34 +
drivers/net/ovpn/main.c | 350 +++
drivers/net/ovpn/main.h | 14 +
drivers/net/ovpn/netlink-gen.c | 213 ++
drivers/net/ovpn/netlink-gen.h | 41 +
drivers/net/ovpn/netlink.c | 1183 ++++++++++
drivers/net/ovpn/netlink.h | 18 +
drivers/net/ovpn/ovpnstruct.h | 54 +
drivers/net/ovpn/peer.c | 1269 +++++++++++
drivers/net/ovpn/peer.h | 164 ++
drivers/net/ovpn/pktid.c | 129 ++
drivers/net/ovpn/pktid.h | 87 +
drivers/net/ovpn/proto.h | 118 +
drivers/net/ovpn/skb.h | 60 +
drivers/net/ovpn/socket.c | 204 ++
drivers/net/ovpn/socket.h | 49 +
drivers/net/ovpn/stats.c | 21 +
drivers/net/ovpn/stats.h | 47 +
drivers/net/ovpn/tcp.c | 565 +++++
drivers/net/ovpn/tcp.h | 33 +
drivers/net/ovpn/udp.c | 421 ++++
drivers/net/ovpn/udp.h | 22 +
include/linux/skbuff.h | 2 +
include/uapi/linux/if_link.h | 15 +
include/uapi/linux/ovpn.h | 111 +
include/uapi/linux/udp.h | 1 +
net/core/skbuff.c | 18 +-
net/ipv6/af_inet6.c | 1 +
tools/testing/selftests/Makefile | 1 +
tools/testing/selftests/net/ovpn/.gitignore | 2 +
tools/testing/selftests/net/ovpn/Makefile | 17 +
tools/testing/selftests/net/ovpn/config | 10 +
tools/testing/selftests/net/ovpn/data64.key | 5 +
tools/testing/selftests/net/ovpn/ovpn-cli.c | 2367 ++++++++++++++++++++
tools/testing/selftests/net/ovpn/tcp_peers.txt | 5 +
.../testing/selftests/net/ovpn/test-chachapoly.sh | 9 +
tools/testing/selftests/net/ovpn/test-float.sh | 9 +
tools/testing/selftests/net/ovpn/test-tcp.sh | 9 +
tools/testing/selftests/net/ovpn/test.sh | 185 ++
tools/testing/selftests/net/ovpn/udp_peers.txt | 5 +
53 files changed, 9673 insertions(+), 5 deletions(-)
---
base-commit: 7d0da8f862340c5f42f0062b8560b8d0971a6ac4
change-id: 20241002-b4-ovpn-eeee35c694a2
Best regards,
--
Antonio Quartulli <antonio(a)openvpn.net>
The mac address on backup slave should be convert from Solicited-Node
Multicast address, not from bonding unicast target address.
v2: fix patch 01's subject
Hangbin Liu (2):
bonding: fix incorrect MAC address setting to receive NS messages
selftests: bonding: fix incorrect mac address
drivers/net/bonding/bond_options.c | 4 +++-
tools/testing/selftests/drivers/net/bonding/bond_options.sh | 4 ++--
2 files changed, 5 insertions(+), 3 deletions(-)
--
2.46.0
Use ASSERT_OK_FD to check the return value of join cgroup,
or else this test will pass even if the fd < 0. ASSERT_OK_FD
can print the error message to the console.
Link: https://lore.kernel.org/all/6d62bd77-6733-40c7-b240-a1aeff55566c@linux.dev/
Suggested-by: Martin KaFai Lau <martin.lau(a)kernel.org>
Signed-off-by: Jason Xing <kerneljasonxing(a)gmail.com>
---
tools/testing/selftests/bpf/prog_tests/setget_sockopt.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/bpf/prog_tests/setget_sockopt.c b/tools/testing/selftests/bpf/prog_tests/setget_sockopt.c
index e12255121c15..e4dac529d424 100644
--- a/tools/testing/selftests/bpf/prog_tests/setget_sockopt.c
+++ b/tools/testing/selftests/bpf/prog_tests/setget_sockopt.c
@@ -202,7 +202,7 @@ static void test_nonstandard_opt(int family)
void test_setget_sockopt(void)
{
cg_fd = test__join_cgroup(CG_NAME);
- if (cg_fd < 0)
+ if (!ASSERT_OK_FD(cg_fd, "join cgroup"))
return;
if (create_netns())
--
2.43.5
Greetings:
Welcome to v4. Small functional change, which makes the code cleaner
(see changelog) and tests pass on my machine with mlx5 and netdevsim.
This is an attempt to followup on something Jakub asked me about [1],
adding an xsk attribute to queues and more clearly documenting which
queues are linked to NAPIs...
After the RFC [2], Jakub suggested creating an empty nest for queues
which have a pool, so I've adjusted this version to work that way.
The nest can be extended in the future to express attributes about XSK
as needed. Queues which are not used for AF_XDP do not have the xsk
attribute present.
I've run the included test on:
- my mlx5 machine (via NETIF=)
- without setting NETIF
And the test seems to pass in both cases.
Thanks,
Joe
[1]: https://lore.kernel.org/netdev/20250113143109.60afa59a@kernel.org/
[2]: https://lore.kernel.org/netdev/20250129172431.65773-1-jdamato@fastly.com/
v4:
- Add patch 1, as suggested by Jakub, which adds an empty nest helper.
- Use the helper in patch 2, which makes the code cleaner and prevents
a possible bug.
v3: https://lore.kernel.org/netdev/20250204191108.161046-1-jdamato@fastly.com/
- Change comment format in patch 2 to avoid kdoc warnings. No other
changes.
v2: https://lore.kernel.org/all/20250203185828.19334-1-jdamato@fastly.com/
- Switched from RFC to actual submission now that net-next is open
- Adjusted patch 1 to include an empty nest as suggested by Jakub
- Adjusted patch 2 to update the test based on changes to patch 1, and
to incorporate some Python feedback from Jakub :)
rfc: https://lore.kernel.org/netdev/20250129172431.65773-1-jdamato@fastly.com/
Joe Damato (3):
netlink: Add nla_put_empty_nest helper
netdev-genl: Add an XSK attribute to queues
selftests: drv-net: Test queue xsk attribute
Documentation/netlink/specs/netdev.yaml | 13 ++-
include/net/netlink.h | 15 ++++
include/uapi/linux/netdev.h | 6 ++
net/core/netdev-genl.c | 12 +++
tools/include/uapi/linux/netdev.h | 6 ++
.../testing/selftests/drivers/net/.gitignore | 2 +
tools/testing/selftests/drivers/net/Makefile | 3 +
tools/testing/selftests/drivers/net/queues.py | 35 +++++++-
.../selftests/drivers/net/xdp_helper.c | 89 +++++++++++++++++++
9 files changed, 178 insertions(+), 3 deletions(-)
create mode 100644 tools/testing/selftests/drivers/net/.gitignore
create mode 100644 tools/testing/selftests/drivers/net/xdp_helper.c
base-commit: f3eba8edd885db439f4bfaa2cf9d766bad1ae6c5
--
2.43.0
The kernel has recently added support for shadow stacks, currently
x86 only using their CET feature but both arm64 and RISC-V have
equivalent features (GCS and Zicfiss respectively), I am actively
working on GCS[1]. With shadow stacks the hardware maintains an
additional stack containing only the return addresses for branch
instructions which is not generally writeable by userspace and ensures
that any returns are to the recorded addresses. This provides some
protection against ROP attacks and making it easier to collect call
stacks. These shadow stacks are allocated in the address space of the
userspace process.
Our API for shadow stacks does not currently offer userspace any
flexiblity for managing the allocation of shadow stacks for newly
created threads, instead the kernel allocates a new shadow stack with
the same size as the normal stack whenever a thread is created with the
feature enabled. The stacks allocated in this way are freed by the
kernel when the thread exits or shadow stacks are disabled for the
thread. This lack of flexibility and control isn't ideal, in the vast
majority of cases the shadow stack will be over allocated and the
implicit allocation and deallocation is not consistent with other
interfaces. As far as I can tell the interface is done in this manner
mainly because the shadow stack patches were in development since before
clone3() was implemented.
Since clone3() is readily extensible let's add support for specifying a
shadow stack when creating a new thread or process, keeping the current
implicit allocation behaviour if one is not specified either with
clone3() or through the use of clone(). The user must provide a shadow
stack pointer, this must point to memory mapped for use as a shadow
stackby map_shadow_stack() with an architecture specified shadow stack
token at the top of the stack.
Please note that the x86 portions of this code are build tested only, I
don't appear to have a system that can run CET available to me.
[1] https://lore.kernel.org/linux-arm-kernel/20241001-arm64-gcs-v13-0-222b78d87…
Signed-off-by: Mark Brown <broonie(a)kernel.org>
---
Changes in v14:
- Rebase onto v6.14-rc1.
- Link to v13: https://lore.kernel.org/r/20241203-clone3-shadow-stack-v13-0-93b89a81a5ed@k…
Changes in v13:
- Rebase onto v6.13-rc1.
- Link to v12: https://lore.kernel.org/r/20241031-clone3-shadow-stack-v12-0-7183eb8bee17@k…
Changes in v12:
- Add the regular prctl() to the userspace API document since arm64
support is queued in -next.
- Link to v11: https://lore.kernel.org/r/20241005-clone3-shadow-stack-v11-0-2a6a2bd6d651@k…
Changes in v11:
- Rebase onto arm64 for-next/gcs, which is based on v6.12-rc1, and
integrate arm64 support.
- Rework the interface to specify a shadow stack pointer rather than a
base and size like we do for the regular stack.
- Link to v10: https://lore.kernel.org/r/20240821-clone3-shadow-stack-v10-0-06e8797b9445@k…
Changes in v10:
- Integrate fixes & improvements for the x86 implementation from Rick
Edgecombe.
- Require that the shadow stack be VM_WRITE.
- Require that the shadow stack base and size be sizeof(void *) aligned.
- Clean up trailing newline.
- Link to v9: https://lore.kernel.org/r/20240819-clone3-shadow-stack-v9-0-962d74f99464@ke…
Changes in v9:
- Pull token validation earlier and report problems with an error return
to parent rather than signal delivery to the child.
- Verify that the top of the supplied shadow stack is VM_SHADOW_STACK.
- Rework token validation to only do the page mapping once.
- Drop no longer needed support for testing for signals in selftest.
- Fix typo in comments.
- Link to v8: https://lore.kernel.org/r/20240808-clone3-shadow-stack-v8-0-0acf37caf14c@ke…
Changes in v8:
- Fix token verification with user specified shadow stack.
- Don't track user managed shadow stacks for child processes.
- Link to v7: https://lore.kernel.org/r/20240731-clone3-shadow-stack-v7-0-a9532eebfb1d@ke…
Changes in v7:
- Rebase onto v6.11-rc1.
- Typo fixes.
- Link to v6: https://lore.kernel.org/r/20240623-clone3-shadow-stack-v6-0-9ee7783b1fb9@ke…
Changes in v6:
- Rebase onto v6.10-rc3.
- Ensure we don't try to free the parent shadow stack in error paths of
x86 arch code.
- Spelling fixes in userspace API document.
- Additional cleanups and improvements to the clone3() tests to support
the shadow stack tests.
- Link to v5: https://lore.kernel.org/r/20240203-clone3-shadow-stack-v5-0-322c69598e4b@ke…
Changes in v5:
- Rebase onto v6.8-rc2.
- Rework ABI to have the user allocate the shadow stack memory with
map_shadow_stack() and a token.
- Force inlining of the x86 shadow stack enablement.
- Move shadow stack enablement out into a shared header for reuse by
other tests.
- Link to v4: https://lore.kernel.org/r/20231128-clone3-shadow-stack-v4-0-8b28ffe4f676@ke…
Changes in v4:
- Formatting changes.
- Use a define for minimum shadow stack size and move some basic
validation to fork.c.
- Link to v3: https://lore.kernel.org/r/20231120-clone3-shadow-stack-v3-0-a7b8ed3e2acc@ke…
Changes in v3:
- Rebase onto v6.7-rc2.
- Remove stale shadow_stack in internal kargs.
- If a shadow stack is specified unconditionally use it regardless of
CLONE_ parameters.
- Force enable shadow stacks in the selftest.
- Update changelogs for RISC-V feature rename.
- Link to v2: https://lore.kernel.org/r/20231114-clone3-shadow-stack-v2-0-b613f8681155@ke…
Changes in v2:
- Rebase onto v6.7-rc1.
- Remove ability to provide preallocated shadow stack, just specify the
desired size.
- Link to v1: https://lore.kernel.org/r/20231023-clone3-shadow-stack-v1-0-d867d0b5d4d0@ke…
---
Mark Brown (8):
arm64/gcs: Return a success value from gcs_alloc_thread_stack()
Documentation: userspace-api: Add shadow stack API documentation
selftests: Provide helper header for shadow stack testing
fork: Add shadow stack support to clone3()
selftests/clone3: Remove redundant flushes of output streams
selftests/clone3: Factor more of main loop into test_clone3()
selftests/clone3: Allow tests to flag if -E2BIG is a valid error code
selftests/clone3: Test shadow stack support
Documentation/userspace-api/index.rst | 1 +
Documentation/userspace-api/shadow_stack.rst | 44 +++++
arch/arm64/include/asm/gcs.h | 8 +-
arch/arm64/kernel/process.c | 8 +-
arch/arm64/mm/gcs.c | 62 +++++-
arch/x86/include/asm/shstk.h | 11 +-
arch/x86/kernel/process.c | 2 +-
arch/x86/kernel/shstk.c | 57 +++++-
include/asm-generic/cacheflush.h | 11 ++
include/linux/sched/task.h | 17 ++
include/uapi/linux/sched.h | 10 +-
kernel/fork.c | 96 +++++++--
tools/testing/selftests/clone3/clone3.c | 226 ++++++++++++++++++----
tools/testing/selftests/clone3/clone3_selftests.h | 65 ++++++-
tools/testing/selftests/ksft_shstk.h | 98 ++++++++++
15 files changed, 635 insertions(+), 81 deletions(-)
---
base-commit: 2014c95afecee3e76ca4a56956a936e23283f05b
change-id: 20231019-clone3-shadow-stack-15d40d2bf536
Best regards,
--
Mark Brown <broonie(a)kernel.org>
This series expands the XDP TX metadata framework to allow user
applications to pass per packet 64-bit launch time directly to the kernel
driver, requesting launch time hardware offload support. The XDP TX
metadata framework will not perform any clock conversion or packet
reordering.
Please note that the role of Tx metadata is just to pass the launch time,
not to enable the offload feature. Users will need to enable the launch
time hardware offload feature of the device by using the respective
command, such as the tc-etf command.
Although some devices use the tc-etf command to enable their launch time
hardware offload feature, xsk packets will not go through the etf qdisc.
Therefore, in my opinion, the launch time should always be based on the PTP
Hardware Clock (PHC). Thus, i did not include a clock ID to indicate the
clock source.
To simplify the test steps, I modified the xdp_hw_metadata bpf self-test
tool in such a way that it will set the launch time based on the offset
provided by the user and the value of the Receive Hardware Timestamp, which
is against the PHC. This will eliminate the need to discipline System Clock
with the PHC and then use clock_gettime() to get the time.
Please note that AF_XDP lacks a feedback mechanism to inform the
application if the requested launch time is invalid. So, users are expected
to familiar with the horizon of the launch time of the device they use and
not request a launch time that is beyond the horizon. Otherwise, the driver
might interpret the launch time incorrectly and react wrongly. For stmmac
and igc, where modulo computation is used, a launch time larger than the
horizon will cause the device to transmit the packet earlier that the
requested launch time.
Although there is no feedback mechanism for the launch time request
for now, user still can check whether the requested launch time is
working or not, by requesting the Transmit Completion Hardware Timestamp.
V9:
- Remove the igc_desc_unused() checking (Maciej)
- Ensure that skb allocation and DMA mapping work before proceeding to
fill in igc_tx_buffer info, context desc, and data desc (Maciej)
- Rate limit the error messages (Maciej)
- Update the comment to indicate that the 2 descriptors needed by the
empty frame are already taken into consideration (Maciej)
- Handle the case where the insertion of an empty frame fails and
explain the reason behind (Maciej)
- put self SOB tag as last tag (Maciej)
V8: https://lore.kernel.org/netdev/20250205024116.798862-1-yoong.siang.song@int…
- check the number of used descriptor in xsk_tx_metadata_request()
by using used_desc of struct igc_metadata_request, and then decreases
the budget with it (Maciej)
- submit another bug fix patch to set the buffer type for empty frame (Maciej):
https://lore.kernel.org/netdev/20250205023603.798819-1-yoong.siang.song@int…
V7: https://lore.kernel.org/netdev/20250204004907.789330-1-yoong.siang.song@int…
- split the refactoring code of igc empty packet insertion into a separate
commit (Faizal)
- add explanation on why the value "4" is used as igc transmit budget
(Faizal)
- perform a stress test by sending 1000 packets with 10ms interval and
launch time set to 500us in the future (Faizal & Yong Liang)
V6: https://lore.kernel.org/netdev/20250116155350.555374-1-yoong.siang.song@int…
- fix selftest build errors by using asprintf() and realloc(), instead of
managing the buffer sizes manually (Daniel, Stanislav)
V5: https://lore.kernel.org/netdev/20250114152718.120588-1-yoong.siang.song@int…
- change netdev feature name from tx-launch-time to tx-launch-time-fifo
to explicitly state the FIFO behaviour (Stanislav)
- improve the looping of xdp_hw_metadata app to wait for packet tx
completion to be more readable by using clock_gettime() (Stanislav)
- add launch time setup steps into xdp_hw_metadata app (Stanislav)
V4: https://lore.kernel.org/netdev/20250106135506.9687-1-yoong.siang.song@intel…
- added XDP launch time support to the igc driver (Jesper & Florian)
- added per-driver launch time limitation on xsk-tx-metadata.rst (Jesper)
- added explanation on FIFO behavior on xsk-tx-metadata.rst (Jakub)
- added step to enable launch time in the commit message (Jesper & Willem)
- explicitly documented the type of launch_time and which clock source
it is against (Willem)
V3: https://lore.kernel.org/netdev/20231203165129.1740512-1-yoong.siang.song@in…
- renamed to use launch time (Jesper & Willem)
- changed the default launch time in xdp_hw_metadata apps from 1s to 0.1s
because some NICs do not support such a large future time.
V2: https://lore.kernel.org/netdev/20231201062421.1074768-1-yoong.siang.song@in…
- renamed to use Earliest TxTime First (Willem)
- renamed to use txtime (Willem)
V1: https://lore.kernel.org/netdev/20231130162028.852006-1-yoong.siang.song@int…
Song Yoong Siang (5):
xsk: Add launch time hardware offload support to XDP Tx metadata
selftests/bpf: Add launch time request to xdp_hw_metadata
net: stmmac: Add launch time support to XDP ZC
igc: Refactor empty frame insertion for launch time support
igc: Add launch time support to XDP ZC
Documentation/netlink/specs/netdev.yaml | 4 +
Documentation/networking/xsk-tx-metadata.rst | 62 +++++++
drivers/net/ethernet/intel/igc/igc.h | 1 +
drivers/net/ethernet/intel/igc/igc_main.c | 141 +++++++++++----
drivers/net/ethernet/stmicro/stmmac/stmmac.h | 2 +
.../net/ethernet/stmicro/stmmac/stmmac_main.c | 13 ++
include/net/xdp_sock.h | 10 ++
include/net/xdp_sock_drv.h | 1 +
include/uapi/linux/if_xdp.h | 10 ++
include/uapi/linux/netdev.h | 3 +
net/core/netdev-genl.c | 2 +
net/xdp/xsk.c | 3 +
tools/include/uapi/linux/if_xdp.h | 10 ++
tools/include/uapi/linux/netdev.h | 3 +
tools/testing/selftests/bpf/xdp_hw_metadata.c | 168 +++++++++++++++++-
15 files changed, 394 insertions(+), 39 deletions(-)
--
2.34.1