RFC v6: =======
Major Changes: --------------
This revision largely rebases on top of net-next and addresses the little feedback RFCv5 received.
The series remains in RFC because the queue-API ndos defined in this series are not yet implemented. I have a GVE implementation I carry out of tree for my testing. A upstreamable GVE implementation is in the works. Aside from that, in my estimation all the patches are ready for review/merge. Please do take a look.
As usual the full devmem TCP changes including the full GVE driver implementation is here:
https://github.com/mina/linux/commits/tcpdevmem-v6/
This version also comes with some performance data recorded in the cover letter (see below changelog).
Detailed changelog:
- Rebased on top of the merged netmem_ref changes.
- Converted skb->dmabuf to skb->readable (Pavel). Pavel's original suggestion was to remove the skb->dmabuf flag entirely, but when I looked into it closely, I found the issue that if we remove the flag we have to dereference the shinfo(skb) pointer to obtain the first frag to tell whether an skb is readable or not. This can cause a performance regression if it dirties the cache line when the shinfo(skb) was not really needed. Instead, I converted the skb->dmabuf flag into a generic skb->readable flag which can be re-used by io_uring 0-copy RX.
- Squashed a few locking optimizations from Eric Dumazet in the RX path and the DEVMEM_DONTNEED setsockopt.
- Expanded the tests a bit. Added validation for invalid scenarios and added some more coverage.
Perf - page-pool benchmark: ---------------------------
bench_page_pool_simple.ko tests with and without these changes: https://pastebin.com/raw/ncHDwAbn
AFAIK the number that really matters in the perf tests is the 'tasklet_page_pool01_fast_path Per elem'. This one measures at about 8 cycles without the changes but there is some 1 cycle noise in some results.
With the patches this regresses to 9 cycles with the changes but there is 1 cycle noise occasionally running this test repeatedly.
Lastly I tried disable the static_branch_unlikely() in netmem_is_net_iov() check. To my surprise disabling the static_branch_unlikely() check reduces the fast path back to 8 cycles, but the 1 cycle noise remains.
Perf - Devmem TCP benchmark: ---------------------
189/200gbps bi-directional throughput with RX devmem TCP and regular TCP TX i.e. ~95% line rate.
Major changes in RFC v5: ========================
1. Rebased on top of 'Abstract page from net stack' series and used the new netmem type to refer to LSB set pointers instead of re-using struct page.
2. Downgraded this series back to RFC and called it RFC v5. This is because this series is now dependent on 'Abstract page from net stack'[1] and the queue API. Both are removed from the series to reduce the patch # and those bits are fairly independent or pre-requisite work.
3. Reworked the page_pool devmem support to use netmem and for some more unified handling.
4. Reworked the reference counting of net_iov (renamed from page_pool_iov) to use pp_ref_count for refcounting.
The full changes including the dependent series and GVE page pool support is here:
https://github.com/mina/linux/commits/tcpdevmem-rfcv5/
[1] https://patchwork.kernel.org/project/netdevbpf/list/?series=810774
Major changes in v1: ====================
1. Implemented MVP queue API ndos to remove the userspace-visible driver reset.
2. Fixed issues in the napi_pp_put_page() devmem frag unref path.
3. Removed RFC tag.
Many smaller addressed comments across all the patches (patches have individual change log).
Full tree including the rest of the GVE driver changes: https://github.com/mina/linux/commits/tcpdevmem-v1
Changes in RFC v3: ==================
1. Pulled in the memory-provider dependency from Jakub's RFC[1] to make the series reviewable and mergeable.
2. Implemented multi-rx-queue binding which was a todo in v2.
3. Fix to cmsg handling.
The sticking point in RFC v2[2] was the device reset required to refill the device rx-queues after the dmabuf bind/unbind. The solution suggested as I understand is a subset of the per-queue management ops Jakub suggested or similar:
https://lore.kernel.org/netdev/20230815171638.4c057dcd@kernel.org/
This is not addressed in this revision, because:
1. This point was discussed at netconf & netdev and there is openness to using the current approach of requiring a device reset.
2. Implementing individual queue resetting seems to be difficult for my test bed with GVE. My prototype to test this ran into issues with the rx-queues not coming back up properly if reset individually. At the moment I'm unsure if it's a mistake in the POC or a genuine issue in the virtualization stack behind GVE, which currently doesn't test individual rx-queue restart.
3. Our usecases are not bothered by requiring a device reset to refill the buffer queues, and we'd like to support NICs that run into this limitation with resetting individual queues.
My thought is that drivers that have trouble with per-queue configs can use the support in this series, while drivers that support new netdev ops to reset individual queues can automatically reset the queue as part of the dma-buf bind/unbind.
The same approach with device resets is presented again for consideration with other sticking points addressed.
This proposal includes the rx devmem path only proposed for merge. For a snapshot of my entire tree which includes the GVE POC page pool support & device memory support:
https://github.com/torvalds/linux/compare/master...mina:linux:tcpdevmem-v3
[1] https://lore.kernel.org/netdev/f8270765-a27b-6ccf-33ea-cda097168d79@redhat.c... [2] https://lore.kernel.org/netdev/CAHS8izOVJGJH5WF68OsRWFKJid1_huzzUK+hpKbLcL4p...
Changes in RFC v2: ==================
The sticking point in RFC v1[1] was the dma-buf pages approach we used to deliver the device memory to the TCP stack. RFC v2 is a proof-of-concept that attempts to resolve this by implementing scatterlist support in the networking stack, such that we can import the dma-buf scatterlist directly. This is the approach proposed at a high level here[2].
Detailed changes: 1. Replaced dma-buf pages approach with importing scatterlist into the page pool. 2. Replace the dma-buf pages centric API with a netlink API. 3. Removed the TX path implementation - there is no issue with implementing the TX path with scatterlist approach, but leaving out the TX path makes it easier to review. 4. Functionality is tested with this proposal, but I have not conducted perf testing yet. I'm not sure there are regressions, but I removed perf claims from the cover letter until they can be re-confirmed. 5. Added Signed-off-by: contributors to the implementation. 6. Fixed some bugs with the RX path since RFC v1.
Any feedback welcome, but specifically the biggest pending questions needing feedback IMO are:
1. Feedback on the scatterlist-based approach in general. 2. Netlink API (Patch 1 & 2). 3. Approach to handle all the drivers that expect to receive pages from the page pool (Patch 6).
[1] https://lore.kernel.org/netdev/dfe4bae7-13a0-3c5d-d671-f61b375cb0b4@gmail.co... [2] https://lore.kernel.org/netdev/CAHS8izPm6XRS54LdCDZVd0C75tA1zHSu6jLVO8nzTLXC...
==================
* TL;DR:
Device memory TCP (devmem TCP) is a proposal for transferring data to and/or from device memory efficiently, without bouncing the data to a host memory buffer.
* Problem:
A large amount of data transfers have device memory as the source and/or destination. Accelerators drastically increased the volume of such transfers. Some examples include: - ML accelerators transferring large amounts of training data from storage into GPU/TPU memory. In some cases ML training setup time can be as long as 50% of TPU compute time, improving data transfer throughput & efficiency can help improving GPU/TPU utilization.
- Distributed training, where ML accelerators, such as GPUs on different hosts, exchange data among them.
- Distributed raw block storage applications transfer large amounts of data with remote SSDs, much of this data does not require host processing.
Today, the majority of the Device-to-Device data transfers the network are implemented as the following low level operations: Device-to-Host copy, Host-to-Host network transfer, and Host-to-Device copy.
The implementation is suboptimal, especially for bulk data transfers, and can put significant strains on system resources, such as host memory bandwidth, PCIe bandwidth, etc. One important reason behind the current state is the kernel’s lack of semantics to express device to network transfers.
* Proposal:
In this patch series we attempt to optimize this use case by implementing socket APIs that enable the user to:
1. send device memory across the network directly, and 2. receive incoming network packets directly into device memory.
Packet _payloads_ go directly from the NIC to device memory for receive and from device memory to NIC for transmit. Packet _headers_ go to/from host memory and are processed by the TCP/IP stack normally. The NIC _must_ support header split to achieve this.
Advantages:
- Alleviate host memory bandwidth pressure, compared to existing network-transfer + device-copy semantics.
- Alleviate PCIe BW pressure, by limiting data transfer to the lowest level of the PCIe tree, compared to traditional path which sends data through the root complex.
* Patch overview:
** Part 1: netlink API
Gives user ability to bind dma-buf to an RX queue.
** Part 2: scatterlist support
Currently the standard for device memory sharing is DMABUF, which doesn't generate struct pages. On the other hand, networking stack (skbs, drivers, and page pool) operate on pages. We have 2 options:
1. Generate struct pages for dmabuf device memory, or, 2. Modify the networking stack to process scatterlist.
Approach #1 was attempted in RFC v1. RFC v2 implements approach #2.
** part 3: page pool support
We piggy back on page pool memory providers proposal: https://github.com/kuba-moo/linux/tree/pp-providers
It allows the page pool to define a memory provider that provides the page allocation and freeing. It helps abstract most of the device memory TCP changes from the driver.
** part 4: support for unreadable skb frags
Page pool iovs are not accessible by the host; we implement changes throughput the networking stack to correctly handle skbs with unreadable frags.
** Part 5: recvmsg() APIs
We define user APIs for the user to send and receive device memory.
Not included with this series is the GVE devmem TCP support, just to simplify the review. Code available here if desired: https://github.com/mina/linux/tree/tcpdevmem
This series is built on top of net-next with Jakub's pp-providers changes cherry-picked.
* NIC dependencies:
1. (strict) Devmem TCP require the NIC to support header split, i.e. the capability to split incoming packets into a header + payload and to put each into a separate buffer. Devmem TCP works by using device memory for the packet payload, and host memory for the packet headers.
2. (optional) Devmem TCP works better with flow steering support & RSS support, i.e. the NIC's ability to steer flows into certain rx queues. This allows the sysadmin to enable devmem TCP on a subset of the rx queues, and steer devmem TCP traffic onto these queues and non devmem TCP elsewhere.
The NIC I have access to with these properties is the GVE with DQO support running in Google Cloud, but any NIC that supports these features would suffice. I may be able to help reviewers bring up devmem TCP on their NICs.
* Testing:
The series includes a udmabuf kselftest that show a simple use case of devmem TCP and validates the entire data path end to end without a dependency on a specific dmabuf provider.
** Test Setup
Kernel: net-next with this series and memory provider API cherry-picked locally.
Hardware: Google Cloud A3 VMs.
NIC: GVE with header split & RSS & flow steering support.
Cc: Pavel Begunkov asml.silence@gmail.com Cc: David Wei dw@davidwei.uk Cc: Jason Gunthorpe jgg@ziepe.ca Cc: Yunsheng Lin linyunsheng@huawei.com Cc: Shailend Chand shailend@google.com Cc: Harshitha Ramamurthy hramamurthy@google.com Cc: Shakeel Butt shakeelb@google.com Cc: Jeroen de Borst jeroendb@google.com Cc: Praveen Kaligineedi pkaligineedi@google.com
Jakub Kicinski (1): net: page_pool: create hooks for custom page providers
Mina Almasry (14): queue_api: define queue api net: page_pool: factor out page_pool recycle check net: netdev netlink api to bind dma-buf to a net device netdev: support binding dma-buf to netdevice netdev: netdevice devmem allocator page_pool: convert to use netmem page_pool: devmem support memory-provider: dmabuf devmem memory provider net: support non paged skb frags net: add support for skbs with unreadable frags tcp: RX path for devmem TCP net: add SO_DEVMEM_DONTNEED setsockopt to release RX frags net: add devmem TCP documentation selftests: add ncdevmem, netcat for devmem TCP
Documentation/netlink/specs/netdev.yaml | 52 +++ Documentation/networking/devmem.rst | 271 ++++++++++++ Documentation/networking/index.rst | 1 + arch/alpha/include/uapi/asm/socket.h | 6 + arch/mips/include/uapi/asm/socket.h | 6 + arch/parisc/include/uapi/asm/socket.h | 6 + arch/sparc/include/uapi/asm/socket.h | 6 + include/linux/netdevice.h | 24 ++ include/linux/skbuff.h | 67 ++- include/linux/socket.h | 1 + include/net/devmem.h | 127 ++++++ include/net/netdev_rx_queue.h | 1 + include/net/netmem.h | 234 +++++++++- include/net/page_pool/helpers.h | 154 +++++-- include/net/page_pool/types.h | 28 +- include/net/sock.h | 2 + include/net/tcp.h | 5 +- include/trace/events/page_pool.h | 29 +- include/uapi/asm-generic/socket.h | 6 + include/uapi/linux/netdev.h | 19 + include/uapi/linux/uio.h | 14 + net/bpf/test_run.c | 5 +- net/core/Makefile | 2 +- net/core/datagram.c | 6 + net/core/dev.c | 6 +- net/core/devmem.c | 413 ++++++++++++++++++ net/core/gro.c | 7 +- net/core/netdev-genl-gen.c | 19 + net/core/netdev-genl-gen.h | 2 + net/core/netdev-genl.c | 123 ++++++ net/core/page_pool.c | 362 +++++++++------- net/core/skbuff.c | 110 ++++- net/core/sock.c | 61 +++ net/ipv4/tcp.c | 257 ++++++++++- net/ipv4/tcp_input.c | 13 +- net/ipv4/tcp_ipv4.c | 9 + net/ipv4/tcp_output.c | 5 +- net/packet/af_packet.c | 4 +- tools/include/uapi/linux/netdev.h | 19 + tools/testing/selftests/net/.gitignore | 1 + tools/testing/selftests/net/Makefile | 5 + tools/testing/selftests/net/ncdevmem.c | 546 ++++++++++++++++++++++++ 42 files changed, 2764 insertions(+), 270 deletions(-) create mode 100644 Documentation/networking/devmem.rst create mode 100644 include/net/devmem.h create mode 100644 net/core/devmem.c create mode 100644 tools/testing/selftests/net/ncdevmem.c
This API enables the net stack to reset the queues used for devmem.
Signed-off-by: Mina Almasry almasrymina@google.com
--- include/linux/netdevice.h | 24 ++++++++++++++++++++++++ 1 file changed, 24 insertions(+)
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index c41019f34179..3105c586355d 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -1435,6 +1435,20 @@ struct netdev_net_notifier { * struct kernel_hwtstamp_config *kernel_config, * struct netlink_ext_ack *extack); * Change the hardware timestamping parameters for NIC device. + * + * void *(*ndo_queue_mem_alloc)(struct net_device *dev, int idx); + * Allocate memory for an RX queue. The memory returned in the form of + * a void * can be passed to ndo_queue_mem_free() for freeing or to + * ndo_queue_start to create an RX queue with this memory. + * + * void (*ndo_queue_mem_free)(struct net_device *dev, void *); + * Free memory from an RX queue. + * + * int (*ndo_queue_start)(struct net_device *dev, int idx, void *); + * Start an RX queue at the specified index. + * + * int (*ndo_queue_stop)(struct net_device *dev, int idx, void **); + * Stop the RX queue at the specified index. */ struct net_device_ops { int (*ndo_init)(struct net_device *dev); @@ -1679,6 +1693,16 @@ struct net_device_ops { int (*ndo_hwtstamp_set)(struct net_device *dev, struct kernel_hwtstamp_config *kernel_config, struct netlink_ext_ack *extack); + void * (*ndo_queue_mem_alloc)(struct net_device *dev, + int idx); + void (*ndo_queue_mem_free)(struct net_device *dev, + void *queue_mem); + int (*ndo_queue_start)(struct net_device *dev, + int idx, + void *queue_mem); + int (*ndo_queue_stop)(struct net_device *dev, + int idx, + void **out_queue_mem); };
/**
On Mon, 4 Mar 2024 18:01:36 -0800 Mina Almasry wrote:
- void *(*ndo_queue_mem_alloc)(struct net_device *dev, int idx);
- Allocate memory for an RX queue. The memory returned in the form of
- a void * can be passed to ndo_queue_mem_free() for freeing or to
- ndo_queue_start to create an RX queue with this memory.
- void (*ndo_queue_mem_free)(struct net_device *dev, void *);
- Free memory from an RX queue.
- int (*ndo_queue_start)(struct net_device *dev, int idx, void *);
- Start an RX queue at the specified index.
- int (*ndo_queue_stop)(struct net_device *dev, int idx, void **);
*/
- Stop the RX queue at the specified index.
struct net_device_ops { int (*ndo_init)(struct net_device *dev); @@ -1679,6 +1693,16 @@ struct net_device_ops { int (*ndo_hwtstamp_set)(struct net_device *dev, struct kernel_hwtstamp_config *kernel_config, struct netlink_ext_ack *extack);
- void * (*ndo_queue_mem_alloc)(struct net_device *dev,
int idx);
- void (*ndo_queue_mem_free)(struct net_device *dev,
void *queue_mem);
- int (*ndo_queue_start)(struct net_device *dev,
int idx,
void *queue_mem);
- int (*ndo_queue_stop)(struct net_device *dev,
int idx,
void **out_queue_mem);
The queue configuration object was quite an integral part of the design, I'm slightly worried that it's not here :) Also we may want to rename the about-to-be-merged ops from netdev_stat_ops and netdev_queue_ops, and add these there?
https://lore.kernel.org/all/20240306195509.1502746-2-kuba@kernel.org/
Very excited to hear that you made progress on this and ported GVE over!
On Thu, Mar 7, 2024 at 5:30 PM Jakub Kicinski kuba@kernel.org wrote:
On Mon, 4 Mar 2024 18:01:36 -0800 Mina Almasry wrote:
- void *(*ndo_queue_mem_alloc)(struct net_device *dev, int idx);
- Allocate memory for an RX queue. The memory returned in the form of
- a void * can be passed to ndo_queue_mem_free() for freeing or to
- ndo_queue_start to create an RX queue with this memory.
- void (*ndo_queue_mem_free)(struct net_device *dev, void *);
- Free memory from an RX queue.
- int (*ndo_queue_start)(struct net_device *dev, int idx, void *);
- Start an RX queue at the specified index.
- int (*ndo_queue_stop)(struct net_device *dev, int idx, void **);
*/
- Stop the RX queue at the specified index.
struct net_device_ops { int (*ndo_init)(struct net_device *dev); @@ -1679,6 +1693,16 @@ struct net_device_ops { int (*ndo_hwtstamp_set)(struct net_device *dev, struct kernel_hwtstamp_config *kernel_config, struct netlink_ext_ack *extack);
void * (*ndo_queue_mem_alloc)(struct net_device *dev,
int idx);
void (*ndo_queue_mem_free)(struct net_device *dev,
void *queue_mem);
int (*ndo_queue_start)(struct net_device *dev,
int idx,
void *queue_mem);
int (*ndo_queue_stop)(struct net_device *dev,
int idx,
void **out_queue_mem);
The queue configuration object was quite an integral part of the design, I'm slightly worried that it's not here :)
That was a bit of a simplification I'm making since we just want to restart the queue. I thought it was OK to define some minimal version here and extend it later with configuration? Because in this context all we really need is to restart the queue, yes?
If extending with some configuration is a must please let me know what configuration struct you're envisioning. Were you envisioning a stub? Or some real configuration struct that we just don't use at the moment? Or one that we use for this use case somehow?
Also we may want to rename the about-to-be-merged ops from netdev_stat_ops and netdev_queue_ops, and add these there?
https://lore.kernel.org/all/20240306195509.1502746-2-kuba@kernel.org/
Yeah, that sounds reasonable! Thanks! We could also keep the netdev_stat_ops and add new netdev_queue_ops alongside them if you prefer.
Very excited to hear that you made progress on this and ported GVE over!
Actually, we're still discussing but it looks like my GVE queue API implementation I proposed earlier may be a no-go. Likely someone from the GVE team will follow up here with this piece, probably in a separate series.
For now I'm carrying my POC for the GVE implementation out of tree with the rest of the driver changes:
https://github.com/mina/linux/commit/501b734c80186545281e9edb1bf313f5a2d8cbe...
On Thu, 7 Mar 2024 18:08:24 -0800 Mina Almasry wrote:
On Thu, Mar 7, 2024 at 5:30 PM Jakub Kicinski kuba@kernel.org wrote:
On Mon, 4 Mar 2024 18:01:36 -0800 Mina Almasry wrote:
- void *(*ndo_queue_mem_alloc)(struct net_device *dev, int idx);
- Allocate memory for an RX queue. The memory returned in the form of
- a void * can be passed to ndo_queue_mem_free() for freeing or to
- ndo_queue_start to create an RX queue with this memory.
- void (*ndo_queue_mem_free)(struct net_device *dev, void *);
- Free memory from an RX queue.
- int (*ndo_queue_start)(struct net_device *dev, int idx, void *);
- Start an RX queue at the specified index.
- int (*ndo_queue_stop)(struct net_device *dev, int idx, void **);
*/
- Stop the RX queue at the specified index.
struct net_device_ops { int (*ndo_init)(struct net_device *dev); @@ -1679,6 +1693,16 @@ struct net_device_ops { int (*ndo_hwtstamp_set)(struct net_device *dev, struct kernel_hwtstamp_config *kernel_config, struct netlink_ext_ack *extack);
void * (*ndo_queue_mem_alloc)(struct net_device *dev,
int idx);
void (*ndo_queue_mem_free)(struct net_device *dev,
void *queue_mem);
int (*ndo_queue_start)(struct net_device *dev,
int idx,
void *queue_mem);
int (*ndo_queue_stop)(struct net_device *dev,
int idx,
void **out_queue_mem);
The queue configuration object was quite an integral part of the design, I'm slightly worried that it's not here :)
That was a bit of a simplification I'm making since we just want to restart the queue. I thought it was OK to define some minimal version here and extend it later with configuration? Because in this context all we really need is to restart the queue, yes?
Right, I think it's perfectly fine for the time being. It works, and is internal to the kernel.
If extending with some configuration is a must please let me know what configuration struct you're envisioning. Were you envisioning a stub? Or some real configuration struct that we just don't use at the moment? Or one that we use for this use case somehow?
I had some ideas about storing the configuration as rules, instead of directly in struct netdev_rx_queue. E.g. default queue length = 2000, but for select queues you may want a different length. But application binding to a queue would always take precedence, so even if the ideas ever materialize there will be no uAPI change.
Also we may want to rename the about-to-be-merged ops from netdev_stat_ops and netdev_queue_ops, and add these there?
https://lore.kernel.org/all/20240306195509.1502746-2-kuba@kernel.org/
Yeah, that sounds reasonable! Thanks! We could also keep the netdev_stat_ops and add new netdev_queue_ops alongside them if you prefer.
Up to you, after some soul searching we renamed the uAPI to call these stats qstats, I just forgot to rename the op struct. But it doesn't matter much.
Very excited to hear that you made progress on this and ported GVE over!
Actually, we're still discussing but it looks like my GVE queue API implementation I proposed earlier may be a no-go. Likely someone from the GVE team will follow up here with this piece, probably in a separate series.
Well, it's going to be ready when it's ready :) Speaking of things which can be merged independently, feel free to post patch 3, maybe it can make v6.9..
For now I'm carrying my POC for the GVE implementation out of tree with the rest of the driver changes:
https://github.com/mina/linux/commit/501b734c80186545281e9edb1bf313f5a2d8cbe...
On 2024-03-04 18:01, Mina Almasry wrote:
This API enables the net stack to reset the queues used for devmem.
Signed-off-by: Mina Almasry almasrymina@google.com
include/linux/netdevice.h | 24 ++++++++++++++++++++++++ 1 file changed, 24 insertions(+)
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index c41019f34179..3105c586355d 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -1435,6 +1435,20 @@ struct netdev_net_notifier {
struct kernel_hwtstamp_config *kernel_config,
struct netlink_ext_ack *extack);
- Change the hardware timestamping parameters for NIC device.
- void *(*ndo_queue_mem_alloc)(struct net_device *dev, int idx);
- Allocate memory for an RX queue. The memory returned in the form of
- a void * can be passed to ndo_queue_mem_free() for freeing or to
- ndo_queue_start to create an RX queue with this memory.
- void (*ndo_queue_mem_free)(struct net_device *dev, void *);
- Free memory from an RX queue.
- int (*ndo_queue_start)(struct net_device *dev, int idx, void *);
- Start an RX queue at the specified index.
- int (*ndo_queue_stop)(struct net_device *dev, int idx, void **);
*/
- Stop the RX queue at the specified index.
struct net_device_ops { int (*ndo_init)(struct net_device *dev); @@ -1679,6 +1693,16 @@ struct net_device_ops { int (*ndo_hwtstamp_set)(struct net_device *dev, struct kernel_hwtstamp_config *kernel_config, struct netlink_ext_ack *extack);
- void * (*ndo_queue_mem_alloc)(struct net_device *dev,
int idx);
- void (*ndo_queue_mem_free)(struct net_device *dev,
void *queue_mem);
- int (*ndo_queue_start)(struct net_device *dev,
int idx,
void *queue_mem);
- int (*ndo_queue_stop)(struct net_device *dev,
int idx,
void **out_queue_mem);
};
I'm working to port bnxt over to using this API. What are your thoughts on maybe pulling this out and use bnxt to drive it?
/**
On Fri, Mar 8, 2024 at 3:48 PM David Wei dw@davidwei.uk wrote:
On 2024-03-04 18:01, Mina Almasry wrote:
This API enables the net stack to reset the queues used for devmem.
Signed-off-by: Mina Almasry almasrymina@google.com
include/linux/netdevice.h | 24 ++++++++++++++++++++++++ 1 file changed, 24 insertions(+)
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index c41019f34179..3105c586355d 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -1435,6 +1435,20 @@ struct netdev_net_notifier {
struct kernel_hwtstamp_config *kernel_config,
struct netlink_ext_ack *extack);
- Change the hardware timestamping parameters for NIC device.
- void *(*ndo_queue_mem_alloc)(struct net_device *dev, int idx);
- Allocate memory for an RX queue. The memory returned in the form of
- a void * can be passed to ndo_queue_mem_free() for freeing or to
- ndo_queue_start to create an RX queue with this memory.
- void (*ndo_queue_mem_free)(struct net_device *dev, void *);
- Free memory from an RX queue.
- int (*ndo_queue_start)(struct net_device *dev, int idx, void *);
- Start an RX queue at the specified index.
- int (*ndo_queue_stop)(struct net_device *dev, int idx, void **);
*/
- Stop the RX queue at the specified index.
struct net_device_ops { int (*ndo_init)(struct net_device *dev); @@ -1679,6 +1693,16 @@ struct net_device_ops { int (*ndo_hwtstamp_set)(struct net_device *dev, struct kernel_hwtstamp_config *kernel_config, struct netlink_ext_ack *extack);
void * (*ndo_queue_mem_alloc)(struct net_device *dev,
int idx);
void (*ndo_queue_mem_free)(struct net_device *dev,
void *queue_mem);
int (*ndo_queue_start)(struct net_device *dev,
int idx,
void *queue_mem);
int (*ndo_queue_stop)(struct net_device *dev,
int idx,
void **out_queue_mem);
};
I'm working to port bnxt over to using this API. What are your thoughts on maybe pulling this out and use bnxt to drive it?
Sure thing, go for it! Thanks!
I think we've going to have someone from GVE working on this in parallel. I see no issue with us aligning on what the core-net ndos would look like and implementing those in parallel for both drivers. We're not currently planning to make any changes to the ndos besides applying Jakub's feedback from this thread. If you find a need to deviate from this, let us know and we'll work on staying in line with that. Thanks!
On 3/8/24 4:47 PM, David Wei wrote:
I'm working to port bnxt over to using this API. What are your thoughts on maybe pulling this out and use bnxt to drive it?
I would love to see a second nic implementation; this patch set and overall design is driven by GVE limitations.
From: Jakub Kicinski kuba@kernel.org
The page providers which try to reuse the same pages will need to hold onto the ref, even if page gets released from the pool - as in releasing the page from the pp just transfers the "ownership" reference from pp to the provider, and provider will wait for other references to be gone before feeding this page back into the pool.
Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Mina Almasry almasrymina@google.com
---
This is implemented by Jakub in his RFC: https://lore.kernel.org/netdev/f8270765-a27b-6ccf-33ea-cda097168d79@redhat.c...
I take no credit for the idea or implementation; I only added minor edits to make this workable with device memory TCP, and removed some hacky test code. This is a critical dependency of device memory TCP and thus I'm pulling it into this series to make it revewable and mergeable.
RFC v3 -> v1 - Removed unusued mem_provider. (Yunsheng). - Replaced memory_provider & mp_priv with netdev_rx_queue (Jakub).
--- include/net/page_pool/types.h | 12 ++++++++++ net/core/page_pool.c | 43 +++++++++++++++++++++++++++++++---- 2 files changed, 50 insertions(+), 5 deletions(-)
diff --git a/include/net/page_pool/types.h b/include/net/page_pool/types.h index 5e43a08d3231..ffe5f31fb0da 100644 --- a/include/net/page_pool/types.h +++ b/include/net/page_pool/types.h @@ -52,6 +52,7 @@ struct pp_alloc_cache { * @dev: device, for DMA pre-mapping purposes * @netdev: netdev this pool will serve (leave as NULL if none or multiple) * @napi: NAPI which is the sole consumer of pages, otherwise NULL + * @queue: struct netdev_rx_queue this page_pool is being created for. * @dma_dir: DMA mapping direction * @max_len: max DMA sync memory size for PP_FLAG_DMA_SYNC_DEV * @offset: DMA sync address offset for PP_FLAG_DMA_SYNC_DEV @@ -64,6 +65,7 @@ struct page_pool_params { int nid; struct device *dev; struct napi_struct *napi; + struct netdev_rx_queue *queue; enum dma_data_direction dma_dir; unsigned int max_len; unsigned int offset; @@ -126,6 +128,13 @@ struct page_pool_stats { }; #endif
+struct memory_provider_ops { + int (*init)(struct page_pool *pool); + void (*destroy)(struct page_pool *pool); + struct page *(*alloc_pages)(struct page_pool *pool, gfp_t gfp); + bool (*release_page)(struct page_pool *pool, struct page *page); +}; + struct page_pool { struct page_pool_params_fast p;
@@ -176,6 +185,9 @@ struct page_pool { */ struct ptr_ring ring;
+ void *mp_priv; + const struct memory_provider_ops *mp_ops; + #ifdef CONFIG_PAGE_POOL_STATS /* recycle stats are per-cpu to avoid locking */ struct page_pool_recycle_stats __percpu *recycle_stats; diff --git a/net/core/page_pool.c b/net/core/page_pool.c index d706fe5548df..8776fcad064a 100644 --- a/net/core/page_pool.c +++ b/net/core/page_pool.c @@ -25,6 +25,8 @@
#include "page_pool_priv.h"
+static DEFINE_STATIC_KEY_FALSE(page_pool_mem_providers); + #define DEFER_TIME (msecs_to_jiffies(1000)) #define DEFER_WARN_INTERVAL (60 * HZ)
@@ -177,6 +179,7 @@ static int page_pool_init(struct page_pool *pool, int cpuid) { unsigned int ring_qsize = 1024; /* Default */ + int err;
memcpy(&pool->p, ¶ms->fast, sizeof(pool->p)); memcpy(&pool->slow, ¶ms->slow, sizeof(pool->slow)); @@ -248,10 +251,25 @@ static int page_pool_init(struct page_pool *pool, /* Driver calling page_pool_create() also call page_pool_destroy() */ refcount_set(&pool->user_cnt, 1);
+ if (pool->mp_ops) { + err = pool->mp_ops->init(pool); + if (err) { + pr_warn("%s() mem-provider init failed %d\n", + __func__, err); + goto free_ptr_ring; + } + + static_branch_inc(&page_pool_mem_providers); + } + if (pool->p.flags & PP_FLAG_DMA_MAP) get_device(pool->p.dev);
return 0; + +free_ptr_ring: + ptr_ring_cleanup(&pool->ring, NULL); + return err; }
static void page_pool_uninit(struct page_pool *pool) @@ -546,7 +564,10 @@ struct page *page_pool_alloc_pages(struct page_pool *pool, gfp_t gfp) return page;
/* Slow-path: cache empty, do real allocation */ - page = __page_pool_alloc_pages_slow(pool, gfp); + if (static_branch_unlikely(&page_pool_mem_providers) && pool->mp_ops) + page = pool->mp_ops->alloc_pages(pool, gfp); + else + page = __page_pool_alloc_pages_slow(pool, gfp); return page; } EXPORT_SYMBOL(page_pool_alloc_pages); @@ -603,10 +624,13 @@ void __page_pool_release_page_dma(struct page_pool *pool, struct page *page) void page_pool_return_page(struct page_pool *pool, struct page *page) { int count; + bool put;
- __page_pool_release_page_dma(pool, page); - - page_pool_clear_pp_info(page); + put = true; + if (static_branch_unlikely(&page_pool_mem_providers) && pool->mp_ops) + put = pool->mp_ops->release_page(pool, page); + else + __page_pool_release_page_dma(pool, page);
/* This may be the last page returned, releasing the pool, so * it is not safe to reference pool afterwards. @@ -614,7 +638,10 @@ void page_pool_return_page(struct page_pool *pool, struct page *page) count = atomic_inc_return_relaxed(&pool->pages_state_release_cnt); trace_page_pool_state_release(pool, page, count);
- put_page(page); + if (put) { + page_pool_clear_pp_info(page); + put_page(page); + } /* An optimization would be to call __free_pages(page, pool->p.order) * knowing page is not part of page-cache (thus avoiding a * __page_cache_release() call). @@ -884,6 +911,12 @@ static void __page_pool_destroy(struct page_pool *pool)
page_pool_unlist(pool); page_pool_uninit(pool); + + if (pool->mp_ops) { + pool->mp_ops->destroy(pool); + static_branch_dec(&page_pool_mem_providers); + } + kfree(pool); }
On 2024-03-04 18:01, Mina Almasry wrote:
+struct memory_provider_ops {
- int (*init)(struct page_pool *pool);
- void (*destroy)(struct page_pool *pool);
- struct page *(*alloc_pages)(struct page_pool *pool, gfp_t gfp);
- bool (*release_page)(struct page_pool *pool, struct page *page);
For ZC Rx we added a scrub() function to memory_provider_ops that is called from page_pool_scrub(). Does TCP devmem not custom behaviour waiting for all netmem_refs to return before destroying the page pool? What happens if e.g. application crashes?
On Tue, Mar 5, 2024 at 1:55 PM David Wei dw@davidwei.uk wrote:
On 2024-03-04 18:01, Mina Almasry wrote:
+struct memory_provider_ops {
int (*init)(struct page_pool *pool);
void (*destroy)(struct page_pool *pool);
struct page *(*alloc_pages)(struct page_pool *pool, gfp_t gfp);
bool (*release_page)(struct page_pool *pool, struct page *page);
For ZC Rx we added a scrub() function to memory_provider_ops that is called from page_pool_scrub(). Does TCP devmem not custom behaviour waiting for all netmem_refs to return before destroying the page pool? What happens if e.g. application crashes?
(sorry for the long reply, but he refcounting is pretty complicated to explain and I feel like we need to agree on how things currently work)
Yeah, the addition of the page_pool_scrub() function is a bit of a head scratcher for me. Here is how the (complicated) refcounting works for devmem TCP (assuming the driver is not doing its own recycling logic which complicates things further):
1. When a netmem_ref is allocated by the page_pool (from dmabuf or page), the netmem_get_pp_ref_count_ref()==1 and belongs to the page pool as long as the netmem is waiting in the pool for driver allocation.
2. When a netmem is allocated by the driver, no refcounting is changed, but the ownership of the netmem_get_pp_ref_count_ref() is implicitly transferred from the page pool to the driver. i.e. the ref now belongs to the driver until an skb is formed.
3. When the driver forms an skb using skb_rx_add_frag_netmem(), no refcounting is changed, but the ownership of the netmem_get_pp_ref_count_ref() is transferred from the driver to the TCP stack.
4. When the TCP stack hands the skb to the application, the TCP stack obtains an additional refcount, so netmem_get_pp_ref_count_ref()==2, and frees the skb using skb_frag_unref(), which drops the netmem_get_pp_ref_count_ref()==1.
5. When the user is done with the skb, the user calls the DEVMEM_DONTNEED setsockopt which calls napi_pp_put_netmem() which recycles the netmem back to the page pool. This doesn't modify any refcounting, but the refcount ownership transfers from the userspace back to the page pool, and we're back at step 1.
So all in all netmem can belong either to (a) the page pool, or (b) the driver, or (c) the TCP stack, or (d) the application depending on where exactly it is in the RX path.
When an application running devmem TCP crashes, the netmem that belong to the page pool or driver are not touched, because the page pool is not tied to the application in our case really. However, the TCP stack notices the devmem socket of the application close, and when it does, the TCP stack will:
1. Free all the skbs in the sockets receive queue. This is not custom behavior for devmem TCP, it's just standard for TCP to free all skbs waiting to be received by the application. 2. The TCP stack will free references that belong to the application. Since the application crashed, it will not call the DEVMEM_DONTNEED setsockopt, so we need to free those on behalf of the application. This is done in this diff:
@@ -2498,6 +2498,15 @@ static void tcp_md5sig_info_free_rcu(struct rcu_head *head) void tcp_v4_destroy_sock(struct sock *sk) { struct tcp_sock *tp = tcp_sk(sk); + __maybe_unused unsigned long index; + __maybe_unused void *netmem; + +#ifdef CONFIG_PAGE_POOL + xa_for_each(&sk->sk_user_frags, index, netmem) + WARN_ON_ONCE(!napi_pp_put_page((__force netmem_ref)netmem, false)); +#endif + + xa_destroy(&sk->sk_user_frags);
trace_tcp_destroy_sock(sk);
To be honest, I think it makes sense for the TCP stack to be responsible for putting the references that belong to it and the application. To me, it does not make much sense for the page pool to be responsible for putting the reference that belongs to the TCP stack or driver via a page_pool_scrub() function, as those references do not belong to the page pool really. I'm not sure why there is a diff between our use cases here because I'm not an io_uring expert. Why do you need to scrub all the references on page pool destruction? Don't these belong to non-page pool components like io_uring stack or TCP stack ol otherwise?
On 3/5/24 22:36, Mina Almasry wrote:
On Tue, Mar 5, 2024 at 1:55 PM David Wei dw@davidwei.uk wrote:
On 2024-03-04 18:01, Mina Almasry wrote:
+struct memory_provider_ops {
int (*init)(struct page_pool *pool);
void (*destroy)(struct page_pool *pool);
struct page *(*alloc_pages)(struct page_pool *pool, gfp_t gfp);
bool (*release_page)(struct page_pool *pool, struct page *page);
For ZC Rx we added a scrub() function to memory_provider_ops that is called from page_pool_scrub(). Does TCP devmem not custom behaviour waiting for all netmem_refs to return before destroying the page pool? What happens if e.g. application crashes?
(sorry for the long reply, but he refcounting is pretty complicated to explain and I feel like we need to agree on how things currently work)
Yeah, the addition of the page_pool_scrub() function is a bit of a head scratcher for me. Here is how the (complicated) refcounting works for devmem TCP (assuming the driver is not doing its own recycling logic which complicates things further):
- When a netmem_ref is allocated by the page_pool (from dmabuf or
page), the netmem_get_pp_ref_count_ref()==1 and belongs to the page pool as long as the netmem is waiting in the pool for driver allocation.
- When a netmem is allocated by the driver, no refcounting is
changed, but the ownership of the netmem_get_pp_ref_count_ref() is implicitly transferred from the page pool to the driver. i.e. the ref now belongs to the driver until an skb is formed.
- When the driver forms an skb using skb_rx_add_frag_netmem(), no
refcounting is changed, but the ownership of the netmem_get_pp_ref_count_ref() is transferred from the driver to the TCP stack.
- When the TCP stack hands the skb to the application, the TCP stack
obtains an additional refcount, so netmem_get_pp_ref_count_ref()==2, and frees the skb using skb_frag_unref(), which drops the netmem_get_pp_ref_count_ref()==1.
- When the user is done with the skb, the user calls the
DEVMEM_DONTNEED setsockopt which calls napi_pp_put_netmem() which recycles the netmem back to the page pool. This doesn't modify any refcounting, but the refcount ownership transfers from the userspace back to the page pool, and we're back at step 1.
So all in all netmem can belong either to (a) the page pool, or (b) the driver, or (c) the TCP stack, or (d) the application depending on where exactly it is in the RX path.
When an application running devmem TCP crashes, the netmem that belong to the page pool or driver are not touched, because the page pool is not tied to the application in our case really. However, the TCP stack notices the devmem socket of the application close, and when it does, the TCP stack will:
- Free all the skbs in the sockets receive queue. This is not custom
behavior for devmem TCP, it's just standard for TCP to free all skbs waiting to be received by the application. 2. The TCP stack will free references that belong to the application. Since the application crashed, it will not call the DEVMEM_DONTNEED setsockopt, so we need to free those on behalf of the application. This is done in this diff:
@@ -2498,6 +2498,15 @@ static void tcp_md5sig_info_free_rcu(struct rcu_head *head) void tcp_v4_destroy_sock(struct sock *sk) { struct tcp_sock *tp = tcp_sk(sk);
- __maybe_unused unsigned long index;
- __maybe_unused void *netmem;
+#ifdef CONFIG_PAGE_POOL
- xa_for_each(&sk->sk_user_frags, index, netmem)
- WARN_ON_ONCE(!napi_pp_put_page((__force netmem_ref)netmem, false));
+#endif
xa_destroy(&sk->sk_user_frags);
trace_tcp_destroy_sock(sk);
To be honest, I think it makes sense for the TCP stack to be responsible for putting the references that belong to it and the application. To me, it does not make much sense for the page pool to be responsible for putting the reference that belongs to the TCP stack or driver via a page_pool_scrub() function, as those references do not belong to the page pool really. I'm not sure why there is a diff between our use cases here because I'm not an io_uring expert. Why do you need to scrub all the references on page pool destruction? Don't these belong to non-page pool components like io_uring stack or TCP stack ol otherwise?
That one is about cleaning buffers that are in b/w 4 and 5, i.e. owned by the user, which devmem does at sock destruction. io_uring could get by without scrub, dropping user refs while unregistering ifq, but then it'd need to wait for all requests to finish so there is no step 4 in the meantime. Might change, can be useful, but it was much easier to hook into the pp release loop.
Another concern is who and when can reset ifq / kill pp outside of io_uring/devmem. I assume it can happen on a whim, which is hard to handle gracefully.
On Wed, Mar 6, 2024 at 6:30 AM Pavel Begunkov asml.silence@gmail.com wrote:
On 3/5/24 22:36, Mina Almasry wrote:
On Tue, Mar 5, 2024 at 1:55 PM David Wei dw@davidwei.uk wrote:
On 2024-03-04 18:01, Mina Almasry wrote:
+struct memory_provider_ops {
int (*init)(struct page_pool *pool);
void (*destroy)(struct page_pool *pool);
struct page *(*alloc_pages)(struct page_pool *pool, gfp_t gfp);
bool (*release_page)(struct page_pool *pool, struct page *page);
For ZC Rx we added a scrub() function to memory_provider_ops that is called from page_pool_scrub(). Does TCP devmem not custom behaviour waiting for all netmem_refs to return before destroying the page pool? What happens if e.g. application crashes?
(sorry for the long reply, but he refcounting is pretty complicated to explain and I feel like we need to agree on how things currently work)
Yeah, the addition of the page_pool_scrub() function is a bit of a head scratcher for me. Here is how the (complicated) refcounting works for devmem TCP (assuming the driver is not doing its own recycling logic which complicates things further):
- When a netmem_ref is allocated by the page_pool (from dmabuf or
page), the netmem_get_pp_ref_count_ref()==1 and belongs to the page pool as long as the netmem is waiting in the pool for driver allocation.
- When a netmem is allocated by the driver, no refcounting is
changed, but the ownership of the netmem_get_pp_ref_count_ref() is implicitly transferred from the page pool to the driver. i.e. the ref now belongs to the driver until an skb is formed.
- When the driver forms an skb using skb_rx_add_frag_netmem(), no
refcounting is changed, but the ownership of the netmem_get_pp_ref_count_ref() is transferred from the driver to the TCP stack.
- When the TCP stack hands the skb to the application, the TCP stack
obtains an additional refcount, so netmem_get_pp_ref_count_ref()==2, and frees the skb using skb_frag_unref(), which drops the netmem_get_pp_ref_count_ref()==1.
- When the user is done with the skb, the user calls the
DEVMEM_DONTNEED setsockopt which calls napi_pp_put_netmem() which recycles the netmem back to the page pool. This doesn't modify any refcounting, but the refcount ownership transfers from the userspace back to the page pool, and we're back at step 1.
So all in all netmem can belong either to (a) the page pool, or (b) the driver, or (c) the TCP stack, or (d) the application depending on where exactly it is in the RX path.
When an application running devmem TCP crashes, the netmem that belong to the page pool or driver are not touched, because the page pool is not tied to the application in our case really. However, the TCP stack notices the devmem socket of the application close, and when it does, the TCP stack will:
- Free all the skbs in the sockets receive queue. This is not custom
behavior for devmem TCP, it's just standard for TCP to free all skbs waiting to be received by the application. 2. The TCP stack will free references that belong to the application. Since the application crashed, it will not call the DEVMEM_DONTNEED setsockopt, so we need to free those on behalf of the application. This is done in this diff:
@@ -2498,6 +2498,15 @@ static void tcp_md5sig_info_free_rcu(struct rcu_head *head) void tcp_v4_destroy_sock(struct sock *sk) { struct tcp_sock *tp = tcp_sk(sk);
- __maybe_unused unsigned long index;
- __maybe_unused void *netmem;
+#ifdef CONFIG_PAGE_POOL
- xa_for_each(&sk->sk_user_frags, index, netmem)
- WARN_ON_ONCE(!napi_pp_put_page((__force netmem_ref)netmem, false));
+#endif
xa_destroy(&sk->sk_user_frags);
trace_tcp_destroy_sock(sk);
To be honest, I think it makes sense for the TCP stack to be responsible for putting the references that belong to it and the application. To me, it does not make much sense for the page pool to be responsible for putting the reference that belongs to the TCP stack or driver via a page_pool_scrub() function, as those references do not belong to the page pool really. I'm not sure why there is a diff between our use cases here because I'm not an io_uring expert. Why do you need to scrub all the references on page pool destruction? Don't these belong to non-page pool components like io_uring stack or TCP stack ol otherwise?
That one is about cleaning buffers that are in b/w 4 and 5, i.e. owned by the user, which devmem does at sock destruction. io_uring could get by without scrub, dropping user refs while unregistering ifq, but then it'd need to wait for all requests to finish so there is no step 4 in the meantime. Might change, can be useful, but it was much easier to hook into the pp release loop.
Another concern is who and when can reset ifq / kill pp outside of io_uring/devmem. I assume it can happen on a whim, which is hard to handle gracefully.
If this is about dropping application refs in step 4 & step 5, then from devmem TCP perspective it must be done on socket close & skb freeing AFAIU, and not delayed until page_pool destruction. Think about a stupid or malicious user that does something like:
1. Set up dmabuf binding using netlink api. 2. While (100000): 3. create devmem TCP socket. 4. receive some devmem data on TCP socket. 5. close TCP socket without calling DEVMEM_DONTNEED. 6. clean up dmabuf binding using netlink api.
In this case, we need to drop the references in step 5 when the socket is destroyed, so the memory is freed to the page pool and available for the next socket in step 3. We cannot delay the freeing until step 6 when the rx queue is recreated and the page pool is destroyed, otherwise the net_iovs would leak in the loop and eventually the NIC would fail to find available memory. The same bug would be reproducible with io_uring unless you're creating a new page pool for each new io_uring socket equivalent.
But even outside of this, I think it's a bit semantically off to ask the page_pool to drop references that belong to the application IMO, because those references are not the page_pool's.
-- Pavel Begunkov
On 3/6/24 17:04, Mina Almasry wrote:
On Wed, Mar 6, 2024 at 6:30 AM Pavel Begunkov asml.silence@gmail.com wrote:
On 3/5/24 22:36, Mina Almasry wrote:
On Tue, Mar 5, 2024 at 1:55 PM David Wei dw@davidwei.uk wrote:
On 2024-03-04 18:01, Mina Almasry wrote:
+struct memory_provider_ops {
int (*init)(struct page_pool *pool);
void (*destroy)(struct page_pool *pool);
struct page *(*alloc_pages)(struct page_pool *pool, gfp_t gfp);
bool (*release_page)(struct page_pool *pool, struct page *page);
For ZC Rx we added a scrub() function to memory_provider_ops that is called from page_pool_scrub(). Does TCP devmem not custom behaviour waiting for all netmem_refs to return before destroying the page pool? What happens if e.g. application crashes?
(sorry for the long reply, but he refcounting is pretty complicated to explain and I feel like we need to agree on how things currently work)
Yeah, the addition of the page_pool_scrub() function is a bit of a head scratcher for me. Here is how the (complicated) refcounting works for devmem TCP (assuming the driver is not doing its own recycling logic which complicates things further):
- When a netmem_ref is allocated by the page_pool (from dmabuf or
page), the netmem_get_pp_ref_count_ref()==1 and belongs to the page pool as long as the netmem is waiting in the pool for driver allocation.
- When a netmem is allocated by the driver, no refcounting is
changed, but the ownership of the netmem_get_pp_ref_count_ref() is implicitly transferred from the page pool to the driver. i.e. the ref now belongs to the driver until an skb is formed.
- When the driver forms an skb using skb_rx_add_frag_netmem(), no
refcounting is changed, but the ownership of the netmem_get_pp_ref_count_ref() is transferred from the driver to the TCP stack.
- When the TCP stack hands the skb to the application, the TCP stack
obtains an additional refcount, so netmem_get_pp_ref_count_ref()==2, and frees the skb using skb_frag_unref(), which drops the netmem_get_pp_ref_count_ref()==1.
- When the user is done with the skb, the user calls the
DEVMEM_DONTNEED setsockopt which calls napi_pp_put_netmem() which recycles the netmem back to the page pool. This doesn't modify any refcounting, but the refcount ownership transfers from the userspace back to the page pool, and we're back at step 1.
So all in all netmem can belong either to (a) the page pool, or (b) the driver, or (c) the TCP stack, or (d) the application depending on where exactly it is in the RX path.
When an application running devmem TCP crashes, the netmem that belong to the page pool or driver are not touched, because the page pool is not tied to the application in our case really. However, the TCP stack notices the devmem socket of the application close, and when it does, the TCP stack will:
- Free all the skbs in the sockets receive queue. This is not custom
behavior for devmem TCP, it's just standard for TCP to free all skbs waiting to be received by the application. 2. The TCP stack will free references that belong to the application. Since the application crashed, it will not call the DEVMEM_DONTNEED setsockopt, so we need to free those on behalf of the application. This is done in this diff:
@@ -2498,6 +2498,15 @@ static void tcp_md5sig_info_free_rcu(struct rcu_head *head) void tcp_v4_destroy_sock(struct sock *sk) { struct tcp_sock *tp = tcp_sk(sk);
- __maybe_unused unsigned long index;
- __maybe_unused void *netmem;
+#ifdef CONFIG_PAGE_POOL
- xa_for_each(&sk->sk_user_frags, index, netmem)
- WARN_ON_ONCE(!napi_pp_put_page((__force netmem_ref)netmem, false));
+#endif
xa_destroy(&sk->sk_user_frags);
trace_tcp_destroy_sock(sk);
To be honest, I think it makes sense for the TCP stack to be responsible for putting the references that belong to it and the application. To me, it does not make much sense for the page pool to be responsible for putting the reference that belongs to the TCP stack or driver via a page_pool_scrub() function, as those references do not belong to the page pool really. I'm not sure why there is a diff between our use cases here because I'm not an io_uring expert. Why do you need to scrub all the references on page pool destruction? Don't these belong to non-page pool components like io_uring stack or TCP stack ol otherwise?
That one is about cleaning buffers that are in b/w 4 and 5, i.e. owned by the user, which devmem does at sock destruction. io_uring could get by without scrub, dropping user refs while unregistering ifq, but then it'd need to wait for all requests to finish so there is no step 4 in the meantime. Might change, can be useful, but it was much easier to hook into the pp release loop.
Another concern is who and when can reset ifq / kill pp outside of io_uring/devmem. I assume it can happen on a whim, which is hard to handle gracefully.
If this is about dropping application refs in step 4 & step 5, then from devmem TCP perspective it must be done on socket close & skb freeing AFAIU, and not delayed until page_pool destruction.
Right, something in the kernel should take care of it. You temporarily attach it to the socket, which is fine. And you could've also stored it in the netlink socket or some other object. In case of zcrx io_uring impl, it's bound to io_uring, io_uring is responsible for cleaning them up. And we do it before __page_pool_destroy, otherwise there would be a ref dependency.
A side note, attaching to netlink or some other global object sounds conceptually better, as once you return a buffer to the user, the socket should not have any further business with the buffer. FWIW, that better resembles io_uring approach. For example allows to:
recv(sock); close(sock); process_rx_buffers();
or to return (i.e. DEVMEM_DONTNEED) buffers from different sockets in one call. However, I don't think it's important for devmem and perhaps more implementation dictated.
Think about a stupid or malicious user that does something like:
- Set up dmabuf binding using netlink api.
- While (100000):
- create devmem TCP socket.
- receive some devmem data on TCP socket.
- close TCP socket without calling DEVMEM_DONTNEED.
- clean up dmabuf binding using netlink api.
In this case, we need to drop the references in step 5 when the socket is destroyed, so the memory is freed to the page pool and available for the next socket in step 3. We cannot delay the freeing until step 6 when the rx queue is recreated and the page pool is destroyed, otherwise the net_iovs would leak in the loop and eventually the NIC would fail to find available memory. The same bug would be
By "would leak" you probably mean until step 6, right? There are always many ways to shoot yourself in the leg. Even if you clean up in 5, the user can just leak the socket and get the same result with pp starvation. I see it not as a requirement but rather a uapi choice, that's assuming netlink would be cleaned as a normal socket when the task exits.
reproducible with io_uring unless you're creating a new page pool for each new io_uring socket equivalent.
Surely we don't, but it's still the user's responsibility to return buffers back. And in case of io_uring buffers returned to the user are not attached to a socket, so even the scope / lifetime is a bit different.
But even outside of this, I think it's a bit semantically off to ask the page_pool to drop references that belong to the application IMO, because those references are not the page_pool's.
Completely agree with you, which is why it was in a callback, totally controlled by io_uring.
On Wed, Mar 6, 2024 at 11:14 AM Pavel Begunkov asml.silence@gmail.com wrote:
On 3/6/24 17:04, Mina Almasry wrote:
On Wed, Mar 6, 2024 at 6:30 AM Pavel Begunkov asml.silence@gmail.com wrote:
On 3/5/24 22:36, Mina Almasry wrote:
On Tue, Mar 5, 2024 at 1:55 PM David Wei dw@davidwei.uk wrote:
On 2024-03-04 18:01, Mina Almasry wrote:
+struct memory_provider_ops {
int (*init)(struct page_pool *pool);
void (*destroy)(struct page_pool *pool);
struct page *(*alloc_pages)(struct page_pool *pool, gfp_t gfp);
bool (*release_page)(struct page_pool *pool, struct page *page);
For ZC Rx we added a scrub() function to memory_provider_ops that is called from page_pool_scrub(). Does TCP devmem not custom behaviour waiting for all netmem_refs to return before destroying the page pool? What happens if e.g. application crashes?
(sorry for the long reply, but he refcounting is pretty complicated to explain and I feel like we need to agree on how things currently work)
Yeah, the addition of the page_pool_scrub() function is a bit of a head scratcher for me. Here is how the (complicated) refcounting works for devmem TCP (assuming the driver is not doing its own recycling logic which complicates things further):
- When a netmem_ref is allocated by the page_pool (from dmabuf or
page), the netmem_get_pp_ref_count_ref()==1 and belongs to the page pool as long as the netmem is waiting in the pool for driver allocation.
- When a netmem is allocated by the driver, no refcounting is
changed, but the ownership of the netmem_get_pp_ref_count_ref() is implicitly transferred from the page pool to the driver. i.e. the ref now belongs to the driver until an skb is formed.
- When the driver forms an skb using skb_rx_add_frag_netmem(), no
refcounting is changed, but the ownership of the netmem_get_pp_ref_count_ref() is transferred from the driver to the TCP stack.
- When the TCP stack hands the skb to the application, the TCP stack
obtains an additional refcount, so netmem_get_pp_ref_count_ref()==2, and frees the skb using skb_frag_unref(), which drops the netmem_get_pp_ref_count_ref()==1.
- When the user is done with the skb, the user calls the
DEVMEM_DONTNEED setsockopt which calls napi_pp_put_netmem() which recycles the netmem back to the page pool. This doesn't modify any refcounting, but the refcount ownership transfers from the userspace back to the page pool, and we're back at step 1.
So all in all netmem can belong either to (a) the page pool, or (b) the driver, or (c) the TCP stack, or (d) the application depending on where exactly it is in the RX path.
When an application running devmem TCP crashes, the netmem that belong to the page pool or driver are not touched, because the page pool is not tied to the application in our case really. However, the TCP stack notices the devmem socket of the application close, and when it does, the TCP stack will:
- Free all the skbs in the sockets receive queue. This is not custom
behavior for devmem TCP, it's just standard for TCP to free all skbs waiting to be received by the application. 2. The TCP stack will free references that belong to the application. Since the application crashed, it will not call the DEVMEM_DONTNEED setsockopt, so we need to free those on behalf of the application. This is done in this diff:
@@ -2498,6 +2498,15 @@ static void tcp_md5sig_info_free_rcu(struct rcu_head *head) void tcp_v4_destroy_sock(struct sock *sk) { struct tcp_sock *tp = tcp_sk(sk);
- __maybe_unused unsigned long index;
- __maybe_unused void *netmem;
+#ifdef CONFIG_PAGE_POOL
- xa_for_each(&sk->sk_user_frags, index, netmem)
- WARN_ON_ONCE(!napi_pp_put_page((__force netmem_ref)netmem, false));
+#endif
xa_destroy(&sk->sk_user_frags);
trace_tcp_destroy_sock(sk);
To be honest, I think it makes sense for the TCP stack to be responsible for putting the references that belong to it and the application. To me, it does not make much sense for the page pool to be responsible for putting the reference that belongs to the TCP stack or driver via a page_pool_scrub() function, as those references do not belong to the page pool really. I'm not sure why there is a diff between our use cases here because I'm not an io_uring expert. Why do you need to scrub all the references on page pool destruction? Don't these belong to non-page pool components like io_uring stack or TCP stack ol otherwise?
That one is about cleaning buffers that are in b/w 4 and 5, i.e. owned by the user, which devmem does at sock destruction. io_uring could get by without scrub, dropping user refs while unregistering ifq, but then it'd need to wait for all requests to finish so there is no step 4 in the meantime. Might change, can be useful, but it was much easier to hook into the pp release loop.
Another concern is who and when can reset ifq / kill pp outside of io_uring/devmem. I assume it can happen on a whim, which is hard to handle gracefully.
If this is about dropping application refs in step 4 & step 5, then from devmem TCP perspective it must be done on socket close & skb freeing AFAIU, and not delayed until page_pool destruction.
Right, something in the kernel should take care of it. You temporarily attach it to the socket, which is fine. And you could've also stored it in the netlink socket or some other object. In case of zcrx io_uring impl, it's bound to io_uring, io_uring is responsible for cleaning them up. And we do it before __page_pool_destroy, otherwise there would be a ref dependency.
AFAIU today the page_pool_release() waits until there are no more pages in flight before calling __page_pool_destroy(), and in existing use cases it's common for the page_pool to stay alive after page_pool_destroy() is called until all the skbs waiting in the receive queue to be recvmsg()'d are received and the page_pool is freed. I just didn't modify that behavior.
A side note, attaching to netlink or some other global object sounds conceptually better, as once you return a buffer to the user, the socket should not have any further business with the buffer. FWIW, that better resembles io_uring approach. For example allows to:
recv(sock); close(sock); process_rx_buffers();
or to return (i.e. DEVMEM_DONTNEED) buffers from different sockets in one call. However, I don't think it's important for devmem and perhaps more implementation dictated.
Hmm I think this may be a sockets vs io_uring difference? For sockets there is no way to recvmsg() new buffers after the sock close and there is no way to release buffers to the kernel via the setsockopt() after the sock close.
But I don't think we need to align on everything here, right? If page_pool_scrub makes more sense in your use case because the lifetime of the io_uring buffers is different I don't see any issue with you extending the ops with page_pool_scrub(), and not define it for the dmabuf_devmem provider which doesn't need a scrub, right?
Think about a stupid or malicious user that does something like:
- Set up dmabuf binding using netlink api.
- While (100000):
- create devmem TCP socket.
- receive some devmem data on TCP socket.
- close TCP socket without calling DEVMEM_DONTNEED.
- clean up dmabuf binding using netlink api.
In this case, we need to drop the references in step 5 when the socket is destroyed, so the memory is freed to the page pool and available for the next socket in step 3. We cannot delay the freeing until step 6 when the rx queue is recreated and the page pool is destroyed, otherwise the net_iovs would leak in the loop and eventually the NIC would fail to find available memory. The same bug would be
By "would leak" you probably mean until step 6, right? There are
Yes, sorry I wasn't clear!
always many ways to shoot yourself in the leg. Even if you clean up in 5, the user can just leak the socket and get the same result with pp starvation. I see it not as a requirement but rather a uapi choice, that's assuming netlink would be cleaned as a normal socket when the task exits.
Yes, thanks for pointing out. The above was a pathological example meant to describe the point, but I think this generates a realistic edge case I may run into production. I don't know if you care about the specifics, but FWIW we split our userspace into an orchestrator that allocates dma-bufs and binds them via netlink and the ML application that creates tcp connections. We do this because then the orchestrator needs CAP_NET_ADMIN for netlink but the ML applications do not. If we delay dropping references until page_pool_destroy then we delay dropping references until the orchestrator exists, i.e. we risk one ML application crashing, leaving references unfreed, and the next application (that reuses the buffers) seeing a smaller address space because the previous application did not get to release them before crash and so on.
Now of course it's possible to work around this by making sure we don't reuse bound buffers (when they should be reusable for the same user), but in general I think in the socket use case it's a bit unnatural IMO for one socket to leave state behind like this and this would be a subtlety that the userspace needs to take care of, but like you said, maybe a uapi or buffer lifetime choice.
reproducible with io_uring unless you're creating a new page pool for each new io_uring socket equivalent.
Surely we don't, but it's still the user's responsibility to return buffers back. And in case of io_uring buffers returned to the user are not attached to a socket, so even the scope / lifetime is a bit different.
Yes, sorry, without understanding the specifics it seems your lifetime management is different. IMO it's not an issue if we diverge in this aspect.
But even outside of this, I think it's a bit semantically off to ask the page_pool to drop references that belong to the application IMO, because those references are not the page_pool's.
Completely agree with you, which is why it was in a callback, totally controlled by io_uring.
-- Pavel Begunkov
On 3/6/24 21:59, Mina Almasry wrote:
On Wed, Mar 6, 2024 at 11:14 AM Pavel Begunkov asml.silence@gmail.com wrote:
On 3/6/24 17:04, Mina Almasry wrote:
On Wed, Mar 6, 2024 at 6:30 AM Pavel Begunkov asml.silence@gmail.com wrote:
On 3/5/24 22:36, Mina Almasry wrote:
...
To be honest, I think it makes sense for the TCP stack to be responsible for putting the references that belong to it and the application. To me, it does not make much sense for the page pool to be responsible for putting the reference that belongs to the TCP stack or driver via a page_pool_scrub() function, as those references do not belong to the page pool really. I'm not sure why there is a diff between our use cases here because I'm not an io_uring expert. Why do you need to scrub all the references on page pool destruction? Don't these belong to non-page pool components like io_uring stack or TCP stack ol otherwise?
That one is about cleaning buffers that are in b/w 4 and 5, i.e. owned by the user, which devmem does at sock destruction. io_uring could get by without scrub, dropping user refs while unregistering ifq, but then it'd need to wait for all requests to finish so there is no step 4 in the meantime. Might change, can be useful, but it was much easier to hook into the pp release loop.
Another concern is who and when can reset ifq / kill pp outside of io_uring/devmem. I assume it can happen on a whim, which is hard to handle gracefully.
If this is about dropping application refs in step 4 & step 5, then from devmem TCP perspective it must be done on socket close & skb freeing AFAIU, and not delayed until page_pool destruction.
Right, something in the kernel should take care of it. You temporarily attach it to the socket, which is fine. And you could've also stored it in the netlink socket or some other object. In case of zcrx io_uring impl, it's bound to io_uring, io_uring is responsible for cleaning them up. And we do it before __page_pool_destroy, otherwise there would be a ref dependency.
AFAIU today the page_pool_release() waits until there are no more pages in flight before calling __page_pool_destroy(), and in existing use cases it's common for the page_pool to stay alive after page_pool_destroy() is called until all the skbs waiting in the receive queue to be recvmsg()'d are received and the page_pool is freed. I just didn't modify that behavior.
A side note, attaching to netlink or some other global object sounds conceptually better, as once you return a buffer to the user, the socket should not have any further business with the buffer. FWIW, that better resembles io_uring approach. For example allows to:
recv(sock); close(sock); process_rx_buffers();
or to return (i.e. DEVMEM_DONTNEED) buffers from different sockets in one call. However, I don't think it's important for devmem and perhaps more implementation dictated.
Hmm I think this may be a sockets vs io_uring difference? For sockets there is no way to recvmsg() new buffers after the sock close and
That is true for io_uring as well, but io_uring can't wait for socket tear down as it holds socket refs
there is no way to release buffers to the kernel via the setsockopt() after the sock close.
If, for example, you store the xarray with userspace buffers in a netlink socket and implement DONT_NEED setsockopt against it, you would be able to return bufs after the TCP socket is closed. FWIW, I'm not saying you should or even want to have it this way.
But I don't think we need to align on everything here, right? If
Absolutely, mentioned just to entertain with a design consideration
page_pool_scrub makes more sense in your use case because the lifetime of the io_uring buffers is different I don't see any issue with you extending the ops with page_pool_scrub(), and not define it for the dmabuf_devmem provider which doesn't need a scrub, right?
Yes, and it's a slow path, but I'll look at removing it anyway in later rfcs.
Think about a stupid or malicious user that does something like:
- Set up dmabuf binding using netlink api.
- While (100000):
- create devmem TCP socket.
- receive some devmem data on TCP socket.
- close TCP socket without calling DEVMEM_DONTNEED.
- clean up dmabuf binding using netlink api.
In this case, we need to drop the references in step 5 when the socket is destroyed, so the memory is freed to the page pool and available for the next socket in step 3. We cannot delay the freeing until step 6 when the rx queue is recreated and the page pool is destroyed, otherwise the net_iovs would leak in the loop and eventually the NIC would fail to find available memory. The same bug would be
By "would leak" you probably mean until step 6, right? There are
Yes, sorry I wasn't clear!
always many ways to shoot yourself in the leg. Even if you clean up in 5, the user can just leak the socket and get the same result with pp starvation. I see it not as a requirement but rather a uapi choice, that's assuming netlink would be cleaned as a normal socket when the task exits.
Yes, thanks for pointing out. The above was a pathological example meant to describe the point, but I think this generates a realistic edge case I may run into production. I don't know if you care about the specifics, but FWIW we split our userspace into an orchestrator that allocates dma-bufs and binds them via netlink and the ML application that creates tcp connections. We do this because then the orchestrator needs CAP_NET_ADMIN for netlink but the ML applications do not. If we delay dropping references until page_pool_destroy then we delay dropping references until the orchestrator exists, i.e. we risk one ML application crashing, leaving references unfreed, and the next application (that reuses the buffers) seeing a smaller address space because the previous application did not get to release them before crash and so on.
Makes sense
Now of course it's possible to work around this by making sure we don't reuse bound buffers (when they should be reusable for the same user), but in general I think in the socket use case it's a bit unnatural IMO for one socket to leave state behind like this and this would be a subtlety that the userspace needs to take care of, but like you said, maybe a uapi or buffer lifetime choice.
reproducible with io_uring unless you're creating a new page pool for each new io_uring socket equivalent.
Surely we don't, but it's still the user's responsibility to return buffers back. And in case of io_uring buffers returned to the user are not attached to a socket, so even the scope / lifetime is a bit different.
Yes, sorry, without understanding the specifics it seems your lifetime management is different. IMO it's not an issue if we diverge in this aspect.
In terms of devmem that would be if you attach userspace buffers to the netlink socket instead of a TCP socket as mentioned above, not that much different
On 2024-03-04 18:01, Mina Almasry wrote:
From: Jakub Kicinski kuba@kernel.org
The page providers which try to reuse the same pages will need to hold onto the ref, even if page gets released from the pool - as in releasing the page from the pp just transfers the "ownership" reference from pp to the provider, and provider will wait for other references to be gone before feeding this page back into the pool.
Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Mina Almasry almasrymina@google.com
This is implemented by Jakub in his RFC: https://lore.kernel.org/netdev/f8270765-a27b-6ccf-33ea-cda097168d79@redhat.c...
I take no credit for the idea or implementation; I only added minor edits to make this workable with device memory TCP, and removed some hacky test code. This is a critical dependency of device memory TCP and thus I'm pulling it into this series to make it revewable and mergeable.
RFC v3 -> v1
- Removed unusued mem_provider. (Yunsheng).
- Replaced memory_provider & mp_priv with netdev_rx_queue (Jakub).
include/net/page_pool/types.h | 12 ++++++++++ net/core/page_pool.c | 43 +++++++++++++++++++++++++++++++---- 2 files changed, 50 insertions(+), 5 deletions(-)
diff --git a/include/net/page_pool/types.h b/include/net/page_pool/types.h index 5e43a08d3231..ffe5f31fb0da 100644 --- a/include/net/page_pool/types.h +++ b/include/net/page_pool/types.h @@ -52,6 +52,7 @@ struct pp_alloc_cache {
- @dev: device, for DMA pre-mapping purposes
- @netdev: netdev this pool will serve (leave as NULL if none or multiple)
- @napi: NAPI which is the sole consumer of pages, otherwise NULL
- @queue: struct netdev_rx_queue this page_pool is being created for.
- @dma_dir: DMA mapping direction
- @max_len: max DMA sync memory size for PP_FLAG_DMA_SYNC_DEV
- @offset: DMA sync address offset for PP_FLAG_DMA_SYNC_DEV
@@ -64,6 +65,7 @@ struct page_pool_params { int nid; struct device *dev; struct napi_struct *napi;
enum dma_data_direction dma_dir; unsigned int max_len; unsigned int offset;struct netdev_rx_queue *queue;
@@ -126,6 +128,13 @@ struct page_pool_stats { }; #endif +struct memory_provider_ops {
- int (*init)(struct page_pool *pool);
- void (*destroy)(struct page_pool *pool);
- struct page *(*alloc_pages)(struct page_pool *pool, gfp_t gfp);
- bool (*release_page)(struct page_pool *pool, struct page *page);
+};
Separate question as I try to adapt bnxt to this and your queue configuration API.
How does GVE handle the need to allocate kernel pages for headers and dmabuf for payloads?
Reading the code, struct gve_rx_ring is the main per-ring object with a page pool. gve_queue_page_lists are filled with page pool netmem allocations from the page pool in gve_alloc_queue_page_list(). Are these strictly used for payloads only?
I found a struct gve_header_buf in both gve_rx_ring and struct gve_per_rx_queue_mem_dpo. This is allocated in gve_rx_queue_mem_alloc() using dma_alloc_coherent(). Is this where GVE stores headers?
IOW, GVE only uses page pool to allocate memory for QPLs, and QPLs are used by the device for split payloads. Is my understanding correct?
struct page_pool { struct page_pool_params_fast p; @@ -176,6 +185,9 @@ struct page_pool { */ struct ptr_ring ring;
- void *mp_priv;
- const struct memory_provider_ops *mp_ops;
#ifdef CONFIG_PAGE_POOL_STATS /* recycle stats are per-cpu to avoid locking */ struct page_pool_recycle_stats __percpu *recycle_stats; diff --git a/net/core/page_pool.c b/net/core/page_pool.c index d706fe5548df..8776fcad064a 100644 --- a/net/core/page_pool.c +++ b/net/core/page_pool.c @@ -25,6 +25,8 @@ #include "page_pool_priv.h" +static DEFINE_STATIC_KEY_FALSE(page_pool_mem_providers);
#define DEFER_TIME (msecs_to_jiffies(1000)) #define DEFER_WARN_INTERVAL (60 * HZ) @@ -177,6 +179,7 @@ static int page_pool_init(struct page_pool *pool, int cpuid) { unsigned int ring_qsize = 1024; /* Default */
- int err;
memcpy(&pool->p, ¶ms->fast, sizeof(pool->p)); memcpy(&pool->slow, ¶ms->slow, sizeof(pool->slow)); @@ -248,10 +251,25 @@ static int page_pool_init(struct page_pool *pool, /* Driver calling page_pool_create() also call page_pool_destroy() */ refcount_set(&pool->user_cnt, 1);
- if (pool->mp_ops) {
err = pool->mp_ops->init(pool);
if (err) {
pr_warn("%s() mem-provider init failed %d\n",
__func__, err);
goto free_ptr_ring;
}
static_branch_inc(&page_pool_mem_providers);
- }
- if (pool->p.flags & PP_FLAG_DMA_MAP) get_device(pool->p.dev);
return 0;
+free_ptr_ring:
- ptr_ring_cleanup(&pool->ring, NULL);
- return err;
} static void page_pool_uninit(struct page_pool *pool) @@ -546,7 +564,10 @@ struct page *page_pool_alloc_pages(struct page_pool *pool, gfp_t gfp) return page; /* Slow-path: cache empty, do real allocation */
- page = __page_pool_alloc_pages_slow(pool, gfp);
- if (static_branch_unlikely(&page_pool_mem_providers) && pool->mp_ops)
page = pool->mp_ops->alloc_pages(pool, gfp);
- else
return page;page = __page_pool_alloc_pages_slow(pool, gfp);
} EXPORT_SYMBOL(page_pool_alloc_pages); @@ -603,10 +624,13 @@ void __page_pool_release_page_dma(struct page_pool *pool, struct page *page) void page_pool_return_page(struct page_pool *pool, struct page *page) { int count;
- bool put;
- __page_pool_release_page_dma(pool, page);
- page_pool_clear_pp_info(page);
- put = true;
- if (static_branch_unlikely(&page_pool_mem_providers) && pool->mp_ops)
put = pool->mp_ops->release_page(pool, page);
- else
__page_pool_release_page_dma(pool, page);
/* This may be the last page returned, releasing the pool, so * it is not safe to reference pool afterwards. @@ -614,7 +638,10 @@ void page_pool_return_page(struct page_pool *pool, struct page *page) count = atomic_inc_return_relaxed(&pool->pages_state_release_cnt); trace_page_pool_state_release(pool, page, count);
- put_page(page);
- if (put) {
page_pool_clear_pp_info(page);
put_page(page);
- } /* An optimization would be to call __free_pages(page, pool->p.order)
- knowing page is not part of page-cache (thus avoiding a
- __page_cache_release() call).
@@ -884,6 +911,12 @@ static void __page_pool_destroy(struct page_pool *pool) page_pool_unlist(pool); page_pool_uninit(pool);
- if (pool->mp_ops) {
pool->mp_ops->destroy(pool);
static_branch_dec(&page_pool_mem_providers);
- }
- kfree(pool);
}
On Thu, Mar 7, 2024 at 8:57 PM David Wei dw@davidwei.uk wrote:
On 2024-03-04 18:01, Mina Almasry wrote:
From: Jakub Kicinski kuba@kernel.org
The page providers which try to reuse the same pages will need to hold onto the ref, even if page gets released from the pool - as in releasing the page from the pp just transfers the "ownership" reference from pp to the provider, and provider will wait for other references to be gone before feeding this page back into the pool.
Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Mina Almasry almasrymina@google.com
This is implemented by Jakub in his RFC: https://lore.kernel.org/netdev/f8270765-a27b-6ccf-33ea-cda097168d79@redhat.c...
I take no credit for the idea or implementation; I only added minor edits to make this workable with device memory TCP, and removed some hacky test code. This is a critical dependency of device memory TCP and thus I'm pulling it into this series to make it revewable and mergeable.
RFC v3 -> v1
- Removed unusued mem_provider. (Yunsheng).
- Replaced memory_provider & mp_priv with netdev_rx_queue (Jakub).
include/net/page_pool/types.h | 12 ++++++++++ net/core/page_pool.c | 43 +++++++++++++++++++++++++++++++---- 2 files changed, 50 insertions(+), 5 deletions(-)
diff --git a/include/net/page_pool/types.h b/include/net/page_pool/types.h index 5e43a08d3231..ffe5f31fb0da 100644 --- a/include/net/page_pool/types.h +++ b/include/net/page_pool/types.h @@ -52,6 +52,7 @@ struct pp_alloc_cache {
- @dev: device, for DMA pre-mapping purposes
- @netdev: netdev this pool will serve (leave as NULL if none or multiple)
- @napi: NAPI which is the sole consumer of pages, otherwise NULL
- @queue: struct netdev_rx_queue this page_pool is being created for.
- @dma_dir: DMA mapping direction
- @max_len: max DMA sync memory size for PP_FLAG_DMA_SYNC_DEV
- @offset: DMA sync address offset for PP_FLAG_DMA_SYNC_DEV
@@ -64,6 +65,7 @@ struct page_pool_params { int nid; struct device *dev; struct napi_struct *napi;
struct netdev_rx_queue *queue; enum dma_data_direction dma_dir; unsigned int max_len; unsigned int offset;
@@ -126,6 +128,13 @@ struct page_pool_stats { }; #endif
+struct memory_provider_ops {
int (*init)(struct page_pool *pool);
void (*destroy)(struct page_pool *pool);
struct page *(*alloc_pages)(struct page_pool *pool, gfp_t gfp);
bool (*release_page)(struct page_pool *pool, struct page *page);
+};
Separate question as I try to adapt bnxt to this and your queue configuration API.
How does GVE handle the need to allocate kernel pages for headers and dmabuf for payloads?
Reading the code, struct gve_rx_ring is the main per-ring object with a page pool. gve_queue_page_lists are filled with page pool netmem allocations from the page pool in gve_alloc_queue_page_list(). Are these strictly used for payloads only?
You're almost correct. We actually don't use the gve queue page lists for devmem TCP, that's an unrelated GVE feature/code path for low memory VMs. The code in effect is the !qpl code. In that code, for incoming RX packets we allocate a new or recycled netmem from the page pool in gve_alloc_page_dqo(). These buffers are used for payload only in the case where header split is enabled. In the case header split is disabled, these buffers are used for the entire incoming packet.
I found a struct gve_header_buf in both gve_rx_ring and struct gve_per_rx_queue_mem_dpo. This is allocated in gve_rx_queue_mem_alloc() using dma_alloc_coherent(). Is this where GVE stores headers?
Yes, this is where GVE stores headers.
IOW, GVE only uses page pool to allocate memory for QPLs, and QPLs are used by the device for split payloads. Is my understanding correct?
On Mon, Mar 04, 2024 at 06:01:37PM -0800, Mina Almasry wrote:
From: Jakub Kicinski kuba@kernel.org
The page providers which try to reuse the same pages will need to hold onto the ref, even if page gets released from the pool - as in releasing the page from the pp just transfers the "ownership" reference from pp to the provider, and provider will wait for other references to be gone before feeding this page back into the pool.
The word hook always rings a giant warning bell for me, and looking into this series I am concerned indeed.
The only provider provided here is the dma-buf one, and that basically is the only sensible one for the documented design. So instead of adding hooks that random proprietary crap can hook into, why not hard code the dma buf provide and just use a flag? That'll also avoid expensive indirect calls.
On 2024-03-17 19:02, Christoph Hellwig wrote:
On Mon, Mar 04, 2024 at 06:01:37PM -0800, Mina Almasry wrote:
From: Jakub Kicinski kuba@kernel.org
The page providers which try to reuse the same pages will need to hold onto the ref, even if page gets released from the pool - as in releasing the page from the pp just transfers the "ownership" reference from pp to the provider, and provider will wait for other references to be gone before feeding this page back into the pool.
The word hook always rings a giant warning bell for me, and looking into this series I am concerned indeed.
The only provider provided here is the dma-buf one, and that basically is the only sensible one for the documented design. So instead of adding hooks that random proprietary crap can hook into, why not hard code the dma buf provide and just use a flag? That'll also avoid expensive indirect calls.
I'm working on a similar proposal for zero copy Rx but to host memory and depend on this memory provider API.
Jakub also designed this API for hugepages too IIRC. Basically there's going to be at least three fancy ways of providing pages (one of which isn't actually pages, hence the merged netmem_t series) to drivers.
On Sun, Mar 17, 2024 at 07:49:43PM -0700, David Wei wrote:
I'm working on a similar proposal for zero copy Rx but to host memory and depend on this memory provider API.
How do you need a different provider for that vs just udmabuf?
Jakub also designed this API for hugepages too IIRC. Basically there's going to be at least three fancy ways of providing pages (one of which isn't actually pages, hence the merged netmem_t series) to drivers.
How do hugepages different from a normal page allocation? They should just a different ordered passed to the page allocator.
Hi Christoph,
Sorry for the late reply, I've been out for a few days.
On Mon, Mar 18, 2024 at 4:22 PM Christoph Hellwig hch@infradead.org wrote:
On Sun, Mar 17, 2024 at 07:49:43PM -0700, David Wei wrote:
I'm working on a similar proposal for zero copy Rx but to host memory and depend on this memory provider API.
How do you need a different provider for that vs just udmabuf?
This was discussed on the io_uring ZC RFC in one of the earliest RFCs. Here is a link to Pavel's response:
https://patchwork.kernel.org/project/netdevbpf/patch/20231106024413.2801438-...
The UAPI of wrapping io_uring memory into a udmabuf just to use it with devmem TCP only for the user to have to unwrap it is undesirable to him.
Jakub also designed this API for hugepages too IIRC. Basically there's going to be at least three fancy ways of providing pages (one of which isn't actually pages, hence the merged netmem_t series) to drivers.
How do hugepages different from a normal page allocation? They should just a different ordered passed to the page allocator.
Yes, that's more-or-less what's what the hugepage memory provider Jakub proposed does. The memory provider would allocate a hugepage and hold a reference to it. Then when the page_pool needs a page, it would allocate a PAGE_SIZE page from said hugepage region and provide it to the page_pool, and the pool back to the driver. This allows the hugepages to work without the page_pool and driver to be hugepage aware and to insert huge page specific processing in it.
Other designs for this hugepage use case are possible, I'm just describing Jakub's idea for it as a potential use-case for these hooks. For example technically the page_pool at the moment does support non-0 order allocations, but most drivers simply set the order to 0 and use the page pool only for PAGE_SIZE allocations. An alternative design could be to use this support in the page pool, but that requires every driver to adopt this rather than a core networking change that can apply transparently (to a large extent) to all page_pool drivers.
On Fri, 22 Mar 2024 10:40:26 -0700 Mina Almasry wrote:
Other designs for this hugepage use case are possible, I'm just describing Jakub's idea for it as a potential use-case for these hooks.
I made it ops because I had 4 different implementations with different recycling algorithms. I think it's a fairly reasonable piece of code.
On Fri, Mar 22, 2024 at 04:19:44PM -0700, Jakub Kicinski wrote:
On Fri, 22 Mar 2024 10:40:26 -0700 Mina Almasry wrote:
Other designs for this hugepage use case are possible, I'm just describing Jakub's idea for it as a potential use-case for these hooks.
I made it ops because I had 4 different implementations with different recycling algorithms. I think it's a fairly reasonable piece of code.
Assuming we need 4 different implementation. And I strongly question that.
On Fri, Mar 22, 2024 at 10:40:26AM -0700, Mina Almasry wrote:
Hi Christoph,
Sorry for the late reply, I've been out for a few days.
On Mon, Mar 18, 2024 at 4:22 PM Christoph Hellwig hch@infradead.org wrote:
On Sun, Mar 17, 2024 at 07:49:43PM -0700, David Wei wrote:
I'm working on a similar proposal for zero copy Rx but to host memory and depend on this memory provider API.
How do you need a different provider for that vs just udmabuf?
This was discussed on the io_uring ZC RFC in one of the earliest RFCs. Here is a link to Pavel's response:
https://patchwork.kernel.org/project/netdevbpf/patch/20231106024413.2801438-...
Undesirable is not a good argument. We need one proper API that different subsystems share for this use case (this is the same Feedback I gave Keith for the similar block proposal btw, not picking on the net folks here).
If dmabuf/udmabuf doesn't work for that we need to enhance or replace it, but not come up with little subsystem specific side channels.
On Sun, Mar 17, 2024 at 7:03 PM Christoph Hellwig hch@infradead.org wrote:
On Mon, Mar 04, 2024 at 06:01:37PM -0800, Mina Almasry wrote:
From: Jakub Kicinski kuba@kernel.org
The page providers which try to reuse the same pages will need to hold onto the ref, even if page gets released from the pool - as in releasing the page from the pp just transfers the "ownership" reference from pp to the provider, and provider will wait for other references to be gone before feeding this page back into the pool.
The word hook always rings a giant warning bell for me, and looking into this series I am concerned indeed.
The only provider provided here is the dma-buf one, and that basically is the only sensible one for the documented design.
Sorry I don't mean to argue but as David mentioned, there are some plans in the works and ones not in the works to extend this to other memory types. David mentioned io_uring & Jakub's huge page use cases which may want to re-use this design. I have an additional one in mind, which is extending devmem TCP for storage devices. Currently storage devices do not support dmabuf and my understanding is that it's very hard to do so, and NVMe uses pci_p2pdma instead. I wonder if it's possible to extend devmem TCP in the future to support pci_p2pdma to support nvme devices in the future.
Additionally I've been thinking about a use case of limiting the amount of memory the net stack can use. Currently the page pool is free to allocate as much memory as it wants from the buddy allocator. This may be undesirable in very low memory setups such as overcommited VMs. We can imagine a memory provider that allows allocation only if the page_pool is below a certain limit. We can also imagine a memory provider that preallocates memory and only uses that pinned pool. None of these are in the works at the moment, but are examples of how this can be (reasonably?) extended.
So instead of adding hooks that random proprietary crap can hook into,
To be completely honest I'm unsure how to design hooks for proprietary code to hook into. I think that would be done on the basis of EXPORTED_SYMBOL? We do not export these hooks, nor plan to at the moment.
why not hard code the dma buf provide and just use a flag? That'll also avoid expensive indirect calls.
Thankfully the indirect calls do not seem to be an issue. We've been able to hit 95% line rate with devmem TCP and I think the remaining 5% are a bottleneck unrelated to the indirect calls. Page_pool benchmarks show a very minor degradation in the fast path, so small it may be just noise in the measurement (may!):
https://lore.kernel.org/netdev/20240305020153.2787423-1-almasrymina@google.c...
This is because the code path that does indirect allocations is the slow path. The page_pool recycles netmem aggressively.
On Fri, Mar 22, 2024 at 10:54:54AM -0700, Mina Almasry wrote:
Sorry I don't mean to argue but as David mentioned, there are some plans in the works and ones not in the works to extend this to other memory types. David mentioned io_uring & Jakub's huge page use cases which may want to re-use this design. I have an additional one in mind, which is extending devmem TCP for storage devices. Currently storage devices do not support dmabuf and my understanding is that it's very hard to do so, and NVMe uses pci_p2pdma instead. I wonder if it's possible to extend devmem TCP in the future to support pci_p2pdma to support nvme devices in the future.
The block layer needs to suppotr dmabuf for this kind of I/O. Any special netdev to block side channel will be NAKed before you can even send it out.
On Sun, Mar 24, 2024 at 4:37 PM Christoph Hellwig hch@infradead.org wrote:
On Fri, Mar 22, 2024 at 10:54:54AM -0700, Mina Almasry wrote:
Sorry I don't mean to argue but as David mentioned, there are some plans in the works and ones not in the works to extend this to other memory types. David mentioned io_uring & Jakub's huge page use cases which may want to re-use this design. I have an additional one in mind, which is extending devmem TCP for storage devices. Currently storage devices do not support dmabuf and my understanding is that it's very hard to do so, and NVMe uses pci_p2pdma instead. I wonder if it's possible to extend devmem TCP in the future to support pci_p2pdma to support nvme devices in the future.
The block layer needs to suppotr dmabuf for this kind of I/O. Any special netdev to block side channel will be NAKed before you can even send it out.
Thanks, a few questions if you have time to help me understand the potential of extending this to storage devices.
Are you envisioning that dmabuf support would be added to the block layer (which I understand is part of the VFS and not driver specific), or as part of the specific storage driver (like nvme for example)? If we can add dmabuf support to the block layer itself that sounds awesome. We may then be able to do devmem TCP on all/most storage devices without having to modify each individual driver.
In your estimation, is adding dmabuf support to the block layer something technically feasible & acceptable upstream? I notice you suggested it so I'm guessing yes to both, but I thought I'd confirm.
Worthy of note this is all pertaining to potential follow up use cases, nothing in this particular proposal is trying to do any of this yet.
On Tue, Mar 26, 2024 at 01:19:20PM -0700, Mina Almasry wrote:
Are you envisioning that dmabuf support would be added to the block layer
Yes.
(which I understand is part of the VFS and not driver specific),
The block layer isn't really the VFS, it's just another core stack like the network stack.
or as part of the specific storage driver (like nvme for example)? If we can add dmabuf support to the block layer itself that sounds awesome. We may then be able to do devmem TCP on all/most storage devices without having to modify each individual driver.
I suspect we'll still need to touch the drivers to understand it, but hopefully all the main infrastructure can live in the block layer.
In your estimation, is adding dmabuf support to the block layer something technically feasible & acceptable upstream? I notice you suggested it so I'm guessing yes to both, but I thought I'd confirm.
I think so, and I know there has been quite some interest to at least pre-register userspace memory so that the iommu overhead can be pre-loaded. It also is a much better interface for Peer to Peer transfers than what we currently have.
On Thu, Mar 28, 2024 at 12:31 AM Christoph Hellwig hch@infradead.org wrote:
On Tue, Mar 26, 2024 at 01:19:20PM -0700, Mina Almasry wrote:
Are you envisioning that dmabuf support would be added to the block layer
Yes.
(which I understand is part of the VFS and not driver specific),
The block layer isn't really the VFS, it's just another core stack like the network stack.
or as part of the specific storage driver (like nvme for example)? If we can add dmabuf support to the block layer itself that sounds awesome. We may then be able to do devmem TCP on all/most storage devices without having to modify each individual driver.
I suspect we'll still need to touch the drivers to understand it, but hopefully all the main infrastructure can live in the block layer.
In your estimation, is adding dmabuf support to the block layer something technically feasible & acceptable upstream? I notice you suggested it so I'm guessing yes to both, but I thought I'd confirm.
I think so, and I know there has been quite some interest to at least pre-register userspace memory so that the iommu overhead can be pre-loaded. It also is a much better interface for Peer to Peer transfers than what we currently have.
I think this is positively thrilling news for me. I was worried that adding devmemTCP support to storage devices would involve using a non-dmabuf standard of buffer sharing like pci_p2pdma_ (drivers/pci/p2pdma.c) and that would require messy changes to pci_p2pdma_ that would get nacked. Also it would require adding pci_p2pdma_ support to devmem TCP, which is a can of worms. If adding dma-buf support to storage devices is feasible and desirable, that's a much better approach IMO. (a) it will maybe work with devmem TCP without any changes needed on the netdev side of things and (b) dma-buf support may be generically useful and a good contribution even outside of devmem TCP.
I don't have a concrete user for devmem TCP for storage devices but the use case is very similar to GPU and I imagine the benefits in perf can be significant in some setups.
Christoph, if you have any hints or rough specific design in mind for how dma-buf support can be added to the block layer, please do let us know and we'll follow your hints to investigate. But I don't want to use up too much of your time. Marc and I can definitely read enough code to figure out how to do it ourselves :-)
Marc, please review and consider this thread and work, this could be a good project for you and I. I imagine the work would be:
1. Investigate how to add dma-buf support to the block layer (maybe write a prototype code, and maybe even test it with devmem TCP). 2. Share a code or no-code proposal with netdev/fs/block layer mailing list and try to work through concerns/nacks. 3. Finally share RFC through merging etc.
-- Thanks, Mina
On Mon, Apr 01, 2024 at 12:22:24PM -0700, Mina Almasry wrote:
On Thu, Mar 28, 2024 at 12:31 AM Christoph Hellwig hch@infradead.org wrote:
On Tue, Mar 26, 2024 at 01:19:20PM -0700, Mina Almasry wrote:
Are you envisioning that dmabuf support would be added to the block layer
Yes.
(which I understand is part of the VFS and not driver specific),
The block layer isn't really the VFS, it's just another core stack like the network stack.
or as part of the specific storage driver (like nvme for example)? If we can add dmabuf support to the block layer itself that sounds awesome. We may then be able to do devmem TCP on all/most storage devices without having to modify each individual driver.
I suspect we'll still need to touch the drivers to understand it, but hopefully all the main infrastructure can live in the block layer.
In your estimation, is adding dmabuf support to the block layer something technically feasible & acceptable upstream? I notice you suggested it so I'm guessing yes to both, but I thought I'd confirm.
I think so, and I know there has been quite some interest to at least pre-register userspace memory so that the iommu overhead can be pre-loaded. It also is a much better interface for Peer to Peer transfers than what we currently have.
Thanks for copying me on this. This sounds really great.
Also P2PDMA requires PCI root complex to support this kind of direct transfer, and IIUC dmabuf does not have such hardware dependency.
I think this is positively thrilling news for me. I was worried that adding devmemTCP support to storage devices would involve using a non-dmabuf standard of buffer sharing like pci_p2pdma_ (drivers/pci/p2pdma.c) and that would require messy changes to pci_p2pdma_ that would get nacked. Also it would require adding pci_p2pdma_ support to devmem TCP, which is a can of worms. If adding dma-buf support to storage devices is feasible and desirable, that's a much better approach IMO. (a) it will maybe work with devmem TCP without any changes needed on the netdev side of things and (b) dma-buf support may be generically useful and a good contribution even outside of devmem TCP.
I think the major difference is its interface, which exposes an mmap memory region instead of fd: https://lwn.net/Articles/906092/.
I don't have a concrete user for devmem TCP for storage devices but the use case is very similar to GPU and I imagine the benefits in perf can be significant in some setups.
We have storage use cases at ByteDance, we use NVME SSD to cache videos transferred through network, so moving data directly from SSD to NIC would help a lot.
Thanks!
The check is duplicated in 2 places, factor it out into a common helper.
Signed-off-by: Mina Almasry almasrymina@google.com --- net/core/page_pool.c | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-)
diff --git a/net/core/page_pool.c b/net/core/page_pool.c index 8776fcad064a..fe9de4ecce94 100644 --- a/net/core/page_pool.c +++ b/net/core/page_pool.c @@ -684,6 +684,11 @@ static bool page_pool_recycle_in_cache(struct page *page, return true; }
+static bool __page_pool_page_can_be_recycled(const struct page *page) +{ + return page_ref_count(page) == 1 && !page_is_pfmemalloc(page); +} + /* If the page refcnt == 1, this will try to recycle the page. * if PP_FLAG_DMA_SYNC_DEV is set, we'll try to sync the DMA area for * the configured size min(dma_sync_size, pool->max_len). @@ -705,7 +710,7 @@ __page_pool_put_page(struct page_pool *pool, struct page *page, * page is NOT reusable when allocated when system is under * some pressure. (page_is_pfmemalloc) */ - if (likely(page_ref_count(page) == 1 && !page_is_pfmemalloc(page))) { + if (likely(__page_pool_page_can_be_recycled(page))) { /* Read barrier done in page_ref_count / READ_ONCE */
if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV) @@ -820,7 +825,7 @@ static struct page *page_pool_drain_frag(struct page_pool *pool, if (likely(page_pool_unref_page(page, drain_count))) return NULL;
- if (page_ref_count(page) == 1 && !page_is_pfmemalloc(page)) { + if (__page_pool_page_can_be_recycled(page)) { if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV) page_pool_dma_sync_for_device(pool, page, -1);
On 2024/3/5 10:01, Mina Almasry wrote:
The check is duplicated in 2 places, factor it out into a common helper.
Reviewed-by: Yunsheng Lin linyunsheng@huawei.com
API takes the dma-buf fd as input, and binds it to the netdevice. The user can specify the rx queues to bind the dma-buf to.
Suggested-by: Stanislav Fomichev sdf@google.com Signed-off-by: Mina Almasry almasrymina@google.com
---
Changes in v1: - Add rx-queue-type to distingish rx from tx (Jakub) - Return dma-buf ID from netlink API (David, Stan)
Changes in RFC-v3: - Support binding multiple rx rx-queues
--- Documentation/netlink/specs/netdev.yaml | 52 +++++++++++++++++++++++++ include/uapi/linux/netdev.h | 19 +++++++++ net/core/netdev-genl-gen.c | 19 +++++++++ net/core/netdev-genl-gen.h | 2 + net/core/netdev-genl.c | 6 +++ tools/include/uapi/linux/netdev.h | 19 +++++++++ 6 files changed, 117 insertions(+)
diff --git a/Documentation/netlink/specs/netdev.yaml b/Documentation/netlink/specs/netdev.yaml index 3addac970680..6f235fd7b14d 100644 --- a/Documentation/netlink/specs/netdev.yaml +++ b/Documentation/netlink/specs/netdev.yaml @@ -264,6 +264,45 @@ attribute-sets: name: napi-id doc: ID of the NAPI instance which services this queue. type: u32 + - + name: queue-dmabuf + attributes: + - + name: type + doc: rx or tx queue + type: u8 + enum: queue-type + - + name: idx + doc: queue index + type: u32 + + - + name: bind-dmabuf + attributes: + - + name: ifindex + doc: netdev ifindex to bind the dma-buf to. + type: u32 + checks: + min: 1 + - + name: queues + doc: receive queues to bind the dma-buf to. + type: nest + nested-attributes: queue-dmabuf + multi-attr: true + - + name: dmabuf-fd + doc: dmabuf file descriptor to bind. + type: u32 + - + name: dmabuf-id + doc: id of the dmabuf binding + type: u32 + checks: + min: 1 +
operations: list: @@ -386,6 +425,19 @@ operations: attributes: - ifindex reply: *queue-get-op + - + name: bind-rx + doc: Bind dmabuf to netdev + attribute-set: bind-dmabuf + do: + request: + attributes: + - ifindex + - dmabuf-fd + - queues + reply: + attributes: + - dmabuf-id - name: napi-get doc: Get information about NAPI instances configured on the system. diff --git a/include/uapi/linux/netdev.h b/include/uapi/linux/netdev.h index 93cb411adf72..6a5cd4af68c8 100644 --- a/include/uapi/linux/netdev.h +++ b/include/uapi/linux/netdev.h @@ -132,6 +132,24 @@ enum { NETDEV_A_QUEUE_MAX = (__NETDEV_A_QUEUE_MAX - 1) };
+enum { + NETDEV_A_QUEUE_DMABUF_TYPE = 1, + NETDEV_A_QUEUE_DMABUF_IDX, + + __NETDEV_A_QUEUE_DMABUF_MAX, + NETDEV_A_QUEUE_DMABUF_MAX = (__NETDEV_A_QUEUE_DMABUF_MAX - 1) +}; + +enum { + NETDEV_A_BIND_DMABUF_IFINDEX = 1, + NETDEV_A_BIND_DMABUF_QUEUES, + NETDEV_A_BIND_DMABUF_DMABUF_FD, + NETDEV_A_BIND_DMABUF_DMABUF_ID, + + __NETDEV_A_BIND_DMABUF_MAX, + NETDEV_A_BIND_DMABUF_MAX = (__NETDEV_A_BIND_DMABUF_MAX - 1) +}; + enum { NETDEV_CMD_DEV_GET = 1, NETDEV_CMD_DEV_ADD_NTF, @@ -143,6 +161,7 @@ enum { NETDEV_CMD_PAGE_POOL_CHANGE_NTF, NETDEV_CMD_PAGE_POOL_STATS_GET, NETDEV_CMD_QUEUE_GET, + NETDEV_CMD_BIND_RX, NETDEV_CMD_NAPI_GET,
__NETDEV_CMD_MAX, diff --git a/net/core/netdev-genl-gen.c b/net/core/netdev-genl-gen.c index be7f2ebd61b2..3384b1ae3f40 100644 --- a/net/core/netdev-genl-gen.c +++ b/net/core/netdev-genl-gen.c @@ -27,6 +27,11 @@ const struct nla_policy netdev_page_pool_info_nl_policy[NETDEV_A_PAGE_POOL_IFIND [NETDEV_A_PAGE_POOL_IFINDEX] = NLA_POLICY_FULL_RANGE(NLA_U32, &netdev_a_page_pool_ifindex_range), };
+const struct nla_policy netdev_queue_dmabuf_nl_policy[NETDEV_A_QUEUE_DMABUF_IDX + 1] = { + [NETDEV_A_QUEUE_DMABUF_TYPE] = NLA_POLICY_MAX(NLA_U8, 1), + [NETDEV_A_QUEUE_DMABUF_IDX] = { .type = NLA_U32, }, +}; + /* NETDEV_CMD_DEV_GET - do */ static const struct nla_policy netdev_dev_get_nl_policy[NETDEV_A_DEV_IFINDEX + 1] = { [NETDEV_A_DEV_IFINDEX] = NLA_POLICY_MIN(NLA_U32, 1), @@ -58,6 +63,13 @@ static const struct nla_policy netdev_queue_get_dump_nl_policy[NETDEV_A_QUEUE_IF [NETDEV_A_QUEUE_IFINDEX] = NLA_POLICY_MIN(NLA_U32, 1), };
+/* NETDEV_CMD_BIND_RX - do */ +static const struct nla_policy netdev_bind_rx_nl_policy[NETDEV_A_BIND_DMABUF_DMABUF_FD + 1] = { + [NETDEV_A_BIND_DMABUF_IFINDEX] = NLA_POLICY_MIN(NLA_U32, 1), + [NETDEV_A_BIND_DMABUF_DMABUF_FD] = { .type = NLA_U32, }, + [NETDEV_A_BIND_DMABUF_QUEUES] = NLA_POLICY_NESTED(netdev_queue_dmabuf_nl_policy), +}; + /* NETDEV_CMD_NAPI_GET - do */ static const struct nla_policy netdev_napi_get_do_nl_policy[NETDEV_A_NAPI_ID + 1] = { [NETDEV_A_NAPI_ID] = { .type = NLA_U32, }, @@ -124,6 +136,13 @@ static const struct genl_split_ops netdev_nl_ops[] = { .maxattr = NETDEV_A_QUEUE_IFINDEX, .flags = GENL_CMD_CAP_DUMP, }, + { + .cmd = NETDEV_CMD_BIND_RX, + .doit = netdev_nl_bind_rx_doit, + .policy = netdev_bind_rx_nl_policy, + .maxattr = NETDEV_A_BIND_DMABUF_DMABUF_FD, + .flags = GENL_CMD_CAP_DO, + }, { .cmd = NETDEV_CMD_NAPI_GET, .doit = netdev_nl_napi_get_doit, diff --git a/net/core/netdev-genl-gen.h b/net/core/netdev-genl-gen.h index a47f2bcbe4fa..a7ede514eccd 100644 --- a/net/core/netdev-genl-gen.h +++ b/net/core/netdev-genl-gen.h @@ -13,6 +13,7 @@
/* Common nested types */ extern const struct nla_policy netdev_page_pool_info_nl_policy[NETDEV_A_PAGE_POOL_IFINDEX + 1]; +extern const struct nla_policy netdev_queue_dmabuf_nl_policy[NETDEV_A_QUEUE_DMABUF_IDX + 1];
int netdev_nl_dev_get_doit(struct sk_buff *skb, struct genl_info *info); int netdev_nl_dev_get_dumpit(struct sk_buff *skb, struct netlink_callback *cb); @@ -26,6 +27,7 @@ int netdev_nl_page_pool_stats_get_dumpit(struct sk_buff *skb, int netdev_nl_queue_get_doit(struct sk_buff *skb, struct genl_info *info); int netdev_nl_queue_get_dumpit(struct sk_buff *skb, struct netlink_callback *cb); +int netdev_nl_bind_rx_doit(struct sk_buff *skb, struct genl_info *info); int netdev_nl_napi_get_doit(struct sk_buff *skb, struct genl_info *info); int netdev_nl_napi_get_dumpit(struct sk_buff *skb, struct netlink_callback *cb);
diff --git a/net/core/netdev-genl.c b/net/core/netdev-genl.c index fd98936da3ae..0ed292d87ae0 100644 --- a/net/core/netdev-genl.c +++ b/net/core/netdev-genl.c @@ -469,6 +469,12 @@ int netdev_nl_queue_get_dumpit(struct sk_buff *skb, struct netlink_callback *cb) return skb->len; }
+/* Stub */ +int netdev_nl_bind_rx_doit(struct sk_buff *skb, struct genl_info *info) +{ + return 0; +} + static int netdev_genl_netdevice_event(struct notifier_block *nb, unsigned long event, void *ptr) { diff --git a/tools/include/uapi/linux/netdev.h b/tools/include/uapi/linux/netdev.h index 93cb411adf72..6a5cd4af68c8 100644 --- a/tools/include/uapi/linux/netdev.h +++ b/tools/include/uapi/linux/netdev.h @@ -132,6 +132,24 @@ enum { NETDEV_A_QUEUE_MAX = (__NETDEV_A_QUEUE_MAX - 1) };
+enum { + NETDEV_A_QUEUE_DMABUF_TYPE = 1, + NETDEV_A_QUEUE_DMABUF_IDX, + + __NETDEV_A_QUEUE_DMABUF_MAX, + NETDEV_A_QUEUE_DMABUF_MAX = (__NETDEV_A_QUEUE_DMABUF_MAX - 1) +}; + +enum { + NETDEV_A_BIND_DMABUF_IFINDEX = 1, + NETDEV_A_BIND_DMABUF_QUEUES, + NETDEV_A_BIND_DMABUF_DMABUF_FD, + NETDEV_A_BIND_DMABUF_DMABUF_ID, + + __NETDEV_A_BIND_DMABUF_MAX, + NETDEV_A_BIND_DMABUF_MAX = (__NETDEV_A_BIND_DMABUF_MAX - 1) +}; + enum { NETDEV_CMD_DEV_GET = 1, NETDEV_CMD_DEV_ADD_NTF, @@ -143,6 +161,7 @@ enum { NETDEV_CMD_PAGE_POOL_CHANGE_NTF, NETDEV_CMD_PAGE_POOL_STATS_GET, NETDEV_CMD_QUEUE_GET, + NETDEV_CMD_BIND_RX, NETDEV_CMD_NAPI_GET,
__NETDEV_CMD_MAX,
Add a netdev_dmabuf_binding struct which represents the dma-buf-to-netdevice binding. The netlink API will bind the dma-buf to rx queues on the netdevice. On the binding, the dma_buf_attach & dma_buf_map_attachment will occur. The entries in the sg_table from mapping will be inserted into a genpool to make it ready for allocation.
The chunks in the genpool are owned by a dmabuf_chunk_owner struct which holds the dma-buf offset of the base of the chunk and the dma_addr of the chunk. Both are needed to use allocations that come from this chunk.
We create a new type that represents an allocation from the genpool: net_iov. We setup the net_iov allocation size in the genpool to PAGE_SIZE for simplicity: to match the PAGE_SIZE normally allocated by the page pool and given to the drivers.
The user can unbind the dmabuf from the netdevice by closing the netlink socket that established the binding. We do this so that the binding is automatically unbound even if the userspace process crashes.
The binding and unbinding leaves an indicator in struct netdev_rx_queue that the given queue is bound, but the binding doesn't take effect until the driver actually reconfigures its queues, and re-initializes its page pool.
The netdev_dmabuf_binding struct is refcounted, and releases its resources only when all the refs are released.
Signed-off-by: Willem de Bruijn willemb@google.com Signed-off-by: Kaiyuan Zhang kaiyuanz@google.com Signed-off-by: Mina Almasry almasrymina@google.com
---
RFC v6: - Validate rx queue index - Refactor new functions into devmem.c (Pavel)
RFC v5: - Renamed page_pool_iov to net_iov, and moved that support to devmem.h or netmem.h.
v1:
- Introduce devmem.h instead of bloating netdevice.h (Jakub) - ENOTSUPP -> EOPNOTSUPP (checkpatch.pl I think) - Remove unneeded rcu protection for binding->list (rtnl protected) - Removed extraneous err_binding_put: label. - Removed dma_addr += len (Paolo). - Don't override err on netdev_bind_dmabuf_to_queue failure. - Rename devmem -> dmabuf (David). - Add id to dmabuf binding (David/Stan). - Fix missing xa_destroy bound_rq_list. - Use queue api to reset bound RX queues (Jakub). - Update netlink API for rx-queue type (tx/re) (Jakub).
RFC v3: - Support multi rx-queue binding
--- include/net/devmem.h | 115 +++++++++++++ include/net/netdev_rx_queue.h | 1 + include/net/netmem.h | 10 ++ net/core/Makefile | 2 +- net/core/dev.c | 3 + net/core/devmem.c | 293 ++++++++++++++++++++++++++++++++++ net/core/netdev-genl.c | 121 +++++++++++++- 7 files changed, 542 insertions(+), 3 deletions(-) create mode 100644 include/net/devmem.h create mode 100644 net/core/devmem.c
diff --git a/include/net/devmem.h b/include/net/devmem.h new file mode 100644 index 000000000000..85ccbbe84c65 --- /dev/null +++ b/include/net/devmem.h @@ -0,0 +1,115 @@ +/* SPDX-License-Identifier: GPL-2.0-or-later */ +/* + * Device memory TCP support + * + * Authors: Mina Almasry almasrymina@google.com + * Willem de Bruijn willemb@google.com + * Kaiyuan Zhang kaiyuanz@google.com + * + */ +#ifndef _NET_DEVMEM_H +#define _NET_DEVMEM_H + +struct netdev_dmabuf_binding { + struct dma_buf *dmabuf; + struct dma_buf_attachment *attachment; + struct sg_table *sgt; + struct net_device *dev; + struct gen_pool *chunk_pool; + + /* The user holds a ref (via the netlink API) for as long as they want + * the binding to remain alive. Each page pool using this binding holds + * a ref to keep the binding alive. Each allocated net_iov holds a + * ref. + * + * The binding undos itself and unmaps the underlying dmabuf once all + * those refs are dropped and the binding is no longer desired or in + * use. + */ + refcount_t ref; + + /* The portid of the user that owns this binding. Used for netlink to + * notify us of the user dropping the bind. + */ + u32 owner_nlportid; + + /* The list of bindings currently active. Used for netlink to notify us + * of the user dropping the bind. + */ + struct list_head list; + + /* rxq's this binding is active on. */ + struct xarray bound_rxq_list; + + /* ID of this binding. Globally unique to all bindings currently + * active. + */ + u32 id; +}; + +/* Owner of the dma-buf chunks inserted into the gen pool. Each scatterlist + * entry from the dmabuf is inserted into the genpool as a chunk, and needs + * this owner struct to keep track of some metadata necessary to create + * allocations from this chunk. + */ +struct dmabuf_genpool_chunk_owner { + /* Offset into the dma-buf where this chunk starts. */ + unsigned long base_virtual; + + /* dma_addr of the start of the chunk. */ + dma_addr_t base_dma_addr; + + /* Array of net_iovs for this chunk. */ + struct net_iov *niovs; + size_t num_niovs; + + struct netdev_dmabuf_binding *binding; +}; + +#ifdef CONFIG_DMA_SHARED_BUFFER +void __netdev_dmabuf_binding_free(struct netdev_dmabuf_binding *binding); +int netdev_bind_dmabuf(struct net_device *dev, unsigned int dmabuf_fd, + struct netdev_dmabuf_binding **out); +void netdev_unbind_dmabuf(struct netdev_dmabuf_binding *binding); +int netdev_bind_dmabuf_to_queue(struct net_device *dev, u32 rxq_idx, + struct netdev_dmabuf_binding *binding); +#else +static inline void +__netdev_dmabuf_binding_free(struct netdev_dmabuf_binding *binding) +{ +} + +static inline int netdev_bind_dmabuf(struct net_device *dev, + unsigned int dmabuf_fd, + struct netdev_dmabuf_binding **out) +{ + return -EOPNOTSUPP; +} +static inline void netdev_unbind_dmabuf(struct netdev_dmabuf_binding *binding) +{ +} + +static inline int +netdev_bind_dmabuf_to_queue(struct net_device *dev, u32 rxq_idx, + struct netdev_dmabuf_binding *binding) +{ + return -EOPNOTSUPP; +} +#endif + +static inline void +netdev_dmabuf_binding_get(struct netdev_dmabuf_binding *binding) +{ + refcount_inc(&binding->ref); +} + +static inline void +netdev_dmabuf_binding_put(struct netdev_dmabuf_binding *binding) +{ + if (!refcount_dec_and_test(&binding->ref)) + return; + + __netdev_dmabuf_binding_free(binding); +} + +#endif /* _NET_DEVMEM_H */ diff --git a/include/net/netdev_rx_queue.h b/include/net/netdev_rx_queue.h index aa1716fb0e53..5dc35628633a 100644 --- a/include/net/netdev_rx_queue.h +++ b/include/net/netdev_rx_queue.h @@ -25,6 +25,7 @@ struct netdev_rx_queue { * Readers and writers must hold RTNL */ struct napi_struct *napi; + struct netdev_dmabuf_binding *binding; } ____cacheline_aligned_in_smp;
/* diff --git a/include/net/netmem.h b/include/net/netmem.h index d8b810245c1d..72e932a1a948 100644 --- a/include/net/netmem.h +++ b/include/net/netmem.h @@ -8,6 +8,16 @@ #ifndef _NET_NETMEM_H #define _NET_NETMEM_H
+#include <net/devmem.h> + +/* net_iov */ + +struct net_iov { + struct dmabuf_genpool_chunk_owner *owner; +}; + +/* netmem */ + /** * typedef netmem_ref - a nonexistent type marking a reference to generic * network memory. diff --git a/net/core/Makefile b/net/core/Makefile index 821aec06abf1..592f955c1241 100644 --- a/net/core/Makefile +++ b/net/core/Makefile @@ -13,7 +13,7 @@ obj-y += dev.o dev_addr_lists.o dst.o netevent.o \ neighbour.o rtnetlink.o utils.o link_watch.o filter.o \ sock_diag.o dev_ioctl.o tso.o sock_reuseport.o \ fib_notifier.o xdp.o flow_offload.o gro.o \ - netdev-genl.o netdev-genl-gen.o gso.o + netdev-genl.o netdev-genl-gen.o gso.o devmem.o
obj-$(CONFIG_NETDEV_ADDR_LIST_TEST) += dev_addr_lists_test.o
diff --git a/net/core/dev.c b/net/core/dev.c index fe054cbd41e9..bbea1b252529 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -155,6 +155,9 @@ #include <net/netdev_rx_queue.h> #include <net/page_pool/types.h> #include <net/page_pool/helpers.h> +#include <linux/genalloc.h> +#include <linux/dma-buf.h> +#include <net/devmem.h>
#include "dev.h" #include "net-sysfs.h" diff --git a/net/core/devmem.c b/net/core/devmem.c new file mode 100644 index 000000000000..779ad990971e --- /dev/null +++ b/net/core/devmem.c @@ -0,0 +1,293 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/* + * Devmem TCP + * + * Authors: Mina Almasry almasrymina@google.com + * Willem de Bruijn willemdebruijn.kernel@gmail.com + * Kaiyuan Zhang <kaiyuanz@google.com + */ + +#include <linux/types.h> +#include <linux/mm.h> +#include <linux/netdevice.h> +#include <trace/events/page_pool.h> +#include <net/netdev_rx_queue.h> +#include <net/page_pool/types.h> +#include <net/page_pool/helpers.h> +#include <linux/genalloc.h> +#include <linux/dma-buf.h> +#include <net/devmem.h> + +/* Device memory support */ + +#ifdef CONFIG_DMA_SHARED_BUFFER +static void netdev_dmabuf_free_chunk_owner(struct gen_pool *genpool, + struct gen_pool_chunk *chunk, + void *not_used) +{ + struct dmabuf_genpool_chunk_owner *owner = chunk->owner; + + kvfree(owner->niovs); + kfree(owner); +} + +void __netdev_dmabuf_binding_free(struct netdev_dmabuf_binding *binding) +{ + size_t size, avail; + + gen_pool_for_each_chunk(binding->chunk_pool, + netdev_dmabuf_free_chunk_owner, NULL); + + size = gen_pool_size(binding->chunk_pool); + avail = gen_pool_avail(binding->chunk_pool); + + if (!WARN(size != avail, "can't destroy genpool. size=%lu, avail=%lu", + size, avail)) + gen_pool_destroy(binding->chunk_pool); + + dma_buf_unmap_attachment(binding->attachment, binding->sgt, + DMA_BIDIRECTIONAL); + dma_buf_detach(binding->dmabuf, binding->attachment); + dma_buf_put(binding->dmabuf); + xa_destroy(&binding->bound_rxq_list); + kfree(binding); +} + +static int netdev_restart_rx_queue(struct net_device *dev, int rxq_idx) +{ + void *new_mem; + void *old_mem; + int err; + + if (!dev || !dev->netdev_ops) + return -EINVAL; + + if (!dev->netdev_ops->ndo_queue_stop || + !dev->netdev_ops->ndo_queue_mem_free || + !dev->netdev_ops->ndo_queue_mem_alloc || + !dev->netdev_ops->ndo_queue_start) + return -EOPNOTSUPP; + + new_mem = dev->netdev_ops->ndo_queue_mem_alloc(dev, rxq_idx); + if (!new_mem) + return -ENOMEM; + + err = dev->netdev_ops->ndo_queue_stop(dev, rxq_idx, &old_mem); + if (err) + goto err_free_new_mem; + + err = dev->netdev_ops->ndo_queue_start(dev, rxq_idx, new_mem); + if (err) + goto err_start_queue; + + dev->netdev_ops->ndo_queue_mem_free(dev, old_mem); + + return 0; + +err_start_queue: + dev->netdev_ops->ndo_queue_start(dev, rxq_idx, old_mem); + +err_free_new_mem: + dev->netdev_ops->ndo_queue_mem_free(dev, new_mem); + + return err; +} + +/* Protected by rtnl_lock() */ +static DEFINE_XARRAY_FLAGS(netdev_dmabuf_bindings, XA_FLAGS_ALLOC1); + +void netdev_unbind_dmabuf(struct netdev_dmabuf_binding *binding) +{ + struct netdev_rx_queue *rxq; + unsigned long xa_idx; + unsigned int rxq_idx; + + if (!binding) + return; + + if (binding->list.next) + list_del(&binding->list); + + xa_for_each(&binding->bound_rxq_list, xa_idx, rxq) { + if (rxq->binding == binding) { + /* We hold the rtnl_lock while binding/unbinding + * dma-buf, so we can't race with another thread that + * is also modifying this value. However, the driver + * may read this config while it's creating its + * rx-queues. WRITE_ONCE() here to match the + * READ_ONCE() in the driver. + */ + WRITE_ONCE(rxq->binding, NULL); + + rxq_idx = get_netdev_rx_queue_index(rxq); + + netdev_restart_rx_queue(binding->dev, rxq_idx); + } + } + + xa_erase(&netdev_dmabuf_bindings, binding->id); + + netdev_dmabuf_binding_put(binding); +} + +int netdev_bind_dmabuf_to_queue(struct net_device *dev, u32 rxq_idx, + struct netdev_dmabuf_binding *binding) +{ + struct netdev_rx_queue *rxq; + u32 xa_idx; + int err; + + if (rxq_idx >= dev->num_rx_queues) + return -ERANGE; + + rxq = __netif_get_rx_queue(dev, rxq_idx); + + if (rxq->binding) + return -EEXIST; + + err = xa_alloc(&binding->bound_rxq_list, &xa_idx, rxq, xa_limit_32b, + GFP_KERNEL); + if (err) + return err; + + /* We hold the rtnl_lock while binding/unbinding dma-buf, so we can't + * race with another thread that is also modifying this value. However, + * the driver may read this config while it's creating its * rx-queues. + * WRITE_ONCE() here to match the READ_ONCE() in the driver. + */ + WRITE_ONCE(rxq->binding, binding); + + err = netdev_restart_rx_queue(dev, rxq_idx); + if (err) + goto err_xa_erase; + + return 0; + +err_xa_erase: + xa_erase(&binding->bound_rxq_list, xa_idx); + WRITE_ONCE(rxq->binding, NULL); + + return err; +} + +int netdev_bind_dmabuf(struct net_device *dev, unsigned int dmabuf_fd, + struct netdev_dmabuf_binding **out) +{ + struct netdev_dmabuf_binding *binding; + static u32 id_alloc_next; + struct scatterlist *sg; + struct dma_buf *dmabuf; + unsigned int sg_idx, i; + unsigned long virtual; + int err; + + if (!capable(CAP_NET_ADMIN)) + return -EPERM; + + dmabuf = dma_buf_get(dmabuf_fd); + if (IS_ERR_OR_NULL(dmabuf)) + return -EBADFD; + + binding = kzalloc_node(sizeof(*binding), GFP_KERNEL, + dev_to_node(&dev->dev)); + if (!binding) { + err = -ENOMEM; + goto err_put_dmabuf; + } + binding->dev = dev; + + err = xa_alloc_cyclic(&netdev_dmabuf_bindings, &binding->id, binding, + xa_limit_32b, &id_alloc_next, GFP_KERNEL); + if (err < 0) + goto err_free_binding; + + xa_init_flags(&binding->bound_rxq_list, XA_FLAGS_ALLOC); + + refcount_set(&binding->ref, 1); + + binding->dmabuf = dmabuf; + + binding->attachment = dma_buf_attach(binding->dmabuf, dev->dev.parent); + if (IS_ERR(binding->attachment)) { + err = PTR_ERR(binding->attachment); + goto err_free_id; + } + + binding->sgt = + dma_buf_map_attachment(binding->attachment, DMA_BIDIRECTIONAL); + if (IS_ERR(binding->sgt)) { + err = PTR_ERR(binding->sgt); + goto err_detach; + } + + /* For simplicity we expect to make PAGE_SIZE allocations, but the + * binding can be much more flexible than that. We may be able to + * allocate MTU sized chunks here. Leave that for future work... + */ + binding->chunk_pool = + gen_pool_create(PAGE_SHIFT, dev_to_node(&dev->dev)); + if (!binding->chunk_pool) { + err = -ENOMEM; + goto err_unmap; + } + + virtual = 0; + for_each_sgtable_dma_sg(binding->sgt, sg, sg_idx) { + dma_addr_t dma_addr = sg_dma_address(sg); + struct dmabuf_genpool_chunk_owner *owner; + size_t len = sg_dma_len(sg); + struct net_iov *niov; + + owner = kzalloc_node(sizeof(*owner), GFP_KERNEL, + dev_to_node(&dev->dev)); + owner->base_virtual = virtual; + owner->base_dma_addr = dma_addr; + owner->num_niovs = len / PAGE_SIZE; + owner->binding = binding; + + err = gen_pool_add_owner(binding->chunk_pool, dma_addr, + dma_addr, len, dev_to_node(&dev->dev), + owner); + if (err) { + err = -EINVAL; + goto err_free_chunks; + } + + owner->niovs = kvmalloc_array(owner->num_niovs, + sizeof(*owner->niovs), + GFP_KERNEL); + if (!owner->niovs) { + err = -ENOMEM; + goto err_free_chunks; + } + + for (i = 0; i < owner->num_niovs; i++) { + niov = &owner->niovs[i]; + niov->owner = owner; + } + + virtual += len; + } + + *out = binding; + + return 0; + +err_free_chunks: + gen_pool_for_each_chunk(binding->chunk_pool, + netdev_dmabuf_free_chunk_owner, NULL); + gen_pool_destroy(binding->chunk_pool); +err_unmap: + dma_buf_unmap_attachment(binding->attachment, binding->sgt, + DMA_BIDIRECTIONAL); +err_detach: + dma_buf_detach(dmabuf, binding->attachment); +err_free_id: + xa_erase(&netdev_dmabuf_bindings, binding->id); +err_free_binding: + kfree(binding); +err_put_dmabuf: + dma_buf_put(dmabuf); + return err; +} +#endif diff --git a/net/core/netdev-genl.c b/net/core/netdev-genl.c index 0ed292d87ae0..8f0867ae5eeb 100644 --- a/net/core/netdev-genl.c +++ b/net/core/netdev-genl.c @@ -9,6 +9,7 @@ #include <net/xdp_sock.h> #include <net/netdev_rx_queue.h> #include <net/busy_poll.h> +#include <net/devmem.h>
#include "netdev-genl-gen.h" #include "dev.h" @@ -469,10 +470,93 @@ int netdev_nl_queue_get_dumpit(struct sk_buff *skb, struct netlink_callback *cb) return skb->len; }
-/* Stub */ +static LIST_HEAD(netdev_rbinding_list); + int netdev_nl_bind_rx_doit(struct sk_buff *skb, struct genl_info *info) { - return 0; + struct nlattr *tb[ARRAY_SIZE(netdev_queue_dmabuf_nl_policy)]; + struct netdev_dmabuf_binding *out_binding; + u32 ifindex, dmabuf_fd, rxq_idx; + struct net_device *netdev; + struct sk_buff *rsp; + struct nlattr *attr; + int rem, err = 0; + void *hdr; + + if (GENL_REQ_ATTR_CHECK(info, NETDEV_A_DEV_IFINDEX) || + GENL_REQ_ATTR_CHECK(info, NETDEV_A_BIND_DMABUF_DMABUF_FD) || + GENL_REQ_ATTR_CHECK(info, NETDEV_A_BIND_DMABUF_QUEUES)) + return -EINVAL; + + ifindex = nla_get_u32(info->attrs[NETDEV_A_DEV_IFINDEX]); + dmabuf_fd = nla_get_u32(info->attrs[NETDEV_A_BIND_DMABUF_DMABUF_FD]); + + rtnl_lock(); + + netdev = __dev_get_by_index(genl_info_net(info), ifindex); + if (!netdev) { + err = -ENODEV; + goto err_unlock; + } + + err = netdev_bind_dmabuf(netdev, dmabuf_fd, &out_binding); + if (err) + goto err_unlock; + + nla_for_each_attr(attr, genlmsg_data(info->genlhdr), + genlmsg_len(info->genlhdr), rem) { + if (nla_type(attr) != NETDEV_A_BIND_DMABUF_QUEUES) + continue; + + err = nla_parse_nested( + tb, ARRAY_SIZE(netdev_queue_dmabuf_nl_policy) - 1, attr, + netdev_queue_dmabuf_nl_policy, info->extack); + + if (err < 0) + goto err_unbind; + + rxq_idx = nla_get_u32(tb[NETDEV_A_QUEUE_DMABUF_IDX]); + + if (rxq_idx >= netdev->num_rx_queues) { + err = -ERANGE; + goto err_unbind; + } + + err = netdev_bind_dmabuf_to_queue(netdev, rxq_idx, out_binding); + if (err) + goto err_unbind; + } + + out_binding->owner_nlportid = info->snd_portid; + list_add(&out_binding->list, &netdev_rbinding_list); + + rsp = genlmsg_new(GENLMSG_DEFAULT_SIZE, GFP_KERNEL); + if (!rsp) { + err = -ENOMEM; + goto err_unbind; + } + + hdr = genlmsg_put(rsp, info->snd_portid, info->snd_seq, + &netdev_nl_family, 0, info->genlhdr->cmd); + if (!hdr) { + err = -EMSGSIZE; + goto err_genlmsg_free; + } + + nla_put_u32(rsp, NETDEV_A_BIND_DMABUF_DMABUF_ID, out_binding->id); + genlmsg_end(rsp, hdr); + + rtnl_unlock(); + + return genlmsg_reply(rsp, info); + +err_genlmsg_free: + nlmsg_free(rsp); +err_unbind: + netdev_unbind_dmabuf(out_binding); +err_unlock: + rtnl_unlock(); + return err; }
static int netdev_genl_netdevice_event(struct notifier_block *nb, @@ -495,10 +579,37 @@ static int netdev_genl_netdevice_event(struct notifier_block *nb, return NOTIFY_OK; }
+static int netdev_netlink_notify(struct notifier_block *nb, unsigned long state, + void *_notify) +{ + struct netlink_notify *notify = _notify; + struct netdev_dmabuf_binding *rbinding; + + if (state != NETLINK_URELEASE || notify->protocol != NETLINK_GENERIC) + return NOTIFY_DONE; + + rtnl_lock(); + + list_for_each_entry(rbinding, &netdev_rbinding_list, list) { + if (rbinding->owner_nlportid == notify->portid) { + netdev_unbind_dmabuf(rbinding); + break; + } + } + + rtnl_unlock(); + + return NOTIFY_OK; +} + static struct notifier_block netdev_genl_nb = { .notifier_call = netdev_genl_netdevice_event, };
+static struct notifier_block netdev_netlink_notifier = { + .notifier_call = netdev_netlink_notify, +}; + static int __init netdev_genl_init(void) { int err; @@ -511,8 +622,14 @@ static int __init netdev_genl_init(void) if (err) goto err_unreg_ntf;
+ err = netlink_register_notifier(&netdev_netlink_notifier); + if (err) + goto err_unreg_family; + return 0;
+err_unreg_family: + genl_unregister_family(&netdev_nl_family); err_unreg_ntf: unregister_netdevice_notifier(&netdev_genl_nb); return err;
On Tue, Mar 5, 2024, at 03:01, Mina Almasry wrote:
+int netdev_bind_dmabuf(struct net_device *dev, unsigned int dmabuf_fd,
struct netdev_dmabuf_binding **out)
+{
- struct netdev_dmabuf_binding *binding;
- static u32 id_alloc_next;
- struct scatterlist *sg;
- struct dma_buf *dmabuf;
- unsigned int sg_idx, i;
- unsigned long virtual;
- int err;
- if (!capable(CAP_NET_ADMIN))
return -EPERM;
- dmabuf = dma_buf_get(dmabuf_fd);
- if (IS_ERR_OR_NULL(dmabuf))
return -EBADFD;
You should never need to use IS_ERR_OR_NULL() for a properly defined kernel interface. This one should always return an error or a valid pointer, so don't check for NULL.
- binding->attachment = dma_buf_attach(binding->dmabuf, dev->dev.parent);
- if (IS_ERR(binding->attachment)) {
err = PTR_ERR(binding->attachment);
goto err_free_id;
- }
- binding->sgt =
dma_buf_map_attachment(binding->attachment, DMA_BIDIRECTIONAL);
- if (IS_ERR(binding->sgt)) {
err = PTR_ERR(binding->sgt);
goto err_detach;
- }
Should there be a check to verify that this buffer is suitable for network data?
In general, dmabuf allows buffers that are uncached or reside in MMIO space of another device, but I think this would break when you get an skb with those buffers and try to parse the data inside of the kernel on architectures where MMIO space is not a normal pointer or unaligned access is disallowed on uncached data.
Arnd
On Tue, Mar 5, 2024 at 1:05 AM Arnd Bergmann arnd@arndb.de wrote:
On Tue, Mar 5, 2024, at 03:01, Mina Almasry wrote:
+int netdev_bind_dmabuf(struct net_device *dev, unsigned int dmabuf_fd,
struct netdev_dmabuf_binding **out)
+{
struct netdev_dmabuf_binding *binding;
static u32 id_alloc_next;
struct scatterlist *sg;
struct dma_buf *dmabuf;
unsigned int sg_idx, i;
unsigned long virtual;
int err;
if (!capable(CAP_NET_ADMIN))
return -EPERM;
dmabuf = dma_buf_get(dmabuf_fd);
if (IS_ERR_OR_NULL(dmabuf))
return -EBADFD;
You should never need to use IS_ERR_OR_NULL() for a properly defined kernel interface. This one should always return an error or a valid pointer, so don't check for NULL.
Thanks for clarifying. I will convert to IS_ERR().
binding->attachment = dma_buf_attach(binding->dmabuf, dev->dev.parent);
if (IS_ERR(binding->attachment)) {
err = PTR_ERR(binding->attachment);
goto err_free_id;
}
binding->sgt =
dma_buf_map_attachment(binding->attachment, DMA_BIDIRECTIONAL);
if (IS_ERR(binding->sgt)) {
err = PTR_ERR(binding->sgt);
goto err_detach;
}
Should there be a check to verify that this buffer is suitable for network data?
In general, dmabuf allows buffers that are uncached or reside in MMIO space of another device, but I think this would break when you get an skb with those buffers and try to parse the data inside of the kernel on architectures where MMIO space is not a normal pointer or unaligned access is disallowed on uncached data.
Arnd
A key goal of this patch series is that the kernel does not try to parse the skb frags that reside in the dma-buf for that precise reason. This is achieved using patch "net: add support for skbs with unreadable frags" which disables the kernel touching the payload in these skbs, and "tcp: RX path for devmem TCP" which implements a uapi where the kernel hands the data in the dmabuf to the userspace via a cmsg that gives the user a pointer to the data in the dmabuf (offset + size).
So really AFACT the only restriction here is that the NIC should be able to DMA into the dmabuf that we're attaching, and dma_buf_attach() fails in this scenario so we're covered there.
On Tue, Mar 5, 2024, at 21:00, Mina Almasry wrote:
On Tue, Mar 5, 2024 at 1:05 AM Arnd Bergmann arnd@arndb.de wrote:
On Tue, Mar 5, 2024, at 03:01, Mina Almasry wrote:
A key goal of this patch series is that the kernel does not try to parse the skb frags that reside in the dma-buf for that precise reason. This is achieved using patch "net: add support for skbs with unreadable frags" which disables the kernel touching the payload in these skbs, and "tcp: RX path for devmem TCP" which implements a uapi where the kernel hands the data in the dmabuf to the userspace via a cmsg that gives the user a pointer to the data in the dmabuf (offset + size).
So really AFACT the only restriction here is that the NIC should be able to DMA into the dmabuf that we're attaching, and dma_buf_attach() fails in this scenario so we're covered there.
Ok, makes sense. Thanks for the clarification.
Arnd
On 2024/3/5 10:01, Mina Almasry wrote:
...
The netdev_dmabuf_binding struct is refcounted, and releases its resources only when all the refs are released.
Signed-off-by: Willem de Bruijn willemb@google.com Signed-off-by: Kaiyuan Zhang kaiyuanz@google.com Signed-off-by: Mina Almasry almasrymina@google.com
RFC v6:
- Validate rx queue index
- Refactor new functions into devmem.c (Pavel)
It seems odd that the functions or stucts in a file called devmem.c are named after 'dmabuf' instead of 'devmem'.
...
diff --git a/include/net/netmem.h b/include/net/netmem.h index d8b810245c1d..72e932a1a948 100644 --- a/include/net/netmem.h +++ b/include/net/netmem.h @@ -8,6 +8,16 @@ #ifndef _NET_NETMEM_H #define _NET_NETMEM_H +#include <net/devmem.h>
+/* net_iov */
+struct net_iov {
- struct dmabuf_genpool_chunk_owner *owner;
+};
+/* netmem */
/**
- typedef netmem_ref - a nonexistent type marking a reference to generic
- network memory.
diff --git a/net/core/Makefile b/net/core/Makefile index 821aec06abf1..592f955c1241 100644 --- a/net/core/Makefile +++ b/net/core/Makefile @@ -13,7 +13,7 @@ obj-y += dev.o dev_addr_lists.o dst.o netevent.o \ neighbour.o rtnetlink.o utils.o link_watch.o filter.o \ sock_diag.o dev_ioctl.o tso.o sock_reuseport.o \ fib_notifier.o xdp.o flow_offload.o gro.o \
netdev-genl.o netdev-genl-gen.o gso.o
netdev-genl.o netdev-genl-gen.o gso.o devmem.o
obj-$(CONFIG_NETDEV_ADDR_LIST_TEST) += dev_addr_lists_test.o diff --git a/net/core/dev.c b/net/core/dev.c index fe054cbd41e9..bbea1b252529 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -155,6 +155,9 @@ #include <net/netdev_rx_queue.h> #include <net/page_pool/types.h> #include <net/page_pool/helpers.h> +#include <linux/genalloc.h> +#include <linux/dma-buf.h> +#include <net/devmem.h> #include "dev.h" #include "net-sysfs.h" diff --git a/net/core/devmem.c b/net/core/devmem.c new file mode 100644 index 000000000000..779ad990971e --- /dev/null +++ b/net/core/devmem.c @@ -0,0 +1,293 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/*
Devmem TCP
Authors: Mina Almasry <almasrymina@google.com>
Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Kaiyuan Zhang <kaiyuanz@google.com
- */
+#include <linux/types.h> +#include <linux/mm.h> +#include <linux/netdevice.h> +#include <trace/events/page_pool.h> +#include <net/netdev_rx_queue.h> +#include <net/page_pool/types.h> +#include <net/page_pool/helpers.h> +#include <linux/genalloc.h> +#include <linux/dma-buf.h> +#include <net/devmem.h>
+/* Device memory support */
+#ifdef CONFIG_DMA_SHARED_BUFFER
I still think it is worth adding its own config for devmem or dma-buf for networking, thinking about the embeded system.
+static void netdev_dmabuf_free_chunk_owner(struct gen_pool *genpool,
struct gen_pool_chunk *chunk,
void *not_used)
It seems odd to still keep the netdev_ prefix as it is not really related to netdev, perhaps use 'net_' or something better.
+{
- struct dmabuf_genpool_chunk_owner *owner = chunk->owner;
- kvfree(owner->niovs);
- kfree(owner);
+}
+void __netdev_dmabuf_binding_free(struct netdev_dmabuf_binding *binding) +{
- size_t size, avail;
- gen_pool_for_each_chunk(binding->chunk_pool,
netdev_dmabuf_free_chunk_owner, NULL);
- size = gen_pool_size(binding->chunk_pool);
- avail = gen_pool_avail(binding->chunk_pool);
- if (!WARN(size != avail, "can't destroy genpool. size=%lu, avail=%lu",
size, avail))
gen_pool_destroy(binding->chunk_pool);
- dma_buf_unmap_attachment(binding->attachment, binding->sgt,
DMA_BIDIRECTIONAL);
For now DMA_FROM_DEVICE seems enough as tx is not supported yet.
- dma_buf_detach(binding->dmabuf, binding->attachment);
- dma_buf_put(binding->dmabuf);
- xa_destroy(&binding->bound_rxq_list);
- kfree(binding);
+}
+static int netdev_restart_rx_queue(struct net_device *dev, int rxq_idx) +{
- void *new_mem;
- void *old_mem;
- int err;
- if (!dev || !dev->netdev_ops)
return -EINVAL;
- if (!dev->netdev_ops->ndo_queue_stop ||
!dev->netdev_ops->ndo_queue_mem_free ||
!dev->netdev_ops->ndo_queue_mem_alloc ||
!dev->netdev_ops->ndo_queue_start)
return -EOPNOTSUPP;
- new_mem = dev->netdev_ops->ndo_queue_mem_alloc(dev, rxq_idx);
- if (!new_mem)
return -ENOMEM;
- err = dev->netdev_ops->ndo_queue_stop(dev, rxq_idx, &old_mem);
- if (err)
goto err_free_new_mem;
- err = dev->netdev_ops->ndo_queue_start(dev, rxq_idx, new_mem);
- if (err)
goto err_start_queue;
- dev->netdev_ops->ndo_queue_mem_free(dev, old_mem);
- return 0;
+err_start_queue:
- dev->netdev_ops->ndo_queue_start(dev, rxq_idx, old_mem);
It might worth mentioning why queue start with old_mem will always success here as the return value seems to be ignored here.
+err_free_new_mem:
- dev->netdev_ops->ndo_queue_mem_free(dev, new_mem);
- return err;
+}
+/* Protected by rtnl_lock() */ +static DEFINE_XARRAY_FLAGS(netdev_dmabuf_bindings, XA_FLAGS_ALLOC1);
+void netdev_unbind_dmabuf(struct netdev_dmabuf_binding *binding) +{
- struct netdev_rx_queue *rxq;
- unsigned long xa_idx;
- unsigned int rxq_idx;
- if (!binding)
return;
- if (binding->list.next)
list_del(&binding->list);
The above does not seems to be a good pattern to delete a entry, is there any reason having a checking before the list_del()? seems like defensive programming?
- xa_for_each(&binding->bound_rxq_list, xa_idx, rxq) {
if (rxq->binding == binding) {
It seems like defensive programming here too?
/* We hold the rtnl_lock while binding/unbinding
* dma-buf, so we can't race with another thread that
* is also modifying this value. However, the driver
* may read this config while it's creating its
* rx-queues. WRITE_ONCE() here to match the
* READ_ONCE() in the driver.
*/
WRITE_ONCE(rxq->binding, NULL);
rxq_idx = get_netdev_rx_queue_index(rxq);
netdev_restart_rx_queue(binding->dev, rxq_idx);
}
- }
- xa_erase(&netdev_dmabuf_bindings, binding->id);
- netdev_dmabuf_binding_put(binding);
+}
On Tue, Mar 5, 2024 at 4:55 AM Yunsheng Lin linyunsheng@huawei.com wrote:
On 2024/3/5 10:01, Mina Almasry wrote:
...
The netdev_dmabuf_binding struct is refcounted, and releases its resources only when all the refs are released.
Signed-off-by: Willem de Bruijn willemb@google.com Signed-off-by: Kaiyuan Zhang kaiyuanz@google.com Signed-off-by: Mina Almasry almasrymina@google.com
RFC v6:
- Validate rx queue index
- Refactor new functions into devmem.c (Pavel)
It seems odd that the functions or stucts in a file called devmem.c are named after 'dmabuf' instead of 'devmem'.
So my intention with this naming that devmem.c contains all the functions for all devmem tcp specific support. Currently the only devmem we support is dmabuf. In the future, other devmem may be supported and it can fit nicely in devmem.c. For example, if we want to extend devmem TCP to support NVMe devices, we need to add support for p2pdma, maybe, and we can add that support under the devmem.c umbrella rather than add new files.
But I can rename to dmabuf.c if there is strong objection to the current name.
...
diff --git a/include/net/netmem.h b/include/net/netmem.h index d8b810245c1d..72e932a1a948 100644 --- a/include/net/netmem.h +++ b/include/net/netmem.h @@ -8,6 +8,16 @@ #ifndef _NET_NETMEM_H #define _NET_NETMEM_H
+#include <net/devmem.h>
+/* net_iov */
+struct net_iov {
struct dmabuf_genpool_chunk_owner *owner;
+};
+/* netmem */
/**
- typedef netmem_ref - a nonexistent type marking a reference to generic
- network memory.
diff --git a/net/core/Makefile b/net/core/Makefile index 821aec06abf1..592f955c1241 100644 --- a/net/core/Makefile +++ b/net/core/Makefile @@ -13,7 +13,7 @@ obj-y += dev.o dev_addr_lists.o dst.o netevent.o \ neighbour.o rtnetlink.o utils.o link_watch.o filter.o \ sock_diag.o dev_ioctl.o tso.o sock_reuseport.o \ fib_notifier.o xdp.o flow_offload.o gro.o \
netdev-genl.o netdev-genl-gen.o gso.o
netdev-genl.o netdev-genl-gen.o gso.o devmem.o
obj-$(CONFIG_NETDEV_ADDR_LIST_TEST) += dev_addr_lists_test.o
diff --git a/net/core/dev.c b/net/core/dev.c index fe054cbd41e9..bbea1b252529 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -155,6 +155,9 @@ #include <net/netdev_rx_queue.h> #include <net/page_pool/types.h> #include <net/page_pool/helpers.h> +#include <linux/genalloc.h> +#include <linux/dma-buf.h> +#include <net/devmem.h>
#include "dev.h" #include "net-sysfs.h" diff --git a/net/core/devmem.c b/net/core/devmem.c new file mode 100644 index 000000000000..779ad990971e --- /dev/null +++ b/net/core/devmem.c @@ -0,0 +1,293 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/*
Devmem TCP
Authors: Mina Almasry <almasrymina@google.com>
Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Kaiyuan Zhang <kaiyuanz@google.com
- */
+#include <linux/types.h> +#include <linux/mm.h> +#include <linux/netdevice.h> +#include <trace/events/page_pool.h> +#include <net/netdev_rx_queue.h> +#include <net/page_pool/types.h> +#include <net/page_pool/helpers.h> +#include <linux/genalloc.h> +#include <linux/dma-buf.h> +#include <net/devmem.h>
+/* Device memory support */
+#ifdef CONFIG_DMA_SHARED_BUFFER
I still think it is worth adding its own config for devmem or dma-buf for networking, thinking about the embeded system.
FWIW Willem did weigh on this previously and said he prefers to have it unguarded by a CONFIG, but I will submit to whatever the consensus here. It shouldn't be a huge deal to add a CONFIG technically speaking.
+static void netdev_dmabuf_free_chunk_owner(struct gen_pool *genpool,
struct gen_pool_chunk *chunk,
void *not_used)
It seems odd to still keep the netdev_ prefix as it is not really related to netdev, perhaps use 'net_' or something better.
Yes, thanks for catching. I can change to net_devmem_ maybe or net_dmabuf_*.
+{
struct dmabuf_genpool_chunk_owner *owner = chunk->owner;
kvfree(owner->niovs);
kfree(owner);
+}
+void __netdev_dmabuf_binding_free(struct netdev_dmabuf_binding *binding) +{
size_t size, avail;
gen_pool_for_each_chunk(binding->chunk_pool,
netdev_dmabuf_free_chunk_owner, NULL);
size = gen_pool_size(binding->chunk_pool);
avail = gen_pool_avail(binding->chunk_pool);
if (!WARN(size != avail, "can't destroy genpool. size=%lu, avail=%lu",
size, avail))
gen_pool_destroy(binding->chunk_pool);
dma_buf_unmap_attachment(binding->attachment, binding->sgt,
DMA_BIDIRECTIONAL);
For now DMA_FROM_DEVICE seems enough as tx is not supported yet.
Yes, good catch. I suspect we want to reuse this code for TX path. But for now, I'll test with DMA_FROM_DEVICE and if I see no issues I'll apply this change.
dma_buf_detach(binding->dmabuf, binding->attachment);
dma_buf_put(binding->dmabuf);
xa_destroy(&binding->bound_rxq_list);
kfree(binding);
+}
+static int netdev_restart_rx_queue(struct net_device *dev, int rxq_idx) +{
void *new_mem;
void *old_mem;
int err;
if (!dev || !dev->netdev_ops)
return -EINVAL;
if (!dev->netdev_ops->ndo_queue_stop ||
!dev->netdev_ops->ndo_queue_mem_free ||
!dev->netdev_ops->ndo_queue_mem_alloc ||
!dev->netdev_ops->ndo_queue_start)
return -EOPNOTSUPP;
new_mem = dev->netdev_ops->ndo_queue_mem_alloc(dev, rxq_idx);
if (!new_mem)
return -ENOMEM;
err = dev->netdev_ops->ndo_queue_stop(dev, rxq_idx, &old_mem);
if (err)
goto err_free_new_mem;
err = dev->netdev_ops->ndo_queue_start(dev, rxq_idx, new_mem);
if (err)
goto err_start_queue;
dev->netdev_ops->ndo_queue_mem_free(dev, old_mem);
return 0;
+err_start_queue:
dev->netdev_ops->ndo_queue_start(dev, rxq_idx, old_mem);
It might worth mentioning why queue start with old_mem will always success here as the return value seems to be ignored here.
So the old queue, we stopped it, and if we fail to bring up the new queue, then we want to start the old queue back up to get the queue back to a workable state.
I don't see what we can do to recover if restarting the old queue fails. Seems like it should be a requirement that the driver tries as much as possible to keep the old queue restartable.
I can improve this by at least logging or warning if restarting the old queue fails.
+err_free_new_mem:
dev->netdev_ops->ndo_queue_mem_free(dev, new_mem);
return err;
+}
+/* Protected by rtnl_lock() */ +static DEFINE_XARRAY_FLAGS(netdev_dmabuf_bindings, XA_FLAGS_ALLOC1);
+void netdev_unbind_dmabuf(struct netdev_dmabuf_binding *binding) +{
struct netdev_rx_queue *rxq;
unsigned long xa_idx;
unsigned int rxq_idx;
if (!binding)
return;
if (binding->list.next)
list_del(&binding->list);
The above does not seems to be a good pattern to delete a entry, is there any reason having a checking before the list_del()? seems like defensive programming?
I think I needed to apply this condition to handle the case where netdev_unbind_dmabuf() is called when binding->list is not initialized or is empty.
netdev_nl_bind_rx_doit() will call unbind to free a partially allocated binding in error paths, so, netdev_unbind_dmabuf() may be called with a partially initialized binding. This is why we check for binding->list is initialized here and check that rxq->binding == binding below. The main point is that netdev_unbind_dmabuf() may be asked to unbind a partially bound dmabuf due to error paths.
Maybe a comment here will test this better. I will double confirm the check is needed for the error paths in netdev_nl_bind_rx_doit().
xa_for_each(&binding->bound_rxq_list, xa_idx, rxq) {
if (rxq->binding == binding) {
It seems like defensive programming here too?
/* We hold the rtnl_lock while binding/unbinding
* dma-buf, so we can't race with another thread that
* is also modifying this value. However, the driver
* may read this config while it's creating its
* rx-queues. WRITE_ONCE() here to match the
* READ_ONCE() in the driver.
*/
WRITE_ONCE(rxq->binding, NULL);
rxq_idx = get_netdev_rx_queue_index(rxq);
netdev_restart_rx_queue(binding->dev, rxq_idx);
}
}
xa_erase(&netdev_dmabuf_bindings, binding->id);
netdev_dmabuf_binding_put(binding);
+}
On 2024/3/6 5:17, Mina Almasry wrote:
On Tue, Mar 5, 2024 at 4:55 AM Yunsheng Lin linyunsheng@huawei.com wrote:
On 2024/3/5 10:01, Mina Almasry wrote:
...
The netdev_dmabuf_binding struct is refcounted, and releases its resources only when all the refs are released.
Signed-off-by: Willem de Bruijn willemb@google.com Signed-off-by: Kaiyuan Zhang kaiyuanz@google.com Signed-off-by: Mina Almasry almasrymina@google.com
RFC v6:
- Validate rx queue index
- Refactor new functions into devmem.c (Pavel)
It seems odd that the functions or stucts in a file called devmem.c are named after 'dmabuf' instead of 'devmem'.
So my intention with this naming that devmem.c contains all the functions for all devmem tcp specific support. Currently the only devmem we support is dmabuf. In the future, other devmem may be supported and it can fit nicely in devmem.c. For example, if we want to extend devmem TCP to support NVMe devices, we need to add support for p2pdma, maybe, and we can add that support under the devmem.c umbrella rather than add new files.
But I can rename to dmabuf.c if there is strong objection to the current name.
Grepping 'dmabuf' seems to show that it may be common rename it to something as *_dmabuf.c.
...
diff --git a/include/net/netmem.h b/include/net/netmem.h index d8b810245c1d..72e932a1a948 100644 --- a/include/net/netmem.h +++ b/include/net/netmem.h @@ -8,6 +8,16 @@ #ifndef _NET_NETMEM_H #define _NET_NETMEM_H
+#include <net/devmem.h>
+/* net_iov */
+struct net_iov {
struct dmabuf_genpool_chunk_owner *owner;
+};
+/* netmem */
/**
- typedef netmem_ref - a nonexistent type marking a reference to generic
- network memory.
diff --git a/net/core/Makefile b/net/core/Makefile index 821aec06abf1..592f955c1241 100644 --- a/net/core/Makefile +++ b/net/core/Makefile @@ -13,7 +13,7 @@ obj-y += dev.o dev_addr_lists.o dst.o netevent.o \ neighbour.o rtnetlink.o utils.o link_watch.o filter.o \ sock_diag.o dev_ioctl.o tso.o sock_reuseport.o \ fib_notifier.o xdp.o flow_offload.o gro.o \
netdev-genl.o netdev-genl-gen.o gso.o
netdev-genl.o netdev-genl-gen.o gso.o devmem.o
obj-$(CONFIG_NETDEV_ADDR_LIST_TEST) += dev_addr_lists_test.o
diff --git a/net/core/dev.c b/net/core/dev.c index fe054cbd41e9..bbea1b252529 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -155,6 +155,9 @@ #include <net/netdev_rx_queue.h> #include <net/page_pool/types.h> #include <net/page_pool/helpers.h> +#include <linux/genalloc.h> +#include <linux/dma-buf.h> +#include <net/devmem.h>
#include "dev.h" #include "net-sysfs.h" diff --git a/net/core/devmem.c b/net/core/devmem.c new file mode 100644 index 000000000000..779ad990971e --- /dev/null +++ b/net/core/devmem.c @@ -0,0 +1,293 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/*
Devmem TCP
Authors: Mina Almasry <almasrymina@google.com>
Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Kaiyuan Zhang <kaiyuanz@google.com
- */
+#include <linux/types.h> +#include <linux/mm.h> +#include <linux/netdevice.h> +#include <trace/events/page_pool.h> +#include <net/netdev_rx_queue.h> +#include <net/page_pool/types.h> +#include <net/page_pool/helpers.h> +#include <linux/genalloc.h> +#include <linux/dma-buf.h> +#include <net/devmem.h>
+/* Device memory support */
+#ifdef CONFIG_DMA_SHARED_BUFFER
I still think it is worth adding its own config for devmem or dma-buf for networking, thinking about the embeded system.
FWIW Willem did weigh on this previously and said he prefers to have it unguarded by a CONFIG, but I will submit to whatever the consensus here. It shouldn't be a huge deal to add a CONFIG technically speaking.
Grepping 'CONFIG_DMA_SHARED_BUFFER' show that the API user of dmabuf API does not seems to reuse the CONFIG_DMA_SHARED_BUFFER, instead they seem to define its own config, and select CONFIG_DMA_SHARED_BUFFER if necessary, it that any reason it is different here?
+static void netdev_dmabuf_free_chunk_owner(struct gen_pool *genpool,
struct gen_pool_chunk *chunk,
void *not_used)
It seems odd to still keep the netdev_ prefix as it is not really related to netdev, perhaps use 'net_' or something better.
Yes, thanks for catching. I can change to net_devmem_ maybe or net_dmabuf_*.
FWIW, net_dmabuf_* seems like a better name technically.
+{
struct dmabuf_genpool_chunk_owner *owner = chunk->owner;
kvfree(owner->niovs);
kfree(owner);
+}
+void __netdev_dmabuf_binding_free(struct netdev_dmabuf_binding *binding) +{
size_t size, avail;
gen_pool_for_each_chunk(binding->chunk_pool,
netdev_dmabuf_free_chunk_owner, NULL);
size = gen_pool_size(binding->chunk_pool);
avail = gen_pool_avail(binding->chunk_pool);
if (!WARN(size != avail, "can't destroy genpool. size=%lu, avail=%lu",
size, avail))
gen_pool_destroy(binding->chunk_pool);
dma_buf_unmap_attachment(binding->attachment, binding->sgt,
DMA_BIDIRECTIONAL);
For now DMA_FROM_DEVICE seems enough as tx is not supported yet.
Yes, good catch. I suspect we want to reuse this code for TX path. But for now, I'll test with DMA_FROM_DEVICE and if I see no issues I'll apply this change.
dma_buf_detach(binding->dmabuf, binding->attachment);
dma_buf_put(binding->dmabuf);
xa_destroy(&binding->bound_rxq_list);
kfree(binding);
+}
+static int netdev_restart_rx_queue(struct net_device *dev, int rxq_idx) +{
void *new_mem;
void *old_mem;
int err;
if (!dev || !dev->netdev_ops)
return -EINVAL;
if (!dev->netdev_ops->ndo_queue_stop ||
!dev->netdev_ops->ndo_queue_mem_free ||
!dev->netdev_ops->ndo_queue_mem_alloc ||
!dev->netdev_ops->ndo_queue_start)
return -EOPNOTSUPP;
new_mem = dev->netdev_ops->ndo_queue_mem_alloc(dev, rxq_idx);
if (!new_mem)
return -ENOMEM;
err = dev->netdev_ops->ndo_queue_stop(dev, rxq_idx, &old_mem);
if (err)
goto err_free_new_mem;
err = dev->netdev_ops->ndo_queue_start(dev, rxq_idx, new_mem);
if (err)
goto err_start_queue;
dev->netdev_ops->ndo_queue_mem_free(dev, old_mem);
return 0;
+err_start_queue:
dev->netdev_ops->ndo_queue_start(dev, rxq_idx, old_mem);
It might worth mentioning why queue start with old_mem will always success here as the return value seems to be ignored here.
So the old queue, we stopped it, and if we fail to bring up the new queue, then we want to start the old queue back up to get the queue back to a workable state.
I don't see what we can do to recover if restarting the old queue fails. Seems like it should be a requirement that the driver tries as much as possible to keep the old queue restartable.
Is it possible that we may have the 'old_mem' leaking if the driver fails to restart the old queue? how does the driver handle the firmware cmd failure for ndo_queue_start()? it seems a little tricky to implement it.
I can improve this by at least logging or warning if restarting the old queue fails.
Also the semantics of the above function seems odd that it is not only restarting rx queue, but also freeing and allocating memory despite the name only suggests 'restart', I am a litte afraid that it may conflict with future usecae when user only need the 'restart' part, perhaps rename it to a more appropriate name.
+err_free_new_mem:
dev->netdev_ops->ndo_queue_mem_free(dev, new_mem);
return err;
+}
+/* Protected by rtnl_lock() */ +static DEFINE_XARRAY_FLAGS(netdev_dmabuf_bindings, XA_FLAGS_ALLOC1);
+void netdev_unbind_dmabuf(struct netdev_dmabuf_binding *binding) +{
struct netdev_rx_queue *rxq;
unsigned long xa_idx;
unsigned int rxq_idx;
if (!binding)
return;
if (binding->list.next)
list_del(&binding->list);
The above does not seems to be a good pattern to delete a entry, is there any reason having a checking before the list_del()? seems like defensive programming?
I think I needed to apply this condition to handle the case where netdev_unbind_dmabuf() is called when binding->list is not initialized or is empty.
netdev_nl_bind_rx_doit() will call unbind to free a partially allocated binding in error paths, so, netdev_unbind_dmabuf() may be called with a partially initialized binding. This is why we check for binding->list is initialized here and check that rxq->binding == binding below. The main point is that netdev_unbind_dmabuf() may be asked to unbind a partially bound dmabuf due to error paths.
Maybe a comment here will test this better. I will double confirm the check is needed for the error paths in netdev_nl_bind_rx_doit().
xa_for_each(&binding->bound_rxq_list, xa_idx, rxq) {
if (rxq->binding == binding) {
It seems like defensive programming here too?
On Wed, Mar 6, 2024 at 4:38 AM Yunsheng Lin linyunsheng@huawei.com wrote:
On 2024/3/6 5:17, Mina Almasry wrote:
On Tue, Mar 5, 2024 at 4:55 AM Yunsheng Lin linyunsheng@huawei.com wrote:
On 2024/3/5 10:01, Mina Almasry wrote:
...
The netdev_dmabuf_binding struct is refcounted, and releases its resources only when all the refs are released.
Signed-off-by: Willem de Bruijn willemb@google.com Signed-off-by: Kaiyuan Zhang kaiyuanz@google.com Signed-off-by: Mina Almasry almasrymina@google.com
RFC v6:
- Validate rx queue index
- Refactor new functions into devmem.c (Pavel)
It seems odd that the functions or stucts in a file called devmem.c are named after 'dmabuf' instead of 'devmem'.
So my intention with this naming that devmem.c contains all the functions for all devmem tcp specific support. Currently the only devmem we support is dmabuf. In the future, other devmem may be supported and it can fit nicely in devmem.c. For example, if we want to extend devmem TCP to support NVMe devices, we need to add support for p2pdma, maybe, and we can add that support under the devmem.c umbrella rather than add new files.
But I can rename to dmabuf.c if there is strong objection to the current name.
Grepping 'dmabuf' seems to show that it may be common rename it to something as *_dmabuf.c.
...
diff --git a/include/net/netmem.h b/include/net/netmem.h index d8b810245c1d..72e932a1a948 100644 --- a/include/net/netmem.h +++ b/include/net/netmem.h @@ -8,6 +8,16 @@ #ifndef _NET_NETMEM_H #define _NET_NETMEM_H
+#include <net/devmem.h>
+/* net_iov */
+struct net_iov {
struct dmabuf_genpool_chunk_owner *owner;
+};
+/* netmem */
/**
- typedef netmem_ref - a nonexistent type marking a reference to generic
- network memory.
diff --git a/net/core/Makefile b/net/core/Makefile index 821aec06abf1..592f955c1241 100644 --- a/net/core/Makefile +++ b/net/core/Makefile @@ -13,7 +13,7 @@ obj-y += dev.o dev_addr_lists.o dst.o netevent.o \ neighbour.o rtnetlink.o utils.o link_watch.o filter.o \ sock_diag.o dev_ioctl.o tso.o sock_reuseport.o \ fib_notifier.o xdp.o flow_offload.o gro.o \
netdev-genl.o netdev-genl-gen.o gso.o
netdev-genl.o netdev-genl-gen.o gso.o devmem.o
obj-$(CONFIG_NETDEV_ADDR_LIST_TEST) += dev_addr_lists_test.o
diff --git a/net/core/dev.c b/net/core/dev.c index fe054cbd41e9..bbea1b252529 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -155,6 +155,9 @@ #include <net/netdev_rx_queue.h> #include <net/page_pool/types.h> #include <net/page_pool/helpers.h> +#include <linux/genalloc.h> +#include <linux/dma-buf.h> +#include <net/devmem.h>
#include "dev.h" #include "net-sysfs.h" diff --git a/net/core/devmem.c b/net/core/devmem.c new file mode 100644 index 000000000000..779ad990971e --- /dev/null +++ b/net/core/devmem.c @@ -0,0 +1,293 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/*
Devmem TCP
Authors: Mina Almasry <almasrymina@google.com>
Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Kaiyuan Zhang <kaiyuanz@google.com
- */
+#include <linux/types.h> +#include <linux/mm.h> +#include <linux/netdevice.h> +#include <trace/events/page_pool.h> +#include <net/netdev_rx_queue.h> +#include <net/page_pool/types.h> +#include <net/page_pool/helpers.h> +#include <linux/genalloc.h> +#include <linux/dma-buf.h> +#include <net/devmem.h>
+/* Device memory support */
+#ifdef CONFIG_DMA_SHARED_BUFFER
I still think it is worth adding its own config for devmem or dma-buf for networking, thinking about the embeded system.
FWIW Willem did weigh on this previously and said he prefers to have it unguarded by a CONFIG, but I will submit to whatever the consensus here. It shouldn't be a huge deal to add a CONFIG technically speaking.
Grepping 'CONFIG_DMA_SHARED_BUFFER' show that the API user of dmabuf API does not seems to reuse the CONFIG_DMA_SHARED_BUFFER, instead they seem to define its own config, and select CONFIG_DMA_SHARED_BUFFER if necessary, it that any reason it is different here?
+static void netdev_dmabuf_free_chunk_owner(struct gen_pool *genpool,
struct gen_pool_chunk *chunk,
void *not_used)
It seems odd to still keep the netdev_ prefix as it is not really related to netdev, perhaps use 'net_' or something better.
Yes, thanks for catching. I can change to net_devmem_ maybe or net_dmabuf_*.
FWIW, net_dmabuf_* seems like a better name technically.
+{
struct dmabuf_genpool_chunk_owner *owner = chunk->owner;
kvfree(owner->niovs);
kfree(owner);
+}
+void __netdev_dmabuf_binding_free(struct netdev_dmabuf_binding *binding) +{
size_t size, avail;
gen_pool_for_each_chunk(binding->chunk_pool,
netdev_dmabuf_free_chunk_owner, NULL);
size = gen_pool_size(binding->chunk_pool);
avail = gen_pool_avail(binding->chunk_pool);
if (!WARN(size != avail, "can't destroy genpool. size=%lu, avail=%lu",
size, avail))
gen_pool_destroy(binding->chunk_pool);
dma_buf_unmap_attachment(binding->attachment, binding->sgt,
DMA_BIDIRECTIONAL);
For now DMA_FROM_DEVICE seems enough as tx is not supported yet.
Yes, good catch. I suspect we want to reuse this code for TX path. But for now, I'll test with DMA_FROM_DEVICE and if I see no issues I'll apply this change.
dma_buf_detach(binding->dmabuf, binding->attachment);
dma_buf_put(binding->dmabuf);
xa_destroy(&binding->bound_rxq_list);
kfree(binding);
+}
+static int netdev_restart_rx_queue(struct net_device *dev, int rxq_idx) +{
void *new_mem;
void *old_mem;
int err;
if (!dev || !dev->netdev_ops)
return -EINVAL;
if (!dev->netdev_ops->ndo_queue_stop ||
!dev->netdev_ops->ndo_queue_mem_free ||
!dev->netdev_ops->ndo_queue_mem_alloc ||
!dev->netdev_ops->ndo_queue_start)
return -EOPNOTSUPP;
new_mem = dev->netdev_ops->ndo_queue_mem_alloc(dev, rxq_idx);
if (!new_mem)
return -ENOMEM;
err = dev->netdev_ops->ndo_queue_stop(dev, rxq_idx, &old_mem);
if (err)
goto err_free_new_mem;
err = dev->netdev_ops->ndo_queue_start(dev, rxq_idx, new_mem);
if (err)
goto err_start_queue;
dev->netdev_ops->ndo_queue_mem_free(dev, old_mem);
return 0;
+err_start_queue:
dev->netdev_ops->ndo_queue_start(dev, rxq_idx, old_mem);
It might worth mentioning why queue start with old_mem will always success here as the return value seems to be ignored here.
So the old queue, we stopped it, and if we fail to bring up the new queue, then we want to start the old queue back up to get the queue back to a workable state.
I don't see what we can do to recover if restarting the old queue fails. Seems like it should be a requirement that the driver tries as much as possible to keep the old queue restartable.
Is it possible that we may have the 'old_mem' leaking if the driver fails to restart the old queue? how does the driver handle the firmware cmd failure for ndo_queue_start()? it seems a little tricky to implement it.
I'm not sure what we can do to meaningfully recover from failure to restarting the old queue, except log it so the error is visible. In theory because we have not modifying any queue configurations restarting it would be straight forward, but since it's dealing with hardware then any failures are possible.
I can improve this by at least logging or warning if restarting the old queue fails.
Also the semantics of the above function seems odd that it is not only restarting rx queue, but also freeing and allocating memory despite the name only suggests 'restart', I am a litte afraid that it may conflict with future usecae when user only need the 'restart' part, perhaps rename it to a more appropriate name.
Oh, what we want here is just the 'restart' part. However, Jakub mandates that if you restart a queue (or a driver), you do it like this, hence the slightly more complicated implementation.
https://patchwork.kernel.org/project/netdevbpf/patch/20231106024413.2801438-... https://lore.kernel.org/netdev/20230815171638.4c057dcd@kernel.org/
+err_free_new_mem:
dev->netdev_ops->ndo_queue_mem_free(dev, new_mem);
return err;
+}
+/* Protected by rtnl_lock() */ +static DEFINE_XARRAY_FLAGS(netdev_dmabuf_bindings, XA_FLAGS_ALLOC1);
+void netdev_unbind_dmabuf(struct netdev_dmabuf_binding *binding) +{
struct netdev_rx_queue *rxq;
unsigned long xa_idx;
unsigned int rxq_idx;
if (!binding)
return;
if (binding->list.next)
list_del(&binding->list);
The above does not seems to be a good pattern to delete a entry, is there any reason having a checking before the list_del()? seems like defensive programming?
I think I needed to apply this condition to handle the case where netdev_unbind_dmabuf() is called when binding->list is not initialized or is empty.
netdev_nl_bind_rx_doit() will call unbind to free a partially allocated binding in error paths, so, netdev_unbind_dmabuf() may be called with a partially initialized binding. This is why we check for binding->list is initialized here and check that rxq->binding == binding below. The main point is that netdev_unbind_dmabuf() may be asked to unbind a partially bound dmabuf due to error paths.
Maybe a comment here will test this better. I will double confirm the check is needed for the error paths in netdev_nl_bind_rx_doit().
xa_for_each(&binding->bound_rxq_list, xa_idx, rxq) {
if (rxq->binding == binding) {
It seems like defensive programming here too?
On 2024/3/7 6:10, Mina Almasry wrote:
...
+static int netdev_restart_rx_queue(struct net_device *dev, int rxq_idx) +{
void *new_mem;
void *old_mem;
int err;
if (!dev || !dev->netdev_ops)
return -EINVAL;
if (!dev->netdev_ops->ndo_queue_stop ||
!dev->netdev_ops->ndo_queue_mem_free ||
!dev->netdev_ops->ndo_queue_mem_alloc ||
!dev->netdev_ops->ndo_queue_start)
return -EOPNOTSUPP;
new_mem = dev->netdev_ops->ndo_queue_mem_alloc(dev, rxq_idx);
if (!new_mem)
return -ENOMEM;
err = dev->netdev_ops->ndo_queue_stop(dev, rxq_idx, &old_mem);
if (err)
goto err_free_new_mem;
err = dev->netdev_ops->ndo_queue_start(dev, rxq_idx, new_mem);
if (err)
goto err_start_queue;
dev->netdev_ops->ndo_queue_mem_free(dev, old_mem);
return 0;
+err_start_queue:
dev->netdev_ops->ndo_queue_start(dev, rxq_idx, old_mem);
It might worth mentioning why queue start with old_mem will always success here as the return value seems to be ignored here.
So the old queue, we stopped it, and if we fail to bring up the new queue, then we want to start the old queue back up to get the queue back to a workable state.
I don't see what we can do to recover if restarting the old queue fails. Seems like it should be a requirement that the driver tries as much as possible to keep the old queue restartable.
Is it possible that we may have the 'old_mem' leaking if the driver fails to restart the old queue? how does the driver handle the firmware cmd failure for ndo_queue_start()? it seems a little tricky to implement it.
I'm not sure what we can do to meaningfully recover from failure to restarting the old queue, except log it so the error is visible. In theory because we have not modifying any queue configurations restarting it would be straight forward, but since it's dealing with hardware then any failures are possible.
Yes, we may need to have a clear semantics of how should the driver implement the above interface, for example if the driver should free the memory when fail to start a queue or the driver should restart the queue when fail to stop a queue? Otherwise we may have different driver implementing different behavior.
From the disscusion you mentioned below, does it make senses to modeling rdma subsystem by using create_queue/modify_queue/destroy_queue semantics instead?
I can improve this by at least logging or warning if restarting the old queue fails.
Also the semantics of the above function seems odd that it is not only restarting rx queue, but also freeing and allocating memory despite the name only suggests 'restart', I am a litte afraid that it may conflict with future usecae when user only need the 'restart' part, perhaps rename it to a more appropriate name.
Oh, what we want here is just the 'restart' part. However, Jakub mandates that if you restart a queue (or a driver), you do it like this, hence the slightly more complicated implementation.
https://patchwork.kernel.org/project/netdevbpf/patch/20231106024413.2801438-... https://lore.kernel.org/netdev/20230815171638.4c057dcd@kernel.org/
Thanks for the link.
I like david's idea of "a more generic design where H/W queues are created and destroyed - e.g., queues unique to a process which makes the cleanup so much easier." , but it seems it is a lot of work for networking to implement that for now.
On Mon, 4 Mar 2024 18:01:40 -0800 Mina Almasry wrote:
- if (!dev || !dev->netdev_ops)
return -EINVAL;
too defensive
- if (!dev->netdev_ops->ndo_queue_stop ||
!dev->netdev_ops->ndo_queue_mem_free ||
!dev->netdev_ops->ndo_queue_mem_alloc ||
!dev->netdev_ops->ndo_queue_start)
return -EOPNOTSUPP;
- new_mem = dev->netdev_ops->ndo_queue_mem_alloc(dev, rxq_idx);
- if (!new_mem)
return -ENOMEM;
- err = dev->netdev_ops->ndo_queue_stop(dev, rxq_idx, &old_mem);
- if (err)
goto err_free_new_mem;
- err = dev->netdev_ops->ndo_queue_start(dev, rxq_idx, new_mem);
- if (err)
goto err_start_queue;
- dev->netdev_ops->ndo_queue_mem_free(dev, old_mem);
nice :)
- rxq = __netif_get_rx_queue(dev, rxq_idx);
- if (rxq->binding)
nit: a few places have an empty line between call and error check
return -EEXIST;
- if (!capable(CAP_NET_ADMIN))
return -EPERM;
this can be a flag on the netlink policy, no?
flags: [ admin-perm ]
on the op
- dmabuf = dma_buf_get(dmabuf_fd);
- if (IS_ERR_OR_NULL(dmabuf))
return -EBADFD;
- hdr = genlmsg_put(rsp, info->snd_portid, info->snd_seq,
genlmsg_iput()
+static int netdev_netlink_notify(struct notifier_block *nb, unsigned long state,
void *_notify)
+{
- struct netlink_notify *notify = _notify;
- struct netdev_dmabuf_binding *rbinding;
- if (state != NETLINK_URELEASE || notify->protocol != NETLINK_GENERIC)
return NOTIFY_DONE;
- rtnl_lock();
- list_for_each_entry(rbinding, &netdev_rbinding_list, list) {
if (rbinding->owner_nlportid == notify->portid) {
netdev_unbind_dmabuf(rbinding);
break;
}
- }
- rtnl_unlock();
- return NOTIFY_OK;
+}
While you were not looking we added three new members to the netlink family:
* @sock_priv_size: the size of per-socket private memory * @sock_priv_init: the per-socket private memory initializer * @sock_priv_destroy: the per-socket private memory destructor
You should be able to associate state with a netlink socket and have it auto-destroyed if the socket closes. LMK if that doesn't work for you, I was hoping it would fit nicely.
I just realized now that the code gen doesn't know how to spit those members out, but I'll send a patch tomorrow, you can hack it manually until that gets in.
Implement netdev devmem allocator. The allocator takes a given struct netdev_dmabuf_binding as input and allocates net_iov from that binding.
The allocation simply delegates to the binding's genpool for the allocation logic and wraps the returned memory region in a net_iov struct.
Signed-off-by: Willem de Bruijn willemb@google.com Signed-off-by: Kaiyuan Zhang kaiyuanz@google.com Signed-off-by: Mina Almasry almasrymina@google.com
---
v6: - Add comment on net_iov_dma_addr to explain why we don't use niov->dma_addr (Pavel) - Refactor new functions into net/core/devmem.c (Pavel)
v1: - Rename devmem -> dmabuf (David).
--- include/net/devmem.h | 12 ++++++++++++ include/net/netmem.h | 40 ++++++++++++++++++++++++++++++++++++++++ net/core/devmem.c | 38 ++++++++++++++++++++++++++++++++++++++ 3 files changed, 90 insertions(+)
diff --git a/include/net/devmem.h b/include/net/devmem.h index 85ccbbe84c65..4207adadc2bb 100644 --- a/include/net/devmem.h +++ b/include/net/devmem.h @@ -67,6 +67,8 @@ struct dmabuf_genpool_chunk_owner { };
#ifdef CONFIG_DMA_SHARED_BUFFER +struct net_iov *netdev_alloc_dmabuf(struct netdev_dmabuf_binding *binding); +void netdev_free_dmabuf(struct net_iov *ppiov); void __netdev_dmabuf_binding_free(struct netdev_dmabuf_binding *binding); int netdev_bind_dmabuf(struct net_device *dev, unsigned int dmabuf_fd, struct netdev_dmabuf_binding **out); @@ -74,6 +76,16 @@ void netdev_unbind_dmabuf(struct netdev_dmabuf_binding *binding); int netdev_bind_dmabuf_to_queue(struct net_device *dev, u32 rxq_idx, struct netdev_dmabuf_binding *binding); #else +static inline struct net_iov * +netdev_alloc_dmabuf(struct netdev_dmabuf_binding *binding) +{ + return NULL; +} + +static inline void netdev_free_dmabuf(struct net_iov *ppiov) +{ +} + static inline void __netdev_dmabuf_binding_free(struct netdev_dmabuf_binding *binding) { diff --git a/include/net/netmem.h b/include/net/netmem.h index 72e932a1a948..ca17ea1d33f8 100644 --- a/include/net/netmem.h +++ b/include/net/netmem.h @@ -14,8 +14,48 @@
struct net_iov { struct dmabuf_genpool_chunk_owner *owner; + unsigned long dma_addr; };
+static inline struct dmabuf_genpool_chunk_owner * +net_iov_owner(const struct net_iov *niov) +{ + return niov->owner; +} + +static inline unsigned int net_iov_idx(const struct net_iov *niov) +{ + return niov - net_iov_owner(niov)->niovs; +} + +/* This returns the absolute dma_addr_t calculated from + * net_iov_owner(niov)->owner->base_dma_addr, not the page_pool-owned + * niov->dma_addr. + * + * The absolute dma_addr_t is a dma_addr_t that is always uncompressed. + * + * The page_pool-owner niov->dma_addr is the absolute dma_addr compressed into + * an unsigned long. Special handling is done when the unsigned long is 32-bit + * but the dma_addr_t is 64-bit. + * + * In general code looking for the dma_addr_t should use net_iov_dma_addr(), + * while page_pool code looking for the unsigned long dma_addr which mirrors + * the field in struct page should use niov->dma_addr. + */ +static inline dma_addr_t net_iov_dma_addr(const struct net_iov *niov) +{ + struct dmabuf_genpool_chunk_owner *owner = net_iov_owner(niov); + + return owner->base_dma_addr + + ((dma_addr_t)net_iov_idx(niov) << PAGE_SHIFT); +} + +static inline struct netdev_dmabuf_binding * +net_iov_binding(const struct net_iov *niov) +{ + return net_iov_owner(niov)->binding; +} + /* netmem */
/** diff --git a/net/core/devmem.c b/net/core/devmem.c index 779ad990971e..57d3a1f223ef 100644 --- a/net/core/devmem.c +++ b/net/core/devmem.c @@ -93,6 +93,44 @@ static int netdev_restart_rx_queue(struct net_device *dev, int rxq_idx) return err; }
+struct net_iov *netdev_alloc_dmabuf(struct netdev_dmabuf_binding *binding) +{ + struct dmabuf_genpool_chunk_owner *owner; + unsigned long dma_addr; + struct net_iov *niov; + ssize_t offset; + ssize_t index; + + dma_addr = gen_pool_alloc_owner(binding->chunk_pool, PAGE_SIZE, + (void **)&owner); + if (!dma_addr) + return NULL; + + offset = dma_addr - owner->base_dma_addr; + index = offset / PAGE_SIZE; + niov = &owner->niovs[index]; + + niov->pp_magic = 0; + niov->pp = NULL; + niov->dma_addr = 0; + atomic_long_set(&niov->pp_ref_count, 0); + + netdev_dmabuf_binding_get(binding); + + return niov; +} + +void netdev_free_dmabuf(struct net_iov *niov) +{ + struct netdev_dmabuf_binding *binding = net_iov_binding(niov); + unsigned long dma_addr = net_iov_dma_addr(niov); + + if (gen_pool_has_addr(binding->chunk_pool, dma_addr, PAGE_SIZE)) + gen_pool_free(binding->chunk_pool, dma_addr, PAGE_SIZE); + + netdev_dmabuf_binding_put(binding); +} + /* Protected by rtnl_lock() */ static DEFINE_XARRAY_FLAGS(netdev_dmabuf_bindings, XA_FLAGS_ALLOC1);
Abstrace the memory type from the page_pool so we can later add support for new memory types. Convert the page_pool to use the new netmem type abstraction, rather than use struct page directly.
As of this patch the netmem type is a no-op abstraction: it's always a struct page underneath. All the page pool internals are converted to use struct netmem instead of struct page, and the page pool now exports 2 APIs:
1. The existing struct page API. 2. The new struct netmem API.
Keeping the existing API is transitional; we do not want to refactor all the current drivers using the page pool at once.
The netmem abstraction is currently a no-op. The page_pool uses page_to_netmem() to convert allocated pages to netmem, and uses netmem_to_page() to convert the netmem back to pages to pass to mm APIs,
Follow up patches to this series add non-paged netmem support to the page_pool. This change is factored out on its own to limit the code churn to this 1 patch, for ease of code review.
Signed-off-by: Mina Almasry almasrymina@google.com
---
v6:
- Rebased on top of the merged netmem_ref type.
--- include/linux/skbuff.h | 4 +- include/net/netmem.h | 15 ++ include/net/page_pool/helpers.h | 122 +++++++++---- include/net/page_pool/types.h | 17 +- include/trace/events/page_pool.h | 29 +-- net/bpf/test_run.c | 5 +- net/core/page_pool.c | 303 +++++++++++++++++-------------- net/core/skbuff.c | 7 +- 8 files changed, 302 insertions(+), 200 deletions(-)
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index d577e0bee18d..ca29d1fd4561 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -3504,7 +3504,7 @@ int skb_pp_cow_data(struct page_pool *pool, struct sk_buff **pskb, unsigned int headroom); int skb_cow_data_for_xdp(struct page_pool *pool, struct sk_buff **pskb, struct bpf_prog *prog); -bool napi_pp_put_page(struct page *page, bool napi_safe); +bool napi_pp_put_page(netmem_ref netmem, bool napi_safe);
static inline void napi_frag_unref(skb_frag_t *frag, bool recycle, bool napi_safe) @@ -3512,7 +3512,7 @@ napi_frag_unref(skb_frag_t *frag, bool recycle, bool napi_safe) struct page *page = skb_frag_page(frag);
#ifdef CONFIG_PAGE_POOL - if (recycle && napi_pp_put_page(page, napi_safe)) + if (recycle && napi_pp_put_page(page_to_netmem(page), napi_safe)) return; #endif put_page(page); diff --git a/include/net/netmem.h b/include/net/netmem.h index ca17ea1d33f8..21f53b29e5fe 100644 --- a/include/net/netmem.h +++ b/include/net/netmem.h @@ -88,4 +88,19 @@ static inline netmem_ref page_to_netmem(struct page *page) return (__force netmem_ref)page; }
+static inline int netmem_ref_count(netmem_ref netmem) +{ + return page_ref_count(netmem_to_page(netmem)); +} + +static inline unsigned long netmem_to_pfn(netmem_ref netmem) +{ + return page_to_pfn(netmem_to_page(netmem)); +} + +static inline netmem_ref netmem_compound_head(netmem_ref netmem) +{ + return page_to_netmem(compound_head(netmem_to_page(netmem))); +} + #endif /* _NET_NETMEM_H */ diff --git a/include/net/page_pool/helpers.h b/include/net/page_pool/helpers.h index 1d397c1a0043..61814f91a458 100644 --- a/include/net/page_pool/helpers.h +++ b/include/net/page_pool/helpers.h @@ -53,6 +53,8 @@ #define _NET_PAGE_POOL_HELPERS_H
#include <net/page_pool/types.h> +#include <net/net_debug.h> +#include <net/netmem.h>
#ifdef CONFIG_PAGE_POOL_STATS /* Deprecated driver-facing API, use netlink instead */ @@ -101,7 +103,7 @@ static inline struct page *page_pool_dev_alloc_pages(struct page_pool *pool) * Get a page fragment from the page allocator or page_pool caches. * * Return: - * Return allocated page fragment, otherwise return NULL. + * Return allocated page fragment, otherwise return 0. */ static inline struct page *page_pool_dev_alloc_frag(struct page_pool *pool, unsigned int *offset, @@ -112,22 +114,22 @@ static inline struct page *page_pool_dev_alloc_frag(struct page_pool *pool, return page_pool_alloc_frag(pool, offset, size, gfp); }
-static inline struct page *page_pool_alloc(struct page_pool *pool, - unsigned int *offset, - unsigned int *size, gfp_t gfp) +static inline netmem_ref page_pool_alloc(struct page_pool *pool, + unsigned int *offset, + unsigned int *size, gfp_t gfp) { unsigned int max_size = PAGE_SIZE << pool->p.order; - struct page *page; + netmem_ref netmem;
if ((*size << 1) > max_size) { *size = max_size; *offset = 0; - return page_pool_alloc_pages(pool, gfp); + return page_pool_alloc_netmem(pool, gfp); }
- page = page_pool_alloc_frag(pool, offset, *size, gfp); - if (unlikely(!page)) - return NULL; + netmem = page_pool_alloc_frag_netmem(pool, offset, *size, gfp); + if (unlikely(!netmem)) + return 0;
/* There is very likely not enough space for another fragment, so append * the remaining size to the current fragment to avoid truesize @@ -138,7 +140,7 @@ static inline struct page *page_pool_alloc(struct page_pool *pool, pool->frag_offset = max_size; }
- return page; + return netmem; }
/** @@ -152,7 +154,7 @@ static inline struct page *page_pool_alloc(struct page_pool *pool, * utilization and performance penalty. * * Return: - * Return allocated page or page fragment, otherwise return NULL. + * Return allocated page or page fragment, otherwise return 0. */ static inline struct page *page_pool_dev_alloc(struct page_pool *pool, unsigned int *offset, @@ -160,7 +162,7 @@ static inline struct page *page_pool_dev_alloc(struct page_pool *pool, { gfp_t gfp = (GFP_ATOMIC | __GFP_NOWARN);
- return page_pool_alloc(pool, offset, size, gfp); + return netmem_to_page(page_pool_alloc(pool, offset, size, gfp)); }
static inline void *page_pool_alloc_va(struct page_pool *pool, @@ -170,9 +172,10 @@ static inline void *page_pool_alloc_va(struct page_pool *pool, struct page *page;
/* Mask off __GFP_HIGHMEM to ensure we can use page_address() */ - page = page_pool_alloc(pool, &offset, size, gfp & ~__GFP_HIGHMEM); + page = netmem_to_page( + page_pool_alloc(pool, &offset, size, gfp & ~__GFP_HIGHMEM)); if (unlikely(!page)) - return NULL; + return 0;
return page_address(page) + offset; } @@ -187,7 +190,7 @@ static inline void *page_pool_alloc_va(struct page_pool *pool, * it returns va of the allocated page or page fragment. * * Return: - * Return the va for the allocated page or page fragment, otherwise return NULL. + * Return the va for the allocated page or page fragment, otherwise return 0. */ static inline void *page_pool_dev_alloc_va(struct page_pool *pool, unsigned int *size) @@ -210,6 +213,11 @@ inline enum dma_data_direction page_pool_get_dma_dir(struct page_pool *pool) return pool->p.dma_dir; }
+static inline void page_pool_fragment_netmem(netmem_ref netmem, long nr) +{ + atomic_long_set(&netmem_to_page(netmem)->pp_ref_count, nr); +} + /** * page_pool_fragment_page() - split a fresh page into fragments * @page: page to split @@ -230,11 +238,12 @@ inline enum dma_data_direction page_pool_get_dma_dir(struct page_pool *pool) */ static inline void page_pool_fragment_page(struct page *page, long nr) { - atomic_long_set(&page->pp_ref_count, nr); + page_pool_fragment_netmem(page_to_netmem(page), nr); }
-static inline long page_pool_unref_page(struct page *page, long nr) +static inline long page_pool_unref_netmem(netmem_ref netmem, long nr) { + struct page *page = netmem_to_page(netmem); long ret;
/* If nr == pp_ref_count then we have cleared all remaining @@ -277,15 +286,41 @@ static inline long page_pool_unref_page(struct page *page, long nr) return ret; }
+static inline long page_pool_unref_page(struct page *page, long nr) +{ + return page_pool_unref_netmem(page_to_netmem(page), nr); +} + +static inline void page_pool_ref_netmem(netmem_ref netmem) +{ + atomic_long_inc(&netmem_to_page(netmem)->pp_ref_count); +} + static inline void page_pool_ref_page(struct page *page) { - atomic_long_inc(&page->pp_ref_count); + page_pool_ref_netmem(page_to_netmem(page)); }
-static inline bool page_pool_is_last_ref(struct page *page) +static inline bool page_pool_is_last_ref(netmem_ref netmem) { /* If page_pool_unref_page() returns 0, we were the last user */ - return page_pool_unref_page(page, 1) == 0; + return page_pool_unref_netmem(netmem, 1) == 0; +} + +static inline void page_pool_put_netmem(struct page_pool *pool, + netmem_ref netmem, + unsigned int dma_sync_size, + bool allow_direct) +{ + /* When page_pool isn't compiled-in, net/core/xdp.c doesn't + * allow registering MEM_TYPE_PAGE_POOL, but shield linker. + */ +#ifdef CONFIG_PAGE_POOL + if (!page_pool_is_last_ref(netmem)) + return; + + page_pool_put_unrefed_netmem(pool, netmem, dma_sync_size, allow_direct); +#endif }
/** @@ -306,15 +341,15 @@ static inline void page_pool_put_page(struct page_pool *pool, unsigned int dma_sync_size, bool allow_direct) { - /* When page_pool isn't compiled-in, net/core/xdp.c doesn't - * allow registering MEM_TYPE_PAGE_POOL, but shield linker. - */ -#ifdef CONFIG_PAGE_POOL - if (!page_pool_is_last_ref(page)) - return; + page_pool_put_netmem(pool, page_to_netmem(page), dma_sync_size, + allow_direct); +}
- page_pool_put_unrefed_page(pool, page, dma_sync_size, allow_direct); -#endif +static inline void page_pool_put_full_netmem(struct page_pool *pool, + netmem_ref netmem, + bool allow_direct) +{ + page_pool_put_netmem(pool, netmem, -1, allow_direct); }
/** @@ -329,7 +364,7 @@ static inline void page_pool_put_page(struct page_pool *pool, static inline void page_pool_put_full_page(struct page_pool *pool, struct page *page, bool allow_direct) { - page_pool_put_page(pool, page, -1, allow_direct); + page_pool_put_netmem(pool, page_to_netmem(page), -1, allow_direct); }
/** @@ -363,6 +398,18 @@ static inline void page_pool_free_va(struct page_pool *pool, void *va, page_pool_put_page(pool, virt_to_head_page(va), -1, allow_direct); }
+static inline dma_addr_t page_pool_get_dma_addr_netmem(netmem_ref netmem) +{ + struct page *page = netmem_to_page(netmem); + + dma_addr_t ret = page->dma_addr; + + if (PAGE_POOL_32BIT_ARCH_WITH_64BIT_DMA) + ret <<= PAGE_SHIFT; + + return ret; +} + /** * page_pool_get_dma_addr() - Retrieve the stored DMA address. * @page: page allocated from a page pool @@ -372,16 +419,14 @@ static inline void page_pool_free_va(struct page_pool *pool, void *va, */ static inline dma_addr_t page_pool_get_dma_addr(struct page *page) { - dma_addr_t ret = page->dma_addr; - - if (PAGE_POOL_32BIT_ARCH_WITH_64BIT_DMA) - ret <<= PAGE_SHIFT; - - return ret; + return page_pool_get_dma_addr_netmem(page_to_netmem(page)); }
-static inline bool page_pool_set_dma_addr(struct page *page, dma_addr_t addr) +static inline bool page_pool_set_dma_addr_netmem(netmem_ref netmem, + dma_addr_t addr) { + struct page *page = netmem_to_page(netmem); + if (PAGE_POOL_32BIT_ARCH_WITH_64BIT_DMA) { page->dma_addr = addr >> PAGE_SHIFT;
@@ -395,6 +440,11 @@ static inline bool page_pool_set_dma_addr(struct page *page, dma_addr_t addr) return false; }
+static inline bool page_pool_set_dma_addr(struct page *page, dma_addr_t addr) +{ + return page_pool_set_dma_addr_netmem(page_to_netmem(page), addr); +} + static inline bool page_pool_put(struct page_pool *pool) { return refcount_dec_and_test(&pool->user_cnt); diff --git a/include/net/page_pool/types.h b/include/net/page_pool/types.h index ffe5f31fb0da..68a24c5ae827 100644 --- a/include/net/page_pool/types.h +++ b/include/net/page_pool/types.h @@ -40,7 +40,7 @@ #define PP_ALLOC_CACHE_REFILL 64 struct pp_alloc_cache { u32 count; - struct page *cache[PP_ALLOC_CACHE_SIZE]; + netmem_ref cache[PP_ALLOC_CACHE_SIZE]; };
/** @@ -73,7 +73,7 @@ struct page_pool_params { struct_group_tagged(page_pool_params_slow, slow, struct net_device *netdev; /* private: used by test code only */ - void (*init_callback)(struct page *page, void *arg); + void (*init_callback)(netmem_ref netmem, void *arg); void *init_arg; ); }; @@ -131,8 +131,8 @@ struct page_pool_stats { struct memory_provider_ops { int (*init)(struct page_pool *pool); void (*destroy)(struct page_pool *pool); - struct page *(*alloc_pages)(struct page_pool *pool, gfp_t gfp); - bool (*release_page)(struct page_pool *pool, struct page *page); + netmem_ref (*alloc_pages)(struct page_pool *pool, gfp_t gfp); + bool (*release_page)(struct page_pool *pool, netmem_ref netmem); };
struct page_pool { @@ -142,7 +142,7 @@ struct page_pool { bool has_init_callback;
long frag_users; - struct page *frag_page; + netmem_ref frag_page; unsigned int frag_offset; u32 pages_state_hold_cnt;
@@ -214,8 +214,12 @@ struct page_pool { };
struct page *page_pool_alloc_pages(struct page_pool *pool, gfp_t gfp); +netmem_ref page_pool_alloc_netmem(struct page_pool *pool, gfp_t gfp); struct page *page_pool_alloc_frag(struct page_pool *pool, unsigned int *offset, unsigned int size, gfp_t gfp); +netmem_ref page_pool_alloc_frag_netmem(struct page_pool *pool, + unsigned int *offset, unsigned int size, + gfp_t gfp); struct page_pool *page_pool_create(const struct page_pool_params *params); struct page_pool *page_pool_create_percpu(const struct page_pool_params *params, int cpuid); @@ -245,6 +249,9 @@ static inline void page_pool_put_page_bulk(struct page_pool *pool, void **data, } #endif
+void page_pool_put_unrefed_netmem(struct page_pool *pool, netmem_ref netmem, + unsigned int dma_sync_size, + bool allow_direct); void page_pool_put_unrefed_page(struct page_pool *pool, struct page *page, unsigned int dma_sync_size, bool allow_direct); diff --git a/include/trace/events/page_pool.h b/include/trace/events/page_pool.h index 6834356b2d2a..c5b6383ff276 100644 --- a/include/trace/events/page_pool.h +++ b/include/trace/events/page_pool.h @@ -42,51 +42,52 @@ TRACE_EVENT(page_pool_release, TRACE_EVENT(page_pool_state_release,
TP_PROTO(const struct page_pool *pool, - const struct page *page, u32 release), + netmem_ref netmem, u32 release),
- TP_ARGS(pool, page, release), + TP_ARGS(pool, netmem, release),
TP_STRUCT__entry( __field(const struct page_pool *, pool) - __field(const struct page *, page) + __field(netmem_ref, netmem) __field(u32, release) __field(unsigned long, pfn) ),
TP_fast_assign( __entry->pool = pool; - __entry->page = page; + __entry->netmem = netmem; __entry->release = release; - __entry->pfn = page_to_pfn(page); + __entry->pfn = netmem_to_pfn(netmem); ),
- TP_printk("page_pool=%p page=%p pfn=0x%lx release=%u", - __entry->pool, __entry->page, __entry->pfn, __entry->release) + TP_printk("page_pool=%p netmem=%lu pfn=0x%lx release=%u", + __entry->pool, (__force unsigned long)__entry->netmem, + __entry->pfn, __entry->release) );
TRACE_EVENT(page_pool_state_hold,
TP_PROTO(const struct page_pool *pool, - const struct page *page, u32 hold), + netmem_ref netmem, u32 hold),
- TP_ARGS(pool, page, hold), + TP_ARGS(pool, netmem, hold),
TP_STRUCT__entry( __field(const struct page_pool *, pool) - __field(const struct page *, page) + __field(netmem_ref, netmem) __field(u32, hold) __field(unsigned long, pfn) ),
TP_fast_assign( __entry->pool = pool; - __entry->page = page; + __entry->netmem = netmem; __entry->hold = hold; - __entry->pfn = page_to_pfn(page); + __entry->pfn = netmem_to_pfn(netmem); ),
- TP_printk("page_pool=%p page=%p pfn=0x%lx hold=%u", - __entry->pool, __entry->page, __entry->pfn, __entry->hold) + TP_printk("page_pool=%p netmem=%lu pfn=0x%lx hold=%u", + __entry->pool, __entry->netmem, __entry->pfn, __entry->hold) );
TRACE_EVENT(page_pool_update_nid, diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c index 5535f9adc658..bc8f7ab88f86 100644 --- a/net/bpf/test_run.c +++ b/net/bpf/test_run.c @@ -126,9 +126,10 @@ struct xdp_test_data { #define TEST_XDP_FRAME_SIZE (PAGE_SIZE - sizeof(struct xdp_page_head)) #define TEST_XDP_MAX_BATCH 256
-static void xdp_test_run_init_page(struct page *page, void *arg) +static void xdp_test_run_init_page(netmem_ref netmem, void *arg) { - struct xdp_page_head *head = phys_to_virt(page_to_phys(page)); + struct xdp_page_head *head = + phys_to_virt(page_to_phys(netmem_to_page(netmem))); struct xdp_buff *new_ctx, *orig_ctx; u32 headroom = XDP_PACKET_HEADROOM; struct xdp_test_data *xdp = arg; diff --git a/net/core/page_pool.c b/net/core/page_pool.c index fe9de4ecce94..24d5236b2efc 100644 --- a/net/core/page_pool.c +++ b/net/core/page_pool.c @@ -329,19 +329,18 @@ struct page_pool *page_pool_create(const struct page_pool_params *params) } EXPORT_SYMBOL(page_pool_create);
-static void page_pool_return_page(struct page_pool *pool, struct page *page); +static void page_pool_return_page(struct page_pool *pool, netmem_ref netmem);
-noinline -static struct page *page_pool_refill_alloc_cache(struct page_pool *pool) +static noinline netmem_ref page_pool_refill_alloc_cache(struct page_pool *pool) { struct ptr_ring *r = &pool->ring; - struct page *page; + netmem_ref netmem; int pref_nid; /* preferred NUMA node */
/* Quicker fallback, avoid locks when ring is empty */ if (__ptr_ring_empty(r)) { alloc_stat_inc(pool, empty); - return NULL; + return 0; }
/* Softirq guarantee CPU and thus NUMA node is stable. This, @@ -356,56 +355,56 @@ static struct page *page_pool_refill_alloc_cache(struct page_pool *pool)
/* Refill alloc array, but only if NUMA match */ do { - page = __ptr_ring_consume(r); - if (unlikely(!page)) + netmem = (__force netmem_ref)__ptr_ring_consume(r); + if (unlikely(!netmem)) break;
- if (likely(page_to_nid(page) == pref_nid)) { - pool->alloc.cache[pool->alloc.count++] = page; + if (likely(page_to_nid(netmem_to_page(netmem)) == pref_nid)) { + pool->alloc.cache[pool->alloc.count++] = netmem; } else { /* NUMA mismatch; * (1) release 1 page to page-allocator and * (2) break out to fallthrough to alloc_pages_node. * This limit stress on page buddy alloactor. */ - page_pool_return_page(pool, page); + page_pool_return_page(pool, netmem); alloc_stat_inc(pool, waive); - page = NULL; + netmem = 0; break; } } while (pool->alloc.count < PP_ALLOC_CACHE_REFILL);
/* Return last page */ if (likely(pool->alloc.count > 0)) { - page = pool->alloc.cache[--pool->alloc.count]; + netmem = pool->alloc.cache[--pool->alloc.count]; alloc_stat_inc(pool, refill); }
- return page; + return netmem; }
/* fast path */ -static struct page *__page_pool_get_cached(struct page_pool *pool) +static netmem_ref __page_pool_get_cached(struct page_pool *pool) { - struct page *page; + netmem_ref netmem;
/* Caller MUST guarantee safe non-concurrent access, e.g. softirq */ if (likely(pool->alloc.count)) { /* Fast-path */ - page = pool->alloc.cache[--pool->alloc.count]; + netmem = pool->alloc.cache[--pool->alloc.count]; alloc_stat_inc(pool, fast); } else { - page = page_pool_refill_alloc_cache(pool); + netmem = page_pool_refill_alloc_cache(pool); }
- return page; + return netmem; }
static void page_pool_dma_sync_for_device(struct page_pool *pool, - struct page *page, + netmem_ref netmem, unsigned int dma_sync_size) { - dma_addr_t dma_addr = page_pool_get_dma_addr(page); + dma_addr_t dma_addr = page_pool_get_dma_addr_netmem(netmem);
dma_sync_size = min(dma_sync_size, pool->p.max_len); dma_sync_single_range_for_device(pool->p.dev, dma_addr, @@ -413,7 +412,7 @@ static void page_pool_dma_sync_for_device(struct page_pool *pool, pool->p.dma_dir); }
-static bool page_pool_dma_map(struct page_pool *pool, struct page *page) +static bool page_pool_dma_map(struct page_pool *pool, netmem_ref netmem) { dma_addr_t dma;
@@ -422,18 +421,18 @@ static bool page_pool_dma_map(struct page_pool *pool, struct page *page) * into page private data (i.e 32bit cpu with 64bit DMA caps) * This mapping is kept for lifetime of page, until leaving pool. */ - dma = dma_map_page_attrs(pool->p.dev, page, 0, - (PAGE_SIZE << pool->p.order), - pool->p.dma_dir, DMA_ATTR_SKIP_CPU_SYNC | - DMA_ATTR_WEAK_ORDERING); + dma = dma_map_page_attrs(pool->p.dev, netmem_to_page(netmem), 0, + (PAGE_SIZE << pool->p.order), pool->p.dma_dir, + DMA_ATTR_SKIP_CPU_SYNC | + DMA_ATTR_WEAK_ORDERING); if (dma_mapping_error(pool->p.dev, dma)) return false;
- if (page_pool_set_dma_addr(page, dma)) + if (page_pool_set_dma_addr_netmem(netmem, dma)) goto unmap_failed;
if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV) - page_pool_dma_sync_for_device(pool, page, pool->p.max_len); + page_pool_dma_sync_for_device(pool, netmem, pool->p.max_len);
return true;
@@ -445,9 +444,10 @@ static bool page_pool_dma_map(struct page_pool *pool, struct page *page) return false; }
-static void page_pool_set_pp_info(struct page_pool *pool, - struct page *page) +static void page_pool_set_pp_info(struct page_pool *pool, netmem_ref netmem) { + struct page *page = netmem_to_page(netmem); + page->pp = pool; page->pp_magic |= PP_SIGNATURE;
@@ -457,13 +457,15 @@ static void page_pool_set_pp_info(struct page_pool *pool, * is dirtying the same cache line as the page->pp_magic above, so * the overhead is negligible. */ - page_pool_fragment_page(page, 1); + page_pool_fragment_netmem(netmem, 1); if (pool->has_init_callback) - pool->slow.init_callback(page, pool->slow.init_arg); + pool->slow.init_callback(netmem, pool->slow.init_arg); }
-static void page_pool_clear_pp_info(struct page *page) +static void page_pool_clear_pp_info(netmem_ref netmem) { + struct page *page = netmem_to_page(netmem); + page->pp_magic = 0; page->pp = NULL; } @@ -479,34 +481,34 @@ static struct page *__page_pool_alloc_page_order(struct page_pool *pool, return NULL;
if ((pool->p.flags & PP_FLAG_DMA_MAP) && - unlikely(!page_pool_dma_map(pool, page))) { + unlikely(!page_pool_dma_map(pool, page_to_netmem(page)))) { put_page(page); return NULL; }
alloc_stat_inc(pool, slow_high_order); - page_pool_set_pp_info(pool, page); + page_pool_set_pp_info(pool, page_to_netmem(page));
/* Track how many pages are held 'in-flight' */ pool->pages_state_hold_cnt++; - trace_page_pool_state_hold(pool, page, pool->pages_state_hold_cnt); + trace_page_pool_state_hold(pool, page_to_netmem(page), + pool->pages_state_hold_cnt); return page; }
/* slow path */ -noinline -static struct page *__page_pool_alloc_pages_slow(struct page_pool *pool, - gfp_t gfp) +static noinline netmem_ref __page_pool_alloc_pages_slow(struct page_pool *pool, + gfp_t gfp) { const int bulk = PP_ALLOC_CACHE_REFILL; unsigned int pp_flags = pool->p.flags; unsigned int pp_order = pool->p.order; - struct page *page; + netmem_ref netmem; int i, nr_pages;
/* Don't support bulk alloc for high-order pages */ if (unlikely(pp_order)) - return __page_pool_alloc_page_order(pool, gfp); + return page_to_netmem(__page_pool_alloc_page_order(pool, gfp));
/* Unnecessary as alloc cache is empty, but guarantees zero count */ if (unlikely(pool->alloc.count > 0)) @@ -515,60 +517,67 @@ static struct page *__page_pool_alloc_pages_slow(struct page_pool *pool, /* Mark empty alloc.cache slots "empty" for alloc_pages_bulk_array */ memset(&pool->alloc.cache, 0, sizeof(void *) * bulk);
- nr_pages = alloc_pages_bulk_array_node(gfp, pool->p.nid, bulk, - pool->alloc.cache); + nr_pages = alloc_pages_bulk_array_node(gfp, + pool->p.nid, bulk, + (struct page **)pool->alloc.cache); if (unlikely(!nr_pages)) - return NULL; + return 0;
/* Pages have been filled into alloc.cache array, but count is zero and * page element have not been (possibly) DMA mapped. */ for (i = 0; i < nr_pages; i++) { - page = pool->alloc.cache[i]; + netmem = pool->alloc.cache[i]; if ((pp_flags & PP_FLAG_DMA_MAP) && - unlikely(!page_pool_dma_map(pool, page))) { - put_page(page); + unlikely(!page_pool_dma_map(pool, netmem))) { + put_page(netmem_to_page(netmem)); continue; }
- page_pool_set_pp_info(pool, page); - pool->alloc.cache[pool->alloc.count++] = page; + page_pool_set_pp_info(pool, netmem); + pool->alloc.cache[pool->alloc.count++] = netmem; /* Track how many pages are held 'in-flight' */ pool->pages_state_hold_cnt++; - trace_page_pool_state_hold(pool, page, + trace_page_pool_state_hold(pool, netmem, pool->pages_state_hold_cnt); }
/* Return last page */ if (likely(pool->alloc.count > 0)) { - page = pool->alloc.cache[--pool->alloc.count]; + netmem = pool->alloc.cache[--pool->alloc.count]; alloc_stat_inc(pool, slow); } else { - page = NULL; + netmem = 0; }
/* When page just alloc'ed is should/must have refcnt 1. */ - return page; + return netmem; }
/* For using page_pool replace: alloc_pages() API calls, but provide * synchronization guarantee for allocation side. */ -struct page *page_pool_alloc_pages(struct page_pool *pool, gfp_t gfp) +netmem_ref page_pool_alloc_netmem(struct page_pool *pool, gfp_t gfp) { - struct page *page; + netmem_ref netmem;
/* Fast-path: Get a page from cache */ - page = __page_pool_get_cached(pool); - if (page) - return page; + netmem = __page_pool_get_cached(pool); + if (netmem) + return netmem;
/* Slow-path: cache empty, do real allocation */ if (static_branch_unlikely(&page_pool_mem_providers) && pool->mp_ops) - page = pool->mp_ops->alloc_pages(pool, gfp); + netmem = pool->mp_ops->alloc_pages(pool, gfp); else - page = __page_pool_alloc_pages_slow(pool, gfp); - return page; + netmem = __page_pool_alloc_pages_slow(pool, gfp); + return netmem; +} +EXPORT_SYMBOL(page_pool_alloc_netmem); + +struct page *page_pool_alloc_pages(struct page_pool *pool, gfp_t gfp) +{ + return netmem_to_page(page_pool_alloc_netmem(pool, gfp)); } EXPORT_SYMBOL(page_pool_alloc_pages);
@@ -596,8 +605,8 @@ s32 page_pool_inflight(const struct page_pool *pool, bool strict) return inflight; }
-static __always_inline -void __page_pool_release_page_dma(struct page_pool *pool, struct page *page) +static __always_inline void __page_pool_release_page_dma(struct page_pool *pool, + netmem_ref netmem) { dma_addr_t dma;
@@ -607,13 +616,13 @@ void __page_pool_release_page_dma(struct page_pool *pool, struct page *page) */ return;
- dma = page_pool_get_dma_addr(page); + dma = page_pool_get_dma_addr_netmem(netmem);
/* When page is unmapped, it cannot be returned to our pool */ dma_unmap_page_attrs(pool->p.dev, dma, PAGE_SIZE << pool->p.order, pool->p.dma_dir, DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_WEAK_ORDERING); - page_pool_set_dma_addr(page, 0); + page_pool_set_dma_addr_netmem(netmem, 0); }
/* Disconnects a page (from a page_pool). API users can have a need @@ -621,26 +630,26 @@ void __page_pool_release_page_dma(struct page_pool *pool, struct page *page) * a regular page (that will eventually be returned to the normal * page-allocator via put_page). */ -void page_pool_return_page(struct page_pool *pool, struct page *page) +void page_pool_return_page(struct page_pool *pool, netmem_ref netmem) { int count; bool put;
put = true; if (static_branch_unlikely(&page_pool_mem_providers) && pool->mp_ops) - put = pool->mp_ops->release_page(pool, page); + put = pool->mp_ops->release_page(pool, netmem); else - __page_pool_release_page_dma(pool, page); + __page_pool_release_page_dma(pool, netmem);
/* This may be the last page returned, releasing the pool, so * it is not safe to reference pool afterwards. */ count = atomic_inc_return_relaxed(&pool->pages_state_release_cnt); - trace_page_pool_state_release(pool, page, count); + trace_page_pool_state_release(pool, netmem, count);
if (put) { - page_pool_clear_pp_info(page); - put_page(page); + page_pool_clear_pp_info(netmem); + put_page(netmem_to_page(netmem)); } /* An optimization would be to call __free_pages(page, pool->p.order) * knowing page is not part of page-cache (thus avoiding a @@ -648,14 +657,14 @@ void page_pool_return_page(struct page_pool *pool, struct page *page) */ }
-static bool page_pool_recycle_in_ring(struct page_pool *pool, struct page *page) +static bool page_pool_recycle_in_ring(struct page_pool *pool, netmem_ref netmem) { int ret; /* BH protection not needed if current is softirq */ if (in_softirq()) - ret = ptr_ring_produce(&pool->ring, page); + ret = ptr_ring_produce(&pool->ring, (__force void *)netmem); else - ret = ptr_ring_produce_bh(&pool->ring, page); + ret = ptr_ring_produce_bh(&pool->ring, (__force void *)netmem);
if (!ret) { recycle_stat_inc(pool, ring); @@ -670,7 +679,7 @@ static bool page_pool_recycle_in_ring(struct page_pool *pool, struct page *page) * * Caller must provide appropriate safe context. */ -static bool page_pool_recycle_in_cache(struct page *page, +static bool page_pool_recycle_in_cache(netmem_ref netmem, struct page_pool *pool) { if (unlikely(pool->alloc.count == PP_ALLOC_CACHE_SIZE)) { @@ -679,14 +688,15 @@ static bool page_pool_recycle_in_cache(struct page *page, }
/* Caller MUST have verified/know (page_ref_count(page) == 1) */ - pool->alloc.cache[pool->alloc.count++] = page; + pool->alloc.cache[pool->alloc.count++] = netmem; recycle_stat_inc(pool, cached); return true; }
-static bool __page_pool_page_can_be_recycled(const struct page *page) +static bool __page_pool_page_can_be_recycled(netmem_ref netmem) { - return page_ref_count(page) == 1 && !page_is_pfmemalloc(page); + return page_ref_count(netmem_to_page(netmem)) == 1 && + !page_is_pfmemalloc(netmem_to_page(netmem)); }
/* If the page refcnt == 1, this will try to recycle the page. @@ -695,8 +705,8 @@ static bool __page_pool_page_can_be_recycled(const struct page *page) * If the page refcnt != 1, then the page will be returned to memory * subsystem. */ -static __always_inline struct page * -__page_pool_put_page(struct page_pool *pool, struct page *page, +static __always_inline netmem_ref +__page_pool_put_page(struct page_pool *pool, netmem_ref netmem, unsigned int dma_sync_size, bool allow_direct) { lockdep_assert_no_hardirq(); @@ -710,19 +720,19 @@ __page_pool_put_page(struct page_pool *pool, struct page *page, * page is NOT reusable when allocated when system is under * some pressure. (page_is_pfmemalloc) */ - if (likely(__page_pool_page_can_be_recycled(page))) { + if (likely(__page_pool_page_can_be_recycled(netmem))) { /* Read barrier done in page_ref_count / READ_ONCE */
if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV) - page_pool_dma_sync_for_device(pool, page, + page_pool_dma_sync_for_device(pool, netmem, dma_sync_size);
if (allow_direct && in_softirq() && - page_pool_recycle_in_cache(page, pool)) - return NULL; + page_pool_recycle_in_cache(netmem, pool)) + return 0;
/* Page found as candidate for recycling */ - return page; + return netmem; } /* Fallback/non-XDP mode: API user have elevated refcnt. * @@ -738,21 +748,30 @@ __page_pool_put_page(struct page_pool *pool, struct page *page, * will be invoking put_page. */ recycle_stat_inc(pool, released_refcnt); - page_pool_return_page(pool, page); + page_pool_return_page(pool, netmem);
- return NULL; + return 0; }
-void page_pool_put_unrefed_page(struct page_pool *pool, struct page *page, - unsigned int dma_sync_size, bool allow_direct) +void page_pool_put_unrefed_netmem(struct page_pool *pool, netmem_ref netmem, + unsigned int dma_sync_size, bool allow_direct) { - page = __page_pool_put_page(pool, page, dma_sync_size, allow_direct); - if (page && !page_pool_recycle_in_ring(pool, page)) { + netmem = + __page_pool_put_page(pool, netmem, dma_sync_size, allow_direct); + if (netmem && !page_pool_recycle_in_ring(pool, netmem)) { /* Cache full, fallback to free pages */ recycle_stat_inc(pool, ring_full); - page_pool_return_page(pool, page); + page_pool_return_page(pool, netmem); } } +EXPORT_SYMBOL(page_pool_put_unrefed_netmem); + +void page_pool_put_unrefed_page(struct page_pool *pool, struct page *page, + unsigned int dma_sync_size, bool allow_direct) +{ + page_pool_put_unrefed_netmem(pool, page_to_netmem(page), dma_sync_size, + allow_direct); +} EXPORT_SYMBOL(page_pool_put_unrefed_page);
/** @@ -777,16 +796,16 @@ void page_pool_put_page_bulk(struct page_pool *pool, void **data, bool in_softirq;
for (i = 0; i < count; i++) { - struct page *page = virt_to_head_page(data[i]); + netmem_ref netmem = page_to_netmem(virt_to_head_page(data[i]));
/* It is not the last user for the page frag case */ - if (!page_pool_is_last_ref(page)) + if (!page_pool_is_last_ref(netmem)) continue;
- page = __page_pool_put_page(pool, page, -1, false); + netmem = __page_pool_put_page(pool, netmem, -1, false); /* Approved for bulk recycling in ptr_ring cache */ - if (page) - data[bulk_len++] = page; + if (netmem) + data[bulk_len++] = (__force void *)netmem; }
if (unlikely(!bulk_len)) @@ -812,100 +831,108 @@ void page_pool_put_page_bulk(struct page_pool *pool, void **data, * since put_page() with refcnt == 1 can be an expensive operation */ for (; i < bulk_len; i++) - page_pool_return_page(pool, data[i]); + page_pool_return_page(pool, (__force netmem_ref)data[i]); } EXPORT_SYMBOL(page_pool_put_page_bulk);
-static struct page *page_pool_drain_frag(struct page_pool *pool, - struct page *page) +static netmem_ref page_pool_drain_frag(struct page_pool *pool, + netmem_ref netmem) { long drain_count = BIAS_MAX - pool->frag_users;
/* Some user is still using the page frag */ - if (likely(page_pool_unref_page(page, drain_count))) - return NULL; + if (likely(page_pool_unref_netmem(netmem, drain_count))) + return 0;
- if (__page_pool_page_can_be_recycled(page)) { + if (__page_pool_page_can_be_recycled(netmem)) { if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV) - page_pool_dma_sync_for_device(pool, page, -1); + page_pool_dma_sync_for_device(pool, netmem, -1);
- return page; + return netmem; }
- page_pool_return_page(pool, page); - return NULL; + page_pool_return_page(pool, netmem); + return 0; }
static void page_pool_free_frag(struct page_pool *pool) { long drain_count = BIAS_MAX - pool->frag_users; - struct page *page = pool->frag_page; + netmem_ref netmem = pool->frag_page;
- pool->frag_page = NULL; + pool->frag_page = 0;
- if (!page || page_pool_unref_page(page, drain_count)) + if (!netmem || page_pool_unref_netmem(netmem, drain_count)) return;
- page_pool_return_page(pool, page); + page_pool_return_page(pool, netmem); }
-struct page *page_pool_alloc_frag(struct page_pool *pool, - unsigned int *offset, - unsigned int size, gfp_t gfp) +netmem_ref page_pool_alloc_frag_netmem(struct page_pool *pool, + unsigned int *offset, unsigned int size, + gfp_t gfp) { unsigned int max_size = PAGE_SIZE << pool->p.order; - struct page *page = pool->frag_page; + netmem_ref netmem = pool->frag_page;
if (WARN_ON(size > max_size)) - return NULL; + return 0;
size = ALIGN(size, dma_get_cache_alignment()); *offset = pool->frag_offset;
- if (page && *offset + size > max_size) { - page = page_pool_drain_frag(pool, page); - if (page) { + if (netmem && *offset + size > max_size) { + netmem = page_pool_drain_frag(pool, netmem); + if (netmem) { alloc_stat_inc(pool, fast); goto frag_reset; } }
- if (!page) { - page = page_pool_alloc_pages(pool, gfp); - if (unlikely(!page)) { - pool->frag_page = NULL; - return NULL; + if (!netmem) { + netmem = page_pool_alloc_netmem(pool, gfp); + if (unlikely(!netmem)) { + pool->frag_page = 0; + return 0; }
- pool->frag_page = page; + pool->frag_page = netmem;
frag_reset: pool->frag_users = 1; *offset = 0; pool->frag_offset = size; - page_pool_fragment_page(page, BIAS_MAX); - return page; + page_pool_fragment_netmem(netmem, BIAS_MAX); + return netmem; }
pool->frag_users++; pool->frag_offset = *offset + size; alloc_stat_inc(pool, fast); - return page; + return netmem; +} +EXPORT_SYMBOL(page_pool_alloc_frag_netmem); + +struct page *page_pool_alloc_frag(struct page_pool *pool, unsigned int *offset, + unsigned int size, gfp_t gfp) +{ + return netmem_to_page(page_pool_alloc_frag_netmem(pool, offset, size, + gfp)); } EXPORT_SYMBOL(page_pool_alloc_frag);
static void page_pool_empty_ring(struct page_pool *pool) { - struct page *page; + netmem_ref netmem;
/* Empty recycle ring */ - while ((page = ptr_ring_consume_bh(&pool->ring))) { + while ((netmem = (__force netmem_ref)ptr_ring_consume_bh(&pool->ring))) { /* Verify the refcnt invariant of cached pages */ - if (!(page_ref_count(page) == 1)) + if (!(page_ref_count(netmem_to_page(netmem)) == 1)) pr_crit("%s() page_pool refcnt %d violation\n", - __func__, page_ref_count(page)); + __func__, netmem_ref_count(netmem));
- page_pool_return_page(pool, page); + page_pool_return_page(pool, netmem); } }
@@ -927,7 +954,7 @@ static void __page_pool_destroy(struct page_pool *pool)
static void page_pool_empty_alloc_cache_once(struct page_pool *pool) { - struct page *page; + netmem_ref netmem;
if (pool->destroy_cnt) return; @@ -937,8 +964,8 @@ static void page_pool_empty_alloc_cache_once(struct page_pool *pool) * call concurrently. */ while (pool->alloc.count) { - page = pool->alloc.cache[--pool->alloc.count]; - page_pool_return_page(pool, page); + netmem = pool->alloc.cache[--pool->alloc.count]; + page_pool_return_page(pool, netmem); } }
@@ -1044,15 +1071,15 @@ EXPORT_SYMBOL(page_pool_destroy); /* Caller must provide appropriate safe context, e.g. NAPI. */ void page_pool_update_nid(struct page_pool *pool, int new_nid) { - struct page *page; + netmem_ref netmem;
trace_page_pool_update_nid(pool, new_nid); pool->p.nid = new_nid;
/* Flush pool alloc cache, as refill will check NUMA node */ while (pool->alloc.count) { - page = pool->alloc.cache[--pool->alloc.count]; - page_pool_return_page(pool, page); + netmem = pool->alloc.cache[--pool->alloc.count]; + page_pool_return_page(pool, netmem); } } EXPORT_SYMBOL(page_pool_update_nid); diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 1f918e602bc4..e1118b637085 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -1006,8 +1006,9 @@ int skb_cow_data_for_xdp(struct page_pool *pool, struct sk_buff **pskb, EXPORT_SYMBOL(skb_cow_data_for_xdp);
#if IS_ENABLED(CONFIG_PAGE_POOL) -bool napi_pp_put_page(struct page *page, bool napi_safe) +bool napi_pp_put_page(netmem_ref netmem, bool napi_safe) { + struct page *page = netmem_to_page(netmem); bool allow_direct = false; struct page_pool *pp;
@@ -1044,7 +1045,7 @@ bool napi_pp_put_page(struct page *page, bool napi_safe) * The page will be returned to the pool here regardless of the * 'flipped' fragment being in use or not. */ - page_pool_put_full_page(pp, page, allow_direct); + page_pool_put_full_netmem(pp, page_to_netmem(page), allow_direct);
return true; } @@ -1055,7 +1056,7 @@ static bool skb_pp_recycle(struct sk_buff *skb, void *data, bool napi_safe) { if (!IS_ENABLED(CONFIG_PAGE_POOL) || !skb->pp_recycle) return false; - return napi_pp_put_page(virt_to_page(data), napi_safe); + return napi_pp_put_page(page_to_netmem(virt_to_page(data)), napi_safe); }
/**
Hi Mina,
I recommend you cc linux-mm and Matthew Wilcox on these two patches also.
David
On Mon, Mar 4, 2024 at 6:02 PM Mina Almasry almasrymina@google.com wrote:
Abstrace the memory type from the page_pool so we can later add support for new memory types. Convert the page_pool to use the new netmem type abstraction, rather than use struct page directly.
As of this patch the netmem type is a no-op abstraction: it's always a struct page underneath. All the page pool internals are converted to use struct netmem instead of struct page, and the page pool now exports 2 APIs:
- The existing struct page API.
- The new struct netmem API.
Keeping the existing API is transitional; we do not want to refactor all the current drivers using the page pool at once.
The netmem abstraction is currently a no-op. The page_pool uses page_to_netmem() to convert allocated pages to netmem, and uses netmem_to_page() to convert the netmem back to pages to pass to mm APIs,
Follow up patches to this series add non-paged netmem support to the page_pool. This change is factored out on its own to limit the code churn to this 1 patch, for ease of code review.
Signed-off-by: Mina Almasry almasrymina@google.com
Per David Howell's request, I'm forwarding these 2 patches from the series to linux-mm & Mathew Wilcox.
https://lore.kernel.org/netdev/950858.1709622997@warthog.procyon.org.uk/
+linux-mm +Mathew Wilcox +David Howells
v6:
- Rebased on top of the merged netmem_ref type.
include/linux/skbuff.h | 4 +- include/net/netmem.h | 15 ++ include/net/page_pool/helpers.h | 122 +++++++++---- include/net/page_pool/types.h | 17 +- include/trace/events/page_pool.h | 29 +-- net/bpf/test_run.c | 5 +- net/core/page_pool.c | 303 +++++++++++++++++-------------- net/core/skbuff.c | 7 +- 8 files changed, 302 insertions(+), 200 deletions(-)
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index d577e0bee18d..ca29d1fd4561 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -3504,7 +3504,7 @@ int skb_pp_cow_data(struct page_pool *pool, struct sk_buff **pskb, unsigned int headroom); int skb_cow_data_for_xdp(struct page_pool *pool, struct sk_buff **pskb, struct bpf_prog *prog); -bool napi_pp_put_page(struct page *page, bool napi_safe); +bool napi_pp_put_page(netmem_ref netmem, bool napi_safe);
static inline void napi_frag_unref(skb_frag_t *frag, bool recycle, bool napi_safe) @@ -3512,7 +3512,7 @@ napi_frag_unref(skb_frag_t *frag, bool recycle, bool napi_safe) struct page *page = skb_frag_page(frag);
#ifdef CONFIG_PAGE_POOL
if (recycle && napi_pp_put_page(page, napi_safe))
if (recycle && napi_pp_put_page(page_to_netmem(page), napi_safe)) return;
#endif put_page(page); diff --git a/include/net/netmem.h b/include/net/netmem.h index ca17ea1d33f8..21f53b29e5fe 100644 --- a/include/net/netmem.h +++ b/include/net/netmem.h @@ -88,4 +88,19 @@ static inline netmem_ref page_to_netmem(struct page *page) return (__force netmem_ref)page; }
+static inline int netmem_ref_count(netmem_ref netmem) +{
return page_ref_count(netmem_to_page(netmem));
+}
+static inline unsigned long netmem_to_pfn(netmem_ref netmem) +{
return page_to_pfn(netmem_to_page(netmem));
+}
+static inline netmem_ref netmem_compound_head(netmem_ref netmem) +{
return page_to_netmem(compound_head(netmem_to_page(netmem)));
+}
#endif /* _NET_NETMEM_H */ diff --git a/include/net/page_pool/helpers.h b/include/net/page_pool/helpers.h index 1d397c1a0043..61814f91a458 100644 --- a/include/net/page_pool/helpers.h +++ b/include/net/page_pool/helpers.h @@ -53,6 +53,8 @@ #define _NET_PAGE_POOL_HELPERS_H
#include <net/page_pool/types.h> +#include <net/net_debug.h> +#include <net/netmem.h>
#ifdef CONFIG_PAGE_POOL_STATS /* Deprecated driver-facing API, use netlink instead */ @@ -101,7 +103,7 @@ static inline struct page *page_pool_dev_alloc_pages(struct page_pool *pool)
- Get a page fragment from the page allocator or page_pool caches.
- Return:
- Return allocated page fragment, otherwise return NULL.
*/
- Return allocated page fragment, otherwise return 0.
static inline struct page *page_pool_dev_alloc_frag(struct page_pool *pool, unsigned int *offset, @@ -112,22 +114,22 @@ static inline struct page *page_pool_dev_alloc_frag(struct page_pool *pool, return page_pool_alloc_frag(pool, offset, size, gfp); }
-static inline struct page *page_pool_alloc(struct page_pool *pool,
unsigned int *offset,
unsigned int *size, gfp_t gfp)
+static inline netmem_ref page_pool_alloc(struct page_pool *pool,
unsigned int *offset,
unsigned int *size, gfp_t gfp)
{ unsigned int max_size = PAGE_SIZE << pool->p.order;
struct page *page;
netmem_ref netmem; if ((*size << 1) > max_size) { *size = max_size; *offset = 0;
return page_pool_alloc_pages(pool, gfp);
return page_pool_alloc_netmem(pool, gfp); }
page = page_pool_alloc_frag(pool, offset, *size, gfp);
if (unlikely(!page))
return NULL;
netmem = page_pool_alloc_frag_netmem(pool, offset, *size, gfp);
if (unlikely(!netmem))
return 0; /* There is very likely not enough space for another fragment, so append * the remaining size to the current fragment to avoid truesize
@@ -138,7 +140,7 @@ static inline struct page *page_pool_alloc(struct page_pool *pool, pool->frag_offset = max_size; }
return page;
return netmem;
}
/** @@ -152,7 +154,7 @@ static inline struct page *page_pool_alloc(struct page_pool *pool,
- utilization and performance penalty.
- Return:
- Return allocated page or page fragment, otherwise return NULL.
*/
- Return allocated page or page fragment, otherwise return 0.
static inline struct page *page_pool_dev_alloc(struct page_pool *pool, unsigned int *offset, @@ -160,7 +162,7 @@ static inline struct page *page_pool_dev_alloc(struct page_pool *pool, { gfp_t gfp = (GFP_ATOMIC | __GFP_NOWARN);
return page_pool_alloc(pool, offset, size, gfp);
return netmem_to_page(page_pool_alloc(pool, offset, size, gfp));
}
static inline void *page_pool_alloc_va(struct page_pool *pool, @@ -170,9 +172,10 @@ static inline void *page_pool_alloc_va(struct page_pool *pool, struct page *page;
/* Mask off __GFP_HIGHMEM to ensure we can use page_address() */
page = page_pool_alloc(pool, &offset, size, gfp & ~__GFP_HIGHMEM);
page = netmem_to_page(
page_pool_alloc(pool, &offset, size, gfp & ~__GFP_HIGHMEM)); if (unlikely(!page))
return NULL;
return 0; return page_address(page) + offset;
} @@ -187,7 +190,7 @@ static inline void *page_pool_alloc_va(struct page_pool *pool,
- it returns va of the allocated page or page fragment.
- Return:
- Return the va for the allocated page or page fragment, otherwise return NULL.
*/
- Return the va for the allocated page or page fragment, otherwise return 0.
static inline void *page_pool_dev_alloc_va(struct page_pool *pool, unsigned int *size) @@ -210,6 +213,11 @@ inline enum dma_data_direction page_pool_get_dma_dir(struct page_pool *pool) return pool->p.dma_dir; }
+static inline void page_pool_fragment_netmem(netmem_ref netmem, long nr) +{
atomic_long_set(&netmem_to_page(netmem)->pp_ref_count, nr);
+}
/**
- page_pool_fragment_page() - split a fresh page into fragments
- @page: page to split
@@ -230,11 +238,12 @@ inline enum dma_data_direction page_pool_get_dma_dir(struct page_pool *pool) */ static inline void page_pool_fragment_page(struct page *page, long nr) {
atomic_long_set(&page->pp_ref_count, nr);
page_pool_fragment_netmem(page_to_netmem(page), nr);
}
-static inline long page_pool_unref_page(struct page *page, long nr) +static inline long page_pool_unref_netmem(netmem_ref netmem, long nr) {
struct page *page = netmem_to_page(netmem); long ret; /* If nr == pp_ref_count then we have cleared all remaining
@@ -277,15 +286,41 @@ static inline long page_pool_unref_page(struct page *page, long nr) return ret; }
+static inline long page_pool_unref_page(struct page *page, long nr) +{
return page_pool_unref_netmem(page_to_netmem(page), nr);
+}
+static inline void page_pool_ref_netmem(netmem_ref netmem) +{
atomic_long_inc(&netmem_to_page(netmem)->pp_ref_count);
+}
static inline void page_pool_ref_page(struct page *page) {
atomic_long_inc(&page->pp_ref_count);
page_pool_ref_netmem(page_to_netmem(page));
}
-static inline bool page_pool_is_last_ref(struct page *page) +static inline bool page_pool_is_last_ref(netmem_ref netmem) { /* If page_pool_unref_page() returns 0, we were the last user */
return page_pool_unref_page(page, 1) == 0;
return page_pool_unref_netmem(netmem, 1) == 0;
+}
+static inline void page_pool_put_netmem(struct page_pool *pool,
netmem_ref netmem,
unsigned int dma_sync_size,
bool allow_direct)
+{
/* When page_pool isn't compiled-in, net/core/xdp.c doesn't
* allow registering MEM_TYPE_PAGE_POOL, but shield linker.
*/
+#ifdef CONFIG_PAGE_POOL
if (!page_pool_is_last_ref(netmem))
return;
page_pool_put_unrefed_netmem(pool, netmem, dma_sync_size, allow_direct);
+#endif }
/** @@ -306,15 +341,15 @@ static inline void page_pool_put_page(struct page_pool *pool, unsigned int dma_sync_size, bool allow_direct) {
/* When page_pool isn't compiled-in, net/core/xdp.c doesn't
* allow registering MEM_TYPE_PAGE_POOL, but shield linker.
*/
-#ifdef CONFIG_PAGE_POOL
if (!page_pool_is_last_ref(page))
return;
page_pool_put_netmem(pool, page_to_netmem(page), dma_sync_size,
allow_direct);
+}
page_pool_put_unrefed_page(pool, page, dma_sync_size, allow_direct);
-#endif +static inline void page_pool_put_full_netmem(struct page_pool *pool,
netmem_ref netmem,
bool allow_direct)
+{
page_pool_put_netmem(pool, netmem, -1, allow_direct);
}
/** @@ -329,7 +364,7 @@ static inline void page_pool_put_page(struct page_pool *pool, static inline void page_pool_put_full_page(struct page_pool *pool, struct page *page, bool allow_direct) {
page_pool_put_page(pool, page, -1, allow_direct);
page_pool_put_netmem(pool, page_to_netmem(page), -1, allow_direct);
}
/** @@ -363,6 +398,18 @@ static inline void page_pool_free_va(struct page_pool *pool, void *va, page_pool_put_page(pool, virt_to_head_page(va), -1, allow_direct); }
+static inline dma_addr_t page_pool_get_dma_addr_netmem(netmem_ref netmem) +{
struct page *page = netmem_to_page(netmem);
dma_addr_t ret = page->dma_addr;
if (PAGE_POOL_32BIT_ARCH_WITH_64BIT_DMA)
ret <<= PAGE_SHIFT;
return ret;
+}
/**
- page_pool_get_dma_addr() - Retrieve the stored DMA address.
- @page: page allocated from a page pool
@@ -372,16 +419,14 @@ static inline void page_pool_free_va(struct page_pool *pool, void *va, */ static inline dma_addr_t page_pool_get_dma_addr(struct page *page) {
dma_addr_t ret = page->dma_addr;
if (PAGE_POOL_32BIT_ARCH_WITH_64BIT_DMA)
ret <<= PAGE_SHIFT;
return ret;
return page_pool_get_dma_addr_netmem(page_to_netmem(page));
}
-static inline bool page_pool_set_dma_addr(struct page *page, dma_addr_t addr) +static inline bool page_pool_set_dma_addr_netmem(netmem_ref netmem,
dma_addr_t addr)
{
struct page *page = netmem_to_page(netmem);
if (PAGE_POOL_32BIT_ARCH_WITH_64BIT_DMA) { page->dma_addr = addr >> PAGE_SHIFT;
@@ -395,6 +440,11 @@ static inline bool page_pool_set_dma_addr(struct page *page, dma_addr_t addr) return false; }
+static inline bool page_pool_set_dma_addr(struct page *page, dma_addr_t addr) +{
return page_pool_set_dma_addr_netmem(page_to_netmem(page), addr);
+}
static inline bool page_pool_put(struct page_pool *pool) { return refcount_dec_and_test(&pool->user_cnt); diff --git a/include/net/page_pool/types.h b/include/net/page_pool/types.h index ffe5f31fb0da..68a24c5ae827 100644 --- a/include/net/page_pool/types.h +++ b/include/net/page_pool/types.h @@ -40,7 +40,7 @@ #define PP_ALLOC_CACHE_REFILL 64 struct pp_alloc_cache { u32 count;
struct page *cache[PP_ALLOC_CACHE_SIZE];
netmem_ref cache[PP_ALLOC_CACHE_SIZE];
};
/** @@ -73,7 +73,7 @@ struct page_pool_params { struct_group_tagged(page_pool_params_slow, slow, struct net_device *netdev; /* private: used by test code only */
void (*init_callback)(struct page *page, void *arg);
void (*init_callback)(netmem_ref netmem, void *arg); void *init_arg; );
}; @@ -131,8 +131,8 @@ struct page_pool_stats { struct memory_provider_ops { int (*init)(struct page_pool *pool); void (*destroy)(struct page_pool *pool);
struct page *(*alloc_pages)(struct page_pool *pool, gfp_t gfp);
bool (*release_page)(struct page_pool *pool, struct page *page);
netmem_ref (*alloc_pages)(struct page_pool *pool, gfp_t gfp);
bool (*release_page)(struct page_pool *pool, netmem_ref netmem);
};
struct page_pool { @@ -142,7 +142,7 @@ struct page_pool { bool has_init_callback;
long frag_users;
struct page *frag_page;
netmem_ref frag_page; unsigned int frag_offset; u32 pages_state_hold_cnt;
@@ -214,8 +214,12 @@ struct page_pool { };
struct page *page_pool_alloc_pages(struct page_pool *pool, gfp_t gfp); +netmem_ref page_pool_alloc_netmem(struct page_pool *pool, gfp_t gfp); struct page *page_pool_alloc_frag(struct page_pool *pool, unsigned int *offset, unsigned int size, gfp_t gfp); +netmem_ref page_pool_alloc_frag_netmem(struct page_pool *pool,
unsigned int *offset, unsigned int size,
gfp_t gfp);
struct page_pool *page_pool_create(const struct page_pool_params *params); struct page_pool *page_pool_create_percpu(const struct page_pool_params *params, int cpuid); @@ -245,6 +249,9 @@ static inline void page_pool_put_page_bulk(struct page_pool *pool, void **data, } #endif
+void page_pool_put_unrefed_netmem(struct page_pool *pool, netmem_ref netmem,
unsigned int dma_sync_size,
bool allow_direct);
void page_pool_put_unrefed_page(struct page_pool *pool, struct page *page, unsigned int dma_sync_size, bool allow_direct); diff --git a/include/trace/events/page_pool.h b/include/trace/events/page_pool.h index 6834356b2d2a..c5b6383ff276 100644 --- a/include/trace/events/page_pool.h +++ b/include/trace/events/page_pool.h @@ -42,51 +42,52 @@ TRACE_EVENT(page_pool_release, TRACE_EVENT(page_pool_state_release,
TP_PROTO(const struct page_pool *pool,
const struct page *page, u32 release),
netmem_ref netmem, u32 release),
TP_ARGS(pool, page, release),
TP_ARGS(pool, netmem, release), TP_STRUCT__entry( __field(const struct page_pool *, pool)
__field(const struct page *, page)
__field(netmem_ref, netmem) __field(u32, release) __field(unsigned long, pfn) ), TP_fast_assign( __entry->pool = pool;
__entry->page = page;
__entry->netmem = netmem; __entry->release = release;
__entry->pfn = page_to_pfn(page);
__entry->pfn = netmem_to_pfn(netmem); ),
TP_printk("page_pool=%p page=%p pfn=0x%lx release=%u",
__entry->pool, __entry->page, __entry->pfn, __entry->release)
TP_printk("page_pool=%p netmem=%lu pfn=0x%lx release=%u",
__entry->pool, (__force unsigned long)__entry->netmem,
__entry->pfn, __entry->release)
);
TRACE_EVENT(page_pool_state_hold,
TP_PROTO(const struct page_pool *pool,
const struct page *page, u32 hold),
netmem_ref netmem, u32 hold),
TP_ARGS(pool, page, hold),
TP_ARGS(pool, netmem, hold), TP_STRUCT__entry( __field(const struct page_pool *, pool)
__field(const struct page *, page)
__field(netmem_ref, netmem) __field(u32, hold) __field(unsigned long, pfn) ), TP_fast_assign( __entry->pool = pool;
__entry->page = page;
__entry->netmem = netmem; __entry->hold = hold;
__entry->pfn = page_to_pfn(page);
__entry->pfn = netmem_to_pfn(netmem); ),
TP_printk("page_pool=%p page=%p pfn=0x%lx hold=%u",
__entry->pool, __entry->page, __entry->pfn, __entry->hold)
TP_printk("page_pool=%p netmem=%lu pfn=0x%lx hold=%u",
__entry->pool, __entry->netmem, __entry->pfn, __entry->hold)
);
TRACE_EVENT(page_pool_update_nid, diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c index 5535f9adc658..bc8f7ab88f86 100644 --- a/net/bpf/test_run.c +++ b/net/bpf/test_run.c @@ -126,9 +126,10 @@ struct xdp_test_data { #define TEST_XDP_FRAME_SIZE (PAGE_SIZE - sizeof(struct xdp_page_head)) #define TEST_XDP_MAX_BATCH 256
-static void xdp_test_run_init_page(struct page *page, void *arg) +static void xdp_test_run_init_page(netmem_ref netmem, void *arg) {
struct xdp_page_head *head = phys_to_virt(page_to_phys(page));
struct xdp_page_head *head =
phys_to_virt(page_to_phys(netmem_to_page(netmem))); struct xdp_buff *new_ctx, *orig_ctx; u32 headroom = XDP_PACKET_HEADROOM; struct xdp_test_data *xdp = arg;
diff --git a/net/core/page_pool.c b/net/core/page_pool.c index fe9de4ecce94..24d5236b2efc 100644 --- a/net/core/page_pool.c +++ b/net/core/page_pool.c @@ -329,19 +329,18 @@ struct page_pool *page_pool_create(const struct page_pool_params *params) } EXPORT_SYMBOL(page_pool_create);
-static void page_pool_return_page(struct page_pool *pool, struct page *page); +static void page_pool_return_page(struct page_pool *pool, netmem_ref netmem);
-noinline -static struct page *page_pool_refill_alloc_cache(struct page_pool *pool) +static noinline netmem_ref page_pool_refill_alloc_cache(struct page_pool *pool) { struct ptr_ring *r = &pool->ring;
struct page *page;
netmem_ref netmem; int pref_nid; /* preferred NUMA node */ /* Quicker fallback, avoid locks when ring is empty */ if (__ptr_ring_empty(r)) { alloc_stat_inc(pool, empty);
return NULL;
return 0; } /* Softirq guarantee CPU and thus NUMA node is stable. This,
@@ -356,56 +355,56 @@ static struct page *page_pool_refill_alloc_cache(struct page_pool *pool)
/* Refill alloc array, but only if NUMA match */ do {
page = __ptr_ring_consume(r);
if (unlikely(!page))
netmem = (__force netmem_ref)__ptr_ring_consume(r);
if (unlikely(!netmem)) break;
if (likely(page_to_nid(page) == pref_nid)) {
pool->alloc.cache[pool->alloc.count++] = page;
if (likely(page_to_nid(netmem_to_page(netmem)) == pref_nid)) {
pool->alloc.cache[pool->alloc.count++] = netmem; } else { /* NUMA mismatch; * (1) release 1 page to page-allocator and * (2) break out to fallthrough to alloc_pages_node. * This limit stress on page buddy alloactor. */
page_pool_return_page(pool, page);
page_pool_return_page(pool, netmem); alloc_stat_inc(pool, waive);
page = NULL;
netmem = 0; break; } } while (pool->alloc.count < PP_ALLOC_CACHE_REFILL); /* Return last page */ if (likely(pool->alloc.count > 0)) {
page = pool->alloc.cache[--pool->alloc.count];
netmem = pool->alloc.cache[--pool->alloc.count]; alloc_stat_inc(pool, refill); }
return page;
return netmem;
}
/* fast path */ -static struct page *__page_pool_get_cached(struct page_pool *pool) +static netmem_ref __page_pool_get_cached(struct page_pool *pool) {
struct page *page;
netmem_ref netmem; /* Caller MUST guarantee safe non-concurrent access, e.g. softirq */ if (likely(pool->alloc.count)) { /* Fast-path */
page = pool->alloc.cache[--pool->alloc.count];
netmem = pool->alloc.cache[--pool->alloc.count]; alloc_stat_inc(pool, fast); } else {
page = page_pool_refill_alloc_cache(pool);
netmem = page_pool_refill_alloc_cache(pool); }
return page;
return netmem;
}
static void page_pool_dma_sync_for_device(struct page_pool *pool,
struct page *page,
netmem_ref netmem, unsigned int dma_sync_size)
{
dma_addr_t dma_addr = page_pool_get_dma_addr(page);
dma_addr_t dma_addr = page_pool_get_dma_addr_netmem(netmem); dma_sync_size = min(dma_sync_size, pool->p.max_len); dma_sync_single_range_for_device(pool->p.dev, dma_addr,
@@ -413,7 +412,7 @@ static void page_pool_dma_sync_for_device(struct page_pool *pool, pool->p.dma_dir); }
-static bool page_pool_dma_map(struct page_pool *pool, struct page *page) +static bool page_pool_dma_map(struct page_pool *pool, netmem_ref netmem) { dma_addr_t dma;
@@ -422,18 +421,18 @@ static bool page_pool_dma_map(struct page_pool *pool, struct page *page) * into page private data (i.e 32bit cpu with 64bit DMA caps) * This mapping is kept for lifetime of page, until leaving pool. */
dma = dma_map_page_attrs(pool->p.dev, page, 0,
(PAGE_SIZE << pool->p.order),
pool->p.dma_dir, DMA_ATTR_SKIP_CPU_SYNC |
DMA_ATTR_WEAK_ORDERING);
dma = dma_map_page_attrs(pool->p.dev, netmem_to_page(netmem), 0,
(PAGE_SIZE << pool->p.order), pool->p.dma_dir,
DMA_ATTR_SKIP_CPU_SYNC |
DMA_ATTR_WEAK_ORDERING); if (dma_mapping_error(pool->p.dev, dma)) return false;
if (page_pool_set_dma_addr(page, dma))
if (page_pool_set_dma_addr_netmem(netmem, dma)) goto unmap_failed; if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)
page_pool_dma_sync_for_device(pool, page, pool->p.max_len);
page_pool_dma_sync_for_device(pool, netmem, pool->p.max_len); return true;
@@ -445,9 +444,10 @@ static bool page_pool_dma_map(struct page_pool *pool, struct page *page) return false; }
-static void page_pool_set_pp_info(struct page_pool *pool,
struct page *page)
+static void page_pool_set_pp_info(struct page_pool *pool, netmem_ref netmem) {
struct page *page = netmem_to_page(netmem);
page->pp = pool; page->pp_magic |= PP_SIGNATURE;
@@ -457,13 +457,15 @@ static void page_pool_set_pp_info(struct page_pool *pool, * is dirtying the same cache line as the page->pp_magic above, so * the overhead is negligible. */
page_pool_fragment_page(page, 1);
page_pool_fragment_netmem(netmem, 1); if (pool->has_init_callback)
pool->slow.init_callback(page, pool->slow.init_arg);
pool->slow.init_callback(netmem, pool->slow.init_arg);
}
-static void page_pool_clear_pp_info(struct page *page) +static void page_pool_clear_pp_info(netmem_ref netmem) {
struct page *page = netmem_to_page(netmem);
page->pp_magic = 0; page->pp = NULL;
} @@ -479,34 +481,34 @@ static struct page *__page_pool_alloc_page_order(struct page_pool *pool, return NULL;
if ((pool->p.flags & PP_FLAG_DMA_MAP) &&
unlikely(!page_pool_dma_map(pool, page))) {
unlikely(!page_pool_dma_map(pool, page_to_netmem(page)))) { put_page(page); return NULL; } alloc_stat_inc(pool, slow_high_order);
page_pool_set_pp_info(pool, page);
page_pool_set_pp_info(pool, page_to_netmem(page)); /* Track how many pages are held 'in-flight' */ pool->pages_state_hold_cnt++;
trace_page_pool_state_hold(pool, page, pool->pages_state_hold_cnt);
trace_page_pool_state_hold(pool, page_to_netmem(page),
pool->pages_state_hold_cnt); return page;
}
/* slow path */ -noinline -static struct page *__page_pool_alloc_pages_slow(struct page_pool *pool,
gfp_t gfp)
+static noinline netmem_ref __page_pool_alloc_pages_slow(struct page_pool *pool,
gfp_t gfp)
{ const int bulk = PP_ALLOC_CACHE_REFILL; unsigned int pp_flags = pool->p.flags; unsigned int pp_order = pool->p.order;
struct page *page;
netmem_ref netmem; int i, nr_pages; /* Don't support bulk alloc for high-order pages */ if (unlikely(pp_order))
return __page_pool_alloc_page_order(pool, gfp);
return page_to_netmem(__page_pool_alloc_page_order(pool, gfp)); /* Unnecessary as alloc cache is empty, but guarantees zero count */ if (unlikely(pool->alloc.count > 0))
@@ -515,60 +517,67 @@ static struct page *__page_pool_alloc_pages_slow(struct page_pool *pool, /* Mark empty alloc.cache slots "empty" for alloc_pages_bulk_array */ memset(&pool->alloc.cache, 0, sizeof(void *) * bulk);
nr_pages = alloc_pages_bulk_array_node(gfp, pool->p.nid, bulk,
pool->alloc.cache);
nr_pages = alloc_pages_bulk_array_node(gfp,
pool->p.nid, bulk,
(struct page **)pool->alloc.cache); if (unlikely(!nr_pages))
return NULL;
return 0; /* Pages have been filled into alloc.cache array, but count is zero and * page element have not been (possibly) DMA mapped. */ for (i = 0; i < nr_pages; i++) {
page = pool->alloc.cache[i];
netmem = pool->alloc.cache[i]; if ((pp_flags & PP_FLAG_DMA_MAP) &&
unlikely(!page_pool_dma_map(pool, page))) {
put_page(page);
unlikely(!page_pool_dma_map(pool, netmem))) {
put_page(netmem_to_page(netmem)); continue; }
page_pool_set_pp_info(pool, page);
pool->alloc.cache[pool->alloc.count++] = page;
page_pool_set_pp_info(pool, netmem);
pool->alloc.cache[pool->alloc.count++] = netmem; /* Track how many pages are held 'in-flight' */ pool->pages_state_hold_cnt++;
trace_page_pool_state_hold(pool, page,
trace_page_pool_state_hold(pool, netmem, pool->pages_state_hold_cnt); } /* Return last page */ if (likely(pool->alloc.count > 0)) {
page = pool->alloc.cache[--pool->alloc.count];
netmem = pool->alloc.cache[--pool->alloc.count]; alloc_stat_inc(pool, slow); } else {
page = NULL;
netmem = 0; } /* When page just alloc'ed is should/must have refcnt 1. */
return page;
return netmem;
}
/* For using page_pool replace: alloc_pages() API calls, but provide
- synchronization guarantee for allocation side.
*/ -struct page *page_pool_alloc_pages(struct page_pool *pool, gfp_t gfp) +netmem_ref page_pool_alloc_netmem(struct page_pool *pool, gfp_t gfp) {
struct page *page;
netmem_ref netmem; /* Fast-path: Get a page from cache */
page = __page_pool_get_cached(pool);
if (page)
return page;
netmem = __page_pool_get_cached(pool);
if (netmem)
return netmem; /* Slow-path: cache empty, do real allocation */ if (static_branch_unlikely(&page_pool_mem_providers) && pool->mp_ops)
page = pool->mp_ops->alloc_pages(pool, gfp);
netmem = pool->mp_ops->alloc_pages(pool, gfp); else
page = __page_pool_alloc_pages_slow(pool, gfp);
return page;
netmem = __page_pool_alloc_pages_slow(pool, gfp);
return netmem;
+} +EXPORT_SYMBOL(page_pool_alloc_netmem);
+struct page *page_pool_alloc_pages(struct page_pool *pool, gfp_t gfp) +{
return netmem_to_page(page_pool_alloc_netmem(pool, gfp));
} EXPORT_SYMBOL(page_pool_alloc_pages);
@@ -596,8 +605,8 @@ s32 page_pool_inflight(const struct page_pool *pool, bool strict) return inflight; }
-static __always_inline -void __page_pool_release_page_dma(struct page_pool *pool, struct page *page) +static __always_inline void __page_pool_release_page_dma(struct page_pool *pool,
netmem_ref netmem)
{ dma_addr_t dma;
@@ -607,13 +616,13 @@ void __page_pool_release_page_dma(struct page_pool *pool, struct page *page) */ return;
dma = page_pool_get_dma_addr(page);
dma = page_pool_get_dma_addr_netmem(netmem); /* When page is unmapped, it cannot be returned to our pool */ dma_unmap_page_attrs(pool->p.dev, dma, PAGE_SIZE << pool->p.order, pool->p.dma_dir, DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_WEAK_ORDERING);
page_pool_set_dma_addr(page, 0);
page_pool_set_dma_addr_netmem(netmem, 0);
}
/* Disconnects a page (from a page_pool). API users can have a need @@ -621,26 +630,26 @@ void __page_pool_release_page_dma(struct page_pool *pool, struct page *page)
- a regular page (that will eventually be returned to the normal
- page-allocator via put_page).
*/ -void page_pool_return_page(struct page_pool *pool, struct page *page) +void page_pool_return_page(struct page_pool *pool, netmem_ref netmem) { int count; bool put;
put = true; if (static_branch_unlikely(&page_pool_mem_providers) && pool->mp_ops)
put = pool->mp_ops->release_page(pool, page);
put = pool->mp_ops->release_page(pool, netmem); else
__page_pool_release_page_dma(pool, page);
__page_pool_release_page_dma(pool, netmem); /* This may be the last page returned, releasing the pool, so * it is not safe to reference pool afterwards. */ count = atomic_inc_return_relaxed(&pool->pages_state_release_cnt);
trace_page_pool_state_release(pool, page, count);
trace_page_pool_state_release(pool, netmem, count); if (put) {
page_pool_clear_pp_info(page);
put_page(page);
page_pool_clear_pp_info(netmem);
put_page(netmem_to_page(netmem)); } /* An optimization would be to call __free_pages(page, pool->p.order) * knowing page is not part of page-cache (thus avoiding a
@@ -648,14 +657,14 @@ void page_pool_return_page(struct page_pool *pool, struct page *page) */ }
-static bool page_pool_recycle_in_ring(struct page_pool *pool, struct page *page) +static bool page_pool_recycle_in_ring(struct page_pool *pool, netmem_ref netmem) { int ret; /* BH protection not needed if current is softirq */ if (in_softirq())
ret = ptr_ring_produce(&pool->ring, page);
ret = ptr_ring_produce(&pool->ring, (__force void *)netmem); else
ret = ptr_ring_produce_bh(&pool->ring, page);
ret = ptr_ring_produce_bh(&pool->ring, (__force void *)netmem); if (!ret) { recycle_stat_inc(pool, ring);
@@ -670,7 +679,7 @@ static bool page_pool_recycle_in_ring(struct page_pool *pool, struct page *page)
- Caller must provide appropriate safe context.
*/ -static bool page_pool_recycle_in_cache(struct page *page, +static bool page_pool_recycle_in_cache(netmem_ref netmem, struct page_pool *pool) { if (unlikely(pool->alloc.count == PP_ALLOC_CACHE_SIZE)) { @@ -679,14 +688,15 @@ static bool page_pool_recycle_in_cache(struct page *page, }
/* Caller MUST have verified/know (page_ref_count(page) == 1) */
pool->alloc.cache[pool->alloc.count++] = page;
pool->alloc.cache[pool->alloc.count++] = netmem; recycle_stat_inc(pool, cached); return true;
}
-static bool __page_pool_page_can_be_recycled(const struct page *page) +static bool __page_pool_page_can_be_recycled(netmem_ref netmem) {
return page_ref_count(page) == 1 && !page_is_pfmemalloc(page);
return page_ref_count(netmem_to_page(netmem)) == 1 &&
!page_is_pfmemalloc(netmem_to_page(netmem));
}
/* If the page refcnt == 1, this will try to recycle the page. @@ -695,8 +705,8 @@ static bool __page_pool_page_can_be_recycled(const struct page *page)
- If the page refcnt != 1, then the page will be returned to memory
- subsystem.
*/ -static __always_inline struct page * -__page_pool_put_page(struct page_pool *pool, struct page *page, +static __always_inline netmem_ref +__page_pool_put_page(struct page_pool *pool, netmem_ref netmem, unsigned int dma_sync_size, bool allow_direct) { lockdep_assert_no_hardirq(); @@ -710,19 +720,19 @@ __page_pool_put_page(struct page_pool *pool, struct page *page, * page is NOT reusable when allocated when system is under * some pressure. (page_is_pfmemalloc) */
if (likely(__page_pool_page_can_be_recycled(page))) {
if (likely(__page_pool_page_can_be_recycled(netmem))) { /* Read barrier done in page_ref_count / READ_ONCE */ if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)
page_pool_dma_sync_for_device(pool, page,
page_pool_dma_sync_for_device(pool, netmem, dma_sync_size); if (allow_direct && in_softirq() &&
page_pool_recycle_in_cache(page, pool))
return NULL;
page_pool_recycle_in_cache(netmem, pool))
return 0; /* Page found as candidate for recycling */
return page;
return netmem; } /* Fallback/non-XDP mode: API user have elevated refcnt. *
@@ -738,21 +748,30 @@ __page_pool_put_page(struct page_pool *pool, struct page *page, * will be invoking put_page. */ recycle_stat_inc(pool, released_refcnt);
page_pool_return_page(pool, page);
page_pool_return_page(pool, netmem);
return NULL;
return 0;
}
-void page_pool_put_unrefed_page(struct page_pool *pool, struct page *page,
unsigned int dma_sync_size, bool allow_direct)
+void page_pool_put_unrefed_netmem(struct page_pool *pool, netmem_ref netmem,
unsigned int dma_sync_size, bool allow_direct)
{
page = __page_pool_put_page(pool, page, dma_sync_size, allow_direct);
if (page && !page_pool_recycle_in_ring(pool, page)) {
netmem =
__page_pool_put_page(pool, netmem, dma_sync_size, allow_direct);
if (netmem && !page_pool_recycle_in_ring(pool, netmem)) { /* Cache full, fallback to free pages */ recycle_stat_inc(pool, ring_full);
page_pool_return_page(pool, page);
page_pool_return_page(pool, netmem); }
} +EXPORT_SYMBOL(page_pool_put_unrefed_netmem);
+void page_pool_put_unrefed_page(struct page_pool *pool, struct page *page,
unsigned int dma_sync_size, bool allow_direct)
+{
page_pool_put_unrefed_netmem(pool, page_to_netmem(page), dma_sync_size,
allow_direct);
+} EXPORT_SYMBOL(page_pool_put_unrefed_page);
/** @@ -777,16 +796,16 @@ void page_pool_put_page_bulk(struct page_pool *pool, void **data, bool in_softirq;
for (i = 0; i < count; i++) {
struct page *page = virt_to_head_page(data[i]);
netmem_ref netmem = page_to_netmem(virt_to_head_page(data[i])); /* It is not the last user for the page frag case */
if (!page_pool_is_last_ref(page))
if (!page_pool_is_last_ref(netmem)) continue;
page = __page_pool_put_page(pool, page, -1, false);
netmem = __page_pool_put_page(pool, netmem, -1, false); /* Approved for bulk recycling in ptr_ring cache */
if (page)
data[bulk_len++] = page;
if (netmem)
data[bulk_len++] = (__force void *)netmem; } if (unlikely(!bulk_len))
@@ -812,100 +831,108 @@ void page_pool_put_page_bulk(struct page_pool *pool, void **data, * since put_page() with refcnt == 1 can be an expensive operation */ for (; i < bulk_len; i++)
page_pool_return_page(pool, data[i]);
page_pool_return_page(pool, (__force netmem_ref)data[i]);
} EXPORT_SYMBOL(page_pool_put_page_bulk);
-static struct page *page_pool_drain_frag(struct page_pool *pool,
struct page *page)
+static netmem_ref page_pool_drain_frag(struct page_pool *pool,
netmem_ref netmem)
{ long drain_count = BIAS_MAX - pool->frag_users;
/* Some user is still using the page frag */
if (likely(page_pool_unref_page(page, drain_count)))
return NULL;
if (likely(page_pool_unref_netmem(netmem, drain_count)))
return 0;
if (__page_pool_page_can_be_recycled(page)) {
if (__page_pool_page_can_be_recycled(netmem)) { if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)
page_pool_dma_sync_for_device(pool, page, -1);
page_pool_dma_sync_for_device(pool, netmem, -1);
return page;
return netmem; }
page_pool_return_page(pool, page);
return NULL;
page_pool_return_page(pool, netmem);
return 0;
}
static void page_pool_free_frag(struct page_pool *pool) { long drain_count = BIAS_MAX - pool->frag_users;
struct page *page = pool->frag_page;
netmem_ref netmem = pool->frag_page;
pool->frag_page = NULL;
pool->frag_page = 0;
if (!page || page_pool_unref_page(page, drain_count))
if (!netmem || page_pool_unref_netmem(netmem, drain_count)) return;
page_pool_return_page(pool, page);
page_pool_return_page(pool, netmem);
}
-struct page *page_pool_alloc_frag(struct page_pool *pool,
unsigned int *offset,
unsigned int size, gfp_t gfp)
+netmem_ref page_pool_alloc_frag_netmem(struct page_pool *pool,
unsigned int *offset, unsigned int size,
gfp_t gfp)
{ unsigned int max_size = PAGE_SIZE << pool->p.order;
struct page *page = pool->frag_page;
netmem_ref netmem = pool->frag_page; if (WARN_ON(size > max_size))
return NULL;
return 0; size = ALIGN(size, dma_get_cache_alignment()); *offset = pool->frag_offset;
if (page && *offset + size > max_size) {
page = page_pool_drain_frag(pool, page);
if (page) {
if (netmem && *offset + size > max_size) {
netmem = page_pool_drain_frag(pool, netmem);
if (netmem) { alloc_stat_inc(pool, fast); goto frag_reset; } }
if (!page) {
page = page_pool_alloc_pages(pool, gfp);
if (unlikely(!page)) {
pool->frag_page = NULL;
return NULL;
if (!netmem) {
netmem = page_pool_alloc_netmem(pool, gfp);
if (unlikely(!netmem)) {
pool->frag_page = 0;
return 0; }
pool->frag_page = page;
pool->frag_page = netmem;
frag_reset: pool->frag_users = 1; *offset = 0; pool->frag_offset = size;
page_pool_fragment_page(page, BIAS_MAX);
return page;
page_pool_fragment_netmem(netmem, BIAS_MAX);
return netmem; } pool->frag_users++; pool->frag_offset = *offset + size; alloc_stat_inc(pool, fast);
return page;
return netmem;
+} +EXPORT_SYMBOL(page_pool_alloc_frag_netmem);
+struct page *page_pool_alloc_frag(struct page_pool *pool, unsigned int *offset,
unsigned int size, gfp_t gfp)
+{
return netmem_to_page(page_pool_alloc_frag_netmem(pool, offset, size,
gfp));
} EXPORT_SYMBOL(page_pool_alloc_frag);
static void page_pool_empty_ring(struct page_pool *pool) {
struct page *page;
netmem_ref netmem; /* Empty recycle ring */
while ((page = ptr_ring_consume_bh(&pool->ring))) {
while ((netmem = (__force netmem_ref)ptr_ring_consume_bh(&pool->ring))) { /* Verify the refcnt invariant of cached pages */
if (!(page_ref_count(page) == 1))
if (!(page_ref_count(netmem_to_page(netmem)) == 1)) pr_crit("%s() page_pool refcnt %d violation\n",
__func__, page_ref_count(page));
__func__, netmem_ref_count(netmem));
page_pool_return_page(pool, page);
page_pool_return_page(pool, netmem); }
}
@@ -927,7 +954,7 @@ static void __page_pool_destroy(struct page_pool *pool)
static void page_pool_empty_alloc_cache_once(struct page_pool *pool) {
struct page *page;
netmem_ref netmem; if (pool->destroy_cnt) return;
@@ -937,8 +964,8 @@ static void page_pool_empty_alloc_cache_once(struct page_pool *pool) * call concurrently. */ while (pool->alloc.count) {
page = pool->alloc.cache[--pool->alloc.count];
page_pool_return_page(pool, page);
netmem = pool->alloc.cache[--pool->alloc.count];
page_pool_return_page(pool, netmem); }
}
@@ -1044,15 +1071,15 @@ EXPORT_SYMBOL(page_pool_destroy); /* Caller must provide appropriate safe context, e.g. NAPI. */ void page_pool_update_nid(struct page_pool *pool, int new_nid) {
struct page *page;
netmem_ref netmem; trace_page_pool_update_nid(pool, new_nid); pool->p.nid = new_nid; /* Flush pool alloc cache, as refill will check NUMA node */ while (pool->alloc.count) {
page = pool->alloc.cache[--pool->alloc.count];
page_pool_return_page(pool, page);
netmem = pool->alloc.cache[--pool->alloc.count];
page_pool_return_page(pool, netmem); }
} EXPORT_SYMBOL(page_pool_update_nid); diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 1f918e602bc4..e1118b637085 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -1006,8 +1006,9 @@ int skb_cow_data_for_xdp(struct page_pool *pool, struct sk_buff **pskb, EXPORT_SYMBOL(skb_cow_data_for_xdp);
#if IS_ENABLED(CONFIG_PAGE_POOL) -bool napi_pp_put_page(struct page *page, bool napi_safe) +bool napi_pp_put_page(netmem_ref netmem, bool napi_safe) {
struct page *page = netmem_to_page(netmem); bool allow_direct = false; struct page_pool *pp;
@@ -1044,7 +1045,7 @@ bool napi_pp_put_page(struct page *page, bool napi_safe) * The page will be returned to the pool here regardless of the * 'flipped' fragment being in use or not. */
page_pool_put_full_page(pp, page, allow_direct);
page_pool_put_full_netmem(pp, page_to_netmem(page), allow_direct); return true;
} @@ -1055,7 +1056,7 @@ static bool skb_pp_recycle(struct sk_buff *skb, void *data, bool napi_safe) { if (!IS_ENABLED(CONFIG_PAGE_POOL) || !skb->pp_recycle) return false;
return napi_pp_put_page(virt_to_page(data), napi_safe);
return napi_pp_put_page(page_to_netmem(virt_to_page(data)), napi_safe);
}
/**
2.44.0.rc1.240.g4c46232300-goog
Convert netmem to be a union of struct page and struct netmem. Overload the LSB of struct netmem* to indicate that it's a net_iov, otherwise it's a page.
Currently these entries in struct page are rented by the page_pool and used exclusively by the net stack:
struct { unsigned long pp_magic; struct page_pool *pp; unsigned long _pp_mapping_pad; unsigned long dma_addr; atomic_long_t pp_ref_count; };
Mirror these (and only these) entries into struct net_iov and implement netmem helpers that can access these common fields regardless of whether the underlying type is page or net_iov.
Implement checks for net_iov in netmem helpers which delegate to mm APIs, to ensure net_iov are never passed to the mm stack.
Signed-off-by: Mina Almasry almasrymina@google.com
---
v6: - Rebased on top of the merged netmem_ref type. - Rebased on top of the merged skb_pp_frag_ref() changes.
v5: - Use netmem instead of page* with LSB set. - Use pp_ref_count for refcounting net_iov. - Removed many of the custom checks for netmem.
v1: - Disable fragmentation support for iov properly. - fix napi_pp_put_page() path (Yunsheng). - Use pp_frag_count for devmem refcounting.
--- include/net/netmem.h | 142 ++++++++++++++++++++++++++++++-- include/net/page_pool/helpers.h | 25 +++--- include/net/page_pool/types.h | 1 + net/core/page_pool.c | 26 +++--- net/core/skbuff.c | 23 +++--- 5 files changed, 171 insertions(+), 46 deletions(-)
diff --git a/include/net/netmem.h b/include/net/netmem.h index 21f53b29e5fe..8699788d587d 100644 --- a/include/net/netmem.h +++ b/include/net/netmem.h @@ -9,14 +9,51 @@ #define _NET_NETMEM_H
#include <net/devmem.h> +#include <net/net_debug.h>
/* net_iov */
+DECLARE_STATIC_KEY_FALSE(page_pool_mem_providers); + +/* We overload the LSB of the struct page pointer to indicate whether it's + * a page or net_iov. + */ +#define NET_IOV 0x01UL + struct net_iov { + unsigned long __unused_padding; + unsigned long pp_magic; + struct page_pool *pp; struct dmabuf_genpool_chunk_owner *owner; unsigned long dma_addr; + atomic_long_t pp_ref_count; };
+/* These fields in struct page are used by the page_pool and net stack: + * + * struct { + * unsigned long pp_magic; + * struct page_pool *pp; + * unsigned long _pp_mapping_pad; + * unsigned long dma_addr; + * atomic_long_t pp_ref_count; + * }; + * + * We mirror the page_pool fields here so the page_pool can access these fields + * without worrying whether the underlying fields belong to a page or net_iov. + * + * The non-net stack fields of struct page are private to the mm stack and must + * never be mirrored to net_iov. + */ +#define NET_IOV_ASSERT_OFFSET(pg, iov) \ + static_assert(offsetof(struct page, pg) == \ + offsetof(struct net_iov, iov)) +NET_IOV_ASSERT_OFFSET(pp_magic, pp_magic); +NET_IOV_ASSERT_OFFSET(pp, pp); +NET_IOV_ASSERT_OFFSET(dma_addr, dma_addr); +NET_IOV_ASSERT_OFFSET(pp_ref_count, pp_ref_count); +#undef NET_IOV_ASSERT_OFFSET + static inline struct dmabuf_genpool_chunk_owner * net_iov_owner(const struct net_iov *niov) { @@ -69,20 +106,27 @@ net_iov_binding(const struct net_iov *niov) */ typedef unsigned long __bitwise netmem_ref;
+static inline bool netmem_is_net_iov(const netmem_ref netmem) +{ +#ifdef CONFIG_PAGE_POOL + return static_branch_unlikely(&page_pool_mem_providers) && + (__force unsigned long)netmem & NET_IOV; +#else + return false; +#endif +} + /* This conversion fails (returns NULL) if the netmem_ref is not struct page * backed. - * - * Currently struct page is the only possible netmem, and this helper never - * fails. */ static inline struct page *netmem_to_page(netmem_ref netmem) { + if (WARN_ON_ONCE(netmem_is_net_iov(netmem))) + return NULL; + return (__force struct page *)netmem; }
-/* Converting from page to netmem is always safe, because a page can always be - * a netmem. - */ static inline netmem_ref page_to_netmem(struct page *page) { return (__force netmem_ref)page; @@ -90,17 +134,103 @@ static inline netmem_ref page_to_netmem(struct page *page)
static inline int netmem_ref_count(netmem_ref netmem) { + /* The non-pp refcount of net_iov is always 1. On net_iov, we only + * support pp refcounting which uses the pp_ref_count field. + */ + if (netmem_is_net_iov(netmem)) + return 1; + return page_ref_count(netmem_to_page(netmem)); }
static inline unsigned long netmem_to_pfn(netmem_ref netmem) { + if (netmem_is_net_iov(netmem)) + return 0; + return page_to_pfn(netmem_to_page(netmem)); }
+static inline struct net_iov *__netmem_clear_lsb(netmem_ref netmem) +{ + return (struct net_iov *)((__force unsigned long)netmem & ~NET_IOV); +} + +static inline unsigned long netmem_get_pp_magic(netmem_ref netmem) +{ + return __netmem_clear_lsb(netmem)->pp_magic; +} + +static inline void netmem_or_pp_magic(netmem_ref netmem, unsigned long pp_magic) +{ + __netmem_clear_lsb(netmem)->pp_magic |= pp_magic; +} + +static inline void netmem_clear_pp_magic(netmem_ref netmem) +{ + __netmem_clear_lsb(netmem)->pp_magic = 0; +} + +static inline struct page_pool *netmem_get_pp(netmem_ref netmem) +{ + return __netmem_clear_lsb(netmem)->pp; +} + +static inline void netmem_set_pp(netmem_ref netmem, struct page_pool *pool) +{ + __netmem_clear_lsb(netmem)->pp = pool; +} + +static inline unsigned long netmem_get_dma_addr(netmem_ref netmem) +{ + return __netmem_clear_lsb(netmem)->dma_addr; +} + +static inline void netmem_set_dma_addr(netmem_ref netmem, + unsigned long dma_addr) +{ + __netmem_clear_lsb(netmem)->dma_addr = dma_addr; +} + +static inline atomic_long_t *netmem_get_pp_ref_count_ref(netmem_ref netmem) +{ + return &__netmem_clear_lsb(netmem)->pp_ref_count; +} + +static inline bool netmem_is_pref_nid(netmem_ref netmem, int pref_nid) +{ + /* Assume net_iov are on the preferred node without actually + * checking... + * + * This check is only used to check for recycling memory in the page + * pool's fast paths. Currently the only implementation of net_iov + * is dmabuf device memory. It's a deliberate decision by the user to + * bind a certain dmabuf to a certain netdev, and the netdev rx queue + * would not be able to reallocate memory from another dmabuf that + * exists on the preferred node, so, this check doesn't make much sense + * in this case. Assume all net_iovs can be recycled for now. + */ + if (netmem_is_net_iov(netmem)) + return true; + + return page_to_nid(netmem_to_page(netmem)) == pref_nid; +} + static inline netmem_ref netmem_compound_head(netmem_ref netmem) { + /* niov are never compounded */ + if (netmem_is_net_iov(netmem)) + return netmem; + return page_to_netmem(compound_head(netmem_to_page(netmem))); }
+static inline void *netmem_address(netmem_ref netmem) +{ + if (netmem_is_net_iov(netmem)) + return NULL; + + return page_address(netmem_to_page(netmem)); +} + #endif /* _NET_NETMEM_H */ diff --git a/include/net/page_pool/helpers.h b/include/net/page_pool/helpers.h index 61814f91a458..c6a55eddefae 100644 --- a/include/net/page_pool/helpers.h +++ b/include/net/page_pool/helpers.h @@ -215,7 +215,7 @@ inline enum dma_data_direction page_pool_get_dma_dir(struct page_pool *pool)
static inline void page_pool_fragment_netmem(netmem_ref netmem, long nr) { - atomic_long_set(&netmem_to_page(netmem)->pp_ref_count, nr); + atomic_long_set(netmem_get_pp_ref_count_ref(netmem), nr); }
/** @@ -243,7 +243,7 @@ static inline void page_pool_fragment_page(struct page *page, long nr)
static inline long page_pool_unref_netmem(netmem_ref netmem, long nr) { - struct page *page = netmem_to_page(netmem); + atomic_long_t *pp_ref_count = netmem_get_pp_ref_count_ref(netmem); long ret;
/* If nr == pp_ref_count then we have cleared all remaining @@ -260,19 +260,19 @@ static inline long page_pool_unref_netmem(netmem_ref netmem, long nr) * initially, and only overwrite it when the page is partitioned into * more than one piece. */ - if (atomic_long_read(&page->pp_ref_count) == nr) { + if (atomic_long_read(pp_ref_count) == nr) { /* As we have ensured nr is always one for constant case using * the BUILD_BUG_ON(), only need to handle the non-constant case * here for pp_ref_count draining, which is a rare case. */ BUILD_BUG_ON(__builtin_constant_p(nr) && nr != 1); if (!__builtin_constant_p(nr)) - atomic_long_set(&page->pp_ref_count, 1); + atomic_long_set(pp_ref_count, 1);
return 0; }
- ret = atomic_long_sub_return(nr, &page->pp_ref_count); + ret = atomic_long_sub_return(nr, pp_ref_count); WARN_ON(ret < 0);
/* We are the last user here too, reset pp_ref_count back to 1 to @@ -281,7 +281,7 @@ static inline long page_pool_unref_netmem(netmem_ref netmem, long nr) * page_pool_unref_page() currently. */ if (unlikely(!ret)) - atomic_long_set(&page->pp_ref_count, 1); + atomic_long_set(pp_ref_count, 1);
return ret; } @@ -400,9 +400,7 @@ static inline void page_pool_free_va(struct page_pool *pool, void *va,
static inline dma_addr_t page_pool_get_dma_addr_netmem(netmem_ref netmem) { - struct page *page = netmem_to_page(netmem); - - dma_addr_t ret = page->dma_addr; + dma_addr_t ret = netmem_get_dma_addr(netmem);
if (PAGE_POOL_32BIT_ARCH_WITH_64BIT_DMA) ret <<= PAGE_SHIFT; @@ -425,18 +423,17 @@ static inline dma_addr_t page_pool_get_dma_addr(struct page *page) static inline bool page_pool_set_dma_addr_netmem(netmem_ref netmem, dma_addr_t addr) { - struct page *page = netmem_to_page(netmem); - if (PAGE_POOL_32BIT_ARCH_WITH_64BIT_DMA) { - page->dma_addr = addr >> PAGE_SHIFT; + netmem_set_dma_addr(netmem, addr >> PAGE_SHIFT);
/* We assume page alignment to shave off bottom bits, * if this "compression" doesn't work we need to drop. */ - return addr != (dma_addr_t)page->dma_addr << PAGE_SHIFT; + return addr != (dma_addr_t)netmem_get_dma_addr(netmem) + << PAGE_SHIFT; }
- page->dma_addr = addr; + netmem_set_dma_addr(netmem, addr); return false; }
diff --git a/include/net/page_pool/types.h b/include/net/page_pool/types.h index 68a24c5ae827..e29e77f7934e 100644 --- a/include/net/page_pool/types.h +++ b/include/net/page_pool/types.h @@ -6,6 +6,7 @@ #include <linux/dma-direction.h> #include <linux/ptr_ring.h> #include <linux/types.h> +#include <net/netmem.h>
#define PP_FLAG_DMA_MAP BIT(0) /* Should page_pool do the DMA * map/unmap diff --git a/net/core/page_pool.c b/net/core/page_pool.c index 24d5236b2efc..22e3d439da18 100644 --- a/net/core/page_pool.c +++ b/net/core/page_pool.c @@ -25,7 +25,7 @@
#include "page_pool_priv.h"
-static DEFINE_STATIC_KEY_FALSE(page_pool_mem_providers); +DEFINE_STATIC_KEY_FALSE(page_pool_mem_providers);
#define DEFER_TIME (msecs_to_jiffies(1000)) #define DEFER_WARN_INTERVAL (60 * HZ) @@ -359,7 +359,7 @@ static noinline netmem_ref page_pool_refill_alloc_cache(struct page_pool *pool) if (unlikely(!netmem)) break;
- if (likely(page_to_nid(netmem_to_page(netmem)) == pref_nid)) { + if (likely(netmem_is_pref_nid(netmem, pref_nid))) { pool->alloc.cache[pool->alloc.count++] = netmem; } else { /* NUMA mismatch; @@ -446,10 +446,8 @@ static bool page_pool_dma_map(struct page_pool *pool, netmem_ref netmem)
static void page_pool_set_pp_info(struct page_pool *pool, netmem_ref netmem) { - struct page *page = netmem_to_page(netmem); - - page->pp = pool; - page->pp_magic |= PP_SIGNATURE; + netmem_set_pp(netmem, pool); + netmem_or_pp_magic(netmem, PP_SIGNATURE);
/* Ensuring all pages have been split into one fragment initially: * page_pool_set_pp_info() is only called once for every page when it @@ -464,10 +462,8 @@ static void page_pool_set_pp_info(struct page_pool *pool, netmem_ref netmem)
static void page_pool_clear_pp_info(netmem_ref netmem) { - struct page *page = netmem_to_page(netmem); - - page->pp_magic = 0; - page->pp = NULL; + netmem_clear_pp_magic(netmem); + netmem_set_pp(netmem, NULL); }
static struct page *__page_pool_alloc_page_order(struct page_pool *pool, @@ -695,8 +691,9 @@ static bool page_pool_recycle_in_cache(netmem_ref netmem,
static bool __page_pool_page_can_be_recycled(netmem_ref netmem) { - return page_ref_count(netmem_to_page(netmem)) == 1 && - !page_is_pfmemalloc(netmem_to_page(netmem)); + return netmem_is_net_iov(netmem) || + (page_ref_count(netmem_to_page(netmem)) == 1 && + !page_is_pfmemalloc(netmem_to_page(netmem))); }
/* If the page refcnt == 1, this will try to recycle the page. @@ -718,7 +715,7 @@ __page_pool_put_page(struct page_pool *pool, netmem_ref netmem, * refcnt == 1 means page_pool owns page, and can recycle it. * * page is NOT reusable when allocated when system is under - * some pressure. (page_is_pfmemalloc) + * some pressure. (page_pool_page_is_pfmemalloc) */ if (likely(__page_pool_page_can_be_recycled(netmem))) { /* Read barrier done in page_ref_count / READ_ONCE */ @@ -734,6 +731,7 @@ __page_pool_put_page(struct page_pool *pool, netmem_ref netmem, /* Page found as candidate for recycling */ return netmem; } + /* Fallback/non-XDP mode: API user have elevated refcnt. * * Many drivers split up the page into fragments, and some @@ -928,7 +926,7 @@ static void page_pool_empty_ring(struct page_pool *pool) /* Empty recycle ring */ while ((netmem = (__force netmem_ref)ptr_ring_consume_bh(&pool->ring))) { /* Verify the refcnt invariant of cached pages */ - if (!(page_ref_count(netmem_to_page(netmem)) == 1)) + if (!(netmem_ref_count(netmem) == 1)) pr_crit("%s() page_pool refcnt %d violation\n", __func__, netmem_ref_count(netmem));
diff --git a/net/core/skbuff.c b/net/core/skbuff.c index e1118b637085..cf23392e97f5 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -908,9 +908,9 @@ static void skb_clone_fraglist(struct sk_buff *skb) skb_get(list); }
-static bool is_pp_page(struct page *page) +static bool is_pp_netmem(netmem_ref netmem) { - return (page->pp_magic & ~0x3UL) == PP_SIGNATURE; + return (netmem_get_pp_magic(netmem) & ~0x3UL) == PP_SIGNATURE; }
int skb_pp_cow_data(struct page_pool *pool, struct sk_buff **pskb, @@ -1008,11 +1008,10 @@ EXPORT_SYMBOL(skb_cow_data_for_xdp); #if IS_ENABLED(CONFIG_PAGE_POOL) bool napi_pp_put_page(netmem_ref netmem, bool napi_safe) { - struct page *page = netmem_to_page(netmem); bool allow_direct = false; struct page_pool *pp;
- page = compound_head(page); + netmem = netmem_compound_head(netmem);
/* page->pp_magic is OR'ed with PP_SIGNATURE after the allocation * in order to preserve any existing bits, such as bit 0 for the @@ -1021,10 +1020,10 @@ bool napi_pp_put_page(netmem_ref netmem, bool napi_safe) * and page_is_pfmemalloc() is checked in __page_pool_put_page() * to avoid recycling the pfmemalloc page. */ - if (unlikely(!is_pp_page(page))) + if (unlikely(!is_pp_netmem(netmem))) return false;
- pp = page->pp; + pp = netmem_get_pp(netmem);
/* Allow direct recycle if we have reasons to believe that we are * in the same context as the consumer would run, so there's @@ -1045,7 +1044,7 @@ bool napi_pp_put_page(netmem_ref netmem, bool napi_safe) * The page will be returned to the pool here regardless of the * 'flipped' fragment being in use or not. */ - page_pool_put_full_netmem(pp, page_to_netmem(page), allow_direct); + page_pool_put_full_netmem(pp, netmem, allow_direct);
return true; } @@ -1072,7 +1071,7 @@ static bool skb_pp_recycle(struct sk_buff *skb, void *data, bool napi_safe) static int skb_pp_frag_ref(struct sk_buff *skb) { struct skb_shared_info *shinfo; - struct page *head_page; + netmem_ref head_netmem; int i;
if (!skb->pp_recycle) @@ -1081,11 +1080,11 @@ static int skb_pp_frag_ref(struct sk_buff *skb) shinfo = skb_shinfo(skb);
for (i = 0; i < shinfo->nr_frags; i++) { - head_page = compound_head(skb_frag_page(&shinfo->frags[i])); - if (likely(is_pp_page(head_page))) - page_pool_ref_page(head_page); + head_netmem = netmem_compound_head(shinfo->frags[i].netmem); + if (likely(is_pp_netmem(head_netmem))) + page_pool_ref_netmem(head_netmem); else - page_ref_inc(head_page); + page_ref_inc(netmem_to_page(head_netmem)); } return 0; }
On Mon, Mar 4, 2024 at 6:02 PM Mina Almasry almasrymina@google.com wrote:
Convert netmem to be a union of struct page and struct netmem. Overload the LSB of struct netmem* to indicate that it's a net_iov, otherwise it's a page.
Currently these entries in struct page are rented by the page_pool and used exclusively by the net stack:
struct { unsigned long pp_magic; struct page_pool *pp; unsigned long _pp_mapping_pad; unsigned long dma_addr; atomic_long_t pp_ref_count; };
Mirror these (and only these) entries into struct net_iov and implement netmem helpers that can access these common fields regardless of whether the underlying type is page or net_iov.
Implement checks for net_iov in netmem helpers which delegate to mm APIs, to ensure net_iov are never passed to the mm stack.
Signed-off-by: Mina Almasry almasrymina@google.com
Per David Howell's request, I'm forwarding these 2 patches from the series to linux-mm & Mathew Wilcox.
https://lore.kernel.org/netdev/950858.1709622997@warthog.procyon.org.uk/
+linux-mm +Mathew Wilcox +David Howells
Some background on how we arrived at the current design may be useful. This was largely discussed across 2 (long) email threads, largely with Jason Gunthorpe and Christian Konig:
https://lore.kernel.org/lkml/CAHS8izPm6XRS54LdCDZVd0C75tA1zHSu6jLVO8nzTLXCc=...
https://lore.kernel.org/bpf/CAHS8izMdKYyjE9bdcFDWWPWECwVZL7XQjtjOFoTq5_bEEJv...
The approach is based on supporting dma-buf scatterlist entries in the networking stack, to support receiving data directly into GPU memory typically represented by dmabuf. I added the abstract netmem_ref type in a recent series that can multiplex between struct pages and non-struct pages (called net_iovs in this series):
https://lore.kernel.org/netdev/20240214223405.1972973-1-almasrymina@google.c...
v6:
- Rebased on top of the merged netmem_ref type.
- Rebased on top of the merged skb_pp_frag_ref() changes.
v5:
- Use netmem instead of page* with LSB set.
- Use pp_ref_count for refcounting net_iov.
- Removed many of the custom checks for netmem.
v1:
- Disable fragmentation support for iov properly.
- fix napi_pp_put_page() path (Yunsheng).
- Use pp_frag_count for devmem refcounting.
include/net/netmem.h | 142 ++++++++++++++++++++++++++++++-- include/net/page_pool/helpers.h | 25 +++--- include/net/page_pool/types.h | 1 + net/core/page_pool.c | 26 +++--- net/core/skbuff.c | 23 +++--- 5 files changed, 171 insertions(+), 46 deletions(-)
diff --git a/include/net/netmem.h b/include/net/netmem.h index 21f53b29e5fe..8699788d587d 100644 --- a/include/net/netmem.h +++ b/include/net/netmem.h @@ -9,14 +9,51 @@ #define _NET_NETMEM_H
#include <net/devmem.h> +#include <net/net_debug.h>
/* net_iov */
+DECLARE_STATIC_KEY_FALSE(page_pool_mem_providers);
+/* We overload the LSB of the struct page pointer to indicate whether it's
- a page or net_iov.
- */
+#define NET_IOV 0x01UL
struct net_iov {
unsigned long __unused_padding;
unsigned long pp_magic;
struct page_pool *pp; struct dmabuf_genpool_chunk_owner *owner; unsigned long dma_addr;
atomic_long_t pp_ref_count;
};
+/* These fields in struct page are used by the page_pool and net stack:
struct {
unsigned long pp_magic;
struct page_pool *pp;
unsigned long _pp_mapping_pad;
unsigned long dma_addr;
atomic_long_t pp_ref_count;
};
- We mirror the page_pool fields here so the page_pool can access these fields
- without worrying whether the underlying fields belong to a page or net_iov.
- The non-net stack fields of struct page are private to the mm stack and must
- never be mirrored to net_iov.
- */
+#define NET_IOV_ASSERT_OFFSET(pg, iov) \
static_assert(offsetof(struct page, pg) == \
offsetof(struct net_iov, iov))
+NET_IOV_ASSERT_OFFSET(pp_magic, pp_magic); +NET_IOV_ASSERT_OFFSET(pp, pp); +NET_IOV_ASSERT_OFFSET(dma_addr, dma_addr); +NET_IOV_ASSERT_OFFSET(pp_ref_count, pp_ref_count); +#undef NET_IOV_ASSERT_OFFSET
static inline struct dmabuf_genpool_chunk_owner * net_iov_owner(const struct net_iov *niov) { @@ -69,20 +106,27 @@ net_iov_binding(const struct net_iov *niov) */ typedef unsigned long __bitwise netmem_ref;
+static inline bool netmem_is_net_iov(const netmem_ref netmem) +{ +#ifdef CONFIG_PAGE_POOL
return static_branch_unlikely(&page_pool_mem_providers) &&
(__force unsigned long)netmem & NET_IOV;
+#else
return false;
+#endif +}
/* This conversion fails (returns NULL) if the netmem_ref is not struct page
- backed.
- Currently struct page is the only possible netmem, and this helper never
*/
- fails.
static inline struct page *netmem_to_page(netmem_ref netmem) {
if (WARN_ON_ONCE(netmem_is_net_iov(netmem)))
return NULL;
return (__force struct page *)netmem;
}
-/* Converting from page to netmem is always safe, because a page can always be
- a netmem.
- */
static inline netmem_ref page_to_netmem(struct page *page) { return (__force netmem_ref)page; @@ -90,17 +134,103 @@ static inline netmem_ref page_to_netmem(struct page *page)
static inline int netmem_ref_count(netmem_ref netmem) {
/* The non-pp refcount of net_iov is always 1. On net_iov, we only
* support pp refcounting which uses the pp_ref_count field.
*/
if (netmem_is_net_iov(netmem))
return 1;
return page_ref_count(netmem_to_page(netmem));
}
static inline unsigned long netmem_to_pfn(netmem_ref netmem) {
if (netmem_is_net_iov(netmem))
return 0;
return page_to_pfn(netmem_to_page(netmem));
}
+static inline struct net_iov *__netmem_clear_lsb(netmem_ref netmem) +{
return (struct net_iov *)((__force unsigned long)netmem & ~NET_IOV);
+}
+static inline unsigned long netmem_get_pp_magic(netmem_ref netmem) +{
return __netmem_clear_lsb(netmem)->pp_magic;
+}
+static inline void netmem_or_pp_magic(netmem_ref netmem, unsigned long pp_magic) +{
__netmem_clear_lsb(netmem)->pp_magic |= pp_magic;
+}
+static inline void netmem_clear_pp_magic(netmem_ref netmem) +{
__netmem_clear_lsb(netmem)->pp_magic = 0;
+}
+static inline struct page_pool *netmem_get_pp(netmem_ref netmem) +{
return __netmem_clear_lsb(netmem)->pp;
+}
+static inline void netmem_set_pp(netmem_ref netmem, struct page_pool *pool) +{
__netmem_clear_lsb(netmem)->pp = pool;
+}
+static inline unsigned long netmem_get_dma_addr(netmem_ref netmem) +{
return __netmem_clear_lsb(netmem)->dma_addr;
+}
+static inline void netmem_set_dma_addr(netmem_ref netmem,
unsigned long dma_addr)
+{
__netmem_clear_lsb(netmem)->dma_addr = dma_addr;
+}
+static inline atomic_long_t *netmem_get_pp_ref_count_ref(netmem_ref netmem) +{
return &__netmem_clear_lsb(netmem)->pp_ref_count;
+}
+static inline bool netmem_is_pref_nid(netmem_ref netmem, int pref_nid) +{
/* Assume net_iov are on the preferred node without actually
* checking...
*
* This check is only used to check for recycling memory in the page
* pool's fast paths. Currently the only implementation of net_iov
* is dmabuf device memory. It's a deliberate decision by the user to
* bind a certain dmabuf to a certain netdev, and the netdev rx queue
* would not be able to reallocate memory from another dmabuf that
* exists on the preferred node, so, this check doesn't make much sense
* in this case. Assume all net_iovs can be recycled for now.
*/
if (netmem_is_net_iov(netmem))
return true;
return page_to_nid(netmem_to_page(netmem)) == pref_nid;
+}
static inline netmem_ref netmem_compound_head(netmem_ref netmem) {
/* niov are never compounded */
if (netmem_is_net_iov(netmem))
return netmem;
return page_to_netmem(compound_head(netmem_to_page(netmem)));
}
+static inline void *netmem_address(netmem_ref netmem) +{
if (netmem_is_net_iov(netmem))
return NULL;
return page_address(netmem_to_page(netmem));
+}
#endif /* _NET_NETMEM_H */ diff --git a/include/net/page_pool/helpers.h b/include/net/page_pool/helpers.h index 61814f91a458..c6a55eddefae 100644 --- a/include/net/page_pool/helpers.h +++ b/include/net/page_pool/helpers.h @@ -215,7 +215,7 @@ inline enum dma_data_direction page_pool_get_dma_dir(struct page_pool *pool)
static inline void page_pool_fragment_netmem(netmem_ref netmem, long nr) {
atomic_long_set(&netmem_to_page(netmem)->pp_ref_count, nr);
atomic_long_set(netmem_get_pp_ref_count_ref(netmem), nr);
}
/** @@ -243,7 +243,7 @@ static inline void page_pool_fragment_page(struct page *page, long nr)
static inline long page_pool_unref_netmem(netmem_ref netmem, long nr) {
struct page *page = netmem_to_page(netmem);
atomic_long_t *pp_ref_count = netmem_get_pp_ref_count_ref(netmem); long ret; /* If nr == pp_ref_count then we have cleared all remaining
@@ -260,19 +260,19 @@ static inline long page_pool_unref_netmem(netmem_ref netmem, long nr) * initially, and only overwrite it when the page is partitioned into * more than one piece. */
if (atomic_long_read(&page->pp_ref_count) == nr) {
if (atomic_long_read(pp_ref_count) == nr) { /* As we have ensured nr is always one for constant case using * the BUILD_BUG_ON(), only need to handle the non-constant case * here for pp_ref_count draining, which is a rare case. */ BUILD_BUG_ON(__builtin_constant_p(nr) && nr != 1); if (!__builtin_constant_p(nr))
atomic_long_set(&page->pp_ref_count, 1);
atomic_long_set(pp_ref_count, 1); return 0; }
ret = atomic_long_sub_return(nr, &page->pp_ref_count);
ret = atomic_long_sub_return(nr, pp_ref_count); WARN_ON(ret < 0); /* We are the last user here too, reset pp_ref_count back to 1 to
@@ -281,7 +281,7 @@ static inline long page_pool_unref_netmem(netmem_ref netmem, long nr) * page_pool_unref_page() currently. */ if (unlikely(!ret))
atomic_long_set(&page->pp_ref_count, 1);
atomic_long_set(pp_ref_count, 1); return ret;
} @@ -400,9 +400,7 @@ static inline void page_pool_free_va(struct page_pool *pool, void *va,
static inline dma_addr_t page_pool_get_dma_addr_netmem(netmem_ref netmem) {
struct page *page = netmem_to_page(netmem);
dma_addr_t ret = page->dma_addr;
dma_addr_t ret = netmem_get_dma_addr(netmem); if (PAGE_POOL_32BIT_ARCH_WITH_64BIT_DMA) ret <<= PAGE_SHIFT;
@@ -425,18 +423,17 @@ static inline dma_addr_t page_pool_get_dma_addr(struct page *page) static inline bool page_pool_set_dma_addr_netmem(netmem_ref netmem, dma_addr_t addr) {
struct page *page = netmem_to_page(netmem);
if (PAGE_POOL_32BIT_ARCH_WITH_64BIT_DMA) {
page->dma_addr = addr >> PAGE_SHIFT;
netmem_set_dma_addr(netmem, addr >> PAGE_SHIFT); /* We assume page alignment to shave off bottom bits, * if this "compression" doesn't work we need to drop. */
return addr != (dma_addr_t)page->dma_addr << PAGE_SHIFT;
return addr != (dma_addr_t)netmem_get_dma_addr(netmem)
<< PAGE_SHIFT; }
page->dma_addr = addr;
netmem_set_dma_addr(netmem, addr); return false;
}
diff --git a/include/net/page_pool/types.h b/include/net/page_pool/types.h index 68a24c5ae827..e29e77f7934e 100644 --- a/include/net/page_pool/types.h +++ b/include/net/page_pool/types.h @@ -6,6 +6,7 @@ #include <linux/dma-direction.h> #include <linux/ptr_ring.h> #include <linux/types.h> +#include <net/netmem.h>
#define PP_FLAG_DMA_MAP BIT(0) /* Should page_pool do the DMA * map/unmap diff --git a/net/core/page_pool.c b/net/core/page_pool.c index 24d5236b2efc..22e3d439da18 100644 --- a/net/core/page_pool.c +++ b/net/core/page_pool.c @@ -25,7 +25,7 @@
#include "page_pool_priv.h"
-static DEFINE_STATIC_KEY_FALSE(page_pool_mem_providers); +DEFINE_STATIC_KEY_FALSE(page_pool_mem_providers);
#define DEFER_TIME (msecs_to_jiffies(1000)) #define DEFER_WARN_INTERVAL (60 * HZ) @@ -359,7 +359,7 @@ static noinline netmem_ref page_pool_refill_alloc_cache(struct page_pool *pool) if (unlikely(!netmem)) break;
if (likely(page_to_nid(netmem_to_page(netmem)) == pref_nid)) {
if (likely(netmem_is_pref_nid(netmem, pref_nid))) { pool->alloc.cache[pool->alloc.count++] = netmem; } else { /* NUMA mismatch;
@@ -446,10 +446,8 @@ static bool page_pool_dma_map(struct page_pool *pool, netmem_ref netmem)
static void page_pool_set_pp_info(struct page_pool *pool, netmem_ref netmem) {
struct page *page = netmem_to_page(netmem);
page->pp = pool;
page->pp_magic |= PP_SIGNATURE;
netmem_set_pp(netmem, pool);
netmem_or_pp_magic(netmem, PP_SIGNATURE); /* Ensuring all pages have been split into one fragment initially: * page_pool_set_pp_info() is only called once for every page when it
@@ -464,10 +462,8 @@ static void page_pool_set_pp_info(struct page_pool *pool, netmem_ref netmem)
static void page_pool_clear_pp_info(netmem_ref netmem) {
struct page *page = netmem_to_page(netmem);
page->pp_magic = 0;
page->pp = NULL;
netmem_clear_pp_magic(netmem);
netmem_set_pp(netmem, NULL);
}
static struct page *__page_pool_alloc_page_order(struct page_pool *pool, @@ -695,8 +691,9 @@ static bool page_pool_recycle_in_cache(netmem_ref netmem,
static bool __page_pool_page_can_be_recycled(netmem_ref netmem) {
return page_ref_count(netmem_to_page(netmem)) == 1 &&
!page_is_pfmemalloc(netmem_to_page(netmem));
return netmem_is_net_iov(netmem) ||
(page_ref_count(netmem_to_page(netmem)) == 1 &&
!page_is_pfmemalloc(netmem_to_page(netmem)));
}
/* If the page refcnt == 1, this will try to recycle the page. @@ -718,7 +715,7 @@ __page_pool_put_page(struct page_pool *pool, netmem_ref netmem, * refcnt == 1 means page_pool owns page, and can recycle it. * * page is NOT reusable when allocated when system is under
* some pressure. (page_is_pfmemalloc)
* some pressure. (page_pool_page_is_pfmemalloc) */ if (likely(__page_pool_page_can_be_recycled(netmem))) { /* Read barrier done in page_ref_count / READ_ONCE */
@@ -734,6 +731,7 @@ __page_pool_put_page(struct page_pool *pool, netmem_ref netmem, /* Page found as candidate for recycling */ return netmem; }
/* Fallback/non-XDP mode: API user have elevated refcnt. * * Many drivers split up the page into fragments, and some
@@ -928,7 +926,7 @@ static void page_pool_empty_ring(struct page_pool *pool) /* Empty recycle ring */ while ((netmem = (__force netmem_ref)ptr_ring_consume_bh(&pool->ring))) { /* Verify the refcnt invariant of cached pages */
if (!(page_ref_count(netmem_to_page(netmem)) == 1))
if (!(netmem_ref_count(netmem) == 1)) pr_crit("%s() page_pool refcnt %d violation\n", __func__, netmem_ref_count(netmem));
diff --git a/net/core/skbuff.c b/net/core/skbuff.c index e1118b637085..cf23392e97f5 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -908,9 +908,9 @@ static void skb_clone_fraglist(struct sk_buff *skb) skb_get(list); }
-static bool is_pp_page(struct page *page) +static bool is_pp_netmem(netmem_ref netmem) {
return (page->pp_magic & ~0x3UL) == PP_SIGNATURE;
return (netmem_get_pp_magic(netmem) & ~0x3UL) == PP_SIGNATURE;
}
int skb_pp_cow_data(struct page_pool *pool, struct sk_buff **pskb, @@ -1008,11 +1008,10 @@ EXPORT_SYMBOL(skb_cow_data_for_xdp); #if IS_ENABLED(CONFIG_PAGE_POOL) bool napi_pp_put_page(netmem_ref netmem, bool napi_safe) {
struct page *page = netmem_to_page(netmem); bool allow_direct = false; struct page_pool *pp;
page = compound_head(page);
netmem = netmem_compound_head(netmem); /* page->pp_magic is OR'ed with PP_SIGNATURE after the allocation * in order to preserve any existing bits, such as bit 0 for the
@@ -1021,10 +1020,10 @@ bool napi_pp_put_page(netmem_ref netmem, bool napi_safe) * and page_is_pfmemalloc() is checked in __page_pool_put_page() * to avoid recycling the pfmemalloc page. */
if (unlikely(!is_pp_page(page)))
if (unlikely(!is_pp_netmem(netmem))) return false;
pp = page->pp;
pp = netmem_get_pp(netmem); /* Allow direct recycle if we have reasons to believe that we are * in the same context as the consumer would run, so there's
@@ -1045,7 +1044,7 @@ bool napi_pp_put_page(netmem_ref netmem, bool napi_safe) * The page will be returned to the pool here regardless of the * 'flipped' fragment being in use or not. */
page_pool_put_full_netmem(pp, page_to_netmem(page), allow_direct);
page_pool_put_full_netmem(pp, netmem, allow_direct); return true;
} @@ -1072,7 +1071,7 @@ static bool skb_pp_recycle(struct sk_buff *skb, void *data, bool napi_safe) static int skb_pp_frag_ref(struct sk_buff *skb) { struct skb_shared_info *shinfo;
struct page *head_page;
netmem_ref head_netmem; int i; if (!skb->pp_recycle)
@@ -1081,11 +1080,11 @@ static int skb_pp_frag_ref(struct sk_buff *skb) shinfo = skb_shinfo(skb);
for (i = 0; i < shinfo->nr_frags; i++) {
head_page = compound_head(skb_frag_page(&shinfo->frags[i]));
if (likely(is_pp_page(head_page)))
page_pool_ref_page(head_page);
head_netmem = netmem_compound_head(shinfo->frags[i].netmem);
if (likely(is_pp_netmem(head_netmem)))
page_pool_ref_netmem(head_netmem); else
page_ref_inc(head_page);
page_ref_inc(netmem_to_page(head_netmem)); } return 0;
}
2.44.0.rc1.240.g4c46232300-goog
Implement a memory provider that allocates dmabuf devmem in the form of net_iov.
The provider receives a reference to the struct netdev_dmabuf_binding via the pool->mp_priv pointer. The driver needs to set this pointer for the provider in the net_iov.
The provider obtains a reference on the netdev_dmabuf_binding which guarantees the binding and the underlying mapping remains alive until the provider is destroyed.
Usage of PP_FLAG_DMA_MAP is required for this memory provide such that the page_pool can provide the driver with the dma-addrs of the devmem.
Support for PP_FLAG_DMA_SYNC_DEV is omitted for simplicity & p.order != 0.
Signed-off-by: Willem de Bruijn willemb@google.com Signed-off-by: Kaiyuan Zhang kaiyuanz@google.com Signed-off-by: Mina Almasry almasrymina@google.com
---
v6: - refactor new memory provider functions into net/core/devmem.c (Pavel)
v2: - Disable devmem for p.order != 0
v1: - static_branch check in page_is_page_pool_iov() (Willem & Paolo). - PP_DEVMEM -> PP_IOV (David). - Require PP_FLAG_DMA_MAP (Jakub).
--- include/net/netmem.h | 14 ++++++ include/net/page_pool/helpers.h | 21 +++++++++ include/net/page_pool/types.h | 2 + net/core/devmem.c | 82 +++++++++++++++++++++++++++++++++ net/core/page_pool.c | 35 ++++++-------- 5 files changed, 132 insertions(+), 22 deletions(-)
diff --git a/include/net/netmem.h b/include/net/netmem.h index 8699788d587d..a2de9411025d 100644 --- a/include/net/netmem.h +++ b/include/net/netmem.h @@ -127,6 +127,20 @@ static inline struct page *netmem_to_page(netmem_ref netmem) return (__force struct page *)netmem; }
+static inline struct net_iov *netmem_to_net_iov(netmem_ref netmem) +{ + if (netmem_is_net_iov(netmem)) + return (struct net_iov *)((__force unsigned long)netmem & ~NET_IOV); + + DEBUG_NET_WARN_ON_ONCE(true); + return NULL; +} + +static inline netmem_ref net_iov_to_netmem(struct net_iov *niov) +{ + return (__force netmem_ref)((unsigned long)niov | NET_IOV); +} + static inline netmem_ref page_to_netmem(struct page *page) { return (__force netmem_ref)page; diff --git a/include/net/page_pool/helpers.h b/include/net/page_pool/helpers.h index c6a55eddefae..00682b4de6e8 100644 --- a/include/net/page_pool/helpers.h +++ b/include/net/page_pool/helpers.h @@ -453,4 +453,25 @@ static inline void page_pool_nid_changed(struct page_pool *pool, int new_nid) page_pool_update_nid(pool, new_nid); }
+static inline void page_pool_set_pp_info(struct page_pool *pool, netmem_ref netmem) +{ + netmem_set_pp(netmem, pool); + netmem_or_pp_magic(netmem, PP_SIGNATURE); + + /* Ensuring all pages have been split into one fragment initially: + * page_pool_set_pp_info() is only called once for every page when it + * is allocated from the page allocator and page_pool_fragment_page() + * is dirtying the same cache line as the page->pp_magic above, so + * the overhead is negligible. + */ + page_pool_fragment_netmem(netmem, 1); + if (pool->has_init_callback) + pool->slow.init_callback(netmem, pool->slow.init_arg); +} + +static inline void page_pool_clear_pp_info(netmem_ref netmem) +{ + netmem_clear_pp_magic(netmem); + netmem_set_pp(netmem, NULL); +} #endif /* _NET_PAGE_POOL_HELPERS_H */ diff --git a/include/net/page_pool/types.h b/include/net/page_pool/types.h index e29e77f7934e..096cd2455b2c 100644 --- a/include/net/page_pool/types.h +++ b/include/net/page_pool/types.h @@ -136,6 +136,8 @@ struct memory_provider_ops { bool (*release_page)(struct page_pool *pool, netmem_ref netmem); };
+extern const struct memory_provider_ops dmabuf_devmem_ops; + struct page_pool { struct page_pool_params_fast p;
diff --git a/net/core/devmem.c b/net/core/devmem.c index 57d3a1f223ef..3ced312f7860 100644 --- a/net/core/devmem.c +++ b/net/core/devmem.c @@ -329,3 +329,85 @@ int netdev_bind_dmabuf(struct net_device *dev, unsigned int dmabuf_fd, return err; } #endif + +/*** "Dmabuf devmem memory provider" ***/ + +static int mp_dmabuf_devmem_init(struct page_pool *pool) +{ + struct netdev_dmabuf_binding *binding = pool->mp_priv; + + if (!binding) + return -EINVAL; + + if (!(pool->p.flags & PP_FLAG_DMA_MAP)) + return -EOPNOTSUPP; + + if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV) + return -EOPNOTSUPP; + + if (pool->p.order != 0) + return -E2BIG; + + netdev_dmabuf_binding_get(binding); + return 0; +} + +static netmem_ref mp_dmabuf_devmem_alloc_pages(struct page_pool *pool, + gfp_t gfp) +{ + struct netdev_dmabuf_binding *binding = pool->mp_priv; + netmem_ref netmem; + struct net_iov *niov; + dma_addr_t dma_addr; + + niov = netdev_alloc_dmabuf(binding); + if (!niov) + return 0; + + dma_addr = net_iov_dma_addr(niov); + + netmem = net_iov_to_netmem(niov); + + page_pool_set_pp_info(pool, netmem); + + if (page_pool_set_dma_addr_netmem(netmem, dma_addr)) + goto err_free; + + pool->pages_state_hold_cnt++; + trace_page_pool_state_hold(pool, netmem, pool->pages_state_hold_cnt); + return netmem; + +err_free: + netdev_free_dmabuf(niov); + return 0; +} + +static void mp_dmabuf_devmem_destroy(struct page_pool *pool) +{ + struct netdev_dmabuf_binding *binding = pool->mp_priv; + + netdev_dmabuf_binding_put(binding); +} + +static bool mp_dmabuf_devmem_release_page(struct page_pool *pool, + netmem_ref netmem) +{ + WARN_ON_ONCE(!netmem_is_net_iov(netmem)); + WARN_ON_ONCE(atomic_long_read(netmem_get_pp_ref_count_ref(netmem)) + != 1); + + page_pool_clear_pp_info(netmem); + + netdev_free_dmabuf(netmem_to_net_iov(netmem)); + + /* We don't want the page pool put_page()ing our net_iovs. */ + return false; +} + +const struct memory_provider_ops dmabuf_devmem_ops = { + .init = mp_dmabuf_devmem_init, + .destroy = mp_dmabuf_devmem_destroy, + .alloc_pages = mp_dmabuf_devmem_alloc_pages, + .release_page = mp_dmabuf_devmem_release_page, +}; +EXPORT_SYMBOL(dmabuf_devmem_ops); diff --git a/net/core/page_pool.c b/net/core/page_pool.c index 22e3d439da18..2cee7d9f6ca6 100644 --- a/net/core/page_pool.c +++ b/net/core/page_pool.c @@ -12,6 +12,7 @@
#include <net/page_pool/helpers.h> #include <net/xdp.h> +#include <net/netdev_rx_queue.h>
#include <linux/dma-direction.h> #include <linux/dma-mapping.h> @@ -20,12 +21,15 @@ #include <linux/poison.h> #include <linux/ethtool.h> #include <linux/netdevice.h> +#include <linux/genalloc.h> +#include <net/devmem.h>
#include <trace/events/page_pool.h>
#include "page_pool_priv.h"
DEFINE_STATIC_KEY_FALSE(page_pool_mem_providers); +EXPORT_SYMBOL(page_pool_mem_providers);
#define DEFER_TIME (msecs_to_jiffies(1000)) #define DEFER_WARN_INTERVAL (60 * HZ) @@ -178,6 +182,7 @@ static int page_pool_init(struct page_pool *pool, const struct page_pool_params *params, int cpuid) { + struct netdev_dmabuf_binding *binding = NULL; unsigned int ring_qsize = 1024; /* Default */ int err;
@@ -251,6 +256,14 @@ static int page_pool_init(struct page_pool *pool, /* Driver calling page_pool_create() also call page_pool_destroy() */ refcount_set(&pool->user_cnt, 1);
+ if (pool->p.queue) + binding = READ_ONCE(pool->p.queue->binding); + + if (binding) { + pool->mp_ops = &dmabuf_devmem_ops; + pool->mp_priv = binding; + } + if (pool->mp_ops) { err = pool->mp_ops->init(pool); if (err) { @@ -444,28 +457,6 @@ static bool page_pool_dma_map(struct page_pool *pool, netmem_ref netmem) return false; }
-static void page_pool_set_pp_info(struct page_pool *pool, netmem_ref netmem) -{ - netmem_set_pp(netmem, pool); - netmem_or_pp_magic(netmem, PP_SIGNATURE); - - /* Ensuring all pages have been split into one fragment initially: - * page_pool_set_pp_info() is only called once for every page when it - * is allocated from the page allocator and page_pool_fragment_page() - * is dirtying the same cache line as the page->pp_magic above, so - * the overhead is negligible. - */ - page_pool_fragment_netmem(netmem, 1); - if (pool->has_init_callback) - pool->slow.init_callback(netmem, pool->slow.init_arg); -} - -static void page_pool_clear_pp_info(netmem_ref netmem) -{ - netmem_clear_pp_magic(netmem); - netmem_set_pp(netmem, NULL); -} - static struct page *__page_pool_alloc_page_order(struct page_pool *pool, gfp_t gfp) {
On 2024-03-04 18:01, Mina Almasry wrote:
- if (pool->p.queue)
binding = READ_ONCE(pool->p.queue->binding);
- if (binding) {
pool->mp_ops = &dmabuf_devmem_ops;
pool->mp_priv = binding;
- }
This is specific to TCP devmem. For ZC Rx we will need something more generic to let us pass our own memory provider backend down to the page pool.
What about storing ops and priv void ptr in struct netdev_rx_queue instead? Then we can both use it.
On Tue, Mar 5, 2024 at 6:28 PM David Wei dw@davidwei.uk wrote:
On 2024-03-04 18:01, Mina Almasry wrote:
if (pool->p.queue)
binding = READ_ONCE(pool->p.queue->binding);
if (binding) {
pool->mp_ops = &dmabuf_devmem_ops;
pool->mp_priv = binding;
}
This is specific to TCP devmem. For ZC Rx we will need something more generic to let us pass our own memory provider backend down to the page pool.
What about storing ops and priv void ptr in struct netdev_rx_queue instead? Then we can both use it.
Yes, this is dmabuf specific, I was thinking you'd define your own member of netdev_rx_queue, and then add something like this to page_pool_init:
+ if (pool->p.queue) + io_uring_metadata = READ_ONCE(pool->p.queue->io_uring_metadata); + + /* We don't support rx-queues that are configured for both io_uring & dmabuf binding */ + BUG_ON(io_uring_metadata && binding); + + if (io_uring_metadata) { + pool->mp_ops = &io_uring_ops; + pool->mp_priv = io_uring_metadata; + }
I.e., we share the pool->mp_ops and the pool->mp_priv but we don't really need to share the same netdev_rx_queue member. For me it's a dma-buf specific data structure (netdev_dmabuf_binding) and for you it's something else.
page_pool_init() probably needs to validate that the queue is configured for dma-buf or io_uring but not both. If it's configured for both then the user is doing something funky we shouldn't support.
Perhaps I can make the intention clearer by renaming 'binding' to something more specific to dma-buf like queue->dmabuf_binding, to make it clear that this is the dma-buf binding and not some other binding like io_uring?
On 2024-03-05 18:42, Mina Almasry wrote:
On Tue, Mar 5, 2024 at 6:28 PM David Wei dw@davidwei.uk wrote:
On 2024-03-04 18:01, Mina Almasry wrote:
if (pool->p.queue)
binding = READ_ONCE(pool->p.queue->binding);
if (binding) {
pool->mp_ops = &dmabuf_devmem_ops;
pool->mp_priv = binding;
}
This is specific to TCP devmem. For ZC Rx we will need something more generic to let us pass our own memory provider backend down to the page pool.
What about storing ops and priv void ptr in struct netdev_rx_queue instead? Then we can both use it.
Yes, this is dmabuf specific, I was thinking you'd define your own member of netdev_rx_queue, and then add something like this to page_pool_init:
if (pool->p.queue)
io_uring_metadata = READ_ONCE(pool->p.queue->io_uring_metadata);
/* We don't support rx-queues that are configured for both
io_uring & dmabuf binding */
BUG_ON(io_uring_metadata && binding);
if (io_uring_metadata) {
pool->mp_ops = &io_uring_ops;
pool->mp_priv = io_uring_metadata;
}
I.e., we share the pool->mp_ops and the pool->mp_priv but we don't really need to share the same netdev_rx_queue member. For me it's a dma-buf specific data structure (netdev_dmabuf_binding) and for you it's something else.
This adds size to struct netdev_rx_queue and requires checks on whether both are set. There can be thousands of these structs at any one time so if we don't need to add size unnecessarily then that would be best.
We can disambiguate by comparing &mp_ops and then cast the void ptr to our impl specific objects.
What do you not like about this approach?
page_pool_init() probably needs to validate that the queue is configured for dma-buf or io_uring but not both. If it's configured for both then the user is doing something funky we shouldn't support.
Perhaps I can make the intention clearer by renaming 'binding' to something more specific to dma-buf like queue->dmabuf_binding, to make it clear that this is the dma-buf binding and not some other binding like io_uring?
On Tue, Mar 5, 2024 at 6:47 PM David Wei dw@davidwei.uk wrote:
On 2024-03-05 18:42, Mina Almasry wrote:
On Tue, Mar 5, 2024 at 6:28 PM David Wei dw@davidwei.uk wrote:
On 2024-03-04 18:01, Mina Almasry wrote:
if (pool->p.queue)
binding = READ_ONCE(pool->p.queue->binding);
if (binding) {
pool->mp_ops = &dmabuf_devmem_ops;
pool->mp_priv = binding;
}
This is specific to TCP devmem. For ZC Rx we will need something more generic to let us pass our own memory provider backend down to the page pool.
What about storing ops and priv void ptr in struct netdev_rx_queue instead? Then we can both use it.
Yes, this is dmabuf specific, I was thinking you'd define your own member of netdev_rx_queue, and then add something like this to page_pool_init:
if (pool->p.queue)
io_uring_metadata = READ_ONCE(pool->p.queue->io_uring_metadata);
/* We don't support rx-queues that are configured for both
io_uring & dmabuf binding */
BUG_ON(io_uring_metadata && binding);
if (io_uring_metadata) {
pool->mp_ops = &io_uring_ops;
pool->mp_priv = io_uring_metadata;
}
I.e., we share the pool->mp_ops and the pool->mp_priv but we don't really need to share the same netdev_rx_queue member. For me it's a dma-buf specific data structure (netdev_dmabuf_binding) and for you it's something else.
This adds size to struct netdev_rx_queue and requires checks on whether both are set. There can be thousands of these structs at any one time so if we don't need to add size unnecessarily then that would be best.
We can disambiguate by comparing &mp_ops and then cast the void ptr to our impl specific objects.
What do you not like about this approach?
I was thinking it leaks page_pool specifics into a generic struct unrelated to the page pool like netdev_rx_queue. My mental model is that the rx-queue just says that it's bound to a dma-buf/io_uring unaware of page_pool internals, and the page pool internals figure out what to do from there.
Currently netdev_rx_queue.h doesn't include net/page_pool/types.h for example because there is no dependency between netdev_rx_queue & page_pool, I think this change would add a dependency.
But I concede it does not matter much AFAICT, I can certainly change the netdev_rx_queue to hold the mp_priv & mp_ops directly and include net/page_pool/types.h if you prefer that. I'll look into applying this change in the next iteration if there are no objections.
page_pool_init() probably needs to validate that the queue is configured for dma-buf or io_uring but not both. If it's configured for both then the user is doing something funky we shouldn't support.
Perhaps I can make the intention clearer by renaming 'binding' to something more specific to dma-buf like queue->dmabuf_binding, to make it clear that this is the dma-buf binding and not some other binding like io_uring?
On 3/6/24 02:42, Mina Almasry wrote:
On Tue, Mar 5, 2024 at 6:28 PM David Wei dw@davidwei.uk wrote:
On 2024-03-04 18:01, Mina Almasry wrote:
if (pool->p.queue)
binding = READ_ONCE(pool->p.queue->binding);
if (binding) {
pool->mp_ops = &dmabuf_devmem_ops;
pool->mp_priv = binding;
}
This is specific to TCP devmem. For ZC Rx we will need something more generic to let us pass our own memory provider backend down to the page pool.
What about storing ops and priv void ptr in struct netdev_rx_queue instead? Then we can both use it.
Yes, this is dmabuf specific, I was thinking you'd define your own member of netdev_rx_queue, and then add something like this to page_pool_init:
That would be quite annoying, there are 3 expected users together with huge pages, each would need a field and check all others are disabled as you mentioned and so on. It should be cleaner to pass a generic {pp_ops,pp_private} pair instead.
If header dependencies is a problem, you it can probably be
struct pp_provider_param { struct pp_ops ops; void *private; };
# netdev_rx_queue.h
// definition is not included here struct pp_provider_params;
struct netdev_rx_queue { ... struct pp_provider_params *pp_params; };
On Wed, Mar 6, 2024 at 6:59 AM Pavel Begunkov asml.silence@gmail.com wrote:
On 3/6/24 02:42, Mina Almasry wrote:
On Tue, Mar 5, 2024 at 6:28 PM David Wei dw@davidwei.uk wrote:
On 2024-03-04 18:01, Mina Almasry wrote:
if (pool->p.queue)
binding = READ_ONCE(pool->p.queue->binding);
if (binding) {
pool->mp_ops = &dmabuf_devmem_ops;
pool->mp_priv = binding;
}
This is specific to TCP devmem. For ZC Rx we will need something more generic to let us pass our own memory provider backend down to the page pool.
What about storing ops and priv void ptr in struct netdev_rx_queue instead? Then we can both use it.
Yes, this is dmabuf specific, I was thinking you'd define your own member of netdev_rx_queue, and then add something like this to page_pool_init:
That would be quite annoying, there are 3 expected users together with huge pages, each would need a field and check all others are disabled as you mentioned and so on. It should be cleaner to pass a generic {pp_ops,pp_private} pair instead.
If header dependencies is a problem, you it can probably be
struct pp_provider_param { struct pp_ops ops; void *private; };
# netdev_rx_queue.h
// definition is not included here struct pp_provider_params;
struct netdev_rx_queue { ... struct pp_provider_params *pp_params; };
Seems very reasonable, will do! Thanks!
-- Pavel Begunkov
Make skb_frag_page() fail in the case where the frag is not backed by a page, and fix its relevant callers to handle this case.
Signed-off-by: Mina Almasry almasrymina@google.com
---
v6: - Rebased on top of the merged netmem changes.
Changes in v1: - Fix illegal_highdma() (Yunsheng). - Rework napi_pp_put_page() slightly to reduce code churn (Willem).
--- include/linux/skbuff.h | 47 +++++++++++++++++++++++++++++++++++++----- net/core/dev.c | 3 ++- net/core/gro.c | 2 +- net/core/skbuff.c | 11 ++++++++++ net/ipv4/tcp.c | 3 +++ 5 files changed, 59 insertions(+), 7 deletions(-)
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index ca29d1fd4561..d68d430dc596 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -3466,17 +3466,53 @@ static inline void skb_frag_off_copy(skb_frag_t *fragto, fragto->offset = fragfrom->offset; }
+/* Returns true if the skb_frag contains a net_iov. */ +static inline bool skb_frag_is_net_iov(const skb_frag_t *frag) +{ + return netmem_is_net_iov(frag->netmem); +} + +/** + * skb_frag_net_iov - retrieve the net_iov referred to by fragment + * @frag: the fragment + * + * Returns the &struct net_iov associated with @frag. Returns NULL if this + * frag has no associated net_iov. + */ +static inline struct net_iov *skb_frag_net_iov(const skb_frag_t *frag) +{ + if (!skb_frag_is_net_iov(frag)) + return NULL; + + return netmem_to_net_iov(frag->netmem); +} + /** * skb_frag_page - retrieve the page referred to by a paged fragment * @frag: the paged fragment * - * Returns the &struct page associated with @frag. + * Returns the &struct page associated with @frag. Returns NULL if this frag + * has no associated page. */ static inline struct page *skb_frag_page(const skb_frag_t *frag) { + if (skb_frag_is_net_iov(frag)) + return NULL; + return netmem_to_page(frag->netmem); }
+/** + * skb_frag_netmem - retrieve the netmem referred to by a fragment + * @frag: the fragment + * + * Returns the &netmem_ref associated with @frag. + */ +static inline netmem_ref skb_frag_netmem(const skb_frag_t *frag) +{ + return frag->netmem; +} + /** * __skb_frag_ref - take an addition reference on a paged fragment. * @frag: the paged fragment @@ -3509,13 +3545,11 @@ bool napi_pp_put_page(netmem_ref netmem, bool napi_safe); static inline void napi_frag_unref(skb_frag_t *frag, bool recycle, bool napi_safe) { - struct page *page = skb_frag_page(frag); - #ifdef CONFIG_PAGE_POOL - if (recycle && napi_pp_put_page(page_to_netmem(page), napi_safe)) + if (recycle && napi_pp_put_page(skb_frag_netmem(frag), napi_safe)) return; #endif - put_page(page); + put_page(skb_frag_page(frag)); }
/** @@ -3555,6 +3589,9 @@ static inline void skb_frag_unref(struct sk_buff *skb, int f) */ static inline void *skb_frag_address(const skb_frag_t *frag) { + if (!skb_frag_page(frag)) + return NULL; + return page_address(skb_frag_page(frag)) + skb_frag_off(frag); }
diff --git a/net/core/dev.c b/net/core/dev.c index bbea1b252529..765f4a995693 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -3385,8 +3385,9 @@ static int illegal_highdma(struct net_device *dev, struct sk_buff *skb) if (!(dev->features & NETIF_F_HIGHDMA)) { for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) { skb_frag_t *frag = &skb_shinfo(skb)->frags[i]; + struct page *page = skb_frag_page(frag);
- if (PageHighMem(skb_frag_page(frag))) + if (page && PageHighMem(page)) return 1; } } diff --git a/net/core/gro.c b/net/core/gro.c index 0759277dc14e..42d7f6755f32 100644 --- a/net/core/gro.c +++ b/net/core/gro.c @@ -376,7 +376,7 @@ static inline void skb_gro_reset_offset(struct sk_buff *skb, u32 nhoff) NAPI_GRO_CB(skb)->frag0 = NULL; NAPI_GRO_CB(skb)->frag0_len = 0;
- if (!skb_headlen(skb) && pinfo->nr_frags && + if (!skb_headlen(skb) && pinfo->nr_frags && skb_frag_page(frag0) && !PageHighMem(skb_frag_page(frag0)) && (!NET_IP_ALIGN || !((skb_frag_off(frag0) + nhoff) & 3))) { NAPI_GRO_CB(skb)->frag0 = skb_frag_address(frag0); diff --git a/net/core/skbuff.c b/net/core/skbuff.c index cf23392e97f5..907fff2894d3 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -1378,6 +1378,14 @@ void skb_dump(const char *level, const struct sk_buff *skb, bool full_pkt) struct page *p; u8 *vaddr;
+ if (skb_frag_is_net_iov(frag)) { + printk("%sskb frag %d: not readable\n", level, i); + len -= frag->bv_len; + if (!len) + break; + continue; + } + skb_frag_foreach_page(frag, skb_frag_off(frag), skb_frag_size(frag), p, p_off, p_len, copied) { @@ -3145,6 +3153,9 @@ static bool __skb_splice_bits(struct sk_buff *skb, struct pipe_inode_info *pipe, for (seg = 0; seg < skb_shinfo(skb)->nr_frags; seg++) { const skb_frag_t *f = &skb_shinfo(skb)->frags[seg];
+ if (WARN_ON_ONCE(!skb_frag_page(f))) + return false; + if (__splice_segment(skb_frag_page(f), skb_frag_off(f), skb_frag_size(f), offset, len, spd, false, sk, pipe)) diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index c82dc42f57c6..51a5d263e8b4 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -2166,6 +2166,9 @@ static int tcp_zerocopy_receive(struct sock *sk, break; } page = skb_frag_page(frags); + if (WARN_ON_ONCE(!page)) + break; + prefetchw(page); pages[pages_to_map++] = page; length += PAGE_SIZE;
For device memory TCP, we expect the skb headers to be available in host memory for access, and we expect the skb frags to be in device memory and unaccessible to the host. We expect there to be no mixing and matching of device memory frags (unaccessible) with host memory frags (accessible) in the same skb.
Add a skb->devmem flag which indicates whether the frags in this skb are device memory frags or not.
__skb_fill_netmem_desc() now checks frags added to skbs for net_iov, and marks the skb as skb->devmem accordingly.
Add checks through the network stack to avoid accessing the frags of devmem skbs and avoid coalescing devmem skbs with non devmem skbs.
Signed-off-by: Willem de Bruijn willemb@google.com Signed-off-by: Kaiyuan Zhang kaiyuanz@google.com Signed-off-by: Mina Almasry almasrymina@google.com
---
v6 - skb->dmabuf -> skb->readable (Pavel). Pavel's original suggestion was to remove the skb->dmabuf flag entirely, but when I looked into it closely, I found the issue that if we remove the flag we have to dereference the shinfo(skb) pointer to obtain the first frag, which can cause a performance regression if it dirties the cache line when the shinfo(skb) was not really needed. Instead, I converted the skb->dmabuf flag into a generic skb->readable flag which can be re-used by io_uring.
Changes in v1: - Rename devmem -> dmabuf (David). - Flip skb_frags_not_readable (Jakub).
--- include/linux/skbuff.h | 18 ++++++++-- include/net/tcp.h | 5 +-- net/core/datagram.c | 6 ++++ net/core/gro.c | 5 ++- net/core/skbuff.c | 75 +++++++++++++++++++++++++++++++++++------- net/ipv4/tcp.c | 3 ++ net/ipv4/tcp_input.c | 13 ++++++-- net/ipv4/tcp_output.c | 5 ++- net/packet/af_packet.c | 4 +-- 9 files changed, 112 insertions(+), 22 deletions(-)
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index d68d430dc596..919ab795a3b0 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -818,6 +818,7 @@ typedef unsigned char *sk_buff_data_t; * @csum_level: indicates the number of consecutive checksums found in * the packet minus one that have been verified as * CHECKSUM_UNNECESSARY (max 3) + * @readable: indicates that all the fragments in this skb are readable. * @dst_pending_confirm: need to confirm neighbour * @decrypted: Decrypted SKB * @slow_gro: state present at GRO time, slower prepare step required @@ -1004,7 +1005,7 @@ struct sk_buff { #if IS_ENABLED(CONFIG_IP_SCTP) __u8 csum_not_inet:1; #endif - + __u8 readable:1; #if defined(CONFIG_NET_SCHED) || defined(CONFIG_NET_XGRESS) __u16 tc_index; /* traffic control index */ #endif @@ -1779,6 +1780,12 @@ static inline void skb_zcopy_downgrade_managed(struct sk_buff *skb) __skb_zcopy_downgrade_managed(skb); }
+/* Return true if frags in this skb are readable by the host. */ +static inline bool skb_frags_readable(const struct sk_buff *skb) +{ + return skb->readable; +} + static inline void skb_mark_not_on_list(struct sk_buff *skb) { skb->next = NULL; @@ -2495,10 +2502,17 @@ static inline void skb_len_add(struct sk_buff *skb, int delta) static inline void __skb_fill_netmem_desc(struct sk_buff *skb, int i, netmem_ref netmem, int off, int size) { - struct page *page = netmem_to_page(netmem); + struct page *page;
__skb_fill_netmem_desc_noacc(skb_shinfo(skb), i, netmem, off, size);
+ if (netmem_is_net_iov(netmem)) { + skb->readable = false; + return; + } + + page = netmem_to_page(netmem); + /* Propagate page pfmemalloc to the skb if we can. The problem is * that not all callers have unique ownership of the page but rely * on page_is_pfmemalloc doing the right thing(tm). diff --git a/include/net/tcp.h b/include/net/tcp.h index 6ae35199d3b3..8f086e14b21d 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -1062,7 +1062,7 @@ static inline int tcp_skb_mss(const struct sk_buff *skb)
static inline bool tcp_skb_can_collapse_to(const struct sk_buff *skb) { - return likely(!TCP_SKB_CB(skb)->eor); + return likely(!TCP_SKB_CB(skb)->eor && skb_frags_readable(skb)); }
static inline bool tcp_skb_can_collapse(const struct sk_buff *to, @@ -1070,7 +1070,8 @@ static inline bool tcp_skb_can_collapse(const struct sk_buff *to, { return likely(tcp_skb_can_collapse_to(to) && mptcp_skb_can_collapse(to, from) && - skb_pure_zcopy_same(to, from)); + skb_pure_zcopy_same(to, from) && + skb_frags_readable(to) == skb_frags_readable(from)); }
/* Events passed to congestion control interface */ diff --git a/net/core/datagram.c b/net/core/datagram.c index a8b625abe242..5dd39ec30287 100644 --- a/net/core/datagram.c +++ b/net/core/datagram.c @@ -426,6 +426,9 @@ static int __skb_datagram_iter(const struct sk_buff *skb, int offset, return 0; }
+ if (!skb_frags_readable(skb)) + goto short_copy; + /* Copy paged appendix. Hmm... why does this look so complicated? */ for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) { int end; @@ -638,6 +641,9 @@ int __zerocopy_sg_from_iter(struct msghdr *msg, struct sock *sk, if (msg && msg->msg_ubuf && msg->sg_from_iter) return msg->sg_from_iter(sk, skb, from, length);
+ if (!skb_frags_readable(skb)) + return -EFAULT; + frag = skb_shinfo(skb)->nr_frags;
while (length && iov_iter_count(from)) { diff --git a/net/core/gro.c b/net/core/gro.c index 42d7f6755f32..26df48f1b355 100644 --- a/net/core/gro.c +++ b/net/core/gro.c @@ -390,6 +390,9 @@ static void gro_pull_from_frag0(struct sk_buff *skb, int grow) { struct skb_shared_info *pinfo = skb_shinfo(skb);
+ if (WARN_ON_ONCE(!skb_frags_readable(skb))) + return; + BUG_ON(skb->end - skb->tail < grow);
memcpy(skb_tail_pointer(skb), NAPI_GRO_CB(skb)->frag0, grow); @@ -411,7 +414,7 @@ static void gro_try_pull_from_frag0(struct sk_buff *skb) { int grow = skb_gro_offset(skb) - skb_headlen(skb);
- if (grow > 0) + if (grow > 0 && skb_frags_readable(skb)) gro_pull_from_frag0(skb, grow); }
diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 907fff2894d3..3c7d825bb8b5 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -693,6 +693,8 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask, refcount_set(&fclones->fclone_ref, 1); }
+ skb->readable = true; + return skb;
nodata: @@ -765,6 +767,7 @@ struct sk_buff *__netdev_alloc_skb(struct net_device *dev, unsigned int len, if (pfmemalloc) skb->pfmemalloc = 1; skb->head_frag = 1; + skb->readable = true;
skb_success: skb_reserve(skb, NET_SKB_PAD); @@ -853,6 +856,7 @@ struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len, if (pfmemalloc) skb->pfmemalloc = 1; skb->head_frag = 1; + skb->readable = true;
skb_success: skb_reserve(skb, NET_SKB_PAD + NET_IP_ALIGN); @@ -1380,7 +1384,7 @@ void skb_dump(const char *level, const struct sk_buff *skb, bool full_pkt)
if (skb_frag_is_net_iov(frag)) { printk("%sskb frag %d: not readable\n", level, i); - len -= frag->bv_len; + len -= skb_frag_size(frag); if (!len) break; continue; @@ -1965,6 +1969,9 @@ int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask) if (skb_shared(skb) || skb_unclone(skb, gfp_mask)) return -EINVAL;
+ if (!skb_frags_readable(skb)) + return -EFAULT; + if (!num_frags) goto release;
@@ -2136,8 +2143,12 @@ struct sk_buff *skb_copy(const struct sk_buff *skb, gfp_t gfp_mask) { int headerlen = skb_headroom(skb); unsigned int size = skb_end_offset(skb) + skb->data_len; - struct sk_buff *n = __alloc_skb(size, gfp_mask, - skb_alloc_rx_flag(skb), NUMA_NO_NODE); + struct sk_buff *n; + + if (!skb_frags_readable(skb)) + return NULL; + + n = __alloc_skb(size, gfp_mask, skb_alloc_rx_flag(skb), NUMA_NO_NODE);
if (!n) return NULL; @@ -2463,14 +2474,16 @@ struct sk_buff *skb_copy_expand(const struct sk_buff *skb, int newheadroom, int newtailroom, gfp_t gfp_mask) { - /* - * Allocate the copy buffer - */ - struct sk_buff *n = __alloc_skb(newheadroom + skb->len + newtailroom, - gfp_mask, skb_alloc_rx_flag(skb), - NUMA_NO_NODE); int oldheadroom = skb_headroom(skb); int head_copy_len, head_copy_off; + struct sk_buff *n; + + if (!skb_frags_readable(skb)) + return NULL; + + /* Allocate the copy buffer */ + n = __alloc_skb(newheadroom + skb->len + newtailroom, gfp_mask, + skb_alloc_rx_flag(skb), NUMA_NO_NODE);
if (!n) return NULL; @@ -2809,6 +2822,9 @@ void *__pskb_pull_tail(struct sk_buff *skb, int delta) */ int i, k, eat = (skb->tail + delta) - skb->end;
+ if (!skb_frags_readable(skb)) + return NULL; + if (eat > 0 || skb_cloned(skb)) { if (pskb_expand_head(skb, 0, eat > 0 ? eat + 128 : 0, GFP_ATOMIC)) @@ -2962,6 +2978,9 @@ int skb_copy_bits(const struct sk_buff *skb, int offset, void *to, int len) to += copy; }
+ if (!skb_frags_readable(skb)) + goto fault; + for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) { int end; skb_frag_t *f = &skb_shinfo(skb)->frags[i]; @@ -3150,6 +3169,9 @@ static bool __skb_splice_bits(struct sk_buff *skb, struct pipe_inode_info *pipe, /* * then map the fragments */ + if (!skb_frags_readable(skb)) + return false; + for (seg = 0; seg < skb_shinfo(skb)->nr_frags; seg++) { const skb_frag_t *f = &skb_shinfo(skb)->frags[seg];
@@ -3373,6 +3395,9 @@ int skb_store_bits(struct sk_buff *skb, int offset, const void *from, int len) from += copy; }
+ if (!skb_frags_readable(skb)) + goto fault; + for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) { skb_frag_t *frag = &skb_shinfo(skb)->frags[i]; int end; @@ -3452,6 +3477,9 @@ __wsum __skb_checksum(const struct sk_buff *skb, int offset, int len, pos = copy; }
+ if (!skb_frags_readable(skb)) + return 0; + for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) { int end; skb_frag_t *frag = &skb_shinfo(skb)->frags[i]; @@ -3552,6 +3580,9 @@ __wsum skb_copy_and_csum_bits(const struct sk_buff *skb, int offset, pos = copy; }
+ if (!skb_frags_readable(skb)) + return 0; + for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) { int end;
@@ -4043,7 +4074,9 @@ static inline void skb_split_inside_header(struct sk_buff *skb, skb_shinfo(skb1)->frags[i] = skb_shinfo(skb)->frags[i];
skb_shinfo(skb1)->nr_frags = skb_shinfo(skb)->nr_frags; + skb1->readable = skb->readable; skb_shinfo(skb)->nr_frags = 0; + skb->readable = 1; skb1->data_len = skb->data_len; skb1->len += skb1->data_len; skb->data_len = 0; @@ -4057,6 +4090,7 @@ static inline void skb_split_no_header(struct sk_buff *skb, { int i, k = 0; const int nfrags = skb_shinfo(skb)->nr_frags; + const int readable = skb->readable;
skb_shinfo(skb)->nr_frags = 0; skb1->len = skb1->data_len = skb->len - len; @@ -4090,6 +4124,16 @@ static inline void skb_split_no_header(struct sk_buff *skb, pos += size; } skb_shinfo(skb1)->nr_frags = k; + + if (skb_shinfo(skb)->nr_frags) + skb->readable = readable; + else + skb->readable = 1; + + if (skb_shinfo(skb1)->nr_frags) + skb1->readable = readable; + else + skb1->readable = 1; }
/** @@ -4325,6 +4369,9 @@ unsigned int skb_seq_read(unsigned int consumed, const u8 **data, return block_limit - abs_offset; }
+ if (!skb_frags_readable(st->cur_skb)) + return 0; + if (st->frag_idx == 0 && !st->frag_data) st->stepped_offset += skb_headlen(st->cur_skb);
@@ -5937,7 +5984,10 @@ bool skb_try_coalesce(struct sk_buff *to, struct sk_buff *from, if (to->pp_recycle != from->pp_recycle) return false;
- if (len <= skb_tailroom(to)) { + if (skb_frags_readable(from) != skb_frags_readable(to)) + return false; + + if (len <= skb_tailroom(to) && skb_frags_readable(from)) { if (len) BUG_ON(skb_copy_bits(from, 0, skb_put(to, len), len)); *delta_truesize = 0; @@ -6114,6 +6164,9 @@ int skb_ensure_writable(struct sk_buff *skb, unsigned int write_len) if (!pskb_may_pull(skb, write_len)) return -ENOMEM;
+ if (!skb_frags_readable(skb)) + return -EFAULT; + if (!skb_cloned(skb) || skb_clone_writable(skb, write_len)) return 0;
@@ -6793,7 +6846,7 @@ void skb_condense(struct sk_buff *skb) { if (skb->data_len) { if (skb->data_len > skb->end - skb->tail || - skb_cloned(skb)) + skb_cloned(skb) || !skb_frags_readable(skb)) return;
/* Nice, we can free page frag(s) right now */ diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 51a5d263e8b4..00b5ffa06ab6 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -2149,6 +2149,9 @@ static int tcp_zerocopy_receive(struct sock *sk, skb = tcp_recv_skb(sk, seq, &offset); }
+ if (!skb_frags_readable(skb)) + break; + if (TCP_SKB_CB(skb)->has_rxtstamp) { tcp_update_recv_tstamps(skb, tss); zc->msg_flags |= TCP_CMSG_TS; diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 5d874817a78d..799501fefccc 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -5331,6 +5331,9 @@ tcp_collapse(struct sock *sk, struct sk_buff_head *list, struct rb_root *root, for (end_of_skbs = true; skb != NULL && skb != tail; skb = n) { n = tcp_skb_next(skb, list);
+ if (!skb_frags_readable(skb)) + goto skip_this; + /* No new bits? It is possible on ofo queue. */ if (!before(start, TCP_SKB_CB(skb)->end_seq)) { skb = tcp_collapse_one(sk, skb, list, root); @@ -5351,17 +5354,20 @@ tcp_collapse(struct sock *sk, struct sk_buff_head *list, struct rb_root *root, break; }
- if (n && n != tail && mptcp_skb_can_collapse(skb, n) && + if (n && n != tail && skb_frags_readable(n) && + mptcp_skb_can_collapse(skb, n) && TCP_SKB_CB(skb)->end_seq != TCP_SKB_CB(n)->seq) { end_of_skbs = false; break; }
+skip_this: /* Decided to skip this, advance start seq. */ start = TCP_SKB_CB(skb)->end_seq; } if (end_of_skbs || - (TCP_SKB_CB(skb)->tcp_flags & (TCPHDR_SYN | TCPHDR_FIN))) + (TCP_SKB_CB(skb)->tcp_flags & (TCPHDR_SYN | TCPHDR_FIN)) || + !skb_frags_readable(skb)) return;
__skb_queue_head_init(&tmp); @@ -5405,7 +5411,8 @@ tcp_collapse(struct sock *sk, struct sk_buff_head *list, struct rb_root *root, if (!skb || skb == tail || !mptcp_skb_can_collapse(nskb, skb) || - (TCP_SKB_CB(skb)->tcp_flags & (TCPHDR_SYN | TCPHDR_FIN))) + (TCP_SKB_CB(skb)->tcp_flags & (TCPHDR_SYN | TCPHDR_FIN)) || + !skb_frags_readable(skb)) goto end; #ifdef CONFIG_TLS_DEVICE if (skb->decrypted != nskb->decrypted) diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index e3167ad96567..30f53de14a24 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -2343,7 +2343,8 @@ static bool tcp_can_coalesce_send_queue_head(struct sock *sk, int len)
if (unlikely(TCP_SKB_CB(skb)->eor) || tcp_has_tx_tstamp(skb) || - !skb_pure_zcopy_same(skb, next)) + !skb_pure_zcopy_same(skb, next) || + skb_frags_readable(skb) != skb_frags_readable(next)) return false;
len -= skb->len; @@ -3227,6 +3228,8 @@ static bool tcp_can_collapse(const struct sock *sk, const struct sk_buff *skb) return false; if (skb_cloned(skb)) return false; + if (!skb_frags_readable(skb)) + return false; /* Some heuristics for collapsing over SACK'd could be invented */ if (TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_ACKED) return false; diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c index c9bbc2686690..d302e8de914d 100644 --- a/net/packet/af_packet.c +++ b/net/packet/af_packet.c @@ -2156,7 +2156,7 @@ static int packet_rcv(struct sk_buff *skb, struct net_device *dev, } }
- snaplen = skb->len; + snaplen = skb_frags_readable(skb) ? skb->len : skb_headlen(skb);
res = run_filter(skb, sk, snaplen); if (!res) @@ -2276,7 +2276,7 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev, } }
- snaplen = skb->len; + snaplen = skb_frags_readable(skb) ? skb->len : skb_headlen(skb);
res = run_filter(skb, sk, snaplen); if (!res)
In tcp_recvmsg_locked(), detect if the skb being received by the user is a devmem skb. In this case - if the user provided the MSG_SOCK_DEVMEM flag - pass it to tcp_recvmsg_devmem() for custom handling.
tcp_recvmsg_devmem() copies any data in the skb header to the linear buffer, and returns a cmsg to the user indicating the number of bytes returned in the linear buffer.
tcp_recvmsg_devmem() then loops over the unaccessible devmem skb frags, and returns to the user a cmsg_devmem indicating the location of the data in the dmabuf device memory. cmsg_devmem contains this information:
1. the offset into the dmabuf where the payload starts. 'frag_offset'. 2. the size of the frag. 'frag_size'. 3. an opaque token 'frag_token' to return to the kernel when the buffer is to be released.
The pages awaiting freeing are stored in the newly added sk->sk_user_frags, and each page passed to userspace is get_page()'d. This reference is dropped once the userspace indicates that it is done reading this page. All pages are released when the socket is destroyed.
Signed-off-by: Willem de Bruijn willemb@google.com Signed-off-by: Kaiyuan Zhang kaiyuanz@google.com Signed-off-by: Mina Almasry almasrymina@google.com
---
v6 - skb->dmabuf -> skb->readable (Pavel) - Fixed asm definitions of SO_DEVMEM_LINEAR/SO_DEVMEM_DMABUF not found on some archs. - Squashed in locking optimizations from edumazet@google.com. With this change we lock the xarray once per per tcp_recvmsg_dmabuf() rather than once per frag in xa_alloc().
Changes in v1: - Added dmabuf_id to dmabuf_cmsg (David/Stan). - Devmem -> dmabuf (David). - Change tcp_recvmsg_dmabuf() check to skb->dmabuf (Paolo). - Use __skb_frag_ref() & napi_pp_put_page() for refcounting (Yunsheng).
RFC v3: - Fixed issue with put_cmsg() failing silently.
--- arch/alpha/include/uapi/asm/socket.h | 5 + arch/mips/include/uapi/asm/socket.h | 5 + arch/parisc/include/uapi/asm/socket.h | 5 + arch/sparc/include/uapi/asm/socket.h | 5 + include/linux/socket.h | 1 + include/net/netmem.h | 13 ++ include/net/sock.h | 2 + include/uapi/asm-generic/socket.h | 5 + include/uapi/linux/uio.h | 10 + net/ipv4/tcp.c | 251 +++++++++++++++++++++++++- net/ipv4/tcp_ipv4.c | 9 + 11 files changed, 306 insertions(+), 5 deletions(-)
diff --git a/arch/alpha/include/uapi/asm/socket.h b/arch/alpha/include/uapi/asm/socket.h index e94f621903fe..1a9439f1c02e 100644 --- a/arch/alpha/include/uapi/asm/socket.h +++ b/arch/alpha/include/uapi/asm/socket.h @@ -140,6 +140,11 @@ #define SO_PASSPIDFD 76 #define SO_PEERPIDFD 77
+#define SO_DEVMEM_LINEAR 79 +#define SCM_DEVMEM_LINEAR SO_DEVMEM_LINEAR +#define SO_DEVMEM_DMABUF 80 +#define SCM_DEVMEM_DMABUF SO_DEVMEM_DMABUF + #if !defined(__KERNEL__)
#if __BITS_PER_LONG == 64 diff --git a/arch/mips/include/uapi/asm/socket.h b/arch/mips/include/uapi/asm/socket.h index 60ebaed28a4c..a940747504a0 100644 --- a/arch/mips/include/uapi/asm/socket.h +++ b/arch/mips/include/uapi/asm/socket.h @@ -151,6 +151,11 @@ #define SO_PASSPIDFD 76 #define SO_PEERPIDFD 77
+#define SO_DEVMEM_LINEAR 79 +#define SCM_DEVMEM_LINEAR SO_DEVMEM_LINEAR +#define SO_DEVMEM_DMABUF 80 +#define SCM_DEVMEM_DMABUF SO_DEVMEM_DMABUF + #if !defined(__KERNEL__)
#if __BITS_PER_LONG == 64 diff --git a/arch/parisc/include/uapi/asm/socket.h b/arch/parisc/include/uapi/asm/socket.h index be264c2b1a11..1d399f104f08 100644 --- a/arch/parisc/include/uapi/asm/socket.h +++ b/arch/parisc/include/uapi/asm/socket.h @@ -132,6 +132,11 @@ #define SO_PASSPIDFD 0x404A #define SO_PEERPIDFD 0x404B
+#define SO_DEVMEM_LINEAR 98 +#define SCM_DEVMEM_LINEAR SO_DEVMEM_LINEAR +#define SO_DEVMEM_DMABUF 99 +#define SCM_DEVMEM_DMABUF SO_DEVMEM_DMABUF + #if !defined(__KERNEL__)
#if __BITS_PER_LONG == 64 diff --git a/arch/sparc/include/uapi/asm/socket.h b/arch/sparc/include/uapi/asm/socket.h index 682da3714686..a5961ecff31d 100644 --- a/arch/sparc/include/uapi/asm/socket.h +++ b/arch/sparc/include/uapi/asm/socket.h @@ -133,6 +133,11 @@ #define SO_PASSPIDFD 0x0055 #define SO_PEERPIDFD 0x0056
+#define SO_DEVMEM_LINEAR 0x0058 +#define SCM_DEVMEM_LINEAR SO_DEVMEM_LINEAR +#define SO_DEVMEM_DMABUF 0x0059 +#define SCM_DEVMEM_DMABUF SO_DEVMEM_DMABUF + #if !defined(__KERNEL__)
diff --git a/include/linux/socket.h b/include/linux/socket.h index cfcb7e2c3813..fe2b9e2081bb 100644 --- a/include/linux/socket.h +++ b/include/linux/socket.h @@ -326,6 +326,7 @@ struct ucred { * plain text and require encryption */
+#define MSG_SOCK_DEVMEM 0x2000000 /* Receive devmem skbs as cmsg */ #define MSG_ZEROCOPY 0x4000000 /* Use user data in kernel path */ #define MSG_SPLICE_PAGES 0x8000000 /* Splice the pages from the iterator in sendmsg() */ #define MSG_FASTOPEN 0x20000000 /* Send data in TCP SYN */ diff --git a/include/net/netmem.h b/include/net/netmem.h index a2de9411025d..2f36f21b3a3c 100644 --- a/include/net/netmem.h +++ b/include/net/netmem.h @@ -65,6 +65,19 @@ static inline unsigned int net_iov_idx(const struct net_iov *niov) return niov - net_iov_owner(niov)->niovs; }
+static inline unsigned long net_iov_virtual_addr(const struct net_iov *niov) +{ + struct dmabuf_genpool_chunk_owner *owner = net_iov_owner(niov); + + return owner->base_virtual + + ((unsigned long)net_iov_idx(niov) << PAGE_SHIFT); +} + +static inline u32 net_iov_binding_id(const struct net_iov *niov) +{ + return net_iov_owner(niov)->binding->id; +} + /* This returns the absolute dma_addr_t calculated from * net_iov_owner(niov)->owner->base_dma_addr, not the page_pool-owned * niov->dma_addr. diff --git a/include/net/sock.h b/include/net/sock.h index 09a0cde8bf52..2b500a804191 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -337,6 +337,7 @@ struct sk_filter; * @sk_txtime_report_errors: set report errors mode for SO_TXTIME * @sk_txtime_unused: unused txtime flags * @ns_tracker: tracker for netns reference + * @sk_user_frags: xarray of pages the user is holding a reference on. */ struct sock { /* @@ -542,6 +543,7 @@ struct sock { #endif struct rcu_head sk_rcu; netns_tracker ns_tracker; + struct xarray sk_user_frags; };
enum sk_pacing { diff --git a/include/uapi/asm-generic/socket.h b/include/uapi/asm-generic/socket.h index 8ce8a39a1e5f..25a2f5255f52 100644 --- a/include/uapi/asm-generic/socket.h +++ b/include/uapi/asm-generic/socket.h @@ -135,6 +135,11 @@ #define SO_PASSPIDFD 76 #define SO_PEERPIDFD 77
+#define SO_DEVMEM_LINEAR 98 +#define SCM_DEVMEM_LINEAR SO_DEVMEM_LINEAR +#define SO_DEVMEM_DMABUF 99 +#define SCM_DEVMEM_DMABUF SO_DEVMEM_DMABUF + #if !defined(__KERNEL__)
#if __BITS_PER_LONG == 64 || (defined(__x86_64__) && defined(__ILP32__)) diff --git a/include/uapi/linux/uio.h b/include/uapi/linux/uio.h index 059b1a9147f4..ad92e37699da 100644 --- a/include/uapi/linux/uio.h +++ b/include/uapi/linux/uio.h @@ -20,6 +20,16 @@ struct iovec __kernel_size_t iov_len; /* Must be size_t (1003.1g) */ };
+struct dmabuf_cmsg { + __u64 frag_offset; /* offset into the dmabuf where the frag starts. + */ + __u32 frag_size; /* size of the frag. */ + __u32 frag_token; /* token representing this frag for + * DEVMEM_DONTNEED. + */ + __u32 dmabuf_id; /* dmabuf id this frag belongs to. */ +}; + /* * UIO_MAXIOV shall be at least 16 1003.1g (5.4.1.1) */ diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 00b5ffa06ab6..60cbd166f0df 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -461,6 +461,7 @@ void tcp_init_sock(struct sock *sk)
set_bit(SOCK_SUPPORT_ZC, &sk->sk_socket->flags); sk_sockets_allocated_inc(sk); + xa_init_flags(&sk->sk_user_frags, XA_FLAGS_ALLOC1); } EXPORT_SYMBOL(tcp_init_sock);
@@ -2312,6 +2313,216 @@ static int tcp_inq_hint(struct sock *sk) return inq; }
+/* batch __xa_alloc() calls and reduce xa_lock()/xa_unlock() overhead. */ +struct tcp_xa_pool { + u8 max; /* max <= MAX_SKB_FRAGS */ + u8 idx; /* idx <= max */ + __u32 tokens[MAX_SKB_FRAGS]; + netmem_ref netmems[MAX_SKB_FRAGS]; +}; + +static void tcp_xa_pool_commit(struct sock *sk, struct tcp_xa_pool *p, + bool lock) +{ + int i; + + if (!p->max) + return; + if (lock) + xa_lock_bh(&sk->sk_user_frags); + /* Commit part that has been copied to user space. */ + for (i = 0; i < p->idx; i++) + __xa_cmpxchg(&sk->sk_user_frags, + p->tokens[i], + XA_ZERO_ENTRY, + (__force void *)p->netmems[i], + GFP_KERNEL); + /* Rollback what has been pre-allocated and is no longer needed. */ + for (; i < p->max; i++) + __xa_erase(&sk->sk_user_frags, p->tokens[i]); + if (lock) + xa_unlock_bh(&sk->sk_user_frags); + p->max = 0; + p->idx = 0; +} + +static int tcp_xa_pool_refill(struct sock *sk, struct tcp_xa_pool *p, + unsigned int max_frags) +{ + int err, k; + + if (p->idx < p->max) + return 0; + + xa_lock_bh(&sk->sk_user_frags); + + tcp_xa_pool_commit(sk, p, false); + for (k = 0; k < max_frags; k++) { + err = __xa_alloc(&sk->sk_user_frags, &p->tokens[k], + XA_ZERO_ENTRY, xa_limit_31b, GFP_KERNEL); + if (err) + break; + } + + xa_unlock_bh(&sk->sk_user_frags); + + p->max = k; + p->idx = 0; + return k ? 0 : err; +} + +/* On error, returns the -errno. On success, returns number of bytes sent to the + * user. May not consume all of @remaining_len. + */ +static int tcp_recvmsg_dmabuf(struct sock *sk, const struct sk_buff *skb, + unsigned int offset, struct msghdr *msg, + int remaining_len) +{ + struct dmabuf_cmsg dmabuf_cmsg = { 0 }; + struct tcp_xa_pool tcp_xa_pool; + unsigned int start; + int i, copy, n; + int sent = 0; + int err = 0; + + tcp_xa_pool.max = 0; + tcp_xa_pool.idx = 0; + do { + start = skb_headlen(skb); + + if (skb->readable) { + err = -ENODEV; + goto out; + } + + /* Copy header. */ + copy = start - offset; + if (copy > 0) { + copy = min(copy, remaining_len); + + n = copy_to_iter(skb->data + offset, copy, + &msg->msg_iter); + if (n != copy) { + err = -EFAULT; + goto out; + } + + offset += copy; + remaining_len -= copy; + + /* First a dmabuf_cmsg for # bytes copied to user + * buffer. + */ + memset(&dmabuf_cmsg, 0, sizeof(dmabuf_cmsg)); + dmabuf_cmsg.frag_size = copy; + err = put_cmsg(msg, SOL_SOCKET, SO_DEVMEM_LINEAR, + sizeof(dmabuf_cmsg), &dmabuf_cmsg); + if (err || msg->msg_flags & MSG_CTRUNC) { + msg->msg_flags &= ~MSG_CTRUNC; + if (!err) + err = -ETOOSMALL; + goto out; + } + + sent += copy; + + if (remaining_len == 0) + goto out; + } + + /* after that, send information of dmabuf pages through a + * sequence of cmsg + */ + for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) { + skb_frag_t *frag = &skb_shinfo(skb)->frags[i]; + struct net_iov *niov; + u64 frag_offset; + int end; + + /* !skb->readable should indicate that ALL the frags in + * this skb are dmabuf net_iovs. We're checking + * for that flag above, but also check individual frags + * here. If the tcp stack is not setting skb->readable + * correctly, we still don't want to crash here when + * accessing pgmap or priv below. + */ + if (!skb_frag_net_iov(frag)) { + net_err_ratelimited("Found non-dmabuf skb with net_iov"); + err = -ENODEV; + goto out; + } + + niov = skb_frag_net_iov(frag); + end = start + skb_frag_size(frag); + copy = end - offset; + + if (copy > 0) { + copy = min(copy, remaining_len); + + frag_offset = net_iov_virtual_addr(niov) + + skb_frag_off(frag) + offset - + start; + dmabuf_cmsg.frag_offset = frag_offset; + dmabuf_cmsg.frag_size = copy; + err = tcp_xa_pool_refill(sk, &tcp_xa_pool, + skb_shinfo(skb)->nr_frags - i); + if (err) + goto out; + + /* Will perform the exchange later */ + dmabuf_cmsg.frag_token = tcp_xa_pool.tokens[tcp_xa_pool.idx]; + dmabuf_cmsg.dmabuf_id = net_iov_binding_id(niov); + + offset += copy; + remaining_len -= copy; + + err = put_cmsg(msg, SOL_SOCKET, + SO_DEVMEM_DMABUF, + sizeof(dmabuf_cmsg), + &dmabuf_cmsg); + if (err || msg->msg_flags & MSG_CTRUNC) { + msg->msg_flags &= ~MSG_CTRUNC; + if (!err) + err = -ETOOSMALL; + goto out; + } + + atomic_long_inc(&niov->pp_ref_count); + tcp_xa_pool.netmems[tcp_xa_pool.idx++] = skb_frag_netmem(frag); + + sent += copy; + + if (remaining_len == 0) + goto out; + } + start = end; + } + + tcp_xa_pool_commit(sk, &tcp_xa_pool, true); + if (!remaining_len) + goto out; + + /* if remaining_len is not satisfied yet, we need to go to the + * next frag in the frag_list to satisfy remaining_len. + */ + skb = skb_shinfo(skb)->frag_list ?: skb->next; + + offset = offset - start; + } while (skb); + + if (remaining_len) { + err = -EFAULT; + goto out; + } + +out: + tcp_xa_pool_commit(sk, &tcp_xa_pool, true); + if (!sent) + sent = err; + + return sent; +} + /* * This routine copies from a sock struct into the user buffer. * @@ -2325,6 +2536,7 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len, int *cmsg_flags) { struct tcp_sock *tp = tcp_sk(sk); + int last_copied_dmabuf = -1; /* uninitialized */ int copied = 0; u32 peek_seq; u32 *seq; @@ -2502,15 +2714,44 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len, }
if (!(flags & MSG_TRUNC)) { - err = skb_copy_datagram_msg(skb, offset, msg, used); - if (err) { - /* Exception. Bailout! */ - if (!copied) - copied = -EFAULT; + if (last_copied_dmabuf != -1 && + last_copied_dmabuf != !skb->readable) break; + + if (skb->readable) { + err = skb_copy_datagram_msg(skb, offset, msg, + used); + if (err) { + /* Exception. Bailout! */ + if (!copied) + copied = -EFAULT; + break; + } + } else { + if (!(flags & MSG_SOCK_DEVMEM)) { + /* dmabuf skbs can only be received + * with the MSG_SOCK_DEVMEM flag. + */ + if (!copied) + copied = -EFAULT; + + break; + } + + err = tcp_recvmsg_dmabuf(sk, skb, offset, msg, + used); + if (err <= 0) { + if (!copied) + copied = -EFAULT; + + break; + } + used = err; } }
+ last_copied_dmabuf = !skb->readable; + WRITE_ONCE(*seq, *seq + used); copied += used; len -= used; diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index a22ee5838751..e8dc831df007 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -2498,6 +2498,15 @@ static void tcp_md5sig_info_free_rcu(struct rcu_head *head) void tcp_v4_destroy_sock(struct sock *sk) { struct tcp_sock *tp = tcp_sk(sk); + __maybe_unused unsigned long index; + __maybe_unused void *netmem; + +#ifdef CONFIG_PAGE_POOL + xa_for_each(&sk->sk_user_frags, index, netmem) + WARN_ON_ONCE(!napi_pp_put_page((__force netmem_ref)netmem, false)); +#endif + + xa_destroy(&sk->sk_user_frags);
trace_tcp_destroy_sock(sk);
On Tue, Mar 5, 2024, at 03:01, Mina Almasry wrote:
--- a/arch/alpha/include/uapi/asm/socket.h +++ b/arch/alpha/include/uapi/asm/socket.h #define SO_PEERPIDFD 77 +#define SO_DEVMEM_LINEAR 79 +#define SO_DEVMEM_DMABUF 80 --- a/arch/mips/include/uapi/asm/socket.h +++ b/arch/mips/include/uapi/asm/socket.h #define SO_PEERPIDFD 77 +#define SO_DEVMEM_LINEAR 79 +#define SO_DEVMEM_DMABUF 80 --- a/arch/parisc/include/uapi/asm/socket.h +++ b/arch/parisc/include/uapi/asm/socket.h #define SO_PEERPIDFD 0x404B +#define SO_DEVMEM_LINEAR 98 +#define SO_DEVMEM_DMABUF 99 --- a/arch/sparc/include/uapi/asm/socket.h +++ b/arch/sparc/include/uapi/asm/socket.h #define SO_PEERPIDFD 0x0056 +#define SO_DEVMEM_LINEAR 0x0058 +#define SO_DEVMEM_DMABUF 0x0059 --- a/include/uapi/asm-generic/socket.h +++ b/include/uapi/asm-generic/socket.h @@ -135,6 +135,11 @@ #define SO_PEERPIDFD 77 +#define SO_DEVMEM_LINEAR 98 +#define SO_DEVMEM_DMABUF 99
These look inconsistent. I can see how you picked the alpha and mips numbers, but how did you come up with the generic and parisc ones? Can you follow the existing scheme instead?
diff --git a/include/uapi/linux/uio.h b/include/uapi/linux/uio.h index 059b1a9147f4..ad92e37699da 100644 --- a/include/uapi/linux/uio.h +++ b/include/uapi/linux/uio.h @@ -20,6 +20,16 @@ struct iovec __kernel_size_t iov_len; /* Must be size_t (1003.1g) */ };
+struct dmabuf_cmsg {
- __u64 frag_offset; /* offset into the dmabuf where the frag starts.
*/
- __u32 frag_size; /* size of the frag. */
- __u32 frag_token; /* token representing this frag for
* DEVMEM_DONTNEED.
*/
- __u32 dmabuf_id; /* dmabuf id this frag belongs to. */
+};
This structure requires a special compat handler to run x86-32 binaries on x86-64 because of the different alignment requirements. Any uapi-visible structures should be defined to avoid this and just have no holes in them. Maybe extend one of the __u32 members to __u64 or add another 32-bit padding field?
Arnd
On Tue, Mar 5, 2024 at 12:42 AM Arnd Bergmann arnd@arndb.de wrote:
On Tue, Mar 5, 2024, at 03:01, Mina Almasry wrote:
--- a/arch/alpha/include/uapi/asm/socket.h +++ b/arch/alpha/include/uapi/asm/socket.h #define SO_PEERPIDFD 77 +#define SO_DEVMEM_LINEAR 79 +#define SO_DEVMEM_DMABUF 80 --- a/arch/mips/include/uapi/asm/socket.h +++ b/arch/mips/include/uapi/asm/socket.h #define SO_PEERPIDFD 77 +#define SO_DEVMEM_LINEAR 79 +#define SO_DEVMEM_DMABUF 80 --- a/arch/parisc/include/uapi/asm/socket.h +++ b/arch/parisc/include/uapi/asm/socket.h #define SO_PEERPIDFD 0x404B +#define SO_DEVMEM_LINEAR 98 +#define SO_DEVMEM_DMABUF 99 --- a/arch/sparc/include/uapi/asm/socket.h +++ b/arch/sparc/include/uapi/asm/socket.h #define SO_PEERPIDFD 0x0056 +#define SO_DEVMEM_LINEAR 0x0058 +#define SO_DEVMEM_DMABUF 0x0059 --- a/include/uapi/asm-generic/socket.h +++ b/include/uapi/asm-generic/socket.h @@ -135,6 +135,11 @@ #define SO_PEERPIDFD 77 +#define SO_DEVMEM_LINEAR 98 +#define SO_DEVMEM_DMABUF 99
These look inconsistent. I can see how you picked the alpha and mips numbers, but how did you come up with the generic and parisc ones? Can you follow the existing scheme instead?
Sorry, yes, this is a bit weird. I'll change this to use the next available entry rather than leave a gap.
diff --git a/include/uapi/linux/uio.h b/include/uapi/linux/uio.h index 059b1a9147f4..ad92e37699da 100644 --- a/include/uapi/linux/uio.h +++ b/include/uapi/linux/uio.h @@ -20,6 +20,16 @@ struct iovec __kernel_size_t iov_len; /* Must be size_t (1003.1g) */ };
+struct dmabuf_cmsg {
__u64 frag_offset; /* offset into the dmabuf where the frag starts.
*/
__u32 frag_size; /* size of the frag. */
__u32 frag_token; /* token representing this frag for
* DEVMEM_DONTNEED.
*/
__u32 dmabuf_id; /* dmabuf id this frag belongs to. */
+};
This structure requires a special compat handler to run x86-32 binaries on x86-64 because of the different alignment requirements. Any uapi-visible structures should be defined to avoid this and just have no holes in them. Maybe extend one of the __u32 members to __u64 or add another 32-bit padding field?
Honestly the 32-bit fields as-is are somewhat comically large. I don't think extending the __u32 -> __u64 is preferred because I don't see us needing that much, so maybe I can add another 32-bit padding field. Does this look good to you?
struct dmabuf_cmsg { __u64 frag_offset; __u32 frag_size; __u32 frag_token; __u32 dmabuf_id; __u32 ext; /* reserved for future flags */ };
Another option is to actually compress frag_token & dmabuf_id to be 32-bit combined size if that addresses your concern. I prefer that less in case they end up being too small for future use cases.
On Tue, Mar 5, 2024, at 20:22, Mina Almasry wrote:
On Tue, Mar 5, 2024 at 12:42 AM Arnd Bergmann arnd@arndb.de wrote:
On Tue, Mar 5, 2024, at 03:01, Mina Almasry wrote:
This structure requires a special compat handler to run x86-32 binaries on x86-64 because of the different alignment requirements. Any uapi-visible structures should be defined to avoid this and just have no holes in them. Maybe extend one of the __u32 members to __u64 or add another 32-bit padding field?
Honestly the 32-bit fields as-is are somewhat comically large. I don't think extending the __u32 -> __u64 is preferred because I don't see us needing that much, so maybe I can add another 32-bit padding field. Does this look good to you?
Having a reserved field works but requires that you check it for being zero already, so you can detect an incompatible caller.
struct dmabuf_cmsg { __u64 frag_offset; __u32 frag_size; __u32 frag_token; __u32 dmabuf_id; __u32 ext; /* reserved for future flags */ };
Maybe call it 'flags'?
Another option is to actually compress frag_token & dmabuf_id to be 32-bit combined size if that addresses your concern. I prefer that less in case they end up being too small for future use cases.
I don't know what either of those fields is. Is dmabuf_id not a file descriptor? If it is, it has to be 32 bits wide. Otherwise having two 16-bit fields and a 32-bit field would indeed add up to a multiple of the structure alignment on all architectures and solve the problem.
Arnd
Add an interface for the user to notify the kernel that it is done reading the devmem dmabuf frags returned as cmsg. The kernel will drop the reference on the frags to make them available for reuse.
Signed-off-by: Willem de Bruijn willemb@google.com Signed-off-by: Kaiyuan Zhang kaiyuanz@google.com Signed-off-by: Mina Almasry almasrymina@google.com
---
v6: - Squash in locking optimizations from edumazet@google.com. With his changes we lock the xarray once per sock_devmem_dontneed operation rather than once per frag.
Changes in v1: - devmemtoken -> dmabuf_token (David). - Use napi_pp_put_page() for refcounting (Yunsheng). - Fix build error with missing socket options on other asms.
--- arch/alpha/include/uapi/asm/socket.h | 1 + arch/mips/include/uapi/asm/socket.h | 1 + arch/parisc/include/uapi/asm/socket.h | 1 + arch/sparc/include/uapi/asm/socket.h | 1 + include/uapi/asm-generic/socket.h | 1 + include/uapi/linux/uio.h | 4 ++ net/core/sock.c | 61 +++++++++++++++++++++++++++ 7 files changed, 70 insertions(+)
diff --git a/arch/alpha/include/uapi/asm/socket.h b/arch/alpha/include/uapi/asm/socket.h index 1a9439f1c02e..43c98719120a 100644 --- a/arch/alpha/include/uapi/asm/socket.h +++ b/arch/alpha/include/uapi/asm/socket.h @@ -140,6 +140,7 @@ #define SO_PASSPIDFD 76 #define SO_PEERPIDFD 77
+#define SO_DEVMEM_DONTNEED 78 #define SO_DEVMEM_LINEAR 79 #define SCM_DEVMEM_LINEAR SO_DEVMEM_LINEAR #define SO_DEVMEM_DMABUF 80 diff --git a/arch/mips/include/uapi/asm/socket.h b/arch/mips/include/uapi/asm/socket.h index a940747504a0..9a71ee8f36db 100644 --- a/arch/mips/include/uapi/asm/socket.h +++ b/arch/mips/include/uapi/asm/socket.h @@ -151,6 +151,7 @@ #define SO_PASSPIDFD 76 #define SO_PEERPIDFD 77
+#define SO_DEVMEM_DONTNEED 78 #define SO_DEVMEM_LINEAR 79 #define SCM_DEVMEM_LINEAR SO_DEVMEM_LINEAR #define SO_DEVMEM_DMABUF 80 diff --git a/arch/parisc/include/uapi/asm/socket.h b/arch/parisc/include/uapi/asm/socket.h index 1d399f104f08..1e77efcc6d63 100644 --- a/arch/parisc/include/uapi/asm/socket.h +++ b/arch/parisc/include/uapi/asm/socket.h @@ -132,6 +132,7 @@ #define SO_PASSPIDFD 0x404A #define SO_PEERPIDFD 0x404B
+#define SO_DEVMEM_DONTNEED 97 #define SO_DEVMEM_LINEAR 98 #define SCM_DEVMEM_LINEAR SO_DEVMEM_LINEAR #define SO_DEVMEM_DMABUF 99 diff --git a/arch/sparc/include/uapi/asm/socket.h b/arch/sparc/include/uapi/asm/socket.h index a5961ecff31d..ecfc8bfa9fe0 100644 --- a/arch/sparc/include/uapi/asm/socket.h +++ b/arch/sparc/include/uapi/asm/socket.h @@ -133,6 +133,7 @@ #define SO_PASSPIDFD 0x0055 #define SO_PEERPIDFD 0x0056
+#define SO_DEVMEM_DONTNEED 0x0057 #define SO_DEVMEM_LINEAR 0x0058 #define SCM_DEVMEM_LINEAR SO_DEVMEM_LINEAR #define SO_DEVMEM_DMABUF 0x0059 diff --git a/include/uapi/asm-generic/socket.h b/include/uapi/asm-generic/socket.h index 25a2f5255f52..1acb77780f10 100644 --- a/include/uapi/asm-generic/socket.h +++ b/include/uapi/asm-generic/socket.h @@ -135,6 +135,7 @@ #define SO_PASSPIDFD 76 #define SO_PEERPIDFD 77
+#define SO_DEVMEM_DONTNEED 97 #define SO_DEVMEM_LINEAR 98 #define SCM_DEVMEM_LINEAR SO_DEVMEM_LINEAR #define SO_DEVMEM_DMABUF 99 diff --git a/include/uapi/linux/uio.h b/include/uapi/linux/uio.h index ad92e37699da..65f33178a601 100644 --- a/include/uapi/linux/uio.h +++ b/include/uapi/linux/uio.h @@ -30,6 +30,10 @@ struct dmabuf_cmsg { __u32 dmabuf_id; /* dmabuf id this frag belongs to. */ };
+struct dmabuf_token { + __u32 token_start; + __u32 token_count; +}; /* * UIO_MAXIOV shall be at least 16 1003.1g (5.4.1.1) */ diff --git a/net/core/sock.c b/net/core/sock.c index df2ac54a8f74..dc15d676f46f 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -1049,6 +1049,63 @@ static int sock_reserve_memory(struct sock *sk, int bytes) return 0; }
+#ifdef CONFIG_PAGE_POOL +static noinline_for_stack int +sock_devmem_dontneed(struct sock *sk, sockptr_t optval, unsigned int optlen) +{ + unsigned int num_tokens, i, j, k, netmem_num = 0; + struct dmabuf_token *tokens; + netmem_ref netmems[16]; + int ret; + + if (sk->sk_type != SOCK_STREAM || sk->sk_protocol != IPPROTO_TCP) + return -EBADF; + + if (optlen % sizeof(struct dmabuf_token) || + optlen > sizeof(*tokens) * 128) + return -EINVAL; + + tokens = kvmalloc_array(128, sizeof(*tokens), GFP_KERNEL); + if (!tokens) + return -ENOMEM; + + num_tokens = optlen / sizeof(struct dmabuf_token); + if (copy_from_sockptr(tokens, optval, optlen)) + return -EFAULT; + + ret = 0; + + xa_lock_bh(&sk->sk_user_frags); + for (i = 0; i < num_tokens; i++) { + for (j = 0; j < tokens[i].token_count; j++) { + netmem_ref netmem = (__force netmem_ref)__xa_erase( + &sk->sk_user_frags, tokens[i].token_start + j); + + if (netmem && + !WARN_ON_ONCE(!netmem_is_net_iov(netmem))) { + netmems[netmem_num++] = netmem; + if (netmem_num == ARRAY_SIZE(netmems)) { + xa_unlock_bh(&sk->sk_user_frags); + for (k = 0; k < netmem_num; k++) + WARN_ON_ONCE(!napi_pp_put_page(netmems[k], + false)); + netmem_num = 0; + xa_lock_bh(&sk->sk_user_frags); + } + ret++; + } + } + } + + xa_unlock_bh(&sk->sk_user_frags); + for (k = 0; k < netmem_num; k++) + WARN_ON_ONCE(!napi_pp_put_page(netmems[k], false)); + + kvfree(tokens); + return ret; +} +#endif + void sockopt_lock_sock(struct sock *sk) { /* When current->bpf_ctx is set, the setsockopt is called from @@ -1200,6 +1257,10 @@ int sk_setsockopt(struct sock *sk, int level, int optname, ret = -EOPNOTSUPP; return ret; } +#ifdef CONFIG_PAGE_POOL + case SO_DEVMEM_DONTNEED: + return sock_devmem_dontneed(sk, optval, optlen); +#endif }
sockopt_lock_sock(sk);
Signed-off-by: Mina Almasry almasrymina@google.com
---
v1 -> v2:
- Missing spdx (simon) - add to index.rst (simon)
--- Documentation/networking/devmem.rst | 271 ++++++++++++++++++++++++++++ Documentation/networking/index.rst | 1 + 2 files changed, 272 insertions(+) create mode 100644 Documentation/networking/devmem.rst
diff --git a/Documentation/networking/devmem.rst b/Documentation/networking/devmem.rst new file mode 100644 index 000000000000..4712f029e5ed --- /dev/null +++ b/Documentation/networking/devmem.rst @@ -0,0 +1,271 @@ +.. SPDX-License-Identifier: GPL-2.0 + +================= +Device Memory TCP +================= + + +Intro +===== + +Device memory TCP (devmem TCP) enables receiving data directly into device +memory (dmabuf). The feature is currently implemented for TCP sockets. + + +Opportunity +----------- + +A large amount of data transfers have device memory as the source and/or +destination. Accelerators drastically increased the volume of such transfers. +Some examples include: + +- Distributed training, where ML accelerators, such as GPUs on different hosts, + exchange data among them. + +- Distributed raw block storage applications transfer large amounts of data with + remote SSDs, much of this data does not require host processing. + +Today, the majority of the Device-to-Device data transfers the network are +implemented as the following low level operations: Device-to-Host copy, +Host-to-Host network transfer, and Host-to-Device copy. + +The implementation is suboptimal, especially for bulk data transfers, and can +put significant strains on system resources such as host memory bandwidth and +PCIe bandwidth. + +Devmem TCP optimizes this use case by implementing socket APIs that enable +the user to receive incoming network packets directly into device memory. + +Packet payloads go directly from the NIC to device memory. + +Packet headers go to host memory and are processed by the TCP/IP stack +normally. The NIC must support header split to achieve this. + +Advantages: + +- Alleviate host memory bandwidth pressure, compared to existing + network-transfer + device-copy semantics. + +- Alleviate PCIe bandwidth pressure, by limiting data transfer to the lowest + level of the PCIe tree, compared to traditional path which sends data through + the root complex. + + +More Info +--------- + + slides, video + https://netdevconf.org/0x17/sessions/talk/device-memory-tcp.html + + patchset + [RFC PATCH v3 00/12] Device Memory TCP + https://lore.kernel.org/lkml/20231106024413.2801438-1-almasrymina@google.com... + + +Interface +========= + +Example +------- + +tools/testing/selftests/net/ncdevmem.c:do_server shows an example of setting up +the RX path of this API. + +NIC Setup +--------- + +Header split, flow steering, & RSS are required features for devmem TCP. + +Header split is used to split incoming packets into a header buffer in host +memory, and a payload buffer in device memory. + +Flow steering & RSS are used to ensure that only flows targeting devmem land on +RX queue bound to devmem. + +Enable header split & flow steering: + +:: + + # enable header split (assuming priv-flag) + ethtool --set-priv-flags eth1 enable-header-split on + + # enable flow steering + ethtool -K eth1 ntuple on + +Configure RSS to steer all traffic away from the target RX queue (queue 15 in +this example): + +:: + + ethtool --set-rxfh-indir eth1 equal 15 + + +The user must bind a dmabuf to any number of RX queues on a given NIC using +netlink API: + +:: + + /* Bind dmabuf to NIC RX queue 15 */ + struct netdev_queue *queues; + queues = malloc(sizeof(*queues) * 1); + + queues[0]._present.type = 1; + queues[0]._present.idx = 1; + queues[0].type = NETDEV_RX_QUEUE_TYPE_RX; + queues[0].idx = 15; + + *ys = ynl_sock_create(&ynl_netdev_family, &yerr); + + req = netdev_bind_rx_req_alloc(); + netdev_bind_rx_req_set_ifindex(req, 1 /* ifindex */); + netdev_bind_rx_req_set_dmabuf_fd(req, dmabuf_fd); + __netdev_bind_rx_req_set_queues(req, queues, n_queue_index); + + rsp = netdev_bind_rx(*ys, req); + + dmabuf_id = rsp->dmabuf_id; + + +The netlink API returns a dmabuf_id: a unique ID that refers to this dmabuf +that has been bound. + +Socket Setup +------------ + +The socket must be flow steering to the dmabuf bound RX queue: + +:: + + ethtool -N eth1 flow-type tcp4 ... queue 15, + + +Receiving data +-------------- + +The user application must signal to the kernel that it is capable of receiving +devmem data by passing the MSG_SOCK_DEVMEM flag to recvmsg: + +:: + + ret = recvmsg(fd, &msg, MSG_SOCK_DEVMEM); + +Applications that do not specify the MSG_SOCK_DEVMEM flag will receive an EFAULT +on devmem data. + +Devmem data is received directly into the dmabuf bound to the NIC in 'NIC +Setup', and the kernel signals such to the user via the SCM_DEVMEM_* cmsgs: + +:: + + for (cm = CMSG_FIRSTHDR(&msg); cm; cm = CMSG_NXTHDR(&msg, cm)) { + if (cm->cmsg_level != SOL_SOCKET || + (cm->cmsg_type != SCM_DEVMEM_DMABUF && + cm->cmsg_type != SCM_DEVMEM_LINEAR)) + continue; + + dmabuf_cmsg = (struct dmabuf_cmsg *)CMSG_DATA(cm); + + if (cm->cmsg_type == SCM_DEVMEM_DMABUF) { + /* Frag landed in dmabuf. + * + * dmabuf_cmsg->dmabuf_id is the dmabuf the + * frag landed on. + * + * dmabuf_cmsg->frag_offset is the offset into + * the dmabuf where the frag starts. + * + * dmabuf_cmsg->frag_size is the size of the + * frag. + * + * dmabuf_cmsg->frag_token is a token used to + * refer to this frag for later freeing. + */ + + struct dmabuf_token token; + token.token_start = dmabuf_cmsg->frag_token; + token.token_count = 1; + continue; + } + + if (cm->cmsg_type == SCM_DEVMEM_LINEAR) + /* Frag landed in linear buffer. + * + * dmabuf_cmsg->frag_size is the size of the + * frag. + */ + continue; + + } + +Applications may receive 2 cmsgs: + +- SCM_DEVMEM_DMABUF: this indicates the fragment landed in the dmabuf indicated + by dmabuf_id. + +- SCM_DEVMEM_LINEAR: this indicates the fragment landed in the linear buffer. + This typically happens when the NIC is unable to split the packet at the + header boundary, such that part (or all) of the payload landed in host + memory. + +Applications may receive no SO_DEVMEM_* cmsgs. That indicates non-devmem, +regular TCP data that landed on an RX queue not bound to a dmabuf. + + +Freeing frags +------------- + +Frags received via SCM_DEVMEM_DMABUF are pinned by the kernel while the user +processes the frag. The user must return the frag to the kernel via +SO_DEVMEM_DONTNEED: + +:: + + ret = setsockopt(client_fd, SOL_SOCKET, SO_DEVMEM_DONTNEED, &token, + sizeof(token)); + +The user must ensure the tokens are returned to the kernel in a timely manner. +Failure to do so will exhaust the limited dmabuf that is bound to the RX queue +and will lead to packet drops. + + +Implementation & Caveats +======================== + +Unreadable skbs +--------------- + +Devmem payloads are inaccessible to the kernel processing the packets. This +results in a few quirks for payloads of devmem skbs: + +- Loopback is not functional. Loopback relies on copying the payload, which is + not possible with devmem skbs. + +- Software checksum calculation fails. + +- TCP Dump and bpf can't access devmem packet payloads. + + +Testing +======= + +More realistic example code can be found in the kernel source under +tools/testing/selftests/net/ncdevmem.c + +ncdevmem is a devmem TCP netcat. It works very similarly to netcat, but +receives data directly into a udmabuf. + +To run ncdevmem, you need to run it a server on the machine under test, and you +need to run netcat on a peer to provide the TX data. + +ncdevmem has a validation mode as well that expects a repeating pattern of +incoming data and validates it as such: + +:: + + # On server: + ncdevmem -s <server IP> -c <client IP> -f eth1 -d 3 -n 0000:06:00.0 -l \ + -p 5201 -v 7 + + # On client: + yes $(echo -e \x01\x02\x03\x04\x05\x06) | \ + tr \n \0 | head -c 5G | nc <server IP> 5201 -p 5201 diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index 69f3d6dcd9fd..d9f86514aa1e 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -48,6 +48,7 @@ Contents: cdc_mbim dccp dctcp + devmem dns_resolver driver eql
On Mon, 4 Mar 2024 18:01:49 -0800 Mina Almasry wrote:
+Intro +=====
+Device memory TCP (devmem TCP) enables receiving data directly into device +memory (dmabuf). The feature is currently implemented for TCP sockets.
+Opportunity +-----------
+A large amount of data transfers have device memory as the source and/or
s/amount/number/
+destination. Accelerators drastically increased the volume of such transfers.
s/volume/prevalence/
+Some examples include:
+- Distributed training, where ML accelerators, such as GPUs on different hosts,
- exchange data among them.
s/among them//
+- Distributed raw block storage applications transfer large amounts of data with
- remote SSDs, much of this data does not require host processing.
+Today, the majority of the Device-to-Device data transfers the network are
"Today" won't age well.
+implemented as the following low level operations: Device-to-Host copy, +Host-to-Host network transfer, and Host-to-Device copy.
+The implementation is suboptimal, especially for bulk data transfers, and can
/The implementation/The flow involving host copies/
+put significant strains on system resources such as host memory bandwidth and +PCIe bandwidth.
+Devmem TCP optimizes this use case by implementing socket APIs that enable +the user to receive incoming network packets directly into device memory.
+More Info +---------
- slides, video
- https://netdevconf.org/0x17/sessions/talk/device-memory-tcp.html
- patchset
- [RFC PATCH v3 00/12] Device Memory TCP
- https://lore.kernel.org/lkml/20231106024413.2801438-1-almasrymina@google.com...
Won't age well? :)
+Interface +=========
+Example +-------
+tools/testing/selftests/net/ncdevmem.c:do_server shows an example of setting up +the RX path of this API.
+NIC Setup +---------
+Header split, flow steering, & RSS are required features for devmem TCP.
+Header split is used to split incoming packets into a header buffer in host +memory, and a payload buffer in device memory.
+Flow steering & RSS are used to ensure that only flows targeting devmem land on +RX queue bound to devmem.
+Enable header split & flow steering:
+::
You can put the :: at the end of the text, IIRC, like this:
Enable header split & flow steering::
- # enable header split (assuming priv-flag)
- ethtool --set-priv-flags eth1 enable-header-split on
Olek added the "set" in commit 50d73710715d ("ethtool: add SET for TCP_DATA_SPLIT ringparam"), no need for the priv flag any more.
ncdevmem is a devmem TCP netcat. It works similarly to netcat, but it sends and receives data using the devmem TCP APIs. It uses udmabuf as the dmabuf provider. It is compatible with a regular netcat running on a peer, or a ncdevmem running on a peer.
In addition to normal netcat support, ncdevmem has a validation mode, where it sends a specific pattern and validates this pattern on the receiver side to ensure data integrity.
Suggested-by: Stanislav Fomichev sdf@google.com Signed-off-by: Mina Almasry almasrymina@google.com
---
v6: - Updated to bind 8 queues. - Added RSS configuration. - Added some more tests for the netlink API.
Changes in v1: - Many more general cleanups (Willem). - Removed driver reset (Jakub). - Removed hardcoded if index (Paolo).
RFC v2: - General cleanups (Willem).
--- tools/testing/selftests/net/.gitignore | 1 + tools/testing/selftests/net/Makefile | 5 + tools/testing/selftests/net/ncdevmem.c | 546 +++++++++++++++++++++++++ 3 files changed, 552 insertions(+) create mode 100644 tools/testing/selftests/net/ncdevmem.c
diff --git a/tools/testing/selftests/net/.gitignore b/tools/testing/selftests/net/.gitignore index 2f9d378edec3..b644dbae58b7 100644 --- a/tools/testing/selftests/net/.gitignore +++ b/tools/testing/selftests/net/.gitignore @@ -17,6 +17,7 @@ ipv6_flowlabel ipv6_flowlabel_mgr log.txt msg_zerocopy +ncdevmem nettest psock_fanout psock_snd diff --git a/tools/testing/selftests/net/Makefile b/tools/testing/selftests/net/Makefile index 7b6918d5f4af..c9853573e60c 100644 --- a/tools/testing/selftests/net/Makefile +++ b/tools/testing/selftests/net/Makefile @@ -5,6 +5,10 @@ CFLAGS = -Wall -Wl,--no-as-needed -O2 -g CFLAGS += -I../../../../usr/include/ $(KHDR_INCLUDES) # Additional include paths needed by kselftest.h CFLAGS += -I../ +CFLAGS += -I../../../net/ynl/generated/ +CFLAGS += -I../../../net/ynl/lib/ + +LDLIBS += ../../../net/ynl/lib/ynl.a ../../../net/ynl/generated/protos.a
TEST_PROGS := run_netsocktests run_afpackettests test_bpf.sh netdevice.sh \ rtnetlink.sh xfrm_policy.sh test_blackhole_dev.sh @@ -93,6 +97,7 @@ TEST_PROGS += test_bridge_backup_port.sh TEST_PROGS += fdb_flush.sh TEST_PROGS += fq_band_pktlimit.sh TEST_PROGS += vlan_hw_filter.sh +TEST_GEN_FILES += ncdevmem
TEST_FILES := settings TEST_FILES += in_netns.sh lib.sh net_helper.sh setup_loopback.sh setup_veth.sh diff --git a/tools/testing/selftests/net/ncdevmem.c b/tools/testing/selftests/net/ncdevmem.c new file mode 100644 index 000000000000..11bfe3e1125b --- /dev/null +++ b/tools/testing/selftests/net/ncdevmem.c @@ -0,0 +1,546 @@ +// SPDX-License-Identifier: GPL-2.0 +#define _GNU_SOURCE +#define __EXPORTED_HEADERS__ + +#include <linux/uio.h> +#include <stdio.h> +#include <stdlib.h> +#include <unistd.h> +#include <stdbool.h> +#include <string.h> +#include <errno.h> +#define __iovec_defined +#include <fcntl.h> +#include <malloc.h> +#include <error.h> + +#include <arpa/inet.h> +#include <sys/socket.h> +#include <sys/mman.h> +#include <sys/ioctl.h> +#include <sys/syscall.h> + +#include <linux/memfd.h> +#include <linux/if.h> +#include <linux/dma-buf.h> +#include <linux/udmabuf.h> +#include <libmnl/libmnl.h> +#include <linux/types.h> +#include <linux/netlink.h> +#include <linux/genetlink.h> +#include <linux/netdev.h> +#include <time.h> + +#include "netdev-user.h" +#include <ynl.h> + +#define PAGE_SHIFT 12 +#define TEST_PREFIX "ncdevmem" +#define NUM_PAGES 16000 + +#ifndef MSG_SOCK_DEVMEM +#define MSG_SOCK_DEVMEM 0x2000000 +#endif + +/* + * tcpdevmem netcat. Works similarly to netcat but does device memory TCP + * instead of regular TCP. Uses udmabuf to mock a dmabuf provider. + * + * Usage: + * + * On server: + * ncdevmem -s <server IP> -c <client IP> -f eth1 -d 3 -n 0000:06:00.0 -l \ + * -p 5201 -v 7 + * + * On client: + * yes $(echo -e \x01\x02\x03\x04\x05\x06) | \ + * tr \n \0 | \ + * head -c 5G | \ + * nc <server IP> 5201 -p 5201 + * + * Note this is compatible with regular netcat. i.e. the sender or receiver can + * be replaced with regular netcat to test the RX or TX path in isolation. + */ + +static char *server_ip = "192.168.1.4"; +static char *client_ip = "192.168.1.2"; +static char *port = "5201"; +static size_t do_validation; +static int start_queue = 8; +static int num_queues = 8; +static char *ifname = "eth1"; +static unsigned int ifindex = 3; +static char *nic_pci_addr = "0000:06:00.0"; +static unsigned int iterations; +static unsigned int dmabuf_id; + +void print_bytes(void *ptr, size_t size) +{ + unsigned char *p = ptr; + int i; + + for (i = 0; i < size; i++) + printf("%02hhX ", p[i]); + printf("\n"); +} + +void print_nonzero_bytes(void *ptr, size_t size) +{ + unsigned char *p = ptr; + unsigned int i; + + for (i = 0; i < size; i++) + putchar(p[i]); + printf("\n"); +} + +void validate_buffer(void *line, size_t size) +{ + static unsigned char seed = 1; + unsigned char *ptr = line; + int errors = 0; + size_t i; + + for (i = 0; i < size; i++) { + if (ptr[i] != seed) { + fprintf(stderr, + "Failed validation: expected=%u, actual=%u, index=%lu\n", + seed, ptr[i], i); + errors++; + if (errors > 20) + error(1, 0, "validation failed."); + } + seed++; + if (seed == do_validation) + seed = 0; + } + + fprintf(stdout, "Validated buffer\n"); +} + +static void reset_flow_steering(void) +{ + char command[256]; + + memset(command, 0, sizeof(command)); + snprintf(command, sizeof(command), "sudo ethtool -K %s ntuple off", + "eth1"); + system(command); + + memset(command, 0, sizeof(command)); + snprintf(command, sizeof(command), "sudo ethtool -K %s ntuple on", + "eth1"); + system(command); +} + +static void configure_rss(void) +{ + char command[256]; + + memset(command, 0, sizeof(command)); + snprintf(command, sizeof(command), "sudo ethtool -X %s equal %d", + ifname, start_queue); + system(command); +} + +static void configure_flow_steering(void) +{ + char command[256]; + + memset(command, 0, sizeof(command)); + snprintf(command, sizeof(command), + "sudo ethtool -N %s flow-type tcp4 src-ip %s dst-ip %s src-port %s dst-port %s queue %d", + ifname, client_ip, server_ip, port, port, start_queue); + system(command); +} + +static int bind_rx_queue(unsigned int ifindex, unsigned int dmabuf_fd, + struct netdev_queue_dmabuf *queues, + unsigned int n_queue_index, struct ynl_sock **ys) +{ + struct netdev_bind_rx_req *req = NULL; + struct netdev_bind_rx_rsp *rsp = NULL; + struct ynl_error yerr; + + *ys = ynl_sock_create(&ynl_netdev_family, &yerr); + if (!*ys) { + fprintf(stderr, "YNL: %s\n", yerr.msg); + return -1; + } + + req = netdev_bind_rx_req_alloc(); + netdev_bind_rx_req_set_ifindex(req, ifindex); + netdev_bind_rx_req_set_dmabuf_fd(req, dmabuf_fd); + __netdev_bind_rx_req_set_queues(req, queues, n_queue_index); + + rsp = netdev_bind_rx(*ys, req); + if (!rsp) { + perror("netdev_bind_rx"); + goto err_close; + } + + if (!rsp->_present.dmabuf_id) { + perror("dmabuf_id not present"); + goto err_close; + } + + printf("got dmabuf id=%d\n", rsp->dmabuf_id); + dmabuf_id = rsp->dmabuf_id; + + netdev_bind_rx_req_free(req); + netdev_bind_rx_rsp_free(rsp); + + return 0; + +err_close: + fprintf(stderr, "YNL failed: %s\n", (*ys)->err.msg); + netdev_bind_rx_req_free(req); + ynl_sock_destroy(*ys); + return -1; +} + +static void create_udmabuf(int *devfd, int *memfd, int *buf, size_t dmabuf_size) +{ + struct udmabuf_create create; + int ret; + + *devfd = open("/dev/udmabuf", O_RDWR); + if (*devfd < 0) { + error(70, 0, + "%s: [skip,no-udmabuf: Unable to access DMA buffer device file]\n", + TEST_PREFIX); + } + + *memfd = memfd_create("udmabuf-test", MFD_ALLOW_SEALING); + if (*memfd < 0) + error(70, 0, "%s: [skip,no-memfd]\n", TEST_PREFIX); + + /* Required for udmabuf */ + ret = fcntl(*memfd, F_ADD_SEALS, F_SEAL_SHRINK); + if (ret < 0) + error(73, 0, "%s: [skip,fcntl-add-seals]\n", TEST_PREFIX); + + ret = ftruncate(*memfd, dmabuf_size); + if (ret == -1) + error(74, 0, "%s: [FAIL,memfd-truncate]\n", TEST_PREFIX); + + memset(&create, 0, sizeof(create)); + + create.memfd = *memfd; + create.offset = 0; + create.size = dmabuf_size; + *buf = ioctl(*devfd, UDMABUF_CREATE, &create); + if (*buf < 0) + error(75, 0, "%s: [FAIL, create udmabuf]\n", TEST_PREFIX); +} + +int do_server(void) +{ + char ctrl_data[sizeof(int) * 20000]; + struct netdev_queue_dmabuf *queues; + size_t non_page_aligned_frags = 0; + struct sockaddr_in client_addr; + struct sockaddr_in server_sin; + size_t page_aligned_frags = 0; + int devfd, memfd, buf, ret; + size_t total_received = 0; + socklen_t client_addr_len; + bool is_devmem = false; + char *buf_mem = NULL; + struct ynl_sock *ys; + size_t dmabuf_size; + char iobuf[819200]; + char buffer[256]; + int socket_fd; + int client_fd; + size_t i = 0; + int opt = 1; + + dmabuf_size = getpagesize() * NUM_PAGES; + + create_udmabuf(&devfd, &memfd, &buf, dmabuf_size); + + reset_flow_steering(); + + /* Configure RSS to divert all traffic from our devmem queues */ + configure_rss(); + + /* Flow steer our devmem flows to start_queue */ + configure_flow_steering(); + + sleep(1); + + queues = malloc(sizeof(*queues) * num_queues); + + for (i = 0; i < num_queues; i++) { + queues[i]._present.type = 1; + queues[i]._present.idx = 1; + queues[i].type = NETDEV_QUEUE_TYPE_RX; + queues[i].idx = start_queue + i; + } + + if (bind_rx_queue(ifindex, buf, queues, num_queues, &ys)) + error(1, 0, "Failed to bind\n"); + + buf_mem = mmap(NULL, dmabuf_size, PROT_READ | PROT_WRITE, MAP_SHARED, + buf, 0); + if (buf_mem == MAP_FAILED) + error(1, 0, "mmap()"); + + server_sin.sin_family = AF_INET; + server_sin.sin_port = htons(atoi(port)); + + ret = inet_pton(server_sin.sin_family, server_ip, &server_sin.sin_addr); + if (socket < 0) + error(79, 0, "%s: [FAIL, create socket]\n", TEST_PREFIX); + + socket_fd = socket(server_sin.sin_family, SOCK_STREAM, 0); + if (socket < 0) + error(errno, errno, "%s: [FAIL, create socket]\n", TEST_PREFIX); + + ret = setsockopt(socket_fd, SOL_SOCKET, SO_REUSEPORT, &opt, + sizeof(opt)); + if (ret) + error(errno, errno, "%s: [FAIL, set sock opt]\n", TEST_PREFIX); + + ret = setsockopt(socket_fd, SOL_SOCKET, SO_REUSEADDR, &opt, + sizeof(opt)); + if (ret) + error(errno, errno, "%s: [FAIL, set sock opt]\n", TEST_PREFIX); + + printf("binding to address %s:%d\n", server_ip, + ntohs(server_sin.sin_port)); + + ret = bind(socket_fd, &server_sin, sizeof(server_sin)); + if (ret) + error(errno, errno, "%s: [FAIL, bind]\n", TEST_PREFIX); + + ret = listen(socket_fd, 1); + if (ret) + error(errno, errno, "%s: [FAIL, listen]\n", TEST_PREFIX); + + client_addr_len = sizeof(client_addr); + + inet_ntop(server_sin.sin_family, &server_sin.sin_addr, buffer, + sizeof(buffer)); + printf("Waiting or connection on %s:%d\n", buffer, + ntohs(server_sin.sin_port)); + client_fd = accept(socket_fd, &client_addr, &client_addr_len); + + inet_ntop(client_addr.sin_family, &client_addr.sin_addr, buffer, + sizeof(buffer)); + printf("Got connection from %s:%d\n", buffer, + ntohs(client_addr.sin_port)); + + while (1) { + struct iovec iov = { .iov_base = iobuf, + .iov_len = sizeof(iobuf) }; + struct dmabuf_cmsg *dmabuf_cmsg = NULL; + struct dma_buf_sync sync = { 0 }; + struct cmsghdr *cm = NULL; + struct msghdr msg = { 0 }; + struct dmabuf_token token; + ssize_t ret; + + is_devmem = false; + printf("\n\n"); + + msg.msg_iov = &iov; + msg.msg_iovlen = 1; + msg.msg_control = ctrl_data; + msg.msg_controllen = sizeof(ctrl_data); + ret = recvmsg(client_fd, &msg, MSG_SOCK_DEVMEM); + printf("recvmsg ret=%ld\n", ret); + if (ret < 0 && (errno == EAGAIN || errno == EWOULDBLOCK)) + continue; + if (ret < 0) { + perror("recvmsg"); + continue; + } + if (ret == 0) { + printf("client exited\n"); + goto cleanup; + } + + i++; + for (cm = CMSG_FIRSTHDR(&msg); cm; cm = CMSG_NXTHDR(&msg, cm)) { + if (cm->cmsg_level != SOL_SOCKET || + (cm->cmsg_type != SCM_DEVMEM_DMABUF && + cm->cmsg_type != SCM_DEVMEM_LINEAR)) { + fprintf(stdout, "skipping non-devmem cmsg\n"); + continue; + } + + dmabuf_cmsg = (struct dmabuf_cmsg *)CMSG_DATA(cm); + is_devmem = true; + + if (cm->cmsg_type == SCM_DEVMEM_LINEAR) { + /* TODO: process data copied from skb's linear + * buffer. + */ + fprintf(stdout, + "SCM_DEVMEM_LINEAR. dmabuf_cmsg->frag_size=%u\n", + dmabuf_cmsg->frag_size); + + continue; + } + + token.token_start = dmabuf_cmsg->frag_token; + token.token_count = 1; + + total_received += dmabuf_cmsg->frag_size; + printf("received frag_page=%llu, in_page_offset=%llu, frag_offset=%llu, frag_size=%u, token=%u, total_received=%lu, dmabuf_id=%u\n", + dmabuf_cmsg->frag_offset >> PAGE_SHIFT, + dmabuf_cmsg->frag_offset % getpagesize(), + dmabuf_cmsg->frag_offset, dmabuf_cmsg->frag_size, + dmabuf_cmsg->frag_token, total_received, + dmabuf_cmsg->dmabuf_id); + + if (dmabuf_cmsg->dmabuf_id != dmabuf_id) + error(1, 0, + "received on wrong dmabuf_id: flow steering error\n"); + + if (dmabuf_cmsg->frag_size % getpagesize()) + non_page_aligned_frags++; + else + page_aligned_frags++; + + sync.flags = DMA_BUF_SYNC_READ | DMA_BUF_SYNC_START; + ioctl(buf, DMA_BUF_IOCTL_SYNC, &sync); + + if (do_validation) + validate_buffer( + ((unsigned char *)buf_mem) + + dmabuf_cmsg->frag_offset, + dmabuf_cmsg->frag_size); + else + print_nonzero_bytes( + ((unsigned char *)buf_mem) + + dmabuf_cmsg->frag_offset, + dmabuf_cmsg->frag_size); + + sync.flags = DMA_BUF_SYNC_READ | DMA_BUF_SYNC_END; + ioctl(buf, DMA_BUF_IOCTL_SYNC, &sync); + + ret = setsockopt(client_fd, SOL_SOCKET, + SO_DEVMEM_DONTNEED, &token, + sizeof(token)); + if (ret != 1) + error(1, 0, + "SO_DEVMEM_DONTNEED not enough tokens"); + } + if (!is_devmem) + error(1, 0, "flow steering error\n"); + + printf("total_received=%lu\n", total_received); + } + + fprintf(stdout, "%s: ok\n", TEST_PREFIX); + + fprintf(stdout, "page_aligned_frags=%lu, non_page_aligned_frags=%lu\n", + page_aligned_frags, non_page_aligned_frags); + + fprintf(stdout, "page_aligned_frags=%lu, non_page_aligned_frags=%lu\n", + page_aligned_frags, non_page_aligned_frags); + +cleanup: + + munmap(buf_mem, dmabuf_size); + close(client_fd); + close(socket_fd); + close(buf); + close(memfd); + close(devfd); + ynl_sock_destroy(ys); + + return 0; +} + +void run_devmem_tests(void) +{ + struct netdev_queue_dmabuf *queues; + int devfd, memfd, buf; + struct ynl_sock *ys; + size_t dmabuf_size; + size_t i = 0; + + dmabuf_size = getpagesize() * NUM_PAGES; + + create_udmabuf(&devfd, &memfd, &buf, dmabuf_size); + + /* Configure RSS to divert all traffic from our devmem queues */ + configure_rss(); + + sleep(1); + + queues = malloc(sizeof(*queues) * num_queues); + + for (i = 0; i < num_queues; i++) { + queues[i]._present.type = 1; + queues[i]._present.idx = 1; + queues[i].type = NETDEV_QUEUE_TYPE_RX; + queues[i].idx = start_queue + i; + } + + if (bind_rx_queue(ifindex, buf, queues, num_queues, &ys)) + error(1, 0, "Failed to bind\n"); + + /* Closing the netlink socket does an implicit unbind */ + ynl_sock_destroy(ys); +} + +int main(int argc, char *argv[]) +{ + int is_server = 0, opt; + + while ((opt = getopt(argc, argv, "ls:c:p:v:q:f:n:i:d:")) != -1) { + switch (opt) { + case 'l': + is_server = 1; + break; + case 's': + server_ip = optarg; + break; + case 'c': + client_ip = optarg; + break; + case 'p': + port = optarg; + break; + case 'v': + do_validation = atoll(optarg); + break; + case 'q': + num_queues = atoi(optarg); + break; + case 't': + start_queue = atoi(optarg); + break; + case 'f': + ifname = optarg; + break; + case 'd': + ifindex = atoi(optarg); + break; + case 'n': + nic_pci_addr = optarg; + break; + case 'i': + iterations = atoll(optarg); + break; + case '?': + printf("unknown option: %c\n", optopt); + break; + } + } + + for (; optind < argc; optind++) + printf("extra arguments: %s\n", argv[optind]); + + run_devmem_tests(); + + if (is_server) + return do_server(); + + return 0; +}
On 2024/3/5 10:01, Mina Almasry wrote:
...
Perf - page-pool benchmark:
bench_page_pool_simple.ko tests with and without these changes: https://pastebin.com/raw/ncHDwAbn
AFAIK the number that really matters in the perf tests is the 'tasklet_page_pool01_fast_path Per elem'. This one measures at about 8 cycles without the changes but there is some 1 cycle noise in some results.
With the patches this regresses to 9 cycles with the changes but there is 1 cycle noise occasionally running this test repeatedly.
Lastly I tried disable the static_branch_unlikely() in netmem_is_net_iov() check. To my surprise disabling the static_branch_unlikely() check reduces the fast path back to 8 cycles, but the 1 cycle noise remains.
The last sentence seems to be suggesting the above 1 ns regresses is caused by the static_branch_unlikely() checking?
On Tue, Mar 5, 2024 at 4:54 AM Yunsheng Lin linyunsheng@huawei.com wrote:
On 2024/3/5 10:01, Mina Almasry wrote:
...
Perf - page-pool benchmark:
bench_page_pool_simple.ko tests with and without these changes: https://pastebin.com/raw/ncHDwAbn
AFAIK the number that really matters in the perf tests is the 'tasklet_page_pool01_fast_path Per elem'. This one measures at about 8 cycles without the changes but there is some 1 cycle noise in some results.
With the patches this regresses to 9 cycles with the changes but there is 1 cycle noise occasionally running this test repeatedly.
Lastly I tried disable the static_branch_unlikely() in netmem_is_net_iov() check. To my surprise disabling the static_branch_unlikely() check reduces the fast path back to 8 cycles, but the 1 cycle noise remains.
The last sentence seems to be suggesting the above 1 ns regresses is caused by the static_branch_unlikely() checking?
Note it's not a 1ns regression, it's looks like maybe a 1 cycle regression (slightly less than 1ns if I'm reading the output of the test correctly):
# clean net-next time_bench: Type:tasklet_page_pool01_fast_path Per elem: 8 cycles(tsc) 2.993 ns (step:0)
# with patches time_bench: Type:tasklet_page_pool01_fast_path Per elem: 9 cycles(tsc) 3.679 ns (step:0)
# with patches and with diff that disables static branching: time_bench: Type:tasklet_page_pool01_fast_path Per elem: 8 cycles(tsc) 3.248 ns (step:0)
I do see noise in the test results between run and run, and any regression (if any) is slightly obfuscated by the noise, so it's a bit hard to make confident statements. So far it looks like a ~0.25ns regression without static branch and about ~0.65ns with static branch.
Honestly when I saw all 3 results were within some noise I did not investigate more, but if this looks concerning to you I can dig further. I likely need to gather a few test runs to filter out the noise and maybe investigate the assembly my compiler is generating to maybe narrow down what changes there.
On 2024/3/6 3:38, Mina Almasry wrote:
On Tue, Mar 5, 2024 at 4:54 AM Yunsheng Lin linyunsheng@huawei.com wrote:
On 2024/3/5 10:01, Mina Almasry wrote:
...
Perf - page-pool benchmark:
bench_page_pool_simple.ko tests with and without these changes: https://pastebin.com/raw/ncHDwAbn
AFAIK the number that really matters in the perf tests is the 'tasklet_page_pool01_fast_path Per elem'. This one measures at about 8 cycles without the changes but there is some 1 cycle noise in some results.
With the patches this regresses to 9 cycles with the changes but there is 1 cycle noise occasionally running this test repeatedly.
Lastly I tried disable the static_branch_unlikely() in netmem_is_net_iov() check. To my surprise disabling the static_branch_unlikely() check reduces the fast path back to 8 cycles, but the 1 cycle noise remains.
The last sentence seems to be suggesting the above 1 ns regresses is caused by the static_branch_unlikely() checking?
Note it's not a 1ns regression, it's looks like maybe a 1 cycle regression (slightly less than 1ns if I'm reading the output of the test correctly):
# clean net-next time_bench: Type:tasklet_page_pool01_fast_path Per elem: 8 cycles(tsc) 2.993 ns (step:0)
# with patches time_bench: Type:tasklet_page_pool01_fast_path Per elem: 9 cycles(tsc) 3.679 ns (step:0)
# with patches and with diff that disables static branching:
time_bench: Type:tasklet_page_pool01_fast_path Per elem: 8 cycles(tsc) 3.248 ns (step:0)
I do see noise in the test results between run and run, and any regression (if any) is slightly obfuscated by the noise, so it's a bit hard to make confident statements. So far it looks like a ~0.25ns regression without static branch and about ~0.65ns with static branch.
Honestly when I saw all 3 results were within some noise I did not investigate more, but if this looks concerning to you I can dig further. I likely need to gather a few test runs to filter out the noise and maybe investigate the assembly my compiler is generating to maybe narrow down what changes there.
Yes, that is confusing enough that need more investigation.
On Tue, Mar 5, 2024 at 11:38 AM Mina Almasry almasrymina@google.com wrote:
On Tue, Mar 5, 2024 at 4:54 AM Yunsheng Lin linyunsheng@huawei.com wrote:
On 2024/3/5 10:01, Mina Almasry wrote:
...
Perf - page-pool benchmark:
bench_page_pool_simple.ko tests with and without these changes: https://pastebin.com/raw/ncHDwAbn
AFAIK the number that really matters in the perf tests is the 'tasklet_page_pool01_fast_path Per elem'. This one measures at about 8 cycles without the changes but there is some 1 cycle noise in some results.
With the patches this regresses to 9 cycles with the changes but there is 1 cycle noise occasionally running this test repeatedly.
Lastly I tried disable the static_branch_unlikely() in netmem_is_net_iov() check. To my surprise disabling the static_branch_unlikely() check reduces the fast path back to 8 cycles, but the 1 cycle noise remains.
The last sentence seems to be suggesting the above 1 ns regresses is caused by the static_branch_unlikely() checking?
Note it's not a 1ns regression, it's looks like maybe a 1 cycle regression (slightly less than 1ns if I'm reading the output of the test correctly):
# clean net-next time_bench: Type:tasklet_page_pool01_fast_path Per elem: 8 cycles(tsc) 2.993 ns (step:0)
# with patches time_bench: Type:tasklet_page_pool01_fast_path Per elem: 9 cycles(tsc) 3.679 ns (step:0)
# with patches and with diff that disables static branching: time_bench: Type:tasklet_page_pool01_fast_path Per elem: 8 cycles(tsc) 3.248 ns (step:0)
I do see noise in the test results between run and run, and any regression (if any) is slightly obfuscated by the noise, so it's a bit hard to make confident statements. So far it looks like a ~0.25ns regression without static branch and about ~0.65ns with static branch.
Honestly when I saw all 3 results were within some noise I did not investigate more, but if this looks concerning to you I can dig further. I likely need to gather a few test runs to filter out the noise and maybe investigate the assembly my compiler is generating to maybe narrow down what changes there.
I did some more investigation here to gather more data to filter out the noise, and recorded the summary here:
https://pastebin.com/raw/v5dYRg8L
Long story short, the page_pool benchmark results are consistent with some outlier noise results that I'm discounting here. Currently page_pool fast path is at 8 cycles
[ 2115.724510] time_bench: Type:tasklet_page_pool01_fast_path Per elem: 8 cycles(tsc) 3.187 ns (step:0) - (measurement period time:0.031870585 sec time_interval:31870585) - (invoke count:10000000 tsc_interval:86043192)
and with this patch series it degrades to 10 cycles, or about a 0.7ns degradation or so:
[ 498.226127] time_bench: Type:tasklet_page_pool01_fast_path Per elem: 10 cycles(tsc) 3.944 ns (step:0) - (measurement period time:0.039442539 sec time_interval:39442539) - (invoke count:10000000 tsc_interval:106485268)
I took the time to dig into where the degradation comes from, and to my surprise we can shave off 1 cycle in perf by removing the static_branch_unlikely check in netmem_is_net_iov() like so:
diff --git a/include/net/netmem.h b/include/net/netmem.h index fe354d11a421..2b4310ac1115 100644 --- a/include/net/netmem.h +++ b/include/net/netmem.h @@ -122,8 +122,7 @@ typedef unsigned long __bitwise netmem_ref; static inline bool netmem_is_net_iov(const netmem_ref netmem) { #ifdef CONFIG_PAGE_POOL - return static_branch_unlikely(&page_pool_mem_providers) && - (__force unsigned long)netmem & NET_IOV; + return (__force unsigned long)netmem & NET_IOV; #else return false; #endif
With this change, the fast path is 9 cycles, only a 1 cycle (~0.35ns) regression:
[ 199.184429] time_bench: Type:tasklet_page_pool01_fast_path Per elem: 9 cycles(tsc) 3.552 ns (step:0) - (measurement period time:0.035524013 sec time_interval:35524013) - (invoke count:10000000 tsc_interval:95907775)
I did some digging with YiFei on why the static_branch_unlikely appears to be causing a 1 cycle regression, but could not get an answer that makes sense. The # of instructions in page_pool_return_page() with the static_branch_unlikely and without is about the same in the compiled .o file, and my understanding is that static_branch will cause code re-writing anyway so looking at the compiled code may not be representative.
Worthy of note is that I get ~95% line rate of devmem TCP regardless of the static_branch_unlikely() or not, so impact of the static_branch is not large enough to be measurable end-to-end. I'm thinking I want to drop the static_branch_unlikely() in the next RFC since it doesn't improve the end-to-end throughput number and is resulting in a measurable improvement in the page pool benchmark.
On 2024/3/26 8:28, Mina Almasry wrote:
On Tue, Mar 5, 2024 at 11:38 AM Mina Almasry almasrymina@google.com wrote:
On Tue, Mar 5, 2024 at 4:54 AM Yunsheng Lin linyunsheng@huawei.com wrote:
On 2024/3/5 10:01, Mina Almasry wrote:
...
Perf - page-pool benchmark:
bench_page_pool_simple.ko tests with and without these changes: https://pastebin.com/raw/ncHDwAbn
AFAIK the number that really matters in the perf tests is the 'tasklet_page_pool01_fast_path Per elem'. This one measures at about 8 cycles without the changes but there is some 1 cycle noise in some results.
With the patches this regresses to 9 cycles with the changes but there is 1 cycle noise occasionally running this test repeatedly.
Lastly I tried disable the static_branch_unlikely() in netmem_is_net_iov() check. To my surprise disabling the static_branch_unlikely() check reduces the fast path back to 8 cycles, but the 1 cycle noise remains.
The last sentence seems to be suggesting the above 1 ns regresses is caused by the static_branch_unlikely() checking?
Note it's not a 1ns regression, it's looks like maybe a 1 cycle regression (slightly less than 1ns if I'm reading the output of the test correctly):
# clean net-next time_bench: Type:tasklet_page_pool01_fast_path Per elem: 8 cycles(tsc) 2.993 ns (step:0)
# with patches time_bench: Type:tasklet_page_pool01_fast_path Per elem: 9 cycles(tsc) 3.679 ns (step:0)
# with patches and with diff that disables static branching: time_bench: Type:tasklet_page_pool01_fast_path Per elem: 8 cycles(tsc) 3.248 ns (step:0)
I do see noise in the test results between run and run, and any regression (if any) is slightly obfuscated by the noise, so it's a bit hard to make confident statements. So far it looks like a ~0.25ns regression without static branch and about ~0.65ns with static branch.
Honestly when I saw all 3 results were within some noise I did not investigate more, but if this looks concerning to you I can dig further. I likely need to gather a few test runs to filter out the noise and maybe investigate the assembly my compiler is generating to maybe narrow down what changes there.
I did some more investigation here to gather more data to filter out the noise, and recorded the summary here:
https://pastebin.com/raw/v5dYRg8L
Long story short, the page_pool benchmark results are consistent with some outlier noise results that I'm discounting here. Currently page_pool fast path is at 8 cycles
[ 2115.724510] time_bench: Type:tasklet_page_pool01_fast_path Per elem: 8 cycles(tsc) 3.187 ns (step:0) - (measurement period time:0.031870585 sec time_interval:31870585) - (invoke count:10000000 tsc_interval:86043192)
and with this patch series it degrades to 10 cycles, or about a 0.7ns degradation or so:
Even if the absolute value for the overhead is small, we seems have a degradation of about 20% for tasklet_page_pool01_fast_path testcase, which seems scary.
I am assuming that every page is recyclable for tasklet_page_pool01_fast_path testcase, and that code path matters for page_pool, it would be good to remove any additional checking for that code path.
And we already have pool->has_init_callback checking when we have to use a new page, it may make sense to refactor that to share the same checking for provider to avoid the overhead as much as possible.
Also, I am not sure if it really matter that much, as with the introducing of netmem_is_net_iov() checking spreading in the networking, the overhead might add up for other case too.
[ 498.226127] time_bench: Type:tasklet_page_pool01_fast_path Per elem: 10 cycles(tsc) 3.944 ns (step:0) - (measurement period time:0.039442539 sec time_interval:39442539) - (invoke count:10000000 tsc_interval:106485268)
I took the time to dig into where the degradation comes from, and to my surprise we can shave off 1 cycle in perf by removing the static_branch_unlikely check in netmem_is_net_iov() like so:
diff --git a/include/net/netmem.h b/include/net/netmem.h index fe354d11a421..2b4310ac1115 100644 --- a/include/net/netmem.h +++ b/include/net/netmem.h @@ -122,8 +122,7 @@ typedef unsigned long __bitwise netmem_ref; static inline bool netmem_is_net_iov(const netmem_ref netmem) { #ifdef CONFIG_PAGE_POOL
return static_branch_unlikely(&page_pool_mem_providers) &&
(__force unsigned long)netmem & NET_IOV;
return (__force unsigned long)netmem & NET_IOV;
#else return false; #endif
With this change, the fast path is 9 cycles, only a 1 cycle (~0.35ns) regression:
[ 199.184429] time_bench: Type:tasklet_page_pool01_fast_path Per elem: 9 cycles(tsc) 3.552 ns (step:0) - (measurement period time:0.035524013 sec time_interval:35524013) - (invoke count:10000000 tsc_interval:95907775)
I did some digging with YiFei on why the static_branch_unlikely appears to be causing a 1 cycle regression, but could not get an answer that makes sense. The # of instructions in page_pool_return_page() with the static_branch_unlikely and without is about the same in the compiled .o file, and my understanding is that static_branch will cause code re-writing anyway so looking at the compiled code may not be representative.
Worthy of note is that I get ~95% line rate of devmem TCP regardless of the static_branch_unlikely() or not, so impact of the static_branch is not large enough to be measurable end-to-end. I'm thinking I want to drop the static_branch_unlikely() in the next RFC since it doesn't improve the end-to-end throughput number and is resulting in a measurable improvement in the page pool benchmark.
On Tue, Mar 26, 2024 at 5:47 AM Yunsheng Lin linyunsheng@huawei.com wrote:
On 2024/3/26 8:28, Mina Almasry wrote:
On Tue, Mar 5, 2024 at 11:38 AM Mina Almasry almasrymina@google.com wrote:
On Tue, Mar 5, 2024 at 4:54 AM Yunsheng Lin linyunsheng@huawei.com wrote:
On 2024/3/5 10:01, Mina Almasry wrote:
...
Perf - page-pool benchmark:
bench_page_pool_simple.ko tests with and without these changes: https://pastebin.com/raw/ncHDwAbn
AFAIK the number that really matters in the perf tests is the 'tasklet_page_pool01_fast_path Per elem'. This one measures at about 8 cycles without the changes but there is some 1 cycle noise in some results.
With the patches this regresses to 9 cycles with the changes but there is 1 cycle noise occasionally running this test repeatedly.
Lastly I tried disable the static_branch_unlikely() in netmem_is_net_iov() check. To my surprise disabling the static_branch_unlikely() check reduces the fast path back to 8 cycles, but the 1 cycle noise remains.
The last sentence seems to be suggesting the above 1 ns regresses is caused by the static_branch_unlikely() checking?
Note it's not a 1ns regression, it's looks like maybe a 1 cycle regression (slightly less than 1ns if I'm reading the output of the test correctly):
# clean net-next time_bench: Type:tasklet_page_pool01_fast_path Per elem: 8 cycles(tsc) 2.993 ns (step:0)
# with patches time_bench: Type:tasklet_page_pool01_fast_path Per elem: 9 cycles(tsc) 3.679 ns (step:0)
# with patches and with diff that disables static branching: time_bench: Type:tasklet_page_pool01_fast_path Per elem: 8 cycles(tsc) 3.248 ns (step:0)
I do see noise in the test results between run and run, and any regression (if any) is slightly obfuscated by the noise, so it's a bit hard to make confident statements. So far it looks like a ~0.25ns regression without static branch and about ~0.65ns with static branch.
Honestly when I saw all 3 results were within some noise I did not investigate more, but if this looks concerning to you I can dig further. I likely need to gather a few test runs to filter out the noise and maybe investigate the assembly my compiler is generating to maybe narrow down what changes there.
I did some more investigation here to gather more data to filter out the noise, and recorded the summary here:
https://pastebin.com/raw/v5dYRg8L
Long story short, the page_pool benchmark results are consistent with some outlier noise results that I'm discounting here. Currently page_pool fast path is at 8 cycles
[ 2115.724510] time_bench: Type:tasklet_page_pool01_fast_path Per elem: 8 cycles(tsc) 3.187 ns (step:0) - (measurement period time:0.031870585 sec time_interval:31870585) - (invoke count:10000000 tsc_interval:86043192)
and with this patch series it degrades to 10 cycles, or about a 0.7ns degradation or so:
Even if the absolute value for the overhead is small, we seems have a degradation of about 20% for tasklet_page_pool01_fast_path testcase, which seems scary.
I am assuming that every page is recyclable for tasklet_page_pool01_fast_path testcase, and that code path matters for page_pool, it would be good to remove any additional checking for that code path.
We can remove the usage of static_branch_unlikely in the net_iov check, which reduces the overhead to 1 cycle (8->9), only 12.5% overhead. The addition of the static_branch_unlikely is not improving the performance of devmem TCP anyway. From previous discussions with Jesper he deemed a 1 cycle degradation acceptable, but he hasn't commented in a while, he may have changed his mind but so far no complaints.
We can additionally only add the check only if CONFIG_SHARED_DMA_BUFFER is enabled. I've tested that and the fast path goes back to 8 cycles (0 overhead). If CONFIG_SHARED_DMA_BUFFER is not enabled then netmem can't be dmabuf anyway, so no reason to check.
And we already have pool->has_init_callback checking when we have to use a new page, it may make sense to refactor that to share the same checking for provider to avoid the overhead as much as possible.
Also, I am not sure if it really matter that much, as with the introducing of netmem_is_net_iov() checking spreading in the networking, the overhead might add up for other case too.
linux-kselftest-mirror@lists.linaro.org