* TL;DR:
Device memory TCP (devmem TCP) is a proposal for transferring data to and/or from device memory efficiently, without bouncing the data to a host memory buffer.
* Problem:
A large amount of data transfers have device memory as the source and/or destination. Accelerators drastically increased the volume of such transfers. Some examples include: - ML accelerators transferring large amounts of training data from storage into GPU/TPU memory. In some cases ML training setup time can be as long as 50% of TPU compute time, improving data transfer throughput & efficiency can help improving GPU/TPU utilization.
- Distributed training, where ML accelerators, such as GPUs on different hosts, exchange data among them.
- Distributed raw block storage applications transfer large amounts of data with remote SSDs, much of this data does not require host processing.
Today, the majority of the Device-to-Device data transfers the network are implemented as the following low level operations: Device-to-Host copy, Host-to-Host network transfer, and Host-to-Device copy.
The implementation is suboptimal, especially for bulk data transfers, and can put significant strains on system resources, such as host memory bandwidth, PCIe bandwidth, etc. One important reason behind the current state is the kernel’s lack of semantics to express device to network transfers.
* Proposal:
In this patch series we attempt to optimize this use case by implementing socket APIs that enable the user to:
1. send device memory across the network directly, and 2. receive incoming network packets directly into device memory.
Packet _payloads_ go directly from the NIC to device memory for receive and from device memory to NIC for transmit. Packet _headers_ go to/from host memory and are processed by the TCP/IP stack normally. The NIC _must_ support header split to achieve this.
Advantages:
- Alleviate host memory bandwidth pressure, compared to existing network-transfer + device-copy semantics.
- Alleviate PCIe BW pressure, by limiting data transfer to the lowest level of the PCIe tree, compared to traditional path which sends data through the root complex.
With this proposal we're able to reach ~96.6% line rate speeds with data sent and received directly from/to device memory.
* Patch overview:
** Part 1: struct paged device memory
Currently the standard for device memory sharing is DMABUF, which doesn't generate struct pages. On the other hand, networking stack (skbs, drivers, and page pool) operate on pages. We have 2 options:
1. Generate struct pages for dmabuf device memory, or, 2. Modify the networking stack to understand a new memory type.
This proposal implements option #1. We implement a small framework to generate struct pages for an sg_table returned from dma_buf_map_attachment(). The support added here should be generic and easily extended to other use cases interested in struct paged device memory. We use this framework to generate pages that can be used in the networking stack.
** Part 2: recvmsg() & sendmsg() APIs
We define user APIs for the user to send and receive these dmabuf pages.
** part 3: support for unreadable skb frags
Dmabuf pages are not accessible by the host; we implement changes throughput the networking stack to correctly handle skbs with unreadable frags.
** part 4: page pool support
We piggy back on Jakub's page pool memory providers idea: https://github.com/kuba-moo/linux/tree/pp-providers
It allows the page pool to define a memory provider that provides the page allocation and freeing. It helps abstract most of the device memory TCP changes from the driver.
This is not strictly necessary, the driver can choose to allocate dmabuf pages and use them directly without going through the page pool (if acceptable to their maintainers).
Not included with this RFC is the GVE devmem TCP support, just to simplify the review. Code available here if desired: https://github.com/mina/linux/tree/tcpdevmem
This RFC is built on top of v6.4-rc7 with Jakub's pp-providers changes cherry-picked.
* NIC dependencies:
1. (strict) Devmem TCP require the NIC to support header split, i.e. the capability to split incoming packets into a header + payload and to put each into a separate buffer. Devmem TCP works by using dmabuf pages for the packet payload, and host memory for the packet headers.
2. (optional) Devmem TCP works better with flow steering support & RSS support, i.e. the NIC's ability to steer flows into certain rx queues. This allows the sysadmin to enable devmem TCP on a subset of the rx queues, and steer devmem TCP traffic onto these queues and non devmem TCP elsewhere.
The NIC I have access to with these properties is the GVE with DQO support running in Google Cloud, but any NIC that supports these features would suffice. I may be able to help reviewers bring up devmem TCP on their NICs.
* Testing:
The series includes a udmabuf kselftest that show a simple use case of devmem TCP and validates the entire data path end to end without a dependency on a specific dmabuf provider.
Not included in this series is our devmem TCP benchmark, which transfers data to/from GPU dmabufs directly.
With this implementation & benchmark we're able to reach ~96.6% line rate speeds with 4 GPU/NIC pairs running bi-direction traffic, with all the packet payloads going straight to the GPU memory (no host buffer bounce).
** Test Setup
Kernel: v6.4-rc7, with this RFC and Jakub's memory provider API cherry-picked locally.
Hardware: Google Cloud A3 VMs.
NIC: GVE with header split & RSS & flow steering support.
Benchmark: custom devmem TCP benchmark not yet open sourced.
Mina Almasry (10): dma-buf: add support for paged attachment mappings dma-buf: add support for NET_RX pages dma-buf: add support for NET_TX pages net: add support for skbs with unreadable frags tcp: implement recvmsg() RX path for devmem TCP net: add SO_DEVMEM_DONTNEED setsockopt to release RX pages tcp: implement sendmsg() TX path for for devmem tcp selftests: add ncdevmem, netcat for devmem TCP memory-provider: updates core provider API for devmem TCP memory-provider: add dmabuf devmem provider
drivers/dma-buf/dma-buf.c | 444 ++++++++++++++++ include/linux/dma-buf.h | 142 +++++ include/linux/netdevice.h | 1 + include/linux/skbuff.h | 34 +- include/linux/socket.h | 1 + include/net/page_pool.h | 21 + include/net/sock.h | 4 + include/net/tcp.h | 6 +- include/uapi/asm-generic/socket.h | 6 + include/uapi/linux/dma-buf.h | 12 + include/uapi/linux/uio.h | 10 + net/core/datagram.c | 3 + net/core/page_pool.c | 111 +++- net/core/skbuff.c | 81 ++- net/core/sock.c | 47 ++ net/ipv4/tcp.c | 262 +++++++++- net/ipv4/tcp_input.c | 13 +- net/ipv4/tcp_ipv4.c | 8 + net/ipv4/tcp_output.c | 5 +- net/packet/af_packet.c | 4 +- tools/testing/selftests/net/.gitignore | 1 + tools/testing/selftests/net/Makefile | 1 + tools/testing/selftests/net/ncdevmem.c | 693 +++++++++++++++++++++++++ 23 files changed, 1868 insertions(+), 42 deletions(-) create mode 100644 tools/testing/selftests/net/ncdevmem.c
Currently dmabuf p2p memory doesn't present itself in the form of struct pages and the memory can't be easily used with code that expects memory in that form. Add support for paged attachment mappings. We use existing dmabuf APIs to create a mapped attachment (dma_buf_attach() & dma_buf_map_attachment()), and we create struct pages for this mapping. We write the dma_addr's from the sg_table into the created pages. These pages can then be passed into code that expects struct pages and can largely operate on these pages with minimal modifications:
1. The pages need not be dma mapped. The dma addr can be queried from page->zone_device_data and used directly. 2. The pages are not kmappable.
Add a new ioctl that enables the user to create a struct page backed dmabuf attachment mapping. This ioctl returns a new file descriptor which represents the dmabuf pages. The pages are freed when (a) the user closes the file, and (b) the struct pages backing the dmabuf are no longer in use. Once the pages are no longer in use, the mapped attachment is removed.
The support added in this patch should be generic - the pages are created by the base code, but the user specifies the type of page to create using the dmabuf_create_pages_info->type flag. The base code hands of any handling specific to the use case of the ops of that page type.
Signed-off-by: Mina Almasry almasrymina@google.com --- drivers/dma-buf/dma-buf.c | 223 +++++++++++++++++++++++++++++++++++ include/linux/dma-buf.h | 90 ++++++++++++++ include/uapi/linux/dma-buf.h | 9 ++ 3 files changed, 322 insertions(+)
diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c index aa4ea8530cb3..50b1d813cf5c 100644 --- a/drivers/dma-buf/dma-buf.c +++ b/drivers/dma-buf/dma-buf.c @@ -22,6 +22,7 @@ #include <linux/module.h> #include <linux/seq_file.h> #include <linux/sync_file.h> +#include <linux/pci.h> #include <linux/poll.h> #include <linux/dma-resv.h> #include <linux/mm.h> @@ -442,12 +443,16 @@ static long dma_buf_import_sync_file(struct dma_buf *dmabuf, } #endif
+static long dma_buf_create_pages(struct file *file, + struct dma_buf_create_pages_info *create_info); + static long dma_buf_ioctl(struct file *file, unsigned int cmd, unsigned long arg) { struct dma_buf *dmabuf; struct dma_buf_sync sync; enum dma_data_direction direction; + struct dma_buf_create_pages_info create_info; int ret;
dmabuf = file->private_data; @@ -484,6 +489,12 @@ static long dma_buf_ioctl(struct file *file, case DMA_BUF_SET_NAME_A: case DMA_BUF_SET_NAME_B: return dma_buf_set_name(dmabuf, (const char __user *)arg); + case DMA_BUF_CREATE_PAGES: + if (copy_from_user(&create_info, (void __user *)arg, + sizeof(create_info))) + return -EFAULT; + + return dma_buf_create_pages(file, &create_info);
#if IS_ENABLED(CONFIG_SYNC_FILE) case DMA_BUF_IOCTL_EXPORT_SYNC_FILE: @@ -1613,6 +1624,218 @@ void dma_buf_vunmap_unlocked(struct dma_buf *dmabuf, struct iosys_map *map) } EXPORT_SYMBOL_NS_GPL(dma_buf_vunmap_unlocked, DMA_BUF);
+static int dma_buf_pages_release(struct inode *inode, struct file *file) +{ + struct dma_buf_pages *priv = file->private_data; + + if (priv->type_ops->dma_buf_pages_release) + priv->type_ops->dma_buf_pages_release(priv, file); + + percpu_ref_kill(&priv->pgmap.ref); + /* Drop initial ref after percpu_ref_kill(). */ + percpu_ref_put(&priv->pgmap.ref); + + return 0; +} + +static void dma_buf_page_free(struct page *page) +{ + struct dma_buf_pages *priv; + struct dev_pagemap *pgmap; + + pgmap = page->pgmap; + priv = container_of(pgmap, struct dma_buf_pages, pgmap); + + if (priv->type_ops->dma_buf_page_free) + priv->type_ops->dma_buf_page_free(priv, page); +} + +const struct dev_pagemap_ops dma_buf_pgmap_ops = { + .page_free = dma_buf_page_free, +}; +EXPORT_SYMBOL_GPL(dma_buf_pgmap_ops); + +const struct file_operations dma_buf_pages_fops = { + .release = dma_buf_pages_release, +}; +EXPORT_SYMBOL_GPL(dma_buf_pages_fops); + +#ifdef CONFIG_ZONE_DEVICE +static void dma_buf_pages_destroy(struct percpu_ref *ref) +{ + struct dma_buf_pages *priv; + struct dev_pagemap *pgmap; + + pgmap = container_of(ref, struct dev_pagemap, ref); + priv = container_of(pgmap, struct dma_buf_pages, pgmap); + + if (priv->type_ops->dma_buf_pages_destroy) + priv->type_ops->dma_buf_pages_destroy(priv); + + kvfree(priv->pages); + kfree(priv); + + dma_buf_unmap_attachment(priv->attachment, priv->sgt, priv->direction); + dma_buf_detach(priv->dmabuf, priv->attachment); + dma_buf_put(priv->dmabuf); + pci_dev_put(priv->pci_dev); +} + +static long dma_buf_create_pages(struct file *file, + struct dma_buf_create_pages_info *create_info) +{ + int err, fd, i, pg_idx; + struct scatterlist *sg; + struct dma_buf_pages *priv; + struct file *new_file; + + fd = get_unused_fd_flags(O_RDWR | O_CLOEXEC); + if (fd < 0) { + err = fd; + goto out_err; + } + + priv = kzalloc(sizeof(*priv), GFP_KERNEL); + if (!priv) { + err = -ENOMEM; + goto out_put_fd; + } + + priv->pgmap.type = MEMORY_DEVICE_PRIVATE; + priv->pgmap.ops = &dma_buf_pgmap_ops; + init_completion(&priv->pgmap.done); + + /* This refcount is incremented every time a page in priv->pages is + * allocated, and decremented every time a page is freed. When + * it drops to 0, the dma_buf_pages can be destroyed. An initial ref is + * held and the dma_buf_pages is not destroyed until that is dropped. + */ + err = percpu_ref_init(&priv->pgmap.ref, dma_buf_pages_destroy, 0, + GFP_KERNEL); + if (err) + goto out_free_priv; + + /* Initial ref to be dropped after percpu_ref_kill(). */ + percpu_ref_get(&priv->pgmap.ref); + + priv->pci_dev = pci_get_domain_bus_and_slot( + 0, create_info->pci_bdf[0], + PCI_DEVFN(create_info->pci_bdf[1], create_info->pci_bdf[2])); + if (!priv->pci_dev) { + err = -ENODEV; + goto out_exit_percpu_ref; + } + + priv->dmabuf = dma_buf_get(create_info->dma_buf_fd); + if (IS_ERR(priv->dmabuf)) { + err = PTR_ERR(priv->dmabuf); + goto out_put_pci_dev; + } + + if (priv->dmabuf->size % PAGE_SIZE != 0) { + err = -EINVAL; + goto out_put_dma_buf; + } + + priv->attachment = dma_buf_attach(priv->dmabuf, &priv->pci_dev->dev); + if (IS_ERR(priv->attachment)) { + err = PTR_ERR(priv->attachment); + goto out_put_dma_buf; + } + + priv->num_pages = priv->dmabuf->size / PAGE_SIZE; + priv->pages = kvmalloc_array(priv->num_pages, sizeof(struct page), + GFP_KERNEL); + if (!priv->pages) { + err = -ENOMEM; + goto out_detach_dma_buf; + } + + for (i = 0; i < priv->num_pages; i++) { + struct page *page = &priv->pages[i]; + + mm_zero_struct_page(page); + set_page_zone(page, ZONE_DEVICE); + set_page_count(page, 1); + page->pgmap = &priv->pgmap; + } + + priv->direction = DMA_BIDIRECTIONAL; + priv->sgt = dma_buf_map_attachment(priv->attachment, priv->direction); + if (IS_ERR(priv->sgt)) { + err = PTR_ERR(priv->sgt); + goto out_free_pages; + } + + /* write each dma addresses from sgt to each page */ + pg_idx = 0; + for_each_sgtable_dma_sg(priv->sgt, sg, i) { + size_t len = sg_dma_len(sg); + dma_addr_t dma_addr = sg_dma_address(sg); + + BUG_ON(!PAGE_ALIGNED(len)); + while (len > 0) { + priv->pages[pg_idx].zone_device_data = (void *)dma_addr; + pg_idx++; + dma_addr += PAGE_SIZE; + len -= PAGE_SIZE; + } + } + + new_file = anon_inode_getfile("[dma_buf_pages]", &dma_buf_pages_fops, + (void *)priv, O_RDWR | O_CLOEXEC); + if (IS_ERR(new_file)) { + err = PTR_ERR(new_file); + goto out_unmap_dma_buf; + } + + priv->type = create_info->type; + priv->create_flags = create_info->create_flags; + + switch (priv->type) { + default: + err = -EINVAL; + goto out_put_new_file; + } + + if (priv->type_ops->dma_buf_pages_init) { + err = priv->type_ops->dma_buf_pages_init(priv, new_file); + if (err) + goto out_put_new_file; + } + + fd_install(fd, new_file); + return fd; + +out_put_new_file: + fput(new_file); +out_unmap_dma_buf: + dma_buf_unmap_attachment(priv->attachment, priv->sgt, priv->direction); +out_free_pages: + kvfree(priv->pages); +out_detach_dma_buf: + dma_buf_detach(priv->dmabuf, priv->attachment); +out_put_dma_buf: + dma_buf_put(priv->dmabuf); +out_put_pci_dev: + pci_dev_put(priv->pci_dev); +out_exit_percpu_ref: + percpu_ref_exit(&priv->pgmap.ref); +out_free_priv: + kfree(priv); +out_put_fd: + put_unused_fd(fd); +out_err: + return err; +} +#else +static long dma_buf_create_pages(struct file *file, + struct dma_buf_create_pages_info *create_info) +{ + return -ENOTSUPP; +} +#endif + #ifdef CONFIG_DEBUG_FS static int dma_buf_debug_show(struct seq_file *s, void *unused) { diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h index 3f31baa3293f..5789006180ea 100644 --- a/include/linux/dma-buf.h +++ b/include/linux/dma-buf.h @@ -540,6 +540,36 @@ struct dma_buf_export_info { void *priv; };
+struct dma_buf_pages; + +struct dma_buf_pages_type_ops { + int (*dma_buf_pages_init)(struct dma_buf_pages *priv, + struct file *file); + void (*dma_buf_pages_release)(struct dma_buf_pages *priv, + struct file *file); + void (*dma_buf_pages_destroy)(struct dma_buf_pages *priv); + void (*dma_buf_page_free)(struct dma_buf_pages *priv, + struct page *page); +}; + +struct dma_buf_pages { + /* fields for dmabuf */ + struct dma_buf *dmabuf; + struct dma_buf_attachment *attachment; + struct sg_table *sgt; + struct pci_dev *pci_dev; + enum dma_data_direction direction; + + /* fields for dma-buf pages */ + size_t num_pages; + struct page *pages; + struct dev_pagemap pgmap; + + unsigned int type; + const struct dma_buf_pages_type_ops *type_ops; + __u64 create_flags; +}; + /** * DEFINE_DMA_BUF_EXPORT_INFO - helper macro for exporters * @name: export-info name @@ -631,4 +661,64 @@ int dma_buf_vmap(struct dma_buf *dmabuf, struct iosys_map *map); void dma_buf_vunmap(struct dma_buf *dmabuf, struct iosys_map *map); int dma_buf_vmap_unlocked(struct dma_buf *dmabuf, struct iosys_map *map); void dma_buf_vunmap_unlocked(struct dma_buf *dmabuf, struct iosys_map *map); + +#ifdef CONFIG_DMA_SHARED_BUFFER +extern const struct file_operations dma_buf_pages_fops; +extern const struct dev_pagemap_ops dma_buf_pgmap_ops; + +static inline bool is_dma_buf_pages_file(struct file *file) +{ + return file->f_op == &dma_buf_pages_fops; +} + +static inline bool is_dma_buf_page(struct page *page) +{ + return (is_zone_device_page(page) && page->pgmap && + page->pgmap->ops == &dma_buf_pgmap_ops); +} + +static inline dma_addr_t dma_buf_page_to_dma_addr(struct page *page) +{ + return (dma_addr_t)page->zone_device_data; +} + +static inline int dma_buf_map_sg(struct device *dev, struct scatterlist *sg, + int nents, enum dma_data_direction dir) +{ + struct scatterlist *s; + int i; + + for_each_sg(sg, s, nents, i) { + struct page *pg = sg_page(s); + + s->dma_address = dma_buf_page_to_dma_addr(pg); + sg_dma_len(s) = s->length; + } + + return nents; +} +#else +static inline bool is_dma_buf_page(struct page *page) +{ + return false; +} + +static inline bool is_dma_buf_pages_file(struct file *file) +{ + return false; +} + +static inline dma_addr_t dma_buf_page_to_dma_addr(struct page *page) +{ + return 0; +} + +static inline int dma_buf_map_sg(struct device *dev, struct scatterlist *sg, + int nents, enum dma_data_direction dir) +{ + return 0; +} +#endif + + #endif /* __DMA_BUF_H__ */ diff --git a/include/uapi/linux/dma-buf.h b/include/uapi/linux/dma-buf.h index 5a6fda66d9ad..d0f63a2ab7e4 100644 --- a/include/uapi/linux/dma-buf.h +++ b/include/uapi/linux/dma-buf.h @@ -179,4 +179,13 @@ struct dma_buf_import_sync_file { #define DMA_BUF_IOCTL_EXPORT_SYNC_FILE _IOWR(DMA_BUF_BASE, 2, struct dma_buf_export_sync_file) #define DMA_BUF_IOCTL_IMPORT_SYNC_FILE _IOW(DMA_BUF_BASE, 3, struct dma_buf_import_sync_file)
+struct dma_buf_create_pages_info { + __u8 pci_bdf[3]; + __s32 dma_buf_fd; + __u32 type; + __u64 create_flags; +}; + +#define DMA_BUF_CREATE_PAGES _IOW(DMA_BUF_BASE, 4, struct dma_buf_create_pages_info) + #endif
Am 11.07.23 um 00:32 schrieb Mina Almasry:
Currently dmabuf p2p memory doesn't present itself in the form of struct pages and the memory can't be easily used with code that expects memory in that form.
Well, this won't fly at all.
First of all DMA-buf is *not* about passing struct pages around, drivers are *only* supposed to use the DMA addresses.
That code can't use the pages inside a DMA-buf is absolutely intentional. We even mangle the page pointers from the sg_tables because people sometimes get the impression they can use those.
See function mangle_sg_table() in dma-buf.c
/* To catch abuse of the underlying struct page by importers mix * up the bits, but take care to preserve the low SG_ bits to * not corrupt the sgt. The mixing is undone in __unmap_dma_buf * before passing the sgt back to the exporter. */ for_each_sgtable_sg(sg_table, sg, i) sg->page_link ^= ~0xffUL;
Add support for paged attachment mappings. We use existing dmabuf APIs to create a mapped attachment (dma_buf_attach() & dma_buf_map_attachment()), and we create struct pages for this mapping. We write the dma_addr's from the sg_table into the created pages.
Hui, what? Not really.
Regards, Christian.
These pages can then be passed into code that expects struct pages and can largely operate on these pages with minimal modifications:
- The pages need not be dma mapped. The dma addr can be queried from page->zone_device_data and used directly.
- The pages are not kmappable.
Add a new ioctl that enables the user to create a struct page backed dmabuf attachment mapping. This ioctl returns a new file descriptor which represents the dmabuf pages. The pages are freed when (a) the user closes the file, and (b) the struct pages backing the dmabuf are no longer in use. Once the pages are no longer in use, the mapped attachment is removed.
The support added in this patch should be generic - the pages are created by the base code, but the user specifies the type of page to create using the dmabuf_create_pages_info->type flag. The base code hands of any handling specific to the use case of the ops of that page type.
Signed-off-by: Mina Almasry almasrymina@google.com
drivers/dma-buf/dma-buf.c | 223 +++++++++++++++++++++++++++++++++++ include/linux/dma-buf.h | 90 ++++++++++++++ include/uapi/linux/dma-buf.h | 9 ++ 3 files changed, 322 insertions(+)
diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c index aa4ea8530cb3..50b1d813cf5c 100644 --- a/drivers/dma-buf/dma-buf.c +++ b/drivers/dma-buf/dma-buf.c @@ -22,6 +22,7 @@ #include <linux/module.h> #include <linux/seq_file.h> #include <linux/sync_file.h> +#include <linux/pci.h> #include <linux/poll.h> #include <linux/dma-resv.h> #include <linux/mm.h> @@ -442,12 +443,16 @@ static long dma_buf_import_sync_file(struct dma_buf *dmabuf, } #endif +static long dma_buf_create_pages(struct file *file,
struct dma_buf_create_pages_info *create_info);
- static long dma_buf_ioctl(struct file *file, unsigned int cmd, unsigned long arg) { struct dma_buf *dmabuf; struct dma_buf_sync sync; enum dma_data_direction direction;
- struct dma_buf_create_pages_info create_info; int ret;
dmabuf = file->private_data; @@ -484,6 +489,12 @@ static long dma_buf_ioctl(struct file *file, case DMA_BUF_SET_NAME_A: case DMA_BUF_SET_NAME_B: return dma_buf_set_name(dmabuf, (const char __user *)arg);
- case DMA_BUF_CREATE_PAGES:
if (copy_from_user(&create_info, (void __user *)arg,
sizeof(create_info)))
return -EFAULT;
return dma_buf_create_pages(file, &create_info);
#if IS_ENABLED(CONFIG_SYNC_FILE) case DMA_BUF_IOCTL_EXPORT_SYNC_FILE: @@ -1613,6 +1624,218 @@ void dma_buf_vunmap_unlocked(struct dma_buf *dmabuf, struct iosys_map *map) } EXPORT_SYMBOL_NS_GPL(dma_buf_vunmap_unlocked, DMA_BUF); +static int dma_buf_pages_release(struct inode *inode, struct file *file) +{
- struct dma_buf_pages *priv = file->private_data;
- if (priv->type_ops->dma_buf_pages_release)
priv->type_ops->dma_buf_pages_release(priv, file);
- percpu_ref_kill(&priv->pgmap.ref);
- /* Drop initial ref after percpu_ref_kill(). */
- percpu_ref_put(&priv->pgmap.ref);
- return 0;
+}
+static void dma_buf_page_free(struct page *page) +{
- struct dma_buf_pages *priv;
- struct dev_pagemap *pgmap;
- pgmap = page->pgmap;
- priv = container_of(pgmap, struct dma_buf_pages, pgmap);
- if (priv->type_ops->dma_buf_page_free)
priv->type_ops->dma_buf_page_free(priv, page);
+}
+const struct dev_pagemap_ops dma_buf_pgmap_ops = {
- .page_free = dma_buf_page_free,
+}; +EXPORT_SYMBOL_GPL(dma_buf_pgmap_ops);
+const struct file_operations dma_buf_pages_fops = {
- .release = dma_buf_pages_release,
+}; +EXPORT_SYMBOL_GPL(dma_buf_pages_fops);
+#ifdef CONFIG_ZONE_DEVICE +static void dma_buf_pages_destroy(struct percpu_ref *ref) +{
- struct dma_buf_pages *priv;
- struct dev_pagemap *pgmap;
- pgmap = container_of(ref, struct dev_pagemap, ref);
- priv = container_of(pgmap, struct dma_buf_pages, pgmap);
- if (priv->type_ops->dma_buf_pages_destroy)
priv->type_ops->dma_buf_pages_destroy(priv);
- kvfree(priv->pages);
- kfree(priv);
- dma_buf_unmap_attachment(priv->attachment, priv->sgt, priv->direction);
- dma_buf_detach(priv->dmabuf, priv->attachment);
- dma_buf_put(priv->dmabuf);
- pci_dev_put(priv->pci_dev);
+}
+static long dma_buf_create_pages(struct file *file,
struct dma_buf_create_pages_info *create_info)
+{
- int err, fd, i, pg_idx;
- struct scatterlist *sg;
- struct dma_buf_pages *priv;
- struct file *new_file;
- fd = get_unused_fd_flags(O_RDWR | O_CLOEXEC);
- if (fd < 0) {
err = fd;
goto out_err;
- }
- priv = kzalloc(sizeof(*priv), GFP_KERNEL);
- if (!priv) {
err = -ENOMEM;
goto out_put_fd;
- }
- priv->pgmap.type = MEMORY_DEVICE_PRIVATE;
- priv->pgmap.ops = &dma_buf_pgmap_ops;
- init_completion(&priv->pgmap.done);
- /* This refcount is incremented every time a page in priv->pages is
* allocated, and decremented every time a page is freed. When
* it drops to 0, the dma_buf_pages can be destroyed. An initial ref is
* held and the dma_buf_pages is not destroyed until that is dropped.
*/
- err = percpu_ref_init(&priv->pgmap.ref, dma_buf_pages_destroy, 0,
GFP_KERNEL);
- if (err)
goto out_free_priv;
- /* Initial ref to be dropped after percpu_ref_kill(). */
- percpu_ref_get(&priv->pgmap.ref);
- priv->pci_dev = pci_get_domain_bus_and_slot(
0, create_info->pci_bdf[0],
PCI_DEVFN(create_info->pci_bdf[1], create_info->pci_bdf[2]));
- if (!priv->pci_dev) {
err = -ENODEV;
goto out_exit_percpu_ref;
- }
- priv->dmabuf = dma_buf_get(create_info->dma_buf_fd);
- if (IS_ERR(priv->dmabuf)) {
err = PTR_ERR(priv->dmabuf);
goto out_put_pci_dev;
- }
- if (priv->dmabuf->size % PAGE_SIZE != 0) {
err = -EINVAL;
goto out_put_dma_buf;
- }
- priv->attachment = dma_buf_attach(priv->dmabuf, &priv->pci_dev->dev);
- if (IS_ERR(priv->attachment)) {
err = PTR_ERR(priv->attachment);
goto out_put_dma_buf;
- }
- priv->num_pages = priv->dmabuf->size / PAGE_SIZE;
- priv->pages = kvmalloc_array(priv->num_pages, sizeof(struct page),
GFP_KERNEL);
- if (!priv->pages) {
err = -ENOMEM;
goto out_detach_dma_buf;
- }
- for (i = 0; i < priv->num_pages; i++) {
struct page *page = &priv->pages[i];
mm_zero_struct_page(page);
set_page_zone(page, ZONE_DEVICE);
set_page_count(page, 1);
page->pgmap = &priv->pgmap;
- }
- priv->direction = DMA_BIDIRECTIONAL;
- priv->sgt = dma_buf_map_attachment(priv->attachment, priv->direction);
- if (IS_ERR(priv->sgt)) {
err = PTR_ERR(priv->sgt);
goto out_free_pages;
- }
- /* write each dma addresses from sgt to each page */
- pg_idx = 0;
- for_each_sgtable_dma_sg(priv->sgt, sg, i) {
size_t len = sg_dma_len(sg);
dma_addr_t dma_addr = sg_dma_address(sg);
BUG_ON(!PAGE_ALIGNED(len));
while (len > 0) {
priv->pages[pg_idx].zone_device_data = (void *)dma_addr;
pg_idx++;
dma_addr += PAGE_SIZE;
len -= PAGE_SIZE;
}
- }
- new_file = anon_inode_getfile("[dma_buf_pages]", &dma_buf_pages_fops,
(void *)priv, O_RDWR | O_CLOEXEC);
- if (IS_ERR(new_file)) {
err = PTR_ERR(new_file);
goto out_unmap_dma_buf;
- }
- priv->type = create_info->type;
- priv->create_flags = create_info->create_flags;
- switch (priv->type) {
- default:
err = -EINVAL;
goto out_put_new_file;
- }
- if (priv->type_ops->dma_buf_pages_init) {
err = priv->type_ops->dma_buf_pages_init(priv, new_file);
if (err)
goto out_put_new_file;
- }
- fd_install(fd, new_file);
- return fd;
+out_put_new_file:
- fput(new_file);
+out_unmap_dma_buf:
- dma_buf_unmap_attachment(priv->attachment, priv->sgt, priv->direction);
+out_free_pages:
- kvfree(priv->pages);
+out_detach_dma_buf:
- dma_buf_detach(priv->dmabuf, priv->attachment);
+out_put_dma_buf:
- dma_buf_put(priv->dmabuf);
+out_put_pci_dev:
- pci_dev_put(priv->pci_dev);
+out_exit_percpu_ref:
- percpu_ref_exit(&priv->pgmap.ref);
+out_free_priv:
- kfree(priv);
+out_put_fd:
- put_unused_fd(fd);
+out_err:
- return err;
+} +#else +static long dma_buf_create_pages(struct file *file,
struct dma_buf_create_pages_info *create_info)
+{
- return -ENOTSUPP;
+} +#endif
- #ifdef CONFIG_DEBUG_FS static int dma_buf_debug_show(struct seq_file *s, void *unused) {
diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h index 3f31baa3293f..5789006180ea 100644 --- a/include/linux/dma-buf.h +++ b/include/linux/dma-buf.h @@ -540,6 +540,36 @@ struct dma_buf_export_info { void *priv; }; +struct dma_buf_pages;
+struct dma_buf_pages_type_ops {
- int (*dma_buf_pages_init)(struct dma_buf_pages *priv,
struct file *file);
- void (*dma_buf_pages_release)(struct dma_buf_pages *priv,
struct file *file);
- void (*dma_buf_pages_destroy)(struct dma_buf_pages *priv);
- void (*dma_buf_page_free)(struct dma_buf_pages *priv,
struct page *page);
+};
+struct dma_buf_pages {
- /* fields for dmabuf */
- struct dma_buf *dmabuf;
- struct dma_buf_attachment *attachment;
- struct sg_table *sgt;
- struct pci_dev *pci_dev;
- enum dma_data_direction direction;
- /* fields for dma-buf pages */
- size_t num_pages;
- struct page *pages;
- struct dev_pagemap pgmap;
- unsigned int type;
- const struct dma_buf_pages_type_ops *type_ops;
- __u64 create_flags;
+};
- /**
- DEFINE_DMA_BUF_EXPORT_INFO - helper macro for exporters
- @name: export-info name
@@ -631,4 +661,64 @@ int dma_buf_vmap(struct dma_buf *dmabuf, struct iosys_map *map); void dma_buf_vunmap(struct dma_buf *dmabuf, struct iosys_map *map); int dma_buf_vmap_unlocked(struct dma_buf *dmabuf, struct iosys_map *map); void dma_buf_vunmap_unlocked(struct dma_buf *dmabuf, struct iosys_map *map);
+#ifdef CONFIG_DMA_SHARED_BUFFER +extern const struct file_operations dma_buf_pages_fops; +extern const struct dev_pagemap_ops dma_buf_pgmap_ops;
+static inline bool is_dma_buf_pages_file(struct file *file) +{
- return file->f_op == &dma_buf_pages_fops;
+}
+static inline bool is_dma_buf_page(struct page *page) +{
- return (is_zone_device_page(page) && page->pgmap &&
page->pgmap->ops == &dma_buf_pgmap_ops);
+}
+static inline dma_addr_t dma_buf_page_to_dma_addr(struct page *page) +{
- return (dma_addr_t)page->zone_device_data;
+}
+static inline int dma_buf_map_sg(struct device *dev, struct scatterlist *sg,
int nents, enum dma_data_direction dir)
+{
- struct scatterlist *s;
- int i;
- for_each_sg(sg, s, nents, i) {
struct page *pg = sg_page(s);
s->dma_address = dma_buf_page_to_dma_addr(pg);
sg_dma_len(s) = s->length;
- }
- return nents;
+} +#else +static inline bool is_dma_buf_page(struct page *page) +{
- return false;
+}
+static inline bool is_dma_buf_pages_file(struct file *file) +{
- return false;
+}
+static inline dma_addr_t dma_buf_page_to_dma_addr(struct page *page) +{
- return 0;
+}
+static inline int dma_buf_map_sg(struct device *dev, struct scatterlist *sg,
int nents, enum dma_data_direction dir)
+{
- return 0;
+} +#endif
- #endif /* __DMA_BUF_H__ */
diff --git a/include/uapi/linux/dma-buf.h b/include/uapi/linux/dma-buf.h index 5a6fda66d9ad..d0f63a2ab7e4 100644 --- a/include/uapi/linux/dma-buf.h +++ b/include/uapi/linux/dma-buf.h @@ -179,4 +179,13 @@ struct dma_buf_import_sync_file { #define DMA_BUF_IOCTL_EXPORT_SYNC_FILE _IOWR(DMA_BUF_BASE, 2, struct dma_buf_export_sync_file) #define DMA_BUF_IOCTL_IMPORT_SYNC_FILE _IOW(DMA_BUF_BASE, 3, struct dma_buf_import_sync_file) +struct dma_buf_create_pages_info {
- __u8 pci_bdf[3];
- __s32 dma_buf_fd;
- __u32 type;
- __u64 create_flags;
+};
+#define DMA_BUF_CREATE_PAGES _IOW(DMA_BUF_BASE, 4, struct dma_buf_create_pages_info)
- #endif
On Tue, Jul 11, 2023 at 12:59 AM Christian König christian.koenig@amd.com wrote:
Am 11.07.23 um 00:32 schrieb Mina Almasry:
Currently dmabuf p2p memory doesn't present itself in the form of struct pages and the memory can't be easily used with code that expects memory in that form.
Well, this won't fly at all.
First of all DMA-buf is *not* about passing struct pages around, drivers are *only* supposed to use the DMA addresses.
That code can't use the pages inside a DMA-buf is absolutely intentional. We even mangle the page pointers from the sg_tables because people sometimes get the impression they can use those.
See function mangle_sg_table() in dma-buf.c
/* To catch abuse of the underlying struct page by importers mix * up the bits, but take care to preserve the low SG_ bits to * not corrupt the sgt. The mixing is undone in __unmap_dma_buf * before passing the sgt back to the exporter. */ for_each_sgtable_sg(sg_table, sg, i) sg->page_link ^= ~0xffUL;
Add support for paged attachment mappings. We use existing dmabuf APIs to create a mapped attachment (dma_buf_attach() & dma_buf_map_attachment()), and we create struct pages for this mapping. We write the dma_addr's from the sg_table into the created pages.
Hui, what? Not really.
Thanks for talking a look Christian,
FWIW, the patch doesn't try to use the pages backing the sg_table at all. The patch allocates pages here:
+ priv->pages = kvmalloc_array(priv->num_pages, sizeof(struct page), + GFP_KERNEL);
And writes the dma_addrs to the pages here:
+ /* write each dma addresses from sgt to each page */ + pg_idx = 0; + for_each_sgtable_dma_sg(priv->sgt, sg, i) { + size_t len = sg_dma_len(sg); + dma_addr_t dma_addr = sg_dma_address(sg); + + BUG_ON(!PAGE_ALIGNED(len)); + while (len > 0) { + priv->pages[pg_idx].zone_device_data = (void *)dma_addr; + pg_idx++; + dma_addr += PAGE_SIZE; + len -= PAGE_SIZE; + } + }
But, from your response it doesn't seem like it matters. It seems you're not interested in creating struct pages for the DMA-buf at all. My options for this RFC are:
1. Use p2pdma so I get struct pages for the device memory, or 2. Modify the page_pool, networking drivers, and sk_buff to use dma_addr instead of a struct page.
As far as I understand option #2 is not feasible or desired. I'll do some homework to investigate the feasibility of using p2pdma instead, and CCing the maintainers of p2pdma. The cover letter of the RFC is here:
https://lore.kernel.org/netdev/20230710223304.1174642-1-almasrymina@google.c...
If there is something obvious to you that is blocking utilizing p2pdma for this work, please do let me know.
Regards, Christian.
These pages can then be passed into code that expects struct pages and can largely operate on these pages with minimal modifications:
- The pages need not be dma mapped. The dma addr can be queried from page->zone_device_data and used directly.
- The pages are not kmappable.
Add a new ioctl that enables the user to create a struct page backed dmabuf attachment mapping. This ioctl returns a new file descriptor which represents the dmabuf pages. The pages are freed when (a) the user closes the file, and (b) the struct pages backing the dmabuf are no longer in use. Once the pages are no longer in use, the mapped attachment is removed.
The support added in this patch should be generic - the pages are created by the base code, but the user specifies the type of page to create using the dmabuf_create_pages_info->type flag. The base code hands of any handling specific to the use case of the ops of that page type.
Signed-off-by: Mina Almasry almasrymina@google.com
drivers/dma-buf/dma-buf.c | 223 +++++++++++++++++++++++++++++++++++ include/linux/dma-buf.h | 90 ++++++++++++++ include/uapi/linux/dma-buf.h | 9 ++ 3 files changed, 322 insertions(+)
diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c index aa4ea8530cb3..50b1d813cf5c 100644 --- a/drivers/dma-buf/dma-buf.c +++ b/drivers/dma-buf/dma-buf.c @@ -22,6 +22,7 @@ #include <linux/module.h> #include <linux/seq_file.h> #include <linux/sync_file.h> +#include <linux/pci.h> #include <linux/poll.h> #include <linux/dma-resv.h> #include <linux/mm.h> @@ -442,12 +443,16 @@ static long dma_buf_import_sync_file(struct dma_buf *dmabuf, } #endif
+static long dma_buf_create_pages(struct file *file,
struct dma_buf_create_pages_info *create_info);
static long dma_buf_ioctl(struct file *file, unsigned int cmd, unsigned long arg) { struct dma_buf *dmabuf; struct dma_buf_sync sync; enum dma_data_direction direction;
struct dma_buf_create_pages_info create_info; int ret; dmabuf = file->private_data;
@@ -484,6 +489,12 @@ static long dma_buf_ioctl(struct file *file, case DMA_BUF_SET_NAME_A: case DMA_BUF_SET_NAME_B: return dma_buf_set_name(dmabuf, (const char __user *)arg);
case DMA_BUF_CREATE_PAGES:
if (copy_from_user(&create_info, (void __user *)arg,
sizeof(create_info)))
return -EFAULT;
return dma_buf_create_pages(file, &create_info);
#if IS_ENABLED(CONFIG_SYNC_FILE) case DMA_BUF_IOCTL_EXPORT_SYNC_FILE:
@@ -1613,6 +1624,218 @@ void dma_buf_vunmap_unlocked(struct dma_buf *dmabuf, struct iosys_map *map) } EXPORT_SYMBOL_NS_GPL(dma_buf_vunmap_unlocked, DMA_BUF);
+static int dma_buf_pages_release(struct inode *inode, struct file *file) +{
struct dma_buf_pages *priv = file->private_data;
if (priv->type_ops->dma_buf_pages_release)
priv->type_ops->dma_buf_pages_release(priv, file);
percpu_ref_kill(&priv->pgmap.ref);
/* Drop initial ref after percpu_ref_kill(). */
percpu_ref_put(&priv->pgmap.ref);
return 0;
+}
+static void dma_buf_page_free(struct page *page) +{
struct dma_buf_pages *priv;
struct dev_pagemap *pgmap;
pgmap = page->pgmap;
priv = container_of(pgmap, struct dma_buf_pages, pgmap);
if (priv->type_ops->dma_buf_page_free)
priv->type_ops->dma_buf_page_free(priv, page);
+}
+const struct dev_pagemap_ops dma_buf_pgmap_ops = {
.page_free = dma_buf_page_free,
+}; +EXPORT_SYMBOL_GPL(dma_buf_pgmap_ops);
+const struct file_operations dma_buf_pages_fops = {
.release = dma_buf_pages_release,
+}; +EXPORT_SYMBOL_GPL(dma_buf_pages_fops);
+#ifdef CONFIG_ZONE_DEVICE +static void dma_buf_pages_destroy(struct percpu_ref *ref) +{
struct dma_buf_pages *priv;
struct dev_pagemap *pgmap;
pgmap = container_of(ref, struct dev_pagemap, ref);
priv = container_of(pgmap, struct dma_buf_pages, pgmap);
if (priv->type_ops->dma_buf_pages_destroy)
priv->type_ops->dma_buf_pages_destroy(priv);
kvfree(priv->pages);
kfree(priv);
dma_buf_unmap_attachment(priv->attachment, priv->sgt, priv->direction);
dma_buf_detach(priv->dmabuf, priv->attachment);
dma_buf_put(priv->dmabuf);
pci_dev_put(priv->pci_dev);
+}
+static long dma_buf_create_pages(struct file *file,
struct dma_buf_create_pages_info *create_info)
+{
int err, fd, i, pg_idx;
struct scatterlist *sg;
struct dma_buf_pages *priv;
struct file *new_file;
fd = get_unused_fd_flags(O_RDWR | O_CLOEXEC);
if (fd < 0) {
err = fd;
goto out_err;
}
priv = kzalloc(sizeof(*priv), GFP_KERNEL);
if (!priv) {
err = -ENOMEM;
goto out_put_fd;
}
priv->pgmap.type = MEMORY_DEVICE_PRIVATE;
priv->pgmap.ops = &dma_buf_pgmap_ops;
init_completion(&priv->pgmap.done);
/* This refcount is incremented every time a page in priv->pages is
* allocated, and decremented every time a page is freed. When
* it drops to 0, the dma_buf_pages can be destroyed. An initial ref is
* held and the dma_buf_pages is not destroyed until that is dropped.
*/
err = percpu_ref_init(&priv->pgmap.ref, dma_buf_pages_destroy, 0,
GFP_KERNEL);
if (err)
goto out_free_priv;
/* Initial ref to be dropped after percpu_ref_kill(). */
percpu_ref_get(&priv->pgmap.ref);
priv->pci_dev = pci_get_domain_bus_and_slot(
0, create_info->pci_bdf[0],
PCI_DEVFN(create_info->pci_bdf[1], create_info->pci_bdf[2]));
if (!priv->pci_dev) {
err = -ENODEV;
goto out_exit_percpu_ref;
}
priv->dmabuf = dma_buf_get(create_info->dma_buf_fd);
if (IS_ERR(priv->dmabuf)) {
err = PTR_ERR(priv->dmabuf);
goto out_put_pci_dev;
}
if (priv->dmabuf->size % PAGE_SIZE != 0) {
err = -EINVAL;
goto out_put_dma_buf;
}
priv->attachment = dma_buf_attach(priv->dmabuf, &priv->pci_dev->dev);
if (IS_ERR(priv->attachment)) {
err = PTR_ERR(priv->attachment);
goto out_put_dma_buf;
}
priv->num_pages = priv->dmabuf->size / PAGE_SIZE;
priv->pages = kvmalloc_array(priv->num_pages, sizeof(struct page),
GFP_KERNEL);
if (!priv->pages) {
err = -ENOMEM;
goto out_detach_dma_buf;
}
for (i = 0; i < priv->num_pages; i++) {
struct page *page = &priv->pages[i];
mm_zero_struct_page(page);
set_page_zone(page, ZONE_DEVICE);
set_page_count(page, 1);
page->pgmap = &priv->pgmap;
}
priv->direction = DMA_BIDIRECTIONAL;
priv->sgt = dma_buf_map_attachment(priv->attachment, priv->direction);
if (IS_ERR(priv->sgt)) {
err = PTR_ERR(priv->sgt);
goto out_free_pages;
}
/* write each dma addresses from sgt to each page */
pg_idx = 0;
for_each_sgtable_dma_sg(priv->sgt, sg, i) {
size_t len = sg_dma_len(sg);
dma_addr_t dma_addr = sg_dma_address(sg);
BUG_ON(!PAGE_ALIGNED(len));
while (len > 0) {
priv->pages[pg_idx].zone_device_data = (void *)dma_addr;
pg_idx++;
dma_addr += PAGE_SIZE;
len -= PAGE_SIZE;
}
}
new_file = anon_inode_getfile("[dma_buf_pages]", &dma_buf_pages_fops,
(void *)priv, O_RDWR | O_CLOEXEC);
if (IS_ERR(new_file)) {
err = PTR_ERR(new_file);
goto out_unmap_dma_buf;
}
priv->type = create_info->type;
priv->create_flags = create_info->create_flags;
switch (priv->type) {
default:
err = -EINVAL;
goto out_put_new_file;
}
if (priv->type_ops->dma_buf_pages_init) {
err = priv->type_ops->dma_buf_pages_init(priv, new_file);
if (err)
goto out_put_new_file;
}
fd_install(fd, new_file);
return fd;
+out_put_new_file:
fput(new_file);
+out_unmap_dma_buf:
dma_buf_unmap_attachment(priv->attachment, priv->sgt, priv->direction);
+out_free_pages:
kvfree(priv->pages);
+out_detach_dma_buf:
dma_buf_detach(priv->dmabuf, priv->attachment);
+out_put_dma_buf:
dma_buf_put(priv->dmabuf);
+out_put_pci_dev:
pci_dev_put(priv->pci_dev);
+out_exit_percpu_ref:
percpu_ref_exit(&priv->pgmap.ref);
+out_free_priv:
kfree(priv);
+out_put_fd:
put_unused_fd(fd);
+out_err:
return err;
+} +#else +static long dma_buf_create_pages(struct file *file,
struct dma_buf_create_pages_info *create_info)
+{
return -ENOTSUPP;
+} +#endif
- #ifdef CONFIG_DEBUG_FS static int dma_buf_debug_show(struct seq_file *s, void *unused) {
diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h index 3f31baa3293f..5789006180ea 100644 --- a/include/linux/dma-buf.h +++ b/include/linux/dma-buf.h @@ -540,6 +540,36 @@ struct dma_buf_export_info { void *priv; };
+struct dma_buf_pages;
+struct dma_buf_pages_type_ops {
int (*dma_buf_pages_init)(struct dma_buf_pages *priv,
struct file *file);
void (*dma_buf_pages_release)(struct dma_buf_pages *priv,
struct file *file);
void (*dma_buf_pages_destroy)(struct dma_buf_pages *priv);
void (*dma_buf_page_free)(struct dma_buf_pages *priv,
struct page *page);
+};
+struct dma_buf_pages {
/* fields for dmabuf */
struct dma_buf *dmabuf;
struct dma_buf_attachment *attachment;
struct sg_table *sgt;
struct pci_dev *pci_dev;
enum dma_data_direction direction;
/* fields for dma-buf pages */
size_t num_pages;
struct page *pages;
struct dev_pagemap pgmap;
unsigned int type;
const struct dma_buf_pages_type_ops *type_ops;
__u64 create_flags;
+};
- /**
- DEFINE_DMA_BUF_EXPORT_INFO - helper macro for exporters
- @name: export-info name
@@ -631,4 +661,64 @@ int dma_buf_vmap(struct dma_buf *dmabuf, struct iosys_map *map); void dma_buf_vunmap(struct dma_buf *dmabuf, struct iosys_map *map); int dma_buf_vmap_unlocked(struct dma_buf *dmabuf, struct iosys_map *map); void dma_buf_vunmap_unlocked(struct dma_buf *dmabuf, struct iosys_map *map);
+#ifdef CONFIG_DMA_SHARED_BUFFER +extern const struct file_operations dma_buf_pages_fops; +extern const struct dev_pagemap_ops dma_buf_pgmap_ops;
+static inline bool is_dma_buf_pages_file(struct file *file) +{
return file->f_op == &dma_buf_pages_fops;
+}
+static inline bool is_dma_buf_page(struct page *page) +{
return (is_zone_device_page(page) && page->pgmap &&
page->pgmap->ops == &dma_buf_pgmap_ops);
+}
+static inline dma_addr_t dma_buf_page_to_dma_addr(struct page *page) +{
return (dma_addr_t)page->zone_device_data;
+}
+static inline int dma_buf_map_sg(struct device *dev, struct scatterlist *sg,
int nents, enum dma_data_direction dir)
+{
struct scatterlist *s;
int i;
for_each_sg(sg, s, nents, i) {
struct page *pg = sg_page(s);
s->dma_address = dma_buf_page_to_dma_addr(pg);
sg_dma_len(s) = s->length;
}
return nents;
+} +#else +static inline bool is_dma_buf_page(struct page *page) +{
return false;
+}
+static inline bool is_dma_buf_pages_file(struct file *file) +{
return false;
+}
+static inline dma_addr_t dma_buf_page_to_dma_addr(struct page *page) +{
return 0;
+}
+static inline int dma_buf_map_sg(struct device *dev, struct scatterlist *sg,
int nents, enum dma_data_direction dir)
+{
return 0;
+} +#endif
- #endif /* __DMA_BUF_H__ */
diff --git a/include/uapi/linux/dma-buf.h b/include/uapi/linux/dma-buf.h index 5a6fda66d9ad..d0f63a2ab7e4 100644 --- a/include/uapi/linux/dma-buf.h +++ b/include/uapi/linux/dma-buf.h @@ -179,4 +179,13 @@ struct dma_buf_import_sync_file { #define DMA_BUF_IOCTL_EXPORT_SYNC_FILE _IOWR(DMA_BUF_BASE, 2, struct dma_buf_export_sync_file) #define DMA_BUF_IOCTL_IMPORT_SYNC_FILE _IOW(DMA_BUF_BASE, 3, struct dma_buf_import_sync_file)
+struct dma_buf_create_pages_info {
__u8 pci_bdf[3];
__s32 dma_buf_fd;
__u32 type;
__u64 create_flags;
+};
+#define DMA_BUF_CREATE_PAGES _IOW(DMA_BUF_BASE, 4, struct dma_buf_create_pages_info)
- #endif
Hi Mina,
Am 11.07.23 um 13:44 schrieb Mina Almasry:
On Tue, Jul 11, 2023 at 12:59 AM Christian König christian.koenig@amd.com wrote:
Am 11.07.23 um 00:32 schrieb Mina Almasry:
Currently dmabuf p2p memory doesn't present itself in the form of struct pages and the memory can't be easily used with code that expects memory in that form.
Well, this won't fly at all.
First of all DMA-buf is *not* about passing struct pages around, drivers are *only* supposed to use the DMA addresses.
That code can't use the pages inside a DMA-buf is absolutely intentional. We even mangle the page pointers from the sg_tables because people sometimes get the impression they can use those.
See function mangle_sg_table() in dma-buf.c
/* To catch abuse of the underlying struct page by importers mix * up the bits, but take care to preserve the low SG_ bits to * not corrupt the sgt. The mixing is undone in __unmap_dma_buf * before passing the sgt back to the exporter. */ for_each_sgtable_sg(sg_table, sg, i) sg->page_link ^= ~0xffUL;
Add support for paged attachment mappings. We use existing dmabuf APIs to create a mapped attachment (dma_buf_attach() & dma_buf_map_attachment()), and we create struct pages for this mapping. We write the dma_addr's from the sg_table into the created pages.
Hui, what? Not really.
Thanks for talking a look Christian,
FWIW, the patch doesn't try to use the pages backing the sg_table at all. The patch allocates pages here:
priv->pages = kvmalloc_array(priv->num_pages, sizeof(struct page),
GFP_KERNEL);
And writes the dma_addrs to the pages here:
/* write each dma addresses from sgt to each page */
pg_idx = 0;
for_each_sgtable_dma_sg(priv->sgt, sg, i) {
size_t len = sg_dma_len(sg);
dma_addr_t dma_addr = sg_dma_address(sg);
BUG_ON(!PAGE_ALIGNED(len));
while (len > 0) {
priv->pages[pg_idx].zone_device_data = (void *)dma_addr;
pg_idx++;
dma_addr += PAGE_SIZE;
len -= PAGE_SIZE;
}
}
Yes, I have seen that and this is completely utterly nonsense.
I have absolutely no idea how you came to the conclusion that this would work.
But, from your response it doesn't seem like it matters. It seems you're not interested in creating struct pages for the DMA-buf at all.
Well the problem isn't creating struct pages for DMA-buf, most exporters actually do have struct pages backing the memory they are exporting.
The problem is that the importer should *not* touch those struct pages because that would create all kinds of trouble.
My options for this RFC are:
- Use p2pdma so I get struct pages for the device memory, or
- Modify the page_pool, networking drivers, and sk_buff to use
dma_addr instead of a struct page.
As far as I understand option #2 is not feasible or desired. I'll do some homework to investigate the feasibility of using p2pdma instead, and CCing the maintainers of p2pdma.
Well as far as I can see neither approach will work.
The question is what you are trying to archive here? Let network drivers work with p2p? That is certainly possible.
The cover letter of the RFC is here:
https://lore.kernel.org/netdev/20230710223304.1174642-1-almasrymina@google.c...
If there is something obvious to you that is blocking utilizing p2pdma for this work, please do let me know.
The fundamental problem seems to be that you don't understand what a struct page is good for and why this is used the way it is in the network and I/O layer and why it is not used in DMA-buf.
I strongly suggest that you read up on the basic of this before proposing any solution.
Regards, Christian.
Regards, Christian.
These pages can then be passed into code that expects struct pages and can largely operate on these pages with minimal modifications:
- The pages need not be dma mapped. The dma addr can be queried from page->zone_device_data and used directly.
- The pages are not kmappable.
Add a new ioctl that enables the user to create a struct page backed dmabuf attachment mapping. This ioctl returns a new file descriptor which represents the dmabuf pages. The pages are freed when (a) the user closes the file, and (b) the struct pages backing the dmabuf are no longer in use. Once the pages are no longer in use, the mapped attachment is removed.
The support added in this patch should be generic - the pages are created by the base code, but the user specifies the type of page to create using the dmabuf_create_pages_info->type flag. The base code hands of any handling specific to the use case of the ops of that page type.
Signed-off-by: Mina Almasry almasrymina@google.com
drivers/dma-buf/dma-buf.c | 223 +++++++++++++++++++++++++++++++++++ include/linux/dma-buf.h | 90 ++++++++++++++ include/uapi/linux/dma-buf.h | 9 ++ 3 files changed, 322 insertions(+)
diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c index aa4ea8530cb3..50b1d813cf5c 100644 --- a/drivers/dma-buf/dma-buf.c +++ b/drivers/dma-buf/dma-buf.c @@ -22,6 +22,7 @@ #include <linux/module.h> #include <linux/seq_file.h> #include <linux/sync_file.h> +#include <linux/pci.h> #include <linux/poll.h> #include <linux/dma-resv.h> #include <linux/mm.h> @@ -442,12 +443,16 @@ static long dma_buf_import_sync_file(struct dma_buf *dmabuf, } #endif
+static long dma_buf_create_pages(struct file *file,
struct dma_buf_create_pages_info *create_info);
static long dma_buf_ioctl(struct file *file, unsigned int cmd, unsigned long arg) { struct dma_buf *dmabuf; struct dma_buf_sync sync; enum dma_data_direction direction;
struct dma_buf_create_pages_info create_info; int ret; dmabuf = file->private_data;
@@ -484,6 +489,12 @@ static long dma_buf_ioctl(struct file *file, case DMA_BUF_SET_NAME_A: case DMA_BUF_SET_NAME_B: return dma_buf_set_name(dmabuf, (const char __user *)arg);
case DMA_BUF_CREATE_PAGES:
if (copy_from_user(&create_info, (void __user *)arg,
sizeof(create_info)))
return -EFAULT;
return dma_buf_create_pages(file, &create_info);
#if IS_ENABLED(CONFIG_SYNC_FILE) case DMA_BUF_IOCTL_EXPORT_SYNC_FILE:
@@ -1613,6 +1624,218 @@ void dma_buf_vunmap_unlocked(struct dma_buf *dmabuf, struct iosys_map *map) } EXPORT_SYMBOL_NS_GPL(dma_buf_vunmap_unlocked, DMA_BUF);
+static int dma_buf_pages_release(struct inode *inode, struct file *file) +{
struct dma_buf_pages *priv = file->private_data;
if (priv->type_ops->dma_buf_pages_release)
priv->type_ops->dma_buf_pages_release(priv, file);
percpu_ref_kill(&priv->pgmap.ref);
/* Drop initial ref after percpu_ref_kill(). */
percpu_ref_put(&priv->pgmap.ref);
return 0;
+}
+static void dma_buf_page_free(struct page *page) +{
struct dma_buf_pages *priv;
struct dev_pagemap *pgmap;
pgmap = page->pgmap;
priv = container_of(pgmap, struct dma_buf_pages, pgmap);
if (priv->type_ops->dma_buf_page_free)
priv->type_ops->dma_buf_page_free(priv, page);
+}
+const struct dev_pagemap_ops dma_buf_pgmap_ops = {
.page_free = dma_buf_page_free,
+}; +EXPORT_SYMBOL_GPL(dma_buf_pgmap_ops);
+const struct file_operations dma_buf_pages_fops = {
.release = dma_buf_pages_release,
+}; +EXPORT_SYMBOL_GPL(dma_buf_pages_fops);
+#ifdef CONFIG_ZONE_DEVICE +static void dma_buf_pages_destroy(struct percpu_ref *ref) +{
struct dma_buf_pages *priv;
struct dev_pagemap *pgmap;
pgmap = container_of(ref, struct dev_pagemap, ref);
priv = container_of(pgmap, struct dma_buf_pages, pgmap);
if (priv->type_ops->dma_buf_pages_destroy)
priv->type_ops->dma_buf_pages_destroy(priv);
kvfree(priv->pages);
kfree(priv);
dma_buf_unmap_attachment(priv->attachment, priv->sgt, priv->direction);
dma_buf_detach(priv->dmabuf, priv->attachment);
dma_buf_put(priv->dmabuf);
pci_dev_put(priv->pci_dev);
+}
+static long dma_buf_create_pages(struct file *file,
struct dma_buf_create_pages_info *create_info)
+{
int err, fd, i, pg_idx;
struct scatterlist *sg;
struct dma_buf_pages *priv;
struct file *new_file;
fd = get_unused_fd_flags(O_RDWR | O_CLOEXEC);
if (fd < 0) {
err = fd;
goto out_err;
}
priv = kzalloc(sizeof(*priv), GFP_KERNEL);
if (!priv) {
err = -ENOMEM;
goto out_put_fd;
}
priv->pgmap.type = MEMORY_DEVICE_PRIVATE;
priv->pgmap.ops = &dma_buf_pgmap_ops;
init_completion(&priv->pgmap.done);
/* This refcount is incremented every time a page in priv->pages is
* allocated, and decremented every time a page is freed. When
* it drops to 0, the dma_buf_pages can be destroyed. An initial ref is
* held and the dma_buf_pages is not destroyed until that is dropped.
*/
err = percpu_ref_init(&priv->pgmap.ref, dma_buf_pages_destroy, 0,
GFP_KERNEL);
if (err)
goto out_free_priv;
/* Initial ref to be dropped after percpu_ref_kill(). */
percpu_ref_get(&priv->pgmap.ref);
priv->pci_dev = pci_get_domain_bus_and_slot(
0, create_info->pci_bdf[0],
PCI_DEVFN(create_info->pci_bdf[1], create_info->pci_bdf[2]));
if (!priv->pci_dev) {
err = -ENODEV;
goto out_exit_percpu_ref;
}
priv->dmabuf = dma_buf_get(create_info->dma_buf_fd);
if (IS_ERR(priv->dmabuf)) {
err = PTR_ERR(priv->dmabuf);
goto out_put_pci_dev;
}
if (priv->dmabuf->size % PAGE_SIZE != 0) {
err = -EINVAL;
goto out_put_dma_buf;
}
priv->attachment = dma_buf_attach(priv->dmabuf, &priv->pci_dev->dev);
if (IS_ERR(priv->attachment)) {
err = PTR_ERR(priv->attachment);
goto out_put_dma_buf;
}
priv->num_pages = priv->dmabuf->size / PAGE_SIZE;
priv->pages = kvmalloc_array(priv->num_pages, sizeof(struct page),
GFP_KERNEL);
if (!priv->pages) {
err = -ENOMEM;
goto out_detach_dma_buf;
}
for (i = 0; i < priv->num_pages; i++) {
struct page *page = &priv->pages[i];
mm_zero_struct_page(page);
set_page_zone(page, ZONE_DEVICE);
set_page_count(page, 1);
page->pgmap = &priv->pgmap;
}
priv->direction = DMA_BIDIRECTIONAL;
priv->sgt = dma_buf_map_attachment(priv->attachment, priv->direction);
if (IS_ERR(priv->sgt)) {
err = PTR_ERR(priv->sgt);
goto out_free_pages;
}
/* write each dma addresses from sgt to each page */
pg_idx = 0;
for_each_sgtable_dma_sg(priv->sgt, sg, i) {
size_t len = sg_dma_len(sg);
dma_addr_t dma_addr = sg_dma_address(sg);
BUG_ON(!PAGE_ALIGNED(len));
while (len > 0) {
priv->pages[pg_idx].zone_device_data = (void *)dma_addr;
pg_idx++;
dma_addr += PAGE_SIZE;
len -= PAGE_SIZE;
}
}
new_file = anon_inode_getfile("[dma_buf_pages]", &dma_buf_pages_fops,
(void *)priv, O_RDWR | O_CLOEXEC);
if (IS_ERR(new_file)) {
err = PTR_ERR(new_file);
goto out_unmap_dma_buf;
}
priv->type = create_info->type;
priv->create_flags = create_info->create_flags;
switch (priv->type) {
default:
err = -EINVAL;
goto out_put_new_file;
}
if (priv->type_ops->dma_buf_pages_init) {
err = priv->type_ops->dma_buf_pages_init(priv, new_file);
if (err)
goto out_put_new_file;
}
fd_install(fd, new_file);
return fd;
+out_put_new_file:
fput(new_file);
+out_unmap_dma_buf:
dma_buf_unmap_attachment(priv->attachment, priv->sgt, priv->direction);
+out_free_pages:
kvfree(priv->pages);
+out_detach_dma_buf:
dma_buf_detach(priv->dmabuf, priv->attachment);
+out_put_dma_buf:
dma_buf_put(priv->dmabuf);
+out_put_pci_dev:
pci_dev_put(priv->pci_dev);
+out_exit_percpu_ref:
percpu_ref_exit(&priv->pgmap.ref);
+out_free_priv:
kfree(priv);
+out_put_fd:
put_unused_fd(fd);
+out_err:
return err;
+} +#else +static long dma_buf_create_pages(struct file *file,
struct dma_buf_create_pages_info *create_info)
+{
return -ENOTSUPP;
+} +#endif
- #ifdef CONFIG_DEBUG_FS static int dma_buf_debug_show(struct seq_file *s, void *unused) {
diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h index 3f31baa3293f..5789006180ea 100644 --- a/include/linux/dma-buf.h +++ b/include/linux/dma-buf.h @@ -540,6 +540,36 @@ struct dma_buf_export_info { void *priv; };
+struct dma_buf_pages;
+struct dma_buf_pages_type_ops {
int (*dma_buf_pages_init)(struct dma_buf_pages *priv,
struct file *file);
void (*dma_buf_pages_release)(struct dma_buf_pages *priv,
struct file *file);
void (*dma_buf_pages_destroy)(struct dma_buf_pages *priv);
void (*dma_buf_page_free)(struct dma_buf_pages *priv,
struct page *page);
+};
+struct dma_buf_pages {
/* fields for dmabuf */
struct dma_buf *dmabuf;
struct dma_buf_attachment *attachment;
struct sg_table *sgt;
struct pci_dev *pci_dev;
enum dma_data_direction direction;
/* fields for dma-buf pages */
size_t num_pages;
struct page *pages;
struct dev_pagemap pgmap;
unsigned int type;
const struct dma_buf_pages_type_ops *type_ops;
__u64 create_flags;
+};
- /**
- DEFINE_DMA_BUF_EXPORT_INFO - helper macro for exporters
- @name: export-info name
@@ -631,4 +661,64 @@ int dma_buf_vmap(struct dma_buf *dmabuf, struct iosys_map *map); void dma_buf_vunmap(struct dma_buf *dmabuf, struct iosys_map *map); int dma_buf_vmap_unlocked(struct dma_buf *dmabuf, struct iosys_map *map); void dma_buf_vunmap_unlocked(struct dma_buf *dmabuf, struct iosys_map *map);
+#ifdef CONFIG_DMA_SHARED_BUFFER +extern const struct file_operations dma_buf_pages_fops; +extern const struct dev_pagemap_ops dma_buf_pgmap_ops;
+static inline bool is_dma_buf_pages_file(struct file *file) +{
return file->f_op == &dma_buf_pages_fops;
+}
+static inline bool is_dma_buf_page(struct page *page) +{
return (is_zone_device_page(page) && page->pgmap &&
page->pgmap->ops == &dma_buf_pgmap_ops);
+}
+static inline dma_addr_t dma_buf_page_to_dma_addr(struct page *page) +{
return (dma_addr_t)page->zone_device_data;
+}
+static inline int dma_buf_map_sg(struct device *dev, struct scatterlist *sg,
int nents, enum dma_data_direction dir)
+{
struct scatterlist *s;
int i;
for_each_sg(sg, s, nents, i) {
struct page *pg = sg_page(s);
s->dma_address = dma_buf_page_to_dma_addr(pg);
sg_dma_len(s) = s->length;
}
return nents;
+} +#else +static inline bool is_dma_buf_page(struct page *page) +{
return false;
+}
+static inline bool is_dma_buf_pages_file(struct file *file) +{
return false;
+}
+static inline dma_addr_t dma_buf_page_to_dma_addr(struct page *page) +{
return 0;
+}
+static inline int dma_buf_map_sg(struct device *dev, struct scatterlist *sg,
int nents, enum dma_data_direction dir)
+{
return 0;
+} +#endif
- #endif /* __DMA_BUF_H__ */
diff --git a/include/uapi/linux/dma-buf.h b/include/uapi/linux/dma-buf.h index 5a6fda66d9ad..d0f63a2ab7e4 100644 --- a/include/uapi/linux/dma-buf.h +++ b/include/uapi/linux/dma-buf.h @@ -179,4 +179,13 @@ struct dma_buf_import_sync_file { #define DMA_BUF_IOCTL_EXPORT_SYNC_FILE _IOWR(DMA_BUF_BASE, 2, struct dma_buf_export_sync_file) #define DMA_BUF_IOCTL_IMPORT_SYNC_FILE _IOW(DMA_BUF_BASE, 3, struct dma_buf_import_sync_file)
+struct dma_buf_create_pages_info {
__u8 pci_bdf[3];
__s32 dma_buf_fd;
__u32 type;
__u64 create_flags;
+};
+#define DMA_BUF_CREATE_PAGES _IOW(DMA_BUF_BASE, 4, struct dma_buf_create_pages_info)
- #endif
Use the paged attachment mappings support to create NET_RX pages. NET_RX pages are pages that can be used in the networking receive path:
Bind the pages to the driver's rx queues specified by the create_flags param, and create a gen_pool to hold the free pages available for the driver to allocate.
Signed-off-by: Mina Almasry almasrymina@google.com --- drivers/dma-buf/dma-buf.c | 174 +++++++++++++++++++++++++++++++++++ include/linux/dma-buf.h | 20 ++++ include/linux/netdevice.h | 1 + include/uapi/linux/dma-buf.h | 2 + 4 files changed, 197 insertions(+)
diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c index 50b1d813cf5c..acb86bf406f4 100644 --- a/drivers/dma-buf/dma-buf.c +++ b/drivers/dma-buf/dma-buf.c @@ -27,6 +27,7 @@ #include <linux/dma-resv.h> #include <linux/mm.h> #include <linux/mount.h> +#include <linux/netdevice.h> #include <linux/pseudo_fs.h>
#include <uapi/linux/dma-buf.h> @@ -1681,6 +1682,8 @@ static void dma_buf_pages_destroy(struct percpu_ref *ref) pci_dev_put(priv->pci_dev); }
+const struct dma_buf_pages_type_ops net_rx_ops; + static long dma_buf_create_pages(struct file *file, struct dma_buf_create_pages_info *create_info) { @@ -1793,6 +1796,9 @@ static long dma_buf_create_pages(struct file *file, priv->create_flags = create_info->create_flags;
switch (priv->type) { + case DMA_BUF_PAGES_NET_RX: + priv->type_ops = &net_rx_ops; + break; default: err = -EINVAL; goto out_put_new_file; @@ -1966,3 +1972,171 @@ static void __exit dma_buf_deinit(void) dma_buf_uninit_sysfs_statistics(); } __exitcall(dma_buf_deinit); + +/******************************** + * dma_buf_pages_net_rx * + ********************************/ + +void dma_buf_pages_net_rx_release(struct dma_buf_pages *priv, struct file *file) +{ + struct netdev_rx_queue *rxq; + unsigned long xa_idx; + + xa_for_each(&priv->net_rx.bound_rxq_list, xa_idx, rxq) + if (rxq->dmabuf_pages == file) + rxq->dmabuf_pages = NULL; +} + +static int dev_is_class(struct device *dev, void *class) +{ + if (dev->class != NULL && !strcmp(dev->class->name, class)) + return 1; + + return 0; +} + +int dma_buf_pages_net_rx_init(struct dma_buf_pages *priv, struct file *file) +{ + struct netdev_rx_queue *rxq; + struct net_device *netdev; + int xa_id, err, rxq_idx; + struct device *device; + + priv->net_rx.page_pool = + gen_pool_create(PAGE_SHIFT, dev_to_node(&priv->pci_dev->dev)); + + if (!priv->net_rx.page_pool) + return -ENOMEM; + + /* + * We start with PAGE_SIZE instead of 0 since gen_pool_alloc_*() returns + * NULL on error + */ + err = gen_pool_add_virt(priv->net_rx.page_pool, PAGE_SIZE, 0, + PAGE_SIZE * priv->num_pages, + dev_to_node(&priv->pci_dev->dev)); + if (err) + goto out_destroy_pool; + + xa_init_flags(&priv->net_rx.bound_rxq_list, XA_FLAGS_ALLOC); + + device = device_find_child(&priv->pci_dev->dev, "net", dev_is_class); + if (!device) { + err = -ENODEV; + goto out_destroy_xarray; + } + + netdev = to_net_dev(device); + if (!netdev) { + err = -ENODEV; + goto out_put_dev; + } + + for (rxq_idx = 0; rxq_idx < (sizeof(priv->create_flags) * 8); + rxq_idx++) { + if (!(priv->create_flags & (1ULL << rxq_idx))) + continue; + + if (rxq_idx >= netdev->num_rx_queues) { + err = -ERANGE; + goto out_release_rx; + } + + rxq = __netif_get_rx_queue(netdev, rxq_idx); + + err = xa_alloc(&priv->net_rx.bound_rxq_list, &xa_id, rxq, + xa_limit_32b, GFP_KERNEL); + if (err) + goto out_release_rx; + + /* We previously have done a dma_buf_attach(), which validates + * that the net_device we're trying to attach to can reach the + * dmabuf, so we don't need to check here as well. + */ + rxq->dmabuf_pages = file; + } + put_device(device); + return 0; + +out_release_rx: + dma_buf_pages_net_rx_release(priv, file); +out_put_dev: + put_device(device); +out_destroy_xarray: + xa_destroy(&priv->net_rx.bound_rxq_list); +out_destroy_pool: + gen_pool_destroy(priv->net_rx.page_pool); + return err; +} + +void dma_buf_pages_net_rx_free(struct dma_buf_pages *priv) +{ + xa_destroy(&priv->net_rx.bound_rxq_list); + gen_pool_destroy(priv->net_rx.page_pool); +} + +static unsigned long dma_buf_page_to_gen_pool_addr(struct page *page) +{ + struct dma_buf_pages *priv; + struct dev_pagemap *pgmap; + unsigned long offset; + + pgmap = page->pgmap; + priv = container_of(pgmap, struct dma_buf_pages, pgmap); + offset = page - priv->pages; + /* Offset + 1 is due to the fact that we want to avoid 0 virt address + * returned from the gen_pool. The gen_pool returns 0 on error, and virt + * address 0 is indistinguishable from an error. + */ + return (offset + 1) << PAGE_SHIFT; +} + +static struct page * +dma_buf_gen_pool_addr_to_page(unsigned long addr, struct dma_buf_pages *priv) +{ + /* - 1 is due to the fact that we want to avoid 0 virt address + * returned from the gen_pool. See comment in dma_buf_create_pages() + * for details. + */ + unsigned long offset = (addr >> PAGE_SHIFT) - 1; + return &priv->pages[offset]; +} + +void dma_buf_page_free_net_rx(struct dma_buf_pages *priv, struct page *page) +{ + unsigned long addr = dma_buf_page_to_gen_pool_addr(page); + + if (gen_pool_has_addr(priv->net_rx.page_pool, addr, PAGE_SIZE)) + gen_pool_free(priv->net_rx.page_pool, addr, PAGE_SIZE); +} + +const struct dma_buf_pages_type_ops net_rx_ops = { + .dma_buf_pages_init = dma_buf_pages_net_rx_init, + .dma_buf_pages_release = dma_buf_pages_net_rx_release, + .dma_buf_pages_destroy = dma_buf_pages_net_rx_free, + .dma_buf_page_free = dma_buf_page_free_net_rx, +}; + +struct page *dma_buf_pages_net_rx_alloc(struct dma_buf_pages *priv) +{ + unsigned long gen_pool_addr; + struct page *pg; + + if (!(priv->type & DMA_BUF_PAGES_NET_RX)) + return NULL; + + gen_pool_addr = gen_pool_alloc(priv->net_rx.page_pool, PAGE_SIZE); + if (!gen_pool_addr) + return NULL; + + if (!PAGE_ALIGNED(gen_pool_addr)) { + net_err_ratelimited("dmabuf page pool allocation not aligned"); + gen_pool_free(priv->net_rx.page_pool, gen_pool_addr, PAGE_SIZE); + return NULL; + } + + pg = dma_buf_gen_pool_addr_to_page(gen_pool_addr, priv); + + percpu_ref_get(&priv->pgmap.ref); + return pg; +} diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h index 5789006180ea..e8e66d6407d0 100644 --- a/include/linux/dma-buf.h +++ b/include/linux/dma-buf.h @@ -22,6 +22,9 @@ #include <linux/fs.h> #include <linux/dma-fence.h> #include <linux/wait.h> +#include <linux/genalloc.h> +#include <linux/xarray.h> +#include <net/page_pool.h>
struct device; struct dma_buf; @@ -552,6 +555,11 @@ struct dma_buf_pages_type_ops { struct page *page); };
+struct dma_buf_pages_net_rx { + struct gen_pool *page_pool; + struct xarray bound_rxq_list; +}; + struct dma_buf_pages { /* fields for dmabuf */ struct dma_buf *dmabuf; @@ -568,6 +576,10 @@ struct dma_buf_pages { unsigned int type; const struct dma_buf_pages_type_ops *type_ops; __u64 create_flags; + + union { + struct dma_buf_pages_net_rx net_rx; + }; };
/** @@ -671,6 +683,8 @@ static inline bool is_dma_buf_pages_file(struct file *file) return file->f_op == &dma_buf_pages_fops; }
+struct page *dma_buf_pages_net_rx_alloc(struct dma_buf_pages *priv); + static inline bool is_dma_buf_page(struct page *page) { return (is_zone_device_page(page) && page->pgmap && @@ -718,6 +732,12 @@ static inline int dma_buf_map_sg(struct device *dev, struct scatterlist *sg, { return 0; } + +static inline struct page *dma_buf_pages_net_rx_alloc(struct dma_buf_pages *priv) +{ + return NULL; +} + #endif
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index c2f0c6002a84..7a087ffa9baa 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -796,6 +796,7 @@ struct netdev_rx_queue { #ifdef CONFIG_XDP_SOCKETS struct xsk_buff_pool *pool; #endif + struct file __rcu *dmabuf_pages; } ____cacheline_aligned_in_smp;
/* diff --git a/include/uapi/linux/dma-buf.h b/include/uapi/linux/dma-buf.h index d0f63a2ab7e4..b392cef9d3c6 100644 --- a/include/uapi/linux/dma-buf.h +++ b/include/uapi/linux/dma-buf.h @@ -186,6 +186,8 @@ struct dma_buf_create_pages_info { __u64 create_flags; };
+#define DMA_BUF_PAGES_NET_RX (1 << 0) + #define DMA_BUF_CREATE_PAGES _IOW(DMA_BUF_BASE, 4, struct dma_buf_create_pages_info)
#endif
Used the paged attachment mappings support to create NET_TX pages. NET_TX pages can be used in the networking transmit path:
1. Create an iov_iter & bio_vec entries to represent this dmabuf.
2. Initialize the bio_vec with the backing dmabuf pages.
Signed-off-by: Mina Almasry almasrymina@google.com --- drivers/dma-buf/dma-buf.c | 47 ++++++++++++++++++++++++++++++++++++ include/linux/dma-buf.h | 7 ++++++ include/uapi/linux/dma-buf.h | 1 + 3 files changed, 55 insertions(+)
diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c index acb86bf406f4..3ca71297b9b4 100644 --- a/drivers/dma-buf/dma-buf.c +++ b/drivers/dma-buf/dma-buf.c @@ -1683,6 +1683,7 @@ static void dma_buf_pages_destroy(struct percpu_ref *ref) }
const struct dma_buf_pages_type_ops net_rx_ops; +const struct dma_buf_pages_type_ops net_tx_ops;
static long dma_buf_create_pages(struct file *file, struct dma_buf_create_pages_info *create_info) @@ -1799,6 +1800,9 @@ static long dma_buf_create_pages(struct file *file, case DMA_BUF_PAGES_NET_RX: priv->type_ops = &net_rx_ops; break; + case DMA_BUF_PAGES_NET_TX: + priv->type_ops = &net_tx_ops; + break; default: err = -EINVAL; goto out_put_new_file; @@ -2140,3 +2144,46 @@ struct page *dma_buf_pages_net_rx_alloc(struct dma_buf_pages *priv) percpu_ref_get(&priv->pgmap.ref); return pg; } + +/******************************** + * dma_buf_pages_net_tx * + ********************************/ + +static void dma_buf_pages_net_tx_release(struct dma_buf_pages *priv, + struct file *file) +{ + int i; + for (i = 0; i < priv->num_pages; i++) + put_page(&priv->pages[i]); +} + +static int dma_buf_pages_net_tx_init(struct dma_buf_pages *priv, + struct file *file) +{ + int i; + priv->net_tx.tx_bv = kvmalloc_array(priv->num_pages, + sizeof(struct bio_vec), GFP_KERNEL); + if (!priv->net_tx.tx_bv) + return -ENOMEM; + + for (i = 0; i < priv->num_pages; i++) { + priv->net_tx.tx_bv[i].bv_page = &priv->pages[i]; + priv->net_tx.tx_bv[i].bv_offset = 0; + priv->net_tx.tx_bv[i].bv_len = PAGE_SIZE; + } + percpu_ref_get_many(&priv->pgmap.ref, priv->num_pages); + iov_iter_bvec(&priv->net_tx.iter, WRITE, priv->net_tx.tx_bv, + priv->num_pages, priv->dmabuf->size); + return 0; +} + +static void dma_buf_pages_net_tx_free(struct dma_buf_pages *priv) +{ + kvfree(priv->net_tx.tx_bv); +} + +const struct dma_buf_pages_type_ops net_tx_ops = { + .dma_buf_pages_init = dma_buf_pages_net_tx_init, + .dma_buf_pages_release = dma_buf_pages_net_tx_release, + .dma_buf_pages_destroy = dma_buf_pages_net_tx_free, +}; diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h index e8e66d6407d0..93228a2fec47 100644 --- a/include/linux/dma-buf.h +++ b/include/linux/dma-buf.h @@ -22,6 +22,7 @@ #include <linux/fs.h> #include <linux/dma-fence.h> #include <linux/wait.h> +#include <linux/uio.h> #include <linux/genalloc.h> #include <linux/xarray.h> #include <net/page_pool.h> @@ -555,6 +556,11 @@ struct dma_buf_pages_type_ops { struct page *page); };
+struct dma_buf_pages_net_tx { + struct iov_iter iter; + struct bio_vec *tx_bv; +}; + struct dma_buf_pages_net_rx { struct gen_pool *page_pool; struct xarray bound_rxq_list; @@ -579,6 +585,7 @@ struct dma_buf_pages {
union { struct dma_buf_pages_net_rx net_rx; + struct dma_buf_pages_net_tx net_tx; }; };
diff --git a/include/uapi/linux/dma-buf.h b/include/uapi/linux/dma-buf.h index b392cef9d3c6..546f211a7556 100644 --- a/include/uapi/linux/dma-buf.h +++ b/include/uapi/linux/dma-buf.h @@ -187,6 +187,7 @@ struct dma_buf_create_pages_info { };
#define DMA_BUF_PAGES_NET_RX (1 << 0) +#define DMA_BUF_PAGES_NET_TX (2 << 0)
#define DMA_BUF_CREATE_PAGES _IOW(DMA_BUF_BASE, 4, struct dma_buf_create_pages_info)
For device memory TCP, we expect the skb headers to be available in host memory for access, and we expect the skb frags to be in device memory and unaccessible to the host. We expect there to be no mixing and matching of device memory frags (unaccessible) with host memory frags (accessible) in the same skb.
Add a skb->devmem flag which indicates whether the frags in this skb are device memory frags or not.
__skb_fill_page_desc() & skb_fill_page_desc_noacc() now checks frags added to skbs for dmabuf pages, and marks the skb as skb->devmem if the page is a device memory page.
Add checks through the network stack to avoid accessing the frags of devmem skbs and avoid coallescing devmem skbs with non devmem skbs.
Signed-off-by: Mina Almasry almasrymina@google.com --- include/linux/skbuff.h | 15 +++++++++ include/net/tcp.h | 6 ++-- net/core/skbuff.c | 73 ++++++++++++++++++++++++++++++++++-------- net/ipv4/tcp.c | 3 ++ net/ipv4/tcp_input.c | 13 ++++++-- net/ipv4/tcp_output.c | 5 ++- net/packet/af_packet.c | 4 +-- 7 files changed, 97 insertions(+), 22 deletions(-)
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 0b40417457cd..f5e03aa84160 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -38,6 +38,7 @@ #endif #include <net/net_debug.h> #include <net/dropreason-core.h> +#include <linux/dma-buf.h>
/** * DOC: skb checksums @@ -805,6 +806,8 @@ typedef unsigned char *sk_buff_data_t; * @csum_level: indicates the number of consecutive checksums found in * the packet minus one that have been verified as * CHECKSUM_UNNECESSARY (max 3) + * @devmem: indicates that all the fragments in this skb is backed by + * device memory. * @dst_pending_confirm: need to confirm neighbour * @decrypted: Decrypted SKB * @slow_gro: state present at GRO time, slower prepare step required @@ -992,6 +995,7 @@ struct sk_buff { __u8 csum_not_inet:1; #endif
+ __u8 devmem:1; #ifdef CONFIG_NET_SCHED __u16 tc_index; /* traffic control index */ #endif @@ -1766,6 +1770,12 @@ static inline void skb_zcopy_downgrade_managed(struct sk_buff *skb) __skb_zcopy_downgrade_managed(skb); }
+/* Return true if frags in this skb are not readable by the host. */ +static inline bool skb_frags_not_readable(const struct sk_buff *skb) +{ + return skb->devmem; +} + static inline void skb_mark_not_on_list(struct sk_buff *skb) { skb->next = NULL; @@ -2469,6 +2479,8 @@ static inline void __skb_fill_page_desc(struct sk_buff *skb, int i, page = compound_head(page); if (page_is_pfmemalloc(page)) skb->pfmemalloc = true; + if (is_dma_buf_page(page)) + skb->devmem = true; }
/** @@ -2511,6 +2523,9 @@ static inline void skb_fill_page_desc_noacc(struct sk_buff *skb, int i,
__skb_fill_page_desc_noacc(shinfo, i, page, off, size); shinfo->nr_frags = i + 1; + + if (is_dma_buf_page(page)) + skb->devmem = true; }
void skb_add_rx_frag(struct sk_buff *skb, int i, struct page *page, int off, diff --git a/include/net/tcp.h b/include/net/tcp.h index 5066e4586cf0..6d86ed3736ad 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -986,7 +986,7 @@ static inline int tcp_skb_mss(const struct sk_buff *skb)
static inline bool tcp_skb_can_collapse_to(const struct sk_buff *skb) { - return likely(!TCP_SKB_CB(skb)->eor); + return likely(!TCP_SKB_CB(skb)->eor && !skb_frags_not_readable(skb)); }
static inline bool tcp_skb_can_collapse(const struct sk_buff *to, @@ -994,7 +994,9 @@ static inline bool tcp_skb_can_collapse(const struct sk_buff *to, { return likely(tcp_skb_can_collapse_to(to) && mptcp_skb_can_collapse(to, from) && - skb_pure_zcopy_same(to, from)); + skb_pure_zcopy_same(to, from) && + skb_frags_not_readable(to) == + skb_frags_not_readable(from)); }
/* Events passed to congestion control interface */ diff --git a/net/core/skbuff.c b/net/core/skbuff.c index cea28d30abb5..9b83da794641 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -1191,11 +1191,16 @@ void skb_dump(const char *level, const struct sk_buff *skb, bool full_pkt) skb_frag_size(frag), p, p_off, p_len, copied) { seg_len = min_t(int, p_len, len); - vaddr = kmap_atomic(p); - print_hex_dump(level, "skb frag: ", - DUMP_PREFIX_OFFSET, - 16, 1, vaddr + p_off, seg_len, false); - kunmap_atomic(vaddr); + if (!is_dma_buf_page(p)) { + vaddr = kmap_atomic(p); + print_hex_dump(level, "skb frag: ", + DUMP_PREFIX_OFFSET, 16, 1, + vaddr + p_off, seg_len, false); + kunmap_atomic(vaddr); + } else { + printk("%sskb frag: devmem", level); + } + len -= seg_len; if (!len) break; @@ -1764,6 +1769,9 @@ int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask) if (skb_shared(skb) || skb_unclone(skb, gfp_mask)) return -EINVAL;
+ if (skb_frags_not_readable(skb)) + return -EFAULT; + if (!num_frags) goto release;
@@ -1934,8 +1942,10 @@ struct sk_buff *skb_copy(const struct sk_buff *skb, gfp_t gfp_mask) { int headerlen = skb_headroom(skb); unsigned int size = skb_end_offset(skb) + skb->data_len; - struct sk_buff *n = __alloc_skb(size, gfp_mask, - skb_alloc_rx_flag(skb), NUMA_NO_NODE); + struct sk_buff *n = skb_frags_not_readable(skb) ? NULL : + __alloc_skb(size, gfp_mask, + skb_alloc_rx_flag(skb), + NUMA_NO_NODE);
if (!n) return NULL; @@ -2266,9 +2276,10 @@ struct sk_buff *skb_copy_expand(const struct sk_buff *skb, /* * Allocate the copy buffer */ - struct sk_buff *n = __alloc_skb(newheadroom + skb->len + newtailroom, - gfp_mask, skb_alloc_rx_flag(skb), - NUMA_NO_NODE); + struct sk_buff *n = skb_frags_not_readable(skb) ? NULL : + __alloc_skb(newheadroom + skb->len + newtailroom, + gfp_mask, skb_alloc_rx_flag(skb), + NUMA_NO_NODE); int oldheadroom = skb_headroom(skb); int head_copy_len, head_copy_off;
@@ -2609,6 +2620,9 @@ void *__pskb_pull_tail(struct sk_buff *skb, int delta) */ int i, k, eat = (skb->tail + delta) - skb->end;
+ if (skb_frags_not_readable(skb)) + return NULL; + if (eat > 0 || skb_cloned(skb)) { if (pskb_expand_head(skb, 0, eat > 0 ? eat + 128 : 0, GFP_ATOMIC)) @@ -2762,6 +2776,9 @@ int skb_copy_bits(const struct sk_buff *skb, int offset, void *to, int len) to += copy; }
+ if (skb_frags_not_readable(skb)) + goto fault; + for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) { int end; skb_frag_t *f = &skb_shinfo(skb)->frags[i]; @@ -2835,7 +2852,7 @@ static struct page *linear_to_page(struct page *page, unsigned int *len, { struct page_frag *pfrag = sk_page_frag(sk);
- if (!sk_page_frag_refill(sk, pfrag)) + if (!sk_page_frag_refill(sk, pfrag) || is_dma_buf_page(pfrag->page)) return NULL;
*len = min_t(unsigned int, *len, pfrag->size - pfrag->offset); @@ -3164,6 +3181,9 @@ int skb_store_bits(struct sk_buff *skb, int offset, const void *from, int len) from += copy; }
+ if (skb_frags_not_readable(skb)) + goto fault; + for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) { skb_frag_t *frag = &skb_shinfo(skb)->frags[i]; int end; @@ -3243,6 +3263,9 @@ __wsum __skb_checksum(const struct sk_buff *skb, int offset, int len, pos = copy; }
+ if (skb_frags_not_readable(skb)) + return 0; + for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) { int end; skb_frag_t *frag = &skb_shinfo(skb)->frags[i]; @@ -3343,6 +3366,9 @@ __wsum skb_copy_and_csum_bits(const struct sk_buff *skb, int offset, pos = copy; }
+ if (skb_frags_not_readable(skb)) + return 0; + for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) { int end;
@@ -3800,7 +3826,9 @@ static inline void skb_split_inside_header(struct sk_buff *skb, skb_shinfo(skb1)->frags[i] = skb_shinfo(skb)->frags[i];
skb_shinfo(skb1)->nr_frags = skb_shinfo(skb)->nr_frags; + skb1->devmem = skb->devmem; skb_shinfo(skb)->nr_frags = 0; + skb->devmem = 0; skb1->data_len = skb->data_len; skb1->len += skb1->data_len; skb->data_len = 0; @@ -3814,11 +3842,13 @@ static inline void skb_split_no_header(struct sk_buff *skb, { int i, k = 0; const int nfrags = skb_shinfo(skb)->nr_frags; + const int devmem = skb->devmem;
skb_shinfo(skb)->nr_frags = 0; skb1->len = skb1->data_len = skb->len - len; skb->len = len; skb->data_len = len - pos; + skb->devmem = skb1->devmem = 0;
for (i = 0; i < nfrags; i++) { int size = skb_frag_size(&skb_shinfo(skb)->frags[i]); @@ -3847,6 +3877,12 @@ static inline void skb_split_no_header(struct sk_buff *skb, pos += size; } skb_shinfo(skb1)->nr_frags = k; + + if (skb_shinfo(skb)->nr_frags) + skb->devmem = devmem; + + if (skb_shinfo(skb1)->nr_frags) + skb1->devmem = devmem; }
/** @@ -4082,6 +4118,9 @@ unsigned int skb_seq_read(unsigned int consumed, const u8 **data, return block_limit - abs_offset; }
+ if (skb_frags_not_readable(st->cur_skb)) + return 0; + if (st->frag_idx == 0 && !st->frag_data) st->stepped_offset += skb_headlen(st->cur_skb);
@@ -5681,7 +5720,10 @@ bool skb_try_coalesce(struct sk_buff *to, struct sk_buff *from, (from->pp_recycle && skb_cloned(from))) return false;
- if (len <= skb_tailroom(to)) { + if (skb_frags_not_readable(from) != skb_frags_not_readable(to)) + return false; + + if (len <= skb_tailroom(to) && !skb_frags_not_readable(from)) { if (len) BUG_ON(skb_copy_bits(from, 0, skb_put(to, len), len)); *delta_truesize = 0; @@ -5997,6 +6039,9 @@ int skb_ensure_writable(struct sk_buff *skb, unsigned int write_len) if (!pskb_may_pull(skb, write_len)) return -ENOMEM;
+ if (skb_frags_not_readable(skb)) + return -EFAULT; + if (!skb_cloned(skb) || skb_clone_writable(skb, write_len)) return 0;
@@ -6656,8 +6701,8 @@ EXPORT_SYMBOL(pskb_extract); void skb_condense(struct sk_buff *skb) { if (skb->data_len) { - if (skb->data_len > skb->end - skb->tail || - skb_cloned(skb)) + if (skb->data_len > skb->end - skb->tail || skb_cloned(skb) || + skb_frags_not_readable(skb)) return;
/* Nice, we can free page frag(s) right now */ diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 8d20d9221238..51e8d5872670 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -4520,6 +4520,9 @@ int tcp_md5_hash_skb_data(struct tcp_md5sig_pool *hp, if (crypto_ahash_update(req)) return 1;
+ if (skb_frags_not_readable(skb)) + return 1; + for (i = 0; i < shi->nr_frags; ++i) { const skb_frag_t *f = &shi->frags[i]; unsigned int offset = skb_frag_off(f); diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index bf8b22218dd4..8d28d96a3c24 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -5188,6 +5188,9 @@ tcp_collapse(struct sock *sk, struct sk_buff_head *list, struct rb_root *root, for (end_of_skbs = true; skb != NULL && skb != tail; skb = n) { n = tcp_skb_next(skb, list);
+ if (skb_frags_not_readable(skb)) + goto skip_this; + /* No new bits? It is possible on ofo queue. */ if (!before(start, TCP_SKB_CB(skb)->end_seq)) { skb = tcp_collapse_one(sk, skb, list, root); @@ -5208,17 +5211,20 @@ tcp_collapse(struct sock *sk, struct sk_buff_head *list, struct rb_root *root, break; }
- if (n && n != tail && mptcp_skb_can_collapse(skb, n) && + if (n && n != tail && !skb_frags_not_readable(n) && + mptcp_skb_can_collapse(skb, n) && TCP_SKB_CB(skb)->end_seq != TCP_SKB_CB(n)->seq) { end_of_skbs = false; break; }
+skip_this: /* Decided to skip this, advance start seq. */ start = TCP_SKB_CB(skb)->end_seq; } if (end_of_skbs || - (TCP_SKB_CB(skb)->tcp_flags & (TCPHDR_SYN | TCPHDR_FIN))) + (TCP_SKB_CB(skb)->tcp_flags & (TCPHDR_SYN | TCPHDR_FIN)) || + skb_frags_not_readable(skb)) return;
__skb_queue_head_init(&tmp); @@ -5262,7 +5268,8 @@ tcp_collapse(struct sock *sk, struct sk_buff_head *list, struct rb_root *root, if (!skb || skb == tail || !mptcp_skb_can_collapse(nskb, skb) || - (TCP_SKB_CB(skb)->tcp_flags & (TCPHDR_SYN | TCPHDR_FIN))) + (TCP_SKB_CB(skb)->tcp_flags & (TCPHDR_SYN | TCPHDR_FIN)) || + skb_frags_not_readable(skb)) goto end; #ifdef CONFIG_TLS_DEVICE if (skb->decrypted != nskb->decrypted) diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index cfe128b81a01..eddade864c7f 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -2310,7 +2310,8 @@ static bool tcp_can_coalesce_send_queue_head(struct sock *sk, int len)
if (unlikely(TCP_SKB_CB(skb)->eor) || tcp_has_tx_tstamp(skb) || - !skb_pure_zcopy_same(skb, next)) + !skb_pure_zcopy_same(skb, next) || + skb->devmem != next->devmem) return false;
len -= skb->len; @@ -3087,6 +3088,8 @@ static bool tcp_can_collapse(const struct sock *sk, const struct sk_buff *skb) return false; if (skb_cloned(skb)) return false; + if (skb_frags_not_readable(skb)) + return false; /* Some heuristics for collapsing over SACK'd could be invented */ if (TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_ACKED) return false; diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c index a2dbeb264f26..9b31f688163c 100644 --- a/net/packet/af_packet.c +++ b/net/packet/af_packet.c @@ -2152,7 +2152,7 @@ static int packet_rcv(struct sk_buff *skb, struct net_device *dev, } }
- snaplen = skb->len; + snaplen = skb_frags_not_readable(skb) ? skb_headlen(skb) : skb->len;
res = run_filter(skb, sk, snaplen); if (!res) @@ -2275,7 +2275,7 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev, } }
- snaplen = skb->len; + snaplen = skb_frags_not_readable(skb) ? skb_headlen(skb) : skb->len;
res = run_filter(skb, sk, snaplen); if (!res)
In tcp_recvmsg_locked(), detect if the skb being received by the user is a devmem skb. In this case - if the user provided the MSG_SOCK_DEVMEM flag - pass it to tcp_recvmsg_devmem() for custom handling.
tcp_recvmsg_devmem() copies any data in the skb header to the linear buffer, and returns a cmsg to the user indicating the number of bytes returned in the linear buffer.
tcp_recvmsg_devmem() then loops over the unaccessible devmem skb frags, and returns to the user a cmsg_devmem indicating the location of the data in the dmabuf device memory. cmsg_devmem contains this information:
1. the offset into the dmabuf where the payload starts. 'frag_offset'. 2. the size of the frag. 'frag_size'. 3. an opaque token 'frag_token' to return to the kernel when the buffer is to be released.
The pages awaiting freeing are stored in the newly added sk->sk_pagepool, and each page passed to userspace is get_page()'d. This reference is dropped once the userspace indicates that it is done reading this page. All pages are released when the socket is destroyed.
Signed-off-by: Mina Almasry almasrymina@google.com --- include/linux/socket.h | 1 + include/net/sock.h | 2 + include/uapi/asm-generic/socket.h | 5 + include/uapi/linux/uio.h | 6 + net/core/datagram.c | 3 + net/ipv4/tcp.c | 186 +++++++++++++++++++++++++++++- net/ipv4/tcp_ipv4.c | 8 ++ 7 files changed, 209 insertions(+), 2 deletions(-)
diff --git a/include/linux/socket.h b/include/linux/socket.h index 13c3a237b9c9..12905b2f1215 100644 --- a/include/linux/socket.h +++ b/include/linux/socket.h @@ -326,6 +326,7 @@ struct ucred { * plain text and require encryption */
+#define MSG_SOCK_DEVMEM 0x2000000 /* Receive devmem skbs as cmsg */ #define MSG_ZEROCOPY 0x4000000 /* Use user data in kernel path */ #define MSG_FASTOPEN 0x20000000 /* Send data in TCP SYN */ #define MSG_CMSG_CLOEXEC 0x40000000 /* Set close_on_exec for file diff --git a/include/net/sock.h b/include/net/sock.h index 6f428a7f3567..c615666ff19a 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -353,6 +353,7 @@ struct sk_filter; * @sk_txtime_unused: unused txtime flags * @ns_tracker: tracker for netns reference * @sk_bind2_node: bind node in the bhash2 table + * @sk_pagepool: page pool associated with this socket. */ struct sock { /* @@ -545,6 +546,7 @@ struct sock { struct rcu_head sk_rcu; netns_tracker ns_tracker; struct hlist_node sk_bind2_node; + struct xarray sk_pagepool; };
enum sk_pacing { diff --git a/include/uapi/asm-generic/socket.h b/include/uapi/asm-generic/socket.h index 638230899e98..88f9234f78cb 100644 --- a/include/uapi/asm-generic/socket.h +++ b/include/uapi/asm-generic/socket.h @@ -132,6 +132,11 @@
#define SO_RCVMARK 75
+#define SO_DEVMEM_HEADER 98 +#define SCM_DEVMEM_HEADER SO_DEVMEM_HEADER +#define SO_DEVMEM_OFFSET 99 +#define SCM_DEVMEM_OFFSET SO_DEVMEM_OFFSET + #if !defined(__KERNEL__)
#if __BITS_PER_LONG == 64 || (defined(__x86_64__) && defined(__ILP32__)) diff --git a/include/uapi/linux/uio.h b/include/uapi/linux/uio.h index 059b1a9147f4..8b0be0f50838 100644 --- a/include/uapi/linux/uio.h +++ b/include/uapi/linux/uio.h @@ -20,6 +20,12 @@ struct iovec __kernel_size_t iov_len; /* Must be size_t (1003.1g) */ };
+struct cmsg_devmem { + __u32 frag_offset; + __u32 frag_size; + __u32 frag_token; +}; + /* * UIO_MAXIOV shall be at least 16 1003.1g (5.4.1.1) */ diff --git a/net/core/datagram.c b/net/core/datagram.c index 176eb5834746..3a82598aa6ed 100644 --- a/net/core/datagram.c +++ b/net/core/datagram.c @@ -455,6 +455,9 @@ static int __skb_datagram_iter(const struct sk_buff *skb, int offset, skb_walk_frags(skb, frag_iter) { int end;
+ if (frag_iter->devmem) + goto short_copy; + WARN_ON(start > offset + len);
end = start + frag_iter->len; diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 51e8d5872670..a894b8a9dbb0 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -279,6 +279,7 @@ #include <linux/uaccess.h> #include <asm/ioctls.h> #include <net/busy_poll.h> +#include <linux/dma-buf.h>
/* Track pending CMSGs. */ enum { @@ -460,6 +461,7 @@ void tcp_init_sock(struct sock *sk)
set_bit(SOCK_SUPPORT_ZC, &sk->sk_socket->flags); sk_sockets_allocated_inc(sk); + xa_init_flags(&sk->sk_pagepool, XA_FLAGS_ALLOC); } EXPORT_SYMBOL(tcp_init_sock);
@@ -2408,6 +2410,165 @@ static int tcp_inq_hint(struct sock *sk) return inq; }
+static int tcp_recvmsg_devmem(const struct sock *sk, const struct sk_buff *skb, + unsigned int offset, struct msghdr *msg, int len) +{ + unsigned int start = skb_headlen(skb); + struct cmsg_devmem cmsg_devmem = { 0 }; + unsigned int tokens_added_idx = 0; + int i, copy = start - offset, n; + struct sk_buff *frag_iter; + u32 *tokens_added; + int err = 0; + + if (!skb->devmem) + return -ENODEV; + + tokens_added = kzalloc(sizeof(u32) * skb_shinfo(skb)->nr_frags, + GFP_KERNEL); + + if (!tokens_added) + return -ENOMEM; + + /* Copy header. */ + if (copy > 0) { + copy = min(copy, len); + + n = copy_to_iter(skb->data + offset, copy, &msg->msg_iter); + if (n != copy) { + err = -EFAULT; + goto err_release_pages; + } + + offset += copy; + len -= copy; + + /* First a cmsg_devmem for # bytes copied to user buffer */ + cmsg_devmem.frag_size = copy; + err = put_cmsg(msg, SOL_SOCKET, SO_DEVMEM_HEADER, + sizeof(cmsg_devmem), &cmsg_devmem); + if (err) + goto err_release_pages; + + if (len == 0) + goto out; + } + + /* after that, send information of devmem pages through a sequence + * of cmsg + */ + for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) { + const skb_frag_t *frag = &skb_shinfo(skb)->frags[i]; + struct page *page = skb_frag_page(frag); + struct dma_buf_pages *priv; + u32 user_token, frag_offset; + struct page *dmabuf_pages; + int end; + + /* skb->devmem should indicate that ALL the pages in this skb + * are dma buf pages. We're checking for that flag above, but + * also check individual pages here. If the driver is not + * setting skb->devmem correctly, we still don't want to crash + * here when accessing pgmap or priv below. + */ + if (!is_dma_buf_page(page)) { + net_err_ratelimited("Found non-devmem skb with dma_buf " + "page"); + err = -ENODEV; + goto err_release_pages; + } + + end = start + skb_frag_size(frag); + copy = end - offset; + memset(&cmsg_devmem, 0, sizeof(cmsg_devmem)); + + if (copy > 0) { + copy = min(copy, len); + + priv = (struct dma_buf_pages *)page->pp->mp_priv; + + dmabuf_pages = priv->pages; + frag_offset = ((page - dmabuf_pages) << PAGE_SHIFT) + + skb_frag_off(frag) + offset - start; + cmsg_devmem.frag_offset = frag_offset; + cmsg_devmem.frag_size = copy; + err = xa_alloc((struct xarray *)&sk->sk_pagepool, + &user_token, page, xa_limit_31b, + GFP_KERNEL); + if (err) + goto err_release_pages; + + tokens_added[tokens_added_idx++] = user_token; + + get_page(page); + cmsg_devmem.frag_token = user_token; + + offset += copy; + len -= copy; + + err = put_cmsg(msg, SOL_SOCKET, SO_DEVMEM_OFFSET, + sizeof(cmsg_devmem), &cmsg_devmem); + if (err) { + put_page(page); + goto err_release_pages; + } + + if (len == 0) + goto out; + } + start = end; + } + + if (!len) + goto out; + + /* if len is not satisfied yet, we need to skb_walk_frags() to satisfy + * len + */ + skb_walk_frags(skb, frag_iter) + { + int end; + + if (!frag_iter->devmem) { + err = -EFAULT; + goto err_release_pages; + } + + WARN_ON(start > offset + len); + end = start + frag_iter->len; + copy = end - offset; + if (copy > 0) { + if (copy > len) + copy = len; + err = tcp_recvmsg_devmem(sk, frag_iter, offset - start, + msg, copy); + if (err) + goto err_release_pages; + len -= copy; + if (len == 0) + goto out; + offset += copy; + } + start = end; + } + + if (len) { + err = -EFAULT; + goto err_release_pages; + } + + goto out; + +err_release_pages: + for (i = 0; i < tokens_added_idx; i++) + put_page(xa_erase((struct xarray *)&sk->sk_pagepool, + tokens_added[i])); + +out: + kfree(tokens_added); + return err; +} + /* * This routine copies from a sock struct into the user buffer. * @@ -2428,7 +2589,7 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len, int err; int target; /* Read at least this many bytes */ long timeo; - struct sk_buff *skb, *last; + struct sk_buff *skb, *last, *skb_last_copied = NULL; u32 urg_hole = 0;
err = -ENOTCONN; @@ -2593,7 +2754,27 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len, } }
- if (!(flags & MSG_TRUNC)) { + if (skb_last_copied && skb_last_copied->devmem != skb->devmem) + break; + + if (skb->devmem) { + if (!(flags & MSG_SOCK_DEVMEM)) { + /* skb->devmem skbs can only be received with + * the MSG_SOCK_DEVMEM flag. + */ + + copied = -EFAULT; + break; + } + + err = tcp_recvmsg_devmem(sk, skb, offset, msg, used); + if (err) { + if (!copied) + copied = -EFAULT; + break; + } + skb_last_copied = skb; + } else if (!(flags & MSG_TRUNC)) { err = skb_copy_datagram_msg(skb, offset, msg, used); if (err) { /* Exception. Bailout! */ @@ -2601,6 +2782,7 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len, copied = -EFAULT; break; } + skb_last_copied = skb; }
WRITE_ONCE(*seq, *seq + used); diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index 06d2573685ca..d7dee38e0410 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -2291,6 +2291,14 @@ void tcp_v4_destroy_sock(struct sock *sk) { struct tcp_sock *tp = tcp_sk(sk);
+ unsigned long index; + struct page *page; + + xa_for_each(&sk->sk_pagepool, index, page) + put_page(page); + + xa_destroy(&sk->sk_pagepool); + trace_tcp_destroy_sock(sk);
tcp_clear_xmit_timers(sk);
Add an interface for the user to notify the kernel that it is done reading the NET_RX dmabuf pages returned as cmsg. The kernel will drop the reference on the NET_RX pages to make them available for re-use.
Signed-off-by: Mina Almasry almasrymina@google.com --- include/uapi/asm-generic/socket.h | 1 + include/uapi/linux/uio.h | 4 +++ net/core/sock.c | 41 +++++++++++++++++++++++++++++++ 3 files changed, 46 insertions(+)
diff --git a/include/uapi/asm-generic/socket.h b/include/uapi/asm-generic/socket.h index 88f9234f78cb..2a5a7f5da358 100644 --- a/include/uapi/asm-generic/socket.h +++ b/include/uapi/asm-generic/socket.h @@ -132,6 +132,7 @@
#define SO_RCVMARK 75
+#define SO_DEVMEM_DONTNEED 97 #define SO_DEVMEM_HEADER 98 #define SCM_DEVMEM_HEADER SO_DEVMEM_HEADER #define SO_DEVMEM_OFFSET 99 diff --git a/include/uapi/linux/uio.h b/include/uapi/linux/uio.h index 8b0be0f50838..faaa765fd5a4 100644 --- a/include/uapi/linux/uio.h +++ b/include/uapi/linux/uio.h @@ -26,6 +26,10 @@ struct cmsg_devmem { __u32 frag_token; };
+struct devmemtoken { + __u32 token_start; + __u32 token_count; +}; /* * UIO_MAXIOV shall be at least 16 1003.1g (5.4.1.1) */ diff --git a/net/core/sock.c b/net/core/sock.c index 24f2761bdb1d..f9b9d9ec7322 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -1531,7 +1531,48 @@ int sk_setsockopt(struct sock *sk, int level, int optname, /* Paired with READ_ONCE() in tcp_rtx_synack() */ WRITE_ONCE(sk->sk_txrehash, (u8)val); break; + case SO_DEVMEM_DONTNEED: { + struct devmemtoken tokens[128]; + unsigned int num_tokens, i, j;
+ if (sk->sk_type != SOCK_STREAM || + sk->sk_protocol != IPPROTO_TCP) { + ret = -EBADF; + break; + } + + if (optlen % sizeof(struct devmemtoken) || + optlen > sizeof(tokens)) { + ret = -EINVAL; + break; + } + + num_tokens = optlen / sizeof(struct devmemtoken); + if (copy_from_sockptr(tokens, optval, optlen)) { + ret = -EFAULT; + break; + } + + ret = 0; + + for (i = 0; i < num_tokens; i++) { + for (j = 0; j < tokens[i].token_count; j++) { + struct page *pg = xa_erase(&sk->sk_pagepool, + tokens[i].token_start + j); + + if (pg) + put_page(pg); + else + /* -EINTR here notifies the userspace + * that not all tokens passed to it have + * been freed. + */ + ret = -EINTR; + } + } + + break; + } default: ret = -ENOPROTOOPT; break;
On 7/10/23 15:32, Mina Almasry wrote:
Add an interface for the user to notify the kernel that it is done reading the NET_RX dmabuf pages returned as cmsg. The kernel will drop the reference on the NET_RX pages to make them available for re-use.
Signed-off-by: Mina Almasry almasrymina@google.com
for (i = 0; i < num_tokens; i++) {
for (j = 0; j < tokens[i].token_count; j++) {
struct page *pg = xa_erase(&sk->sk_pagepool,
tokens[i].token_start + j);
if (pg)
put_page(pg);
else
/* -EINTR here notifies the userspace
* that not all tokens passed to it have
* been freed.
*/
ret = -EINTR;
Unless I'm missing something, this type of error reporting is unrecoverable -- userspace doesn't know how many tokens have been freed.
I think you should either make it explicitly unrecoverable (somehow shut down dmabuf handling entirely) or tell userspace how many tokens were successfully freed.
--Andy
On Sun, Jul 16, 2023 at 4:57 PM Andy Lutomirski luto@kernel.org wrote:
On 7/10/23 15:32, Mina Almasry wrote:
Add an interface for the user to notify the kernel that it is done reading the NET_RX dmabuf pages returned as cmsg. The kernel will drop the reference on the NET_RX pages to make them available for re-use.
Signed-off-by: Mina Almasry almasrymina@google.com
for (i = 0; i < num_tokens; i++) {
for (j = 0; j < tokens[i].token_count; j++) {
struct page *pg = xa_erase(&sk->sk_pagepool,
tokens[i].token_start + j);
if (pg)
put_page(pg);
else
/* -EINTR here notifies the userspace
* that not all tokens passed to it have
* been freed.
*/
ret = -EINTR;
Unless I'm missing something, this type of error reporting is unrecoverable -- userspace doesn't know how many tokens have been freed.
I think you should either make it explicitly unrecoverable (somehow shut down dmabuf handling entirely) or tell userspace how many tokens were successfully freed.
Thank you, the latter suggestion sounds perfect actually. The user can't do much if the kernel fails to free all the tokens, but at least it can know that something wrong is happening and can log an error for further debugging. Great suggestion! Thanks!
For device memory TCP, we let the user provide the kernel with a cmsg container 2 items:
1. the dmabuf pages fd that the user would like to send data from. 2. the offset into this dmabuf that the user would like to start sending from.
In tcp_sendmsg_locked(), if this cmsg is provided, we send the data using the dmabuf NET_TX pages bio_vec.
Also provide drivers with a new skb_devmem_frag_dma_map() helper. This helper is similar to skb_frag_dma_map(), but it first checks whether the frag being mapped is backed by dmabuf NET_TX pages, and provides the correct dma_addr if so.
Signed-off-by: Mina Almasry almasrymina@google.com --- include/linux/skbuff.h | 19 +++++++++-- include/net/sock.h | 2 ++ net/core/skbuff.c | 8 ++--- net/core/sock.c | 6 ++++ net/ipv4/tcp.c | 73 +++++++++++++++++++++++++++++++++++++++++- 5 files changed, 101 insertions(+), 7 deletions(-)
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index f5e03aa84160..ad4e7bfcab07 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -1660,8 +1660,8 @@ static inline int skb_zerocopy_iter_dgram(struct sk_buff *skb, }
int skb_zerocopy_iter_stream(struct sock *sk, struct sk_buff *skb, - struct msghdr *msg, int len, - struct ubuf_info *uarg); + struct msghdr *msg, struct iov_iter *iov_iter, + int len, struct ubuf_info *uarg);
/* Internal */ #define skb_shinfo(SKB) ((struct skb_shared_info *)(skb_end_pointer(SKB))) @@ -3557,6 +3557,21 @@ static inline dma_addr_t skb_frag_dma_map(struct device *dev, skb_frag_off(frag) + offset, size, dir); }
+/* Similar to skb_frag_dma_map, but handles devmem skbs correctly. */ +static inline dma_addr_t skb_devmem_frag_dma_map(struct device *dev, + const struct sk_buff *skb, + const skb_frag_t *frag, + size_t offset, size_t size, + enum dma_data_direction dir) +{ + if (unlikely(skb->devmem && is_dma_buf_page(skb_frag_page(frag)))) { + dma_addr_t dma_addr = + dma_buf_page_to_dma_addr(skb_frag_page(frag)); + return dma_addr + skb_frag_off(frag) + offset; + } + return skb_frag_dma_map(dev, frag, offset, size, dir); +} + static inline struct sk_buff *pskb_copy(struct sk_buff *skb, gfp_t gfp_mask) { diff --git a/include/net/sock.h b/include/net/sock.h index c615666ff19a..733865f89635 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -1890,6 +1890,8 @@ struct sockcm_cookie { u64 transmit_time; u32 mark; u32 tsflags; + u32 devmem_fd; + u32 devmem_offset; };
static inline void sockcm_init(struct sockcm_cookie *sockc, diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 9b83da794641..b1e28e7ad6a8 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -1685,8 +1685,8 @@ void msg_zerocopy_put_abort(struct ubuf_info *uarg, bool have_uref) EXPORT_SYMBOL_GPL(msg_zerocopy_put_abort);
int skb_zerocopy_iter_stream(struct sock *sk, struct sk_buff *skb, - struct msghdr *msg, int len, - struct ubuf_info *uarg) + struct msghdr *msg, struct iov_iter *iov_iter, + int len, struct ubuf_info *uarg) { struct ubuf_info *orig_uarg = skb_zcopy(skb); int err, orig_len = skb->len; @@ -1697,12 +1697,12 @@ int skb_zerocopy_iter_stream(struct sock *sk, struct sk_buff *skb, if (orig_uarg && uarg != orig_uarg) return -EEXIST;
- err = __zerocopy_sg_from_iter(msg, sk, skb, &msg->msg_iter, len); + err = __zerocopy_sg_from_iter(msg, sk, skb, iov_iter, len); if (err == -EFAULT || (err == -EMSGSIZE && skb->len == orig_len)) { struct sock *save_sk = skb->sk;
/* Streams do not free skb on error. Reset to prev state. */ - iov_iter_revert(&msg->msg_iter, skb->len - orig_len); + iov_iter_revert(iov_iter, skb->len - orig_len); skb->sk = sk; ___pskb_trim(skb, orig_len); skb->sk = save_sk; diff --git a/net/core/sock.c b/net/core/sock.c index f9b9d9ec7322..854624bee5d0 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -2813,6 +2813,12 @@ int __sock_cmsg_send(struct sock *sk, struct cmsghdr *cmsg, return -EINVAL; sockc->transmit_time = get_unaligned((u64 *)CMSG_DATA(cmsg)); break; + case SCM_DEVMEM_OFFSET: + if (cmsg->cmsg_len != CMSG_LEN(2 * sizeof(u32))) + return -EINVAL; + sockc->devmem_fd = ((u32 *)CMSG_DATA(cmsg))[0]; + sockc->devmem_offset = ((u32 *)CMSG_DATA(cmsg))[1]; + break; /* SCM_RIGHTS and SCM_CREDENTIALS are semantically in SOL_UNIX. */ case SCM_RIGHTS: case SCM_CREDENTIALS: diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index a894b8a9dbb0..85d6cdc832ef 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -280,6 +280,7 @@ #include <asm/ioctls.h> #include <net/busy_poll.h> #include <linux/dma-buf.h> +#include <uapi/linux/dma-buf.h>
/* Track pending CMSGs. */ enum { @@ -1216,6 +1217,52 @@ int tcp_sendmsg_fastopen(struct sock *sk, struct msghdr *msg, int *copied, return err; }
+static int tcp_prepare_devmem_data(struct msghdr *msg, int devmem_fd, + unsigned int devmem_offset, + struct file **devmem_file, + struct iov_iter *devmem_tx_iter, size_t size) +{ + struct dma_buf_pages *priv; + int err = 0; + + *devmem_file = fget_raw(devmem_fd); + if (!*devmem_file) { + err = -EINVAL; + goto err; + } + + if (!is_dma_buf_pages_file(*devmem_file)) { + err = -EBADF; + goto err_fput; + } + + priv = (*devmem_file)->private_data; + if (!priv) { + WARN_ONCE(!priv, "dma_buf_pages_file has no private_data"); + err = -EINTR; + goto err_fput; + } + + if (!(priv->type & DMA_BUF_PAGES_NET_TX)) + return -EINVAL; + + if (devmem_offset + size > priv->dmabuf->size) { + err = -ENOSPC; + goto err_fput; + } + + *devmem_tx_iter = priv->net_tx.iter; + iov_iter_advance(devmem_tx_iter, devmem_offset); + + return 0; + +err_fput: + fput(*devmem_file); + *devmem_file = NULL; +err: + return err; +} + int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size) { struct tcp_sock *tp = tcp_sk(sk); @@ -1227,6 +1274,8 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size) int process_backlog = 0; bool zc = false; long timeo; + struct file *devmem_file = NULL; + struct iov_iter devmem_tx_iter;
flags = msg->msg_flags;
@@ -1295,6 +1344,14 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size) } }
+ if (sockc.devmem_fd) { + err = tcp_prepare_devmem_data(msg, sockc.devmem_fd, + sockc.devmem_offset, &devmem_file, + &devmem_tx_iter, size); + if (err) + goto out_err; + } + /* This should be in poll */ sk_clear_bit(SOCKWQ_ASYNC_NOSPACE, sk);
@@ -1408,7 +1465,17 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size) goto wait_for_space; }
- err = skb_zerocopy_iter_stream(sk, skb, msg, copy, uarg); + if (devmem_file) { + err = skb_zerocopy_iter_stream(sk, skb, msg, + &devmem_tx_iter, + copy, uarg); + if (err > 0) + iov_iter_advance(&msg->msg_iter, err); + } else { + err = skb_zerocopy_iter_stream(sk, skb, msg, + &msg->msg_iter, + copy, uarg); + } if (err == -EMSGSIZE || err == -EEXIST) { tcp_mark_push(tp, skb); goto new_segment; @@ -1462,6 +1529,8 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size) } out_nopush: net_zcopy_put(uarg); + if (devmem_file) + fput(devmem_file); return copied + copied_syn;
do_error: @@ -1470,6 +1539,8 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size) if (copied + copied_syn) goto out; out_err: + if (devmem_file) + fput(devmem_file); net_zcopy_put_abort(uarg, true); err = sk_stream_error(sk, flags, err); /* make sure we wake any epoll edge trigger waiter */
ncdevmem is a devmem TCP netcat. It works similarly to netcat, but it sends and receives data using the devmem TCP APIs. It uses udmabuf as the dmabuf provider. It is compatible with a regular netcat running on a peer, or a ncdevmem running on a peer.
In addition to normal netcat support, ncdevmem has a validation mode, where it sends a specific pattern and validates this pattern on the receiver side to ensure data integrity.
Signed-off-by: Mina Almasry almasrymina@google.com --- tools/testing/selftests/net/.gitignore | 1 + tools/testing/selftests/net/Makefile | 1 + tools/testing/selftests/net/ncdevmem.c | 693 +++++++++++++++++++++++++ 3 files changed, 695 insertions(+) create mode 100644 tools/testing/selftests/net/ncdevmem.c
diff --git a/tools/testing/selftests/net/.gitignore b/tools/testing/selftests/net/.gitignore index f27a7338b60e..1153ba177f1b 100644 --- a/tools/testing/selftests/net/.gitignore +++ b/tools/testing/selftests/net/.gitignore @@ -16,6 +16,7 @@ ipsec ipv6_flowlabel ipv6_flowlabel_mgr msg_zerocopy +ncdevmem nettest psock_fanout psock_snd diff --git a/tools/testing/selftests/net/Makefile b/tools/testing/selftests/net/Makefile index c12df57d5539..ea5ccbd99d3b 100644 --- a/tools/testing/selftests/net/Makefile +++ b/tools/testing/selftests/net/Makefile @@ -84,6 +84,7 @@ TEST_GEN_FILES += ip_local_port_range TEST_GEN_FILES += bind_wildcard TEST_PROGS += test_vxlan_mdb.sh TEST_PROGS += test_bridge_neigh_suppress.sh +TEST_GEN_FILES += ncdevmem
TEST_FILES := settings
diff --git a/tools/testing/selftests/net/ncdevmem.c b/tools/testing/selftests/net/ncdevmem.c new file mode 100644 index 000000000000..4f62f22bf763 --- /dev/null +++ b/tools/testing/selftests/net/ncdevmem.c @@ -0,0 +1,693 @@ +// SPDX-License-Identifier: GPL-2.0 +#define _GNU_SOURCE +#define __EXPORTED_HEADERS__ + +#include <linux/uio.h> +#include <stdio.h> +#include <stdlib.h> +#include <unistd.h> +#include <stdbool.h> +#include <string.h> +#include <errno.h> +#define __iovec_defined +#include <fcntl.h> +#include <malloc.h> + +#include <arpa/inet.h> +#include <sys/socket.h> +#include <sys/mman.h> +#include <sys/ioctl.h> +#include <sys/syscall.h> + +#include <linux/memfd.h> +#include <linux/if.h> +#include <linux/dma-buf.h> +#include <linux/udmabuf.h> + +#define PAGE_SHIFT 12 +#define PAGE_SIZE 4096 +#define TEST_PREFIX "ncdevmem" +#define NUM_PAGES 16000 + +#ifndef MSG_SOCK_DEVMEM +#define MSG_SOCK_DEVMEM 0x2000000 +#endif + +/* + * tcpdevmem netcat. Works similarly to netcat but does device memory TCP + * instead of regular TCP. Uses udmabuf to mock a dmabuf provider. + * + * Usage: + * + * * Without validation: + * + * On server: + * ncdevmem -s <server IP> -c <client IP> -f eth1 -n 0000:06:00.0 -l \ + * -p 5201 + * + * On client: + * ncdevmem -s <server IP> -c <client IP> -f eth1 -n 0000:06:00.0 -p 5201 + * + * * With Validation: + * On server: + * ncdevmem -s <server IP> -c <client IP> -l -f eth1 -n 0000:06:00.0 \ + * -p 5202 -v 1 + * + * On client: + * ncdevmem -s <server IP> -c <client IP> -f eth1 -n 0000:06:00.0 -p 5202 \ + * -v 100000 + * + * Note this is compatible with regular netcat. i.e. the sender or receiver can + * be replaced with regular netcat to test the RX or TX path in isolation. + */ + +static char *server_ip = "192.168.1.4"; +static char *client_ip = "192.168.1.2"; +static char *port = "5201"; +static size_t do_validation = 0; +static int queue_num = 15; +static char *ifname = "eth1"; +static char *nic_pci_addr = "0000:06:00.0"; +static unsigned int iterations = 0; + +void print_bytes(void *ptr, size_t size) +{ + unsigned char *p = ptr; + int i; + for (i = 0; i < size; i++) { + printf("%02hhX ", p[i]); + } + printf("\n"); +} + +void print_nonzero_bytes(void *ptr, size_t size) +{ + unsigned char *p = ptr; + unsigned int i; + for (i = 0; i < size; i++) { + if (p[i]) + printf("%c", p[i]); + } + printf("\n"); +} + +void initialize_validation(void *line, size_t size) +{ + static unsigned char seed = 1; + unsigned char *ptr = line; + for (size_t i = 0; i < size; i++) { + ptr[i] = seed; + seed++; + if (seed == 254) + seed = 0; + } +} + +void validate_buffer(void *line, size_t size) +{ + static unsigned char seed = 1; + int errors = 0; + + unsigned char *ptr = line; + for (size_t i = 0; i < size; i++) { + if (ptr[i] != seed) { + fprintf(stderr, + "Failed validation: expected=%u, " + "actual=%u, index=%lu\n", + seed, ptr[i], i); + errors++; + if (errors > 20) + exit(1); + } + seed++; + if (seed == 254) + seed = 0; + } + + fprintf(stdout, "Validated buffer\n"); +} + +/* Triggers a driver reset... + * + * The proper way to do this is probably 'ethtool --reset', but I don't have + * that supported on my current test bed. I resort to changing this + * configuration in the driver which also causes a driver reset... + */ +static void reset_flow_steering() +{ + char command[256]; + memset(command, 0, sizeof(command)); + snprintf(command, sizeof(command), "sudo ethtool -K %s ntuple off", + "eth1"); + system(command); + + memset(command, 0, sizeof(command)); + snprintf(command, sizeof(command), "sudo ethtool -K %s ntuple on", + "eth1"); + system(command); +} + +static void configure_flow_steering() +{ + char command[256]; + memset(command, 0, sizeof(command)); + snprintf(command, sizeof(command), + "sudo ethtool -N %s flow-type tcp4 src-ip %s dst-ip %s " + "src-port %s dst-port %s queue %d", + ifname, client_ip, server_ip, port, port, queue_num); + system(command); +} + +/* Triggers a device reset, which causes the dmabuf pages binding to take + * effect. A better and more generic way to do this may be ethtool --reset. + */ +static void trigger_device_reset() +{ + char command[256]; + memset(command, 0, sizeof(command)); + snprintf(command, sizeof(command), + "sudo ethtool --set-priv-flags %s enable-header-split off", + ifname); + system(command); + + memset(command, 0, sizeof(command)); + snprintf(command, sizeof(command), + "sudo ethtool --set-priv-flags %s enable-header-split on", + ifname); + system(command); +} + +static void create_udmabuf(int *devfd, int *memfd, int *buf, size_t dmabuf_size) +{ + struct udmabuf_create create; + int ret; + + *devfd = open("/dev/udmabuf", O_RDWR); + if (*devfd < 0) { + fprintf(stderr, + "%s: [skip,no-udmabuf: Unable to access DMA " + "buffer device file]\n", + TEST_PREFIX); + exit(70); + } + + *memfd = memfd_create("udmabuf-test", MFD_ALLOW_SEALING); + if (*memfd < 0) { + printf("%s: [skip,no-memfd]\n", TEST_PREFIX); + exit(72); + } + + ret = fcntl(*memfd, F_ADD_SEALS, F_SEAL_SHRINK); + if (ret < 0) { + printf("%s: [skip,fcntl-add-seals]\n", TEST_PREFIX); + exit(73); + } + + ret = ftruncate(*memfd, dmabuf_size); + if (ret == -1) { + printf("%s: [FAIL,memfd-truncate]\n", TEST_PREFIX); + exit(74); + } + + memset(&create, 0, sizeof(create)); + + create.memfd = *memfd; + create.offset = 0; + create.size = dmabuf_size; + *buf = ioctl(*devfd, UDMABUF_CREATE, &create); + if (*buf < 0) { + printf("%s: [FAIL, create udmabuf]\n", TEST_PREFIX); + exit(75); + } +} + + +int do_server() +{ + struct dma_buf_create_pages_info pages_create_info = { 0 }; + int devfd, memfd, buf, buf_pages, ret; + size_t dmabuf_size; + + dmabuf_size = getpagesize() * NUM_PAGES; + + create_udmabuf(&devfd, &memfd, &buf, dmabuf_size); + + pages_create_info.dma_buf_fd = buf; + pages_create_info.type = DMA_BUF_PAGES_NET_RX; + pages_create_info.create_flags = (1 << queue_num); + + ret = sscanf(nic_pci_addr, "0000:%hhx:%hhx.%hhx", + &pages_create_info.pci_bdf[0], + &pages_create_info.pci_bdf[1], + &pages_create_info.pci_bdf[2]); + + if (ret != 3) { + printf("%s: [FAIL, parse fail]\n", TEST_PREFIX); + exit(76); + } + + buf_pages = ioctl(buf, DMA_BUF_CREATE_PAGES, &pages_create_info); + if (buf_pages < 0) { + perror("create pages"); + exit(77); + } + + char *buf_mem = NULL; + buf_mem = mmap(NULL, dmabuf_size, PROT_READ | PROT_WRITE, MAP_SHARED, + buf, 0); + if (buf_mem == MAP_FAILED) { + perror("mmap()"); + exit(1); + } + + /* Need to trigger the NIC to reallocate its RX pages, otherwise the + * bind doesn't take effect. + */ + trigger_device_reset(); + + sleep(1); + + reset_flow_steering(); + configure_flow_steering(); + + struct sockaddr_in server_sin; + server_sin.sin_family = AF_INET; + server_sin.sin_port = htons(atoi(port)); + + ret = inet_pton(server_sin.sin_family, server_ip, &server_sin.sin_addr); + if (socket < 0) { + printf("%s: [FAIL, create socket]\n", TEST_PREFIX); + exit(79); + } + + int socket_fd = socket(server_sin.sin_family, SOCK_STREAM, 0); + if (socket < 0) { + printf("%s: [FAIL, create socket]\n", TEST_PREFIX); + exit(76); + } + + int opt = 1; + ret = setsockopt(socket_fd, SOL_SOCKET, + SO_REUSEADDR | SO_REUSEPORT | SO_ZEROCOPY, &opt, + sizeof(opt)); + if (ret) { + printf("%s: [FAIL, set sock opt]: %s\n", TEST_PREFIX, + strerror(errno)); + exit(76); + } + + printf("binding to address %s:%d\n", server_ip, + ntohs(server_sin.sin_port)); + + ret = bind(socket_fd, &server_sin, sizeof(server_sin)); + if (ret) { + printf("%s: [FAIL, bind]: %s\n", TEST_PREFIX, strerror(errno)); + exit(76); + } + + ret = listen(socket_fd, 1); + if (ret) { + printf("%s: [FAIL, listen]: %s\n", TEST_PREFIX, + strerror(errno)); + exit(76); + } + + struct sockaddr_in client_addr; + socklen_t client_addr_len = sizeof(client_addr); + + char buffer[256]; + + inet_ntop(server_sin.sin_family, &server_sin.sin_addr, buffer, + sizeof(buffer)); + printf("Waiting or connection on %s:%d\n", buffer, + ntohs(server_sin.sin_port)); + int client_fd = accept(socket_fd, &client_addr, &client_addr_len); + + inet_ntop(client_addr.sin_family, &client_addr.sin_addr, buffer, + sizeof(buffer)); + printf("Got connection from %s:%d\n", buffer, + ntohs(client_addr.sin_port)); + + char iobuf[819200]; + char ctrl_data[sizeof(int) * 20000]; + + size_t total_received = 0; + size_t i = 0; + size_t page_aligned_frags = 0; + size_t non_page_aligned_frags = 0; + while (1) { + bool is_devmem = false; + printf("\n\n"); + + struct msghdr msg = { 0 }; + struct iovec iov = { .iov_base = iobuf, + .iov_len = sizeof(iobuf) }; + msg.msg_iov = &iov; + msg.msg_iovlen = 1; + msg.msg_control = ctrl_data; + msg.msg_controllen = sizeof(ctrl_data); + ssize_t ret = recvmsg(client_fd, &msg, MSG_SOCK_DEVMEM); + printf("recvmsg ret=%ld\n", ret); + if (ret < 0 && (errno == EAGAIN || errno == EWOULDBLOCK)) { + continue; + } + if (ret < 0) { + perror("recvmsg"); + continue; + } + if (ret == 0) { + printf("client exited\n"); + goto cleanup; + } + + i++; + struct cmsghdr *cm = NULL; + struct cmsg_devmem *cmsg_devmem = NULL; + for (cm = CMSG_FIRSTHDR(&msg); cm; cm = CMSG_NXTHDR(&msg, cm)) { + if (cm->cmsg_level != SOL_SOCKET || + (cm->cmsg_type != SCM_DEVMEM_OFFSET && + cm->cmsg_type != SCM_DEVMEM_HEADER)) { + fprintf(stdout, "skipping non-devmem cmsg\n"); + continue; + } + + cmsg_devmem = (struct cmsg_devmem *)CMSG_DATA(cm); + is_devmem = true; + + if (cm->cmsg_type == SCM_DEVMEM_HEADER) { + // TODO: process data copied from skb's linear + // buffer. + fprintf(stdout, + "SCM_DEVMEM_HEADER. " + "cmsg_devmem->frag_size=%u\n", + cmsg_devmem->frag_size); + + continue; + } + + struct devmemtoken token = { cmsg_devmem->frag_token, + 1 }; + + total_received += cmsg_devmem->frag_size; + printf("received frag_page=%u, in_page_offset=%u," + " frag_offset=%u, frag_size=%u, token=%u" + " total_received=%lu\n", + cmsg_devmem->frag_offset >> PAGE_SHIFT, + cmsg_devmem->frag_offset % PAGE_SIZE, + cmsg_devmem->frag_offset, cmsg_devmem->frag_size, + cmsg_devmem->frag_token, total_received); + + if (cmsg_devmem->frag_size % PAGE_SIZE) + non_page_aligned_frags++; + else + page_aligned_frags++; + + struct dma_buf_sync sync = { 0 }; + sync.flags = DMA_BUF_SYNC_READ | DMA_BUF_SYNC_START; + ioctl(buf, DMA_BUF_IOCTL_SYNC, &sync); + + if (do_validation) + validate_buffer( + ((unsigned char *)buf_mem) + + cmsg_devmem->frag_offset, + cmsg_devmem->frag_size); + else + print_nonzero_bytes( + ((unsigned char *)buf_mem) + + cmsg_devmem->frag_offset, + cmsg_devmem->frag_size); + + sync.flags = DMA_BUF_SYNC_READ | DMA_BUF_SYNC_END; + ioctl(buf, DMA_BUF_IOCTL_SYNC, &sync); + + ret = setsockopt(client_fd, SOL_SOCKET, + SO_DEVMEM_DONTNEED, &token, + sizeof(token)); + if (ret) { + perror("SO_DEVMEM_DONTNEED"); + exit(1); + } + } + if (!is_devmem) + printf("flow steering error\n"); + + printf("total_received=%lu\n", total_received); + } + +cleanup: + fprintf(stdout, "%s: ok\n", TEST_PREFIX); + + fprintf(stdout, "page_aligned_frags=%lu, non_page_aligned_frags=%lu\n", + page_aligned_frags, non_page_aligned_frags); + + fprintf(stdout, "page_aligned_frags=%lu, non_page_aligned_frags=%lu\n", + page_aligned_frags, non_page_aligned_frags); + + munmap(buf_mem, dmabuf_size); + close(client_fd); + close(socket_fd); + close(buf_pages); + close(buf); + close(memfd); + close(devfd); + trigger_device_reset(); + + return 0; +} + +int do_client() +{ + printf("doing client\n"); + + struct dma_buf_create_pages_info pages_create_info = { 0 }; + int devfd, memfd, buf, buf_pages, ret; + size_t dmabuf_size; + + dmabuf_size = getpagesize() * NUM_PAGES; + + create_udmabuf(&devfd, &memfd, &buf, dmabuf_size); + + pages_create_info.dma_buf_fd = buf; + pages_create_info.type = DMA_BUF_PAGES_NET_TX; + + ret = sscanf(nic_pci_addr, "0000:%hhx:%hhx.%hhx", + &pages_create_info.pci_bdf[0], + &pages_create_info.pci_bdf[1], + &pages_create_info.pci_bdf[2]); + + if (ret != 3) { + printf("%s: [FAIL, parse fail]\n", TEST_PREFIX); + exit(76); + } + + buf_pages = ioctl(buf, DMA_BUF_CREATE_PAGES, &pages_create_info); + if (buf_pages < 0) { + perror("create pages"); + exit(77); + } + + char *buf_mem = NULL; + buf_mem = mmap(NULL, dmabuf_size, PROT_READ | PROT_WRITE, MAP_SHARED, + buf, 0); + if (buf_mem == MAP_FAILED) { + perror("mmap()"); + exit(1); + } + + struct sockaddr_in server_sin; + server_sin.sin_family = AF_INET; + server_sin.sin_port = htons(atoi(port)); + + ret = inet_pton(server_sin.sin_family, server_ip, &server_sin.sin_addr); + if (socket < 0) { + printf("%s: [FAIL, create socket]\n", TEST_PREFIX); + exit(79); + } + + int socket_fd = socket(server_sin.sin_family, SOCK_STREAM, 0); + if (socket < 0) { + printf("%s: [FAIL, create socket]\n", TEST_PREFIX); + exit(76); + } + + int opt = 1; + ret = setsockopt(socket_fd, SOL_SOCKET, + SO_REUSEADDR | SO_REUSEPORT | SO_ZEROCOPY, &opt, + sizeof(opt)); + if (ret) { + printf("%s: [FAIL, set sock opt]: %s\n", TEST_PREFIX, + strerror(errno)); + exit(76); + } + + struct sockaddr_in client_sin; + client_sin.sin_family = AF_INET; + client_sin.sin_port = htons(atoi(port)); + + ret = inet_pton(client_sin.sin_family, client_ip, &client_sin.sin_addr); + if (socket < 0) { + printf("%s: [FAIL, create socket]\n", TEST_PREFIX); + exit(79); + } + + ret = bind(socket_fd, &client_sin, sizeof(client_sin)); + if (ret) { + printf("%s: [FAIL, bind]: %s\n", TEST_PREFIX, strerror(errno)); + exit(76); + } + + ret = setsockopt(socket_fd, SOL_SOCKET, SO_ZEROCOPY, &opt, sizeof(opt)); + if (ret) { + printf("%s: [FAIL, set sock opt]: %s\n", TEST_PREFIX, + strerror(errno)); + exit(76); + } + + ret = connect(socket_fd, &server_sin, sizeof(server_sin)); + if (ret) { + printf("%s: [FAIL, connect]: %s\n", TEST_PREFIX, + strerror(errno)); + exit(76); + } + + char *line = NULL; + size_t line_size = 0; + size_t len = 0; + + size_t i = 0; + while (!iterations || i < iterations) { + i++; + free(line); + line = NULL; + if (do_validation) { + line = malloc(do_validation); + if (!line) { + fprintf(stderr, "Failed to allocate\n"); + exit(1); + } + memset(line, 0, do_validation); + initialize_validation(line, do_validation); + line_size = do_validation; + } else { + line_size = getline(&line, &len, stdin); + } + + if (line_size < 0) + continue; + + fprintf(stdout, "DEBUG: read line_size=%lu\n", line_size); + + struct dma_buf_sync sync = { 0 }; + sync.flags = DMA_BUF_SYNC_WRITE | DMA_BUF_SYNC_START; + ioctl(buf, DMA_BUF_IOCTL_SYNC, &sync); + + memset(buf_mem, 0, dmabuf_size); + memcpy(buf_mem, line, line_size); + + if (do_validation) + validate_buffer(buf_mem, line_size); + + struct iovec iov = { .iov_base = NULL, .iov_len = line_size }; + + struct msghdr msg = { 0 }; + + msg.msg_iov = &iov; + msg.msg_iovlen = 1; + + char ctrl_data[CMSG_SPACE(sizeof(int) * 2)]; + msg.msg_control = ctrl_data; + msg.msg_controllen = sizeof(ctrl_data); + + struct cmsghdr *cmsg = CMSG_FIRSTHDR(&msg); + cmsg->cmsg_level = SOL_SOCKET; + cmsg->cmsg_type = SCM_DEVMEM_OFFSET; + cmsg->cmsg_len = CMSG_LEN(sizeof(int) * 2); + *((int *)CMSG_DATA(cmsg)) = buf_pages; + ((int *)CMSG_DATA(cmsg))[1] = 0; + + ret = sendmsg(socket_fd, &msg, MSG_ZEROCOPY); + if (ret < 0) + perror("sendmsg"); + else + fprintf(stdout, "DEBUG: sendmsg_ret=%d\n", ret); + + sync.flags = DMA_BUF_SYNC_WRITE | DMA_BUF_SYNC_END; + ioctl(buf, DMA_BUF_IOCTL_SYNC, &sync); + + // sleep for a bit before we overwrite the dmabuf that's being + // sent. + if (do_validation) + sleep(1); + } + + fprintf(stdout, "%s: ok\n", TEST_PREFIX); + + while (1) + sleep(10); + + munmap(buf_mem, dmabuf_size); + + free(line); + close(socket_fd); + close(buf_pages); + close(buf); + close(memfd); + close(devfd); + trigger_device_reset(); + + return 0; +} + +int main(int argc, char *argv[]) +{ + int is_server = 0, opt; + + // put ':' in the starting of the + // string so that program can + //distinguish between '?' and ':' + while ((opt = getopt(argc, argv, "ls:c:p:v:q:f:n:i:")) != -1) { + switch (opt) { + case 'l': + is_server = 1; + break; + case 's': + server_ip = optarg; + break; + case 'c': + client_ip = optarg; + break; + case 'p': + port = optarg; + break; + case 'v': + do_validation = atoll(optarg); + break; + case 'q': + queue_num = atoi(optarg); + break; + case 'f': + ifname = optarg; + break; + case 'n': + nic_pci_addr = optarg; + break; + case 'i': + iterations = atoll(optarg); + break; + case '?': + printf("unknown option: %c\n", optopt); + break; + } + } + + // optind is for the extra arguments + // which are not parsed + for (; optind < argc; optind++) { + printf("extra arguments: %s\n", argv[optind]); + } + + if (is_server) + return do_server(); + + return do_client(); +}
Implement a few updates to Jakub's RFC memory provider API to make it suitable for device memory TCP:
1. Currently for devmem TCP the driver's netdev_rx_queue holds a reference to the dma_buf_pages struct and needs to pass that to the page_pool's memory provider somehow. For PoC purposes, create a pp->mp_priv field that is set by the driver. Likely needs a better API (likely dependant on the general memory provider API).
2. The current memory_provider API gives the memory_provider the option to override put_page(), but tries page_pool_clear_pp_info() after the memory provider has released the page. IMO if the page freeing is delegated to the provider then the page_pool should not modify the page after release_page() has been called.
Signed-off-by: Mina Almasry almasrymina@google.com --- include/net/page_pool.h | 1 + net/core/page_pool.c | 7 ++++--- 2 files changed, 5 insertions(+), 3 deletions(-)
diff --git a/include/net/page_pool.h b/include/net/page_pool.h index 364fe6924258..7b6668479baf 100644 --- a/include/net/page_pool.h +++ b/include/net/page_pool.h @@ -78,6 +78,7 @@ struct page_pool_params { struct device *dev; /* device, for DMA pre-mapping purposes */ struct napi_struct *napi; /* Sole consumer of pages, otherwise NULL */ u8 memory_provider; /* haaacks! should be user-facing */ + void *mp_priv; /* argument to pass to the memory provider */ enum dma_data_direction dma_dir; /* DMA mapping direction */ unsigned int max_len; /* max DMA sync memory size */ unsigned int offset; /* DMA addr offset */ diff --git a/net/core/page_pool.c b/net/core/page_pool.c index d50f6728e4f6..df3f431fcff3 100644 --- a/net/core/page_pool.c +++ b/net/core/page_pool.c @@ -241,6 +241,7 @@ static int page_pool_init(struct page_pool *pool, goto free_ptr_ring; }
+ pool->mp_priv = pool->p.mp_priv; if (pool->mp_ops) { err = pool->mp_ops->init(pool); if (err) { @@ -564,16 +565,16 @@ void page_pool_return_page(struct page_pool *pool, struct page *page) else __page_pool_release_page_dma(pool, page);
- page_pool_clear_pp_info(page); - /* This may be the last page returned, releasing the pool, so * it is not safe to reference pool afterwards. */ count = atomic_inc_return_relaxed(&pool->pages_state_release_cnt); trace_page_pool_state_release(pool, page, count);
- if (put) + if (put) { + page_pool_clear_pp_info(page); put_page(page); + } /* An optimization would be to call __free_pages(page, pool->p.order) * knowing page is not part of page-cache (thus avoiding a * __page_cache_release() call).
Use Jakub's memory provider PoC API: https://github.com/kuba-moo/linux/tree/pp-providers
To implement a dmabuf devmem memory provider. The provider allocates NET_RX dmabuf pages to the page pool. This abstracts any custom memory allocation or freeing changes for devmem TCP from drivers using the page pool.
The memory provider allocates NET_RX pages from the dmabuf pages provided by the driver. These pages are ZONE_DEVICE pages with the sg dma_addrs stored in the zone_device_data entry in the page. The page pool entries in struct page are in a union with the ZONE_DEVICE entries, and - without special handling - the page pool would accidentally overwrite the data in the ZONE_DEVICE fields.
To solve this, the memory provider converts the page from a ZONE_DEVICE page to a ZONE_NORMAL page upon giving it to the page pool, and converts it back to ZONE_DEVICE page upon getting it back from the page pool. This is safe to do because the NET_RX pages are dmabuf pages created to hold the dma_addr in the dma_buf_map_attachement sg_table entries, and are only used with code that handles them specifically.
However, since dmabuf pages can now also be page pool page, we need to update 2 places to detect this correctly:
1. is_dma_buf_page() needs to be updated to correctly detect dmabuf pages after they've been inserted into the pool.
2. dma_buf_page_to_dma_addr() needs to be updated. For page pool pages, the dma_addr exists in page->dma_addr. For non page pool pages, the dma_addr exists in page->zone_device_data.
Signed-off-by: Mina Almasry almasrymina@google.com --- include/linux/dma-buf.h | 29 ++++++++++- include/net/page_pool.h | 20 ++++++++ net/core/page_pool.c | 104 ++++++++++++++++++++++++++++++++++++---- 3 files changed, 143 insertions(+), 10 deletions(-)
diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h index 93228a2fec47..896359fa998d 100644 --- a/include/linux/dma-buf.h +++ b/include/linux/dma-buf.h @@ -692,15 +692,26 @@ static inline bool is_dma_buf_pages_file(struct file *file)
struct page *dma_buf_pages_net_rx_alloc(struct dma_buf_pages *priv);
+static inline bool is_dma_buf_page_net_rx(struct page *page) +{ + struct dma_buf_pages *priv; + + return (is_page_pool_page(page) && (priv = page->pp->mp_priv) && + priv->pgmap.ops == &dma_buf_pgmap_ops); +} + static inline bool is_dma_buf_page(struct page *page) { return (is_zone_device_page(page) && page->pgmap && - page->pgmap->ops == &dma_buf_pgmap_ops); + page->pgmap->ops == &dma_buf_pgmap_ops) || + is_dma_buf_page_net_rx(page); }
static inline dma_addr_t dma_buf_page_to_dma_addr(struct page *page) { - return (dma_addr_t)page->zone_device_data; + return is_dma_buf_page_net_rx(page) ? + (dma_addr_t)page->dma_addr : + (dma_addr_t)page->zone_device_data; }
static inline int dma_buf_map_sg(struct device *dev, struct scatterlist *sg, @@ -718,6 +729,16 @@ static inline int dma_buf_map_sg(struct device *dev, struct scatterlist *sg,
return nents; } + +static inline bool is_dma_buf_pages_priv(void *ptr) +{ + struct dma_buf_pages *priv = (struct dma_buf_pages *)ptr; + + if (!priv || priv->pgmap.ops != &dma_buf_pgmap_ops) + return false; + + return true; +} #else static inline bool is_dma_buf_page(struct page *page) { @@ -745,6 +766,10 @@ static inline struct page *dma_buf_pages_net_rx_alloc(struct dma_buf_pages *priv return NULL; }
+static inline bool is_dma_buf_pages_priv(void *ptr) +{ + return false; +} #endif
diff --git a/include/net/page_pool.h b/include/net/page_pool.h index 7b6668479baf..a57757a13cc8 100644 --- a/include/net/page_pool.h +++ b/include/net/page_pool.h @@ -157,6 +157,7 @@ enum pp_memory_provider_type { PP_MP_HUGE_SPLIT, /* 2MB, online page alloc */ PP_MP_HUGE, /* 2MB, all memory pre-allocated */ PP_MP_HUGE_1G, /* 1G pages, MEP, pre-allocated */ + PP_MP_DMABUF_DEVMEM, /* dmabuf devmem provider */ };
struct pp_memory_provider_ops { @@ -170,6 +171,7 @@ extern const struct pp_memory_provider_ops basic_ops; extern const struct pp_memory_provider_ops hugesp_ops; extern const struct pp_memory_provider_ops huge_ops; extern const struct pp_memory_provider_ops huge_1g_ops; +extern const struct pp_memory_provider_ops dmabuf_devmem_ops;
struct page_pool { struct page_pool_params p; @@ -420,4 +422,22 @@ static inline void page_pool_nid_changed(struct page_pool *pool, int new_nid) page_pool_update_nid(pool, new_nid); }
+static inline bool is_page_pool_page(struct page *page) +{ + /* page->pp_magic is OR'ed with PP_SIGNATURE after the allocation + * in order to preserve any existing bits, such as bit 0 for the + * head page of compound page and bit 1 for pfmemalloc page, so + * mask those bits for freeing side when doing below checking, + * and page_is_pfmemalloc() is checked in __page_pool_put_page() + * to avoid recycling the pfmemalloc page. + */ + if (unlikely((page->pp_magic & ~0x3UL) != PP_SIGNATURE)) + return false; + + if (!page->pp) + return false; + + return true; +} + #endif /* _NET_PAGE_POOL_H */ diff --git a/net/core/page_pool.c b/net/core/page_pool.c index df3f431fcff3..e626d4e309c1 100644 --- a/net/core/page_pool.c +++ b/net/core/page_pool.c @@ -236,6 +236,9 @@ static int page_pool_init(struct page_pool *pool, case PP_MP_HUGE_1G: pool->mp_ops = &huge_1g_ops; break; + case PP_MP_DMABUF_DEVMEM: + pool->mp_ops = &dmabuf_devmem_ops; + break; default: err = -EINVAL; goto free_ptr_ring; @@ -975,14 +978,7 @@ bool page_pool_return_skb_page(struct page *page, bool napi_safe)
page = compound_head(page);
- /* page->pp_magic is OR'ed with PP_SIGNATURE after the allocation - * in order to preserve any existing bits, such as bit 0 for the - * head page of compound page and bit 1 for pfmemalloc page, so - * mask those bits for freeing side when doing below checking, - * and page_is_pfmemalloc() is checked in __page_pool_put_page() - * to avoid recycling the pfmemalloc page. - */ - if (unlikely((page->pp_magic & ~0x3UL) != PP_SIGNATURE)) + if (!is_page_pool_page(page)) return false;
pp = page->pp; @@ -1538,3 +1534,95 @@ const struct pp_memory_provider_ops huge_1g_ops = { .alloc_pages = mp_huge_1g_alloc_pages, .release_page = mp_huge_1g_release, }; + +/*** "Dmabuf devmem page" ***/ + +/* Dmabuf devmem memory provider allocates DMA_BUF_PAGES_NET_RX pages which are + * backing the dma_buf_map_attachment() from the NIC to the device memory. + * + * These pages are wrappers around the dma_addr of the sg entries in the + * sg_table returned from dma_buf_map_attachment(). They can be passed to the + * networking stack, which will generate devmem skbs from them and process them + * correctly. + */ +static int mp_dmabuf_devmem_init(struct page_pool *pool) +{ + struct dma_buf_pages *priv; + + priv = pool->mp_priv; + if (!is_dma_buf_pages_priv(priv)) + return -EINVAL; + + return 0; +} + +static void mp_dmabuf_devmem_destroy(struct page_pool *pool) +{ +} + +static struct page *mp_dmabuf_devmem_alloc_pages(struct page_pool *pool, + gfp_t gfp) +{ + struct dma_buf_pages *priv = pool->mp_priv; + dma_addr_t dma_addr; + struct page *page; + + page = dma_buf_pages_net_rx_alloc(priv); + if (!page) + return page; + + /* It shouldn't be possible for the allocation to give us a page not + * belonging to this page_pool's pgmap. + */ + BUG_ON(page->pgmap != &priv->pgmap); + + /* netdev_rxq_alloc_dma_buf_page() allocates a ZONE_DEVICE page. + * Prepare to convert it into a page_pool page. We need to hold pgmap + * and zone_device_data (which holds the dma_addr). + * + * DMA_BUF_PAGES_NET_RX are dmabuf pages created specifically to wrap + * the dma_addr of the sg_table into a struct page. These pages are + * used by code specifically equipped to handle them, so this + * conversation from ZONE_DEVICE page to page pool page should be safe. + */ + dma_addr = (dma_addr_t)page->zone_device_data; + + set_page_zone(page, ZONE_NORMAL); + page->pp_magic = 0; + page_pool_set_pp_info(pool, page); + + page->dma_addr = dma_addr; + + return page; +} + +static bool mp_dmabuf_devmem_release_page(struct page_pool *pool, + struct page *page) +{ + struct dma_buf_pages *priv = pool->mp_priv; + unsigned long dma_addr = page->dma_addr; + + page_pool_clear_pp_info(page); + + /* As the page pool releases the page, restore it back to a ZONE_DEVICE + * page so it gets freed according to the + * page->pgmap->ops->page_free(). + */ + set_page_zone(page, ZONE_DEVICE); + page->zone_device_data = (void*)dma_addr; + page->pgmap = &priv->pgmap; + put_page(page); + + /* Return false here as we don't want the page pool touching the page + * after it's released to us. + */ + return false; +} + +const struct pp_memory_provider_ops dmabuf_devmem_ops = { + .init = mp_dmabuf_devmem_init, + .destroy = mp_dmabuf_devmem_destroy, + .alloc_pages = mp_dmabuf_devmem_alloc_pages, + .release_page = mp_dmabuf_devmem_release_page, +}; +EXPORT_SYMBOL(dmabuf_devmem_ops);
On 7/10/23 15:32, Mina Almasry wrote:
- TL;DR:
Device memory TCP (devmem TCP) is a proposal for transferring data to and/or from device memory efficiently, without bouncing the data to a host memory buffer.
(I'm writing this as someone who might plausibly use this mechanism, but I don't think I'm very likely to end up working on the kernel side, unless I somehow feel extremely inspired to implement it for i40e.)
I looked at these patches and the GVE tree, and I'm trying to wrap my head around the data path. As I understand it, for RX:
1. The GVE driver notices that the queue is programmed to use devmem, and it programs the NIC to copy packet payloads to the devmem that has been programmed. 2. The NIC receives the packet and copies the header to kernel memory and the payload to dma-buf memory. 3. The kernel tells userspace where in the dma-buf the data is. 4. Userspace does something with the data. 5. Userspace does DONTNEED to recycle the memory and make it available for new received packets.
Did I get this right?
This seems a bit awkward if there's any chance that packets not intended for the target device end up in the rxq.
I'm wondering if a more capable if somewhat higher latency model could work where the NIC stores received packets in its own device memory. Then userspace (or the kernel or a driver or whatever) could initiate a separate DMA from the NIC to the final target *after* reading the headers. Can the hardware support this?
Another way of putting this is: steering received data to a specific device based on the *receive queue* forces the logic selecting a destination device to be the same as the logic selecting the queue. RX steering logic is pretty limited on most hardware (as far as I know -- certainly I've never had much luck doing anything especially intelligent with RX flow steering, and I've tried on a couple of different brands of supposedly fancy NICs). But Linux has very nice capabilities to direct packets, in software, to where they are supposed to go, and it would be nice if all that logic could just work, scalably, with device memory. If Linux could examine headers *before* the payload gets DMAed to wherever it goes, I think this could plausibly work quite nicely. One could even have an easy-to-use interface in which one directs a *socket* to a PCIe device. I expect, although I've never looked at the datasheets, that the kernel could even efficiently make rx decisions based on data in device memory on upcoming CXL NICs where device memory could participate in the host cache hierarchy.
My real ulterior motive is that I think it would be great to use an ability like this for DPDK-like uses. Wouldn't it be nifty if I could open a normal TCP socket, then, after it's open, ask the kernel to kindly DMA the results directly to my application memory (via udmabuf, perhaps)? Or have a whole VLAN or macvlan get directed to a userspace queue, etc?
It also seems a bit odd to me that the binding from rxq to dma-buf is established by programming the dma-buf. This makes the security model (and the mental model) awkward -- this binding is a setting on the *queue*, not the dma-buf, and in a containerized or privilege-separated system, a process could have enough privilege to make a dma-buf somewhere but not have any privileges on the NIC. (And may not even have the NIC present in its network namespace!)
--Andy
On Sun, 16 Jul 2023 19:41:28 -0700 Andy Lutomirski wrote:
I'm wondering if a more capable if somewhat higher latency model could work where the NIC stores received packets in its own device memory. Then userspace (or the kernel or a driver or whatever) could initiate a separate DMA from the NIC to the final target *after* reading the headers. Can the hardware support this?
No, no, that's impossible. SW response times are in 100s of usec (at best) which at 200Gbps already means megabytes of data _per-queue_. Way more than the amount of buffer NICs will have.
The Rx application can bind to a IP addr + Port and then install a one-sided-3-tuple (dst IP+proto+port) rule in the HW. Worst case a full 5-tuple per flow.
Most NICs support OvS offloads with 100s of thousands of flows. The steering should be bread and butter.
It does require splitting flows into separate data and control channels, but it's the right trade-off - complexity should be on the SW side.
On Sun, Jul 16, 2023 at 7:41 PM Andy Lutomirski luto@kernel.org wrote:
On 7/10/23 15:32, Mina Almasry wrote:
- TL;DR:
Device memory TCP (devmem TCP) is a proposal for transferring data to and/or from device memory efficiently, without bouncing the data to a host memory buffer.
(I'm writing this as someone who might plausibly use this mechanism, but I don't think I'm very likely to end up working on the kernel side, unless I somehow feel extremely inspired to implement it for i40e.)
I looked at these patches and the GVE tree, and I'm trying to wrap my head around the data path. As I understand it, for RX:
- The GVE driver notices that the queue is programmed to use devmem,
and it programs the NIC to copy packet payloads to the devmem that has been programmed. 2. The NIC receives the packet and copies the header to kernel memory and the payload to dma-buf memory. 3. The kernel tells userspace where in the dma-buf the data is. 4. Userspace does something with the data. 5. Userspace does DONTNEED to recycle the memory and make it available for new received packets.
Did I get this right?
Sorry for the late reply. I'm a bit buried working on the follow up to this proposal: exploring using dma-bufs without pages.
Yes, this is completely correct.
This seems a bit awkward if there's any chance that packets not intended for the target device end up in the rxq.
It does a bit. What happens in practice is that we use RSS to steer general traffic away from the devmem queues, and we use flow steering to steer specific flows to devem queues.
In the case where the RSS/flow steering configuration is done incorrectly, the user would call recvmsg() on a devmem skb and if they haven't specified the MSG_SOCK_DEVMEM flag they'd get an error.
I'm wondering if a more capable if somewhat higher latency model could work where the NIC stores received packets in its own device memory. Then userspace (or the kernel or a driver or whatever) could initiate a separate DMA from the NIC to the final target *after* reading the headers. Can the hardware support this?
Not that I know of. I guess Jakub also responded with the same.
Another way of putting this is: steering received data to a specific device based on the *receive queue* forces the logic selecting a destination device to be the same as the logic selecting the queue. RX steering logic is pretty limited on most hardware (as far as I know -- certainly I've never had much luck doing anything especially intelligent with RX flow steering, and I've tried on a couple of different brands of supposedly fancy NICs). But Linux has very nice capabilities to direct packets, in software, to where they are supposed to go, and it would be nice if all that logic could just work, scalably, with device memory. If Linux could examine headers *before* the payload gets DMAed to wherever it goes, I think this could plausibly work quite nicely. One could even have an easy-to-use interface in which one directs a *socket* to a PCIe device. I expect, although I've never looked at the datasheets, that the kernel could even efficiently make rx decisions based on data in device memory on upcoming CXL NICs where device memory could participate in the host cache hierarchy.
My real ulterior motive is that I think it would be great to use an ability like this for DPDK-like uses. Wouldn't it be nifty if I could open a normal TCP socket, then, after it's open, ask the kernel to kindly DMA the results directly to my application memory (via udmabuf, perhaps)? Or have a whole VLAN or macvlan get directed to a userspace queue, etc?
It also seems a bit odd to me that the binding from rxq to dma-buf is established by programming the dma-buf.
That is specific to this proposal, and will likely be very different in future ones. I thought the dma-buf pages approach was extensible and the uapi belonged somewhere in dma-buf. Clearly not. The next proposal, I think, will program the rxq via some net uapi and will take the dma-buf as input. Probably some netlink api (not sure if ethtool family or otherwise). I'm working out details of this non-paged networking first.
This makes the security model (and the mental model) awkward -- this binding is a setting on the *queue*, not the dma-buf, and in a containerized or privilege-separated system, a process could have enough privilege to make a dma-buf somewhere but not have any privileges on the NIC. (And may not even have the NIC present in its network namespace!)
--Andy
-- Thanks, Mina
On Tue, Jul 18, 2023 at 10:36:52AM -0700, Mina Almasry wrote:
That is specific to this proposal, and will likely be very different in future ones. I thought the dma-buf pages approach was extensible and the uapi belonged somewhere in dma-buf. Clearly not. The next proposal, I think, will program the rxq via some net uapi and will take the dma-buf as input. Probably some netlink api (not sure if ethtool family or otherwise). I'm working out details of this non-paged networking first.
In practice you want the application to startup, get itself some 3/5 tuples and then request the kernel to setup the flow steering and provision the NIC queues.
This is the right moment for the application to provide the backing for the rx queue memory via a DMABUF handle.
Ideally this would all be accessible to non-priv applications as well, so I think you'd want some kind of system call that sets all this up and takes in a FD for the 3/5-tuple socket (to prove ownership over the steering) and the DMABUF FD.
The queues and steering should exist only as long as the application is still running (whatever that means). Otherwise you have a big mess to clean up whenever anything crashes.
netlink feels like a weird API choice for that, in particular it would be really wrong to somehow bind the lifecycle of a netlink object to a process.
Further, if you are going to all the trouble of doing this, it seems to me you should make it work with any kind of memory, including CPU memory. Get a consistent approach to zero-copy TCP RX. So also allow a memfd or similar to be passed in as the backing storage.
Jason
On Tue, 18 Jul 2023 15:06:29 -0300 Jason Gunthorpe wrote:
netlink feels like a weird API choice for that, in particular it would be really wrong to somehow bind the lifecycle of a netlink object to a process.
Netlink is the right API, life cycle of objects can be easily tied to a netlink socket.
On 7/18/23 12:15 PM, Jakub Kicinski wrote:
On Tue, 18 Jul 2023 15:06:29 -0300 Jason Gunthorpe wrote:
netlink feels like a weird API choice for that, in particular it would be really wrong to somehow bind the lifecycle of a netlink object to a process.
Netlink is the right API, life cycle of objects can be easily tied to a netlink socket.
That is an untuitive connection -- memory references, h/w queues, flow steering should be tied to the datapath socket, not a control plane socket.
On Tue, 18 Jul 2023 12:20:59 -0600 David Ahern wrote:
On 7/18/23 12:15 PM, Jakub Kicinski wrote:
On Tue, 18 Jul 2023 15:06:29 -0300 Jason Gunthorpe wrote:
netlink feels like a weird API choice for that, in particular it would be really wrong to somehow bind the lifecycle of a netlink object to a process.
Netlink is the right API, life cycle of objects can be easily tied to a netlink socket.
That is an untuitive connection -- memory references, h/w queues, flow steering should be tied to the datapath socket, not a control plane socket.
There's one RSS context for may datapath sockets. Plus a lot of the APIs already exist, and it's more of a question of packaging them up at the user space level. For things which do not have an API, however, netlink, please.
On 7/18/23 12:29 PM, Jakub Kicinski wrote:
On Tue, 18 Jul 2023 12:20:59 -0600 David Ahern wrote:
On 7/18/23 12:15 PM, Jakub Kicinski wrote:
On Tue, 18 Jul 2023 15:06:29 -0300 Jason Gunthorpe wrote:
netlink feels like a weird API choice for that, in particular it would be really wrong to somehow bind the lifecycle of a netlink object to a process.
Netlink is the right API, life cycle of objects can be easily tied to a netlink socket.
That is an untuitive connection -- memory references, h/w queues, flow steering should be tied to the datapath socket, not a control plane socket.
There's one RSS context for may datapath sockets. Plus a lot of the APIs already exist, and it's more of a question of packaging them up at the user space level. For things which do not have an API, however, netlink, please.
I do not see how 1 RSS context (or more specifically a h/w Rx queue) can be used properly with memory from different processes (or dma-buf references). When the process dies, that memory needs to be flushed from the H/W queues. Queues with interlaced submissions make that more complicated.
I guess the devil is in the details; I look forward to the evolution of the patches.
On Tue, 18 Jul 2023 16:35:17 -0600 David Ahern wrote:
I do not see how 1 RSS context (or more specifically a h/w Rx queue) can be used properly with memory from different processes (or dma-buf references). When the process dies, that memory needs to be flushed from the H/W queues. Queues with interlaced submissions make that more complicated.
Agreed, one process, one control path socket.
FWIW the rtnetlink use of netlink is very basic. genetlink already has some infra which allows associate state with a user socket and cleaning it up when the socket gets closed. This needs some improvements. A bit of a chicken and egg problem, I can't make the improvements until there are families making use of it, and nobody will make use of it until it's in tree... But the basics are already in place and I can help with building it out.
I guess the devil is in the details; I look forward to the evolution of the patches.
+1
On Tue, Jul 18, 2023 at 3:45 PM Jakub Kicinski kuba@kernel.org wrote:
On Tue, 18 Jul 2023 16:35:17 -0600 David Ahern wrote:
I do not see how 1 RSS context (or more specifically a h/w Rx queue) can be used properly with memory from different processes (or dma-buf references).
Right, my experience with dma-bufs from GPUs are that they're allocated from the userspace and owned by the process that allocated the backing GPU memory and generated the dma-buf from it. I.e., we're limited to 1 dma-buf per RX queue. If we enable binding multiple dma-bufs to the same RX queue, we have a problem, because AFAIU the NIC can't decide which dma-buf to put the packet into (it hasn't parsed the packet's destination yet).
When the process dies, that memory needs to be flushed from the H/W queues. Queues with interlaced submissions make that more complicated.
When the process dies, do we really want to flush the memory from the hardware queues? The drivers I looked at don't seem to have a function to flush the rx queues alone, they usually do an entire driver reset to achieve that. Not sure if that's just convenience or there is some technical limitation there. Do we really want to trigger a driver reset at the event a userspace process crashes?
Agreed, one process, one control path socket.
FWIW the rtnetlink use of netlink is very basic. genetlink already has some infra which allows associate state with a user socket and cleaning it up when the socket gets closed. This needs some improvements. A bit of a chicken and egg problem, I can't make the improvements until there are families making use of it, and nobody will make use of it until it's in tree... But the basics are already in place and I can help with building it out.
I had this approach in mind (which doesn't need netlink improvements) for the next POC. It's mostly inspired by the comments from the cover letter of Jakub's memory-provider RFC, if I understood it correctly. I'm sure there's going to be some iteration, but roughly:
1. A netlink CAP_NET_ADMIN API which binds the dma-buf to any number of rx queues on 1 NIC. It will do the dma_buf_attach() and dma_buf_map_attachment() and leave some indicator in the struct net_device to tell the NIC that it's bound to a dma-buf. The actual binding doesn't actuate until the next driver reset. The API, I guess, can cause a driver reset (or just a refill of the rx queues, if you think that's feasible) as well to streamline things a bit. The API returns a file handle to the user representing that binding.
2. On the driver reset, the driver notices that its struct net_device is bound to a dma-buf, and sets up the dma-buf memory-provider instead of the basic one which provides host memory.
3. The user can close the file handle received in #1 to unbind the dma-buf from the rx queues. Or if the user crashes, the kernel closes the handle for us. The unbind doesn't take effect until the next flushing or rx queues, or the next driver reset (not sure the former is feasible).
4. The dma-buf memory provider keeps the dma buf mapping alive until the next driver reset, where all the dma-buf slices are freed, and the dma buf attachment mapping can be unmapped.
I'm thinking the user sets up RSS and flow steering outside this API using existing ethtool APIs, but things can be streamlined a bit by doing some of these RSS/flow steering steps in cohesion with the dma-buf binding/unbinding. The complication with setting up flow steering in cohesion with dma-buf bind unbind is that the application may start more connections after the bind, and it will need to install flow steering rules for those too, and use the ethtool api for that. May as well use the ethtool apis for all of it...?
From Jakub and David's comments it sounds (if I understood correctly), you'd like to tie the dma-buf bind/unbind functions to the lifetime of a netlink socket, rather than a struct file like I was thinking. That does sound cleaner, but I'm not sure how. Can you link me to any existing code examples? Or rough pointers to any existing code?
I guess the devil is in the details; I look forward to the evolution of the patches.
+1
On Wed, 19 Jul 2023 08:10:58 -0700 Mina Almasry almasrymina@google.com wrote:
On Tue, Jul 18, 2023 at 3:45 PM Jakub Kicinski kuba@kernel.org wrote:
On Tue, 18 Jul 2023 16:35:17 -0600 David Ahern wrote:
I do not see how 1 RSS context (or more specifically a h/w Rx queue) can be used properly with memory from different processes (or dma-buf references).
Right, my experience with dma-bufs from GPUs are that they're allocated from the userspace and owned by the process that allocated the backing GPU memory and generated the dma-buf from it. I.e., we're limited to 1 dma-buf per RX queue. If we enable binding multiple dma-bufs to the same RX queue, we have a problem, because AFAIU the NIC can't decide which dma-buf to put the packet into (it hasn't parsed the packet's destination yet).
When the process dies, that memory needs to be flushed from the H/W queues. Queues with interlaced submissions make that more complicated.
When the process dies, do we really want to flush the memory from the hardware queues? The drivers I looked at don't seem to have a function to flush the rx queues alone, they usually do an entire driver reset to achieve that. Not sure if that's just convenience or there is some technical limitation there. Do we really want to trigger a driver reset at the event a userspace process crashes?
Naive idea. Would it be possible for process to use mmap() on the GPU memory and then do zero copy TCP receive some how? Or is this what is being proposed.
On Wed, Jul 19, 2023 at 10:57:11AM -0700, Stephen Hemminger wrote:
Naive idea. Would it be possible for process to use mmap() on the GPU memory and then do zero copy TCP receive some how? Or is this what is being proposed.
It could be possible, but currently there is no API to recover the underlying dmabuf from the VMA backing the mmap.
Also you can't just take arbitary struct pages from any old VMA and make them "netmem"
Jason
Am 20.07.23 um 01:24 schrieb Jason Gunthorpe:
On Wed, Jul 19, 2023 at 10:57:11AM -0700, Stephen Hemminger wrote:
Naive idea. Would it be possible for process to use mmap() on the GPU memory and then do zero copy TCP receive some how? Or is this what is being proposed.
It could be possible, but currently there is no API to recover the underlying dmabuf from the VMA backing the mmap.
Sorry for being a bit late, have been on vacation.
Well actually this was discussed before to work around problems with Windows applications through wine/proton.
Not 100% sure what the outcome of that was, but if I'm not completely mistaken getting the fd behind a VMA should be possible.
It might just not be the DMA-buf fd, because we use mmap() re-routing to be able to work around problems with the reverse tracking of mappings.
Christian.
Also you can't just take arbitary struct pages from any old VMA and make them "netmem"
Jason _______________________________________________ Linaro-mm-sig mailing list -- linaro-mm-sig@lists.linaro.org To unsubscribe send an email to linaro-mm-sig-leave@lists.linaro.org
On Wed, 19 Jul 2023 08:10:58 -0700 Mina Almasry wrote:
From Jakub and David's comments it sounds (if I understood correctly), you'd like to tie the dma-buf bind/unbind functions to the lifetime of a netlink socket, rather than a struct file like I was thinking. That does sound cleaner, but I'm not sure how. Can you link me to any existing code examples? Or rough pointers to any existing code?
I don't have a strong preference whether the lifetime is bound to the socket or not. My main point was that if we're binding lifetimes to processes, it should be done via netlink sockets, not special- -purpose FDs. Inevitably more commands and info will be needed and we'll start reinventing the uAPI wheel which is Netlink.
Currently adding state to netlink sockets is a bit raw. You can create an Xarray which stores the per socket state using socket's portid (genl_info->snd_portid) and use netlink_register_notifier() to get notifications when sockets are closed.
linaro-mm-sig@lists.linaro.org