On Mon, Nov 6, 2023 at 11:34 AM David Ahern dsahern@kernel.org wrote:
On 11/6/23 11:47 AM, Stanislav Fomichev wrote:
On 11/05, Mina Almasry wrote:
For device memory TCP, we expect the skb headers to be available in host memory for access, and we expect the skb frags to be in device memory and unaccessible to the host. We expect there to be no mixing and matching of device memory frags (unaccessible) with host memory frags (accessible) in the same skb.
Add a skb->devmem flag which indicates whether the frags in this skb are device memory frags or not.
__skb_fill_page_desc() now checks frags added to skbs for page_pool_iovs, and marks the skb as skb->devmem accordingly.
Add checks through the network stack to avoid accessing the frags of devmem skbs and avoid coalescing devmem skbs with non devmem skbs.
Signed-off-by: Willem de Bruijn willemb@google.com Signed-off-by: Kaiyuan Zhang kaiyuanz@google.com Signed-off-by: Mina Almasry almasrymina@google.com
include/linux/skbuff.h | 14 +++++++- include/net/tcp.h | 5 +-- net/core/datagram.c | 6 ++++ net/core/gro.c | 5 ++- net/core/skbuff.c | 77 ++++++++++++++++++++++++++++++++++++------ net/ipv4/tcp.c | 6 ++++ net/ipv4/tcp_input.c | 13 +++++-- net/ipv4/tcp_output.c | 5 ++- net/packet/af_packet.c | 4 +-- 9 files changed, 115 insertions(+), 20 deletions(-)
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 1fae276c1353..8fb468ff8115 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -805,6 +805,8 @@ typedef unsigned char *sk_buff_data_t;
- @csum_level: indicates the number of consecutive checksums found in
the packet minus one that have been verified as
CHECKSUM_UNNECESSARY (max 3)
- @devmem: indicates that all the fragments in this skb are backed by
device memory.
- @dst_pending_confirm: need to confirm neighbour
- @decrypted: Decrypted SKB
- @slow_gro: state present at GRO time, slower prepare step required
@@ -991,7 +993,7 @@ struct sk_buff { #if IS_ENABLED(CONFIG_IP_SCTP) __u8 csum_not_inet:1; #endif
- __u8 devmem:1;
#if defined(CONFIG_NET_SCHED) || defined(CONFIG_NET_XGRESS) __u16 tc_index; /* traffic control index */ #endif @@ -1766,6 +1768,12 @@ static inline void skb_zcopy_downgrade_managed(struct sk_buff *skb) __skb_zcopy_downgrade_managed(skb); }
+/* Return true if frags in this skb are not readable by the host. */ +static inline bool skb_frags_not_readable(const struct sk_buff *skb) +{
- return skb->devmem;
bikeshedding: should we also rename 'devmem' sk_buff flag to 'not_readable'? It better communicates the fact that the stack shouldn't dereference the frags (because it has 'devmem' fragments or for some other potential future reason).
+1.
Also, the flag on the skb is an optimization - a high level signal that one or more frags is in unreadable memory. There is no requirement that all of the frags are in the same memory type.
The flag indicates that the skb contains all devmem dma-buf memory specifically, not generic 'not_readable' frags as the comment says:
+ * @devmem: indicates that all the fragments in this skb are backed by + * device memory.
The reason it's not a generic 'not_readable' flag is because handing off a generic not_readable skb to the userspace is semantically not what we're doing. recvmsg() is augmented in this patch series to return a devmem skb to the user via a cmsg_devmem struct which refers specifically to the memory in the dma-buf. recvmsg() in this patch series is not augmented to give any 'not_readable' skb to the userspace.
IMHO skb->devmem + an skb_frags_not_readable() as implemented is correct. If a new type of unreadable skbs are introduced to the stack, I imagine the stack would implement:
1. new header flag: skb->newmem 2.
static inline bool skb_frags_not_readable(const struct skb_buff *skb) { return skb->devmem || skb->newmem; }
3. tcp_recvmsg_devmem() would handle skb->devmem skbs is in this patch series, but tcp_recvmsg_newmem() would handle skb->newmem skbs.