Greetings:
Welcome to the RFC.
Currently, when a user app uses sendfile the user app has no way to know if the bytes were transmit; sendfile simply returns, but it is possible that a slow client on the other side may take time to receive and ACK the bytes. In the meantime, the user app which called sendfile has no way to know whether it can overwrite the data on disk that it just sendfile'd.
One way to fix this is to add zerocopy notifications to sendfile similar to how MSG_ZEROCOPY works with sendmsg. This is possible thanks to the extensive work done by Pavel [1].
To support this, two important user ABI changes are proposed:
- A new splice flag, SPLICE_F_ZC, which allows users to signal that splice should generate zerocopy notifications if possible.
- A new system call, sendfile2, which is similar to sendfile64 except that it takes an additional argument, flags, which allows the user to specify either a "regular" sendfile or a sendfile with zerocopy notifications enabled.
In either case, user apps can read notifications from the error queue (like they would with MSG_ZEROCOPY) to determine when their call to sendfile has completed.
I tested this RFC using the selftest modified in the last patch and also by using the selftest between two different physical hosts:
# server ./msg_zerocopy -4 -i eth0 -t 2 -v -r tcp
# client (does the sendfiling) dd if=/dev/zero of=sendfile_data bs=1M count=8 ./msg_zerocopy -4 -i eth0 -D $SERVER_IP -v -l 1 -t 2 -z -f sendfile_data tcp
I would love to get high level feedback from folks on a few things:
- Is this functionality, at a high level, something that would be desirable / useful? I think so, but I'm of course I am biased ;)
- Is this approach generally headed in the right direction? Are the proposed user ABI changes reasonable?
If the above two points are generally agreed upon then I'd welcome feedback on the patches themselves :)
This is kind of a net thing, but also kind of a splice thing so hope I am sending this to right places to get appropriate feedback. I based my code on the vfs/for-next tree, but am happy to rebase on another tree if desired. The cc-list got a little out of control, so I manually trimmed it down quite a bit; sorry if I missed anyone I should have CC'd in the process.
Thanks, Joe
[1]: https://lore.kernel.org/netdev/cover.1657643355.git.asml.silence@gmail.com/
Joe Damato (10): splice: Add ubuf_info to prepare for ZC splice: Add helper that passes through splice_desc splice: Factor splice_socket into a helper splice: Add SPLICE_F_ZC and attach ubuf fs: Add splice_write_sd to file operations fs: Extend do_sendfile to take a flags argument fs: Add sendfile2 which accepts a flags argument fs: Add sendfile flags for sendfile2 fs: Add sendfile2 syscall selftests: Add sendfile zerocopy notification test
arch/alpha/kernel/syscalls/syscall.tbl | 1 + arch/arm/tools/syscall.tbl | 1 + arch/arm64/tools/syscall_32.tbl | 1 + arch/m68k/kernel/syscalls/syscall.tbl | 1 + arch/microblaze/kernel/syscalls/syscall.tbl | 1 + arch/mips/kernel/syscalls/syscall_n32.tbl | 1 + arch/mips/kernel/syscalls/syscall_n64.tbl | 1 + arch/mips/kernel/syscalls/syscall_o32.tbl | 1 + arch/parisc/kernel/syscalls/syscall.tbl | 1 + arch/powerpc/kernel/syscalls/syscall.tbl | 1 + arch/s390/kernel/syscalls/syscall.tbl | 1 + arch/sh/kernel/syscalls/syscall.tbl | 1 + arch/sparc/kernel/syscalls/syscall.tbl | 1 + arch/x86/entry/syscalls/syscall_32.tbl | 1 + arch/x86/entry/syscalls/syscall_64.tbl | 1 + arch/xtensa/kernel/syscalls/syscall.tbl | 1 + fs/read_write.c | 40 +++++++--- fs/splice.c | 87 +++++++++++++++++---- include/linux/fs.h | 2 + include/linux/sendfile.h | 10 +++ include/linux/splice.h | 7 +- include/linux/syscalls.h | 2 + include/uapi/asm-generic/unistd.h | 4 +- net/socket.c | 1 + scripts/syscall.tbl | 1 + tools/testing/selftests/net/msg_zerocopy.c | 54 ++++++++++++- tools/testing/selftests/net/msg_zerocopy.sh | 5 ++ 27 files changed, 200 insertions(+), 29 deletions(-) create mode 100644 include/linux/sendfile.h
base-commit: 2e72b1e0aac24a12f3bf3eec620efaca7ab7d4de
Update struct splice_desc to include ubuf_info to prepare splice for zero copy notifications.
Signed-off-by: Joe Damato jdamato@fastly.com --- include/linux/splice.h | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/include/linux/splice.h b/include/linux/splice.h index 9dec4861d09f..7477df3916e2 100644 --- a/include/linux/splice.h +++ b/include/linux/splice.h @@ -10,6 +10,7 @@ #define SPLICE_H
#include <linux/pipe_fs_i.h> +#include <linux/skbuff.h>
/* * Flags passed in from splice/tee/vmsplice @@ -43,6 +44,7 @@ struct splice_desc { loff_t *opos; /* sendfile: output position */ size_t num_spliced; /* number of bytes already spliced */ bool need_wakeup; /* need to wake up writer */ + struct ubuf_info *ubuf_info; /* zerocopy infrastructure */ };
struct partial_page {
Add do_splice_from_sd which takes splice_desc as an argument. This helper is just a wrapper around splice_write but will be extended. Use the helper from existing splice code.
Signed-off-by: Joe Damato jdamato@fastly.com --- fs/splice.c | 13 +++++++++++-- 1 file changed, 11 insertions(+), 2 deletions(-)
diff --git a/fs/splice.c b/fs/splice.c index 2898fa1e9e63..9575074a1296 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -941,6 +941,15 @@ static ssize_t do_splice_from(struct pipe_inode_info *pipe, struct file *out, return out->f_op->splice_write(pipe, out, ppos, len, flags); }
+static ssize_t do_splice_from_sd(struct pipe_inode_info *pipe, struct file *out, + struct splice_desc *sd) +{ + if (unlikely(!out->f_op->splice_write)) + return warn_unsupported(out, "write"); + return out->f_op->splice_write(pipe, out, sd->opos, sd->total_len, + sd->flags); +} + /* * Indicate to the caller that there was a premature EOF when reading from the * source and the caller didn't indicate they would be sending more data after @@ -1161,7 +1170,7 @@ static int direct_splice_actor(struct pipe_inode_info *pipe, long ret;
file_start_write(file); - ret = do_splice_from(pipe, file, sd->opos, sd->total_len, sd->flags); + ret = do_splice_from_sd(pipe, file, sd); file_end_write(file); return ret; } @@ -1171,7 +1180,7 @@ static int splice_file_range_actor(struct pipe_inode_info *pipe, { struct file *file = sd->u.file;
- return do_splice_from(pipe, file, sd->opos, sd->total_len, sd->flags); + return do_splice_from_sd(pipe, file, sd); }
static void direct_file_splice_eof(struct splice_desc *sd)
splice_socket becomes a wrapper around splice_socket_generic which takes a ubuf pointer to prepare for zerocopy notifications.
Signed-off-by: Joe Damato jdamato@fastly.com --- fs/splice.c | 40 +++++++++++++++++++++++++--------------- 1 file changed, 25 insertions(+), 15 deletions(-)
diff --git a/fs/splice.c b/fs/splice.c index 9575074a1296..1f27ce6d1c34 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -37,6 +37,8 @@ #include <linux/socket.h> #include <linux/sched/signal.h>
+#include <net/sock.h> + #include "internal.h"
/* @@ -783,21 +785,10 @@ iter_file_splice_write(struct pipe_inode_info *pipe, struct file *out, EXPORT_SYMBOL(iter_file_splice_write);
#ifdef CONFIG_NET -/** - * splice_to_socket - splice data from a pipe to a socket - * @pipe: pipe to splice from - * @out: socket to write to - * @ppos: position in @out - * @len: number of bytes to splice - * @flags: splice modifier flags - * - * Description: - * Will send @len bytes from the pipe to a network socket. No data copying - * is involved. - * - */ -ssize_t splice_to_socket(struct pipe_inode_info *pipe, struct file *out, - loff_t *ppos, size_t len, unsigned int flags) +static ssize_t splice_socket_generic(struct pipe_inode_info *pipe, + struct file *out, loff_t *ppos, + size_t len, unsigned int flags, + struct ubuf_info *ubuf_info) { struct socket *sock = sock_from_file(out); struct bio_vec bvec[16]; @@ -920,6 +911,25 @@ ssize_t splice_to_socket(struct pipe_inode_info *pipe, struct file *out, wakeup_pipe_writers(pipe); return spliced ?: ret; } + +/** + * splice_to_socket - splice data from a pipe to a socket + * @pipe: pipe to splice from + * @out: socket to write to + * @ppos: position in @out + * @len: number of bytes to splice + * @flags: splice modifier flags + * + * Description: + * Will send @len bytes from the pipe to a network socket. No data copying + * is involved. + * + */ +ssize_t splice_to_socket(struct pipe_inode_info *pipe, struct file *out, + loff_t *ppos, size_t len, unsigned int flags) +{ + return splice_socket_generic(pipe, out, ppos, len, flags, NULL); +} #endif
static int warn_unsupported(struct file *file, const char *op)
Add the SPLICE_F_ZC flag and when it is set, allocate a ubuf and attach it to generate zerocopy notifications.
Signed-off-by: Joe Damato jdamato@fastly.com --- fs/splice.c | 20 ++++++++++++++++++++ include/linux/splice.h | 3 ++- 2 files changed, 22 insertions(+), 1 deletion(-)
diff --git a/fs/splice.c b/fs/splice.c index 1f27ce6d1c34..6dc60f47f84e 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -875,6 +875,11 @@ static ssize_t splice_socket_generic(struct pipe_inode_info *pipe, if (out->f_flags & O_NONBLOCK) msg.msg_flags |= MSG_DONTWAIT;
+ if (unlikely(flags & SPLICE_F_ZC) && ubuf_info) { + msg.msg_flags = MSG_ZEROCOPY; + msg.msg_ubuf = ubuf_info; + } + iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, bvec, bc, len - remain); ret = sock_sendmsg(sock, &msg); @@ -1223,12 +1228,27 @@ static ssize_t do_splice_direct_actor(struct file *in, loff_t *ppos, if (unlikely(out->f_flags & O_APPEND)) return -EINVAL;
+ if (unlikely(flags & SPLICE_F_ZC)) { + struct socket *sock = sock_from_file(out); + struct sock *sk = sock->sk; + struct ubuf_info *ubuf_info; + + ubuf_info = msg_zerocopy_realloc(sk, len, NULL); + if (!ubuf_info) + return -ENOMEM; + sd.ubuf_info = ubuf_info; + } + ret = splice_direct_to_actor(in, &sd, actor); if (ret > 0) *ppos = sd.pos;
+ if (unlikely(flags & SPLICE_F_ZC)) + refcount_dec(&sd.ubuf_info->refcnt); + return ret; } + /** * do_splice_direct - splices data directly between two files * @in: file to splice from diff --git a/include/linux/splice.h b/include/linux/splice.h index 7477df3916e2..a88588cf2754 100644 --- a/include/linux/splice.h +++ b/include/linux/splice.h @@ -21,8 +21,9 @@ /* from/to, of course */ #define SPLICE_F_MORE (0x04) /* expect more data */ #define SPLICE_F_GIFT (0x08) /* pages passed in are a gift */ +#define SPLICE_F_ZC (0x10) /* generate zero copy notifications */
-#define SPLICE_F_ALL (SPLICE_F_MOVE|SPLICE_F_NONBLOCK|SPLICE_F_MORE|SPLICE_F_GIFT) +#define SPLICE_F_ALL (SPLICE_F_MOVE|SPLICE_F_NONBLOCK|SPLICE_F_MORE|SPLICE_F_GIFT|SPLICE_F_ZC)
/* * Passed to the actors
Introduce splice_write_sd to file operations and export a new helper for sockets splice_to_socket_sd to pass through the splice_desc context allowing the allocated ubuf to be attached.
Signed-off-by: Joe Damato jdamato@fastly.com --- fs/splice.c | 22 ++++++++++++++++++---- include/linux/fs.h | 2 ++ include/linux/splice.h | 2 ++ net/socket.c | 1 + 4 files changed, 23 insertions(+), 4 deletions(-)
diff --git a/fs/splice.c b/fs/splice.c index 6dc60f47f84e..d08fa2a6d930 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -935,6 +935,16 @@ ssize_t splice_to_socket(struct pipe_inode_info *pipe, struct file *out, { return splice_socket_generic(pipe, out, ppos, len, flags, NULL); } + +ssize_t splice_to_socket_sd(struct pipe_inode_info *pipe, + struct file *out, struct splice_desc *sd) +{ + ssize_t ret; + + ret = splice_socket_generic(pipe, out, sd->opos, sd->total_len, + sd->flags, sd->ubuf_info); + return ret; +} #endif
static int warn_unsupported(struct file *file, const char *op) @@ -959,10 +969,14 @@ static ssize_t do_splice_from(struct pipe_inode_info *pipe, struct file *out, static ssize_t do_splice_from_sd(struct pipe_inode_info *pipe, struct file *out, struct splice_desc *sd) { - if (unlikely(!out->f_op->splice_write)) - return warn_unsupported(out, "write"); - return out->f_op->splice_write(pipe, out, sd->opos, sd->total_len, - sd->flags); + if (likely(!(sd->flags & SPLICE_F_ZC))) { + if (unlikely(!out->f_op->splice_write)) + return warn_unsupported(out, "write"); + return out->f_op->splice_write(pipe, out, sd->opos, + sd->total_len, sd->flags); + } else { + return out->f_op->splice_write_sd(pipe, out, sd); + } }
/* diff --git a/include/linux/fs.h b/include/linux/fs.h index 7e29433c5ecc..843e8b8a1d4d 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2065,6 +2065,7 @@ struct dir_context { struct iov_iter; struct io_uring_cmd; struct offset_ctx; +struct splice_desc;
typedef unsigned int __bitwise fop_flags_t;
@@ -2093,6 +2094,7 @@ struct file_operations { int (*check_flags)(int); int (*flock) (struct file *, int, struct file_lock *); ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, loff_t *, size_t, unsigned int); + ssize_t (*splice_write_sd)(struct pipe_inode_info *, struct file *, struct splice_desc *); ssize_t (*splice_read)(struct file *, loff_t *, struct pipe_inode_info *, size_t, unsigned int); void (*splice_eof)(struct file *file); int (*setlease)(struct file *, int, struct file_lease **, void **); diff --git a/include/linux/splice.h b/include/linux/splice.h index a88588cf2754..356b8cae4818 100644 --- a/include/linux/splice.h +++ b/include/linux/splice.h @@ -100,6 +100,8 @@ static inline long splice_copy_file_range(struct file *in, loff_t pos_in,
ssize_t do_tee(struct file *in, struct file *out, size_t len, unsigned int flags); +ssize_t splice_to_socket_sd(struct pipe_inode_info *pipe, struct file *out, + struct splice_desc *sd); ssize_t splice_to_socket(struct pipe_inode_info *pipe, struct file *out, loff_t *ppos, size_t len, unsigned int flags);
diff --git a/net/socket.c b/net/socket.c index 9a117248f18f..4baf26a36477 100644 --- a/net/socket.c +++ b/net/socket.c @@ -165,6 +165,7 @@ static const struct file_operations socket_file_ops = { .release = sock_close, .fasync = sock_fasync, .splice_write = splice_to_socket, + .splice_write_sd = splice_to_socket_sd, .splice_read = sock_splice_read, .splice_eof = sock_splice_eof, .show_fdinfo = sock_show_fdinfo,
Extend the internal do_sendfile to take a flags argument, which will be used in future commits to signal that userland wants zerocopy notifications.
This commit does not change anything about sendfile or sendfile64.
Signed-off-by: Joe Damato jdamato@fastly.com --- fs/read_write.c | 18 +++++++++--------- 1 file changed, 9 insertions(+), 9 deletions(-)
diff --git a/fs/read_write.c b/fs/read_write.c index a6133241dfb8..03d2a93c3d1b 100644 --- a/fs/read_write.c +++ b/fs/read_write.c @@ -1293,7 +1293,7 @@ COMPAT_SYSCALL_DEFINE6(pwritev2, compat_ulong_t, fd, #endif /* CONFIG_COMPAT */
static ssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos, - size_t count, loff_t max) + size_t count, loff_t max, int flags) { struct inode *in_inode, *out_inode; struct pipe_inode_info *opipe; @@ -1398,13 +1398,13 @@ SYSCALL_DEFINE4(sendfile, int, out_fd, int, in_fd, off_t __user *, offset, size_ if (unlikely(get_user(off, offset))) return -EFAULT; pos = off; - ret = do_sendfile(out_fd, in_fd, &pos, count, MAX_NON_LFS); + ret = do_sendfile(out_fd, in_fd, &pos, count, MAX_NON_LFS, 0); if (unlikely(put_user(pos, offset))) return -EFAULT; return ret; }
- return do_sendfile(out_fd, in_fd, NULL, count, 0); + return do_sendfile(out_fd, in_fd, NULL, count, 0, 0); }
SYSCALL_DEFINE4(sendfile64, int, out_fd, int, in_fd, loff_t __user *, offset, size_t, count) @@ -1415,13 +1415,13 @@ SYSCALL_DEFINE4(sendfile64, int, out_fd, int, in_fd, loff_t __user *, offset, si if (offset) { if (unlikely(copy_from_user(&pos, offset, sizeof(loff_t)))) return -EFAULT; - ret = do_sendfile(out_fd, in_fd, &pos, count, 0); + ret = do_sendfile(out_fd, in_fd, &pos, count, 0, 0); if (unlikely(put_user(pos, offset))) return -EFAULT; return ret; }
- return do_sendfile(out_fd, in_fd, NULL, count, 0); + return do_sendfile(out_fd, in_fd, NULL, count, 0, 0); }
#ifdef CONFIG_COMPAT @@ -1436,13 +1436,13 @@ COMPAT_SYSCALL_DEFINE4(sendfile, int, out_fd, int, in_fd, if (unlikely(get_user(off, offset))) return -EFAULT; pos = off; - ret = do_sendfile(out_fd, in_fd, &pos, count, MAX_NON_LFS); + ret = do_sendfile(out_fd, in_fd, &pos, count, MAX_NON_LFS, 0); if (unlikely(put_user(pos, offset))) return -EFAULT; return ret; }
- return do_sendfile(out_fd, in_fd, NULL, count, 0); + return do_sendfile(out_fd, in_fd, NULL, count, 0, 0); }
COMPAT_SYSCALL_DEFINE4(sendfile64, int, out_fd, int, in_fd, @@ -1454,13 +1454,13 @@ COMPAT_SYSCALL_DEFINE4(sendfile64, int, out_fd, int, in_fd, if (offset) { if (unlikely(copy_from_user(&pos, offset, sizeof(loff_t)))) return -EFAULT; - ret = do_sendfile(out_fd, in_fd, &pos, count, 0); + ret = do_sendfile(out_fd, in_fd, &pos, count, 0, 0); if (unlikely(put_user(pos, offset))) return -EFAULT; return ret; }
- return do_sendfile(out_fd, in_fd, NULL, count, 0); + return do_sendfile(out_fd, in_fd, NULL, count, 0, 0); } #endif
Add sendfile2 which is similar to sendfile64, but takes a flags argument.
Signed-off-by: Joe Damato jdamato@fastly.com --- fs/read_write.c | 17 +++++++++++++++++ 1 file changed, 17 insertions(+)
diff --git a/fs/read_write.c b/fs/read_write.c index 03d2a93c3d1b..057e5f37645d 100644 --- a/fs/read_write.c +++ b/fs/read_write.c @@ -1424,6 +1424,23 @@ SYSCALL_DEFINE4(sendfile64, int, out_fd, int, in_fd, loff_t __user *, offset, si return do_sendfile(out_fd, in_fd, NULL, count, 0, 0); }
+SYSCALL_DEFINE5(sendfile2, int, out_fd, int, in_fd, loff_t __user *, offset, size_t, count, int, flags) +{ + loff_t pos; + ssize_t ret; + + if (offset) { + if (unlikely(copy_from_user(&pos, offset, sizeof(loff_t)))) + return -EFAULT; + ret = do_sendfile(out_fd, in_fd, &pos, count, 0, flags); + if (unlikely(put_user(pos, offset))) + return -EFAULT; + return ret; + } + + return do_sendfile(out_fd, in_fd, NULL, count, 0, flags); +} + #ifdef CONFIG_COMPAT COMPAT_SYSCALL_DEFINE4(sendfile, int, out_fd, int, in_fd, compat_off_t __user *, offset, compat_size_t, count)
Add a default flag (SENDFILE_DEFAULT) and a flag for requesting zerocopy notifications (SENDFILE_ZC). do_sendfile is updated to pass through the corresponding splice flag to enable zerocopy notifications.
Signed-off-by: Joe Damato jdamato@fastly.com --- fs/read_write.c | 5 +++++ include/linux/sendfile.h | 10 ++++++++++ 2 files changed, 15 insertions(+) create mode 100644 include/linux/sendfile.h
diff --git a/fs/read_write.c b/fs/read_write.c index 057e5f37645d..e3929fd0f605 100644 --- a/fs/read_write.c +++ b/fs/read_write.c @@ -16,6 +16,7 @@ #include <linux/export.h> #include <linux/syscalls.h> #include <linux/pagemap.h> +#include <linux/sendfile.h> #include <linux/splice.h> #include <linux/compat.h> #include <linux/mount.h> @@ -1360,6 +1361,10 @@ static ssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos, retval = rw_verify_area(WRITE, fd_file(out), &out_pos, count); if (retval < 0) return retval; + + if (flags & SENDFILE_ZC) + fl |= SPLICE_F_ZC; + retval = do_splice_direct(fd_file(in), &pos, fd_file(out), &out_pos, count, fl); } else { diff --git a/include/linux/sendfile.h b/include/linux/sendfile.h new file mode 100644 index 000000000000..0bd3c76ea6f2 --- /dev/null +++ b/include/linux/sendfile.h @@ -0,0 +1,10 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef SENDFILE_H +#define SENDFILE_H + +#define SENDFILE_DEFAULT (0x1) /* normal sendfile */ +#define SENDFILE_ZC (0x2) /* sendfile which generates ZC notifications */ + +#define SENDFILE_ALL (SENDFILE_DEFAULT|SENDFILE_ZC) + +#endif
The sendfile2 system call is similar to sendfile64, but takes a flags argument allowing the user to select either a default sendfile or for sendfile to generate zerocopy notifications similar to MSG_ZEROCOPY and sendmsg.
Signed-off-by: Joe Damato jdamato@fastly.com --- arch/alpha/kernel/syscalls/syscall.tbl | 1 + arch/arm/tools/syscall.tbl | 1 + arch/arm64/tools/syscall_32.tbl | 1 + arch/m68k/kernel/syscalls/syscall.tbl | 1 + arch/microblaze/kernel/syscalls/syscall.tbl | 1 + arch/mips/kernel/syscalls/syscall_n32.tbl | 1 + arch/mips/kernel/syscalls/syscall_n64.tbl | 1 + arch/mips/kernel/syscalls/syscall_o32.tbl | 1 + arch/parisc/kernel/syscalls/syscall.tbl | 1 + arch/powerpc/kernel/syscalls/syscall.tbl | 1 + arch/s390/kernel/syscalls/syscall.tbl | 1 + arch/sh/kernel/syscalls/syscall.tbl | 1 + arch/sparc/kernel/syscalls/syscall.tbl | 1 + arch/x86/entry/syscalls/syscall_32.tbl | 1 + arch/x86/entry/syscalls/syscall_64.tbl | 1 + arch/xtensa/kernel/syscalls/syscall.tbl | 1 + include/linux/syscalls.h | 2 ++ include/uapi/asm-generic/unistd.h | 4 +++- scripts/syscall.tbl | 1 + 19 files changed, 22 insertions(+), 1 deletion(-)
diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl index c59d53d6d3f3..124313c745b6 100644 --- a/arch/alpha/kernel/syscalls/syscall.tbl +++ b/arch/alpha/kernel/syscalls/syscall.tbl @@ -506,3 +506,4 @@ 574 common getxattrat sys_getxattrat 575 common listxattrat sys_listxattrat 576 common removexattrat sys_removexattrat +577 common sendfile2 sys_sendfile2 diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl index 49eeb2ad8dbd..ca61b5792148 100644 --- a/arch/arm/tools/syscall.tbl +++ b/arch/arm/tools/syscall.tbl @@ -481,3 +481,4 @@ 464 common getxattrat sys_getxattrat 465 common listxattrat sys_listxattrat 466 common removexattrat sys_removexattrat +467 common sendfile2 sys_sendfile2 diff --git a/arch/arm64/tools/syscall_32.tbl b/arch/arm64/tools/syscall_32.tbl index 69a829912a05..71695a61a1df 100644 --- a/arch/arm64/tools/syscall_32.tbl +++ b/arch/arm64/tools/syscall_32.tbl @@ -478,3 +478,4 @@ 464 common getxattrat sys_getxattrat 465 common listxattrat sys_listxattrat 466 common removexattrat sys_removexattrat +467 common sendfile2 sys_sendfile2 diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl index f5ed71f1910d..6096a22b4472 100644 --- a/arch/m68k/kernel/syscalls/syscall.tbl +++ b/arch/m68k/kernel/syscalls/syscall.tbl @@ -466,3 +466,4 @@ 464 common getxattrat sys_getxattrat 465 common listxattrat sys_listxattrat 466 common removexattrat sys_removexattrat +467 common sendfile2 sys_sendfile2 diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl index 680f568b77f2..0429dc26ceee 100644 --- a/arch/microblaze/kernel/syscalls/syscall.tbl +++ b/arch/microblaze/kernel/syscalls/syscall.tbl @@ -472,3 +472,4 @@ 464 common getxattrat sys_getxattrat 465 common listxattrat sys_listxattrat 466 common removexattrat sys_removexattrat +467 common sendfile2 sys_sendfile2 diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/syscalls/syscall_n32.tbl index 0b9b7e25b69a..f6571c8ecb15 100644 --- a/arch/mips/kernel/syscalls/syscall_n32.tbl +++ b/arch/mips/kernel/syscalls/syscall_n32.tbl @@ -405,3 +405,4 @@ 464 n32 getxattrat sys_getxattrat 465 n32 listxattrat sys_listxattrat 466 n32 removexattrat sys_removexattrat +467 n32 sendfile2 sys_sendfile2 diff --git a/arch/mips/kernel/syscalls/syscall_n64.tbl b/arch/mips/kernel/syscalls/syscall_n64.tbl index c844cd5cda62..532ce99478ee 100644 --- a/arch/mips/kernel/syscalls/syscall_n64.tbl +++ b/arch/mips/kernel/syscalls/syscall_n64.tbl @@ -381,3 +381,4 @@ 464 n64 getxattrat sys_getxattrat 465 n64 listxattrat sys_listxattrat 466 n64 removexattrat sys_removexattrat +467 n64 sendfile2 sys_sendfile2 diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl b/arch/mips/kernel/syscalls/syscall_o32.tbl index 349b8aad1159..9cacbbff6b12 100644 --- a/arch/mips/kernel/syscalls/syscall_o32.tbl +++ b/arch/mips/kernel/syscalls/syscall_o32.tbl @@ -454,3 +454,4 @@ 464 o32 getxattrat sys_getxattrat 465 o32 listxattrat sys_listxattrat 466 o32 removexattrat sys_removexattrat +467 o32 sendfile2 sys_sendfile2 diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl index d9fc94c86965..ca5a3e6eb8f3 100644 --- a/arch/parisc/kernel/syscalls/syscall.tbl +++ b/arch/parisc/kernel/syscalls/syscall.tbl @@ -465,3 +465,4 @@ 464 common getxattrat sys_getxattrat 465 common listxattrat sys_listxattrat 466 common removexattrat sys_removexattrat +467 common sendfile2 sys_sendfile2 diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl index d8b4ab78bef0..450392aed1eb 100644 --- a/arch/powerpc/kernel/syscalls/syscall.tbl +++ b/arch/powerpc/kernel/syscalls/syscall.tbl @@ -557,3 +557,4 @@ 464 common getxattrat sys_getxattrat 465 common listxattrat sys_listxattrat 466 common removexattrat sys_removexattrat +467 common sendfile2 sys_sendfile2 diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl index e9115b4d8b63..e7e1b16f4d39 100644 --- a/arch/s390/kernel/syscalls/syscall.tbl +++ b/arch/s390/kernel/syscalls/syscall.tbl @@ -469,3 +469,4 @@ 464 common getxattrat sys_getxattrat sys_getxattrat 465 common listxattrat sys_listxattrat sys_listxattrat 466 common removexattrat sys_removexattrat sys_removexattrat +467 64 sendfile2 sys_sendfile2 - diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl index c8cad33bf250..c75a0e69c033 100644 --- a/arch/sh/kernel/syscalls/syscall.tbl +++ b/arch/sh/kernel/syscalls/syscall.tbl @@ -470,3 +470,4 @@ 464 common getxattrat sys_getxattrat 465 common listxattrat sys_listxattrat 466 common removexattrat sys_removexattrat +467 common sendfile2 sys_sendfile2 diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl index 727f99d333b3..fd15465b5330 100644 --- a/arch/sparc/kernel/syscalls/syscall.tbl +++ b/arch/sparc/kernel/syscalls/syscall.tbl @@ -512,3 +512,4 @@ 464 common getxattrat sys_getxattrat 465 common listxattrat sys_listxattrat 466 common removexattrat sys_removexattrat +467 common sendfile2 sys_sendfile2 diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl index 4d0fb2fba7e2..f711ee6068ec 100644 --- a/arch/x86/entry/syscalls/syscall_32.tbl +++ b/arch/x86/entry/syscalls/syscall_32.tbl @@ -472,3 +472,4 @@ 464 i386 getxattrat sys_getxattrat 465 i386 listxattrat sys_listxattrat 466 i386 removexattrat sys_removexattrat +467 i386 sendfile2 sys_sendfile2 diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl index 5eb708bff1c7..0ba4edb1e4c0 100644 --- a/arch/x86/entry/syscalls/syscall_64.tbl +++ b/arch/x86/entry/syscalls/syscall_64.tbl @@ -390,6 +390,7 @@ 464 common getxattrat sys_getxattrat 465 common listxattrat sys_listxattrat 466 common removexattrat sys_removexattrat +467 common sendfile2 sys_sendfile2
# # Due to a historical design error, certain syscalls are numbered differently diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl index 37effc1b134e..142597c92baf 100644 --- a/arch/xtensa/kernel/syscalls/syscall.tbl +++ b/arch/xtensa/kernel/syscalls/syscall.tbl @@ -437,3 +437,4 @@ 464 common getxattrat sys_getxattrat 465 common listxattrat sys_listxattrat 466 common removexattrat sys_removexattrat +467 common sendfile2 sys_sendfile2 diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index c6333204d451..3ee0e997d6c6 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -491,6 +491,8 @@ asmlinkage long sys_pwritev(unsigned long fd, const struct iovec __user *vec, unsigned long vlen, unsigned long pos_l, unsigned long pos_h); asmlinkage long sys_sendfile64(int out_fd, int in_fd, loff_t __user *offset, size_t count); +asmlinkage long sys_sendfile2(int out_fd, int in_fd, + loff_t __user *offset, size_t count, int flags); asmlinkage long sys_pselect6(int, fd_set __user *, fd_set __user *, fd_set __user *, struct __kernel_timespec __user *, void __user *); diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h index 88dc393c2bca..ec0ac5a8d519 100644 --- a/include/uapi/asm-generic/unistd.h +++ b/include/uapi/asm-generic/unistd.h @@ -849,9 +849,11 @@ __SYSCALL(__NR_getxattrat, sys_getxattrat) __SYSCALL(__NR_listxattrat, sys_listxattrat) #define __NR_removexattrat 466 __SYSCALL(__NR_removexattrat, sys_removexattrat) +#define __NR_sendfile2 467 +__SYSCALL(__NR_sendfile2, sys_sendfile2)
#undef __NR_syscalls -#define __NR_syscalls 467 +#define __NR_syscalls 468
/* * 32 bit systems traditionally used different diff --git a/scripts/syscall.tbl b/scripts/syscall.tbl index ebbdb3c42e9f..1911a64d3b33 100644 --- a/scripts/syscall.tbl +++ b/scripts/syscall.tbl @@ -407,3 +407,4 @@ 464 common getxattrat sys_getxattrat 465 common listxattrat sys_listxattrat 466 common removexattrat sys_removexattrat +467 common sendfile2 sys_sendfile2
Extend the existing the msg_zerocopy test to allow testing sendfile to ensure that notifications are generated.
Signed-off-by: Joe Damato jdamato@fastly.com --- tools/testing/selftests/net/msg_zerocopy.c | 54 ++++++++++++++++++++- tools/testing/selftests/net/msg_zerocopy.sh | 5 ++ 2 files changed, 58 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/net/msg_zerocopy.c b/tools/testing/selftests/net/msg_zerocopy.c index 7ea5fb28c93d..20e334b25fbd 100644 --- a/tools/testing/selftests/net/msg_zerocopy.c +++ b/tools/testing/selftests/net/msg_zerocopy.c @@ -30,6 +30,7 @@ #include <arpa/inet.h> #include <error.h> #include <errno.h> +#include <fcntl.h> #include <limits.h> #include <linux/errqueue.h> #include <linux/if_packet.h> @@ -50,6 +51,7 @@ #include <stdlib.h> #include <string.h> #include <sys/ioctl.h> +#include <sys/sendfile.h> #include <sys/socket.h> #include <sys/stat.h> #include <sys/time.h> @@ -74,6 +76,14 @@ #define MSG_ZEROCOPY 0x4000000 #endif
+#ifndef SENDFILE_ZC +#define SENDFILE_ZC (0x2) +#endif + +#ifndef __NR_sendfile2 +#define __NR_sendfile2 467 +#endif + static int cfg_cork; static bool cfg_cork_mixed; static int cfg_cpu = -1; /* default: pin to last cpu */ @@ -87,6 +97,8 @@ static int cfg_verbose; static int cfg_waittime_ms = 500; static int cfg_notification_limit = 32; static bool cfg_zerocopy; +static bool cfg_sendfile; +static const char *cfg_sendfile_path;
static socklen_t cfg_alen; static struct sockaddr_storage cfg_dst_addr; @@ -182,6 +194,37 @@ static void add_zcopy_cookie(struct msghdr *msg, uint32_t cookie) memcpy(CMSG_DATA(cm), &cookie, sizeof(cookie)); }
+static bool do_sendfile(int fd) +{ + int from_fd = open(cfg_sendfile_path, O_RDONLY, 0); + struct stat buf; + ssize_t total = 0; + ssize_t ret = 0; + off_t off = 0; + + if (fd < 0) + error(1, errno, "couldn't open sendfile path"); + + if (fstat(from_fd, &buf)) + error(1, errno, "couldn't fstat"); + + while (total < buf.st_size) { + ret = syscall(__NR_sendfile2, fd, from_fd, &off, buf.st_size, + SENDFILE_ZC); + if (ret < 0) + error(1, errno, "unable to sendfile"); + total += ret; + sends_since_notify++; + bytes += ret; + packets++; + if (ret > 0) + expected_completions++; + } + + close(from_fd); + return total == buf.st_size; +} + static bool do_sendmsg(int fd, struct msghdr *msg, bool do_zerocopy, int domain) { int ret, len, i, flags; @@ -550,6 +593,8 @@ static void do_tx(int domain, int type, int protocol) do { if (cfg_cork) do_sendmsg_corked(fd, &msg); + else if (cfg_sendfile) + do_sendfile(fd); else do_sendmsg(fd, &msg, cfg_zerocopy, domain);
@@ -715,7 +760,7 @@ static void parse_opts(int argc, char **argv)
cfg_payload_len = max_payload_len;
- while ((c = getopt(argc, argv, "46c:C:D:i:l:mp:rs:S:t:vz")) != -1) { + while ((c = getopt(argc, argv, "46c:C:D:i:l:mp:rs:S:t:vzf:w:")) != -1) { switch (c) { case '4': if (cfg_family != PF_UNSPEC) @@ -767,9 +812,16 @@ static void parse_opts(int argc, char **argv) case 'v': cfg_verbose++; break; + case 'f': + cfg_sendfile = true; + cfg_sendfile_path = optarg; + break; case 'z': cfg_zerocopy = true; break; + case 'w': + cfg_waittime_ms = 200 + strtoul(optarg, NULL, 10) * 1000; + break; } }
diff --git a/tools/testing/selftests/net/msg_zerocopy.sh b/tools/testing/selftests/net/msg_zerocopy.sh index 89c22f5320e0..c735e4ab86b5 100755 --- a/tools/testing/selftests/net/msg_zerocopy.sh +++ b/tools/testing/selftests/net/msg_zerocopy.sh @@ -74,6 +74,7 @@ esac cleanup() { ip netns del "${NS2}" ip netns del "${NS1}" + rm -f sendfile_data }
trap cleanup EXIT @@ -106,6 +107,9 @@ ip -netns "${NS2}" addr add fd::2/64 dev "${DEV}" nodad # Optionally disable sg or csum offload to test edge cases # ip netns exec "${NS1}" ethtool -K "${DEV}" sg off
+# create sendfile test data +dd if=/dev/zero of=sendfile_data bs=1M count=8 2> /dev/null + do_test() { local readonly ARGS="$1"
@@ -118,4 +122,5 @@ do_test() {
do_test "${EXTRA_ARGS}" do_test "-z ${EXTRA_ARGS}" +do_test "-z -f sendfile_data ${EXTRA_ARGS}" echo ok
On Wed, Mar 19, 2025 at 12:15:11AM +0000, Joe Damato wrote:
One way to fix this is to add zerocopy notifications to sendfile similar to how MSG_ZEROCOPY works with sendmsg. This is possible thanks to the extensive work done by Pavel [1].
What is a "zerocopy notification" and why aren't you simply plugging this into io_uring and generate a CQE so that it works like all other asynchronous operations?
On Wed, Mar 19, 2025 at 01:04:48AM -0700, Christoph Hellwig wrote:
On Wed, Mar 19, 2025 at 12:15:11AM +0000, Joe Damato wrote:
One way to fix this is to add zerocopy notifications to sendfile similar to how MSG_ZEROCOPY works with sendmsg. This is possible thanks to the extensive work done by Pavel [1].
What is a "zerocopy notification"
See the docs on MSG_ZEROCOPY [1], but in short when a user app calls sendmsg and passes MSG_ZEROCOPY a completion notification is added to the error queue. The user app can poll for these to find out when the TX has completed and the buffer it passed to the kernel can be overwritten.
My series provides the same functionality via splice and sendfile2.
[1]: https://www.kernel.org/doc/html/v6.13/networking/msg_zerocopy.html
and why aren't you simply plugging this into io_uring and generate a CQE so that it works like all other asynchronous operations?
I linked to the iouring work that Pavel did in the cover letter. Please take a look.
That work refactored the internals of how zerocopy completion notifications are wired up, allowing other pieces of code to use the same infrastructure and extend it, if needed.
My series is using the same internals that iouring (and others) use to generate zerocopy completion notifications. Unlike iouring, though, I don't need a fully customized implementation with a new user API for harvesting completion events; I can use the existing mechanism already in the kernel that user apps already use for sendmsg (the error queue, as explained above and in the MSG_ZEROCOPY documentation).
Let me know if that answers your question or if you have other questions.
Thanks, Joe
On 3/19/25 9:32 AM, Joe Damato wrote:
On Wed, Mar 19, 2025 at 01:04:48AM -0700, Christoph Hellwig wrote:
On Wed, Mar 19, 2025 at 12:15:11AM +0000, Joe Damato wrote:
One way to fix this is to add zerocopy notifications to sendfile similar to how MSG_ZEROCOPY works with sendmsg. This is possible thanks to the extensive work done by Pavel [1].
What is a "zerocopy notification"
See the docs on MSG_ZEROCOPY [1], but in short when a user app calls sendmsg and passes MSG_ZEROCOPY a completion notification is added to the error queue. The user app can poll for these to find out when the TX has completed and the buffer it passed to the kernel can be overwritten.
My series provides the same functionality via splice and sendfile2.
and why aren't you simply plugging this into io_uring and generate a CQE so that it works like all other asynchronous operations?
I linked to the iouring work that Pavel did in the cover letter. Please take a look.
That work refactored the internals of how zerocopy completion notifications are wired up, allowing other pieces of code to use the same infrastructure and extend it, if needed.
My series is using the same internals that iouring (and others) use to generate zerocopy completion notifications. Unlike iouring, though, I don't need a fully customized implementation with a new user API for harvesting completion events; I can use the existing mechanism already in the kernel that user apps already use for sendmsg (the error queue, as explained above and in the MSG_ZEROCOPY documentation).
The error queue is arguably a work-around for _not_ having a delivery mechanism that works with a sync syscall in the first place. The main question here imho would be "why add a whole new syscall etc when there's already an existing way to do accomplish this, with free-to-reuse notifications". If the answer is "because splice", then it would seem saner to plumb up those bits only. Would be much simpler too...
On Wed, Mar 19, 2025 at 10:07:27AM -0600, Jens Axboe wrote:
On 3/19/25 9:32 AM, Joe Damato wrote:
On Wed, Mar 19, 2025 at 01:04:48AM -0700, Christoph Hellwig wrote:
On Wed, Mar 19, 2025 at 12:15:11AM +0000, Joe Damato wrote:
One way to fix this is to add zerocopy notifications to sendfile similar to how MSG_ZEROCOPY works with sendmsg. This is possible thanks to the extensive work done by Pavel [1].
What is a "zerocopy notification"
See the docs on MSG_ZEROCOPY [1], but in short when a user app calls sendmsg and passes MSG_ZEROCOPY a completion notification is added to the error queue. The user app can poll for these to find out when the TX has completed and the buffer it passed to the kernel can be overwritten.
My series provides the same functionality via splice and sendfile2.
and why aren't you simply plugging this into io_uring and generate a CQE so that it works like all other asynchronous operations?
I linked to the iouring work that Pavel did in the cover letter. Please take a look.
That work refactored the internals of how zerocopy completion notifications are wired up, allowing other pieces of code to use the same infrastructure and extend it, if needed.
My series is using the same internals that iouring (and others) use to generate zerocopy completion notifications. Unlike iouring, though, I don't need a fully customized implementation with a new user API for harvesting completion events; I can use the existing mechanism already in the kernel that user apps already use for sendmsg (the error queue, as explained above and in the MSG_ZEROCOPY documentation).
The error queue is arguably a work-around for _not_ having a delivery mechanism that works with a sync syscall in the first place. The main question here imho would be "why add a whole new syscall etc when there's already an existing way to do accomplish this, with free-to-reuse notifications". If the answer is "because splice", then it would seem saner to plumb up those bits only. Would be much simpler too...
I may be misunderstanding your comment, but my response would be:
There are existing apps which use sendfile today unsafely and it would be very nice to have a safe sendfile equivalent. Converting existing apps to using iouring (if I understood your suggestion?) would be significantly more work compared to calling sendfile2 and adding code to check the error queue.
I would also argue that there are likely user apps out there that use both sendmsg MSG_ZEROCOPY for certain writes (for data in memory) and also use sendfile (for data on disk). One example would be a reverse proxy that might write HTTP headers to clients via sendmsg but transmit the response body with sendfile.
For those apps, the code to check the error queue already exists for sendmsg + MSG_ZEROCOPY, so swapping in sendfile2 seems like an easy way to ensure safe sendfile usage.
As far as the bit about plumbing only the splice bits, sorry if I'm being dense here, do you mean plumbing the error queue through to splice only and dropping sendfile2?
That is an option. Then the apps currently using sendfile could use splice instead and get completion notifications on the error queue. That would probably work and be less work than rewriting to use iouring, but probably a bit more work than using a new syscall.
Thanks for taking a look and responding.
On 3/19/25 11:04 AM, Joe Damato wrote:
On Wed, Mar 19, 2025 at 10:07:27AM -0600, Jens Axboe wrote:
On 3/19/25 9:32 AM, Joe Damato wrote:
On Wed, Mar 19, 2025 at 01:04:48AM -0700, Christoph Hellwig wrote:
On Wed, Mar 19, 2025 at 12:15:11AM +0000, Joe Damato wrote:
One way to fix this is to add zerocopy notifications to sendfile similar to how MSG_ZEROCOPY works with sendmsg. This is possible thanks to the extensive work done by Pavel [1].
What is a "zerocopy notification"
See the docs on MSG_ZEROCOPY [1], but in short when a user app calls sendmsg and passes MSG_ZEROCOPY a completion notification is added to the error queue. The user app can poll for these to find out when the TX has completed and the buffer it passed to the kernel can be overwritten.
My series provides the same functionality via splice and sendfile2.
and why aren't you simply plugging this into io_uring and generate a CQE so that it works like all other asynchronous operations?
I linked to the iouring work that Pavel did in the cover letter. Please take a look.
That work refactored the internals of how zerocopy completion notifications are wired up, allowing other pieces of code to use the same infrastructure and extend it, if needed.
My series is using the same internals that iouring (and others) use to generate zerocopy completion notifications. Unlike iouring, though, I don't need a fully customized implementation with a new user API for harvesting completion events; I can use the existing mechanism already in the kernel that user apps already use for sendmsg (the error queue, as explained above and in the MSG_ZEROCOPY documentation).
The error queue is arguably a work-around for _not_ having a delivery mechanism that works with a sync syscall in the first place. The main question here imho would be "why add a whole new syscall etc when there's already an existing way to do accomplish this, with free-to-reuse notifications". If the answer is "because splice", then it would seem saner to plumb up those bits only. Would be much simpler too...
I may be misunderstanding your comment, but my response would be:
There are existing apps which use sendfile today unsafely and it would be very nice to have a safe sendfile equivalent. Converting existing apps to using iouring (if I understood your suggestion?) would be significantly more work compared to calling sendfile2 and adding code to check the error queue.
It's really not, if you just want to use it as a sync kind of thing. If you want to have multiple things in flight etc, yeah it could be more work, you'd also get better performance that way. And you could use things like registered buffers for either of them, which again would likely make it more efficient.
If you just use it as a sync thing, it'd be pretty trivial to just wrap a my_sendfile_foo() in a submit_and_wait operation, which issues and waits on the completion in a single syscall. And if you want to wait on the notification too, you could even do that in the same syscall and wait on 2 CQEs. That'd be a downright trivial way to provide a sync way of doing the same thing.
I would also argue that there are likely user apps out there that use both sendmsg MSG_ZEROCOPY for certain writes (for data in memory) and also use sendfile (for data on disk). One example would be a reverse proxy that might write HTTP headers to clients via sendmsg but transmit the response body with sendfile.
For those apps, the code to check the error queue already exists for sendmsg + MSG_ZEROCOPY, so swapping in sendfile2 seems like an easy way to ensure safe sendfile usage.
Sure that is certainly possible. I didn't say that wasn't the case, rather that the error queue approach is a work-around in the first place for not having some kind of async notification mechanism for when it's free to reuse.
As far as the bit about plumbing only the splice bits, sorry if I'm being dense here, do you mean plumbing the error queue through to splice only and dropping sendfile2?
That is an option. Then the apps currently using sendfile could use splice instead and get completion notifications on the error queue. That would probably work and be less work than rewriting to use iouring, but probably a bit more work than using a new syscall.
Yep
On Wed, Mar 19, 2025 at 11:20:50AM -0600, Jens Axboe wrote:
On 3/19/25 11:04 AM, Joe Damato wrote:
On Wed, Mar 19, 2025 at 10:07:27AM -0600, Jens Axboe wrote:
On 3/19/25 9:32 AM, Joe Damato wrote:
On Wed, Mar 19, 2025 at 01:04:48AM -0700, Christoph Hellwig wrote:
On Wed, Mar 19, 2025 at 12:15:11AM +0000, Joe Damato wrote:
One way to fix this is to add zerocopy notifications to sendfile similar to how MSG_ZEROCOPY works with sendmsg. This is possible thanks to the extensive work done by Pavel [1].
What is a "zerocopy notification"
See the docs on MSG_ZEROCOPY [1], but in short when a user app calls sendmsg and passes MSG_ZEROCOPY a completion notification is added to the error queue. The user app can poll for these to find out when the TX has completed and the buffer it passed to the kernel can be overwritten.
My series provides the same functionality via splice and sendfile2.
and why aren't you simply plugging this into io_uring and generate a CQE so that it works like all other asynchronous operations?
I linked to the iouring work that Pavel did in the cover letter. Please take a look.
That work refactored the internals of how zerocopy completion notifications are wired up, allowing other pieces of code to use the same infrastructure and extend it, if needed.
My series is using the same internals that iouring (and others) use to generate zerocopy completion notifications. Unlike iouring, though, I don't need a fully customized implementation with a new user API for harvesting completion events; I can use the existing mechanism already in the kernel that user apps already use for sendmsg (the error queue, as explained above and in the MSG_ZEROCOPY documentation).
The error queue is arguably a work-around for _not_ having a delivery mechanism that works with a sync syscall in the first place. The main question here imho would be "why add a whole new syscall etc when there's already an existing way to do accomplish this, with free-to-reuse notifications". If the answer is "because splice", then it would seem saner to plumb up those bits only. Would be much simpler too...
I may be misunderstanding your comment, but my response would be:
There are existing apps which use sendfile today unsafely and it would be very nice to have a safe sendfile equivalent. Converting existing apps to using iouring (if I understood your suggestion?) would be significantly more work compared to calling sendfile2 and adding code to check the error queue.
It's really not, if you just want to use it as a sync kind of thing. If you want to have multiple things in flight etc, yeah it could be more work, you'd also get better performance that way. And you could use things like registered buffers for either of them, which again would likely make it more efficient.
I haven't argued that performance would be better using sendfile2 compared to iouring, just that existing apps which already use sendfile (but do so unsafely) would probably be more likely to use a safe alternative with existing examples of how to harvest completion notifications vs something more complex, like wrapping iouring.
If you just use it as a sync thing, it'd be pretty trivial to just wrap a my_sendfile_foo() in a submit_and_wait operation, which issues and waits on the completion in a single syscall. And if you want to wait on the notification too, you could even do that in the same syscall and wait on 2 CQEs. That'd be a downright trivial way to provide a sync way of doing the same thing.
I don't disagree; I just don't know if app developers: a.) know that this is possible to do, and b.) know how to do it
In general: it does seem a bit odd to me that there isn't a safe sendfile syscall in Linux that uses existing completion notification mechanisms.
I would also argue that there are likely user apps out there that use both sendmsg MSG_ZEROCOPY for certain writes (for data in memory) and also use sendfile (for data on disk). One example would be a reverse proxy that might write HTTP headers to clients via sendmsg but transmit the response body with sendfile.
For those apps, the code to check the error queue already exists for sendmsg + MSG_ZEROCOPY, so swapping in sendfile2 seems like an easy way to ensure safe sendfile usage.
Sure that is certainly possible. I didn't say that wasn't the case, rather that the error queue approach is a work-around in the first place for not having some kind of async notification mechanism for when it's free to reuse.
Of course, I certainly agree that the error queue is a work around. But it works, app use it, and its fairly well known. I don't see any reason, other than historical context, why sendmsg can use this mechanism, splice can, but sendfile shouldn't?
As far as the bit about plumbing only the splice bits, sorry if I'm being dense here, do you mean plumbing the error queue through to splice only and dropping sendfile2?
That is an option. Then the apps currently using sendfile could use splice instead and get completion notifications on the error queue. That would probably work and be less work than rewriting to use iouring, but probably a bit more work than using a new syscall.
Yep
I'm not opposed to dropping the sendfile2 part of the series for the official submission. I do think it is a bit odd to add the functionality to splice only, though, when probably many apps are using splice via calls to sendfile and there is no way to safely use sendfile.
If you feel very strongly that this cannot be merged without dropping sendfile2 and only plumbing this through for splice, then I'll drop the sendfile2 syscall when I submit officially (probably next week?).
I do feel pretty strongly that it's more likely apps would use sendfile2 and we'd have safer apps out in the wild. But, I could be wrong.
That said: if the new syscsall is the blocker, I'll drop it and offer a change to the sendfile man page suggesting users swap it with calls to splice + error queue for safety.
I greatly appreciate you taking a look and your feedback.
Thanks, Joe
On 3/19/25 11:45 AM, Joe Damato wrote:
On Wed, Mar 19, 2025 at 11:20:50AM -0600, Jens Axboe wrote:
On 3/19/25 11:04 AM, Joe Damato wrote:
On Wed, Mar 19, 2025 at 10:07:27AM -0600, Jens Axboe wrote:
On 3/19/25 9:32 AM, Joe Damato wrote:
On Wed, Mar 19, 2025 at 01:04:48AM -0700, Christoph Hellwig wrote:
On Wed, Mar 19, 2025 at 12:15:11AM +0000, Joe Damato wrote: > One way to fix this is to add zerocopy notifications to sendfile similar > to how MSG_ZEROCOPY works with sendmsg. This is possible thanks to the > extensive work done by Pavel [1].
What is a "zerocopy notification"
See the docs on MSG_ZEROCOPY [1], but in short when a user app calls sendmsg and passes MSG_ZEROCOPY a completion notification is added to the error queue. The user app can poll for these to find out when the TX has completed and the buffer it passed to the kernel can be overwritten.
My series provides the same functionality via splice and sendfile2.
and why aren't you simply plugging this into io_uring and generate a CQE so that it works like all other asynchronous operations?
I linked to the iouring work that Pavel did in the cover letter. Please take a look.
That work refactored the internals of how zerocopy completion notifications are wired up, allowing other pieces of code to use the same infrastructure and extend it, if needed.
My series is using the same internals that iouring (and others) use to generate zerocopy completion notifications. Unlike iouring, though, I don't need a fully customized implementation with a new user API for harvesting completion events; I can use the existing mechanism already in the kernel that user apps already use for sendmsg (the error queue, as explained above and in the MSG_ZEROCOPY documentation).
The error queue is arguably a work-around for _not_ having a delivery mechanism that works with a sync syscall in the first place. The main question here imho would be "why add a whole new syscall etc when there's already an existing way to do accomplish this, with free-to-reuse notifications". If the answer is "because splice", then it would seem saner to plumb up those bits only. Would be much simpler too...
I may be misunderstanding your comment, but my response would be:
There are existing apps which use sendfile today unsafely and it would be very nice to have a safe sendfile equivalent. Converting existing apps to using iouring (if I understood your suggestion?) would be significantly more work compared to calling sendfile2 and adding code to check the error queue.
It's really not, if you just want to use it as a sync kind of thing. If you want to have multiple things in flight etc, yeah it could be more work, you'd also get better performance that way. And you could use things like registered buffers for either of them, which again would likely make it more efficient.
I haven't argued that performance would be better using sendfile2 compared to iouring, just that existing apps which already use sendfile (but do so unsafely) would probably be more likely to use a safe alternative with existing examples of how to harvest completion notifications vs something more complex, like wrapping iouring.
Sure and I get that, just not sure it'd be worth doing on the kernel side for such (fairly) weak reasoning. The performance benefit is just a side note in that if you did do it this way, you'd potentially be able to run it more efficiently too. And regardless what people do or use now, they are generally always interested in that aspect.
If you just use it as a sync thing, it'd be pretty trivial to just wrap a my_sendfile_foo() in a submit_and_wait operation, which issues and waits on the completion in a single syscall. And if you want to wait on the notification too, you could even do that in the same syscall and wait on 2 CQEs. That'd be a downright trivial way to provide a sync way of doing the same thing.
I don't disagree; I just don't know if app developers: a.) know that this is possible to do, and b.) know how to do it
Writing that wrapper would be not even a screenful of code. Yes maybe they don't know how to do it now, but it's _really_ trivial to do. It'd take me roughly 1 min to do that, would be happy to help out with that side so it could go into a commit or man page or whatever.
In general: it does seem a bit odd to me that there isn't a safe sendfile syscall in Linux that uses existing completion notification mechanisms.
Pretty natural, I think. sendfile(2) predates that by quite a bit, and the last real change to sendfile was using splice underneath. Which I did, and that was probably almost 20 years ago at this point...
I do think it makes sense to have a sendfile that's both fast and efficient, and can be used sanely with buffer reuse without relying on odd heuristics.
I would also argue that there are likely user apps out there that use both sendmsg MSG_ZEROCOPY for certain writes (for data in memory) and also use sendfile (for data on disk). One example would be a reverse proxy that might write HTTP headers to clients via sendmsg but transmit the response body with sendfile.
For those apps, the code to check the error queue already exists for sendmsg + MSG_ZEROCOPY, so swapping in sendfile2 seems like an easy way to ensure safe sendfile usage.
Sure that is certainly possible. I didn't say that wasn't the case, rather that the error queue approach is a work-around in the first place for not having some kind of async notification mechanism for when it's free to reuse.
Of course, I certainly agree that the error queue is a work around. But it works, app use it, and its fairly well known. I don't see any reason, other than historical context, why sendmsg can use this mechanism, splice can, but sendfile shouldn't?
My argument would be the same as for other features - if you can do it simpler this other way, why not consider that? The end result would be the same, you can do fast sendfile() with sane buffer reuse. But the kernel side would be simpler, which is always a kernel main goal for those of us that have to maintain it.
Just adding sendfile2() works in the sense that it's an easier drop in replacement for an app, though the error queue side does mean it needs to change anyway - it's not just replacing one syscall with another. And if we want to be lazy, sure that's fine. I just don't think it's the best way to do it when we literally have a mechanism that's designed for this and works with reuse already with normal send zc (and receive side too, in the next kernel).
Am 19.03.25 um 19:37 schrieb Jens Axboe:
On 3/19/25 11:45 AM, Joe Damato wrote:
On Wed, Mar 19, 2025 at 11:20:50AM -0600, Jens Axboe wrote:
On 3/19/25 11:04 AM, Joe Damato wrote:
On Wed, Mar 19, 2025 at 10:07:27AM -0600, Jens Axboe wrote:
On 3/19/25 9:32 AM, Joe Damato wrote:
On Wed, Mar 19, 2025 at 01:04:48AM -0700, Christoph Hellwig wrote: > On Wed, Mar 19, 2025 at 12:15:11AM +0000, Joe Damato wrote: >> One way to fix this is to add zerocopy notifications to sendfile similar >> to how MSG_ZEROCOPY works with sendmsg. This is possible thanks to the >> extensive work done by Pavel [1]. > > What is a "zerocopy notification"
See the docs on MSG_ZEROCOPY [1], but in short when a user app calls sendmsg and passes MSG_ZEROCOPY a completion notification is added to the error queue. The user app can poll for these to find out when the TX has completed and the buffer it passed to the kernel can be overwritten.
My series provides the same functionality via splice and sendfile2.
> and why aren't you simply plugging this into io_uring and generate > a CQE so that it works like all other asynchronous operations?
I linked to the iouring work that Pavel did in the cover letter. Please take a look.
That work refactored the internals of how zerocopy completion notifications are wired up, allowing other pieces of code to use the same infrastructure and extend it, if needed.
My series is using the same internals that iouring (and others) use to generate zerocopy completion notifications. Unlike iouring, though, I don't need a fully customized implementation with a new user API for harvesting completion events; I can use the existing mechanism already in the kernel that user apps already use for sendmsg (the error queue, as explained above and in the MSG_ZEROCOPY documentation).
The error queue is arguably a work-around for _not_ having a delivery mechanism that works with a sync syscall in the first place. The main question here imho would be "why add a whole new syscall etc when there's already an existing way to do accomplish this, with free-to-reuse notifications". If the answer is "because splice", then it would seem saner to plumb up those bits only. Would be much simpler too...
I may be misunderstanding your comment, but my response would be:
There are existing apps which use sendfile today unsafely and it would be very nice to have a safe sendfile equivalent. Converting existing apps to using iouring (if I understood your suggestion?) would be significantly more work compared to calling sendfile2 and adding code to check the error queue.
It's really not, if you just want to use it as a sync kind of thing. If you want to have multiple things in flight etc, yeah it could be more work, you'd also get better performance that way. And you could use things like registered buffers for either of them, which again would likely make it more efficient.
I haven't argued that performance would be better using sendfile2 compared to iouring, just that existing apps which already use sendfile (but do so unsafely) would probably be more likely to use a safe alternative with existing examples of how to harvest completion notifications vs something more complex, like wrapping iouring.
Sure and I get that, just not sure it'd be worth doing on the kernel side for such (fairly) weak reasoning. The performance benefit is just a side note in that if you did do it this way, you'd potentially be able to run it more efficiently too. And regardless what people do or use now, they are generally always interested in that aspect.
If you just use it as a sync thing, it'd be pretty trivial to just wrap a my_sendfile_foo() in a submit_and_wait operation, which issues and waits on the completion in a single syscall. And if you want to wait on the notification too, you could even do that in the same syscall and wait on 2 CQEs. That'd be a downright trivial way to provide a sync way of doing the same thing.
I don't disagree; I just don't know if app developers: a.) know that this is possible to do, and b.) know how to do it
Writing that wrapper would be not even a screenful of code. Yes maybe they don't know how to do it now, but it's _really_ trivial to do. It'd take me roughly 1 min to do that, would be happy to help out with that side so it could go into a commit or man page or whatever.
In general: it does seem a bit odd to me that there isn't a safe sendfile syscall in Linux that uses existing completion notification mechanisms.
Pretty natural, I think. sendfile(2) predates that by quite a bit, and the last real change to sendfile was using splice underneath. Which I did, and that was probably almost 20 years ago at this point...
I do think it makes sense to have a sendfile that's both fast and efficient, and can be used sanely with buffer reuse without relying on odd heuristics.
I would also argue that there are likely user apps out there that use both sendmsg MSG_ZEROCOPY for certain writes (for data in memory) and also use sendfile (for data on disk). One example would be a reverse proxy that might write HTTP headers to clients via sendmsg but transmit the response body with sendfile.
For those apps, the code to check the error queue already exists for sendmsg + MSG_ZEROCOPY, so swapping in sendfile2 seems like an easy way to ensure safe sendfile usage.
Sure that is certainly possible. I didn't say that wasn't the case, rather that the error queue approach is a work-around in the first place for not having some kind of async notification mechanism for when it's free to reuse.
Of course, I certainly agree that the error queue is a work around. But it works, app use it, and its fairly well known. I don't see any reason, other than historical context, why sendmsg can use this mechanism, splice can, but sendfile shouldn't?
My argument would be the same as for other features - if you can do it simpler this other way, why not consider that? The end result would be the same, you can do fast sendfile() with sane buffer reuse. But the kernel side would be simpler, which is always a kernel main goal for those of us that have to maintain it.
Just adding sendfile2() works in the sense that it's an easier drop in replacement for an app, though the error queue side does mean it needs to change anyway - it's not just replacing one syscall with another. And if we want to be lazy, sure that's fine. I just don't think it's the best way to do it when we literally have a mechanism that's designed for this and works with reuse already with normal send zc (and receive side too, in the next kernel).
A few month (or even years) back, Pavel came up with an idea to implement some kind of splice into a fixed buffer, if that would be implemented I guess it would help me in Samba too. My first usage was on the receive side (from the network).
But the other side might also be possible now we have RWF_DONTCACHE. Instead of dropping the pages from the page cache, it might be possible move them to fixed buffer instead. It would mean the pages would be 'stable' when they are no longer part of the pagecache. But maybe my assumption for that is too naive...
Anyway that splice into a fixed buffer would great to have, as the new IORING_OP_RECV_ZC, requires control over the hardware queues of the nic and only allows a single process to provide buffers for that receive queue (at least that's how I understand it). And that's not possible for multiple process (maybe not belonging to the same high level application and likely non-root applications). So it would be great have splice into fixed buffer as alternative to IORING_OP_SPLICE/IORING_OP_TEE, as it would be more flexible to use in combination with IORING_OP_SENDMSG_ZC as well as IORING_OP_WRITE[V]_FIXED with RWF_DONTCACHE.
I guess such a splice into fixed buffer linked to IORING_OP_SENDMSG_ZC would be the way to simulate the sendfile2() in userspace?
Thanks! metze
On 3/19/25 19:15, Stefan Metzmacher wrote:
Am 19.03.25 um 19:37 schrieb Jens Axboe:
On 3/19/25 11:45 AM, Joe Damato wrote:
On Wed, Mar 19, 2025 at 11:20:50AM -0600, Jens Axboe wrote:
...
My argument would be the same as for other features - if you can do it simpler this other way, why not consider that? The end result would be the same, you can do fast sendfile() with sane buffer reuse. But the kernel side would be simpler, which is always a kernel main goal for those of us that have to maintain it.
Just adding sendfile2() works in the sense that it's an easier drop in replacement for an app, though the error queue side does mean it needs to change anyway - it's not just replacing one syscall with another. And if we want to be lazy, sure that's fine. I just don't think it's the best way to do it when we literally have a mechanism that's designed for this and works with reuse already with normal send zc (and receive side too, in the next kernel).
A few month (or even years) back, Pavel came up with an idea to implement some kind of splice into a fixed buffer, if that would be implemented I guess it would help me in Samba too. My first usage was on the receive side (from the network).
I did it as a testing ground for infra needed for ublk zerocopy, but if that's of interest I can resurrect the patches and see where it goes, especially since the aforementioned infra just got queued.
But the other side might also be possible now we have RWF_DONTCACHE. Instead of dropping the pages from the page cache, it might be possible move them to fixed buffer instead. It would mean the pages would be 'stable' when they are no longer part of the pagecache. But maybe my assumption for that is too naive...
That's an interesting idea
Anyway that splice into a fixed buffer would great to have, as the new IORING_OP_RECV_ZC, requires control over the hardware queues of the nic and only allows a single process
Right, it basically borrows a hardware rx queue and that needs CAP_NET_ADMIN, and the user also has to set up steering rules.
to provide buffers for that receive queue (at least that's how I understand it). And that's not possible for multiple process (maybe not belonging to the same high level application and likely
It's up to the user to decide who returns buffers back (and how to sychronise that) as the api is just a user mapped ring. Regardless, it's not a finished project, David and I looked at features we want to add to make life easier for multithreaded apps that can't throw that many queues. I see your point though.
non-root applications). So it would be great have splice into fixed buffer as alternative to IORING_OP_SPLICE/IORING_OP_TEE, as it would be more flexible to use in combination with IORING_OP_SENDMSG_ZC as well as IORING_OP_WRITE[V]_FIXED with RWF_DONTCACHE.
I guess such a splice into fixed buffer linked to IORING_OP_SENDMSG_ZC would be the way to simulate the sendfile2() in userspace?
Right, and that approach allows to handle intermediate errors, which is why it doesn't need to put restrictions on the input file.
Am 20.03.25 um 11:46 schrieb Pavel Begunkov:
On 3/19/25 19:15, Stefan Metzmacher wrote:
Am 19.03.25 um 19:37 schrieb Jens Axboe:
On 3/19/25 11:45 AM, Joe Damato wrote:
On Wed, Mar 19, 2025 at 11:20:50AM -0600, Jens Axboe wrote:
...
My argument would be the same as for other features - if you can do it simpler this other way, why not consider that? The end result would be the same, you can do fast sendfile() with sane buffer reuse. But the kernel side would be simpler, which is always a kernel main goal for those of us that have to maintain it.
Just adding sendfile2() works in the sense that it's an easier drop in replacement for an app, though the error queue side does mean it needs to change anyway - it's not just replacing one syscall with another. And if we want to be lazy, sure that's fine. I just don't think it's the best way to do it when we literally have a mechanism that's designed for this and works with reuse already with normal send zc (and receive side too, in the next kernel).
A few month (or even years) back, Pavel came up with an idea to implement some kind of splice into a fixed buffer, if that would be implemented I guess it would help me in Samba too. My first usage was on the receive side (from the network).
I did it as a testing ground for infra needed for ublk zerocopy, but if that's of interest I can resurrect the patches and see where it goes, especially since the aforementioned infra just got queued.
Would be great!
Have you posted the work in progress somewhere?
Thanks! metze
On 3/21/25 07:55, Stefan Metzmacher wrote:
Am 20.03.25 um 11:46 schrieb Pavel Begunkov:
On 3/19/25 19:15, Stefan Metzmacher wrote:
Am 19.03.25 um 19:37 schrieb Jens Axboe:
On 3/19/25 11:45 AM, Joe Damato wrote:
On Wed, Mar 19, 2025 at 11:20:50AM -0600, Jens Axboe wrote:
...
My argument would be the same as for other features - if you can do it simpler this other way, why not consider that? The end result would be the same, you can do fast sendfile() with sane buffer reuse. But the kernel side would be simpler, which is always a kernel main goal for those of us that have to maintain it.
Just adding sendfile2() works in the sense that it's an easier drop in replacement for an app, though the error queue side does mean it needs to change anyway - it's not just replacing one syscall with another. And if we want to be lazy, sure that's fine. I just don't think it's the best way to do it when we literally have a mechanism that's designed for this and works with reuse already with normal send zc (and receive side too, in the next kernel).
A few month (or even years) back, Pavel came up with an idea to implement some kind of splice into a fixed buffer, if that would be implemented I guess it would help me in Samba too. My first usage was on the receive side (from the network).
I did it as a testing ground for infra needed for ublk zerocopy, but if that's of interest I can resurrect the patches and see where it goes, especially since the aforementioned infra just got queued.
Would be great!
Have you posted the work in progress somewhere?
Nope apart from a dirty hack I believe I posted back then.
On Wed, Mar 19, 2025 at 12:37:29PM -0600, Jens Axboe wrote:
On 3/19/25 11:45 AM, Joe Damato wrote:
On Wed, Mar 19, 2025 at 11:20:50AM -0600, Jens Axboe wrote:
On 3/19/25 11:04 AM, Joe Damato wrote:
On Wed, Mar 19, 2025 at 10:07:27AM -0600, Jens Axboe wrote:
On 3/19/25 9:32 AM, Joe Damato wrote:
On Wed, Mar 19, 2025 at 01:04:48AM -0700, Christoph Hellwig wrote: > On Wed, Mar 19, 2025 at 12:15:11AM +0000, Joe Damato wrote: >> One way to fix this is to add zerocopy notifications to sendfile similar >> to how MSG_ZEROCOPY works with sendmsg. This is possible thanks to the >> extensive work done by Pavel [1]. > > What is a "zerocopy notification"
See the docs on MSG_ZEROCOPY [1], but in short when a user app calls sendmsg and passes MSG_ZEROCOPY a completion notification is added to the error queue. The user app can poll for these to find out when the TX has completed and the buffer it passed to the kernel can be overwritten.
My series provides the same functionality via splice and sendfile2.
> and why aren't you simply plugging this into io_uring and generate > a CQE so that it works like all other asynchronous operations?
I linked to the iouring work that Pavel did in the cover letter. Please take a look.
That work refactored the internals of how zerocopy completion notifications are wired up, allowing other pieces of code to use the same infrastructure and extend it, if needed.
My series is using the same internals that iouring (and others) use to generate zerocopy completion notifications. Unlike iouring, though, I don't need a fully customized implementation with a new user API for harvesting completion events; I can use the existing mechanism already in the kernel that user apps already use for sendmsg (the error queue, as explained above and in the MSG_ZEROCOPY documentation).
The error queue is arguably a work-around for _not_ having a delivery mechanism that works with a sync syscall in the first place. The main question here imho would be "why add a whole new syscall etc when there's already an existing way to do accomplish this, with free-to-reuse notifications". If the answer is "because splice", then it would seem saner to plumb up those bits only. Would be much simpler too...
I may be misunderstanding your comment, but my response would be:
There are existing apps which use sendfile today unsafely and it would be very nice to have a safe sendfile equivalent. Converting existing apps to using iouring (if I understood your suggestion?) would be significantly more work compared to calling sendfile2 and adding code to check the error queue.
It's really not, if you just want to use it as a sync kind of thing. If you want to have multiple things in flight etc, yeah it could be more work, you'd also get better performance that way. And you could use things like registered buffers for either of them, which again would likely make it more efficient.
I haven't argued that performance would be better using sendfile2 compared to iouring, just that existing apps which already use sendfile (but do so unsafely) would probably be more likely to use a safe alternative with existing examples of how to harvest completion notifications vs something more complex, like wrapping iouring.
Sure and I get that, just not sure it'd be worth doing on the kernel side for such (fairly) weak reasoning. The performance benefit is just a side note in that if you did do it this way, you'd potentially be able to run it more efficiently too. And regardless what people do or use now, they are generally always interested in that aspect.
Fair enough.
If you just use it as a sync thing, it'd be pretty trivial to just wrap a my_sendfile_foo() in a submit_and_wait operation, which issues and waits on the completion in a single syscall. And if you want to wait on the notification too, you could even do that in the same syscall and wait on 2 CQEs. That'd be a downright trivial way to provide a sync way of doing the same thing.
I don't disagree; I just don't know if app developers: a.) know that this is possible to do, and b.) know how to do it
Writing that wrapper would be not even a screenful of code. Yes maybe they don't know how to do it now, but it's _really_ trivial to do. It'd take me roughly 1 min to do that, would be happy to help out with that side so it could go into a commit or man page or whatever.
I'd never be opposed to more documentation ;)
In general: it does seem a bit odd to me that there isn't a safe sendfile syscall in Linux that uses existing completion notification mechanisms.
Pretty natural, I think. sendfile(2) predates that by quite a bit, and the last real change to sendfile was using splice underneath. Which I did, and that was probably almost 20 years ago at this point...
I do think it makes sense to have a sendfile that's both fast and efficient, and can be used sanely with buffer reuse without relying on odd heuristics.
Just trying to tie this together in my head -- are you saying that you think the kernel internals of sendfile could be changed in a different way or that this a userland problem (and they should use the io_uring wrapper you suggested above) ?
I would also argue that there are likely user apps out there that use both sendmsg MSG_ZEROCOPY for certain writes (for data in memory) and also use sendfile (for data on disk). One example would be a reverse proxy that might write HTTP headers to clients via sendmsg but transmit the response body with sendfile.
For those apps, the code to check the error queue already exists for sendmsg + MSG_ZEROCOPY, so swapping in sendfile2 seems like an easy way to ensure safe sendfile usage.
Sure that is certainly possible. I didn't say that wasn't the case, rather that the error queue approach is a work-around in the first place for not having some kind of async notification mechanism for when it's free to reuse.
Of course, I certainly agree that the error queue is a work around. But it works, app use it, and its fairly well known. I don't see any reason, other than historical context, why sendmsg can use this mechanism, splice can, but sendfile shouldn't?
My argument would be the same as for other features - if you can do it simpler this other way, why not consider that? The end result would be the same, you can do fast sendfile() with sane buffer reuse. But the kernel side would be simpler, which is always a kernel main goal for those of us that have to maintain it.
Just adding sendfile2() works in the sense that it's an easier drop in replacement for an app, though the error queue side does mean it needs to change anyway - it's not just replacing one syscall with another. And if we want to be lazy, sure that's fine. I just don't think it's the best way to do it when we literally have a mechanism that's designed for this and works with reuse already with normal send zc (and receive side too, in the next kernel).
It seems like you've answered the question I asked above and that you are suggesting there might be a better and simpler sendfile2 kernel-side implementation that doesn't rely on splice internals at all.
Am I following you? If so, I'll drop the sendfile2 stuff from this series and stick with the splice changes only, if you are (at a high level) OK with the idea of adding a flag for this to splice.
In the meantime, I'll take a few more reads through the iouring code to see if I can work out how sendfile2 might be built on top of that instead of splice in the kernel.
Thank you very much for your time, feedback, and attention, Joe
On 3/19/25 1:16 PM, Joe Damato wrote:
In general: it does seem a bit odd to me that there isn't a safe sendfile syscall in Linux that uses existing completion notification mechanisms.
Pretty natural, I think. sendfile(2) predates that by quite a bit, and the last real change to sendfile was using splice underneath. Which I did, and that was probably almost 20 years ago at this point...
I do think it makes sense to have a sendfile that's both fast and efficient, and can be used sanely with buffer reuse without relying on odd heuristics.
Just trying to tie this together in my head -- are you saying that you think the kernel internals of sendfile could be changed in a different way or that this a userland problem (and they should use the io_uring wrapper you suggested above) ?
I'm saying that it of course makes sense to have a way to do sendfile where you know when reuse is safe, and that we have an API that provides that very nicely already without needing to add syscalls. If you used io_uring for this, then the "tx is done, reuse is fine" notification is just another notification, not anything special that needs new plumbing.
I would also argue that there are likely user apps out there that use both sendmsg MSG_ZEROCOPY for certain writes (for data in memory) and also use sendfile (for data on disk). One example would be a reverse proxy that might write HTTP headers to clients via sendmsg but transmit the response body with sendfile.
For those apps, the code to check the error queue already exists for sendmsg + MSG_ZEROCOPY, so swapping in sendfile2 seems like an easy way to ensure safe sendfile usage.
Sure that is certainly possible. I didn't say that wasn't the case, rather that the error queue approach is a work-around in the first place for not having some kind of async notification mechanism for when it's free to reuse.
Of course, I certainly agree that the error queue is a work around. But it works, app use it, and its fairly well known. I don't see any reason, other than historical context, why sendmsg can use this mechanism, splice can, but sendfile shouldn't?
My argument would be the same as for other features - if you can do it simpler this other way, why not consider that? The end result would be the same, you can do fast sendfile() with sane buffer reuse. But the kernel side would be simpler, which is always a kernel main goal for those of us that have to maintain it.
Just adding sendfile2() works in the sense that it's an easier drop in replacement for an app, though the error queue side does mean it needs to change anyway - it's not just replacing one syscall with another. And if we want to be lazy, sure that's fine. I just don't think it's the best way to do it when we literally have a mechanism that's designed for this and works with reuse already with normal send zc (and receive side too, in the next kernel).
It seems like you've answered the question I asked above and that you are suggesting there might be a better and simpler sendfile2 kernel-side implementation that doesn't rely on splice internals at all.
Am I following you? If so, I'll drop the sendfile2 stuff from this series and stick with the splice changes only, if you are (at a high level) OK with the idea of adding a flag for this to splice.
In the meantime, I'll take a few more reads through the iouring code to see if I can work out how sendfile2 might be built on top of that instead of splice in the kernel.
Heh I don't know how you jumped to that conclusion based on my feedback, and seems like it's solidified through other replies. No I'm not saying that the approach makes sense for the kernel, it makes some vague amount of sense only on the premise of "oh but this is easy for applications as they already know how to use sendfile(2)".
On Wed, Mar 19, 2025 at 10:45:22AM -0700, Joe Damato wrote:
I don't disagree; I just don't know if app developers: a.) know that this is possible to do, and b.) know how to do it
So if you don't know that why do you even do the work?
In general: it does seem a bit odd to me that there isn't a safe sendfile syscall in Linux that uses existing completion notification mechanisms.
Agreed. Where the existing notification mechanism is called io_uring.
Of course, I certainly agree that the error queue is a work around. But it works, app use it, and its fairly well known. I don't see any reason, other than historical context, why sendmsg can use this mechanism, splice can, but sendfile shouldn't?
Because sendmsg should never have done that it certainly should not spread beyond purely socket specific syscalls.
If you feel very strongly that this cannot be merged without dropping sendfile2 and only plumbing this through for splice, then I'll drop the sendfile2 syscall when I submit officially (probably next week?).
Splice should also not do "error queue notifications". Nothing new and certainly nothing outside of net/ should.
I do feel pretty strongly that it's more likely apps would use sendfile2 and we'd have safer apps out in the wild. But, I could be wrong.
A purely synchronous sendfile that is safe is a good thing. Spreading non-standard out of band notifications is not. How to build that safe sendmsg is a good question, and a sendmsg2 might be a sane option for that. The important thing is that the underlying code should use iocbs and ki_complete to notify I/O completion so that all the existing infrastucture like io_uring and in-kernel callers can reuse this.
Note that this also ties into the currently broken memory mamangement in the networking code that directly messeѕ with page references rather than notifying the caller about I/O completion.
On Wed, Mar 19, 2025 at 10:57:29PM -0700, Christoph Hellwig wrote:
On Wed, Mar 19, 2025 at 10:45:22AM -0700, Joe Damato wrote:
I don't disagree; I just don't know if app developers: a.) know that this is possible to do, and b.) know how to do it
So if you don't know that why do you even do the work?
I am doing the work because I use splice and sendfile and it seems relatively straightforward to make them safer using an existing mechanism, at least for network sockets.
After dropping the sendfile2 patches completely, it looks like in my new set all of the code is within CONFIG_NET defines in fs/splice.c.
In general: it does seem a bit odd to me that there isn't a safe sendfile syscall in Linux that uses existing completion notification mechanisms.
Agreed. Where the existing notification mechanism is called io_uring.
Sure. As I mentioned to Jens: I agree that any new system call should be built differently.
But does that mean we should leave splice and sendfile as-is when there is a way to potentially make them safer?
In my other message to Jens I proposed: - SPLICE_F_ZC for splice to generate zc completion notifications to the error queue - Modifying sendfile so that if SO_ZEROCOPY (which already exists) is set on a network socket, zc completion notifications are generated.
In both cases no new system call is needed and both splice and sendfile become safer to use.
At some point in the future a mechanism built on top of iouring introduced as new system calls (sendmsg2, sendfile2, splice2, etc) can be built.
Of course, I certainly agree that the error queue is a work around. But it works, app use it, and its fairly well known. I don't see any reason, other than historical context, why sendmsg can use this mechanism, splice can, but sendfile shouldn't?
Because sendmsg should never have done that it certainly should not spread beyond purely socket specific syscalls.
I don't know the entire historical context, but I presume sendmsg did that because there was no other mechanism at the time.
I will explain it more clearly in the next cover letter, but the way I see the situation is: - There are existing system calls which operate on network sockets (splice and sendfile) that avoid copies - There is a mechanism already in the kernel in the networking stack for generating completion notifications - Both splice and sendfile could be extended to support this for network sockets so they can be used more safely, without introducing a new system call
If you feel very strongly that this cannot be merged without dropping sendfile2 and only plumbing this through for splice, then I'll drop the sendfile2 syscall when I submit officially (probably next week?).
Splice should also not do "error queue notifications". Nothing new and certainly nothing outside of net/ should.
It seems like Jens suggested that plumbing this through for splice was a possibility, but sounds like you disagree.
Not really sure how to proceed here?
If code I am modifying is within CONFIG_NET defines, but lives in fs/splice.c ... is that within the realm of net or fs ?
I am asking because I genuinely don't know.
As mentioned above and in other messages, it seems like it is possible to improve the networking parts of splice (and therefore sendfile) to make them safer to use without introducing a new system call.
Are you saying that you are against doing that, even if the code is network specific (but lives in fs/)?
I do feel pretty strongly that it's more likely apps would use sendfile2 and we'd have safer apps out in the wild. But, I could be wrong.
A purely synchronous sendfile that is safe is a good thing. Spreading non-standard out of band notifications is not. How to build that safe sendmsg is a good question, and a sendmsg2 might be a sane option for that. The important thing is that the underlying code should use iocbs and ki_complete to notify I/O completion so that all the existing infrastucture like io_uring and in-kernel callers can reuse this.
I'm not currently planning to build sendmsg2 (and I've already mentioned to Jens and above I will drop sendfile2), but if I have the time it sounds like an interesting project.
On Thu, Mar 20, 2025 at 11:23:57AM -0700, Joe Damato wrote:
In my other message to Jens I proposed:
- SPLICE_F_ZC for splice to generate zc completion notifications to the error queue
- Modifying sendfile so that if SO_ZEROCOPY (which already exists) is set on a network socket, zc completion notifications are generated.
In both cases no new system call is needed and both splice and sendfile become safer to use.
At some point in the future a mechanism built on top of iouring introduced as new system calls (sendmsg2, sendfile2, splice2, etc) can be built.
I strongly disagree with this. This is spreading the broken SO_ZEROCOPY to futher places outside the pure networking realm. Don't do that.
It also doesn't help that more than 7 years after adding it, SO_ZEROCOPY is still completely undocumented.
Because sendmsg should never have done that it certainly should not spread beyond purely socket specific syscalls.
I don't know the entire historical context, but I presume sendmsg did that because there was no other mechanism at the time.
At least aio had been around for about 15 years at the point, but networking folks tend to be pretty insular and reinvent things.
It seems like Jens suggested that plumbing this through for splice was a possibility, but sounds like you disagree.
Yes, very strongly.
As mentioned above and in other messages, it seems like it is possible to improve the networking parts of splice (and therefore sendfile) to make them safer to use without introducing a new system call.
Are you saying that you are against doing that, even if the code is network specific (but lives in fs/)?
Yes.
Please take the work and integrate it with the kiocb-based system we use for all other in-kernel I/O that needs completion notifications and which makes it trivial to integate with io_uring instead of spreading an imcompatible and inferior event system.
On 3/20/25 11:56 PM, Christoph Hellwig wrote:
I don't know the entire historical context, but I presume sendmsg did that because there was no other mechanism at the time.
At least aio had been around for about 15 years at the point, but networking folks tend to be pretty insular and reinvent things.
Yep...
It seems like Jens suggested that plumbing this through for splice was a possibility, but sounds like you disagree.
Yes, very strongly.
And that is very much not what I suggested, fwiw.
As mentioned above and in other messages, it seems like it is possible to improve the networking parts of splice (and therefore sendfile) to make them safer to use without introducing a new system call.
Are you saying that you are against doing that, even if the code is network specific (but lives in fs/)?
Yes.
Please take the work and integrate it with the kiocb-based system we use for all other in-kernel I/O that needs completion notifications and which makes it trivial to integate with io_uring instead of spreading an imcompatible and inferior event system.
Exactly, this is how we do async IO elsewhere, not sure why networking needs to be special here, and definitely not special in a good way.
On Fri, Mar 21, 2025 at 05:14:59AM -0600, Jens Axboe wrote:
On 3/20/25 11:56 PM, Christoph Hellwig wrote:
I don't know the entire historical context, but I presume sendmsg did that because there was no other mechanism at the time.
At least aio had been around for about 15 years at the point, but networking folks tend to be pretty insular and reinvent things.
Yep...
It seems like Jens suggested that plumbing this through for splice was a possibility, but sounds like you disagree.
Yes, very strongly.
And that is very much not what I suggested, fwiw.
Your earlier message said:
If the answer is "because splice", then it would seem saner to plumb up those bits only. Would be much simpler too...
wherein I interpreted "plumb those bits" to mean plumbing the error queue notifications on TX completions.
My sincere apologies that I misunderstood your prior message and/or misconstrued what you said -- it was not clear to me what you meant.
It is clear to me now, though, that adding a flag to splice as previously proposed and extending sendfile based on the SO_ZEROCOPY sock flag being set are both unacceptable solutions.
If you happen to have a suggestion of some piece of code that I should read (other than the iouring implementation) to inform how I might build an RFCv2, I would appreciate the pointer.
Thanks for your time and energy, Joe
On Fri, Mar 21, 2025 at 09:36:34AM -0700, Joe Damato wrote:
On Fri, Mar 21, 2025 at 05:14:59AM -0600, Jens Axboe wrote:
On 3/20/25 11:56 PM, Christoph Hellwig wrote:
I don't know the entire historical context, but I presume sendmsg did that because there was no other mechanism at the time.
At least aio had been around for about 15 years at the point, but networking folks tend to be pretty insular and reinvent things.
Yep...
It seems like Jens suggested that plumbing this through for splice was a possibility, but sounds like you disagree.
Yes, very strongly.
And that is very much not what I suggested, fwiw.
Your earlier message said:
If the answer is "because splice", then it would seem saner to plumb up those bits only. Would be much simpler too...
wherein I interpreted "plumb those bits" to mean plumbing the error queue notifications on TX completions.
My sincere apologies that I misunderstood your prior message and/or misconstrued what you said -- it was not clear to me what you meant.
I think what added to my confusion here was this bit, Jens:
As far as the bit about plumbing only the splice bits, sorry if I'm being dense here, do you mean plumbing the error queue through to splice only and dropping sendfile2?
That is an option. Then the apps currently using sendfile could use splice instead and get completion notifications on the error queue. That would probably work and be less work than rewriting to use iouring, but probably a bit more work than using a new syscall.
Yep
I thought I was explicitly asking if adding SPLICE_F_ZC and plumbing through the error queue notifications was OK and your response here ("Yep") suggested to me that it would be a suitable path to consider.
I take it from your other responses, though, that I was mistaken.
On 3/21/25 2:30 PM, Joe Damato wrote:
On Fri, Mar 21, 2025 at 09:36:34AM -0700, Joe Damato wrote:
On Fri, Mar 21, 2025 at 05:14:59AM -0600, Jens Axboe wrote:
On 3/20/25 11:56 PM, Christoph Hellwig wrote:
I don't know the entire historical context, but I presume sendmsg did that because there was no other mechanism at the time.
At least aio had been around for about 15 years at the point, but networking folks tend to be pretty insular and reinvent things.
Yep...
It seems like Jens suggested that plumbing this through for splice was a possibility, but sounds like you disagree.
Yes, very strongly.
And that is very much not what I suggested, fwiw.
Your earlier message said:
If the answer is "because splice", then it would seem saner to plumb up those bits only. Would be much simpler too...
wherein I interpreted "plumb those bits" to mean plumbing the error queue notifications on TX completions.
My sincere apologies that I misunderstood your prior message and/or misconstrued what you said -- it was not clear to me what you meant.
I think what added to my confusion here was this bit, Jens:
As far as the bit about plumbing only the splice bits, sorry if I'm being dense here, do you mean plumbing the error queue through to splice only and dropping sendfile2?
That is an option. Then the apps currently using sendfile could use splice instead and get completion notifications on the error queue. That would probably work and be less work than rewriting to use iouring, but probably a bit more work than using a new syscall.
Yep
I thought I was explicitly asking if adding SPLICE_F_ZC and plumbing through the error queue notifications was OK and your response here ("Yep") suggested to me that it would be a suitable path to consider.
I take it from your other responses, though, that I was mistaken.
I guess I missed your error queue thing here, I was definitely pretty clear in other ones that I consider that part a hack and something that only exists because networking never looked into doing a proper async API for anything.
On Fri, Mar 21, 2025 at 02:33:16PM -0600, Jens Axboe wrote:
On 3/21/25 2:30 PM, Joe Damato wrote:
On Fri, Mar 21, 2025 at 09:36:34AM -0700, Joe Damato wrote:
On Fri, Mar 21, 2025 at 05:14:59AM -0600, Jens Axboe wrote:
On 3/20/25 11:56 PM, Christoph Hellwig wrote:
I don't know the entire historical context, but I presume sendmsg did that because there was no other mechanism at the time.
At least aio had been around for about 15 years at the point, but networking folks tend to be pretty insular and reinvent things.
Yep...
It seems like Jens suggested that plumbing this through for splice was a possibility, but sounds like you disagree.
Yes, very strongly.
And that is very much not what I suggested, fwiw.
Your earlier message said:
If the answer is "because splice", then it would seem saner to plumb up those bits only. Would be much simpler too...
wherein I interpreted "plumb those bits" to mean plumbing the error queue notifications on TX completions.
My sincere apologies that I misunderstood your prior message and/or misconstrued what you said -- it was not clear to me what you meant.
I think what added to my confusion here was this bit, Jens:
As far as the bit about plumbing only the splice bits, sorry if I'm being dense here, do you mean plumbing the error queue through to splice only and dropping sendfile2?
That is an option. Then the apps currently using sendfile could use splice instead and get completion notifications on the error queue. That would probably work and be less work than rewriting to use iouring, but probably a bit more work than using a new syscall.
Yep
I thought I was explicitly asking if adding SPLICE_F_ZC and plumbing through the error queue notifications was OK and your response here ("Yep") suggested to me that it would be a suitable path to consider.
I take it from your other responses, though, that I was mistaken.
I guess I missed your error queue thing here, I was definitely pretty clear in other ones that I consider that part a hack and something that only exists because networking never looked into doing a proper async API for anything.
OK, so then I have no idea what you meant in your earlier response with "Yep" :)
Combing everything said amongst a set of emails it sounds like the summary is something like:
A safe, synchronous sendfile can be implemented in userland using iouring, so there is no need to do anything in the kernel at all. At most, some documentation or examples are needed somewhere so people who want a safe version of sendfile know what to do instead.
Is that right?
If so, then great, I guess there is nothing for me to do except figure out what documentation or man pages to update.
On 3/21/25 10:36 AM, Joe Damato wrote:
On Fri, Mar 21, 2025 at 05:14:59AM -0600, Jens Axboe wrote:
On 3/20/25 11:56 PM, Christoph Hellwig wrote:
I don't know the entire historical context, but I presume sendmsg did that because there was no other mechanism at the time.
At least aio had been around for about 15 years at the point, but networking folks tend to be pretty insular and reinvent things.
Yep...
It seems like Jens suggested that plumbing this through for splice was a possibility, but sounds like you disagree.
Yes, very strongly.
And that is very much not what I suggested, fwiw.
Your earlier message said:
If the answer is "because splice", then it would seem saner to plumb up those bits only. Would be much simpler too...
wherein I interpreted "plumb those bits" to mean plumbing the error queue notifications on TX completions.
My sincere apologies that I misunderstood your prior message and/or misconstrued what you said -- it was not clear to me what you meant.
It is clear to me now, though, that adding a flag to splice as previously proposed and extending sendfile based on the SO_ZEROCOPY sock flag being set are both unacceptable solutions.
If you happen to have a suggestion of some piece of code that I should read (other than the iouring implementation) to inform how I might build an RFCv2, I would appreciate the pointer.
I don't know what to point you at - you need an API that can deliver notifications, and I'm obviously going to say that io_uring would be one way to do that. Nothing else exists on the networking side, as far as I'm aware.
Like Christoph said, struct kiocb is generally how the kernel passes around async or sync IO, which comes with a completion callback for IO that initially returns -EIOCBQUEUED, meaning the operation is started but not yet complete. Outside of that, yeah need some delivery mechanism, as you're generating two events per zero copy send here. You could obviously roll your own, good luck with that, or use the existing one.
On Thu, Mar 20, 2025 at 10:56:15PM -0700, Christoph Hellwig wrote:
On Thu, Mar 20, 2025 at 11:23:57AM -0700, Joe Damato wrote:
In my other message to Jens I proposed:
- SPLICE_F_ZC for splice to generate zc completion notifications to the error queue
- Modifying sendfile so that if SO_ZEROCOPY (which already exists) is set on a network socket, zc completion notifications are generated.
In both cases no new system call is needed and both splice and sendfile become safer to use.
At some point in the future a mechanism built on top of iouring introduced as new system calls (sendmsg2, sendfile2, splice2, etc) can be built.
I strongly disagree with this. This is spreading the broken SO_ZEROCOPY to futher places outside the pure networking realm. Don't do that.
OK. I won't proceed down that path. Thank you for the feedback.
Because sendmsg should never have done that it certainly should not spread beyond purely socket specific syscalls.
I don't know the entire historical context, but I presume sendmsg did that because there was no other mechanism at the time.
At least aio had been around for about 15 years at the point, but networking folks tend to be pretty insular and reinvent things.
Sorry, but whatever issue there is between networking and other folks is well beyond my understanding and historical context. I'm not a reviewer or maintainer or anything like that; I'm just a developer who saw a problem and wanted a solution.
I've read your message loud and clear, though, and I won't proceed down the path I've proposed.
I appreciate your feedback; this is precisely why I sent the RFC - to get comments - so thank you for taking a look and letting me know.
As mentioned above and in other messages, it seems like it is possible to improve the networking parts of splice (and therefore sendfile) to make them safer to use without introducing a new system call.
Are you saying that you are against doing that, even if the code is network specific (but lives in fs/)?
Yes.
Please take the work and integrate it with the kiocb-based system we use for all other in-kernel I/O that needs completion notifications and which makes it trivial to integate with io_uring instead of spreading an imcompatible and inferior event system.
If you have any suggestions or pointers to code I should look at for inspiration I would very much appreciate the guidance.
Thanks for your time and energy in reviewing my RFC and responding.
- Joe
On Wed, Mar 19, 2025 at 10:07:27AM -0600, Jens Axboe wrote:
On 3/19/25 9:32 AM, Joe Damato wrote:
On Wed, Mar 19, 2025 at 01:04:48AM -0700, Christoph Hellwig wrote:
On Wed, Mar 19, 2025 at 12:15:11AM +0000, Joe Damato wrote:
One way to fix this is to add zerocopy notifications to sendfile similar to how MSG_ZEROCOPY works with sendmsg. This is possible thanks to the extensive work done by Pavel [1].
What is a "zerocopy notification"
See the docs on MSG_ZEROCOPY [1], but in short when a user app calls sendmsg and passes MSG_ZEROCOPY a completion notification is added to the error queue. The user app can poll for these to find out when the TX has completed and the buffer it passed to the kernel can be overwritten.
My series provides the same functionality via splice and sendfile2.
and why aren't you simply plugging this into io_uring and generate a CQE so that it works like all other asynchronous operations?
I linked to the iouring work that Pavel did in the cover letter. Please take a look.
That work refactored the internals of how zerocopy completion notifications are wired up, allowing other pieces of code to use the same infrastructure and extend it, if needed.
My series is using the same internals that iouring (and others) use to generate zerocopy completion notifications. Unlike iouring, though, I don't need a fully customized implementation with a new user API for harvesting completion events; I can use the existing mechanism already in the kernel that user apps already use for sendmsg (the error queue, as explained above and in the MSG_ZEROCOPY documentation).
The error queue is arguably a work-around for _not_ having a delivery mechanism that works with a sync syscall in the first place. The main question here imho would be "why add a whole new syscall etc when there's already an existing way to do accomplish this, with free-to-reuse notifications". If the answer is "because splice", then it would seem saner to plumb up those bits only. Would be much simpler too...
OK, I reworked the patches to drop all the sendfile2 stuff so no new system call is added. Only a flag for splice, SPLICE_F_ZC.
It feels weird to add this to the splice path but not the path that sendfile takes through splice.
I understand and agree with you: if we are adding a new system call, like sendfile2, it should probably be done as you've described in your other messages.
What about an alternative?
Would you be open to the idea that sendfile could be extended to generate error queue completions if the network socket has SO_ZEROCOPY set?
If so, that would solve the original problem without introducing a new system call and still leaves the door open for a more efficient sendfile2 based on iouring internals later.
What do you think?
On 3/19/25 5:22 PM, Joe Damato wrote:
Would you be open to the idea that sendfile could be extended to generate error queue completions if the network socket has SO_ZEROCOPY set?
I thought I was quite clear on my view of SO_ZEROCOPY and its error queue usage, I guess I was not. No I don't think this is a good path at all, when the whole issue is that pretending to handle two different types of completions via two different interfaces is pretty dumb and inefficient to begin with, particularly when we have a method of doing exactly that where the reuse notifications arrive in the normal completion stream.
On Wed, Mar 19, 2025 at 08:32:19AM -0700, Joe Damato wrote:
See the docs on MSG_ZEROCOPY [1], but in short when a user app calls sendmsg and passes MSG_ZEROCOPY a completion notification is added to the error queue. The user app can poll for these to find out when the TX has completed and the buffer it passed to the kernel can be overwritten.
Yikes. That's not just an ugly interface, but something entirely specific to sockets and incompatible with all other asynchronous I/O interfaces.
and why aren't you simply plugging this into io_uring and generate a CQE so that it works like all other asynchronous operations?
I linked to the iouring work that Pavel did in the cover letter. Please take a look.
Please write down what matters in the cover letter, including all the important tradeoffs.
On Wed, Mar 19, 2025 at 10:50:18PM -0700, Christoph Hellwig wrote:
On Wed, Mar 19, 2025 at 08:32:19AM -0700, Joe Damato wrote:
See the docs on MSG_ZEROCOPY [1], but in short when a user app calls sendmsg and passes MSG_ZEROCOPY a completion notification is added to the error queue. The user app can poll for these to find out when the TX has completed and the buffer it passed to the kernel can be overwritten.
Yikes. That's not just an ugly interface, but something entirely specific to sockets and incompatible with all other asynchronous I/O interfaces.
I don't really know but I would assume it was introduced, as Jens said, as a work-around long before other completion mechanisms existed.
and why aren't you simply plugging this into io_uring and generate a CQE so that it works like all other asynchronous operations?
I linked to the iouring work that Pavel did in the cover letter. Please take a look.
Please write down what matters in the cover letter, including all the important tradeoffs.
OK, I will enhance the cover letter for the next submission. I had originally thought I'd submit something officially, but I think I'll probably submit another RFC with some of the changes I've made based on the discussion with Jens.
Namely: dropping sendfile2 completely and plumbing the bits through for splice. I'll wait a bit to hear what Jens thinks about the SO_ZEROCOPY thing (basically: if a network socket has that option set, maybe the existing sendfile can generate error queue completions without needing a separate system call?).
I agree overall that sendfile2 or sendmsg2 or whatever else could likely be built differently now that better interfaces and mechanisms exist in the kernel - but I still think there's room to improve existing system calls so they can be used safely.
linux-kselftest-mirror@lists.linaro.org