A successful call to NOTIFY_RETRIEVE by filesystem carries promise from the kernel to send back NOTIFY_REPLY message. However if the filesystem is not reading requests with fuse_conn->max_pages capacity, fuse_dev_do_read might see that the "request is too large" and decide to "reply with an error and restart the read". "Reply with an error" has underlying assumption that there is a "requester thread" that is waiting for request completion, which is true for most requests, but is not true for NOTIFY_REPLY: NOTIFY_RETRIEVE handler completes with OK status right after it could successfully queue NOTIFY_REPLY message without waiting for NOTIFY_REPLY completion. This leads to situation when filesystem requested to retrieve inode data with NOTIFY_RETRIEVE, got err=OK for that notification request, but NOTIFY_REPLY is not coming back.
More, since there is no "requester thread" to handle the error, the situation shows itself as /sys/fs/fuse/connections/X/waiting=1 _and_ /dev/fuse read(s) queued. Which is misleading since NOTIFY_REPLY request was removed from pending queue and abandoned.
One way to fix would be to change NOTIFY_RETRIEVE handler to wait until queued NOTIFY_REPLY is actually read back to the server and only then return NOTIFY_RETRIEVE status. However this is change in behaviour and would require filesystems to have at least 2 threads. In particular a single-threaded filesystem that was previously successfully using NOTIFY_RETRIEVE would become stuck after the change. This way of fixing is thus not acceptable.
However we can fix it another way - by always returning NOTIFY_REPLY irregardless of its original size - with so much data as provided read buffer could fit. This aligns with the way NOTIFY_RETRIEVE handler works, which already unconditionally caps requested retrieve size to fuse_conn->max_pages. This way it should not hurt NOTIFY_RETRIEVE semantic if we return less data than was originally requested.
This fix requires another behaviour change however - to be sure that read buffer has enough capacity to always fit fixed NOTIFY_REPLY part plus at least some (0 or more) data, we have to precheck the buffer before dequeuing and handling a request. And if the buffer is very small - return EINVAL to read in filesystem with semantic that queued read was invalid from the viewpoint of FUSE protocol. Even though this is also behaviour change, this should not practically cause problems: 1d3d752b47 (fuse: clean up request size limit checking), which originally removed such EINVAL return and reworked fuse_dev_do_read to loop and retry, also added FUSE_MIN_READ_BUFFER=8K to user-visible fuse.h with comment that "The read buffer is required to be at least 8k ..." Even though FUSE_MIN_READ_BUFFER is not currently checked anywhere in the kernel, libfuse always initializes session with bufsize=32·pages and, since its beginning, (at least from 2005) issues a warning should user modify fuse_session->bufsize directly to be sure that queued buffers are at least as large as that sane minimum:
https://github.com/libfuse/libfuse/blob/fuse-3.3.0-22-g63d53ecc3a/lib/fuse_l... https://github.com/libfuse/libfuse/blob/fuse-3.3.0-22-g63d53ecc3a/lib/fuse_l... (semantic added in https://github.com/libfuse/libfuse/commit/044da2e9e0)
This way we should be safe to add the check for minimum read buffer size.
I've hit this bug for real with my filesystem that is using https://github.com/hanwen/go-fuse: there was no NOTIFY_REPLY after successful NOTIFY_RETRIEVE and the filesystem was stuck waiting, because FUSE protocol (definition scattered through many places) states that NOTIFY_REPLY is guaranteed to come after successful NOTIFY_RETRIEVE (see 2d45ba381a "fuse: add retrieve request"). After inspecting /sys/fs/fuse/connections/X/waiting and seeing it was 1, I was initially suspecting that it was user-space who is not issuing /dev/fuse reads and NOTIFY_REPLY is there but stuck in kernel pending queue. However tracing what is going on in /dev/fuse exchange and in both kernel and userspace (see https://lab.nexedi.com/kirr/wendelin.core/blob/13d2d1f8/wcfs/fusetrace) showed that there are correctly queued /dev/fuse reads still pending after NOTIFY_RETRIEVE returns and it is the kernel who is not replying back:
...
P2 2.215710 /dev/fuse <- qread wcfs/11399_4_r:
syscall.Syscall+48 syscall.Read+73 github.com/hanwen/go-fuse/fuse.(*Server).readRequest.func1+85 github.com/hanwen/go-fuse/fuse.handleEINTR+39 github.com/hanwen/go-fuse/fuse.(*Server).readRequest+355 github.com/hanwen/go-fuse/fuse.(*Server).loop+107 runtime.goexit+1
P2 2.215810 /dev/fuse -> read wcfs/11399_4_r: .56 RELEASE i8 ... (ret=64)
P2 2.215859 /dev/fuse <- write wcfs/11399_5_w: .56 (0) ...
syscall.Syscall+48 syscall.Write+73 github.com/hanwen/go-fuse/fuse.(*Server).systemWrite.func1+76 github.com/hanwen/go-fuse/fuse.handleEINTR+39 github.com/hanwen/go-fuse/fuse.(*Server).systemWrite+931 github.com/hanwen/go-fuse/fuse.(*Server).write+194 github.com/hanwen/go-fuse/fuse.(*Server).handleRequest+179 github.com/hanwen/go-fuse/fuse.(*Server).loop+399 runtime.goexit+1
P2 2.215871 /dev/fuse -> write_ack wcfs/11399_5_w (ret=16)
P2 2.215876 /dev/fuse <- qread wcfs/11399_5_r: <-- NOTE
syscall.Syscall+48 syscall.Read+73 github.com/hanwen/go-fuse/fuse.(*Server).readRequest.func1+85 github.com/hanwen/go-fuse/fuse.handleEINTR+39 github.com/hanwen/go-fuse/fuse.(*Server).readRequest+355 github.com/hanwen/go-fuse/fuse.(*Server).loop+107 runtime.goexit+1
P0 2.221527 /dev/fuse <- qread wcfs/11401_1_r: <-- NOTE
syscall.Syscall+48 syscall.Read+73 github.com/hanwen/go-fuse/fuse.(*Server).readRequest.func1+85 github.com/hanwen/go-fuse/fuse.handleEINTR+39 github.com/hanwen/go-fuse/fuse.(*Server).readRequest+355 github.com/hanwen/go-fuse/fuse.(*Server).loop+107 runtime.goexit+1
P1 2.239384 /dev/fuse -> read wcfs/11398_6_r: # woken read that was queued before "..." .57 READ i5 ... (ret=80)
P0 2.239626 /dev/fuse <- write wcfs/11397_0_w: NOTIFY_RETRIEVE ...
syscall.Syscall+48 syscall.Write+73 github.com/hanwen/go-fuse/fuse.(*Server).systemWrite.func1+76 github.com/hanwen/go-fuse/fuse.handleEINTR+39 github.com/hanwen/go-fuse/fuse.(*Server).systemWrite+931 github.com/hanwen/go-fuse/fuse.(*Server).write+194 github.com/hanwen/go-fuse/fuse.(*Server).InodeRetrieveCache+764 github.com/hanwen/go-fuse/fuse/nodefs.(*FileSystemConnector).FileRetrieveCache+157 main.(*BigFile).invalidateBlk+232 main.(*Root).zδhandle1.func1+72 golang.org/x/sync/errgroup.(*Group).Go.func1+87 runtime.goexit+1
P0 2.239660 /dev/fuse -> write_ack wcfs/11397_0_w (ret=48)
# stuck # (full trace: https://lab.nexedi.com/kirr/wendelin.core/commit/96416aaabd)
with queued / served read analysis confirming that two reads were indeed queued and not served:
grep -w -e '<- qread>' y.log |awk {'print $6'} |sort >qread.txt grep -w -e '-> read>' y.log |awk {'print $6'} |sort >read.txt
# xdiff qread.txt read.txt diff --git a/qread.txt b/read.txt index 4ab50d7..fdd2be1 100644 --- a/qread.txt +++ b/read.txt @@ -53,7 +53,5 @@ wcfs/11399_1_r: wcfs/11399_2_r: wcfs/11399_3_r: wcfs/11399_4_r: -wcfs/11399_5_r: wcfs/11400_0_r: wcfs/11401_0_r: -wcfs/11401_1_r:
The bug was hit because go-fuse by default uses 64K for read buffer size
https://github.com/hanwen/go-fuse/blob/33711add/fuse/server.go#L142
and the kernel presets fuse_conn->max_pages to be 128K (= 32·4K pages).
Go-fuse will be likely fixed to both use bufsize=kernel's and to correctly handle size > bufsize in InodeRetrieveCache. However we should also fix the kernel to always deliver NOTIFY_REPLY once NOTIFY_RETRIEVE was successful, so that FUSE protocol guarantee always holds irregardless of whether userspace used default or other valid buffer size setting, and so that filesystems can count not to get stuck waiting for kernel who promised a reply.
This way this patch is here.
Signed-off-by: Kirill Smelkov kirr@nexedi.com Cc: Han-Wen Nienhuys hanwen@google.com Cc: Jakob Unterwurzacher jakobunt@gmail.com Cc: stable@vger.kernel.org # v2.6.36+ ---
First patch version was sent 1 week ago, but got no response: https://marc.info/?l=linux-fsdevel&m=155000277921155&w=2
Changes since v1: don't forget to also update req->misc.retrieve_in.size after truncation.
( This is my first patch to fs/fuse, so please forgive me if I missed anything. )
fs/fuse/dev.c | 71 ++++++++++++++++++++++++++++++++++++++++++++++----- 1 file changed, 65 insertions(+), 6 deletions(-)
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c index 8a63e52785e9..93deb8e54d88 100644 --- a/fs/fuse/dev.c +++ b/fs/fuse/dev.c @@ -381,6 +381,40 @@ static void queue_request(struct fuse_iqueue *fiq, struct fuse_req *req) kill_fasync(&fiq->fasync, SIGIO, POLL_IN); }
+/* + * fuse_req_truncate_data truncates data in request that has paged data + * (req.in.argpages=1), so that whole request, when serialized, is <= nbytes. + * + * nbytes must be >= size(request without data). + */ +static void fuse_req_truncate_data(struct fuse_req *req, unsigned nbytes) { + unsigned size, n; + + BUG_ON(!req->in.argpages); + BUG_ON(req->in.numargs < 1); + + /* request size without data */ + size = sizeof(struct fuse_in_header) + + len_args(req->in.numargs - 1, (struct fuse_arg *) req->in.args); + BUG_ON(nbytes < size); + + /* truncate paged data */ + for (n = 0; n < req->num_pages; n++) { + struct fuse_page_desc *p = &req->page_descs[n]; + + if (size >= nbytes) { + p->length = 0; + } else { + p->length = min_t(unsigned, p->length, nbytes - size); + } + + size += p->length; + } + + /* update whole request length in the header */ + req->in.h.len = size; +} + void fuse_queue_forget(struct fuse_conn *fc, struct fuse_forget_link *forget, u64 nodeid, u64 nlookup) { @@ -1317,6 +1351,15 @@ static ssize_t fuse_dev_do_read(struct fuse_dev *fud, struct file *file, unsigned reqsize; unsigned int hash;
+ /* + * Require sane minimum read buffer - that has capacity for fixed part + * of any request + some room for data. If the requirement is not + * satisfied return EINVAL to the filesystem without dequeueing / + * aborting any request. + */ + if (nbytes < FUSE_MIN_READ_BUFFER) + return -EINVAL; + restart: spin_lock(&fiq->waitq.lock); err = -EAGAIN; @@ -1358,12 +1401,28 @@ static ssize_t fuse_dev_do_read(struct fuse_dev *fud, struct file *file,
/* If request is too large, reply with an error and restart the read */ if (nbytes < reqsize) { - req->out.h.error = -EIO; - /* SETXATTR is special, since it may contain too large data */ - if (in->h.opcode == FUSE_SETXATTR) - req->out.h.error = -E2BIG; - request_end(fc, req); - goto restart; + switch (in->h.opcode) { + default: + req->out.h.error = -EIO; + /* SETXATTR is special, since it may contain too large data */ + if (in->h.opcode == FUSE_SETXATTR) + req->out.h.error = -E2BIG; + request_end(fc, req); + goto restart; + + /* + * NOTIFY_REPLY is special: if it was queued we already + * promised to filesystem to deliver it when handling + * NOTIFY_RETRIVE. We know that read buffer has capacity for at + * least some data. Truncate retrieved data to read buffer size + * and deliver it to stay to the promise. + */ + case FUSE_NOTIFY_REPLY: + fuse_req_truncate_data(req, nbytes); + req->misc.retrieve_in.size -= reqsize - in->h.len; + reqsize = in->h.len; + } + } spin_lock(&fpq->lock); list_add(&req->list, &fpq->io);
On Tue, Feb 19, 2019 at 10:42 AM Kirill Smelkov kirr@nexedi.com wrote:
A successful call to NOTIFY_RETRIEVE by filesystem carries promise from the kernel to send back NOTIFY_REPLY message. However if the filesystem is not reading requests with fuse_conn->max_pages capacity,
That's a violation of the contract by the fuse server, not the kernel.
fuse_dev_do_read might see that the "request is too large" and decide to "reply with an error and restart the read". "Reply with an error" has underlying assumption that there is a "requester thread" that is waiting for request completion, which is true for most requests, but is not true for NOTIFY_REPLY: NOTIFY_RETRIEVE handler completes with OK status right after it could successfully queue NOTIFY_REPLY message without waiting for NOTIFY_REPLY completion. This leads to situation when filesystem requested to retrieve inode data with NOTIFY_RETRIEVE, got err=OK for that notification request, but NOTIFY_REPLY is not coming back.
More, since there is no "requester thread" to handle the error, the situation shows itself as /sys/fs/fuse/connections/X/waiting=1 _and_ /dev/fuse read(s) queued. Which is misleading since NOTIFY_REPLY request was removed from pending queue and abandoned.
Now I don't understand how that would happen. If the request is abandoned, its refcount should go down to zero and the num_waiting count decremented accordingly.
Thanks, Miklos
Miklos, first of all thanks for feedback.
On Tue, Feb 26, 2019 at 04:14:22PM +0100, Miklos Szeredi wrote:
On Tue, Feb 19, 2019 at 10:42 AM Kirill Smelkov kirr@nexedi.com wrote:
A successful call to NOTIFY_RETRIEVE by filesystem carries promise from the kernel to send back NOTIFY_REPLY message. However if the filesystem is not reading requests with fuse_conn->max_pages capacity,
That's a violation of the contract by the fuse server, not the kernel.
Do you mean that even if filesystem server configures via init_out.max_write that it is accepting e.g. only 32K max writes, it still has to be issuing sys_read with buffer of 128K (= hardcoded fuse_conn->max_pages before Linux 4.20, and default since Linux 4.20)?
Also, I could not find any FUSE contract being specified anywhere, so I used message of the commit that added support for NOTIFY_RETRIEVE to sense its semantic:
commit 2d45ba381a74a743eeaa2b06c7c5c0d2bf73ba1a Author: Miklos Szeredi mszeredi@suse.cz Date: Mon Jul 12 14:41:40 2010 +0200 fuse: add retrieve request Userspace filesystem can request data to be retrieved from the inode's mapping. This request is synchronous and the retrieved data is queued as a new request. If the write to the fuse device returns an error then the retrieve request was not completed and a reply will not be sent. Only present pages are returned in the retrieve reply. Retrieving stops when it finds a non-present page and only data prior to that is returned. This request doesn't change the dirty state of pages. Signed-off-by: Miklos Szeredi mszeredi@suse.cz which, even if not explicitly, gives the impression that if NOTIFY_RETRIEVE was queued successfully, the reply will come.
Also: if it is a violation of contract by filesystem server, the kernel should return ESOMETHING for violating sys_read, instead of making that read to be waiting indefinitely, isn't it?
In summary: instead of getting clients stuck silently I still suggest for NOTIFY_RETRIEVE to come to client, if it can come. And also to return EINVAL for /dev/fuse sys_read calls that are violating the server/kernel contract.
fuse_dev_do_read might see that the "request is too large" and decide to "reply with an error and restart the read". "Reply with an error" has underlying assumption that there is a "requester thread" that is waiting for request completion, which is true for most requests, but is not true for NOTIFY_REPLY: NOTIFY_RETRIEVE handler completes with OK status right after it could successfully queue NOTIFY_REPLY message without waiting for NOTIFY_REPLY completion. This leads to situation when filesystem requested to retrieve inode data with NOTIFY_RETRIEVE, got err=OK for that notification request, but NOTIFY_REPLY is not coming back.
More, since there is no "requester thread" to handle the error, the situation shows itself as /sys/fs/fuse/connections/X/waiting=1 _and_ /dev/fuse read(s) queued. Which is misleading since NOTIFY_REPLY request was removed from pending queue and abandoned.
Now I don't understand how that would happen. If the request is abandoned, its refcount should go down to zero and the num_waiting count decremented accordingly.
You are right - it was my mistake. I misinterpreted waiting=1 as a request not being transferred to filesystem server yet, but in my test it turned out to be already transferred and sitting on client "processing" list waiting for corresponding reply (which was not coming because the filesystem in turn was stuck waiting for NOTIFY_REPLY to come from the kernel):
root@(none):~/src/wendelin/wendelin.core/wcfs# cat /sys/fs/fuse/connections/25/waiting 1 root@(none):~/src/wendelin/wendelin.core/wcfs# cat /sys/fs/fuse/connections/25/queue #waiting: 1 (0)Interrupt: (0)Forget: (0)Request: (0)Processing.IO: (1)Processing.P: #0: .52 R15 i5
So the part of commit message that discussed X/waiting=1 is wrong and should be dropped - thanks for catching this.
Still the main point is what should be the semantic of NOTIFY_RETRIEVE vs NOTIFY_REPLY vs INIT.max_write, and that it is better to always send retrieve data if client promised it, and also to explicitly indicate with an error if filesystem server is violating FUSE server/client contract.
Thanks, Kirill
P.S. I attach the draft patch for /sys/fs/fuse/connections/X/queue in case someone is interested.
---- 8< ----
diff --git a/fs/fuse/control.c b/fs/fuse/control.c index fe80bea4ad89..f4e22f5436e2 100644 --- a/fs/fuse/control.c +++ b/fs/fuse/control.c @@ -63,6 +63,160 @@ static ssize_t fuse_conn_waiting_read(struct file *file, char __user *buf, return simple_read_from_buffer(buf, len, ppos, tmp, size); }
+/* fuse_conn_iqueue_print prints input queue into provided buf. + * + * buf can be NULL in which case only the length of would-be printed text is + * returned and nothing is actually printed. + * + * must be called with fc->iq->waitq locked. */ +static size_t fuse_conn_iqueue_print(char *buf, size_t size, struct fuse_conn *fc) +{ + struct fuse_iqueue *fiq = &fc->iq; + struct fuse_req *req; + struct fuse_forget_link *freq; + size_t nreq; + + size_t __n, __total = 0; +#define emitf(FORMAT, ...) do { \ + __n = snprintf(buf, size, FORMAT, __VA_ARGS__); \ + __total += __n; \ + if (buf) { \ + size -= __n; \ + buf += __n; \ + } \ +} while (0) + + + if (!buf) + size = 0; + + // XXX temp + emitf("#waiting: %d\n", atomic_read(&fc->num_waiting)); + + /* interrupts */ + nreq = 0; + list_for_each_entry(req, &fiq->interrupts, list) { + nreq++; + } + + emitf("(%lu)Interrupt:\n", nreq); + + nreq = 0; + list_for_each_entry(req, &fiq->interrupts, list) { + emitf("\t#%lu: R%d\n", nreq, req->in.h.opcode); + nreq++; + } + + /* forgets */ + nreq = 0; + for (freq = fiq->forget_list_head.next; freq != NULL; freq = freq->next) { + nreq++; + } + + emitf("(%lu)Forget:\n", nreq); + + nreq = 0; + for(freq = fiq->forget_list_head.next; freq != NULL; freq = freq->next) { + emitf("\t#%lu: FORGET i%llu -%llu\n", nreq, + freq->forget_one.nodeid, freq->forget_one.nlookup); + nreq++; + } + + /* all other requests */ + nreq = 0; + list_for_each_entry(req, &fiq->pending, list) { + nreq++; + } + + emitf("(%lu)Request:\n", nreq); + + nreq = 0; + list_for_each_entry(req, &fiq->pending, list) { + emitf("\t#%lu: R%d\n", nreq, req->in.h.opcode); + nreq++; + } + + /* processing */ + // XXX temp? XXX locking + { + int i; + struct fuse_dev *fud; + list_for_each_entry(fud, &fc->devices, entry) { + struct fuse_pqueue *fpq = &fud->pq; + + nreq = 0; + list_for_each_entry(req, &fpq->io, list) { + nreq++; + } + emitf("(%lu)Processing.IO:\n", nreq); + + // XXX print IO elements + + nreq = 0; + for (i = 0; i < FUSE_PQ_HASH_SIZE; i++) { + list_for_each_entry(req, &fpq->processing[i], list) { + nreq++; + } + } + emitf("(%lu)Processing.P:\n", nreq); + + nreq = 0; + for (i = 0; i < FUSE_PQ_HASH_SIZE; i++) { + list_for_each_entry(req, &fpq->processing[i], list) { + struct fuse_in_header *h = &req->in.h; + emitf("\t#%lu: .%lld R%d i%llu\n", nreq, h->unique, h->opcode, h->nodeid); + nreq++; + } + } + } + } + + return __total; +#undef emitf +} + +static ssize_t fuse_conn_iqueue_read(struct file *file, char __user *buf, + size_t len, loff_t *ppos) +{ + char *qdump; + + if (!*ppos) { + struct fuse_conn *fc = fuse_ctl_file_conn_get(file); + struct fuse_iqueue *fiq; + size_t n; + char *qdump2; + if (!fc) + return 0; + + fiq = &fc->iq; + spin_lock(&fiq->waitq.lock); + n = fuse_conn_iqueue_print(NULL, 0, fc); + n += 1; /* trailing 0 */ + + qdump = kmalloc(n, GFP_ATOMIC); + if (qdump) { + fuse_conn_iqueue_print(qdump, n, fc); + } + + spin_unlock(&fiq->waitq.lock); + fuse_conn_put(fc); + + if (!qdump) { + return -ENOMEM; + } + + /* release atomic memory, since it is scarce resource */ + qdump2 = kstrdup(qdump, GFP_KERNEL); + kfree(qdump); + + file->private_data = (void *)qdump2; + // TODO release qdump on file release + } + + qdump = (char *)file->private_data; + return simple_read_from_buffer(buf, len, ppos, qdump, strlen(qdump)); +} + static ssize_t fuse_conn_limit_read(struct file *file, char __user *buf, size_t len, loff_t *ppos, unsigned val) { @@ -202,6 +356,12 @@ static const struct file_operations fuse_ctl_waiting_ops = { .llseek = no_llseek, };
+static const struct file_operations fuse_ctl_queue_ops = { + .open = nonseekable_open, + .read = fuse_conn_iqueue_read, + .llseek = no_llseek, +}; + static const struct file_operations fuse_conn_max_background_ops = { .open = nonseekable_open, .read = fuse_conn_max_background_read, @@ -278,6 +438,8 @@ int fuse_ctl_add_conn(struct fuse_conn *fc)
if (!fuse_ctl_add_dentry(parent, fc, "waiting", S_IFREG | 0400, 1, NULL, &fuse_ctl_waiting_ops) || + !fuse_ctl_add_dentry(parent, fc, "queue", S_IFREG | 0400, 1, + NULL, &fuse_ctl_queue_ops) || !fuse_ctl_add_dentry(parent, fc, "abort", S_IFREG | 0200, 1, NULL, &fuse_ctl_abort_ops) || !fuse_ctl_add_dentry(parent, fc, "max_background", S_IFREG | 0600, diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h index 0920c0c032a0..7efef59caaa9 100644 --- a/fs/fuse/fuse_i.h +++ b/fs/fuse/fuse_i.h @@ -41,7 +41,7 @@ #define FUSE_NAME_MAX 1024
/** Number of dentries for each connection in the control filesystem */ -#define FUSE_CTL_NUM_DENTRIES 5 +#define FUSE_CTL_NUM_DENTRIES 6
/** Number of page pointers embedded in fuse_req */ #define FUSE_REQ_INLINE_PAGES 1
On Wed, Feb 27, 2019 at 9:02 PM Kirill Smelkov kirr@nexedi.com wrote:
Miklos, first of all thanks for feedback.
On Tue, Feb 26, 2019 at 04:14:22PM +0100, Miklos Szeredi wrote:
On Tue, Feb 19, 2019 at 10:42 AM Kirill Smelkov kirr@nexedi.com wrote:
A successful call to NOTIFY_RETRIEVE by filesystem carries promise from the kernel to send back NOTIFY_REPLY message. However if the filesystem is not reading requests with fuse_conn->max_pages capacity,
That's a violation of the contract by the fuse server, not the kernel.
Do you mean that even if filesystem server configures via init_out.max_write that it is accepting e.g. only 32K max writes, it still has to be issuing sys_read with buffer of 128K (= hardcoded fuse_conn->max_pages before Linux 4.20, and default since Linux 4.20)?
Filesystem is asking for a specific number of bytes to retrieve. It does not have to be less than max_writes, but it does have to fit into the request buffer it is using. If the filesystem is asking to retrieve 64k and it is using a 32k request buffer, then that obviously won't work. Kernel could limit the retrieve length to max_writes, that may make sense, but it doesn't fundamentally change the fact that if the filesystem is not properly sizing the request buffer, it may result in various forms of breakage.
Thanks, Miklos
On Wed, Feb 27, 2019 at 09:26:32PM +0100, Miklos Szeredi wrote:
On Wed, Feb 27, 2019 at 9:02 PM Kirill Smelkov kirr@nexedi.com wrote:
Miklos, first of all thanks for feedback.
On Tue, Feb 26, 2019 at 04:14:22PM +0100, Miklos Szeredi wrote:
On Tue, Feb 19, 2019 at 10:42 AM Kirill Smelkov kirr@nexedi.com wrote:
A successful call to NOTIFY_RETRIEVE by filesystem carries promise from the kernel to send back NOTIFY_REPLY message. However if the filesystem is not reading requests with fuse_conn->max_pages capacity,
That's a violation of the contract by the fuse server, not the kernel.
Do you mean that even if filesystem server configures via init_out.max_write that it is accepting e.g. only 32K max writes, it still has to be issuing sys_read with buffer of 128K (= hardcoded fuse_conn->max_pages before Linux 4.20, and default since Linux 4.20)?
Filesystem is asking for a specific number of bytes to retrieve. It does not have to be less than max_writes, but it does have to fit into the request buffer it is using. If the filesystem is asking to retrieve 64k and it is using a 32k request buffer, then that obviously won't work. Kernel could limit the retrieve length to max_writes, that may make sense, but it doesn't fundamentally change the fact that if the filesystem is not properly sizing the request buffer, it may result in various forms of breakage.
I more or less agree with this statement. However can we please make the breakage to be explicitly visible with an error instead of exhibiting it via harder to debug stucks/deadlocks? For example sys_read < max_write -> error instead of getting stuck. And if notify_retrieve requests buffer larger than max_write -> error or cut to max_write, but don't return OK when we know we will never send what was requested to filesystem even if it uses max_write sized reads. What is the point of breaking in hard to diagnose way when we can make the breakage showing itself explicitly? Would a patch for such behaviour accepted?
Thanks, Kirill
On Wed, Feb 27, 2019 at 9:39 PM Kirill Smelkov kirr@nexedi.com wrote:
I more or less agree with this statement. However can we please make the breakage to be explicitly visible with an error instead of exhibiting it via harder to debug stucks/deadlocks? For example sys_read < max_write -> error instead of getting stuck. And if notify_retrieve requests buffer larger than max_write -> error or cut to max_write, but don't return OK when we know we will never send what was requested to filesystem even if it uses max_write sized reads. What is the point of breaking in hard to diagnose way when we can make the breakage showing itself explicitly? Would a patch for such behaviour accepted?
Sure, if it's only adds a couple of lines. Adding more than say ten lines for such a non-bug fix is definitely excessive.
Thanks, Miklos
On Thu, Feb 28, 2019 at 09:10:15AM +0100, Miklos Szeredi wrote:
On Wed, Feb 27, 2019 at 9:39 PM Kirill Smelkov kirr@nexedi.com wrote:
I more or less agree with this statement. However can we please make the breakage to be explicitly visible with an error instead of exhibiting it via harder to debug stucks/deadlocks? For example sys_read < max_write -> error instead of getting stuck. And if notify_retrieve requests buffer larger than max_write -> error or cut to max_write, but don't return OK when we know we will never send what was requested to filesystem even if it uses max_write sized reads. What is the point of breaking in hard to diagnose way when we can make the breakage showing itself explicitly? Would a patch for such behaviour accepted?
Sure, if it's only adds a couple of lines. Adding more than say ten lines for such a non-bug fix is definitely excessive.
Ok, thanks. Please consider applying the following patch. (It's a bit pity to hear the problem is not considered to be a bug, but anyway).
I will also send the second patch as another mail, since I could not made `git am --scissors` to apply several patched extracted from one mail successfully.
Thanks, Kirill
---- 8< ---- From: Kirill Smelkov kirr@nexedi.com Date: Thu, 28 Feb 2019 13:06:18 +0300 Subject: [PATCH 1/2] fuse: retrieve: cap requested size to negotiated max_write MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit
FUSE filesystem server and kernel client negotiate during initialization phase, what should be the maximum write size the client will ever issue. Correspondingly the filesystem server then queues sys_read calls to read requests with buffer capacity large enough to carry request header + that max_write bytes. A filesystem server is free to set its max_write in anywhere in the range between [1·page, fc->max_pages·page]. In particular go-fuse[2] sets max_write by default as 64K, wheres default fc->max_pages corresponds to 128K. Libfuse also allows users to configure max_write, but by default presets it to possible maximum.
If max_write is < fc->max_pages·page, and in NOTIFY_RETRIEVE handler we allow to retrieve more than max_write bytes, corresponding prepared NOTIFY_REPLY will be thrown away by fuse_dev_do_read, because the filesystem server, in full correspondence with server/client contract, will be only queuing sys_read with ~max_write buffer capacity, and fuse_dev_do_read throws away requests that cannot fit into server request buffer. In turn the filesystem server could get stuck waiting indefinitely for NOTIFY_REPLY since NOTIFY_RETRIEVE handler returned OK which is understood by clients as that NOTIFY_REPLY was queued and will be sent back.
-> Cap requested size to negotiate max_write to avoid the problem. This aligns with the way NOTIFY_RETRIEVE handler works, which already unconditionally caps requested retrieve size to fuse_conn->max_pages. This way it should not hurt NOTIFY_RETRIEVE semantic if we return less data than was originally requested.
Please see [1] for context where the problem of stuck filesystem was hit for real, how the situation was traced and for more involving patch that did not make it into the tree.
[1] https://marc.info/?l=linux-fsdevel&m=155057023600853&w=2 [2] https://github.com/hanwen/go-fuse
Signed-off-by: Kirill Smelkov kirr@nexedi.com Cc: Han-Wen Nienhuys hanwen@google.com Cc: Jakob Unterwurzacher jakobunt@gmail.com Cc: stable@vger.kernel.org # v2.6.36+ --- fs/fuse/dev.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c index 8a63e52785e9..38e94bc43053 100644 --- a/fs/fuse/dev.c +++ b/fs/fuse/dev.c @@ -1749,7 +1749,7 @@ static int fuse_retrieve(struct fuse_conn *fc, struct inode *inode, offset = outarg->offset & ~PAGE_MASK; file_size = i_size_read(inode);
- num = outarg->size; + num = min(outarg->size, fc->max_write); if (outarg->offset > file_size) num = 0; else if (outarg->offset + num > file_size)
On Thu, Feb 28, 2019 at 02:47:57PM +0300, Kirill Smelkov wrote:
On Thu, Feb 28, 2019 at 09:10:15AM +0100, Miklos Szeredi wrote:
On Wed, Feb 27, 2019 at 9:39 PM Kirill Smelkov kirr@nexedi.com wrote:
I more or less agree with this statement. However can we please make the breakage to be explicitly visible with an error instead of exhibiting it via harder to debug stucks/deadlocks? For example sys_read < max_write -> error instead of getting stuck. And if notify_retrieve requests buffer larger than max_write -> error or cut to max_write, but don't return OK when we know we will never send what was requested to filesystem even if it uses max_write sized reads. What is the point of breaking in hard to diagnose way when we can make the breakage showing itself explicitly? Would a patch for such behaviour accepted?
Sure, if it's only adds a couple of lines. Adding more than say ten lines for such a non-bug fix is definitely excessive.
Ok, thanks. Please consider applying the following patch. (It's a bit pity to hear the problem is not considered to be a bug, but anyway).
I will also send the second patch as another mail, since I could not made `git am --scissors` to apply several patched extracted from one mail successfully.
Ping. Miklos, is there anything wrong with this patch and its second counterpart?
Thank beforehand for feedback, Kirill
---- 8< ---- From: Kirill Smelkov kirr@nexedi.com Date: Thu, 28 Feb 2019 13:06:18 +0300 Subject: [PATCH 1/2] fuse: retrieve: cap requested size to negotiated max_write MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit
FUSE filesystem server and kernel client negotiate during initialization phase, what should be the maximum write size the client will ever issue. Correspondingly the filesystem server then queues sys_read calls to read requests with buffer capacity large enough to carry request header
- that max_write bytes. A filesystem server is free to set its max_write
in anywhere in the range between [1·page, fc->max_pages·page]. In particular go-fuse[2] sets max_write by default as 64K, wheres default fc->max_pages corresponds to 128K. Libfuse also allows users to configure max_write, but by default presets it to possible maximum.
If max_write is < fc->max_pages·page, and in NOTIFY_RETRIEVE handler we allow to retrieve more than max_write bytes, corresponding prepared NOTIFY_REPLY will be thrown away by fuse_dev_do_read, because the filesystem server, in full correspondence with server/client contract, will be only queuing sys_read with ~max_write buffer capacity, and fuse_dev_do_read throws away requests that cannot fit into server request buffer. In turn the filesystem server could get stuck waiting indefinitely for NOTIFY_REPLY since NOTIFY_RETRIEVE handler returned OK which is understood by clients as that NOTIFY_REPLY was queued and will be sent back.
-> Cap requested size to negotiate max_write to avoid the problem. This aligns with the way NOTIFY_RETRIEVE handler works, which already unconditionally caps requested retrieve size to fuse_conn->max_pages. This way it should not hurt NOTIFY_RETRIEVE semantic if we return less data than was originally requested.
Please see [1] for context where the problem of stuck filesystem was hit for real, how the situation was traced and for more involving patch that did not make it into the tree.
[1] https://marc.info/?l=linux-fsdevel&m=155057023600853&w=2 [2] https://github.com/hanwen/go-fuse
Signed-off-by: Kirill Smelkov kirr@nexedi.com Cc: Han-Wen Nienhuys hanwen@google.com Cc: Jakob Unterwurzacher jakobunt@gmail.com Cc: stable@vger.kernel.org # v2.6.36+
fs/fuse/dev.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c index 8a63e52785e9..38e94bc43053 100644 --- a/fs/fuse/dev.c +++ b/fs/fuse/dev.c @@ -1749,7 +1749,7 @@ static int fuse_retrieve(struct fuse_conn *fc, struct inode *inode, offset = outarg->offset & ~PAGE_MASK; file_size = i_size_read(inode);
- num = outarg->size;
- num = min(outarg->size, fc->max_write); if (outarg->offset > file_size) num = 0; else if (outarg->offset + num > file_size)
-- 2.21.0.352.gf09ad66450
linux-stable-mirror@lists.linaro.org