Miklos, first of all thanks for feedback.
On Tue, Feb 26, 2019 at 04:14:22PM +0100, Miklos Szeredi wrote:
On Tue, Feb 19, 2019 at 10:42 AM Kirill Smelkov kirr@nexedi.com wrote:
A successful call to NOTIFY_RETRIEVE by filesystem carries promise from the kernel to send back NOTIFY_REPLY message. However if the filesystem is not reading requests with fuse_conn->max_pages capacity,
That's a violation of the contract by the fuse server, not the kernel.
Do you mean that even if filesystem server configures via init_out.max_write that it is accepting e.g. only 32K max writes, it still has to be issuing sys_read with buffer of 128K (= hardcoded fuse_conn->max_pages before Linux 4.20, and default since Linux 4.20)?
Also, I could not find any FUSE contract being specified anywhere, so I used message of the commit that added support for NOTIFY_RETRIEVE to sense its semantic:
commit 2d45ba381a74a743eeaa2b06c7c5c0d2bf73ba1a Author: Miklos Szeredi mszeredi@suse.cz Date: Mon Jul 12 14:41:40 2010 +0200 fuse: add retrieve request Userspace filesystem can request data to be retrieved from the inode's mapping. This request is synchronous and the retrieved data is queued as a new request. If the write to the fuse device returns an error then the retrieve request was not completed and a reply will not be sent. Only present pages are returned in the retrieve reply. Retrieving stops when it finds a non-present page and only data prior to that is returned. This request doesn't change the dirty state of pages. Signed-off-by: Miklos Szeredi mszeredi@suse.cz which, even if not explicitly, gives the impression that if NOTIFY_RETRIEVE was queued successfully, the reply will come.
Also: if it is a violation of contract by filesystem server, the kernel should return ESOMETHING for violating sys_read, instead of making that read to be waiting indefinitely, isn't it?
In summary: instead of getting clients stuck silently I still suggest for NOTIFY_RETRIEVE to come to client, if it can come. And also to return EINVAL for /dev/fuse sys_read calls that are violating the server/kernel contract.
fuse_dev_do_read might see that the "request is too large" and decide to "reply with an error and restart the read". "Reply with an error" has underlying assumption that there is a "requester thread" that is waiting for request completion, which is true for most requests, but is not true for NOTIFY_REPLY: NOTIFY_RETRIEVE handler completes with OK status right after it could successfully queue NOTIFY_REPLY message without waiting for NOTIFY_REPLY completion. This leads to situation when filesystem requested to retrieve inode data with NOTIFY_RETRIEVE, got err=OK for that notification request, but NOTIFY_REPLY is not coming back.
More, since there is no "requester thread" to handle the error, the situation shows itself as /sys/fs/fuse/connections/X/waiting=1 _and_ /dev/fuse read(s) queued. Which is misleading since NOTIFY_REPLY request was removed from pending queue and abandoned.
Now I don't understand how that would happen. If the request is abandoned, its refcount should go down to zero and the num_waiting count decremented accordingly.
You are right - it was my mistake. I misinterpreted waiting=1 as a request not being transferred to filesystem server yet, but in my test it turned out to be already transferred and sitting on client "processing" list waiting for corresponding reply (which was not coming because the filesystem in turn was stuck waiting for NOTIFY_REPLY to come from the kernel):
root@(none):~/src/wendelin/wendelin.core/wcfs# cat /sys/fs/fuse/connections/25/waiting 1 root@(none):~/src/wendelin/wendelin.core/wcfs# cat /sys/fs/fuse/connections/25/queue #waiting: 1 (0)Interrupt: (0)Forget: (0)Request: (0)Processing.IO: (1)Processing.P: #0: .52 R15 i5
So the part of commit message that discussed X/waiting=1 is wrong and should be dropped - thanks for catching this.
Still the main point is what should be the semantic of NOTIFY_RETRIEVE vs NOTIFY_REPLY vs INIT.max_write, and that it is better to always send retrieve data if client promised it, and also to explicitly indicate with an error if filesystem server is violating FUSE server/client contract.
Thanks, Kirill
P.S. I attach the draft patch for /sys/fs/fuse/connections/X/queue in case someone is interested.
---- 8< ----
diff --git a/fs/fuse/control.c b/fs/fuse/control.c index fe80bea4ad89..f4e22f5436e2 100644 --- a/fs/fuse/control.c +++ b/fs/fuse/control.c @@ -63,6 +63,160 @@ static ssize_t fuse_conn_waiting_read(struct file *file, char __user *buf, return simple_read_from_buffer(buf, len, ppos, tmp, size); }
+/* fuse_conn_iqueue_print prints input queue into provided buf. + * + * buf can be NULL in which case only the length of would-be printed text is + * returned and nothing is actually printed. + * + * must be called with fc->iq->waitq locked. */ +static size_t fuse_conn_iqueue_print(char *buf, size_t size, struct fuse_conn *fc) +{ + struct fuse_iqueue *fiq = &fc->iq; + struct fuse_req *req; + struct fuse_forget_link *freq; + size_t nreq; + + size_t __n, __total = 0; +#define emitf(FORMAT, ...) do { \ + __n = snprintf(buf, size, FORMAT, __VA_ARGS__); \ + __total += __n; \ + if (buf) { \ + size -= __n; \ + buf += __n; \ + } \ +} while (0) + + + if (!buf) + size = 0; + + // XXX temp + emitf("#waiting: %d\n", atomic_read(&fc->num_waiting)); + + /* interrupts */ + nreq = 0; + list_for_each_entry(req, &fiq->interrupts, list) { + nreq++; + } + + emitf("(%lu)Interrupt:\n", nreq); + + nreq = 0; + list_for_each_entry(req, &fiq->interrupts, list) { + emitf("\t#%lu: R%d\n", nreq, req->in.h.opcode); + nreq++; + } + + /* forgets */ + nreq = 0; + for (freq = fiq->forget_list_head.next; freq != NULL; freq = freq->next) { + nreq++; + } + + emitf("(%lu)Forget:\n", nreq); + + nreq = 0; + for(freq = fiq->forget_list_head.next; freq != NULL; freq = freq->next) { + emitf("\t#%lu: FORGET i%llu -%llu\n", nreq, + freq->forget_one.nodeid, freq->forget_one.nlookup); + nreq++; + } + + /* all other requests */ + nreq = 0; + list_for_each_entry(req, &fiq->pending, list) { + nreq++; + } + + emitf("(%lu)Request:\n", nreq); + + nreq = 0; + list_for_each_entry(req, &fiq->pending, list) { + emitf("\t#%lu: R%d\n", nreq, req->in.h.opcode); + nreq++; + } + + /* processing */ + // XXX temp? XXX locking + { + int i; + struct fuse_dev *fud; + list_for_each_entry(fud, &fc->devices, entry) { + struct fuse_pqueue *fpq = &fud->pq; + + nreq = 0; + list_for_each_entry(req, &fpq->io, list) { + nreq++; + } + emitf("(%lu)Processing.IO:\n", nreq); + + // XXX print IO elements + + nreq = 0; + for (i = 0; i < FUSE_PQ_HASH_SIZE; i++) { + list_for_each_entry(req, &fpq->processing[i], list) { + nreq++; + } + } + emitf("(%lu)Processing.P:\n", nreq); + + nreq = 0; + for (i = 0; i < FUSE_PQ_HASH_SIZE; i++) { + list_for_each_entry(req, &fpq->processing[i], list) { + struct fuse_in_header *h = &req->in.h; + emitf("\t#%lu: .%lld R%d i%llu\n", nreq, h->unique, h->opcode, h->nodeid); + nreq++; + } + } + } + } + + return __total; +#undef emitf +} + +static ssize_t fuse_conn_iqueue_read(struct file *file, char __user *buf, + size_t len, loff_t *ppos) +{ + char *qdump; + + if (!*ppos) { + struct fuse_conn *fc = fuse_ctl_file_conn_get(file); + struct fuse_iqueue *fiq; + size_t n; + char *qdump2; + if (!fc) + return 0; + + fiq = &fc->iq; + spin_lock(&fiq->waitq.lock); + n = fuse_conn_iqueue_print(NULL, 0, fc); + n += 1; /* trailing 0 */ + + qdump = kmalloc(n, GFP_ATOMIC); + if (qdump) { + fuse_conn_iqueue_print(qdump, n, fc); + } + + spin_unlock(&fiq->waitq.lock); + fuse_conn_put(fc); + + if (!qdump) { + return -ENOMEM; + } + + /* release atomic memory, since it is scarce resource */ + qdump2 = kstrdup(qdump, GFP_KERNEL); + kfree(qdump); + + file->private_data = (void *)qdump2; + // TODO release qdump on file release + } + + qdump = (char *)file->private_data; + return simple_read_from_buffer(buf, len, ppos, qdump, strlen(qdump)); +} + static ssize_t fuse_conn_limit_read(struct file *file, char __user *buf, size_t len, loff_t *ppos, unsigned val) { @@ -202,6 +356,12 @@ static const struct file_operations fuse_ctl_waiting_ops = { .llseek = no_llseek, };
+static const struct file_operations fuse_ctl_queue_ops = { + .open = nonseekable_open, + .read = fuse_conn_iqueue_read, + .llseek = no_llseek, +}; + static const struct file_operations fuse_conn_max_background_ops = { .open = nonseekable_open, .read = fuse_conn_max_background_read, @@ -278,6 +438,8 @@ int fuse_ctl_add_conn(struct fuse_conn *fc)
if (!fuse_ctl_add_dentry(parent, fc, "waiting", S_IFREG | 0400, 1, NULL, &fuse_ctl_waiting_ops) || + !fuse_ctl_add_dentry(parent, fc, "queue", S_IFREG | 0400, 1, + NULL, &fuse_ctl_queue_ops) || !fuse_ctl_add_dentry(parent, fc, "abort", S_IFREG | 0200, 1, NULL, &fuse_ctl_abort_ops) || !fuse_ctl_add_dentry(parent, fc, "max_background", S_IFREG | 0600, diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h index 0920c0c032a0..7efef59caaa9 100644 --- a/fs/fuse/fuse_i.h +++ b/fs/fuse/fuse_i.h @@ -41,7 +41,7 @@ #define FUSE_NAME_MAX 1024
/** Number of dentries for each connection in the control filesystem */ -#define FUSE_CTL_NUM_DENTRIES 5 +#define FUSE_CTL_NUM_DENTRIES 6
/** Number of page pointers embedded in fuse_req */ #define FUSE_REQ_INLINE_PAGES 1