From: Chen Linxuan me@black-desk.cn
When using fsconfig(..., FSCONFIG_CMD_CREATE, ...), the filesystem context is retrieved from the file descriptor. Since the file structure persists across syscall restarts, the context state is preserved:
// fs/fsopen.c SYSCALL_DEFINE5(fsconfig, ...) { ... fc = fd_file(f)->private_data; ... ret = vfs_fsconfig_locked(fc, cmd, ¶m); ... }
In vfs_cmd_create(), the context phase is transitioned to FS_CONTEXT_CREATING before calling vfs_get_tree():
// fs/fsopen.c static int vfs_cmd_create(struct fs_context *fc, bool exclusive) { ... fc->phase = FS_CONTEXT_CREATING; ... ret = vfs_get_tree(fc); ... }
However, vfs_get_tree() may return -ERESTARTNOINTR if the filesystem implementation needs to restart the syscall. For example, cgroup v1 does this when it encounters a race condition where the root is dying:
// kernel/cgroup/cgroup-v1.c int cgroup1_get_tree(struct fs_context *fc) { ... if (unlikely(ret > 0)) { msleep(10); return restart_syscall(); } return ret; }
If the syscall is restarted, fsconfig() is called again and retrieves the *same* fs_context. However, vfs_cmd_create() rejects the call because the phase was left as FS_CONTEXT_CREATING during the first attempt:
if (fc->phase != FS_CONTEXT_CREATE_PARAMS) return -EBUSY;
Fix this by resetting fc->phase back to FS_CONTEXT_CREATE_PARAMS if vfs_get_tree() returns -ERESTARTNOINTR.
Cc: stable@vger.kernel.org Signed-off-by: Chen Linxuan me@black-desk.cn --- fs/fsopen.c | 4 ++++ 1 file changed, 4 insertions(+)
diff --git a/fs/fsopen.c b/fs/fsopen.c index f645c99204eb..8a7cb031af50 100644 --- a/fs/fsopen.c +++ b/fs/fsopen.c @@ -229,6 +229,10 @@ static int vfs_cmd_create(struct fs_context *fc, bool exclusive) fc->exclusive = exclusive;
ret = vfs_get_tree(fc); + if (ret == -ERESTARTNOINTR) { + fc->phase = FS_CONTEXT_CREATE_PARAMS; + return ret; + } if (ret) { fc->phase = FS_CONTEXT_FAILED; return ret;
--- base-commit: 187d0801404f415f22c0b31531982c7ea97fa341 change-id: 20251213-mount-ebusy-8ee3888a7e4f
Best regards,
On Sat 13-12-25 02:03:56, Chen Linxuan via B4 Relay wrote:
From: Chen Linxuan me@black-desk.cn
When using fsconfig(..., FSCONFIG_CMD_CREATE, ...), the filesystem context is retrieved from the file descriptor. Since the file structure persists across syscall restarts, the context state is preserved:
// fs/fsopen.c SYSCALL_DEFINE5(fsconfig, ...) { ... fc = fd_file(f)->private_data; ... ret = vfs_fsconfig_locked(fc, cmd, ¶m); ... }
In vfs_cmd_create(), the context phase is transitioned to FS_CONTEXT_CREATING before calling vfs_get_tree():
// fs/fsopen.c static int vfs_cmd_create(struct fs_context *fc, bool exclusive) { ... fc->phase = FS_CONTEXT_CREATING; ... ret = vfs_get_tree(fc); ... }
However, vfs_get_tree() may return -ERESTARTNOINTR if the filesystem implementation needs to restart the syscall. For example, cgroup v1 does this when it encounters a race condition where the root is dying:
// kernel/cgroup/cgroup-v1.c int cgroup1_get_tree(struct fs_context *fc) { ... if (unlikely(ret > 0)) { msleep(10); return restart_syscall(); } return ret; }
If the syscall is restarted, fsconfig() is called again and retrieves the *same* fs_context. However, vfs_cmd_create() rejects the call because the phase was left as FS_CONTEXT_CREATING during the first attempt:
Well, not quite. The phase is actually set to FS_CONTEXT_FAILED if vfs_get_tree() returns any error. Still the effect is the same.
if (fc->phase != FS_CONTEXT_CREATE_PARAMS) return -EBUSY;
Fix this by resetting fc->phase back to FS_CONTEXT_CREATE_PARAMS if vfs_get_tree() returns -ERESTARTNOINTR.
Cc: stable@vger.kernel.org Signed-off-by: Chen Linxuan me@black-desk.cn
fs/fsopen.c | 4 ++++ 1 file changed, 4 insertions(+)
diff --git a/fs/fsopen.c b/fs/fsopen.c index f645c99204eb..8a7cb031af50 100644 --- a/fs/fsopen.c +++ b/fs/fsopen.c @@ -229,6 +229,10 @@ static int vfs_cmd_create(struct fs_context *fc, bool exclusive) fc->exclusive = exclusive; ret = vfs_get_tree(fc);
- if (ret == -ERESTARTNOINTR) {
fc->phase = FS_CONTEXT_CREATE_PARAMS;return ret;- } if (ret) { fc->phase = FS_CONTEXT_FAILED; return ret;
Thanks for the patch. It looks good to me. I'd slightly prefer style like:
if (ret) { if (ret == -ERESTARTNOINTR) fc->phase = FS_CONTEXT_CREATE_PARAMS; else fc->phase = FS_CONTEXT_FAILED; return ret; }
but I can live with what you propose as well. So feel free to add:
Reviewed-by: Jan Kara jack@suse.cz
Honza
On Mon, Dec 15, 2025 at 4:46 PM Jan Kara jack@suse.cz wrote:
On Sat 13-12-25 02:03:56, Chen Linxuan via B4 Relay wrote:
If the syscall is restarted, fsconfig() is called again and retrieves the *same* fs_context. However, vfs_cmd_create() rejects the call because the phase was left as FS_CONTEXT_CREATING during the first attempt:
Well, not quite. The phase is actually set to FS_CONTEXT_FAILED if vfs_get_tree() returns any error. Still the effect is the same.
Oh, that's a mistake.
Thanks for the patch. It looks good to me. I'd slightly prefer style like:
if (ret) { if (ret == -ERESTARTNOINTR) fc->phase = FS_CONTEXT_CREATE_PARAMS; else fc->phase = FS_CONTEXT_FAILED; return ret; }
Will be applied in v2.
On Mon, Dec 15, 2025 at 09:46:19AM +0100, Jan Kara wrote:
On Sat 13-12-25 02:03:56, Chen Linxuan via B4 Relay wrote:
From: Chen Linxuan me@black-desk.cn
When using fsconfig(..., FSCONFIG_CMD_CREATE, ...), the filesystem context is retrieved from the file descriptor. Since the file structure persists across syscall restarts, the context state is preserved:
// fs/fsopen.c SYSCALL_DEFINE5(fsconfig, ...) { ... fc = fd_file(f)->private_data; ... ret = vfs_fsconfig_locked(fc, cmd, ¶m); ... }
In vfs_cmd_create(), the context phase is transitioned to FS_CONTEXT_CREATING before calling vfs_get_tree():
// fs/fsopen.c static int vfs_cmd_create(struct fs_context *fc, bool exclusive) { ... fc->phase = FS_CONTEXT_CREATING; ... ret = vfs_get_tree(fc); ... }
However, vfs_get_tree() may return -ERESTARTNOINTR if the filesystem implementation needs to restart the syscall. For example, cgroup v1 does this when it encounters a race condition where the root is dying:
// kernel/cgroup/cgroup-v1.c int cgroup1_get_tree(struct fs_context *fc) { ... if (unlikely(ret > 0)) { msleep(10); return restart_syscall(); } return ret; }
If the syscall is restarted, fsconfig() is called again and retrieves the *same* fs_context. However, vfs_cmd_create() rejects the call because the phase was left as FS_CONTEXT_CREATING during the first attempt:
Well, not quite. The phase is actually set to FS_CONTEXT_FAILED if vfs_get_tree() returns any error. Still the effect is the same.
Uh, I'm not sure we should do this. If this only affects cgroup v1 then I say we should simply not care at all. It's a deprecated api and anyone using it uses something that is inherently broken and a big portion of userspace has already migrated. The current or upcoming systemd release has dropped all cgroup v1 support.
Generally, making fsconfig() restartable is not as trivial as it looks because once you called into the filesystem the config that was setup might have already been consumed. That's definitely the case for stuff in overlayfs and others. So no, that patch won't work and btw, I remembered that we already had that discussion a few years ago and I was right:
https://lore.kernel.org/20200923201958.b27ecda5a1e788fb5f472bcd@virtuozzo.co...
On Mon 15-12-25 12:55:12, Christian Brauner wrote:
On Mon, Dec 15, 2025 at 09:46:19AM +0100, Jan Kara wrote:
On Sat 13-12-25 02:03:56, Chen Linxuan via B4 Relay wrote:
From: Chen Linxuan me@black-desk.cn
When using fsconfig(..., FSCONFIG_CMD_CREATE, ...), the filesystem context is retrieved from the file descriptor. Since the file structure persists across syscall restarts, the context state is preserved:
// fs/fsopen.c SYSCALL_DEFINE5(fsconfig, ...) { ... fc = fd_file(f)->private_data; ... ret = vfs_fsconfig_locked(fc, cmd, ¶m); ... }
In vfs_cmd_create(), the context phase is transitioned to FS_CONTEXT_CREATING before calling vfs_get_tree():
// fs/fsopen.c static int vfs_cmd_create(struct fs_context *fc, bool exclusive) { ... fc->phase = FS_CONTEXT_CREATING; ... ret = vfs_get_tree(fc); ... }
However, vfs_get_tree() may return -ERESTARTNOINTR if the filesystem implementation needs to restart the syscall. For example, cgroup v1 does this when it encounters a race condition where the root is dying:
// kernel/cgroup/cgroup-v1.c int cgroup1_get_tree(struct fs_context *fc) { ... if (unlikely(ret > 0)) { msleep(10); return restart_syscall(); } return ret; }
If the syscall is restarted, fsconfig() is called again and retrieves the *same* fs_context. However, vfs_cmd_create() rejects the call because the phase was left as FS_CONTEXT_CREATING during the first attempt:
Well, not quite. The phase is actually set to FS_CONTEXT_FAILED if vfs_get_tree() returns any error. Still the effect is the same.
Uh, I'm not sure we should do this. If this only affects cgroup v1 then I say we should simply not care at all. It's a deprecated api and anyone using it uses something that is inherently broken and a big portion of userspace has already migrated. The current or upcoming systemd release has dropped all cgroup v1 support.
Generally, making fsconfig() restartable is not as trivial as it looks because once you called into the filesystem the config that was setup might have already been consumed. That's definitely the case for stuff in overlayfs and others. So no, that patch won't work and btw, I remembered that we already had that discussion a few years ago and I was right:
https://lore.kernel.org/20200923201958.b27ecda5a1e788fb5f472bcd@virtuozzo.co...
I see. I was assuming that if the filesystem returns ERESTART* it will make sure to not consume the context it wants to restart with. But I can see why that might be too errorprone and not really worth the hassle in this case. So fine by me if you don't want to go that route.
Honza
On Mon, Dec 15, 2025 at 7:55 PM Christian Brauner brauner@kernel.org wrote:
Uh, I'm not sure we should do this. If this only affects cgroup v1 then I say we should simply not care at all. It's a deprecated api and anyone using it uses something that is inherently broken and a big portion of userspace has already migrated. The current or upcoming systemd release has dropped all cgroup v1 support.
I am not quite sure if nfs will be affected by this or not. It looks like vfs_get_tree -> nfs_get_tree -> nfs_try_get_tree -> nfs_try_get_tree -> nfs_request_mount -> nfs_mount -> rpc_call_sync -> rpc_run_task -> rpc_execute -> __rpc_execute -> rpc_wait_bit_killable might return -ERESTARTSYS.
linux-stable-mirror@lists.linaro.org