An undocumented feature of the mount interface was that it was possible to mount over a symlink (even with the old mount API) by mounting over /proc/self/fd/$n -- where the corresponding file descrpitor was opened with (O_PATH|O_NOFOLLOW). This didn't work with traditional "new" mounts (for a variety of reasons), but MS_BIND worked without issue. With the new mount API it was even easier.
A reasonably detailed explanation of the issues is provided in the patch itself, but the full traces produced by both the oopses and deadlocks is included below (it makes little sense to include them in the commit since we are disabling this feature, not directly fixing the bugs themselves).
I've posted this as an RFC on whether this feature should be allowed at all (and if anyone knows of legitimate uses for it), or if we should work on fixing these other kernel bugs that it exposes.
Oops on NULL dereference: BUG: kernel NULL pointer dereference, address: 0000000000000000 #PF: supervisor instruction fetch in kernel mode #PF: error_code(0x0010) - not-present page PGD 8000000181b1f067 P4D 8000000181b1f067 PUD 24829c067 PMD 0 Oops: 0010 [#1] SMP PTI CPU: 6 PID: 20796 Comm: mount_to_symlin Tainted: G OE 5.5.0-rc1+openat2~v18+ #123 Hardware name: LENOVO 20KHCTO1WW/20KHCTO1WW, BIOS N23ET55W (1.30 ) 08/31/2018 RIP: 0010:0x0 Code: Bad RIP value. RSP: 0018:ffffbc7d87e1bcb0 EFLAGS: 00010206 RAX: 0000000000000000 RBX: ffffa0c28cb633c0 RCX: 000000000000ae5a RDX: 0000000000000089 RSI: ffffa0c0eece8840 RDI: ffffa0c0eb8843b0 RBP: ffffa0c0eb8843b0 R08: ffffdc7d7fbbb770 R09: ffffa0c0ca333000 R10: 0000000000000000 R11: 808080807fffffff R12: ffffa0c0eece8840 R13: 0000000000000089 R14: ffffbc7d87e1bdb0 R15: 0000000000000080 FS: 00007fd921508540(0000) GS:ffffa0c3cf580000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffffffffffffd6 CR3: 000000018878a003 CR4: 00000000003606e0 Call Trace: __lookup_slow+0x94/0x160 lookup_slow+0x36/0x50 path_mountpoint+0x1be/0x350 filename_mountpoint+0xa5/0x150 ? __lookup_hash+0xa0/0xa0 ksys_umount+0x78/0x490 __x64_sys_umount+0x12/0x20 do_syscall_64+0x64/0x240 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x7fd92143f4e7 Code: 09 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 0f 1f 44 00 00 31 f6 e9 09 00 00 00 66 0f 1f 84 00 00 00 00 00 b8 a6 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 69 09 0c 00 f7 d8 64 89 01 48 RSP: 002b:00007ffe98c89cc8 EFLAGS: 00000202 ORIG_RAX: 00000000000000a6 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fd92143f4e7 RDX: 0000000000000000 RSI: 0000000000000002 RDI: 000000000167a330 RBP: 00007ffe98c89da0 R08: 0000000000000000 R09: 000000000000000f R10: 00000000004004c6 R11: 0000000000000202 R12: 00000000004010c0 R13: 00007ffe98c89e80 R14: 0000000000000000 R15: 0000000000000000 CR2: 0000000000000000
Oops on kernel address: BUG: unable to handle page fault for address: ffffbc7d87e1bcc0 #PF: supervisor write access in kernel mode #PF: error_code(0x0002) - not-present page PGD 107d4a067 P4D 107d4a067 PUD 107d4b067 PMD 46d753067 PTE 0 Oops: 0002 [#2] SMP PTI CPU: 4 PID: 20975 Comm: mount_to_symlin Tainted: G D OE 5.5.0-rc1+openat2~v18+ #123 Hardware name: LENOVO 20KHCTO1WW/20KHCTO1WW, BIOS N23ET55W (1.30 ) 08/31/2018 RIP: 0010:_raw_spin_lock_irqsave+0x28/0x50 Code: 00 00 0f 1f 44 00 00 41 54 53 48 89 fb 9c 58 0f 1f 44 00 00 49 89 c4 fa 66 0f 1f 44 00 00 e8 3f 55 82 ff 31 c0 ba 01 00 00 00 <f0> 0f b1 13 75 07 4c 89 e0 5b 41 5c c3 89 c6 48 89 df e8 01 52 77 RSP: 0018:ffffbc7d90067bd8 EFLAGS: 00010046 RAX: 0000000000000000 RBX: ffffbc7d87e1bcc0 RCX: 0000000200000000 RDX: 0000000000000001 RSI: ffffbc7d90067c50 RDI: ffffbc7d87e1bcc0 RBP: ffffbc7d87e1bcc0 R08: 0000000000000001 R09: 0000000000000003 R10: 0000000000000000 R11: 808080807fffffff R12: 0000000000000246 R13: ffffa0c28cb633c0 R14: ffffbc7d90067db0 R15: ffffa0c0eece8898 FS: 00007f4b80214540(0000) GS:ffffa0c3cf500000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffbc7d87e1bcc0 CR3: 000000026d4d0002 CR4: 00000000003606e0 Call Trace: add_wait_queue+0x15/0x40 d_alloc_parallel+0x36d/0x480 ? get_acl+0x1a/0x160 ? wake_up_q+0xa0/0xa0 __lookup_slow+0x6b/0x160 lookup_slow+0x36/0x50 path_mountpoint+0x1be/0x350 filename_mountpoint+0xa5/0x150 ? __lookup_hash+0xa0/0xa0 ksys_umount+0x78/0x490 __x64_sys_umount+0x12/0x20 do_syscall_64+0x64/0x240 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x7f4b8014b4e7 Code: 09 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 0f 1f 44 00 00 31 f6 e9 09 00 00 00 66 0f 1f 84 00 00 00 00 00 b8 a6 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 69 09 0c 00 f7 d8 64 89 01 48 RSP: 002b:00007ffee8041b28 EFLAGS: 00000206 ORIG_RAX: 00000000000000a6 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f4b8014b4e7 RDX: 0000000000000000 RSI: 0000000000000002 RDI: 00000000019c8330 RBP: 00007ffee8041c00 R08: 0000000000000000 R09: 000000000000000f R10: 00000000004004c6 R11: 0000000000000206 R12: 00000000004010c0 R13: 00007ffee8041ce0 R14: 0000000000000000 R15: 0000000000000000 CR2: ffffbc7d87e1bcc0
Apparent deadlock in d_alloc_parallel: watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [mount_to_symlin:21285] CPU: 0 PID: 21285 Comm: mount_to_symlin Tainted: G D OE 5.5.0-rc1+openat2~v18+ #123 Hardware name: LENOVO 20KHCTO1WW/20KHCTO1WW, BIOS N23ET55W (1.30 ) 08/31/2018 RIP: 0010:native_queued_spin_lock_slowpath+0x5b/0x1d0 Code: 6d f0 0f ba 2f 08 0f 92 c0 0f b6 c0 c1 e0 08 89 c2 8b 07 30 e4 09 d0 a9 00 01 ff ff 75 47 85 c0 74 0e 8b 07 84 c0 74 08 f3 90 <8b> 07 84 c0 75 f8 b8 01 00 00 00 66 89 07 c3 8b 37 81 fe 00 01 00 RSP: 0018:ffffbc7d90547be8 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff13 RAX: 0000000000000101 RBX: ffffffffbac7ac60 RCX: 0000000000000018 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffa0c0eece8898 RBP: ffffa0c0eece8898 R08: 00000000006f6f66 R09: 0000000000000003 R10: 0000000000000000 R11: 808080807fffffff R12: 00000000e25b3c73 R13: ffffa0c28cb633c0 R14: ffffbc7d90547db0 R15: ffffa0c0eece8898 FS: 00007fbb1fd30540(0000) GS:ffffa0c3cf400000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fbb1fbd25a0 CR3: 0000000181ace005 CR4: 00000000003606f0 Call Trace: _raw_spin_lock+0x1a/0x20 lockref_get_not_dead+0x4f/0x90 d_alloc_parallel+0x1a8/0x480 ? get_acl+0x1a/0x160 __lookup_slow+0x6b/0x160 lookup_slow+0x36/0x50 path_mountpoint+0x1be/0x350 filename_mountpoint+0xa5/0x150 ? __lookup_hash+0xa0/0xa0 ksys_umount+0x78/0x490 __x64_sys_umount+0x12/0x20 do_syscall_64+0x64/0x240 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x7fbb1fc674e7 Code: 09 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 0f 1f 44 00 00 31 f6 e9 09 00 00 00 66 0f 1f 84 00 00 00 00 00 b8 a6 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 69 09 0c 00 f7 d8 64 89 01 48 RSP: 002b:00007ffd75fcb858 EFLAGS: 00000202 ORIG_RAX: 00000000000000a6 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fbb1fc674e7 RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000f6c330 RBP: 00007ffd75fcb930 R08: 0000000000000000 R09: 000000000000000f R10: 00000000004004a6 R11: 0000000000000202 R12: 00000000004010b0 R13: 00007ffd75fcba10 R14: 0000000000000000 R15: 0000000000000000
RCU stall when trying to grab /proc/$pid/stack for the stuck process: rcu: INFO: rcu_sched self-detected stall on CPU rcu: 0-....: (15000 ticks this GP) idle=2c6/1/0x4000000000000002 softirq=1172554/1172554 fqs=6849 (t=15001 jiffies g=1935177 q=25734) NMI backtrace for cpu 0 CPU: 0 PID: 21285 Comm: mount_to_symlin Tainted: G D OEL 5.5.0-rc1+openat2~v18+ #123 Hardware name: LENOVO 20KHCTO1WW/20KHCTO1WW, BIOS N23ET55W (1.30 ) 08/31/2018 Call Trace: <IRQ> dump_stack+0x8f/0xd0 ? lapic_can_unplug_cpu.cold+0x3e/0x3e nmi_cpu_backtrace.cold+0x14/0x52 nmi_trigger_cpumask_backtrace+0xf6/0xf8 rcu_dump_cpu_stacks+0x8f/0xbd rcu_sched_clock_irq.cold+0x1b2/0x39f update_process_times+0x24/0x50 tick_sched_handle+0x22/0x60 tick_sched_timer+0x38/0x80 ? tick_sched_do_timer+0x60/0x60 __hrtimer_run_queues+0xf6/0x270 hrtimer_interrupt+0x10e/0x240 smp_apic_timer_interrupt+0x6c/0x130 apic_timer_interrupt+0xf/0x20 </IRQ> RIP: 0010:native_queued_spin_lock_slowpath+0x5b/0x1d0 Code: 6d f0 0f ba 2f 08 0f 92 c0 0f b6 c0 c1 e0 08 89 c2 8b 07 30 e4 09 d0 a9 00 01 ff ff 75 47 85 c0 74 0e 8b 07 84 c0 74 08 f3 90 <8b> 07 84 c0 75 f8 b8 01 00 00 00 66 89 07 c3 8b 37 81 fe 00 01 00 RSP: 0018:ffffbc7d90547be8 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff13 RAX: 0000000000000101 RBX: ffffffffbac7ac60 RCX: 0000000000000018 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffa0c0eece8898 RBP: ffffa0c0eece8898 R08: 00000000006f6f66 R09: 0000000000000003 R10: 0000000000000000 R11: 808080807fffffff R12: 00000000e25b3c73 R13: ffffa0c28cb633c0 R14: ffffbc7d90547db0 R15: ffffa0c0eece8898 _raw_spin_lock+0x1a/0x20 lockref_get_not_dead+0x4f/0x90 d_alloc_parallel+0x1a8/0x480 ? get_acl+0x1a/0x160 __lookup_slow+0x6b/0x160 lookup_slow+0x36/0x50 path_mountpoint+0x1be/0x350 filename_mountpoint+0xa5/0x150 ? __lookup_hash+0xa0/0xa0 ksys_umount+0x78/0x490 __x64_sys_umount+0x12/0x20 do_syscall_64+0x64/0x240 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x7fbb1fc674e7 Code: 09 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 0f 1f 44 00 00 31 f6 e9 09 00 00 00 66 0f 1f 84 00 00 00 00 00 b8 a6 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 69 09 0c 00 f7 d8 64 89 01 48 RSP: 002b:00007ffd75fcb858 EFLAGS: 00000202 ORIG_RAX: 00000000000000a6 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fbb1fc674e7 RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000f6c330 RBP: 00007ffd75fcb930 R08: 0000000000000000 R09: 000000000000000f R10: 00000000004004a6 R11: 0000000000000202 R12: 00000000004010b0 R13: 00007ffd75fcba10 R14: 0000000000000000 R15: 0000000000000000
Deadlock on lock_mount after a successful umount(). The watchdog does trigger, but I could only find this stall when trying to suspend the system in my logs: Freezing of tasks failed after 20.010 seconds (2 tasks refusing to freeze, wq_busy=0): mount_to_symlin D 0 5850 5849 0x00000004 Call Trace: ? __schedule+0x2dd/0x770 schedule+0x4a/0xb0 rwsem_down_write_slowpath+0x256/0x500 lock_mount+0x22/0xf0 do_mount+0x4b7/0x9f0 ksys_mount+0x7e/0xc0 __x64_sys_mount+0x21/0x30 do_syscall_64+0x64/0x240 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x7f86e6355fda Code: Bad RIP value. RSP: 002b:00007ffc36f952d8 EFLAGS: 00000206 ORIG_RAX: 00000000000000a5 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f86e6355fda RDX: 0000000000402099 RSI: 00000000019a5310 RDI: 00007ffc36f96ee1 RBP: 00007ffc36f953b0 R08: 0000000000402099 R09: 000000000000000f R10: 0000000000001000 R11: 0000000000000206 R12: 00000000004010c0 R13: 00007ffc36f95490 R14: 0000000000000000 R15: 0000000000000000
Cc: stable@vger.kernel.org # pre-git Cc: Al Viro viro@zeniv.linux.org.uk Cc: David Howells dhowells@redhat.com Cc: Eric Biederman ebiederm@xmission.com Cc: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Aleksa Sarai cyphar@cyphar.com
Aleksa Sarai (1): mount: universally disallow mounting over symlinks
fs/namespace.c | 17 ++++++++++++----- 1 file changed, 12 insertions(+), 5 deletions(-)
base-commit: fd6988496e79a6a4bdb514a4655d2920209eb85d
An undocumented feature of the mount interface was that it was possible to mount over a symlink (even with the old mount API) by mounting over /proc/self/fd/$n -- where the corresponding file descrpitor was opened with (O_PATH|O_NOFOLLOW). This didn't work with traditional "new" mounts (for a variety of reasons), but MS_BIND worked without issue. With the new mount API it was even easier.
From userspace's perspective, this capability is only really useful as
an attack vector. Until the introduction of openat2(RESOLVE_NO_XDEV), there was no trivial way to detect if a bind-mount was present. In the container runtime context (in a similar vein to CVE-2019-19921), this could result in a privileged process being unable to detect that a configuration resulted in magic-link usage operating on the wrong magic-links. Additionally, the API to use this feature was incredibly strange -- in order to umount, you would have go through /proc/self/fd/$n again (umounting the path would result in the *underlying* symlink being followed).
Which brings us to the issues on the kernel side. When umounting a mount on top of a symlink, several oopses (both NULL and garbage kernel address dereferences) and deadlocks could be triggered incredibly trivially. Note that because this works in user namespaces, an unprivileged user could trigger these oopses incredibly trivially. While these bugs could be fixed separately, it seems much cleaner to disable a "feature" which clearly was not intentional (and is not used -- otherwise we would've seen bug reports about it breaking on umount).
Note that because the linux-utils mount(1) helper will expand paths containing symlinks in user-space, only users which used the mount(2) syscall directly could possibly have seen this behaviour.
Cc: stable@vger.kernel.org # pre-git Cc: Al Viro viro@zeniv.linux.org.uk Cc: David Howells dhowells@redhat.com Cc: Eric Biederman ebiederm@xmission.com Cc: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Aleksa Sarai cyphar@cyphar.com --- fs/namespace.c | 17 ++++++++++++----- 1 file changed, 12 insertions(+), 5 deletions(-)
diff --git a/fs/namespace.c b/fs/namespace.c index be601d3a8008..01a62bce105f 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -2172,8 +2172,12 @@ static int graft_tree(struct mount *mnt, struct mount *p, struct mountpoint *mp) if (mnt->mnt.mnt_sb->s_flags & SB_NOUSER) return -EINVAL;
+ if (d_is_symlink(mp->m_dentry) || + d_is_symlink(mnt->mnt.mnt_root)) + return -EINVAL; + if (d_is_dir(mp->m_dentry) != - d_is_dir(mnt->mnt.mnt_root)) + d_is_dir(mnt->mnt.mnt_root)) return -ENOTDIR;
return attach_recursive_mnt(mnt, p, mp, false); @@ -2251,6 +2255,9 @@ static struct mount *__do_loopback(struct path *old_path, int recurse) if (IS_MNT_UNBINDABLE(old)) return mnt;
+ if (d_is_symlink(old_path->dentry)) + return mnt; + if (!check_mnt(old) && old_path->dentry->d_op != &ns_dentry_operations) return mnt;
@@ -2635,6 +2642,10 @@ static int do_move_mount(struct path *old_path, struct path *new_path) if (old_path->dentry != old_path->mnt->mnt_root) goto out;
+ if (d_is_symlink(new_path->dentry) || + d_is_symlink(old_path->dentry)) + goto out; + if (d_is_dir(new_path->dentry) != d_is_dir(old_path->dentry)) goto out; @@ -2726,10 +2737,6 @@ static int do_add_mount(struct mount *newmnt, struct path *path, int mnt_flags) path->mnt->mnt_root == path->dentry) goto unlock;
- err = -EINVAL; - if (d_is_symlink(newmnt->mnt.mnt_root)) - goto unlock; - newmnt->mnt.mnt_flags = mnt_flags; err = graft_tree(newmnt, parent, mp);
On Sun, Dec 29, 2019 at 9:21 PM Aleksa Sarai cyphar@cyphar.com wrote:
if (d_is_symlink(mp->m_dentry) ||
d_is_symlink(mnt->mnt.mnt_root))
return -EINVAL;
So I don't hate this kind of check in general - overmounting a symlink sounds odd, but at the same time I get the feeling that the real issue is that something went wrong earlier.
Yeah, the mount target kind of _is_ a path, but at the same time, we most definitely want to have the permission to really open the directory in question, don't we, and I don't see that we should accept a O_PATH file descriptor.
I feel like the only valid use of "O_PATH" files is to then use them as the base for an openat() and friends (ie fchmodat/execveat() etc).
But maybe I'm completely wrong, and people really do want O_PATH handling exactly for mounting too. It does sound a bit odd. By definition, mounting wants permissions to the mount-point, so what's the point of using O_PATH?
So instead of saying "don't overmount symlinks", I would feel like it's the mount system call that should use a proper file descriptor that isn't FMODE_PATH.
Is it really the symlink that is the issue? Because if it's the symlink that is the issue then I feel like O_NOFOLLOW should have triggered it, but your other email seems to say that you really need O_PATH | O_SYMLINK.
So I'm not sayng that this patch is wrong, but it really smells a bit like it's papering over the more fundamental issue.
For example, is the problem that when you do a proper
fd = open("somepath", O_PATH);
in one process, and then another thread does
fd = open("/proc/<pid>/fd/<opathfd>", O_RDWR);
then we get confused and do bad things on that *second* open? Because now the second open doesn't have O_PATH, and doesn't ghet marked FMODE_PATH, but the underlying file descriptor is one of those limited "is really only useful for openat() and friends".
I dunno. I haven't thought through the whole thing. But the oopses you quote seem like we're really doing something wrong, and it really does feel like your patch in no way _fixes_ the wrong thing we're doing, it's just hiding the symptoms.
Linus
On 2019-12-29, Linus Torvalds torvalds@linux-foundation.org wrote:
On Sun, Dec 29, 2019 at 9:21 PM Aleksa Sarai cyphar@cyphar.com wrote:
if (d_is_symlink(mp->m_dentry) ||
d_is_symlink(mnt->mnt.mnt_root))
return -EINVAL;
So I don't hate this kind of check in general - overmounting a symlink sounds odd, but at the same time I get the feeling that the real issue is that something went wrong earlier.
Yeah, the mount target kind of _is_ a path, but at the same time, we most definitely want to have the permission to really open the directory in question, don't we, and I don't see that we should accept a O_PATH file descriptor.
The new mount API uses O_PATH under the hood (which is a good thing since some files you'd like to avoid actually opening -- FIFOs are the obvious example) so I'm not sure that's something we could really avoid.
But if we block O_PATH for mounts this will achieve the same thing, because the only way to get a file descriptor that references a symlink is through (O_PATH | O_NOFOLLOW).
I feel like the only valid use of "O_PATH" files is to then use them as the base for an openat() and friends (ie fchmodat/execveat() etc).
See below, we use this for all sorts of dirty^Wclever tricks.
But maybe I'm completely wrong, and people really do want O_PATH handling exactly for mounting too. It does sound a bit odd. By definition, mounting wants permissions to the mount-point, so what's the point of using O_PATH?
When you go through O_PATH, you still get a proper 'struct path' which means that for operations such as mount (or open) you will operate on the *real* underlying file.
This is part of what makes magic-links so useful (but also quite terrifying).
For example, is the problem that when you do a proper
fd = open("somepath", O_PATH);
in one process, and then another thread does
fd = open("/proc/<pid>/fd/<opathfd>", O_RDWR);
then we get confused and do bad things on that *second* open? Because now the second open doesn't have O_PATH, and doesn't ghet marked FMODE_PATH, but the underlying file descriptor is one of those limited "is really only useful for openat() and friends".
Actually, this isn't true (for the same reason as above) -- when you do a re-open through /proc/$pid/fd/$n you get a real-as-a-heart-attack file descriptor. We make lots of use of this in container runtimes in order to do some dirty^Wfun tricks that help us harden the runtime against malicious container processes.
You might recall that when I was posting the earlier revisions of openat2(), I also included a patch for O_EMPTYPATH (which basically did a re-open of /proc/self/fd/$dfd but without needing /proc). That had precisely the same semantics so that you could do the same operation without procfs. That patch was dropped before Al merged openat2(), but I am probably going to revive it for the reasons I outlined below.
I dunno. I haven't thought through the whole thing. But the oopses you quote seem like we're really doing something wrong, and it really does feel like your patch in no way _fixes_ the wrong thing we're doing, it's just hiding the symptoms.
That's fair enough.
I'll be honest, the real reason why I don't want mounts over symlinks to be possible is for an entirely different reason. I'm working on a safe path resolution library to accompany openat2()[1] -- and one of the things I want to do is to harden all of our uses of procfs (such that if we are running in a context where procfs has been messed with -- such as having files bind-mounted -- we can detect it and abort). The issue with symlinks is that we need to be able to operate on magic-links (such as /proc/self/fd/$n and /proc/self/exe) -- and if it's possible bind-mount over those magic-links then we can't detect it at all.
openat2(RESOLVE_NO_XDEV) would block it, but it also blocks going through magic-links which change your mount (which would almost always be true). You can't trust /proc/self/mountinfo by definition -- not just because of the TOCTOU race but also because you can't depend on /proc to harden against a "bad" /proc. All other options such as umount2(MNT_EXPIRE) won't help with magic-links because we cannot take an O_PATH to a magic-link and follow it -- O_PATHs of symlinks are completely stunted in this respect.
If allowing bind-mounts over symlinks is allowed (which I don't have a problem with really), it just means we'll need a few more kernel pieces to get this hardening to work. But these features would be useful outside of the problems I'm dealing with (O_EMPTYPATH and some kind of pidfd-based interface to grab the equivalent of /proc/self/exe and a few other such magic-link targets).
[1]: https://github.com/openSUSE/libpathrs
On Mon, Dec 30, 2019 at 12:29 AM Aleksa Sarai cyphar@cyphar.com wrote:
On 2019-12-29, Linus Torvalds torvalds@linux-foundation.org wrote:
On Sun, Dec 29, 2019 at 9:21 PM Aleksa Sarai cyphar@cyphar.com wrote:
If allowing bind-mounts over symlinks is allowed (which I don't have a problem with really), it just means we'll need a few more kernel pieces to get this hardening to work. But these features would be useful outside of the problems I'm dealing with (O_EMPTYPATH and some kind of pidfd-based interface to grab the equivalent of /proc/self/exe and a few other such magic-link targets).
As one data point, I would use this ability in virtme: this would allow me to more reliably mount over /etc/resolve.conf even when it's a symlink.
(Perhaps I should use overlayfs instead. Hmm.)
On Mon, Dec 30, 2019 at 04:20:35PM +1100, Aleksa Sarai wrote:
A reasonably detailed explanation of the issues is provided in the patch itself, but the full traces produced by both the oopses and deadlocks is included below (it makes little sense to include them in the commit since we are disabling this feature, not directly fixing the bugs themselves).
I've posted this as an RFC on whether this feature should be allowed at all (and if anyone knows of legitimate uses for it), or if we should work on fixing these other kernel bugs that it exposes.
Umm... Are all of those traces a) reproducible on mainline and b) reproducible as the first oopsen?
As it is, quite a few might be secondary results of earlier memory corruption...
On 2019-12-30, Al Viro viro@zeniv.linux.org.uk wrote:
On Mon, Dec 30, 2019 at 04:20:35PM +1100, Aleksa Sarai wrote:
A reasonably detailed explanation of the issues is provided in the patch itself, but the full traces produced by both the oopses and deadlocks is included below (it makes little sense to include them in the commit since we are disabling this feature, not directly fixing the bugs themselves).
I've posted this as an RFC on whether this feature should be allowed at all (and if anyone knows of legitimate uses for it), or if we should work on fixing these other kernel bugs that it exposes.
Umm... Are all of those traces a) reproducible on mainline and
This was on viro/for-next, I'll retry it on v5.5-rc4.
b) reproducible as the first oopsen?
The NULL and garbage pointer derefs are reproducible as the first oops. Looking at my logs, it looks like the deadlocks were always triggered after the oops, but that might just have been a mistake on my part while testing things.
As it is, quite a few might be secondary results of earlier memory corruption...
Yeah, I thought that might be the case but decided to include them anyway (the /proc/self/stack RCU stall is definitely the result of other corruption and stalls).
On 2019-12-30, Aleksa Sarai cyphar@cyphar.com wrote:
On 2019-12-30, Al Viro viro@zeniv.linux.org.uk wrote:
On Mon, Dec 30, 2019 at 04:20:35PM +1100, Aleksa Sarai wrote:
A reasonably detailed explanation of the issues is provided in the patch itself, but the full traces produced by both the oopses and deadlocks is included below (it makes little sense to include them in the commit since we are disabling this feature, not directly fixing the bugs themselves).
I've posted this as an RFC on whether this feature should be allowed at all (and if anyone knows of legitimate uses for it), or if we should work on fixing these other kernel bugs that it exposes.
Umm... Are all of those traces a) reproducible on mainline and
This was on viro/for-next, I'll retry it on v5.5-rc4.
The NULL deref oops is reproducible on v5.5-rc4. Strangely it seems harder to reproduce than on viro/for-next (I kept reproducing it there by accident), but I'll double-check if that really is the case.
The simplest reproducer is (using the attached programs and .config):
ln -s . link sudo ./umount_symlink link
There's also a few other whacky behaviours where you get -ELOOP or -EACCES in cases where you shouldn't -- which results in MNT_DETACH failing and the mount being impossible to get rid of. A good example is
sudo ./mount_to_symlink /proc/self/exe link sudo ./umount_symlink link # -EACCES
Or
ln -s . link1 ln -s . link2 sudo ./mount_to_symlink link1 link2 sudo ./umount_symlink link1 # -ELOOP sudo ./umount_symlink link2 # -ELOOP
But I am trying to find a reproducer for the "umount of a mount triggering an Oops" issue.
On another note -- I guess this is considered a feature which should "just work" and not a bug?
BUG: kernel NULL pointer dereference, address: 0000000000000000 #PF: supervisor instruction fetch in kernel mode #PF: error_code(0x0010) - not-present page PGD 80000003c6fca067 P4D 80000003c6fca067 PUD 3c6f42067 PMD 0 Oops: 0010 [#1] SMP PTI CPU: 4 PID: 4486 Comm: umount_symlink Tainted: G E 5.5.0-rc4-cyphar #126 Hardware name: LENOVO 20KHCTO1WW/20KHCTO1WW, BIOS N23ET55W (1.30 ) 08/31/2018 RIP: 0010:0x0 Code: Bad RIP value. RSP: 0018:ffffb70b82963cc0 EFLAGS: 00010206 RAX: 0000000000000000 RBX: ffff906d0cc3bb40 RCX: 0000000000000abc RDX: 0000000000000089 RSI: ffff906d74623cc0 RDI: ffff906d74475df0 RBP: ffff906d74475df0 R08: ffffd70b7fb24c20 R09: ffff906d066a5000 R10: 0000000000000000 R11: 8080807fffffffff R12: ffff906d74623cc0 R13: 0000000000000089 R14: ffffb70b82963dc0 R15: 0000000000000080 FS: 00007fbc2a8f0540(0000) GS:ffff906dcf500000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffffffffffffd6 CR3: 00000003c68f8001 CR4: 00000000003606e0 Call Trace: __lookup_slow+0x94/0x160 lookup_slow+0x36/0x50 path_mountpoint+0x1be/0x360 filename_mountpoint+0xa5/0x150 ? __lookup_hash+0xa0/0xa0 ksys_umount+0x78/0x490 __x64_sys_umount+0x12/0x20 do_syscall_64+0x64/0x240 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x7fbc2a8274e7 Code: 09 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 0f 1f 44 00 00 31 f6 e9 09 00 00 00 66 0f 1f 84 00 00 00 00 00 b8 a6 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 69 09 0c 00 f7 d8 64 89 01 48 RSP: 002b:00007ffd1da9b3f8 EFLAGS: 00000202 ORIG_RAX: 00000000000000a6 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fbc2a8274e7 RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000001300310 RBP: 00007ffd1da9b4c0 R08: 0000000000000000 R09: 000000000000000f R10: 00007fbc2a92f800 R11: 0000000000000202 R12: 0000000000401090 R13: 00007ffd1da9b5a0 R14: 0000000000000000 R15: 0000000000000000 Modules linked in: [snip] CR2: 0000000000000000 ---[ end trace ae473813e34e641d ]--- RIP: 0010:0x0 Code: Bad RIP value. RSP: 0018:ffffb70b82963cc0 EFLAGS: 00010206 RAX: 0000000000000000 RBX: ffff906d0cc3bb40 RCX: 0000000000000abc RDX: 0000000000000089 RSI: ffff906d74623cc0 RDI: ffff906d74475df0 RBP: ffff906d74475df0 R08: ffffd70b7fb24c20 R09: ffff906d066a5000 R10: 0000000000000000 R11: 8080807fffffffff R12: ffff906d74623cc0 R13: 0000000000000089 R14: ffffb70b82963dc0 R15: 0000000000000080 FS: 00007fbc2a8f0540(0000) GS:ffff906dcf500000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffffffffffffd6 CR3: 00000003c68f8001 CR4: 00000000003606e0
On Sun, Dec 29, 2019 at 11:30 PM Aleksa Sarai cyphar@cyphar.com wrote:
BUG: kernel NULL pointer dereference, address: 0000000000000000
Would you mind building with debug info, and then running the oops through
scripts/decode_stacktrace.sh
which makes those addresses much more legible.
#PF: supervisor instruction fetch in kernel mode #PF: error_code(0x0010) - not-present page
Somebody jumped through a NULL pointer.
RAX: 0000000000000000 RBX: ffff906d0cc3bb40 RCX: 0000000000000abc RDX: 0000000000000089 RSI: ffff906d74623cc0 RDI: ffff906d74475df0 RBP: ffff906d74475df0 R08: ffffd70b7fb24c20 R09: ffff906d066a5000 R10: 0000000000000000 R11: 8080807fffffffff R12: ffff906d74623cc0 R13: 0000000000000089 R14: ffffb70b82963dc0 R15: 0000000000000080 FS: 00007fbc2a8f0540(0000) GS:ffff906dcf500000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffffffffffffd6 CR3: 00000003c68f8001 CR4: 00000000003606e0 Call Trace: __lookup_slow+0x94/0x160
And "__lookup_slow()" has two indirect calls (they aren't obvious with retpoline, but look for something like
call __x86_indirect_thunk_rax
which is the modern sad way of doing "call *%rax"). One is for revalidatinging an old dentry, but the one I _suspect_ you trigger is this one:
old = inode->i_op->lookup(inode, dentry, flags);
but I thought we only could get here if we know it's a directory.
How did we miss the "d_can_lookup()", which is what should check that yes, we can call that ->lookup() routine.
This is why I have that suspicion that it's somehow that O_PATH fd opened in another process without O_PATH causes confusion...
So what I think has happened is that because of the O_PATH thing, we've ended up with an inode that has never been truly opened (because O_PATH skips that part), but then with the /proc/<pid>/fd/xyz open, we now have a file descriptor that _looks_ like it is valid, and we're treating that inode as if it can be used.
But I'm handwaving.
Linus
On 2019-12-29, Linus Torvalds torvalds@linux-foundation.org wrote:
On Sun, Dec 29, 2019 at 11:30 PM Aleksa Sarai cyphar@cyphar.com wrote:
BUG: kernel NULL pointer dereference, address: 0000000000000000
Would you mind building with debug info, and then running the oops through
scripts/decode_stacktrace.sh
which makes those addresses much more legible.
Will do.
#PF: supervisor instruction fetch in kernel mode #PF: error_code(0x0010) - not-present page
Somebody jumped through a NULL pointer.
RAX: 0000000000000000 RBX: ffff906d0cc3bb40 RCX: 0000000000000abc RDX: 0000000000000089 RSI: ffff906d74623cc0 RDI: ffff906d74475df0 RBP: ffff906d74475df0 R08: ffffd70b7fb24c20 R09: ffff906d066a5000 R10: 0000000000000000 R11: 8080807fffffffff R12: ffff906d74623cc0 R13: 0000000000000089 R14: ffffb70b82963dc0 R15: 0000000000000080 FS: 00007fbc2a8f0540(0000) GS:ffff906dcf500000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffffffffffffd6 CR3: 00000003c68f8001 CR4: 00000000003606e0 Call Trace: __lookup_slow+0x94/0x160
And "__lookup_slow()" has two indirect calls (they aren't obvious with retpoline, but look for something like
call __x86_indirect_thunk_rax
which is the modern sad way of doing "call *%rax"). One is for revalidatinging an old dentry, but the one I _suspect_ you trigger is this one:
old = inode->i_op->lookup(inode, dentry, flags);
but I thought we only could get here if we know it's a directory.
How did we miss the "d_can_lookup()", which is what should check that yes, we can call that ->lookup() routine.
I'll try applying a trivial patch to add d_can_lookup() to see if it fixes the immediate issue.
This is why I have that suspicion that it's somehow that O_PATH fd opened in another process without O_PATH causes confusion...
So what I think has happened is that because of the O_PATH thing, we've ended up with an inode that has never been truly opened (because O_PATH skips that part), but then with the /proc/<pid>/fd/xyz open, we now have a file descriptor that _looks_ like it is valid, and we're treating that inode as if it can be used.
I'm not sure I agree -- as I mentioned in my other mail, re-opening through /proc/self/fd/$n works *very* well and has for a long time (in fact, both LXC and runc depend on this working).
From: Aleksa Sarai
Sent: 30 December 2019 08:32
...
I'm not sure I agree -- as I mentioned in my other mail, re-opening through /proc/self/fd/$n works *very* well and has for a long time (in fact, both LXC and runc depend on this working).
I thought it was marginally broken because it is followed as a symlink? On, for example, NetBSD /proc/<n>/fd/<n> is a real reference to the filesystem inode and can be used to link the file back into the filesystem if all the directory entries have been removed.
David
- Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On 2020-01-02, David Laight David.Laight@ACULAB.COM wrote:
From: Aleksa Sarai
Sent: 30 December 2019 08:32
...
I'm not sure I agree -- as I mentioned in my other mail, re-opening through /proc/self/fd/$n works *very* well and has for a long time (in fact, both LXC and runc depend on this working).
I thought it was marginally broken because it is followed as a symlink? On, for example, NetBSD /proc/<n>/fd/<n> is a real reference to the filesystem inode and can be used to link the file back into the filesystem if all the directory entries have been removed.
That is also the case on Linux. It (strictly speaking) isn't a symlink in the normal sense of the word, it's a magic-link (nd_jump_link switches the nd->path to the actual 'struct file' in the case of /proc/self/fd/$n).
On Mon, Dec 30, 2019 at 06:29:59PM +1100, Aleksa Sarai wrote:
On 2019-12-30, Aleksa Sarai cyphar@cyphar.com wrote:
On 2019-12-30, Al Viro viro@zeniv.linux.org.uk wrote:
On Mon, Dec 30, 2019 at 04:20:35PM +1100, Aleksa Sarai wrote:
A reasonably detailed explanation of the issues is provided in the patch itself, but the full traces produced by both the oopses and deadlocks is included below (it makes little sense to include them in the commit since we are disabling this feature, not directly fixing the bugs themselves).
I've posted this as an RFC on whether this feature should be allowed at all (and if anyone knows of legitimate uses for it), or if we should work on fixing these other kernel bugs that it exposes.
Umm... Are all of those traces a) reproducible on mainline and
This was on viro/for-next, I'll retry it on v5.5-rc4.
The NULL deref oops is reproducible on v5.5-rc4. Strangely it seems harder to reproduce than on viro/for-next (I kept reproducing it there by accident), but I'll double-check if that really is the case.
The simplest reproducer is (using the attached programs and .config):
ln -s . link sudo ./umount_symlink link
FWIW, the problem with that reproducer is that we *CAN'T* resolve that path. Look: you have /proc/self/fd/3 resolve to ./link. OK, you've asked to follow that. Got ./link, which is a symlink, so we need to follow it further. Relative to what, though?
The meaning of symlink is dependent upon the directory you find it in. And we don't have any here.
The bug is in mountpoint_last() - we have if (unlikely(nd->last_type != LAST_NORM)) { error = handle_dots(nd, nd->last_type); if (error) return error; path.dentry = dget(nd->path.dentry); } else { path.dentry = d_lookup(dir, &nd->last); if (!path.dentry) { /* * No cached dentry. Mounted dentries are pinned in the * cache, so that means that this dentry is probably * a symlink or the path doesn't actually point * to a mounted dentry. */ path.dentry = lookup_slow(&nd->last, dir, nd->flags | LOOKUP_NO_REVAL); if (IS_ERR(path.dentry)) return PTR_ERR(path.dentry); } } if (d_flags_negative(smp_load_acquire(&path.dentry->d_flags))) { dput(path.dentry); return -ENOENT; } path.mnt = nd->path.mnt; return step_into(nd, &path, 0, d_backing_inode(path.dentry), 0); in there, and that ends up with step_into() called in case of LAST_DOT/LAST_DOTDOT (where it's harmless) *AND* in case of LAST_BIND. Where it very much isn't.
I'm not sure if you have caught anything else, but we really, really should *NOT* consider the LAST_BIND as "maybe we should follow the result" material. So at least the following is needed; could you check if anything else remains with that applied?
diff --git a/fs/namei.c b/fs/namei.c index d6c91d1e88cb..d4fbbda8a7ff 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -2656,10 +2656,7 @@ mountpoint_last(struct nameidata *nd) nd->flags &= ~LOOKUP_PARENT;
if (unlikely(nd->last_type != LAST_NORM)) { - error = handle_dots(nd, nd->last_type); - if (error) - return error; - path.dentry = dget(nd->path.dentry); + return handle_dots(nd, nd->last_type); } else { path.dentry = d_lookup(dir, &nd->last); if (!path.dentry) {
On Wed, Jan 01, 2020 at 12:43:24AM +0000, Al Viro wrote:
I'm not sure if you have caught anything else, but we really, really should *NOT* consider the LAST_BIND as "maybe we should follow the result" material. So at least the following is needed; could you check if anything else remains with that applied?
diff --git a/fs/namei.c b/fs/namei.c index d6c91d1e88cb..d4fbbda8a7ff 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -2656,10 +2656,7 @@ mountpoint_last(struct nameidata *nd) nd->flags &= ~LOOKUP_PARENT; if (unlikely(nd->last_type != LAST_NORM)) {
error = handle_dots(nd, nd->last_type);
if (error)
return error;
path.dentry = dget(nd->path.dentry);
} else { path.dentry = d_lookup(dir, &nd->last); if (!path.dentry) {return handle_dots(nd, nd->last_type);
Note, BTW, that lookup_last() (aka walk_component()) does just that - we only hit step_into() on LAST_NORM. The same goes for do_last(). mountpoint_last() not doing the same is _not_ intentional - it's definitely a bug.
Consider your testcase; link points to . here. So the only thing you could expect from trying to follow it would be the directory 'link' lives in. And you don't have it when you reach the fscker via /proc/self/fd/3; what happens instead is nd->path set to ./link (by nd_jump_link()) *AND* step_into() called, pushing the same ./link onto stack. It violates all kinds of assumptions made by fs/namei.c - when pushing a symlink onto stack nd->path is expected to contain the base directory for resolving it.
I'm fairly sure that this is the cause of at least some of the insanity you've caught; there always could be something else, of course, but this hole needs to be closed in any case.
On Wed, Jan 01, 2020 at 12:54:46AM +0000, Al Viro wrote:
Note, BTW, that lookup_last() (aka walk_component()) does just that - we only hit step_into() on LAST_NORM. The same goes for do_last(). mountpoint_last() not doing the same is _not_ intentional - it's definitely a bug.
Consider your testcase; link points to . here. So the only thing you could expect from trying to follow it would be the directory 'link' lives in. And you don't have it when you reach the fscker via /proc/self/fd/3; what happens instead is nd->path set to ./link (by nd_jump_link()) *AND* step_into() called, pushing the same ./link onto stack. It violates all kinds of assumptions made by fs/namei.c - when pushing a symlink onto stack nd->path is expected to contain the base directory for resolving it.
I'm fairly sure that this is the cause of at least some of the insanity you've caught; there always could be something else, of course, but this hole needs to be closed in any case.
... and with removal of now unused local variable, that's
mountpoint_last(): fix the treatment of LAST_BIND
step_into() should be attempted only in LAST_NORM case, when we have the parent directory (in nd->path). We get away with that for LAST_DOT and LOST_DOTDOT, since those can't be symlinks, making step_init() and equivalent of path_to_nameidata() - we do a bit of useless work, but that's it. For LAST_BIND (i.e. the case when we'd just followed a procfs-style symlink) we really can't go there - result might be a symlink and we really can't attempt following it.
lookup_last() and do_last() do handle that properly; mountpoint_last() should do the same.
Cc: stable@vger.kernel.org Signed-off-by: Al Viro viro@zeniv.linux.org.uk --- diff --git a/fs/namei.c b/fs/namei.c index d6c91d1e88cb..13f9f973722b 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -2643,7 +2643,6 @@ EXPORT_SYMBOL(user_path_at_empty); static int mountpoint_last(struct nameidata *nd) { - int error = 0; struct dentry *dir = nd->path.dentry; struct path path;
@@ -2656,10 +2655,7 @@ mountpoint_last(struct nameidata *nd) nd->flags &= ~LOOKUP_PARENT;
if (unlikely(nd->last_type != LAST_NORM)) { - error = handle_dots(nd, nd->last_type); - if (error) - return error; - path.dentry = dget(nd->path.dentry); + return handle_dots(nd, nd->last_type); } else { path.dentry = d_lookup(dir, &nd->last); if (!path.dentry) {
On 2020-01-01, Al Viro viro@zeniv.linux.org.uk wrote:
On Wed, Jan 01, 2020 at 12:54:46AM +0000, Al Viro wrote:
Note, BTW, that lookup_last() (aka walk_component()) does just that - we only hit step_into() on LAST_NORM. The same goes for do_last(). mountpoint_last() not doing the same is _not_ intentional - it's definitely a bug.
Consider your testcase; link points to . here. So the only thing you could expect from trying to follow it would be the directory 'link' lives in. And you don't have it when you reach the fscker via /proc/self/fd/3; what happens instead is nd->path set to ./link (by nd_jump_link()) *AND* step_into() called, pushing the same ./link onto stack. It violates all kinds of assumptions made by fs/namei.c - when pushing a symlink onto stack nd->path is expected to contain the base directory for resolving it.
I'm fairly sure that this is the cause of at least some of the insanity you've caught; there always could be something else, of course, but this hole needs to be closed in any case.
... and with removal of now unused local variable, that's
mountpoint_last(): fix the treatment of LAST_BIND
step_into() should be attempted only in LAST_NORM case, when we have the parent directory (in nd->path). We get away with that for LAST_DOT and LOST_DOTDOT, since those can't be symlinks, making step_init() and equivalent of path_to_nameidata() - we do a bit of useless work, but that's it. For LAST_BIND (i.e. the case when we'd just followed a procfs-style symlink) we really can't go there - result might be a symlink and we really can't attempt following it.
lookup_last() and do_last() do handle that properly; mountpoint_last() should do the same.
Cc: stable@vger.kernel.org Signed-off-by: Al Viro viro@zeniv.linux.org.uk
Thanks, this fixes the issue for me (and also fixes another reproducer I found -- mounting a symlink on top of itself then trying to umount it).
Reported-by: Aleksa Sarai cyphar@cyphar.com Tested-by: Aleksa Sarai cyphar@cyphar.com
As for the original topic of bind-mounting symlinks -- given this is a supported feature, would you be okay with me sending an updated O_EMPTYPATH series?
diff --git a/fs/namei.c b/fs/namei.c index d6c91d1e88cb..13f9f973722b 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -2643,7 +2643,6 @@ EXPORT_SYMBOL(user_path_at_empty); static int mountpoint_last(struct nameidata *nd) {
- int error = 0; struct dentry *dir = nd->path.dentry; struct path path;
@@ -2656,10 +2655,7 @@ mountpoint_last(struct nameidata *nd) nd->flags &= ~LOOKUP_PARENT; if (unlikely(nd->last_type != LAST_NORM)) {
error = handle_dots(nd, nd->last_type);
if (error)
return error;
path.dentry = dget(nd->path.dentry);
} else { path.dentry = d_lookup(dir, &nd->last); if (!path.dentry) {return handle_dots(nd, nd->last_type);
On Thu, Jan 02, 2020 at 01:44:07AM +1100, Aleksa Sarai wrote:
Thanks, this fixes the issue for me (and also fixes another reproducer I found -- mounting a symlink on top of itself then trying to umount it).
Reported-by: Aleksa Sarai cyphar@cyphar.com Tested-by: Aleksa Sarai cyphar@cyphar.com
Pushed into #fixes.
As for the original topic of bind-mounting symlinks -- given this is a supported feature, would you be okay with me sending an updated O_EMPTYPATH series?
Post it on fsdevel; I'll need to reread it anyway to say anything useful...
On 2020-01-01, Al Viro viro@zeniv.linux.org.uk wrote:
On Thu, Jan 02, 2020 at 01:44:07AM +1100, Aleksa Sarai wrote:
Thanks, this fixes the issue for me (and also fixes another reproducer I found -- mounting a symlink on top of itself then trying to umount it).
Reported-by: Aleksa Sarai cyphar@cyphar.com Tested-by: Aleksa Sarai cyphar@cyphar.com
Pushed into #fixes.
Thanks. One other thing I noticed is that umount applies to the underlying symlink rather than the mountpoint on top. So, for example (using the same scripts I posted in the thread):
# ln -s /tmp/foo link # ./mount_to_symlink /etc/passwd link # umount -l link # will attempt to unmount "/tmp/foo"
Is that intentional?
On Thu, Jan 02, 2020 at 02:59:20PM +1100, Aleksa Sarai wrote:
On 2020-01-01, Al Viro viro@zeniv.linux.org.uk wrote:
On Thu, Jan 02, 2020 at 01:44:07AM +1100, Aleksa Sarai wrote:
Thanks, this fixes the issue for me (and also fixes another reproducer I found -- mounting a symlink on top of itself then trying to umount it).
Reported-by: Aleksa Sarai cyphar@cyphar.com Tested-by: Aleksa Sarai cyphar@cyphar.com
Pushed into #fixes.
Thanks. One other thing I noticed is that umount applies to the underlying symlink rather than the mountpoint on top. So, for example (using the same scripts I posted in the thread):
# ln -s /tmp/foo link # ./mount_to_symlink /etc/passwd link # umount -l link # will attempt to unmount "/tmp/foo"
Is that intentional?
It's a mess, again in mountpoint_last(). FWIW, at some point I proposed to have nd_jump_link() to fail with -ELOOP if the target was a symlink; Linus asked for reasons deeper than my dislike of the semantics, I looked around and hadn't spotted anything. And there hadn't been at the time, but when four months later umount_lookup_last() went in I failed to look for that source of potential problems in it ;-/
I've looked at that area again now. Aside of usual cursing at do_last() horrors (yes, its control flow is a horror; yes, it needs serious massage; no, it's not a good idea to get sidetracked into that right now), there are several fun questions: * d_manage() and d_automount(). We almost certainly don't want those for autofs on the final component of pathname in umount, including the trailing symlinks. But do we want those on usual access via /proc/*/fd/*? I.e. suppose somebody does open() (O_PATH or not) in autofs; do we want ->d_manage()/->d_automount() called when resolving /proc/self/fd/<whatever>/foo/bar? We do not; is that correct from autofs point of view? I suspect that refusing to do ->d_automount() is correct, but I don't understand ->d_manage() purpose well enough to tell. * I really hope that the weird "trailing / forces automount even in cases when we normally wouldn't trigger it" (stat /mnt/foo vs. stat /mnt/foo/) is not meant to extend to umount. I'd like Ian's confirmation, though. * do we want ->d_manage() on following .. into overmounted directory? Again, autofs question...
The minimal fix to mountpoint_last() would be to have follow_mount() done in LAST_NORM case. However, I'd like to understand (and hopefully regularize) the rules for follow_mount()/follow_managed(). Additional scary question is nfsd iterplay with automount. For nfs4 exports it's potentially interesting...
Ian, could you comment on the autofs questions above? I'd rather avoid doing changes in that area without your input - it's subtle and breakage in automount-related behaviour can be mysterious as hell.
It may be a bit off-topic here but, in autofs symlinks can be used in place of mounts. That mechanism can be used (mostly nowadays) with amd map format maps.
If I'm using symlinks instead of mounts (where I can) I definitely don't want these to be over mounted by a mount.
I haven't seen problems like that happening but if it did happen that would be a bug in automount or user mis-use of some sort.
On Fri, 2020-01-03 at 01:49 +0000, Al Viro wrote:
On Thu, Jan 02, 2020 at 02:59:20PM +1100, Aleksa Sarai wrote:
On 2020-01-01, Al Viro viro@zeniv.linux.org.uk wrote:
On Thu, Jan 02, 2020 at 01:44:07AM +1100, Aleksa Sarai wrote:
Thanks, this fixes the issue for me (and also fixes another reproducer I found -- mounting a symlink on top of itself then trying to umount it).
Reported-by: Aleksa Sarai cyphar@cyphar.com Tested-by: Aleksa Sarai cyphar@cyphar.com
Pushed into #fixes.
Thanks. One other thing I noticed is that umount applies to the underlying symlink rather than the mountpoint on top. So, for example (using the same scripts I posted in the thread):
# ln -s /tmp/foo link # ./mount_to_symlink /etc/passwd link # umount -l link # will attempt to unmount "/tmp/foo"
Is that intentional?
It's a mess, again in mountpoint_last(). FWIW, at some point I proposed to have nd_jump_link() to fail with -ELOOP if the target was a symlink; Linus asked for reasons deeper than my dislike of the semantics, I looked around and hadn't spotted anything. And there hadn't been at the time, but when four months later umount_lookup_last() went in I failed to look for that source of potential problems in it ;-/
I've looked at that area again now. Aside of usual cursing at do_last() horrors (yes, its control flow is a horror; yes, it needs serious massage; no, it's not a good idea to get sidetracked into that right now), there are several fun questions:
- d_manage() and d_automount(). We almost certainly don't
want those for autofs on the final component of pathname in umount, including the trailing symlinks. But do we want those on usual access via /proc/*/fd/*? I.e. suppose somebody does open() (O_PATH or not) in autofs; do we want ->d_manage()/->d_automount() called when resolving /proc/self/fd/<whatever>/foo/bar? We do not; is that correct from autofs point of view? I suspect that refusing to do ->d_automount() is correct, but I don't understand ->d_manage() purpose well enough to tell.
Yes, we don't want those on the final component of the path in umount. The following of a symlink will give use a new path of some sort so the rules would change to the usual ones for the new path.
The semantics of following a symlink, be the source a proc entry or not (I think) should always be the same. If the follow takes us to an autofs file system (be it a trigger mount or an indirect mount in an autofs file system) the behaviour should be that of the autofs file system when we arrive there, from an auto-mount POV.
The original intent of ->d_manage() was to prevent walks into an under construction mount and that might not be as simple as mounting a source on a mount point.
For example take the case of an automount indirect mount map entry like this:
test /some/path/one server:/source/path1 \ /some/path/two server2:/source/path2 \ /some/other/path server:/source/path3 \ /some/other/path/three server:/source/path4
This entry has no mount at the root of the tree (so called root-less multi-mount) but walks need to block when it's under construction as the topology isn't known until the directory tree and any associated mounts (usually trigger mounts) have been completed.
In this case it's needed to go to ref-walk mode and block until it's done.
The other (perhaps not so obvious) use of ->d_manage() is to detect expire to mount races. When an automount is expiring at the same time a process (that would cause an automount) is traversing the path. The base (I'll not say root, since the root of the expire might not be the root of the tree) needs to block the walk until the expire is done.
These multi-mounts are meant to provide a "mount as you go" mechanism so that only portions of the tree of mounts are mounted or expired at any one time.
For example, the offsets in the above entry are /some/path/one, /some/path/two, /some/other/path and /some/other/path/three.
On access to <autofs mount>/test automount is meant to mount trigger mounts for offsets /some/path/one, /some/path/two and /some/other/path and mount an offset trigger for /some/other/path/three into the mount for /some/other/path when it's accessed and that might not happen during the initial mount of the tree. The reverse being done on umount in sub-trees of mounts when a nesting point like /some/other/path is encountered.
But that's something of an aside because in all cases below the root there will be an actual mount preventing walks into the tree under nesting point mounts being constructed or expired.
Anyway, returning to the topic at hand, the answer to whether we want ->d_manage()/->d_automount() after a symlink has been followed is yes, I think, because at that point we could be within a file system that has automounts of some sort.
But perhaps I'm missing something about the description of the case above ...
- I really hope that the weird "trailing / forces automount
even in cases when we normally wouldn't trigger it" (stat /mnt/foo vs. stat /mnt/foo/) is not meant to extend to umount. I'd like Ian's confirmation, though.
I can't see any way that the trailing "/" can realte to umount.
It has always been meant to be used to trigger a mount on something that would otherwise not be mounted and that's the only case I'm aware of.
- do we want ->d_manage() on following .. into overmounted
directory? Again, autofs question...
I think that amounts to asking "can the target of the ../ be in the process of being constructed or expired at this time" and that's probably yes. A root-less multi-mount would be one case where this could happen (although it's not strictly an over-mounted directory).
The minimal fix to mountpoint_last() would be to have follow_mount() done in LAST_NORM case. However, I'd like to understand (and hopefully regularize) the rules for follow_mount()/follow_managed(). Additional scary question is nfsd iterplay with automount. For nfs4 exports it's potentially interesting...
I'm not sure about nfs (and other cross mounting file systems). The automounting in file systems other than autofs always have a real mount as the target (AFAIK) so there's an implied blocking that occurs on crossing the mount point. That's always made the nfs automounting case simpler to my thinking anyway.
The real problem with nfs automount trees is when the topology of the exports tree changes while parts of it are in use. People that have any idea of how nfs cross mounting (and mount dependencies in general) work shouldn't do that but they do it and then wonder why things go wrong ...
Ian, could you comment on the autofs questions above? I'd rather avoid doing changes in that area without your input - it's subtle and breakage in automount-related behaviour can be mysterious as hell.
Thanks for the heads up.
As always I can run tests on changes you want to do. Fortunately that's generally worked out ok for us in the past.
Ian
On Fri, Jan 03, 2020 at 01:49:01AM +0000, Al Viro wrote:
It's a mess, again in mountpoint_last(). FWIW, at some point I proposed to have nd_jump_link() to fail with -ELOOP if the target was a symlink; Linus asked for reasons deeper than my dislike of the semantics, I looked around and hadn't spotted anything. And there hadn't been at the time, but when four months later umount_lookup_last() went in I failed to look for that source of potential problems in it ;-/
I've looked at that area again now. Aside of usual cursing at do_last() horrors (yes, its control flow is a horror; yes, it needs serious massage; no, it's not a good idea to get sidetracked into that right now), there are several fun questions:
- d_manage() and d_automount(). We almost certainly don't
want those for autofs on the final component of pathname in umount, including the trailing symlinks. But do we want those on usual access via /proc/*/fd/*? I.e. suppose somebody does open() (O_PATH or not) in autofs; do we want ->d_manage()/->d_automount() called when resolving /proc/self/fd/<whatever>/foo/bar? We do not; is that correct from autofs point of view? I suspect that refusing to do ->d_automount() is correct, but I don't understand ->d_manage() purpose well enough to tell.
- I really hope that the weird "trailing / forces automount
even in cases when we normally wouldn't trigger it" (stat /mnt/foo vs. stat /mnt/foo/) is not meant to extend to umount. I'd like Ian's confirmation, though.
- do we want ->d_manage() on following .. into overmounted
directory? Again, autofs question...
FWIW, I suspect that we want to do something along the following lines:
1) make build_open_flags() treat O_CREAT | O_EXCL as if there had been O_NOFOLLOW in the mix. Reason: if there is a trailing symlink, we want to fail with EEXIST anyway. Benefit: this fragment in do_last() error = follow_managed(&path, nd); if (unlikely(error < 0)) return error;
/* * create/update audit record if it already exists. */ audit_inode(nd->name, path.dentry, 0);
if (unlikely((open_flag & (O_EXCL | O_CREAT)) == (O_EXCL | O_CREAT))) { path_to_nameidata(&path, nd); return -EEXIST; }
seq = 0; /* out of RCU mode, so the value doesn't matter */ inode = d_backing_inode(path.dentry); finish_lookup: error = step_into(nd, &path, 0, inode, seq); if (unlikely(error)) return error; can become error = follow_managed(&path, nd); if (unlikely(error < 0)) return error;
seq = 0; /* out of RCU mode, so the value doesn't matter */ inode = d_backing_inode(path.dentry); finish_lookup: error = step_into(nd, &path, 0, inode, seq); if (unlikely(error)) return error;
if (unlikely((open_flag & (O_EXCL | O_CREAT)) == (O_EXCL | O_CREAT))) { audit_inode(nd->name, nd->path.dentry, 0); return -EEXIST; } Equivalent transformation, since the the only goto finish_lookup; is under if (!(open_flag & O_CREAT)). What it buys us is more regular structure of follow_managed() callers.
2) make follow_managed() take &inode and &seq. Look: follow_managed() never returns 0 (we have if (ret == -EISDIR || !ret) ret = 1; on the way to the only return in it) and the callers are err = follow_managed(path, nd); if (likely(err > 0)) *inode = d_backing_inode(path->dentry); return err; in lookup_fast(), err = follow_managed(&path, nd); if (unlikely(err < 0)) return err;
seq = 0; /* we are already out of RCU mode */ inode = d_backing_inode(path.dentry); in walk_component(), err = follow_managed(&path, nd); if (unlikely(err < 0)) return err; inode = d_backing_inode(path.dentry); seq = 0; in handle_lookup_down() and (after the previous change) error = follow_managed(&path, nd); if (unlikely(error < 0)) return error;
seq = 0; /* out of RCU mode, so the value doesn't matter */ inode = d_backing_inode(path.dentry); in do_last(). That's begging to fold those followups into follow_managed() itself, doesn't it? And having *seqp = 0; equivalent added in lookup_fast() is not going to hurt the performance - in all callers it's an address of local variable, right next to the one whose address is passed as inodep. Which we'd just dirtied, and the cacheline is not going to have been shared anyway.
Note that after that the arguments for follow_managed() become identical to those for __follow_mount_rcu(). Which makes a lot of sense, since the latter is RCU-mode counterpart of the former.
3) have the followup to failing __follow_mount_rcu() taken into it. After (2) we have this in lookup_fast():
*seqp = seq; status = d_revalidate(dentry, nd->flags); if (likely(status > 0)) { /* * Note: do negative dentry check after revalidation in * case that drops it. */ if (unlikely(negative)) return -ENOENT; path->mnt = mnt; path->dentry = dentry; if (likely(__follow_mount_rcu(nd, path, inode, seqp))) return 1; } if (unlazy_child(nd, dentry, seq)) return -ECHILD; if (unlikely(status == -ECHILD)) /* we'd been told to redo it in non-rcu mode */ status = d_revalidate(dentry, nd->flags); } else { ... } if (unlikely(status <= 0)) { if (!status) d_invalidate(dentry); dput(dentry); return status; }
path->mnt = mnt; path->dentry = dentry; return follow_managed(path, nd, inode, seqp);
Suppose __follow_mount_rcu() returns false; what follows is if (unlazy_child(nd, dentry, seq)) return -ECHILD; seq here is equal to *seqp here, dentry - the value of path->dentry at the time of __follow_mount_rcu() call. if (unlikely(status == -ECHILD)) .... not taken - we know that status must have been positive if (unlikely(status <= 0)) { ... } ditto path->mnt = mnt; path->dentry = dentry; return follow_managed(path, nd, inode, seqp); we return *path to original and call follow_managed(). IOW, we could bloody well do all of that in the __follow_mount_rcu() itself, having it return 1 when the original would've returned true and doing that "revert *path, call unlazy_child() and fall back to follow_mount_rcu() in case of success" in cases when the original would've returned false. The caller turns into /* * Note: do negative dentry check after revalidation in * case that drops it. */ if (unlikely(negative)) return -ENOENT; path->mnt = mnt; path->dentry = dentry; return __follow_mount_rcu(nd, path, inode, seqp);
4) fold __follow_mount_rcu() into follow_managed(), using the latter both in RCU and non-RCU cases.
5) take the calls of follow_managed() out of lookup_fast() into its callers. That would be err = lookup_fast(nd, &path, &inode, &seq); if (unlikely(err <= 0)) { if (err < 0) return err; path.dentry = lookup_slow(&nd->last, nd->path.dentry, nd->flags); if (IS_ERR(path.dentry)) return PTR_ERR(path.dentry);
path.mnt = nd->path.mnt; err = follow_managed(&path, nd, &inode, &seq); if (unlikely(err < 0)) return err; } turning into err = lookup_fast(nd, &path, &inode, &seq); if (unlikely(err <= 0)) { if (err < 0) return err; path.dentry = lookup_slow(&nd->last, nd->path.dentry, nd->flags); if (IS_ERR(path.dentry)) return PTR_ERR(path.dentry);
path.mnt = nd->path.mnt; } err = follow_managed(&path, nd, &inode, &seq); if (unlikely(err < 0)) return err; in walk_component() and error = lookup_fast(nd, &path, &inode, &seq); if (likely(error > 0)) goto finish_lookup; ... error = follow_managed(&path, nd, &inode, &seq); if (unlikely(error < 0)) return error; finish_lookup: turning into error = lookup_fast(nd, &path, &inode, &seq); if (likely(error > 0)) goto finish_lookup; ... finish_lookup: error = follow_managed(&path, nd, &inode, &seq); if (unlikely(error < 0)) return error; in do_last().
6) after that we have 3 callers of step_into(); the ones in walk_component() and in do_last() would be immediately preceded by the calls of follow_managed(). The last one is in mountpoint_last(). That's if (d_flags_negative(smp_load_acquire(&path.dentry->d_flags))) { dput(path.dentry); return -ENOENT; } path.mnt = nd->path.mnt; return step_into(nd, &path, 0, d_backing_inode(path.dentry), 0); And that's where we are missing the mountpoint traversal in symlink case - sure, the caller does follow_mount(), but it doesn't catch the case when we have a symlink overmounted - we run into step_into() before that. Note that smp_load_acquire + d_flags_negative is what we would've done in follow_managed(), as well as getting d_backing_inode(). So here we also have an open-coded bastardized variant of follow_managed(). The difference is, we don't want to trigger ->d_automount() and ->d_manage() in that one.
And at that point the only call of follow_managed() *NOT* followed by step_into() is in handle_lookup_down(). What it is followed by is path_to_nameidata(&path, nd); nd->inode = inode; nd->seq = seq; And that's a piece of step_into(): if (likely(!d_is_symlink(path->dentry)) || !(flags & WALK_FOLLOW || nd->flags & LOOKUP_FOLLOW)) { /* not a symlink or should not follow */ path_to_nameidata(path, nd); nd->inode = inode; nd->seq = seq; return 0; } is the normal path through that sucker. What's more, we are guaranteed that this will _not_ be a symlink (it's the starting point of pathwalk, and path_init() would've told us to sod off were it not a directory).
So if we manage to convert the damn thing in mountpoint_last() into follow_managed(), we could fold follow_managed() into step_into(). Which suggests the way to do that - not that step_into() takes an argument containing ORed WALK_... constants. So we can simply add WALK_NOAUTOMOUNT and put a check for it into if (flags & DCACHE_MANAGE_TRANSIT) { and if (flags & DCACHE_NEED_AUTOMOUNT) { bodies, so that they would be ignored if that's passed to follow_mount()/step_into() hybrid.
At that point we have one primitive for moving into child, handling both the mountpoint traversals and keeping track of symlinks. Moreover, there's a fairly strong argument for using it in case of .. as well. As it is, if the parent is overmounted, we cross into whatever is mounted on top of it. And we ignore ->d_manage/->d_automount on the damn thing. Which is not an issue for anything other than autofs (nobody else has ->d_manage() and nfs/afs/cifs automount points don't have children) and for autofs we *want* those called; that's not something likely to be encountered, but it's an impossible setup (autofs direct mount set on an ancestor of somebody's current directory) and autofs does count upon not walking into something being set up by the daemon.
I'll put together such series and see how well does it work; it would fix the idiocies in user_path_mountpoint_at() and make the pathwalk machinery easier to follow - the boilerplate around mountpoint crossing and symlink handling is demonstrably easy to get wrong. If that works and doesn't cause observable slowdown, I'll put it into -next, either stepping around the changes done by openat2() series, or rebasing it on top of that.
Another interesting question is whether we want O_PATH open to trigger automounts. The thing is, we do *NOT* trigger them (or traverse mountpoints) at the starting point of lookups. I believe it's a mistake (and mine, at that), but I doubt that there's anything that can be done about it at that point. It's a user-visible behaviour and I can easily imagine a custom /init that ends up relying upon it ;-/ mkdir /root, mount the final root there, chdir /root, mount --move . /, remove everything on initramfs using absolute pathnames and chroot to "." to finish... Traversing mounts at the beginning of pathwalk would break the hell out of that, potentially with root filesystem contents wiped out... ;-/
I wish we could change that, but I'm afraid that's cast in stone by now (and had been for 20 years or so). As it is, we have an unpleasant side effect - O_PATH open does *NOT* trigger automounts. So if you do that to e.g. referral point and try to do ...at() syscalls with that as the origin, you'll get an unpleasant surprise - automount won't trigger at all.
I think the easiest way to handle that is to have O_PATH turn LOOKUP_AUTOMOUNT, same as the normal open() does. That's trivial to do, but that changes user-visible behaviour. OTOH, with the current behaviour nobody can rely upon automount not getting triggered by somebody else just as they are entering their open(dir, O_PATH), so I think that's not a problem.
Linus, do you have any objections to such O_PATH semantics change?
PS: I think I see how to untangle the control flow horrors in do_last() with this massage done, but I'm not going there until this is sorted out - by previous experience touching the damn thing can easily turn into several weeks of digging through the nfs/gfs2/etc. guts trying to verify something, with a couple of detours into fixing something in there found in process... ;-/
On Tue, Jan 7, 2020 at 7:13 PM Al Viro viro@zeniv.linux.org.uk wrote:
FWIW, I suspect that we want to do something along the following lines:
- make build_open_flags() treat O_CREAT | O_EXCL as if there had been
O_NOFOLLOW in the mix.
My reaction to that is "Whee, that's a big change".
But:
Benefit: this fragment in do_last()
you're right.
That's the semantics we have right now (and I think it's the correct safe semantics when I think about it). But when I first looked at your email I without thinking more about it actually thought we followed the symlink, and then did the O_CREAT | O_EXCL on the target (and potentially succeeded).
So I agree - making O_CREAT | O_EXCL imply O_NOFOLLOW seems to be the right thing to do, and not only should simplify our code, it's much more descriptive of what the real semantics are.
Even if my first reaction was that it would act differently.
Slash-and-burn approach to your explanatory subsequent steps:
- make follow_managed() take &inode and &seq.
- have the followup to failing __follow_mount_rcu() taken into it.
- fold __follow_mount_rcu() into follow_managed(), using the latter both in
RCU and non-RCU cases. 5) take the calls of follow_managed() out of lookup_fast() into its callers. 6) after that we have 3 callers of step_into(); [..] So if we manage to convert the damn thing in mountpoint_last() into follow_managed(), we could fold follow_managed() into step_into().
I think that all makes sense. I didn't go to look at the source, but from the email contents your steps look reasonable to me.
Another interesting question is whether we want O_PATH open to trigger automounts.
It does sound like they shouldn't, but as you say:
The thing is, we do *NOT* trigger them
(or traverse mountpoints) at the starting point of lookups. I believe it's a mistake (and mine, at that), but I doubt that there's anything that can be done about it at that point. It's a user-visible behaviour [..]
Hmm. I wonder how set in stone that is. We may have two decades of history of not doing it at start point of lookups, but we do *not* have two decades of history of O_PATH.
So what I think we agree would be sane behavior would be for O_PATH opens to not trigger automounts (unless there's a slash at the end, whatever), but _do_ add the mount-point traversal to the beginning of lookups.
But only do it for the actual O_PATH fd case, not the cwd/root/non-O_PATH case.
That way we maintain original behavior: if somebody overmounts your cwd, you still see the pre-mount directory on lookups, because your cwd is "under" the mount.
But if you open a file with O_PATH, and somebody does a mount _afterwards_, the openat() will see that later mount and/or do the automount.
Don't you think that would be the more sane/obvious semantics of how O_PATH should work?
I think the easiest way to handle that is to have O_PATH turn LOOKUP_AUTOMOUNT, same as the normal open() does. That's trivial to do, but that changes user-visible behaviour. OTOH, with the current behaviour nobody can rely upon automount not getting triggered by somebody else just as they are entering their open(dir, O_PATH), so I think that's not a problem.
Linus, do you have any objections to such O_PATH semantics change?
See above: I think I'd prefer the O_PATH behavior the other way around. That seems to be more of a consistent behavior of what "O_PATH" means - it means "don't really open, we'll do it only when you use it as a directory".
But I don't have any _strong_ opinions. If you have a good reason to tell me that I'm being stupid, go ahead and do so and override my stupidity.
Linus
On Tue, Jan 07, 2020 at 07:54:02PM -0800, Linus Torvalds wrote:
Another interesting question is whether we want O_PATH open to trigger automounts.
It does sound like they shouldn't, but as you say:
The thing is, we do *NOT* trigger them
(or traverse mountpoints) at the starting point of lookups. I believe it's a mistake (and mine, at that), but I doubt that there's anything that can be done about it at that point. It's a user-visible behaviour [..]
Hmm. I wonder how set in stone that is. We may have two decades of history of not doing it at start point of lookups, but we do *not* have two decades of history of O_PATH.
So what I think we agree would be sane behavior would be for O_PATH opens to not trigger automounts (unless there's a slash at the end, whatever), but _do_ add the mount-point traversal to the beginning of lookups.
But only do it for the actual O_PATH fd case, not the cwd/root/non-O_PATH case.
That way we maintain original behavior: if somebody overmounts your cwd, you still see the pre-mount directory on lookups, because your cwd is "under" the mount.
But if you open a file with O_PATH, and somebody does a mount _afterwards_, the openat() will see that later mount and/or do the automount.
Don't you think that would be the more sane/obvious semantics of how O_PATH should work?
Maybe, but... note that we do not (and AFAICS never had) follow mounts on /proc/self/cwd, /proc/self/fd/42, etc. And there are very good reasons for that. First of all, if your stdin is from /tmp/foo, you'd better get that file when you open /dev/stdin, even if somebody has done mount --bind /tmp/bar /tmp/foo; another issue is with the use of stat("/proc/self/fd/42", &buf) - it should be an equivalent of fstat(42, &buf), even if somebody has overmounted that. BTW, for similar reason after link(".", "foo"); fd = open("foo", O_PATH); // return 42 we really should (and do) have resolution of /proc/self/fd/42 stop at foo, not . Reason: consistency of stat() behaviour...
The point is, we'd never followed mounts on /proc/self/cwd et.al. I hadn't checked 2.0, but 2.1.100 ('97, before any changes from me) is that way. Actually, scratch that - 2.0 behaves the same way (mountpoint crossing is done in iget() there; is that Minix influence or straight from the Lions' book?)
Hmm... Looking through the history, we have
(for reference) v7: mount traversal in iget() (forward) and namei() (back); due to the way it's done, forward traversal happens * at starting point * after any component (. and .. included) * on results of forward traversal (due to a loop in iget()). Back traversal (to covered on .. from root directory) is also to unlimited depth.
0.01: no mount handling
0.10: forward traversal in iget(), back traversal in fs/namei.c:find_entry() (not by Lions' Book, then - v6 didn't do back traversals at all). Forward traversal * after any component (. and .. included) No traversal on starting point, no traversal on result of traversal. OTOH, mount(2) refuses to mount on top of root, so the lack of the last one is not an issue.
0.12: symlinks added; no mount traversal on starting point of those either. We start at the process' root for absolute ones, even if it happens to be overmounted, and we start from parent for relative ones. The latter matters only if we were in the beginning of the pathwalk, since anything else would've traversed mounts back when we'd picked said parent. Mount traversal takes precedence over symlink traversal, but that's not an issue since mount follows links on mountpoint. It does not, at that point, reject fs image with symlink for root, but that actually more or less works.
0.97.3: same, with addition of procfs symlinks. No mount crossing on their targets (for normal symlinks we don't do mount crossing in the beginning and any component inside triggers mount crossing as usual; for procfs ones there's no components inside)
Situation remains essentially unchanged until 2.1.42. Next few kernels are in flux, to put it politely - initial merge had been insane and it took until 2.1.44 or so for the things to get more or less working.
At 2.1.44: forward traversal in fs/namei.c:lookup(), back traversal in fs/namei.c:reserved_lookup(). Otherwise the same behaviour as pre-dcache (wrt mount traversals, that is).
2.1.51pre1: forward traversal moved into real_lookup() and __d_lookup(). Forward traversal happens *ONLY* after normal components - not after . or ..
2.1.61: forward traversal moved into follow_mount(), behaviour reverted to pre-dcache one.
Previous is from reading through the historical trees; my involvement started circa 2.1.120-something.
2.3.50pre3: call of follow_mount() moved a bit, reverting to 2.1.51pre1 behaviour (nor traversal on . or ..) *again*. Not sure whose idea had that been - might've been mine, but unlike the other patch that went into fs/namei.c in the same release, I hadn't been able to find anything related to that one. If your memories (or mail archives) are better...
2.3.99pre4-5: massive surgery in there. Preparations to allowing mount on top of mount; forward traversal adjusted accordingly, back traversal still isn't.
2.3.99pre7-1: more surgery, back traversals are also to unlimited depth now and mount on top of mount has been allowed.
2.3.99pre9-4: mount --bind taught to mount non-directories on top of non-directories. At that point it does *NOT* follow trailing symlinks, so mounting of symlinks and mounting on top of symlinks becomes possible. Mount traversal still takes precedence over symlink traversal, symlink traversal of mount traversal result still generally works, even though it's not something I considered at the time.
v2.4.5.2: mount --bind started to follow symlinks. So that source of mounting of and on the symlinks was no more.
2.5.0.5: forward mount traversal is done after .. (in handle_dotdot()). That brings back the pre-dcache behaviour for those suckers. Still no forward traversal after ., though.
At about the same time I'd been getting rid of the early-boot incestous relationships with fs/namespace.c (initramfs work) and that was probably the last time we could realistically switch to following mounts at starting point; I considered trying to do that, but decided not to. Pity, that...
2.6.5-rc2: normal mount now checks for corrupt fs with symlink for root. Since it has always been following symlinks for mountpoint, the remaining source of mounting of and on symlinks was gone; that lasted until after O_PATH introduction.
2.6.39-rc1: mount traps support - instead of abusing ->follow_link() for automounting, we have an explicit pair of methods that can be called at the same places where we traverse mounts. None too consistent - we don't do that on .. results. That was Dave Howells and Ian Kent.
2.6.39-rc1: O_PATH introduced and, later in the same series, allowed for symlinks. That has changed things - now procfs symlink targets could be symlinks themselves. Originally an attempt to follow those would blow up with -ELOOP (there's simply no good way to follow such beast; it's either "stop even if we are asked to follow" or "give an error").
3.6.0-rc1: nd_jump_link() introduction (hch) had unnoticed side effects - we'd switched from "fail traversal with -ELOOP" to "stop there". Mostly it doesn't change behaviour, but it has opened a way to mount symlinks and mount on top of symlinks. Which generally worked.
circa 3.8--3.9: side effects had been noticed; my first reaction had been "let's make nd_jump_link() return an error, then", but I hadn't been able to find good reasons when challenged to do so. Did an audit, found no obvious problems, went "oh, well - whether it works by accident or by design, it doesn't break anything".
3.12.0-rc1: lookups for umount(2) are different - we don't want revalidate on the last component. Which had been handled by introduction of path_umountat()/umount_lookup_last(), parallel to path_lookupat(). Which has gotten quite a few things wrong - it *did* try to follow symlinks obtained by following procfs ones (and blew up big way) and it didn't follow mounts on overmounted trailing symlinks. Nobody noticed for 6 years, until folks actually tried to play with mount-on-symlink... Patches were by Jeff Layton, neither he nor I have spotted the problem back then. And I should have, since it had been only a few months since the audit for exactly that kind of problems...
AFAICS, there'd been no serious semantical changes since then. What we have right now: * no mount traversal on the starting point * mount traversal after any component other than "." * symlink traversal consists of possibly jumping to given point plus following a given (possibly empty) series of components. It can be both - e.g. symlink to "/foo/bar" is 'jump to root, then traverse "foo", then traverse "bar"'. Procfs "magic" symlinks are not really magical - they behave as symlinks to "/" as far as the pathwalk semantics is concerned. The only differences is that jump might be not to process' root. * mount traversal takes precedence over symlink traversal. * jump (if any) in symlink traversal is treated the same as the starting point - it's not followed by mount traversal. It's also not followed by symlink traversal, even if we are jumping into a symlink. Of course, in any position other than the end of pathname that's an instant error. That's also not different from the starting point treatment - if ...at(2) is given a symlink for starting point, it leaves it as-is if AT_EMPTY_PATH is given and fails with -ENOTDIR otherwise. * umount(2) handles the final component differently - for one thing, it does not do revalidate, for another - its mount traversal (if any) does not include automount-related parts. And there we *do* want mount traversal at the final point, for obvious reasons.
I think the easiest way to handle that is to have O_PATH turn LOOKUP_AUTOMOUNT, same as the normal open() does. That's trivial to do, but that changes user-visible behaviour. OTOH, with the current behaviour nobody can rely upon automount not getting triggered by somebody else just as they are entering their open(dir, O_PATH), so I think that's not a problem.
Linus, do you have any objections to such O_PATH semantics change?
See above: I think I'd prefer the O_PATH behavior the other way around. That seems to be more of a consistent behavior of what "O_PATH" means - it means "don't really open, we'll do it only when you use it as a directory".
How would your proposal deal with access("/proc/self/fd/42/foo", MAY_READ) vs. faccessat(42, "foo", MAY_READ)? The latter would trigger automount, the former would not... Or would you extend that to "traverse mounts upon following procfs links, if the file in question had been opened with O_PATH"? We could do that (give nd_jump_link() an extra argument telling if we want mount traversal), but I'm not sure if the resulting semantics is sane...
Note, BTW, that O_PATH users really can't rely upon automounts _not_ being triggered - all it takes is a lookup on bogus path with such prefix by anybody who can reach that place... We are not opening anything, really, but we are not able to ignore automounts triggered by somebody else.
On Wed, Jan 8, 2020 at 1:34 PM Al Viro viro@zeniv.linux.org.uk wrote:
The point is, we'd never followed mounts on /proc/self/cwd et.al. I hadn't checked 2.0, but 2.1.100 ('97, before any changes from me) is that way.
Hmm. If that's the case, maybe they should be marked implicitly as O_PATH when opened?
Actually, scratch that - 2.0 behaves the same way (mountpoint crossing is done in iget() there; is that Minix influence or straight from the Lions' book?)
I don't think I ever had access to Lions' - I've _seen_ a printout of it later, and obviously maybe others did,
More likely it's from Maurice Bach: the Design of the Unix Operating System. I'm pretty sure that's where a lot of the FS layer stuff came from. Certainly the bad old buffer head interfaces, and quite likely the iget() stuff too.
0.10: forward traversal in iget(), back traversal in fs/namei.c:find_entry()
Whee, you _really_ went back in time.
So I did too.
And looking at that code in iget(), I doubt it came from anywhere. Christ. It's just looping over a fixed-size array, both when finding the inode, and finding the superblock.
Cute, but unbelievably stupid. It was a more innocent time.
In other words, I think you can chalk it up to just me, because blaming anybody else for that garbage would be very very unfair indeed ;)
How would your proposal deal with access("/proc/self/fd/42/foo", MAY_READ) vs. faccessat(42, "foo", MAY_READ)?
I think that in a perfect world, the O_PATH'ness of '42' would be the deciding factor. Wouldn't those be the best and most consistent semantics?
And then 'cwd'/'root' always have the O_PATH behavior.
The latter would trigger automount, the former would not... Or would you extend that to "traverse mounts upon following procfs links, if the file in question had been opened with O_PATH"?
Exactly.
But you know what? I do not believe this is all that important, and I doubt it will matter to anybody.
So what matters most is what makes the most sense to the VFS layer, and what makes the most sense to _you_.
Because my reaction from this thread is that not only have you thought about this issue and followed the history a whole lot more than I would ever have done, it's also that I trust you to DTRT.
I think it would be good to have some self-consistency, but at the same time clearly we already don't really, and our behavior here has subtly changed over the years (and not so subtly - if you go back sufficiently far, /proc behavior wrt file descriptors has had both "dup()" behavior and "make a new file descriptor with the same inode" behavior, afaik).
Linus
On Thu, Jan 09, 2020 at 04:08:16PM -0800, Linus Torvalds wrote:
On Wed, Jan 8, 2020 at 1:34 PM Al Viro viro@zeniv.linux.org.uk wrote:
The point is, we'd never followed mounts on /proc/self/cwd et.al. I hadn't checked 2.0, but 2.1.100 ('97, before any changes from me) is that way.
Hmm. If that's the case, maybe they should be marked implicitly as O_PATH when opened?
I thought you wanted O_PATH as starting point to have mounts traversed? Confused...
Actually, scratch that - 2.0 behaves the same way (mountpoint crossing is done in iget() there; is that Minix influence or straight from the Lions' book?)
I don't think I ever had access to Lions' - I've _seen_ a printout of it later, and obviously maybe others did,
More likely it's from Maurice Bach: the Design of the Unix Operating System. I'm pretty sure that's where a lot of the FS layer stuff came from. Certainly the bad old buffer head interfaces, and quite likely the iget() stuff too.
0.10: forward traversal in iget(), back traversal in fs/namei.c:find_entry()
Whee, you _really_ went back in time.
So I did too.
And looking at that code in iget(), I doubt it came from anywhere. Christ. It's just looping over a fixed-size array, both when finding the inode, and finding the superblock.
Cute, but unbelievably stupid. It was a more innocent time.
In other words, I think you can chalk it up to just me, because blaming anybody else for that garbage would be very very unfair indeed ;)
See https://minnie.tuhs.org/cgi-bin/utree.pl?file=V7/usr/sys/sys/iget.c Exactly the same algorithm, complete with linear searches over those fixed-sized array.
<grabs Bach> Right, he simply transcribes v7 iget().
So I suspect that you are right - your variant of iget was pretty much one-to-one implementation of Bach's description of v7 iget.
Your namei wasn't - Bach has 'if the entry points to root and you are in the root and name is "..", find mount table entry (by device number), drop your directory inode, grab the inode of mountpount and restart the search for ".." in there', which gives back traversals to arbitrary depth. And v7 namei() (as Bach mentions) uses iget() for starting point as well as for each component. You kept pointers instead, which is where the other difference has come from (no mount traversal at the starting point)...
Actually, I've misread your code in 0.10 - it does unlimited forward traversals; it's back traversals that go only one level. The forward ones got limited to one level in 0.95, but then mount-over-root had been banned all along. I'd read the pre-dcache variant of iget(), seen it go pretty much all the way back to beginning and hadn't sorted out the 0.12 -> 0.95 transition...
How would your proposal deal with access("/proc/self/fd/42/foo", MAY_READ) vs. faccessat(42, "foo", MAY_READ)?
I think that in a perfect world, the O_PATH'ness of '42' would be the deciding factor. Wouldn't those be the best and most consistent semantics?
And then 'cwd'/'root' always have the O_PATH behavior.
See above - unless I'm misparsing you, you wanted mount traversals in the starting point if it's ...at() with O_PATH fd. With O_PATH open() not doing them.
For cwd and root the situation is opposite - we do NOT traverse mounts for those. And that's really too late to change.
The latter would trigger automount, the former would not... Or would you extend that to "traverse mounts upon following procfs links, if the file in question had been opened with O_PATH"?
Exactly.
But you know what? I do not believe this is all that important, and I doubt it will matter to anybody.
FWIW, digging through the automount-related parts of that stuff has caught several fun issues. One (and I'm rather embarrassed by it) should've been caught back in commit 8aef18845266 (VFS: Fix vfsmount overput on simultaneous automount). To quote the commit message: The problem is that lock_mount() drops the caller's reference to the mountpoint's vfsmount in the case where it finds something already mounted on the mountpoint as it transits to the mounted filesystem and replaces path->mnt with the new mountpoint vfsmount.
During a pathwalk, however, we don't take a reference on the vfsmount if it is the same as the one in the nameidata struct, but do_add_mount() doesn't know this. At which point I should've gone "what the fuck?" - lock_mount() does, indeed, drop path->mnt in this situation and replaces it with the whatever's come to cover it. For mount(2) that's the right thing to do - we _want_ to mount on top of whatever we have at the mountpoint. For automounts we very much don't want that - it's either "mount right on top of the automount trigger" or discard whatever we'd been about to mount and walk into whatever's got mounted there (presumably the same thing triggered by another process). We kinda-sorta get that effect, but in a very convoluted way: do_add_mount() will refuse to mount something on top of itself - /* Refuse the same filesystem on the same mount point */ err = -EBUSY; if (path->mnt->mnt_sb == newmnt->mnt.mnt_sb && path->mnt->mnt_root == path->dentry) goto unlock; which will end up with -EBUSY returned (and recognized by follow_automount()).
First of all, that's unreliable. If somebody not only has triggered that automount, but managed to _mount_ something else on top (for example, has triggered it by lookup of mountpoint-to-be in mount(2)), we'll end up not triggering that check. In which case we'll get something like nfs referral point under nfs automounted there under tmpfs from explicit overmount under same nfs mount we'd automounted there - identical to what's been buried under tmpfs. It's hard to hit, but not impossibly so.
What's more, the whole solution is a kludge - the root of problem is that lock_mount() is the wrong thing to do in case of finish_automount(). We don't want to go into whatever's overmounting us there, both for the reasons above *and* because it's a PITA for the caller. So the right solution is * lift lock_mount() call from do_add_mount() into its callers (all 2 of them); while we are at it, lift unlock_mount() as well (makes for simpler failure exits in do_add_mount()). * replace the call of lock_mount() in finish_automount() with variant that doesn't do "unlock, walk deeper and retry locking", returning ERR_PTR(-EBUSY) in such case. * get rid of the kludge introduced in that commit. Better yet, don't bother with traversing into the covering mount in case of success - let the caller of follow_automount() do that. Which eliminates the need to pass need_mntput to the sucker and suggests an even better solution - have this analogue of lock_mount() return NULL instead of ERR_PTR(-EBUSY) and treat it in finish_automount() as "OK, discard what we wanted to mount and return 0". That gets rid of the entire err = finish_automount(mnt, path); switch (err) { case -EBUSY: /* Someone else made a mount here whilst we were busy */ return 0; case 0: path_put(path); path->mnt = mnt; path->dentry = dget(mnt->mnt_root); return 0; default: return err; } chunk in follow_automount() - it would just be return finish_automount(mnt, path);
Another thing (in the same area) is not a bug per se, but... after the call of ->d_automount() we have this: if (IS_ERR(mnt)) { /* * The filesystem is allowed to return -EISDIR here to indicate * it doesn't want to automount. For instance, autofs would do * this so that its userspace daemon can mount on this dentry. * * However, we can only permit this if it's a terminal point in * the path being looked up; if it wasn't then the remainder of * the path is inaccessible and we should say so. */ if (PTR_ERR(mnt) == -EISDIR && (nd->flags & LOOKUP_PARENT)) return -EREMOTE; return PTR_ERR(mnt); } Except that not a single instance of ->d_automount() has ever returned -EISDIR. Certainly not autofs one, despite the what the comment says. That chunk has come from dhowells, back when the whole mount trap series had been merged. After talking that thing over (fun: trying to figure out what had been intended nearly 9 years ago, when people involved are in UK, US east coast and AU west coast respectively. The only way it could suck more would've been if I were on the west coast - then all timezone deltas would be 8-hour ones)... looks like it's a rudiment of plans that got superseded during the series development, nobody quite remembers exact details. Conclusion: it's not even dead, it's stillborn; bury it.
Unfortunately, there are other interesting questions related to autofs-specific bits (->d_manage()) and the timezone-related fun is, of course, still there. I hope to sort that out today or tomorrow, at least enough to do a reasonable set of backportable fixes to put in front of follow_managed()/step_into() queue. Oh, well...
On Thu, Jan 9, 2020 at 8:15 PM Al Viro viro@zeniv.linux.org.uk wrote:
Hmm. If that's the case, maybe they should be marked implicitly as O_PATH when opened?
I thought you wanted O_PATH as starting point to have mounts traversed? Confused...
No, I'm confused. I meant "non-O_PATH", just got the rules reversed in my mind.
So cwd/root would always act as it non-O_PATH, and only using an actual fd would look at the O_PATH flag, and if it was set would walk the mountpoints.
<grabs Bach> Right, he simply transcribes v7 iget().
So I suspect that you are right - your variant of iget was pretty much one-to-one implementation of Bach's description of v7 iget.
Ok, that makes sense. My copy of Bach literally had the system call list "marked off" when I implemented them back when.
I may still have that paperbook copy somewhere. I don't _think_ I'd have thrown it out, it has sentimental value.
I think that in a perfect world, the O_PATH'ness of '42' would be the deciding factor. Wouldn't those be the best and most consistent semantics?
And then 'cwd'/'root' always have the O_PATH behavior.
See above - unless I'm misparsing you, you wanted mount traversals in the starting point if it's ...at() with O_PATH fd.
.. and see above, it was just my confusion about the sense of O_PATH.
For cwd and root the situation is opposite - we do NOT traverse mounts for those. And that's really too late to change.
Oh, absolutely.
[ snip some more about your automount digging. Looks about right, but I'm not going to make a peep after getting O_PATH reversed ;) ]
Linus
On Fri, 2020-01-10 at 04:15 +0000, Al Viro wrote:
On Thu, Jan 09, 2020 at 04:08:16PM -0800, Linus Torvalds wrote:
On Wed, Jan 8, 2020 at 1:34 PM Al Viro viro@zeniv.linux.org.uk wrote:
The point is, we'd never followed mounts on /proc/self/cwd et.al. I hadn't checked 2.0, but 2.1.100 ('97, before any changes from me) is that way.
Hmm. If that's the case, maybe they should be marked implicitly as O_PATH when opened?
I thought you wanted O_PATH as starting point to have mounts traversed? Confused...
Actually, scratch that - 2.0 behaves the same way (mountpoint crossing is done in iget() there; is that Minix influence or straight from the Lions' book?)
I don't think I ever had access to Lions' - I've _seen_ a printout of it later, and obviously maybe others did,
More likely it's from Maurice Bach: the Design of the Unix Operating System. I'm pretty sure that's where a lot of the FS layer stuff came from. Certainly the bad old buffer head interfaces, and quite likely the iget() stuff too.
0.10: forward traversal in iget(), back traversal in fs/namei.c:find_entry()
Whee, you _really_ went back in time.
So I did too.
And looking at that code in iget(), I doubt it came from anywhere. Christ. It's just looping over a fixed-size array, both when finding the inode, and finding the superblock.
Cute, but unbelievably stupid. It was a more innocent time.
In other words, I think you can chalk it up to just me, because blaming anybody else for that garbage would be very very unfair indeed ;)
See https://minnie.tuhs.org/cgi-bin/utree.pl?file=V7/usr/sys/sys/iget.c Exactly the same algorithm, complete with linear searches over those fixed-sized array.
<grabs Bach> Right, he simply transcribes v7 iget().
So I suspect that you are right - your variant of iget was pretty much one-to-one implementation of Bach's description of v7 iget.
Your namei wasn't - Bach has 'if the entry points to root and you are in the root and name is "..", find mount table entry (by device number), drop your directory inode, grab the inode of mountpount and restart the search for ".." in there', which gives back traversals to arbitrary depth. And v7 namei() (as Bach mentions) uses iget() for starting point as well as for each component. You kept pointers instead, which is where the other difference has come from (no mount traversal at the starting point)...
Actually, I've misread your code in 0.10 - it does unlimited forward traversals; it's back traversals that go only one level. The forward ones got limited to one level in 0.95, but then mount-over-root had been banned all along. I'd read the pre-dcache variant of iget(), seen it go pretty much all the way back to beginning and hadn't sorted out the 0.12 -> 0.95 transition...
How would your proposal deal with access("/proc/self/fd/42/foo", MAY_READ) vs. faccessat(42, "foo", MAY_READ)?
I think that in a perfect world, the O_PATH'ness of '42' would be the deciding factor. Wouldn't those be the best and most consistent semantics?
And then 'cwd'/'root' always have the O_PATH behavior.
See above - unless I'm misparsing you, you wanted mount traversals in the starting point if it's ...at() with O_PATH fd. With O_PATH open() not doing them.
For cwd and root the situation is opposite - we do NOT traverse mounts for those. And that's really too late to change.
The latter would trigger automount, the former would not... Or would you extend that to "traverse mounts upon following procfs links, if the file in question had been opened with O_PATH"?
Exactly.
But you know what? I do not believe this is all that important, and I doubt it will matter to anybody.
FWIW, digging through the automount-related parts of that stuff has caught several fun issues. One (and I'm rather embarrassed by it) should've been caught back in commit 8aef18845266 (VFS: Fix vfsmount overput on simultaneous automount). To quote the commit message: The problem is that lock_mount() drops the caller's reference to the mountpoint's vfsmount in the case where it finds something already mounted on the mountpoint as it transits to the mounted filesystem and replaces path->mnt with the new mountpoint vfsmount. During a pathwalk, however, we don't take a reference on the vfsmount if it is the same as the one in the nameidata struct, but do_add_mount() doesn't know this. At which point I should've gone "what the fuck?" - lock_mount() does, indeed, drop path->mnt in this situation and replaces it with the whatever's come to cover it. For mount(2) that's the right thing to do - we _want_ to mount on top of whatever we have at the mountpoint. For automounts we very much don't want that - it's either "mount right on top of the automount trigger" or discard whatever we'd been about to mount and walk into whatever's got mounted there (presumably the same thing triggered by another process). We kinda-sorta get that effect, but in a very convoluted way: do_add_mount() will refuse to mount something on top of itself - /* Refuse the same filesystem on the same mount point */ err = -EBUSY; if (path->mnt->mnt_sb == newmnt->mnt.mnt_sb && path->mnt->mnt_root == path->dentry) goto unlock; which will end up with -EBUSY returned (and recognized by follow_automount()).
First of all, that's unreliable. If somebody not only has triggered that automount, but managed to _mount_ something else on top (for example, has triggered it by lookup of mountpoint-to-be in mount(2)), we'll end up not triggering that check. In which case we'll get something like nfs referral point under nfs automounted there under tmpfs from explicit overmount under same nfs mount we'd automounted there - identical to what's been buried under tmpfs. It's hard to hit, but not impossibly so.
What's more, the whole solution is a kludge - the root of problem is that lock_mount() is the wrong thing to do in case of finish_automount(). We don't want to go into whatever's overmounting us there, both for the reasons above *and* because it's a PITA for the caller. So the right solution is
- lift lock_mount() call from do_add_mount() into its callers
(all 2 of them); while we are at it, lift unlock_mount() as well (makes for simpler failure exits in do_add_mount()).
- replace the call of lock_mount() in finish_automount()
with variant that doesn't do "unlock, walk deeper and retry locking", returning ERR_PTR(-EBUSY) in such case.
- get rid of the kludge introduced in that commit. Better
yet, don't bother with traversing into the covering mount in case of success - let the caller of follow_automount() do that. Which eliminates the need to pass need_mntput to the sucker and suggests an even better solution - have this analogue of lock_mount() return NULL instead of ERR_PTR(-EBUSY) and treat it in finish_automount() as "OK, discard what we wanted to mount and return 0". That gets rid of the entire err = finish_automount(mnt, path); switch (err) { case -EBUSY: /* Someone else made a mount here whilst we were busy */ return 0; case 0: path_put(path); path->mnt = mnt; path->dentry = dget(mnt->mnt_root); return 0; default: return err; } chunk in follow_automount() - it would just be return finish_automount(mnt, path);
Another thing (in the same area) is not a bug per se, but... after the call of ->d_automount() we have this: if (IS_ERR(mnt)) { /* * The filesystem is allowed to return -EISDIR here to indicate * it doesn't want to automount. For instance, autofs would do * this so that its userspace daemon can mount on this dentry. * * However, we can only permit this if it's a terminal point in * the path being looked up; if it wasn't then the remainder of * the path is inaccessible and we should say so. */ if (PTR_ERR(mnt) == -EISDIR && (nd->flags & LOOKUP_PARENT)) return -EREMOTE; return PTR_ERR(mnt); } Except that not a single instance of ->d_automount() has ever returned -EISDIR. Certainly not autofs one, despite the what the comment says. That chunk has come from dhowells, back when the whole mount trap series had been merged. After talking that thing over (fun: trying to figure out what had been intended nearly 9 years ago, when people involved are in UK, US east coast and AU west coast respectively. The only way it could suck more would've been if I were on the west coast - then all timezone deltas would be 8-hour ones)... looks like it's a rudiment of plans that got superseded during the series development, nobody quite remembers exact details. Conclusion: it's not even dead, it's stillborn; bury it.
Yeah, autofs ->d_automount() doesn't return -EISDIR, by the time we get there it's not relevant any more, so that check looks redundant. I'm not aware of any other fs automount implementation that needs that EISDIR pass-thru function.
I didn't notice it at the time of the merge, sorry about that.
While we're at it that: if (!path->dentry->d_op || !path->dentry->d_op->d_automount) return -EREMOTE;
at the top of follow_automount() isn't going to be be relevant for autofs because ->d_automount() really must always be defined for it.
But, at the time of the merge, I didn't object to it because there were (are) other file systems that use the VFS automount function which may accidentally not define the method.
Unfortunately, there are other interesting questions related to autofs-specific bits (->d_manage()) and the timezone-related fun is, of course, still there. I hope to sort that out today or tomorrow, at least enough to do a reasonable set of backportable fixes to put in front of follow_managed()/step_into() queue. Oh, well...
Yeah, I know it slows you down but I kink-off like having a chance to look at what's going and think about your questions before trying to answer them, rather than replying prematurely, as I usually do ...
It's been a bit of a busy day so far but I'm getting to look into the questions you've asked.
Ian
On Fri, Jan 10, 2020 at 02:20:55PM +0800, Ian Kent wrote:
Yeah, autofs ->d_automount() doesn't return -EISDIR, by the time we get there it's not relevant any more, so that check looks redundant. I'm not aware of any other fs automount implementation that needs that EISDIR pass-thru function.
I didn't notice it at the time of the merge, sorry about that.
While we're at it that: if (!path->dentry->d_op || !path->dentry->d_op->d_automount) return -EREMOTE;
at the top of follow_automount() isn't going to be be relevant for autofs because ->d_automount() really must always be defined for it.
But, at the time of the merge, I didn't object to it because there were (are) other file systems that use the VFS automount function which may accidentally not define the method.
OK...
Unfortunately, there are other interesting questions related to autofs-specific bits (->d_manage()) and the timezone-related fun is, of course, still there. I hope to sort that out today or tomorrow, at least enough to do a reasonable set of backportable fixes to put in front of follow_managed()/step_into() queue. Oh, well...
Yeah, I know it slows you down but I kink-off like having a chance
Nice typo, that ;-)
to look at what's going and think about your questions before trying to answer them, rather than replying prematurely, as I usually do ...
It's been a bit of a busy day so far but I'm getting to look into the questions you've asked.
Here's a bit more of those (I might've missed some of your replies on IRC; my apologies if that's the case):
1) AFAICS, -EISDIR from ->d_manage() actually means "don't even try ->d_automount() here". If its effect can be delayed until the decision to call ->d_automount(), the things seem to get simpler. Is it ever returned in situation when the sucker _is_ overmounted?
2) can autofs_d_automount() ever be called for a daemon? Looks like it shouldn't be...
3) is _anything_ besides root directory ever created in direct autofs superblocks by anyone? If not, why does autofs_lookup() even bother to do anything there? IOW, why not have it return ERR_PTR(-ENOENT) immediately for direct ones? Or am I missing something and it is, in fact, possible to have the daemon create something in those?
4) Symlinks look like they should qualify for parent being non-empty; at least autofs_d_manage() seems to think so (simple_empty() use). So shouldn't we remove the trap from its parent on symlink/restore on unlink if parent gets empty? For version 4 or earlier, that is. Or is it simply that daemon only creates symlinks in root directory?
Anyway, intermediate state of the series is in #work.namei right now, and some _very_ interesting possibilities open up. It definitely needs more massage around __follow_mount_rcu() (as it is, the fastpath in there is still too twisted). Said that * call graph is less convoluted * follow_managed() calls are folded into step_into(). Interface: int step_into(nd, flags, dentry, inode, seq), with inode/seq used only if we are in RCU mode. * ".." still doesn't use that; it probably ought to. * lookup_fast() doesn't take path - nd, &inode, &seq and returns dentry * lookup_open() and fs/namei.c:atomic_open() get similar treatment - don't take path, return dentry. * calls of follow_managed()/step_into() combination returning 1 are always followed by get_link(), and very shortly, at that. So much that we can realistically merge pick_link() (in the end of step_into()) with get_link(). That merge is NOT done in this branch yet.
The last one promises to get rid of a rather unpleasant group of calling conventions. Right now we have several functions (step_into()/ walk_component()/lookup_last()/do_last()) with the following calling conventions: -E... => error 0 => non-symlink or symlink not followed; nd->path points to it 1 => picked a symlink to follow; its mount/dentry/seq has been pushed on nd->stack[]; its inode is stashed into nd->link_inode for subsequent get_link() to pick. nd->path is left unchanged.
That way all of those become ERR_PTR(-E...) => error NULL => non-symlink, symlink not followed or a pure jump (bare "/" or procfs ones); nd->path points to where we end up string => symlink being followed; the sucker's pushed to stack, initial jump (if any) has been handled and the string returned is what we need to traverse.
IMO it's less arbitrary that way. More importantly, the separation between step_into() committing to symlink traversal and (inevitably following) get_link() is gone - it's one operation after that change. No nd->link_inode either - it's only needed to carry the information from pick_link() to the next get_link().
Loops turn into while (!(err = link_path_walk(nd, s)) && (s = lookup_last(nd)) != NULL) ; and while (!(err = link_path_walk(nd, s)) && (s = do_last(nd, file, op)) != NULL) ;
trailing_symlink() goes away (folded into pick_link()/get_link() combo, conditional upon nd->depth at the entry). And in link_path_walk() we'll have if (unlikely(!*name)) { /* pathname body, done */ if (!nd->depth) return 0; name = nd->stack[nd->depth - 1].name; /* trailing symlink, done */ if (!name) return 0; /* last component of nested symlink */ s = walk_component(nd, WALK_FOLLOW); } else { /* not the last component */ s = walk_component(nd, WALK_FOLLOW | WALK_MORE); } if (s) { if (IS_ERR(s)) return PTR_ERR(s); /* a symlink to follow */ nd->stack[nd->depth - 1].name = name; name = s; continue; }
Anyway, before I try that one I'm going to fold path_openat2() into that series - that step is definitely going to require some massage there; it's too close to get_link() changes done in Aleksa's series.
If we do that, we get a single primitive for "here's the result of lookup; traverse mounts and either move into the result or, if it's a symlink that needs to be traversed, start the symlink traversal - jump into the base position for it (if needed) and return the pathname that needs to be handled". As it is, mainline has that logics spread over about a dozen locations...
Diffstat at the moment: fs/autofs/dev-ioctl.c | 6 +- fs/internal.h | 1 - fs/namei.c | 460 ++++++++++++++------------------------------------ fs/namespace.c | 97 +++++++---- fs/nfs/nfstrace.h | 2 - fs/open.c | 4 +- include/linux/namei.h | 3 +- 7 files changed, 197 insertions(+), 376 deletions(-)
In the current form the sucker appears to work (so far - about 30% into the usual xfstests run) without visible slowdowns...
On Sun, 2020-01-12 at 21:33 +0000, Al Viro wrote:
On Fri, Jan 10, 2020 at 02:20:55PM +0800, Ian Kent wrote:
Yeah, autofs ->d_automount() doesn't return -EISDIR, by the time we get there it's not relevant any more, so that check looks redundant. I'm not aware of any other fs automount implementation that needs that EISDIR pass-thru function.
I didn't notice it at the time of the merge, sorry about that.
While we're at it that: if (!path->dentry->d_op || !path->dentry->d_op->d_automount) return -EREMOTE;
at the top of follow_automount() isn't going to be be relevant for autofs because ->d_automount() really must always be defined for it.
But, at the time of the merge, I didn't object to it because there were (are) other file systems that use the VFS automount function which may accidentally not define the method.
OK...
Unfortunately, there are other interesting questions related to autofs-specific bits (->d_manage()) and the timezone-related fun is, of course, still there. I hope to sort that out today or tomorrow, at least enough to do a reasonable set of backportable fixes to put in front of follow_managed()/step_into() queue. Oh, well...
Yeah, I know it slows you down but I kink-off like having a chance
Nice typo, that ;-)
to look at what's going and think about your questions before trying to answer them, rather than replying prematurely, as I usually do ...
It's been a bit of a busy day so far but I'm getting to look into the questions you've asked.
Here's a bit more of those (I might've missed some of your replies on IRC; my apologies if that's the case):
- AFAICS, -EISDIR from ->d_manage() actually means "don't even try
->d_automount() here". If its effect can be delayed until the decision to call ->d_automount(), the things seem to get simpler. Is it ever returned in situation when the sucker _is_ overmounted?
In theory it shouldn't need to be returned when there is an actual mount there.
If there is a real mount at this point that should be enough to prevent walks into that mount until it's mount is complete.
The whole idea of -EISDIR is to prevent processes from walking into a directory tree that "doesn't have a real mount at its base" (the so called multi-mount map construct).
- can autofs_d_automount() ever be called for a daemon? Looks like
it shouldn't be...
Can't do that, it will lead to deadlock very quickly.
- is _anything_ besides root directory ever created in direct autofs
superblocks by anyone? If not, why does autofs_lookup() even bother to do anything there? IOW, why not have it return ERR_PTR(-ENOENT) immediately for direct ones? Or am I missing something and it is, in fact, possible to have the daemon create something in those?
Short answer is no, longer answer is directories "shouldn't" ever be created inside direct mount points.
The thing is that the multi-mount map construct can be used with direct mounts too, but they must always have a real mount at the base because they are direct mounts. So processes should not be able to walk into them while they are being mounted (constructed).
But I'm pretty sure it's rare (maybe not done at all) that this map construct is used with direct mounts.
- Symlinks look like they should qualify for parent being non-empty;
at least autofs_d_manage() seems to think so (simple_empty() use). So shouldn't we remove the trap from its parent on symlink/restore on unlink if parent gets empty? For version 4 or earlier, that is. Or is it simply that daemon only creates symlinks in root directory?
Yes, they have to be empty.
If a symlink is to be used (based on autofs config or map option) and the "browse" option is used for the indirect mount (browse only makes sense for indirect autofs managed mounts) then the mount point directory has to be removed and a symlink created so it must be empty to for this to make sense.
If it's a "nobrowse" autofs mount then nothing should already exist, it just gets created.
The catch is that a map entry for which a symlink is to be used instead of a mount can't be a multi-mount. I'm pretty sure I don't have sufficient error checking for that in the daemon but I also haven't had reports of problems with it either.
For a very long time the use of symlinks was not common but when the amd format map parser was added it made sense to use symlinks in some cases for those. That was partly to reduce the number of mounts needed and because I deliberately don't support amd map entries that provide the multi-mount construct. The way amd did this looked ugly to me, very much a hack to add a Sun format mount feature.
As far as keeping the trap flags up to date, I don't.
It seemed so much simpler to just leave the flags in place but, at that time, symlinks were not used (although it was possible to do so), now that's changed fiddling with the flags might now make sense.
As I said on IRC: "DCACHE_NEED_AUTOMOUNT is set on symlink dentries because, when ->lookup() is called the dentry may trigger a callback to the daemon that will either create a directory (since, in this case, one does not already exist) and attempt to mount on it or create a symlink if the autofs config/map requires it.
I didn't think there would be potential simplification by setting and clearing the DCACHE_NEED_AUTOMOUNT flag based on it being a directory (mountpoint) or a symlink so the flag is always left set. Although, as you point out, symlinks won't actually trigger mounts so the flag being left set when the dentry is a symlink is due to lazyness, since there's nothing to gain. If you can see potential simplification in the VFS code by managing this flag better then that would be worth while."
Anyway, intermediate state of the series is in #work.namei right now, and some _very_ interesting possibilities open up. It definitely needs more massage around __follow_mount_rcu() (as it is, the fastpath in there is still too twisted). Said that
- call graph is less convoluted
- follow_managed() calls are folded into
step_into(). Interface: int step_into(nd, flags, dentry, inode, seq), with inode/seq used only if we are in RCU mode.
- ".." still doesn't use that; it probably ought to.
- lookup_fast() doesn't take path - nd, &inode, &seq and
returns dentry
- lookup_open() and fs/namei.c:atomic_open() get similar
treatment
- don't take path, return dentry.
- calls of follow_managed()/step_into() combination returning 1
are always followed by get_link(), and very shortly, at that. So much that we can realistically merge pick_link() (in the end of step_into()) with get_link(). That merge is NOT done in this branch yet.
The last one promises to get rid of a rather unpleasant group of calling conventions. Right now we have several functions (step_into()/ walk_component()/lookup_last()/do_last()) with the following calling conventions: -E... => error 0 => non-symlink or symlink not followed; nd->path points to it 1 => picked a symlink to follow; its mount/dentry/seq has been pushed on nd->stack[]; its inode is stashed into nd->link_inode for subsequent get_link() to pick. nd->path is left unchanged.
That way all of those become ERR_PTR(-E...) => error NULL => non-symlink, symlink not followed or a pure jump (bare "/" or procfs ones); nd->path points to where we end up string => symlink being followed; the sucker's pushed to stack, initial jump (if any) has been handled and the string returned is what we need to traverse.
IMO it's less arbitrary that way. More importantly, the separation between step_into() committing to symlink traversal and (inevitably following) get_link() is gone - it's one operation after that change. No nd-
link_inode
either - it's only needed to carry the information from pick_link() to the next get_link().
Loops turn into while (!(err = link_path_walk(nd, s)) && (s = lookup_last(nd)) != NULL) ; and while (!(err = link_path_walk(nd, s)) && (s = do_last(nd, file, op)) != NULL) ;
trailing_symlink() goes away (folded into pick_link()/get_link() combo, conditional upon nd->depth at the entry). And in link_path_walk() we'll have if (unlikely(!*name)) { /* pathname body, done */ if (!nd->depth) return 0; name = nd->stack[nd->depth - 1].name; /* trailing symlink, done */ if (!name) return 0; /* last component of nested symlink */ s = walk_component(nd, WALK_FOLLOW); } else { /* not the last component */ s = walk_component(nd, WALK_FOLLOW | WALK_MORE); } if (s) { if (IS_ERR(s)) return PTR_ERR(s); /* a symlink to follow */ nd->stack[nd->depth - 1].name = name; name = s; continue; }
Anyway, before I try that one I'm going to fold path_openat2() into that series - that step is definitely going to require some massage there; it's too close to get_link() changes done in Aleksa's series.
If we do that, we get a single primitive for "here's the result of lookup; traverse mounts and either move into the result or, if it's a symlink that needs to be traversed, start the symlink traversal - jump into the base position for it (if needed) and return the pathname that needs to be handled". As it is, mainline has that logics spread over about a dozen locations...
Diffstat at the moment: fs/autofs/dev-ioctl.c | 6 +- fs/internal.h | 1 - fs/namei.c | 460 ++++++++++++++------------------------
fs/namespace.c | 97 +++++++---- fs/nfs/nfstrace.h | 2 - fs/open.c | 4 +- include/linux/namei.h | 3 +- 7 files changed, 197 insertions(+), 376 deletions(-)
In the current form the sucker appears to work (so far - about 30% into the usual xfstests run) without visible slowdowns...
Ok, I'll have a look at that branch, ;)
Ian
On Mon, 2020-01-13 at 10:59 +0800, Ian Kent wrote:
- is _anything_ besides root directory ever created in direct
autofs superblocks by anyone? If not, why does autofs_lookup() even bother to do anything there? IOW, why not have it return ERR_PTR(-ENOENT) immediately for direct ones? Or am I missing something and it is, in fact, possible to have the daemon create something in those?
Short answer is no, longer answer is directories "shouldn't" ever be created inside direct mount points.
The thing is that the multi-mount map construct can be used with direct mounts too, but they must always have a real mount at the base because they are direct mounts. So processes should not be able to walk into them while they are being mounted (constructed).
But I'm pretty sure it's rare (maybe not done at all) that this map construct is used with direct mounts.
This isn't right.
There's actually nothing stopping a user from using a direct map entry that's a multi-mount without an actual mount at its root. So there could be directories created under these, it's just not usually done.
I'm pretty sure I don't check and disallow this.
Ian
On Tue, Jan 14, 2020 at 08:25:19AM +0800, Ian Kent wrote:
This isn't right.
There's actually nothing stopping a user from using a direct map entry that's a multi-mount without an actual mount at its root. So there could be directories created under these, it's just not usually done.
I'm pretty sure I don't check and disallow this.
IDGI... How the hell will that work in v5? Who will set _any_ traps outside the one in root in that scenario? autofs_lookup() won't (there it's conditional upon indirect mount). Neither will autofs_dir_mkdir() (conditional upon version being less than 5). Who will, then?
Confused...
On Tue, 2020-01-14 at 04:39 +0000, Al Viro wrote:
On Tue, Jan 14, 2020 at 08:25:19AM +0800, Ian Kent wrote:
This isn't right.
There's actually nothing stopping a user from using a direct map entry that's a multi-mount without an actual mount at its root. So there could be directories created under these, it's just not usually done.
I'm pretty sure I don't check and disallow this.
IDGI... How the hell will that work in v5? Who will set _any_ traps outside the one in root in that scenario? autofs_lookup() won't (there it's conditional upon indirect mount). Neither will autofs_dir_mkdir() (conditional upon version being less than 5). Who will, then?
Confused...
It's easy to miss.
For autofs type direct and offset mounts the flags are set at fill super time.
They have to be set then because they are direct mounts and offset mounts behave the same as direct mounts so they need to be set then too. So, like direct mounts, offset mounts are each distinct autofs (trigger) mounts.
I could check for this construct and refuse it if that's really needed. I'm pretty sure this map construct isn't much used by people using direct mounts.
Ian
On Tue, 2020-01-14 at 13:01 +0800, Ian Kent wrote:
On Tue, 2020-01-14 at 04:39 +0000, Al Viro wrote:
On Tue, Jan 14, 2020 at 08:25:19AM +0800, Ian Kent wrote:
This isn't right.
There's actually nothing stopping a user from using a direct map entry that's a multi-mount without an actual mount at its root. So there could be directories created under these, it's just not usually done.
I'm pretty sure I don't check and disallow this.
IDGI... How the hell will that work in v5? Who will set _any_ traps outside the one in root in that scenario? autofs_lookup() won't (there it's conditional upon indirect mount). Neither will autofs_dir_mkdir() (conditional upon version being less than 5). Who will, then?
Confused...
It's easy to miss.
For autofs type direct and offset mounts the flags are set at fill super time.
They have to be set then because they are direct mounts and offset mounts behave the same as direct mounts so they need to be set then too. So, like direct mounts, offset mounts are each distinct autofs (trigger) mounts.
I could check for this construct and refuse it if that's really needed. I'm pretty sure this map construct isn't much used by people using direct mounts.
Ok, once again I'm not exactly accurate is some of what I said.
It turns out that the autofs connectathon tests, one of the tests that I use, does test direct mounts with offsets both with and without a real mount at the base of the mount.
Based on that, I have to say this map construct is meant to be supported with Sun format maps of autofs (even though I think it's probably not used much).
So not allowing it is probably the wrong thing to do.
OTOH initial testing with the #work.namei branch shows these are functioning as required.
Ian
On 2020-01-07, Linus Torvalds torvalds@linux-foundation.org wrote:
On Tue, Jan 7, 2020 at 7:13 PM Al Viro viro@zeniv.linux.org.uk wrote:
Another interesting question is whether we want O_PATH open to trigger automounts.
It does sound like they shouldn't, but as you say:
The thing is, we do *NOT* trigger them
(or traverse mountpoints) at the starting point of lookups. I believe it's a mistake (and mine, at that), but I doubt that there's anything that can be done about it at that point. It's a user-visible behaviour [..]
Hmm. I wonder how set in stone that is. We may have two decades of history of not doing it at start point of lookups, but we do *not* have two decades of history of O_PATH.
So what I think we agree would be sane behavior would be for O_PATH opens to not trigger automounts (unless there's a slash at the end, whatever), but _do_ add the mount-point traversal to the beginning of lookups.
But only do it for the actual O_PATH fd case, not the cwd/root/non-O_PATH case.
That way we maintain original behavior: if somebody overmounts your cwd, you still see the pre-mount directory on lookups, because your cwd is "under" the mount.
But if you open a file with O_PATH, and somebody does a mount _afterwards_, the openat() will see that later mount and/or do the automount.
Don't you think that would be the more sane/obvious semantics of how O_PATH should work?
If I'm understanding this proposal correctly, this would be a problem for the libpathrs use-case -- if this is done then there's no way to avoid a TOCTOU with someone mounting and the userspace program checking whether something is a mountpoint (unless you have Linux >5.6 and RESOLVE_NO_XDEV). Today, you can (in theory) do it with MNT_EXPIRE:
1. Open the candidate directory. 2. umount2(MNT_EXPIRE) the fd. * -EINVAL means it wasn't a mountpoint when we got the fd, and the fd is a stable handle to the underlying directory. * -EAGAIN or -EBUSY means that it was a mountpoint or became a mountpoint after the fd was opened (we don't care about that, but fail-safe is better here). 3. Use the fd from (1) for all operations.
Don't get me wrong, I want to fix this issue *properly* by adding some new kernel features that allow us to avoid worrying about mounts-over-magiclinks -- but on old kernels (which libpathrs cares about) I would be worried about changes like this being backported resulting in it being not possible to implement the hardening I mentioned up-thread.
On Sat, Jan 11, 2020 at 08:07:19AM +1100, Aleksa Sarai wrote:
If I'm understanding this proposal correctly, this would be a problem for the libpathrs use-case -- if this is done then there's no way to avoid a TOCTOU with someone mounting and the userspace program checking whether something is a mountpoint (unless you have Linux >5.6 and RESOLVE_NO_XDEV). Today, you can (in theory) do it with MNT_EXPIRE:
- Open the candidate directory.
- umount2(MNT_EXPIRE) the fd.
* -EINVAL means it wasn't a mountpoint when we got the fd, and the fd is a stable handle to the underlying directory.
- -EAGAIN or -EBUSY means that it was a mountpoint or became a mountpoint after the fd was opened (we don't care about that, but fail-safe is better here).
- Use the fd from (1) for all operations.
... except that foo/../bar *WILL* cross into the covering mount, on any kernel that supports ...at(2) at all, so I would be very cautious about any kind "hardening" claims in that case.
I'm not sure about Linus' proposal - it looks rather convoluted and we get a hard to describe twist of semantics in an area (procfs symlinks vs. mount traversal) on top of everything else in there...
Anyway, a couple of questions:
1) do you see any problems on your testcases with the current #fixes? That's commit 7a955b7363b8 as branch tip.
2) do you have any updates you would like to fold into stuff in #work.openat2? Right now I have a local variant of #work.namei (with fairly cosmetical change compared to vfs.git one) that merges clean with #work.openat2; I would like to do any updates/fold-ins/etc. of #work.openat2 *before* doing a merge and continuing to work on top of the merge results...
On Tue, Jan 14, 2020 at 04:57:33AM +0000, Al Viro wrote:
On Sat, Jan 11, 2020 at 08:07:19AM +1100, Aleksa Sarai wrote:
If I'm understanding this proposal correctly, this would be a problem for the libpathrs use-case -- if this is done then there's no way to avoid a TOCTOU with someone mounting and the userspace program checking whether something is a mountpoint (unless you have Linux >5.6 and RESOLVE_NO_XDEV). Today, you can (in theory) do it with MNT_EXPIRE:
- Open the candidate directory.
- umount2(MNT_EXPIRE) the fd.
* -EINVAL means it wasn't a mountpoint when we got the fd, and the fd is a stable handle to the underlying directory.
- -EAGAIN or -EBUSY means that it was a mountpoint or became a mountpoint after the fd was opened (we don't care about that, but fail-safe is better here).
- Use the fd from (1) for all operations.
... except that foo/../bar *WILL* cross into the covering mount, on any kernel that supports ...at(2) at all, so I would be very cautious about any kind "hardening" claims in that case.
I'm not sure about Linus' proposal - it looks rather convoluted and we get a hard to describe twist of semantics in an area (procfs symlinks vs. mount traversal) on top of everything else in there...
PS: one thing that might be interesting is exposing LOOKUP_DOWN via AT_... flag - it would allow to request mount traversals at the starting point explicitly. Pretty much all code needed for that is already there; all it would take is checking the flag in path_openat() and path_parentat() and having handle_lookup_down() called there, same as in path_lookupat().
A tricky question is whether such flag should affect absolute symlinks - i.e.
chdir /foo ln -s /bar barf overmount / do lookup with that flag for /bar/splat do lookup with that flag for barf/splat
Do we want the same results in both calls? The first one would traverse mounts on / and walk into /bar/splat in overmounting; the second - see no mounts whatsoever on current directory (/foo in old root), see the symlink to "/bar", jump to process' root and proceed from there, first for "bar", then "splat" in it...
On 2020-01-14, Al Viro viro@zeniv.linux.org.uk wrote:
On Sat, Jan 11, 2020 at 08:07:19AM +1100, Aleksa Sarai wrote:
If I'm understanding this proposal correctly, this would be a problem for the libpathrs use-case -- if this is done then there's no way to avoid a TOCTOU with someone mounting and the userspace program checking whether something is a mountpoint (unless you have Linux >5.6 and RESOLVE_NO_XDEV). Today, you can (in theory) do it with MNT_EXPIRE:
- Open the candidate directory.
- umount2(MNT_EXPIRE) the fd.
* -EINVAL means it wasn't a mountpoint when we got the fd, and the fd is a stable handle to the underlying directory.
- -EAGAIN or -EBUSY means that it was a mountpoint or became a mountpoint after the fd was opened (we don't care about that, but fail-safe is better here).
- Use the fd from (1) for all operations.
... except that foo/../bar *WILL* cross into the covering mount, on any kernel that supports ...at(2) at all, so I would be very cautious about any kind "hardening" claims in that case.
In the use-case I have, we would have full control over what the path being opened is (and thus you wouldn't open "foo/../bar"). But I agree that generally the MNT_EXPIRE solution is really non-ideal anyway.
Not to mention that we're still screwed when it comes to using magic-links (because if someone bind-mounts a magic-link over a magic-link there's absolutely no race-free way to be sure that we're traversing the right magic-link -- for that we'll need to have a different solution).
I'm not sure about Linus' proposal - it looks rather convoluted and we get a hard to describe twist of semantics in an area (procfs symlinks vs. mount traversal) on top of everything else in there...
Yeah, I agree.
- do you see any problems on your testcases with the current #fixes?
That's commit 7a955b7363b8 as branch tip.
I will take a quick look later today, but I'm currently at a conference.
- do you have any updates you would like to fold into stuff in
#work.openat2? Right now I have a local variant of #work.namei (with fairly cosmetical change compared to vfs.git one) that merges clean with #work.openat2; I would like to do any updates/fold-ins/etc. of #work.openat2 *before* doing a merge and continuing to work on top of the merge results...
Yes, there were two patches I sent a while ago[1]. I can re-send them if you like. The second patch switches open_how->mode to a u64, but I'm still on the fence about whether that makes sense to do...
[1]: https://lore.kernel.org/lkml/20191219105533.12508-1-cyphar@cyphar.com/
On Wed, Jan 15, 2020 at 07:01:50AM +1100, Aleksa Sarai wrote:
Yes, there were two patches I sent a while ago[1]. I can re-send them if you like. The second patch switches open_how->mode to a u64, but I'm still on the fence about whether that makes sense to do...
IMO plain __u64 is better than games with __aligned_u64 - all sizes are fixed, so...
Do you want that series folded into "open: introduce openat2(2) syscall" and "selftests: add openat2(2) selftests" or would you rather have them appended at the end of the series. Personally I'd go for "fold them in" if it had been about my code, but it's really up to you.
On 2020-01-15, Al Viro viro@zeniv.linux.org.uk wrote:
On Wed, Jan 15, 2020 at 07:01:50AM +1100, Aleksa Sarai wrote:
Yes, there were two patches I sent a while ago[1]. I can re-send them if you like. The second patch switches open_how->mode to a u64, but I'm still on the fence about whether that makes sense to do...
IMO plain __u64 is better than games with __aligned_u64 - all sizes are fixed, so...
Do you want that series folded into "open: introduce openat2(2) syscall" and "selftests: add openat2(2) selftests" or would you rather have them appended at the end of the series. Personally I'd go for "fold them in" if it had been about my code, but it's really up to you.
"fold them in" would probably be better to avoid making the mainline history confusing afterwards. Thanks.
On 2020-01-16, Aleksa Sarai cyphar@cyphar.com wrote:
On 2020-01-15, Al Viro viro@zeniv.linux.org.uk wrote:
On Wed, Jan 15, 2020 at 07:01:50AM +1100, Aleksa Sarai wrote:
Yes, there were two patches I sent a while ago[1]. I can re-send them if you like. The second patch switches open_how->mode to a u64, but I'm still on the fence about whether that makes sense to do...
IMO plain __u64 is better than games with __aligned_u64 - all sizes are fixed, so...
Do you want that series folded into "open: introduce openat2(2) syscall" and "selftests: add openat2(2) selftests" or would you rather have them appended at the end of the series. Personally I'd go for "fold them in" if it had been about my code, but it's really up to you.
"fold them in" would probably be better to avoid making the mainline history confusing afterwards. Thanks.
Also (if you prefer) I can send a v3 which uses u64s rather than aligned_u64s.
On Thu, Jan 16, 2020 at 01:34:59AM +1100, Aleksa Sarai wrote:
On 2020-01-16, Aleksa Sarai cyphar@cyphar.com wrote:
On 2020-01-15, Al Viro viro@zeniv.linux.org.uk wrote:
On Wed, Jan 15, 2020 at 07:01:50AM +1100, Aleksa Sarai wrote:
Yes, there were two patches I sent a while ago[1]. I can re-send them if you like. The second patch switches open_how->mode to a u64, but I'm still on the fence about whether that makes sense to do...
IMO plain __u64 is better than games with __aligned_u64 - all sizes are fixed, so...
Do you want that series folded into "open: introduce openat2(2) syscall" and "selftests: add openat2(2) selftests" or would you rather have them appended at the end of the series. Personally I'd go for "fold them in" if it had been about my code, but it's really up to you.
"fold them in" would probably be better to avoid making the mainline history confusing afterwards. Thanks.
Also (if you prefer) I can send a v3 which uses u64s rather than aligned_u64s.
<mode "lazy bastard"> Could you fold and resend the results of folding (i.e. replacements for two commits in question)? </mode>
The hard part is, of course, in updating commit messages ;-)
On 2020-01-14, Al Viro viro@zeniv.linux.org.uk wrote:
- do you see any problems on your testcases with the current #fixes?
That's commit 7a955b7363b8 as branch tip.
I just finished testing the few cases I reported earlier and they both appear to be fixed with the current #work.namei branch. And I don't have any troubles booting whatsoever.
OK, vfs.git #work.namei seems to survive xfstests. I think it cleans the things quite a bit, but it obviously needs more review and testing.
Review and testing would be _very_ welcome; it does a lot of massage, so there had been a plenty of opportunities to fuck up and fail to spot that. The same goes for profiling - it doesn't seem to slow the things down, but that needs to be verified.
It does include #work.openat2. Topology: 17 commits, followed by clean merge with #work.openat2, followed by 9 followups. The part is #work.openat2 is as posted by Aleksa; I can repost it, but I don't see much point. Description of the rest follows; patches themselves will be in followups.
part 1: follow_automount() cleanups and fixes.
Quite a bit of that function had been about working around the wrong calling conventions of finish_automount(). The problem is that finish_automount() misuses the primitive intended for mount(2) and friends, where we want to mount on top of the pile, even if something has managed to add to that while we'd been trying to lock the namespace. For automount that's not the right thing to do - there we want to discard whatever it was going to attach and just cross into what got mounted there in the meanwhile (most likely - the results of the same automount triggered by somebody else). Current mainline kinda-sorta manages to do that, but it's unreliable and very convoluted. Much simpler approach is to stop using lock_mount() in finish_automount() and have it bail out if something turns out to have been mounted on top where we wanted to attach. That allows to get rid of a lot of PITA in the caller. Another simplification comes from not trying to cross into the results of automount - simply ride through the next iteration of the loop and let it move into overmount.
Another thing in the same series is divorcing follow_automount() from nameidata; that'll play later when we get to unifying follow_down() with the guts of follow_managed().
4 commits, the second one fixes a hard-to-hit race. The first is a prereq for it.
1/17 do_add_mount(): lift lock_mount/unlock_mount into callers 2/17 fix automount/automount race properly 3/17 follow_automount(): get rid of dead^Wstillborn code 4/17 follow_automount() doesn't need the entire nameidata
part 2: unifying mount traversals in pathwalk.
Handling of mount traversal (follow_managed()) is currently called in a bunch of places. Each of them is shortly followed by a call of step_into() or an open-coded equivalent thereof. However, the locations of those step_into() calls are far from preceding follow_managed(); moreover, that preceding call might happen on different paths that converge to given step_into() call. It's harder to analyse that it should be (especially when it comes to liveness analysis) and it forces rather ugly calling conventions on lookup_fast()/atomic_open()/lookup_open(). The series below massages the code to the point when the calls of follow_managed() (and __follow_mount_rcu()) move into the beginning of step_into().
5/17 make build_open_flags() treat O_CREAT | O_EXCL as implying O_NOFOLLOW gets EEXIST handling in do_last() past the step_into() call there. 6/17 handle_mounts(): start building a sane wrapper for follow_managed() rather than mangling follow_managed() itself (and creating conflicts with openat2 series), add a wrapper that will absorb the required interface changes. 7/17 atomic_open(): saner calling conventions (return dentry on success) struct path passed to it is pure out parameter; only dentry part ever varies, though - mnt is always nd->path.mnt. Just return the dentry on success, and ERR_PTR(-E...) on failure. 8/17 lookup_open(): saner calling conventions (return dentry on success) propagate the same change one level up the call chain. 9/17 do_last(): collapse the call of path_to_nameidata() struct path filled in lookup_open() call is eventually given to handle_mounts(); the only use it has before that is path_to_nameidata() call in "->atomic_open() has actually opened it" case, and there path_to_nameidata() is an overkill - we are guaranteed to replace only nd->path.dentry. So have the struct path filled only immediately prior to handle_mounts(). 10/17 handle_mounts(): pass dentry in, turn path into a pure out argument now all callers of handle_mount() are directly preceded by filling struct path it gets. path->mnt is nd->path.mnt in all cases, so we can pass just the dentry instead and fill path in handle_mount() itself. Some boilerplate gone, path is pure out argument of handle_mount() now. 11/17 lookup_fast(): consolidate the RCU success case massage to gather what will become an RCU case equivalent of handle_mounts(); basically, that's what we do if revalidate succeeds in RCU case of lookup_fast(), including unlazy and fallback to handle_mounts() if __follow_mount_rcu() says "it's too tricky". 12/17 teach handle_mounts() to handle RCU mode ... and take that into handle_mount() itself. The other caller of __follow_mount_rcu() is fine with the same fallback (it just didn't bother since it's in the very beginning of pathwalk), switched to handle_mount() as well. 13/17 lookup_fast(): take mount traversal into callers Now we are getting somewhere - both RCU and non-RCU success cases of lookup_fast() are ended with the same return handle_mounts(...); move that to the callers - there it will merge with the identical calls that had been on the paths where we had to do slow lookups. lookup_fast() returns dentry now. 14/17 new step_into() flag: WALK_NOFOLLOW use step_into() instead of open-coding it in handle_lookup_down(). Add a flag for "don't follow symlinks regardless of LOOKUP_FOLLOW" for that (and eventually, I hope, for .. handling). Now *all* calls of handle_mounts() and step_into() are right next to each other. 15/17 fold handle_mounts() into step_into() ... and we can move the call of handle_mounts() into step_into(), getting a slightly saner calling conventions out of that. 16/17 LOOKUP_MOUNTPOINT: fold path_mountpointat() into path_lookupat() another payoff from 14/17 - we can teach path_lookupat() to do what path_mountpointat() used to. And kill the latter, along with its wrappers. 17/17 expand the only remaining call of path_lookup_conditional() minor cleanup - RIP path_lookup_conditional(). Only one caller left.
At that point we run out of things that can be done without textual conflicts with openat2 series. Changes so far: * mount traversal is taken into step_into(). * lookup_fast(), atomic_open() and lookup_open() calling conventions are slightly changed. All of them return dentry now, instead of returning an int and filling struct path on success. For lookup_fast() the old "0 for cache miss, 1 for cache hit" is replaced with "NULL stands for cache miss, dentry - for hit". * step_into() can be called in RCU mode as well. Takes nameidata, WALK_... flags, dentry and, in RCU case, corresponding inode and seq value. Handles mount traversals, decides whether it's a symlink to be followed. Error => returns -E...; symlink to follow => returns 1, puts symlink on stack; non-symlink or symlink not to follow => returns 0, moves nd->path to new location. * LOOKUP_MOUNTPOINT introduced; user_path_mountpoint_at() and friends became calls of user_path_at() et.al. with LOOKUP_MOUNTPOINT in flags.
Next comes the merge with Aleksa's openat2 patchset; everything up to that point had been non-conflicting with it. That patchset has been posted earlier; it's in #work.openat2. The next series comes on top of the merge.
part 3: untangling the symlink handling.
Right now when we decide to follow a symlink it happens this way: * step_into() decides that it has been given a symlink that needs to be followed. * it calls pick_link(), which pushes the symlink on stack and returns 1 on success / -E... on error. Symlink's mount/dentry/seq is stored on stack and the inode is stashed in nd->link_inode. * step_into() passes that 1 to its callers, which proceed to pass it up the call chain for several layers. In all cases we get to get_link() call shortly afterwards. * get_link() is called, picks the inode stashed in nd->link_inode by the pick_link(), does some checks, touches the atime, etc. * get_link() either picks the link body out of inode or calls ->get_link(). If it's an absolute symlink, we move to the root and return the relative portion of the body; if it's a relative one - just return the body. If it's a procfs-style one, the call of nd_jump_link() has been made and we'd moved to whatever location is desired. And return NULL, same as we do for symlink to "/". * the caller proceeds to deal with the string returned to it.
The sequence is the same in all cases (nested symlink, trailing symlink on lookup, trailing symlink on open), but its pieces are not close to each other and the bit between the call of pick_link() and (inevitable) call of get_link() afterwards is not easy to follow. Moreover, a bunch of functions (walk_component/lookup_last/do_last) ends up with the same conventions for return values as step_into(). And those conventions (see above) are not pretty - 0/1/-E... is asking for mistakes, especially when returned 1 is used only to direct control flow on a rather twisted way to matching get_link() call. And that path can be seriously twisted. E.g. when we are trying to open /dev/stdin, we get the following sequence: * path_init() has put us into root and returned "/dev/stdin" * link_path_walk() has eventually reached /dev and left <LAST_NORM, "stdin"> in nd->last_type/nd->last * we call do_last(), which sees that we have LAST_NORM and calls lookup_fast(). Let's assume that everything is in dcache; we get the dentry of /dev/stdin and proceed to finish_lookup:, where we call step_into() * it's a symlink, we have LOOKUP_FOLLOW, so we decide to pick the damn thing. Into the stack it goes and we return 1. * do_last() sees 1 and returns it. * trailing_symlink() is called (in the top-level loop) and it calls get_link(). OK, we get "/proc/self/fd/0" for body, move to root again and return "proc/self/fd/0". * link_path_walk() is given that string, eventually leading us into /proc/self/fd, with <LAST_NORM, "0"> left as the component to handle. * do_last() is called, and similar to the previous case we eventually reach the call of step_into() with dentry of /proc/self/fd/0. * _now_ we can discard /dev/stdin from the stack (we'd been using its body until now). It's dropped (from step_into()) and we get to look at what we'd been given. A symlink to follow, so on the stack it goes and we return 1. * again, do_last() passes 1 to caller * trailing_symlink() is called and calls get_link(). * this time it's a procfs symlink and its ->get_link() method moves us to the mount/dentry of our stdin. And returns NULL. But the fun doesn't stop yet. * trailing_symlink() returns "" to the caller * link_path_walk() is called on that and does nothing whatsoever. * do_last() is called and sees LAST_BIND left by the get_link(). It calls handle_dots() * handle_dots() drops the symlink from stack and returns * do_last() *FINALLY* proceeds to the point after its call of step_into() (finish_open:) and gets around to opening the damn thing.
Making sense of the control flow through all of that is not fun, to put it mildly; debugging anything in that area can be a massive PITA, and this example has touched only one of 3 cases. Arguably, the worst one, but... Anyway, it turns out that this code can be massaged to considerably saner shape - both in terms of control flow and wrt calling conventions.
1/9 merging pick_link() with get_link(), part 1 prep work: move the "hardening" crap from trailing_symlink() into get_link() (conditional on the absense of LOOKUP_PARENT in nd->flags). We'll be moving the calls of get_link() around quite a bit through that series, and the next step will be to eliminate trailing_symlink(). 2/9 merging pick_link() with get_link(), part 2 fold trailing_symlink() into lookup_last() and do_last(). Now these are returning strings; it's not the final calling conventions, but it's almost there. NULL => old 0, we are done. ERR_PTR(-E...) => old -E..., we'd failed. string => old 1, and the string is the symlink body to follow. Just as for trailing_symlink(), "/" and procfs ones (where get_link() returns NULL) yield "", so the ugly song and dance with no-op trip through link_path_walk()/handle_dots() still remains. 3/9 merging pick_link() with get_link(), part 3 elimination of that round-trip. In *all* cases having get_link() return NULL on such symlinks means that we'll proceed to drop the symlink from stack and get back to the point near that get_link() call - basically, where we would be if it hadn't been a symlink at all. The path by which we are getting there depends upon the call site; the end result is the same in all cases - such symlinks (procfs ones and symlink to "/") are fully processed by the time get_link() returns, so we could as well drop them from the stack right in get_link(). Makes life simpler in terms of control flow analysis... And now the calling conventions for do_last() and lookup_last() have reached the final shape - ERR_PTR(-E...) for error, NULL for "we are done", string for "traverse this". 4/9 merging pick_link() with get_link(), part 4 now all calls of walk_component() are followed by the same boilerplate - "if it has returned 1, call get_link() and if that has returned NULL treat that as if walk_component() has returned 0". Eliminate by folding that into walk_component() itself. Now walk_component() return value conventions have joined those of do_last()/lookup_last(). 5/9 merging pick_link() with get_link(), part 5 same as for the previous, only this time the boilerplate migrates one level down, into step_into(). Only one caller of get_link() left, step_into() has joined the same return value conventions. 6/9 merging pick_link() with get_link(), part 6 move that thing into pick_link(). Now all traces of "return 1 if we are following a symlink" are gone. 7/9 finally fold get_link() into pick_link() ta-da - expand get_link() into the only caller. As a side benefit, we get rid of stashing the inode in nd->link_inode - it was done only to carry that piece of information from pick_link() to eventual get_link(). That's not the main benefit, though - the control flow became considerably easier to reason about.
For what it's worth, the example above (/dev/stdin) becomes * path_init() has put us into root and returned "/dev/stdin" * link_path_walk() has eventually reached /dev and left <LAST_NORM, "stdin"> in nd->last_type/nd->last * we call do_last(), which sees that we have LAST_NORM and calls lookup_fast(). Let's assume that everything is in dcache; we get the dentry of /dev/stdin and proceed to finish_lookup:, where we call step_into() * it's a symlink, we have LOOKUP_FOLLOW, so we decide to pick the damn thing. On the stack it goes and we get its body. Which is "/proc/self/fd/0", so we move to root and return "proc/self/fd/0". * do_last() sees non-NULL and returns it - whether it's an error or a pathname to traverse, we hadn't reached something we'll be opening. * link_path_walk() is given that string, eventually leading us into /proc/self/fd, with <LAST_NORM, "0"> left as the component to handle. * do_last() is called, and similar to the previous case we eventually reach the call of step_into() with dentry of /proc/self/fd/0. * _now_ we can discard /dev/stdin from the stack (we'd been using its body until now). It's dropped (from step_into()) and we get to look at what we'd been given. A symlink to follow, so on the stack it goes. This time it's a procfs symlink and its ->get_link() method moves us to the mount/dentry of our stdin. And returns NULL. So we drop symlink from stack and return that NULL to caller. * that NULL is returned by step_into(), same as if we had just moved to a non-symlink. * do_last() proceeds to open the damn thing.
part 4. some mount traversal cleanups.
8/9 massage __follow_mount_rcu() a bit make it more similar to non-RCU counterpart 9/9 new helper: traverse_mounts() the guts of follow_managed() are very similar to follow_down(). The calling conventions are different (follow_managed() works with nameidata, follow_down() - with standalone struct path), but the core loop is pretty much the same in both. Turned that loop into a common helper (traverse_mounts()) and since follow_managed() becomes a very thin wrapper around it, expand follow_managed() at its only call site (in handle_mounts()),
That's where the series stands right now. FWIW, at 5.5-rc1 fs/namei.c had been 4867 lines, at the tip of #work.openat2 - 4998, at the tip of #work.namei (containing #work.openat2) - 4730... And IMO the thing has become considerably easier to follow.
What's more, it might be possible to untangle the control flow in do_last() now. Probably a separate series, though - do_last() is one hell of a tarpit, so I'm not stepping into it for the rest of this cycle...
On Sun, 2020-01-19 at 03:14 +0000, Al Viro wrote:
OK, vfs.git #work.namei seems to survive xfstests. I think it cleans the things quite a bit, but it obviously needs more review and testing.
Review and testing would be _very_ welcome; it does a lot of massage, so there had been a plenty of opportunities to fuck up and fail to spot that. The same goes for profiling - it doesn't seem to slow the things down, but that needs to be verified.
I have run my usual tests (the second run of my submount-test is still going) and they have run through fine.
I spend what time I can looking through the series tomorrow but will probably need to complete that when I return from my trip to Albany (Western Australia) some time on Friday.
It does include #work.openat2. Topology: 17 commits, followed by clean merge with #work.openat2, followed by 9 followups. The part is #work.openat2 is as posted by Aleksa; I can repost it, but I don't see much point. Description of the rest follows; patches themselves will be in followups.
part 1: follow_automount() cleanups and fixes.
Quite a bit of that function had been about working around the wrong calling conventions of finish_automount(). The problem is that finish_automount() misuses the primitive intended for mount(2) and friends, where we want to mount on top of the pile, even if something has managed to add to that while we'd been trying to lock the namespace. For automount that's not the right thing to do - there we want to discard whatever it was going to attach and just cross into what got mounted there in the meanwhile (most likely - the results of the same automount triggered by somebody else). Current mainline kinda-sorta manages to do that, but it's unreliable and very convoluted. Much simpler approach is to stop using lock_mount() in finish_automount() and have it bail out if something turns out to have been mounted on top where we wanted to attach. That allows to get rid of a lot of PITA in the caller. Another simplification comes from not trying to cross into the results of automount - simply ride through the next iteration of the loop and let it move into overmount.
Another thing in the same series is divorcing follow_automount() from nameidata; that'll play later when we get to unifying follow_down() with the guts of follow_managed().
4 commits, the second one fixes a hard-to-hit race. The first is a prereq for it.
1/17 do_add_mount(): lift lock_mount/unlock_mount into callers 2/17 fix automount/automount race properly 3/17 follow_automount(): get rid of dead^Wstillborn code 4/17 follow_automount() doesn't need the entire nameidata
part 2: unifying mount traversals in pathwalk.
Handling of mount traversal (follow_managed()) is currently called in a bunch of places. Each of them is shortly followed by a call of step_into() or an open-coded equivalent thereof. However, the locations of those step_into() calls are far from preceding follow_managed(); moreover, that preceding call might happen on different paths that converge to given step_into() call. It's harder to analyse that it should be (especially when it comes to liveness analysis) and it forces rather ugly calling conventions on lookup_fast()/atomic_open()/lookup_open(). The series below massages the code to the point when the calls of follow_managed() (and __follow_mount_rcu()) move into the beginning of step_into().
5/17 make build_open_flags() treat O_CREAT | O_EXCL as implying O_NOFOLLOW gets EEXIST handling in do_last() past the step_into() call there. 6/17 handle_mounts(): start building a sane wrapper for follow_managed() rather than mangling follow_managed() itself (and creating conflicts with openat2 series), add a wrapper that will absorb the required interface changes. 7/17 atomic_open(): saner calling conventions (return dentry on success) struct path passed to it is pure out parameter; only dentry part ever varies, though - mnt is always nd->path.mnt. Just return the dentry on success, and ERR_PTR(-E...) on failure. 8/17 lookup_open(): saner calling conventions (return dentry on success) propagate the same change one level up the call chain. 9/17 do_last(): collapse the call of path_to_nameidata() struct path filled in lookup_open() call is eventually given to handle_mounts(); the only use it has before that is path_to_nameidata() call in "->atomic_open() has actually opened it" case, and there path_to_nameidata() is an overkill - we are guaranteed to replace only nd->path.dentry. So have the struct path filled only immediately prior to handle_mounts(). 10/17 handle_mounts(): pass dentry in, turn path into a pure out argument now all callers of handle_mount() are directly preceded by filling struct path it gets. path->mnt is nd->path.mnt in all cases, so we can pass just the dentry instead and fill path in handle_mount() itself. Some boilerplate gone, path is pure out argument of handle_mount() now. 11/17 lookup_fast(): consolidate the RCU success case massage to gather what will become an RCU case equivalent of handle_mounts(); basically, that's what we do if revalidate succeeds in RCU case of lookup_fast(), including unlazy and fallback to handle_mounts() if __follow_mount_rcu() says "it's too tricky". 12/17 teach handle_mounts() to handle RCU mode ... and take that into handle_mount() itself. The other caller of __follow_mount_rcu() is fine with the same fallback (it just didn't bother since it's in the very beginning of pathwalk), switched to handle_mount() as well. 13/17 lookup_fast(): take mount traversal into callers Now we are getting somewhere - both RCU and non-RCU success cases of lookup_fast() are ended with the same return handle_mounts(...); move that to the callers - there it will merge with the identical calls that had been on the paths where we had to do slow lookups. lookup_fast() returns dentry now. 14/17 new step_into() flag: WALK_NOFOLLOW use step_into() instead of open-coding it in handle_lookup_down(). Add a flag for "don't follow symlinks regardless of LOOKUP_FOLLOW" for that (and eventually, I hope, for .. handling). Now *all* calls of handle_mounts() and step_into() are right next to each other. 15/17 fold handle_mounts() into step_into() ... and we can move the call of handle_mounts() into step_into(), getting a slightly saner calling conventions out of that. 16/17 LOOKUP_MOUNTPOINT: fold path_mountpointat() into path_lookupat() another payoff from 14/17 - we can teach path_lookupat() to do what path_mountpointat() used to. And kill the latter, along with its wrappers. 17/17 expand the only remaining call of path_lookup_conditional() minor cleanup - RIP path_lookup_conditional(). Only one caller left.
At that point we run out of things that can be done without textual conflicts with openat2 series. Changes so far:
- mount traversal is taken into step_into().
- lookup_fast(), atomic_open() and lookup_open() calling
conventions are slightly changed. All of them return dentry now, instead of returning an int and filling struct path on success. For lookup_fast() the old "0 for cache miss, 1 for cache hit" is replaced with "NULL stands for cache miss, dentry - for hit".
- step_into() can be called in RCU mode as well. Takes
nameidata, WALK_... flags, dentry and, in RCU case, corresponding inode and seq value. Handles mount traversals, decides whether it's a symlink to be followed. Error => returns -E...; symlink to follow => returns 1, puts symlink on stack; non-symlink or symlink not to follow => returns 0, moves nd->path to new location.
- LOOKUP_MOUNTPOINT introduced; user_path_mountpoint_at() and
friends became calls of user_path_at() et.al. with LOOKUP_MOUNTPOINT in flags.
Next comes the merge with Aleksa's openat2 patchset; everything up to that point had been non-conflicting with it. That patchset has been posted earlier; it's in #work.openat2. The next series comes on top of the merge.
part 3: untangling the symlink handling.
Right now when we decide to follow a symlink it happens this way:
- step_into() decides that it has been given a symlink that
needs to be followed.
- it calls pick_link(), which pushes the symlink on stack and
returns 1 on success / -E... on error. Symlink's mount/dentry/seq is stored on stack and the inode is stashed in nd->link_inode.
- step_into() passes that 1 to its callers, which proceed to
pass it up the call chain for several layers. In all cases we get to get_link() call shortly afterwards.
- get_link() is called, picks the inode stashed in nd-
link_inode
by the pick_link(), does some checks, touches the atime, etc.
- get_link() either picks the link body out of inode or calls
->get_link(). If it's an absolute symlink, we move to the root and return the relative portion of the body; if it's a relative one - just return the body. If it's a procfs-style one, the call of nd_jump_link() has been made and we'd moved to whatever location is desired. And return NULL, same as we do for symlink to "/".
- the caller proceeds to deal with the string returned to it.
The sequence is the same in all cases (nested symlink, trailing symlink on lookup, trailing symlink on open), but its pieces are not close to each other and the bit between the call of pick_link() and (inevitable) call of get_link() afterwards is not easy to follow. Moreover, a bunch of functions (walk_component/lookup_last/do_last) ends up with the same conventions for return values as step_into(). And those conventions (see above) are not pretty - 0/1/-E... is asking for mistakes, especially when returned 1 is used only to direct control flow on a rather twisted way to matching get_link() call. And that path can be seriously twisted. E.g. when we are trying to open /dev/stdin, we get the following sequence:
- path_init() has put us into root and returned "/dev/stdin"
- link_path_walk() has eventually reached /dev and left
<LAST_NORM, "stdin"> in nd->last_type/nd->last
- we call do_last(), which sees that we have LAST_NORM and
calls lookup_fast(). Let's assume that everything is in dcache; we get the dentry of /dev/stdin and proceed to finish_lookup:, where we call step_into()
- it's a symlink, we have LOOKUP_FOLLOW, so we decide to pick
the damn thing. Into the stack it goes and we return 1.
- do_last() sees 1 and returns it.
- trailing_symlink() is called (in the top-level loop) and it
calls get_link(). OK, we get "/proc/self/fd/0" for body, move to root again and return "proc/self/fd/0".
- link_path_walk() is given that string, eventually leading us
into /proc/self/fd, with <LAST_NORM, "0"> left as the component to handle.
- do_last() is called, and similar to the previous case we
eventually reach the call of step_into() with dentry of /proc/self/fd/0.
- _now_ we can discard /dev/stdin from the stack (we'd been
using its body until now). It's dropped (from step_into()) and we get to look at what we'd been given. A symlink to follow, so on the stack it goes and we return 1.
- again, do_last() passes 1 to caller
- trailing_symlink() is called and calls get_link().
- this time it's a procfs symlink and its ->get_link() method
moves us to the mount/dentry of our stdin. And returns NULL. But the fun doesn't stop yet.
- trailing_symlink() returns "" to the caller
- link_path_walk() is called on that and does nothing
whatsoever.
- do_last() is called and sees LAST_BIND left by the
get_link(). It calls handle_dots()
- handle_dots() drops the symlink from stack and returns
- do_last() *FINALLY* proceeds to the point after its call of
step_into() (finish_open:) and gets around to opening the damn thing.
Making sense of the control flow through all of that is not fun, to put it mildly; debugging anything in that area can be a massive PITA, and this example has touched only one of 3 cases. Arguably, the worst one, but... Anyway, it turns out that this code can be massaged to considerably saner shape - both in terms of control flow and wrt calling conventions.
1/9 merging pick_link() with get_link(), part 1 prep work: move the "hardening" crap from trailing_symlink() into get_link() (conditional on the absense of LOOKUP_PARENT in nd-
flags).
We'll be moving the calls of get_link() around quite a bit through that series, and the next step will be to eliminate trailing_symlink(). 2/9 merging pick_link() with get_link(), part 2 fold trailing_symlink() into lookup_last() and do_last(). Now these are returning strings; it's not the final calling conventions, but it's almost there. NULL => old 0, we are done. ERR_PTR(-E...) => old -E..., we'd failed. string => old 1, and the string is the symlink body to follow. Just as for trailing_symlink(), "/" and procfs ones (where get_link() returns NULL) yield "", so the ugly song and dance with no-op trip through link_path_walk()/handle_dots() still remains. 3/9 merging pick_link() with get_link(), part 3 elimination of that round-trip. In *all* cases having get_link() return NULL on such symlinks means that we'll proceed to drop the symlink from stack and get back to the point near that get_link() call - basically, where we would be if it hadn't been a symlink at all. The path by which we are getting there depends upon the call site; the end result is the same in all cases - such symlinks (procfs ones and symlink to "/") are fully processed by the time get_link() returns, so we could as well drop them from the stack right in get_link(). Makes life simpler in terms of control flow analysis... And now the calling conventions for do_last() and lookup_last() have reached the final shape - ERR_PTR(-E...) for error, NULL for "we are done", string for "traverse this". 4/9 merging pick_link() with get_link(), part 4 now all calls of walk_component() are followed by the same boilerplate - "if it has returned 1, call get_link() and if that has returned NULL treat that as if walk_component() has returned 0". Eliminate by folding that into walk_component() itself. Now walk_component() return value conventions have joined those of do_last()/lookup_last(). 5/9 merging pick_link() with get_link(), part 5 same as for the previous, only this time the boilerplate migrates one level down, into step_into(). Only one caller of get_link() left, step_into() has joined the same return value conventions. 6/9 merging pick_link() with get_link(), part 6 move that thing into pick_link(). Now all traces of "return 1 if we are following a symlink" are gone. 7/9 finally fold get_link() into pick_link() ta-da - expand get_link() into the only caller. As a side benefit, we get rid of stashing the inode in nd->link_inode - it was done only to carry that piece of information from pick_link() to eventual get_link(). That's not the main benefit, though - the control flow became considerably easier to reason about.
For what it's worth, the example above (/dev/stdin) becomes
- path_init() has put us into root and returned "/dev/stdin"
- link_path_walk() has eventually reached /dev and left
<LAST_NORM, "stdin"> in nd->last_type/nd->last
- we call do_last(), which sees that we have LAST_NORM and
calls lookup_fast(). Let's assume that everything is in dcache; we get the dentry of /dev/stdin and proceed to finish_lookup:, where we call step_into()
- it's a symlink, we have LOOKUP_FOLLOW, so we decide to pick
the damn thing. On the stack it goes and we get its body. Which is "/proc/self/fd/0", so we move to root and return "proc/self/fd/0".
- do_last() sees non-NULL and returns it - whether it's an
error or a pathname to traverse, we hadn't reached something we'll be opening.
- link_path_walk() is given that string, eventually leading us
into /proc/self/fd, with <LAST_NORM, "0"> left as the component to handle.
- do_last() is called, and similar to the previous case we
eventually reach the call of step_into() with dentry of /proc/self/fd/0.
- _now_ we can discard /dev/stdin from the stack (we'd been
using its body until now). It's dropped (from step_into()) and we get to look at what we'd been given. A symlink to follow, so on the stack it goes. This time it's a procfs symlink and its ->get_link() method moves us to the mount/dentry of our stdin. And returns NULL. So we drop symlink from stack and return that NULL to caller.
- that NULL is returned by step_into(), same as if we had just
moved to a non-symlink.
- do_last() proceeds to open the damn thing.
part 4. some mount traversal cleanups.
8/9 massage __follow_mount_rcu() a bit make it more similar to non-RCU counterpart 9/9 new helper: traverse_mounts() the guts of follow_managed() are very similar to follow_down(). The calling conventions are different (follow_managed() works with nameidata, follow_down() - with standalone struct path), but the core loop is pretty much the same in both. Turned that loop into a common helper (traverse_mounts()) and since follow_managed() becomes a very thin wrapper around it, expand follow_managed() at its only call site (in handle_mounts()),
That's where the series stands right now. FWIW, at 5.5-rc1 fs/namei.c had been 4867 lines, at the tip of #work.openat2 - 4998, at the tip of #work.namei (containing #work.openat2) - 4730... And IMO the thing has become considerably easier to follow.
What's more, it might be possible to untangle the control flow in do_last() now. Probably a separate series, though - do_last() is one hell of a tarpit, so I'm not stepping into it for the rest of this cycle...
On Fri, Jan 03, 2020 at 01:49:01AM +0000, Al Viro wrote:
On Thu, Jan 02, 2020 at 02:59:20PM +1100, Aleksa Sarai wrote:
On 2020-01-01, Al Viro viro@zeniv.linux.org.uk wrote:
On Thu, Jan 02, 2020 at 01:44:07AM +1100, Aleksa Sarai wrote:
Thanks, this fixes the issue for me (and also fixes another reproducer I found -- mounting a symlink on top of itself then trying to umount it).
Reported-by: Aleksa Sarai cyphar@cyphar.com Tested-by: Aleksa Sarai cyphar@cyphar.com
Pushed into #fixes.
Thanks. One other thing I noticed is that umount applies to the underlying symlink rather than the mountpoint on top. So, for example (using the same scripts I posted in the thread):
# ln -s /tmp/foo link # ./mount_to_symlink /etc/passwd link # umount -l link # will attempt to unmount "/tmp/foo"
Is that intentional?
It's a mess, again in mountpoint_last(). FWIW, at some point I proposed to have nd_jump_link() to fail with -ELOOP if the target was a symlink; Linus asked for reasons deeper than my dislike of the semantics, I looked around and hadn't spotted anything. And there hadn't been at the time, but when four months later umount_lookup_last() went in I failed to look for that source of potential problems in it ;-/
FWIW, since Ian appears to agree that we want ->d_manage() on the mount crossing at the end of umount(2) lookup, here's a much simpler solution - kill mountpoint_last() and switch to using lookup_last(). As a side benefit, LOOKUP_NO_REVAL also goes away. It's possible to trim the things even more (path_mountpoint() is very similar to path_lookupat() at that point, and it's not hard to make the differences conditional on something like LOOKUP_UMOUNT); I would rather do that part in the cleanups series - the one below is easier to backport.
Aleksa, Ian - could you see if the patch below works for you?
commit e56b43b971a7c08762fceab330a52b7245041dbc Author: Al Viro viro@zeniv.linux.org.uk Date: Fri Jan 10 17:17:19 2020 -0500
reimplement path_mountpoint() with less magic
... and get rid of a bunch of bugs in it. Background: the reason for path_mountpoint() is that umount() really doesn't want attempts to revalidate the root of what it's trying to umount. The thing we want to avoid actually happen from complete_walk(); solution was to do something parallel to normal path_lookupat() and it both went overboard and got the boilerplate subtly (and not so subtly) wrong.
A better solution is to do pretty much what the normal path_lookupat() does, but instead of complete_walk() do unlazy_walk(). All it takes to avoid that ->d_weak_revalidate() call... mountpoint_last() goes away, along with everything it got wrong, and so does the magic around LOOKUP_NO_REVAL.
Another source of bugs is that when we traverse mounts at the final location (and we need to do that - umount . expects to get whatever's overmounting ., if any, out of the lookup) we really ought to take care of ->d_manage() - as it is, manual umount of autofs automount in progress can lead to unpleasant surprises for the daemon. Easily solved by using handle_lookup_down() instead of follow_mount().
Signed-off-by: Al Viro viro@zeniv.linux.org.uk
diff --git a/fs/namei.c b/fs/namei.c index d6c91d1e88cb..1793661c3342 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -1649,17 +1649,15 @@ static struct dentry *__lookup_slow(const struct qstr *name, if (IS_ERR(dentry)) return dentry; if (unlikely(!d_in_lookup(dentry))) { - if (!(flags & LOOKUP_NO_REVAL)) { - int error = d_revalidate(dentry, flags); - if (unlikely(error <= 0)) { - if (!error) { - d_invalidate(dentry); - dput(dentry); - goto again; - } + int error = d_revalidate(dentry, flags); + if (unlikely(error <= 0)) { + if (!error) { + d_invalidate(dentry); dput(dentry); - dentry = ERR_PTR(error); + goto again; } + dput(dentry); + dentry = ERR_PTR(error); } } else { old = inode->i_op->lookup(inode, dentry, flags); @@ -2618,72 +2616,6 @@ int user_path_at_empty(int dfd, const char __user *name, unsigned flags, EXPORT_SYMBOL(user_path_at_empty);
/** - * mountpoint_last - look up last component for umount - * @nd: pathwalk nameidata - currently pointing at parent directory of "last" - * - * This is a special lookup_last function just for umount. In this case, we - * need to resolve the path without doing any revalidation. - * - * The nameidata should be the result of doing a LOOKUP_PARENT pathwalk. Since - * mountpoints are always pinned in the dcache, their ancestors are too. Thus, - * in almost all cases, this lookup will be served out of the dcache. The only - * cases where it won't are if nd->last refers to a symlink or the path is - * bogus and it doesn't exist. - * - * Returns: - * -error: if there was an error during lookup. This includes -ENOENT if the - * lookup found a negative dentry. - * - * 0: if we successfully resolved nd->last and found it to not to be a - * symlink that needs to be followed. - * - * 1: if we successfully resolved nd->last and found it to be a symlink - * that needs to be followed. - */ -static int -mountpoint_last(struct nameidata *nd) -{ - int error = 0; - struct dentry *dir = nd->path.dentry; - struct path path; - - /* If we're in rcuwalk, drop out of it to handle last component */ - if (nd->flags & LOOKUP_RCU) { - if (unlazy_walk(nd)) - return -ECHILD; - } - - nd->flags &= ~LOOKUP_PARENT; - - if (unlikely(nd->last_type != LAST_NORM)) { - error = handle_dots(nd, nd->last_type); - if (error) - return error; - path.dentry = dget(nd->path.dentry); - } else { - path.dentry = d_lookup(dir, &nd->last); - if (!path.dentry) { - /* - * No cached dentry. Mounted dentries are pinned in the - * cache, so that means that this dentry is probably - * a symlink or the path doesn't actually point - * to a mounted dentry. - */ - path.dentry = lookup_slow(&nd->last, dir, - nd->flags | LOOKUP_NO_REVAL); - if (IS_ERR(path.dentry)) - return PTR_ERR(path.dentry); - } - } - if (d_flags_negative(smp_load_acquire(&path.dentry->d_flags))) { - dput(path.dentry); - return -ENOENT; - } - path.mnt = nd->path.mnt; - return step_into(nd, &path, 0, d_backing_inode(path.dentry), 0); -} - -/** * path_mountpoint - look up a path to be umounted * @nd: lookup context * @flags: lookup flags @@ -2699,14 +2631,17 @@ path_mountpoint(struct nameidata *nd, unsigned flags, struct path *path) int err;
while (!(err = link_path_walk(s, nd)) && - (err = mountpoint_last(nd)) > 0) { + (err = lookup_last(nd)) > 0) { s = trailing_symlink(nd); } + if (!err) + err = unlazy_walk(nd); + if (!err) + err = handle_lookup_down(nd); if (!err) { *path = nd->path; nd->path.mnt = NULL; nd->path.dentry = NULL; - follow_mount(path); } terminate_walk(nd); return err; diff --git a/fs/nfs/nfstrace.h b/fs/nfs/nfstrace.h index f64a33d2a1d1..2a82dcce5fc1 100644 --- a/fs/nfs/nfstrace.h +++ b/fs/nfs/nfstrace.h @@ -206,7 +206,6 @@ TRACE_DEFINE_ENUM(LOOKUP_AUTOMOUNT); TRACE_DEFINE_ENUM(LOOKUP_PARENT); TRACE_DEFINE_ENUM(LOOKUP_REVAL); TRACE_DEFINE_ENUM(LOOKUP_RCU); -TRACE_DEFINE_ENUM(LOOKUP_NO_REVAL); TRACE_DEFINE_ENUM(LOOKUP_OPEN); TRACE_DEFINE_ENUM(LOOKUP_CREATE); TRACE_DEFINE_ENUM(LOOKUP_EXCL); @@ -224,7 +223,6 @@ TRACE_DEFINE_ENUM(LOOKUP_DOWN); { LOOKUP_PARENT, "PARENT" }, \ { LOOKUP_REVAL, "REVAL" }, \ { LOOKUP_RCU, "RCU" }, \ - { LOOKUP_NO_REVAL, "NO_REVAL" }, \ { LOOKUP_OPEN, "OPEN" }, \ { LOOKUP_CREATE, "CREATE" }, \ { LOOKUP_EXCL, "EXCL" }, \ diff --git a/include/linux/namei.h b/include/linux/namei.h index 7fe7b87a3ded..07bfb0874033 100644 --- a/include/linux/namei.h +++ b/include/linux/namei.h @@ -34,7 +34,6 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT, LAST_BIND};
/* internal use only */ #define LOOKUP_PARENT 0x0010 -#define LOOKUP_NO_REVAL 0x0080 #define LOOKUP_JUMPED 0x1000 #define LOOKUP_ROOT 0x2000 #define LOOKUP_ROOT_GRABBED 0x0008
On Fri, 2020-01-10 at 23:19 +0000, Al Viro wrote:
On Fri, Jan 03, 2020 at 01:49:01AM +0000, Al Viro wrote:
On Thu, Jan 02, 2020 at 02:59:20PM +1100, Aleksa Sarai wrote:
On 2020-01-01, Al Viro viro@zeniv.linux.org.uk wrote:
On Thu, Jan 02, 2020 at 01:44:07AM +1100, Aleksa Sarai wrote:
Thanks, this fixes the issue for me (and also fixes another reproducer I found -- mounting a symlink on top of itself then trying to umount it).
Reported-by: Aleksa Sarai cyphar@cyphar.com Tested-by: Aleksa Sarai cyphar@cyphar.com
Pushed into #fixes.
Thanks. One other thing I noticed is that umount applies to the underlying symlink rather than the mountpoint on top. So, for example (using the same scripts I posted in the thread):
# ln -s /tmp/foo link # ./mount_to_symlink /etc/passwd link # umount -l link # will attempt to unmount "/tmp/foo"
Is that intentional?
It's a mess, again in mountpoint_last(). FWIW, at some point I proposed to have nd_jump_link() to fail with -ELOOP if the target was a symlink; Linus asked for reasons deeper than my dislike of the semantics, I looked around and hadn't spotted anything. And there hadn't been at the time, but when four months later umount_lookup_last() went in I failed to look for that source of potential problems in it ;-/
FWIW, since Ian appears to agree that we want ->d_manage() on the mount crossing at the end of umount(2) lookup, here's a much simpler solution - kill mountpoint_last() and switch to using lookup_last(). As a side benefit, LOOKUP_NO_REVAL also goes away. It's possible to trim the things even more (path_mountpoint() is very similar to path_lookupat() at that point, and it's not hard to make the differences conditional on something like LOOKUP_UMOUNT); I would rather do that part in the cleanups series - the one below is easier to backport.
Aleksa, Ian - could you see if the patch below works for you?
I did try this patch and I was trying to work out why it didn't work. But thought I'd let you know what I saw.
Applying it to current Linus tree systemd stops at switch root.
Not sure what causes that, I couldn't see any reason for it.
I see you have a development branch in your repo. I'll have a look at that rather than continue with this.
commit e56b43b971a7c08762fceab330a52b7245041dbc Author: Al Viro viro@zeniv.linux.org.uk Date: Fri Jan 10 17:17:19 2020 -0500
reimplement path_mountpoint() with less magic
... and get rid of a bunch of bugs in it. Background: the reason for path_mountpoint() is that umount() really doesn't want attempts to revalidate the root of what it's trying to umount. The thing we want to avoid actually happen from complete_walk(); solution was to do something parallel to normal path_lookupat() and it both went overboard and got the boilerplate subtly (and not so subtly) wrong. A better solution is to do pretty much what the normal path_lookupat() does, but instead of complete_walk() do unlazy_walk(). All it takes to avoid that ->d_weak_revalidate() call... mountpoint_last() goes away, along with everything it got wrong, and so does the magic around LOOKUP_NO_REVAL. Another source of bugs is that when we traverse mounts at the final location (and we need to do that - umount . expects to get whatever's overmounting ., if any, out of the lookup) we really ought to take care of ->d_manage() - as it is, manual umount of autofs automount in progress can lead to unpleasant surprises for the daemon. Easily solved by using handle_lookup_down() instead of follow_mount(). Signed-off-by: Al Viro viro@zeniv.linux.org.uk
diff --git a/fs/namei.c b/fs/namei.c index d6c91d1e88cb..1793661c3342 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -1649,17 +1649,15 @@ static struct dentry *__lookup_slow(const struct qstr *name, if (IS_ERR(dentry)) return dentry; if (unlikely(!d_in_lookup(dentry))) {
if (!(flags & LOOKUP_NO_REVAL)) {
int error = d_revalidate(dentry, flags);
if (unlikely(error <= 0)) {
if (!error) {
d_invalidate(dentry);
dput(dentry);
goto again;
}
int error = d_revalidate(dentry, flags);
if (unlikely(error <= 0)) {
if (!error) {
d_invalidate(dentry); dput(dentry);
dentry = ERR_PTR(error);
goto again; }
dput(dentry);
} } else { old = inode->i_op->lookup(inode, dentry, flags);dentry = ERR_PTR(error);
@@ -2618,72 +2616,6 @@ int user_path_at_empty(int dfd, const char __user *name, unsigned flags, EXPORT_SYMBOL(user_path_at_empty); /**
- mountpoint_last - look up last component for umount
- @nd: pathwalk nameidata - currently pointing at parent
directory of "last"
- This is a special lookup_last function just for umount. In this
case, we
- need to resolve the path without doing any revalidation.
- The nameidata should be the result of doing a LOOKUP_PARENT
pathwalk. Since
- mountpoints are always pinned in the dcache, their ancestors are
too. Thus,
- in almost all cases, this lookup will be served out of the
dcache. The only
- cases where it won't are if nd->last refers to a symlink or the
path is
- bogus and it doesn't exist.
- Returns:
- -error: if there was an error during lookup. This includes
-ENOENT if the
lookup found a negative dentry.
- 0: if we successfully resolved nd->last and found it to not
to be a
symlink that needs to be followed.
- 1: if we successfully resolved nd->last and found it to be a
symlink
that needs to be followed.
- */
-static int -mountpoint_last(struct nameidata *nd) -{
- int error = 0;
- struct dentry *dir = nd->path.dentry;
- struct path path;
- /* If we're in rcuwalk, drop out of it to handle last component
*/
- if (nd->flags & LOOKUP_RCU) {
if (unlazy_walk(nd))
return -ECHILD;
- }
- nd->flags &= ~LOOKUP_PARENT;
- if (unlikely(nd->last_type != LAST_NORM)) {
error = handle_dots(nd, nd->last_type);
if (error)
return error;
path.dentry = dget(nd->path.dentry);
- } else {
path.dentry = d_lookup(dir, &nd->last);
if (!path.dentry) {
/*
* No cached dentry. Mounted dentries are
pinned in the
* cache, so that means that this dentry is
probably
* a symlink or the path doesn't actually point
* to a mounted dentry.
*/
path.dentry = lookup_slow(&nd->last, dir,
nd->flags |
LOOKUP_NO_REVAL);
if (IS_ERR(path.dentry))
return PTR_ERR(path.dentry);
}
- }
- if (d_flags_negative(smp_load_acquire(&path.dentry->d_flags)))
{
dput(path.dentry);
return -ENOENT;
- }
- path.mnt = nd->path.mnt;
- return step_into(nd, &path, 0, d_backing_inode(path.dentry),
0); -}
-/**
- path_mountpoint - look up a path to be umounted
- @nd: lookup context
- @flags: lookup flags
@@ -2699,14 +2631,17 @@ path_mountpoint(struct nameidata *nd, unsigned flags, struct path *path) int err; while (!(err = link_path_walk(s, nd)) &&
(err = mountpoint_last(nd)) > 0) {
s = trailing_symlink(nd); }(err = lookup_last(nd)) > 0) {
- if (!err)
err = unlazy_walk(nd);
- if (!err)
if (!err) { *path = nd->path; nd->path.mnt = NULL; nd->path.dentry = NULL;err = handle_lookup_down(nd);
} terminate_walk(nd); return err;follow_mount(path);
diff --git a/fs/nfs/nfstrace.h b/fs/nfs/nfstrace.h index f64a33d2a1d1..2a82dcce5fc1 100644 --- a/fs/nfs/nfstrace.h +++ b/fs/nfs/nfstrace.h @@ -206,7 +206,6 @@ TRACE_DEFINE_ENUM(LOOKUP_AUTOMOUNT); TRACE_DEFINE_ENUM(LOOKUP_PARENT); TRACE_DEFINE_ENUM(LOOKUP_REVAL); TRACE_DEFINE_ENUM(LOOKUP_RCU); -TRACE_DEFINE_ENUM(LOOKUP_NO_REVAL); TRACE_DEFINE_ENUM(LOOKUP_OPEN); TRACE_DEFINE_ENUM(LOOKUP_CREATE); TRACE_DEFINE_ENUM(LOOKUP_EXCL); @@ -224,7 +223,6 @@ TRACE_DEFINE_ENUM(LOOKUP_DOWN); { LOOKUP_PARENT, "PARENT" }, \ { LOOKUP_REVAL, "REVAL" }, \ { LOOKUP_RCU, "RCU" }, \
{ LOOKUP_NO_REVAL, "NO_REVAL" }, \ { LOOKUP_OPEN, "OPEN" }, \ { LOOKUP_CREATE, "CREATE" }, \ { LOOKUP_EXCL, "EXCL" }, \
diff --git a/include/linux/namei.h b/include/linux/namei.h index 7fe7b87a3ded..07bfb0874033 100644 --- a/include/linux/namei.h +++ b/include/linux/namei.h @@ -34,7 +34,6 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT, LAST_BIND}; /* internal use only */ #define LOOKUP_PARENT 0x0010 -#define LOOKUP_NO_REVAL 0x0080 #define LOOKUP_JUMPED 0x1000 #define LOOKUP_ROOT 0x2000 #define LOOKUP_ROOT_GRABBED 0x0008
On Mon, Jan 13, 2020 at 09:48:23AM +0800, Ian Kent wrote:
I did try this patch and I was trying to work out why it didn't work. But thought I'd let you know what I saw.
Applying it to current Linus tree systemd stops at switch root.
Not sure what causes that, I couldn't see any reason for it.
Wait a minute... So you are seeing problems early in the boot, before any autofs ioctls might come into play?
Sigh... Guess I'll have to dig that Fedora KVM image out and try to see what it's about... ;-/ Here comes a couple of hours of build...
On Mon, 2020-01-13 at 03:54 +0000, Al Viro wrote:
On Mon, Jan 13, 2020 at 09:48:23AM +0800, Ian Kent wrote:
I did try this patch and I was trying to work out why it didn't work. But thought I'd let you know what I saw.
Applying it to current Linus tree systemd stops at switch root.
Not sure what causes that, I couldn't see any reason for it.
Wait a minute... So you are seeing problems early in the boot, before any autofs ioctls might come into play?
I did, then I checked it booted without the patch, then tried building from scratch with the patch twice and same thing happened each time.
Looked like this, such as it is: [ OK ] Reached target Switch Root. [ OK ] Started Plymouth switch root service. Starting Switch Root...
I don't have any evidence but thought it might be this: https://github.com/karelzak/util-linux/blob/master/sys-utils/switch_root.c
Mind you, that's not the actual systemd repo. either I probably need to look a lot deeper (and at the actual systemd repo) to work out what's actually being called.
Sigh... Guess I'll have to dig that Fedora KVM image out and try to see what it's about... ;-/ Here comes a couple of hours of build...
On Mon, 2020-01-13 at 14:00 +0800, Ian Kent wrote:
On Mon, 2020-01-13 at 03:54 +0000, Al Viro wrote:
On Mon, Jan 13, 2020 at 09:48:23AM +0800, Ian Kent wrote:
I did try this patch and I was trying to work out why it didn't work. But thought I'd let you know what I saw.
Applying it to current Linus tree systemd stops at switch root.
Not sure what causes that, I couldn't see any reason for it.
Wait a minute... So you are seeing problems early in the boot, before any autofs ioctls might come into play?
I did, then I checked it booted without the patch, then tried building from scratch with the patch twice and same thing happened each time.
Looked like this, such as it is: [ OK ] Reached target Switch Root. [ OK ] Started Plymouth switch root service. Starting Switch Root...
I don't have any evidence but thought it might be this: https://github.com/karelzak/util-linux/blob/master/sys-utils/switch_root.c
Oh wait, for systemd I was actually looking at: https://github.com/systemd/systemd/blob/master/src/shared/switch-root.c
Mind you, that's not the actual systemd repo. either I probably need to look a lot deeper (and at the actual systemd repo) to work out what's actually being called.
Sigh... Guess I'll have to dig that Fedora KVM image out and try to see what it's about... ;-/ Here comes a couple of hours of build...
On Mon, Jan 13, 2020 at 02:03:00PM +0800, Ian Kent wrote:
Oh wait, for systemd I was actually looking at: https://github.com/systemd/systemd/blob/master/src/shared/switch-root.c
Mind you, that's not the actual systemd repo. either I probably need to look a lot deeper (and at the actual systemd repo) to work out what's actually being called.
Sigh... Guess I'll have to dig that Fedora KVM image out and try to see what it's about... ;-/ Here comes a couple of hours of build...
D'oh... And yes, that would've been a bisect hazard - switch to path_lookupat() later in the series gets rid of that. Incremental (to be foldede, of course):
diff --git a/fs/namei.c b/fs/namei.c index 1793661c3342..204677c37751 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -2634,7 +2634,7 @@ path_mountpoint(struct nameidata *nd, unsigned flags, struct path *path) (err = lookup_last(nd)) > 0) { s = trailing_symlink(nd); } - if (!err) + if (!err && (nd->flags & LOOKUP_RCU)) err = unlazy_walk(nd); if (!err) err = handle_lookup_down(nd);
On Mon, 2020-01-13 at 13:30 +0000, Al Viro wrote:
On Mon, Jan 13, 2020 at 02:03:00PM +0800, Ian Kent wrote:
Oh wait, for systemd I was actually looking at: https://github.com/systemd/systemd/blob/master/src/shared/switch-root.c
Mind you, that's not the actual systemd repo. either I probably need to look a lot deeper (and at the actual systemd repo) to work out what's actually being called.
Sigh... Guess I'll have to dig that Fedora KVM image out and try to see what it's about... ;-/ Here comes a couple of hours of build...
D'oh... And yes, that would've been a bisect hazard - switch to path_lookupat() later in the series gets rid of that. Incremental (to be foldede, of course):
diff --git a/fs/namei.c b/fs/namei.c index 1793661c3342..204677c37751 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -2634,7 +2634,7 @@ path_mountpoint(struct nameidata *nd, unsigned flags, struct path *path) (err = lookup_last(nd)) > 0) { s = trailing_symlink(nd); }
- if (!err)
- if (!err && (nd->flags & LOOKUP_RCU)) err = unlazy_walk(nd); if (!err) err = handle_lookup_down(nd);
Ok, so I've tested with the updated patch.
The autofs connectathon tests I use function fine.
I also tested sending a SIGKILL to the daemon with about 180 active mounts and restarted the daemon to test the function of the ioctls that Al was concerned about.
While the connectathon test expired everything I had 3 mounts left after allowing sufficient expire time with the SIGKILL test.
Those mounts correspond to one map entry that has a mix of NFS vers=3 and vers=2 mount options and NFSv2 isn't supported by the servers I use in testing.
I'm inclined to think this is a bug in the automount mount tree re-connection code rather than a problem with this patch since all the other mounts, some simple and others with not so simple constructs, expired fine after automount re-connected to them.
There are two other map entries that have an NFS vers=2 option but they are simple mounts that will fail on attempting the automount because the server doesn't support v2 so they don't end up with mounts to reconnect to.
This particular map entry, having a mix of NFS vers=3 and vers=2 in the offsets of the entry, will lead to a partial mount of the map entry which is probably not being handled properly by automount when re-connecting to the mounts in the tree.
So I think the patch here is fine from an autofs POV.
Ian
On Tue, 2020-01-14 at 15:25 +0800, Ian Kent wrote:
On Mon, 2020-01-13 at 13:30 +0000, Al Viro wrote:
On Mon, Jan 13, 2020 at 02:03:00PM +0800, Ian Kent wrote:
Oh wait, for systemd I was actually looking at: https://github.com/systemd/systemd/blob/master/src/shared/switch-root.c
Mind you, that's not the actual systemd repo. either I probably need to look a lot deeper (and at the actual systemd repo) to work out what's actually being called.
Sigh... Guess I'll have to dig that Fedora KVM image out and try to see what it's about... ;-/ Here comes a couple of hours of build...
D'oh... And yes, that would've been a bisect hazard - switch to path_lookupat() later in the series gets rid of that. Incremental (to be foldede, of course):
diff --git a/fs/namei.c b/fs/namei.c index 1793661c3342..204677c37751 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -2634,7 +2634,7 @@ path_mountpoint(struct nameidata *nd, unsigned flags, struct path *path) (err = lookup_last(nd)) > 0) { s = trailing_symlink(nd); }
- if (!err)
- if (!err && (nd->flags & LOOKUP_RCU)) err = unlazy_walk(nd); if (!err) err = handle_lookup_down(nd);
Ok, so I've tested with the updated patch.
The autofs connectathon tests I use function fine.
I also tested sending a SIGKILL to the daemon with about 180 active mounts and restarted the daemon to test the function of the ioctls that Al was concerned about.
While the connectathon test expired everything I had 3 mounts left after allowing sufficient expire time with the SIGKILL test.
Those mounts correspond to one map entry that has a mix of NFS vers=3 and vers=2 mount options and NFSv2 isn't supported by the servers I use in testing.
I'm inclined to think this is a bug in the automount mount tree re-connection code rather than a problem with this patch since all the other mounts, some simple and others with not so simple constructs, expired fine after automount re-connected to them.
There are two other map entries that have an NFS vers=2 option but they are simple mounts that will fail on attempting the automount because the server doesn't support v2 so they don't end up with mounts to reconnect to.
This particular map entry, having a mix of NFS vers=3 and vers=2 in the offsets of the entry, will lead to a partial mount of the map entry which is probably not being handled properly by automount when re-connecting to the mounts in the tree.
So I think the patch here is fine from an autofs POV.
Umm ... unfortunately further testing shows an autofs problem.
It appears to be present in the current kernel (so far I've only been able to check the current git head and an earlier kernel but can't remember the version and can't check) so I must have missed it.
I'm attempting to bisect now but managed to trash the root file system on my VM. I'll get this done as quickly as I can.
Ian
On Jan 1, 2020, at 11:44 PM, Aleksa Sarai cyphar@cyphar.com wrote:
On 2020-01-01, Al Viro viro@zeniv.linux.org.uk wrote:
On Wed, Jan 01, 2020 at 12:54:46AM +0000, Al Viro wrote: Note, BTW, that lookup_last() (aka walk_component()) does just that - we only hit step_into() on LAST_NORM. The same goes for do_last(). mountpoint_last() not doing the same is _not_ intentional - it's definitely a bug.
Consider your testcase; link points to . here. So the only thing you could expect from trying to follow it would be the directory 'link' lives in. And you don't have it when you reach the fscker via /proc/self/fd/3; what happens instead is nd->path set to ./link (by nd_jump_link()) *AND* step_into() called, pushing the same ./link onto stack. It violates all kinds of assumptions made by fs/namei.c - when pushing a symlink onto stack nd->path is expected to contain the base directory for resolving it.
I'm fairly sure that this is the cause of at least some of the insanity you've caught; there always could be something else, of course, but this hole needs to be closed in any case.
... and with removal of now unused local variable, that's
mountpoint_last(): fix the treatment of LAST_BIND
step_into() should be attempted only in LAST_NORM case, when we have the parent directory (in nd->path). We get away with that for LAST_DOT and LOST_DOTDOT, since those can't be symlinks, making step_init() and equivalent of path_to_nameidata() - we do a bit of useless work, but that's it. For LAST_BIND (i.e. the case when we'd just followed a procfs-style symlink) we really can't go there - result might be a symlink and we really can't attempt following it.
lookup_last() and do_last() do handle that properly; mountpoint_last() should do the same.
Cc: stable@vger.kernel.org Signed-off-by: Al Viro viro@zeniv.linux.org.uk
Thanks, this fixes the issue for me (and also fixes another reproducer I found -- mounting a symlink on top of itself then trying to umount it).
Reported-by: Aleksa Sarai cyphar@cyphar.com Tested-by: Aleksa Sarai cyphar@cyphar.com
As for the original topic of bind-mounting symlinks -- given this is a supported feature, would you be okay with me sending an updated O_EMPTYPATH series?
FWIW, I have an actual use case for mounting over a symlink: replacing /etc/resolv.conf. My virtme tool is presented with somewhat arbitrary crud in /etc, where /etc/resolv.conf might be a plain file or a symlink, but, regardless, has inappropriate contents. If it’s a file, I can mount a new file over it. If it’s a symlink and the kernel properly supported it, I could also mount over it.
Yes, I could also use overlayfs. Maybe I should regardless.
linux-stable-mirror@lists.linaro.org