On Tue, Jan 07, 2020 at 07:54:02PM -0800, Linus Torvalds wrote:
Another interesting question is whether we want O_PATH open to trigger automounts.
It does sound like they shouldn't, but as you say:
The thing is, we do *NOT* trigger them
(or traverse mountpoints) at the starting point of lookups. I believe it's a mistake (and mine, at that), but I doubt that there's anything that can be done about it at that point. It's a user-visible behaviour [..]
Hmm. I wonder how set in stone that is. We may have two decades of history of not doing it at start point of lookups, but we do *not* have two decades of history of O_PATH.
So what I think we agree would be sane behavior would be for O_PATH opens to not trigger automounts (unless there's a slash at the end, whatever), but _do_ add the mount-point traversal to the beginning of lookups.
But only do it for the actual O_PATH fd case, not the cwd/root/non-O_PATH case.
That way we maintain original behavior: if somebody overmounts your cwd, you still see the pre-mount directory on lookups, because your cwd is "under" the mount.
But if you open a file with O_PATH, and somebody does a mount _afterwards_, the openat() will see that later mount and/or do the automount.
Don't you think that would be the more sane/obvious semantics of how O_PATH should work?
Maybe, but... note that we do not (and AFAICS never had) follow mounts on /proc/self/cwd, /proc/self/fd/42, etc. And there are very good reasons for that. First of all, if your stdin is from /tmp/foo, you'd better get that file when you open /dev/stdin, even if somebody has done mount --bind /tmp/bar /tmp/foo; another issue is with the use of stat("/proc/self/fd/42", &buf) - it should be an equivalent of fstat(42, &buf), even if somebody has overmounted that. BTW, for similar reason after link(".", "foo"); fd = open("foo", O_PATH); // return 42 we really should (and do) have resolution of /proc/self/fd/42 stop at foo, not . Reason: consistency of stat() behaviour...
The point is, we'd never followed mounts on /proc/self/cwd et.al. I hadn't checked 2.0, but 2.1.100 ('97, before any changes from me) is that way. Actually, scratch that - 2.0 behaves the same way (mountpoint crossing is done in iget() there; is that Minix influence or straight from the Lions' book?)
Hmm... Looking through the history, we have
(for reference) v7: mount traversal in iget() (forward) and namei() (back); due to the way it's done, forward traversal happens * at starting point * after any component (. and .. included) * on results of forward traversal (due to a loop in iget()). Back traversal (to covered on .. from root directory) is also to unlimited depth.
0.01: no mount handling
0.10: forward traversal in iget(), back traversal in fs/namei.c:find_entry() (not by Lions' Book, then - v6 didn't do back traversals at all). Forward traversal * after any component (. and .. included) No traversal on starting point, no traversal on result of traversal. OTOH, mount(2) refuses to mount on top of root, so the lack of the last one is not an issue.
0.12: symlinks added; no mount traversal on starting point of those either. We start at the process' root for absolute ones, even if it happens to be overmounted, and we start from parent for relative ones. The latter matters only if we were in the beginning of the pathwalk, since anything else would've traversed mounts back when we'd picked said parent. Mount traversal takes precedence over symlink traversal, but that's not an issue since mount follows links on mountpoint. It does not, at that point, reject fs image with symlink for root, but that actually more or less works.
0.97.3: same, with addition of procfs symlinks. No mount crossing on their targets (for normal symlinks we don't do mount crossing in the beginning and any component inside triggers mount crossing as usual; for procfs ones there's no components inside)
Situation remains essentially unchanged until 2.1.42. Next few kernels are in flux, to put it politely - initial merge had been insane and it took until 2.1.44 or so for the things to get more or less working.
At 2.1.44: forward traversal in fs/namei.c:lookup(), back traversal in fs/namei.c:reserved_lookup(). Otherwise the same behaviour as pre-dcache (wrt mount traversals, that is).
2.1.51pre1: forward traversal moved into real_lookup() and __d_lookup(). Forward traversal happens *ONLY* after normal components - not after . or ..
2.1.61: forward traversal moved into follow_mount(), behaviour reverted to pre-dcache one.
Previous is from reading through the historical trees; my involvement started circa 2.1.120-something.
2.3.50pre3: call of follow_mount() moved a bit, reverting to 2.1.51pre1 behaviour (nor traversal on . or ..) *again*. Not sure whose idea had that been - might've been mine, but unlike the other patch that went into fs/namei.c in the same release, I hadn't been able to find anything related to that one. If your memories (or mail archives) are better...
2.3.99pre4-5: massive surgery in there. Preparations to allowing mount on top of mount; forward traversal adjusted accordingly, back traversal still isn't.
2.3.99pre7-1: more surgery, back traversals are also to unlimited depth now and mount on top of mount has been allowed.
2.3.99pre9-4: mount --bind taught to mount non-directories on top of non-directories. At that point it does *NOT* follow trailing symlinks, so mounting of symlinks and mounting on top of symlinks becomes possible. Mount traversal still takes precedence over symlink traversal, symlink traversal of mount traversal result still generally works, even though it's not something I considered at the time.
v2.4.5.2: mount --bind started to follow symlinks. So that source of mounting of and on the symlinks was no more.
2.5.0.5: forward mount traversal is done after .. (in handle_dotdot()). That brings back the pre-dcache behaviour for those suckers. Still no forward traversal after ., though.
At about the same time I'd been getting rid of the early-boot incestous relationships with fs/namespace.c (initramfs work) and that was probably the last time we could realistically switch to following mounts at starting point; I considered trying to do that, but decided not to. Pity, that...
2.6.5-rc2: normal mount now checks for corrupt fs with symlink for root. Since it has always been following symlinks for mountpoint, the remaining source of mounting of and on symlinks was gone; that lasted until after O_PATH introduction.
2.6.39-rc1: mount traps support - instead of abusing ->follow_link() for automounting, we have an explicit pair of methods that can be called at the same places where we traverse mounts. None too consistent - we don't do that on .. results. That was Dave Howells and Ian Kent.
2.6.39-rc1: O_PATH introduced and, later in the same series, allowed for symlinks. That has changed things - now procfs symlink targets could be symlinks themselves. Originally an attempt to follow those would blow up with -ELOOP (there's simply no good way to follow such beast; it's either "stop even if we are asked to follow" or "give an error").
3.6.0-rc1: nd_jump_link() introduction (hch) had unnoticed side effects - we'd switched from "fail traversal with -ELOOP" to "stop there". Mostly it doesn't change behaviour, but it has opened a way to mount symlinks and mount on top of symlinks. Which generally worked.
circa 3.8--3.9: side effects had been noticed; my first reaction had been "let's make nd_jump_link() return an error, then", but I hadn't been able to find good reasons when challenged to do so. Did an audit, found no obvious problems, went "oh, well - whether it works by accident or by design, it doesn't break anything".
3.12.0-rc1: lookups for umount(2) are different - we don't want revalidate on the last component. Which had been handled by introduction of path_umountat()/umount_lookup_last(), parallel to path_lookupat(). Which has gotten quite a few things wrong - it *did* try to follow symlinks obtained by following procfs ones (and blew up big way) and it didn't follow mounts on overmounted trailing symlinks. Nobody noticed for 6 years, until folks actually tried to play with mount-on-symlink... Patches were by Jeff Layton, neither he nor I have spotted the problem back then. And I should have, since it had been only a few months since the audit for exactly that kind of problems...
AFAICS, there'd been no serious semantical changes since then. What we have right now: * no mount traversal on the starting point * mount traversal after any component other than "." * symlink traversal consists of possibly jumping to given point plus following a given (possibly empty) series of components. It can be both - e.g. symlink to "/foo/bar" is 'jump to root, then traverse "foo", then traverse "bar"'. Procfs "magic" symlinks are not really magical - they behave as symlinks to "/" as far as the pathwalk semantics is concerned. The only differences is that jump might be not to process' root. * mount traversal takes precedence over symlink traversal. * jump (if any) in symlink traversal is treated the same as the starting point - it's not followed by mount traversal. It's also not followed by symlink traversal, even if we are jumping into a symlink. Of course, in any position other than the end of pathname that's an instant error. That's also not different from the starting point treatment - if ...at(2) is given a symlink for starting point, it leaves it as-is if AT_EMPTY_PATH is given and fails with -ENOTDIR otherwise. * umount(2) handles the final component differently - for one thing, it does not do revalidate, for another - its mount traversal (if any) does not include automount-related parts. And there we *do* want mount traversal at the final point, for obvious reasons.
I think the easiest way to handle that is to have O_PATH turn LOOKUP_AUTOMOUNT, same as the normal open() does. That's trivial to do, but that changes user-visible behaviour. OTOH, with the current behaviour nobody can rely upon automount not getting triggered by somebody else just as they are entering their open(dir, O_PATH), so I think that's not a problem.
Linus, do you have any objections to such O_PATH semantics change?
See above: I think I'd prefer the O_PATH behavior the other way around. That seems to be more of a consistent behavior of what "O_PATH" means - it means "don't really open, we'll do it only when you use it as a directory".
How would your proposal deal with access("/proc/self/fd/42/foo", MAY_READ) vs. faccessat(42, "foo", MAY_READ)? The latter would trigger automount, the former would not... Or would you extend that to "traverse mounts upon following procfs links, if the file in question had been opened with O_PATH"? We could do that (give nd_jump_link() an extra argument telling if we want mount traversal), but I'm not sure if the resulting semantics is sane...
Note, BTW, that O_PATH users really can't rely upon automounts _not_ being triggered - all it takes is a lookup on bogus path with such prefix by anybody who can reach that place... We are not opening anything, really, but we are not able to ignore automounts triggered by somebody else.