On 2019-07-13, Al Viro viro@zeniv.linux.org.uk wrote:
On Fri, Jul 12, 2019 at 04:00:26PM +0100, Al Viro wrote:
On Fri, Jul 12, 2019 at 02:25:53PM +0100, Al Viro wrote:
if (flags & LOOKUP_BENEATH) { nd->root = nd->path; if (!(flags & LOOKUP_RCU)) path_get(&nd->root); else nd->root_seq = nd->seq;
BTW, this assignment is needed for LOOKUP_RCU case. Without it you are pretty much guaranteed that lazy pathwalk will fail, when it comes to complete_walk().
Speaking of which, what would happen if LOOKUP_ROOT/LOOKUP_BENEATH combination would someday get passed?
I don't understand what's going on with ->r_seq in there - your call of path_is_under() is after having (re-)sampled rename_lock, but if that was the only .. in there, who's going to recheck the value? For that matter, what's to guarantee that the thing won't get moved just as you are returning from handle_dots()?
IOW, what does LOOKUP_IN_ROOT guarantee for caller (openat2())?
I tried to explain this in the commit message for "namei: aggressively check for nd->root escape on ".." resolution", but I probably could've explained it better.
The basic property being guaranteed by LOOKUP_IN_ROOT is that it will not result in resolution of a path component which was not inside the root of the dirfd tree at some point during resolution (and that all absolute symlink and ".." resolution will be done relative to the dirfd). This may smell slightly of chroot(2), because unfortunately it is a similar concept -- the reason for this is to allow for a more efficient way to safely resolve paths inside a rootfs than spawning a separate process to then pass back the fd to the caller.
We don't want to do a path_is_under() check for every ".." (otherwise lookups have a quadratic slowdown when doing many ".."s), so I instead only do a check after a rename or a mount (which are the only operations which could change what ".." points to). And since we do the path_is_under() check if m_seq or r_seq need a retry, we can re-take them[+].
The main reason why I don't re-check path_is_under() after handle_dots() is that there is no way to be sure that a racing rename didn't happen after your last path_is_under() check. The rename could happen after the syscall returns, after all.
So, the main purpose of the check is to make sure that a ".."s after a rename doesn't result in an escape. If the rename happens after we've traversed through a ".." that means that the ".." was inside the root in the first place (a root ".." is handled by follow_dotdot). If the rename happens after we've gone through handle_dots() and there is no subsequent ".." then to userspace it looks identical to the rename occurring after the syscall has returned. If there is a subsequent ".." after a racing rename then we may have moved into a path that wasn't path_is_under() and so we have to check it.
The only way I could see you could solve the race completely is if you had a way for userspace to lock things from being able to be renamed (or MS_MOVE'd). And that feels like a really bad idea to me.
[+]: You asked why don't I re-take m_seq. The reason is that I don't fully understand all the other m_seq checks being done during resolution, and we aren't definitely doing them all in handle_dots(). So I assumed re-taking it could result in me breaking RCU-walk which obviously would be bad. Since I am the only thing using nd->r_seq, I can re-take it without issue.