Letting the following set of commands run long enough on a multi-core machine causes soft lockups in the kernel:
(cd /sys/fs/selinux/; while true; do find >/dev/null 2>&1; done) & (cd /sys/fs/selinux/; while true; do find >/dev/null 2>&1; done) & (cd /sys/fs/selinux/; while true; do find >/dev/null 2>&1; done) & while true; do load_policy; echo -n .; sleep 0.1; done
The problem is that sel_remove_entries() calls d_genocide() to remove the whole contents of certain subdirectories in /sys/fs/selinux/. This function is apparently only intended for removing filesystem entries that are no longer accessible to userspace, because it doesn't follow the rule that any code removing entries from a directory must hold the write lock on the directory's inode RW semaphore (formerly this was a mutex, see 9902af79c01a ("parallel lookups: actual switch to rwsem")).
In order to fix this, I had to open-code d_genocide() again, adding the necessary parent inode locking. I based the new implementation on d_walk() (fs/dcache.c), the original code from before ad52184b705c ("selinuxfs: don't open-code d_genocide()"), and with a slight inspiration from debugfs_remove() (fs/debugfs/inode.c).
The new code no longer triggers soft lockups with the above reproducer and it also gives no warnings when run with CONFIG_PROVE_LOCKING=y.
Note that I am adding Fixes: on the commit that replaced the open-coded version with d_genocide(), but the bug is present also in the original code that dates back to pre-2.6 times... This patch obviously doesn't apply to this older code so pre-4.0 kernels will need a dedicated patch (just adding the parent inode locks should be sufficient there).
Link: https://github.com/SELinuxProject/selinux-kernel/issues/42 Fixes: ad52184b705c ("selinuxfs: don't open-code d_genocide()") Cc: stable@vger.kernel.org # 4.0+ Cc: Stephen Smalley sds@tycho.nsa.gov Cc: Al Viro viro@zeniv.linux.org.uk Signed-off-by: Ondrej Mosnacek omosnace@redhat.com --- security/selinux/selinuxfs.c | 88 ++++++++++++++++++++++++++++++++++-- 1 file changed, 85 insertions(+), 3 deletions(-)
diff --git a/security/selinux/selinuxfs.c b/security/selinux/selinuxfs.c index f3a5a138a096..4ac9b17e24d1 100644 --- a/security/selinux/selinuxfs.c +++ b/security/selinux/selinuxfs.c @@ -1316,10 +1316,92 @@ static const struct file_operations sel_commit_bools_ops = { .llseek = generic_file_llseek, };
-static void sel_remove_entries(struct dentry *de) +/* removes all children of the given dentry (but not itself) */ +static void sel_remove_entries(struct dentry *root) { - d_genocide(de); - shrink_dcache_parent(de); + struct dentry *parent = root; + struct dentry *child; + struct list_head *next; + + spin_lock(&parent->d_lock); + +repeat: + next = parent->d_subdirs.next; + + while (next != &parent->d_subdirs) { + child = list_entry(next, struct dentry, d_child); + next = next->next; + + spin_lock_nested(&child->d_lock, DENTRY_D_LOCK_NESTED); + + if (!list_empty(&child->d_subdirs)) { + /* Current entry has children, we need to remove those + * first. We unlock the parent but keep the child locked + * (it will become the new parent). */ + spin_unlock(&parent->d_lock); + + /* we need to re-lock the child (new parent) in a + * special way (see d_walk() in fs/dcache.c) */ + spin_release(&child->d_lock.dep_map, 1, _RET_IP_); + parent = child; + spin_acquire(&parent->d_lock.dep_map, 0, 1, _RET_IP_); + + goto repeat; + } + +resume: + if (child->d_inode) { + /* acquire a reference to the child */ + dget_dlock(child); + + /* drop all locks (someof the callees will try to + * acquire them) */ + spin_unlock(&child->d_lock); + spin_unlock(&parent->d_lock); + + /* IMPORTANT: we need to hold the parent's inode lock + * because some functions (namely dcache_readdir) assume + * holding this lock means children won't be removed */ + inode_lock_nested(parent->d_inode, I_MUTEX_PARENT); + + /* now unlink the child */ + if (d_is_dir(child)) + simple_rmdir(d_inode(parent), child); + else + simple_unlink(d_inode(parent), child); + d_delete(child); + + inode_unlock(parent->d_inode); + + /* drop our extra reference */ + dput(child); + + /* re-lock only the parent */ + spin_lock(&parent->d_lock); + } else + spin_unlock(&child->d_lock); + } + + if (parent != root) { + /* we processed contents of a subdirectory, ascend and continue + * by removing the subdirectory itself (now empty) */ + child = parent; + parent = child->d_parent; + + /* we have to re-lock the child after locking the parent */ + spin_unlock(&child->d_lock); + spin_lock(&parent->d_lock); + spin_lock_nested(&child->d_lock, DENTRY_D_LOCK_NESTED); + + /* set next to the next sibling */ + next = child->d_child.next; + goto resume; + } + + spin_unlock(&root->d_lock); + + /* try to clean up the stale entries */ + shrink_dcache_parent(root); }
#define BOOL_DIR_NAME "booleans"
On Tue, Oct 02, 2018 at 01:18:30PM +0200, Ondrej Mosnacek wrote:
No. With the side of Hell, No. The bug is real, but this is not the way to fix it.
First of all, it's still broken - e.g. mount something on a subdirectory and watch what that thing will do to it. And anyone who has permission to reload policy _will_ have one for mount.
And yes, debugfs_remove() suffers the same problem. Frankly, the only wish debugfs_remove() implementation inspires is to shoot it before it breeds. Which is the second problem with that approach - such stuff really shouldn't be duplicated in individual filesystems.
Don't copy (let along open-code) d_walk() in there. The guts of dcache tree handling are subtle enough (and had more than enough locking bugs over the years) to make spreading the dependencies on its details all over the tree an invitation for massive PITA down the road.
I have beginnings of patch doing that for debugfs; the same thing should be usable for selinuxfs as well. However, the main problem with selinuxfs wrt policy loading is different - what to do if sel_make_policy_nodes() fails halfway through? And what to do with accesses to the unholy mix of old and new nodes we currently have left behind?
Before security_load_policy() we don't know which nodes to create; after it we have nothing to fall back onto. It looks like we need to split security_load_policy() into "load", "switch over" and "free old" parts, so that the whole thing would look like load create new directory trees, unconnected to the root so they couldn't be reached by lookups until we are done switch over move the new directory trees in place kill the old trees (using that invalidate+genocide carefully combination) free old data structures with failures upon the first two steps handled by killing whatever detached trees we'd created (if any) and freeing the new data structures.
However, I'd really like to have the folks familiar with selinux guts to comment upon the feasibility of the above. AFAICS, nobody has ever seriously looked at that code wrt graceful error handling, etc.[*], so I'm not happy with making inferences from what the existing code is doing.
If you are interested in getting selinuxfs into sane shape, that would be a good place to start. As for the kernel-side rm -rf (which is what debugfs_remove() et.al. are trying to be)... * it absolutely needs to be in fs/*.c - either dcache or libfs. It's too closely tied to dcache guts to do otherwise. * as the first step it needs to do d_invalidate(), to get rid of anything that might be mounted on it and to prevent new mounts from appearing. It's rather tempting to unhash everything in the victim tree at the same time, but that needs to be done with care - I'm still not entirely happy with the solution I've got in that area. Alternative is to unhash them on the way out of subtree. simple_unlink()/simple_rmdir() are wrong there - we don't want to bother with the parent's timestamps as we go, for one thing; that should be done only once to parent of the root of that subtree. For another, we bloody well enforce the emptiness ourselves, so this simple_empty() is pointless (especially since we have no choice other than ignoring it anyway).
BTW, another selinuxfs unpleasantness is, the things like sel_write_enforce() don't have any exclusion against themselves, let alone the policy reloads. And quite a few of them obviously expect that e.g. permission check is done against the same policy the operation will apply to, not the previous one. That one definitely needs selinux folks involved.
[*] not too unreasonably so - anyone who gets to use _that_ as an attack vector has already won, so it's not a security problem pretty much by definition and running into heavy OOM at the time of policy reload is almost certainly going to end up with the userland parts of the entire thing not handling failures gracefully.
Hi Al,
On Tue, Oct 2, 2018 at 5:58 PM Al Viro viro@zeniv.linux.org.uk wrote:
On Tue, Oct 02, 2018 at 01:18:30PM +0200, Ondrej Mosnacek wrote:
No. With the side of Hell, No. The bug is real, but this is not the way to fix it.
First of all, it's still broken - e.g. mount something on a subdirectory and watch what that thing will do to it. And anyone who has permission to reload policy _will_ have one for mount.
I have no doubts there are still tons of bugs left over, but having processes traverse selinuxfs while load_policy is in progress is something that can (and will) easily happen in practice. Mounting over an selinuxfs subdirectory OTOH isn't something that you would normally do. I think it is worth doing a quick but partial fix that fixes a practical problem and then working towards a better long-term solution.
I feel like you are assuming that I am trying to fix some security problem here, but that's not true. It *may* be seen as a security issue, but as you point out having permission to load policy will usually imply you can do other kinds of damage anyway. I simply see this as an annoying real-life bug (I know of at least one user that is/was affected) that I want to fix (even if not in a perfect way for now).
And yes, debugfs_remove() suffers the same problem. Frankly, the only wish debugfs_remove() implementation inspires is to shoot it before it breeds. Which is the second problem with that approach - such stuff really shouldn't be duplicated in individual filesystems.
Don't copy (let along open-code) d_walk() in there. The guts of dcache tree handling are subtle enough (and had more than enough locking bugs over the years) to make spreading the dependencies on its details all over the tree an invitation for massive PITA down the road.
Right, I am now convinced that it is not a good idea at all. The thought of adding a new function to dcache has crossed my mind, but I dismissed it as I didn't want to needlessly touch other parts of the kernel. Looking back now, it would have been a better choice indeed.
I have beginnings of patch doing that for debugfs; the same thing should be usable for selinuxfs as well. However, the main problem with selinuxfs wrt policy loading is different - what to do if sel_make_policy_nodes() fails halfway through? And what to do with accesses to the unholy mix of old and new nodes we currently have left behind?
Again, this a big problem, too, but out of the scope I care about right now.
Before security_load_policy() we don't know which nodes to create; after it we have nothing to fall back onto. It looks like we need to split security_load_policy() into "load", "switch over" and "free old" parts, so that the whole thing would look like load create new directory trees, unconnected to the root so they couldn't be reached by lookups until we are done switch over move the new directory trees in place kill the old trees (using that invalidate+genocide carefully combination) free old data structures with failures upon the first two steps handled by killing whatever detached trees we'd created (if any) and freeing the new data structures.
However, I'd really like to have the folks familiar with selinux guts to comment upon the feasibility of the above. AFAICS, nobody has ever seriously looked at that code wrt graceful error handling, etc.[*], so I'm not happy with making inferences from what the existing code is doing.
Yes, that sounds like it could work. I'd be willing to work on that as a longer term solution. Let's hope we get some feedback from them.
If you are interested in getting selinuxfs into sane shape, that would be a good place to start. As for the kernel-side rm -rf (which is what debugfs_remove() et.al. are trying to be)... * it absolutely needs to be in fs/*.c - either dcache or libfs. It's too closely tied to dcache guts to do otherwise. * as the first step it needs to do d_invalidate(), to get rid of anything that might be mounted on it and to prevent new mounts from appearing. It's rather tempting to unhash everything in the victim tree at the same time, but that needs to be done with care - I'm still not entirely happy with the solution I've got in that area. Alternative is to unhash them on the way out of subtree. simple_unlink()/simple_rmdir() are wrong there - we don't want to bother with the parent's timestamps as we go, for one thing; that should be done only once to parent of the root of that subtree. For another, we bloody well enforce the emptiness ourselves, so this simple_empty() is pointless (especially since we have no choice other than ignoring it anyway).
Right, I was suspicious about the simple_*() functions anyway, I simply wanted to stay close to what the old code and debugfs were doing (turns out they were both wrong anyway).
So, it sounds like you are already working on a better fix... Should I wait for your patch(es) or should I try again? There is no rush, I just want to know if it makes sense for me to still work on it :)
BTW, another selinuxfs unpleasantness is, the things like sel_write_enforce() don't have any exclusion against themselves, let alone the policy reloads. And quite a few of them obviously expect that e.g. permission check is done against the same policy the operation will apply to, not the previous one. That one definitely needs selinux folks involved.
[*] not too unreasonably so - anyone who gets to use _that_ as an attack vector has already won, so it's not a security problem pretty much by definition and running into heavy OOM at the time of policy reload is almost certainly going to end up with the userland parts of the entire thing not handling failures gracefully.
Thanks a lot for your comments!
On 10/02/2018 11:58 AM, Al Viro wrote:
On Tue, Oct 02, 2018 at 01:18:30PM +0200, Ondrej Mosnacek wrote:
No. With the side of Hell, No. The bug is real, but this is not the way to fix it.
First of all, it's still broken - e.g. mount something on a subdirectory and watch what that thing will do to it. And anyone who has permission to reload policy _will_ have one for mount.
And yes, debugfs_remove() suffers the same problem. Frankly, the only wish debugfs_remove() implementation inspires is to shoot it before it breeds. Which is the second problem with that approach - such stuff really shouldn't be duplicated in individual filesystems.
Don't copy (let along open-code) d_walk() in there. The guts of dcache tree handling are subtle enough (and had more than enough locking bugs over the years) to make spreading the dependencies on its details all over the tree an invitation for massive PITA down the road.
I have beginnings of patch doing that for debugfs; the same thing should be usable for selinuxfs as well. However, the main problem with selinuxfs wrt policy loading is different - what to do if sel_make_policy_nodes() fails halfway through? And what to do with accesses to the unholy mix of old and new nodes we currently have left behind?
Before security_load_policy() we don't know which nodes to create; after it we have nothing to fall back onto. It looks like we need to split security_load_policy() into "load", "switch over" and "free old" parts, so that the whole thing would look like load create new directory trees, unconnected to the root so they couldn't be reached by lookups until we are done switch over move the new directory trees in place kill the old trees (using that invalidate+genocide carefully combination) free old data structures with failures upon the first two steps handled by killing whatever detached trees we'd created (if any) and freeing the new data structures.
However, I'd really like to have the folks familiar with selinux guts to comment upon the feasibility of the above. AFAICS, nobody has ever seriously looked at that code wrt graceful error handling, etc.[*], so I'm not happy with making inferences from what the existing code is doing.
That sounds viable to me as an approach, and hopefully the splitting/refactoring of security_load_policy() will be a bit easier now that the global selinux state is wrapped in the selinux_state structure and explicitly passed around.
If you are interested in getting selinuxfs into sane shape, that would be a good place to start. As for the kernel-side rm -rf (which is what debugfs_remove() et.al. are trying to be)...
- it absolutely needs to be in fs/*.c - either dcache or libfs.
It's too closely tied to dcache guts to do otherwise.
- as the first step it needs to do d_invalidate(), to get rid of
anything that might be mounted on it and to prevent new mounts from appearing. It's rather tempting to unhash everything in the victim tree at the same time, but that needs to be done with care - I'm still not entirely happy with the solution I've got in that area. Alternative is to unhash them on the way out of subtree. simple_unlink()/simple_rmdir() are wrong there - we don't want to bother with the parent's timestamps as we go, for one thing; that should be done only once to parent of the root of that subtree. For another, we bloody well enforce the emptiness ourselves, so this simple_empty() is pointless (especially since we have no choice other than ignoring it anyway).
BTW, another selinuxfs unpleasantness is, the things like sel_write_enforce() don't have any exclusion against themselves, let alone the policy reloads. And quite a few of them obviously expect that e.g. permission check is done against the same policy the operation will apply to, not the previous one. That one definitely needs selinux folks involved.
[*] not too unreasonably so - anyone who gets to use _that_ as an attack vector has already won, so it's not a security problem pretty much by definition and running into heavy OOM at the time of policy reload is almost certainly going to end up with the userland parts of the entire thing not handling failures gracefully.
linux-stable-mirror@lists.linaro.org