在 2020/12/31 下午5:22, Greg Kroah-Hartman 写道:
On Thu, Dec 17, 2020 at 10:26:23AM +0800, Wen Yang wrote:
在 2020/12/4 上午2:31, Wen Yang 写道:
The dentries such as /proc/<pid>/ns/ have the DCACHE_OP_DELETE flag, they should be deleted when the process exits.
Suppose the following race appears:
release_task dput -> proc_flush_task -> dentry->d_op->d_delete(dentry) -> __exit_signal -> dentry->d_lockref.count-- and return.
In the proc_flush_task(), if another process is using this dentry, it will not be deleted. At the same time, in dput(), d_op->d_delete() can be executed before __exit_signal(pid has not been hashed), d_delete returns false, so this dentry still cannot be deleted.
This dentry will always be cached (although its count is 0 and the DCACHE_OP_DELETE flag is set), its parent denry will also be cached too, and these dentries can only be deleted when drop_caches is manually triggered.
This will result in wasted memory. What's more troublesome is that these dentries reference pid, according to the commit f333c700c610 ("pidns: Add a limit on the number of pid namespaces"), if the pid cannot be released, it may result in the inability to create a new pid_ns.
This problem occurred in our cluster environment (Linux 4.9 LTS). We could reproduce it by manually constructing a test program + adding some debugging switches in the kernel:
- A test program to open the directory (/proc/<pid>/ns) [1]
- Adding some debugging switches to the kernel, adding a delay between proc_flush_task and __exit_signal in release_task() [2]
The test process is as follows:
A, terminal #1
Turn on the debug switch: echo 1> /proc/sys/vm/dentry_debug_trace
Execute the following unshare command: sudo unshare --pid --fork --mount-proc bash
B, terminal #2
Find the pid of the unshare process:
# pstree -p | grep unshare | `-sshd(716)---bash(718)--sudo(816)---unshare(817)---bash(818)
Find the corresponding dentry: # dmesg | grep pid=818 [70.424722] XXX proc_pid_instantiate:3119 pid=818 tid=818 entry=818/ffff8802c7b670e8
C, terminal #3
Execute the opendir program, it will always open the /proc/818/ns/ directory:
# ./a.out /proc/818/ns/ pid: 876 . .. net uts ipc pid user mnt cgroup
D, go back to terminal #2
Turn on the debugging switches to construct the race: # echo 818> /proc/sys/vm/dentry_debug_pid # echo 1> /proc/sys/vm/dentry_debug_delay
Kill the unshare process (pid 818). Since the debugging switches have been turned on, it will get stuck in release_task(): # kill -9 818
Then kill the process that opened the /proc/818/ns/ directory: # kill -9 876
Then turn off these debugging switches to allow the 818 process to exit: # echo 0> /proc/sys/vm/dentry_debug_delay # echo 0> /proc/sys/vm/dentry_debug_pid
Checking the dmesg, we will find that the dentry(/proc/818/ns) ’s count is 0, and the flag is 2800cc (#define DCACHE_OP_DELETE 0x00000008), but it is still cached: # dmesg | grep ffff8802a3999548 … [565.559156] XXX dput:853 dentry=ns/ffff8802bea7b528, flag=2800cc, cnt=0, inode=ffff8802b38c2010, pdentry=818/ffff8802c7b670e8, pflag=20008c, pcnt=1, pinode=ffff8802c7812010, keywords: be cached
It could also be verified via the crash tool:
crash> dentry.d_flags,d_iname,d_inode,d_lockref -x ffff8802bea7b528 d_flags = 0x2800cc d_iname = "ns\000kkkkkkkkkkkkkkkkkkkkkkkkkkkk" d_inode = 0xffff8802b38c2010 d_lockref = { { lock_count = 0x0, { lock = { { rlock = { raw_lock = { { val = { counter = 0x0 }, { locked = 0x0, pending = 0x0 }, { locked_pending = 0x0, tail = 0x0 } } } } } }, count = 0x0 } } } crash> kmem ffff8802bea7b528 CACHE OBJSIZE ALLOCATED TOTAL SLABS SSIZE NAME ffff8802dd5f5900 192 23663 26130 871 16k dentry SLAB MEMORY NODE TOTAL ALLOCATED FREE ffffea000afa9e00 ffff8802bea78000 0 30 25 5 FREE / [ALLOCATED] [ffff8802bea7b520]
PAGE PHYSICAL MAPPING INDEX CNT FLAGS
ffffea000afa9ec0 2bea7b000 dead000000000400 0 0 2fffff80000000 crash>
This series of patches is to fix this issue.
Regards, Wen
Alexey Dobriyan (1): proc: use %u for pid printing and slightly less stack
Andreas Gruenbacher (1): proc: Pass file mode to proc_pid_make_inode
Christian Brauner (1): clone: add CLONE_PIDFD
Eric W. Biederman (6): proc: Better ownership of files for non-dumpable tasks in user namespaces proc: Rename in proc_inode rename sysctl_inodes sibling_inodes proc: Generalize proc_sys_prune_dcache into proc_prune_siblings_dcache proc: Clear the pieces of proc_inode that proc_evict_inode cares about proc: Use d_invalidate in proc_prune_siblings_dcache proc: Use a list of inodes to flush from proc
Joel Fernandes (Google) (1): pidfd: add polling support
fs/proc/base.c | 242 ++++++++++++++++++++------------------------- fs/proc/fd.c | 20 +--- fs/proc/inode.c | 67 ++++++++++++- fs/proc/internal.h | 22 ++--- fs/proc/namespaces.c | 3 +- fs/proc/proc_sysctl.c | 45 ++------- fs/proc/self.c | 6 +- fs/proc/thread_self.c | 5 +- include/linux/pid.h | 5 + include/linux/proc_fs.h | 4 +- include/uapi/linux/sched.h | 1 + kernel/exit.c | 5 +- kernel/fork.c | 145 ++++++++++++++++++++++++++- kernel/pid.c | 3 + kernel/signal.c | 11 +++ security/selinux/hooks.c | 1 + 16 files changed, 357 insertions(+), 228 deletions(-)
[1] A test program to open the directory (/proc/<pid>/ns) #include <stdio.h> #include <sys/types.h> #include <dirent.h> #include <errno.h>
int main(int argc, char *argv[]) { DIR *dip; struct dirent *dit;
if (argc < 2) { printf("Usage :%s <directory>\n", argv[0]); return -1; }
if ((dip = opendir(argv[1])) == NULL) { perror("opendir"); return -1; }
printf("pid: %d\n", getpid()); while((dit = readdir (dip)) != NULL) { printf("%s\n", dit->d_name); }
while (1) sleep (1);
return 0; }
[2] Adding some debugging switches to the kernel, also adding a delay between proc_flush_task and __exit_signal in release_task():
diff --git a/fs/dcache.c b/fs/dcache.c index 05bad55..fafad37 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -84,6 +84,9 @@ int sysctl_vfs_cache_pressure __read_mostly = 100; EXPORT_SYMBOL_GPL(sysctl_vfs_cache_pressure);
+int sysctl_dentry_debug_trace __read_mostly = 0; +EXPORT_SYMBOL_GPL(sysctl_dentry_debug_trace);
__cacheline_aligned_in_smp DEFINE_SEQLOCK(rename_lock);
EXPORT_SYMBOL(rename_lock);
@@ -758,6 +761,26 @@ static inline bool fast_dput(struct dentry *dentry) return 0; }
+#define DENTRY_DEBUG_TRACE(dentry, keywords) \ +do { \
- if (sysctl_dentry_debug_trace) \
printk("XXX %s:%d " \
"dentry=%s/%p, flag=%x, cnt=%d, inode=%p, " \
"pdentry=%s/%p, pflag=%x, pcnt=%d, pinode=%p, " \
"keywords: %s\n", \
__func__, __LINE__, \
dentry->d_name.name, \
dentry, \
dentry->d_flags, \
dentry->d_lockref.count, \
dentry->d_inode, \
dentry->d_parent->d_name.name, \
dentry->d_parent, \
dentry->d_parent->d_flags, \
dentry->d_parent->d_lockref.count, \
dentry->d_parent->d_inode, \
keywords); \
+} while (0)
/* * This is dput @@ -804,6 +827,8 @@ void dput(struct dentry *dentry)
WARN_ON(d_in_lookup(dentry));
- DENTRY_DEBUG_TRACE(dentry, "be checked");
- /* Unreachable? Get rid of it */ if (unlikely(d_unhashed(dentry))) goto kill_it;
@@ -812,8 +837,10 @@ void dput(struct dentry *dentry) goto kill_it;
if (unlikely(dentry->d_flags & DCACHE_OP_DELETE)) {
if (dentry->d_op->d_delete(dentry))
if (dentry->d_op->d_delete(dentry)) {
DENTRY_DEBUG_TRACE(dentry, "be killed"); goto kill_it;
}
}
if (!(dentry->d_flags & DCACHE_REFERENCED))
@@ -822,6 +849,9 @@ void dput(struct dentry *dentry)
dentry->d_lockref.count--; spin_unlock(&dentry->d_lock);
DENTRY_DEBUG_TRACE(dentry, "be cached");
return;
kill_it:
diff --git a/fs/proc/base.c b/fs/proc/base.c index b9e4183..419a409 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -3090,6 +3090,8 @@ void proc_flush_task(struct task_struct *task) } }
+extern int sysctl_dentry_debug_trace;
- static int proc_pid_instantiate(struct inode *dir, struct dentry * dentry, struct task_struct *task, const void *ptr)
@@ -3111,6 +3113,12 @@ static int proc_pid_instantiate(struct inode *dir, d_set_d_op(dentry, &pid_dentry_operations);
d_add(dentry, inode);
- if (sysctl_dentry_debug_trace)
printk("XXX %s:%d pid=%d tid=%d entry=%s/%p\n",
__func__, __LINE__, task->pid, task->tgid,
dentry->d_name.name, dentry);
- /* Close the race of the process dying before we return the dentry */ if (pid_revalidate(dentry, 0)) return 0;
diff --git a/kernel/exit.c b/kernel/exit.c index 27f4168..2b3e1b6 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -55,6 +55,8 @@ #include <linux/shm.h> #include <linux/kcov.h>
+#include <linux/delay.h>
- #include <asm/uaccess.h> #include <asm/unistd.h> #include <asm/pgtable.h>
@@ -164,6 +166,8 @@ static void delayed_put_task_struct(struct rcu_head *rhp) put_task_struct(tsk); }
+int sysctl_dentry_debug_delay __read_mostly = 0; +int sysctl_dentry_debug_pid __read_mostly = 0;
void release_task(struct task_struct *p) { @@ -178,6 +182,11 @@ void release_task(struct task_struct *p)
proc_flush_task(p);
- if (sysctl_dentry_debug_delay && p->pid == sysctl_dentry_debug_pid) {
while (sysctl_dentry_debug_delay)
mdelay(1);
- }
- write_lock_irq(&tasklist_lock); ptrace_release_task(p); __exit_signal(p);
diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 513e6da..27f1395 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -282,6 +282,10 @@ static int sysrq_sysctl_handler(struct ctl_table *table, int write, static int max_extfrag_threshold = 1000; #endif
+extern int sysctl_dentry_debug_trace; +extern int sysctl_dentry_debug_delay; +extern int sysctl_dentry_debug_pid;
- static struct ctl_table kern_table[] = { { .procname = "sched_child_runs_first",
@@ -1498,6 +1502,30 @@ static int sysrq_sysctl_handler(struct ctl_table *table, int write, .proc_handler = proc_dointvec, .extra1 = &zero, },
- {
.procname = "dentry_debug_trace",
.data = &sysctl_dentry_debug_trace,
.maxlen = sizeof(sysctl_dentry_debug_trace),
.mode = 0644,
.proc_handler = proc_dointvec,
.extra1 = &zero,
- },
- {
.procname = "dentry_debug_delay",
.data = &sysctl_dentry_debug_delay,
.maxlen = sizeof(sysctl_dentry_debug_delay),
.mode = 0644,
.proc_handler = proc_dointvec,
.extra1 = &zero,
- },
- {
.procname = "dentry_debug_pid",
.data = &sysctl_dentry_debug_pid,
.maxlen = sizeof(sysctl_dentry_debug_pid),
.mode = 0644,
.proc_handler = proc_dointvec,
.extra1 = &zero,
- }, #ifdef HAVE_ARCH_PICK_MMAP_LAYOUT { .procname = "legacy_va_layout",
Signed-off-by: Wen Yang wenyang@linux.alibaba.com Cc: Pavel Emelyanov xemul@openvz.org Cc: Oleg Nesterov oleg@tv-sign.ru Cc: Sukadev Bhattiprolu sukadev@us.ibm.com Cc: Paul Menage menage@google.com Cc: "Eric W. Biederman" ebiederm@xmission.com Cc: Greg Kroah-Hartman gregkh@linuxfoundation.org Cc: stable@vger.kernel.org
Hi Greg,
Could you kindly give some suggestions?
I'm sorry, but I don't really understand what this patch series is for.
Why is there a patch in the 00/10 email? What stable kernel(s) should this be backported to? And why can't you just use a newer kernel (like 4.19) if you are hitting this issue with much older ones?
This bug was introduced by 60347f6716aa ("pid namespaces: prepare proc_flust_task() to flush entries from multiple proc trees"), exposed by f333c700c610 ("pidns: Add a limit on the number of pid namespaces"), and then fixed by 7bc3e6e55acf (“proc: Use a list of inodes to flush from proc”)
4.19 LTS did not solve it either.
The one that fixes the bug is this patch (10/10): https://lkml.org/lkml/2020/12/3/1046 The previous 9 patches (from 01/10 to 09/10) are its pre-dependencies.
The 00/10 is some of our test programs, including a user mode program (open the /proc/<pid>/ns directory), and some hacks added to the kernel (just add log printing and some delays for easy construction this race).
Thanks.