It's currently possible to create pidfds for kthreads but it is unclear what that is supposed to mean. Until we have use-cases for it and we figured out what behavior we want block the creation of pidfds for kthreads.
Fixes: 32fcb426ec00 ("pid: add pidfd_open()") Cc: stable@vger.kernel.org Signed-off-by: Christian Brauner brauner@kernel.org --- kernel/fork.c | 25 ++++++++++++++++++++++--- 1 file changed, 22 insertions(+), 3 deletions(-)
diff --git a/kernel/fork.c b/kernel/fork.c index cc760491f201..18bdc87209d0 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -2053,11 +2053,24 @@ static int __pidfd_prepare(struct pid *pid, unsigned int flags, struct file **re */ int pidfd_prepare(struct pid *pid, unsigned int flags, struct file **ret) { - bool thread = flags & PIDFD_THREAD; - - if (!pid || !pid_has_task(pid, thread ? PIDTYPE_PID : PIDTYPE_TGID)) + if (!pid) return -EINVAL;
+ scoped_guard(rcu) { + struct task_struct *tsk; + + if (flags & PIDFD_THREAD) + tsk = pid_task(pid, PIDTYPE_PID); + else + tsk = pid_task(pid, PIDTYPE_TGID); + if (!tsk) + return -EINVAL; + + /* Don't create pidfds for kernel threads for now. */ + if (tsk->flags & PF_KTHREAD) + return -EINVAL; + } + return __pidfd_prepare(pid, flags, ret); }
@@ -2403,6 +2416,12 @@ __latent_entropy struct task_struct *copy_process( if (clone_flags & CLONE_PIDFD) { int flags = (clone_flags & CLONE_THREAD) ? PIDFD_THREAD : 0;
+ /* Don't create pidfds for kernel threads for now. */ + if (args->kthread) { + retval = -EINVAL; + goto bad_fork_free_pid; + } + /* Note that no task has been attached to @pid yet. */ retval = __pidfd_prepare(pid, flags, &pidfile); if (retval < 0)
On 07/31, Christian Brauner wrote:
It's currently possible to create pidfds for kthreads but it is unclear what that is supposed to mean. Until we have use-cases for it and we figured out what behavior we want block the creation of pidfds for kthreads.
Hmm... could you explain your concerns? Why do you think we should disallow pidfd_open(pid-of-kthread) ?
@@ -2403,6 +2416,12 @@ __latent_entropy struct task_struct *copy_process( if (clone_flags & CLONE_PIDFD) { int flags = (clone_flags & CLONE_THREAD) ? PIDFD_THREAD : 0;
/* Don't create pidfds for kernel threads for now. */
if (args->kthread) {
retval = -EINVAL;
goto bad_fork_free_pid;
Do we really need this check? Userspace can't use args->kthread != NULL, the kernel users should not use CLONE_PIDFD.
Oleg.
On Wed, Jul 31, 2024 at 04:51:33PM GMT, Oleg Nesterov wrote:
On 07/31, Christian Brauner wrote:
It's currently possible to create pidfds for kthreads but it is unclear what that is supposed to mean. Until we have use-cases for it and we figured out what behavior we want block the creation of pidfds for kthreads.
Hmm... could you explain your concerns? Why do you think we should disallow pidfd_open(pid-of-kthread) ?
It basically just works now and it's not intentional - at least not on my part. You can't send signals to them, you may or may not get notified via poll when a kthread exits. If we ever want this to be useful I would like to enable it explicitly.
Plus, this causes confusion in userspace. When you have qemu running with kvm support then kvm creates several kthreads (that inherit the cgroup of the calling process). If you try to kill those instances via systemctl kill or systemctl stop then pidfds for these kthreads are opened but sending a signal to them is meaningless.
(So imho this causes more confusion then it is actually helpful. If we add supports for kthreads I'd also like pidfs to gain a way to identify them via statx() or fdinfo.)
@@ -2403,6 +2416,12 @@ __latent_entropy struct task_struct *copy_process( if (clone_flags & CLONE_PIDFD) { int flags = (clone_flags & CLONE_THREAD) ? PIDFD_THREAD : 0;
/* Don't create pidfds for kernel threads for now. */
if (args->kthread) {
retval = -EINVAL;
goto bad_fork_free_pid;
Do we really need this check? Userspace can't use args->kthread != NULL, the kernel users should not use CLONE_PIDFD.
Yeah, I know. That's really just proactive so that user of e.g., copy_process() such as vhost or so on don't start handing out pidfds for stuff without requring changes to the helper itself.
OK, I won't argue, but ....
On 08/01, Christian Brauner wrote:
On Wed, Jul 31, 2024 at 04:51:33PM GMT, Oleg Nesterov wrote:
On 07/31, Christian Brauner wrote:
It's currently possible to create pidfds for kthreads but it is unclear what that is supposed to mean. Until we have use-cases for it and we figured out what behavior we want block the creation of pidfds for kthreads.
Hmm... could you explain your concerns? Why do you think we should disallow pidfd_open(pid-of-kthread) ?
It basically just works now and it's not intentional - at least not on my part. You can't send signals to them,
Yes, you can't send signals to kthread. So what?
You can't send signals to the normal processes if check_kill_permission() fails. And even if you are root, you can't send an unhandled signal via pidfd = pidfd_open(1).
you may or may not get notified via poll when a kthread exits.
Why? the exiting kthread should not differ in this respect?
(So imho this causes more confusion then it is actually helpful. If we add supports for kthreads I'd also like pidfs to gain a way to identify them via statx() or fdinfo.)
/proc/$pid/status has a "Kthread" field...
@@ -2403,6 +2416,12 @@ __latent_entropy struct task_struct *copy_process( if (clone_flags & CLONE_PIDFD) { int flags = (clone_flags & CLONE_THREAD) ? PIDFD_THREAD : 0;
/* Don't create pidfds for kernel threads for now. */
if (args->kthread) {
retval = -EINVAL;
goto bad_fork_free_pid;
Do we really need this check? Userspace can't use args->kthread != NULL, the kernel users should not use CLONE_PIDFD.
Yeah, I know. That's really just proactive so that user of e.g., copy_process() such as vhost or so on don't start handing out pidfds for stuff without requring changes to the helper itself.
Then I'd suggest WARN_ON_ONCE(args->kthread).
But as I said I won't argue. I see nothing wrong in this patch.
Oleg.
On Thu, Aug 01, 2024 at 10:01:20AM GMT, Oleg Nesterov wrote:
OK, I won't argue, but ....
On 08/01, Christian Brauner wrote:
On Wed, Jul 31, 2024 at 04:51:33PM GMT, Oleg Nesterov wrote:
On 07/31, Christian Brauner wrote:
It's currently possible to create pidfds for kthreads but it is unclear what that is supposed to mean. Until we have use-cases for it and we figured out what behavior we want block the creation of pidfds for kthreads.
Hmm... could you explain your concerns? Why do you think we should disallow pidfd_open(pid-of-kthread) ?
It basically just works now and it's not intentional - at least not on my part. You can't send signals to them,
Yes, you can't send signals to kthread. So what?
You can't send signals to the normal processes if check_kill_permission() fails. And even if you are root, you can't send an unhandled signal via pidfd = pidfd_open(1).
you may or may not get notified via poll when a kthread exits.
Why? the exiting kthread should not differ in this respect?
Why do you want to allow it? I see zero reason to get a reference to a kthread if there's no use-case for it. kthreads are mostly a kernel thing so why give userspace handles to it. And as I said before, there's userspace out there that's already confused why they can get references to them in the first place.
(So imho this causes more confusion then it is actually helpful. If we add supports for kthreads I'd also like pidfs to gain a way to identify them via statx() or fdinfo.)
/proc/$pid/status has a "Kthread" field...
Going forward, I don't want to force people to parse basic stuff out of procfs. Ideally, they'll be able to mostly rely on pidfd operations only.
On 08/01, Christian Brauner wrote:
On Thu, Aug 01, 2024 at 10:01:20AM GMT, Oleg Nesterov wrote:
OK, I won't argue, but ....
you may or may not get notified via poll when a kthread exits.
Why? the exiting kthread should not differ in this respect?
Why do you want to allow it?
Again, I didn't try to "nack" your patch. Just tried to understand your motivation.
And "may not get notified" above doesn't look right to me...
/proc/$pid/status has a "Kthread" field...
Going forward, I don't want to force people to parse basic stuff out of procfs. Ideally, they'll be able to mostly rely on pidfd operations only.
Agreed.
Oleg.
Hi Christian,
On Wed, Jul 31, 2024 at 12:01:12PM +0200, Christian Brauner wrote:
It's currently possible to create pidfds for kthreads but it is unclear what that is supposed to mean. Until we have use-cases for it and we figured out what behavior we want block the creation of pidfds for kthreads.
Fixes: 32fcb426ec00 ("pid: add pidfd_open()") Cc: stable@vger.kernel.org Signed-off-by: Christian Brauner brauner@kernel.org
kernel/fork.c | 25 ++++++++++++++++++++++--- 1 file changed, 22 insertions(+), 3 deletions(-)
Unfortunately this commit broke systemd-shutdown's ability to kill processes, which makes some filesystems no longer get unmounted at shutdown.
It looks like systemd-shutdown relies on being able to create a pidfd for any process listed in /proc (even a kthread), and if it gets EINVAL it treats it a fatal error and stops looking for more processes...
This is what shows up in the system log:
systemd[1]: Shutting down. systemd-shutdown[1]: Syncing filesystems and block devices. systemd-shutdown[1]: Sending SIGTERM to remaining processes... systemd-shutdown[1]: Failed to enumerate /proc/: Invalid argument systemd-shutdown[1]: Sending SIGKILL to remaining processes... systemd-shutdown[1]: Failed to enumerate /proc/: Invalid argument systemd-shutdown[1]: Unmounting file systems. (sd-umount)[17359]: Unmounting '/run/credentials/systemd-vconsole-setup.service'. (sd-umount)[17360]: Unmounting '/run/credentials/systemd-journald.service'. (sd-remount)[17361]: Remounting '/' read-only with options ''. (sd-remount)[17361]: Failed to remount '/' read-only: Device or resource busy (sd-remount)[17362]: Remounting '/' read-only with options ''. (sd-remount)[17362]: Failed to remount '/' read-only: Device or resource busy systemd-shutdown[1]: Not all file systems unmounted, 1 left.
- Eric
On Sat, Aug 17, 2024 at 08:58:18PM GMT, Eric Biggers wrote:
Hi Christian,
On Wed, Jul 31, 2024 at 12:01:12PM +0200, Christian Brauner wrote:
It's currently possible to create pidfds for kthreads but it is unclear what that is supposed to mean. Until we have use-cases for it and we figured out what behavior we want block the creation of pidfds for kthreads.
Fixes: 32fcb426ec00 ("pid: add pidfd_open()") Cc: stable@vger.kernel.org Signed-off-by: Christian Brauner brauner@kernel.org
kernel/fork.c | 25 ++++++++++++++++++++++--- 1 file changed, 22 insertions(+), 3 deletions(-)
Unfortunately this commit broke systemd-shutdown's ability to kill processes, which makes some filesystems no longer get unmounted at shutdown.
It looks like systemd-shutdown relies on being able to create a pidfd for any process listed in /proc (even a kthread), and if it gets EINVAL it treats it a fatal error and stops looking for more processes...
Thanks for the report! I talked to Daan De Meyer who made that change and he said that this must a systemd version that hasn't gotten his fixes yet. In any case, if this causes regression then I'll revert it right now. See the appended revert.
On Mon, Aug 19, 2024 at 10:41:15AM +0200, Christian Brauner wrote:
On Sat, Aug 17, 2024 at 08:58:18PM GMT, Eric Biggers wrote:
Hi Christian,
On Wed, Jul 31, 2024 at 12:01:12PM +0200, Christian Brauner wrote:
It's currently possible to create pidfds for kthreads but it is unclear what that is supposed to mean. Until we have use-cases for it and we figured out what behavior we want block the creation of pidfds for kthreads.
Fixes: 32fcb426ec00 ("pid: add pidfd_open()") Cc: stable@vger.kernel.org Signed-off-by: Christian Brauner brauner@kernel.org
kernel/fork.c | 25 ++++++++++++++++++++++--- 1 file changed, 22 insertions(+), 3 deletions(-)
Unfortunately this commit broke systemd-shutdown's ability to kill processes, which makes some filesystems no longer get unmounted at shutdown.
It looks like systemd-shutdown relies on being able to create a pidfd for any process listed in /proc (even a kthread), and if it gets EINVAL it treats it a fatal error and stops looking for more processes...
Thanks for the report! I talked to Daan De Meyer who made that change and he said that this must a systemd version that hasn't gotten his fixes yet. In any case, if this causes regression then I'll revert it right now. See the appended revert.
Thanks for queueing up a revert.
This was on systemd 256.4 which was released less than a month ago.
I'm not sure what systemd fix you are talking about. Looking at killall() in src/shared/killall.c on the latest "main" branch of systemd, it calls proc_dir_read_pidref() => pidref_set_pid() => pidfd_open(), and EINVAL gets passed back up to killall() and treated as a fatal error. ignore_proc() skips kernel threads but is executed too late. I didn't test it, so I could be wrong, but based on the code it does not appear to be fixed.
- Erici
On Tue, Aug 20, 2024 at 12:34:14PM GMT, Eric Biggers wrote:
On Mon, Aug 19, 2024 at 10:41:15AM +0200, Christian Brauner wrote:
On Sat, Aug 17, 2024 at 08:58:18PM GMT, Eric Biggers wrote:
Hi Christian,
On Wed, Jul 31, 2024 at 12:01:12PM +0200, Christian Brauner wrote:
It's currently possible to create pidfds for kthreads but it is unclear what that is supposed to mean. Until we have use-cases for it and we figured out what behavior we want block the creation of pidfds for kthreads.
Fixes: 32fcb426ec00 ("pid: add pidfd_open()") Cc: stable@vger.kernel.org Signed-off-by: Christian Brauner brauner@kernel.org
kernel/fork.c | 25 ++++++++++++++++++++++--- 1 file changed, 22 insertions(+), 3 deletions(-)
Unfortunately this commit broke systemd-shutdown's ability to kill processes, which makes some filesystems no longer get unmounted at shutdown.
It looks like systemd-shutdown relies on being able to create a pidfd for any process listed in /proc (even a kthread), and if it gets EINVAL it treats it a fatal error and stops looking for more processes...
Thanks for the report! I talked to Daan De Meyer who made that change and he said that this must a systemd version that hasn't gotten his fixes yet. In any case, if this causes regression then I'll revert it right now. See the appended revert.
Thanks for queueing up a revert.
This was on systemd 256.4 which was released less than a month ago.
I'm not sure what systemd fix you are talking about. Looking at killall() in src/shared/killall.c on the latest "main" branch of systemd, it calls proc_dir_read_pidref() => pidref_set_pid() => pidfd_open(), and EINVAL gets passed back up to killall() and treated as a fatal error. ignore_proc() skips kernel threads but is executed too late. I didn't test it, so I could be wrong, but based on the code it does not appear to be fixed.
Yeah, I think you're right. What they fixed is ead48ec35c86 ("cgroup-util: Don't try to open pidfd for kernel threads") when reading pids from cgroup.procs. Daan is currently prepping a fix for reading pids from /proc as well.
Hi,
I just prepped https://github.com/systemd/systemd/pull/34058 which will apply the same fix from ead48ec35c86 ("cgroup-util: Don't try to open pidfd for kernel threads") for pids read from /proc as well.
Cheers,
Daan
On Wed, 21 Aug 2024 at 09:41, Christian Brauner brauner@kernel.org wrote:
On Tue, Aug 20, 2024 at 12:34:14PM GMT, Eric Biggers wrote:
On Mon, Aug 19, 2024 at 10:41:15AM +0200, Christian Brauner wrote:
On Sat, Aug 17, 2024 at 08:58:18PM GMT, Eric Biggers wrote:
Hi Christian,
On Wed, Jul 31, 2024 at 12:01:12PM +0200, Christian Brauner wrote:
It's currently possible to create pidfds for kthreads but it is unclear what that is supposed to mean. Until we have use-cases for it and we figured out what behavior we want block the creation of pidfds for kthreads.
Fixes: 32fcb426ec00 ("pid: add pidfd_open()") Cc: stable@vger.kernel.org Signed-off-by: Christian Brauner brauner@kernel.org
kernel/fork.c | 25 ++++++++++++++++++++++--- 1 file changed, 22 insertions(+), 3 deletions(-)
Unfortunately this commit broke systemd-shutdown's ability to kill processes, which makes some filesystems no longer get unmounted at shutdown.
It looks like systemd-shutdown relies on being able to create a pidfd for any process listed in /proc (even a kthread), and if it gets EINVAL it treats it a fatal error and stops looking for more processes...
Thanks for the report! I talked to Daan De Meyer who made that change and he said that this must a systemd version that hasn't gotten his fixes yet. In any case, if this causes regression then I'll revert it right now. See the appended revert.
Thanks for queueing up a revert.
This was on systemd 256.4 which was released less than a month ago.
I'm not sure what systemd fix you are talking about. Looking at killall() in src/shared/killall.c on the latest "main" branch of systemd, it calls proc_dir_read_pidref() => pidref_set_pid() => pidfd_open(), and EINVAL gets passed back up to killall() and treated as a fatal error. ignore_proc() skips kernel threads but is executed too late. I didn't test it, so I could be wrong, but based on the code it does not appear to be fixed.
Yeah, I think you're right. What they fixed is ead48ec35c86 ("cgroup-util: Don't try to open pidfd for kernel threads") when reading pids from cgroup.procs. Daan is currently prepping a fix for reading pids from /proc as well.
On 19.08.24 10:41, Christian Brauner wrote:
On Sat, Aug 17, 2024 at 08:58:18PM GMT, Eric Biggers wrote:
On Wed, Jul 31, 2024 at 12:01:12PM +0200, Christian Brauner wrote:
It's currently possible to create pidfds for kthreads but it is unclear what that is supposed to mean. Until we have use-cases for it and we figured out what behavior we want block the creation of pidfds for kthreads.
Fixes: 32fcb426ec00 ("pid: add pidfd_open()") Cc: stable@vger.kernel.org Signed-off-by: Christian Brauner brauner@kernel.org
kernel/fork.c | 25 ++++++++++++++++++++++--- 1 file changed, 22 insertions(+), 3 deletions(-)
Unfortunately this commit broke systemd-shutdown's ability to kill processes, which makes some filesystems no longer get unmounted at shutdown.
It looks like systemd-shutdown relies on being able to create a pidfd for any process listed in /proc (even a kthread), and if it gets EINVAL it treats it a fatal error and stops looking for more processes...
Thanks for the report! I talked to Daan De Meyer who made that change and he said that this must a systemd version that hasn't gotten his fixes yet. In any case, if this causes regression then I'll revert it right now. See the appended revert.
Greg, Sasha, JFYI in case you are not already aware of it: I by chance[1] noticed that the patch Christian plans to revert is still in the 6.10-queue. You might want to drop it (or apply the revert as well, which is in -next, but not yet in mainline afaics).
Ciao, Thorsten
[1] after I noticed this thread a third time, this time via some SELinux problems it caused in Fedora land: https://bugzilla.redhat.com/show_bug.cgi?id=2306818 https://bugzilla.redhat.com/show_bug.cgi?id=2305270
On Fri, Aug 23, 2024 at 07:23:12AM +0200, Linux regression tracking (Thorsten Leemhuis) wrote:
On 19.08.24 10:41, Christian Brauner wrote:
On Sat, Aug 17, 2024 at 08:58:18PM GMT, Eric Biggers wrote:
On Wed, Jul 31, 2024 at 12:01:12PM +0200, Christian Brauner wrote:
It's currently possible to create pidfds for kthreads but it is unclear what that is supposed to mean. Until we have use-cases for it and we figured out what behavior we want block the creation of pidfds for kthreads.
Fixes: 32fcb426ec00 ("pid: add pidfd_open()") Cc: stable@vger.kernel.org Signed-off-by: Christian Brauner brauner@kernel.org
kernel/fork.c | 25 ++++++++++++++++++++++--- 1 file changed, 22 insertions(+), 3 deletions(-)
Unfortunately this commit broke systemd-shutdown's ability to kill processes, which makes some filesystems no longer get unmounted at shutdown.
It looks like systemd-shutdown relies on being able to create a pidfd for any process listed in /proc (even a kthread), and if it gets EINVAL it treats it a fatal error and stops looking for more processes...
Thanks for the report! I talked to Daan De Meyer who made that change and he said that this must a systemd version that hasn't gotten his fixes yet. In any case, if this causes regression then I'll revert it right now. See the appended revert.
Greg, Sasha, JFYI in case you are not already aware of it: I by chance[1] noticed that the patch Christian plans to revert is still in the 6.10-queue. You might want to drop it (or apply the revert as well, which is in -next, but not yet in mainline afaics).
I was hoping it would get into Linus's tree "soon" so I could take the revert too. As it's in -next, I'll grab it from there when I get a chance.
thanks,
greg k-h
linux-stable-mirror@lists.linaro.org