On Mon, Oct 28, 2024 at 04:34:33PM +0100, Christian Brauner wrote:
On Sat, Oct 26, 2024 at 08:24:58AM +0100, Lorenzo Stoakes wrote:
It is useful to be able to utilise the pidfd mechanism to reference the current thread or process (from a userland point of view - thread group leader from the kernel's point of view).
Therefore introduce PIDFD_SELF_THREAD to refer to the current thread, and PIDFD_SELF_THREAD_GROUP to refer to the current thread group leader.
For convenience and to avoid confusion from userland's perspective we alias these:
PIDFD_SELF is an alias for PIDFD_SELF_THREAD - This is nearly always what the user will want to use, as they would find it surprising if for instance fd's were unshared()'d and they wanted to invoke pidfd_getfd() and that failed.
PIDFD_SELF_PROCESS is an alias for PIDFD_SELF_THREAD_GROUP - Most users have no concept of thread groups or what a thread group leader is, and from userland's perspective and nomenclature this is what userland considers to be a process.
Due to the refactoring of the central __pidfd_get_pid() function we can implement this functionality centrally, providing the use of this sentinel in most functionality which utilises pidfd's.
We need to explicitly adjust kernel_waitid_prepare() to permit this (though it wouldn't really make sense to use this there, we provide the ability for consistency).
We explicitly disallow use of this in setns(), which would otherwise have required explicit custom handling, as it doesn't make sense to set the current calling thread to join the namespace of itself.
As the callers of pidfd_get_pid() expect an increased reference count on the pid we do so in the self case, reducing churn and avoiding any breakage from existing logic which decrements this reference count.
This change implicitly provides PIDFD_SELF_* support in the waitid(P_PIDFS, ...), process_madvise(), process_mrelease(), pidfd_send_signal(), and pidfd_getfd() system calls.
Things such as polling a pidfs and general fd operations are not supported, this strictly provides the sentinel for APIs which explicitly accept a pidfd.
Suggested-by: Pedro Falcato pedro.falcato@gmail.com Reviewed-by: Shakeel Butt shakeel.butt@linux.dev Signed-off-by: Lorenzo Stoakes lorenzo.stoakes@oracle.com
Currently, a pidfd based system call like pidfd_send_signal() would simply do:
fdget(pidfd); // use struct pid fdput(pidfd);
Where the lifetime of @pid is guaranteed by @file. And in the regular case where there's only a single thread the file code will avoid taking a reference. Thus, there's no reference count bump on fdget(), nor a drop on fdput(), nor a get_pid() or put_pid().
Right I missed that fdget() wouldn't take a reference count I assumed it would be equivalent, my mistake.
With your patch series you will always cause reference counts on @pid to be taken for everyone. And I wouldn't be surprised if we get performance regressions for this.
This was in response to you review saying I can't pass around a pointer to the fd, originally I didn't do this.
This was the only way I could find to de-jank and make my shared function not end up problematic in the light of wanting to keep the fd within a single scope, I didn't realise that passing that by value would be ok.
But obviously hadn't realised that fdget()/fdput() sometimes doesn't change a reference count, mea culpa on that not an fs person...
In one of my earlier mails I had mused about a fdput() like primitive. What I roughly, hastily, and under the influence of the flu, sketched in the _completey untested_ patch I appended illustrates roughly what I had been thinking about.
OK, I was really uncertain as to what you meant regarding the scope of this value so had assumed we couldn't do something like assigning the value like that.
I guess I'll try to adapt that and respin a v7 when I get a chance.
include/linux/pid.h | 8 ++++-- include/uapi/linux/pidfd.h | 10 ++++++++ kernel/exit.c | 4 ++- kernel/nsproxy.c | 1 + kernel/pid.c | 51 ++++++++++++++++++++++++-------------- 5 files changed, 53 insertions(+), 21 deletions(-)
diff --git a/include/linux/pid.h b/include/linux/pid.h index d466890e1b35..3b2ac7567a88 100644 --- a/include/linux/pid.h +++ b/include/linux/pid.h @@ -78,11 +78,15 @@ struct file;
- __pidfd_get_pid() - Retrieve a pid associated with the specified pidfd.
- @pidfd: The pidfd whose pid we want, or the fd of a /proc/<pid> file if
@alloc_proc is also set.
@alloc_proc is also set, or PIDFD_SELF_* to refer to the current
thread or thread group leader.
- @allow_proc: If set, then an fd of a /proc/<pid> file can be passed instead
of a pidfd, and this will be used to determine the pid.
- @flags: Output variable, if non-NULL, then the file->f_flags of the
pidfd will be set here.
pidfd will be set here or If PIDFD_SELF_THREAD is set, this is
set to PIDFD_THREAD, otherwise if PIDFD_SELF_THREAD_GROUP then
this is set to zero.
- Returns: If successful, the pid associated with the pidfd, otherwise an
error.
diff --git a/include/uapi/linux/pidfd.h b/include/uapi/linux/pidfd.h index 565fc0629fff..6fe1d63b2086 100644 --- a/include/uapi/linux/pidfd.h +++ b/include/uapi/linux/pidfd.h @@ -29,4 +29,14 @@ #define PIDFD_GET_USER_NAMESPACE _IO(PIDFS_IOCTL_MAGIC, 9) #define PIDFD_GET_UTS_NAMESPACE _IO(PIDFS_IOCTL_MAGIC, 10)
+/*
- Special sentinel values which can be used to refer to the current thread or
- thread group leader (which from a userland perspective is the process).
- */
+#define PIDFD_SELF PIDFD_SELF_THREAD +#define PIDFD_SELF_PROCESS PIDFD_SELF_THREAD_GROUP
+#define PIDFD_SELF_THREAD -10000 /* Current thread. */ +#define PIDFD_SELF_THREAD_GROUP -20000 /* Current thread group leader. */
#endif /* _UAPI_LINUX_PIDFD_H */ diff --git a/kernel/exit.c b/kernel/exit.c index 619f0014c33b..e4f85ec4ba78 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -71,6 +71,7 @@ #include <linux/user_events.h> #include <linux/uaccess.h>
+#include <uapi/linux/pidfd.h> #include <uapi/linux/wait.h>
#include <asm/unistd.h> @@ -1739,7 +1740,8 @@ int kernel_waitid_prepare(struct wait_opts *wo, int which, pid_t upid, break; case P_PIDFD: type = PIDTYPE_PID;
if (upid < 0)
if (upid < 0 && upid != PIDFD_SELF_THREAD &&
upid != PIDFD_SELF_THREAD_GROUP) return -EINVAL;
pid = pidfd_get_pid(upid, &f_flags);
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c index dc952c3b05af..d239f7eeaa1f 100644 --- a/kernel/nsproxy.c +++ b/kernel/nsproxy.c @@ -550,6 +550,7 @@ SYSCALL_DEFINE2(setns, int, fd, int, flags) struct nsset nsset = {}; int err = 0;
- /* If fd is PIDFD_SELF_*, implicitly fail here, as invalid. */ if (!fd_file(f)) return -EBADF;
diff --git a/kernel/pid.c b/kernel/pid.c index 94c97559e5c5..0a1861b4422c 100644 --- a/kernel/pid.c +++ b/kernel/pid.c @@ -535,33 +535,48 @@ struct pid *find_ge_pid(int nr, struct pid_namespace *ns) } EXPORT_SYMBOL_GPL(find_ge_pid);
+static struct pid *pidfd_get_pid_self(unsigned int pidfd, unsigned int *flags)
The @flags argument is unused afaict.
Oops will rework on v7.
+{
- bool is_thread = pidfd == PIDFD_SELF_THREAD;
- enum pid_type type = is_thread ? PIDTYPE_PID : PIDTYPE_TGID;
- struct pid *pid = *task_pid_ptr(current, type);
- /* The caller expects an elevated reference count. */
- get_pid(pid);
- return pid;
+}
Fwiw, what you've done here is essentially reimplement the already existing get_task_pid() helper that you could simply use.
We're looking up PIDFD_SELF_* values here. So presumably you mean the:
struct pid *pid = *task_pid_ptr(current, type); /* The caller expects an elevated reference count. */ get_pid(pid);
Bit is duplicated vs. get_task_pid()?
I did that because it wasn't clear doing that under the RCU lock was necessary or useful?
It seems useful still to have the PIDFD_SELF stuff qseparate, I can replace those two lines with a call to get_task_pid() if you prefer? Unless you meant something else?
struct pid *__pidfd_get_pid(unsigned int pidfd, bool allow_proc, unsigned int *flags) {
- struct pid *pid;
- struct fd f = fdget(pidfd);
- struct file *file = fd_file(f);
- if (pidfd == PIDFD_SELF_THREAD || pidfd == PIDFD_SELF_THREAD_GROUP) {
return pidfd_get_pid_self(pidfd, flags);
- } else {
I think the else can just go and we can save an indentation level.
This has been raised a couple times before by other reviewers, this is just so we can declare variables, especially the fd variable, which you were very clear _must_ retain scope only where it used.
Otherwise I have to do something like;
struct fd f = {};
if (...) { return ...; }
f = fdget(...);
This way we don't need to do that.
I mean probably the compiler would do the right thing but it just seems ugly to assign/reassign a stack value like that.
Ah, I see struct fd is just a wrapper around an unsigned long, so probably not a big deal to just leave it unassigned then.
This was the only reason I did this, I usually much prefer the guard pattern.
OK if you're fine with this value being assigned like that then no problem will change!
struct pid *pid;
struct fd f = fdget(pidfd);
struct file *file = fd_file(f);
- if (!file)
return ERR_PTR(-EBADF);
if (!file)
return ERR_PTR(-EBADF);
- pid = pidfd_pid(file);
- /* If we allow opening a pidfd via /proc/<pid>, do so. */
- if (IS_ERR(pid) && allow_proc)
pid = tgid_pidfd_to_pid(file);
pid = pidfd_pid(file);
/* If we allow opening a pidfd via /proc/<pid>, do so. */
if (IS_ERR(pid) && allow_proc)
pid = tgid_pidfd_to_pid(file);
- if (IS_ERR(pid)) {
if (IS_ERR(pid)) {
fdput(f);
return pid;
}
/* Pin pid before we release fd. */
get_pid(pid);
if (flags)
fdput(f);*flags = file->f_flags;
- return pid; }
- /* Pin pid before we release fd. */
- get_pid(pid);
- if (flags)
*flags = file->f_flags;
- fdput(f);
- return pid;
}
/**
2.47.0