On Wed, Aug 06, 2025 at 11:57:42AM -0700, Randy Dunlap wrote:
On 8/6/25 11:02 AM, Aleksa Sarai wrote:
On 2025-08-05, Randy Dunlap rdunlap@infradead.org wrote:
On 8/4/25 10:45 PM, Aleksa Sarai wrote:
/proc has historically had very opaque semantics about PID namespaces, which is a little unfortunate for container runtimes and other programs that deal with switching namespaces very often. One common issue is that of converting between PIDs in the process's namespace and PIDs in the namespace of /proc.
In principle, it is possible to do this today by opening a pidfd with pidfd_open(2) and then looking at /proc/self/fdinfo/$n (which will contain a PID value translated to the pid namespace associated with that procfs superblock). However, allocating a new file for each PID to be converted is less than ideal for programs that may need to scan procfs, and it is generally useful for userspace to be able to finally get this information from procfs.
So, add a new API to get the pid namespace of a procfs instance, in the form of an ioctl(2) you can call on the root directory of said procfs. The returned file descriptor will have O_CLOEXEC set. This acts as a sister feature to the new "pidns" mount option, finally allowing userspace full control of the pid namespaces associated with procfs instances.
The permission model for this is a bit looser than that of the "pidns" mount option (and also setns(2)) because /proc/1/ns/pid provides the same information, so as long as you have access to that magic-link (or something equivalently reasonable such as being in an ancestor pid namespace) it makes sense to allow userspace to grab a handle. Ideally we would check for ptrace-read access against all processes in the pidns (which is very likely to be true for at least one process, as SUID_DUMP_DISABLE is cleared on exec(2) and is rarely set by most programs), but this would obviously not scale.
setns(2) will still have their own permission checks, so being able to open a pidns handle doesn't really provide too many other capabilities.
Signed-off-by: Aleksa Sarai cyphar@cyphar.com
Documentation/filesystems/proc.rst | 4 +++ fs/proc/root.c | 68 ++++++++++++++++++++++++++++++++++++-- include/uapi/linux/fs.h | 4 +++ 3 files changed, 74 insertions(+), 2 deletions(-)
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h index 0bd678a4a10e..68e65e6d7d6b 100644 --- a/include/uapi/linux/fs.h +++ b/include/uapi/linux/fs.h @@ -435,8 +435,12 @@ typedef int __bitwise __kernel_rwf_t; RWF_APPEND | RWF_NOAPPEND | RWF_ATOMIC |\ RWF_DONTCACHE) +/* This matches XSDFEC_MAGIC, so we need to allocate subvalues carefully. */ #define PROCFS_IOCTL_MAGIC 'f' +/* procfs root ioctls */ +#define PROCFS_GET_PID_NAMESPACE _IO(PROCFS_IOCTL_MAGIC, 32)
Since the _IO() nr here is 32, Documentation/userspace-api/ioctl/ioctl-number.rst should be updated like:
-'f' 00-0F linux/fs.h conflict! +'f' 00-1F linux/fs.h conflict!
Should this be 00-20 (or 00-2F) instead?
Oops, yes, it should be one of those. Thanks.
Also, is there a better value to use for this new ioctl? I'm not quite sure what is the best practice to handle these kinds of conflicts...
I wouldn't worry about it. We have *many* conflicts. (unless Al or Christian are concerned)
We try to minimize conflicts but we unfortunately give no strong guarantees in any way. I always defer to Arnd in such matters as he's got a pretty good mental model of what is best to do for ioctls.
(17 is already used for PROCFS_IOCTL_MAGIC somewhere else, so that probably should have update the Doc/rst file.)