On Tue, Aug 05, 2025 at 03:45:07PM +1000, Aleksa Sarai wrote:
Ever since the introduction of pid namespaces, procfs has had very implicit behaviour surrounding them (the pidns used by a procfs mount is auto-selected based on the mounting process's active pidns, and the pidns itself is basically hidden once the mount has been constructed).
/* pidns mount option for procfs */
This implicit behaviour has historically meant that userspace was required to do some special dances in order to configure the pidns of a procfs mount as desired. Examples include:
In order to bypass the mnt_too_revealing() check, Kubernetes creates a procfs mount from an empty pidns so that user namespaced containers can be nested (without this, the nested containers would fail to mount procfs). But this requires forking off a helper process because you cannot just one-shot this using mount(2).
Container runtimes in general need to fork into a container before configuring its mounts, which can lead to security issues in the case of shared-pidns containers (a privileged process in the pidns can interact with your container runtime process). While SUID_DUMP_DISABLE and user namespaces make this less of an issue, the strict need for this due to a minor uAPI wart is kind of unfortunate.
Things would be much easier if there was a way for userspace to just specify the pidns they want. Patch 1 implements a new "pidns" argument which can be set using fsconfig(2):
fsconfig(procfd, FSCONFIG_SET_FD, "pidns", NULL, nsfd); fsconfig(procfd, FSCONFIG_SET_STRING, "pidns", "/proc/self/ns/pid", 0);
or classic mount(2) / mount(8):
// mount -t proc -o pidns=/proc/self/ns/pid proc /tmp/proc mount("proc", "/tmp/proc", "proc", MS_..., "pidns=/proc/self/ns/pid");
The initial security model I have in this RFC is to be as conservative as possible and just mirror the security model for setns(2) -- which means that you can only set pidns=... to pid namespaces that your current pid namespace is a direct ancestor of and you have CAP_SYS_ADMIN privileges over the pid namespace. This fulfils the requirements of container runtimes, but I suspect that this may be too strict for some usecases.
The pidns argument is not displayed in mountinfo -- it's not clear to me what value it would make sense to show (maybe we could just use ns_dname to provide an identifier for the namespace, but this number would be fairly useless to userspace). I'm open to suggestions. Note that PROCFS_GET_PID_NAMESPACE (see below) does at least let userspace get information about this outside of mountinfo.
Note that you cannot change the pidns of an already-created procfs instance. The primary reason is that allowing this to be changed would require RCU-protecting proc_pid_ns(sb) and thus auditing all of fs/proc/* and some of the users in fs/* to make sure they wouldn't UAF the pid namespace. Since creating procfs instances is very cheap, it seems unnecessary to overcomplicate this upfront. Trying to reconfigure procfs this way errors out with -EBUSY.
/* ioctl(PROCFS_GET_PID_NAMESPACE) */
In addition, being able to figure out what pid namespace is being used by a procfs mount is quite useful when you have an administrative process (such as a container runtime) which wants to figure out the correct way of mapping PIDs between its own namespace and the namespace for procfs (using NS_GET_{PID,TGID}_{IN,FROM}_PIDNS). There are alternative ways to do this, but they all rely on ancillary information that third-party libraries and tools do not necessarily have access to.
To make this easier, add a new ioctl (PROCFS_GET_PID_NAMESPACE) which can be used to get a reference to the pidns that a procfs is using.
Rather than copying the (fairly strict) security model for setns(2), apply a slightly looser model to better match what userspace can already do:
Make the ioctl only valid on the root (meaning that a process without access to the procfs root -- such as only having an fd to a procfs file or some open_tree(2)-like subset -- cannot use this API). This means that the process already has some level of access to the /proc/$pid directories.
If the calling process is in an ancestor pidns, then they can already create pidfd for processes inside the pidns, which is morally equivalent to a pidns file descriptor according to setns(2). So it seems reasonable to just allow it in this case. (The justification for this model was suggested by Christian.)
If the process has access to /proc/1/ns/pid already (i.e. has ptrace-read access to the pidns pid1), then this ioctl is equivalent to just opening a handle to it that way.
Ideally we would check for ptrace-read access against all processes in the pidns (which is very likely to be true for at least one process, as SUID_DUMP_DISABLE is cleared on exec(2) and is rarely set by most programs), but this would obviously not scale.
I'm open to suggestions for whether we need to make this stricter (or possibly allow more cases).
Signed-off-by: Aleksa Sarai cyphar@cyphar.com
Thanks for the patchset. Being able to specify what pid namespace the procfs instance is supposed to belong to is super useful and will make things easier for userspace for sure.
The code you added contains a minor wrinkle that I disliked which I've changed and you tell me if you can live with this restriction or not.
The way you've implemented it specifying a pid namespace that the caller holds privilege over would silently also override the user namespace the filesystem is supposed to belong to.
Specifically, you did something like:
put_pid_ns(ctx->pid_ns); ctx->pid_ns = get_pid_ns(target); put_user_ns(fc->user_ns); fc->user_ns = get_user_ns(ctx->pid_ns->user_ns);
This silently overrides the user namespace recorded at fsopen() time. I think that's too subtle and we should just not allow that at all for now.
Instead I've changed this to:
if (fc->user_ns != target->user_ns) return invalfc(fc, "owning user namespace of pid namespace doesn't match procfs user namespace");
put_pid_ns(ctx->pid_ns); ctx->pid_ns = get_pid_ns(target);
so we just refuse different owernship.
I've also dropped the procfs ioctl because I'm not sure how much value it will actually add given that you can do this via /proc/1/ns/pid.
If that is something that libpathrs despearately needs I would like to do it as a separate patch anyways.
Thanks for the excellent cover letter. This was a pleasure merging!