While user namespaces do not make the kernel more vulnerable, they are however used to initiate exploits. Some users do not want to block namespace creation for the entirety of the system, which some distributions provide. Instead, we needed a way to have some applications be blocked, and others allowed. This is not possible with those tools. Managing hierarchies also did not fit our case because we're determining which tasks are allowed based on their attributes.
While exploring a solution, we first leveraged the LSM cred_prepare hook because that is the closest hook to prevent a call to create_user_ns().
The calls look something like this:
cred = prepare_creds() security_prepare_creds() call_int_hook(cred_prepare, ... if (cred) create_user_ns(cred)
We noticed that error codes were not propagated from this hook and introduced a patch [1] to propagate those errors.
The discussion notes that security_prepare_creds() is not appropriate for MAC policies, and instead the hook is meant for LSM authors to prepare credentials for mutation. [2]
Additionally, cred_prepare hook is not without problems. Handling the clone3 case is a bit more tricky due to the user space pointer passed to it. This makes checking the syscall subject to a possible TOCTTOU attack.
Ultimately, we concluded that a better course of action is to introduce a new security hook for LSM authors. [3]
This patch set first introduces a new security_create_user_ns() function and userns_create LSM hook, then marks the hook as sleepable in BPF. The following patches after include a BPF test and a patch for an SELinux implementation.
We want to encourage use of user namespaces, and also cater the needs of users/administrators to observe and/or control access. There is no expectation of an impact on user space applications because access control is opt-in, and users wishing to observe within a LSM context
Links: 1. https://lore.kernel.org/all/20220608150942.776446-1-fred@cloudflare.com/ 2. https://lore.kernel.org/all/87y1xzyhub.fsf@email.froward.int.ebiederm.org/ 3. https://lore.kernel.org/all/9fe9cd9f-1ded-a179-8ded-5fde8960a586@cloudflare....
Past discussions: V4: https://lore.kernel.org/all/20220801180146.1157914-1-fred@cloudflare.com/ V3: https://lore.kernel.org/all/20220721172808.585539-1-fred@cloudflare.com/ V2: https://lore.kernel.org/all/20220707223228.1940249-1-fred@cloudflare.com/ V1: https://lore.kernel.org/all/20220621233939.993579-1-fred@cloudflare.com/
Changes since v4: - Update commit description - Update cover letter Changes since v3: - Explicitly set CAP_SYS_ADMIN to test namespace is created given permission - Simplify BPF test to use sleepable hook only - Prefer unshare() over clone() for tests Changes since v2: - Rename create_user_ns hook to userns_create - Use user_namespace as an object opposed to a generic namespace object - s/domB_t/domA_t in commit message Changes since v1: - Add selftests/bpf: Add tests verifying bpf lsm create_user_ns hook patch - Add selinux: Implement create_user_ns hook patch - Change function signature of security_create_user_ns() to only take struct cred - Move security_create_user_ns() call after id mapping check in create_user_ns() - Update documentation to reflect changes
Frederick Lawler (4): security, lsm: Introduce security_create_user_ns() bpf-lsm: Make bpf_lsm_userns_create() sleepable selftests/bpf: Add tests verifying bpf lsm userns_create hook selinux: Implement userns_create hook
include/linux/lsm_hook_defs.h | 1 + include/linux/lsm_hooks.h | 4 + include/linux/security.h | 6 ++ kernel/bpf/bpf_lsm.c | 1 + kernel/user_namespace.c | 5 + security/security.c | 5 + security/selinux/hooks.c | 9 ++ security/selinux/include/classmap.h | 2 + .../selftests/bpf/prog_tests/deny_namespace.c | 102 ++++++++++++++++++ .../selftests/bpf/progs/test_deny_namespace.c | 33 ++++++ 10 files changed, 168 insertions(+) create mode 100644 tools/testing/selftests/bpf/prog_tests/deny_namespace.c create mode 100644 tools/testing/selftests/bpf/progs/test_deny_namespace.c
User namespaces are an effective tool to allow programs to run with permission without requiring the need for a program to run as root. User namespaces may also be used as a sandboxing technique. However, attackers sometimes leverage user namespaces as an initial attack vector to perform some exploit. [1,2,3]
While it is not the unprivileged user namespace functionality, which causes the kernel to be exploitable, users/administrators might want to more granularly limit or at least monitor how various processes use this functionality, while vulnerable kernel subsystems are being patched.
Preventing user namespace already creation comes in a few of forms in order of granularity:
1. /proc/sys/user/max_user_namespaces sysctl 2. Distro specific patch(es) 3. CONFIG_USER_NS
To block a task based on its attributes, the LSM hook cred_prepare is a decent candidate for use because it provides more granular control, and it is called before create_user_ns():
cred = prepare_creds() security_prepare_creds() call_int_hook(cred_prepare, ... if (cred) create_user_ns(cred)
Since security_prepare_creds() is meant for LSMs to copy and prepare credentials, access control is an unintended use of the hook. [4] Further, security_prepare_creds() will always return a ENOMEM if the hook returns any non-zero error code.
This hook also does not handle the clone3 case which requires us to access a user space pointer to know if we're in the CLONE_NEW_USER call path which may be subject to a TOCTTOU attack.
Lastly, cred_prepare is called in many call paths, and a targeted hook further limits the frequency of calls which is a beneficial outcome. Therefore introduce a new function security_create_user_ns() with an accompanying userns_create LSM hook.
With the new userns_create hook, users will have more control over the observability and access control over user namespace creation. Users should expect that normal operation of user namespaces will behave as usual, and only be impacted when controls are implemented by users or administrators.
This hook takes the prepared creds for LSM authors to write policy against. On success, the new namespace is applied to credentials, otherwise an error is returned.
Links: 1. https://nvd.nist.gov/vuln/detail/CVE-2022-0492 2. https://nvd.nist.gov/vuln/detail/CVE-2022-25636 3. https://nvd.nist.gov/vuln/detail/CVE-2022-34918 4. https://lore.kernel.org/all/1c4b1c0d-12f6-6e9e-a6a3-cdce7418110c@schaufler-c...
Reviewed-by: Christian Brauner (Microsoft) brauner@kernel.org Reviewed-by: KP Singh kpsingh@kernel.org Signed-off-by: Frederick Lawler fred@cloudflare.com
--- Changes since v4: - Update commit description Changes since v3: - No changes Changes since v2: - Rename create_user_ns hook to userns_create Changes since v1: - Changed commit wording - Moved execution to be after id mapping check - Changed signature to only accept a const struct cred * --- include/linux/lsm_hook_defs.h | 1 + include/linux/lsm_hooks.h | 4 ++++ include/linux/security.h | 6 ++++++ kernel/user_namespace.c | 5 +++++ security/security.c | 5 +++++ 5 files changed, 21 insertions(+)
diff --git a/include/linux/lsm_hook_defs.h b/include/linux/lsm_hook_defs.h index 806448173033..aa7272e83626 100644 --- a/include/linux/lsm_hook_defs.h +++ b/include/linux/lsm_hook_defs.h @@ -224,6 +224,7 @@ LSM_HOOK(int, -ENOSYS, task_prctl, int option, unsigned long arg2, unsigned long arg3, unsigned long arg4, unsigned long arg5) LSM_HOOK(void, LSM_RET_VOID, task_to_inode, struct task_struct *p, struct inode *inode) +LSM_HOOK(int, 0, userns_create, const struct cred *cred) LSM_HOOK(int, 0, ipc_permission, struct kern_ipc_perm *ipcp, short flag) LSM_HOOK(void, LSM_RET_VOID, ipc_getsecid, struct kern_ipc_perm *ipcp, u32 *secid) diff --git a/include/linux/lsm_hooks.h b/include/linux/lsm_hooks.h index 84a0d7e02176..2e11a2a22ed1 100644 --- a/include/linux/lsm_hooks.h +++ b/include/linux/lsm_hooks.h @@ -806,6 +806,10 @@ * security attributes, e.g. for /proc/pid inodes. * @p contains the task_struct for the task. * @inode contains the inode structure for the inode. + * @userns_create: + * Check permission prior to creating a new user namespace. + * @cred points to prepared creds. + * Return 0 if successful, otherwise < 0 error code. * * Security hooks for Netlink messaging. * diff --git a/include/linux/security.h b/include/linux/security.h index 1bc362cb413f..767802fe9bfa 100644 --- a/include/linux/security.h +++ b/include/linux/security.h @@ -437,6 +437,7 @@ int security_task_kill(struct task_struct *p, struct kernel_siginfo *info, int security_task_prctl(int option, unsigned long arg2, unsigned long arg3, unsigned long arg4, unsigned long arg5); void security_task_to_inode(struct task_struct *p, struct inode *inode); +int security_create_user_ns(const struct cred *cred); int security_ipc_permission(struct kern_ipc_perm *ipcp, short flag); void security_ipc_getsecid(struct kern_ipc_perm *ipcp, u32 *secid); int security_msg_msg_alloc(struct msg_msg *msg); @@ -1194,6 +1195,11 @@ static inline int security_task_prctl(int option, unsigned long arg2, static inline void security_task_to_inode(struct task_struct *p, struct inode *inode) { }
+static inline int security_create_user_ns(const struct cred *cred) +{ + return 0; +} + static inline int security_ipc_permission(struct kern_ipc_perm *ipcp, short flag) { diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c index 5481ba44a8d6..3f464bbda0e9 100644 --- a/kernel/user_namespace.c +++ b/kernel/user_namespace.c @@ -9,6 +9,7 @@ #include <linux/highuid.h> #include <linux/cred.h> #include <linux/securebits.h> +#include <linux/security.h> #include <linux/keyctl.h> #include <linux/key-type.h> #include <keys/user-type.h> @@ -113,6 +114,10 @@ int create_user_ns(struct cred *new) !kgid_has_mapping(parent_ns, group)) goto fail_dec;
+ ret = security_create_user_ns(new); + if (ret < 0) + goto fail_dec; + ret = -ENOMEM; ns = kmem_cache_zalloc(user_ns_cachep, GFP_KERNEL); if (!ns) diff --git a/security/security.c b/security/security.c index 14d30fec8a00..1e60c4b570ec 100644 --- a/security/security.c +++ b/security/security.c @@ -1909,6 +1909,11 @@ void security_task_to_inode(struct task_struct *p, struct inode *inode) call_void_hook(task_to_inode, p, inode); }
+int security_create_user_ns(const struct cred *cred) +{ + return call_int_hook(userns_create, 0, cred); +} + int security_ipc_permission(struct kern_ipc_perm *ipcp, short flag) { return call_int_hook(ipc_permission, 0, ipcp, flag);
Users may want to audit calls to security_create_user_ns() and access user space memory. Also create_user_ns() runs without pagefault_disabled(). Therefore, make bpf_lsm_userns_create() sleepable for mandatory access control policies.
Acked-by: Alexei Starovoitov ast@kernel.org Acked-by: Christian Brauner (Microsoft) brauner@kernel.org Acked-by: KP Singh kpsingh@kernel.org Signed-off-by: Frederick Lawler fred@cloudflare.com
--- Changes since v4: - None Changes since v3: - None Changes since v2: - Rename create_user_ns hook to userns_create Changes since v1: - None --- kernel/bpf/bpf_lsm.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/kernel/bpf/bpf_lsm.c b/kernel/bpf/bpf_lsm.c index fa71d58b7ded..761998fda762 100644 --- a/kernel/bpf/bpf_lsm.c +++ b/kernel/bpf/bpf_lsm.c @@ -335,6 +335,7 @@ BTF_ID(func, bpf_lsm_task_getsecid_obj) BTF_ID(func, bpf_lsm_task_prctl) BTF_ID(func, bpf_lsm_task_setscheduler) BTF_ID(func, bpf_lsm_task_to_inode) +BTF_ID(func, bpf_lsm_userns_create) BTF_SET_END(sleepable_lsm_hooks)
bool bpf_lsm_is_sleepable_hook(u32 btf_id)
The LSM hook userns_create was introduced to provide LSM's an opportunity to block or allow unprivileged user namespace creation. This test serves two purposes: it provides a test eBPF implementation, and tests the hook successfully blocks or allows user namespace creation.
This tests 3 cases:
1. Unattached bpf program does not block unpriv user namespace creation. 2. Attached bpf program allows user namespace creation given CAP_SYS_ADMIN privileges. 3. Attached bpf program denies user namespace creation for a user without CAP_SYS_ADMIN.
Acked-by: KP Singh kpsingh@kernel.org Signed-off-by: Frederick Lawler fred@cloudflare.com
--- The generic deny_namespace file name is used for future namespace expansion. I didn't want to limit these files to just the create_user_ns hook. Changes since v4: - None Changes since v3: - Explicitly set CAP_SYS_ADMIN to test namespace is created given permission - Simplify BPF test to use sleepable hook only - Prefer unshare() over clone() for tests Changes since v2: - Rename create_user_ns hook to userns_create Changes since v1: - Introduce this patch --- .../selftests/bpf/prog_tests/deny_namespace.c | 102 ++++++++++++++++++ .../selftests/bpf/progs/test_deny_namespace.c | 33 ++++++ 2 files changed, 135 insertions(+) create mode 100644 tools/testing/selftests/bpf/prog_tests/deny_namespace.c create mode 100644 tools/testing/selftests/bpf/progs/test_deny_namespace.c
diff --git a/tools/testing/selftests/bpf/prog_tests/deny_namespace.c b/tools/testing/selftests/bpf/prog_tests/deny_namespace.c new file mode 100644 index 000000000000..1bc6241b755b --- /dev/null +++ b/tools/testing/selftests/bpf/prog_tests/deny_namespace.c @@ -0,0 +1,102 @@ +// SPDX-License-Identifier: GPL-2.0 +#define _GNU_SOURCE +#include <test_progs.h> +#include "test_deny_namespace.skel.h" +#include <sched.h> +#include "cap_helpers.h" +#include <stdio.h> + +static int wait_for_pid(pid_t pid) +{ + int status, ret; + +again: + ret = waitpid(pid, &status, 0); + if (ret == -1) { + if (errno == EINTR) + goto again; + + return -1; + } + + if (!WIFEXITED(status)) + return -1; + + return WEXITSTATUS(status); +} + +/* negative return value -> some internal error + * positive return value -> userns creation failed + * 0 -> userns creation succeeded + */ +static int create_user_ns(void) +{ + pid_t pid; + + pid = fork(); + if (pid < 0) + return -1; + + if (pid == 0) { + if (unshare(CLONE_NEWUSER)) + _exit(EXIT_FAILURE); + _exit(EXIT_SUCCESS); + } + + return wait_for_pid(pid); +} + +static void test_userns_create_bpf(void) +{ + __u32 cap_mask = 1ULL << CAP_SYS_ADMIN; + __u64 old_caps = 0; + + cap_enable_effective(cap_mask, &old_caps); + + ASSERT_OK(create_user_ns(), "priv new user ns"); + + cap_disable_effective(cap_mask, &old_caps); + + ASSERT_EQ(create_user_ns(), EPERM, "unpriv new user ns"); + + if (cap_mask & old_caps) + cap_enable_effective(cap_mask, NULL); +} + +static void test_unpriv_userns_create_no_bpf(void) +{ + __u32 cap_mask = 1ULL << CAP_SYS_ADMIN; + __u64 old_caps = 0; + + cap_disable_effective(cap_mask, &old_caps); + + ASSERT_OK(create_user_ns(), "no-bpf unpriv new user ns"); + + if (cap_mask & old_caps) + cap_enable_effective(cap_mask, NULL); +} + +void test_deny_namespace(void) +{ + struct test_deny_namespace *skel = NULL; + int err; + + if (test__start_subtest("unpriv_userns_create_no_bpf")) + test_unpriv_userns_create_no_bpf(); + + skel = test_deny_namespace__open_and_load(); + if (!ASSERT_OK_PTR(skel, "skel load")) + goto close_prog; + + err = test_deny_namespace__attach(skel); + if (!ASSERT_OK(err, "attach")) + goto close_prog; + + if (test__start_subtest("userns_create_bpf")) + test_userns_create_bpf(); + + test_deny_namespace__detach(skel); + +close_prog: + test_deny_namespace__destroy(skel); +} diff --git a/tools/testing/selftests/bpf/progs/test_deny_namespace.c b/tools/testing/selftests/bpf/progs/test_deny_namespace.c new file mode 100644 index 000000000000..09ad5a4ebd1f --- /dev/null +++ b/tools/testing/selftests/bpf/progs/test_deny_namespace.c @@ -0,0 +1,33 @@ +// SPDX-License-Identifier: GPL-2.0 +#include <linux/bpf.h> +#include <bpf/bpf_helpers.h> +#include <bpf/bpf_tracing.h> +#include <errno.h> +#include <linux/capability.h> + +struct kernel_cap_struct { + __u32 cap[_LINUX_CAPABILITY_U32S_3]; +} __attribute__((preserve_access_index)); + +struct cred { + struct kernel_cap_struct cap_effective; +} __attribute__((preserve_access_index)); + +char _license[] SEC("license") = "GPL"; + +SEC("lsm.s/userns_create") +int BPF_PROG(test_userns_create, const struct cred *cred, int ret) +{ + struct kernel_cap_struct caps = cred->cap_effective; + int cap_index = CAP_TO_INDEX(CAP_SYS_ADMIN); + __u32 cap_mask = CAP_TO_MASK(CAP_SYS_ADMIN); + + if (ret) + return 0; + + ret = -EPERM; + if (caps.cap[cap_index] & cap_mask) + return 0; + + return -EPERM; +}
Unprivileged user namespace creation is an intended feature to enable sandboxing, however this feature is often used to as an initial step to perform a privilege escalation attack.
This patch implements a new user_namespace { create } access control permission to restrict which domains allow or deny user namespace creation. This is necessary for system administrators to quickly protect their systems while waiting for vulnerability patches to be applied.
This permission can be used in the following way:
allow domA_t domA_t : user_namespace { create };
Signed-off-by: Frederick Lawler fred@cloudflare.com
--- Changes since v4: - None Changes since v3: - None Changes since v2: - Rename create_user_ns hook to userns_create - Use user_namespace as an object opposed to a generic namespace object - s/domB_t/domA_t in commit message Changes since v1: - Introduce this patch --- security/selinux/hooks.c | 9 +++++++++ security/selinux/include/classmap.h | 2 ++ 2 files changed, 11 insertions(+)
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c index 79573504783b..b9f1078450b3 100644 --- a/security/selinux/hooks.c +++ b/security/selinux/hooks.c @@ -4221,6 +4221,14 @@ static void selinux_task_to_inode(struct task_struct *p, spin_unlock(&isec->lock); }
+static int selinux_userns_create(const struct cred *cred) +{ + u32 sid = current_sid(); + + return avc_has_perm(&selinux_state, sid, sid, SECCLASS_USER_NAMESPACE, + USER_NAMESPACE__CREATE, NULL); +} + /* Returns error only if unable to parse addresses */ static int selinux_parse_skb_ipv4(struct sk_buff *skb, struct common_audit_data *ad, u8 *proto) @@ -7111,6 +7119,7 @@ static struct security_hook_list selinux_hooks[] __lsm_ro_after_init = { LSM_HOOK_INIT(task_movememory, selinux_task_movememory), LSM_HOOK_INIT(task_kill, selinux_task_kill), LSM_HOOK_INIT(task_to_inode, selinux_task_to_inode), + LSM_HOOK_INIT(userns_create, selinux_userns_create),
LSM_HOOK_INIT(ipc_permission, selinux_ipc_permission), LSM_HOOK_INIT(ipc_getsecid, selinux_ipc_getsecid), diff --git a/security/selinux/include/classmap.h b/security/selinux/include/classmap.h index ff757ae5f253..0bff55bb9cde 100644 --- a/security/selinux/include/classmap.h +++ b/security/selinux/include/classmap.h @@ -254,6 +254,8 @@ const struct security_class_mapping secclass_map[] = { { COMMON_FILE_PERMS, NULL } }, { "io_uring", { "override_creds", "sqpoll", NULL } }, + { "user_namespace", + { "create", NULL } }, { NULL } };
On Mon, Aug 15, 2022 at 12:20 PM Frederick Lawler fred@cloudflare.com wrote:
While user namespaces do not make the kernel more vulnerable, they are however used to initiate exploits. Some users do not want to block namespace creation for the entirety of the system, which some distributions provide. Instead, we needed a way to have some applications be blocked, and others allowed. This is not possible with those tools. Managing hierarchies also did not fit our case because we're determining which tasks are allowed based on their attributes.
While exploring a solution, we first leveraged the LSM cred_prepare hook because that is the closest hook to prevent a call to create_user_ns().
The calls look something like this:
cred = prepare_creds() security_prepare_creds() call_int_hook(cred_prepare, ... if (cred) create_user_ns(cred)
We noticed that error codes were not propagated from this hook and introduced a patch [1] to propagate those errors.
The discussion notes that security_prepare_creds() is not appropriate for MAC policies, and instead the hook is meant for LSM authors to prepare credentials for mutation. [2]
Additionally, cred_prepare hook is not without problems. Handling the clone3 case is a bit more tricky due to the user space pointer passed to it. This makes checking the syscall subject to a possible TOCTTOU attack.
Ultimately, we concluded that a better course of action is to introduce a new security hook for LSM authors. [3]
This patch set first introduces a new security_create_user_ns() function and userns_create LSM hook, then marks the hook as sleepable in BPF. The following patches after include a BPF test and a patch for an SELinux implementation.
We want to encourage use of user namespaces, and also cater the needs of users/administrators to observe and/or control access. There is no expectation of an impact on user space applications because access control is opt-in, and users wishing to observe within a LSM context
Links:
- https://lore.kernel.org/all/20220608150942.776446-1-fred@cloudflare.com/
- https://lore.kernel.org/all/87y1xzyhub.fsf@email.froward.int.ebiederm.org/
- https://lore.kernel.org/all/9fe9cd9f-1ded-a179-8ded-5fde8960a586@cloudflare....
Past discussions: V4: https://lore.kernel.org/all/20220801180146.1157914-1-fred@cloudflare.com/ V3: https://lore.kernel.org/all/20220721172808.585539-1-fred@cloudflare.com/ V2: https://lore.kernel.org/all/20220707223228.1940249-1-fred@cloudflare.com/ V1: https://lore.kernel.org/all/20220621233939.993579-1-fred@cloudflare.com/
Changes since v4:
- Update commit description
- Update cover letter
Changes since v3:
- Explicitly set CAP_SYS_ADMIN to test namespace is created given permission
- Simplify BPF test to use sleepable hook only
- Prefer unshare() over clone() for tests
Changes since v2:
- Rename create_user_ns hook to userns_create
- Use user_namespace as an object opposed to a generic namespace object
- s/domB_t/domA_t in commit message
Changes since v1:
- Add selftests/bpf: Add tests verifying bpf lsm create_user_ns hook patch
- Add selinux: Implement create_user_ns hook patch
- Change function signature of security_create_user_ns() to only take struct cred
- Move security_create_user_ns() call after id mapping check in create_user_ns()
- Update documentation to reflect changes
Frederick Lawler (4): security, lsm: Introduce security_create_user_ns() bpf-lsm: Make bpf_lsm_userns_create() sleepable selftests/bpf: Add tests verifying bpf lsm userns_create hook selinux: Implement userns_create hook
include/linux/lsm_hook_defs.h | 1 + include/linux/lsm_hooks.h | 4 + include/linux/security.h | 6 ++ kernel/bpf/bpf_lsm.c | 1 + kernel/user_namespace.c | 5 + security/security.c | 5 + security/selinux/hooks.c | 9 ++ security/selinux/include/classmap.h | 2 + .../selftests/bpf/prog_tests/deny_namespace.c | 102 ++++++++++++++++++ .../selftests/bpf/progs/test_deny_namespace.c | 33 ++++++ 10 files changed, 168 insertions(+) create mode 100644 tools/testing/selftests/bpf/prog_tests/deny_namespace.c create mode 100644 tools/testing/selftests/bpf/progs/test_deny_namespace.c
I just merged this into the lsm/next tree, thanks for seeing this through Frederick, and thank you to everyone who took the time to review the patches and add their tags.
git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm.git next
I just merged this into the lsm/next tree, thanks for seeing this through Frederick, and thank you to everyone who took the time to review the patches and add their tags.
git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm.git next
Paul, Frederick
I repeat my NACK, in part because I am being ignored and in part because the hook does not make technical sense.
Linus I want you to know that this has been put in the lsm tree against my explicit and clear objections.
My request to talk about the actual problems that are being address has been completely ignored.
I have been a bit slow in dealing with this conversation because I am very much sick and not on top of my game, but that is no excuse to steam roll over me, instead of addressing my concerns.
This is an irresponsible way of adding an access control to user namespace creation. This is a linux-api and manpages level kind of change, as this is a semantic change visible to userspace. Instead that concern has been brushed off as different return code to userspace.
For observably this is a terrible LSM interface because there is no pair with user namespace destruction, nor is their any ability for the LSM to allocate any state to track the user namespace. As there is no patch actually calling audit or anything else observably does not appear to be a driving factor of this new interface.
The common scenarios I am aware of for using the user namespace are: - Creating a container. - Using the user namespace to sandbox your application like chrome does. - Running an exploit.
Returning an error code in the first 2 scenarios will create a userspace regression as either userspace will run less securely or it won't work at all.
Returning an error code in the third scenario when someone is trying to exploit your machine is equally foolish as you are giving the exploit the chance to continue running. The application should be killed instead.
Further adding a random failure mode to user namespace creation if it is used at all will just encourage userspace to use a setuid application to perform the namespace creation instead. Creating a less secure system overall.
If the concern is to reduce the attack surface everything this proposed hook can do is already possible with the security_capable security hook.
So Paul, Frederick please drop this. I can't see what this new hook is good for except creating regressions in existing userspace code. I am not willing to support such a hook in code that I maintain.
Eric
On Wed, Aug 17, 2022 at 11:08 AM Eric W. Biederman ebiederm@xmission.com wrote:
I just merged this into the lsm/next tree, thanks for seeing this through Frederick, and thank you to everyone who took the time to review the patches and add their tags.
git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm.git next
Paul, Frederick
I repeat my NACK, in part because I am being ignored and in part because the hook does not make technical sense.
Linus I want you to know that this has been put in the lsm tree against my explicit and clear objections.
Eric, we are disagreeing with you, not ignoring you; that's an important distinction. This is the fifth iteration of the patchset, or the sixth (?) if you could Frederick's earlier attempts using the credential hooks, and with each revision multiple people have tried to work with you to find a mutually agreeable solution to the use cases presented by Frederick and others. In the end of the v4 discussion it was my opinion that you kept moving the goalposts in an effort to prevent any additional hooks/controls/etc. to the user namespace code which is why I made the decision to merge the code into the lsm/next branch against your wishes. Multiple people have come out in support of this functionality, and you remain the only one opposed to the change; normally a maintainer's objection would be enough to block the change, but it is my opinion that Eric is acting in bad faith.
At the end of the v4 patchset I suggested merging this into lsm/next so it could get a full -rc cycle in linux-next, assuming no issues were uncovered during testing I was planning to send it to Linus during the next merge window with commentary on the contentiousness of the patchset, including Eric's NACK. I'm personally very disappointed that it has come to this, but I'm at a loss of how to work with you (Eric) to find a solution; this is the only path forward that I can see at this point. Others have expressed their agreement with this approach, both on-list and privately.
If anyone other than Eric or myself has a different view of the situation, *please* add your comments now. I believe I've done a fair job of summarizing things, but everyone has a bias and I'm definitely no exception.
Finally, I'm going to refrain from rehashing the same arguments over again in this revision of the patchset, instead I'll just provide links to the previous drafts in case anyone wants to spend an hour or two:
Revision v1 https://lore.kernel.org/linux-security-module/20220621233939.993579-1-fred@c...
Revision v2 https://lore.kernel.org/linux-security-module/20220707223228.1940249-1-fred@...
Revision v3 https://lore.kernel.org/linux-security-module/20220721172808.585539-1-fred@c...
Revision v4 https://lore.kernel.org/linux-security-module/20220801180146.1157914-1-fred@...
-- paul-moore.com
Paul Moore paul@paul-moore.com writes:
At the end of the v4 patchset I suggested merging this into lsm/next so it could get a full -rc cycle in linux-next, assuming no issues were uncovered during testing
What in the world can be uncovered in linux-next for code that has no in tree users.
That is one of my largest problems. I want to talk about the users and the use cases and I don't get dialog. Nor do I get hey look back there you missed it.
Since you don't want to rehash this. I will just repeat my conclusion that the patchset appears to introduce an ineffective defense that will achieve nothing in the defense of the kernel, and so all it will achieve a code maintenance burden and to occasionally break legitimate users of the user namespace.
Further the process is broken. You are changing the semantics of an operation with the introduction of a security hook. That needs a man-page and discussion on linux-abi. In general of the scrutiny we give to new systems and changed system calls. As this change fundamentally changes the semantics of creating a user namespace.
Skipping that part of the process is not simply disagree that is being irresponsible.
Eric
On Wed, Aug 17, 2022 at 3:58 PM Eric W. Biederman ebiederm@xmission.com wrote:
Paul Moore paul@paul-moore.com writes:
At the end of the v4 patchset I suggested merging this into lsm/next so it could get a full -rc cycle in linux-next, assuming no issues were uncovered during testing
What in the world can be uncovered in linux-next for code that has no in tree users.
The patchset provides both BPF LSM and SELinux implementations of the hooks along with a BPF LSM test under tools/testing/selftests/bpf/. If no one beats me to it, I plan to work on adding a test to the selinux-testsuite as soon as I'm done dealing with other urgent LSM/SELinux issues (io_uring CMD passthrough, SCTP problems, etc.); I run these tests multiple times a week (multiple times a day sometimes) against the -rcX kernels with the lsm/next, selinux/next, and audit/next branches applied on top. I know others do similar things.
Paul Moore paul@paul-moore.com writes:
On Wed, Aug 17, 2022 at 3:58 PM Eric W. Biederman ebiederm@xmission.com wrote:
Paul Moore paul@paul-moore.com writes:
At the end of the v4 patchset I suggested merging this into lsm/next so it could get a full -rc cycle in linux-next, assuming no issues were uncovered during testing
What in the world can be uncovered in linux-next for code that has no in tree users.
The patchset provides both BPF LSM and SELinux implementations of the hooks along with a BPF LSM test under tools/testing/selftests/bpf/. If no one beats me to it, I plan to work on adding a test to the selinux-testsuite as soon as I'm done dealing with other urgent LSM/SELinux issues (io_uring CMD passthrough, SCTP problems, etc.); I run these tests multiple times a week (multiple times a day sometimes) against the -rcX kernels with the lsm/next, selinux/next, and audit/next branches applied on top. I know others do similar things.
A layer of hooks that leaves all of the logic to userspace is not an in-tree user for purposes of understanding the logic of the code.
The reason why I implemented user namespaces is so that all of linux's neat features could be exposed to non-root userspace processes, in a way that doesn't break suid root processes.
The access control you are adding to user namespaces looks to take that away. It looks to remove the whole point of user namespaces.
So without any mention of how people intend to use this feature, without any code that uses this hook to implement semantics. Without any talk about how this semantic change is reasonable. I strenuously object.
Eric
On Wed, Aug 17, 2022 at 4:56 PM Eric W. Biederman ebiederm@xmission.com wrote:
Paul Moore paul@paul-moore.com writes:
On Wed, Aug 17, 2022 at 3:58 PM Eric W. Biederman ebiederm@xmission.com wrote:
Paul Moore paul@paul-moore.com writes:
At the end of the v4 patchset I suggested merging this into lsm/next so it could get a full -rc cycle in linux-next, assuming no issues were uncovered during testing
What in the world can be uncovered in linux-next for code that has no in tree users.
The patchset provides both BPF LSM and SELinux implementations of the hooks along with a BPF LSM test under tools/testing/selftests/bpf/. If no one beats me to it, I plan to work on adding a test to the selinux-testsuite as soon as I'm done dealing with other urgent LSM/SELinux issues (io_uring CMD passthrough, SCTP problems, etc.); I run these tests multiple times a week (multiple times a day sometimes) against the -rcX kernels with the lsm/next, selinux/next, and audit/next branches applied on top. I know others do similar things.
A layer of hooks that leaves all of the logic to userspace is not an in-tree user for purposes of understanding the logic of the code.
The BPF LSM selftests which are part of this patchset live in-tree. The SELinux hook implementation is completely in-tree with the subject/verb/object relationship clearly described by the code itself. After all, the selinux_userns_create() function consists of only two lines, one of which is an assignment. Yes, it is true that the SELinux policy lives outside the kernel, but that is because there is no singular SELinux policy for everyone. From a practical perspective, the SELinux policy is really just a configuration file used to setup the kernel at runtime; it is not significantly different than an iptables script, /etc/sysctl.conf, or any of the other myriad of configuration files used to configure the kernel during boot.
Paul Moore paul@paul-moore.com writes:
On Wed, Aug 17, 2022 at 4:56 PM Eric W. Biederman ebiederm@xmission.com wrote:
Paul Moore paul@paul-moore.com writes:
On Wed, Aug 17, 2022 at 3:58 PM Eric W. Biederman ebiederm@xmission.com wrote:
Paul Moore paul@paul-moore.com writes:
At the end of the v4 patchset I suggested merging this into lsm/next so it could get a full -rc cycle in linux-next, assuming no issues were uncovered during testing
What in the world can be uncovered in linux-next for code that has no in tree users.
The patchset provides both BPF LSM and SELinux implementations of the hooks along with a BPF LSM test under tools/testing/selftests/bpf/. If no one beats me to it, I plan to work on adding a test to the selinux-testsuite as soon as I'm done dealing with other urgent LSM/SELinux issues (io_uring CMD passthrough, SCTP problems, etc.); I run these tests multiple times a week (multiple times a day sometimes) against the -rcX kernels with the lsm/next, selinux/next, and audit/next branches applied on top. I know others do similar things.
A layer of hooks that leaves all of the logic to userspace is not an in-tree user for purposes of understanding the logic of the code.
The BPF LSM selftests which are part of this patchset live in-tree. The SELinux hook implementation is completely in-tree with the subject/verb/object relationship clearly described by the code itself. After all, the selinux_userns_create() function consists of only two lines, one of which is an assignment. Yes, it is true that the SELinux policy lives outside the kernel, but that is because there is no singular SELinux policy for everyone. From a practical perspective, the SELinux policy is really just a configuration file used to setup the kernel at runtime; it is not significantly different than an iptables script, /etc/sysctl.conf, or any of the other myriad of configuration files used to configure the kernel during boot.
I object to adding the new system configuration knob.
Especially when I don't see people explaining why such a knob is a good idea. What is userspace going to do with this new feature that makes it worth maintaining in the kernel?
That is always the conversation we have when adding new features, and that is exactly the conversation that has not happened here.
Adding a layer of indirection should not exempt a new feature from needing to justify itself.
Eric
On Wed, Aug 17, 2022 at 5:24 PM Eric W. Biederman ebiederm@xmission.com wrote:
I object to adding the new system configuration knob.
Especially when I don't see people explaining why such a knob is a good idea. What is userspace going to do with this new feature that makes it worth maintaining in the kernel?
From https://lore.kernel.org/all/CAEiveUdPhEPAk7Y0ZXjPsD=Vb5hn453CHzS9aG-tkyRa8bf...
"We have valid use cases not specifically related to the attack surface, but go into the middle from bpf observability to enforcement. As we want to track namespace creation, changes, nesting and per task creds context depending on the nature of the workload." -Djalal Harouni
From https://lore.kernel.org/linux-security-module/CALrw=nGT0kcHh4wyBwUF-Q8+v8Dgn...
"[W]e do want to embrace user namespaces in our code and some of our workloads already depend on it. Hence we didn't agree to Debian's approach of just having a global sysctl. But there is "our code" and there is "third party" code, which might not even be open source due to various reasons. And while the path exists for that code to do something bad - we want to block it." -Ignat Korchagin
From https://lore.kernel.org/linux-security-module/CAHC9VhSKmqn5wxF3BZ67Z+-CV7sZz...
"I've heard you talk about bugs being the only reason why people would want to ever block user namespaces, but I think we've all seen use cases now where it goes beyond that. However, even if it didn't, the need to build high confidence/assurance systems where big chunks of functionality can be disabled based on a security policy is a very real use case, and this patchset would help enable that." -Paul Moore (with apologies for self-quoting)
From https://lore.kernel.org/linux-security-module/CAHC9VhRSCXCM51xpOT95G_WVi=UQ4...
"One of the selling points of the BPF LSM is that it allows for various different ways of reporting and logging beyond audit. However, even if it was limited to just audit I believe that provides some useful justification as auditing fork()/clone() isn't quite the same and could be difficult to do at scale in some configurations." -Paul Moore (my apologies again)
From https://lore.kernel.org/linux-security-module/20220722082159.jgvw7jgds3qwfyq...
"Nice and straightforward." -Christian Brauner
Hi,
Please remove me from this list and stop harassing me.
Jonathan Moore
-----Original Message----- From: Paul Moore paul@paul-moore.com Sent: Wednesday, August 17, 2022 5:51 PM To: Eric W. Biederman ebiederm@xmission.com Cc: Linus Torvalds torvalds@linux-foundation.org; Frederick Lawler fred@cloudflare.com; kpsingh@kernel.org; revest@chromium.org; jackmanb@chromium.org; ast@kernel.org; daniel@iogearbox.net; andrii@kernel.org; kafai@fb.com; songliubraving@fb.com; yhs@fb.com; john.fastabend@gmail.com; jmorris@namei.org; serge@hallyn.com; stephen.smalley.work@gmail.com; eparis@parisplace.org; shuah@kernel.org; brauner@kernel.org; casey@schaufler-ca.com; bpf@vger.kernel.org; linux-security-module@vger.kernel.org; selinux@vger.kernel.org; linux-kselftest@vger.kernel.org; linux-kernel@vger.kernel.org; netdev@vger.kernel.org; kernel-team@cloudflare.com; cgzones@googlemail.com; karl@bigbadwolfsecurity.com; tixxdz@gmail.com Subject: Re: [PATCH v5 0/4] Introduce security_create_user_ns()
On Wed, Aug 17, 2022 at 5:24 PM Eric W. Biederman ebiederm@xmission.com wrote:
I object to adding the new system configuration knob.
Especially when I don't see people explaining why such a knob is a good idea. What is userspace going to do with this new feature that makes it worth maintaining in the kernel?
From https://lore.kernel.org/all/CAEiveUdPhEPAk7Y0ZXjPsD=Vb5hn453CHzS9aG-tkyRa8bf...
"We have valid use cases not specifically related to the attack surface, but go into the middle from bpf observability to enforcement. As we want to track namespace creation, changes, nesting and per task creds context depending on the nature of the workload." -Djalal Harouni
From https://lore.kernel.org/linux-security-module/CALrw=nGT0kcHh4wyBwUF-Q8+v8Dgn...
"[W]e do want to embrace user namespaces in our code and some of our workloads already depend on it. Hence we didn't agree to Debian's approach of just having a global sysctl. But there is "our code" and there is "third party" code, which might not even be open source due to various reasons. And while the path exists for that code to do something bad - we want to block it." -Ignat Korchagin
From https://lore.kernel.org/linux-security-module/CAHC9VhSKmqn5wxF3BZ67Z+-CV7sZz...
"I've heard you talk about bugs being the only reason why people would want to ever block user namespaces, but I think we've all seen use cases now where it goes beyond that. However, even if it didn't, the need to build high confidence/assurance systems where big chunks of functionality can be disabled based on a security policy is a very real use case, and this patchset would help enable that." -Paul Moore (with apologies for self-quoting)
From https://lore.kernel.org/linux-security-module/CAHC9VhRSCXCM51xpOT95G_WVi=UQ4...
"One of the selling points of the BPF LSM is that it allows for various different ways of reporting and logging beyond audit. However, even if it was limited to just audit I believe that provides some useful justification as auditing fork()/clone() isn't quite the same and could be difficult to do at scale in some configurations." -Paul Moore (my apologies again)
From https://lore.kernel.org/linux-security-module/20220722082159.jgvw7jgds3qwfyq...
"Nice and straightforward." -Christian Brauner
On Wed, Aug 17, 2022 at 04:24:28PM -0500, Eric W. Biederman wrote:
Paul Moore paul@paul-moore.com writes:
On Wed, Aug 17, 2022 at 4:56 PM Eric W. Biederman ebiederm@xmission.com wrote:
Paul Moore paul@paul-moore.com writes:
On Wed, Aug 17, 2022 at 3:58 PM Eric W. Biederman ebiederm@xmission.com wrote:
Paul Moore paul@paul-moore.com writes:
At the end of the v4 patchset I suggested merging this into lsm/next so it could get a full -rc cycle in linux-next, assuming no issues were uncovered during testing
What in the world can be uncovered in linux-next for code that has no in tree users.
The patchset provides both BPF LSM and SELinux implementations of the hooks along with a BPF LSM test under tools/testing/selftests/bpf/. If no one beats me to it, I plan to work on adding a test to the selinux-testsuite as soon as I'm done dealing with other urgent LSM/SELinux issues (io_uring CMD passthrough, SCTP problems, etc.); I run these tests multiple times a week (multiple times a day sometimes) against the -rcX kernels with the lsm/next, selinux/next, and audit/next branches applied on top. I know others do similar things.
A layer of hooks that leaves all of the logic to userspace is not an in-tree user for purposes of understanding the logic of the code.
The BPF LSM selftests which are part of this patchset live in-tree. The SELinux hook implementation is completely in-tree with the subject/verb/object relationship clearly described by the code itself. After all, the selinux_userns_create() function consists of only two lines, one of which is an assignment. Yes, it is true that the SELinux policy lives outside the kernel, but that is because there is no singular SELinux policy for everyone. From a practical perspective, the SELinux policy is really just a configuration file used to setup the kernel at runtime; it is not significantly different than an iptables script, /etc/sysctl.conf, or any of the other myriad of configuration files used to configure the kernel during boot.
I object to adding the new system configuration knob.
I do strongly sympathize with Eric's points. It will be very easy, once user namespace creation has been further restricted in some distros, to say "well see this stuff is silly" and go back to simply requiring root to create all containers and namespaces, which is generally quite a bit easier anywway. And then, of course, give everyone root so they can start containers.
As Eric said,
| Further adding a random failure mode to user namespace creation if it is | used at all will just encourage userspace to use a setuid application to | perform the namespace creation instead. Creating a less secure system | overall.
However, I'm also looking at e.g. CVE-2022-2588 and CVE-2022-2586, and yes there are two issues which do require discussion (three if you count reportability, which is mainly a tool in guarding against the others).
The first is, indeed, configuration knobs. There are tools, including chrome, which use user namespaces to make things better. The hope is that more and more tools will do so.
The second is damage control. When an 0day has been announced, things change. You can say "well the bug was there all along", but it is different when every lazy ne'erdowell can pick an exploit off a mailing list and use it against a product for which spinning a new version with a new kernel and getting customers to update is probably a months-long endeavor. Some of these products do in fact require namespaces (user and otherwise) as part of their function. And - to my chagrin - I suspect most of them create usernamespace as the root user, before possibly processing untrusted user input, so unprivileged_userns_clone isn't a good fit.
SELinux (and LSMs in generaly) do in fact seem like a useful place to add some configuration, because they tend to assign different domains to tasks with different purposes and trust levels. But another such place is the init system / service manager. And in most cases these days, this will use cgroups to collect tasks of certain types. So I wonder (this is ALMOST ENTIRELY thinking out loud, not thought through sufficiently) whether we should be setting a cgroup.nslock or somesuch.
Of course, kernel livepatch is another potentially useful mitigation. Currently that's not possible for everyone.
Maybe there is a more fundamental way we can approach this. Part of me still likes the idea of splitting the id mapping and capability-in-userns parts, but that's not sufficient. Maybe looking over all the relevant CVEs would give a better hint.
Eric, you said
| If the concern is to reduce the attack surface everything this | proposed hook can do is already possible with the security_capable | security hook.
I suppose I could envision an LSM which gets activated when we find out there was a net-ns-exacerbated 0-day, which refuses CAP_NET_ADMIN for a task not in init_user_ns? Ideally it would be more flexible than that.
idea. What is userspace going to do with this new feature that makes it worth maintaining in the kernel?
That is always the conversation we have when adding new features, and that is exactly the conversation that has not happened here.
Eric and Paul, I wonder, will you - or some people you'd like to represent you - be at plumbers in September? Should there be a BOF session there? (I won't be there, but could join over video) I think a brainstorming session for solutions to the above problems would be good.
Adding a layer of indirection should not exempt a new feature from needing to justify itself.
Eric
On Thu, Aug 18, 2022 at 10:05 AM Serge E. Hallyn serge@hallyn.com wrote:
On Wed, Aug 17, 2022 at 04:24:28PM -0500, Eric W. Biederman wrote:
Paul Moore paul@paul-moore.com writes:
On Wed, Aug 17, 2022 at 4:56 PM Eric W. Biederman ebiederm@xmission.com wrote:
Paul Moore paul@paul-moore.com writes:
On Wed, Aug 17, 2022 at 3:58 PM Eric W. Biederman ebiederm@xmission.com wrote:
Paul Moore paul@paul-moore.com writes:
> At the end of the v4 patchset I suggested merging this into lsm/next > so it could get a full -rc cycle in linux-next, assuming no issues > were uncovered during testing
What in the world can be uncovered in linux-next for code that has no in tree users.
The patchset provides both BPF LSM and SELinux implementations of the hooks along with a BPF LSM test under tools/testing/selftests/bpf/. If no one beats me to it, I plan to work on adding a test to the selinux-testsuite as soon as I'm done dealing with other urgent LSM/SELinux issues (io_uring CMD passthrough, SCTP problems, etc.); I run these tests multiple times a week (multiple times a day sometimes) against the -rcX kernels with the lsm/next, selinux/next, and audit/next branches applied on top. I know others do similar things.
A layer of hooks that leaves all of the logic to userspace is not an in-tree user for purposes of understanding the logic of the code.
The BPF LSM selftests which are part of this patchset live in-tree. The SELinux hook implementation is completely in-tree with the subject/verb/object relationship clearly described by the code itself. After all, the selinux_userns_create() function consists of only two lines, one of which is an assignment. Yes, it is true that the SELinux policy lives outside the kernel, but that is because there is no singular SELinux policy for everyone. From a practical perspective, the SELinux policy is really just a configuration file used to setup the kernel at runtime; it is not significantly different than an iptables script, /etc/sysctl.conf, or any of the other myriad of configuration files used to configure the kernel during boot.
I object to adding the new system configuration knob.
I do strongly sympathize with Eric's points. It will be very easy, once user namespace creation has been further restricted in some distros, to say "well see this stuff is silly" and go back to simply requiring root to create all containers and namespaces, which is generally quite a bit easier anywway. And then, of course, give everyone root so they can start containers.
That's assuming a lot. Many years have passed since namespaces were first introduced, and awareness of good security practices has improved, perhaps not as much as any of us would like, but to say that distros, system builders, and even users are the same as they were so many years ago is a bit of a stretch in my opinion.
However, even ignoring that for a moment, do we really want to go to a place where we dictate how users compose and secure their systems? Linux "took over the world" because it offered a level of flexibility that wasn't really possible before, and it has flourished because it has kept that mentality. The Linux Kernel can be shoehorned onto most hardware that you can get your hands on these days, with driver support for most anything you can think to plug into the system. Do you want a single-user environment with no per-user separation? We can do that. Do you want a traditional DAC based system that leans heavy on ACLs and capabilities? We can do that. Do you want a container host that allows you to carve up the system with a high degree of granularity thanks to the different namespaces? We can do that. How about a system that leverages the LSM to enforce a least privilege ideal, even on the most privileged root user? We can do that too. This patchset is about giving distro, system builders, and users another choice in how they build their system. We've seen both in this patchset and in previously failed attempts that there is a definite want from a user perspective for functionality such as this, and I think it's time we deliver it in the upstream kernel so they don't have to keep patching their own systems with out-of-tree patches.
Eric and Paul, I wonder, will you - or some people you'd like to represent you - be at plumbers in September? Should there be a BOF session there? (I won't be there, but could join over video) I think a brainstorming session for solutions to the above problems would be good.
Regardless of if Eric or I will be at LPC, it is doubtful that all of the people who have participated in this discussion will be able to attend, and I think it's important that the users who are asking for this patchset have a chance to be heard in each forum where this is discussed. While conferences are definitely nice - I definitely missed them over the past couple of years - we can't use them as a crutch to help us reach a conclusion on this issue; we've debated much more difficult things over the mailing lists, I see no reason why this would be any different.
On Thu, Aug 18, 2022 at 11:11:06AM -0400, Paul Moore wrote:
On Thu, Aug 18, 2022 at 10:05 AM Serge E. Hallyn serge@hallyn.com wrote:
On Wed, Aug 17, 2022 at 04:24:28PM -0500, Eric W. Biederman wrote:
Paul Moore paul@paul-moore.com writes:
On Wed, Aug 17, 2022 at 4:56 PM Eric W. Biederman ebiederm@xmission.com wrote:
Paul Moore paul@paul-moore.com writes:
On Wed, Aug 17, 2022 at 3:58 PM Eric W. Biederman ebiederm@xmission.com wrote: > Paul Moore paul@paul-moore.com writes: > > > At the end of the v4 patchset I suggested merging this into lsm/next > > so it could get a full -rc cycle in linux-next, assuming no issues > > were uncovered during testing > > What in the world can be uncovered in linux-next for code that has no in > tree users.
The patchset provides both BPF LSM and SELinux implementations of the hooks along with a BPF LSM test under tools/testing/selftests/bpf/. If no one beats me to it, I plan to work on adding a test to the selinux-testsuite as soon as I'm done dealing with other urgent LSM/SELinux issues (io_uring CMD passthrough, SCTP problems, etc.); I run these tests multiple times a week (multiple times a day sometimes) against the -rcX kernels with the lsm/next, selinux/next, and audit/next branches applied on top. I know others do similar things.
A layer of hooks that leaves all of the logic to userspace is not an in-tree user for purposes of understanding the logic of the code.
The BPF LSM selftests which are part of this patchset live in-tree. The SELinux hook implementation is completely in-tree with the subject/verb/object relationship clearly described by the code itself. After all, the selinux_userns_create() function consists of only two lines, one of which is an assignment. Yes, it is true that the SELinux policy lives outside the kernel, but that is because there is no singular SELinux policy for everyone. From a practical perspective, the SELinux policy is really just a configuration file used to setup the kernel at runtime; it is not significantly different than an iptables script, /etc/sysctl.conf, or any of the other myriad of configuration files used to configure the kernel during boot.
I object to adding the new system configuration knob.
I do strongly sympathize with Eric's points. It will be very easy, once user namespace creation has been further restricted in some distros, to say "well see this stuff is silly" and go back to simply requiring root to create all containers and namespaces, which is generally quite a bit easier anywway. And then, of course, give everyone root so they can start containers.
That's assuming a lot. Many years have passed since namespaces were first introduced, and awareness of good security practices has improved, perhaps not as much as any of us would like, but to say that distros, system builders, and even users are the same as they were so many years ago is a bit of a stretch in my opinion.
Maybe. But I do get a bit worried based on some of what I've been reading in mailing lists lately. Kernel dev definitely moves like fashion - remember when every api should have its own filesystem? That was not a different group of people.
However, even ignoring that for a moment, do we really want to go to a place where we dictate how users compose and secure their systems? Linux "took over the world" because it offered a level of flexibility that wasn't really possible before, and it has flourished because it has kept that mentality. The Linux Kernel can be shoehorned onto most hardware that you can get your hands on these days, with driver support for most anything you can think to plug into the system. Do you want a single-user environment with no per-user separation? We can do that. Do you want a traditional DAC based system that leans heavy on ACLs and capabilities? We can do that. Do you want a container host that allows you to carve up the system with a high degree of granularity thanks to the different namespaces? We can do that. How about a system that leverages the LSM to enforce a least privilege ideal, even on the most privileged root user? We can do that too. This patchset is about giving distro, system builders, and users another choice in how they build their system. We've seen both
Oh, you misunderstand. Whereas I do feel there are important concerns in Eric's objections, and whereas I don't feel this set sufficiently addresses the problems that I see and outlined above, I do see value in this set, and was not aiming to deter it. We need better ways to mitigate a certain clas sof 0-days without completely disallowing use of user namespaces, and this may help.
in this patchset and in previously failed attempts that there is a definite want from a user perspective for functionality such as this, and I think it's time we deliver it in the upstream kernel so they don't have to keep patching their own systems with out-of-tree patches.
Eric and Paul, I wonder, will you - or some people you'd like to represent you - be at plumbers in September? Should there be a BOF session there? (I won't be there, but could join over video) I think a brainstorming session for solutions to the above problems would be good.
Regardless of if Eric or I will be at LPC, it is doubtful that all of the people who have participated in this discussion will be able to attend, and I think it's important that the users who are asking for this patchset have a chance to be heard in each forum where this is discussed. While conferences are definitely nice - I definitely missed them over the past couple of years - we can't use them as a crutch to help us reach a conclusion on this issue; we've debated much
No I wasn't thinking we would use LPC to decide on this patchset. As far as I can see, the patchset is merged. I am hoping we can come up with "something better" to address people's needs, make everyone happy, and bring forth world peace. Which would stack just fine with what's here for defense in depth.
You may well not be interested in further work, and that's fine. I need to set aside a few days to think on this.
more difficult things over the mailing lists, I see no reason why this would be any different.
-- paul-moore.com
On Fri, Aug 19, 2022 at 10:45 AM Serge E. Hallyn serge@hallyn.com wrote:
On Thu, Aug 18, 2022 at 11:11:06AM -0400, Paul Moore wrote:
On Thu, Aug 18, 2022 at 10:05 AM Serge E. Hallyn serge@hallyn.com wrote:
...
I do strongly sympathize with Eric's points. It will be very easy, once user namespace creation has been further restricted in some distros, to say "well see this stuff is silly" and go back to simply requiring root to create all containers and namespaces, which is generally quite a bit easier anywway. And then, of course, give everyone root so they can start containers.
That's assuming a lot. Many years have passed since namespaces were first introduced, and awareness of good security practices has improved, perhaps not as much as any of us would like, but to say that distros, system builders, and even users are the same as they were so many years ago is a bit of a stretch in my opinion.
Maybe. But I do get a bit worried based on some of what I've been reading in mailing lists lately. Kernel dev definitely moves like fashion - remember when every api should have its own filesystem? That was not a different group of people.
I'm not going to argue against the idea that kernel development is subject to fads, I just don't agree that adding a LSM control point for user namespace creation is going to be the end of user namespaces.
However, even ignoring that for a moment, do we really want to go to a place where we dictate how users compose and secure their systems? Linux "took over the world" because it offered a level of flexibility that wasn't really possible before, and it has flourished because it has kept that mentality. The Linux Kernel can be shoehorned onto most hardware that you can get your hands on these days, with driver support for most anything you can think to plug into the system. Do you want a single-user environment with no per-user separation? We can do that. Do you want a traditional DAC based system that leans heavy on ACLs and capabilities? We can do that. Do you want a container host that allows you to carve up the system with a high degree of granularity thanks to the different namespaces? We can do that. How about a system that leverages the LSM to enforce a least privilege ideal, even on the most privileged root user? We can do that too. This patchset is about giving distro, system builders, and users another choice in how they build their system. We've seen both
Oh, you misunderstand. Whereas I do feel there are important concerns in Eric's objections, and whereas I don't feel this set sufficiently addresses the problems that I see and outlined above, I do see value in this set, and was not aiming to deter it. We need better ways to mitigate a certain clas sof 0-days without completely disallowing use of user namespaces, and this may help.
Ah, thanks for the explanation, I missed that (obviously) in your previous email. If I'm perfectly honest, I suppose the protracted debate with Eric has also left me a little overly sensitive to any perceived arguments against this patchset.
in this patchset and in previously failed attempts that there is a definite want from a user perspective for functionality such as this, and I think it's time we deliver it in the upstream kernel so they don't have to keep patching their own systems with out-of-tree patches.
Eric and Paul, I wonder, will you - or some people you'd like to represent you - be at plumbers in September? Should there be a BOF session there? (I won't be there, but could join over video) I think a brainstorming session for solutions to the above problems would be good.
Regardless of if Eric or I will be at LPC, it is doubtful that all of the people who have participated in this discussion will be able to attend, and I think it's important that the users who are asking for this patchset have a chance to be heard in each forum where this is discussed. While conferences are definitely nice - I definitely missed them over the past couple of years - we can't use them as a crutch to help us reach a conclusion on this issue; we've debated much
No I wasn't thinking we would use LPC to decide on this patchset. As far as I can see, the patchset is merged.
While I maintain that Frederick's patches are a good thing, I'm not going to consider them "merged" until I see them in Linus' tree or Linus decided to voice his support on the lists. These patches do have Eric's NACK, and a maintainer's NACK isn't something to take lightly. I certainly don't.
I am hoping we can come up with "something better" to address people's needs, make everyone happy, and bring forth world peace. Which would stack just fine with what's here for defense in depth.
You may well not be interested in further work, and that's fine. I need to set aside a few days to think on this.
I'm happy to continue the discussion as long as it's constructive; I think we all are. My gut feeling is that Frederick's approach falls closest to the sweet spot of "workable without being overly offensive" (*cough*), but if you've got an additional approach in mind, or an alternative approach that solves the same use case problems, I think we'd all love to hear about it.
Paul Moore paul@paul-moore.com writes:
On Fri, Aug 19, 2022 at 10:45 AM Serge E. Hallyn serge@hallyn.com wrote:
I am hoping we can come up with "something better" to address people's needs, make everyone happy, and bring forth world peace. Which would stack just fine with what's here for defense in depth.
You may well not be interested in further work, and that's fine. I need to set aside a few days to think on this.
I'm happy to continue the discussion as long as it's constructive; I think we all are. My gut feeling is that Frederick's approach falls closest to the sweet spot of "workable without being overly offensive" (*cough*), but if you've got an additional approach in mind, or an alternative approach that solves the same use case problems, I think we'd all love to hear about it.
I would love to actually hear the problems people are trying to solve so that we can have a sensible conversation about the trade offs.
As best I can tell without more information people want to use the creation of a user namespace as a signal that the code is attempting an exploit.
As such let me propose instead of returning an error code which will let the exploit continue, have the security hook return a bool. With true meaning the code can continue and on false it will trigger using SIGSYS to terminate the program like seccomp does.
I am not super fond of that idea, but it means that userspace code is not expected to deal with the situation, and the only conversation a userspace application developer needs to enter into with a system administrator or security policy developer is one to prove they are not exploit code. Plus it makes much more sense to kill an exploit immediately instead of letting it run.
In general when addressing code coverage concerns I think it makes more sense to use the security hooks to implement some variety of the principle of least privilege and only give applications access to the kernel facilities they are known to use.
As far as I can tell creating a user namespace does not increase the attack surface. It is the creation of the other namespaces from a user namespace that begins to do that. So in general I would think restrictions should be in places they matter.
Just like the bugs that have exploits that involve the user namespace are not user namespace bugs, but instead they are bugs in other subsystems that just happen to go through the user namespace as the easiest path to the buggy code, not the only path to the buggy code.
Eric
On Thu, Aug 25, 2022 at 2:15 PM Eric W. Biederman ebiederm@xmission.com wrote:
Paul Moore paul@paul-moore.com writes:
On Fri, Aug 19, 2022 at 10:45 AM Serge E. Hallyn serge@hallyn.com wrote:
I am hoping we can come up with "something better" to address people's needs, make everyone happy, and bring forth world peace. Which would stack just fine with what's here for defense in depth.
You may well not be interested in further work, and that's fine. I need to set aside a few days to think on this.
I'm happy to continue the discussion as long as it's constructive; I think we all are. My gut feeling is that Frederick's approach falls closest to the sweet spot of "workable without being overly offensive" (*cough*), but if you've got an additional approach in mind, or an alternative approach that solves the same use case problems, I think we'd all love to hear about it.
I would love to actually hear the problems people are trying to solve so that we can have a sensible conversation about the trade offs.
Here are several taken from the previous threads, it's surely not a complete list, but it should give you a good idea:
https://lore.kernel.org/linux-security-module/CAHC9VhQnPAsmjmKo-e84XDJ1wmaOF...
As best I can tell without more information people want to use the creation of a user namespace as a signal that the code is attempting an exploit.
Some use cases are like that, there are several other use cases that go beyond this; see all of our previous discussions on this topic/patchset. As has been mentioned before, there are use cases that require improved observability, access control, or both.
As such let me propose instead of returning an error code which will let the exploit continue, have the security hook return a bool. With true meaning the code can continue and on false it will trigger using SIGSYS to terminate the program like seccomp does.
Having the kernel forcibly exit the process isn't something that most LSMs would likely want. I suppose we could modify the hook/caller so that *if* an LSM wanted to return SIGSYS the system would kill the process, but I would want that to be something in addition to returning an error code like LSMs normally do (e.g. EACCES).
On Aug 25, 2022, at 12:19 PM, Paul Moore paul@paul-moore.com wrote:
On Thu, Aug 25, 2022 at 2:15 PM Eric W. Biederman ebiederm@xmission.com wrote:
Paul Moore paul@paul-moore.com writes:
On Fri, Aug 19, 2022 at 10:45 AM Serge E. Hallyn serge@hallyn.com wrote:
I am hoping we can come up with "something better" to address people's needs, make everyone happy, and bring forth world peace. Which would stack just fine with what's here for defense in depth.
You may well not be interested in further work, and that's fine. I need to set aside a few days to think on this.
I'm happy to continue the discussion as long as it's constructive; I think we all are. My gut feeling is that Frederick's approach falls closest to the sweet spot of "workable without being overly offensive" (*cough*), but if you've got an additional approach in mind, or an alternative approach that solves the same use case problems, I think we'd all love to hear about it.
I would love to actually hear the problems people are trying to solve so that we can have a sensible conversation about the trade offs.
Here are several taken from the previous threads, it's surely not a complete list, but it should give you a good idea:
https://lore.kernel.org/linux-security-module/CAHC9VhQnPAsmjmKo-e84XDJ1wmaOF...
As best I can tell without more information people want to use the creation of a user namespace as a signal that the code is attempting an exploit.
Some use cases are like that, there are several other use cases that go beyond this; see all of our previous discussions on this topic/patchset. As has been mentioned before, there are use cases that require improved observability, access control, or both.
As such let me propose instead of returning an error code which will let the exploit continue, have the security hook return a bool. With true meaning the code can continue and on false it will trigger using SIGSYS to terminate the program like seccomp does.
Having the kernel forcibly exit the process isn't something that most LSMs would likely want. I suppose we could modify the hook/caller so that *if* an LSM wanted to return SIGSYS the system would kill the process, but I would want that to be something in addition to returning an error code like LSMs normally do (e.g. EACCES).
I am new to user_namespace and security work, so please pardon me if anything below is very wrong.
IIUC, user_namespace is a tool that enables trusted userspace code to control the behavior of untrusted (or less trusted) userspace code. Failing create_user_ns() doesn't make the system more reliable. Specifically, we call create_user_ns() via two paths: fork/clone and unshare. For both paths, we need the userspace to use user_namespace, and to honor failed create_user_ns().
On the other hand, I would echo that killing the process is not practical in some use cases. Specifically, allowing the application to run in a less secure environment for a short period of time might be much better than killing it and taking down the whole service. Of course, there are other cases that security is more important, and taking down the whole service is the better choice.
I guess the ultimate solution is a way to enforce using user_namespace in the kernel (if it ever makes sense...). But I don't know how that gonna work. Before we have such solution, maybe we only need an void hook for observability (or just a tracepoint, coming from BPF background).
Thanks, Song
On Thu, Aug 25, 2022 at 5:58 PM Song Liu songliubraving@fb.com wrote:
On Aug 25, 2022, at 12:19 PM, Paul Moore paul@paul-moore.com wrote:
On Thu, Aug 25, 2022 at 2:15 PM Eric W. Biederman ebiederm@xmission.com wrote:
Paul Moore paul@paul-moore.com writes:
On Fri, Aug 19, 2022 at 10:45 AM Serge E. Hallyn serge@hallyn.com wrote:
I am hoping we can come up with "something better" to address people's needs, make everyone happy, and bring forth world peace. Which would stack just fine with what's here for defense in depth.
You may well not be interested in further work, and that's fine. I need to set aside a few days to think on this.
I'm happy to continue the discussion as long as it's constructive; I think we all are. My gut feeling is that Frederick's approach falls closest to the sweet spot of "workable without being overly offensive" (*cough*), but if you've got an additional approach in mind, or an alternative approach that solves the same use case problems, I think we'd all love to hear about it.
I would love to actually hear the problems people are trying to solve so that we can have a sensible conversation about the trade offs.
Here are several taken from the previous threads, it's surely not a complete list, but it should give you a good idea:
https://lore.kernel.org/linux-security-module/CAHC9VhQnPAsmjmKo-e84XDJ1wmaOF...
As best I can tell without more information people want to use the creation of a user namespace as a signal that the code is attempting an exploit.
Some use cases are like that, there are several other use cases that go beyond this; see all of our previous discussions on this topic/patchset. As has been mentioned before, there are use cases that require improved observability, access control, or both.
As such let me propose instead of returning an error code which will let the exploit continue, have the security hook return a bool. With true meaning the code can continue and on false it will trigger using SIGSYS to terminate the program like seccomp does.
Having the kernel forcibly exit the process isn't something that most LSMs would likely want. I suppose we could modify the hook/caller so that *if* an LSM wanted to return SIGSYS the system would kill the process, but I would want that to be something in addition to returning an error code like LSMs normally do (e.g. EACCES).
I am new to user_namespace and security work, so please pardon me if anything below is very wrong.
IIUC, user_namespace is a tool that enables trusted userspace code to control the behavior of untrusted (or less trusted) userspace code. Failing create_user_ns() doesn't make the system more reliable. Specifically, we call create_user_ns() via two paths: fork/clone and unshare. For both paths, we need the userspace to use user_namespace, and to honor failed create_user_ns().
On the other hand, I would echo that killing the process is not practical in some use cases. Specifically, allowing the application to run in a less secure environment for a short period of time might be much better than killing it and taking down the whole service. Of course, there are other cases that security is more important, and taking down the whole service is the better choice.
I guess the ultimate solution is a way to enforce using user_namespace in the kernel (if it ever makes sense...).
The LSM framework, and the BPF and SELinux LSM implementations in this patchset, provide a mechanism to do just that: kernel enforced access controls using flexible security policies which can be tailored by the distro, solution provider, or end user to meet the specific needs of their use case.
On Aug 25, 2022, at 3:10 PM, Paul Moore paul@paul-moore.com wrote:
On Thu, Aug 25, 2022 at 5:58 PM Song Liu songliubraving@fb.com wrote:
On Aug 25, 2022, at 12:19 PM, Paul Moore paul@paul-moore.com wrote:
On Thu, Aug 25, 2022 at 2:15 PM Eric W. Biederman ebiederm@xmission.com wrote:
Paul Moore paul@paul-moore.com writes:
On Fri, Aug 19, 2022 at 10:45 AM Serge E. Hallyn serge@hallyn.com wrote:
I am hoping we can come up with "something better" to address people's needs, make everyone happy, and bring forth world peace. Which would stack just fine with what's here for defense in depth.
You may well not be interested in further work, and that's fine. I need to set aside a few days to think on this.
I'm happy to continue the discussion as long as it's constructive; I think we all are. My gut feeling is that Frederick's approach falls closest to the sweet spot of "workable without being overly offensive" (*cough*), but if you've got an additional approach in mind, or an alternative approach that solves the same use case problems, I think we'd all love to hear about it.
I would love to actually hear the problems people are trying to solve so that we can have a sensible conversation about the trade offs.
Here are several taken from the previous threads, it's surely not a complete list, but it should give you a good idea:
https://lore.kernel.org/linux-security-module/CAHC9VhQnPAsmjmKo-e84XDJ1wmaOF...
As best I can tell without more information people want to use the creation of a user namespace as a signal that the code is attempting an exploit.
Some use cases are like that, there are several other use cases that go beyond this; see all of our previous discussions on this topic/patchset. As has been mentioned before, there are use cases that require improved observability, access control, or both.
As such let me propose instead of returning an error code which will let the exploit continue, have the security hook return a bool. With true meaning the code can continue and on false it will trigger using SIGSYS to terminate the program like seccomp does.
Having the kernel forcibly exit the process isn't something that most LSMs would likely want. I suppose we could modify the hook/caller so that *if* an LSM wanted to return SIGSYS the system would kill the process, but I would want that to be something in addition to returning an error code like LSMs normally do (e.g. EACCES).
I am new to user_namespace and security work, so please pardon me if anything below is very wrong.
IIUC, user_namespace is a tool that enables trusted userspace code to control the behavior of untrusted (or less trusted) userspace code. Failing create_user_ns() doesn't make the system more reliable. Specifically, we call create_user_ns() via two paths: fork/clone and unshare. For both paths, we need the userspace to use user_namespace, and to honor failed create_user_ns().
On the other hand, I would echo that killing the process is not practical in some use cases. Specifically, allowing the application to run in a less secure environment for a short period of time might be much better than killing it and taking down the whole service. Of course, there are other cases that security is more important, and taking down the whole service is the better choice.
I guess the ultimate solution is a way to enforce using user_namespace in the kernel (if it ever makes sense...).
The LSM framework, and the BPF and SELinux LSM implementations in this patchset, provide a mechanism to do just that: kernel enforced access controls using flexible security policies which can be tailored by the distro, solution provider, or end user to meet the specific needs of their use case.
In this case, I wouldn't call the kernel is enforcing access control. (I might be wrong). There are 3 components here: kernel, LSM, and trusted userspace (whoever calls unshare). AFAICT, kernel simply passes the decision made by LSM (BPF or SELinux) to the trusted userspace. It is up to the trusted userspace to honor the return value of unshare(). If the userspace simply ignores unshare failures, or does not call unshare(CLONE_NEWUSER), kernel and LSM cannot do much about it, right?
This might still be useful in some cases. (I am far from an expert on these). I just feel this is not the typical solution to enforce something.
Thanks, Song
PS: If I said something very stupid, I would not feel offended if someone pointed it out loud. :)
On Thu, Aug 25, 2022 at 6:42 PM Song Liu songliubraving@fb.com wrote:
On Aug 25, 2022, at 3:10 PM, Paul Moore paul@paul-moore.com wrote: On Thu, Aug 25, 2022 at 5:58 PM Song Liu songliubraving@fb.com wrote:
...
I am new to user_namespace and security work, so please pardon me if anything below is very wrong.
IIUC, user_namespace is a tool that enables trusted userspace code to control the behavior of untrusted (or less trusted) userspace code. Failing create_user_ns() doesn't make the system more reliable. Specifically, we call create_user_ns() via two paths: fork/clone and unshare. For both paths, we need the userspace to use user_namespace, and to honor failed create_user_ns().
On the other hand, I would echo that killing the process is not practical in some use cases. Specifically, allowing the application to run in a less secure environment for a short period of time might be much better than killing it and taking down the whole service. Of course, there are other cases that security is more important, and taking down the whole service is the better choice.
I guess the ultimate solution is a way to enforce using user_namespace in the kernel (if it ever makes sense...).
The LSM framework, and the BPF and SELinux LSM implementations in this patchset, provide a mechanism to do just that: kernel enforced access controls using flexible security policies which can be tailored by the distro, solution provider, or end user to meet the specific needs of their use case.
In this case, I wouldn't call the kernel is enforcing access control. (I might be wrong). There are 3 components here: kernel, LSM, and trusted userspace (whoever calls unshare).
The LSM layer, and the LSMs themselves are part of the kernel; look at the changes in this patchset to see the LSM, BPF LSM, and SELinux kernel changes. Explaining how the different LSMs work is quite a bit beyond the scope of this discussion, but there is plenty of information available online that should be able to serve as an introduction, not to mention the kernel source itself. However, in very broad terms you can think of the individual LSMs as somewhat analogous to filesystem drivers, e.g. ext4, and the LSM itself as the VFS layer.
AFAICT, kernel simply passes the decision made by LSM (BPF or SELinux) to the trusted userspace. It is up to the trusted userspace to honor the return value of unshare().
With a LSM enabled and enforcing a security policy on user namespace creation, which appears to be the case of most concern, the kernel would make a decision on the namespace creation based on various factors (e.g. for SELinux this would be the calling process' security domain and the domain's permission set as determined by the configured security policy) and if the operation was rejected an error code would be returned to userspace and the operation rejected. It is the exact same thing as what would happen if the calling process is chrooted or doesn't have a proper UID/GID mapping. Don't forget that the create_user_ns() function already enforces a security policy and returns errors to userspace; this patchset doesn't add anything new in that regard, it just allows for a richer and more flexible security policy to be built on top of the existing constraints.
If the userspace simply ignores unshare failures, or does not call unshare(CLONE_NEWUSER), kernel and LSM cannot do much about it, right?
The process is still subject to any security policies that are active and being enforced by the kernel. A malicious or misconfigured application can still be constrained by the kernel using both the kernel's legacy Discretionary Access Controls (DAC) as well as the more comprehensive Mandatory Access Controls (MAC) provided by many of the LSMs.
On Aug 26, 2022, at 8:02 AM, Paul Moore paul@paul-moore.com wrote:
On Thu, Aug 25, 2022 at 6:42 PM Song Liu songliubraving@fb.com wrote:
On Aug 25, 2022, at 3:10 PM, Paul Moore paul@paul-moore.com wrote: On Thu, Aug 25, 2022 at 5:58 PM Song Liu songliubraving@fb.com wrote:
...
I am new to user_namespace and security work, so please pardon me if anything below is very wrong.
IIUC, user_namespace is a tool that enables trusted userspace code to control the behavior of untrusted (or less trusted) userspace code. Failing create_user_ns() doesn't make the system more reliable. Specifically, we call create_user_ns() via two paths: fork/clone and unshare. For both paths, we need the userspace to use user_namespace, and to honor failed create_user_ns().
On the other hand, I would echo that killing the process is not practical in some use cases. Specifically, allowing the application to run in a less secure environment for a short period of time might be much better than killing it and taking down the whole service. Of course, there are other cases that security is more important, and taking down the whole service is the better choice.
I guess the ultimate solution is a way to enforce using user_namespace in the kernel (if it ever makes sense...).
The LSM framework, and the BPF and SELinux LSM implementations in this patchset, provide a mechanism to do just that: kernel enforced access controls using flexible security policies which can be tailored by the distro, solution provider, or end user to meet the specific needs of their use case.
In this case, I wouldn't call the kernel is enforcing access control. (I might be wrong). There are 3 components here: kernel, LSM, and trusted userspace (whoever calls unshare).
The LSM layer, and the LSMs themselves are part of the kernel; look at the changes in this patchset to see the LSM, BPF LSM, and SELinux kernel changes. Explaining how the different LSMs work is quite a bit beyond the scope of this discussion, but there is plenty of information available online that should be able to serve as an introduction, not to mention the kernel source itself. However, in very broad terms you can think of the individual LSMs as somewhat analogous to filesystem drivers, e.g. ext4, and the LSM itself as the VFS layer.
Thanks for the explanation. This matches my understanding with LSM.
AFAICT, kernel simply passes the decision made by LSM (BPF or SELinux) to the trusted userspace. It is up to the trusted userspace to honor the return value of unshare().
With a LSM enabled and enforcing a security policy on user namespace creation, which appears to be the case of most concern, the kernel would make a decision on the namespace creation based on various factors (e.g. for SELinux this would be the calling process' security domain and the domain's permission set as determined by the configured security policy) and if the operation was rejected an error code would be returned to userspace and the operation rejected. It is the exact same thing as what would happen if the calling process is chrooted or doesn't have a proper UID/GID mapping. Don't forget that the create_user_ns() function already enforces a security policy and returns errors to userspace; this patchset doesn't add anything new in that regard, it just allows for a richer and more flexible security policy to be built on top of the existing constraints.
I believe I don't understand user namespace enough to agree or disagree here. I guess I should read more.
Thanks, Song
If the userspace simply ignores unshare failures, or does not call unshare(CLONE_NEWUSER), kernel and LSM cannot do much about it, right?
The process is still subject to any security policies that are active and being enforced by the kernel. A malicious or misconfigured application can still be constrained by the kernel using both the kernel's legacy Discretionary Access Controls (DAC) as well as the more comprehensive Mandatory Access Controls (MAC) provided by many of the LSMs.
-- paul-moore.com
On Thu, Aug 25, 2022 at 09:58:46PM +0000, Song Liu wrote:
On Aug 25, 2022, at 12:19 PM, Paul Moore paul@paul-moore.com wrote:
On Thu, Aug 25, 2022 at 2:15 PM Eric W. Biederman ebiederm@xmission.com wrote:
Paul Moore paul@paul-moore.com writes:
On Fri, Aug 19, 2022 at 10:45 AM Serge E. Hallyn serge@hallyn.com wrote:
I am hoping we can come up with "something better" to address people's needs, make everyone happy, and bring forth world peace. Which would stack just fine with what's here for defense in depth.
You may well not be interested in further work, and that's fine. I need to set aside a few days to think on this.
I'm happy to continue the discussion as long as it's constructive; I think we all are. My gut feeling is that Frederick's approach falls closest to the sweet spot of "workable without being overly offensive" (*cough*), but if you've got an additional approach in mind, or an alternative approach that solves the same use case problems, I think we'd all love to hear about it.
I would love to actually hear the problems people are trying to solve so that we can have a sensible conversation about the trade offs.
Here are several taken from the previous threads, it's surely not a complete list, but it should give you a good idea:
https://lore.kernel.org/linux-security-module/CAHC9VhQnPAsmjmKo-e84XDJ1wmaOF...
As best I can tell without more information people want to use the creation of a user namespace as a signal that the code is attempting an exploit.
Some use cases are like that, there are several other use cases that go beyond this; see all of our previous discussions on this topic/patchset. As has been mentioned before, there are use cases that require improved observability, access control, or both.
As such let me propose instead of returning an error code which will let the exploit continue, have the security hook return a bool. With true meaning the code can continue and on false it will trigger using SIGSYS to terminate the program like seccomp does.
Having the kernel forcibly exit the process isn't something that most LSMs would likely want. I suppose we could modify the hook/caller so that *if* an LSM wanted to return SIGSYS the system would kill the process, but I would want that to be something in addition to returning an error code like LSMs normally do (e.g. EACCES).
I am new to user_namespace and security work, so please pardon me if anything below is very wrong.
IIUC, user_namespace is a tool that enables trusted userspace code to control the behavior of untrusted (or less trusted) userspace code.
No. user namespaces are not a way for more trusted code to control the behavior of less trusted code.
Failing create_user_ns() doesn't make the system more reliable. Specifically, we call create_user_ns() via two paths: fork/clone and unshare. For both paths, we need the userspace to use user_namespace, and to honor failed create_user_ns().
On the other hand, I would echo that killing the process is not practical in some use cases. Specifically, allowing the application to run in a less secure environment for a short period of time might be much better than killing it and taking down the whole service. Of course, there are other cases that security is more important, and taking down the whole service is the better choice.
I guess the ultimate solution is a way to enforce using user_namespace in the kernel (if it ever makes sense...). But I don't know how that gonna work. Before we have such solution, maybe we only need an void hook for observability (or just a tracepoint, coming from BPF background).
Thanks, Song
On Aug 26, 2022, at 8:24 AM, Serge E. Hallyn serge@hallyn.com wrote:
On Thu, Aug 25, 2022 at 09:58:46PM +0000, Song Liu wrote:
On Aug 25, 2022, at 12:19 PM, Paul Moore paul@paul-moore.com wrote:
On Thu, Aug 25, 2022 at 2:15 PM Eric W. Biederman ebiederm@xmission.com wrote:
Paul Moore paul@paul-moore.com writes:
On Fri, Aug 19, 2022 at 10:45 AM Serge E. Hallyn serge@hallyn.com wrote:
I am hoping we can come up with "something better" to address people's needs, make everyone happy, and bring forth world peace. Which would stack just fine with what's here for defense in depth.
You may well not be interested in further work, and that's fine. I need to set aside a few days to think on this.
I'm happy to continue the discussion as long as it's constructive; I think we all are. My gut feeling is that Frederick's approach falls closest to the sweet spot of "workable without being overly offensive" (*cough*), but if you've got an additional approach in mind, or an alternative approach that solves the same use case problems, I think we'd all love to hear about it.
I would love to actually hear the problems people are trying to solve so that we can have a sensible conversation about the trade offs.
Here are several taken from the previous threads, it's surely not a complete list, but it should give you a good idea:
https://lore.kernel.org/linux-security-module/CAHC9VhQnPAsmjmKo-e84XDJ1wmaOF...
As best I can tell without more information people want to use the creation of a user namespace as a signal that the code is attempting an exploit.
Some use cases are like that, there are several other use cases that go beyond this; see all of our previous discussions on this topic/patchset. As has been mentioned before, there are use cases that require improved observability, access control, or both.
As such let me propose instead of returning an error code which will let the exploit continue, have the security hook return a bool. With true meaning the code can continue and on false it will trigger using SIGSYS to terminate the program like seccomp does.
Having the kernel forcibly exit the process isn't something that most LSMs would likely want. I suppose we could modify the hook/caller so that *if* an LSM wanted to return SIGSYS the system would kill the process, but I would want that to be something in addition to returning an error code like LSMs normally do (e.g. EACCES).
I am new to user_namespace and security work, so please pardon me if anything below is very wrong.
IIUC, user_namespace is a tool that enables trusted userspace code to control the behavior of untrusted (or less trusted) userspace code.
No. user namespaces are not a way for more trusted code to control the behavior of less trusted code.
Hmm.. In this case, I think I really need to learn more.
Thanks for pointing out my misunderstanding.
Song
Failing create_user_ns() doesn't make the system more reliable. Specifically, we call create_user_ns() via two paths: fork/clone and unshare. For both paths, we need the userspace to use user_namespace, and to honor failed create_user_ns().
On the other hand, I would echo that killing the process is not practical in some use cases. Specifically, allowing the application to run in a less secure environment for a short period of time might be much better than killing it and taking down the whole service. Of course, there are other cases that security is more important, and taking down the whole service is the better choice.
I guess the ultimate solution is a way to enforce using user_namespace in the kernel (if it ever makes sense...). But I don't know how that gonna work. Before we have such solution, maybe we only need an void hook for observability (or just a tracepoint, coming from BPF background).
Thanks, Song
On Fri, Aug 26, 2022 at 05:00:51PM +0000, Song Liu wrote:
On Aug 26, 2022, at 8:24 AM, Serge E. Hallyn serge@hallyn.com wrote:
On Thu, Aug 25, 2022 at 09:58:46PM +0000, Song Liu wrote:
On Aug 25, 2022, at 12:19 PM, Paul Moore paul@paul-moore.com wrote:
On Thu, Aug 25, 2022 at 2:15 PM Eric W. Biederman ebiederm@xmission.com wrote:
Paul Moore paul@paul-moore.com writes:
On Fri, Aug 19, 2022 at 10:45 AM Serge E. Hallyn serge@hallyn.com wrote: > I am hoping we can come up with > "something better" to address people's needs, make everyone happy, and > bring forth world peace. Which would stack just fine with what's here > for defense in depth. > > You may well not be interested in further work, and that's fine. I need > to set aside a few days to think on this.
I'm happy to continue the discussion as long as it's constructive; I think we all are. My gut feeling is that Frederick's approach falls closest to the sweet spot of "workable without being overly offensive" (*cough*), but if you've got an additional approach in mind, or an alternative approach that solves the same use case problems, I think we'd all love to hear about it.
I would love to actually hear the problems people are trying to solve so that we can have a sensible conversation about the trade offs.
Here are several taken from the previous threads, it's surely not a complete list, but it should give you a good idea:
https://lore.kernel.org/linux-security-module/CAHC9VhQnPAsmjmKo-e84XDJ1wmaOF...
As best I can tell without more information people want to use the creation of a user namespace as a signal that the code is attempting an exploit.
Some use cases are like that, there are several other use cases that go beyond this; see all of our previous discussions on this topic/patchset. As has been mentioned before, there are use cases that require improved observability, access control, or both.
As such let me propose instead of returning an error code which will let the exploit continue, have the security hook return a bool. With true meaning the code can continue and on false it will trigger using SIGSYS to terminate the program like seccomp does.
Having the kernel forcibly exit the process isn't something that most LSMs would likely want. I suppose we could modify the hook/caller so that *if* an LSM wanted to return SIGSYS the system would kill the process, but I would want that to be something in addition to returning an error code like LSMs normally do (e.g. EACCES).
I am new to user_namespace and security work, so please pardon me if anything below is very wrong.
IIUC, user_namespace is a tool that enables trusted userspace code to control the behavior of untrusted (or less trusted) userspace code.
No. user namespaces are not a way for more trusted code to control the behavior of less trusted code.
Hmm.. In this case, I think I really need to learn more.
Thanks for pointing out my misunderstanding.
(I thought maybe Eric would chime in with a better explanation, but I'll fill it in for now :)
One of the main goals of user namespaces is to allow unprivileged users to do things like chroot and mount, which are very useful development tools, without needing admin privileges. So it's almost the opposite of what you said: rather than to enable trusted userspace code to control the behavior of less trusted code, it's to allow less privileged code to do things which do not affect other users, without having to assume *more* privilege.
To be precise, the goals were:
1. uid mapping - allow two users to both "use uid 500" without conflicting 2. provide (unprivileged) users privilege over their own resources 3. absolutely no extra privilege over other resources 4. be able to nest
While (3) was technically achieved, the problem we have is that (2) provides unprivileged users the ability to exercise kernel code which they previously could not.
-serge
On Aug 26, 2022, at 2:00 PM, Serge E. Hallyn serge@hallyn.com wrote:
On Fri, Aug 26, 2022 at 05:00:51PM +0000, Song Liu wrote:
On Aug 26, 2022, at 8:24 AM, Serge E. Hallyn serge@hallyn.com wrote:
On Thu, Aug 25, 2022 at 09:58:46PM +0000, Song Liu wrote:
On Aug 25, 2022, at 12:19 PM, Paul Moore paul@paul-moore.com wrote:
On Thu, Aug 25, 2022 at 2:15 PM Eric W. Biederman ebiederm@xmission.com wrote:
Paul Moore paul@paul-moore.com writes: > On Fri, Aug 19, 2022 at 10:45 AM Serge E. Hallyn serge@hallyn.com wrote: >> I am hoping we can come up with >> "something better" to address people's needs, make everyone happy, and >> bring forth world peace. Which would stack just fine with what's here >> for defense in depth. >> >> You may well not be interested in further work, and that's fine. I need >> to set aside a few days to think on this. > > I'm happy to continue the discussion as long as it's constructive; I > think we all are. My gut feeling is that Frederick's approach falls > closest to the sweet spot of "workable without being overly offensive" > (*cough*), but if you've got an additional approach in mind, or an > alternative approach that solves the same use case problems, I think > we'd all love to hear about it.
I would love to actually hear the problems people are trying to solve so that we can have a sensible conversation about the trade offs.
Here are several taken from the previous threads, it's surely not a complete list, but it should give you a good idea:
https://lore.kernel.org/linux-security-module/CAHC9VhQnPAsmjmKo-e84XDJ1wmaOF...
As best I can tell without more information people want to use the creation of a user namespace as a signal that the code is attempting an exploit.
Some use cases are like that, there are several other use cases that go beyond this; see all of our previous discussions on this topic/patchset. As has been mentioned before, there are use cases that require improved observability, access control, or both.
As such let me propose instead of returning an error code which will let the exploit continue, have the security hook return a bool. With true meaning the code can continue and on false it will trigger using SIGSYS to terminate the program like seccomp does.
Having the kernel forcibly exit the process isn't something that most LSMs would likely want. I suppose we could modify the hook/caller so that *if* an LSM wanted to return SIGSYS the system would kill the process, but I would want that to be something in addition to returning an error code like LSMs normally do (e.g. EACCES).
I am new to user_namespace and security work, so please pardon me if anything below is very wrong.
IIUC, user_namespace is a tool that enables trusted userspace code to control the behavior of untrusted (or less trusted) userspace code.
No. user namespaces are not a way for more trusted code to control the behavior of less trusted code.
Hmm.. In this case, I think I really need to learn more.
Thanks for pointing out my misunderstanding.
(I thought maybe Eric would chime in with a better explanation, but I'll fill it in for now :)
One of the main goals of user namespaces is to allow unprivileged users to do things like chroot and mount, which are very useful development tools, without needing admin privileges. So it's almost the opposite of what you said: rather than to enable trusted userspace code to control the behavior of less trusted code, it's to allow less privileged code to do things which do not affect other users, without having to assume *more* privilege.
Thanks for the explanation!
To be precise, the goals were:
- uid mapping - allow two users to both "use uid 500" without conflicting
- provide (unprivileged) users privilege over their own resources
- absolutely no extra privilege over other resources
- be able to nest
Now I have better idea about "what". But I am not quite sure about how to do it. I will do more homework, and probably come back with more questions. :)
While (3) was technically achieved, the problem we have is that (2) provides unprivileged users the ability to exercise kernel code which they previously could not.
Do you mean this one?
""" I think the problem is that it seems you can pretty reliably get a root shell at some point in the future by creating a user namespace, leaving it open for a bit, and waiting for a new announcement of the latest netfilter or whatever exploit that requires root in a user namespace. Then go back to your userns shell and run the exploit. """
Please don't share how to do it yet. I want to use it as a test for my study. :)
Thanks again!
Song
On Fri, Aug 26, 2022 at 04:00:39PM -0500, Serge Hallyn wrote:
On Fri, Aug 26, 2022 at 05:00:51PM +0000, Song Liu wrote:
On Aug 26, 2022, at 8:24 AM, Serge E. Hallyn serge@hallyn.com wrote:
On Thu, Aug 25, 2022 at 09:58:46PM +0000, Song Liu wrote:
On Aug 25, 2022, at 12:19 PM, Paul Moore paul@paul-moore.com wrote:
On Thu, Aug 25, 2022 at 2:15 PM Eric W. Biederman ebiederm@xmission.com wrote:
Paul Moore paul@paul-moore.com writes: > On Fri, Aug 19, 2022 at 10:45 AM Serge E. Hallyn serge@hallyn.com wrote: >> I am hoping we can come up with >> "something better" to address people's needs, make everyone happy, and >> bring forth world peace. Which would stack just fine with what's here >> for defense in depth. >> >> You may well not be interested in further work, and that's fine. I need >> to set aside a few days to think on this. > > I'm happy to continue the discussion as long as it's constructive; I > think we all are. My gut feeling is that Frederick's approach falls > closest to the sweet spot of "workable without being overly offensive" > (*cough*), but if you've got an additional approach in mind, or an > alternative approach that solves the same use case problems, I think > we'd all love to hear about it.
I would love to actually hear the problems people are trying to solve so that we can have a sensible conversation about the trade offs.
Here are several taken from the previous threads, it's surely not a complete list, but it should give you a good idea:
https://lore.kernel.org/linux-security-module/CAHC9VhQnPAsmjmKo-e84XDJ1wmaOF...
As best I can tell without more information people want to use the creation of a user namespace as a signal that the code is attempting an exploit.
Some use cases are like that, there are several other use cases that go beyond this; see all of our previous discussions on this topic/patchset. As has been mentioned before, there are use cases that require improved observability, access control, or both.
As such let me propose instead of returning an error code which will let the exploit continue, have the security hook return a bool. With true meaning the code can continue and on false it will trigger using SIGSYS to terminate the program like seccomp does.
Having the kernel forcibly exit the process isn't something that most LSMs would likely want. I suppose we could modify the hook/caller so that *if* an LSM wanted to return SIGSYS the system would kill the process, but I would want that to be something in addition to returning an error code like LSMs normally do (e.g. EACCES).
I am new to user_namespace and security work, so please pardon me if anything below is very wrong.
IIUC, user_namespace is a tool that enables trusted userspace code to control the behavior of untrusted (or less trusted) userspace code.
No. user namespaces are not a way for more trusted code to control the behavior of less trusted code.
Hmm.. In this case, I think I really need to learn more.
Thanks for pointing out my misunderstanding.
(I thought maybe Eric would chime in with a better explanation, but I'll fill it in for now :)
One of the main goals of user namespaces is to allow unprivileged users to do things like chroot and mount, which are very useful development tools, without needing admin privileges. So it's almost the opposite of what you said: rather than to enable trusted userspace code to control the behavior of less trusted code, it's to allow less privileged code to do things which do not affect other users, without having to assume *more* privilege.
To be precise, the goals were:
- uid mapping - allow two users to both "use uid 500" without conflicting
- provide (unprivileged) users privilege over their own resources
- absolutely no extra privilege over other resources
- be able to nest
While (3) was technically achieved, the problem we have is that (2) provides unprivileged users the ability to exercise kernel code which they previously could not.
The consequence of the refusal to give users any way to control whether or not user namespaces are available to unprivileged users is that a non-significant number of distros still carry the same patch for about 10 years now that adds an unprivileged_userns_clone sysctl to restrict them to privileged users. That includes current Debian and Archlinux btw.
The LSM hook is a simple way to allow administrators to control this and will allow user namespaces to be enabled in scenarios where they would otherwise not be accepted precisely because they are available to unprivileged users.
I fully understand the motivation and usefulness in unprivileged scenarios but it's an unfounded fear that giving users the ability to control user namespace creation via an LSM hook will cause proliferation of setuid binaries (Ignoring for a moment that any fully unprivileged container with useful idmappings has to rely on the new{g,u}idmap setuid binaries to setup useful mappings anyway.) or decrease system safety let alone cause regressions (Which I don't think is an applicable term here at all.). Distros that have unprivileged user namespaces turned on by default are extremely unlikely to switch to an LSM profile that turns them off and distros that already turn them off will continue to turn them off whether or not that LSM hook is available.
It's much more likely that workloads that want to minimize their attack surface while still getting the benefits of user namespaces for e.g. service isolation will feel comfortable enabling them for the first time since they can control them via an LSM profile.
On Mon, Aug 29, 2022 at 05:33:04PM +0200, Christian Brauner wrote:
On Fri, Aug 26, 2022 at 04:00:39PM -0500, Serge Hallyn wrote:
On Fri, Aug 26, 2022 at 05:00:51PM +0000, Song Liu wrote:
On Aug 26, 2022, at 8:24 AM, Serge E. Hallyn serge@hallyn.com wrote:
On Thu, Aug 25, 2022 at 09:58:46PM +0000, Song Liu wrote:
On Aug 25, 2022, at 12:19 PM, Paul Moore paul@paul-moore.com wrote:
On Thu, Aug 25, 2022 at 2:15 PM Eric W. Biederman ebiederm@xmission.com wrote: > Paul Moore paul@paul-moore.com writes: >> On Fri, Aug 19, 2022 at 10:45 AM Serge E. Hallyn serge@hallyn.com wrote: >>> I am hoping we can come up with >>> "something better" to address people's needs, make everyone happy, and >>> bring forth world peace. Which would stack just fine with what's here >>> for defense in depth. >>> >>> You may well not be interested in further work, and that's fine. I need >>> to set aside a few days to think on this. >> >> I'm happy to continue the discussion as long as it's constructive; I >> think we all are. My gut feeling is that Frederick's approach falls >> closest to the sweet spot of "workable without being overly offensive" >> (*cough*), but if you've got an additional approach in mind, or an >> alternative approach that solves the same use case problems, I think >> we'd all love to hear about it. > > I would love to actually hear the problems people are trying to solve so > that we can have a sensible conversation about the trade offs.
Here are several taken from the previous threads, it's surely not a complete list, but it should give you a good idea:
https://lore.kernel.org/linux-security-module/CAHC9VhQnPAsmjmKo-e84XDJ1wmaOF...
> As best I can tell without more information people want to use > the creation of a user namespace as a signal that the code is > attempting an exploit.
Some use cases are like that, there are several other use cases that go beyond this; see all of our previous discussions on this topic/patchset. As has been mentioned before, there are use cases that require improved observability, access control, or both.
> As such let me propose instead of returning an error code which will let > the exploit continue, have the security hook return a bool. With true > meaning the code can continue and on false it will trigger using SIGSYS > to terminate the program like seccomp does.
Having the kernel forcibly exit the process isn't something that most LSMs would likely want. I suppose we could modify the hook/caller so that *if* an LSM wanted to return SIGSYS the system would kill the process, but I would want that to be something in addition to returning an error code like LSMs normally do (e.g. EACCES).
I am new to user_namespace and security work, so please pardon me if anything below is very wrong.
IIUC, user_namespace is a tool that enables trusted userspace code to control the behavior of untrusted (or less trusted) userspace code.
No. user namespaces are not a way for more trusted code to control the behavior of less trusted code.
Hmm.. In this case, I think I really need to learn more.
Thanks for pointing out my misunderstanding.
(I thought maybe Eric would chime in with a better explanation, but I'll fill it in for now :)
One of the main goals of user namespaces is to allow unprivileged users to do things like chroot and mount, which are very useful development tools, without needing admin privileges. So it's almost the opposite of what you said: rather than to enable trusted userspace code to control the behavior of less trusted code, it's to allow less privileged code to do things which do not affect other users, without having to assume *more* privilege.
To be precise, the goals were:
- uid mapping - allow two users to both "use uid 500" without conflicting
- provide (unprivileged) users privilege over their own resources
- absolutely no extra privilege over other resources
- be able to nest
While (3) was technically achieved, the problem we have is that (2) provides unprivileged users the ability to exercise kernel code which they previously could not.
The consequence of the refusal to give users any way to control whether or not user namespaces are available to unprivileged users is that a non-significant number of distros still carry the same patch for about 10 years now that adds an unprivileged_userns_clone sysctl to restrict them to privileged users. That includes current Debian and Archlinux btw.
Hi Christian,
I'm wondering about your placement of this argument in the thread, and whether you interpreted what I said above as an argument against this patchset, or whether you're just expanding on what I said.
The LSM hook is a simple way to allow administrators to control this and
(I think the "control" here is suboptimal, but I've not seen - nor conceived of - anything better as of yet)
will allow user namespaces to be enabled in scenarios where they would otherwise not be accepted precisely because they are available to unprivileged users.
I fully understand the motivation and usefulness in unprivileged scenarios but it's an unfounded fear that giving users the ability to control user namespace creation via an LSM hook will cause proliferation of setuid binaries (Ignoring for a moment that any fully unprivileged container with useful idmappings has to rely on the new{g,u}idmap setuid binaries to setup useful mappings anyway.) or decrease system safety let alone cause regressions (Which I don't think is an applicable term here at all.). Distros that have unprivileged user namespaces turned on by default are extremely unlikely to switch to an LSM profile that turns them off and distros that already turn them off will continue to turn them off whether or not that LSM hook is available.
It's much more likely that workloads that want to minimize their attack surface while still getting the benefits of user namespaces for e.g. service isolation will feel comfortable enabling them for the first time since they can control them via an LSM profile.
On Thu, Aug 25, 2022 at 8:19 PM Paul Moore paul@paul-moore.com wrote:
On Thu, Aug 25, 2022 at 2:15 PM Eric W. Biederman ebiederm@xmission.com wrote:
Paul Moore paul@paul-moore.com writes:
On Fri, Aug 19, 2022 at 10:45 AM Serge E. Hallyn serge@hallyn.com wrote:
I am hoping we can come up with "something better" to address people's needs, make everyone happy, and bring forth world peace. Which would stack just fine with what's here for defense in depth.
You may well not be interested in further work, and that's fine. I need to set aside a few days to think on this.
I'm happy to continue the discussion as long as it's constructive; I think we all are. My gut feeling is that Frederick's approach falls closest to the sweet spot of "workable without being overly offensive" (*cough*), but if you've got an additional approach in mind, or an alternative approach that solves the same use case problems, I think we'd all love to hear about it.
I would love to actually hear the problems people are trying to solve so that we can have a sensible conversation about the trade offs.
Here are several taken from the previous threads, it's surely not a complete list, but it should give you a good idea:
https://lore.kernel.org/linux-security-module/CAHC9VhQnPAsmjmKo-e84XDJ1wmaOF...
As best I can tell without more information people want to use the creation of a user namespace as a signal that the code is attempting an exploit.
Some use cases are like that, there are several other use cases that go beyond this; see all of our previous discussions on this topic/patchset. As has been mentioned before, there are use cases that require improved observability, access control, or both.
As such let me propose instead of returning an error code which will let the exploit continue, have the security hook return a bool. With true meaning the code can continue and on false it will trigger using SIGSYS to terminate the program like seccomp does.
Having the kernel forcibly exit the process isn't something that most LSMs would likely want. I suppose we could modify the hook/caller so that *if* an LSM wanted to return SIGSYS the system would kill the process, but I would want that to be something in addition to returning an error code like LSMs normally do (e.g. EACCES).
I would also add here that seccomp allows more flexibility than just delivering SIGSYS to a violating application. We can program seccomp bpf to: * deliver a signal * return a CUSTOM error code (and BTW somehow this does not trigger any requirements to change userapi or document in manpages: in my toy example in [1] I'm delivering ENETDOWN from a uname(2) system call, which is not documented in the man pages, but totally valid from a seccomp usage perspective) * do-nothing, but log the action
So I would say the seccomp reference supports the current approach more than the alternative approach of delivering SIGSYS as technically an LSM implementation of the hook (at least in-kernel one) can chose to deliver a signal to a task via kernel-api, but BPF-LSM (and others) can deliver custom error codes and log the actions as well.
Ignat
-- paul-moore.com
[1]: https://blog.cloudflare.com/sandboxing-in-linux-with-zero-lines-of-code/
On Fri, Aug 26, 2022 at 5:11 AM Ignat Korchagin ignat@cloudflare.com wrote:
I would also add here that seccomp allows more flexibility than just delivering SIGSYS to a violating application. We can program seccomp bpf to:
- deliver a signal
- return a CUSTOM error code (and BTW somehow this does not trigger
any requirements to change userapi or document in manpages: in my toy example in [1] I'm delivering ENETDOWN from a uname(2) system call, which is not documented in the man pages, but totally valid from a seccomp usage perspective)
- do-nothing, but log the action
So I would say the seccomp reference supports the current approach more than the alternative approach of delivering SIGSYS as technically an LSM implementation of the hook (at least in-kernel one) can chose to deliver a signal to a task via kernel-api, but BPF-LSM (and others) can deliver custom error codes and log the actions as well.
I agree that seccomp mode 2 allows for more flexibility than was mentioned earlier, however seccomp filtering has some limitations in this particular case which can be an issue for some. The first, and perhaps most important, is that some of the information that a seccomp filter might want to inspect is effectively hidden with the clone3(2) syscall due to the clone_args struct; this would make it difficult for a seccomp filter to identify namespace related operations. The second issue is that a seccomp mode 2 based approach requires the applications themselves to "Do The Right Thing" and ensure that the proper seccomp filter is loaded into the kernel before the target fork()/clone()/unshare() call is executed; a LSM which implements a proper mandatory access control mechanism does not rely on the application, it enforces the system's security policy regardless of what actions userspace performs.
On Thu, Aug 25, 2022 at 01:15:46PM -0500, Eric W. Biederman wrote:
Paul Moore paul@paul-moore.com writes:
On Fri, Aug 19, 2022 at 10:45 AM Serge E. Hallyn serge@hallyn.com wrote:
I am hoping we can come up with "something better" to address people's needs, make everyone happy, and bring forth world peace. Which would stack just fine with what's here for defense in depth.
You may well not be interested in further work, and that's fine. I need to set aside a few days to think on this.
I'm happy to continue the discussion as long as it's constructive; I think we all are. My gut feeling is that Frederick's approach falls closest to the sweet spot of "workable without being overly offensive" (*cough*), but if you've got an additional approach in mind, or an alternative approach that solves the same use case problems, I think we'd all love to hear about it.
I would love to actually hear the problems people are trying to solve so that we can have a sensible conversation about the trade offs.
As best I can tell without more information people want to use the creation of a user namespace as a signal that the code is attempting an exploit.
I don't think that's it at all. I think the problem is that it seems you can pretty reliably get a root shell at some point in the future by creating a user namespace, leaving it open for a bit, and waiting for a new announcement of the latest netfilter or whatever exploit that requires root in a user namespace. Then go back to your userns shell and run the exploit.
So i was hoping we could do something more targeted. Be it splitting off the ability to run code under capable_ns code from uid mapping (to an extent), or maybe some limited-livepatch type of thing where certain parts of code become inaccessible to code in a non-init userns after some sysctl has been toggled, or something cooloer that I've failed to think of.
-serge
linux-kselftest-mirror@lists.linaro.org