From: Jeff Xu jeffxu@chromium.org
Since Linux introduced the memfd feature, memfd have always had their execute bit set, and the memfd_create() syscall doesn't allow setting it differently.
However, in a secure by default system, such as ChromeOS, (where all executables should come from the rootfs, which is protected by Verified boot), this executable nature of memfd opens a door for NoExec bypass and enables “confused deputy attack”. E.g, in VRP bug [1]: cros_vm process created a memfd to share the content with an external process, however the memfd is overwritten and used for executing arbitrary code and root escalation. [2] lists more VRP in this kind.
On the other hand, executable memfd has its legit use, runc uses memfd’s seal and executable feature to copy the contents of the binary then execute them, for such system, we need a solution to differentiate runc's use of executable memfds and an attacker's [3].
To address those above, this set of patches add following: 1> Let memfd_create() set X bit at creation time. 2> Let memfd to be sealed for modifying X bit. 3> A new pid namespace sysctl: vm.memfd_noexec to control behavior of X bit. For example, if a container has vm.memfd_noexec=2, then memfd_create() without MFD_NOEXEC_SEAL will be rejected. 4> A new security hook in memfd_create(). This make it possible to a new LSM, which rejects or allows executable memfd based on its security policy.
This is V3 version of patch: see [4] [5] for previous versions. [1] https://crbug.com/1305411 [2] https://bugs.chromium.org/p/chromium/issues/list?q=type%3Dbug-security%20mem... [3] https://lwn.net/Articles/781013/ [4] https://lwn.net/Articles/890096/ [5] https://lore.kernel.org/lkml/20220805222126.142525-1-jeffxu@google.com/
Daniel Verkamp (2): mm/memfd: add F_SEAL_EXEC selftests/memfd: add tests for F_SEAL_EXEC
Jeff Xu (4): mm/memfd: add MFD_NOEXEC_SEAL and MFD_EXEC selftests/memfd: add tests for MFD_NOEXEC_SEAL MFD_EXEC mm/memfd: security hook for memfd_create mm/memfd: Add write seals when apply SEAL_EXEC to executable memfd
include/linux/lsm_hook_defs.h | 1 + include/linux/lsm_hooks.h | 4 + include/linux/pid_namespace.h | 19 ++ include/linux/security.h | 6 + include/uapi/linux/fcntl.h | 1 + include/uapi/linux/memfd.h | 4 + kernel/pid_namespace.c | 47 ++++ mm/memfd.c | 54 +++- mm/shmem.c | 6 + tools/testing/selftests/memfd/fuse_test.c | 1 + tools/testing/selftests/memfd/memfd_test.c | 305 ++++++++++++++++++++- 11 files changed, 445 insertions(+), 3 deletions(-)
base-commit: eb7081409f94a9a8608593d0fb63a1aa3d6f95d8
From: Daniel Verkamp dverkamp@chromium.org
The new F_SEAL_EXEC flag will prevent modification of the exec bits: written as traditional octal mask, 0111, or as named flags, S_IXUSR | S_IXGRP | S_IXOTH. Any chmod(2) or similar call that attempts to modify any of these bits after the seal is applied will fail with errno EPERM.
This will preserve the execute bits as they are at the time of sealing, so the memfd will become either permanently executable or permanently un-executable.
Co-developed-by: Jeff Xu jeffxu@chromium.org Signed-off-by: Jeff Xu jeffxu@chromium.org Signed-off-by: Daniel Verkamp dverkamp@chromium.org --- include/uapi/linux/fcntl.h | 1 + mm/memfd.c | 2 ++ mm/shmem.c | 6 ++++++ 3 files changed, 9 insertions(+)
diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h index 2f86b2ad6d7e..e8c07da58c9f 100644 --- a/include/uapi/linux/fcntl.h +++ b/include/uapi/linux/fcntl.h @@ -43,6 +43,7 @@ #define F_SEAL_GROW 0x0004 /* prevent file from growing */ #define F_SEAL_WRITE 0x0008 /* prevent writes */ #define F_SEAL_FUTURE_WRITE 0x0010 /* prevent future writes while mapped */ +#define F_SEAL_EXEC 0x0020 /* prevent chmod modifying exec bits */ /* (1U << 31) is reserved for signed error codes */
/* diff --git a/mm/memfd.c b/mm/memfd.c index 08f5f8304746..4ebeab94aa74 100644 --- a/mm/memfd.c +++ b/mm/memfd.c @@ -147,6 +147,7 @@ static unsigned int *memfd_file_seals_ptr(struct file *file) }
#define F_ALL_SEALS (F_SEAL_SEAL | \ + F_SEAL_EXEC | \ F_SEAL_SHRINK | \ F_SEAL_GROW | \ F_SEAL_WRITE | \ @@ -175,6 +176,7 @@ static int memfd_add_seals(struct file *file, unsigned int seals) * SEAL_SHRINK: Prevent the file from shrinking * SEAL_GROW: Prevent the file from growing * SEAL_WRITE: Prevent write access to the file + * SEAL_EXEC: Prevent modification of the exec bits in the file mode * * As we don't require any trust relationship between two parties, we * must prevent seals from being removed. Therefore, sealing a file diff --git a/mm/shmem.c b/mm/shmem.c index c1d8b8a1aa3b..e18a9cf9d937 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -1085,6 +1085,12 @@ static int shmem_setattr(struct user_namespace *mnt_userns, if (error) return error;
+ if ((info->seals & F_SEAL_EXEC) && (attr->ia_valid & ATTR_MODE)) { + if ((inode->i_mode ^ attr->ia_mode) & 0111) { + return -EPERM; + } + } + if (S_ISREG(inode->i_mode) && (attr->ia_valid & ATTR_SIZE)) { loff_t oldsize = inode->i_size; loff_t newsize = attr->ia_size;
On Fri, Dec 02, 2022 at 01:33:59AM +0000, jeffxu@chromium.org wrote:
From: Daniel Verkamp dverkamp@chromium.org
The new F_SEAL_EXEC flag will prevent modification of the exec bits: written as traditional octal mask, 0111, or as named flags, S_IXUSR | S_IXGRP | S_IXOTH. Any chmod(2) or similar call that attempts to modify any of these bits after the seal is applied will fail with errno EPERM.
This will preserve the execute bits as they are at the time of sealing, so the memfd will become either permanently executable or permanently un-executable.
Co-developed-by: Jeff Xu jeffxu@chromium.org Signed-off-by: Jeff Xu jeffxu@chromium.org Signed-off-by: Daniel Verkamp dverkamp@chromium.org
include/uapi/linux/fcntl.h | 1 + mm/memfd.c | 2 ++ mm/shmem.c | 6 ++++++ 3 files changed, 9 insertions(+)
diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h index 2f86b2ad6d7e..e8c07da58c9f 100644 --- a/include/uapi/linux/fcntl.h +++ b/include/uapi/linux/fcntl.h @@ -43,6 +43,7 @@ #define F_SEAL_GROW 0x0004 /* prevent file from growing */ #define F_SEAL_WRITE 0x0008 /* prevent writes */ #define F_SEAL_FUTURE_WRITE 0x0010 /* prevent future writes while mapped */ +#define F_SEAL_EXEC 0x0020 /* prevent chmod modifying exec bits */ /* (1U << 31) is reserved for signed error codes */ /* diff --git a/mm/memfd.c b/mm/memfd.c index 08f5f8304746..4ebeab94aa74 100644 --- a/mm/memfd.c +++ b/mm/memfd.c @@ -147,6 +147,7 @@ static unsigned int *memfd_file_seals_ptr(struct file *file) } #define F_ALL_SEALS (F_SEAL_SEAL | \
F_SEAL_EXEC | \ F_SEAL_SHRINK | \ F_SEAL_GROW | \ F_SEAL_WRITE | \
@@ -175,6 +176,7 @@ static int memfd_add_seals(struct file *file, unsigned int seals) * SEAL_SHRINK: Prevent the file from shrinking * SEAL_GROW: Prevent the file from growing * SEAL_WRITE: Prevent write access to the file
* SEAL_EXEC: Prevent modification of the exec bits in the file mode
- As we don't require any trust relationship between two parties, we
- must prevent seals from being removed. Therefore, sealing a file
diff --git a/mm/shmem.c b/mm/shmem.c index c1d8b8a1aa3b..e18a9cf9d937 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -1085,6 +1085,12 @@ static int shmem_setattr(struct user_namespace *mnt_userns, if (error) return error;
- if ((info->seals & F_SEAL_EXEC) && (attr->ia_valid & ATTR_MODE)) {
if ((inode->i_mode ^ attr->ia_mode) & 0111) {
return -EPERM;
}
- }
- if (S_ISREG(inode->i_mode) && (attr->ia_valid & ATTR_SIZE)) { loff_t oldsize = inode->i_size; loff_t newsize = attr->ia_size;
-- 2.39.0.rc0.267.gcb52ba06e7-goog
This looks sensible to me!
Reviewed-by: Kees Cook keescook@chromium.org
On Fri, Dec 02, 2022 at 01:33:59AM +0000, jeffxu@chromium.org wrote:
From: Daniel Verkamp dverkamp@chromium.org
The new F_SEAL_EXEC flag will prevent modification of the exec bits: written as traditional octal mask, 0111, or as named flags, S_IXUSR | S_IXGRP | S_IXOTH. Any chmod(2) or similar call that attempts to modify any of these bits after the seal is applied will fail with errno EPERM.
This will preserve the execute bits as they are at the time of sealing, so the memfd will become either permanently executable or permanently un-executable.
Co-developed-by: Jeff Xu jeffxu@chromium.org Signed-off-by: Jeff Xu jeffxu@chromium.org Signed-off-by: Daniel Verkamp dverkamp@chromium.org
Oh, one note on tag ordering here. Since you're sending it, I would expect this to read as:
From: Daniel Verkamp dverkamp@chromium.org ... Signed-off-by: Daniel Verkamp dverkamp@chromium.org Co-developed-by: Jeff Xu jeffxu@chromium.org Signed-off-by: Jeff Xu jeffxu@chromium.org
From: Jeff Xu jeffxu@chromium.org
The new MFD_NOEXEC_SEAL and MFD_EXEC flags allows application to set executable bit at creation time (memfd_create).
When MFD_NOEXEC_SEAL is set, memfd is created without executable bit (mode:0666), and sealed with F_SEAL_EXEC, so it can't be chmod to be executable (mode: 0777) after creation.
when MFD_EXEC flag is set, memfd is created with executable bit (mode:0777), this is the same as the old behavior of memfd_create.
The new pid namespaced sysctl vm.memfd_noexec has 3 values: 0: memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL acts like MFD_EXEC was set. 1: memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL acts like MFD_NOEXEC_SEAL was set. 2:memfd_create() without MFD_NOEXEC_SEAL will be rejected.
The sysctl allows finer control of memfd_create for old-software that doesn't set the executable bit, for example, a container with vm.memfd_noexec=1 means the old-software will create non-executable memfd by default.
Co-developed-by: Daniel Verkamp dverkamp@chromium.org Signed-off-by: Daniel Verkamp dverkamp@chromium.org Signed-off-by: Jeff Xu jeffxu@chromium.org --- include/linux/pid_namespace.h | 19 ++++++++++++++ include/uapi/linux/memfd.h | 4 +++ kernel/pid_namespace.c | 47 +++++++++++++++++++++++++++++++++++ mm/memfd.c | 44 ++++++++++++++++++++++++++++++-- 4 files changed, 112 insertions(+), 2 deletions(-)
diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h index 07481bb87d4e..a4789a7b34a9 100644 --- a/include/linux/pid_namespace.h +++ b/include/linux/pid_namespace.h @@ -16,6 +16,21 @@
struct fs_pin;
+#if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE) +/* + * sysctl for vm.memfd_noexec + * 0: memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL + * acts like MFD_EXEC was set. + * 1: memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL + * acts like MFD_NOEXEC_SEAL was set. + * 2: memfd_create() without MFD_NOEXEC_SEAL will be + * rejected. + */ +#define MEMFD_NOEXEC_SCOPE_EXEC 0 +#define MEMFD_NOEXEC_SCOPE_NOEXEC_SEAL 1 +#define MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED 2 +#endif + struct pid_namespace { struct idr idr; struct rcu_head rcu; @@ -31,6 +46,10 @@ struct pid_namespace { struct ucounts *ucounts; int reboot; /* group exit code if this pidns was rebooted */ struct ns_common ns; +#if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE) + /* sysctl for vm.memfd_noexec */ + int memfd_noexec_scope; +#endif } __randomize_layout;
extern struct pid_namespace init_pid_ns; diff --git a/include/uapi/linux/memfd.h b/include/uapi/linux/memfd.h index 7a8a26751c23..273a4e15dfcf 100644 --- a/include/uapi/linux/memfd.h +++ b/include/uapi/linux/memfd.h @@ -8,6 +8,10 @@ #define MFD_CLOEXEC 0x0001U #define MFD_ALLOW_SEALING 0x0002U #define MFD_HUGETLB 0x0004U +/* not executable and sealed to prevent changing to executable. */ +#define MFD_NOEXEC_SEAL 0x0008U +/* executable */ +#define MFD_EXEC 0x0010U
/* * Huge page size encoding when MFD_HUGETLB is specified, and a huge page diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c index f4f8cb0435b4..71dd9b0a0f62 100644 --- a/kernel/pid_namespace.c +++ b/kernel/pid_namespace.c @@ -110,6 +110,10 @@ static struct pid_namespace *create_pid_namespace(struct user_namespace *user_ns ns->ucounts = ucounts; ns->pid_allocated = PIDNS_ADDING;
+#if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE) + ns->memfd_noexec_scope = MEMFD_NOEXEC_SCOPE_EXEC; +#endif + return ns;
out_free_idr: @@ -255,6 +259,45 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns) return; }
+#if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE) +int pid_mfd_noexec_dointvec_minmax(struct ctl_table *table, int write, + void *buffer, size_t *lenp, loff_t *ppos) +{ + struct pid_namespace *ns = task_active_pid_ns(current); + struct ctl_table table_copy; + + if (write && !capable(CAP_SYS_ADMIN)) + return -EPERM; + + table_copy = *table; + if (ns != &init_pid_ns) + table_copy.data = &ns->memfd_noexec_scope; + + /* + * set minimum to current value, the effect is only bigger + * value is accepted. + */ + if (*(int *)table_copy.data > *(int *)table_copy.extra1) + table_copy.extra1 = table_copy.data; + + return proc_dointvec_minmax(&table_copy, write, buffer, lenp, ppos); +} + +static struct ctl_table pid_ns_ctl_table_vm[] = { + { + .procname = "memfd_noexec", + .data = &init_pid_ns.memfd_noexec_scope, + .maxlen = sizeof(init_pid_ns.memfd_noexec_scope), + .mode = 0644, + .proc_handler = pid_mfd_noexec_dointvec_minmax, + .extra1 = SYSCTL_ZERO, + .extra2 = SYSCTL_TWO, + }, + { } +}; +static struct ctl_path vm_path[] = { { .procname = "vm", }, { } }; +#endif + #ifdef CONFIG_CHECKPOINT_RESTORE static int pid_ns_ctl_handler(struct ctl_table *table, int write, void *buffer, size_t *lenp, loff_t *ppos) @@ -455,6 +498,10 @@ static __init int pid_namespaces_init(void) #ifdef CONFIG_CHECKPOINT_RESTORE register_sysctl_paths(kern_path, pid_ns_ctl_table); #endif + +#if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE) + register_sysctl_paths(vm_path, pid_ns_ctl_table_vm); +#endif return 0; }
diff --git a/mm/memfd.c b/mm/memfd.c index 4ebeab94aa74..69e897dea6d5 100644 --- a/mm/memfd.c +++ b/mm/memfd.c @@ -18,6 +18,7 @@ #include <linux/hugetlb.h> #include <linux/shmem_fs.h> #include <linux/memfd.h> +#include <linux/pid_namespace.h> #include <uapi/linux/memfd.h>
/* @@ -263,12 +264,13 @@ long memfd_fcntl(struct file *file, unsigned int cmd, unsigned long arg) #define MFD_NAME_PREFIX_LEN (sizeof(MFD_NAME_PREFIX) - 1) #define MFD_NAME_MAX_LEN (NAME_MAX - MFD_NAME_PREFIX_LEN)
-#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB) +#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB | MFD_NOEXEC_SEAL | MFD_EXEC)
SYSCALL_DEFINE2(memfd_create, const char __user *, uname, unsigned int, flags) { + struct pid_namespace *ns; unsigned int *file_seals; struct file *file; int fd, error; @@ -285,6 +287,36 @@ SYSCALL_DEFINE2(memfd_create, return -EINVAL; }
+ /* Invalid if both EXEC and NOEXEC_SEAL are set.*/ + if ((flags & MFD_EXEC) && (flags & MFD_NOEXEC_SEAL)) + return -EINVAL; + + if (!(flags & (MFD_EXEC | MFD_NOEXEC_SEAL))) { +#ifdef CONFIG_SYSCTL + int sysctl = MEMFD_NOEXEC_SCOPE_EXEC; + + ns = task_active_pid_ns(current); + if (ns) + sysctl = ns->memfd_noexec_scope; + + if (sysctl == MEMFD_NOEXEC_SCOPE_EXEC) { + flags |= MFD_EXEC; + } else if (sysctl == MEMFD_NOEXEC_SCOPE_NOEXEC_SEAL) { + flags |= MFD_NOEXEC_SEAL; + } else { + pr_warn_ratelimited( + "memfd_create(): MFD_NOEXEC_SEAL is enforced, pid=%d\n", + task_pid_nr(current)); + return -EINVAL; + } +#else + flags |= MFD_EXEC; +#endif + pr_warn_ratelimited( + "memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL, pid=%d\n", + task_pid_nr(current)); + } + /* length includes terminating zero */ len = strnlen_user(uname, MFD_NAME_MAX_LEN + 1); if (len <= 0) @@ -328,7 +360,15 @@ SYSCALL_DEFINE2(memfd_create, file->f_mode |= FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE; file->f_flags |= O_LARGEFILE;
- if (flags & MFD_ALLOW_SEALING) { + if (flags & MFD_NOEXEC_SEAL) { + struct inode *inode = file_inode(file); + + inode->i_mode &= ~0111; + file_seals = memfd_file_seals_ptr(file); + *file_seals &= ~F_SEAL_SEAL; + *file_seals |= F_SEAL_EXEC; + } else if (flags & MFD_ALLOW_SEALING) { + /* MFD_EXEC and MFD_ALLOW_SEALING are set */ file_seals = memfd_file_seals_ptr(file); *file_seals &= ~F_SEAL_SEAL; }
On Fri, Dec 02, 2022 at 01:34:00AM +0000, jeffxu@chromium.org wrote:
From: Jeff Xu jeffxu@chromium.org
The new MFD_NOEXEC_SEAL and MFD_EXEC flags allows application to set executable bit at creation time (memfd_create).
When MFD_NOEXEC_SEAL is set, memfd is created without executable bit (mode:0666), and sealed with F_SEAL_EXEC, so it can't be chmod to be executable (mode: 0777) after creation.
when MFD_EXEC flag is set, memfd is created with executable bit (mode:0777), this is the same as the old behavior of memfd_create.
The new pid namespaced sysctl vm.memfd_noexec has 3 values: 0: memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL acts like MFD_EXEC was set. 1: memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL acts like MFD_NOEXEC_SEAL was set. 2:memfd_create() without MFD_NOEXEC_SEAL will be rejected.
^ nit: missing space
The sysctl allows finer control of memfd_create for old-software that doesn't set the executable bit, for example, a container with vm.memfd_noexec=1 means the old-software will create non-executable memfd by default.
Co-developed-by: Daniel Verkamp dverkamp@chromium.org Signed-off-by: Daniel Verkamp dverkamp@chromium.org Signed-off-by: Jeff Xu jeffxu@chromium.org
include/linux/pid_namespace.h | 19 ++++++++++++++ include/uapi/linux/memfd.h | 4 +++ kernel/pid_namespace.c | 47 +++++++++++++++++++++++++++++++++++ mm/memfd.c | 44 ++++++++++++++++++++++++++++++-- 4 files changed, 112 insertions(+), 2 deletions(-)
diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h index 07481bb87d4e..a4789a7b34a9 100644 --- a/include/linux/pid_namespace.h +++ b/include/linux/pid_namespace.h @@ -16,6 +16,21 @@ struct fs_pin; +#if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE) +/*
- sysctl for vm.memfd_noexec
- 0: memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL
- acts like MFD_EXEC was set.
- 1: memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL
- acts like MFD_NOEXEC_SEAL was set.
- 2: memfd_create() without MFD_NOEXEC_SEAL will be
- rejected.
- */
+#define MEMFD_NOEXEC_SCOPE_EXEC 0 +#define MEMFD_NOEXEC_SCOPE_NOEXEC_SEAL 1 +#define MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED 2 +#endif
struct pid_namespace { struct idr idr; struct rcu_head rcu; @@ -31,6 +46,10 @@ struct pid_namespace { struct ucounts *ucounts; int reboot; /* group exit code if this pidns was rebooted */ struct ns_common ns; +#if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE)
- /* sysctl for vm.memfd_noexec */
- int memfd_noexec_scope;
+#endif } __randomize_layout; extern struct pid_namespace init_pid_ns; diff --git a/include/uapi/linux/memfd.h b/include/uapi/linux/memfd.h index 7a8a26751c23..273a4e15dfcf 100644 --- a/include/uapi/linux/memfd.h +++ b/include/uapi/linux/memfd.h @@ -8,6 +8,10 @@ #define MFD_CLOEXEC 0x0001U #define MFD_ALLOW_SEALING 0x0002U #define MFD_HUGETLB 0x0004U +/* not executable and sealed to prevent changing to executable. */ +#define MFD_NOEXEC_SEAL 0x0008U +/* executable */ +#define MFD_EXEC 0x0010U /*
- Huge page size encoding when MFD_HUGETLB is specified, and a huge page
diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c index f4f8cb0435b4..71dd9b0a0f62 100644 --- a/kernel/pid_namespace.c +++ b/kernel/pid_namespace.c @@ -110,6 +110,10 @@ static struct pid_namespace *create_pid_namespace(struct user_namespace *user_ns ns->ucounts = ucounts; ns->pid_allocated = PIDNS_ADDING; +#if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE)
- ns->memfd_noexec_scope = MEMFD_NOEXEC_SCOPE_EXEC;
+#endif
I think this should be inherited from the parent pid namespace instead?
- return ns;
out_free_idr: @@ -255,6 +259,45 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns) return; } +#if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE) +int pid_mfd_noexec_dointvec_minmax(struct ctl_table *table, int write,
- void *buffer, size_t *lenp, loff_t *ppos)
+{
- struct pid_namespace *ns = task_active_pid_ns(current);
- struct ctl_table table_copy;
- if (write && !capable(CAP_SYS_ADMIN))
return -EPERM;
- table_copy = *table;
- if (ns != &init_pid_ns)
table_copy.data = &ns->memfd_noexec_scope;
- /*
* set minimum to current value, the effect is only bigger
* value is accepted.
*/
- if (*(int *)table_copy.data > *(int *)table_copy.extra1)
table_copy.extra1 = table_copy.data;
Yeah, I like this kind of enforcement.
- return proc_dointvec_minmax(&table_copy, write, buffer, lenp, ppos);
+}
+static struct ctl_table pid_ns_ctl_table_vm[] = {
- {
.procname = "memfd_noexec",
.data = &init_pid_ns.memfd_noexec_scope,
.maxlen = sizeof(init_pid_ns.memfd_noexec_scope),
.mode = 0644,
.proc_handler = pid_mfd_noexec_dointvec_minmax,
.extra1 = SYSCTL_ZERO,
.extra2 = SYSCTL_TWO,
- },
- { }
+}; +static struct ctl_path vm_path[] = { { .procname = "vm", }, { } }; +#endif
#ifdef CONFIG_CHECKPOINT_RESTORE static int pid_ns_ctl_handler(struct ctl_table *table, int write, void *buffer, size_t *lenp, loff_t *ppos) @@ -455,6 +498,10 @@ static __init int pid_namespaces_init(void) #ifdef CONFIG_CHECKPOINT_RESTORE register_sysctl_paths(kern_path, pid_ns_ctl_table); #endif
+#if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE)
- register_sysctl_paths(vm_path, pid_ns_ctl_table_vm);
+#endif return 0; } diff --git a/mm/memfd.c b/mm/memfd.c index 4ebeab94aa74..69e897dea6d5 100644 --- a/mm/memfd.c +++ b/mm/memfd.c @@ -18,6 +18,7 @@ #include <linux/hugetlb.h> #include <linux/shmem_fs.h> #include <linux/memfd.h> +#include <linux/pid_namespace.h> #include <uapi/linux/memfd.h> /* @@ -263,12 +264,13 @@ long memfd_fcntl(struct file *file, unsigned int cmd, unsigned long arg) #define MFD_NAME_PREFIX_LEN (sizeof(MFD_NAME_PREFIX) - 1) #define MFD_NAME_MAX_LEN (NAME_MAX - MFD_NAME_PREFIX_LEN) -#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB) +#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB | MFD_NOEXEC_SEAL | MFD_EXEC) SYSCALL_DEFINE2(memfd_create, const char __user *, uname, unsigned int, flags) {
- struct pid_namespace *ns; unsigned int *file_seals; struct file *file; int fd, error;
@@ -285,6 +287,36 @@ SYSCALL_DEFINE2(memfd_create, return -EINVAL; }
- /* Invalid if both EXEC and NOEXEC_SEAL are set.*/
- if ((flags & MFD_EXEC) && (flags & MFD_NOEXEC_SEAL))
return -EINVAL;
- if (!(flags & (MFD_EXEC | MFD_NOEXEC_SEAL))) {
+#ifdef CONFIG_SYSCTL
int sysctl = MEMFD_NOEXEC_SCOPE_EXEC;
ns = task_active_pid_ns(current);
if (ns)
sysctl = ns->memfd_noexec_scope;
if (sysctl == MEMFD_NOEXEC_SCOPE_EXEC) {
flags |= MFD_EXEC;
} else if (sysctl == MEMFD_NOEXEC_SCOPE_NOEXEC_SEAL) {
flags |= MFD_NOEXEC_SEAL;
} else {
pr_warn_ratelimited(
"memfd_create(): MFD_NOEXEC_SEAL is enforced, pid=%d\n",
task_pid_nr(current));
return -EINVAL;
}
Not a huge deal, but the above could be a switch statement to improve readability.
+#else
flags |= MFD_EXEC;
+#endif
pr_warn_ratelimited(
"memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL, pid=%d\n",
task_pid_nr(current));
Perhaps include process name as well -- makes admin lives easier. :)
pr_warn_ratelimited( "memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL: %s[%d]\n", current->comm, task_pid_nr(current));
- }
- /* length includes terminating zero */ len = strnlen_user(uname, MFD_NAME_MAX_LEN + 1); if (len <= 0)
@@ -328,7 +360,15 @@ SYSCALL_DEFINE2(memfd_create, file->f_mode |= FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE; file->f_flags |= O_LARGEFILE;
- if (flags & MFD_ALLOW_SEALING) {
- if (flags & MFD_NOEXEC_SEAL) {
struct inode *inode = file_inode(file);
inode->i_mode &= ~0111;
file_seals = memfd_file_seals_ptr(file);
*file_seals &= ~F_SEAL_SEAL;
*file_seals |= F_SEAL_EXEC;
- } else if (flags & MFD_ALLOW_SEALING) {
file_seals = memfd_file_seals_ptr(file); *file_seals &= ~F_SEAL_SEAL; }/* MFD_EXEC and MFD_ALLOW_SEALING are set */
-- 2.39.0.rc0.267.gcb52ba06e7-goog
Otherwise, looks good to me.
Reviewed-by: Kees Cook keescook@chromium.org
On Fri, Dec 2, 2022 at 2:56 PM Kees Cook keescook@chromium.org wrote:
On Fri, Dec 02, 2022 at 01:34:00AM +0000, jeffxu@chromium.org wrote:
From: Jeff Xu jeffxu@chromium.org
The new MFD_NOEXEC_SEAL and MFD_EXEC flags allows application to set executable bit at creation time (memfd_create).
When MFD_NOEXEC_SEAL is set, memfd is created without executable bit (mode:0666), and sealed with F_SEAL_EXEC, so it can't be chmod to be executable (mode: 0777) after creation.
when MFD_EXEC flag is set, memfd is created with executable bit (mode:0777), this is the same as the old behavior of memfd_create.
The new pid namespaced sysctl vm.memfd_noexec has 3 values: 0: memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL acts like MFD_EXEC was set. 1: memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL acts like MFD_NOEXEC_SEAL was set. 2:memfd_create() without MFD_NOEXEC_SEAL will be rejected.
^ nit: missing space
The sysctl allows finer control of memfd_create for old-software that doesn't set the executable bit, for example, a container with vm.memfd_noexec=1 means the old-software will create non-executable memfd by default.
Co-developed-by: Daniel Verkamp dverkamp@chromium.org Signed-off-by: Daniel Verkamp dverkamp@chromium.org Signed-off-by: Jeff Xu jeffxu@chromium.org
include/linux/pid_namespace.h | 19 ++++++++++++++ include/uapi/linux/memfd.h | 4 +++ kernel/pid_namespace.c | 47 +++++++++++++++++++++++++++++++++++ mm/memfd.c | 44 ++++++++++++++++++++++++++++++-- 4 files changed, 112 insertions(+), 2 deletions(-)
diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h index 07481bb87d4e..a4789a7b34a9 100644 --- a/include/linux/pid_namespace.h +++ b/include/linux/pid_namespace.h @@ -16,6 +16,21 @@
struct fs_pin;
+#if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE) +/*
- sysctl for vm.memfd_noexec
- 0: memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL
- acts like MFD_EXEC was set.
- 1: memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL
- acts like MFD_NOEXEC_SEAL was set.
- 2: memfd_create() without MFD_NOEXEC_SEAL will be
- rejected.
- */
+#define MEMFD_NOEXEC_SCOPE_EXEC 0 +#define MEMFD_NOEXEC_SCOPE_NOEXEC_SEAL 1 +#define MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED 2 +#endif
struct pid_namespace { struct idr idr; struct rcu_head rcu; @@ -31,6 +46,10 @@ struct pid_namespace { struct ucounts *ucounts; int reboot; /* group exit code if this pidns was rebooted */ struct ns_common ns; +#if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE)
/* sysctl for vm.memfd_noexec */
int memfd_noexec_scope;
+#endif } __randomize_layout;
extern struct pid_namespace init_pid_ns; diff --git a/include/uapi/linux/memfd.h b/include/uapi/linux/memfd.h index 7a8a26751c23..273a4e15dfcf 100644 --- a/include/uapi/linux/memfd.h +++ b/include/uapi/linux/memfd.h @@ -8,6 +8,10 @@ #define MFD_CLOEXEC 0x0001U #define MFD_ALLOW_SEALING 0x0002U #define MFD_HUGETLB 0x0004U +/* not executable and sealed to prevent changing to executable. */ +#define MFD_NOEXEC_SEAL 0x0008U +/* executable */ +#define MFD_EXEC 0x0010U
/*
- Huge page size encoding when MFD_HUGETLB is specified, and a huge page
diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c index f4f8cb0435b4..71dd9b0a0f62 100644 --- a/kernel/pid_namespace.c +++ b/kernel/pid_namespace.c @@ -110,6 +110,10 @@ static struct pid_namespace *create_pid_namespace(struct user_namespace *user_ns ns->ucounts = ucounts; ns->pid_allocated = PIDNS_ADDING;
+#if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE)
ns->memfd_noexec_scope = MEMFD_NOEXEC_SCOPE_EXEC;
+#endif
I think this should be inherited from the parent pid namespace instead?
OK, I will add logic to make the child name space to inherit from the parent namespace at creation time, but if the parent changes this setting later, the child will not pick up the parent's change automatically, would that be OK ?
Or do we want the child to keep in sync with the parent, to the highest value. For example, if init namespace val=1, child namespace=1, later when init namespace = 2, child namespace will automatically become 2.
Jeff
return ns;
out_free_idr: @@ -255,6 +259,45 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns) return; }
+#if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE) +int pid_mfd_noexec_dointvec_minmax(struct ctl_table *table, int write,
void *buffer, size_t *lenp, loff_t *ppos)
+{
struct pid_namespace *ns = task_active_pid_ns(current);
struct ctl_table table_copy;
if (write && !capable(CAP_SYS_ADMIN))
return -EPERM;
table_copy = *table;
if (ns != &init_pid_ns)
table_copy.data = &ns->memfd_noexec_scope;
/*
* set minimum to current value, the effect is only bigger
* value is accepted.
*/
if (*(int *)table_copy.data > *(int *)table_copy.extra1)
table_copy.extra1 = table_copy.data;
Yeah, I like this kind of enforcement.
return proc_dointvec_minmax(&table_copy, write, buffer, lenp, ppos);
+}
+static struct ctl_table pid_ns_ctl_table_vm[] = {
{
.procname = "memfd_noexec",
.data = &init_pid_ns.memfd_noexec_scope,
.maxlen = sizeof(init_pid_ns.memfd_noexec_scope),
.mode = 0644,
.proc_handler = pid_mfd_noexec_dointvec_minmax,
.extra1 = SYSCTL_ZERO,
.extra2 = SYSCTL_TWO,
},
{ }
+}; +static struct ctl_path vm_path[] = { { .procname = "vm", }, { } }; +#endif
#ifdef CONFIG_CHECKPOINT_RESTORE static int pid_ns_ctl_handler(struct ctl_table *table, int write, void *buffer, size_t *lenp, loff_t *ppos) @@ -455,6 +498,10 @@ static __init int pid_namespaces_init(void) #ifdef CONFIG_CHECKPOINT_RESTORE register_sysctl_paths(kern_path, pid_ns_ctl_table); #endif
+#if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE)
register_sysctl_paths(vm_path, pid_ns_ctl_table_vm);
+#endif return 0; }
diff --git a/mm/memfd.c b/mm/memfd.c index 4ebeab94aa74..69e897dea6d5 100644 --- a/mm/memfd.c +++ b/mm/memfd.c @@ -18,6 +18,7 @@ #include <linux/hugetlb.h> #include <linux/shmem_fs.h> #include <linux/memfd.h> +#include <linux/pid_namespace.h> #include <uapi/linux/memfd.h>
/* @@ -263,12 +264,13 @@ long memfd_fcntl(struct file *file, unsigned int cmd, unsigned long arg) #define MFD_NAME_PREFIX_LEN (sizeof(MFD_NAME_PREFIX) - 1) #define MFD_NAME_MAX_LEN (NAME_MAX - MFD_NAME_PREFIX_LEN)
-#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB) +#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB | MFD_NOEXEC_SEAL | MFD_EXEC)
SYSCALL_DEFINE2(memfd_create, const char __user *, uname, unsigned int, flags) {
struct pid_namespace *ns; unsigned int *file_seals; struct file *file; int fd, error;
@@ -285,6 +287,36 @@ SYSCALL_DEFINE2(memfd_create, return -EINVAL; }
/* Invalid if both EXEC and NOEXEC_SEAL are set.*/
if ((flags & MFD_EXEC) && (flags & MFD_NOEXEC_SEAL))
return -EINVAL;
if (!(flags & (MFD_EXEC | MFD_NOEXEC_SEAL))) {
+#ifdef CONFIG_SYSCTL
int sysctl = MEMFD_NOEXEC_SCOPE_EXEC;
ns = task_active_pid_ns(current);
if (ns)
sysctl = ns->memfd_noexec_scope;
if (sysctl == MEMFD_NOEXEC_SCOPE_EXEC) {
flags |= MFD_EXEC;
} else if (sysctl == MEMFD_NOEXEC_SCOPE_NOEXEC_SEAL) {
flags |= MFD_NOEXEC_SEAL;
} else {
pr_warn_ratelimited(
"memfd_create(): MFD_NOEXEC_SEAL is enforced, pid=%d\n",
task_pid_nr(current));
return -EINVAL;
}
Not a huge deal, but the above could be a switch statement to improve readability.
+#else
flags |= MFD_EXEC;
+#endif
pr_warn_ratelimited(
"memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL, pid=%d\n",
task_pid_nr(current));
Perhaps include process name as well -- makes admin lives easier. :)
pr_warn_ratelimited( "memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL: %s[%d]\n", current->comm, task_pid_nr(current));
}
/* length includes terminating zero */ len = strnlen_user(uname, MFD_NAME_MAX_LEN + 1); if (len <= 0)
@@ -328,7 +360,15 @@ SYSCALL_DEFINE2(memfd_create, file->f_mode |= FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE; file->f_flags |= O_LARGEFILE;
if (flags & MFD_ALLOW_SEALING) {
if (flags & MFD_NOEXEC_SEAL) {
struct inode *inode = file_inode(file);
inode->i_mode &= ~0111;
file_seals = memfd_file_seals_ptr(file);
*file_seals &= ~F_SEAL_SEAL;
*file_seals |= F_SEAL_EXEC;
} else if (flags & MFD_ALLOW_SEALING) {
/* MFD_EXEC and MFD_ALLOW_SEALING are set */ file_seals = memfd_file_seals_ptr(file); *file_seals &= ~F_SEAL_SEAL; }
-- 2.39.0.rc0.267.gcb52ba06e7-goog
Otherwise, looks good to me.
Reviewed-by: Kees Cook keescook@chromium.org
-- Kees Cook
From: Daniel Verkamp dverkamp@chromium.org
Basic tests to ensure that user/group/other execute bits cannot be changed after applying F_SEAL_EXEC to a memfd.
Co-developed-by: Jeff Xu jeffxu@chromium.org Signed-off-by: Jeff Xu jeffxu@chromium.org Signed-off-by: Daniel Verkamp dverkamp@chromium.org --- tools/testing/selftests/memfd/memfd_test.c | 129 ++++++++++++++++++++- 1 file changed, 128 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/memfd/memfd_test.c b/tools/testing/selftests/memfd/memfd_test.c index 94df2692e6e4..1d7e7b36bbdd 100644 --- a/tools/testing/selftests/memfd/memfd_test.c +++ b/tools/testing/selftests/memfd/memfd_test.c @@ -28,12 +28,44 @@ #define MFD_DEF_SIZE 8192 #define STACK_SIZE 65536
+#ifndef F_SEAL_EXEC +#define F_SEAL_EXEC 0x0020 +#endif + +#ifndef MAX_PATH +#define MAX_PATH 256 +#endif + /* * Default is not to test hugetlbfs */ static size_t mfd_def_size = MFD_DEF_SIZE; static const char *memfd_str = MEMFD_STR;
+static ssize_t fd2name(int fd, char *buf, size_t bufsize) +{ + char buf1[MAX_PATH]; + int size; + ssize_t nbytes; + + size = snprintf(buf1, MAX_PATH, "/proc/self/fd/%d", fd); + if (size < 0) { + printf("snprintf(%d) failed on %m\n", fd); + abort(); + } + + /* + * reserver one byte for string termination. + */ + nbytes = readlink(buf1, buf, bufsize-1); + if (nbytes == -1) { + printf("readlink(%s) failed %m\n", buf1); + abort(); + } + buf[nbytes] = '\0'; + return nbytes; +} + static int mfd_assert_new(const char *name, loff_t sz, unsigned int flags) { int r, fd; @@ -98,11 +130,14 @@ static unsigned int mfd_assert_get_seals(int fd)
static void mfd_assert_has_seals(int fd, unsigned int seals) { + char buf[MAX_PATH]; + int nbytes; unsigned int s; + fd2name(fd, buf, MAX_PATH);
s = mfd_assert_get_seals(fd); if (s != seals) { - printf("%u != %u = GET_SEALS(%d)\n", seals, s, fd); + printf("%u != %u = GET_SEALS(%s)\n", seals, s, buf); abort(); } } @@ -594,6 +629,64 @@ static void mfd_fail_grow_write(int fd) } }
+static void mfd_assert_mode(int fd, int mode) +{ + struct stat st; + char buf[MAX_PATH]; + int nbytes; + + fd2name(fd, buf, MAX_PATH); + + if (fstat(fd, &st) < 0) { + printf("fstat(%s) failed: %m\n", buf); + abort(); + } + + if ((st.st_mode & 07777) != mode) { + printf("fstat(%s) wrong file mode 0%04o, but expected 0%04o\n", + buf, (int)st.st_mode & 07777, mode); + abort(); + } +} + +static void mfd_assert_chmod(int fd, int mode) +{ + char buf[MAX_PATH]; + int nbytes; + + fd2name(fd, buf, MAX_PATH); + + if (fchmod(fd, mode) < 0) { + printf("fchmod(%s, 0%04o) failed: %m\n", buf, mode); + abort(); + } + + mfd_assert_mode(fd, mode); +} + +static void mfd_fail_chmod(int fd, int mode) +{ + struct stat st; + char buf[MAX_PATH]; + int nbytes; + + fd2name(fd, buf, MAX_PATH); + + if (fstat(fd, &st) < 0) { + printf("fstat(%s) failed: %m\n", buf); + abort(); + } + + if (fchmod(fd, mode) == 0) { + printf("fchmod(%s, 0%04o) didn't fail as expected\n", + buf, mode); + abort(); + } + + /* verify that file mode bits did not change */ + mfd_assert_mode(fd, st.st_mode & 07777); +} + static int idle_thread_fn(void *arg) { sigset_t set; @@ -880,6 +973,39 @@ static void test_seal_resize(void) close(fd); }
+/* + * Test SEAL_EXEC + * Test that chmod() cannot change x bits after sealing + */ +static void test_seal_exec(void) +{ + int fd; + + printf("%s SEAL-EXEC\n", memfd_str); + + fd = mfd_assert_new("kern_memfd_seal_exec", + mfd_def_size, + MFD_CLOEXEC | MFD_ALLOW_SEALING); + + mfd_assert_mode(fd, 0777); + + mfd_assert_chmod(fd, 0644); + + mfd_assert_has_seals(fd, 0); + mfd_assert_add_seals(fd, F_SEAL_EXEC); + mfd_assert_has_seals(fd, F_SEAL_EXEC); + + mfd_assert_chmod(fd, 0600); + mfd_fail_chmod(fd, 0777); + mfd_fail_chmod(fd, 0670); + mfd_fail_chmod(fd, 0605); + mfd_fail_chmod(fd, 0700); + mfd_fail_chmod(fd, 0100); + mfd_assert_chmod(fd, 0666); + + close(fd); +} + /* * Test sharing via dup() * Test that seals are shared between dupped FDs and they're all equal. @@ -1059,6 +1185,7 @@ int main(int argc, char **argv) test_seal_shrink(); test_seal_grow(); test_seal_resize(); + test_seal_exec();
test_share_dup("SHARE-DUP", ""); test_share_mmap("SHARE-MMAP", "");
From: Jeff Xu jeffxu@chromium.org
Tests to verify MFD_NOEXEC, MFD_EXEC and vm.memfd_noexec sysctl.
Co-developed-by: Daniel Verkamp dverkamp@chromium.org Signed-off-by: Daniel Verkamp dverkamp@chromium.org Signed-off-by: Jeff Xu jeffxu@chromium.org --- tools/testing/selftests/memfd/fuse_test.c | 1 + tools/testing/selftests/memfd/memfd_test.c | 161 ++++++++++++++++++++- 2 files changed, 157 insertions(+), 5 deletions(-)
diff --git a/tools/testing/selftests/memfd/fuse_test.c b/tools/testing/selftests/memfd/fuse_test.c index be675002f918..93798c8c5d54 100644 --- a/tools/testing/selftests/memfd/fuse_test.c +++ b/tools/testing/selftests/memfd/fuse_test.c @@ -22,6 +22,7 @@ #include <linux/falloc.h> #include <fcntl.h> #include <linux/memfd.h> +#include <linux/types.h> #include <sched.h> #include <stdio.h> #include <stdlib.h> diff --git a/tools/testing/selftests/memfd/memfd_test.c b/tools/testing/selftests/memfd/memfd_test.c index 1d7e7b36bbdd..775c9e6c061e 100644 --- a/tools/testing/selftests/memfd/memfd_test.c +++ b/tools/testing/selftests/memfd/memfd_test.c @@ -36,6 +36,10 @@ #define MAX_PATH 256 #endif
+#ifndef MFD_NOEXEC_SEAL +#define MFD_NOEXEC_SEAL 0x0008U +#endif + /* * Default is not to test hugetlbfs */ @@ -86,6 +90,21 @@ static int mfd_assert_new(const char *name, loff_t sz, unsigned int flags) return fd; }
+static void sysctl_assert_write(const char *val) +{ + int fd = open("/proc/sys/vm/memfd_noexec", O_WRONLY | O_CLOEXEC); + + if (fd < 0) { + printf("open sysctl failed\n"); + abort(); + } + + if (write(fd, val, strlen(val)) < 0) { + printf("write sysctl failed\n"); + abort(); + } +} + static int mfd_assert_reopen_fd(int fd_in) { int fd; @@ -764,6 +783,9 @@ static void test_create(void) mfd_fail_new("", ~0); mfd_fail_new("", 0x80000000U);
+ /* verify EXEC and NOEXEC_SEAL can't both be set */ + mfd_fail_new("", MFD_EXEC | MFD_NOEXEC_SEAL); + /* verify MFD_CLOEXEC is allowed */ fd = mfd_assert_new("", 0, MFD_CLOEXEC); close(fd); @@ -975,9 +997,10 @@ static void test_seal_resize(void)
/* * Test SEAL_EXEC - * Test that chmod() cannot change x bits after sealing + * Test fd is created with exec and allow sealing. + * chmod() cannot change x bits after sealing. */ -static void test_seal_exec(void) +static void test_exec_seal(void) { int fd;
@@ -985,10 +1008,9 @@ static void test_seal_exec(void)
fd = mfd_assert_new("kern_memfd_seal_exec", mfd_def_size, - MFD_CLOEXEC | MFD_ALLOW_SEALING); + MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_EXEC);
mfd_assert_mode(fd, 0777); - mfd_assert_chmod(fd, 0644);
mfd_assert_has_seals(fd, 0); @@ -1006,6 +1028,131 @@ static void test_seal_exec(void) close(fd); }
+/* + * Test EXEC_NO_SEAL + * Test fd is created with exec and not allow sealing. + */ +static void test_exec_no_seal(void) +{ + int fd; + + printf("%s EXEC_NO_SEAL\n", memfd_str); + + /* Create with EXEC but without ALLOW_SEALING */ + fd = mfd_assert_new("kern_memfd_exec_no_sealing", + mfd_def_size, + MFD_CLOEXEC | MFD_EXEC); + mfd_assert_mode(fd, 0777); + mfd_assert_has_seals(fd, F_SEAL_SEAL); + mfd_assert_chmod(fd, 0666); + close(fd); +} + +/* + * Test memfd_create with MFD_NOEXEC flag + */ +static void test_noexec_seal(void) +{ + int fd; + + printf("%s NOEXEC_SEAL\n", memfd_str); + + /* Create with NOEXEC and ALLOW_SEALING */ + fd = mfd_assert_new("kern_memfd_noexec", + mfd_def_size, + MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_NOEXEC_SEAL); + mfd_assert_mode(fd, 0666); + mfd_assert_has_seals(fd, F_SEAL_EXEC); + mfd_fail_chmod(fd, 0777); + close(fd); + + /* Create with NOEXEC but without ALLOW_SEALING */ + fd = mfd_assert_new("kern_memfd_noexec", + mfd_def_size, + MFD_CLOEXEC | MFD_NOEXEC_SEAL); + mfd_assert_mode(fd, 0666); + mfd_assert_has_seals(fd, F_SEAL_EXEC); + mfd_fail_chmod(fd, 0777); + close(fd); +} + +static void test_sysctl_child(void) +{ + int fd, pid, ret; + + printf("%s sysctl 0\n", memfd_str); + sysctl_assert_write("0"); + fd = mfd_assert_new("kern_memfd_sysctl_0", + mfd_def_size, + MFD_CLOEXEC | MFD_ALLOW_SEALING); + + mfd_assert_mode(fd, 0777); + mfd_assert_has_seals(fd, 0); + mfd_assert_chmod(fd, 0644); + close(fd); + + printf("%s sysctl 1\n", memfd_str); + sysctl_assert_write("1"); + fd = mfd_assert_new("kern_memfd_sysctl_1", + mfd_def_size, + MFD_CLOEXEC | MFD_ALLOW_SEALING); + + mfd_assert_mode(fd, 0666); + mfd_assert_has_seals(fd, F_SEAL_EXEC); + mfd_fail_chmod(fd, 0777); + close(fd); + + printf("%s sysctl 2\n", memfd_str); + sysctl_assert_write("2"); + mfd_fail_new("kern_memfd_sysctl_2", + MFD_CLOEXEC | MFD_ALLOW_SEALING); +} + +static int newpid_thread_fn(void *arg) +{ + test_sysctl_child(); + return 0; +} + +static pid_t spawn_newpid_thread(unsigned int flags) +{ + uint8_t *stack; + pid_t pid; + + stack = malloc(STACK_SIZE); + if (!stack) { + printf("malloc(STACK_SIZE) failed: %m\n"); + abort(); + } + + pid = clone(newpid_thread_fn, + stack + STACK_SIZE, + SIGCHLD | flags, + NULL); + if (pid < 0) { + printf("clone() failed: %m\n"); + abort(); + } + + return pid; +} + +static void join_newpid_thread(pid_t pid) +{ + waitpid(pid, NULL, 0); +} + +/* + * Test sysctl + * A very basic sealing test to see whether setting/retrieving seals works. + */ +static void test_sysctl(void) +{ + int pid = spawn_newpid_thread(CLONE_NEWPID); + + join_newpid_thread(pid); +} + /* * Test sharing via dup() * Test that seals are shared between dupped FDs and they're all equal. @@ -1179,13 +1326,15 @@ int main(int argc, char **argv)
test_create(); test_basic(); + test_exec_seal(); + test_exec_no_seal(); + test_noexec_seal();
test_seal_write(); test_seal_future_write(); test_seal_shrink(); test_seal_grow(); test_seal_resize(); - test_seal_exec();
test_share_dup("SHARE-DUP", ""); test_share_mmap("SHARE-MMAP", ""); @@ -1201,6 +1350,8 @@ int main(int argc, char **argv) test_share_fork("SHARE-FORK", SHARED_FT_STR); join_idle_thread(pid);
+ test_sysctl(); + printf("memfd: DONE\n");
return 0;
From: Jeff Xu jeffxu@chromium.org
The new security_memfd_create allows lsm to check flags of memfd_create.
The security by default system (such as chromeos) can use this to implement system wide lsm to allow only non-executable memfd being created.
Signed-off-by: Jeff Xu jeffxu@chromium.org --- include/linux/lsm_hook_defs.h | 1 + include/linux/lsm_hooks.h | 4 ++++ include/linux/security.h | 6 ++++++ mm/memfd.c | 5 +++++ 4 files changed, 16 insertions(+)
diff --git a/include/linux/lsm_hook_defs.h b/include/linux/lsm_hook_defs.h index ec119da1d89b..fd40840927c8 100644 --- a/include/linux/lsm_hook_defs.h +++ b/include/linux/lsm_hook_defs.h @@ -164,6 +164,7 @@ LSM_HOOK(int, 0, file_alloc_security, struct file *file) LSM_HOOK(void, LSM_RET_VOID, file_free_security, struct file *file) LSM_HOOK(int, 0, file_ioctl, struct file *file, unsigned int cmd, unsigned long arg) +LSM_HOOK(int, 0, memfd_create, char *name, unsigned int flags) LSM_HOOK(int, 0, mmap_addr, unsigned long addr) LSM_HOOK(int, 0, mmap_file, struct file *file, unsigned long reqprot, unsigned long prot, unsigned long flags) diff --git a/include/linux/lsm_hooks.h b/include/linux/lsm_hooks.h index 4ec80b96c22e..5a18a6552278 100644 --- a/include/linux/lsm_hooks.h +++ b/include/linux/lsm_hooks.h @@ -543,6 +543,10 @@ * simple integer value. When @arg represents a user space pointer, it * should never be used by the security module. * Return 0 if permission is granted. + * @memfd_create: + * @name is the name of memfd file. + * @flags is the flags used in memfd_create. + * Return 0 if permission is granted. * @mmap_addr : * Check permissions for a mmap operation at @addr. * @addr contains virtual address that will be used for the operation. diff --git a/include/linux/security.h b/include/linux/security.h index ca1b7109c0db..5b87a780822a 100644 --- a/include/linux/security.h +++ b/include/linux/security.h @@ -384,6 +384,7 @@ int security_file_permission(struct file *file, int mask); int security_file_alloc(struct file *file); void security_file_free(struct file *file); int security_file_ioctl(struct file *file, unsigned int cmd, unsigned long arg); +int security_memfd_create(char *name, unsigned int flags); int security_mmap_file(struct file *file, unsigned long prot, unsigned long flags); int security_mmap_addr(unsigned long addr); @@ -963,6 +964,11 @@ static inline int security_file_ioctl(struct file *file, unsigned int cmd, return 0; }
+static inline int security_memfd_create(char *name, unsigned int flags) +{ + return 0; +} + static inline int security_mmap_file(struct file *file, unsigned long prot, unsigned long flags) { diff --git a/mm/memfd.c b/mm/memfd.c index 69e897dea6d5..96dcfbfed09e 100644 --- a/mm/memfd.c +++ b/mm/memfd.c @@ -346,6 +346,11 @@ SYSCALL_DEFINE2(memfd_create, goto err_name; }
+ /* security hook for memfd_create */ + error = security_memfd_create(name, flags); + if (error) + return error; + if (flags & MFD_HUGETLB) { file = hugetlb_file_setup(name, 0, VM_NORESERVE, HUGETLB_ANONHUGE_INODE,
On Fri, Dec 02, 2022 at 01:34:03AM +0000, jeffxu@chromium.org wrote:
From: Jeff Xu jeffxu@chromium.org
The new security_memfd_create allows lsm to check flags of memfd_create.
The security by default system (such as chromeos) can use this to implement system wide lsm to allow only non-executable memfd being created.
Signed-off-by: Jeff Xu jeffxu@chromium.org
include/linux/lsm_hook_defs.h | 1 + include/linux/lsm_hooks.h | 4 ++++ include/linux/security.h | 6 ++++++ mm/memfd.c | 5 +++++ 4 files changed, 16 insertions(+)
diff --git a/include/linux/lsm_hook_defs.h b/include/linux/lsm_hook_defs.h index ec119da1d89b..fd40840927c8 100644 --- a/include/linux/lsm_hook_defs.h +++ b/include/linux/lsm_hook_defs.h @@ -164,6 +164,7 @@ LSM_HOOK(int, 0, file_alloc_security, struct file *file) LSM_HOOK(void, LSM_RET_VOID, file_free_security, struct file *file) LSM_HOOK(int, 0, file_ioctl, struct file *file, unsigned int cmd, unsigned long arg) +LSM_HOOK(int, 0, memfd_create, char *name, unsigned int flags) LSM_HOOK(int, 0, mmap_addr, unsigned long addr) LSM_HOOK(int, 0, mmap_file, struct file *file, unsigned long reqprot, unsigned long prot, unsigned long flags) diff --git a/include/linux/lsm_hooks.h b/include/linux/lsm_hooks.h index 4ec80b96c22e..5a18a6552278 100644 --- a/include/linux/lsm_hooks.h +++ b/include/linux/lsm_hooks.h @@ -543,6 +543,10 @@
- simple integer value. When @arg represents a user space pointer, it
- should never be used by the security module.
- Return 0 if permission is granted.
- @memfd_create:
- @name is the name of memfd file.
- @flags is the flags used in memfd_create.
- Return 0 if permission is granted.
- @mmap_addr :
- Check permissions for a mmap operation at @addr.
- @addr contains virtual address that will be used for the operation.
diff --git a/include/linux/security.h b/include/linux/security.h index ca1b7109c0db..5b87a780822a 100644 --- a/include/linux/security.h +++ b/include/linux/security.h @@ -384,6 +384,7 @@ int security_file_permission(struct file *file, int mask); int security_file_alloc(struct file *file); void security_file_free(struct file *file); int security_file_ioctl(struct file *file, unsigned int cmd, unsigned long arg); +int security_memfd_create(char *name, unsigned int flags); int security_mmap_file(struct file *file, unsigned long prot, unsigned long flags); int security_mmap_addr(unsigned long addr); @@ -963,6 +964,11 @@ static inline int security_file_ioctl(struct file *file, unsigned int cmd, return 0; } +static inline int security_memfd_create(char *name, unsigned int flags) +{
- return 0;
+}
I think this is missing the security/security.c changes for the non-inline version?
-Kees
static inline int security_mmap_file(struct file *file, unsigned long prot, unsigned long flags) { diff --git a/mm/memfd.c b/mm/memfd.c index 69e897dea6d5..96dcfbfed09e 100644 --- a/mm/memfd.c +++ b/mm/memfd.c @@ -346,6 +346,11 @@ SYSCALL_DEFINE2(memfd_create, goto err_name; }
- /* security hook for memfd_create */
- error = security_memfd_create(name, flags);
- if (error)
return error;
- if (flags & MFD_HUGETLB) { file = hugetlb_file_setup(name, 0, VM_NORESERVE, HUGETLB_ANONHUGE_INODE,
-- 2.39.0.rc0.267.gcb52ba06e7-goog
On Fri, Dec 2, 2022 at 2:58 PM Kees Cook keescook@chromium.org wrote:
On Fri, Dec 02, 2022 at 01:34:03AM +0000, jeffxu@chromium.org wrote:
From: Jeff Xu jeffxu@chromium.org
The new security_memfd_create allows lsm to check flags of memfd_create.
The security by default system (such as chromeos) can use this to implement system wide lsm to allow only non-executable memfd being created.
Signed-off-by: Jeff Xu jeffxu@chromium.org
include/linux/lsm_hook_defs.h | 1 + include/linux/lsm_hooks.h | 4 ++++ include/linux/security.h | 6 ++++++ mm/memfd.c | 5 +++++ 4 files changed, 16 insertions(+)
diff --git a/include/linux/lsm_hook_defs.h b/include/linux/lsm_hook_defs.h index ec119da1d89b..fd40840927c8 100644 --- a/include/linux/lsm_hook_defs.h +++ b/include/linux/lsm_hook_defs.h @@ -164,6 +164,7 @@ LSM_HOOK(int, 0, file_alloc_security, struct file *file) LSM_HOOK(void, LSM_RET_VOID, file_free_security, struct file *file) LSM_HOOK(int, 0, file_ioctl, struct file *file, unsigned int cmd, unsigned long arg) +LSM_HOOK(int, 0, memfd_create, char *name, unsigned int flags) LSM_HOOK(int, 0, mmap_addr, unsigned long addr) LSM_HOOK(int, 0, mmap_file, struct file *file, unsigned long reqprot, unsigned long prot, unsigned long flags) diff --git a/include/linux/lsm_hooks.h b/include/linux/lsm_hooks.h index 4ec80b96c22e..5a18a6552278 100644 --- a/include/linux/lsm_hooks.h +++ b/include/linux/lsm_hooks.h @@ -543,6 +543,10 @@
- simple integer value. When @arg represents a user space pointer, it
- should never be used by the security module.
- Return 0 if permission is granted.
- @memfd_create:
- @name is the name of memfd file.
- @flags is the flags used in memfd_create.
- Return 0 if permission is granted.
- @mmap_addr :
- Check permissions for a mmap operation at @addr.
- @addr contains virtual address that will be used for the operation.
diff --git a/include/linux/security.h b/include/linux/security.h index ca1b7109c0db..5b87a780822a 100644 --- a/include/linux/security.h +++ b/include/linux/security.h @@ -384,6 +384,7 @@ int security_file_permission(struct file *file, int mask); int security_file_alloc(struct file *file); void security_file_free(struct file *file); int security_file_ioctl(struct file *file, unsigned int cmd, unsigned long arg); +int security_memfd_create(char *name, unsigned int flags); int security_mmap_file(struct file *file, unsigned long prot, unsigned long flags); int security_mmap_addr(unsigned long addr); @@ -963,6 +964,11 @@ static inline int security_file_ioctl(struct file *file, unsigned int cmd, return 0; }
+static inline int security_memfd_create(char *name, unsigned int flags) +{
return 0;
+}
I think this is missing the security/security.c changes for the non-inline version?
Yes. I will add that in V4.
-Kees
static inline int security_mmap_file(struct file *file, unsigned long prot, unsigned long flags) { diff --git a/mm/memfd.c b/mm/memfd.c index 69e897dea6d5..96dcfbfed09e 100644 --- a/mm/memfd.c +++ b/mm/memfd.c @@ -346,6 +346,11 @@ SYSCALL_DEFINE2(memfd_create, goto err_name; }
/* security hook for memfd_create */
error = security_memfd_create(name, flags);
if (error)
return error;
if (flags & MFD_HUGETLB) { file = hugetlb_file_setup(name, 0, VM_NORESERVE, HUGETLB_ANONHUGE_INODE,
-- 2.39.0.rc0.267.gcb52ba06e7-goog
-- Kees Cook
From: Jeff Xu jeffxu@chromium.org
When apply F_SEAL_EXEC to an executable memfd, add write seals also to prevent modification of memfd.
Signed-off-by: Jeff Xu jeffxu@chromium.org --- mm/memfd.c | 3 +++ tools/testing/selftests/memfd/memfd_test.c | 25 ++++++++++++++++++++++ 2 files changed, 28 insertions(+)
diff --git a/mm/memfd.c b/mm/memfd.c index 96dcfbfed09e..3a04c0698957 100644 --- a/mm/memfd.c +++ b/mm/memfd.c @@ -222,6 +222,9 @@ static int memfd_add_seals(struct file *file, unsigned int seals) } }
+ if (seals & F_SEAL_EXEC && inode->i_mode & 0111) + seals |= F_ALL_SEALS; + *file_seals |= seals; error = 0;
diff --git a/tools/testing/selftests/memfd/memfd_test.c b/tools/testing/selftests/memfd/memfd_test.c index 775c9e6c061e..0731e5b3cdce 100644 --- a/tools/testing/selftests/memfd/memfd_test.c +++ b/tools/testing/selftests/memfd/memfd_test.c @@ -32,6 +32,13 @@ #define F_SEAL_EXEC 0x0020 #endif
+#define F_ALL_SEALS (F_SEAL_SEAL | \ + F_SEAL_SHRINK | \ + F_SEAL_GROW | \ + F_SEAL_WRITE | \ + F_SEAL_FUTURE_WRITE | \ + F_SEAL_EXEC) + #ifndef MAX_PATH #define MAX_PATH 256 #endif @@ -1006,6 +1013,7 @@ static void test_exec_seal(void)
printf("%s SEAL-EXEC\n", memfd_str);
+ printf("%s Apply SEAL_EXEC\n", memfd_str); fd = mfd_assert_new("kern_memfd_seal_exec", mfd_def_size, MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_EXEC); @@ -1024,7 +1032,24 @@ static void test_exec_seal(void) mfd_fail_chmod(fd, 0700); mfd_fail_chmod(fd, 0100); mfd_assert_chmod(fd, 0666); + mfd_assert_write(fd); + close(fd); + + printf("%s Apply ALL_SEALS\n", memfd_str); + fd = mfd_assert_new("kern_memfd_seal_exec", + mfd_def_size, + MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_EXEC);
+ mfd_assert_mode(fd, 0777); + mfd_assert_chmod(fd, 0700); + + mfd_assert_has_seals(fd, 0); + mfd_assert_add_seals(fd, F_SEAL_EXEC); + mfd_assert_has_seals(fd, F_ALL_SEALS); + + mfd_fail_chmod(fd, 0711); + mfd_fail_chmod(fd, 0600); + mfd_fail_write(fd); close(fd); }
On Thu, Dec 1, 2022 at 5:36 PM jeffxu@chromium.org wrote:
From: Jeff Xu jeffxu@chromium.org
When apply F_SEAL_EXEC to an executable memfd, add write seals also to prevent modification of memfd.
Signed-off-by: Jeff Xu jeffxu@chromium.org
mm/memfd.c | 3 +++ tools/testing/selftests/memfd/memfd_test.c | 25 ++++++++++++++++++++++ 2 files changed, 28 insertions(+)
diff --git a/mm/memfd.c b/mm/memfd.c index 96dcfbfed09e..3a04c0698957 100644 --- a/mm/memfd.c +++ b/mm/memfd.c @@ -222,6 +222,9 @@ static int memfd_add_seals(struct file *file, unsigned int seals) } }
if (seals & F_SEAL_EXEC && inode->i_mode & 0111)
seals |= F_ALL_SEALS;
*file_seals |= seals; error = 0;
Hi Jeff,
(Following up on some discussion on the original review, sorry for any duplicate comments.)
Making F_SEAL_EXEC imply all seals (including F_SEAL_SEAL) seems a bit confusing. This at least needs documentation to make it clear.
Rather than silently adding other seals, perhaps we could return an error if the caller requests F_SEAL_EXEC but not the write seals, so the other seals would have to be explicitly listed in the application code. This would have the same net effect without making the F_SEAL_EXEC operation too magical.
Additionally, if the goal is to enforce W^X, I don't think this completely closes the gap. There will always be a period where it is both writable and executable with this API:
1. memfd_create(MFD_EXEC). Can't use MFD_NOEXEC since that would seal chmod(+x), so the memfd is W + X here. 2. write() code to the memfd. 3. fcntl(F_ADD_SEALS, F_SEAL_EXEC) to convert the memfd to !W + X.
I think one of the attack vectors involved the attacker waiting for another process to create a memfd, pausing/delaying the victim process, overwriting the memfd with their own code, and calling exec() on it, which is still possible in the window between steps 1 and 3 with this design.
Thanks, -- Daniel
Hi Daniel
Thanks for your review.
On Fri, Dec 2, 2022 at 3:24 PM Daniel Verkamp dverkamp@chromium.org wrote:
On Thu, Dec 1, 2022 at 5:36 PM jeffxu@chromium.org wrote:
From: Jeff Xu jeffxu@chromium.org
When apply F_SEAL_EXEC to an executable memfd, add write seals also to prevent modification of memfd.
Signed-off-by: Jeff Xu jeffxu@chromium.org
mm/memfd.c | 3 +++ tools/testing/selftests/memfd/memfd_test.c | 25 ++++++++++++++++++++++ 2 files changed, 28 insertions(+)
diff --git a/mm/memfd.c b/mm/memfd.c index 96dcfbfed09e..3a04c0698957 100644 --- a/mm/memfd.c +++ b/mm/memfd.c @@ -222,6 +222,9 @@ static int memfd_add_seals(struct file *file, unsigned int seals) } }
if (seals & F_SEAL_EXEC && inode->i_mode & 0111)
seals |= F_ALL_SEALS;
*file_seals |= seals; error = 0;
Hi Jeff,
(Following up on some discussion on the original review, sorry for any duplicate comments.)
Making F_SEAL_EXEC imply all seals (including F_SEAL_SEAL) seems a bit confusing. This at least needs documentation to make it clear.
Rather than silently adding other seals, perhaps we could return an error if the caller requests F_SEAL_EXEC but not the write seals, so the other seals would have to be explicitly listed in the application code. This would have the same net effect without making the F_SEAL_EXEC operation too magical.
If we take error out approach, application need to add F_SEAL_SHRINK|F_SEAL_GROW|F_SEAL_WRITE|F_SEAL_FUTURE_WRITE when F_SEAL_EXEC is used. Personally I think it is a bit long. From an API point of view, we can think of this as sealing the whole executable instead of just "X" bit.
If there is a new type of write SEAL in future, all applications need to be updated, that is much harder, and updating the kernel is easier.
Maybe I should remove F_SEAL_SEAL, so this code is still correct if a new type of "Non-Write" seal is added in future.
Additionally, if the goal is to enforce W^X, I don't think this completely closes the gap. There will always be a period where it is both writable and executable with this API:
- memfd_create(MFD_EXEC). Can't use MFD_NOEXEC since that would seal
chmod(+x), so the memfd is W + X here. 2. write() code to the memfd. 3. fcntl(F_ADD_SEALS, F_SEAL_EXEC) to convert the memfd to !W + X.
I think one of the attack vectors involved the attacker waiting for another process to create a memfd, pausing/delaying the victim process, overwriting the memfd with their own code, and calling exec() on it, which is still possible in the window between steps 1 and 3 with this design.
There are also step 4. 4. call exec on the memfd, In confused deputy attack, attacker wants to inject content into memfd before step 4, because step 4 is by a privilege process, attackers can gain root escalation this way.
Ideally step 2 rewrites the whole memfd, (injecting content between 1 and 2 won't work), and step 3 is the next line after 2, making the process to stop exactly between 2 and 3 is not easy.
So enforcing W^X can reduce the attack surface. It also defines the most secure way for dev, or else, dev might: - forget to apply the W seal. - choose to apply X and W seal in multiple calls, thus adding a gap.
Thanks, -- Daniel
Thanks Jeff
On Fri, Dec 02, 2022 at 01:33:58AM +0000, jeffxu@chromium.org wrote:
From: Jeff Xu jeffxu@chromium.org
Hi! Thanks for this update! For future versions, please also use "-n" with "git format-patch" so the patches are numbered, otherwise it's more difficult to figure out what order they should be applied in, etc.
linux-kselftest-mirror@lists.linaro.org