From: Jeff Xu jeffxu@chromium.org
Hi,
This v2 series MFD_NOEXEC, this series includes: 1> address comments in V1 2> add sysctl (vm.mfd_noexec) to change the default file permissions of memfd_create to be non-executable.
Below are cover-level for v1:
The default file permissions on a memfd include execute bits, which means that such a memfd can be filled with a executable and passed to the exec() family of functions. This is undesirable on systems where all code is verified and all filesystems are intended to be mounted noexec, since an attacker may be able to use a memfd to load unverified code and execute it.
Additionally, execution via memfd is a common way to avoid scrutiny for malicious code, since it allows execution of a program without a file ever appearing on disk. This attack vector is not totally mitigated with this new flag, since the default memfd file permissions must remain executable to avoid breaking existing legitimate uses, but it should be possible to use other security mechanisms to prevent memfd_create calls without MFD_NOEXEC on systems where it is known that executable memfds are not necessary.
This patch series adds a new MFD_NOEXEC flag for memfd_create(), which allows creation of non-executable memfds, and as part of the implementation of this new flag, it also adds a new F_SEAL_EXEC seal, which will prevent modification of any of the execute bits of a sealed memfd.
I am not sure if this is the best way to implement the desired behavior (for example, the F_SEAL_EXEC seal is really more of an implementation detail and feels a bit clunky to expose), so suggestions are welcome for alternate approaches.
v1: https://lwn.net/Articles/890096/
Daniel Verkamp (4): mm/memfd: add F_SEAL_EXEC mm/memfd: add MFD_NOEXEC flag to memfd_create selftests/memfd: add tests for F_SEAL_EXEC selftests/memfd: add tests for MFD_NOEXEC
Jeff Xu (1): sysctl: add support for mfd_noexec
include/linux/mm.h | 4 + include/uapi/linux/fcntl.h | 1 + include/uapi/linux/memfd.h | 1 + kernel/sysctl.c | 9 ++ mm/memfd.c | 39 ++++- mm/shmem.c | 6 + tools/testing/selftests/memfd/memfd_test.c | 163 ++++++++++++++++++++- 7 files changed, 221 insertions(+), 2 deletions(-)
base-commit: 9e2f40233670c70c25e0681cb66d50d1e2742829
From: Daniel Verkamp dverkamp@chromium.org
The new F_SEAL_EXEC flag will prevent modification of the exec bits: written as traditional octal mask, 0111, or as named flags, S_IXUSR | S_IXGRP | S_IXOTH. Any chmod(2) or similar call that attempts to modify any of these bits after the seal is applied will fail with errno EPERM.
This will preserve the execute bits as they are at the time of sealing, so the memfd will become either permanently executable or permanently un-executable.
Co-developed-by: Jeff Xu jeffxu@chromium.org Signed-off-by: Jeff Xu jeffxu@chromium.org Signed-off-by: Daniel Verkamp dverkamp@chromium.org --- include/uapi/linux/fcntl.h | 1 + mm/memfd.c | 2 ++ mm/shmem.c | 6 ++++++ 3 files changed, 9 insertions(+)
diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h index 2f86b2ad6d7e..a472ba69596c 100644 --- a/include/uapi/linux/fcntl.h +++ b/include/uapi/linux/fcntl.h @@ -43,6 +43,7 @@ #define F_SEAL_GROW 0x0004 /* prevent file from growing */ #define F_SEAL_WRITE 0x0008 /* prevent writes */ #define F_SEAL_FUTURE_WRITE 0x0010 /* prevent future writes while mapped */ +#define F_SEAL_EXEC 0x0020 /* prevent chmod modifying exec bits */ /* (1U << 31) is reserved for signed error codes */
/* diff --git a/mm/memfd.c b/mm/memfd.c index 08f5f8304746..4ebeab94aa74 100644 --- a/mm/memfd.c +++ b/mm/memfd.c @@ -147,6 +147,7 @@ static unsigned int *memfd_file_seals_ptr(struct file *file) }
#define F_ALL_SEALS (F_SEAL_SEAL | \ + F_SEAL_EXEC | \ F_SEAL_SHRINK | \ F_SEAL_GROW | \ F_SEAL_WRITE | \ @@ -175,6 +176,7 @@ static int memfd_add_seals(struct file *file, unsigned int seals) * SEAL_SHRINK: Prevent the file from shrinking * SEAL_GROW: Prevent the file from growing * SEAL_WRITE: Prevent write access to the file + * SEAL_EXEC: Prevent modification of the exec bits in the file mode * * As we don't require any trust relationship between two parties, we * must prevent seals from being removed. Therefore, sealing a file diff --git a/mm/shmem.c b/mm/shmem.c index e5e43b990fdc..bb530f562bdd 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -1082,6 +1082,12 @@ static int shmem_setattr(struct user_namespace *mnt_userns, if (error) return error;
+ if ((info->seals & F_SEAL_EXEC) && (attr->ia_valid & ATTR_MODE)) { + if ((inode->i_mode ^ attr->ia_mode) & 0111) { + return -EPERM; + } + } + if (S_ISREG(inode->i_mode) && (attr->ia_valid & ATTR_SIZE)) { loff_t oldsize = inode->i_size; loff_t newsize = attr->ia_size;
From: Daniel Verkamp dverkamp@chromium.org
The new MFD_NOEXEC flag allows the creation of a permanently non-executable memfd. This is accomplished by creating it with a different set of file mode bits (0666) than the default (0777) and applying the F_SEAL_EXEC seal at creation time, so there is no window between memfd creation and seal application.
Unfortunately, the default for memfd must remain executable, since changing this would be an API break, and some programs depend on being able to exec code from a memfd directly. However, this new flag will allow programs to create non-executable memfds, and a distribution may choose to enforce use of this flag in memfd_create calls via other security mechanisms.
Co-developed-by: Jeff Xu jeffxu@chromium.org Signed-off-by: Jeff Xu jeffxu@chromium.org Signed-off-by: Daniel Verkamp dverkamp@chromium.org --- include/uapi/linux/memfd.h | 1 + mm/memfd.c | 10 +++++++++- 2 files changed, 10 insertions(+), 1 deletion(-)
diff --git a/include/uapi/linux/memfd.h b/include/uapi/linux/memfd.h index 7a8a26751c23..140e125c9f65 100644 --- a/include/uapi/linux/memfd.h +++ b/include/uapi/linux/memfd.h @@ -8,6 +8,7 @@ #define MFD_CLOEXEC 0x0001U #define MFD_ALLOW_SEALING 0x0002U #define MFD_HUGETLB 0x0004U +#define MFD_NOEXEC 0x0008U
/* * Huge page size encoding when MFD_HUGETLB is specified, and a huge page diff --git a/mm/memfd.c b/mm/memfd.c index 4ebeab94aa74..b841514eb0fd 100644 --- a/mm/memfd.c +++ b/mm/memfd.c @@ -263,7 +263,7 @@ long memfd_fcntl(struct file *file, unsigned int cmd, unsigned long arg) #define MFD_NAME_PREFIX_LEN (sizeof(MFD_NAME_PREFIX) - 1) #define MFD_NAME_MAX_LEN (NAME_MAX - MFD_NAME_PREFIX_LEN)
-#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB) +#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB | MFD_NOEXEC)
SYSCALL_DEFINE2(memfd_create, const char __user *, uname, @@ -333,6 +333,14 @@ SYSCALL_DEFINE2(memfd_create, *file_seals &= ~F_SEAL_SEAL; }
+ if (flags & MFD_NOEXEC) { + struct inode *inode = file_inode(file); + + inode->i_mode &= ~0111; + file_seals = memfd_file_seals_ptr(file); + *file_seals |= F_SEAL_EXEC; + } + fd_install(fd, file); kfree(name); return fd;
From: Daniel Verkamp dverkamp@chromium.org
Basic tests to ensure that user/group/other execute bits cannot be changed after applying F_SEAL_EXEC to a memfd.
Co-developed-by: Jeff Xu jeffxu@chromium.org Signed-off-by: Jeff Xu jeffxu@chromium.org Signed-off-by: Daniel Verkamp dverkamp@chromium.org --- tools/testing/selftests/memfd/memfd_test.c | 129 ++++++++++++++++++++- 1 file changed, 128 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/memfd/memfd_test.c b/tools/testing/selftests/memfd/memfd_test.c index 94df2692e6e4..1d7e7b36bbdd 100644 --- a/tools/testing/selftests/memfd/memfd_test.c +++ b/tools/testing/selftests/memfd/memfd_test.c @@ -28,12 +28,44 @@ #define MFD_DEF_SIZE 8192 #define STACK_SIZE 65536
+#ifndef F_SEAL_EXEC +#define F_SEAL_EXEC 0x0020 +#endif + +#ifndef MAX_PATH +#define MAX_PATH 256 +#endif + /* * Default is not to test hugetlbfs */ static size_t mfd_def_size = MFD_DEF_SIZE; static const char *memfd_str = MEMFD_STR;
+static ssize_t fd2name(int fd, char *buf, size_t bufsize) +{ + char buf1[MAX_PATH]; + int size; + ssize_t nbytes; + + size = snprintf(buf1, MAX_PATH, "/proc/self/fd/%d", fd); + if (size < 0) { + printf("snprintf(%d) failed on %m\n", fd); + abort(); + } + + /* + * reserver one byte for string termination. + */ + nbytes = readlink(buf1, buf, bufsize-1); + if (nbytes == -1) { + printf("readlink(%s) failed %m\n", buf1); + abort(); + } + buf[nbytes] = '\0'; + return nbytes; +} + static int mfd_assert_new(const char *name, loff_t sz, unsigned int flags) { int r, fd; @@ -98,11 +130,14 @@ static unsigned int mfd_assert_get_seals(int fd)
static void mfd_assert_has_seals(int fd, unsigned int seals) { + char buf[MAX_PATH]; + int nbytes; unsigned int s; + fd2name(fd, buf, MAX_PATH);
s = mfd_assert_get_seals(fd); if (s != seals) { - printf("%u != %u = GET_SEALS(%d)\n", seals, s, fd); + printf("%u != %u = GET_SEALS(%s)\n", seals, s, buf); abort(); } } @@ -594,6 +629,64 @@ static void mfd_fail_grow_write(int fd) } }
+static void mfd_assert_mode(int fd, int mode) +{ + struct stat st; + char buf[MAX_PATH]; + int nbytes; + + fd2name(fd, buf, MAX_PATH); + + if (fstat(fd, &st) < 0) { + printf("fstat(%s) failed: %m\n", buf); + abort(); + } + + if ((st.st_mode & 07777) != mode) { + printf("fstat(%s) wrong file mode 0%04o, but expected 0%04o\n", + buf, (int)st.st_mode & 07777, mode); + abort(); + } +} + +static void mfd_assert_chmod(int fd, int mode) +{ + char buf[MAX_PATH]; + int nbytes; + + fd2name(fd, buf, MAX_PATH); + + if (fchmod(fd, mode) < 0) { + printf("fchmod(%s, 0%04o) failed: %m\n", buf, mode); + abort(); + } + + mfd_assert_mode(fd, mode); +} + +static void mfd_fail_chmod(int fd, int mode) +{ + struct stat st; + char buf[MAX_PATH]; + int nbytes; + + fd2name(fd, buf, MAX_PATH); + + if (fstat(fd, &st) < 0) { + printf("fstat(%s) failed: %m\n", buf); + abort(); + } + + if (fchmod(fd, mode) == 0) { + printf("fchmod(%s, 0%04o) didn't fail as expected\n", + buf, mode); + abort(); + } + + /* verify that file mode bits did not change */ + mfd_assert_mode(fd, st.st_mode & 07777); +} + static int idle_thread_fn(void *arg) { sigset_t set; @@ -880,6 +973,39 @@ static void test_seal_resize(void) close(fd); }
+/* + * Test SEAL_EXEC + * Test that chmod() cannot change x bits after sealing + */ +static void test_seal_exec(void) +{ + int fd; + + printf("%s SEAL-EXEC\n", memfd_str); + + fd = mfd_assert_new("kern_memfd_seal_exec", + mfd_def_size, + MFD_CLOEXEC | MFD_ALLOW_SEALING); + + mfd_assert_mode(fd, 0777); + + mfd_assert_chmod(fd, 0644); + + mfd_assert_has_seals(fd, 0); + mfd_assert_add_seals(fd, F_SEAL_EXEC); + mfd_assert_has_seals(fd, F_SEAL_EXEC); + + mfd_assert_chmod(fd, 0600); + mfd_fail_chmod(fd, 0777); + mfd_fail_chmod(fd, 0670); + mfd_fail_chmod(fd, 0605); + mfd_fail_chmod(fd, 0700); + mfd_fail_chmod(fd, 0100); + mfd_assert_chmod(fd, 0666); + + close(fd); +} + /* * Test sharing via dup() * Test that seals are shared between dupped FDs and they're all equal. @@ -1059,6 +1185,7 @@ int main(int argc, char **argv) test_seal_shrink(); test_seal_grow(); test_seal_resize(); + test_seal_exec();
test_share_dup("SHARE-DUP", ""); test_share_mmap("SHARE-MMAP", "");
From: Daniel Verkamp dverkamp@chromium.org
Tests that ensure MFD_NOEXEC memfds have the appropriate mode bits and cannot be chmod-ed into being executable.
Co-developed-by: Jeff Xu jeffxu@chromium.org Signed-off-by: Jeff Xu jeffxu@chromium.org Signed-off-by: Daniel Verkamp dverkamp@chromium.org --- tools/testing/selftests/memfd/memfd_test.c | 34 ++++++++++++++++++++++ 1 file changed, 34 insertions(+)
diff --git a/tools/testing/selftests/memfd/memfd_test.c b/tools/testing/selftests/memfd/memfd_test.c index 1d7e7b36bbdd..4906f778564e 100644 --- a/tools/testing/selftests/memfd/memfd_test.c +++ b/tools/testing/selftests/memfd/memfd_test.c @@ -36,6 +36,10 @@ #define MAX_PATH 256 #endif
+#ifndef MFD_NOEXEC +#define MFD_NOEXEC 0x0008U +#endif + /* * Default is not to test hugetlbfs */ @@ -1006,6 +1010,35 @@ static void test_seal_exec(void) close(fd); }
+/* + * Test memfd_create with MFD_NOEXEC flag + * Test that MFD_NOEXEC applies F_SEAL_EXEC and prevents change of exec bits + */ +static void test_noexec(void) +{ + int fd; + + printf("%s NOEXEC\n", memfd_str); + + /* Create with NOEXEC and ALLOW_SEALING */ + fd = mfd_assert_new("kern_memfd_noexec", + mfd_def_size, + MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_NOEXEC); + mfd_assert_mode(fd, 0666); + mfd_assert_has_seals(fd, F_SEAL_EXEC); + mfd_fail_chmod(fd, 0777); + close(fd); + + /* Create with NOEXEC but without ALLOW_SEALING */ + fd = mfd_assert_new("kern_memfd_noexec", + mfd_def_size, + MFD_CLOEXEC | MFD_NOEXEC); + mfd_assert_mode(fd, 0666); + mfd_assert_has_seals(fd, F_SEAL_EXEC | F_SEAL_SEAL); + mfd_fail_chmod(fd, 0777); + close(fd); +} + /* * Test sharing via dup() * Test that seals are shared between dupped FDs and they're all equal. @@ -1179,6 +1212,7 @@ int main(int argc, char **argv)
test_create(); test_basic(); + test_noexec();
test_seal_write(); test_seal_future_write();
From: Jeff Xu jeffxu@chromium.org
Add vm.mfd_noexec. When the value is 1 (enabled), memfd_create syscall will created non-executable memfd.
The default value is 0 (disabled), admin can change the setting from 0 => 1, however 1 => 0 is not allowed, unless reboot.
Signed-off-by: Jeff Xu jeffxu@chromium.org --- include/linux/mm.h | 4 ++++ kernel/sysctl.c | 9 +++++++++ mm/memfd.c | 27 +++++++++++++++++++++++++++ 3 files changed, 40 insertions(+)
diff --git a/include/linux/mm.h b/include/linux/mm.h index 7898e29bcfb5..1c66cf4aca11 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -203,6 +203,10 @@ extern int sysctl_overcommit_memory; extern int sysctl_overcommit_ratio; extern unsigned long sysctl_overcommit_kbytes;
+extern int sysctl_mfd_noexec_scope; +extern int mfd_noexec_dointvec_minmax(struct ctl_table *table, int write, + void *buffer, size_t *lenp, loff_t *ppos); + int overcommit_ratio_handler(struct ctl_table *, int, void *, size_t *, loff_t *); int overcommit_kbytes_handler(struct ctl_table *, int, void *, size_t *, diff --git a/kernel/sysctl.c b/kernel/sysctl.c index b233714a1c78..54510da007ff 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -2362,6 +2362,15 @@ static struct ctl_table vm_table[] = { .mode = 0644, .proc_handler = mmap_min_addr_handler, }, + { + .procname = "mfd_noexec", + .data = &sysctl_mfd_noexec_scope, + .maxlen = sizeof(sysctl_mfd_noexec_scope), + .mode = 0644, + .proc_handler = mfd_noexec_dointvec_minmax, + .extra1 = SYSCTL_ZERO, + .extra2 = SYSCTL_ONE, + }, #endif #ifdef CONFIG_NUMA { diff --git a/mm/memfd.c b/mm/memfd.c index b841514eb0fd..c6ccb8481ed2 100644 --- a/mm/memfd.c +++ b/mm/memfd.c @@ -20,6 +20,11 @@ #include <linux/memfd.h> #include <uapi/linux/memfd.h>
+#define MFD_NOEXEC_SCOPE_DISABLED 0 +#define MFD_NOEXEC_SCOPE_ENABLED 1 + +int sysctl_mfd_noexec_scope __read_mostly = MFD_NOEXEC_SCOPE_DISABLED; + /* * We need a tag: a new tag would expand every xa_node by 8 bytes, * so reuse a tag which we firmly believe is never set or cleared on tmpfs @@ -275,6 +280,10 @@ SYSCALL_DEFINE2(memfd_create, char *name; long len;
+ if (sysctl_mfd_noexec_scope == MFD_NOEXEC_SCOPE_ENABLED) { + flags |= MFD_NOEXEC; + } + if (!(flags & MFD_HUGETLB)) { if (flags & ~(unsigned int)MFD_ALL_FLAGS) return -EINVAL; @@ -351,3 +360,21 @@ SYSCALL_DEFINE2(memfd_create, kfree(name); return error; } + +#ifdef CONFIG_SYSCTL +int mfd_noexec_dointvec_minmax(struct ctl_table *table, int write, + void *buffer, size_t *lenp, loff_t *ppos) +{ + struct ctl_table table_copy; + + if (write && !capable(CAP_SYS_ADMIN)) + return -EPERM; + + /* Lock the max value if it ever gets set. */ + table_copy = *table; + if (*(int *)table_copy.data == *(int *)table_copy.extra2) + table_copy.extra1 = table_copy.extra2; + + return proc_dointvec_minmax(&table_copy, write, buffer, lenp, ppos); +} +#endif
Hi,
Thank you for the patch! Yet something to improve:
[auto build test ERROR on 9e2f40233670c70c25e0681cb66d50d1e2742829]
url: https://github.com/intel-lab-lkp/linux/commits/jeffxu-google-com/mm-memfd-MF... base: 9e2f40233670c70c25e0681cb66d50d1e2742829 config: csky-allnoconfig (https://download.01.org/0day-ci/archive/20220814/202208140256.CLDuIPhh-lkp@i...) compiler: csky-linux-gcc (GCC) 12.1.0 reproduce (this is a W=1 build): wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # https://github.com/intel-lab-lkp/linux/commit/e71897c778df5381c6d1ca858ae096... git remote add linux-review https://github.com/intel-lab-lkp/linux git fetch --no-tags linux-review jeffxu-google-com/mm-memfd-MFD_NOEXEC-for-memfd_create/20220806-062430 git checkout e71897c778df5381c6d1ca858ae096a557a08a2a # save the config file mkdir build_dir && cp config build_dir/.config COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=csky SHELL=/bin/bash
If you fix the issue, kindly add following tag where applicable Reported-by: kernel test robot lkp@intel.com
All errors (new ones prefixed by >>):
csky-linux-ld: kernel/sysctl.o:(.data+0x398): undefined reference to `sysctl_mfd_noexec_scope' csky-linux-ld: kernel/sysctl.o:(.data+0x3a8): undefined reference to `mfd_noexec_dointvec_minmax'
Hi,
Thank you for the patch! Yet something to improve:
[auto build test ERROR on 9e2f40233670c70c25e0681cb66d50d1e2742829]
url: https://github.com/intel-lab-lkp/linux/commits/jeffxu-google-com/mm-memfd-MF... base: 9e2f40233670c70c25e0681cb66d50d1e2742829 config: openrisc-randconfig-r004-20220805 (https://download.01.org/0day-ci/archive/20220814/202208140205.5zR1h5VB-lkp@i...) compiler: or1k-linux-gcc (GCC) 12.1.0 reproduce (this is a W=1 build): wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # https://github.com/intel-lab-lkp/linux/commit/e71897c778df5381c6d1ca858ae096... git remote add linux-review https://github.com/intel-lab-lkp/linux git fetch --no-tags linux-review jeffxu-google-com/mm-memfd-MFD_NOEXEC-for-memfd_create/20220806-062430 git checkout e71897c778df5381c6d1ca858ae096a557a08a2a # save the config file mkdir build_dir && cp config build_dir/.config COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=openrisc SHELL=/bin/bash
If you fix the issue, kindly add following tag where applicable Reported-by: kernel test robot lkp@intel.com
All errors (new ones prefixed by >>):
or1k-linux-ld: kernel/sysctl.o:(.data+0x440): undefined reference to `sysctl_mfd_noexec_scope' or1k-linux-ld: kernel/sysctl.o:(.data+0x450): undefined reference to `mfd_noexec_dointvec_minmax'
On Fri, Aug 05, 2022 at 10:21:21PM +0000, jeffxu@google.com wrote:
This v2 series MFD_NOEXEC, this series includes: 1> address comments in V1 2> add sysctl (vm.mfd_noexec) to change the default file permissions of memfd_create to be non-executable.
Below are cover-level for v1:
The default file permissions on a memfd include execute bits, which means that such a memfd can be filled with a executable and passed to the exec() family of functions. This is undesirable on systems where all code is verified and all filesystems are intended to be mounted noexec, since an attacker may be able to use a memfd to load unverified code and execute it.
I would absolutely like to see some kind of protection here. However, I'd like a more specific threat model. What are the cases where the X bit has been abused (e.g.[1])? What are the cases where the X bit is needed (e.g.[2])? With those in mind, it should be possible to draw a clear line between the two cases. (e.g. we need to avoid a confused deputy attack where an "unprivileged" user can pass an executable memfd to a "privileged" user. How those privileges are defined may matter a lot based on how memfds are being used. For example, can runc's use of executable memfds be distinguished from an attacker's?)
Additionally, execution via memfd is a common way to avoid scrutiny for malicious code, since it allows execution of a program without a file ever appearing on disk. This attack vector is not totally mitigated with this new flag, since the default memfd file permissions must remain executable to avoid breaking existing legitimate uses, but it should be possible to use other security mechanisms to prevent memfd_create calls without MFD_NOEXEC on systems where it is known that executable memfds are not necessary.
This reminds me of dealing with non-executable stacks. There ended up being three states:
- requested to be executable (PT_GNU_STACK X) - requested to be non-executable (PT_GNU_STACK NX) - undefined (no PT_GNU_STACK)
The first two are clearly defined, but the third needed a lot of special handling. For a "safe by default" world, the third should be "NX", but old stuff depended on it being "X".
Here, we have a bit being present or not, so we only have a binary state. I'd much rather the default be NX (no bit set) instead of making every future (safe) user of memfd have to specify MFD_NOEXEC.
It's also easier on a filtering side to say "disallow memfd_create with MFD_EXEC", but how do we deal with the older software?
If the default perms of memfd_create()'s exec bit is controlled by a sysctl and the sysctl is set to "leave it executable", how does a user create an NX memfd? (i.e. setting MFD_EXEC means "exec" and not setting it means "exec" also.) Are two bits needed? Seems wasteful. MFD_I_KNOW_HOW_TO_SET_EXEC | MFD_EXEC, etc...
For F_SEAL_EXEC, it seems this should imply F_SEAL_WRITE if forced executable to avoid WX mappings (i.e. provide W^X from the start).
-Kees
[1] https://bugs.chromium.org/p/chromium/issues/list?q=type%3Dbug-security%20mem... [2] https://lwn.net/Articles/781013/
Hi Kees
Sorry for the long overdue reply.
Those questions are really helpful to understand the usage of memfd_create, I will try to answer them, please see below inline.
On Mon, Aug 8, 2022 at 10:46 AM Kees Cook keescook@chromium.org wrote:
On Fri, Aug 05, 2022 at 10:21:21PM +0000, jeffxu@google.com wrote:
This v2 series MFD_NOEXEC, this series includes: 1> address comments in V1 2> add sysctl (vm.mfd_noexec) to change the default file permissions of memfd_create to be non-executable.
Below are cover-level for v1:
The default file permissions on a memfd include execute bits, which means that such a memfd can be filled with a executable and passed to the exec() family of functions. This is undesirable on systems where all code is verified and all filesystems are intended to be mounted noexec, since an attacker may be able to use a memfd to load unverified code and execute it.
I would absolutely like to see some kind of protection here. However, I'd like a more specific threat model. What are the cases where the X bit has been abused (e.g.[1])? What are the cases where the X bit is needed (e.g.[2])? With those in mind, it should be possible to draw a clear line between the two cases. (e.g. we need to avoid a confused deputy attack where an "unprivileged" user can pass an executable memfd to a "privileged" user. How those privileges are defined may matter a lot based on how memfds are being used. For example, can runc's use of executable memfds be distinguished from an attacker's?)
runc needs memfd to be executable, so the host with runc need to be able to create both non-executable memfd and executable memfd. memfd_create API itself can't enforce the security of how it is being used.
Additionally, execution via memfd is a common way to avoid scrutiny for malicious code, since it allows execution of a program without a file ever appearing on disk. This attack vector is not totally mitigated with this new flag, since the default memfd file permissions must remain executable to avoid breaking existing legitimate uses, but it should be possible to use other security mechanisms to prevent memfd_create calls without MFD_NOEXEC on systems where it is known that executable memfds are not necessary.
This reminds me of dealing with non-executable stacks. There ended up being three states:
- requested to be executable (PT_GNU_STACK X)
- requested to be non-executable (PT_GNU_STACK NX)
- undefined (no PT_GNU_STACK)
The first two are clearly defined, but the third needed a lot of special handling. For a "safe by default" world, the third should be "NX", but old stuff depended on it being "X".
Here, we have a bit being present or not, so we only have a binary state. I'd much rather the default be NX (no bit set) instead of making every future (safe) user of memfd have to specify MFD_NOEXEC.
It's also easier on a filtering side to say "disallow memfd_create with MFD_EXEC", but how do we deal with the older software?
If the default perms of memfd_create()'s exec bit is controlled by a sysctl and the sysctl is set to "leave it executable", how does a user create an NX memfd? (i.e. setting MFD_EXEC means "exec" and not setting it means "exec" also.) Are two bits needed? Seems wasteful. MFD_I_KNOW_HOW_TO_SET_EXEC | MFD_EXEC, etc...
Great points, with those questions and usages in mind, I m thinking below:
1> memfd_create: Add two flags: #define MFD_EXEC 0x0008 #define MFD_NOEXEC _SEAL 0x0010 This lets application to set executable bit explicitly. (If application set both, it will be rejected)
2> For old application that doesn't set executable bit: Add a pid name-spaced sysctl.kernel.pid_mfd_noexec, with: value = 0: Default_EXEC Honor MFD_EXEC and MFD_NOEXEC_SEAL When none is set, will fall back to original behavior (EXEC) value = 1: Default_NOEXEC_SEAL Honor MFD_EXEC and MFD_NOEXEC_SEAL When none is set, will default to MFD_NOEXEC_SEAL
3> Add a pid name-spaced sysctl kernel.pid_mfd_noexec_enforced: with: value = 0: default, not enforced. value = 1: enforce NOEXEC_SEAL (overwrite everything)
Then we can use and secure memfd at host and container as below: At host level: Case A> In secure by default system where doesn't allow executable memfd: sysctl.kernel.pid_mfd_noexec_enforced = 1 LSM to block creation of executable memfd system wide. This requires a new hook: secure_memfd_create
Case B> In system that need both (runc case), use sysctl kernel.pid_mfd_noexec = 0/1 during converting application to new API. SELINUX or landlock to sandbox the process.(requires work).
At container level: It would be nice for container to control creation of executable memfd too. This is through sysctl kernel.pid_mfd_noexec_enforced This lets runc to create two type of contains: one with ability to create executable memfd, one without.
The sysctl.kernel.pid_mfd_noexec sets the default value, it is helpful during applications are being migrated to set the executable bit. Alternatively, we can have a new syscall: memfd_create2, where it is mandatary to set executable bit (or default to NOEXEC_SEAL), then sysctl.kernel.pid_mfd_noexec is not needed.
For F_SEAL_EXEC, it seems this should imply F_SEAL_WRITE if forced executable to avoid WX mappings (i.e. provide W^X from the start).
Yes. I agree.
Thanks! Best regards, Jeff Xu
-Kees
[1] https://bugs.chromium.org/p/chromium/issues/list?q=type%3Dbug-security%20mem... [2] https://lwn.net/Articles/781013/
-- Kees Cook
On Tue, Nov 01, 2022 at 04:14:39PM -0700, Jeff Xu wrote:
Sorry for the long overdue reply.
No worries! I am a fan of thread necromancy. :)
[...] 1> memfd_create: Add two flags: #define MFD_EXEC 0x0008 #define MFD_NOEXEC_SEAL 0x0010 This lets application to set executable bit explicitly. (If application set both, it will be rejected)
So no MFD_NOEXEC without seal? (I'm fine with that.)
2> For old application that doesn't set executable bit: Add a pid name-spaced sysctl.kernel.pid_mfd_noexec, with:
bikeshed: vm.memfd_noexec (doesn't belong in "kernel", and seems better suited to "vm" than "fs")
value = 0: Default_EXEC Honor MFD_EXEC and MFD_NOEXEC_SEAL When none is set, will fall back to original behavior (EXEC)
Yeah. Rephrasing for myself to understand more clearly:
"memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL act like MFD_EXEC was set."
value = 1: Default_NOEXEC_SEAL Honor MFD_EXEC and MFD_NOEXEC_SEAL When none is set, will default to MFD_NOEXEC_SEAL
"memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL act like MFD_NOEXEC_SEAL was set."
Also, I think there should be a pr_warn_ratelimited() when memfd_create() is used without either bit, so that there is some pressure to please adjust their API calls to explicitly set a bit.
3> Add a pid name-spaced sysctl kernel.pid_mfd_noexec_enforced: with: value = 0: default, not enforced. value = 1: enforce NOEXEC_SEAL (overwrite everything)
How about making this just mode "value 2" for the first sysctl? "memfd_create() without MFD_NOEXEC_SEAL will be rejected."
-Kees
On Tue, Nov 1, 2022 at 7:45 PM Kees Cook keescook@chromium.org wrote:
On Tue, Nov 01, 2022 at 04:14:39PM -0700, Jeff Xu wrote:
Sorry for the long overdue reply.
No worries! I am a fan of thread necromancy. :)
[...] 1> memfd_create: Add two flags: #define MFD_EXEC 0x0008 #define MFD_NOEXEC_SEAL 0x0010 This lets application to set executable bit explicitly. (If application set both, it will be rejected)
So no MFD_NOEXEC without seal? (I'm fine with that.)
no MFD_NOEXEC because memfd can be chmod to add x after creation, it is not secure.
no MFD_EXEC_SEAL because it is better to apply both w and x seal within the same function call, and w seal can't be applied at creation time.
2> For old application that doesn't set executable bit: Add a pid name-spaced sysctl.kernel.pid_mfd_noexec, with:
bikeshed: vm.memfd_noexec (doesn't belong in "kernel", and seems better suited to "vm" than "fs")
SG, will use vm.memfd_noexec
value = 0: Default_EXEC Honor MFD_EXEC and MFD_NOEXEC_SEAL When none is set, will fall back to original behavior (EXEC)
Yeah. Rephrasing for myself to understand more clearly:
"memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL act like MFD_EXEC was set."
value = 1: Default_NOEXEC_SEAL Honor MFD_EXEC and MFD_NOEXEC_SEAL When none is set, will default to MFD_NOEXEC_SEAL
"memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL act like MFD_NOEXEC_SEAL was set."
Copy, this is clearer. Thanks.
Also, I think there should be a pr_warn_ratelimited() when memfd_create() is used without either bit, so that there is some pressure to please adjust their API calls to explicitly set a bit.
Sure
3> Add a pid name-spaced sysctl kernel.pid_mfd_noexec_enforced: with: value = 0: default, not enforced. value = 1: enforce NOEXEC_SEAL (overwrite everything)
How about making this just mode "value 2" for the first sysctl? "memfd_create() without MFD_NOEXEC_SEAL will be rejected."
Good point. Kernel overwriting might not be a good practice. I will add to vm.mfd_noexec. value = 2: "memfd_create() without MFD_NOEXEC_SEAL will be rejected."
Thanks! Jeff
-Kees
-- Kees Cook
linux-kselftest-mirror@lists.linaro.org