From: Jeff Xu jeffxu@chromium.org
When MFD_NOEXEC_SEAL was introduced, there was one big mistake: it didn't have proper documentation. This led to a lot of confusion, especially about whether or not memfd created with the MFD_NOEXEC_SEAL flag is sealable. Before MFD_NOEXEC_SEAL, memfd had to explicitly set MFD_ALLOW_SEALING to be sealable, so it's a fair question.
As one might have noticed, unlike other flags in memfd_create, MFD_NOEXEC_SEAL is actually a combination of multiple flags. The idea is to make it easier to use memfd in the most common way, which is NOEXEC + F_SEAL_EXEC + MFD_ALLOW_SEALING. This works with sysctl vm.noexec to help existing applications move to a more secure way of using memfd.
Proposals have been made to put MFD_NOEXEC_SEAL non-sealable, unless MFD_ALLOW_SEALING is set, to be consistent with other flags [1] [2], Those are based on the viewpoint that each flag is an atomic unit, which is a reasonable assumption. However, MFD_NOEXEC_SEAL was designed with the intent of promoting the most secure method of using memfd, therefore a combination of multiple functionalities into one bit.
Furthermore, the MFD_NOEXEC_SEAL has been added for more than one year, and multiple applications and distributions have backported and utilized it. Altering ABI now presents a degree of risk and may lead to disruption.
MFD_NOEXEC_SEAL is a new flag, and applications must change their code to use it. There is no backward compatibility problem.
When sysctl vm.noexec == 1 or 2, applications that don't set MFD_NOEXEC_SEAL or MFD_EXEC will get MFD_NOEXEC_SEAL memfd. And old-application might break, that is by-design, in such a system vm.noexec = 0 shall be used. Also no backward compatibility problem.
I propose to include this documentation patch to assist in clarifying the semantics of MFD_NOEXEC_SEAL, thereby preventing any potential future confusion.
This patch supersede previous patch which is trying different direction [3], and please remove [2] from mm-unstable branch when applying this patch.
Finally, I would like to express my gratitude to David Rheinsberg and Barnabás Pőcze for initiating the discussion on the topic of sealability.
[1] https://lore.kernel.org/lkml/20230714114753.170814-1-david@readahead.eu/
[2] https://lore.kernel.org/lkml/20240513191544.94754-1-pobrn@protonmail.com/
[3] https://lore.kernel.org/lkml/20240524033933.135049-1-jeffxu@google.com/
Jeff Xu (1): mm/memfd: add documentation for MFD_NOEXEC_SEAL MFD_EXEC
Documentation/userspace-api/index.rst | 1 + Documentation/userspace-api/mfd_noexec.rst | 86 ++++++++++++++++++++++ 2 files changed, 87 insertions(+) create mode 100644 Documentation/userspace-api/mfd_noexec.rst
From: Jeff Xu jeffxu@chromium.org
Add documentation for memfd_create flags: FMD_NOEXEC_SEAL and MFD_EXEC
Signed-off-by: Jeff Xu jeffxu@chromium.org --- Documentation/userspace-api/index.rst | 1 + Documentation/userspace-api/mfd_noexec.rst | 86 ++++++++++++++++++++++ 2 files changed, 87 insertions(+) create mode 100644 Documentation/userspace-api/mfd_noexec.rst
diff --git a/Documentation/userspace-api/index.rst b/Documentation/userspace-api/index.rst index 5926115ec0ed..8a251d71fa6e 100644 --- a/Documentation/userspace-api/index.rst +++ b/Documentation/userspace-api/index.rst @@ -32,6 +32,7 @@ Security-related interfaces seccomp_filter landlock lsm + mfd_noexec spec_ctrl tee
diff --git a/Documentation/userspace-api/mfd_noexec.rst b/Documentation/userspace-api/mfd_noexec.rst new file mode 100644 index 000000000000..0d2c840f37e1 --- /dev/null +++ b/Documentation/userspace-api/mfd_noexec.rst @@ -0,0 +1,86 @@ +.. SPDX-License-Identifier: GPL-2.0 + +================================== +Introduction of non executable mfd +================================== +:Author: + Daniel Verkamp dverkamp@chromium.org + Jeff Xu jeffxu@chromium.org + +:Contributor: + Aleksa Sarai cyphar@cyphar.com + +Since Linux introduced the memfd feature, memfd have always had their +execute bit set, and the memfd_create() syscall doesn't allow setting +it differently. + +However, in a secure by default system, such as ChromeOS, (where all +executables should come from the rootfs, which is protected by Verified +boot), this executable nature of memfd opens a door for NoExec bypass +and enables “confused deputy attack”. E.g, in VRP bug [1]: cros_vm +process created a memfd to share the content with an external process, +however the memfd is overwritten and used for executing arbitrary code +and root escalation. [2] lists more VRP in this kind. + +On the other hand, executable memfd has its legit use, runc uses memfd’s +seal and executable feature to copy the contents of the binary then +execute them, for such system, we need a solution to differentiate runc's +use of executable memfds and an attacker's [3]. + +To address those above. + - Let memfd_create() set X bit at creation time. + - Let memfd be sealed for modifying X bit when NX is set. + - A new pid namespace sysctl: vm.memfd_noexec to help applications to + migrating and enforcing non-executable MFD. + +User API +======== +``int memfd_create(const char *name, unsigned int flags)`` + +``MFD_NOEXEC_SEAL`` + When MFD_NOEXEC_SEAL bit is set in the ``flags``, memfd is created + with NX. F_SEAL_EXEC is set and the memfd can't be modified to + add X later. MFD_ALLOW_SEALING is also implied. + This is the most common case for the application to use memfd. + +``MFD_EXEC`` + When MFD_EXEC bit is set in the ``flags``, memfd is created with X. + +Note: + ``MFD_NOEXEC_SEAL`` implies ``MFD_ALLOW_SEALING``. In case that + app doesn't want sealing, it can add F_SEAL_SEAL after creation. + + +Sysctl: +======== +``pid namespaced sysctl vm.memfd_noexec`` + +The new pid namespaced sysctl vm.memfd_noexec has 3 values: + + - 0: MEMFD_NOEXEC_SCOPE_EXEC + memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL acts like + MFD_EXEC was set. + + - 1: MEMFD_NOEXEC_SCOPE_NOEXEC_SEAL + memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL acts like + MFD_NOEXEC_SEAL was set. + + - 2: MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED + memfd_create() without MFD_NOEXEC_SEAL will be rejected. + +The sysctl allows finer control of memfd_create for old-software that +doesn't set the executable bit, for example, a container with +vm.memfd_noexec=1 means the old-software will create non-executable memfd +by default while new-software can create executable memfd by setting +MFD_EXEC. + +The value of vm.memfd_noexec is passed to child namespace at creation +time, in addition, the setting is hierarchical, i.e. during memfd_create, +we will search from current ns to root ns and use the most restrictive +setting. + +[1] https://crbug.com/1305267 + +[2] https://bugs.chromium.org/p/chromium/issues/list?q=type%3Dbug-security%20mem... + +[3] https://lwn.net/Articles/781013/
Hi--
On 6/7/24 1:35 PM, jeffxu@chromium.org wrote:
From: Jeff Xu jeffxu@chromium.org
Add documentation for memfd_create flags: FMD_NOEXEC_SEAL
s/FMD/MFD/
and MFD_EXEC
Signed-off-by: Jeff Xu jeffxu@chromium.org
Documentation/userspace-api/index.rst | 1 + Documentation/userspace-api/mfd_noexec.rst | 86 ++++++++++++++++++++++ 2 files changed, 87 insertions(+) create mode 100644 Documentation/userspace-api/mfd_noexec.rst
diff --git a/Documentation/userspace-api/index.rst b/Documentation/userspace-api/index.rst index 5926115ec0ed..8a251d71fa6e 100644 --- a/Documentation/userspace-api/index.rst +++ b/Documentation/userspace-api/index.rst @@ -32,6 +32,7 @@ Security-related interfaces seccomp_filter landlock lsm
- mfd_noexec spec_ctrl tee
diff --git a/Documentation/userspace-api/mfd_noexec.rst b/Documentation/userspace-api/mfd_noexec.rst new file mode 100644 index 000000000000..0d2c840f37e1 --- /dev/null +++ b/Documentation/userspace-api/mfd_noexec.rst @@ -0,0 +1,86 @@ +.. SPDX-License-Identifier: GPL-2.0
+================================== +Introduction of non executable mfd
non-executable mfd
+================================== +:Author:
- Daniel Verkamp dverkamp@chromium.org
- Jeff Xu jeffxu@chromium.org
+:Contributor:
- Aleksa Sarai cyphar@cyphar.com
+Since Linux introduced the memfd feature, memfd have always had their
memfds i.e., plural
+execute bit set, and the memfd_create() syscall doesn't allow setting +it differently.
+However, in a secure by default system, such as ChromeOS, (where all
secure-by-default
+executables should come from the rootfs, which is protected by Verified +boot), this executable nature of memfd opens a door for NoExec bypass +and enables “confused deputy attack”. E.g, in VRP bug [1]: cros_vm +process created a memfd to share the content with an external process, +however the memfd is overwritten and used for executing arbitrary code +and root escalation. [2] lists more VRP in this kind.
of this kind.
+On the other hand, executable memfd has its legit use, runc uses memfd’s
use:
+seal and executable feature to copy the contents of the binary then +execute them, for such system, we need a solution to differentiate runc's
them. For such a system,
+use of executable memfds and an attacker's [3].
+To address those above.
above:
- Let memfd_create() set X bit at creation time.
- Let memfd be sealed for modifying X bit when NX is set.
- A new pid namespace sysctl: vm.memfd_noexec to help applications to
- Add a new applications in
- migrating and enforcing non-executable MFD.
+User API +======== +``int memfd_create(const char *name, unsigned int flags)``
+``MFD_NOEXEC_SEAL``
- When MFD_NOEXEC_SEAL bit is set in the ``flags``, memfd is created
- with NX. F_SEAL_EXEC is set and the memfd can't be modified to
- add X later. MFD_ALLOW_SEALING is also implied.
- This is the most common case for the application to use memfd.
+``MFD_EXEC``
- When MFD_EXEC bit is set in the ``flags``, memfd is created with X.
+Note:
- ``MFD_NOEXEC_SEAL`` implies ``MFD_ALLOW_SEALING``. In case that
- app doesn't want sealing, it can add F_SEAL_SEAL after creation.
an app
+Sysctl: +======== +``pid namespaced sysctl vm.memfd_noexec``
+The new pid namespaced sysctl vm.memfd_noexec has 3 values:
- 0: MEMFD_NOEXEC_SCOPE_EXEC
- memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL acts like
- MFD_EXEC was set.
- 1: MEMFD_NOEXEC_SCOPE_NOEXEC_SEAL
- memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL acts like
- MFD_NOEXEC_SEAL was set.
- 2: MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED
- memfd_create() without MFD_NOEXEC_SEAL will be rejected.
+The sysctl allows finer control of memfd_create for old-software that
old software
+doesn't set the executable bit, for example, a container with
bit;
+vm.memfd_noexec=1 means the old-software will create non-executable memfd
old software
+by default while new-software can create executable memfd by setting
new software
+MFD_EXEC.
+The value of vm.memfd_noexec is passed to child namespace at creation +time, in addition, the setting is hierarchical, i.e. during memfd_create,
time. In addition,
+we will search from current ns to root ns and use the most restrictive +setting.
+[1] https://crbug.com/1305267
+[2] https://bugs.chromium.org/p/chromium/issues/list?q=type%3Dbug-security%20mem...
Hi
On Mon, Jun 10, 2024 at 7:20 PM Randy Dunlap rdunlap@infradead.org wrote:
Hi--
On 6/7/24 1:35 PM, jeffxu@chromium.org wrote:
From: Jeff Xu jeffxu@chromium.org
Add documentation for memfd_create flags: FMD_NOEXEC_SEAL
s/FMD/MFD/
and MFD_EXEC
Signed-off-by: Jeff Xu jeffxu@chromium.org
Documentation/userspace-api/index.rst | 1 + Documentation/userspace-api/mfd_noexec.rst | 86 ++++++++++++++++++++++ 2 files changed, 87 insertions(+) create mode 100644 Documentation/userspace-api/mfd_noexec.rst
diff --git a/Documentation/userspace-api/index.rst b/Documentation/userspace-api/index.rst index 5926115ec0ed..8a251d71fa6e 100644 --- a/Documentation/userspace-api/index.rst +++ b/Documentation/userspace-api/index.rst @@ -32,6 +32,7 @@ Security-related interfaces seccomp_filter landlock lsm
- mfd_noexec spec_ctrl tee
diff --git a/Documentation/userspace-api/mfd_noexec.rst b/Documentation/userspace-api/mfd_noexec.rst new file mode 100644 index 000000000000..0d2c840f37e1 --- /dev/null +++ b/Documentation/userspace-api/mfd_noexec.rst @@ -0,0 +1,86 @@ +.. SPDX-License-Identifier: GPL-2.0
+================================== +Introduction of non executable mfd
non-executable mfd
+================================== +:Author:
- Daniel Verkamp dverkamp@chromium.org
- Jeff Xu jeffxu@chromium.org
+:Contributor:
Aleksa Sarai <cyphar@cyphar.com>
+Since Linux introduced the memfd feature, memfd have always had their
memfds
i.e., plural
+execute bit set, and the memfd_create() syscall doesn't allow setting +it differently.
+However, in a secure by default system, such as ChromeOS, (where all
secure-by-default
+executables should come from the rootfs, which is protected by Verified +boot), this executable nature of memfd opens a door for NoExec bypass +and enables “confused deputy attack”. E.g, in VRP bug [1]: cros_vm +process created a memfd to share the content with an external process, +however the memfd is overwritten and used for executing arbitrary code +and root escalation. [2] lists more VRP in this kind.
of this kind.
+On the other hand, executable memfd has its legit use, runc uses memfd’s
use:
+seal and executable feature to copy the contents of the binary then +execute them, for such system, we need a solution to differentiate runc's
them. For such a system,
+use of executable memfds and an attacker's [3].
+To address those above.
above:
- Let memfd_create() set X bit at creation time.
- Let memfd be sealed for modifying X bit when NX is set.
- A new pid namespace sysctl: vm.memfd_noexec to help applications to
- Add a new applications in
- migrating and enforcing non-executable MFD.
+User API +======== +``int memfd_create(const char *name, unsigned int flags)``
+``MFD_NOEXEC_SEAL``
When MFD_NOEXEC_SEAL bit is set in the ``flags``, memfd is created
with NX. F_SEAL_EXEC is set and the memfd can't be modified to
add X later. MFD_ALLOW_SEALING is also implied.
This is the most common case for the application to use memfd.
+``MFD_EXEC``
When MFD_EXEC bit is set in the ``flags``, memfd is created with X.
+Note:
``MFD_NOEXEC_SEAL`` implies ``MFD_ALLOW_SEALING``. In case that
app doesn't want sealing, it can add F_SEAL_SEAL after creation.
an app
+Sysctl: +======== +``pid namespaced sysctl vm.memfd_noexec``
+The new pid namespaced sysctl vm.memfd_noexec has 3 values:
- 0: MEMFD_NOEXEC_SCOPE_EXEC
memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL acts like
MFD_EXEC was set.
- 1: MEMFD_NOEXEC_SCOPE_NOEXEC_SEAL
memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL acts like
MFD_NOEXEC_SEAL was set.
- 2: MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED
memfd_create() without MFD_NOEXEC_SEAL will be rejected.
+The sysctl allows finer control of memfd_create for old-software that
old software
+doesn't set the executable bit, for example, a container with
bit;
+vm.memfd_noexec=1 means the old-software will create non-executable memfd
old software
+by default while new-software can create executable memfd by setting
new software
+MFD_EXEC.
+The value of vm.memfd_noexec is passed to child namespace at creation +time, in addition, the setting is hierarchical, i.e. during memfd_create,
time. In addition,
Updated in V2. Thanks! -Jeff
+we will search from current ns to root ns and use the most restrictive +setting.
+[1] https://crbug.com/1305267
+[2] https://bugs.chromium.org/p/chromium/issues/list?q=type%3Dbug-security%20mem...
-- ~Randy
Hi
2024. június 7., péntek 22:35 keltezéssel, jeffxu@chromium.org jeffxu@chromium.org írta:
From: Jeff Xu jeffxu@chromium.org
When MFD_NOEXEC_SEAL was introduced, there was one big mistake: it didn't have proper documentation. This led to a lot of confusion, especially about whether or not memfd created with the MFD_NOEXEC_SEAL flag is sealable. Before MFD_NOEXEC_SEAL, memfd had to explicitly set MFD_ALLOW_SEALING to be sealable, so it's a fair question.
As one might have noticed, unlike other flags in memfd_create, MFD_NOEXEC_SEAL is actually a combination of multiple flags. The idea is to make it easier to use memfd in the most common way, which is NOEXEC + F_SEAL_EXEC + MFD_ALLOW_SEALING. This works with sysctl vm.noexec to help existing applications move to a more secure way of using memfd.
Proposals have been made to put MFD_NOEXEC_SEAL non-sealable, unless MFD_ALLOW_SEALING is set, to be consistent with other flags [1] [2], Those are based on the viewpoint that each flag is an atomic unit, which is a reasonable assumption. However, MFD_NOEXEC_SEAL was designed with the intent of promoting the most secure method of using memfd, therefore a combination of multiple functionalities into one bit.
Furthermore, the MFD_NOEXEC_SEAL has been added for more than one year, and multiple applications and distributions have backported and utilized it. Altering ABI now presents a degree of risk and may lead to disruption.
I feel compelled to mention again that based on my investigation the risk is minimal. Not to mention that it can easily be reverted if need be.
In my view, it is better to fix the inconsistency than to document it. I would argue that "`MFD_ALLOW_SEALING` is needed to enable sealing except that XYZ" is unintuitive and confusing for a non-significant amount of people.
In conclusion, I think it would be unfortunate if the inconsistency was not fixed and the problem was considered "solved" by a passing mention in the documentation.
Regards, Barnabás Pőcze
MFD_NOEXEC_SEAL is a new flag, and applications must change their code to use it. There is no backward compatibility problem.
When sysctl vm.noexec == 1 or 2, applications that don't set MFD_NOEXEC_SEAL or MFD_EXEC will get MFD_NOEXEC_SEAL memfd. And old-application might break, that is by-design, in such a system vm.noexec = 0 shall be used. Also no backward compatibility problem.
I propose to include this documentation patch to assist in clarifying the semantics of MFD_NOEXEC_SEAL, thereby preventing any potential future confusion.
This patch supersede previous patch which is trying different direction [3], and please remove [2] from mm-unstable branch when applying this patch.
Finally, I would like to express my gratitude to David Rheinsberg and Barnabás Pőcze for initiating the discussion on the topic of sealability.
[1] https://lore.kernel.org/lkml/20230714114753.170814-1-david@readahead.eu/
[2] https://lore.kernel.org/lkml/20240513191544.94754-1-pobrn@protonmail.com/
[3] https://lore.kernel.org/lkml/20240524033933.135049-1-jeffxu@google.com/
Jeff Xu (1): mm/memfd: add documentation for MFD_NOEXEC_SEAL MFD_EXEC
Documentation/userspace-api/index.rst | 1 + Documentation/userspace-api/mfd_noexec.rst | 86 ++++++++++++++++++++++ 2 files changed, 87 insertions(+) create mode 100644 Documentation/userspace-api/mfd_noexec.rst
-- 2.45.2.505.gda0bf45e8d-goog
Resent, (previous email is not plain text)
Hi
On Fri, Jun 7, 2024 at 2:41 PM Barnabás Pőcze pobrn@protonmail.com wrote:
Hi
- június 7., péntek 22:35 keltezéssel, jeffxu@chromium.org jeffxu@chromium.org írta:
From: Jeff Xu jeffxu@chromium.org
When MFD_NOEXEC_SEAL was introduced, there was one big mistake: it didn't have proper documentation. This led to a lot of confusion, especially about whether or not memfd created with the MFD_NOEXEC_SEAL flag is sealable. Before MFD_NOEXEC_SEAL, memfd had to explicitly set MFD_ALLOW_SEALING to be sealable, so it's a fair question.
As one might have noticed, unlike other flags in memfd_create, MFD_NOEXEC_SEAL is actually a combination of multiple flags. The idea is to make it easier to use memfd in the most common way, which is NOEXEC + F_SEAL_EXEC + MFD_ALLOW_SEALING. This works with sysctl vm.noexec to help existing applications move to a more secure way of using memfd.
Proposals have been made to put MFD_NOEXEC_SEAL non-sealable, unless MFD_ALLOW_SEALING is set, to be consistent with other flags [1] [2], Those are based on the viewpoint that each flag is an atomic unit, which is a reasonable assumption. However, MFD_NOEXEC_SEAL was designed with the intent of promoting the most secure method of using memfd, therefore a combination of multiple functionalities into one bit.
Furthermore, the MFD_NOEXEC_SEAL has been added for more than one year, and multiple applications and distributions have backported and utilized it. Altering ABI now presents a degree of risk and may lead to disruption.
I feel compelled to mention again that based on my investigation the risk is minimal. Not to mention that it can easily be reverted if need be.
The risk is not zero. If we changed the ABI it would be propagated to early kernel stable versions. Various Linux distributions also backported the patch to earlier kernels such as 5.4. If it needs a revert, then everyone has to do it again.
In my view, it is better to fix the inconsistency than to document it. I would argue that "`MFD_ALLOW_SEALING` is needed to enable sealing except that XYZ" is unintuitive and confusing for a non-significant amount of people.
I understand, documentation helps resolve the confusion, the next step is to update the man page for memfd.
Thanks -Jeff
linux-kselftest-mirror@lists.linaro.org