From: Roberto Sassu <roberto.sassu(a)huawei.com>
IMA and EVM are not effectively LSMs, especially due to the fact that in
the past they could not provide a security blob while there is another LSM
active.
That changed in the recent years, the LSM stacking feature now makes it
possible to stack together multiple LSMs, and allows them to provide a
security blob for most kernel objects. While the LSM stacking feature has
some limitations being worked out, it is already suitable to make IMA and
EVM as LSMs.
The main purpose of this patch set is to remove IMA and EVM function calls,
hardcoded in the LSM infrastructure and other places in the kernel, and to
register them as LSM hook implementations, so that those functions are
called by the LSM infrastructure like other regular LSMs.
This patch set introduces two new LSMs 'ima' and 'evm', so that functions
can be registered to their respective LSM, and removes the 'integrity' LSM.
integrity_kernel_module_request() was moved to IMA, since deadlock could
occur only there. integrity_inode_free() was replaced with ima_inode_free()
(EVM does not need to free memory).
In order to make 'ima' and 'evm' independent LSMs, it was necessary to
split integrity metadata used by both IMA and EVM, and to let them manage
their own. The special case of the IMA_NEW_FILE flag, managed by IMA and
used by EVM, was handled by introducing a new flag in EVM, EVM_NEW_FILE,
managed by two additional LSM hooks, evm_post_path_mknod() and
evm_file_release(), equivalent to their counterparts ima_post_path_mknod()
and ima_file_free().
In addition to splitting metadata, it was decided to embed the
evm_iint_inode structure into the inode security blob, since it is small
and because anyway it cannot rely on IMA anymore allocating it (since it
uses a different structure).
On the other hand, to avoid memory pressure concerns, only a pointer to the
ima_iint_cache structure is stored in the inode security blob, and the
structure is allocated on demand, like before.
Another follow-up change was removing the iint parameter from
evm_verifyxattr(), that IMA used to pass integrity metadata to EVM. After
splitting metadata, and aligning EVM_NEW_FILE with IMA_NEW_FILE, this
parameter was not necessary anymore.
The last part was to ensure that the order of IMA and EVM functions is
respected after they become LSMs. Since the order of lsm_info structures in
the .lsm_info.init section depends on the order object files containing
those structures are passed to the linker of the kernel image, and since
IMA is before EVM in the Makefile, that is sufficient to assert that IMA
functions are executed before EVM ones.
The patch set is organized as follows.
Patches 1-9 make IMA and EVM functions suitable to be registered to the LSM
infrastructure, by aligning function parameters.
Patches 10-18 add new LSM hooks in the same places where IMA and EVM
functions are called, if there is no LSM hook already.
Patch 19 moves integrity_kernel_module_request() to IMA, as a prerequisite
for removing the 'integrity' LSM.
Patches 20-22 introduce the new standalone LSMs 'ima' and 'evm', and move
hardcoded calls to IMA, EVM and integrity functions to those LSMs.
Patches 23-24 remove the dependency on the 'integrity' LSM by splitting
integrity metadata, so that the 'ima' and 'evm' LSMs can use their own.
They also duplicate iint_lockdep_annotate() in ima_main.c, since the mutex
field was moved from integrity_iint_cache to ima_iint_cache.
Patch 25 finally removes the 'integrity' LSM, since 'ima' and 'evm' are now
self-contained and independent.
The patch set applies on top of lsm/dev, commit f1bb47a31dff ("lsm: new
security_file_ioctl_compat() hook"). The linux-integrity/next-integrity at
commit c00f94b3a5be ("overlay: disable EVM") was merged.
Changelog:
v8:
- Restore dynamic allocation of IMA integrity metadata, and store only the
pointer in the inode security blob
- Select SECURITY_PATH both in IMA and EVM
- Rename evm_file_free() to evm_file_release()
- Unconditionally register evm_post_path_mknod()
- Introduce the new ima_iint.c file for the management of IMA integrity
metadata
- Introduce ima_inode_set_iint()/ima_inode_get_iint() in ima.h to
respectively store/retrieve the IMA integrity metadata pointer
- Replace ima_iint_inode() with ima_inode_get() and ima_iint_find(), with
same behavior of integrity_inode_get() and integrity_iint_find()
- Initialize the ima_iint_cache in ima_iintcache_init() and call it from
init_ima_lsm()
- Move integrity_kernel_module_request() to IMA in a separate patch
(suggested by Mimi)
- Compile ima_kernel_module_request() if CONFIG_INTEGRITY_ASYMMETRIC_KEYS
is enabled
- Remove ima_inode_alloc_security() and ima_inode_free_security(), since
the IMA integrity metadata is not fully embedded in the inode security
blob
- Fixed the missed initialization of ima_iint_cache in
process_measurement() and __ima_inode_hash()
- Add a sentence in 'evm: Move to LSM infrastructure' to mention about
moving evm_inode_remove_acl(), evm_inode_post_remove_acl() and
evm_inode_post_set_acl() to evm_main.c
- Add a sentence in 'ima: Move IMA-Appraisal to LSM infrastructure' to
mention about moving ima_inode_remove_acl() to ima_appraise.c
v7:
- Use return instead of goto in __vfs_removexattr_locked() (suggested by
Casey)
- Clarify in security/integrity/Makefile that the order of 'ima' and 'evm'
LSMs depends on the order in which IMA and EVM are compiled
- Move integrity_iint_cache flags to ima.h and evm.h in security/ and
duplicate IMA_NEW_FILE to EVM_NEW_FILE
- Rename evm_inode_get_iint() to evm_iint_inode() and ima_inode_get_iint()
to ima_iint_inode(), check if inode->i_security is NULL, and just return
the pointer from the inode security blob
- Restore the non-NULL checks after ima_iint_inode() and evm_iint_inode()
(suggested by Casey)
- Introduce evm_file_free() to clear EVM_NEW_FILE
- Remove comment about LSM_ORDER_LAST not guaranteeing the order of 'ima'
and 'evm' LSMs
- Lock iint->mutex before reading IMA_COLLECTED flag in __ima_inode_hash()
and restored ima_policy_flag check
- Remove patch about the hardcoded ordering of 'ima' and 'evm' LSMs in
security.c
- Add missing ima_inode_free_security() to free iint->ima_hash
- Add the cases for LSM_ID_IMA and LSM_ID_EVM in lsm_list_modules_test.c
- Mention about the change in IMA and EVM post functions for private
inodes
v6:
- See v7
v5:
- Rename security_file_pre_free() to security_file_release() and the LSM
hook file_pre_free_security to file_release (suggested by Paul)
- Move integrity_kernel_module_request() to ima_main.c (renamed to
ima_kernel_module_request())
- Split the integrity_iint_cache structure into ima_iint_cache and
evm_iint_cache, so that IMA and EVM can use disjoint metadata and
reserve space with the LSM infrastructure
- Reserve space for the entire ima_iint_cache and evm_iint_cache
structures, not just the pointer (suggested by Paul)
- Introduce ima_inode_get_iint() and evm_inode_get_iint() to retrieve
respectively the ima_iint_cache and evm_iint_cache structure from the
security blob
- Remove the various non-NULL checks for the ima_iint_cache and
evm_iint_cache structures, since the LSM infrastructure ensure that they
always exist
- Remove the iint parameter from evm_verifyxattr() since IMA and EVM
use disjoint integrity metaddata
- Introduce the evm_post_path_mknod() to set the IMA_NEW_FILE flag
- Register the inode_alloc_security LSM hook in IMA and EVM to
initialize the respective integrity metadata structures
- Remove the 'integrity' LSM completely and instead make 'ima' and 'evm'
proper standalone LSMs
- Add the inode parameter to ima_get_verity_digest(), since the inode
field is not present in ima_iint_cache
- Move iint_lockdep_annotate() to ima_main.c (renamed to
ima_iint_lockdep_annotate())
- Remove ima_get_lsm_id() and evm_get_lsm_id(), since IMA and EVM directly
register the needed LSM hooks
- Enforce 'ima' and 'evm' LSM ordering at LSM infrastructure level
v4:
- Improve short and long description of
security_inode_post_create_tmpfile(), security_inode_post_set_acl(),
security_inode_post_remove_acl() and security_file_post_open()
(suggested by Mimi)
- Improve commit message of 'ima: Move to LSM infrastructure' (suggested
by Mimi)
v3:
- Drop 'ima: Align ima_post_path_mknod() definition with LSM
infrastructure' and 'ima: Align ima_post_create_tmpfile() definition
with LSM infrastructure', define the new LSM hooks with the same
IMA parameters instead (suggested by Mimi)
- Do IS_PRIVATE() check in security_path_post_mknod() and
security_inode_post_create_tmpfile() on the new inode rather than the
parent directory (in the post method it is available)
- Don't export ima_file_check() (suggested by Stefan)
- Remove redundant check of file mode in ima_post_path_mknod() (suggested
by Mimi)
- Mention that ima_post_path_mknod() is now conditionally invoked when
CONFIG_SECURITY_PATH=y (suggested by Mimi)
- Mention when a LSM hook will be introduced in the IMA/EVM alignment
patches (suggested by Mimi)
- Simplify the commit messages when introducing a new LSM hook
- Still keep the 'extern' in the function declaration, until the
declaration is removed (suggested by Mimi)
- Improve documentation of security_file_pre_free()
- Register 'ima' and 'evm' as standalone LSMs (suggested by Paul)
- Initialize the 'ima' and 'evm' LSMs from 'integrity', to keep the
original ordering of IMA and EVM functions as when they were hardcoded
- Return the IMA and EVM LSM IDs to 'integrity' for registration of the
integrity-specific hooks
- Reserve an xattr slot from the 'evm' LSM instead of 'integrity'
- Pass the LSM ID to init_ima_appraise_lsm()
v2:
- Add description for newly introduced LSM hooks (suggested by Casey)
- Clarify in the description of security_file_pre_free() that actions can
be performed while the file is still open
v1:
- Drop 'evm: Complete description of evm_inode_setattr()', 'fs: Fix
description of vfs_tmpfile()' and 'security: Introduce LSM_ORDER_LAST',
they were sent separately (suggested by Christian Brauner)
- Replace dentry with file descriptor parameter for
security_inode_post_create_tmpfile()
- Introduce mode_stripped and pass it as mode argument to
security_path_mknod() and security_path_post_mknod()
- Use goto in do_mknodat() and __vfs_removexattr_locked() (suggested by
Mimi)
- Replace __lsm_ro_after_init with __ro_after_init
- Modify short description of security_inode_post_create_tmpfile() and
security_inode_post_set_acl() (suggested by Stefan)
- Move security_inode_post_setattr() just after security_inode_setattr()
(suggested by Mimi)
- Modify short description of security_key_post_create_or_update()
(suggested by Mimi)
- Add back exported functions ima_file_check() and
evm_inode_init_security() respectively to ima.h and evm.h (reported by
kernel robot)
- Remove extern from prototype declarations and fix style issues
- Remove unnecessary include of linux/lsm_hooks.h in ima_main.c and
ima_appraise.c
Roberto Sassu (25):
ima: Align ima_inode_post_setattr() definition with LSM infrastructure
ima: Align ima_file_mprotect() definition with LSM infrastructure
ima: Align ima_inode_setxattr() definition with LSM infrastructure
ima: Align ima_inode_removexattr() definition with LSM infrastructure
ima: Align ima_post_read_file() definition with LSM infrastructure
evm: Align evm_inode_post_setattr() definition with LSM infrastructure
evm: Align evm_inode_setxattr() definition with LSM infrastructure
evm: Align evm_inode_post_setxattr() definition with LSM
infrastructure
security: Align inode_setattr hook definition with EVM
security: Introduce inode_post_setattr hook
security: Introduce inode_post_removexattr hook
security: Introduce file_post_open hook
security: Introduce file_release hook
security: Introduce path_post_mknod hook
security: Introduce inode_post_create_tmpfile hook
security: Introduce inode_post_set_acl hook
security: Introduce inode_post_remove_acl hook
security: Introduce key_post_create_or_update hook
integrity: Move integrity_kernel_module_request() to IMA
ima: Move to LSM infrastructure
ima: Move IMA-Appraisal to LSM infrastructure
evm: Move to LSM infrastructure
evm: Make it independent from 'integrity' LSM
ima: Make it independent from 'integrity' LSM
integrity: Remove LSM
fs/attr.c | 5 +-
fs/file_table.c | 3 +-
fs/namei.c | 12 +-
fs/nfsd/vfs.c | 3 +-
fs/open.c | 1 -
fs/posix_acl.c | 5 +-
fs/xattr.c | 9 +-
include/linux/evm.h | 117 +-------
include/linux/ima.h | 142 ----------
include/linux/integrity.h | 27 --
include/linux/lsm_hook_defs.h | 20 +-
include/linux/security.h | 59 ++++
include/uapi/linux/lsm.h | 2 +
security/integrity/Makefile | 1 +
security/integrity/digsig_asymmetric.c | 23 --
security/integrity/evm/Kconfig | 1 +
security/integrity/evm/evm.h | 19 ++
security/integrity/evm/evm_crypto.c | 4 +-
security/integrity/evm/evm_main.c | 195 ++++++++++---
security/integrity/iint.c | 197 +------------
security/integrity/ima/Kconfig | 1 +
security/integrity/ima/Makefile | 2 +-
security/integrity/ima/ima.h | 148 ++++++++--
security/integrity/ima/ima_api.c | 23 +-
security/integrity/ima/ima_appraise.c | 66 +++--
security/integrity/ima/ima_iint.c | 142 ++++++++++
security/integrity/ima/ima_init.c | 2 +-
security/integrity/ima/ima_main.c | 144 +++++++---
security/integrity/ima/ima_policy.c | 2 +-
security/integrity/integrity.h | 80 +-----
security/keys/key.c | 10 +-
security/security.c | 263 +++++++++++-------
security/selinux/hooks.c | 3 +-
security/smack/smack_lsm.c | 4 +-
.../selftests/lsm/lsm_list_modules_test.c | 6 +
35 files changed, 902 insertions(+), 839 deletions(-)
create mode 100644 security/integrity/ima/ima_iint.c
--
2.34.1
Calling get_system_loc_code before checking devfd and errno - fails the test
when the device is not available, expected a SKIP.
Change the order of 'SKIP_IF_MSG' correctly SKIP when the /dev/papr-vpd device
is not available.
with out patch: Test FAILED on line 271
with patch: [SKIP] Test skipped on line 266: /dev/papr-vpd not present
Signed-off-by: R Nageswara Sastry <rnsastry(a)linux.ibm.com>
---
tools/testing/selftests/powerpc/papr_vpd/papr_vpd.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/powerpc/papr_vpd/papr_vpd.c b/tools/testing/selftests/powerpc/papr_vpd/papr_vpd.c
index 98cbb9109ee6..505294da1b9f 100644
--- a/tools/testing/selftests/powerpc/papr_vpd/papr_vpd.c
+++ b/tools/testing/selftests/powerpc/papr_vpd/papr_vpd.c
@@ -263,10 +263,10 @@ static int papr_vpd_system_loc_code(void)
off_t size;
int fd;
- SKIP_IF_MSG(get_system_loc_code(&lc),
- "Cannot determine system location code");
SKIP_IF_MSG(devfd < 0 && errno == ENOENT,
DEVPATH " not present");
+ SKIP_IF_MSG(get_system_loc_code(&lc),
+ "Cannot determine system location code");
FAIL_IF(devfd < 0);
--
2.37.1 (Apple Git-137.1)
The kernel has recently added support for shadow stacks, currently
x86 only using their CET feature but both arm64 and RISC-V have
equivalent features (GCS and Zicfiss respectively), I am actively
working on GCS[1]. With shadow stacks the hardware maintains an
additional stack containing only the return addresses for branch
instructions which is not generally writeable by userspace and ensures
that any returns are to the recorded addresses. This provides some
protection against ROP attacks and making it easier to collect call
stacks. These shadow stacks are allocated in the address space of the
userspace process.
Our API for shadow stacks does not currently offer userspace any
flexiblity for managing the allocation of shadow stacks for newly
created threads, instead the kernel allocates a new shadow stack with
the same size as the normal stack whenever a thread is created with the
feature enabled. The stacks allocated in this way are freed by the
kernel when the thread exits or shadow stacks are disabled for the
thread. This lack of flexibility and control isn't ideal, in the vast
majority of cases the shadow stack will be over allocated and the
implicit allocation and deallocation is not consistent with other
interfaces. As far as I can tell the interface is done in this manner
mainly because the shadow stack patches were in development since before
clone3() was implemented.
Since clone3() is readily extensible let's add support for specifying a
shadow stack when creating a new thread or process in a similar manner
to how the normal stack is specified, keeping the current implicit
allocation behaviour if one is not specified either with clone3() or
through the use of clone(). The user must provide a shadow stack
address and size, this must point to memory mapped for use as a shadow
stackby map_shadow_stack() with a shadow stack token at the top of the
stack.
Please note that the x86 portions of this code are build tested only, I
don't appear to have a system that can run CET avaible to me, I have
done testing with an integration into my pending work for GCS. There is
some possibility that the arm64 implementation may require the use of
clone3() and explicit userspace allocation of shadow stacks, this is
still under discussion.
Please further note that the token consumption done by clone3() is not
currently implemented in an atomic fashion, Rick indicated that he would
look into fixing this if people are OK with the implementation.
A new architecture feature Kconfig option for shadow stacks is added as
here, this was suggested as part of the review comments for the arm64
GCS series and since we need to detect if shadow stacks are supported it
seemed sensible to roll it in here.
[1] https://lore.kernel.org/r/20231009-arm64-gcs-v6-0-78e55deaa4dd@kernel.org/
Signed-off-by: Mark Brown <broonie(a)kernel.org>
---
Changes in v5:
- Rebase onto v6.8-rc2.
- Rework ABI to have the user allocate the shadow stack memory with
map_shadow_stack() and a token.
- Force inlining of the x86 shadow stack enablement.
- Move shadow stack enablement out into a shared header for reuse by
other tests.
- Link to v4: https://lore.kernel.org/r/20231128-clone3-shadow-stack-v4-0-8b28ffe4f676@ke…
Changes in v4:
- Formatting changes.
- Use a define for minimum shadow stack size and move some basic
validation to fork.c.
- Link to v3: https://lore.kernel.org/r/20231120-clone3-shadow-stack-v3-0-a7b8ed3e2acc@ke…
Changes in v3:
- Rebase onto v6.7-rc2.
- Remove stale shadow_stack in internal kargs.
- If a shadow stack is specified unconditionally use it regardless of
CLONE_ parameters.
- Force enable shadow stacks in the selftest.
- Update changelogs for RISC-V feature rename.
- Link to v2: https://lore.kernel.org/r/20231114-clone3-shadow-stack-v2-0-b613f8681155@ke…
Changes in v2:
- Rebase onto v6.7-rc1.
- Remove ability to provide preallocated shadow stack, just specify the
desired size.
- Link to v1: https://lore.kernel.org/r/20231023-clone3-shadow-stack-v1-0-d867d0b5d4d0@ke…
---
Mark Brown (7):
Documentation: userspace-api: Add shadow stack API documentation
selftests: Provide helper header for shadow stack testing
mm: Introduce ARCH_HAS_USER_SHADOW_STACK
fork: Add shadow stack support to clone3()
selftests/clone3: Factor more of main loop into test_clone3()
selftests/clone3: Allow tests to flag if -E2BIG is a valid error code
selftests/clone3: Test shadow stack support
Documentation/userspace-api/index.rst | 1 +
Documentation/userspace-api/shadow_stack.rst | 41 +++++
arch/x86/Kconfig | 1 +
arch/x86/include/asm/shstk.h | 11 +-
arch/x86/kernel/process.c | 2 +-
arch/x86/kernel/shstk.c | 91 +++++++---
fs/proc/task_mmu.c | 2 +-
include/linux/mm.h | 2 +-
include/linux/sched/task.h | 2 +
include/uapi/linux/sched.h | 13 +-
kernel/fork.c | 61 +++++--
mm/Kconfig | 6 +
tools/testing/selftests/clone3/clone3.c | 211 ++++++++++++++++++----
tools/testing/selftests/clone3/clone3_selftests.h | 8 +
tools/testing/selftests/ksft_shstk.h | 63 +++++++
15 files changed, 430 insertions(+), 85 deletions(-)
---
base-commit: 41bccc98fb7931d63d03f326a746ac4d429c1dd3
change-id: 20231019-clone3-shadow-stack-15d40d2bf536
Best regards,
--
Mark Brown <broonie(a)kernel.org>
Hi Linus,
Please pull the following KUnit fixes update for Linux 6.8-rc5.
This KUnit update for Linux 6.8-rc5 consists of one important fix
to unregister kunit_bus when KUnit module is unloaded. Not doing
so causes an error when KUnit module tries to re-register the bus
when it gets reloaded.
diff is attached.
thanks,
-- Shuah
----------------------------------------------------------------
The following changes since commit 1a9f2c776d1416c4ea6cb0d0b9917778c41a1a7d:
Documentation: KUnit: Update the instructions on how to test static functions (2024-01-22 07:59:03 -0700)
are available in the Git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest tags/linux_kselftest-kunit-fixes-6.8-rc5
for you to fetch changes up to 829388b725f8d266ccec32a2f446717d8693eaba:
kunit: device: Unregister the kunit_bus on shutdown (2024-02-06 17:07:37 -0700)
----------------------------------------------------------------
linux_kselftest-kunit-fixes-6.8-rc5
This KUnit update for Linux 6.8-rc5 consists of one important fix
to unregister kunit_bus when KUnit module is unloaded. Not doing
so causes an error when KUnit module tries to re-register the bus
when it gets reloaded.
----------------------------------------------------------------
David Gow (1):
kunit: device: Unregister the kunit_bus on shutdown
lib/kunit/device-impl.h | 2 ++
lib/kunit/device.c | 14 ++++++++++++++
lib/kunit/test.c | 3 +++
3 files changed, 19 insertions(+)
----------------------------------------------------------------
[Putting this as a RFC because I'm pretty sure I'm not doing the things
correctly at the BPF level.]
[Also using bpf-next as the base tree as there will be conflicting
changes otherwise]
Ideally I'd like to have something similar to bpf_timers, but not
in soft IRQ context. So I'm emulating this with a sleepable
bpf_tail_call() (see "HID: bpf: allow to defer work in a delayed
workqueue").
Basically, I need to be able to defer a HID-BPF program for the
following reasons (from the aforementioned patch):
1. defer an event:
Sometimes we receive an out of proximity event, but the device can not
be trusted enough, and we need to ensure that we won't receive another
one in the following n milliseconds. So we need to wait those n
milliseconds, and eventually re-inject that event in the stack.
2. inject new events in reaction to one given event:
We might want to transform one given event into several. This is the
case for macro keys where a single key press is supposed to send
a sequence of key presses. But this could also be used to patch a
faulty behavior, if a device forgets to send a release event.
3. communicate with the device in reaction to one event:
We might want to communicate back to the device after a given event.
For example a device might send us an event saying that it came back
from sleeping state and needs to be re-initialized.
Currently we can achieve that by keeping a userspace program around,
raise a bpf event, and let that userspace program inject the events and
commands.
However, we are just keeping that program alive as a daemon for just
scheduling commands. There is no logic in it, so it doesn't really justify
an actual userspace wakeup. So a kernel workqueue seems simpler to handle.
Besides the "this is insane to reimplement the wheel", the other part I'm
not sure is whether we can say that BPF maps of type prog_array and of
type queue/stack can be used in sleepable context.
I don't see any warning when running the test programs, but that's probably
not a guarantee I'm doing the things properly :)
Cheers,
Benjamin
Signed-off-by: Benjamin Tissoires <bentiss(a)kernel.org>
---
Benjamin Tissoires (9):
bpf: allow more maps in sleepable bpf programs
HID: bpf/dispatch: regroup kfuncs definitions
HID: bpf: export hid_hw_output_report as a BPF kfunc
selftests/hid: Add test for hid_bpf_hw_output_report
HID: bpf: allow to inject HID event from BPF
selftests/hid: add tests for hid_bpf_input_report
HID: bpf: allow to defer work in a delayed workqueue
selftests/hid: add test for hid_bpf_schedule_delayed_work
selftests/hid: add another set of delayed work tests
Documentation/hid/hid-bpf.rst | 26 +-
drivers/hid/bpf/entrypoints/entrypoints.bpf.c | 8 +
drivers/hid/bpf/entrypoints/entrypoints.lskel.h | 236 +++++++++------
drivers/hid/bpf/hid_bpf_dispatch.c | 315 ++++++++++++++++-----
drivers/hid/bpf/hid_bpf_jmp_table.c | 34 ++-
drivers/hid/hid-core.c | 2 +
include/linux/hid_bpf.h | 10 +
kernel/bpf/verifier.c | 3 +
tools/testing/selftests/hid/hid_bpf.c | 286 ++++++++++++++++++-
tools/testing/selftests/hid/progs/hid.c | 205 ++++++++++++++
.../testing/selftests/hid/progs/hid_bpf_helpers.h | 8 +
11 files changed, 962 insertions(+), 171 deletions(-)
---
base-commit: 4f7a05917237b006ceae760507b3d15305769ade
change-id: 20240205-hid-bpf-sleepable-c01260fd91c4
Best regards,
--
Benjamin Tissoires <bentiss(a)kernel.org>
From: Paul Durrant <pdurrant(a)amazon.com>
This series has one small fix to what was in v11 [1]:
* KVM: xen: re-initialize shared_info if guest (32/64-bit) mode is set
The v11 patch failed to set the return code of the ioctl if the mode
was not actually changed, leading to a spurious failure.
This version of the series also contains a new bug-fix to the pfncache
code from David Woodhouse.
[1] https://lore.kernel.org/kvm/20231219161109.1318-1-paul@xen.org/
David Woodhouse (1):
KVM: pfncache: rework __kvm_gpc_refresh() to fix locking issues
Paul Durrant (19):
KVM: pfncache: Add a map helper function
KVM: pfncache: remove unnecessary exports
KVM: xen: mark guest pages dirty with the pfncache lock held
KVM: pfncache: add a mark-dirty helper
KVM: pfncache: remove KVM_GUEST_USES_PFN usage
KVM: pfncache: stop open-coding offset_in_page()
KVM: pfncache: include page offset in uhva and use it consistently
KVM: pfncache: allow a cache to be activated with a fixed (userspace)
HVA
KVM: xen: separate initialization of shared_info cache and content
KVM: xen: re-initialize shared_info if guest (32/64-bit) mode is set
KVM: xen: allow shared_info to be mapped by fixed HVA
KVM: xen: allow vcpu_info to be mapped by fixed HVA
KVM: selftests / xen: map shared_info using HVA rather than GFN
KVM: selftests / xen: re-map vcpu_info using HVA rather than GPA
KVM: xen: advertize the KVM_XEN_HVM_CONFIG_SHARED_INFO_HVA capability
KVM: xen: split up kvm_xen_set_evtchn_fast()
KVM: xen: don't block on pfncache locks in kvm_xen_set_evtchn_fast()
KVM: pfncache: check the need for invalidation under read lock first
KVM: xen: allow vcpu_info content to be 'safely' copied
Documentation/virt/kvm/api.rst | 53 ++-
arch/x86/kvm/x86.c | 7 +-
arch/x86/kvm/xen.c | 356 +++++++++++-------
include/linux/kvm_host.h | 40 +-
include/linux/kvm_types.h | 8 -
include/uapi/linux/kvm.h | 9 +-
.../selftests/kvm/x86_64/xen_shinfo_test.c | 59 ++-
virt/kvm/pfncache.c | 340 +++++++++--------
8 files changed, 535 insertions(+), 337 deletions(-)
base-commit: 1c6d984f523f67ecfad1083bb04c55d91977bb15
--
2.39.2
Major changes in RFC v5:
------------------------
1. Rebased on top of 'Abstract page from net stack' series and used the
new netmem type to refer to LSB set pointers instead of re-using
struct page.
2. Downgraded this series back to RFC and called it RFC v5. This is
because this series is now dependent on 'Abstract page from net
stack'[1] and the queue API. Both are removed from the series to
pre-requisite work.
3. Reworked the page_pool devmem support to use netmem and for some
more unified handling.
4. Reworked the reference counting of net_iov (renamed from
page_pool_iov) to use pp_ref_count for refcounting.
The full changes including the dependent series and GVE page pool
support is here:
https://github.com/mina/linux/commits/tcpdevmem-rfcv5/
[1] https://patchwork.kernel.org/project/netdevbpf/list/?series=810774
Major changes in v1:
--------------------
1. Implemented MVP queue API ndos to remove the userspace-visible
driver reset.
2. Fixed issues in the napi_pp_put_page() devmem frag unref path.
3. Removed RFC tag.
Many smaller addressed comments across all the patches (patches have
individual change log).
Full tree including the rest of the GVE driver changes:
https://github.com/mina/linux/commits/tcpdevmem-v1
Changes in RFC v3:
------------------
1. Pulled in the memory-provider dependency from Jakub's RFC[1] to make the
series reviewable and mergable.
2. Implemented multi-rx-queue binding which was a todo in v2.
3. Fix to cmsg handling.
The sticking point in RFC v2[2] was the device reset required to refill
the device rx-queues after the dmabuf bind/unbind. The solution
suggested as I understand is a subset of the per-queue management ops
Jakub suggested or similar:
https://lore.kernel.org/netdev/20230815171638.4c057dcd@kernel.org/
This is not addressed in this revision, because:
1. This point was discussed at netconf & netdev and there is openness to
using the current approach of requiring a device reset.
2. Implementing individual queue resetting seems to be difficult for my
test bed with GVE. My prototype to test this ran into issues with the
rx-queues not coming back up properly if reset individually. At the
moment I'm unsure if it's a mistake in the POC or a genuine issue in
the virtualization stack behind GVE, which currently doesn't test
individual rx-queue restart.
3. Our usecases are not bothered by requiring a device reset to refill
the buffer queues, and we'd like to support NICs that run into this
limitation with resetting individual queues.
My thought is that drivers that have trouble with per-queue configs can
use the support in this series, while drivers that support new netdev
ops to reset individual queues can automatically reset the queue as
part of the dma-buf bind/unbind.
The same approach with device resets is presented again for consideration
with other sticking points addressed.
This proposal includes the rx devmem path only proposed for merge. For a
snapshot of my entire tree which includes the GVE POC page pool support &
device memory support:
https://github.com/torvalds/linux/compare/master...mina:linux:tcpdevmem-v3
[1] https://lore.kernel.org/netdev/f8270765-a27b-6ccf-33ea-cda097168d79@redhat.…
[2] https://lore.kernel.org/netdev/CAHS8izOVJGJH5WF68OsRWFKJid1_huzzUK+hpKbLcL4…
Changes in RFC v2:
------------------
The sticking point in RFC v1[1] was the dma-buf pages approach we used to
deliver the device memory to the TCP stack. RFC v2 is a proof-of-concept
that attempts to resolve this by implementing scatterlist support in the
networking stack, such that we can import the dma-buf scatterlist
directly. This is the approach proposed at a high level here[2].
Detailed changes:
1. Replaced dma-buf pages approach with importing scatterlist into the
page pool.
2. Replace the dma-buf pages centric API with a netlink API.
3. Removed the TX path implementation - there is no issue with
implementing the TX path with scatterlist approach, but leaving
out the TX path makes it easier to review.
4. Functionality is tested with this proposal, but I have not conducted
perf testing yet. I'm not sure there are regressions, but I removed
perf claims from the cover letter until they can be re-confirmed.
5. Added Signed-off-by: contributors to the implementation.
6. Fixed some bugs with the RX path since RFC v1.
Any feedback welcome, but specifically the biggest pending questions
needing feedback IMO are:
1. Feedback on the scatterlist-based approach in general.
2. Netlink API (Patch 1 & 2).
3. Approach to handle all the drivers that expect to receive pages from
the page pool (Patch 6).
[1] https://lore.kernel.org/netdev/dfe4bae7-13a0-3c5d-d671-f61b375cb0b4@gmail.c…
[2] https://lore.kernel.org/netdev/CAHS8izPm6XRS54LdCDZVd0C75tA1zHSu6jLVO8nzTLX…
----------------------
* TL;DR:
Device memory TCP (devmem TCP) is a proposal for transferring data to and/or
from device memory efficiently, without bouncing the data to a host memory
buffer.
* Problem:
A large amount of data transfers have device memory as the source and/or
destination. Accelerators drastically increased the volume of such transfers.
Some examples include:
- ML accelerators transferring large amounts of training data from storage into
GPU/TPU memory. In some cases ML training setup time can be as long as 50% of
TPU compute time, improving data transfer throughput & efficiency can help
improving GPU/TPU utilization.
- Distributed training, where ML accelerators, such as GPUs on different hosts,
exchange data among them.
- Distributed raw block storage applications transfer large amounts of data with
remote SSDs, much of this data does not require host processing.
Today, the majority of the Device-to-Device data transfers the network are
implemented as the following low level operations: Device-to-Host copy,
Host-to-Host network transfer, and Host-to-Device copy.
The implementation is suboptimal, especially for bulk data transfers, and can
put significant strains on system resources, such as host memory bandwidth,
PCIe bandwidth, etc. One important reason behind the current state is the
kernel’s lack of semantics to express device to network transfers.
* Proposal:
In this patch series we attempt to optimize this use case by implementing
socket APIs that enable the user to:
1. send device memory across the network directly, and
2. receive incoming network packets directly into device memory.
Packet _payloads_ go directly from the NIC to device memory for receive and from
device memory to NIC for transmit.
Packet _headers_ go to/from host memory and are processed by the TCP/IP stack
normally. The NIC _must_ support header split to achieve this.
Advantages:
- Alleviate host memory bandwidth pressure, compared to existing
network-transfer + device-copy semantics.
- Alleviate PCIe BW pressure, by limiting data transfer to the lowest level
of the PCIe tree, compared to traditional path which sends data through the
root complex.
* Patch overview:
** Part 1: netlink API
Gives user ability to bind dma-buf to an RX queue.
** Part 2: scatterlist support
Currently the standard for device memory sharing is DMABUF, which doesn't
generate struct pages. On the other hand, networking stack (skbs, drivers, and
page pool) operate on pages. We have 2 options:
1. Generate struct pages for dmabuf device memory, or,
2. Modify the networking stack to process scatterlist.
** part 3: page pool support
We piggy back on page pool memory providers proposal:
https://github.com/kuba-moo/linux/tree/pp-providers
It allows the page pool to define a memory provider that provides the
page allocation and freeing. It helps abstract most of the device memory
TCP changes from the driver.
** part 4: support for unreadable skb frags
Page pool iovs are not accessible by the host; we implement changes
throughput the networking stack to correctly handle skbs with unreadable
frags.
** Part 5: recvmsg() APIs
We define user APIs for the user to send and receive device memory.
Not included with this series is the GVE devmem TCP support, just to
simplify the review. Code available here if desired:
https://github.com/mina/linux/tree/tcpdevmem
This series is built on top of net-next with Jakub's pp-providers changes
cherry-picked.
* NIC dependencies:
1. (strict) Devmem TCP require the NIC to support header split, i.e. the
capability to split incoming packets into a header + payload and to put
each into a separate buffer. Devmem TCP works by using device memory
for the packet payload, and host memory for the packet headers.
2. (optional) Devmem TCP works better with flow steering support & RSS support,
i.e. the NIC's ability to steer flows into certain rx queues. This allows the
sysadmin to enable devmem TCP on a subset of the rx queues, and steer
devmem TCP traffic onto these queues and non devmem TCP elsewhere.
The NIC I have access to with these properties is the GVE with DQO support
running in Google Cloud, but any NIC that supports these features would suffice.
I may be able to help reviewers bring up devmem TCP on their NICs.
* Testing:
The series includes a udmabuf kselftest that show a simple use case of
devmem TCP and validates the entire data path end to end without
a dependency on a specific dmabuf provider.
** Test Setup
Kernel: net-next with this series and memory provider API cherry-picked
locally.
Hardware: Google Cloud A3 VMs.
NIC: GVE with header split & RSS & flow steering support.
Cc: Pavel Begunkov <asml.silence(a)gmail.com>
Cc: David Wei <dw(a)davidwei.uk>
Cc: Jason Gunthorpe <jgg(a)ziepe.ca>
Cc: Yunsheng Lin <linyunsheng(a)huawei.com>
Cc: Shailend Chand <shailend(a)google.com>
Cc: Harshitha Ramamurthy <hramamurthy(a)google.com>
Cc: Shakeel Butt <shakeelb(a)google.com>
Cc: Jeroen de Borst <jeroendb(a)google.com>
Cc: Praveen Kaligineedi <pkaligineedi(a)google.com>
Jakub Kicinski (1):
net: page_pool: create hooks for custom page providers
Mina Almasry (13):
net: page_pool: factor out page_pool recycle check
net: netdev netlink api to bind dma-buf to a net device
netdev: support binding dma-buf to netdevice
netdev: netdevice devmem allocator
page_pool: convert to use netmem
page_pool: devmem support
memory-provider: dmabuf devmem memory provider
net: support non paged skb frags
net: add support for skbs with unreadable frags
tcp: RX path for devmem TCP
net: add SO_DEVMEM_DONTNEED setsockopt to release RX frags
net: add devmem TCP documentation
selftests: add ncdevmem, netcat for devmem TCP
Documentation/netlink/specs/netdev.yaml | 52 +++
Documentation/networking/devmem.rst | 271 +++++++++++++
Documentation/networking/index.rst | 1 +
arch/alpha/include/uapi/asm/socket.h | 8 +-
arch/mips/include/uapi/asm/socket.h | 6 +
arch/parisc/include/uapi/asm/socket.h | 6 +
arch/sparc/include/uapi/asm/socket.h | 6 +
include/linux/skbuff.h | 68 +++-
include/linux/socket.h | 1 +
include/net/devmem.h | 127 ++++++
include/net/netdev_rx_queue.h | 1 +
include/net/netmem.h | 223 ++++++++++-
include/net/page_pool/helpers.h | 117 ++++--
include/net/page_pool/types.h | 27 +-
include/net/sock.h | 2 +
include/net/tcp.h | 5 +-
include/trace/events/page_pool.h | 28 +-
include/uapi/asm-generic/socket.h | 6 +
include/uapi/linux/netdev.h | 19 +
include/uapi/linux/uio.h | 14 +
net/bpf/test_run.c | 5 +-
net/core/datagram.c | 6 +
net/core/dev.c | 314 ++++++++++++++-
net/core/gro.c | 7 +-
net/core/netdev-genl-gen.c | 19 +
net/core/netdev-genl-gen.h | 2 +
net/core/netdev-genl.c | 123 ++++++
net/core/page_pool.c | 419 +++++++++++++-------
net/core/skbuff.c | 92 ++++-
net/core/sock.c | 45 +++
net/ipv4/tcp.c | 196 +++++++++-
net/ipv4/tcp_input.c | 13 +-
net/ipv4/tcp_ipv4.c | 9 +
net/ipv4/tcp_output.c | 5 +-
net/packet/af_packet.c | 4 +-
tools/include/uapi/linux/netdev.h | 19 +
tools/testing/selftests/net/.gitignore | 1 +
tools/testing/selftests/net/Makefile | 5 +
tools/testing/selftests/net/ncdevmem.c | 489 ++++++++++++++++++++++++
39 files changed, 2527 insertions(+), 234 deletions(-)
create mode 100644 Documentation/networking/devmem.rst
create mode 100644 include/net/devmem.h
create mode 100644 tools/testing/selftests/net/ncdevmem.c
--
2.43.0.472.g3155946c3a-goog
From: Jeff Xu <jeffxu(a)chromium.org>
This is V9 version, with removing MAP_SEALABLE and PROT_SEAL
from mmap(), adding perfrmance benchmark and a test
to demo the sealing of read-only memory segment of elf mapping.
------------------------------------------------------------------
This patchset proposes a new mseal() syscall for the Linux kernel.
In a nutshell, mseal() protects the VMAs of a given virtual memory
range against modifications, such as changes to their permission bits.
Modern CPUs support memory permissions, such as the read/write (RW)
and no-execute (NX) bits. Linux has supported NX since the release of
kernel version 2.6.8 in August 2004 [1]. The memory permission feature
improves the security stance on memory corruption bugs, as an attacker
cannot simply write to arbitrary memory and point the code to it. The
memory must be marked with the X bit, or else an exception will occur.
Internally, the kernel maintains the memory permissions in a data
structure called VMA (vm_area_struct). mseal() additionally protects
the VMA itself against modifications of the selected seal type.
Memory sealing is useful to mitigate memory corruption issues where a
corrupted pointer is passed to a memory management system. For
example, such an attacker primitive can break control-flow integrity
guarantees since read-only memory that is supposed to be trusted can
become writable or .text pages can get remapped. Memory sealing can
automatically be applied by the runtime loader to seal .text and
.rodata pages and applications can additionally seal security critical
data at runtime. A similar feature already exists in the XNU kernel
with the VM_FLAGS_PERMANENT [3] flag and on OpenBSD with the
mimmutable syscall [4]. Also, Chrome wants to adopt this feature for
their CFI work [2] and this patchset has been designed to be
compatible with the Chrome use case.
Two system calls are involved in sealing the map: mmap() and mseal().
The new mseal() is an syscall on 64 bit CPU, and with
following signature:
int mseal(void addr, size_t len, unsigned long flags)
addr/len: memory range.
flags: reserved.
mseal() blocks following operations for the given memory range.
1> Unmapping, moving to another location, and shrinking the size,
via munmap() and mremap(), can leave an empty space, therefore can
be replaced with a VMA with a new set of attributes.
2> Moving or expanding a different VMA into the current location,
via mremap().
3> Modifying a VMA via mmap(MAP_FIXED).
4> Size expansion, via mremap(), does not appear to pose any specific
risks to sealed VMAs. It is included anyway because the use case is
unclear. In any case, users can rely on merging to expand a sealed VMA.
5> mprotect() and pkey_mprotect().
6> Some destructive madvice() behaviors (e.g. MADV_DONTNEED) for anonymous
memory, when users don't have write permission to the memory. Those
behaviors can alter region contents by discarding pages, effectively a
memset(0) for anonymous memory.
The idea that inspired this patch comes from Stephen Röttger’s work in
V8 CFI [5]. Chrome browser in ChromeOS will be the first user of this
API.
Indeed, the Chrome browser has very specific requirements for sealing,
which are distinct from those of most applications. For example, in
the case of libc, sealing is only applied to read-only (RO) or
read-execute (RX) memory segments (such as .text and .RELRO) to
prevent them from becoming writable, the lifetime of those mappings
are tied to the lifetime of the process.
Chrome wants to seal two large address space reservations that are
managed by different allocators. The memory is mapped RW- and RWX
respectively but write access to it is restricted using pkeys (or in
the future ARM permission overlay extensions). The lifetime of those
mappings are not tied to the lifetime of the process, therefore, while
the memory is sealed, the allocators still need to free or discard the
unused memory. For example, with madvise(DONTNEED).
However, always allowing madvise(DONTNEED) on this range poses a
security risk. For example if a jump instruction crosses a page
boundary and the second page gets discarded, it will overwrite the
target bytes with zeros and change the control flow. Checking
write-permission before the discard operation allows us to control
when the operation is valid. In this case, the madvise will only
succeed if the executing thread has PKEY write permissions and PKRU
changes are protected in software by control-flow integrity.
Although the initial version of this patch series is targeting the
Chrome browser as its first user, it became evident during upstream
discussions that we would also want to ensure that the patch set
eventually is a complete solution for memory sealing and compatible
with other use cases. The specific scenario currently in mind is
glibc's use case of loading and sealing ELF executables. To this end,
Stephen is working on a change to glibc to add sealing support to the
dynamic linker, which will seal all non-writable segments at startup.
Once this work is completed, all applications will be able to
automatically benefit from these new protections.
In closing, I would like to formally acknowledge the valuable
contributions received during the RFC process, which were instrumental
in shaping this patch:
Jann Horn: raising awareness and providing valuable insights on the
destructive madvise operations.
Liam R. Howlett: perf optimization.
Linus Torvalds: assisting in defining system call signature and scope.
Theo de Raadt: sharing the experiences and insight gained from
implementing mimmutable() in OpenBSD.
MM perf benchmarks
==================
This patch adds a loop in the mprotect/munmap/madvise(DONTNEED) to
check the VMAs’ sealing flag, so that no partial update can be made,
when any segment within the given memory range is sealed.
To measure the performance impact of this loop, two tests are developed.
[8]
The first is measuring the time taken for a particular system call,
by using clock_gettime(CLOCK_MONOTONIC). The second is using
PERF_COUNT_HW_REF_CPU_CYCLES (exclude user space). Both tests have
similar results.
The tests have roughly below sequence:
for (i = 0; i < 1000, i++)
create 1000 mappings (1 page per VMA)
start the sampling
for (j = 0; j < 1000, j++)
mprotect one mapping
stop and save the sample
delete 1000 mappings
calculates all samples.
Below tests are performed on Intel(R) Pentium(R) Gold 7505 @ 2.00GHz,
4G memory, Chromebook.
Based on the latest upstream code:
The first test (measuring time)
syscall__ vmas t t_mseal delta_ns per_vma %
munmap__ 1 909 944 35 35 104%
munmap__ 2 1398 1502 104 52 107%
munmap__ 4 2444 2594 149 37 106%
munmap__ 8 4029 4323 293 37 107%
munmap__ 16 6647 6935 288 18 104%
munmap__ 32 11811 12398 587 18 105%
mprotect 1 439 465 26 26 106%
mprotect 2 1659 1745 86 43 105%
mprotect 4 3747 3889 142 36 104%
mprotect 8 6755 6969 215 27 103%
mprotect 16 13748 14144 396 25 103%
mprotect 32 27827 28969 1142 36 104%
madvise_ 1 240 262 22 22 109%
madvise_ 2 366 442 76 38 121%
madvise_ 4 623 751 128 32 121%
madvise_ 8 1110 1324 215 27 119%
madvise_ 16 2127 2451 324 20 115%
madvise_ 32 4109 4642 534 17 113%
The second test (measuring cpu cycle)
syscall__ vmas cpu cmseal delta_cpu per_vma %
munmap__ 1 1790 1890 100 100 106%
munmap__ 2 2819 3033 214 107 108%
munmap__ 4 4959 5271 312 78 106%
munmap__ 8 8262 8745 483 60 106%
munmap__ 16 13099 14116 1017 64 108%
munmap__ 32 23221 24785 1565 49 107%
mprotect 1 906 967 62 62 107%
mprotect 2 3019 3203 184 92 106%
mprotect 4 6149 6569 420 105 107%
mprotect 8 9978 10524 545 68 105%
mprotect 16 20448 21427 979 61 105%
mprotect 32 40972 42935 1963 61 105%
madvise_ 1 434 497 63 63 115%
madvise_ 2 752 899 147 74 120%
madvise_ 4 1313 1513 200 50 115%
madvise_ 8 2271 2627 356 44 116%
madvise_ 16 4312 4883 571 36 113%
madvise_ 32 8376 9319 943 29 111%
Based on the result, for 6.8 kernel, sealing check adds
20-40 nano seconds, or around 50-100 CPU cycles, per VMA.
In addition, I applied the sealing to 5.10 kernel:
The first test (measuring time)
syscall__ vmas t tmseal delta_ns per_vma %
munmap__ 1 357 390 33 33 109%
munmap__ 2 442 463 21 11 105%
munmap__ 4 614 634 20 5 103%
munmap__ 8 1017 1137 120 15 112%
munmap__ 16 1889 2153 263 16 114%
munmap__ 32 4109 4088 -21 -1 99%
mprotect 1 235 227 -7 -7 97%
mprotect 2 495 464 -30 -15 94%
mprotect 4 741 764 24 6 103%
mprotect 8 1434 1437 2 0 100%
mprotect 16 2958 2991 33 2 101%
mprotect 32 6431 6608 177 6 103%
madvise_ 1 191 208 16 16 109%
madvise_ 2 300 324 24 12 108%
madvise_ 4 450 473 23 6 105%
madvise_ 8 753 806 53 7 107%
madvise_ 16 1467 1592 125 8 108%
madvise_ 32 2795 3405 610 19 122%
The second test (measuring cpu cycle)
syscall__ nbr_vma cpu cmseal delta_cpu per_vma %
munmap__ 1 684 715 31 31 105%
munmap__ 2 861 898 38 19 104%
munmap__ 4 1183 1235 51 13 104%
munmap__ 8 1999 2045 46 6 102%
munmap__ 16 3839 3816 -23 -1 99%
munmap__ 32 7672 7887 216 7 103%
mprotect 1 397 443 46 46 112%
mprotect 2 738 788 50 25 107%
mprotect 4 1221 1256 35 9 103%
mprotect 8 2356 2429 72 9 103%
mprotect 16 4961 4935 -26 -2 99%
mprotect 32 9882 10172 291 9 103%
madvise_ 1 351 380 29 29 108%
madvise_ 2 565 615 49 25 109%
madvise_ 4 872 933 61 15 107%
madvise_ 8 1508 1640 132 16 109%
madvise_ 16 3078 3323 245 15 108%
madvise_ 32 5893 6704 811 25 114%
For 5.10 kernel, sealing check adds 0-15 ns in time, or 10-30
CPU cycles, there is even decrease in some cases.
It might be interesting to compare 5.10 and 6.8 kernel
The first test (measuring time)
syscall__ vmas t_5_10 t_6_8 delta_ns per_vma %
munmap__ 1 357 909 552 552 254%
munmap__ 2 442 1398 956 478 316%
munmap__ 4 614 2444 1830 458 398%
munmap__ 8 1017 4029 3012 377 396%
munmap__ 16 1889 6647 4758 297 352%
munmap__ 32 4109 11811 7702 241 287%
mprotect 1 235 439 204 204 187%
mprotect 2 495 1659 1164 582 335%
mprotect 4 741 3747 3006 752 506%
mprotect 8 1434 6755 5320 665 471%
mprotect 16 2958 13748 10790 674 465%
mprotect 32 6431 27827 21397 669 433%
madvise_ 1 191 240 49 49 125%
madvise_ 2 300 366 67 33 122%
madvise_ 4 450 623 173 43 138%
madvise_ 8 753 1110 357 45 147%
madvise_ 16 1467 2127 660 41 145%
madvise_ 32 2795 4109 1314 41 147%
The second test (measuring cpu cycle)
syscall__ vmas cpu_5_10 c_6_8 delta_cpu per_vma %
munmap__ 1 684 1790 1106 1106 262%
munmap__ 2 861 2819 1958 979 327%
munmap__ 4 1183 4959 3776 944 419%
munmap__ 8 1999 8262 6263 783 413%
munmap__ 16 3839 13099 9260 579 341%
munmap__ 32 7672 23221 15549 486 303%
mprotect 1 397 906 509 509 228%
mprotect 2 738 3019 2281 1140 409%
mprotect 4 1221 6149 4929 1232 504%
mprotect 8 2356 9978 7622 953 423%
mprotect 16 4961 20448 15487 968 412%
mprotect 32 9882 40972 31091 972 415%
madvise_ 1 351 434 82 82 123%
madvise_ 2 565 752 186 93 133%
madvise_ 4 872 1313 442 110 151%
madvise_ 8 1508 2271 763 95 151%
madvise_ 16 3078 4312 1234 77 140%
madvise_ 32 5893 8376 2483 78 142%
From 5.10 to 6.8
munmap: added 250-550 ns in time, or 500-1100 in cpu cycle, per vma.
mprotect: added 200-750 ns in time, or 500-1200 in cpu cycle, per vma.
madvise: added 33-50 ns in time, or 70-110 in cpu cycle, per vma.
In comparison to mseal, which adds 20-40 ns or 50-100 CPU cycles, the
increase from 5.10 to 6.8 is significantly larger, approximately ten
times greater for munmap and mprotect.
When I discuss the mm performance with Brian Makin, an engineer worked
on performance, it was brought to my attention that such a performance
benchmarks, which measuring millions of mm syscall in a tight loop, may
not accurately reflect real-world scenarios, such as that of a database
service. Also this is tested using a single HW and ChromeOS, the data
from another HW or distribution might be different. It might be best
to take this data with a grain of salt.
Change history:
===============
V9:
- remove mmap(PROT_SEAL) and mmap(MAP_SEALABLE) (Linus, Theo de Raadt)
- Update mseal_test to check for prot bit (Liam R. Howlett)
- Update documentation to give more detail on sealing check (Liam R. Howlett)
- Add seal_elf test.
- Add performance measure data.
- mseal_test: fix arm build.
V8:
- perf optimization in mmap. (Liam R. Howlett)
- add one testcase (test_seal_zero_address)
- Update mseal.rst to add note for MAP_SEALABLE.
https://lore.kernel.org/lkml/20240131175027.3287009-1-jeffxu@chromium.org/
V7:
- fix index.rst (Randy Dunlap)
- fix arm build (Randy Dunlap)
- return EPERM for blocked operations (Theo de Raadt)
https://lore.kernel.org/linux-mm/20240122152905.2220849-2-jeffxu@chromium.o…
V6:
- Drop RFC from subject, Given Linus's general approval.
- Adjust syscall number for mseal (main Jan.11/2024)
- Code style fix (Matthew Wilcox)
- selftest: use ksft macros (Muhammad Usama Anjum)
- Document fix. (Randy Dunlap)
https://lore.kernel.org/all/20240111234142.2944934-1-jeffxu@chromium.org/
V5:
- fix build issue in mseal-Wire-up-mseal-syscall
(Suggested by Linus Torvalds, and Greg KH)
- updates on selftest.
https://lore.kernel.org/lkml/20240109154547.1839886-1-jeffxu@chromium.org/#r
V4:
(Suggested by Linus Torvalds)
- new signature: mseal(start,len,flags)
- 32 bit is not supported. vm_seal is removed, use vm_flags instead.
- single bit in vm_flags for sealed state.
- CONFIG_MSEAL kernel config is removed.
- single bit of PROT_SEAL in the "Prot" field of mmap().
Other changes:
- update selftest (Suggested by Muhammad Usama Anjum)
- update documentation.
https://lore.kernel.org/all/20240104185138.169307-1-jeffxu@chromium.org/
V3:
- Abandon per-syscall approach, (Suggested by Linus Torvalds).
- Organize sealing types around their functionality, such as
MM_SEAL_BASE, MM_SEAL_PROT_PKEY.
- Extend the scope of sealing from calls originated in userspace to
both kernel and userspace. (Suggested by Linus Torvalds)
- Add seal type support in mmap(). (Suggested by Pedro Falcato)
- Add a new sealing type: MM_SEAL_DISCARD_RO_ANON to prevent
destructive operations of madvise. (Suggested by Jann Horn and
Stephen Röttger)
- Make sealed VMAs mergeable. (Suggested by Jann Horn)
- Add MAP_SEALABLE to mmap()
- Add documentation - mseal.rst
https://lore.kernel.org/linux-mm/20231212231706.2680890-2-jeffxu@chromium.o…
v2:
Use _BITUL to define MM_SEAL_XX type.
Use unsigned long for seal type in sys_mseal() and other functions.
Remove internal VM_SEAL_XX type and convert_user_seal_type().
Remove MM_ACTION_XX type.
Remove caller_origin(ON_BEHALF_OF_XX) and replace with sealing bitmask.
Add more comments in code.
Add a detailed commit message.
https://lore.kernel.org/lkml/20231017090815.1067790-1-jeffxu@chromium.org/
v1:
https://lore.kernel.org/lkml/20231016143828.647848-1-jeffxu@chromium.org/
----------------------------------------------------------------
[1] https://kernelnewbies.org/Linux_2_6_8
[2] https://v8.dev/blog/control-flow-integrity
[3] https://github.com/apple-oss-distributions/xnu/blob/1031c584a5e37aff177559b…
[4] https://man.openbsd.org/mimmutable.2
[5] https://docs.google.com/document/d/1O2jwK4dxI3nRcOJuPYkonhTkNQfbmwdvxQMyXge…
[6] https://lore.kernel.org/lkml/CAG48ez3ShUYey+ZAFsU2i1RpQn0a5eOs2hzQ426Fkcgnf…
[7] https://lore.kernel.org/lkml/20230515130553.2311248-1-jeffxu@chromium.org/
[8] https://github.com/peaktocreek/mmperf
Jeff Xu (5):
mseal: Wire up mseal syscall
mseal: add mseal syscall
selftest mm/mseal memory sealing
mseal:add documentation
selftest mm/mseal read-only elf memory segment
Documentation/userspace-api/index.rst | 1 +
Documentation/userspace-api/mseal.rst | 199 ++
arch/alpha/kernel/syscalls/syscall.tbl | 1 +
arch/arm/tools/syscall.tbl | 1 +
arch/arm64/include/asm/unistd.h | 2 +-
arch/arm64/include/asm/unistd32.h | 2 +
arch/m68k/kernel/syscalls/syscall.tbl | 1 +
arch/microblaze/kernel/syscalls/syscall.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n32.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n64.tbl | 1 +
arch/mips/kernel/syscalls/syscall_o32.tbl | 1 +
arch/parisc/kernel/syscalls/syscall.tbl | 1 +
arch/powerpc/kernel/syscalls/syscall.tbl | 1 +
arch/s390/kernel/syscalls/syscall.tbl | 1 +
arch/sh/kernel/syscalls/syscall.tbl | 1 +
arch/sparc/kernel/syscalls/syscall.tbl | 1 +
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/xtensa/kernel/syscalls/syscall.tbl | 1 +
include/linux/syscalls.h | 1 +
include/uapi/asm-generic/unistd.h | 5 +-
kernel/sys_ni.c | 1 +
mm/Makefile | 4 +
mm/internal.h | 37 +
mm/madvise.c | 12 +
mm/mmap.c | 31 +-
mm/mprotect.c | 10 +
mm/mremap.c | 31 +
mm/mseal.c | 307 ++++
tools/testing/selftests/mm/.gitignore | 2 +
tools/testing/selftests/mm/Makefile | 2 +
tools/testing/selftests/mm/mseal_test.c | 1836 +++++++++++++++++++
tools/testing/selftests/mm/seal_elf.c | 183 ++
33 files changed, 2678 insertions(+), 3 deletions(-)
create mode 100644 Documentation/userspace-api/mseal.rst
create mode 100644 mm/mseal.c
create mode 100644 tools/testing/selftests/mm/mseal_test.c
create mode 100644 tools/testing/selftests/mm/seal_elf.c
--
2.43.0.687.g38aa6559b0-goog