Hi Linus,
Please pull the following Kselftest fixes update for Linux 6.6-rc7.
This Kselftest update for Linux 6.6-rc7 consists of one single fix
to assert check in user_events abi_test to properly check bit value
on Big Endian architectures. The current code treats the bit values
as Little Endian and the check fails on Big Endian.
diff is attached.
thanks,
-- Shuah
----------------------------------------------------------------
The following changes since commit 6f874fa021dfc7bf37f4f37da3a5aaa41fe9c39c:
selftests: Fix wrong TARGET in kselftest top level Makefile (2023-09-26 18:47:37 -0600)
are available in the Git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest tags/linux_kselftest_active-fixes-6.6-rc7
for you to fetch changes up to cf5a103c98a6fb9ee3164334cb5502df6360749b:
selftests/user_events: Fix abi_test for BE archs (2023-10-17 15:07:19 -0600)
----------------------------------------------------------------
linux_kselftest_active-fixes-6.6-rc7
This Kselftest update for Linux 6.6-rc7 consists of one single fix
to assert check in user_events abi_test to properly check bit value
on Big Endian architectures. The current code treats the bit values
as Little Endian and the check fails on Big Endian.
----------------------------------------------------------------
Beau Belgrave (1):
selftests/user_events: Fix abi_test for BE archs
tools/testing/selftests/user_events/abi_test.c | 16 +++++++++-------
1 file changed, 9 insertions(+), 7 deletions(-)
----------------------------------------------------------------
Changelog:
v3:
* Add a patch to export per-cgroup zswap writeback counters
* Add a patch to update zswap's kselftest
* Separate the new list_lru functions into its own prep patch
* Do not start from the top of the hierarchy when encounter a memcg
that is not online for the global limit zswap writeback (patch 2)
(suggested by Yosry Ahmed)
* Do not remove the swap entry from list_lru in
__read_swapcache_async() (patch 2) (suggested by Yosry Ahmed)
* Removed a redundant zswap pool getting (patch 2)
(reported by Ryan Roberts)
* Use atomic for the nr_zswap_protected (instead of lruvec's lock)
(patch 5) (suggested by Yosry Ahmed)
* Remove the per-cgroup zswap shrinker knob (patch 5)
(suggested by Yosry Ahmed)
v2:
* Fix loongarch compiler errors
* Use pool stats instead of memcg stats when !CONFIG_MEMCG_KEM
There are currently several issues with zswap writeback:
1. There is only a single global LRU for zswap, making it impossible to
perform worload-specific shrinking - an memcg under memory pressure
cannot determine which pages in the pool it owns, and often ends up
writing pages from other memcgs. This issue has been previously
observed in practice and mitigated by simply disabling
memcg-initiated shrinking:
https://lore.kernel.org/all/20230530232435.3097106-1-nphamcs@gmail.com/T/#u
But this solution leaves a lot to be desired, as we still do not
have an avenue for an memcg to free up its own memory locked up in
the zswap pool.
2. We only shrink the zswap pool when the user-defined limit is hit.
This means that if we set the limit too high, cold data that are
unlikely to be used again will reside in the pool, wasting precious
memory. It is hard to predict how much zswap space will be needed
ahead of time, as this depends on the workload (specifically, on
factors such as memory access patterns and compressibility of the
memory pages).
This patch series solves these issues by separating the global zswap
LRU into per-memcg and per-NUMA LRUs, and performs workload-specific
(i.e memcg- and NUMA-aware) zswap writeback under memory pressure. The
new shrinker does not have any parameter that must be tuned by the
user, and can be opted in or out on a per-memcg basis.
As a proof of concept, we ran the following synthetic benchmark:
build the linux kernel in a memory-limited cgroup, and allocate some
cold data in tmpfs to see if the shrinker could write them out and
improved the overall performance. Depending on the amount of cold data
generated, we observe from 14% to 35% reduction in kernel CPU time used
in the kernel builds.
Domenico Cerasuolo (3):
zswap: make shrinking memcg-aware
mm: memcg: add per-memcg zswap writeback stat
selftests: cgroup: update per-memcg zswap writeback selftest
Nhat Pham (2):
mm: list_lru: allow external numa node and cgroup tracking
zswap: shrinks zswap pool based on memory pressure
Documentation/admin-guide/mm/zswap.rst | 7 +
include/linux/list_lru.h | 38 +++
include/linux/memcontrol.h | 7 +
include/linux/mmzone.h | 14 +
mm/list_lru.c | 43 ++-
mm/memcontrol.c | 15 +
mm/mmzone.c | 3 +
mm/swap.h | 3 +-
mm/swap_state.c | 38 ++-
mm/zswap.c | 335 ++++++++++++++++----
tools/testing/selftests/cgroup/test_zswap.c | 74 +++--
11 files changed, 485 insertions(+), 92 deletions(-)
--
2.34.1
From: Jeff Xu <jeffxu(a)google.com>
This patchset proposes a new mseal() syscall for the Linux kernel.
Modern CPUs support memory permissions such as RW and NX bits. Linux has
supported NX since the release of kernel version 2.6.8 in August 2004 [1].
The memory permission feature improves security stance on memory
corruption bugs, i.e. the attacker can’t just write to arbitrary memory
and point the code to it, the memory has to be marked with X bit, or
else an exception will happen. The protection is set by mmap(2),
mprotect(2), mremap(2).
Memory sealing additionally protects the mapping itself against
modifications. This is useful to mitigate memory corruption issues where
a corrupted pointer is passed to a memory management syscall. For example,
such an attacker primitive can break control-flow integrity guarantees
since read-only memory that is supposed to be trusted can become writable
or .text pages can get remapped. Memory sealing can automatically be
applied by the runtime loader to seal .text and .rodata pages and
applications can additionally seal security critical data at runtime.
A similar feature already exists in the XNU kernel with the
VM_FLAGS_PERMANENT [3] flag and on OpenBSD with the mimmutable syscall [4].
Also, Chrome wants to adopt this feature for their CFI work [2] and this
patchset has been designed to be compatible with the Chrome use case.
The new mseal() is an architecture independent syscall, and with
following signature:
mseal(void addr, size_t len, unsigned long types, unsigned long flags)
addr/len: memory range. Must be continuous/allocated memory, or else
mseal() will fail and no VMA is updated. For details on acceptable
arguments, please refer to comments in mseal.c. Those are also fully
covered by the selftest.
types: bit mask to specify which syscall to seal.
Five syscalls can be sealed, as specified by bitmasks:
MM_SEAL_MPROTECT: Deny mprotect(2)/pkey_mprotect(2).
MM_SEAL_MUNMAP: Deny munmap(2).
MM_SEAL_MMAP: Deny mmap(2).
MM_SEAL_MREMAP: Deny mremap(2).
MM_SEAL_MSEAL: Deny adding a new seal type.
Each bit represents sealing for one specific syscall type, e.g.
MM_SEAL_MPROTECT will deny mprotect syscall. The consideration of bitmask
is that the API is extendable, i.e. when needed, the sealing can be
extended to madvise, mlock, etc. Backward compatibility is also easy.
The kernel will remember which seal types are applied, and the application
doesn’t need to repeat all existing seal types in the next mseal(). Once
a seal type is applied, it can’t be unsealed. Call mseal() on an existing
seal type is a no-action, not a failure.
MM_SEAL_MSEAL will deny mseal() calls that try to add a new seal type.
Internally, vm_area_struct adds a new field vm_seals, to store the bit
masks.
For the affected syscalls, such as mprotect, a check(can_modify_mm) for
sealing is added, this usually happens at the early point of the syscall,
before any update is made to VMAs. The effect of that is: if any of the
VMAs in the given address range fails the sealing check, none of the VMA
will be updated.
The idea that inspired this patch comes from Stephen Röttger’s work in
V8 CFI [5], Chrome browser in ChromeOS will be the first user of this API.
[1] https://kernelnewbies.org/Linux_2_6_8
[2] https://v8.dev/blog/control-flow-integrity
[3] https://github.com/apple-oss-distributions/xnu/blob/1031c584a5e37aff177559b…
[4] https://man.openbsd.org/mimmutable.2
[5] https://docs.google.com/document/d/1O2jwK4dxI3nRcOJuPYkonhTkNQfbmwdvxQMyXge…
PATCH history:
v1:
Use _BITUL to define MM_SEAL_XX type.
Use unsigned long for seal type in sys_mseal() and other functions.
Remove internal VM_SEAL_XX type and convert_user_seal_type().
Remove MM_ACTION_XX type.
Remove caller_origin(ON_BEHALF_OF_XX) and replace with sealing bitmask.
Add more comments in code.
Add detailed commit message.
v0:
https://lore.kernel.org/lkml/20231016143828.647848-1-jeffxu@chromium.org/
Jeff Xu (8):
mseal: Add mseal(2) syscall.
mseal: Wire up mseal syscall
mseal: add can_modify_mm and can_modify_vma
mseal: Check seal flag for mprotect(2)
mseal: Check seal flag for munmap(2)
mseal: Check seal flag for mremap(2)
mseal:Check seal flag for mmap(2)
selftest mm/mseal mprotect/munmap/mremap/mmap
arch/alpha/kernel/syscalls/syscall.tbl | 1 +
arch/arm/tools/syscall.tbl | 1 +
arch/arm64/include/asm/unistd.h | 2 +-
arch/arm64/include/asm/unistd32.h | 2 +
arch/ia64/kernel/syscalls/syscall.tbl | 1 +
arch/m68k/kernel/syscalls/syscall.tbl | 1 +
arch/microblaze/kernel/syscalls/syscall.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n32.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n64.tbl | 1 +
arch/mips/kernel/syscalls/syscall_o32.tbl | 1 +
arch/parisc/kernel/syscalls/syscall.tbl | 1 +
arch/powerpc/kernel/syscalls/syscall.tbl | 1 +
arch/s390/kernel/syscalls/syscall.tbl | 1 +
arch/sh/kernel/syscalls/syscall.tbl | 1 +
arch/sparc/kernel/syscalls/syscall.tbl | 1 +
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/xtensa/kernel/syscalls/syscall.tbl | 1 +
fs/aio.c | 5 +-
include/linux/mm.h | 44 +-
include/linux/mm_types.h | 7 +
include/linux/syscalls.h | 2 +
include/uapi/asm-generic/unistd.h | 5 +-
include/uapi/linux/mman.h | 6 +
ipc/shm.c | 3 +-
kernel/sys_ni.c | 1 +
mm/Kconfig | 8 +
mm/Makefile | 1 +
mm/internal.h | 4 +-
mm/mmap.c | 57 +-
mm/mprotect.c | 15 +
mm/mremap.c | 30 +-
mm/mseal.c | 268 ++++
mm/nommu.c | 6 +-
mm/util.c | 8 +-
tools/testing/selftests/mm/Makefile | 1 +
tools/testing/selftests/mm/mseal_test.c | 1428 +++++++++++++++++++
37 files changed, 1891 insertions(+), 28 deletions(-)
create mode 100644 mm/mseal.c
create mode 100644 tools/testing/selftests/mm/mseal_test.c
--
2.42.0.655.g421f12c284-goog