Hi,
This patch series fixes a corner-case with non-overlapping access rights
coming from different layers. This is now handled in a generic way and
verified with new tests. A stricter check is enforced for
landlock_add_rule(2) to forbid useless rules. Finally, the previous
landlock_enforce_ruleset_self(2) is renamed to
landlock_restrict_self(2), which is more consistent.
The SLOC count is 1314 for security/landlock/ and 2484 for
tools/testing/selftest/landlock/ . Test coverage for security/landlock/
is 94.7% of lines. The code not covered only deals with internal kernel
errors (e.g. memory allocation) and race conditions. This series is
being fuzzed by syzkaller, and patches are on their way:
https://github.com/google/syzkaller/pull/2380
The compiled documentation is available here:
https://landlock.io/linux-doc/landlock-v28/userspace-api/landlock.html
This series can be applied on top of v5.11-rc6 . This can be tested
with CONFIG_SECURITY_LANDLOCK, CONFIG_SAMPLE_LANDLOCK and by prepending
"landlock," to CONFIG_LSM. This patch series can be found in a Git
repository here:
https://github.com/landlock-lsm/linux/commits/landlock-v28
This patch series seems ready for upstream and I would really appreciate
final reviews.
# Landlock LSM
The goal of Landlock is to enable to restrict ambient rights (e.g.
global filesystem access) for a set of processes. Because Landlock is a
stackable LSM [1], it makes possible to create safe security sandboxes
as new security layers in addition to the existing system-wide
access-controls. This kind of sandbox is expected to help mitigate the
security impact of bugs or unexpected/malicious behaviors in user-space
applications. Landlock empowers any process, including unprivileged
ones, to securely restrict themselves.
Landlock is inspired by seccomp-bpf but instead of filtering syscalls
and their raw arguments, a Landlock rule can restrict the use of kernel
objects like file hierarchies, according to the kernel semantic.
Landlock also takes inspiration from other OS sandbox mechanisms: XNU
Sandbox, FreeBSD Capsicum or OpenBSD Pledge/Unveil.
In this current form, Landlock misses some access-control features.
This enables to minimize this patch series and ease review. This series
still addresses multiple use cases, especially with the combined use of
seccomp-bpf: applications with built-in sandboxing, init systems,
security sandbox tools and security-oriented APIs [2].
Previous version:
https://lore.kernel.org/lkml/20210121205119.793296-1-mic@digikod.net/
[1] https://lore.kernel.org/lkml/50db058a-7dde-441b-a7f9-f6837fe8b69f@schaufler…
[2] https://lore.kernel.org/lkml/f646e1c7-33cf-333f-070c-0a40ad0468cd@digikod.n…
Casey Schaufler (1):
LSM: Infrastructure management of the superblock
Mickaël Salaün (11):
landlock: Add object management
landlock: Add ruleset and domain management
landlock: Set up the security framework and manage credentials
landlock: Add ptrace restrictions
fs,security: Add sb_delete hook
landlock: Support filesystem access-control
landlock: Add syscall implementations
arch: Wire up Landlock syscalls
selftests/landlock: Add user space tests
samples/landlock: Add a sandbox manager example
landlock: Add user and kernel documentation
Documentation/security/index.rst | 1 +
Documentation/security/landlock.rst | 79 +
Documentation/userspace-api/index.rst | 1 +
Documentation/userspace-api/landlock.rst | 307 ++
MAINTAINERS | 15 +
arch/Kconfig | 7 +
arch/alpha/kernel/syscalls/syscall.tbl | 3 +
arch/arm/tools/syscall.tbl | 3 +
arch/arm64/include/asm/unistd.h | 2 +-
arch/arm64/include/asm/unistd32.h | 6 +
arch/ia64/kernel/syscalls/syscall.tbl | 3 +
arch/m68k/kernel/syscalls/syscall.tbl | 3 +
arch/microblaze/kernel/syscalls/syscall.tbl | 3 +
arch/mips/kernel/syscalls/syscall_n32.tbl | 3 +
arch/mips/kernel/syscalls/syscall_n64.tbl | 3 +
arch/mips/kernel/syscalls/syscall_o32.tbl | 3 +
arch/parisc/kernel/syscalls/syscall.tbl | 3 +
arch/powerpc/kernel/syscalls/syscall.tbl | 3 +
arch/s390/kernel/syscalls/syscall.tbl | 3 +
arch/sh/kernel/syscalls/syscall.tbl | 3 +
arch/sparc/kernel/syscalls/syscall.tbl | 3 +
arch/um/Kconfig | 1 +
arch/x86/entry/syscalls/syscall_32.tbl | 3 +
arch/x86/entry/syscalls/syscall_64.tbl | 3 +
arch/xtensa/kernel/syscalls/syscall.tbl | 3 +
fs/super.c | 1 +
include/linux/lsm_hook_defs.h | 1 +
include/linux/lsm_hooks.h | 3 +
include/linux/security.h | 4 +
include/linux/syscalls.h | 7 +
include/uapi/asm-generic/unistd.h | 8 +-
include/uapi/linux/landlock.h | 128 +
kernel/sys_ni.c | 5 +
samples/Kconfig | 7 +
samples/Makefile | 1 +
samples/landlock/.gitignore | 1 +
samples/landlock/Makefile | 13 +
samples/landlock/sandboxer.c | 238 ++
security/Kconfig | 11 +-
security/Makefile | 2 +
security/landlock/Kconfig | 21 +
security/landlock/Makefile | 4 +
security/landlock/common.h | 20 +
security/landlock/cred.c | 46 +
security/landlock/cred.h | 58 +
security/landlock/fs.c | 627 ++++
security/landlock/fs.h | 56 +
security/landlock/limits.h | 21 +
security/landlock/object.c | 67 +
security/landlock/object.h | 91 +
security/landlock/ptrace.c | 120 +
security/landlock/ptrace.h | 14 +
security/landlock/ruleset.c | 473 +++
security/landlock/ruleset.h | 165 +
security/landlock/setup.c | 40 +
security/landlock/setup.h | 18 +
security/landlock/syscalls.c | 444 +++
security/security.c | 51 +-
security/selinux/hooks.c | 58 +-
security/selinux/include/objsec.h | 6 +
security/selinux/ss/services.c | 3 +-
security/smack/smack.h | 6 +
security/smack/smack_lsm.c | 35 +-
tools/testing/selftests/Makefile | 1 +
tools/testing/selftests/landlock/.gitignore | 2 +
tools/testing/selftests/landlock/Makefile | 24 +
tools/testing/selftests/landlock/base_test.c | 219 ++
tools/testing/selftests/landlock/common.h | 169 ++
tools/testing/selftests/landlock/config | 6 +
tools/testing/selftests/landlock/fs_test.c | 2664 +++++++++++++++++
.../testing/selftests/landlock/ptrace_test.c | 314 ++
tools/testing/selftests/landlock/true.c | 5 +
72 files changed, 6668 insertions(+), 77 deletions(-)
create mode 100644 Documentation/security/landlock.rst
create mode 100644 Documentation/userspace-api/landlock.rst
create mode 100644 include/uapi/linux/landlock.h
create mode 100644 samples/landlock/.gitignore
create mode 100644 samples/landlock/Makefile
create mode 100644 samples/landlock/sandboxer.c
create mode 100644 security/landlock/Kconfig
create mode 100644 security/landlock/Makefile
create mode 100644 security/landlock/common.h
create mode 100644 security/landlock/cred.c
create mode 100644 security/landlock/cred.h
create mode 100644 security/landlock/fs.c
create mode 100644 security/landlock/fs.h
create mode 100644 security/landlock/limits.h
create mode 100644 security/landlock/object.c
create mode 100644 security/landlock/object.h
create mode 100644 security/landlock/ptrace.c
create mode 100644 security/landlock/ptrace.h
create mode 100644 security/landlock/ruleset.c
create mode 100644 security/landlock/ruleset.h
create mode 100644 security/landlock/setup.c
create mode 100644 security/landlock/setup.h
create mode 100644 security/landlock/syscalls.c
create mode 100644 tools/testing/selftests/landlock/.gitignore
create mode 100644 tools/testing/selftests/landlock/Makefile
create mode 100644 tools/testing/selftests/landlock/base_test.c
create mode 100644 tools/testing/selftests/landlock/common.h
create mode 100644 tools/testing/selftests/landlock/config
create mode 100644 tools/testing/selftests/landlock/fs_test.c
create mode 100644 tools/testing/selftests/landlock/ptrace_test.c
create mode 100644 tools/testing/selftests/landlock/true.c
base-commit: 1048ba83fb1c00cd24172e23e8263972f6b5d9ac
--
2.30.0
Hi,
This patch series introduces the futex2 syscalls.
* What happened to the current futex()?
For some years now, developers have been trying to add new features to
futex, but maintainers have been reluctant to accept then, given the
multiplexed interface full of legacy features and tricky to do big
changes. Some problems that people tried to address with patchsets are:
NUMA-awareness[0], smaller sized futexes[1], wait on multiple futexes[2].
NUMA, for instance, just doesn't fit the current API in a reasonable
way. Considering that, it's not possible to merge new features into the
current futex.
** The NUMA problem
At the current implementation, all futex kernel side infrastructure is
stored on a single node. Given that, all futex() calls issued by
processors that aren't located on that node will have a memory access
penalty when doing it.
** The 32bit sized futex problem
Embedded systems or anything with memory constrains would benefit of
using smaller sizes for the futex userspace integer. Also, a mutex
implementation can be done using just three values, so 8 bits is enough
for various scenarios.
** The wait on multiple problem
The use case lies in the Wine implementation of the Windows NT interface
WaitMultipleObjects. This Windows API function allows a thread to sleep
waiting on the first of a set of event sources (mutexes, timers, signal,
console input, etc) to signal. Considering this is a primitive
synchronization operation for Windows applications, being able to quickly
signal events on the producer side, and quickly go to sleep on the
consumer side is essential for good performance of those running
over Wine.
[0] https://lore.kernel.org/lkml/20160505204230.932454245@linutronix.de/
[1] https://lore.kernel.org/lkml/20191221155659.3159-2-malteskarupke@web.de/
[2] https://lore.kernel.org/lkml/20200213214525.183689-1-andrealmeid@collabora.…
* The solution
As proposed by Peter Zijlstra and Florian Weimer[3], a new interface
is required to solve this, which must be designed with those features in
mind. futex2() is that interface. As opposed to the current multiplexed
interface, the new one should have one syscall per operation. This will
allow the maintainability of the API if it gets extended, and will help
users with type checking of arguments.
In particular, the new interface is extended to support the ability to
wait on any of a list of futexes at a time, which could be seen as a
vectored extension of the FUTEX_WAIT semantics.
[3] https://lore.kernel.org/lkml/20200303120050.GC2596@hirez.programming.kicks-…
* The interface
The new interface can be seen in details in the following patches, but
this is a high level summary of what the interface can do:
- Supports wake/wait semantics, as in futex()
- Supports requeue operations, similarly as FUTEX_CMP_REQUEUE, but with
individual flags for each address
- Supports waiting for a vector of futexes, using a new syscall named
futex_waitv()
- Supports variable sized futexes (8bits, 16bits and 32bits)
- Supports NUMA-awareness operations, where the user can specify on
which memory node would like to operate
* Implementation
The internal implementation follows a similar design to the original futex.
Given that we want to replicate the same external behavior of current
futex, this should be somewhat expected. For some functions, like the
init and the code to get a shared key, I literally copied code and
comments from kernel/futex.c. I decided to do so instead of exposing the
original function as a public function since in that way we can freely
modify our implementation if required, without any impact on old futex.
Also, the comments precisely describes the details and corner cases of
the implementation.
Each patch contains a brief description of implementation, but patch 6
"docs: locking: futex2: Add documentation" adds a more complete document
about it.
* The patchset
This patchset can be also found at my git tree:
https://gitlab.collabora.com/tonyk/linux/-/tree/futex2
- Patch 1: Implements wait/wake, and the basics foundations of futex2
- Patches 2-4: Implement the remaining features (shared, waitv, requeue).
- Patch 5: Adds the x86_x32 ABI handling. I kept it in a separated
patch since I'm not sure if x86_x32 is still a thing, or if it should
return -ENOSYS.
- Patch 6: Add a documentation file which details the interface and
the internal implementation.
- Patches 7-13: Selftests for all operations along with perf
support for futex2.
- Patch 14: While working on porting glibc for futex2, I found out
that there's a futex_wake() call at the user thread exit path, if
that thread was created with clone(..., CLONE_CHILD_SETTID, ...). In
order to make pthreads work with futex2, it was required to add
this patch. Note that this is more a proof-of-concept of what we
will need to do in future, rather than part of the interface and
shouldn't be merged as it is.
* Testing:
This patchset provides selftests for each operation and their flags.
Along with that, the following work was done:
** Stability
To stress the interface in "real world scenarios":
- glibc[4]: nptl's low level locking was modified to use futex2 API
(except for robust and PI things). All relevant nptl/ tests passed.
- Wine[5]: Proton/Wine was modified in order to use futex2() for the
emulation of Windows NT sync mechanisms based on futex, called "fsync".
Triple-A games with huge CPU's loads and tons of parallel jobs worked
as expected when compared with the previous FUTEX_WAIT_MULTIPLE
implementation at futex(). Some games issue 42k futex2() calls
per second.
- Full GNU/Linux distro: I installed the modified glibc in my host
machine, so all pthread's programs would use futex2(). After tweaking
systemd[6] to allow futex2() calls at seccomp, everything worked as
expected (web browsers do some syscall sandboxing and need some
configuration as well).
- perf: The perf benchmarks tests can also be used to stress the
interface, and they can be found in this patchset.
** Performance
- For comparing futex() and futex2() performance, I used the artificial
benchmarks implemented at perf (wake, wake-parallel, hash and
requeue). The setup was 200 runs for each test and using 8, 80, 800,
8000 for the number of threads, Note that for this test, I'm not using
patch 14 ("kernel: Enable waitpid() for futex2") , for reasons explained
at "The patchset" section.
- For the first three ones, I measured an average of 4% gain in
performance. This is not a big step, but it shows that the new
interface is at least comparable in performance with the current one.
- For requeue, I measured an average of 21% decrease in performance
compared to the original futex implementation. This is expected given
the new design with individual flags. The performance trade-offs are
explained at patch 4 ("futex2: Implement requeue operation").
[4] https://gitlab.collabora.com/tonyk/glibc/-/tree/futex2
[5] https://gitlab.collabora.com/tonyk/wine/-/tree/proton_5.13
[6] https://gitlab.collabora.com/tonyk/systemd
* FAQ
** "Where's the code for NUMA and FUTEX_8/16?"
The current code is already complex enough to take some time for
review, so I believe it's better to split that work out to a future
iteration of this patchset. Besides that, this RFC is the core part of the
infrastructure, and the following features will not pose big design
changes to it, the work will be more about wiring up the flags and
modifying some functions.
** "And what's about FUTEX_64?"
By supporting 64 bit futexes, the kernel structure for futex would
need to have a 64 bit field for the value, and that could defeat one of
the purposes of having different sized futexes in the first place:
supporting smaller ones to decrease memory usage. This might be
something that could be disabled for 32bit archs (and even for
CONFIG_BASE_SMALL).
Which use case would benefit for FUTEX_64? Does it worth the trade-offs?
** "Where's the PI/robust stuff?"
As said by Peter Zijlstra at [3], all those new features are related to
the "simple" futex interface, that doesn't use PI or robust. Do we want
to have this complexity at futex2() and if so, should it be part of
this patchset or can it be future work?
Thanks,
André
André Almeida (13):
futex2: Implement wait and wake functions
futex2: Add support for shared futexes
futex2: Implement vectorized wait
futex2: Implement requeue operation
futex2: Add compatibility entry point for x86_x32 ABI
docs: locking: futex2: Add documentation
selftests: futex2: Add wake/wait test
selftests: futex2: Add timeout test
selftests: futex2: Add wouldblock test
selftests: futex2: Add waitv test
selftests: futex2: Add requeue test
perf bench: Add futex2 benchmark tests
kernel: Enable waitpid() for futex2
Documentation/locking/futex2.rst | 198 +++
Documentation/locking/index.rst | 1 +
MAINTAINERS | 2 +-
arch/arm/tools/syscall.tbl | 4 +
arch/arm64/include/asm/unistd.h | 2 +-
arch/arm64/include/asm/unistd32.h | 4 +
arch/x86/entry/syscalls/syscall_32.tbl | 4 +
arch/x86/entry/syscalls/syscall_64.tbl | 4 +
fs/inode.c | 1 +
include/linux/compat.h | 23 +
include/linux/fs.h | 1 +
include/linux/syscalls.h | 18 +
include/uapi/asm-generic/unistd.h | 14 +-
include/uapi/linux/futex.h | 56 +
init/Kconfig | 7 +
kernel/Makefile | 1 +
kernel/fork.c | 2 +
kernel/futex2.c | 1255 +++++++++++++++++
kernel/sys_ni.c | 6 +
tools/arch/x86/include/asm/unistd_64.h | 12 +
tools/include/uapi/asm-generic/unistd.h | 11 +-
.../arch/x86/entry/syscalls/syscall_64.tbl | 3 +
tools/perf/bench/bench.h | 4 +
tools/perf/bench/futex-hash.c | 24 +-
tools/perf/bench/futex-requeue.c | 57 +-
tools/perf/bench/futex-wake-parallel.c | 41 +-
tools/perf/bench/futex-wake.c | 37 +-
tools/perf/bench/futex.h | 47 +
tools/perf/builtin-bench.c | 18 +-
.../selftests/futex/functional/.gitignore | 3 +
.../selftests/futex/functional/Makefile | 8 +-
.../futex/functional/futex2_requeue.c | 164 +++
.../selftests/futex/functional/futex2_wait.c | 209 +++
.../selftests/futex/functional/futex2_waitv.c | 157 +++
.../futex/functional/futex_wait_timeout.c | 58 +-
.../futex/functional/futex_wait_wouldblock.c | 33 +-
.../testing/selftests/futex/functional/run.sh | 6 +
.../selftests/futex/include/futex2test.h | 121 ++
38 files changed, 2563 insertions(+), 53 deletions(-)
create mode 100644 Documentation/locking/futex2.rst
create mode 100644 kernel/futex2.c
create mode 100644 tools/testing/selftests/futex/functional/futex2_requeue.c
create mode 100644 tools/testing/selftests/futex/functional/futex2_wait.c
create mode 100644 tools/testing/selftests/futex/functional/futex2_waitv.c
create mode 100644 tools/testing/selftests/futex/include/futex2test.h
--
2.30.1
v1 by Uriel is here: [1].
Since it's been a while, I've dropped the Reviewed-By's.
It depended on commit 83c4e7a0363b ("KUnit: KASAN Integration") which
hadn't been merged yet, so that caused some kerfuffle with applying them
previously and the series was reverted.
This revives the series but makes the kunit_fail_current_test() function
take a format string and logs the file and line number of the failing
code, addressing Alan Maguire's comments on the previous version.
As a result, the patch that makes UBSAN errors was tweaked slightly to
include an error message.
v2 -> v3:
Fix kunit_fail_current_test() so it works w/ CONFIG_KUNIT=m
s/_/__ on the helper func to match others in test.c
[1] https://lore.kernel.org/linux-kselftest/20200806174326.3577537-1-urielguaja…
Uriel Guajardo (2):
kunit: support failure from dynamic analysis tools
kunit: ubsan integration
include/kunit/test-bug.h | 30 ++++++++++++++++++++++++++++++
lib/kunit/test.c | 37 +++++++++++++++++++++++++++++++++----
lib/ubsan.c | 3 +++
3 files changed, 66 insertions(+), 4 deletions(-)
create mode 100644 include/kunit/test-bug.h
base-commit: 1e0d27fce010b0a4a9e595506b6ede75934c31be
--
2.30.0.478.g8a0d178c01-goog
pipe() and socketpair() have different behavior wrt edge-triggered
read epoll, in that no event is generated when data is written into
a non-empty pipe, but an event is generated if socketpair() is used
instead.
This simple modification of the epoll2 testlet from
tools/testing/selftests/filesystems/epoll/epoll_wakeup_test.c
(it just adds a second write) shows the different behavior.
The testlet passes with pipe() but fails with socketpair() with 5.10.
They both fail with 4.19.
Is it fair to assume that 5.10 pipe's behavior is the correct one?
Thanks,
Francesco Ruggeri
/*
* t0
* | (ew)
* e0
* | (et)
* s0
*/
TEST(epoll2)
{
int efd;
int sfd[2];
struct epoll_event e;
ASSERT_EQ(socketpair(AF_UNIX, SOCK_STREAM, 0, sfd), 0);
//ASSERT_EQ(pipe(sfd), 0);
efd = epoll_create(1);
ASSERT_GE(efd, 0);
e.events = EPOLLIN | EPOLLET;
ASSERT_EQ(epoll_ctl(efd, EPOLL_CTL_ADD, sfd[0], &e), 0);
ASSERT_EQ(write(sfd[1], "w", 1), 1);
EXPECT_EQ(epoll_wait(efd, &e, 1, 0), 1);
ASSERT_EQ(write(sfd[1], "w", 1), 1);
EXPECT_EQ(epoll_wait(efd, &e, 1, 0), 0);
close(efd);
close(sfd[0]);
close(sfd[1]);
}
This is an optimization of a set of sgx-related codes, each of which
is independent of the patch. Because the second and third patches have
conflicting dependencies, these patches are put together.
---
v3 changes:
* split free_cnt count and spin lock optimization into two patches
v2 changes:
* review suggested changes
Tianjia Zhang (5):
selftests/x86: Simplify the code to get vdso base address in sgx
x86/sgx: Optimize the locking range in sgx_sanitize_section()
x86/sgx: Optimize the free_cnt count in sgx_epc_section
x86/sgx: Allows ioctl PROVISION to execute before CREATE
x86/sgx: Remove redundant if conditions in sgx_encl_create
arch/x86/kernel/cpu/sgx/driver.c | 1 +
arch/x86/kernel/cpu/sgx/ioctl.c | 9 +++++----
arch/x86/kernel/cpu/sgx/main.c | 13 +++++--------
tools/testing/selftests/sgx/main.c | 24 ++++--------------------
4 files changed, 15 insertions(+), 32 deletions(-)
--
2.19.1.3.ge56e4f7
Changelog
---------
v11
- Another build fix reported by robot on i386: moved is_pinnable_page()
below set_page_section() in linux/mm.h
v10
- Fixed !CONFIG_MMU compiler issues by adding is_zero_pfn() stub.
v9
- Renamed gpf_to_alloc_flags() to gfp_to_alloc_flags_cma(); thanks Lecopzer
Chen for noticing.
- Fixed warning reported scripts/checkpatch.pl:
"Logical continuations should be on the previous line"
v8
- Added reviewed by's from John Hubbard
- Fixed subjects for selftests patches
- Moved zero page check inside is_pinnable_page() as requested by Jason
Gunthorpe.
v7
- Added reviewed-by's
- Fixed a compile bug on non-mmu builds reported by robot
v6
Small update, but I wanted to send it out quicker, as it removes a
controversial patch and replaces it with something sane.
- Removed forcing FOLL_WRITE for longterm gup, instead added a patch to
skip zero pages during migration.
- Added reviewed-by's and minor log changes.
v5
- Added the following patches to the beginning of series, which are fixes
to the other existing problems with CMA migration code:
mm/gup: check every subpage of a compound page during isolation
mm/gup: return an error on migration failure
mm/gup: check for isolation errors also at the beginning of series
mm/gup: do not allow zero page for pinned pages
- remove .gfp_mask/.reclaim_idx changes from mm/vmscan.c
- update movable zone header comment in patch 8 instead of patch 3, fix
the comment
- Added acked, sign-offs
- Updated commit logs based on feedback
- Addressed issues reported by Michal and Jason.
- Remove:
#define PINNABLE_MIGRATE_MAX 10
#define PINNABLE_ISOLATE_MAX 100
Instead: fail on the first migration failure, and retry isolation
forever as their failures are transient.
- In self-set addressed some of the comments from John Hubbard, updated
commit logs, and added comments. Renamed gup->flags with gup->test_flags.
v4
- Address page migration comments. New patch:
mm/gup: limit number of gup migration failures, honor failures
Implements the limiting number of retries for migration failures, and
also check for isolation failures.
Added a test case into gup_test to verify that pages never long-term
pinned in a movable zone, and also added tests to fault both in kernel
and in userland.
v3
- Merged with linux-next, which contains clean-up patch from Jason,
therefore this series is reduced by two patches which did the same
thing.
v2
- Addressed all review comments
- Added Reviewed-by's.
- Renamed PF_MEMALLOC_NOMOVABLE to PF_MEMALLOC_PIN
- Added is_pinnable_page() to check if page can be longterm pinned
- Fixed gup fast path by checking is_in_pinnable_zone()
- rename cma_page_list to movable_page_list
- add a admin-guide note about handling pinned pages in ZONE_MOVABLE,
updated caveat about pinned pages from linux/mmzone.h
- Move current_gfp_context() to fast-path
---------
When page is pinned it cannot be moved and its physical address stays
the same until pages is unpinned.
This is useful functionality to allows userland to implementation DMA
access. For example, it is used by vfio in vfio_pin_pages().
However, this functionality breaks memory hotplug/hotremove assumptions
that pages in ZONE_MOVABLE can always be migrated.
This patch series fixes this issue by forcing new allocations during
page pinning to omit ZONE_MOVABLE, and also to migrate any existing
pages from ZONE_MOVABLE during pinning.
It uses the same scheme logic that is currently used by CMA, and extends
the functionality for all allocations.
For more information read the discussion [1] about this problem.
[1] https://lore.kernel.org/lkml/CA+CK2bBffHBxjmb9jmSKacm0fJMinyt3Nhk8Nx6iudcQS…
Previous versions:
v1
https://lore.kernel.org/lkml/20201202052330.474592-1-pasha.tatashin@soleen.…
v2
https://lore.kernel.org/lkml/20201210004335.64634-1-pasha.tatashin@soleen.c…
v3
https://lore.kernel.org/lkml/20201211202140.396852-1-pasha.tatashin@soleen.…
v4
https://lore.kernel.org/lkml/20201217185243.3288048-1-pasha.tatashin@soleen…
v5
https://lore.kernel.org/lkml/20210119043920.155044-1-pasha.tatashin@soleen.…
v6
https://lore.kernel.org/lkml/20210120014333.222547-1-pasha.tatashin@soleen.…
v7
https://lore.kernel.org/lkml/20210122033748.924330-1-pasha.tatashin@soleen.…
v8
https://lore.kernel.org/lkml/20210125194751.1275316-1-pasha.tatashin@soleen…
v9
https://lore.kernel.org/lkml/20210201153827.444374-1-pasha.tatashin@soleen.…
v10
https://lore.kernel.org/lkml/20210211162427.618913-1-pasha.tatashin@soleen.…
Pavel Tatashin (14):
mm/gup: don't pin migrated cma pages in movable zone
mm/gup: check every subpage of a compound page during isolation
mm/gup: return an error on migration failure
mm/gup: check for isolation errors
mm cma: rename PF_MEMALLOC_NOCMA to PF_MEMALLOC_PIN
mm: apply per-task gfp constraints in fast path
mm: honor PF_MEMALLOC_PIN for all movable pages
mm/gup: do not migrate zero page
mm/gup: migrate pinned pages out of movable zone
memory-hotplug.rst: add a note about ZONE_MOVABLE and page pinning
mm/gup: change index type to long as it counts pages
mm/gup: longterm pin migration cleanup
selftests/vm: gup_test: fix test flag
selftests/vm: gup_test: test faulting in kernel, and verify pinnable
pages
.../admin-guide/mm/memory-hotplug.rst | 9 +
include/linux/migrate.h | 1 +
include/linux/mm.h | 19 ++
include/linux/mmzone.h | 13 +-
include/linux/pgtable.h | 12 ++
include/linux/sched.h | 2 +-
include/linux/sched/mm.h | 27 +--
include/trace/events/migrate.h | 3 +-
mm/gup.c | 174 ++++++++----------
mm/gup_test.c | 29 +--
mm/gup_test.h | 3 +-
mm/hugetlb.c | 4 +-
mm/page_alloc.c | 33 ++--
tools/testing/selftests/vm/gup_test.c | 36 +++-
14 files changed, 208 insertions(+), 157 deletions(-)
--
2.25.1