Patch changelog:
v3:
* Merge changes into the original patches to make Al's life easier.
[Al Viro]
v2:
* Add include <linux/types.h> to openat2.h. [Florian Weimer]
* Move OPEN_HOW_SIZE_* constants out of UAPI. [Florian Weimer]
* Switch from __aligned_u64 to __u64 since it isn't necessary.
[David Laight]
v1: <https://lore.kernel.org/lkml/20191219105533.12508-1-cyphar@cyphar.com/>
While openat2(2) is still not yet in Linus's tree, we can take this
opportunity to iron out some small warts that weren't noticed earlier:
* A fix was suggested by Florian Weimer, to separate the openat2
definitions so glibc can use the header directly. I've put the
maintainership under VFS but let me know if you'd prefer it belong
ot the fcntl folks.
* Having heterogenous field sizes in an extensible struct results in
"padding hole" problems when adding new fields (in addition the
correct error to use for non-zero padding isn't entirely clear ).
The simplest solution is to just copy clone(3)'s model -- always use
u64s. It will waste a little more space in the struct, but it
removes a possible future headache.
This patch is intended to replace the corresponding patches in Al's
#work.openat2 tree (and *will not* apply on Linus' tree).
@Al: I will send some additional patches later, but they will require
proper design review since they're ABI-related features (namely,
adding a way to check what features a syscall supports as I
outlined in my talk here[1]).
[1]: https://youtu.be/ggD-eb3yPVs
Aleksa Sarai (2):
open: introduce openat2(2) syscall
selftests: add openat2(2) selftests
CREDITS | 4 +-
MAINTAINERS | 1 +
arch/alpha/kernel/syscalls/syscall.tbl | 1 +
arch/arm/tools/syscall.tbl | 1 +
arch/arm64/include/asm/unistd.h | 2 +-
arch/arm64/include/asm/unistd32.h | 2 +
arch/ia64/kernel/syscalls/syscall.tbl | 1 +
arch/m68k/kernel/syscalls/syscall.tbl | 1 +
arch/microblaze/kernel/syscalls/syscall.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n32.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n64.tbl | 1 +
arch/mips/kernel/syscalls/syscall_o32.tbl | 1 +
arch/parisc/kernel/syscalls/syscall.tbl | 1 +
arch/powerpc/kernel/syscalls/syscall.tbl | 1 +
arch/s390/kernel/syscalls/syscall.tbl | 1 +
arch/sh/kernel/syscalls/syscall.tbl | 1 +
arch/sparc/kernel/syscalls/syscall.tbl | 1 +
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/xtensa/kernel/syscalls/syscall.tbl | 1 +
fs/open.c | 147 +++--
include/linux/fcntl.h | 16 +-
include/linux/syscalls.h | 3 +
include/uapi/asm-generic/unistd.h | 5 +-
include/uapi/linux/fcntl.h | 2 +-
include/uapi/linux/openat2.h | 39 ++
tools/testing/selftests/Makefile | 1 +
tools/testing/selftests/openat2/.gitignore | 1 +
tools/testing/selftests/openat2/Makefile | 8 +
tools/testing/selftests/openat2/helpers.c | 109 ++++
tools/testing/selftests/openat2/helpers.h | 106 ++++
.../testing/selftests/openat2/openat2_test.c | 312 +++++++++++
.../selftests/openat2/rename_attack_test.c | 160 ++++++
.../testing/selftests/openat2/resolve_test.c | 523 ++++++++++++++++++
34 files changed, 1418 insertions(+), 39 deletions(-)
create mode 100644 include/uapi/linux/openat2.h
create mode 100644 tools/testing/selftests/openat2/.gitignore
create mode 100644 tools/testing/selftests/openat2/Makefile
create mode 100644 tools/testing/selftests/openat2/helpers.c
create mode 100644 tools/testing/selftests/openat2/helpers.h
create mode 100644 tools/testing/selftests/openat2/openat2_test.c
create mode 100644 tools/testing/selftests/openat2/rename_attack_test.c
create mode 100644 tools/testing/selftests/openat2/resolve_test.c
--
2.24.1
Memory protection keys enables an application to protect its address
space from inadvertent access by its own code.
This feature is now enabled on powerpc and has been available since
4.16-rc1. The patches move the selftests to arch neutral directory
and enhance their test coverage.
Tested on powerpc64 and x86_64 (Skylake-SP).
Changelog
---------
Link to previous version (v15):
https://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=149238
v16:
(1) Rebased on top of latest master.
(2) Switched to u64 instead of using an arch-dependent
pkey_reg_t type for references to the pkey register
based on suggestions from Dave, Michal and Michael.
(3) Removed build time determination of page size based
on suggestion from Michael.
(4) Fixed comment before the definition of __page_o_noops()
from patch 13 ("selftests/vm/pkeys: Introduce powerpc
support").
v15:
(1) Rebased on top of latest master.
(2) Addressed review comments from Dave Hansen.
(3) Moved code for getting or setting pkey bits to new
helpers. These changes replace patch 7 of v14.
(4) Added a fix which ensures that the correct count of
reserved keys is used across different platforms.
(5) Added a fix which ensures that the correct page size
is used as powerpc supports both 4K and 64K pages.
v14:
(1) Incorporated another round of comments from Dave Hansen.
v13:
(1) Incorporated comments for Dave Hansen.
(2) Added one more test for correct pkey-0 behavior.
v12:
(1) Fixed the offset of pkey field in the siginfo structure for
x86_64 and powerpc. And tries to use the actual field
if the headers have it defined.
v11:
(1) Fixed a deadlock in the ptrace testcase.
v10 and prior:
(1) Moved the testcase to arch neutral directory.
(2) Split the changes into incremental patches.
Desnes A. Nunes do Rosario (1):
selftests/vm/pkeys: Fix number of reserved powerpc pkeys
Ram Pai (17):
selftests/x86/pkeys: Move selftests to arch-neutral directory
selftests/vm: Rename all references to pkru to a generic name
selftests/vm: Move generic definitions to header file
selftests/vm: Typecast references to pkey register
selftests/vm: Fix pkey_disable_clear()
selftests/vm/pkeys: Fix assertion in pkey_disable_set/clear()
selftests/vm/pkeys: Fix alloc_random_pkey() to make it really random
selftests/vm/pkeys: Introduce generic pkey abstractions
selftests/vm/pkeys: Introduce powerpc support
selftests/vm/pkeys: Fix assertion in test_pkey_alloc_exhaust()
selftests/vm/pkeys: Improve checks to determine pkey support
selftests/vm/pkeys: Associate key on a mapped page and detect access
violation
selftests/vm/pkeys: Associate key on a mapped page and detect write
violation
selftests/vm/pkeys: Detect write violation on a mapped
access-denied-key page
selftests/vm/pkeys: Introduce a sub-page allocator
selftests/vm/pkeys: Test correct behaviour of pkey-0
selftests/vm/pkeys: Override access right definitions on powerpc
Sandipan Das (3):
selftests: vm: pkeys: Add helpers for pkey bits
selftests: vm: pkeys: Use the correct huge page size
selftests: vm: pkeys: Use the correct page size on powerpc
Thiago Jung Bauermann (2):
selftests/vm: Move some definitions to arch-specific header
selftests/vm: Make gcc check arguments of sigsafe_printf()
tools/testing/selftests/vm/.gitignore | 1 +
tools/testing/selftests/vm/Makefile | 1 +
tools/testing/selftests/vm/pkey-helpers.h | 225 ++++++
tools/testing/selftests/vm/pkey-powerpc.h | 136 ++++
tools/testing/selftests/vm/pkey-x86.h | 181 +++++
.../selftests/{x86 => vm}/protection_keys.c | 693 ++++++++++--------
tools/testing/selftests/x86/.gitignore | 1 -
tools/testing/selftests/x86/pkey-helpers.h | 219 ------
8 files changed, 927 insertions(+), 530 deletions(-)
create mode 100644 tools/testing/selftests/vm/pkey-helpers.h
create mode 100644 tools/testing/selftests/vm/pkey-powerpc.h
create mode 100644 tools/testing/selftests/vm/pkey-x86.h
rename tools/testing/selftests/{x86 => vm}/protection_keys.c (74%)
delete mode 100644 tools/testing/selftests/x86/pkey-helpers.h
--
2.17.1
It was observed[1] on arm64 that __builtin_strlen led to an infinite
loop in the get_size selftest. This is because __builtin_strlen (and
other builtins) may sometimes result in a call to the C library
function. The C library implementation of strlen uses an IFUNC
resolver to load the most efficient strlen implementation for the
underlying machine and hence has a PLT indirection even for static
binaries. Because this binary avoids the C library startup routines,
the PLT initialization never happens and hence the program gets stuck
in an infinite loop.
On x86_64 the __builtin_strlen just happens to expand inline and avoid
the call but that is not always guaranteed.
Further, while testing on x86_64 (Fedora 31), it was observed that the
test also failed with a segfault inside write() because the generated
code for the write function in glibc seems to access TLS before the
syscall (probably due to the cancellation point check) and fails
because TLS is not initialised.
To mitigate these problems, this patch reduces the interface with the
C library to just the syscall function. The syscall function still
sets errno on failure, which is undesirable but for now it only
affects cases where syscalls fail.
[1] https://bugs.linaro.org/show_bug.cgi?id=5479
Signed-off-by: Siddhesh Poyarekar <siddhesh(a)gotplt.org>
Reported-by: Masami Hiramatsu <masami.hiramatsu(a)linaro.org>
---
tools/testing/selftests/size/get_size.c | 24 ++++++++++++++++++------
1 file changed, 18 insertions(+), 6 deletions(-)
diff --git a/tools/testing/selftests/size/get_size.c b/tools/testing/selftests/size/get_size.c
index d4b59ab979a0..f55943b6d1e2 100644
--- a/tools/testing/selftests/size/get_size.c
+++ b/tools/testing/selftests/size/get_size.c
@@ -12,23 +12,35 @@
* own execution. It also attempts to have as few dependencies
* on kernel features as possible.
*
- * It should be statically linked, with startup libs avoided.
- * It uses no library calls, and only the following 3 syscalls:
+ * It should be statically linked, with startup libs avoided. It uses
+ * no library calls except the syscall() function for the following 3
+ * syscalls:
* sysinfo(), write(), and _exit()
*
* For output, it avoids printf (which in some C libraries
* has large external dependencies) by implementing it's own
* number output and print routines, and using __builtin_strlen()
+ *
+ * The test may crash if any of the above syscalls fails because in some
+ * libc implementations (e.g. the GNU C Library) errno is saved in
+ * thread-local storage, which does not get initialized due to avoiding
+ * startup libs.
*/
#include <sys/sysinfo.h>
#include <unistd.h>
+#include <sys/syscall.h>
#define STDOUT_FILENO 1
static int print(const char *s)
{
- return write(STDOUT_FILENO, s, __builtin_strlen(s));
+ size_t len = 0;
+
+ while (s[len] != '\0')
+ len++;
+
+ return syscall(SYS_write, STDOUT_FILENO, s, len);
}
static inline char *num_to_str(unsigned long num, char *buf, int len)
@@ -80,12 +92,12 @@ void _start(void)
print("TAP version 13\n");
print("# Testing system size.\n");
- ccode = sysinfo(&info);
+ ccode = syscall(SYS_sysinfo, &info);
if (ccode < 0) {
print("not ok 1");
print(test_name);
print(" ---\n reason: \"could not get sysinfo\"\n ...\n");
- _exit(ccode);
+ syscall(SYS_exit, ccode);
}
print("ok 1");
print(test_name);
@@ -101,5 +113,5 @@ void _start(void)
print(" ...\n");
print("1..1\n");
- _exit(0);
+ syscall(SYS_exit, 0);
}
--
2.24.1
Hi,
The "track FOLL_PIN pages" would have been the very next patch, but it is
not included here because I'm still debugging a bug report from Leon.
Let's get all of the prerequisite work (it's been reviewed) into the tree
so that future reviews are easier. It's clear that any fixes that are
required to the tracking patch, won't affect these patches here.
This implements an API naming change (put_user_page*() -->
unpin_user_page*()), and also adds FOLL_PIN page support, up to
*but not including* actually tracking FOLL_PIN pages. It extends
the FOLL_PIN support to a few select subsystems. More subsystems will
be added in follow up work.
Christoph Hellwig, a point of interest:
a) I've moved the bulk of the code out of the inline functions, as
requested, for the devmap changes (patch 4: "mm: devmap: refactor
1-based refcounting for ZONE_DEVICE pages").
Changes since v11: Fixes resulting from Kirill Shutemov's review, plus
a fix for a kbuild robot-reported warning.
* Only include the first 22 patches: up to, but not including, the "track
FOLL_PIN pages" patch.
* Improved the efficiency of put_compound_head(), by avoiding get_page()
entirely, and instead doing the mass subtraction on one less than
refs, followed by a final put_page().
* Got rid of the forward declaration of __gup_longterm_locked(), by
moving get_user_pages_remote() further down in gup.c
* Got rid of a redundant page_is_devmap_managed() call, and simplified
put_devmap_managed_page() as part of that small cleanup.
* Changed put_devmap_managed_page() to do an early out if the page is
not devmap managed. This saves an indentation level.
* Applied the same type of change to __unpin_devmap_managed_user_page(),
which has the same checks.
* Changed release_pages() to handle the changed put_devmap_managed_page()
API.
* Removed EXPORT_SYMBOL(free_devmap_managed_page), as it is not required,
after the other refactoring.
* Fixed a kbuild robot sparse warning: added "static" to
try_pin_compound_head()'s declaration.
There is a git repo and branch, for convenience:
git@github.com:johnhubbard/linux.git pin_user_pages_tracking_v8
For the remaining list of "changes since version N", those are all in
v11, which is here:
https://lore.kernel.org/r/20191216222537.491123-1-jhubbard@nvidia.com
============================================================
Overview:
This is a prerequisite to solving the problem of proper interactions
between file-backed pages, and [R]DMA activities, as discussed in [1],
[2], [3], and in a remarkable number of email threads since about
2017. :)
A new internal gup flag, FOLL_PIN is introduced, and thoroughly
documented in the last patch's Documentation/vm/pin_user_pages.rst.
I believe that this will provide a good starting point for doing the
layout lease work that Ira Weiny has been working on. That's because
these new wrapper functions provide a clean, constrained, systematically
named set of functionality that, again, is required in order to even
know if a page is "dma-pinned".
In contrast to earlier approaches, the page tracking can be
incrementally applied to the kernel call sites that, until now, have
been simply calling get_user_pages() ("gup"). In other words, opt-in by
changing from this:
get_user_pages() (sets FOLL_GET)
put_page()
to this:
pin_user_pages() (sets FOLL_PIN)
unpin_user_page()
============================================================
Testing:
* I've done some overall kernel testing (LTP, and a few other goodies),
and some directed testing to exercise some of the changes. And as you
can see, gup_benchmark is enhanced to exercise this. Basically, I've
been able to runtime test the core get_user_pages() and
pin_user_pages() and related routines, but not so much on several of
the call sites--but those are generally just a couple of lines
changed, each.
Not much of the kernel is actually using this, which on one hand
reduces risk quite a lot. But on the other hand, testing coverage
is low. So I'd love it if, in particular, the Infiniband and PowerPC
folks could do a smoke test of this series for me.
Runtime testing for the call sites so far is pretty light:
* io_uring: Some directed tests from liburing exercise this, and
they pass.
* process_vm_access.c: A small directed test passes.
* gup_benchmark: the enhanced version hits the new gup.c code, and
passes.
* infiniband: Ran rdma-core tests: rdma-core/build/bin/run_tests.py
* VFIO: compiles (I'm vowing to set up a run time test soon, but it's
not ready just yet)
* powerpc: it compiles...
* drm/via: compiles...
* goldfish: compiles...
* net/xdp: compiles...
* media/v4l2: compiles...
[1] Some slow progress on get_user_pages() (Apr 2, 2019): https://lwn.net/Articles/784574/
[2] DMA and get_user_pages() (LPC: Dec 12, 2018): https://lwn.net/Articles/774411/
[3] The trouble with get_user_pages() (Apr 30, 2018): https://lwn.net/Articles/753027/
Dan Williams (1):
mm: Cleanup __put_devmap_managed_page() vs ->page_free()
John Hubbard (21):
mm/gup: factor out duplicate code from four routines
mm/gup: move try_get_compound_head() to top, fix minor issues
mm: devmap: refactor 1-based refcounting for ZONE_DEVICE pages
goldish_pipe: rename local pin_user_pages() routine
mm: fix get_user_pages_remote()'s handling of FOLL_LONGTERM
vfio: fix FOLL_LONGTERM use, simplify get_user_pages_remote() call
mm/gup: allow FOLL_FORCE for get_user_pages_fast()
IB/umem: use get_user_pages_fast() to pin DMA pages
media/v4l2-core: set pages dirty upon releasing DMA buffers
mm/gup: introduce pin_user_pages*() and FOLL_PIN
goldish_pipe: convert to pin_user_pages() and put_user_page()
IB/{core,hw,umem}: set FOLL_PIN via pin_user_pages*(), fix up ODP
mm/process_vm_access: set FOLL_PIN via pin_user_pages_remote()
drm/via: set FOLL_PIN via pin_user_pages_fast()
fs/io_uring: set FOLL_PIN via pin_user_pages()
net/xdp: set FOLL_PIN via pin_user_pages()
media/v4l2-core: pin_user_pages (FOLL_PIN) and put_user_page()
conversion
vfio, mm: pin_user_pages (FOLL_PIN) and put_user_page() conversion
powerpc: book3s64: convert to pin_user_pages() and put_user_page()
mm/gup_benchmark: use proper FOLL_WRITE flags instead of hard-coding
"1"
mm, tree-wide: rename put_user_page*() to unpin_user_page*()
Documentation/core-api/index.rst | 1 +
Documentation/core-api/pin_user_pages.rst | 232 +++++++++
arch/powerpc/mm/book3s64/iommu_api.c | 10 +-
drivers/gpu/drm/via/via_dmablit.c | 6 +-
drivers/infiniband/core/umem.c | 19 +-
drivers/infiniband/core/umem_odp.c | 13 +-
drivers/infiniband/hw/hfi1/user_pages.c | 4 +-
drivers/infiniband/hw/mthca/mthca_memfree.c | 8 +-
drivers/infiniband/hw/qib/qib_user_pages.c | 4 +-
drivers/infiniband/hw/qib/qib_user_sdma.c | 8 +-
drivers/infiniband/hw/usnic/usnic_uiom.c | 4 +-
drivers/infiniband/sw/siw/siw_mem.c | 4 +-
drivers/media/v4l2-core/videobuf-dma-sg.c | 8 +-
drivers/nvdimm/pmem.c | 6 -
drivers/platform/goldfish/goldfish_pipe.c | 35 +-
drivers/vfio/vfio_iommu_type1.c | 35 +-
fs/io_uring.c | 6 +-
include/linux/mm.h | 95 +++-
mm/gup.c | 495 ++++++++++++--------
mm/gup_benchmark.c | 9 +-
mm/memremap.c | 75 ++-
mm/process_vm_access.c | 28 +-
mm/swap.c | 27 +-
net/xdp/xdp_umem.c | 4 +-
tools/testing/selftests/vm/gup_benchmark.c | 6 +-
25 files changed, 762 insertions(+), 380 deletions(-)
create mode 100644 Documentation/core-api/pin_user_pages.rst
--
2.24.1