Hi Mathieu,
I am seeing rseq test build failure on Linux 5.5-rc1.
gcc -O2 -Wall -g -I./ -I../../../../usr/include/ -L./ -Wl,-rpath=./
param_test.c -lpthread -lrseq -o ...tools/testing/selftests/rseq/param_test
param_test.c:18:21: error: static declaration of ‘gettid’ follows
non-static declaration
18 | static inline pid_t gettid(void)
| ^~~~~~
In file included from /usr/include/unistd.h:1170,
from param_test.c:11:
/usr/include/x86_64-linux-gnu/bits/unistd_ext.h:34:16: note: previous
declaration of ‘gettid’ was here
34 | extern __pid_t gettid (void) __THROW;
| ^~~~~~
make: *** [Makefile:28: ...tools/testing/selftests/rseq/param_test] Error 1
The following obvious change fixes it. However, there could be reason
why this was defined here. If you think this is the right fix, I can
send the patch. I started seeing this with gcc version 9.2.1 20191008
diff --git a/tools/testing/selftests/rseq/param_test.c
b/tools/testing/selftests/rseq/param_test.c
index eec2663261f2..18a0fa1235a7 100644
--- a/tools/testing/selftests/rseq/param_test.c
+++ b/tools/testing/selftests/rseq/param_test.c
@@ -15,11 +15,6 @@
#include <errno.h>
#include <stddef.h>
-static inline pid_t gettid(void)
-{
- return syscall(__NR_gettid);
-}
-
thanks,
-- Shuah
Hi,
This implements an API naming change (put_user_page*() -->
unpin_user_page*()), and also implements tracking of FOLL_PIN pages. It
extends that tracking to a few select subsystems. More subsystems will
be added in follow up work.
Christoph Hellwig, a couple of points of interest:
a) I've moved the bulk of the code out of the inline functions, as
requested, for the devmap changes (patch 4: "mm: devmap: refactor
1-based refcounting for ZONE_DEVICE pages").
b) Contrary to my earlier response to your review, I have not actually
merged patch 23 ("mm/gup: pass flags arg to __gup_device_*
functions") into patch 24 ("mm/gup: track FOLL_PIN pages"). This is
because I suspect that it's better to avoid making patch 24 any larger
and worse to review than it already is. But if you feel strongly
about it, I'll combine them anyway.
Changes since v7:
* Rebased onto Linux 5.5-rc1
* Reworked the grab_page() and try_grab_compound_head(), for API
consistency and less diffs (thanks to Jan Kara's reviews).
* Added Leon Romanovsky's reviewed-by tags for two of the IB-related
patches.
* patch 4 refactoring changes, as mentioned above.
There is a git repo and branch, for convenience:
git@github.com:johnhubbard/linux.git pin_user_pages_tracking_v8
For the remaining list of "changes since version N", those are all in
v7, which is here:
https://lore.kernel.org/r/20191121071354.456618-1-jhubbard@nvidia.com
============================================================
Overview:
This is a prerequisite to solving the problem of proper interactions
between file-backed pages, and [R]DMA activities, as discussed in [1],
[2], [3], and in a remarkable number of email threads since about
2017. :)
A new internal gup flag, FOLL_PIN is introduced, and thoroughly
documented in the last patch's Documentation/vm/pin_user_pages.rst.
I believe that this will provide a good starting point for doing the
layout lease work that Ira Weiny has been working on. That's because
these new wrapper functions provide a clean, constrained, systematically
named set of functionality that, again, is required in order to even
know if a page is "dma-pinned".
In contrast to earlier approaches, the page tracking can be
incrementally applied to the kernel call sites that, until now, have
been simply calling get_user_pages() ("gup"). In other words, opt-in by
changing from this:
get_user_pages() (sets FOLL_GET)
put_page()
to this:
pin_user_pages() (sets FOLL_PIN)
unpin_user_page()
============================================================
Testing:
* I've done some overall kernel testing (LTP, and a few other goodies),
and some directed testing to exercise some of the changes. And as you
can see, gup_benchmark is enhanced to exercise this. Basically, I've
been able to runtime test the core get_user_pages() and
pin_user_pages() and related routines, but not so much on several of
the call sites--but those are generally just a couple of lines
changed, each.
Not much of the kernel is actually using this, which on one hand
reduces risk quite a lot. But on the other hand, testing coverage
is low. So I'd love it if, in particular, the Infiniband and PowerPC
folks could do a smoke test of this series for me.
Runtime testing for the call sites so far is pretty light:
* io_uring: Some directed tests from liburing exercise this, and
they pass.
* process_vm_access.c: A small directed test passes.
* gup_benchmark: the enhanced version hits the new gup.c code, and
passes.
* infiniband: ran "ib_write_bw", which exercises the umem.c changes,
but not the other changes.
* VFIO: compiles (I'm vowing to set up a run time test soon, but it's
not ready just yet)
* powerpc: it compiles...
* drm/via: compiles...
* goldfish: compiles...
* net/xdp: compiles...
* media/v4l2: compiles...
[1] Some slow progress on get_user_pages() (Apr 2, 2019): https://lwn.net/Articles/784574/
[2] DMA and get_user_pages() (LPC: Dec 12, 2018): https://lwn.net/Articles/774411/
[3] The trouble with get_user_pages() (Apr 30, 2018): https://lwn.net/Articles/753027/
Dan Williams (1):
mm: Cleanup __put_devmap_managed_page() vs ->page_free()
John Hubbard (25):
mm/gup: factor out duplicate code from four routines
mm/gup: move try_get_compound_head() to top, fix minor issues
mm: devmap: refactor 1-based refcounting for ZONE_DEVICE pages
goldish_pipe: rename local pin_user_pages() routine
mm: fix get_user_pages_remote()'s handling of FOLL_LONGTERM
vfio: fix FOLL_LONGTERM use, simplify get_user_pages_remote() call
mm/gup: allow FOLL_FORCE for get_user_pages_fast()
IB/umem: use get_user_pages_fast() to pin DMA pages
mm/gup: introduce pin_user_pages*() and FOLL_PIN
goldish_pipe: convert to pin_user_pages() and put_user_page()
IB/{core,hw,umem}: set FOLL_PIN via pin_user_pages*(), fix up ODP
mm/process_vm_access: set FOLL_PIN via pin_user_pages_remote()
drm/via: set FOLL_PIN via pin_user_pages_fast()
fs/io_uring: set FOLL_PIN via pin_user_pages()
net/xdp: set FOLL_PIN via pin_user_pages()
media/v4l2-core: set pages dirty upon releasing DMA buffers
media/v4l2-core: pin_user_pages (FOLL_PIN) and put_user_page()
conversion
vfio, mm: pin_user_pages (FOLL_PIN) and put_user_page() conversion
powerpc: book3s64: convert to pin_user_pages() and put_user_page()
mm/gup_benchmark: use proper FOLL_WRITE flags instead of hard-coding
"1"
mm, tree-wide: rename put_user_page*() to unpin_user_page*()
mm/gup: pass flags arg to __gup_device_* functions
mm/gup: track FOLL_PIN pages
mm/gup_benchmark: support pin_user_pages() and related calls
selftests/vm: run_vmtests: invoke gup_benchmark with basic FOLL_PIN
coverage
Documentation/core-api/index.rst | 1 +
Documentation/core-api/pin_user_pages.rst | 233 ++++++++
arch/powerpc/mm/book3s64/iommu_api.c | 12 +-
drivers/gpu/drm/via/via_dmablit.c | 6 +-
drivers/infiniband/core/umem.c | 19 +-
drivers/infiniband/core/umem_odp.c | 13 +-
drivers/infiniband/hw/hfi1/user_pages.c | 4 +-
drivers/infiniband/hw/mthca/mthca_memfree.c | 8 +-
drivers/infiniband/hw/qib/qib_user_pages.c | 4 +-
drivers/infiniband/hw/qib/qib_user_sdma.c | 8 +-
drivers/infiniband/hw/usnic/usnic_uiom.c | 4 +-
drivers/infiniband/sw/siw/siw_mem.c | 4 +-
drivers/media/v4l2-core/videobuf-dma-sg.c | 8 +-
drivers/nvdimm/pmem.c | 6 -
drivers/platform/goldfish/goldfish_pipe.c | 35 +-
drivers/vfio/vfio_iommu_type1.c | 35 +-
fs/io_uring.c | 6 +-
include/linux/mm.h | 145 ++++-
include/linux/mmzone.h | 2 +
include/linux/page_ref.h | 10 +
mm/gup.c | 595 +++++++++++++++-----
mm/gup_benchmark.c | 74 ++-
mm/huge_memory.c | 23 +-
mm/hugetlb.c | 15 +-
mm/memremap.c | 76 ++-
mm/process_vm_access.c | 28 +-
mm/swap.c | 24 +
mm/vmstat.c | 2 +
net/xdp/xdp_umem.c | 4 +-
tools/testing/selftests/vm/gup_benchmark.c | 21 +-
tools/testing/selftests/vm/run_vmtests | 22 +
31 files changed, 1093 insertions(+), 354 deletions(-)
create mode 100644 Documentation/core-api/pin_user_pages.rst
--
2.24.0
From: Ivan Khoronzhuk <ivan.khoronzhuk(a)linaro.org>
[ Upstream commit c588146378962786ddeec817f7736a53298a7b01 ]
The "path" buf is supposed to contain path + printf msg up to 24 bytes.
It will be cut anyway, but compiler generates truncation warns like:
"
samples/bpf/../../tools/testing/selftests/bpf/cgroup_helpers.c: In
function ‘setup_cgroup_environment’:
samples/bpf/../../tools/testing/selftests/bpf/cgroup_helpers.c:52:34:
warning: ‘/cgroup.controllers’ directive output may be truncated
writing 19 bytes into a region of size between 1 and 4097
[-Wformat-truncation=]
snprintf(path, sizeof(path), "%s/cgroup.controllers", cgroup_path);
^~~~~~~~~~~~~~~~~~~
samples/bpf/../../tools/testing/selftests/bpf/cgroup_helpers.c:52:2:
note: ‘snprintf’ output between 20 and 4116 bytes into a destination
of size 4097
snprintf(path, sizeof(path), "%s/cgroup.controllers", cgroup_path);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
samples/bpf/../../tools/testing/selftests/bpf/cgroup_helpers.c:72:34:
warning: ‘/cgroup.subtree_control’ directive output may be truncated
writing 23 bytes into a region of size between 1 and 4097
[-Wformat-truncation=]
snprintf(path, sizeof(path), "%s/cgroup.subtree_control",
^~~~~~~~~~~~~~~~~~~~~~~
cgroup_path);
samples/bpf/../../tools/testing/selftests/bpf/cgroup_helpers.c:72:2:
note: ‘snprintf’ output between 24 and 4120 bytes into a destination
of size 4097
snprintf(path, sizeof(path), "%s/cgroup.subtree_control",
cgroup_path);
"
In order to avoid warns, lets decrease buf size for cgroup workdir on
24 bytes with assumption to include also "/cgroup.subtree_control" to
the address. The cut will never happen anyway.
Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk(a)linaro.org>
Signed-off-by: Daniel Borkmann <daniel(a)iogearbox.net>
Acked-by: Song Liu <songliubraving(a)fb.com>
Link: https://lore.kernel.org/bpf/20191002120404.26962-3-ivan.khoronzhuk@linaro.o…
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/bpf/cgroup_helpers.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/bpf/cgroup_helpers.c b/tools/testing/selftests/bpf/cgroup_helpers.c
index cf16948aad4ad..6af24f9a780de 100644
--- a/tools/testing/selftests/bpf/cgroup_helpers.c
+++ b/tools/testing/selftests/bpf/cgroup_helpers.c
@@ -44,7 +44,7 @@
*/
int setup_cgroup_environment(void)
{
- char cgroup_workdir[PATH_MAX + 1];
+ char cgroup_workdir[PATH_MAX - 24];
format_cgroup_path(cgroup_workdir, "");
--
2.20.1
From: Yonghong Song <yhs(a)fb.com>
[ Upstream commit 2ea2612b987ad703235c92be21d4e98ee9c2c67c ]
Currently, with latest llvm trunk, selftest test_progs failed obj
file test_seg6_loop.o with the following error in verifier:
infinite loop detected at insn 76
The byte code sequence looks like below, and noted that alu32 has been
turned off by default for better generated codes in general:
48: w3 = 100
49: *(u32 *)(r10 - 68) = r3
...
; if (tlv.type == SR6_TLV_PADDING) {
76: if w3 == 5 goto -18 <LBB0_19>
...
85: r1 = *(u32 *)(r10 - 68)
; for (int i = 0; i < 100; i++) {
86: w1 += -1
87: if w1 == 0 goto +5 <LBB0_20>
88: *(u32 *)(r10 - 68) = r1
The main reason for verification failure is due to partial spills at
r10 - 68 for induction variable "i".
Current verifier only handles spills with 8-byte values. The above 4-byte
value spill to stack is treated to STACK_MISC and its content is not
saved. For the above example:
w3 = 100
R3_w=inv100 fp-64_w=inv1086626730498
*(u32 *)(r10 - 68) = r3
R3_w=inv100 fp-64_w=inv1086626730498
...
r1 = *(u32 *)(r10 - 68)
R1_w=inv(id=0,umax_value=4294967295,var_off=(0x0; 0xffffffff))
fp-64=inv1086626730498
To resolve this issue, verifier needs to be extended to track sub-registers
in spilling, or llvm needs to enhanced to prevent sub-register spilling
in register allocation phase. The former will increase verifier complexity
and the latter will need some llvm "hacking".
Let us workaround this issue by declaring the induction variable as "long"
type so spilling will happen at non sub-register level. We can revisit this
later if sub-register spilling causes similar or other verification issues.
Signed-off-by: Yonghong Song <yhs(a)fb.com>
Signed-off-by: Daniel Borkmann <daniel(a)iogearbox.net>
Acked-by: Andrii Nakryiko <andriin(a)fb.com>
Link: https://lore.kernel.org/bpf/20191117214036.1309510-1-yhs@fb.com
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/bpf/progs/test_seg6_loop.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/bpf/progs/test_seg6_loop.c b/tools/testing/selftests/bpf/progs/test_seg6_loop.c
index c4d104428643e..69880c1e7700c 100644
--- a/tools/testing/selftests/bpf/progs/test_seg6_loop.c
+++ b/tools/testing/selftests/bpf/progs/test_seg6_loop.c
@@ -132,8 +132,10 @@ static __always_inline int is_valid_tlv_boundary(struct __sk_buff *skb,
*pad_off = 0;
// we can only go as far as ~10 TLVs due to the BPF max stack size
+ // workaround: define induction variable "i" as "long" instead
+ // of "int" to prevent alu32 sub-register spilling.
#pragma clang loop unroll(disable)
- for (int i = 0; i < 100; i++) {
+ for (long i = 0; i < 100; i++) {
struct sr6_tlv_t tlv;
if (cur_off == *tlv_off)
--
2.20.1
From: Jiri Benc <jbenc(a)redhat.com>
[ Upstream commit 3b054b7133b4ad93671c82e8d6185258e3f1a7a5 ]
When run_kselftests.sh is run, it hangs after test_tc_tunnel.sh. The reason
is test_tc_tunnel.sh ensures the server ('nc -l') is run all the time,
starting it again every time it is expected to terminate. The exception is
the final client_connect: the server is not started anymore, which ensures
no process is kept running after the test is finished.
For a sit test, though, the script is terminated prematurely without the
final client_connect and the 'nc' process keeps running. This in turn causes
the run_one function in kselftest/runner.sh to hang forever, waiting for the
runaway process to finish.
Ensure a remaining server is terminated on cleanup.
Fixes: f6ad6accaa99 ("selftests/bpf: expand test_tc_tunnel with SIT encap")
Signed-off-by: Jiri Benc <jbenc(a)redhat.com>
Signed-off-by: Daniel Borkmann <daniel(a)iogearbox.net>
Acked-by: Willem de Bruijn <willemb(a)google.com>
Link: https://lore.kernel.org/bpf/60919291657a9ee89c708d8aababc28ebe1420be.157382…
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/bpf/test_tc_tunnel.sh | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/tools/testing/selftests/bpf/test_tc_tunnel.sh b/tools/testing/selftests/bpf/test_tc_tunnel.sh
index ff0d31d38061f..7c76b841b17bb 100755
--- a/tools/testing/selftests/bpf/test_tc_tunnel.sh
+++ b/tools/testing/selftests/bpf/test_tc_tunnel.sh
@@ -62,6 +62,10 @@ cleanup() {
if [[ -f "${infile}" ]]; then
rm "${infile}"
fi
+
+ if [[ -n $server_pid ]]; then
+ kill $server_pid 2> /dev/null
+ fi
}
server_listen() {
@@ -77,6 +81,7 @@ client_connect() {
verify_data() {
wait "${server_pid}"
+ server_pid=
# sha1sum returns two fields [sha1] [filepath]
# convert to bash array and access first elem
insum=($(sha1sum ${infile}))
--
2.20.1
From: Yonghong Song <yhs(a)fb.com>
[ Upstream commit b7a0d65d80a0c5034b366392624397a0915b7556 ]
With latest llvm compiler, running test_progs will have the following
verifier failure for test_sysctl_loop1.o:
libbpf: load bpf program failed: Permission denied
libbpf: -- BEGIN DUMP LOG ---
libbpf:
invalid indirect read from stack var_off (0x0; 0xff)+196 size 7
...
libbpf: -- END LOG --
libbpf: failed to load program 'cgroup/sysctl'
libbpf: failed to load object 'test_sysctl_loop1.o'
The related bytecode looks as below:
0000000000000308 LBB0_8:
97: r4 = r10
98: r4 += -288
99: r4 += r7
100: w8 &= 255
101: r1 = r10
102: r1 += -488
103: r1 += r8
104: r2 = 7
105: r3 = 0
106: call 106
107: w1 = w0
108: w1 += -1
109: if w1 > 6 goto -24 <LBB0_5>
110: w0 += w8
111: r7 += 8
112: w8 = w0
113: if r7 != 224 goto -17 <LBB0_8>
And source code:
for (i = 0; i < ARRAY_SIZE(tcp_mem); ++i) {
ret = bpf_strtoul(value + off, MAX_ULONG_STR_LEN, 0,
tcp_mem + i);
if (ret <= 0 || ret > MAX_ULONG_STR_LEN)
return 0;
off += ret & MAX_ULONG_STR_LEN;
}
Current verifier is not able to conclude that register w0 before '+'
at insn 110 has a range of 1 to 7 and thinks it is from 0 - 255. This
leads to more conservative range for w8 at insn 112, and later verifier
complaint.
Let us workaround this issue until we found a compiler and/or verifier
solution. The workaround in this patch is to make variable 'ret' volatile,
which will force a reload and then '&' operation to ensure better value
range. With this patch, I got the below byte code for the loop:
0000000000000328 LBB0_9:
101: r4 = r10
102: r4 += -288
103: r4 += r7
104: w8 &= 255
105: r1 = r10
106: r1 += -488
107: r1 += r8
108: r2 = 7
109: r3 = 0
110: call 106
111: *(u32 *)(r10 - 64) = r0
112: r1 = *(u32 *)(r10 - 64)
113: if w1 s< 1 goto -28 <LBB0_5>
114: r1 = *(u32 *)(r10 - 64)
115: if w1 s> 7 goto -30 <LBB0_5>
116: r1 = *(u32 *)(r10 - 64)
117: w1 &= 7
118: w1 += w8
119: r7 += 8
120: w8 = w1
121: if r7 != 224 goto -21 <LBB0_9>
Insn 117 did the '&' operation and we got more precise value range
for 'w8' at insn 120. The test is happy then:
#3/17 test_sysctl_loop1.o:OK
Signed-off-by: Yonghong Song <yhs(a)fb.com>
Signed-off-by: Daniel Borkmann <daniel(a)iogearbox.net>
Acked-by: Song Liu <songliubraving(a)fb.com>
Link: https://lore.kernel.org/bpf/20191107170045.2503480-1-yhs@fb.com
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/bpf/progs/test_sysctl_loop1.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/bpf/progs/test_sysctl_loop1.c b/tools/testing/selftests/bpf/progs/test_sysctl_loop1.c
index 608a06871572d..d22e438198cf7 100644
--- a/tools/testing/selftests/bpf/progs/test_sysctl_loop1.c
+++ b/tools/testing/selftests/bpf/progs/test_sysctl_loop1.c
@@ -44,7 +44,10 @@ int sysctl_tcp_mem(struct bpf_sysctl *ctx)
unsigned long tcp_mem[TCP_MEM_LOOPS] = {};
char value[MAX_VALUE_STR_LEN];
unsigned char i, off = 0;
- int ret;
+ /* a workaround to prevent compiler from generating
+ * codes verifier cannot handle yet.
+ */
+ volatile int ret;
if (ctx->write)
return 0;
--
2.20.1
From: Masami Hiramatsu <mhiramat(a)kernel.org>
[ Upstream commit 2f3571ea71311bbb2cbb9c3bbefc9c1969a3e889 ]
Currently proc-self-map-files-002.c sets va_max (max test address
of user virtual address) to 4GB, but it is too big for 32bit
arch and 1UL << 32 is overflow on 32bit long.
Also since this value should be enough bigger than vm.mmap_min_addr
(64KB or 32KB by default), 1MB should be enough.
Make va_max 1MB unconditionally.
Signed-off-by: Masami Hiramatsu <mhiramat(a)kernel.org>
Cc: Alexey Dobriyan <adobriyan(a)gmail.com>
Signed-off-by: Shuah Khan <skhan(a)linuxfoundation.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/proc/proc-self-map-files-002.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/proc/proc-self-map-files-002.c b/tools/testing/selftests/proc/proc-self-map-files-002.c
index 47b7473dedef7..e6aa00a183bcd 100644
--- a/tools/testing/selftests/proc/proc-self-map-files-002.c
+++ b/tools/testing/selftests/proc/proc-self-map-files-002.c
@@ -47,7 +47,11 @@ static void fail(const char *fmt, unsigned long a, unsigned long b)
int main(void)
{
const int PAGE_SIZE = sysconf(_SC_PAGESIZE);
- const unsigned long va_max = 1UL << 32;
+ /*
+ * va_max must be enough bigger than vm.mmap_min_addr, which is
+ * 64KB/32KB by default. (depends on CONFIG_LSM_MMAP_MIN_ADDR)
+ */
+ const unsigned long va_max = 1UL << 20;
unsigned long va;
void *p;
int fd;
--
2.20.1
From: Ivan Khoronzhuk <ivan.khoronzhuk(a)linaro.org>
[ Upstream commit c588146378962786ddeec817f7736a53298a7b01 ]
The "path" buf is supposed to contain path + printf msg up to 24 bytes.
It will be cut anyway, but compiler generates truncation warns like:
"
samples/bpf/../../tools/testing/selftests/bpf/cgroup_helpers.c: In
function ‘setup_cgroup_environment’:
samples/bpf/../../tools/testing/selftests/bpf/cgroup_helpers.c:52:34:
warning: ‘/cgroup.controllers’ directive output may be truncated
writing 19 bytes into a region of size between 1 and 4097
[-Wformat-truncation=]
snprintf(path, sizeof(path), "%s/cgroup.controllers", cgroup_path);
^~~~~~~~~~~~~~~~~~~
samples/bpf/../../tools/testing/selftests/bpf/cgroup_helpers.c:52:2:
note: ‘snprintf’ output between 20 and 4116 bytes into a destination
of size 4097
snprintf(path, sizeof(path), "%s/cgroup.controllers", cgroup_path);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
samples/bpf/../../tools/testing/selftests/bpf/cgroup_helpers.c:72:34:
warning: ‘/cgroup.subtree_control’ directive output may be truncated
writing 23 bytes into a region of size between 1 and 4097
[-Wformat-truncation=]
snprintf(path, sizeof(path), "%s/cgroup.subtree_control",
^~~~~~~~~~~~~~~~~~~~~~~
cgroup_path);
samples/bpf/../../tools/testing/selftests/bpf/cgroup_helpers.c:72:2:
note: ‘snprintf’ output between 24 and 4120 bytes into a destination
of size 4097
snprintf(path, sizeof(path), "%s/cgroup.subtree_control",
cgroup_path);
"
In order to avoid warns, lets decrease buf size for cgroup workdir on
24 bytes with assumption to include also "/cgroup.subtree_control" to
the address. The cut will never happen anyway.
Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk(a)linaro.org>
Signed-off-by: Daniel Borkmann <daniel(a)iogearbox.net>
Acked-by: Song Liu <songliubraving(a)fb.com>
Link: https://lore.kernel.org/bpf/20191002120404.26962-3-ivan.khoronzhuk@linaro.o…
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/bpf/cgroup_helpers.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/bpf/cgroup_helpers.c b/tools/testing/selftests/bpf/cgroup_helpers.c
index e95c33e333a40..b29a73fe64dbc 100644
--- a/tools/testing/selftests/bpf/cgroup_helpers.c
+++ b/tools/testing/selftests/bpf/cgroup_helpers.c
@@ -98,7 +98,7 @@ int enable_all_controllers(char *cgroup_path)
*/
int setup_cgroup_environment(void)
{
- char cgroup_workdir[PATH_MAX + 1];
+ char cgroup_workdir[PATH_MAX - 24];
format_cgroup_path(cgroup_workdir, "");
--
2.20.1
Currently, when some of the KSFT subsystems fails to build, the toplevel
KSFT Makefile just keeps carrying on with the build process.
This behaviour is expected and desirable especially in the context of a CI
system running KSelfTest, since it is not always easy to guarantee that the
most recent and esoteric dependencies are respected across all KSFT TARGETS
in a timely manner.
Unfortunately, as of now, this holds true only if the very last of the
built subsystems could have been successfully compiled: if the last of
those subsystem instead failed to build, such failure is taken as the whole
outcome of the Makefile target and the complete build/install process halts
even though many other preceding subsytems were in fact already built
successfully.
Fix the KSFT Makefile behaviour related to all/install targets in order
to fail as a whole only when the all/install targets have failed for all
of the requested TARGETS, while succeeding when at least one of TARGETS
has been successfully built.
Signed-off-by: Cristian Marussi <cristian.marussi(a)arm.com>
---
This patch is based on ksft/fixes branch from:
git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest.git
on top of commit (~5.5-rc1):
99e51aa8f701 Documentation: kunit: add documentation for kunit_tool
Building with either:
make kselftest-install \
KSFT_INSTALL_PATH=/tmp/KSFT \
TARGETS="exec arm64 bpf"
make -C tools/testing/selftests install \
KSFT_INSTALL_PATH=/tmp/KSFT \
TARGETS="exec arm64 bpf"
(with 'bpf' not building clean on my setup in the above case)
and veryfying that build/install completes if at least one of TARGETS can
be successfully built, and any successfully built subsystem is installed.
Changes:
-------
V1 --> V2
- rebased on 5.5-rc1
- rewording commit message
- dropped RFC tag
---
tools/testing/selftests/Makefile | 18 +++++++++++-------
1 file changed, 11 insertions(+), 7 deletions(-)
diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index b001c602414b..86b2a3fca04d 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -143,11 +143,13 @@ else
endif
all: khdr
- @for TARGET in $(TARGETS); do \
- BUILD_TARGET=$$BUILD/$$TARGET; \
- mkdir $$BUILD_TARGET -p; \
- $(MAKE) OUTPUT=$$BUILD_TARGET -C $$TARGET;\
- done;
+ @ret=1; \
+ for TARGET in $(TARGETS); do \
+ BUILD_TARGET=$$BUILD/$$TARGET; \
+ mkdir $$BUILD_TARGET -p; \
+ $(MAKE) OUTPUT=$$BUILD_TARGET -C $$TARGET; \
+ ret=$$((ret * $$?)); \
+ done; exit $$ret;
run_tests: all
@for TARGET in $(TARGETS); do \
@@ -196,10 +198,12 @@ ifdef INSTALL_PATH
install -m 744 kselftest/module.sh $(INSTALL_PATH)/kselftest/
install -m 744 kselftest/runner.sh $(INSTALL_PATH)/kselftest/
install -m 744 kselftest/prefix.pl $(INSTALL_PATH)/kselftest/
- @for TARGET in $(TARGETS); do \
+ @ret=1; \
+ for TARGET in $(TARGETS); do \
BUILD_TARGET=$$BUILD/$$TARGET; \
$(MAKE) OUTPUT=$$BUILD_TARGET -C $$TARGET INSTALL_PATH=$(INSTALL_PATH)/$$TARGET install; \
- done;
+ ret=$$((ret * $$?)); \
+ done; exit $$ret;
@# Ask all targets to emit their test scripts
echo "#!/bin/sh" > $(ALL_SCRIPT)
--
2.17.1
When reading the codes, I find the definitions of interrupt-window exiting and
nmi-window exiting don't match the names in latest intel SDM.
I have no idea whether it's the historical names, rename them to match the
latest SDM to avoid confusion.
CPU_BASED_USE_TSC_OFFSETING mis-spelling in Patch 3 is found by checkpatch.pl.
Xiaoyao Li (3):
KVM: VMX: Rename INTERRUPT_PENDING to INTERRUPT_WINDOW
KVM: VMX: Rename NMI_PENDING to NMI_WINDOW
KVM: VMX: Fix the spelling of CPU_BASED_USE_TSC_OFFSETTING
arch/x86/include/asm/vmx.h | 6 ++--
arch/x86/include/uapi/asm/vmx.h | 4 +--
arch/x86/kvm/vmx/nested.c | 28 +++++++++----------
arch/x86/kvm/vmx/vmx.c | 20 ++++++-------
tools/arch/x86/include/uapi/asm/vmx.h | 4 +--
.../selftests/kvm/include/x86_64/vmx.h | 8 +++---
.../kvm/x86_64/vmx_tsc_adjust_test.c | 2 +-
7 files changed, 36 insertions(+), 36 deletions(-)
--
2.19.1
When using fragments with size 8 and payload larger than 8000, the backlog
might fill up and packets will be dropped, causing the test to fail. This
happens often enough when conntrack is on during the IPv6 test.
As the larger payload in the test is 10000, using a backlog of 1250 allow
the test to run repeatedly without failure. At least a 1000 runs were
possible with no failures, when usually less than 50 runs were good enough
for showing a failure.
As netdev_max_backlog is not a pernet setting, this sets the backlog to
1000 during exit to prevent disturbing following tests.
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo(a)canonical.com>
Fixes: 4c3510483d26 (selftests: net: ip_defrag: cover new IPv6 defrag behavior)
---
tools/testing/selftests/net/ip_defrag.sh | 3 +++
1 file changed, 3 insertions(+)
diff --git a/tools/testing/selftests/net/ip_defrag.sh b/tools/testing/selftests/net/ip_defrag.sh
index 15d3489ecd9c..c91cfecfa245 100755
--- a/tools/testing/selftests/net/ip_defrag.sh
+++ b/tools/testing/selftests/net/ip_defrag.sh
@@ -12,6 +12,8 @@ setup() {
ip netns add "${NETNS}"
ip -netns "${NETNS}" link set lo up
+ sysctl -w net.core.netdev_max_backlog=1250 >/dev/null 2>&1
+
ip netns exec "${NETNS}" sysctl -w net.ipv4.ipfrag_high_thresh=9000000 >/dev/null 2>&1
ip netns exec "${NETNS}" sysctl -w net.ipv4.ipfrag_low_thresh=7000000 >/dev/null 2>&1
ip netns exec "${NETNS}" sysctl -w net.ipv4.ipfrag_time=1 >/dev/null 2>&1
@@ -30,6 +32,7 @@ setup() {
cleanup() {
ip netns del "${NETNS}"
+ sysctl -w net.core.netdev_max_backlog=1000 >/dev/null 2>&1
}
trap cleanup EXIT
--
2.24.0
This patchset is being developed here:
<https://github.com/cyphar/linux/tree/openat2/master>
Patch changelog:
v18:
* Further fixups from Al Viro:
- Don't WARN_ON in complete_walk() check since it can be trivially
triggered by userspace. Also, improve the comment so the purpose of the
check is more clear.
- Avoid duplicate smp_rmb() when in handle_dots() by doing
__read_seqcount_retry().
- Drop vestigial UPGRADE_NO* flag definitions in uapi.
* Update non-zero __padding test to include all bytes of the padding.
v17: <https://lore.kernel.org/lkml/20191117011713.13032-1-cyphar@cyphar.com/>
<https://lore.kernel.org/lkml/20191120050631.12816-1-cyphar@cyphar.com/>
v16: <https://lore.kernel.org/lkml/20191116002802.6663-1-cyphar@cyphar.com/>
v15: <https://lore.kernel.org/lkml/20191105090553.6350-1-cyphar@cyphar.com/>
v14: <https://lore.kernel.org/lkml/20191010054140.8483-1-cyphar@cyphar.com/>
<https://lore.kernel.org/lkml/20191026185700.10708-1-cyphar@cyphar.com>
v13: <https://lore.kernel.org/lkml/20190930183316.10190-1-cyphar@cyphar.com/>
v12: <https://lore.kernel.org/lkml/20190904201933.10736-1-cyphar@cyphar.com/>
v11: <https://lore.kernel.org/lkml/20190820033406.29796-1-cyphar@cyphar.com/>
<https://lore.kernel.org/lkml/20190728010207.9781-1-cyphar@cyphar.com/>
v10: <https://lore.kernel.org/lkml/20190719164225.27083-1-cyphar@cyphar.com/>
v09: <https://lore.kernel.org/lkml/20190706145737.5299-1-cyphar@cyphar.com/>
v08: <https://lore.kernel.org/lkml/20190520133305.11925-1-cyphar@cyphar.com/>
v07: <https://lore.kernel.org/lkml/20190507164317.13562-1-cyphar@cyphar.com/>
v06: <https://lore.kernel.org/lkml/20190506165439.9155-1-cyphar@cyphar.com/>
v05: <https://lore.kernel.org/lkml/20190320143717.2523-1-cyphar@cyphar.com/>
v04: <https://lore.kernel.org/lkml/20181112142654.341-1-cyphar@cyphar.com/>
v03: <https://lore.kernel.org/lkml/20181009070230.12884-1-cyphar@cyphar.com/>
v02: <https://lore.kernel.org/lkml/20181009065300.11053-1-cyphar@cyphar.com/>
v01: <https://lore.kernel.org/lkml/20180929103453.12025-1-cyphar@cyphar.com/>
For a very long time, extending openat(2) with new features has been
incredibly frustrating. This stems from the fact that openat(2) is
possibly the most famous counter-example to the mantra "don't silently
accept garbage from userspace" -- it doesn't check whether unknown flags
are present[1].
This means that (generally) the addition of new flags to openat(2) has
been fraught with backwards-compatibility issues (O_TMPFILE has to be
defined as __O_TMPFILE|O_DIRECTORY|[O_RDWR or O_WRONLY] to ensure old
kernels gave errors, since it's insecure to silently ignore the
flag[2]). All new security-related flags therefore have a tough road to
being added to openat(2).
Furthermore, the need for some sort of control over VFS's path resolution (to
avoid malicious paths resulting in inadvertent breakouts) has been a very
long-standing desire of many userspace applications. This patchset is a revival
of Al Viro's old AT_NO_JUMPS[3] patchset (which was a variant of David
Drysdale's O_BENEATH patchset[4] which was a spin-off of the Capsicum
project[5]) with a few additions and changes made based on the previous
discussion within [6] as well as others I felt were useful.
In line with the conclusions of the original discussion of AT_NO_JUMPS, the
flag has been split up into separate flags. However, instead of being an
openat(2) flag it is provided through a new syscall openat2(2) which provides
several other improvements to the openat(2) interface (see the patch
description for more details). The following new LOOKUP_* flags are added:
* LOOKUP_NO_XDEV blocks all mountpoint crossings (upwards, downwards,
or through absolute links). Absolute pathnames alone in openat(2) do not
trigger this. Magic-link traversal which implies a vfsmount jump is also
blocked (though magic-link jumps on the same vfsmount are permitted).
* LOOKUP_NO_MAGICLINKS blocks resolution through /proc/$pid/fd-style
links. This is done by blocking the usage of nd_jump_link() during
resolution in a filesystem. The term "magic-links" is used to match
with the only reference to these links in Documentation/, but I'm
happy to change the name.
It should be noted that this is different to the scope of
~LOOKUP_FOLLOW in that it applies to all path components. However,
you can do openat2(NO_FOLLOW|NO_MAGICLINKS) on a magic-link and it
will *not* fail (assuming that no parent component was a
magic-link), and you will have an fd for the magic-link.
In order to correctly detect magic-links, the introduction of a new
LOOKUP_MAGICLINK_JUMPED state flag was required.
* LOOKUP_BENEATH disallows escapes to outside the starting dirfd's
tree, using techniques such as ".." or absolute links. Absolute
paths in openat(2) are also disallowed. Conceptually this flag is to
ensure you "stay below" a certain point in the filesystem tree --
but this requires some additional to protect against various races
that would allow escape using "..".
Currently LOOKUP_BENEATH implies LOOKUP_NO_MAGICLINKS, because it
can trivially beam you around the filesystem (breaking the
protection). In future, there might be similar safety checks done as
in LOOKUP_IN_ROOT, but that requires more discussion.
In addition, two new flags are added that expand on the above ideas:
* LOOKUP_NO_SYMLINKS does what it says on the tin. No symlink
resolution is allowed at all, including magic-links. Just as with
LOOKUP_NO_MAGICLINKS this can still be used with NOFOLLOW to open an
fd for the symlink as long as no parent path had a symlink
component.
* LOOKUP_IN_ROOT is an extension of LOOKUP_BENEATH that, rather than
blocking attempts to move past the root, forces all such movements
to be scoped to the starting point. This provides chroot(2)-like
protection but without the cost of a chroot(2) for each filesystem
operation, as well as being safe against race attacks that chroot(2)
is not.
If a race is detected (as with LOOKUP_BENEATH) then an error is
generated, and similar to LOOKUP_BENEATH it is not permitted to cross
magic-links with LOOKUP_IN_ROOT.
The primary need for this is from container runtimes, which
currently need to do symlink scoping in userspace[7] when opening
paths in a potentially malicious container. There is a long list of
CVEs that could have bene mitigated by having RESOLVE_THIS_ROOT
(such as CVE-2017-1002101, CVE-2017-1002102, CVE-2018-15664, and
CVE-2019-5736, just to name a few).
In order to make all of the above more usable, I'm working on
libpathrs[8] which is a C-friendly library for safe path resolution. It
features a userspace-emulated backend if the kernel doesn't support
openat2(2). Hopefully we can get userspace to switch to using it, and
thus get openat2(2) support for free once it's ready.
Future work would include implementing things like RESOLVE_NO_AUTOMOUNT and
possibly a RESOLVE_NO_REMOTE (to allow programs to be sure they don't hit DoSes
though stale NFS handles).
[1]: https://lwn.net/Articles/588444/
[2]: https://lore.kernel.org/lkml/CA+55aFyyxJL1LyXZeBsf2ypriraj5ut1XkNDsunRBqgVj…
[3]: https://lore.kernel.org/lkml/20170429220414.GT29622@ZenIV.linux.org.uk
[4]: https://lore.kernel.org/lkml/1415094884-18349-1-git-send-email-drysdale@goo…
[5]: https://lore.kernel.org/lkml/1404124096-21445-1-git-send-email-drysdale@goo…
[6]: https://lwn.net/Articles/723057/
[7]: https://github.com/cyphar/filepath-securejoin
[8]: https://github.com/openSUSE/libpathrs
The current draft of the openat2(2) man-page is included below.
--8<---------------------------------------------------------------------------
OPENAT2(2) Linux Programmer's Manual OPENAT2(2)
NAME
openat2 - open and possibly create a file (extended)
SYNOPSIS
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
int openat2(int dirfd, const char *pathname, struct open_how *how, size_t size);
Note: There is no glibc wrapper for this system call; see NOTES.
DESCRIPTION
The openat2() system call opens the file specified by pathname. If the specified file does not exist, it may
optionally (if O_CREAT is specified in how.flags) be created by openat2().
As with openat(2), if pathname is relative, then it is interpreted relative to the directory referred to by
the file descriptor dirfd (or the current working directory of the calling process, if dirfd is the special
value AT_FDCWD.) If pathname is absolute, then dirfd is ignored (unless how.resolve contains RESOLVE_IN_ROOT,
in which case pathname is resolved relative to dirfd.)
The openat2() system call is an extension of openat(2) and provides a superset of its functionality. Rather
than taking a single flag argument, an extensible structure (how) is passed instead to allow for future exten-
sions. size must be set to sizeof(struct open_how), to facilitate future extensions (see the "Extensibility"
section of the NOTES for more detail on how extensions are handled.)
The open_how structure
The following structure indicates how pathname should be opened, and acts as a superset of the flag and mode
arguments to openat(2).
struct open_how {
__aligned_u64 flags; /* O_* flags. */
__u16 mode; /* Mode for O_{CREAT,TMPFILE}. */
__u16 __padding[3]; /* Must be zeroed. */
__aligned_u64 resolve; /* RESOLVE_* flags. */
};
Any future extensions to openat2() will be implemented as new fields appended to the above structure (or
through reuse of pre-existing padding space), with the zero value of the new fields acting as though the ex-
tension were not present.
The meaning of each field is as follows:
flags
The file creation and status flags to use for this operation. All of the O_* flags defined for
openat(2) are valid openat2() flag values.
Unlike openat(2), it is an error to provide openat2() unknown or conflicting flags in flags.
mode
File mode for the new file, with identical semantics to the mode argument to openat(2). How-
ever, unlike openat(2), it is an error to provide openat2() with a mode which contains bits
other than 0777.
It is an error to provide openat2() a non-zero mode if flags does not contain O_CREAT or O_TMP-
FILE.
resolve
Change how the components of pathname will be resolved (see path_resolution(7) for background
information.) The primary use case for these flags is to allow trusted programs to restrict how
untrusted paths (or paths inside untrusted directories) are resolved. The full list of resolve
flags is given below.
RESOLVE_NO_XDEV
Disallow traversal of mount points during path resolution (including all bind mounts).
Users of this flag are encouraged to make its use configurable (unless it is used for a
specific security purpose), as bind mounts are very widely used by end-users. Setting
this flag indiscrimnately for all uses of openat2() may result in spurious errors on pre-
viously-functional systems.
RESOLVE_NO_SYMLINKS
Disallow resolution of symbolic links during path resolution. This option implies RE-
SOLVE_NO_MAGICLINKS.
If the trailing component is a symbolic link, and flags contains both O_PATH and O_NOFOL-
LOW, then an O_PATH file descriptor referencing the symbolic link will be returned.
Users of this flag are encouraged to make its use configurable (unless it is used for a
specific security purpose), as symbolic links are very widely used by end-users. Setting
this flag indiscrimnately for all uses of openat2() may result in spurious errors on pre-
viously-functional systems.
RESOLVE_NO_MAGICLINKS
Disallow all magic link resolution during path resolution.
If the trailing component is a magic link, and flags contains both O_PATH and O_NOFOLLOW,
then an O_PATH file descriptor referencing the magic link will be returned.
Magic-links are symbolic link-like objects that are most notably found in proc(5) (exam-
ples include /proc/[pid]/exe and /proc/[pid]/fd/*.) Due to the potential danger of un-
knowingly opening these magic links, it may be preferable for users to disable their res-
olution entirely (see symboliclink(7) for more details.)
RESOLVE_BENEATH
Do not permit the path resolution to succeed if any component of the resolution is not a
descendant of the directory indicated by dirfd. This results in absolute symbolic links
(and absolute values of pathname) to be rejected.
Currently, this flag also disables magic link resolution. However, this may change in
the future. The caller should explicitly specify RESOLVE_NO_MAGICLINKS to ensure that
magic links are not resolved.
RESOLVE_IN_ROOT
Treat dirfd as the root directory while resolving pathname (as though the user called ch-
root(2) with dirfd as the argument.) Absolute symbolic links and ".." path components
will be scoped to dirfd. If pathname is an absolute path, it is also treated relative to
dirfd.
However, unlike chroot(2) (which changes the filesystem root permanently for a process),
RESOLVE_IN_ROOT allows a program to efficiently restrict path resolution for only certain
operations. It also has several hardening features (such detecting escape attempts dur-
ing .. resolution) which chroot(2) does not.
Currently, this flag also disables magic link resolution. However, this may change in
the future. The caller should explicitly specify RESOLVE_NO_MAGICLINKS to ensure that
magic links are not resolved.
It is an error to provide openat2() unknown flags in resolve.
RETURN VALUE
On success, a new file descriptor is returned. On error, -1 is returned, and errno is set appropriately.
ERRORS
The set of errors returned by openat2() includes all of the errors returned by openat(2), as well as the fol-
lowing additional errors:
EINVAL An unknown flag or invalid value was specified in how.
EINVAL mode is non-zero, but flags does not contain O_CREAT or O_TMPFILE.
EINVAL size was smaller than any known version of struct open_how.
E2BIG An extension was specified in how, which the current kernel does not support (see the "Extensibility"
section of the NOTES for more detail on how extensions are handled.)
EAGAIN resolve contains either RESOLVE_IN_ROOT or RESOLVE_BENEATH, and the kernel could not ensure that a ".."
component didn't escape (due to a race condition or potential attack.) Callers may choose to retry the
openat2() call.
EXDEV resolve contains either RESOLVE_IN_ROOT or RESOLVE_BENEATH, and an escape from the root during path
resolution was detected.
EXDEV resolve contains RESOLVE_NO_XDEV, and a path component attempted to cross a mount point.
ELOOP resolve contains RESOLVE_NO_SYMLINKS, and one of the path components was a symbolic link (or magic
link).
ELOOP resolve contains RESOLVE_NO_MAGICLINKS, and one of the path components was a magic link.
VERSIONS
openat2() first appeared in Linux 5.6.
CONFORMING TO
This system call is Linux-specific.
The semantics of RESOLVE_BENEATH were modelled after FreeBSD's O_BENEATH.
NOTES
Glibc does not provide a wrapper for this system call; call it using syscall(2).
Extensibility
In order to allow for struct open_how to be extended in future kernel revisions, openat2() requires userspace
to specify the size of struct open_how structure they are passing. By providing this information, it is pos-
sible for openat2() to provide both forwards- and backwards-compatibility — with size acting as an implicit
version number (because new extension fields will always be appended, the size will always increase.) This
extensibility design is very similar to other system calls such as perf_setattr(2), perf_event_open(2), and
clone(3).
If we let usize be the size of the structure according to userspace and ksize be the size of the structure
which the kernel supports, then there are only three cases to consider:
* If ksize equals usize, then there is no version mismatch and how can be used verbatim.
* If ksize is larger than usize, then there are some extensions the kernel supports which the
userspace program is unaware of. Because all extensions must have their zero values be a no-op, the
kernel treats all of the extension fields not set by userspace to have zero values. This provides
backwards-compatibility.
* If ksize is smaller than usize, then there are some extensions which the userspace program is aware
of but the kernel does not support. Because all extensions must have their zero values be a no-op,
the kernel can safely ignore the unsupported extension fields if they are all-zero. If any unsup-
ported extension fields are non-zero, then -1 is returned and errno is set to E2BIG. This provides
forwards-compatibility.
Therefore, most userspace programs will not need to have any special handling of extensions. However, if a
userspace program wishes to determine what extensions the running kernel supports, they may conduct a binary
search on size (to find the largest value which doesn't produce an error of E2BIG.)
SEE ALSO
openat(2), path_resolution(7), symlink(7)
Linux 2019-11-05 OPENAT2(2)
--8<---------------------------------------------------------------------------
Aleksa Sarai (13):
namei: only return -ECHILD from follow_dotdot_rcu()
nsfs: clean-up ns_get_path() signature to return int
namei: allow nd_jump_link() to produce errors
namei: allow set_root() to produce errors
namei: LOOKUP_NO_SYMLINKS: block symlink resolution
namei: LOOKUP_NO_MAGICLINKS: block magic-link resolution
namei: LOOKUP_NO_XDEV: block mountpoint crossing
namei: LOOKUP_BENEATH: O_BENEATH-like scoped resolution
namei: LOOKUP_IN_ROOT: chroot-like scoped resolution
namei: LOOKUP_{IN_ROOT,BENEATH}: permit limited ".." resolution
open: introduce openat2(2) syscall
selftests: add openat2(2) selftests
Documentation: path-lookup: include new LOOKUP flags
CREDITS | 4 +-
Documentation/filesystems/path-lookup.rst | 68 ++-
arch/alpha/kernel/syscalls/syscall.tbl | 1 +
arch/arm/tools/syscall.tbl | 1 +
arch/arm64/include/asm/unistd.h | 2 +-
arch/arm64/include/asm/unistd32.h | 2 +
arch/ia64/kernel/syscalls/syscall.tbl | 1 +
arch/m68k/kernel/syscalls/syscall.tbl | 1 +
arch/microblaze/kernel/syscalls/syscall.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n32.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n64.tbl | 1 +
arch/mips/kernel/syscalls/syscall_o32.tbl | 1 +
arch/parisc/kernel/syscalls/syscall.tbl | 1 +
arch/powerpc/kernel/syscalls/syscall.tbl | 1 +
arch/s390/kernel/syscalls/syscall.tbl | 1 +
arch/sh/kernel/syscalls/syscall.tbl | 1 +
arch/sparc/kernel/syscalls/syscall.tbl | 1 +
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/xtensa/kernel/syscalls/syscall.tbl | 1 +
fs/namei.c | 199 +++++--
fs/nsfs.c | 29 +-
fs/open.c | 149 +++--
fs/proc/base.c | 3 +-
fs/proc/namespaces.c | 20 +-
include/linux/fcntl.h | 12 +-
include/linux/namei.h | 12 +-
include/linux/proc_ns.h | 4 +-
include/linux/syscalls.h | 3 +
include/uapi/asm-generic/unistd.h | 5 +-
include/uapi/linux/fcntl.h | 35 ++
kernel/bpf/offload.c | 12 +-
kernel/events/core.c | 2 +-
security/apparmor/apparmorfs.c | 6 +-
tools/testing/selftests/Makefile | 1 +
tools/testing/selftests/openat2/.gitignore | 1 +
tools/testing/selftests/openat2/Makefile | 8 +
tools/testing/selftests/openat2/helpers.c | 109 ++++
tools/testing/selftests/openat2/helpers.h | 107 ++++
.../testing/selftests/openat2/openat2_test.c | 320 +++++++++++
.../selftests/openat2/rename_attack_test.c | 160 ++++++
.../testing/selftests/openat2/resolve_test.c | 523 ++++++++++++++++++
42 files changed, 1697 insertions(+), 115 deletions(-)
create mode 100644 tools/testing/selftests/openat2/.gitignore
create mode 100644 tools/testing/selftests/openat2/Makefile
create mode 100644 tools/testing/selftests/openat2/helpers.c
create mode 100644 tools/testing/selftests/openat2/helpers.h
create mode 100644 tools/testing/selftests/openat2/openat2_test.c
create mode 100644 tools/testing/selftests/openat2/rename_attack_test.c
create mode 100644 tools/testing/selftests/openat2/resolve_test.c
base-commit: 219d54332a09e8d8741c1e1982f5eae56099de85
--
2.24.0
The current kunit execution model is to provide base kunit functionality
and tests built-in to the kernel. The aim of this series is to allow
building kunit itself and tests as modules. This in turn allows a
simple form of selective execution; load the module you wish to test.
In doing so, kunit itself (if also built as a module) will be loaded as
an implicit dependency.
Because this requires a core API modification - if a module delivers
multiple suites, they must be declared with the kunit_test_suites()
macro - we're proposing this patch set as a candidate to be applied to the
test tree before too many kunit consumers appear. We attempt to deal
with existing consumers in patch 3.
Changes since v4:
- fixed signoff chain to use Co-developed-by: prior to Knut's signoff
(Stephen, all patches)
- added Reviewed-by, Tested-by for patches 1, 2, 4 and 6
- updated comment describing try-catch-impl.h (Stephen, patch 2)
- fixed MODULE_LICENSEs to be GPL v2 (Stephen, patches 3, 5)
- added __init to kunit_init() (Stephen, patch 5)
Changes since v3:
- removed symbol lookup patch for separate submission later
- removed use of sysctl_hung_task_timeout_seconds (patch 4, as discussed
with Brendan and Stephen)
- disabled build of string-stream-test when CONFIG_KUNIT_TEST=m; this
is to avoid having to deal with symbol lookup issues
- changed string-stream-impl.h back to string-stream.h (Brendan)
- added module build support to new list, ext4 tests
Changes since v2:
- moved string-stream.h header to lib/kunit/string-stream-impl.h (Brendan)
(patch 1)
- split out non-exported interfaces in try-catch-impl.h (Brendan)
(patch 2)
- added kunit_find_symbol() and KUNIT_INIT_SYMBOL to lookup non-exported
symbols (patches 3, 4)
- removed #ifdef MODULE around module licenses (Randy, Brendan, Andy)
(patch 4)
- replaced kunit_test_suite() with kunit_test_suites() rather than
supporting both (Brendan) (patch 4)
- lookup sysctl_hung_task_timeout_secs as kunit may be built as a module
and the symbol may not be available (patch 5)
Alan Maguire (6):
kunit: move string-stream.h to lib/kunit
kunit: hide unexported try-catch interface in try-catch-impl.h
kunit: allow kunit tests to be loaded as a module
kunit: remove timeout dependence on sysctl_hung_task_timeout_seconds
kunit: allow kunit to be loaded as a module
kunit: update documentation to describe module-based build
Documentation/dev-tools/kunit/faq.rst | 3 +-
Documentation/dev-tools/kunit/index.rst | 3 ++
Documentation/dev-tools/kunit/usage.rst | 16 ++++++++++
fs/ext4/Kconfig | 2 +-
fs/ext4/Makefile | 5 +++
fs/ext4/inode-test.c | 4 ++-
include/kunit/assert.h | 3 +-
include/kunit/test.h | 35 ++++++++++++++------
include/kunit/try-catch.h | 10 ------
kernel/sysctl-test.c | 4 ++-
lib/Kconfig.debug | 4 +--
lib/kunit/Kconfig | 6 ++--
lib/kunit/Makefile | 14 +++++---
lib/kunit/assert.c | 10 ++++++
lib/kunit/{example-test.c => kunit-example-test.c} | 4 ++-
lib/kunit/{test-test.c => kunit-test.c} | 7 ++--
lib/kunit/string-stream-test.c | 5 +--
lib/kunit/string-stream.c | 3 +-
{include => lib}/kunit/string-stream.h | 0
lib/kunit/test.c | 25 ++++++++++++++-
lib/kunit/try-catch-impl.h | 27 ++++++++++++++++
lib/kunit/try-catch.c | 37 +++++-----------------
lib/list-test.c | 4 ++-
23 files changed, 160 insertions(+), 71 deletions(-)
rename lib/kunit/{example-test.c => kunit-example-test.c} (97%)
rename lib/kunit/{test-test.c => kunit-test.c} (98%)
rename {include => lib}/kunit/string-stream.h (100%)
create mode 100644 lib/kunit/try-catch-impl.h
--
1.8.3.1
Hi,
Here is the v2 series to fix build warnings and erorrs on
kselftest safesetid.
This version includes a fix for a runtime error.
Thank you,
---
Masami Hiramatsu (3):
selftests: safesetid: Move link library to LDLIBS
selftests: safesetid: Check the return value of setuid/setgid
selftests: safesetid: Fix Makefile to set correct test program
tools/testing/selftests/safesetid/Makefile | 5 +++--
tools/testing/selftests/safesetid/safesetid-test.c | 15 ++++++++++-----
2 files changed, 13 insertions(+), 7 deletions(-)
--
Masami Hiramatsu (Linaro) <mhiramat(a)kernel.org>
Hi,
Here are the patches to fix build warnings and erorrs on
kselftest safesetid.
Thank you,
---
Masami Hiramatsu (2):
selftests: safesetid: Move link library to LDLIBS
selftests: safesetid: Check the return value of setuid/setgid
tools/testing/selftests/safesetid/Makefile | 3 ++-
tools/testing/selftests/safesetid/safesetid-test.c | 15 ++++++++++-----
2 files changed, 12 insertions(+), 6 deletions(-)
--
Masami Hiramatsu (Linaro) <mhiramat(a)kernel.org>
This patchset contains trivial fixes for the kunit documentations and the
wrapper python scripts.
Changes from v2 (https://lore.kernel.org/linux-kselftest/1575361141-6806-1-git-send-email-sj…):
- Make 'build_dir' if not exists (missed from v3 by mistake)
SeongJae Park (5):
docs/kunit/start: Use in-tree 'kunit_defconfig'
kunit: Remove duplicated defconfig creation
kunit: Create default config in '--build_dir'
kunit: Place 'test.log' under the 'build_dir'
kunit: Rename 'kunitconfig' to '.kunitconfig'
Documentation/dev-tools/kunit/start.rst | 13 +++++--------
tools/testing/kunit/kunit.py | 16 ++++++++++------
tools/testing/kunit/kunit_kernel.py | 8 ++++----
3 files changed, 19 insertions(+), 18 deletions(-)
--
2.7.4
Since single_step_syscall.c depends on sys/syscall.h and
its include, asm/unistd.h, we should check the availability
of those headers.
Without this fix, if gcc-multilib is not installed but
libc6-dev-i386 is installed, kselftest tries to build 32bit
binary and failed with following error message.
In file included from single_step_syscall.c:18:
/usr/include/sys/syscall.h:24:10: fatal error: asm/unistd.h: No such file or directory
#include <asm/unistd.h>
^~~~~~~~~~~~~~
compilation terminated.
Signed-off-by: Masami Hiramatsu <mhiramat(a)kernel.org>
---
.../testing/selftests/x86/trivial_32bit_program.c | 1 +
.../testing/selftests/x86/trivial_64bit_program.c | 1 +
2 files changed, 2 insertions(+)
diff --git a/tools/testing/selftests/x86/trivial_32bit_program.c b/tools/testing/selftests/x86/trivial_32bit_program.c
index aa1f58c2f71c..6b455eda24f7 100644
--- a/tools/testing/selftests/x86/trivial_32bit_program.c
+++ b/tools/testing/selftests/x86/trivial_32bit_program.c
@@ -8,6 +8,7 @@
# error wrong architecture
#endif
+#include <sys/syscall.h>
#include <stdio.h>
int main()
diff --git a/tools/testing/selftests/x86/trivial_64bit_program.c b/tools/testing/selftests/x86/trivial_64bit_program.c
index 39f4b84fbf15..07ae86df18ff 100644
--- a/tools/testing/selftests/x86/trivial_64bit_program.c
+++ b/tools/testing/selftests/x86/trivial_64bit_program.c
@@ -8,6 +8,7 @@
# error wrong architecture
#endif
+#include <sys/syscall.h>
#include <stdio.h>
int main()
This patchset contains trivial fixes for the kunit documentations and the
wrapper python scripts.
Changes from v3
(https://lore.kernel.org/linux-kselftest/20191204192141.GA247851@google.com):
- Fix the 4th patch, "kunit: Place 'test.log' under the 'build_dir'" to set
default value of 'build_dir' as '' instead of NULL so that kunit can run
even though '--build_dir' option is not given.
Changes from v2
(https://lore.kernel.org/linux-kselftest/1575361141-6806-1-git-send-email-sj…):
- Make 'build_dir' if not exists (missed from v3 by mistake)
Changes from v1
(https://lore.kernel.org/linux-doc/1575242724-4937-1-git-send-email-sj38.par…):
- Remove "docs/kunit/start: Skip wrapper run command" (A similar approach is
ongoing)
- Make 'build_dir' if not exists
SeongJae Park (5):
docs/kunit/start: Use in-tree 'kunit_defconfig'
kunit: Remove duplicated defconfig creation
kunit: Create default config in '--build_dir'
kunit: Place 'test.log' under the 'build_dir'
kunit: Rename 'kunitconfig' to '.kunitconfig'
Documentation/dev-tools/kunit/start.rst | 13 +++++--------
tools/testing/kunit/kunit.py | 18 +++++++++++-------
tools/testing/kunit/kunit_kernel.py | 10 +++++-----
3 files changed, 21 insertions(+), 20 deletions(-)
--
2.7.4
Hello and good day,
My name is Charles, and I am contacting you to know first and
foremost, if your private or company bank account in Japan or
elsewhere can receive a total of 15 Million US Dollars
(Approximately 1.6 Billion JP¥) at the shortest notice? If yes,
then kindly reply for further and complete details.
Expecting your reply as soon as possible because we have limited
time to have this fund in your nominated bank account.
Doumo Arigatou.
Charles Renfroe.
Toronto-Canada
This patchset contains trivial fixes for the kunit documentations and the
wrapper python scripts.
Changes from v1 (https://lore.kernel.org/linux-doc/1575242724-4937-1-git-send-email-sj38.par…):
- Remove "docs/kunit/start: Skip wrapper run command" (A similar approach is
ongoing)
- Make 'build_dir' if not exists
SeongJae Park (5):
docs/kunit/start: Use in-tree 'kunit_defconfig'
kunit: Remove duplicated defconfig creation
kunit: Create default config in '--build_dir'
kunit: Place 'test.log' under the 'build_dir'
kunit: Rename 'kunitconfig' to '.kunitconfig'
Documentation/dev-tools/kunit/start.rst | 13 +++++--------
tools/testing/kunit/kunit.py | 14 ++++++++------
tools/testing/kunit/kunit_kernel.py | 8 ++++----
3 files changed, 17 insertions(+), 18 deletions(-)
--
2.7.4
Commit 9852ae3fe529 ("mm, memcg: consider subtrees in memory.events") made
memory.events recursive: all events are propagated upwards by the
tree. It was a change in semantics.
It broke the oom group leaf events test: it assumes that after
an OOM the oom_kill counter is zero on parent's level.
Let's adjust the test: it should have similar expectations
for the child and parent levels.
The test passes after this fix.
Signed-off-by: Roman Gushchin <guro(a)fb.com>
Cc: Chris Down <chris(a)chrisdown.name>
Cc: Johannes Weiner <hannes(a)cmpxchg.org>
---
tools/testing/selftests/cgroup/test_memcontrol.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/cgroup/test_memcontrol.c b/tools/testing/selftests/cgroup/test_memcontrol.c
index c19a97dd02d4..60bfe53c0289 100644
--- a/tools/testing/selftests/cgroup/test_memcontrol.c
+++ b/tools/testing/selftests/cgroup/test_memcontrol.c
@@ -1002,7 +1002,8 @@ static int test_memcg_sock(const char *root)
/*
* This test disables swapping and tries to allocate anonymous memory
* up to OOM with memory.group.oom set. Then it checks that all
- * processes in the leaf (but not the parent) were killed.
+ * processes in the leaf were killed. It also checks that oom_events
+ * were propagated to the parent level.
*/
static int test_memcg_oom_group_leaf_events(const char *root)
{
@@ -1045,7 +1046,7 @@ static int test_memcg_oom_group_leaf_events(const char *root)
if (cg_read_key_long(child, "memory.events", "oom_kill ") <= 0)
goto cleanup;
- if (cg_read_key_long(parent, "memory.events", "oom_kill ") != 0)
+ if (cg_read_key_long(parent, "memory.events", "oom_kill ") <= 0)
goto cleanup;
ret = KSFT_PASS;
--
2.17.1
Hello and good day,
My name is Charles, and I am contacting you to know first and
foremost, if your private or company bank account in Japan or
elsewhere can receive a total of 15 Million US Dollars
(Approximately 1.6 Billion JP¥) at the shortest notice? If yes,
then kindly reply for further and complete details.
Expecting your reply as soon as possible because we have limited
time to have this fund in your nominated bank account.
Doumo Arigatou.
Charles Renfroe.
Toronto-Canada
Hi Linus,
Please pull this second Kselftest fixes update for Linux 5.5-rc1.
This second Kselftest fixes update for Linux 5.5-rc1 consists of
an urgent revert to fix regression in CI coverage.
diff is attached.
thanks,
-- Shuah
----------------------------------------------------------------
The following changes since commit ed2d8fa734e7759ac3788a19f308d3243d0eb164:
selftests: sync: Fix cast warnings on arm (2019-11-07 14:54:37 -0700)
are available in the Git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
tags/linux-kselftest-5.5-rc1-fixes2
for you to fetch changes up to f60b85e83659b5fbd3eb2c8f68d33ef4e35ebb2c:
Revert "selftests: Fix O= and KBUILD_OUTPUT handling for relative
paths" (2019-11-28 16:27:44 -0700)
----------------------------------------------------------------
linux-kselftest-5.5-rc1-fixes2
This second Kselftest fixes update for Linux 5.5-rc1 consists of
an urgent revert to fix regression in CI coverage.
----------------------------------------------------------------
Shuah Khan (1):
Revert "selftests: Fix O= and KBUILD_OUTPUT handling for relative
paths"
tools/testing/selftests/Makefile | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)
----------------------------------------------------------------
Support for frequency limits in dev_pm_qos was removed when cpufreq was
switched to freq_qos, this series attempts to restore it by
reimplementing on top of freq_qos.
Discussion about removal is here:
https://lore.kernel.org/linux-pm/VI1PR04MB7023DF47D046AEADB4E051EBEE680@VI1…
The cpufreq core switched away because it needs contraints at the level
of a "cpufreq_policy" which cover multiple cpus so dev_pm_qos coupling
to struct device was not useful. Cpufreq could only use dev_pm_qos by
implementing an additional layer of aggregation anyway.
However in the devfreq subsystem scaling is always performed on a per-device
basis so dev_pm_qos is a very good match. Support for dev_pm_qos in devfreq
core is here (latest version, no dependencies outside this series):
https://patchwork.kernel.org/cover/11252409/
That series is RFC mostly because it needs these PM core patches.
Earlier versions got entangled in some locking cleanups but those are
not strictly necessary to get dev_pm_qos functionality.
In theory if freq_qos is extended to handle conflicting min/max values then
this sharing would be valuable. Right now freq_qos just ties two unrelated
pm_qos aggregations for min and max freq.
---
This is implemented by embeding a freq_qos_request inside dev_pm_qos_request:
the data field was already an union in order to deal with flag requests.
The internal freq_qos_apply is exported so that it can be called from
dev_pm_qos apply_constraints.
The dev_pm_qos_constraints_destroy function has no obvious equivalent in
freq_qos and the whole approach of "removing requests" is somewhat dubios:
request objects should be owned by consumers and the list of qos requests
will most likely be empty when the target device is deleted. Series follows
current pattern for dev_pm_qos.
First two patches can be applied separately.
Changes since v3:
* Fix s/QOS/QoS in patch 2 title
* Improves comments in kunit test
* Fix assertions after freq_qos_remove_request
* Remove (c) from NXP copyright header
* Wrap long lines in qos.c to be under 80 chars. This fixes checkpatch but the
rule is already broken by code in the files.
* Collect reviews
Link to v3: https://patchwork.kernel.org/cover/11260627/
Changes since v2:
* #define PM_QOS_MAX_FREQUENCY_DEFAULT_VALUE FREQ_QOS_MAX_DEFAULT_VALUE
* #define FREQ_QOS_MAX_DEFAULT_VALUE S32_MAX (in new patch)
* Add initial kunit test for freq_qos, validating the MAX_DEFAULT_VALUE found
by Matthias and another recent fix. Testing this should be easier!
Link to v2: https://patchwork.kernel.org/cover/11250413/
Changes since v1:
* Don't rename or EXPORT_SYMBOL_GPL the freq_qos_apply function; just
drop the static marker.
Link to v1: https://patchwork.kernel.org/cover/11212887/
Leonard Crestez (4):
PM / QoS: Initial kunit test
PM / QoS: Redefine FREQ_QOS_MAX_DEFAULT_VALUE to S32_MAX
PM / QoS: Reorder pm_qos/freq_qos/dev_pm_qos structs
PM / QoS: Restore DEV_PM_QOS_MIN/MAX_FREQUENCY
drivers/base/Kconfig | 4 ++
drivers/base/power/Makefile | 1 +
drivers/base/power/qos-test.c | 117 ++++++++++++++++++++++++++++++++++
drivers/base/power/qos.c | 73 +++++++++++++++++++--
include/linux/pm_qos.h | 86 ++++++++++++++-----------
kernel/power/qos.c | 4 +-
6 files changed, 242 insertions(+), 43 deletions(-)
create mode 100644 drivers/base/power/qos-test.c
--
2.17.1
Fixes the issue caused by the fact that in C in the expression
of the form -1234L only 1234L is the actual literal, the unary
minus is an operation applied to the literal. Which means that
to express the lower bound for the type one has to negate the
upper bound and subtract 1.
Original error:
Expected test_data[i].expected.tv_sec == timestamp.tv_sec, but
test_data[i].expected.tv_sec == -2147483648
timestamp.tv_sec == 2147483648
1901-12-13 Lower bound of 32bit < 0 timestamp, no extra bits: msb:1
lower_bound:1 extra_bits: 0
Expected test_data[i].expected.tv_sec == timestamp.tv_sec, but
test_data[i].expected.tv_sec == 2147483648
timestamp.tv_sec == 6442450944
2038-01-19 Lower bound of 32bit <0 timestamp, lo extra sec bit on:
msb:1 lower_bound:1 extra_bits: 1
Expected test_data[i].expected.tv_sec == timestamp.tv_sec, but
test_data[i].expected.tv_sec == 6442450944
timestamp.tv_sec == 10737418240
2174-02-25 Lower bound of 32bit <0 timestamp, hi extra sec bit on:
msb:1 lower_bound:1 extra_bits: 2
not ok 1 - inode_test_xtimestamp_decoding
not ok 1 - ext4_inode_test
Reported-by: Geert Uytterhoeven <geert(a)linux-m68k.org>
Signed-off-by: Iurii Zaikin <yzaikin(a)google.com>
Tested-by: Geert Uytterhoeven <geert(a)linux-m68k.org>
---
fs/ext4/inode-test.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/ext4/inode-test.c b/fs/ext4/inode-test.c
index 92a9da1774aa..bbce1c328d85 100644
--- a/fs/ext4/inode-test.c
+++ b/fs/ext4/inode-test.c
@@ -25,7 +25,7 @@
* For constructing the negative timestamp lower bound value.
* binary: 10000000 00000000 00000000 00000000
*/
-#define LOWER_MSB_1 (-0x80000000L)
+#define LOWER_MSB_1 (-(UPPER_MSB_0) - 1L) /* avoid overflow */
/*
* For constructing the negative timestamp upper bound value.
* binary: 11111111 11111111 11111111 11111111
--
2.24.0.432.g9d3f5f5b63-goog
Hi Linus,
Please pull these seccomp updates for v5.5-rc1. Mostly this is
implementing the new flag SECCOMP_USER_NOTIF_FLAG_CONTINUE, but there
are cleanups as well. Most notably, the secure_computing() prototype
has changed (to remove an unused argument), but this has happened at the
same time as riscv adding seccomp support, so the cleanest merge order
would be to merge riscv first, then seccomp with the following patch for
riscv to handle the change from "seccomp: simplify secure_computing()":
diff --git a/arch/riscv/kernel/ptrace.c b/arch/riscv/kernel/ptrace.c
index 0f84628b9385..407464201b91 100644
--- a/arch/riscv/kernel/ptrace.c
+++ b/arch/riscv/kernel/ptrace.c
@@ -159,7 +159,7 @@ __visible void do_syscall_trace_enter(struct pt_regs *regs)
* If this fails we might have return value in a0 from seccomp
* (via SECCOMP_RET_ERRNO/TRACE).
*/
- if (secure_computing(NULL) == -1) {
+ if (secure_computing() == -1) {
syscall_set_nr(current, regs, -1);
return;
}
Thanks!
-Kees
The following changes since commit da0c9ea146cbe92b832f1b0f694840ea8eb33cce:
Linux 5.4-rc2 (2019-10-06 14:27:30 -0700)
are available in the Git repository at:
https://git.kernel.org/pub/scm/linux/kernel/git/kees/linux.git tags/seccomp-v5.5-rc1
for you to fetch changes up to 23b2c96fad21886c53f5e1a4ffedd45ddd2e85ba:
seccomp: rework define for SECCOMP_USER_NOTIF_FLAG_CONTINUE (2019-10-28 12:29:46 -0700)
----------------------------------------------------------------
seccomp updates for v5.5
- implement SECCOMP_USER_NOTIF_FLAG_CONTINUE (Christian Brauner)
- fixes to selftests (Christian Brauner)
- remove secure_computing() argument (Christian Brauner)
----------------------------------------------------------------
Christian Brauner (6):
seccomp: avoid overflow in implicit constant conversion
seccomp: add SECCOMP_USER_NOTIF_FLAG_CONTINUE
seccomp: test SECCOMP_USER_NOTIF_FLAG_CONTINUE
seccomp: simplify secure_computing()
seccomp: fix SECCOMP_USER_NOTIF_FLAG_CONTINUE test
seccomp: rework define for SECCOMP_USER_NOTIF_FLAG_CONTINUE
arch/arm/kernel/ptrace.c | 2 +-
arch/arm64/kernel/ptrace.c | 2 +-
arch/parisc/kernel/ptrace.c | 2 +-
arch/s390/kernel/ptrace.c | 2 +-
arch/um/kernel/skas/syscall.c | 2 +-
arch/x86/entry/vsyscall/vsyscall_64.c | 2 +-
include/linux/seccomp.h | 6 +-
include/uapi/linux/seccomp.h | 29 +++++++
kernel/seccomp.c | 28 +++++--
tools/testing/selftests/seccomp/seccomp_bpf.c | 110 +++++++++++++++++++++++++-
10 files changed, 169 insertions(+), 16 deletions(-)
--
Kees Cook
Hi,
OK, here is v7, maybe this is the last one. The corresponding git repo
and branch is:
git@github.com:johnhubbard/linux.git pin_user_pages_tracking_v7
Ira, you reviewed the gup_benchmark patches a bit earlier, but I
removed one or two of those review-by tags, due to invasive changes
I made after your review (in response to further reviews).
So could you please reply to any patches you'd like to have
reviewed-by's restoredto, if any? Mainly I'm thinking of
"mm/gup_benchmark: support pin_user_pages() and related calls". Also
various FOLL_LONGTERM vs pin_longterm*() patches.
The following blurb from the v6 cover letter is still applicable, and
I'll repeat it here so it doesn't get lost in the patch blizzard:
Christoph Hellwig has a preference to do things a little differently,
for the devmap cleanup in patch 5 ("mm: devmap: refactor 1-based
refcounting for ZONE_DEVICE pages"). That came up in a different
review thread, because the patch is out for review in two locations.
Here's that review thread:
https://lore.kernel.org/r/20191118070826.GB3099@infradead.org
...and I'm hoping that we can defer that request, because otherwise
it derails this series, which is starting to otherwise look like
it could be ready for 5.5.
Changes since v6:
* Renamed a couple of routines, to get rid of unnecessary leading
underscores:
__pin_compound_head() --> grab_compound_head()
__record_subpages() --> record_subpages()
* Fixed the error fallback (put_compound_head()) so as to match the fix
in the previous version: need to put back N * GUP_PIN_COUNTING_BIAS
pages, for FOLL_PIN cases.
* Factored out yet another common chunk of code, into a new grab_page()
routine.
* Added a missing compound_head() call to put_compound_head().
* [Re-]added Jens Axboe's reviewed-by tag to the fs/io_uring patch.
* Added more reviewed-by's from Jan Kara.
Changes since v5:
* Fixed the refcounting for huge pages: in most cases, it was
only taking one GUP_PIN_COUNTING_BIAS's worth of refs, when it
should have been taking one GUP_PIN_COUNTING_BIAS for each subpage.
(Much thanks to Jan Kara for spotting that one!)
* Renamed user_page_ref_inc() to try_pin_page(), and added a new
try_pin_compound_head(). This definitely improves readability.
* Factored out some more duplication in the FOLL_PIN and FOLL_GET
cases, in gup.c.
* Fixed up some straggling "get_" --> "pin_" references in the comments.
* Added reviewed-by tags.
Changes since v4:
* Renamed put_user_page*() --> unpin_user_page().
* Removed all pin_longterm_pages*() calls. We will use FOLL_LONGTERM
at the call sites. (FOLL_PIN, however, remains an internal gup flag).
This is very nice: many patches just change three characters now:
get_user_pages --> pin_user_pages. I think we've found the right
balance of wrapper calls and gup flags, for the call sites.
* Updated a lot of documentation and commit logs to match the above
two large changes.
* Changed gup_benchmark tests and run_vmtests, to adapt to one less
use case: there is no pin_longterm_pages() call anymore.
* This includes a new devmap cleanup patch from Dan Williams, along
with a rebased follow-up: patches 4 and 5, already mentioned above.
* Fixed patch 10 ("mm/gup: introduce pin_user_pages*() and FOLL_PIN"),
so as to make pin_user_pages*() calls act as placeholders for the
corresponding get_user_pages*() calls, until a later patch fully
implements the DMA-pinning functionality.
Thanks to Jan Kara for noticing that.
* Fixed the implementation of pin_user_pages_remote().
* Further tweaked patch 2 ("mm/gup: factor out duplicate code from four
routines"), in response to Jan Kara's feedback.
* Dropped a few reviewed-by tags due to changes that invalidated
them.
Changes since v3:
* VFIO fix (patch 8): applied further cleanup: removed a pre-existing,
unnecessary release and reacquire of mmap_sem. Moved the DAX vma
checks from the vfio call site, to gup internals, and added comments
(and commit log) to clarify.
* Due to the above, made a corresponding fix to the
pin_longterm_pages_remote(), which was actually calling the wrong
gup internal function.
* Changed put_user_page() comments, to refer to pin*() APIs, rather than
get_user_pages*() APIs.
* Reverted an accidental whitespace-only change in the IB ODP code.
* Added a few more reviewed-by tags.
Changes since v2:
* Added a patch to convert IB/umem from normal gup, to gup_fast(). This
is also posted separately, in order to hopefully get some runtime
testing.
* Changed the page devmap code to be a little clearer,
thanks to Jerome for that.
* Split out the page devmap changes into a separate patch (and moved
Ira's Signed-off-by to that patch).
* Fixed my bug in IB: ODP code does not require pin_user_pages()
semantics. Therefore, revert the put_user_page() calls to put_page(),
and leave the get_user_pages() call as-is.
* As part of the revert, I am proposing here a change directly
from put_user_pages(), to release_pages(). I'd feel better if
someone agrees that this is the best way. It uses the more
efficient release_pages(), instead of put_page() in a loop,
and keep the change to just a few character on one line,
but OTOH it is not a pure revert.
* Loosened the FOLL_LONGTERM restrictions in the
__get_user_pages_locked() implementation, and used that in order
to fix up a VFIO bug. Thanks to Jason for that idea.
* Note the use of release_pages() in IB: is that OK?
* Added a few more WARN's and clarifying comments nearby.
* Many documentation improvements in various comments.
* Moved the new pin_user_pages.rst from Documentation/vm/ to
Documentation/core-api/ .
* Commit descriptions: added clarifying notes to the three patches
(drm/via, fs/io_uring, net/xdp) that already had put_user_page()
calls in place.
* Collected all pending Reviewed-by and Acked-by tags, from v1 and v2
email threads.
* Lot of churn from v2 --> v3, so it's possible that new bugs
sneaked in.
NOT DONE: separate patchset is required:
* __get_user_pages_locked(): stop compensating for
buggy callers who failed to set FOLL_GET. Instead, assert
that FOLL_GET is set (and fail if it's not).
======================================================================
Original cover letter (edited to fix up the patch description numbers)
This applies cleanly to linux-next and mmotm, and also to linux.git if
linux-next's commit 20cac10710c9 ("mm/gup_benchmark: fix MAP_HUGETLB
case") is first applied there.
This provides tracking of dma-pinned pages. This is a prerequisite to
solving the larger problem of proper interactions between file-backed
pages, and [R]DMA activities, as discussed in [1], [2], [3], and in
a remarkable number of email threads since about 2017. :)
A new internal gup flag, FOLL_PIN is introduced, and thoroughly
documented in the last patch's Documentation/vm/pin_user_pages.rst.
I believe that this will provide a good starting point for doing the
layout lease work that Ira Weiny has been working on. That's because
these new wrapper functions provide a clean, constrained, systematically
named set of functionality that, again, is required in order to even
know if a page is "dma-pinned".
In contrast to earlier approaches, the page tracking can be
incrementally applied to the kernel call sites that, until now, have
been simply calling get_user_pages() ("gup"). In other words, opt-in by
changing from this:
get_user_pages() (sets FOLL_GET)
put_page()
to this:
pin_user_pages() (sets FOLL_PIN)
put_user_page()
Because there are interdependencies with FOLL_LONGTERM, a similar
conversion as for FOLL_PIN, was applied. The change was from this:
get_user_pages(FOLL_LONGTERM) (also sets FOLL_GET)
put_page()
to this:
pin_longterm_pages() (sets FOLL_PIN | FOLL_LONGTERM)
put_user_page()
============================================================
Patch summary:
* Patches 1-9: refactoring and preparatory cleanup, independent fixes
* Patch 10: introduce pin_user_pages(), FOLL_PIN, but no functional
changes yet
* Patches 11-16: Convert existing put_user_page() callers, to use the
new pin*()
* Patch 17: Activate tracking of FOLL_PIN pages.
* Patches 18-20: convert various callers
* Patches: 21-23: gup_benchmark and run_vmtests support
* Patch 24: rename put_user_page*() --> unpin_user_page*()
============================================================
Testing:
* I've done some overall kernel testing (LTP, and a few other goodies),
and some directed testing to exercise some of the changes. And as you
can see, gup_benchmark is enhanced to exercise this. Basically, I've been
able to runtime test the core get_user_pages() and pin_user_pages() and
related routines, but not so much on several of the call sites--but those
are generally just a couple of lines changed, each.
Not much of the kernel is actually using this, which on one hand
reduces risk quite a lot. But on the other hand, testing coverage
is low. So I'd love it if, in particular, the Infiniband and PowerPC
folks could do a smoke test of this series for me.
Also, my runtime testing for the call sites so far is very weak:
* io_uring: Some directed tests from liburing exercise this, and they pass.
* process_vm_access.c: A small directed test passes.
* gup_benchmark: the enhanced version hits the new gup.c code, and passes.
* infiniband (still only have crude "IB pingpong" working, on a
good day: it's not exercising my conversions at runtime...)
* VFIO: compiles (I'm vowing to set up a run time test soon, but it's
not ready just yet)
* powerpc: it compiles...
* drm/via: compiles...
* goldfish: compiles...
* net/xdp: compiles...
* media/v4l2: compiles...
============================================================
Next:
* Get the block/bio_vec sites converted to use pin_user_pages().
* Work with Ira and Dave Chinner to weave this together with the
layout lease stuff.
============================================================
[1] Some slow progress on get_user_pages() (Apr 2, 2019): https://lwn.net/Articles/784574/
[2] DMA and get_user_pages() (LPC: Dec 12, 2018): https://lwn.net/Articles/774411/
[3] The trouble with get_user_pages() (Apr 30, 2018): https://lwn.net/Articles/753027/
Dan Williams (1):
mm: Cleanup __put_devmap_managed_page() vs ->page_free()
John Hubbard (23):
mm/gup: pass flags arg to __gup_device_* functions
mm/gup: factor out duplicate code from four routines
mm/gup: move try_get_compound_head() to top, fix minor issues
mm: devmap: refactor 1-based refcounting for ZONE_DEVICE pages
goldish_pipe: rename local pin_user_pages() routine
IB/umem: use get_user_pages_fast() to pin DMA pages
media/v4l2-core: set pages dirty upon releasing DMA buffers
vfio, mm: fix get_user_pages_remote() and FOLL_LONGTERM
mm/gup: introduce pin_user_pages*() and FOLL_PIN
goldish_pipe: convert to pin_user_pages() and put_user_page()
IB/{core,hw,umem}: set FOLL_PIN via pin_user_pages*(), fix up ODP
mm/process_vm_access: set FOLL_PIN via pin_user_pages_remote()
drm/via: set FOLL_PIN via pin_user_pages_fast()
fs/io_uring: set FOLL_PIN via pin_user_pages()
net/xdp: set FOLL_PIN via pin_user_pages()
mm/gup: track FOLL_PIN pages
media/v4l2-core: pin_user_pages (FOLL_PIN) and put_user_page()
conversion
vfio, mm: pin_user_pages (FOLL_PIN) and put_user_page() conversion
powerpc: book3s64: convert to pin_user_pages() and put_user_page()
mm/gup_benchmark: use proper FOLL_WRITE flags instead of hard-coding
"1"
mm/gup_benchmark: support pin_user_pages() and related calls
selftests/vm: run_vmtests: invoke gup_benchmark with basic FOLL_PIN
coverage
mm, tree-wide: rename put_user_page*() to unpin_user_page*()
Documentation/core-api/index.rst | 1 +
Documentation/core-api/pin_user_pages.rst | 233 +++++++++
arch/powerpc/mm/book3s64/iommu_api.c | 12 +-
drivers/gpu/drm/via/via_dmablit.c | 6 +-
drivers/infiniband/core/umem.c | 19 +-
drivers/infiniband/core/umem_odp.c | 13 +-
drivers/infiniband/hw/hfi1/user_pages.c | 4 +-
drivers/infiniband/hw/mthca/mthca_memfree.c | 8 +-
drivers/infiniband/hw/qib/qib_user_pages.c | 4 +-
drivers/infiniband/hw/qib/qib_user_sdma.c | 8 +-
drivers/infiniband/hw/usnic/usnic_uiom.c | 4 +-
drivers/infiniband/sw/siw/siw_mem.c | 4 +-
drivers/media/v4l2-core/videobuf-dma-sg.c | 8 +-
drivers/nvdimm/pmem.c | 6 -
drivers/platform/goldfish/goldfish_pipe.c | 35 +-
drivers/vfio/vfio_iommu_type1.c | 35 +-
fs/io_uring.c | 6 +-
include/linux/mm.h | 195 ++++++-
include/linux/mmzone.h | 2 +
include/linux/page_ref.h | 10 +
mm/gup.c | 553 +++++++++++++++-----
mm/gup_benchmark.c | 74 ++-
mm/huge_memory.c | 44 +-
mm/hugetlb.c | 36 +-
mm/memremap.c | 76 ++-
mm/process_vm_access.c | 28 +-
mm/vmstat.c | 2 +
net/xdp/xdp_umem.c | 4 +-
tools/testing/selftests/vm/gup_benchmark.c | 21 +-
tools/testing/selftests/vm/run_vmtests | 22 +
30 files changed, 1121 insertions(+), 352 deletions(-)
create mode 100644 Documentation/core-api/pin_user_pages.rst
--
2.24.0
USER_NOTIF_MAGIC is used to both initialize seccomp_notif_resp::val and
verify syscall resturn value. On 32-bit architectures syscall return
value has type long, but the value of USER_NOTIF_MAGIC has type long
long because it doesn't fit into long. As a result all syscall return
value comparisons with USER_NOTIF_MAGIC are false. This is also reported
by the compiler when '-W' is added to CFLAGS.
Add explicit type cast to USER_NOTIF_MAGIC definition.
This fixes the following seccomp_bpf tests on 32-bit architectures:
global.user_notification_basic
global.user_notification_child_pid_ns
global.user_notification_sibling_pid_ns
global.user_notification_fault_recv
Signed-off-by: Max Filippov <jcmvbkbc(a)gmail.com>
---
tools/testing/selftests/seccomp/seccomp_bpf.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c b/tools/testing/selftests/seccomp/seccomp_bpf.c
index 7f8b5c8982e3..16cc30e2ade4 100644
--- a/tools/testing/selftests/seccomp/seccomp_bpf.c
+++ b/tools/testing/selftests/seccomp/seccomp_bpf.c
@@ -3077,7 +3077,7 @@ static int user_trap_syscall(int nr, unsigned int flags)
return seccomp(SECCOMP_SET_MODE_FILTER, flags, &prog);
}
-#define USER_NOTIF_MAGIC 116983961184613L
+#define USER_NOTIF_MAGIC ((unsigned long)116983961184613L)
TEST(user_notification_basic)
{
pid_t pid;
--
2.20.1
Hi,
Changes since v1:
* Fixed up ppc in response to Jan Kara's review comments (thanks for
those!).
* Fixed a kbuilt robot-detected build failure: added a stub function for
the !CONFIG_MMU case.
* Cover letter: now refers to "unpin_user_page()", reflecting the name
change in the last patch (instead of put_user_page() ).
* Rebased onto today's linux-next: c165016bac27 ("Add linux-next
specific files for 20191125")
========================================================================
Here is a set of well-reviewed (expect for one patch), lower-risk items
that can go into Linux 5.5. (Update: the powerpc conversion patch has
had some initial review now, since v1 was posted.)
This is essentially a cut-down v8 of "mm/gup: track dma-pinned pages:
FOLL_PIN" [1], and with one of the VFIO patches split into two patches.
The idea here is to get this long list of "noise" checked into 5.5, so
that the actual, higher-risk "track FOLL_PIN pages" (which is deferred:
not part of this series) will be a much shorter patchset to review.
For the v4l2-core changes, I've left those here (instead of sending
them separately to the -media tree), in order to get the name change
done now (put_user_page --> unpin_user_page). However, I've added a Cc
stable, as recommended during the last round of reviews.
Here are the relevant notes from the original cover letter, edited to
match the current situation:
This is a prerequisite to tracking dma-pinned pages. That in turn is a
prerequisite to solving the larger problem of proper interactions
between file-backed pages, and [R]DMA activities, as discussed in [1],
[2], [3], and in a remarkable number of email threads since about
2017. :)
A new internal gup flag, FOLL_PIN is introduced, and thoroughly
documented in the last patch's Documentation/vm/pin_user_pages.rst.
I believe that this will provide a good starting point for doing the
layout lease work that Ira Weiny has been working on. That's because
these new wrapper functions provide a clean, constrained, systematically
named set of functionality that, again, is required in order to even
know if a page is "dma-pinned".
In contrast to earlier approaches, the page tracking can be
incrementally applied to the kernel call sites that, until now, have
been simply calling get_user_pages() ("gup"). In other words, opt-in by
changing from this:
get_user_pages() (sets FOLL_GET)
put_page()
to this:
pin_user_pages() (sets FOLL_PIN)
unpin_user_page()
Because there are interdependencies with FOLL_LONGTERM, a similar
conversion as for FOLL_PIN, was applied. The change was from this:
get_user_pages(FOLL_LONGTERM) (also sets FOLL_GET)
put_page()
to this:
pin_longterm_pages() (sets FOLL_PIN | FOLL_LONGTERM)
unpin_user_page()
[1] https://lore.kernel.org/r/20191121071354.456618-1-jhubbard@nvidia.com
thanks,
John Hubbard
NVIDIA
Dan Williams (1):
mm: Cleanup __put_devmap_managed_page() vs ->page_free()
John Hubbard (18):
mm/gup: factor out duplicate code from four routines
mm/gup: move try_get_compound_head() to top, fix minor issues
goldish_pipe: rename local pin_user_pages() routine
mm: fix get_user_pages_remote()'s handling of FOLL_LONGTERM
vfio: fix FOLL_LONGTERM use, simplify get_user_pages_remote() call
mm/gup: introduce pin_user_pages*() and FOLL_PIN
goldish_pipe: convert to pin_user_pages() and put_user_page()
IB/{core,hw,umem}: set FOLL_PIN via pin_user_pages*(), fix up ODP
mm/process_vm_access: set FOLL_PIN via pin_user_pages_remote()
drm/via: set FOLL_PIN via pin_user_pages_fast()
fs/io_uring: set FOLL_PIN via pin_user_pages()
net/xdp: set FOLL_PIN via pin_user_pages()
media/v4l2-core: set pages dirty upon releasing DMA buffers
media/v4l2-core: pin_user_pages (FOLL_PIN) and put_user_page()
conversion
vfio, mm: pin_user_pages (FOLL_PIN) and put_user_page() conversion
powerpc: book3s64: convert to pin_user_pages() and put_user_page()
mm/gup_benchmark: use proper FOLL_WRITE flags instead of hard-coding
"1"
mm, tree-wide: rename put_user_page*() to unpin_user_page*()
Documentation/core-api/index.rst | 1 +
Documentation/core-api/pin_user_pages.rst | 233 ++++++++++++++
arch/powerpc/mm/book3s64/iommu_api.c | 12 +-
drivers/gpu/drm/via/via_dmablit.c | 6 +-
drivers/infiniband/core/umem.c | 4 +-
drivers/infiniband/core/umem_odp.c | 13 +-
drivers/infiniband/hw/hfi1/user_pages.c | 4 +-
drivers/infiniband/hw/mthca/mthca_memfree.c | 8 +-
drivers/infiniband/hw/qib/qib_user_pages.c | 4 +-
drivers/infiniband/hw/qib/qib_user_sdma.c | 8 +-
drivers/infiniband/hw/usnic/usnic_uiom.c | 4 +-
drivers/infiniband/sw/siw/siw_mem.c | 4 +-
drivers/media/v4l2-core/videobuf-dma-sg.c | 8 +-
drivers/nvdimm/pmem.c | 6 -
drivers/platform/goldfish/goldfish_pipe.c | 35 +-
drivers/vfio/vfio_iommu_type1.c | 35 +-
fs/io_uring.c | 6 +-
include/linux/mm.h | 77 +++--
mm/gup.c | 340 +++++++++++++-------
mm/gup_benchmark.c | 9 +-
mm/memremap.c | 80 ++---
mm/process_vm_access.c | 28 +-
net/xdp/xdp_umem.c | 4 +-
tools/testing/selftests/vm/gup_benchmark.c | 6 +-
24 files changed, 650 insertions(+), 285 deletions(-)
create mode 100644 Documentation/core-api/pin_user_pages.rst
--
2.24.0
Clean up a handful of interrelated warts in the kernel's handling of VMX:
- Enable VMX in IA32_FEATURE_CONTROL during boot instead of on-demand
during KVM load to avoid future contention over IA32_FEATURE_CONTROL.
- Rework VMX feature reporting so that it is accurate and up-to-date,
now and in the future.
- Consolidate code across CPUs that support VMX.
This series stems from two separate but related issues. The first issue,
pointed out by Boris in the SGX enabling series[1], is that the kernel
currently doesn't ensure the IA32_FEATURE_CONTROL MSR is configured during
boot. The second issue is that the kernel's reporting of VMX features is
stale, potentially inaccurate, and difficult to maintain.
Please holler if you don't want to be cc'd on future versions of this
series, or only want to be cc'd on select patches.
v3:
- Rebase to tip/master, ceceaf1f12ba ("Merge branch 'WIP.x86/cleanups'").
- Rename the feature control MSR bit defines [Boris].
- Rewrite the error message displayed when reading feature control MSR
faults on a VMX capable CPU to explicitly state that it's likely a
hardware or hypervisor issue [Boris].
- Collect a Reviewed-by for the LMCE change [Boris].
- Enable VMX in feature control (if it's unlocked) if and only if
KVM is enabled [Paolo].
- Remove a big pile of redudant MSR defines from the KVM selftests that
was discovered when renaming the feature control defines.
- Fix a changelog typoe [Boris].
v2:
- Rebase to latest tip/x86/cpu (1edae1ae6258, "x86/Kconfig: Enforce...)
- Collect Jim's reviews.
- Fix a typo in setting of EPT capabilities [TonyWWang-oc].
- Remove defines for reserved VMX feature flags [Paolo].
- Print the VMX features under "flags" and maintain all existing names
to be backward compatible with the ABI [Paolo].
- Create aggregate APIC features to report FLEXPRIORITY and APICV, so
that the full feature *and* their associated individual features are
printed, e.g. to aid in recognizing why an APIC feature isn't being
used.
- Fix a few copy paste errors in changelogs.
v1 cover letter:
== IA32_FEATURE_CONTROL ==
Lack of IA32_FEATURE_CONTROL configuration during boot isn't a functional
issue in the current kernel as the majority of platforms set and lock
IA32_FEATURE_CONTROL in firmware. And when the MSR is left unlocked, KVM
is the only subsystem that writes IA32_FEATURE_CONTROL. That will change
if/when SGX support is enabled, as SGX will also want to fully enable
itself when IA32_FEATURE_CONTROL is unlocked.
== VMX Feature Reporting ==
VMX features are not enumerated via CPUID, but instead are enumerated
through VMX MSRs. As a result, new VMX features are not automatically
reported via /proc/cpuinfo.
An attempt was made long ago to report interesting and/or meaningful VMX
features by synthesizing select features into a Linux-defined cpufeatures
word. Synthetic feature flags worked for the initial purpose, but the
existence of the synthetic flags was forgotten almost immediately, e.g.
only one new flag (EPT A/D) has been added in the the decade since the
synthetic VMX features were introduced, while VMX and KVM have gained
support for many new features.
Placing the synthetic flags in x86_capability also allows them to be
queried via cpu_has() and company, which is misleading as the flags exist
purely for reporting via /proc/cpuinfo. KVM, the only in-kernel user of
VMX, ignores the flags.
Last but not least, VMX features are reported in /proc/cpuinfo even
when VMX is unusable due to lack of enabling in IA32_FEATURE_CONTROL.
== Caveats ==
All of the testing of non-standard flows was done in a VM, as I don't
have a system that leaves IA32_FEATURE_CONTROL unlocked, or locks it with
VMX disabled.
The Centaur and Zhaoxin changes are somewhat speculative, as I haven't
confirmed they actually support IA32_FEATURE_CONTROL, or that they want to
gain "official" KVM support. I assume they unofficially support KVM given
that both CPUs went through the effort of enumerating VMX features. That
in turn would require them to support IA32_FEATURE_CONTROL since KVM will
fault and refuse to load if the MSR doesn't exist.
[1] https://lkml.kernel.org/r/20190925085156.GA3891@zn.tnic
Sean Christopherson (19):
x86/msr-index: Clean up bit defines for IA32_FEATURE_CONTROL MSR
selftests: kvm: Replace manual MSR defs with common msr-index.h
tools arch x86: Sync msr-index.h from kernel sources
x86/intel: Initialize IA32_FEATURE_CONTROL MSR at boot
x86/mce: WARN once if IA32_FEATURE_CONTROL MSR is left unlocked
x86/centaur: Use common IA32_FEATURE_CONTROL MSR initialization
x86/zhaoxin: Use common IA32_FEATURE_CONTROL MSR initialization
KVM: VMX: Drop initialization of IA32_FEATURE_CONTROL MSR
x86/cpu: Clear VMX feature flag if VMX is not fully enabled
KVM: VMX: Use VMX feature flag to query BIOS enabling
KVM: VMX: Check for full VMX support when verifying CPU compatibility
x86/vmx: Introduce VMX_FEATURES_*
x86/cpu: Detect VMX features on Intel, Centaur and Zhaoxin CPUs
x86/cpu: Print VMX flags in /proc/cpuinfo using VMX_FEATURES_*
x86/cpufeatures: Drop synthetic VMX feature flags
KVM: VMX: Use VMX_FEATURE_* flags to define VMCS control bits
x86/cpufeatures: Clean up synthetic virtualization flags
perf/x86: Provide stubs of KVM helpers for non-Intel CPUs
KVM: VMX: Allow KVM_INTEL when building for Centaur and/or Zhaoxin
CPUs
MAINTAINERS | 2 +-
arch/x86/Kconfig.cpu | 8 +
arch/x86/boot/mkcpustr.c | 1 +
arch/x86/include/asm/cpufeatures.h | 15 +-
arch/x86/include/asm/msr-index.h | 11 +-
arch/x86/include/asm/perf_event.h | 22 +-
arch/x86/include/asm/processor.h | 4 +
arch/x86/include/asm/vmx.h | 105 +--
arch/x86/include/asm/vmxfeatures.h | 86 +++
arch/x86/kernel/cpu/Makefile | 6 +-
arch/x86/kernel/cpu/centaur.c | 35 +-
arch/x86/kernel/cpu/common.c | 3 +
arch/x86/kernel/cpu/cpu.h | 4 +
arch/x86/kernel/cpu/feature_control.c | 127 +++
arch/x86/kernel/cpu/intel.c | 49 +-
arch/x86/kernel/cpu/mce/intel.c | 7 +-
arch/x86/kernel/cpu/mkcapflags.sh | 15 +-
arch/x86/kernel/cpu/proc.c | 14 +
arch/x86/kernel/cpu/zhaoxin.c | 35 +-
arch/x86/kvm/Kconfig | 10 +-
arch/x86/kvm/vmx/nested.c | 4 +-
arch/x86/kvm/vmx/vmx.c | 57 +-
arch/x86/kvm/vmx/vmx.h | 2 +-
tools/arch/x86/include/asm/msr-index.h | 27 +-
tools/testing/selftests/kvm/Makefile | 4 +-
.../selftests/kvm/include/x86_64/processor.h | 726 +-----------------
tools/testing/selftests/kvm/lib/x86_64/vmx.c | 4 +-
27 files changed, 400 insertions(+), 983 deletions(-)
create mode 100644 arch/x86/include/asm/vmxfeatures.h
create mode 100644 arch/x86/kernel/cpu/feature_control.c
--
2.24.0
Some versions of iproute2 will output more than one line per entry, which
will cause the test to fail, like:
TEST: ipv6: list and flush cached exceptions [FAIL]
can't list cached exceptions
That happens, for example, with iproute2 4.15.0. When using the -oneline
option, this will work just fine:
TEST: ipv6: list and flush cached exceptions [ OK ]
This also works just fine with a more recent version of iproute2, like
5.4.0.
For some reason, two lines are printed for the IPv4 test no matter what
version of iproute2 is used. Use the same -oneline parameter there instead
of counting the lines twice.
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo(a)canonical.com>
---
tools/testing/selftests/net/pmtu.sh | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)
diff --git a/tools/testing/selftests/net/pmtu.sh b/tools/testing/selftests/net/pmtu.sh
index ab367e75f095..d697815d2785 100755
--- a/tools/testing/selftests/net/pmtu.sh
+++ b/tools/testing/selftests/net/pmtu.sh
@@ -1249,8 +1249,7 @@ test_list_flush_ipv4_exception() {
done
run_cmd ${ns_a} ping -q -M want -i 0.1 -c 2 -s 1800 "${dst2}"
- # Each exception is printed as two lines
- if [ "$(${ns_a} ip route list cache | wc -l)" -ne 202 ]; then
+ if [ "$(${ns_a} ip -oneline route list cache | wc -l)" -ne 101 ]; then
err " can't list cached exceptions"
fail=1
fi
@@ -1300,7 +1299,7 @@ test_list_flush_ipv6_exception() {
run_cmd ${ns_a} ping -q -M want -i 0.1 -w 1 -s 1800 "${dst_prefix1}${i}"
done
run_cmd ${ns_a} ping -q -M want -i 0.1 -w 1 -s 1800 "${dst2}"
- if [ "$(${ns_a} ip -6 route list cache | wc -l)" -ne 101 ]; then
+ if [ "$(${ns_a} ip -oneline -6 route list cache | wc -l)" -ne 101 ]; then
err " can't list cached exceptions"
fail=1
fi
--
2.20.1
Hi
while testing on linux-next
I see that, when KBUILD_OUTPUT is set in the env, running something like (using TARGETS=exec as a random subsystem here...)
$ make TARGETS=exec INSTALL_PATH=/nfs/LTP-official-debian-aarch64-rootfs/opt/KSFT_next kselftest-install
works fine as usual, WHILE the alternative invocation (still documented in Documentation/dev-tools/kselftest.rst)
make -C tools/testing/selftests/ TARGETS=exec INSTALL_PATH=/nfs/LTP-official-debian-aarch64-rootfs/opt/KSFT_next install
fails miserably with:
...
...
REMOVE usr/include/rdma/cxgb3-abi.h usr/include/rdma/nes-abi.h
HDRINST usr/include/asm/kvm.h
INSTALL /kselftest/usr/include
mkdir: cannot create directory ‘/kselftest’: Permission denied
/home/crimar01/ARM/dev/src/pdsw/linux/Makefile:1187: recipe for target 'headers_install' failed
make[2]: *** [headers_install] Error 1
This is fixed by unsetting KBUILD_OUTPUT OR reverting:
303e6218ecec (ksft/fixes) selftests: Fix O= and KBUILD_OUTPUT handling for relative paths
since bypassing top makefile with -C, the definition of abs-objtree used by the above patch
is no more available.
As a side effect when KBUILD_OUTPUT is set, this breaks also the usage kselftest_install.sh.
$ ./kselftest_install.sh /home/crimar01/ARM/dev/nfs/LTP-official-debian-aarch64-rootfs/opt/KSFT_full_next
./kselftest_install.sh: Installing in specified location - /home/crimar01/ARM/dev/nfs/LTP-official-debian-aarch64-rootfs/opt/KSFT_full_next ...
make --no-builtin-rules INSTALL_HDR_PATH=$BUILD/usr \
ARCH=arm64 -C ../../.. headers_install
make[1]: Entering directory '/home/crimar01/ARM/dev/src/pdsw/linux'
make[2]: Entering directory '/home/crimar01/ARM/dev/src/pdsw/out_linux'
INSTALL /kselftest/usr/include
mkdir: cannot create directory ‘/kselftest’: Permission denied
/home/crimar01/ARM/dev/src/pdsw/linux/Makefile:1187: recipe for target 'headers_install' failed
make[2]: *** [headers_install] Error 1
make[2]: Leaving directory '/home/crimar01/ARM/dev/src/pdsw/out_linux'
Makefile:179: recipe for target 'sub-make' failed
make[1]: *** [sub-make] Error 2
make[1]: Leaving directory '/home/crimar01/ARM/dev/src/pdsw/linux'
Makefile:142: recipe for target 'khdr' failed
make: *** [khdr] Error 2
A possible fix would be (but duplicates in fact the main Makefile logic)
diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index 319e094c3212..491d8b3ef1c7 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -88,6 +88,15 @@ ifdef building_out_of_srctree
override LDFLAGS =
endif
+ifeq ($(abs_objtree),)
+ifneq ($(KBUILD_OUTPUT),)
+abs_objtree := $(shell cd $(KBUILD_OUTPUT) && pwd)
+abs_objtree := $(realpath $(abs_objtree))
+else
+abs_objtree := $(shell pwd)
+endif
+endif #ifeq ($(abs_objtree),)
+
ifneq ($(O),)
BUILD := $(abs_objtree)
else
Any thoughts ? ... or am I missing something ?
(I think I'm starting to see this in latest CI linaro kselftest while they cross-compile for arm64)
Thanks
Cristian
On Wed, Nov 27, 2019 at 5:49 AM Jeffrin Jose
<jeffrin(a)rajagiritech.edu.in> wrote:
> Tested-by: Jeffrin Jose T <jeffrin(a)rajagiritech.edu.in>
> Signed-off-by: Jeffrin Jose T <jeffrin(a)rajagiritech.edu.in>
i do
--
software engineer
rajagiri school of engineering and technology
Fixes the issue caused by the fact that in C in the expression
of the form -1234L only 1234L is the actual literal, the unary
minus is an operation applied to the literal. Which means that
to express the lower bound for the type one has to negate the
upper bound and subtract 1.
Original error:
Expected test_data[i].expected.tv_sec == timestamp.tv_sec, but
test_data[i].expected.tv_sec == -2147483648
timestamp.tv_sec == 2147483648
1901-12-13 Lower bound of 32bit < 0 timestamp, no extra bits: msb:1
lower_bound:1 extra_bits: 0
Expected test_data[i].expected.tv_sec == timestamp.tv_sec, but
test_data[i].expected.tv_sec == 2147483648
timestamp.tv_sec == 6442450944
2038-01-19 Lower bound of 32bit <0 timestamp, lo extra sec bit on:
msb:1 lower_bound:1 extra_bits: 1
Expected test_data[i].expected.tv_sec == timestamp.tv_sec, but
test_data[i].expected.tv_sec == 6442450944
timestamp.tv_sec == 10737418240
2174-02-25 Lower bound of 32bit <0 timestamp, hi extra sec bit on:
msb:1 lower_bound:1 extra_bits: 2
not ok 1 - inode_test_xtimestamp_decoding
not ok 1 - ext4_inode_test
Reported-by: Geert Uytterhoeven geert(a)linux-m68k.org
Signed-off-by: Iurii Zaikin <yzaikin(a)google.com>
---
fs/ext4/inode-test.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/ext4/inode-test.c b/fs/ext4/inode-test.c
index 92a9da1774aa..bbce1c328d85 100644
--- a/fs/ext4/inode-test.c
+++ b/fs/ext4/inode-test.c
@@ -25,7 +25,7 @@
* For constructing the negative timestamp lower bound value.
* binary: 10000000 00000000 00000000 00000000
*/
-#define LOWER_MSB_1 (-0x80000000L)
+#define LOWER_MSB_1 (-(UPPER_MSB_0) - 1L) /* avoid overflow */
/*
* For constructing the negative timestamp upper bound value.
* binary: 11111111 11111111 11111111 11111111
--
2.24.0.432.g9d3f5f5b63-goog
Hi,
Here is the 4th version of patches to fix some issues which happens on
the kernel with CONFIG_FUNCTION_TRACER=n or CONFIG_DYNAMIC_FTRACE=n.
In this version I fixed [1/4] to cleanup set_ftrace_notrace (Thanks Steve!)
Thank you,
---
Masami Hiramatsu (4):
selftests/ftrace: Fix to check the existence of set_ftrace_filter
selftests/ftrace: Fix ftrace test cases to check unsupported
selftests/ftrace: Do not to use absolute debugfs path
selftests/ftrace: Fix multiple kprobe testcase
.../ftrace/test.d/ftrace/func-filter-stacktrace.tc | 2 ++
.../selftests/ftrace/test.d/ftrace/func_cpumask.tc | 5 +++++
tools/testing/selftests/ftrace/test.d/functions | 5 ++++-
.../ftrace/test.d/kprobe/multiple_kprobes.tc | 6 +++---
.../inter-event/trigger-action-hist-xfail.tc | 4 ++--
.../inter-event/trigger-onchange-action-hist.tc | 2 +-
.../inter-event/trigger-snapshot-action-hist.tc | 4 ++--
7 files changed, 19 insertions(+), 9 deletions(-)
--
Masami Hiramatsu (Linaro) <mhiramat(a)kernel.org>
Fixes the issue caused by the fact that in C in the expression
of the form -1234L only 1234L is the actual literal, the unary
minus is an operation applied to the literal. Which means that
to express the lower bound for the type one has to negate the
upper bound and subtract 1.
Signed-off-by: Iurii Zaikin <yzaikin(a)google.com>
---
fs/ext4/inode-test.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/ext4/inode-test.c b/fs/ext4/inode-test.c
index 92a9da1774aa..bbce1c328d85 100644
--- a/fs/ext4/inode-test.c
+++ b/fs/ext4/inode-test.c
@@ -25,7 +25,7 @@
* For constructing the negative timestamp lower bound value.
* binary: 10000000 00000000 00000000 00000000
*/
-#define LOWER_MSB_1 (-0x80000000L)
+#define LOWER_MSB_1 (-(UPPER_MSB_0) - 1L) /* avoid overflow */
/*
* For constructing the negative timestamp upper bound value.
* binary: 11111111 11111111 11111111 11111111
--
2.24.0.432.g9d3f5f5b63-goog
Hi Lurii,
On Tue, Nov 26, 2019 at 3:12 AM Linux Kernel Mailing List
<linux-kernel(a)vger.kernel.org> wrote:
> Commit: 1cbeab1b242d16fdb22dc3dab6a7d6afe746ae6d
> Parent: d460623c5fa126dc51bb2571dd7714ca75b0116c
> Refname: refs/heads/master
> Web: https://git.kernel.org/torvalds/c/1cbeab1b242d16fdb22dc3dab6a7d6afe746ae6d
> Author: Iurii Zaikin <yzaikin(a)google.com>
> AuthorDate: Thu Oct 17 15:12:33 2019 -0700
> Committer: Shuah Khan <skhan(a)linuxfoundation.org>
> CommitDate: Wed Oct 23 10:28:23 2019 -0600
>
> ext4: add kunit test for decoding extended timestamps
>
> KUnit tests for decoding extended 64 bit timestamps that verify the
> seconds part of [a/c/m] timestamps in ext4 inode structs are decoded
> correctly.
>
> Test data is derived from the table in the Inode Timestamps section of
> Documentation/filesystems/ext4/inodes.rst.
>
> KUnit tests run during boot and output the results to the debug log
> in TAP format (http://testanything.org/). Only useful for kernel devs
> running KUnit test harness and are not for inclusion into a production
> build.
>
> Signed-off-by: Iurii Zaikin <yzaikin(a)google.com>
> Reviewed-by: Theodore Ts'o <tytso(a)mit.edu>
> Reviewed-by: Brendan Higgins <brendanhiggins(a)google.com>
> Tested-by: Brendan Higgins <brendanhiggins(a)google.com>
> Reviewed-by: Shuah Khan <skhan(a)linuxfoundation.org>
> Signed-off-by: Shuah Khan <skhan(a)linuxfoundation.org>
While this test succeeds on arm64, it fails on m68k and arm32 (presumably
all 32-bit platforms?):
# Subtest: ext4_inode_test
1..1
# inode_test_xtimestamp_decoding: EXPECTATION FAILED at fs/ext4/inode-test.c:250
Expected test_data[i].expected.tv_sec == timestamp.tv_sec, but
test_data[i].expected.tv_sec == -2147483648
timestamp.tv_sec == 2147483648
1901-12-13 Lower bound of 32bit < 0 timestamp, no extra bits: msb:1
lower_bound:1 extra_bits: 0
# inode_test_xtimestamp_decoding: EXPECTATION FAILED at fs/ext4/inode-test.c:250
Expected test_data[i].expected.tv_sec == timestamp.tv_sec, but
test_data[i].expected.tv_sec == 2147483648
timestamp.tv_sec == 6442450944
2038-01-19 Lower bound of 32bit <0 timestamp, lo extra sec bit on:
msb:1 lower_bound:1 extra_bits: 1
# inode_test_xtimestamp_decoding: EXPECTATION FAILED at fs/ext4/inode-test.c:250
Expected test_data[i].expected.tv_sec == timestamp.tv_sec, but
test_data[i].expected.tv_sec == 6442450944
timestamp.tv_sec == 10737418240
2174-02-25 Lower bound of 32bit <0 timestamp, hi extra sec bit on:
msb:1 lower_bound:1 extra_bits: 2
not ok 1 - inode_test_xtimestamp_decoding
not ok 1 - ext4_inode_test
Gr{oetje,eeting}s,
Geert
--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert(a)linux-m68k.org
In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds
Previous error message for invalid kunitconfig was vague. Added to it so
that it lists invalid fields and prompts for them to be removed. Added
validate_config function returning whether or not this kconfig is valid.
Signed-off-by: Heidi Fahim <heidifahim(a)google.com>
---
tools/testing/kunit/kunit_kernel.py | 27 +++++++++++++++------------
1 file changed, 15 insertions(+), 12 deletions(-)
diff --git a/tools/testing/kunit/kunit_kernel.py b/tools/testing/kunit/kunit_kernel.py
index bf3876835331..010d3f5030d2 100644
--- a/tools/testing/kunit/kunit_kernel.py
+++ b/tools/testing/kunit/kunit_kernel.py
@@ -93,6 +93,19 @@ class LinuxSourceTree(object):
return False
return True
+ def validate_config(self, build_dir):
+ kconfig_path = get_kconfig_path(build_dir)
+ validated_kconfig = kunit_config.Kconfig()
+ validated_kconfig.read_from_file(kconfig_path)
+ if not self._kconfig.is_subset_of(validated_kconfig):
+ invalid = self._kconfig.entries() - validated_kconfig.entries()
+ message = 'Provided Kconfig is not contained in validated .config. Invalid fields found in kunitconfig: %s' % (
+ ', '.join([str(e) for e in invalid])
+ )
+ logging.error(message)
+ return False
+ return True
+
def build_config(self, build_dir):
kconfig_path = get_kconfig_path(build_dir)
if build_dir and not os.path.exists(build_dir):
@@ -103,12 +116,7 @@ class LinuxSourceTree(object):
except ConfigError as e:
logging.error(e)
return False
- validated_kconfig = kunit_config.Kconfig()
- validated_kconfig.read_from_file(kconfig_path)
- if not self._kconfig.is_subset_of(validated_kconfig):
- logging.error('Provided Kconfig is not contained in validated .config!')
- return False
- return True
+ return self.validate_config(build_dir)
def build_reconfig(self, build_dir):
"""Creates a new .config if it is not a subset of the kunitconfig."""
@@ -133,12 +141,7 @@ class LinuxSourceTree(object):
except (ConfigError, BuildError) as e:
logging.error(e)
return False
- used_kconfig = kunit_config.Kconfig()
- used_kconfig.read_from_file(get_kconfig_path(build_dir))
- if not self._kconfig.is_subset_of(used_kconfig):
- logging.error('Provided Kconfig is not contained in final config!')
- return False
- return True
+ return self.validate_config(build_dir)
def run_kernel(self, args=[], timeout=None, build_dir=None):
args.extend(['mem=256M'])
--
2.24.0.432.g9d3f5f5b63-goog
Support for frequency limits in dev_pm_qos was removed when cpufreq was
switched to freq_qos, this series attempts to restore it by
reimplementing on top of freq_qos.
Discussion about removal is here:
https://lore.kernel.org/linux-pm/VI1PR04MB7023DF47D046AEADB4E051EBEE680@VI1…
The cpufreq core switched away because it needs contraints at the level
of a "cpufreq_policy" which cover multiple cpus so dev_pm_qos coupling
to struct device was not useful. Cpufreq could only use dev_pm_qos by
implementing an additional layer of aggregation anyway.
However in the devfreq subsystem scaling is always performed on a per-device
basis so dev_pm_qos is a very good match. Support for dev_pm_qos in devfreq
core is here:
https://patchwork.kernel.org/cover/11252409/
That series is RFC mostly because it needs these PM core patches.
Earlier versions got entangled in some locking cleanups but those are
not strictly necessary to get dev_pm_qos functionality.
In theory if freq_qos is extended to handle conflicting min/max values then
this sharing would be valuable. Right now freq_qos just ties two unrelated
pm_qos aggregations for min and max freq.
---
This is implemented by embeding a freq_qos_request inside dev_pm_qos_request:
the data field was already an union in order to deal with flag requests.
The internal freq_qos_apply is exported so that it can be called from
dev_pm_qos apply_constraints.
The dev_pm_qos_constraints_destroy function has no obvious equivalent in
freq_qos and the whole approach of "removing requests" is somewhat dubios:
request objects should be owned by consumers and the list of qos requests
should be empty when the target device is deleted.
Changes since v2:
* #define PM_QOS_MAX_FREQUENCY_DEFAULT_VALUE FREQ_QOS_MAX_DEFAULT_VALUE
* #define FREQ_QOS_MAX_DEFAULT_VALUE S32_MAX (in new patch)
* Add initial kunit test for freq_qos, validating the MAX_DEFAULT_VALUE found
by Matthias.
Link to v2: https://patchwork.kernel.org/cover/11250413/
First two patches can be applied separated
Changes since v1:
* Don't rename or EXPORT_SYMBOL_GPL the freq_qos_apply function; just
drop the static marker.
Link to v1: https://patchwork.kernel.org/cover/11212887/
Leonard Crestez (4):
PM / QoS: Initial kunit test
PM / QOS: Redefine FREQ_QOS_MAX_DEFAULT_VALUE to S32_MAX
PM / QoS: Reorder pm_qos/freq_qos/dev_pm_qos structs
PM / QoS: Restore DEV_PM_QOS_MIN/MAX_FREQUENCY
drivers/base/Kconfig | 4 ++
drivers/base/power/Makefile | 1 +
drivers/base/power/qos-test.c | 116 ++++++++++++++++++++++++++++++++++
drivers/base/power/qos.c | 69 ++++++++++++++++++--
include/linux/pm_qos.h | 86 ++++++++++++++-----------
kernel/power/qos.c | 4 +-
6 files changed, 237 insertions(+), 43 deletions(-)
create mode 100644 drivers/base/power/qos-test.c
--
2.17.1
Hi,
Here is the 3rd version of patches to fix some issues which happens on
the kernel with CONFIG_FUNCTION_TRACER=n or CONFIG_DYNAMIC_FTRACE=n.
In this version and v2, I updated the descriptions of the first 2 patches
according to Steve's comment, added Steve's Reviewed-by to the 3rd patch,
and added the 4th patch which was newly found.
Thank you,
---
Masami Hiramatsu (4):
selftests/ftrace: Fix to check the existence of set_ftrace_filter
selftests/ftrace: Fix ftrace test cases to check unsupported
selftests/ftrace: Do not to use absolute debugfs path
selftests/ftrace: Fix multiple kprobe testcase
.../ftrace/test.d/ftrace/func-filter-stacktrace.tc | 2 ++
.../selftests/ftrace/test.d/ftrace/func_cpumask.tc | 5 +++++
tools/testing/selftests/ftrace/test.d/functions | 4 +++-
.../ftrace/test.d/kprobe/multiple_kprobes.tc | 6 +++---
.../inter-event/trigger-action-hist-xfail.tc | 4 ++--
.../inter-event/trigger-onchange-action-hist.tc | 2 +-
.../inter-event/trigger-snapshot-action-hist.tc | 4 ++--
7 files changed, 18 insertions(+), 9 deletions(-)
--
Masami Hiramatsu (Linaro) <mhiramat(a)kernel.org>
Hi Linus,
Please pull the following kselftest fixes update for Linux 5.5-rc1.
This kselftest fixes update for Linux 5.5-rc1 consists of several
fixes to tests and framework. Masami Hiramatsu fixed several tests
to build and run correctly on arm and other 32bit architectures.
diff is attached.
thanks,
-- Shuah
----------------------------------------------------------------
The following changes since commit ce3a677802121e038d2f062e90f96f84e7351da0:
selftests: watchdog: Add command line option to show watchdog_info
(2019-10-02 13:44:43 -0600)
are available in the Git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
tags/linux-kselftest-5.5-rc1-fixes
for you to fetch changes up to ed2d8fa734e7759ac3788a19f308d3243d0eb164:
selftests: sync: Fix cast warnings on arm (2019-11-07 14:54:37 -0700)
----------------------------------------------------------------
linux-kselftest-5.5-rc1-fixes
This kselftest fixes update for Linux 5.5-rc1 consists of several
fixes to tests and framework. Masami Hiramatsu fixed several tests
to build and run correctly on arm and other 32bit architectures.
----------------------------------------------------------------
Kees Cook (2):
selftests: gen_kselftest_tar.sh: Do not clobber kselftest/
selftests: Move kselftest_module.sh into kselftest/
Masami Hiramatsu (6):
selftests: breakpoints: Fix a typo of function name
selftests: proc: Make va_max 1MB
selftests: vm: Build/Run 64bit tests only on 64bit arch
selftests: net: Use size_t and ssize_t for counting file size
selftests: net: Fix printf format warnings on arm
selftests: sync: Fix cast warnings on arm
Prabhakar Kushwaha (1):
kselftest: Fix NULL INSTALL_PATH for TARGETS runlist
Shuah Khan (1):
selftests: Fix O= and KBUILD_OUTPUT handling for relative paths
tools/testing/selftests/Makefile | 8 +++++---
.../selftests/breakpoints/breakpoint_test_arm64.c | 2 +-
tools/testing/selftests/gen_kselftest_tar.sh | 21
+++++++++++--------
.../{kselftest_module.sh => kselftest/module.sh} | 0
tools/testing/selftests/kselftest_install.sh | 24
+++++++++++-----------
tools/testing/selftests/lib/bitmap.sh | 2 +-
tools/testing/selftests/lib/prime_numbers.sh | 2 +-
tools/testing/selftests/lib/printf.sh | 2 +-
tools/testing/selftests/lib/strscpy.sh | 2 +-
tools/testing/selftests/net/so_txtime.c | 4 ++--
tools/testing/selftests/net/tcp_mmap.c | 8 ++++----
tools/testing/selftests/net/udpgso.c | 3 ++-
tools/testing/selftests/net/udpgso_bench_tx.c | 3 ++-
.../selftests/proc/proc-self-map-files-002.c | 6 +++++-
tools/testing/selftests/sync/sync.c | 6 +++---
tools/testing/selftests/vm/Makefile | 5 +++++
tools/testing/selftests/vm/run_vmtests | 10 +++++++++
17 files changed, 68 insertions(+), 40 deletions(-)
rename tools/testing/selftests/{kselftest_module.sh =>
kselftest/module.sh} (100%)
----------------------------------------------------------------
These counters will track hugetlb reservations rather than hugetlb
memory faulted in. This patch only adds the counter, following patches
add the charging and uncharging of the counter.
Problem:
Currently tasks attempting to allocate more hugetlb memory than is available get
a failure at mmap/shmget time. This is thanks to Hugetlbfs Reservations [1].
However, if a task attempts to allocate hugetlb memory only more than its
hugetlb_cgroup limit allows, the kernel will allow the mmap/shmget call,
but will SIGBUS the task when it attempts to fault the memory in.
We have developers interested in using hugetlb_cgroups, and they have expressed
dissatisfaction regarding this behavior. We'd like to improve this
behavior such that tasks violating the hugetlb_cgroup limits get an error on
mmap/shmget time, rather than getting SIGBUS'd when they try to fault
the excess memory in.
The underlying problem is that today's hugetlb_cgroup accounting happens
at hugetlb memory *fault* time, rather than at *reservation* time.
Thus, enforcing the hugetlb_cgroup limit only happens at fault time, and
the offending task gets SIGBUS'd.
Proposed Solution:
A new page counter named hugetlb.xMB.reservation_[limit|usage]_in_bytes. This
counter has slightly different semantics than
hugetlb.xMB.[limit|usage]_in_bytes:
- While usage_in_bytes tracks all *faulted* hugetlb memory,
reservation_usage_in_bytes tracks all *reserved* hugetlb memory and
hugetlb memory faulted in without a prior reservation.
- If a task attempts to reserve more memory than limit_in_bytes allows,
the kernel will allow it to do so. But if a task attempts to reserve
more memory than reservation_limit_in_bytes, the kernel will fail this
reservation.
This proposal is implemented in this patch series, with tests to verify
functionality and show the usage. We also added cgroup-v2 support to
hugetlb_cgroup so that the new use cases can be extended to v2.
Alternatives considered:
1. A new cgroup, instead of only a new page_counter attached to
the existing hugetlb_cgroup. Adding a new cgroup seemed like a lot of code
duplication with hugetlb_cgroup. Keeping hugetlb related page counters under
hugetlb_cgroup seemed cleaner as well.
2. Instead of adding a new counter, we considered adding a sysctl that modifies
the behavior of hugetlb.xMB.[limit|usage]_in_bytes, to do accounting at
reservation time rather than fault time. Adding a new page_counter seems
better as userspace could, if it wants, choose to enforce different cgroups
differently: one via limit_in_bytes, and another via
reservation_limit_in_bytes. This could be very useful if you're
transitioning how hugetlb memory is partitioned on your system one
cgroup at a time, for example. Also, someone may find usage for both
limit_in_bytes and reservation_limit_in_bytes concurrently, and this
approach gives them the option to do so.
Testing:
- Added tests passing.
- libhugetlbfs tests mostly passing, but some tests have trouble with and
without this patch series. Seems environment issue rather than code:
- Overall results:
********** TEST SUMMARY
* 2M
* 32-bit 64-bit
* Total testcases: 84 0
* Skipped: 0 0
* PASS: 66 0
* FAIL: 14 0
* Killed by signal: 0 0
* Bad configuration: 4 0
* Expected FAIL: 0 0
* Unexpected PASS: 0 0
* Test not present: 0 0
* Strange test result: 0 0
**********
- Failing tests:
- elflink_rw_and_share_test("linkhuge_rw") segfaults with and without this
patch series.
- LD_PRELOAD=libhugetlbfs.so HUGETLB_MORECORE=yes malloc (2M: 32):
FAIL Address is not hugepage
- LD_PRELOAD=libhugetlbfs.so HUGETLB_RESTRICT_EXE=unknown:malloc
HUGETLB_MORECORE=yes malloc (2M: 32):
FAIL Address is not hugepage
- LD_PRELOAD=libhugetlbfs.so HUGETLB_MORECORE=yes malloc_manysmall (2M: 32):
FAIL Address is not hugepage
- GLIBC_TUNABLES=glibc.malloc.tcache_count=0 LD_PRELOAD=libhugetlbfs.so
HUGETLB_MORECORE=yes heapshrink (2M: 32):
FAIL Heap not on hugepages
- GLIBC_TUNABLES=glibc.malloc.tcache_count=0 LD_PRELOAD=libhugetlbfs.so
libheapshrink.so HUGETLB_MORECORE=yes heapshrink (2M: 32):
FAIL Heap not on hugepages
- HUGETLB_ELFMAP=RW linkhuge_rw (2M: 32): FAIL small_data is not hugepage
- HUGETLB_ELFMAP=RW HUGETLB_MINIMAL_COPY=no linkhuge_rw (2M: 32):
FAIL small_data is not hugepage
- alloc-instantiate-race shared (2M: 32):
Bad configuration: sched_setaffinity(cpu1): Invalid argument -
FAIL Child 1 killed by signal Killed
- shmoverride_linked (2M: 32):
FAIL shmget failed size 2097152 from line 176: Invalid argument
- HUGETLB_SHM=yes shmoverride_linked (2M: 32):
FAIL shmget failed size 2097152 from line 176: Invalid argument
- shmoverride_linked_static (2M: 32):
FAIL shmget failed size 2097152 from line 176: Invalid argument
- HUGETLB_SHM=yes shmoverride_linked_static (2M: 32):
FAIL shmget failed size 2097152 from line 176: Invalid argument
- LD_PRELOAD=libhugetlbfs.so shmoverride_unlinked (2M: 32):
FAIL shmget failed size 2097152 from line 176: Invalid argument
- LD_PRELOAD=libhugetlbfs.so HUGETLB_SHM=yes shmoverride_unlinked (2M: 32):
FAIL shmget failed size 2097152 from line 176: Invalid argument
[1]: https://www.kernel.org/doc/html/latest/vm/hugetlbfs_reserv.html
Signed-off-by: Mina Almasry <almasrymina(a)google.com>
Acked-by: Hillf Danton <hdanton(a)sina.com>
---
include/linux/hugetlb.h | 23 ++++++++-
mm/hugetlb_cgroup.c | 111 ++++++++++++++++++++++++++++++----------
2 files changed, 107 insertions(+), 27 deletions(-)
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 53fc34f930d08..9c49a0ba894d3 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -320,6 +320,27 @@ unsigned long hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
#ifdef CONFIG_HUGETLB_PAGE
+enum {
+ /* Tracks hugetlb memory faulted in. */
+ HUGETLB_RES_USAGE,
+ /* Tracks hugetlb memory reserved. */
+ HUGETLB_RES_RESERVATION_USAGE,
+ /* Limit for hugetlb memory faulted in. */
+ HUGETLB_RES_LIMIT,
+ /* Limit for hugetlb memory reserved. */
+ HUGETLB_RES_RESERVATION_LIMIT,
+ /* Max usage for hugetlb memory faulted in. */
+ HUGETLB_RES_MAX_USAGE,
+ /* Max usage for hugetlb memory reserved. */
+ HUGETLB_RES_RESERVATION_MAX_USAGE,
+ /* Faulted memory accounting fail count. */
+ HUGETLB_RES_FAILCNT,
+ /* Reserved memory accounting fail count. */
+ HUGETLB_RES_RESERVATION_FAILCNT,
+ HUGETLB_RES_NULL,
+ HUGETLB_RES_MAX,
+};
+
#define HSTATE_NAME_LEN 32
/* Defines one hugetlb page size */
struct hstate {
@@ -340,7 +361,7 @@ struct hstate {
unsigned int surplus_huge_pages_node[MAX_NUMNODES];
#ifdef CONFIG_CGROUP_HUGETLB
/* cgroup control files */
- struct cftype cgroup_files[5];
+ struct cftype cgroup_files[HUGETLB_RES_MAX];
#endif
char name[HSTATE_NAME_LEN];
};
diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
index f1930fa0b445d..1ed4448ca41d3 100644
--- a/mm/hugetlb_cgroup.c
+++ b/mm/hugetlb_cgroup.c
@@ -25,6 +25,10 @@ struct hugetlb_cgroup {
* the counter to account for hugepages from hugetlb.
*/
struct page_counter hugepage[HUGE_MAX_HSTATE];
+ /*
+ * the counter to account for hugepage reservations from hugetlb.
+ */
+ struct page_counter reserved_hugepage[HUGE_MAX_HSTATE];
};
#define MEMFILE_PRIVATE(x, val) (((x) << 16) | (val))
@@ -33,6 +37,14 @@ struct hugetlb_cgroup {
static struct hugetlb_cgroup *root_h_cgroup __read_mostly;
+static inline struct page_counter *
+hugetlb_cgroup_get_counter(struct hugetlb_cgroup *h_cg, int idx, bool reserved)
+{
+ if (reserved)
+ return &h_cg->reserved_hugepage[idx];
+ return &h_cg->hugepage[idx];
+}
+
static inline
struct hugetlb_cgroup *hugetlb_cgroup_from_css(struct cgroup_subsys_state *s)
{
@@ -254,30 +266,33 @@ void hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages,
return;
}
-enum {
- RES_USAGE,
- RES_LIMIT,
- RES_MAX_USAGE,
- RES_FAILCNT,
-};
-
static u64 hugetlb_cgroup_read_u64(struct cgroup_subsys_state *css,
struct cftype *cft)
{
struct page_counter *counter;
+ struct page_counter *reserved_counter;
struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(css);
counter = &h_cg->hugepage[MEMFILE_IDX(cft->private)];
+ reserved_counter = &h_cg->reserved_hugepage[MEMFILE_IDX(cft->private)];
switch (MEMFILE_ATTR(cft->private)) {
- case RES_USAGE:
+ case HUGETLB_RES_USAGE:
return (u64)page_counter_read(counter) * PAGE_SIZE;
- case RES_LIMIT:
+ case HUGETLB_RES_RESERVATION_USAGE:
+ return (u64)page_counter_read(reserved_counter) * PAGE_SIZE;
+ case HUGETLB_RES_LIMIT:
return (u64)counter->max * PAGE_SIZE;
- case RES_MAX_USAGE:
+ case HUGETLB_RES_RESERVATION_LIMIT:
+ return (u64)reserved_counter->max * PAGE_SIZE;
+ case HUGETLB_RES_MAX_USAGE:
return (u64)counter->watermark * PAGE_SIZE;
- case RES_FAILCNT:
+ case HUGETLB_RES_RESERVATION_MAX_USAGE:
+ return (u64)reserved_counter->watermark * PAGE_SIZE;
+ case HUGETLB_RES_FAILCNT:
return counter->failcnt;
+ case HUGETLB_RES_RESERVATION_FAILCNT:
+ return reserved_counter->failcnt;
default:
BUG();
}
@@ -291,6 +306,7 @@ static ssize_t hugetlb_cgroup_write(struct kernfs_open_file *of,
int ret, idx;
unsigned long nr_pages;
struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(of_css(of));
+ bool reserved = false;
if (hugetlb_cgroup_is_root(h_cg)) /* Can't set limit on root */
return -EINVAL;
@@ -304,9 +320,14 @@ static ssize_t hugetlb_cgroup_write(struct kernfs_open_file *of,
nr_pages = round_down(nr_pages, 1 << huge_page_order(&hstates[idx]));
switch (MEMFILE_ATTR(of_cft(of)->private)) {
- case RES_LIMIT:
+ case HUGETLB_RES_RESERVATION_LIMIT:
+ reserved = true;
+ /* Fall through. */
+ case HUGETLB_RES_LIMIT:
mutex_lock(&hugetlb_limit_mutex);
- ret = page_counter_set_max(&h_cg->hugepage[idx], nr_pages);
+ ret = page_counter_set_max(hugetlb_cgroup_get_counter(h_cg, idx,
+ reserved),
+ nr_pages);
mutex_unlock(&hugetlb_limit_mutex);
break;
default:
@@ -320,18 +341,26 @@ static ssize_t hugetlb_cgroup_reset(struct kernfs_open_file *of,
char *buf, size_t nbytes, loff_t off)
{
int ret = 0;
- struct page_counter *counter;
+ struct page_counter *counter, *reserved_counter;
struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(of_css(of));
counter = &h_cg->hugepage[MEMFILE_IDX(of_cft(of)->private)];
+ reserved_counter =
+ &h_cg->reserved_hugepage[MEMFILE_IDX(of_cft(of)->private)];
switch (MEMFILE_ATTR(of_cft(of)->private)) {
- case RES_MAX_USAGE:
+ case HUGETLB_RES_MAX_USAGE:
page_counter_reset_watermark(counter);
break;
- case RES_FAILCNT:
+ case HUGETLB_RES_RESERVATION_MAX_USAGE:
+ page_counter_reset_watermark(reserved_counter);
+ break;
+ case HUGETLB_RES_FAILCNT:
counter->failcnt = 0;
break;
+ case HUGETLB_RES_RESERVATION_FAILCNT:
+ reserved_counter->failcnt = 0;
+ break;
default:
ret = -EINVAL;
break;
@@ -357,37 +386,67 @@ static void __init __hugetlb_cgroup_file_init(int idx)
struct hstate *h = &hstates[idx];
/* format the size */
- mem_fmt(buf, 32, huge_page_size(h));
+ mem_fmt(buf, sizeof(buf), huge_page_size(h));
/* Add the limit file */
- cft = &h->cgroup_files[0];
+ cft = &h->cgroup_files[HUGETLB_RES_LIMIT];
snprintf(cft->name, MAX_CFTYPE_NAME, "%s.limit_in_bytes", buf);
- cft->private = MEMFILE_PRIVATE(idx, RES_LIMIT);
+ cft->private = MEMFILE_PRIVATE(idx, HUGETLB_RES_LIMIT);
+ cft->read_u64 = hugetlb_cgroup_read_u64;
+ cft->write = hugetlb_cgroup_write;
+
+ /* Add the reservation limit file */
+ cft = &h->cgroup_files[HUGETLB_RES_RESERVATION_LIMIT];
+ snprintf(cft->name, MAX_CFTYPE_NAME, "%s.reservation_limit_in_bytes",
+ buf);
+ cft->private = MEMFILE_PRIVATE(idx, HUGETLB_RES_RESERVATION_LIMIT);
cft->read_u64 = hugetlb_cgroup_read_u64;
cft->write = hugetlb_cgroup_write;
/* Add the usage file */
- cft = &h->cgroup_files[1];
+ cft = &h->cgroup_files[HUGETLB_RES_USAGE];
snprintf(cft->name, MAX_CFTYPE_NAME, "%s.usage_in_bytes", buf);
- cft->private = MEMFILE_PRIVATE(idx, RES_USAGE);
+ cft->private = MEMFILE_PRIVATE(idx, HUGETLB_RES_USAGE);
+ cft->read_u64 = hugetlb_cgroup_read_u64;
+
+ /* Add the reservation usage file */
+ cft = &h->cgroup_files[HUGETLB_RES_RESERVATION_USAGE];
+ snprintf(cft->name, MAX_CFTYPE_NAME, "%s.reservation_usage_in_bytes",
+ buf);
+ cft->private = MEMFILE_PRIVATE(idx, HUGETLB_RES_RESERVATION_USAGE);
cft->read_u64 = hugetlb_cgroup_read_u64;
/* Add the MAX usage file */
- cft = &h->cgroup_files[2];
+ cft = &h->cgroup_files[HUGETLB_RES_MAX_USAGE];
snprintf(cft->name, MAX_CFTYPE_NAME, "%s.max_usage_in_bytes", buf);
- cft->private = MEMFILE_PRIVATE(idx, RES_MAX_USAGE);
+ cft->private = MEMFILE_PRIVATE(idx, HUGETLB_RES_MAX_USAGE);
+ cft->write = hugetlb_cgroup_reset;
+ cft->read_u64 = hugetlb_cgroup_read_u64;
+
+ /* Add the MAX reservation usage file */
+ cft = &h->cgroup_files[HUGETLB_RES_RESERVATION_MAX_USAGE];
+ snprintf(cft->name, MAX_CFTYPE_NAME,
+ "%s.reservation_max_usage_in_bytes", buf);
+ cft->private = MEMFILE_PRIVATE(idx, HUGETLB_RES_RESERVATION_MAX_USAGE);
cft->write = hugetlb_cgroup_reset;
cft->read_u64 = hugetlb_cgroup_read_u64;
/* Add the failcntfile */
- cft = &h->cgroup_files[3];
+ cft = &h->cgroup_files[HUGETLB_RES_FAILCNT];
snprintf(cft->name, MAX_CFTYPE_NAME, "%s.failcnt", buf);
- cft->private = MEMFILE_PRIVATE(idx, RES_FAILCNT);
+ cft->private = MEMFILE_PRIVATE(idx, HUGETLB_RES_FAILCNT);
+ cft->write = hugetlb_cgroup_reset;
+ cft->read_u64 = hugetlb_cgroup_read_u64;
+
+ /* Add the reservation failcntfile */
+ cft = &h->cgroup_files[HUGETLB_RES_RESERVATION_FAILCNT];
+ snprintf(cft->name, MAX_CFTYPE_NAME, "%s.reservation_failcnt", buf);
+ cft->private = MEMFILE_PRIVATE(idx, HUGETLB_RES_RESERVATION_FAILCNT);
cft->write = hugetlb_cgroup_reset;
cft->read_u64 = hugetlb_cgroup_read_u64;
/* NULL terminate the last cft */
- cft = &h->cgroup_files[4];
+ cft = &h->cgroup_files[HUGETLB_RES_NULL];
memset(cft, 0, sizeof(*cft));
WARN_ON(cgroup_add_legacy_cftypes(&hugetlb_cgrp_subsys,
--
2.24.0.rc1.363.gb1bccd3e3d-goog
Hi,
Here is a set of well-reviewed (expect for one patch), lower-risk items
that can go into Linux 5.5. The one patch that wasn't reviewed is the
powerpc conversion, and it's still at this point a no-op, because
tracking isn't yet activated.
This is based on linux-next: b9d3d01405061bb42358fe53f824e894a1922ced
("Add linux-next specific files for 20191122").
This is essentially a cut-down v8 of "mm/gup: track dma-pinned pages:
FOLL_PIN" [1], and with one of the VFIO patches split into two patches.
The idea here is to get this long list of "noise" checked into 5.5, so
that the actual, higher-risk "track FOLL_PIN pages" (which is deferred:
not part of this series) will be a much shorter patchset to review.
For the v4l2-core changes, I've left those here (instead of sending
them separately to the -media tree), in order to get the name change
done now (put_user_page --> unpin_user_page). However, I've added a Cc
stable, as recommended during the last round of reviews.
Here are the relevant notes from the original cover letter, edited to
match the current situation:
This is a prerequisite to tracking dma-pinned pages. That in turn is a
prerequisite to solving the larger problem of proper interactions
between file-backed pages, and [R]DMA activities, as discussed in [1],
[2], [3], and in a remarkable number of email threads since about
2017. :)
A new internal gup flag, FOLL_PIN is introduced, and thoroughly
documented in the last patch's Documentation/vm/pin_user_pages.rst.
I believe that this will provide a good starting point for doing the
layout lease work that Ira Weiny has been working on. That's because
these new wrapper functions provide a clean, constrained, systematically
named set of functionality that, again, is required in order to even
know if a page is "dma-pinned".
In contrast to earlier approaches, the page tracking can be
incrementally applied to the kernel call sites that, until now, have
been simply calling get_user_pages() ("gup"). In other words, opt-in by
changing from this:
get_user_pages() (sets FOLL_GET)
put_page()
to this:
pin_user_pages() (sets FOLL_PIN)
put_user_page()
Because there are interdependencies with FOLL_LONGTERM, a similar
conversion as for FOLL_PIN, was applied. The change was from this:
get_user_pages(FOLL_LONGTERM) (also sets FOLL_GET)
put_page()
to this:
pin_longterm_pages() (sets FOLL_PIN | FOLL_LONGTERM)
put_user_page()
[1] https://lore.kernel.org/r/20191121071354.456618-1-jhubbard@nvidia.com
thanks,
John Hubbard
NVIDIA
Dan Williams (1):
mm: Cleanup __put_devmap_managed_page() vs ->page_free()
John Hubbard (18):
mm/gup: factor out duplicate code from four routines
mm/gup: move try_get_compound_head() to top, fix minor issues
goldish_pipe: rename local pin_user_pages() routine
mm: fix get_user_pages_remote()'s handling of FOLL_LONGTERM
vfio: fix FOLL_LONGTERM use, simplify get_user_pages_remote() call
mm/gup: introduce pin_user_pages*() and FOLL_PIN
goldish_pipe: convert to pin_user_pages() and put_user_page()
IB/{core,hw,umem}: set FOLL_PIN via pin_user_pages*(), fix up ODP
mm/process_vm_access: set FOLL_PIN via pin_user_pages_remote()
drm/via: set FOLL_PIN via pin_user_pages_fast()
fs/io_uring: set FOLL_PIN via pin_user_pages()
net/xdp: set FOLL_PIN via pin_user_pages()
media/v4l2-core: set pages dirty upon releasing DMA buffers
media/v4l2-core: pin_user_pages (FOLL_PIN) and put_user_page()
conversion
vfio, mm: pin_user_pages (FOLL_PIN) and put_user_page() conversion
powerpc: book3s64: convert to pin_user_pages() and put_user_page()
mm/gup_benchmark: use proper FOLL_WRITE flags instead of hard-coding
"1"
mm, tree-wide: rename put_user_page*() to unpin_user_page*()
Documentation/core-api/index.rst | 1 +
Documentation/core-api/pin_user_pages.rst | 233 ++++++++++++++
arch/powerpc/mm/book3s64/iommu_api.c | 12 +-
drivers/gpu/drm/via/via_dmablit.c | 6 +-
drivers/infiniband/core/umem.c | 4 +-
drivers/infiniband/core/umem_odp.c | 13 +-
drivers/infiniband/hw/hfi1/user_pages.c | 4 +-
drivers/infiniband/hw/mthca/mthca_memfree.c | 8 +-
drivers/infiniband/hw/qib/qib_user_pages.c | 4 +-
drivers/infiniband/hw/qib/qib_user_sdma.c | 8 +-
drivers/infiniband/hw/usnic/usnic_uiom.c | 4 +-
drivers/infiniband/sw/siw/siw_mem.c | 4 +-
drivers/media/v4l2-core/videobuf-dma-sg.c | 8 +-
drivers/nvdimm/pmem.c | 6 -
drivers/platform/goldfish/goldfish_pipe.c | 35 +--
drivers/vfio/vfio_iommu_type1.c | 35 +--
fs/io_uring.c | 6 +-
include/linux/mm.h | 77 +++--
mm/gup.c | 332 +++++++++++++-------
mm/gup_benchmark.c | 9 +-
mm/memremap.c | 80 ++---
mm/process_vm_access.c | 28 +-
net/xdp/xdp_umem.c | 4 +-
tools/testing/selftests/vm/gup_benchmark.c | 6 +-
24 files changed, 642 insertions(+), 285 deletions(-)
create mode 100644 Documentation/core-api/pin_user_pages.rst
--
2.24.0
Hi,
Here is the 2nd version of patches to fix some issues which happens on
the kernel with CONFIG_FUNCTION_TRACER=n or CONFIG_DYNAMIC_FTRACE=n.
In this version I just updated the descriptions of the first 2 patches
according to Steve's comment and add his Reviewed-by to the last patch.
Thank you,
---
Masami Hiramatsu (3):
selftests/ftrace: Fix to check the existence of set_ftrace_filter
selftests/ftrace: Fix ftrace test cases to check unsupported
selftests/ftrace: Do not to use absolute debugfs path
.../ftrace/test.d/ftrace/func-filter-stacktrace.tc | 2 ++
.../selftests/ftrace/test.d/ftrace/func_cpumask.tc | 5 +++++
tools/testing/selftests/ftrace/test.d/functions | 4 +++-
.../inter-event/trigger-action-hist-xfail.tc | 4 ++--
.../inter-event/trigger-onchange-action-hist.tc | 2 +-
.../inter-event/trigger-snapshot-action-hist.tc | 4 ++--
6 files changed, 15 insertions(+), 6 deletions(-)
--
Masami Hiramatsu (Linaro) <mhiramat(a)kernel.org>
Hi,
Here is a series of patches to fix some issues which happens on
the kernel with CONFIG_FUNCTION_TRACER=n but CONFIG_TRACER=y.
Thank you,
---
Masami Hiramatsu (3):
selftests/ftrace: Fix to check the existence of set_ftrace_filter
selftests/ftrace: Fix ftrace test cases to check unsupported
selftests/ftrace: Do not to use absolute debugfs path
.../ftrace/test.d/ftrace/func-filter-stacktrace.tc | 2 ++
.../selftests/ftrace/test.d/ftrace/func_cpumask.tc | 5 +++++
tools/testing/selftests/ftrace/test.d/functions | 4 +++-
.../inter-event/trigger-action-hist-xfail.tc | 4 ++--
.../inter-event/trigger-onchange-action-hist.tc | 2 +-
.../inter-event/trigger-snapshot-action-hist.tc | 4 ++--
6 files changed, 15 insertions(+), 6 deletions(-)
--
Masami Hiramatsu (Linaro) <mhiramat(a)kernel.org>
If settings files are present in upper directories of a test directory,
load those first before the local settings. This allows a top-level
directory to define settings for subdirectories while still allowing the
subdirectories to override them on a per-directory basis.
Signed-off-by: Kees Cook <keescook(a)chromium.org>
---
Note that this depends on Matt's patch:
https://lore.kernel.org/lkml/20191022171223.27934-1-matthieu.baerts@tessare…
---
tools/testing/selftests/kselftest/runner.sh | 30 +++++++++++++++------
1 file changed, 22 insertions(+), 8 deletions(-)
diff --git a/tools/testing/selftests/kselftest/runner.sh b/tools/testing/selftests/kselftest/runner.sh
index 84de7bc74f2c..666cfeaa8046 100644
--- a/tools/testing/selftests/kselftest/runner.sh
+++ b/tools/testing/selftests/kselftest/runner.sh
@@ -39,6 +39,27 @@ tap_timeout()
fi
}
+load_settings()
+{
+ local fullpath="$1"
+ local upperpath=${fullpath%/*}
+
+ # Load upper path settings first.
+ if [ "$fullpath" != "$upperpath" ] ; then
+ load_settings "$upperpath"
+ fi
+
+ # Load per-test-directory kselftest "settings" file.
+ local settings="$BASE_DIR/$fullpath/settings"
+ if [ -r "$settings" ] ; then
+ while read line ; do
+ local field=$(echo "$line" | cut -d= -f1)
+ local value=$(echo "$line" | cut -d= -f2-)
+ eval "kselftest_$field"="$value"
+ done < "$settings"
+ fi
+}
+
run_one()
{
DIR="$1"
@@ -50,14 +71,7 @@ run_one()
# Reset any "settings"-file variables.
export kselftest_timeout="$kselftest_default_timeout"
# Load per-test-directory kselftest "settings" file.
- settings="$BASE_DIR/$DIR/settings"
- if [ -r "$settings" ] ; then
- while read line ; do
- field=$(echo "$line" | cut -d= -f1)
- value=$(echo "$line" | cut -d= -f2-)
- eval "kselftest_$field"="$value"
- done < "$settings"
- fi
+ load_settings "$DIR"
TEST_HDR_MSG="selftests: $DIR: $BASENAME_TEST"
echo "# $TEST_HDR_MSG"
--
2.17.1
--
Kees Cook
Add documentation for the Python script used to build, run, and collect
results from the kernel known as kunit_tool. kunit_tool
(tools/testing/kunit/kunit.py) was already added in previous commits.
Signed-off-by: Brendan Higgins <brendanhiggins(a)google.com>
Reviewed-by: David Gow <davidgow(a)google.com>
---
Documentation/dev-tools/kunit/index.rst | 1 +
Documentation/dev-tools/kunit/kunit-tool.rst | 57 ++++++++++++++++++++
Documentation/dev-tools/kunit/start.rst | 5 +-
3 files changed, 62 insertions(+), 1 deletion(-)
create mode 100644 Documentation/dev-tools/kunit/kunit-tool.rst
diff --git a/Documentation/dev-tools/kunit/index.rst b/Documentation/dev-tools/kunit/index.rst
index 26ffb46bdf99d..c60d760a0eed1 100644
--- a/Documentation/dev-tools/kunit/index.rst
+++ b/Documentation/dev-tools/kunit/index.rst
@@ -9,6 +9,7 @@ KUnit - Unit Testing for the Linux Kernel
start
usage
+ kunit-tool
api/index
faq
diff --git a/Documentation/dev-tools/kunit/kunit-tool.rst b/Documentation/dev-tools/kunit/kunit-tool.rst
new file mode 100644
index 0000000000000..37509527c04e1
--- /dev/null
+++ b/Documentation/dev-tools/kunit/kunit-tool.rst
@@ -0,0 +1,57 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=================
+kunit_tool How-To
+=================
+
+What is kunit_tool?
+===================
+
+kunit_tool is a script (``tools/testing/kunit/kunit.py``) that aids in building
+the Linux kernel as UML (`User Mode Linux
+<http://user-mode-linux.sourceforge.net/>`_), running KUnit tests, parsing
+the test results and displaying them in a user friendly manner.
+
+What is a kunitconfig?
+======================
+
+It's just a defconfig that kunit_tool looks for in the base directory.
+kunit_tool uses it to generate a .config as you might expect. In addition, it
+verifies that the generated .config contains the CONFIG options in the
+kunitconfig; the reason it does this is so that it is easy to be sure that a
+CONFIG that enables a test actually ends up in the .config.
+
+How do I use kunit_tool?
+=================================
+
+If a kunitconfig is present at the root directory, all you have to do is:
+
+.. code-block:: bash
+
+ ./tools/testing/kunit/kunit.py run
+
+However, you most likely want to use it with the following options:
+
+.. code-block:: bash
+
+ ./tools/testing/kunit/kunit.py run --timeout=30 --jobs=`nproc --all`
+
+- ``--timeout`` sets a maximum amount of time to allow tests to run.
+- ``--jobs`` sets the number of threads to use to build the kernel.
+
+If you just want to use the defconfig that ships with the kernel, you can
+append the ``--defconfig`` flag as well:
+
+.. code-block:: bash
+
+ ./tools/testing/kunit/kunit.py run --timeout=30 --jobs=`nproc --all` --defconfig
+
+.. note::
+ This command is particularly helpful for getting started because it
+ just works. No kunitconfig needs to be present.
+
+For a list of all the flags supported by kunit_tool, you can run:
+
+.. code-block:: bash
+
+ ./tools/testing/kunit/kunit.py run --help
diff --git a/Documentation/dev-tools/kunit/start.rst b/Documentation/dev-tools/kunit/start.rst
index aeeddfafeea20..f4d9a4fa914f8 100644
--- a/Documentation/dev-tools/kunit/start.rst
+++ b/Documentation/dev-tools/kunit/start.rst
@@ -19,7 +19,10 @@ The wrapper can be run with:
.. code-block:: bash
- ./tools/testing/kunit/kunit.py run
+ ./tools/testing/kunit/kunit.py run --defconfig
+
+For more information on this wrapper (also called kunit_tool) checkout the
+:doc:`kunit-tool` page.
Creating a kunitconfig
======================
--
2.24.0.432.g9d3f5f5b63-goog
The current kunit execution model is to provide base kunit functionality
and tests built-in to the kernel. The aim of this series is to allow
building kunit itself and tests as modules. This in turn allows a
simple form of selective execution; load the module you wish to test.
In doing so, kunit itself (if also built as a module) will be loaded as
an implicit dependency.
Because this requires a core API modification - if a module delivers
multiple suites, they must be declared with the kunit_test_suites()
macro - we're proposing this patch set as a candidate to be applied to the
test tree before too many kunit consumers appear. We attempt to deal
with existing consumers in patch 3.
Changes since v3:
- removed symbol lookup patch for separate submission later
- removed use of sysctl_hung_task_timeout_seconds (patch 4, as discussed
with Brendan and Stephen)
- disabled build of string-stream-test when CONFIG_KUNIT_TEST=m; this
is to avoid having to deal with symbol lookup issues
- changed string-stream-impl.h back to string-stream.h (Brendan)
- added module build support to new list, ext4 tests
Changes since v2:
- moved string-stream.h header to lib/kunit/string-stream-impl.h (Brendan)
(patch 1)
- split out non-exported interfaces in try-catch-impl.h (Brendan)
(patch 2)
- added kunit_find_symbol() and KUNIT_INIT_SYMBOL to lookup non-exported
symbols (patches 3, 4)
- removed #ifdef MODULE around module licenses (Randy, Brendan, Andy)
(patch 4)
- replaced kunit_test_suite() with kunit_test_suites() rather than
supporting both (Brendan) (patch 4)
- lookup sysctl_hung_task_timeout_secs as kunit may be built as a module
and the symbol may not be available (patch 5)
Alan Maguire (6):
kunit: move string-stream.h to lib/kunit
kunit: hide unexported try-catch interface in try-catch-impl.h
kunit: allow kunit tests to be loaded as a module
kunit: remove timeout dependence on sysctl_hung_task_timeout_seconds
kunit: allow kunit to be loaded as a module
kunit: update documentation to describe module-based build
Documentation/dev-tools/kunit/faq.rst | 3 +-
Documentation/dev-tools/kunit/index.rst | 3 +
Documentation/dev-tools/kunit/usage.rst | 16 ++
fs/ext4/Kconfig | 2 +-
fs/ext4/Makefile | 5 +
fs/ext4/inode-test.c | 4 +-
include/kunit/assert.h | 3 +-
include/kunit/string-stream.h | 51 -----
include/kunit/test.h | 35 +++-
include/kunit/try-catch.h | 10 -
kernel/sysctl-test.c | 4 +-
lib/Kconfig.debug | 4 +-
lib/kunit/Kconfig | 6 +-
lib/kunit/Makefile | 14 +-
lib/kunit/assert.c | 10 +
lib/kunit/example-test.c | 88 ---------
lib/kunit/kunit-example-test.c | 90 +++++++++
lib/kunit/kunit-test.c | 334 ++++++++++++++++++++++++++++++++
lib/kunit/string-stream-test.c | 5 +-
lib/kunit/string-stream.c | 3 +-
lib/kunit/string-stream.h | 51 +++++
lib/kunit/test-test.c | 331 -------------------------------
lib/kunit/test.c | 25 ++-
lib/kunit/try-catch-impl.h | 28 +++
lib/kunit/try-catch.c | 37 +---
lib/list-test.c | 4 +-
26 files changed, 628 insertions(+), 538 deletions(-)
delete mode 100644 include/kunit/string-stream.h
delete mode 100644 lib/kunit/example-test.c
create mode 100644 lib/kunit/kunit-example-test.c
create mode 100644 lib/kunit/kunit-test.c
create mode 100644 lib/kunit/string-stream.h
delete mode 100644 lib/kunit/test-test.c
create mode 100644 lib/kunit/try-catch-impl.h
--
1.8.3.1
Summary
------------------------------------------------------------------------
kernel: 5.4.0-rc8
git repo: https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git
git branch: master
git commit: 1fef9976397fc9b951ee54467eccd65e0c508785
git describe: next-20191120
Test details: https://qa-reports.linaro.org/lkft/linux-next-oe/build/next-20191120
Regressions (compared to build next-20191119)
------------------------------------------------------------------------
No regressions
Fixes (compared to build next-20191119)
------------------------------------------------------------------------
No fixes
In total:
------------------------------------------------------------------------
Ran 0 total tests in the following environments and test suites.
pass 0
fail 0
xfail 0
skip 0
Environments
--------------
- x15 - arm
Test Suites
-----------
Failures
------------------------------------------------------------------------
x15:
Skips
------------------------------------------------------------------------
No skips
--
Linaro LKFT
https://lkft.linaro.org
Hi,
Christoph Hellwig has a preference to do things a little differently,
for the devmap cleanup in patch 5 ("mm: devmap: refactor 1-based
refcounting for ZONE_DEVICE pages"). That came up in a different
review thread, because the patch is out for review in two locations.
Here's that review thread:
https://lore.kernel.org/r/20191118070826.GB3099@infradead.org
...and I'm hoping that we can defer that request, because otherwise
it derails this series, which is starting to otherwise look like
it could be ready for 5.5.
There is a git repo and branch, for convenience:
git@github.com:johnhubbard/linux.git pin_user_pages_tracking_v6
Changes since v5:
* Fixed the refcounting for huge pages: in most cases, it was
only taking one GUP_PIN_COUNTING_BIAS's worth of refs, when it
should have been taking one GUP_PIN_COUNTING_BIAS for each subpage.
(Much thanks to Jan Kara for spotting that one!)
* Renamed user_page_ref_inc() to try_pin_page(), and added a new
try_pin_compound_head(). This definitely improves readability.
* Factored out some more duplication in the FOLL_PIN and FOLL_GET
cases, in gup.c.
* Fixed up some straggling "get_" --> "pin_" references in the comments.
* Added reviewed-by tags.
Changes since v4:
* Renamed put_user_page*() --> unpin_user_page().
* Removed all pin_longterm_pages*() calls. We will use FOLL_LONGTERM
at the call sites. (FOLL_PIN, however, remains an internal gup flag).
This is very nice: many patches just change three characters now:
get_user_pages --> pin_user_pages. I think we've found the right
balance of wrapper calls and gup flags, for the call sites.
* Updated a lot of documentation and commit logs to match the above
two large changes.
* Changed gup_benchmark tests and run_vmtests, to adapt to one less
use case: there is no pin_longterm_pages() call anymore.
* This includes a new devmap cleanup patch from Dan Williams, along
with a rebased follow-up: patches 4 and 5, already mentioned above.
* Fixed patch 10 ("mm/gup: introduce pin_user_pages*() and FOLL_PIN"),
so as to make pin_user_pages*() calls act as placeholders for the
corresponding get_user_pages*() calls, until a later patch fully
implements the DMA-pinning functionality.
Thanks to Jan Kara for noticing that.
* Fixed the implementation of pin_user_pages_remote().
* Further tweaked patch 2 ("mm/gup: factor out duplicate code from four
routines"), in response to Jan Kara's feedback.
* Dropped a few reviewed-by tags due to changes that invalidated
them.
Changes since v3:
* VFIO fix (patch 8): applied further cleanup: removed a pre-existing,
unnecessary release and reacquire of mmap_sem. Moved the DAX vma
checks from the vfio call site, to gup internals, and added comments
(and commit log) to clarify.
* Due to the above, made a corresponding fix to the
pin_longterm_pages_remote(), which was actually calling the wrong
gup internal function.
* Changed put_user_page() comments, to refer to pin*() APIs, rather than
get_user_pages*() APIs.
* Reverted an accidental whitespace-only change in the IB ODP code.
* Added a few more reviewed-by tags.
Changes since v2:
* Added a patch to convert IB/umem from normal gup, to gup_fast(). This
is also posted separately, in order to hopefully get some runtime
testing.
* Changed the page devmap code to be a little clearer,
thanks to Jerome for that.
* Split out the page devmap changes into a separate patch (and moved
Ira's Signed-off-by to that patch).
* Fixed my bug in IB: ODP code does not require pin_user_pages()
semantics. Therefore, revert the put_user_page() calls to put_page(),
and leave the get_user_pages() call as-is.
* As part of the revert, I am proposing here a change directly
from put_user_pages(), to release_pages(). I'd feel better if
someone agrees that this is the best way. It uses the more
efficient release_pages(), instead of put_page() in a loop,
and keep the change to just a few character on one line,
but OTOH it is not a pure revert.
* Loosened the FOLL_LONGTERM restrictions in the
__get_user_pages_locked() implementation, and used that in order
to fix up a VFIO bug. Thanks to Jason for that idea.
* Note the use of release_pages() in IB: is that OK?
* Added a few more WARN's and clarifying comments nearby.
* Many documentation improvements in various comments.
* Moved the new pin_user_pages.rst from Documentation/vm/ to
Documentation/core-api/ .
* Commit descriptions: added clarifying notes to the three patches
(drm/via, fs/io_uring, net/xdp) that already had put_user_page()
calls in place.
* Collected all pending Reviewed-by and Acked-by tags, from v1 and v2
email threads.
* Lot of churn from v2 --> v3, so it's possible that new bugs
sneaked in.
NOT DONE: separate patchset is required:
* __get_user_pages_locked(): stop compensating for
buggy callers who failed to set FOLL_GET. Instead, assert
that FOLL_GET is set (and fail if it's not).
======================================================================
Original cover letter (edited to fix up the patch description numbers)
This applies cleanly to linux-next and mmotm, and also to linux.git if
linux-next's commit 20cac10710c9 ("mm/gup_benchmark: fix MAP_HUGETLB
case") is first applied there.
This provides tracking of dma-pinned pages. This is a prerequisite to
solving the larger problem of proper interactions between file-backed
pages, and [R]DMA activities, as discussed in [1], [2], [3], and in
a remarkable number of email threads since about 2017. :)
A new internal gup flag, FOLL_PIN is introduced, and thoroughly
documented in the last patch's Documentation/vm/pin_user_pages.rst.
I believe that this will provide a good starting point for doing the
layout lease work that Ira Weiny has been working on. That's because
these new wrapper functions provide a clean, constrained, systematically
named set of functionality that, again, is required in order to even
know if a page is "dma-pinned".
In contrast to earlier approaches, the page tracking can be
incrementally applied to the kernel call sites that, until now, have
been simply calling get_user_pages() ("gup"). In other words, opt-in by
changing from this:
get_user_pages() (sets FOLL_GET)
put_page()
to this:
pin_user_pages() (sets FOLL_PIN)
put_user_page()
Because there are interdependencies with FOLL_LONGTERM, a similar
conversion as for FOLL_PIN, was applied. The change was from this:
get_user_pages(FOLL_LONGTERM) (also sets FOLL_GET)
put_page()
to this:
pin_longterm_pages() (sets FOLL_PIN | FOLL_LONGTERM)
put_user_page()
============================================================
Patch summary:
* Patches 1-9: refactoring and preparatory cleanup, independent fixes
* Patch 10: introduce pin_user_pages(), FOLL_PIN, but no functional
changes yet
* Patches 11-16: Convert existing put_user_page() callers, to use the
new pin*()
* Patch 17: Activate tracking of FOLL_PIN pages.
* Patches 18-20: convert various callers
* Patches: 21-23: gup_benchmark and run_vmtests support
* Patch 24: rename put_user_page*() --> unpin_user_page*()
============================================================
Testing:
* I've done some overall kernel testing (LTP, and a few other goodies),
and some directed testing to exercise some of the changes. And as you
can see, gup_benchmark is enhanced to exercise this. Basically, I've been
able to runtime test the core get_user_pages() and pin_user_pages() and
related routines, but not so much on several of the call sites--but those
are generally just a couple of lines changed, each.
Not much of the kernel is actually using this, which on one hand
reduces risk quite a lot. But on the other hand, testing coverage
is low. So I'd love it if, in particular, the Infiniband and PowerPC
folks could do a smoke test of this series for me.
Also, my runtime testing for the call sites so far is very weak:
* io_uring: Some directed tests from liburing exercise this, and they pass.
* process_vm_access.c: A small directed test passes.
* gup_benchmark: the enhanced version hits the new gup.c code, and passes.
* infiniband (still only have crude "IB pingpong" working, on a
good day: it's not exercising my conversions at runtime...)
* VFIO: compiles (I'm vowing to set up a run time test soon, but it's
not ready just yet)
* powerpc: it compiles...
* drm/via: compiles...
* goldfish: compiles...
* net/xdp: compiles...
* media/v4l2: compiles...
============================================================
Next:
* Get the block/bio_vec sites converted to use pin_user_pages().
* Work with Ira and Dave Chinner to weave this together with the
layout lease stuff.
============================================================
[1] Some slow progress on get_user_pages() (Apr 2, 2019): https://lwn.net/Articles/784574/
[2] DMA and get_user_pages() (LPC: Dec 12, 2018): https://lwn.net/Articles/774411/
[3] The trouble with get_user_pages() (Apr 30, 2018): https://lwn.net/Articles/753027/
Dan Williams (1):
mm: Cleanup __put_devmap_managed_page() vs ->page_free()
John Hubbard (23):
mm/gup: pass flags arg to __gup_device_* functions
mm/gup: factor out duplicate code from four routines
mm/gup: move try_get_compound_head() to top, fix minor issues
mm: devmap: refactor 1-based refcounting for ZONE_DEVICE pages
goldish_pipe: rename local pin_user_pages() routine
IB/umem: use get_user_pages_fast() to pin DMA pages
media/v4l2-core: set pages dirty upon releasing DMA buffers
vfio, mm: fix get_user_pages_remote() and FOLL_LONGTERM
mm/gup: introduce pin_user_pages*() and FOLL_PIN
goldish_pipe: convert to pin_user_pages() and put_user_page()
IB/{core,hw,umem}: set FOLL_PIN via pin_user_pages*(), fix up ODP
mm/process_vm_access: set FOLL_PIN via pin_user_pages_remote()
drm/via: set FOLL_PIN via pin_user_pages_fast()
fs/io_uring: set FOLL_PIN via pin_user_pages()
net/xdp: set FOLL_PIN via pin_user_pages()
mm/gup: track FOLL_PIN pages
media/v4l2-core: pin_user_pages (FOLL_PIN) and put_user_page()
conversion
vfio, mm: pin_user_pages (FOLL_PIN) and put_user_page() conversion
powerpc: book3s64: convert to pin_user_pages() and put_user_page()
mm/gup_benchmark: use proper FOLL_WRITE flags instead of hard-coding
"1"
mm/gup_benchmark: support pin_user_pages() and related calls
selftests/vm: run_vmtests: invoke gup_benchmark with basic FOLL_PIN
coverage
mm, tree-wide: rename put_user_page*() to unpin_user_page*()
Documentation/core-api/index.rst | 1 +
Documentation/core-api/pin_user_pages.rst | 233 +++++++++
arch/powerpc/mm/book3s64/iommu_api.c | 12 +-
drivers/gpu/drm/via/via_dmablit.c | 6 +-
drivers/infiniband/core/umem.c | 19 +-
drivers/infiniband/core/umem_odp.c | 13 +-
drivers/infiniband/hw/hfi1/user_pages.c | 4 +-
drivers/infiniband/hw/mthca/mthca_memfree.c | 8 +-
drivers/infiniband/hw/qib/qib_user_pages.c | 4 +-
drivers/infiniband/hw/qib/qib_user_sdma.c | 8 +-
drivers/infiniband/hw/usnic/usnic_uiom.c | 4 +-
drivers/infiniband/sw/siw/siw_mem.c | 4 +-
drivers/media/v4l2-core/videobuf-dma-sg.c | 8 +-
drivers/nvdimm/pmem.c | 6 -
drivers/platform/goldfish/goldfish_pipe.c | 35 +-
drivers/vfio/vfio_iommu_type1.c | 35 +-
fs/io_uring.c | 6 +-
include/linux/mm.h | 168 +++++-
include/linux/mmzone.h | 2 +
include/linux/page_ref.h | 10 +
mm/gup.c | 548 +++++++++++++++-----
mm/gup_benchmark.c | 74 ++-
mm/huge_memory.c | 54 +-
mm/hugetlb.c | 39 +-
mm/memremap.c | 76 ++-
mm/process_vm_access.c | 28 +-
mm/vmstat.c | 2 +
net/xdp/xdp_umem.c | 4 +-
tools/testing/selftests/vm/gup_benchmark.c | 21 +-
tools/testing/selftests/vm/run_vmtests | 22 +
30 files changed, 1104 insertions(+), 350 deletions(-)
create mode 100644 Documentation/core-api/pin_user_pages.rst
--
2.24.0
This patchset is being developed here:
<https://github.com/cyphar/linux/tree/openat2/master>
This is a re-send of
<https://lore.kernel.org/lkml/20191117011713.13032-1-cyphar@cyphar.com/>
but rebased on top of 5.4-rc8 (also my mails got duplicated the first
time I sent v17 -- hopefully that doesn't happen this time).
Patch changelog:
v17:
* Add a path_is_under() check for LOOKUP_IS_SCOPED in complete_walk(), as a
last line of defence to ensure that namei bugs will not break the contract
of LOOKUP_BENEATH or LOOKUP_IN_ROOT.
* Update based on feedback by Al Viro:
* Make nd_jump_link() free the passed path on error, so that callers don't
need to worry about it in the error path.
* Remove needless m_retry and r_retry variables in handle_dots().
* Always return -ECHILD from follow_dotdot_rcu().
v16: <https://lore.kernel.org/lkml/20191116002802.6663-1-cyphar@cyphar.com/>
v15: <https://lore.kernel.org/lkml/20191105090553.6350-1-cyphar@cyphar.com/>
v14: <https://lore.kernel.org/lkml/20191010054140.8483-1-cyphar@cyphar.com/>
<https://lore.kernel.org/lkml/20191026185700.10708-1-cyphar@cyphar.com>
v13: <https://lore.kernel.org/lkml/20190930183316.10190-1-cyphar@cyphar.com/>
v12: <https://lore.kernel.org/lkml/20190904201933.10736-1-cyphar@cyphar.com/>
v11: <https://lore.kernel.org/lkml/20190820033406.29796-1-cyphar@cyphar.com/>
<https://lore.kernel.org/lkml/20190728010207.9781-1-cyphar@cyphar.com/>
v10: <https://lore.kernel.org/lkml/20190719164225.27083-1-cyphar@cyphar.com/>
v09: <https://lore.kernel.org/lkml/20190706145737.5299-1-cyphar@cyphar.com/>
v08: <https://lore.kernel.org/lkml/20190520133305.11925-1-cyphar@cyphar.com/>
v07: <https://lore.kernel.org/lkml/20190507164317.13562-1-cyphar@cyphar.com/>
v06: <https://lore.kernel.org/lkml/20190506165439.9155-1-cyphar@cyphar.com/>
v05: <https://lore.kernel.org/lkml/20190320143717.2523-1-cyphar@cyphar.com/>
v04: <https://lore.kernel.org/lkml/20181112142654.341-1-cyphar@cyphar.com/>
v03: <https://lore.kernel.org/lkml/20181009070230.12884-1-cyphar@cyphar.com/>
v02: <https://lore.kernel.org/lkml/20181009065300.11053-1-cyphar@cyphar.com/>
v01: <https://lore.kernel.org/lkml/20180929103453.12025-1-cyphar@cyphar.com/>
For a very long time, extending openat(2) with new features has been
incredibly frustrating. This stems from the fact that openat(2) is
possibly the most famous counter-example to the mantra "don't silently
accept garbage from userspace" -- it doesn't check whether unknown flags
are present[1].
This means that (generally) the addition of new flags to openat(2) has
been fraught with backwards-compatibility issues (O_TMPFILE has to be
defined as __O_TMPFILE|O_DIRECTORY|[O_RDWR or O_WRONLY] to ensure old
kernels gave errors, since it's insecure to silently ignore the
flag[2]). All new security-related flags therefore have a tough road to
being added to openat(2).
Furthermore, the need for some sort of control over VFS's path resolution (to
avoid malicious paths resulting in inadvertent breakouts) has been a very
long-standing desire of many userspace applications. This patchset is a revival
of Al Viro's old AT_NO_JUMPS[3] patchset (which was a variant of David
Drysdale's O_BENEATH patchset[4] which was a spin-off of the Capsicum
project[5]) with a few additions and changes made based on the previous
discussion within [6] as well as others I felt were useful.
In line with the conclusions of the original discussion of AT_NO_JUMPS, the
flag has been split up into separate flags. However, instead of being an
openat(2) flag it is provided through a new syscall openat2(2) which provides
several other improvements to the openat(2) interface (see the patch
description for more details). The following new LOOKUP_* flags are added:
* LOOKUP_NO_XDEV blocks all mountpoint crossings (upwards, downwards,
or through absolute links). Absolute pathnames alone in openat(2) do not
trigger this. Magic-link traversal which implies a vfsmount jump is also
blocked (though magic-link jumps on the same vfsmount are permitted).
* LOOKUP_NO_MAGICLINKS blocks resolution through /proc/$pid/fd-style
links. This is done by blocking the usage of nd_jump_link() during
resolution in a filesystem. The term "magic-links" is used to match
with the only reference to these links in Documentation/, but I'm
happy to change the name.
It should be noted that this is different to the scope of
~LOOKUP_FOLLOW in that it applies to all path components. However,
you can do openat2(NO_FOLLOW|NO_MAGICLINKS) on a magic-link and it
will *not* fail (assuming that no parent component was a
magic-link), and you will have an fd for the magic-link.
In order to correctly detect magic-links, the introduction of a new
LOOKUP_MAGICLINK_JUMPED state flag was required.
* LOOKUP_BENEATH disallows escapes to outside the starting dirfd's
tree, using techniques such as ".." or absolute links. Absolute
paths in openat(2) are also disallowed. Conceptually this flag is to
ensure you "stay below" a certain point in the filesystem tree --
but this requires some additional to protect against various races
that would allow escape using "..".
Currently LOOKUP_BENEATH implies LOOKUP_NO_MAGICLINKS, because it
can trivially beam you around the filesystem (breaking the
protection). In future, there might be similar safety checks done as
in LOOKUP_IN_ROOT, but that requires more discussion.
In addition, two new flags are added that expand on the above ideas:
* LOOKUP_NO_SYMLINKS does what it says on the tin. No symlink
resolution is allowed at all, including magic-links. Just as with
LOOKUP_NO_MAGICLINKS this can still be used with NOFOLLOW to open an
fd for the symlink as long as no parent path had a symlink
component.
* LOOKUP_IN_ROOT is an extension of LOOKUP_BENEATH that, rather than
blocking attempts to move past the root, forces all such movements
to be scoped to the starting point. This provides chroot(2)-like
protection but without the cost of a chroot(2) for each filesystem
operation, as well as being safe against race attacks that chroot(2)
is not.
If a race is detected (as with LOOKUP_BENEATH) then an error is
generated, and similar to LOOKUP_BENEATH it is not permitted to cross
magic-links with LOOKUP_IN_ROOT.
The primary need for this is from container runtimes, which
currently need to do symlink scoping in userspace[7] when opening
paths in a potentially malicious container. There is a long list of
CVEs that could have bene mitigated by having RESOLVE_THIS_ROOT
(such as CVE-2017-1002101, CVE-2017-1002102, CVE-2018-15664, and
CVE-2019-5736, just to name a few).
In order to make all of the above more usable, I'm working on
libpathrs[8] which is a C-friendly library for safe path resolution. It
features a userspace-emulated backend if the kernel doesn't support
openat2(2). Hopefully we can get userspace to switch to using it, and
thus get openat2(2) support for free once it's ready.
Future work would include implementing things like RESOLVE_NO_AUTOMOUNT and
possibly a RESOLVE_NO_REMOTE (to allow programs to be sure they don't hit DoSes
though stale NFS handles).
[1]: https://lwn.net/Articles/588444/
[2]: https://lore.kernel.org/lkml/CA+55aFyyxJL1LyXZeBsf2ypriraj5ut1XkNDsunRBqgVj…
[3]: https://lore.kernel.org/lkml/20170429220414.GT29622@ZenIV.linux.org.uk
[4]: https://lore.kernel.org/lkml/1415094884-18349-1-git-send-email-drysdale@goo…
[5]: https://lore.kernel.org/lkml/1404124096-21445-1-git-send-email-drysdale@goo…
[6]: https://lwn.net/Articles/723057/
[7]: https://github.com/cyphar/filepath-securejoin
[8]: https://github.com/openSUSE/libpathrs
The current draft of the openat2(2) man-page is included below.
--8<---------------------------------------------------------------------------
OPENAT2(2) Linux Programmer's Manual OPENAT2(2)
NAME
openat2 - open and possibly create a file (extended)
SYNOPSIS
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
int openat2(int dirfd, const char *pathname, struct open_how *how, size_t size);
Note: There is no glibc wrapper for this system call; see NOTES.
DESCRIPTION
The openat2() system call opens the file specified by pathname. If the specified file
does not exist, it may optionally (if O_CREAT is specified in how.flags) be created by
openat2().
As with openat(2), if pathname is relative, then it is interpreted relative to the direc-
tory referred to by the file descriptor dirfd (or the current working directory of the
calling process, if dirfd is the special value AT_FDCWD.) If pathname is absolute, then
dirfd is ignored (unless how.resolve contains RESOLVE_IN_ROOT, in which case pathname is
resolved relative to dirfd.)
The openat2() system call is an extension of openat(2) and provides a superset of its
functionality. Rather than taking a single flag argument, an extensible structure (how)
is passed instead to allow for future extensions. size must be set to sizeof(struct
open_how), to facilitate future extensions (see the "Extensibility" section of the NOTES
for more detail on how extensions are handled.)
The open_how structure
The following structure indicates how pathname should be opened, and acts as a superset of
the flag and mode arguments to openat(2).
struct open_how {
__aligned_u64 flags; /* O_* flags. */
__u16 mode; /* Mode for O_{CREAT,TMPFILE}. */
__u16 __padding[3]; /* Must be zeroed. */
__aligned_u64 resolve; /* RESOLVE_* flags. */
};
Any future extensions to openat2() will be implemented as new fields appended to the above
structure (or through reuse of pre-existing padding space), with the zero value of the new
fields acting as though the extension were not present.
The meaning of each field is as follows:
flags
The file creation and status flags to use for this operation. All of the
O_* flags defined for openat(2) are valid openat2() flag values.
Unlike openat(2), it is an error to provide openat2() unknown or conflicting
flags in flags.
mode
File mode for the new file, with identical semantics to the mode argument to
openat(2). However, unlike openat(2), it is an error to provide openat2()
with a mode which contains bits other than 0777.
It is an error to provide openat2() a non-zero mode if flags does not con-
tain O_CREAT or O_TMPFILE.
resolve
Change how the components of pathname will be resolved (see path_resolu-
tion(7) for background information.) The primary use case for these flags
is to allow trusted programs to restrict how untrusted paths (or paths in-
side untrusted directories) are resolved. The full list of resolve flags is
given below.
RESOLVE_NO_XDEV
Disallow traversal of mount points during path resolution (including
all bind mounts).
Users of this flag are encouraged to make its use configurable (un-
less it is used for a specific security purpose), as bind mounts are
very widely used by end-users. Setting this flag indiscrimnately for
all uses of openat2() may result in spurious errors on previously-
functional systems.
RESOLVE_NO_SYMLINKS
Disallow resolution of symbolic links during path resolution. This
option implies RESOLVE_NO_MAGICLINKS.
If the trailing component is a symbolic link, and flags contains both
O_PATH and O_NOFOLLOW, then an O_PATH file descriptor referencing the
symbolic link will be returned.
Users of this flag are encouraged to make its use configurable (un-
less it is used for a specific security purpose), as symbolic links
are very widely used by end-users. Setting this flag indiscrimnately
for all uses of openat2() may result in spurious errors on previ-
ously-functional systems.
RESOLVE_NO_MAGICLINKS
Disallow all magic link resolution during path resolution.
If the trailing component is a magic link, and flags contains both
O_PATH and O_NOFOLLOW, then an O_PATH file descriptor referencing the
magic link will be returned.
Magic-links are symbolic link-like objects that are most notably
found in proc(5) (examples include /proc/[pid]/exe and
/proc/[pid]/fd/*.) Due to the potential danger of unknowingly open-
ing these magic links, it may be preferable for users to disable
their resolution entirely (see symboliclink(7) for more details.)
RESOLVE_BENEATH
Do not permit the path resolution to succeed if any component of the
resolution is not a descendant of the directory indicated by dirfd.
This results in absolute symbolic links (and absolute values of path-
name) to be rejected.
Currently, this flag also disables magic link resolution. However,
this may change in the future. The caller should explicitly specify
RESOLVE_NO_MAGICLINKS to ensure that magic links are not resolved.
RESOLVE_IN_ROOT
Treat dirfd as the root directory while resolving pathname (as though
the user called chroot(2) with dirfd as the argument.) Absolute sym-
bolic links and ".." path components will be scoped to dirfd. If
pathname is an absolute path, it is also treated relative to dirfd.
However, unlike chroot(2) (which changes the filesystem root perma-
nently for a process), RESOLVE_IN_ROOT allows a program to effi-
ciently restrict path resolution for only certain operations. It
also has several hardening features (such detecting escape attempts
during .. resolution) which chroot(2) does not.
Currently, this flag also disables magic link resolution. However,
this may change in the future. The caller should explicitly specify
RESOLVE_NO_MAGICLINKS to ensure that magic links are not resolved.
It is an error to provide openat2() unknown flags in resolve.
RETURN VALUE
On success, a new file descriptor is returned. On error, -1 is returned, and errno is set
appropriately.
ERRORS
The set of errors returned by openat2() includes all of the errors returned by openat(2),
as well as the following additional errors:
EINVAL An unknown flag or invalid value was specified in how.
EINVAL mode is non-zero, but flags does not contain O_CREAT or O_TMPFILE.
EINVAL size was smaller than any known version of struct open_how.
E2BIG An extension was specified in how, which the current kernel does not support (see
the "Extensibility" section of the NOTES for more detail on how extensions are han-
dled.)
EAGAIN resolve contains either RESOLVE_IN_ROOT or RESOLVE_BENEATH, and the kernel could
not ensure that a ".." component didn't escape (due to a race condition or poten-
tial attack.) Callers may choose to retry the openat2() call.
EXDEV resolve contains either RESOLVE_IN_ROOT or RESOLVE_BENEATH, and an escape from the
root during path resolution was detected.
EXDEV resolve contains RESOLVE_NO_XDEV, and a path component attempted to cross a mount
point.
ELOOP resolve contains RESOLVE_NO_SYMLINKS, and one of the path components was a symbolic
link (or magic link).
ELOOP resolve contains RESOLVE_NO_MAGICLINKS, and one of the path components was a magic
link.
VERSIONS
openat2() was added to Linux in kernel 5.FOO.
CONFORMING TO
This system call is Linux-specific.
The semantics of RESOLVE_BENEATH were modelled after FreeBSD's O_BENEATH.
NOTES
Glibc does not provide a wrapper for this system call; call it using systemcall(2).
Extensibility
In order to allow for struct open_how to be extended in future kernel revisions, openat2()
requires userspace to specify the size of struct open_how structure they are passing. By
providing this information, it is possible for openat2() to provide both forwards- and
backwards-compatibility — with size acting as an implicit version number (because new ex-
tension fields will always be appended, the size will always increase.) This extensibil-
ity design is very similar to other system calls such as perf_setattr(2),
perf_event_open(2), and clone(3).
If we let usize be the size of the structure according to userspace and ksize be the size
of the structure which the kernel supports, then there are only three cases to consider:
* If ksize equals usize, then there is no version mismatch and how can be used
verbatim.
* If ksize is larger than usize, then there are some extensions the kernel sup-
ports which the userspace program is unaware of. Because all extensions must
have their zero values be a no-op, the kernel treats all of the extension fields
not set by userspace to have zero values. This provides backwards-compatibil-
ity.
* If ksize is smaller than usize, then there are some extensions which the
userspace program is aware of but the kernel does not support. Because all ex-
tensions must have their zero values be a no-op, the kernel can safely ignore
the unsupported extension fields if they are all-zero. If any unsupported ex-
tension fields are non-zero, then -1 is returned and errno is set to E2BIG.
This provides forwards-compatibility.
Therefore, most userspace programs will not need to have any special handling of exten-
sions. However, if a userspace program wishes to determine what extensions the running
kernel supports, they may conduct a binary search on size (to find the largest value which
doesn't produce an error of E2BIG.)
SEE ALSO
openat(2), path_resolution(7), symlink(7)
Linux 2019-11-05 OPENAT2(2)
--8<---------------------------------------------------------------------------
Aleksa Sarai (13):
namei: only return -ECHILD from follow_dotdot_rcu()
nsfs: clean-up ns_get_path() signature to return int
namei: allow nd_jump_link() to produce errors
namei: allow set_root() to produce errors
namei: LOOKUP_NO_SYMLINKS: block symlink resolution
namei: LOOKUP_NO_MAGICLINKS: block magic-link resolution
namei: LOOKUP_NO_XDEV: block mountpoint crossing
namei: LOOKUP_BENEATH: O_BENEATH-like scoped resolution
namei: LOOKUP_IN_ROOT: chroot-like scoped resolution
namei: LOOKUP_{IN_ROOT,BENEATH}: permit limited ".." resolution
open: introduce openat2(2) syscall
selftests: add openat2(2) selftests
Documentation: path-lookup: include new LOOKUP flags
CREDITS | 4 +-
Documentation/filesystems/path-lookup.rst | 68 ++-
arch/alpha/kernel/syscalls/syscall.tbl | 1 +
arch/arm/tools/syscall.tbl | 1 +
arch/arm64/include/asm/unistd.h | 2 +-
arch/arm64/include/asm/unistd32.h | 2 +
arch/ia64/kernel/syscalls/syscall.tbl | 1 +
arch/m68k/kernel/syscalls/syscall.tbl | 1 +
arch/microblaze/kernel/syscalls/syscall.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n32.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n64.tbl | 1 +
arch/mips/kernel/syscalls/syscall_o32.tbl | 1 +
arch/parisc/kernel/syscalls/syscall.tbl | 1 +
arch/powerpc/kernel/syscalls/syscall.tbl | 1 +
arch/s390/kernel/syscalls/syscall.tbl | 1 +
arch/sh/kernel/syscalls/syscall.tbl | 1 +
arch/sparc/kernel/syscalls/syscall.tbl | 1 +
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/xtensa/kernel/syscalls/syscall.tbl | 1 +
fs/namei.c | 185 +++++--
fs/nsfs.c | 29 +-
fs/open.c | 149 +++--
fs/proc/base.c | 3 +-
fs/proc/namespaces.c | 20 +-
include/linux/fcntl.h | 12 +-
include/linux/namei.h | 12 +-
include/linux/proc_ns.h | 4 +-
include/linux/syscalls.h | 3 +
include/uapi/asm-generic/unistd.h | 5 +-
include/uapi/linux/fcntl.h | 40 ++
kernel/bpf/offload.c | 12 +-
kernel/events/core.c | 2 +-
security/apparmor/apparmorfs.c | 6 +-
tools/testing/selftests/Makefile | 1 +
tools/testing/selftests/openat2/.gitignore | 1 +
tools/testing/selftests/openat2/Makefile | 8 +
tools/testing/selftests/openat2/helpers.c | 109 ++++
tools/testing/selftests/openat2/helpers.h | 107 ++++
.../testing/selftests/openat2/openat2_test.c | 316 +++++++++++
.../selftests/openat2/rename_attack_test.c | 160 ++++++
.../testing/selftests/openat2/resolve_test.c | 523 ++++++++++++++++++
42 files changed, 1686 insertions(+), 113 deletions(-)
create mode 100644 tools/testing/selftests/openat2/.gitignore
create mode 100644 tools/testing/selftests/openat2/Makefile
create mode 100644 tools/testing/selftests/openat2/helpers.c
create mode 100644 tools/testing/selftests/openat2/helpers.h
create mode 100644 tools/testing/selftests/openat2/openat2_test.c
create mode 100644 tools/testing/selftests/openat2/rename_attack_test.c
create mode 100644 tools/testing/selftests/openat2/resolve_test.c
base-commit: af42d3466bdc8f39806b26f593604fdc54140bcb
--
2.24.0
Add documentation for the Python script used to build, run, and collect
results from the kernel known as kunit_tool. kunit_tool
(tools/testing/kunit/kunit.py) was already added in previous commits.
Signed-off-by: Brendan Higgins <brendanhiggins(a)google.com>
Reviewed-by: David Gow <davidgow(a)google.com>
Cc: Randy Dunlap <rdunlap(a)infradead.org>
---
Documentation/dev-tools/kunit/index.rst | 1 +
Documentation/dev-tools/kunit/kunit-tool.rst | 57 ++++++++++++++++++++
Documentation/dev-tools/kunit/start.rst | 5 +-
3 files changed, 62 insertions(+), 1 deletion(-)
create mode 100644 Documentation/dev-tools/kunit/kunit-tool.rst
diff --git a/Documentation/dev-tools/kunit/index.rst b/Documentation/dev-tools/kunit/index.rst
index 26ffb46bdf99d..c60d760a0eed1 100644
--- a/Documentation/dev-tools/kunit/index.rst
+++ b/Documentation/dev-tools/kunit/index.rst
@@ -9,6 +9,7 @@ KUnit - Unit Testing for the Linux Kernel
start
usage
+ kunit-tool
api/index
faq
diff --git a/Documentation/dev-tools/kunit/kunit-tool.rst b/Documentation/dev-tools/kunit/kunit-tool.rst
new file mode 100644
index 0000000000000..50d46394e97e3
--- /dev/null
+++ b/Documentation/dev-tools/kunit/kunit-tool.rst
@@ -0,0 +1,57 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=================
+kunit_tool How-To
+=================
+
+What is kunit_tool?
+===================
+
+kunit_tool is a script (``tools/testing/kunit/kunit.py``) that aids in building
+the Linux kernel as UML (`User Mode Linux
+<http://user-mode-linux.sourceforge.net/>`_), running KUnit tests, parsing
+the test results and displaying them in a user friendly manner.
+
+What is a kunitconfig?
+======================
+
+It's just a defconfig that kunit_tool looks for in the base directory.
+kunit_tool uses it to generate a .config as you might expect. In addition, it
+verifies that the generated .config contains the CONFIG options in the
+kunitconfig; the reason it does this is so that it is easy to be sure that a
+CONFIG that enables a test actually ends up in the .config.
+
+How do I use kunit_tool?
+========================
+
+If a kunitconfig is present at the root directory, all you have to do is:
+
+.. code-block:: bash
+
+ ./tools/testing/kunit/kunit.py run
+
+However, you most likely want to use it with the following options:
+
+.. code-block:: bash
+
+ ./tools/testing/kunit/kunit.py run --timeout=30 --jobs=`nproc --all`
+
+- ``--timeout`` sets a maximum amount of time to allow tests to run.
+- ``--jobs`` sets the number of threads to use to build the kernel.
+
+If you just want to use the defconfig that ships with the kernel, you can
+append the ``--defconfig`` flag as well:
+
+.. code-block:: bash
+
+ ./tools/testing/kunit/kunit.py run --timeout=30 --jobs=`nproc --all` --defconfig
+
+.. note::
+ This command is particularly helpful for getting started because it
+ just works. No kunitconfig needs to be present.
+
+For a list of all the flags supported by kunit_tool, you can run:
+
+.. code-block:: bash
+
+ ./tools/testing/kunit/kunit.py run --help
diff --git a/Documentation/dev-tools/kunit/start.rst b/Documentation/dev-tools/kunit/start.rst
index aeeddfafeea20..f4d9a4fa914f8 100644
--- a/Documentation/dev-tools/kunit/start.rst
+++ b/Documentation/dev-tools/kunit/start.rst
@@ -19,7 +19,10 @@ The wrapper can be run with:
.. code-block:: bash
- ./tools/testing/kunit/kunit.py run
+ ./tools/testing/kunit/kunit.py run --defconfig
+
+For more information on this wrapper (also called kunit_tool) checkout the
+:doc:`kunit-tool` page.
Creating a kunitconfig
======================
--
2.24.0.432.g9d3f5f5b63-goog
Fix typos and gramatical errors in the Getting Started and Usage guide
for KUnit.
Reported-by: Randy Dunlap <rdunlap(a)infradead.org>
Link: https://patchwork.kernel.org/patch/11156481/
Reported-by: Rinat Ibragimov <ibragimovrinat(a)mail.ru>
Link: https://github.com/google/kunit-docs/issues/1
Signed-off-by: Brendan Higgins <brendanhiggins(a)google.com>
Reviewed-by: David Gow <davidgow(a)google.com>
---
Documentation/dev-tools/kunit/start.rst | 8 ++++----
Documentation/dev-tools/kunit/usage.rst | 24 ++++++++++++------------
2 files changed, 16 insertions(+), 16 deletions(-)
diff --git a/Documentation/dev-tools/kunit/start.rst b/Documentation/dev-tools/kunit/start.rst
index f4d9a4fa914f8..9d6db892c41c0 100644
--- a/Documentation/dev-tools/kunit/start.rst
+++ b/Documentation/dev-tools/kunit/start.rst
@@ -26,7 +26,7 @@ For more information on this wrapper (also called kunit_tool) checkout the
Creating a kunitconfig
======================
-The Python script is a thin wrapper around Kbuild as such, it needs to be
+The Python script is a thin wrapper around Kbuild. As such, it needs to be
configured with a ``kunitconfig`` file. This file essentially contains the
regular Kernel config, with the specific test targets as well.
@@ -62,8 +62,8 @@ If everything worked correctly, you should see the following:
followed by a list of tests that are run. All of them should be passing.
.. note::
- Because it is building a lot of sources for the first time, the ``Building
- kunit kernel`` step may take a while.
+ Because it is building a lot of sources for the first time, the
+ ``Building KUnit kernel`` step may take a while.
Writing your first test
=======================
@@ -162,7 +162,7 @@ Now you can run the test:
.. code-block:: bash
- ./tools/testing/kunit/kunit.py
+ ./tools/testing/kunit/kunit.py run
You should see the following failure:
diff --git a/Documentation/dev-tools/kunit/usage.rst b/Documentation/dev-tools/kunit/usage.rst
index c6e69634e274b..b9a065ab681ee 100644
--- a/Documentation/dev-tools/kunit/usage.rst
+++ b/Documentation/dev-tools/kunit/usage.rst
@@ -16,7 +16,7 @@ Organization of this document
=============================
This document is organized into two main sections: Testing and Isolating
-Behavior. The first covers what a unit test is and how to use KUnit to write
+Behavior. The first covers what unit tests are and how to use KUnit to write
them. The second covers how to use KUnit to isolate code and make it possible
to unit test code that was otherwise un-unit-testable.
@@ -174,13 +174,13 @@ Test Suites
~~~~~~~~~~~
Now obviously one unit test isn't very helpful; the power comes from having
-many test cases covering all of your behaviors. Consequently it is common to
-have many *similar* tests; in order to reduce duplication in these closely
-related tests most unit testing frameworks provide the concept of a *test
-suite*, in KUnit we call it a *test suite*; all it is is just a collection of
-test cases for a unit of code with a set up function that gets invoked before
-every test cases and then a tear down function that gets invoked after every
-test case completes.
+many test cases covering all of a unit's behaviors. Consequently it is common
+to have many *similar* tests; in order to reduce duplication in these closely
+related tests most unit testing frameworks - including KUnit - provide the
+concept of a *test suite*. A *test suite* is just a collection of test cases
+for a unit of code with a set up function that gets invoked before every test
+case and then a tear down function that gets invoked after every test case
+completes.
Example:
@@ -211,7 +211,7 @@ KUnit test framework.
.. note::
A test case will only be run if it is associated with a test suite.
-For a more information on these types of things see the :doc:`api/test`.
+For more information on these types of things see the :doc:`api/test`.
Isolating Behavior
==================
@@ -338,7 +338,7 @@ We can easily test this code by *faking out* the underlying EEPROM:
return count;
}
- ssize_t fake_eeprom_write(struct eeprom *this, size_t offset, const char *buffer, size_t count)
+ ssize_t fake_eeprom_write(struct eeprom *parent, size_t offset, const char *buffer, size_t count)
{
struct fake_eeprom *this = container_of(parent, struct fake_eeprom, parent);
@@ -454,7 +454,7 @@ KUnit on non-UML architectures
By default KUnit uses UML as a way to provide dependencies for code under test.
Under most circumstances KUnit's usage of UML should be treated as an
implementation detail of how KUnit works under the hood. Nevertheless, there
-are instances where being able to run architecture specific code, or test
+are instances where being able to run architecture specific code or test
against real hardware is desirable. For these reasons KUnit supports running on
other architectures.
@@ -557,7 +557,7 @@ run your tests on your hardware setup just by compiling for your architecture.
.. important::
Always prefer tests that run on UML to tests that only run under a particular
architecture, and always prefer tests that run under QEMU or another easy
- (and monitarily free) to obtain software environment to a specific piece of
+ (and monetarily free) to obtain software environment to a specific piece of
hardware.
Nevertheless, there are still valid reasons to write an architecture or hardware
--
2.24.0.432.g9d3f5f5b63-goog
Fix typos and gramatical errors in the Getting Started and Usage guide
for KUnit.
Reported-by: Randy Dunlap <rdunlap(a)infradead.org>
Link: https://patchwork.kernel.org/patch/11156481/
Signed-off-by: Brendan Higgins <brendanhiggins(a)google.com>
---
Documentation/dev-tools/kunit/start.rst | 6 +++---
Documentation/dev-tools/kunit/usage.rst | 22 +++++++++++-----------
2 files changed, 14 insertions(+), 14 deletions(-)
diff --git a/Documentation/dev-tools/kunit/start.rst b/Documentation/dev-tools/kunit/start.rst
index f4d9a4fa914f8..db146c7d77490 100644
--- a/Documentation/dev-tools/kunit/start.rst
+++ b/Documentation/dev-tools/kunit/start.rst
@@ -26,7 +26,7 @@ For more information on this wrapper (also called kunit_tool) checkout the
Creating a kunitconfig
======================
-The Python script is a thin wrapper around Kbuild as such, it needs to be
+The Python script is a thin wrapper around Kbuild. As such, it needs to be
configured with a ``kunitconfig`` file. This file essentially contains the
regular Kernel config, with the specific test targets as well.
@@ -62,8 +62,8 @@ If everything worked correctly, you should see the following:
followed by a list of tests that are run. All of them should be passing.
.. note::
- Because it is building a lot of sources for the first time, the ``Building
- kunit kernel`` step may take a while.
+ Because it is building a lot of sources for the first time, the
+ ``Building KUnit kernel`` step may take a while.
Writing your first test
=======================
diff --git a/Documentation/dev-tools/kunit/usage.rst b/Documentation/dev-tools/kunit/usage.rst
index c6e69634e274b..ae42a0d128c27 100644
--- a/Documentation/dev-tools/kunit/usage.rst
+++ b/Documentation/dev-tools/kunit/usage.rst
@@ -16,7 +16,7 @@ Organization of this document
=============================
This document is organized into two main sections: Testing and Isolating
-Behavior. The first covers what a unit test is and how to use KUnit to write
+Behavior. The first covers what unit tests are and how to use KUnit to write
them. The second covers how to use KUnit to isolate code and make it possible
to unit test code that was otherwise un-unit-testable.
@@ -174,13 +174,13 @@ Test Suites
~~~~~~~~~~~
Now obviously one unit test isn't very helpful; the power comes from having
-many test cases covering all of your behaviors. Consequently it is common to
-have many *similar* tests; in order to reduce duplication in these closely
-related tests most unit testing frameworks provide the concept of a *test
-suite*, in KUnit we call it a *test suite*; all it is is just a collection of
-test cases for a unit of code with a set up function that gets invoked before
-every test cases and then a tear down function that gets invoked after every
-test case completes.
+many test cases covering all of a unit's behaviors. Consequently it is common
+to have many *similar* tests; in order to reduce duplication in these closely
+related tests most unit testing frameworks - including KUnit - provide the
+concept of a *test suite*. A *test suite* is just a collection of test cases
+for a unit of code with a set up function that gets invoked before every test
+case and then a tear down function that gets invoked after every test case
+completes.
Example:
@@ -211,7 +211,7 @@ KUnit test framework.
.. note::
A test case will only be run if it is associated with a test suite.
-For a more information on these types of things see the :doc:`api/test`.
+For more information on these types of things see the :doc:`api/test`.
Isolating Behavior
==================
@@ -454,7 +454,7 @@ KUnit on non-UML architectures
By default KUnit uses UML as a way to provide dependencies for code under test.
Under most circumstances KUnit's usage of UML should be treated as an
implementation detail of how KUnit works under the hood. Nevertheless, there
-are instances where being able to run architecture specific code, or test
+are instances where being able to run architecture specific code or test
against real hardware is desirable. For these reasons KUnit supports running on
other architectures.
@@ -557,7 +557,7 @@ run your tests on your hardware setup just by compiling for your architecture.
.. important::
Always prefer tests that run on UML to tests that only run under a particular
architecture, and always prefer tests that run under QEMU or another easy
- (and monitarily free) to obtain software environment to a specific piece of
+ (and monetarily free) to obtain software environment to a specific piece of
hardware.
Nevertheless, there are still valid reasons to write an architecture or hardware
--
2.24.0.432.g9d3f5f5b63-goog
Hi,
Please note that two of these patches are also out for review
separately, and may go in earlier than this series. This series
requires those, so to ease review and testing, they are also
included here: patches 4 and 5, the devmap cleanups.
There is a git repo and branch, for convenience:
git@github.com:johnhubbard/linux.git pin_user_pages_tracking_v5
Despite the large number of changes, it does feel like the review
comments are converging, btw.
Changes since v4:
* Renamed put_user_page*() --> unpin_user_page().
* Removed all pin_longterm_pages*() calls. We will use FOLL_LONGTERM
at the call sites. (FOLL_PIN, however, remains an internal gup flag).
This is very nice: many patches just change three characters now:
get_user_pages --> pin_user_pages. I think we've found the right
balance of wrapper calls and gup flags, for the call sites.
* Updated a lot of documentation and commit logs to match the above
two large changes.
* Changed gup_benchmark tests and run_vmtests, to adapt to one less
use case: there is no pin_longterm_pages() call anymore.
* This includes a new devmap cleanup patch from Dan Williams, along
with a rebased follow-up: patches 4 and 5, already mentioned above.
* Fixed patch 10 ("mm/gup: introduce pin_user_pages*() and FOLL_PIN"),
so as to make pin_user_pages*() calls act as placeholders for the
corresponding get_user_pages*() calls, until a later patch fully
implements the DMA-pinning functionality.
Thanks to Jan Kara for noticing that.
* Fixed the implementation of pin_user_pages_remote().
* Further tweaked patch 2 ("mm/gup: factor out duplicate code from four
routines"), in response to Jan Kara's feedback.
* Dropped a few reviewed-by tags due to changes that invalidated
them.
Changes since v3:
* VFIO fix (patch 8): applied further cleanup: removed a pre-existing,
unnecessary release and reacquire of mmap_sem. Moved the DAX vma
checks from the vfio call site, to gup internals, and added comments
(and commit log) to clarify.
* Due to the above, made a corresponding fix to the
pin_longterm_pages_remote(), which was actually calling the wrong
gup internal function.
* Changed put_user_page() comments, to refer to pin*() APIs, rather than
get_user_pages*() APIs.
* Reverted an accidental whitespace-only change in the IB ODP code.
* Added a few more reviewed-by tags.
Changes since v2:
* Added a patch to convert IB/umem from normal gup, to gup_fast(). This
is also posted separately, in order to hopefully get some runtime
testing.
* Changed the page devmap code to be a little clearer,
thanks to Jerome for that.
* Split out the page devmap changes into a separate patch (and moved
Ira's Signed-off-by to that patch).
* Fixed my bug in IB: ODP code does not require pin_user_pages()
semantics. Therefore, revert the put_user_page() calls to put_page(),
and leave the get_user_pages() call as-is.
* As part of the revert, I am proposing here a change directly
from put_user_pages(), to release_pages(). I'd feel better if
someone agrees that this is the best way. It uses the more
efficient release_pages(), instead of put_page() in a loop,
and keep the change to just a few character on one line,
but OTOH it is not a pure revert.
* Loosened the FOLL_LONGTERM restrictions in the
__get_user_pages_locked() implementation, and used that in order
to fix up a VFIO bug. Thanks to Jason for that idea.
* Note the use of release_pages() in IB: is that OK?
* Added a few more WARN's and clarifying comments nearby.
* Many documentation improvements in various comments.
* Moved the new pin_user_pages.rst from Documentation/vm/ to
Documentation/core-api/ .
* Commit descriptions: added clarifying notes to the three patches
(drm/via, fs/io_uring, net/xdp) that already had put_user_page()
calls in place.
* Collected all pending Reviewed-by and Acked-by tags, from v1 and v2
email threads.
* Lot of churn from v2 --> v3, so it's possible that new bugs
sneaked in.
NOT DONE: separate patchset is required:
* __get_user_pages_locked(): stop compensating for
buggy callers who failed to set FOLL_GET. Instead, assert
that FOLL_GET is set (and fail if it's not).
======================================================================
Original cover letter (edited to fix up the patch description numbers)
This applies cleanly to linux-next and mmotm, and also to linux.git if
linux-next's commit 20cac10710c9 ("mm/gup_benchmark: fix MAP_HUGETLB
case") is first applied there.
This provides tracking of dma-pinned pages. This is a prerequisite to
solving the larger problem of proper interactions between file-backed
pages, and [R]DMA activities, as discussed in [1], [2], [3], and in
a remarkable number of email threads since about 2017. :)
A new internal gup flag, FOLL_PIN is introduced, and thoroughly
documented in the last patch's Documentation/vm/pin_user_pages.rst.
I believe that this will provide a good starting point for doing the
layout lease work that Ira Weiny has been working on. That's because
these new wrapper functions provide a clean, constrained, systematically
named set of functionality that, again, is required in order to even
know if a page is "dma-pinned".
In contrast to earlier approaches, the page tracking can be
incrementally applied to the kernel call sites that, until now, have
been simply calling get_user_pages() ("gup"). In other words, opt-in by
changing from this:
get_user_pages() (sets FOLL_GET)
put_page()
to this:
pin_user_pages() (sets FOLL_PIN)
put_user_page()
Because there are interdependencies with FOLL_LONGTERM, a similar
conversion as for FOLL_PIN, was applied. The change was from this:
get_user_pages(FOLL_LONGTERM) (also sets FOLL_GET)
put_page()
to this:
pin_longterm_pages() (sets FOLL_PIN | FOLL_LONGTERM)
put_user_page()
============================================================
Patch summary:
* Patches 1-9: refactoring and preparatory cleanup, independent fixes
* Patch 10: introduce pin_user_pages(), FOLL_PIN, but no functional
changes yet
* Patches 11-16: Convert existing put_user_page() callers, to use the
new pin*()
* Patch 17: Activate tracking of FOLL_PIN pages.
* Patches 18-20: convert various callers
* Patches: 21-23: gup_benchmark and run_vmtests support
* Patch 24: rename put_user_page*() --> unpin_user_page*()
============================================================
Testing:
* I've done some overall kernel testing (LTP, and a few other goodies),
and some directed testing to exercise some of the changes. And as you
can see, gup_benchmark is enhanced to exercise this. Basically, I've been
able to runtime test the core get_user_pages() and pin_user_pages() and
related routines, but not so much on several of the call sites--but those
are generally just a couple of lines changed, each.
Not much of the kernel is actually using this, which on one hand
reduces risk quite a lot. But on the other hand, testing coverage
is low. So I'd love it if, in particular, the Infiniband and PowerPC
folks could do a smoke test of this series for me.
Also, my runtime testing for the call sites so far is very weak:
* io_uring: Some directed tests from liburing exercise this, and they pass.
* process_vm_access.c: A small directed test passes.
* gup_benchmark: the enhanced version hits the new gup.c code, and passes.
* infiniband (still only have crude "IB pingpong" working, on a
good day: it's not exercising my conversions at runtime...)
* VFIO: compiles (I'm vowing to set up a run time test soon, but it's
not ready just yet)
* powerpc: it compiles...
* drm/via: compiles...
* goldfish: compiles...
* net/xdp: compiles...
* media/v4l2: compiles...
============================================================
Next:
* Get the block/bio_vec sites converted to use pin_user_pages().
* Work with Ira and Dave Chinner to weave this together with the
layout lease stuff.
============================================================
[1] Some slow progress on get_user_pages() (Apr 2, 2019): https://lwn.net/Articles/784574/
[2] DMA and get_user_pages() (LPC: Dec 12, 2018): https://lwn.net/Articles/774411/
[3] The trouble with get_user_pages() (Apr 30, 2018): https://lwn.net/Articles/753027/
Dan Williams (1):
mm: Cleanup __put_devmap_managed_page() vs ->page_free()
John Hubbard (23):
mm/gup: pass flags arg to __gup_device_* functions
mm/gup: factor out duplicate code from four routines
mm/gup: move try_get_compound_head() to top, fix minor issues
mm: devmap: refactor 1-based refcounting for ZONE_DEVICE pages
goldish_pipe: rename local pin_user_pages() routine
IB/umem: use get_user_pages_fast() to pin DMA pages
media/v4l2-core: set pages dirty upon releasing DMA buffers
vfio, mm: fix get_user_pages_remote() and FOLL_LONGTERM
mm/gup: introduce pin_user_pages*() and FOLL_PIN
goldish_pipe: convert to pin_user_pages() and put_user_page()
IB/{core,hw,umem}: set FOLL_PIN via pin_user_pages*(), fix up ODP
mm/process_vm_access: set FOLL_PIN via pin_user_pages_remote()
drm/via: set FOLL_PIN via pin_user_pages_fast()
fs/io_uring: set FOLL_PIN via pin_user_pages()
net/xdp: set FOLL_PIN via pin_user_pages()
mm/gup: track FOLL_PIN pages
media/v4l2-core: pin_user_pages (FOLL_PIN) and put_user_page()
conversion
vfio, mm: pin_user_pages (FOLL_PIN) and put_user_page() conversion
powerpc: book3s64: convert to pin_user_pages() and put_user_page()
mm/gup_benchmark: use proper FOLL_WRITE flags instead of hard-coding
"1"
mm/gup_benchmark: support pin_user_pages() and related calls
selftests/vm: run_vmtests: invoke gup_benchmark with basic FOLL_PIN
coverage
mm, tree-wide: rename put_user_page*() to unpin_user_page*()
Documentation/core-api/index.rst | 1 +
Documentation/core-api/pin_user_pages.rst | 233 ++++++++
arch/powerpc/mm/book3s64/iommu_api.c | 12 +-
drivers/gpu/drm/via/via_dmablit.c | 6 +-
drivers/infiniband/core/umem.c | 19 +-
drivers/infiniband/core/umem_odp.c | 13 +-
drivers/infiniband/hw/hfi1/user_pages.c | 4 +-
drivers/infiniband/hw/mthca/mthca_memfree.c | 8 +-
drivers/infiniband/hw/qib/qib_user_pages.c | 4 +-
drivers/infiniband/hw/qib/qib_user_sdma.c | 8 +-
drivers/infiniband/hw/usnic/usnic_uiom.c | 4 +-
drivers/infiniband/sw/siw/siw_mem.c | 4 +-
drivers/media/v4l2-core/videobuf-dma-sg.c | 8 +-
drivers/nvdimm/pmem.c | 6 -
drivers/platform/goldfish/goldfish_pipe.c | 35 +-
drivers/vfio/vfio_iommu_type1.c | 35 +-
fs/io_uring.c | 6 +-
include/linux/mm.h | 155 +++++-
include/linux/mmzone.h | 2 +
include/linux/page_ref.h | 10 +
mm/gup.c | 561 +++++++++++++++-----
mm/gup_benchmark.c | 74 ++-
mm/huge_memory.c | 54 +-
mm/hugetlb.c | 39 +-
mm/memremap.c | 76 ++-
mm/process_vm_access.c | 28 +-
mm/vmstat.c | 2 +
net/xdp/xdp_umem.c | 4 +-
tools/testing/selftests/vm/gup_benchmark.c | 21 +-
tools/testing/selftests/vm/run_vmtests | 22 +
30 files changed, 1104 insertions(+), 350 deletions(-)
create mode 100644 Documentation/core-api/pin_user_pages.rst
--
2.24.0
These changes are based on Jason's rdma/hmm branch (5.4.0-rc5).
Patch 1 was previously posted here [1] but was dropped from that orginal
series. Hopefully, the tests will reduce concerns about edge conditions.
I'm sure more tests could be usefully added but I thought this was a good
starting point.
Changes since v3:
patch 1:
Unchanged except rebased on Jason's latest hmm (bbe3329e354d3ab5dc18).
patch 2:
Is now part of Jason's tree.
patch 3 (now 2):
Major changes to incorporate Jason's review feedback.
* drivers/char/hmm_dmirror.c driver moved to lib/test_hmm.c
* XArray used instead of "page tables".
* platform device driver removed.
* remove redundant copyright.
Changes since v2:
patch 1:
Removed hmm_range_needs_fault() and just use hmm_range_need_fault().
Updated the change log to include that it fixes a bug where
hmm_range_fault() incorrectly returned an error when no fault is requested.
patch 2:
Removed the confusing change log wording about DMA.
Changed hmm_range_fault() to return the PFN of the zero page like any other
page.
patch 3:
Adjusted the test code to match the new zero page behavior.
Changes since v1:
Rebased to Jason's rdma/hmm branch (5.4.0-rc1).
Cleaned up locking for the test driver's page tables.
Incorporated Christoph Hellwig's comments.
[1] https://lore.kernel.org/linux-mm/20190726005650.2566-6-rcampbell@nvidia.com/
Ralph Campbell (2):
mm/hmm: make full use of walk_page_range()
mm/hmm/test: add self tests for HMM
MAINTAINERS | 3 +
include/uapi/linux/test_hmm.h | 59 ++
lib/Kconfig.debug | 11 +
lib/Makefile | 1 +
lib/test_hmm.c | 1306 ++++++++++++++++++++++++
mm/hmm.c | 121 ++-
tools/testing/selftests/vm/.gitignore | 1 +
tools/testing/selftests/vm/Makefile | 3 +
tools/testing/selftests/vm/config | 2 +
tools/testing/selftests/vm/hmm-tests.c | 1295 +++++++++++++++++++++++
tools/testing/selftests/vm/run_vmtests | 16 +
tools/testing/selftests/vm/test_hmm.sh | 97 ++
12 files changed, 2852 insertions(+), 63 deletions(-)
create mode 100644 include/uapi/linux/test_hmm.h
create mode 100644 lib/test_hmm.c
create mode 100644 tools/testing/selftests/vm/hmm-tests.c
create mode 100755 tools/testing/selftests/vm/test_hmm.sh
--
2.20.1
This patchset is being developed here:
<https://github.com/cyphar/linux/tree/openat2/master>
Patch changelog:
v17:
* Add a path_is_under() check for LOOKUP_IS_SCOPED in complete_walk(), as a
last line of defence to ensure that namei bugs will not break the contract
of LOOKUP_BENEATH or LOOKUP_IN_ROOT.
* Update based on feedback by Al Viro:
* Make nd_jump_link() free the passed path on error, so that callers don't
need to worry about it in the error path.
* Remove needless m_retry and r_retry variables in handle_dots().
* Always return -ECHILD from follow_dotdot_rcu().
v16: <https://lore.kernel.org/lkml/20191116002802.6663-1-cyphar@cyphar.com/>
v15: <https://lore.kernel.org/lkml/20191105090553.6350-1-cyphar@cyphar.com/>
v14: <https://lore.kernel.org/lkml/20191010054140.8483-1-cyphar@cyphar.com/>
<https://lore.kernel.org/lkml/20191026185700.10708-1-cyphar@cyphar.com>
v13: <https://lore.kernel.org/lkml/20190930183316.10190-1-cyphar@cyphar.com/>
v12: <https://lore.kernel.org/lkml/20190904201933.10736-1-cyphar@cyphar.com/>
v11: <https://lore.kernel.org/lkml/20190820033406.29796-1-cyphar@cyphar.com/>
<https://lore.kernel.org/lkml/20190728010207.9781-1-cyphar@cyphar.com/>
v10: <https://lore.kernel.org/lkml/20190719164225.27083-1-cyphar@cyphar.com/>
v09: <https://lore.kernel.org/lkml/20190706145737.5299-1-cyphar@cyphar.com/>
v08: <https://lore.kernel.org/lkml/20190520133305.11925-1-cyphar@cyphar.com/>
v07: <https://lore.kernel.org/lkml/20190507164317.13562-1-cyphar@cyphar.com/>
v06: <https://lore.kernel.org/lkml/20190506165439.9155-1-cyphar@cyphar.com/>
v05: <https://lore.kernel.org/lkml/20190320143717.2523-1-cyphar@cyphar.com/>
v04: <https://lore.kernel.org/lkml/20181112142654.341-1-cyphar@cyphar.com/>
v03: <https://lore.kernel.org/lkml/20181009070230.12884-1-cyphar@cyphar.com/>
v02: <https://lore.kernel.org/lkml/20181009065300.11053-1-cyphar@cyphar.com/>
v01: <https://lore.kernel.org/lkml/20180929103453.12025-1-cyphar@cyphar.com/>
For a very long time, extending openat(2) with new features has been
incredibly frustrating. This stems from the fact that openat(2) is
possibly the most famous counter-example to the mantra "don't silently
accept garbage from userspace" -- it doesn't check whether unknown flags
are present[1].
This means that (generally) the addition of new flags to openat(2) has
been fraught with backwards-compatibility issues (O_TMPFILE has to be
defined as __O_TMPFILE|O_DIRECTORY|[O_RDWR or O_WRONLY] to ensure old
kernels gave errors, since it's insecure to silently ignore the
flag[2]). All new security-related flags therefore have a tough road to
being added to openat(2).
Furthermore, the need for some sort of control over VFS's path resolution (to
avoid malicious paths resulting in inadvertent breakouts) has been a very
long-standing desire of many userspace applications. This patchset is a revival
of Al Viro's old AT_NO_JUMPS[3] patchset (which was a variant of David
Drysdale's O_BENEATH patchset[4] which was a spin-off of the Capsicum
project[5]) with a few additions and changes made based on the previous
discussion within [6] as well as others I felt were useful.
In line with the conclusions of the original discussion of AT_NO_JUMPS, the
flag has been split up into separate flags. However, instead of being an
openat(2) flag it is provided through a new syscall openat2(2) which provides
several other improvements to the openat(2) interface (see the patch
description for more details). The following new LOOKUP_* flags are added:
* LOOKUP_NO_XDEV blocks all mountpoint crossings (upwards, downwards,
or through absolute links). Absolute pathnames alone in openat(2) do not
trigger this. Magic-link traversal which implies a vfsmount jump is also
blocked (though magic-link jumps on the same vfsmount are permitted).
* LOOKUP_NO_MAGICLINKS blocks resolution through /proc/$pid/fd-style
links. This is done by blocking the usage of nd_jump_link() during
resolution in a filesystem. The term "magic-links" is used to match
with the only reference to these links in Documentation/, but I'm
happy to change the name.
It should be noted that this is different to the scope of
~LOOKUP_FOLLOW in that it applies to all path components. However,
you can do openat2(NO_FOLLOW|NO_MAGICLINKS) on a magic-link and it
will *not* fail (assuming that no parent component was a
magic-link), and you will have an fd for the magic-link.
In order to correctly detect magic-links, the introduction of a new
LOOKUP_MAGICLINK_JUMPED state flag was required.
* LOOKUP_BENEATH disallows escapes to outside the starting dirfd's
tree, using techniques such as ".." or absolute links. Absolute
paths in openat(2) are also disallowed. Conceptually this flag is to
ensure you "stay below" a certain point in the filesystem tree --
but this requires some additional to protect against various races
that would allow escape using "..".
Currently LOOKUP_BENEATH implies LOOKUP_NO_MAGICLINKS, because it
can trivially beam you around the filesystem (breaking the
protection). In future, there might be similar safety checks done as
in LOOKUP_IN_ROOT, but that requires more discussion.
In addition, two new flags are added that expand on the above ideas:
* LOOKUP_NO_SYMLINKS does what it says on the tin. No symlink
resolution is allowed at all, including magic-links. Just as with
LOOKUP_NO_MAGICLINKS this can still be used with NOFOLLOW to open an
fd for the symlink as long as no parent path had a symlink
component.
* LOOKUP_IN_ROOT is an extension of LOOKUP_BENEATH that, rather than
blocking attempts to move past the root, forces all such movements
to be scoped to the starting point. This provides chroot(2)-like
protection but without the cost of a chroot(2) for each filesystem
operation, as well as being safe against race attacks that chroot(2)
is not.
If a race is detected (as with LOOKUP_BENEATH) then an error is
generated, and similar to LOOKUP_BENEATH it is not permitted to cross
magic-links with LOOKUP_IN_ROOT.
The primary need for this is from container runtimes, which
currently need to do symlink scoping in userspace[7] when opening
paths in a potentially malicious container. There is a long list of
CVEs that could have bene mitigated by having RESOLVE_THIS_ROOT
(such as CVE-2017-1002101, CVE-2017-1002102, CVE-2018-15664, and
CVE-2019-5736, just to name a few).
In order to make all of the above more usable, I'm working on
libpathrs[8] which is a C-friendly library for safe path resolution. It
features a userspace-emulated backend if the kernel doesn't support
openat2(2). Hopefully we can get userspace to switch to using it, and
thus get openat2(2) support for free once it's ready.
Future work would include implementing things like RESOLVE_NO_AUTOMOUNT and
possibly a RESOLVE_NO_REMOTE (to allow programs to be sure they don't hit DoSes
though stale NFS handles).
[1]: https://lwn.net/Articles/588444/
[2]: https://lore.kernel.org/lkml/CA+55aFyyxJL1LyXZeBsf2ypriraj5ut1XkNDsunRBqgVj…
[3]: https://lore.kernel.org/lkml/20170429220414.GT29622@ZenIV.linux.org.uk
[4]: https://lore.kernel.org/lkml/1415094884-18349-1-git-send-email-drysdale@goo…
[5]: https://lore.kernel.org/lkml/1404124096-21445-1-git-send-email-drysdale@goo…
[6]: https://lwn.net/Articles/723057/
[7]: https://github.com/cyphar/filepath-securejoin
[8]: https://github.com/openSUSE/libpathrs
The current draft of the openat2(2) man-page is included below.
--8<---------------------------------------------------------------------------
OPENAT2(2) Linux Programmer's Manual OPENAT2(2)
NAME
openat2 - open and possibly create a file (extended)
SYNOPSIS
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
int openat2(int dirfd, const char *pathname, struct open_how *how, size_t size);
Note: There is no glibc wrapper for this system call; see NOTES.
DESCRIPTION
The openat2() system call opens the file specified by pathname. If the specified file
does not exist, it may optionally (if O_CREAT is specified in how.flags) be created by
openat2().
As with openat(2), if pathname is relative, then it is interpreted relative to the direc-
tory referred to by the file descriptor dirfd (or the current working directory of the
calling process, if dirfd is the special value AT_FDCWD.) If pathname is absolute, then
dirfd is ignored (unless how.resolve contains RESOLVE_IN_ROOT, in which case pathname is
resolved relative to dirfd.)
The openat2() system call is an extension of openat(2) and provides a superset of its
functionality. Rather than taking a single flag argument, an extensible structure (how)
is passed instead to allow for future extensions. size must be set to sizeof(struct
open_how), to facilitate future extensions (see the "Extensibility" section of the NOTES
for more detail on how extensions are handled.)
The open_how structure
The following structure indicates how pathname should be opened, and acts as a superset of
the flag and mode arguments to openat(2).
struct open_how {
__aligned_u64 flags; /* O_* flags. */
__u16 mode; /* Mode for O_{CREAT,TMPFILE}. */
__u16 __padding[3]; /* Must be zeroed. */
__aligned_u64 resolve; /* RESOLVE_* flags. */
};
Any future extensions to openat2() will be implemented as new fields appended to the above
structure (or through reuse of pre-existing padding space), with the zero value of the new
fields acting as though the extension were not present.
The meaning of each field is as follows:
flags
The file creation and status flags to use for this operation. All of the
O_* flags defined for openat(2) are valid openat2() flag values.
Unlike openat(2), it is an error to provide openat2() unknown or conflicting
flags in flags.
mode
File mode for the new file, with identical semantics to the mode argument to
openat(2). However, unlike openat(2), it is an error to provide openat2()
with a mode which contains bits other than 0777.
It is an error to provide openat2() a non-zero mode if flags does not con-
tain O_CREAT or O_TMPFILE.
resolve
Change how the components of pathname will be resolved (see path_resolu-
tion(7) for background information.) The primary use case for these flags
is to allow trusted programs to restrict how untrusted paths (or paths in-
side untrusted directories) are resolved. The full list of resolve flags is
given below.
RESOLVE_NO_XDEV
Disallow traversal of mount points during path resolution (including
all bind mounts).
Users of this flag are encouraged to make its use configurable (un-
less it is used for a specific security purpose), as bind mounts are
very widely used by end-users. Setting this flag indiscrimnately for
all uses of openat2() may result in spurious errors on previously-
functional systems.
RESOLVE_NO_SYMLINKS
Disallow resolution of symbolic links during path resolution. This
option implies RESOLVE_NO_MAGICLINKS.
If the trailing component is a symbolic link, and flags contains both
O_PATH and O_NOFOLLOW, then an O_PATH file descriptor referencing the
symbolic link will be returned.
Users of this flag are encouraged to make its use configurable (un-
less it is used for a specific security purpose), as symbolic links
are very widely used by end-users. Setting this flag indiscrimnately
for all uses of openat2() may result in spurious errors on previ-
ously-functional systems.
RESOLVE_NO_MAGICLINKS
Disallow all magic link resolution during path resolution.
If the trailing component is a magic link, and flags contains both
O_PATH and O_NOFOLLOW, then an O_PATH file descriptor referencing the
magic link will be returned.
Magic-links are symbolic link-like objects that are most notably
found in proc(5) (examples include /proc/[pid]/exe and
/proc/[pid]/fd/*.) Due to the potential danger of unknowingly open-
ing these magic links, it may be preferable for users to disable
their resolution entirely (see symboliclink(7) for more details.)
RESOLVE_BENEATH
Do not permit the path resolution to succeed if any component of the
resolution is not a descendant of the directory indicated by dirfd.
This results in absolute symbolic links (and absolute values of path-
name) to be rejected.
Currently, this flag also disables magic link resolution. However,
this may change in the future. The caller should explicitly specify
RESOLVE_NO_MAGICLINKS to ensure that magic links are not resolved.
RESOLVE_IN_ROOT
Treat dirfd as the root directory while resolving pathname (as though
the user called chroot(2) with dirfd as the argument.) Absolute sym-
bolic links and ".." path components will be scoped to dirfd. If
pathname is an absolute path, it is also treated relative to dirfd.
However, unlike chroot(2) (which changes the filesystem root perma-
nently for a process), RESOLVE_IN_ROOT allows a program to effi-
ciently restrict path resolution for only certain operations. It
also has several hardening features (such detecting escape attempts
during .. resolution) which chroot(2) does not.
Currently, this flag also disables magic link resolution. However,
this may change in the future. The caller should explicitly specify
RESOLVE_NO_MAGICLINKS to ensure that magic links are not resolved.
It is an error to provide openat2() unknown flags in resolve.
RETURN VALUE
On success, a new file descriptor is returned. On error, -1 is returned, and errno is set
appropriately.
ERRORS
The set of errors returned by openat2() includes all of the errors returned by openat(2),
as well as the following additional errors:
EINVAL An unknown flag or invalid value was specified in how.
EINVAL mode is non-zero, but flags does not contain O_CREAT or O_TMPFILE.
EINVAL size was smaller than any known version of struct open_how.
E2BIG An extension was specified in how, which the current kernel does not support (see
the "Extensibility" section of the NOTES for more detail on how extensions are han-
dled.)
EAGAIN resolve contains either RESOLVE_IN_ROOT or RESOLVE_BENEATH, and the kernel could
not ensure that a ".." component didn't escape (due to a race condition or poten-
tial attack.) Callers may choose to retry the openat2() call.
EXDEV resolve contains either RESOLVE_IN_ROOT or RESOLVE_BENEATH, and an escape from the
root during path resolution was detected.
EXDEV resolve contains RESOLVE_NO_XDEV, and a path component attempted to cross a mount
point.
ELOOP resolve contains RESOLVE_NO_SYMLINKS, and one of the path components was a symbolic
link (or magic link).
ELOOP resolve contains RESOLVE_NO_MAGICLINKS, and one of the path components was a magic
link.
VERSIONS
openat2() was added to Linux in kernel 5.FOO.
CONFORMING TO
This system call is Linux-specific.
The semantics of RESOLVE_BENEATH were modelled after FreeBSD's O_BENEATH.
NOTES
Glibc does not provide a wrapper for this system call; call it using systemcall(2).
Extensibility
In order to allow for struct open_how to be extended in future kernel revisions, openat2()
requires userspace to specify the size of struct open_how structure they are passing. By
providing this information, it is possible for openat2() to provide both forwards- and
backwards-compatibility — with size acting as an implicit version number (because new ex-
tension fields will always be appended, the size will always increase.) This extensibil-
ity design is very similar to other system calls such as perf_setattr(2),
perf_event_open(2), and clone(3).
If we let usize be the size of the structure according to userspace and ksize be the size
of the structure which the kernel supports, then there are only three cases to consider:
* If ksize equals usize, then there is no version mismatch and how can be used
verbatim.
* If ksize is larger than usize, then there are some extensions the kernel sup-
ports which the userspace program is unaware of. Because all extensions must
have their zero values be a no-op, the kernel treats all of the extension fields
not set by userspace to have zero values. This provides backwards-compatibil-
ity.
* If ksize is smaller than usize, then there are some extensions which the
userspace program is aware of but the kernel does not support. Because all ex-
tensions must have their zero values be a no-op, the kernel can safely ignore
the unsupported extension fields if they are all-zero. If any unsupported ex-
tension fields are non-zero, then -1 is returned and errno is set to E2BIG.
This provides forwards-compatibility.
Therefore, most userspace programs will not need to have any special handling of exten-
sions. However, if a userspace program wishes to determine what extensions the running
kernel supports, they may conduct a binary search on size (to find the largest value which
doesn't produce an error of E2BIG.)
SEE ALSO
openat(2), path_resolution(7), symlink(7)
Linux 2019-11-05 OPENAT2(2)
--8<---------------------------------------------------------------------------
Aleksa Sarai (13):
namei: only return -ECHILD from follow_dotdot_rcu()
nsfs: clean-up ns_get_path() signature to return int
namei: allow nd_jump_link() to produce errors
namei: allow set_root() to produce errors
namei: LOOKUP_NO_SYMLINKS: block symlink resolution
namei: LOOKUP_NO_MAGICLINKS: block magic-link resolution
namei: LOOKUP_NO_XDEV: block mountpoint crossing
namei: LOOKUP_BENEATH: O_BENEATH-like scoped resolution
namei: LOOKUP_IN_ROOT: chroot-like scoped resolution
namei: LOOKUP_{IN_ROOT,BENEATH}: permit limited ".." resolution
open: introduce openat2(2) syscall
selftests: add openat2(2) selftests
Documentation: path-lookup: include new LOOKUP flags
CREDITS | 4 +-
Documentation/filesystems/path-lookup.rst | 68 ++-
arch/alpha/kernel/syscalls/syscall.tbl | 1 +
arch/arm/tools/syscall.tbl | 1 +
arch/arm64/include/asm/unistd.h | 2 +-
arch/arm64/include/asm/unistd32.h | 2 +
arch/ia64/kernel/syscalls/syscall.tbl | 1 +
arch/m68k/kernel/syscalls/syscall.tbl | 1 +
arch/microblaze/kernel/syscalls/syscall.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n32.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n64.tbl | 1 +
arch/mips/kernel/syscalls/syscall_o32.tbl | 1 +
arch/parisc/kernel/syscalls/syscall.tbl | 1 +
arch/powerpc/kernel/syscalls/syscall.tbl | 1 +
arch/s390/kernel/syscalls/syscall.tbl | 1 +
arch/sh/kernel/syscalls/syscall.tbl | 1 +
arch/sparc/kernel/syscalls/syscall.tbl | 1 +
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/xtensa/kernel/syscalls/syscall.tbl | 1 +
fs/namei.c | 185 +++++--
fs/nsfs.c | 29 +-
fs/open.c | 149 +++--
fs/proc/base.c | 3 +-
fs/proc/namespaces.c | 20 +-
include/linux/fcntl.h | 12 +-
include/linux/namei.h | 12 +-
include/linux/proc_ns.h | 4 +-
include/linux/syscalls.h | 3 +
include/uapi/asm-generic/unistd.h | 5 +-
include/uapi/linux/fcntl.h | 40 ++
kernel/bpf/offload.c | 12 +-
kernel/events/core.c | 2 +-
security/apparmor/apparmorfs.c | 6 +-
tools/testing/selftests/Makefile | 1 +
tools/testing/selftests/openat2/.gitignore | 1 +
tools/testing/selftests/openat2/Makefile | 8 +
tools/testing/selftests/openat2/helpers.c | 109 ++++
tools/testing/selftests/openat2/helpers.h | 107 ++++
.../testing/selftests/openat2/openat2_test.c | 316 +++++++++++
.../selftests/openat2/rename_attack_test.c | 160 ++++++
.../testing/selftests/openat2/resolve_test.c | 523 ++++++++++++++++++
42 files changed, 1686 insertions(+), 113 deletions(-)
create mode 100644 tools/testing/selftests/openat2/.gitignore
create mode 100644 tools/testing/selftests/openat2/Makefile
create mode 100644 tools/testing/selftests/openat2/helpers.c
create mode 100644 tools/testing/selftests/openat2/helpers.h
create mode 100644 tools/testing/selftests/openat2/openat2_test.c
create mode 100644 tools/testing/selftests/openat2/rename_attack_test.c
create mode 100644 tools/testing/selftests/openat2/resolve_test.c
base-commit: 31f4f5b495a62c9a8b15b1c3581acd5efeb9af8c
--
2.24.0
This patchset is being developed here:
<https://github.com/cyphar/linux/tree/openat2/master>
Patch changelog:
v16:
* Update based on review by Al Viro:
* Handle magic-link related errors from with nd_jump_link() and drop
LOOKUP_MAGICLINK_JUMPED.
* Drop the slash-skipping logic for LOOKUP_IN_ROOT since it's not actually
necessary (link_path_walk() already does it, and it's possible that doing
it slightly breaks downstream code).
* Update outdated open_how documentation to match new semantics.
* Update commit message to further explain why -EAGAIN is preferable for
path_is_under() when checking whether ".." is safe.
* Split out the set_root() and nd_jump_root() errors from the
LOOKUP_BENEATH patch.
* Expand path-lookup documentation changes to describe all new LOOKUP flags.
* Cleanup the signature of ns_get_path() such that it returns an int
(previously it returned a void * -- even though the only two
possible return values were ERR_PTR or NULL).
v15: <https://lore.kernel.org/lkml/20191105090553.6350-1-cyphar@cyphar.com/>
v14: <https://lore.kernel.org/lkml/20191010054140.8483-1-cyphar@cyphar.com/>
<https://lore.kernel.org/lkml/20191026185700.10708-1-cyphar@cyphar.com>
v13: <https://lore.kernel.org/lkml/20190930183316.10190-1-cyphar@cyphar.com/>
v12: <https://lore.kernel.org/lkml/20190904201933.10736-1-cyphar@cyphar.com/>
v11: <https://lore.kernel.org/lkml/20190820033406.29796-1-cyphar@cyphar.com/>
<https://lore.kernel.org/lkml/20190728010207.9781-1-cyphar@cyphar.com/>
v10: <https://lore.kernel.org/lkml/20190719164225.27083-1-cyphar@cyphar.com/>
v09: <https://lore.kernel.org/lkml/20190706145737.5299-1-cyphar@cyphar.com/>
v08: <https://lore.kernel.org/lkml/20190520133305.11925-1-cyphar@cyphar.com/>
v07: <https://lore.kernel.org/lkml/20190507164317.13562-1-cyphar@cyphar.com/>
v06: <https://lore.kernel.org/lkml/20190506165439.9155-1-cyphar@cyphar.com/>
v05: <https://lore.kernel.org/lkml/20190320143717.2523-1-cyphar@cyphar.com/>
v04: <https://lore.kernel.org/lkml/20181112142654.341-1-cyphar@cyphar.com/>
v03: <https://lore.kernel.org/lkml/20181009070230.12884-1-cyphar@cyphar.com/>
v02: <https://lore.kernel.org/lkml/20181009065300.11053-1-cyphar@cyphar.com/>
v01: <https://lore.kernel.org/lkml/20180929103453.12025-1-cyphar@cyphar.com/>
For a very long time, extending openat(2) with new features has been
incredibly frustrating. This stems from the fact that openat(2) is
possibly the most famous counter-example to the mantra "don't silently
accept garbage from userspace" -- it doesn't check whether unknown flags
are present[1].
This means that (generally) the addition of new flags to openat(2) has
been fraught with backwards-compatibility issues (O_TMPFILE has to be
defined as __O_TMPFILE|O_DIRECTORY|[O_RDWR or O_WRONLY] to ensure old
kernels gave errors, since it's insecure to silently ignore the
flag[2]). All new security-related flags therefore have a tough road to
being added to openat(2).
Furthermore, the need for some sort of control over VFS's path resolution (to
avoid malicious paths resulting in inadvertent breakouts) has been a very
long-standing desire of many userspace applications. This patchset is a revival
of Al Viro's old AT_NO_JUMPS[3] patchset (which was a variant of David
Drysdale's O_BENEATH patchset[4] which was a spin-off of the Capsicum
project[5]) with a few additions and changes made based on the previous
discussion within [6] as well as others I felt were useful.
In line with the conclusions of the original discussion of AT_NO_JUMPS, the
flag has been split up into separate flags. However, instead of being an
openat(2) flag it is provided through a new syscall openat2(2) which provides
several other improvements to the openat(2) interface (see the patch
description for more details). The following new LOOKUP_* flags are added:
* LOOKUP_NO_XDEV blocks all mountpoint crossings (upwards, downwards,
or through absolute links). Absolute pathnames alone in openat(2) do not
trigger this. Magic-link traversal which implies a vfsmount jump is also
blocked (though magic-link jumps on the same vfsmount are permitted).
* LOOKUP_NO_MAGICLINKS blocks resolution through /proc/$pid/fd-style
links. This is done by blocking the usage of nd_jump_link() during
resolution in a filesystem. The term "magic-links" is used to match
with the only reference to these links in Documentation/, but I'm
happy to change the name.
It should be noted that this is different to the scope of
~LOOKUP_FOLLOW in that it applies to all path components. However,
you can do openat2(NO_FOLLOW|NO_MAGICLINKS) on a magic-link and it
will *not* fail (assuming that no parent component was a
magic-link), and you will have an fd for the magic-link.
In order to correctly detect magic-links, the introduction of a new
LOOKUP_MAGICLINK_JUMPED state flag was required.
* LOOKUP_BENEATH disallows escapes to outside the starting dirfd's
tree, using techniques such as ".." or absolute links. Absolute
paths in openat(2) are also disallowed. Conceptually this flag is to
ensure you "stay below" a certain point in the filesystem tree --
but this requires some additional to protect against various races
that would allow escape using "..".
Currently LOOKUP_BENEATH implies LOOKUP_NO_MAGICLINKS, because it
can trivially beam you around the filesystem (breaking the
protection). In future, there might be similar safety checks done as
in LOOKUP_IN_ROOT, but that requires more discussion.
In addition, two new flags are added that expand on the above ideas:
* LOOKUP_NO_SYMLINKS does what it says on the tin. No symlink
resolution is allowed at all, including magic-links. Just as with
LOOKUP_NO_MAGICLINKS this can still be used with NOFOLLOW to open an
fd for the symlink as long as no parent path had a symlink
component.
* LOOKUP_IN_ROOT is an extension of LOOKUP_BENEATH that, rather than
blocking attempts to move past the root, forces all such movements
to be scoped to the starting point. This provides chroot(2)-like
protection but without the cost of a chroot(2) for each filesystem
operation, as well as being safe against race attacks that chroot(2)
is not.
If a race is detected (as with LOOKUP_BENEATH) then an error is
generated, and similar to LOOKUP_BENEATH it is not permitted to cross
magic-links with LOOKUP_IN_ROOT.
The primary need for this is from container runtimes, which
currently need to do symlink scoping in userspace[7] when opening
paths in a potentially malicious container. There is a long list of
CVEs that could have bene mitigated by having RESOLVE_THIS_ROOT
(such as CVE-2017-1002101, CVE-2017-1002102, CVE-2018-15664, and
CVE-2019-5736, just to name a few).
In order to make all of the above more usable, I'm working on
libpathrs[8] which is a C-friendly library for safe path resolution. It
features a userspace-emulated backend if the kernel doesn't support
openat2(2). Hopefully we can get userspace to switch to using it, and
thus get openat2(2) support for free once it's ready.
Future work would include implementing things like RESOLVE_NO_AUTOMOUNT and
possibly a RESOLVE_NO_REMOTE (to allow programs to be sure they don't hit DoSes
though stale NFS handles).
[1]: https://lwn.net/Articles/588444/
[2]: https://lore.kernel.org/lkml/CA+55aFyyxJL1LyXZeBsf2ypriraj5ut1XkNDsunRBqgVj…
[3]: https://lore.kernel.org/lkml/20170429220414.GT29622@ZenIV.linux.org.uk
[4]: https://lore.kernel.org/lkml/1415094884-18349-1-git-send-email-drysdale@goo…
[5]: https://lore.kernel.org/lkml/1404124096-21445-1-git-send-email-drysdale@goo…
[6]: https://lwn.net/Articles/723057/
[7]: https://github.com/cyphar/filepath-securejoin
[8]: https://github.com/openSUSE/libpathrs
The current draft of the openat2(2) man-page is included below.
--8<---------------------------------------------------------------------------
OPENAT2(2) Linux Programmer's Manual OPENAT2(2)
NAME
openat2 - open and possibly create a file (extended)
SYNOPSIS
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
int openat2(int dirfd, const char *pathname, struct open_how *how, size_t size);
Note: There is no glibc wrapper for this system call; see NOTES.
DESCRIPTION
The openat2() system call opens the file specified by pathname. If the specified file
does not exist, it may optionally (if O_CREAT is specified in how.flags) be created by
openat2().
As with openat(2), if pathname is relative, then it is interpreted relative to the direc-
tory referred to by the file descriptor dirfd (or the current working directory of the
calling process, if dirfd is the special value AT_FDCWD.) If pathname is absolute, then
dirfd is ignored (unless how.resolve contains RESOLVE_IN_ROOT, in which case pathname is
resolved relative to dirfd.)
The openat2() system call is an extension of openat(2) and provides a superset of its
functionality. Rather than taking a single flag argument, an extensible structure (how)
is passed instead to allow for future extensions. size must be set to sizeof(struct
open_how), to facilitate future extensions (see the "Extensibility" section of the NOTES
for more detail on how extensions are handled.)
The open_how structure
The following structure indicates how pathname should be opened, and acts as a superset of
the flag and mode arguments to openat(2).
struct open_how {
__aligned_u64 flags; /* O_* flags. */
__u16 mode; /* Mode for O_{CREAT,TMPFILE}. */
__u16 __padding[3]; /* Must be zeroed. */
__aligned_u64 resolve; /* RESOLVE_* flags. */
};
Any future extensions to openat2() will be implemented as new fields appended to the above
structure (or through reuse of pre-existing padding space), with the zero value of the new
fields acting as though the extension were not present.
The meaning of each field is as follows:
flags
The file creation and status flags to use for this operation. All of the
O_* flags defined for openat(2) are valid openat2() flag values.
Unlike openat(2), it is an error to provide openat2() unknown or conflicting
flags in flags.
mode
File mode for the new file, with identical semantics to the mode argument to
openat(2). However, unlike openat(2), it is an error to provide openat2()
with a mode which contains bits other than 0777.
It is an error to provide openat2() a non-zero mode if flags does not con-
tain O_CREAT or O_TMPFILE.
resolve
Change how the components of pathname will be resolved (see path_resolu-
tion(7) for background information.) The primary use case for these flags
is to allow trusted programs to restrict how untrusted paths (or paths in-
side untrusted directories) are resolved. The full list of resolve flags is
given below.
RESOLVE_NO_XDEV
Disallow traversal of mount points during path resolution (including
all bind mounts).
Users of this flag are encouraged to make its use configurable (un-
less it is used for a specific security purpose), as bind mounts are
very widely used by end-users. Setting this flag indiscrimnately for
all uses of openat2() may result in spurious errors on previously-
functional systems.
RESOLVE_NO_SYMLINKS
Disallow resolution of symbolic links during path resolution. This
option implies RESOLVE_NO_MAGICLINKS.
If the trailing component is a symbolic link, and flags contains both
O_PATH and O_NOFOLLOW, then an O_PATH file descriptor referencing the
symbolic link will be returned.
Users of this flag are encouraged to make its use configurable (un-
less it is used for a specific security purpose), as symbolic links
are very widely used by end-users. Setting this flag indiscrimnately
for all uses of openat2() may result in spurious errors on previ-
ously-functional systems.
RESOLVE_NO_MAGICLINKS
Disallow all magic link resolution during path resolution.
If the trailing component is a magic link, and flags contains both
O_PATH and O_NOFOLLOW, then an O_PATH file descriptor referencing the
magic link will be returned.
Magic-links are symbolic link-like objects that are most notably
found in proc(5) (examples include /proc/[pid]/exe and
/proc/[pid]/fd/*.) Due to the potential danger of unknowingly open-
ing these magic links, it may be preferable for users to disable
their resolution entirely (see symboliclink(7) for more details.)
RESOLVE_BENEATH
Do not permit the path resolution to succeed if any component of the
resolution is not a descendant of the directory indicated by dirfd.
This results in absolute symbolic links (and absolute values of path-
name) to be rejected.
Currently, this flag also disables magic link resolution. However,
this may change in the future. The caller should explicitly specify
RESOLVE_NO_MAGICLINKS to ensure that magic links are not resolved.
RESOLVE_IN_ROOT
Treat dirfd as the root directory while resolving pathname (as though
the user called chroot(2) with dirfd as the argument.) Absolute sym-
bolic links and ".." path components will be scoped to dirfd. If
pathname is an absolute path, it is also treated relative to dirfd.
However, unlike chroot(2) (which changes the filesystem root perma-
nently for a process), RESOLVE_IN_ROOT allows a program to effi-
ciently restrict path resolution for only certain operations. It
also has several hardening features (such detecting escape attempts
during .. resolution) which chroot(2) does not.
Currently, this flag also disables magic link resolution. However,
this may change in the future. The caller should explicitly specify
RESOLVE_NO_MAGICLINKS to ensure that magic links are not resolved.
It is an error to provide openat2() unknown flags in resolve.
RETURN VALUE
On success, a new file descriptor is returned. On error, -1 is returned, and errno is set
appropriately.
ERRORS
The set of errors returned by openat2() includes all of the errors returned by openat(2),
as well as the following additional errors:
EINVAL An unknown flag or invalid value was specified in how.
EINVAL mode is non-zero, but flags does not contain O_CREAT or O_TMPFILE.
EINVAL size was smaller than any known version of struct open_how.
E2BIG An extension was specified in how, which the current kernel does not support (see
the "Extensibility" section of the NOTES for more detail on how extensions are han-
dled.)
EAGAIN resolve contains either RESOLVE_IN_ROOT or RESOLVE_BENEATH, and the kernel could
not ensure that a ".." component didn't escape (due to a race condition or poten-
tial attack.) Callers may choose to retry the openat2() call.
EXDEV resolve contains either RESOLVE_IN_ROOT or RESOLVE_BENEATH, and an escape from the
root during path resolution was detected.
EXDEV resolve contains RESOLVE_NO_XDEV, and a path component attempted to cross a mount
point.
ELOOP resolve contains RESOLVE_NO_SYMLINKS, and one of the path components was a symbolic
link (or magic link).
ELOOP resolve contains RESOLVE_NO_MAGICLINKS, and one of the path components was a magic
link.
VERSIONS
openat2() was added to Linux in kernel 5.FOO.
CONFORMING TO
This system call is Linux-specific.
The semantics of RESOLVE_BENEATH were modelled after FreeBSD's O_BENEATH.
NOTES
Glibc does not provide a wrapper for this system call; call it using systemcall(2).
Extensibility
In order to allow for struct open_how to be extended in future kernel revisions, openat2()
requires userspace to specify the size of struct open_how structure they are passing. By
providing this information, it is possible for openat2() to provide both forwards- and
backwards-compatibility — with size acting as an implicit version number (because new ex-
tension fields will always be appended, the size will always increase.) This extensibil-
ity design is very similar to other system calls such as perf_setattr(2),
perf_event_open(2), and clone(3).
If we let usize be the size of the structure according to userspace and ksize be the size
of the structure which the kernel supports, then there are only three cases to consider:
* If ksize equals usize, then there is no version mismatch and how can be used
verbatim.
* If ksize is larger than usize, then there are some extensions the kernel sup-
ports which the userspace program is unaware of. Because all extensions must
have their zero values be a no-op, the kernel treats all of the extension fields
not set by userspace to have zero values. This provides backwards-compatibil-
ity.
* If ksize is smaller than usize, then there are some extensions which the
userspace program is aware of but the kernel does not support. Because all ex-
tensions must have their zero values be a no-op, the kernel can safely ignore
the unsupported extension fields if they are all-zero. If any unsupported ex-
tension fields are non-zero, then -1 is returned and errno is set to E2BIG.
This provides forwards-compatibility.
Therefore, most userspace programs will not need to have any special handling of exten-
sions. However, if a userspace program wishes to determine what extensions the running
kernel supports, they may conduct a binary search on size (to find the largest value which
doesn't produce an error of E2BIG.)
SEE ALSO
openat(2), path_resolution(7), symlink(7)
Linux 2019-11-05 OPENAT2(2)
--8<---------------------------------------------------------------------------
Aleksa Sarai (12):
nsfs: clean-up ns_get_path() signature to return int
namei: allow nd_jump_link() to produce errors
namei: allow set_root() to produce errors
namei: LOOKUP_NO_SYMLINKS: block symlink resolution
namei: LOOKUP_NO_MAGICLINKS: block magic-link resolution
namei: LOOKUP_NO_XDEV: block mountpoint crossing
namei: LOOKUP_BENEATH: O_BENEATH-like scoped resolution
namei: LOOKUP_IN_ROOT: chroot-like scoped resolution
namei: LOOKUP_{IN_ROOT,BENEATH}: permit limited ".." resolution
open: introduce openat2(2) syscall
selftests: add openat2(2) selftests
Documentation: path-lookup: include new LOOKUP flags
CREDITS | 4 +-
Documentation/filesystems/path-lookup.rst | 68 ++-
arch/alpha/kernel/syscalls/syscall.tbl | 1 +
arch/arm/tools/syscall.tbl | 1 +
arch/arm64/include/asm/unistd.h | 2 +-
arch/arm64/include/asm/unistd32.h | 2 +
arch/ia64/kernel/syscalls/syscall.tbl | 1 +
arch/m68k/kernel/syscalls/syscall.tbl | 1 +
arch/microblaze/kernel/syscalls/syscall.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n32.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n64.tbl | 1 +
arch/mips/kernel/syscalls/syscall_o32.tbl | 1 +
arch/parisc/kernel/syscalls/syscall.tbl | 1 +
arch/powerpc/kernel/syscalls/syscall.tbl | 1 +
arch/s390/kernel/syscalls/syscall.tbl | 1 +
arch/sh/kernel/syscalls/syscall.tbl | 1 +
arch/sparc/kernel/syscalls/syscall.tbl | 1 +
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/xtensa/kernel/syscalls/syscall.tbl | 1 +
fs/namei.c | 166 ++++--
fs/nsfs.c | 29 +-
fs/open.c | 149 +++--
fs/proc/base.c | 5 +-
fs/proc/namespaces.c | 23 +-
include/linux/fcntl.h | 12 +-
include/linux/namei.h | 12 +-
include/linux/proc_ns.h | 4 +-
include/linux/syscalls.h | 3 +
include/uapi/asm-generic/unistd.h | 5 +-
include/uapi/linux/fcntl.h | 40 ++
kernel/bpf/offload.c | 12 +-
kernel/events/core.c | 2 +-
security/apparmor/apparmorfs.c | 8 +-
tools/testing/selftests/Makefile | 1 +
tools/testing/selftests/openat2/.gitignore | 1 +
tools/testing/selftests/openat2/Makefile | 8 +
tools/testing/selftests/openat2/helpers.c | 109 ++++
tools/testing/selftests/openat2/helpers.h | 107 ++++
.../testing/selftests/openat2/openat2_test.c | 316 +++++++++++
.../selftests/openat2/rename_attack_test.c | 160 ++++++
.../testing/selftests/openat2/resolve_test.c | 523 ++++++++++++++++++
42 files changed, 1675 insertions(+), 112 deletions(-)
create mode 100644 tools/testing/selftests/openat2/.gitignore
create mode 100644 tools/testing/selftests/openat2/Makefile
create mode 100644 tools/testing/selftests/openat2/helpers.c
create mode 100644 tools/testing/selftests/openat2/helpers.h
create mode 100644 tools/testing/selftests/openat2/openat2_test.c
create mode 100644 tools/testing/selftests/openat2/rename_attack_test.c
create mode 100644 tools/testing/selftests/openat2/resolve_test.c
base-commit: 31f4f5b495a62c9a8b15b1c3581acd5efeb9af8c
--
2.24.0
From: Michael Ellerman <mpe(a)ellerman.id.au>
[ Upstream commit 69f8117f17b332a68cd8f4bf8c2d0d3d5b84efc5 ]
Use TEST_GEN_PROGS and don't redefine all, this makes the out-of-tree
build work. We need to move the extra dependencies below the include
of lib.mk, because it adds the $(OUTPUT) prefix if it's defined.
We can also drop the clean rule, lib.mk does it for us.
Signed-off-by: Michael Ellerman <mpe(a)ellerman.id.au>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/powerpc/cache_shape/Makefile | 9 ++-------
1 file changed, 2 insertions(+), 7 deletions(-)
diff --git a/tools/testing/selftests/powerpc/cache_shape/Makefile b/tools/testing/selftests/powerpc/cache_shape/Makefile
index 1be547434a49c..7e0c175b82978 100644
--- a/tools/testing/selftests/powerpc/cache_shape/Makefile
+++ b/tools/testing/selftests/powerpc/cache_shape/Makefile
@@ -1,11 +1,6 @@
# SPDX-License-Identifier: GPL-2.0
-TEST_PROGS := cache_shape
-
-all: $(TEST_PROGS)
-
-$(TEST_PROGS): ../harness.c ../utils.c
+TEST_GEN_PROGS := cache_shape
include ../../lib.mk
-clean:
- rm -f $(TEST_PROGS) *.o
+$(TEST_GEN_PROGS): ../harness.c ../utils.c
--
2.20.1
From: Joel Stanley <joel(a)jms.id.au>
[ Upstream commit 27825349d7b238533a47e3d98b8bb0efd886b752 ]
We should use TEST_GEN_PROGS, not TEST_PROGS. That tells the selftests
makefile (lib.mk) that those tests are generated (built), and so it
adds the $(OUTPUT) prefix for us, making the out-of-tree build work
correctly.
It also means we don't need our own clean rule, lib.mk does it.
We also have to update the signal_tm rule to use $(OUTPUT).
Signed-off-by: Joel Stanley <joel(a)jms.id.au>
Signed-off-by: Michael Ellerman <mpe(a)ellerman.id.au>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/powerpc/signal/Makefile | 11 +++--------
1 file changed, 3 insertions(+), 8 deletions(-)
diff --git a/tools/testing/selftests/powerpc/signal/Makefile b/tools/testing/selftests/powerpc/signal/Makefile
index a7cbd5082e271..4213978f3ee2c 100644
--- a/tools/testing/selftests/powerpc/signal/Makefile
+++ b/tools/testing/selftests/powerpc/signal/Makefile
@@ -1,14 +1,9 @@
# SPDX-License-Identifier: GPL-2.0
-TEST_PROGS := signal signal_tm
-
-all: $(TEST_PROGS)
-
-$(TEST_PROGS): ../harness.c ../utils.c signal.S
+TEST_GEN_PROGS := signal signal_tm
CFLAGS += -maltivec
-signal_tm: CFLAGS += -mhtm
+$(OUTPUT)/signal_tm: CFLAGS += -mhtm
include ../../lib.mk
-clean:
- rm -f $(TEST_PROGS) *.o
+$(TEST_GEN_PROGS): ../harness.c ../utils.c signal.S
--
2.20.1
From: "Shuah Khan (Samsung OSG)" <shuah(a)kernel.org>
[ Upstream commit 9a244229a4b850b11952a0df79607c69b18fd8df ]
When /dev/watchdog open fails, watchdog exits with "watchdog not enabled"
message. This is incorrect when open fails due to insufficient privilege.
Fix message to clearly state the reason when open fails with EACCESS when
a non-root user runs it.
Signed-off-by: Shuah Khan (Samsung OSG) <shuah(a)kernel.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/watchdog/watchdog-test.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/watchdog/watchdog-test.c b/tools/testing/selftests/watchdog/watchdog-test.c
index 6e290874b70e2..e029e2017280f 100644
--- a/tools/testing/selftests/watchdog/watchdog-test.c
+++ b/tools/testing/selftests/watchdog/watchdog-test.c
@@ -89,7 +89,13 @@ int main(int argc, char *argv[])
fd = open("/dev/watchdog", O_WRONLY);
if (fd == -1) {
- printf("Watchdog device not enabled.\n");
+ if (errno == ENOENT)
+ printf("Watchdog device not enabled.\n");
+ else if (errno == EACCES)
+ printf("Run watchdog as root.\n");
+ else
+ printf("Watchdog device open failed %s\n",
+ strerror(errno));
exit(-1);
}
--
2.20.1
From: Michael Ellerman <mpe(a)ellerman.id.au>
[ Upstream commit 69f8117f17b332a68cd8f4bf8c2d0d3d5b84efc5 ]
Use TEST_GEN_PROGS and don't redefine all, this makes the out-of-tree
build work. We need to move the extra dependencies below the include
of lib.mk, because it adds the $(OUTPUT) prefix if it's defined.
We can also drop the clean rule, lib.mk does it for us.
Signed-off-by: Michael Ellerman <mpe(a)ellerman.id.au>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/powerpc/cache_shape/Makefile | 9 ++-------
1 file changed, 2 insertions(+), 7 deletions(-)
diff --git a/tools/testing/selftests/powerpc/cache_shape/Makefile b/tools/testing/selftests/powerpc/cache_shape/Makefile
index ede4d3dae7505..689f6c8ebcd8d 100644
--- a/tools/testing/selftests/powerpc/cache_shape/Makefile
+++ b/tools/testing/selftests/powerpc/cache_shape/Makefile
@@ -1,12 +1,7 @@
# SPDX-License-Identifier: GPL-2.0
-TEST_PROGS := cache_shape
-
-all: $(TEST_PROGS)
-
-$(TEST_PROGS): ../harness.c ../utils.c
+TEST_GEN_PROGS := cache_shape
top_srcdir = ../../../../..
include ../../lib.mk
-clean:
- rm -f $(TEST_PROGS) *.o
+$(TEST_GEN_PROGS): ../harness.c ../utils.c
--
2.20.1
From: Joel Stanley <joel(a)jms.id.au>
[ Upstream commit 27825349d7b238533a47e3d98b8bb0efd886b752 ]
We should use TEST_GEN_PROGS, not TEST_PROGS. That tells the selftests
makefile (lib.mk) that those tests are generated (built), and so it
adds the $(OUTPUT) prefix for us, making the out-of-tree build work
correctly.
It also means we don't need our own clean rule, lib.mk does it.
We also have to update the signal_tm rule to use $(OUTPUT).
Signed-off-by: Joel Stanley <joel(a)jms.id.au>
Signed-off-by: Michael Ellerman <mpe(a)ellerman.id.au>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/powerpc/signal/Makefile | 11 +++--------
1 file changed, 3 insertions(+), 8 deletions(-)
diff --git a/tools/testing/selftests/powerpc/signal/Makefile b/tools/testing/selftests/powerpc/signal/Makefile
index 1fca25c6ace06..209a958dca127 100644
--- a/tools/testing/selftests/powerpc/signal/Makefile
+++ b/tools/testing/selftests/powerpc/signal/Makefile
@@ -1,15 +1,10 @@
# SPDX-License-Identifier: GPL-2.0
-TEST_PROGS := signal signal_tm
-
-all: $(TEST_PROGS)
-
-$(TEST_PROGS): ../harness.c ../utils.c signal.S
+TEST_GEN_PROGS := signal signal_tm
CFLAGS += -maltivec
-signal_tm: CFLAGS += -mhtm
+$(OUTPUT)/signal_tm: CFLAGS += -mhtm
top_srcdir = ../../../../..
include ../../lib.mk
-clean:
- rm -f $(TEST_PROGS) *.o
+$(TEST_GEN_PROGS): ../harness.c ../utils.c signal.S
--
2.20.1
From: Joel Stanley <joel(a)jms.id.au>
[ Upstream commit c39b79082a38a4f8c801790edecbbb4d62ed2992 ]
We should use TEST_GEN_PROGS, not TEST_PROGS. That tells the selftests
makefile (lib.mk) that those tests are generated (built), and so it
adds the $(OUTPUT) prefix for us, making the out-of-tree build work
correctly.
It also means we don't need our own clean rule, lib.mk does it.
We also have to update the ptrace-pkey and core-pkey rules to use
$(OUTPUT).
Signed-off-by: Joel Stanley <joel(a)jms.id.au>
Signed-off-by: Michael Ellerman <mpe(a)ellerman.id.au>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/powerpc/ptrace/Makefile | 13 ++++---------
1 file changed, 4 insertions(+), 9 deletions(-)
diff --git a/tools/testing/selftests/powerpc/ptrace/Makefile b/tools/testing/selftests/powerpc/ptrace/Makefile
index 923d531265f8c..9f9423430059e 100644
--- a/tools/testing/selftests/powerpc/ptrace/Makefile
+++ b/tools/testing/selftests/powerpc/ptrace/Makefile
@@ -1,5 +1,5 @@
# SPDX-License-Identifier: GPL-2.0
-TEST_PROGS := ptrace-gpr ptrace-tm-gpr ptrace-tm-spd-gpr \
+TEST_GEN_PROGS := ptrace-gpr ptrace-tm-gpr ptrace-tm-spd-gpr \
ptrace-tar ptrace-tm-tar ptrace-tm-spd-tar ptrace-vsx ptrace-tm-vsx \
ptrace-tm-spd-vsx ptrace-tm-spr ptrace-hwbreak ptrace-pkey core-pkey \
perf-hwbreak
@@ -7,14 +7,9 @@ TEST_PROGS := ptrace-gpr ptrace-tm-gpr ptrace-tm-spd-gpr \
top_srcdir = ../../../../..
include ../../lib.mk
-all: $(TEST_PROGS)
-
CFLAGS += -m64 -I../../../../../usr/include -I../tm -mhtm -fno-pie
-ptrace-pkey core-pkey: child.h
-ptrace-pkey core-pkey: LDLIBS += -pthread
-
-$(TEST_PROGS): ../harness.c ../utils.c ../lib/reg.S ptrace.h
+$(OUTPUT)/ptrace-pkey $(OUTPUT)/core-pkey: child.h
+$(OUTPUT)/ptrace-pkey $(OUTPUT)/core-pkey: LDLIBS += -pthread
-clean:
- rm -f $(TEST_PROGS) *.o
+$(TEST_GEN_PROGS): ../harness.c ../utils.c ../lib/reg.S ptrace.h
--
2.20.1
From: Andrea Parri <andrea.parri(a)amarulasolutions.com>
[ Upstream commit fb363e2d20351e1d16629df19e7bce1a31b3227a ]
Fixes the following warnings:
dirty_log_test.c: In function ‘help’:
dirty_log_test.c:216:9: warning: format ‘%lu’ expects argument of type ‘long unsigned int’, but argument 2 has type ‘int’ [-Wformat=]
printf(" -i: specify iteration counts (default: %"PRIu64")\n",
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from include/test_util.h:18:0,
from dirty_log_test.c:16:
/usr/include/inttypes.h:105:34: note: format string is defined here
# define PRIu64 __PRI64_PREFIX "u"
dirty_log_test.c:218:9: warning: format ‘%lu’ expects argument of type ‘long unsigned int’, but argument 2 has type ‘int’ [-Wformat=]
printf(" -I: specify interval in ms (default: %"PRIu64" ms)\n",
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from include/test_util.h:18:0,
from dirty_log_test.c:16:
/usr/include/inttypes.h:105:34: note: format string is defined here
# define PRIu64 __PRI64_PREFIX "u"
Signed-off-by: Andrea Parri <andrea.parri(a)amarulasolutions.com>
Signed-off-by: Shuah Khan (Samsung OSG) <shuah(a)kernel.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/kvm/dirty_log_test.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/kvm/dirty_log_test.c b/tools/testing/selftests/kvm/dirty_log_test.c
index 0c2cdc105f968..a9c4b5e21d7e7 100644
--- a/tools/testing/selftests/kvm/dirty_log_test.c
+++ b/tools/testing/selftests/kvm/dirty_log_test.c
@@ -31,9 +31,9 @@
/* How many pages to dirty for each guest loop */
#define TEST_PAGES_PER_LOOP 1024
/* How many host loops to run (one KVM_GET_DIRTY_LOG for each loop) */
-#define TEST_HOST_LOOP_N 32
+#define TEST_HOST_LOOP_N 32UL
/* Interval for each host loop (ms) */
-#define TEST_HOST_LOOP_INTERVAL 10
+#define TEST_HOST_LOOP_INTERVAL 10UL
/*
* Guest variables. We use these variables to share data between host
--
2.20.1
From: "Shuah Khan (Samsung OSG)" <shuah(a)kernel.org>
[ Upstream commit 9a244229a4b850b11952a0df79607c69b18fd8df ]
When /dev/watchdog open fails, watchdog exits with "watchdog not enabled"
message. This is incorrect when open fails due to insufficient privilege.
Fix message to clearly state the reason when open fails with EACCESS when
a non-root user runs it.
Signed-off-by: Shuah Khan (Samsung OSG) <shuah(a)kernel.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/watchdog/watchdog-test.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/watchdog/watchdog-test.c b/tools/testing/selftests/watchdog/watchdog-test.c
index 6e290874b70e2..e029e2017280f 100644
--- a/tools/testing/selftests/watchdog/watchdog-test.c
+++ b/tools/testing/selftests/watchdog/watchdog-test.c
@@ -89,7 +89,13 @@ int main(int argc, char *argv[])
fd = open("/dev/watchdog", O_WRONLY);
if (fd == -1) {
- printf("Watchdog device not enabled.\n");
+ if (errno == ENOENT)
+ printf("Watchdog device not enabled.\n");
+ else if (errno == EACCES)
+ printf("Run watchdog as root.\n");
+ else
+ printf("Watchdog device open failed %s\n",
+ strerror(errno));
exit(-1);
}
--
2.20.1
From: Quentin Monnet <quentin.monnet(a)netronome.com>
[ Upstream commit c5fa5d602221362f8341ecd9e32d83194abf5bd9 ]
The return value for each test in test_libbpf.sh is compared with
if (( $? == 0 )) ; then ...
This works well with bash, but not with dash, that /bin/sh is aliased to
on some systems (such as Ubuntu).
Let's replace this comparison by something that works on both shells.
Signed-off-by: Quentin Monnet <quentin.monnet(a)netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski(a)netronome.com>
Signed-off-by: Alexei Starovoitov <ast(a)kernel.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/bpf/test_libbpf.sh | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/bpf/test_libbpf.sh b/tools/testing/selftests/bpf/test_libbpf.sh
index 8b1bc96d8e0cc..2989b2e2d856d 100755
--- a/tools/testing/selftests/bpf/test_libbpf.sh
+++ b/tools/testing/selftests/bpf/test_libbpf.sh
@@ -6,7 +6,7 @@ export TESTNAME=test_libbpf
# Determine selftest success via shell exit code
exit_handler()
{
- if (( $? == 0 )); then
+ if [ $? -eq 0 ]; then
echo "selftests: $TESTNAME [PASS]";
else
echo "$TESTNAME: failed at file $LAST_LOADED" 1>&2
--
2.20.1
Add documentation for the Python script used to build, run, and collect
results from the kernel known as kunit_tool. kunit_tool
(tools/testing/kunit/kunit.py) was already added in previous commits.
Signed-off-by: Brendan Higgins <brendanhiggins(a)google.com>
---
Documentation/dev-tools/kunit/index.rst | 1 +
Documentation/dev-tools/kunit/kunit-tool.rst | 57 ++++++++++++++++++++
Documentation/dev-tools/kunit/start.rst | 5 +-
3 files changed, 62 insertions(+), 1 deletion(-)
create mode 100644 Documentation/dev-tools/kunit/kunit-tool.rst
diff --git a/Documentation/dev-tools/kunit/index.rst b/Documentation/dev-tools/kunit/index.rst
index 26ffb46bdf99d..c60d760a0eed1 100644
--- a/Documentation/dev-tools/kunit/index.rst
+++ b/Documentation/dev-tools/kunit/index.rst
@@ -9,6 +9,7 @@ KUnit - Unit Testing for the Linux Kernel
start
usage
+ kunit-tool
api/index
faq
diff --git a/Documentation/dev-tools/kunit/kunit-tool.rst b/Documentation/dev-tools/kunit/kunit-tool.rst
new file mode 100644
index 0000000000000..37509527c04e1
--- /dev/null
+++ b/Documentation/dev-tools/kunit/kunit-tool.rst
@@ -0,0 +1,57 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=================
+kunit_tool How-To
+=================
+
+What is kunit_tool?
+===================
+
+kunit_tool is a script (``tools/testing/kunit/kunit.py``) that aids in building
+the Linux kernel as UML (`User Mode Linux
+<http://user-mode-linux.sourceforge.net/>`_), running KUnit tests, parsing
+the test results and displaying them in a user friendly manner.
+
+What is a kunitconfig?
+======================
+
+It's just a defconfig that kunit_tool looks for in the base directory.
+kunit_tool uses it to generate a .config as you might expect. In addition, it
+verifies that the generated .config contains the CONFIG options in the
+kunitconfig; the reason it does this is so that it is easy to be sure that a
+CONFIG that enables a test actually ends up in the .config.
+
+How do I use kunit_tool?
+=================================
+
+If a kunitconfig is present at the root directory, all you have to do is:
+
+.. code-block:: bash
+
+ ./tools/testing/kunit/kunit.py run
+
+However, you most likely want to use it with the following options:
+
+.. code-block:: bash
+
+ ./tools/testing/kunit/kunit.py run --timeout=30 --jobs=`nproc --all`
+
+- ``--timeout`` sets a maximum amount of time to allow tests to run.
+- ``--jobs`` sets the number of threads to use to build the kernel.
+
+If you just want to use the defconfig that ships with the kernel, you can
+append the ``--defconfig`` flag as well:
+
+.. code-block:: bash
+
+ ./tools/testing/kunit/kunit.py run --timeout=30 --jobs=`nproc --all` --defconfig
+
+.. note::
+ This command is particularly helpful for getting started because it
+ just works. No kunitconfig needs to be present.
+
+For a list of all the flags supported by kunit_tool, you can run:
+
+.. code-block:: bash
+
+ ./tools/testing/kunit/kunit.py run --help
diff --git a/Documentation/dev-tools/kunit/start.rst b/Documentation/dev-tools/kunit/start.rst
index aeeddfafeea20..4248a6f9038b8 100644
--- a/Documentation/dev-tools/kunit/start.rst
+++ b/Documentation/dev-tools/kunit/start.rst
@@ -19,7 +19,10 @@ The wrapper can be run with:
.. code-block:: bash
- ./tools/testing/kunit/kunit.py run
+ ./tools/testing/kunit/kunit.py run --timeout=30 --jobs=`nproc --all` --defconfig
+
+For more information on this wrapper (also called kunit_tool) checkout the
+:doc:`kunit-tool` page.
Creating a kunitconfig
======================
--
2.24.0.432.g9d3f5f5b63-goog
Currently the following command produces an error message:
linux# make kselftest TARGETS=bpf O=/mnt/linux-build
# selftests: bpf: test_libbpf.sh
# ./test_libbpf.sh: line 23: ./test_libbpf_open: No such file or directory
# test_libbpf: failed at file test_l4lb.o
# selftests: test_libbpf [FAILED]
The error message might not affect make return code, therefore one might
need to grep make output in order to detect it.
The current logic prepends $(OUTPUT) only to the first member of
$(TEST_PROGS). After that, run_one() does
cd `dirname $TEST`
For all tests except the first one, `dirname $TEST` is ., which means
they cannot access the files generated in $(OUTPUT).
Fix by using $(addprefix) to prepend $(OUTPUT)/ to each member of
$(TEST_PROGS).
Fixes: 1a940687e424 ("selftests: lib.mk: copy test scripts and test files for make O=dir run")
Signed-off-by: Ilya Leoshkevich <iii(a)linux.ibm.com>
---
v1->v2:
- Append / to $(OUTPUT).
- Use $(addprefix) instead of $(foreach).
v2->v3:
- Split the patch in two.
- Improve the commit message.
v3->v4:
- Drop the first patch.
- Add a note regarding make return code to the commit message.
tools/testing/selftests/lib.mk | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/lib.mk b/tools/testing/selftests/lib.mk
index 1c8a1963d03f..0cf510df1ee2 100644
--- a/tools/testing/selftests/lib.mk
+++ b/tools/testing/selftests/lib.mk
@@ -75,7 +75,8 @@ ifdef building_out_of_srctree
@rsync -aq $(TEST_PROGS) $(TEST_PROGS_EXTENDED) $(TEST_FILES) $(OUTPUT)
fi
@if [ "X$(TEST_PROGS)" != "X" ]; then
- $(call RUN_TESTS, $(TEST_GEN_PROGS) $(TEST_CUSTOM_PROGS) $(OUTPUT)/$(TEST_PROGS))
+ $(call RUN_TESTS, $(TEST_GEN_PROGS) $(TEST_CUSTOM_PROGS) \
+ $(addprefix $(OUTPUT)/,$(TEST_PROGS)))
else
$(call RUN_TESTS, $(TEST_GEN_PROGS) $(TEST_CUSTOM_PROGS))
fi
--
2.23.0
Greetings,
Find attached email very confidential. reply for more details
Thanks.
Peter Wong
----------------------------------------------------
This email was sent by the shareware version of Postman Professional.
Add documentation for the Python script used to build, run, and collect
results from the kernel known as kunit_tool. kunit_tool
(tools/testing/kunit/kunit.py) was already added in previous commits.
Signed-off-by: Brendan Higgins <brendanhiggins(a)google.com>
---
Documentation/dev-tools/kunit/index.rst | 1 +
Documentation/dev-tools/kunit/kunit-tool.rst | 57 ++++++++++++++++++++
Documentation/dev-tools/kunit/start.rst | 3 ++
3 files changed, 61 insertions(+)
create mode 100644 Documentation/dev-tools/kunit/kunit-tool.rst
diff --git a/Documentation/dev-tools/kunit/index.rst b/Documentation/dev-tools/kunit/index.rst
index 26ffb46bdf99d..c60d760a0eed1 100644
--- a/Documentation/dev-tools/kunit/index.rst
+++ b/Documentation/dev-tools/kunit/index.rst
@@ -9,6 +9,7 @@ KUnit - Unit Testing for the Linux Kernel
start
usage
+ kunit-tool
api/index
faq
diff --git a/Documentation/dev-tools/kunit/kunit-tool.rst b/Documentation/dev-tools/kunit/kunit-tool.rst
new file mode 100644
index 0000000000000..aa1a93649a45a
--- /dev/null
+++ b/Documentation/dev-tools/kunit/kunit-tool.rst
@@ -0,0 +1,57 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=================
+kunit_tool How-To
+=================
+
+What is kunit_tool?
+===================
+
+kunit_tool is a set of scripts that aid in building the Linux kernel as UML
+(`User Mode Linux <http://user-mode-linux.sourceforge.net/old/>`_), running
+KUnit tests, parsing the test results and displaying them in a user friendly
+manner.
+
+What is a kunitconfig?
+======================
+
+It's just a defconfig that kunit_tool looks for in the base directory.
+kunit_tool uses it to generate a .config as you might expect. In addition, it
+verifies that the generated .config contains the CONFIG options in the
+kunitconfig; the reason it does this is so that it is easy to be sure that a
+CONFIG that enables a test actually ends up in the .config.
+
+How do I use kunit_tool?
+=================================
+
+If a kunitconfig is present at the root directory, all you have to do is:
+
+.. code-block:: bash
+
+ ./tools/testing/kunit/kunit.py run
+
+However, you most likely want to use it with the following options:
+
+.. code-block:: bash
+
+ ./tools/testing/kunit/kunit.py run --timeout=30 --jobs=8
+
+- ``--timeout`` sets a maximum amount of time to allow tests to run.
+- ``--jobs`` sets the number of threads to use to build the kernel.
+
+If you just want to use the defconfig that ships with the kernel, you can
+append the ``--defconfig`` flag as well:
+
+.. code-block:: bash
+
+ ./tools/testing/kunit/kunit.py run --timeout=30 --jobs=8 --defconfig
+
+.. note::
+ This command is particularly helpful for getting started because it
+ just works. No kunitconfig needs to be present.
+
+For a list of all the flags supported by kunit_tool, you can run:
+
+.. code-block:: bash
+
+ ./tools/testing/kunit/kunit.py run --help
diff --git a/Documentation/dev-tools/kunit/start.rst b/Documentation/dev-tools/kunit/start.rst
index aeeddfafeea20..1535c4394cfa2 100644
--- a/Documentation/dev-tools/kunit/start.rst
+++ b/Documentation/dev-tools/kunit/start.rst
@@ -21,6 +21,9 @@ The wrapper can be run with:
./tools/testing/kunit/kunit.py run
+For more information on this wrapper (also called kunit_tool) checkout the
+:doc:`kunit-tool` page.
+
Creating a kunitconfig
======================
The Python script is a thin wrapper around Kbuild as such, it needs to be
--
2.24.0.432.g9d3f5f5b63-goog
On Thu, Nov 14, 2019 at 1:31 PM Brendan Higgins
<brendanhiggins(a)google.com> wrote:
>
> +kselftest and kunit lists to document this decision.
Sorry for the spam. I accidentally CC'ed the doc list instead of the
kselftest list in my previous email.
> On Wed, Nov 13, 2019 at 11:54 PM Alan Maguire <alan.maguire(a)oracle.com> wrote:
> >
> > On Wed, 13 Nov 2019, Stephen Boyd wrote:
> >
> > > Quoting Brendan Higgins (2019-11-11 13:41:38)
> > > > +Stephen Boyd - since he is more of an expert on the hung task timer than I am.
> > > >
> > > > On Fri, Nov 8, 2019 at 7:30 AM Alan Maguire <alan.maguire(a)oracle.com> wrote:
> > > > >
> > > > > On Thu, 7 Nov 2019, Brendan Higgins wrote:
> > > > >
> > > > > > On Thu, Oct 17, 2019 at 11:09 AM Alan Maguire <alan.maguire(a)oracle.com> wrote:
> > > > > > > +MODULE_LICENSE("GPL");
> > > > > > > diff --git a/lib/kunit/try-catch.c b/lib/kunit/try-catch.c
> > > > > > > index 1c1e9af..72fc8ed 100644
> > > > > > > --- a/lib/kunit/try-catch.c
> > > > > > > +++ b/lib/kunit/try-catch.c
> > > > > > > @@ -31,6 +31,8 @@ static int kunit_generic_run_threadfn_adapter(void *data)
> > > > > > > complete_and_exit(try_catch->try_completion, 0);
> > > > > > > }
> > > > > > >
> > > > > > > +KUNIT_VAR_SYMBOL(sysctl_hung_task_timeout_secs, unsigned long);
> > > > > >
> > > > > > Can you just export sysctl_hung_task_timeout_secs?
> > > > > >
> > > > > > I don't mean to make you redo all this work for one symbol twice, but
> > > > > > I thought we agreed on just exposing this symbol, but in a namespace.
> > > > > > It seemed like a good use case for that namespaced exporting thing
> > > > > > that Luis was talking about. As I understood it, you would have to
> > > > > > export it in the module that defines it, and then use the new
> > > > > > MODULE_IMPORT_NS() macro here.
> > > > > >
> > > > >
> > > > > Sure, I can certainly look into that, though I wonder if we should
> > > > > consider another possibility - should kunit have its own sysctl table for
> > > > > things like configuring timeouts? I can look at adding a patch for that
> > > >
> > > > So on the one hand, yes, I would like to have configurable test
> > > > timeouts for KUnit, but that is not what the parameter check is for
> > > > here. This is to make sure KUnit times a test case out before the hung
> > > > task timer does.
> > > >
> > > > > prior to the module patch so the issues with exporting the hung task
> > > > > timeout would go away. Now the reason I suggest this isn't as much a hack
> > > > > to solve this specific problem, rather it seems to fit better with the
> > > > > longer-term intent expressed by the comment around use of the field (at
> > > > > least as I read it, I may be wrong).
> > > >
> > > > Not really. Although I do agree that adding configurability here might
> > > > be a good idea, I believe we would need to clamp such a value by
> > > > sysctl_hung_task_timeout_secs regardless since we don't want to be
> > > > killed by the hung task timer; thus, we still need access to
> > > > sysctl_hung_task_timeout_secs either way, and so doing what you are
> > > > proposing would be off topic.
> > > >
> > > > > Exporting the symbol does allow us to piggy-back on an existing value, but
> > > > > maybe we should support out our own tunable "kunit_timeout_secs" here?
> > > > > Doing so would also lay the groundwork for supporting other kunit
> > > > > tunables in the future if needed. What do you think?
> > > >
> > > > The goal is not to piggy back on the value as I mentioned above.
> > > > Stephen, do you have any thoughts on this? Do you see any other
> > > > preferable solution to what Alan is trying to do?
> > >
> > > One idea would be to make some sort of process flag that says "this is a
> > > kunit task, ignore me with regards to the hung task timeout". Then we
> > > can hardcode the 5 minute kunit timeout. I'm not sure we have any more
> > > flags though.
> > >
> > > Or drop the whole timeout clamping logic, let the hung task timeout kick
> > > in and potentially oops the kernel, but then continue to let the test
> > > run and maybe sometimes get the kunit timeout here. This last option
> > > doesn't sound so bad to me given that this is all a corner case anyway
> > > where we don't expect to actually ever hit this problem so letting the
> > > hung task detector do its job is probably fine. This nicely avoids
> > > having to export this symbol to modules too.
> > >
> >
> > Thanks for suggestions! This latter approach seems fine to me; presumably
> > something has gone wrong if we are tripping the hung task timeout anyway,
> > so having an oops to document that seems fine. Brendan, what do you think?
>
> If Stephen thinks it is fine to drop the clamping logic, I think it is
> fine too. I think it would probably be good to replace it with a
> comment under the TODO that explains that a hung test *can* cause an
> oops if the hung task timeout is less than the kunit timeout value. It
> would probably be good to also select a timeout value that is less
> than the default hung task timeout. We might also want to link to this
> discussion. I fully expect that the timeout logic will get more
> attention at some point in the future.
>
> One more thing: Alan, can you submit the commit that drops the
> clamping logic in its own commit? I would prefer to make sure that it
> is easy to spot in the commit history.
>
> Cheers!
This patchset is being developed here:
<https://github.com/cyphar/linux/tree/openat2/master>
Patch changelog:
v15:
* Fix code style for LOOKUP_IN_ROOT handling in path_init(). [Linus Torvalds]
* Split out patches for each individual LOOKUP flag.
* Reword commit messages to give more background information about the
series, as well as mention the semantics of each flag in more detail.
v14: <https://lore.kernel.org/lkml/20191010054140.8483-1-cyphar@cyphar.com/>
<https://lore.kernel.org/lkml/20191026185700.10708-1-cyphar@cyphar.com>
v13: <https://lore.kernel.org/lkml/20190930183316.10190-1-cyphar@cyphar.com/>
v12: <https://lore.kernel.org/lkml/20190904201933.10736-1-cyphar@cyphar.com/>
v11: <https://lore.kernel.org/lkml/20190820033406.29796-1-cyphar@cyphar.com/>
<https://lore.kernel.org/lkml/20190728010207.9781-1-cyphar@cyphar.com/>
v10: <https://lore.kernel.org/lkml/20190719164225.27083-1-cyphar@cyphar.com/>
v09: <https://lore.kernel.org/lkml/20190706145737.5299-1-cyphar@cyphar.com/>
v08: <https://lore.kernel.org/lkml/20190520133305.11925-1-cyphar@cyphar.com/>
v07: <https://lore.kernel.org/lkml/20190507164317.13562-1-cyphar@cyphar.com/>
v06: <https://lore.kernel.org/lkml/20190506165439.9155-1-cyphar@cyphar.com/>
v05: <https://lore.kernel.org/lkml/20190320143717.2523-1-cyphar@cyphar.com/>
v04: <https://lore.kernel.org/lkml/20181112142654.341-1-cyphar@cyphar.com/>
v03: <https://lore.kernel.org/lkml/20181009070230.12884-1-cyphar@cyphar.com/>
v02: <https://lore.kernel.org/lkml/20181009065300.11053-1-cyphar@cyphar.com/>
v01: <https://lore.kernel.org/lkml/20180929103453.12025-1-cyphar@cyphar.com/>
For a very long time, extending openat(2) with new features has been
incredibly frustrating. This stems from the fact that openat(2) is
possibly the most famous counter-example to the mantra "don't silently
accept garbage from userspace" -- it doesn't check whether unknown flags
are present[1].
This means that (generally) the addition of new flags to openat(2) has
been fraught with backwards-compatibility issues (O_TMPFILE has to be
defined as __O_TMPFILE|O_DIRECTORY|[O_RDWR or O_WRONLY] to ensure old
kernels gave errors, since it's insecure to silently ignore the
flag[2]). All new security-related flags therefore have a tough road to
being added to openat(2).
Furthermore, the need for some sort of control over VFS's path resolution (to
avoid malicious paths resulting in inadvertent breakouts) has been a very
long-standing desire of many userspace applications. This patchset is a revival
of Al Viro's old AT_NO_JUMPS[3] patchset (which was a variant of David
Drysdale's O_BENEATH patchset[4] which was a spin-off of the Capsicum
project[5]) with a few additions and changes made based on the previous
discussion within [6] as well as others I felt were useful.
In line with the conclusions of the original discussion of AT_NO_JUMPS, the
flag has been split up into separate flags. However, instead of being an
openat(2) flag it is provided through a new syscall openat2(2) which provides
several other improvements to the openat(2) interface (see the patch
description for more details). The following new LOOKUP_* flags are added:
* LOOKUP_NO_XDEV blocks all mountpoint crossings (upwards, downwards,
or through absolute links). Absolute pathnames alone in openat(2) do not
trigger this. Magic-link traversal which implies a vfsmount jump is also
blocked (though magic-link jumps on the same vfsmount are permitted).
* LOOKUP_NO_MAGICLINKS blocks resolution through /proc/$pid/fd-style
links. This is done by blocking the usage of nd_jump_link() during
resolution in a filesystem. The term "magic-links" is used to match
with the only reference to these links in Documentation/, but I'm
happy to change the name.
It should be noted that this is different to the scope of
~LOOKUP_FOLLOW in that it applies to all path components. However,
you can do openat2(NO_FOLLOW|NO_MAGICLINKS) on a magic-link and it
will *not* fail (assuming that no parent component was a
magic-link), and you will have an fd for the magic-link.
In order to correctly detect magic-links, the introduction of a new
LOOKUP_MAGICLINK_JUMPED state flag was required.
* LOOKUP_BENEATH disallows escapes to outside the starting dirfd's
tree, using techniques such as ".." or absolute links. Absolute
paths in openat(2) are also disallowed. Conceptually this flag is to
ensure you "stay below" a certain point in the filesystem tree --
but this requires some additional to protect against various races
that would allow escape using "..".
Currently LOOKUP_BENEATH implies LOOKUP_NO_MAGICLINKS, because it
can trivially beam you around the filesystem (breaking the
protection). In future, there might be similar safety checks done as
in LOOKUP_IN_ROOT, but that requires more discussion.
In addition, two new flags are added that expand on the above ideas:
* LOOKUP_NO_SYMLINKS does what it says on the tin. No symlink
resolution is allowed at all, including magic-links. Just as with
LOOKUP_NO_MAGICLINKS this can still be used with NOFOLLOW to open an
fd for the symlink as long as no parent path had a symlink
component.
* LOOKUP_IN_ROOT is an extension of LOOKUP_BENEATH that, rather than
blocking attempts to move past the root, forces all such movements
to be scoped to the starting point. This provides chroot(2)-like
protection but without the cost of a chroot(2) for each filesystem
operation, as well as being safe against race attacks that chroot(2)
is not.
If a race is detected (as with LOOKUP_BENEATH) then an error is
generated, and similar to LOOKUP_BENEATH it is not permitted to cross
magic-links with LOOKUP_IN_ROOT.
The primary need for this is from container runtimes, which
currently need to do symlink scoping in userspace[7] when opening
paths in a potentially malicious container. There is a long list of
CVEs that could have bene mitigated by having RESOLVE_THIS_ROOT
(such as CVE-2017-1002101, CVE-2017-1002102, CVE-2018-15664, and
CVE-2019-5736, just to name a few).
In order to make all of the above more usable, I'm working on
libpathrs[8] which is a C-friendly library for safe path resolution. It
features a userspace-emulated backend if the kernel doesn't support
openat2(2). Hopefully we can get userspace to switch to using it, and
thus get openat2(2) support for free once it's ready.
Future work would include implementing things like RESOLVE_NO_AUTOMOUNT and
possibly a RESOLVE_NO_REMOTE (to allow programs to be sure they don't hit DoSes
though stale NFS handles).
[1]: https://lwn.net/Articles/588444/
[2]: https://lore.kernel.org/lkml/CA+55aFyyxJL1LyXZeBsf2ypriraj5ut1XkNDsunRBqgVj…
[3]: https://lore.kernel.org/lkml/20170429220414.GT29622@ZenIV.linux.org.uk
[4]: https://lore.kernel.org/lkml/1415094884-18349-1-git-send-email-drysdale@goo…
[5]: https://lore.kernel.org/lkml/1404124096-21445-1-git-send-email-drysdale@goo…
[6]: https://lwn.net/Articles/723057/
[7]: https://github.com/cyphar/filepath-securejoin
[8]: https://github.com/openSUSE/libpathrs
The current draft of the openat2(2) man-page is included below.
--8<---------------------------------------------------------------------------
OPENAT2(2) Linux Programmer's Manual OPENAT2(2)
NAME
openat2 - open and possibly create a file (extended)
SYNOPSIS
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
int openat2(int dirfd, const char *pathname, struct open_how *how, size_t size);
Note: There is no glibc wrapper for this system call; see NOTES.
DESCRIPTION
The openat2() system call opens the file specified by pathname. If the specified file
does not exist, it may optionally (if O_CREAT is specified in how.flags) be created by
openat2().
As with openat(2), if pathname is relative, then it is interpreted relative to the direc-
tory referred to by the file descriptor dirfd (or the current working directory of the
calling process, if dirfd is the special value AT_FDCWD.) If pathname is absolute, then
dirfd is ignored (unless how.resolve contains RESOLVE_IN_ROOT, in which case pathname is
resolved relative to dirfd.)
The openat2() system call is an extension of openat(2) and provides a superset of its
functionality. Rather than taking a single flag argument, an extensible structure (how)
is passed instead to allow for future extensions. size must be set to sizeof(struct
open_how), to facilitate future extensions (see the "Extensibility" section of the NOTES
for more detail on how extensions are handled.)
The open_how structure
The following structure indicates how pathname should be opened, and acts as a superset of
the flag and mode arguments to openat(2).
struct open_how {
__aligned_u64 flags; /* O_* flags. */
__u16 mode; /* Mode for O_{CREAT,TMPFILE}. */
__u16 __padding[3]; /* Must be zeroed. */
__aligned_u64 resolve; /* RESOLVE_* flags. */
};
Any future extensions to openat2() will be implemented as new fields appended to the above
structure (or through reuse of pre-existing padding space), with the zero value of the new
fields acting as though the extension were not present.
The meaning of each field is as follows:
flags
The file creation and status flags to use for this operation. All of the
O_* flags defined for openat(2) are valid openat2() flag values.
Unlike openat(2), it is an error to provide openat2() unknown or conflicting
flags in flags.
mode
File mode for the new file, with identical semantics to the mode argument to
openat(2). However, unlike openat(2), it is an error to provide openat2()
with a mode which contains bits other than 0777.
It is an error to provide openat2() a non-zero mode if flags does not con-
tain O_CREAT or O_TMPFILE.
resolve
Change how the components of pathname will be resolved (see path_resolu-
tion(7) for background information.) The primary use case for these flags
is to allow trusted programs to restrict how untrusted paths (or paths in-
side untrusted directories) are resolved. The full list of resolve flags is
given below.
RESOLVE_NO_XDEV
Disallow traversal of mount points during path resolution (including
all bind mounts).
Users of this flag are encouraged to make its use configurable (un-
less it is used for a specific security purpose), as bind mounts are
very widely used by end-users. Setting this flag indiscrimnately for
all uses of openat2() may result in spurious errors on previously-
functional systems.
RESOLVE_NO_SYMLINKS
Disallow resolution of symbolic links during path resolution. This
option implies RESOLVE_NO_MAGICLINKS.
If the trailing component is a symbolic link, and flags contains both
O_PATH and O_NOFOLLOW, then an O_PATH file descriptor referencing the
symbolic link will be returned.
Users of this flag are encouraged to make its use configurable (un-
less it is used for a specific security purpose), as symbolic links
are very widely used by end-users. Setting this flag indiscrimnately
for all uses of openat2() may result in spurious errors on previ-
ously-functional systems.
RESOLVE_NO_MAGICLINKS
Disallow all magic link resolution during path resolution.
If the trailing component is a magic link, and flags contains both
O_PATH and O_NOFOLLOW, then an O_PATH file descriptor referencing the
magic link will be returned.
Magic-links are symbolic link-like objects that are most notably
found in proc(5) (examples include /proc/[pid]/exe and
/proc/[pid]/fd/*.) Due to the potential danger of unknowingly open-
ing these magic links, it may be preferable for users to disable
their resolution entirely (see symboliclink(7) for more details.)
RESOLVE_BENEATH
Do not permit the path resolution to succeed if any component of the
resolution is not a descendant of the directory indicated by dirfd.
This results in absolute symbolic links (and absolute values of path-
name) to be rejected.
Currently, this flag also disables magic link resolution. However,
this may change in the future. The caller should explicitly specify
RESOLVE_NO_MAGICLINKS to ensure that magic links are not resolved.
RESOLVE_IN_ROOT
Treat dirfd as the root directory while resolving pathname (as though
the user called chroot(2) with dirfd as the argument.) Absolute sym-
bolic links and ".." path components will be scoped to dirfd. If
pathname is an absolute path, it is also treated relative to dirfd.
However, unlike chroot(2) (which changes the filesystem root perma-
nently for a process), RESOLVE_IN_ROOT allows a program to effi-
ciently restrict path resolution for only certain operations. It
also has several hardening features (such detecting escape attempts
during .. resolution) which chroot(2) does not.
Currently, this flag also disables magic link resolution. However,
this may change in the future. The caller should explicitly specify
RESOLVE_NO_MAGICLINKS to ensure that magic links are not resolved.
It is an error to provide openat2() unknown flags in resolve.
RETURN VALUE
On success, a new file descriptor is returned. On error, -1 is returned, and errno is set
appropriately.
ERRORS
The set of errors returned by openat2() includes all of the errors returned by openat(2),
as well as the following additional errors:
EINVAL An unknown flag or invalid value was specified in how.
EINVAL mode is non-zero, but flags does not contain O_CREAT or O_TMPFILE.
EINVAL size was smaller than any known version of struct open_how.
E2BIG An extension was specified in how, which the current kernel does not support (see
the "Extensibility" section of the NOTES for more detail on how extensions are han-
dled.)
EAGAIN resolve contains either RESOLVE_IN_ROOT or RESOLVE_BENEATH, and the kernel could
not ensure that a ".." component didn't escape (due to a race condition or poten-
tial attack.) Callers may choose to retry the openat2() call.
EXDEV resolve contains either RESOLVE_IN_ROOT or RESOLVE_BENEATH, and an escape from the
root during path resolution was detected.
EXDEV resolve contains RESOLVE_NO_XDEV, and a path component attempted to cross a mount
point.
ELOOP resolve contains RESOLVE_NO_SYMLINKS, and one of the path components was a symbolic
link (or magic link).
ELOOP resolve contains RESOLVE_NO_MAGICLINKS, and one of the path components was a magic
link.
VERSIONS
openat2() was added to Linux in kernel 5.FOO.
CONFORMING TO
This system call is Linux-specific.
The semantics of RESOLVE_BENEATH were modelled after FreeBSD's O_BENEATH.
NOTES
Glibc does not provide a wrapper for this system call; call it using systemcall(2).
Extensibility
In order to allow for struct open_how to be extended in future kernel revisions, openat2()
requires userspace to specify the size of struct open_how structure they are passing. By
providing this information, it is possible for openat2() to provide both forwards- and
backwards-compatibility — with size acting as an implicit version number (because new ex-
tension fields will always be appended, the size will always increase.) This extensibil-
ity design is very similar to other system calls such as perf_setattr(2),
perf_event_open(2), and clone(3).
If we let usize be the size of the structure according to userspace and ksize be the size
of the structure which the kernel supports, then there are only three cases to consider:
* If ksize equals usize, then there is no version mismatch and how can be used
verbatim.
* If ksize is larger than usize, then there are some extensions the kernel sup-
ports which the userspace program is unaware of. Because all extensions must
have their zero values be a no-op, the kernel treats all of the extension fields
not set by userspace to have zero values. This provides backwards-compatibil-
ity.
* If ksize is smaller than usize, then there are some extensions which the
userspace program is aware of but the kernel does not support. Because all ex-
tensions must have their zero values be a no-op, the kernel can safely ignore
the unsupported extension fields if they are all-zero. If any unsupported ex-
tension fields are non-zero, then -1 is returned and errno is set to E2BIG.
This provides forwards-compatibility.
Therefore, most userspace programs will not need to have any special handling of exten-
sions. However, if a userspace program wishes to determine what extensions the running
kernel supports, they may conduct a binary search on size (to find the largest value which
doesn't produce an error of E2BIG.)
SEE ALSO
openat(2), path_resolution(7), symlink(7)
Linux 2019-11-05 OPENAT2(2)
--8<---------------------------------------------------------------------------
Aleksa Sarai (9):
namei: LOOKUP_NO_SYMLINKS: block symlink resolution
namei: LOOKUP_NO_MAGICLINKS: block magic-link resolution
namei: LOOKUP_NO_XDEV: block mountpoint crossing
namei: LOOKUP_BENEATH: O_BENEATH-like scoped resolution
namei: LOOKUP_IN_ROOT: chroot-like scoped resolution
namei: LOOKUP_{IN_ROOT,BENEATH}: permit limited ".." resolution
open: introduce openat2(2) syscall
selftests: add openat2(2) selftests
Documentation: path-lookup: mention LOOKUP_MAGICLINK_JUMPED
CREDITS | 4 +-
Documentation/filesystems/path-lookup.rst | 18 +-
arch/alpha/kernel/syscalls/syscall.tbl | 1 +
arch/arm/tools/syscall.tbl | 1 +
arch/arm64/include/asm/unistd.h | 2 +-
arch/arm64/include/asm/unistd32.h | 2 +
arch/ia64/kernel/syscalls/syscall.tbl | 1 +
arch/m68k/kernel/syscalls/syscall.tbl | 1 +
arch/microblaze/kernel/syscalls/syscall.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n32.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n64.tbl | 1 +
arch/mips/kernel/syscalls/syscall_o32.tbl | 1 +
arch/parisc/kernel/syscalls/syscall.tbl | 1 +
arch/powerpc/kernel/syscalls/syscall.tbl | 1 +
arch/s390/kernel/syscalls/syscall.tbl | 1 +
arch/sh/kernel/syscalls/syscall.tbl | 1 +
arch/sparc/kernel/syscalls/syscall.tbl | 1 +
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/xtensa/kernel/syscalls/syscall.tbl | 1 +
fs/namei.c | 176 +++++-
fs/open.c | 149 +++--
include/linux/fcntl.h | 12 +-
include/linux/namei.h | 11 +
include/linux/syscalls.h | 3 +
include/uapi/asm-generic/unistd.h | 5 +-
include/uapi/linux/fcntl.h | 41 ++
tools/testing/selftests/Makefile | 1 +
tools/testing/selftests/openat2/.gitignore | 1 +
tools/testing/selftests/openat2/Makefile | 8 +
tools/testing/selftests/openat2/helpers.c | 109 ++++
tools/testing/selftests/openat2/helpers.h | 107 ++++
.../testing/selftests/openat2/openat2_test.c | 316 +++++++++++
.../selftests/openat2/rename_attack_test.c | 160 ++++++
.../testing/selftests/openat2/resolve_test.c | 523 ++++++++++++++++++
35 files changed, 1591 insertions(+), 73 deletions(-)
create mode 100644 tools/testing/selftests/openat2/.gitignore
create mode 100644 tools/testing/selftests/openat2/Makefile
create mode 100644 tools/testing/selftests/openat2/helpers.c
create mode 100644 tools/testing/selftests/openat2/helpers.h
create mode 100644 tools/testing/selftests/openat2/openat2_test.c
create mode 100644 tools/testing/selftests/openat2/rename_attack_test.c
create mode 100644 tools/testing/selftests/openat2/resolve_test.c
base-commit: a99d8080aaf358d5d23581244e5da23b35e340b9
--
2.23.0
On kvm_create_max_vcpus test remove unneeded local
variable in the loop that add vcpus to the VM.
Signed-off-by: Wainer dos Santos Moschetta <wainersm(a)redhat.com>
---
tools/testing/selftests/kvm/kvm_create_max_vcpus.c | 7 ++-----
1 file changed, 2 insertions(+), 5 deletions(-)
diff --git a/tools/testing/selftests/kvm/kvm_create_max_vcpus.c b/tools/testing/selftests/kvm/kvm_create_max_vcpus.c
index 231d79e57774..6f38c3dc0d56 100644
--- a/tools/testing/selftests/kvm/kvm_create_max_vcpus.c
+++ b/tools/testing/selftests/kvm/kvm_create_max_vcpus.c
@@ -29,12 +29,9 @@ void test_vcpu_creation(int first_vcpu_id, int num_vcpus)
vm = vm_create(VM_MODE_DEFAULT, DEFAULT_GUEST_PHY_PAGES, O_RDWR);
- for (i = 0; i < num_vcpus; i++) {
- int vcpu_id = first_vcpu_id + i;
-
+ for (i = first_vcpu_id; i < first_vcpu_id + num_vcpus; i++)
/* This asserts that the vCPU was created. */
- vm_vcpu_add(vm, vcpu_id);
- }
+ vm_vcpu_add(vm, i);
kvm_vm_free(vm);
}
--
2.18.1
OK, here we go. Any VFIO and Infiniband runtime testing from anyone, is
especially welcome here.
Changes since v3:
* VFIO fix (patch 8): applied further cleanup: removed a pre-existing,
unnecessary release and reacquire of mmap_sem. Moved the DAX vma
checks from the vfio call site, to gup internals, and added comments
(and commit log) to clarify.
* Due to the above, made a corresponding fix to the
pin_longterm_pages_remote(), which was actually calling the wrong
gup internal function.
* Changed put_user_page() comments, to refer to pin*() APIs, rather than
get_user_pages*() APIs.
* Reverted an accidental whitespace-only change in the IB ODP code.
* Added a few more reviewed-by tags.
Changes since v2:
* Added a patch to convert IB/umem from normal gup, to gup_fast(). This
is also posted separately, in order to hopefully get some runtime
testing.
* Changed the page devmap code to be a little clearer,
thanks to Jerome for that.
* Split out the page devmap changes into a separate patch (and moved
Ira's Signed-off-by to that patch).
* Fixed my bug in IB: ODP code does not require pin_user_pages()
semantics. Therefore, revert the put_user_page() calls to put_page(),
and leave the get_user_pages() call as-is.
* As part of the revert, I am proposing here a change directly
from put_user_pages(), to release_pages(). I'd feel better if
someone agrees that this is the best way. It uses the more
efficient release_pages(), instead of put_page() in a loop,
and keep the change to just a few character on one line,
but OTOH it is not a pure revert.
* Loosened the FOLL_LONGTERM restrictions in the
__get_user_pages_locked() implementation, and used that in order
to fix up a VFIO bug. Thanks to Jason for that idea.
* Note the use of release_pages() in IB: is that OK?
* Added a few more WARN's and clarifying comments nearby.
* Many documentation improvements in various comments.
* Moved the new pin_user_pages.rst from Documentation/vm/ to
Documentation/core-api/ .
* Commit descriptions: added clarifying notes to the three patches
(drm/via, fs/io_uring, net/xdp) that already had put_user_page()
calls in place.
* Collected all pending Reviewed-by and Acked-by tags, from v1 and v2
email threads.
* Lot of churn from v2 --> v3, so it's possible that new bugs
sneaked in.
NOT DONE: separate patchset is required:
* __get_user_pages_locked(): stop compensating for
buggy callers who failed to set FOLL_GET. Instead, assert
that FOLL_GET is set (and fail if it's not).
======================================================================
Original cover letter (edited to fix up the patch description numbers)
This applies cleanly to linux-next and mmotm, and also to linux.git if
linux-next's commit 20cac10710c9 ("mm/gup_benchmark: fix MAP_HUGETLB
case") is first applied there.
This provides tracking of dma-pinned pages. This is a prerequisite to
solving the larger problem of proper interactions between file-backed
pages, and [R]DMA activities, as discussed in [1], [2], [3], and in
a remarkable number of email threads since about 2017. :)
A new internal gup flag, FOLL_PIN is introduced, and thoroughly
documented in the last patch's Documentation/vm/pin_user_pages.rst.
I believe that this will provide a good starting point for doing the
layout lease work that Ira Weiny has been working on. That's because
these new wrapper functions provide a clean, constrained, systematically
named set of functionality that, again, is required in order to even
know if a page is "dma-pinned".
In contrast to earlier approaches, the page tracking can be
incrementally applied to the kernel call sites that, until now, have
been simply calling get_user_pages() ("gup"). In other words, opt-in by
changing from this:
get_user_pages() (sets FOLL_GET)
put_page()
to this:
pin_user_pages() (sets FOLL_PIN)
put_user_page()
Because there are interdependencies with FOLL_LONGTERM, a similar
conversion as for FOLL_PIN, was applied. The change was from this:
get_user_pages(FOLL_LONGTERM) (also sets FOLL_GET)
put_page()
to this:
pin_longterm_pages() (sets FOLL_PIN | FOLL_LONGTERM)
put_user_page()
============================================================
Patch summary:
* Patches 1-8: refactoring and preparatory cleanup, independent fixes
* Patch 9: introduce pin_user_pages(), FOLL_PIN, but no functional
changes yet
* Patches 10-15: Convert existing put_user_page() callers, to use the
new pin*()
* Patch 16: Activate tracking of FOLL_PIN pages.
* Patches 17-19: convert FOLL_LONGTERM callers
* Patches: 20-22: gup_benchmark and run_vmtests support
* Patch 23: enforce FOLL_LONGTERM as a gup-internal (only) flag
============================================================
Testing:
* I've done some overall kernel testing (LTP, and a few other goodies),
and some directed testing to exercise some of the changes. And as you
can see, gup_benchmark is enhanced to exercise this. Basically, I've been
able to runtime test the core get_user_pages() and pin_user_pages() and
related routines, but not so much on several of the call sites--but those
are generally just a couple of lines changed, each.
Not much of the kernel is actually using this, which on one hand
reduces risk quite a lot. But on the other hand, testing coverage
is low. So I'd love it if, in particular, the Infiniband and PowerPC
folks could do a smoke test of this series for me.
Also, my runtime testing for the call sites so far is very weak:
* io_uring: Some directed tests from liburing exercise this, and they pass.
* process_vm_access.c: A small directed test passes.
* gup_benchmark: the enhanced version hits the new gup.c code, and passes.
* infiniband (still only have crude "IB pingpong" working, on a
good day: it's not exercising my conversions at runtime...)
* VFIO: compiles (I'm vowing to set up a run time test soon, but it's
not ready just yet)
* powerpc: it compiles...
* drm/via: compiles...
* goldfish: compiles...
* net/xdp: compiles...
* media/v4l2: compiles...
============================================================
Next:
* Get the block/bio_vec sites converted to use pin_user_pages().
* Work with Ira and Dave Chinner to weave this together with the
layout lease stuff.
============================================================
[1] Some slow progress on get_user_pages() (Apr 2, 2019): https://lwn.net/Articles/784574/
[2] DMA and get_user_pages() (LPC: Dec 12, 2018): https://lwn.net/Articles/774411/
[3] The trouble with get_user_pages() (Apr 30, 2018): https://lwn.net/Articles/753027/
John Hubbard (23):
mm/gup: pass flags arg to __gup_device_* functions
mm/gup: factor out duplicate code from four routines
mm/gup: move try_get_compound_head() to top, fix minor issues
mm: devmap: refactor 1-based refcounting for ZONE_DEVICE pages
goldish_pipe: rename local pin_user_pages() routine
IB/umem: use get_user_pages_fast() to pin DMA pages
media/v4l2-core: set pages dirty upon releasing DMA buffers
vfio, mm: fix get_user_pages_remote() and FOLL_LONGTERM
mm/gup: introduce pin_user_pages*() and FOLL_PIN
goldish_pipe: convert to pin_user_pages() and put_user_page()
IB/{core,hw,umem}: set FOLL_PIN, FOLL_LONGTERM via
pin_longterm_pages*()
mm/process_vm_access: set FOLL_PIN via pin_user_pages_remote()
drm/via: set FOLL_PIN via pin_user_pages_fast()
fs/io_uring: set FOLL_PIN via pin_user_pages()
net/xdp: set FOLL_PIN via pin_user_pages()
mm/gup: track FOLL_PIN pages
media/v4l2-core: pin_longterm_pages (FOLL_PIN) and put_user_page()
conversion
vfio, mm: pin_longterm_pages (FOLL_PIN) and put_user_page() conversion
powerpc: book3s64: convert to pin_longterm_pages() and put_user_page()
mm/gup_benchmark: use proper FOLL_WRITE flags instead of hard-coding
"1"
mm/gup_benchmark: support pin_user_pages() and related calls
selftests/vm: run_vmtests: invoke gup_benchmark with basic FOLL_PIN
coverage
mm/gup: remove support for gup(FOLL_LONGTERM)
Documentation/core-api/index.rst | 1 +
Documentation/core-api/pin_user_pages.rst | 218 +++++++
arch/powerpc/mm/book3s64/iommu_api.c | 15 +-
drivers/gpu/drm/via/via_dmablit.c | 2 +-
drivers/infiniband/core/umem.c | 17 +-
drivers/infiniband/core/umem_odp.c | 13 +-
drivers/infiniband/hw/hfi1/user_pages.c | 4 +-
drivers/infiniband/hw/mthca/mthca_memfree.c | 3 +-
drivers/infiniband/hw/qib/qib_user_pages.c | 8 +-
drivers/infiniband/hw/qib/qib_user_sdma.c | 2 +-
drivers/infiniband/hw/usnic/usnic_uiom.c | 9 +-
drivers/infiniband/sw/siw/siw_mem.c | 5 +-
drivers/media/v4l2-core/videobuf-dma-sg.c | 10 +-
drivers/platform/goldfish/goldfish_pipe.c | 35 +-
drivers/vfio/vfio_iommu_type1.c | 30 +-
fs/io_uring.c | 5 +-
include/linux/mm.h | 164 ++++-
include/linux/mmzone.h | 2 +
include/linux/page_ref.h | 10 +
mm/gup.c | 636 ++++++++++++++++----
mm/gup_benchmark.c | 87 ++-
mm/huge_memory.c | 54 +-
mm/hugetlb.c | 39 +-
mm/memremap.c | 67 +--
mm/process_vm_access.c | 28 +-
mm/vmstat.c | 2 +
net/xdp/xdp_umem.c | 4 +-
tools/testing/selftests/vm/gup_benchmark.c | 34 +-
tools/testing/selftests/vm/run_vmtests | 22 +
29 files changed, 1191 insertions(+), 335 deletions(-)
create mode 100644 Documentation/core-api/pin_user_pages.rst
--
2.24.0
Hi,
The cover letter is long, so the more important stuff is first:
* Jason, if you or someone could look at the the VFIO cleanup (patch 8)
and conversion to FOLL_PIN (patch 18), to make sure it's use of
remote and longterm gup matches what we discussed during the review
of v2, I'd appreciate it.
* Also for Jason and IB: as noted below, in patch 11, I am (too?) boldly
converting from put_user_pages() to release_pages().
* Jerome, I am going to take a look at doing your FOLL_GET change idea
(some callers should set FOLL_GET) separately, because it blew up "a
little bit" in my face. It's definitely a separate--tiny, but risky--project.
It also looks more reasonable when applied on top of this series here
(and it conflicts a lot), so I'm going to send it as a follow-up.
Changes since v2:
* Added a patch to convert IB/umem from normal gup, to gup_fast(). This
is also posted separately, in order to hopefully get some runtime
testing.
* Changed the page devmap code to be a little clearer,
thanks to Jerome for that.
* Split out the page devmap changes into a separate patch (and moved
Ira's Signed-off-by to that patch).
* Fixed my bug in IB: ODP code does not require pin_user_pages()
semantics. Therefore, revert the put_user_page() calls to put_page(),
and leave the get_user_pages() call as-is.
* As part of the revert, I am proposing here a change directly
from put_user_pages(), to release_pages(). I'd feel better if
someone agrees that this is the best way. It uses the more
efficient release_pages(), instead of put_page() in a loop,
and keep the change to just a few character on one line,
but OTOH it is not a pure revert.
* Loosened the FOLL_LONGTERM restrictions in the
__get_user_pages_locked() implementation, and used that in order
to fix up a VFIO bug. Thanks to Jason for that idea.
* Note the use of release_pages() in IB: is that OK?
* Added a few more WARN's and clarifying comments nearby.
* Many documentation improvements in various comments.
* Moved the new pin_user_pages.rst from Documentation/vm/ to
Documentation/core-api/ .
* Commit descriptions: added clarifying notes to the three patches
(drm/via, fs/io_uring, net/xdp) that already had put_user_page()
calls in place.
* Collected all pending Reviewed-by and Acked-by tags, from v1 and v2
email threads.
* Lot of churn from v2 --> v3, so it's possible that new bugs
sneaked in.
NOT DONE: separate patchset is required:
* __get_user_pages_locked(): stop compensating for
buggy callers who failed to set FOLL_GET. Instead, assert
that FOLL_GET is set (and fail if it's not).
======================================================================
Original cover letter (edited to fix up the patch description numbers)
This applies cleanly to linux-next and mmotm, and also to linux.git if
linux-next's commit 20cac10710c9 ("mm/gup_benchmark: fix MAP_HUGETLB
case") is first applied there.
This provides tracking of dma-pinned pages. This is a prerequisite to
solving the larger problem of proper interactions between file-backed
pages, and [R]DMA activities, as discussed in [1], [2], [3], and in
a remarkable number of email threads since about 2017. :)
A new internal gup flag, FOLL_PIN is introduced, and thoroughly
documented in the last patch's Documentation/vm/pin_user_pages.rst.
I believe that this will provide a good starting point for doing the
layout lease work that Ira Weiny has been working on. That's because
these new wrapper functions provide a clean, constrained, systematically
named set of functionality that, again, is required in order to even
know if a page is "dma-pinned".
In contrast to earlier approaches, the page tracking can be
incrementally applied to the kernel call sites that, until now, have
been simply calling get_user_pages() ("gup"). In other words, opt-in by
changing from this:
get_user_pages() (sets FOLL_GET)
put_page()
to this:
pin_user_pages() (sets FOLL_PIN)
put_user_page()
Because there are interdependencies with FOLL_LONGTERM, a similar
conversion as for FOLL_PIN, was applied. The change was from this:
get_user_pages(FOLL_LONGTERM) (also sets FOLL_GET)
put_page()
to this:
pin_longterm_pages() (sets FOLL_PIN | FOLL_LONGTERM)
put_user_page()
============================================================
Patch summary:
* Patches 1-8: refactoring and preparatory cleanup, independent fixes
* Patch 9: introduce pin_user_pages(), FOLL_PIN, but no functional
changes yet
* Patches 10-15: Convert existing put_user_page() callers, to use the
new pin*()
* Patch 16: Activate tracking of FOLL_PIN pages.
* Patches 17-19: convert FOLL_LONGTERM callers
* Patches: 20-22: gup_benchmark and run_vmtests support
* Patch 23: enforce FOLL_LONGTERM as a gup-internal (only) flag
============================================================
Testing:
* I've done some overall kernel testing (LTP, and a few other goodies),
and some directed testing to exercise some of the changes. And as you
can see, gup_benchmark is enhanced to exercise this. Basically, I've been
able to runtime test the core get_user_pages() and pin_user_pages() and
related routines, but not so much on several of the call sites--but those
are generally just a couple of lines changed, each.
Not much of the kernel is actually using this, which on one hand
reduces risk quite a lot. But on the other hand, testing coverage
is low. So I'd love it if, in particular, the Infiniband and PowerPC
folks could do a smoke test of this series for me.
Also, my runtime testing for the call sites so far is very weak:
* io_uring: Some directed tests from liburing exercise this, and they pass.
* process_vm_access.c: A small directed test passes.
* gup_benchmark: the enhanced version hits the new gup.c code, and passes.
* infiniband (still only have crude "IB pingpong" working, on a
good day: it's not exercising my conversions at runtime...)
* VFIO: compiles (I'm vowing to set up a run time test soon, but it's
not ready just yet)
* powerpc: it compiles...
* drm/via: compiles...
* goldfish: compiles...
* net/xdp: compiles...
* media/v4l2: compiles...
============================================================
Next:
* Get the block/bio_vec sites converted to use pin_user_pages().
* Work with Ira and Dave Chinner to weave this together with the
layout lease stuff.
============================================================
[1] Some slow progress on get_user_pages() (Apr 2, 2019): https://lwn.net/Articles/784574/
[2] DMA and get_user_pages() (LPC: Dec 12, 2018): https://lwn.net/Articles/774411/
[3] The trouble with get_user_pages() (Apr 30, 2018): https://lwn.net/Articles/753027/
John Hubbard (23):
mm/gup: pass flags arg to __gup_device_* functions
mm/gup: factor out duplicate code from four routines
mm/gup: move try_get_compound_head() to top, fix minor issues
mm: devmap: refactor 1-based refcounting for ZONE_DEVICE pages
goldish_pipe: rename local pin_user_pages() routine
IB/umem: use get_user_pages_fast() to pin DMA pages
media/v4l2-core: set pages dirty upon releasing DMA buffers
vfio, mm: fix get_user_pages_remote() and FOLL_LONGTERM
mm/gup: introduce pin_user_pages*() and FOLL_PIN
goldish_pipe: convert to pin_user_pages() and put_user_page()
IB/{core,hw,umem}: set FOLL_PIN, FOLL_LONGTERM via
pin_longterm_pages*()
mm/process_vm_access: set FOLL_PIN via pin_user_pages_remote()
drm/via: set FOLL_PIN via pin_user_pages_fast()
fs/io_uring: set FOLL_PIN via pin_user_pages()
net/xdp: set FOLL_PIN via pin_user_pages()
mm/gup: track FOLL_PIN pages
media/v4l2-core: pin_longterm_pages (FOLL_PIN) and put_user_page()
conversion
vfio, mm: pin_longterm_pages (FOLL_PIN) and put_user_page() conversion
powerpc: book3s64: convert to pin_longterm_pages() and put_user_page()
mm/gup_benchmark: use proper FOLL_WRITE flags instead of hard-coding
"1"
mm/gup_benchmark: support pin_user_pages() and related calls
selftests/vm: run_vmtests: invoke gup_benchmark with basic FOLL_PIN
coverage
mm/gup: remove support for gup(FOLL_LONGTERM)
Documentation/core-api/index.rst | 1 +
Documentation/core-api/pin_user_pages.rst | 218 +++++++
arch/powerpc/mm/book3s64/iommu_api.c | 15 +-
drivers/gpu/drm/via/via_dmablit.c | 2 +-
drivers/infiniband/core/umem.c | 17 +-
drivers/infiniband/core/umem_odp.c | 24 +-
drivers/infiniband/hw/hfi1/user_pages.c | 4 +-
drivers/infiniband/hw/mthca/mthca_memfree.c | 3 +-
drivers/infiniband/hw/qib/qib_user_pages.c | 8 +-
drivers/infiniband/hw/qib/qib_user_sdma.c | 2 +-
drivers/infiniband/hw/usnic/usnic_uiom.c | 9 +-
drivers/infiniband/sw/siw/siw_mem.c | 5 +-
drivers/media/v4l2-core/videobuf-dma-sg.c | 10 +-
drivers/platform/goldfish/goldfish_pipe.c | 35 +-
drivers/vfio/vfio_iommu_type1.c | 35 +-
fs/io_uring.c | 5 +-
include/linux/mm.h | 164 +++++-
include/linux/mmzone.h | 2 +
include/linux/page_ref.h | 10 +
mm/gup.c | 608 ++++++++++++++++----
mm/gup_benchmark.c | 87 ++-
mm/huge_memory.c | 54 +-
mm/hugetlb.c | 39 +-
mm/memremap.c | 67 +--
mm/process_vm_access.c | 28 +-
mm/vmstat.c | 2 +
net/xdp/xdp_umem.c | 4 +-
tools/testing/selftests/vm/gup_benchmark.c | 34 +-
tools/testing/selftests/vm/run_vmtests | 22 +
29 files changed, 1180 insertions(+), 334 deletions(-)
create mode 100644 Documentation/core-api/pin_user_pages.rst
--
2.24.0
From: Petr Machata <petrm(a)mellanox.com>
[ Upstream commit 372809055f6c830ff978564e09f58bcb9e9b937c ]
Immediately after mlxsw module is probed and lldpad started, added APP
entries are briefly in "unknown" state before becoming "pending". That's
the state that lldpad_app_wait_set() typically sees, and since there are
no pending entries at that time, it bails out. However the entries have
not been pushed to the kernel yet at that point, and thus the test case
fails.
Fix by waiting for both unknown and pending entries to disappear before
proceeding.
Fixes: d159261f3662 ("selftests: mlxsw: Add test for trust-DSCP")
Signed-off-by: Petr Machata <petrm(a)mellanox.com>
Signed-off-by: Ido Schimmel <idosch(a)mellanox.com>
Signed-off-by: David S. Miller <davem(a)davemloft.net>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/net/forwarding/lib.sh | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/net/forwarding/lib.sh b/tools/testing/selftests/net/forwarding/lib.sh
index ca53b539aa2d1..08bac6cf1bb3a 100644
--- a/tools/testing/selftests/net/forwarding/lib.sh
+++ b/tools/testing/selftests/net/forwarding/lib.sh
@@ -251,7 +251,7 @@ lldpad_app_wait_set()
{
local dev=$1; shift
- while lldptool -t -i $dev -V APP -c app | grep -q pending; do
+ while lldptool -t -i $dev -V APP -c app | grep -Eq "pending|unknown"; do
echo "$dev: waiting for lldpad to push pending APP updates"
sleep 5
done
--
2.20.1
On Tue, Nov 12, 2019 at 3:08 PM John Hubbard <jhubbard(a)nvidia.com> wrote:
>
> On 11/12/19 2:43 PM, Dan Williams wrote:
> ...
> > Ah, sorry. This was the first time I had looked at this series and
> > jumped in without reading the background.
> >
> > Your patch as is looks ok, I assume you've removed the FOLL_LONGTERM
> > warning in get_user_pages_remote in another patch?
> >
>
> Actually, I haven't gone quite that far. Actually this patch is the last
> change to that function. Therefore, at the end of this patchset,
> get_user_pages_remote() ends up with this check in it which
> is a less-restrictive version of the warning:
>
> /*
> * Current FOLL_LONGTERM behavior is incompatible with
> * FAULT_FLAG_ALLOW_RETRY because of the FS DAX check requirement on
> * vmas. However, this only comes up if locked is set, and there are
> * callers that do request FOLL_LONGTERM, but do not set locked. So,
> * allow what we can.
> */
> if (gup_flags & FOLL_LONGTERM) {
> if (WARN_ON_ONCE(locked))
> return -EINVAL;
> }
>
> Is that OK, or did you want to go further (possibly in a follow-up
> patchset, as I'm hoping to get this one in soon)?
That looks ok. Something to maybe push down into the core in a future
cleanup, but not something that needs to be done now.
> ...
> >>> I think check_vma_flags() should do the ((FOLL_LONGTERM | FOLL_GET) &&
> >>> vma_is_fsdax()) check and that would also remove the need for
> >>> __gup_longterm_locked.
> >>>
> >>
> >> Good idea, but there is still the call to check_and_migrate_cma_pages(),
> >> inside __gup_longterm_locked(). So it's a little more involved and
> >> we can't trivially delete __gup_longterm_locked() yet, right?
> >
> > [ add Aneesh ]
> >
> > Yes, you're right. I had overlooked that had snuck in there. That to
> > me similarly needs to be pushed down into the core with its own FOLL
> > flag, or it needs to be an explicit fixup that each caller does after
> > get_user_pages. The fact that migration silently happens as a side
> > effect of gup is too magical for my taste.
> >
>
> Yes. It's an intrusive side effect that is surprising, and not in a
> "happy surprise" way. :) . Fixing up the CMA pages by splitting that
> functionality into separate function calls sounds like an improvement
> worth exploring.
Right, future work.
Greetings,
Find attached email very confidential. reply for more details
Thanks.
Peter Wong
----------------------------------------------------
This email was sent by the shareware version of Postman Professional.
It is necessary to set fd to -1 when inotify_add_watch() fails in
cg_prepare_for_wait. Otherwise the fd which has been closed in
cg_prepare_for_wait may be misused in other functions such as
cg_enter_and_wait_for_frozen and cg_freeze_wait.
Fixes: 5313bfe425c8 ("selftests: cgroup: add freezer controller self-tests")
Signed-off-by: Hewenliang <hewenliang4(a)huawei.com>
---
tools/testing/selftests/cgroup/test_freezer.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/tools/testing/selftests/cgroup/test_freezer.c b/tools/testing/selftests/cgroup/test_freezer.c
index 0fc1b6d4b0f9..62a27ab3c2f3 100644
--- a/tools/testing/selftests/cgroup/test_freezer.c
+++ b/tools/testing/selftests/cgroup/test_freezer.c
@@ -72,6 +72,7 @@ static int cg_prepare_for_wait(const char *cgroup)
if (ret == -1) {
debug("Error: inotify_add_watch() failed\n");
close(fd);
+ fd = -1;
}
return fd;
--
2.19.1
Hi,
Changes since v1:
* Changed the function signature of __huge_pt_done() from int to void.
* Renamed __remove_refs_from_head() to put_compound_head().
* Improved the comment documentation in mm.h and gup.c
* Merged Documentation/vm/pin_user_pages.rst into the "introduce
FOLL_PIN" patch.
* Fixed Documentation/vm/pin_user_pages.rst:
* Fixed up a TODO about DAX.
* 31, not 32 bits total are available for counting
* Deleted some stale comments from the commit description of the
VFIO patch.
* Added Reviewed-by tags from Ira Weiny and Jens Axboe, and Acked-by
from Björn Töpel.
======================================================================
Original cover letter (edited to fix up the patch description numbers)
This applies cleanly to linux-next and mmotm, and also to linux.git if
linux-next's commit 20cac10710c9 ("mm/gup_benchmark: fix MAP_HUGETLB
case") is first applied there.
This provides tracking of dma-pinned pages. This is a prerequisite to
solving the larger problem of proper interactions between file-backed
pages, and [R]DMA activities, as discussed in [1], [2], [3], and in
a remarkable number of email threads since about 2017. :)
A new internal gup flag, FOLL_PIN is introduced, and thoroughly
documented in the last patch's Documentation/vm/pin_user_pages.rst.
I believe that this will provide a good starting point for doing the
layout lease work that Ira Weiny has been working on. That's because
these new wrapper functions provide a clean, constrained, systematically
named set of functionality that, again, is required in order to even
know if a page is "dma-pinned".
In contrast to earlier approaches, the page tracking can be
incrementally applied to the kernel call sites that, until now, have
been simply calling get_user_pages() ("gup"). In other words, opt-in by
changing from this:
get_user_pages() (sets FOLL_GET)
put_page()
to this:
pin_user_pages() (sets FOLL_PIN)
put_user_page()
Because there are interdependencies with FOLL_LONGTERM, a similar
conversion as for FOLL_PIN, was applied. The change was from this:
get_user_pages(FOLL_LONGTERM) (also sets FOLL_GET)
put_page()
to this:
pin_longterm_pages() (sets FOLL_PIN | FOLL_LONGTERM)
put_user_page()
============================================================
Patch summary:
* Patches 1-4: refactoring and preparatory cleanup, independent fixes
(Patch 4: V4L2-core bug fix (can be separately applied))
* Patch 5: introduce pin_user_pages(), FOLL_PIN, but no functional
changes yet
* Patches 6-11: Convert existing put_user_page() callers, to use the
new pin*()
* Patch 12: Activate tracking of FOLL_PIN pages.
* Patches 13-15: convert FOLL_LONGTERM callers
* Patches: 16-17: gup_benchmark and run_vmtests support
* Patch 18: enforce FOLL_LONGTERM as a gup-internal (only) flag
============================================================
Testing:
* I've done some overall kernel testing (LTP, and a few other goodies),
and some directed testing to exercise some of the changes. And as you
can see, gup_benchmark is enhanced to exercise this. Basically, I've been
able to runtime test the core get_user_pages() and pin_user_pages() and
related routines, but not so much on several of the call sites--but those
are generally just a couple of lines changed, each.
Not much of the kernel is actually using this, which on one hand
reduces risk quite a lot. But on the other hand, testing coverage
is low. So I'd love it if, in particular, the Infiniband and PowerPC
folks could do a smoke test of this series for me.
Also, my runtime testing for the call sites so far is very weak:
* io_uring: Some directed tests from liburing exercise this, and they pass.
* process_vm_access.c: A small directed test passes.
* gup_benchmark: the enhanced version hits the new gup.c code, and passes.
* infiniband (still only have crude "IB pingpong" working, on a
good day: it's not exercising my conversions at runtime...)
* VFIO: compiles (I'm vowing to set up a run time test soon, but it's
not ready just yet)
* powerpc: it compiles...
* drm/via: compiles...
* goldfish: compiles...
* net/xdp: compiles...
* media/v4l2: compiles...
============================================================
Next:
* Get the block/bio_vec sites converted to use pin_user_pages().
* Work with Ira and Dave Chinner to weave this together with the
layout lease stuff.
============================================================
[1] Some slow progress on get_user_pages() (Apr 2, 2019): https://lwn.net/Articles/784574/
[2] DMA and get_user_pages() (LPC: Dec 12, 2018): https://lwn.net/Articles/774411/
[3] The trouble with get_user_pages() (Apr 30, 2018): https://lwn.net/Articles/753027/
John Hubbard (18):
mm/gup: pass flags arg to __gup_device_* functions
mm/gup: factor out duplicate code from four routines
goldish_pipe: rename local pin_user_pages() routine
media/v4l2-core: set pages dirty upon releasing DMA buffers
mm/gup: introduce pin_user_pages*() and FOLL_PIN
goldish_pipe: convert to pin_user_pages() and put_user_page()
infiniband: set FOLL_PIN, FOLL_LONGTERM via pin_longterm_pages*()
mm/process_vm_access: set FOLL_PIN via pin_user_pages_remote()
drm/via: set FOLL_PIN via pin_user_pages_fast()
fs/io_uring: set FOLL_PIN via pin_user_pages()
net/xdp: set FOLL_PIN via pin_user_pages()
mm/gup: track FOLL_PIN pages
media/v4l2-core: pin_longterm_pages (FOLL_PIN) and put_user_page()
conversion
vfio, mm: pin_longterm_pages (FOLL_PIN) and put_user_page() conversion
powerpc: book3s64: convert to pin_longterm_pages() and put_user_page()
mm/gup_benchmark: support pin_user_pages() and related calls
selftests/vm: run_vmtests: invoke gup_benchmark with basic FOLL_PIN
coverage
mm/gup: remove support for gup(FOLL_LONGTERM)
Documentation/vm/index.rst | 1 +
Documentation/vm/pin_user_pages.rst | 212 +++++++
arch/powerpc/mm/book3s64/iommu_api.c | 15 +-
drivers/gpu/drm/via/via_dmablit.c | 2 +-
drivers/infiniband/core/umem.c | 5 +-
drivers/infiniband/core/umem_odp.c | 10 +-
drivers/infiniband/hw/hfi1/user_pages.c | 4 +-
drivers/infiniband/hw/mthca/mthca_memfree.c | 3 +-
drivers/infiniband/hw/qib/qib_user_pages.c | 8 +-
drivers/infiniband/hw/qib/qib_user_sdma.c | 2 +-
drivers/infiniband/hw/usnic/usnic_uiom.c | 9 +-
drivers/infiniband/sw/siw/siw_mem.c | 5 +-
drivers/media/v4l2-core/videobuf-dma-sg.c | 10 +-
drivers/platform/goldfish/goldfish_pipe.c | 35 +-
drivers/vfio/vfio_iommu_type1.c | 15 +-
fs/io_uring.c | 5 +-
include/linux/mm.h | 142 ++++-
include/linux/mmzone.h | 2 +
include/linux/page_ref.h | 10 +
mm/gup.c | 594 ++++++++++++++++----
mm/gup_benchmark.c | 81 ++-
mm/huge_memory.c | 32 +-
mm/hugetlb.c | 28 +-
mm/memremap.c | 4 +-
mm/process_vm_access.c | 28 +-
mm/vmstat.c | 2 +
net/xdp/xdp_umem.c | 4 +-
tools/testing/selftests/vm/gup_benchmark.c | 28 +-
tools/testing/selftests/vm/run_vmtests | 22 +
29 files changed, 1054 insertions(+), 264 deletions(-)
create mode 100644 Documentation/vm/pin_user_pages.rst
--
2.23.0
The current kunit execution model is to provide base kunit functionality
and tests built-in to the kernel. The aim of this series is to allow
building kunit itself and tests as modules. This in turn allows a
simple form of selective execution; load the module you wish to test.
In doing so, kunit itself (if also built as a module) will be loaded as
an implicit dependency.
Because this requires a core API modification - if a module delivers
multiple suites, they must be declared with the kunit_test_suites()
macro - we're proposing this patch set as a candidate to be applied to the
test tree before too many kunit consumers appear. We attempt to deal
with existing consumers in patch 4.
Changes since v2:
- moved string-stream.h header to lib/kunit/string-stream-impl.h (Brendan)
(patch 1)
- split out non-exported interfaces in try-catch-impl.h (Brendan)
(patch 2)
- added kunit_find_symbol() and KUNIT_INIT_*SYMBOL to lookup non-exported
symbols. KUNIT_INIT_*SYMBOL() is defined so that a mismatch between
local symbol definition and definition of symbol in target will trigger
a compilation error when the object is compiled built-in (Brendan)
(patches 3, 4)
- removed #ifdef MODULE around module licenses (Randy, Brendan, Andy)
(patch 4)
- replaced kunit_test_suite() with kunit_test_suites() rather than
supporting both (Brendan) (patch 4)
- lookup sysctl_hung_task_timeout_secs as kunit may be built as a module
and the symbol may not be available (patch 5)
- fixed whitespace issues in doc (patch 6)
Alan Maguire (6):
kunit: move string-stream.h to lib/kunit/string-stream-impl.h
kunit: hide unexported try-catch interface in try-catch-impl.h
kunit: add kunit_find_symbol() function for symbol lookup
kunit: allow kunit tests to be loaded as a module
kunit: allow kunit to be loaded as a module
kunit: update documentation to describe module-based build
Documentation/dev-tools/kunit/faq.rst | 3 +-
Documentation/dev-tools/kunit/index.rst | 3 +
Documentation/dev-tools/kunit/usage.rst | 16 +++++
include/kunit/assert.h | 3 +-
include/kunit/string-stream.h | 51 -------------
include/kunit/test.h | 123 +++++++++++++++++++++++++++++---
include/kunit/try-catch.h | 10 ---
kernel/sysctl-test.c | 4 +-
lib/Kconfig.debug | 2 +-
lib/kunit/Kconfig | 6 +-
lib/kunit/Makefile | 4 +-
lib/kunit/assert.c | 9 +++
lib/kunit/example-test.c | 4 +-
lib/kunit/string-stream-impl.h | 51 +++++++++++++
lib/kunit/string-stream-test.c | 46 ++++++++----
lib/kunit/string-stream.c | 3 +-
lib/kunit/test-test.c | 50 ++++++++++---
lib/kunit/test.c | 49 +++++++++++++
lib/kunit/try-catch-impl.h | 23 ++++++
lib/kunit/try-catch.c | 6 ++
20 files changed, 363 insertions(+), 103 deletions(-)
delete mode 100644 include/kunit/string-stream.h
create mode 100644 lib/kunit/string-stream-impl.h
create mode 100644 lib/kunit/try-catch-impl.h
--
1.8.3.1
When installing kselftests to its own directory and run the
test_lwt_ip_encap.sh it will complain that test_lwt_ip_encap.o can't be
found. Same with the test_tc_edt.sh test it will complain that
test_tc_edt.o can't be found.
$ ./test_lwt_ip_encap.sh
starting egress IPv4 encap test
Error opening object test_lwt_ip_encap.o: No such file or directory
Object hashing failed!
Cannot initialize ELF context!
Failed to parse eBPF program: Invalid argument
Rework to add test_lwt_ip_encap.o and test_tc_edt.o to TEST_FILES so the
object file gets installed when installing kselftest.
Fixes: 74b5a5968fe8 ("selftests/bpf: Replace test_progs and test_maps w/ general rule")
Signed-off-by: Anders Roxell <anders.roxell(a)linaro.org>
Acked-by: Song Liu <songliubraving(a)fb.com>
---
tools/testing/selftests/bpf/Makefile | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index b334a6db15c1..b03dc2298fea 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -38,7 +38,8 @@ TEST_GEN_PROGS += test_progs-bpf_gcc
endif
TEST_GEN_FILES =
-TEST_FILES =
+TEST_FILES = test_lwt_ip_encap.o \
+ test_tc_edt.o
# Order correspond to 'make run_tests' order
TEST_PROGS := test_kmod.sh \
--
2.20.1
When installing kselftests to its own directory and running the
test_lwt_ip_encap.sh it will complain that test_lwt_ip_encap.o can't be
find.
$ ./test_lwt_ip_encap.sh
starting egress IPv4 encap test
Error opening object test_lwt_ip_encap.o: No such file or directory
Object hashing failed!
Cannot initialize ELF context!
Failed to parse eBPF program: Invalid argument
Rework to add test_lwt_ip_encap.o to TEST_FILES so the object file gets
installed when installing kselftest.
Fixes: 74b5a5968fe8 ("selftests/bpf: Replace test_progs and test_maps w/ general rule")
Signed-off-by: Anders Roxell <anders.roxell(a)linaro.org>
Acked-by: Song Liu <songliubraving(a)fb.com>
---
tools/testing/selftests/bpf/Makefile | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index b334a6db15c1..cc09b5df9403 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -38,7 +38,7 @@ TEST_GEN_PROGS += test_progs-bpf_gcc
endif
TEST_GEN_FILES =
-TEST_FILES =
+TEST_FILES = test_lwt_ip_encap.o
# Order correspond to 'make run_tests' order
TEST_PROGS := test_kmod.sh \
--
2.20.1
When installing kselftests to its own directory and running the
test_lwt_ip_encap.sh it will complain that test_lwt_ip_encap.o can't be
find.
$ ./test_lwt_ip_encap.sh
starting egress IPv4 encap test
Error opening object test_lwt_ip_encap.o: No such file or directory
Object hashing failed!
Cannot initialize ELF context!
Failed to parse eBPF program: Invalid argument
Rework to add test_lwt_ip_encap.o to TEST_FILES so the object file gets
installed when installing kselftest.
Fixes: 74b5a5968fe8 ("selftests/bpf: Replace test_progs and test_maps w/ general rule")
Signed-off-by: Anders Roxell <anders.roxell(a)linaro.org>
---
tools/testing/selftests/bpf/Makefile | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index b334a6db15c1..cc09b5df9403 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -38,7 +38,7 @@ TEST_GEN_PROGS += test_progs-bpf_gcc
endif
TEST_GEN_FILES =
-TEST_FILES =
+TEST_FILES = test_lwt_ip_encap.o
# Order correspond to 'make run_tests' order
TEST_PROGS := test_kmod.sh \
--
2.20.1
From: Breno Leitao <leitao(a)debian.org>
[ Upstream commit 44d947eff19d64384efc06069509db7a0a1103b0 ]
There are cases where the test is not expecting to have the transaction
aborted, but, the test process might have been rescheduled, either in the
OS level or by KVM (if it is running on a KVM guest machine). The process
reschedule will cause a treclaim/recheckpoint which will cause the
transaction to doom, aborting the transaction as soon as the process is
rescheduled back to the CPU. This might cause the test to fail, but this is
not a failure in essence.
If that is the case, TEXASR[FC] is indicated with either
TM_CAUSE_RESCHEDULE or TM_CAUSE_KVM_RESCHEDULE for KVM interruptions.
In this scenario, ignore these two failures and avoid the whole test to
return failure.
Signed-off-by: Breno Leitao <leitao(a)debian.org>
Reviewed-by: Gustavo Romero <gromero(a)linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe(a)ellerman.id.au>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/powerpc/tm/tm-unavailable.c | 9 ++++++---
tools/testing/selftests/powerpc/tm/tm.h | 9 +++++++++
2 files changed, 15 insertions(+), 3 deletions(-)
diff --git a/tools/testing/selftests/powerpc/tm/tm-unavailable.c b/tools/testing/selftests/powerpc/tm/tm-unavailable.c
index 156c8e750259b..09894f4ff62e6 100644
--- a/tools/testing/selftests/powerpc/tm/tm-unavailable.c
+++ b/tools/testing/selftests/powerpc/tm/tm-unavailable.c
@@ -236,7 +236,8 @@ void *tm_una_ping(void *input)
}
/* Check if we were not expecting a failure and a it occurred. */
- if (!expecting_failure() && is_failure(cr_)) {
+ if (!expecting_failure() && is_failure(cr_) &&
+ !failure_is_reschedule()) {
printf("\n\tUnexpected transaction failure 0x%02lx\n\t",
failure_code());
return (void *) -1;
@@ -244,9 +245,11 @@ void *tm_una_ping(void *input)
/*
* Check if TM failed due to the cause we were expecting. 0xda is a
- * TM_CAUSE_FAC_UNAV cause, otherwise it's an unexpected cause.
+ * TM_CAUSE_FAC_UNAV cause, otherwise it's an unexpected cause, unless
+ * it was caused by a reschedule.
*/
- if (is_failure(cr_) && !failure_is_unavailable()) {
+ if (is_failure(cr_) && !failure_is_unavailable() &&
+ !failure_is_reschedule()) {
printf("\n\tUnexpected failure cause 0x%02lx\n\t",
failure_code());
return (void *) -1;
diff --git a/tools/testing/selftests/powerpc/tm/tm.h b/tools/testing/selftests/powerpc/tm/tm.h
index df4204247d45c..5518b1d4ef8b2 100644
--- a/tools/testing/selftests/powerpc/tm/tm.h
+++ b/tools/testing/selftests/powerpc/tm/tm.h
@@ -52,6 +52,15 @@ static inline bool failure_is_unavailable(void)
return (failure_code() & TM_CAUSE_FAC_UNAV) == TM_CAUSE_FAC_UNAV;
}
+static inline bool failure_is_reschedule(void)
+{
+ if ((failure_code() & TM_CAUSE_RESCHED) == TM_CAUSE_RESCHED ||
+ (failure_code() & TM_CAUSE_KVM_RESCHED) == TM_CAUSE_KVM_RESCHED)
+ return true;
+
+ return false;
+}
+
static inline bool failure_is_nesting(void)
{
return (__builtin_get_texasru() & 0x400000);
--
2.20.1
hello,
i got a warning during kernel compilation.
net/core/skbuff.o: warning: objtool: skb_push.cold()+0x15: unreachable instruction
related clips...
--------------------x----------------x---------------------
(gdb) l skb_push.cold
1880 void *skb_push(struct sk_buff *skb, unsigned int len)
1881 {
1882 skb->data -= len;
1883 skb->len += len;
1884 if (unlikely(skb->data < skb->head))
1885 skb_under_panic(skb, len, __builtin_return_address(0));
1886 return skb->data;
1887 }
1888 EXPORT_SYMBOL(skb_push);
1889
(gdb) l *0xffffffff815ffc8e
0xffffffff815ffc8e is in skb_push (net/core/skbuff.c:1885).
1880 void *skb_push(struct sk_buff *skb, unsigned int len)
1881 {
1882 skb->data -= len;
1883 skb->len += len;
1884 if (unlikely(skb->data < skb->head))
1885 skb_under_panic(skb, len, __builtin_return_address(0));
1886 return skb->data;
1887 }
1888 EXPORT_SYMBOL(skb_push);
1889
(gdb)
------------------x-----------------------x------------------------------------
$uname -a
Linux debian 5.4.0-rc1+ #1 SMP Sat Nov 9 21:29:48 IST 2019 x86_64 GNU/Linux
$
this kernel is from linux-kselftest tree
---------------------------x-------------x------------------------------
$gcc --version
gcc (Debian 9.2.1-14) 9.2.1 20191025
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
-------------------------x-------------------x-----------------------------
Linux debian 5.4.0-rc1+ #1 SMP Sat Nov 9 21:29:48 IST 2019 x86_64 GNU/Linux
GNU Make 4.2.1
Binutils 2.33.1
Util-linux 2.33.1
Mount 2.33.1
Linux C Library 2.29
Dynamic linker (ldd) 2.29
Procps 3.3.15
Kbd 2.0.4
Console-tools 2.0.4
Sh-utils 8.30
Udev 241
-------------------------------x-----------------x----------------------------
--
software engineer
rajagiri school of engineering and technology
hello,
i just got some mail deleted. so incase some of you have
send mail to me after https://lkml.org/lkml/2019/11/4/824
please kindly resend it to me.
I have also noticed the error in 5.4.0-rc1+ of linux-kselftest tree.
iam attaching two related files again.
--
software engineer
rajagiri school of engineering and technology
From: Nicolas Geoffray <ngeoffray(a)google.com>
F_SEAL_FUTURE_WRITE has unexpected behavior when used with MAP_PRIVATE:
A private mapping created after the memfd file that gets sealed with
F_SEAL_FUTURE_WRITE loses the copy-on-write at fork behavior, meaning
children and parent share the same memory, even though the mapping is
private.
The reason for this is due to the code below:
static int shmem_mmap(struct file *file, struct vm_area_struct *vma)
{
struct shmem_inode_info *info = SHMEM_I(file_inode(file));
if (info->seals & F_SEAL_FUTURE_WRITE) {
/*
* New PROT_WRITE and MAP_SHARED mmaps are not allowed when
* "future write" seal active.
*/
if ((vma->vm_flags & VM_SHARED) && (vma->vm_flags & VM_WRITE))
return -EPERM;
/*
* Since the F_SEAL_FUTURE_WRITE seals allow for a MAP_SHARED
* read-only mapping, take care to not allow mprotect to revert
* protections.
*/
vma->vm_flags &= ~(VM_MAYWRITE);
}
...
}
And for the mm to know if a mapping is copy-on-write:
static inline bool is_cow_mapping(vm_flags_t flags)
{
return (flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
}
The patch fixes the issue by making the mprotect revert protection
happen only for shared mappings. For private mappings, using mprotect
will have no effect on the seal behavior.
Cc: kernel-team(a)android.com
Signed-off-by: Nicolas Geoffray <ngeoffray(a)google.com>
Signed-off-by: Joel Fernandes (Google) <joel(a)joelfernandes.org>
---
Google bug: 143833776
mm/shmem.c | 11 +++++++----
1 file changed, 7 insertions(+), 4 deletions(-)
diff --git a/mm/shmem.c b/mm/shmem.c
index 447fd575587c..6ac5e867ef13 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2214,11 +2214,14 @@ static int shmem_mmap(struct file *file, struct vm_area_struct *vma)
return -EPERM;
/*
- * Since the F_SEAL_FUTURE_WRITE seals allow for a MAP_SHARED
- * read-only mapping, take care to not allow mprotect to revert
- * protections.
+ * Since an F_SEAL_FUTURE_WRITE sealed memfd can be mapped as
+ * MAP_SHARED and read-only, take care to not allow mprotect to
+ * revert protections on such mappings. Do this only for shared
+ * mappings. For private mappings, don't need to mask VM_MAYWRITE
+ * as we still want them to be COW-writable.
*/
- vma->vm_flags &= ~(VM_MAYWRITE);
+ if (vma->vm_flags & VM_SHARED)
+ vma->vm_flags &= ~(VM_MAYWRITE);
}
file_accessed(file);
--
2.24.0.rc1.363.gb1bccd3e3d-goog
Hi
this patchset aims to add the initial arch-specific arm64 support to
kselftest starting with signals-related test-cases.
This series is based on v5.4-rc4.
A common internal test-case layout is proposed for signal tests and it is
wired-up to the toplevel kselftest Makefile, so that it should be possible
at the end to run it on an arm64 target in the usual way with KSFT.
~/linux# make TARGETS=arm64 kselftest
New KSFT arm64 testcases live inside tools/testing/selftests/arm64 grouped
by family inside subdirectories: arm64/signal is the first family proposed
with this series.
This series converts also to this subdirectory scheme the pre-existing
KSFT arm64 tags tests (already merged in v5.3), moving them into their own
arm64/tags subdirectory.
Thanks
Cristian
Notes:
-----
- further details in the included READMEs
- more tests still to be written (current strategy is going through the
related Kernel signal-handling code and write a test for each possible
and sensible code-path)
A few ideas for more TODO testcases:
- mangle_pstate_invalid_ssbs_regs: mess with SSBS bits on every
possible configured behavior
- fake_sigreturn_unmapped_sp: SP into unmapped addrs
- fake_sigreturn_kernelspace_sp: SP into kernel addrs
- fake_sigreturn_sve_bad_extra_context: SVE extra context badly formed
- fake_sigreturn_misaligned_sp_4: misaligned SP by 4
(i.e., __alignof__(struct _aarch64_ctx))
- fake_sigreturn_misaligned_sp_8: misaligned SP by 8
(i.e., sizeof(struct _aarch64_ctx))
- fake_sigreturn_bad_size_non_aligned: a size that doesn't overflow
__reserved[], but is not a multiple of 16
- fake_sigreturn_bad_size_tiny: a size that is less than 16
- fake_sigreturn_bad_size_overflow_tiny: a size that does overflow
__reserved[], but by less than 16 bytes?
- mangle_sve_invalid_extra_context: SVE extra_context invalid
- SVE signal testcases and special handling will be part of an additional
patch still to be released
- KSFT arm64 tags test patch
https://lore.kernel.org/linux-arm-kernel/c1e6aad230658bc175b42d92daeff2e300…
is relocated into its own directory under tools/testing/selftests/arm64/tags
Changes:
--------
v9 --> v10:
- rebased on v5.4-rc4
- removed some test_init stale code related to PAN/UAO
(not used nor needed and wrong)
v8-->v9:
- fixed a couple of misplaced .gitignore
v7-->v8:
- removed SSBS test case
- split remnants of SSBS patch (v7 05/11), containing some helpers,
into two distinct patches
v6-->v7:
- rebased on v5.4-rc2
- renamed SUBTARGETS arm64/ toplevel Makefile ENV to ARM64_SUBTARGETS
- fixed fake_sigreturn alignment routines (off by one)
- fixed SSBS test: avoid using MRS/MSR as whole and SKIP when SSBS not
supported
- reporting KSFT_SKIP when needed (usually if test_init(0 fails)
- using ID_AA64PFR1_EL1.SSBS to check SSBS support instead of HWCAP_SSBS
v5-->v6:
- added arm64 toplevel Makefile SUBTARGETS env var to be able to selectively
build only some arm64/ tests subdirectories
- removed unneed toplevel Makefile exports and fixed Copyright
- better checks for supported features and features names helpers
- converted some run-time critical assert() to abort() to avoid
issues when -NDEBUG is set
- default_handler() signal handler refactored and split
- using SIGTRAP for get_current_context()
- use volatile where proper
- refactor and relocate test_init() invocation
- review usage of MRS SSBS instructions depending on HW_SSBS
- cleanup fake_sigreturn trampoline
- cleanup get_starting_header helper
- avoiding timeout test failures wherever possible (fail immediately
if possible)
v4-->v5:
- rebased on arm64/for-next-core merging 01/11 with KSFT tags tests:
commit 9ce1263033cd ("selftests, arm64: add a selftest for passing tagged pointers to kernel")
- moved .gitignore up on elevel
- moved kernel header search mechanism into KSFT arm64 toplevel Makefile
so that it can be used easily also by each arm64 KSFT subsystem inside
subdirs of arm64
v3-->v4:
- rebased on v5.3-rc6
- added test descriptions
- fixed commit messages (imperative mood)
- added missing includes and removed unneeded ones
- added/used new get_starting_head() helper
- fixed/simplified signal.S::fakke_sigreturn()
- added set_regval() macro and .init initialization func
- better synchonization in get_current_context()
- macroization of mangle_pstate_invalid_mode_el
- split mangle_pstate_invalid_mode_el h/t
- removed standalone mode
- simplified CPU features checks
- fixed/refactored get_header() and validation routines
- simplfied docs
v2-->v3:
- rebased on v5.3-rc2
- better test result characterization looking for
SEGV_ACCERR in si_code on SIGSEGV
- using KSFT Framework macros for retvalues
- removed SAFE_WRITE()/dump_uc: buggy, un-needed and unused
- reviewed generation process of test_arm64_signals.sh runner script
- re-added a fixed fake_sigreturn_misaligned_sp testcase and a properly
extended fake_sigreturn() helper
- added tests' TODO notes
v1-->v2:
- rebased on 5.2-rc7
- various makefile's cleanups
- mixed READMEs fixes
- fixed test_arm64_signals.sh runner script
- cleaned up assembly code in signal.S
- improved get_current_context() logic
- fixed SAFE_WRITE()
- common support code split into more chunks, each one introduced when
needed by some new testcases
- fixed some headers validation routines in testcases.c
- removed some still broken/immature tests:
+ fake_sigreturn_misaligned
+ fake_sigreturn_overflow_reserved
+ mangle_pc_invalid
+ mangle_sp_misaligned
- fixed some other testcases:
+ mangle_pstate_ssbs_regs: better checks of SSBS bit when feature unsupported
+ mangle_pstate_invalid_compat_toggle: name fix
+ mangle_pstate_invalid_mode_el[1-3]: precautionary zeroing PSTATE.MODE
+ fake_sigreturn_bad_magic, fake_sigreturn_bad_size,
fake_sigreturn_bad_size_for_magic0:
- accounting for available space...dropping extra when needed
- keeping alignent
- new testcases on FPSMID context:
+ fake_sigreturn_missing_fpsimd
+ fake_sigreturn_duplicated_fpsimd
Cristian Marussi (12):
kselftest: arm64: extend toplevel skeleton Makefile
kselftest: arm64: mangle_pstate_invalid_compat_toggle and common utils
kselftest: arm64: mangle_pstate_invalid_daif_bits
kselftest: arm64: mangle_pstate_invalid_mode_el[123][ht]
kselftest: arm64: extend test_init functionalities
kselftest: arm64: add helper get_current_context
kselftest: arm64: fake_sigreturn_bad_magic
kselftest: arm64: fake_sigreturn_bad_size_for_magic0
kselftest: arm64: fake_sigreturn_missing_fpsimd
kselftest: arm64: fake_sigreturn_duplicated_fpsimd
kselftest: arm64: fake_sigreturn_bad_size
kselftest: arm64: fake_sigreturn_misaligned_sp
tools/testing/selftests/Makefile | 1 +
tools/testing/selftests/arm64/Makefile | 64 +++-
tools/testing/selftests/arm64/README | 25 ++
.../testing/selftests/arm64/signal/.gitignore | 3 +
tools/testing/selftests/arm64/signal/Makefile | 32 ++
tools/testing/selftests/arm64/signal/README | 59 ++++
.../testing/selftests/arm64/signal/signals.S | 64 ++++
.../selftests/arm64/signal/test_signals.c | 29 ++
.../selftests/arm64/signal/test_signals.h | 100 ++++++
.../arm64/signal/test_signals_utils.c | 328 ++++++++++++++++++
.../arm64/signal/test_signals_utils.h | 120 +++++++
.../testcases/fake_sigreturn_bad_magic.c | 52 +++
.../testcases/fake_sigreturn_bad_size.c | 77 ++++
.../fake_sigreturn_bad_size_for_magic0.c | 46 +++
.../fake_sigreturn_duplicated_fpsimd.c | 50 +++
.../testcases/fake_sigreturn_misaligned_sp.c | 37 ++
.../testcases/fake_sigreturn_missing_fpsimd.c | 50 +++
.../mangle_pstate_invalid_compat_toggle.c | 31 ++
.../mangle_pstate_invalid_daif_bits.c | 35 ++
.../mangle_pstate_invalid_mode_el1h.c | 15 +
.../mangle_pstate_invalid_mode_el1t.c | 15 +
.../mangle_pstate_invalid_mode_el2h.c | 15 +
.../mangle_pstate_invalid_mode_el2t.c | 15 +
.../mangle_pstate_invalid_mode_el3h.c | 15 +
.../mangle_pstate_invalid_mode_el3t.c | 15 +
.../mangle_pstate_invalid_mode_template.h | 28 ++
.../arm64/signal/testcases/testcases.c | 196 +++++++++++
.../arm64/signal/testcases/testcases.h | 104 ++++++
.../selftests/arm64/{ => tags}/.gitignore | 0
tools/testing/selftests/arm64/tags/Makefile | 7 +
.../arm64/{ => tags}/run_tags_test.sh | 0
.../selftests/arm64/{ => tags}/tags_test.c | 0
32 files changed, 1623 insertions(+), 5 deletions(-)
create mode 100644 tools/testing/selftests/arm64/README
create mode 100644 tools/testing/selftests/arm64/signal/.gitignore
create mode 100644 tools/testing/selftests/arm64/signal/Makefile
create mode 100644 tools/testing/selftests/arm64/signal/README
create mode 100644 tools/testing/selftests/arm64/signal/signals.S
create mode 100644 tools/testing/selftests/arm64/signal/test_signals.c
create mode 100644 tools/testing/selftests/arm64/signal/test_signals.h
create mode 100644 tools/testing/selftests/arm64/signal/test_signals_utils.c
create mode 100644 tools/testing/selftests/arm64/signal/test_signals_utils.h
create mode 100644 tools/testing/selftests/arm64/signal/testcases/fake_sigreturn_bad_magic.c
create mode 100644 tools/testing/selftests/arm64/signal/testcases/fake_sigreturn_bad_size.c
create mode 100644 tools/testing/selftests/arm64/signal/testcases/fake_sigreturn_bad_size_for_magic0.c
create mode 100644 tools/testing/selftests/arm64/signal/testcases/fake_sigreturn_duplicated_fpsimd.c
create mode 100644 tools/testing/selftests/arm64/signal/testcases/fake_sigreturn_misaligned_sp.c
create mode 100644 tools/testing/selftests/arm64/signal/testcases/fake_sigreturn_missing_fpsimd.c
create mode 100644 tools/testing/selftests/arm64/signal/testcases/mangle_pstate_invalid_compat_toggle.c
create mode 100644 tools/testing/selftests/arm64/signal/testcases/mangle_pstate_invalid_daif_bits.c
create mode 100644 tools/testing/selftests/arm64/signal/testcases/mangle_pstate_invalid_mode_el1h.c
create mode 100644 tools/testing/selftests/arm64/signal/testcases/mangle_pstate_invalid_mode_el1t.c
create mode 100644 tools/testing/selftests/arm64/signal/testcases/mangle_pstate_invalid_mode_el2h.c
create mode 100644 tools/testing/selftests/arm64/signal/testcases/mangle_pstate_invalid_mode_el2t.c
create mode 100644 tools/testing/selftests/arm64/signal/testcases/mangle_pstate_invalid_mode_el3h.c
create mode 100644 tools/testing/selftests/arm64/signal/testcases/mangle_pstate_invalid_mode_el3t.c
create mode 100644 tools/testing/selftests/arm64/signal/testcases/mangle_pstate_invalid_mode_template.h
create mode 100644 tools/testing/selftests/arm64/signal/testcases/testcases.c
create mode 100644 tools/testing/selftests/arm64/signal/testcases/testcases.h
rename tools/testing/selftests/arm64/{ => tags}/.gitignore (100%)
create mode 100644 tools/testing/selftests/arm64/tags/Makefile
rename tools/testing/selftests/arm64/{ => tags}/run_tags_test.sh (100%)
rename tools/testing/selftests/arm64/{ => tags}/tags_test.c (100%)
--
2.17.1
Hello,
Is there anything blocking this from getting merged?
This patch series fixes the following problem:
linux# make kselftest TARGETS=bpf O=/mnt/linux-build
# selftests: bpf: test_libbpf.sh
# ./test_libbpf.sh: line 23: ./test_libbpf_open: No such file or directory
# test_libbpf: failed at file test_l4lb.o
# selftests: test_libbpf [FAILED]
Patch 1 appends / to $(OUTPUT) in order to make it more uniform with the
rest of the tree.
Patch 2 fixes the problem by prepending $(OUTPUT) to all members of
$(TEST_PROGS).
v1->v2:
- Append / to $(OUTPUT).
- Use $(addprefix) instead of $(foreach).
v2->v3:
- Split the patch in two.
- Improve the commit message.
Ilya Leoshkevich (2):
selftests: append / to $(OUTPUT)
selftests: fix prepending $(OUTPUT) to $(TEST_PROGS)
tools/testing/selftests/Makefile | 16 ++++++++--------
tools/testing/selftests/lib.mk | 3 ++-
2 files changed, 10 insertions(+), 9 deletions(-)
--
2.23.0
The livepatch selftests compare expected dmesg output to verify kernel
behavior. They currently filter out "tainting kernel with
TAINT_LIVEPATCH" messages which may be logged when loading livepatch
modules.
Further filter the log to also drop "loading out-of-tree module taints
kernel" messages in case the klp_test modules have been build without
the in-tree module flag.
Signed-off-by: Joe Lawrence <joe.lawrence(a)redhat.com>
---
Note: I stumbled across this in a testing scenario and thought it might
be generally useful to extend this admittedly fragile mechanism. Since
there are no related livepatch-core changes, this can go through Shuah's
kselftest tree if she prefers. -- Joe
tools/testing/selftests/livepatch/functions.sh | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/livepatch/functions.sh b/tools/testing/selftests/livepatch/functions.sh
index 79b0affd21fb..57975c323542 100644
--- a/tools/testing/selftests/livepatch/functions.sh
+++ b/tools/testing/selftests/livepatch/functions.sh
@@ -221,7 +221,7 @@ function check_result {
local expect="$*"
local result
- result=$(dmesg | grep -v 'tainting' | grep -e 'livepatch:' -e 'test_klp' | sed 's/^\[[ 0-9.]*\] //')
+ result=$(dmesg | grep -ve '\<taints\>' -ve '\<tainting\>' | grep -e 'livepatch:' -e 'test_klp' | sed 's/^\[[ 0-9.]*\] //')
if [[ "$expect" == "$result" ]] ; then
echo "ok"
--
2.21.0
Hi,
This applies cleanly to linux-next and mmotm, and also to linux.git if
linux-next's commit 20cac10710c9 ("mm/gup_benchmark: fix MAP_HUGETLB
case") is first applied there.
This provides tracking of dma-pinned pages. This is a prerequisite to
solving the larger problem of proper interactions between file-backed
pages, and [R]DMA activities, as discussed in [1], [2], [3], and in
a remarkable number of email threads since about 2017. :)
A new internal gup flag, FOLL_PIN is introduced, and thoroughly
documented in the last patch's Documentation/vm/pin_user_pages.rst.
I believe that this will provide a good starting point for doing the
layout lease work that Ira Weiny has been working on. That's because
these new wrapper functions provide a clean, constrained, systematically
named set of functionality that, again, is required in order to even
know if a page is "dma-pinned".
In contrast to earlier approaches, the page tracking can be
incrementally applied to the kernel call sites that, until now, have
been simply calling get_user_pages() ("gup"). In other words, opt-in by
changing from this:
get_user_pages() (sets FOLL_GET)
put_page()
to this:
pin_user_pages() (sets FOLL_PIN)
put_user_page()
Because there are interdependencies with FOLL_LONGTERM, a similar
conversion as for FOLL_PIN, was applied. The change was from this:
get_user_pages(FOLL_LONGTERM) (also sets FOLL_GET)
put_page()
to this:
pin_longterm_pages() (sets FOLL_PIN | FOLL_LONGTERM)
put_user_page()
============================================================
Patch summary:
* Patches 1-4: refactoring and preparatory cleanup, independent fixes
(Patch 4: V4L2-core bug fix (can be separately applied))
* Patch 5: introduce pin_user_pages(), FOLL_PIN, but no functional
changes yet
* Patches 6-11: Convert existing put_user_page() callers, to use the
new pin*()
* Patch 12: Activate tracking of FOLL_PIN pages.
* Patches 13-15: convert FOLL_LONGTERM callers
* Patches: 16-17: gup_benchmark and run_vmtests support
* Patch 18: enforce FOLL_LONGTERM as a gup-internal (only) flag
* Patch 19: Documentation/vm/pin_user_pages.rst
============================================================
Testing:
* I've done some overall kernel testing (LTP, and a few other goodies),
and some directed testing to exercise some of the changes. And as you
can see, gup_benchmark is enhanced to exercise this. Basically, I've been
able to runtime test the core get_user_pages() and pin_user_pages() and
related routines, but not so much on several of the call sites--but those
are generally just a couple of lines changed, each.
Not much of the kernel is actually using this, which on one hand
reduces risk quite a lot. But on the other hand, testing coverage
is low. So I'd love it if, in particular, the Infiniband and PowerPC
folks could do a smoke test of this series for me.
Also, my runtime testing for the call sites so far is very weak:
* io_uring: Some directed tests from liburing exercise this, and they pass.
* process_vm_access.c: A small directed test passes.
* gup_benchmark: the enhanced version hits the new gup.c code, and passes.
* infiniband (still only have crude "IB pingpong" working, on a
good day: it's not exercising my conversions at runtime...)
* VFIO: compiles (I'm vowing to set up a run time test soon, but it's
not ready just yet)
* powerpc: it compiles...
* drm/via: compiles...
* goldfish: compiles...
* net/xdp: compiles...
* media/v4l2: compiles...
============================================================
Next:
* Get the block/bio_vec sites converted to use pin_user_pages().
* Work with Ira and Dave Chinner to weave this together with the
layout lease stuff.
============================================================
[1] Some slow progress on get_user_pages() (Apr 2, 2019): https://lwn.net/Articles/784574/
[2] DMA and get_user_pages() (LPC: Dec 12, 2018): https://lwn.net/Articles/774411/
[3] The trouble with get_user_pages() (Apr 30, 2018): https://lwn.net/Articles/753027/
John Hubbard (19):
mm/gup: pass flags arg to __gup_device_* functions
mm/gup: factor out duplicate code from four routines
goldish_pipe: rename local pin_user_pages() routine
media/v4l2-core: set pages dirty upon releasing DMA buffers
mm/gup: introduce pin_user_pages*() and FOLL_PIN
goldish_pipe: convert to pin_user_pages() and put_user_page()
infiniband: set FOLL_PIN, FOLL_LONGTERM via pin_longterm_pages*()
mm/process_vm_access: set FOLL_PIN via pin_user_pages_remote()
drm/via: set FOLL_PIN via pin_user_pages_fast()
fs/io_uring: set FOLL_PIN via pin_user_pages()
net/xdp: set FOLL_PIN via pin_user_pages()
mm/gup: track FOLL_PIN pages
media/v4l2-core: pin_longterm_pages (FOLL_PIN) and put_user_page()
conversion
vfio, mm: pin_longterm_pages (FOLL_PIN) and put_user_page() conversion
powerpc: book3s64: convert to pin_longterm_pages() and put_user_page()
mm/gup_benchmark: support pin_user_pages() and related calls
selftests/vm: run_vmtests: invoke gup_benchmark with basic FOLL_PIN
coverage
mm/gup: remove support for gup(FOLL_LONGTERM)
Documentation/vm: add pin_user_pages.rst
Documentation/vm/index.rst | 1 +
Documentation/vm/pin_user_pages.rst | 213 +++++++
arch/powerpc/mm/book3s64/iommu_api.c | 15 +-
drivers/gpu/drm/via/via_dmablit.c | 2 +-
drivers/infiniband/core/umem.c | 5 +-
drivers/infiniband/core/umem_odp.c | 10 +-
drivers/infiniband/hw/hfi1/user_pages.c | 4 +-
drivers/infiniband/hw/mthca/mthca_memfree.c | 3 +-
drivers/infiniband/hw/qib/qib_user_pages.c | 8 +-
drivers/infiniband/hw/qib/qib_user_sdma.c | 2 +-
drivers/infiniband/hw/usnic/usnic_uiom.c | 9 +-
drivers/infiniband/sw/siw/siw_mem.c | 5 +-
drivers/media/v4l2-core/videobuf-dma-sg.c | 10 +-
drivers/platform/goldfish/goldfish_pipe.c | 35 +-
drivers/vfio/vfio_iommu_type1.c | 15 +-
fs/io_uring.c | 5 +-
include/linux/mm.h | 133 ++++-
include/linux/mmzone.h | 2 +
include/linux/page_ref.h | 10 +
mm/gup.c | 622 ++++++++++++++++----
mm/gup_benchmark.c | 81 ++-
mm/huge_memory.c | 32 +-
mm/hugetlb.c | 28 +-
mm/memremap.c | 4 +-
mm/process_vm_access.c | 28 +-
mm/vmstat.c | 2 +
net/xdp/xdp_umem.c | 4 +-
tools/testing/selftests/vm/gup_benchmark.c | 28 +-
tools/testing/selftests/vm/run_vmtests | 22 +
29 files changed, 1066 insertions(+), 272 deletions(-)
create mode 100644 Documentation/vm/pin_user_pages.rst
--
2.23.0
Greetings,
Find the attached mail very confidential. reply for more details
Thanks.
Peter Wong
----------------------------------------------------
This email was sent by the shareware version of Postman Professional.
Verify that in this scenario
------------------------ N2
| |
------ ------ N3 ----
| R1 | | R2 |------|H2|
------ ------ ----
| |
------------------------ N1
|
----
|H1|
----
where H1's default route goes through R1 and R1's default route goes
through R2 over N2, traceroute6 from H1 to H2 reports R2's address
on N2 and not N1.
Signed-off-by: Francesco Ruggeri <fruggeri(a)arista.com>
---
tools/testing/selftests/net/Makefile | 1 +
.../testing/selftests/net/icmp6_reply_addr.sh | 159 ++++++++++++++++++
2 files changed, 160 insertions(+)
create mode 100755 tools/testing/selftests/net/icmp6_reply_addr.sh
diff --git a/tools/testing/selftests/net/Makefile b/tools/testing/selftests/net/Makefile
index 0bd6b23c97ef..daeaeb59d5ca 100644
--- a/tools/testing/selftests/net/Makefile
+++ b/tools/testing/selftests/net/Makefile
@@ -11,6 +11,7 @@ TEST_PROGS += udpgso_bench.sh fib_rule_tests.sh msg_zerocopy.sh psock_snd.sh
TEST_PROGS += udpgro_bench.sh udpgro.sh test_vxlan_under_vrf.sh reuseport_addr_any.sh
TEST_PROGS += test_vxlan_fdb_changelink.sh so_txtime.sh ipv6_flowlabel.sh
TEST_PROGS += tcp_fastopen_backup_key.sh fcnal-test.sh l2tp.sh
+TEST_PROGS += icmp6_reply_addr.sh
TEST_PROGS_EXTENDED := in_netns.sh
TEST_GEN_FILES = socket nettest
TEST_GEN_FILES += psock_fanout psock_tpacket msg_zerocopy reuseport_addr_any
diff --git a/tools/testing/selftests/net/icmp6_reply_addr.sh b/tools/testing/selftests/net/icmp6_reply_addr.sh
new file mode 100755
index 000000000000..551834cb9272
--- /dev/null
+++ b/tools/testing/selftests/net/icmp6_reply_addr.sh
@@ -0,0 +1,159 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# Verify that in this scenario
+#
+# ------------------------ N2
+# | |
+# ------ ------ N3 ----
+# | R1 | | R2 |------|H2|
+# ------ ------ ----
+# | |
+# ------------------------ N1
+# |
+# ----
+# |H1|
+# ----
+#
+# where H1's default route goes through R1 and R1's default route goes
+# through R2 over N2, traceroute6 from H1 to H2 reports R2's address
+# on N2 and not N1.
+#
+# Addresses are assigned as follows:
+#
+# N1: 2000:101::/64
+# N2: 2000:102::/64
+# N3: 2000:103::/64
+#
+# R1's host part of address: 1
+# R2's host part of address: 2
+# H1's host part of address: 3
+# H2's host part of address: 4
+#
+# For example:
+# the IPv6 address of R1's interface on N2 is 2000:102::1/64
+
+####################################################################
+# helpers
+#
+# Interface on network <net> in node <node> is called <node><net>
+#
+
+node()
+{
+ host=$1
+ shift
+ ip netns exec ${host} $*
+}
+
+create_nodes()
+{
+ for n in $*; do
+ ip netns add $n
+ node $n ip link set lo up
+ done
+}
+
+delete_nodes()
+{
+ for n in $*; do
+ ip netns del $n
+ done
+}
+
+create_veth_net()
+{
+ net=$1
+ h1=$2
+ h2=$3
+
+ ip link add ${h1}${net} type veth peer name ${h2}${net}
+ ip link set ${h1}${net} netns ${h1}
+ node ${h1} ip link set ${h1}${net} up
+ ip link set ${h2}${net} netns ${h2}
+ node ${h2} ip link set ${h2}${net} up
+}
+
+create_macvlan_net()
+{
+ net=$1
+ shift
+ nodes=$*
+
+ ip link add ${net} type dummy
+ ip link set ${net} up
+
+ for n in ${nodes}; do
+ ip link add link ${net} dev ${n}${net} type macvlan mode bridge
+ ip link set ${n}${net} netns $n
+ node ${n} ip link set ${n}${net} up
+ done
+}
+
+delete_macvlan_nets()
+{
+ nets=$*
+
+ for n in ${nets}; do
+ ip link del ${n}
+ done
+}
+
+# end helpers
+####################################################################
+
+if [ "$(id -u)" -ne 0 ]; then
+ echo "SKIP: Need root privileges"
+ exit 0
+fi
+
+if [ ! -x "$(command -v traceroute6)" ]; then
+ echo "SKIP: Could not run test without traceroute6"
+ exit 0
+fi
+
+create_nodes host1 host2 rtr1 rtr2
+
+create_macvlan_net net1 host1 rtr1 rtr2
+create_veth_net net2 rtr1 rtr2
+create_veth_net net3 rtr2 host2
+
+# Configure interfaces and routes in host1
+node host1 ip -6 addr add 2000:101::3/64 dev host1net1
+node host1 ip -6 route add default via 2000:101::1
+
+# Configure interfaces and routes in rtr1
+node rtr1 ip -6 addr add 2000:101::1/64 dev rtr1net1
+node rtr1 ip -6 addr add 2000:102::1/64 dev rtr1net2
+node rtr1 ip -6 route add default via 2000:102::2
+node rtr1 sysctl net.ipv6.conf.all.forwarding=1 >/dev/null
+
+# Configure interfaces and routes in rtr2
+node rtr2 ip -6 addr add 2000:101::2/64 dev rtr2net1
+node rtr2 ip -6 addr add 2000:102::2/64 dev rtr2net2
+node rtr2 ip -6 addr add 2000:103::2/64 dev rtr2net3
+node rtr2 sysctl net.ipv6.conf.all.forwarding=1 >/dev/null
+
+# Configure interfaces and routes in host2
+node host2 ip -6 addr add 2000:103::4/64 dev host2net3
+node host2 ip -6 route add default via 2000:103::2
+
+# Ping host2 from host1
+echo "Priming the network"
+node host1 ping6 -c5 2000:103::4 >/dev/null
+
+# Traceroute host2 from host1
+echo "Running traceroute6"
+if node host1 traceroute6 2000:103::4 | grep -q 2000:102::2; then
+ ret=0
+ echo "Found 2000:102::2. Test passed."
+else
+ ret=1
+ echo "Did not find 2000:102::2. Test failed."
+fi
+
+delete_macvlan_nets net1
+delete_nodes host1 host2 rtr1 rtr2
+
+exit ${ret}
+
--
2.19.1
I'm planning to add some kernel self tests which use a user level program
in tools/testing/selftests/vm/ and a kernel module. See:
https://lore.kernel.org/linux-mm/20191023195515.13168-1-rcampbell@nvidia.co…
The question is where to put the kernel module source code.
I see some test modules that are in lib/test_*.ko and my patch
initially placed the hmm-dmirror module in drivers/char/ since
it creates a character device.
Any advice?
These counters will track hugetlb reservations rather than hugetlb
memory faulted in. This patch only adds the counter, following patches
add the charging and uncharging of the counter.
Problem:
Currently tasks attempting to allocate more hugetlb memory than is available get
a failure at mmap/shmget time. This is thanks to Hugetlbfs Reservations [1].
However, if a task attempts to allocate hugetlb memory only more than its
hugetlb_cgroup limit allows, the kernel will allow the mmap/shmget call,
but will SIGBUS the task when it attempts to fault the memory in.
We have developers interested in using hugetlb_cgroups, and they have expressed
dissatisfaction regarding this behavior. We'd like to improve this
behavior such that tasks violating the hugetlb_cgroup limits get an error on
mmap/shmget time, rather than getting SIGBUS'd when they try to fault
the excess memory in.
The underlying problem is that today's hugetlb_cgroup accounting happens
at hugetlb memory *fault* time, rather than at *reservation* time.
Thus, enforcing the hugetlb_cgroup limit only happens at fault time, and
the offending task gets SIGBUS'd.
Proposed Solution:
A new page counter named hugetlb.xMB.reservation_[limit|usage]_in_bytes. This
counter has slightly different semantics than
hugetlb.xMB.[limit|usage]_in_bytes:
- While usage_in_bytes tracks all *faulted* hugetlb memory,
reservation_usage_in_bytes tracks all *reserved* hugetlb memory and
hugetlb memory faulted in without a prior reservation.
- If a task attempts to reserve more memory than limit_in_bytes allows,
the kernel will allow it to do so. But if a task attempts to reserve
more memory than reservation_limit_in_bytes, the kernel will fail this
reservation.
This proposal is implemented in this patch series, with tests to verify
functionality and show the usage. We also added cgroup-v2 support to
hugetlb_cgroup so that the new use cases can be extended to v2.
Alternatives considered:
1. A new cgroup, instead of only a new page_counter attached to
the existing hugetlb_cgroup. Adding a new cgroup seemed like a lot of code
duplication with hugetlb_cgroup. Keeping hugetlb related page counters under
hugetlb_cgroup seemed cleaner as well.
2. Instead of adding a new counter, we considered adding a sysctl that modifies
the behavior of hugetlb.xMB.[limit|usage]_in_bytes, to do accounting at
reservation time rather than fault time. Adding a new page_counter seems
better as userspace could, if it wants, choose to enforce different cgroups
differently: one via limit_in_bytes, and another via
reservation_limit_in_bytes. This could be very useful if you're
transitioning how hugetlb memory is partitioned on your system one
cgroup at a time, for example. Also, someone may find usage for both
limit_in_bytes and reservation_limit_in_bytes concurrently, and this
approach gives them the option to do so.
Testing:
- Added tests passing.
- libhugetlbfs tests mostly passing, but some tests have trouble with and
without this patch series. Seems environment issue rather than code:
- Overall results:
********** TEST SUMMARY
* 2M
* 32-bit 64-bit
* Total testcases: 84 0
* Skipped: 0 0
* PASS: 66 0
* FAIL: 14 0
* Killed by signal: 0 0
* Bad configuration: 4 0
* Expected FAIL: 0 0
* Unexpected PASS: 0 0
* Test not present: 0 0
* Strange test result: 0 0
**********
- Failing tests:
- elflink_rw_and_share_test("linkhuge_rw") segfaults with and without this
patch series.
- LD_PRELOAD=libhugetlbfs.so HUGETLB_MORECORE=yes malloc (2M: 32):
FAIL Address is not hugepage
- LD_PRELOAD=libhugetlbfs.so HUGETLB_RESTRICT_EXE=unknown:malloc
HUGETLB_MORECORE=yes malloc (2M: 32):
FAIL Address is not hugepage
- LD_PRELOAD=libhugetlbfs.so HUGETLB_MORECORE=yes malloc_manysmall (2M: 32):
FAIL Address is not hugepage
- GLIBC_TUNABLES=glibc.malloc.tcache_count=0 LD_PRELOAD=libhugetlbfs.so
HUGETLB_MORECORE=yes heapshrink (2M: 32):
FAIL Heap not on hugepages
- GLIBC_TUNABLES=glibc.malloc.tcache_count=0 LD_PRELOAD=libhugetlbfs.so
libheapshrink.so HUGETLB_MORECORE=yes heapshrink (2M: 32):
FAIL Heap not on hugepages
- HUGETLB_ELFMAP=RW linkhuge_rw (2M: 32): FAIL small_data is not hugepage
- HUGETLB_ELFMAP=RW HUGETLB_MINIMAL_COPY=no linkhuge_rw (2M: 32):
FAIL small_data is not hugepage
- alloc-instantiate-race shared (2M: 32):
Bad configuration: sched_setaffinity(cpu1): Invalid argument -
FAIL Child 1 killed by signal Killed
- shmoverride_linked (2M: 32):
FAIL shmget failed size 2097152 from line 176: Invalid argument
- HUGETLB_SHM=yes shmoverride_linked (2M: 32):
FAIL shmget failed size 2097152 from line 176: Invalid argument
- shmoverride_linked_static (2M: 32):
FAIL shmget failed size 2097152 from line 176: Invalid argument
- HUGETLB_SHM=yes shmoverride_linked_static (2M: 32):
FAIL shmget failed size 2097152 from line 176: Invalid argument
- LD_PRELOAD=libhugetlbfs.so shmoverride_unlinked (2M: 32):
FAIL shmget failed size 2097152 from line 176: Invalid argument
- LD_PRELOAD=libhugetlbfs.so HUGETLB_SHM=yes shmoverride_unlinked (2M: 32):
FAIL shmget failed size 2097152 from line 176: Invalid argument
[1]: https://www.kernel.org/doc/html/latest/vm/hugetlbfs_reserv.html
Signed-off-by: Mina Almasry <almasrymina(a)google.com>
Acked-by: Hillf Danton <hdanton(a)sina.com>
---
include/linux/hugetlb.h | 23 ++++++++-
mm/hugetlb_cgroup.c | 111 ++++++++++++++++++++++++++++++----------
2 files changed, 107 insertions(+), 27 deletions(-)
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 53fc34f930d08..9c49a0ba894d3 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -320,6 +320,27 @@ unsigned long hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
#ifdef CONFIG_HUGETLB_PAGE
+enum {
+ /* Tracks hugetlb memory faulted in. */
+ HUGETLB_RES_USAGE,
+ /* Tracks hugetlb memory reserved. */
+ HUGETLB_RES_RESERVATION_USAGE,
+ /* Limit for hugetlb memory faulted in. */
+ HUGETLB_RES_LIMIT,
+ /* Limit for hugetlb memory reserved. */
+ HUGETLB_RES_RESERVATION_LIMIT,
+ /* Max usage for hugetlb memory faulted in. */
+ HUGETLB_RES_MAX_USAGE,
+ /* Max usage for hugetlb memory reserved. */
+ HUGETLB_RES_RESERVATION_MAX_USAGE,
+ /* Faulted memory accounting fail count. */
+ HUGETLB_RES_FAILCNT,
+ /* Reserved memory accounting fail count. */
+ HUGETLB_RES_RESERVATION_FAILCNT,
+ HUGETLB_RES_NULL,
+ HUGETLB_RES_MAX,
+};
+
#define HSTATE_NAME_LEN 32
/* Defines one hugetlb page size */
struct hstate {
@@ -340,7 +361,7 @@ struct hstate {
unsigned int surplus_huge_pages_node[MAX_NUMNODES];
#ifdef CONFIG_CGROUP_HUGETLB
/* cgroup control files */
- struct cftype cgroup_files[5];
+ struct cftype cgroup_files[HUGETLB_RES_MAX];
#endif
char name[HSTATE_NAME_LEN];
};
diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
index f1930fa0b445d..1ed4448ca41d3 100644
--- a/mm/hugetlb_cgroup.c
+++ b/mm/hugetlb_cgroup.c
@@ -25,6 +25,10 @@ struct hugetlb_cgroup {
* the counter to account for hugepages from hugetlb.
*/
struct page_counter hugepage[HUGE_MAX_HSTATE];
+ /*
+ * the counter to account for hugepage reservations from hugetlb.
+ */
+ struct page_counter reserved_hugepage[HUGE_MAX_HSTATE];
};
#define MEMFILE_PRIVATE(x, val) (((x) << 16) | (val))
@@ -33,6 +37,14 @@ struct hugetlb_cgroup {
static struct hugetlb_cgroup *root_h_cgroup __read_mostly;
+static inline struct page_counter *
+hugetlb_cgroup_get_counter(struct hugetlb_cgroup *h_cg, int idx, bool reserved)
+{
+ if (reserved)
+ return &h_cg->reserved_hugepage[idx];
+ return &h_cg->hugepage[idx];
+}
+
static inline
struct hugetlb_cgroup *hugetlb_cgroup_from_css(struct cgroup_subsys_state *s)
{
@@ -254,30 +266,33 @@ void hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages,
return;
}
-enum {
- RES_USAGE,
- RES_LIMIT,
- RES_MAX_USAGE,
- RES_FAILCNT,
-};
-
static u64 hugetlb_cgroup_read_u64(struct cgroup_subsys_state *css,
struct cftype *cft)
{
struct page_counter *counter;
+ struct page_counter *reserved_counter;
struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(css);
counter = &h_cg->hugepage[MEMFILE_IDX(cft->private)];
+ reserved_counter = &h_cg->reserved_hugepage[MEMFILE_IDX(cft->private)];
switch (MEMFILE_ATTR(cft->private)) {
- case RES_USAGE:
+ case HUGETLB_RES_USAGE:
return (u64)page_counter_read(counter) * PAGE_SIZE;
- case RES_LIMIT:
+ case HUGETLB_RES_RESERVATION_USAGE:
+ return (u64)page_counter_read(reserved_counter) * PAGE_SIZE;
+ case HUGETLB_RES_LIMIT:
return (u64)counter->max * PAGE_SIZE;
- case RES_MAX_USAGE:
+ case HUGETLB_RES_RESERVATION_LIMIT:
+ return (u64)reserved_counter->max * PAGE_SIZE;
+ case HUGETLB_RES_MAX_USAGE:
return (u64)counter->watermark * PAGE_SIZE;
- case RES_FAILCNT:
+ case HUGETLB_RES_RESERVATION_MAX_USAGE:
+ return (u64)reserved_counter->watermark * PAGE_SIZE;
+ case HUGETLB_RES_FAILCNT:
return counter->failcnt;
+ case HUGETLB_RES_RESERVATION_FAILCNT:
+ return reserved_counter->failcnt;
default:
BUG();
}
@@ -291,6 +306,7 @@ static ssize_t hugetlb_cgroup_write(struct kernfs_open_file *of,
int ret, idx;
unsigned long nr_pages;
struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(of_css(of));
+ bool reserved = false;
if (hugetlb_cgroup_is_root(h_cg)) /* Can't set limit on root */
return -EINVAL;
@@ -304,9 +320,14 @@ static ssize_t hugetlb_cgroup_write(struct kernfs_open_file *of,
nr_pages = round_down(nr_pages, 1 << huge_page_order(&hstates[idx]));
switch (MEMFILE_ATTR(of_cft(of)->private)) {
- case RES_LIMIT:
+ case HUGETLB_RES_RESERVATION_LIMIT:
+ reserved = true;
+ /* Fall through. */
+ case HUGETLB_RES_LIMIT:
mutex_lock(&hugetlb_limit_mutex);
- ret = page_counter_set_max(&h_cg->hugepage[idx], nr_pages);
+ ret = page_counter_set_max(hugetlb_cgroup_get_counter(h_cg, idx,
+ reserved),
+ nr_pages);
mutex_unlock(&hugetlb_limit_mutex);
break;
default:
@@ -320,18 +341,26 @@ static ssize_t hugetlb_cgroup_reset(struct kernfs_open_file *of,
char *buf, size_t nbytes, loff_t off)
{
int ret = 0;
- struct page_counter *counter;
+ struct page_counter *counter, *reserved_counter;
struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(of_css(of));
counter = &h_cg->hugepage[MEMFILE_IDX(of_cft(of)->private)];
+ reserved_counter =
+ &h_cg->reserved_hugepage[MEMFILE_IDX(of_cft(of)->private)];
switch (MEMFILE_ATTR(of_cft(of)->private)) {
- case RES_MAX_USAGE:
+ case HUGETLB_RES_MAX_USAGE:
page_counter_reset_watermark(counter);
break;
- case RES_FAILCNT:
+ case HUGETLB_RES_RESERVATION_MAX_USAGE:
+ page_counter_reset_watermark(reserved_counter);
+ break;
+ case HUGETLB_RES_FAILCNT:
counter->failcnt = 0;
break;
+ case HUGETLB_RES_RESERVATION_FAILCNT:
+ reserved_counter->failcnt = 0;
+ break;
default:
ret = -EINVAL;
break;
@@ -357,37 +386,67 @@ static void __init __hugetlb_cgroup_file_init(int idx)
struct hstate *h = &hstates[idx];
/* format the size */
- mem_fmt(buf, 32, huge_page_size(h));
+ mem_fmt(buf, sizeof(buf), huge_page_size(h));
/* Add the limit file */
- cft = &h->cgroup_files[0];
+ cft = &h->cgroup_files[HUGETLB_RES_LIMIT];
snprintf(cft->name, MAX_CFTYPE_NAME, "%s.limit_in_bytes", buf);
- cft->private = MEMFILE_PRIVATE(idx, RES_LIMIT);
+ cft->private = MEMFILE_PRIVATE(idx, HUGETLB_RES_LIMIT);
+ cft->read_u64 = hugetlb_cgroup_read_u64;
+ cft->write = hugetlb_cgroup_write;
+
+ /* Add the reservation limit file */
+ cft = &h->cgroup_files[HUGETLB_RES_RESERVATION_LIMIT];
+ snprintf(cft->name, MAX_CFTYPE_NAME, "%s.reservation_limit_in_bytes",
+ buf);
+ cft->private = MEMFILE_PRIVATE(idx, HUGETLB_RES_RESERVATION_LIMIT);
cft->read_u64 = hugetlb_cgroup_read_u64;
cft->write = hugetlb_cgroup_write;
/* Add the usage file */
- cft = &h->cgroup_files[1];
+ cft = &h->cgroup_files[HUGETLB_RES_USAGE];
snprintf(cft->name, MAX_CFTYPE_NAME, "%s.usage_in_bytes", buf);
- cft->private = MEMFILE_PRIVATE(idx, RES_USAGE);
+ cft->private = MEMFILE_PRIVATE(idx, HUGETLB_RES_USAGE);
+ cft->read_u64 = hugetlb_cgroup_read_u64;
+
+ /* Add the reservation usage file */
+ cft = &h->cgroup_files[HUGETLB_RES_RESERVATION_USAGE];
+ snprintf(cft->name, MAX_CFTYPE_NAME, "%s.reservation_usage_in_bytes",
+ buf);
+ cft->private = MEMFILE_PRIVATE(idx, HUGETLB_RES_RESERVATION_USAGE);
cft->read_u64 = hugetlb_cgroup_read_u64;
/* Add the MAX usage file */
- cft = &h->cgroup_files[2];
+ cft = &h->cgroup_files[HUGETLB_RES_MAX_USAGE];
snprintf(cft->name, MAX_CFTYPE_NAME, "%s.max_usage_in_bytes", buf);
- cft->private = MEMFILE_PRIVATE(idx, RES_MAX_USAGE);
+ cft->private = MEMFILE_PRIVATE(idx, HUGETLB_RES_MAX_USAGE);
+ cft->write = hugetlb_cgroup_reset;
+ cft->read_u64 = hugetlb_cgroup_read_u64;
+
+ /* Add the MAX reservation usage file */
+ cft = &h->cgroup_files[HUGETLB_RES_RESERVATION_MAX_USAGE];
+ snprintf(cft->name, MAX_CFTYPE_NAME,
+ "%s.reservation_max_usage_in_bytes", buf);
+ cft->private = MEMFILE_PRIVATE(idx, HUGETLB_RES_RESERVATION_MAX_USAGE);
cft->write = hugetlb_cgroup_reset;
cft->read_u64 = hugetlb_cgroup_read_u64;
/* Add the failcntfile */
- cft = &h->cgroup_files[3];
+ cft = &h->cgroup_files[HUGETLB_RES_FAILCNT];
snprintf(cft->name, MAX_CFTYPE_NAME, "%s.failcnt", buf);
- cft->private = MEMFILE_PRIVATE(idx, RES_FAILCNT);
+ cft->private = MEMFILE_PRIVATE(idx, HUGETLB_RES_FAILCNT);
+ cft->write = hugetlb_cgroup_reset;
+ cft->read_u64 = hugetlb_cgroup_read_u64;
+
+ /* Add the reservation failcntfile */
+ cft = &h->cgroup_files[HUGETLB_RES_RESERVATION_FAILCNT];
+ snprintf(cft->name, MAX_CFTYPE_NAME, "%s.reservation_failcnt", buf);
+ cft->private = MEMFILE_PRIVATE(idx, HUGETLB_RES_RESERVATION_FAILCNT);
cft->write = hugetlb_cgroup_reset;
cft->read_u64 = hugetlb_cgroup_read_u64;
/* NULL terminate the last cft */
- cft = &h->cgroup_files[4];
+ cft = &h->cgroup_files[HUGETLB_RES_NULL];
memset(cft, 0, sizeof(*cft));
WARN_ON(cgroup_add_legacy_cftypes(&hugetlb_cgrp_subsys,
--
2.24.0.rc0.303.g954a862665-goog
Since commit 5821ba969511 ("selftests: Add test plan API to kselftest.h
and adjust callers") accidentally introduced 'a' typo in the front of
run_test() function, breakpoint_test_arm64.c became not able to be
compiled.
Remove the 'a' from arun_test().
Fixes: 5821ba969511 ("selftests: Add test plan API to kselftest.h and adjust callers")
Reported-by: Jun Takahashi <takahashi.jun_s(a)aa.socionext.com>
Signed-off-by: Masami Hiramatsu <mhiramat(a)kernel.org>
Cc: Kees Cook <keescook(a)chromium.org>
---
.../selftests/breakpoints/breakpoint_test_arm64.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/breakpoints/breakpoint_test_arm64.c b/tools/testing/selftests/breakpoints/breakpoint_test_arm64.c
index 58ed5eeab709..ad41ea69001b 100644
--- a/tools/testing/selftests/breakpoints/breakpoint_test_arm64.c
+++ b/tools/testing/selftests/breakpoints/breakpoint_test_arm64.c
@@ -109,7 +109,7 @@ static bool set_watchpoint(pid_t pid, int size, int wp)
return false;
}
-static bool arun_test(int wr_size, int wp_size, int wr, int wp)
+static bool run_test(int wr_size, int wp_size, int wr, int wp)
{
int status;
siginfo_t siginfo;
The 'functions' directive is not only for functions, but also works for
structs/unions. So the name is misleading. This patch renames it to
'identifiers', which specific the functions/types to be included in
documentation. We keep the old name as an alias of the new one before
all documentation are updated.
Signed-off-by: Changbin Du <changbin.du(a)gmail.com>
---
v2:
o use 'identifiers' as the new directive name.
---
Documentation/doc-guide/kernel-doc.rst | 29 ++++++++++++++------------
Documentation/sphinx/kerneldoc.py | 19 ++++++++++-------
2 files changed, 28 insertions(+), 20 deletions(-)
diff --git a/Documentation/doc-guide/kernel-doc.rst b/Documentation/doc-guide/kernel-doc.rst
index 192c36af39e2..fff6604631ea 100644
--- a/Documentation/doc-guide/kernel-doc.rst
+++ b/Documentation/doc-guide/kernel-doc.rst
@@ -476,6 +476,22 @@ internal: *[source-pattern ...]*
.. kernel-doc:: drivers/gpu/drm/i915/intel_audio.c
:internal:
+identifiers: *[ function/type ...]*
+ Include documentation for each *function* and *type* in *source*.
+ If no *function* is specified, the documentation for all functions
+ and types in the *source* will be included.
+
+ Examples::
+
+ .. kernel-doc:: lib/bitmap.c
+ :identifiers: bitmap_parselist bitmap_parselist_user
+
+ .. kernel-doc:: lib/idr.c
+ :identifiers:
+
+functions: *[ function/type ...]*
+ This is an alias of the 'identifiers' directive and deprecated.
+
doc: *title*
Include documentation for the ``DOC:`` paragraph identified by *title* in
*source*. Spaces are allowed in *title*; do not quote the *title*. The *title*
@@ -488,19 +504,6 @@ doc: *title*
.. kernel-doc:: drivers/gpu/drm/i915/intel_audio.c
:doc: High Definition Audio over HDMI and Display Port
-functions: *[ function ...]*
- Include documentation for each *function* in *source*.
- If no *function* is specified, the documentation for all functions
- and types in the *source* will be included.
-
- Examples::
-
- .. kernel-doc:: lib/bitmap.c
- :functions: bitmap_parselist bitmap_parselist_user
-
- .. kernel-doc:: lib/idr.c
- :functions:
-
Without options, the kernel-doc directive includes all documentation comments
from the source file.
diff --git a/Documentation/sphinx/kerneldoc.py b/Documentation/sphinx/kerneldoc.py
index 1159405cb920..0689f9c37f1e 100644
--- a/Documentation/sphinx/kerneldoc.py
+++ b/Documentation/sphinx/kerneldoc.py
@@ -59,9 +59,10 @@ class KernelDocDirective(Directive):
optional_arguments = 4
option_spec = {
'doc': directives.unchanged_required,
- 'functions': directives.unchanged,
'export': directives.unchanged,
'internal': directives.unchanged,
+ 'identifiers': directives.unchanged,
+ 'functions': directives.unchanged, # alias of 'identifiers'
}
has_content = False
@@ -71,6 +72,7 @@ class KernelDocDirective(Directive):
filename = env.config.kerneldoc_srctree + '/' + self.arguments[0]
export_file_patterns = []
+ identifiers = None
# Tell sphinx of the dependency
env.note_dependency(os.path.abspath(filename))
@@ -86,19 +88,22 @@ class KernelDocDirective(Directive):
export_file_patterns = str(self.options.get('internal')).split()
elif 'doc' in self.options:
cmd += ['-function', str(self.options.get('doc'))]
+ elif 'identifiers' in self.options:
+ identifiers = self.options.get('identifiers').split()
elif 'functions' in self.options:
- functions = self.options.get('functions').split()
- if functions:
- for f in functions:
- cmd += ['-function', f]
- else:
- cmd += ['-no-doc-sections']
+ identifiers = self.options.get('functions').split()
for pattern in export_file_patterns:
for f in glob.glob(env.config.kerneldoc_srctree + '/' + pattern):
env.note_dependency(os.path.abspath(f))
cmd += ['-export-file', f]
+ if identifiers:
+ for i in identifiers:
+ cmd += ['-function', i]
+ elif identifiers is not None:
+ cmd += ['-no-doc-sections']
+
cmd += [filename]
try:
--
2.20.1
Hi,
Here are the 3rd version of kselftest fixes some on 32bit arch
(e.g. arm)
In this version, I updated [1/5] to make va_max 1MB unconditionally
according to Alexey's comment.
When I built the ksefltest on arm, I hit some 32bit related warnings.
Here are the patches to fix those issues.
- [1/5] va_max was set 2^32 even on 32bit arch. This can make
va_max == 0 and always fail. Make it 1GB unconditionally.
- [2/5] Some VM tests requires 64bit user space, which should
not run on 32bit arch.
- [3/5] For counting the size of large file, we should use
size_t instead of unsinged long.
- [4/5] Gcc warns printf format for size_t and int64_t on
32bit arch. Use %llu and cast it.
- [5/5] Gcc warns __u64 and pointer type castings. It should
once translated to unsigned long.
Thank you,
---
Masami Hiramatsu (5):
selftests: proc: Make va_max 1MB
selftests: vm: Build/Run 64bit tests only on 64bit arch
selftests: net: Use size_t and ssize_t for counting file size
selftests: net: Fix printf format warnings on arm
selftests: sync: Fix cast warnings on arm
tools/testing/selftests/net/so_txtime.c | 4 ++--
tools/testing/selftests/net/tcp_mmap.c | 8 ++++----
tools/testing/selftests/net/udpgso.c | 3 ++-
tools/testing/selftests/net/udpgso_bench_tx.c | 3 ++-
.../selftests/proc/proc-self-map-files-002.c | 6 +++++-
tools/testing/selftests/sync/sync.c | 6 +++---
tools/testing/selftests/vm/Makefile | 5 +++++
tools/testing/selftests/vm/run_vmtests | 10 ++++++++++
8 files changed, 33 insertions(+), 12 deletions(-)
--
Masami Hiramatsu (Linaro) <mhiramat(a)kernel.org>
Hi Shua,
Here is the set with cleanup as suggested by Kees on v3.
Configured, built, and tested all modules loaded by
tools/testing/selftests/lib/*.sh
>From previous cover letters ...
While doing the testing for strscpy_pad() it was noticed that there is
duplication in how test modules are being fed to kselftest and also in
the test modules themselves.
This set makes an attempt at adding a framework to kselftest for writing
kernel test modules. It also adds a script for use in creating script
test runners for kselftest. My macro-foo is not great, all criticism
and suggestions very much appreciated. The design is based on test
modules lib/test_printf.c, lib/test_bitmap.c, lib/test_xarray.c.
Changes since last version:
- Remove dependency on Bash (thanks Kees)
- Use oneliner to implement kselftest test runners (thanks Kees)
- Squash patch that adds kselftest script creator script with patch
that uses it.
- Fix typos (thanks Randy)
- Add Kees' Acked-by tags to all patches
thanks,
Tobin.
Tobin C. Harding (6):
lib/test_printf: Add empty module_exit function
kselftest: Add test runner creation script
kselftest: Add test module framework header
lib: Use new kselftest header
lib/string: Add strscpy_pad() function
lib: Add test module for strscpy_pad
Documentation/dev-tools/kselftest.rst | 94 +++++++++++-
include/linux/string.h | 4 +
lib/Kconfig.debug | 3 +
lib/Makefile | 1 +
lib/string.c | 47 +++++-
lib/test_bitmap.c | 20 +--
lib/test_printf.c | 17 +--
lib/test_strscpy.c | 150 +++++++++++++++++++
tools/testing/selftests/kselftest_module.h | 48 ++++++
tools/testing/selftests/kselftest_module.sh | 84 +++++++++++
tools/testing/selftests/lib/Makefile | 2 +-
tools/testing/selftests/lib/bitmap.sh | 18 +--
tools/testing/selftests/lib/config | 1 +
tools/testing/selftests/lib/prime_numbers.sh | 17 +--
tools/testing/selftests/lib/printf.sh | 19 +--
tools/testing/selftests/lib/strscpy.sh | 3 +
16 files changed, 440 insertions(+), 88 deletions(-)
create mode 100644 lib/test_strscpy.c
create mode 100644 tools/testing/selftests/kselftest_module.h
create mode 100755 tools/testing/selftests/kselftest_module.sh
create mode 100755 tools/testing/selftests/lib/strscpy.sh
--
2.21.0
This patchset is being developed here:
<https://github.com/cyphar/linux/tree/openat2/master>
Patch changelog:
v14: [<https://lore.kernel.org/lkml/20191010054140.8483-1-cyphar@cyphar.com/>]
* The magic-link changes (and O_EMPTYPATH) have been dropped from this series
-- they will be developed and sent separately. The main reason is that we
need to restrict things other than open(2) (examples include truncate(2) as
well as mount(MS_BIND)). This will require a fair amount of extra work, and
there's no point stalling openat2(2) for that work to be completed.
* Minor rework of 'struct open_how':
* To avoid future headaches, make it a non-const argument.
* Expand ->flags and ->resolve to 64-bit fields to allow for more flag
extensions without needing to add separate fields too early. This
requires adding a bit of explicit padding (32 bits) to avoid userspace
putting garbage in the alignment padding -- this can be repurposed for
future extensions.
* upgrade_mask is dropped (and will be a separate field when we add it
again in the future) to avoid userspace foot-guns.
* Expand -EINVAL checks in build_open_flags(). Rather than silently
ignoring silly flag combinations (such as O_TMPFILE|O_PATH or
O_PATH|<most flags>), give an -EINVAL. All of the silent ignore semantics
were added to open(2) because we couldn't return -EINVAL -- but we can
now!
* open(2) and openat(2) clean up their flags before passing them to
build_open_flags(), so all mixed flags will continue to work. There is
one exception which is (O_PATH|O_TMPFILE) -- this is no longer
permitted (as far as I can tell this appears to be a bug, and there are
no userspace users that I've hit after running this code for a few
days). If it turns out that userspace does depend on (O_PATH|O_TMPFILE)
working, we can only disallow it for openat2(2).
* Don't zero out nd->root in complete_walk() for RCU-walk if we're doing a
scoped-lookup (this prevents a needless REF-walk retry).
* Attempt all tests on kernels that don't have openat2(2), rather than just
skipping everything.
v13: <https://lore.kernel.org/lkml/20190930183316.10190-1-cyphar@cyphar.com/>
v12: <https://lore.kernel.org/lkml/20190904201933.10736-1-cyphar@cyphar.com/>
v11: <https://lore.kernel.org/lkml/20190820033406.29796-1-cyphar@cyphar.com/>
<https://lore.kernel.org/lkml/20190728010207.9781-1-cyphar@cyphar.com/>
v10: <https://lore.kernel.org/lkml/20190719164225.27083-1-cyphar@cyphar.com/>
v09: <https://lore.kernel.org/lkml/20190706145737.5299-1-cyphar@cyphar.com/>
v08: <https://lore.kernel.org/lkml/20190520133305.11925-1-cyphar@cyphar.com/>
v07: <https://lore.kernel.org/lkml/20190507164317.13562-1-cyphar@cyphar.com/>
v06: <https://lore.kernel.org/lkml/20190506165439.9155-1-cyphar@cyphar.com/>
v05: <https://lore.kernel.org/lkml/20190320143717.2523-1-cyphar@cyphar.com/>
v04: <https://lore.kernel.org/lkml/20181112142654.341-1-cyphar@cyphar.com/>
v03: <https://lore.kernel.org/lkml/20181009070230.12884-1-cyphar@cyphar.com/>
v02: <https://lore.kernel.org/lkml/20181009065300.11053-1-cyphar@cyphar.com/>
v01: <https://lore.kernel.org/lkml/20180929103453.12025-1-cyphar@cyphar.com/>
For a very long time, extending openat(2) with new features has been
incredibly frustrating. This stems from the fact that openat(2) is
possibly the most famous counter-example to the mantra "don't silently
accept garbage from userspace" -- it doesn't check whether unknown flags
are present[1].
This means that (generally) the addition of new flags to openat(2) has
been fraught with backwards-compatibility issues (O_TMPFILE has to be
defined as __O_TMPFILE|O_DIRECTORY|[O_RDWR or O_WRONLY] to ensure old
kernels gave errors, since it's insecure to silently ignore the
flag[2]). All new security-related flags therefore have a tough road to
being added to openat(2).
Furthermore, the need for some sort of control over VFS's path resolution (to
avoid malicious paths resulting in inadvertent breakouts) has been a very
long-standing desire of many userspace applications. This patchset is a revival
of Al Viro's old AT_NO_JUMPS[3] patchset (which was a variant of David
Drysdale's O_BENEATH patchset[4] which was a spin-off of the Capsicum
project[5]) with a few additions and changes made based on the previous
discussion within [6] as well as others I felt were useful.
In line with the conclusions of the original discussion of AT_NO_JUMPS, the
flag has been split up into separate flags. However, instead of being an
openat(2) flag it is provided through a new syscall openat2(2) which provides
several other improvements to the openat(2) interface (see the patch
description for more details). The following new LOOKUP_* flags are added:
* LOOKUP_NO_XDEV blocks all mountpoint crossings (upwards, downwards,
or through absolute links). Absolute pathnames alone in openat(2) do not
trigger this. Magic-link traversal which implies a vfsmount jump is also
blocked (though magic-link jumps on the same vfsmount are permitted).
* LOOKUP_NO_MAGICLINKS blocks resolution through /proc/$pid/fd-style
links. This is done by blocking the usage of nd_jump_link() during
resolution in a filesystem. The term "magic-links" is used to match
with the only reference to these links in Documentation/, but I'm
happy to change the name.
It should be noted that this is different to the scope of
~LOOKUP_FOLLOW in that it applies to all path components. However,
you can do openat2(NO_FOLLOW|NO_MAGICLINKS) on a magic-link and it
will *not* fail (assuming that no parent component was a
magic-link), and you will have an fd for the magic-link.
In order to correctly detect magic-links, the introduction of a new
LOOKUP_MAGICLINK_JUMPED state flag was required.
* LOOKUP_BENEATH disallows escapes to outside the starting dirfd's
tree, using techniques such as ".." or absolute links. Absolute
paths in openat(2) are also disallowed. Conceptually this flag is to
ensure you "stay below" a certain point in the filesystem tree --
but this requires some additional to protect against various races
that would allow escape using "..".
Currently LOOKUP_BENEATH implies LOOKUP_NO_MAGICLINKS, because it
can trivially beam you around the filesystem (breaking the
protection). In future, there might be similar safety checks done as
in LOOKUP_IN_ROOT, but that requires more discussion.
In addition, two new flags are added that expand on the above ideas:
* LOOKUP_NO_SYMLINKS does what it says on the tin. No symlink
resolution is allowed at all, including magic-links. Just as with
LOOKUP_NO_MAGICLINKS this can still be used with NOFOLLOW to open an
fd for the symlink as long as no parent path had a symlink
component.
* LOOKUP_IN_ROOT is an extension of LOOKUP_BENEATH that, rather than
blocking attempts to move past the root, forces all such movements
to be scoped to the starting point. This provides chroot(2)-like
protection but without the cost of a chroot(2) for each filesystem
operation, as well as being safe against race attacks that chroot(2)
is not.
If a race is detected (as with LOOKUP_BENEATH) then an error is
generated, and similar to LOOKUP_BENEATH it is not permitted to cross
magic-links with LOOKUP_IN_ROOT.
The primary need for this is from container runtimes, which
currently need to do symlink scoping in userspace[7] when opening
paths in a potentially malicious container. There is a long list of
CVEs that could have bene mitigated by having RESOLVE_THIS_ROOT
(such as CVE-2017-1002101, CVE-2017-1002102, CVE-2018-15664, and
CVE-2019-5736, just to name a few).
In order to make all of the above more usable, I'm working on
libpathrs[8] which is a C-friendly library for safe path resolution. It
features a userspace-emulated backend if the kernel doesn't support
openat2(2). Hopefully we can get userspace to switch to using it, and
thus get openat2(2) support for free once it's ready.
[1]: https://lwn.net/Articles/588444/
[2]: https://lore.kernel.org/lkml/CA+55aFyyxJL1LyXZeBsf2ypriraj5ut1XkNDsunRBqgVj…
[3]: https://lore.kernel.org/lkml/20170429220414.GT29622@ZenIV.linux.org.uk
[4]: https://lore.kernel.org/lkml/1415094884-18349-1-git-send-email-drysdale@goo…
[5]: https://lore.kernel.org/lkml/1404124096-21445-1-git-send-email-drysdale@goo…
[6]: https://lwn.net/Articles/723057/
[7]: https://github.com/cyphar/filepath-securejoin
[8]: https://github.com/openSUSE/libpathrs
The current draft of the openat2(2) man-page is included below.
--8<---------------------------------------------------------------------------
OPENAT2(2) Linux Programmer's Manual OPENAT2(2)
NAME
openat2 - open and possibly create a file (extended)
SYNOPSIS
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
int openat2(int dirfd, const char *pathname, struct open_how *how, size_t size);
Note: There is no glibc wrapper for this system call; see NOTES.
DESCRIPTION
The openat2() system call opens the file specified by pathname. If the specified file
does not exist, it may optionally (if O_CREAT is specified in how.flags) be created by
openat2().
As with openat(2), if pathname is relative, then it is interpreted relative to the direc-
tory referred to by the file descriptor dirfd (or the current working directory of the
calling process, if dirfd is the special value AT_FDCWD.) If pathname is absolute, then
dirfd is ignored (unless how.resolve contains RESOLVE_IN_ROOT, in which case pathname is
resolved relative to dirfd.)
The openat2() system call is an extension of openat(2) and provides a superset of its
functionality. Rather than taking a single flag argument, an extensible structure (how)
is passed instead to allow for future extensions. size must be set to sizeof(struct
open_how), to facilitate future extensions (see the "Extensibility" section of the NOTES
for more detail on how extensions are handled.)
The open_how structure
The following structure indicates how pathname should be opened, and acts as a superset of
the flag and mode arguments to openat(2).
struct open_how {
__aligned_u64 flags; /* O_* flags. */
__u16 mode; /* Mode for O_{CREAT,TMPFILE}. */
__u16 __padding[3]; /* Must be zeroed. */
__aligned_u64 resolve; /* RESOLVE_* flags. */
};
Any future extensions to openat2() will be implemented as new fields appended to the above
structure (or through reuse of pre-existing padding space), with the zero value of the new
fields acting as though the extension were not present.
The meaning of each field is as follows:
flags
The file creation and status flags to use for this operation. All of the
O_* flags defined for openat(2) are valid openat2() flag values.
Unlike openat(2), it is an error to provide openat2() unknown or conflicting
flags in flags.
mode
File mode for the new file, with identical semantics to the mode argument to
openat(2). However, unlike openat(2), it is an error to provide openat2()
with a mode which contains bits other than 0777.
It is an error to provide openat2() a non-zero mode if flags does not con-
tain O_CREAT or O_TMPFILE.
resolve
Change how the components of pathname will be resolved (see path_resolu-
tion(7) for background information.) The primary use case for these flags
is to allow trusted programs to restrict how untrusted paths (or paths in-
side untrusted directories) are resolved. The full list of resolve flags is
given below.
RESOLVE_NO_XDEV
Disallow traversal of mount points during path resolution (including
all bind mounts).
Users of this flag are encouraged to make its use configurable (un-
less it is used for a specific security purpose), as bind mounts are
very widely used by end-users. Setting this flag indiscrimnately for
all uses of openat2() may result in spurious errors on previously-
functional systems.
RESOLVE_NO_SYMLINKS
Disallow resolution of symbolic links during path resolution. This
option implies RESOLVE_NO_MAGICLINKS.
If the trailing component is a symbolic link, and flags contains both
O_PATH and O_NOFOLLOW, then an O_PATH file descriptor referencing the
symbolic link will be returned.
Users of this flag are encouraged to make its use configurable (un-
less it is used for a specific security purpose), as symbolic links
are very widely used by end-users. Setting this flag indiscrimnately
for all uses of openat2() may result in spurious errors on previ-
ously-functional systems.
RESOLVE_NO_MAGICLINKS
Disallow all magic link resolution during path resolution.
If the trailing component is a magic link, and flags contains both
O_PATH and O_NOFOLLOW, then an O_PATH file descriptor referencing the
magic link will be returned.
Magic-links are symbolic link-like objects that are most notably
found in proc(5) (examples include /proc/[pid]/exe and
/proc/[pid]/fd/*.) Due to the potential danger of unknowingly open-
ing these magic links, it may be preferable for users to disable
their resolution entirely (see symboliclink(7) for more details.)
RESOLVE_BENEATH
Do not permit the path resolution to succeed if any component of the
resolution is not a descendant of the directory indicated by dirfd.
This results in absolute symbolic links (and absolute values of path-
name) to be rejected.
Currently, this flag also disables magic link resolution. However,
this may change in the future. The caller should explicitly specify
RESOLVE_NO_MAGICLINKS to ensure that magic links are not resolved.
RESOLVE_IN_ROOT
Treat dirfd as the root directory while resolving pathname (as though
the user called chroot(2) with dirfd as the argument.) Absolute sym-
bolic links and ".." path components will be scoped to dirfd. If
pathname is an absolute path, it is also treated relative to dirfd.
However, unlike chroot(2) (which changes the filesystem root perma-
nently for a process), RESOLVE_IN_ROOT allows a program to effi-
ciently restrict path resolution for only certain operations. It
also has several hardening features (such detecting escape attempts
during .. resolution) which chroot(2) does not.
Currently, this flag also disables magic link resolution. However,
this may change in the future. The caller should explicitly specify
RESOLVE_NO_MAGICLINKS to ensure that magic links are not resolved.
It is an error to provide openat2() unknown flags in resolve.
RETURN VALUE
On success, a new file descriptor is returned. On error, -1 is returned, and errno is set
appropriately.
ERRORS
The set of errors returned by openat2() includes all of the errors returned by openat(2),
as well as the following additional errors:
EINVAL An unknown flag or invalid value was specified in how.
EINVAL mode is non-zero, but flags does not contain O_CREAT or O_TMPFILE.
EINVAL size was smaller than any known version of struct open_how.
E2BIG An extension was specified in how, which the current kernel does not support (see
the "Extensibility" section of the NOTES for more detail on how extensions are han-
dled.)
EAGAIN resolve contains either RESOLVE_IN_ROOT or RESOLVE_BENEATH, and the kernel could
not ensure that a ".." component didn't escape (due to a race condition or poten-
tial attack.) Callers may choose to retry the openat2() call.
EXDEV resolve contains either RESOLVE_IN_ROOT or RESOLVE_BENEATH, and an escape from the
root during path resolution was detected.
EXDEV resolve contains RESOLVE_NO_XDEV, and a path component attempted to cross a mount
point.
ELOOP resolve contains RESOLVE_NO_SYMLINKS, and one of the path components was a symbolic
link (or magic link).
ELOOP resolve contains RESOLVE_NO_MAGICLINKS, and one of the path components was a magic
link.
VERSIONS
openat2() was added to Linux in kernel 5.FOO.
CONFORMING TO
This system call is Linux-specific.
The semantics of RESOLVE_BENEATH were modelled after FreeBSD's O_BENEATH.
NOTES
Glibc does not provide a wrapper for this system call; call it using systemcall(2).
Extensibility
In order to allow for struct open_how to be extended in future kernel revisions, openat2()
requires userspace to specify the size of struct open_how structure they are passing. By
providing this information, it is possible for openat2() to provide both forwards- and
backwards-compatibility — with size acting as an implicit version number (because new ex-
tension fields will always be appended, the size will always increase.) This extensibil-
ity design is very similar to other system calls such as perf_setattr(2),
perf_event_open(2), and clone(3).
If we let usize be the size of the structure according to userspace and ksize be the size
of the structure which the kernel supports, then there are only three cases to consider:
* If ksize equals usize, then there is no version mismatch and how can be used
verbatim.
* If ksize is larger than usize, then there are some extensions the kernel sup-
ports which the userspace program is unaware of. Because all extensions must
have their zero values be a no-op, the kernel treats all of the extension fields
not set by userspace to have zero values. This provides backwards-compatibil-
ity.
* If ksize is smaller than usize, then there are some extensions which the
userspace program is aware of but the kernel does not support. Because all ex-
tensions must have their zero values be a no-op, the kernel can safely ignore
the unsupported extension fields if they are all-zero. If any unsupported ex-
tension fields are non-zero, then -1 is returned and errno is set to E2BIG.
This provides forwards-compatibility.
Therefore, most userspace programs will not need to have any special handling of exten-
sions. However, if a userspace program wishes to determine what extensions the running
kernel supports, they may conduct a binary search on size (to find the largest value which
doesn't produce an error of E2BIG.)
SEE ALSO
openat(2), path_resolution(7), symlink(7)
Linux 2019-10-27 OPENAT2(2)
--8<---------------------------------------------------------------------------
Aleksa Sarai (6):
namei: O_BENEATH-style resolution restriction flags
namei: LOOKUP_IN_ROOT: chroot-like path resolution
namei: permit ".." resolution with LOOKUP_{IN_ROOT,BENEATH}
open: introduce openat2(2) syscall
selftests: add openat2(2) selftests
Documentation: path-lookup: mention LOOKUP_MAGICLINK_JUMPED
CREDITS | 4 +-
Documentation/filesystems/path-lookup.rst | 18 +-
arch/alpha/kernel/syscalls/syscall.tbl | 1 +
arch/arm/tools/syscall.tbl | 1 +
arch/arm64/include/asm/unistd.h | 2 +-
arch/arm64/include/asm/unistd32.h | 2 +
arch/ia64/kernel/syscalls/syscall.tbl | 1 +
arch/m68k/kernel/syscalls/syscall.tbl | 1 +
arch/microblaze/kernel/syscalls/syscall.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n32.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n64.tbl | 1 +
arch/mips/kernel/syscalls/syscall_o32.tbl | 1 +
arch/parisc/kernel/syscalls/syscall.tbl | 1 +
arch/powerpc/kernel/syscalls/syscall.tbl | 1 +
arch/s390/kernel/syscalls/syscall.tbl | 1 +
arch/sh/kernel/syscalls/syscall.tbl | 1 +
arch/sparc/kernel/syscalls/syscall.tbl | 1 +
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/xtensa/kernel/syscalls/syscall.tbl | 1 +
fs/namei.c | 167 +++++-
fs/open.c | 154 ++++--
include/linux/fcntl.h | 12 +-
include/linux/namei.h | 12 +
include/linux/syscalls.h | 3 +
include/uapi/asm-generic/unistd.h | 5 +-
include/uapi/linux/fcntl.h | 41 ++
tools/testing/selftests/Makefile | 1 +
tools/testing/selftests/openat2/.gitignore | 1 +
tools/testing/selftests/openat2/Makefile | 8 +
tools/testing/selftests/openat2/helpers.c | 109 ++++
tools/testing/selftests/openat2/helpers.h | 107 ++++
.../testing/selftests/openat2/openat2_test.c | 297 ++++++++++
.../selftests/openat2/rename_attack_test.c | 160 ++++++
.../testing/selftests/openat2/resolve_test.c | 523 ++++++++++++++++++
35 files changed, 1571 insertions(+), 71 deletions(-)
create mode 100644 tools/testing/selftests/openat2/.gitignore
create mode 100644 tools/testing/selftests/openat2/Makefile
create mode 100644 tools/testing/selftests/openat2/helpers.c
create mode 100644 tools/testing/selftests/openat2/helpers.h
create mode 100644 tools/testing/selftests/openat2/openat2_test.c
create mode 100644 tools/testing/selftests/openat2/rename_attack_test.c
create mode 100644 tools/testing/selftests/openat2/resolve_test.c
--
2.23.0
From: John Hubbard <jhubbard(a)nvidia.com>
[ Upstream commit 6f24c8d30d08f270b54f4c2cb9b08dfccbe59c57 ]
Even though gup_benchmark.c has code to handle the -w command-line option,
the "w" is not part of the getopt string. It looks as if it has been
missing the whole time.
On my machine, this leads naturally to the following predictable result:
$ sudo ./gup_benchmark -w
./gup_benchmark: invalid option -- 'w'
...which is fixed with this commit.
Link: http://lkml.kernel.org/r/20191014184639.1512873-2-jhubbard@nvidia.com
Signed-off-by: John Hubbard <jhubbard(a)nvidia.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov(a)linux.intel.com>
Cc: Keith Busch <keith.busch(a)intel.com>
Cc: Shuah Khan <shuah(a)kernel.org>
Cc: Christoph Hellwig <hch(a)infradead.org>
Cc: "Aneesh Kumar K . V" <aneesh.kumar(a)linux.ibm.com>
Cc: Ira Weiny <ira.weiny(a)intel.com>
Cc: Christoph Hellwig <hch(a)lst.de>
Cc: kbuild test robot <lkp(a)intel.com>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/vm/gup_benchmark.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/vm/gup_benchmark.c b/tools/testing/selftests/vm/gup_benchmark.c
index c0534e298b512..cb3fc09645c48 100644
--- a/tools/testing/selftests/vm/gup_benchmark.c
+++ b/tools/testing/selftests/vm/gup_benchmark.c
@@ -37,7 +37,7 @@ int main(int argc, char **argv)
char *file = "/dev/zero";
char *p;
- while ((opt = getopt(argc, argv, "m:r:n:f:tTLUSH")) != -1) {
+ while ((opt = getopt(argc, argv, "m:r:n:f:tTLUwSH")) != -1) {
switch (opt) {
case 'm':
size = atoi(optarg) * MB;
--
2.20.1
From: Jiri Benc <jbenc(a)redhat.com>
[ Upstream commit fd418b01fe26c2430b1091675cceb3ab2b52e1e0 ]
Many distributions enable rp_filter. However, the flow dissector test
generates packets that have 1.1.1.1 set as (inner) source address without
this address being reachable. This causes the selftest to fail.
The selftests should not assume a particular initial configuration. Switch
off rp_filter.
Fixes: 50b3ed57dee9 ("selftests/bpf: test bpf flow dissection")
Signed-off-by: Jiri Benc <jbenc(a)redhat.com>
Signed-off-by: Daniel Borkmann <daniel(a)iogearbox.net>
Acked-by: Petar Penkov <ppenkov(a)google.com>
Link: https://lore.kernel.org/bpf/513a298f53e99561d2f70b2e60e2858ea6cda754.157053…
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/bpf/test_flow_dissector.sh | 3 +++
1 file changed, 3 insertions(+)
diff --git a/tools/testing/selftests/bpf/test_flow_dissector.sh b/tools/testing/selftests/bpf/test_flow_dissector.sh
index d23d4da66b834..e2d06191bd35c 100755
--- a/tools/testing/selftests/bpf/test_flow_dissector.sh
+++ b/tools/testing/selftests/bpf/test_flow_dissector.sh
@@ -63,6 +63,9 @@ fi
# Setup
tc qdisc add dev lo ingress
+echo 0 > /proc/sys/net/ipv4/conf/default/rp_filter
+echo 0 > /proc/sys/net/ipv4/conf/all/rp_filter
+echo 0 > /proc/sys/net/ipv4/conf/lo/rp_filter
echo "Testing IPv4..."
# Drops all IP/UDP packets coming from port 9
--
2.20.1
Commit 852c8cbf34d3 ("selftests/kselftest/runner.sh: Add 45 second
timeout per test") introduced a timeout per test. Livepatch tests could
run longer than 45 seconds, especially on slower machines. They do not
hang and they detect if something goes awry with internal accounting.
Better than looking for an arbitrary value, just disable the timeout for
livepatch selftests.
Signed-off-by: Miroslav Benes <mbenes(a)suse.cz>
---
tools/testing/selftests/livepatch/settings | 1 +
1 file changed, 1 insertion(+)
create mode 100644 tools/testing/selftests/livepatch/settings
diff --git a/tools/testing/selftests/livepatch/settings b/tools/testing/selftests/livepatch/settings
new file mode 100644
index 000000000000..e7b9417537fb
--- /dev/null
+++ b/tools/testing/selftests/livepatch/settings
@@ -0,0 +1 @@
+timeout=0
--
2.23.0
Hi
this patchset aims to add the initial arch-specific arm64 support to
kselftest starting with signals-related test-cases.
This series is based on v5.4-rc2.
A common internal test-case layout is proposed for signal tests and it is
wired-up to the toplevel kselftest Makefile, so that it should be possible
at the end to run it on an arm64 target in the usual way with KSFT.
~/linux# make TARGETS=arm64 kselftest
New KSFT arm64 testcases live inside tools/testing/selftests/arm64 grouped
by family inside subdirectories: arm64/signal is the first family proposed
with this series.
This series converts also to this subdirectory scheme the pre-existing
KSFT arm64 tags tests (already merged in v5.3), moving them into their own
arm64/tags subdirectory.
Thanks
Cristian
Notes:
-----
- further details in the included READMEs
- more tests still to be written (current strategy is going through the
related Kernel signal-handling code and write a test for each possible
and sensible code-path)
A few ideas for more TODO testcases:
- mangle_pstate_invalid_ssbs_regs: mess with SSBS bits on every
possible configured behavior
- fake_sigreturn_unmapped_sp: SP into unmapped addrs
- fake_sigreturn_kernelspace_sp: SP into kernel addrs
- fake_sigreturn_sve_bad_extra_context: SVE extra context badly formed
- fake_sigreturn_misaligned_sp_4: misaligned SP by 4
(i.e., __alignof__(struct _aarch64_ctx))
- fake_sigreturn_misaligned_sp_8: misaligned SP by 8
(i.e., sizeof(struct _aarch64_ctx))
- fake_sigreturn_bad_size_non_aligned: a size that doesn't overflow
__reserved[], but is not a multiple of 16
- fake_sigreturn_bad_size_tiny: a size that is less than 16
- fake_sigreturn_bad_size_overflow_tiny: a size that does overflow
__reserved[], but by less than 16 bytes?
- mangle_sve_invalid_extra_context: SVE extra_context invalid
- SVE signal testcases and special handling will be part of an additional patch
still to be released
- KSFT arm64 tags test patch
https://lore.kernel.org/linux-arm-kernel/c1e6aad230658bc175b42d92daeff2e300…
is relocated into its own directory under tools/testing/selftests/arm64/tags
Changes:
--------
v8-->v9:
- fixed a couple of misplaced .gitignore
v7-->v8:
- removed SSBS test case
- split remnants of SSBS patch (v7 05/11), containing some helpers,
into two distinct patches
v6-->v7:
- rebased on v5.4-rc2
- renamed SUBTARGETS arm64/ toplevel Makefile ENV to ARM64_SUBTARGETS
- fixed fake_sigreturn alignment routines (off by one)
- fixed SSBS test: avoid using MRS/MSR as whole and SKIP when SSBS not
supported
- reporting KSFT_SKIP when needed (usually if test_init(0 fails)
- using ID_AA64PFR1_EL1.SSBS to check SSBS support instead of HWCAP_SSBS
v5-->v6:
- added arm64 toplevel Makefile SUBTARGETS env var to be able to selectively
build only some arm64/ tests subdirectories
- removed unneed toplevel Makefile exports and fixed Copyright
- better checks for supported features and features names helpers
- converted some run-time critical assert() to abort() to avoid
issues when -NDEBUG is set
- default_handler() signal handler refactored and split
- using SIGTRAP for get_current_context()
- use volatile where proper
- refactor and relocate test_init() invocation
- review usage of MRS SSBS instructions depending on HW_SSBS
- cleanup fake_sigreturn trampoline
- cleanup get_starting_header helper
- avoiding timeout test failures wherever possible (fail immediately
if possible)
v4-->v5:
- rebased on arm64/for-next-core merging 01/11 with KSFT tags tests:
commit 9ce1263033cd ("selftests, arm64: add a selftest for passing tagged pointers to kernel")
- moved .gitignore up on elevel
- moved kernel header search mechanism into KSFT arm64 toplevel Makefile
so that it can be used easily also by each arm64 KSFT subsystem inside
subdirs of arm64
v3-->v4:
- rebased on v5.3-rc6
- added test descriptions
- fixed commit messages (imperative mood)
- added missing includes and removed unneeded ones
- added/used new get_starting_head() helper
- fixed/simplified signal.S::fakke_sigreturn()
- added set_regval() macro and .init initialization func
- better synchonization in get_current_context()
- macroization of mangle_pstate_invalid_mode_el
- split mangle_pstate_invalid_mode_el h/t
- removed standalone mode
- simplified CPU features checks
- fixed/refactored get_header() and validation routines
- simplfied docs
v2-->v3:
- rebased on v5.3-rc2
- better test result characterization looking for
SEGV_ACCERR in si_code on SIGSEGV
- using KSFT Framework macros for retvalues
- removed SAFE_WRITE()/dump_uc: buggy, un-needed and unused
- reviewed generation process of test_arm64_signals.sh runner script
- re-added a fixed fake_sigreturn_misaligned_sp testcase and a properly
extended fake_sigreturn() helper
- added tests' TODO notes
v1-->v2:
- rebased on 5.2-rc7
- various makefile's cleanups
- mixed READMEs fixes
- fixed test_arm64_signals.sh runner script
- cleaned up assembly code in signal.S
- improved get_current_context() logic
- fixed SAFE_WRITE()
- common support code split into more chunks, each one introduced when
needed by some new testcases
- fixed some headers validation routines in testcases.c
- removed some still broken/immature tests:
+ fake_sigreturn_misaligned
+ fake_sigreturn_overflow_reserved
+ mangle_pc_invalid
+ mangle_sp_misaligned
- fixed some other testcases:
+ mangle_pstate_ssbs_regs: better checks of SSBS bit when feature unsupported
+ mangle_pstate_invalid_compat_toggle: name fix
+ mangle_pstate_invalid_mode_el[1-3]: precautionary zeroing PSTATE.MODE
+ fake_sigreturn_bad_magic, fake_sigreturn_bad_size,
fake_sigreturn_bad_size_for_magic0:
- accounting for available space...dropping extra when needed
- keeping alignent
- new testcases on FPSMID context:
+ fake_sigreturn_missing_fpsimd
+ fake_sigreturn_duplicated_fpsimd
Cristian Marussi (12):
kselftest: arm64: extend toplevel skeleton Makefile
kselftest: arm64: mangle_pstate_invalid_compat_toggle and common utils
kselftest: arm64: mangle_pstate_invalid_daif_bits
kselftest: arm64: mangle_pstate_invalid_mode_el[123][ht]
kselftest: arm64: extend test_init functionalities
kselftest: arm64: add helper get_current_context
kselftest: arm64: fake_sigreturn_bad_magic
kselftest: arm64: fake_sigreturn_bad_size_for_magic0
kselftest: arm64: fake_sigreturn_missing_fpsimd
kselftest: arm64: fake_sigreturn_duplicated_fpsimd
kselftest: arm64: fake_sigreturn_bad_size
kselftest: arm64: fake_sigreturn_misaligned_sp
tools/testing/selftests/Makefile | 1 +
tools/testing/selftests/arm64/Makefile | 64 +++-
tools/testing/selftests/arm64/README | 25 ++
.../testing/selftests/arm64/signal/.gitignore | 3 +
tools/testing/selftests/arm64/signal/Makefile | 32 ++
tools/testing/selftests/arm64/signal/README | 59 +++
.../testing/selftests/arm64/signal/signals.S | 64 ++++
.../selftests/arm64/signal/test_signals.c | 29 ++
.../selftests/arm64/signal/test_signals.h | 116 ++++++
.../arm64/signal/test_signals_utils.c | 340 ++++++++++++++++++
.../arm64/signal/test_signals_utils.h | 120 +++++++
.../testcases/fake_sigreturn_bad_magic.c | 52 +++
.../testcases/fake_sigreturn_bad_size.c | 77 ++++
.../fake_sigreturn_bad_size_for_magic0.c | 46 +++
.../fake_sigreturn_duplicated_fpsimd.c | 50 +++
.../testcases/fake_sigreturn_misaligned_sp.c | 37 ++
.../testcases/fake_sigreturn_missing_fpsimd.c | 50 +++
.../mangle_pstate_invalid_compat_toggle.c | 31 ++
.../mangle_pstate_invalid_daif_bits.c | 35 ++
.../mangle_pstate_invalid_mode_el1h.c | 15 +
.../mangle_pstate_invalid_mode_el1t.c | 15 +
.../mangle_pstate_invalid_mode_el2h.c | 15 +
.../mangle_pstate_invalid_mode_el2t.c | 15 +
.../mangle_pstate_invalid_mode_el3h.c | 15 +
.../mangle_pstate_invalid_mode_el3t.c | 15 +
.../mangle_pstate_invalid_mode_template.h | 28 ++
.../arm64/signal/testcases/testcases.c | 196 ++++++++++
.../arm64/signal/testcases/testcases.h | 104 ++++++
.../selftests/arm64/{ => tags}/.gitignore | 0
tools/testing/selftests/arm64/tags/Makefile | 7 +
.../arm64/{ => tags}/run_tags_test.sh | 0
.../selftests/arm64/{ => tags}/tags_test.c | 0
32 files changed, 1651 insertions(+), 5 deletions(-)
create mode 100644 tools/testing/selftests/arm64/README
create mode 100644 tools/testing/selftests/arm64/signal/.gitignore
create mode 100644 tools/testing/selftests/arm64/signal/Makefile
create mode 100644 tools/testing/selftests/arm64/signal/README
create mode 100644 tools/testing/selftests/arm64/signal/signals.S
create mode 100644 tools/testing/selftests/arm64/signal/test_signals.c
create mode 100644 tools/testing/selftests/arm64/signal/test_signals.h
create mode 100644 tools/testing/selftests/arm64/signal/test_signals_utils.c
create mode 100644 tools/testing/selftests/arm64/signal/test_signals_utils.h
create mode 100644 tools/testing/selftests/arm64/signal/testcases/fake_sigreturn_bad_magic.c
create mode 100644 tools/testing/selftests/arm64/signal/testcases/fake_sigreturn_bad_size.c
create mode 100644 tools/testing/selftests/arm64/signal/testcases/fake_sigreturn_bad_size_for_magic0.c
create mode 100644 tools/testing/selftests/arm64/signal/testcases/fake_sigreturn_duplicated_fpsimd.c
create mode 100644 tools/testing/selftests/arm64/signal/testcases/fake_sigreturn_misaligned_sp.c
create mode 100644 tools/testing/selftests/arm64/signal/testcases/fake_sigreturn_missing_fpsimd.c
create mode 100644 tools/testing/selftests/arm64/signal/testcases/mangle_pstate_invalid_compat_toggle.c
create mode 100644 tools/testing/selftests/arm64/signal/testcases/mangle_pstate_invalid_daif_bits.c
create mode 100644 tools/testing/selftests/arm64/signal/testcases/mangle_pstate_invalid_mode_el1h.c
create mode 100644 tools/testing/selftests/arm64/signal/testcases/mangle_pstate_invalid_mode_el1t.c
create mode 100644 tools/testing/selftests/arm64/signal/testcases/mangle_pstate_invalid_mode_el2h.c
create mode 100644 tools/testing/selftests/arm64/signal/testcases/mangle_pstate_invalid_mode_el2t.c
create mode 100644 tools/testing/selftests/arm64/signal/testcases/mangle_pstate_invalid_mode_el3h.c
create mode 100644 tools/testing/selftests/arm64/signal/testcases/mangle_pstate_invalid_mode_el3t.c
create mode 100644 tools/testing/selftests/arm64/signal/testcases/mangle_pstate_invalid_mode_template.h
create mode 100644 tools/testing/selftests/arm64/signal/testcases/testcases.c
create mode 100644 tools/testing/selftests/arm64/signal/testcases/testcases.h
rename tools/testing/selftests/arm64/{ => tags}/.gitignore (100%)
create mode 100644 tools/testing/selftests/arm64/tags/Makefile
rename tools/testing/selftests/arm64/{ => tags}/run_tags_test.sh (100%)
rename tools/testing/selftests/arm64/{ => tags}/tags_test.c (100%)
--
2.17.1
This patchset is being developed here:
<https://github.com/cyphar/linux/tree/openat2/master>
Patch changelog:
v14:
* The magic-link changes (and O_EMPTYPATH) have been dropped from this series
-- they will be developed and sent separately. The main reason is that we
need to restrict things other than open(2) (examples include truncate(2) as
well as mount(MS_BIND)). This will require a fair amount of extra work, and
there's no point stalling openat2(2) for that work to be completed.
* Minor rework of 'struct open_how':
* To avoid future headaches, make it a non-const argument.
* Expand ->flags and ->resolve to 64-bit fields to allow for more flag
extensions without needing to add separate fields too early. This
requires adding a bit of explicit padding (32 bits) to avoid userspace
putting garbage in the alignment padding -- this can be repurposed for
future extensions.
* upgrade_mask is dropped (and will be a separate field when we add it
again in the future) to avoid userspace foot-guns.
* Expand -EINVAL checks in build_open_flags(). Rather than silently
ignoring silly flag combinations (such as O_TMPFILE|O_PATH or
O_PATH|<most flags>), give an -EINVAL. All of the silent ignore semantics
were added to open(2) because we couldn't return -EINVAL -- but we can
now!
* open(2) and openat(2) clean up their flags before passing them to
build_open_flags(), so all mixed flags will continue to work. There is
one exception which is (O_PATH|O_TMPFILE) -- this is no longer
permitted (as far as I can tell this appears to be a bug, and there are
no userspace users that I've hit after running this code for a few
days). If it turns out that userspace does depend on (O_PATH|O_TMPFILE)
working, we can only disallow it for openat2(2).
* Don't zero out nd->root in complete_walk() for RCU-walk if we're doing a
scoped-lookup (this prevents a needless REF-walk retry).
* Attempt all tests on kernels that don't have openat2(2), rather than just
skipping everything.
v13: <https://lore.kernel.org/lkml/20190930183316.10190-1-cyphar@cyphar.com/>
v12: <https://lore.kernel.org/lkml/20190904201933.10736-1-cyphar@cyphar.com/>
v11: <https://lore.kernel.org/lkml/20190820033406.29796-1-cyphar@cyphar.com/>
<https://lore.kernel.org/lkml/20190728010207.9781-1-cyphar@cyphar.com/>
v10: <https://lore.kernel.org/lkml/20190719164225.27083-1-cyphar@cyphar.com/>
v09: <https://lore.kernel.org/lkml/20190706145737.5299-1-cyphar@cyphar.com/>
v08: <https://lore.kernel.org/lkml/20190520133305.11925-1-cyphar@cyphar.com/>
v07: <https://lore.kernel.org/lkml/20190507164317.13562-1-cyphar@cyphar.com/>
v06: <https://lore.kernel.org/lkml/20190506165439.9155-1-cyphar@cyphar.com/>
v05: <https://lore.kernel.org/lkml/20190320143717.2523-1-cyphar@cyphar.com/>
v04: <https://lore.kernel.org/lkml/20181112142654.341-1-cyphar@cyphar.com/>
v03: <https://lore.kernel.org/lkml/20181009070230.12884-1-cyphar@cyphar.com/>
v02: <https://lore.kernel.org/lkml/20181009065300.11053-1-cyphar@cyphar.com/>
v01: <https://lore.kernel.org/lkml/20180929103453.12025-1-cyphar@cyphar.com/>
For a very long time, extending openat(2) with new features has been
incredibly frustrating. This stems from the fact that openat(2) is
possibly the most famous counter-example to the mantra "don't silently
accept garbage from userspace" -- it doesn't check whether unknown flags
are present[1].
This means that (generally) the addition of new flags to openat(2) has
been fraught with backwards-compatibility issues (O_TMPFILE has to be
defined as __O_TMPFILE|O_DIRECTORY|[O_RDWR or O_WRONLY] to ensure old
kernels gave errors, since it's insecure to silently ignore the
flag[2]). All new security-related flags therefore have a tough road to
being added to openat(2).
Furthermore, the need for some sort of control over VFS's path resolution (to
avoid malicious paths resulting in inadvertent breakouts) has been a very
long-standing desire of many userspace applications. This patchset is a revival
of Al Viro's old AT_NO_JUMPS[3] patchset (which was a variant of David
Drysdale's O_BENEATH patchset[4] which was a spin-off of the Capsicum
project[5]) with a few additions and changes made based on the previous
discussion within [6] as well as others I felt were useful.
In line with the conclusions of the original discussion of AT_NO_JUMPS, the
flag has been split up into separate flags. However, instead of being an
openat(2) flag it is provided through a new syscall openat2(2) which provides
several other improvements to the openat(2) interface (see the patch
description for more details). The following new LOOKUP_* flags are added:
* LOOKUP_NO_XDEV blocks all mountpoint crossings (upwards, downwards,
or through absolute links). Absolute pathnames alone in openat(2) do not
trigger this. Magic-link traversal which implies a vfsmount jump is also
blocked (though magic-link jumps on the same vfsmount are permitted).
* LOOKUP_NO_MAGICLINKS blocks resolution through /proc/$pid/fd-style
links. This is done by blocking the usage of nd_jump_link() during
resolution in a filesystem. The term "magic-links" is used to match
with the only reference to these links in Documentation/, but I'm
happy to change the name.
It should be noted that this is different to the scope of
~LOOKUP_FOLLOW in that it applies to all path components. However,
you can do openat2(NO_FOLLOW|NO_MAGICLINKS) on a magic-link and it
will *not* fail (assuming that no parent component was a
magic-link), and you will have an fd for the magic-link.
In order to correctly detect magic-links, the introduction of a new
LOOKUP_MAGICLINK_JUMPED state flag was required.
* LOOKUP_BENEATH disallows escapes to outside the starting dirfd's
tree, using techniques such as ".." or absolute links. Absolute
paths in openat(2) are also disallowed. Conceptually this flag is to
ensure you "stay below" a certain point in the filesystem tree --
but this requires some additional to protect against various races
that would allow escape using "..".
Currently LOOKUP_BENEATH implies LOOKUP_NO_MAGICLINKS, because it
can trivially beam you around the filesystem (breaking the
protection). In future, there might be similar safety checks done as
in LOOKUP_IN_ROOT, but that requires more discussion.
In addition, two new flags are added that expand on the above ideas:
* LOOKUP_NO_SYMLINKS does what it says on the tin. No symlink
resolution is allowed at all, including magic-links. Just as with
LOOKUP_NO_MAGICLINKS this can still be used with NOFOLLOW to open an
fd for the symlink as long as no parent path had a symlink
component.
* LOOKUP_IN_ROOT is an extension of LOOKUP_BENEATH that, rather than
blocking attempts to move past the root, forces all such movements
to be scoped to the starting point. This provides chroot(2)-like
protection but without the cost of a chroot(2) for each filesystem
operation, as well as being safe against race attacks that chroot(2)
is not.
If a race is detected (as with LOOKUP_BENEATH) then an error is
generated, and similar to LOOKUP_BENEATH it is not permitted to cross
magic-links with LOOKUP_IN_ROOT.
The primary need for this is from container runtimes, which
currently need to do symlink scoping in userspace[7] when opening
paths in a potentially malicious container. There is a long list of
CVEs that could have bene mitigated by having RESOLVE_THIS_ROOT
(such as CVE-2017-1002101, CVE-2017-1002102, CVE-2018-15664, and
CVE-2019-5736, just to name a few).
In order to make all of the above more usable, I'm working on
libpathrs[8] which is a C-friendly library for safe path resolution. It
features a userspace-emulated backend if the kernel doesn't support
openat2(2). Hopefully we can get userspace to switch to using it, and
thus get openat2(2) support for free once it's ready.
[1]: https://lwn.net/Articles/588444/
[2]: https://lore.kernel.org/lkml/CA+55aFyyxJL1LyXZeBsf2ypriraj5ut1XkNDsunRBqgVj…
[3]: https://lore.kernel.org/lkml/20170429220414.GT29622@ZenIV.linux.org.uk
[4]: https://lore.kernel.org/lkml/1415094884-18349-1-git-send-email-drysdale@goo…
[5]: https://lore.kernel.org/lkml/1404124096-21445-1-git-send-email-drysdale@goo…
[6]: https://lwn.net/Articles/723057/
[7]: https://github.com/cyphar/filepath-securejoin
[8]: https://github.com/openSUSE/libpathrs
The current draft of the openat2(2) man-page is included below.
--8<---------------------------------------------------------------------------
OPENAT2(2) Linux Programmer's Manual OPENAT2(2)
NAME
openat2 - open and possibly create a file (extended)
SYNOPSIS
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
int openat2(int dirfd, const char *pathname, struct open_how *how, size_t size);
Note: There is no glibc wrapper for this system call; see NOTES.
DESCRIPTION
The openat2() system call opens the file specified by pathname. If the specified file
does not exist, it may optionally (if O_CREAT is specified in how.flags) be created by
openat2().
As with openat(2), if pathname is relative, then it is interpreted relative to the
directory referred to by the file descriptor dirfd (or the current working directory of
the calling process, if dirfd is the special value AT_FDCWD.) If pathname is absolute,
then dirfd is ignored (unless how.resolve contains RESOLVE_IN_ROOT, in which case pathname
is resolved relative to dirfd.)
The openat2() system call is an extension of openat(2) and provides a superset of its
functionality. Rather than taking a single flag argument, an extensible structure (how)
is passed instead to allow for future extensions. size must be set to sizeof(struct
open_how), to facilitate future extensions (see the "Extensibility" section of the NOTES
for more detail on how extensions are handled.)
The open_how structure
The following structure indicates how pathname should be opened, and acts as a superset of
the flag and mode arguments to openat(2).
struct open_how {
__aligned_u64 flags; /* O_* flags. */
__u16 mode; /* Mode for O_{CREAT,TMPFILE}. */
__u16 __padding[3]; /* Must be zeroed. */
__aligned_u64 resolve; /* RESOLVE_* flags. */
};
Any future extensions to openat2() will be implemented as new fields appended to the above
structure (or through reuse of pre-existing padding space), with the zero value of the new
fields acting as though the extension were not present.
The meaning of each field is as follows:
flags
The file creation and status flags to use for this operation. All of the
O_* flags defined for openat(2) are valid openat2() flag values.
Unlike openat(2), it is an error to provide openat2() unknown or conflicting
flags in flags.
mode
File mode for the new file, with identical semantics to the mode argument to
openat(2). However, unlike openat(2), it is an error to provide openat2()
with a mode which contains bits other than 0777.
It is an error to provide openat2() a non-zero mode if flags does not
contain O_CREAT or O_TMPFILE.
resolve
Change how the components of pathname will be resolved (see
path_resolution(7) for background information.) The primary use case for
these flags is to allow trusted programs to restrict how untrusted paths (or
paths inside untrusted directories) are resolved. The full list of resolve
flags is given below.
RESOLVE_NO_XDEV
Disallow traversal of mount points during path resolution (including
all bind mounts).
Users of this flag are encouraged to make its use configurable
(unless it is used for a specific security purpose), as bind mounts
are very widely used by end-users. Setting this flag indiscrimnately
for all uses of openat2() may result in spurious errors on
previously-functional systems.
RESOLVE_NO_SYMLINKS
Disallow resolution of symbolic links during path resolution. This
option implies RESOLVE_NO_MAGICLINKS.
If the trailing component is a symbolic link, and flags contains both
O_PATH and O_NOFOLLOW, then an O_PATH file descriptor referencing the
symbolic link will be returned.
Users of this flag are encouraged to make its use configurable
(unless it is used for a specific security purpose), as symbolic
links are very widely used by end-users. Setting this flag
indiscrimnately for all uses of openat2() may result in spurious
errors on previously-functional systems.
RESOLVE_NO_MAGICLINKS
Disallow all magic link resolution during path resolution.
If the trailing component is a magic link, and flags contains both
O_PATH and O_NOFOLLOW, then an O_PATH file descriptor referencing the
magic link will be returned.
Magic-links are symbolic link-like objects that are most notably
found in proc(5) (examples include /proc/[pid]/exe and
/proc/[pid]/fd/*.) Due to the potential danger of unknowingly
opening these magic links, it may be preferable for users to disable
their resolution entirely (see symboliclink(7) for more details.)
RESOLVE_BENEATH
Do not permit the path resolution to succeed if any component of the
resolution is not a descendant of the directory indicated by dirfd.
This results in absolute symbolic links (and absolute values of
pathname) to be rejected.
Currently, this flag also disables magic link resolution. However,
this may change in the future. The caller should explicitly specify
RESOLVE_NO_MAGICLINKS to ensure that magic links are not resolved.
RESOLVE_IN_ROOT
Treat dirfd as the root directory while resolving pathname (as though
the user called chroot(2) with dirfd as the argument.) Absolute
symbolic links and ".." path components will be scoped to dirfd. If
pathname is an absolute path, it is also treated relative to dirfd.
However, unlike chroot(2) (which changes the filesystem root
permanently for a process), RESOLVE_IN_ROOT allows a program to
efficiently restrict path resolution for only certain operations. It
also has several hardening features (such detecting escape attempts
during .. resolution) which chroot(2) does not.
Currently, this flag also disables magic link resolution. However,
this may change in the future. The caller should explicitly specify
RESOLVE_NO_MAGICLINKS to ensure that magic links are not resolved.
It is an error to provide openat2() unknown flags in resolve.
RETURN VALUE
On success, a new file descriptor is returned. On error, -1 is returned, and errno is set
appropriately.
ERRORS
The set of errors returned by openat2() includes all of the errors returned by openat(2),
as well as the following additional errors:
EINVAL An unknown flag or invalid value was specified in how.
EINVAL mode is non-zero, but flags does not contain O_CREAT or O_TMPFILE.
EINVAL size was smaller than any known version of struct open_how.
E2BIG An extension was specified in how, which the current kernel does not support (see
the "Extensibility" section of the NOTES for more detail on how extensions are
handled.)
EAGAIN resolve contains either RESOLVE_IN_ROOT or RESOLVE_BENEATH, and the kernel could
not ensure that a ".." component didn't escape (due to a race condition or
potential attack.) Callers may choose to retry the openat2() call.
EXDEV resolve contains either RESOLVE_IN_ROOT or RESOLVE_BENEATH, and an escape from the
root during path resolution was detected.
EXDEV resolve contains RESOLVE_NO_XDEV, and a path component attempted to cross a mount
point.
ELOOP resolve contains RESOLVE_NO_SYMLINKS, and one of the path components was a symbolic
link (or magic link).
ELOOP resolve contains RESOLVE_NO_MAGICLINKS, and one of the path components was a magic
link.
VERSIONS
openat2() was added to Linux in kernel 5.FOO.
CONFORMING TO
This system call is Linux-specific.
The semantics of RESOLVE_BENEATH were modelled after FreeBSD's O_BENEATH.
NOTES
Glibc does not provide a wrapper for this system call; call it using systemcall(2).
Extensibility
In order to allow for struct open_how to be extended in future kernel revisions, openat2()
requires userspace to specify the size of struct open_how structure they are passing. By
providing this information, it is possible for openat2() to provide both forwards- and
backwards-compatibility — with size acting as an implicit version number (because new
extension fields will always be appended, the size will always increase.) This
extensibility design is very similar to other system calls such as perf_setattr(2),
perf_event_open(2), and clone(3).
If we let usize be the size of the structure according to userspace and ksize be the size
of the structure which the kernel supports, then there are only three cases to consider:
* If ksize equals usize, then there is no version mismatch and how can be used
verbatim.
* If ksize is larger than usize, then there are some extensions the kernel
supports which the userspace program is unaware of. Because all extensions must
have their zero values be a no-op, the kernel treats all of the extension fields
not set by userspace to have zero values. This provides backwards-
compatibility.
* If ksize is smaller than usize, then there are some extensions which the
userspace program is aware of but the kernel does not support. Because all
extensions must have their zero values be a no-op, the kernel can safely ignore
the unsupported extension fields if they are all-zero. If any unsupported
extension fields are non-zero, then -1 is returned and errno is set to E2BIG.
This provides forwards-compatibility.
Therefore, most userspace programs will not need to have any special handling of
extensions. However, if a userspace program wishes to determine what extensions the
running kernel supports, they may conduct a binary search on size (to find the largest
value which doesn't produce an error of E2BIG.)
SEE ALSO
openat(2), path_resolution(7), symboliclink(7)
Linux 2019-10-10 OPENAT2(2)
--8<---------------------------------------------------------------------------
Aleksa Sarai (6):
namei: O_BENEATH-style resolution restriction flags
namei: LOOKUP_IN_ROOT: chroot-like path resolution
namei: permit ".." resolution with LOOKUP_{IN_ROOT,BENEATH}
open: introduce openat2(2) syscall
selftests: add openat2(2) selftests
Documentation: path-lookup: mention LOOKUP_MAGICLINK_JUMPED
CREDITS | 4 +-
Documentation/filesystems/path-lookup.rst | 18 +-
arch/alpha/kernel/syscalls/syscall.tbl | 1 +
arch/arm/tools/syscall.tbl | 1 +
arch/arm64/include/asm/unistd.h | 2 +-
arch/arm64/include/asm/unistd32.h | 2 +
arch/ia64/kernel/syscalls/syscall.tbl | 1 +
arch/m68k/kernel/syscalls/syscall.tbl | 1 +
arch/microblaze/kernel/syscalls/syscall.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n32.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n64.tbl | 1 +
arch/mips/kernel/syscalls/syscall_o32.tbl | 1 +
arch/parisc/kernel/syscalls/syscall.tbl | 1 +
arch/powerpc/kernel/syscalls/syscall.tbl | 1 +
arch/s390/kernel/syscalls/syscall.tbl | 1 +
arch/sh/kernel/syscalls/syscall.tbl | 1 +
arch/sparc/kernel/syscalls/syscall.tbl | 1 +
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/xtensa/kernel/syscalls/syscall.tbl | 1 +
fs/namei.c | 167 +++++-
fs/open.c | 154 ++++--
include/linux/fcntl.h | 12 +-
include/linux/namei.h | 12 +
include/linux/syscalls.h | 3 +
include/uapi/asm-generic/unistd.h | 5 +-
include/uapi/linux/fcntl.h | 41 ++
tools/testing/selftests/Makefile | 1 +
tools/testing/selftests/openat2/.gitignore | 1 +
tools/testing/selftests/openat2/Makefile | 8 +
tools/testing/selftests/openat2/helpers.c | 109 ++++
tools/testing/selftests/openat2/helpers.h | 107 ++++
.../testing/selftests/openat2/openat2_test.c | 297 ++++++++++
.../selftests/openat2/rename_attack_test.c | 160 ++++++
.../testing/selftests/openat2/resolve_test.c | 523 ++++++++++++++++++
35 files changed, 1571 insertions(+), 71 deletions(-)
create mode 100644 tools/testing/selftests/openat2/.gitignore
create mode 100644 tools/testing/selftests/openat2/Makefile
create mode 100644 tools/testing/selftests/openat2/helpers.c
create mode 100644 tools/testing/selftests/openat2/helpers.h
create mode 100644 tools/testing/selftests/openat2/openat2_test.c
create mode 100644 tools/testing/selftests/openat2/rename_attack_test.c
create mode 100644 tools/testing/selftests/openat2/resolve_test.c
--
2.23.0
The current kunit execution model is to provide base kunit functionality
and tests built-in to the kernel. The aim of this series is to allow
building kunit itself and tests as modules. This in turn allows a
simple form of selective execution; load the module you wish to test.
In doing so, kunit itself (if also built as a module) will be loaded as
an implicit dependency.
Because this requires a core API modification - if a module delivers
multiple suites, they must be declared with the kunit_test_suites()
macro - we're proposing this patch as a candidate to be applied to the
test tree before too many kunit consumers appear. We attempt to deal
with existing consumers in patch 1.
Changes since v1:
- sent correct patch set; apologies, previous patch set was built
prior to kunit move to lib/ and should be ignored.
Patch 1 consists changes needed to support loading tests as modules.
Patch 2 allows kunit itself to be loaded as a module.
Patch 3 documents module support.
Alan Maguire (3):
kunit: allow kunit tests to be loaded as a module
kunit: allow kunit to be loaded as a module
kunit: update documentation to describe module-based build
Documentation/dev-tools/kunit/faq.rst | 3 ++-
Documentation/dev-tools/kunit/index.rst | 3 +++
Documentation/dev-tools/kunit/usage.rst | 16 ++++++++++++++++
include/kunit/test.h | 30 +++++++++++++++++++++++-------
kernel/sysctl-test.c | 6 +++++-
lib/Kconfig.debug | 4 ++--
lib/kunit/Kconfig | 6 +++---
lib/kunit/Makefile | 4 +++-
lib/kunit/assert.c | 8 ++++++++
lib/kunit/example-test.c | 6 +++++-
lib/kunit/string-stream-test.c | 9 +++++++--
lib/kunit/string-stream.c | 7 +++++++
lib/kunit/test-test.c | 8 ++++++--
lib/kunit/test.c | 12 ++++++++++++
lib/kunit/try-catch.c | 8 ++++++--
15 files changed, 108 insertions(+), 22 deletions(-)
--
1.8.3.1