Some applications, especially tracing ones, benefit from avoiding the
syscall overhead for getcpu() so it is common for architectures to have
vDSO implementations. Add one for arm64, using TPIDRRO_EL0 to pass a
pointer to per-CPU data rather than just store the immediate value in
order to allow for future extensibility.
It is questionable if something TPIDRRO_EL0 based is worthwhile at all
on current kernels, since v4.18 we have had support for restartable
sequences which can be used to provide a sched_getcpu() implementation
with generally better performance than the vDSO approach on
architectures which have that[1]. Work is ongoing to implement this for
glibc:
https://lore.kernel.org/lkml/20200527185130.5604-3-mathieu.desnoyers@effici…
but is not yet merged and will need similar work for other userspaces.
The main advantages for the vDSO implementation are the node parameter
(though this is a static mapping to CPU number so could be looked up
separately when processing data if it's needed, it shouldn't need to be
in the hot path) and ease of implementation for users.
This is currently not compatible with KPTI due to the use of TPIDRRO_EL0
by the KPTI trampoline, this could be addressed by reinitializing that
system register in the return path but I have found it hard to justify
adding that overhead for all users for something that is essentially a
profiling optimization which is likely to get superceeded by a more
modern implementation - if there are other uses for the per-CPU data
then the balance might change here.
This builds on work done by Kristina Martsenko some time ago but is a
new implementation.
[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?…
v3:
- Rebase on v5.9-rc1.
- Drop in progress portions of the series.
v2:
- Rebase on v5.8-rc3.
- Add further cleanup patches & a first draft of multi-page support.
Mark Brown (5):
arm64: vdso: Provide a define when building the vDSO
arm64: vdso: Add per-CPU data
arm64: vdso: Initialise the per-CPU vDSO data
arm64: vdso: Add getcpu() implementation
selftests: vdso: Support arm64 in getcpu() test
arch/arm64/include/asm/processor.h | 12 +----
arch/arm64/include/asm/vdso/datapage.h | 54 +++++++++++++++++++
arch/arm64/kernel/process.c | 26 ++++++++-
arch/arm64/kernel/vdso.c | 33 +++++++++++-
arch/arm64/kernel/vdso/Makefile | 4 +-
arch/arm64/kernel/vdso/vdso.lds.S | 1 +
arch/arm64/kernel/vdso/vgetcpu.c | 48 +++++++++++++++++
.../testing/selftests/vDSO/vdso_test_getcpu.c | 10 ++++
8 files changed, 172 insertions(+), 16 deletions(-)
create mode 100644 arch/arm64/include/asm/vdso/datapage.h
create mode 100644 arch/arm64/kernel/vdso/vgetcpu.c
--
2.20.1
Currently kunit_tool does not work correctly when executed from a path
outside of the kernel tree, so make sure that the current working
directory is correct and the kunit_dir is properly initialized before
running.
Signed-off-by: Brendan Higgins <brendanhiggins(a)google.com>
---
tools/testing/kunit/kunit.py | 13 +++++--------
1 file changed, 5 insertions(+), 8 deletions(-)
diff --git a/tools/testing/kunit/kunit.py b/tools/testing/kunit/kunit.py
index 425ef40067e7e..e2caf4e24ecb2 100755
--- a/tools/testing/kunit/kunit.py
+++ b/tools/testing/kunit/kunit.py
@@ -237,9 +237,13 @@ def main(argv, linux=None):
cli_args = parser.parse_args(argv)
+ if get_kernel_root_path():
+ os.chdir(get_kernel_root_path())
+
if cli_args.subcommand == 'run':
if not os.path.exists(cli_args.build_dir):
os.mkdir(cli_args.build_dir)
+ create_default_kunitconfig()
if not linux:
linux = kunit_kernel.LinuxSourceTree()
@@ -257,6 +261,7 @@ def main(argv, linux=None):
if cli_args.build_dir:
if not os.path.exists(cli_args.build_dir):
os.mkdir(cli_args.build_dir)
+ create_default_kunitconfig()
if not linux:
linux = kunit_kernel.LinuxSourceTree()
@@ -270,10 +275,6 @@ def main(argv, linux=None):
if result.status != KunitStatus.SUCCESS:
sys.exit(1)
elif cli_args.subcommand == 'build':
- if cli_args.build_dir:
- if not os.path.exists(cli_args.build_dir):
- os.mkdir(cli_args.build_dir)
-
if not linux:
linux = kunit_kernel.LinuxSourceTree()
@@ -288,10 +289,6 @@ def main(argv, linux=None):
if result.status != KunitStatus.SUCCESS:
sys.exit(1)
elif cli_args.subcommand == 'exec':
- if cli_args.build_dir:
- if not os.path.exists(cli_args.build_dir):
- os.mkdir(cli_args.build_dir)
-
if not linux:
linux = kunit_kernel.LinuxSourceTree()
base-commit: 30185b69a2d533c4ba6ca926b8390ce7de495e29
--
2.28.0.236.gb10cc79966-goog
Some tests might not be able to be run if resources like huge pages are
not available. Mark these tests as skipped instead of simply passing.
Signed-off-by: Ralph Campbell <rcampbell(a)nvidia.com>
---
This applies to linux-mm and is for Andrew Morton's tree.
tools/testing/selftests/vm/hmm-tests.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/vm/hmm-tests.c b/tools/testing/selftests/vm/hmm-tests.c
index 93fc5cadce61..0a28a6a29581 100644
--- a/tools/testing/selftests/vm/hmm-tests.c
+++ b/tools/testing/selftests/vm/hmm-tests.c
@@ -680,7 +680,7 @@ TEST_F(hmm, anon_write_hugetlbfs)
n = gethugepagesizes(pagesizes, 4);
if (n <= 0)
- return;
+ SKIP(return, "Huge page size could not be determined");
for (idx = 0; --n > 0; ) {
if (pagesizes[n] < pagesizes[idx])
idx = n;
@@ -694,7 +694,7 @@ TEST_F(hmm, anon_write_hugetlbfs)
buffer->ptr = get_hugepage_region(size, GHR_STRICT);
if (buffer->ptr == NULL) {
free(buffer);
- return;
+ SKIP(return, "Huge page could not be allocated");
}
buffer->fd = -1;
--
2.20.1
From: Mattias Nissler <mnissler(a)chromium.org>
For mounts that have the new "nosymfollow" option, don't follow symlinks
when resolving paths. The new option is similar in spirit to the
existing "nodev", "noexec", and "nosuid" options, as well as to the
LOOKUP_NO_SYMLINKS resolve flag in the openat2(2) syscall. Various BSD
variants have been supporting the "nosymfollow" mount option for a long
time with equivalent implementations.
Note that symlinks may still be created on file systems mounted with
the "nosymfollow" option present. readlink() remains functional, so
user space code that is aware of symlinks can still choose to follow
them explicitly.
Setting the "nosymfollow" mount option helps prevent privileged
writers from modifying files unintentionally in case there is an
unexpected link along the accessed path. The "nosymfollow" option is
thus useful as a defensive measure for systems that need to deal with
untrusted file systems in privileged contexts.
More information on the history and motivation for this patch can be
found here:
https://sites.google.com/a/chromium.org/dev/chromium-os/chromiumos-design-d…
Signed-off-by: Mattias Nissler <mnissler(a)chromium.org>
Signed-off-by: Ross Zwisler <zwisler(a)google.com>
Reviewed-by: Aleksa Sarai <cyphar(a)cyphar.com>
---
Changes since v7 [1]:
* Rebased onto v5.9-rc1.
* Added selftest in second patch.
* Added Aleska's Reviewed-By tag. Thank you for the review!
After this lands I will upstream changes to util-linux[2] and man-pages
[3].
[1]: https://lkml.org/lkml/2020/8/11/896
[2]: https://github.com/rzwisler/util-linux/commit/7f8771acd85edb70d97921c026c55…
[3]: https://github.com/rzwisler/man-pages/commit/b8fe8079f64b5068940c0144586e58…
---
fs/namei.c | 3 ++-
fs/namespace.c | 2 ++
fs/proc_namespace.c | 1 +
fs/statfs.c | 2 ++
include/linux/mount.h | 3 ++-
include/linux/statfs.h | 1 +
include/uapi/linux/mount.h | 1 +
7 files changed, 11 insertions(+), 2 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c
index e99e2a9da0f7d..12d92af2e2ca7 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1626,7 +1626,8 @@ static const char *pick_link(struct nameidata *nd, struct path *link,
return ERR_PTR(error);
}
- if (unlikely(nd->flags & LOOKUP_NO_SYMLINKS))
+ if (unlikely(nd->flags & LOOKUP_NO_SYMLINKS) ||
+ unlikely(nd->path.mnt->mnt_flags & MNT_NOSYMFOLLOW))
return ERR_PTR(-ELOOP);
if (!(nd->flags & LOOKUP_RCU)) {
diff --git a/fs/namespace.c b/fs/namespace.c
index bae0e95b3713a..6408788a649e1 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -3160,6 +3160,8 @@ int path_mount(const char *dev_name, struct path *path,
mnt_flags &= ~(MNT_RELATIME | MNT_NOATIME);
if (flags & MS_RDONLY)
mnt_flags |= MNT_READONLY;
+ if (flags & MS_NOSYMFOLLOW)
+ mnt_flags |= MNT_NOSYMFOLLOW;
/* The default atime for remount is preservation */
if ((flags & MS_REMOUNT) &&
diff --git a/fs/proc_namespace.c b/fs/proc_namespace.c
index 3059a9394c2d6..e59d4bb3a89e4 100644
--- a/fs/proc_namespace.c
+++ b/fs/proc_namespace.c
@@ -70,6 +70,7 @@ static void show_mnt_opts(struct seq_file *m, struct vfsmount *mnt)
{ MNT_NOATIME, ",noatime" },
{ MNT_NODIRATIME, ",nodiratime" },
{ MNT_RELATIME, ",relatime" },
+ { MNT_NOSYMFOLLOW, ",nosymfollow" },
{ 0, NULL }
};
const struct proc_fs_opts *fs_infop;
diff --git a/fs/statfs.c b/fs/statfs.c
index 2616424012ea7..59f33752c1311 100644
--- a/fs/statfs.c
+++ b/fs/statfs.c
@@ -29,6 +29,8 @@ static int flags_by_mnt(int mnt_flags)
flags |= ST_NODIRATIME;
if (mnt_flags & MNT_RELATIME)
flags |= ST_RELATIME;
+ if (mnt_flags & MNT_NOSYMFOLLOW)
+ flags |= ST_NOSYMFOLLOW;
return flags;
}
diff --git a/include/linux/mount.h b/include/linux/mount.h
index de657bd211fa6..aaf343b38671c 100644
--- a/include/linux/mount.h
+++ b/include/linux/mount.h
@@ -30,6 +30,7 @@ struct fs_context;
#define MNT_NODIRATIME 0x10
#define MNT_RELATIME 0x20
#define MNT_READONLY 0x40 /* does the user want this to be r/o? */
+#define MNT_NOSYMFOLLOW 0x80
#define MNT_SHRINKABLE 0x100
#define MNT_WRITE_HOLD 0x200
@@ -46,7 +47,7 @@ struct fs_context;
#define MNT_SHARED_MASK (MNT_UNBINDABLE)
#define MNT_USER_SETTABLE_MASK (MNT_NOSUID | MNT_NODEV | MNT_NOEXEC \
| MNT_NOATIME | MNT_NODIRATIME | MNT_RELATIME \
- | MNT_READONLY)
+ | MNT_READONLY | MNT_NOSYMFOLLOW)
#define MNT_ATIME_MASK (MNT_NOATIME | MNT_NODIRATIME | MNT_RELATIME )
#define MNT_INTERNAL_FLAGS (MNT_SHARED | MNT_WRITE_HOLD | MNT_INTERNAL | \
diff --git a/include/linux/statfs.h b/include/linux/statfs.h
index 9bc69edb8f188..fac4356ea1bfc 100644
--- a/include/linux/statfs.h
+++ b/include/linux/statfs.h
@@ -40,6 +40,7 @@ struct kstatfs {
#define ST_NOATIME 0x0400 /* do not update access times */
#define ST_NODIRATIME 0x0800 /* do not update directory access times */
#define ST_RELATIME 0x1000 /* update atime relative to mtime/ctime */
+#define ST_NOSYMFOLLOW 0x2000 /* do not follow symlinks */
struct dentry;
extern int vfs_get_fsid(struct dentry *dentry, __kernel_fsid_t *fsid);
diff --git a/include/uapi/linux/mount.h b/include/uapi/linux/mount.h
index 96a0240f23fed..dd8306ea336c1 100644
--- a/include/uapi/linux/mount.h
+++ b/include/uapi/linux/mount.h
@@ -16,6 +16,7 @@
#define MS_REMOUNT 32 /* Alter flags of a mounted FS */
#define MS_MANDLOCK 64 /* Allow mandatory locks on an FS */
#define MS_DIRSYNC 128 /* Directory modifications are synchronous */
+#define MS_NOSYMFOLLOW 256 /* Do not follow symlinks */
#define MS_NOATIME 1024 /* Do not update access times. */
#define MS_NODIRATIME 2048 /* Do not update directory access times */
#define MS_BIND 4096
--
2.28.0.220.ged08abb693-goog
check_result() uses "comm" to check expected results of selftests output
in dmesg. Everything works fine if timestamps in dmesg are unique. If
not, like in this example
[ 86.844422] test_klp_callbacks_demo: pre_unpatch_callback: test_klp_callbacks_mod -> [MODULE_STATE_LIVE] Normal state
[ 86.844422] livepatch: 'test_klp_callbacks_demo': starting unpatching transition
, "comm" fails with "comm: file 2 is not in sorted order". Suppress the
order checking with --nocheck-order option.
Signed-off-by: Miroslav Benes <mbenes(a)suse.cz>
---
The strange thing is, I can reproduce the issue easily and reliably on
older codestreams (4.12) but not on current upstream in my testing
environment. I think the change makes sense regardless though.
tools/testing/selftests/livepatch/functions.sh | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/livepatch/functions.sh b/tools/testing/selftests/livepatch/functions.sh
index 1aba83c87ad3..846c7ed71556 100644
--- a/tools/testing/selftests/livepatch/functions.sh
+++ b/tools/testing/selftests/livepatch/functions.sh
@@ -278,7 +278,7 @@ function check_result {
# help differentiate repeated testing runs. Remove them with a
# post-comparison sed filter.
- result=$(dmesg | comm -13 "$SAVED_DMESG" - | \
+ result=$(dmesg | comm --nocheck-order -13 "$SAVED_DMESG" - | \
grep -e 'livepatch:' -e 'test_klp' | \
grep -v '\(tainting\|taints\) kernel' | \
sed 's/^\[[ 0-9.]*\] //')
--
2.28.0
From: Leon He <leon.he(a)unisoc.com>
There are two errors in the dmabuf-heap selftest:
1. The 'char name[5]' was not initialized to zero, which will cause
strcmp(name, "vgem") failed in check_vgem().
2. The return value of test_alloc_errors() should be reversed, other-
wise the while loop in main() will be broken.
Signed-off-by: Leon He <leon.he(a)unisoc.com>
---
tools/testing/selftests/dmabuf-heaps/dmabuf-heap.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/dmabuf-heaps/dmabuf-heap.c b/tools/testing/selftests/dmabuf-heaps/dmabuf-heap.c
index cd5e1f6..836b185 100644
--- a/tools/testing/selftests/dmabuf-heaps/dmabuf-heap.c
+++ b/tools/testing/selftests/dmabuf-heaps/dmabuf-heap.c
@@ -22,7 +22,7 @@
static int check_vgem(int fd)
{
drm_version_t version = { 0 };
- char name[5];
+ char name[5] = { 0 };
int ret;
version.name_len = 4;
@@ -357,7 +357,7 @@ static int test_alloc_errors(char *heap_name)
if (heap_fd >= 0)
close(heap_fd);
- return ret;
+ return !ret;
}
int main(void)
--
2.7.4
From: David Ahern <dsahern(a)kernel.org>
[ Upstream commit bcf7ddb0186d366f761f86196b480ea6dd2dc18c ]
h1 is initially configured to reach h2 via r1 rather than the
more direct path through r2. If rp_filter is set and inherited
for r2, forwarding fails since the source address of h1 is
reachable from eth0 vs the packet coming to it via r1 and eth1.
Since rp_filter setting affects the test, explicitly reset it.
Signed-off-by: David Ahern <dsahern(a)kernel.org>
Signed-off-by: David S. Miller <davem(a)davemloft.net>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/net/icmp_redirect.sh | 2 ++
1 file changed, 2 insertions(+)
diff --git a/tools/testing/selftests/net/icmp_redirect.sh b/tools/testing/selftests/net/icmp_redirect.sh
index 18c5de53558af..bf361f30d6ef9 100755
--- a/tools/testing/selftests/net/icmp_redirect.sh
+++ b/tools/testing/selftests/net/icmp_redirect.sh
@@ -180,6 +180,8 @@ setup()
;;
r[12]) ip netns exec $ns sysctl -q -w net.ipv4.ip_forward=1
ip netns exec $ns sysctl -q -w net.ipv4.conf.all.send_redirects=1
+ ip netns exec $ns sysctl -q -w net.ipv4.conf.default.rp_filter=0
+ ip netns exec $ns sysctl -q -w net.ipv4.conf.all.rp_filter=0
ip netns exec $ns sysctl -q -w net.ipv6.conf.all.forwarding=1
ip netns exec $ns sysctl -q -w net.ipv6.route.mtu_expires=10
--
2.25.1
From: Yauheni Kaliuta <yauheni.kaliuta(a)redhat.com>
[ Upstream commit c210773d6c6f595f5922d56b7391fe343bc7310e ]
The error path in libbpf.c:load_program() has calls to pr_warn()
which ends up for global_funcs tests to
test_global_funcs.c:libbpf_debug_print().
For the tests with no struct test_def::err_str initialized with a
string, it causes call of strstr() with NULL as the second argument
and it segfaults.
Fix it by calling strstr() only for non-NULL err_str.
Signed-off-by: Yauheni Kaliuta <yauheni.kaliuta(a)redhat.com>
Signed-off-by: Alexei Starovoitov <ast(a)kernel.org>
Acked-by: Yonghong Song <yhs(a)fb.com>
Link: https://lore.kernel.org/bpf/20200820115843.39454-1-yauheni.kaliuta@redhat.c…
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/bpf/prog_tests/test_global_funcs.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/bpf/prog_tests/test_global_funcs.c b/tools/testing/selftests/bpf/prog_tests/test_global_funcs.c
index 25b068591e9a4..193002b14d7f6 100644
--- a/tools/testing/selftests/bpf/prog_tests/test_global_funcs.c
+++ b/tools/testing/selftests/bpf/prog_tests/test_global_funcs.c
@@ -19,7 +19,7 @@ static int libbpf_debug_print(enum libbpf_print_level level,
log_buf = va_arg(args, char *);
if (!log_buf)
goto out;
- if (strstr(log_buf, err_str) == 0)
+ if (err_str && strstr(log_buf, err_str) == 0)
found = true;
out:
printf(format, log_buf);
--
2.25.1
From: David Ahern <dsahern(a)kernel.org>
[ Upstream commit bcf7ddb0186d366f761f86196b480ea6dd2dc18c ]
h1 is initially configured to reach h2 via r1 rather than the
more direct path through r2. If rp_filter is set and inherited
for r2, forwarding fails since the source address of h1 is
reachable from eth0 vs the packet coming to it via r1 and eth1.
Since rp_filter setting affects the test, explicitly reset it.
Signed-off-by: David Ahern <dsahern(a)kernel.org>
Signed-off-by: David S. Miller <davem(a)davemloft.net>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/net/icmp_redirect.sh | 2 ++
1 file changed, 2 insertions(+)
diff --git a/tools/testing/selftests/net/icmp_redirect.sh b/tools/testing/selftests/net/icmp_redirect.sh
index 18c5de53558af..bf361f30d6ef9 100755
--- a/tools/testing/selftests/net/icmp_redirect.sh
+++ b/tools/testing/selftests/net/icmp_redirect.sh
@@ -180,6 +180,8 @@ setup()
;;
r[12]) ip netns exec $ns sysctl -q -w net.ipv4.ip_forward=1
ip netns exec $ns sysctl -q -w net.ipv4.conf.all.send_redirects=1
+ ip netns exec $ns sysctl -q -w net.ipv4.conf.default.rp_filter=0
+ ip netns exec $ns sysctl -q -w net.ipv4.conf.all.rp_filter=0
ip netns exec $ns sysctl -q -w net.ipv6.conf.all.forwarding=1
ip netns exec $ns sysctl -q -w net.ipv6.route.mtu_expires=10
--
2.25.1
From: Yauheni Kaliuta <yauheni.kaliuta(a)redhat.com>
[ Upstream commit c210773d6c6f595f5922d56b7391fe343bc7310e ]
The error path in libbpf.c:load_program() has calls to pr_warn()
which ends up for global_funcs tests to
test_global_funcs.c:libbpf_debug_print().
For the tests with no struct test_def::err_str initialized with a
string, it causes call of strstr() with NULL as the second argument
and it segfaults.
Fix it by calling strstr() only for non-NULL err_str.
Signed-off-by: Yauheni Kaliuta <yauheni.kaliuta(a)redhat.com>
Signed-off-by: Alexei Starovoitov <ast(a)kernel.org>
Acked-by: Yonghong Song <yhs(a)fb.com>
Link: https://lore.kernel.org/bpf/20200820115843.39454-1-yauheni.kaliuta@redhat.c…
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/bpf/prog_tests/test_global_funcs.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/bpf/prog_tests/test_global_funcs.c b/tools/testing/selftests/bpf/prog_tests/test_global_funcs.c
index 25b068591e9a4..193002b14d7f6 100644
--- a/tools/testing/selftests/bpf/prog_tests/test_global_funcs.c
+++ b/tools/testing/selftests/bpf/prog_tests/test_global_funcs.c
@@ -19,7 +19,7 @@ static int libbpf_debug_print(enum libbpf_print_level level,
log_buf = va_arg(args, char *);
if (!log_buf)
goto out;
- if (strstr(log_buf, err_str) == 0)
+ if (err_str && strstr(log_buf, err_str) == 0)
found = true;
out:
printf(format, log_buf);
--
2.25.1
From: David Ahern <dsahern(a)kernel.org>
[ Upstream commit bcf7ddb0186d366f761f86196b480ea6dd2dc18c ]
h1 is initially configured to reach h2 via r1 rather than the
more direct path through r2. If rp_filter is set and inherited
for r2, forwarding fails since the source address of h1 is
reachable from eth0 vs the packet coming to it via r1 and eth1.
Since rp_filter setting affects the test, explicitly reset it.
Signed-off-by: David Ahern <dsahern(a)kernel.org>
Signed-off-by: David S. Miller <davem(a)davemloft.net>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/net/icmp_redirect.sh | 2 ++
1 file changed, 2 insertions(+)
diff --git a/tools/testing/selftests/net/icmp_redirect.sh b/tools/testing/selftests/net/icmp_redirect.sh
index 18c5de53558af..bf361f30d6ef9 100755
--- a/tools/testing/selftests/net/icmp_redirect.sh
+++ b/tools/testing/selftests/net/icmp_redirect.sh
@@ -180,6 +180,8 @@ setup()
;;
r[12]) ip netns exec $ns sysctl -q -w net.ipv4.ip_forward=1
ip netns exec $ns sysctl -q -w net.ipv4.conf.all.send_redirects=1
+ ip netns exec $ns sysctl -q -w net.ipv4.conf.default.rp_filter=0
+ ip netns exec $ns sysctl -q -w net.ipv4.conf.all.rp_filter=0
ip netns exec $ns sysctl -q -w net.ipv6.conf.all.forwarding=1
ip netns exec $ns sysctl -q -w net.ipv6.route.mtu_expires=10
--
2.25.1
In the beginning, mm/gup_benchmark.c supported get_user_pages_fast()
only, but right now, it supports the benchmarking of a couple of
get_user_pages() related calls like:
* get_user_pages_fast()
* get_user_pages()
* pin_user_pages_fast()
* pin_user_pages()
The documentation is confusing and needs update.
Cc: John Hubbard <jhubbard(a)nvidia.com>
Cc: Keith Busch <keith.busch(a)intel.com>
Cc: Ira Weiny <ira.weiny(a)intel.com>
Cc: Kirill A. Shutemov <kirill.shutemov(a)linux.intel.com>
Signed-off-by: Barry Song <song.bao.hua(a)hisilicon.com>
---
mm/Kconfig | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/mm/Kconfig b/mm/Kconfig
index 6c974888f86f..f7c9374da7b3 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -831,10 +831,10 @@ config PERCPU_STATS
be used to help understand percpu memory usage.
config GUP_BENCHMARK
- bool "Enable infrastructure for get_user_pages_fast() benchmarking"
+ bool "Enable infrastructure for get_user_pages() and related calls benchmarking"
help
Provides /sys/kernel/debug/gup_benchmark that helps with testing
- performance of get_user_pages_fast().
+ performance of get_user_pages() and related calls.
See tools/testing/selftests/vm/gup_benchmark.c
--
2.27.0
This series attempts to provide a simple way for BPF programs (and in
future other consumers) to utilize BPF Type Format (BTF) information
to display kernel data structures in-kernel. The use case this
functionality is applied to here is to support a bpf_trace_printk
trace event-based method of rendering type information.
There is already support in kernel/bpf/btf.c for "show" functionality;
the changes here generalize that support from seq-file specific
verifier display to the more generic case and add another specific
use case; rather than seq_printf()ing the show data, it is traced
using the bpf_trace_printk trace event. Other future uses of the
show functionality could include a bpf_printk_btf() function which
printk()ed the data instead. Oops messaging in particular would
be an interesting application for such functionality.
The above potential use case hints at a potential reply to
a reasonable objection that such typed display should be
solved by tracing programs, where the in-kernel tracing records
data and the userspace program prints it out. While this
is certainly the recommended approach for most cases, I
believe having an in-kernel mechanism would be valuable
also. Critically in BPF programs it greatly simplifies
debugging and tracing of such data to invoking a one-line
helper.
One challenge raised in an earlier iteration of this work -
where the BTF printing was implemented as a printk() format
specifier - was that the amount of data printed per
printk() was large, and other format specifiers were far
simpler. This patchset tackles this by instead displaying
data _as the data structure is traversed_, rathern than copying
it to a string for later display. The problem in doing this
however is that such output can be intermixed with other
bpf_trace_printk events. The solution pursued here is to
associate a trace ID with the bpf_trace_printk events. For
now, the bpf_trace_printk() helper sets this trace ID to 0,
and bpf_trace_btf() can specify non-zero values. This allows
a BPF program to coordinate with a user-space program which
creates a separate trace instance which filters trace events
based on trace ID, allowing for clean display without pollution
from other data sources. Future work could enhance
bpf_trace_printk() to do this too, either via a new helper
or by smuggling a 32-bit trace id value into the "fmt_size"
argument (the latter might be problematic for 32-bit platforms
though, so a new helper might be preferred).
To aid in display the bpf_trace_btf() helper is passed a
"struct btf_ptr *" which specifies the data to be traced
(struct sk_buff * say), the BTF id of the type (the BTF
id of "struct sk_buff") or a string representation of
the type ("struct sk_buff"). A flags field is also
present for future use.
Separately a number of flags control how the output is
rendered, see patch 3 for more details.
A snippet of output from printing "struct sk_buff *"
(see patch 3 for the full output) looks like this:
<idle>-0 [023] d.s. 1825.778400: bpf_trace_printk: (struct sk_buff){
<idle>-0 [023] d.s. 1825.778409: bpf_trace_printk: (union){
<idle>-0 [023] d.s. 1825.778410: bpf_trace_printk: (struct){
<idle>-0 [023] d.s. 1825.778412: bpf_trace_printk: .prev = (struct sk_buff *)0x00000000b2a3df7e,
<idle>-0 [023] d.s. 1825.778413: bpf_trace_printk: (union){
<idle>-0 [023] d.s. 1825.778414: bpf_trace_printk: .dev = (struct net_device *)0x000000001658808b,
<idle>-0 [023] d.s. 1825.778416: bpf_trace_printk: .dev_scratch = (long unsigned int)18446628460391432192,
<idle>-0 [023] d.s. 1825.778417: bpf_trace_printk: },
<idle>-0 [023] d.s. 1825.778417: bpf_trace_printk: },
<idle>-0 [023] d.s. 1825.778418: bpf_trace_printk: .rbnode = (struct rb_node){
<idle>-0 [023] d.s. 1825.778419: bpf_trace_printk: .rb_right = (struct rb_node *)0x00000000b2a3df7e,
<idle>-0 [023] d.s. 1825.778420: bpf_trace_printk: .rb_left = (struct rb_node *)0x000000001658808b,
<idle>-0 [023] d.s. 1825.778420: bpf_trace_printk: },
<idle>-0 [023] d.s. 1825.778421: bpf_trace_printk: .list = (struct list_head){
<idle>-0 [023] d.s. 1825.778422: bpf_trace_printk: .prev = (struct list_head *)0x00000000b2a3df7e,
<idle>-0 [023] d.s. 1825.778422: bpf_trace_printk: },
<idle>-0 [023] d.s. 1825.778422: bpf_trace_printk: },
<idle>-0 [023] d.s. 1825.778426: bpf_trace_printk: .len = (unsigned int)168,
<idle>-0 [023] d.s. 1825.778427: bpf_trace_printk: .mac_len = (__u16)14,
Changes since v3:
- Moved to RFC since the approach is different (and bpf-next is
closed)
- Rather than using a printk() format specifier as the means
of invoking BTF-enabled display, a dedicated BPF helper is
used. This solves the issue of printk() having to output
large amounts of data using a complex mechanism such as
BTF traversal, but still provides a way for the display of
such data to be achieved via BPF programs. Future work could
include a bpf_printk_btf() function to invoke display via
printk() where the elements of a data structure are printk()ed
one at a time. Thanks to Petr Mladek, Andy Shevchenko and
Rasmus Villemoes who took time to look at the earlier printk()
format-specifier-focused version of this and provided feedback
clarifying the problems with that approach.
- Added trace id to the bpf_trace_printk events as a means of
separating output from standard bpf_trace_printk() events,
ensuring it can be easily parsed by the reader.
- Added bpf_trace_btf() helper tests which do simple verification
of the various display options.
Changes since v2:
- Alexei and Yonghong suggested it would be good to use
probe_kernel_read() on to-be-shown data to ensure safety
during operation. Safe copy via probe_kernel_read() to a
buffer object in "struct btf_show" is used to support
this. A few different approaches were explored
including dynamic allocation and per-cpu buffers. The
downside of dynamic allocation is that it would be done
during BPF program execution for bpf_trace_printk()s using
%pT format specifiers. The problem with per-cpu buffers
is we'd have to manage preemption and since the display
of an object occurs over an extended period and in printk
context where we'd rather not change preemption status,
it seemed tricky to manage buffer safety while considering
preemption. The approach of utilizing stack buffer space
via the "struct btf_show" seemed like the simplest approach.
The stack size of the associated functions which have a
"struct btf_show" on their stack to support show operation
(btf_type_snprintf_show() and btf_type_seq_show()) stays
under 500 bytes. The compromise here is the safe buffer we
use is small - 256 bytes - and as a result multiple
probe_kernel_read()s are needed for larger objects. Most
objects of interest are smaller than this (e.g.
"struct sk_buff" is 224 bytes), and while task_struct is a
notable exception at ~8K, performance is not the priority for
BTF-based display. (Alexei and Yonghong, patch 2).
- safe buffer use is the default behaviour (and is mandatory
for BPF) but unsafe display - meaning no safe copy is done
and we operate on the object itself - is supported via a
'u' option.
- pointers are prefixed with 0x for clarity (Alexei, patch 2)
- added additional comments and explanations around BTF show
code, especially around determining whether objects such
zeroed. Also tried to comment safe object scheme used. (Yonghong,
patch 2)
- added late_initcall() to initialize vmlinux BTF so that it would
not have to be initialized during printk operation (Alexei,
patch 5)
- removed CONFIG_BTF_PRINTF config option as it is not needed;
CONFIG_DEBUG_INFO_BTF can be used to gate test behaviour and
determining behaviour of type-based printk can be done via
retrieval of BTF data; if it's not there BTF was unavailable
or broken (Alexei, patches 4,6)
- fix bpf_trace_printk test to use vmlinux.h and globals via
skeleton infrastructure, removing need for perf events
(Andrii, patch 8)
Changes since v1:
- changed format to be more drgn-like, rendering indented type info
along with type names by default (Alexei)
- zeroed values are omitted (Arnaldo) by default unless the '0'
modifier is specified (Alexei)
- added an option to print pointer values without obfuscation.
The reason to do this is the sysctls controlling pointer display
are likely to be irrelevant in many if not most tracing contexts.
Some questions on this in the outstanding questions section below...
- reworked printk format specifer so that we no longer rely on format
%pT<type> but instead use a struct * which contains type information
(Rasmus). This simplifies the printk parsing, makes use more dynamic
and also allows specification by BTF id as well as name.
- removed incorrect patch which tried to fix dereferencing of resolved
BTF info for vmlinux; instead we skip modifiers for the relevant
case (array element type determination) (Alexei).
- fixed issues with negative snprintf format length (Rasmus)
- added test cases for various data structure formats; base types,
typedefs, structs, etc.
- tests now iterate through all typedef, enum, struct and unions
defined for vmlinux BTF and render a version of the target dummy
value which is either all zeros or all 0xff values; the idea is this
exercises the "skip if zero" and "print everything" cases.
- added support in BPF for using the %pT format specifier in
bpf_trace_printk()
- added BPF tests which ensure %pT format specifier use works (Alexei).
Alan Maguire (4):
bpf: provide function to get vmlinux BTF information
bpf: make BTF show support generic, apply to seq
files/bpf_trace_printk
bpf: add bpf_trace_btf helper
selftests/bpf: add bpf_trace_btf helper tests
include/linux/bpf.h | 5 +
include/linux/btf.h | 38 +
include/uapi/linux/bpf.h | 63 ++
kernel/bpf/btf.c | 962 ++++++++++++++++++---
kernel/bpf/core.c | 5 +
kernel/bpf/helpers.c | 4 +
kernel/bpf/verifier.c | 18 +-
kernel/trace/bpf_trace.c | 121 ++-
kernel/trace/bpf_trace.h | 6 +-
scripts/bpf_helpers_doc.py | 2 +
tools/include/uapi/linux/bpf.h | 63 ++
tools/testing/selftests/bpf/prog_tests/trace_btf.c | 45 +
.../selftests/bpf/progs/netif_receive_skb.c | 43 +
13 files changed, 1257 insertions(+), 118 deletions(-)
create mode 100644 tools/testing/selftests/bpf/prog_tests/trace_btf.c
create mode 100644 tools/testing/selftests/bpf/progs/netif_receive_skb.c
--
1.8.3.1
KUnit test cases run on kthreads, and kthreads don't have an
adddress space (current->mm is NULL), but processes have mm.
The purpose of this patch is to allow to borrow mm to KUnit kthread
after userspace is brought up, because we know that there are processes
running, at least the process that loaded the module to borrow mm.
This allows, for example, tests such as user_copy_kunit, which uses
vm_mmap, which needs current->mm.
Signed-off-by: Vitor Massaru Iha <vitor(a)massaru.org>
---
v2:
* splitted patch in 3:
- Allows to install and load modules in root filesystem;
- Provides an userspace memory context when tests are compiled
as module;
- Convert test_user_copy to KUnit test;
* added documentation;
* added more explanation;
* tested a pointer;
* released mput();
---
Documentation/dev-tools/kunit/usage.rst | 14 ++++++++++++++
include/kunit/test.h | 12 ++++++++++++
lib/kunit/try-catch.c | 15 ++++++++++++++-
3 files changed, 40 insertions(+), 1 deletion(-)
diff --git a/Documentation/dev-tools/kunit/usage.rst b/Documentation/dev-tools/kunit/usage.rst
index 3c3fe8b5fecc..9f909157be34 100644
--- a/Documentation/dev-tools/kunit/usage.rst
+++ b/Documentation/dev-tools/kunit/usage.rst
@@ -448,6 +448,20 @@ We can now use it to test ``struct eeprom_buffer``:
.. _kunit-on-non-uml:
+User-space context
+------------------
+
+I case you need a user-space context, for now this is only possible through
+tests compiled as a module. And it will be necessary to use a root filesystem
+and uml_utilities.
+
+Example:
+
+.. code-block:: bash
+
+ ./tools/testing/kunit/kunit.py run --timeout=60 --uml_rootfs_dir=.uml_rootfs
+
+
KUnit on non-UML architectures
==============================
diff --git a/include/kunit/test.h b/include/kunit/test.h
index 59f3144f009a..ae3337139c65 100644
--- a/include/kunit/test.h
+++ b/include/kunit/test.h
@@ -222,6 +222,18 @@ struct kunit {
* protect it with some type of lock.
*/
struct list_head resources; /* Protected by lock. */
+ /*
+ * KUnit test cases run on kthreads, and kthreads don't have an
+ * adddress space (current->mm is NULL), but processes have mm.
+ *
+ * The purpose of this mm_struct is to allow to borrow mm to KUnit kthread
+ * after userspace is brought up, because we know that there are processes
+ * running, at least the process that loaded the module to borrow mm.
+ *
+ * This allows, for example, tests such as user_copy_kunit, which uses
+ * vm_mmap, which needs current->mm.
+ */
+ struct mm_struct *mm;
};
void kunit_init_test(struct kunit *test, const char *name, char *log);
diff --git a/lib/kunit/try-catch.c b/lib/kunit/try-catch.c
index 0dd434e40487..d03e2093985b 100644
--- a/lib/kunit/try-catch.c
+++ b/lib/kunit/try-catch.c
@@ -11,7 +11,8 @@
#include <linux/completion.h>
#include <linux/kernel.h>
#include <linux/kthread.h>
-
+#include <linux/sched/mm.h>
+#include <linux/sched/task.h>
#include "try-catch-impl.h"
void __noreturn kunit_try_catch_throw(struct kunit_try_catch *try_catch)
@@ -24,8 +25,17 @@ EXPORT_SYMBOL_GPL(kunit_try_catch_throw);
static int kunit_generic_run_threadfn_adapter(void *data)
{
struct kunit_try_catch *try_catch = data;
+ struct kunit *test = try_catch->test;
+
+ if (test != NULL && test->mm != NULL)
+ kthread_use_mm(test->mm);
try_catch->try(try_catch->context);
+ if (test != NULL && test->mm != NULL) {
+ kthread_unuse_mm(test->mm);
+ mmput(test->mm);
+ test->mm = NULL;
+ }
complete_and_exit(try_catch->try_completion, 0);
}
@@ -65,6 +75,9 @@ void kunit_try_catch_run(struct kunit_try_catch *try_catch, void *context)
try_catch->context = context;
try_catch->try_completion = &try_completion;
try_catch->try_result = 0;
+
+ test->mm = get_task_mm(current);
+
task_struct = kthread_run(kunit_generic_run_threadfn_adapter,
try_catch,
"kunit_try_catch_thread");
base-commit: 725aca9585956676687c4cb803e88f770b0df2b2
prerequisite-patch-id: 5e5f9a8a05c5680fda1b04c9ab1b95ce91dc88b2
prerequisite-patch-id: 4d997940f4a9f303424af9bac412de1af861f9d9
prerequisite-patch-id: 582b6d9d28ce4b71628890ec832df6522ca68de0
--
2.26.2
Hey everyone,
This is a follow-up to the do_fork() cleanup from last cycle based on a
short discussion this was merged.
Last cycle we removed copy_thread_tls() and the associated Kconfig
option for each architecture. Now we are only left with copy_thread().
Part of this work was removing the old do_fork() legacy clone()-style
calling convention in favor of the new struct kernel_clone args calling
convention.
The only remaining function callable outside of kernel/fork.c is
_do_fork(). It doesn't really follow the naming of kernel-internal
syscall helpers as Christoph righly pointed out. Switch all callers and
references to kernel_clone() and remove _do_fork() once and for all.
For all architectures I have done a full git rebase v5.9-rc1 -x "make
-j31". There were no built failures and the changes were fairly
mechanical.
The only helpers we have left now are kernel_thread() and kernel_clone()
where kernel_thread() just calls kernel_clone().
Thanks!
Christian
Christian Brauner (11):
fork: introduce kernel_clone()
h8300: switch to kernel_clone()
ia64: switch to kernel_clone()
m68k: switch to kernel_clone()
nios2: switch to kernel_clone()
sparc: switch to kernel_clone()
x86: switch to kernel_clone()
kprobes: switch to kernel_clone()
kgdbts: switch to kernel_clone()
tracing: switch to kernel_clone()
sched: remove _do_fork()
Documentation/trace/histogram.rst | 4 +-
arch/h8300/kernel/process.c | 2 +-
arch/ia64/kernel/process.c | 4 +-
arch/m68k/kernel/process.c | 10 ++--
arch/nios2/kernel/process.c | 2 +-
arch/sparc/kernel/process.c | 6 +--
arch/x86/kernel/sys_ia32.c | 2 +-
drivers/misc/kgdbts.c | 48 +++++++++----------
include/linux/sched/task.h | 2 +-
kernel/fork.c | 14 +++---
samples/kprobes/kprobe_example.c | 6 +--
samples/kprobes/kretprobe_example.c | 4 +-
.../test.d/dynevent/add_remove_kprobe.tc | 2 +-
.../test.d/dynevent/clear_select_events.tc | 2 +-
.../test.d/dynevent/generic_clear_event.tc | 2 +-
.../test.d/ftrace/func-filter-stacktrace.tc | 4 +-
.../ftrace/test.d/kprobe/add_and_remove.tc | 2 +-
.../ftrace/test.d/kprobe/busy_check.tc | 2 +-
.../ftrace/test.d/kprobe/kprobe_args.tc | 4 +-
.../ftrace/test.d/kprobe/kprobe_args_comm.tc | 2 +-
.../test.d/kprobe/kprobe_args_string.tc | 4 +-
.../test.d/kprobe/kprobe_args_symbol.tc | 10 ++--
.../ftrace/test.d/kprobe/kprobe_args_type.tc | 2 +-
.../ftrace/test.d/kprobe/kprobe_ftrace.tc | 14 +++---
.../ftrace/test.d/kprobe/kprobe_multiprobe.tc | 2 +-
.../test.d/kprobe/kprobe_syntax_errors.tc | 12 ++---
.../ftrace/test.d/kprobe/kretprobe_args.tc | 4 +-
.../selftests/ftrace/test.d/kprobe/profile.tc | 2 +-
28 files changed, 87 insertions(+), 87 deletions(-)
base-commit: 9123e3a74ec7b934a4a099e98af6a61c2f80bbf5
--
2.28.0
If debug_regs.c is built with newer gcc, e.g., 8.3.1 on my side, then the generated
binary looks like over-optimized by gcc:
asm volatile("ss_start: "
"xor %%rax,%%rax\n\t"
"cpuid\n\t"
"movl $0x1a0,%%ecx\n\t"
"rdmsr\n\t"
: : : "rax", "ecx");
is translated to :
000000000040194e <ss_start>:
40194e: 31 c0 xor %eax,%eax <----- rax->eax?
401950: 0f a2 cpuid
401952: b9 a0 01 00 00 mov $0x1a0,%ecx
401957: 0f 32 rdmsr
As you can see rax is replaced with eax in taret binary code.
But if I replace %%rax with %%r8 or any GPR from r8~15, then I get below
expected binary:
0000000000401950 <ss_start>:
401950: 45 31 ff xor %r15d,%r15d
401953: 0f a2 cpuid
401955: b9 a0 01 00 00 mov $0x1a0,%ecx
40195a: 0f 32 rdmsr
The difference is the length of xor instruction(2 Byte vs 3 Byte),
so this makes below hard-coded instruction length cannot pass runtime check:
/* Instruction lengths starting at ss_start */
int ss_size[4] = {
3, /* xor */ <-------- 2 or 3?
2, /* cpuid */
5, /* mov */
2, /* rdmsr */
};
Note:
Use 8.2.1 or older gcc, it generates expected 3 bytes xor target code.
I use the default Makefile to build the binaries, and I cannot figure out why this
happens, so it comes this patch, maybe you have better solution to resolve the
issue. If you know how things work in this way, please let me know, thanks!
Below is the capture from my environments:
========================================================================
gcc (GCC) 8.3.1 20190223 (Red Hat 8.3.1-2)
Copyright (C) 2018 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
0000000000401950 <ss_start>:
401950: 45 31 ff xor %r15d,%r15d
401953: 0f a2 cpuid
401955: b9 a0 01 00 00 mov $0x1a0,%ecx
40195a: 0f 32 rdmsr
000000000040194f <ss_start>:
40194f: 31 db xor %ebx,%ebx
401951: 0f a2 cpuid
401953: b9 a0 01 00 00 mov $0x1a0,%ecx
401958: 0f 32 rdmsr
000000000040194e <ss_start>:
40194e: 31 c0 xor %eax,%eax
401950: 0f a2 cpuid
401952: b9 a0 01 00 00 mov $0x1a0,%ecx
401957: 0f 32 rdmsr
==========================================================================
gcc (GCC) 8.2.1 20180905 (Red Hat 8.2.1-3)
Copyright (C) 2018 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
0000000000401750 <ss_start>:
401750: 48 31 c0 xor %rax,%rax
401753: 0f a2 cpuid
401755: b9 a0 01 00 00 mov $0x1a0,%ecx
40175a: 0f 32 rdmsr
Signed-off-by: Yang Weijiang <weijiang.yang(a)intel.com>
---
tools/testing/selftests/kvm/x86_64/debug_regs.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/kvm/x86_64/debug_regs.c b/tools/testing/selftests/kvm/x86_64/debug_regs.c
index 8162c58a1234..74641cfa8ace 100644
--- a/tools/testing/selftests/kvm/x86_64/debug_regs.c
+++ b/tools/testing/selftests/kvm/x86_64/debug_regs.c
@@ -40,11 +40,11 @@ static void guest_code(void)
/* Single step test, covers 2 basic instructions and 2 emulated */
asm volatile("ss_start: "
- "xor %%rax,%%rax\n\t"
+ "xor %%r15,%%r15\n\t"
"cpuid\n\t"
"movl $0x1a0,%%ecx\n\t"
"rdmsr\n\t"
- : : : "rax", "ecx");
+ : : : "r15", "ecx");
/* DR6.BD test */
asm volatile("bd_start: mov %%dr0, %%rax" : : : "rax");
--
2.17.2