Hi,
Passing a non-blocking pidfd to waitid() currently has no effect, i.e.
is not supported. There are users which would like to use waitid() on
pidfds that are O_NONBLOCK and mix it with pidfds that are blocking and
both pass them to waitid().
The expected behavior is to have waitid() return -EAGAIN for
non-blocking pidfds and to block for blocking pidfds without needing to
perform any additional checks for flags set on the pidfd before passing
it to waitid().
Non-blocking pidfds will return EAGAIN from waitid() when no child
process is ready yet. Returning -EAGAIN for non-blocking pidfds makes it
easier for event loops that handle EAGAIN specially.
It also makes the API more consistent and uniform. In essence, waitid()
is treated like a read on a non-blocking pidfd or a recvmsg() on a
non-blocking socket.
With the addition of support for non-blocking pidfds we support the same
functionality that sockets do. For sockets() recvmsg() supports
MSG_DONTWAIT for pidfds waitid() supports WNOHANG. Both flags are
per-call options. In contrast non-blocking pidfds and non-blocking
sockets are a setting on an open file description affecting all threads
in the calling process as well as other processes that hold file
descriptors referring to the same open file description. Both behaviors,
per call and per open file description, have genuine use-cases.
A concrete use-case that was brought on-list (see [1]) was Josh's async
pidfd library. Ever since the introduction of pidfds and more advanced
async io various programming languages such as Rust have grown support
for async event libraries. These libraries are created to help build
epoll-based event loops around file descriptors. A common pattern is to
automatically make all file descriptors they manage to O_NONBLOCK.
For such libraries the EAGAIN error code is treated specially. When a
function is called that returns EAGAIN the function isn't called again
until the event loop indicates the the file descriptor is ready.
Supporting EAGAIN when waiting on pidfds makes such libraries just work
with little effort.
Thanks!
Christian
[1]: https://lore.kernel.org/lkml/20200811181236.GA18763@localhost/
Christian Brauner (4):
pidfd: support PIDFD_NONBLOCK in pidfd_open()
exit: support non-blocking pidfds
tests: port pidfd_wait to kselftest harness
tests: add waitid() tests for non-blocking pidfds
include/uapi/linux/pidfd.h | 12 +
kernel/exit.c | 15 +-
kernel/pid.c | 12 +-
tools/testing/selftests/pidfd/pidfd.h | 4 +
tools/testing/selftests/pidfd/pidfd_wait.c | 298 +++++++++------------
5 files changed, 157 insertions(+), 184 deletions(-)
create mode 100644 include/uapi/linux/pidfd.h
base-commit: d012a7190fc1fd72ed48911e77ca97ba4521bccd
--
2.28.0
The returned value of bpf_object__open_file() should be checked with
libbpf_get_error() rather than NULL. This fix prevents test_progs from
crash when test_global_data.o is not present.
Signed-off-by: Hao Luo <haoluo(a)google.com>
---
tools/testing/selftests/bpf/prog_tests/global_data_init.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/bpf/prog_tests/global_data_init.c b/tools/testing/selftests/bpf/prog_tests/global_data_init.c
index 3bdaa5a40744..ee46b11f1f9a 100644
--- a/tools/testing/selftests/bpf/prog_tests/global_data_init.c
+++ b/tools/testing/selftests/bpf/prog_tests/global_data_init.c
@@ -12,7 +12,8 @@ void test_global_data_init(void)
size_t sz;
obj = bpf_object__open_file(file, NULL);
- if (CHECK_FAIL(!obj))
+ err = libbpf_get_error(obj);
+ if (CHECK_FAIL(err))
return;
map = bpf_object__find_map_by_name(obj, "test_glo.rodata");
--
2.28.0.526.ge36021eeef-goog
The returned value of bpf_object__open_file() should be checked with
IS_ERR() rather than NULL. This fix makes test_progs not crash when
test_global_data.o is not present.
Signed-off-by: Hao Luo <haoluo(a)google.com>
---
tools/testing/selftests/bpf/prog_tests/global_data_init.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/bpf/prog_tests/global_data_init.c b/tools/testing/selftests/bpf/prog_tests/global_data_init.c
index 3bdaa5a40744..1ece86d5c519 100644
--- a/tools/testing/selftests/bpf/prog_tests/global_data_init.c
+++ b/tools/testing/selftests/bpf/prog_tests/global_data_init.c
@@ -12,7 +12,7 @@ void test_global_data_init(void)
size_t sz;
obj = bpf_object__open_file(file, NULL);
- if (CHECK_FAIL(!obj))
+ if (CHECK_FAIL(IS_ERR(obj)))
return;
map = bpf_object__find_map_by_name(obj, "test_glo.rodata");
--
2.28.0.402.g5ffc5be6b7-goog
A migrating transparent huge page has to already be unmapped. Otherwise,
the page could be modified while it is being copied to a new page and
data could be lost. The function __split_huge_pmd() checks for a PMD
migration entry before calling __split_huge_pmd_locked() leading one to
think that __split_huge_pmd_locked() can handle splitting a migrating PMD.
However, the code always increments the page->_mapcount and adjusts the
memory control group accounting assuming the page is mapped.
Also, if the PMD entry is a migration PMD entry, the call to
is_huge_zero_pmd(*pmd) is incorrect because it calls pmd_pfn(pmd) instead
of migration_entry_to_pfn(pmd_to_swp_entry(pmd)).
Fix these problems by checking for a PMD migration entry.
Fixes: 84c3fc4e9c56 ("mm: thp: check pmd migration entry in common path")
cc: stable(a)vger.kernel.org # 4.14+
Signed-off-by: Ralph Campbell <rcampbell(a)nvidia.com>
Reviewed-by: Yang Shi <shy828301(a)gmail.com>
Reviewed-by: Zi Yan <ziy(a)nvidia.com>
---
No changes in v3 to this patch, just added reviewed-by and fixes to the
change log and sending this as a separate patch from the rest of the
series ("mm/hmm/nouveau: add THP migration to migrate_vma_*").
I'll hold off resending the series without this patch unless there are
changes needed.
mm/huge_memory.c | 42 +++++++++++++++++++++++-------------------
1 file changed, 23 insertions(+), 19 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 2a468a4acb0a..606d712d9505 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2023,7 +2023,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
put_page(page);
add_mm_counter(mm, mm_counter_file(page), -HPAGE_PMD_NR);
return;
- } else if (is_huge_zero_pmd(*pmd)) {
+ } else if (pmd_trans_huge(*pmd) && is_huge_zero_pmd(*pmd)) {
/*
* FIXME: Do we want to invalidate secondary mmu by calling
* mmu_notifier_invalidate_range() see comments below inside
@@ -2117,30 +2117,34 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
pte = pte_offset_map(&_pmd, addr);
BUG_ON(!pte_none(*pte));
set_pte_at(mm, addr, pte, entry);
- atomic_inc(&page[i]._mapcount);
- pte_unmap(pte);
- }
-
- /*
- * Set PG_double_map before dropping compound_mapcount to avoid
- * false-negative page_mapped().
- */
- if (compound_mapcount(page) > 1 && !TestSetPageDoubleMap(page)) {
- for (i = 0; i < HPAGE_PMD_NR; i++)
+ if (!pmd_migration)
atomic_inc(&page[i]._mapcount);
+ pte_unmap(pte);
}
- lock_page_memcg(page);
- if (atomic_add_negative(-1, compound_mapcount_ptr(page))) {
- /* Last compound_mapcount is gone. */
- __dec_lruvec_page_state(page, NR_ANON_THPS);
- if (TestClearPageDoubleMap(page)) {
- /* No need in mapcount reference anymore */
+ if (!pmd_migration) {
+ /*
+ * Set PG_double_map before dropping compound_mapcount to avoid
+ * false-negative page_mapped().
+ */
+ if (compound_mapcount(page) > 1 &&
+ !TestSetPageDoubleMap(page)) {
for (i = 0; i < HPAGE_PMD_NR; i++)
- atomic_dec(&page[i]._mapcount);
+ atomic_inc(&page[i]._mapcount);
+ }
+
+ lock_page_memcg(page);
+ if (atomic_add_negative(-1, compound_mapcount_ptr(page))) {
+ /* Last compound_mapcount is gone. */
+ __dec_lruvec_page_state(page, NR_ANON_THPS);
+ if (TestClearPageDoubleMap(page)) {
+ /* No need in mapcount reference anymore */
+ for (i = 0; i < HPAGE_PMD_NR; i++)
+ atomic_dec(&page[i]._mapcount);
+ }
}
+ unlock_page_memcg(page);
}
- unlock_page_memcg(page);
smp_wmb(); /* make pte visible before pmd */
pmd_populate(mm, pmd, pgtable);
--
2.20.1
This series adds support for transparent huge page migration to
migrate_vma_*() and adds nouveau SVM and HMM selftests as consumers.
An earlier version was posted previously [1]. This version now
supports splitting a THP midway in the migration process which
led to a number of changes.
The patches apply cleanly to the current linux-mm tree. Since there
are a couple of patches in linux-mm from Dan Williams that modify
lib/test_hmm.c and drivers/gpu/drm/nouveau/nouveau_dmem.c, it might
be easiest if Andrew could take these through the linux-mm tree
assuming that's OK with other maintainers like Ben Skeggs.
[1] https://lore.kernel.org/linux-mm/20200619215649.32297-1-rcampbell@nvidia.com
Ralph Campbell (7):
mm/thp: fix __split_huge_pmd_locked() for migration PMD
mm/migrate: move migrate_vma_collect_skip()
mm: support THP migration to device private memory
mm/thp: add prep_transhuge_device_private_page()
mm/thp: add THP allocation helper
mm/hmm/test: add self tests for THP migration
nouveau: support THP migration to private memory
drivers/gpu/drm/nouveau/nouveau_dmem.c | 289 +++++++++++-----
drivers/gpu/drm/nouveau/nouveau_svm.c | 11 +-
drivers/gpu/drm/nouveau/nouveau_svm.h | 3 +-
include/linux/gfp.h | 10 +
include/linux/huge_mm.h | 12 +
include/linux/memremap.h | 9 +
include/linux/migrate.h | 2 +
lib/test_hmm.c | 439 +++++++++++++++++++++----
lib/test_hmm_uapi.h | 3 +
mm/huge_memory.c | 177 +++++++---
mm/memory.c | 10 +-
mm/migrate.c | 429 +++++++++++++++++++-----
mm/rmap.c | 2 +-
tools/testing/selftests/vm/hmm-tests.c | 404 +++++++++++++++++++++++
14 files changed, 1519 insertions(+), 281 deletions(-)
--
2.20.1
Patch #1, #2 and #3 enables p10 2nd DAWR feature for Book3S kvm guest. DAWR
is a hypervisor resource and thus H_SET_MODE hcall is used to set/unset it.
A new case H_SET_MODE_RESOURCE_SET_DAWR1 is introduced in H_SET_MODE hcall
for setting/unsetting 2nd DAWR. Also, new capability KVM_CAP_PPC_DAWR1 has
been added to query 2nd DAWR support via kvm ioctl.
This feature also needs to be enabled in Qemu to really use it. I'll reply
link to qemu patches once I post them in qemu-devel mailing list.
Patch #4, #5, #6 and #7 adds selftests to test 2nd DAWR.
Dependency:
1: p10 kvm base enablement
https://lore.kernel.org/linuxppc-dev/20200602055325.6102-1-alistair@popple.…
2: 2nd DAWR powervm/baremetal enablement
https://lore.kernel.org/linuxppc-dev/20200723090813.303838-1-ravi.bangoria@…
3: ptrace PPC_DEBUG_FEATURE_DATA_BP_DAWR_ARCH_31 flag
https://lore.kernel.org/linuxppc-dev/20200723093330.306341-1-ravi.bangoria@…
Patches in this series applies fine on top of powerpc/next (9a77c4a0a125)
plus above dependency patches.
Ravi Bangoria (7):
powerpc/watchpoint/kvm: Rename current DAWR macros and variables
powerpc/watchpoint/kvm: Add infrastructure to support 2nd DAWR
powerpc/watchpoint/kvm: Introduce new capability for 2nd DAWR
powerpc/selftests/ptrace-hwbreak: Add testcases for 2nd DAWR
powerpc/selftests/perf-hwbreak: Coalesce event creation code
powerpc/selftests/perf-hwbreak: Add testcases for 2nd DAWR
powerpc/selftests: Add selftest to test concurrent perf/ptrace events
Documentation/virt/kvm/api.rst | 6 +-
arch/powerpc/include/asm/hvcall.h | 2 +
arch/powerpc/include/asm/kvm_host.h | 6 +-
arch/powerpc/include/uapi/asm/kvm.h | 8 +-
arch/powerpc/kernel/asm-offsets.c | 6 +-
arch/powerpc/kvm/book3s_hv.c | 73 +-
arch/powerpc/kvm/book3s_hv_nested.c | 15 +-
arch/powerpc/kvm/book3s_hv_rmhandlers.S | 43 +-
arch/powerpc/kvm/powerpc.c | 3 +
include/uapi/linux/kvm.h | 1 +
tools/arch/powerpc/include/uapi/asm/kvm.h | 8 +-
.../selftests/powerpc/ptrace/.gitignore | 1 +
.../testing/selftests/powerpc/ptrace/Makefile | 2 +-
.../selftests/powerpc/ptrace/perf-hwbreak.c | 646 +++++++++++++++--
.../selftests/powerpc/ptrace/ptrace-hwbreak.c | 79 +++
.../powerpc/ptrace/ptrace-perf-hwbreak.c | 659 ++++++++++++++++++
16 files changed, 1476 insertions(+), 82 deletions(-)
create mode 100644 tools/testing/selftests/powerpc/ptrace/ptrace-perf-hwbreak.c
--
2.26.2
This patch series extends the previously added __ksym externs with
btf support.
Right now the __ksym externs are treated as pure 64-bit scalar value.
Libbpf replaces ld_imm64 insn of __ksym by its kernel address at load
time. This patch series extend those externs with their btf info. Note
that btf support for __ksym must come with the kernel btf that has
VARs encoded to work properly. The corresponding chagnes in pahole
is available at [1].
The first 5 patches in this series add support for general kernel
global variables, which includes verifier checking (01/08), libbpf
type checking (03/08) and btf_id resolving (04/08).
The last 3 patches extends that capability further by introducing a
helper bpf_per_cpu_ptr(), which allows accessing kernel percpu vars
correctly (06/08).
The tests of this feature were performed against the extended pahole.
For kernel btf that does not have VARs encoded, the selftests will be
skipped.
[1] https://git.kernel.org/pub/scm/devel/pahole/pahole.git/commit/?id=f3d9054ba…
rfc -> v1:
- Encode VAR's btf_id for PSEUDO_BTF_ID.
- More checks in verifier. Checking the btf_id passed as
PSEUDO_BTF_ID is valid VAR, its name and type.
- Checks in libbpf on type compatibility of ksyms.
- Add bpf_per_cpu_ptr() to access kernel percpu vars. Introduced
new ARG and RET types for this helper.
Hao Luo (8):
bpf: Introduce pseudo_btf_id
bpf: Propagate BPF_PSEUDO_BTF_ID to uapi headers in /tools
bpf: Introduce help function to validate ksym's type.
bpf/libbpf: BTF support for typed ksyms
bpf/selftests: ksyms_btf to test typed ksyms
bpf: Introduce bpf_per_cpu_ptr()
bpf: Propagate bpf_per_cpu_ptr() to /tools
bpf/selftests: Test for bpf_per_cpu_ptr()
include/linux/bpf.h | 3 +
include/linux/btf.h | 26 +++
include/uapi/linux/bpf.h | 52 +++++-
kernel/bpf/btf.c | 25 ---
kernel/bpf/verifier.c | 128 ++++++++++++-
kernel/trace/bpf_trace.c | 18 ++
tools/include/uapi/linux/bpf.h | 53 +++++-
tools/lib/bpf/btf.c | 171 ++++++++++++++++++
tools/lib/bpf/btf.h | 2 +
tools/lib/bpf/libbpf.c | 130 +++++++++++--
.../selftests/bpf/prog_tests/ksyms_btf.c | 81 +++++++++
.../selftests/bpf/progs/test_ksyms_btf.c | 36 ++++
12 files changed, 665 insertions(+), 60 deletions(-)
create mode 100644 tools/testing/selftests/bpf/prog_tests/ksyms_btf.c
create mode 100644 tools/testing/selftests/bpf/progs/test_ksyms_btf.c
--
2.28.0.220.ged08abb693-goog
Hi,
Passing a non-blocking pidfd to waitid() currently has no effect, i.e.
is not supported. There are users which would like to use waitid() on
pidfds that are O_NONBLOCK and mix it with pidfds that are blocking and
both pass them to waitid().
The expected behavior is to have waitid() return -EAGAIN for
non-blocking pidfds and to block for blocking pidfds without needing to
perform any additional checks for flags set on the pidfd before passing
it to waitid().
Non-blocking pidfds will return EAGAIN from waitid() when no child
process is ready yet. Returning -EAGAIN for non-blocking pidfds makes it
easier for event loops that handle EAGAIN specially.
It also makes the API more consistent and uniform. In essence, waitid()
is treated like a read on a non-blocking pidfd or a recvmsg() on a
non-blocking socket.
With the addition of support for non-blocking pidfds we support the same
functionality that sockets do. For sockets() recvmsg() supports
MSG_DONTWAIT for pidfds waitid() supports WNOHANG. Both flags are
per-call options. In contrast non-blocking pidfds and non-blocking
sockets are a setting on an open file description affecting all threads
in the calling process as well as other processes that hold file
descriptors referring to the same open file description. Both behaviors,
per call and per open file description, have genuine use-cases.
A concrete use-case that was brought on-list (see [1]) was Josh's async
pidfd library. Ever since the introduction of pidfds and more advanced
async io various programming languages such as Rust have grown support
for async event libraries. These libraries are created to help build
epoll-based event loops around file descriptors. A common pattern is to
automatically make all file descriptors they manage to O_NONBLOCK.
For such libraries the EAGAIN error code is treated specially. When a
function is called that returns EAGAIN the function isn't called again
until the event loop indicates the the file descriptor is ready.
Supporting EAGAIN when waiting on pidfds makes such libraries just work
with little effort.
Thanks!
Christian
[1]: https://lore.kernel.org/lkml/20200811181236.GA18763@localhost/
Christian Brauner (4):
pidfd: support PIDFD_NONBLOCK in pidfd_open()
exit: support non-blocking pidfds
tests: port pidfd_wait to kselftest harness
tests: add waitid() tests for non-blocking pidfds
include/uapi/linux/pidfd.h | 12 +
kernel/exit.c | 19 +-
kernel/pid.c | 12 +-
tools/testing/selftests/pidfd/pidfd.h | 4 +
tools/testing/selftests/pidfd/pidfd_wait.c | 298 +++++++++------------
5 files changed, 161 insertions(+), 184 deletions(-)
create mode 100644 include/uapi/linux/pidfd.h
base-commit: d012a7190fc1fd72ed48911e77ca97ba4521bccd
--
2.28.0
Some applications, especially tracing ones, benefit from avoiding the
syscall overhead for getcpu() so it is common for architectures to have
vDSO implementations. Add one for arm64, using TPIDRRO_EL0 to pass a
pointer to per-CPU data rather than just store the immediate value in
order to allow for future extensibility.
It is questionable if something TPIDRRO_EL0 based is worthwhile at all
on current kernels, since v4.18 we have had support for restartable
sequences which can be used to provide a sched_getcpu() implementation
with generally better performance than the vDSO approach on
architectures which have that[1]. Work is ongoing to implement this for
glibc:
https://lore.kernel.org/lkml/20200527185130.5604-3-mathieu.desnoyers@effici…
but is not yet merged and will need similar work for other userspaces.
The main advantages for the vDSO implementation are the node parameter
(though this is a static mapping to CPU number so could be looked up
separately when processing data if it's needed, it shouldn't need to be
in the hot path) and ease of implementation for users.
This is currently not compatible with KPTI due to the use of TPIDRRO_EL0
by the KPTI trampoline, this could be addressed by reinitializing that
system register in the return path but I have found it hard to justify
adding that overhead for all users for something that is essentially a
profiling optimization which is likely to get superceeded by a more
modern implementation - if there are other uses for the per-CPU data
then the balance might change here.
This builds on work done by Kristina Martsenko some time ago but is a
new implementation.
[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?…
v3:
- Rebase on v5.9-rc1.
- Drop in progress portions of the series.
v2:
- Rebase on v5.8-rc3.
- Add further cleanup patches & a first draft of multi-page support.
Mark Brown (5):
arm64: vdso: Provide a define when building the vDSO
arm64: vdso: Add per-CPU data
arm64: vdso: Initialise the per-CPU vDSO data
arm64: vdso: Add getcpu() implementation
selftests: vdso: Support arm64 in getcpu() test
arch/arm64/include/asm/processor.h | 12 +----
arch/arm64/include/asm/vdso/datapage.h | 54 +++++++++++++++++++
arch/arm64/kernel/process.c | 26 ++++++++-
arch/arm64/kernel/vdso.c | 33 +++++++++++-
arch/arm64/kernel/vdso/Makefile | 4 +-
arch/arm64/kernel/vdso/vdso.lds.S | 1 +
arch/arm64/kernel/vdso/vgetcpu.c | 48 +++++++++++++++++
.../testing/selftests/vDSO/vdso_test_getcpu.c | 10 ++++
8 files changed, 172 insertions(+), 16 deletions(-)
create mode 100644 arch/arm64/include/asm/vdso/datapage.h
create mode 100644 arch/arm64/kernel/vdso/vgetcpu.c
--
2.20.1
Currently kunit_tool does not work correctly when executed from a path
outside of the kernel tree, so make sure that the current working
directory is correct and the kunit_dir is properly initialized before
running.
Signed-off-by: Brendan Higgins <brendanhiggins(a)google.com>
---
tools/testing/kunit/kunit.py | 13 +++++--------
1 file changed, 5 insertions(+), 8 deletions(-)
diff --git a/tools/testing/kunit/kunit.py b/tools/testing/kunit/kunit.py
index 425ef40067e7e..e2caf4e24ecb2 100755
--- a/tools/testing/kunit/kunit.py
+++ b/tools/testing/kunit/kunit.py
@@ -237,9 +237,13 @@ def main(argv, linux=None):
cli_args = parser.parse_args(argv)
+ if get_kernel_root_path():
+ os.chdir(get_kernel_root_path())
+
if cli_args.subcommand == 'run':
if not os.path.exists(cli_args.build_dir):
os.mkdir(cli_args.build_dir)
+ create_default_kunitconfig()
if not linux:
linux = kunit_kernel.LinuxSourceTree()
@@ -257,6 +261,7 @@ def main(argv, linux=None):
if cli_args.build_dir:
if not os.path.exists(cli_args.build_dir):
os.mkdir(cli_args.build_dir)
+ create_default_kunitconfig()
if not linux:
linux = kunit_kernel.LinuxSourceTree()
@@ -270,10 +275,6 @@ def main(argv, linux=None):
if result.status != KunitStatus.SUCCESS:
sys.exit(1)
elif cli_args.subcommand == 'build':
- if cli_args.build_dir:
- if not os.path.exists(cli_args.build_dir):
- os.mkdir(cli_args.build_dir)
-
if not linux:
linux = kunit_kernel.LinuxSourceTree()
@@ -288,10 +289,6 @@ def main(argv, linux=None):
if result.status != KunitStatus.SUCCESS:
sys.exit(1)
elif cli_args.subcommand == 'exec':
- if cli_args.build_dir:
- if not os.path.exists(cli_args.build_dir):
- os.mkdir(cli_args.build_dir)
-
if not linux:
linux = kunit_kernel.LinuxSourceTree()
base-commit: 30185b69a2d533c4ba6ca926b8390ce7de495e29
--
2.28.0.236.gb10cc79966-goog
Some tests might not be able to be run if resources like huge pages are
not available. Mark these tests as skipped instead of simply passing.
Signed-off-by: Ralph Campbell <rcampbell(a)nvidia.com>
---
This applies to linux-mm and is for Andrew Morton's tree.
tools/testing/selftests/vm/hmm-tests.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/vm/hmm-tests.c b/tools/testing/selftests/vm/hmm-tests.c
index 93fc5cadce61..0a28a6a29581 100644
--- a/tools/testing/selftests/vm/hmm-tests.c
+++ b/tools/testing/selftests/vm/hmm-tests.c
@@ -680,7 +680,7 @@ TEST_F(hmm, anon_write_hugetlbfs)
n = gethugepagesizes(pagesizes, 4);
if (n <= 0)
- return;
+ SKIP(return, "Huge page size could not be determined");
for (idx = 0; --n > 0; ) {
if (pagesizes[n] < pagesizes[idx])
idx = n;
@@ -694,7 +694,7 @@ TEST_F(hmm, anon_write_hugetlbfs)
buffer->ptr = get_hugepage_region(size, GHR_STRICT);
if (buffer->ptr == NULL) {
free(buffer);
- return;
+ SKIP(return, "Huge page could not be allocated");
}
buffer->fd = -1;
--
2.20.1
From: Mattias Nissler <mnissler(a)chromium.org>
For mounts that have the new "nosymfollow" option, don't follow symlinks
when resolving paths. The new option is similar in spirit to the
existing "nodev", "noexec", and "nosuid" options, as well as to the
LOOKUP_NO_SYMLINKS resolve flag in the openat2(2) syscall. Various BSD
variants have been supporting the "nosymfollow" mount option for a long
time with equivalent implementations.
Note that symlinks may still be created on file systems mounted with
the "nosymfollow" option present. readlink() remains functional, so
user space code that is aware of symlinks can still choose to follow
them explicitly.
Setting the "nosymfollow" mount option helps prevent privileged
writers from modifying files unintentionally in case there is an
unexpected link along the accessed path. The "nosymfollow" option is
thus useful as a defensive measure for systems that need to deal with
untrusted file systems in privileged contexts.
More information on the history and motivation for this patch can be
found here:
https://sites.google.com/a/chromium.org/dev/chromium-os/chromiumos-design-d…
Signed-off-by: Mattias Nissler <mnissler(a)chromium.org>
Signed-off-by: Ross Zwisler <zwisler(a)google.com>
Reviewed-by: Aleksa Sarai <cyphar(a)cyphar.com>
---
Changes since v7 [1]:
* Rebased onto v5.9-rc1.
* Added selftest in second patch.
* Added Aleska's Reviewed-By tag. Thank you for the review!
After this lands I will upstream changes to util-linux[2] and man-pages
[3].
[1]: https://lkml.org/lkml/2020/8/11/896
[2]: https://github.com/rzwisler/util-linux/commit/7f8771acd85edb70d97921c026c55…
[3]: https://github.com/rzwisler/man-pages/commit/b8fe8079f64b5068940c0144586e58…
---
fs/namei.c | 3 ++-
fs/namespace.c | 2 ++
fs/proc_namespace.c | 1 +
fs/statfs.c | 2 ++
include/linux/mount.h | 3 ++-
include/linux/statfs.h | 1 +
include/uapi/linux/mount.h | 1 +
7 files changed, 11 insertions(+), 2 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c
index e99e2a9da0f7d..12d92af2e2ca7 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1626,7 +1626,8 @@ static const char *pick_link(struct nameidata *nd, struct path *link,
return ERR_PTR(error);
}
- if (unlikely(nd->flags & LOOKUP_NO_SYMLINKS))
+ if (unlikely(nd->flags & LOOKUP_NO_SYMLINKS) ||
+ unlikely(nd->path.mnt->mnt_flags & MNT_NOSYMFOLLOW))
return ERR_PTR(-ELOOP);
if (!(nd->flags & LOOKUP_RCU)) {
diff --git a/fs/namespace.c b/fs/namespace.c
index bae0e95b3713a..6408788a649e1 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -3160,6 +3160,8 @@ int path_mount(const char *dev_name, struct path *path,
mnt_flags &= ~(MNT_RELATIME | MNT_NOATIME);
if (flags & MS_RDONLY)
mnt_flags |= MNT_READONLY;
+ if (flags & MS_NOSYMFOLLOW)
+ mnt_flags |= MNT_NOSYMFOLLOW;
/* The default atime for remount is preservation */
if ((flags & MS_REMOUNT) &&
diff --git a/fs/proc_namespace.c b/fs/proc_namespace.c
index 3059a9394c2d6..e59d4bb3a89e4 100644
--- a/fs/proc_namespace.c
+++ b/fs/proc_namespace.c
@@ -70,6 +70,7 @@ static void show_mnt_opts(struct seq_file *m, struct vfsmount *mnt)
{ MNT_NOATIME, ",noatime" },
{ MNT_NODIRATIME, ",nodiratime" },
{ MNT_RELATIME, ",relatime" },
+ { MNT_NOSYMFOLLOW, ",nosymfollow" },
{ 0, NULL }
};
const struct proc_fs_opts *fs_infop;
diff --git a/fs/statfs.c b/fs/statfs.c
index 2616424012ea7..59f33752c1311 100644
--- a/fs/statfs.c
+++ b/fs/statfs.c
@@ -29,6 +29,8 @@ static int flags_by_mnt(int mnt_flags)
flags |= ST_NODIRATIME;
if (mnt_flags & MNT_RELATIME)
flags |= ST_RELATIME;
+ if (mnt_flags & MNT_NOSYMFOLLOW)
+ flags |= ST_NOSYMFOLLOW;
return flags;
}
diff --git a/include/linux/mount.h b/include/linux/mount.h
index de657bd211fa6..aaf343b38671c 100644
--- a/include/linux/mount.h
+++ b/include/linux/mount.h
@@ -30,6 +30,7 @@ struct fs_context;
#define MNT_NODIRATIME 0x10
#define MNT_RELATIME 0x20
#define MNT_READONLY 0x40 /* does the user want this to be r/o? */
+#define MNT_NOSYMFOLLOW 0x80
#define MNT_SHRINKABLE 0x100
#define MNT_WRITE_HOLD 0x200
@@ -46,7 +47,7 @@ struct fs_context;
#define MNT_SHARED_MASK (MNT_UNBINDABLE)
#define MNT_USER_SETTABLE_MASK (MNT_NOSUID | MNT_NODEV | MNT_NOEXEC \
| MNT_NOATIME | MNT_NODIRATIME | MNT_RELATIME \
- | MNT_READONLY)
+ | MNT_READONLY | MNT_NOSYMFOLLOW)
#define MNT_ATIME_MASK (MNT_NOATIME | MNT_NODIRATIME | MNT_RELATIME )
#define MNT_INTERNAL_FLAGS (MNT_SHARED | MNT_WRITE_HOLD | MNT_INTERNAL | \
diff --git a/include/linux/statfs.h b/include/linux/statfs.h
index 9bc69edb8f188..fac4356ea1bfc 100644
--- a/include/linux/statfs.h
+++ b/include/linux/statfs.h
@@ -40,6 +40,7 @@ struct kstatfs {
#define ST_NOATIME 0x0400 /* do not update access times */
#define ST_NODIRATIME 0x0800 /* do not update directory access times */
#define ST_RELATIME 0x1000 /* update atime relative to mtime/ctime */
+#define ST_NOSYMFOLLOW 0x2000 /* do not follow symlinks */
struct dentry;
extern int vfs_get_fsid(struct dentry *dentry, __kernel_fsid_t *fsid);
diff --git a/include/uapi/linux/mount.h b/include/uapi/linux/mount.h
index 96a0240f23fed..dd8306ea336c1 100644
--- a/include/uapi/linux/mount.h
+++ b/include/uapi/linux/mount.h
@@ -16,6 +16,7 @@
#define MS_REMOUNT 32 /* Alter flags of a mounted FS */
#define MS_MANDLOCK 64 /* Allow mandatory locks on an FS */
#define MS_DIRSYNC 128 /* Directory modifications are synchronous */
+#define MS_NOSYMFOLLOW 256 /* Do not follow symlinks */
#define MS_NOATIME 1024 /* Do not update access times. */
#define MS_NODIRATIME 2048 /* Do not update directory access times */
#define MS_BIND 4096
--
2.28.0.220.ged08abb693-goog
check_result() uses "comm" to check expected results of selftests output
in dmesg. Everything works fine if timestamps in dmesg are unique. If
not, like in this example
[ 86.844422] test_klp_callbacks_demo: pre_unpatch_callback: test_klp_callbacks_mod -> [MODULE_STATE_LIVE] Normal state
[ 86.844422] livepatch: 'test_klp_callbacks_demo': starting unpatching transition
, "comm" fails with "comm: file 2 is not in sorted order". Suppress the
order checking with --nocheck-order option.
Signed-off-by: Miroslav Benes <mbenes(a)suse.cz>
---
The strange thing is, I can reproduce the issue easily and reliably on
older codestreams (4.12) but not on current upstream in my testing
environment. I think the change makes sense regardless though.
tools/testing/selftests/livepatch/functions.sh | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/livepatch/functions.sh b/tools/testing/selftests/livepatch/functions.sh
index 1aba83c87ad3..846c7ed71556 100644
--- a/tools/testing/selftests/livepatch/functions.sh
+++ b/tools/testing/selftests/livepatch/functions.sh
@@ -278,7 +278,7 @@ function check_result {
# help differentiate repeated testing runs. Remove them with a
# post-comparison sed filter.
- result=$(dmesg | comm -13 "$SAVED_DMESG" - | \
+ result=$(dmesg | comm --nocheck-order -13 "$SAVED_DMESG" - | \
grep -e 'livepatch:' -e 'test_klp' | \
grep -v '\(tainting\|taints\) kernel' | \
sed 's/^\[[ 0-9.]*\] //')
--
2.28.0
From: Leon He <leon.he(a)unisoc.com>
There are two errors in the dmabuf-heap selftest:
1. The 'char name[5]' was not initialized to zero, which will cause
strcmp(name, "vgem") failed in check_vgem().
2. The return value of test_alloc_errors() should be reversed, other-
wise the while loop in main() will be broken.
Signed-off-by: Leon He <leon.he(a)unisoc.com>
---
tools/testing/selftests/dmabuf-heaps/dmabuf-heap.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/dmabuf-heaps/dmabuf-heap.c b/tools/testing/selftests/dmabuf-heaps/dmabuf-heap.c
index cd5e1f6..836b185 100644
--- a/tools/testing/selftests/dmabuf-heaps/dmabuf-heap.c
+++ b/tools/testing/selftests/dmabuf-heaps/dmabuf-heap.c
@@ -22,7 +22,7 @@
static int check_vgem(int fd)
{
drm_version_t version = { 0 };
- char name[5];
+ char name[5] = { 0 };
int ret;
version.name_len = 4;
@@ -357,7 +357,7 @@ static int test_alloc_errors(char *heap_name)
if (heap_fd >= 0)
close(heap_fd);
- return ret;
+ return !ret;
}
int main(void)
--
2.7.4
From: David Ahern <dsahern(a)kernel.org>
[ Upstream commit bcf7ddb0186d366f761f86196b480ea6dd2dc18c ]
h1 is initially configured to reach h2 via r1 rather than the
more direct path through r2. If rp_filter is set and inherited
for r2, forwarding fails since the source address of h1 is
reachable from eth0 vs the packet coming to it via r1 and eth1.
Since rp_filter setting affects the test, explicitly reset it.
Signed-off-by: David Ahern <dsahern(a)kernel.org>
Signed-off-by: David S. Miller <davem(a)davemloft.net>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/net/icmp_redirect.sh | 2 ++
1 file changed, 2 insertions(+)
diff --git a/tools/testing/selftests/net/icmp_redirect.sh b/tools/testing/selftests/net/icmp_redirect.sh
index 18c5de53558af..bf361f30d6ef9 100755
--- a/tools/testing/selftests/net/icmp_redirect.sh
+++ b/tools/testing/selftests/net/icmp_redirect.sh
@@ -180,6 +180,8 @@ setup()
;;
r[12]) ip netns exec $ns sysctl -q -w net.ipv4.ip_forward=1
ip netns exec $ns sysctl -q -w net.ipv4.conf.all.send_redirects=1
+ ip netns exec $ns sysctl -q -w net.ipv4.conf.default.rp_filter=0
+ ip netns exec $ns sysctl -q -w net.ipv4.conf.all.rp_filter=0
ip netns exec $ns sysctl -q -w net.ipv6.conf.all.forwarding=1
ip netns exec $ns sysctl -q -w net.ipv6.route.mtu_expires=10
--
2.25.1
From: Yauheni Kaliuta <yauheni.kaliuta(a)redhat.com>
[ Upstream commit c210773d6c6f595f5922d56b7391fe343bc7310e ]
The error path in libbpf.c:load_program() has calls to pr_warn()
which ends up for global_funcs tests to
test_global_funcs.c:libbpf_debug_print().
For the tests with no struct test_def::err_str initialized with a
string, it causes call of strstr() with NULL as the second argument
and it segfaults.
Fix it by calling strstr() only for non-NULL err_str.
Signed-off-by: Yauheni Kaliuta <yauheni.kaliuta(a)redhat.com>
Signed-off-by: Alexei Starovoitov <ast(a)kernel.org>
Acked-by: Yonghong Song <yhs(a)fb.com>
Link: https://lore.kernel.org/bpf/20200820115843.39454-1-yauheni.kaliuta@redhat.c…
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/bpf/prog_tests/test_global_funcs.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/bpf/prog_tests/test_global_funcs.c b/tools/testing/selftests/bpf/prog_tests/test_global_funcs.c
index 25b068591e9a4..193002b14d7f6 100644
--- a/tools/testing/selftests/bpf/prog_tests/test_global_funcs.c
+++ b/tools/testing/selftests/bpf/prog_tests/test_global_funcs.c
@@ -19,7 +19,7 @@ static int libbpf_debug_print(enum libbpf_print_level level,
log_buf = va_arg(args, char *);
if (!log_buf)
goto out;
- if (strstr(log_buf, err_str) == 0)
+ if (err_str && strstr(log_buf, err_str) == 0)
found = true;
out:
printf(format, log_buf);
--
2.25.1
From: David Ahern <dsahern(a)kernel.org>
[ Upstream commit bcf7ddb0186d366f761f86196b480ea6dd2dc18c ]
h1 is initially configured to reach h2 via r1 rather than the
more direct path through r2. If rp_filter is set and inherited
for r2, forwarding fails since the source address of h1 is
reachable from eth0 vs the packet coming to it via r1 and eth1.
Since rp_filter setting affects the test, explicitly reset it.
Signed-off-by: David Ahern <dsahern(a)kernel.org>
Signed-off-by: David S. Miller <davem(a)davemloft.net>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/net/icmp_redirect.sh | 2 ++
1 file changed, 2 insertions(+)
diff --git a/tools/testing/selftests/net/icmp_redirect.sh b/tools/testing/selftests/net/icmp_redirect.sh
index 18c5de53558af..bf361f30d6ef9 100755
--- a/tools/testing/selftests/net/icmp_redirect.sh
+++ b/tools/testing/selftests/net/icmp_redirect.sh
@@ -180,6 +180,8 @@ setup()
;;
r[12]) ip netns exec $ns sysctl -q -w net.ipv4.ip_forward=1
ip netns exec $ns sysctl -q -w net.ipv4.conf.all.send_redirects=1
+ ip netns exec $ns sysctl -q -w net.ipv4.conf.default.rp_filter=0
+ ip netns exec $ns sysctl -q -w net.ipv4.conf.all.rp_filter=0
ip netns exec $ns sysctl -q -w net.ipv6.conf.all.forwarding=1
ip netns exec $ns sysctl -q -w net.ipv6.route.mtu_expires=10
--
2.25.1
From: Yauheni Kaliuta <yauheni.kaliuta(a)redhat.com>
[ Upstream commit c210773d6c6f595f5922d56b7391fe343bc7310e ]
The error path in libbpf.c:load_program() has calls to pr_warn()
which ends up for global_funcs tests to
test_global_funcs.c:libbpf_debug_print().
For the tests with no struct test_def::err_str initialized with a
string, it causes call of strstr() with NULL as the second argument
and it segfaults.
Fix it by calling strstr() only for non-NULL err_str.
Signed-off-by: Yauheni Kaliuta <yauheni.kaliuta(a)redhat.com>
Signed-off-by: Alexei Starovoitov <ast(a)kernel.org>
Acked-by: Yonghong Song <yhs(a)fb.com>
Link: https://lore.kernel.org/bpf/20200820115843.39454-1-yauheni.kaliuta@redhat.c…
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/bpf/prog_tests/test_global_funcs.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/bpf/prog_tests/test_global_funcs.c b/tools/testing/selftests/bpf/prog_tests/test_global_funcs.c
index 25b068591e9a4..193002b14d7f6 100644
--- a/tools/testing/selftests/bpf/prog_tests/test_global_funcs.c
+++ b/tools/testing/selftests/bpf/prog_tests/test_global_funcs.c
@@ -19,7 +19,7 @@ static int libbpf_debug_print(enum libbpf_print_level level,
log_buf = va_arg(args, char *);
if (!log_buf)
goto out;
- if (strstr(log_buf, err_str) == 0)
+ if (err_str && strstr(log_buf, err_str) == 0)
found = true;
out:
printf(format, log_buf);
--
2.25.1
From: David Ahern <dsahern(a)kernel.org>
[ Upstream commit bcf7ddb0186d366f761f86196b480ea6dd2dc18c ]
h1 is initially configured to reach h2 via r1 rather than the
more direct path through r2. If rp_filter is set and inherited
for r2, forwarding fails since the source address of h1 is
reachable from eth0 vs the packet coming to it via r1 and eth1.
Since rp_filter setting affects the test, explicitly reset it.
Signed-off-by: David Ahern <dsahern(a)kernel.org>
Signed-off-by: David S. Miller <davem(a)davemloft.net>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/net/icmp_redirect.sh | 2 ++
1 file changed, 2 insertions(+)
diff --git a/tools/testing/selftests/net/icmp_redirect.sh b/tools/testing/selftests/net/icmp_redirect.sh
index 18c5de53558af..bf361f30d6ef9 100755
--- a/tools/testing/selftests/net/icmp_redirect.sh
+++ b/tools/testing/selftests/net/icmp_redirect.sh
@@ -180,6 +180,8 @@ setup()
;;
r[12]) ip netns exec $ns sysctl -q -w net.ipv4.ip_forward=1
ip netns exec $ns sysctl -q -w net.ipv4.conf.all.send_redirects=1
+ ip netns exec $ns sysctl -q -w net.ipv4.conf.default.rp_filter=0
+ ip netns exec $ns sysctl -q -w net.ipv4.conf.all.rp_filter=0
ip netns exec $ns sysctl -q -w net.ipv6.conf.all.forwarding=1
ip netns exec $ns sysctl -q -w net.ipv6.route.mtu_expires=10
--
2.25.1
In the beginning, mm/gup_benchmark.c supported get_user_pages_fast()
only, but right now, it supports the benchmarking of a couple of
get_user_pages() related calls like:
* get_user_pages_fast()
* get_user_pages()
* pin_user_pages_fast()
* pin_user_pages()
The documentation is confusing and needs update.
Cc: John Hubbard <jhubbard(a)nvidia.com>
Cc: Keith Busch <keith.busch(a)intel.com>
Cc: Ira Weiny <ira.weiny(a)intel.com>
Cc: Kirill A. Shutemov <kirill.shutemov(a)linux.intel.com>
Signed-off-by: Barry Song <song.bao.hua(a)hisilicon.com>
---
mm/Kconfig | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/mm/Kconfig b/mm/Kconfig
index 6c974888f86f..f7c9374da7b3 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -831,10 +831,10 @@ config PERCPU_STATS
be used to help understand percpu memory usage.
config GUP_BENCHMARK
- bool "Enable infrastructure for get_user_pages_fast() benchmarking"
+ bool "Enable infrastructure for get_user_pages() and related calls benchmarking"
help
Provides /sys/kernel/debug/gup_benchmark that helps with testing
- performance of get_user_pages_fast().
+ performance of get_user_pages() and related calls.
See tools/testing/selftests/vm/gup_benchmark.c
--
2.27.0
This series attempts to provide a simple way for BPF programs (and in
future other consumers) to utilize BPF Type Format (BTF) information
to display kernel data structures in-kernel. The use case this
functionality is applied to here is to support a bpf_trace_printk
trace event-based method of rendering type information.
There is already support in kernel/bpf/btf.c for "show" functionality;
the changes here generalize that support from seq-file specific
verifier display to the more generic case and add another specific
use case; rather than seq_printf()ing the show data, it is traced
using the bpf_trace_printk trace event. Other future uses of the
show functionality could include a bpf_printk_btf() function which
printk()ed the data instead. Oops messaging in particular would
be an interesting application for such functionality.
The above potential use case hints at a potential reply to
a reasonable objection that such typed display should be
solved by tracing programs, where the in-kernel tracing records
data and the userspace program prints it out. While this
is certainly the recommended approach for most cases, I
believe having an in-kernel mechanism would be valuable
also. Critically in BPF programs it greatly simplifies
debugging and tracing of such data to invoking a one-line
helper.
One challenge raised in an earlier iteration of this work -
where the BTF printing was implemented as a printk() format
specifier - was that the amount of data printed per
printk() was large, and other format specifiers were far
simpler. This patchset tackles this by instead displaying
data _as the data structure is traversed_, rathern than copying
it to a string for later display. The problem in doing this
however is that such output can be intermixed with other
bpf_trace_printk events. The solution pursued here is to
associate a trace ID with the bpf_trace_printk events. For
now, the bpf_trace_printk() helper sets this trace ID to 0,
and bpf_trace_btf() can specify non-zero values. This allows
a BPF program to coordinate with a user-space program which
creates a separate trace instance which filters trace events
based on trace ID, allowing for clean display without pollution
from other data sources. Future work could enhance
bpf_trace_printk() to do this too, either via a new helper
or by smuggling a 32-bit trace id value into the "fmt_size"
argument (the latter might be problematic for 32-bit platforms
though, so a new helper might be preferred).
To aid in display the bpf_trace_btf() helper is passed a
"struct btf_ptr *" which specifies the data to be traced
(struct sk_buff * say), the BTF id of the type (the BTF
id of "struct sk_buff") or a string representation of
the type ("struct sk_buff"). A flags field is also
present for future use.
Separately a number of flags control how the output is
rendered, see patch 3 for more details.
A snippet of output from printing "struct sk_buff *"
(see patch 3 for the full output) looks like this:
<idle>-0 [023] d.s. 1825.778400: bpf_trace_printk: (struct sk_buff){
<idle>-0 [023] d.s. 1825.778409: bpf_trace_printk: (union){
<idle>-0 [023] d.s. 1825.778410: bpf_trace_printk: (struct){
<idle>-0 [023] d.s. 1825.778412: bpf_trace_printk: .prev = (struct sk_buff *)0x00000000b2a3df7e,
<idle>-0 [023] d.s. 1825.778413: bpf_trace_printk: (union){
<idle>-0 [023] d.s. 1825.778414: bpf_trace_printk: .dev = (struct net_device *)0x000000001658808b,
<idle>-0 [023] d.s. 1825.778416: bpf_trace_printk: .dev_scratch = (long unsigned int)18446628460391432192,
<idle>-0 [023] d.s. 1825.778417: bpf_trace_printk: },
<idle>-0 [023] d.s. 1825.778417: bpf_trace_printk: },
<idle>-0 [023] d.s. 1825.778418: bpf_trace_printk: .rbnode = (struct rb_node){
<idle>-0 [023] d.s. 1825.778419: bpf_trace_printk: .rb_right = (struct rb_node *)0x00000000b2a3df7e,
<idle>-0 [023] d.s. 1825.778420: bpf_trace_printk: .rb_left = (struct rb_node *)0x000000001658808b,
<idle>-0 [023] d.s. 1825.778420: bpf_trace_printk: },
<idle>-0 [023] d.s. 1825.778421: bpf_trace_printk: .list = (struct list_head){
<idle>-0 [023] d.s. 1825.778422: bpf_trace_printk: .prev = (struct list_head *)0x00000000b2a3df7e,
<idle>-0 [023] d.s. 1825.778422: bpf_trace_printk: },
<idle>-0 [023] d.s. 1825.778422: bpf_trace_printk: },
<idle>-0 [023] d.s. 1825.778426: bpf_trace_printk: .len = (unsigned int)168,
<idle>-0 [023] d.s. 1825.778427: bpf_trace_printk: .mac_len = (__u16)14,
Changes since v3:
- Moved to RFC since the approach is different (and bpf-next is
closed)
- Rather than using a printk() format specifier as the means
of invoking BTF-enabled display, a dedicated BPF helper is
used. This solves the issue of printk() having to output
large amounts of data using a complex mechanism such as
BTF traversal, but still provides a way for the display of
such data to be achieved via BPF programs. Future work could
include a bpf_printk_btf() function to invoke display via
printk() where the elements of a data structure are printk()ed
one at a time. Thanks to Petr Mladek, Andy Shevchenko and
Rasmus Villemoes who took time to look at the earlier printk()
format-specifier-focused version of this and provided feedback
clarifying the problems with that approach.
- Added trace id to the bpf_trace_printk events as a means of
separating output from standard bpf_trace_printk() events,
ensuring it can be easily parsed by the reader.
- Added bpf_trace_btf() helper tests which do simple verification
of the various display options.
Changes since v2:
- Alexei and Yonghong suggested it would be good to use
probe_kernel_read() on to-be-shown data to ensure safety
during operation. Safe copy via probe_kernel_read() to a
buffer object in "struct btf_show" is used to support
this. A few different approaches were explored
including dynamic allocation and per-cpu buffers. The
downside of dynamic allocation is that it would be done
during BPF program execution for bpf_trace_printk()s using
%pT format specifiers. The problem with per-cpu buffers
is we'd have to manage preemption and since the display
of an object occurs over an extended period and in printk
context where we'd rather not change preemption status,
it seemed tricky to manage buffer safety while considering
preemption. The approach of utilizing stack buffer space
via the "struct btf_show" seemed like the simplest approach.
The stack size of the associated functions which have a
"struct btf_show" on their stack to support show operation
(btf_type_snprintf_show() and btf_type_seq_show()) stays
under 500 bytes. The compromise here is the safe buffer we
use is small - 256 bytes - and as a result multiple
probe_kernel_read()s are needed for larger objects. Most
objects of interest are smaller than this (e.g.
"struct sk_buff" is 224 bytes), and while task_struct is a
notable exception at ~8K, performance is not the priority for
BTF-based display. (Alexei and Yonghong, patch 2).
- safe buffer use is the default behaviour (and is mandatory
for BPF) but unsafe display - meaning no safe copy is done
and we operate on the object itself - is supported via a
'u' option.
- pointers are prefixed with 0x for clarity (Alexei, patch 2)
- added additional comments and explanations around BTF show
code, especially around determining whether objects such
zeroed. Also tried to comment safe object scheme used. (Yonghong,
patch 2)
- added late_initcall() to initialize vmlinux BTF so that it would
not have to be initialized during printk operation (Alexei,
patch 5)
- removed CONFIG_BTF_PRINTF config option as it is not needed;
CONFIG_DEBUG_INFO_BTF can be used to gate test behaviour and
determining behaviour of type-based printk can be done via
retrieval of BTF data; if it's not there BTF was unavailable
or broken (Alexei, patches 4,6)
- fix bpf_trace_printk test to use vmlinux.h and globals via
skeleton infrastructure, removing need for perf events
(Andrii, patch 8)
Changes since v1:
- changed format to be more drgn-like, rendering indented type info
along with type names by default (Alexei)
- zeroed values are omitted (Arnaldo) by default unless the '0'
modifier is specified (Alexei)
- added an option to print pointer values without obfuscation.
The reason to do this is the sysctls controlling pointer display
are likely to be irrelevant in many if not most tracing contexts.
Some questions on this in the outstanding questions section below...
- reworked printk format specifer so that we no longer rely on format
%pT<type> but instead use a struct * which contains type information
(Rasmus). This simplifies the printk parsing, makes use more dynamic
and also allows specification by BTF id as well as name.
- removed incorrect patch which tried to fix dereferencing of resolved
BTF info for vmlinux; instead we skip modifiers for the relevant
case (array element type determination) (Alexei).
- fixed issues with negative snprintf format length (Rasmus)
- added test cases for various data structure formats; base types,
typedefs, structs, etc.
- tests now iterate through all typedef, enum, struct and unions
defined for vmlinux BTF and render a version of the target dummy
value which is either all zeros or all 0xff values; the idea is this
exercises the "skip if zero" and "print everything" cases.
- added support in BPF for using the %pT format specifier in
bpf_trace_printk()
- added BPF tests which ensure %pT format specifier use works (Alexei).
Alan Maguire (4):
bpf: provide function to get vmlinux BTF information
bpf: make BTF show support generic, apply to seq
files/bpf_trace_printk
bpf: add bpf_trace_btf helper
selftests/bpf: add bpf_trace_btf helper tests
include/linux/bpf.h | 5 +
include/linux/btf.h | 38 +
include/uapi/linux/bpf.h | 63 ++
kernel/bpf/btf.c | 962 ++++++++++++++++++---
kernel/bpf/core.c | 5 +
kernel/bpf/helpers.c | 4 +
kernel/bpf/verifier.c | 18 +-
kernel/trace/bpf_trace.c | 121 ++-
kernel/trace/bpf_trace.h | 6 +-
scripts/bpf_helpers_doc.py | 2 +
tools/include/uapi/linux/bpf.h | 63 ++
tools/testing/selftests/bpf/prog_tests/trace_btf.c | 45 +
.../selftests/bpf/progs/netif_receive_skb.c | 43 +
13 files changed, 1257 insertions(+), 118 deletions(-)
create mode 100644 tools/testing/selftests/bpf/prog_tests/trace_btf.c
create mode 100644 tools/testing/selftests/bpf/progs/netif_receive_skb.c
--
1.8.3.1