Hi Linus,
Please pull the following fixes update for Linux 6.13-rc3.
linux_kselftest-fixes-6.13-rc3
-- fixes the offset for kprobe syntax error test case when checking the
BTF arguments on 64-bit powerpc.
Note: This fix has been in linux-next since last week. I had to drop
a patch and rebase this morning.
diff is attached.
thanks,
-- Shuah
----------------------------------------------------------------
The following changes since commit 40384c840ea1944d7c5a392e8975ed088ecf0b37:
Linux 6.13-rc1 (2024-12-01 14:28:56 -0800)
are available in the Git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest tags/linux_kselftest-fixes-6.13-rc3
for you to fetch changes up to 777f290ab328de333b85558bb6807a69a59b36ba:
selftests/ftrace: adjust offset for kprobe syntax error test (2024-12-11 10:08:04 -0700)
----------------------------------------------------------------
linux_kselftest-fixes-6.13-rc3
-- fixes the offset for kprobe syntax error test case when checking the
BTF arguments on 64-bit powerpc.
----------------------------------------------------------------
Hari Bathini (1):
selftests/ftrace: adjust offset for kprobe syntax error test
tools/testing/selftests/ftrace/test.d/kprobe/kprobe_syntax_errors.tc | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
----------------------------------------------------------------
This series introduces a new ioctl KVM_TRANSLATE2, which expands on
KVM_TRANSLATE. It is required to implement Hyper-V's
HvTranslateVirtualAddress hyper-call as part of the ongoing effort to
emulate HyperV's Virtual Secure Mode (VSM) within KVM and QEMU. The hyper-
call requires several new KVM APIs, one of which is KVM_TRANSLATE2, which
implements the core functionality of the hyper-call. The rest of the
required functionality will be implemented in subsequent series.
Other than translating guest virtual addresses, the ioctl allows the
caller to control whether the access and dirty bits are set during the
page walk. It also allows specifying an access mode instead of returning
viable access modes, which enables setting the bits up to the level that
caused a failure. Additionally, the ioctl provides more information about
why the page walk failed, and which page table is responsible. This
functionality is not available within KVM_TRANSLATE, and can't be added
without breaking backwards compatiblity, thus a new ioctl is required.
The ioctl was designed to facilitate as many other use cases as possible
apart from VSM. The error codes were intentionally chosen to be broad
enough to avoid exposing architecture specific details. Even though
HvTranslateVirtualAddress only really needs one flag to set the accessed
and dirty bits whenever possible, that was split into several flags so
that future users can chose more gradually when these bits should be set.
Furthermore, as much information as possible is provided to the caller.
The patch series includes selftests for the ioctl, as well as fuzzy
testing on random garbage guest page table entries. All previously passing
KVM selftests and KVM unit tests still pass.
Series overview:
- 1: Document the new ioctl
- 2-11: Update the page walker in preparation
- 12-14: Implement the ioctl
- 15: Implement testing
This series, alongside the series by Nicolas Saenz Julienne [1]
introducing the core building blocks for VSM and the accompanying QEMU
implementation [2], is capable of booting Windows Server 2019.
Both series are also available on GitHub [3].
[1] https://lore.kernel.org/linux-hyperv/20240609154945.55332-1-nsaenz@amazon.c…
[2] https://github.com/vianpl/qemu/tree/vsm/next
[3] https://github.com/vianpl/linux/tree/vsm/next
Best,
Nikolas
Nikolas Wipper (15):
KVM: Add API documentation for KVM_TRANSLATE2
KVM: x86/mmu: Abort page walk if permission checks fail
KVM: x86/mmu: Introduce exception flag for unmapped GPAs
KVM: x86/mmu: Store GPA in exception if applicable
KVM: x86/mmu: Introduce flags parameter to page walker
KVM: x86/mmu: Implement PWALK_SET_ACCESSED in page walker
KVM: x86/mmu: Implement PWALK_SET_DIRTY in page walker
KVM: x86/mmu: Implement PWALK_FORCE_SET_ACCESSED in page walker
KVM: x86/mmu: Introduce status parameter to page walker
KVM: x86/mmu: Implement PWALK_STATUS_READ_ONLY_PTE_GPA in page walker
KVM: x86: Introduce generic gva to gpa translation function
KVM: Introduce KVM_TRANSLATE2
KVM: Add KVM_TRANSLATE2 stub
KVM: x86: Implement KVM_TRANSLATE2
KVM: selftests: Add test for KVM_TRANSLATE2
Documentation/virt/kvm/api.rst | 131 ++++++++
arch/x86/include/asm/kvm_host.h | 18 +-
arch/x86/kvm/hyperv.c | 3 +-
arch/x86/kvm/kvm_emulate.h | 8 +
arch/x86/kvm/mmu.h | 10 +-
arch/x86/kvm/mmu/mmu.c | 7 +-
arch/x86/kvm/mmu/paging_tmpl.h | 80 +++--
arch/x86/kvm/x86.c | 123 ++++++-
include/linux/kvm_host.h | 6 +
include/uapi/linux/kvm.h | 33 ++
tools/testing/selftests/kvm/Makefile | 1 +
.../selftests/kvm/x86_64/kvm_translate2.c | 310 ++++++++++++++++++
virt/kvm/kvm_main.c | 41 +++
13 files changed, 724 insertions(+), 47 deletions(-)
create mode 100644 tools/testing/selftests/kvm/x86_64/kvm_translate2.c
--
2.40.1
Amazon Web Services Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
Sitz: Berlin
Ust-ID: DE 365 538 597
This patch series provides workingset reporting of user pages in
lruvecs, of which coldness can be tracked by accessed bits and fd
references. However, the concept of workingset applies generically to
all types of memory, which could be kernel slab caches, discardable
userspace caches (databases), or CXL.mem. Therefore, data sources might
come from slab shrinkers, device drivers, or the userspace.
Another interesting idea might be hugepage workingset, so that we can
measure the proportion of hugepages backing cold memory. However, with
architectures like arm, there may be too many hugepage sizes leading to
a combinatorial explosion when exporting stats to the userspace.
Nonetheless, the kernel should provide a set of workingset interfaces
that is generic enough to accommodate the various use cases, and extensible
to potential future use cases.
Use cases
==========
Job scheduling
On overcommitted hosts, workingset information improves efficiency and
reliability by allowing the job scheduler to have better stats on the
exact memory requirements of each job. This can manifest in efficiency by
landing more jobs on the same host or NUMA node. On the other hand, the
job scheduler can also ensure each node has a sufficient amount of memory
and does not enter direct reclaim or the kernel OOM path. With workingset
information and job priority, the userspace OOM killing or proactive
reclaim policy can kick in before the system is under memory pressure.
If the job shape is very different from the machine shape, knowing the
workingset per-node can also help inform page allocation policies.
Proactive reclaim
Workingset information allows the a container manager to proactively
reclaim memory while not impacting a job's performance. While PSI may
provide a reactive measure of when a proactive reclaim has reclaimed too
much, workingset reporting allows the policy to be more accurate and
flexible.
Ballooning (similar to proactive reclaim)
The last patch of the series extends the virtio-balloon device to report
the guest workingset.
Balloon policies benefit from workingset to more precisely determine the
size of the memory balloon. On end-user devices where memory is scarce and
overcommitted, the balloon sizing in multiple VMs running on the same
device can be orchestrated with workingset reports from each one.
On the server side, workingset reporting allows the balloon controller to
inflate the balloon without causing too much file cache to be reclaimed in
the guest.
Promotion/Demotion
If different mechanisms are used for promition and demotion, workingset
information can help connect the two and avoid pages being migrated back
and forth.
For example, given a promotion hot page threshold defined in reaccess
distance of N seconds (promote pages accessed more often than every N
seconds). The threshold N should be set so that ~80% (e.g.) of pages on
the fast memory node passes the threshold. This calculation can be done
with workingset reports.
To be directly useful for promotion policies, the workingset report
interfaces need to be extended to report hotness and gather hotness
information from the devices[1].
[1]
https://www.opencompute.org/documents/ocp-cms-hotness-tracking-requirements…
Sysfs and Cgroup Interfaces
==========
The interfaces are detailed in the patches that introduce them. The main
idea here is we break down the workingset per-node per-memcg into time
intervals (ms), e.g.
1000 anon=137368 file=24530
20000 anon=34342 file=0
30000 anon=353232 file=333608
40000 anon=407198 file=206052
9223372036854775807 anon=4925624 file=892892
Implementation
==========
The reporting of user pages is based off of MGLRU, and therefore requires
CONFIG_LRU_GEN=y. We would benefit from more MGLRU generations for a more
fine-grained workingset report, but we can already gather a lot of data
with just four generations. The workingset reporting mechanism is gated
behind CONFIG_WORKINGSET_REPORT, and the aging thread is behind
CONFIG_WORKINGSET_REPORT_AGING.
Benchmarks
==========
Ghait Ouled Amar Ben Cheikh has implemented a simple policy and ran Linux
compile and redis benchmarks from openbenchmarking.org. The policy and
runner is referred to as WMO (Workload Memory Optimization).
The results were based on v3 of the series, but v4 doesn't change the core
of the working set reporting and just adds the ballooning counterpart.
The timed Linux kernel compilation benchmark shows improvements in peak
memory usage with a policy of "swap out all bytes colder than 10 seconds
every 40 seconds". A swapfile is configured on SSD.
--------------------------------------------
peak memory usage (with WMO): 4982.61328 MiB
peak memory usage (control): 9569.1367 MiB
peak memory reduction: 47.9%
--------------------------------------------
Benchmark | Experimental |Control | Experimental_Std_Dev | Control_Std_Dev
Timed Linux Kernel Compilation - allmodconfig (sec) | 708.486 (95.91%) | 679.499 (100%) | 0.6% | 0.1%
--------------------------------------------
Seconds, fewer is better
The redis benchmark shows employs the same policy:
--------------------------------------------
peak memory usage (with WMO): 375.9023 MiB
peak memory usage (control): 509.765 MiB
peak memory reduction: 26%
--------------------------------------------
Benchmark | Experimental | Control | Experimental_Std_Dev | Control_Std_Dev
Redis - LPOP (Reqs/sec) | 2023130 (98.22%) | 2059849 (100%) | 1.2% | 2%
Redis - SADD (Reqs/sec) | 2539662 (98.63%) | 2574811 (100%) | 2.3% | 1.4%
Redis - LPUSH (Reqs/sec)| 2024880 (100%) | 2000884 (98.81%) | 1.1% | 0.8%
Redis - GET (Reqs/sec) | 2835764 (100%) | 2763722 (97.46%) | 2.7% | 1.6%
Redis - SET (Reqs/sec) | 2340723 (100%) | 2327372 (99.43%) | 2.4% | 1.8%
--------------------------------------------
Reqs/sec, more is better
The detailed report and benchmarking results are in Ghait's repo:
https://github.com/miloudi98/WMO
Changelog
==========
Changes from PATCH v3 -> v4:
- Added documentation for cgroup-v2
(Waiman Long)
- Fixed types in documentation
(Randy Dunlap)
- Added implementation for the ballooning use case
- Added detailed description of benchmark results
(Andrew Morton)
Changes from PATCH v2 -> v3:
- Fixed typos in commit messages and documentation
(Lance Yang, Randy Dunlap)
- Split out the force_scan patch to be reviewed separately
- Added benchmarks from Ghait Ouled Amar Ben Cheikh
- Fixed reported compile error without CONFIG_MEMCG
Changes from PATCH v1 -> v2:
- Updated selftest to use ksft_test_result_code instead of switch-case
(Muhammad Usama Anjum)
- Included more use cases in the cover letter
(Huang, Ying)
- Added documentation for sysfs and memcg interfaces
- Added an aging-specific struct lru_gen_mm_walk in struct pglist_data
to avoid allocating for each lruvec.
[v1] https://lore.kernel.org/linux-mm/20240504073011.4000534-1-yuanchu@google.co…
[v2] https://lore.kernel.org/linux-mm/20240604020549.1017540-1-yuanchu@google.co…
[v3] https://lore.kernel.org/linux-mm/20240813165619.748102-1-yuanchu@google.com/
Yuanchu Xie (9):
mm: aggregate workingset information into histograms
mm: use refresh interval to rate-limit workingset report aggregation
mm: report workingset during memory pressure driven scanning
mm: extend workingset reporting to memcgs
mm: add kernel aging thread for workingset reporting
selftest: test system-wide workingset reporting
Docs/admin-guide/mm/workingset_report: document sysfs and memcg
interfaces
Docs/admin-guide/cgroup-v2: document workingset reporting
virtio-balloon: add workingset reporting
Documentation/admin-guide/cgroup-v2.rst | 35 +
Documentation/admin-guide/mm/index.rst | 1 +
.../admin-guide/mm/workingset_report.rst | 105 +++
drivers/base/node.c | 6 +
drivers/virtio/virtio_balloon.c | 390 ++++++++++-
include/linux/balloon_compaction.h | 1 +
include/linux/memcontrol.h | 21 +
include/linux/mmzone.h | 13 +
include/linux/workingset_report.h | 167 +++++
include/uapi/linux/virtio_balloon.h | 30 +
mm/Kconfig | 15 +
mm/Makefile | 2 +
mm/internal.h | 19 +
mm/memcontrol.c | 162 ++++-
mm/mm_init.c | 2 +
mm/mmzone.c | 2 +
mm/vmscan.c | 56 +-
mm/workingset_report.c | 653 ++++++++++++++++++
mm/workingset_report_aging.c | 127 ++++
tools/testing/selftests/mm/.gitignore | 1 +
tools/testing/selftests/mm/Makefile | 3 +
tools/testing/selftests/mm/run_vmtests.sh | 5 +
.../testing/selftests/mm/workingset_report.c | 306 ++++++++
.../testing/selftests/mm/workingset_report.h | 39 ++
.../selftests/mm/workingset_report_test.c | 330 +++++++++
25 files changed, 2482 insertions(+), 9 deletions(-)
create mode 100644 Documentation/admin-guide/mm/workingset_report.rst
create mode 100644 include/linux/workingset_report.h
create mode 100644 mm/workingset_report.c
create mode 100644 mm/workingset_report_aging.c
create mode 100644 tools/testing/selftests/mm/workingset_report.c
create mode 100644 tools/testing/selftests/mm/workingset_report.h
create mode 100644 tools/testing/selftests/mm/workingset_report_test.c
--
2.47.0.338.g60cca15819-goog
This series:
1. makes the behavior of_find_device_by_node(),
bus_find_device_by_of_node(), bus_find_device_by_fwnode(), etc., more
consistent when provided with a NULL node/handle;
2. adds kunit tests to validate the new NULL-argument behavior; and
3. makes some related improvements and refactoring for the drivers/base/
kunit tests.
This series aims to prevent problems like the ones resolved in commit
5c8418cf4025 ("PCI/pwrctrl: Unregister platform device only if one
actually exists").
Brian Norris (4):
drivers: base: Don't match devices with NULL of_node/fwnode/etc
drivers: base: test: Enable device model tests with KUNIT_ALL_TESTS
drivers: base: test: Drop "devm" from platform-device-test names
drivers: base: test: Add ...find_device_by...(... NULL) tests
drivers/base/core.c | 8 ++---
drivers/base/test/Kconfig | 1 +
drivers/base/test/platform-device-test.c | 42 ++++++++++++++++++++----
3 files changed, 40 insertions(+), 11 deletions(-)
--
2.47.0.338.g60cca15819-goog
From: Kumar Kartikeya Dwivedi <memxor(a)gmail.com>
[ Upstream commit bd74e238ae6944b462f57ce8752440a011ba4530 ]
Andrii spotted that process_dynptr_func's rejection of incorrect
argument register type will print an error string where argument numbers
are not zero-indexed, unlike elsewhere in the verifier. Fix this by
subtracting 1 from regno. The same scenario exists for iterator
messages. Fix selftest error strings that match on the exact argument
number while we're at it to ensure clean bisection.
Suggested-by: Andrii Nakryiko <andrii(a)kernel.org>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor(a)gmail.com>
Link: https://lore.kernel.org/r/20241203002235.3776418-1-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast(a)kernel.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
kernel/bpf/verifier.c | 12 +++++-----
.../testing/selftests/bpf/progs/dynptr_fail.c | 22 +++++++++----------
.../selftests/bpf/progs/iters_state_safety.c | 14 ++++++------
.../selftests/bpf/progs/iters_testmod_seq.c | 4 ++--
.../bpf/progs/test_kfunc_dynptr_param.c | 2 +-
.../selftests/bpf/progs/verifier_bits_iter.c | 4 ++--
6 files changed, 29 insertions(+), 29 deletions(-)
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 91317857ea3ee..436a83784b7d2 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -7903,7 +7903,7 @@ static int process_dynptr_func(struct bpf_verifier_env *env, int regno, int insn
if (reg->type != PTR_TO_STACK && reg->type != CONST_PTR_TO_DYNPTR) {
verbose(env,
"arg#%d expected pointer to stack or const struct bpf_dynptr\n",
- regno);
+ regno - 1);
return -EINVAL;
}
@@ -7957,7 +7957,7 @@ static int process_dynptr_func(struct bpf_verifier_env *env, int regno, int insn
if (!is_dynptr_reg_valid_init(env, reg)) {
verbose(env,
"Expected an initialized dynptr as arg #%d\n",
- regno);
+ regno - 1);
return -EINVAL;
}
@@ -7965,7 +7965,7 @@ static int process_dynptr_func(struct bpf_verifier_env *env, int regno, int insn
if (!is_dynptr_type_expected(env, reg, arg_type & ~MEM_RDONLY)) {
verbose(env,
"Expected a dynptr of type %s as arg #%d\n",
- dynptr_type_str(arg_to_dynptr_type(arg_type)), regno);
+ dynptr_type_str(arg_to_dynptr_type(arg_type)), regno - 1);
return -EINVAL;
}
@@ -8029,7 +8029,7 @@ static int process_iter_arg(struct bpf_verifier_env *env, int regno, int insn_id
*/
btf_id = btf_check_iter_arg(meta->btf, meta->func_proto, regno - 1);
if (btf_id < 0) {
- verbose(env, "expected valid iter pointer as arg #%d\n", regno);
+ verbose(env, "expected valid iter pointer as arg #%d\n", regno - 1);
return -EINVAL;
}
t = btf_type_by_id(meta->btf, btf_id);
@@ -8039,7 +8039,7 @@ static int process_iter_arg(struct bpf_verifier_env *env, int regno, int insn_id
/* bpf_iter_<type>_new() expects pointer to uninit iter state */
if (!is_iter_reg_valid_uninit(env, reg, nr_slots)) {
verbose(env, "expected uninitialized iter_%s as arg #%d\n",
- iter_type_str(meta->btf, btf_id), regno);
+ iter_type_str(meta->btf, btf_id), regno - 1);
return -EINVAL;
}
@@ -8063,7 +8063,7 @@ static int process_iter_arg(struct bpf_verifier_env *env, int regno, int insn_id
break;
case -EINVAL:
verbose(env, "expected an initialized iter_%s as arg #%d\n",
- iter_type_str(meta->btf, btf_id), regno);
+ iter_type_str(meta->btf, btf_id), regno - 1);
return err;
case -EPROTO:
verbose(env, "expected an RCU CS when using %s\n", meta->func_name);
diff --git a/tools/testing/selftests/bpf/progs/dynptr_fail.c b/tools/testing/selftests/bpf/progs/dynptr_fail.c
index 8f36c9de75915..dfd817d0348c4 100644
--- a/tools/testing/selftests/bpf/progs/dynptr_fail.c
+++ b/tools/testing/selftests/bpf/progs/dynptr_fail.c
@@ -149,7 +149,7 @@ int ringbuf_release_uninit_dynptr(void *ctx)
/* A dynptr can't be used after it has been invalidated */
SEC("?raw_tp")
-__failure __msg("Expected an initialized dynptr as arg #3")
+__failure __msg("Expected an initialized dynptr as arg #2")
int use_after_invalid(void *ctx)
{
struct bpf_dynptr ptr;
@@ -428,7 +428,7 @@ int invalid_helper2(void *ctx)
/* A bpf_dynptr is invalidated if it's been written into */
SEC("?raw_tp")
-__failure __msg("Expected an initialized dynptr as arg #1")
+__failure __msg("Expected an initialized dynptr as arg #0")
int invalid_write1(void *ctx)
{
struct bpf_dynptr ptr;
@@ -1407,7 +1407,7 @@ int invalid_slice_rdwr_rdonly(struct __sk_buff *skb)
/* bpf_dynptr_adjust can only be called on initialized dynptrs */
SEC("?raw_tp")
-__failure __msg("Expected an initialized dynptr as arg #1")
+__failure __msg("Expected an initialized dynptr as arg #0")
int dynptr_adjust_invalid(void *ctx)
{
struct bpf_dynptr ptr = {};
@@ -1420,7 +1420,7 @@ int dynptr_adjust_invalid(void *ctx)
/* bpf_dynptr_is_null can only be called on initialized dynptrs */
SEC("?raw_tp")
-__failure __msg("Expected an initialized dynptr as arg #1")
+__failure __msg("Expected an initialized dynptr as arg #0")
int dynptr_is_null_invalid(void *ctx)
{
struct bpf_dynptr ptr = {};
@@ -1433,7 +1433,7 @@ int dynptr_is_null_invalid(void *ctx)
/* bpf_dynptr_is_rdonly can only be called on initialized dynptrs */
SEC("?raw_tp")
-__failure __msg("Expected an initialized dynptr as arg #1")
+__failure __msg("Expected an initialized dynptr as arg #0")
int dynptr_is_rdonly_invalid(void *ctx)
{
struct bpf_dynptr ptr = {};
@@ -1446,7 +1446,7 @@ int dynptr_is_rdonly_invalid(void *ctx)
/* bpf_dynptr_size can only be called on initialized dynptrs */
SEC("?raw_tp")
-__failure __msg("Expected an initialized dynptr as arg #1")
+__failure __msg("Expected an initialized dynptr as arg #0")
int dynptr_size_invalid(void *ctx)
{
struct bpf_dynptr ptr = {};
@@ -1459,7 +1459,7 @@ int dynptr_size_invalid(void *ctx)
/* Only initialized dynptrs can be cloned */
SEC("?raw_tp")
-__failure __msg("Expected an initialized dynptr as arg #1")
+__failure __msg("Expected an initialized dynptr as arg #0")
int clone_invalid1(void *ctx)
{
struct bpf_dynptr ptr1 = {};
@@ -1493,7 +1493,7 @@ int clone_invalid2(struct xdp_md *xdp)
/* Invalidating a dynptr should invalidate its clones */
SEC("?raw_tp")
-__failure __msg("Expected an initialized dynptr as arg #3")
+__failure __msg("Expected an initialized dynptr as arg #2")
int clone_invalidate1(void *ctx)
{
struct bpf_dynptr clone;
@@ -1514,7 +1514,7 @@ int clone_invalidate1(void *ctx)
/* Invalidating a dynptr should invalidate its parent */
SEC("?raw_tp")
-__failure __msg("Expected an initialized dynptr as arg #3")
+__failure __msg("Expected an initialized dynptr as arg #2")
int clone_invalidate2(void *ctx)
{
struct bpf_dynptr ptr;
@@ -1535,7 +1535,7 @@ int clone_invalidate2(void *ctx)
/* Invalidating a dynptr should invalidate its siblings */
SEC("?raw_tp")
-__failure __msg("Expected an initialized dynptr as arg #3")
+__failure __msg("Expected an initialized dynptr as arg #2")
int clone_invalidate3(void *ctx)
{
struct bpf_dynptr ptr;
@@ -1723,7 +1723,7 @@ __noinline long global_call_bpf_dynptr(const struct bpf_dynptr *dynptr)
}
SEC("?raw_tp")
-__failure __msg("arg#1 expected pointer to stack or const struct bpf_dynptr")
+__failure __msg("arg#0 expected pointer to stack or const struct bpf_dynptr")
int test_dynptr_reg_type(void *ctx)
{
struct task_struct *current = NULL;
diff --git a/tools/testing/selftests/bpf/progs/iters_state_safety.c b/tools/testing/selftests/bpf/progs/iters_state_safety.c
index d47e59aba6de3..f41257eadbb25 100644
--- a/tools/testing/selftests/bpf/progs/iters_state_safety.c
+++ b/tools/testing/selftests/bpf/progs/iters_state_safety.c
@@ -73,7 +73,7 @@ int create_and_forget_to_destroy_fail(void *ctx)
}
SEC("?raw_tp")
-__failure __msg("expected an initialized iter_num as arg #1")
+__failure __msg("expected an initialized iter_num as arg #0")
int destroy_without_creating_fail(void *ctx)
{
/* init with zeros to stop verifier complaining about uninit stack */
@@ -91,7 +91,7 @@ int destroy_without_creating_fail(void *ctx)
}
SEC("?raw_tp")
-__failure __msg("expected an initialized iter_num as arg #1")
+__failure __msg("expected an initialized iter_num as arg #0")
int compromise_iter_w_direct_write_fail(void *ctx)
{
struct bpf_iter_num iter;
@@ -143,7 +143,7 @@ int compromise_iter_w_direct_write_and_skip_destroy_fail(void *ctx)
}
SEC("?raw_tp")
-__failure __msg("expected an initialized iter_num as arg #1")
+__failure __msg("expected an initialized iter_num as arg #0")
int compromise_iter_w_helper_write_fail(void *ctx)
{
struct bpf_iter_num iter;
@@ -230,7 +230,7 @@ int valid_stack_reuse(void *ctx)
}
SEC("?raw_tp")
-__failure __msg("expected uninitialized iter_num as arg #1")
+__failure __msg("expected uninitialized iter_num as arg #0")
int double_create_fail(void *ctx)
{
struct bpf_iter_num iter;
@@ -258,7 +258,7 @@ int double_create_fail(void *ctx)
}
SEC("?raw_tp")
-__failure __msg("expected an initialized iter_num as arg #1")
+__failure __msg("expected an initialized iter_num as arg #0")
int double_destroy_fail(void *ctx)
{
struct bpf_iter_num iter;
@@ -284,7 +284,7 @@ int double_destroy_fail(void *ctx)
}
SEC("?raw_tp")
-__failure __msg("expected an initialized iter_num as arg #1")
+__failure __msg("expected an initialized iter_num as arg #0")
int next_without_new_fail(void *ctx)
{
struct bpf_iter_num iter;
@@ -305,7 +305,7 @@ int next_without_new_fail(void *ctx)
}
SEC("?raw_tp")
-__failure __msg("expected an initialized iter_num as arg #1")
+__failure __msg("expected an initialized iter_num as arg #0")
int next_after_destroy_fail(void *ctx)
{
struct bpf_iter_num iter;
diff --git a/tools/testing/selftests/bpf/progs/iters_testmod_seq.c b/tools/testing/selftests/bpf/progs/iters_testmod_seq.c
index 4a176e6aede89..6543d5b6e0a97 100644
--- a/tools/testing/selftests/bpf/progs/iters_testmod_seq.c
+++ b/tools/testing/selftests/bpf/progs/iters_testmod_seq.c
@@ -79,7 +79,7 @@ int testmod_seq_truncated(const void *ctx)
SEC("?raw_tp")
__failure
-__msg("expected an initialized iter_testmod_seq as arg #2")
+__msg("expected an initialized iter_testmod_seq as arg #1")
int testmod_seq_getter_before_bad(const void *ctx)
{
struct bpf_iter_testmod_seq it;
@@ -89,7 +89,7 @@ int testmod_seq_getter_before_bad(const void *ctx)
SEC("?raw_tp")
__failure
-__msg("expected an initialized iter_testmod_seq as arg #2")
+__msg("expected an initialized iter_testmod_seq as arg #1")
int testmod_seq_getter_after_bad(const void *ctx)
{
struct bpf_iter_testmod_seq it;
diff --git a/tools/testing/selftests/bpf/progs/test_kfunc_dynptr_param.c b/tools/testing/selftests/bpf/progs/test_kfunc_dynptr_param.c
index e68667aec6a65..cd4d752bd089c 100644
--- a/tools/testing/selftests/bpf/progs/test_kfunc_dynptr_param.c
+++ b/tools/testing/selftests/bpf/progs/test_kfunc_dynptr_param.c
@@ -45,7 +45,7 @@ int BPF_PROG(not_valid_dynptr, int cmd, union bpf_attr *attr, unsigned int size)
}
SEC("?lsm.s/bpf")
-__failure __msg("arg#1 expected pointer to stack or const struct bpf_dynptr")
+__failure __msg("arg#0 expected pointer to stack or const struct bpf_dynptr")
int BPF_PROG(not_ptr_to_stack, int cmd, union bpf_attr *attr, unsigned int size)
{
unsigned long val = 0;
diff --git a/tools/testing/selftests/bpf/progs/verifier_bits_iter.c b/tools/testing/selftests/bpf/progs/verifier_bits_iter.c
index 7c881bca9af5c..497febf5c578d 100644
--- a/tools/testing/selftests/bpf/progs/verifier_bits_iter.c
+++ b/tools/testing/selftests/bpf/progs/verifier_bits_iter.c
@@ -32,7 +32,7 @@ int BPF_PROG(no_destroy, struct bpf_iter_meta *meta, struct cgroup *cgrp)
SEC("iter/cgroup")
__description("uninitialized iter in ->next()")
-__failure __msg("expected an initialized iter_bits as arg #1")
+__failure __msg("expected an initialized iter_bits as arg #0")
int BPF_PROG(next_uninit, struct bpf_iter_meta *meta, struct cgroup *cgrp)
{
struct bpf_iter_bits *it = NULL;
@@ -43,7 +43,7 @@ int BPF_PROG(next_uninit, struct bpf_iter_meta *meta, struct cgroup *cgrp)
SEC("iter/cgroup")
__description("uninitialized iter in ->destroy()")
-__failure __msg("expected an initialized iter_bits as arg #1")
+__failure __msg("expected an initialized iter_bits as arg #0")
int BPF_PROG(destroy_uninit, struct bpf_iter_meta *meta, struct cgroup *cgrp)
{
struct bpf_iter_bits it = {};
--
2.43.0
Isolated CPUs are not allowed to be used in a non-isolated partition.
The only exception is the top cpuset which is allowed to contain boot
time isolated CPUs.
Commit ccac8e8de99c ("cgroup/cpuset: Fix remote root partition creation
problem") introduces a simplified scheme of including only partition
roots in sched domain generation. However, it does not properly account
for this exception case. This can result in leakage of isolated CPUs
into a sched domain.
Fix it by making sure that isolated CPUs are excluded from the top
cpuset before generating sched domains.
Also update the way the boot time isolated CPUs are handled in
test_cpuset_prs.sh to make sure that those isolated CPUs are really
isolated instead of just skipping them in the tests.
Fixes: ccac8e8de99c ("cgroup/cpuset: Fix remote root partition creation problem")
Signed-off-by: Waiman Long <longman(a)redhat.com>
---
kernel/cgroup/cpuset.c | 10 +++++-
.../selftests/cgroup/test_cpuset_prs.sh | 33 +++++++++++--------
2 files changed, 28 insertions(+), 15 deletions(-)
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index f321ed515f3a..33b264c3e258 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -890,7 +890,15 @@ static int generate_sched_domains(cpumask_var_t **domains,
*/
if (cgrpv2) {
for (i = 0; i < ndoms; i++) {
- cpumask_copy(doms[i], csa[i]->effective_cpus);
+ /*
+ * The top cpuset may contain some boot time isolated
+ * CPUs that need to be excluded from the sched domain.
+ */
+ if (csa[i] == &top_cpuset)
+ cpumask_and(doms[i], csa[i]->effective_cpus,
+ housekeeping_cpumask(HK_TYPE_DOMAIN));
+ else
+ cpumask_copy(doms[i], csa[i]->effective_cpus);
if (dattr)
dattr[i] = SD_ATTR_INIT;
}
diff --git a/tools/testing/selftests/cgroup/test_cpuset_prs.sh b/tools/testing/selftests/cgroup/test_cpuset_prs.sh
index 03c1bdaed2c3..400a696a0d21 100755
--- a/tools/testing/selftests/cgroup/test_cpuset_prs.sh
+++ b/tools/testing/selftests/cgroup/test_cpuset_prs.sh
@@ -86,15 +86,15 @@ echo "" > test/cpuset.cpus
#
# If isolated CPUs have been reserved at boot time (as shown in
-# cpuset.cpus.isolated), these isolated CPUs should be outside of CPUs 0-7
+# cpuset.cpus.isolated), these isolated CPUs should be outside of CPUs 0-8
# that will be used by this script for testing purpose. If not, some of
-# the tests may fail incorrectly. These isolated CPUs will also be removed
-# before being compared with the expected results.
+# the tests may fail incorrectly. These pre-isolated CPUs should stay in
+# an isolated state throughout the testing process for now.
#
BOOT_ISOLCPUS=$(cat $CGROUP2/cpuset.cpus.isolated)
if [[ -n "$BOOT_ISOLCPUS" ]]
then
- [[ $(echo $BOOT_ISOLCPUS | sed -e "s/[,-].*//") -le 7 ]] &&
+ [[ $(echo $BOOT_ISOLCPUS | sed -e "s/[,-].*//") -le 8 ]] &&
skip_test "Pre-isolated CPUs ($BOOT_ISOLCPUS) overlap CPUs to be tested"
echo "Pre-isolated CPUs: $BOOT_ISOLCPUS"
fi
@@ -683,15 +683,19 @@ check_isolcpus()
EXPECT_VAL2=$EXPECT_VAL
fi
+ #
+ # Appending pre-isolated CPUs
+ # Even though CPU #8 isn't used for testing, it can't be pre-isolated
+ # to make appending those CPUs easier.
+ #
+ [[ -n "$BOOT_ISOLCPUS" ]] && {
+ EXPECT_VAL=${EXPECT_VAL:+${EXPECT_VAL},}${BOOT_ISOLCPUS}
+ EXPECT_VAL2=${EXPECT_VAL2:+${EXPECT_VAL2},}${BOOT_ISOLCPUS}
+ }
+
#
# Check cpuset.cpus.isolated cpumask
#
- if [[ -z "$BOOT_ISOLCPUS" ]]
- then
- ISOLCPUS=$(cat $ISCPUS)
- else
- ISOLCPUS=$(cat $ISCPUS | sed -e "s/,*$BOOT_ISOLCPUS//")
- fi
[[ "$EXPECT_VAL2" != "$ISOLCPUS" ]] && {
# Take a 50ms pause and try again
pause 0.05
@@ -731,8 +735,6 @@ check_isolcpus()
fi
done
[[ "$ISOLCPUS" = *- ]] && ISOLCPUS=${ISOLCPUS}$LASTISOLCPU
- [[ -n "BOOT_ISOLCPUS" ]] &&
- ISOLCPUS=$(echo $ISOLCPUS | sed -e "s/,*$BOOT_ISOLCPUS//")
[[ "$EXPECT_VAL" = "$ISOLCPUS" ]]
}
@@ -836,8 +838,11 @@ run_state_test()
# if available
[[ -n "$ICPUS" ]] && {
check_isolcpus $ICPUS
- [[ $? -ne 0 ]] && test_fail $I "isolated CPU" \
- "Expect $ICPUS, get $ISOLCPUS instead"
+ [[ $? -ne 0 ]] && {
+ [[ -n "$BOOT_ISOLCPUS" ]] && ICPUS=${ICPUS},${BOOT_ISOLCPUS}
+ test_fail $I "isolated CPU" \
+ "Expect $ICPUS, get $ISOLCPUS instead"
+ }
}
reset_cgroup_states
#
--
2.47.1
This is something that I've been thinking about for a while. We had a
discussion at LPC 2020 about this[1] but the proposals suggested there
never materialised.
In short, it is quite difficult for userspace to detect the feature
capability of syscalls at runtime. This is something a lot of programs
want to do, but they are forced to create elaborate scenarios to try to
figure out if a feature is supported without causing damage to the
system. For the vast majority of cases, each individual feature also
needs to be tested individually (because syscall results are
all-or-nothing), so testing even a single syscall's feature set can
easily inflate the startup time of programs.
This patchset implements the fairly minimal design I proposed in this
talk[2] and in some old LKML threads (though I can't find the exact
references ATM). The general flow looks like:
1. Userspace will indicate to the kernel that a syscall should a be
no-op by setting the top bit of the extensible struct size argument.
We will almost certainly never support exabyte sized structs, so the
top bits are free for us to use as makeshift flag bits. This is
preferable to using the per-syscall flag field inside the structure
because seccomp can easily detect the bit in the flag and allow the
probe or forcefully return -EEXTSYS_NOOP.
2. The kernel will then fill the provided structure with every valid
bit pattern that the current kernel understands.
For flags or other bitflag-like fields, this is the set of valid
flags or bits. For pointer fields or fields that take an arbitrary
value, the field has every bit set (0xFF... to fill the field) to
indicate that any value is valid in the field.
3. The syscall then returns -EEXTSYS_NOOP which is an errno that will
only ever be used for this purpose (so userspace can be sure that
the request succeeded).
On older kernels, the syscall will return a different error (usually
-E2BIG or -EFAULT) and userspace can do their old-fashioned checks.
4. Userspace can then check which flags and fields are supported by
looking at the fields in the returned structure. Flags are checked
by doing an AND with the flags field, and field support can checked
by comparing to 0. In principle you could just AND the entire
structure if you wanted to do this check generically without caring
about the structure contents (this is what libraries might consider
doing).
Userspace can even find out the internal kernel structure size by
passing a PAGE_SIZE buffer and seeing how many bytes are non-zero.
As with copy_struct_from_user(), this is designed to be forward- and
backwards- compatible.
This allows programas to get a one-shot understanding of what features a
syscall supports without having to do any elaborate setups or tricks to
detect support for destructive features. Flags can simply be ANDed to
check if they are in the supported set, and fields can just be checked
to see if they are non-zero.
This patchset is IMHO the simplest way we can add the ability to
introspect the feature set of extensible struct (copy_struct_from_user)
syscalls. It doesn't preclude the chance of a more generic mechanism
being added later.
The intended way of using this interface to get feature information
looks something like the following (imagine that openat2 has gained a
new field and a new flag in the future):
static bool openat2_no_automount_supported;
static bool openat2_cwd_fd_supported;
int check_openat2_support(void)
{
int err;
struct open_how how = {};
err = openat2(AT_FDCWD, ".", &how, CHECK_FIELDS | sizeof(how));
assert(err < 0);
switch (errno) {
case EFAULT: case E2BIG:
/* Old kernel... */
check_support_the_old_way();
break;
case EEXTSYS_NOOP:
openat2_no_automount_supported = (how.flags & RESOLVE_NO_AUTOMOUNT);
openat2_cwd_fd_supported = (how.cwd_fd != 0);
break;
}
}
This series adds CHECK_FIELDS support for the following extensible
struct syscalls, as they are quite likely to grow flags in the near
future:
* openat2
* clone3
* mount_setattr
[1]: https://lwn.net/Articles/830666/
[2]: https://youtu.be/ggD-eb3yPVs
Signed-off-by: Aleksa Sarai <cyphar(a)cyphar.com>
---
Changes in v3:
- Fix copy_struct_to_user() return values in case of clear_user() failure.
- v2: <https://lore.kernel.org/r/20240906-extensible-structs-check_fields-v2-0-0f4…>
Changes in v2:
- Add CHECK_FIELDS support to mount_setattr(2).
- Fix build failure on architectures with custom errno values.
- Rework selftests to use the tools/ uAPI headers rather than custom
defining EEXTSYS_NOOP.
- Make sure we return -EINVAL and -E2BIG for invalid sizes even if
CHECK_FIELDS is set, and add some tests for that.
- v1: <https://lore.kernel.org/r/20240902-extensible-structs-check_fields-v1-0-545…>
---
Aleksa Sarai (10):
uaccess: add copy_struct_to_user helper
sched_getattr: port to copy_struct_to_user
openat2: explicitly return -E2BIG for (usize > PAGE_SIZE)
openat2: add CHECK_FIELDS flag to usize argument
selftests: openat2: add 0xFF poisoned data after misaligned struct
selftests: openat2: add CHECK_FIELDS selftests
clone3: add CHECK_FIELDS flag to usize argument
selftests: clone3: add CHECK_FIELDS selftests
mount_setattr: add CHECK_FIELDS flag to usize argument
selftests: mount_setattr: add CHECK_FIELDS selftest
arch/alpha/include/uapi/asm/errno.h | 3 +
arch/mips/include/uapi/asm/errno.h | 3 +
arch/parisc/include/uapi/asm/errno.h | 3 +
arch/sparc/include/uapi/asm/errno.h | 3 +
fs/namespace.c | 17 ++
fs/open.c | 18 ++
include/linux/uaccess.h | 97 ++++++++
include/uapi/asm-generic/errno.h | 3 +
include/uapi/linux/openat2.h | 2 +
kernel/fork.c | 30 ++-
kernel/sched/syscalls.c | 42 +---
tools/arch/alpha/include/uapi/asm/errno.h | 3 +
tools/arch/mips/include/uapi/asm/errno.h | 3 +
tools/arch/parisc/include/uapi/asm/errno.h | 3 +
tools/arch/sparc/include/uapi/asm/errno.h | 3 +
tools/include/uapi/asm-generic/errno.h | 3 +
tools/include/uapi/asm-generic/posix_types.h | 101 ++++++++
tools/testing/selftests/clone3/.gitignore | 1 +
tools/testing/selftests/clone3/Makefile | 4 +-
.../testing/selftests/clone3/clone3_check_fields.c | 264 +++++++++++++++++++++
tools/testing/selftests/mount_setattr/Makefile | 2 +-
.../selftests/mount_setattr/mount_setattr_test.c | 53 ++++-
tools/testing/selftests/openat2/Makefile | 2 +
tools/testing/selftests/openat2/openat2_test.c | 165 ++++++++++++-
24 files changed, 777 insertions(+), 51 deletions(-)
---
base-commit: 98f7e32f20d28ec452afb208f9cffc08448a2652
change-id: 20240803-extensible-structs-check_fields-a47e94cef691
Best regards,
--
Aleksa Sarai <cyphar(a)cyphar.com>
Hi,
This series carries forward the effort to add Kselftest for PCI Endpoint
Subsystem started by Aman Gupta [1] a while ago. I reworked the initial version
based on another patch that fixes the return values of IOCTLs in
pci_endpoint_test driver and did many cleanups. Since the resulting work
modified the initial version substantially, I took over the authorship.
This series also incorporates the review comment by Shuah Khan [2] to move the
existing tests from 'tools/pci' to 'tools/testing/kselftest/pci_endpoint' before
migrating to Kselftest framework. I made sure that the tests are executable in
each commit and updated documentation accordingly.
NOTE: Patch 1 is strictly not related to this series, but necessary to execute
Kselftests with Qualcomm Endpoint devices. So this can be merged separately.
- Mani
[1] https://lore.kernel.org/linux-pci/20221007053934.5188-1-aman1.gupta@samsung…
[2] https://lore.kernel.org/linux-pci/b2a5db97-dc59-33ab-71cd-f591e0b1b34d@linu…
Changes in v2:
* Added a patch that fixes return values of IOCTL in pci_endpoint_test driver
* Moved the existing tests to new location before migrating
* Added a fix for BARs on Qcom devices
* Updated documentation and also added fixture variants for memcpy & DMA modes
Manivannan Sadhasivam (4):
PCI: qcom-ep: Mark BAR0/BAR2 as 64bit BARs and BAR1/BAR3 as RESERVED
misc: pci_endpoint_test: Fix the return value of IOCTL
selftests: Move PCI Endpoint tests from tools/pci to Kselftests
selftests: pci_endpoint: Migrate to Kselftest framework
Documentation/PCI/endpoint/pci-test-howto.rst | 144 +++-------
MAINTAINERS | 2 +-
drivers/misc/pci_endpoint_test.c | 236 ++++++++---------
drivers/pci/controller/dwc/pcie-qcom-ep.c | 4 +
tools/pci/Build | 1 -
tools/pci/Makefile | 58 ----
tools/pci/pcitest.c | 250 ------------------
tools/pci/pcitest.sh | 72 -----
tools/testing/selftests/Makefile | 1 +
.../testing/selftests/pci_endpoint/.gitignore | 2 +
tools/testing/selftests/pci_endpoint/Makefile | 7 +
tools/testing/selftests/pci_endpoint/config | 4 +
.../pci_endpoint/pci_endpoint_test.c | 186 +++++++++++++
13 files changed, 365 insertions(+), 602 deletions(-)
delete mode 100644 tools/pci/Build
delete mode 100644 tools/pci/Makefile
delete mode 100644 tools/pci/pcitest.c
delete mode 100644 tools/pci/pcitest.sh
create mode 100644 tools/testing/selftests/pci_endpoint/.gitignore
create mode 100644 tools/testing/selftests/pci_endpoint/Makefile
create mode 100644 tools/testing/selftests/pci_endpoint/config
create mode 100644 tools/testing/selftests/pci_endpoint/pci_endpoint_test.c
--
2.25.1