Hello.
Running the mm selftests from the kernel's root directory
on an x86_64 debian machine using:
make defconfig
sudo make kselftest TARGETS=mm
the tests run normally till we reach one which stalls
for 180 seconds and times out according to the following logs:
```
-----------------------------------------------
running ./charge_reserved_hugetlb.sh -cgroup-v2
-----------------------------------------------
CLEANUP DONE
CLEANUP DONE
Test normal case.
private=, populate=, method=0, reserve=
nr hugepages = 10
writing cgroup limit: 20971520
writing reseravation limit: 20971520
Starting:
hugetlb_usage=0
reserved_usage=0
expect_failure is 0
Putting task in cgroup 'hugetlb_cgroup_test'
Method is 0
>>> write_hugetlb_memory.sh: line 22: ./write_to_hugetlbfs: No such file or directory <<<
Waiting for hugetlb memory reservation to reach size 10485760.
0
Waiting for hugetlb memory reservation to reach size 10485760.
0
...
Waiting for hugetlb memory reservation to reach size 10485760.
0
Waiting for hugetlb memory reservation to reach size 10485760.
0
not ok 1 selftests: mm: run_vmtests.sh # TIMEOUT 180 seconds
make[3]: Leaving directory '/linux/tools/testing/selftests/mm'
```
Logs show that the executable "write_to_hugetlbfs" is missing, causing
the test to hang waiting for hugepage reservations.
The executable not found means it was not built by the Make system.
It is mentioned in Makefile:136-142, and only built if ARCH is 64-bit
```
ifneq (,$(filter $(ARCH),arm64 mips64 parisc64 powerpc riscv64 s390x sparc64 x86_64 s390))
TEST_GEN_FILES += va_high_addr_switch
ifneq ($(ARCH),riscv64)
TEST_GEN_FILES += virtual_address_range
endif
TEST_GEN_FILES += write_to_hugetlbfs
endif
```
So, for some reason, the top-level Makefile provides ARCH as x86.
My proposed solution is similar to existing virtual_address_range check
that is to check for the binary, and if it is not found, skip these 2
test cases: charge_reserved_hugetlb.sh and hugetlb_reparenting_test.sh
since they directly and indirectly depend on write_to_hugetlbfs binary.
This is just a workaround, the root issue of different ARCH detection
when running tests from the kernel root directory should still be
addressed. I am not sure how to approach it and open for your suggestions.
Note that this issue does not happen when ran from selftests/mm using
something like
sudo make -C tools/testing/selftests/mm
because then mm/Makefile's ARCH detection runs correctly (x86_64)
Kindly review and share your thoughts.
Signed-off-by: Khaled Elnaggar <khaledelnaggarlinux(a)gmail.com>
---
tools/testing/selftests/mm/run_vmtests.sh | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/tools/testing/selftests/mm/run_vmtests.sh b/tools/testing/selftests/mm/run_vmtests.sh
index dddd1dd8af14..cdbcfdb62f8a 100755
--- a/tools/testing/selftests/mm/run_vmtests.sh
+++ b/tools/testing/selftests/mm/run_vmtests.sh
@@ -375,8 +375,13 @@ CATEGORY="process_mrelease" run_test ./mrelease_test
CATEGORY="mremap" run_test ./mremap_test
CATEGORY="hugetlb" run_test ./thuge-gen
+
+# the following depend on write_to_hugetlbfs binary
+if [ -x ./write_to_hugetlbfs ]; then
CATEGORY="hugetlb" run_test ./charge_reserved_hugetlb.sh -cgroup-v2
CATEGORY="hugetlb" run_test ./hugetlb_reparenting_test.sh -cgroup-v2
+fi
+
if $RUN_DESTRUCTIVE; then
nr_hugepages_tmp=$(cat /proc/sys/vm/nr_hugepages)
enable_soft_offline=$(cat /proc/sys/vm/enable_soft_offline)
--
2.47.2
Overview:
This series implements a new PMU scheme on ARM, a partitioned PMU
that exists alongside the existing emulated PMU and may be enabled by
the kernel command line kvm.reserved_host_counters or by the vcpu
ioctl KVM_ARM_PARTITION_PMU. This is a continuation of the RFC posted
earlier this year. [1]
The high level overview and reason for the name is that this
implementation takes advantage of recent CPU features to partition the
PMU counters into a host-reserved set and a guest-reserved set. Guests
are allowed untrapped hardware access to the most frequently used PMU
registers and features for the guest-reserved counters only.
This untrapped hardware access significantly reduces the overhead of
using performance monitoring capabilities such as the `perf` tool
inside a guest VM. Register accesses that aren't trapping to KVM mean
less time spent in the host kernel and more time on the workloads
guests care about. This optimization especially shines during high
`perf` sample rates or large numbers of events that require
multiplexing hardware counters.
Performance:
For example, the following tests were carried out on identical ARM
machines with 10 general purpose counters with identical guest images
run on QEMU, the only difference being my PMU implementation or the
existing one. Some arguments have been simplified here to clarify the
purpose of the test:
1) time perf record -e ${FIFTEEN_HW_EVENTS} -F 1000 -- \
gzip -c tmpfs/random.64M.img >/dev/null
On emulated PMU this command took 4.143s real time with 0.159s system
time. On partitioned PMU this command took 3.139s real time with
0.110s system time, runtime reductions of 24.23% and 30.82%.
2) time perf stat -dd -- \
automated_specint2017.sh
On emulated PMU this benchmark completed in 3789.16s real time with
224.45s system time and a final benchmark score of 4.28. On
partitioned PMU this benchmark completed in 3525.67s real time with
15.98s system time and a final benchmark score of 4.56. That is a
6.95% reduction in runtime, 92.88% reduction in system time, and
6.54% improvement in overall benchmark score.
Seeing these improvements on something as lightweight as perf stat is
remarkable and implies there would have been a much greater
improvement with perf record. I did not test that because I was not
confident it would even finish in a reasonable time on the emulated
PMU
Test 3 was slightly different, I ran the workload in a VM with a
single VCPU pinned to a physical CPU and analyzed from the host where
the physical CPU spent its time using mpstat.
3) perf record -e ${FIFTEEN_HW_EVENTS} -F 4000 -- \
stress-ng --cpu 0 --timeout 30
Over a period of 30s the cpu running with the emulated PMU spent
34.96% of the time in the host kernel and 55.85% of the time in the
guest. The cpu running the partitioned PMU spent 0.97% of its time in
the host kernel and 91.06% of its time in the guest.
Taken together, these tests represent a remarkable performance
improvement for anything perf related using this new PMU
implementation.
Caveats:
Because the most consistent and performant thing to do was untrap
PMCR_EL0, the number of counters visible to the guest via PMCR_EL0.N
is always equal to the value KVM sets for MDCR_EL2.HPMN. Previously
allowed writes to PMCR_EL0.N via {GET,SET}_ONE_REG no longer affect
the guest.
These improvements come at a cost to 7-35 new registers that must be
swapped at every vcpu_load and vcpu_put if the feature is enabled. I
have been informed KVM would like to avoid paying this cost when
possible.
One solution is to make the trapping changes and context swapping lazy
such that the trapping changes and context swapping only take place
after the guest has actually accessed the PMU so guests that never
access the PMU never pay the cost.
This is not done here because it is not crucial to the primary
functionality and I thought review would be more productive as soon as
I had something complete enough for reviewers to easily play with.
However, this or any better ideas are on the table for inclusion in
future re-rolls.
[1] https://lore.kernel.org/kvmarm/20250213180317.3205285-1-coltonlewis@google.…
Colton Lewis (16):
arm64: cpufeature: Add cpucap for HPMN0
arm64: Generate sign macro for sysreg Enums
arm64: cpufeature: Add cpucap for PMICNTR
KVM: arm64: Reorganize PMU functions
KVM: arm64: Introduce method to partition the PMU
perf: arm_pmuv3: Generalize counter bitmasks
perf: arm_pmuv3: Keep out of guest counter partition
KVM: arm64: Set up FGT for Partitioned PMU
KVM: arm64: Writethrough trapped PMEVTYPER register
KVM: arm64: Use physical PMSELR for PMXEVTYPER if partitioned
KVM: arm64: Writethrough trapped PMOVS register
KVM: arm64: Context switch Partitioned PMU guest registers
perf: pmuv3: Handle IRQs for Partitioned PMU guest counters
KVM: arm64: Inject recorded guest interrupts
KVM: arm64: Add ioctl to partition the PMU when supported
KVM: arm64: selftests: Add test case for partitioned PMU
Marc Zyngier (1):
KVM: arm64: Cleanup PMU includes
Documentation/virt/kvm/api.rst | 16 +
arch/arm/include/asm/arm_pmuv3.h | 24 +
arch/arm64/include/asm/arm_pmuv3.h | 36 +-
arch/arm64/include/asm/kvm_host.h | 208 +++++-
arch/arm64/include/asm/kvm_pmu.h | 82 +++
arch/arm64/kernel/cpufeature.c | 15 +
arch/arm64/kvm/Makefile | 2 +-
arch/arm64/kvm/arm.c | 24 +-
arch/arm64/kvm/debug.c | 13 +-
arch/arm64/kvm/hyp/include/hyp/switch.h | 65 +-
arch/arm64/kvm/pmu-emul.c | 629 +----------------
arch/arm64/kvm/pmu-part.c | 358 ++++++++++
arch/arm64/kvm/pmu.c | 630 ++++++++++++++++++
arch/arm64/kvm/sys_regs.c | 54 +-
arch/arm64/tools/cpucaps | 2 +
arch/arm64/tools/gen-sysreg.awk | 1 +
arch/arm64/tools/sysreg | 6 +-
drivers/perf/arm_pmuv3.c | 55 +-
include/kvm/arm_pmu.h | 199 ------
include/linux/perf/arm_pmu.h | 15 +-
include/linux/perf/arm_pmuv3.h | 14 +-
include/uapi/linux/kvm.h | 4 +
tools/include/uapi/linux/kvm.h | 2 +
.../selftests/kvm/arm64/vpmu_counter_access.c | 40 +-
virt/kvm/kvm_main.c | 1 +
25 files changed, 1616 insertions(+), 879 deletions(-)
create mode 100644 arch/arm64/include/asm/kvm_pmu.h
create mode 100644 arch/arm64/kvm/pmu-part.c
delete mode 100644 include/kvm/arm_pmu.h
base-commit: 1b85d923ba8c9e6afaf19e26708411adde94fba8
--
2.49.0.1204.g71687c7c1d-goog
As David suggested, currently we don't have a high level test case to
verify the behavior of rmap. This patch set introduce the verification
on rmap by migration.
Patch 1 is a preparation to move ksm related operation into vm_util.
Patch 2 is the new test case.
Currently it covers following four scenarios:
* anonymous page
* shmem page
* pagecache page
* ksm page
Wei Yang (2):
selftests/mm: put general ksm operation into vm_util
selftests/mm: assert rmap behave as expected
MAINTAINERS | 1 +
tools/testing/selftests/mm/.gitignore | 1 +
tools/testing/selftests/mm/Makefile | 3 +
.../selftests/mm/ksm_functional_tests.c | 76 +--
tools/testing/selftests/mm/rmap.c | 466 ++++++++++++++++++
tools/testing/selftests/mm/run_vmtests.sh | 4 +
tools/testing/selftests/mm/vm_util.c | 71 +++
tools/testing/selftests/mm/vm_util.h | 7 +
8 files changed, 563 insertions(+), 66 deletions(-)
create mode 100644 tools/testing/selftests/mm/rmap.c
--
2.34.1
Problem
=======
When host APEI is unable to claim synchronous external abort (SEA)
during stage-2 guest abort, today KVM directly injects an async SError
into the VCPU then resumes it. The injected SError usually results in
unpleasant guest kernel panic.
One of the major situation of guest SEA is when VCPU consumes recoverable
uncorrected memory error (UER), which is not uncommon at all in modern
datacenter servers with large amounts of physical memory. Although SError
and guest panic is sufficient to stop the propagation of corrupted memory
there is room to recover from an UER in a more graceful manner.
Proposed Solution
=================
Alternatively KVM can replay the SEA to the faulting VCPU, via existing
KVM_SET_VCPU_EVENTS API. If the memory poison consumption or the fault
that cause SEA is not from guest kernel, the blast radius can be limited
to the consuming or faulting guest userspace process, so the VM can keep
running.
In addition, instead of doing under the hood without involving userspace,
there are benefits to redirect the SEA to VMM:
- VM customers care about the disruptions caused by memory errors, and
VMM usually has the responsibility to start the process of notifying
the customers of memory error events in their VMs. For example some
cloud provider emits a critical log in their observability UI [1], and
provides playbook for customers on how to mitigate disruptions to
their workloads.
- VMM can protect future memory error consumption by unmapping the poisoned
pages from stage-2 page table with KVM userfault, or by splitting the
memslot that contains the poisoned guest pages [2].
- VMM can keep track of SEA events in the VM. When VMM thinks the status
on the host or the VM is bad enough, e.g. number of distinct SEAs
exceeds a threshold, it can restart the VM on another healthy host.
- Behavior parity with x86 architecture. When machine check exception
(MCE) is caused by VCPU, kernel or KVM signals userspace SIGBUS to
let VMM either recover from the MCE, or terminate itself with VM.
The prior RFC proposes to implement SIGBUS on arm64 as well, but
Marc preferred VCPU exit over signal [3]. However, implementation
aside, returning SEA to VMM is on par with returning MCE to VMM.
Once SEA is redirected to VMM, among other actions, VMM is encouraged
to inject external aborts into the faulting VCPU, which is already
supported by KVM on arm64. We notice injecting instruction abort is not
fully supported by KVM_SET_VCPU_EVENTS. Complement it in the patchset.
New UAPIs
=========
This patchset introduces following userspace-visiable changes to empower
VMM to control what happens next for SEA on guest memory:
- KVM_CAP_ARM_SEA_TO_USER. While taking SEA, if userspace has enabled
this new capability at VM creation, and the SEA is not caused by
memory allocated for stage-2 translation table, instead of injecting
SError, return KVM_EXIT_ARM_SEA to userspace.
- KVM_EXIT_ARM_SEA. This is the VM exit reason VMM gets. The details
about the SEA is provided in arm_sea as much as possible, including
sanitized ESR value at EL2, if guest virtual and physical addresses
(GPA and GVA) are available and the values if available.
- KVM_CAP_ARM_INJECT_EXT_IABT. VMM today can inject external data abort
to VCPU via KVM_SET_VCPU_EVENTS API. However, in case of instruction
abort, VMM cannot inject it via KVM_SET_VCPU_EVENTS.
KVM_CAP_ARM_INJECT_EXT_IABT is just a natural extend to
KVM_CAP_ARM_INJECT_EXT_DABT that tells VMM KVM_SET_VCPU_EVENTS now
supports external instruction abort.
* From v1 [4]:
- Rebased on commit 4d62121ce9b5 ("KVM: arm64: vgic-debug: Avoid
dereferencing NULL ITE pointer").
- Sanitize ESR_EL2 before reporting it to userspace.
- Do not do KVM_EXIT_ARM_SEA when SEA is caused by memory allocated to
stage-2 translation table.
[1] https://cloud.google.com/solutions/sap/docs/manage-host-errors
[2] https://lore.kernel.org/kvm/20250109204929.1106563-1-jthoughton@google.com
[3] https://lore.kernel.org/kvm/86pljbqqh0.wl-maz@kernel.org
[4] https://lore.kernel.org/kvm/20250505161412.1926643-1-jiaqiyan@google.com
Jiaqi Yan (5):
KVM: arm64: VM exit to userspace to handle SEA
KVM: arm64: Set FnV for VCPU when FAR_EL2 is invalid
KVM: selftests: Test for KVM_EXIT_ARM_SEA and KVM_CAP_ARM_SEA_TO_USER
KVM: selftests: Test for KVM_CAP_INJECT_EXT_IABT
Documentation: kvm: new uAPI for handling SEA
Raghavendra Rao Ananta (1):
KVM: arm64: Allow userspace to inject external instruction aborts
Documentation/virt/kvm/api.rst | 128 ++++++-
arch/arm64/include/asm/kvm_emulate.h | 67 ++++
arch/arm64/include/asm/kvm_host.h | 8 +
arch/arm64/include/asm/kvm_ras.h | 2 +-
arch/arm64/include/uapi/asm/kvm.h | 3 +-
arch/arm64/kvm/arm.c | 6 +
arch/arm64/kvm/guest.c | 13 +-
arch/arm64/kvm/inject_fault.c | 3 +
arch/arm64/kvm/mmu.c | 59 ++-
include/uapi/linux/kvm.h | 12 +
tools/arch/arm64/include/asm/esr.h | 2 +
tools/arch/arm64/include/uapi/asm/kvm.h | 3 +-
tools/testing/selftests/kvm/Makefile.kvm | 2 +
.../testing/selftests/kvm/arm64/inject_iabt.c | 98 +++++
.../testing/selftests/kvm/arm64/sea_to_user.c | 340 ++++++++++++++++++
tools/testing/selftests/kvm/lib/kvm_util.c | 1 +
16 files changed, 718 insertions(+), 29 deletions(-)
create mode 100644 tools/testing/selftests/kvm/arm64/inject_iabt.c
create mode 100644 tools/testing/selftests/kvm/arm64/sea_to_user.c
--
2.49.0.1266.g31b7d2e469-goog
Some failure modes are handled poorly by kublk. For example, if ublk_drv
is built as a module but not currently loaded into the kernel, ./kublk
add ... just hangs forever. This happens because in this case (and a few
others), the worker process does not notify its parent (via a write to
the shared eventfd) that it has tried and failed to initialize, so the
parent hangs forever. Fix this by ensuring that we always notify the
parent process of any initialization failure, and have the parent print
a (not very descriptive) log line when this happens.
Signed-off-by: Uday Shankar <ushankar(a)purestorage.com>
---
tools/testing/selftests/ublk/kublk.c | 34 +++++++++++++++++++++++-----------
1 file changed, 23 insertions(+), 11 deletions(-)
diff --git a/tools/testing/selftests/ublk/kublk.c b/tools/testing/selftests/ublk/kublk.c
index a98e14e4c245965d817b93843ff9a4011291223b..e2d2042810d4bb472e48a0ed91317d2bdf6e2f2a 100644
--- a/tools/testing/selftests/ublk/kublk.c
+++ b/tools/testing/selftests/ublk/kublk.c
@@ -1112,7 +1112,7 @@ static int __cmd_dev_add(const struct dev_ctx *ctx)
__u64 features;
const struct ublk_tgt_ops *ops;
struct ublksrv_ctrl_dev_info *info;
- struct ublk_dev *dev;
+ struct ublk_dev *dev = NULL;
int dev_id = ctx->dev_id;
int ret, i;
@@ -1120,13 +1120,15 @@ static int __cmd_dev_add(const struct dev_ctx *ctx)
if (!ops) {
ublk_err("%s: no such tgt type, type %s\n",
__func__, tgt_type);
- return -ENODEV;
+ ret = -ENODEV;
+ goto fail;
}
if (nr_queues > UBLK_MAX_QUEUES || depth > UBLK_QUEUE_DEPTH) {
ublk_err("%s: invalid nr_queues or depth queues %u depth %u\n",
__func__, nr_queues, depth);
- return -EINVAL;
+ ret = -EINVAL;
+ goto fail;
}
/* default to 1:1 threads:queues if nthreads is unspecified */
@@ -1136,30 +1138,37 @@ static int __cmd_dev_add(const struct dev_ctx *ctx)
if (nthreads > UBLK_MAX_THREADS) {
ublk_err("%s: %u is too many threads (max %u)\n",
__func__, nthreads, UBLK_MAX_THREADS);
- return -EINVAL;
+ ret = -EINVAL;
+ goto fail;
}
if (nthreads != nr_queues && !ctx->per_io_tasks) {
ublk_err("%s: threads %u must be same as queues %u if "
"not using per_io_tasks\n",
__func__, nthreads, nr_queues);
- return -EINVAL;
+ ret = -EINVAL;
+ goto fail;
}
dev = ublk_ctrl_init();
if (!dev) {
ublk_err("%s: can't alloc dev id %d, type %s\n",
__func__, dev_id, tgt_type);
- return -ENOMEM;
+ ret = -ENOMEM;
+ goto fail;
}
/* kernel doesn't support get_features */
ret = ublk_ctrl_get_features(dev, &features);
- if (ret < 0)
- return -EINVAL;
+ if (ret < 0) {
+ ret = -EINVAL;
+ goto fail;
+ }
- if (!(features & UBLK_F_CMD_IOCTL_ENCODE))
- return -ENOTSUP;
+ if (!(features & UBLK_F_CMD_IOCTL_ENCODE)) {
+ ret = -ENOTSUP;
+ goto fail;
+ }
info = &dev->dev_info;
info->dev_id = ctx->dev_id;
@@ -1200,7 +1209,8 @@ static int __cmd_dev_add(const struct dev_ctx *ctx)
fail:
if (ret < 0)
ublk_send_dev_event(ctx, dev, -1);
- ublk_ctrl_deinit(dev);
+ if (dev)
+ ublk_ctrl_deinit(dev);
return ret;
}
@@ -1262,6 +1272,8 @@ static int cmd_dev_add(struct dev_ctx *ctx)
shmctl(ctx->_shmid, IPC_RMID, NULL);
/* wait for child and detach from it */
wait(NULL);
+ if (exit_code == EXIT_FAILURE)
+ ublk_err("%s: command failed\n", __func__);
exit(exit_code);
} else {
exit(EXIT_FAILURE);
---
base-commit: c09a8b00f850d3ca0af998bff1fac4a3f6d11768
change-id: 20250603-ublk_init_fail-b498905159eb
Best regards,
--
Uday Shankar <ushankar(a)purestorage.com>
well, i checked the script using checkpatch.pl and
it shows that the patch has no warnings or errors
and its ready to be sent
v2:
- fixed multiple trailing whitespace errors and
- the Signed-off-by mismatch
The test file for the IR decoder used single-line comments
at the top to document its purpose and licensing,
which is inconsistent with the style used throughout the
Linux kernel.
In this patch i converted the file header to
a proper multi-line comment block
(/*) that aligns with standard kernel practices.
This improves readability, consistency across selftests,
and ensures the license and documentation are
clearly visible in a familiar format.
No functional changes have been made.
Signed-off-by: Abdelrahman Fekry <abdelrahmanfekry375(a)gmail.com>
---
tools/testing/selftests/ir/ir_loopback.c | 23 +++++++++++++----------
1 file changed, 13 insertions(+), 10 deletions(-)
diff --git a/tools/testing/selftests/ir/ir_loopback.c b/tools/testing/selftests/ir/ir_loopback.c
index f4a15cbdd5ea..c94faa975630 100644
--- a/tools/testing/selftests/ir/ir_loopback.c
+++ b/tools/testing/selftests/ir/ir_loopback.c
@@ -1,14 +1,17 @@
// SPDX-License-Identifier: GPL-2.0
-// test ir decoder
-//
-// Copyright (C) 2018 Sean Young <sean(a)mess.org>
-
-// When sending LIRC_MODE_SCANCODE, the IR will be encoded. rc-loopback
-// will send this IR to the receiver side, where we try to read the decoded
-// IR. Decoding happens in a separate kernel thread, so we will need to
-// wait until that is scheduled, hence we use poll to check for read
-// readiness.
-
+/* Copyright (C) 2018 Sean Young <sean(a)mess.org>
+ *
+ * Selftest for IR decoder
+ *
+ *
+ * When sending LIRC_MODE_SCANCODE, the IR will be encoded.
+ * rc-loopback will send this IR to the receiver side,
+ * where we try to read the decoded IR.
+ * Decoding happens in a separate kernel thread,
+ * so we will need to wait until that is scheduled,
+ * hence we use poll to check for read
+ * readiness.
+ */
#include <linux/lirc.h>
#include <errno.h>
#include <stdio.h>
--
2.25.1
This improves the expressiveness of unprivileged BPF by inserting
speculation barriers instead of rejecting the programs.
The approach was previously presented at LPC'24 [1] and RAID'24 [2].
To mitigate the Spectre v1 (PHT) vulnerability, the kernel rejects
potentially-dangerous unprivileged BPF programs as of
commit 9183671af6db ("bpf: Fix leakage under speculation on mispredicted
branches"). In [2], we have analyzed 364 object files from open source
projects (Linux Samples and Selftests, BCC, Loxilb, Cilium, libbpf
Examples, Parca, and Prevail) and found that this affects 31% to 54% of
programs.
To resolve this in the majority of cases this patchset adds a fall-back
for mitigating Spectre v1 using speculation barriers. The kernel still
optimistically attempts to verify all speculative paths but uses
speculation barriers against v1 when unsafe behavior is detected. This
allows for more programs to be accepted without disabling the BPF
Spectre mitigations (e.g., by setting cpu_mitigations_off()).
For this, it relies on the fact that speculation barriers prevent all
later instructions if the speculation was not correct:
* On x86_64, lfence acts as full speculation barrier, not only as a
load fence [3]:
An LFENCE instruction or a serializing instruction will ensure that
no later instructions execute, even speculatively, until all prior
instructions complete locally. [...] Inserting an LFENCE instruction
after a bounds check prevents later operations from executing before
the bound check completes.
This was experimentally confirmed in [4].
* ARM's SB speculation barrier instruction also affects "any instruction
that appears later in the program order than the barrier" [5].
In [1] we have measured the overhead of this approach relative to having
mitigations off and including the upstream Spectre v4 mitigations. For
event tracing and stack-sampling profilers, we found that mitigations
increase BPF program execution time by 0% to 62%. For the Loxilb network
load balancer, we have measured a 14% slowdown in SCTP performance but
no significant slowdown for TCP. This overhead only applies to programs
that were previously rejected.
I reran the expressiveness-evaluation with v6.14 and made sure the main
results still match those from [1] and [2] (which used v6.5).
Main design decisions are:
* Do not use separate bytecode insns for v1 and v4 barriers (inspired by
Daniel Borkmann's question at LPC). This simplifies the verifier
significantly and has the only downside that performance on PowerPC is
not as high as it could be.
* Allow archs to still disable v1/v4 mitigations separately by setting
bpf_jit_bypass_spec_v1/v4(). This has the benefit that archs can
benefit from improved BPF expressiveness / performance if they are not
vulnerable (e.g., ARM64 for v4 in the kernel).
* Do not remove the empty BPF_NOSPEC implementation for backends for
which it is unknown whether they are vulnerable to Spectre v1.
[1] https://lpc.events/event/18/contributions/1954/ ("Mitigating
Spectre-PHT using Speculation Barriers in Linux eBPF")
[2] https://arxiv.org/pdf/2405.00078 ("VeriFence: Lightweight and
Precise Spectre Defenses for Untrusted Linux Kernel Extensions")
[3] https://www.intel.com/content/www/us/en/developer/articles/technical/softwa…
("Managed Runtime Speculative Execution Side Channel Mitigations")
[4] https://dl.acm.org/doi/pdf/10.1145/3359789.3359837 ("Speculator: a
tool to analyze speculative execution attacks and mitigations" -
Section 4.6 "Stopping Speculative Execution")
[5] https://developer.arm.com/documentation/ddi0597/2020-12/Base-Instructions/S…
("SB - Speculation Barrier - Arm Armv8-A A32/T32 Instruction Set Architecture (2020-12)")
Changes:
* v2 -> v3:
- Fix
https://lore.kernel.org/oe-kbuild-all/202504212030.IF1SLhz6-lkp@intel.com/
and similar by moving the bpf_jit_bypass_spec_v1/v4() prototypes out
of the #ifdef CONFIG_BPF_SYSCALL. Decided not to move them to
filter.h (where similar bpf_jit_*() prototypes live) as they would
still have to be duplicated in bpf.h to be usable to
bpf_bypass_spec_v1/v4() (unless including filter.h in bpf.h is an
option).
- Fix
https://lore.kernel.org/oe-kbuild-all/202504220035.SoGveGpj-lkp@intel.com/
by moving the variable declarations out of the switch-case.
- Build touched C files with W=2 and bpf config on x86 to check that
there are no other warnings introduced.
- Found 3 more checkpatch warnings that can be fixed without degrading
readability.
- Rebase to bpf-next 2025-05-01
- Link to v2: https://lore.kernel.org/bpf/20250421091802.3234859-1-luis.gerhorst@fau.de/
* v1 -> v2:
- Drop former commits 9 ("bpf: Return PTR_ERR from push_stack()") and 11
("bpf: Fall back to nospec for spec path verification") as suggested
by Alexei. This series therefore no longer changes push_stack() to
return PTR_ERR.
- Add detailed explanation of how lfence works internally and how it
affects the algorithm.
- Add tests checking that nospec instructions are inserted in expected
locations using __xlated_unpriv as suggested by Eduard (also,
include a fix for __xlated_unpriv)
- Add a test for the mitigations from the description of
commit 9183671af6db ("bpf: Fix leakage under speculation on
mispredicted branches")
- Remove unused variables from do_check[_insn]() as suggested by
Eduard.
- Remove INSN_IDX_MODIFIED to improve readability as suggested by
Eduard. This also causes the nospec_result-check to run (and fail)
for jumping-ops. Add a warning to assert that this check must never
succeed in that case.
- Add details on the safety of patch 10 ("bpf: Allow nospec-protected
var-offset stack access") based on the feedback on v1.
- Rebase to bpf-next-250420
- Link to v1: https://lore.kernel.org/all/20250313172127.1098195-1-luis.gerhorst@fau.de/
* RFC -> v1:
- rebase to bpf-next-250313
- tests: mark expected successes/new errors
- add bpt_jit_bypass_spec_v1/v4() to avoid #ifdef in
bpf_bypass_spec_v1/v4()
- ensure that nospec with v1-support is implemented for archs for
which GCC supports speculation barriers, except for MIPS
- arm64: emit speculation barrier
- powerpc: change nospec to include v1 barrier
- discuss potential security (archs that do not impl. BPF nospec) and
performance (only PowerPC) regressions
- Link to RFC: https://lore.kernel.org/bpf/20250224203619.594724-1-luis.gerhorst@fau.de/
Luis Gerhorst (11):
selftests/bpf: Fix caps for __xlated/jited_unpriv
bpf: Move insn if/else into do_check_insn()
bpf: Return -EFAULT on misconfigurations
bpf: Return -EFAULT on internal errors
bpf, arm64, powerpc: Add bpf_jit_bypass_spec_v1/v4()
bpf, arm64, powerpc: Change nospec to include v1 barrier
bpf: Rename sanitize_stack_spill to nospec_result
bpf: Fall back to nospec for Spectre v1
selftests/bpf: Add test for Spectre v1 mitigation
bpf: Allow nospec-protected var-offset stack access
bpf: Fall back to nospec for sanitization-failures
arch/arm64/net/bpf_jit.h | 5 +
arch/arm64/net/bpf_jit_comp.c | 28 +-
arch/powerpc/net/bpf_jit_comp64.c | 80 ++-
include/linux/bpf.h | 11 +-
include/linux/bpf_verifier.h | 3 +-
include/linux/filter.h | 2 +-
kernel/bpf/core.c | 32 +-
kernel/bpf/verifier.c | 653 ++++++++++--------
tools/testing/selftests/bpf/progs/bpf_misc.h | 4 +
.../selftests/bpf/progs/verifier_and.c | 8 +-
.../selftests/bpf/progs/verifier_bounds.c | 66 +-
.../bpf/progs/verifier_bounds_deduction.c | 45 +-
.../selftests/bpf/progs/verifier_map_ptr.c | 20 +-
.../selftests/bpf/progs/verifier_movsx.c | 16 +-
.../selftests/bpf/progs/verifier_unpriv.c | 65 +-
.../bpf/progs/verifier_value_ptr_arith.c | 101 ++-
tools/testing/selftests/bpf/test_loader.c | 14 +-
.../selftests/bpf/verifier/dead_code.c | 3 +-
tools/testing/selftests/bpf/verifier/jmp32.c | 33 +-
tools/testing/selftests/bpf/verifier/jset.c | 10 +-
20 files changed, 771 insertions(+), 428 deletions(-)
base-commit: 358b1c0f56ebb6996fcec7dcdcf6bae5dcbc8b6c
--
2.49.0