With /proc/pid/maps now being read under per-vma lock protection we can
reuse parts of that code to execute PROCMAP_QUERY ioctl also without
taking mmap_lock. The change is designed to reduce mmap_lock contention
and prevent PROCMAP_QUERY ioctl calls from blocking address space updates.
This patchset was split out of the original patchset [1] that introduced
per-vma lock usage for /proc/pid/maps reading. It contains PROCMAP_QUERY
tests, code refactoring patch to simplify the main change and the actual
transition to per-vma lock.
[1] https://lore.kernel.org/all/20250704060727.724817-1-surenb@google.com/
Suren Baghdasaryan (3):
selftests/proc: test PROCMAP_QUERY ioctl while vma is concurrently
modified
fs/proc/task_mmu: factor out proc_maps_private fields used by
PROCMAP_QUERY
fs/proc/task_mmu: execute PROCMAP_QUERY ioctl under per-vma locks
fs/proc/internal.h | 15 +-
fs/proc/task_mmu.c | 149 ++++++++++++------
tools/testing/selftests/proc/proc-maps-race.c | 65 ++++++++
3 files changed, 174 insertions(+), 55 deletions(-)
base-commit: 01da54f10fddf3b01c5a3b80f6b16bbad390c302
--
2.50.1.565.gc32cd1483b-goog
The step_after_suspend_test verifies that the system successfully
suspended and resumed by setting a timerfd and checking whether the
timer fully expired. However, this method is unreliable due to timing
races.
In practice, the system may take time to enter suspend, during which the
timer may expire just before or during the transition. As a result,
the remaining time after resume may show non-zero nanoseconds, even if
suspend/resume completed successfully. This leads to false test failures.
Replace the timer-based check with a read from
/sys/power/suspend_stats/success. This counter is incremented only
after a full suspend/resume cycle, providing a reliable and race-free
indicator.
Also remove the unused file descriptor for /sys/power/state, which
remained after switching to a system() call to trigger suspend [1].
[1] https://lore.kernel.org/all/20240930224025.2858767-1-yifei.l.liu@oracle.com/
Fixes: c66be905cda2 ("selftests: breakpoints: use remaining time to check if suspend succeed")
Signed-off-by: Moon Hee Lee <moonhee.lee.ca(a)gmail.com>
---
.../breakpoints/step_after_suspend_test.c | 41 ++++++++++++++-----
1 file changed, 31 insertions(+), 10 deletions(-)
diff --git a/tools/testing/selftests/breakpoints/step_after_suspend_test.c b/tools/testing/selftests/breakpoints/step_after_suspend_test.c
index 8d275f03e977..8d233ac95696 100644
--- a/tools/testing/selftests/breakpoints/step_after_suspend_test.c
+++ b/tools/testing/selftests/breakpoints/step_after_suspend_test.c
@@ -127,22 +127,42 @@ int run_test(int cpu)
return KSFT_PASS;
}
+/*
+ * Reads the suspend success count from sysfs.
+ * Returns the count on success or exits on failure.
+ */
+static int get_suspend_success_count_or_fail(void)
+{
+ FILE *fp;
+ int val;
+
+ fp = fopen("/sys/power/suspend_stats/success", "r");
+ if (!fp)
+ ksft_exit_fail_msg(
+ "Failed to open suspend_stats/success: %s\n",
+ strerror(errno));
+
+ if (fscanf(fp, "%d", &val) != 1) {
+ fclose(fp);
+ ksft_exit_fail_msg(
+ "Failed to read suspend success count\n");
+ }
+
+ fclose(fp);
+ return val;
+}
+
void suspend(void)
{
- int power_state_fd;
int timerfd;
int err;
+ int count_before;
+ int count_after;
struct itimerspec spec = {};
if (getuid() != 0)
ksft_exit_skip("Please run the test as root - Exiting.\n");
- power_state_fd = open("/sys/power/state", O_RDWR);
- if (power_state_fd < 0)
- ksft_exit_fail_msg(
- "open(\"/sys/power/state\") failed %s)\n",
- strerror(errno));
-
timerfd = timerfd_create(CLOCK_BOOTTIME_ALARM, 0);
if (timerfd < 0)
ksft_exit_fail_msg("timerfd_create() failed\n");
@@ -152,14 +172,15 @@ void suspend(void)
if (err < 0)
ksft_exit_fail_msg("timerfd_settime() failed\n");
+ count_before = get_suspend_success_count_or_fail();
+
system("(echo mem > /sys/power/state) 2> /dev/null");
- timerfd_gettime(timerfd, &spec);
- if (spec.it_value.tv_sec != 0 || spec.it_value.tv_nsec != 0)
+ count_after = get_suspend_success_count_or_fail();
+ if (count_after <= count_before)
ksft_exit_fail_msg("Failed to enter Suspend state\n");
close(timerfd);
- close(power_state_fd);
}
int main(int argc, char **argv)
--
2.43.0
This patch fixes unstable LACP negotiation when bonding is configured in
passive mode (`lacp_active=off`).
Previously, the actor would stop sending LACPDUs after initial negotiation
succeeded, leading to the partner timing out and restarting the negotiation
cycle. This resulted in continuous LACP state flapping.
The fix ensures the passive actor starts sending periodic LACPDUs after
receiving the first LACPDU from the partner, in accordance with IEEE
802.1AX-2020 section 6.4.1.
Out of topic:
Although this patch addresses a functional bug and could be considered for
`net`, I'm slightly concerned about potential regressions, as it changes
the current bonding LACP protocol behavior.
It might be safer to merge this through `net-next` first to allow broader
testing. Thoughts?
Hangbin Liu (2):
bonding: send LACPDUs periodically in passive mode after receiving
partner's LACPDU
selftests: bonding: add test for passive LACP mode
drivers/net/bonding/bond_3ad.c | 72 ++++++++++----
drivers/net/bonding/bond_options.c | 1 +
include/net/bond_3ad.h | 1 +
.../selftests/drivers/net/bonding/Makefile | 3 +-
.../drivers/net/bonding/bond_passive_lacp.sh | 93 +++++++++++++++++++
5 files changed, 151 insertions(+), 19 deletions(-)
create mode 100755 tools/testing/selftests/drivers/net/bonding/bond_passive_lacp.sh
--
2.46.0
There are several situations where VMM is involved when handling
synchronous external instruction or data aborts, and often VMM
needs to inject external aborts to guest. In addition to manipulating
individual registers with KVM_SET_ONE_REG API, an easier way is to
use the KVM_SET_VCPU_EVENTS API.
This patchset adds two new features to the KVM_SET_VCPU_EVENTS API.
1. Extend KVM_SET_VCPU_EVENTS to support external instruction abort.
2. Allow userspace to emulate ESR_ELx.ISS by supplying ESR_ELx.
In this way, we can also allow userspace to emulate ESR_ELx.ISS2
in future.
The UAPI change for #1 is straightforward. However, I would appreciate
some feedback on the ABI change for #2:
struct kvm_vcpu_events {
struct {
__u8 serror_pending;
__u8 serror_has_esr;
__u8 ext_dabt_pending;
__u8 ext_iabt_pending;
__u8 ext_abt_has_esr;
__u8 pad[3];
__u64 serror_esr;
__u64 ext_abt_esr; // <= +8 bytes
} exception;
__u32 reserved[10]; // <= -8 bytes
};
The offset to kvm_vcpu_events.reserved changes, and the size of
exception changes. I think we can't say userspace will never access
reserved, or they will never use sizeof(exception). Theoretically this
is an ABI break and I want to call it out and ask if a new ABI is needed
for feature #2. For example, is it worthy to introduce exception_v2
or kvm_vcpu_events_v2.
Based on commit 7b8346bd9fce6 ("KVM: arm64: Don't attempt vLPI mappings
when vPE allocation is disabled")
Jiaqi Yan (3):
KVM: arm64: Allow userspace to supply ESR when injecting SEA
KVM: selftests: Test injecting external abort with ISS
Documentation: kvm: update UAPI for injecting SEA
Raghavendra Rao Ananta (1):
KVM: arm64: Allow userspace to inject external instruction abort
Documentation/virt/kvm/api.rst | 48 +++--
arch/arm64/include/asm/kvm_emulate.h | 9 +-
arch/arm64/include/uapi/asm/kvm.h | 7 +-
arch/arm64/kvm/arm.c | 1 +
arch/arm64/kvm/emulate-nested.c | 6 +-
arch/arm64/kvm/guest.c | 42 ++--
arch/arm64/kvm/inject_fault.c | 16 +-
include/uapi/linux/kvm.h | 1 +
tools/arch/arm64/include/uapi/asm/kvm.h | 7 +-
.../selftests/kvm/arm64/external_aborts.c | 191 +++++++++++++++---
.../testing/selftests/kvm/arm64/inject_iabt.c | 98 +++++++++
11 files changed, 352 insertions(+), 74 deletions(-)
create mode 100644 tools/testing/selftests/kvm/arm64/inject_iabt.c
--
2.50.1.565.gc32cd1483b-goog
Problem
=======
When host APEI is unable to claim synchronous external abort (SEA)
during stage-2 guest abort, today KVM directly injects an async SError
into the VCPU then resumes it. The injected SError usually results in
unpleasant guest kernel panic.
One of the major situation of guest SEA is when VCPU consumes recoverable
uncorrected memory error (UER), which is not uncommon at all in modern
datacenter servers with large amounts of physical memory. Although SError
and guest panic is sufficient to stop the propagation of corrupted memory
there is room to recover from an UER in a more graceful manner.
Proposed Solution
=================
Alternatively KVM can replay the SEA to the faulting VCPU, via existing
KVM_SET_VCPU_EVENTS API. If the memory poison consumption or the fault
that cause SEA is not from guest kernel, the blast radius can be limited
to the consuming or faulting guest userspace process, so the VM can keep
running.
In addition, instead of doing under the hood without involving userspace,
there are benefits to redirect the SEA to VMM:
- VM customers care about the disruptions caused by memory errors, and
VMM usually has the responsibility to start the process of notifying
the customers of memory error events in their VMs. For example some
cloud provider emits a critical log in their observability UI [1], and
provides playbook for customers on how to mitigate disruptions to
their workloads.
- VMM can protect future memory error consumption by unmapping the poisoned
pages from stage-2 page table with KVM userfault, or by splitting the
memslot that contains the poisoned guest pages [2].
- VMM can keep track of SEA events in the VM. When VMM thinks the status
on the host or the VM is bad enough, e.g. number of distinct SEAs
exceeds a threshold, it can restart the VM on another healthy host.
- Behavior parity with x86 architecture. When machine check exception
(MCE) is caused by VCPU, kernel or KVM signals userspace SIGBUS to
let VMM either recover from the MCE, or terminate itself with VM.
The prior RFC proposes to implement SIGBUS on arm64 as well, but
Marc preferred VCPU exit over signal [3]. However, implementation
aside, returning SEA to VMM is on par with returning MCE to VMM.
Once SEA is redirected to VMM, among other actions, VMM is encouraged
to inject external aborts into the faulting VCPU, which is already
supported by KVM on arm64. We notice injecting instruction abort is not
fully supported by KVM_SET_VCPU_EVENTS. Complement it in the patchset.
New UAPIs
=========
This patchset introduces following userspace-visiable changes to empower
VMM to control what happens next for SEA on guest memory:
- KVM_CAP_ARM_SEA_TO_USER. While taking SEA, if userspace has enabled
this new capability at VM creation, and the SEA is not caused by
memory allocated for stage-2 translation table, instead of injecting
SError, return KVM_EXIT_ARM_SEA to userspace.
- KVM_EXIT_ARM_SEA. This is the VM exit reason VMM gets. The details
about the SEA is provided in arm_sea as much as possible, including
sanitized ESR value at EL2, if guest virtual and physical addresses
(GPA and GVA) are available and the values if available.
- KVM_CAP_ARM_INJECT_EXT_IABT. VMM today can inject external data abort
to VCPU via KVM_SET_VCPU_EVENTS API. However, in case of instruction
abort, VMM cannot inject it via KVM_SET_VCPU_EVENTS.
KVM_CAP_ARM_INJECT_EXT_IABT is just a natural extend to
KVM_CAP_ARM_INJECT_EXT_DABT that tells VMM KVM_SET_VCPU_EVENTS now
supports external instruction abort.
* From v1 [4]:
- Rebased on commit 4d62121ce9b5 ("KVM: arm64: vgic-debug: Avoid
dereferencing NULL ITE pointer").
- Sanitize ESR_EL2 before reporting it to userspace.
- Do not do KVM_EXIT_ARM_SEA when SEA is caused by memory allocated to
stage-2 translation table.
[1] https://cloud.google.com/solutions/sap/docs/manage-host-errors
[2] https://lore.kernel.org/kvm/20250109204929.1106563-1-jthoughton@google.com
[3] https://lore.kernel.org/kvm/86pljbqqh0.wl-maz@kernel.org
[4] https://lore.kernel.org/kvm/20250505161412.1926643-1-jiaqiyan@google.com
Jiaqi Yan (5):
KVM: arm64: VM exit to userspace to handle SEA
KVM: arm64: Set FnV for VCPU when FAR_EL2 is invalid
KVM: selftests: Test for KVM_EXIT_ARM_SEA and KVM_CAP_ARM_SEA_TO_USER
KVM: selftests: Test for KVM_CAP_INJECT_EXT_IABT
Documentation: kvm: new uAPI for handling SEA
Raghavendra Rao Ananta (1):
KVM: arm64: Allow userspace to inject external instruction aborts
Documentation/virt/kvm/api.rst | 128 ++++++-
arch/arm64/include/asm/kvm_emulate.h | 67 ++++
arch/arm64/include/asm/kvm_host.h | 8 +
arch/arm64/include/asm/kvm_ras.h | 2 +-
arch/arm64/include/uapi/asm/kvm.h | 3 +-
arch/arm64/kvm/arm.c | 6 +
arch/arm64/kvm/guest.c | 13 +-
arch/arm64/kvm/inject_fault.c | 3 +
arch/arm64/kvm/mmu.c | 59 ++-
include/uapi/linux/kvm.h | 12 +
tools/arch/arm64/include/asm/esr.h | 2 +
tools/arch/arm64/include/uapi/asm/kvm.h | 3 +-
tools/testing/selftests/kvm/Makefile.kvm | 2 +
.../testing/selftests/kvm/arm64/inject_iabt.c | 98 +++++
.../testing/selftests/kvm/arm64/sea_to_user.c | 340 ++++++++++++++++++
tools/testing/selftests/kvm/lib/kvm_util.c | 1 +
16 files changed, 718 insertions(+), 29 deletions(-)
create mode 100644 tools/testing/selftests/kvm/arm64/inject_iabt.c
create mode 100644 tools/testing/selftests/kvm/arm64/sea_to_user.c
--
2.49.0.1266.g31b7d2e469-goog
Vishal!
On Wed, Jul 30 2025 at 23:35, Vishal Parmar wrote:
Please do not top-post and trim your replies.
> The intent behind this change is to make output useful as is.
> for example, to provide a performance report in case of regression.
The point John was making:
>> So it might be worth looking into getting the output to be happy with
>> TAP while you're tweaking things here.
The kernel selftests are converting over to standardized TAP output
format, which is intended to aid automated testing.
So if we change the outpot format of this test, then we switch it over to
TAP format and do not invent yet another randomized output scheme.
> CSV format is also a good alternative if the maintainer prefers that.
The most important information is whether the test succeeded or not and
CSV format is not helping either to conform with the test output
standards.
For the success case, the actual numbers are uninteresting. In the
failure case it's sufficient to emit:
ksft_test_result_fail("Req: NNNN, Exp: $MMMM, Res: $LLLL\n", ...);
In case of regressions (fail), a report providing this output is good
enough for the relevant maintainer/developer to start investigating.
No?
Thanks,
tglx
Hi!
Does anyone have ideas about crediting test authors or tests for bugs
discovered? We increasingly see situations where someone adds a test
then our subsystem CI uncovers a (1 in a 100 runs) bug using that test.
Using reported-by doesn't feel right. But credit should go to the
person who wrote the test. Is anyone else having this dilemma?