Some small fixes for arch_timer_edge_cases that I stumbled upon while debugging failures for this selftest on ampere-one.
Changes since v1: modified patch 3 based on suggestions from Marc.
I've done some tests with this on various machines - seems to be all good, however on ampere-one I now hit this in 10% of the runs: ==== Test Assertion Failure ==== arm64/arch_timer_edge_cases.c:481: timer_get_cntct(timer) >= DEF_CNT + (timer_get_cntfrq() * (uint64_t)(delta_2_ms) / 1000) pid=166657 tid=166657 errno=4 - Interrupted system call 1 0x0000000000404db3: test_run at arch_timer_edge_cases.c:933 2 0x0000000000401f9f: main at arch_timer_edge_cases.c:1062 3 0x0000ffffaedd625b: ?? ??:0 4 0x0000ffffaedd633b: ?? ??:0 5 0x00000000004020af: _start at ??:? timer_get_cntct(timer) >= DEF_CNT + msec_to_cycles(delta_2_ms)
This is not new, it was just hidden behind the other failure. I'll try to figure out what this is about (seems to be independent of the wait time)..
Sebastian Ott (3): KVM: arm64: selftests: fix help text for arch_timer_edge_cases KVM: arm64: selftests: fix thread migration in arch_timer_edge_cases KVM: arm64: selftests: arch_timer_edge_cases - determine effective counter width
.../kvm/arm64/arch_timer_edge_cases.c | 37 ++++++++++++------- 1 file changed, 24 insertions(+), 13 deletions(-)
base-commit: 0ff41df1cb268fc69e703a08a57ee14ae967d0ca
Fix the help text for arch_timer_edge_cases to show the correct option for setting the wait time.
Signed-off-by: Sebastian Ott sebott@redhat.com --- tools/testing/selftests/kvm/arm64/arch_timer_edge_cases.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/kvm/arm64/arch_timer_edge_cases.c b/tools/testing/selftests/kvm/arm64/arch_timer_edge_cases.c index a36a7e2db434..c4716e0c1438 100644 --- a/tools/testing/selftests/kvm/arm64/arch_timer_edge_cases.c +++ b/tools/testing/selftests/kvm/arm64/arch_timer_edge_cases.c @@ -986,7 +986,7 @@ static void test_print_help(char *name) pr_info("\t-b: Test both physical and virtual timers (default: true)\n"); pr_info("\t-l: Delta (in ms) used for long wait time test (default: %u)\n", LONG_WAIT_TEST_MS); - pr_info("\t-l: Delta (in ms) used for wait times (default: %u)\n", + pr_info("\t-w: Delta (in ms) used for wait times (default: %u)\n", WAIT_TEST_MS); pr_info("\t-p: Test physical timer (default: true)\n"); pr_info("\t-v: Test virtual timer (default: true)\n");
arch_timer_edge_cases tries to migrate itself across host cpus. Before the first test it migrates to cpu 0 by setting up an affinity mask with only bit 0 set. After that it looks for the next possible cpu in the current affinity mask which still has only bit 0 set. So there is no migration at all.
Fix this by reading the default mask at start and use this to find the next cpu in each iteration.
Signed-off-by: Sebastian Ott sebott@redhat.com --- tools/testing/selftests/kvm/arm64/arch_timer_edge_cases.c | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-)
diff --git a/tools/testing/selftests/kvm/arm64/arch_timer_edge_cases.c b/tools/testing/selftests/kvm/arm64/arch_timer_edge_cases.c index c4716e0c1438..a813b4c6c817 100644 --- a/tools/testing/selftests/kvm/arm64/arch_timer_edge_cases.c +++ b/tools/testing/selftests/kvm/arm64/arch_timer_edge_cases.c @@ -849,17 +849,17 @@ static void guest_code(enum arch_timer timer) GUEST_DONE(); }
+static cpu_set_t default_cpuset; + static uint32_t next_pcpu(void) { uint32_t max = get_nprocs(); uint32_t cur = sched_getcpu(); uint32_t next = cur; - cpu_set_t cpuset; + cpu_set_t cpuset = default_cpuset;
TEST_ASSERT(max > 1, "Need at least two physical cpus");
- sched_getaffinity(0, sizeof(cpuset), &cpuset); - do { next = (next + 1) % CPU_SETSIZE; } while (!CPU_ISSET(next, &cpuset)); @@ -1046,6 +1046,8 @@ int main(int argc, char *argv[]) if (!parse_args(argc, argv)) exit(KSFT_SKIP);
+ sched_getaffinity(0, sizeof(default_cpuset), &default_cpuset); + if (test_args.test_virtual) { test_vm_create(&vm, &vcpu, VIRTUAL); test_run(vm, vcpu);
arch_timer_edge_cases uses ~0 as the maximum counter value, however there's no architectural guarantee that this is valid.
Figure out the effective counter width based on the effective frequency like it's done by the kernel.
Note that the following subtest only worked since the counter initialized with CVAL_MAX would instantly overflow (which is no longer the case):
test_set_cnt_after_cval_no_irq(timer, 0, DEF_CNT, CVAL_MAX, sm);
To fix this we could swap CVAL_MAX for 0 here but since that is already done by test_move_counters_behind_timers() let's remove that subtest.
This also serves as a workaround for AC03_CPU_14 that led to the following assertion failure on ampere-one machines:
==== Test Assertion Failure ==== arm64/arch_timer_edge_cases.c:169: timer_condition == istatus pid=11236 tid=11236 errno=4 - Interrupted system call 1 0x0000000000404ce7: test_run at arch_timer_edge_cases.c:938 2 0x0000000000401ebb: main at arch_timer_edge_cases.c:1053 3 0x0000ffff9fa8625b: ?? ??:0 4 0x0000ffff9fa8633b: ?? ??:0 5 0x0000000000401fef: _start at ??:? 0x1 != 0x0 (timer_condition != istatus)
Link: https://lore.kernel.org/kvmarm/ac1de1d2-ef2b-d439-dc48-8615e121b07b@redhat.c... Link: https://amperecomputing.com/assets/AmpereOne_Developer_ER_v0_80_20240823_289... Signed-off-by: Sebastian Ott sebott@redhat.com --- .../kvm/arm64/arch_timer_edge_cases.c | 27 ++++++++++++------- 1 file changed, 18 insertions(+), 9 deletions(-)
diff --git a/tools/testing/selftests/kvm/arm64/arch_timer_edge_cases.c b/tools/testing/selftests/kvm/arm64/arch_timer_edge_cases.c index a813b4c6c817..1e3e36d869d4 100644 --- a/tools/testing/selftests/kvm/arm64/arch_timer_edge_cases.c +++ b/tools/testing/selftests/kvm/arm64/arch_timer_edge_cases.c @@ -22,7 +22,8 @@ #include "gic.h" #include "vgic.h"
-static const uint64_t CVAL_MAX = ~0ULL; +/* Depends on counter width. */ +static uint64_t CVAL_MAX; /* tval is a signed 32-bit int. */ static const int32_t TVAL_MAX = INT32_MAX; static const int32_t TVAL_MIN = INT32_MIN; @@ -30,8 +31,8 @@ static const int32_t TVAL_MIN = INT32_MIN; /* After how much time we say there is no IRQ. */ static const uint32_t TIMEOUT_NO_IRQ_US = 50000;
-/* A nice counter value to use as the starting one for most tests. */ -static const uint64_t DEF_CNT = (CVAL_MAX / 2); +/* Counter value to use as the starting one for most tests. Set to CVAL_MAX/2 */ +static uint64_t DEF_CNT;
/* Number of runs. */ static const uint32_t NR_TEST_ITERS_DEF = 5; @@ -732,12 +733,6 @@ static void test_move_counters_ahead_of_timers(enum arch_timer timer) test_set_cnt_after_tval(timer, 0, tval, (uint64_t) tval + 1, wm); } - - for (i = 0; i < ARRAY_SIZE(sleep_method); i++) { - sleep_method_t sm = sleep_method[i]; - - test_set_cnt_after_cval_no_irq(timer, 0, DEF_CNT, CVAL_MAX, sm); - } }
/* @@ -975,6 +970,8 @@ static void test_vm_create(struct kvm_vm **vm, struct kvm_vcpu **vcpu, test_init_timer_irq(*vm, *vcpu); vgic_v3_setup(*vm, 1, 64); sync_global_to_guest(*vm, test_args); + sync_global_to_guest(*vm, CVAL_MAX); + sync_global_to_guest(*vm, DEF_CNT); }
static void test_print_help(char *name) @@ -1035,6 +1032,17 @@ static bool parse_args(int argc, char *argv[]) return false; }
+static void set_counter_defaults(void) +{ + const uint64_t MIN_ROLLOVER_SECS = 40ULL * 365 * 24 * 3600; + uint64_t freq = read_sysreg(CNTFRQ_EL0); + uint64_t width = ilog2(MIN_ROLLOVER_SECS * freq); + + width = clamp(width, 56, 64); + CVAL_MAX = GENMASK_ULL(width - 1, 0); + DEF_CNT = CVAL_MAX / 2; +} + int main(int argc, char *argv[]) { struct kvm_vcpu *vcpu; @@ -1047,6 +1055,7 @@ int main(int argc, char *argv[]) exit(KSFT_SKIP);
sched_getaffinity(0, sizeof(default_cpuset), &default_cpuset); + set_counter_defaults();
if (test_args.test_virtual) { test_vm_create(&vm, &vcpu, VIRTUAL);
Hi Sebastian,
On 2025/5/27 22:24, Sebastian Ott wrote:
Some small fixes for arch_timer_edge_cases that I stumbled upon while debugging failures for this selftest on ampere-one.
Changes since v1: modified patch 3 based on suggestions from Marc.
I've done some tests with this on various machines - seems to be all good, however on ampere-one I now hit this in 10% of the runs: ==== Test Assertion Failure ==== arm64/arch_timer_edge_cases.c:481: timer_get_cntct(timer) >= DEF_CNT + (timer_get_cntfrq() * (uint64_t)(delta_2_ms) / 1000) pid=166657 tid=166657 errno=4 - Interrupted system call 1 0x0000000000404db3: test_run at arch_timer_edge_cases.c:933 2 0x0000000000401f9f: main at arch_timer_edge_cases.c:1062 3 0x0000ffffaedd625b: ?? ??:0 4 0x0000ffffaedd633b: ?? ??:0 5 0x00000000004020af: _start at ??:? timer_get_cntct(timer) >= DEF_CNT + msec_to_cycles(delta_2_ms)
This is not new, it was just hidden behind the other failure. I'll try to figure out what this is about (seems to be independent of the wait time)..
Not sure if you have figured it out. I can easily reproduce it on my box and I *guess* it is that we have some random XVAL values when we enable the timer..
test_reprogramming_timer() { local_irq_disable(); reset_timer_state(timer, DEF_CNT);
/* Program the timer to DEF_CNT + delta_1_ms. */ set_tval_irq(timer, msec_to_cycles(delta_1_ms), CTL_ENABLE);
[...] }
set_tval_irq() { timer_set_ctl(timer, ctl);
// There is a window that we enable the timer with *random* XVAL // values and we may get the unexpected interrupt.. And it's // unlikely that KVM can be aware of TVAL's change (and // re-evaluate the interrupt's pending state) before hitting the // GUEST_ASSERT().
timer_set_tval(timer, tval_cycles); }
I'm not familiar with the test so I'm not 100% sure that this is the root cause. But I hope this helps with your analysis ;-) .
Thanks, Zenghui
Hi Zenghui,
On Tue, 3 Jun 2025, Zenghui Yu wrote:
On 2025/5/27 22:24, Sebastian Ott wrote:
Some small fixes for arch_timer_edge_cases that I stumbled upon while debugging failures for this selftest on ampere-one.
Changes since v1: modified patch 3 based on suggestions from Marc.
I've done some tests with this on various machines - seems to be all good, however on ampere-one I now hit this in 10% of the runs: ==== Test Assertion Failure ==== arm64/arch_timer_edge_cases.c:481: timer_get_cntct(timer) >= DEF_CNT + (timer_get_cntfrq() * (uint64_t)(delta_2_ms) / 1000) pid=166657 tid=166657 errno=4 - Interrupted system call 1 0x0000000000404db3: test_run at arch_timer_edge_cases.c:933 2 0x0000000000401f9f: main at arch_timer_edge_cases.c:1062 3 0x0000ffffaedd625b: ?? ??:0 4 0x0000ffffaedd633b: ?? ??:0 5 0x00000000004020af: _start at ??:? timer_get_cntct(timer) >= DEF_CNT + msec_to_cycles(delta_2_ms)
This is not new, it was just hidden behind the other failure. I'll try to figure out what this is about (seems to be independent of the wait time)..
Not sure if you have figured it out. I can easily reproduce it on my box and I *guess* it is that we have some random XVAL values when we enable the timer..
Yes, I think so, too.
test_reprogramming_timer() { local_irq_disable(); reset_timer_state(timer, DEF_CNT);
My first attempt was to also initialize cval here
/* Program the timer to DEF_CNT + delta_1_ms. */ set_tval_irq(timer, msec_to_cycles(delta_1_ms), CTL_ENABLE);
[...] }
set_tval_irq() { timer_set_ctl(timer, ctl);
// There is a window that we enable the timer with *random* XVAL // values and we may get the unexpected interrupt.. And it's // unlikely that KVM can be aware of TVAL's change (and // re-evaluate the interrupt's pending state) before hitting the // GUEST_ASSERT().
timer_set_tval(timer, tval_cycles);
Yes, I stumbled over this as well. I've always assumed that this order is becauase of this from the architecture "If CNTV_CTL_EL0.ENABLE is 0, the value returned is UNKNOWN." However re-reading that part today I realized that this only concerns register reads.
Maybe somone on cc knows why it's in that order?
I'm currently testing this with the above swapped and it's looking good, so far.
}
I'm not familiar with the test so I'm not 100% sure that this is the root cause. But I hope this helps with your analysis ;-) .
It did, thanks!
Sebastian
On Wed, 4 Jun 2025, Sebastian Ott wrote:
On Tue, 3 Jun 2025, Zenghui Yu wrote:
On 2025/5/27 22:24, Sebastian Ott wrote:
Some small fixes for arch_timer_edge_cases that I stumbled upon while debugging failures for this selftest on ampere-one.
Changes since v1: modified patch 3 based on suggestions from Marc.
I've done some tests with this on various machines - seems to be all good, however on ampere-one I now hit this in 10% of the runs: ==== Test Assertion Failure ==== arm64/arch_timer_edge_cases.c:481: timer_get_cntct(timer) >= DEF_CNT + (timer_get_cntfrq() * (uint64_t)(delta_2_ms) / 1000) pid=166657 tid=166657 errno=4 - Interrupted system call 1 0x0000000000404db3: test_run at arch_timer_edge_cases.c:933 2 0x0000000000401f9f: main at arch_timer_edge_cases.c:1062 3 0x0000ffffaedd625b: ?? ??:0 4 0x0000ffffaedd633b: ?? ??:0 5 0x00000000004020af: _start at ??:? timer_get_cntct(timer) >= DEF_CNT + msec_to_cycles(delta_2_ms)
This is not new, it was just hidden behind the other failure. I'll try to figure out what this is about (seems to be independent of the wait time)..
Not sure if you have figured it out. I can easily reproduce it on my box and I *guess* it is that we have some random XVAL values when we enable the timer..
Yes, I think so, too.
test_reprogramming_timer() { local_irq_disable(); reset_timer_state(timer, DEF_CNT);
My first attempt was to also initialize cval here
Forgot to mention that I did this because my tests have shown that the interrupt didn't only trigger early (like before the reprogrammed delta) but instantly. This seemed to work but I think the order in set_tval_irq() is the actual issue.
/* Program the timer to DEF_CNT + delta_1_ms. */ set_tval_irq(timer, msec_to_cycles(delta_1_ms), CTL_ENABLE);
[...] }
set_tval_irq() { timer_set_ctl(timer, ctl);
// There is a window that we enable the timer with *random* XVAL // values and we may get the unexpected interrupt.. And it's // unlikely that KVM can be aware of TVAL's change (and // re-evaluate the interrupt's pending state) before hitting the // GUEST_ASSERT().
timer_set_tval(timer, tval_cycles);
Yes, I stumbled over this as well. I've always assumed that this order is becauase of this from the architecture "If CNTV_CTL_EL0.ENABLE is 0, the value returned is UNKNOWN." However re-reading that part today I realized that this only concerns register reads.
Maybe somone on cc knows why it's in that order?
I'm currently testing this with the above swapped and it's looking good, so far.
}
I'm not familiar with the test so I'm not 100% sure that this is the root cause. But I hope this helps with your analysis ;-) .
It did, thanks!
Sebastian
On Wed, 04 Jun 2025 21:58:48 +0100, Sebastian Ott sebott@redhat.com wrote:
Hi Zenghui,
On Tue, 3 Jun 2025, Zenghui Yu wrote:
On 2025/5/27 22:24, Sebastian Ott wrote:
Some small fixes for arch_timer_edge_cases that I stumbled upon while debugging failures for this selftest on ampere-one.
Changes since v1: modified patch 3 based on suggestions from Marc.
I've done some tests with this on various machines - seems to be all good, however on ampere-one I now hit this in 10% of the runs: ==== Test Assertion Failure ==== arm64/arch_timer_edge_cases.c:481: timer_get_cntct(timer) >= DEF_CNT + (timer_get_cntfrq() * (uint64_t)(delta_2_ms) / 1000) pid=166657 tid=166657 errno=4 - Interrupted system call 1 0x0000000000404db3: test_run at arch_timer_edge_cases.c:933 2 0x0000000000401f9f: main at arch_timer_edge_cases.c:1062 3 0x0000ffffaedd625b: ?? ??:0 4 0x0000ffffaedd633b: ?? ??:0 5 0x00000000004020af: _start at ??:? timer_get_cntct(timer) >= DEF_CNT + msec_to_cycles(delta_2_ms)
This is not new, it was just hidden behind the other failure. I'll try to figure out what this is about (seems to be independent of the wait time)..
Not sure if you have figured it out. I can easily reproduce it on my box and I *guess* it is that we have some random XVAL values when we enable the timer..
Yes, I think so, too.
test_reprogramming_timer() { local_irq_disable(); reset_timer_state(timer, DEF_CNT);
My first attempt was to also initialize cval here
Note that CVAL and TVAL are two views of the same thing. It should be enough to initialise one of them (though TVAL is stupidly narrow).
/* Program the timer to DEF_CNT + delta_1_ms. */ set_tval_irq(timer, msec_to_cycles(delta_1_ms), CTL_ENABLE);
[...] }
set_tval_irq() { timer_set_ctl(timer, ctl);
// There is a window that we enable the timer with *random* XVAL // values and we may get the unexpected interrupt.. And it's // unlikely that KVM can be aware of TVAL's change (and // re-evaluate the interrupt's pending state) before hitting the // GUEST_ASSERT().
timer_set_tval(timer, tval_cycles);
Yes, I stumbled over this as well. I've always assumed that this order is becauase of this from the architecture "If CNTV_CTL_EL0.ENABLE is 0, the value returned is UNKNOWN." However re-reading that part today I realized that this only concerns register reads.
Maybe somone on cc knows why it's in that order?
I can't think of a valid reason. Enabling the timer without having set the deadline first is just silly.
I'm currently testing this with the above swapped and it's looking good, so far.
The swapped version (set xVAL, then enable the timer) is the only one that makes any sense.
Thanks,
M.
linux-kselftest-mirror@lists.linaro.org