Following kernel BUG noticed while running selftests arm64 fp-stress running stable rc kernel versions 6.1.29-rc1 and 6.3.3-rc1.
Reported-by: Linux Kernel Functional Testing lkft@linaro.org
# selftests: arm64: fp-stress # TAP version 13 # 1..80 # # 8 CPUs, 3 SVE VLs, 3 SME VLs, SME2 absent # # Will run for 10s ...
# # ZA-VL-32-4: PID: 1091 # # [ 263.834190] ================================================================== [ 263.834270] BUG: KFENCE: memory corruption in fpsimd_release_task+0x28/0x50 [ 263.834270] ZA-V[ 263.834419] Corrupted memory at 0x00000000d9c0a375 [ ! ! ! ! ! ! . . . . . . . . . . ] (in kfence-#158): L-64-[ 263.834929] fpsimd_release_task+0x28/0x50 [ 263.835074] arch_release_task_struct+0x1c/0x30 [ 263.835221] __put_task_struct+0x164/0x220 [ 263.835336] delayed_put_task_struct+0x60/0x128 4: [ 263.835484] rcu_core+0x318/0x950 [ 263.835632] rcu_core_si+0x1c/0x30 [ 263.835770] __do_softirq+0x110/0x3d8 Stre[ 263.835874] run_ksoftirqd+0x40/0xe0 [ 263.835994] smpboot_thread_fn+0x1d0/0x260 [ 263.836105] kthread+0xec/0x190 [ 263.836221] ret_from_fork+0x10/0x20 [ 263.836342] ami[ 263.836393] kfence-#158: 0x00000000c8819329-0x000000009e00cc22, size=546, cache=kmalloc-1k [ 263.836393] [ 263.836527] allocated by task 1112 on cpu 5 at 252.422888s: [ 263.836697] do_sme_acc+0xa8/0x230 ng m[ 263.836821] el0_sme_acc+0x40/0xa0 [ 263.836966] el0t_64_sync_handler+0xa8/0xf0 [ 263.837114] el0t_64_sync+0x190/0x198 [ 263.837224] ode[ 263.837275] freed by task 15 on cpu 0 at 263.833793s: [ 263.837500] fpsimd_release_task+0x28/0x50 [ 263.837629] arch_release_task_struct+0x1c/0x30 ve[ 263.837773] __put_task_struct+0x164/0x220 [ 263.837886] delayed_put_task_struct+0x60/0x128 [ 263.838032] rcu_core+0x318/0x950 cto[ 263.838176] rcu_core_si+0x1c/0x30 [ 263.838310] __do_softirq+0x110/0x3d8 [ 263.838417] run_ksoftirqd+0x40/0xe0 [ 263.838521] smpboot_thread_fn+0x1d0/0x260 [ 263.838626] kthread+0xec/0x190 [ 263.838742] ret_from_fork+0x10/0x20 [ 263.838861] [ 263.838913] CPU: 0 PID: 15 Comm: ksoftirqd/0 Not tainted 6.3.3-rc1 #1 [ 263.839037] Hardware name: FVP Base RevC (DT) [ 263.839111] ================================================================== r length: 512 bits # # ZA-VL-64-4: PID: 1089 # # SSVE-VL-64-4: Streaming mode Vector length: 512 bits # # SSVE-VL-64-4: PID: 1088 # # ZA-VL-16-4: Streaming mode vector length: 128 bits # # ZA-VL-16-4: PID: 1093 # # FPSIMD-5-0: Vector length: 128 bits # # FPSIMD-5-0: PID: 1094 # # SVE-VL-32-5: Vector length: 256 bits # # SVE-VL-32-5: PID: 1096 # # SSVE-VL-64-5: Streaming mode Vector length: 512 bits # # SVE-VL-64-5: Vector length: 512 bits # # SVE-VL-64-5: PID: 1095 # # SSVE-VL-64-5: PID: 1098 # # ZA-VL-64-5:[ 263.905145] ================================================================== [ 263.905299] BUG: KFENCE: memory corruption in fpsimd_release_task+0x28/0x50 [ 263.905299] Str[ 263.905444] Corrupted memory at 0x00000000e3d2342a [ ! ! ! ! ! ! . . . . . . . . . . ] (in kfence-#146): [ 263.905957] fpsimd_release_task+0x28/0x50 eam[ 263.906088] arch_release_task_struct+0x1c/0x30 [ 263.906236] __put_task_struct+0x164/0x220 [ 263.906348] delayed_put_task_struct+0x60/0x128 [ 263.906499] rcu_core+0x318/0x950 [ 263.906647] rcu_core_si+0x1c/0x30 in[ 263.906786] __do_softirq+0x110/0x3d8 [ 263.906892] ____do_softirq+0x1c/0x30 [ 263.907015] call_on_irq_stack+0x24/0x58 g mo[ 263.907139] do_softirq_own_stack+0x28/0x40 [ 263.907305] __irq_exit_rcu+0x94/0xf8 [ 263.907454] irq_exit_rcu+0x1c/0x40 de [ 263.907599] el0_interrupt+0x58/0x160 [ 263.907765] __el0_irq_handler_common+0x18/0x28 [ 263.907879] el0t_64_irq_handler+0x10/0x20 [ 263.907989] el0t_64_irq+0x190/0x198 [ 263.908098] vect[ 263.908149] kfence-#146: 0x000000005a8569e6-0x00000000c704c501, size=546, cache=kmalloc-1k [ 263.908149] [ 263.908282] allocated by task 1102 on cpu 0 at 251.030980s: [ 263.908452] do_sme_acc+0xa8/0x230 or l[ 263.908576] el0_sme_acc+0x40/0xa0 [ 263.908725] el0t_64_sync_handler+0xa8/0xf0 [ 263.908879] el0t_64_sync+0x190/0x198 [ 263.908986] eng[ 263.909036] freed by task 1 on cpu 3 at 263.904989s: [ 263.909311] fpsimd_release_task+0x28/0x50 [ 263.909439] arch_release_task_struct+0x1c/0x30 th:[ 263.909584] __put_task_struct+0x164/0x220 [ 263.909696] delayed_put_task_struct+0x60/0x128 [ 263.909842] rcu_core+0x318/0x950 512 [ 263.909986] rcu_core_si+0x1c/0x30 [ 263.910175] __do_softirq+0x110/0x3d8 [ 263.910279] ____do_softirq+0x1c/0x30 [ 263.910399] call_on_irq_stack+0x24/0x58 [ 263.910520] do_softirq_own_stack+0x28/0x40 [ 263.910645] __irq_exit_rcu+0x94/0xf8 bits[ 263.910792] irq_exit_rcu+0x1c/0x40 [ 263.910937] el0_interrupt+0x58/0x160 [ 263.911043] __el0_irq_handler_common+0x18/0x28 [ 263.911154] el0t_64_irq_handler+0x10/0x20
[ 263.911261] el0t_64_irq+0x190/0x198 # # [ 263.911387] [ 263.911448] CPU: 3 PID: 1 Comm: systemd Tainted: G B 6.3.3-rc1 #1 [ 263.911575] Hardware name: FVP Base RevC (DT) [ 263.911653] ================================================================== ..
# ok 80 ZA-VL-16-7 # # Totals: pass:80 fail:0 xfail:0 xpass:0 skip:0 error:0 ok 32 selftests: arm64: fp-stress
Steps to reproduce: ============
# To install tuxrun on your system globally: # sudo pip3 install -U tuxrun==0.42.0 # # See https://tuxrun.org/ for complete documentation.
tuxrun \ --runtime podman \ --device fvp-aemva \ --boot-args rw \ --kernel https://storage.tuxsuite.com/public/linaro/lkft/builds/2Pq5NvLiBcWRMuy6lXftD... \ --modules https://storage.tuxsuite.com/public/linaro/lkft/builds/2Pq5NvLiBcWRMuy6lXftD... \ --rootfs https://storage.tuxboot.com/debian/bookworm/arm64/rootfs.ext4.xz \ --parameters SKIPFILE=skipfile-lkft.yaml \ --parameters KSELFTEST=https://storage.tuxsuite.com/public/linaro/lkft/builds/2Pq5NvLiBcWRMuy6lXftD... \ --image tuxrun:fvp \ --tests kselftest-arm64 \ --timeouts boot=60 kselftest-arm64=60
Test log links: ========
- https://qa-reports.linaro.org/lkft/linux-stable-rc-linux-6.1.y/build/v6.1.28... - https://qa-reports.linaro.org/lkft/linux-stable-rc-linux-6.1.y/build/v6.1.28... - https://qa-reports.linaro.org/lkft/linux-stable-rc-linux-6.1.y/build/v6.1.28...
- https://qa-reports.linaro.org/lkft/linux-stable-rc-linux-6.3.y/build/v6.3.2-... - https://qa-reports.linaro.org/lkft/linux-stable-rc-linux-6.3.y/build/v6.3.2-... - https://qa-reports.linaro.org/lkft/linux-stable-rc-linux-6.3.y/build/v6.3.2-...
-- Linaro LKFT https://lkft.linaro.org
Hi Naresh,
On Tue, May 16, 2023 at 11:58:40AM +0530, Naresh Kamboju wrote:
Following kernel BUG noticed while running selftests arm64 fp-stress running stable rc kernel versions 6.1.29-rc1 and 6.3.3-rc1.
Is there a known-good build so that we could attempt a bisection?
Reported-by: Linux Kernel Functional Testing lkft@linaro.org
# selftests: arm64: fp-stress # TAP version 13 # 1..80 # # 8 CPUs, 3 SVE VLs, 3 SME VLs, SME2 absent # # Will run for 10s ...
# # ZA-VL-32-4: PID: 1091
# # [ 263.834190]
[ 263.834270] BUG: KFENCE: memory corruption in fpsimd_release_task+0x28/0x50 [ 263.834270] ZA-V[ 263.834419] Corrupted memory at 0x00000000d9c0a375 [ ! ! ! ! ! ! . . . . . . . . . . ] (in kfence-#158): L-64-[ 263.834929] fpsimd_release_task+0x28/0x50 [ 263.835074] arch_release_task_struct+0x1c/0x30 [ 263.835221] __put_task_struct+0x164/0x220 [ 263.835336] delayed_put_task_struct+0x60/0x128 4: [ 263.835484] rcu_core+0x318/0x950 [ 263.835632] rcu_core_si+0x1c/0x30 [ 263.835770] __do_softirq+0x110/0x3d8 Stre[ 263.835874] run_ksoftirqd+0x40/0xe0 [ 263.835994] smpboot_thread_fn+0x1d0/0x260 [ 263.836105] kthread+0xec/0x190 [ 263.836221] ret_from_fork+0x10/0x20 [ 263.836342] ami[ 263.836393] kfence-#158: 0x00000000c8819329-0x000000009e00cc22, size=546, cache=kmalloc-1k [ 263.836393] [ 263.836527] allocated by task 1112 on cpu 5 at 252.422888s: [ 263.836697] do_sme_acc+0xa8/0x230 ng m[ 263.836821] el0_sme_acc+0x40/0xa0 [ 263.836966] el0t_64_sync_handler+0xa8/0xf0 [ 263.837114] el0t_64_sync+0x190/0x198
Mark -- given that this is an SME allocation, please can you take a look? I think the implication of the kfence report is that we're writing beyond the end of 'task->thread.sme_state' at some point and corrupting the redzone.
There are two reports here, so hopefully it's not too hard to repro.
Will
On Tue, May 16, 2023 at 02:44:49PM +0100, Will Deacon wrote:
Mark -- given that this is an SME allocation, please can you take a look?
I'm on holiday.
I think the implication of the kfence report is that we're writing beyond the end of 'task->thread.sme_state' at some point and corrupting the redzone.
There are two reports here, so hopefully it's not too hard to repro.
I think I *once* saw something that might be this but I've never reproduced it, and I suspect that if this just suddenly came up with LKFT in stable kernels when there's been no relevant changes AFAIR it's not showing up terribly reliably there either.
Naresh,
On Tue, May 16, 2023 at 02:44:49PM +0100, Will Deacon wrote:
On Tue, May 16, 2023 at 11:58:40AM +0530, Naresh Kamboju wrote:
Following kernel BUG noticed while running selftests arm64 fp-stress running stable rc kernel versions 6.1.29-rc1 and 6.3.3-rc1.
Is there a known-good build so that we could attempt a bisection?
FWIW, I've been trying (and failing) all day to reproduce this in QEMU. I matched the same VL configuration as you have in the fastmodel and tried enabling additional memory debugging options too, but I'm yet to see a kfence splat (or any other splat fwiw).
How often do you see this?
Will
On Mon, May 22, 2023 at 05:41:17PM +0100, Will Deacon wrote:
On Tue, May 16, 2023 at 02:44:49PM +0100, Will Deacon wrote:
On Tue, May 16, 2023 at 11:58:40AM +0530, Naresh Kamboju wrote:
Following kernel BUG noticed while running selftests arm64 fp-stress running stable rc kernel versions 6.1.29-rc1 and 6.3.3-rc1.
Is there a known-good build so that we could attempt a bisection?
FWIW, I've been trying (and failing) all day to reproduce this in QEMU. I matched the same VL configuration as you have in the fastmodel and tried enabling additional memory debugging options too, but I'm yet to see a kfence splat (or any other splat fwiw).
How often do you see this?
As I said in another mail I've also been unable to reproduce this. FWIW I *suspect* that it might need to be run in the context of a full kselftest run to manifest rather than just running fp-stress in isolation, that's mostly a guess but given that the kfence trap appeared to be happening on free at a point where the test program shouldn't be exiting any tasks and should not be changing vector lengths on tasks that have used either of the vector extensions.
On Mon, 22 May 2023 at 22:11, Will Deacon will@kernel.org wrote:
Naresh,
On Tue, May 16, 2023 at 02:44:49PM +0100, Will Deacon wrote:
On Tue, May 16, 2023 at 11:58:40AM +0530, Naresh Kamboju wrote:
Following kernel BUG noticed while running selftests arm64 fp-stress running stable rc kernel versions 6.1.29-rc1 and 6.3.3-rc1.
Is there a known-good build so that we could attempt a bisection?
FWIW, I've been trying (and failing) all day to reproduce this in QEMU. I matched the same VL configuration as you have in the fastmodel and tried enabling additional memory debugging options too, but I'm yet to see a kfence splat (or any other splat fwiw).
Thanks for trying it out. I have shared log [Log link] below on Linux next-20230314 which is the starting point of BUG that we started noticing and that is raw log showing FVP log with all -C details.
How often do you see this?
Our CI system running selftests: arm64 - subtests by using ./run_kselftest.sh -c arm64
With full selftests: arm64 the probability of occurrence is 40%.
On Linux next this fp-stress BUG: has been happening *intermittently* from next-20230314 dated March 14, 2023. On Linux stable-rc it started happening on 6.3.2-rc1 and 6.1.28-rc2.
More details (on previous email), - https://lore.kernel.org/all/CA+G9fYtZjGomLjDi+Vf-hdcLpKPKbPmn4nwoPXvn24SG2hE...
- https://qa-reports.linaro.org/lkft/linux-next-master/build/next-20230314/tes... 9588df685892e898be8969def31c5aa074b2faada33f12ebc88fd7e7b52893cd/details/
Log link: - https://qa-reports.linaro.org/lkft/linux-next-master/build/next-20230314/tes...
Will
Hi Will,
On Tue, 16 May 2023 at 19:14, Will Deacon will@kernel.org wrote:
Hi Naresh,
On Tue, May 16, 2023 at 11:58:40AM +0530, Naresh Kamboju wrote:
Following kernel BUG noticed while running selftests arm64 fp-stress running stable rc kernel versions 6.1.29-rc1 and 6.3.3-rc1.
Is there a known-good build so that we could attempt a bisection?
[ Sorry for the delay ]
Since this problem is intermittent, It is not easy to bisect.
On Linux next this fp-stress BUG: has been happening *intermittently* from next-20230314 dated March 14, 2023.
On Linux stable-rc it started happening on 6.3.2-rc1 and 6.1.28-rc2.
Here is the proof showing the intermittent occurance on Linux next, - https://qa-reports.linaro.org/lkft/linux-next-master/build/next-20230518/tes... - https://qa-reports.linaro.org/lkft/linux-next-master/build/next-20230518/tes... - https://qa-reports.linaro.org/lkft/linux-next-master/build/next-20230518/tes...
Here is the proof showing the intermittent occurance on stable-rc 6.3 and 6.1. - https://qa-reports.linaro.org/lkft/linux-stable-rc-linux-6.3.y/build/v6.3-rc... - https://qa-reports.linaro.org/lkft/linux-stable-rc-linux-6.1.y/build/v6.1.22...
- Naresh
On Thu, May 25, 2023 at 06:49:36PM +0530, Naresh Kamboju wrote:
On Linux next this fp-stress BUG: has been happening *intermittently* from next-20230314 dated March 14, 2023.
A quick scan through shows no updates related to FP in the few weeks preceeding that (quite few arm64 updates at all really). This all seems to match the pattern I mentioned when this was originally reported - it does happen sometimes but there's some very thin race or something which means it only comes up *very* rarely. None of which is helpful for figuring out what's going on :/
I don't know this code at all so probably this is dumb... I don't undestand how vec_set_vector_length() ensures that sme_state_size() stays in sync with the actual size allocated in sme_alloc()
arch/arm64/kernel/fpsimd.c 847 int vec_set_vector_length(struct task_struct *task, enum vec_type type, 848 unsigned long vl, unsigned long flags) ^^^ "vl" comes from the user and is 0-u16max.
849 { 850 if (flags & ~(unsigned long)(PR_SVE_VL_INHERIT | 851 PR_SVE_SET_VL_ONEXEC)) 852 return -EINVAL; 853 854 if (!sve_vl_valid(vl))
valid values are '16-8192'
855 return -EINVAL; 856 857 /* 858 * Clamp to the maximum vector length that VL-agnostic code 859 * can work with. A flag may be assigned in the future to 860 * allow setting of larger vector lengths without confusing 861 * older software. 862 */ 863 if (vl > VL_ARCH_MAX) 864 vl = VL_ARCH_MAX;
Now 16-256'
865 866 vl = find_supported_vector_length(type, vl);
type is ARM64_VEC_SVE. I've looked at this function for a while and I don't see anything which ensures that "vl" is less than the current value.
867 868 if (flags & (PR_SVE_VL_INHERIT | 869 PR_SVE_SET_VL_ONEXEC)) 870 task_set_vl_onexec(task, type, vl); 871 else 872 /* Reset VL to system default on next exec: */ 873 task_set_vl_onexec(task, type, 0); 874 875 /* Only actually set the VL if not deferred: */ 876 if (flags & PR_SVE_SET_VL_ONEXEC)
Assume the PR_SVE_SET_VL_ONEXEC flag is not set.
877 goto out; 878 879 if (vl == task_get_vl(task, type))
This checks if the flag is == and if so we are done.
880 goto out; 881 882 /* 883 * To ensure the FPSIMD bits of the SVE vector registers are preserved, 884 * write any live register state back to task_struct, and convert to a 885 * regular FPSIMD thread. 886 */ 887 if (task == current) { 888 get_cpu_fpsimd_context(); 889 890 fpsimd_save(); 891 } 892 893 fpsimd_flush_task_state(task); 894 if (test_and_clear_tsk_thread_flag(task, TIF_SVE) || 895 thread_sm_enabled(&task->thread)) { 896 sve_to_fpsimd(task); 897 task->thread.fp_type = FP_STATE_FPSIMD; 898 } 899 900 if (system_supports_sme() && type == ARM64_VEC_SME) { 901 task->thread.svcr &= ~(SVCR_SM_MASK | 902 SVCR_ZA_MASK); 903 clear_thread_flag(TIF_SME); 904 } 905 906 if (task == current) 907 put_cpu_fpsimd_context(); 908 909 /* 910 * Force reallocation of task SVE and SME state to the correct 911 * size on next use: 912 */ 913 sve_free(task); 914 if (system_supports_sme() && type == ARM64_VEC_SME) 915 sme_free(task); 916 917 task_set_vl(task, type, vl);
"vl" is set here. This is fine if we are setting it to a smaller value, but if we are setting it to a larger value then I think we need to realloc the ->sme_state buffer.
When we call sme_alloc() it will say the buffer is already allocated and just zero out what we need for "vl", but the existing buffer is too small.
918 919 out: 920 update_tsk_thread_flag(task, vec_vl_inherit_flag(type), 922 923 return 0; 924 }
regards, dan carpenter
On Wed, May 17, 2023 at 10:30:29PM +0300, Dan Carpenter wrote:
I don't know this code at all so probably this is dumb... I don't undestand how vec_set_vector_length() ensures that sme_state_size() stays in sync with the actual size allocated in sme_alloc()
866 vl = find_supported_vector_length(type, vl);
type is ARM64_VEC_SVE. I've looked at this function for a while and I don't see anything which ensures that "vl" is less than the current value.
It could be either ARM64_VEC_SVE or ARM64_VEC_SME.
917 task_set_vl(task, type, vl);
"vl" is set here. This is fine if we are setting it to a smaller value, but if we are setting it to a larger value then I think we need to realloc the ->sme_state buffer.
When we call sme_alloc() it will say the buffer is already allocated and just zero out what we need for "vl", but the existing buffer is too small.
If we are setting the SVE vector length we do not need to reallocate the SME state since the size of the data stored in the sme_state buffer is influenced only by the SME vector length, not the SVE vector length. We unconditionally free the SVE state (causing it to be reallocated when needed) since the size needed for it depends on both vector lengths.
On Thu, May 18, 2023 at 10:52:31AM +0900, Mark Brown wrote:
When we call sme_alloc() it will say the buffer is already allocated and just zero out what we need for "vl", but the existing buffer is too small.
If we are setting the SVE vector length we do not need to reallocate the SME state since the size of the data stored in the sme_state buffer is influenced only by the SME vector length, not the SVE vector length. We unconditionally free the SVE state (causing it to be reallocated when needed) since the size needed for it depends on both vector lengths.
arch/arm64/kernel/fpsimd.c 909 /* 910 * Force reallocation of task SVE and SME state to the correct 911 * size on next use: 912 */ 913 sve_free(task); ^^^^^^^^^^^^^^ Sure, this forces a reallocation. But what prevents it from happening before we reach the task_set_vl() line?
914 if (system_supports_sme() && type == ARM64_VEC_SME) 915 sme_free(task); 916 917 task_set_vl(task, type, vl); 918 919 out: 920 update_tsk_thread_flag(task, vec_vl_inherit_flag(type), 921 flags & PR_SVE_VL_INHERIT); 922 923 return 0;
regards, dan carpenter
On Mon, May 22, 2023 at 06:40:59PM +0300, Dan Carpenter wrote:
On Thu, May 18, 2023 at 10:52:31AM +0900, Mark Brown wrote:
When we call sme_alloc() it will say the buffer is already allocated and just zero out what we need for "vl", but the existing buffer is too small.
If we are setting the SVE vector length we do not need to reallocate the SME state since the size of the data stored in the sme_state buffer is influenced only by the SME vector length, not the SVE vector length. We unconditionally free the SVE state (causing it to be reallocated when needed) since the size needed for it depends on both vector lengths.
arch/arm64/kernel/fpsimd.c 909 /* 910 * Force reallocation of task SVE and SME state to the correct 911 * size on next use: 912 */ 913 sve_free(task);
Sure, this forces a reallocation. But what prevents it from happening before we reach the task_set_vl() line?
Reallocation is either triggered by a trap from userspace or via ptrace, as is a vector length configuration. The two cases should already be prevented from running simultaneously, and can't simultaneously perform two actions.
On Tue, May 16, 2023 at 11:58:40AM +0530, Naresh Kamboju wrote:
# To install tuxrun on your system globally: # sudo pip3 install -U tuxrun==0.42.0
I'm not thrilled about the idea of installing some Python package outside of my distro package manager, especially not running as root, but I *do* have a checked out copy of tuxrun which normally seems to do something...
# # See https://tuxrun.org/ for complete documentation.
tuxrun \ --runtime podman \ --device fvp-aemva \ --boot-args rw \ --kernel https://storage.tuxsuite.com/public/linaro/lkft/builds/2Pq5NvLiBcWRMuy6lXftD... \ --modules https://storage.tuxsuite.com/public/linaro/lkft/builds/2Pq5NvLiBcWRMuy6lXftD... \ --rootfs https://storage.tuxboot.com/debian/bookworm/arm64/rootfs.ext4.xz \ --parameters SKIPFILE=skipfile-lkft.yaml \ --parameters KSELFTEST=https://storage.tuxsuite.com/public/linaro/lkft/builds/2Pq5NvLiBcWRMuy6lXftD... \ --image tuxrun:fvp \ --tests kselftest-arm64 \ --timeouts boot=60 kselftest-arm64=60
This command does not work for me, after fixing up the fact that multiple lines have continuation characters that are nonfunctional due to being wrapped onto the next line I get:
| Error: error getting default registries to try: short-name "tuxrun:fvp" did not resolve to an alias and no unqualified-search registries are defined in "/etc/containers/registries.conf"
Trying tip of tree tuxrun gives the same result. Grovelling around in the documentation I see there's a need to manually build some containers for the FVP so I was able to get the above command to boot with the --image option removed and switching to docker as the runtime but after faffing for a very large amount of time even by the standards of the model it appeared to just shut down the model without starting kselftest, possibly due to having mounted some of the filesystems read only:
2023-05-22T21:03:43 Using a character delay of 50 (ms) 2023-05-22T21:03:43 #⏎ 2023-05-22T21:03:43 [?2004l[?2004hroot@runner-pqlayms-project-40964107-concurrent-5:~# # 2023-05-22T21:03:43 lava-test-shell: Wait for prompt ['root@(.*):[/~]#'] (timeout 01:00:00) 2023-05-22T21:03:43 # 2023-05-22T21:03:43 Using /lava-1 2023-05-22T21:03:43 Sending with 50 millisecond of delay 2023-05-22T21:03:43 export SHELL=/bin/sh⏎ 2023-05-22T21:03:45 [?2004l[?2004hroot@runner-pqlayms-project-40964107-concurrent-5:~# export SHELL=/bin/sh 2023-05-22T21:03:45 export SHELL=/bin/sh 2023-05-22T21:03:45 Sending with 50 millisecond of delay 2023-05-22T21:03:45 . /lava-1/environment⏎ 2023-05-22T21:03:47 [?2004l[?2004hroot@runner-pqlayms-project-40964107-concurrent-5:~# . /lava-1/environment 2023-05-22T21:03:47 . /lava-1/environment 2023-05-22T21:03:47 Will listen to feedbacks from 'terminal_1' for 1 second 2023-05-22T21:03:47 Will listen to feedbacks from 'terminal_2' for 1 second 2023-05-22T21:03:47 Will listen to feedbacks from 'terminal_3' for 1 second 2023-05-22T21:03:47 Sending with 50 millisecond of delay 2023-05-22T21:03:47 /lava-1/bin/lava-test-runner /lava-1/0⏎ 2023-05-22T21:03:51 [?2004l[?2004hroot@runner-pqlayms-project-40964107-concurrent-5:~# /lava-1/bin/lava-test-runner /lava-1/0 2023-05-22T21:03:51 Test shell timeout: 10s (minimum of the action and connection timeout) 2023-05-22T21:03:51 /lava-1/bin/lava-test-runne r /lava-1/0 2023-05-22T21:03:52 [?2004lmkdir: cannot create directory ‘/lava-1/0/results’: Read-only file system 2023-05-22T21:03:53 mv: cannot move '/lava-1/0/lava-test-runner.conf' to '/lava-1/0/lava-test-runner.conf-1684789015': Read-only file system 2023-05-22T21:03:54 cat: /lava-1/0/lava-test-runner.conf-1684789015: No such file or directory 2023-05-22T21:03:55 ok: lava_test_shell seems to have completed 2023-05-22T21:03:55 end: 3.1 lava-test-shell (duration 00:00:12) [common] 2023-05-22T21:03:55 end: 3 lava-test-retry (duration 00:00:12) [common] 2023-05-22T21:03:55 start: 4 finalize (timeout 00:10:00) [common] 2023-05-22T21:03:55 start: 4.1 power-off (timeout 00:01:00) [common] 2023-05-22T21:03:55 end: 4.1 power-off (duration 00:00:00) [common] 2023-05-22T21:03:55 start: 4.2 read-feedback (timeout 00:10:00) [common]
Attempting to use podman as the runtime as your command said had various problems:
2023-05-22T21:07:01 start: 2.1.1 check-fvp-version (timeout 01:00:00) [common] 2023-05-22T21:07:01 sh -c docker run --rm fvp:aemva-11.21.15 /opt/model/FVP_AEMvA/models/Linux64_GCC-9.3/FVP_Base_RevC-2xAEMvA --version 2023-05-22T21:07:01 Parsed command exited 1. 2023-05-22T21:07:01 action: check-fvp-version command: ['sh', '-c', 'docker run --rm fvp:aemva-11.21.15 /opt/model/FVP_AEMvA/models/Linux64_GCC-9.3/FVP_Base_RevC-2xAEMvA --version'] message: Command '['sh', '-c', 'docker run --rm fvp:aemva-11.21.15 /opt/model/FVP_AEMvA/models/Linux64_GCC-9.3/FVP_Base_RevC-2xAEMvA --version']' returned non-zero exit status 1. output: Missing runtime '/usr/bin/podman' return code: 1
(I do have podman installed though I rarely use it, this looks to be in the LAVA container though)
Test log links:
https://qa-reports.linaro.org/lkft/linux-stable-rc-linux-6.1.y/build/v6.1.28...
https://qa-reports.linaro.org/lkft/linux-stable-rc-linux-6.1.y/build/v6.1.28...
https://qa-reports.linaro.org/lkft/linux-stable-rc-linux-6.1.y/build/v6.1.28...
https://qa-reports.linaro.org/lkft/linux-stable-rc-linux-6.3.y/build/v6.3.2-...
https://qa-reports.linaro.org/lkft/linux-stable-rc-linux-6.3.y/build/v6.3.2-...
https://qa-reports.linaro.org/lkft/linux-stable-rc-linux-6.3.y/build/v6.3.2-...
None of these seem to provide me with information like what kernel config was used but I did manage to find
https://storage.tuxsuite.com/public/linaro/lkft/builds/2Pq5NvLiBcWRMuy6lXftD...
which might be it? Or one of them? However even trying to use that I'm unable to reproduce issues with either the FVP or qemu.
On Tue, 23 May 2023 at 03:42, Mark Brown broonie@kernel.org wrote:
On Tue, May 16, 2023 at 11:58:40AM +0530, Naresh Kamboju wrote:
# To install tuxrun on your system globally: # sudo pip3 install -U tuxrun==0.42.0
I'm not thrilled about the idea of installing some Python package outside of my distro package manager, especially not running as root, but I *do* have a checked out copy of tuxrun which normally seems to do something...
# # See https://tuxrun.org/ for complete documentation.
tuxrun \ --runtime podman \ --device fvp-aemva \ --boot-args rw \ --kernel https://storage.tuxsuite.com/public/linaro/lkft/builds/2Pq5NvLiBcWRMuy6lXftD... \ --modules https://storage.tuxsuite.com/public/linaro/lkft/builds/2Pq5NvLiBcWRMuy6lXftD... \ --rootfs https://storage.tuxboot.com/debian/bookworm/arm64/rootfs.ext4.xz \ --parameters SKIPFILE=skipfile-lkft.yaml \ --parameters KSELFTEST=https://storage.tuxsuite.com/public/linaro/lkft/builds/2Pq5NvLiBcWRMuy6lXftD... \ --image tuxrun:fvp \ --tests kselftest-arm64 \ --timeouts boot=60 kselftest-arm64=60
This command does not work for me, after fixing up the fact that multiple lines have continuation characters that are nonfunctional due to being wrapped onto the next line I get:
| Error: error getting default registries to try: short-name "tuxrun:fvp" did not resolve to an alias and no unqualified-search registries are defined in "/etc/containers/registries.conf"
Trying tip of tree tuxrun gives the same result. Grovelling around in the documentation I see there's a need to manually build some containers for the FVP so I was able to get the above command to boot with the --image option removed and switching to docker as the runtime but after faffing for a very large amount of time even by the standards of the model it appeared to just shut down the model without starting kselftest, possibly due to having mounted some of the filesystems read only:
2023-05-22T21:03:43 Using a character delay of 50 (ms) 2023-05-22T21:03:43 #⏎ 2023-05-22T21:03:43 [?2004l[?2004hroot@runner-pqlayms-project-40964107-concurrent-5:~# # 2023-05-22T21:03:43 lava-test-shell: Wait for prompt ['root@(.*):[/~]#'] (timeout 01:00:00) 2023-05-22T21:03:43 # 2023-05-22T21:03:43 Using /lava-1 2023-05-22T21:03:43 Sending with 50 millisecond of delay 2023-05-22T21:03:43 export SHELL=/bin/sh⏎ 2023-05-22T21:03:45 [?2004l[?2004hroot@runner-pqlayms-project-40964107-concurrent-5:~# export SHELL=/bin/sh 2023-05-22T21:03:45 export SHELL=/bin/sh 2023-05-22T21:03:45 Sending with 50 millisecond of delay 2023-05-22T21:03:45 . /lava-1/environment⏎ 2023-05-22T21:03:47 [?2004l[?2004hroot@runner-pqlayms-project-40964107-concurrent-5:~# . /lava-1/environment 2023-05-22T21:03:47 . /lava-1/environment 2023-05-22T21:03:47 Will listen to feedbacks from 'terminal_1' for 1 second 2023-05-22T21:03:47 Will listen to feedbacks from 'terminal_2' for 1 second 2023-05-22T21:03:47 Will listen to feedbacks from 'terminal_3' for 1 second 2023-05-22T21:03:47 Sending with 50 millisecond of delay 2023-05-22T21:03:47 /lava-1/bin/lava-test-runner /lava-1/0⏎ 2023-05-22T21:03:51 [?2004l[?2004hroot@runner-pqlayms-project-40964107-concurrent-5:~# /lava-1/bin/lava-test-runner /lava-1/0 2023-05-22T21:03:51 Test shell timeout: 10s (minimum of the action and connection timeout) 2023-05-22T21:03:51 /lava-1/bin/lava-test-runne r /lava-1/0 2023-05-22T21:03:52 [?2004lmkdir: cannot create directory ‘/lava-1/0/results’: Read-only file system 2023-05-22T21:03:53 mv: cannot move '/lava-1/0/lava-test-runner.conf' to '/lava-1/0/lava-test-runner.conf-1684789015': Read-only file system 2023-05-22T21:03:54 cat: /lava-1/0/lava-test-runner.conf-1684789015: No such file or directory 2023-05-22T21:03:55 ok: lava_test_shell seems to have completed 2023-05-22T21:03:55 end: 3.1 lava-test-shell (duration 00:00:12) [common] 2023-05-22T21:03:55 end: 3 lava-test-retry (duration 00:00:12) [common] 2023-05-22T21:03:55 start: 4 finalize (timeout 00:10:00) [common] 2023-05-22T21:03:55 start: 4.1 power-off (timeout 00:01:00) [common] 2023-05-22T21:03:55 end: 4.1 power-off (duration 00:00:00) [common] 2023-05-22T21:03:55 start: 4.2 read-feedback (timeout 00:10:00) [common]
Attempting to use podman as the runtime as your command said had various problems:
2023-05-22T21:07:01 start: 2.1.1 check-fvp-version (timeout 01:00:00) [common] 2023-05-22T21:07:01 sh -c docker run --rm fvp:aemva-11.21.15 /opt/model/FVP_AEMvA/models/Linux64_GCC-9.3/FVP_Base_RevC-2xAEMvA --version 2023-05-22T21:07:01 Parsed command exited 1. 2023-05-22T21:07:01 action: check-fvp-version command: ['sh', '-c', 'docker run --rm fvp:aemva-11.21.15 /opt/model/FVP_AEMvA/models/Linux64_GCC-9.3/FVP_Base_RevC-2xAEMvA --version'] message: Command '['sh', '-c', 'docker run --rm fvp:aemva-11.21.15 /opt/model/FVP_AEMvA/models/Linux64_GCC-9.3/FVP_Base_RevC-2xAEMvA --version']' returned non-zero exit status 1. output: Missing runtime '/usr/bin/podman' return code: 1
(I do have podman installed though I rarely use it, this looks to be in the LAVA container though)
Test log links:
https://qa-reports.linaro.org/lkft/linux-stable-rc-linux-6.1.y/build/v6.1.28...
https://qa-reports.linaro.org/lkft/linux-stable-rc-linux-6.1.y/build/v6.1.28...
https://qa-reports.linaro.org/lkft/linux-stable-rc-linux-6.1.y/build/v6.1.28...
https://qa-reports.linaro.org/lkft/linux-stable-rc-linux-6.3.y/build/v6.3.2-...
https://qa-reports.linaro.org/lkft/linux-stable-rc-linux-6.3.y/build/v6.3.2-...
https://qa-reports.linaro.org/lkft/linux-stable-rc-linux-6.3.y/build/v6.3.2-...
None of these seem to provide me with information like what kernel config was used but I did manage to find
https://storage.tuxsuite.com/public/linaro/lkft/builds/2Pq5NvLiBcWRMuy6lXftD...
which might be it? Or one of them? However even trying to use that I'm unable to reproduce issues with either the FVP or qemu.
You got the right config file which we are using for testing FVP selftests.
Since it is intermittent it is not easy to reproduce always. You are right ! that, you may have to try with full sub set run
./run_kselftest.sh -c arm64
- Naresh
On Thu, May 25, 2023 at 07:21:24PM +0530, Naresh Kamboju wrote:
On Tue, 23 May 2023 at 03:42, Mark Brown broonie@kernel.org wrote:
On Tue, May 16, 2023 at 11:58:40AM +0530, Naresh Kamboju wrote:
Test log links:
https://qa-reports.linaro.org/lkft/linux-stable-rc-linux-6.1.y/build/v6.1.28...
https://qa-reports.linaro.org/lkft/linux-stable-rc-linux-6.1.y/build/v6.1.28...
https://qa-reports.linaro.org/lkft/linux-stable-rc-linux-6.1.y/build/v6.1.28...
https://qa-reports.linaro.org/lkft/linux-stable-rc-linux-6.3.y/build/v6.3.2-...
https://qa-reports.linaro.org/lkft/linux-stable-rc-linux-6.3.y/build/v6.3.2-...
https://qa-reports.linaro.org/lkft/linux-stable-rc-linux-6.3.y/build/v6.3.2-...
None of these seem to provide me with information like what kernel config was used but I did manage to find
https://storage.tuxsuite.com/public/linaro/lkft/builds/2Pq5NvLiBcWRMuy6lXftD...
which might be it? Or one of them? However even trying to use that I'm unable to reproduce issues with either the FVP or qemu.
You got the right config file which we are using for testing FVP selftests.
Sadly, the config link above no longer works (404 file not found).
However, I notice that the failure still seems to occur with 6.4:
https://qa-reports.linaro.org/lkft/linux-mainline-master/build/v6.4/testrun/...
Please can you point me at the config file for that run? I can't figure out which one I need from the web interface.
Will
Hello!
On Mon, 26 Jun 2023 at 08:42, Will Deacon will@kernel.org wrote:
On Thu, May 25, 2023 at 07:21:24PM +0530, Naresh Kamboju wrote:
On Tue, 23 May 2023 at 03:42, Mark Brown broonie@kernel.org wrote:
On Tue, May 16, 2023 at 11:58:40AM +0530, Naresh Kamboju wrote:
Test log links:
https://qa-reports.linaro.org/lkft/linux-stable-rc-linux-6.1.y/build/v6.1.28...
https://qa-reports.linaro.org/lkft/linux-stable-rc-linux-6.1.y/build/v6.1.28...
https://qa-reports.linaro.org/lkft/linux-stable-rc-linux-6.1.y/build/v6.1.28...
https://qa-reports.linaro.org/lkft/linux-stable-rc-linux-6.3.y/build/v6.3.2-...
https://qa-reports.linaro.org/lkft/linux-stable-rc-linux-6.3.y/build/v6.3.2-...
https://qa-reports.linaro.org/lkft/linux-stable-rc-linux-6.3.y/build/v6.3.2-...
None of these seem to provide me with information like what kernel config was used but I did manage to find
https://storage.tuxsuite.com/public/linaro/lkft/builds/2Pq5NvLiBcWRMuy6lXftD...
which might be it? Or one of them? However even trying to use that I'm unable to reproduce issues with either the FVP or qemu.
You got the right config file which we are using for testing FVP selftests.
Sadly, the config link above no longer works (404 file not found).
However, I notice that the failure still seems to occur with 6.4:
https://qa-reports.linaro.org/lkft/linux-mainline-master/build/v6.4/testrun/...
Please can you point me at the config file for that run? I can't figure out which one I need from the web interface.
Kernel artifacts for that test run can be found here: https://storage.tuxsuite.com/public/linaro/lkft/builds/2Ricfkwzy9jwZnHXNOety...
(The labyrinth can be traversed this way: I went from that testrun link, then test details [clicking on "check-kernel-kfence — FAIL"], then "job_url", then "build", then "tuxbuild directory".)
Greetings!
Daniel Díaz daniel.diaz@linaro.org