Hi Daniel, Manu I was able to reproduce this issue on KVM and found the root cause for this hang! The other issue that we fixed is unrelated to this hang and doesn't occur on self hosted github runners as they use 48-bit VAs.
The userspace test code has:
#define STACK_SIZE (1024 * 1024) static char child_stack[STACK_SIZE];
cpid = clone(do_sleep, child_stack + STACK_SIZE, CLONE_FILES | SIGCHLD, fexit_skel);
arm64 requires the stack pointer to be 16 byte aligned otherwise SPAlignmentFault occurs, this appears as Bus error in the userspace.
The stack provided to the clone system call is not guaranteed to be aligned properly in this selftest.
The test hangs on the following line: while (READ_ONCE(fexit_skel->bss->fentry_cnt) != 2);
Because the child process is killed due to SPAlignmentFault, the fentry_cnt remains at 0!
Reading the man page of clone system call, the correct way to allocate stack for this call is using mmap like this:
stack = mmap(NULL, STACK_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS | MAP_STACK, -1, 0);
This fixes the issue, I will send a patch to use this and once again remove this test from DENYLIST and I hope this time it fixes it for good.
It looks like there is still an issue left. A recent CI run on bpf-next is still hitting the same on arm64:
Base:
https://github.com/kernel-patches/bpf/commits/series/870746%3D%3Ebpf-next/
CI:
https://github.com/kernel-patches/bpf/actions/runs/9905842936/job/2736643543...
[...] #89/11 fexit_bpf2bpf/func_replace_global_func:OK #89/12 fexit_bpf2bpf/fentry_to_cgroup_bpf:OK #89/13 fexit_bpf2bpf/func_replace_progmap:OK #89 fexit_bpf2bpf:OK Error: The operation was canceled.
Thanks, Puranjay