On Mon, Jul 15, 2024 at 9:32 AM Puranjay Mohan puranjay@kernel.org wrote:
Hi Daniel, Manu I was able to reproduce this issue on KVM and found the root cause for this hang! The other issue that we fixed is unrelated to this hang and doesn't occur on self hosted github runners as they use 48-bit VAs.
The userspace test code has:
#define STACK_SIZE (1024 * 1024) static char child_stack[STACK_SIZE]; cpid = clone(do_sleep, child_stack + STACK_SIZE, CLONE_FILES | SIGCHLD, fexit_skel);
arm64 requires the stack pointer to be 16 byte aligned otherwise SPAlignmentFault occurs, this appears as Bus error in the userspace.
The stack provided to the clone system call is not guaranteed to be aligned properly in this selftest.
The test hangs on the following line: while (READ_ONCE(fexit_skel->bss->fentry_cnt) != 2);
Because the child process is killed due to SPAlignmentFault, the fentry_cnt remains at 0!
Reading the man page of clone system call, the correct way to allocate stack for this call is using mmap like this:
stack = mmap(NULL, STACK_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS | MAP_STACK, -1, 0);
This fixes the issue, I will send a patch to use this and once again remove this test from DENYLIST and I hope this time it fixes it for good.
Wow. Great find. Good to know. prog_tests/ns_current_pid_tgid.c has the same issue probably.