On Thu, Jul 14, 2022, Paolo Bonzini wrote:
On 7/14/22 10:06, Gavin Shan wrote:
In rseq_test, there are two threads created. Those two threads are 'main' and 'migration_thread' separately. We also have the assumption that non-migration status on 'migration-worker' thread guarantees the same non-migration status on 'main' thread. Unfortunately, the assumption isn't true. The 'main' thread can be migrated from one CPU to another one between the calls to sched_getcpu() and READ_ONCE(__rseq.cpu_id). The following assert is raised eventually because of the mismatched CPU numbers.
The issue can be reproduced on arm64 system occasionally.
Hmm, this does not seem a correct patch - the threads are already synchronizing using seq_cnt, like this:
migration main
seq_cnt = 1 smp_wmb() snapshot = 0 smp_rmb() cpu = sched_getcpu() reads 23 sched_setaffinity() rseq_cpu = __rseq.cpuid reads 35 smp_rmb() snapshot != seq_cnt -> retry smp_wmb() seq_cnt = 2
sched_setaffinity() is guaranteed to block until the task is enqueued on an allowed CPU.
Yes, and retrying could suppress detection of kernel bugs that this test is intended to catch.
Can you check that smp_rmb() and smp_wmb() generate correct instructions on arm64?
That seems like the most likely scenario (or a kernel bug), I distinctly remember the barriers provided by tools/ being rather bizarre.