(reply out of order)
Naresh Kamboju wrote on Thu, Dec 01, 2022 at 01:13:25PM +0530:
(You might need to build with at least CONFIG_DEBUG_INFO_REDUCED (or not reduced), but that is on by default for aarch64)
Thanks for the suggestions. The Kconfig is enabled now. CONFIG_DEBUG_INFO_REDUCED=y
It looks enabled in your the config file you linked at, I don't understand this remark? Did you produce the trace the other day without it and rebuild the kernel with it? In this case you also have CONFIG_DEBUG_INFO_SPLIT set, so the vmlinux file does not contain enough informations to retrieve line numbers or types, and in particular addr2line cannot be used on the files you provided. I've never used split debug infos before, but digging old threads I'm not too hopeful unless that changed: https://lkml.iu.edu/hypermail/linux/kernel/1711.1/03393.html https://sourceware.org/bugzilla/show_bug.cgi?id=22434
(...a test build later, it's still mostly useless... normal build $ ./scripts/faddr2line vmlinux __schedule+0x314 __schedule+0x314/0x6c0: perf_fetch_caller_regs at include/linux/perf_event.h:1286 (inlined by) __perf_sw_event_sched at include/linux/perf_event.h:1307 (inlined by) perf_event_task_sched_out at include/linux/perf_event.h:1347 (inlined by) prepare_task_switch at kernel/sched/core.c:5053 (inlined by) context_switch at kernel/sched/core.c:5195 (inlined by) __schedule at kernel/sched/core.c:6561
split dwarf build $ ./scripts/faddr2line vmlinux __schedule+0x314 aarch64-linux-gnu-addr2line: DWARF error: could not find abbrev number 860923 __schedule+0x314/0x780: aarch64-linux-gnu-addr2line: DWARF error: could not find abbrev number 860923 __schedule at core.c:?
I'd tend to agree build time/space savings aren't worth the developer time. )
Anyway, address sanitizer used to have a kasan_symbolize.py script but it looks like it got removed as no longer maintained, and I'm not sure what's a good tool to just run these logs through nowadays, might want to ask other test projects folks what they use...
If you still have the vmlinux binary from that build (or if you can rebuild with the same options), running this text through addr2line should not take you too long.
Please find build artifacts in this link,
- config
- vmlinux
- System.map
https://people.linaro.org/~anders.roxell/next-20221130-allmodconfig-arm64-tu...
So from the disassembly...
- p9_client_cb+0x84 is right before the wake_up and after the wmb(), so I assume we're on writing req->status line 441:
--- p9_client_cb(...) { ... smp_wmb(); req->status = status;
wake_up(&req->wq); ---
report is about a write from 2 to 3, this makes sense we're going from REQ_STATUS_SENT (2) to REQ_STATUS_RCVD (3).
- p9_client_rpc+0x1d0 isn't as simple to pin down as I'm having a hard time making sense of the kcsan instrumentations... The report is talking about a READ of 4 bytes at the same address, so I'd expect to see an ccess to req->status (and we're likely spot on wait_event_killable which checks req->status), but this doesn't seem to match up with the assembly: here's the excerpt from disass around 0x1d0 = 464 (why doesn't gdb provide hex offsets..) --- 0xffff80000a46e9b8 <+440>: cmn w28, #0x200 0xffff80000a46e9bc <+444>: ccmn w28, #0xe, #0x4, ne // ne = any 0xffff80000a46e9c0 <+448>: b.eq 0xffff80000a46ecfc <p9_client_rpc+1276> // b.none 0xffff80000a46e9c4 <+452>: mov x0, x25 0xffff80000a46e9c8 <+456>: bl 0xffff800008543640 <__tsan_write4> 0xffff80000a46e9cc <+460>: mov w0, #0x2 // #2 0xffff80000a46e9d0 <+464>: str w0, [x21, #88] 0xffff80000a46e9d4 <+468>: b 0xffff80000a46ecfc <p9_client_rpc+1276> 0xffff80000a46e9d8 <+472>: mov w27, #0x1 // #1 0xffff80000a46e9dc <+476>: mov x0, x23 0xffff80000a46e9e0 <+480>: mov w1, #0x2bc // #700 0xffff80000a46e9e4 <+484>: bl 0xffff800008192d80 <__might_sleep> ---
+464 is a write to x21 (client 'c', from looking at how it is passed into x0 for other function calls) at offset 88 (status field according to dwarf infos from a rebuild with your config/same sources)
So, err, I'm a bit lost on this side. But I can't really find a problem with what KCSAN complains about -- we are indeed accessing status from two threads without any locks. Instead of a lock, we're using a barrier so that: - recv thread/cb: writes to req stuff || write to req status - p9_client_rpc: reads req status || reads other fields from req
Which has been working well enough (at least, without the barrier things blow up quite fast).
So can I'll just consider this a false positive, but if someone knows how much one can read into this that'd be appreciated.
Thanks, -- Dominique