On 03-Nov-22 6:03 PM, Peter Zijlstra wrote:
On Thu, Nov 03, 2022 at 05:15:30PM +0530, Ravi Bangoria wrote:
Sorry was distracted a bit. So, this seems to be happening because of race between amd_pmu_enable_all() and perf event NMI. Something like:
amd_pmu_enable_all() { if (!test_bit(idx, cpuc->active_mask))
--->/* perf NMI entry */ ... x86_pmu_stop() { __clear_bit(hwc->idx, cpuc->active_mask); cpuc->events[hwc->idx] = NULL; } ... <---/* perf NMI exit */
amd_pmu_enable_event(cpuc->events[idx]);
}
Hmm, do you have more information?
I've extracted function graph logs from crash dump and uploaded it here: https://github.com/BangoriaRavi/function_graph/blob/main/trace.function_grap...
crash was on CPU1.
Something like that would require calling amd_pmu_enable_all() while it is already active -- and that seems suspect at first glance.
That is, you shouldn't be getting an NMI for @idx before amd_pmu_enable_event().
I too was wondering about this. Will try to get some more data tomorrow.
Thanks, Ravi