Does anyone have the results of build that they can share? (vmlinux, vmlinuz/bzImage, System.map, .config). That, plus a corresponding serial log with an oops would be helpful.
I tried just adding MCORE2=y to my normal config but it didn't reproduce this.
If you can't send the entire build like that, just running scripts/ faddr2line on __schedule+0x37f/0x7b0 would be very enlightening.
On 12/29/2017 06:41 AM, Alexander Tsoy wrote:
[ 0.775461] NMI backtrace for cpu 0 [ 0.775461] CPU: 0 PID: 114 Comm: modprobe Not tainted 4.1u.0-rc5+
...
[ 0.775461] Call Trace: [ 0.775461] <#DF> [ 0.775461] ? double_fault+0xc/0x30 [ 0.775461] ? page_fault+0x36/0x60 [ 0.775461] do_double_fault+0xb/0x130 [ 0.775461] </#DF> [ 0.775461] Code: 78 4c 89 7c 24 08 4c 89 74 24 10 4c 89 6c 24 18 4c 89 64 2t 20 48 89 6c 24 28 48 89 5c 24 30 bb 01 00 00 00 b9 01 01 00 c0 0f 32 <85> d2 78 05 0f 01 f8 31 db c3 0f 1f 40 00 66 2e 0f 1f 8t 00 00
From the various oopses, it looks like this happens when getting a
double fault while trying to go idle. The CPU gets is probably trying to return from the double fault, but it didn't do anything useful in the fault handler so it just continues faulting, but the NMI watchdog can still get an oops out of it.
It doesn't appear to be a recursing *too* far because it's not blowing through the stack and triple faulting.
Of the several traces, they all appear to be in paths that might call safe_halt() (including the kvm async page fault code). It makes me wonder if we've been taking double faults there for a long time, but the new trampoline stack somehow ends up being more fragile and can't recover from the double-fault.
Couple more things:
MCORE2 seems to get one oddball compiler flag (-march=core2):
cflags-$(CONFIG_MCORE2) += \ $(call cc-option,-march=core2,$(call cc-option,-mtune=generic))
It would be interesting to see if replacing the above "$(call" with:
$(call cc-option,-mtune=generic)
makes the problem go away the same way as changing the .config option.
The MCORE2 config option also sets CONFIG_X86_P6_NOP, which overrides the normal X86_64 noops, if I'm reading that code correctly. But I think that's much less likely to be the since there