On Fri, Jan 12, 2018 at 1:51 PM, Thomas Gleixner tglx@linutronix.de wrote:
On Fri, 12 Jan 2018, Laura Abbott wrote:
Cc+ Andy
I'm almost crashed out by now. Andy might have an idea. I'll look again tomorrow with brain awake.
On 01/12/2018 10:51 AM, Thomas Gleixner wrote:
On Fri, 12 Jan 2018, Laura Abbott wrote:
Fedora got a bug report on 4.14.11 of a panic when booting a Fedora guest in a CentOS 6 VM, not reproducible with nopti. The issue is still present as of 4.14.13 as well. The only report is a panic screenshot https://bugzilla.redhat.com/show_bug.cgi?id=1532458
I've lost track of all the fixes that have been flying around, is this a new issue or has a fix not yet made it to stable?
Hmm. Looks kinda familiar, but that has been fixed I think even before 4.4.11. Could you please ask the reported to provide a full console log via the VM "serial console" ?
Thanks,
tglx
[ 2.910025] PANIC: double fault, error_code: 0x0
Probably a stack overflow
[ 2.910025] CPU: 1 PID: 56 Comm: modprobe Not tainted 4.14.13-300.fc27.x86_64 #1 [ 2.910025] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007 [ 2.910025] task: ffff891f78dc3c00 task.stack: ffffa6eac0594000 [ 2.910025] RIP: 0010:vprintk_default+0x5/0x30 [ 2.910025] RSP: 0000:fffffe000002e000 EFLAGS: 00010046 [ 2.910025] RAX: 0000000000000000 RBX: fffffe000002e118 RCX: 0000000000000001 [ 2.910025] RDX: 0000000000000000 RSI: fffffe000002e018 RDI: ffffffffbe0715a0 [ 2.910025] RBP: fffffe000002e008 R08: ffffffffbe0bb565 R09: ffffffffbe07159b [ 2.910025] R10: fffffe000002e080 R11: 0000000000000000 R12: ffffffffbe070fdd [ 2.910025] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 [ 2.910025] FS: 0000000000000000(0000) GS:ffff891f7fd00000(0000) knlGS:0000000000000000 [ 2.910025] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 2.910025] CR2: fffffe000002dff8 CR3: 0000000078da6000 CR4: 00000000000006e0
CR2 and RSP are consistent with stack overflow, but that's not terribly important, since...
[ 2.910025] Call Trace: [ 2.910025] <ENTRY_TRAMPOLINE> [ 2.910025] ? vprintk_func+0x27/0x60 [ 2.910025] printk+0x52/0x6e [ 2.910025] __die+0x6b/0xe0 [ 2.910025] die+0x2f/0x50 [ 2.910025] do_general_protection+0x149/0x160
This is the real problem here.
[ 2.910025] general_protection+0x2c/0x60 [ 2.910025] RIP: 0010:swapgs_restore_regs_and_return_to_usermode+0x6f/0x80 [ 2.910025] RSP: 0000:fffffe000002e1c8 EFLAGS: 00000006 [ 2.910025] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 [ 2.910025] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000078da7800 [ 2.910025] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000 [ 2.910025] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 [ 2.910025] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 [ 2.910025] </ENTRY_TRAMPOLINE>
ffffffff81a00b3e: 65 48 0f b3 3c 25 8e btr %rdi,%gs:0x1e168e ffffffff81a00b45: 16 1e 00 ffffffff81a00b48: 48 89 c7 mov %rax,%rdi ffffffff81a00b4b: eb 08 jmp 0xffffffff81a00b55 ffffffff81a00b4d: 48 89 c7 mov %rax,%rdi ffffffff81a00b50: 48 0f ba ef 3f bts $0x3f,%rdi ffffffff81a00b55: 48 81 cf 00 18 00 00 or $0x1800,%rdi ffffffff81a00b5c: 0f 22 df mov %rdi,%cr3 <-- #GP here ffffffff81a00b5f: 58 pop %rax ffffffff81a00b60: 5f pop %rdi
The CPU didn't like writing 0x0000000078da7800 to %cr3.
I'm guessing that CR4.PCIDE=0.
Now this is quite a strange value to write to CR3. The 0x800 part means that we're using the "user" variant of the address space that would have ASID=0 and the 0x1000 bit being set corresponds to the user pgdir, but this is nonsense, since the kernel never uses PCID 0 for user mode. We always start at 1. The only exception is if X86_FEATURE_PCID is off. But, if X86_FEATURE_PCID is off, then we shouldn't be setting any PCID bits.
In fact, it looks like this code is totally bogus and has never been correct at all. Even in:
commit 4b1d5ae3b103eda43f9d0f85c355bb6995b03a30 Author: Peter Zijlstra peterz@infradead.org Date: Mon Dec 4 15:07:59 2017 +0100
x86/mm: Use/Fix PCID to optimize user/kernel switches
We have:
.macro SWITCH_TO_USER_CR3_NOSTACK scratch_reg:req scratch_reg2:req ALTERNATIVE "jmp .Lend_@", "", X86_FEATURE_PTI mov %cr3, \scratch_reg
ALTERNATIVE "jmp .Lwrcr3_@", "", X86_FEATURE_PCID
...
.Lwrcr3_@: /* Flip the PGD and ASID to the user version */ orq $(PTI_SWITCH_MASK), \scratch_reg mov \scratch_reg, %cr3 .Lend_@:
That's bogus. PTI_SWITCH_MASK is 0x1800, which has PCID = 0x800.
This should probably use an alternative to select between 0x1000 and 0x800 depending on X86_FEATURE_PCID or just use an entirely different label for the !PCID case.
FWIW, this bit in SAVE_AND_SWITCH_TO_KERNEL_CR3
testq $(PTI_SWITCH_MASK), \scratch_reg jz .Ldone_@
is a bit silly, too. It's *correct* (I think), but shouldn't that just be bt $(PTI_SWITCH_PGTABLES_BIT), \scratch_reg, with the obvious caveat that the headers don't actually define PTI_SWITCH_PGTABLES_BIT?
Was this code *ever* tested with nopcid?