On Thu, Nov 29, 2018 at 01:35:17PM +0000, Juergen Gross wrote:
On 29/11/2018 14:26, Kirill A. Shutemov wrote:
On Thu, Nov 29, 2018 at 09:41:25AM +0000, Juergen Gross wrote:
On 29/11/2018 02:22, Hans van Kranenburg wrote:
Hi,
As also seen at: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=914951
Attached there are two serial console output logs. One is starting with Xen 4.11 (from debian unstable) as dom0, and the other one without Xen.
[ 2.085543] BUG: unable to handle kernel paging request at ffff888d9fffc000 [ 2.085610] PGD 200c067 P4D 200c067 PUD 0 [ 2.085674] Oops: 0000 [#1] SMP NOPTI [ 2.085736] CPU: 1 PID: 1 Comm: swapper/0 Not tainted 4.19.0-trunk-amd64 #1 Debian 4.19.5-1~exp1+pvh1 [ 2.085823] Hardware name: HP ProLiant DL360 G7, BIOS P68 05/21/2018 [ 2.085895] RIP: e030:ptdump_walk_pgd_level_core+0x1fd/0x490 [...]
The offending stable commit is 4074ca7d8a1832921c865d250bbd08f3441b3657 ("x86/mm: Move LDT remap out of KASLR region on 5-level paging"), this is commit d52888aa2753e3063a9d3a0c9f72f94aa9809c15 upstream.
Current upstream kernel is booting fine under Xen, so in general the patch should be fine. Using an upstream kernel built from above commit (with the then needed Xen fixup patch 1457d8cf7664f34c4ba534) is fine, too.
Kirill, are you aware of any prerequisite patch from 4.20 which could be missing in 4.19.5?
I'm not.
Let me look into this.
What is making me suspicious is the failure happening just after releasing the init memory. Maybe there is an access to .init.data segment or similar? The native kernel booting could be related to the usage of 2M mappings not being available in a PV-domain.
Sounds like a valid hypothesis.
[ 2.085616] Code: 00 00 00 00 40 00 00 49 83 c5 08 48 01 04 24 4c 3b 6c 24 48 0f 84 83 02 00 00 48 8b 04 24 48 c1 f8 10 48 89 84 24 88 00 00 00 <49> 8b 7d 00 48 f7 c7 9f ff ff ff 0f 85 36 ff ff ff 41 b8 03 00 00 All code ======== 0: 00 00 add %al,(%rax) 2: 00 00 add %al,(%rax) 4: 40 00 00 add %al,(%rax) 7: 49 83 c5 08 add $0x8,%r13 b: 48 01 04 24 add %rax,(%rsp) f: 4c 3b 6c 24 48 cmp 0x48(%rsp),%r13 14: 0f 84 83 02 00 00 je 0x29d 1a: 48 8b 04 24 mov (%rsp),%rax 1e: 48 c1 f8 10 sar $0x10,%rax 22: 48 89 84 24 88 00 00 mov %rax,0x88(%rsp) 29: 00 2a:* 49 8b 7d 00 mov 0x0(%r13),%rdi <-- trapping instruction 2e: 48 f7 c7 9f ff ff ff test $0xffffffffffffff9f,%rdi 35: 0f 85 36 ff ff ff jne 0xffffffffffffff71 3b: 41 rex.B 3c: b8 .byte 0xb8 3d: 03 00 add (%rax),%eax ...
Code starting with the faulting instruction =========================================== 0: 49 8b 7d 00 mov 0x0(%r13),%rdi 4: 48 f7 c7 9f ff ff ff test $0xffffffffffffff9f,%rdi b: 0f 85 36 ff ff ff jne 0xffffffffffffff47 11: 41 rex.B 12: b8 .byte 0xb8 13: 03 00 add (%rax),%eax ...
Reading from %r13 causes the fault.
I don't have a setup to reproduce the issue myself and have hard time correlate the code with source.
What is ptdump_walk_pgd_level_core+0x1fd/0x490 for you?