The x86 mmap() code selects the mmap base for an allocation depending on the bitness of the syscall. For 64bit sycalls it select mm->mmap_base and for 32bit mm->mmap_compat_base.
exec() calls mmap() which in turn uses in_compat_syscall() to check whether the mapping is for a 32bit or a 64bit task. The decision is made on the following criteria:
ia32 child->thread.status & TS_COMPAT x32 child->pt_regs.orig_ax & __X32_SYSCALL_BIT ia64 !ia32 && !x32
__set_personality_x32() was dropping TS_COMPAT flag, but set_personality_64bit() has kept compat syscall flag making in_compat_syscall() return true during the first exec() syscall.
Which in result has user-visible effects, mentioned by Alexey: 1) It breaks ASAN $ gcc -fsanitize=address wrap.c -o wrap-asan $ ./wrap32 ./wrap-asan true ==1217==Shadow memory range interleaves with an existing memory mapping. ASan cannot proceed correctly. ABORTING. ==1217==ASan shadow was supposed to be located in the [0x00007fff7000-0x10007fff7fff] range. ==1217==Process memory map follows: 0x000000400000-0x000000401000 /home/izbyshev/test/gcc/asan-exec-from-32bit/wrap-asan 0x000000600000-0x000000601000 /home/izbyshev/test/gcc/asan-exec-from-32bit/wrap-asan 0x000000601000-0x000000602000 /home/izbyshev/test/gcc/asan-exec-from-32bit/wrap-asan 0x0000f7dbd000-0x0000f7de2000 /lib64/ld-2.27.so 0x0000f7fe2000-0x0000f7fe3000 /lib64/ld-2.27.so 0x0000f7fe3000-0x0000f7fe4000 /lib64/ld-2.27.so 0x0000f7fe4000-0x0000f7fe5000 0x7fed9abff000-0x7fed9af54000 0x7fed9af54000-0x7fed9af6b000 /lib64/libgcc_s.so.1 [snip]
2) It doesn't seem to be great for security if an attacker always knows that ld.so is going to be mapped into the first 4GB in this case (the same thing happens for PIEs as well).
The testcase: $ cat wrap.c
int main(int argc, char *argv[]) { execvp(argv[1], &argv[1]); return 127; }
$ gcc wrap.c -o wrap $ LD_SHOW_AUXV=1 ./wrap ./wrap true |& grep AT_BASE AT_BASE: 0x7f63b8309000 AT_BASE: 0x7faec143c000 AT_BASE: 0x7fbdb25fa000
$ gcc -m32 wrap.c -o wrap32 $ LD_SHOW_AUXV=1 ./wrap32 ./wrap true |& grep AT_BASE AT_BASE: 0xf7eff000 AT_BASE: 0xf7cee000 AT_BASE: 0x7f8b9774e000
Fixes: commit 1b028f784e8c ("x86/mm: Introduce mmap_compat_base() for 32-bit mmap()") commit ada26481dfe6 ("x86/mm: Make in_compat_syscall() work during exec")
Cc: Borislav Petkov bp@suse.de Cc: Cyrill Gorcunov gorcunov@openvz.org Cc: Dmitry Safonov 0x7f454c46@gmail.com Cc: "H. Peter Anvin" hpa@zytor.com Cc: Ingo Molnar mingo@redhat.com Cc: "Kirill A. Shutemov" kirill.shutemov@linux.intel.com Cc: Thomas Gleixner tglx@linutronix.de Cc: linux-mm@kvack.org Cc: x86@kernel.org Cc: stable@vger.kernel.org # v4.12+ Reported-by: Alexey Izbyshev izbyshev@ispras.ru Bisected-by: Alexander Monakov amonakov@ispras.ru Investigated-by: Andy Lutomirski luto@kernel.org Signed-off-by: Dmitry Safonov dima@arista.com --- arch/x86/kernel/process_64.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c index 4b100fe0f508..12bb445fb98d 100644 --- a/arch/x86/kernel/process_64.c +++ b/arch/x86/kernel/process_64.c @@ -542,6 +542,7 @@ void set_personality_64bit(void) clear_thread_flag(TIF_X32); /* Pretend that this comes from a 64bit execve */ task_pt_regs(current)->orig_ax = __NR_execve; + current_thread_info()->status &= ~TS_COMPAT;
/* Ensure the corresponding mm is not marked. */ if (current->mm)
On Fri, 2018-05-18 at 00:35 +0100, Dmitry Safonov wrote:
The x86 mmap() code selects the mmap base for an allocation depending on the bitness of the syscall. For 64bit sycalls it select mm->mmap_base and for 32bit mm->mmap_compat_base.
exec() calls mmap() which in turn uses in_compat_syscall() to check whether the mapping is for a 32bit or a 64bit task. The decision is made on the following criteria:
ia32 child->thread.status & TS_COMPAT x32 child->pt_regs.orig_ax & __X32_SYSCALL_BIT ia64 !ia32 && !x32
__set_personality_x32() was dropping TS_COMPAT flag, but set_personality_64bit() has kept compat syscall flag making in_compat_syscall() return true during the first exec() syscall.
Which in result has user-visible effects, mentioned by Alexey:
- It breaks ASAN
$ gcc -fsanitize=address wrap.c -o wrap-asan $ ./wrap32 ./wrap-asan true ==1217==Shadow memory range interleaves with an existing memory mapping. ASan cannot proceed correctly. ABORTING. ==1217==ASan shadow was supposed to be located in the [0x00007fff7000-0x10007fff7fff] range. ==1217==Process memory map follows: 0x000000400000-0x000000401000 /home/izbyshev/test/gcc/asan- exec-from-32bit/wrap-asan 0x000000600000-0x000000601000 /home/izbyshev/test/gcc/asan- exec-from-32bit/wrap-asan 0x000000601000-0x000000602000 /home/izbyshev/test/gcc/asan- exec-from-32bit/wrap-asan 0x0000f7dbd000-0x0000f7de2000 /lib64/ld-2.27.so 0x0000f7fe2000-0x0000f7fe3000 /lib64/ld-2.27.so 0x0000f7fe3000-0x0000f7fe4000 /lib64/ld-2.27.so 0x0000f7fe4000-0x0000f7fe5000 0x7fed9abff000-0x7fed9af54000 0x7fed9af54000-0x7fed9af6b000 /lib64/libgcc_s.so.1 [snip]
- It doesn't seem to be great for security if an attacker always
knows that ld.so is going to be mapped into the first 4GB in this case (the same thing happens for PIEs as well).
The testcase: $ cat wrap.c
int main(int argc, char *argv[]) { execvp(argv[1], &argv[1]); return 127; }
$ gcc wrap.c -o wrap $ LD_SHOW_AUXV=1 ./wrap ./wrap true |& grep AT_BASE AT_BASE: 0x7f63b8309000 AT_BASE: 0x7faec143c000 AT_BASE: 0x7fbdb25fa000
$ gcc -m32 wrap.c -o wrap32 $ LD_SHOW_AUXV=1 ./wrap32 ./wrap true |& grep AT_BASE AT_BASE: 0xf7eff000 AT_BASE: 0xf7cee000 AT_BASE: 0x7f8b9774e000
Fixes: commit 1b028f784e8c ("x86/mm: Introduce mmap_compat_base() for 32-bit mmap()") commit ada26481dfe6 ("x86/mm: Make in_compat_syscall() work during exec")
Cc: Borislav Petkov bp@suse.de Cc: Cyrill Gorcunov gorcunov@openvz.org Cc: Dmitry Safonov 0x7f454c46@gmail.com Cc: "H. Peter Anvin" hpa@zytor.com Cc: Ingo Molnar mingo@redhat.com Cc: "Kirill A. Shutemov" kirill.shutemov@linux.intel.com Cc: Thomas Gleixner tglx@linutronix.de Cc: linux-mm@kvack.org Cc: x86@kernel.org Cc: stable@vger.kernel.org # v4.12+ Reported-by: Alexey Izbyshev izbyshev@ispras.ru Bisected-by: Alexander Monakov amonakov@ispras.ru Investigated-by: Andy Lutomirski luto@kernel.org Signed-off-by: Dmitry Safonov dima@arista.com
I've tested it on master with: - the reproducer - x86 selftests - criu
Some selftests are failing, but the same way as before the patch (ITOW, it's not regression): [root@localhost self]# grep FAIL out [FAIL] Reg 1 mismatch: requested 0x0; got 0x3 [FAIL] Reg 15 mismatch: requested 0x8badf00d5aadc0de; got 0xffffff425aadc0de [FAIL] Reg 15 mismatch: requested 0x8badf00d5aadc0de; got 0xffffff425aadc0de [FAIL] Reg 15 mismatch: requested 0x8badf00d5aadc0de; got 0xffffff425aadc0de [FAIL] f[u]comi[p] errors: 1 [FAIL] fisttp errors: 1 [FAIL] R8 has changed:0000000000000000 [FAIL] R9 has changed:0000000000000000 [FAIL] R10 has changed:0000000000000000 [FAIL] R11 has changed:0000000000000000 [FAIL] R8 has changed:0000000000000000 [FAIL] R9 has changed:0000000000000000 [FAIL] R10 has changed:0000000000000000 [FAIL] R11 has changed:0000000000000000
I think, R8-R11 are not preserved yet in master? Not quite sure about register mismatches :-/ Also ia32-criu has a fail, which I need to look into (but not a regression).
On Thu, May 17, 2018 at 4:40 PM Dmitry Safonov dima@arista.com wrote:
Some selftests are failing, but the same way as before the patch (ITOW, it's not regression): [root@localhost self]# grep FAIL out [FAIL] Reg 1 mismatch: requested 0x0; got 0x3 [FAIL] Reg 15 mismatch: requested 0x8badf00d5aadc0de; got 0xffffff425aadc0de [FAIL] Reg 15 mismatch: requested 0x8badf00d5aadc0de; got 0xffffff425aadc0de [FAIL] Reg 15 mismatch: requested 0x8badf00d5aadc0de; got 0xffffff425aadc0de
Are you on AMD? Can you try this patch:
https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=x86...
and give me a Tested-by if it fixes it for you?
[FAIL] f[u]comi[p] errors: 1 [FAIL] fisttp errors: 1'
I don't know about these.
[FAIL] R8 has changed:0000000000000000 [FAIL] R9 has changed:0000000000000000 [FAIL] R10 has changed:0000000000000000 [FAIL] R11 has changed:0000000000000000 [FAIL] R8 has changed:0000000000000000 [FAIL] R9 has changed:0000000000000000 [FAIL] R10 has changed:0000000000000000 [FAIL] R11 has changed:0000000000000000
The patch that added these test lines was the same patch that should have made them pass. Are you sure your tests match your running kernel? You need commit 8bb2610bc4967f19672444a7b0407367f1540028.
If you still have failures, can you send me the complete output from the test_syscall_vdso test?
--Andy
Hi Andy,
2018-05-18 23:03 GMT+01:00 Andy Lutomirski luto@kernel.org:
On Thu, May 17, 2018 at 4:40 PM Dmitry Safonov dima@arista.com wrote:
Some selftests are failing, but the same way as before the patch (ITOW, it's not regression): [root@localhost self]# grep FAIL out [FAIL] Reg 1 mismatch: requested 0x0; got 0x3 [FAIL] Reg 15 mismatch: requested 0x8badf00d5aadc0de; got 0xffffff425aadc0de [FAIL] Reg 15 mismatch: requested 0x8badf00d5aadc0de; got 0xffffff425aadc0de [FAIL] Reg 15 mismatch: requested 0x8badf00d5aadc0de; got 0xffffff425aadc0de
Are you on AMD? Can you try this patch:
https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=x86...
and give me a Tested-by if it fixes it for you?
Sure. I'm on Intel actually: cpu family : 6 model : 142 model name : Intel(R) Core(TM) i7-7600U CPU @ 2.80GHz
But I usually test kernels in VM. So, I use virt-manager as it's easier to manage multiple VMs. The thing is that I've chosen "Copy host CPU configuration" and for some reason, I don't quite follow virt-manager makes model "Opteron_G4". I'm on Fedora 27, virt-manager 1.4.3, qemu 2.9.1(qemu-2.9.1-2.fc26). So, cpuinfo in VM says: cpu family : 21 model : 1 model name : AMD Opteron 62xx class CPU
What's worse than registers changes is that some selftests actually lead to Oops's. The same reason for criu-ia32 fails. I've tested so far v4.15 and v4.16 releases besides master (2c71d338bef2), so it looks to be not a recent regression.
Full Oopses: [ 189.100174] BUG: unable to handle kernel paging request at 00000000417bafe8 [ 189.100174] PGD 69ed4067 P4D 69ed4067 PUD 707fc067 PMD 6c535067 PTE 6991f067 [ 189.100174] Oops: 0001 [#3] SMP NOPTI [ 189.100174] Modules linked in: [ 189.100174] CPU: 0 PID: 2443 Comm: sysret_ss_attrs Tainted: G D 4.17.0-rc5+ #11 [ 189.103187] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1.fc26 04/01/2014 [ 189.103187] RIP: 0033:0x40085a [ 189.103187] RSP: 002b:00000000417bafe8 EFLAGS: 00000206 [ 189.103187] RAX: 0000000000000000 RBX: 00000000000003e8 RCX: 0000000000000000 [ 189.103187] RDX: 0000000000000000 RSI: 0000000000400830 RDI: 00000000417baff8 [ 189.103187] RBP: 00000000417baff8 R08: 0000000000000000 R09: 0000000000000077 [ 189.103187] R10: 0000000000000006 R11: 0000000000000000 R12: 00000000417ba000 [ 189.103187] R13: 00007ffc05207840 R14: 0000000000000000 R15: 0000000000000000 [ 189.103187] FS: 00007f98566ecb40(0000) GS:ffff9740ffc00000(0000) knlGS:0000000000000000 [ 189.103187] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 189.103187] CR2: 00000000417bafe8 CR3: 0000000069dc4000 CR4: 00000000007406f0 [ 189.103187] PKRU: 55555554 [ 189.103187] RIP: 0x40085a RSP: 00000000417bafe8 [ 189.103187] CR2: 00000000417bafe8 [ 189.103187] ---[ end trace 8878c9a088d5f296 ]--- Killed [ 219.366814] BUG: unable to handle kernel paging request at 00000000ffd2874c [ 219.367040] PGD 69fbf067 P4D 69fbf067 PUD 69fa5067 PMD 69fa4067 PTE 6cb04067 [ 219.367040] Oops: 0001 [#4] SMP NOPTI [ 219.367040] Modules linked in: [ 219.367040] CPU: 1 PID: 2497 Comm: test_syscall_vd Tainted: G D 4.17.0-rc5+ #11 [ 219.367040] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1.fc26 04/01/2014 [ 219.367040] RIP: 0033:0x8048e9d [ 219.367040] RSP: 002b:00000000ffd2874c EFLAGS: 00000202 [ 219.367040] RAX: 0000000008048778 RBX: 0000000000000000 RCX: 000000000000003f [ 219.367040] RDX: 0000000000000001 RSI: 00000000f7ff7b80 RDI: 0000000000000000 [ 219.367040] RBP: 00000000ffd287c8 R08: 7f7f7f7f7f7f7f7f R09: 7f7f7f7f7f7f7f80 [ 219.367040] R10: 7f7f7f7f7f7f7f81 R11: 7f7f7f7f7f7f7f82 R12: 7f7f7f7f7f7f7f83 [ 219.367040] R13: 7f7f7f7f7f7f7f84 R14: 7f7f7f7f7f7f7f85 R15: 7f7f7f7f7f7f7f86 [ 219.367040] FS: 0000000000000000(0000) GS:ffff9740ffd00000(0063) knlGS:00000000f7fc6700 [ 219.367040] CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033 [ 219.367040] CR2: 00000000ffd2874c CR3: 000000006c4ca000 CR4: 00000000007406e0 [ 219.367040] PKRU: 55555554 [ 219.367040] RIP: 0x8048e9d RSP: 00000000ffd2874c [ 219.367040] CR2: 00000000ffd2874c [ 219.367040] ---[ end trace 8878c9a088d5f297 ]--- Killed
When I choose kvm64 (or qemu64) as CPU model, Oops's are gone, but tests still fail with registers mismatch the same way. Possibly, Oops's are qemu faults?
[FAIL] f[u]comi[p] errors: 1 [FAIL] fisttp errors: 1'
I don't know about these.
[FAIL] R8 has changed:0000000000000000 [FAIL] R9 has changed:0000000000000000 [FAIL] R10 has changed:0000000000000000 [FAIL] R11 has changed:0000000000000000 [FAIL] R8 has changed:0000000000000000 [FAIL] R9 has changed:0000000000000000 [FAIL] R10 has changed:0000000000000000 [FAIL] R11 has changed:0000000000000000
The patch that added these test lines was the same patch that should have made them pass. Are you sure your tests match your running kernel? You need commit 8bb2610bc4967f19672444a7b0407367f1540028.
Yeah, it is already in the last master.
If you still have failures, can you send me the complete output from the test_syscall_vdso test?
So, with such possibly loosy qemu (mis-)configuration that I have, with your patch applied on the top of the last master, it fixes "Reg 15 mismatch". Still see the following faults:
======./sigreturn_32======== [OK] set_thread_area refused 16-bit data [OK] set_thread_area refused 16-bit data [RUN] Valid sigreturn: 64-bit CS (33), 32-bit SS (2b, GDT) [FAIL] Reg 1 mismatch: requested 0x0; got 0x3 SP: 5aadc0de -> 5aadc0de [RUN] Valid sigreturn: 32-bit CS (23), 32-bit SS (2b, GDT) SP: 5aadc0de -> 5aadc0de [OK] all registers okay [RUN] Valid sigreturn: 16-bit CS (37), 32-bit SS (2b, GDT) SP: 5aadc0de -> 5aadc0de [OK] all registers okay [RUN] Valid sigreturn: 64-bit CS (33), 16-bit SS (3f) SP: 5aadc0de -> 5aadc0de [OK] all registers okay -- [RUN] Testing fcmovCC instructions [OK] fcmovCC ======./test_syscall_vdso_32======== [RUN] Executing 6-argument 32-bit syscall via VDSO [OK] Arguments are preserved across syscall [NOTE] R11 has changed:0000000000200ed7 - assuming clobbered by SYSRET insn [OK] R8..R15 did not leak kernel data [RUN] Executing 6-argument 32-bit syscall via INT 80 [OK] Arguments are preserved across syscall [FAIL] R8 has changed:0000000000000000 [FAIL] R9 has changed:0000000000000000 [FAIL] R10 has changed:0000000000000000 [FAIL] R11 has changed:0000000000000000 [RUN] Executing 6-argument 32-bit syscall via VDSO [OK] Arguments are preserved across syscall [NOTE] R11 has changed:0000000000200ed7 - assuming clobbered by SYSRET insn [OK] R8..R15 did not leak kernel data [RUN] Executing 6-argument 32-bit syscall via INT 80 [OK] Arguments are preserved across syscall [FAIL] R8 has changed:0000000000000000 [FAIL] R9 has changed:0000000000000000 [FAIL] R10 has changed:0000000000000000 [FAIL] R11 has changed:0000000000000000 [RUN] Running tests under ptrace
Thanks, Dmitry
2018-05-19 0:10 GMT+01:00 Dmitry Safonov 0x7f454c46@gmail.com:
Sure. I'm on Intel actually: cpu family : 6 model : 142 model name : Intel(R) Core(TM) i7-7600U CPU @ 2.80GHz
But I usually test kernels in VM. So, I use virt-manager as it's easier to manage multiple VMs. The thing is that I've chosen "Copy host CPU configuration" and for some reason, I don't quite follow virt-manager makes model "Opteron_G4". I'm on Fedora 27, virt-manager 1.4.3, qemu 2.9.1(qemu-2.9.1-2.fc26).
Hmm, the reason it chooses AMD emulation looks like a bug in virt-manager: When I try IvyBridge CPU, it gives the following error:
Error starting domain: the CPU is incompatible with host CPU: Host CPU does not provide required features: vme, x2apic, tsc-deadline, avx, f16c, rdrand
Which to my naive mind is by the reason that "tsc-deadline" is not written with a dash in cpuinfo: flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves ibpb ibrs stibp dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp
But that just my naive suppose.
Thanks, Dmitry
2018-05-19 0:16 GMT+01:00 Dmitry Safonov 0x7f454c46@gmail.com:
2018-05-19 0:10 GMT+01:00 Dmitry Safonov 0x7f454c46@gmail.com:
Sure. I'm on Intel actually: cpu family : 6 model : 142 model name : Intel(R) Core(TM) i7-7600U CPU @ 2.80GHz
But I usually test kernels in VM. So, I use virt-manager as it's easier to manage multiple VMs. The thing is that I've chosen "Copy host CPU configuration" and for some reason, I don't quite follow virt-manager makes model "Opteron_G4". I'm on Fedora 27, virt-manager 1.4.3, qemu 2.9.1(qemu-2.9.1-2.fc26).
Hmm, the reason it chooses AMD emulation looks like a bug in virt-manager: When I try IvyBridge CPU, it gives the following error:
Error starting domain: the CPU is incompatible with host CPU: Host CPU does not provide required features: vme, x2apic, tsc-deadline, avx, f16c, rdrand
Which to my naive mind is by the reason that "tsc-deadline" is not written with a dash in cpuinfo: flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves ibpb ibrs stibp dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp
But that just my naive suppose.
Yeah, so they use cpuid there and I guess this one wasn't fixed for me: https://bugzilla.redhat.com/show_bug.cgi?id=1467599
Thanks, Dmitry
On May 18, 2018, at 4:10 PM, Dmitry Safonov 0x7f454c46@gmail.com wrote:
Hi Andy,
2018-05-18 23:03 GMT+01:00 Andy Lutomirski luto@kernel.org:
On Thu, May 17, 2018 at 4:40 PM Dmitry Safonov dima@arista.com wrote: Some selftests are failing, but the same way as before the patch (ITOW, it's not regression): [root@localhost self]# grep FAIL out [FAIL] Reg 1 mismatch: requested 0x0; got 0x3 [FAIL] Reg 15 mismatch: requested 0x8badf00d5aadc0de; got 0xffffff425aadc0de [FAIL] Reg 15 mismatch: requested 0x8badf00d5aadc0de; got 0xffffff425aadc0de [FAIL] Reg 15 mismatch: requested 0x8badf00d5aadc0de; got 0xffffff425aadc0de
Are you on AMD? Can you try this patch:
https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=x86...
and give me a Tested-by if it fixes it for you?
Sure. I'm on Intel actually: cpu family : 6 model : 142 model name : Intel(R) Core(TM) i7-7600U CPU @ 2.80GHz
But I usually test kernels in VM. So, I use virt-manager as it's easier to manage multiple VMs. The thing is that I've chosen "Copy host CPU configuration" and for some reason, I don't quite follow virt-manager makes model
"Opteron_G4".
I'm on Fedora 27, virt-manager 1.4.3, qemu 2.9.1(qemu-2.9.1-2.fc26). So, cpuinfo in VM says: cpu family : 21 model : 1 model name : AMD Opteron 62xx class CPU
What does guest cpuinfo say for vendor_id?
There are multiple potential screwups here.
1. (What I *thought* was going on) AMD CPUs have screwy IRET behavior that’s different from Intel’s, and the test case was definitely wrong. But KVM has no way to influence it. Are you sure you’re using KVM and not QEMU TCG? Anyway, the IRET thing is minor compared to your other problems, so let’s try to fix them first.
2. Compat fast syscalls are wildly different on AMD and Intel. Because of this issue, QEMU with KVM is supposed to always report the real vendor_id no matter -cpu asks for. If we get the wrong vendor_id, then we’re at the mercy of KVM’s emulation and performance will suck. On older kernels, this would cause hideous kernel crashes. On new kernels, I would expect it to merely crash 32-bit user programs or be slow.
What's worse than registers changes is that some selftests actually lead
to
Oops's. The same reason for criu-ia32 fails. I've tested so far v4.15 and v4.16 releases besides master (2c71d338bef2), so it looks to be not a recent regression.
Full Oopses: [ 189.100174] BUG: unable to handle kernel paging request at
00000000417bafe8
[ 189.100174] PGD 69ed4067 P4D 69ed4067 PUD 707fc067 PMD 6c535067 PTE
6991f067
[ 189.100174] Oops: 0001 [#3] SMP NOPTI
Whoa there! 0001 means a failed *kernel* access.
[ 189.100174] Modules linked in: [ 189.100174] CPU: 0 PID: 2443 Comm: sysret_ss_attrs Tainted: G
Was this sysret_ss_attrs_32 or sysret_ss_attrs_64?
D 4.17.0-rc5+ #11 [ 189.103187] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1.fc26 04/01/2014 [ 189.103187] RIP: 0033:0x40085a
The oops was caused from CPL 3 at what looks like a totally sensible user address. Can you disassemble the offending binary and tell me what the code at 0x40085a is?
[ 189.103187] RSP: 002b:00000000417bafe8 EFLAGS: 00000206 [ 189.103187] RAX: 0000000000000000 RBX: 00000000000003e8 RCX:
0000000000000000
[ 189.103187] RDX: 0000000000000000 RSI: 0000000000400830 RDI:
00000000417baff8
[ 189.103187] RBP: 00000000417baff8 R08: 0000000000000000 R09:
0000000000000077
[ 189.103187] R10: 0000000000000006 R11: 0000000000000000 R12:
00000000417ba000
[ 189.103187] R13: 00007ffc05207840 R14: 0000000000000000 R15:
0000000000000000
[ 189.103187] FS: 00007f98566ecb40(0000) GS:ffff9740ffc00000(0000) knlGS:0000000000000000 [ 189.103187] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CS here is the value of CS that the *kernel* has, so 0x10 is normal.
[ 189.103187] CR2: 00000000417bafe8 CR3: 0000000069dc4000 CR4:
00000000007406f0
CR2 is in user space.
So the big question is: what happened here? Why did the CPU (or emulated CPU) attempt a privileged access to a user address while running user code?
On Fri, 2018-05-18 at 19:05 -0700, Andy Lutomirski wrote:
On May 18, 2018, at 4:10 PM, Dmitry Safonov 0x7f454c46@gmail.com cpu family : 6 model : 142 model name : Intel(R) Core(TM) i7-7600U CPU @ 2.80GHz But I usually test kernels in VM. So, I use virt-manager as it's easier to manage multiple VMs. The thing is that I've chosen "Copy host CPU configuration" and for some reason, I don't quite follow virt-manager makes model
"Opteron_G4".
I'm on Fedora 27, virt-manager 1.4.3, qemu 2.9.1(qemu-2.9.1- 2.fc26). So, cpuinfo in VM says: cpu family : 21 model : 1 model name : AMD Opteron 62xx class CPU
What does guest cpuinfo say for vendor_id?
There are multiple potential screwups here.
- (What I *thought* was going on) AMD CPUs have screwy IRET behavior
that’s different from Intel’s, and the test case was definitely wrong. But KVM has no way to influence it. Are you sure you’re using KVM and not QEMU TCG? Anyway, the IRET thing is minor compared to your other problems, so let’s try to fix them first.
- Compat fast syscalls are wildly different on AMD and Intel.
Because of this issue, QEMU with KVM is supposed to always report the real vendor_id no matter -cpu asks for. If we get the wrong vendor_id, then we’re at the mercy of KVM’s emulation and performance will suck. On older kernels, this would cause hideous kernel crashes. On new kernels, I would expect it to merely crash 32-bit user programs or be slow.
Heh, I didn't know those details, so it looks like it's (2), vendor_id : AuthenticAMD in guest.
What's worse than registers changes is that some selftests actually lead
to
Oops's. The same reason for criu-ia32 fails. I've tested so far v4.15 and v4.16 releases besides master (2c71d338bef2), so it looks to be not a recent regression. Full Oopses: [ 189.100174] BUG: unable to handle kernel paging request at
00000000417bafe8
[ 189.100174] PGD 69ed4067 P4D 69ed4067 PUD 707fc067 PMD 6c535067 PTE
6991f067
[ 189.100174] Oops: 0001 [#3] SMP NOPTI
Whoa there! 0001 means a failed *kernel* access.
[ 189.100174] Modules linked in: [ 189.100174] CPU: 0 PID: 2443 Comm: sysret_ss_attrs Tainted: G
Was this sysret_ss_attrs_32 or sysret_ss_attrs_64?
sysret_ss_attrs_32 survives
D 4.17.0-rc5+ #11 [ 189.103187] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1.fc26 04/01/2014 [ 189.103187] RIP: 0033:0x40085a
The oops was caused from CPL 3 at what looks like a totally sensible user address. Can you disassemble the offending binary and tell me what the code at 0x40085a is?
Here is the function: 0000000000400842 <call32_from_64>: 400842: 53 push %rbx 400843: 55 push %rbp 400844: 41 54 push %r12 400846: 41 55 push %r13 400848: 41 56 push %r14 40084a: 41 57 push %r15 40084c: 9c pushfq 40084d: 48 89 27 mov %rsp,(%rdi) 400850: 48 89 fc mov %rdi,%rsp 400853: 6a 23 pushq $0x23 400855: 68 5c 08 40 00 pushq $0x40085c 40085a: 48 cb lretq 40085c: ff d6 callq *%rsi 40085e: ea (bad) 40085f: 65 08 40 00 or %al,%gs:0x0(%rax) 400863: 33 00 xor (%rax),%eax 400865: 48 8b 24 24 mov (%rsp),%rsp 400869: 9d popfq 40086a: 41 5f pop %r15 40086c: 41 5e pop %r14 40086e: 41 5d pop %r13 400870: 41 5c pop %r12 400872: 5d pop %rbp 400873: 5b pop %rbx 400874: c3 retq 400875: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1) 40087c: 00 00 00 40087f: 90 nop
Looks like mov between registers caused it? The hell.
[ 189.103187] RSP: 002b:00000000417bafe8 EFLAGS: 00000206 [ 189.103187] RAX: 0000000000000000 RBX: 00000000000003e8 RCX:
0000000000000000
[ 189.103187] RDX: 0000000000000000 RSI: 0000000000400830 RDI:
00000000417baff8
[ 189.103187] RBP: 00000000417baff8 R08: 0000000000000000 R09:
0000000000000077
[ 189.103187] R10: 0000000000000006 R11: 0000000000000000 R12:
00000000417ba000
[ 189.103187] R13: 00007ffc05207840 R14: 0000000000000000 R15:
0000000000000000
[ 189.103187] FS: 00007f98566ecb40(0000) GS:ffff9740ffc00000(0000) knlGS:0000000000000000 [ 189.103187] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CS here is the value of CS that the *kernel* has, so 0x10 is normal.
[ 189.103187] CR2: 00000000417bafe8 CR3: 0000000069dc4000 CR4:
00000000007406f0
CR2 is in user space.
So the big question is: what happened here? Why did the CPU (or emulated CPU) attempt a privileged access to a user address while running user code?
No idea, but looks like it's not a kernel fault.
2018-05-19 3:22 GMT+01:00 Dmitry Safonov dima@arista.com:
On Fri, 2018-05-18 at 19:05 -0700, Andy Lutomirski wrote:
On May 18, 2018, at 4:10 PM, Dmitry Safonov 0x7f454c46@gmail.com cpu family : 6 model : 142 model name : Intel(R) Core(TM) i7-7600U CPU @ 2.80GHz But I usually test kernels in VM. So, I use virt-manager as it's easier to manage multiple VMs. The thing is that I've chosen "Copy host CPU configuration" and for some reason, I don't quite follow virt-manager makes model
"Opteron_G4".
I'm on Fedora 27, virt-manager 1.4.3, qemu 2.9.1(qemu-2.9.1- 2.fc26). So, cpuinfo in VM says: cpu family : 21 model : 1 model name : AMD Opteron 62xx class CPU
What does guest cpuinfo say for vendor_id?
There are multiple potential screwups here.
- (What I *thought* was going on) AMD CPUs have screwy IRET behavior
that’s different from Intel’s, and the test case was definitely wrong. But KVM has no way to influence it. Are you sure you’re using KVM and not QEMU TCG? Anyway, the IRET thing is minor compared to your other problems, so let’s try to fix them first.
- Compat fast syscalls are wildly different on AMD and Intel.
Because of this issue, QEMU with KVM is supposed to always report the real vendor_id no matter -cpu asks for. If we get the wrong vendor_id, then we’re at the mercy of KVM’s emulation and performance will suck. On older kernels, this would cause hideous kernel crashes. On new kernels, I would expect it to merely crash 32-bit user programs or be slow.
Heh, I didn't know those details, so it looks like it's (2), vendor_id : AuthenticAMD in guest.
What's worse than registers changes is that some selftests actually lead
to
Oops's. The same reason for criu-ia32 fails. I've tested so far v4.15 and v4.16 releases besides master (2c71d338bef2), so it looks to be not a recent regression. Full Oopses: [ 189.100174] BUG: unable to handle kernel paging request at
00000000417bafe8
[ 189.100174] PGD 69ed4067 P4D 69ed4067 PUD 707fc067 PMD 6c535067 PTE
6991f067
[ 189.100174] Oops: 0001 [#3] SMP NOPTI
Whoa there! 0001 means a failed *kernel* access.
[ 189.100174] Modules linked in: [ 189.100174] CPU: 0 PID: 2443 Comm: sysret_ss_attrs Tainted: G
Was this sysret_ss_attrs_32 or sysret_ss_attrs_64?
sysret_ss_attrs_32 survives
D 4.17.0-rc5+ #11 [ 189.103187] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1.fc26 04/01/2014 [ 189.103187] RIP: 0033:0x40085a
The oops was caused from CPL 3 at what looks like a totally sensible user address. Can you disassemble the offending binary and tell me what the code at 0x40085a is?
Here is the function: 0000000000400842 <call32_from_64>: 400842: 53 push %rbx 400843: 55 push %rbp 400844: 41 54 push %r12 400846: 41 55 push %r13 400848: 41 56 push %r14 40084a: 41 57 push %r15 40084c: 9c pushfq 40084d: 48 89 27 mov %rsp,(%rdi) 400850: 48 89 fc mov %rdi,%rsp 400853: 6a 23 pushq $0x23 400855: 68 5c 08 40 00 pushq $0x40085c 40085a: 48 cb lretq 40085c: ff d6 callq *%rsi 40085e: ea (bad) 40085f: 65 08 40 00 or %al,%gs:0x0(%rax) 400863: 33 00 xor (%rax),%eax 400865: 48 8b 24 24 mov (%rsp),%rsp 400869: 9d popfq 40086a: 41 5f pop %r15 40086c: 41 5e pop %r14 40086e: 41 5d pop %r13 400870: 41 5c pop %r12 400872: 5d pop %rbp 400873: 5b pop %rbx 400874: c3 retq 400875: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1) 40087c: 00 00 00 40087f: 90 nop
Looks like mov between registers caused it? The hell.
Oh, it's not 400850, I missloked, but 40085a so lretq might case it.
[ 189.103187] RSP: 002b:00000000417bafe8 EFLAGS: 00000206 [ 189.103187] RAX: 0000000000000000 RBX: 00000000000003e8 RCX:
0000000000000000
[ 189.103187] RDX: 0000000000000000 RSI: 0000000000400830 RDI:
00000000417baff8
[ 189.103187] RBP: 00000000417baff8 R08: 0000000000000000 R09:
0000000000000077
[ 189.103187] R10: 0000000000000006 R11: 0000000000000000 R12:
00000000417ba000
[ 189.103187] R13: 00007ffc05207840 R14: 0000000000000000 R15:
0000000000000000
[ 189.103187] FS: 00007f98566ecb40(0000) GS:ffff9740ffc00000(0000) knlGS:0000000000000000 [ 189.103187] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CS here is the value of CS that the *kernel* has, so 0x10 is normal.
[ 189.103187] CR2: 00000000417bafe8 CR3: 0000000069dc4000 CR4:
00000000007406f0
CR2 is in user space.
So the big question is: what happened here? Why did the CPU (or emulated CPU) attempt a privileged access to a user address while running user code?
No idea, but looks like it's not a kernel fault.
-- Thanks, Dmitry
2018-05-19 3:25 GMT+01:00 Dmitry Safonov 0x7f454c46@gmail.com:
Here is the function: 0000000000400842 <call32_from_64>: 400842: 53 push %rbx 400843: 55 push %rbp 400844: 41 54 push %r12 400846: 41 55 push %r13 400848: 41 56 push %r14 40084a: 41 57 push %r15 40084c: 9c pushfq 40084d: 48 89 27 mov %rsp,(%rdi) 400850: 48 89 fc mov %rdi,%rsp 400853: 6a 23 pushq $0x23 400855: 68 5c 08 40 00 pushq $0x40085c 40085a: 48 cb lretq 40085c: ff d6 callq *%rsi 40085e: ea (bad) 40085f: 65 08 40 00 or %al,%gs:0x0(%rax) 400863: 33 00 xor (%rax),%eax 400865: 48 8b 24 24 mov (%rsp),%rsp 400869: 9d popfq 40086a: 41 5f pop %r15 40086c: 41 5e pop %r14 40086e: 41 5d pop %r13 400870: 41 5c pop %r12 400872: 5d pop %rbp 400873: 5b pop %rbx 400874: c3 retq 400875: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1) 40087c: 00 00 00 40087f: 90 nop
Looks like mov between registers caused it? The hell.
Oh, it's not 400850, I missloked, but 40085a so lretq might case it.
But it's 002b:00000000417bafe8 USER_DS and sensible address, still no idea.
On Fri, May 18, 2018 at 12:35:10AM +0100, Dmitry Safonov wrote:
The x86 mmap() code selects the mmap base for an allocation depending on the bitness of the syscall. For 64bit sycalls it select mm->mmap_base and for 32bit mm->mmap_compat_base.
exec() calls mmap() which in turn uses in_compat_syscall() to check whether the mapping is for a 32bit or a 64bit task. The decision is made on the following criteria:
ia32 child->thread.status & TS_COMPAT x32 child->pt_regs.orig_ax & __X32_SYSCALL_BIT ia64 !ia32 && !x32
__set_personality_x32() was dropping TS_COMPAT flag, but set_personality_64bit() has kept compat syscall flag making in_compat_syscall() return true during the first exec() syscall.
Which in result has user-visible effects, mentioned by Alexey:
- It breaks ASAN
$ gcc -fsanitize=address wrap.c -o wrap-asan $ ./wrap32 ./wrap-asan true ==1217==Shadow memory range interleaves with an existing memory mapping. ASan cannot proceed correctly. ABORTING. ==1217==ASan shadow was supposed to be located in the [0x00007fff7000-0x10007fff7fff] range. ==1217==Process memory map follows: 0x000000400000-0x000000401000 /home/izbyshev/test/gcc/asan-exec-from-32bit/wrap-asan 0x000000600000-0x000000601000 /home/izbyshev/test/gcc/asan-exec-from-32bit/wrap-asan 0x000000601000-0x000000602000 /home/izbyshev/test/gcc/asan-exec-from-32bit/wrap-asan 0x0000f7dbd000-0x0000f7de2000 /lib64/ld-2.27.so 0x0000f7fe2000-0x0000f7fe3000 /lib64/ld-2.27.so 0x0000f7fe3000-0x0000f7fe4000 /lib64/ld-2.27.so 0x0000f7fe4000-0x0000f7fe5000 0x7fed9abff000-0x7fed9af54000 0x7fed9af54000-0x7fed9af6b000 /lib64/libgcc_s.so.1 [snip]
- It doesn't seem to be great for security if an attacker always knows
that ld.so is going to be mapped into the first 4GB in this case (the same thing happens for PIEs as well).
The testcase: $ cat wrap.c
int main(int argc, char *argv[]) { execvp(argv[1], &argv[1]); return 127; }
$ gcc wrap.c -o wrap $ LD_SHOW_AUXV=1 ./wrap ./wrap true |& grep AT_BASE AT_BASE: 0x7f63b8309000 AT_BASE: 0x7faec143c000 AT_BASE: 0x7fbdb25fa000
$ gcc -m32 wrap.c -o wrap32 $ LD_SHOW_AUXV=1 ./wrap32 ./wrap true |& grep AT_BASE AT_BASE: 0xf7eff000 AT_BASE: 0xf7cee000 AT_BASE: 0x7f8b9774e000
Fixes: commit 1b028f784e8c ("x86/mm: Introduce mmap_compat_base() for 32-bit mmap()") commit ada26481dfe6 ("x86/mm: Make in_compat_syscall() work during exec")
Cc: Borislav Petkov bp@suse.de Cc: Cyrill Gorcunov gorcunov@openvz.org Cc: Dmitry Safonov 0x7f454c46@gmail.com Cc: "H. Peter Anvin" hpa@zytor.com Cc: Ingo Molnar mingo@redhat.com Cc: "Kirill A. Shutemov" kirill.shutemov@linux.intel.com Cc: Thomas Gleixner tglx@linutronix.de Cc: linux-mm@kvack.org Cc: x86@kernel.org Cc: stable@vger.kernel.org # v4.12+ Reported-by: Alexey Izbyshev izbyshev@ispras.ru Bisected-by: Alexander Monakov amonakov@ispras.ru Investigated-by: Andy Lutomirski luto@kernel.org Signed-off-by: Dmitry Safonov dima@arista.com
Reviewed-by: Cyrill Gorcunov gorcunov@openvz.org
Thanks a lot! (At first I had to scratch my head for a second to realize that the key moment is executing 64 bit application from inside of a compat process :-)
linux-stable-mirror@lists.linaro.org