On Tue, 2024-10-22 at 17:06 -0500, Steve Wahl wrote:
On Tue, Oct 22, 2024 at 07:51:38PM +0100, David Woodhouse wrote:
I spent all of Monday setting up a full GDT, IDT and exception handler for the relocate_kernel() environment¹, and I think these reports may have been the same as what I've been debugging.
David,
My original problem involved UV platform hardware catching a speculative access into the reserved areas, that caused a BIOS HALT. Reducing the use of gbpages in the page table kept the speculation from hitting those areas. I would believe this sort of thing might be uniqe to the UV platform.
The regression reports I got from Pavin and others were due to my original patch trimming down the page tables to the point where they didn't include some memory that was actually referenced, not processor speculation, because the regions were not explicitly included in the creation of the kexec page map. This was fixed by explicitly including those regions when creating the map.
Hm, I didn't see that part of the discussion. I saw that such was a theory, but haven't seen specific confirmation and fixes. And your original patch was reverted and still not reapplied, AFAICT.
I did note that the victims all seemed to be using AMD CPUs, so it seemed likely that at least *some* of them were suffering the same problem that I've found.
Do you have references please?
If anyone is still seeing such problems either with or without your patch, they can run with my exception handler and get an actual dump instead of a triple-fault.
(I'm also pushing CPU vendors to give us information from the triple- fault through the machine check architecture. It's awful having to do this blind. For VMs, I also had plans to register a crashdump kernel entry point with the hypervisor, so that on a triple fault the *hypervisor* could jump state of all the vCPUs to the configured location, then restart one CPU in the crash kernel for it to do its own dump).
Can you dump the page tables to see if the address you're referencing is included in those tables (or maybe you already did)? Can you give symbols and code around the RIP when you hit the #PF? It looks like this is in the region metioned as the "Control page", so it's probably trampoline code that has been copied from somewhere else. I'm using my copy of perhaps different kernel source than you have, given your exception handler modification.
Wait, I can't make sense of the dump. See more below.
What platform are you running on? And under what conditions (is this bare metal)? Is it really speculation that's causing your #PF? If so, you could cause it deterministically by, say, doing a quick checksum on that area you're not supposed to touch (0xc142000000 - 0xC1420fffff) and see if it faults every time. (As I said, I was thinking faults from speculation might be unique to the UV platform.)
Yes, it's bare metal. AMD Genoa. No, it's not speculation. It's because we have a single 2MiB page which covers *both* the RMP table (1MiB reserved by BIOS in e820 as I showed), and a page that was allocated for the kimage. If I understand correctly, the hardware raises that fault (with bit 31 in the error code) when refusing to populate that TLB entry for writing.
According to the AMD manual we're allowed to *read* but not write.
We end up taking a #PF, usually on one of the 'rep mov's, one time on the 'pushq %r8' right before using it to 'ret' to identity_mapped. In each case it happens on the first *write* to a page.
Now I can print %cr2 when it happens (instead of just going straight to triple-fault), I spot an interesting fact about the address. It's always *adjacent* to a region reserved by BIOS in the e820 data, and within the same 2MiB page.
I'm not at all certain, but this feels like a red herring. Be cautious.
It wouldn't be our first in this journey, but I'm actually fairly confident this time. :)
[ 0.000000] BIOS-e820: [mem 0x000000bfbe000000-0x000000c1420fffff] reserved [ 0.000000] BIOS-e820: [mem 0x000000c142100000-0x000000fc7fffffff] usable
2024-10-22 17:09:14.291000 kern NOTICE [ 58.996257] kexec: Control page at c149431000 2024-10-22 17:09:14.291000 Y 2024-10-22 17:09:14.291000 rip:000000c1494312f8 2024-10-22 17:09:14.291000 rsp:000000c149431f90 2024-10-22 17:09:14.291000 Exc:000000000000000e 2024-10-22 17:09:14.291000 Err:0000000080000003 2024-10-22 17:09:14.291000 rax:000000c142130000 2024-10-22 17:09:14.291000 rbx:000000010d4b8020 2024-10-22 17:09:14.291000 rcx:0000000000000200 2024-10-22 17:09:14.291000 rdx:000000000009c000 2024-10-22 17:09:14.291000 rsi:000000000009c000 2024-10-22 17:09:14.291000 rdi:000000c142130000 2024-10-22 17:09:14.291000 r8 :000000c149431000 2024-10-22 17:09:14.291000 r9 :000000c149430000 2024-10-22 17:09:14.291000 r10:000000010d4bc000 2024-10-22 17:09:14.291000 r11:0000000000000000 2024-10-22 17:09:14.291000 r12:0000000000000000 2024-10-22 17:09:14.291000 r13:0000000000770ef0 2024-10-22 17:09:14.291000 r14:ffff8c82c0000000 2024-10-22 17:09:14.291000 r15:0000000000000000 2024-10-22 17:09:14.291000 cr2:000000c142130000
And bit 31 in the error code is set, which means it's an RMP violation.
RMP is AMD SEV related, right? I'm not familiar with SEV operation, but I have an itchy feeling it's involved in this problem.
I am having a hard time with the RIP listed above. Maybe your exception handler has affected it? My disassembly seems to show this address should be in a sea of 0xCC / int3 bytes past the end of swap pages.
You'd have to have access to my kernel binary to have a hope of knowing that, surely? I don't think I checked that particular one, but it's normally one of the 'rep mov's in relocate_kernel_64.S.
Looks like we set up a 2MiB page covering the whole range from 0xc142000000 to 0xc142200000, but we aren't allowed to touch the first half of that.
Is it possible that, instead, some SEV tag is hanging around (TLB not completely cleared?) and a page that was otherwise free is causing the problem. Are you using SEV/SME in your system, and if you stop using it does it go away? (Although I have a feeling the answer is no and I'm barking up the wrong tree.)
The target of the pages above is c1421300000. Have you checked to make sure that's a valid address in the page map?
Yeah, we dumped the page tables and it's present.
For me it happens either with or without Steve's last patch, *but* clearing direct_gbpages did seem to make it go away (or at least reduced the incident rate far below the 1-crash-in-1000-kexecs which I was seeing before).
I assume you're referring to the "nogbpages" kernel option?
Nah, I just commented out the lines in init_pgtable() which set info.direct_gbpages=true.
My patch and the nogbpages option should have the exact same pages mapped in the page table. The difference being my patch would still use gbpages in places where a whole gbpage region is included in the map, nogbpages would use 2M pages to fill out the region. This *would* allocate more pages to the page table, which might be shifting things around on you.
Right. In fact the first trigger for this, in our case, was an innocuous change to the NMI watchdog period — which sent us on a *long* wild goose chase based on the assumption that it was a stray perf NMI causing the triple-faults, when in fact that was just shifting things around on us too, and causing pages in that dangerous 1MiB to be chosen for the kimage.
I think Steve's original patch was just moving things around a little and because it allocate more pages for page tables, just happened to leave pages in the offending range to be allocated for writing to, for the unlucky victims.
I think the patch was actually along the right lines though, although it needs to go all the way down to 4KiB PTEs in some cases. And it could probably map anything that the e820 calls 'usable RAM', rather than really restricting itself to precisely the ranges which it's requested to map.
¹ I'll post that exception handler at some point once I've tidied it up.
I hope this might be of some help. Good luck, I'll pitch in any way I can.
Thanks.