On Tue, Mar 12, 2024 at 07:04:10AM -0400, Eric Hagberg wrote:
On Thu, Mar 7, 2024 at 11:33 AM Steve Wahl steve.wahl@hpe.com wrote:
What Linux Distribution are you running on that machine? My guess would be that this is not distro related; if you are running something quite different from Pavin that would confirm this.
Distro in use is Rocky 8, so it’s pretty clear not to be distro-specific.
I found an AMD based system to try to reproduce this on.
yeah, it probably requires either a specific cpu or set or devices plus cpu to trigger… found that it also affects Dell R7625 servers in addition to the R6615s
I agree that it's likely the CPU or particular set of surrounding devices that trigger the problem.
I have not succeeded in reproducing the problem yet. I tried an AMD based system lent to me, but it's probably the wrong generation (AMD EPYC 7251) and I didn't see the problem. I have a line on a system that's more in line with the systems the bug was reported on that I should be able to try tomorrow.
I would love to have some direction from the community at large on this. The fact that nogbpages on the command line causes the same problem without my patch suggests it's not bad code directly in my patch, but something in the way kexec reacts to the resulting identity map. One quick solution would be a kernel command line parameter to select between the previous identity map creation behavior and the new behavior. E.g. in addition to "nogbpages", we could have "somegbpages" and "allgbpages" -- or gbpages=[all, some, none] with nogbpages a synonym for backwards compatibility.
But I don't want to introduce a new command line parameter if the actual problem can be understood and fixed. The question is how much time do I have to persue a direct fix before some other action needs to be taken?
Thanks,
--> Steve Wahl
On 3/12/24 20:43, Steve Wahl wrote:
But I don't want to introduce a new command line parameter if the actual problem can be understood and fixed. The question is how much time do I have to persue a direct fix before some other action needs to be taken?
Perhaps the kexec maintainers [0] can be made aware of this and you could coordinate with them on a potential fix?
Currently maintained by P: Simon Horman M: horms@verge.net.au L: kexec@lists.infradead.org
I hope the root cause can be fixed instead of patching it over with a flag to suppress the problem, but I don't know how regressions are handled here.
[0]: https://git.kernel.org/pub/scm/utils/kernel/kexec/kexec-tools.git/tree/AUTHO...
Pavin.
[Added kexec maintainers]
Full thread starts here: https://lore.kernel.org/all/3a1b9909-45ac-4f97-ad68-d16ef1ce99db@pavinjoseph...
On Wed, Mar 13, 2024 at 12:12:31AM +0530, Pavin Joseph wrote:
On 3/12/24 20:43, Steve Wahl wrote:
But I don't want to introduce a new command line parameter if the actual problem can be understood and fixed. The question is how much time do I have to persue a direct fix before some other action needs to be taken?
Perhaps the kexec maintainers [0] can be made aware of this and you could coordinate with them on a potential fix?
Currently maintained by P: Simon Horman M: horms@verge.net.au L: kexec@lists.infradead.org
Probably a good idea to add kexec people to the list, so I've added them to this email.
Everyone, my recent patch to the kernel that changed identity mapping:
7143c5f4cf2073193 x86/mm/ident_map: Use gbpages only where full GB page should be mapped.
... has broken kexec on a few machines. The symptom is they do a full BIOS reboot instead of a kexec of the new kernel. Seems to be limited to AMD processors, but it's not all AMD processors, probably just some characteristic that they happen to share.
The same machines that are broken by my patch, are also broken in previous kernels if you add "nogbpages" to the kernel command line (which makes the identity map bigger, "nogbpages" doing for all parts of the identity map what my patch does only for some parts of it).
I'm still hoping to find a machine I can reproduce this on to try and debug it myself.
If any of you have any assistance or advice to offer, it would be most welcome!
I hope the root cause can be fixed instead of patching it over with a flag to suppress the problem, but I don't know how regressions are handled here.
That would be my preference as well.
Thanks,
--> Steve Wahl
[*really* added kexec maintainers this time.]
Full thread starts here: https://lore.kernel.org/all/3a1b9909-45ac-4f97-ad68-d16ef1ce99db@pavinjoseph...
On Wed, Mar 13, 2024 at 12:12:31AM +0530, Pavin Joseph wrote:
On 3/12/24 20:43, Steve Wahl wrote:
But I don't want to introduce a new command line parameter if the actual problem can be understood and fixed. The question is how much time do I have to persue a direct fix before some other action needs to be taken?
Perhaps the kexec maintainers [0] can be made aware of this and you could coordinate with them on a potential fix?
Currently maintained by P: Simon Horman M: horms@verge.net.au L: kexec@lists.infradead.org
Probably a good idea to add kexec people to the list, so I've added them to this email.
Everyone, my recent patch to the kernel that changed identity mapping:
7143c5f4cf2073193 x86/mm/ident_map: Use gbpages only where full GB page should be mapped.
... has broken kexec on a few machines. The symptom is they do a full BIOS reboot instead of a kexec of the new kernel. Seems to be limited to AMD processors, but it's not all AMD processors, probably just some characteristic that they happen to share.
The same machines that are broken by my patch, are also broken in previous kernels if you add "nogbpages" to the kernel command line (which makes the identity map bigger, "nogbpages" doing for all parts of the identity map what my patch does only for some parts of it).
I'm still hoping to find a machine I can reproduce this on to try and debug it myself.
If any of you have any assistance or advice to offer, it would be most welcome!
I hope the root cause can be fixed instead of patching it over with a flag to suppress the problem, but I don't know how regressions are handled here.
That would be my preference as well.
Thanks,
--> Steve Wahl
Steve Wahl steve.wahl@hpe.com writes:
[*really* added kexec maintainers this time.]
Full thread starts here: https://lore.kernel.org/all/3a1b9909-45ac-4f97-ad68-d16ef1ce99db@pavinjoseph...
On Wed, Mar 13, 2024 at 12:12:31AM +0530, Pavin Joseph wrote:
On 3/12/24 20:43, Steve Wahl wrote:
But I don't want to introduce a new command line parameter if the actual problem can be understood and fixed. The question is how much time do I have to persue a direct fix before some other action needs to be taken?
Perhaps the kexec maintainers [0] can be made aware of this and you could coordinate with them on a potential fix?
Currently maintained by P: Simon Horman M: horms@verge.net.au L: kexec@lists.infradead.org
Probably a good idea to add kexec people to the list, so I've added them to this email.
Everyone, my recent patch to the kernel that changed identity mapping:
7143c5f4cf2073193 x86/mm/ident_map: Use gbpages only where full GB page should be mapped.
... has broken kexec on a few machines. The symptom is they do a full BIOS reboot instead of a kexec of the new kernel. Seems to be limited to AMD processors, but it's not all AMD processors, probably just some characteristic that they happen to share.
The same machines that are broken by my patch, are also broken in previous kernels if you add "nogbpages" to the kernel command line (which makes the identity map bigger, "nogbpages" doing for all parts of the identity map what my patch does only for some parts of it).
I'm still hoping to find a machine I can reproduce this on to try and debug it myself.
If any of you have any assistance or advice to offer, it would be most welcome!
Kexec happens on identity mapped page tables.
The files of interest are machine_kexec_64.c and relocate_kernel_64.S
I suspect either the building of the identity mappged page table in machine_kexec_prepare, or the switching to the page table in identity_mapped in relocate_kernel_64.S is where something goes wrong.
Probably in kernel_ident_mapping_init as that code is directly used to build the identity mapped page tables.
Hmm.
Your change is commit d794734c9bbf ("x86/mm/ident_map: Use gbpages only where full GB page should be mapped.")
Given the simplicity of that change itself my guess is that somewhere in the first 1Gb there are pages that needed to be mapped like the idt at 0 that are not getting mapped.
Reading through the changelog:
x86/mm/ident_map: Use gbpages only where full GB page should be mapped. When ident_pud_init() uses only gbpages to create identity maps, large ranges of addresses not actually requested can be included in the resulting table; a 4K request will map a full GB. On UV systems, this ends up including regions that will cause hardware to halt the system if accessed (these are marked "reserved" by BIOS). Even processor speculation into these regions is enough to trigger the system halt. Only use gbpages when map creation requests include the full GB page of space. Fall back to using smaller 2M pages when only portions of a GB page are included in the request. No attempt is made to coalesce mapping requests. If a request requires a map entry at the 2M (pmd) level, subsequent mapping requests within the same 1G region will also be at the pmd level, even if adjacent or overlapping such requests could have been combined to map a full gbpage. Existing usage starts with larger regions and then adds smaller regions, so this should not have any great consequence. [ dhansen: fix up comment formatting, simplifty changelog ] Signed-off-by: Steve Wahl steve.wahl@hpe.com Signed-off-by: Dave Hansen dave.hansen@linux.intel.com Cc: stable@vger.kernel.org Link: https://lore.kernel.org/all/20240126164841.170866-1-steve.wahl%40hpe.com
I know historically that fixed mtrrs were used so that the first 1GiB could be covered with page tables and cause problems.
I suspect whatever those UV systems are more targeted solution would be to use the fixed mtrrs to disable caching and speculation on the problematic ranges rather than completely changing the fundamental logic of how pages are mapped.
Right now it looks like you get to play a game of whack-a-mole with firmware/BIOS tables that don't mention something important, and ensuring the kernel maps everything important in the first 1GiB.
It might be worth setting up early printk on some of these systems and seeing if the failure is in early boot up of the new kernel (that is using kexec supplied identity mapped pages) rather than in kexec per-se.
But that is just my guess at the moment.
Eric
I hope the root cause can be fixed instead of patching it over with a flag to suppress the problem, but I don't know how regressions are handled here.
That would be my preference as well.
Thanks,
--> Steve Wahl
On Wed, Mar 13, 2024 at 07:16:23AM -0500, Eric W. Biederman wrote:
Kexec happens on identity mapped page tables.
The files of interest are machine_kexec_64.c and relocate_kernel_64.S
I suspect either the building of the identity mappged page table in machine_kexec_prepare, or the switching to the page table in identity_mapped in relocate_kernel_64.S is where something goes wrong.
Probably in kernel_ident_mapping_init as that code is directly used to build the identity mapped page tables.
Hmm.
Your change is commit d794734c9bbf ("x86/mm/ident_map: Use gbpages only where full GB page should be mapped.")
Yeah, sorry, I accidentally used the stable cherry-pick commit id that Pavin Joseph found with his bisect results.
Given the simplicity of that change itself my guess is that somewhere in the first 1Gb there are pages that needed to be mapped like the idt at 0 that are not getting mapped.
...
It might be worth setting up early printk on some of these systems and seeing if the failure is in early boot up of the new kernel (that is using kexec supplied identity mapped pages) rather than in kexec per-se.
But that is just my guess at the moment.
Thanks for the input. I was thinking in terms of running out of memory somewhere because we're using more page table entries than we used to. But you've got me thinking that maybe some necessary region is not explicitly requested to be placed in the identity map, but is by luck included in the rounding errors when we use gbpages.
At any rate, since I am still unable to reproduce this for myself, I am going to contact Pavin Joseph off-list and see if he's willing to do a few debugging kernel steps for me and send me the results, to see if I can get this figured out. (I believe trimming the CC list and/or going private is usually frowned upon for the LKML, but I think this is appropriate as it only adds noise for the rest. Let me know if I'm wrong.)
Thank you.
--> Steve Wahl
On Thu, 14 Mar 2024 at 00:18, Steve Wahl steve.wahl@hpe.com wrote:
On Wed, Mar 13, 2024 at 07:16:23AM -0500, Eric W. Biederman wrote:
Kexec happens on identity mapped page tables.
The files of interest are machine_kexec_64.c and relocate_kernel_64.S
I suspect either the building of the identity mappged page table in machine_kexec_prepare, or the switching to the page table in identity_mapped in relocate_kernel_64.S is where something goes wrong.
Probably in kernel_ident_mapping_init as that code is directly used to build the identity mapped page tables.
Hmm.
Your change is commit d794734c9bbf ("x86/mm/ident_map: Use gbpages only where full GB page should be mapped.")
Yeah, sorry, I accidentally used the stable cherry-pick commit id that Pavin Joseph found with his bisect results.
Given the simplicity of that change itself my guess is that somewhere in the first 1Gb there are pages that needed to be mapped like the idt at 0 that are not getting mapped.
...
It might be worth setting up early printk on some of these systems and seeing if the failure is in early boot up of the new kernel (that is using kexec supplied identity mapped pages) rather than in kexec per-se.
But that is just my guess at the moment.
Thanks for the input. I was thinking in terms of running out of memory somewhere because we're using more page table entries than we used to. But you've got me thinking that maybe some necessary region is not explicitly requested to be placed in the identity map, but is by luck included in the rounding errors when we use gbpages.
Yes, it is possible. Here is an example case: http://lists.infradead.org/pipermail/kexec/2023-June/027301.html Final change was to avoid doing AMD things on Intel platform, but the mapping code is still not fixed in a good way.
At any rate, since I am still unable to reproduce this for myself, I am going to contact Pavin Joseph off-list and see if he's willing to do a few debugging kernel steps for me and send me the results, to see if I can get this figured out. (I believe trimming the CC list and/or going private is usually frowned upon for the LKML, but I think this is appropriate as it only adds noise for the rest. Let me know if I'm wrong.)
Thank you.
--> Steve Wahl
-- Steve Wahl, Hewlett Packard Enterprise
kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
On Thu, 2024-03-14 at 17:25 +0800, Dave Young wrote:
On Thu, 14 Mar 2024 at 00:18, Steve Wahl steve.wahl@hpe.com wrote:
On Wed, Mar 13, 2024 at 07:16:23AM -0500, Eric W. Biederman wrote:
Kexec happens on identity mapped page tables.
The files of interest are machine_kexec_64.c and relocate_kernel_64.S
I suspect either the building of the identity mappged page table in machine_kexec_prepare, or the switching to the page table in identity_mapped in relocate_kernel_64.S is where something goes wrong.
Probably in kernel_ident_mapping_init as that code is directly used to build the identity mapped page tables.
Hmm.
Your change is commit d794734c9bbf ("x86/mm/ident_map: Use gbpages only where full GB page should be mapped.")
Yeah, sorry, I accidentally used the stable cherry-pick commit id that Pavin Joseph found with his bisect results.
Given the simplicity of that change itself my guess is that somewhere in the first 1Gb there are pages that needed to be mapped like the idt at 0 that are not getting mapped.
...
It might be worth setting up early printk on some of these systems and seeing if the failure is in early boot up of the new kernel (that is using kexec supplied identity mapped pages) rather than in kexec per-se.
But that is just my guess at the moment.
Thanks for the input. I was thinking in terms of running out of memory somewhere because we're using more page table entries than we used to. But you've got me thinking that maybe some necessary region is not explicitly requested to be placed in the identity map, but is by luck included in the rounding errors when we use gbpages.
Yes, it is possible. Here is an example case: http://lists.infradead.org/pipermail/kexec/2023-June/027301.html Final change was to avoid doing AMD things on Intel platform, but the mapping code is still not fixed in a good way.
I spent all of Monday setting up a full GDT, IDT and exception handler for the relocate_kernel() environment¹, and I think these reports may have been the same as what I've been debugging.
We end up taking a #PF, usually on one of the 'rep mov's, one time on the 'pushq %r8' right before using it to 'ret' to identity_mapped. In each case it happens on the first *write* to a page.
Now I can print %cr2 when it happens (instead of just going straight to triple-fault), I spot an interesting fact about the address. It's always *adjacent* to a region reserved by BIOS in the e820 data, and within the same 2MiB page.
[ 0.000000] BIOS-e820: [mem 0x000000bfbe000000-0x000000c1420fffff] reserved [ 0.000000] BIOS-e820: [mem 0x000000c142100000-0x000000fc7fffffff] usable
2024-10-22 17:09:14.291000 kern NOTICE [ 58.996257] kexec: Control page at c149431000 2024-10-22 17:09:14.291000 Y 2024-10-22 17:09:14.291000 rip:000000c1494312f8 2024-10-22 17:09:14.291000 rsp:000000c149431f90 2024-10-22 17:09:14.291000 Exc:000000000000000e 2024-10-22 17:09:14.291000 Err:0000000080000003 2024-10-22 17:09:14.291000 rax:000000c142130000 2024-10-22 17:09:14.291000 rbx:000000010d4b8020 2024-10-22 17:09:14.291000 rcx:0000000000000200 2024-10-22 17:09:14.291000 rdx:000000000009c000 2024-10-22 17:09:14.291000 rsi:000000000009c000 2024-10-22 17:09:14.291000 rdi:000000c142130000 2024-10-22 17:09:14.291000 r8 :000000c149431000 2024-10-22 17:09:14.291000 r9 :000000c149430000 2024-10-22 17:09:14.291000 r10:000000010d4bc000 2024-10-22 17:09:14.291000 r11:0000000000000000 2024-10-22 17:09:14.291000 r12:0000000000000000 2024-10-22 17:09:14.291000 r13:0000000000770ef0 2024-10-22 17:09:14.291000 r14:ffff8c82c0000000 2024-10-22 17:09:14.291000 r15:0000000000000000 2024-10-22 17:09:14.291000 cr2:000000c142130000
And bit 31 in the error code is set, which means it's an RMP violation.
Looks like we set up a 2MiB page covering the whole range from 0xc142000000 to 0xc142200000, but we aren't allowed to touch the first half of that.
For me it happens either with or without Steve's last patch, *but* clearing direct_gbpages did seem to make it go away (or at least reduced the incident rate far below the 1-crash-in-1000-kexecs which I was seeing before).
I think Steve's original patch was just moving things around a little and because it allocate more pages for page tables, just happened to leave pages in the offending range to be allocated for writing to, for the unlucky victims.
I think the patch was actually along the right lines though, although it needs to go all the way down to 4KiB PTEs in some cases. And it could probably map anything that the e820 calls 'usable RAM', rather than really restricting itself to precisely the ranges which it's requested to map.
¹ I'll post that exception handler at some point once I've tidied it up.
On Tue, Oct 22, 2024 at 07:51:38PM +0100, David Woodhouse wrote:
On Thu, 2024-03-14 at 17:25 +0800, Dave Young wrote:
On Thu, 14 Mar 2024 at 00:18, Steve Wahl steve.wahl@hpe.com wrote:
On Wed, Mar 13, 2024 at 07:16:23AM -0500, Eric W. Biederman wrote:
Kexec happens on identity mapped page tables.
The files of interest are machine_kexec_64.c and relocate_kernel_64.S
I suspect either the building of the identity mappged page table in machine_kexec_prepare, or the switching to the page table in identity_mapped in relocate_kernel_64.S is where something goes wrong.
Probably in kernel_ident_mapping_init as that code is directly used to build the identity mapped page tables.
Hmm.
Your change is commit d794734c9bbf ("x86/mm/ident_map: Use gbpages only where full GB page should be mapped.")
Yeah, sorry, I accidentally used the stable cherry-pick commit id that Pavin Joseph found with his bisect results.
Given the simplicity of that change itself my guess is that somewhere in the first 1Gb there are pages that needed to be mapped like the idt at 0 that are not getting mapped.
...
It might be worth setting up early printk on some of these systems and seeing if the failure is in early boot up of the new kernel (that is using kexec supplied identity mapped pages) rather than in kexec per-se.
But that is just my guess at the moment.
Thanks for the input. I was thinking in terms of running out of memory somewhere because we're using more page table entries than we used to. But you've got me thinking that maybe some necessary region is not explicitly requested to be placed in the identity map, but is by luck included in the rounding errors when we use gbpages.
Yes, it is possible. Here is an example case: http://lists.infradead.org/pipermail/kexec/2023-June/027301.html Final change was to avoid doing AMD things on Intel platform, but the mapping code is still not fixed in a good way.
I spent all of Monday setting up a full GDT, IDT and exception handler for the relocate_kernel() environment¹, and I think these reports may have been the same as what I've been debugging.
David,
My original problem involved UV platform hardware catching a speculative access into the reserved areas, that caused a BIOS HALT. Reducing the use of gbpages in the page table kept the speculation from hitting those areas. I would believe this sort of thing might be uniqe to the UV platform.
The regression reports I got from Pavin and others were due to my original patch trimming down the page tables to the point where they didn't include some memory that was actually referenced, not processor speculation, because the regions were not explicitly included in the creation of the kexec page map. This was fixed by explicitly including those regions when creating the map.
Can you dump the page tables to see if the address you're referencing is included in those tables (or maybe you already did)? Can you give symbols and code around the RIP when you hit the #PF? It looks like this is in the region metioned as the "Control page", so it's probably trampoline code that has been copied from somewhere else. I'm using my copy of perhaps different kernel source than you have, given your exception handler modification.
Wait, I can't make sense of the dump. See more below.
What platform are you running on? And under what conditions (is this bare metal)? Is it really speculation that's causing your #PF? If so, you could cause it deterministically by, say, doing a quick checksum on that area you're not supposed to touch (0xc142000000 - 0xC1420fffff) and see if it faults every time. (As I said, I was thinking faults from speculation might be unique to the UV platform.)
We end up taking a #PF, usually on one of the 'rep mov's, one time on the 'pushq %r8' right before using it to 'ret' to identity_mapped. In each case it happens on the first *write* to a page.
Now I can print %cr2 when it happens (instead of just going straight to triple-fault), I spot an interesting fact about the address. It's always *adjacent* to a region reserved by BIOS in the e820 data, and within the same 2MiB page.
I'm not at all certain, but this feels like a red herring. Be cautious.
[ 0.000000] BIOS-e820: [mem 0x000000bfbe000000-0x000000c1420fffff] reserved [ 0.000000] BIOS-e820: [mem 0x000000c142100000-0x000000fc7fffffff] usable
2024-10-22 17:09:14.291000 kern NOTICE [ 58.996257] kexec: Control page at c149431000 2024-10-22 17:09:14.291000 Y 2024-10-22 17:09:14.291000 rip:000000c1494312f8 2024-10-22 17:09:14.291000 rsp:000000c149431f90 2024-10-22 17:09:14.291000 Exc:000000000000000e 2024-10-22 17:09:14.291000 Err:0000000080000003 2024-10-22 17:09:14.291000 rax:000000c142130000 2024-10-22 17:09:14.291000 rbx:000000010d4b8020 2024-10-22 17:09:14.291000 rcx:0000000000000200 2024-10-22 17:09:14.291000 rdx:000000000009c000 2024-10-22 17:09:14.291000 rsi:000000000009c000 2024-10-22 17:09:14.291000 rdi:000000c142130000 2024-10-22 17:09:14.291000 r8 :000000c149431000 2024-10-22 17:09:14.291000 r9 :000000c149430000 2024-10-22 17:09:14.291000 r10:000000010d4bc000 2024-10-22 17:09:14.291000 r11:0000000000000000 2024-10-22 17:09:14.291000 r12:0000000000000000 2024-10-22 17:09:14.291000 r13:0000000000770ef0 2024-10-22 17:09:14.291000 r14:ffff8c82c0000000 2024-10-22 17:09:14.291000 r15:0000000000000000 2024-10-22 17:09:14.291000 cr2:000000c142130000
And bit 31 in the error code is set, which means it's an RMP violation.
RMP is AMD SEV related, right? I'm not familiar with SEV operation, but I have an itchy feeling it's involved in this problem.
I am having a hard time with the RIP listed above. Maybe your exception handler has affected it? My disassembly seems to show this address should be in a sea of 0xCC / int3 bytes past the end of swap pages.
Looks like we set up a 2MiB page covering the whole range from 0xc142000000 to 0xc142200000, but we aren't allowed to touch the first half of that.
Is it possible that, instead, some SEV tag is hanging around (TLB not completely cleared?) and a page that was otherwise free is causing the problem. Are you using SEV/SME in your system, and if you stop using it does it go away? (Although I have a feeling the answer is no and I'm barking up the wrong tree.)
The target of the pages above is c1421300000. Have you checked to make sure that's a valid address in the page map?
For me it happens either with or without Steve's last patch, *but* clearing direct_gbpages did seem to make it go away (or at least reduced the incident rate far below the 1-crash-in-1000-kexecs which I was seeing before).
I assume you're referring to the "nogbpages" kernel option? My patch and the nogbpages option should have the exact same pages mapped in the page table. The difference being my patch would still use gbpages in places where a whole gbpage region is included in the map, nogbpages would use 2M pages to fill out the region. This *would* allocate more pages to the page table, which might be shifting things around on you.
I think Steve's original patch was just moving things around a little and because it allocate more pages for page tables, just happened to leave pages in the offending range to be allocated for writing to, for the unlucky victims.
I think the patch was actually along the right lines though, although it needs to go all the way down to 4KiB PTEs in some cases. And it could probably map anything that the e820 calls 'usable RAM', rather than really restricting itself to precisely the ranges which it's requested to map.
¹ I'll post that exception handler at some point once I've tidied it up.
I hope this might be of some help. Good luck, I'll pitch in any way I can.
--> Steve
On Tue, 2024-10-22 at 17:06 -0500, Steve Wahl wrote:
On Tue, Oct 22, 2024 at 07:51:38PM +0100, David Woodhouse wrote:
I spent all of Monday setting up a full GDT, IDT and exception handler for the relocate_kernel() environment¹, and I think these reports may have been the same as what I've been debugging.
David,
My original problem involved UV platform hardware catching a speculative access into the reserved areas, that caused a BIOS HALT. Reducing the use of gbpages in the page table kept the speculation from hitting those areas. I would believe this sort of thing might be uniqe to the UV platform.
The regression reports I got from Pavin and others were due to my original patch trimming down the page tables to the point where they didn't include some memory that was actually referenced, not processor speculation, because the regions were not explicitly included in the creation of the kexec page map. This was fixed by explicitly including those regions when creating the map.
Hm, I didn't see that part of the discussion. I saw that such was a theory, but haven't seen specific confirmation and fixes. And your original patch was reverted and still not reapplied, AFAICT.
I did note that the victims all seemed to be using AMD CPUs, so it seemed likely that at least *some* of them were suffering the same problem that I've found.
Do you have references please?
If anyone is still seeing such problems either with or without your patch, they can run with my exception handler and get an actual dump instead of a triple-fault.
(I'm also pushing CPU vendors to give us information from the triple- fault through the machine check architecture. It's awful having to do this blind. For VMs, I also had plans to register a crashdump kernel entry point with the hypervisor, so that on a triple fault the *hypervisor* could jump state of all the vCPUs to the configured location, then restart one CPU in the crash kernel for it to do its own dump).
Can you dump the page tables to see if the address you're referencing is included in those tables (or maybe you already did)? Can you give symbols and code around the RIP when you hit the #PF? It looks like this is in the region metioned as the "Control page", so it's probably trampoline code that has been copied from somewhere else. I'm using my copy of perhaps different kernel source than you have, given your exception handler modification.
Wait, I can't make sense of the dump. See more below.
What platform are you running on? And under what conditions (is this bare metal)? Is it really speculation that's causing your #PF? If so, you could cause it deterministically by, say, doing a quick checksum on that area you're not supposed to touch (0xc142000000 - 0xC1420fffff) and see if it faults every time. (As I said, I was thinking faults from speculation might be unique to the UV platform.)
Yes, it's bare metal. AMD Genoa. No, it's not speculation. It's because we have a single 2MiB page which covers *both* the RMP table (1MiB reserved by BIOS in e820 as I showed), and a page that was allocated for the kimage. If I understand correctly, the hardware raises that fault (with bit 31 in the error code) when refusing to populate that TLB entry for writing.
According to the AMD manual we're allowed to *read* but not write.
We end up taking a #PF, usually on one of the 'rep mov's, one time on the 'pushq %r8' right before using it to 'ret' to identity_mapped. In each case it happens on the first *write* to a page.
Now I can print %cr2 when it happens (instead of just going straight to triple-fault), I spot an interesting fact about the address. It's always *adjacent* to a region reserved by BIOS in the e820 data, and within the same 2MiB page.
I'm not at all certain, but this feels like a red herring. Be cautious.
It wouldn't be our first in this journey, but I'm actually fairly confident this time. :)
[ 0.000000] BIOS-e820: [mem 0x000000bfbe000000-0x000000c1420fffff] reserved [ 0.000000] BIOS-e820: [mem 0x000000c142100000-0x000000fc7fffffff] usable
2024-10-22 17:09:14.291000 kern NOTICE [ 58.996257] kexec: Control page at c149431000 2024-10-22 17:09:14.291000 Y 2024-10-22 17:09:14.291000 rip:000000c1494312f8 2024-10-22 17:09:14.291000 rsp:000000c149431f90 2024-10-22 17:09:14.291000 Exc:000000000000000e 2024-10-22 17:09:14.291000 Err:0000000080000003 2024-10-22 17:09:14.291000 rax:000000c142130000 2024-10-22 17:09:14.291000 rbx:000000010d4b8020 2024-10-22 17:09:14.291000 rcx:0000000000000200 2024-10-22 17:09:14.291000 rdx:000000000009c000 2024-10-22 17:09:14.291000 rsi:000000000009c000 2024-10-22 17:09:14.291000 rdi:000000c142130000 2024-10-22 17:09:14.291000 r8 :000000c149431000 2024-10-22 17:09:14.291000 r9 :000000c149430000 2024-10-22 17:09:14.291000 r10:000000010d4bc000 2024-10-22 17:09:14.291000 r11:0000000000000000 2024-10-22 17:09:14.291000 r12:0000000000000000 2024-10-22 17:09:14.291000 r13:0000000000770ef0 2024-10-22 17:09:14.291000 r14:ffff8c82c0000000 2024-10-22 17:09:14.291000 r15:0000000000000000 2024-10-22 17:09:14.291000 cr2:000000c142130000
And bit 31 in the error code is set, which means it's an RMP violation.
RMP is AMD SEV related, right? I'm not familiar with SEV operation, but I have an itchy feeling it's involved in this problem.
I am having a hard time with the RIP listed above. Maybe your exception handler has affected it? My disassembly seems to show this address should be in a sea of 0xCC / int3 bytes past the end of swap pages.
You'd have to have access to my kernel binary to have a hope of knowing that, surely? I don't think I checked that particular one, but it's normally one of the 'rep mov's in relocate_kernel_64.S.
Looks like we set up a 2MiB page covering the whole range from 0xc142000000 to 0xc142200000, but we aren't allowed to touch the first half of that.
Is it possible that, instead, some SEV tag is hanging around (TLB not completely cleared?) and a page that was otherwise free is causing the problem. Are you using SEV/SME in your system, and if you stop using it does it go away? (Although I have a feeling the answer is no and I'm barking up the wrong tree.)
The target of the pages above is c1421300000. Have you checked to make sure that's a valid address in the page map?
Yeah, we dumped the page tables and it's present.
For me it happens either with or without Steve's last patch, *but* clearing direct_gbpages did seem to make it go away (or at least reduced the incident rate far below the 1-crash-in-1000-kexecs which I was seeing before).
I assume you're referring to the "nogbpages" kernel option?
Nah, I just commented out the lines in init_pgtable() which set info.direct_gbpages=true.
My patch and the nogbpages option should have the exact same pages mapped in the page table. The difference being my patch would still use gbpages in places where a whole gbpage region is included in the map, nogbpages would use 2M pages to fill out the region. This *would* allocate more pages to the page table, which might be shifting things around on you.
Right. In fact the first trigger for this, in our case, was an innocuous change to the NMI watchdog period — which sent us on a *long* wild goose chase based on the assumption that it was a stray perf NMI causing the triple-faults, when in fact that was just shifting things around on us too, and causing pages in that dangerous 1MiB to be chosen for the kimage.
I think Steve's original patch was just moving things around a little and because it allocate more pages for page tables, just happened to leave pages in the offending range to be allocated for writing to, for the unlucky victims.
I think the patch was actually along the right lines though, although it needs to go all the way down to 4KiB PTEs in some cases. And it could probably map anything that the e820 calls 'usable RAM', rather than really restricting itself to precisely the ranges which it's requested to map.
¹ I'll post that exception handler at some point once I've tidied it up.
I hope this might be of some help. Good luck, I'll pitch in any way I can.
Thanks.
On 10/23/2024 2:39 AM, David Woodhouse wrote:
On Tue, 2024-10-22 at 17:06 -0500, Steve Wahl wrote:
On Tue, Oct 22, 2024 at 07:51:38PM +0100, David Woodhouse wrote:
I spent all of Monday setting up a full GDT, IDT and exception handler for the relocate_kernel() environment¹, and I think these reports may have been the same as what I've been debugging.
David,
My original problem involved UV platform hardware catching a speculative access into the reserved areas, that caused a BIOS HALT. Reducing the use of gbpages in the page table kept the speculation from hitting those areas. I would believe this sort of thing might be uniqe to the UV platform.
The regression reports I got from Pavin and others were due to my original patch trimming down the page tables to the point where they didn't include some memory that was actually referenced, not processor speculation, because the regions were not explicitly included in the creation of the kexec page map. This was fixed by explicitly including those regions when creating the map.
Hm, I didn't see that part of the discussion. I saw that such was a theory, but haven't seen specific confirmation and fixes. And your original patch was reverted and still not reapplied, AFAICT.
I did note that the victims all seemed to be using AMD CPUs, so it seemed likely that at least *some* of them were suffering the same problem that I've found.
Do you have references please?
If anyone is still seeing such problems either with or without your patch, they can run with my exception handler and get an actual dump instead of a triple-fault.
(I'm also pushing CPU vendors to give us information from the triple- fault through the machine check architecture. It's awful having to do this blind. For VMs, I also had plans to register a crashdump kernel entry point with the hypervisor, so that on a triple fault the *hypervisor* could jump state of all the vCPUs to the configured location, then restart one CPU in the crash kernel for it to do its own dump).
Can you dump the page tables to see if the address you're referencing is included in those tables (or maybe you already did)? Can you give symbols and code around the RIP when you hit the #PF? It looks like this is in the region metioned as the "Control page", so it's probably trampoline code that has been copied from somewhere else. I'm using my copy of perhaps different kernel source than you have, given your exception handler modification.
Wait, I can't make sense of the dump. See more below.
What platform are you running on? And under what conditions (is this bare metal)? Is it really speculation that's causing your #PF? If so, you could cause it deterministically by, say, doing a quick checksum on that area you're not supposed to touch (0xc142000000 - 0xC1420fffff) and see if it faults every time. (As I said, I was thinking faults from speculation might be unique to the UV platform.)
Yes, it's bare metal. AMD Genoa. No, it's not speculation. It's because we have a single 2MiB page which covers *both* the RMP table (1MiB reserved by BIOS in e820 as I showed), and a page that was allocated for the kimage. If I understand correctly, the hardware raises that fault (with bit 31 in the error code) when refusing to populate that TLB entry for writing.
As mentioned above, about the same 2MB page containing the end portion of the RMP table and a page allocated for kexec and looking at the e820 memory map dump here:
[ 0.000000] BIOS-e820: [mem 0x000000bfbe000000-0x000000c1420fffff] reserved [ 0.000000] BIOS-e820: [mem 0x000000c142100000-0x000000fc7fffffff] usable
As seen here in the e820 memory map, the end range of the RMP table is not aligned to 2MB and not reserved but it is usable as RAM.
Subsequently, kexec-ed kernel could try to allocate from within that chunk which then causes a fatal RMP fault.
This issue has been fixed with the following patch: https://lore.kernel.org/lkml/171438476623.10875.16783275868264913579.tip-bot...
Thanks, Ashish
According to the AMD manual we're allowed to *read* but not write.
We end up taking a #PF, usually on one of the 'rep mov's, one time on the 'pushq %r8' right before using it to 'ret' to identity_mapped. In each case it happens on the first *write* to a page.
Now I can print %cr2 when it happens (instead of just going straight to triple-fault), I spot an interesting fact about the address. It's always *adjacent* to a region reserved by BIOS in the e820 data, and within the same 2MiB page.
I'm not at all certain, but this feels like a red herring. Be cautious.
It wouldn't be our first in this journey, but I'm actually fairly confident this time. :)
[ 0.000000] BIOS-e820: [mem 0x000000bfbe000000-0x000000c1420fffff] reserved [ 0.000000] BIOS-e820: [mem 0x000000c142100000-0x000000fc7fffffff] usable
2024-10-22 17:09:14.291000 kern NOTICE [ 58.996257] kexec: Control page at c149431000 2024-10-22 17:09:14.291000 Y 2024-10-22 17:09:14.291000 rip:000000c1494312f8 2024-10-22 17:09:14.291000 rsp:000000c149431f90 2024-10-22 17:09:14.291000 Exc:000000000000000e 2024-10-22 17:09:14.291000 Err:0000000080000003 2024-10-22 17:09:14.291000 rax:000000c142130000 2024-10-22 17:09:14.291000 rbx:000000010d4b8020 2024-10-22 17:09:14.291000 rcx:0000000000000200 2024-10-22 17:09:14.291000 rdx:000000000009c000 2024-10-22 17:09:14.291000 rsi:000000000009c000 2024-10-22 17:09:14.291000 rdi:000000c142130000 2024-10-22 17:09:14.291000 r8 :000000c149431000 2024-10-22 17:09:14.291000 r9 :000000c149430000 2024-10-22 17:09:14.291000 r10:000000010d4bc000 2024-10-22 17:09:14.291000 r11:0000000000000000 2024-10-22 17:09:14.291000 r12:0000000000000000 2024-10-22 17:09:14.291000 r13:0000000000770ef0 2024-10-22 17:09:14.291000 r14:ffff8c82c0000000 2024-10-22 17:09:14.291000 r15:0000000000000000 2024-10-22 17:09:14.291000 cr2:000000c142130000
And bit 31 in the error code is set, which means it's an RMP violation.
RMP is AMD SEV related, right? I'm not familiar with SEV operation, but I have an itchy feeling it's involved in this problem.
I am having a hard time with the RIP listed above. Maybe your exception handler has affected it? My disassembly seems to show this address should be in a sea of 0xCC / int3 bytes past the end of swap pages.
You'd have to have access to my kernel binary to have a hope of knowing that, surely? I don't think I checked that particular one, but it's normally one of the 'rep mov's in relocate_kernel_64.S.
Looks like we set up a 2MiB page covering the whole range from 0xc142000000 to 0xc142200000, but we aren't allowed to touch the first half of that.
Is it possible that, instead, some SEV tag is hanging around (TLB not completely cleared?) and a page that was otherwise free is causing the problem. Are you using SEV/SME in your system, and if you stop using it does it go away? (Although I have a feeling the answer is no and I'm barking up the wrong tree.)
The target of the pages above is c1421300000. Have you checked to make sure that's a valid address in the page map?
Yeah, we dumped the page tables and it's present.
For me it happens either with or without Steve's last patch, *but* clearing direct_gbpages did seem to make it go away (or at least reduced the incident rate far below the 1-crash-in-1000-kexecs which I was seeing before).
I assume you're referring to the "nogbpages" kernel option?
Nah, I just commented out the lines in init_pgtable() which set info.direct_gbpages=true.
My patch and the nogbpages option should have the exact same pages mapped in the page table. The difference being my patch would still use gbpages in places where a whole gbpage region is included in the map, nogbpages would use 2M pages to fill out the region. This *would* allocate more pages to the page table, which might be shifting things around on you.
Right. In fact the first trigger for this, in our case, was an innocuous change to the NMI watchdog period — which sent us on a *long* wild goose chase based on the assumption that it was a stray perf NMI causing the triple-faults, when in fact that was just shifting things around on us too, and causing pages in that dangerous 1MiB to be chosen for the kimage.
I think Steve's original patch was just moving things around a little and because it allocate more pages for page tables, just happened to leave pages in the offending range to be allocated for writing to, for the unlucky victims.
I think the patch was actually along the right lines though, although it needs to go all the way down to 4KiB PTEs in some cases. And it could probably map anything that the e820 calls 'usable RAM', rather than really restricting itself to precisely the ranges which it's requested to map.
¹ I'll post that exception handler at some point once I've tidied it up.
I hope this might be of some help. Good luck, I'll pitch in any way I can.
Thanks.
On Wed, 2024-10-23 at 06:07 -0500, Kalra, Ashish wrote:
As mentioned above, about the same 2MB page containing the end portion of the RMP table and a page allocated for kexec and looking at the e820 memory map dump here:
[ 0.000000] BIOS-e820: [mem 0x000000bfbe000000-0x000000c1420fffff] reserved [ 0.000000] BIOS-e820: [mem 0x000000c142100000-0x000000fc7fffffff] usable
As seen here in the e820 memory map, the end range of the RMP table is not aligned to 2MB and not reserved but it is usable as RAM.
Subsequently, kexec-ed kernel could try to allocate from within that chunk which then causes a fatal RMP fault.
Well, allocating within that chunk would be just fine. It *is* usable as RAM, as the e820 table says. It works fine most of the time.
You've missed a step out of the story. The problem is that for kexec we map it with an "overreaching" 2MiB PTE which also covers the reserved regions, and *that* is what causes the RMP violation fault.
We could take two possible viewpoints here. I was taking the viewpoint that this is a kernel bug, that it *shouldn't* be setting up 2MiB pages which include a reserved region, and should break those down to 4KiB pages.
The alternative view would be to consider it a BIOS bug, and to say that the BIOS really *ought* to have reserved the whole 2MiB region to avoid the 'sharing'. Since the hardware apparently already breaks down 1GiB pages to 2MiB TLB entries in order to avoid triggering the problem on 1GiB mappings.
This issue has been fixed with the following patch: https://lore.kernel.org/lkml/171438476623.10875.16783275868264913579.tip-bot...
Thanks for pointing that patch out! Should it have been Cc:stable?
It seems to be taking the latter of the above two viewpoints, that this is a BIOS bug and that the BIOS *should* have reserved the whole 2MiB.
In that case are fixed BIOSes available already? This patch makes sense as a temporary workaround (we have ways to print warnings about BIOS bugs, btw), but I don't really like it as a longer-term "fix". What if the BIOS had put *other* things into that other 1MiB of address space? What if the bootloader had loaded something there?
I'm still inclined to suggest that kexec *shouldn't* use over-reaching large pages which cover anything that isn't marked as usable RAM.
On 10/23/2024 6:39 AM, David Woodhouse wrote:
On Wed, 2024-10-23 at 06:07 -0500, Kalra, Ashish wrote:
As mentioned above, about the same 2MB page containing the end portion of the RMP table and a page allocated for kexec and looking at the e820 memory map dump here:
[ 0.000000] BIOS-e820: [mem 0x000000bfbe000000-0x000000c1420fffff] reserved [ 0.000000] BIOS-e820: [mem 0x000000c142100000-0x000000fc7fffffff] usable
As seen here in the e820 memory map, the end range of the RMP table is not aligned to 2MB and not reserved but it is usable as RAM.
Subsequently, kexec-ed kernel could try to allocate from within that chunk which then causes a fatal RMP fault.
Well, allocating within that chunk would be just fine. It *is* usable as RAM, as the e820 table says. It works fine most of the time.
You've missed a step out of the story. The problem is that for kexec we map it with an "overreaching" 2MiB PTE which also covers the reserved regions, and *that* is what causes the RMP violation fault.
Actually, the RMP entry covering the end range of the RMP table will be a 2MB/large entry which means that the whole 2MB including the usable 1MB memory range here will also be marked as reserved in the RMP table and hence any host writes into this memory range will trigger the RMP violation.
We could take two possible viewpoints here. I was taking the viewpoint that this is a kernel bug, that it *shouldn't* be setting up 2MiB pages which include a reserved region, and should break those down to 4KiB pages.
The alternative view would be to consider it a BIOS bug, and to say that the BIOS really *ought* to have reserved the whole 2MiB region to avoid the 'sharing'. Since the hardware apparently already breaks down 1GiB pages to 2MiB TLB entries in order to avoid triggering the problem on 1GiB mappings.
This issue has been fixed with the following patch: https://lore.kernel.org/lkml/171438476623.10875.16783275868264913579.tip-bot...
Thanks for pointing that patch out! Should it have been Cc:stable?
This thing can happen after SNP host support got merged in 6.11 and SNP support is enabled, therefore the patch does not mark it Cc:stable.
I am trying to understand the scenario here: you have SNP enabled in the BIOS and you also have SNP support added in the host kernel, which means that the following logs are seen: .. SEV-SNP: RMP table physical range [0x000000xxxxxxxxxx - 0x000000yyyyyyyyyy] ..
It seems to be taking the latter of the above two viewpoints, that this is a BIOS bug and that the BIOS *should* have reserved the whole 2MiB.
In that case are fixed BIOSes available already?
We have been of the view that it is easier to get it fixed in kernel, by fixing/aligning the e820 range mapping the start and end of RMP table to 2MB boundaries, rather than trusting a BIOS to do it correctly.
Here is a link to a discussion on the same: https://lore.kernel.org/all/2ab14f6f-2690-056b-cf9e-38a12dafd728@amd.com/
Thanks, Ashish
This patch makes sense as a temporary workaround (we have ways to print warnings about BIOS bugs, btw), but I don't really like it as a longer-term "fix". What if the BIOS had put *other* things into that other 1MiB of address space? What if the bootloader had loaded something there?
I'm still inclined to suggest that kexec *shouldn't* use over-reaching large pages which cover anything that isn't marked as usable RAM.
On Wed, 2024-10-23 at 08:29 -0500, Kalra, Ashish wrote:
On 10/23/2024 6:39 AM, David Woodhouse wrote:
On Wed, 2024-10-23 at 06:07 -0500, Kalra, Ashish wrote:
As mentioned above, about the same 2MB page containing the end portion of the RMP table and a page allocated for kexec and looking at the e820 memory map dump here:
[ 0.000000] BIOS-e820: [mem 0x000000bfbe000000-0x000000c1420fffff] reserved [ 0.000000] BIOS-e820: [mem 0x000000c142100000-0x000000fc7fffffff] usable
As seen here in the e820 memory map, the end range of the RMP table is not aligned to 2MB and not reserved but it is usable as RAM.
Subsequently, kexec-ed kernel could try to allocate from within that chunk which then causes a fatal RMP fault.
Well, allocating within that chunk would be just fine. It *is* usable as RAM, as the e820 table says. It works fine most of the time.
You've missed a step out of the story. The problem is that for kexec we map it with an "overreaching" 2MiB PTE which also covers the reserved regions, and *that* is what causes the RMP violation fault.
Actually, the RMP entry covering the end range of the RMP table will be a 2MB/large entry which means that the whole 2MB including the usable 1MB memory range here will also be marked as reserved in the RMP table and hence any host writes into this memory range will trigger the RMP violation.
Hm, that does not match our testing. We tried writing to the "offending" area from the main kernel (which I assume was using 4KiB pages for it, but didn't verify), and that was fine.
It also doesn't match what Tom says in the email you linked to:
"There's no requirement from a hardware/RMP usage perspective that requires a 2MB alignment, so BIOS is not doing anything wrong. The problem occurs because kexec is initially using 2MB mappings that overlap the start and/or end of the RMP which then results in an RMP fault when memory within one of those 2MB mappings, that is not part of the RMP, is referenced."
Tom's words precisely match my understanding of the situation (with the exception that he keeps saying 2MB when he means 2MiB).
I believe we *can* use that extra 1MiB which is marked as 'usable RAM' as usable RAM if we want to, as *long* as we don't use a 2MiB (or larger) PTE for it which would overlap the RMP table.
And the only case where the kernel uses an "overreaching" 2MiB mapping is the kexec identmap code, so we should just fix that.
We could take two possible viewpoints here. I was taking the viewpoint that this is a kernel bug, that it *shouldn't* be setting up 2MiB pages which include a reserved region, and should break those down to 4KiB pages.
The alternative view would be to consider it a BIOS bug, and to say that the BIOS really *ought* to have reserved the whole 2MiB region to avoid the 'sharing'. Since the hardware apparently already breaks down 1GiB pages to 2MiB TLB entries in order to avoid triggering the problem on 1GiB mappings.
This issue has been fixed with the following patch: https://lore.kernel.org/lkml/171438476623.10875.16783275868264913579.tip-bot...
Thanks for pointing that patch out! Should it have been Cc:stable?
This thing can happen after SNP host support got merged in 6.11 and SNP support is enabled, therefore the patch does not mark it Cc:stable.
I am trying to understand the scenario here: you have SNP enabled in the BIOS and you also have SNP support added in the host kernel, which means that the following logs are seen: .. SEV-SNP: RMP table physical range [0x000000xxxxxxxxxx - 0x000000yyyyyyyyyy] ..
Ah yes. SEV-SNP isn't actually being *used* on these Genoa platforms at the moment, but I do think it's enabled in the kernel.
If this problem only happens when the kernel actually *enables* SEV- SNP, then it seems this fix was missed in our backporting of SEV-SNP support to, ahem, a slightly older kernel.
But I still don't like it :)
It seems to be taking the latter of the above two viewpoints, that this is a BIOS bug and that the BIOS *should* have reserved the whole 2MiB.
In that case are fixed BIOSes available already?
We have been of the view that it is easier to get it fixed in kernel, by fixing/aligning the e820 range mapping the start and end of RMP table to 2MB boundaries, rather than trusting a BIOS to do it correctly.
Here is a link to a discussion on the same: https://lore.kernel.org/all/2ab14f6f-2690-056b-cf9e-38a12dafd728@amd.com/
As noted above, that message clearly states that the BIOS isn't doing anything wrong, and the problem is the kernel using large page mappings that overlap reserved ranges.
In that case, shouldn't we fix the kernel *not* to do that?
I suppose we can be OK with "let's just avoid using that memory to workaround the kexec/identmap bug", but in that case let's not claim that we're working around a BIOS bug?
Hello David,
On 10/23/2024 8:50 AM, David Woodhouse wrote:
On Wed, 2024-10-23 at 08:29 -0500, Kalra, Ashish wrote:
On 10/23/2024 6:39 AM, David Woodhouse wrote:
On Wed, 2024-10-23 at 06:07 -0500, Kalra, Ashish wrote:
As mentioned above, about the same 2MB page containing the end portion of the RMP table and a page allocated for kexec and looking at the e820 memory map dump here:
> [ 0.000000] BIOS-e820: [mem 0x000000bfbe000000-0x000000c1420fffff] reserved > [ 0.000000] BIOS-e820: [mem 0x000000c142100000-0x000000fc7fffffff] usable
As seen here in the e820 memory map, the end range of the RMP table is not aligned to 2MB and not reserved but it is usable as RAM.
Subsequently, kexec-ed kernel could try to allocate from within that chunk which then causes a fatal RMP fault.
Well, allocating within that chunk would be just fine. It *is* usable as RAM, as the e820 table says. It works fine most of the time.
You've missed a step out of the story. The problem is that for kexec we map it with an "overreaching" 2MiB PTE which also covers the reserved regions, and *that* is what causes the RMP violation fault.
Actually, the RMP entry covering the end range of the RMP table will be a 2MB/large entry which means that the whole 2MB including the usable 1MB memory range here will also be marked as reserved in the RMP table and hence any host writes into this memory range will trigger the RMP violation.
Hm, that does not match our testing. We tried writing to the "offending" area from the main kernel (which I assume was using 4KiB pages for it, but didn't verify), and that was fine.
It also doesn't match what Tom says in the email you linked to:
"There's no requirement from a hardware/RMP usage perspective that requires a 2MB alignment, so BIOS is not doing anything wrong. The problem occurs because kexec is initially using 2MB mappings that overlap the start and/or end of the RMP which then results in an RMP fault when memory within one of those 2MB mappings, that is not part of the RMP, is referenced."
Tom's words precisely match my understanding of the situation (with the exception that he keeps saying 2MB when he means 2MiB).
I believe we *can* use that extra 1MiB which is marked as 'usable RAM' as usable RAM if we want to, as *long* as we don't use a 2MiB (or larger) PTE for it which would overlap the RMP table.
And the only case where the kernel uses an "overreaching" 2MiB mapping is the kexec identmap code, so we should just fix that.
Here is a more *correct* explanation of the issue after discussing it with Tom:
The RMP entries for the RMP table memory are marked as firmware pages (meaning they are assigned and immutable - with the key point being the assigned bit is set). If the start or end of the RMP table is not 2MB aligned, then the RMP entries are broken down into 4k entries. Using a 2MB page table mapping (kexec identmap code using a 2MiB PTE), if the kernel tries to access the portion of the memory that is within the 2MB page but is not the RMP table, an RMP fault will be generated because the mappings don't match.
This is documented in AMD64 Architecture Programmer's Manual Volume 2, section 15.36.10 - RMP and VMPL Access Checks.
As this RMP entry here is covering the RMP table and usable memory range, so it needs to be smashed to have 4k entries.
So, the RMP fault is being generated here because of the page size mapping mismatch between the RMP entry and the page table entry.
We could take two possible viewpoints here. I was taking the viewpoint that this is a kernel bug, that it *shouldn't* be setting up 2MiB pages which include a reserved region, and should break those down to 4KiB pages.
The alternative view would be to consider it a BIOS bug, and to say that the BIOS really *ought* to have reserved the whole 2MiB region to avoid the 'sharing'. Since the hardware apparently already breaks down 1GiB pages to 2MiB TLB entries in order to avoid triggering the problem on 1GiB mappings.
This issue has been fixed with the following patch: https://lore.kernel.org/lkml/171438476623.10875.16783275868264913579.tip-bot...
Thanks for pointing that patch out! Should it have been Cc:stable?
This thing can happen after SNP host support got merged in 6.11 and SNP support is enabled, therefore the patch does not mark it Cc:stable.
I am trying to understand the scenario here: you have SNP enabled in the BIOS and you also have SNP support added in the host kernel, which means that the following logs are seen: .. SEV-SNP: RMP table physical range [0x000000xxxxxxxxxx - 0x000000yyyyyyyyyy] ..
Ah yes. SEV-SNP isn't actually being *used* on these Genoa platforms at the moment, but I do think it's enabled in the kernel.
If this problem only happens when the kernel actually *enables* SEV- SNP, then it seems this fix was missed in our backporting of SEV-SNP support to, ahem, a slightly older kernel.
Yes, this problem only happens when kernel enables SEV-SNP support and SNP_INIT_EX has been done by the CCP driver to initialize SEV-SNP support.
But I still don't like it :)
It seems to be taking the latter of the above two viewpoints, that this is a BIOS bug and that the BIOS *should* have reserved the whole 2MiB.
In that case are fixed BIOSes available already?
We have been of the view that it is easier to get it fixed in kernel, by fixing/aligning the e820 range mapping the start and end of RMP table to 2MB boundaries, rather than trusting a BIOS to do it correctly.
Here is a link to a discussion on the same: https://lore.kernel.org/all/2ab14f6f-2690-056b-cf9e-38a12dafd728@amd.com/
As noted above, that message clearly states that the BIOS isn't doing anything wrong, and the problem is the kernel using large page mappings that overlap reserved ranges.
In that case, shouldn't we fix the kernel *not* to do that?
I suppose we can be OK with "let's just avoid using that memory to workaround the kexec/identmap bug", but in that case let's not claim that we're working around a BIOS bug?
Yes, this is the approach we are taking currently to workaround the kexec/identmap bug with the above patch.
Do note, we *need* to do the e820 memory map fixups and additionally do memblock_reserve() to ensure that this usable part of memory adjacent to the RMP table does not get allocated to guests, otherwise it causes RMPUPDATE on this range of memory to fail, fixed with the following patch:
https://lore.kernel.org/lkml/172968164814.1442.8035313578482871705.tip-bot2@...
Thanks, Ashish
On Wed, Oct 23, 2024 at 08:39:40AM +0100, David Woodhouse wrote:
On Tue, 2024-10-22 at 17:06 -0500, Steve Wahl wrote:
On Tue, Oct 22, 2024 at 07:51:38PM +0100, David Woodhouse wrote:
I spent all of Monday setting up a full GDT, IDT and exception handler for the relocate_kernel() environment¹, and I think these reports may have been the same as what I've been debugging.
David,
My original problem involved UV platform hardware catching a speculative access into the reserved areas, that caused a BIOS HALT. Reducing the use of gbpages in the page table kept the speculation from hitting those areas. I would believe this sort of thing might be uniqe to the UV platform.
The regression reports I got from Pavin and others were due to my original patch trimming down the page tables to the point where they didn't include some memory that was actually referenced, not processor speculation, because the regions were not explicitly included in the creation of the kexec page map. This was fixed by explicitly including those regions when creating the map.
Hm, I didn't see that part of the discussion. I saw that such was a theory, but haven't seen specific confirmation and fixes. And your original patch was reverted and still not reapplied, AFAICT.
It has been reapplied. I think the cover letter explains the whole thing.
https://lore.kernel.org/all/20240717213121.3064030-1-steve.wahl@hpe.com/
I did note that the victims all seemed to be using AMD CPUs, so it seemed likely that at least *some* of them were suffering the same problem that I've found.
Do you have references please?
If anyone is still seeing such problems either with or without your patch, they can run with my exception handler and get an actual dump instead of a triple-fault.
(I'm also pushing CPU vendors to give us information from the triple- fault through the machine check architecture. It's awful having to do this blind. For VMs, I also had plans to register a crashdump kernel entry point with the hypervisor, so that on a triple fault the *hypervisor* could jump state of all the vCPUs to the configured location, then restart one CPU in the crash kernel for it to do its own dump).
Can you dump the page tables to see if the address you're referencing is included in those tables (or maybe you already did)? Can you give symbols and code around the RIP when you hit the #PF? It looks like this is in the region metioned as the "Control page", so it's probably trampoline code that has been copied from somewhere else. I'm using my copy of perhaps different kernel source than you have, given your exception handler modification.
Wait, I can't make sense of the dump. See more below.
What platform are you running on? And under what conditions (is this bare metal)? Is it really speculation that's causing your #PF? If so, you could cause it deterministically by, say, doing a quick checksum on that area you're not supposed to touch (0xc142000000 - 0xC1420fffff) and see if it faults every time. (As I said, I was thinking faults from speculation might be unique to the UV platform.)
Yes, it's bare metal. AMD Genoa. No, it's not speculation. It's because we have a single 2MiB page which covers *both* the RMP table (1MiB reserved by BIOS in e820 as I showed), and a page that was allocated for the kimage. If I understand correctly, the hardware raises that fault (with bit 31 in the error code) when refusing to populate that TLB entry for writing.
According to the AMD manual we're allowed to *read* but not write.
Ah, I get it, thanks for the explanation.
That "feature" of not allowing a writable TLB entry to span reserved and non-reserved areas is quite a departure from what has previously been allowed in x86_64, if you asked me. But nobody did ask me, nor should they have. :-)
We end up taking a #PF, usually on one of the 'rep mov's, one time on the 'pushq %r8' right before using it to 'ret' to identity_mapped. In each case it happens on the first *write* to a page.
Now I can print %cr2 when it happens (instead of just going straight to triple-fault), I spot an interesting fact about the address. It's always *adjacent* to a region reserved by BIOS in the e820 data, and within the same 2MiB page.
I'm not at all certain, but this feels like a red herring. Be cautious.
It wouldn't be our first in this journey, but I'm actually fairly confident this time. :)
[ 0.000000] BIOS-e820: [mem 0x000000bfbe000000-0x000000c1420fffff] reserved [ 0.000000] BIOS-e820: [mem 0x000000c142100000-0x000000fc7fffffff] usable
2024-10-22 17:09:14.291000 kern NOTICE [ 58.996257] kexec: Control page at c149431000 2024-10-22 17:09:14.291000 Y 2024-10-22 17:09:14.291000 rip:000000c1494312f8 2024-10-22 17:09:14.291000 rsp:000000c149431f90 2024-10-22 17:09:14.291000 Exc:000000000000000e 2024-10-22 17:09:14.291000 Err:0000000080000003 2024-10-22 17:09:14.291000 rax:000000c142130000 2024-10-22 17:09:14.291000 rbx:000000010d4b8020 2024-10-22 17:09:14.291000 rcx:0000000000000200 2024-10-22 17:09:14.291000 rdx:000000000009c000 2024-10-22 17:09:14.291000 rsi:000000000009c000 2024-10-22 17:09:14.291000 rdi:000000c142130000 2024-10-22 17:09:14.291000 r8 :000000c149431000 2024-10-22 17:09:14.291000 r9 :000000c149430000 2024-10-22 17:09:14.291000 r10:000000010d4bc000 2024-10-22 17:09:14.291000 r11:0000000000000000 2024-10-22 17:09:14.291000 r12:0000000000000000 2024-10-22 17:09:14.291000 r13:0000000000770ef0 2024-10-22 17:09:14.291000 r14:ffff8c82c0000000 2024-10-22 17:09:14.291000 r15:0000000000000000 2024-10-22 17:09:14.291000 cr2:000000c142130000
And bit 31 in the error code is set, which means it's an RMP violation.
RMP is AMD SEV related, right? I'm not familiar with SEV operation, but I have an itchy feeling it's involved in this problem.
I am having a hard time with the RIP listed above. Maybe your exception handler has affected it? My disassembly seems to show this address should be in a sea of 0xCC / int3 bytes past the end of swap pages.
You'd have to have access to my kernel binary to have a hope of knowing that, surely? I don't think I checked that particular one, but it's normally one of the 'rep mov's in relocate_kernel_64.S.
It all depends on how much your kernel source differs from mine, actually. The above info includes the address of the control page, and that's where the code from relocate_kernel_64.S gets copied to.
The RIP of c1494312f8 is 2f8 into the control page; after machine_kexec_64.c copies the relocate kernel code there, that address should represent <relocate_kernel> + 0x2f8. This code comes from assembly source code, not very compiler dependent. And there's not even a lot of kernel config dependent macros in there.
I'm thinking the reason yours differs from mine is probably debug you added (your exception handler maybe).
It was way back yesterday, but I think I reasoned since rcx was 512, rax == rdi, and rdx == rsi, we're probably at the "middle" rep ; movsq line, under the comment "copy destination page to source page". But RIP did not make sense.
Looks like we set up a 2MiB page covering the whole range from 0xc142000000 to 0xc142200000, but we aren't allowed to touch the first half of that.
Is it possible that, instead, some SEV tag is hanging around (TLB not completely cleared?) and a page that was otherwise free is causing the problem. Are you using SEV/SME in your system, and if you stop using it does it go away? (Although I have a feeling the answer is no and I'm barking up the wrong tree.)
The target of the pages above is c1421300000. Have you checked to make sure that's a valid address in the page map?
Yeah, we dumped the page tables and it's present.
For me it happens either with or without Steve's last patch, *but* clearing direct_gbpages did seem to make it go away (or at least reduced the incident rate far below the 1-crash-in-1000-kexecs which I was seeing before).
I assume you're referring to the "nogbpages" kernel option?
Nah, I just commented out the lines in init_pgtable() which set info.direct_gbpages=true.
Ack.
My patch and the nogbpages option should have the exact same pages mapped in the page table. The difference being my patch would still use gbpages in places where a whole gbpage region is included in the map, nogbpages would use 2M pages to fill out the region. This *would* allocate more pages to the page table, which might be shifting things around on you.
Right. In fact the first trigger for this, in our case, was an innocuous change to the NMI watchdog period — which sent us on a *long* wild goose chase based on the assumption that it was a stray perf NMI causing the triple-faults, when in fact that was just shifting things around on us too, and causing pages in that dangerous 1MiB to be chosen for the kimage.
I think Steve's original patch was just moving things around a little and because it allocate more pages for page tables, just happened to leave pages in the offending range to be allocated for writing to, for the unlucky victims.
I think the patch was actually along the right lines though, although it needs to go all the way down to 4KiB PTEs in some cases. And it could probably map anything that the e820 calls 'usable RAM', rather than really restricting itself to precisely the ranges which it's requested to map.
¹ I'll post that exception handler at some point once I've tidied it up.
I hope this might be of some help. Good luck, I'll pitch in any way I can.
Thanks.
Sounds like the rest of the community have got you much further than I have!
--> Steve
linux-stable-mirror@lists.linaro.org