On Wed, Mar 13, 2024 at 07:16:23AM -0500, Eric W. Biederman wrote:
Kexec happens on identity mapped page tables.
The files of interest are machine_kexec_64.c and relocate_kernel_64.S
I suspect either the building of the identity mappged page table in machine_kexec_prepare, or the switching to the page table in identity_mapped in relocate_kernel_64.S is where something goes wrong.
Probably in kernel_ident_mapping_init as that code is directly used to build the identity mapped page tables.
Hmm.
Your change is commit d794734c9bbf ("x86/mm/ident_map: Use gbpages only where full GB page should be mapped.")
Yeah, sorry, I accidentally used the stable cherry-pick commit id that Pavin Joseph found with his bisect results.
Given the simplicity of that change itself my guess is that somewhere in the first 1Gb there are pages that needed to be mapped like the idt at 0 that are not getting mapped.
...
It might be worth setting up early printk on some of these systems and seeing if the failure is in early boot up of the new kernel (that is using kexec supplied identity mapped pages) rather than in kexec per-se.
But that is just my guess at the moment.
Thanks for the input. I was thinking in terms of running out of memory somewhere because we're using more page table entries than we used to. But you've got me thinking that maybe some necessary region is not explicitly requested to be placed in the identity map, but is by luck included in the rounding errors when we use gbpages.
At any rate, since I am still unable to reproduce this for myself, I am going to contact Pavin Joseph off-list and see if he's willing to do a few debugging kernel steps for me and send me the results, to see if I can get this figured out. (I believe trimming the CC list and/or going private is usually frowned upon for the LKML, but I think this is appropriate as it only adds noise for the rest. Let me know if I'm wrong.)
Thank you.
--> Steve Wahl