Steve Wahl steve.wahl@hpe.com writes:
[*really* added kexec maintainers this time.]
Full thread starts here: https://lore.kernel.org/all/3a1b9909-45ac-4f97-ad68-d16ef1ce99db@pavinjoseph...
On Wed, Mar 13, 2024 at 12:12:31AM +0530, Pavin Joseph wrote:
On 3/12/24 20:43, Steve Wahl wrote:
But I don't want to introduce a new command line parameter if the actual problem can be understood and fixed. The question is how much time do I have to persue a direct fix before some other action needs to be taken?
Perhaps the kexec maintainers [0] can be made aware of this and you could coordinate with them on a potential fix?
Currently maintained by P: Simon Horman M: horms@verge.net.au L: kexec@lists.infradead.org
Probably a good idea to add kexec people to the list, so I've added them to this email.
Everyone, my recent patch to the kernel that changed identity mapping:
7143c5f4cf2073193 x86/mm/ident_map: Use gbpages only where full GB page should be mapped.
... has broken kexec on a few machines. The symptom is they do a full BIOS reboot instead of a kexec of the new kernel. Seems to be limited to AMD processors, but it's not all AMD processors, probably just some characteristic that they happen to share.
The same machines that are broken by my patch, are also broken in previous kernels if you add "nogbpages" to the kernel command line (which makes the identity map bigger, "nogbpages" doing for all parts of the identity map what my patch does only for some parts of it).
I'm still hoping to find a machine I can reproduce this on to try and debug it myself.
If any of you have any assistance or advice to offer, it would be most welcome!
Kexec happens on identity mapped page tables.
The files of interest are machine_kexec_64.c and relocate_kernel_64.S
I suspect either the building of the identity mappged page table in machine_kexec_prepare, or the switching to the page table in identity_mapped in relocate_kernel_64.S is where something goes wrong.
Probably in kernel_ident_mapping_init as that code is directly used to build the identity mapped page tables.
Hmm.
Your change is commit d794734c9bbf ("x86/mm/ident_map: Use gbpages only where full GB page should be mapped.")
Given the simplicity of that change itself my guess is that somewhere in the first 1Gb there are pages that needed to be mapped like the idt at 0 that are not getting mapped.
Reading through the changelog:
x86/mm/ident_map: Use gbpages only where full GB page should be mapped. When ident_pud_init() uses only gbpages to create identity maps, large ranges of addresses not actually requested can be included in the resulting table; a 4K request will map a full GB. On UV systems, this ends up including regions that will cause hardware to halt the system if accessed (these are marked "reserved" by BIOS). Even processor speculation into these regions is enough to trigger the system halt. Only use gbpages when map creation requests include the full GB page of space. Fall back to using smaller 2M pages when only portions of a GB page are included in the request. No attempt is made to coalesce mapping requests. If a request requires a map entry at the 2M (pmd) level, subsequent mapping requests within the same 1G region will also be at the pmd level, even if adjacent or overlapping such requests could have been combined to map a full gbpage. Existing usage starts with larger regions and then adds smaller regions, so this should not have any great consequence. [ dhansen: fix up comment formatting, simplifty changelog ] Signed-off-by: Steve Wahl steve.wahl@hpe.com Signed-off-by: Dave Hansen dave.hansen@linux.intel.com Cc: stable@vger.kernel.org Link: https://lore.kernel.org/all/20240126164841.170866-1-steve.wahl%40hpe.com
I know historically that fixed mtrrs were used so that the first 1GiB could be covered with page tables and cause problems.
I suspect whatever those UV systems are more targeted solution would be to use the fixed mtrrs to disable caching and speculation on the problematic ranges rather than completely changing the fundamental logic of how pages are mapped.
Right now it looks like you get to play a game of whack-a-mole with firmware/BIOS tables that don't mention something important, and ensuring the kernel maps everything important in the first 1GiB.
It might be worth setting up early printk on some of these systems and seeing if the failure is in early boot up of the new kernel (that is using kexec supplied identity mapped pages) rather than in kexec per-se.
But that is just my guess at the moment.
Eric
I hope the root cause can be fixed instead of patching it over with a flag to suppress the problem, but I don't know how regressions are handled here.
That would be my preference as well.
Thanks,
--> Steve Wahl