On Tue, Mar 12, 2024 at 07:04:10AM -0400, Eric Hagberg wrote:
On Thu, Mar 7, 2024 at 11:33 AM Steve Wahl steve.wahl@hpe.com wrote:
What Linux Distribution are you running on that machine? My guess would be that this is not distro related; if you are running something quite different from Pavin that would confirm this.
Distro in use is Rocky 8, so it’s pretty clear not to be distro-specific.
I found an AMD based system to try to reproduce this on.
yeah, it probably requires either a specific cpu or set or devices plus cpu to trigger… found that it also affects Dell R7625 servers in addition to the R6615s
I agree that it's likely the CPU or particular set of surrounding devices that trigger the problem.
I have not succeeded in reproducing the problem yet. I tried an AMD based system lent to me, but it's probably the wrong generation (AMD EPYC 7251) and I didn't see the problem. I have a line on a system that's more in line with the systems the bug was reported on that I should be able to try tomorrow.
I would love to have some direction from the community at large on this. The fact that nogbpages on the command line causes the same problem without my patch suggests it's not bad code directly in my patch, but something in the way kexec reacts to the resulting identity map. One quick solution would be a kernel command line parameter to select between the previous identity map creation behavior and the new behavior. E.g. in addition to "nogbpages", we could have "somegbpages" and "allgbpages" -- or gbpages=[all, some, none] with nogbpages a synonym for backwards compatibility.
But I don't want to introduce a new command line parameter if the actual problem can be understood and fixed. The question is how much time do I have to persue a direct fix before some other action needs to be taken?
Thanks,
--> Steve Wahl