On 7/13/2022 3:22 PM, Chuck Zmudzinski wrote:
On 7/13/2022 3:07 PM, Chuck Zmudzinski wrote:
On 7/13/2022 9:45 AM, Juergen Gross wrote:
On 7/13/2022 6:36 AM, Chuck Zmudzinski wrote: And in addition, if we are going to backport this patch to all current stable branches, we better have a really, really, good reason for changing the behavior of "nopat" on Xen.
Does such a reason exist?
Well, the simple reason is: It doesn't work the same way under Xen and non-Xen (in turn because, before my patch or whatever equivalent work, things don't work properly anyway, PAT-wise). Yet it definitely ought to behave the same everywhere, imo.
There is Documentation/x86/pat.rst which rather clearly states, how "nopat" is meant to work. It should not change the contents of the PAT MSR and keep it just as it was set at boot time (the doc talks about the "BIOS" setting of the MSR, and I guess in the Xen case the hypervisor is kind of acting as the BIOS).
The question is, whether "nopat" needs to be translated to pat_enabled() returning "false".
When I started working on a re-factoring effort of the logic surrounding pat_enabled(), I noticed there are five different reasons in the current code for setting pat_disabled to true, which IMO is what should be a redundant variable that should always be equal !pat_enabled() and !pat_bp_enabled, but that unfortunately is not the case. The five reasons for setting pat_disabled to true are given as message strings:
- "MTRRs disabled, skipping PAT initialization too."
- "PAT support disabled because CONFIG_MTRR is disabled in the kernel."
- "PAT support disabled via boot option."
- "PAT not supported by the CPU."
- "PAT support disabled by the firmware."
The only effect of setting pat_disabled to true is to inhibit the execution of pat_init(), but it does not inhibit the execution of init_cache_modes(), which is for handling all these cases when pat_init() was skipped. The Xen case is one of those cases, so in the Xen case, pat_disabled will be true yet the only way to fix the current regression and the five-year-old commit is by setting pat_bp_enabled to true so pat_enabled() will return true. So to fix the five-year-old commit, we must have
pat_enabled() != pat_disabled
Something is wrong with this logic, that is why I wanted to precede my fix with some re-factoring that will change some variable and function names and modify some comments before trying to fix the five-year-old commit, so that we will never have a situation when pat_enabled() != pat_disabled.
Chuck
Sorry, I meant to say,
To fix the five-year-old commit, we must have
pat_enabled() != !pat_disabled or pat_enabled() == pat_disabled,
and there is something wrong with that logic.
Chuck
So to summarize, I think this means that to be comfortable fixing the five-year-old commit and the current regression by artificially setting pat_bp_enabled and pat_enabled() to true, something which both my patch and Jan's patch does, we need to come to a new understanding of what the static boolean variable pat_disabled in arch/x86/mm/pat/memtype.c in the code really means.
The fact is, we have a regression and the only fix we can find is to try to make pat_enabled() == pat_disabled
I need to stop thinking about this for a while. It is time for those who have authority to fix this regression to make some comments about how they think this should be fixed.
Chuck