On Apr 7, 2020, at 1:20 PM, Thomas Gleixner tglx@linutronix.de wrote:
Andy Lutomirski luto@amacapital.net writes:
On Apr 7, 2020, at 10:21 AM, Vivek Goyal vgoyal@redhat.com wrote:
Whether interrupts are enabled or not check only happens before we decide if async pf protocol should be followed or not. Once we decide to send PAGE_NOT_PRESENT, later notification PAGE_READY does not check if interrupts are enabled or not. And it kind of makes sense otherwise guest process will wait infinitely to receive PAGE_READY.
I modified the code a bit to disable interrupt and wait 10 seconds (after getting PAGE_NOT_PRESENT message). And I noticed that error async pf got delivered after 10 seconds after enabling interrupts. So error async pf was not lost because interrupts were disabled.
Async PF is not the same as a real #PF. It just hijacked the #PF vector because someone thought this is a brilliant idea.
Havind said that, I thought disabling interrupts does not mask exceptions. So page fault exception should have been delivered even with interrupts disabled. Is that correct? May be there was no vm exit/entry during those 10 seconds and that's why.
No. Async PF is not a real exception. It has interrupt semantics and it can only be injected when the guest has interrupts enabled. It's bad design.
My point is that the entire async pf is nonsense. There are two types of events right now:
“Page not ready”: normally this isn’t even visible to the guest — the guest just waits. With async pf, the idea is to try to tell the guest that a particular instruction would block and the guest should do something else instead. Sending a normal exception is a poor design, though: the guest may not expect this instruction to cause an exception. I think KVM should try to deliver an *interrupt* and, if it can’t, then just block the guest.
That's pretty much what it does, just that it runs this through #PF and has the checks for interrupts disabled - i.e can't right now' around that. If it can't then KVM schedules the guest out until the situation has been resolved.
“Page ready”: this is a regular asynchronous notification just like, say, a virtio completion. It should be an ordinary interrupt. Some in memory data structure should indicate which pages are ready.
“Page is malfunctioning” is tricky because you *must* deliver the event. x86’s #MC is not exactly a masterpiece, but it does kind of work.
Nooooo. This does not need #MC at all. Don't even think about it.
Yessssssssssss. Please do think about it. :)
The point is that the access to such a page is either happening in user space or in kernel space with a proper exception table fixup.
That means a real #PF is perfectly fine. That can be injected any time and does not have the interrupt semantics of async PF.
The hypervisor has no way to distinguish between MOV-and-has-valid-stack-and-extable-entry and MOV-definitely-can’t-fault-here. Or, for that matter, MOV-in-do_page_fault()-will-recurve-if-it-faults.
So now lets assume we distangled async PF from #PF and made it a regular interrupt, then the following situation still needs to be dealt with:
guest -> access faults
host -> injects async fault
guest -> handles and blocks the task
host figures out that the page does not exist anymore and now needs to fixup the situation.
host -> injects async wakeup
guest -> returns from aysnc PF interrupt and retries the instruction which faults again.
host -> knows by now that this is a real fault and injects a proper #PF
guest -> #PF runs and either sends signal to user space or runs the exception table fixup for a kernel fault.
Or guest blows up because the fault could not be recovered using #PF.
I can see two somewhat sane ways to make this work.
1. Access to bad memory results in an async-page-not-present, except that, it’s not deliverable, the guest is killed. Either that async-page-not-present has a special flag saying “memory failure” or the eventual wakeup says “memory failure”.
2. Access to bad memory results in #MC. Sure, #MC is a turd, but it’s an *architectural* turd. By all means, have a nice simple PV mechanism to tell the #MC code exactly what went wrong, but keep the overall flow the same as in the native case.
I think I like #2 much better. It has another nice effect: a good implementation will serve as a way to exercise the #MC code without needing to muck with EINJ or with whatever magic Tony uses. The average kernel developer does not have access to a box with testable memory failure reporting.
Thanks,
tglx