Andy Lutomirski luto@amacapital.net writes:
On Apr 7, 2020, at 3:48 PM, Thomas Gleixner tglx@linutronix.de wrote: Inject #MC
No, not what I meant. Host has two sane choices here IMO:
Tell the guest that the page is gone as part of the wakeup. No #PF or #MC.
Tell guest that it’s resolved and inject #MC when the guest
retries. The #MC is a real fault, RIP points to the right place, etc.
Ok, that makes sense.
- Access to bad memory results in an async-page-not-present, except
that, it’s not deliverable, the guest is killed.
That's incorrect. The proper reaction is a real #PF. Simply because this is part of the contract of sharing some file backed stuff between host and guest in a well defined "virtio" scenario and not a random access to memory which might be there or not.
The problem is that the host doesn’t know when #PF is safe. It’s sort of the same problem that async pf has now. The guest kernel could access the problematic page in the middle of an NMI, under pagefault_disable(), etc — getting #PF as a result of CPL0 access to a page with a valid guest PTE is simply not part of the x86 architecture.
Fair enough.
Replace copy_to_user() with some access to a gup-ed mapping with no extable handler and it doesn’t look so good any more.
In this case the guest needs to die.
Of course, the guest will oops if this happens, but the guest needs to be able to oops cleanly. #PF is too fragile for this because it’s not IST, and #PF is the wrong thing anyway — #PF is all about guest-virtual-to-guest-physical mappings. Heck, what would CR2 be? The host might not even know the guest virtual address.
It knows, but I can see your point.
- Access to bad memory results in #MC. Sure, #MC is a turd, but it’s
an *architectural* turd. By all means, have a nice simple PV mechanism to tell the #MC code exactly what went wrong, but keep the overall flow the same as in the native case.
It's a completely different flow as you evaluate PV turd instead of analysing the MCE banks and the other error reporting facilities.
I’m fine with the flow being different. do_machine_check() could have entirely different logic to decide the error in PV. But I think we should reuse the overall flow: kernel gets #MC with RIP pointing to the offending instruction. If there’s an extable entry that can handle memory failure, handle it. If it’s a user access, handle it. If it’s an unrecoverable error because it was a non-extable kernel access, oops or panic.
The actual PV part could be extremely simple: the host just needs to tell the guest “this #MC is due to memory failure at this guest physical address”. No banks, no DIMM slot, no rendezvous crap (LMCE), no other nonsense. It would be nifty if the host also told the guest what the guest virtual address was if the host knows it.
It does. The EPT violations store:
- guest-linear address - guest-physical address
That's also part of the #VE exception to which Paolo was referring.
Thanks,
tglx