On Sat, Oct 07, 2023 at 03:28:16PM +0800, Shuai Xue wrote:
However, this trick is not always be effective
So far so good.
What's missing here is why "this trick" is not always effective.
Basically to explain what exactly the problem is.
For example, hwpoison-aware user-space processes use the si_code: BUS_MCEERR_AO for 'action optional' early notifications, and BUS_MCEERR_AR for 'action required' synchronous/late notifications. Specifically, when a signal with SIGBUS_MCEERR_AR is delivered to QEMU, it will inject a vSEA to Guest kernel. In contrast, a signal with SIGBUS_MCEERR_AO will be ignored by QEMU.[1]
Fix it by seting memory failure flags as MF_ACTION_REQUIRED on synchronous events. (PATCH 1)
So you're fixing qemu by "fixing" the kernel?
This doesn't make any sense.
Make errors which are ACPI_HEST_NOTIFY_SEA type return MF_ACTION_REQUIRED so that it *happens* to fix your use case.
Sounds like a lot of nonsense to me.
What is the issue here you're trying to solve?
- Handle memory_failure() abnormal fails to avoid a unnecessary reboot
If process mapping fault page, but memory_failure() abnormal return before try_to_unmap(), for example, the fault page process mapping is KSM page. In this case, arm64 cannot use the page fault process to terminate the synchronous exception loop.[4]
This loop can potentially exceed the platform firmware threshold or even trigger a kernel hard lockup, leading to a system reboot. However, kernel has the capability to recover from this error.
Fix it by performing a force kill when memory_failure() abnormal fails or when other abnormal synchronous errors occur.
Just like that?
Without giving the process the opportunity to even save its other data?
So this all is still very confusing, patches definitely need splitting and this whole thing needs restraint.
You go and do this: you split *each* issue you're addressing into a separate patch and explain it like this:
--- 1. Prepare the context for the explanation briefly.
2. Explain the problem at hand.
3. "It happens because of <...>"
4. "Fix it by doing X"
5. "(Potentially do Y)." ---
and each patch explains *exactly* *one* issue, what happens, why it happens and just the fix for it and *why* it is needed.
Otherwise, this is unreviewable.
Thx.