Hi Shuai,
On 07/10/2023 08:28, Shuai Xue wrote:
There are two major types of uncorrected recoverable (UCR) errors :
Is UCR a well known x86 acronym? It's best to just spell this out each time, there is enough jargon in this area already.
Action Required (AR): The error is detected and the processor already consumes the memory. OS requires to take action (for example, offline failure page/kill failure thread) to recover this uncorrectable error.
Action Optional (AO): The error is detected out of processor execution context. Some data in the memory are corrupted. But the data have not been consumed. OS is optional to take action to recover this uncorrectable error.
As elsewhere, please don't think of errors as 'action required', this is how things get reported to user-space. Action-required for one thread may be action-optional for another that has the same page mapped - its really not a property of the error. It would be better to describe this as synchronous and asynchronous, or in-band and out-of-band.
The essential difference between AR and AO errors is that AR is a synchronous event, while AO is an asynchronous event. The hardware will signal a synchronous exception (Machine Check Exception on X86 and Synchronous External Abort on Arm64) when an error is detected and the memory access has been architecturally executed.
When APEI firmware first is enabled, a platform may describe one error source for the handling of synchronous errors (e.g. MCE or SEA notification ), or for handling asynchronous errors (e.g. SCI or External Interrupt notification). In other words, we can distinguish synchronous errors by APEI notification. For AR errors, kernel will kill current process accessing the poisoned page by sending SIGBUS with BUS_MCEERR_AR. In addition, for AO errors, kernel will notify the process who owns the poisoned page by sending SIGBUS with BUS_MCEERR_AO in early kill mode. However, the GHES driver always sets mf_flags to 0 so that all UCR errors are handled as AO errors in memory failure.
To make this easier to read: UCR and AR -> synchronous AO -> asynchronous
To this end, set memory failure flags as MF_ACTION_REQUIRED on synchronous events.
Fixes: ba61ca4aab47 ("ACPI, APEI, GHES: Add hardware memory error recovery support")'
Erm, this predates arm64 support, and what you have here doesn't change the behaviour on x86.
You can blame 7f17b4a121d0d50 ("ACPI: APEI: Kick the memory_failure() queue for synchronous errors"), which should have covered this.
diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c index ef59d6ea16da..88178aa6222d 100644 --- a/drivers/acpi/apei/ghes.c +++ b/drivers/acpi/apei/ghes.c @@ -101,6 +101,20 @@ static inline bool is_hest_type_generic_v2(struct ghes *ghes) return ghes->generic->header.type == ACPI_HEST_TYPE_GENERIC_ERROR_V2; } +/*
- A platform may describe one error source for the handling of synchronous
- errors (e.g. MCE or SEA), or for handling asynchronous errors (e.g. SCI
- or External Interrupt). On x86, the HEST notifications are always
- asynchronous, so only SEA on ARM is delivered as a synchronous
- notification.
- */
+static inline bool is_hest_sync_notify(struct ghes *ghes) +{
- u8 notify_type = ghes->generic->notify.type;
- return notify_type == ACPI_HEST_NOTIFY_SEA;
+}
and as you had in earlier versions, sometimes SDEI. SDEI can report by synchronous and asynchronous errors, I wouldn't too surprised if the hardware NMI can be used for the same. It would be good to chase up having a hint of this in the CPER records and pass that in here as a hint.
Unfortunately, its not safe to assume either way for SDEI.
Reviewed-by: James Morse james.morse@arm.com
Thanks,
James