Re: [PATCH v9 1/2] ACPI: APEI: set memory failure flags as MF_ACTION_REQUIRED on synchronous events

30 Nov 2023

Hi Shuai,
On 07/10/2023 08:28, Shuai Xue wrote:
...
There are two major types of uncorrected recoverable (UCR) errors :
Is UCR a well known x86 acronym? It's best to just spell this out each time,
there is enough jargon in this area already.
...

Action Required (AR): The error is detected and the processor already
consumes the memory. OS requires to take action (for example, offline
failure page/kill failure thread) to recover this uncorrectable error.

Action Optional (AO): The error is detected out of processor execution
context. Some data in the memory are corrupted. But the data have not
been consumed. OS is optional to take action to recover this
uncorrectable error.


As elsewhere, please don't think of errors as 'action required', this is how
things get reported to user-space. Action-required for one thread may be
action-optional for another that has the same page mapped - its really not a
property of the error.
It would be better to describe this as synchronous and asynchronous, or in-band
and out-of-band.
...
The essential difference between AR and AO errors is that AR is a
synchronous event, while AO is an asynchronous event. The hardware will
signal a synchronous exception (Machine Check Exception on X86 and
Synchronous External Abort on Arm64) when an error is detected and the
memory access has been architecturally executed.
...
When APEI firmware first is enabled, a platform may describe one error
source for the handling of synchronous errors (e.g. MCE or SEA notification
), or for handling asynchronous errors (e.g. SCI or External Interrupt
notification). In other words, we can distinguish synchronous errors by
APEI notification. For AR errors, kernel will kill current process
accessing the poisoned page by sending SIGBUS with BUS_MCEERR_AR. In
addition, for AO errors, kernel will notify the process who owns the
poisoned page by sending SIGBUS with BUS_MCEERR_AO in early kill mode.
However, the GHES driver always sets mf_flags to 0 so that all UCR errors
are handled as AO errors in memory failure.
To make this easier to read:
 UCR and AR -> synchronous
 AO -> asynchronous
...
To this end, set memory failure flags as MF_ACTION_REQUIRED on synchronous
events.
...
Fixes: ba61ca4aab47 ("ACPI, APEI, GHES: Add hardware memory error recovery support")'
Erm, this predates arm64 support, and what you have here doesn't change the behaviour on x86.
You can blame 7f17b4a121d0d50 ("ACPI: APEI: Kick the memory_failure() queue for
synchronous errors"), which should have covered this.
...

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index ef59d6ea16da..88178aa6222d 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -101,6 +101,20 @@ static inline bool is_hest_type_generic_v2(struct ghes *ghes)
   return ghes->generic->header.type == ACPI_HEST_TYPE_GENERIC_ERROR_V2;
 }
 
+/*


A platform may describe one error source for the handling of synchronous



errors (e.g. MCE or SEA), or for handling asynchronous errors (e.g. SCI



or External Interrupt). On x86, the HEST notifications are always



asynchronous, so only SEA on ARM is delivered as a synchronous



notification.


*/

+static inline bool is_hest_sync_notify(struct ghes *ghes)
+{

u8 notify_type = ghes->generic->notify.type;

return notify_type == ACPI_HEST_NOTIFY_SEA;

+}
and as you had in earlier versions, sometimes SDEI.
SDEI can report by synchronous and asynchronous errors, I wouldn't too surprised if the
hardware NMI can be used for the same. It would be good to chase up having a hint of this
in the CPER records and pass that in here as a hint.
Unfortunately, its not safe to assume either way for SDEI.
Reviewed-by: James Morse james.morse@arm.com
Thanks,
James

    

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: [PATCH v9 1/2] ACPI: APEI: set memory failure flags as MF_ACTION_REQUIRED on synchronous events