On Mon, Dec 18, 2023 at 02:45:21PM +0800, Shuai Xue wrote:
Hardware errors could be signaled by asynchronous interrupt, e.g. when an error is detected by a background scrubber, or signaled by synchronous exception, e.g. when a CPU tries to access a poisoned cache line. Both synchronous and asynchronous error are queued as a memory_failure() work and handled by a dedicated kthread in workqueue.
However, the memory failure recovery sends SIBUS with wrong BUS_MCEERR_AO si_code for synchronous errors in early kill mode, even MF_ACTION_REQUIRED is set. The main problem is that the memory failure work is handled in kthread context but not the user-space process which is accessing the corrupt memory location, so it will send SIGBUS with BUS_MCEERR_AO si_code to the user-space process instead of BUS_MCEERR_AR in kill_proc().
To this end, queue memory_failure() as a task_work so that the current context in memory_failure() is exactly belongs to the process consuming poison data and it will send SIBBUS with proper si_code.
Signed-off-by: Shuai Xue xueshuai@linux.alibaba.com Tested-by: Ma Wupeng mawupeng1@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Reviewed-by: Xiaofei Tan tanxiaofei@huawei.com Reviewed-by: Baolin Wang baolin.wang@linux.alibaba.com
drivers/acpi/apei/ghes.c | 77 +++++++++++++++++++++++----------------- include/acpi/ghes.h | 3 -- mm/memory-failure.c | 13 ------- 3 files changed, 44 insertions(+), 49 deletions(-)
<formletter>
This is not the correct way to submit patches for inclusion in the stable kernel tree. Please read: https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html for how to do this properly.
</formletter>