Re: [PATCH v9 0/2] ACPI: APEI: handle synchronous errors in task work with proper si_code

1 Dec 2023

      On 2023/12/1 01:39, James Morse wrote:
...
Hi Boris, Shuai,
On 29/11/2023 18:54, Borislav Petkov wrote:
...
On Sun, Nov 26, 2023 at 08:25:38PM +0800, Shuai Xue wrote:
...
...
On Sat, Nov 25, 2023 at 02:44:52PM +0800, Shuai Xue wrote:
...

an AR error consumed by current process is deferred to handle in a
dedicated kernel thread, but memory_failure() assumes that it runs in the
current context

On x86? ARM?
Pease point to the exact code flow.
...
...
An AR error consumed by current process is deferred to handle in a
dedicated kernel thread on ARM platform. The AR error is handled in bellow
flow:
Please don't think of errors as "action required" - that's a user-space signal code. If
the page could be fixed by memory-failure(), you may never get a signal. (all this was the
fix for always sending an action-required signal)
I assume you mean the CPU accessed a poisoned location and took a synchronous error.
Yes, I mean that CPU accessed a poisoned location and took a synchronous error.
...
...
...

[usr space task einj_mem_uc consumd data poison, CPU 3]         STEP 0

[ghes_sdei_critical_callback: current einj_mem_uc, CPU 3]		STEP 1
ghes_sdei_critical_callback
    => __ghes_sdei_callback
        => ghes_in_nmi_queue_one_entry 		// peak and read estatus
        => irq_work_queue(&ghes_proc_irq_work) <=> ghes_proc_in_irq // irq_work
[ghes_sdei_critical_callback: return]

[ghes_proc_in_irq: current einj_mem_uc, CPU 3]			        STEP 2
            => ghes_do_proc
                => ghes_handle_memory_failure
                    => ghes_do_memory_failure
                        => memory_failure_queue	 // put work task on current CPU
                            => if (kfifo_put(&mf_cpu->fifo, entry))
                                  schedule_work_on(smp_processor_id(), &mf_cpu->work);
            => task_work_add(current, &estatus_node->task_work, TWA_RESUME);
[ghes_proc_in_irq: return]

// kworker preempts einj_mem_uc on CPU 3 due to RESCHED flag	STEP 3
[memory_failure_work_func: current kworker, CPU 3]	
     => memory_failure_work_func(&mf_cpu->work)
        => while kfifo_get(&mf_cpu->fifo, &entry);	// until get no work
            => memory_failure(entry.pfn, entry.flags);
From the comment above that function:

The function is primarily of use for corruptions that
happen outside the current execution context (e.g. when
detected by a background scrubber)

Must run in process context (e.g. a work queue) with interrupts
enabled and no spinlocks held.

...

[ghes_kick_task_work: current einj_mem_uc, other cpu]           STEP 4
                => memory_failure_queue_kick
                    => cancel_work_sync - waiting memory_failure_work_func finish
                    => memory_failure_work_func(&mf_cpu->work)
                        => kfifo_get(&mf_cpu->fifo, &entry); // no work

[einj_mem_uc resume at the same PC, trigger a page fault        STEP 5
STEP 0: A user space task, named einj_mem_uc consume a poison. The firmware
notifies hardware error to kernel through is SDEI
(ACPI_HEST_NOTIFY_SOFTWARE_DELEGATED).
STEP 1: The swapper running on CPU 3 is interrupted. irq_work_queue() rasie
a irq_work to handle hardware errors in IRQ context
STEP2: In IRQ context, ghes_proc_in_irq() queues memory failure work on
current CPU in workqueue and add task work to sync with the workqueue.
STEP3: The kworker preempts the current running thread and get CPU 3. Then
memory_failure() is processed in kworker.
See above.
...
STEP4: ghes_kick_task_work() is called as task_work to ensure any queued
workqueue has been done before returning to user-space.
STEP5: Upon returning to user-space, the task einj_mem_uc resumes at the
current instruction, because the poison page is unmapped by
memory_failure() in step 3, so a page fault will be triggered.
memory_failure() assumes that it runs in the current context on both x86
and ARM platform.
for example:
   memory_failure() in mm/memory-failure.c:
if (flags & MF_ACTION_REQUIRED) {
	folio = page_folio(p);
	res = kill_accessing_process(current, folio_pfn(folio), flags);
}

And?
Do you see the check above it?
if (TestSetPageHWPoison(p)) {
test_and_set_bit() returns true only when the page was poisoned already.

This function is intended to handle "Action Required" MCEs on already
hardware poisoned pages. They could happen, for example, when
memory_failure() failed to unmap the error page at the first call, or
when multiple local machine checks happened on different CPUs.

And that's kill_accessing_process().
So AFAIU, the kworker running memory_failure() would only mark the page
as poison.
The killing happens when memory_failure() runs again and the process
touches the page again.
But I'd let James confirm here.
Yes, this is what is expected to happen with the existing code.
The first pass will remove the pages from all processes that have it mapped before this
user-space task can restart. Restarting the task will make it access a poisoned page,
kicking off the second path which delivers the signal.
The reason for two passes is send_sig_mceerr() likes to clear_siginfo(), so even if you
queued action-required before leaving GHES, memory-failure() would stomp on it.
...
I still don't know what you're fixing here.
The problem is if the user-space process registered for early messages, it gets a signal
on the first pass. If it returns from that signal, it will access the poisoned page and
get the action-required signal.
How is this making Qemu go wrong?
The problem here is that we need to assume, the first pass memory failure
handle and unmap the poisoned page successfully.
- If so, it may work by the second pass action-requried signal because it
  access an unmapped page. But IMHO, we can improve by just sending one
  pass signal, so that the Guest will vmexit only once, right?
- If not, there is no second pass signal. The exist code does not handle
  the error code from memory_failure(), so a exception loop happens
  resulting a hard lockup panic.
Besides, in production environment, a second access to an already known
poison page will introduce more risk of error propagation.
...
As to how this works for you given Boris' comments above: kill_procs() is also called from
hwpoison_user_mappings(), which takes the flags given to memory-failure(). This is where
the action-optional signals come from.
Thank you very much for involving to review and comment.
Best Regards,
Shuai

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: [PATCH v9 0/2] ACPI: APEI: handle synchronous errors in task work with proper si_code