On 2023/11/25 20:10, Borislav Petkov wrote:
Hi, Borislav,
Thank you for your reply, and sorry for the confusion I made. Please see my rely inline.
Best Regards, Shuai
On Sat, Nov 25, 2023 at 02:44:52PM +0800, Shuai Xue wrote:
- an AR error consumed by current process is deferred to handle in a dedicated kernel thread, but memory_failure() assumes that it runs in the current context
On x86? ARM?
Pease point to the exact code flow.
An AR error consumed by current process is deferred to handle in a dedicated kernel thread on ARM platform. The AR error is handled in bellow flow:
----------------------------------------------------------------------------- [usr space task einj_mem_uc consumd data poison, CPU 3] STEP 0
----------------------------------------------------------------------------- [ghes_sdei_critical_callback: current einj_mem_uc, CPU 3] STEP 1 ghes_sdei_critical_callback => __ghes_sdei_callback => ghes_in_nmi_queue_one_entry // peak and read estatus => irq_work_queue(&ghes_proc_irq_work) <=> ghes_proc_in_irq // irq_work [ghes_sdei_critical_callback: return] ----------------------------------------------------------------------------- [ghes_proc_in_irq: current einj_mem_uc, CPU 3] STEP 2 => ghes_do_proc => ghes_handle_memory_failure => ghes_do_memory_failure => memory_failure_queue // put work task on current CPU => if (kfifo_put(&mf_cpu->fifo, entry)) schedule_work_on(smp_processor_id(), &mf_cpu->work); => task_work_add(current, &estatus_node->task_work, TWA_RESUME); [ghes_proc_in_irq: return] ----------------------------------------------------------------------------- // kworker preempts einj_mem_uc on CPU 3 due to RESCHED flag STEP 3 [memory_failure_work_func: current kworker, CPU 3] => memory_failure_work_func(&mf_cpu->work) => while kfifo_get(&mf_cpu->fifo, &entry); // until get no work => memory_failure(entry.pfn, entry.flags); ----------------------------------------------------------------------------- [ghes_kick_task_work: current einj_mem_uc, other cpu] STEP 4 => memory_failure_queue_kick => cancel_work_sync - waiting memory_failure_work_func finish => memory_failure_work_func(&mf_cpu->work) => kfifo_get(&mf_cpu->fifo, &entry); // no work ----------------------------------------------------------------------------- [einj_mem_uc resume at the same PC, trigger a page fault STEP 5
STEP 0: A user space task, named einj_mem_uc consume a poison. The firmware notifies hardware error to kernel through is SDEI (ACPI_HEST_NOTIFY_SOFTWARE_DELEGATED).
STEP 1: The swapper running on CPU 3 is interrupted. irq_work_queue() rasie a irq_work to handle hardware errors in IRQ context
STEP2: In IRQ context, ghes_proc_in_irq() queues memory failure work on current CPU in workqueue and add task work to sync with the workqueue.
STEP3: The kworker preempts the current running thread and get CPU 3. Then memory_failure() is processed in kworker.
STEP4: ghes_kick_task_work() is called as task_work to ensure any queued workqueue has been done before returning to user-space.
STEP5: Upon returning to user-space, the task einj_mem_uc resumes at the current instruction, because the poison page is unmapped by memory_failure() in step 3, so a page fault will be triggered.
memory_failure() assumes that it runs in the current context on both x86 and ARM platform.
for example: memory_failure() in mm/memory-failure.c:
if (flags & MF_ACTION_REQUIRED) { folio = page_folio(p); res = kill_accessing_process(current, folio_pfn(folio), flags); }
- another page fault is not unnecessary, we can send sigbus to current process in the first Synchronous External Abort SEA on arm64 (analogy Machine Check Exception on x86)
I have no clue what that means. What page fault?
I mean page fault in step 5. We can simplify the above flow by queuing memory_failure() as a task work for AR errors in step 3 directly.
I just give an example that the user space process *really* relys on the si_code of signal to handle hardware errors
No, don't give examples.
Explain what the exact problem is you're seeing, in your use case, point to the code and then state how you think it should be fixed and why.
Right now your text is "all over the place" and I have no clue what you even want.
Ok, got it. Thank you.
The SIGBUS si_codes defined in include/uapi/asm-generic/siginfo.h says:
/* hardware memory error consumed on a machine check: action required */ #define BUS_MCEERR_AR 4 /* hardware memory error detected in process but not consumed: action optional*/ #define BUS_MCEERR_AO 5
When a synchronous error is consumed by Guest, the kernel should send a signal with BUS_MCEERR_AR instead of BUS_MCEERR_AO.
Can you drop this "synchronous" bla and concentrate on the error *severity*?
I think you want to say that there are some types of errors for which error handling needs to happen immediately and for some reason that doesn't happen.
Which errors are those? Types?
Why do you need them to be handled immediately?
Well, the severity defined on x86 and ARM platform is quite different. I guess you mean taxonomy of producer error types.
- X86: Software recoverable action required (SRAR)
A UCR error that *requires* system software to take a recovery action on this processor *before scheduling another stream of execution on this processor*. (15.6.3 UCR Error Classification in Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 3)
- ARM: Recoverable state (UER)
The PE determines that software *must* take action to locate and repair the error to successfully recover execution. This might be because the exception was taken before the error was architecturally consumed by the PE, at the point when the PE was not be able to make correct progress without either consuming the error or *otherwise making the state of the PE unrecoverable*. (2.3.2 PE error state classification in Arm RAS Supplement https://documentation-service.arm.com/static/63185614f72fad1903828eda)
I think above two types of error need to be handled immediately.
Exactly.
No, not exactly. Why is it ok to do that? What are the implications of this?
Is immediate killing the right decision?
Is this ok for *every* possible kernel running out there - not only for your use case?
And so on and so on...
I don't have a clear answer here. I guess the poison data only effects the user space task which triggers exception. A panic is not necessary.
On x86 platform, the current error handling of memory_failure() in kill_me_maybe() is just send a sigbus forcely.
kill_me_maybe():
ret = memory_failure(pfn, flags); if (ret == -EHWPOISON || ret == -EOPNOTSUPP) return;
pr_err("Memory error not recovered"); kill_me_now(cb);
Do you have any comments or suggestion about this? I don't change x86 behavior.
For arm64 platform, step 3 in above flow, memory_failure_work_func(), the call site of memory_failure(), does not handle the return code of memory_failure(). I just add the same behavior.