[PATCH v2] ACPI: APEI: do not add task_work to kernel thread to avoid memory leak

List overview All Threads
Download

newer

older

[PATCH 5.15 1/6] mac80211: mesh:...

stable-rc/queue/4.19 baseline: 118...

Shuai Xue

24 Sep 2022 24 Sep '22

7:49 a.m.

If an error is detected as a result of user-space process accessing a corrupt memory location, the CPU may take an abort. Then the platform firmware reports kernel via NMI like notifications, e.g. NOTIFY_SEA, NOTIFY_SOFTWARE_DELEGATED, etc.

For NMI like notifications, commit 7f17b4a121d0 ("ACPI: APEI: Kick the memory_failure() queue for synchronous errors") keep track of whether memory_failure() work was queued, and make task_work pending to flush out the queue so that the work is processed before return to user-space.

The code use init_mm to check whether the error occurs in user space:

if (current->mm != &init_mm)

The condition is always true, becase _nobody_ ever has "init_mm" as a real VM any more.

In addition to abort, errors can also be signaled as asynchronous exceptions, such as interrupt and SError. In such case, the interrupted current process could be any kind of thread. When a kernel thread is interrupted, the work ghes_kick_task_work deferred to task_work will never be processed because entry_handler returns to call ret_to_kernel() instead of ret_to_user(). Consequently, the estatus_node alloced from ghes_estatus_pool in ghes_in_nmi_queue_one_entry() will not be freed. After around 200 allocations in our platform, the ghes_estatus_pool will run of memory and ghes_in_nmi_queue_one_entry() returns ENOMEM. As a result, the event failed to be processed.

sdei: event 805 on CPU 113 failed with error: -2

Finally, a lot of unhandled events may cause platform firmware to exceed some threshold and reboot.

The condition should generally just do

if (current->mm)

as described in active_mm.rst documentation.

Then if an asynchronous error is detected when a kernel thread is running, (e.g. when detected by a background scrubber), do not add task_work to it as the original patch intends to do.

Fixes: 7f17b4a121d0 ("ACPI: APEI: Kick the memory_failure() queue for synchronous errors") Signed-off-by: Shuai Xue xueshuai@linux.alibaba.com --- changes since v1: - add description the side effect and give more details

drivers/acpi/apei/ghes.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c index d91ad378c00d..80ad530583c9 100644 --- a/drivers/acpi/apei/ghes.c +++ b/drivers/acpi/apei/ghes.c @@ -985,7 +985,7 @@ static void ghes_proc_in_irq(struct irq_work *irq_work) ghes_estatus_cache_add(generic, estatus); }

- if (task_work_pending && current->mm != &init_mm) { + if (task_work_pending && current->mm) { estatus_node->task_work.func = ghes_kick_task_work; estatus_node->task_work_cpu = smp_processor_id(); ret = task_work_add(current, &estatus_node->task_work,

-- 2.20.1.12.g72788fdb

Show replies by date

kernel test robot

24 Sep 24 Sep

7:50 a.m.

Hi,

Thanks for your patch.

FYI: kernel test robot notices the stable kernel rule is not satisfied.

Rule: 'Cc: stable@vger.kernel.org' or 'commit <sha1> upstream.' Subject: [PATCH v2] ACPI: APEI: do not add task_work to kernel thread to avoid memory leak Link: https://lore.kernel.org/stable/20220924074953.83064-1-xueshuai%40linux.aliba...

The check is based on https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html

-- 0-DAY CI Kernel Test Service https://01.org/lkp

Rafael J. Wysocki

5:17 p.m.

On Sat, Sep 24, 2022 at 9:50 AM Shuai Xue xueshuai@linux.alibaba.com wrote:

...

If an error is detected as a result of user-space process accessing a corrupt memory location, the CPU may take an abort. Then the platform firmware reports kernel via NMI like notifications, e.g. NOTIFY_SEA, NOTIFY_SOFTWARE_DELEGATED, etc.

For NMI like notifications, commit 7f17b4a121d0 ("ACPI: APEI: Kick the memory_failure() queue for synchronous errors") keep track of whether memory_failure() work was queued, and make task_work pending to flush out the queue so that the work is processed before return to user-space.

The code use init_mm to check whether the error occurs in user space:
if (current->mm != &init_mm)
The condition is always true, becase _nobody_ ever has "init_mm" as a real VM any more.

In addition to abort, errors can also be signaled as asynchronous exceptions, such as interrupt and SError. In such case, the interrupted current process could be any kind of thread. When a kernel thread is interrupted, the work ghes_kick_task_work deferred to task_work will never be processed because entry_handler returns to call ret_to_kernel() instead of ret_to_user(). Consequently, the estatus_node alloced from ghes_estatus_pool in ghes_in_nmi_queue_one_entry() will not be freed. After around 200 allocations in our platform, the ghes_estatus_pool will run of memory and ghes_in_nmi_queue_one_entry() returns ENOMEM. As a result, the event failed to be processed.
sdei: event 805 on CPU 113 failed with error: -2
Finally, a lot of unhandled events may cause platform firmware to exceed some threshold and reboot.

The condition should generally just do
if (current->mm)
as described in active_mm.rst documentation.

Then if an asynchronous error is detected when a kernel thread is running, (e.g. when detected by a background scrubber), do not add task_work to it as the original patch intends to do.

Fixes: 7f17b4a121d0 ("ACPI: APEI: Kick the memory_failure() queue for synchronous errors") Signed-off-by: Shuai Xue xueshuai@linux.alibaba.com

I need the APEI code reviewers to tell me that this is all OK.

...

changes since v1:

add description the side effect and give more details

drivers/acpi/apei/ghes.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c index d91ad378c00d..80ad530583c9 100644 --- a/drivers/acpi/apei/ghes.c +++ b/drivers/acpi/apei/ghes.c @@ -985,7 +985,7 @@ static void ghes_proc_in_irq(struct irq_work *irq_work) ghes_estatus_cache_add(generic, estatus); }
          if (task_work_pending && current->mm != &init_mm) {
          if (task_work_pending && current->mm) {
                  estatus_node->task_work.func = ghes_kick_task_work;
                  estatus_node->task_work_cpu = smp_processor_id();
                  ret = task_work_add(current, &estatus_node->task_work,
-- 2.20.1.12.g72788fdb

Shuai Xue

26 Sep 26 Sep

11:35 a.m.

在 2022/9/25 AM1:17, Rafael J. Wysocki 写道:

...

On Sat, Sep 24, 2022 at 9:50 AM Shuai Xue xueshuai@linux.alibaba.com wrote:

...
If an error is detected as a result of user-space process accessing a corrupt memory location, the CPU may take an abort. Then the platform firmware reports kernel via NMI like notifications, e.g. NOTIFY_SEA, NOTIFY_SOFTWARE_DELEGATED, etc.

For NMI like notifications, commit 7f17b4a121d0 ("ACPI: APEI: Kick the memory_failure() queue for synchronous errors") keep track of whether memory_failure() work was queued, and make task_work pending to flush out the queue so that the work is processed before return to user-space.

The code use init_mm to check whether the error occurs in user space:
if (current->mm != &init_mm)
The condition is always true, becase _nobody_ ever has "init_mm" as a real VM any more.

In addition to abort, errors can also be signaled as asynchronous exceptions, such as interrupt and SError. In such case, the interrupted current process could be any kind of thread. When a kernel thread is interrupted, the work ghes_kick_task_work deferred to task_work will never be processed because entry_handler returns to call ret_to_kernel() instead of ret_to_user(). Consequently, the estatus_node alloced from ghes_estatus_pool in ghes_in_nmi_queue_one_entry() will not be freed. After around 200 allocations in our platform, the ghes_estatus_pool will run of memory and ghes_in_nmi_queue_one_entry() returns ENOMEM. As a result, the event failed to be processed.
sdei: event 805 on CPU 113 failed with error: -2
Finally, a lot of unhandled events may cause platform firmware to exceed some threshold and reboot.

The condition should generally just do
if (current->mm)
as described in active_mm.rst documentation.

Then if an asynchronous error is detected when a kernel thread is running, (e.g. when detected by a background scrubber), do not add task_work to it as the original patch intends to do.

Fixes: 7f17b4a121d0 ("ACPI: APEI: Kick the memory_failure() queue for synchronous errors") Signed-off-by: Shuai Xue xueshuai@linux.alibaba.com
I need the APEI code reviewers to tell me that this is all OK.

Thank you for your reply. OK, let's wait the reviewers comments.

Best Regards, Shuai

...

...
changes since v1:

add description the side effect and give more details

drivers/acpi/apei/ghes.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c index d91ad378c00d..80ad530583c9 100644 --- a/drivers/acpi/apei/ghes.c +++ b/drivers/acpi/apei/ghes.c @@ -985,7 +985,7 @@ static void ghes_proc_in_irq(struct irq_work *irq_work) ghes_estatus_cache_add(generic, estatus); }
          if (task_work_pending && current->mm != &init_mm) {
          if (task_work_pending && current->mm) {
                  estatus_node->task_work.func = ghes_kick_task_work;
                  estatus_node->task_work_cpu = smp_processor_id();
                  ret = task_work_add(current, &estatus_node->task_work,
-- 2.20.1.12.g72788fdb

Luck, Tony

3:20 p.m.

...

          if (task_work_pending && current->mm != &init_mm) {

          if (task_work_pending && current->mm) {
                  estatus_node->task_work.func = ghes_kick_task_work;
                  estatus_node->task_work_cpu = smp_processor_id();
                  ret = task_work_add(current, &estatus_node->task_work,

It seems that you are getting errors reported while running kernel threads. This fix avoids pointlessly adding "task_work" that will never be processed because kernel threads never return to user mode.

But maybe something else needs to be done? The code was, and with this fix still is, taking no action for the error. That doesn't seem right.

Are you injecting errors to create this scenario? What does your test do?

-Tony

Shuai Xue

27 Sep 27 Sep

3:50 a.m.

在 2022/9/26 PM11:20, Luck, Tony 写道:

...

...
...
          if (task_work_pending && current->mm != &init_mm) {
          if (task_work_pending && current->mm) {
                  estatus_node->task_work.func = ghes_kick_task_work;
                  estatus_node->task_work_cpu = smp_processor_id();
                  ret = task_work_add(current, &estatus_node->task_work,
It seems that you are getting errors reported while running kernel threads. This fix avoids pointlessly adding "task_work" that will never be processed because kernel threads never return to user mode.

Yes, you are right.

...

But maybe something else needs to be done? The code was, and with this fix still is, taking no action for the error. That doesn't seem right.

Sorry, I don't think so. As far as I know, on Arm platform, hardware error can signal exceptions including:

- Synchronous External Abort (SEA), e,g. data abort or instruction abort - Asynchronous External Abort (signalled as SErrors), e,g. L2 can generate SError for error responses from the interconnect for a Device or Non-cacheable store - Fault Handling and Error Recovery interrupts: DDR mainline/demand/scrubber error interrupt

When the error signals asynchronous exceptions (SError or interrupt), any kind of thread can be interrupted, including kernel thread. Because asynchronous exceptions are signaled in background, the errors are detected outside of the current execution context.

The GHES driver always queues work to handle memory failure of a page in memory_failure_queue(). If a kernel thread is interrupted:

- Without this fix, the added task_work will never be processed so that the work will not be canceled. - With this fix, the task_work will not be added.

In a conclusion, the error will be handled in a kworker with or without this fix.

The point of fix is that:

- The condition is useless because it is always tree. And I think it is not the original patch intends to do. - The current code leads to memory leaks. The estatus_node will not be freed when task_work is added to a kernel thread.

...

Are you injecting errors to create this scenario?

Yes, I am injecting error to create such scenario. After 200 injections, the ghes_estatus_pool will run of memory and ghes_in_nmi_queue_one_entry() returns ENOMEM. Finally, a lot of unhandled events may cause platform firmware to exceed some threshold and reboot.

...

What does your test do?

My injection includes two steps:

- mmap a device memory for userspace in a driver - inject uc and trigger write like ras-tools does

I have opened source the code and you can find here[1]. It's forked from your repo and mainly based on your code :)

By the way, do you have any plans to extend ras-tools to Arm platform. And is there a mail-list for ras-tools? I send a mail to add test cases to ras-tools several weeks ago, but no response. May your mailbox regards it as Spam.

[1] https://gitee.com/anolis/ras-tools/tree/arm-devel

Luck, Tony

5:47 p.m.

I follow and agree with everything up until:

...

In a conclusion, the error will be handled in a kworker with or without this fix.

Who handles the error if the interrupt happens during the execution of a kthread?

It isn't handled during the interrupt (it can't be).

Can't use the task_work_add() trick to handle it (because this thread never returns to user mode).

So how is the error handled?

-Tony

Shuai Xue

29 Sep 29 Sep

2:33 a.m.

在 2022/9/28 AM1:47, Luck, Tony 写道:

...

I follow and agree with everything up until:

...
In a conclusion, the error will be handled in a kworker with or without this fix.

...

It isn't handled during the interrupt (it can't be).

Yes, it is not handled during the interrupt and it does not have to.

...

Who handles the error if the interrupt happens during the execution of a kthread?

As I mentioned, the GHES driver always queues work into workqueue to handle memory failure of a page in memory_failure_queue(), so the **worker will be scheduled and handle memory failure later**.

...

Can't use the task_work_add() trick to handle it (because this thread never returns to user mode).

Yes, it can not. And this is the key point to fix.

...

So how is the error handled?

The workflow to handle hardware error is summery as bellow:

----------------------------------------------------------------------------- [ghes_sdei_critical_callback: current swapper/3, CPU 3] ghes_sdei_critical_callback => __ghes_sdei_callback => ghes_in_nmi_queue_one_entry // peak and read estatus => irq_work_queue(&ghes_proc_irq_work) <=> ghes_proc_in_irq // irq_work [ghes_sdei_critical_callback: return] ----------------------------------------------------------------------------- [ghes_proc_in_irq: current swapper/3, CPU 3] => ghes_do_proc => ghes_handle_memory_failure => ghes_do_memory_failure => memory_failure_queue // put work task on current CPU => if (kfifo_put(&mf_cpu->fifo, entry)) schedule_work_on(smp_processor_id(), &mf_cpu->work); => task_work_add(current, &estatus_node->task_work, TWA_RESUME); // fix here, always added to current [ghes_proc_in_irq: return] ----------------------------------------------------------------------------- // kworker preempts swapper/3 on CPU 3 due to RESCHED flag [memory_failure_work_func: current kworker, CPU 3] => memory_failure_work_func(&mf_cpu->work) => while kfifo_get(&mf_cpu->fifo, &entry); // until get no work => soft/hard offline -----------------------------------------------------------------------------

STEP 0: The firmware notifies hardware error to kernel through is SDEI (ACPI_HEST_NOTIFY_SOFTWARE_DELEGATED).

STEP 1: In SDEI callback (or any NMI-like handler), memory from ghes_estatus_pool is used to save estatus, and added to the ghes_estatus_llist. The swapper running on CPU 3 is interrupted. irq_work_queue() causes ghes_proc_in_irq() to run in IRQ context where each estatus in ghes_estatus_llist is processed.

STEP2: In IRQ context, ghes_proc_in_irq() queues memory failure work on current CPU in workqueue and add task work to sync with the workqueue.

STEP3: The kworker preempts the current running thread and get CPU 3. Then memory failure is processed in kworker.

(STEP4 for user thread: ghes_kick_task_work() is called as task_work to ensure any queued workqueue has been done before returning to user-space. The estatus_node is freed.)

If the task work is not added, estatus_node->task_work.func will be NULL, and estatus_node is freed in STEP 2.

Hope it helps to make the problem clearer. You can also check the stack dumped in key function in above flow.

Best Regards, Shuai

--------------------------------------------------------------------------------------- dump_stack() is added in: - __ghes_sdei_callback() - ghes_proc_in_irq() - memory_failure_queue_kick() - memory_failure_work_func() - memory_failure()

[ 485.457761] CPU: 3 PID: 0 Comm: swapper/3 Tainted: G E 6.0.0-rc5+ #33 [ 485.457769] Hardware name: xxxx [ 485.457771] Call trace: [ 485.457772] dump_backtrace+0xe8/0x12c [ 485.457779] show_stack+0x20/0x50 [ 485.457781] dump_stack_lvl+0x68/0x84 [ 485.457785] dump_stack+0x18/0x34 [ 485.457787] __ghes_sdei_callback+0x24/0x64 [ 485.457789] ghes_sdei_critical_callback+0x5c/0x94 [ 485.457792] sdei_event_handler+0x28/0x90 [ 485.457795] do_sdei_event+0x74/0x160 [ 485.457797] __sdei_handler+0x60/0xf0 [ 485.457799] __sdei_asm_handler+0xbc/0x18c [ 485.457801] cpu_do_idle+0x14/0x80 [ 485.457802] default_idle_call+0x50/0x114 [ 485.457804] cpuidle_idle_call+0x16c/0x1c0 [ 485.457806] do_idle+0xb8/0x110 [ 485.457808] cpu_startup_entry+0x2c/0x34 [ 485.457809] secondary_start_kernel+0xf0/0x144 [ 485.457812] __secondary_switched+0xb0/0xb4

[ 485.459513] EDAC MC0: 1 UE multi-symbol chipkill ECC on unknown memory (node:0 card:3 module:0 rank:0 bank_group:0 bank_address:0 device:0 row:624 column:384 chip_id:0 page:0x89c033 offset:0x400 grain:1 - APEI location: node:0 card:3 module:0 rank:0 bank_group:0 bank_address:0 device:0 row:624 column:384 chip_id:0 status(0x0000000000000400): Storage error in DRAM memory) [ 485.459523] {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 2 [ 485.470607] {2}[Hardware Error]: event severity: recoverable [ 485.476252] {2}[Hardware Error]: precise tstamp: 2022-09-29 09:31:27 [ 485.482678] {2}[Hardware Error]: Error 0, type: recoverable [ 485.488322] {2}[Hardware Error]: section_type: memory error [ 485.494052] {2}[Hardware Error]: error_status: Storage error in DRAM memory (0x0000000000000400) [ 485.503081] {2}[Hardware Error]: physical_address: 0x000000089c033400 [ 485.509680] {2}[Hardware Error]: node:0 card:3 module:0 rank:0 bank_group:0 bank_address:0 device:0 row:624 column:384 chip_id:0 [ 485.521487] {2}[Hardware Error]: error_type: 5, multi-symbol chipkill ECC

[ 485.528439] CPU: 3 PID: 0 Comm: swapper/3 Tainted: G E 6.0.0-rc5+ #33 [ 485.528440] Hardware name: AlibabaCloud AliServer-Xuanwu2.0AM-02-2UC1P-5B/M, BIOS 1.2.M1.AL.E.132.01 08/23/2022 [ 485.528441] Call trace: [ 485.528441] dump_backtrace+0xe8/0x12c [ 485.528443] show_stack+0x20/0x50 [ 485.528444] dump_stack_lvl+0x68/0x84 [ 485.528446] dump_stack+0x18/0x34 [ 485.528448] ghes_proc_in_irq+0x220/0x250 [ 485.528450] irq_work_single+0x30/0x80 [ 485.528453] irq_work_run_list+0x4c/0x70 [ 485.528455] irq_work_run+0x28/0x44 [ 485.528457] do_handle_IPI+0x2b4/0x2f0 [ 485.528459] ipi_handler+0x24/0x34 [ 485.528461] handle_percpu_devid_irq+0x90/0x1c4 [ 485.528463] generic_handle_domain_irq+0x34/0x50 [ 485.528465] __gic_handle_irq_from_irqson.isra.0+0x130/0x230 [ 485.528468] gic_handle_irq+0x2c/0x60 [ 485.528469] call_on_irq_stack+0x2c/0x38 [ 485.528471] do_interrupt_handler+0x88/0x90 [ 485.528472] el1_interrupt+0x48/0xb0 [ 485.528475] el1h_64_irq_handler+0x18/0x24 [ 485.528476] el1h_64_irq+0x74/0x78 [ 485.528477] __do_softirq+0xa4/0x358 [ 485.528478] __irq_exit_rcu+0x110/0x13c [ 485.528479] irq_exit_rcu+0x18/0x24 [ 485.528480] el1_interrupt+0x4c/0xb0 [ 485.528482] el1h_64_irq_handler+0x18/0x24 [ 485.528483] el1h_64_irq+0x74/0x78 [ 485.528484] arch_cpu_idle+0x18/0x40 [ 485.528485] default_idle_call+0x50/0x114 [ 485.528487] cpuidle_idle_call+0x16c/0x1c0 [ 485.528488] do_idle+0xb8/0x110 [ 485.528489] cpu_startup_entry+0x2c/0x34 [ 485.528491] secondary_start_kernel+0xf0/0x144 [ 485.528493] __secondary_switched+0xb0/0xb4

[ 485.528511] CPU: 3 PID: 12696 Comm: kworker/3:0 Tainted: G E 6.0.0-rc5+ #33 [ 485.528513] Hardware name: AlibabaCloud AliServer-Xuanwu2.0AM-02-2UC1P-5B/M, BIOS 1.2.M1.AL.E.132.01 08/23/2022 [ 485.528514] Workqueue: events memory_failure_work_func [ 485.528518] Call trace: [ 485.528519] dump_backtrace+0xe8/0x12c [ 485.528520] show_stack+0x20/0x50 [ 485.528521] dump_stack_lvl+0x68/0x84 [ 485.528523] dump_stack+0x18/0x34 [ 485.528525] memory_failure_work_func+0xec/0x180 [ 485.528527] process_one_work+0x1f4/0x460 [ 485.528528] worker_thread+0x188/0x3e4 [ 485.528530] kthread+0xd0/0xd4 [ 485.528532] ret_from_fork+0x10/0x20

[ 485.528533] CPU: 3 PID: 12696 Comm: kworker/3:0 Tainted: G E 6.0.0-rc5+ #33 [ 485.528534] Hardware name: AlibabaCloud AliServer-Xuanwu2.0AM-02-2UC1P-5B/M, BIOS 1.2.M1.AL.E.132.01 08/23/2022 [ 485.528535] Workqueue: events memory_failure_work_func [ 485.528537] Call trace: [ 485.528538] dump_backtrace+0xe8/0x12c [ 485.528539] show_stack+0x20/0x50 [ 485.528540] dump_stack_lvl+0x68/0x84 [ 485.528541] dump_stack+0x18/0x34 [ 485.528543] memory_failure+0x50/0x438 [ 485.528544] memory_failure_work_func+0x174/0x180 [ 485.528546] process_one_work+0x1f4/0x460 [ 485.528547] worker_thread+0x188/0x3e4 [ 485.528548] kthread+0xd0/0xd4 [ 485.528550] ret_from_fork+0x10/0x20 [ 485.530622] Memory failure: 0x89c033: recovery action for dirty LRU page: Recovered

Luck, Tony

8:52 p.m.

Thanks for your patient explanations.

...

STEP2: In IRQ context, ghes_proc_in_irq() queues memory failure work on current CPU in workqueue and add task work to sync with the workqueue.

Why is there a difference if the interrupted task was a user task vs. a kernel thread?

It seems arbitrary. If the error can be handled in the kernel thread case without a task_work_add() to the current process, can't all errors be handled this way?

The current thread likely has nothing to do with the error. Just a matter of chance on what is running when the NMI is delivered, right?

-Tony

Shuai Xue

30 Sep 30 Sep

2:52 a.m.

在 2022/9/30 AM4:52, Luck, Tony 写道:

...

Thanks for your patient explanations.

You are welcome :)

...

...
STEP2: In IRQ context, ghes_proc/_in_irq() queues memory failure work on current CPU in workqueue and add task work to sync with the workqueue.

Why is there a difference if the interrupted task was a user task vs. a kernel thread?

It seems arbitrary. If the error can be handled in the kernel thread case without a task_work_add() to the current process, can't all errors be handled this way?

I'm afraid not. The kworker in workqueue is asynchronous with ret_to_user() of the interrupted task. If we return to user-space before the queued memory_failure() work is processed, we will take the fault again when the error is signal by synchronous external abort. This loop may cause platform firmware to exceed some threshold and reboot.

When a user task consuming poison data, a synchronous external abort will be signaled, for example "einj_mem_uc single" in ras-tools. In such case, the handling flow will be like bellow:

----------------------------------STEP 0------------------------------------------- [ghes_sdei_critical_callback: current einj_mem_uc, local cpu] ghes_sdei_critical_callback => __ghes_sdei_callback => ghes_in_nmi_queue_one_entry: peak and read estatus => irq_work_queue(&ghes_proc_irq_work) // ghes_proc_in_irq - irq_work [ghes_sdei_critical_callback: return] -----------------------------------STEP 1------------------------------------------ [ghes_proc_in_irq: current einj_mem_uc, local cpu] => ghes_do_proc => ghes_handle_memory_failure => ghes_do_memory_failure => memory_failure_queue - put work task on a specific cpu => if (kfifo_put(&mf_cpu->fifo, entry)) schedule_work_on(smp_processor_id(), &mf_cpu->work); => task_work_add(current, &estatus_node->task_work, TWA_RESUME); [ghes_proc_in_irq: return] -----------------------------------STEP 3------------------------------------------ // kworker preempts einj_mem_uc on local cpu due to RESCHED flag [memory_failure_work_func: current kworker, local cpu] => memory_failure_work_func(&mf_cpu->work) => while kfifo_get(&mf_cpu->fifo, &entry); // until get no work => soft/hard offline ------------------------------------STEP 4----------------------------------------- [ghes_kick_task_work: current einj_mem_uc, other cpu] => memory_failure_queue_kick => cancel_work_sync //wait memory_failure_work_func finish => memory_failure_work_func(&mf_cpu->work) => kfifo_get(&mf_cpu->fifo, &entry); // no work here ------------------------------------STEP 5----------------------------------------- [current einj_mem_uc returned to userspace] => Killed by SIGBUS

STEP 4 add a task work to ensure the queued memory_failure() work is processed before returning to user-space. And the interrupted user will be killed by SIGBUS signal.

If we delete STEP 4, the interrupted user task will return to user space synchronously and consume the poison data again.

...

The current thread likely has nothing to do with the error. Just a matter of chance on what is running when the NMI is delivered, right?

Yes, the error is actually handled in workqueue. I think the point is that the synchronous exception signaled by synchronous external abort must be handled synchronously, otherwise, it will be signaled again.

Best Regards, Shuai

Luck, Tony

3:52 p.m.

...

Yes, the error is actually handled in workqueue. I think the point is that the synchronous exception signaled by synchronous external abort must be handled synchronously, otherwise, it will be signaled again.

Ok. Got it now. Thanks.

For Rafael:

Reviewed-by: Tony Luck tony.luck@intel.com

-Tony

Rafael J. Wysocki

4 Oct 4 Oct

2:07 p.m.

On Fri, Sep 30, 2022 at 5:52 PM Luck, Tony tony.luck@intel.com wrote:

...

...
Yes, the error is actually handled in workqueue. I think the point is that the synchronous exception signaled by synchronous external abort must be handled synchronously, otherwise, it will be signaled again.

Ok. Got it now. Thanks.

For Rafael:

Reviewed-by: Tony Luck tony.luck@intel.com

Applied as 6.1-rc material, thanks!

Shuai Xue

13 Oct 13 Oct

7:05 a.m.

Hi, Tony,

在 2022/9/27 AM11:50, Shuai Xue 写道:

...

在 2022/9/26 PM11:20, Luck, Tony 写道:

...
...
...
          if (task_work_pending && current->mm != &init_mm) {
          if (task_work_pending && current->mm) {
                  estatus_node->task_work.func = ghes_kick_task_work;
                  estatus_node->task_work_cpu = smp_processor_id();
                  ret = task_work_add(current, &estatus_node->task_work,
It seems that you are getting errors reported while running kernel threads. This fix avoids pointlessly adding "task_work" that will never be processed because kernel threads never return to user mode.
Yes, you are right.

...
But maybe something else needs to be done? The code was, and with this fix still is, taking no action for the error. That doesn't seem right.

Sorry, I don't think so. As far as I know, on Arm platform, hardware error can signal exceptions including:

Synchronous External Abort (SEA), e,g. data abort or instruction abort

Asynchronous External Abort (signalled as SErrors), e,g. L2 can generate SError for error responses from the interconnect for a Device or Non-cacheable store

Fault Handling and Error Recovery interrupts: DDR mainline/demand/scrubber error interrupt

When the error signals asynchronous exceptions (SError or interrupt), any kind of thread can be interrupted, including kernel thread. Because asynchronous exceptions are signaled in background, the errors are detected outside of the current execution context.

The GHES driver always queues work to handle memory failure of a page in memory_failure_queue(). If a kernel thread is interrupted:

Without this fix, the added task_work will never be processed so that the work will not be canceled.

With this fix, the task_work will not be added.

In a conclusion, the error will be handled in a kworker with or without this fix.

The point of fix is that:

The condition is useless because it is always tree. And I think it is not the original patch intends to do.

The current code leads to memory leaks. The estatus_node will not be freed when task_work is added to a kernel thread.

...
Are you injecting errors to create this scenario?

Yes, I am injecting error to create such scenario. After 200 injections, the ghes_estatus_pool will run of memory and ghes_in_nmi_queue_one_entry() returns ENOMEM. Finally, a lot of unhandled events may cause platform firmware to exceed some threshold and reboot.

...
What does your test do?

My injection includes two steps:

mmap a device memory for userspace in a driver

inject uc and trigger write like ras-tools does

I have opened source the code and you can find here[1]. It's forked from your repo and mainly based on your code :)

By the way, do you have any plans to extend ras-tools to Arm platform. And is there a mail-list for ras-tools? I send a mail to add test cases to ras-tools several weeks ago, but no response. May your mailbox regards it as Spam.

Thank you for your review, but I am still having problems with this question.

Do you have any interest to extend ras-tools to Arm platform? I forked a arm-devel branch[1] from your repo:

- port X86 arch specific cases to Arm platform - add some common cases like hugetlb, thread and Arm specific cases like prefetch, strb, etc, which are helpful to test hardware and firmware RAS problems we encountered.

I am pleasure to contribute these code to your upstream repo and looking forward to see a more powerful and cross platform tools to inject and debug RAS ability on both X86 and Arm platform.

I really appreciate your great work, and look forward to your reply. Thank you.

Best Regards, Shuai

...

[1] https://gitee.com/anolis/ras-tools/tree/arm-devel

Luck, Tony

5:18 p.m.

...

Thank you for your review, but I am still having problems with this question.

Do you have any interest to extend ras-tools to Arm platform? I forked a arm-devel branch[1] from your repo:

port X86 arch specific cases to Arm platform

add some common cases like hugetlb, thread and Arm specific cases like prefetch, strb, etc, which are helpful to test hardware and firmware RAS problems we encountered.

I am pleasure to contribute these code to your upstream repo and looking forward to see a more powerful and cross platform tools to inject and debug RAS ability on both X86 and Arm platform.

I really appreciate your great work, and look forward to your reply. Thank you.

Best Regards, Shuai

...
[1] https://gitee.com/anolis/ras-tools/tree/arm-devel

Shaui,

Sorry I didn't follow up on this part. Yes, I'm happy to take your (excellent) changes.

I did a "git fetch" from your repo and the a "git merge" of all except the README and LICENSE commits at the tip of your repo. Those I manually cherry-picked (since the last section of your README said "clone of Tony's repo" ... which makes no sense in my repo).

Resulting tree has been pushed out to git://git.kernel.org/pub/scm/linux/kernel/git/aegl/ras-tools.git

Please check that I didn't break anything.

-Tony

Shuai Xue

14 Oct 14 Oct

1:23 p.m.

在 2022/10/14 AM1:18, Luck, Tony 写道:

...

...
Thank you for your review, but I am still having problems with this question.

Do you have any interest to extend ras-tools to Arm platform? I forked a arm-devel branch[1] from your repo:

port X86 arch specific cases to Arm platform

add some common cases like hugetlb, thread and Arm specific cases like prefetch, strb, etc, which are helpful to test hardware and firmware RAS problems we encountered.

I am pleasure to contribute these code to your upstream repo and looking forward to see a more powerful and cross platform tools to inject and debug RAS ability on both X86 and Arm platform.

I really appreciate your great work, and look forward to your reply. Thank you.

Best Regards, Shuai

...
[1] https://gitee.com/anolis/ras-tools/tree/arm-devel

Shaui,

Sorry I didn't follow up on this part. Yes, I'm happy to take your (excellent) changes.

I did a "git fetch" from your repo and the a "git merge" of all except the README and LICENSE commits at the tip of your repo. Those I manually cherry-picked (since the last section of your README said "clone of Tony's repo" ... which makes no sense in my repo).

Resulting tree has been pushed out to git://git.kernel.org/pub/scm/linux/kernel/git/aegl/ras-tools.git

Thank you. I am glad to see it really happens :)

...

Please check that I didn't break anything.

-Tony

I recheck it and all look good except hornet.

hornet use PTRACE_GETREGS to read the tracee's general-purpose registers, but it does not work on Arm. I will send a new patch to extend it into Arm platform.

Cheers, Shuai

1108

days inactive

1128

days old

linux-stable-mirror@lists.linaro.org

14 comments

participants

tags (0)

participants (4)

kernel test robot
Luck, Tony
Rafael J. Wysocki
Shuai Xue