Re: [PATCH v2] ACPI: APEI: do not add task_work to kernel thread to avoid memory leak

29 Sep 2022


      在 2022/9/28 AM1:47, Luck, Tony 写道:
...
I follow and agree with everything up until:
...
In a conclusion, the error will be handled in a kworker with or without this fix.
...
It isn't handled during the interrupt (it can't be).
Yes, it is not handled during the interrupt and it does not have to.
...
Who handles the error if the interrupt happens during the execution of a kthread?
As I mentioned, the GHES driver always queues work into workqueue to handle memory
failure of a page in memory_failure_queue(), so the **worker will be scheduled and
handle memory failure later**.
...
Can't use the task_work_add() trick to handle it (because this thread never returns to user mode).
Yes, it can not. And this is the key point to fix.
...
So how is the error handled?
The workflow to handle hardware error is summery as bellow:
-----------------------------------------------------------------------------
[ghes_sdei_critical_callback: current swapper/3, CPU 3]
ghes_sdei_critical_callback
    => __ghes_sdei_callback
        => ghes_in_nmi_queue_one_entry 		// peak and read estatus
        => irq_work_queue(&ghes_proc_irq_work) <=> ghes_proc_in_irq // irq_work
[ghes_sdei_critical_callback: return]
-----------------------------------------------------------------------------
[ghes_proc_in_irq: current swapper/3, CPU 3]
            => ghes_do_proc
                => ghes_handle_memory_failure
                    => ghes_do_memory_failure
                        => memory_failure_queue	 // put work task on current CPU
                            => if (kfifo_put(&mf_cpu->fifo, entry))
                                  schedule_work_on(smp_processor_id(), &mf_cpu->work);
            => task_work_add(current, &estatus_node->task_work, TWA_RESUME); // fix here, always added to current
[ghes_proc_in_irq: return]
-----------------------------------------------------------------------------
// kworker preempts swapper/3 on CPU 3 due to RESCHED flag
[memory_failure_work_func: current kworker, CPU 3]	
     => memory_failure_work_func(&mf_cpu->work)
        => while kfifo_get(&mf_cpu->fifo, &entry);	// until get no work
            => soft/hard offline
-----------------------------------------------------------------------------
STEP 0: The firmware notifies hardware error to kernel through is SDEI
(ACPI_HEST_NOTIFY_SOFTWARE_DELEGATED).
STEP 1: In SDEI callback (or any NMI-like handler), memory from ghes_estatus_pool is
used to save estatus, and added to the ghes_estatus_llist. The swapper running on
CPU 3 is interrupted. irq_work_queue() causes ghes_proc_in_irq() to run in IRQ
context where each estatus in ghes_estatus_llist is processed.
STEP2: In IRQ context, ghes_proc_in_irq() queues memory failure work on current CPU
in workqueue and add task work to sync with the workqueue.
STEP3: The kworker preempts the current running thread and get CPU 3. Then memory failure
is processed in kworker.
(STEP4 for user thread: ghes_kick_task_work() is called as task_work to ensure any
queued workqueue has been done before returning to user-space. The estatus_node is freed.)
If the task work is not added, estatus_node->task_work.func will be NULL, and estatus_node
is freed in STEP 2.
Hope it helps to make the problem clearer. You can also check the stack dumped in key
function in above flow.
Best Regards,
Shuai
---------------------------------------------------------------------------------------
dump_stack() is added in:
- __ghes_sdei_callback()
- ghes_proc_in_irq()
- memory_failure_queue_kick()
- memory_failure_work_func()
- memory_failure()
[  485.457761] CPU: 3 PID: 0 Comm: swapper/3 Tainted: G            E      6.0.0-rc5+ #33
[  485.457769] Hardware name: xxxx
[  485.457771] Call trace:
[  485.457772]  dump_backtrace+0xe8/0x12c
[  485.457779]  show_stack+0x20/0x50
[  485.457781]  dump_stack_lvl+0x68/0x84
[  485.457785]  dump_stack+0x18/0x34
[  485.457787]  __ghes_sdei_callback+0x24/0x64
[  485.457789]  ghes_sdei_critical_callback+0x5c/0x94
[  485.457792]  sdei_event_handler+0x28/0x90
[  485.457795]  do_sdei_event+0x74/0x160
[  485.457797]  __sdei_handler+0x60/0xf0
[  485.457799]  __sdei_asm_handler+0xbc/0x18c
[  485.457801]  cpu_do_idle+0x14/0x80
[  485.457802]  default_idle_call+0x50/0x114
[  485.457804]  cpuidle_idle_call+0x16c/0x1c0
[  485.457806]  do_idle+0xb8/0x110
[  485.457808]  cpu_startup_entry+0x2c/0x34
[  485.457809]  secondary_start_kernel+0xf0/0x144
[  485.457812]  __secondary_switched+0xb0/0xb4
[  485.459513] EDAC MC0: 1 UE multi-symbol chipkill ECC on unknown memory (node:0 card:3 module:0 rank:0 bank_group:0 bank_address:0 device:0 row:624 column:384 chip_id:0 page:0x89c033 offset:0x400 grain:1 - APEI location: node:0 card:3 module:0 rank:0 bank_group:0 bank_address:0 device:0 row:624 column:384 chip_id:0 status(0x0000000000000400): Storage error in DRAM memory)
[  485.459523] {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 2
[  485.470607] {2}[Hardware Error]: event severity: recoverable
[  485.476252] {2}[Hardware Error]:  precise tstamp: 2022-09-29 09:31:27
[  485.482678] {2}[Hardware Error]:  Error 0, type: recoverable
[  485.488322] {2}[Hardware Error]:   section_type: memory error
[  485.494052] {2}[Hardware Error]:    error_status: Storage error in DRAM memory (0x0000000000000400)
[  485.503081] {2}[Hardware Error]:   physical_address: 0x000000089c033400
[  485.509680] {2}[Hardware Error]:   node:0 card:3 module:0 rank:0 bank_group:0 bank_address:0 device:0 row:624 column:384 chip_id:0
[  485.521487] {2}[Hardware Error]:   error_type: 5, multi-symbol chipkill ECC
[  485.528439] CPU: 3 PID: 0 Comm: swapper/3 Tainted: G            E      6.0.0-rc5+ #33
[  485.528440] Hardware name: AlibabaCloud AliServer-Xuanwu2.0AM-02-2UC1P-5B/M, BIOS 1.2.M1.AL.E.132.01 08/23/2022
[  485.528441] Call trace:
[  485.528441]  dump_backtrace+0xe8/0x12c
[  485.528443]  show_stack+0x20/0x50
[  485.528444]  dump_stack_lvl+0x68/0x84
[  485.528446]  dump_stack+0x18/0x34
[  485.528448]  ghes_proc_in_irq+0x220/0x250
[  485.528450]  irq_work_single+0x30/0x80
[  485.528453]  irq_work_run_list+0x4c/0x70
[  485.528455]  irq_work_run+0x28/0x44
[  485.528457]  do_handle_IPI+0x2b4/0x2f0
[  485.528459]  ipi_handler+0x24/0x34
[  485.528461]  handle_percpu_devid_irq+0x90/0x1c4
[  485.528463]  generic_handle_domain_irq+0x34/0x50
[  485.528465]  __gic_handle_irq_from_irqson.isra.0+0x130/0x230
[  485.528468]  gic_handle_irq+0x2c/0x60
[  485.528469]  call_on_irq_stack+0x2c/0x38
[  485.528471]  do_interrupt_handler+0x88/0x90
[  485.528472]  el1_interrupt+0x48/0xb0
[  485.528475]  el1h_64_irq_handler+0x18/0x24
[  485.528476]  el1h_64_irq+0x74/0x78
[  485.528477]  __do_softirq+0xa4/0x358
[  485.528478]  __irq_exit_rcu+0x110/0x13c
[  485.528479]  irq_exit_rcu+0x18/0x24
[  485.528480]  el1_interrupt+0x4c/0xb0
[  485.528482]  el1h_64_irq_handler+0x18/0x24
[  485.528483]  el1h_64_irq+0x74/0x78
[  485.528484]  arch_cpu_idle+0x18/0x40
[  485.528485]  default_idle_call+0x50/0x114
[  485.528487]  cpuidle_idle_call+0x16c/0x1c0
[  485.528488]  do_idle+0xb8/0x110
[  485.528489]  cpu_startup_entry+0x2c/0x34
[  485.528491]  secondary_start_kernel+0xf0/0x144
[  485.528493]  __secondary_switched+0xb0/0xb4
[  485.528511] CPU: 3 PID: 12696 Comm: kworker/3:0 Tainted: G            E      6.0.0-rc5+ #33
[  485.528513] Hardware name: AlibabaCloud AliServer-Xuanwu2.0AM-02-2UC1P-5B/M, BIOS 1.2.M1.AL.E.132.01 08/23/2022
[  485.528514] Workqueue: events memory_failure_work_func
[  485.528518] Call trace:
[  485.528519]  dump_backtrace+0xe8/0x12c
[  485.528520]  show_stack+0x20/0x50
[  485.528521]  dump_stack_lvl+0x68/0x84
[  485.528523]  dump_stack+0x18/0x34
[  485.528525]  memory_failure_work_func+0xec/0x180
[  485.528527]  process_one_work+0x1f4/0x460
[  485.528528]  worker_thread+0x188/0x3e4
[  485.528530]  kthread+0xd0/0xd4
[  485.528532]  ret_from_fork+0x10/0x20
[  485.528533] CPU: 3 PID: 12696 Comm: kworker/3:0 Tainted: G            E      6.0.0-rc5+ #33
[  485.528534] Hardware name: AlibabaCloud AliServer-Xuanwu2.0AM-02-2UC1P-5B/M, BIOS 1.2.M1.AL.E.132.01 08/23/2022
[  485.528535] Workqueue: events memory_failure_work_func
[  485.528537] Call trace:
[  485.528538]  dump_backtrace+0xe8/0x12c
[  485.528539]  show_stack+0x20/0x50
[  485.528540]  dump_stack_lvl+0x68/0x84
[  485.528541]  dump_stack+0x18/0x34
[  485.528543]  memory_failure+0x50/0x438
[  485.528544]  memory_failure_work_func+0x174/0x180
[  485.528546]  process_one_work+0x1f4/0x460
[  485.528547]  worker_thread+0x188/0x3e4
[  485.528548]  kthread+0xd0/0xd4
[  485.528550]  ret_from_fork+0x10/0x20
[  485.530622] Memory failure: 0x89c033: recovery action for dirty LRU page: Recovered

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: [PATCH v2] ACPI: APEI: do not add task_work to kernel thread to avoid memory leak