Recently we encounter multi #MC on the same task when it's task_work_run() has not been called, current->mce_kill_me was added to task_works list more than once, that make a circular linked task_works, so task_work_run() will do a endless loop.
More seriously, the SIGBUS signal can not be delivered to the userspace task which tigger the #MC and I met #MC flood.
I borrowed mce_kill_me.func to check whether current->mce_kill_me has been added to task_works, prevent duplicate addition and override current->mce_xxx. When work function be called, the task_works must has been taken, so it is safe to be cleared in callback.
Fixes: commit 5567d11c21a1 ("x86/mce: Send #MC singal from task work") Cc: stable@vger.kernel.org #v5.8+ Signed-off-by: Ding Hui dinghui@sangfor.com.cn
--- v1->v2: do not call kill_me_now in kill_me_maybe, to avoid mce_kill_me.func race condition do not override current->mce_xxx before callback
arch/x86/kernel/cpu/mce/core.c | 38 ++++++++++++++++++++++------------ 1 file changed, 25 insertions(+), 13 deletions(-)
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c index 22791aadc085..95d244a95486 100644 --- a/arch/x86/kernel/cpu/mce/core.c +++ b/arch/x86/kernel/cpu/mce/core.c @@ -1250,6 +1250,7 @@ static void __mc_scan_banks(struct mce *m, struct pt_regs *regs, struct mce *fin
static void kill_me_now(struct callback_head *ch) { + WRITE_ONCE(ch->func, NULL); force_sig(SIGBUS); }
@@ -1258,15 +1259,22 @@ static void kill_me_maybe(struct callback_head *cb) struct task_struct *p = container_of(cb, struct task_struct, mce_kill_me); int flags = MF_ACTION_REQUIRED; int ret; + u64 mce_addr = p->mce_addr; + __u64 mce_kflags = p->mce_kflags; + bool mce_ripv = p->mce_ripv; + bool mce_whole_page = p->mce_whole_page;
- pr_err("Uncorrected hardware memory error in user-access at %llx", p->mce_addr); + /* reset func after save p->mce_xxx */ + WRITE_ONCE(cb->func, NULL);
- if (!p->mce_ripv) + pr_err("Uncorrected hardware memory error in user-access at %llx", mce_addr); + + if (!mce_ripv) flags |= MF_MUST_KILL;
- ret = memory_failure(p->mce_addr >> PAGE_SHIFT, flags); - if (!ret && !(p->mce_kflags & MCE_IN_KERNEL_COPYIN)) { - set_mce_nospec(p->mce_addr >> PAGE_SHIFT, p->mce_whole_page); + ret = memory_failure(mce_addr >> PAGE_SHIFT, flags); + if (!ret && !(mce_kflags & MCE_IN_KERNEL_COPYIN)) { + set_mce_nospec(mce_addr >> PAGE_SHIFT, mce_whole_page); sync_core(); return; } @@ -1283,23 +1291,27 @@ static void kill_me_maybe(struct callback_head *cb) force_sig_mceerr(BUS_MCEERR_AR, p->mce_vaddr, PAGE_SHIFT); } else { pr_err("Memory error not recovered"); - kill_me_now(cb); + force_sig(SIGBUS); } }
static void queue_task_work(struct mce *m, int kill_current_task) { - current->mce_addr = m->addr; - current->mce_kflags = m->kflags; - current->mce_ripv = !!(m->mcgstatus & MCG_STATUS_RIPV); - current->mce_whole_page = whole_page(m); + struct callback_head ch;
if (kill_current_task) - current->mce_kill_me.func = kill_me_now; + ch.func = kill_me_now; else - current->mce_kill_me.func = kill_me_maybe; + ch.func = kill_me_maybe; + + if (!cmpxchg(¤t->mce_kill_me.func, NULL, ch.func)) { + current->mce_addr = m->addr; + current->mce_kflags = m->kflags; + current->mce_ripv = !!(m->mcgstatus & MCG_STATUS_RIPV); + current->mce_whole_page = whole_page(m);
- task_work_add(current, ¤t->mce_kill_me, TWA_RESUME); + task_work_add(current, ¤t->mce_kill_me, TWA_RESUME); + } }
/*
On Tue, Jul 06, 2021 at 08:16:06PM +0800, Ding Hui wrote:
Recently we encounter multi #MC on the same task when it's task_work_run() has not been called, current->mce_kill_me was added to task_works list more than once, that make a circular linked task_works, so task_work_run() will do a endless loop.
I saw the same and posted a similar fix a while back:
https://www.spinics.net/lists/linux-mm/msg251006.html
It didn't get merged because some validation tests began failing around the same time. I'm now pretty sure I understand what happened with those other tests.
I'll post my updated version (second patch in a three part series) later today.
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
- if (!cmpxchg(¤t->mce_kill_me.func, NULL, ch.func)) {
current->mce_addr = m->addr;
current->mce_kflags = m->kflags;
current->mce_ripv = !!(m->mcgstatus & MCG_STATUS_RIPV);
current->mce_whole_page = whole_page(m);
You don't need an atomic cmpxchg here (nor the WRITE_ONCE() to clear it). The task is operating on its own task_struct. Nobody else should touch the mce_kill_me field.
-Tony
On 2021/7/7 0:44, Luck, Tony wrote:
On Tue, Jul 06, 2021 at 08:16:06PM +0800, Ding Hui wrote:
Recently we encounter multi #MC on the same task when it's task_work_run() has not been called, current->mce_kill_me was added to task_works list more than once, that make a circular linked task_works, so task_work_run() will do a endless loop.
I saw the same and posted a similar fix a while back:
https://www.spinics.net/lists/linux-mm/msg251006.html
It didn't get merged because some validation tests began failing around the same time. I'm now pretty sure I understand what happened with those other tests.
I'll post my updated version (second patch in a three part series) later today.
Thanks for your fixes.
After digging my original problem, maybe I find out why I met #MC flood.
My test case: 1. run qemu-kvm guest VM, OS is memtest86+.iso 2. inject SRAR UE to VM memory and wait #MC When VM trigger #MC, I expect that qemu will receive SIGBUS signal ASAP, and with the modifed qemu, I will kill VM.
In this case, do_machine_check() maybe called by kvm_machine_check() in vmx.c.
Before [1], memory_failure() is called in do_machine_check(), so TIF_SIGPENDING is set on due to SIGBUS signal, vcpu_run() checked the pending singal, so return to qemu to handle SIGBUS.
After [1], do_machine_check() only add task work but not send SIGBUS directly, vcpu_run() will not break the for-loop because vcpu_enter_guest() return 1 and not set TIF_SIGPENDING on, task works never executed until sth else happen. So the kvm enter guest repeatedly and the #MC is triggered repeatedly.
Can you consider to fix cases like this?
And do you mind to give me some advice for my temporary workaround about this #MC flood: I want to check the context of do_machine_check() is exception or kvm, and fallback to call kill_me_xxx directly when in kvm context. (I already tested simply and met my expection)
[1]: commit 5567d11c21a1 ("x86/mce: Send #MC singal from task work")
On 2021/7/7 11:39, Ding Hui wrote:
On 2021/7/7 0:44, Luck, Tony wrote:
On Tue, Jul 06, 2021 at 08:16:06PM +0800, Ding Hui wrote:
Recently we encounter multi #MC on the same task when it's task_work_run() has not been called, current->mce_kill_me was added to task_works list more than once, that make a circular linked task_works, so task_work_run() will do a endless loop.
I saw the same and posted a similar fix a while back:
https://www.spinics.net/lists/linux-mm/msg251006.html
It didn't get merged because some validation tests began failing around the same time. I'm now pretty sure I understand what happened with those other tests.
I'll post my updated version (second patch in a three part series) later today.
Thanks for your fixes.
After digging my original problem, maybe I find out why I met #MC flood.
My test case:
- run qemu-kvm guest VM, OS is memtest86+.iso
- inject SRAR UE to VM memory and wait #MC
When VM trigger #MC, I expect that qemu will receive SIGBUS signal ASAP, and with the modifed qemu, I will kill VM.
In this case, do_machine_check() maybe called by kvm_machine_check() in vmx.c.
Before [1], memory_failure() is called in do_machine_check(), so TIF_SIGPENDING is set on due to SIGBUS signal, vcpu_run() checked the pending singal, so return to qemu to handle SIGBUS.
After [1], do_machine_check() only add task work but not send SIGBUS directly, vcpu_run() will not break the for-loop because vcpu_enter_guest() return 1 and not set TIF_SIGPENDING on, task works never executed until sth else happen. So the kvm enter guest repeatedly and the #MC is triggered repeatedly.
Sorry for my incorrect description.
I figure out that my test kernel is not the lastest, it's without [2] commit 72c3c0fe54a3 ("x86/kvm: Use generic xfer to guest work function"), so vcpu_run() only care about signal_pending but not TIF_NOTIFY_RESUME which set on in task_work_add().
After [2], #MC flood should not exist.
Also thank Thomas Gleixner.
Can you consider to fix cases like this?
And do you mind to give me some advice for my temporary workaround about this #MC flood: I want to check the context of do_machine_check() is exception or kvm, and fallback to call kill_me_xxx directly when in kvm context. (I already tested simply and met my expection)
So ignore my ask, please.
[1]: commit 5567d11c21a1 ("x86/mce: Send #MC singal from task work")
linux-stable-mirror@lists.linaro.org