FTR, this is starting to make sense, thanks for explaining.
Replying only to this one for now:
On Thu, Nov 30, 2023 at 10:58:53AM +0800, Shuai Xue wrote:
To reproduce this problem:
# STEP1: enable early kill mode #sysctl -w vm.memory_failure_early_kill=1 vm.memory_failure_early_kill = 1
# STEP2: inject an UCE error and consume it to trigger a synchronous error
So this is for ARM folks to deal with, BUT:
A consumed uncorrectable error on x86 means panic. On some hw like on AMD, that error doesn't even get seen by the OS but the hw does something called syncflood to prevent further error propagation. So there's no any action required - the hw does that.
But I'd like to hear from ARM folks whether consuming an uncorrectable error even lets software run. Dunno.
Thx.