On Thu, May 30, 2019, Tony W Wang-oc wrote:
Hi Ashok, I have two questions about this patch, could you help to check:
1, for broadcast #MC exceptions, this patch seems require #MC exception errors set MCG_STATUS_RIPV = 1. But for Intel CPU, some #MC exception errors set MCG_STATUS_RIPV = 0 (like "Recoverable-not-continuable SRAR Type" Errors), for these errors the patch doesn't seem to work, is that okay?
2, for LMCE exceptions, this patch seems require #MC exception errors set MCG_STATUS_RIPV = 0 to make sure LMCE be handled normally even on offline CPU. For LMCE errors set MCG_STAUS_RIPV = 1, the patch prevents offline CPU handle these LMCE errors, is that okay?
More specifically, this patch seems require #MC exceptions meet the condition "MCG_STATUS_RIPV ^ MCG_STATUS_LMCES == 1"; But on a Xeon X5650 machine (SMP), "Data CACHE Level-2 Generic Error" does not meet this condition.
I got below message from: https://www.centos.org/forums/viewtopic.php?p=292742
Hardware event. This is not a software error. MCE 0 CPU 4 BANK 6 TSC b7065eeaa18b0 TIME 1545643603 Mon Dec 24 10:26:43 2018 MCG status:MCIP MCi status: Uncorrected error Error enabled Processor context corrupt MCA: Data CACHE Level-2 Generic Error STATUS b200000080000106 MCGSTATUS 4 MCGCAP 1c09 APICID 4 SOCKETID 0 CPUID Vendor Intel Family 6 Model 44
Thanks Tony W Wang-oc
On Thu, May 30, 2019 at 09:13:39AM +0000, Tony W Wang-oc wrote:
On Thu, May 30, 2019, Tony W Wang-oc wrote:
Hi Ashok, I have two questions about this patch, could you help to check:
1, for broadcast #MC exceptions, this patch seems require #MC exception errors set MCG_STATUS_RIPV = 1. But for Intel CPU, some #MC exception errors set MCG_STATUS_RIPV = 0 (like "Recoverable-not-continuable SRAR Type" Errors), for these errors the patch doesn't seem to work, is that okay?
2, for LMCE exceptions, this patch seems require #MC exception errors set MCG_STATUS_RIPV = 0 to make sure LMCE be handled normally even on offline CPU. For LMCE errors set MCG_STAUS_RIPV = 1, the patch prevents offline CPU handle these LMCE errors, is that okay?
More specifically, this patch seems require #MC exceptions meet the condition "MCG_STATUS_RIPV ^ MCG_STATUS_LMCES == 1"; But on a Xeon X5650 machine (SMP),
The offline CPU will never get a LMCE=1, since those only happen on the CPU that's doing active work. Offline CPUs just sitting in idle.
The specific error here is a PCC=1, so irrespective of what happens We do capture the errors in the per-cpu log, and kernel would panic.
What specifically this patch tries to achieve is to leave an error sitting with MCG-STATUS.MCIP=1 and another recoverable error would shut the system dowm.
I don't see anything wrong with what this patch does..
"Data CACHE Level-2 Generic Error" does not meet this condition.
I got below message from: https://www.centos.org/forums/viewtopic.php?p=292742
Hardware event. This is not a software error. MCE 0 CPU 4 BANK 6 TSC b7065eeaa18b0 TIME 1545643603 Mon Dec 24 10:26:43 2018 MCG status:MCIP MCi status: Uncorrected error Error enabled Processor context corrupt MCA: Data CACHE Level-2 Generic Error STATUS b200000080000106 MCGSTATUS 4 MCGCAP 1c09 APICID 4 SOCKETID 0 CPUID Vendor Intel Family 6 Model 44
Thanks Tony W Wang-oc
-----Original Mail----- Sender: Raj, Ashok ashok.raj@intel.com Time: 2019.05.31 1:11 To : Tony W Wang-oc TonyWWang-oc@zhaoxin.com CC: tipbot@zytor.com; bp@suse.de; hpa@zytor.com; linux-edac@vger.kernel.org; linux-kernel@vger.kernel.org; linux-tip-commits@vger.kernel.org; mingo@kernel.org; peterz@infradead.org; stable@vger.kernel.org; tglx@linutronix.de; tony.luck@intel.com; torvalds@linux-foundation.org; David Wang DavidWang@zhaoxin.com; Ashok Raj ashok.raj@intel.com Topic: Re: Re: Re: [tip:x86/urgent] x86/mce: Ensure offline CPUs don' t participate in rendezvous process
On Thu, May 30, 2019 at 09:13:39AM +0000, Tony W Wang-oc wrote:
On Thu, May 30, 2019, Tony W Wang-oc wrote:
Hi Ashok, I have two questions about this patch, could you help to check:
1, for broadcast #MC exceptions, this patch seems require #MC exception errors set MCG_STATUS_RIPV = 1. But for Intel CPU, some #MC exception errors set MCG_STATUS_RIPV = 0 (like "Recoverable-not-continuable SRAR Type" Errors), for these errors the patch doesn't seem to work, is that okay?
2, for LMCE exceptions, this patch seems require #MC exception errors set MCG_STATUS_RIPV = 0 to make sure LMCE be handled normally even on offline CPU. For LMCE errors set MCG_STAUS_RIPV = 1, the patch prevents offline CPU handle these LMCE errors, is that okay?
More specifically, this patch seems require #MC exceptions meet the condition "MCG_STATUS_RIPV ^ MCG_STATUS_LMCES == 1"; But on a Xeon X5650 machine (SMP),
The offline CPU will never get a LMCE=1, since those only happen on the CPU that's doing active work. Offline CPUs just sitting in idle.
So, for intel CPU, LMCE is only for Thread level(or core level) error? If not, suppose 2 threads share level-2 cache. And thread 0 is active, thread 1 was offlined by SW. When MCE for this level-2 cache occurred, thread 1 will be active. When thread 1 read mcgstatus.lmce, the result will be always 0?
Thanks.
The specific error here is a PCC=1, so irrespective of what happens We do capture the errors in the per-cpu log, and kernel would panic.
What specifically this patch tries to achieve is to leave an error sitting with MCG-STATUS.MCIP=1 and another recoverable error would shut the system dowm.
I don't see anything wrong with what this patch does..
"Data CACHE Level-2 Generic Error" does not meet this condition.
I got below message from: https://www.centos.org/forums/viewtopic.php?p=292742
Hardware event. This is not a software error. MCE 0 CPU 4 BANK 6 TSC b7065eeaa18b0 TIME 1545643603 Mon Dec 24 10:26:43 2018 MCG status:MCIP MCi status: Uncorrected error Error enabled Processor context corrupt MCA: Data CACHE Level-2 Generic Error STATUS b200000080000106 MCGSTATUS 4 MCGCAP 1c09 APICID 4 SOCKETID 0 CPUID Vendor Intel Family 6 Model 44
Thanks Tony W Wang-oc
On Fri, May 31, 2019, Raj, Ashok wrote:
On Thu, May 30, 2019 at 09:13:39AM +0000, Tony W Wang-oc wrote:
On Thu, May 30, 2019, Tony W Wang-oc wrote:
Hi Ashok, I have two questions about this patch, could you help to check:
1, for broadcast #MC exceptions, this patch seems require #MC exception errors set MCG_STATUS_RIPV = 1. But for Intel CPU, some #MC exception errors set MCG_STATUS_RIPV = 0 (like "Recoverable-not-continuable SRAR Type" Errors), for these errors the patch doesn't seem to work, is that okay?
2, for LMCE exceptions, this patch seems require #MC exception errors set MCG_STATUS_RIPV = 0 to make sure LMCE be handled normally even on offline CPU. For LMCE errors set MCG_STAUS_RIPV = 1, the patch prevents offline CPU handle these LMCE errors, is that okay?
More specifically, this patch seems require #MC exceptions meet the
condition
"MCG_STATUS_RIPV ^ MCG_STATUS_LMCES == 1"; But on a Xeon X5650
machine (SMP),
The offline CPU will never get a LMCE=1, since those only happen on the CPU that's doing active work. Offline CPUs just sitting in idle.
The specific error here is a PCC=1, so irrespective of what happens We do capture the errors in the per-cpu log, and kernel would panic.
What specifically this patch tries to achieve is to leave an error sitting with MCG-STATUS.MCIP=1 and another recoverable error would shut the system dowm.
Yes, agree with you for this point.
But for question 1, When some #MC exception errors broadcast to offline CPU, like "Recoverable-not-continuable SRAR Type" Errors, set MCG_STATUS_RIPV = 0, PCC = 0, is there also the problem : " Kernel panic - not syncing: Timeout: Not all CPUs entered broadcast exception handler"?
Thanks
I don't see anything wrong with what this patch does..
"Data CACHE Level-2 Generic Error" does not meet this condition.
I got below message from:
https://www.centos.org/forums/viewtopic.php?p=292742
Hardware event. This is not a software error. MCE 0 CPU 4 BANK 6 TSC b7065eeaa18b0 TIME 1545643603 Mon Dec 24 10:26:43 2018 MCG status:MCIP MCi status: Uncorrected error Error enabled Processor context corrupt MCA: Data CACHE Level-2 Generic Error STATUS b200000080000106 MCGSTATUS 4 MCGCAP 1c09 APICID 4 SOCKETID 0 CPUID Vendor Intel Family 6 Model 44
Thanks Tony W Wang-oc
linux-stable-mirror@lists.linaro.org