From: Smita Koralahalli Smita.KoralahalliChannabasappa@amd.com
Extend the logic of handling CMCI storms to AMD threshold interrupts.
Rely on the similar approach as of Intel's CMCI to mitigate storms per CPU and per bank. But, unlike CMCI, do not set thresholds and reduce interrupt rate on a storm. Rather, disable the interrupt on the corresponding CPU and bank. Re-enable back the interrupts if enough consecutive polls of the bank show no corrected errors (30, as programmed by Intel).
Turning off the threshold interrupts would be a better solution on AMD systems as other error severities will still be handled even if the threshold interrupts are disabled.
Also, AMD systems currently allow banks to be managed by both polling and interrupts. So don't modify the polling banks set after a storm ends.
[Tony: Small tweak because mce_handle_storm() isn't a pointer now] [Yazen: Rebase and simplify]
Stable backport notes: 1. Currently, when a Machine check interrupt storm is detected, the bank's corresponding bit in mce_poll_banks per-CPU variable is cleared by cmci_storm_end(). As a result, on AMD's SMCA systems, errors injected or encountered after the storm subsides are not logged since polling on that bank has been disabled. Polling banks set on AMD systems should not be modified when a storm subsides.
2. This patch is a snippet from the CMCI storm handling patch (link below) that has been accepted into tip for v6.19. While backporting the patch would have been the preferred way, the same cannot be undertaken since its part of a larger set. As such, this fix will be temporary. When the original patch and its set is integrated into stable, this patch should be reverted.
Signed-off-by: Smita Koralahalli Smita.KoralahalliChannabasappa@amd.com Signed-off-by: Tony Luck tony.luck@intel.com Signed-off-by: Yazen Ghannam yazen.ghannam@amd.com Signed-off-by: Borislav Petkov (AMD) bp@alien8.de Reviewed-by: Qiuxu Zhuo qiuxu.zhuo@intel.com Link: https://lore.kernel.org/20251104-wip-mca-updates-v8-0-66c8eacf67b9@amd.com Signed-off-by: Avadhut Naik avadhut.naik@amd.com --- This is somewhat of a new scenario for me. Not really sure about the procedure. Hence, haven't modified the commit message and removed the tags. If required, will rework both. Also, while this issue can be encountered on AMD systems using v6.8 and later stable kernels, we would specifically prefer for this fix to be backported to v6.12 since its LTS. --- arch/x86/kernel/cpu/mce/threshold.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kernel/cpu/mce/threshold.c b/arch/x86/kernel/cpu/mce/threshold.c index f4a007616468..61eaa1774931 100644 --- a/arch/x86/kernel/cpu/mce/threshold.c +++ b/arch/x86/kernel/cpu/mce/threshold.c @@ -85,7 +85,8 @@ void cmci_storm_end(unsigned int bank) { struct mca_storm_desc *storm = this_cpu_ptr(&storm_desc);
- __clear_bit(bank, this_cpu_ptr(mce_poll_banks)); + if (!mce_flags.amd_threshold) + __clear_bit(bank, this_cpu_ptr(mce_poll_banks)); storm->banks[bank].history = 0; storm->banks[bank].in_storm_mode = false;
base-commit: 8b690556d8fe074b4f9835075050fba3fb180e93
Hi,
Thanks for your patch.
FYI: kernel test robot notices the stable kernel rule is not satisfied.
The check is based on https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html#opti...
Rule: add the tag "Cc: stable@vger.kernel.org" in the sign-off area to have the patch automatically included in the stable tree. Subject: [PATCH] x86/mce: Handle AMD threshold interrupt storms Link: https://lore.kernel.org/stable/20251120214139.1721338-1-avadhut.naik%40amd.c...
On Thu, Nov 20, 2025 at 09:41:24PM +0000, Avadhut Naik wrote:
From: Smita Koralahalli Smita.KoralahalliChannabasappa@amd.com
You need to put here
"Commit <sha1> upstream."
Extend the logic of handling CMCI storms to AMD threshold interrupts.
...
On 11/20/2025 15:53, Borislav Petkov wrote:
On Thu, Nov 20, 2025 at 09:41:24PM +0000, Avadhut Naik wrote:
From: Smita Koralahalli Smita.KoralahalliChannabasappa@amd.com
You need to put here
"Commit <sha1> upstream."
Will add that.
Also, does this need to have a Fixes tag?
Didn't add one here as the original patch committed to tip didn't have one.
Extend the logic of handling CMCI storms to AMD threshold interrupts.
...
On Thu, Nov 20, 2025 at 07:59:57PM -0600, Naik, Avadhut wrote:
On 11/20/2025 15:53, Borislav Petkov wrote:
On Thu, Nov 20, 2025 at 09:41:24PM +0000, Avadhut Naik wrote:
From: Smita Koralahalli Smita.KoralahalliChannabasappa@amd.com
You need to put here
"Commit <sha1> upstream."
Will add that.
Also, does this need to have a Fixes tag?
Didn't add one here as the original patch committed to tip didn't have one.
Then there's no need.
On Thu, Nov 20, 2025 at 09:41:24PM +0000, Avadhut Naik wrote:
From: Smita Koralahalli Smita.KoralahalliChannabasappa@amd.com
Extend the logic of handling CMCI storms to AMD threshold interrupts.
Rely on the similar approach as of Intel's CMCI to mitigate storms per CPU and per bank. But, unlike CMCI, do not set thresholds and reduce interrupt rate on a storm. Rather, disable the interrupt on the corresponding CPU and bank. Re-enable back the interrupts if enough consecutive polls of the bank show no corrected errors (30, as programmed by Intel).
Turning off the threshold interrupts would be a better solution on AMD systems as other error severities will still be handled even if the threshold interrupts are disabled.
Also, AMD systems currently allow banks to be managed by both polling and interrupts. So don't modify the polling banks set after a storm ends.
[Tony: Small tweak because mce_handle_storm() isn't a pointer now] [Yazen: Rebase and simplify]
Stable backport notes:
- Currently, when a Machine check interrupt storm is detected, the bank's
corresponding bit in mce_poll_banks per-CPU variable is cleared by cmci_storm_end(). As a result, on AMD's SMCA systems, errors injected or encountered after the storm subsides are not logged since polling on that bank has been disabled. Polling banks set on AMD systems should not be modified when a storm subsides.
- This patch is a snippet from the CMCI storm handling patch (link below)
that has been accepted into tip for v6.19. While backporting the patch would have been the preferred way, the same cannot be undertaken since its part of a larger set. As such, this fix will be temporary. When the original patch and its set is integrated into stable, this patch should be reverted.
Signed-off-by: Smita Koralahalli Smita.KoralahalliChannabasappa@amd.com Signed-off-by: Tony Luck tony.luck@intel.com Signed-off-by: Yazen Ghannam yazen.ghannam@amd.com Signed-off-by: Borislav Petkov (AMD) bp@alien8.de Reviewed-by: Qiuxu Zhuo qiuxu.zhuo@intel.com Link: https://lore.kernel.org/20251104-wip-mca-updates-v8-0-66c8eacf67b9@amd.com Signed-off-by: Avadhut Naik avadhut.naik@amd.com
This is somewhat of a new scenario for me. Not really sure about the procedure. Hence, haven't modified the commit message and removed the tags. If required, will rework both. Also, while this issue can be encountered on AMD systems using v6.8 and later stable kernels, we would specifically prefer for this fix to be backported to v6.12 since its LTS.
What is the git commit id of this change in Linus's tree?
On 11/21/2025 00:53, Greg KH wrote:
On Thu, Nov 20, 2025 at 09:41:24PM +0000, Avadhut Naik wrote:
From: Smita Koralahalli Smita.KoralahalliChannabasappa@amd.com
Extend the logic of handling CMCI storms to AMD threshold interrupts.
Rely on the similar approach as of Intel's CMCI to mitigate storms per CPU and per bank. But, unlike CMCI, do not set thresholds and reduce interrupt rate on a storm. Rather, disable the interrupt on the corresponding CPU and bank. Re-enable back the interrupts if enough consecutive polls of the bank show no corrected errors (30, as programmed by Intel).
Turning off the threshold interrupts would be a better solution on AMD systems as other error severities will still be handled even if the threshold interrupts are disabled.
Also, AMD systems currently allow banks to be managed by both polling and interrupts. So don't modify the polling banks set after a storm ends.
[Tony: Small tweak because mce_handle_storm() isn't a pointer now] [Yazen: Rebase and simplify]
Stable backport notes:
- Currently, when a Machine check interrupt storm is detected, the bank's
corresponding bit in mce_poll_banks per-CPU variable is cleared by cmci_storm_end(). As a result, on AMD's SMCA systems, errors injected or encountered after the storm subsides are not logged since polling on that bank has been disabled. Polling banks set on AMD systems should not be modified when a storm subsides.
- This patch is a snippet from the CMCI storm handling patch (link below)
that has been accepted into tip for v6.19. While backporting the patch would have been the preferred way, the same cannot be undertaken since its part of a larger set. As such, this fix will be temporary. When the original patch and its set is integrated into stable, this patch should be reverted.
Signed-off-by: Smita Koralahalli Smita.KoralahalliChannabasappa@amd.com Signed-off-by: Tony Luck tony.luck@intel.com Signed-off-by: Yazen Ghannam yazen.ghannam@amd.com Signed-off-by: Borislav Petkov (AMD) bp@alien8.de Reviewed-by: Qiuxu Zhuo qiuxu.zhuo@intel.com Link: https://lore.kernel.org/20251104-wip-mca-updates-v8-0-66c8eacf67b9@amd.com Signed-off-by: Avadhut Naik avadhut.naik@amd.com
This is somewhat of a new scenario for me. Not really sure about the procedure. Hence, haven't modified the commit message and removed the tags. If required, will rework both. Also, while this issue can be encountered on AMD systems using v6.8 and later stable kernels, we would specifically prefer for this fix to be backported to v6.12 since its LTS.
What is the git commit id of this change in Linus's tree?
I think it has not yet been merged into mainline's master branch. This commit was recently accepted into the tip (5th November).
Following is its commit ID:
a5834a5458aa004866e7da402c6bc2dfe2f3737e
Link: https://lore.kernel.org/all/176243356968.2601451.11559805061162819633.tip-bo...
Do I need to send another version with this commit ID mentioned in the commit message?
On Fri, Nov 21, 2025 at 01:04:47AM -0600, Naik, Avadhut wrote:
On 11/21/2025 00:53, Greg KH wrote:
On Thu, Nov 20, 2025 at 09:41:24PM +0000, Avadhut Naik wrote:
From: Smita Koralahalli Smita.KoralahalliChannabasappa@amd.com
Extend the logic of handling CMCI storms to AMD threshold interrupts.
Rely on the similar approach as of Intel's CMCI to mitigate storms per CPU and per bank. But, unlike CMCI, do not set thresholds and reduce interrupt rate on a storm. Rather, disable the interrupt on the corresponding CPU and bank. Re-enable back the interrupts if enough consecutive polls of the bank show no corrected errors (30, as programmed by Intel).
Turning off the threshold interrupts would be a better solution on AMD systems as other error severities will still be handled even if the threshold interrupts are disabled.
Also, AMD systems currently allow banks to be managed by both polling and interrupts. So don't modify the polling banks set after a storm ends.
[Tony: Small tweak because mce_handle_storm() isn't a pointer now] [Yazen: Rebase and simplify]
Stable backport notes:
- Currently, when a Machine check interrupt storm is detected, the bank's
corresponding bit in mce_poll_banks per-CPU variable is cleared by cmci_storm_end(). As a result, on AMD's SMCA systems, errors injected or encountered after the storm subsides are not logged since polling on that bank has been disabled. Polling banks set on AMD systems should not be modified when a storm subsides.
- This patch is a snippet from the CMCI storm handling patch (link below)
that has been accepted into tip for v6.19. While backporting the patch would have been the preferred way, the same cannot be undertaken since its part of a larger set. As such, this fix will be temporary. When the original patch and its set is integrated into stable, this patch should be reverted.
Signed-off-by: Smita Koralahalli Smita.KoralahalliChannabasappa@amd.com Signed-off-by: Tony Luck tony.luck@intel.com Signed-off-by: Yazen Ghannam yazen.ghannam@amd.com Signed-off-by: Borislav Petkov (AMD) bp@alien8.de Reviewed-by: Qiuxu Zhuo qiuxu.zhuo@intel.com Link: https://lore.kernel.org/20251104-wip-mca-updates-v8-0-66c8eacf67b9@amd.com Signed-off-by: Avadhut Naik avadhut.naik@amd.com
This is somewhat of a new scenario for me. Not really sure about the procedure. Hence, haven't modified the commit message and removed the tags. If required, will rework both. Also, while this issue can be encountered on AMD systems using v6.8 and later stable kernels, we would specifically prefer for this fix to be backported to v6.12 since its LTS.
What is the git commit id of this change in Linus's tree?
I think it has not yet been merged into mainline's master branch. This commit was recently accepted into the tip (5th November).
Then there's nothing we can do about this in the stable tree, please read: https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html for all about this.
thanks,
greg k-h
On Fri, Nov 21, 2025 at 08:09:21AM +0100, Greg KH wrote:
I think it has not yet been merged into mainline's master branch. This commit was recently accepted into the tip (5th November).
Then there's nothing we can do about this in the stable tree, please
Yeah, it took me a while to understand what the issue is when Avadhut was explaining it to me offlist:
So the hunk at the beginning of this thread is needed as a fix for stable because when they inject a lot of errors back-to-back, after the error storm detection recovers, they cannot log any errors anymore - see the explanation in the first patch.
So what we'll do here:
@Avadhut, you take that hunk, pls, and create a separate patch with commit message explaining everything, blablalba, cc:stable, the whole shebang.
That patch goes upstream and to stable.
The rest of the original
a5834a5458aa ("x86/mce: Handle AMD threshold interrupt storms")
you then redo ontop of this one and send it too.
I'll zap a5834a5458aa from the lineup for now so that you can split it.
Thx.
linux-stable-mirror@lists.linaro.org