On Tue, 11 Jun 2024, Jiaqi Yan wrote:
@@ -267,6 +268,20 @@ used:: These are informational only. They do not mean that anything is wrong with your system. To disable them, echo 4 (bit 2) into drop_caches. +enable_soft_offline +=================== +Control whether to soft offline memory pages that have (excessive) correctable +memory errors. It is your call to choose between reliability (stay away from +fragile physical memory) vs performance (brought by HugeTLB or transparent +hugepages).
Could you expand upon the relevance of HugeTLB or THP in this documentation? I understand the need in some cases to soft offline memory after a number of correctable memory errors, but it's not clear how the performance implications plays into this. The paragraph below goes into a difference in the splitting behavior, are hugepage users the only ones that should be concerned with this?
+When setting to 1, kernel attempts to soft offline the page when it thinks +needed. For in-use page, page content will be migrated to a new page. If +the oringinal hugepage is a HugeTLB hugepage, regardless of in-use or free,
s/oringinal/original/
+it will be dissolved into raw pages, and the capacity of the HugeTLB pool +will reduce by 1. If the original hugepage is a transparent hugepage, it +will be split into raw pages. When setting to 0, kernel won't attempt to +soft offline the page. Its default value is 1.
This behavior is the same for all architectures?