The resctrl selftests for Memory Bandwidth Allocation (MBA) and Memory Bandwidth Monitoring (MBM) are failing on some (for example [1]) Emerald Rapids systems. The test failures result from the following two properties of these systems: 1) Emerald Rapids systems can have up to 320MB L3 cache. The resctrl MBA and MBM selftests measure memory traffic for which a hardcoded 250MB buffer has been sufficient so far. On platforms with L3 cache larger than the buffer, the buffer fits in the L3 cache and thus no/very little memory traffic is generated during the "memory bandwidth" tests. 2) Some platform features, for example RAS features or memory performance features that generate memory traffic may drive accesses that are counted differently by performance counters and MBM respectively, for instance generating "overhead" traffic which is not counted against any specific RMID. Until now these counting differences have always been "in the noise". On Emerald Rapids systems the maximum MBA throttling (10% memory bandwidth) throttles memory bandwidth to where memory accesses by these other platform features push the memory bandwidth difference between memory controller performance counters and resctrl (MBM) beyond the tests' hardcoded tolerance.
Make the tests more robust against platform variations: 1) Let the buffer used by memory bandwidth tests be guided by the size of the L3 cache. 2) Larger buffers require longer initialization time before the buffer can be used to measurement. Rework the tests to ensure that buffer initialization is complete before measurements start. 3) Do not compare performance counters and MBM measurements at low bandwidth. The value of "low" is hardcoded to 750MiB based on measurements on Emerald Rapids, Sapphire Rapids, and Ice Lake systems. This limit is not applicable to AMD systems since it only applies to the MBA and MBM tests that are isolated to Intel.
[1] https://ark.intel.com/content/www/us/en/ark/products/237261/intel-xeon-plati...
Reinette Chatre (6): selftests/resctrl: Fix sparse warnings selftests/resctrl: Ensure measurements skip initialization of default benchmark selftests/resctrl: Simplify benchmark parameter passing selftests/resctrl: Use cache size to determine "fill_buf" buffer size selftests/resctrl: Do not compare performance counters and resctrl at low bandwidth selftests/resctrl: Keep results from first test run
tools/testing/selftests/resctrl/cmt_test.c | 33 +-- tools/testing/selftests/resctrl/fill_buf.c | 19 +- tools/testing/selftests/resctrl/mba_test.c | 26 +- tools/testing/selftests/resctrl/mbm_test.c | 25 +- tools/testing/selftests/resctrl/resctrl.h | 57 +++-- .../testing/selftests/resctrl/resctrl_tests.c | 15 +- tools/testing/selftests/resctrl/resctrl_val.c | 223 +++++------------- 7 files changed, 152 insertions(+), 246 deletions(-)