On Tue, 12 Sep 2023, Reinette Chatre wrote:
On 9/11/2023 4:19 AM, Ilpo Järvinen wrote:
5% difference upper bound for success is a bit on the low side for the
"a bit on the low side" is very vague.
The commit that introduced that 5% bound plainly admitted it's "randomly chosen value". At least that wasn't vague, I guess. :-)
So what I'm trying to do here is to have "randomly chosen value" replaced with a value that seems to work well enough based on measurements on a large set of platforms.
Personally, I don't care much about this, I can just ignore the failures due to outliers (and also reports about failing MBA/MBM test if somebody ever sends one to me), but if I'd be one running automated tests it would be annoying to have a problem like this unaddressed.
MBA and MBM tests. Some platforms produce outliers that are slightly above that, typically 6-7%.
Relaxing the MBA/MBM success bound to 8% removes most of the failures due those frequent outliers.
This description needs more context on what issue is being solved here. What does the % difference represent? How was new percentage determined?
Did you investigate why there are differences between platforms? From what I understand these tests measure memory bandwidth using perf and resctrl and then compare the difference. Are there interesting things about the platforms on which the difference is higher than 5%?
Not really I think. The number just isn't that stable to always remain below 5% (even if it usually does).
Only systematic thing I've come across is that if I play with the read pattern for defeating the hw prefetcher (you've seen a patch earlier and it will be among the series I'll send after this one), it has an impact which looks more systematic across all MBM/MBA tests. But it's not what I'm trying now address with this patch.
Could those be systems with multiple sockets (and thus multiple PMUs that need to be setup, reset, and read)? Can the reading of the counters be improved instead of relaxing the success criteria? A quick comparison between get_mem_bw_imc() and get_mem_bw_resctrl() makes me think that a difference is not surprising ... note how the PMU counters are started and reset (potentially on multiple sockets) at every iteration while the resctrl counters keep rolling and new values are just subtracted from previous.
Perhaps, I can try to look into it (add to my todo list so I won't forget). But in the meantime, this new value is picked using a criteria that looks better than "randomly chosen value". If I ever manage to address the outliers, the bound could be lowered again.
I'll update the changelog to explain things better.