Hello,
On Thu, 24 Oct 2024, Lorenzo Stoakes wrote:
benchmark seems to create many mappings of 4632kB, which would have merged to a large THP-backed area before commit efa7df3e3bb5 and now they are fragmented to multiple areas each aligned to PMD boundary with gaps between. The regression then seems to be caused mainly due to the benchmark's memory access pattern suffering from TLB or cache aliasing due to the aligned boundaries of the individual areas.
Any more details on precisely why?
Anything we found out and theorized about is in the suse bugreport. I think the best theory is TLB aliasing when the mixing^Whash function in the given hardware uses too few bits, and most of them in the low 21-12 bits of an address. Of course that then still depends on the particular access pattern. cactuBSSN has about 20 memory streams in the hot loops, and the accesses are fairly regular from step to step (plus/minus certain strides in 3D arrays). When their start addresses all differ only in the upper bits, you will hit TLB aliasing from time to time, and when the dimensions/strides are just right it occurs often, the N-way associativity doesn't save you anymore and you will hit it very very hard.
It was interesting to see how broad the range of CPUs and vendors was that exhibited the problem (in various degrees of severity, from 50% to 600% slowdown), and how more recent CPUs don't show the symptom anymore. I guess the micro-arch guys eventually convinced P&R management that hashing another bit or two is worthwhile the silicon :-)
Ciao, Michael.