On 3/15/2024 11:02 AM, James Morse wrote:
On 08/03/2024 18:42, Tony Luck wrote:
On Fri, Mar 08, 2024 at 06:06:45PM +0000, James Morse wrote:
Hi guys,
On 07/03/2024 23:16, Tony Luck wrote:
On Thu, Mar 07, 2024 at 02:39:08PM -0800, Reinette Chatre wrote:
Thank you for the example. I find that significantly easier to understand than a single number in a generic "nodes_per_l3_cache". Especially with potential confusion surrounding inconsistent "nodes" between allocation and monitoring.
How about domain_cpu_list and domain_cpu_map ?
Like this (my test system doesn't have SNC, so all domains are the same):
$ cd /sys/fs/resctrl/info/ $ grep . */domain* L3/domain_cpu_list:0: 0-35,72-107 L3/domain_cpu_list:1: 36-71,108-143 L3/domain_cpu_map:0: 0000,00000fff,ffffff00,0000000f,ffffffff L3/domain_cpu_map:1: ffff,fffff000,000000ff,fffffff0,00000000 L3_MON/domain_cpu_list:0: 0-35,72-107 L3_MON/domain_cpu_list:1: 36-71,108-143 L3_MON/domain_cpu_map:0: 0000,00000fff,ffffff00,0000000f,ffffffff L3_MON/domain_cpu_map:1: ffff,fffff000,000000ff,fffffff0,00000000 MB/domain_cpu_list:0: 0-35,72-107 MB/domain_cpu_list:1: 36-71,108-143 MB/domain_cpu_map:0: 0000,00000fff,ffffff00,0000000f,ffffffff MB/domain_cpu_map:1: ffff,fffff000,000000ff,fffffff0,00000000
This duplicates the information in /sys/devices/system/cpu/cpuX/cache/indexY ... is this really because that information is, er, wrong on SNC systems. Is it possible to fix that?
On an SNC system the resctrl domain for L3_MON becomes the SNC node instead of the L3 cache instance. With 2, 3, or 4 SNC nodes per L3.
Even without the SNC issue this duplication may be a useful convienience. On Intel to get from a resctrl domain is a multi-step process to first find which of the indexY directories has level=3 and then look for the "id" that matches the domain.
From Tony's earlier description of how SNC changes things, the MB controls remain
per-socket. To me it feels less invasive to fix the definition of L3 on these platforms to describe how it behaves (assuming that is possible), and define a new 'MB' that is NUMA scoped. This direction of redefining L3 means /sys/fs/resctrl and /sys/devices have different views of 'the' cache hierarchy.
I almost went partly in that direction when I started this epic voyage. The "almost" part was to change the names of the monitoring directories under mon_data from (legacy non-SNC system):
$ ls -l mon_data total 0 dr-xr-xr-x. 2 root root 0 Mar 8 10:31 mon_L3_00 dr-xr-xr-x. 2 root root 0 Mar 8 10:31 mon_L3_01
to (2 socket, SNC=2 system):
$ ls -l mon_data total 0 dr-xr-xr-x. 2 root root 0 Mar 8 10:31 mon_NODE_00 dr-xr-xr-x. 2 root root 0 Mar 8 10:31 mon_NODE_01 dr-xr-xr-x. 2 root root 0 Mar 8 10:31 mon_NODE_02 dr-xr-xr-x. 2 root root 0 Mar 8 10:31 mon_NODE_03
This would be useful for MPAM. I've seen a couple of MPAM systems that have per-NUMA MPAM controls on the 'L3', but describe it as a single global L3. The MPAM driver currently hides this by summing the NUMA node counters and reporting it as the global L3's value.
While that is in some ways a more accurate view, it breaks a lot of legacy monitoring applications that expect the "L3" names.
True - but the behaviour is different from a non SNC system, if this software can read the file - but goes wrong because the contents of the file represent something different, its still broken.
This is a good point. There is also /sys/fs/resctrl/info/L3_MON to consider and trying to think what to do about that makes me go in circles about when user space may expect resctrl to indicate the resource and when user space may expect resctrl to indicate the scope. For example, /sys/fs/resctrl/mon_data/mon_L3_00 contains files with data that monitor the "L3" _resource_, no? If we change that to /sys/fs/resctrl/mon_data/mon_NODE_00 then it switches the meaning of the middle term to be "scope" while it still contains the monitoring data of the "L3" resource. So does that mean user space would need to rely on /sys/fs/resctrl/info/L3_MON to obtain the information about which monitoring files (/sys/fs/resctrl/info/L3_MON/mon_features) are related to the particular resource and then match those filenames with the filenames in /sys/fs/resctrl/mon_data/mon_NODE_00 to know which resource it applies to and learn from the directory name what scope measurement is at?
Reinette