Hi Tony,
On Mon, Mar 18, 2024 at 3:05 PM Luck, Tony tony.luck@intel.com wrote:
Could you please help me understand the details by answering my first question: What is the use case for needing to expose the individual cluster counts?
This is a model specific feature so if this is something needed for just a couple of systems I think we should be less inclined to make changes to resctrl interface. I am starting to be concerned about something similar becoming architectural later and then we need to wrangle this model specific resctrl support (which has then become ABI) again to support whatever that may look like.
Reinette,
Model specific. But present in multiple consecutive generations (Sapphire Rapids, Emerald Rapids, Granite Rapids, Sierra Forest).
Adding Peter Newman for a resctrl user perspective on SNC, rather than me continue to speculate on possible ways this might be used.
Peter: You will need to dig back a few messages on lore.kernel.org to get context.
Our main concern with supporting SNC in resctrl is all of the monitoring groups successfully recording memory bandwidth from all CPUs, regardless of the RMIDs they're assigned.
I would prefer that we don't complicate the model of resctrl monitoring domains for this feature. On ARM SoCs there will be a plethora of technologies influencing the layout of resources, so we shouldn't start cluttering the model with special cases for each.
I think it's valid for the number of domains in the L3 resource to increase or stay the same when the system is configured for SNC. I don't think the details of how the domains came about is relevant at the resctrl interface level so long as the user has enough information to deduce what the domain is referring to based on knowledge of their system configuration.
I would prefer per-cluster as more information could prove useful in some future investigation, but if you feel the data is misleading, providing the clusters combined is also fine. I would prefer that the choice remains consistent from this point forward on any particular implementation to avoid breaking existing controller software developed for that implementation.
In our main use case, we sum mon_data/*/mbm_total_bytes to determine a group's total bandwidth, so please don't cause this logic to produce the wrong answer.
Thanks! -Peter