Hi Peter,
On 12/6/2023 10:38 AM, Peter Newman wrote:
Hi Reinette,
On Tue, Dec 5, 2023 at 5:47 PM Reinette Chatre reinette.chatre@intel.com wrote:
On 12/5/2023 4:33 PM, Peter Newman wrote:
On Tue, Dec 5, 2023 at 1:57 PM Reinette Chatre reinette.chatre@intel.com wrote:
On 12/1/2023 12:56 PM, Peter Newman wrote:
Ignoring any present-day resctrl interfaces, what we minimally need is...
- global "start measurement", which enables a
read-counters-on-context switch flag, and broadcasts an IPI to all CPUs to read their current count 2. wait 5 seconds 3. global "end measurement", to IPI all CPUs again for final counts and clear the flag from step 1
Then the user could read at their leisure all the (frozen) event counts from memory until the next measurement begins.
In our case, if we're measuring as often as 5 seconds for every minute, that will already be a 12x aggregate reduction in overhead, which would be worthwhile enough.
The "con" here would be that during those 5 seconds (which I assume would be controlled via user space so potentially shorter or longer) all tasks in the system is expected to have significant (but yet to be measured) impact on context switch delay.
Yes, of course. In the worst case I've measured, Zen2, it's roughly a 1700-cycle context switch penalty (~20%) for tasks in different monitoring groups. Bad, but the benefit we gain from the per-RMID MBM data makes up for it several times over if we only pay the cost during a measurement.
I see.
I expect the overflow handler should only be run during the measurement timeframe, to not defeat the "at their leisure" reading of counters.
Yes, correct. We wouldn't be interested in overflows of the hardware counter when not actively measuring bandwidth.
The second involves avoiding the situation where a hardware counter could be deallocated: Determine the number of simultaneous RMIDs supported, reduce the effective number of RMIDs available to that number. Use the default RMID (0) for all "unassigned" monitoring
hmmm ... so on the one side there is "only the RMID within the PQR register can be guaranteed to be tracked by hardware" and on the other side there is "A given implementation may have insufficient hardware to simultaneously track the bandwidth for all RMID values that the hardware supports."
From the above there seems to be something in the middle where some subset of the RMID values supported by hardware can be used to simultaneously track bandwidth? How can it be determined what this number of RMID values is?
In the context of AMD, we could use the smallest number of CPUs in any L3 domain as a lower bound of the number of counters.
Could you please elaborate on this? (With the numbers of CPUs nowadays this may be many RMIDs, perhaps even more than what ABMC supports.)
I think the "In the context of AMD" part is key. This feature would only be applicable to the AMD implementations we have today which do not implement ABMC. I believe the difficulties are unique to the topologies of these systems: many small L3 domains per node with a relatively small number of CPUs in each. If the L3 domains were large and few, simply restricting the number of RMIDs and allocating on group creation as we do today would probably be fine.
I am missing something here since it is not obvious to me how this lower bound is determined. Let's assume that there are as many monitor groups (and thus as many assigned RMIDs) as there are CPUs in a L3 domain. Each monitor group may have many tasks. It can be expected that at any moment in time only a subset of assigned RMIDs are assigned to CPUs via the CPUs' PQR registers. Of those RMIDs that are not assigned to CPUs, how can it be certain that they continue to be tracked by hardware?
Are you asking whether the counters will ever be reclaimed proactively? The behavior I've observed is that writing a new RMID into a PQR_ASSOC register when all hardware counters in the domain are allocated will trigger the reallocation.
"When all hardware counters in the domain are allocated" sounds like the ideal scenario with the kernel knowing how many counters there are and each counter is associated with a unique RMID. As long as kernel does not attempt to monitor another RMID this would accurately monitor the monitor groups with "assigned" RMID.
Adding support for hardware without specification and guaranteed behavior can potentially run into unexpected scenarios.
For example, there is no guarantee on how the counters are assigned. The OS and hardware may thus have different view of which hardware counter is "free". OS may write a new RMID to PQR_ASSOC believing that there is a counter available while hardware has its own mechanism of allocation and may reallocate a counter that is in use by an RMID that the OS believes to be "assigned". I do not think anything prevents hardware from doing this.
However, I admit the wording in the PQoS spec[1] is only written to support the permanent-assignment workaround in the current patch series:
"All RMIDs which are currently in use by one or more processors in the QOS domain will be tracked. The hardware will always begin tracking a new RMID value when it gets written to the PQR_ASSOC register of any of the processors in the QOS domain and it is not already being tracked. When the hardware begins tracking an RMID that it was not previously tracking, it will clear the QM_CTR for all events in the new RMID."
I would need to confirm whether this is the case and request the documentation be clarified if it is.
Indeed. Once an RMID is "assigned" then the expectation is that a counter will be dedicated to it but a PQR_ASSOC register may not see that RMID for potentially long intervals. With the above guarantees hardware will be within its rights to reallocate that RMID's counter even if there are other counters that are "free" from OS perspective.
While the second feature is a lot more disruptive at the filesystem layer, it does eliminate the added context switch overhead. Also, it
Which changes to filesystem layer are you anticipating?
Roughly speaking...
- The proposed "assign" interface would have to become more indirect
to avoid understanding how assign could be implemented on various platforms.
It is almost starting to sound like we could learn from the tracing interface where individual events can be enabled/disabled ... with several events potentially enabled with an "enable" done higher in hierarchy, perhaps even globally to support the first approach ...
Sorry, can you clarify the part about the tracing interface? Tracing to support dynamic autoconfiguration of events?
I do not believe we are attempting to do anything revolutionary here so I would like to consider other interfaces that user space may be familiar and comfortable with. The first that came to mind was the tracefs interface and how user space interacts with it to enable trace events. tracefs uses the "enable" file that is present at different levels of the hierarchy that user space can use to enable tracing of all events in hierarchy. There is also the global "tracing_on" that user space can use to dynamically start/stop tracing without needing to frequently enable/disable events of interest.
I do see some parallels with the discussions we have been having. I am not proposing that we adapt tracefs interface, but instead that we can perhaps learn from it.
Reinette