Re: [PATCH v1 3/9] x86/resctrl: Add resctrl_mbm_flush_cpu() to collect CPUs' MBM events

6 Dec 2023

      Hi Peter,
On 12/5/2023 4:33 PM, Peter Newman wrote:
...
On Tue, Dec 5, 2023 at 1:57 PM Reinette Chatre
reinette.chatre@intel.com wrote:
...
On 12/1/2023 12:56 PM, Peter Newman wrote:
...
On Tue, May 16, 2023 at 5:06 PM Reinette Chatre
...
I think it may be optimistic to view this as a replacement of a PQR write.
As you point out, that requires that a CPU switches between tasks with the
same CLOSID. You demonstrate that resctrl already contributes a significant
delay to __switch_to - this work will increase that much more, it has to
be clear about this impact and motivate that it is acceptable.
We were operating under the assumption that if the overhead wasn't
acceptable, we would have heard complaints about it by now, but we
ultimately learned that this feature wasn't deployed as much as we had
originally thought on AMD hardware and that the overhead does need to
be addressed.
I am interested in your opinion on two options I'm exploring to
mitigate the overhead, both of which depend on an API like the one
Babu recently proposed for the AMD ABMC feature [1], where a new file
interface will allow the user to indicate which mon_groups are
actively being measured. I will refer to this as "assigned" for now,
as that's the current proposal.
The first is likely the simpler approach: only read MBM event counters
which have been marked as "assigned" in the filesystem to avoid paying
the context switch cost on tasks in groups which are not actively
being measured. In our use case, we calculate memory bandwidth on
every group every few minutes by reading the counters twice, 5 seconds
apart. We would just need counters read during this 5-second window.
I assume that tasks within a monitoring group can be scheduled on any
CPU and from the cover letter of this work I understand that only an
RMID assigned to a processor can be guaranteed to be tracked by hardware.
Are you proposing for this option that you keep this "soft RMID" approach
with CPUs  permanently assigned a "hard RMID" but only update the counts for a
"soft RMID" that is "assigned"?
Yes
...
I think that means that the context
switch cost for the monitored group would increase even more than with the
implementation in this series since the counters need to be read on context
switch in as well as context switch out.
If I understand correctly then only one monitoring group can be measured
at a time. If such a measurement takes 5 seconds then theoretically 12 groups
can be measured in one minute. It may be possible to create many more
monitoring groups than this. Would it be possible to reach monitoring
goals in your environment?
We actually measure all of the groups at the same time, so thinking
about this more, the proposed ABMC fix isn't actually a great fit: the
user would have to assign all groups individually when a global
setting would have been fine.
Ignoring any present-day resctrl interfaces, what we minimally need is...

global "start measurement", which enables a

read-counters-on-context switch flag, and broadcasts an IPI to all
CPUs to read their current count
2. wait 5 seconds
3. global "end measurement", to IPI all CPUs again for final counts
and clear the flag from step 1
Then the user could read at their leisure all the (frozen) event
counts from memory until the next measurement begins.
In our case, if we're measuring as often as 5 seconds for every
minute, that will already be a 12x aggregate reduction in overhead,
which would be worthwhile enough.
The "con" here would be that during those 5 seconds (which I assume would be
controlled via user space so potentially shorter or longer) all tasks in the
system is expected to have significant (but yet to be measured) impact
on context switch delay.
I expect the overflow handler should only be run during the measurement
timeframe, to not defeat the "at their leisure" reading of counters.
...
...
...
The second involves avoiding the situation where a hardware counter
could be deallocated: Determine the number of simultaneous RMIDs
supported, reduce the effective number of RMIDs available to that
number. Use the default RMID (0) for all "unassigned" monitoring
hmmm ... so on the one side there is "only the RMID within the PQR
register can be guaranteed to be tracked by hardware" and on the
other side there is "A given implementation may have insufficient
hardware to simultaneously track the bandwidth for all RMID values
that the hardware supports."
From the above there seems to be something in the middle where
some subset of the RMID values supported by hardware can be used
to simultaneously track bandwidth? How can it be determined
what this number of RMID values is?
In the context of AMD, we could use the smallest number of CPUs in any
L3 domain as a lower bound of the number of counters.
Could you please elaborate on this? (With the numbers of CPUs nowadays this
may be many RMIDs, perhaps even more than what ABMC supports.)
I am missing something here since it is not obvious to me how this lower
bound is determined. Let's assume that there are as many monitor groups
(and thus as many assigned RMIDs) as there are CPUs in a L3 domain.
Each monitor group may have many tasks. It can be expected that at any
moment in time only a subset of assigned RMIDs are assigned to CPUs
via the CPUs' PQR registers. Of those RMIDs that are not assigned to
CPUs, how can it be certain that they continue to be tracked by hardware?
...
If the number is actually higher, it's not too difficult to probe at
runtime. The technique used by the test script[1] reliably identifies
the number of counters, but some experimentation would be needed to
see how quickly the hardware will repurpose a counter, as the script
today is using way too long of a workload for the kernel to be
invoking.
Maybe a reasonable compromise would be to initialize the HW counter
estimate at the CPUs-per-domain value and add a file node to let the
user increase it if they have better information. The worst that can
happen is the present-day behavior.
...
...
groups and report "Unavailable" on all counter reads (and address the
default monitoring group's counts being unreliable). When assigned,
attempt to allocate one of the remaining, usable RMIDs to that group.
It would only be possible to assign all event counters (local, total,
occupancy) at the same time. Using this approach, we would no longer
be able to measure all groups at the same time, but this is something
we would already be accepting when using the AMD ABMC feature.
It may be possible to turn this into a "fake"/"software" ABMC feature,
which I expect needs to be renamed to move it away from a hardware
specific feature to something that better reflects how user interacts
with system and how the system responds.
Given the similarities in monitoring with ABMC and MPAM, I would want
to see the interface generalized anyways.
...
...
While the second feature is a lot more disruptive at the filesystem
layer, it does eliminate the added context switch overhead. Also, it
Which changes to filesystem layer are you anticipating?
Roughly speaking...

The proposed "assign" interface would have to become more indirect

to avoid understanding how assign could be implemented on various
platforms.
It is almost starting to sound like we could learn from the tracing
interface where individual events can be enabled/disabled ... with several
events potentially enabled with an "enable" done higher in hierarchy, perhaps
even globally to support the first approach ...
...

RMID management would have to change, because this would introduce

the option where creating monitoring groups no longer allocates an
RMID. It may be cleaner for the
filesystem to just track whether a group has allocated monitoring
resources or not and let a lower layer understand what the resources
actually are. (and in the default mode, groups can only be created
with pre-allocated resources)
This matches my understanding of what MPAM would need.
...
If I get the impression that this is the better approach, I'll build a
prototype on top of the ABMC patches to see how it would go.
So far it seems only the second approach (software ABMC) really ties
in with Babu's work.
Thanks!
-Peter
[1] https://lore.kernel.org/all/20230421141723.2405942-2-peternewman@google.com/
Reinette

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: [PATCH v1 3/9] x86/resctrl: Add resctrl_mbm_flush_cpu() to collect CPUs' MBM events