Hi Reinette,
On Fri, May 12, 2023 at 3:25 PM Peter Newman peternewman@google.com wrote:
On Thu, May 11, 2023 at 11:37 PM Reinette Chatre reinette.chatre@intel.com wrote:
On 4/21/2023 7:17 AM, Peter Newman wrote:
if (evtid == QOS_L3_MBM_LOCAL_EVENT_ID) {
counter = &state->local;
} else {
WARN_ON(evtid != QOS_L3_MBM_TOTAL_EVENT_ID);
counter = &state->total;
}
/*
* Propagate the value read from the hw_rmid assigned to the current CPU
* into the "soft" rmid associated with the current task or CPU.
*/
m = get_mbm_state(d, soft_rmid, evtid);
if (!m)
return;
if (resctrl_arch_rmid_read(r, d, hw_rmid, evtid, &val))
return;
This all seems unsafe to run without protection. The code relies on the rdt_domain but a CPU hotplug event could result in the domain disappearing underneath this code. The accesses to the data structures also appear unsafe to me. Note that resctrl_arch_rmid_read() updates the architectural MBM state and this same state can be updated concurrently in other code paths without appropriate locking.
The domain is supposed to always be the current one, but I see that even a get_domain_from_cpu(smp_processor_id(), ...) call needs to walk a resource's domain list to find a matching entry, which could be concurrently modified when other domains are added/removed.
Similarly, when soft RMIDs are enabled, it should not be possible to call resctrl_arch_rmid_read() outside of on the current CPU's HW RMID.
I'll need to confirm whether it's safe to access the current CPU's rdt_domain in an atomic context. If it isn't, I assume I would have to arrange all of the state used during flush to be per-CPU.
I expect the constraints on what data can be safely accessed where is going to constrain how the state is ultimately arranged, so I will need to settle this before I can come back to the other questions about mbm_state.
According to cpu_hotplug.rst, the startup callbacks are called before a CPU is started and the teardown callbacks are called after the CPU has become dysfunctional, so it should always be safe for a CPU to access its own data, so all I need to do here is avoid walking domain lists in resctrl_mbm_flush_cpu().
However, this also means that resctrl_{on,off}line_cpu() call clear_closid_rmid() on a different CPU, so whichever CPU executes these will zap its own pqr_state struct and PQR_ASSOC MSR.
-Peter