Hi Reinette, Fenghua,
This series introduces a new mount option enabling an alternate mode for MBM to work around an issue on present AMD implementations and any other resctrl implementation where there are more RMIDs (or equivalent) than hardware counters.
The L3 External Bandwidth Monitoring feature of the AMD PQoS extension[1] only guarantees that RMIDs currently assigned to a processor will be tracked by hardware. The counters of any other RMIDs which are no longer being tracked will be reset to zero. The MBM event counters return "Unavailable" to indicate when this has happened.
An interval for effectively measuring memory bandwidth typically needs to be multiple seconds long. In Google's workloads, it is not feasible to bound the number of jobs with different RMIDs which will run in a cache domain over any period of time. Consequently, on a fully-committed system where all RMIDs are allocated, few groups' counters return non-zero values.
To demonstrate the underlying issue, the first patch provides a test case in tools/testing/selftests/resctrl/test_rmids.sh.
On an AMD EPYC 7B12 64-Core Processor with the default behavior:
# ./test_rmids.sh Created 255 monitoring groups. g1: mbm_total_bytes: Unavailable -> Unavailable (FAIL) g2: mbm_total_bytes: Unavailable -> Unavailable (FAIL) g3: mbm_total_bytes: Unavailable -> Unavailable (FAIL) [..] g238: mbm_total_bytes: Unavailable -> Unavailable (FAIL) g239: mbm_total_bytes: Unavailable -> Unavailable (FAIL) g240: mbm_total_bytes: Unavailable -> Unavailable (FAIL) g241: mbm_total_bytes: Unavailable -> 660497472 g242: mbm_total_bytes: Unavailable -> 660793344 g243: mbm_total_bytes: Unavailable -> 660477312 g244: mbm_total_bytes: Unavailable -> 660495360 g245: mbm_total_bytes: Unavailable -> 660775360 g246: mbm_total_bytes: Unavailable -> 660645504 g247: mbm_total_bytes: Unavailable -> 660696128 g248: mbm_total_bytes: Unavailable -> 660605248 g249: mbm_total_bytes: Unavailable -> 660681280 g250: mbm_total_bytes: Unavailable -> 660834240 g251: mbm_total_bytes: Unavailable -> 660440064 g252: mbm_total_bytes: Unavailable -> 660501504 g253: mbm_total_bytes: Unavailable -> 660590720 g254: mbm_total_bytes: Unavailable -> 660548352 g255: mbm_total_bytes: Unavailable -> 660607296 255 groups, 0 returned counts in first pass, 15 in second successfully measured bandwidth from 15/255 groups
To compare, here is the output from an Intel(R) Xeon(R) Platinum 8173M CPU:
# ./test_rmids.sh Created 223 monitoring groups. g1: mbm_total_bytes: 0 -> 606126080 g2: mbm_total_bytes: 0 -> 613236736 g3: mbm_total_bytes: 0 -> 610254848 [..] g221: mbm_total_bytes: 0 -> 584679424 g222: mbm_total_bytes: 0 -> 588808192 g223: mbm_total_bytes: 0 -> 587317248 223 groups, 223 returned counts in first pass, 223 in second successfully measured bandwidth from 223/223 groups
To make better use of the hardware in such a use case, this patchset introduces a "soft" RMID implementation, where each CPU is permanently assigned a "hard" RMID. On context switches which change the current soft RMID, the difference between each CPU's current event counts and most recent counts is added to the totals for the current or outgoing soft RMID.
This technique does not work for cache occupancy counters, so this patch series disables cache occupancy events when soft RMIDs are enabled.
This series adds the "mbm_soft_rmid" mount option to allow users to opt-in to the functionaltiy when they deem it helpful.
When the same system from the earlier AMD example enables the mbm_soft_rmid mount option:
# ./test_rmids.sh Created 255 monitoring groups. g1: mbm_total_bytes: 0 -> 686560576 g2: mbm_total_bytes: 0 -> 668204416 [..] g252: mbm_total_bytes: 0 -> 672651200 g253: mbm_total_bytes: 0 -> 666956800 g254: mbm_total_bytes: 0 -> 665917056 g255: mbm_total_bytes: 0 -> 671049600 255 groups, 255 returned counts in first pass, 255 in second successfully measured bandwidth from 255/255 groups
(patches are based on tip/master)
[1] https://www.amd.com/system/files/TechDocs/56375_1.03_PUB.pdf
Peter Newman (8): selftests/resctrl: Verify all RMIDs count together x86/resctrl: Add resctrl_mbm_flush_cpu() to collect CPUs' MBM events x86/resctrl: Flush MBM event counts on soft RMID change x86/resctrl: Call mon_event_count() directly for soft RMIDs x86/resctrl: Create soft RMID version of __mon_event_count() x86/resctrl: Assign HW RMIDs to CPUs for soft RMID x86/resctrl: Use mbm_update() to push soft RMID counts x86/resctrl: Add mount option to enable soft RMID
Stephane Eranian (1): x86/resctrl: Hold a spinlock in __rmid_read() on AMD
arch/x86/include/asm/resctrl.h | 29 +++- arch/x86/kernel/cpu/resctrl/core.c | 80 ++++++++- arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 9 +- arch/x86/kernel/cpu/resctrl/internal.h | 19 ++- arch/x86/kernel/cpu/resctrl/monitor.c | 158 +++++++++++++++++- arch/x86/kernel/cpu/resctrl/rdtgroup.c | 52 ++++++ tools/testing/selftests/resctrl/test_rmids.sh | 93 +++++++++++ 7 files changed, 425 insertions(+), 15 deletions(-) create mode 100755 tools/testing/selftests/resctrl/test_rmids.sh
base-commit: dd806e2f030e57dd5bac973372aa252b6c175b73
AMD CPUs in particular implement fewer monitors than RMIDs, so add a test case to see if a large number of monitoring groups can be measured together.
Signed-off-by: Peter Newman peternewman@google.com --- tools/testing/selftests/resctrl/test_rmids.sh | 93 +++++++++++++++++++ 1 file changed, 93 insertions(+) create mode 100755 tools/testing/selftests/resctrl/test_rmids.sh
diff --git a/tools/testing/selftests/resctrl/test_rmids.sh b/tools/testing/selftests/resctrl/test_rmids.sh new file mode 100755 index 000000000000..475e69c0217e --- /dev/null +++ b/tools/testing/selftests/resctrl/test_rmids.sh @@ -0,0 +1,93 @@ +#!/bin/bash +# SPDX-License-Identifier: GPL-2.0 + +cd /sys/fs/resctrl + +grep -q mbm_total_bytes info/L3_MON/mon_features || { + echo "MBM required" + exit 4 +} + +which perf > /dev/null || { + echo "perf tool required" + exit 4 +} + +num_rmids=$(cat info/L3_MON/num_rmids) + +count=0 + +result=0 + +# use as many RMIDs as possible, up to the number of RMIDs +for i in `seq $num_rmids`; do + mkdir mon_groups/_test_m$((count+1)) 2> /dev/null || break + if [[ -d mon_groups/_test_m$((count+1))/mon_data ]]; then + count=$((count+1)) + else + break; + fi +done + +echo "Created $count monitoring groups." + +if [[ $count -eq 0 ]]; then + echo "need monitoring groups to continue." + exit 4 +fi + +declare -a bytes_array + +unavailable=0 +unavailable0=0 + +for i in `seq $count`; do + bytes_array[$i]=$(cat mon_groups/_test_m${i}/mon_data/mon_L3_00/mbm_total_bytes) + + if [[ "${bytes_array[$i]}" = "Unavailable" ]]; then + unavailable0=$((unavailable0 + 1)) + fi +done + +for i in `seq $count`; do + echo $$ > mon_groups/_test_m${i}/tasks + taskset 0x1 perf bench mem memcpy -s 100MB -f default > /dev/null +done +echo $$ > tasks + +# zero non-integer values +declare -i bytes bytes0 + +success_count=0 + +for i in `seq $count`; do + raw_bytes=$(cat mon_groups/_test_m${i}/mon_data/mon_L3_00/mbm_total_bytes) + raw_bytes0=${bytes_array[$i]} + + # coerce the value to an integer for math + bytes=$raw_bytes + bytes0=$raw_bytes0 + + echo -n "g${i}: mbm_total_bytes: $raw_bytes0 -> $raw_bytes" + + if [[ "$raw_bytes" = "Unavailable" ]]; then + unavailable=$((unavailable + 1)) + fi + + if [[ $bytes -gt $bytes0 ]]; then + success_count=$((success_count+1)) + echo "" + else + echo " (FAIL)" + result=1 + fi +done + +first=$((count-unavailable0)) +second=$((count-unavailable)) +echo "$count groups, $first returned counts in first pass, $second in second" +echo "successfully measured bandwidth from ${success_count}/${count} groups" + +rmdir mon_groups/_test_m* + +exit $result
From: Stephane Eranian eranian@google.com
In AMD PQoS Versions 1.0 and 2.0, IA32_QM_EVTSEL MSR is shared by all processors in a QOS domain. So there's a chance it can read a different event when two processors are reading the counter concurrently. Add a spinlock to prevent this race.
Co-developed-by: Peter Newman peternewman@google.com Signed-off-by: Peter Newman peternewman@google.com Signed-off-by: Stephane Eranian eranian@google.com --- arch/x86/kernel/cpu/resctrl/core.c | 41 ++++++++++++++++++++++++++ arch/x86/kernel/cpu/resctrl/internal.h | 5 ++++ arch/x86/kernel/cpu/resctrl/monitor.c | 14 +++++++-- 3 files changed, 57 insertions(+), 3 deletions(-)
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c index 030d3b409768..47b1c37a81f8 100644 --- a/arch/x86/kernel/cpu/resctrl/core.c +++ b/arch/x86/kernel/cpu/resctrl/core.c @@ -25,6 +25,8 @@ #include <asm/resctrl.h> #include "internal.h"
+DEFINE_STATIC_KEY_FALSE(rmid_read_locked); + /* Mutex to protect rdtgroup access. */ DEFINE_MUTEX(rdtgroup_mutex);
@@ -529,6 +531,8 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r) d->id = id; cpumask_set_cpu(cpu, &d->cpu_mask);
+ raw_spin_lock_init(&hw_dom->evtsel_lock); + rdt_domain_reconfigure_cdp(r);
if (r->alloc_capable && domain_setup_ctrlval(r, d)) { @@ -829,6 +833,41 @@ static __init bool get_rdt_mon_resources(void) return !rdt_get_mon_l3_config(r); }
+static __init bool amd_shared_qm_evtsel(void) +{ + /* + * From AMD64 Technology Platform Quality of Service Extensions, + * Revision 1.03: + * + * "For PQoS Version 1.0 and 2.0, as identified by Family/Model, the + * QM_EVTSEL register is shared by all the processors in a QOS domain." + * + * Check the inclusive Family/Model ranges for PQoS Extension versions + * 1.0 and 2.0 from the PQoS Extension Versions table. + */ + if (boot_cpu_data.x86 == 0x17) + /* V1.0 */ + return boot_cpu_data.x86_model >= 0x30 && + boot_cpu_data.x86_model <= 0x9f; + + if (boot_cpu_data.x86 == 0x19) + /* V2.0 */ + return (boot_cpu_data.x86_model <= 0xf) || + ((boot_cpu_data.x86_model >= 0x20) && + (boot_cpu_data.x86_model <= 0x5f)); + + return false; +} + +static __init void __check_quirks_amd(void) +{ + if (rdt_cpu_has(X86_FEATURE_CQM_MBM_TOTAL) || + rdt_cpu_has(X86_FEATURE_CQM_MBM_LOCAL)) { + if (amd_shared_qm_evtsel()) + static_branch_enable(&rmid_read_locked); + } +} + static __init void __check_quirks_intel(void) { switch (boot_cpu_data.x86_model) { @@ -852,6 +891,8 @@ static __init void check_quirks(void) { if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL) __check_quirks_intel(); + else if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD) + __check_quirks_amd(); }
static __init bool get_rdt_resources(void) diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h index 85ceaf9a31ac..02a062558c67 100644 --- a/arch/x86/kernel/cpu/resctrl/internal.h +++ b/arch/x86/kernel/cpu/resctrl/internal.h @@ -325,6 +325,7 @@ struct arch_mbm_state { * @ctrl_val: array of cache or mem ctrl values (indexed by CLOSID) * @arch_mbm_total: arch private state for MBM total bandwidth * @arch_mbm_local: arch private state for MBM local bandwidth + * @lock: serializes counter reads when QM_EVTSEL MSR is shared per-domain * * Members of this structure are accessed via helpers that provide abstraction. */ @@ -333,6 +334,7 @@ struct rdt_hw_domain { u32 *ctrl_val; struct arch_mbm_state *arch_mbm_total; struct arch_mbm_state *arch_mbm_local; + raw_spinlock_t evtsel_lock; };
static inline struct rdt_hw_domain *resctrl_to_arch_dom(struct rdt_domain *r) @@ -428,6 +430,9 @@ extern struct rdt_hw_resource rdt_resources_all[]; extern struct rdtgroup rdtgroup_default; DECLARE_STATIC_KEY_FALSE(rdt_alloc_enable_key);
+/* Serialization required in resctrl_arch_rmid_read(). */ +DECLARE_STATIC_KEY_FALSE(rmid_read_locked); + extern struct dentry *debugfs_resctrl;
enum resctrl_res_level { diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c index 20952419be75..2de8397f91cd 100644 --- a/arch/x86/kernel/cpu/resctrl/monitor.c +++ b/arch/x86/kernel/cpu/resctrl/monitor.c @@ -146,10 +146,15 @@ static inline struct rmid_entry *__rmid_entry(u32 rmid) return entry; }
-static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val) +static int __rmid_read(struct rdt_hw_domain *hw_dom, u32 rmid, + enum resctrl_event_id eventid, u64 *val) { + unsigned long flags; u64 msr_val;
+ if (static_branch_likely(&rmid_read_locked)) + raw_spin_lock_irqsave(&hw_dom->evtsel_lock, flags); + /* * As per the SDM, when IA32_QM_EVTSEL.EvtID (bits 7:0) is configured * with a valid event code for supported resource type and the bits @@ -161,6 +166,9 @@ static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val) wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid); rdmsrl(MSR_IA32_QM_CTR, msr_val);
+ if (static_branch_likely(&rmid_read_locked)) + raw_spin_unlock_irqrestore(&hw_dom->evtsel_lock, flags); + if (msr_val & RMID_VAL_ERROR) return -EIO; if (msr_val & RMID_VAL_UNAVAIL) @@ -200,7 +208,7 @@ void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d, memset(am, 0, sizeof(*am));
/* Record any initial, non-zero count value. */ - __rmid_read(rmid, eventid, &am->prev_msr); + __rmid_read(hw_dom, rmid, eventid, &am->prev_msr); } }
@@ -241,7 +249,7 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d, if (!cpumask_test_cpu(smp_processor_id(), &d->cpu_mask)) return -EINVAL;
- ret = __rmid_read(rmid, eventid, &msr_val); + ret = __rmid_read(hw_dom, rmid, eventid, &msr_val); if (ret) return ret;
Hi Peter,
On 4/21/2023 7:17 AM, Peter Newman wrote:
From: Stephane Eranian eranian@google.com
In AMD PQoS Versions 1.0 and 2.0, IA32_QM_EVTSEL MSR is shared by all processors in a QOS domain. So there's a chance it can read a different event when two processors are reading the counter concurrently. Add a spinlock to prevent this race.
This is unclear to me. As I understand it this changelog is written as though there is a race that is being fixed. I believe that rdtgroup_mutex currently protects against such races. I thus at first thought that this is a prep patch for the introduction of the new soft RMID feature, but instead this new spinlock is used independent of the soft RMID feature.
I think the spinlock is unnecessary when the soft RMID feature is disabled.
Co-developed-by: Peter Newman peternewman@google.com Signed-off-by: Peter Newman peternewman@google.com Signed-off-by: Stephane Eranian eranian@google.com
arch/x86/kernel/cpu/resctrl/core.c | 41 ++++++++++++++++++++++++++ arch/x86/kernel/cpu/resctrl/internal.h | 5 ++++ arch/x86/kernel/cpu/resctrl/monitor.c | 14 +++++++-- 3 files changed, 57 insertions(+), 3 deletions(-)
...
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h index 85ceaf9a31ac..02a062558c67 100644 --- a/arch/x86/kernel/cpu/resctrl/internal.h +++ b/arch/x86/kernel/cpu/resctrl/internal.h @@ -325,6 +325,7 @@ struct arch_mbm_state {
- @ctrl_val: array of cache or mem ctrl values (indexed by CLOSID)
- @arch_mbm_total: arch private state for MBM total bandwidth
- @arch_mbm_local: arch private state for MBM local bandwidth
*/
- @lock: serializes counter reads when QM_EVTSEL MSR is shared per-domain
- Members of this structure are accessed via helpers that provide abstraction.
@@ -333,6 +334,7 @@ struct rdt_hw_domain { u32 *ctrl_val; struct arch_mbm_state *arch_mbm_total; struct arch_mbm_state *arch_mbm_local;
- raw_spinlock_t evtsel_lock;
};
Please note the difference between the member name in the struct ("evtsel_lock") and its description ("lock").
static inline struct rdt_hw_domain *resctrl_to_arch_dom(struct rdt_domain *r) @@ -428,6 +430,9 @@ extern struct rdt_hw_resource rdt_resources_all[]; extern struct rdtgroup rdtgroup_default; DECLARE_STATIC_KEY_FALSE(rdt_alloc_enable_key); +/* Serialization required in resctrl_arch_rmid_read(). */ +DECLARE_STATIC_KEY_FALSE(rmid_read_locked);
extern struct dentry *debugfs_resctrl; enum resctrl_res_level { diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c index 20952419be75..2de8397f91cd 100644 --- a/arch/x86/kernel/cpu/resctrl/monitor.c +++ b/arch/x86/kernel/cpu/resctrl/monitor.c @@ -146,10 +146,15 @@ static inline struct rmid_entry *__rmid_entry(u32 rmid) return entry; } -static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val) +static int __rmid_read(struct rdt_hw_domain *hw_dom, u32 rmid,
enum resctrl_event_id eventid, u64 *val)
{
- unsigned long flags; u64 msr_val;
- if (static_branch_likely(&rmid_read_locked))
Why static_branch_likely() as opposed to static_branch_unlikely()?
raw_spin_lock_irqsave(&hw_dom->evtsel_lock, flags);
- /*
- As per the SDM, when IA32_QM_EVTSEL.EvtID (bits 7:0) is configured
- with a valid event code for supported resource type and the bits
@@ -161,6 +166,9 @@ static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val) wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid); rdmsrl(MSR_IA32_QM_CTR, msr_val);
- if (static_branch_likely(&rmid_read_locked))
raw_spin_unlock_irqrestore(&hw_dom->evtsel_lock, flags);
If the first "if (static_branch_likely(&rmid_read_locked))" was taken then the second if branch _has_ to be taken. It should not be optional to release a lock if it was taken. I think it would be more robust if a single test of the static key decides whether the spinlock should be used.
if (msr_val & RMID_VAL_ERROR) return -EIO; if (msr_val & RMID_VAL_UNAVAIL) @@ -200,7 +208,7 @@ void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d, memset(am, 0, sizeof(*am)); /* Record any initial, non-zero count value. */
__rmid_read(rmid, eventid, &am->prev_msr);
}__rmid_read(hw_dom, rmid, eventid, &am->prev_msr);
} @@ -241,7 +249,7 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d, if (!cpumask_test_cpu(smp_processor_id(), &d->cpu_mask)) return -EINVAL;
- ret = __rmid_read(rmid, eventid, &msr_val);
- ret = __rmid_read(hw_dom, rmid, eventid, &msr_val); if (ret) return ret;
Reinette
Hi Reinette,
On Thu, May 11, 2023 at 11:36 PM Reinette Chatre reinette.chatre@intel.com wrote:
On 4/21/2023 7:17 AM, Peter Newman wrote:
From: Stephane Eranian eranian@google.com
In AMD PQoS Versions 1.0 and 2.0, IA32_QM_EVTSEL MSR is shared by all processors in a QOS domain. So there's a chance it can read a different event when two processors are reading the counter concurrently. Add a spinlock to prevent this race.
This is unclear to me. As I understand it this changelog is written as though there is a race that is being fixed. I believe that rdtgroup_mutex currently protects against such races. I thus at first thought that this is a prep patch for the introduction of the new soft RMID feature, but instead this new spinlock is used independent of the soft RMID feature.
I think the spinlock is unnecessary when the soft RMID feature is disabled.
My understanding was that the race would happen a lot more when simultaneously IPI'ing all CPUs in a domain, but I had apparently overlooked that all of the counter reads were already serialized.
*/
- @lock: serializes counter reads when QM_EVTSEL MSR is shared per-domain
- Members of this structure are accessed via helpers that provide abstraction.
@@ -333,6 +334,7 @@ struct rdt_hw_domain { u32 *ctrl_val; struct arch_mbm_state *arch_mbm_total; struct arch_mbm_state *arch_mbm_local;
raw_spinlock_t evtsel_lock;
};
Please note the difference between the member name in the struct ("evtsel_lock") and its description ("lock").
Will fix, thanks.
-static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val) +static int __rmid_read(struct rdt_hw_domain *hw_dom, u32 rmid,
enum resctrl_event_id eventid, u64 *val)
{
unsigned long flags; u64 msr_val;
if (static_branch_likely(&rmid_read_locked))
Why static_branch_likely() as opposed to static_branch_unlikely()?
I read the documentation for static branches and I agree that unlikely would make more sense so that the non-locked case is less impacted.
This instance apparently confused my understanding of static branches and I will need to re-visit all uses of them in this patch series.
raw_spin_lock_irqsave(&hw_dom->evtsel_lock, flags);
/* * As per the SDM, when IA32_QM_EVTSEL.EvtID (bits 7:0) is configured * with a valid event code for supported resource type and the bits
@@ -161,6 +166,9 @@ static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val) wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid); rdmsrl(MSR_IA32_QM_CTR, msr_val);
if (static_branch_likely(&rmid_read_locked))
raw_spin_unlock_irqrestore(&hw_dom->evtsel_lock, flags);
If the first "if (static_branch_likely(&rmid_read_locked))" was taken then the second if branch _has_ to be taken. It should not be optional to release a lock if it was taken. I think it would be more robust if a single test of the static key decides whether the spinlock should be used.
Is the concern that the branch value could change concurrently and result in a deadlock?
I'm curious as to whether this case is performance critical enough to justify using a static branch. It's clear that we should be using them in the context switch path, but I'm confused about other places they're used when there are also memory flags.
-Peter
Hi Peter,
On 5/12/2023 6:23 AM, Peter Newman wrote:
Hi Reinette,
On Thu, May 11, 2023 at 11:36 PM Reinette Chatre reinette.chatre@intel.com wrote:
On 4/21/2023 7:17 AM, Peter Newman wrote:
raw_spin_lock_irqsave(&hw_dom->evtsel_lock, flags);
/* * As per the SDM, when IA32_QM_EVTSEL.EvtID (bits 7:0) is configured * with a valid event code for supported resource type and the bits
@@ -161,6 +166,9 @@ static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val) wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid); rdmsrl(MSR_IA32_QM_CTR, msr_val);
if (static_branch_likely(&rmid_read_locked))
raw_spin_unlock_irqrestore(&hw_dom->evtsel_lock, flags);
If the first "if (static_branch_likely(&rmid_read_locked))" was taken then the second if branch _has_ to be taken. It should not be optional to release a lock if it was taken. I think it would be more robust if a single test of the static key decides whether the spinlock should be used.
Is the concern that the branch value could change concurrently and result in a deadlock?
Possibly ... it may be that the static key cannot change value during this call but that thus requires deeper understanding of various flows for this logic to be trusted. I think this should be explicit: if a lock is taken then releasing it should not be optional at all.
I'm curious as to whether this case is performance critical enough to justify using a static branch. It's clear that we should be using them in the context switch path, but I'm confused about other places they're used when there are also memory flags.
Alternatively, there could be a, (for example) __rmid_read_lock() that is called from context switch and it always takes a spin lock.
Reinette
AMD implementations so far are only guaranteed to provide MBM event counts for RMIDs which are currently assigned in CPUs' PQR_ASSOC MSRs. Hardware can reallocate the counter resources for all other RMIDs' which are not currently assigned to those which are, zeroing the event counts of the unassigned RMIDs.
In practice, this makes it impossible to simultaneously calculate the memory bandwidth speed of all RMIDs on a busy system where all RMIDs are in use. Over a multiple-second measurement window, the RMID would need to remain assigned in all of the L3 cache domains where it has been assigned for the duration of the measurement, otherwise portions of the final count will be zero. In general, it is not possible to bound the number of RMIDs which will be assigned in an L3 domain over any interval of time.
To provide reliable MBM counts on such systems, introduce "soft" RMIDs: when enabled, each CPU is permanently assigned a hardware RMID whose event counts are flushed to the current soft RMID during context switches which result in a change in soft RMID as well as whenever userspace requests the current event count for a domain.
Implement resctrl_mbm_flush_cpu(), which collects a domain's current MBM event counts into its current software RMID. The delta for each CPU is determined by tracking the previous event counts in per-CPU data. The software byte counts reside in the arch-independent mbm_state structures.
Co-developed-by: Stephane Eranian eranian@google.com Signed-off-by: Stephane Eranian eranian@google.com Signed-off-by: Peter Newman peternewman@google.com --- arch/x86/include/asm/resctrl.h | 2 + arch/x86/kernel/cpu/resctrl/internal.h | 10 ++-- arch/x86/kernel/cpu/resctrl/monitor.c | 78 ++++++++++++++++++++++++++ 3 files changed, 86 insertions(+), 4 deletions(-)
diff --git a/arch/x86/include/asm/resctrl.h b/arch/x86/include/asm/resctrl.h index 255a78d9d906..e7acf118d770 100644 --- a/arch/x86/include/asm/resctrl.h +++ b/arch/x86/include/asm/resctrl.h @@ -13,6 +13,7 @@ * @cur_closid: The cached Class Of Service ID * @default_rmid: The user assigned Resource Monitoring ID * @default_closid: The user assigned cached Class Of Service ID + * @hw_rmid: The permanently-assigned RMID when soft RMIDs are in use * * The upper 32 bits of MSR_IA32_PQR_ASSOC contain closid and the * lower 10 bits rmid. The update to MSR_IA32_PQR_ASSOC always @@ -27,6 +28,7 @@ struct resctrl_pqr_state { u32 cur_closid; u32 default_rmid; u32 default_closid; + u32 hw_rmid; };
DECLARE_PER_CPU(struct resctrl_pqr_state, pqr_state); diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h index 02a062558c67..256eee05d447 100644 --- a/arch/x86/kernel/cpu/resctrl/internal.h +++ b/arch/x86/kernel/cpu/resctrl/internal.h @@ -298,12 +298,14 @@ struct rftype { * @prev_bw: The most recent bandwidth in MBps * @delta_bw: Difference between the current and previous bandwidth * @delta_comp: Indicates whether to compute the delta_bw + * @soft_rmid_bytes: Recent bandwidth count in bytes when using soft RMIDs */ struct mbm_state { - u64 prev_bw_bytes; - u32 prev_bw; - u32 delta_bw; - bool delta_comp; + u64 prev_bw_bytes; + u32 prev_bw; + u32 delta_bw; + bool delta_comp; + atomic64_t soft_rmid_bytes; };
/** diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c index 2de8397f91cd..3671100d3cc7 100644 --- a/arch/x86/kernel/cpu/resctrl/monitor.c +++ b/arch/x86/kernel/cpu/resctrl/monitor.c @@ -404,6 +404,84 @@ static struct mbm_state *get_mbm_state(struct rdt_domain *d, u32 rmid, } }
+struct mbm_soft_counter { + u64 prev_bytes; + bool initialized; +}; + +struct mbm_flush_state { + struct mbm_soft_counter local; + struct mbm_soft_counter total; +}; + +DEFINE_PER_CPU(struct mbm_flush_state, flush_state); + +/* + * flushes the value of the cpu_rmid to the current soft rmid + */ +static void __mbm_flush(int evtid, struct rdt_resource *r, struct rdt_domain *d) +{ + struct mbm_flush_state *state = this_cpu_ptr(&flush_state); + u32 soft_rmid = this_cpu_ptr(&pqr_state)->cur_rmid; + u32 hw_rmid = this_cpu_ptr(&pqr_state)->hw_rmid; + struct mbm_soft_counter *counter; + struct mbm_state *m; + u64 val; + + /* cache occupancy events are disabled in this mode */ + WARN_ON(!is_mbm_event(evtid)); + + if (evtid == QOS_L3_MBM_LOCAL_EVENT_ID) { + counter = &state->local; + } else { + WARN_ON(evtid != QOS_L3_MBM_TOTAL_EVENT_ID); + counter = &state->total; + } + + /* + * Propagate the value read from the hw_rmid assigned to the current CPU + * into the "soft" rmid associated with the current task or CPU. + */ + m = get_mbm_state(d, soft_rmid, evtid); + if (!m) + return; + + if (resctrl_arch_rmid_read(r, d, hw_rmid, evtid, &val)) + return; + + /* Count bandwidth after the first successful counter read. */ + if (counter->initialized) { + /* Assume that mbm_update() will prevent double-overflows. */ + if (val != counter->prev_bytes) + atomic64_add(val - counter->prev_bytes, + &m->soft_rmid_bytes); + } else { + counter->initialized = true; + } + + counter->prev_bytes = val; +} + +/* + * Called from context switch code __resctrl_sched_in() when the current soft + * RMID is changing or before reporting event counts to user space. + */ +void resctrl_mbm_flush_cpu(void) +{ + struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl; + int cpu = smp_processor_id(); + struct rdt_domain *d; + + d = get_domain_from_cpu(cpu, r); + if (!d) + return; + + if (is_mbm_local_enabled()) + __mbm_flush(QOS_L3_MBM_LOCAL_EVENT_ID, r, d); + if (is_mbm_total_enabled()) + __mbm_flush(QOS_L3_MBM_TOTAL_EVENT_ID, r, d); +} + static int __mon_event_count(u32 rmid, struct rmid_read *rr) { struct mbm_state *m;
Hi Peter,
On 4/21/2023 7:17 AM, Peter Newman wrote:
AMD implementations so far are only guaranteed to provide MBM event counts for RMIDs which are currently assigned in CPUs' PQR_ASSOC MSRs. Hardware can reallocate the counter resources for all other RMIDs' which are not currently assigned to those which are, zeroing the event counts of the unassigned RMIDs.
In practice, this makes it impossible to simultaneously calculate the memory bandwidth speed of all RMIDs on a busy system where all RMIDs are in use. Over a multiple-second measurement window, the RMID would need to remain assigned in all of the L3 cache domains where it has been assigned for the duration of the measurement, otherwise portions of the final count will be zero. In general, it is not possible to bound the number of RMIDs which will be assigned in an L3 domain over any interval of time.
To provide reliable MBM counts on such systems, introduce "soft" RMIDs: when enabled, each CPU is permanently assigned a hardware RMID whose event counts are flushed to the current soft RMID during context switches which result in a change in soft RMID as well as whenever userspace requests the current event count for a domain.
Implement resctrl_mbm_flush_cpu(), which collects a domain's current MBM event counts into its current software RMID. The delta for each CPU is determined by tracking the previous event counts in per-CPU data. The software byte counts reside in the arch-independent mbm_state structures.
Could you elaborate why the arch-independent mbm_state was chosen?
Co-developed-by: Stephane Eranian eranian@google.com Signed-off-by: Stephane Eranian eranian@google.com Signed-off-by: Peter Newman peternewman@google.com
arch/x86/include/asm/resctrl.h | 2 + arch/x86/kernel/cpu/resctrl/internal.h | 10 ++-- arch/x86/kernel/cpu/resctrl/monitor.c | 78 ++++++++++++++++++++++++++ 3 files changed, 86 insertions(+), 4 deletions(-)
diff --git a/arch/x86/include/asm/resctrl.h b/arch/x86/include/asm/resctrl.h index 255a78d9d906..e7acf118d770 100644 --- a/arch/x86/include/asm/resctrl.h +++ b/arch/x86/include/asm/resctrl.h @@ -13,6 +13,7 @@
- @cur_closid: The cached Class Of Service ID
- @default_rmid: The user assigned Resource Monitoring ID
- @default_closid: The user assigned cached Class Of Service ID
- @hw_rmid: The permanently-assigned RMID when soft RMIDs are in use
- The upper 32 bits of MSR_IA32_PQR_ASSOC contain closid and the
- lower 10 bits rmid. The update to MSR_IA32_PQR_ASSOC always
@@ -27,6 +28,7 @@ struct resctrl_pqr_state { u32 cur_closid; u32 default_rmid; u32 default_closid;
- u32 hw_rmid;
}; DECLARE_PER_CPU(struct resctrl_pqr_state, pqr_state); diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h index 02a062558c67..256eee05d447 100644 --- a/arch/x86/kernel/cpu/resctrl/internal.h +++ b/arch/x86/kernel/cpu/resctrl/internal.h @@ -298,12 +298,14 @@ struct rftype {
- @prev_bw: The most recent bandwidth in MBps
- @delta_bw: Difference between the current and previous bandwidth
- @delta_comp: Indicates whether to compute the delta_bw
*/
- @soft_rmid_bytes: Recent bandwidth count in bytes when using soft RMIDs
struct mbm_state {
- u64 prev_bw_bytes;
- u32 prev_bw;
- u32 delta_bw;
- bool delta_comp;
- u64 prev_bw_bytes;
- u32 prev_bw;
- u32 delta_bw;
- bool delta_comp;
- atomic64_t soft_rmid_bytes;
}; /** diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c index 2de8397f91cd..3671100d3cc7 100644 --- a/arch/x86/kernel/cpu/resctrl/monitor.c +++ b/arch/x86/kernel/cpu/resctrl/monitor.c @@ -404,6 +404,84 @@ static struct mbm_state *get_mbm_state(struct rdt_domain *d, u32 rmid, } } +struct mbm_soft_counter {
- u64 prev_bytes;
- bool initialized;
+};
+struct mbm_flush_state {
- struct mbm_soft_counter local;
- struct mbm_soft_counter total;
+};
+DEFINE_PER_CPU(struct mbm_flush_state, flush_state);
Why not use the existing MBM state?
+/*
- flushes the value of the cpu_rmid to the current soft rmid
- */
+static void __mbm_flush(int evtid, struct rdt_resource *r, struct rdt_domain *d) +{
- struct mbm_flush_state *state = this_cpu_ptr(&flush_state);
- u32 soft_rmid = this_cpu_ptr(&pqr_state)->cur_rmid;
- u32 hw_rmid = this_cpu_ptr(&pqr_state)->hw_rmid;
- struct mbm_soft_counter *counter;
- struct mbm_state *m;
- u64 val;
- /* cache occupancy events are disabled in this mode */
- WARN_ON(!is_mbm_event(evtid));
If this is hit it would trigger a lot, perhaps WARN_ON_ONCE?
- if (evtid == QOS_L3_MBM_LOCAL_EVENT_ID) {
counter = &state->local;
- } else {
WARN_ON(evtid != QOS_L3_MBM_TOTAL_EVENT_ID);
counter = &state->total;
- }
- /*
* Propagate the value read from the hw_rmid assigned to the current CPU
* into the "soft" rmid associated with the current task or CPU.
*/
- m = get_mbm_state(d, soft_rmid, evtid);
- if (!m)
return;
- if (resctrl_arch_rmid_read(r, d, hw_rmid, evtid, &val))
return;
This all seems unsafe to run without protection. The code relies on the rdt_domain but a CPU hotplug event could result in the domain disappearing underneath this code. The accesses to the data structures also appear unsafe to me. Note that resctrl_arch_rmid_read() updates the architectural MBM state and this same state can be updated concurrently in other code paths without appropriate locking.
- /* Count bandwidth after the first successful counter read. */
- if (counter->initialized) {
/* Assume that mbm_update() will prevent double-overflows. */
if (val != counter->prev_bytes)
atomic64_add(val - counter->prev_bytes,
&m->soft_rmid_bytes);
- } else {
counter->initialized = true;
- }
- counter->prev_bytes = val;
I notice a lot of similarities between the above and the software controller, see mbm_bw_count().
+}
+/*
- Called from context switch code __resctrl_sched_in() when the current soft
- RMID is changing or before reporting event counts to user space.
- */
+void resctrl_mbm_flush_cpu(void) +{
- struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
- int cpu = smp_processor_id();
- struct rdt_domain *d;
- d = get_domain_from_cpu(cpu, r);
- if (!d)
return;
- if (is_mbm_local_enabled())
__mbm_flush(QOS_L3_MBM_LOCAL_EVENT_ID, r, d);
- if (is_mbm_total_enabled())
__mbm_flush(QOS_L3_MBM_TOTAL_EVENT_ID, r, d);
+}
This (potentially) adds two MSR writes and two MSR reads to what could possibly be quite slow MSRs if it was not designed to be used in context switch. Do you perhaps have data on how long these MSR reads/writes take on these systems to get an idea about the impact on context switch? I think this data should feature prominently in the changelog.
static int __mon_event_count(u32 rmid, struct rmid_read *rr) { struct mbm_state *m;
Reinette
Hi Reinette,
On Thu, May 11, 2023 at 11:37 PM Reinette Chatre reinette.chatre@intel.com wrote:
On 4/21/2023 7:17 AM, Peter Newman wrote:
Implement resctrl_mbm_flush_cpu(), which collects a domain's current MBM event counts into its current software RMID. The delta for each CPU is determined by tracking the previous event counts in per-CPU data. The software byte counts reside in the arch-independent mbm_state structures.
Could you elaborate why the arch-independent mbm_state was chosen?
It largely had to do with how many soft RMIDs to implement. For our own needs, we were mainly concerned with getting back to the number of monitoring groups the hardware claimed to support, so there wasn't much internal motivation to support an unbounded number of soft RMIDs.
However, breaking this artificial connection between supported HW and SW RMIDs to support arbitrarily-many monitoring groups could make the implementation conceptually cleaner. If you agree, I would be happy to give it a try in the next series.
/* cache occupancy events are disabled in this mode */
WARN_ON(!is_mbm_event(evtid));
If this is hit it would trigger a lot, perhaps WARN_ON_ONCE?
Ok
if (evtid == QOS_L3_MBM_LOCAL_EVENT_ID) {
counter = &state->local;
} else {
WARN_ON(evtid != QOS_L3_MBM_TOTAL_EVENT_ID);
counter = &state->total;
}
/*
* Propagate the value read from the hw_rmid assigned to the current CPU
* into the "soft" rmid associated with the current task or CPU.
*/
m = get_mbm_state(d, soft_rmid, evtid);
if (!m)
return;
if (resctrl_arch_rmid_read(r, d, hw_rmid, evtid, &val))
return;
This all seems unsafe to run without protection. The code relies on the rdt_domain but a CPU hotplug event could result in the domain disappearing underneath this code. The accesses to the data structures also appear unsafe to me. Note that resctrl_arch_rmid_read() updates the architectural MBM state and this same state can be updated concurrently in other code paths without appropriate locking.
The domain is supposed to always be the current one, but I see that even a get_domain_from_cpu(smp_processor_id(), ...) call needs to walk a resource's domain list to find a matching entry, which could be concurrently modified when other domains are added/removed.
Similarly, when soft RMIDs are enabled, it should not be possible to call resctrl_arch_rmid_read() outside of on the current CPU's HW RMID.
I'll need to confirm whether it's safe to access the current CPU's rdt_domain in an atomic context. If it isn't, I assume I would have to arrange all of the state used during flush to be per-CPU.
I expect the constraints on what data can be safely accessed where is going to constrain how the state is ultimately arranged, so I will need to settle this before I can come back to the other questions about mbm_state.
/* Count bandwidth after the first successful counter read. */
if (counter->initialized) {
/* Assume that mbm_update() will prevent double-overflows. */
if (val != counter->prev_bytes)
atomic64_add(val - counter->prev_bytes,
&m->soft_rmid_bytes);
} else {
counter->initialized = true;
}
counter->prev_bytes = val;
I notice a lot of similarities between the above and the software controller, see mbm_bw_count().
Thanks for pointing this out, I'll take a look.
+}
+/*
- Called from context switch code __resctrl_sched_in() when the current soft
- RMID is changing or before reporting event counts to user space.
- */
+void resctrl_mbm_flush_cpu(void) +{
struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
int cpu = smp_processor_id();
struct rdt_domain *d;
d = get_domain_from_cpu(cpu, r);
if (!d)
return;
if (is_mbm_local_enabled())
__mbm_flush(QOS_L3_MBM_LOCAL_EVENT_ID, r, d);
if (is_mbm_total_enabled())
__mbm_flush(QOS_L3_MBM_TOTAL_EVENT_ID, r, d);
+}
This (potentially) adds two MSR writes and two MSR reads to what could possibly be quite slow MSRs if it was not designed to be used in context switch. Do you perhaps have data on how long these MSR reads/writes take on these systems to get an idea about the impact on context switch? I think this data should feature prominently in the changelog.
I can probably use ftrace to determine the cost of an __rmid_read() call on a few implementations.
To understand the overall impact to context switch, I can put together a scenario where I can control whether the context switches being measured result in change of soft RMID to prevent the data from being diluted by non-flushing switches.
Thank you for looking over these changes!
-Peter
Hi Peter,
On 5/12/2023 6:25 AM, Peter Newman wrote:
Hi Reinette,
On Thu, May 11, 2023 at 11:37 PM Reinette Chatre reinette.chatre@intel.com wrote:
On 4/21/2023 7:17 AM, Peter Newman wrote:
Implement resctrl_mbm_flush_cpu(), which collects a domain's current MBM event counts into its current software RMID. The delta for each CPU is determined by tracking the previous event counts in per-CPU data. The software byte counts reside in the arch-independent mbm_state structures.
Could you elaborate why the arch-independent mbm_state was chosen?
It largely had to do with how many soft RMIDs to implement. For our own needs, we were mainly concerned with getting back to the number of monitoring groups the hardware claimed to support, so there wasn't much internal motivation to support an unbounded number of soft RMIDs.
Apologies for not being explicit, I was actually curious why the arch-independent mbm_state, as opposed to the arch-dependent state, was chosen.
I think the lines are getting a bit blurry here with the software RMID feature added as a resctrl filesystem feature (and thus non architectural), but it is specific to AMD architecture.
However, breaking this artificial connection between supported HW and SW RMIDs to support arbitrarily-many monitoring groups could make the implementation conceptually cleaner. If you agree, I would be happy to give it a try in the next series.
I have not actually considered this. At first glance I think this would add more tentacles into the core where currently the number of RMIDs supported are queried from the device and supporting an arbitrary number would impact that. At this time the RMID state is also pre-allocated and thus not possible to support an "arbitrarily-many".
+/*
- Called from context switch code __resctrl_sched_in() when the current soft
- RMID is changing or before reporting event counts to user space.
- */
+void resctrl_mbm_flush_cpu(void) +{
struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
int cpu = smp_processor_id();
struct rdt_domain *d;
d = get_domain_from_cpu(cpu, r);
if (!d)
return;
if (is_mbm_local_enabled())
__mbm_flush(QOS_L3_MBM_LOCAL_EVENT_ID, r, d);
if (is_mbm_total_enabled())
__mbm_flush(QOS_L3_MBM_TOTAL_EVENT_ID, r, d);
+}
This (potentially) adds two MSR writes and two MSR reads to what could possibly be quite slow MSRs if it was not designed to be used in context switch. Do you perhaps have data on how long these MSR reads/writes take on these systems to get an idea about the impact on context switch? I think this data should feature prominently in the changelog.
I can probably use ftrace to determine the cost of an __rmid_read() call on a few implementations.
On a lower level I think it may be interesting to measure more closely just how long a wrmsr and rdmsr take on these registers. It may be interesting if you, for example, use rdtsc_ordered() before and after these calls, and then compare it to how long it takes to write the PQR register that has been designed to be used in context switch.
To understand the overall impact to context switch, I can put together a scenario where I can control whether the context switches being measured result in change of soft RMID to prevent the data from being diluted by non-flushing switches.
This sounds great. Thank you very much.
Reinette
Hi Reinette,
On Fri, May 12, 2023 at 5:26 PM Reinette Chatre reinette.chatre@intel.com wrote:
On 5/12/2023 6:25 AM, Peter Newman wrote:
On Thu, May 11, 2023 at 11:37 PM Reinette Chatre reinette.chatre@intel.com wrote:
On 4/21/2023 7:17 AM, Peter Newman wrote:
Implement resctrl_mbm_flush_cpu(), which collects a domain's current MBM event counts into its current software RMID. The delta for each CPU is determined by tracking the previous event counts in per-CPU data. The software byte counts reside in the arch-independent mbm_state structures.
Could you elaborate why the arch-independent mbm_state was chosen?
It largely had to do with how many soft RMIDs to implement. For our own needs, we were mainly concerned with getting back to the number of monitoring groups the hardware claimed to support, so there wasn't much internal motivation to support an unbounded number of soft RMIDs.
Apologies for not being explicit, I was actually curious why the arch-independent mbm_state, as opposed to the arch-dependent state, was chosen.
I think the lines are getting a bit blurry here with the software RMID feature added as a resctrl filesystem feature (and thus non architectural), but it is specific to AMD architecture.
The soft RMID solution applies conceptually to any system where the number of hardware counters is smaller than the number of desired monitoring groups, but at least as large as the number of CPUs. It's a solution we may need to rely on more in the future as it's easier for monitoring hardware to scale to the number of CPUs than (CPUs * mbm_domains). I believed the counts in bytes would apply to the user interface universally.
However, I did recently rebase these changes onto one of James's MPAM snapshot branches and __mbm_flush() did end up fitting better on the arch-dependent side, so I was forced to move the counters over to arch_mbm_state because on the snapshot branch the arch-dependent code cannot see the arch-independent mbm_state structure. I then created resctrl_arch-() helpers for __mon_event_count() to read the counts from the arch_mbm_state.
In hindsight, despite generic-looking code being able to retrieve the CPU counts with resctrl_arch_rmid_read(), the permanent assignment of a HW RMID to a CPU is an implementation-detail specific to the RDT/PQoS interface and would probably not align to a theoretical MPAM implementation.
However, breaking this artificial connection between supported HW and SW RMIDs to support arbitrarily-many monitoring groups could make the implementation conceptually cleaner. If you agree, I would be happy to give it a try in the next series.
I have not actually considered this. At first glance I think this would add more tentacles into the core where currently the number of RMIDs supported are queried from the device and supporting an arbitrary number would impact that. At this time the RMID state is also pre-allocated and thus not possible to support an "arbitrarily-many".
Yes, this was the part that made me want to just leave the RMID count alone.
+/*
- Called from context switch code __resctrl_sched_in() when the current soft
- RMID is changing or before reporting event counts to user space.
- */
+void resctrl_mbm_flush_cpu(void) +{
struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
int cpu = smp_processor_id();
struct rdt_domain *d;
d = get_domain_from_cpu(cpu, r);
if (!d)
return;
if (is_mbm_local_enabled())
__mbm_flush(QOS_L3_MBM_LOCAL_EVENT_ID, r, d);
if (is_mbm_total_enabled())
__mbm_flush(QOS_L3_MBM_TOTAL_EVENT_ID, r, d);
+}
This (potentially) adds two MSR writes and two MSR reads to what could possibly be quite slow MSRs if it was not designed to be used in context switch. Do you perhaps have data on how long these MSR reads/writes take on these systems to get an idea about the impact on context switch? I think this data should feature prominently in the changelog.
I can probably use ftrace to determine the cost of an __rmid_read() call on a few implementations.
On a lower level I think it may be interesting to measure more closely just how long a wrmsr and rdmsr take on these registers. It may be interesting if you, for example, use rdtsc_ordered() before and after these calls, and then compare it to how long it takes to write the PQR register that has been designed to be used in context switch.
To understand the overall impact to context switch, I can put together a scenario where I can control whether the context switches being measured result in change of soft RMID to prevent the data from being diluted by non-flushing switches.
This sounds great. Thank you very much.
I used a simple parent-child pipe loop benchmark with the parent in one monitoring group and the child in another to trigger 2M context-switches on the same CPU and compared the sample-based profiles on an AMD and Intel implementation. I used perf diff to compare the samples between hard and soft RMID switches.
Intel(R) Xeon(R) Platinum 8173M CPU @ 2.00GHz:
+44.80% [kernel.kallsyms] [k] __rmid_read 10.43% -9.52% [kernel.kallsyms] [k] __switch_to
AMD EPYC 7B12 64-Core Processor:
+28.27% [kernel.kallsyms] [k] __rmid_read 13.45% -13.44% [kernel.kallsyms] [k] __switch_to
Note that a soft RMID switch that doesn't change CLOSID skips the PQR_ASSOC write completely, so from this data I can roughly say that __rmid_read() is a little over 2x the length of a PQR_ASSOC write that changes the current RMID on the AMD implementation and about 4.5x longer on Intel.
Let me know if this clarifies the cost enough or if you'd like to also see instrumented measurements on the individual WRMSR/RDMSR instructions.
Thanks! -Peter
Hi Peter,
On 5/15/2023 7:42 AM, Peter Newman wrote:
Hi Reinette,
On Fri, May 12, 2023 at 5:26 PM Reinette Chatre reinette.chatre@intel.com wrote:
On 5/12/2023 6:25 AM, Peter Newman wrote:
On Thu, May 11, 2023 at 11:37 PM Reinette Chatre reinette.chatre@intel.com wrote:
On 4/21/2023 7:17 AM, Peter Newman wrote:
Implement resctrl_mbm_flush_cpu(), which collects a domain's current MBM event counts into its current software RMID. The delta for each CPU is determined by tracking the previous event counts in per-CPU data. The software byte counts reside in the arch-independent mbm_state structures.
Could you elaborate why the arch-independent mbm_state was chosen?
It largely had to do with how many soft RMIDs to implement. For our own needs, we were mainly concerned with getting back to the number of monitoring groups the hardware claimed to support, so there wasn't much internal motivation to support an unbounded number of soft RMIDs.
Apologies for not being explicit, I was actually curious why the arch-independent mbm_state, as opposed to the arch-dependent state, was chosen.
I think the lines are getting a bit blurry here with the software RMID feature added as a resctrl filesystem feature (and thus non architectural), but it is specific to AMD architecture.
The soft RMID solution applies conceptually to any system where the number of hardware counters is smaller than the number of desired monitoring groups, but at least as large as the number of CPUs. It's a solution we may need to rely on more in the future as it's easier for monitoring hardware to scale to the number of CPUs than (CPUs * mbm_domains). I believed the counts in bytes would apply to the user interface universally.
However, I did recently rebase these changes onto one of James's MPAM snapshot branches and __mbm_flush() did end up fitting better on the arch-dependent side, so I was forced to move the counters over to arch_mbm_state because on the snapshot branch the arch-dependent code cannot see the arch-independent mbm_state structure. I then created resctrl_arch-() helpers for __mon_event_count() to read the counts from the arch_mbm_state.
In hindsight, despite generic-looking code being able to retrieve the CPU counts with resctrl_arch_rmid_read(), the permanent assignment of a HW RMID to a CPU is an implementation-detail specific to the RDT/PQoS interface and would probably not align to a theoretical MPAM implementation.
Indeed. There are a couple of points in this work that blurs the clear boundary that the MPAM enabling is trying to establish. We should keep this in mind when looking how this should be solved so that it does not create another item that needs to be untangled. A small but significant start may be to start referring to this as "soft mon id" or something else that is generic. Especially since this is proposed as a mount option and thus tied to the filesystem.
+/*
- Called from context switch code __resctrl_sched_in() when the current soft
- RMID is changing or before reporting event counts to user space.
- */
+void resctrl_mbm_flush_cpu(void) +{
struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
int cpu = smp_processor_id();
struct rdt_domain *d;
d = get_domain_from_cpu(cpu, r);
if (!d)
return;
if (is_mbm_local_enabled())
__mbm_flush(QOS_L3_MBM_LOCAL_EVENT_ID, r, d);
if (is_mbm_total_enabled())
__mbm_flush(QOS_L3_MBM_TOTAL_EVENT_ID, r, d);
+}
This (potentially) adds two MSR writes and two MSR reads to what could possibly be quite slow MSRs if it was not designed to be used in context switch. Do you perhaps have data on how long these MSR reads/writes take on these systems to get an idea about the impact on context switch? I think this data should feature prominently in the changelog.
I can probably use ftrace to determine the cost of an __rmid_read() call on a few implementations.
On a lower level I think it may be interesting to measure more closely just how long a wrmsr and rdmsr take on these registers. It may be interesting if you, for example, use rdtsc_ordered() before and after these calls, and then compare it to how long it takes to write the PQR register that has been designed to be used in context switch.
To understand the overall impact to context switch, I can put together a scenario where I can control whether the context switches being measured result in change of soft RMID to prevent the data from being diluted by non-flushing switches.
This sounds great. Thank you very much.
I used a simple parent-child pipe loop benchmark with the parent in one monitoring group and the child in another to trigger 2M context-switches on the same CPU and compared the sample-based profiles on an AMD and Intel implementation. I used perf diff to compare the samples between hard and soft RMID switches.
Intel(R) Xeon(R) Platinum 8173M CPU @ 2.00GHz:
+44.80% [kernel.kallsyms] [k] __rmid_read 10.43% -9.52% [kernel.kallsyms] [k] __switch_to
AMD EPYC 7B12 64-Core Processor:
+28.27% [kernel.kallsyms] [k] __rmid_read 13.45% -13.44% [kernel.kallsyms] [k] __switch_to
Note that a soft RMID switch that doesn't change CLOSID skips the PQR_ASSOC write completely, so from this data I can roughly say that __rmid_read() is a little over 2x the length of a PQR_ASSOC write that changes the current RMID on the AMD implementation and about 4.5x longer on Intel.
Let me know if this clarifies the cost enough or if you'd like to also see instrumented measurements on the individual WRMSR/RDMSR instructions.
I can see from the data the portion of total time spent in __rmid_read(). It is not clear to me what the impact on a context switch is. Is it possible to say with this data that: this solution makes a context switch x% slower?
I think it may be optimistic to view this as a replacement of a PQR write. As you point out, that requires that a CPU switches between tasks with the same CLOSID. You demonstrate that resctrl already contributes a significant delay to __switch_to - this work will increase that much more, it has to be clear about this impact and motivate that it is acceptable.
Reinette
Hi Reinette,
On Tue, May 16, 2023 at 5:06 PM Reinette Chatre reinette.chatre@intel.com wrote:
On 5/15/2023 7:42 AM, Peter Newman wrote:
I used a simple parent-child pipe loop benchmark with the parent in one monitoring group and the child in another to trigger 2M context-switches on the same CPU and compared the sample-based profiles on an AMD and Intel implementation. I used perf diff to compare the samples between hard and soft RMID switches.
Intel(R) Xeon(R) Platinum 8173M CPU @ 2.00GHz:
+44.80% [kernel.kallsyms] [k] __rmid_read 10.43% -9.52% [kernel.kallsyms] [k] __switch_to
AMD EPYC 7B12 64-Core Processor:
+28.27% [kernel.kallsyms] [k] __rmid_read 13.45% -13.44% [kernel.kallsyms] [k] __switch_to
Note that a soft RMID switch that doesn't change CLOSID skips the PQR_ASSOC write completely, so from this data I can roughly say that __rmid_read() is a little over 2x the length of a PQR_ASSOC write that changes the current RMID on the AMD implementation and about 4.5x longer on Intel.
Let me know if this clarifies the cost enough or if you'd like to also see instrumented measurements on the individual WRMSR/RDMSR instructions.
I can see from the data the portion of total time spent in __rmid_read(). It is not clear to me what the impact on a context switch is. Is it possible to say with this data that: this solution makes a context switch x% slower?
I think it may be optimistic to view this as a replacement of a PQR write. As you point out, that requires that a CPU switches between tasks with the same CLOSID. You demonstrate that resctrl already contributes a significant delay to __switch_to - this work will increase that much more, it has to be clear about this impact and motivate that it is acceptable.
We were operating under the assumption that if the overhead wasn't acceptable, we would have heard complaints about it by now, but we ultimately learned that this feature wasn't deployed as much as we had originally thought on AMD hardware and that the overhead does need to be addressed.
I am interested in your opinion on two options I'm exploring to mitigate the overhead, both of which depend on an API like the one Babu recently proposed for the AMD ABMC feature [1], where a new file interface will allow the user to indicate which mon_groups are actively being measured. I will refer to this as "assigned" for now, as that's the current proposal.
The first is likely the simpler approach: only read MBM event counters which have been marked as "assigned" in the filesystem to avoid paying the context switch cost on tasks in groups which are not actively being measured. In our use case, we calculate memory bandwidth on every group every few minutes by reading the counters twice, 5 seconds apart. We would just need counters read during this 5-second window.
The second involves avoiding the situation where a hardware counter could be deallocated: Determine the number of simultaneous RMIDs supported, reduce the effective number of RMIDs available to that number. Use the default RMID (0) for all "unassigned" monitoring groups and report "Unavailable" on all counter reads (and address the default monitoring group's counts being unreliable). When assigned, attempt to allocate one of the remaining, usable RMIDs to that group. It would only be possible to assign all event counters (local, total, occupancy) at the same time. Using this approach, we would no longer be able to measure all groups at the same time, but this is something we would already be accepting when using the AMD ABMC feature.
While the second feature is a lot more disruptive at the filesystem layer, it does eliminate the added context switch overhead. Also, it may be helpful in the long run for the filesystem code to start taking a more abstract view of hardware monitoring resources, given that few implementations can afford to assign hardware to all monitoring IDs all the time. In both cases, the meaning of "assigned" could vary greatly, even among AMD implementations.
Thanks! -Peter
[1] https://lore.kernel.org/lkml/20231201005720.235639-1-babu.moger@amd.com/
Hi Peter,
On 12/1/2023 12:56 PM, Peter Newman wrote:
Hi Reinette,
On Tue, May 16, 2023 at 5:06 PM Reinette Chatre reinette.chatre@intel.com wrote:
On 5/15/2023 7:42 AM, Peter Newman wrote:
I used a simple parent-child pipe loop benchmark with the parent in one monitoring group and the child in another to trigger 2M context-switches on the same CPU and compared the sample-based profiles on an AMD and Intel implementation. I used perf diff to compare the samples between hard and soft RMID switches.
Intel(R) Xeon(R) Platinum 8173M CPU @ 2.00GHz:
+44.80% [kernel.kallsyms] [k] __rmid_read 10.43% -9.52% [kernel.kallsyms] [k] __switch_to
AMD EPYC 7B12 64-Core Processor:
+28.27% [kernel.kallsyms] [k] __rmid_read 13.45% -13.44% [kernel.kallsyms] [k] __switch_to
Note that a soft RMID switch that doesn't change CLOSID skips the PQR_ASSOC write completely, so from this data I can roughly say that __rmid_read() is a little over 2x the length of a PQR_ASSOC write that changes the current RMID on the AMD implementation and about 4.5x longer on Intel.
Let me know if this clarifies the cost enough or if you'd like to also see instrumented measurements on the individual WRMSR/RDMSR instructions.
I can see from the data the portion of total time spent in __rmid_read(). It is not clear to me what the impact on a context switch is. Is it possible to say with this data that: this solution makes a context switch x% slower?
I think it may be optimistic to view this as a replacement of a PQR write. As you point out, that requires that a CPU switches between tasks with the same CLOSID. You demonstrate that resctrl already contributes a significant delay to __switch_to - this work will increase that much more, it has to be clear about this impact and motivate that it is acceptable.
We were operating under the assumption that if the overhead wasn't acceptable, we would have heard complaints about it by now, but we ultimately learned that this feature wasn't deployed as much as we had originally thought on AMD hardware and that the overhead does need to be addressed.
I am interested in your opinion on two options I'm exploring to mitigate the overhead, both of which depend on an API like the one Babu recently proposed for the AMD ABMC feature [1], where a new file interface will allow the user to indicate which mon_groups are actively being measured. I will refer to this as "assigned" for now, as that's the current proposal.
The first is likely the simpler approach: only read MBM event counters which have been marked as "assigned" in the filesystem to avoid paying the context switch cost on tasks in groups which are not actively being measured. In our use case, we calculate memory bandwidth on every group every few minutes by reading the counters twice, 5 seconds apart. We would just need counters read during this 5-second window.
I assume that tasks within a monitoring group can be scheduled on any CPU and from the cover letter of this work I understand that only an RMID assigned to a processor can be guaranteed to be tracked by hardware.
Are you proposing for this option that you keep this "soft RMID" approach with CPUs permanently assigned a "hard RMID" but only update the counts for a "soft RMID" that is "assigned"? I think that means that the context switch cost for the monitored group would increase even more than with the implementation in this series since the counters need to be read on context switch in as well as context switch out.
If I understand correctly then only one monitoring group can be measured at a time. If such a measurement takes 5 seconds then theoretically 12 groups can be measured in one minute. It may be possible to create many more monitoring groups than this. Would it be possible to reach monitoring goals in your environment?
The second involves avoiding the situation where a hardware counter could be deallocated: Determine the number of simultaneous RMIDs supported, reduce the effective number of RMIDs available to that number. Use the default RMID (0) for all "unassigned" monitoring
hmmm ... so on the one side there is "only the RMID within the PQR register can be guaranteed to be tracked by hardware" and on the other side there is "A given implementation may have insufficient hardware to simultaneously track the bandwidth for all RMID values that the hardware supports."
From the above there seems to be something in the middle where some subset of the RMID values supported by hardware can be used to simultaneously track bandwidth? How can it be determined what this number of RMID values is?
groups and report "Unavailable" on all counter reads (and address the default monitoring group's counts being unreliable). When assigned, attempt to allocate one of the remaining, usable RMIDs to that group. It would only be possible to assign all event counters (local, total, occupancy) at the same time. Using this approach, we would no longer be able to measure all groups at the same time, but this is something we would already be accepting when using the AMD ABMC feature.
It may be possible to turn this into a "fake"/"software" ABMC feature, which I expect needs to be renamed to move it away from a hardware specific feature to something that better reflects how user interacts with system and how the system responds.
While the second feature is a lot more disruptive at the filesystem layer, it does eliminate the added context switch overhead. Also, it
Which changes to filesystem layer are you anticipating?
may be helpful in the long run for the filesystem code to start taking a more abstract view of hardware monitoring resources, given that few implementations can afford to assign hardware to all monitoring IDs all the time. In both cases, the meaning of "assigned" could vary greatly, even among AMD implementations.
Thanks! -Peter
[1] https://lore.kernel.org/lkml/20231201005720.235639-1-babu.moger@amd.com/
Reinette
Hi Reinette,
On Tue, Dec 5, 2023 at 1:57 PM Reinette Chatre reinette.chatre@intel.com wrote:
On 12/1/2023 12:56 PM, Peter Newman wrote:
On Tue, May 16, 2023 at 5:06 PM Reinette Chatre
I think it may be optimistic to view this as a replacement of a PQR write. As you point out, that requires that a CPU switches between tasks with the same CLOSID. You demonstrate that resctrl already contributes a significant delay to __switch_to - this work will increase that much more, it has to be clear about this impact and motivate that it is acceptable.
We were operating under the assumption that if the overhead wasn't acceptable, we would have heard complaints about it by now, but we ultimately learned that this feature wasn't deployed as much as we had originally thought on AMD hardware and that the overhead does need to be addressed.
I am interested in your opinion on two options I'm exploring to mitigate the overhead, both of which depend on an API like the one Babu recently proposed for the AMD ABMC feature [1], where a new file interface will allow the user to indicate which mon_groups are actively being measured. I will refer to this as "assigned" for now, as that's the current proposal.
The first is likely the simpler approach: only read MBM event counters which have been marked as "assigned" in the filesystem to avoid paying the context switch cost on tasks in groups which are not actively being measured. In our use case, we calculate memory bandwidth on every group every few minutes by reading the counters twice, 5 seconds apart. We would just need counters read during this 5-second window.
I assume that tasks within a monitoring group can be scheduled on any CPU and from the cover letter of this work I understand that only an RMID assigned to a processor can be guaranteed to be tracked by hardware.
Are you proposing for this option that you keep this "soft RMID" approach with CPUs permanently assigned a "hard RMID" but only update the counts for a "soft RMID" that is "assigned"?
Yes
I think that means that the context switch cost for the monitored group would increase even more than with the implementation in this series since the counters need to be read on context switch in as well as context switch out.
If I understand correctly then only one monitoring group can be measured at a time. If such a measurement takes 5 seconds then theoretically 12 groups can be measured in one minute. It may be possible to create many more monitoring groups than this. Would it be possible to reach monitoring goals in your environment?
We actually measure all of the groups at the same time, so thinking about this more, the proposed ABMC fix isn't actually a great fit: the user would have to assign all groups individually when a global setting would have been fine.
Ignoring any present-day resctrl interfaces, what we minimally need is...
1. global "start measurement", which enables a read-counters-on-context switch flag, and broadcasts an IPI to all CPUs to read their current count 2. wait 5 seconds 3. global "end measurement", to IPI all CPUs again for final counts and clear the flag from step 1
Then the user could read at their leisure all the (frozen) event counts from memory until the next measurement begins.
In our case, if we're measuring as often as 5 seconds for every minute, that will already be a 12x aggregate reduction in overhead, which would be worthwhile enough.
The second involves avoiding the situation where a hardware counter could be deallocated: Determine the number of simultaneous RMIDs supported, reduce the effective number of RMIDs available to that number. Use the default RMID (0) for all "unassigned" monitoring
hmmm ... so on the one side there is "only the RMID within the PQR register can be guaranteed to be tracked by hardware" and on the other side there is "A given implementation may have insufficient hardware to simultaneously track the bandwidth for all RMID values that the hardware supports."
From the above there seems to be something in the middle where some subset of the RMID values supported by hardware can be used to simultaneously track bandwidth? How can it be determined what this number of RMID values is?
In the context of AMD, we could use the smallest number of CPUs in any L3 domain as a lower bound of the number of counters.
If the number is actually higher, it's not too difficult to probe at runtime. The technique used by the test script[1] reliably identifies the number of counters, but some experimentation would be needed to see how quickly the hardware will repurpose a counter, as the script today is using way too long of a workload for the kernel to be invoking.
Maybe a reasonable compromise would be to initialize the HW counter estimate at the CPUs-per-domain value and add a file node to let the user increase it if they have better information. The worst that can happen is the present-day behavior.
groups and report "Unavailable" on all counter reads (and address the default monitoring group's counts being unreliable). When assigned, attempt to allocate one of the remaining, usable RMIDs to that group. It would only be possible to assign all event counters (local, total, occupancy) at the same time. Using this approach, we would no longer be able to measure all groups at the same time, but this is something we would already be accepting when using the AMD ABMC feature.
It may be possible to turn this into a "fake"/"software" ABMC feature, which I expect needs to be renamed to move it away from a hardware specific feature to something that better reflects how user interacts with system and how the system responds.
Given the similarities in monitoring with ABMC and MPAM, I would want to see the interface generalized anyways.
While the second feature is a lot more disruptive at the filesystem layer, it does eliminate the added context switch overhead. Also, it
Which changes to filesystem layer are you anticipating?
Roughly speaking...
1. The proposed "assign" interface would have to become more indirect to avoid understanding how assign could be implemented on various platforms. 2. RMID management would have to change, because this would introduce the option where creating monitoring groups no longer allocates an RMID. It may be cleaner for the filesystem to just track whether a group has allocated monitoring resources or not and let a lower layer understand what the resources actually are. (and in the default mode, groups can only be created with pre-allocated resources)
If I get the impression that this is the better approach, I'll build a prototype on top of the ABMC patches to see how it would go.
So far it seems only the second approach (software ABMC) really ties in with Babu's work.
Thanks! -Peter
[1] https://lore.kernel.org/all/20230421141723.2405942-2-peternewman@google.com/
Hi Peter,
On 12/5/2023 4:33 PM, Peter Newman wrote:
On Tue, Dec 5, 2023 at 1:57 PM Reinette Chatre reinette.chatre@intel.com wrote:
On 12/1/2023 12:56 PM, Peter Newman wrote:
On Tue, May 16, 2023 at 5:06 PM Reinette Chatre
I think it may be optimistic to view this as a replacement of a PQR write. As you point out, that requires that a CPU switches between tasks with the same CLOSID. You demonstrate that resctrl already contributes a significant delay to __switch_to - this work will increase that much more, it has to be clear about this impact and motivate that it is acceptable.
We were operating under the assumption that if the overhead wasn't acceptable, we would have heard complaints about it by now, but we ultimately learned that this feature wasn't deployed as much as we had originally thought on AMD hardware and that the overhead does need to be addressed.
I am interested in your opinion on two options I'm exploring to mitigate the overhead, both of which depend on an API like the one Babu recently proposed for the AMD ABMC feature [1], where a new file interface will allow the user to indicate which mon_groups are actively being measured. I will refer to this as "assigned" for now, as that's the current proposal.
The first is likely the simpler approach: only read MBM event counters which have been marked as "assigned" in the filesystem to avoid paying the context switch cost on tasks in groups which are not actively being measured. In our use case, we calculate memory bandwidth on every group every few minutes by reading the counters twice, 5 seconds apart. We would just need counters read during this 5-second window.
I assume that tasks within a monitoring group can be scheduled on any CPU and from the cover letter of this work I understand that only an RMID assigned to a processor can be guaranteed to be tracked by hardware.
Are you proposing for this option that you keep this "soft RMID" approach with CPUs permanently assigned a "hard RMID" but only update the counts for a "soft RMID" that is "assigned"?
Yes
I think that means that the context switch cost for the monitored group would increase even more than with the implementation in this series since the counters need to be read on context switch in as well as context switch out.
If I understand correctly then only one monitoring group can be measured at a time. If such a measurement takes 5 seconds then theoretically 12 groups can be measured in one minute. It may be possible to create many more monitoring groups than this. Would it be possible to reach monitoring goals in your environment?
We actually measure all of the groups at the same time, so thinking about this more, the proposed ABMC fix isn't actually a great fit: the user would have to assign all groups individually when a global setting would have been fine.
Ignoring any present-day resctrl interfaces, what we minimally need is...
- global "start measurement", which enables a
read-counters-on-context switch flag, and broadcasts an IPI to all CPUs to read their current count 2. wait 5 seconds 3. global "end measurement", to IPI all CPUs again for final counts and clear the flag from step 1
Then the user could read at their leisure all the (frozen) event counts from memory until the next measurement begins.
In our case, if we're measuring as often as 5 seconds for every minute, that will already be a 12x aggregate reduction in overhead, which would be worthwhile enough.
The "con" here would be that during those 5 seconds (which I assume would be controlled via user space so potentially shorter or longer) all tasks in the system is expected to have significant (but yet to be measured) impact on context switch delay. I expect the overflow handler should only be run during the measurement timeframe, to not defeat the "at their leisure" reading of counters.
The second involves avoiding the situation where a hardware counter could be deallocated: Determine the number of simultaneous RMIDs supported, reduce the effective number of RMIDs available to that number. Use the default RMID (0) for all "unassigned" monitoring
hmmm ... so on the one side there is "only the RMID within the PQR register can be guaranteed to be tracked by hardware" and on the other side there is "A given implementation may have insufficient hardware to simultaneously track the bandwidth for all RMID values that the hardware supports."
From the above there seems to be something in the middle where some subset of the RMID values supported by hardware can be used to simultaneously track bandwidth? How can it be determined what this number of RMID values is?
In the context of AMD, we could use the smallest number of CPUs in any L3 domain as a lower bound of the number of counters.
Could you please elaborate on this? (With the numbers of CPUs nowadays this may be many RMIDs, perhaps even more than what ABMC supports.)
I am missing something here since it is not obvious to me how this lower bound is determined. Let's assume that there are as many monitor groups (and thus as many assigned RMIDs) as there are CPUs in a L3 domain. Each monitor group may have many tasks. It can be expected that at any moment in time only a subset of assigned RMIDs are assigned to CPUs via the CPUs' PQR registers. Of those RMIDs that are not assigned to CPUs, how can it be certain that they continue to be tracked by hardware?
If the number is actually higher, it's not too difficult to probe at runtime. The technique used by the test script[1] reliably identifies the number of counters, but some experimentation would be needed to see how quickly the hardware will repurpose a counter, as the script today is using way too long of a workload for the kernel to be invoking.
Maybe a reasonable compromise would be to initialize the HW counter estimate at the CPUs-per-domain value and add a file node to let the user increase it if they have better information. The worst that can happen is the present-day behavior.
groups and report "Unavailable" on all counter reads (and address the default monitoring group's counts being unreliable). When assigned, attempt to allocate one of the remaining, usable RMIDs to that group. It would only be possible to assign all event counters (local, total, occupancy) at the same time. Using this approach, we would no longer be able to measure all groups at the same time, but this is something we would already be accepting when using the AMD ABMC feature.
It may be possible to turn this into a "fake"/"software" ABMC feature, which I expect needs to be renamed to move it away from a hardware specific feature to something that better reflects how user interacts with system and how the system responds.
Given the similarities in monitoring with ABMC and MPAM, I would want to see the interface generalized anyways.
While the second feature is a lot more disruptive at the filesystem layer, it does eliminate the added context switch overhead. Also, it
Which changes to filesystem layer are you anticipating?
Roughly speaking...
- The proposed "assign" interface would have to become more indirect
to avoid understanding how assign could be implemented on various platforms.
It is almost starting to sound like we could learn from the tracing interface where individual events can be enabled/disabled ... with several events potentially enabled with an "enable" done higher in hierarchy, perhaps even globally to support the first approach ...
- RMID management would have to change, because this would introduce
the option where creating monitoring groups no longer allocates an RMID. It may be cleaner for the filesystem to just track whether a group has allocated monitoring resources or not and let a lower layer understand what the resources actually are. (and in the default mode, groups can only be created with pre-allocated resources)
This matches my understanding of what MPAM would need.
If I get the impression that this is the better approach, I'll build a prototype on top of the ABMC patches to see how it would go.
So far it seems only the second approach (software ABMC) really ties in with Babu's work.
Thanks! -Peter
[1] https://lore.kernel.org/all/20230421141723.2405942-2-peternewman@google.com/
Reinette
Hi Reinette,
On Tue, Dec 5, 2023 at 5:47 PM Reinette Chatre reinette.chatre@intel.com wrote:
On 12/5/2023 4:33 PM, Peter Newman wrote:
On Tue, Dec 5, 2023 at 1:57 PM Reinette Chatre reinette.chatre@intel.com wrote:
On 12/1/2023 12:56 PM, Peter Newman wrote:
Ignoring any present-day resctrl interfaces, what we minimally need is...
- global "start measurement", which enables a
read-counters-on-context switch flag, and broadcasts an IPI to all CPUs to read their current count 2. wait 5 seconds 3. global "end measurement", to IPI all CPUs again for final counts and clear the flag from step 1
Then the user could read at their leisure all the (frozen) event counts from memory until the next measurement begins.
In our case, if we're measuring as often as 5 seconds for every minute, that will already be a 12x aggregate reduction in overhead, which would be worthwhile enough.
The "con" here would be that during those 5 seconds (which I assume would be controlled via user space so potentially shorter or longer) all tasks in the system is expected to have significant (but yet to be measured) impact on context switch delay.
Yes, of course. In the worst case I've measured, Zen2, it's roughly a 1700-cycle context switch penalty (~20%) for tasks in different monitoring groups. Bad, but the benefit we gain from the per-RMID MBM data makes up for it several times over if we only pay the cost during a measurement.
I expect the overflow handler should only be run during the measurement timeframe, to not defeat the "at their leisure" reading of counters.
Yes, correct. We wouldn't be interested in overflows of the hardware counter when not actively measuring bandwidth.
The second involves avoiding the situation where a hardware counter could be deallocated: Determine the number of simultaneous RMIDs supported, reduce the effective number of RMIDs available to that number. Use the default RMID (0) for all "unassigned" monitoring
hmmm ... so on the one side there is "only the RMID within the PQR register can be guaranteed to be tracked by hardware" and on the other side there is "A given implementation may have insufficient hardware to simultaneously track the bandwidth for all RMID values that the hardware supports."
From the above there seems to be something in the middle where some subset of the RMID values supported by hardware can be used to simultaneously track bandwidth? How can it be determined what this number of RMID values is?
In the context of AMD, we could use the smallest number of CPUs in any L3 domain as a lower bound of the number of counters.
Could you please elaborate on this? (With the numbers of CPUs nowadays this may be many RMIDs, perhaps even more than what ABMC supports.)
I think the "In the context of AMD" part is key. This feature would only be applicable to the AMD implementations we have today which do not implement ABMC. I believe the difficulties are unique to the topologies of these systems: many small L3 domains per node with a relatively small number of CPUs in each. If the L3 domains were large and few, simply restricting the number of RMIDs and allocating on group creation as we do today would probably be fine.
I am missing something here since it is not obvious to me how this lower bound is determined. Let's assume that there are as many monitor groups (and thus as many assigned RMIDs) as there are CPUs in a L3 domain. Each monitor group may have many tasks. It can be expected that at any moment in time only a subset of assigned RMIDs are assigned to CPUs via the CPUs' PQR registers. Of those RMIDs that are not assigned to CPUs, how can it be certain that they continue to be tracked by hardware?
Are you asking whether the counters will ever be reclaimed proactively? The behavior I've observed is that writing a new RMID into a PQR_ASSOC register when all hardware counters in the domain are allocated will trigger the reallocation.
However, I admit the wording in the PQoS spec[1] is only written to support the permanent-assignment workaround in the current patch series:
"All RMIDs which are currently in use by one or more processors in the QOS domain will be tracked. The hardware will always begin tracking a new RMID value when it gets written to the PQR_ASSOC register of any of the processors in the QOS domain and it is not already being tracked. When the hardware begins tracking an RMID that it was not previously tracking, it will clear the QM_CTR for all events in the new RMID."
I would need to confirm whether this is the case and request the documentation be clarified if it is.
While the second feature is a lot more disruptive at the filesystem layer, it does eliminate the added context switch overhead. Also, it
Which changes to filesystem layer are you anticipating?
Roughly speaking...
- The proposed "assign" interface would have to become more indirect
to avoid understanding how assign could be implemented on various platforms.
It is almost starting to sound like we could learn from the tracing interface where individual events can be enabled/disabled ... with several events potentially enabled with an "enable" done higher in hierarchy, perhaps even globally to support the first approach ...
Sorry, can you clarify the part about the tracing interface? Tracing to support dynamic autoconfiguration of events?
Thanks! -Peter
[1] AMD64 Technology Platform Quality of Service Extensions, Revision: 1.03: https://bugzilla.kernel.org/attachment.cgi?id=301365
Hi Peter,
On 12/6/2023 10:38 AM, Peter Newman wrote:
Hi Reinette,
On Tue, Dec 5, 2023 at 5:47 PM Reinette Chatre reinette.chatre@intel.com wrote:
On 12/5/2023 4:33 PM, Peter Newman wrote:
On Tue, Dec 5, 2023 at 1:57 PM Reinette Chatre reinette.chatre@intel.com wrote:
On 12/1/2023 12:56 PM, Peter Newman wrote:
Ignoring any present-day resctrl interfaces, what we minimally need is...
- global "start measurement", which enables a
read-counters-on-context switch flag, and broadcasts an IPI to all CPUs to read their current count 2. wait 5 seconds 3. global "end measurement", to IPI all CPUs again for final counts and clear the flag from step 1
Then the user could read at their leisure all the (frozen) event counts from memory until the next measurement begins.
In our case, if we're measuring as often as 5 seconds for every minute, that will already be a 12x aggregate reduction in overhead, which would be worthwhile enough.
The "con" here would be that during those 5 seconds (which I assume would be controlled via user space so potentially shorter or longer) all tasks in the system is expected to have significant (but yet to be measured) impact on context switch delay.
Yes, of course. In the worst case I've measured, Zen2, it's roughly a 1700-cycle context switch penalty (~20%) for tasks in different monitoring groups. Bad, but the benefit we gain from the per-RMID MBM data makes up for it several times over if we only pay the cost during a measurement.
I see.
I expect the overflow handler should only be run during the measurement timeframe, to not defeat the "at their leisure" reading of counters.
Yes, correct. We wouldn't be interested in overflows of the hardware counter when not actively measuring bandwidth.
The second involves avoiding the situation where a hardware counter could be deallocated: Determine the number of simultaneous RMIDs supported, reduce the effective number of RMIDs available to that number. Use the default RMID (0) for all "unassigned" monitoring
hmmm ... so on the one side there is "only the RMID within the PQR register can be guaranteed to be tracked by hardware" and on the other side there is "A given implementation may have insufficient hardware to simultaneously track the bandwidth for all RMID values that the hardware supports."
From the above there seems to be something in the middle where some subset of the RMID values supported by hardware can be used to simultaneously track bandwidth? How can it be determined what this number of RMID values is?
In the context of AMD, we could use the smallest number of CPUs in any L3 domain as a lower bound of the number of counters.
Could you please elaborate on this? (With the numbers of CPUs nowadays this may be many RMIDs, perhaps even more than what ABMC supports.)
I think the "In the context of AMD" part is key. This feature would only be applicable to the AMD implementations we have today which do not implement ABMC. I believe the difficulties are unique to the topologies of these systems: many small L3 domains per node with a relatively small number of CPUs in each. If the L3 domains were large and few, simply restricting the number of RMIDs and allocating on group creation as we do today would probably be fine.
I am missing something here since it is not obvious to me how this lower bound is determined. Let's assume that there are as many monitor groups (and thus as many assigned RMIDs) as there are CPUs in a L3 domain. Each monitor group may have many tasks. It can be expected that at any moment in time only a subset of assigned RMIDs are assigned to CPUs via the CPUs' PQR registers. Of those RMIDs that are not assigned to CPUs, how can it be certain that they continue to be tracked by hardware?
Are you asking whether the counters will ever be reclaimed proactively? The behavior I've observed is that writing a new RMID into a PQR_ASSOC register when all hardware counters in the domain are allocated will trigger the reallocation.
"When all hardware counters in the domain are allocated" sounds like the ideal scenario with the kernel knowing how many counters there are and each counter is associated with a unique RMID. As long as kernel does not attempt to monitor another RMID this would accurately monitor the monitor groups with "assigned" RMID.
Adding support for hardware without specification and guaranteed behavior can potentially run into unexpected scenarios.
For example, there is no guarantee on how the counters are assigned. The OS and hardware may thus have different view of which hardware counter is "free". OS may write a new RMID to PQR_ASSOC believing that there is a counter available while hardware has its own mechanism of allocation and may reallocate a counter that is in use by an RMID that the OS believes to be "assigned". I do not think anything prevents hardware from doing this.
However, I admit the wording in the PQoS spec[1] is only written to support the permanent-assignment workaround in the current patch series:
"All RMIDs which are currently in use by one or more processors in the QOS domain will be tracked. The hardware will always begin tracking a new RMID value when it gets written to the PQR_ASSOC register of any of the processors in the QOS domain and it is not already being tracked. When the hardware begins tracking an RMID that it was not previously tracking, it will clear the QM_CTR for all events in the new RMID."
I would need to confirm whether this is the case and request the documentation be clarified if it is.
Indeed. Once an RMID is "assigned" then the expectation is that a counter will be dedicated to it but a PQR_ASSOC register may not see that RMID for potentially long intervals. With the above guarantees hardware will be within its rights to reallocate that RMID's counter even if there are other counters that are "free" from OS perspective.
While the second feature is a lot more disruptive at the filesystem layer, it does eliminate the added context switch overhead. Also, it
Which changes to filesystem layer are you anticipating?
Roughly speaking...
- The proposed "assign" interface would have to become more indirect
to avoid understanding how assign could be implemented on various platforms.
It is almost starting to sound like we could learn from the tracing interface where individual events can be enabled/disabled ... with several events potentially enabled with an "enable" done higher in hierarchy, perhaps even globally to support the first approach ...
Sorry, can you clarify the part about the tracing interface? Tracing to support dynamic autoconfiguration of events?
I do not believe we are attempting to do anything revolutionary here so I would like to consider other interfaces that user space may be familiar and comfortable with. The first that came to mind was the tracefs interface and how user space interacts with it to enable trace events. tracefs uses the "enable" file that is present at different levels of the hierarchy that user space can use to enable tracing of all events in hierarchy. There is also the global "tracing_on" that user space can use to dynamically start/stop tracing without needing to frequently enable/disable events of interest.
I do see some parallels with the discussions we have been having. I am not proposing that we adapt tracefs interface, but instead that we can perhaps learn from it.
Reinette
Hi Reinette,
On Fri, May 12, 2023 at 3:25 PM Peter Newman peternewman@google.com wrote:
On Thu, May 11, 2023 at 11:37 PM Reinette Chatre reinette.chatre@intel.com wrote:
On 4/21/2023 7:17 AM, Peter Newman wrote:
if (evtid == QOS_L3_MBM_LOCAL_EVENT_ID) {
counter = &state->local;
} else {
WARN_ON(evtid != QOS_L3_MBM_TOTAL_EVENT_ID);
counter = &state->total;
}
/*
* Propagate the value read from the hw_rmid assigned to the current CPU
* into the "soft" rmid associated with the current task or CPU.
*/
m = get_mbm_state(d, soft_rmid, evtid);
if (!m)
return;
if (resctrl_arch_rmid_read(r, d, hw_rmid, evtid, &val))
return;
This all seems unsafe to run without protection. The code relies on the rdt_domain but a CPU hotplug event could result in the domain disappearing underneath this code. The accesses to the data structures also appear unsafe to me. Note that resctrl_arch_rmid_read() updates the architectural MBM state and this same state can be updated concurrently in other code paths without appropriate locking.
The domain is supposed to always be the current one, but I see that even a get_domain_from_cpu(smp_processor_id(), ...) call needs to walk a resource's domain list to find a matching entry, which could be concurrently modified when other domains are added/removed.
Similarly, when soft RMIDs are enabled, it should not be possible to call resctrl_arch_rmid_read() outside of on the current CPU's HW RMID.
I'll need to confirm whether it's safe to access the current CPU's rdt_domain in an atomic context. If it isn't, I assume I would have to arrange all of the state used during flush to be per-CPU.
I expect the constraints on what data can be safely accessed where is going to constrain how the state is ultimately arranged, so I will need to settle this before I can come back to the other questions about mbm_state.
According to cpu_hotplug.rst, the startup callbacks are called before a CPU is started and the teardown callbacks are called after the CPU has become dysfunctional, so it should always be safe for a CPU to access its own data, so all I need to do here is avoid walking domain lists in resctrl_mbm_flush_cpu().
However, this also means that resctrl_{on,off}line_cpu() call clear_closid_rmid() on a different CPU, so whichever CPU executes these will zap its own pqr_state struct and PQR_ASSOC MSR.
-Peter
On Tue, May 16, 2023 at 4:18 PM Peter Newman peternewman@google.com wrote:
According to cpu_hotplug.rst, the startup callbacks are called before a CPU is started and the teardown callbacks are called after the CPU has become dysfunctional, so it should always be safe for a CPU to access its own data, so all I need to do here is avoid walking domain lists in resctrl_mbm_flush_cpu().
However, this also means that resctrl_{on,off}line_cpu() call clear_closid_rmid() on a different CPU, so whichever CPU executes these will zap its own pqr_state struct and PQR_ASSOC MSR.
Sorry, I read the wrong section. I was looking at PREPARE section callbacks. ONLINE callbacks are called on the CPU, so calling clear_closid_rmid() is fine.
It says the offline callback is called from the per-cpu hotplug thread, so I'm not sure if that means another context switch is possible after the teardown handler has run. To be safe, I can make resctrl_mbm_flush_cpu() check to see if it still has its domain state.
-Peter
Hi Reinette,
On Thu, May 11, 2023 at 11:37 PM Reinette Chatre reinette.chatre@intel.com wrote:
On 4/21/2023 7:17 AM, Peter Newman wrote:
/* Count bandwidth after the first successful counter read. */
if (counter->initialized) {
/* Assume that mbm_update() will prevent double-overflows. */
if (val != counter->prev_bytes)
atomic64_add(val - counter->prev_bytes,
&m->soft_rmid_bytes);
} else {
counter->initialized = true;
}
counter->prev_bytes = val;
I notice a lot of similarities between the above and the software controller, see mbm_bw_count().
I see the "a=now(); a-b; b=a;" and the not handling overflow parts being similar, but the use of the initialized flag seems quite different from delta_comp.
Also mbm_state is on the arch-independent side and the new code is going to the arch-dependent side, so it wouldn't be convenient to try to use the mbm_bw structures for this.
From this, I don't think trying to reuse this is worth it unless you have other suggestions.
-Peter
Hi Peter,
On 6/1/2023 7:45 AM, Peter Newman wrote:
Hi Reinette,
On Thu, May 11, 2023 at 11:37 PM Reinette Chatre reinette.chatre@intel.com wrote:
On 4/21/2023 7:17 AM, Peter Newman wrote:
/* Count bandwidth after the first successful counter read. */
if (counter->initialized) {
/* Assume that mbm_update() will prevent double-overflows. */
if (val != counter->prev_bytes)
atomic64_add(val - counter->prev_bytes,
&m->soft_rmid_bytes);
} else {
counter->initialized = true;
}
counter->prev_bytes = val;
I notice a lot of similarities between the above and the software controller, see mbm_bw_count().
I see the "a=now(); a-b; b=a;" and the not handling overflow parts being similar, but the use of the initialized flag seems quite different from delta_comp.
Also mbm_state is on the arch-independent side and the new code is going to the arch-dependent side, so it wouldn't be convenient to try to use the mbm_bw structures for this.
From this, I don't think trying to reuse this is worth it unless you have other suggestions.
At this time I am staring at mbm_state->prev_bw_bytes and mbm_soft_counter->prev_bytes and concerned about how much confusion this would generate. Considering the pending changes to data structures I hope this would be clear then.
Reinette
To implement soft RMIDs, context switch must detect when the current soft RMID is changing and if so, flush the CPU's MBM event counts to the outgoing soft RMID.
To avoid impacting context switch performance in the non-soft RMID case, protect the new logic with a static branch.
Co-developed-by: Stephane Eranian eranian@google.com Signed-off-by: Stephane Eranian eranian@google.com Signed-off-by: Peter Newman peternewman@google.com --- arch/x86/include/asm/resctrl.h | 27 +++++++++++++++++++++++++- arch/x86/kernel/cpu/resctrl/rdtgroup.c | 1 + 2 files changed, 27 insertions(+), 1 deletion(-)
diff --git a/arch/x86/include/asm/resctrl.h b/arch/x86/include/asm/resctrl.h index e7acf118d770..50d05e883dbb 100644 --- a/arch/x86/include/asm/resctrl.h +++ b/arch/x86/include/asm/resctrl.h @@ -36,6 +36,9 @@ DECLARE_PER_CPU(struct resctrl_pqr_state, pqr_state); DECLARE_STATIC_KEY_FALSE(rdt_enable_key); DECLARE_STATIC_KEY_FALSE(rdt_alloc_enable_key); DECLARE_STATIC_KEY_FALSE(rdt_mon_enable_key); +DECLARE_STATIC_KEY_FALSE(rdt_soft_rmid_enable_key); + +void resctrl_mbm_flush_cpu(void);
/* * __resctrl_sched_in() - Writes the task's CLOSid/RMID to IA32_PQR_MSR @@ -75,9 +78,31 @@ static inline void __resctrl_sched_in(struct task_struct *tsk) }
if (closid != state->cur_closid || rmid != state->cur_rmid) { + if (static_branch_likely(&rdt_soft_rmid_enable_key)) { + /* + * Flush current event counts to outgoing soft rmid + * when it changes. + */ + if (rmid != state->cur_rmid) + resctrl_mbm_flush_cpu(); + + /* + * rmid never changes in this mode, so skip wrmsr if the + * closid is not changing. + */ + if (closid != state->cur_closid) + wrmsr(MSR_IA32_PQR_ASSOC, state->hw_rmid, + closid); + } else { + wrmsr(MSR_IA32_PQR_ASSOC, rmid, closid); + } + + /* + * Record new closid/rmid last so soft rmid case can detect + * changes. + */ state->cur_closid = closid; state->cur_rmid = rmid; - wrmsr(MSR_IA32_PQR_ASSOC, rmid, closid); } }
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c index 6ad33f355861..c10f4798156a 100644 --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c @@ -35,6 +35,7 @@ DEFINE_STATIC_KEY_FALSE(rdt_enable_key); DEFINE_STATIC_KEY_FALSE(rdt_mon_enable_key); DEFINE_STATIC_KEY_FALSE(rdt_alloc_enable_key); +DEFINE_STATIC_KEY_FALSE(rdt_soft_rmid_enable_key); static struct kernfs_root *rdt_root; struct rdtgroup rdtgroup_default; LIST_HEAD(rdt_all_groups);
Hi Peter,
On 4/21/2023 7:17 AM, Peter Newman wrote:
To implement soft RMIDs, context switch must detect when the current soft RMID is changing and if so, flush the CPU's MBM event counts to the outgoing soft RMID.
To avoid impacting context switch performance in the non-soft RMID case, protect the new logic with a static branch.
Co-developed-by: Stephane Eranian eranian@google.com Signed-off-by: Stephane Eranian eranian@google.com Signed-off-by: Peter Newman peternewman@google.com
arch/x86/include/asm/resctrl.h | 27 +++++++++++++++++++++++++- arch/x86/kernel/cpu/resctrl/rdtgroup.c | 1 + 2 files changed, 27 insertions(+), 1 deletion(-)
diff --git a/arch/x86/include/asm/resctrl.h b/arch/x86/include/asm/resctrl.h index e7acf118d770..50d05e883dbb 100644 --- a/arch/x86/include/asm/resctrl.h +++ b/arch/x86/include/asm/resctrl.h @@ -36,6 +36,9 @@ DECLARE_PER_CPU(struct resctrl_pqr_state, pqr_state); DECLARE_STATIC_KEY_FALSE(rdt_enable_key); DECLARE_STATIC_KEY_FALSE(rdt_alloc_enable_key); DECLARE_STATIC_KEY_FALSE(rdt_mon_enable_key); +DECLARE_STATIC_KEY_FALSE(rdt_soft_rmid_enable_key);
+void resctrl_mbm_flush_cpu(void); /*
- __resctrl_sched_in() - Writes the task's CLOSid/RMID to IA32_PQR_MSR
@@ -75,9 +78,31 @@ static inline void __resctrl_sched_in(struct task_struct *tsk) } if (closid != state->cur_closid || rmid != state->cur_rmid) {
if (static_branch_likely(&rdt_soft_rmid_enable_key)) {
Could you please elaborate on the choice of static_branch_likely() (as opposed to static_branch_unlikely())?
Reinette
There is no point in using IPIs to call mon_event_count() when it is only reading software counters from memory.
When RMIDs are soft, mon_event_read() just calls mon_event_count() directly.
Signed-off-by: Peter Newman peternewman@google.com --- arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 9 ++++++++- arch/x86/kernel/cpu/resctrl/internal.h | 1 + arch/x86/kernel/cpu/resctrl/monitor.c | 5 +++++ 3 files changed, 14 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c index b44c487727d4..b2ed25a08f6f 100644 --- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c +++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c @@ -534,7 +534,14 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r, rr->val = 0; rr->first = first;
- smp_call_function_any(&d->cpu_mask, mon_event_count, rr, 1); + if (rdt_mon_soft_rmid) + /* + * Soft RMID counters reside in memory, so they can be read from + * anywhere. + */ + mon_event_count(rr); + else + smp_call_function_any(&d->cpu_mask, mon_event_count, rr, 1); }
int rdtgroup_mondata_show(struct seq_file *m, void *arg) diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h index 256eee05d447..e6ff31a4dbc4 100644 --- a/arch/x86/kernel/cpu/resctrl/internal.h +++ b/arch/x86/kernel/cpu/resctrl/internal.h @@ -115,6 +115,7 @@ struct rmid_read {
extern bool rdt_alloc_capable; extern bool rdt_mon_capable; +extern bool rdt_mon_soft_rmid; extern unsigned int rdt_mon_features; extern struct list_head resctrl_schema_all;
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c index 3671100d3cc7..bb857eefa3b0 100644 --- a/arch/x86/kernel/cpu/resctrl/monitor.c +++ b/arch/x86/kernel/cpu/resctrl/monitor.c @@ -57,6 +57,11 @@ static struct rmid_entry *rmid_ptrs; */ bool rdt_mon_capable;
+/* + * Global boolean to indicate when RMIDs are implemented in software. + */ +bool rdt_mon_soft_rmid; + /* * Global to indicate which monitoring events are enabled. */
Hi Peter,
On 4/21/2023 7:17 AM, Peter Newman wrote:
There is no point in using IPIs to call mon_event_count() when it is only reading software counters from memory.
When RMIDs are soft, mon_event_read() just calls mon_event_count() directly.
From this patch forward the patch ordering is a bit confusing. At this time mon_event_count() does not read software counters from memory so I think this change should move to later.
Also, note that rdt_mon_soft_rmid is introduced here but the reader is left wondering how it is set until the final patch in this series.
Reinette
When RMIDs are soft, __mon_event_count() only needs to report the current byte count in memory and should not touch the hardware RMIDs.
Create a parallel version for the soft RMID configuration and update __mon_event_count() to choose between it and the original depending on whether the soft RMID static key is enabled.
Signed-off-by: Peter Newman peternewman@google.com --- arch/x86/kernel/cpu/resctrl/monitor.c | 33 ++++++++++++++++++++++++++- 1 file changed, 32 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c index bb857eefa3b0..3d54a634471a 100644 --- a/arch/x86/kernel/cpu/resctrl/monitor.c +++ b/arch/x86/kernel/cpu/resctrl/monitor.c @@ -487,7 +487,30 @@ void resctrl_mbm_flush_cpu(void) __mbm_flush(QOS_L3_MBM_TOTAL_EVENT_ID, r, d); }
-static int __mon_event_count(u32 rmid, struct rmid_read *rr) +static int __mon_event_count_soft_rmid(u32 rmid, struct rmid_read *rr) +{ + struct mbm_state *m; + + WARN_ON(!is_mbm_event(rr->evtid)); + m = get_mbm_state(rr->d, rmid, rr->evtid); + if (!m) + /* implies !is_mbm_event(...) */ + return -1; + + rr->val += atomic64_read(&m->soft_rmid_bytes); + + if (rr->first) { + /* + * Discard any bandwidth resulting from the initial HW counter + * reads. + */ + atomic64_set(&m->soft_rmid_bytes, 0); + } + + return 0; +} + +static int __mon_event_count_default(u32 rmid, struct rmid_read *rr) { struct mbm_state *m; u64 tval = 0; @@ -509,6 +532,14 @@ static int __mon_event_count(u32 rmid, struct rmid_read *rr) return 0; }
+static int __mon_event_count(u32 rmid, struct rmid_read *rr) +{ + if (rdt_mon_soft_rmid) + return __mon_event_count_soft_rmid(rmid, rr); + else + return __mon_event_count_default(rmid, rr); +} + /* * mbm_bw_count() - Update bw count from values previously read by * __mon_event_count().
Hi Peter,
On 4/21/2023 7:17 AM, Peter Newman wrote:
When RMIDs are soft, __mon_event_count() only needs to report the current byte count in memory and should not touch the hardware RMIDs.
Create a parallel version for the soft RMID configuration and update __mon_event_count() to choose between it and the original depending on whether the soft RMID static key is enabled.
Please note that the changelog refers to "whether the soft RMID static key is enabled" but the patch uses a bool instead of a static key.
Signed-off-by: Peter Newman peternewman@google.com
arch/x86/kernel/cpu/resctrl/monitor.c | 33 ++++++++++++++++++++++++++- 1 file changed, 32 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c index bb857eefa3b0..3d54a634471a 100644 --- a/arch/x86/kernel/cpu/resctrl/monitor.c +++ b/arch/x86/kernel/cpu/resctrl/monitor.c @@ -487,7 +487,30 @@ void resctrl_mbm_flush_cpu(void) __mbm_flush(QOS_L3_MBM_TOTAL_EVENT_ID, r, d); } -static int __mon_event_count(u32 rmid, struct rmid_read *rr) +static int __mon_event_count_soft_rmid(u32 rmid, struct rmid_read *rr) +{
- struct mbm_state *m;
- WARN_ON(!is_mbm_event(rr->evtid));
- m = get_mbm_state(rr->d, rmid, rr->evtid);
- if (!m)
/* implies !is_mbm_event(...) */
return -1;
- rr->val += atomic64_read(&m->soft_rmid_bytes);
- if (rr->first) {
/*
* Discard any bandwidth resulting from the initial HW counter
* reads.
*/
atomic64_set(&m->soft_rmid_bytes, 0);
- }
The above is not clear to me. If rr->first is true then it would read soft_rmid_bytes and then immediately reset it?
Reinette
To implement soft RMIDs, each CPU needs a HW RMID that is unique within its L3 cache domain. This is the minimum number of RMIDs needed to monitor all CPUs.
This is accomplished by determining the rank of each CPU's mask bit within its L3 shared_cpu_mask in resctrl_online_cpu().
Signed-off-by: Peter Newman peternewman@google.com --- arch/x86/kernel/cpu/resctrl/core.c | 39 +++++++++++++++++++++++++++++- 1 file changed, 38 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c index 47b1c37a81f8..b0d873231b1e 100644 --- a/arch/x86/kernel/cpu/resctrl/core.c +++ b/arch/x86/kernel/cpu/resctrl/core.c @@ -596,6 +596,38 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r) } }
+/* Assign each CPU an RMID that is unique within its cache domain. */ +static u32 determine_hw_rmid_for_cpu(int cpu) +{ + struct cpu_cacheinfo *ci = get_cpu_cacheinfo(cpu); + struct cacheinfo *l3ci = NULL; + u32 rmid; + int i; + + /* Locate the cacheinfo for this CPU's L3 cache. */ + for (i = 0; i < ci->num_leaves; i++) { + if (ci->info_list[i].level == 3 && + (ci->info_list[i].attributes & CACHE_ID)) { + l3ci = &ci->info_list[i]; + break; + } + } + WARN_ON(!l3ci); + + if (!l3ci) + return 0; + + /* Use the position of cpu in its shared_cpu_mask as its RMID. */ + rmid = 0; + for_each_cpu(i, &l3ci->shared_cpu_map) { + if (i == cpu) + break; + rmid++; + } + + return rmid; +} + static void clear_closid_rmid(int cpu) { struct resctrl_pqr_state *state = this_cpu_ptr(&pqr_state); @@ -604,7 +636,12 @@ static void clear_closid_rmid(int cpu) state->default_rmid = 0; state->cur_closid = 0; state->cur_rmid = 0; - wrmsr(MSR_IA32_PQR_ASSOC, 0, 0); + state->hw_rmid = 0; + + if (static_branch_likely(&rdt_soft_rmid_enable_key)) + state->hw_rmid = determine_hw_rmid_for_cpu(cpu); + + wrmsr(MSR_IA32_PQR_ASSOC, state->hw_rmid, 0); }
static int resctrl_online_cpu(unsigned int cpu)
Hi Peter,
On 4/21/2023 7:17 AM, Peter Newman wrote:
To implement soft RMIDs, each CPU needs a HW RMID that is unique within its L3 cache domain. This is the minimum number of RMIDs needed to monitor all CPUs.
This is accomplished by determining the rank of each CPU's mask bit within its L3 shared_cpu_mask in resctrl_online_cpu().
Signed-off-by: Peter Newman peternewman@google.com
arch/x86/kernel/cpu/resctrl/core.c | 39 +++++++++++++++++++++++++++++- 1 file changed, 38 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c index 47b1c37a81f8..b0d873231b1e 100644 --- a/arch/x86/kernel/cpu/resctrl/core.c +++ b/arch/x86/kernel/cpu/resctrl/core.c @@ -596,6 +596,38 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r) } } +/* Assign each CPU an RMID that is unique within its cache domain. */ +static u32 determine_hw_rmid_for_cpu(int cpu)
This code tends to use the verb "get", something like "get_hw_rmid()" could work.
+{
- struct cpu_cacheinfo *ci = get_cpu_cacheinfo(cpu);
- struct cacheinfo *l3ci = NULL;
- u32 rmid;
- int i;
- /* Locate the cacheinfo for this CPU's L3 cache. */
- for (i = 0; i < ci->num_leaves; i++) {
if (ci->info_list[i].level == 3 &&
(ci->info_list[i].attributes & CACHE_ID)) {
l3ci = &ci->info_list[i];
break;
}
- }
- WARN_ON(!l3ci);
- if (!l3ci)
return 0;
You can use "if (WARN_ON(..))"
- /* Use the position of cpu in its shared_cpu_mask as its RMID. */
(please use "CPU" instead of "cpu" in comments and changelogs)
- rmid = 0;
- for_each_cpu(i, &l3ci->shared_cpu_map) {
if (i == cpu)
break;
rmid++;
- }
- return rmid;
+}
I do not see any impact to the (soft) RMIDs that can be assigned to monitor groups, yet from what I understand a generic "RMID" is used as index to MBM state. Is this correct? A hardware RMID and software RMID would thus share the same MBM state. If this is correct I think we need to work on making the boundaries between hard and soft RMID more clear.
static void clear_closid_rmid(int cpu) { struct resctrl_pqr_state *state = this_cpu_ptr(&pqr_state); @@ -604,7 +636,12 @@ static void clear_closid_rmid(int cpu) state->default_rmid = 0; state->cur_closid = 0; state->cur_rmid = 0;
- wrmsr(MSR_IA32_PQR_ASSOC, 0, 0);
- state->hw_rmid = 0;
- if (static_branch_likely(&rdt_soft_rmid_enable_key))
state->hw_rmid = determine_hw_rmid_for_cpu(cpu);
- wrmsr(MSR_IA32_PQR_ASSOC, state->hw_rmid, 0);
} static int resctrl_online_cpu(unsigned int cpu)
Reinette
Hi Reinette,
On Thu, May 11, 2023 at 11:40 PM Reinette Chatre reinette.chatre@intel.com wrote:
On 4/21/2023 7:17 AM, Peter Newman wrote:
/* Locate the cacheinfo for this CPU's L3 cache. */
for (i = 0; i < ci->num_leaves; i++) {
if (ci->info_list[i].level == 3 &&
(ci->info_list[i].attributes & CACHE_ID)) {
l3ci = &ci->info_list[i];
break;
}
}
WARN_ON(!l3ci);
if (!l3ci)
return 0;
You can use "if (WARN_ON(..))"
Thanks, I'll look for the other changes in the series which would benefit from this.
rmid = 0;
for_each_cpu(i, &l3ci->shared_cpu_map) {
if (i == cpu)
break;
rmid++;
}
return rmid;
+}
I do not see any impact to the (soft) RMIDs that can be assigned to monitor groups, yet from what I understand a generic "RMID" is used as index to MBM state. Is this correct? A hardware RMID and software RMID would thus share the same MBM state. If this is correct I think we need to work on making the boundaries between hard and soft RMID more clear.
The only RMID-indexed state used by soft RMIDs right now is mbm_state::soft_rmid_bytes. The other aspect of the boundary is ensuring that nothing will access the hard RMID-specific state for a soft RMID.
The remainder of the mbm_state is only accessed by the software controller, which you suggested that I disable.
The arch_mbm_state is accessed only through resctrl_arch_rmid_read() and resctrl_arch_reset_rmid(), which are called by __mon_event_count() or the limbo handler.
__mon_event_count() is aware of soft RMIDs, so I would just need to ensure the software controller is disabled and never put any RMIDs on the limbo list. To be safe, I can also add WARN_ON_ONCE(rdt_mon_soft_rmid) to the rmid-indexing of the mbm_state arrays in the software controller and before the resctrl_arch_rmid_read() call in the limbo handler to catch if they're ever using soft RMIDs.
-Peter
static void clear_closid_rmid(int cpu) { struct resctrl_pqr_state *state = this_cpu_ptr(&pqr_state); @@ -604,7 +636,12 @@ static void clear_closid_rmid(int cpu) state->default_rmid = 0; state->cur_closid = 0; state->cur_rmid = 0;
wrmsr(MSR_IA32_PQR_ASSOC, 0, 0);
state->hw_rmid = 0;
if (static_branch_likely(&rdt_soft_rmid_enable_key))
state->hw_rmid = determine_hw_rmid_for_cpu(cpu);
wrmsr(MSR_IA32_PQR_ASSOC, state->hw_rmid, 0);
}
static int resctrl_online_cpu(unsigned int cpu)
Reinette
Hi Peter,
On 5/16/2023 7:49 AM, Peter Newman wrote:
On Thu, May 11, 2023 at 11:40 PM Reinette Chatre reinette.chatre@intel.com wrote:
On 4/21/2023 7:17 AM, Peter Newman wrote:
rmid = 0;
for_each_cpu(i, &l3ci->shared_cpu_map) {
if (i == cpu)
break;
rmid++;
}
return rmid;
+}
I do not see any impact to the (soft) RMIDs that can be assigned to monitor groups, yet from what I understand a generic "RMID" is used as index to MBM state. Is this correct? A hardware RMID and software RMID would thus share the same MBM state. If this is correct I think we need to work on making the boundaries between hard and soft RMID more clear.
The only RMID-indexed state used by soft RMIDs right now is mbm_state::soft_rmid_bytes. The other aspect of the boundary is ensuring that nothing will access the hard RMID-specific state for a soft RMID.
The remainder of the mbm_state is only accessed by the software controller, which you suggested that I disable.
The arch_mbm_state is accessed only through resctrl_arch_rmid_read() and resctrl_arch_reset_rmid(), which are called by __mon_event_count() or the limbo handler.
__mon_event_count() is aware of soft RMIDs, so I would just need to ensure the software controller is disabled and never put any RMIDs on the limbo list. To be safe, I can also add WARN_ON_ONCE(rdt_mon_soft_rmid) to the rmid-indexing of the mbm_state arrays in the software controller and before the resctrl_arch_rmid_read() call in the limbo handler to catch if they're ever using soft RMIDs.
I understand and trust that you can ensure that this implementation is done safely. Please also consider how future changes to resctrl may stumble if there are not clear boundaries. You may be able to "ensure the software controller is disabled and never put any RMIDs on the limbo list", but consider if these rules will be clear to somebody who comes along in a year or more.
Documenting the data structures with these unique usages will help. Specific accessors can sometimes be useful to make it obvious in which state the data is being accessed and what data can be accessed. Using WARN as you suggest is a useful tool.
Reinette
Hi Reinette,
On Wed, May 17, 2023 at 2:06 AM Reinette Chatre reinette.chatre@intel.com wrote:
On 5/16/2023 7:49 AM, Peter Newman wrote:
On Thu, May 11, 2023 at 11:40 PM Reinette Chatre reinette.chatre@intel.com wrote:
I do not see any impact to the (soft) RMIDs that can be assigned to monitor groups, yet from what I understand a generic "RMID" is used as index to MBM state. Is this correct? A hardware RMID and software RMID would thus share the same MBM state. If this is correct I think we need to work on making the boundaries between hard and soft RMID more clear.
The only RMID-indexed state used by soft RMIDs right now is mbm_state::soft_rmid_bytes. The other aspect of the boundary is ensuring that nothing will access the hard RMID-specific state for a soft RMID.
The remainder of the mbm_state is only accessed by the software controller, which you suggested that I disable.
The arch_mbm_state is accessed only through resctrl_arch_rmid_read() and resctrl_arch_reset_rmid(), which are called by __mon_event_count() or the limbo handler.
__mon_event_count() is aware of soft RMIDs, so I would just need to ensure the software controller is disabled and never put any RMIDs on the limbo list. To be safe, I can also add WARN_ON_ONCE(rdt_mon_soft_rmid) to the rmid-indexing of the mbm_state arrays in the software controller and before the resctrl_arch_rmid_read() call in the limbo handler to catch if they're ever using soft RMIDs.
I understand and trust that you can ensure that this implementation is done safely. Please also consider how future changes to resctrl may stumble if there are not clear boundaries. You may be able to "ensure the software controller is disabled and never put any RMIDs on the limbo list", but consider if these rules will be clear to somebody who comes along in a year or more.
Documenting the data structures with these unique usages will help. Specific accessors can sometimes be useful to make it obvious in which state the data is being accessed and what data can be accessed. Using WARN as you suggest is a useful tool.
After studying the present usage of RMID values some more, I've concluded that I can cleanly move all knowledge of the soft RMID implementation to be within resctrl_arch_rmid_read() and that none of the FS-layer code should need to be aware of them. However, doing this would require James's patch to allow resctrl_arch_rmid_read() to block[1], since resctrl_arch_rmid_read() would be the first opportunity architecture-dependent code has to IPI the other CPUs in the domain.
The alternative to blocking in resctrl_arch_rmid_read() would be introducing an arch hook to mon_event_read(), where blocking can be done today without James's patches, so that architecture-dependent code can IPI all CPUs in the target domain to flush their event counts to memory before calling mon_event_count() to total their MBM event counts.
The remaining special case for soft RMIDs would be knowing that they should never go on the limbo list. Right now I've hard-coded the soft RMID read to always return 0 bytes for occupancy events, but this answer is only correct in the context of deciding whether RMIDs are dirty, so I have to prevent the events from being presented to the user. If returning an error wasn't considered "dirty", maybe that would work too.
Maybe the cleanest approach would be to cause enabling soft RMIDs to somehow cause is_llc_occupancy_enabled() to return false, but this is difficult as long as soft RMIDs are configured at mount time and rdt_mon_features is set at boot time. If soft RMIDs move completely into the arch layer, is it preferable to configure them with an rdt boot option instead of adding an architecture-dependent mount option? I recall James being opposed to adding a boot option for this.
Thanks! -Peter
[1] https://lore.kernel.org/lkml/20230525180209.19497-15-james.morse@arm.com/
On Fri, Apr 21, 2023 at 4:18 PM Peter Newman peternewman@google.com wrote:
static void clear_closid_rmid(int cpu) { struct resctrl_pqr_state *state = this_cpu_ptr(&pqr_state); @@ -604,7 +636,12 @@ static void clear_closid_rmid(int cpu) state->default_rmid = 0; state->cur_closid = 0; state->cur_rmid = 0;
wrmsr(MSR_IA32_PQR_ASSOC, 0, 0);
state->hw_rmid = 0;
if (static_branch_likely(&rdt_soft_rmid_enable_key))
state->hw_rmid = determine_hw_rmid_for_cpu(cpu);
clear_closid_rmid() isn't run at mount time, so hw_rmid will be uninitialized on any CPUs which were already enabled. The static key was originally set at boot.
(the consequence was that domain bandwidth was the amount recorded on the first CPU in the domain multiplied by the number of CPUs in the domain)
__mon_event_count() only reads the current software count and does not cause CPUs in the domain to flush. For mbm_update() to be effective in preventing overflow in hardware counters with soft RMIDs, it needs to flush the domain CPUs so that all of the HW RMIDs are read.
When RMIDs are soft, mbm_update() is intended to push bandwidth counts to the software counters rather than pulling the counts from hardware when userspace reads event counts, as this is a lot more efficient when the number of HW RMIDs is fixed.
When RMIDs are soft, mbm_update() only calls mbm_flush_cpu_handler() on each CPU in the domain rather than reading all RMIDs.
Signed-off-by: Peter Newman peternewman@google.com --- arch/x86/kernel/cpu/resctrl/monitor.c | 28 +++++++++++++++++++++++---- 1 file changed, 24 insertions(+), 4 deletions(-)
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c index 3d54a634471a..9575cb79b8ee 100644 --- a/arch/x86/kernel/cpu/resctrl/monitor.c +++ b/arch/x86/kernel/cpu/resctrl/monitor.c @@ -487,6 +487,11 @@ void resctrl_mbm_flush_cpu(void) __mbm_flush(QOS_L3_MBM_TOTAL_EVENT_ID, r, d); }
+static void mbm_flush_cpu_handler(void *p) +{ + resctrl_mbm_flush_cpu(); +} + static int __mon_event_count_soft_rmid(u32 rmid, struct rmid_read *rr) { struct mbm_state *m; @@ -806,12 +811,27 @@ void mbm_handle_overflow(struct work_struct *work) r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl; d = container_of(work, struct rdt_domain, mbm_over.work);
+ if (rdt_mon_soft_rmid) { + /* + * HW RMIDs are permanently assigned to CPUs, so only a per-CPU + * flush is needed. + */ + on_each_cpu_mask(&d->cpu_mask, mbm_flush_cpu_handler, NULL, + false); + } + list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) { - mbm_update(r, d, prgrp->mon.rmid); + /* + * mbm_update() on every RMID would result in excessive IPIs + * when RMIDs are soft. + */ + if (!rdt_mon_soft_rmid) { + mbm_update(r, d, prgrp->mon.rmid);
- head = &prgrp->mon.crdtgrp_list; - list_for_each_entry(crgrp, head, mon.crdtgrp_list) - mbm_update(r, d, crgrp->mon.rmid); + head = &prgrp->mon.crdtgrp_list; + list_for_each_entry(crgrp, head, mon.crdtgrp_list) + mbm_update(r, d, crgrp->mon.rmid); + }
if (is_mba_sc(NULL)) update_mba_bw(prgrp, d);
Hi Peter,
On 4/21/2023 7:17 AM, Peter Newman wrote:
@@ -806,12 +811,27 @@ void mbm_handle_overflow(struct work_struct *work) r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl; d = container_of(work, struct rdt_domain, mbm_over.work);
- if (rdt_mon_soft_rmid) {
/*
* HW RMIDs are permanently assigned to CPUs, so only a per-CPU
* flush is needed.
*/
on_each_cpu_mask(&d->cpu_mask, mbm_flush_cpu_handler, NULL,
false);
- }
- list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
mbm_update(r, d, prgrp->mon.rmid);
/*
* mbm_update() on every RMID would result in excessive IPIs
* when RMIDs are soft.
*/
if (!rdt_mon_soft_rmid) {
mbm_update(r, d, prgrp->mon.rmid);
head = &prgrp->mon.crdtgrp_list;
list_for_each_entry(crgrp, head, mon.crdtgrp_list)
mbm_update(r, d, crgrp->mon.rmid);
head = &prgrp->mon.crdtgrp_list;
list_for_each_entry(crgrp, head, mon.crdtgrp_list)
mbm_update(r, d, crgrp->mon.rmid);
}
if (is_mba_sc(NULL)) update_mba_bw(prgrp, d);
hmmm ... I think that update_mba_bw() relies on mbm_update() to call mbm_bw_count() to update the data it depends on. Keeping update_mba_bw() while dropping mbm_update() thus seems problematic. AMD does not support the software controller though so it may make things simpler if support for software RMIDs disables support for software controller (in a clear way).
Reinette
Hi Reinette,
On Thu, May 11, 2023 at 11:40 PM Reinette Chatre reinette.chatre@intel.com wrote:
On 4/21/2023 7:17 AM, Peter Newman wrote:
list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
mbm_update(r, d, prgrp->mon.rmid);
/*
* mbm_update() on every RMID would result in excessive IPIs
* when RMIDs are soft.
*/
if (!rdt_mon_soft_rmid) {
mbm_update(r, d, prgrp->mon.rmid);
head = &prgrp->mon.crdtgrp_list;
list_for_each_entry(crgrp, head, mon.crdtgrp_list)
mbm_update(r, d, crgrp->mon.rmid);
head = &prgrp->mon.crdtgrp_list;
list_for_each_entry(crgrp, head, mon.crdtgrp_list)
mbm_update(r, d, crgrp->mon.rmid);
} if (is_mba_sc(NULL)) update_mba_bw(prgrp, d);
hmmm ... I think that update_mba_bw() relies on mbm_update() to call mbm_bw_count() to update the data it depends on. Keeping update_mba_bw() while dropping mbm_update() thus seems problematic. AMD does not support the software controller though so it may make things simpler if support for software RMIDs disables support for software controller (in a clear way).
I looked over this again and realized that the rationale for skipping mbm_update() in this patch is incorrect. __mon_event_count_soft_rmid() does not send any IPIs, so it's perfectly fine to call mbm_update() and update_mba_bw() when using soft RMIDs.
Even if we don't use the software controller on AMD, it seems conceptually cleaner to just allow soft and hard RMIDs to be used interchangeably wherever they work.
-Peter
On Fri, Apr 21, 2023 at 4:18 PM Peter Newman peternewman@google.com wrote:
__mon_event_count() only reads the current software count and does not cause CPUs in the domain to flush. For mbm_update() to be effective in preventing overflow in hardware counters with soft RMIDs, it needs to flush the domain CPUs so that all of the HW RMIDs are read.
When RMIDs are soft, mbm_update() is intended to push bandwidth counts to the software counters rather than pulling the counts from hardware when userspace reads event counts, as this is a lot more efficient when the number of HW RMIDs is fixed.
The low frequency with which the overflow handler is run would introduce too much error into bandwidth calculations and running it more frequently regardless of whether event count reads are being requested by the user is not a good use of CPU time.
mon_event_read() needs to pull fresh event count values from hardware.
When RMIDs are soft, mbm_update() only calls mbm_flush_cpu_handler() on each CPU in the domain rather than reading all RMIDs.
I'll try going back to Stephane's original approach of rate-limiting how often domain CPUs need to be flushed and allowing the user to configure the time threshold. This will allow mbm_update() to read all of the RMIDs without triggering lots of redundant IPIs. (redundant because only the current RMID on each CPU can change when RMIDs are soft)
Add the 'mbm_soft_rmid' mount option to enable soft RMIDs.
This requires adding a mechanism for disabling a monitoring event at mount time to prevent the llc_occupancy event from being presented to the user.
Signed-off-by: Peter Newman peternewman@google.com --- arch/x86/kernel/cpu/resctrl/internal.h | 3 ++ arch/x86/kernel/cpu/resctrl/rdtgroup.c | 51 ++++++++++++++++++++++++++ 2 files changed, 54 insertions(+)
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h index e6ff31a4dbc4..604e3d550601 100644 --- a/arch/x86/kernel/cpu/resctrl/internal.h +++ b/arch/x86/kernel/cpu/resctrl/internal.h @@ -59,6 +59,7 @@ struct rdt_fs_context { bool enable_cdpl2; bool enable_cdpl3; bool enable_mba_mbps; + bool enable_mbm_soft_rmid; };
static inline struct rdt_fs_context *rdt_fc2context(struct fs_context *fc) @@ -76,12 +77,14 @@ DECLARE_STATIC_KEY_FALSE(rdt_mon_enable_key); * @evtid: event id * @name: name of the event * @configurable: true if the event is configurable + * @enabled: true if event is disabled * @list: entry in &rdt_resource->evt_list */ struct mon_evt { enum resctrl_event_id evtid; char *name; bool configurable; + bool disabled; struct list_head list; };
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c index c10f4798156a..c2abf69c2dcf 100644 --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c @@ -1013,6 +1013,8 @@ static int rdt_mon_features_show(struct kernfs_open_file *of, struct mon_evt *mevt;
list_for_each_entry(mevt, &r->evt_list, list) { + if (mevt->disabled) + continue; seq_printf(seq, "%s\n", mevt->name); if (mevt->configurable) seq_printf(seq, "%s_config\n", mevt->name); @@ -2204,6 +2206,37 @@ static bool supports_mba_mbps(void) r->alloc_capable && is_mba_linear()); }
+static bool supports_mbm_soft_rmid(void) +{ + return is_mbm_enabled(); +} + +int set_mbm_soft_rmid(bool mbm_soft_rmid) +{ + struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl; + struct mon_evt *mevt = NULL; + + /* + * is_llc_occupancy_enabled() will always return false when disabling, + * so search for the llc_occupancy event unconditionally. + */ + list_for_each_entry(mevt, &r->evt_list, list) { + if (strcmp(mevt->name, "llc_occupancy") == 0) { + mevt->disabled = mbm_soft_rmid; + break; + } + } + + rdt_mon_soft_rmid = mbm_soft_rmid; + + if (mbm_soft_rmid) + static_branch_enable_cpuslocked(&rdt_soft_rmid_enable_key); + else + static_branch_disable_cpuslocked(&rdt_soft_rmid_enable_key); + + return 0; +} + /* * Enable or disable the MBA software controller * which helps user specify bandwidth in MBps. @@ -2359,6 +2392,9 @@ static int rdt_enable_ctx(struct rdt_fs_context *ctx) if (!ret && ctx->enable_mba_mbps) ret = set_mba_sc(true);
+ if (!ret && ctx->enable_mbm_soft_rmid) + ret = set_mbm_soft_rmid(true); + return ret; }
@@ -2534,6 +2570,8 @@ static int rdt_get_tree(struct fs_context *fc) out_mba: if (ctx->enable_mba_mbps) set_mba_sc(false); + if (ctx->enable_mbm_soft_rmid) + set_mbm_soft_rmid(false); out_cdp: cdp_disable_all(); out: @@ -2547,6 +2585,7 @@ enum rdt_param { Opt_cdp, Opt_cdpl2, Opt_mba_mbps, + Opt_mbm_soft_rmid, nr__rdt_params };
@@ -2554,6 +2593,7 @@ static const struct fs_parameter_spec rdt_fs_parameters[] = { fsparam_flag("cdp", Opt_cdp), fsparam_flag("cdpl2", Opt_cdpl2), fsparam_flag("mba_MBps", Opt_mba_mbps), + fsparam_flag("mbm_soft_rmid", Opt_mbm_soft_rmid), {} };
@@ -2579,6 +2619,11 @@ static int rdt_parse_param(struct fs_context *fc, struct fs_parameter *param) return -EINVAL; ctx->enable_mba_mbps = true; return 0; + case Opt_mbm_soft_rmid: + if (!supports_mbm_soft_rmid()) + return -EINVAL; + ctx->enable_mbm_soft_rmid = true; + return 0; }
return -EINVAL; @@ -2767,6 +2812,7 @@ static void rdt_kill_sb(struct super_block *sb) cpus_read_lock(); mutex_lock(&rdtgroup_mutex);
+ set_mbm_soft_rmid(false); set_mba_sc(false);
/*Put everything back to default values. */ @@ -2861,6 +2907,8 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn, priv.u.rid = r->rid; priv.u.domid = d->id; list_for_each_entry(mevt, &r->evt_list, list) { + if (mevt->disabled) + continue; priv.u.evtid = mevt->evtid; ret = mon_addfile(kn, mevt->name, priv.priv); if (ret) @@ -3517,6 +3565,9 @@ static int rdtgroup_show_options(struct seq_file *seq, struct kernfs_root *kf) if (is_mba_sc(&rdt_resources_all[RDT_RESOURCE_MBA].r_resctrl)) seq_puts(seq, ",mba_MBps");
+ if (static_branch_likely(&rdt_soft_rmid_enable_key)) + seq_puts(seq, ",mbm_soft_rmid"); + return 0; }
Hi Peter,
On 4/21/2023 7:17 AM, Peter Newman wrote:
Add the 'mbm_soft_rmid' mount option to enable soft RMIDs.
This requires adding a mechanism for disabling a monitoring event at mount time to prevent the llc_occupancy event from being presented to the user.
Signed-off-by: Peter Newman peternewman@google.com
arch/x86/kernel/cpu/resctrl/internal.h | 3 ++ arch/x86/kernel/cpu/resctrl/rdtgroup.c | 51 ++++++++++++++++++++++++++ 2 files changed, 54 insertions(+)
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h index e6ff31a4dbc4..604e3d550601 100644 --- a/arch/x86/kernel/cpu/resctrl/internal.h +++ b/arch/x86/kernel/cpu/resctrl/internal.h @@ -59,6 +59,7 @@ struct rdt_fs_context { bool enable_cdpl2; bool enable_cdpl3; bool enable_mba_mbps;
- bool enable_mbm_soft_rmid;
}; static inline struct rdt_fs_context *rdt_fc2context(struct fs_context *fc) @@ -76,12 +77,14 @@ DECLARE_STATIC_KEY_FALSE(rdt_mon_enable_key);
- @evtid: event id
- @name: name of the event
- @configurable: true if the event is configurable
*/
- @enabled: true if event is disabled
- @list: entry in &rdt_resource->evt_list
struct mon_evt { enum resctrl_event_id evtid; char *name; bool configurable;
- bool disabled; struct list_head list;
};
Note the description of member named "enabled" of member named "disabled".
Reinette
linux-kselftest-mirror@lists.linaro.org