Expose resctrl monitoring data via a lightweight perf PMU.
Background: The kernel's initial cache-monitoring interface shipped via perf (commit 4afbb24ce5e7, 2015). That approach tied monitoring to tasks and cgroups. Later, cache control was designed around the resctrl filesystem to better match hardware semantics, and the incompatible perf CQM code was removed (commit c39a0e2c8850, 2017). This series implements a thin, generic perf PMU that _is_ compatible with resctrl.
Motivation: perf support enables measuring cache occupancy and memory bandwidth metrics on hrtimer (high resolution timer) interrupts via eBPF. Compared with polling from userspace, hrtimer-based reads remove scheduling jitter and context switch overhead. Further, PMU reads can be parallel, since the PMU read path need not lock resctrl's rdtgroup_mutex. Parallelization and reduced jitter enable more accurate snapshots of cache occupancy and memory bandwidth. [1] has more details on the motivation and design.
Design: The "resctrl" PMU is a small adapter on top of resctrl's monitoring path: - Event selection uses `attr.config` to pass an open `mon_data` fd (e.g. `mon_L3_00/llc_occupancy`). - Events must be CPU-bound within the file's domain. Perf is responsible the read executes on the bound CPU. - Event init resolves and pins the rdtgroup, prepares struct rmid_read via mon_event_setup_read(), and validates the bound CPU is in the file's domain CPU mask. - Sampling is not supported; reads match the `mon_data` file contents. - If the rdtgroup is deleted, reads return 0.
Includes a new selftest (tools/testing/selftests/resctrl/pmu_test.c) to validate the PMU event init path, and adds PMU testing to existing CMT tests.
Example usage (see Documentation/filesystems/resctrl.rst): Open a monitoring file and pass its fd in `perf_event_attr.config`, with `attr.type` set to the `resctrl` PMU type.
The patches are based on top of v6.18-rc1 (commit 3a8660878839).
[1] https://www.youtube.com/watch?v=4BGhAMJdZTc
Jonathan Perry (8): resctrl: Pin rdtgroup for mon_data file lifetime resctrl/mon: Split RMID read init from execution resctrl/mon: Select cpumask before invoking mon_event_read() resctrl/mon: Create mon_event_setup_read() helper resctrl: Propagate CPU mask validation error via rr->err resctrl/pmu: Introduce skeleton PMU and selftests resctrl/pmu: Use mon_event_setup_read() and validate CPU resctrl/pmu: Implement .read via direct RMID read; add LLC selftest
Documentation/filesystems/resctrl.rst | 64 ++++ fs/resctrl/Makefile | 2 +- fs/resctrl/ctrlmondata.c | 118 ++++--- fs/resctrl/internal.h | 24 +- fs/resctrl/monitor.c | 8 +- fs/resctrl/pmu.c | 217 +++++++++++++ fs/resctrl/rdtgroup.c | 131 +++++++- tools/testing/selftests/resctrl/cache.c | 94 +++++- tools/testing/selftests/resctrl/cmt_test.c | 17 +- tools/testing/selftests/resctrl/pmu_test.c | 292 ++++++++++++++++++ tools/testing/selftests/resctrl/pmu_utils.c | 32 ++ tools/testing/selftests/resctrl/resctrl.h | 4 + .../testing/selftests/resctrl/resctrl_tests.c | 1 + 13 files changed, 948 insertions(+), 56 deletions(-) create mode 100644 fs/resctrl/pmu.c create mode 100644 tools/testing/selftests/resctrl/pmu_test.c create mode 100644 tools/testing/selftests/resctrl/pmu_utils.c
Add .open and .release handlers to mon_data kernfs files so a monitoring file holds a reference to its rdtgroup for the file's lifetime. Store the rdtgroup in of->priv on open and drop it on release. Provide rdtgroup_get()/rdtgroup_put() helpers.
This lets code that only has an open monitoring fd (e.g. the resctrl PMU event_init path) safely resolve the rdtgroup without having a kernfs active reference.
No functional change intended.
Signed-off-by: Jonathan Perry yonch@yonch.com --- fs/resctrl/internal.h | 2 ++ fs/resctrl/rdtgroup.c | 62 +++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 64 insertions(+)
diff --git a/fs/resctrl/internal.h b/fs/resctrl/internal.h index cf1fd82dc5a9..63fb4d6c21a7 100644 --- a/fs/resctrl/internal.h +++ b/fs/resctrl/internal.h @@ -360,6 +360,8 @@ void resctrl_mon_resource_exit(void); void mon_event_count(void *info);
int rdtgroup_mondata_show(struct seq_file *m, void *arg); +int rdtgroup_mondata_open(struct kernfs_open_file *of); +void rdtgroup_mondata_release(struct kernfs_open_file *of);
void mon_event_read(struct rmid_read *rr, struct rdt_resource *r, struct rdt_mon_domain *d, struct rdtgroup *rdtgrp, diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c index 0320360cd7a6..17b61dcfad07 100644 --- a/fs/resctrl/rdtgroup.c +++ b/fs/resctrl/rdtgroup.c @@ -332,6 +332,8 @@ static const struct kernfs_ops rdtgroup_kf_single_ops = { static const struct kernfs_ops kf_mondata_ops = { .atomic_write_len = PAGE_SIZE, .seq_show = rdtgroup_mondata_show, + .open = rdtgroup_mondata_open, + .release = rdtgroup_mondata_release, };
static bool is_cpu_list(struct kernfs_open_file *of) @@ -2512,12 +2514,26 @@ static struct rdtgroup *kernfs_to_rdtgroup(struct kernfs_node *kn) } }
+/* + * Convert an kernfs active reference to an rdtgroup reference. + */ static void rdtgroup_kn_get(struct rdtgroup *rdtgrp, struct kernfs_node *kn) { atomic_inc(&rdtgrp->waitcount); kernfs_break_active_protection(kn); }
+/* + * Get rdtgroup reference count from existing reference + */ +void rdtgroup_get(struct rdtgroup *rdtgrp) +{ + atomic_inc(&rdtgrp->waitcount); +} + +/* + * Decrement rdtgroup reference count, when converted from kernfs active ref + */ static void rdtgroup_kn_put(struct rdtgroup *rdtgrp, struct kernfs_node *kn) { if (atomic_dec_and_test(&rdtgrp->waitcount) && @@ -2532,6 +2548,20 @@ static void rdtgroup_kn_put(struct rdtgroup *rdtgrp, struct kernfs_node *kn) } }
+/* + * Decrement rdtgroup reference count + */ +void rdtgroup_put(struct rdtgroup *rdtgrp) +{ + if (atomic_dec_and_test(&rdtgrp->waitcount) && + (rdtgrp->flags & RDT_DELETED)) { + if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP || + rdtgrp->mode == RDT_MODE_PSEUDO_LOCKED) + rdtgroup_pseudo_lock_remove(rdtgrp); + rdtgroup_remove(rdtgrp); + } +} + struct rdtgroup *rdtgroup_kn_lock_live(struct kernfs_node *kn) { struct rdtgroup *rdtgrp = kernfs_to_rdtgroup(kn); @@ -3364,6 +3394,38 @@ static int mkdir_mondata_all(struct kernfs_node *parent_kn, return ret; }
+int rdtgroup_mondata_open(struct kernfs_open_file *of) +{ + struct rdtgroup *rdtgrp; + + rdtgrp = rdtgroup_kn_lock_live(of->kn); + if (!rdtgrp) { + rdtgroup_kn_unlock(of->kn); + return -ENOENT; + } + + /* + * resctrl relies an kernfs active references to guard access to struct + * rdtgroup from kernfs_open_file. Hold a reference in the file + * descriptor so perf_event_open() can retrieve the rdtgroup. + */ + rdtgroup_get(rdtgrp); + of->priv = rdtgrp; + + rdtgroup_kn_unlock(of->kn); + return 0; +} + +void rdtgroup_mondata_release(struct kernfs_open_file *of) +{ + struct rdtgroup *rdtgrp = of->priv; + + if (rdtgrp) { + rdtgroup_put(rdtgrp); + of->priv = NULL; + } +} + /** * cbm_ensure_valid - Enforce validity on provided CBM * @_val: Candidate CBM
Introduce rmid_read_init() to fill struct rmid_read (resource, domain, rdtgroup, event id, flags, ci). Change mon_event_read() to accept a prepared rmid_read and a CPU mask.
Update callers to use rmid_read_init() + mon_event_read().
This prepares reuse from contexts that pre-select the CPU (e.g. the perf PMU) without duplicating initialization logic.
No functional change intended.
Signed-off-by: Jonathan Perry yonch@yonch.com --- fs/resctrl/ctrlmondata.c | 40 ++++++++++++++++++++++------------------ fs/resctrl/internal.h | 5 +++-- fs/resctrl/rdtgroup.c | 6 ++++-- 3 files changed, 29 insertions(+), 22 deletions(-)
diff --git a/fs/resctrl/ctrlmondata.c b/fs/resctrl/ctrlmondata.c index 0d0ef54fc4de..82f8ad2b3053 100644 --- a/fs/resctrl/ctrlmondata.c +++ b/fs/resctrl/ctrlmondata.c @@ -546,28 +546,31 @@ struct rdt_domain_hdr *resctrl_find_domain(struct list_head *h, int id, return NULL; }
-void mon_event_read(struct rmid_read *rr, struct rdt_resource *r, +void rmid_read_init(struct rmid_read *rr, struct rdt_resource *r, struct rdt_mon_domain *d, struct rdtgroup *rdtgrp, - cpumask_t *cpumask, int evtid, int first) + int evtid, int first, struct cacheinfo *ci) { - int cpu; - - /* When picking a CPU from cpu_mask, ensure it can't race with cpuhp */ - lockdep_assert_cpus_held(); - - /* - * Setup the parameters to pass to mon_event_count() to read the data. - */ + memset(rr, 0, sizeof(*rr)); rr->rgrp = rdtgrp; rr->evtid = evtid; rr->r = r; rr->d = d; rr->first = first; + rr->ci = ci; if (resctrl_arch_mbm_cntr_assign_enabled(r) && - resctrl_is_mbm_event(evtid)) { + resctrl_is_mbm_event(evtid)) rr->is_mbm_cntr = true; - } else { - rr->arch_mon_ctx = resctrl_arch_mon_ctx_alloc(r, evtid); +} + +void mon_event_read(struct rmid_read *rr, cpumask_t *cpumask) +{ + int cpu; + + /* When picking a CPU from cpu_mask, ensure it can't race with cpuhp */ + lockdep_assert_cpus_held(); + + if (!rr->is_mbm_cntr) { + rr->arch_mon_ctx = resctrl_arch_mon_ctx_alloc(rr->r, rr->evtid); if (IS_ERR(rr->arch_mon_ctx)) { rr->err = -EINVAL; return; @@ -588,7 +591,7 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r, smp_call_on_cpu(cpu, smp_mon_event_count, rr, false);
if (rr->arch_mon_ctx) - resctrl_arch_mon_ctx_free(r, evtid, rr->arch_mon_ctx); + resctrl_arch_mon_ctx_free(rr->r, rr->evtid, rr->arch_mon_ctx); }
int rdtgroup_mondata_show(struct seq_file *m, void *arg) @@ -635,9 +638,9 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg) ci = get_cpu_cacheinfo_level(cpu, RESCTRL_L3_CACHE); if (!ci) continue; - rr.ci = ci; - mon_event_read(&rr, r, NULL, rdtgrp, - &ci->shared_cpu_map, evtid, false); + rmid_read_init(&rr, r, NULL, rdtgrp, + evtid, false, ci); + mon_event_read(&rr, &ci->shared_cpu_map); goto checkresult; } } @@ -654,7 +657,8 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg) goto out; } d = container_of(hdr, struct rdt_mon_domain, hdr); - mon_event_read(&rr, r, d, rdtgrp, &d->hdr.cpu_mask, evtid, false); + rmid_read_init(&rr, r, d, rdtgrp, evtid, false, NULL); + mon_event_read(&rr, &d->hdr.cpu_mask); }
checkresult: diff --git a/fs/resctrl/internal.h b/fs/resctrl/internal.h index 63fb4d6c21a7..dcc0b7bea3ac 100644 --- a/fs/resctrl/internal.h +++ b/fs/resctrl/internal.h @@ -363,9 +363,10 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg); int rdtgroup_mondata_open(struct kernfs_open_file *of); void rdtgroup_mondata_release(struct kernfs_open_file *of);
-void mon_event_read(struct rmid_read *rr, struct rdt_resource *r, +void rmid_read_init(struct rmid_read *rr, struct rdt_resource *r, struct rdt_mon_domain *d, struct rdtgroup *rdtgrp, - cpumask_t *cpumask, int evtid, int first); + int evtid, int first, struct cacheinfo *ci); +void mon_event_read(struct rmid_read *rr, cpumask_t *cpumask);
int resctrl_mon_resource_init(void);
diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c index 17b61dcfad07..34337abe5345 100644 --- a/fs/resctrl/rdtgroup.c +++ b/fs/resctrl/rdtgroup.c @@ -3235,8 +3235,10 @@ static int mon_add_all_files(struct kernfs_node *kn, struct rdt_mon_domain *d, if (ret) return ret;
- if (!do_sum && resctrl_is_mbm_event(mevt->evtid)) - mon_event_read(&rr, r, d, prgrp, &d->hdr.cpu_mask, mevt->evtid, true); + if (!do_sum && resctrl_is_mbm_event(mevt->evtid)) { + rmid_read_init(&rr, r, d, prgrp, mevt->evtid, true, NULL); + mon_event_read(&rr, &d->hdr.cpu_mask); + } }
return 0;
Refactor rdtgroup_mondata_show() to pick the appropriate CPU mask first and then call mon_event_read() once.
No functional change intended.
Signed-off-by: Jonathan Perry yonch@yonch.com --- fs/resctrl/ctrlmondata.c | 12 +++++++----- 1 file changed, 7 insertions(+), 5 deletions(-)
diff --git a/fs/resctrl/ctrlmondata.c b/fs/resctrl/ctrlmondata.c index 82f8ad2b3053..f28328c49479 100644 --- a/fs/resctrl/ctrlmondata.c +++ b/fs/resctrl/ctrlmondata.c @@ -607,6 +607,7 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg) struct rdt_resource *r; struct cacheinfo *ci; struct mon_data *md; + cpumask_t *cpumask;
rdtgrp = rdtgroup_kn_lock_live(of->kn); if (!rdtgrp) { @@ -639,9 +640,9 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg) if (!ci) continue; rmid_read_init(&rr, r, NULL, rdtgrp, - evtid, false, ci); - mon_event_read(&rr, &ci->shared_cpu_map); - goto checkresult; + evtid, false, ci); + cpumask = &ci->shared_cpu_map; + goto perform; } } ret = -ENOENT; @@ -658,10 +659,11 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg) } d = container_of(hdr, struct rdt_mon_domain, hdr); rmid_read_init(&rr, r, d, rdtgrp, evtid, false, NULL); - mon_event_read(&rr, &d->hdr.cpu_mask); + cpumask = &d->hdr.cpu_mask; }
-checkresult: +perform: + mon_event_read(&rr, cpumask);
/* * -ENOENT is a special case, set only when "mbm_event" counter assignment
Refactor the selection of monitored event from the kernfs seq_show handler to a helper function. This provides a single setup path that the resctrl PMU will reuse.
Add mon_event_setup_read() to encapsulate domain lookup, rmid_read_init(), and selection of the valid CPU mask for the read. Rework rdtgroup_mondata_show() to call the helper before reading.
No functional change intended.
Signed-off-by: Jonathan Perry yonch@yonch.com --- fs/resctrl/ctrlmondata.c | 71 ++++++++++++++++++++++------------------ fs/resctrl/internal.h | 2 ++ 2 files changed, 41 insertions(+), 32 deletions(-)
diff --git a/fs/resctrl/ctrlmondata.c b/fs/resctrl/ctrlmondata.c index f28328c49479..d1e4cf6f2128 100644 --- a/fs/resctrl/ctrlmondata.c +++ b/fs/resctrl/ctrlmondata.c @@ -594,32 +594,16 @@ void mon_event_read(struct rmid_read *rr, cpumask_t *cpumask) resctrl_arch_mon_ctx_free(rr->r, rr->evtid, rr->arch_mon_ctx); }
-int rdtgroup_mondata_show(struct seq_file *m, void *arg) +int mon_event_setup_read(struct rmid_read *rr, cpumask_t **cpumask, + struct mon_data *md, struct rdtgroup *rdtgrp) { - struct kernfs_open_file *of = m->private; enum resctrl_res_level resid; enum resctrl_event_id evtid; struct rdt_domain_hdr *hdr; - struct rmid_read rr = {0}; struct rdt_mon_domain *d; - struct rdtgroup *rdtgrp; - int domid, cpu, ret = 0; struct rdt_resource *r; struct cacheinfo *ci; - struct mon_data *md; - cpumask_t *cpumask; - - rdtgrp = rdtgroup_kn_lock_live(of->kn); - if (!rdtgrp) { - ret = -ENOENT; - goto out; - } - - md = of->kn->priv; - if (WARN_ON_ONCE(!md)) { - ret = -EIO; - goto out; - } + int domid, cpu;
resid = md->rid; domid = md->domid; @@ -639,30 +623,53 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg) ci = get_cpu_cacheinfo_level(cpu, RESCTRL_L3_CACHE); if (!ci) continue; - rmid_read_init(&rr, r, NULL, rdtgrp, - evtid, false, ci); - cpumask = &ci->shared_cpu_map; - goto perform; + rmid_read_init(rr, r, NULL, rdtgrp, + evtid, false, ci); + *cpumask = &ci->shared_cpu_map; + return 0; } } - ret = -ENOENT; - goto out; + return -ENOENT; } else { /* * This file provides data from a single domain. Search * the resource to find the domain with "domid". */ hdr = resctrl_find_domain(&r->mon_domains, domid, NULL); - if (!hdr || WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN)) { - ret = -ENOENT; - goto out; - } + if (!hdr || WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN)) + return -ENOENT; + d = container_of(hdr, struct rdt_mon_domain, hdr); - rmid_read_init(&rr, r, d, rdtgrp, evtid, false, NULL); - cpumask = &d->hdr.cpu_mask; + rmid_read_init(rr, r, d, rdtgrp, evtid, false, NULL); + *cpumask = &d->hdr.cpu_mask; + return 0; } +}
-perform: +int rdtgroup_mondata_show(struct seq_file *m, void *arg) +{ + struct kernfs_open_file *of = m->private; + struct rmid_read rr = {0}; + struct rdtgroup *rdtgrp; + int ret = 0; + struct mon_data *md; + cpumask_t *cpumask; + + rdtgrp = rdtgroup_kn_lock_live(of->kn); + if (!rdtgrp) { + ret = -ENOENT; + goto out; + } + + md = of->kn->priv; + if (WARN_ON_ONCE(!md)) { + ret = -EIO; + goto out; + } + + ret = mon_event_setup_read(&rr, &cpumask, md, rdtgrp); + if (ret) + goto out; mon_event_read(&rr, cpumask);
/* diff --git a/fs/resctrl/internal.h b/fs/resctrl/internal.h index dcc0b7bea3ac..486cbca8d0ec 100644 --- a/fs/resctrl/internal.h +++ b/fs/resctrl/internal.h @@ -366,6 +366,8 @@ void rdtgroup_mondata_release(struct kernfs_open_file *of); void rmid_read_init(struct rmid_read *rr, struct rdt_resource *r, struct rdt_mon_domain *d, struct rdtgroup *rdtgrp, int evtid, int first, struct cacheinfo *ci); +int mon_event_setup_read(struct rmid_read *rr, cpumask_t **cpumask, + struct mon_data *md, struct rdtgroup *rdtgrp); void mon_event_read(struct rmid_read *rr, cpumask_t *cpumask);
int resctrl_mon_resource_init(void);
When __mon_event_count() rejects a CPU because it is not in the domain's mask (or does not match the cacheinfo domain), it returned -EINVAL but did not set rr->err. mon_event_count() then discarded the return value, which made failures harder to diagnose.
Set rr->err = -EINVAL before returning in both validation checks so the error is visible to callers and can trigger WARN_ONCE() in the PMU .read path.
Signed-off-by: Jonathan Perry yonch@yonch.com --- fs/resctrl/monitor.c | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-)
diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c index 4076336fbba6..4d19c2ec823f 100644 --- a/fs/resctrl/monitor.c +++ b/fs/resctrl/monitor.c @@ -445,8 +445,10 @@ static int __mon_event_count(struct rdtgroup *rdtgrp, struct rmid_read *rr)
if (rr->d) { /* Reading a single domain, must be on a CPU in that domain. */ - if (!cpumask_test_cpu(cpu, &rr->d->hdr.cpu_mask)) + if (!cpumask_test_cpu(cpu, &rr->d->hdr.cpu_mask)) { + rr->err = -EINVAL; return -EINVAL; + } if (rr->is_mbm_cntr) rr->err = resctrl_arch_cntr_read(rr->r, rr->d, closid, rmid, cntr_id, rr->evtid, &tval); @@ -462,8 +464,10 @@ static int __mon_event_count(struct rdtgroup *rdtgrp, struct rmid_read *rr) }
/* Summing domains that share a cache, must be on a CPU for that cache. */ - if (!cpumask_test_cpu(cpu, &rr->ci->shared_cpu_map)) + if (!cpumask_test_cpu(cpu, &rr->ci->shared_cpu_map)) { + rr->err = -EINVAL; return -EINVAL; + }
/* * Legacy files must report the sum of an event across all
Register a read-only "resctrl" PMU and implement minimal perf hooks (event_init, add, del, start, stop, read, destroy). The PMU accepts a resctrl monitoring file descriptor via attr.config, resolves the rdtgroup, and pins it for the event's lifetime.
Call PMU init/exit in resctrl_init()/resctrl_exit().
Add a selftest to exercise PMU registration and verify that only allowed monitoring files can be opened via perf.
Signed-off-by: Jonathan Perry yonch@yonch.com --- fs/resctrl/Makefile | 2 +- fs/resctrl/internal.h | 12 ++ fs/resctrl/pmu.c | 139 ++++++++++++ fs/resctrl/rdtgroup.c | 53 +++++ tools/testing/selftests/resctrl/pmu_test.c | 202 ++++++++++++++++++ tools/testing/selftests/resctrl/resctrl.h | 1 + .../testing/selftests/resctrl/resctrl_tests.c | 1 + 7 files changed, 409 insertions(+), 1 deletion(-) create mode 100644 fs/resctrl/pmu.c create mode 100644 tools/testing/selftests/resctrl/pmu_test.c
diff --git a/fs/resctrl/Makefile b/fs/resctrl/Makefile index e67f34d2236a..f738b0165ccc 100644 --- a/fs/resctrl/Makefile +++ b/fs/resctrl/Makefile @@ -1,5 +1,5 @@ # SPDX-License-Identifier: GPL-2.0 -obj-$(CONFIG_RESCTRL_FS) += rdtgroup.o ctrlmondata.o monitor.o +obj-$(CONFIG_RESCTRL_FS) += rdtgroup.o ctrlmondata.o monitor.o pmu.o obj-$(CONFIG_RESCTRL_FS_PSEUDO_LOCK) += pseudo_lock.o
# To allow define_trace.h's recursive include: diff --git a/fs/resctrl/internal.h b/fs/resctrl/internal.h index 486cbca8d0ec..b42c625569a8 100644 --- a/fs/resctrl/internal.h +++ b/fs/resctrl/internal.h @@ -4,6 +4,7 @@
#include <linux/resctrl.h> #include <linux/kernfs.h> +#include <linux/fs.h> #include <linux/fs_context.h> #include <linux/tick.h>
@@ -362,6 +363,17 @@ void mon_event_count(void *info); int rdtgroup_mondata_show(struct seq_file *m, void *arg); int rdtgroup_mondata_open(struct kernfs_open_file *of); void rdtgroup_mondata_release(struct kernfs_open_file *of); +void rdtgroup_get(struct rdtgroup *rdtgrp); +void rdtgroup_put(struct rdtgroup *rdtgrp); + +/* PMU support */ +/* + * Get rdtgroup from a resctrl monitoring file and take a reference. + * Returns a valid pointer with an extra reference on success, or ERR_PTR on failure. + */ +struct rdtgroup *rdtgroup_get_from_file(struct file *file); +int resctrl_pmu_init(void); +void resctrl_pmu_exit(void);
void rmid_read_init(struct rmid_read *rr, struct rdt_resource *r, struct rdt_mon_domain *d, struct rdtgroup *rdtgrp, diff --git a/fs/resctrl/pmu.c b/fs/resctrl/pmu.c new file mode 100644 index 000000000000..e7915a0a3520 --- /dev/null +++ b/fs/resctrl/pmu.c @@ -0,0 +1,139 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Perf event access to resctrl monitoring (cache occupancy, memory bandwidth) + */ + +#define pr_fmt(fmt) "resctrl_pmu: " fmt + +#include <linux/kernel.h> +#include <linux/perf_event.h> +#include <linux/errno.h> +#include <linux/file.h> +#include <linux/slab.h> +#include <linux/err.h> +#include <linux/seq_file.h> +#include "internal.h" + +static struct pmu resctrl_pmu; + +/* + * Event private data - stores information about the monitored resctrl group + */ +struct resctrl_pmu_event { + struct rdtgroup *rdtgrp; /* Reference to rdtgroup being monitored */ +}; + +static void resctrl_event_destroy(struct perf_event *event); + +/* + * Initialize a new resctrl perf event + * The config field contains the file descriptor of the monitoring file + */ +static int resctrl_event_init(struct perf_event *event) +{ + struct resctrl_pmu_event *resctrl_event; + struct file *file; + struct rdtgroup *rdtgrp; + int fd; + int ret; + + fd = (int)event->attr.config; + if (fd < 0) + return -EINVAL; + + file = fget(fd); + if (!file) + return -EBADF; + + /* Resolve rdtgroup from the monitoring file and take a reference */ + rdtgrp = rdtgroup_get_from_file(file); + fput(file); + if (IS_ERR(rdtgrp)) + return PTR_ERR(rdtgrp); + + resctrl_event = kzalloc(sizeof(*resctrl_event), GFP_KERNEL); + if (!resctrl_event) { + rdtgroup_put(rdtgrp); + return -ENOMEM; + } + + resctrl_event->rdtgrp = rdtgrp; + event->pmu_private = resctrl_event; + event->destroy = resctrl_event_destroy; + + return 0; +} + +static void resctrl_event_destroy(struct perf_event *event) +{ + struct resctrl_pmu_event *resctrl_event = event->pmu_private; + + if (resctrl_event) { + struct rdtgroup *rdtgrp = resctrl_event->rdtgrp; + + if (rdtgrp) + rdtgroup_put(rdtgrp); + + kfree(resctrl_event); + event->pmu_private = NULL; + } +} + +static void resctrl_event_update(struct perf_event *event) +{ + /* Currently just a stub - would read actual cache occupancy here */ + local64_set(&event->hw.prev_count, 0); +} + +static void resctrl_event_start(struct perf_event *event, int flags) +{ + resctrl_event_update(event); +} + +static void resctrl_event_stop(struct perf_event *event, int flags) +{ + if (flags & PERF_EF_UPDATE) + resctrl_event_update(event); +} + +static int resctrl_event_add(struct perf_event *event, int flags) +{ + if (flags & PERF_EF_START) + resctrl_event_start(event, flags); + + return 0; +} + +static void resctrl_event_del(struct perf_event *event, int flags) +{ + resctrl_event_stop(event, PERF_EF_UPDATE); +} + +static struct pmu resctrl_pmu = { + .task_ctx_nr = perf_invalid_context, + .event_init = resctrl_event_init, + .add = resctrl_event_add, + .del = resctrl_event_del, + .start = resctrl_event_start, + .stop = resctrl_event_stop, + .read = resctrl_event_update, + .capabilities = PERF_PMU_CAP_NO_INTERRUPT | PERF_PMU_CAP_NO_EXCLUDE, +}; + +int resctrl_pmu_init(void) +{ + int ret; + + ret = perf_pmu_register(&resctrl_pmu, "resctrl", -1); + if (ret) { + pr_err("Failed to register resctrl PMU: %d\n", ret); + return ret; + } + + return 0; +} + +void resctrl_pmu_exit(void) +{ + perf_pmu_unregister(&resctrl_pmu); +} diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c index 34337abe5345..4f4139edafbf 100644 --- a/fs/resctrl/rdtgroup.c +++ b/fs/resctrl/rdtgroup.c @@ -3428,6 +3428,53 @@ void rdtgroup_mondata_release(struct kernfs_open_file *of) } }
+/* + * rdtgroup_get_from_file - Resolve rdtgroup from a resctrl mon data file + * @file: struct file opened on a resctrl monitoring data file + * + * Validate that @file belongs to resctrl and refers to a monitoring data + * file (kf_mondata_ops). Then, using the kernfs_open_file stored in the + * seq_file, safely fetch the rdtgroup that was pinned at open time and take + * an additional rdtgroup reference for the caller under rdtgroup_mutex. + * + * Returns: rdtgroup* with an extra reference on success; ERR_PTR on failure. + */ +struct rdtgroup *rdtgroup_get_from_file(struct file *file) +{ + struct rdtgroup *rdtgrp = NULL; + struct kernfs_open_file *of; + struct seq_file *seq; + struct inode *inode; + + if (!file) + return ERR_PTR(-EBADF); + + inode = file_inode(file); + /* Check the file is part of the resctrl filesystem */ + if (!inode || !inode->i_sb || inode->i_sb->s_type != &rdt_fs_type) + return ERR_PTR(-EINVAL); + + /* kernfs monitoring files use seq_file; seq_file->private is kernfs_open_file */ + seq = (struct seq_file *)file->private_data; + if (!seq) + return ERR_PTR(-EINVAL); + + of = (struct kernfs_open_file *)seq->private; + /* Check this is a monitoring file */ + if (!of || !of->kn || of->kn->attr.ops != &kf_mondata_ops) + return ERR_PTR(-EINVAL); + + /* Hold rdtgroup_mutex to prevent race with release callback */ + guard(mutex)(&rdtgroup_mutex); + + rdtgrp = of->priv; + if (!rdtgrp || (rdtgrp->flags & RDT_DELETED)) + return ERR_PTR(-ENOENT); + + rdtgroup_get(rdtgrp); + return rdtgrp; +} + /** * cbm_ensure_valid - Enforce validity on provided CBM * @_val: Candidate CBM @@ -4509,6 +4556,10 @@ int resctrl_init(void) */ debugfs_resctrl = debugfs_create_dir("resctrl", NULL);
+ ret = resctrl_pmu_init(); + if (ret) + pr_warn("Failed to initialize resctrl PMU: %d\n", ret); + return 0;
cleanup_mountpoint: @@ -4558,6 +4609,8 @@ static bool resctrl_online_domains_exist(void) */ void resctrl_exit(void) { + resctrl_pmu_exit(); + cpus_read_lock(); WARN_ON_ONCE(resctrl_online_domains_exist());
diff --git a/tools/testing/selftests/resctrl/pmu_test.c b/tools/testing/selftests/resctrl/pmu_test.c new file mode 100644 index 000000000000..29a0ac329619 --- /dev/null +++ b/tools/testing/selftests/resctrl/pmu_test.c @@ -0,0 +1,202 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Resctrl PMU test + * + * Test program to verify the resctrl PMU functionality. + * Walks resctrl filesystem and verifies only allowed files can be + * used with the resctrl PMU via perf_event_open. + */ + +#include "resctrl.h" +#include <fcntl.h> +#include <dirent.h> + +#define RESCTRL_PMU_NAME "resctrl" + +static int find_pmu_type(const char *pmu_name) +{ + char path[256]; + FILE *file; + int type; + + snprintf(path, sizeof(path), "/sys/bus/event_source/devices/%s/type", + pmu_name); + + file = fopen(path, "r"); + if (!file) { + ksft_print_msg("Failed to open %s: %s\n", path, + strerror(errno)); + return -1; + } + + if (fscanf(file, "%d", &type) != 1) { + ksft_print_msg("Failed to read PMU type from %s\n", path); + fclose(file); + return -1; + } + + fclose(file); + return type; +} + +static bool is_allowed_file(const char *filename) +{ + const char *base; + + /* Only exact llc_occupancy and mbm files (no *_config) are allowed */ + base = strrchr(filename, '/'); + base = base ? base + 1 : filename; + + return (!strcmp(base, "llc_occupancy") || + !strcmp(base, "mbm_total_bytes") || + !strcmp(base, "mbm_local_bytes")); +} + +static int test_file_safety(int pmu_type, const char *filepath) +{ + struct perf_event_attr pe = { 0 }; + int fd, perf_fd; + bool should_succeed; + + /* Try to open the file */ + fd = open(filepath, O_RDONLY); + if (fd < 0) { + /* File couldn't be opened, skip it */ + return 0; + } + + should_succeed = is_allowed_file(filepath); + + /* Setup perf event attributes */ + pe.type = pmu_type; + pe.config = fd; + pe.size = sizeof(pe); + pe.disabled = 1; + pe.exclude_kernel = 0; + pe.exclude_hv = 0; + + /* Try to open the perf event */ + perf_fd = perf_event_open(&pe, -1, 0, -1, 0); + + if (should_succeed) { + if (perf_fd < 0) { + ksft_print_msg("FAIL: unexpected - perf_event_open failed for %s: %s\n", + filepath, strerror(errno)); + close(fd); + return -1; + } + ksft_print_msg("PASS: Allowed file %s successfully opened perf event\n", + filepath); + close(perf_fd); + } else { + if (perf_fd >= 0) { + ksft_print_msg("FAIL: unexpected - perf_event_open succeeded for %s\n", + filepath); + close(perf_fd); + close(fd); + return -1; + } + ksft_print_msg("PASS: Blocked file %s correctly failed perf_event_open: %s\n", + filepath, strerror(errno)); + } + +out: + close(fd); + return 0; +} + +static int walk_directory_recursive(int pmu_type, const char *dir_path) +{ + DIR *dir; + struct dirent *entry; + char full_path[1024]; + struct stat statbuf; + int ret = 0; + + dir = opendir(dir_path); + if (!dir) { + ksft_print_msg("Failed to open directory %s: %s\n", dir_path, + strerror(errno)); + return -1; + } + + while ((entry = readdir(dir)) != NULL) { + /* Skip . and .. */ + if (strcmp(entry->d_name, ".") == 0 || + strcmp(entry->d_name, "..") == 0) + continue; + + snprintf(full_path, sizeof(full_path), "%s/%s", dir_path, + entry->d_name); + + if (stat(full_path, &statbuf) != 0) { + ksft_print_msg("Failed to stat %s: %s\n", full_path, + strerror(errno)); + continue; + } + + if (S_ISDIR(statbuf.st_mode)) { + /* Recursively walk subdirectories */ + if (walk_directory_recursive(pmu_type, full_path) != 0) + ret = -1; + } else if (S_ISREG(statbuf.st_mode)) { + /* Test regular files */ + if (test_file_safety(pmu_type, full_path) != 0) + ret = -1; + } + } + + closedir(dir); + return ret; +} + +static int test_resctrl_pmu_safety(int pmu_type) +{ + ksft_print_msg("Testing resctrl PMU safety - walking all files in %s\n", + RESCTRL_PATH); + + /* Walk through all files and directories in /sys/fs/resctrl */ + return walk_directory_recursive(pmu_type, RESCTRL_PATH); +} + +static bool pmu_feature_check(const struct resctrl_test *test) +{ + return resctrl_mon_feature_exists("L3_MON", "llc_occupancy"); +} + +static int pmu_run_test(const struct resctrl_test *test, + const struct user_params *uparams) +{ + int pmu_type, ret; + + ksft_print_msg("Testing resctrl PMU file access safety\n"); + + /* Find the resctrl PMU type */ + pmu_type = find_pmu_type(RESCTRL_PMU_NAME); + if (pmu_type < 0) { + ksft_print_msg("Resctrl PMU not found - PMU is not registered?\n"); + return -1; + } + + ksft_print_msg("Found resctrl PMU with type: %d\n", pmu_type); + + /* Run the safety test to ensure only appropriate files work */ + ret = test_resctrl_pmu_safety(pmu_type); + + if (ret == 0) + ksft_print_msg("Resctrl PMU safety test completed successfully\n"); + else + ksft_print_msg("Resctrl PMU safety test failed\n"); + + return ret; +} + +struct resctrl_test pmu_test = { + .name = "PMU", + .group = "pmu", + .resource = "L3", + .vendor_specific = 0, + .feature_check = pmu_feature_check, + .run_test = pmu_run_test, + .cleanup = NULL, +}; diff --git a/tools/testing/selftests/resctrl/resctrl.h b/tools/testing/selftests/resctrl/resctrl.h index cd3adfc14969..5b0e6074eaba 100644 --- a/tools/testing/selftests/resctrl/resctrl.h +++ b/tools/testing/selftests/resctrl/resctrl.h @@ -244,5 +244,6 @@ extern struct resctrl_test cmt_test; extern struct resctrl_test l3_cat_test; extern struct resctrl_test l3_noncont_cat_test; extern struct resctrl_test l2_noncont_cat_test; +extern struct resctrl_test pmu_test;
#endif /* RESCTRL_H */ diff --git a/tools/testing/selftests/resctrl/resctrl_tests.c b/tools/testing/selftests/resctrl/resctrl_tests.c index 5154ffd821c4..11ba9000e015 100644 --- a/tools/testing/selftests/resctrl/resctrl_tests.c +++ b/tools/testing/selftests/resctrl/resctrl_tests.c @@ -21,6 +21,7 @@ static struct resctrl_test *resctrl_tests[] = { &l3_cat_test, &l3_noncont_cat_test, &l2_noncont_cat_test, + &pmu_test, };
static int detect_vendor(void)
During event_init, extract mon_data from the monitoring file and call mon_event_setup_read() to prepare rmid_read and the valid CPU mask for that file. Require a CPU-bound event and verify the bound CPU is in the mask.
Store the prepared rmid_read and CPU mask in the event private data along with the pinned rdtgroup.
Split the helper that gets the pinned rdtgroup in two, so event_init can get the mon_data from kernfs_open_file: - rdtgroup_get_mondata_open_file() gets kernfs_open_file from file - rdtgroup_get_from_mondata_file() gets pinned rdtgroup from kernfs_open_file
Extend the selftest to test CPU validation and verify that pid-bound events are rejected.
Signed-off-by: Jonathan Perry yonch@yonch.com --- fs/resctrl/internal.h | 2 + fs/resctrl/pmu.c | 59 +++++++- fs/resctrl/rdtgroup.c | 24 ++- tools/testing/selftests/resctrl/pmu_test.c | 164 ++++++++++++++++++--- 4 files changed, 214 insertions(+), 35 deletions(-)
diff --git a/fs/resctrl/internal.h b/fs/resctrl/internal.h index b42c625569a8..8cc3a3747c2f 100644 --- a/fs/resctrl/internal.h +++ b/fs/resctrl/internal.h @@ -365,6 +365,8 @@ int rdtgroup_mondata_open(struct kernfs_open_file *of); void rdtgroup_mondata_release(struct kernfs_open_file *of); void rdtgroup_get(struct rdtgroup *rdtgrp); void rdtgroup_put(struct rdtgroup *rdtgrp); +struct kernfs_open_file *rdtgroup_get_mondata_open_file(struct file *file); +struct rdtgroup *rdtgroup_get_from_mondata_file(struct kernfs_open_file *of);
/* PMU support */ /* diff --git a/fs/resctrl/pmu.c b/fs/resctrl/pmu.c index e7915a0a3520..bdca0b3a5b0b 100644 --- a/fs/resctrl/pmu.c +++ b/fs/resctrl/pmu.c @@ -12,6 +12,7 @@ #include <linux/slab.h> #include <linux/err.h> #include <linux/seq_file.h> +#include <linux/cpu.h> #include "internal.h"
static struct pmu resctrl_pmu; @@ -21,6 +22,8 @@ static struct pmu resctrl_pmu; */ struct resctrl_pmu_event { struct rdtgroup *rdtgrp; /* Reference to rdtgroup being monitored */ + struct rmid_read rr; /* RMID read setup for monitoring */ + cpumask_t *cpumask; /* Valid CPUs for this monitoring file */ };
static void resctrl_event_destroy(struct perf_event *event); @@ -34,9 +37,16 @@ static int resctrl_event_init(struct perf_event *event) struct resctrl_pmu_event *resctrl_event; struct file *file; struct rdtgroup *rdtgrp; + struct kernfs_open_file *of; + struct mon_data *md; + struct rmid_read rr = {0}; + cpumask_t *cpumask; int fd; int ret;
+ if (event->cpu < 0) + return -EINVAL; + fd = (int)event->attr.config; if (fd < 0) return -EINVAL; @@ -45,11 +55,46 @@ static int resctrl_event_init(struct perf_event *event) if (!file) return -EBADF;
- /* Resolve rdtgroup from the monitoring file and take a reference */ - rdtgrp = rdtgroup_get_from_file(file); + of = rdtgroup_get_mondata_open_file(file); + if (IS_ERR(of)) { + ret = PTR_ERR(of); + goto out_fput; + } + + /* Extract mon_data which specifies which resource to measure */ + if (!of->kn || !of->kn->priv) { + ret = -EIO; + goto out_fput; + } + md = of->kn->priv; + + rdtgrp = rdtgroup_get_from_mondata_file(of); + if (IS_ERR(rdtgrp)) { + ret = PTR_ERR(rdtgrp); + goto out_fput; + } + fput(file); - if (IS_ERR(rdtgrp)) - return PTR_ERR(rdtgrp); + file = NULL; + + cpus_read_lock(); + + ret = mon_event_setup_read(&rr, &cpumask, md, rdtgrp); + if (ret) { + cpus_read_unlock(); + rdtgroup_put(rdtgrp); + return ret; + } + + /* Validate that the requested CPU is in the valid CPU mask for this monitoring file */ + if (!cpumask_test_cpu(event->cpu, cpumask)) { + ret = -EINVAL; + cpus_read_unlock(); + rdtgroup_put(rdtgrp); + return ret; + } + + cpus_read_unlock();
resctrl_event = kzalloc(sizeof(*resctrl_event), GFP_KERNEL); if (!resctrl_event) { @@ -58,10 +103,16 @@ static int resctrl_event_init(struct perf_event *event) }
resctrl_event->rdtgrp = rdtgrp; + resctrl_event->rr = rr; + resctrl_event->cpumask = cpumask; event->pmu_private = resctrl_event; event->destroy = resctrl_event_destroy;
return 0; + +out_fput: + fput(file); + return ret; }
static void resctrl_event_destroy(struct perf_event *event) diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c index 4f4139edafbf..ed7d9feccd94 100644 --- a/fs/resctrl/rdtgroup.c +++ b/fs/resctrl/rdtgroup.c @@ -3429,19 +3429,16 @@ void rdtgroup_mondata_release(struct kernfs_open_file *of) }
/* - * rdtgroup_get_from_file - Resolve rdtgroup from a resctrl mon data file + * Resolve kernfs_open_file from a resctrl mon data file. * @file: struct file opened on a resctrl monitoring data file * * Validate that @file belongs to resctrl and refers to a monitoring data - * file (kf_mondata_ops). Then, using the kernfs_open_file stored in the - * seq_file, safely fetch the rdtgroup that was pinned at open time and take - * an additional rdtgroup reference for the caller under rdtgroup_mutex. + * file (kf_mondata_ops). * - * Returns: rdtgroup* with an extra reference on success; ERR_PTR on failure. + * Returns: kernfs_open_file* on success; ERR_PTR on failure. */ -struct rdtgroup *rdtgroup_get_from_file(struct file *file) +struct kernfs_open_file *rdtgroup_get_mondata_open_file(struct file *file) { - struct rdtgroup *rdtgrp = NULL; struct kernfs_open_file *of; struct seq_file *seq; struct inode *inode; @@ -3464,6 +3461,19 @@ struct rdtgroup *rdtgroup_get_from_file(struct file *file) if (!of || !of->kn || of->kn->attr.ops != &kf_mondata_ops) return ERR_PTR(-EINVAL);
+ return of; +} + +/* + * Get rdtgroup from a resctrl mon data open file. + * @of: kernfs_open_file opened on a resctrl monitoring data file + * + * Returns: rdtgroup* with an extra reference on success; ERR_PTR on failure. + */ +struct rdtgroup *rdtgroup_get_from_mondata_file(struct kernfs_open_file *of) +{ + struct rdtgroup *rdtgrp = NULL; + /* Hold rdtgroup_mutex to prevent race with release callback */ guard(mutex)(&rdtgroup_mutex);
diff --git a/tools/testing/selftests/resctrl/pmu_test.c b/tools/testing/selftests/resctrl/pmu_test.c index 29a0ac329619..fb3eec721e43 100644 --- a/tools/testing/selftests/resctrl/pmu_test.c +++ b/tools/testing/selftests/resctrl/pmu_test.c @@ -3,13 +3,16 @@ * Resctrl PMU test * * Test program to verify the resctrl PMU functionality. - * Walks resctrl filesystem and verifies only allowed files can be - * used with the resctrl PMU via perf_event_open. + * Walks resctrl filesystem and verifies only allowed monitoring files + * can be used with the resctrl PMU via perf_event_open when pinned to + * CPUs in the correct L3 domain. Also validates that PID-bound events + * are rejected for all files. */
#include "resctrl.h" #include <fcntl.h> #include <dirent.h> +#include <unistd.h>
#define RESCTRL_PMU_NAME "resctrl"
@@ -52,11 +55,51 @@ static bool is_allowed_file(const char *filename) !strcmp(base, "mbm_local_bytes")); }
+/* Extract base filename from a path */ +static const char *base_name(const char *path) +{ + const char *slash = strrchr(path, '/'); + + return slash ? slash + 1 : path; +} + +/* Parse mon_L3_XX ID from a monitoring path. Returns true on success. */ +static bool parse_l3_id_from_path(const char *path, int *l3_id) +{ + const char *needle = "mon_data/mon_L3_"; + const char *p = strstr(path, needle); + char *endptr; + long id; + + if (!p) + return false; + + p += strlen(needle); + + if (!isdigit((unsigned char)*p)) + return false; + + errno = 0; + id = strtol(p, &endptr, 10); + if (errno || endptr == p) + return false; + + /* Accept only non-negative IDs */ + if (id < 0) + return false; + + *l3_id = (int)id; + return true; +} + static int test_file_safety(int pmu_type, const char *filepath) { struct perf_event_attr pe = { 0 }; int fd, perf_fd; - bool should_succeed; + bool is_monitoring = false; + int file_l3_id = -1; + int ret = 0; + const char *fname = base_name(filepath);
/* Try to open the file */ fd = open(filepath, O_RDONLY); @@ -65,7 +108,8 @@ static int test_file_safety(int pmu_type, const char *filepath) return 0; }
- should_succeed = is_allowed_file(filepath); + /* Determine if this is a monitoring file under mon_L3_XX and allowed */ + is_monitoring = (is_allowed_file(fname) && parse_l3_id_from_path(filepath, &file_l3_id));
/* Setup perf event attributes */ pe.type = pmu_type; @@ -75,34 +119,106 @@ static int test_file_safety(int pmu_type, const char *filepath) pe.exclude_kernel = 0; pe.exclude_hv = 0;
- /* Try to open the perf event */ - perf_fd = perf_event_open(&pe, -1, 0, -1, 0); + /* PID-bound negative attempt: should fail for all files */ + perf_fd = perf_event_open(&pe, getpid(), -1, -1, 0); + if (perf_fd >= 0) { + ksft_print_msg("FAIL: pid-bound perf_event_open unexpectedly succeeded for %s\n", + filepath); + close(perf_fd); + close(fd); + return -1; + } + + int success_count = 0; + cpu_set_t mask; + int max_cpus, nconf; + + CPU_ZERO(&mask); + if (sched_getaffinity(0, sizeof(mask), &mask)) { + ksft_perror("sched_getaffinity failed"); + goto out; + } + + nconf = (int)sysconf(_SC_NPROCESSORS_CONF); + max_cpus = (nconf > 0 && nconf < CPU_SETSIZE) ? nconf : CPU_SETSIZE; + + for (int cpu = 0; cpu < max_cpus; cpu++) { + int cpu_l3; + + if (!CPU_ISSET(cpu, &mask)) + continue; + + if (get_domain_id("L3", cpu, &cpu_l3) < 0) { + ksft_print_msg("Failed to get L3 domain ID for CPU %d\n", cpu); + ret = -1; + break; + }
- if (should_succeed) { - if (perf_fd < 0) { - ksft_print_msg("FAIL: unexpected - perf_event_open failed for %s: %s\n", - filepath, strerror(errno)); - close(fd); - return -1; + perf_fd = perf_event_open(&pe, -1, cpu, -1, 0); + + if (is_monitoring) { + bool expected_ok = (cpu_l3 == file_l3_id); + + if (expected_ok) { + if (perf_fd < 0) { + ksft_print_msg("FAIL: %s CPU %d (L3=%d) expected success, got %s\n", + filepath, cpu, cpu_l3, strerror(errno)); + ret = -1; + break; + } + success_count++; + close(perf_fd); + } else { + if (perf_fd >= 0) { + ksft_print_msg("FAIL: %s CPU %d (L3=%d) expected EINVAL fail, but opened\n", + filepath, cpu, cpu_l3); + close(perf_fd); + ret = -1; + break; + } + if (errno != EINVAL) { + ksft_print_msg("FAIL: %s CPU %d expected errno=EINVAL, got %d (%s)\n", + filepath, cpu, errno, strerror(errno)); + ret = -1; + break; + } + } + } else { + /* Non-monitoring files must fail on all CPUs with EINVAL */ + if (perf_fd >= 0) { + ksft_print_msg("FAIL: non-monitoring file %s CPU %d unexpectedly opened\n", + filepath, cpu); + close(perf_fd); + ret = -1; + break; + } + if (errno != EINVAL) { + ksft_print_msg("FAIL: non-monitoring file %s CPU %d expected errno=EINVAL, got %d (%s)\n", + filepath, cpu, errno, strerror(errno)); + ret = -1; + break; + } } - ksft_print_msg("PASS: Allowed file %s successfully opened perf event\n", + } + + if (!ret && is_monitoring && success_count < 1) { + ksft_print_msg("FAIL: monitoring file %s had no successful CPU opens\n", filepath); - close(perf_fd); - } else { - if (perf_fd >= 0) { - ksft_print_msg("FAIL: unexpected - perf_event_open succeeded for %s\n", + ret = -1; + } + + if (!ret) { + if (is_monitoring) + ksft_print_msg("PASS: monitoring %s: %d CPU(s) opened in-domain, others rejected\n", + filepath, success_count); + else + ksft_print_msg("PASS: non-monitoring %s: all CPU-bound opens rejected with EINVAL\n", filepath); - close(perf_fd); - close(fd); - return -1; - } - ksft_print_msg("PASS: Blocked file %s correctly failed perf_event_open: %s\n", - filepath, strerror(errno)); }
out: close(fd); - return 0; + return ret; }
static int walk_directory_recursive(int pmu_type, const char *dir_path)
Implement reads of monitored values in resctrl_event_update().
If the rdtgroup was deleted, read zero. On RMID read errors, emit a WARN_ONCE and read zero.
Introduce mon_event_read_this_cpu() to call mon_event_count() directly, with no IPI, as perf infra would ensure .read runs on the bound CPU.
Augment the existing LLC occupancy selftest to also test the PMU.
Document PMU usage in resctrl.rst.
Signed-off-by: Jonathan Perry yonch@yonch.com --- Documentation/filesystems/resctrl.rst | 64 ++++++++++++++ fs/resctrl/ctrlmondata.c | 17 ++++ fs/resctrl/internal.h | 1 + fs/resctrl/pmu.c | 31 ++++++- tools/testing/selftests/resctrl/cache.c | 94 ++++++++++++++++++++- tools/testing/selftests/resctrl/cmt_test.c | 17 +++- tools/testing/selftests/resctrl/pmu_test.c | 28 +----- tools/testing/selftests/resctrl/pmu_utils.c | 32 +++++++ tools/testing/selftests/resctrl/resctrl.h | 3 + 9 files changed, 253 insertions(+), 34 deletions(-) create mode 100644 tools/testing/selftests/resctrl/pmu_utils.c
diff --git a/Documentation/filesystems/resctrl.rst b/Documentation/filesystems/resctrl.rst index b7f35b07876a..8f91ba7d622b 100644 --- a/Documentation/filesystems/resctrl.rst +++ b/Documentation/filesystems/resctrl.rst @@ -628,6 +628,70 @@ Resource monitoring rules "mon_data" group.
+Perf PMU access (resctrl PMU) +============================= + +resctrl registers a perf PMU named "resctrl", which provides read-only access +to the same monitoring values exposed via the "mon_data" files. The PMU +enables access from eBPF and allows parallelizing reads as it does not lock +the system-wide `rdtgroup_mutex`. + +Selection and usage +------------------- +- Event selection is performed by passing the file descriptor of a + resctrl monitoring file, for example + "mon_data/mon_L3_00/llc_occupancy", in the `perf_event_attr.config` + field when calling perf_event_open(). +- The perf event must be CPU-bound (pid = -1 and cpu >= 0). +- The chosen CPU must be valid for the domain represented by the + monitoring file: + + - For a domain-specific file such as "mon_L3_00/...", choose a CPU that + belongs to that domain. + - For files that provide a sum across domains that share the same L3 + cache instance (for example on SNC systems), choose a CPU that + shares that L3 cache instance. See the "Cache IDs" section for the + concepts and mapping. +- Exclude flags must be zero. perf_event_open() fails if any exclude + flags are set. + +Semantics +--------- +- The values from the resctrl PMU match the values what would be + read from the corresponding "mon_data" file at the time of the read. +- Sampling is not supported. The PMU provides counts that can be read + on demand; there are no periodic interrupts or per-context filtering + semantics. +- It is safe to read a perf event whose underlying resctrl group has been + deleted. However, the returned values are unspecified: the current + implementation returns zeros, but this may change in the future. + +Discovering the PMU and example +------------------------------- +- The PMU type is exposed at + "/sys/bus/event_source/devices/resctrl/type" and must be placed in + `perf_event_attr.type`. +- A minimal example of opening a resctrl PMU event by passing a + monitoring file descriptor in `config`:: + + int pmu_type = read_int("/sys/bus/event_source/devices/resctrl/type"); + int mon_fd = open("/sys/fs/resctrl/mon_data/mon_L3_00/llc_occupancy", O_RDONLY); + struct perf_event_attr pe = { 0 }; + + pe.type = pmu_type; + pe.size = sizeof(pe); + pe.config = mon_fd; /* select event via resctrl file descriptor */ + pe.disabled = 1; + + int cpu = /* a CPU in the L3_00 domain (see Cache IDs) */; + int fd = perf_event_open(&pe, -1 /* pid */, cpu /* cpu */, -1, 0); + + ioctl(fd, PERF_EVENT_IOC_ENABLE, 0); + uint64_t val; + read(fd, &val, sizeof(val)); + ioctl(fd, PERF_EVENT_IOC_DISABLE, 0); + + Notes on cache occupancy monitoring and control =============================================== When moving a task from one group to another you should remember that diff --git a/fs/resctrl/ctrlmondata.c b/fs/resctrl/ctrlmondata.c index d1e4cf6f2128..02f8f256c680 100644 --- a/fs/resctrl/ctrlmondata.c +++ b/fs/resctrl/ctrlmondata.c @@ -594,6 +594,23 @@ void mon_event_read(struct rmid_read *rr, cpumask_t *cpumask) resctrl_arch_mon_ctx_free(rr->r, rr->evtid, rr->arch_mon_ctx); }
+void mon_event_read_this_cpu(struct rmid_read *rr) +{ + /* Ensure we're not in a CPU hotplug race */ + lockdep_assert_cpus_held(); + + rr->arch_mon_ctx = resctrl_arch_mon_ctx_alloc(rr->r, rr->evtid); + if (IS_ERR(rr->arch_mon_ctx)) { + rr->err = -EINVAL; + return; + } + + /* Direct call on current CPU - no IPI needed */ + mon_event_count(rr); + + resctrl_arch_mon_ctx_free(rr->r, rr->evtid, rr->arch_mon_ctx); +} + int mon_event_setup_read(struct rmid_read *rr, cpumask_t **cpumask, struct mon_data *md, struct rdtgroup *rdtgrp) { diff --git a/fs/resctrl/internal.h b/fs/resctrl/internal.h index 8cc3a3747c2f..c4adb6eaf101 100644 --- a/fs/resctrl/internal.h +++ b/fs/resctrl/internal.h @@ -383,6 +383,7 @@ void rmid_read_init(struct rmid_read *rr, struct rdt_resource *r, int mon_event_setup_read(struct rmid_read *rr, cpumask_t **cpumask, struct mon_data *md, struct rdtgroup *rdtgrp); void mon_event_read(struct rmid_read *rr, cpumask_t *cpumask); +void mon_event_read_this_cpu(struct rmid_read *rr);
int resctrl_mon_resource_init(void);
diff --git a/fs/resctrl/pmu.c b/fs/resctrl/pmu.c index bdca0b3a5b0b..e6f5f13f29d2 100644 --- a/fs/resctrl/pmu.c +++ b/fs/resctrl/pmu.c @@ -132,8 +132,35 @@ static void resctrl_event_destroy(struct perf_event *event)
static void resctrl_event_update(struct perf_event *event) { - /* Currently just a stub - would read actual cache occupancy here */ - local64_set(&event->hw.prev_count, 0); + struct resctrl_pmu_event *resctrl_event = event->pmu_private; + struct rdtgroup *rdtgrp = resctrl_event->rdtgrp; + struct rmid_read rr; + u64 value = 0; + + /* Check if rdtgroup has been deleted */ + if (rdtgrp->flags & RDT_DELETED) { + local64_set(&event->count, 0); + return; + } + + /* Setup rmid_read structure with current parameters */ + rr = resctrl_event->rr; + rr.val = 0; + rr.err = 0; + + /* Take cpus read lock only around the actual RMID read */ + cpus_read_lock(); + mon_event_read_this_cpu(&rr); + cpus_read_unlock(); + + /* Update counter value based on read result */ + if (!rr.err) + value = rr.val; + else + WARN_ONCE(1, "resctrl PMU: RMID read error (err=%d) for closid=%u, rmid=%u, evtid=%d\n", + rr.err, rdtgrp->closid, rdtgrp->mon.rmid, rr.evtid); + + local64_set(&event->count, value); }
static void resctrl_event_start(struct perf_event *event, int flags) diff --git a/tools/testing/selftests/resctrl/cache.c b/tools/testing/selftests/resctrl/cache.c index 1ff1104e6575..826c11589d34 100644 --- a/tools/testing/selftests/resctrl/cache.c +++ b/tools/testing/selftests/resctrl/cache.c @@ -1,6 +1,7 @@ // SPDX-License-Identifier: GPL-2.0
#include <stdint.h> +#include <fcntl.h> #include "resctrl.h"
char llc_occup_path[1024]; @@ -92,12 +93,62 @@ static int get_llc_occu_resctrl(unsigned long *llc_occupancy) return 0; }
+/* + * get_llc_occu_resctrl_pmu - Read LLC occupancy via resctrl PMU + * + * Uses the PMU to read LLC occupancy from the monitored resource given by the + * global variable llc_occup_path. + * + * Return: =0 on success. <0 on failure. + */ +static int get_llc_occu_resctrl_pmu(unsigned long *llc_occupancy_pmu) +{ + int pmu_type, mon_fd = -1, perf_fd = -1; + struct perf_event_attr pe = { 0 }; + __u64 value = 0; + + if (!llc_occup_path[0]) + return -1; + + pmu_type = resctrl_find_pmu_type("resctrl"); + if (pmu_type < 0) + return -1; + + mon_fd = open(llc_occup_path, O_RDONLY); + if (mon_fd < 0) + return -1; + + memset(&pe, 0, sizeof(pe)); + pe.type = pmu_type; + pe.config = mon_fd; /* Pass the monitoring fd */ + pe.size = sizeof(pe); + pe.disabled = 0; /* Start enabled */ + + perf_fd = perf_event_open(&pe, -1, 0, -1, 0); + if (perf_fd < 0) + goto out_close_mon; + + if (read(perf_fd, &value, sizeof(value)) != sizeof(value)) + goto out_close_all; + + *llc_occupancy_pmu = (unsigned long)value; + + close(perf_fd); + close(mon_fd); + return 0; + +out_close_all: + close(perf_fd); +out_close_mon: + close(mon_fd); + return -1; +} + /* * print_results_cache: the cache results are stored in a file * @filename: file that stores the results * @bm_pid: child pid that runs benchmark - * @llc_value: perf miss value / - * llc occupancy value reported by resctrl FS + * @llc_value: perf miss value * * Return: 0 on success, < 0 on error. */ @@ -121,6 +172,37 @@ static int print_results_cache(const char *filename, pid_t bm_pid, __u64 llc_val return 0; }
+/* + * print_results_llc: prints LLC measurements to a file + * @filename: file that stores the results + * @bm_pid: child pid that runs benchmark + * @fs_value: llc occupancy value reported by resctrl FS + * @pmu_value: llc occupancy value reported by resctrl PMU + * + * Return: 0 on success, < 0 on error. + */ +static int print_results_llc(const char *filename, pid_t bm_pid, + unsigned long fs_value, unsigned long pmu_value) +{ + FILE *fp; + + if (strcmp(filename, "stdio") == 0 || strcmp(filename, "stderr") == 0) { + printf("Pid: %d \t llc_value: %lu\t pmu_value: %lu\n", + (int)bm_pid, fs_value, pmu_value); + } else { + fp = fopen(filename, "a"); + if (!fp) { + ksft_perror("Cannot open results file"); + return -1; + } + fprintf(fp, "Pid: %d \t llc_value: %lu\t pmu_value: %lu\n", + (int)bm_pid, fs_value, pmu_value); + fclose(fp); + } + + return 0; +} + /* * perf_event_measure - Measure perf events * @filename: Filename for writing the results @@ -164,13 +246,19 @@ int perf_event_measure(int pe_fd, struct perf_event_read *pe_read, int measure_llc_resctrl(const char *filename, pid_t bm_pid) { unsigned long llc_occu_resc = 0; + unsigned long llc_occu_pmu = 0; int ret;
ret = get_llc_occu_resctrl(&llc_occu_resc); if (ret < 0) return ret;
- return print_results_cache(filename, bm_pid, llc_occu_resc); + /* Try to get PMU value as well */ + ret = get_llc_occu_resctrl_pmu(&llc_occu_pmu); + if (ret < 0) + return ret; + + return print_results_llc(filename, bm_pid, llc_occu_resc, llc_occu_pmu); }
/* diff --git a/tools/testing/selftests/resctrl/cmt_test.c b/tools/testing/selftests/resctrl/cmt_test.c index d09e693dc739..28250903bbf0 100644 --- a/tools/testing/selftests/resctrl/cmt_test.c +++ b/tools/testing/selftests/resctrl/cmt_test.c @@ -78,6 +78,7 @@ static int check_results(struct resctrl_val_param *param, size_t span, int no_of { char *token_array[8], temp[512]; unsigned long sum_llc_occu_resc = 0; + unsigned long sum_llc_occu_pmu = 0; int runs = 0; FILE *fp;
@@ -100,12 +101,24 @@ static int check_results(struct resctrl_val_param *param, size_t span, int no_of
/* Field 3 is llc occ resc value */ sum_llc_occu_resc += strtoul(token_array[3], NULL, 0); + + /* Field 5: llc occupancy from PMU */ + sum_llc_occu_pmu += strtoul(token_array[5], NULL, 0); runs++; } fclose(fp);
- return show_results_info(sum_llc_occu_resc, no_of_bits, span, - MAX_DIFF, MAX_DIFF_PERCENT, runs, true); + /* Filesystem-based results */ + ksft_print_msg("CMT (resctrl fs):\n"); + int ret_fs = show_results_info(sum_llc_occu_resc, no_of_bits, span, + MAX_DIFF, MAX_DIFF_PERCENT, runs, true); + + /* PMU-based results */ + ksft_print_msg("CMT (PMU):\n"); + int ret_pmu = show_results_info(sum_llc_occu_pmu, no_of_bits, span, + MAX_DIFF, MAX_DIFF_PERCENT, runs, true); + + return ret_fs || ret_pmu; }
static void cmt_test_cleanup(void) diff --git a/tools/testing/selftests/resctrl/pmu_test.c b/tools/testing/selftests/resctrl/pmu_test.c index fb3eec721e43..e4d75a8c0a6c 100644 --- a/tools/testing/selftests/resctrl/pmu_test.c +++ b/tools/testing/selftests/resctrl/pmu_test.c @@ -16,32 +16,6 @@
#define RESCTRL_PMU_NAME "resctrl"
-static int find_pmu_type(const char *pmu_name) -{ - char path[256]; - FILE *file; - int type; - - snprintf(path, sizeof(path), "/sys/bus/event_source/devices/%s/type", - pmu_name); - - file = fopen(path, "r"); - if (!file) { - ksft_print_msg("Failed to open %s: %s\n", path, - strerror(errno)); - return -1; - } - - if (fscanf(file, "%d", &type) != 1) { - ksft_print_msg("Failed to read PMU type from %s\n", path); - fclose(file); - return -1; - } - - fclose(file); - return type; -} - static bool is_allowed_file(const char *filename) { const char *base; @@ -288,7 +262,7 @@ static int pmu_run_test(const struct resctrl_test *test, ksft_print_msg("Testing resctrl PMU file access safety\n");
/* Find the resctrl PMU type */ - pmu_type = find_pmu_type(RESCTRL_PMU_NAME); + pmu_type = resctrl_find_pmu_type(RESCTRL_PMU_NAME); if (pmu_type < 0) { ksft_print_msg("Resctrl PMU not found - PMU is not registered?\n"); return -1; diff --git a/tools/testing/selftests/resctrl/pmu_utils.c b/tools/testing/selftests/resctrl/pmu_utils.c new file mode 100644 index 000000000000..2d65d8b6e9e3 --- /dev/null +++ b/tools/testing/selftests/resctrl/pmu_utils.c @@ -0,0 +1,32 @@ +// SPDX-License-Identifier: GPL-2.0 + +#include "resctrl.h" +#include <stdio.h> + +int resctrl_find_pmu_type(const char *pmu_name) +{ + char path[256]; + FILE *file; + int type; + + if (!pmu_name) + return -1; + + snprintf(path, sizeof(path), "/sys/bus/event_source/devices/%s/type", + pmu_name); + + file = fopen(path, "r"); + if (!file) { + ksft_print_msg("Failed to open %s: %s\n", path, strerror(errno)); + return -1; + } + + if (fscanf(file, "%d", &type) != 1) { + ksft_print_msg("Failed to read PMU type from %s\n", path); + fclose(file); + return -1; + } + + fclose(file); + return type; +} diff --git a/tools/testing/selftests/resctrl/resctrl.h b/tools/testing/selftests/resctrl/resctrl.h index 5b0e6074eaba..d1d2891081cf 100644 --- a/tools/testing/selftests/resctrl/resctrl.h +++ b/tools/testing/selftests/resctrl/resctrl.h @@ -205,6 +205,9 @@ void signal_handler_unregister(void); unsigned int count_bits(unsigned long n); int snc_kernel_support(void);
+/* PMU utilities */ +int resctrl_find_pmu_type(const char *pmu_name); + void perf_event_attr_initialize(struct perf_event_attr *pea, __u64 config); void perf_event_initialize_read_format(struct perf_event_read *pe_read); int perf_open(struct perf_event_attr *pea, pid_t pid, int cpu_no);
On Thu, Oct 16, 2025 at 09:46:48AM -0500, Jonathan Perry wrote:
Motivation: perf support enables measuring cache occupancy and memory bandwidth metrics on hrtimer (high resolution timer) interrupts via eBPF. Compared with polling from userspace, hrtimer-based reads remove scheduling jitter and context switch overhead. Further, PMU reads can be parallel, since the PMU read path need not lock resctrl's rdtgroup_mutex. Parallelization and reduced jitter enable more accurate snapshots of cache occupancy and memory bandwidth. [1] has more details on the motivation and design.
This parallel read without rdtgroup_mutex looks worrying.
The h/w counters have limited width (24-bits on older Intel CPUs, 32-bits on AMD and Intel >= Icelake). So resctrl takes the raw value and in get_corrected_val() figures the increment since the previous read of the MSR to figure out how much to add to the running per-RMID count of "chunks".
That's all inherently full of races. If perf does this at the same time that resctrl does, then things will be corrupted sooner or later.
You might fix it with a per-RMID spinlock in "struct arch_mbm_state"?
-Tony
Motivation: perf support enables measuring cache occupancy and memory bandwidth metrics on hrtimer (high resolution timer) interrupts via eBPF. Compared with polling from userspace, hrtimer-based reads remove scheduling jitter and context switch overhead. Further, PMU reads can be parallel, since the PMU read path need not lock resctrl's rdtgroup_mutex. Parallelization and reduced jitter enable more accurate snapshots of cache occupancy and memory bandwidth. [1] has more details on the motivation and design.
This parallel read without rdtgroup_mutex looks worrying.
The h/w counters have limited width (24-bits on older Intel CPUs, 32-bits on AMD and Intel >= Icelake). So resctrl takes the raw value and in get_corrected_val() figures the increment since the previous read of the MSR to figure out how much to add to the running per-RMID count of "chunks".
That's all inherently full of races. If perf does this at the same time that resctrl does, then things will be corrupted sooner or later.
You might fix it with a per-RMID spinlock in "struct arch_mbm_state"?
That might be too fine a locking granularity. You'd probably be fine with little contention with a lock in "struct rdt_mon_domain".
-Tony
Motivation: perf support enables measuring cache occupancy and memory bandwidth metrics on hrtimer (high resolution timer) interrupts via eBPF. Compared with polling from userspace, hrtimer-based reads remove scheduling jitter and context switch overhead. Further, PMU reads can be parallel, since the PMU read path need not lock resctrl's rdtgroup_mutex. Parallelization and reduced jitter enable more accurate snapshots of cache occupancy and memory bandwidth. [1] has more details on the motivation and design.
This parallel read without rdtgroup_mutex looks worrying.
The h/w counters have limited width (24-bits on older Intel CPUs, 32-bits on AMD and Intel >= Icelake). So resctrl takes the raw value and in get_corrected_val() figures the increment since the previous read of the MSR to figure out how much to add to the running per-RMID count of "chunks".
That's all inherently full of races. If perf does this at the same time that resctrl does, then things will be corrupted sooner or later.
You might fix it with a per-RMID spinlock in "struct arch_mbm_state"?
That might be too fine a locking granularity. You'd probably be fine with little contention with a lock in "struct rdt_mon_domain".
Good catch. Thank you Tony!
We might be able to solve the issue similarly to what adding a per-RMID spinlock in "struct arch_mbm_state" would do, but with only a memory barrier (no spinlock). I'll look further into it.
-Jonathan
linux-kselftest-mirror@lists.linaro.org