Re: [PATCH RFC v2 1/2] KVM: arm64: PMU: Introduce KVM_ARM_VCPU_PMU_V3_COMPOSITION

7 Aug 2025

      On 2025/08/07 11:03, Oliver Upton wrote:
...
On Thu, Aug 07, 2025 at 03:24:43AM +0900, Akihiko Odaki wrote:
...
On 2025/08/07 2:20, Oliver Upton wrote:
...
Hi Akihiko,
This is an unreasonably large patch that needs to be broken down into
smaller patches, ideally one functional change per patch. We need this
even for an RFC for the sake of reviews.
On Wed, Aug 06, 2025 at 06:09:54PM +0900, Akihiko Odaki wrote:
...
+static u64 kvm_pmu_get_pmc_value(struct kvm_vcpu *vcpu, u8 idx)
   {

struct kvm_vcpu *vcpu = kvm_pmc_to_vcpu(pmc);

struct kvm_pmc *pmc = *kvm_vcpu_idx_to_pmc(vcpu, idx);
u64 counter, reg, enabled, running;
unsigned int i;

reg = counter_index_to_reg(pmc->idx);

reg = counter_index_to_reg(idx);
counter = __vcpu_sys_reg(vcpu, reg);
/*
The real counter value is equal to the value of counter register plus
the value perf event counts.

*/

if (pmc->perf_event)
counter += perf_event_read_value(pmc->perf_event, &enabled,

				 &running);

if (pmc)
for (i = 0; i < pmc->nr_perf_events; i++)

	counter += perf_event_read_value(pmc->perf_events[i],

					 &enabled, &running);

I'm concerned that this array of events concept you're introducing is
going to be error-prone. An approach that reallocates a new PMU event in
the case of a vCPU migrating to a new PMU implementation would be
desirable.
I avoided allocation at migration because I was worried with disabled
preemption. perf_event_create_kernel_counter() internally takes a mutex so
it cannot be used if preemption is disabled.
I wonder if it is guaranteed that kvm_arch_vcpu_load() executes with
preemption. If so, I can add a hook there to call
perf_event_create_kernel_counter().
You don't need to allocate the event at vcpu_load(). The only guarantee
we need is that an event has been attached to the PMC prior to entering
the guest, for which the KVM_REQ_* infrastructure is quite useful.
I see that kvm_arch_vcpu_load() makes a KVM_REQ_RECORD_STEAL request. 
Indeed I can add another request for PMC handling.
I'm a bit concerned with the overhead of allocation, but perhaps 
migration between heterogeneous cores is rare.
...
...
...
...
+static void reset_sample_period(struct perf_event *perf_event)
+{

struct kvm_pmc **pmc = perf_event->overflow_handler_context;
struct kvm_vcpu *vcpu = kvm_pmc_to_vcpu(pmc);
struct arm_pmu *cpu_pmu = to_arm_pmu(perf_event->pmu);
u64 period;

cpu_pmu->pmu.stop(perf_event, PERF_EF_UPDATE);

/*
* Reset the sample period to the architectural limit,

* i.e. the point where the counter overflows.

*/

period = compute_period(pmc, kvm_pmu_get_pmc_value(vcpu, (*pmc)->idx));

local64_set(&perf_event->hw.period_left, 0);
perf_event->attr.sample_period = period;
perf_event->hw.sample_period = period;

cpu_pmu->pmu.start(perf_event, PERF_EF_RELOAD);

+}
No, we can't start calling into the internal driver interfaces. The fact
that we have a pointer to the PMU is an ugly hack and shouldn't be used
like this.
This function was extracted from kvm_pmu_perf_overflow() and is not a new
addition (which should have been clear if I would have split the patch as
you noted).
Yuck, then forgive my whining.
...
I thought of replacing it with perf_event_period(), but there is a catch: it
returns -EINVAL if the 63th bit of the sample period number is set. Perhaps
we can just specify ((1ULL << 63) - 1) if the sample period is going to be
so long, but I conservatively avoided that. I can change it if you prefer
that way or have an alternative idea.
Let's keep it as-is then. Sorry about the noise.
...
...
Can we just special-case the fixed CPU cycle counter to use
PERF_TYPE_HARDWARE / PERF_COUNT_HW_CPU_CYCLES? That _should_ have the
intended effect of opening an event on the PMU for this CPU.
I have an impression that perhaps this emulation can be more generic by
converting eventsel to PERF_COUNT_HW_* numbers with armv8_pmuv3_perf_map in
drivers/perf/arm_pmuv3.c, but it is not in scope of this change. The current
code is sufficient for now.
The only cross-PMU events we will support are the fixed counters, my
strong preference is that we do not reverse-map architectural events to
generic perf events for all counters.
I wonder if there is a benefit to special case PERF_COUNT_HW_CPU_CYCLES 
then; the current logic of kvm_map_pmu_event() looks sufficient for me.
...
...
...
I'm not sure there's much value in preventing the user from configuring
the PMU event filter. Even in the case of the fixed CPU cycle counter we
allow userspace to filter the event.
It is possible to configure the PMU event filter, but it needs to be done
after setting the attribute. This behavior is aligned with
KVM_ARM_VCPU_PMU_V3_SET_PMU.
That's fine then, I missed that detail.
...
They are mutually exclusive and the latest configuration takes effect.
If you set SET_PMU after COMPOSITION, SET_PMU will take effect.
If you set COMPOSITION after SET_PMU, COMPOSITION will take effect.
I'll note that in the documentation.
This isn't what I meant. What I mean is that userspace either can use
the SET_PMU ioctl or the COMPOSITION ioctl. Once one of them has been
used the other ioctl returns an error.
We're really bad at getting ioctl ordering / interleaving right and
syzkaller has a habit of finding these mistakes. There's zero practical
value in using both of these ioctls on the same VM, let's prevent it.
The corresponding RFC series for QEMU uses KVM_ARM_VCPU_PMU_V3_SET_PMU 
to probe host PMUs, and falls back to KVM_ARM_VCPU_PMU_V3_COMPOSITION if 
none covers all CPUs. Switching between SET_PMU and COMPOSITION is 
useful during such probing.
COMPOSITION is designed to behave like just another host PMU that is set 
with SET_PMU. SET_PMU allows setting a different host PMU even if 
SET_PMU has already been invoked so it is also allowed to set a host PMU 
even if COMPOSITION has already been invoked, maintaining consistency 
with non-composed PMUs.
You can find the QEMU patch at:
https://lore.kernel.org/qemu-devel/20250806-kvm-v1-1-d1d50b7058cd@rsg.ci.i.u...
(look up KVM_ARM_VCPU_PMU_V3_SET_PMU for the probing code)
...
...
...
...

case KVM_ARM_VCPU_PMU_V3_COMPOSITION:
return kvm_arm_pmu_v3_set_pmu_composition(vcpu);

I'd prefer naming this something like 'KVM_ARM_VCPU_PMU_V3_FIXED_COUNTERS_ONLY'.
We will have the fixed instruction counter eventually which is another
event we could potentially provide system-wide.
The design decision to expose the cycle counter is driven by the motivation
to satisfy FEAT_PMU for the guest rather than the host capability.
When we implement FEAT_PMUv3_ICNTR in the future, I think we will need to
add KVM_ARM_VCPU_PMU_V3_COMPOSITION_ICNTR or something. The resultig
emulated PMU will work only on host CPUs that have instruction counters; a
host CPU may lack one because it doesn't implement FEAT_PMUv3_ICNTR or has a
third-party PMU like apple-m1-cpu-pmu (this particular PMU has an
instruction counter fortunately).
No, this is at odds with the way we handle CPU features in KVM/arm64.
Ignoring this ioctl, the goal is that after KVM_ARM_VCPU_INIT the vCPU
is initialized to the maximum possible feature set. If userspace wants
to de-feature the VM it can do so by modifying the ID registers.
OTOH, your ioctl is dealing with something structural in how we
support PMUs on the system. It should be defined as exposing all fixed
counters that we support on a given piece of hardware. On systems
without FEAT_PMUv3_ICNTR, the result should be a vPMU that only has the
CPU cycle counter.
On a system that has FEAT_PMUv3_ICNTR, userspace can still use this
ioctl and explicitly de-feature ICNTR by writing to the ID register
after initialization.
Now I understand better.
Currently, KVM_ARM_VCPU_PMU_V3_COMPOSITION sets supported_cpus to ones 
that have cycle counters compatible with PMU emulation.
If FEAT_PMUv3_ICNTR is set to the ID register, I guess 
KVM_ARM_VCPU_PMU_V3_COMPOSITION will set supported_cpus to ones that 
have compatible cycle and instruction counters. If so, the naming 
KVM_ARM_VCPU_PMU_V3_FIXED_COUNTERS_ONLY indeed makes sense.
Regards,
Akihiko Odaki

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: [PATCH RFC v2 1/2] KVM: arm64: PMU: Introduce KVM_ARM_VCPU_PMU_V3_COMPOSITION