On 2025/08/07 11:03, Oliver Upton wrote:
On Thu, Aug 07, 2025 at 03:24:43AM +0900, Akihiko Odaki wrote:
On 2025/08/07 2:20, Oliver Upton wrote:
Hi Akihiko,
This is an unreasonably large patch that needs to be broken down into smaller patches, ideally one functional change per patch. We need this even for an RFC for the sake of reviews.
On Wed, Aug 06, 2025 at 06:09:54PM +0900, Akihiko Odaki wrote:
+static u64 kvm_pmu_get_pmc_value(struct kvm_vcpu *vcpu, u8 idx) {
- struct kvm_vcpu *vcpu = kvm_pmc_to_vcpu(pmc);
- struct kvm_pmc *pmc = *kvm_vcpu_idx_to_pmc(vcpu, idx); u64 counter, reg, enabled, running;
- unsigned int i;
- reg = counter_index_to_reg(pmc->idx);
- reg = counter_index_to_reg(idx); counter = __vcpu_sys_reg(vcpu, reg); /*
*/
- The real counter value is equal to the value of counter register plus
- the value perf event counts.
- if (pmc->perf_event)
counter += perf_event_read_value(pmc->perf_event, &enabled,
&running);
- if (pmc)
for (i = 0; i < pmc->nr_perf_events; i++)
counter += perf_event_read_value(pmc->perf_events[i],
&enabled, &running);
I'm concerned that this array of events concept you're introducing is going to be error-prone. An approach that reallocates a new PMU event in the case of a vCPU migrating to a new PMU implementation would be desirable.
I avoided allocation at migration because I was worried with disabled preemption. perf_event_create_kernel_counter() internally takes a mutex so it cannot be used if preemption is disabled.
I wonder if it is guaranteed that kvm_arch_vcpu_load() executes with preemption. If so, I can add a hook there to call perf_event_create_kernel_counter().
You don't need to allocate the event at vcpu_load(). The only guarantee we need is that an event has been attached to the PMC prior to entering the guest, for which the KVM_REQ_* infrastructure is quite useful.
I see that kvm_arch_vcpu_load() makes a KVM_REQ_RECORD_STEAL request. Indeed I can add another request for PMC handling.
I'm a bit concerned with the overhead of allocation, but perhaps migration between heterogeneous cores is rare.
+static void reset_sample_period(struct perf_event *perf_event) +{
- struct kvm_pmc **pmc = perf_event->overflow_handler_context;
- struct kvm_vcpu *vcpu = kvm_pmc_to_vcpu(pmc);
- struct arm_pmu *cpu_pmu = to_arm_pmu(perf_event->pmu);
- u64 period;
- cpu_pmu->pmu.stop(perf_event, PERF_EF_UPDATE);
- /*
* Reset the sample period to the architectural limit,
* i.e. the point where the counter overflows.
*/
- period = compute_period(pmc, kvm_pmu_get_pmc_value(vcpu, (*pmc)->idx));
- local64_set(&perf_event->hw.period_left, 0);
- perf_event->attr.sample_period = period;
- perf_event->hw.sample_period = period;
- cpu_pmu->pmu.start(perf_event, PERF_EF_RELOAD);
+}
No, we can't start calling into the internal driver interfaces. The fact that we have a pointer to the PMU is an ugly hack and shouldn't be used like this.
This function was extracted from kvm_pmu_perf_overflow() and is not a new addition (which should have been clear if I would have split the patch as you noted).
Yuck, then forgive my whining.
I thought of replacing it with perf_event_period(), but there is a catch: it returns -EINVAL if the 63th bit of the sample period number is set. Perhaps we can just specify ((1ULL << 63) - 1) if the sample period is going to be so long, but I conservatively avoided that. I can change it if you prefer that way or have an alternative idea.
Let's keep it as-is then. Sorry about the noise.
Can we just special-case the fixed CPU cycle counter to use PERF_TYPE_HARDWARE / PERF_COUNT_HW_CPU_CYCLES? That _should_ have the intended effect of opening an event on the PMU for this CPU.
I have an impression that perhaps this emulation can be more generic by converting eventsel to PERF_COUNT_HW_* numbers with armv8_pmuv3_perf_map in drivers/perf/arm_pmuv3.c, but it is not in scope of this change. The current code is sufficient for now.
The only cross-PMU events we will support are the fixed counters, my strong preference is that we do not reverse-map architectural events to generic perf events for all counters.
I wonder if there is a benefit to special case PERF_COUNT_HW_CPU_CYCLES then; the current logic of kvm_map_pmu_event() looks sufficient for me.
I'm not sure there's much value in preventing the user from configuring the PMU event filter. Even in the case of the fixed CPU cycle counter we allow userspace to filter the event.
It is possible to configure the PMU event filter, but it needs to be done after setting the attribute. This behavior is aligned with KVM_ARM_VCPU_PMU_V3_SET_PMU.
That's fine then, I missed that detail.
They are mutually exclusive and the latest configuration takes effect.
If you set SET_PMU after COMPOSITION, SET_PMU will take effect. If you set COMPOSITION after SET_PMU, COMPOSITION will take effect.
I'll note that in the documentation.
This isn't what I meant. What I mean is that userspace either can use the SET_PMU ioctl or the COMPOSITION ioctl. Once one of them has been used the other ioctl returns an error.
We're really bad at getting ioctl ordering / interleaving right and syzkaller has a habit of finding these mistakes. There's zero practical value in using both of these ioctls on the same VM, let's prevent it.
The corresponding RFC series for QEMU uses KVM_ARM_VCPU_PMU_V3_SET_PMU to probe host PMUs, and falls back to KVM_ARM_VCPU_PMU_V3_COMPOSITION if none covers all CPUs. Switching between SET_PMU and COMPOSITION is useful during such probing.
COMPOSITION is designed to behave like just another host PMU that is set with SET_PMU. SET_PMU allows setting a different host PMU even if SET_PMU has already been invoked so it is also allowed to set a host PMU even if COMPOSITION has already been invoked, maintaining consistency with non-composed PMUs.
You can find the QEMU patch at: https://lore.kernel.org/qemu-devel/20250806-kvm-v1-1-d1d50b7058cd@rsg.ci.i.u...
(look up KVM_ARM_VCPU_PMU_V3_SET_PMU for the probing code)
- case KVM_ARM_VCPU_PMU_V3_COMPOSITION:
return kvm_arm_pmu_v3_set_pmu_composition(vcpu);
I'd prefer naming this something like 'KVM_ARM_VCPU_PMU_V3_FIXED_COUNTERS_ONLY'. We will have the fixed instruction counter eventually which is another event we could potentially provide system-wide.
The design decision to expose the cycle counter is driven by the motivation to satisfy FEAT_PMU for the guest rather than the host capability.
When we implement FEAT_PMUv3_ICNTR in the future, I think we will need to add KVM_ARM_VCPU_PMU_V3_COMPOSITION_ICNTR or something. The resultig emulated PMU will work only on host CPUs that have instruction counters; a host CPU may lack one because it doesn't implement FEAT_PMUv3_ICNTR or has a third-party PMU like apple-m1-cpu-pmu (this particular PMU has an instruction counter fortunately).
No, this is at odds with the way we handle CPU features in KVM/arm64.
Ignoring this ioctl, the goal is that after KVM_ARM_VCPU_INIT the vCPU is initialized to the maximum possible feature set. If userspace wants to de-feature the VM it can do so by modifying the ID registers.
OTOH, your ioctl is dealing with something structural in how we support PMUs on the system. It should be defined as exposing all fixed counters that we support on a given piece of hardware. On systems without FEAT_PMUv3_ICNTR, the result should be a vPMU that only has the CPU cycle counter.
On a system that has FEAT_PMUv3_ICNTR, userspace can still use this ioctl and explicitly de-feature ICNTR by writing to the ID register after initialization.
Now I understand better.
Currently, KVM_ARM_VCPU_PMU_V3_COMPOSITION sets supported_cpus to ones that have cycle counters compatible with PMU emulation.
If FEAT_PMUv3_ICNTR is set to the ID register, I guess KVM_ARM_VCPU_PMU_V3_COMPOSITION will set supported_cpus to ones that have compatible cycle and instruction counters. If so, the naming KVM_ARM_VCPU_PMU_V3_FIXED_COUNTERS_ONLY indeed makes sense.
Regards, Akihiko Odaki