On 2025/08/09 8:08, Oliver Upton wrote:
On Thu, Aug 07, 2025 at 11:06:21PM +0900, Akihiko Odaki wrote:
The only cross-PMU events we will support are the fixed counters, my strong preference is that we do not reverse-map architectural events to generic perf events for all counters.
I wonder if there is a benefit to special case PERF_COUNT_HW_CPU_CYCLES then; the current logic of kvm_map_pmu_event() looks sufficient for me.
I'd rather we just use the generic perf events and let the driver remap things on our behalf. These are fixed counters, using constant events feels like the right way to go about that.
kvm_map_pmu_event() is trying to solve a slightly different problem where we need to map programmable PMUv3 events into a non-PMUv3 event space, like on the M1 PMU.
It is currently also used to map non-programmable PMUv3 events.
I want to understand the motivation better. The current procedure to determine the config value is as follows: 1) If the register is PMCCFILTR_EL0: a) eventsel = ARMV8_PMUV3_PERFCTR_CPU_CYCLES. 2) If the register is not PMCCFILTR_EL0: a) Derive eventsel by masking the register value. 3) If map_pmuv3_event() exists: a) The config value is map_pmuv3_event(eventsel). 4) If map_pmuv3_event() does not exist: a) The config value is eventsel.
If we use PERF_TYPE_HARDWARE / PERF_COUNT_HW_CPU_CYCLES, the procedure will look like the following: 1) If the register is PMCCFILTR_EL0: a) The config value is PERF_TYPE_HARDWARE / PERF_COUNT_HW_CPU_CYCLES. 2) If the reigster is not PMCCFILTR_EL0: a) Derive eventsel by masking the register value. b) If map_pmuv3_event() exists: i) The config value is map_pmuv3_event(eventsel). c) if map_pmuv3_event() does not exist, i) The config value is eventsel.
It does not seem that using PERF_TYPE_HARDWARE / PERF_COUNT_HW_CPU_CYCLES simplifies the procedure.
This isn't what I meant. What I mean is that userspace either can use the SET_PMU ioctl or the COMPOSITION ioctl. Once one of them has been used the other ioctl returns an error.
We're really bad at getting ioctl ordering / interleaving right and syzkaller has a habit of finding these mistakes. There's zero practical value in using both of these ioctls on the same VM, let's prevent it.
The corresponding RFC series for QEMU uses KVM_ARM_VCPU_PMU_V3_SET_PMU to probe host PMUs, and falls back to KVM_ARM_VCPU_PMU_V3_COMPOSITION if none covers all CPUs. Switching between SET_PMU and COMPOSITION is useful during such probing.
COMPOSITION is designed to behave like just another host PMU that is set with SET_PMU. SET_PMU allows setting a different host PMU even if SET_PMU has already been invoked so it is also allowed to set a host PMU even if COMPOSITION has already been invoked, maintaining consistency with non-composed PMUs.
You can find the QEMU patch at: https://lore.kernel.org/qemu-devel/20250806-kvmq-v1-1-d1d50b7058cd@rsg.ci.i....
(look up KVM_ARM_VCPU_PMU_V3_SET_PMU for the probing code)
Having both of these attributes return success when probed with KVM_HAS_DEVICE_ATTR is fine; what I mean is that once KVM_SET_DEVICE_ATTR has been called on an attribute the other fails.
By probing, I meant checking if a host PMU is compatible with KVM.
More concretely, QEMU implements the following procedure to detect a PMU backend compatible with all host CPUs:
1) Traverse /sys/bus/event_source/devices a) Check if the device has the cpus and type attributes. If it doesn't, skip it. b) Try to set the device's type with KVM_ARM_VCPU_PMU_V3_SET_PMU. If successful, the device is compatible with KVM. c) Check if the device's cpus cover all host CPUs. If it does, use it with KVM_ARM_VCPU_PMU_V3_SET_PMU.
2) Check if the union of the cpus attributes of compatible devices cover all CPUs. If it does, use KVM_ARM_VCPU_PMU_V3_COMPOSITION.
3) If it failed to find a usable backend until this step, there is no PMU backend compatible with all host CPUs.
Here, 1b) calls KVM_SET_DEVICE_ATTR with KVM_ARM_VCPU_PMU_V3_SET_PMU during probing.
On a system that has FEAT_PMUv3_ICNTR, userspace can still use this ioctl and explicitly de-feature ICNTR by writing to the ID register after initialization.
Now I understand better.
Currently, KVM_ARM_VCPU_PMU_V3_COMPOSITION sets supported_cpus to ones that have cycle counters compatible with PMU emulation.
If FEAT_PMUv3_ICNTR is set to the ID register, I guess KVM_ARM_VCPU_PMU_V3_COMPOSITION will set supported_cpus to ones that have compatible cycle and instruction counters. If so, the naming KVM_ARM_VCPU_PMU_V3_FIXED_COUNTERS_ONLY indeed makes sense.
Perfect. Ideally SOC vendors do the sensible thing and ensure that FEAT_PMUv3_ICNTR is consistent on all implementations in a machine. We will hide the feature in KVM if it is not.
M1 PMU also implements a fixed instruction counter, fortunately on all CPUs. I hope they continue to do so (and ideally they implement FEAT_PMUv3 and FEAT_PMUv3_ICNTR).
Regards, Akihiko Odaki