From: Jan H. Schönherr jschoenh@amazon.de
It is possible to degrade host performance by manipulating performance counters from a VM and tricking the host hypervisor to enable branch tracing. When the guest programs a CPU to track branch instructions and deliver an interrupt after exactly one branch instruction, the value one is handled by the host KVM/perf subsystems and treated incorrectly as a special value to enable the branch trace store (BTS) subsystem. It should not be possible to enable BTS from a guest. When BTS is enabled, it leads to general host performance degradation to both VMs and host.
Perf considers the combination of PERF_COUNT_HW_BRANCH_INSTRUCTIONS with a sample_period of 1 a special case and handles this as a BTS event (see intel_pmu_has_bts_period()) -- a deviation from the usual semantic, where the sample_period represents the amount of branch instructions to encounter before the overflow handler is invoked.
Nothing prevents a guest from programming its vPMU with the above settings (count branch, interrupt after one branch), which causes KVM to erroneously instruct perf to create a BTS event within pmc_reprogram_counter(), which does not have the desired semantics.
The guest could also do more benign actions and request an interrupt after a more reasonable number of branch instructions via its vPMU. In that case counting works initially. However, KVM occasionally pauses and resumes the created performance counters. If the remaining amount of branch instructions until interrupt has reached 1 exactly, pmc_resume_counter() fails to resume the counter and a BTS event is created instead with its incorrect semantics.
Fix this behavior by not passing the special value "1" as sample_period to perf. Instead, perform the same quirk that happens later in x86_perf_event_set_period() anyway, when the performance counter is transferred to the actual PMU: bump the sample_period to 2.
Testing: From guest: `./wrmsr -p 12 0x186 0x1100c4` `./wrmsr -p 12 0xc1 0xffffffffffff` `./wrmsr -p 12 0x186 0x5100c4`
This sequence sets up branch instruction counting, initializes the counter to overflow after one event (0xffffffffffff), and then enables edge detection (bit 18) for branch events.
./wrmsr -p 12 0x186 0x1100c4 Writes to IA32_PERFEVTSEL0 (0x186) Value 0x1100c4 breaks down as: Event = 0xC4 (Branch instructions) Bits 16-17: 0x1 (User mode only) Bit 22: 1 (Enable counter)
./wrmsr -p 12 0xc1 0xffffffffffff Writes to IA32_PMC0 (0xC1) Sets counter to maximum value (0xffffffffffff) This effectively sets up the counter to overflow on the next branch
./wrmsr -p 12 0x186 0x5100c4 Updates IA32_PERFEVTSEL0 again Similar to first command but adds bit 18 (0x4 to 0x5) Enables edge detection (bit 18)
These MSR writes are trapped by the hypervisor in KVM and forwarded to the perf subsystem to create corresponding monitoring events.
It is possible to repro this problem in a more realistic guest scenario:
`perf record -e branches:u -c 2 -a &` `perf record -e branches:u -c 2 -a &`
This presumably triggers the issue by KVM pausing and resuming the performance counter at the wrong moment, when its value is about to overflow.
Signed-off-by: Jan H. Schönherr jschoenh@amazon.de Signed-off-by: Fernand Sieber sieberf@amazon.com Reviewed-by: David Woodhouse dwmw@amazon.co.uk Reviewed-by: Hendrik Borghorst hborghor@amazon.de Link: https://lore.kernel.org/r/20251124100220.238177-1-sieberf@amazon.com --- arch/x86/kvm/pmu.c | 13 +++++++++++++ 1 file changed, 13 insertions(+)
diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c index 487ad19a236e..547512028e24 100644 --- a/arch/x86/kvm/pmu.c +++ b/arch/x86/kvm/pmu.c @@ -225,6 +225,19 @@ static u64 get_sample_period(struct kvm_pmc *pmc, u64 counter_value) { u64 sample_period = (-counter_value) & pmc_bitmask(pmc);
+ /* + * A sample_period of 1 might get mistaken by perf for a BTS event, see + * intel_pmu_has_bts_period(). This would prevent re-arming the counter + * via pmc_resume_counter(), followed by the accidental creation of an + * actual BTS event, which we do not want. + * + * Avoid this by bumping the sampling period. Note, that we do not lose + * any precision, because the same quirk happens later anyway (for + * different reasons) in x86_perf_event_set_period(). + */ + if (sample_period == 1) + sample_period = 2; + if (!sample_period) sample_period = pmc_bitmask(pmc) + 1; return sample_period;
Hi,
Thanks for your patch.
FYI: kernel test robot notices the stable kernel rule is not satisfied.
The check is based on https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html#opti...
Rule: add the tag "Cc: stable@vger.kernel.org" in the sign-off area to have the patch automatically included in the stable tree. Subject: [PATCH] KVM: x86/pmu: Do not accidentally create BTS events Link: https://lore.kernel.org/stable/20251201142359.344741-1-sieberf%40amazon.com
On Mon, 2025-12-01 at 16:23 +0200, Fernand Sieber wrote
Perf considers the combination of PERF_COUNT_HW_BRANCH_INSTRUCTIONS with a sample_period of 1 a special case and handles this as a BTS event (see intel_pmu_has_bts_period()) -- a deviation from the usual semantic, where the sample_period represents the amount of branch instructions to encounter before the overflow handler is invoked.
That's kind of awful, and seems to be the real underlying cause of the KVM issue. Can we kill it with fire? Peter?
Amazon Development Centre (London) Ltd. Registered in England and Wales with registration number 04543232 with its registered office at 1 Principal Place, Worship Street, London EC2A 2FA, United Kingdom.
On Mon, Dec 01, 2025 at 02:45:01PM +0000, Woodhouse, David wrote:
On Mon, 2025-12-01 at 16:23 +0200, Fernand Sieber wrote
Perf considers the combination of PERF_COUNT_HW_BRANCH_INSTRUCTIONS with a sample_period of 1 a special case and handles this as a BTS event (see intel_pmu_has_bts_period()) -- a deviation from the usual semantic, where the sample_period represents the amount of branch instructions to encounter before the overflow handler is invoked.
That's kind of awful, and seems to be the real underlying cause of the KVM issue. Can we kill it with fire? Peter?
Well, IIRC it gives the same information and was actually less expensive to run, seeing how BTS can buffer the data rather than having to take an interrupt on every event.
But its been ages since this was done.
Now arguably it should not be done for this kvm stuff, because the data-store buffers don't virtualize (just like old PEBS).
On Mon, Dec 02, 2025 at 10:35:34AM +0100, Peter Zijlstra wrote:
On Mon, Dec 01, 2025 at 02:45:01PM +0000, Woodhouse, David wrote:
On Mon, 2025-12-01 at 16:23 +0200, Fernand Sieber wrote
Perf considers the combination of PERF_COUNT_HW_BRANCH_INSTRUCTIONS with a sample_period of 1 a special case and handles this as a BTS event (see intel_pmu_has_bts_period()) -- a deviation from the usual semantic, where the sample_period represents the amount of branch instructions to encounter before the overflow handler is invoked.
That's kind of awful, and seems to be the real underlying cause of the KVM issue. Can we kill it with fire? Peter?
Well, IIRC it gives the same information and was actually less expensive to run, seeing how BTS can buffer the data rather than having to take an interrupt on every event.
But its been ages since this was done.
Now arguably it should not be done for this kvm stuff, because the data-store buffers don't virtualize (just like old PEBS).
This. The current logic bypasses what the guest should actually be allowed to do. See `vmx_get_supported_debugctl`, specifically the guest should not be allowed to enable BTS.
Also semi related to this thread, but the auto enablement of BTS for sample_period = 1 seems to yield undesirable behavior on the guest OS. The guest OS will try to enable BTS and guest a wrmsr failure because the host KVM rejects it, which leads to incorrect behavior (no tracing at all happening).
Amazon Development Centre (South Africa) (Proprietary) Limited 29 Gogosoa Street, Observatory, Cape Town, Western Cape, 7925, South Africa Registration Number: 2004 / 034463 / 07
On Mon, Dec 01, 2025 at 04:23:57PM +0200, Fernand Sieber wrote:
arch/x86/kvm/pmu.c | 13 +++++++++++++ 1 file changed, 13 insertions(+)
diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c index 487ad19a236e..547512028e24 100644 --- a/arch/x86/kvm/pmu.c +++ b/arch/x86/kvm/pmu.c @@ -225,6 +225,19 @@ static u64 get_sample_period(struct kvm_pmc *pmc, u64 counter_value) { u64 sample_period = (-counter_value) & pmc_bitmask(pmc);
- /*
* A sample_period of 1 might get mistaken by perf for a BTS event, see* intel_pmu_has_bts_period(). This would prevent re-arming the counter* via pmc_resume_counter(), followed by the accidental creation of an* actual BTS event, which we do not want.** Avoid this by bumping the sampling period. Note, that we do not lose* any precision, because the same quirk happens later anyway (for* different reasons) in x86_perf_event_set_period().*/- if (sample_period == 1)
sample_period = 2;- if (!sample_period) sample_period = pmc_bitmask(pmc) + 1; return sample_period;
Oh gawd, I so hate this kvm code. It is so ludicrously bad. The way it keeps recreating counters is just stupid. And then they complain it sucks, it does :-(
Anyway, yes this is terrible. Let me try and untangle all this, see if there's a saner solution.
On Tue, Dec 02, 2025 at 11:03:11AM +0100, Peter Zijlstra wrote:
On Mon, Dec 01, 2025 at 04:23:57PM +0200, Fernand Sieber wrote:
arch/x86/kvm/pmu.c | 13 +++++++++++++ 1 file changed, 13 insertions(+)
diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c index 487ad19a236e..547512028e24 100644 --- a/arch/x86/kvm/pmu.c +++ b/arch/x86/kvm/pmu.c @@ -225,6 +225,19 @@ static u64 get_sample_period(struct kvm_pmc *pmc, u64 counter_value) { u64 sample_period = (-counter_value) & pmc_bitmask(pmc);
- /*
* A sample_period of 1 might get mistaken by perf for a BTS event, see* intel_pmu_has_bts_period(). This would prevent re-arming the counter* via pmc_resume_counter(), followed by the accidental creation of an* actual BTS event, which we do not want.** Avoid this by bumping the sampling period. Note, that we do not lose* any precision, because the same quirk happens later anyway (for* different reasons) in x86_perf_event_set_period().*/- if (sample_period == 1)
sample_period = 2;- if (!sample_period) sample_period = pmc_bitmask(pmc) + 1; return sample_period;
Oh gawd, I so hate this kvm code. It is so ludicrously bad. The way it keeps recreating counters is just stupid. And then they complain it sucks, it does :-(
Anyway, yes this is terrible. Let me try and untangle all this, see if there's a saner solution.
Does something like so work? It is still terrible, but perhaps slightly less so.
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h index 2b969386dcdd..493e6ba51e06 100644 --- a/arch/x86/events/perf_event.h +++ b/arch/x86/events/perf_event.h @@ -1558,13 +1558,22 @@ static inline bool intel_pmu_has_bts_period(struct perf_event *event, u64 period struct hw_perf_event *hwc = &event->hw; unsigned int hw_event, bts_event;
- if (event->attr.freq) + /* + * Only use BTS for fixed rate period==1 events. + */ + if (event->attr.freq || period != 1) + return false; + + /* + * BTS doesn't virtualize. + */ + if (event->attr.exclude_host) return false;
hw_event = hwc->config & INTEL_ARCH_EVENT_MASK; bts_event = x86_pmu.event_map(PERF_COUNT_HW_BRANCH_INSTRUCTIONS);
- return hw_event == bts_event && period == 1; + return hw_event == bts_event; }
static inline bool intel_pmu_has_bts(struct perf_event *event)
On Tue, Dec 02, 2025, Peter Zijlstra wrote:
Does something like so work? It is still terrible, but perhaps slightly less so.
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h index 2b969386dcdd..493e6ba51e06 100644 --- a/arch/x86/events/perf_event.h +++ b/arch/x86/events/perf_event.h @@ -1558,13 +1558,22 @@ static inline bool intel_pmu_has_bts_period(struct perf_event *event, u64 period struct hw_perf_event *hwc = &event->hw; unsigned int hw_event, bts_event;
- if (event->attr.freq)
- /*
* Only use BTS for fixed rate period==1 events.*/- if (event->attr.freq || period != 1)
return false;- /*
* BTS doesn't virtualize.*/- if (event->attr.exclude_host)
Ya, this seems like the right fix. Pulling in the original bug report:
When BTS is enabled, it leads to general host performance degradation to both VMs and host.
I assume the underlying problem is that intel_pmu_enable_bts() is called even when the event isn't enabled in the host, and BTS doesn't discrimate once it's enable and bogs down the host (but presumably not the guest, at least not directly, since KVM should prevent setting BTS in vmcs.GUEST_IA32_DEBUGCTL). Enabling BTS for exclude-host events simply can't work, even if the event came from host userspace.
On Tue, Dec 02, 2025 at 01:44:23PM +0100, Peter Zijlstra wrote:
On Tue, Dec 02, 2025 at 11:03:11AM +0100, Peter Zijlstra wrote:
On Mon, Dec 01, 2025 at 04:23:57PM +0200, Fernand Sieber wrote:
arch/x86/kvm/pmu.c | 13 +++++++++++++ 1 file changed, 13 insertions(+)
diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c index 487ad19a236e..547512028e24 100644 --- a/arch/x86/kvm/pmu.c +++ b/arch/x86/kvm/pmu.c @@ -225,6 +225,19 @@ static u64 get_sample_period(struct kvm_pmc *pmc, u64 counter_value) { u64 sample_period = (-counter_value) & pmc_bitmask(pmc);
- /*
* A sample_period of 1 might get mistaken by perf for a BTS event, see* intel_pmu_has_bts_period(). This would prevent re-arming the counter* via pmc_resume_counter(), followed by the accidental creation of an* actual BTS event, which we do not want.** Avoid this by bumping the sampling period. Note, that we do not lose* any precision, because the same quirk happens later anyway (for* different reasons) in x86_perf_event_set_period().*/- if (sample_period == 1)
sample_period = 2;- if (!sample_period) sample_period = pmc_bitmask(pmc) + 1; return sample_period;
Oh gawd, I so hate this kvm code. It is so ludicrously bad. The way it keeps recreating counters is just stupid. And then they complain it sucks, it does :-(
Anyway, yes this is terrible. Let me try and untangle all this, see if there's a saner solution.
Does something like so work? It is still terrible, but perhaps slightly less so.
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h index 2b969386dcdd..493e6ba51e06 100644 --- a/arch/x86/events/perf_event.h +++ b/arch/x86/events/perf_event.h @@ -1558,13 +1558,22 @@ static inline bool intel_pmu_has_bts_period(struct perf_event *event, u64 period struct hw_perf_event *hwc = &event->hw; unsigned int hw_event, bts_event;
- if (event->attr.freq)
/*
* Only use BTS for fixed rate period==1 events.*/if (event->attr.freq || period != 1)
return false;/*
* BTS doesn't virtualize.*/if (event->attr.exclude_host) return false;
hw_event = hwc->config & INTEL_ARCH_EVENT_MASK; bts_event = x86_pmu.event_map(PERF_COUNT_HW_BRANCH_INSTRUCTIONS);
- return hw_event == bts_event && period == 1;
- return hw_event == bts_event;
}
static inline bool intel_pmu_has_bts(struct perf_event *event)
Hi Peter,
I've pulled your changes and confirmed that they address the original bug report.
The repro I use is running on host, with a guest running: `perf record -e branches:u -c 2 -a &` `perf record -e branches:u -c 2 -a &` Then I monitor the enablement of BTS on the host and verify that without the change BTS is enabled, and with the change it's not.
This looks good to me, should we go ahead with your changes then?
--Fernand
Amazon Development Centre (South Africa) (Proprietary) Limited 29 Gogosoa Street, Observatory, Cape Town, Western Cape, 7925, South Africa Registration Number: 2004 / 034463 / 07
On Wed, Dec 10, 2025 at 12:11:47PM +0200, Fernand Sieber wrote:
Does something like so work? It is still terrible, but perhaps slightly less so.
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h index 2b969386dcdd..493e6ba51e06 100644 --- a/arch/x86/events/perf_event.h +++ b/arch/x86/events/perf_event.h @@ -1558,13 +1558,22 @@ static inline bool intel_pmu_has_bts_period(struct perf_event *event, u64 period struct hw_perf_event *hwc = &event->hw; unsigned int hw_event, bts_event;
- if (event->attr.freq)
/*
* Only use BTS for fixed rate period==1 events.*/if (event->attr.freq || period != 1)
return false;/*
* BTS doesn't virtualize.*/if (event->attr.exclude_host) return false;
hw_event = hwc->config & INTEL_ARCH_EVENT_MASK; bts_event = x86_pmu.event_map(PERF_COUNT_HW_BRANCH_INSTRUCTIONS);
- return hw_event == bts_event && period == 1;
- return hw_event == bts_event;
}
static inline bool intel_pmu_has_bts(struct perf_event *event)
Hi Peter,
I've pulled your changes and confirmed that they address the original bug report.
The repro I use is running on host, with a guest running: `perf record -e branches:u -c 2 -a &` `perf record -e branches:u -c 2 -a &` Then I monitor the enablement of BTS on the host and verify that without the change BTS is enabled, and with the change it's not.
This looks good to me, should we go ahead with your changes then?
Yeah, I suppose. Please stick a coherent changelog on and repost.
By default when users program perf to sample branch instructions (PERF_COUNT_HW_BRANCH_INSTRUCTIONS) with a sample period of 1, perf interprets this as a special case and enables BTS (Branch Trace Store) as an optimization to avoid taking an interrupt on every branch.
Since BTS doesn't virtualize, this optimization doesn't make sense when the request originates from a guest. Add an additional check that prevents this optimization for virtualized events (exclude_host).
Reported-by: Jan H. Schönherr jschoenh@amazon.de Suggested-by: Peter Zijlstra peterz@infradead.org Signed-off-by: Fernand Sieber sieberf@amazon.com --- arch/x86/events/perf_event.h | 13 +++++++++++-- 1 file changed, 11 insertions(+), 2 deletions(-)
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h index 3161ec0a3416..f2e2d9b03367 100644 --- a/arch/x86/events/perf_event.h +++ b/arch/x86/events/perf_event.h @@ -1574,13 +1574,22 @@ static inline bool intel_pmu_has_bts_period(struct perf_event *event, u64 period struct hw_perf_event *hwc = &event->hw; unsigned int hw_event, bts_event;
- if (event->attr.freq) + /* + * Only use BTS for fixed rate period==1 events. + */ + if (event->attr.freq || period != 1) + return false; + + /* + * BTS doesn't virtualize. + */ + if (event->attr.exclude_host) return false;
hw_event = hwc->config & INTEL_ARCH_EVENT_MASK; bts_event = x86_pmu.event_map(PERF_COUNT_HW_BRANCH_INSTRUCTIONS);
- return hw_event == bts_event && period == 1; + return hw_event == bts_event; }
static inline bool intel_pmu_has_bts(struct perf_event *event)
Hi,
Thanks for your patch.
FYI: kernel test robot notices the stable kernel rule is not satisfied.
The check is based on https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html#opti...
Rule: add the tag "Cc: stable@vger.kernel.org" in the sign-off area to have the patch automatically included in the stable tree. Subject: [PATCH v2] perf/x86/intel: Do not enable BTS for guests Link: https://lore.kernel.org/stable/20251211183604.868641-1-sieberf%40amazon.com
linux-stable-mirror@lists.linaro.org