Hi all,
This backport wires up AMD perfmon v2 so BPF and other software clients can snapshot LBR stacks on demand, similar to the Intel support upstream. The series keeps the LBR-freeze path branchless, adds the perf_snapshot_branch_stack callback for AMD, and drops the sampling-only restriction now that snapshots can be taken from software contexts.
Leon Hwang (4): perf/x86/amd: Ensure amd_pmu_core_disable_all() is always inlined perf/x86/amd: Avoid taking branches before disabling LBR perf/x86/amd: Support capturing LBR from software events perf/x86/amd: Don't reject non-sampling events with configured LBR
arch/x86/events/amd/core.c | 37 +++++++++++++++++++++++++++++++++++- arch/x86/events/amd/lbr.c | 13 +------------ arch/x86/events/perf_event.h | 13 +++++++++++++ 3 files changed, 50 insertions(+), 13 deletions(-)
-- 2.52.0
[ Upstream commit 0dbf66fa7e80024629f816c2ec7a9f3d39637822 ]
In the following patches we will enable LBR capture on AMD CPUs at arbitrary point in time, which means that LBR recording won't be frozen by hardware automatically as part of hardware overflow event. So we need to take care to minimize amount of branches and function calls/returns on the path to freezing LBR, minimizing LBR snapshot altering as much as possible.
amd_pmu_core_disable_all() is one of the functions on this path, and is already marked as __always_inline. But it calls amd_pmu_set_global_ctl() which is marked as just inline. So to guarantee no function call will be generated thoughout mark amd_pmu_set_global_ctl() as __always_inline as well.
Signed-off-by: Andrii Nakryiko andrii@kernel.org Signed-off-by: Ingo Molnar mingo@kernel.org Reviewed-by: Sandipan Das sandipan.das@amd.com Link: https://lore.kernel.org/r/20240402022118.1046049-2-andrii@kernel.org Signed-off-by: Leon Hwang leon.hwang@linux.dev --- arch/x86/events/amd/core.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/x86/events/amd/core.c b/arch/x86/events/amd/core.c index aa8fc2cf1bde..3b09ec4f2b0d 100644 --- a/arch/x86/events/amd/core.c +++ b/arch/x86/events/amd/core.c @@ -632,7 +632,7 @@ static void amd_pmu_cpu_dead(int cpu) } }
-static inline void amd_pmu_set_global_ctl(u64 ctl) +static __always_inline void amd_pmu_set_global_ctl(u64 ctl) { wrmsrl(MSR_AMD64_PERF_CNTR_GLOBAL_CTL, ctl); }
[ Upstream commit 1eddf187e5d087de4560ec7c3baa2f8283920710 ]
In the following patches we will enable LBR capture on AMD CPUs at arbitrary point in time, which means that LBR recording won't be frozen by hardware automatically as part of hardware overflow event. So we need to take care to minimize amount of branches and function calls/returns on the path to freezing LBR, minimizing LBR snapshot altering as much as possible.
As such, split out LBR disabling logic from the sanity checking logic inside amd_pmu_lbr_disable_all(). This will ensure that no branches are taken before LBR is frozen in the functionality added in the next patch. Use __always_inline to also eliminate any possible function calls.
Signed-off-by: Andrii Nakryiko andrii@kernel.org Signed-off-by: Ingo Molnar mingo@kernel.org Reviewed-by: Sandipan Das sandipan.das@amd.com Link: https://lore.kernel.org/r/20240402022118.1046049-3-andrii@kernel.org Signed-off-by: Leon Hwang leon.hwang@linux.dev --- arch/x86/events/amd/lbr.c | 9 +-------- arch/x86/events/perf_event.h | 13 +++++++++++++ 2 files changed, 14 insertions(+), 8 deletions(-)
diff --git a/arch/x86/events/amd/lbr.c b/arch/x86/events/amd/lbr.c index 5149830c7c4f..33d0a45c0cd3 100644 --- a/arch/x86/events/amd/lbr.c +++ b/arch/x86/events/amd/lbr.c @@ -414,18 +414,11 @@ void amd_pmu_lbr_enable_all(void) void amd_pmu_lbr_disable_all(void) { struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events); - u64 dbg_ctl, dbg_extn_cfg;
if (!cpuc->lbr_users || !x86_pmu.lbr_nr) return;
- rdmsrl(MSR_AMD_DBG_EXTN_CFG, dbg_extn_cfg); - wrmsrl(MSR_AMD_DBG_EXTN_CFG, dbg_extn_cfg & ~DBG_EXTN_CFG_LBRV2EN); - - if (cpu_feature_enabled(X86_FEATURE_AMD_LBR_PMC_FREEZE)) { - rdmsrl(MSR_IA32_DEBUGCTLMSR, dbg_ctl); - wrmsrl(MSR_IA32_DEBUGCTLMSR, dbg_ctl & ~DEBUGCTLMSR_FREEZE_LBRS_ON_PMI); - } + __amd_pmu_lbr_disable(); }
__init int amd_pmu_lbr_init(void) diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h index 4564521296ac..f1dda8ee0e29 100644 --- a/arch/x86/events/perf_event.h +++ b/arch/x86/events/perf_event.h @@ -1315,6 +1315,19 @@ void amd_pmu_lbr_enable_all(void); void amd_pmu_lbr_disable_all(void); int amd_pmu_lbr_hw_config(struct perf_event *event);
+static __always_inline void __amd_pmu_lbr_disable(void) +{ + u64 dbg_ctl, dbg_extn_cfg; + + rdmsrl(MSR_AMD_DBG_EXTN_CFG, dbg_extn_cfg); + wrmsrl(MSR_AMD_DBG_EXTN_CFG, dbg_extn_cfg & ~DBG_EXTN_CFG_LBRV2EN); + + if (cpu_feature_enabled(X86_FEATURE_AMD_LBR_PMC_FREEZE)) { + rdmsrl(MSR_IA32_DEBUGCTLMSR, dbg_ctl); + wrmsrl(MSR_IA32_DEBUGCTLMSR, dbg_ctl & ~DEBUGCTLMSR_FREEZE_LBRS_ON_PMI); + } +} + #ifdef CONFIG_PERF_EVENTS_AMD_BRS
#define AMD_FAM19H_BRS_EVENT 0xc4 /* RETIRED_TAKEN_BRANCH_INSTRUCTIONS */
[ Upstream commit a4d18112e5317c120bcadeb486fbe950f749bb5e ]
Upstream commit c22ac2a3d4bd ("perf: Enable branch record for software events") added ability to capture LBR (Last Branch Records) on Intel CPUs from inside BPF program at pretty much any arbitrary point. This is extremely useful capability that allows to figure out otherwise hard to debug problems, because LBR is now available based on some application-defined conditions, not just hardware-supported events.
'retsnoop' is one such tool that takes a huge advantage of this functionality and has proved to be an extremely useful tool in practice:
https://github.com/anakryiko/retsnoop
Now, AMD Zen4 CPUs got support for similar LBR functionality, but necessary wiring inside the kernel is not yet setup. This patch seeks to rectify this and follows a similar approach to the original patch for Intel CPUs. We implement an AMD-specific callback set to be called through perf_snapshot_branch_stack static call.
Previous preparatory patches ensured that amd_pmu_core_disable_all() and __amd_pmu_lbr_disable() will be completely inlined and will have no branches, so LBR snapshot contamination will be minimized.
This was tested on AMD Bergamo CPU and worked well when utilized from the aforementioned retsnoop tool.
Signed-off-by: Andrii Nakryiko andrii@kernel.org Signed-off-by: Ingo Molnar mingo@kernel.org Reviewed-by: Sandipan Das sandipan.das@amd.com Link: https://lore.kernel.org/r/20240402022118.1046049-4-andrii@kernel.org Signed-off-by: Leon Hwang leon.hwang@linux.dev --- arch/x86/events/amd/core.c | 35 +++++++++++++++++++++++++++++++++++ 1 file changed, 35 insertions(+)
diff --git a/arch/x86/events/amd/core.c b/arch/x86/events/amd/core.c index 3b09ec4f2b0d..002661418830 100644 --- a/arch/x86/events/amd/core.c +++ b/arch/x86/events/amd/core.c @@ -892,6 +892,37 @@ static int amd_pmu_handle_irq(struct pt_regs *regs) return amd_pmu_adjust_nmi_window(handled); }
+/* + * AMD-specific callback invoked through perf_snapshot_branch_stack static + * call, defined in include/linux/perf_event.h. See its definition for API + * details. It's up to caller to provide enough space in *entries* to fit all + * LBR records, otherwise returned result will be truncated to *cnt* entries. + */ +static int amd_pmu_v2_snapshot_branch_stack(struct perf_branch_entry *entries, unsigned int cnt) +{ + struct cpu_hw_events *cpuc; + unsigned long flags; + + /* + * The sequence of steps to freeze LBR should be completely inlined + * and contain no branches to minimize contamination of LBR snapshot + */ + local_irq_save(flags); + amd_pmu_core_disable_all(); + __amd_pmu_lbr_disable(); + + cpuc = this_cpu_ptr(&cpu_hw_events); + + amd_pmu_lbr_read(); + cnt = min(cnt, x86_pmu.lbr_nr); + memcpy(entries, cpuc->lbr_entries, sizeof(struct perf_branch_entry) * cnt); + + amd_pmu_v2_enable_all(0); + local_irq_restore(flags); + + return cnt; +} + static int amd_pmu_v2_handle_irq(struct pt_regs *regs) { struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events); @@ -1434,6 +1465,10 @@ static int __init amd_core_pmu_init(void) static_call_update(amd_pmu_branch_reset, amd_pmu_lbr_reset); static_call_update(amd_pmu_branch_add, amd_pmu_lbr_add); static_call_update(amd_pmu_branch_del, amd_pmu_lbr_del); + + /* Only support branch_stack snapshot on perfmon v2 */ + if (x86_pmu.handle_irq == amd_pmu_v2_handle_irq) + static_call_update(perf_snapshot_branch_stack, amd_pmu_v2_snapshot_branch_stack); } else if (!amd_brs_init()) { /* * BRS requires special event constraints and flushing on ctxsw.
[ Upstream commit 9794563d4d053b1b46a0cc91901f0a11d8678c19 ]
Now that it's possible to capture LBR on AMD CPU from BPF at arbitrary point, there is no reason to artificially limit this feature to just sampling events. So corresponding check is removed. AFAIU, there is no correctness implications of doing this (and it was possible to bypass this check by just setting perf_event's sample_period to 1 anyways, so it doesn't guard all that much).
Signed-off-by: Andrii Nakryiko andrii@kernel.org Signed-off-by: Ingo Molnar mingo@kernel.org Reviewed-by: Sandipan Das sandipan.das@amd.com Link: https://lore.kernel.org/r/20240402022118.1046049-5-andrii@kernel.org Signed-off-by: Leon Hwang leon.hwang@linux.dev --- arch/x86/events/amd/lbr.c | 4 ---- 1 file changed, 4 deletions(-)
diff --git a/arch/x86/events/amd/lbr.c b/arch/x86/events/amd/lbr.c index 33d0a45c0cd3..19c7b76e21bc 100644 --- a/arch/x86/events/amd/lbr.c +++ b/arch/x86/events/amd/lbr.c @@ -310,10 +310,6 @@ int amd_pmu_lbr_hw_config(struct perf_event *event) { int ret = 0;
- /* LBR is not recommended in counting mode */ - if (!is_sampling_event(event)) - return -EINVAL; - ret = amd_pmu_lbr_setup_filter(event); if (!ret) event->attach_state |= PERF_ATTACH_SCHED_CB;
linux-stable-mirror@lists.linaro.org