On 01/07/2025 3:53 pm, Leo Yan wrote:
From: Yabin Cui yabinc@google.com
Similar to ETE, TRBE may lose its context when a CPU enters low power state. To make things worse, if ETE is restored without TRBE being restored, an enabled source device with no enabled sink devices can cause CPU hang on some devices (e.g., Pixel 9).
The save and restore flows are described in the section K5.5 "Context switching" of Arm ARM (ARM DDI 0487 L.a). This commit adds save and restore callbacks with following the software usages defined in the architecture manual.
Signed-off-by: Yabin Cui yabinc@google.com Co-developed-by: Leo Yan leo.yan@arm.com Signed-off-by: Leo Yan leo.yan@arm.com
Hi Leo,
I tested this commit to try to avoid hitting any issues with the last 3 hotplug changes but ran into two issues. They seemed to be hit when running the CPU online/offline/enable_source stress test and then after that running the Perf "Check Arm CoreSight trace data recording and synthesized samples" test.
It hit when doing them in either order, but not when doing only one after a reboot.
First one is just when running one of the tests:
===================================================== WARNING: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected 6.16.0-rc3+ #475 Not tainted ----------------------------------------------------- perf-exec/709 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire: ffff000804002cd0 (&drvdata->spinlock){+.+.}-{2:2}, at: cti_enable+0x40/0x130 [coresight_cti]
and this task is already holding: ffff00080ab67e18 (&ctx->lock){....}-{2:2}, at: perf_event_exec+0xc4/0x6b8 which would create a new lock dependency: (&ctx->lock){....}-{2:2} -> (&drvdata->spinlock){+.+.}-{2:2}
but this new dependency connects a HARDIRQ-irq-safe lock: (&cpuctx_lock){-...}-{2:2}
... which became HARDIRQ-irq-safe at: lock_acquire+0x130/0x2c0 _raw_spin_lock+0x60/0xa8 __perf_install_in_context+0x5c/0x2f0 remote_function+0x58/0x78 __flush_smp_call_function_queue+0x1d8/0x9c0 generic_smp_call_function_single_interrupt+0x20/0x38 ipi_handler+0x118/0x338 handle_percpu_devid_irq+0xb0/0x180 generic_handle_domain_irq+0x4c/0x78 gic_handle_irq+0x68/0xf0 call_on_irq_stack+0x24/0x30 do_interrupt_handler+0x88/0xd0 el1_interrupt+0x34/0x68 el1h_64_irq_handler+0x18/0x28 el1h_64_irq+0x6c/0x70 arch_local_irq_enable+0x8/0x10 cpuidle_enter+0x44/0x68 do_idle+0x1b0/0x2b8 cpu_startup_entry+0x40/0x50 rest_init+0x1c4/0x1d0 start_kernel+0x394/0x458 __primary_switched+0x88/0x98
to a HARDIRQ-irq-unsafe lock: (&drvdata->spinlock){+.+.}-{2:2}
... which became HARDIRQ-irq-unsafe at: ... lock_acquire+0x130/0x2c0 _raw_spin_lock+0x60/0xa8 cti_disable+0x38/0xe8 [coresight_cti] coresight_disable_source+0x88/0xa8 [coresight] coresight_disable_sysfs+0xd0/0x1f0 [coresight] enable_source_store+0x78/0xb0 [coresight] dev_attr_store+0x24/0x40 sysfs_kf_write+0xa8/0xd0 kernfs_fop_write_iter+0x114/0x1c0 vfs_write+0x2d8/0x310 ksys_write+0x80/0xf8 __arm64_sys_write+0x28/0x40 invoke_syscall+0x4c/0x110 el0_svc_common+0xb8/0xf0 do_el0_svc+0x28/0x40 el0_svc+0x4c/0xe8 el0t_64_sync_handler+0x84/0x108 el0t_64_sync+0x198/0x1a0
other info that might help us debug this:
Chain exists of: &cpuctx_lock --> &ctx->lock --> &drvdata->spinlock
Possible interrupt unsafe locking scenario:
CPU0 CPU1 ---- ---- lock(&drvdata->spinlock); local_irq_disable(); lock(&cpuctx_lock); lock(&ctx->lock); <Interrupt> lock(&cpuctx_lock);
*** DEADLOCK ***
4 locks held by perf-exec/709: #0: ffff0008066b66f8 (&sig->cred_guard_mutex){+.+.}-{4:4}, at: bprm_execve+0x54/0x690 #1: ffff0008066b67a0 (&sig->exec_update_lock){++++}-{4:4}, at: exec_mmap+0x48/0x2b0 #2: ffff000976a467f0 (&cpuctx_lock){-...}-{2:2}, at: perf_event_exec+0xb4/0x6b8 #3: ffff00080ab67e18 (&ctx->lock){....}-{2:2}, at: perf_event_exec+0xc4/0x6b8
the dependencies between HARDIRQ-irq-safe lock and the holding lock: -> (&cpuctx_lock){-...}-{2:2} { IN-HARDIRQ-W at: lock_acquire+0x130/0x2c0 _raw_spin_lock+0x60/0xa8 __perf_install_in_context+0x5c/0x2f0 remote_function+0x58/0x78 __flush_smp_call_function_queue+0x1d8/0x9c0 generic_smp_call_function_single_interrupt+0x20/0x38 ipi_handler+0x118/0x338 handle_percpu_devid_irq+0xb0/0x180 generic_handle_domain_irq+0x4c/0x78 gic_handle_irq+0x68/0xf0 call_on_irq_stack+0x24/0x30 do_interrupt_handler+0x88/0xd0 el1_interrupt+0x34/0x68 el1h_64_irq_handler+0x18/0x28 el1h_64_irq+0x6c/0x70 arch_local_irq_enable+0x8/0x10 cpuidle_enter+0x44/0x68 do_idle+0x1b0/0x2b8 cpu_startup_entry+0x40/0x50 rest_init+0x1c4/0x1d0 start_kernel+0x394/0x458 __primary_switched+0x88/0x98 INITIAL USE at: lock_acquire+0x130/0x2c0 _raw_spin_lock+0x60/0xa8 __perf_event_exit_context+0x3c/0xb0 generic_exec_single+0xb0/0x3a8 smp_call_function_single+0x180/0xa98 perf_event_exit_cpu+0x344/0x3d8 cpuhp_invoke_callback+0x120/0x2a0 cpuhp_thread_fun+0x170/0x1d8 smpboot_thread_fn+0x1c0/0x328 kthread+0x148/0x250 ret_from_fork+0x10/0x20 } ... key at: [<ffff800082bbe238>] cpuctx_lock+0x0/0x10 -> (&ctx->lock){....}-{2:2} { INITIAL USE at: lock_acquire+0x130/0x2c0 _raw_spin_lock_irq+0x70/0xb8 find_get_pmu_context+0x88/0x238 __arm64_sys_perf_event_open+0x794/0x1150 invoke_syscall+0x4c/0x110 el0_svc_common+0xb8/0xf0 do_el0_svc+0x28/0x40 el0_svc+0x4c/0xe8 el0t_64_sync_handler+0x84/0x108 el0t_64_sync+0x198/0x1a0 } ... key at: [<ffff800082bbe1d0>] __perf_event_init_context.__key+0x0/0x10 ... acquired at: _raw_spin_lock+0x60/0xa8 __perf_install_in_context+0x6c/0x2f0 remote_function+0x58/0x78 generic_exec_single+0xb0/0x3a8 smp_call_function_single+0x180/0xa98 perf_install_in_context+0x1a0/0x290 __arm64_sys_perf_event_open+0x103c/0x1150 invoke_syscall+0x4c/0x110 el0_svc_common+0xb8/0xf0 do_el0_svc+0x28/0x40 el0_svc+0x4c/0xe8 el0t_64_sync_handler+0x84/0x108 el0t_64_sync+0x198/0x1a0
the dependencies between the lock to be acquired and HARDIRQ-irq-unsafe lock: -> (&drvdata->spinlock){+.+.}-{2:2} { HARDIRQ-ON-W at: lock_acquire+0x130/0x2c0 _raw_spin_lock+0x60/0xa8 cti_disable+0x38/0xe8 [coresight_cti] coresight_disable_source+0x88/0xa8 [coresight] coresight_disable_sysfs+0xd0/0x1f0 [coresight] enable_source_store+0x78/0xb0 [coresight] dev_attr_store+0x24/0x40 sysfs_kf_write+0xa8/0xd0 kernfs_fop_write_iter+0x114/0x1c0 vfs_write+0x2d8/0x310 ksys_write+0x80/0xf8 __arm64_sys_write+0x28/0x40 invoke_syscall+0x4c/0x110 el0_svc_common+0xb8/0xf0 do_el0_svc+0x28/0x40 el0_svc+0x4c/0xe8 el0t_64_sync_handler+0x84/0x108 el0t_64_sync+0x198/0x1a0 SOFTIRQ-ON-W at: lock_acquire+0x130/0x2c0 _raw_spin_lock+0x60/0xa8 cti_disable+0x38/0xe8 [coresight_cti] coresight_disable_source+0x88/0xa8 [coresight] coresight_disable_sysfs+0xd0/0x1f0 [coresight] enable_source_store+0x78/0xb0 [coresight] dev_attr_store+0x24/0x40 sysfs_kf_write+0xa8/0xd0 kernfs_fop_write_iter+0x114/0x1c0 vfs_write+0x2d8/0x310 ksys_write+0x80/0xf8 __arm64_sys_write+0x28/0x40 invoke_syscall+0x4c/0x110 el0_svc_common+0xb8/0xf0 do_el0_svc+0x28/0x40 el0_svc+0x4c/0xe8 el0t_64_sync_handler+0x84/0x108 el0t_64_sync+0x198/0x1a0 INITIAL USE at: lock_acquire+0x130/0x2c0 _raw_spin_lock+0x60/0xa8 cti_cpu_pm_notify+0x54/0x160 [coresight_cti] notifier_call_chain+0xb8/0x1b8 raw_notifier_call_chain_robust+0x50/0xb0 cpu_pm_enter+0x50/0x90 psci_enter_idle_state+0x3c/0x80 cpuidle_enter_state+0x158/0x340 cpuidle_enter+0x44/0x68 do_idle+0x1b0/0x2b8 cpu_startup_entry+0x40/0x50 secondary_start_kernel+0x120/0x150 __secondary_switched+0xc0/0xc8 } ... key at: [<ffff80007b10d2a8>] cti_probe.__key+0x0/0xffffffffffffdd58 [coresight_cti] ... acquired at: _raw_spin_lock_irqsave+0x70/0xc0 cti_enable+0x40/0x130 [coresight_cti] _coresight_enable_path+0x134/0x3c0 [coresight] coresight_enable_path+0x28/0x88 [coresight] etm_event_start+0xe0/0x228 [coresight] etm_event_add+0x40/0x68 [coresight] event_sched_in+0x270/0x418 visit_groups_merge+0x428/0xcd0 __pmu_ctx_sched_in+0xa0/0xe0 ctx_sched_in+0x110/0x188 ctx_resched+0x1c0/0x2b8 perf_event_exec+0x29c/0x6b8 begin_new_exec+0x378/0x558 load_elf_binary+0x2b0/0xb00 bprm_execve+0x394/0x690 do_execveat_common+0x2a0/0x300 __arm64_sys_execve+0x50/0x70 invoke_syscall+0x4c/0x110 el0_svc_common+0xb8/0xf0 do_el0_svc+0x28/0x40 el0_svc+0x4c/0xe8 el0t_64_sync_handler+0x84/0x108 el0t_64_sync+0x198/0x1a0
===============================================
And the second one is when reloading the modules:
$ sudo rmmod coresight_stm coresight_funnel stm_core coresight_replicator coresight_tpiu coresight_etm4x coresight_tmc coresight_cti coresight_cpu_debug coresight_trbe coresight $ sudo modprobe coresight; sudo modprobe coresight_stm ; sudo modprobe coresight_funnel; sudo modprobe stm_core; sudo modprobe coresight_replicator; sudo modprobe coresight_cpu_debug; sudo modprobe coresight_tpiu; sudo modprobe coresight_etm4x; sudo modprobe coresight_tmc; sudo modprobe coresight_trbe ; sudo modprobe coresight_cti ;
Unable to handle kernel NULL pointer dereference at virtual address 00000000000004f0 pc : cti_cpu_pm_notify+0x74/0x160 [coresight_cti] lr : cti_cpu_pm_notify+0x54/0x160 [coresight_cti] Call trace: cti_cpu_pm_notify+0x74/0x160 [coresight_cti] (P) notifier_call_chain+0xb8/0x1b8 raw_notifier_call_chain_robust+0x50/0xb0 cpu_pm_enter+0x50/0x90 psci_enter_idle_state+0x3c/0x80 cpuidle_enter_state+0x158/0x340 cpuidle_enter+0x44/0x68 do_idle+0x1b0/0x2b8 cpu_startup_entry+0x40/0x50 secondary_start_kernel+0x120/0x150 __secondary_switched+0xc0/0xc8