On Wed, Oct 25, 2023 at 11:26 PM Geert Uytterhoeven geert@linux-m68k.org wrote:
On Wed, Oct 25, 2023 at 9:53 PM Geert Uytterhoeven geert@linux-m68k.org wrote:
On Wed, Oct 25, 2023 at 8:39 PM Guenter Roeck linux@roeck-us.net wrote:
On 10/25/23 10:05, Geert Uytterhoeven wrote:
On Wed, Oct 25, 2023 at 2:35 PM Geert Uytterhoeven geert@linux-m68k.org wrote:
On Wed, Oct 25, 2023 at 12:53 PM Geert Uytterhoeven geert@linux-m68k.org wrote:
On Wed, Oct 25, 2023 at 12:47 PM Geert Uytterhoeven geert@linux-m68k.org wrote: > On Tue, Oct 24, 2023 at 9:22 PM Pavel Machek pavel@denx.de wrote: >> But we still have failures on Renesas with 5.10.199-rc2: >> >> https://gitlab.com/cip-project/cip-testing/linux-stable-rc-ci/-/pipelines/10... >> >> And they still happed during MMC init: >> >> 2.638013] renesas_sdhi_internal_dmac ee100000.mmc: Got CD GPIO >> [ 2.638846] INFO: trying to register non-static key. >> [ 2.644192] ledtrig-cpu: registered to indicate activity on CPUs >> [ 2.649066] The code is fine but needs lockdep annotation, or maybe >> [ 2.649069] you didn't initialize this object before use? >> [ 2.649071] turning off the locking correctness validator. >> [ 2.649080] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.10.199-rc2-arm64-renesas-ge31b6513c43d #1 >> [ 2.649082] Hardware name: HopeRun HiHope RZ/G2M with sub board (DT) >> [ 2.649086] Call trace: >> [ 2.655106] SMCCC: SOC_ID: ARCH_SOC_ID not implemented, skipping .... >> [ 2.661354] dump_backtrace+0x0/0x194 >> [ 2.661361] show_stack+0x14/0x20 >> [ 2.667430] usbcore: registered new interface driver usbhid >> [ 2.672230] dump_stack+0xe8/0x130 >> [ 2.672238] register_lock_class+0x480/0x514 >> [ 2.672244] __lock_acquire+0x74/0x20ec >> [ 2.681113] usbhid: USB HID core driver >> [ 2.687450] lock_acquire+0x218/0x350 >> [ 2.687456] _raw_spin_lock+0x58/0x80 >> [ 2.687464] tmio_mmc_irq+0x410/0x9ac >> [ 2.688556] renesas_sdhi_internal_dmac ee160000.mmc: mmc0 base at 0x00000000ee160000, max clock rate 200 MHz >> [ 2.744936] __handle_irq_event_percpu+0xbc/0x340 >> [ 2.749635] handle_irq_event+0x60/0x100 >> [ 2.753553] handle_fasteoi_irq+0xa0/0x1ec >> [ 2.757644] __handle_domain_irq+0x7c/0xdc >> [ 2.761736] efi_header_end+0x4c/0xd0 >> [ 2.765393] el1_irq+0xcc/0x180 >> [ 2.768530] arch_cpu_idle+0x14/0x2c >> [ 2.772100] default_idle_call+0x58/0xe4 >> [ 2.776019] do_idle+0x244/0x2c0 >> [ 2.779242] cpu_startup_entry+0x20/0x6c >> [ 2.783160] rest_init+0x164/0x28c >> [ 2.786561] arch_call_rest_init+0xc/0x14 >> [ 2.790565] start_kernel+0x4c4/0x4f8 >> [ 2.794233] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000014 >> [ 2.803011] Mem abort info: >> >> from https://lava.ciplatform.org/scheduler/job/1025535 >> from >> https://gitlab.com/cip-project/cip-testing/linux-stable-rc-ci/-/jobs/5360973... . >> >> Is there something else missing?
It seems to be an intermittent issue. Investigating...
After spending too much time on bisecting, the bad guy turns out to be commit 6d3745bbc3341d3b ("mmc: renesas_sdhi: register irqs before registering controller") in v5.10.198.
Adding debug information shows the lock is mmc_host.lock.
It is definitely initialized:
renesas_sdhi_probe() { ... tmio_mmc_host_alloc() mmc_alloc_host spin_lock_init(&host->lock);
Initializing mmc_host.lock.
... devm_request_irq() -> tmio_mmc_irq tmio_mmc_cmd_irq() spin_lock(&host->lock);
Locking tmio_mmc_host.lock, but ...
... }
That leaves us with a missing lockdep annotation?
Is it possible that the lock initialization is overwritten ? I seem to recall a recent case where this happens.
Also, there is spin_lock_init(&_host->lock); in tmio_mmc_host_probe(), and tmio_mmc_host_probe() is called after devm_request_irq().
Unless I am missing something, that is initializing tmio_mmc_host.lock, which is a different lock than mmc_host.lock?
... tmio_mmc_host.lock is initialized only here.
Now the question remains why this is not triggered in mainline. More investigation to do tomorrow...
| --- a/drivers/mmc/host/renesas_sdhi_core.c | +++ b/drivers/mmc/host/renesas_sdhi_core.c | @@ -1011,6 +1011,8 @@ int renesas_sdhi_probe(struct platform_device *pdev, | renesas_sdhi_start_signal_voltage_switch; | host->sdcard_irq_setbit_mask = TMIO_STAT_ALWAYS_SET_27; | host->reset = renesas_sdhi_reset;
host->sdcard_irq_mask_all is not initialized in this branch
| + } else { | + host->sdcard_irq_mask_all = TMIO_MASK_ALL; | }
| /* Orginally registers were 16 bit apart, could be 32 or 64 nowadays */ | @@ -1098,9 +1100,7 @@ int renesas_sdhi_probe(struct platform_device *pdev, | host->ops.hs400_complete = renesas_sdhi_hs400_complete; | }
| - ret = tmio_mmc_host_probe(host); | - if (ret < 0) | - goto edisclk; | + sd_ctrl_write32_as_16_and_16(host, CTL_IRQ_MASK, host->sdcard_irq_mask_all);
Fails to disable interrupts for real as host->sdcard_irq_mask_all is still zero.
| num_irqs = platform_irq_count(pdev); | if (num_irqs < 0) { | @@ -1127,6 +1127,10 @@ int renesas_sdhi_probe(struct platform_device *pdev, | goto eirq; | }
| + ret = tmio_mmc_host_probe(host);
Initializes host->sdcard_irq_mask_all when needed and disables interrupts:
if (!_host->sdcard_irq_mask_all) _host->sdcard_irq_mask_all = TMIO_MASK_ALL; tmio_mmc_disable_mmc_irqs(_host, _host->sdcard_irq_mask_all);
If the interrupt came in before, we have an issue.
| + if (ret < 0) | + goto edisclk; | + | dev_info(&pdev->dev, "%s base at %pa, max clock rate %u MHz\n", | mmc_hostname(host->mmc), &res->start, host->mmc->f_max / 1000000);
The solution is to backport commit 9f12cac1bb88e329 ("mmc: renesas_sdhi: use custom mask for TMIO_MASK_ALL") in v5.13. As this doesn't backport cleanly, I'll submit a (tested) patch.
Gr{oetje,eeting}s,
Geert