On Mon, Oct 28, 2019 at 01:02:19PM -0700, Bjorn Andersson wrote:
On Mon 28 Oct 12:11 PDT 2019, Mark Brown wrote:
On Mon, Oct 28, 2019 at 11:40:19AM -0700, Bjorn Andersson wrote:
On Mon 28 Oct 10:48 PDT 2019, Mark Brown wrote:
On Mon, Oct 28, 2019 at 08:03:08AM -0700, kernelci.org bot wrote:
Today's -next (anf Friday's) fails to boot on db820c:
defconfig: gcc-8: apq8096-db820c: 1 failed lab
It looks like it deadlocks somewhere, the last things in the log are a failure to start ufshcd-qcom and then an RCU stall some time later:
db820c has been failing intermittently for a while now, it seems that booting with kpti enabled causes something to go wrong. There are nothing strange in the kernel logs and ftrace seems to indicate that all the CPUs are idling nicely.
Oh dear. Adding Catalin and Will. Is it definitely KPTI that's triggering stuff? It did turn up some bugs on other systems, though it's a bit strange it's only manifesting in KernelCI...
I did a test recently where I booted my db820c 100 times with kpti=yes and 100 times with kpti=no on the kernel command line, and the result was 90% failure to reach console vs 0%. Going back and looking at the logs for the 10% indicated that the boot CPU was fine, but I had stalls reported on other CPUs.
In an effort to rule out driver bugs I reduced the DT to CPUs, the core clocks, gic, timers and serial driver, and I still saw the problem.
I have not looked at this with jtag and hence do not know what secure world is doing.
Hmm. Is this a recent thing? Neither kpti nor the snapdragon 820 are particular new. Might be worth checking that CONFIG_QCOM_FALKOR_ERRATUM_1003 is enabled and getting patched in at runtime -- we had hardware issues during kpti development with this CPU.
Will