next/master boot: 265 boots: 17 failed, 184 passed with 64 offline (next-20190730)
Full Boot Summary: https://kernelci.org/boot/all/job/next/branch/master/kernel/next-20190730/ Full Build Summary: https://kernelci.org/build/next/branch/master/kernel/next-20190730/
Tree: next Branch: master Git Describe: next-20190730 Git Commit: 70f4b4ac1655dffd33cb38f369b31bdecc588daa Git URL: git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git Tested: 84 unique boards, 28 SoC families, 22 builds out of 230
Boot Failures Detected:
arm64: defconfig: gcc-8: qcom-qdf2400: 1 failed lab
defconfig+CONFIG_CPU_BIG_ENDIAN=y: gcc-8: meson-gxm-khadas-vim2: 1 failed lab
defconfig+CONFIG_RANDOMIZE_BASE=y: gcc-8: meson-gxm-khadas-vim2: 1 failed lab qcom-qdf2400: 1 failed lab
arm: oxnas_v6_defconfig: gcc-8: ox820-cloudengines-pogoplug-series-3: 1 failed lab
qcom_defconfig: gcc-8: qcom-apq8064-cm-qs600: 1 failed lab
multi_v7_defconfig+CONFIG_CPU_BIG_ENDIAN=y: gcc-8: armada-xp-openblocks-ax3-4: 1 failed lab
Offline Platforms:
riscv:
defconfig: gcc-8 sifive_fu540: 1 offline lab
arm64:
defconfig: gcc-8 meson-axg-s400: 1 offline lab meson-g12a-u200: 1 offline lab meson-g12a-x96-max: 1 offline lab meson-gxbb-odroidc2: 1 offline lab meson-gxl-s905d-p230: 1 offline lab meson-gxl-s905x-libretech-cc: 1 offline lab meson-gxl-s905x-nexbox-a95x: 1 offline lab meson-gxl-s905x-p212: 1 offline lab meson-gxm-nexbox-a1: 1 offline lab rk3399-firefly: 1 offline lab sun50i-a64-pine64-plus: 1 offline lab
defconfig+CONFIG_CPU_BIG_ENDIAN=y: gcc-8 meson-axg-s400: 1 offline lab meson-g12a-u200: 1 offline lab meson-g12a-x96-max: 1 offline lab meson-gxbb-odroidc2: 1 offline lab meson-gxl-s905d-p230: 1 offline lab meson-gxl-s905x-libretech-cc: 1 offline lab meson-gxl-s905x-nexbox-a95x: 1 offline lab meson-gxl-s905x-p212: 1 offline lab
defconfig+CONFIG_RANDOMIZE_BASE=y: gcc-8 meson-axg-s400: 1 offline lab meson-g12a-u200: 1 offline lab meson-g12a-x96-max: 1 offline lab meson-gxbb-odroidc2: 1 offline lab meson-gxl-s905d-p230: 1 offline lab meson-gxl-s905x-libretech-cc: 1 offline lab meson-gxl-s905x-nexbox-a95x: 1 offline lab meson-gxl-s905x-p212: 1 offline lab meson-gxm-nexbox-a1: 1 offline lab rk3399-firefly: 1 offline lab sun50i-a64-pine64-plus: 1 offline lab
mips:
pistachio_defconfig: gcc-8 pistachio_marduk: 1 offline lab
arm:
exynos_defconfig: gcc-8 exynos5250-arndale: 1 offline lab exynos5420-arndale-octa: 1 offline lab exynos5800-peach-pi: 1 offline lab
multi_v7_defconfig: gcc-8 bcm72521-bcm97252sffe: 1 offline lab bcm7445-bcm97445c: 1 offline lab exynos5250-arndale: 1 offline lab exynos5420-arndale-octa: 1 offline lab exynos5800-peach-pi: 1 offline lab imx6dl-wandboard_dual: 1 offline lab imx6dl-wandboard_solo: 1 offline lab imx6q-wandboard: 1 offline lab imx7s-warp: 1 offline lab meson8b-odroidc1: 1 offline lab omap3-beagle: 1 offline lab omap4-panda: 1 offline lab qcom-apq8064-ifc6410: 1 offline lab stih410-b2120: 1 offline lab sun4i-a10-cubieboard: 1 offline lab sun7i-a20-bananapi: 1 offline lab vf610-colibri-eval-v3: 1 offline lab
omap2plus_defconfig: gcc-8 omap3-beagle: 1 offline lab omap4-panda: 1 offline lab
qcom_defconfig: gcc-8 qcom-apq8064-ifc6410: 1 offline lab
davinci_all_defconfig: gcc-8 da850-evm: 1 offline lab dm365evm,legacy: 1 offline lab
imx_v6_v7_defconfig: gcc-8 imx6dl-wandboard_dual: 1 offline lab imx6dl-wandboard_solo: 1 offline lab imx6q-wandboard: 1 offline lab imx7s-warp: 1 offline lab vf610-colibri-eval-v3: 1 offline lab
sunxi_defconfig: gcc-8 sun4i-a10-cubieboard: 1 offline lab sun7i-a20-bananapi: 1 offline lab
--- For more info write to info@kernelci.org
On Tue, Jul 30, 2019 at 05:17:56AM -0700, kernelci.org bot wrote:
Boot Failures Detected:
arm: oxnas_v6_defconfig: gcc-8: ox820-cloudengines-pogoplug-series-3: 1 failed lab
For some time now -next and mainline have been failing to boot on Pogoplug 3 with the oxnas_v6_defconfig, the kernel seems to start fine but fails to parse the ramdisk it's passed:
08:50:02.086589 <6>[ 7.719854] IP-Config: Complete: 08:50:02.087213 <6>[ 7.723330] device=eth0, hwaddr=0a:a2:89:27:10:1b, ipaddr=10.201.4.144, mask=255.255.0.0, gw=10.201.0.1 08:50:02.087413 <6>[ 7.733409] host=10.201.4.144, domain=, nis-domain=(none) 08:50:02.088056 <6>[ 7.739499] bootserver=10.201.1.1, rootserver=10.201.1.1, rootpath= 08:50:02.088248 <6>[ 7.739504] nameserver0=10.201.1.1 08:50:02.129966 <5>[ 7.752025] RAMDISK: Couldn't find valid RAM disk image starting at 0. 08:50:02.130381 <4>[ 7.759616] List of all partitions: 08:50:02.131333 <4>[ 7.763363] 0100 65536 ram0
Possibly an issue with the ramdisk getting overwritten or something?
Full details for today's -next can be seen here:
Hi Mark,
On 30/07/2019 14:34, Mark Brown wrote:
On Tue, Jul 30, 2019 at 05:17:56AM -0700, kernelci.org bot wrote:
Boot Failures Detected:
arm: oxnas_v6_defconfig: gcc-8: ox820-cloudengines-pogoplug-series-3: 1 failed lab
For some time now -next and mainline have been failing to boot on Pogoplug 3 with the oxnas_v6_defconfig, the kernel seems to start fine but fails to parse the ramdisk it's passed:
08:50:02.086589 <6>[ 7.719854] IP-Config: Complete: 08:50:02.087213 <6>[ 7.723330] device=eth0, hwaddr=0a:a2:89:27:10:1b, ipaddr=10.201.4.144, mask=255.255.0.0, gw=10.201.0.1 08:50:02.087413 <6>[ 7.733409] host=10.201.4.144, domain=, nis-domain=(none) 08:50:02.088056 <6>[ 7.739499] bootserver=10.201.1.1, rootserver=10.201.1.1, rootpath= 08:50:02.088248 <6>[ 7.739504] nameserver0=10.201.1.1 08:50:02.129966 <5>[ 7.752025] RAMDISK: Couldn't find valid RAM disk image starting at 0. 08:50:02.130381 <4>[ 7.759616] List of all partitions: 08:50:02.131333 <4>[ 7.763363] 0100 65536 ram0
Possibly an issue with the ramdisk getting overwritten or something?
Thanks for reporting, it's my suspicion since my multiple bisect runs all point to this merge commit : a318423b61e8 Merge tag 'upstream-5.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs
This merge doesn't introduce notable changes for the oxnas_v6_defconfig, but disabling UBI entirely makes it work again.
Continuing my investigations...
Neil
Full details for today's -next can be seen here:
On Tue, Jul 30, 2019 at 05:17:56AM -0700, kernelci.org bot wrote:
The previously reported issues with booting -next on meson-gxm-khadas-vim2 are still present today, though seemingly only manifesting with CONFIG_RANDOMIZE_BASE and not defconfig (there are failures with big endian too but they don't look device specific):
arm64:
defconfig+CONFIG_RANDOMIZE_BASE=y: gcc-8: meson-gxm-khadas-vim2: 1 failed lab
It looks like it gets to userspace and then hangs (end of the log below). More details at:
https://kernelci.org/boot/id/5d40069859b5148b3931b2bf/
The last message in the log indicates it was initializing the Panfrost driver:
08:53:47.332143 <6>[ 15.172833] panfrost d00c0000.gpu: clock rate = 666666666 08:55:40.299880 ShellCommand command timed out.: Sending # in case of corruption. Connection timeout 00:04:14, retry in 00:02:07
On Tue, Jul 30, 2019 at 05:17:56AM -0700, kernelci.org bot wrote:
next/master boot: 265 boots: 17 failed, 184 passed with 64 offline (next-20190730) Full Boot Summary: https://kernelci.org/boot/all/job/next/branch/master/kernel/next-20190730/ Full Build Summary: https://kernelci.org/build/next/branch/master/kernel/next-20190730/
For a while now all arm64 big endian clang built kernels have been failing, the kernel mounts the root filesystem but is unable to execute init due to an inability to understand the executable format:
08:55:25.999629 <6>[ 226.077194] Run /init as init process 08:55:31.066490 <4>[ 226.086518] request_module: kmod_concurrent_max (0) close to 0 (max_modprobes: 50), for module binfmt-464c, throttling... 08:55:31.085167 <4>[ 231.135458] request_module: modprobe binfmt-464c cannot be processed, kmod busy with 50 threads for more than 5 seconds now 08:55:35.745340 ShellCommand command timed out.: Sending # in case of corruption. Connection timeout 00:01:54, retry in 00:00:11 08:55:35.846536 # 08:55:35.849523 # 08:55:36.185339 <4>[ 231.154208] request_module: kmod_concurrent_max (0) close to 0 (max_modprobes: 50), for module binfmt-464c, throttling... 08:55:36.208673 <4>[ 236.255449] request_module: modprobe binfmt-464c cannot be processed, kmod busy with 50 threads for more than 5 seconds now 08:55:36.209013 <3>[ 236.269366] Failed to execute /init (error -8) 08:55:36.210161 <6>[ 236.285459] Run /sbin/init as init process 08:55:41.306737 <4>[ 236.294490] request_module: kmod_concurrent_max (0) close to 0 (max_modprobes: 50), for module binfmt-464c, throttling... 08:55:41.331547 <4>[ 241.375455] request_module: modprobe binfmt-464c cannot be processed, kmod busy with 50 threads for more than 5 seconds now 08:55:41.331837 <3>[ 241.389316] Starting init: /sbin/init exists but couldn't execute it (error -8)
(binfmt-464c is binfmt-misc, the fallback for unknown executable formats). The same kernel version built with GCC boots fine.
You can see a bunch of reports here (all the big endian failures):
https://kernelci.org/boot/all/job/next/branch/master/kernel/next-20190730/
It's possible that there's some infrastructure error that's causing the wrong ramdisk to be sent to the boards only for clang but I'd be a bit surprised.
On Tue, Jul 30, 2019 at 05:17:56AM -0700, kernelci.org bot wrote:
Today's -next fails to boot on QDF2400 systems:
arm64: defconfig: gcc-8: qcom-qdf2400: 1 failed lab
defconfig+CONFIG_RANDOMIZE_BASE=y: gcc-8: qcom-qdf2400: 1 failed lab
It crashes trying to load the pinctrl driver, looking at the diff I suspect due to 0ce242ad2ec17dd (pinctrl: qcom: Pass irqchip when adding gpiochip) but haven't bisected or anything.
More info including full logs at:
https://kernelci.org/boot/id/5d40064459b5148b2631b2a6/ https://kernelci.org/boot/id/5d40073a59b5148bc631b28d/ https://kernelci.org/boot/id/5d40054959b5148a5e31b29b/ https://kernelci.org/boot/id/5d40082a59b5148c2931b28d/
(both configs, built with GCC and clang), interesting bit of one of the backtraces is:
08:56:35.942217 [ 4.264434] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000 08:56:35.942520 [ 4.272262] Mem abort info: 08:56:35.942781 [ 4.275032] ESR = 0x96000044 08:56:35.943081 [ 4.278074] Exception class = DABT (current EL), IL = 32 bits 08:56:35.943330 [ 4.283976] SET = 0, FnV = 0 08:56:35.943569 [ 4.287011] EA = 0, S1PTW = 0 08:56:35.943809 [ 4.290139] Data abort info: 08:56:35.982983 [ 4.293001] ISV = 0, ISS = 0x00000044 08:56:35.983330 [ 4.296823] CM = 0, WnR = 1 08:56:35.983608 [ 4.299772] [0000000000000000] user address but active_mm is swapper 08:56:35.985342 [ 4.306113] Internal error: Oops: 96000044 [#1] PREEMPT SMP 08:56:35.985640 [ 4.311664] Modules linked in: 08:56:35.985901 [ 4.314704] CPU: 3 PID: 1 Comm: swapper/0 Not tainted 5.3.0-rc2-next-20190730 #1 08:56:35.986154 [ 4.322081] Hardware name: WIWYNN QDF2400 Reference Evaluation Platform CV90-LA115-P900/QDF2400 Customer Reference Board, BIOS 0ACJA551 06/12/2018 08:56:36.026209 [ 4.335189] pstate: 80400005 (Nzcv daif +PAN -UAO) 08:56:36.026587 [ 4.339966] pc : __memset+0x80/0x188 08:56:36.026867 [ 4.343524] lr : msm_gpio_init_valid_mask+0x134/0x1a4 08:56:36.028625 [ 4.348556] sp : ffff00001015b480 08:56:36.028923 [ 4.351855] x29: ffff00001015b4c0 x28: ffffbc73ef19a768 08:56:36.029191 [ 4.357150] x27: ffffbc73ef19a400 x26: 000000000000016a 08:56:36.029441 [ 4.362445] x25: 0000000000000096 x24: 0000000000000020 08:56:36.029688 [ 4.367740] x23: 0000000000000020 x22: ffffbc73ef7ba000 08:56:36.029927 [ 4.373035] x21: 0000000000000000 x20: ffffbc73ef7cb880 08:56:36.069212 [ 4.378330] x19: ffffbc73ef7cb890 x18: 0000000000000002 08:56:36.069560 [ 4.383626] x17: 0000000000000000 x16: 0000000000002680 08:56:36.071378 [ 4.388921] x15: ffff524de59c5827 x14: 0000000000000000 08:56:36.071694 [ 4.394216] x13: 0000000000000000 x12: 0000000000000083 08:56:36.071963 [ 4.399511] x11: 0000000000000018 x10: 0000000000000021 08:56:36.072213 [ 4.404806] x9 : 0000000000000020 x8 : 0000000000000000 08:56:36.072462 [ 4.410101] x7 : 0000000000000000 x6 : 0000000000804661 08:56:36.072701 [ 4.415396] x5 : 6146000000000000 x4 : 0000000000000000 08:56:36.112335 [ 4.420692] x3 : 0000000000000010 x2 : 0000000000000018 08:56:36.112682 [ 4.425986] x1 : 0000000000000000 x0 : 0000000000000000 08:56:36.112962 [ 4.431282] Call trace: 08:56:36.114798 [ 4.433713] __memset+0x80/0x188 08:56:36.115105 [ 4.436927] gpiochip_add_data_with_key+0x624/0xa58 08:56:36.115373 [ 4.441786] msm_pinctrl_probe+0x34c/0x458 08:56:36.115623 [ 4.445866] qdf2xxx_pinctrl_probe+0x290/0x334 08:56:36.115870 [ 4.450296] platform_drv_probe+0x8c/0xb4 08:56:36.116109 [ 4.454286] really_probe+0x214/0x490 08:56:36.116344 [ 4.457932] driver_probe_device+0x60/0xf8 08:56:36.116581 [ 4.462012] __device_attach_driver+0xfc/0x114 08:56:36.155329 [ 4.466439] bus_for_each_drv+0x7c/0xc4 08:56:36.155676 [ 4.470258] __device_attach+0xbc/0x15c
On Tue, Jul 30, 2019 at 3:41 PM Mark Brown broonie@kernel.org wrote:
Today's -next fails to boot on QDF2400 systems:
Is this a devicetree or ACPI system? Which devicetree in that case? If it is ACPI I assume one had to look into DSDTs?
But I looked into this!
08:56:36.026587 [ 4.339966] pc : __memset+0x80/0x188 08:56:36.026867 [ 4.343524] lr : msm_gpio_init_valid_mask+0x134/0x1a4
Aha. I think this only worked out of chance before.
What the Qualcomm driver does is exploit that .init_valid_mask() gets called even if .need_valid_mask in gpio_chip is not set to true, and this is why the bug appears in msm_gpio_init_valid_mask(), I'm pretty much sure it is the bitmap_zero(chip->valid_mask, max_gpios); call towards the end of the function that gets turned into: 08:56:36.114798 [ 4.433713] __memset+0x80/0x188
And this causes the crash.
This is then because chip->valid_mask has not been allocated, and that is because .need_valid_mask is not set. This is set like so:
static bool msm_gpio_needs_valid_mask(struct msm_pinctrl *pctrl) { if (pctrl->soc->reserved_gpios) return true;
return device_property_read_u16_array(pctrl->dev, "gpios", NULL, 0) > 0; }
First comes from the driver, second is for ACPI I think. It looks like a bit dangerous way to do it for ACPI, what if an OF pin controller has some "gpios" property? Oh well.
Apparently this only happens on ACPI systems because I tested it myself on a DT system.
Another cause may be this from the call site inside gpiolib:
static int gpiochip_alloc_valid_mask(struct gpio_chip *gc) { if (IS_ENABLED(CONFIG_OF_GPIO)) gc->need_valid_mask = of_gpio_need_valid_mask(gc); if (!gc->need_valid_mask) return 0;
gc->valid_mask = gpiochip_allocate_mask(gc); if (!gc->valid_mask) return -ENOMEM; return 0; }
So if OF and ACPI is activated at the same time (can that happen?) then the OF code will bail out I suppose and this happens when the ACPI side of things try to use the mask it didn't allocate. The ACPI codepath in msm_gpio_init_valid_mask() seems to try to set the mask even if there is zero GPIOs as well... dereferencing the NULL pointer in chip->valid_mask.
Could it be that this is a ACPI system with zero protected GPIOs? It doesn't seem like the code will survive that. It seems written under the assumption that
It's a bit of a mess.
Can some qcom ACPI people take linux-next for a ride and tell me what I should do?
Lee: do you know if linux-next boots fine on your ACPI machine?
Timur/Stephen: any ideas?
If noone has a better idea I guess I just give up and unconditionally set .needs_valid_mask to true because we honestly don't know when it's needed or not.
Yours, Linus Walleij
On Wed, Jul 31, 2019 at 10:48:38AM +0200, Linus Walleij wrote:
On Tue, Jul 30, 2019 at 3:41 PM Mark Brown broonie@kernel.org wrote:
Today's -next fails to boot on QDF2400 systems:
Is this a devicetree or ACPI system? Which devicetree in that case? If it is ACPI I assume one had to look into DSDTs?
You can see from the linked logs it's an ACPI system, the ACPI code announces it during boot.
Aha. I think this only worked out of chance before.
What the Qualcomm driver does is exploit that .init_valid_mask() gets called even if .need_valid_mask in gpio_chip is not set to true, and this is why the bug appears in msm_gpio_init_valid_mask(), I'm pretty much sure it is the bitmap_zero(chip->valid_mask, max_gpios); call towards the end of the function that gets turned into: 08:56:36.114798 [ 4.433713] __memset+0x80/0x188
And this causes the crash.
Should init_valid_mask() be being called if need_valid_mask() is false?
Apparently this only happens on ACPI systems because I tested it myself on a DT system.
Might also depend on the particular DT the system has?
So if OF and ACPI is activated at the same time (can that happen?)
Not really. There is a stub DT used to pass ACPI to the kernel.
So if OF and ACPI is activated at the same time (can that happen?)
Not really. There is a stub DT used to pass ACPI to the kernel.
Right. I think it's a choice the kernel makes on boot. Both can be provided, but the kernel will pick to run with one or the other.
Quoting Linus Walleij (2019-07-31 01:48:38)
On Tue, Jul 30, 2019 at 3:41 PM Mark Brown broonie@kernel.org wrote:
Today's -next fails to boot on QDF2400 systems:
Is this a devicetree or ACPI system? Which devicetree in that case? If it is ACPI I assume one had to look into DSDTs?
But I looked into this!
08:56:36.026587 [ 4.339966] pc : __memset+0x80/0x188 08:56:36.026867 [ 4.343524] lr : msm_gpio_init_valid_mask+0x134/0x1a4
Aha. I think this only worked out of chance before.
What the Qualcomm driver does is exploit that .init_valid_mask() gets called even if .need_valid_mask in gpio_chip is not set to true, and this is why the bug appears in msm_gpio_init_valid_mask(), I'm pretty much sure it is the bitmap_zero(chip->valid_mask, max_gpios); call towards the end of the function that gets turned into: 08:56:36.114798 [ 4.433713] __memset+0x80/0x188
And this causes the crash.
This is then because chip->valid_mask has not been allocated, and that is because .need_valid_mask is not set. This is set like so:
static bool msm_gpio_needs_valid_mask(struct msm_pinctrl *pctrl) { if (pctrl->soc->reserved_gpios) return true;
return device_property_read_u16_array(pctrl->dev, "gpios", NULL, 0) > 0;
}
Some of the code here is new. I guess the arm64 laptop stuff is making changes.
First comes from the driver, second is for ACPI I think. It looks like a bit dangerous way to do it for ACPI, what if an OF pin controller has some "gpios" property? Oh well.
Apparently this only happens on ACPI systems because I tested it myself on a DT system.
Another cause may be this from the call site inside gpiolib:
static int gpiochip_alloc_valid_mask(struct gpio_chip *gc) { if (IS_ENABLED(CONFIG_OF_GPIO)) gc->need_valid_mask = of_gpio_need_valid_mask(gc); if (!gc->need_valid_mask) return 0;
gc->valid_mask = gpiochip_allocate_mask(gc); if (!gc->valid_mask) return -ENOMEM; return 0;
}
So if OF and ACPI is activated at the same time (can that happen?)
Yes, OF and ACPI can be compiled into the same kernel. Also, ACPI does some interesting things when CONFIG_OF is enabled by looking for some ACPI magic that basically says "match the DT compatible string embedded in this ACPI thing". Quite scary!
then the OF code will bail out I suppose and this happens when the ACPI side of things try to use the mask it didn't allocate. The ACPI codepath in msm_gpio_init_valid_mask() seems to try to set the mask even if there is zero GPIOs as well... dereferencing the NULL pointer in chip->valid_mask.
Could it be that this is a ACPI system with zero protected GPIOs? It doesn't seem like the code will survive that. It seems written under the assumption that
It's a bit of a mess.
Can some qcom ACPI people take linux-next for a ride and tell me what I should do?
Lee: do you know if linux-next boots fine on your ACPI machine?
Timur/Stephen: any ideas?
I think the code changed in commit f626d6dfb709 ("gpio: of: Break out OF-only code"). Now it unconditionally overwrites the chip's need_valid_mask member when CONFIG_OF is enabled. How about only overwriting it to 'true' when it really needs it? That way, the gpio provider can have a say. I originally wrote this by having of_gpio_need_valid_mask() modify the chip directly, but I like this approach instead where we mark it as const in that function and then only set it to true if it's actually needed.
-----8<---- diff --git a/drivers/gpio/gpiolib-of.c b/drivers/gpio/gpiolib-of.c index b10d04dd9296..e39b4290b80c 100644 --- a/drivers/gpio/gpiolib-of.c +++ b/drivers/gpio/gpiolib-of.c @@ -87,7 +87,7 @@ static struct gpio_desc *of_xlate_and_get_gpiod_flags(struct gpio_chip *chip, * @dev: the device for the GPIO provider * @return: true if the valid mask needs to be set */ -bool of_gpio_need_valid_mask(struct gpio_chip *gc) +bool of_gpio_need_valid_mask(const struct gpio_chip *gc) { int size; struct device_node *np = gc->of_node; diff --git a/drivers/gpio/gpiolib-of.h b/drivers/gpio/gpiolib-of.h index 34954921d96e..454d1658ee2d 100644 --- a/drivers/gpio/gpiolib-of.h +++ b/drivers/gpio/gpiolib-of.h @@ -16,7 +16,7 @@ struct gpio_desc *of_get_named_gpiod_flags(struct device_node *np, int of_gpiochip_add(struct gpio_chip *gc); void of_gpiochip_remove(struct gpio_chip *gc); int of_gpio_get_count(struct device *dev, const char *con_id); -bool of_gpio_need_valid_mask(struct gpio_chip *gc); +bool of_gpio_need_valid_mask(const struct gpio_chip *gc); #else static inline struct gpio_desc *of_find_gpio(struct device *dev, const char *con_id, @@ -36,7 +36,7 @@ static inline int of_gpio_get_count(struct device *dev, const char *con_id) { return 0; } -static inline bool of_gpio_need_valid_mask(struct gpio_chip *gc) +static inline bool of_gpio_need_valid_mask(const struct gpio_chip *gc) { return false; } diff --git a/drivers/gpio/gpiolib.c b/drivers/gpio/gpiolib.c index d45c9a2285f0..e7153c81fdaa 100644 --- a/drivers/gpio/gpiolib.c +++ b/drivers/gpio/gpiolib.c @@ -362,8 +362,8 @@ static unsigned long *gpiochip_allocate_mask(struct gpio_chip *chip)
static int gpiochip_alloc_valid_mask(struct gpio_chip *gc) { - if (IS_ENABLED(CONFIG_OF_GPIO)) - gc->need_valid_mask = of_gpio_need_valid_mask(gc); + if (of_gpio_need_valid_mask(gc)) + gc->need_valid_mask = true; if (!gc->need_valid_mask) return 0;
On 7/31/2019 8:13 AM, Stephen Boyd wrote:
Quoting Linus Walleij (2019-07-31 01:48:38)
On Tue, Jul 30, 2019 at 3:41 PM Mark Brown broonie@kernel.org wrote:
Today's -next fails to boot on QDF2400 systems:
Is this a devicetree or ACPI system? Which devicetree in that case? If it is ACPI I assume one had to look into DSDTs?
But I looked into this!
08:56:36.026587 [ 4.339966] pc : __memset+0x80/0x188 08:56:36.026867 [ 4.343524] lr : msm_gpio_init_valid_mask+0x134/0x1a4
Aha. I think this only worked out of chance before.
What the Qualcomm driver does is exploit that .init_valid_mask() gets called even if .need_valid_mask in gpio_chip is not set to true, and this is why the bug appears in msm_gpio_init_valid_mask(), I'm pretty much sure it is the bitmap_zero(chip->valid_mask, max_gpios); call towards the end of the function that gets turned into: 08:56:36.114798 [ 4.433713] __memset+0x80/0x188
And this causes the crash.
This is then because chip->valid_mask has not been allocated, and that is because .need_valid_mask is not set. This is set like so:
static bool msm_gpio_needs_valid_mask(struct msm_pinctrl *pctrl) { if (pctrl->soc->reserved_gpios) return true;
return device_property_read_u16_array(pctrl->dev, "gpios", NULL, 0) > 0;
}
Some of the code here is new. I guess the arm64 laptop stuff is making changes.
First comes from the driver, second is for ACPI I think. It looks like a bit dangerous way to do it for ACPI, what if an OF pin controller has some "gpios" property? Oh well.
Apparently this only happens on ACPI systems because I tested it myself on a DT system.
Another cause may be this from the call site inside gpiolib:
static int gpiochip_alloc_valid_mask(struct gpio_chip *gc) { if (IS_ENABLED(CONFIG_OF_GPIO)) gc->need_valid_mask = of_gpio_need_valid_mask(gc); if (!gc->need_valid_mask) return 0;
gc->valid_mask = gpiochip_allocate_mask(gc); if (!gc->valid_mask) return -ENOMEM; return 0;
}
So if OF and ACPI is activated at the same time (can that happen?)
Yes, OF and ACPI can be compiled into the same kernel. Also, ACPI does some interesting things when CONFIG_OF is enabled by looking for some ACPI magic that basically says "match the DT compatible string embedded in this ACPI thing". Quite scary!
then the OF code will bail out I suppose and this happens when the ACPI side of things try to use the mask it didn't allocate. The ACPI codepath in msm_gpio_init_valid_mask() seems to try to set the mask even if there is zero GPIOs as well... dereferencing the NULL pointer in chip->valid_mask.
Could it be that this is a ACPI system with zero protected GPIOs?
QDF2400 is an ACPI only system, but it has protected GPIOs
It doesn't seem like the code will survive that. It seems written under the assumption that
It's a bit of a mess.
Can some qcom ACPI people take linux-next for a ride and tell me what I should do?
Lee: do you know if linux-next boots fine on your ACPI machine?
Timur/Stephen: any ideas?
Timur's CAF account is no longer valid, I added his @kernel one.
I think the code changed in commit f626d6dfb709 ("gpio: of: Break out OF-only code"). Now it unconditionally overwrites the chip's need_valid_mask member when CONFIG_OF is enabled. How about only overwriting it to 'true' when it really needs it? That way, the gpio provider can have a say. I originally wrote this by having of_gpio_need_valid_mask() modify the chip directly, but I like this approach instead where we mark it as const in that function and then only set it to true if it's actually needed.
-----8<---- diff --git a/drivers/gpio/gpiolib-of.c b/drivers/gpio/gpiolib-of.c index b10d04dd9296..e39b4290b80c 100644 --- a/drivers/gpio/gpiolib-of.c +++ b/drivers/gpio/gpiolib-of.c @@ -87,7 +87,7 @@ static struct gpio_desc *of_xlate_and_get_gpiod_flags(struct gpio_chip *chip,
- @dev: the device for the GPIO provider
- @return: true if the valid mask needs to be set
*/ -bool of_gpio_need_valid_mask(struct gpio_chip *gc) +bool of_gpio_need_valid_mask(const struct gpio_chip *gc) { int size; struct device_node *np = gc->of_node; diff --git a/drivers/gpio/gpiolib-of.h b/drivers/gpio/gpiolib-of.h index 34954921d96e..454d1658ee2d 100644 --- a/drivers/gpio/gpiolib-of.h +++ b/drivers/gpio/gpiolib-of.h @@ -16,7 +16,7 @@ struct gpio_desc *of_get_named_gpiod_flags(struct device_node *np, int of_gpiochip_add(struct gpio_chip *gc); void of_gpiochip_remove(struct gpio_chip *gc); int of_gpio_get_count(struct device *dev, const char *con_id); -bool of_gpio_need_valid_mask(struct gpio_chip *gc); +bool of_gpio_need_valid_mask(const struct gpio_chip *gc); #else static inline struct gpio_desc *of_find_gpio(struct device *dev, const char *con_id, @@ -36,7 +36,7 @@ static inline int of_gpio_get_count(struct device *dev, const char *con_id) { return 0; } -static inline bool of_gpio_need_valid_mask(struct gpio_chip *gc) +static inline bool of_gpio_need_valid_mask(const struct gpio_chip *gc) { return false; } diff --git a/drivers/gpio/gpiolib.c b/drivers/gpio/gpiolib.c index d45c9a2285f0..e7153c81fdaa 100644 --- a/drivers/gpio/gpiolib.c +++ b/drivers/gpio/gpiolib.c @@ -362,8 +362,8 @@ static unsigned long *gpiochip_allocate_mask(struct gpio_chip *chip) static int gpiochip_alloc_valid_mask(struct gpio_chip *gc) {
- if (IS_ENABLED(CONFIG_OF_GPIO))
gc->need_valid_mask = of_gpio_need_valid_mask(gc);
- if (of_gpio_need_valid_mask(gc))
if (!gc->need_valid_mask) return 0;gc->need_valid_mask = true;
linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
On 7/31/19 12:58 PM, Jeffrey Hugo wrote:
Timur/Stephen: any ideas?
Timur's CAF account is no longer valid, I added his @kernel one.
Delete everything specific to the QDF2400.
Qualcomm has made it very clear that they have no interest in developing ARM server chips. No QDF2400 system ever made it to official production.
On Thu, Aug 1, 2019 at 5:49 AM Timur Tabi timur@kernel.org wrote:
On 7/31/19 12:58 PM, Jeffrey Hugo wrote:
Timur/Stephen: any ideas?
Timur's CAF account is no longer valid, I added his @kernel one.
Delete everything specific to the QDF2400.
It appears Mark is still using it in his test farm, and now its sole role is finding bugs in my code. Which it did! With so much elegance that we could fix it up quickly.
Qualcomm has made it very clear that they have no interest in developing ARM server chips. No QDF2400 system ever made it to official production.
That's sad. I remember we had lots of discussions around this at the time. The ACPI code base and quirks we added is however used in other Qualcomm-based machines now (what Lee is doing), so the effort is not wasted at all.
Yours, Linus Walleij
On 8/1/2019 1:09 AM, Linus Walleij wrote:
On Thu, Aug 1, 2019 at 5:49 AM Timur Tabi timur@kernel.org wrote:
On 7/31/19 12:58 PM, Jeffrey Hugo wrote:
Timur/Stephen: any ideas?
Timur's CAF account is no longer valid, I added his @kernel one.
Delete everything specific to the QDF2400.
It appears Mark is still using it in his test farm, and now its sole role is finding bugs in my code. Which it did! With so much elegance that we could fix it up quickly.
Qualcomm has made it very clear that they have no interest in developing ARM server chips. No QDF2400 system ever made it to official production.
That's sad. I remember we had lots of discussions around this at the time. The ACPI code base and quirks we added is however used in other Qualcomm-based machines now (what Lee is doing), so the effort is not wasted at all.
There are a few QDF2400 systems around that are being supported. Certainly not to the scale that was intended/envisioned, but they do exist.
Feel free to poke me for questions/issues, although its not my primary job so responses may be a bit slow.
On 7/31/19 12:58 PM, Jeffrey Hugo wrote:
static int gpiochip_alloc_valid_mask(struct gpio_chip *gc) { if (IS_ENABLED(CONFIG_OF_GPIO)) gc->need_valid_mask = of_gpio_need_valid_mask(gc); if (!gc->need_valid_mask) return 0;
So this seems wrong on a system with OF and ACPI. It assumes that OF takes priority over ACPI if both are enabled, and that's not true in general. If anything, it's the other way around.
IS_ENABLED(CONFIG_OF_GPIO) is not the correct test to see if OF should be used. I think this should be replaced with the OF equivalent of has_acpi_companion(), but even that might not be enough. Basically, of_gpio_need_valid_mask() should return three values, 0 = don't need it, 1 = does need it, -1 = gpio info is not in OF.
On Fri, Aug 2, 2019 at 4:51 AM Timur Tabi timur@kernel.org wrote:
On 7/31/19 12:58 PM, Jeffrey Hugo wrote:
static int gpiochip_alloc_valid_mask(struct gpio_chip *gc) { if (IS_ENABLED(CONFIG_OF_GPIO)) gc->need_valid_mask = of_gpio_need_valid_mask(gc); if (!gc->need_valid_mask) return 0;
So this seems wrong on a system with OF and ACPI. It assumes that OF takes priority over ACPI if both are enabled, and that's not true in general. If anything, it's the other way around.
IS_ENABLED(CONFIG_OF_GPIO) is not the correct test to see if OF should be used. I think this should be replaced with the OF equivalent of has_acpi_companion(), but even that might not be enough. Basically, of_gpio_need_valid_mask() should return three values, 0 = don't need it, 1 = does need it, -1 = gpio info is not in OF.
You're absolutely right.
Sboyd hacked up a patch to that effect and I applied it.
I haven't heard if QDF2400 is working again but I'd love to know!
Yours, Linus Walleij
On 8/3/2019 2:42 AM, Linus Walleij wrote:
On Fri, Aug 2, 2019 at 4:51 AM Timur Tabi timur@kernel.org wrote:
On 7/31/19 12:58 PM, Jeffrey Hugo wrote:
static int gpiochip_alloc_valid_mask(struct gpio_chip *gc) { if (IS_ENABLED(CONFIG_OF_GPIO)) gc->need_valid_mask = of_gpio_need_valid_mask(gc); if (!gc->need_valid_mask) return 0;
So this seems wrong on a system with OF and ACPI. It assumes that OF takes priority over ACPI if both are enabled, and that's not true in general. If anything, it's the other way around.
IS_ENABLED(CONFIG_OF_GPIO) is not the correct test to see if OF should be used. I think this should be replaced with the OF equivalent of has_acpi_companion(), but even that might not be enough. Basically, of_gpio_need_valid_mask() should return three values, 0 = don't need it, 1 = does need it, -1 = gpio info is not in OF.
You're absolutely right.
Sboyd hacked up a patch to that effect and I applied it.
I haven't heard if QDF2400 is working again but I'd love to know!
Sorry, was on vacation. Per kernelci[1], looks like things are working.
[1] https://kernelci.org/boot/qcom-qdf2400/
On Mon, Aug 12, 2019 at 4:08 PM Jeffrey Hugo jhugo@codeaurora.org wrote:
On 8/3/2019 2:42 AM, Linus Walleij wrote:
Sboyd hacked up a patch to that effect and I applied it.
I haven't heard if QDF2400 is working again but I'd love to know!
Sorry, was on vacation. Per kernelci[1], looks like things are working.
Awesome, thanks!
Linus Walleij
On Thu, 01 Aug 2019, Timur Tabi wrote:
On 7/31/19 12:58 PM, Jeffrey Hugo wrote:
static int gpiochip_alloc_valid_mask(struct gpio_chip *gc) { if (IS_ENABLED(CONFIG_OF_GPIO)) gc->need_valid_mask = of_gpio_need_valid_mask(gc); if (!gc->need_valid_mask) return 0;
So this seems wrong on a system with OF and ACPI. It assumes that OF takes priority over ACPI if both are enabled, and that's not true in general. If anything, it's the other way around.
Device Tree has priority in the Linux Kernel. If the bootloader provides both (which breaks SBBR and EBBR compliance by the way!), then the kernel will choose DT.
I guess this is because DT is more fluid (much to the originator's dismay I should imagine) and can be used to fix broken ACPI tables which tend to be more of a permanent fixture.
On Wed, Jul 31, 2019 at 5:13 PM Stephen Boyd swboyd@chromium.org wrote:
if (IS_ENABLED(CONFIG_OF_GPIO))
gc->need_valid_mask = of_gpio_need_valid_mask(gc);
if (of_gpio_need_valid_mask(gc))
gc->need_valid_mask = true;
This looks like the silver bullet, thanks for your sharp eyes for this!
I'll send this out with your authorship and apply to next so Mark can see if it fixes the issue.
Yours, Linus Walleij
kernel-build-reports@lists.linaro.org