On Tue, Jul 30, 2019 at 3:41 PM Mark Brown broonie@kernel.org wrote:
Today's -next fails to boot on QDF2400 systems:
Is this a devicetree or ACPI system? Which devicetree in that case? If it is ACPI I assume one had to look into DSDTs?
But I looked into this!
08:56:36.026587 [ 4.339966] pc : __memset+0x80/0x188 08:56:36.026867 [ 4.343524] lr : msm_gpio_init_valid_mask+0x134/0x1a4
Aha. I think this only worked out of chance before.
What the Qualcomm driver does is exploit that .init_valid_mask() gets called even if .need_valid_mask in gpio_chip is not set to true, and this is why the bug appears in msm_gpio_init_valid_mask(), I'm pretty much sure it is the bitmap_zero(chip->valid_mask, max_gpios); call towards the end of the function that gets turned into: 08:56:36.114798 [ 4.433713] __memset+0x80/0x188
And this causes the crash.
This is then because chip->valid_mask has not been allocated, and that is because .need_valid_mask is not set. This is set like so:
static bool msm_gpio_needs_valid_mask(struct msm_pinctrl *pctrl) { if (pctrl->soc->reserved_gpios) return true;
return device_property_read_u16_array(pctrl->dev, "gpios", NULL, 0) > 0; }
First comes from the driver, second is for ACPI I think. It looks like a bit dangerous way to do it for ACPI, what if an OF pin controller has some "gpios" property? Oh well.
Apparently this only happens on ACPI systems because I tested it myself on a DT system.
Another cause may be this from the call site inside gpiolib:
static int gpiochip_alloc_valid_mask(struct gpio_chip *gc) { if (IS_ENABLED(CONFIG_OF_GPIO)) gc->need_valid_mask = of_gpio_need_valid_mask(gc); if (!gc->need_valid_mask) return 0;
gc->valid_mask = gpiochip_allocate_mask(gc); if (!gc->valid_mask) return -ENOMEM; return 0; }
So if OF and ACPI is activated at the same time (can that happen?) then the OF code will bail out I suppose and this happens when the ACPI side of things try to use the mask it didn't allocate. The ACPI codepath in msm_gpio_init_valid_mask() seems to try to set the mask even if there is zero GPIOs as well... dereferencing the NULL pointer in chip->valid_mask.
Could it be that this is a ACPI system with zero protected GPIOs? It doesn't seem like the code will survive that. It seems written under the assumption that
It's a bit of a mess.
Can some qcom ACPI people take linux-next for a ride and tell me what I should do?
Lee: do you know if linux-next boots fine on your ACPI machine?
Timur/Stephen: any ideas?
If noone has a better idea I guess I just give up and unconditionally set .needs_valid_mask to true because we honestly don't know when it's needed or not.
Yours, Linus Walleij