On Mon, May 6, 2024 at 7:37 AM Ahmad Fatoum a.fatoum@pengutronix.de wrote:
Hello Jonathan, Hello Rob,
Thanks for bringing this topic to discussion. I think the status quo is an obvious shortcoming that needs to be addressed.
On 02.05.24 16:00, Rob Herring wrote:
On Wed, May 1, 2024 at 4:18 PM Humphreys, Jonathan j-humphreys@ti.com wrote:
Problem statement:
Device trees are in theory a pure description of the hardware, and since the hardware doesn't change, the device tree describing the hardware likewise never changes. With this, a device tree could then be burned into the hardware's ROM to be queried by software for hardware discovery. In practice, though, device trees evolve over time. They evolve for many reasons, including
- support for previously unsupported hardware
- device driver improvements that require additional hardware information
- bug fixes
I really would like specific cases of these where compatibility is broken highlighted.
Screening for backwards-compatibility of new kernels (or their bindings) with old DTs is not enough. When an A/B system fails to boot and does a fallback, you can run into the inverse situation, namely: An old kernel is presented with a new device tree as bootloader updates are often not rolled back.
I'm interested in both cases. Indeed, I think this problem is harder than backwards compatibility. There's not much we can do if say a platform was missing some provider (e.g. a clock controller) and then you add one. The old OS is not going to know what to do with the clock controller and all its 'clocks' property references as it lacks a driver. We mitigated this somewhat in Linux to make some dependencies optional or timeout. For example, if a platform booted without pinctrl before, then any pinctrl added should be optional (unless the firmware also stopped doing pin setup when it moved into the DT).
This seems unavoidable and the solution we have for that is to ship device trees along with kernel updates and load both together.
The tooling and reviewing to identify these cases has gotten much better.
barebox has been pulling in kernel device trees for many years and it's a frequent cause of regressions. Here are some recent fixes found with $(git log --grep="^Fixes:.*dts: update"):
Thanks for the pointers. Some analysis below...
- "aiodev: imx_thermal: fix breakage after device tree sync" https://github.com/barebox/barebox/commit/451c25b60e
Removing a required property is something I check. Would not have helped here as it got moved from required to deprecated in 2017. Required and deprecated should be orthogonal. For the DTS, what should have happened here is the 'fsl,tempmon-data" property should have been kept with the nvmem properties added.
- "pinctrl: stm32: Remove check for pins-are-numbered" https://github.com/barebox/barebox/commit/38ff8dad11
Another case of removing a required property.
- "ARM: dts: i.MX8MP: snps,dis-u2-freeclk-exists-quirk" https://github.com/barebox/barebox/commit/db01bf84cf
Should have kept the existing property and made the new property incremental meaning on top of it. Yet another reason for me to dislike the dozens of quirk properties we have on this binding. We push back on them a lot more now and instead push that quirks are implied from SoC specific compatible strings unless they are board specific.
I don't think the binding comparison I'm working on could catch this. For this we probably need to compare DTBs looking for removed properties. I think dtx_diff already can do that.
- "clk: imx8mp: add USB suspend clock" https://github.com/barebox/barebox/commit/d86bbaed71
Changing the number of entries is something I check.
- "ARM: i.MX8MN: assume USBOTG power domains to be powered" https://github.com/barebox/barebox/commit/7b62fbc632
That's the problem of new providers added. I don't know about barebox design, but I think you'd be able to handle that better than the Linux kernel can. The biggest problem in Linux is we never know when all drivers have been loaded or dependencies have probed. A module could be loaded a week after boot to provide a dependency. I'd think bootloaders generally don't have that problem.
All of these bugs would have broken a newer Linux kernel being booted with an old device tree.
I'm not sure we can conclude that. Depends if Linux handles the old binding.
In practice, they didn't because normally barebox-built device trees are used for barebox and Linux-built device trees are shipped along with Linux, even if they might have been at identical some point.
I've been prototyping a tool which will compare 2 versions of binding schemas and spit out incompatible changes for example. Those aren't the only types of changes as you point out, but if we can eliminate a whole class of issues I think the situation would be much better.
I look forward to this. Would your tooling have detected any of the above regressions?
Fortunately, most of these issues are caught before a barebox release (features, unlike bug fixes, sit in master a month before making it into a monthly release), but some slip through and it introduces a lot of churn.
I'm sure you'd rather have a tool to find cases rather than having to find them testing every SoC/board. :)
Linux's device tree source is maintained with the kernel source, and kernel builds include building the device trees too. This ensures that the device tree matching the kernel's usage is always kept in sync. Often, embedded distros will include the matching device tree blobs.
The EBBR mandates that the device tree blob is provided by the firmware.
Thus it is likely that the device tree provided by the firmware and given to the operating system is not the matching device tree blob for that kernel. This can cause hardware to be missing, buggy, or non-functional.
Yes. My first experience with EBBR was AFAIR a system that didn't boot, because an up-to-date Debian kernel failed to handle the old device tree provided by the firmware. At least updating the EFI firmware with a USB stick worked well.
Was that SystemReady IR compliant too? Unfortunately, IR 1.x doesn't do much for DT checking, but 2.x does and should help a bit. It's testing with schemas from relatively recent kernel trees and OS kernels of various versions have to boot. So there's at least some implicit mismatching.
This proposal then has the firmware choose the device tree by name, or some other identifier that can be used to match the device tree for the board [1]. It has the OS-provided OS loader select the location of the matching versions of DTBs for it.
The firmware would pass the device tree filename/id to the OS loader, instead of the DTB itself. If the firmware can't know which version of DTB, how can it know whether to pass a DTB vs. an identifier? The OS might be perfectly fine with firmware's DTB.
I think it's a fair assumption that if the kernel ships with a matching DTB, it would be fine booting with it instead of the firmware provided DTB.
Yes.
If we had a way to express this "shipped-with" relationship, we could thus have the EFI firmware just select the matching device tree and pass it along the exact way it's done now.
Some ways to describe this "shipped-with" relationship:
- a section in the image as UKIs do, see Jan's mail
- a fixed naming scheme in the EFI partition, e.g. \EFI\Debian\BOOTAA64.EFI -> \EFI\Debian\DTS-BOOTAA64.EFI/
- an EFI variable or protocol?
This is the EFI loader doing the selection directly, rather than grub or some next stage? That's quite a bit different of a problem to solve I think.
Rob