Coming late to the party because Leif called my attention to this thread...
On Mon, Jan 29, 2018 at 11:16:21AM +0000, Leif Lindholm wrote:
This type of system behaviour has been seen multiple times to break installations in the real world.
I can't agree more; that's why there's a pile of language in 2.6 and later that says how to do a better job of this.
Note: my main objections here are really with regards to:
- the expectation that variable store is erased on fw update
- automatically rewriting boot variables
If (1) was resolved, then I could potentially see a use for a last-ditch fallback option (but even then, I don't think it should be enabled by default).
I certainly agree resolving #1 is the thing to do here, but there is is actually useful functionality to have in general on the fallback path - though I don't think this patch implements it in a correct or preferred way. In particular, it would be better to follow the advice in 3.1.1 of the UEFI 2.7 spec, where we say that while yes, the firmware is allowed to do boot order maintenance, it shouldn't remove anything or change the order itself except in the most dire of circumstances. In particular:
| The firmware should not, under normal operation, automatically remove | any correctly formed Boot#### variable currently referenced by the | BootOrder or BootNext variables. Such removal should be limited to | scenarios where the firmware is guided by direct user interaction.
The right thing to do here is to publish some PlatformRecovery#### variables as specified in 3.4.2, which can be a static list that's the same every time you boot up, and when there's a failure here BootNext and BootOrder have been exhausted, proceed according to 3.4.1 and 3.4.2, which I'll attempt to summarize below.
The PlatformRecovery#### variables here should have device paths that are something like:
PlatformRecovery0000: File(\EFI\BOOT\BOOTAA64.EFI) PlatformRecovery0001: File(\EFI\centos\grubaa64.efi)/EndInstance/File(\EFI\debian\grubaa64.efi)/EndInstance/File(\EFI\GRUB2\GRUBAA64.EFI)/EndInstance/File(\EFI\Microsoft\Boot\bootmgfw.efi)/EndInstance/File(\EFI\redhat\grub.efi)/EndInstance/... etc .../EndEntire
The EFI_OS_INDICATIONS_START_OS_RECOVERY and EFI_OS_INDICATIONS_START_PLATFORM_RECOVERY bits should be set in OsIndicationsSupported.
(As an aside, I don't think we've explicitly said that multi-instance device paths do or don't work in boot variables, so that may require some work, or if you have a stronger preference about order you could just make each one its own PlatformRecovery#### variable. For what it's worth, I don't think that the list in this patch is great - at the least the arch suffixes should be generated according to what the build target is, and what you've got currently doesn't correctly match several shipping OSes - many of which *do* provide a file at \EFI\BOOT\BOOT${ARCH}.EFI which will fix the BootOrder for you.)
If you supply those static variables, then in the normal boot path, if you've exhausted BootNext and all of BootOrder, continue according to chapter 3.4.2 and then 3.4.3 of the spec. That basically says there's a list of things to try as if they're Boot#### options:
- If OsRecoveryOrder exists, it's a list of GUIDs under which there may be OsRecovery#### variables. The GUIDs are processed in the order they're listed, and the OsRecovery#### variables under each GUID are processed in hexadecimal numerical order. - PlatformRecovery#### variables are processed in hexadecimal numerical order if OsRecovery variables are exhausted without successfully booting anything.
For everything in that list, the variables basically get treated exactly like Boot#### variables. For each #### variable: - parse the variable like you'd parse Boot####, and use the normal discovery method to iterate across any files that match it. - if so, see if there's a Boot#### that matches that (it isn't necessarily in BootOrder); if not create one. - Try to boot it like any other boot entry: set BootCurrent to the Boot#### number, do LoadImage() and StartImage(), etc. There's no modification of BootOrder here. - If the binary there can't be found, loaded, or returns an error, continue iterating the normal way to see if there are more matching files, and if there aren't, then proceed to the next variable.
If you exhaust this list, it's time for an error message and a menu or something ;)
There are some important characteristics here that we need to maintain: 1) We haven't really talked enough about when it's really okay to Boot#### entries, because in general removing them is undoing something that was done on purpose (even if that purpose has been obviated now.) A pretty good rule is: don't remove Boot#### variables unless there are dire circumstances, like you've nearly completely run out of flash. The user or the OS set these for a reason. If you absolutely must prune them, start with the ones that aren't in BootOrder or BootNext, and then proceed to the high-numbered ones that are duplicates of other ones. And even then, only remove until you're under some known safe storage threshold. 2) Don't change BootOrder ever. The user set that for a reason. If you absolutely have to change it, append to the end. But you really don't need to change it unless the user told you to. 3) Don't probe the disk except to try to match a boot entry you're attempting to boot from. If you need to read a partition table to match an HD(), that's fine. If you need to read a disk to match a File(), that's fine. But only when you're trying to boot those device paths and they require iterating the disks. All probing the disk accomplishes in other cases is slowing things down and powering up devices unnecessarily.