On 10/10/2017 7:58 AM, Leif Lindholm wrote:
On Tue, Oct 10, 2017 at 11:23:32AM +0100, Mark Rutland wrote:
On Tue, Oct 10, 2017 at 11:15:39AM +0100, Sudeep Holla wrote:
(+Mark, Grant)
On 09/10/17 18:16, Chris Metcalf wrote:
The Mellanox BlueField SoC firmware supports a safe upgrade mode as part of the flow where users put new firmware on the secondary eMMC boot partition (the one not currently in use), tell the eMMC to make the secondary boot partition primary, and reset.
When you say "firmware", are you referreind to actual firmware, or a platform-specific OS image?
For the former, the preferred update mechanism would be UpdateCapsule().
This sounds to me very much like something we'd want to keep out of the kernel. Assuming ACPI means UEFI, there should be less invasive solutions to this problem.
I've added linaro-uefi to cc. Chris - if you'd be willing to have a side discussion on the overall aproach, please drop the kernel lists from cc and let's have a chat there.
I've taken linux-kernel off the Cc and added linaro-uefi. For the linaro-uefi folks, here is the original patch and commit message:
https://patchwork.kernel.org/patch/9993965/
The proposed solution is intended to be a way to update ATF + UEFI. It's possible that in practice this would also include a UEFI boot image path change (passed as a parameter contained in the firmware image on the eMMC boot partition) that would cause us to boot a different OS image and/or use a different root filesystem as well. The intent is to be able to deploy arbitrary updates to a device in the field, and then if needed, safely roll it back.
But the issue here really isn't about how we install the new firmware itself. We have a simple model now that works: we use Linux to copy the new firmware to the eMMC alternate boot partition, then we issue an ioctl to the mmc driver to tell it to switch boot partitions. Now when we reboot we will get the desired new firmware.
What this driver is intended to support is access via SMC to code in ATF that lets us be clever about how we can roll back bad firmware. Specifically, we notify ATF that, after the next reset, we want to turn on the ARM watchdog, and at the same time arrange that after the NEXT reset, it should switch the eMMC boot partition and reset. This means that we reset, boot the new firmware, and then if it doesn't work well, eventually the ARM watchdog causes a reset, at which point ATF switches us back to the original boot partition and the device comes up fully functional running the old firmware. If things go well once the new image is booted up, we tell ATF to relax and not do anything special at reset time, and we disarm the ARM watchdog.
So the UpdateCapsule mechanism seems orthogonal to this.
It's true that using UpdateCapsule would be a nice improvement on our current "write to eMMC from Linux" model. But our concern is that UEFI and Linux don't have mutual exclusion when accessing the eMMC device registers, so in our multicore system it's not safe to have UEFI write the new firmware to eMMC at the time of the UpdateCapsule call; either Linux or UEFI might end up with eMMC corruption. And we don't have substantial persistent memory across reset; the DDR memory is reset along with the chip, and beyond that we just have the UEFI variable EEPROM plus a few on-chip persistent registers, so we can't pass a megabyte or so of firmware data across reset and do the installation after reset. So we didn't really pursue this approach when we were looking at how to actually do the firmware update.
Maybe there is some way to request the kinds of reset behavior that we want within the existing EFI framework that we are not aware of. Or maybe there is some way to think about incorporating some of this into the EFI reset semantics. It seems reasonable to incorporate something into UpdateCapsule to provide these semantics, although then we'd also have to fix the issues that made us not adopt the UpdateCapsule approach in the first place.
What other approaches do folks think would work to achieve this fail-safe firmware upgrade model?
Ping - I haven't heard any feedback yet on my followup below. I could imagine a number of alternate approaches but I'd like to try to get some kind of consensus here first. Or, the approach we initially chose might actually turn out to be reasonable on reflection.
Thanks in advance!
On 10/10/2017 1:36 PM, Chris Metcalf wrote:
On 10/10/2017 7:58 AM, Leif Lindholm wrote:
On Tue, Oct 10, 2017 at 11:23:32AM +0100, Mark Rutland wrote:
On Tue, Oct 10, 2017 at 11:15:39AM +0100, Sudeep Holla wrote:
(+Mark, Grant)
On 09/10/17 18:16, Chris Metcalf wrote:
The Mellanox BlueField SoC firmware supports a safe upgrade mode as part of the flow where users put new firmware on the secondary eMMC boot partition (the one not currently in use), tell the eMMC to make the secondary boot partition primary, and reset.
When you say "firmware", are you referreind to actual firmware, or a platform-specific OS image?
For the former, the preferred update mechanism would be UpdateCapsule().
This sounds to me very much like something we'd want to keep out of the kernel. Assuming ACPI means UEFI, there should be less invasive solutions to this problem.
I've added linaro-uefi to cc. Chris - if you'd be willing to have a side discussion on the overall aproach, please drop the kernel lists from cc and let's have a chat there.
I've taken linux-kernel off the Cc and added linaro-uefi. For the linaro-uefi folks, here is the original patch and commit message:
https://patchwork.kernel.org/patch/9993965/
The proposed solution is intended to be a way to update ATF + UEFI. It's possible that in practice this would also include a UEFI boot image path change (passed as a parameter contained in the firmware image on the eMMC boot partition) that would cause us to boot a different OS image and/or use a different root filesystem as well. The intent is to be able to deploy arbitrary updates to a device in the field, and then if needed, safely roll it back.
But the issue here really isn't about how we install the new firmware itself. We have a simple model now that works: we use Linux to copy the new firmware to the eMMC alternate boot partition, then we issue an ioctl to the mmc driver to tell it to switch boot partitions. Now when we reboot we will get the desired new firmware.
What this driver is intended to support is access via SMC to code in ATF that lets us be clever about how we can roll back bad firmware. Specifically, we notify ATF that, after the next reset, we want to turn on the ARM watchdog, and at the same time arrange that after the NEXT reset, it should switch the eMMC boot partition and reset. This means that we reset, boot the new firmware, and then if it doesn't work well, eventually the ARM watchdog causes a reset, at which point ATF switches us back to the original boot partition and the device comes up fully functional running the old firmware. If things go well once the new image is booted up, we tell ATF to relax and not do anything special at reset time, and we disarm the ARM watchdog.
So the UpdateCapsule mechanism seems orthogonal to this.
It's true that using UpdateCapsule would be a nice improvement on our current "write to eMMC from Linux" model. But our concern is that UEFI and Linux don't have mutual exclusion when accessing the eMMC device registers, so in our multicore system it's not safe to have UEFI write the new firmware to eMMC at the time of the UpdateCapsule call; either Linux or UEFI might end up with eMMC corruption. And we don't have substantial persistent memory across reset; the DDR memory is reset along with the chip, and beyond that we just have the UEFI variable EEPROM plus a few on-chip persistent registers, so we can't pass a megabyte or so of firmware data across reset and do the installation after reset. So we didn't really pursue this approach when we were looking at how to actually do the firmware update.
Maybe there is some way to request the kinds of reset behavior that we want within the existing EFI framework that we are not aware of. Or maybe there is some way to think about incorporating some of this into the EFI reset semantics. It seems reasonable to incorporate something into UpdateCapsule to provide these semantics, although then we'd also have to fix the issues that made us not adopt the UpdateCapsule approach in the first place.
What other approaches do folks think would work to achieve this fail-safe firmware upgrade model?