Murphy compatible embedded systems updater

September 23, 2022
Software
linux, embedded systems

“If builders built buildings the way programmers wrote programs, then the first woodpecker that came along would destroy civilisation.”

Murphy’s Law of Technology #5

Introduction #

If you’ve ever used a modern computer, either Windows or MacOS, there’s a good chance you have come across the following message:

“Whatever happens, do not pull the plug or reboot this system until the update has been completed”.

Ignoring this instruction will, in a lot of cases, turn your device into a brick. It may no longer boot (the infamous “your PC has a problem” blue screen) or simply exhibit strange bugs from this point on. Now imagine that all the digital devices you possess, from your Amazon Smart-Fridge to your WiFi router, phone, or smartwatch, deal with this issue, and not all are immunised against it.

As Ben-Yossef notes in his paper “Building Murphy-compatible embedded Linux systems,” the occurrence of this issue is rare because of the extensive testing done by QA engineers, and the speed of firmware updates with convenient scheduling when the device is not in use resulting in a lower chance of interruption by the user. However, it does happen, and it is essential to consider this issue in the design stage of building any kind of embedded system that requires updates.1

In considering the importance of robust updates, we must recognise the operational cost of a Return materials Authorization (RMA) for each device where this occurs. If an embedded system is truly fully bricked, it must be returned to its manufacturer, replaced, and sent back to the customer, often at the cost of the company.

To avoid RMAs, the material and carbon cost of product replacement, and interruption of customer use, we need Murphy-compatible – customer-proof – updates. This post will discuss the different tools used to achieve a robust update system compatible with multiple target devices, the way these tools are linked in the software stack, and the progressive erosion of my will to live due to poor documentation of these tools.

Context (the project) #

Recently I have spent time refactoring a Linux distribution with Buildroot, a tool for tailoring embedded operating systems for specific use cases. After an overhaul required to repay some of the (significant) technical debt the project had accrued, some minor changes to the update system made to ensure multi-target compatibility prompted me to have a look at how this worked. Leading from Raspberry Pi compatibility forums, to discussions on dual-copy strategies, all the way down to yak shaving.

To be clear, technical debt refers to “the implied cost of additional reworks caused by choosing an easy (limited) solution” instead of using a more-thorough approach that would take longer to implement. The refactor I had to perform - reworking an existing OS for an updated architecture - was complicated by this debt when errors manifested with no clear reason after small changes, making it difficult for feature-development, testing, and debugging, and overall extending the length of the process.2

When the build system (Buildroot) was updated, both the bootloader (Das U-Boot) and the updater (SWUpdate) had advanced several years (and major versions), so the techniques and syntax used to link them and implement the remote updates had to be modified in line with their changes. In this post, I will clarify the mechanisms used for integrating these two tools, in the hopes that someone will one day spend less time than me to implement this.

In addition, I’d like to point out that I have used a variety of articles, papers, and existing blog posts (thank you George Hilliard) to write this, and have referenced them where I used them most. 3

Update Mechanisms #

Remote firmware and application updates are the new norm of deployed embedded systems. Modern systems have become increasingly complex, requiring bug fixes and patching of security vulnerabilities after the device has been shipped. Moreover, remote updates allow for the deployment of new features to devices currently in use, adding utility over time and granting greater versatility post-deployment. 1

I am currently working on an SMC (system management controller) which has 2 primary update mechanisms.

  1. An SD card with an OS can be flashed and inserted into the microcontroller

  2. The update can be packaged and uploaded through a web interface managed by SWUpdate.4

The latter is a tool installed on the target device, which can receive an update image (.swu file) from either local media or a remote server and use it to update various parts of the system. Typically, this will be used to update the Linux kernel and the root filesystem, but it can also be used to update additional partitions, or the bootloader (though this is pretty risky). SWUpdate uses a simple web interface at a local IP address (normally on port 8080) where the new image is uploaded by the user, and automatically written to the correct partition.

Dual Copy Strategy #

The ability to update partitions separately is a huge benefit and the primary reason for choosing SWUpdate, as it solves many potential issues when updating embedded systems remotely. It checks the board’s hardware compatibility and deploys the root filesystem to the partition we want. When monitoring SWUpdate, we can see it parses the update manifest, fetches artifacts, and deploys the update correctly. But this raises several questions: how can we have a rollback mechanism if the update itself contains issues or breaks the hardware? Can this be done automatically, or will the customer need to dismantle the device to flash an SD card to restore the system?5

The solution to these issues is having 2 root filesystems, stored on separate partitions, one of which is active and executing (MAIN) and the other (ALT) used for the update. When the update arrives it deploys to ALT, while MAIN is active. After a reboot, ALT becomes the active filesystem, and MAIN can be used for updates. If there are issues with the update to ALT, MAIN will still be able to load and work as before.

Dual_copy_layout

U-Boot Partition Management #

Buildroot provides multiple update options: SWUpdate, Mender.io, os-tree, RAUC. This blog post is targeted toward using SWUpdate, which is enabled simply in the Buildroot menuconfig. First the menuconfig must be created, then accessed and configured via the menu interface.

cd project/buildroot/
make menuconfig
make swupdate-menuconfig

Following this, the default web interface must be selected, and SWUpdate must be configured for your system. If you wish to better understand the SWUpdate setup, please see this excellent blog post from Bootlin.

We will focus on the link between U-Boot and SWUpdate, which is what allows the implementation of a reliable dual-copy strategy as the bootloader is what provides the necessary information to the updater regarding the currently active filesystem, and thus which partition to write to, and boot into.

Aside: The original reason for this deep-dive into SWUpdate and U-Boot integration was an issue where the updater would read the correct partition to write into, but the bootloader would not recognise an update had occurred, and so would consistently boot to the same partition. It is perhaps proof of the complexity contained in designing robust embedded systems that fixing this issue required a multitude of resources apart from the official documentation to give context on the way in which environmental variables are passed.

U-Boot uses environmental variables to select the partitions, which are then passed onto the Kernel so it can determine which one to boot. These can be assigned in userspace with SWUpdate, which is what is used to implement the dual-copy strategy.

The first step is creating the 2 root filesystems we wish to use. My target has the following partitions:

  1. /dev/mmcblk0p1: boot (vfat)

  2. /dev/mmcblk0p2: rootfs1 MAIN (ext2)

  3. /dev/mmcblk0p3: rootfs2 ALT (ext2)

  4. /dev/mmcblk0p4: data (ex4)

Next, U-Boot needs support for changing the active root filesystem. This is determined by the bootcmd which is executed at the initialisation stage of the kernel. It is passed to the kernel using the bootargs variable, where the partition is assigned as root=/dev/mmcblk0p\${bootpart} and ${bootpart} corresponds to the partition (2 for MAIN, 3 for ALT). For example, the below assigns the active root filesystem as MAIN (2).

setenv bootargs \"earlyprintk console=tty0 console=ttyAMA0,115200
rootfstype=ext4 noinitrd fsck.repair=yes root=/dev/mmcblk0p2"

As this happens regularly, it is easier to store the partition number as a variable so that when we update the arguments to be passed to the kernel, we don’t discard any other modifications to it.

Our variable is called bootpart, and we use an if statement in the U-Boot boot script to determine which partition to append to the end of the bootargs:

setenv RFS1 'root=/dev/mmcblk0p2'
setenv RFS2 'root=/dev/mmcblk0p3'

if test "${bootpart}" = "2"; then
    setenv bootargs "${args} ${RFS1}";
elif test "${bootpart}" = "3"; then
    setenv bootargs "${args} ${RFS2}";
fi;

The setenv command is what saves the additional arguments as U-Boot’s environmental variables. For debugging purposes, it is also useful to use the fw_setenv command to manually assign the boot partition, or the fw_printenv command to print the bootpart and other U-Boot environmental variables.5

When the kernel is loaded, the bootpart is determined by whether SWUpdate has been initialised. If there is no update, the OS will simply reboot in its previous partition. However, if SWUpdate is updating, it will read the current bootpart, and switch to the other (MAIN -> ALT, or ALT -> MAIN).6

For example, if our active filesystem is MAIN (2), SWUpdate will read this, set the bootpart variable to the non-active partition ALT (3), passing the root=/dev/mmcblk0p3 assignment into the bootcmd for the kernel initialisation. The update will be written to the newly assigned partition, ALT (3) in this case.

It is possible to consider the upgrade procedure as a transaction, and only after the successful upgrade the new software is set as “boot-able”. With these considerations, an upgrade with this strategy is safe: it is always guaranteed that the system boots and it is ready to get a new software, if the old one is corrupted or cannot run. With U-Boot as boot loader, SWUpdate is able to manage U-Boot’s environment setting variables to indicate the start and the end of a transaction and that the storage contains a valid software.

Integration with SWUpdate #

For all this to happen correctly, our partitions need to be set in sw-description (mentioned earlier as the update manifest, shown below) for our board, and the logic for determining which partition to boot in uboot-boot.sh (as of U-Boot 2022.02).

software =
{
    version = "X.Y";
    hardware-compatibility = [ "revZ" ]; 
    
    files: (
        {
            filename = "boot.scr";
            path = "/boot.scr";
            device = "/dev/mmcblk0p1"; # Setting the boot partition
            filesystem = "vfat";
            sha256 = "BOOTSCRSHA";
        }
        );
    
    scripts: (
        {
            filename = "update.sh"; # Name of the update script
            type = "shellscript";
            sha256 = "UPDATESHA"
        }
        );
    
    stable: 
    {
        main: 
        {
            images: (
            {
                filename = "rootfs.ext2.gz"; 
                device = "/dev/mmcblk0p2"; # Address of the MAIN partition
                compressed = true;
                installed-directly = true;
                sha256 = "IMAGESHA";
            }
            );
            
            bootenv: (
            {
                name = "bootpart";
                value = "0:2"; # Assigning MAIN to bootpart 2
            },
            {
                name = "resetBootEnv";
                value = "true";
            }
            );
        };
        
        alt:
        {
            images: (
            {
                filename = "rootfs.ext2.gz";
                device = "/dev/mmcblk0p3"; # Address of the ALT partition
                compressed = true;
                installed-directly = true;
                sha256 = "IMAGESHA";
            }
            );
            bootenv: (
            {
                name = "bootpart";
                value = "0:3"; # Assigning ALT to bootpart 3
            },
            {
                name = "resetBootEnv";
                value = "true";
            }
            );  
        };
    };
}

This describes the software infrastructure and is what allows SWUpdate to update separate parts of the system. In our case, it defines a ‘software collection’ for Revision Z of our current board. It has 2 sub-collections defining the bootpart for the MAIN and ALT partitions, which each contain an image part that handles the actual copying of the compressed root filesystem into the target partition. The bootenv part describes the bootloader integration code – it updates the U-Boot environment with the correct bootpart variable (fw_setenv). When running SWUpdate from the web interface, a message dialog will read the partition to be updated (the opposite of the active root filesystem). 7

Verification (SWUpdate Log) #

[parse_cfg] : Parsing config file /tmp/sw-description
[get_common_fields] : Version 1.3.2
[parse_hw_compatibility] : Accepted Hw Revision : rev2
[_parse_files] : Found File: boot.scr --> /boot.scr (/dev/mmcblk0p1)
[_parse_images] : Found compressed Image: rootfs.ext2.gz in device : 
                  /dev/mmcblk0p3 for handler raw (installed from stream)
[_parse_scripts] : Found Script: update.sh
[_parse_bootloader] : Bootloader var: resetBootEnv = true
[_parse_bootloader] : Bootloader var: bootpart = 3

Potential Improvements #

This system works well for most projects and allows for both a clear user experience and a robust system with rollback capabilities. However, it could be better:

  • Using a dual U-Boot environment. If there is only one, the power being disconnected whilst the bootloader environment is updating could brick the board. This could also allow updates to multiple architectures, adding a lot of versatility. This is not a major issue as the bootloader is rarely updated, so our system prevents unplugging from being an issue in the large majority of user update cases.

  • Having a dual boot partition to safely update the kernel in the same way that the root filesystem is updated

  • Setting a watchdog or daemon that would reset the device when issues with boot occur (such as loading the incorrect filesystem)

If we considered potential improvements, such as the implementation of over the air updates via cellular or the ability to upgrade different architectures with the same image, it is essential that the update strategy be robust, production-ready, and efficient.