The moment Container Linux was updated and almost broke our fleet

The value proposition offered by Container Linux by CoreOS (Red Hat / IBM) is its ability to perform automated operating system updates thanks to its read-only active/passive /usr mount point, the update_engine, and Omaha. This philosophy (“secure the Internet“) allows system administrators to stop worrying about low-level security patches and helps define a clear separation of concerns between operations and applications.

In practice, the update_engine periodically pings the configured CoreUpdate server, and upon finding a new available version of Container Linux, downloads and installs it to the passive partition, which is then marked as the active partition for the next boot. We note that CoreUpdate does not send updates to all the online nodes at once, but spreads their release over several hours/days. After that, a few strategies exist to apply the update, the most common being locksmith‘s etcd-lock which restarts up to n servers simultaneously. The second most frequently used mechanism, which ships by default in Tectonic, is the container-linux-update-operator. It is designed to orchestrate the necessary reboots at the Kubernetes layer, draining the workloads present on the nodes, one by one.

Because quite a few of our applications do not play well with restarts (e.g. slow warm-ups, costliness of re-initializing data streams and manual algorithm restarts required), we had disabled locksmith and never installed the container-linux-update-operator. However, we wouldn’t completely refuse punctual updates, for instance when spinning up new instances or executing rare manual reboots, and therefore kept the update_engine enabled (wcgw?).

On October 24th, CoreOS released a new version (1855.5.0) on their Stable branch, which our nodes downloaded and installed to the passive partition, waiting for manual reboots to switch over. As we were working on other projects at the time and were not making any active changes to our Kubernetes infrastructure, we hadn’t started any new clusters/instances for a little while. A few days later, one of our engineers started the process of upgrading to Kubernetes 1.12, and noticed that some of the test nodes were rebooting unexpectedly from time to time – weird, but okay; it might be related to some other changes. The following night, another of our engineers got paged because one of our production nodes started to flip between Ready / NotReady states. Just a single node out of the entire pool – we can simply take it out for now and investigate during working hours.

We’ve had some instance restart on us in the past, either because the underlying cloud-provider’s host died, or because of some irritating panics related to the VM’s network stack implementation. But this time, the node was continuously restarting. Obviously, the logs didn’t show anything, and the pstore was empty. We provisioned a new server: same symptoms. We noticed that it was the very first Container Linux with this version we were running, and therefore started a server with 1855.4.0 instead – problem “solved”. The CHANGELOG looks pretty slim, just a git and a kernel update. We don’t use git, and the restarts hella look like panics.

After a little bit of digging, we got our hands on a panic dump. From there, a basic Google search answered our questions: Crash with latest v4.14.73 in netif_skb_features. Of course, our Kubernetes nodes use sch_netem in some instances, for very specific packets. We could seldom notice the issue with our in-development 1.12 cluster, as those specific packets rarely flow. However, on production, those packets are rapidly transmitted as soon as Kubernetes starts our workloads, causing the nodes to panic immediately. This evidence would seem to indicate that sch_netem may not be tested as part of the Linux Kernel release. Since our misadventure, a kernel patch has been submitted upstream, yet to be released in Container Linux.

Pegging new nodes to Container Linux 1855.4.0, then disabling the update_engine everywhere and forcing all existing nodes to continue booting on their currently active partitions, prevented the issue from spreading any further. A good way to catch potential issues early on at this layer would be to run different node pools subscribed to different channels (Stable, Beta, Alpha).

Now, here is the thing: If we were using locksmith as the update coordinator (like we would in the etcd-cloud-operator if it was enabled), because the failure only appeared ~1min after the node started, the update would certainly have been marked as successful, and locksmith would have proceeded to update other nodes, catastrophically leading to cluster-wide panic hell in no time. By the time we diagnosed the root cause, every single node had the new kernel version installed and ready-to-go pending reboot. The issue may not have spread too far with the container-linux-update-operator though, as the local update agent container has to start for at least a few seconds, in order to report a successful update.

The is not the first time unexpected issues appear out of Container Linux upgrades. As minimal as Container Linux is; as well tested as the Linux kernel is; and as exciting as automated updates are; any untested changes to a line of code can cause serious havoc. Tectonic, CoreOS’ distribution of Kubernetes, shares the same philosophy, but with a much more complex system to update – and I once was at CoreOS, responsible for shipping downstream those updates developed by dozens of very talented engineers. Red Hat / IBM‘s successors of those products will likely follow the same principle, which is awesome, but requires great care from everyone – nothing is magic.

(Proofreading: Evan Ricketts)

Leave a Reply

Your email address will not be published. Required fields are marked *