Our breakup with Weave Net

In 2017, when BitMEX started using Kubernetes, we picked Weave Net as our overlay network for its obvious simplicity (150 lines of YAML, one DaemonSet, no CRD) and transparent encryption via IPSEC ESP. As our clusters grew bigger, with more and more tenants running real-time financial applications in production, the delusion has faded. In fact, we have suffered several Weave Net related network outages. Read more about our story & a no-downtime migration procedure to Calico!

The moment Container Linux was updated and almost broke our fleet

The value proposition offered by Container Linux by CoreOS (Red Hat / IBM) is its ability to perform automated operating system updates thanks to its read-only active/passive /usr mount point, the update_engine, and Omaha. This philosophy (“secure the Internet“) allows system administrators to stop worrying about low-level security patches and helps define a clear separation of concerns between operations and applications. In practice, the update_engine periodically pings the configured CoreUpdate server, and upon finding a new available version of Container Linux, downloads and installs it to the passive partition, which is then marked as the active partition for the next boot. We note that CoreUpdate does not send updates to all the online nodes at once, but spreads their release over several hours/days. After that, a few strategies exist to apply the update, the most common being locksmith‘s etcd-lock which restarts up to n servers simultaneously. The second most frequently […]