Nvidia & Docker – Failed to initialize NVML

This one is not going to be crazy technical but I haven’t written in a while and thought this might be helpful to many out there. After containerizing my GPUs workloads in my home lab, I noticed that seemingly randomly my GPU-enabled containers would start throwing the following error (e.g. when running nvidia-smi): As it turns out, dozens have been reporting the issue in the past couple of years – but never were there a solid solution. It was pointed out that it wasn’t “random” after all, and that running systemctl daemon-reload would instantly trigger the issue, and that the container(s) had to be restarted before the GPU(s) could be used within the containers once again. It all started with Docker 20.10, which then started using cgroup v2 (aka unified cgroup hierarchy) instead of cgroupfs (if enabled on the operating system), and as distros started enabling it by default (e.g. […]

Our breakup with Weave Net

In 2017, when BitMEX started using Kubernetes, we picked Weave Net as our overlay network for its obvious simplicity (150 lines of YAML, one DaemonSet, no CRD) and transparent encryption via IPSEC ESP. As our clusters grew bigger, with more and more tenants running real-time financial applications in production, the delusion has faded. In fact, we have suffered several Weave Net related network outages. Read more about our story & a no-downtime migration procedure to Calico!

Kube-proxy IPVS – simple, efficient, unstable

After we started in-place updating Kubernetes 1.10 to 1.12, everything seemed fine. Eight hours of sleep, and a few pods OOMKills later – we noticed that some of the AWS ELBs that front our Kubernetes services were reporting a few unhealthy target nodes. While one of them was marking every node as unhealthy – technically taking the service down, most of the affected ELBs were only missing a few nodes, the same set across all services. Fortunately, troubleshooting Kubernetes services and ingresses is a well-known process as tenants commonly misconfigure them. The first step involves verifying that endpoints for the service considered are listed by Kubernetes, which reveals the most common configuration issues such as a missing port declaration on the containers specification, a missing network policy or a pod label/port name typo. Regarding the service that was completely unhealthy – it turned out that a network policy was missing. Unauthorized traffic is […]

The moment Container Linux was updated and almost broke our fleet

The value proposition offered by Container Linux by CoreOS (Red Hat / IBM) is its ability to perform automated operating system updates thanks to its read-only active/passive /usr mount point, the update_engine, and Omaha. This philosophy (“secure the Internet“) allows system administrators to stop worrying about low-level security patches and helps define a clear separation of concerns between operations and applications. In practice, the update_engine periodically pings the configured CoreUpdate server, and upon finding a new available version of Container Linux, downloads and installs it to the passive partition, which is then marked as the active partition for the next boot. We note that CoreUpdate does not send updates to all the online nodes at once, but spreads their release over several hours/days. After that, a few strategies exist to apply the update, the most common being locksmith‘s etcd-lock which restarts up to n servers simultaneously. The second most frequently […]

5 – 15s DNS lookups on Kubernetes?

Back in April, we noticed that several of our applications, but not all, were quite frequently timing out querying either internal or external services, regardless of the ports or protocols. Reproducing the issue was as simple as using cURL in any of our containers, to any destination, where the majority of the queries would stall for durations close to multiples of five seconds. Five seconds, you say? That is generally the red flag for DNS issues. Let’s figure out…