In 2017, when 100x / BitMEX started using Kubernetes, we picked Weave Net as our overlay network for its obvious simplicity (150 lines of YAML, one DaemonSet, no CRD) and transparent encryption via IPSEC ESP. As our clusters grew bigger, with more and more tenants running real-time financial applications in production, the delusion has faded. In fact, we have suffered several Weave Net related network outages. Read more about our story & a no-downtime migration procedure to Calico!
After we started in-place updating Kubernetes 1.10 to 1.12, everything seemed fine. Eight hours of sleep, and a few pods OOMKills later – we noticed that some of the AWS ELBs that front our Kubernetes services were reporting a few unhealthy target nodes. While one of them was marking every node as unhealthy – technically taking the service down, most of the affected ELBs were only missing a few nodes, the same set across all services. Fortunately, troubleshooting Kubernetes services and ingresses is a well-known process as tenants commonly misconfigure them. The first step involves verifying that endpoints for the service considered are listed by Kubernetes, which reveals the most common configuration issues such as a missing port declaration on the containers specification, a missing network policy or a pod label/port name typo. Regarding the service that was completely unhealthy – it turned out that a network policy was missing. Unauthorized traffic is […]