5 – 15s DNS lookups on Kubernetes?

Back in April, we noticed that several of our applications, but not all, were quite frequently timing out querying either internal or external services, regardless of the ports or protocols. Reproducing the issue was as simple as using cURL in any of our containers, to any destination, where the majority of the queries would stall for durations close to multiples of five seconds. Five seconds, you say? That is generally the red flag for DNS issues. Let’s figure out…

An initial look at the problem

With our Kubernetes stacks using recent and fairly trusted components (AWS, Kubernetes 1.10.2, Weave, CoreDNS) and given the experience I acquired while architecting and developing Tectonic at CoreOS, I was pretty confident our clusters were all pretty well configured overall. But still, Kubernetes has various moving pieces, and subtle issues may be introduced at any of those layers – where might that be this time? Unlike the OpenStack world, switching a component by another in Kubernetes is fairly easy thanks to the amazingly simple interfaces the system is built onto (e.g. CNI), and the incredible implementation diversity that the community offers. I, therefore, decided to invest an hour using iptables rather than IPVS, then replacing Weave by Calico, and CoreDNS by the older KubeDNS, in order to rule them out – no luck, the issue is still there. I noticed however that using Weave without fastdp made the issue disappear, but there is no such thing as using Weave without fastdp.

I then remembered various networking issues we had at CoreOS while working on Tectonic for Azure, due to a TX checksum offloading malfunction in their hypervisors / LB implementation, unlikely relevant here. The clusters and their nodes were mostly idle, the ARP tables looked totally fine: without stale entries (as it occurred with kube-proxy before), and not full. Back to square one.

Going deeper, and discovering a time-saving feature of CoreDNS

Looking at the facts again, the issue occurs with some applications, but not all of them, most of the time, but not always. The base image of the containers does not seem to have an effect on the numbers. There were a few issues opened here and there during the past few months about DNS latency, some of them totally unrelated (e.g. scalability, misconfiguration, ARP tables being full).

What is the difference between an affected application, like cURL, and an application that worked totally fine? I opened a few tcpdump sessions on the different nodes and containers in the path of cURL in an attempt to answer this question, and understand the problem better.

Reading the container’s tcpdump capture, two lookups were made by libc spaced by only a few CPU cycles, for A and AAAA records. While the responses for the A queries were coming quickly, the AAAA queries did not seem to be answered in a timely manner and were repeated after five seconds. IPv6 is disabled everywhere across our clusters, at the kernel level and our network interfaces do not even have link-local addresses. The reason why would the applications or libc make AAAA lookups got me a somewhat confused, but I could imagine potential use-cases, moving on. Reading the DNS server’s capture, IPv6 turned out to be irrelevant, as the server would not even receive the most of the packets containing the AAAA queries, which are transported over UDP with IPv4. When it did receive them, it would query the upstream server in a similar fashion. So, the only thing IPv6 about those AAAA lookups is the fact that it is looking for IPv6 responses, and nothing else.

Because the resolv.conf  file of Kubernetes’ containers has numerous search domains and ndots:5, libc generally have to look up several composed names before getting a positive result, unless the requested domain is fully-qualified and has a trailing dot, which most applications do not use. For example, to resolve google.com, google.com.kube-system.svc.cluster.local., google.com.svc.cluster.local., google.com.cluster.local., google.com.ec2.internal. and finally google.com. must be looked up, for both A and AAAA records. That’s a lot of hops, especially when most of the AAAA requests time out after five seconds and must be retried. I discovered that CoreDNS can actually limit the number of roundtrips required, thanks to its autopath feature, which automatically detect queries being made with a known Kubernetes suffix, iterate server-side through the usual search domains, and leverage its own knowledge/cache about the available Kubernetes services to find a valid one (or fallback to querying the upstream server), to finally return both a CNAME containing the actual domain name found to have a valid, and an A/AAAA response with the actual IP address for that domain name (or NXDOMAIN if the record does not exist, obviously). I was baffled to see how smart and convenient that was, such an easy win.

An initial workaround

This did not solve the root cause though, as we are still seeing AAAA lookups taking up to five seconds.

After a bit of digging, I read in the man(5) page for resolv.conf that two options relevant to the parallel lookup mechanism used by glibc are available: single-request and single-request-reopen, which both enable sequential lookups. After specifying any of those options, using the relatively new dnsConfig configuration block (Alpha in Kubernetes 1.9), I could finally only see sub-second queries and got immediately excited about the fact that I would simply be able to add this to our templates and call it a day. I applied the changes and happily went home, too late anyway.

Setback & netfilter race conditions

This was until I discovered that the workaround had no effect on Alpine containers. It was at this moment that I knew musl was going to give me a hard time, again, I should have known better. Their resolver only supports ndots, attempts, and timeout, awesome. I went to talk to Rich Felker on #musl only to learn that no change would be made, as sequential lookups are against their architecture, and because, according to other users on the IRC channel, Kubernetes’ use of ndots is a heresy anyways. Wherever the actual issue is (may it be the general concept of Kubernetes’ networking), it should be fixed there.

Sequential queries work, not parallel ones, sometimes, but not always. That’s got to be a race condition, with the number of networking trickeries that Kubernetes do to get packets from one end to the other, it would not be too surprising after all. After some additional research, I found some existing literature about netfilter race conditions, such as this one or that one. Looking at conntrack -S, we had thousands of insert_failed, this is it. It turns out that a few engineers have noticed the issue and have gone through the troubleshooting process as well, identifying a SNAT race condition, ironically briefly documented in netfilter’s code. The solution would be to add –random-fully on all masquerading rules, which are set by several components in Kubernetes: kubelet, kube-proxy, weave and docker itself. There is only one little problem here… This is an early feature and not available on Container Linux, nor in Alpine’s iptables package nor in the Go wrapper of iptables. Regardless, it seems generally accepted that this would be the solution to the issue, and some developers are now implementing the missing flag support, but behold, this does not stop here.

Based on various traces, Martynas Pumputis discovered that there also was a race with DNAT, as the DNS server is reached via a virtual IP. Due to UDP being a connectionless protocol, connect(2) does not send any packet and therefore no entry is created in the conntrack hash table. During the translation, the following netfilter hooks are called in order: nf_conntrack_in (creates conntrack hash object, adds it to the unconfirmed entries list), nf_nat_ipv4_fn (does the translation, updates the conntrack tuple), and nf_conntrack_confirm (confirms the entry, adds it to the hash table). The two parallel UDP requests race for the entry confirmation and end up using different DNS endpoints, as there are multiple DNS server replicas available. Therefore, insert_failed is incremented, and the request is dropped. This means that adding –random-fully does not mitigate the packet loss, as the flag would only help mitigate the SNAT race! The only reliable fix would be to patch netfilter directly, which Martynas Pumputis is currently attempting to do.

A short and efficient workaround

Getting a patch into the kernel, and having it released, is not something that happen overnight. I, therefore, started writing my own workaround, based on all the knowledge gathered while troubleshooting the issue. Fortunately, I learned how to use tc(8) back then when I was administrating a large infrastructure of containers for my startup Harmony Hosting, in order to provide bandwidth guarantees to our customers and help to mitigate DDoS attacks. Coping with such race condition requires nothing but introducing a small amount of artificial latency to every AAAA packets. Using iptables, we can mark UDP traffic destined to the port exposing our DNS server, that have the DNS query bits set (inexpensive check) and that contain at least one question with QTYPE=AAAA. We need to be cautious due to the existing marks, and use a proper mask. With tc, we can route the marked traffic using a two bands priomap to a netem that will introduce a few milliseconds worth of latency, and the rest to a standard fq_codel. Additionally, we need to do our DPI and traffic shaping on the right interface, as Weave will encapsulate and encrypt traffic using  IPSec (ESP), obfuscating everything. The good news though is that the Weave interface is a virtual interface and is therefore set to noqueue by default, we won’t need to worry about mq or about grafting qdiscs to specific TX/RX queues or CPU cores, which makes the script extremely simple.

Finally, we can build a very simple container image with the iproute2 package only, and run it alongside the Weave’s containers in its DaemonSet.


All in all, given the current adoption of Kubernetes, it is quite surprising that only a few Kubernetes engineers noticed this omnipresent and highly disruptive issue, which may be because networking conditions may not be as favorable everywhere for that race, or a symptom of a lack of monitoring overall.

However, I am thrilled to see that we ended up with a workaround that consists of 10 lines of bash and 10 lines of YAML, that do not require maintaining patches anywhere, or pushing any changes down to our users, and that reduce the likelihood of the races happening down to far less than a percent. And along the way, we also picked up a change that truncates the number of DNS roundtrips dramatically!

Edit: As mentioned by Duffie Cooley, it would also be possible to run the DNS server on every nodes using a DaemonSet, and specify the node’s IP as the clusterDNS in kubelet’s configuration. This solution is unfortunately unusable for us, as containers with cluster-wide permissions (even read-only) are unable to run on our worker nodes, and as containers do not have direct network access to any of our nodes for security reasons.

25 thoughts on “5 – 15s DNS lookups on Kubernetes?

  1. Thanks for this write up, it really helps clarify these DNS timeouts we’re experiencing. We’re testing out the coredns config and its definitely reduced our lookups. I’m trying the weave script in the weave daemonset and I’m still seeing insert_failed increase. Not sure if it’s increasing slower since I don’t have metrics on this. What rate of inserts_failed should I be seeing? Or what percentage did it drop for you in your environment? How are you getting metrics on insert_failed?

  2. Oh man, I’ve been wrestling with this issue for the past six weeks and I am so glad that I found this blog. This line echos exactly how I feel:

    “it is quite surprising that only a few Kubernetes engineers noticed this omnipresent and highly disruptive issue”

    As I dug in further and found this issue going through dnsmasq and dropped queries, I was going crazy thinking this exact same thing. I’m going to try your workaround and see where I can get resolving this issue. Thank you for your write up!

    1. Totally fine solution if you accept making the sacrifice of running custom Alpine everywhere. In multi-tenant clusters, this is harder to communicate. Upstream Alpine likely won’t merge the request as they’ve told me on IRC back then.

  3. We are on the same boat here. We implemented just this change:

    – name: single-request-reopen

    and the timeouts seem to have improved.

    We are on Kubernetes version 1.10 using kube-dns. We will upgrade to CoreDNS after upgrading to Kubernetes 1.11.

    1. Absolutely, this is the first possible workaround. However, this will not work on Alpine containers as they use musl instead of glibc – which does not implement support for the option.

  4. I believe we’re running into similar issues, but I wanted to get clear on a couple of things first
    1) The DNS lookup issues you ran into applied to ALL domain names or just K8s internal ones?
    2) Our docker images are using Ubuntu not Alpine, so we have more resolv.conf options. I’ve seen use-vc suggested which would do DNS lookup over TCP rather than UDP, would that help in this situation, or is the option you mention in the blog post single-request-reopen be sufficient?

    1. 1/ All of them
      2/ You may use single-request or single-request-reopen! use-vc will indeed work as the issue is limited to UDP, but will slow down your DNS requests.

  5. Thank you very much for such useful writeup. We experienced exactly same issues and even more with DNS(we are running on GKE and we don’t use any additional network meshes as of now). Reading through this, I understood various options for TCP race condition and SNAT and they helped for sure, what seems to be most hard part is UPD/DNAT solution. Did I understood correctly there currently there is no simpler soliton then modifying iptables and introducing that artificial delay? Also as I mentioned we don’t use weave or other network mesh right now, so not quite sure if this may result in some simpler or more complex solution.

    1. That’s right. Kernel has been submitting several patches, but one remains and should only be fixed in 5.x release.

  6. I set the single-request-reopen but it looks like not work,

    resolv.conf in container

    search production.svc.cluster.local svc.cluster.local cluster.local
    options ndots:5 single-request-reopen

    what i missed ? this confused me so many days, thank you !

    1. Have you confirmed the problem is related to the conntrack race? If so, are you using Alpine? single-request-reopen is not supported on Alpine.

  7. Hi Quentin,
    we are using centos 7.6 and this problem has occured since few weeks. Applying your first workaround seems improving performance for us.
    I would like to know if you have an idea on when the problem has started ? From a specific version of glibc ?
    It seems to be a regression, not a problem that always exists.
    Thank you in advance for your help and thank you for your article, really helpful.

    1. Hi, sorry for the super late answer.. Not familiar with CentOS specifically but as far as I understand it, this has been a race condition for a very long time within conntrack. It only gets triggered very frequently if your libc library queries A/AAAA at the exact same time (which may not always have been the case?).

  8. Hi, thanks for the awesome blog post.

    Does the traffic shaping/tc solution also work with non-alpine containers?

    We’re having this issue on Kuberentes 1.10.3, using Weave and KubeDNS, will this fix work with all of those as well?

    Also, since the blog post is a little old, is the issue fixed in the latest versions of Kubernetes or the Linux Kernel? If so, which versions?

    1. Absolutely, that works with any base image, and without having to ask your Kubernetes tenants to do anything really. That’s the key advantage of this solution!

  9. I don’t fully understand the tc solution, are you running the weave-tc pod as a second container in the weave pod? Do you already have the bash above that baked into your sidecar? Is it really as simple as it sounds?

    1. Correct, but the weave-tc container really just runs a simple bash script, it can be executed in any other way, and on any overlay network really!

  10. Urgh….
    Finally I’ve found this Article describing exactly the issue I’ve been seeing!
    In the meantime, I have already been digging really deep and even came up with a nice and super-easy fix you all can apply until the underlying issue has actually been patched:
    Kubernetes supports SessionAffinity for Services for quite some time (PR was merged in 2014).
    Configuring the kube-dns Service with SessionAffinity:ClientIP triggers all DNS request packets from one pod to be delivered to the same kube-dns pod, thus eliminating the problem that the race condition causes (although the race condition still exists, it now doesn’t have any effect anymore).
    So: for the kube-dns service in the kube-system namespace, you want to change the
    service.spec.sessionAffinity from None to ClientIP.
    This will probably be overwritten by Kubernetes updates being run; but it solved the problem in out case and was easy to apply without any other modifications to all the apps that are running on the cluster.

  11. Hi
    We are running on alpine and we dont use weave, as you said “single-request-reopen” wont work for alpine it is ruled out.So wt workaround we should use.

    1. I, unfortunately, made it pretty confusing, but the weave-tc script works regardless of your overlay network. You simply have to change the interface via the provided env var!

  12. Hi,
    we are using alpine and we not using weave and in alpine as said “single-request-reopen” wont work.what best possible workaround for this?

    1. The kernel workaround provided here works regardless of the overlay network, you simply have to replace the network interface it affects via the env var!

Leave a Reply

Your email address will not be published. Required fields are marked *