5 – 15s DNS lookups on Kubernetes?

June 24, 2018June 25, 2018 Quentin MachuKubernetes

Back in April, we noticed that several of our applications, but not all, were quite frequently timing out querying either internal or external services, regardless of the ports or protocols. Reproducing the issue was as simple as using cURL in any of our containers, to any destination, where the majority of the queries would stall for durations close to multiples of five seconds. Five seconds, you say? That is generally the red flag for DNS issues. Let’s figure out…

An initial look at the problem

With our Kubernetes stacks using recent and fairly trusted components (AWS, Kubernetes 1.10.2, Weave, CoreDNS) and given the experience I acquired while architecting and developing Tectonic at CoreOS, I was pretty confident our clusters were all pretty well configured overall. But still, Kubernetes has various moving pieces, and subtle issues may be introduced at any of those layers – where might that be this time? Unlike the OpenStack world, switching a component by another in Kubernetes is fairly easy thanks to the amazingly simple interfaces the system is built onto (e.g. CNI), and the incredible implementation diversity that the community offers. I, therefore, decided to invest an hour using iptables rather than IPVS, then replacing Weave by Calico, and CoreDNS by the older KubeDNS, in order to rule them out – no luck, the issue is still there. I noticed however that using Weave without fastdp made the issue disappear, but there is no such thing as using Weave without fastdp.

I then remembered various networking issues we had at CoreOS while working on Tectonic for Azure, due to a TX checksum offloading malfunction in their hypervisors / LB implementation, unlikely relevant here. The clusters and their nodes were mostly idle, the ARP tables looked totally fine: without stale entries (as it occurred with kube-proxy before), and not full. Back to square one.

Going deeper, and discovering a time-saving feature of CoreDNS

Looking at the facts again, the issue occurs with some applications, but not all of them, most of the time, but not always. The base image of the containers does not seem to have an effect on the numbers. There were a few issues opened here and there during the past few months about DNS latency, some of them totally unrelated (e.g. scalability, misconfiguration, ARP tables being full).

What is the difference between an affected application, like cURL, and an application that worked totally fine? I opened a few tcpdump sessions on the different nodes and containers in the path of cURL in an attempt to answer this question, and understand the problem better.

Reading the container’s tcpdump capture, two lookups were made by libc spaced by only a few CPU cycles, for A and AAAA records. While the responses for the A queries were coming quickly, the AAAA queries did not seem to be answered in a timely manner and were repeated after five seconds. IPv6 is disabled everywhere across our clusters, at the kernel level and our network interfaces do not even have link-local addresses. The reason why would the applications or libc make AAAA lookups got me a somewhat confused, but I could imagine potential use-cases, moving on. Reading the DNS server’s capture, IPv6 turned out to be irrelevant, as the server would not even receive the most of the packets containing the AAAA queries, which are transported over UDP with IPv4. When it did receive them, it would query the upstream server in a similar fashion. So, the only thing IPv6 about those AAAA lookups is the fact that it is looking for IPv6 responses, and nothing else.

Because the resolv.conf file of Kubernetes’ containers has numerous search domains and ndots:5, libc generally have to look up several composed names before getting a positive result, unless the requested domain is fully-qualified and has a trailing dot, which most applications do not use. For example, to resolve google.com, google.com.kube-system.svc.cluster.local., google.com.svc.cluster.local., google.com.cluster.local., google.com.ec2.internal. and finally google.com. must be looked up, for both A and AAAA records. That’s a lot of hops, especially when most of the AAAA requests time out after five seconds and must be retried. I discovered that CoreDNS can actually limit the number of roundtrips required, thanks to its autopath feature, which automatically detect queries being made with a known Kubernetes suffix, iterate server-side through the usual search domains, and leverage its own knowledge/cache about the available Kubernetes services to find a valid one (or fallback to querying the upstream server), to finally return both a CNAME containing the actual domain name found to have a valid, and an A/AAAA response with the actual IP address for that domain name (or NXDOMAIN if the record does not exist, obviously). I was baffled to see how smart and convenient that was, such an easy win.

nameserver 172.17.0.10
search default.svc.cluster.local svc.cluster.local cluster.local ec2.internal
options ndots:5

nameserver 172.17.0.10

search default.svc.cluster.local svc.cluster.local cluster.local ec2.internal

options ndots:5

.:5353 {
    errors
    log
    health
    reload

    kubernetes cluster.local 172.16.0.0/16 172.17.0.0/16 {
        pods verified
        resyncperiod 1m
        fallthrough
    }
    cache 10 cluster.local 172.16.0.0/16 172.17.0.0/16
    autopath @kubernetes

    proxy . /etc/resolv.conf
    prometheus 0.0.0.0:9153
}

.:5353 {

errors

log

health

reload

kubernetes cluster.local 172.16.0.0/16 172.17.0.0/16 {

pods verified

resyncperiod 1m

fallthrough

}

cache 10 cluster.local 172.16.0.0/16 172.17.0.0/16

autopath @kubernetes

proxy . /etc/resolv.conf

prometheus 0.0.0.0:9153

}

19:27:05.990180 IP ip-172-16-0-6.us-east-2.compute.internal.52031 > ip-172-17-0-10.us-east-2.compute.internal.domain: 3888+ A? google.com.kube-system.svc.cluster.local. (58)
19:27:05.990253 IP ip-172-16-0-6.us-east-2.compute.internal.52031 > ip-172-17-0-10.us-east-2.compute.internal.domain: 58213+ AAAA? google.com.kube-system.svc.cluster.local. (58)
19:27:05.990258 IP ip-172-16-0-6.us-east-2.compute.internal.52031 > ip-172-17-0-10.us-east-2.compute.internal.domain: 58213+ AAAA? google.com.kube-system.svc.cluster.local. (58)
19:27:06.103767 IP ip-172-17-0-10.us-east-2.compute.internal.domain > ip-172-16-0-6.us-east-2.compute.internal.52031: 3888 2/0/0 CNAME google.com., A 172.217.15.110 (98)
19:27:10.994773 IP ip-172-16-0-6.us-east-2.compute.internal.52031 > ip-172-17-0-10.us-east-2.compute.internal.domain: 3888+ A? google.com.kube-system.svc.cluster.local. (58)
19:27:10.994791 IP ip-172-16-0-6.us-east-2.compute.internal.52031 > ip-172-17-0-10.us-east-2.compute.internal.domain: 3888+ A? google.com.kube-system.svc.cluster.local. (58)
19:27:10.995299 IP ip-172-17-0-10.us-east-2.compute.internal.domain > ip-172-16-0-6.us-east-2.compute.internal.52031: 3888 2/0/0 CNAME google.com., A 172.217.15.110 (98)
19:27:10.995330 IP ip-172-16-0-6.us-east-2.compute.internal.52031 > ip-172-17-0-10.us-east-2.compute.internal.domain: 58213+ AAAA? google.com.kube-system.svc.cluster.local. (58)
19:27:10.995337 IP ip-172-16-0-6.us-east-2.compute.internal.52031 > ip-172-17-0-10.us-east-2.compute.internal.domain: 58213+ AAAA? google.com.kube-system.svc.cluster.local. (58)
19:27:11.100456 IP ip-172-17-0-10.us-east-2.compute.internal.domain > ip-172-16-0-6.us-east-2.compute.internal.52031: 58213 2/0/0 CNAME google.com., AAAA 2a00:1450:8003::69 (110)

19:27:05.990180 IP ip-172-16-0-6.us-east-2.compute.internal.52031 > ip-172-17-0-10.us-east-2.compute.internal.domain: 3888+ A? google.com.kube-system.svc.cluster.local. (58)

19:27:05.990253 IP ip-172-16-0-6.us-east-2.compute.internal.52031 > ip-172-17-0-10.us-east-2.compute.internal.domain: 58213+ AAAA? google.com.kube-system.svc.cluster.local. (58)

19:27:05.990258 IP ip-172-16-0-6.us-east-2.compute.internal.52031 > ip-172-17-0-10.us-east-2.compute.internal.domain: 58213+ AAAA? google.com.kube-system.svc.cluster.local. (58)

19:27:06.103767 IP ip-172-17-0-10.us-east-2.compute.internal.domain > ip-172-16-0-6.us-east-2.compute.internal.52031: 3888 2/0/0 CNAME google.com., A 172.217.15.110 (98)

19:27:10.994773 IP ip-172-16-0-6.us-east-2.compute.internal.52031 > ip-172-17-0-10.us-east-2.compute.internal.domain: 3888+ A? google.com.kube-system.svc.cluster.local. (58)

19:27:10.994791 IP ip-172-16-0-6.us-east-2.compute.internal.52031 > ip-172-17-0-10.us-east-2.compute.internal.domain: 3888+ A? google.com.kube-system.svc.cluster.local. (58)

19:27:10.995299 IP ip-172-17-0-10.us-east-2.compute.internal.domain > ip-172-16-0-6.us-east-2.compute.internal.52031: 3888 2/0/0 CNAME google.com., A 172.217.15.110 (98)

19:27:10.995330 IP ip-172-16-0-6.us-east-2.compute.internal.52031 > ip-172-17-0-10.us-east-2.compute.internal.domain: 58213+ AAAA? google.com.kube-system.svc.cluster.local. (58)

19:27:10.995337 IP ip-172-16-0-6.us-east-2.compute.internal.52031 > ip-172-17-0-10.us-east-2.compute.internal.domain: 58213+ AAAA? google.com.kube-system.svc.cluster.local. (58)

19:27:11.100456 IP ip-172-17-0-10.us-east-2.compute.internal.domain > ip-172-16-0-6.us-east-2.compute.internal.52031: 58213 2/0/0 CNAME google.com., AAAA 2a00:1450:8003::69 (110)

An initial workaround

This did not solve the root cause though, as we are still seeing AAAA lookups taking up to five seconds.

After a bit of digging, I read in the man(5) page for resolv.conf that two options relevant to the parallel lookup mechanism used by glibc are available: single-request and single-request-reopen, which both enable sequential lookups. After specifying any of those options, using the relatively new dnsConfig configuration block (Alpha in Kubernetes 1.9), I could finally only see sub-second queries and got immediately excited about the fact that I would simply be able to add this to our templates and call it a day. I applied the changes and happily went home, too late anyway.

single-request (since glibc 2.10)
       Sets RES_SNGLKUP in _res.options.  By default, glibc
       performs IPv4 and IPv6 lookups in parallel since
       version 2.9.  Some appliance DNS servers cannot handle
       these queries properly and make the requests time out.
       This option disables the behavior and makes glibc
       perform the IPv6 and IPv4 requests sequentially (at the
       cost of some slowdown of the resolving process).

single-request-reopen (since glibc 2.9)
       Sets RES_SNGLKUPREOP in _res.options.  The resolver
       uses the same socket for the A and AAAA requests.  Some
       hardware mistakenly sends back only one reply.  When
       that happens the client system will sit and wait for
       the second reply.  Turning this option on changes this
       behavior so that if two requests from the same port are
       not handled correctly it will close the socket and open
       a new one before sending the second request.

single-request (since glibc 2.10)

Sets RES_SNGLKUP in _res.options. By default, glibc

performs IPv4 and IPv6 lookups in parallel since

version 2.9. Some appliance DNS servers cannot handle

these queries properly and make the requests time out.

This option disables the behavior and makes glibc

perform the IPv6 and IPv4 requests sequentially (at the

cost of some slowdown of the resolving process).

single-request-reopen (since glibc 2.9)

Sets RES_SNGLKUPREOP in _res.options. The resolver

uses the same socket for the A and AAAA requests. Some

hardware mistakenly sends back only one reply. When

that happens the client system will sit and wait for

the second reply. Turning this option on changes this

behavior so that if two requests from the same port are

not handled correctly it will close the socket and open

a new one before sending the second request.

dnsConfig:
  options:
    - name: single-request-reopen

dnsConfig:

options:

- name: single-request-reopen

Setback & netfilter race conditions

This was until I discovered that the workaround had no effect on Alpine containers. It was at this moment that I knew musl was going to give me a hard time, again, I should have known better. Their resolver only supports ndots, attempts, and timeout, awesome. I went to talk to Rich Felker on #musl only to learn that no change would be made, as sequential lookups are against their architecture, and because, according to other users on the IRC channel, Kubernetes’ use of ndots is a heresy anyways. Wherever the actual issue is (may it be the general concept of Kubernetes’ networking), it should be fixed there.

Sequential queries work, not parallel ones, sometimes, but not always. That’s got to be a race condition, with the number of networking trickeries that Kubernetes do to get packets from one end to the other, it would not be too surprising after all. After some additional research, I found some existing literature about netfilter race conditions, such as this one or that one. Looking at conntrack -S, we had thousands of insert_failed, this is it. It turns out that a few engineers have noticed the issue and have gone through the troubleshooting process as well, identifying a SNAT race condition, ironically briefly documented in netfilter’s code. The solution would be to add –random-fully on all masquerading rules, which are set by several components in Kubernetes: kubelet, kube-proxy, weave and docker itself. There is only one little problem here… This is an early feature and not available on Container Linux, nor in Alpine’s iptables package nor in the Go wrapper of iptables. Regardless, it seems generally accepted that this would be the solution to the issue, and some developers are now implementing the missing flag support, but behold, this does not stop here.

Based on various traces, Martynas Pumputis discovered that there also was a race with DNAT, as the DNS server is reached via a virtual IP. Due to UDP being a connectionless protocol, connect(2) does not send any packet and therefore no entry is created in the conntrack hash table. During the translation, the following netfilter hooks are called in order: nf_conntrack_in (creates conntrack hash object, adds it to the unconfirmed entries list), nf_nat_ipv4_fn (does the translation, updates the conntrack tuple), and nf_conntrack_confirm (confirms the entry, adds it to the hash table). The two parallel UDP requests race for the entry confirmation and end up using different DNS endpoints, as there are multiple DNS server replicas available. Therefore, insert_failed is incremented, and the request is dropped. This means that adding –random-fully does not mitigate the packet loss, as the flag would only help mitigate the SNAT race! The only reliable fix would be to patch netfilter directly, which Martynas Pumputis is currently attempting to do.

A short and efficient workaround

Getting a patch into the kernel, and having it released, is not something that happen overnight. I, therefore, started writing my own workaround, based on all the knowledge gathered while troubleshooting the issue. Fortunately, I learned how to use tc(8) back then when I was administrating a large infrastructure of containers for my startup Harmony Hosting, in order to provide bandwidth guarantees to our customers and help to mitigate DDoS attacks. Coping with such race condition requires nothing but introducing a small amount of artificial latency to every AAAA packets. Using iptables, we can mark UDP traffic destined to the port exposing our DNS server, that have the DNS query bits set (inexpensive check) and that contain at least one question with QTYPE=AAAA. We need to be cautious due to the existing marks, and use a proper mask. With tc, we can route the marked traffic using a two bands priomap to a netem that will introduce a few milliseconds worth of latency, and the rest to a standard fq_codel. Additionally, we need to do our DPI and traffic shaping on the right interface, as Weave will encapsulate and encrypt traffic using IPSec (ESP), obfuscating everything. The good news though is that the Weave interface is a virtual interface and is therefore set to noqueue by default, we won’t need to worry about mq or about grafting qdiscs to specific TX/RX queues or CPU cores, which makes the script extremely simple.

# Force the kernel to re-create the dummy mq scheduler on the default interface,
# - as the child qdiscs may have been set to pfifo_fast at boot even if the default
# appear to be ‘fq_codel’ (we also set the default to fq_codel regardless, for older
# systems)
# - as the qdiscs are using a quantum based on the boot MTU, which may have changed
# after DHCP has gotten the proper MTU.
#
# Setting mq will only work if the NIC supports multiple TX/RX queues, therefore
# creating and grafting each class/qdiscs to specific CPU cores. In case the NIC
# does not support that, we simply ignore the error.
sysctl -w net.core.default_qdisc=fq_codel
tc qdisc del dev $(route | grep '^default' | grep -o '[^ ]*$') root 2>/dev/null || true
tc qdisc add dev $(route | grep '^default' | grep -o '[^ ]*$') root handle 0: mq || true

# Traffic leaving the weave interface onto the default interface will be encapsulated
# and encrypted in IPSec (ESP), therefore, we may only do traffic shaping work on this
# interface.
#
# The weave interface is a virtual interface, which is set to noqueue by default and does
# not support mq nor multiq. Therefore, we go directly to the point and create a a 2-bands
# priomap, that sends all traffic (regardless of the TOS octet) to the 2nd band, a simple
# fq_codel. We then define the 1st band as a netem with the a small delay, that appears to
# be avoid the race in a statistically satisfying manner, and that is controlled by a pareto
# distribution (k=4ms, a=1ms) and route traffic marked by 0x100/0x100 to it.
#
# Using iptables, we mark 0x100/0x100 the UDP traffic destined to port 5353, that have the
# DNS query bits set (fast check) and then that contain at least one question with QTYPE=AAAA.
while ! ip link | grep "weave:" > /dev/null; do sleep 1; done
tc qdisc del dev weave root 2>/dev/null || true
tc qdisc add dev weave root handle 1: prio bands 2 priomap 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

tc qdisc add dev weave parent 1:2 handle 12: fq_codel

tc qdisc add dev weave parent 1:1 handle 11: netem delay 4ms 1ms distribution pareto
tc filter add dev weave protocol all parent 1: prio 1 handle 0x100/0x100 fw flowid 1:1
iptables -A POSTROUTING -t mangle -p udp --dport 5353 -m string -m u32 --u32 "28 & 0xF8 = 0" --hex-string "|00001C0001|" --algo bm --from 40 -j MARK --set-mark 0x100/0x100

while sleep 3600; do :; done

# Force the kernel to re-create the dummy mq scheduler on the default interface,

# - as the child qdiscs may have been set to pfifo_fast at boot even if the default

# appear to be ‘fq_codel’ (we also set the default to fq_codel regardless, for older

# systems)

# - as the qdiscs are using a quantum based on the boot MTU, which may have changed

# after DHCP has gotten the proper MTU.

# Setting mq will only work if the NIC supports multiple TX/RX queues, therefore

# creating and grafting each class/qdiscs to specific CPU cores. In case the NIC

# does not support that, we simply ignore the error.

sysctl -w net.core.default_qdisc=fq_codel

tc qdisc del dev $(route | grep '^default' | grep -o '[^ ]*$') root 2>/dev/null || true

tc qdisc add dev $(route | grep '^default' | grep -o '[^ ]*$') root handle 0: mq || true

# Traffic leaving the weave interface onto the default interface will be encapsulated

# and encrypted in IPSec (ESP), therefore, we may only do traffic shaping work on this

# interface.

# The weave interface is a virtual interface, which is set to noqueue by default and does

# not support mq nor multiq. Therefore, we go directly to the point and create a a 2-bands

# priomap, that sends all traffic (regardless of the TOS octet) to the 2nd band, a simple

# fq_codel. We then define the 1st band as a netem with the a small delay, that appears to

# be avoid the race in a statistically satisfying manner, and that is controlled by a pareto

# distribution (k=4ms, a=1ms) and route traffic marked by 0x100/0x100 to it.

# Using iptables, we mark 0x100/0x100 the UDP traffic destined to port 5353, that have the

# DNS query bits set (fast check) and then that contain at least one question with QTYPE=AAAA.

while ! ip link | grep "weave:" > /dev/null; do sleep 1; done

tc qdisc del dev weave root 2>/dev/null || true

tc qdisc add dev weave root handle 1: prio bands 2 priomap 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

tc qdisc add dev weave parent 1:2 handle 12: fq_codel

tc qdisc add dev weave parent 1:1 handle 11: netem delay 4ms 1ms distribution pareto

tc filter add dev weave protocol all parent 1: prio 1 handle 0x100/0x100 fw flowid 1:1

iptables -A POSTROUTING -t mangle -p udp --dport 5353 -m string -m u32 --u32 "28 & 0xF8 = 0" --hex-string "|00001C0001|" --algo bm --from 40 -j MARK --set-mark 0x100/0x100

while sleep 3600; do :; done

Finally, we can build a very simple container image with the iproute2 package only, and run it alongside the Weave’s containers in its DaemonSet.

- name: weave-tc
  image: 'qmachu/weave-tc:0.0.1'
  securityContext:
    privileged: true
  volumeMounts:
    - name: xtables-lock
      mountPath: /run/xtables.lock
    - name: lib-tc
      mountPath: /lib/tc

- name: weave-tc

image: 'qmachu/weave-tc:0.0.1'

securityContext:

privileged: true

volumeMounts:

- name: xtables-lock

mountPath: /run/xtables.lock

- name: lib-tc

mountPath: /lib/tc

Conclusion

All in all, given the current adoption of Kubernetes, it is quite surprising that only a few Kubernetes engineers noticed this omnipresent and highly disruptive issue, which may be because networking conditions may not be as favorable everywhere for that race, or a symptom of a lack of monitoring overall.

However, I am thrilled to see that we ended up with a workaround that consists of 10 lines of bash and 10 lines of YAML, that do not require maintaining patches anywhere, or pushing any changes down to our users, and that reduce the likelihood of the races happening down to far less than a percent. And along the way, we also picked up a change that truncates the number of DNS roundtrips dramatically!

Edit: As mentioned by Duffie Cooley, it would also be possible to run the DNS server on every nodes using a DaemonSet, and specify the node’s IP as the clusterDNS in kubelet’s configuration. This solution is unfortunately unusable for us, as containers with cluster-wide permissions (even read-only) are unable to run on our worker nodes, and as containers do not have direct network access to any of our nodes for security reasons.

25 thoughts on “5 – 15s DNS lookups on Kubernetes?”

Steven says:

June 25, 2018 at 11:31 am

Thanks for this write up, it really helps clarify these DNS timeouts we’re experiencing. We’re testing out the coredns config and its definitely reduced our lookups. I’m trying the weave script in the weave daemonset and I’m still seeing insert_failed increase. Not sure if it’s increasing slower since I don’t have metrics on this. What rate of inserts_failed should I be seeing? Or what percentage did it drop for you in your environment? How are you getting metrics on insert_failed?

Reply
James McShane says:

July 11, 2018 at 3:32 pm

Oh man, I’ve been wrestling with this issue for the past six weeks and I am so glad that I found this blog. This line echos exactly how I feel:

“it is quite surprising that only a few Kubernetes engineers noticed this omnipresent and highly disruptive issue”

As I dug in further and found this issue going through dnsmasq and dropped queries, I was going crazy thinking this exact same thing. I’m going to try your workaround and see where I can get resolving this issue. Thank you for your write up!

Reply
Harper says:

August 1, 2018 at 11:33 am

Hi, I coded a solution to skip AAAA query by default for lib musl on Alpine, described here: https://github.com/kubernetes/kubernetes/issues/56903#issuecomment-409603030
It worked for me.
Could you pls try it in your clusters?

Thanks
Harper

Reply
1. Quentin Machu says:
  
  December 14, 2018 at 9:00 pm
  
  Totally fine solution if you accept making the sacrifice of running custom Alpine everywhere. In multi-tenant clusters, this is harder to communicate. Upstream Alpine likely won’t merge the request as they’ve told me on IRC back then.
  
  Reply
Victor says:

September 17, 2018 at 5:03 pm

We are on the same boat here. We implemented just this change:

dnsConfig:
options:
– name: single-request-reopen

and the timeouts seem to have improved.

We are on Kubernetes version 1.10 using kube-dns. We will upgrade to CoreDNS after upgrading to Kubernetes 1.11.

Reply
1. Quentin Machu says:
  
  December 14, 2018 at 8:59 pm
  
  Absolutely, this is the first possible workaround. However, this will not work on Alpine containers as they use musl instead of glibc – which does not implement support for the option.
  
  Reply
Ben Wilson says:

October 10, 2018 at 9:18 am

I believe we’re running into similar issues, but I wanted to get clear on a couple of things first
1) The DNS lookup issues you ran into applied to ALL domain names or just K8s internal ones?
2) Our docker images are using Ubuntu not Alpine, so we have more resolv.conf options. I’ve seen use-vc suggested which would do DNS lookup over TCP rather than UDP, would that help in this situation, or is the option you mention in the blog post single-request-reopen be sufficient?

Reply
1. Quentin Machu says:
  
  December 14, 2018 at 8:57 pm
  
  1/ All of them
  2/ You may use single-request or single-request-reopen! use-vc will indeed work as the issue is limited to UDP, but will slow down your DNS requests.
  
  Reply
Petro says:

December 20, 2018 at 7:35 am

Thank you very much for such useful writeup. We experienced exactly same issues and even more with DNS(we are running on GKE and we don’t use any additional network meshes as of now). Reading through this, I understood various options for TCP race condition and SNAT and they helped for sure, what seems to be most hard part is UPD/DNAT solution. Did I understood correctly there currently there is no simpler soliton then modifying iptables and introducing that artificial delay? Also as I mentioned we don’t use weave or other network mesh right now, so not quite sure if this may result in some simpler or more complex solution.

Reply
1. Quentin Machu says:
  
  June 20, 2019 at 2:19 pm
  
  That’s right. Kernel has been submitting several patches, but one remains and should only be fixed in 5.x release.
  
  Reply
Zane says:

June 20, 2019 at 11:54 am

Could we use the ‘tc’ solution with Flannel?

Reply
1. Quentin Machu says:
  
  June 20, 2019 at 2:18 pm
  
  Yeah, pretty sure it’d work no problem. Gotta adjust the interface.
  
  Reply
insoz says:

October 15, 2019 at 6:10 am

I set the single-request-reopen but it looks like not work,

resolv.conf in container

nameserver 10.96.0.10
search production.svc.cluster.local svc.cluster.local cluster.local
options ndots:5 single-request-reopen

what i missed ? this confused me so many days, thank you !

Reply
1. Quentin Machu says:
  
  August 13, 2020 at 9:45 pm
  
  Have you confirmed the problem is related to the conntrack race? If so, are you using Alpine? single-request-reopen is not supported on Alpine.
  
  Reply
Jubis says:

October 30, 2019 at 7:27 am

Hi Quentin,
we are using centos 7.6 and this problem has occured since few weeks. Applying your first workaround seems improving performance for us.
I would like to know if you have an idea on when the problem has started ? From a specific version of glibc ?
It seems to be a regression, not a problem that always exists.
Thank you in advance for your help and thank you for your article, really helpful.
BR

Reply
1. Quentin Machu says:
  
  August 13, 2020 at 9:44 pm
  
  Hi, sorry for the super late answer.. Not familiar with CentOS specifically but as far as I understand it, this has been a race condition for a very long time within conntrack. It only gets triggered very frequently if your libc library queries A/AAAA at the exact same time (which may not always have been the case?).
  
  Reply
SF says:

January 8, 2020 at 3:02 am

Hi, thanks for the awesome blog post.

Does the traffic shaping/tc solution also work with non-alpine containers?

We’re having this issue on Kuberentes 1.10.3, using Weave and KubeDNS, will this fix work with all of those as well?

Also, since the blog post is a little old, is the issue fixed in the latest versions of Kubernetes or the Linux Kernel? If so, which versions?

Reply
1. Quentin Machu says:
  
  August 13, 2020 at 9:37 pm
  
  Absolutely, that works with any base image, and without having to ask your Kubernetes tenants to do anything really. That’s the key advantage of this solution!
  
  Reply
D Trimble says:

January 17, 2020 at 9:37 am

I don’t fully understand the tc solution, are you running the weave-tc pod as a second container in the weave pod? Do you already have the bash above that baked into your sidecar? Is it really as simple as it sounds?

Reply
1. Quentin Machu says:
  
  August 13, 2020 at 9:36 pm
  
  Correct, but the weave-tc container really just runs a simple bash script, it can be executed in any other way, and on any overlay network really!
  
  Reply
Daniel says:

April 11, 2020 at 5:47 pm

Urgh….
Finally I’ve found this Article describing exactly the issue I’ve been seeing!
In the meantime, I have already been digging really deep and even came up with a nice and super-easy fix you all can apply until the underlying issue has actually been patched:
Kubernetes supports SessionAffinity for Services for quite some time (PR was merged in 2014).
Configuring the kube-dns Service with SessionAffinity:ClientIP triggers all DNS request packets from one pod to be delivered to the same kube-dns pod, thus eliminating the problem that the race condition causes (although the race condition still exists, it now doesn’t have any effect anymore).
So: for the kube-dns service in the kube-system namespace, you want to change the
service.spec.sessionAffinity from None to ClientIP.
This will probably be overwritten by Kubernetes updates being run; but it solved the problem in out case and was easy to apply without any other modifications to all the apps that are running on the cluster.

Reply
shan says:

April 22, 2020 at 3:01 am

Hi
We are running on alpine and we dont use weave, as you said “single-request-reopen” wont work for alpine it is ruled out.So wt workaround we should use.
Thanks

Reply
1. Quentin Machu says:
  
  August 13, 2020 at 9:45 pm
  
  I, unfortunately, made it pretty confusing, but the weave-tc script works regardless of your overlay network. You simply have to change the interface via the provided env var!
  
  Reply
shan says:

April 22, 2020 at 3:03 am

Hi,
we are using alpine and we not using weave and in alpine as said “single-request-reopen” wont work.what best possible workaround for this?
Thanks

Reply
1. Quentin Machu says:
  
  August 13, 2020 at 9:35 pm
  
  The kernel workaround provided here works regardless of the overlay network, you simply have to replace the network interface it affects via the env var!
  
  Reply