DNS Caching Issues

Hello, we are experiencing a significant issues int he DNS system in Rancher.

We are running 3 tier system like so:

Rancher LBs -> service A -> Service B

we are currently running 3 instances of Service A and 2 instances of Service B
We are running the Rancher LB on each host which is routing traffic to Service A.

When we hit the Load balancer, we are seeing each service A pair up with a single copy of Service B and never round-robin.

When, I jumped in the shell in Service A, and perform a dig serviceB, I do see the results switch appropriately. So the DNS server seems to be round robin the results.
However, if I install curl and do a curl serviceB, it will only hit a single copy of service B.

This is an enormous issue as it creates a totally imbalanced network load and it’s completely non-deterministic about which container it will link to.

Hey,

Regarding your architecture, I’m slightly confused. Traffic to A will be load balanced but why are you expecting traffic from A to B to be load balanced? Wouldn’t you need a second LB service? LBA > A > LBB > B?

My understanding is that the DNS system will round robin the lookups. Effectively providing a client side load balancing. If that’s not the case, then that is not clear from documentation.

Yes, it will do simple roundrobin…

There is an issue with TTL, however I believe vincent had submitted a PR to fix this for the local resolving to TTL=1s, so it will RR properly… Are you on the latest version?

There is no actual “cache” in Rancher. The DNS server on each host provides a randomized answer every time it is asked.

I believe the problem you’re seeing is that getaddrbyhost() returns answers in a deterministic order (more info: http://daniel.haxx.se/blog/2012/01/03/getaddrinfo-with-round-robin-dns-and-happy-eyeballs/) and curl uses that unless compiled with c-ares.

There is a similar but unrelated issue that the default TTL returned on records by the server is 5 minutes (dig should show 300 in the response row). So a client that respects that would keep using the same value for that period (curl does not cache across requests by default). I’m not sure if this was a regression or if it never got set to begin with, but it was intended to be 1 second and will be that in the next release (https://github.com/rancher/rancher-dns/pull/13/).

Hey John,

Of course, forgive me.

Is Service A caching in some way. ?

Also note that curl (more correctly libcurl) maintains its own DNS cache I believe (with a 60s timeout).

We are currently using 0.56.

These are simple Java applications and we’ve disabled the DNS cache in the Java security file.
We’ve also run long running load tests and we never see a hop or change in the service B instance that service A is talking to.

@vincent - that would make some sense, but wouldn’t that mean every instance of service A would be talking to a single instance of service B? That’s not what we see. There is some crossover, but once we see one connection, we never see it change unless we scale service B down. (restart doesn’t seem to affect).

Our containers are running on ubuntu 15, but I haven’t found anything that says that the OS is caching the lookups. But that’s what it appears to be happening since both the application and curl are impacted.

We’ll run some more tests tomorrow to make sure.
Would it be worth upgrading to 0.56.1?

Also note that curl (more correctly libcurl) maintains its own DNS cache I believe (with a 60s timeout).

Yes, but it’s just in process-memory so it doesn’t ever get a cache hit for a single request from the command line:

vincent@host:~$ curl -v http://apple.com 2>&1 | grep -i cache
* Hostname was NOT found in DNS cache
vincent@host:~$ curl -v http://apple.com 2>&1 | grep -i cache
* Hostname was NOT found in DNS cache
vincent@host:~$ curl -v http://apple.com http://apple.com 2>&1 | grep -i cache
* Hostname was NOT found in DNS cache
* Hostname was found in DNS cache

OK, understood, thanks. Probably best I get my nose out of this one. Cheers

Running some more tests this morning and this is what we are seeing:

Host 1
  Service A2
  Service B1
  Service B2

Host 2
  Service A1
  Service A3
  Service B3

What we see for connections are:

A1 -> B1, B3
A2 -> B2
A3 -> B2

We ran this test (~ 10,000 requests), then waited 10 mins and ran again with the same results.

Still having troubles tracking this down.
It seems as though something inside the container is caching the lookup and not releasing it. It appears to be at the container level since both applications and curl are affected.
I’ve verified that the service A containers can properly route to any of the 3 service B containers, but it appears that name resolution only ever resolves 1 entry.

nslookup and dig both return rotating values.
Not quite sure how else to test this. Or where else to dig. But this seems like a fairly significant problem.

In 0.56 the TTL is still 300 instead of 1 so that exacerbates testing the problem. As I said before the DNS server returns the list of IPs in a random order every time it is requested and there is no caching on our side.

The problem is that getaddrinfo(3) sorts the response from the server in non-obvious ways that you can’t control according to a list of rules in RFC 3484, which results in returning the same IP over and over for the same combination of host + answers returned by server. This is then corrected in RFC 6724, but the standard glibc has not changed. Some languages offer both, and others fix the problem for you… but presumably java is using getaddrbyinfo() under the JRE hood and you have no choice in the matter.

So one thing you can do is use a load balancer between services, where you may end up fixed to a balancer but haproxy knows how to rotate between the actual backends correctly.

The only other thing I can think of is changing the DNS server to return a single randomly selected address in each response for an A record, and move the full list to a SRV record or something if you actually wanted them all. This would likely break some applications that expect the full list in the A record response.

Sources in https://gist.github.com/vincent99/32e029c995028a13328e:

root@Default_hostname_1:/dns-test# ll
total 60
drwxr-xr-x  3 root root 4096 Feb  1 20:02 ./
drwxr-xr-x 41 root root 4096 Feb  1 20:00 ../
drwxr-xr-x  8 root root 4096 Feb  1 20:00 .git/
-rw-r--r--  1 root root 1051 Feb  1 20:00 DNS.class
-rw-r--r--  1 root root  668 Feb  1 20:00 DNS.java
-rw-r--r--  1 root root  148 Feb  1 20:00 dns.php
-rw-r--r--  1 root root  162 Feb  1 20:00 dns2.php
-rwxr-xr-x  1 root root 9040 Feb  1 20:02 getaddrinfo*
-rw-r--r--  1 root root 1365 Feb  1 20:00 getaddrinfo.c
-rwxr-xr-x  1 root root 8928 Feb  1 20:02 gethostbyname*
-rw-r--r--  1 root root  920 Feb  1 20:00 gethostbyname.c

root@Default_hostname_1:/dns-test# ./gethostbyname nginx
nginx.rancher.internal: 10.42.170.246 10.42.223.48 10.42.118.217
nginx.rancher.internal: 10.42.223.48 10.42.118.217 10.42.170.246
nginx.rancher.internal: 10.42.223.48 10.42.118.217 10.42.170.246
nginx.rancher.internal: 10.42.170.246 10.42.223.48 10.42.118.217
nginx.rancher.internal: 10.42.118.217 10.42.223.48 10.42.170.246
nginx.rancher.internal: 10.42.118.217 10.42.170.246 10.42.223.48

root@Default_hostname_1:/dns-test# ./getaddrinfo nginx
nginx: 10.42.118.217 10.42.223.48 10.42.170.246
nginx: 10.42.118.217 10.42.223.48 10.42.170.246
nginx: 10.42.118.217 10.42.170.246 10.42.223.48
nginx: 10.42.118.217 10.42.223.48 10.42.170.246
nginx: 10.42.118.217 10.42.170.246 10.42.223.48
nginx: 10.42.118.217 10.42.223.48 10.42.170.246
nginx: 10.42.118.217 10.42.223.48 10.42.170.246
nginx: 10.42.118.217 10.42.223.48 10.42.170.246
nginx: 10.42.118.217 10.42.170.246 10.42.223.48
nginx: 10.42.118.217 10.42.170.246 10.42.223.48
nginx: 10.42.118.217 10.42.223.48 10.42.170.246
nginx: 10.42.118.217 10.42.170.246 10.42.223.48
nginx: 10.42.118.217 10.42.223.48 10.42.170.246
nginx: 10.42.118.217 10.42.170.246 10.42.223.48
nginx: 10.42.118.217 10.42.223.48 10.42.170.246
nginx: 10.42.118.217 10.42.223.48 10.42.170.246
nginx: 10.42.118.217 10.42.223.48 10.42.170.246
nginx: 10.42.118.217 10.42.170.246 10.42.223.48
nginx: 10.42.118.217 10.42.170.246 10.42.223.48
nginx: 10.42.118.217 10.42.170.246 10.42.223.48
nginx: 10.42.118.217 10.42.223.48 10.42.170.246
nginx: 10.42.118.217 10.42.170.246 10.42.223.48

root@Default_hostname_1:/dns-test# php dns.php nginx
10.42.223.48
10.42.118.217
10.42.223.48
10.42.223.48
10.42.118.217
10.42.170.246
10.42.118.217
10.42.118.217
10.42.118.217
10.42.223.48

root@Default_hostname_1:/dns-test# php dns2.php nginx
10.42.223.48 10.42.170.246 10.42.118.217
10.42.170.246 10.42.118.217 10.42.223.48
10.42.118.217 10.42.170.246 10.42.223.48
10.42.118.217 10.42.170.246 10.42.223.48
10.42.118.217 10.42.223.48 10.42.170.246
10.42.223.48 10.42.118.217 10.42.170.246
10.42.170.246 10.42.118.217 10.42.223.48
10.42.118.217 10.42.223.48 10.42.170.246
10.42.118.217 10.42.170.246 10.42.223.48
10.42.223.48 10.42.170.246 10.42.118.217
10.42.118.217 10.42.223.48 10.42.170.246

root@Default_hostname_1:/dns-test# ./go-dns nginx
nginx: 10.42.118.217
nginx: 10.42.118.217
nginx: 10.42.118.217
nginx: 10.42.118.217
nginx: 10.42.118.217
nginx: 10.42.118.217

root@Default_hostname_1:/dns-test# java -Dsun.net.inetaddr.ttl=0 -Dsun.net.inetaddr.negative.ttl=0 DNS nginx
10.42.118.217
10.42.118.217
10.42.118.217
10.42.118.217
10.42.118.217
10.42.118.217
10.42.118.217
10.42.118.217
10.42.118.217

Another (better) option @ibuildthecloud mentioned is returning a single virtual IP that never changes for (in my example) nginx and using iptables in the agent to send each request to that IP randomly to one of the 3 actual addresses. This has the advantage of being random on ever connection without having to do repeated DNS queries on a short TTL.

Some people would still need a way to get the list of actual IPs, which could be done with SRV records or a separate namespace of A records.

So the only things that’s curious is that the my Java app is not returning the same IP over and over on every instance. Some instances lock to a single IP, others will rotate between 2 of the 3 addresses.
Otherwise everything makes sense from what you’re saying.

Which version of java did you run this test with?

Ok, I’ve now discovered why the sorting is behaving the way it is.

My Service B containers are:
10.42.14.79
10.42.202.286
10.42.220.149

And my Service A containers are:
10.42.249.231
10.42.239.168

So, the IPv4 longest matching prefix with the source address is 10.42.2 which effectively eliminates the 10.42.14.79 container from service.

There seems to be ways to disable this sorting, but I haven’t found it.

But anyway, it’s not a Rancher problem. Thanks for the help in figuring it out.

So it appears there is no way to fix this without using an internal loadbalancer or implementing DNS myself in the Java apps.

Because the Rancher SDN is effectively 1 subnet, then glibc will always apply rule 9 sorting (longest matching prefix). So client side load balancer will only ever use the subsets of destination containers with the longest common prefix with it (any destination containers with a short common prefix will never be addressed).

Yeah… We’ll fix it with one of the two forms of returning only one answer so the client has nothing to sort.

Sounds great. Thanks a lot!
Just curious, but will there be a method for still getting an A record with multiple values? (like foo.all) or something? Just thinking about things that can handle this situation already (like a Consul server will try to connect to all records returned during a join).

Yes, separate A records in a different namespace or SRV records (which could also include the published ports).

Github issue: https://github.com/rancher/rancher/issues/3495