Hello, we are experiencing a significant issues int he DNS system in Rancher.
We are running 3 tier system like so:
Rancher LBs -> service A -> Service B
we are currently running 3 instances of Service A and 2 instances of Service B
We are running the Rancher LB on each host which is routing traffic to Service A.
When we hit the Load balancer, we are seeing each service A pair up with a single copy of Service B and never round-robin.
When, I jumped in the shell in Service A, and perform a dig serviceB, I do see the results switch appropriately. So the DNS server seems to be round robin the results.
However, if I install curl and do a curl serviceB, it will only hit a single copy of service B.
This is an enormous issue as it creates a totally imbalanced network load and it’s completely non-deterministic about which container it will link to.
Regarding your architecture, I’m slightly confused. Traffic to A will be load balanced but why are you expecting traffic from A to B to be load balanced? Wouldn’t you need a second LB service? LBA > A > LBB > B?
My understanding is that the DNS system will round robin the lookups. Effectively providing a client side load balancing. If that’s not the case, then that is not clear from documentation.
There is an issue with TTL, however I believe vincent had submitted a PR to fix this for the local resolving to TTL=1s, so it will RR properly… Are you on the latest version?
There is a similar but unrelated issue that the default TTL returned on records by the server is 5 minutes (dig should show 300 in the response row). So a client that respects that would keep using the same value for that period (curl does not cache across requests by default). I’m not sure if this was a regression or if it never got set to begin with, but it was intended to be 1 second and will be that in the next release (https://github.com/rancher/rancher-dns/pull/13/).
These are simple Java applications and we’ve disabled the DNS cache in the Java security file.
We’ve also run long running load tests and we never see a hop or change in the service B instance that service A is talking to.
@vincent - that would make some sense, but wouldn’t that mean every instance of service A would be talking to a single instance of service B? That’s not what we see. There is some crossover, but once we see one connection, we never see it change unless we scale service B down. (restart doesn’t seem to affect).
Our containers are running on ubuntu 15, but I haven’t found anything that says that the OS is caching the lookups. But that’s what it appears to be happening since both the application and curl are impacted.
We’ll run some more tests tomorrow to make sure.
Would it be worth upgrading to 0.56.1?
Also note that curl (more correctly libcurl) maintains its own DNS cache I believe (with a 60s timeout).
Yes, but it’s just in process-memory so it doesn’t ever get a cache hit for a single request from the command line:
vincent@host:~$ curl -v http://apple.com 2>&1 | grep -i cache
* Hostname was NOT found in DNS cache
vincent@host:~$ curl -v http://apple.com 2>&1 | grep -i cache
* Hostname was NOT found in DNS cache
vincent@host:~$ curl -v http://apple.com http://apple.com 2>&1 | grep -i cache
* Hostname was NOT found in DNS cache
* Hostname was found in DNS cache
Still having troubles tracking this down.
It seems as though something inside the container is caching the lookup and not releasing it. It appears to be at the container level since both applications and curl are affected.
I’ve verified that the service A containers can properly route to any of the 3 service B containers, but it appears that name resolution only ever resolves 1 entry.
nslookup and dig both return rotating values.
Not quite sure how else to test this. Or where else to dig. But this seems like a fairly significant problem.
In 0.56 the TTL is still 300 instead of 1 so that exacerbates testing the problem. As I said before the DNS server returns the list of IPs in a random order every time it is requested and there is no caching on our side.
The problem is that getaddrinfo(3) sorts the response from the server in non-obvious ways that you can’t control according to a list of rules in RFC 3484, which results in returning the same IP over and over for the same combination of host + answers returned by server. This is then corrected in RFC 6724, but the standard glibc has not changed. Some languages offer both, and others fix the problem for you… but presumably java is using getaddrbyinfo() under the JRE hood and you have no choice in the matter.
So one thing you can do is use a load balancer between services, where you may end up fixed to a balancer but haproxy knows how to rotate between the actual backends correctly.
The only other thing I can think of is changing the DNS server to return a single randomly selected address in each response for an A record, and move the full list to a SRV record or something if you actually wanted them all. This would likely break some applications that expect the full list in the A record response.
Another (better) option @ibuildthecloud mentioned is returning a single virtual IP that never changes for (in my example) nginx and using iptables in the agent to send each request to that IP randomly to one of the 3 actual addresses. This has the advantage of being random on ever connection without having to do repeated DNS queries on a short TTL.
Some people would still need a way to get the list of actual IPs, which could be done with SRV records or a separate namespace of A records.
So the only things that’s curious is that the my Java app is not returning the same IP over and over on every instance. Some instances lock to a single IP, others will rotate between 2 of the 3 addresses.
Otherwise everything makes sense from what you’re saying.
So it appears there is no way to fix this without using an internal loadbalancer or implementing DNS myself in the Java apps.
Because the Rancher SDN is effectively 1 subnet, then glibc will always apply rule 9 sorting (longest matching prefix). So client side load balancer will only ever use the subsets of destination containers with the longest common prefix with it (any destination containers with a short common prefix will never be addressed).
Sounds great. Thanks a lot!
Just curious, but will there be a method for still getting an A record with multiple values? (like foo.all) or something? Just thinking about things that can handle this situation already (like a Consul server will try to connect to all records returned during a join).