Rancher DNS resulting in poor performance

Currently evaluating rancher for a migration from our vm based environment. Both environments are using identical hardware and identical stacks.

In the current environment it is possible to achieve 40k requests/minute with an average response time of 80-90ms.

Moving into the rancher environment it was possible to eventually reach 40k requests/minute (could not achieve this with the built-in LB solution). However, the response times are 475 - 500ms. This was run multiple times on host networking as well as managed networking with the same results.

It appears at the peak of the load tests, that rancher dns has multiple processes running each utilizing 40 - 50% of the cpu at any given time. To validate this, the same containers were launched on the same machines as standalone containers (not using rancher). Ran the same load tests and was able to achieve an acceptable < 100ms response time.

This was tested using both 0.59.1 and v0.61.0-rc3

Is there a way to increase rancher dns’s performance?
Is this a known issue?
Has anyone else experienced similar behavior?

1 Like

Yikes. When things are not under load what type of response times are you seeing? Also what type of DNS requests are you testing. Are they for DNS names managed by Rancher (ex: myservice) or handled by an upstream DNS server (ex: www.google.com)?

@mschroeder Can you share how you load tested? We are currently doing a lot of performance testing on our side and would like to see if we can reproduce your tests.

Also the dns-server is one process, so multiple on a single host sounds very odd.

Edit: apparently magic in go 1.5 by default.

@ibuildthecloud - When things are not under load, response times are in the 40 - 60ms range when using both a container managed through rancher (with rancher-dns) and running natively (without rancher-dns).

In this particular instance the container only talks to services with upstream DNS (aws). Turned on debug output in rancher-dns and ran the load tests and noticed it was doing a lot of external lookups. Created "External Services’ that pointed directly to the RDS ip and ElasticCache ip. This resolved a lot of the lookups but it seems that dns is still handing those off to amazon’s dns for resolution which is not resolving. It also made no difference when running a small load test.

This screenshot is at the peak of the load when the containers are created through rancher, with rancher-dns consuming the most cpu.

This is what the rancher process list looks like on the host

Below are the graphs from New Relic when running the 40k rpm test with and without rancher-dns

-With Rancher-DNS-

-Without Rancher-DNS-

We use jmeter with 16 slaves in the same VPC on AWS to generate the load on the environment.

@ibuildthecloud I can also provide archives of the rancher-dns debug logs, answers.json and additional screenshots of the load testing and running processes for both large and small tests.

The debug log and answers file would be helpful, you can mail both of us at firstname at rancher.com if you don’t want to post publicly.

How many containers are running on how many hosts to accept the traffic from jmeter?

What application is accepting the connections? (if custom/in-house, what language/framework?) It seems like it may be ignoring TTLs if that many requests are happening, which is not helpful.

What networking mode are the containers? (host, managed, etc)?

How many DNS resolutions does each incoming request need? Are they all internal (to us, *.rancher.internal), all external (anything else) or mixed? Do they go to A records directly or return CNAMEs and get re-resolved one or more times?

If there are external records, what are the configured recursers?

Are any containers being started/stopped on the host (perhaps due to health checks failing while the test is being run)? Changes will trigger a reload of the answers config.

You said you made external services that points to IP addresses and it’s still hitting the recursive server, but that doesn’t make a lot of sense… For what name would it be asking? If it’s an external service that points to a hostname, then yeah it still has to recurse to give back the IPs.

Am I missing something, or how is the throughput the same(ish, hard to tell on different scales) if the latency is 6x? Shouldn’t the throughput plummet as the transaction time rises?

The debug log and answers file would be helpful, you can mail both of us at firstname at rancher.com if you don’t want to post publicly.

Attached in an email is a spreadsheet outlining the tests that were run and the corresponding log and answer files as well as rpms, response times and new relic graphs. There is also a diagram of the current aws infrastructure that i’ll describe below.

How many containers are running on how many hosts to accept the traffic from jmeter?

In all cases there are two servers, each with 1 haproxy container running, and 4 backend servers each with 1 app container running. The app containers are comprised of apache and mod_php as well as the packaged code.

What application is accepting the connections? (if custom/in-house, what language/framework?) It seems like it may be ignoring TTLs if that many requests are happening, which is not helpful.

If it was an application side issue, then I would be seeing the same thing when I run the containers natively and not only through rancher.

What networking mode are the containers? (host, managed, etc)?

The HAProxy containers are launched natively and are using host networking. The app containers become ill performant when launched using either host or managed through rancher. When the same containers are launched with the same configurations natively (docker cli) performant almost on par with our vm instances.

How many DNS resolutions does each incoming request need? Are they all internal (to us, *.rancher.internal), all external (anything else) or mixed? Do they go to A records directly or return CNAMEs and get re-resolved one or more times?

Each app server connects to two separate RDS instances and two separate ElasticCache instances as well as a New Relic. When the container are launched natively they are configured to use an address that CNAME’s the hostname given by amazon for those services.

I have tried three different things when launching the containers through rancher.

  1. The same configuration as the native containers (CNAME -> Amazon Hostname)
  2. Creating an external service in the stack that points to our hostname (Rancher Service -> CNAME -> Amazon Hostname)
  3. Creating an external service in the stack that points directly to the ip the amazon hostname resolves too (Rancher Service -> IP)

If there are external records, what are the configured recursors?

Containers that come up on the managed network under rancher appear to use 169.254.169.250

Containers that come up on the host network default to using
10.20.0.2

Are any containers being started/stopped on the host (perhaps due to health checks failing while the test is being run)? Changes will trigger a reload of the answers config.

The load balancers handle the traffic with no issues, and watching both HAProxy instances neither removes any of the backend servers from rotation because of health checks in any of the scenarios. The containers in rancher have health checks set to take no action, and the containers are set to only start once, as restarts were originally thought to be the issue.

You said you made external services that points to IP addresses and it’s still hitting the recursive server, but that doesn’t make a lot of sense… For what name would it be asking? If it’s an external service that points to a hostname, then yeah it still has to recurse to give back the IPs.

Yes, I made external services in the stack that pointed directly to the ip’s of those services but it seems like dns is still doing a lot of work. I have attached individual log files for each scenario.

Am I missing something, or how is the throughput the same(ish, hard to tell on different scales) if the latency is 6x? Shouldn’t the throughput plummet as the transaction time rises?

I believe there is a high timeout on the load test side that is playing into that. When tests start to fail like you’ll see in the documents I attached there is a reduced throughput.

@mschroeder Can you re-test this with v1.0.0-rc1, we’ve made a bunch of changes to rancher-dns that should help with this including caching of recursive responses. We see it working well now for ~2500 unique (uncached) requests/second for recursive answers and ~8000 for internal ones on a smallish GCE instance, with the load-test generation running on the same host.

Regardless of how much faster it is or isn’t when skipping our extra DNS hop, I still think your application has a serious problem abusing AWS’s resolver if you do no caching and generate thousands of requests/second in the first place. No matter how fast their resolvers are that has to be measurable time you’re wasting at that volume.