The debug log and answers file would be helpful, you can mail both of us at firstname at rancher.com if you don’t want to post publicly.
Attached in an email is a spreadsheet outlining the tests that were run and the corresponding log and answer files as well as rpms, response times and new relic graphs. There is also a diagram of the current aws infrastructure that i’ll describe below.
How many containers are running on how many hosts to accept the traffic from jmeter?
In all cases there are two servers, each with 1 haproxy container running, and 4 backend servers each with 1 app container running. The app containers are comprised of apache and mod_php as well as the packaged code.
What application is accepting the connections? (if custom/in-house, what language/framework?) It seems like it may be ignoring TTLs if that many requests are happening, which is not helpful.
If it was an application side issue, then I would be seeing the same thing when I run the containers natively and not only through rancher.
What networking mode are the containers? (host, managed, etc)?
The HAProxy containers are launched natively and are using host networking. The app containers become ill performant when launched using either host or managed through rancher. When the same containers are launched with the same configurations natively (docker cli) performant almost on par with our vm instances.
How many DNS resolutions does each incoming request need? Are they all internal (to us, *.rancher.internal), all external (anything else) or mixed? Do they go to A records directly or return CNAMEs and get re-resolved one or more times?
Each app server connects to two separate RDS instances and two separate ElasticCache instances as well as a New Relic. When the container are launched natively they are configured to use an address that CNAME’s the hostname given by amazon for those services.
I have tried three different things when launching the containers through rancher.
- The same configuration as the native containers (CNAME → Amazon Hostname)
- Creating an external service in the stack that points to our hostname (Rancher Service → CNAME → Amazon Hostname)
- Creating an external service in the stack that points directly to the ip the amazon hostname resolves too (Rancher Service → IP)
If there are external records, what are the configured recursors?
Containers that come up on the managed network under rancher appear to use 169.254.169.250
Containers that come up on the host network default to using
10.20.0.2
Are any containers being started/stopped on the host (perhaps due to health checks failing while the test is being run)? Changes will trigger a reload of the answers config.
The load balancers handle the traffic with no issues, and watching both HAProxy instances neither removes any of the backend servers from rotation because of health checks in any of the scenarios. The containers in rancher have health checks set to take no action, and the containers are set to only start once, as restarts were originally thought to be the issue.
You said you made external services that points to IP addresses and it’s still hitting the recursive server, but that doesn’t make a lot of sense… For what name would it be asking? If it’s an external service that points to a hostname, then yeah it still has to recurse to give back the IPs.
Yes, I made external services in the stack that pointed directly to the ip’s of those services but it seems like dns is still doing a lot of work. I have attached individual log files for each scenario.
Am I missing something, or how is the throughput the same(ish, hard to tell on different scales) if the latency is 6x? Shouldn’t the throughput plummet as the transaction time rises?
I believe there is a high timeout on the load test side that is playing into that. When tests start to fail like you’ll see in the documents I attached there is a reduced throughput.