Load Balancer performance

I’ve set up a new Rancher cluster. 3 t2.small RancherOS nodes in AWS with an ELB directing traffic over to port 80 across the 3 nodes, which is the Rancher HAProxy LB, configured to direct requests over to the application containers (WordPress, in my case).

I did some testing using SwitchyOmega in Chrome with a PAC file to route traffic over for the same site to the ELB, HAProxy, and the direct container. I have the Chrome developer tools open to the timeline, and I manually refresh after I adjust the PAC file to switch the target so I can see the load time. I’m seeing some pretty unhappy results:

ELB - 4.1m
HAProxy - 1.8m
Container - 1.97s

I can accept that since this is a testing setup the ELB will underperform, as I know it scales behind the scenes to handle the kind of traffic that it’s getting. But a 54x speed difference between going through the HAProxy container to the actual Nginx WordPress container? That’s amazingly bad. I had thought that the slow performance I was seeing was due to downsizing the new cluster to t2.small nodes, but I discussed with vincent99 on IRC and confirmed that my credit balance was fine on the t2 nodes, so CPU shouldn’t be an issue. The other stack I have, I have a virtual F5 EC2 instance to direct traffic over to the containers directly, which bypasses both the ELB and HAProxy components of the new setup.

Any suggestions are welcome, as I would really like to be able to shut down that F5 instance and not have it show up on next month’s bill. But for right now, I’m thinking I need to keep it in place so that I can shut down the old m3.medium node that is on a Rancher 1.2.0 setup that seems to have choked during the upgrade. It’s several days later and I still see the “Upgrading environment” banner and other oddities.

I’m checking out Traefik now.

I’m a bit surprised that this thread hasn’t gotten any replies yet. I could really use some help on this front, please.

As I said on IRC there’s clearly something wrong, but I dunno what it is. I would be looking at tcpdump/wireshark on the haproxy container to see what’s coming in from the client and out to the target and back. You can also try running scale=1 on both the balancer and target service and scheduling them to the same host to eliminate the overlay network.

At 5000%+ it has to be something like packet loss, fragmentation, MTU mismatch, rate-limiting gone mad, connections never being closed and leaving no workers free, etc.