Load balancer issue

Hello all. Yesterday I was working with a load balancer and ran into an issue where I was getting 503s to the service even though it was up and running normally. I checked the haproxy config and all looked good, nothing seemed out of place and i was able to ping the box’s ip/port normally from the balancer container. curling through the host (curl -H) was returning the 503, though.

Simply restarting the balancer service seemed to resolve whatever the issue was, but for a time it definitely seemed to not be respecting the configuration. Is this something anyone has seen before? Thanks.

ping on this @rancher. We’re seeing it a fair amount (HAProxy config gets updated, but HAProxy needs a restart before hosts are available).

@tobowers assuming its still happening on your setup, could you please provide:

  • the output of 2 following mysql queries:

select * from config_item_status where name=‘haproxy’;
select * from instance where agent_id in (select agent_id from config_item_status where name=‘haproxy’);

  • agent log on the host where your lb instance runs

the info should be taken at the moment the problem occurs (lb stuck with old config)

@alena I will look into doing that too. To be clear though… the loadbalancer’s container haproxy.cfg is correct (has all the frontend/backend entries). However, haproxy is not actually serving the frontend until a restart.

@tobowers thank you for the clarification. Hopefully the log will have some info on haproxy reload failure

Thanks @alena. We emailed over the logs. I will note also that upon further inspection, we’re seeing some old haproxy instances in the load balancer containers as well.

Screenshot attached.

@tobowers thank you for the info. Need to confirm a couple of things:

  • that no manual haproxy reload was performed inside lb instance (service haproxy restart)? That always results in multiple haproxy processes. Rancher does haproxy reload this way:

http://www.networkinghowtos.com/howto/reload-haproxy-config-with-minimal-downtime/

and if /var/run/haproxy.pid gets overriden by the process restart performed by the user, Rancher will try to locate the process with "-sf $(cat /var/run/haproxy.pid)> " specified, and it not found - it will start a new process.

There is more info on how the restart process works in 2.4.1 section of this document: http://www.haproxy.org/download/1.2/doc/haproxy-en.txt

  • If no manual restart was performed - and based on the first comment that looks like it’s the case - what were the steps to reproduce it? Did it happen after a) Rancher was up for ~ 1 day b) after a certain amount of changes done to haproxy config (services going up and down)

There might be some bug in the haproxy restart logic described above, and we might have to revise the way we do haproxy “safe reload”. And for the fix we would have to reproduce the issue in house first

We are able to reproduce by deploying new services (sometimes it takes a few tries). Keep in mind that we are updating these lodabalancers every 10-30 seconds as well. Definitely no manual reload. We might have restarted the container manually in order to work around this bug.

Topper

For clarification there. We have a service that runs a global load balancer (one on every host, a rancher load balancer). We then have another service which looks at all the projects in the environment and updates that loadbalancer through the API every 30 seconds or so.

Thats quite interesting @tobowers… Is this script that updates the local LBs something you can share? is it in a container itself or how do you run it?

When running with multiple LBs or IPs It would probably be best to have the global LBs serving only their local containers… Is this what you are doing?

It’s run as a container, but has a small bit of proprietary code. I’d be happy to add you to the github repo. We use the setservicelinks API on a rancher loadbalancer.

Just to confirm, you are basically adding the “local” containers to that loadbalancer, so that it doesnt go “all over” when a request goes in, right? If so I’d like to look at it if you dont mind, I will probably be building something similar…

Since we have our own colocation, we cant use ELB or something like that to taget a loadbalancer. Since Rancher doesnt currently support a way of announcing a certain IP in a high-availability way (such as floating IPs with Virtual IPs), my plan is to use our existing load balancers to send requests into rancher, these would go into a rancher LB or pool of LBs.

Running global loadbalancers within rancher makes more sense if we can ensure that the loadbalancers are only using the services available on that host, or ideally using those “first” (but since we can run a healthcheck on the LB itself, it isnt a problem to serve only local containers)…

Otherwise we will end up with a scenario like this: traffic goes into outerLB -> get directed to RancherLB on host X -> RancherLB on hostX sends traffic to a container on host Y, despite having a container on hostX…

If the rancher LB only (or primarily) routes to local containers, we prevent a bit of back and forth inside the hosts… this is my idea… Hopefully we will have a way in the future to support Virtual IPs and multiple public IPs per host on rancher, which would mean we can do away with an ELB or in our case external LBs without risking availability…

If I’m thinking along the lines of what you have developed it would be great to see what your script looks like as I’d like to add something similar… im RVN-BR on github.

@tobowers debugged it, looks like its caused by the bug in our haproxy service monitoring script:

This script is supposed to monitor haproxy process and bring it back if the process is not running. The process might be stopped if:

  • it dies by some reason
  • when rancher does haproxy reload due to config changes, the service is inactive for a split of a second.

The bug is - monit script should bring the haproxy service back up using the same way as Rancher does on haproxy.config reload, not by simply calling /etc/init.d/haproxy start.

I’m going to work on the fix.

@alena awesome! Thanks! @RVN_BR we actually don’t do “local only” but we got rid of ELBs… instead:

We have a process “rancher_syncer” which will update route53 with the IP address of any of its loadbalancers (so we’re using round-robin DNS if a whole host goes down there will be an outage on some requests until DNS propogates). Then we have the “global rancher loadbalancers” which are setup to send traffic from various DNS entries to rancher services (using their API).

I added you as a read-only to https://github.com/mdx-dev/rancher-syncer though it sounds like we don’t do exactly what you’d want.

Thanks @tobowers I took a look, it may not be exactly what we want but can help as a starting point…

(I may look into doing something like this too, however will have to wait for a cloudflare support in rancher, or build it ourselves… I think its relatively easy to set a low TTL, and although it wont be instant like you said, it may be ok… (athough for our services I can think it may cause havoc even if we have anything over a few seconds of outage, so VIPs are still in my mind)

I’m still a bit unclear why you need this to update the rancher loadbalancers or are you using it only to update R53?

@RVN_BR the rancher loadbalancers need to know what hostnames map to what rancher services

@tobowers right, but where is this info coming from? is it dynamic? We have mostly wildcard domains, and some path redirection to specific services… but they dont change…

In your case you are basically “loadbalancing” (rr) from DNS into any of the global loadbalancers and then loadbalancing again from those to any of the hosts, right? If all the hosts are receiving external traffic, I was looking to reduce the inter-cluster shuffling, like LB_1 sending to SERVICE_2 while LB_2 is sendign to SERVICE_1… In a single datacenter it wouldnt be much of an issue under normal circumstances, but for some applications which are transfering large volumes of data we’d be introducing some unnecessary overhead/network traffic…

Hi @alena ! Is there a github issue or somewhere I can track this issue. Was it fixed with the major networking stuff in the last release? I’d like to hold off upgrading until this is fixed and then start moving some mission-critical onto Rancher.

@toppers Here’s the issue tracking it. It will be in the next release.