Load balancer stopped working

I run an app on a couple of servers and a 3 load balancer instances on 3 different servers in front of this app, but all of a sudden the load balancer stopped working. It just gets stuck in a “Initializing” state, most of the times all three of the containers, but sometimes one of them manage to go to a “Running” state. After a while the “Initializing” instance restarts.

It worked just fine until today, and I haven’t done anything to it today.

From the logs of the “Initializing” containers I can’t get any good info, just a couple of warnings:

2017-10-17 10:37:29time="2017-10-17T08:37:29Z" level=error msg="Failed to initialize Kubernetes controller: KUBERNETES_URL is not set"
2017-10-17 10:37:29time="2017-10-17T08:37:29Z" level=info msg="Starting Rancher LB service"
2017-10-17 10:37:30time="2017-10-17T08:37:30Z" level=info msg="LB controller: rancher"
2017-10-17 10:37:30time="2017-10-17T08:37:30Z" level=info msg="LB provider: haproxy"
2017-10-17 10:37:30time="2017-10-17T08:37:30Z" level=info msg="starting rancher controller"
2017-10-17 10:37:30time="2017-10-17T08:37:30Z" level=info msg="Healthcheck handler is listening on :10241"
2017-10-17 10:37:31time="2017-10-17T08:37:31Z" level=info msg=" -- starting haproxy\n * Starting haproxy haproxy\n   ...done.\n"
2017-10-17 10:37:32time="2017-10-17T08:37:32Z" level=info msg=" -- reloading haproxy config with the new config changes\n * Reloading haproxy haproxy\n[WARNING] 289/083732 (57) : parsing [/etc/haproxy/haproxy.cfg:55]: domain 'www.hidden.com' contains no embedded dots nor does not start with a dot. RFC forbids it, this configuration may not work properly.\n[WARNING] 289/083732 (57) : parsing [/etc/haproxy/haproxy.cfg:66]: domain 'www.hidden.com' contains no embedded dots nor does not start with a dot. RFC forbids it, this configuration may not work properly.\n[WARNING] 289/083732 (57) : config : 'option forwardfor' ignored for proxy 'default' as it requires HTTP mode.\n[WARNING] 289/083732 (59) : parsing [/etc/haproxy/haproxy.cfg:55]: domain 'www.hidden.com' contains no embedded dots nor does not start with a dot. RFC forbids it, this configuration may not work properly.\n[WARNING] 289/083732 (59) : parsing [/etc/haproxy/haproxy.cfg:66]: domain 'www.hidden.com' contains no embedded dots nor does not start with a dot. RFC forbids it, this configuration may not work properly.\n[WARNING] 289/083732 (59) : config : 'option forwardfor' ignored for proxy 'default' as it requires HTTP mode.\n   ...done.\n"

(I’ve omitted the real domain, but it’s formatted in the same format, like www.domain.com).

Here are the balancer rules:

1	Public	HTTPS	www.hidden.com	443	None	hidden-production/hidden-web-app	80	None
2	Public	HTTP	www.hidden.com	80	None	hidden-production/hidden-web-app	80	None

(Once again, I’ve omitted the real name/domain.)

I also use a let’s encrypt cert created from the Let’s Encrypt Certificate Manager, which has worked fine and reports no errors now.

I run Rancher 1.6.10.

Any idea what might be the issue? Thanks!

…and when I made a full restart of all running services (I’m desperate) more load balancers for other apps also started getting stuck in the Initializing state.

Now I restarted one of the servers running the apps, and a couple of minutes after the restart the load balancers started working again. Not sure if it was related though.

We plan to launch this app to the public this week, and this whole issue doesn’t feel great to be honest!

Initializing means the health check has not started passing yet, which typically means the cross host networking is configured wrong or currently unreachable between hosts.

Thanks, that was probably it!
The scheduler and the healthcheck stacks actually had some issues (restaring both solved it).

So I’m getting more issues with the healthcheck services, they get stuck in an initializing state with the followin stderr out (which dont seem to be very helpful):

2017-10-18 14:49:20time="2017-10-18T12:49:20Z" level=info msg="Scheduling apply config"
2017-10-18 14:49:20time="2017-10-18T12:49:20Z" level=info msg="healthCheck -- no changes in haproxy config\n"
2017-10-18 14:49:21time="2017-10-18T12:49:21Z" level=info msg="Scheduling apply config"
2017-10-18 14:49:21time="2017-10-18T12:49:21Z" level=info msg="healthCheck -- no changes in haproxy config\n"
2017-10-18 14:49:22time="2017-10-18T12:49:22Z" level=info msg="Scheduling apply config"
2017-10-18 14:49:22time="2017-10-18T12:49:22Z" level=info msg="healthCheck -- no changes in haproxy config\n"
2017-10-18 14:49:22time="2017-10-18T12:49:22Z" level=info msg="Scheduling apply config"
2017-10-18 14:49:22time="2017-10-18T12:49:22Z" level=info msg="healthCheck -- no changes in haproxy config\n"
2017-10-18 14:49:25time="2017-10-18T12:49:25Z" level=info msg="Scheduling apply config"
2017-10-18 14:49:25time="2017-10-18T12:49:25Z" level=info msg="healthCheck -- no changes in haproxy config\n"
2017-10-18 14:49:25time="2017-10-18T12:49:25Z" level=info msg="Scheduling apply config"
2017-10-18 14:49:25time="2017-10-18T12:49:25Z" level=info msg="healthCheck -- no changes in haproxy config\n"
2017-10-18 14:49:26time="2017-10-18T12:49:26Z" level=info msg="Scheduling apply config"
2017-10-18 14:49:26time="2017-10-18T12:49:26Z" level=info msg="healthCheck -- no changes in haproxy config\n"
2017-10-18 14:49:27time="2017-10-18T12:49:27Z" level=info msg="Scheduling apply config"
2017-10-18 14:49:27time="2017-10-18T12:49:27Z" level=info msg="healthCheck -- no changes in haproxy config\n"
2017-10-18 14:49:28time="2017-10-18T12:49:28Z" level=info msg="Scheduling apply config"
2017-10-18 14:49:28time="2017-10-18T12:49:28Z" level=info msg="healthCheck -- no changes in haproxy config\n"
2017-10-18 14:49:28time="2017-10-18T12:49:28Z" level=info msg="Scheduling apply config"
2017-10-18 14:49:28time="2017-10-18T12:49:28Z" level=info msg="healthCheck -- no changes in haproxy config\n"

Any ideas?

I’m experience the same thing with healthcheck and Load Balancers (although mine don’t actually work while in initializing mode). running version 1.6.10. I find if I reboot the servers involved they work… but less than 24 hours… not exactly sure when they die though. @krstffr - were you able to figure anything out?

@jdh I haven’t found anything out unfortunalty. I guess they won’t do anything about this since they’re moving over to k8 in 2.0. :frowning:

Are the infrastructure stack services all working properly when you are experiencing this? Is cross-host networking working properly (pinging across host using overlay network addresses (10.42.*)? 1.6 is still going to be LTS, so we won’t drop support for this.

1 Like

There were som issues with the scheduler and the healthcheck stacks, not sure why. Restarting the whole server which had the problematic stacks seemed to solve it. I actually removed the LB after that, cause it all felt too unpredictable to use in production (and I could live without a LB right now), so I’m not quite sure if this was actually solved or not by the restart.

My healthcheck stacks are all in intializing state and rapidly cycles the following log:

11/3/2017 6:35:18 PMtime=“2017-11-03T23:35:18Z” level=info msg="Scheduling apply config"
11/3/2017 6:35:19 PMtime=“2017-11-03T23:35:19Z” level=info msg=“healthCheck – no changes in haproxy config\n”

I did a shell into the IPSec containers and couldn’t ping the other 10.42 address of other containers… But I could ping the Host’s gateway.

Restarting the hosts fixes it for a little bit.

Just noticed that one of my hosts had detected one of the docker networks (172.*) instead of the host network public IP. Went and re-added all my hosts… will see how long they last.

So I re-added each of my hosts with the Host’s main IP specified in a parameter (instead of letting it autodiscover) and now all hosts have the correct LAN IP (instead of the one docker network IP). So not sure if it was the IP or re-adding the hosts after clearing all the containers on those machines. But at this point my LB’s have been stable for over 72 hours.

I had experienced the same issue and had to restart the server to fix the healthcheck issue. But I found an alternative way without restarting the server.

  1. Stop all rancher infrastructure containers along with rancher-agent, make sure that none is online
  2. Then start the rancher-agent alone, it will restart all other containers.

See if it works next time:slight_smile: