Load Balancer stay in "Initializing" state

Hello,

When I start a new Load Balancer in Rancher 0.46 through Application / Stack, the LB remain in “Initializing” state and the Stack in “Degraded” state.

In LB logs, nothing seems in error :

19 novembre 2015 14:40:52 UTC+1INFO: ROOT -> ./
19 novembre 2015 14:40:52 UTC+1INFO: ROOT -> ./etc/
19 novembre 2015 14:40:52 UTC+1INFO: ROOT -> ./etc/monit/
19 novembre 2015 14:40:52 UTC+1INFO: ROOT -> ./etc/monit/monitrc
19 novembre 2015 14:40:52 UTC+1INFO: Sending monit applied 1-d166a713486fe4c0f039e152939d8d3b8ab8e6c2518e13d88f0b7ec68da46109
19 novembre 2015 14:40:52 UTC+1INFO: HOME -> ./
19 novembre 2015 14:40:52 UTC+1INFO: HOME -> ./etc/
19 novembre 2015 14:40:52 UTC+1INFO: HOME -> ./etc/cattle/
19 novembre 2015 14:40:52 UTC+1INFO: HOME -> ./etc/cattle/startup-env
19 novembre 2015 14:40:52 UTC+1INFO: ROOT -> ./
19 novembre 2015 14:40:52 UTC+1INFO: ROOT -> ./etc/
19 novembre 2015 14:40:52 UTC+1INFO: ROOT -> ./etc/init.d/
19 novembre 2015 14:40:52 UTC+1INFO: ROOT -> ./etc/init.d/agent-instance-startup
19 novembre 2015 14:40:52 UTC+1INFO: Sending agent-instance-startup applied 1-7d7ae49c4512939662575e5f88cac07c309a10dd3ad5d6d44c72045ecf461af2
19 novembre 2015 14:40:52 UTC+1monit: generated unique Monit id 47838382cd8218274b20fcc4773dd4d9 and stored to '/var/lib/monit/id'
19 novembre 2015 14:40:52 UTC+1Starting monit daemon with http interface at [localhost:2812]

I was unable to reproduce on v0.46.0. Did you try refreshing the UI? It did look to be in initializing a little longer than normal, but eventually became active.

I’m having this same problem. I left it in the initializing state over night and it’s still stuck.

Hi Denise, Same issue with v0.46.0

I am having the same problem here.
In my case the LB works fine on the same machine as rancher/server but does not on different machines.

Other elements:

  • the overlay network is working fine
  • disabling the firewall does not help

Same issue with v0.46.0.

I just spoke the engineer. In v0.46.0, we enabled health checks on load balancers. If it’s stuck in “Initializing”, there is some cross host communication issues on your hosts.

Even though the load balancer is stuck in “Initializing”, they should work, but they will not be in “Active” until the cross host communication issues are resolved.

So I did a little more debugging here, when executing a shell in the load balancer agent container I ran ps aux and found that haproxy was not running. I tried running it manually via haproxy -f /etc/haproxy/haproxy.cfg and I got the following error:

unable to load SSL certificate from PEM file '/etc/haproxy/certs/server2.pem'

this causes haproxy to immediately exit. examining the file mentioned there, I see that its contents are only a private key, and not a complete pem certificate, which is definitely an issue. Removing the offending file, then running haproxy -f /etc/haproxy/haproxy.cfg causes haproxy to run and the load balancer to enter the running state.

it looks like something in the load balancer container is mistakenly writing the private keys to files along with the certs, which is causing this issue.

We have different problems then, my haproxy is running properly (and the LB is working)
yet the state still sits as degraded.

I observe the same symptoms. I’m rather new to rancher, can anyone help me with some debugging strategies to apply to identify either of the problem type described herein before? I’ll then be able to report back.

@nlf .pem on haproxy side should contain private key followed by cert/cert chain. The fact that only private key is present w/o cert following it, sounds like a bug - could you please file it?

Removing the offending file, then running haproxy -f /etc/haproxy/haproxy.cfg causes haproxy to run and the load balancer to enter the running state.

That would be a valid workaround, but only till the moment the LB gets an update (new instance registers to it, LB service instance restart, etc) as we would reapply lb config again, and your change would be overridden. Instead you should disassociate certificate from the LB on the Rancher side.

@Osso @blaggacao Service appearing as degrading while it’s instances are running, is an indication that the health check has failed for this service. We are planning to add more APIs for health check reporting in the future release

https://github.com/rancher/rancher/issues/2128

To debug this issue in the current release, could you please do the following:

  1. Execute following db query against the cattle database:
select reported_health, host_id from service_event where instance_id=[id of your LB container]

For every host reporting DOWN state, do the following:

I updated to 0.47 the other day and my LB services now work properly. So the good news is that bug that affected my LBs has been fixed between 0.46 and 0.47.
The bad news being, I can’t check anymore why it was broken on 0.46.

Kind of old but I am having similar issue with LB running agent-instance 0.6.0.

The LB container logs show"

2/3/2016 5:23:12 PMINFO: ROOT -> ./
2/3/2016 5:23:12 PMINFO: ROOT -> ./etc/
2/3/2016 5:23:12 PMINFO: ROOT -> ./etc/default/
2/3/2016 5:23:12 PMINFO: ROOT -> ./etc/default/haproxy
2/3/2016 5:23:12 PMINFO: ROOT -> ./etc/monit/
2/3/2016 5:23:12 PMINFO: ROOT -> ./etc/monit/conf.d/
2/3/2016 5:23:12 PMINFO: ROOT -> ./etc/monit/conf.d/haproxy
2/3/2016 5:23:12 PMINFO: ROOT -> ./etc/haproxy/
2/3/2016 5:23:12 PMINFO: ROOT -> ./etc/haproxy/haproxy.cfg
2/3/2016 5:23:12 PMINFO: ROOT -> ./etc/haproxy/certs/
2/3/2016 5:23:12 PMINFO: ROOT -> ./etc/haproxy/certs/certs.pem
2/3/2016 5:23:12 PMINFO: ROOT -> ./etc/haproxy/certs/default.pem
2/3/2016 5:23:12 PMINFO: Sending haproxy applied 2-3e0d61c5cd2851e80416ea7bdd7aa59b417a3f11fa7852d0f0f91a26a6cdadb9
2/3/2016 5:23:12 PMINFO: HOME -> ./
2/3/2016 5:23:12 PMINFO: HOME -> ./etc/
2/3/2016 5:23:12 PMINFO: HOME -> ./etc/cattle/
2/3/2016 5:23:12 PMINFO: HOME -> ./etc/cattle/startup-env
2/3/2016 5:23:12 PMINFO: ROOT -> ./
2/3/2016 5:23:12 PMINFO: ROOT -> ./etc/
2/3/2016 5:23:12 PMINFO: ROOT -> ./etc/init.d/
2/3/2016 5:23:12 PMINFO: ROOT -> ./etc/init.d/agent-instance-startup
2/3/2016 5:23:12 PMINFO: Sending agent-instance-startup applied 1-c496253a7bbd2374be9fc021f1a3c515030fa7b0f8f7bb16b86da996b7e3f0e7
2/3/2016 5:23:12 PMmonit: generated unique Monit id 0540aec365d19c8303f39fcacacd6182 and stored to '/var/lib/monit/id'
2/3/2016 5:23:12 PMStarting monit daemon with http interface at [localhost:2812]

Don’t know if relevant but syslog (on a Ubuntu 14.04.3 server) shows:

Feb  3 17:26:06 docker651-1 kernel: [437985.377956] audit: type=1400 audit(1454538366.320:1231): apparmor="DENIED" operation="ptrace" profile="docker-default" pid=13947 comm="monit" requested_mask="trace" denied_mask="trace" peer="docker-default"
Feb  3 17:26:06 docker651-1 kernel: [437985.378313] audit: type=1400 audit(1454538366.320:1232): apparmor="DENIED" operation="ptrace" profile="docker-default" pid=13947 comm="monit" requested_mask="trace" denied_mask="trace" peer="docker-default"

While the container remains in Initializing state, it is operational.

@etlweather What version of Rancher are you running? If you’ve recently upgraded to v0.56.1, have you confirmed that your cross host networking is working? Exec into a network agent and ping ips of other network agents.

Typically, it stays in initializing if health check of the load balancer isn’t working, which implies cross host networking isn’t working.

You’re right - somehow I missed that - the network agent wasn’t able to establish network connection with other agents because they were on different network with network ACL blocking the traffic.

I also ran into this problem on rancher v1.0.1 using rancher/agent-instance:v0.8.1

This is how it happened to me:

I added a new stack from a docker-compose/rancher-compose file. The rancher-compose file referenced a SSL-certificate, which was not present in my SSL-Certificate list any more.

Adding the certifcate using the name referenced by the loadbalancer and then restarting/recreating the loadbalancer referencing the valid certificate name, also did not work.

Solution:

  • Add certificate using a new name
  • Create a new stack referencing the certificate name.

I just encountered similar issue on server v1.0.1 agent-instance v0.8.1.

I had issue with new certificate not being used for different domains (I was getting different certificate I have added before). At the time I thought there was issue with SNI.
I got also the error unable to load SSL certificate from PEM file '...'.

I tried:

  • restarting container
  • recreating stack with loadbalancer
  • creating identical stack under different name
  • removing certificates {default,server*}.pem from loadbalancer
  • recreating all certificates
  • restarting some more
  • deleting loadbalancer container

After all this it somehow started working again. So my guess is there is issue with how loadbalancer gets its certificates. (Or somewhere on the way UI -> rancher -> cattle -> loadbalancer)

Also while I was trying all that stuff UI sometimes displayed in loadbalancer service information (main view, right bottom corner) used certificates: A; default: A when in config I had certs: A; default_cert: B.

I just noticed that certificates I have recently added were missing last line of chain certificates:

-----END CERTIFICATE-----

I noticed this in two environments already. Both load balancers were showing ok but on restart they were yellow. When I fixed it (added the missing line) they changed to green and started serving correct SSL certificate. But they didn’t report error on wrong certificate input and neither did UI.
Could it be possible that it was somehow removed? It might be possible I made same mistake twice but I was always copy-pasting keys from terminal and on two different days.

@mishak If you see it happening again, please file an issue in Github and we can take a look at trying to reproduce it.