I was unable to reproduce on v0.46.0. Did you try refreshing the UI? It did look to be in initializing a little longer than normal, but eventually became active.
I just spoke the engineer. In v0.46.0, we enabled health checks on load balancers. If it’s stuck in “Initializing”, there is some cross host communication issues on your hosts.
Even though the load balancer is stuck in “Initializing”, they should work, but they will not be in “Active” until the cross host communication issues are resolved.
So I did a little more debugging here, when executing a shell in the load balancer agent container I ran ps aux and found that haproxy was not running. I tried running it manually via haproxy -f /etc/haproxy/haproxy.cfg and I got the following error:
unable to load SSL certificate from PEM file '/etc/haproxy/certs/server2.pem'
this causes haproxy to immediately exit. examining the file mentioned there, I see that its contents are only a private key, and not a complete pem certificate, which is definitely an issue. Removing the offending file, then running haproxy -f /etc/haproxy/haproxy.cfg causes haproxy to run and the load balancer to enter the running state.
it looks like something in the load balancer container is mistakenly writing the private keys to files along with the certs, which is causing this issue.
I observe the same symptoms. I’m rather new to rancher, can anyone help me with some debugging strategies to apply to identify either of the problem type described herein before? I’ll then be able to report back.
@nlf .pem on haproxy side should contain private key followed by cert/cert chain. The fact that only private key is present w/o cert following it, sounds like a bug - could you please file it?
Removing the offending file, then running haproxy -f /etc/haproxy/haproxy.cfg causes haproxy to run and the load balancer to enter the running state.
That would be a valid workaround, but only till the moment the LB gets an update (new instance registers to it, LB service instance restart, etc) as we would reapply lb config again, and your change would be overridden. Instead you should disassociate certificate from the LB on the Rancher side.
@Osso@blaggacao Service appearing as degrading while it’s instances are running, is an indication that the health check has failed for this service. We are planning to add more APIs for health check reporting in the future release
To debug this issue in the current release, could you please do the following:
Execute following db query against the cattle database:
select reported_health, host_id from service_event where instance_id=[id of your LB container]
For every host reporting DOWN state, do the following:
login to network agent running on the host
try to telnet to LB_instance_IP:42 (42 is the port open on haproxy against which we do the health check validation). If the telnet fails, it means that the connectivity between hosts doesn’t work by some reason. In most of the cases its related to firewall blocking the traffic for ipsec ports 500/4500. It has to be enabled as per http://docs.rancher.com/rancher/rancher-ui/infrastructure/hosts/custom/#security-groups-firewalls
I updated to 0.47 the other day and my LB services now work properly. So the good news is that bug that affected my LBs has been fixed between 0.46 and 0.47.
The bad news being, I can’t check anymore why it was broken on 0.46.
@etlweather What version of Rancher are you running? If you’ve recently upgraded to v0.56.1, have you confirmed that your cross host networking is working? Exec into a network agent and ping ips of other network agents.
Typically, it stays in initializing if health check of the load balancer isn’t working, which implies cross host networking isn’t working.
You’re right - somehow I missed that - the network agent wasn’t able to establish network connection with other agents because they were on different network with network ACL blocking the traffic.
I also ran into this problem on rancher v1.0.1 using rancher/agent-instance:v0.8.1
This is how it happened to me:
I added a new stack from a docker-compose/rancher-compose file. The rancher-compose file referenced a SSL-certificate, which was not present in my SSL-Certificate list any more.
Adding the certifcate using the name referenced by the loadbalancer and then restarting/recreating the loadbalancer referencing the valid certificate name, also did not work.
Solution:
Add certificate using a new name
Create a new stack referencing the certificate name.
I just encountered similar issue on server v1.0.1 agent-instance v0.8.1.
I had issue with new certificate not being used for different domains (I was getting different certificate I have added before). At the time I thought there was issue with SNI.
I got also the error unable to load SSL certificate from PEM file '...'.
I tried:
restarting container
recreating stack with loadbalancer
creating identical stack under different name
removing certificates {default,server*}.pem from loadbalancer
recreating all certificates
restarting some more
deleting loadbalancer container
After all this it somehow started working again. So my guess is there is issue with how loadbalancer gets its certificates. (Or somewhere on the way UI -> rancher -> cattle -> loadbalancer)
Also while I was trying all that stuff UI sometimes displayed in loadbalancer service information (main view, right bottom corner) used certificates: A; default: A when in config I had certs: A; default_cert: B.
I just noticed that certificates I have recently added were missing last line of chain certificates:
-----END CERTIFICATE-----
I noticed this in two environments already. Both load balancers were showing ok but on restart they were yellow. When I fixed it (added the missing line) they changed to green and started serving correct SSL certificate. But they didn’t report error on wrong certificate input and neither did UI.
Could it be possible that it was somehow removed? It might be possible I made same mistake twice but I was always copy-pasting keys from terminal and on two different days.