I’m trying to solve issues with a few containers not going active after a new Rancher HA Deployment with rancher 1.1.14. Containers give this message in logs.
time=“2016-09-30T15:14:08Z” level=fatal msg=“Error 503 accessing /version path”
I’ve checked firewall settings, ha proxy configs and redeployed the rancher cluster but have not been able to solve the issue. All 3 rancher nodes show in the cluster and I’m able to access the UI. These four containers give the same message as outlined above.
go-machine-service, rancher-compose-executor, websocket-proxy, websocket-proxy-ssl all fail to start and give the same 503 error. They are never able to get to an active state.
are you using a load balancer to front your servers and the registration URL for rancher-agent is pointing to the load balancer?
Are you’re servers showing as healthy in your load balancer?
I’ve experienced this and have needed to set the health check on the load balancer to check against HTTTP:18080/ping instead of HTTP:80/ping because the service that handles port 80 doens’t come up until it can talk to something at the end of the URL, which there isn’t anything b/c the load balancer has them marked as unhealthy b/c they aren’t yet listening on 80 (chicken v. egg)
John,
Thanks very much for your reply. I’m have an HAProxy cluster using keepalived for VIP in front of Rancher Cluster. All servers respond on HTTP:18080/ping with pong response. For my backends in the HAProxy they are okay, with the exception of the websockets backend which responds with a status 200 instead of a status 101 as expected. The registration url is pointing to VIP on the HAProxy and I’m able to view the Rancher UI and Cluster status on the VIP:80. I have not configured the haproxy for any services on 18080 only 80.
If you’re available I would love to also chat in the #rancher irc channel on free node to better explain in realtime the issue and answer questions.
Hey @caseyrichins is your HAProxy explicitly configured to enable websocket connections? Here’s an example of a working config on the rancher doc site. Hope that helps.
After much troubleshooting and debugging I figured out the cause of the entire reason why my setup wouldn’t properly deploy. This setup was being setup not for me but for a client and it was discovered that the client had setup a whildcard record for a amazon ec2 instance on their cloudflare dns records. This wildcard DNS record was causing the dns search string to return the ec2 instance when rancher tried to do dns lookup calls as it was higher in the search order than rancher.local.