I have been using Rancher heavily since 0.59 and now after moving upto 0.63 and now 1.0. I have come across an issue where some of the Rancher service containers are stuck in Initializing state.
They all seem to function correctly but never move to Running.
I thought it was only the Load Balancer and Convoy Gluster containers having issues but I configured a Route53 service from the Rancher Catalog and got the same results.
Please forgive me if there is a topic open for this already or a Github issue submitted. I have searched around and cannot see the same issue as I am having. Most show reports show the container not working.
Where should I look for debugging info for me to post here for help?
Any help or guidance would be greatly appreciated.
Have you confirmed if your cross host networking is still working? You can check by exec-ing into a network agent and pinging another network agent’s IP.
Yep I exec-ed into the 3 rancher agents across the 3 hosts we have and they can all ping each other.
I assume the “network agents IP” is referred to as the host IP shown in the Infrastructure Hosts list in the Rancher UI?
They can also ping the IP address in the “Standalone Containers” -> Network Agent 10.42.x.x address as well.
Looking further into it, it seems if I try and create a Route53 from the catalog in a new environment it works with any issues. (Noting the new environment only had one host in it)
So I tried re-creating the stack/service with the same docker/rancher compose files but this time I modify it to force install on a different host ie via the label “io.rancher.scheduler.affinity:host_label”. It rendered the same result of Initializing.
My upgrade path wasn’t the greatest from version 0.59.1.
Basic list of path
0.59.1 ended up running out of disk space due to logging data getting bloated.
Exported the database to a fresh MySQL server on a new host
Loaded up 0.63 on a new host pointing to the new MySQL database.
Removed and re-created Rancher Agents/config to point to new Rancher server.
Rancher Agents ended up in a re-create loop.
managed to fix and purge all old containers I had, I had a few when trying out GlusterFS.
Loadbalancer service and Convoy Gluster was in a Active state at version 0.63
Upgraded to rancher 1.0
Loadbalancer / Convoy Gluster are now showing Initializing.
One thing to note as of 0.63 I could successfully use an API to add new service links to the LoadBalancer.
Now I can still do so, but the config is only applied it I go into the LB edit button and select Save
Recently added Route53 to find the same issue.
It seems all other containers I create are all ok. It only affects Rancher services like LoadBalancer/Convoy and Route53.
Is there a log or data table that shows what its waiting for in “Initializing”?
11/04/2016 14:40:02time=“2016-04-11T02:40:02Z” level=info msg=“CLOUDFLARE_EMAIL is not set, skipping init of CloudFlare provider”
11/04/2016 14:40:02time=“2016-04-11T02:40:02Z” level=info msg=“DNSIMPLE_TOKEN is not set, skipping init of DNSimple provider”
11/04/2016 14:40:02time=“2016-04-11T02:40:02Z” level=info msg=“GANDI_APIKEY is not set, skipping init of Gandi provider”
11/04/2016 14:40:02time=“2016-04-11T02:40:02Z” level=info msg=“POINTHQ_TOKEN is not set, skipping init of PointHQ provider”
11/04/2016 14:40:04time=“2016-04-11T02:40:04Z” level=info msg="Configured Route53 with hosted zone "myhosted." in region "us-west-2" "
11/04/2016 14:40:04time=“2016-04-11T02:40:04Z” level=info msg=“Starting Rancher External DNS service”
11/04/2016 14:40:04time=“2016-04-11T02:40:04Z” level=info msg=“Powered by Route53”
11/04/2016 14:40:04time=“2016-04-11T02:40:04Z” level=info msg=“Healthcheck handler is listening on :1000”
I’m sure you got this figured out by now but I just encountered this so for those who have this problem in the future:
We are using AWS as well. I decided to tighten up on my ports so I made the communication in the Security Groups only between the Rancher Server and the Rancher Agent on Each Host WITHIN AWS. This made it stuck in initialize until I set the rancher agent ports UDP500 and UPD4500 back to open (0.0.0.0). This obviously makes sense for a cross platform agent.