Rancher Hosts Continually Disconnect

wcbzero · October 27, 2015, 4:34pm

Hello, I am having some trouble with my rancher hosts. I have two environments:

Dev: 3 Hosts
Prod: 4 Hosts + Host for rancher server

In both environments the hosts disconnected over night. I am able to reconnect the hosts again by selecting Add Host->Custom and they resume working fine until they disconnect again.

We are currently running the following versions:
Rancher v0.42.0
Cattle v0.102.0
User Interface v0.57.0
Rancher Compose v0.4.3
On Ubuntu 14.04.3 LTS with Docker version 1.8.3, build f4bf5c7

All hosts are on the same subnet with no firewalls between them.

Where should I be looking to troubleshoot this issue?

denise · October 28, 2015, 4:53am

Can you first upgrade to v0.43.1? We recently fixed a websocket timeout issue in the agent that might fix this issue that users have seen.

Rucknar · October 28, 2015, 11:46am

Is there a new agent with 0.43.1? I’ve upgraded but my agents didn’t cycle, still showing 0.8.2

kaos · October 28, 2015, 12:10pm

No, it’s still 0.8.2, per the release notes:

rancher/server:v0.43.1
rancher/agent:v0.8.2
– Rancher Release - v0.43.1

wcbzero · October 28, 2015, 2:08pm

Hi Denise, I will get our install upgraded today and report back tomorrow.

wcbzero · October 28, 2015, 9:21pm

I upgraded the rancher server container this morning, and my hosts have already disconnected again. What do you recommend to do next?

One detail I left out that may be important is that I am running the mysql server external of the docker/rancher environment, and that db has persisted through several upgrades.

I use this command to launch my rancher instance:

docker run -d --restart=always -p 127.0.0.1:8080:8080
–name=“rancher_server”
-e CATTLE_DB_CATTLE_MYSQL_HOST=[external db host]
-e CATTLE_DB_CATTLE_MYSQL_PORT=3306
-e CATTLE_DB_CATTLE_MYSQL_NAME=rancher
-e CATTLE_DB_CATTLE_USERNAME=rancher
-e CATTLE_DB_CATTLE_PASSWORD=[password]
-v [path to cert]
rancher/server:latest

I am also proxying the rancher connections over ssl via nginx on the rancher server host.

denise · November 2, 2015, 10:48pm

Can you check the logs for the rancher-server container to see if anything sticks out? Also, check out the rancher-agent logs?

Were you doing a lot of activity on your hosts through a script? There is a known issue where if you do a lot of activity on a host (probably large scale (100s)), then docker ends up hanging on the host (which causes reconnecting). We are investigating that issue.

Not sure if that’s relevant to your setup or not, but wanted to throw it out there.

ebishop · November 9, 2015, 9:52pm

I’m seeing this same problem with a rancher server running on vmware. I have not seen this problem with the server instance running on openstack.

I had two servers in this condition running Ubuntu 14, Docker 1.8.2. Just to be sure I was starting clean, I remove the hosts from the UI, then got on the servers and removed all images and containers…even the exited ones. I even gave the Docker engine a restart for good measure.

I copied the docker command from the add host “Other” page, and started the container. One server showed up in the UI quickly, but on the other, rancher-agent is spewing something like the following, and the host never registers in the UI.

INFO: Port:
INFO: Required Image: rancher/agent:v0.8.2
INFO: Current Image:rancher/agent:v0.8.2
INFO: Using image rancher/agent:v0.8.2
INFO: Downloading agent http://rancher.myzone.com:8080/v1/configcontent/configscripts
{“id”:“6887ce28-88de-4083-91f8-xx191cc077b7”,“type”:“error”,“links”:{},“actions”:{},“status”:401,“code”:“Unauthorized”,“message”:“Unauthorized”,“detail”:null}
gzip: stdin: not in
gzip format
tar: Child returned status 1
tar: Error is not recoverable: exiting now

Is there something more I can cleanup or look at?

NarunasK · March 6, 2017, 10:48am

I second that. I’m testing 2 worker instances on EC2. Over night both hosts get disconnected. They reconnect no problem if they are rebooted from EC2 console. I also noticed that I cannot ssh into my workers no more, though port 22 is still open. Further both instances show a steady 15% CPU utilization although there are no real jobs scheduled, just two idle containers (nginx and base debian) running.

Rancher	v1.4.1
Cattle	v0.176.9
User Interface	v1.4.6
Rancher CLI	v0.4.1
Rancher Compose	v0.12.2

auxiLum_Support · August 14, 2017, 9:58pm

I’m having similar issues. can help?
Thanks

Listener_me · August 16, 2017, 7:46am

Probably issue is related to docker. Check whether “docker ps” is working or stuck? I had the same issue when I try to pull so many images at the same time with docker client

Topic		Replies	Views
All AWS Hosts Disconnected on Rancher 1.3? Rancher 1.x	7	1923	January 26, 2017
Rancher vagrant hosts disconnect are network agent start Rancher 1.x	0	907	February 7, 2016
Migrated rancher 1.6 disconnted from digitalocean host Rancher 1.x	2	1054	January 11, 2019
Just setting up rancher for the first time and my first host is disconnected RancherOS	16	1901	June 21, 2017
Agent reconnecting state after rancher server 1.1.0 upgrade Rancher 1.x	13	3500	August 25, 2016

Rancher Hosts Continually Disconnect

Related topics