Rancher Hosts Continually Disconnect

Hello, I am having some trouble with my rancher hosts. I have two environments:

Dev: 3 Hosts
Prod: 4 Hosts + Host for rancher server

In both environments the hosts disconnected over night. I am able to reconnect the hosts again by selecting Add Host->Custom and they resume working fine until they disconnect again.

We are currently running the following versions:
Rancher v0.42.0
Cattle v0.102.0
User Interface v0.57.0
Rancher Compose v0.4.3
On Ubuntu 14.04.3 LTS with Docker version 1.8.3, build f4bf5c7

All hosts are on the same subnet with no firewalls between them.

Where should I be looking to troubleshoot this issue?

Can you first upgrade to v0.43.1? We recently fixed a websocket timeout issue in the agent that might fix this issue that users have seen.

1 Like

Is there a new agent with 0.43.1? I’ve upgraded but my agents didn’t cycle, still showing 0.8.2

No, it’s still 0.8.2, per the release notes:

rancher/server:v0.43.1
rancher/agent:v0.8.2
– Rancher Release - v0.43.1

Hi Denise, I will get our install upgraded today and report back tomorrow.

I upgraded the rancher server container this morning, and my hosts have already disconnected again. What do you recommend to do next?

One detail I left out that may be important is that I am running the mysql server external of the docker/rancher environment, and that db has persisted through several upgrades.

I use this command to launch my rancher instance:

docker run -d --restart=always -p 127.0.0.1:8080:8080
–name=“rancher_server”
-e CATTLE_DB_CATTLE_MYSQL_HOST=[external db host]
-e CATTLE_DB_CATTLE_MYSQL_PORT=3306
-e CATTLE_DB_CATTLE_MYSQL_NAME=rancher
-e CATTLE_DB_CATTLE_USERNAME=rancher
-e CATTLE_DB_CATTLE_PASSWORD=[password]
-v [path to cert]
rancher/server:latest

I am also proxying the rancher connections over ssl via nginx on the rancher server host.

Can you check the logs for the rancher-server container to see if anything sticks out? Also, check out the rancher-agent logs?

Were you doing a lot of activity on your hosts through a script? There is a known issue where if you do a lot of activity on a host (probably large scale (100s)), then docker ends up hanging on the host (which causes reconnecting). We are investigating that issue.

Not sure if that’s relevant to your setup or not, but wanted to throw it out there.

1 Like

I’m seeing this same problem with a rancher server running on vmware. I have not seen this problem with the server instance running on openstack.

I had two servers in this condition running Ubuntu 14, Docker 1.8.2. Just to be sure I was starting clean, I remove the hosts from the UI, then got on the servers and removed all images and containers…even the exited ones. I even gave the Docker engine a restart for good measure.

I copied the docker command from the add host “Other” page, and started the container. One server showed up in the UI quickly, but on the other, rancher-agent is spewing something like the following, and the host never registers in the UI.

INFO: Port:
INFO: Required Image: rancher/agent:v0.8.2
INFO: Current Image:rancher/agent:v0.8.2
INFO: Using image rancher/agent:v0.8.2
INFO: Downloading agent http://rancher.myzone.com:8080/v1/configcontent/configscripts
{“id”:“6887ce28-88de-4083-91f8-xx191cc077b7”,“type”:“error”,“links”:{},“actions”:{},“status”:401,“code”:“Unauthorized”,“message”:“Unauthorized”,“detail”:null}
gzip: stdin: not in
gzip format
tar: Child returned status 1
tar: Error is not recoverable: exiting now

Is there something more I can cleanup or look at?

I second that. I’m testing 2 worker instances on EC2. Over night both hosts get disconnected. They reconnect no problem if they are rebooted from EC2 console. I also noticed that I cannot ssh into my workers no more, though port 22 is still open. Further both instances show a steady 15% CPU utilization although there are no real jobs scheduled, just two idle containers (nginx and base debian) running.

Rancher	v1.4.1
Cattle	v0.176.9
User Interface	v1.4.6
Rancher CLI	v0.4.1
Rancher Compose	v0.12.2

I’m having similar issues. can help?
Thanks

Probably issue is related to docker. Check whether “docker ps” is working or stuck? I had the same issue when I try to pull so many images at the same time with docker client