Rancher network-agent restarting over-and-over

(Tried to find similar error in the forum but they doesn’t seem to be the same.)

Rancher network agent keeps restarting and deploying services get stuck on starting networking.

The rancher-agent logs shows the following and I have no idea how to troubleshoot this further:

3/16/2016 2:29:25 PMINFO: Downloading agent http://rancher-server.production.subnet:8000/v1/configcontent/configscripts
3/16/2016 2:29:56 PM 
The system is going down NOW!
3/16/2016 2:29:56 PM
Sent SIGTERM to all processes
3/16/2016 2:29:57 PM
Sent SIGKILL to all processes
3/16/2016 2:29:57 PM
Requesting system reboot
3/16/2016 2:30:10 PMINFO: Downloading agent http://rancher-server.production.subnet:8000/v1/configcontent/configscripts
3/16/2016 2:30:41 PM
The system is going down NOW!
3/16/2016 2:30:41 PM
Sent SIGTERM to all processes
3/16/2016 2:30:42 PM
Sent SIGKILL to all processes
3/16/2016 2:30:42 PM
Requesting system reboot

Setup:
Production subnet where also the Rancher server is located on an "admin server:
Server 0 - Rancher Server 0.63.0
Server 1-4 app containers all connected successfully to Rancher server environment Production.

Test subnet:
Server 1-4, all connected successfully to Rancher server on the Prod subnet to environment Test

In the production subnet everything is fine, but when trying to start a service in test, the network agent keeps terminating and restarting. Initially it was a firewall problem but now the firewalls are opened for UDP 500 and 4500 (and to prod subnet admin server on port 8000)

Can you provide OS and Docker version?

Are any of your hosts using 10…42.x.x? If so, there would be a conflict with the Rancher’s managed overlay network.

I am guessing it is cross-network since it works for production which is same subnet as the rancher-server.
But doesn’t work on the hosts in the test subnet.

I just deleted all rancher containers and volumes and reinstalled and reconfigured it from scratch, but same result, the four hosts in production works find, and I can deploy a mongodb stack (works like a charm), but on the hosts in the test subnet they all get stuck due to the networking agent restarting over and over.

The servers are on same level of Docker and RHEL:
Prod subnet: (where also rancher server is) 10.143.182.0/24
Test subnet: 10.243.4.0/24
OS: RHEL 7.2 (Kernel 3.10)
Docker 1.10.2
Rancher: 0.63.1

Regarding networking requirements, is there anything else needed to open in the firewalls than:

On the subnets:

  • UDP 500 and 4500 to and from all hosts (this is verified using netcat)
  • ping each other on the IPs registred in Rancher UI
  • TCP 8000 to the rancher server host in the prod subnet (it is running on 8000 instead of 8080 since Jenkins already nicked 8080)

I went looking in our issues and it seems like you are facing this one.

Thanks but I dont think so since I set the CATTLE_AGENT_IP explicitly to an IP address(replacing the real hostname with rancher-server.prod-subnet in the example below):
sudo docker run -d --privileged --add-host="rancher-server.prod-subnet:10.143.182.165" -e CATTLE_AGENT_IP=10.243.4.5 -v /var/run/docker.sock:/var/run/docker.sock -v /var/lib/rancher:/var/lib/rancher rancher/agent:v0.10.0 http://rancher-server.prod-subnet:8000/v1/scripts/69BA6E3D141E5C62528C:1458147600000:7v0NFKBQEKBrTHmVIwwuGpcBM

One more thing that might be obvious, does UDP 500 and 4500 be open also between the hosts and the rancher-server?
This is not the case, test-subnet hosts can only reach the rancher-server on the prod-subnet on TCP 8000, even pings (ICMP) är blocked.

I guess the obvious resolution at the moment is to deploy at rancher-server in the test-subnet, would really like to just have one on our admin server but I guess it is sufficient to get started with the rest since I really like what Rancher tries to solve.

I started over from scratch today using Rancher v1.0.0-rc1 and it worked out-of-the-box also from the test-subnet.

I am facing the same issue after I upgraded from 0.63.1 to 1.0.0.

I am using DigitalOcean, with hosts provisioned by Rancher.

I am experiencing network-agent restarts as well on a completely clean installation of Rancher on three machines. (The first runs the server and acts as a host.)

I have followed the instructions here to proxy Rancher behind Nginx (with SSL). I also made sure that “Host Registration URL” was set properly (to include the FQDN of the server and “https://” prefix). The hosts connect without issue and everything seems fine. However, as soon as I launch the GlusterFS stack, I run into problems:

The network agent starts on one of the hosts and immediately gets stuck in a loop:

INFO: Downloading agent https://[domain]/v1/configcontent/configscripts
The system is going down NOW!
Sent SIGTERM to all processes
Sent SIGKILL to all processes
Requesting system reboot

This continues indefinitely. I am using the latest version of everything:

  • Ubuntu 14.04.4 (amd64)
  • Docker 1.10.3
  • Rancher 1.0.0

I did set up GitHub authentication. None of the servers have a firewall enabled. Let me know if you need any further information.

@nathan-osman Are you running ipv6? If so, Rancher does not support it and will cause network agents to restart. That was @nlhkh’s issue.

@nlhkh’s issue:

https://github.com/rancher/rancher/issues/4237

Feature request for ipv6

https://github.com/rancher/rancher/issues/1403

Good catch - IPv6 was enabled on the hosts and I was able to get everything working once I rebuilt the hosts without IPv6 enabled. Thanks!

Hello, i have the same issue.

INFO: Downloading agent http://rancherapp/v1/configcontent/configscripts
The system is going down NOW!
Sent SIGTERM to all processes
Sent SIGKILL to all processes
Requesting system reboot

Ubuntu 14.04.4 (amd64)
Docker 1.10.3
Rancher 1.0.0

I tried to curl in this url

curl http://rancherapp/v1/configcontent/configscripts

{“id”:“508fb68a-a54d-4ed7-8455-761cc9038966”,“type”:“error”,“links”:{},“actions”:{},“status”:401,“code”:“Unauthorized”,“message”:“Unauthorized”,“detail”:null}

Seeing the same problem after updating Proxmox on the hosting machine (Rancher stuff is running inside CoreOS guests under Proxmox). Also, yesterday I’ve changed the DNS server in my network (resolving forward/reverse queries). Hope this may help us to deduce :wink:

I’ve fixed this thing (I mean, at my homelab).

To make the story short: Rancher daemons love to have DNS working well and flawless. Now, the walkthrough.

  1. Make sure that the DNS you are using at your Rancher server’s host is working properly and resolves all the client hosts (in my case it had the unnecessary host/ip access rules); also turn the recursive requests forwarding on (your Rancher cluster should be able to resolve everything it needs whether within your [virtual] LAN or outside). IF you don’t sysadmin your own DNS then check/choose/use a good and tested one. Your Rancher server/client hosts (those running Docker of course) should be properly set up to use this good DNS.
  2. Recreate your Rancher server container (use Rancher docs to see how to save your server’s data).
  3. Recreate rancher-agent containers (in my case I removed them from the hosts and re-added the hosts).
  4. Remove agent state and instance containers; they should be recreated automatically.

After these steps I saw everything working.

NB: The whole problem I believe is due to 401/Unauthorized received by an agent instance from the server. I don’t know what’s there but it was fixed with the proper DNS setup.

Having this problem I deleted my Cattle environment and created a new one. Now it just works…