(Tried to find similar error in the forum but they doesn’t seem to be the same.)
Rancher network agent keeps restarting and deploying services get stuck on starting networking.
The rancher-agent logs shows the following and I have no idea how to troubleshoot this further:
3/16/2016 2:29:25 PMINFO: Downloading agent http://rancher-server.production.subnet:8000/v1/configcontent/configscripts
3/16/2016 2:29:56 PM
The system is going down NOW!
3/16/2016 2:29:56 PM
Sent SIGTERM to all processes
3/16/2016 2:29:57 PM
Sent SIGKILL to all processes
3/16/2016 2:29:57 PM
Requesting system reboot
3/16/2016 2:30:10 PMINFO: Downloading agent http://rancher-server.production.subnet:8000/v1/configcontent/configscripts
3/16/2016 2:30:41 PM
The system is going down NOW!
3/16/2016 2:30:41 PM
Sent SIGTERM to all processes
3/16/2016 2:30:42 PM
Sent SIGKILL to all processes
3/16/2016 2:30:42 PM
Requesting system reboot
Setup:
Production subnet where also the Rancher server is located on an "admin server:
Server 0 - Rancher Server 0.63.0
Server 1-4 app containers all connected successfully to Rancher server environment Production.
Test subnet:
Server 1-4, all connected successfully to Rancher server on the Prod subnet to environment Test
In the production subnet everything is fine, but when trying to start a service in test, the network agent keeps terminating and restarting. Initially it was a firewall problem but now the firewalls are opened for UDP 500 and 4500 (and to prod subnet admin server on port 8000)
I am guessing it is cross-network since it works for production which is same subnet as the rancher-server.
But doesn’t work on the hosts in the test subnet.
I just deleted all rancher containers and volumes and reinstalled and reconfigured it from scratch, but same result, the four hosts in production works find, and I can deploy a mongodb stack (works like a charm), but on the hosts in the test subnet they all get stuck due to the networking agent restarting over and over.
The servers are on same level of Docker and RHEL:
Prod subnet: (where also rancher server is) 10.143.182.0/24
Test subnet: 10.243.4.0/24
OS: RHEL 7.2 (Kernel 3.10)
Docker 1.10.2
Rancher: 0.63.1
Thanks but I dont think so since I set the CATTLE_AGENT_IP explicitly to an IP address(replacing the real hostname with rancher-server.prod-subnet in the example below): sudo docker run -d --privileged --add-host="rancher-server.prod-subnet:10.143.182.165" -e CATTLE_AGENT_IP=10.243.4.5 -v /var/run/docker.sock:/var/run/docker.sock -v /var/lib/rancher:/var/lib/rancher rancher/agent:v0.10.0 http://rancher-server.prod-subnet:8000/v1/scripts/69BA6E3D141E5C62528C:1458147600000:7v0NFKBQEKBrTHmVIwwuGpcBM
One more thing that might be obvious, does UDP 500 and 4500 be open also between the hosts and the rancher-server?
This is not the case, test-subnet hosts can only reach the rancher-server on the prod-subnet on TCP 8000, even pings (ICMP) är blocked.
I guess the obvious resolution at the moment is to deploy at rancher-server in the test-subnet, would really like to just have one on our admin server but I guess it is sufficient to get started with the rest since I really like what Rancher tries to solve.
I am experiencing network-agent restarts as well on a completely clean installation of Rancher on three machines. (The first runs the server and acts as a host.)
I have followed the instructions here to proxy Rancher behind Nginx (with SSL). I also made sure that “Host Registration URL” was set properly (to include the FQDN of the server and “https://” prefix). The hosts connect without issue and everything seems fine. However, as soon as I launch the GlusterFS stack, I run into problems:
The network agent starts on one of the hosts and immediately gets stuck in a loop:
INFO: Downloading agent https://[domain]/v1/configcontent/configscripts
The system is going down NOW!
Sent SIGTERM to all processes
Sent SIGKILL to all processes
Requesting system reboot
This continues indefinitely. I am using the latest version of everything:
Ubuntu 14.04.4 (amd64)
Docker 1.10.3
Rancher 1.0.0
I did set up GitHub authentication. None of the servers have a firewall enabled. Let me know if you need any further information.
Seeing the same problem after updating Proxmox on the hosting machine (Rancher stuff is running inside CoreOS guests under Proxmox). Also, yesterday I’ve changed the DNS server in my network (resolving forward/reverse queries). Hope this may help us to deduce
To make the story short: Rancher daemons love to have DNS working well and flawless. Now, the walkthrough.
Make sure that the DNS you are using at your Rancher server’s host is working properly and resolves all the client hosts (in my case it had the unnecessary host/ip access rules); also turn the recursive requests forwarding on (your Rancher cluster should be able to resolve everything it needs whether within your [virtual] LAN or outside). IF you don’t sysadmin your own DNS then check/choose/use a good and tested one. Your Rancher server/client hosts (those running Docker of course) should be properly set up to use this good DNS.
Recreate your Rancher server container (use Rancher docs to see how to save your server’s data).
Recreate rancher-agent containers (in my case I removed them from the hosts and re-added the hosts).
Remove agent state and instance containers; they should be recreated automatically.
After these steps I saw everything working.
NB: The whole problem I believe is due to 401/Unauthorized received by an agent instance from the server. I don’t know what’s there but it was fixed with the proper DNS setup.