Custom host won't connect and once it does it goes disconnected all the time

mpartan · December 17, 2018, 1:38pm

Hello,

So I have a few virtual machines running, for which I would like to have a (simplified setup) where I have one host (10.100.10.1) for Rancher and one host (10.100.10.4) for running containers. I have installed Rancher server 1.6.25 in management machine and Docker CE version of 18.06.1~ce~3-0~ubuntu on both machines. They’re both running on Ubuntu 18.04 LTS.

On management machine I have a nginx running with following setup https://pastebin.com/KgCxQdfH so it directs 80 traffic to 8080. Rancher was run with sudo docker run -d -v <host_vol>:/var/lib/mysql --restart=unless-stopped -p 8080:8080 rancher/server. I have also ran sudo ufw allow 500/udp and sudo ufw allow 4500/udp on both machines, and had to do “specify-dns-servers-for-docker” in Docker installation guide (can’t have more links in a post than two) as it giving an error without it.

Problem is that when I try to add host I’m trouble registering it and even after it manages to connect, Rancher struggles to keep a connection active. When I register an agent, at first it gives this:

time=“2018-12-17T13:23:28Z” level=info msg=“Host not registered yet. Sleeping 1 second and trying again. reportedUuid=a0ca6f30-a804-4227-5532-8c2692673e56 Attempt=12”
time=“2018-12-17T13:23:29Z” level=info msg=“Host not registered yet. Sleeping 1 second and trying again. reportedUuid=a0ca6f30-a804-4227-5532-8c2692673e56 Attempt=13”
time=“2018-12-17T13:23:30Z” level=info msg=“Host not registered yet. Sleeping 1 second and trying again. reportedUuid=a0ca6f30-a804-4227-5532-8c2692673e56 Attempt=14”
…
time=“2018-12-17T12:28:57Z” level=error msg=“Failed to get connection token for host-api startup: Reached max retry attempts for getting token”

Then after a while it connects:

time=“2018-12-17T13:23:31Z” level=info msg=“Connecting to proxy.” url=“ws://10.100.10.1/v1/connectbackend?token=token”

This takes longer time than what I’m used to, and a few times it failed completely, meaning that I started getting 401 (maybe token expired?) messages from 10.100.10.1. But even afterwards that I managed to get it connected to, the host keeps going Disconnected => Reconnecting -state in the UI. Then in the rancher-server logs I’m getting following:

2018-12-17 13:24:06,050 ERROR [3a6531c0-b638-4494-bcad-2ee79553901e:3725] [instance:111] [instance.start->(InstanceStart)] [] [ecutorService-4] [i.c.p.process.instance.InstanceStart] Failed [Dependencies readiness error instance is not running] for instance [111]
2018-12-17 13:24:07,047 ERROR [7c3e0b91-7037-4df2-96bd-634aba7eca39:3732] [instance:112] [instance.start->(InstanceStart)] [] [ecutorService-3] [i.c.p.process.instance.InstanceStart] Failed [Dependencies readiness error instance is not running] for instance [112]
2018-12-17 13:24:07,048 ERROR [c995c17c-6e33-4308-b0e3-f4ded72ca0dc:3736] [instance:113] [instance.start->(InstanceStart)] [] [ecutorService-5] [i.c.p.process.instance.InstanceStart] Failed [Dependencies readiness error instance is not running] for instance [113]
2018-12-17 13:24:11,644 ERROR [:] [] [] [] [TaskScheduler-1] [i.c.p.a.s.ping.impl.PingMonitorImpl ] Failed to get ping from agent [43] count [3]
2018-12-17 13:24:16,645 ERROR [:] [] [] [] [TaskScheduler-1] [i.c.p.a.s.ping.impl.PingMonitorImpl ] Failed to get ping from agent [43] count [4]
2018-12-17 13:24:21,645 ERROR [:] [] [] [] [TaskScheduler-1] [i.c.p.a.s.ping.impl.PingMonitorImpl ] Failed to get ping from agent [43] count [5]
2018-12-17 13:24:26,646 ERROR [:] [] [] [] [TaskScheduler-1] [i.c.p.a.s.ping.impl.PingMonitorImpl ] Failed to get ping from agent [43] count [6]
2018-12-17 13:24:26,648 ERROR [:] [] [] [] [TaskScheduler-1] [i.c.p.a.s.ping.impl.PingMonitorImpl ] Scheduling reconnect for agent [43] host [8] count [6]

So the virtual machine I added keeps going disconnected all the time. This is simplified explanation of the issue, so if some more information is required I can provide it, but what exactly could be wrong in the setup with following constraints:

A) Trouble registering the host to the rancher. It might fail for so long that time that it keeps giving 401: Failed to get rancher client for host-api startup: Bad response statusCode [401]. Status [401 Unauthorized]. Body: [code=Unauthorized, baseType=error, message=Unauthorized] from [http://10.100.10.1/v1]
B) Trouble keeping the host in Active state if it is registered, it goes to Disconnected/Reconnecting all the time with Active popping in once in a while.
C) If I ping, curl etc. the hosts the traffic seems to go through from the hosts fine.

Thanks for any clues on how to investigate this.

mpartan · December 18, 2018, 1:30pm

Been trying a few things:

UDP 500/4500 test goes fine from host to another
Switching to Rancher 1.6.17 and Docker 18.03 CE didn’t help
Disabling nginx and having Rancher straight in port 80 (-p 80:8080) didn’t help
Disabling authentication in Rancher didn’t do anything
Even when I had the host running the rancher-server as a host, I’m still getting the same issue with “Host not registered yet.” ultimately failing, or succeeding, only to shortly say “Failed to get ping from agent” in rancher-server logs and saying “Reconnecting” in the Rancher UI.

Topic		Replies	Views
How to debug new host not registering with Rancher Rancher 1.x	2	3691	July 9, 2016
Failed to add a custom host Rancher 1.x	3	5158	April 8, 2018
Rancher Hosts Continually Disconnect Rancher 1.x	10	10382	August 16, 2017
Just setting up rancher for the first time and my first host is disconnected RancherOS	16	1901	June 21, 2017
Cannot register custom host Rancher 2.0 Tech Preview	0	1293	May 1, 2018

Custom host won't connect and once it does it goes disconnected all the time

Related topics