Custom host won't connect and once it does it goes disconnected all the time


#1

Hello,

So I have a few virtual machines running, for which I would like to have a (simplified setup) where I have one host (10.100.10.1) for Rancher and one host (10.100.10.4) for running containers. I have installed Rancher server 1.6.25 in management machine and Docker CE version of 18.06.1~ce~3-0~ubuntu on both machines. They’re both running on Ubuntu 18.04 LTS.

On management machine I have a nginx running with following setup https://pastebin.com/KgCxQdfH so it directs 80 traffic to 8080. Rancher was run with sudo docker run -d -v <host_vol>:/var/lib/mysql --restart=unless-stopped -p 8080:8080 rancher/server. I have also ran sudo ufw allow 500/udp and sudo ufw allow 4500/udp on both machines, and had to do “specify-dns-servers-for-docker” in Docker installation guide (can’t have more links in a post than two) as it giving an error without it.

Problem is that when I try to add host I’m trouble registering it and even after it manages to connect, Rancher struggles to keep a connection active. When I register an agent, at first it gives this:

time=“2018-12-17T13:23:28Z” level=info msg=“Host not registered yet. Sleeping 1 second and trying again. reportedUuid=a0ca6f30-a804-4227-5532-8c2692673e56 Attempt=12”
time=“2018-12-17T13:23:29Z” level=info msg=“Host not registered yet. Sleeping 1 second and trying again. reportedUuid=a0ca6f30-a804-4227-5532-8c2692673e56 Attempt=13”
time=“2018-12-17T13:23:30Z” level=info msg=“Host not registered yet. Sleeping 1 second and trying again. reportedUuid=a0ca6f30-a804-4227-5532-8c2692673e56 Attempt=14”

time=“2018-12-17T12:28:57Z” level=error msg=“Failed to get connection token for host-api startup: Reached max retry attempts for getting token”

Then after a while it connects:

time=“2018-12-17T13:23:31Z” level=info msg=“Connecting to proxy.” url=“ws://10.100.10.1/v1/connectbackend?token=token”

This takes longer time than what I’m used to, and a few times it failed completely, meaning that I started getting 401 (maybe token expired?) messages from 10.100.10.1. But even afterwards that I managed to get it connected to, the host keeps going Disconnected => Reconnecting -state in the UI. Then in the rancher-server logs I’m getting following:

2018-12-17 13:24:06,050 ERROR [3a6531c0-b638-4494-bcad-2ee79553901e:3725] [instance:111] [instance.start->(InstanceStart)] [] [ecutorService-4] [i.c.p.process.instance.InstanceStart] Failed [Dependencies readiness error instance is not running] for instance [111]
2018-12-17 13:24:07,047 ERROR [7c3e0b91-7037-4df2-96bd-634aba7eca39:3732] [instance:112] [instance.start->(InstanceStart)] [] [ecutorService-3] [i.c.p.process.instance.InstanceStart] Failed [Dependencies readiness error instance is not running] for instance [112]
2018-12-17 13:24:07,048 ERROR [c995c17c-6e33-4308-b0e3-f4ded72ca0dc:3736] [instance:113] [instance.start->(InstanceStart)] [] [ecutorService-5] [i.c.p.process.instance.InstanceStart] Failed [Dependencies readiness error instance is not running] for instance [113]
2018-12-17 13:24:11,644 ERROR [:] [] [] [] [TaskScheduler-1] [i.c.p.a.s.ping.impl.PingMonitorImpl ] Failed to get ping from agent [43] count [3]
2018-12-17 13:24:16,645 ERROR [:] [] [] [] [TaskScheduler-1] [i.c.p.a.s.ping.impl.PingMonitorImpl ] Failed to get ping from agent [43] count [4]
2018-12-17 13:24:21,645 ERROR [:] [] [] [] [TaskScheduler-1] [i.c.p.a.s.ping.impl.PingMonitorImpl ] Failed to get ping from agent [43] count [5]
2018-12-17 13:24:26,646 ERROR [:] [] [] [] [TaskScheduler-1] [i.c.p.a.s.ping.impl.PingMonitorImpl ] Failed to get ping from agent [43] count [6]
2018-12-17 13:24:26,648 ERROR [:] [] [] [] [TaskScheduler-1] [i.c.p.a.s.ping.impl.PingMonitorImpl ] Scheduling reconnect for agent [43] host [8] count [6]

So the virtual machine I added keeps going disconnected all the time. This is simplified explanation of the issue, so if some more information is required I can provide it, but what exactly could be wrong in the setup with following constraints:

A) Trouble registering the host to the rancher. It might fail for so long that time that it keeps giving 401: Failed to get rancher client for host-api startup: Bad response statusCode [401]. Status [401 Unauthorized]. Body: [code=Unauthorized, baseType=error, message=Unauthorized] from [http://10.100.10.1/v1]
B) Trouble keeping the host in Active state if it is registered, it goes to Disconnected/Reconnecting all the time with Active popping in once in a while.
C) If I ping, curl etc. the hosts the traffic seems to go through from the hosts fine.

Thanks for any clues on how to investigate this.


#2

Been trying a few things:

  • UDP 500/4500 test goes fine from host to another
  • Switching to Rancher 1.6.17 and Docker 18.03 CE didn’t help
  • Disabling nginx and having Rancher straight in port 80 (-p 80:8080) didn’t help
  • Disabling authentication in Rancher didn’t do anything
  • Even when I had the host running the rancher-server as a host, I’m still getting the same issue with “Host not registered yet.” ultimately failing, or succeeding, only to shortly say “Failed to get ping from agent” in rancher-server logs and saying “Reconnecting” in the Rancher UI.