Rancher host behind a roadwarrior-type VPN

Hello,

I have a setup where I run a number of Rancher-hosts including the Rancher-server in a private network plus one host which connects to this private network over a RW-type OpenVPN bridge.

Installing the agent on the internal hosts completes successfully, but when I registered the RW-host (using its IP-address on the internal network which it also sees on its tap0-interface), both ipsec, network-services as well as the healthcheck-container comes up red in the Rancher-server UI.

Specifically I see in the rancher/agent log the following:
time=“2017-10-03T15:19:57Z” level=info msg="rancher id [44]: Container with docker id [7c810a44afb12153412f21556724a93f78b4a0c35208af7ec206d5566d59dbed] has been started"
time=“2017-10-03T15:19:57Z” level=info msg="Reply: d1974989-cdc7-4c7c-9775-651b1cf44f83, compute.instance.activate, 1ihm44:instanceHostMap"
time=“2017-10-03T15:19:58Z” level=info msg="Received event: Name: storage.volume.activate, Event Id: 9f88abf3-6158-4a26-a65e-6366f3e81699, Resource Id: 1vspm100"
time=“2017-10-03T15:19:58Z” level=info msg="Reply: 9f88abf3-6158-4a26-a65e-6366f3e81699, storage.volume.activate, 1vspm100:volumeStoragePoolMap"
time=“2017-10-03T15:19:58Z” level=info msg="Received event: Name: compute.instance.activate, Event Id: 4df6e0e9-29e4-4c7b-bf6a-ed9e9d1bf2fe, Resource Id: 1ihm45"
time=“2017-10-03T15:19:58Z” level=error msg=“Error processing event” err=“Error response from daemon: cannot join network of a non running container: 7c810a44afb12153412f21556724a93f78b4a0c35208af7ec206d5566d59dbed” eventId=4df6e0e9-29e4-4c7b-bf6a-ed9e9d1bf2fe eventName=compute.instance.activate resourceId=1ihm45
time=“2017-10-03T15:19:59Z” level=info msg="Received event: Name: compute.instance.activate, Event Id: 24960339-0f18-4448-a7f8-9983021e4476, Resource Id: 1ihm45"
time=“2017-10-03T15:19:59Z” level=error msg=“Error processing event” err=“Error response from daemon: cannot join network of a non running container: 7c810a44afb12153412f21556724a93f78b4a0c35208af7ec206d5566d59dbed” eventId=24960339-0f18-4448-a7f8-9983021e4476 eventName=compute.instance.activate resourceId=1ihm45

While both rancher/healthcheck and rancher/net-log is empty.

In rancher/network-manager I have:
time=“2017-10-03T15:25:42Z” level=info msg="routesync: starting monitoring on bridge: docker0, for metadataIP: 169.254.169.250 every 60 seconds"
time=“2017-10-03T15:25:42Z” level=info msg="Waiting for metadata"
Creating metadata client: Get http://169.254.169.250/2016-07-29/version: dial tcp 169.254.169.250:80: getsockopt: no route to host

This container is restarted constantly.

The RW-host does have a route-rule for 169.254.169.250 going to docker0-interface, and AFAICT the iptables rules looks ok too (comparing them to the rules of the internal hosts). The RW-host also has a route-rule for the internal network (over the tap0-interface) and is able to access the Rancher-server.

In the network-services-metadata log I have the following:

10/3/2017 6:44:42 PMtime=“2017-10-03T15:44:42Z” level=info msg="Starting rancher-metadata v0.9.4"
10/3/2017 6:44:42 PMtime=“2017-10-03T15:44:42Z” level=info msg="Subscribing to events"
10/3/2017 6:44:42 PMtime=“2017-10-03T15:44:42Z” level=fatal msg=“Failed to subscribeGet http://rancher.private:49002/v2-beta: dial tcp: lookup rancher.private on 127.0.0.1:53: read udp 127.0.0.1:49022->127.0.0.1:53: read: connection refused”

Since I have the DNS set up on the RW-host so that it can resolve the hostnames of the internal network where the rest of the docker-hosts are, including the Rancher-server, I guess this 127.0.0.1 refers to the container localhost?

My goal with this setup is to run dev and staging containers on the hosts located in the internal network, and then push the containers to production, ie. the RW-host.

Is this kind of setup even feasible, and if, where should I start investigating? I’m pretty new to Rancher (and containers) so any advice is welcome.

General background:
Rancher v1.6.10
Host OS: CentOS7.4 updated to latest
Docker 17.09.0.ce

Poltsi

Replying to myself with the solution.

The issue was that I had configured the RW-host as a slave DNS for the internal network (over the VPN).When I switched over to only have the internal network DNS-server as a nameserver in the RW-host resolv.conf, the metadata-container was able to find its address and all the other failing containers came up properly.

Poltsi