Openvpn connection failing

Hi there,

Within my stack that I deploy via rancher I have a openvpn service running within one of the containers. It runs on 1194/UDP.
Now when running this stack locally on my machine this service runs normally and I can connect clients to it.
However when I deploy it via rancher it fails to connect to clients properly.
The host is a AWS ec2 instance with both TCP and UDP ports open in the security group which I’ve tested both using manual deployment (docker-compose up) and rancher deployment on this instance.
VPN subnet is setup as the 10.20 range.

Are their any known issues with openvpn running on 1194 with rancher agents? Or any other way I can debug this issue?

Kind regards,
Nathan

You mean clients fail to connect right?

I assume you are exposing the port and you’ve confirmed this on the host (something like ss -lun should do it, plus you’ll get decent output from docker ps in the PORTS column).

Where are the clients located? Presumably ‘outside’ and thus you have some sort of public IP NATted to the host?

Rancher does not in any way use this port so it shouldn’t be a problem.

Yeah clients are failing to connect.
Exposing 1194/udp to the host via docker-compose

The docker container logs only output the following

TLS: Initial packet from [AF_INET]<removed>:45171, sid=e2e2627b fa68cce2
Mon Sep 26 05:19:55 2016 <removed>:57357 TLS Error: TLS key negotiation failed to occur within 60 seconds (check your network connectivity)
Mon Sep 26 05:19:55 2016 <removed>:57357 TLS Error: TLS handshake failed
Mon Sep 26 05:19:55 2016 <removed>:57357 SIGUSR1[soft,tls-error] received, client-instance restarting

Similar on the client end.
AWS instance has a elastic IP that the client is pointing at on 1194/udp. Client is just a simple device on my desk with no network restrictions that I’m aware of (It works without being rancherised anyway so port can’t blocked)

I found this link yesterday as well which I thought might be interfering, changing it to 11.42.x.x didn’t appear to fix anything either.

Any other suggestions?
Thanks heaps

So, it’s rather hard to make any judgement with so little information.

Have you actually confirmed that the port is exposed? It may be in your compose file but that proves nothing.

I assume you are using the port directive and not expose in your compose file?

How do you know the OpenVPN service has even started correctly?

Have you setup the necessary TLS certificates?

Do you see those log messages only when you try to connect?

Is whatever necessary in place to allow the service to proxy the traffic onward to whereever it needs to go?

Is your outbound routing setup correctly?

Hey thanks for the reply. Sorry for the vagueness. Not too familiar with openvpn just picked it up from a colleague.

The port is most certainly exposed. Doing a ss -lun shows it open and waiting. Along with docker ps showing it exposed.

The service is certainly running within the container as the container itself becomes 10.20.0.1 and i can see the interface within it. The container becomes the ‘hq’ which all the clients connect to allowing us to gain ssh access to. This is the only way to access the devices as they may not be publicly exposed otherwise. (excluding the test device I have on my desk)

Not sure what exactly is involved in the TLS certs (the entire process is hidden behind some scripts for me) but the same process is used when connecting to a non-rancher deployed container and it works normally.

Not sure what those last 2 questions entail unfortunately. The clients all connect to the single point. Its not forwarding anywhere else. Basically just used for ssh access. Does rancher do anything that might change this routing?

Is there anyway to trace the packet? or dump it perhaps to analyse? Not sure how useful it might be

Thanks in advance!

OK, sorry but I’m still lost. You’re using a TLS based VPN to SSH into a docker container? That sounds all kinds of wrong but hey, it also sounds like a novel idea that might just work.

I think you’ve a basic architectural issue though. How exactly do you get from the container to the hosts? I’ve never given that much thought. Perhaps you need the container to be running host mode networking?

What image are you using, perhaps I can check the debug options.

Can you post the compose details? For both the Rancher and non-rancher deployments. That might provide some clues.

I don’t consider packet capture useful (except for checking general connectivity) as everything will be encrypted.

To just check connectivity, you could run something like this: https://hub.docker.com/r/rucknar/sys-tools/ using host mode networking (and priv mode) and then use tcpdump: tcpdump -i any -vv -nn port 1194

Just wanted to jump in here and clarify (I’m helping @nathanhyman with this issue):

The exact same image, in the exact same docker-compose.yml, deployed on the exact same EC2 instance does work if brought up using docker-compose but does not work if brought up using a Rancher stack.

This rules out any network or configuration concerns between the clients, the aws instance, and within the container itself.

The only thing that’s different is Rancher’s bringing up the container instead of docker-compose…

Hard to do anything but speculate without your compose files to try to reproduce…

Do you have any customization of the upstairs rules on the host?

Rancher-compose is obviously just ultimately starting a container too, so there most interesting areas to look at are going to be related to things like managed networking and service discovery (DNS in the container will be pointed to 169.254.169.250 to talk to the local network agent).

Hi there,

Been caught up with things and never got back to this. Below is the relevant parts from our docker-compose stack.

Is there any more debugging I could try? I’ve been doing some packet dumping to see if perhaps the ports are being messed up on the return (UDP get erroneous Src Port on answers from services in rancher · Issue #6494 · rancher/rancher · GitHub) but from what I understand from the dumps they appear correct.

hq:
  volumes:
  - /etc/openvpn/:/etc/openvpn
  cap_add:
  - NET_ADMIN
  - SYS_ADMIN
  devices:
  - /dev/net/tun:/dev/net/tun
  ports:
  - 80:80         # http
  - 1194:1194/udp # openvpn

@vincent

Do you have any customization of the upstairs rules on the host?

Not sure what this means. Anywhere I can look?

Again to reiterate, this entire process works when its not done via Rancher.

I think that was phone autocorrect for iptables rules…

Hi Vincent

No we haven’t set any custom iptable rules.

I did some more digging with tcpdump and think I found the same issue as in that post above.
The source port on the response from the internal IP (last line) for some reason is 1025 instead of 1194.

Working non rancher docker stack
02:41:05.825715 IP [redacted public ip].56865 > 10.0.2.77.1194: UDP, length 42 Internal eth0 IP
02:41:05.825749 IP [redacted public ip].56865 > 172.17.0.3.1194: UDP, length 42 Docker container recieving
02:41:05.825760 IP [redacted public ip].56865 > 172.17.0.3.1194: UDP, length 42 Docker container recieving
02:41:05.826056 IP 172.17.0.3.1194 > [redacted public ip].56865: UDP, length 54 Docker container replying
02:41:05.826056 IP 172.17.0.3.1194 > [redacted public ip].56865: UDP, length 54 Docker container replying
02:41:05.826074 IP 10.0.2.77.1194 > [redacted public ip].56865: UDP, length 54


Rancher docker stack
02:44:32.705069 IP [redacted public ip].43816 > 10.7.128.251.1194: UDP, length 42 Internal eth0 IP
02:44:32.705105 IP [redacted public ip].43816 > 10.42.181.144.1194: UDP, length 42 Docker container recieving
02:44:32.705109 IP [redacted public ip].43816 > 10.42.181.144.1194: UDP, length 42 Docker container recieving
02:44:32.705508 IP 172.17.0.8.1194 > [redacted public ip].43816: UDP, length 54 Docker container replying
02:44:32.705508 IP 172.17.0.8.1194 > [redacted public ip].43816: UDP, length 54 Docker container replying
02:44:32.705540 IP 10.7.128.251.1025 > [redacted public ip].43816: UDP, length 54 Internal eth0 IP

Does this seem like the cause of the issue? and do you know of any solution or workaround in the meantime?

Kind Regards

@nathanhyman

From the last message since I see the packets with source IP address 172.17.x.x, I am assuming you are using an older release. Could you give the latest version (v.1.2.x) of Rancher a try? The networking infrastructure has changed a lot in this release.

  • 10.20.x.x/16 subnet shouldn’t interfere with Rancher 10.42.x.x/26 subnet, so you are fine there.
  • There is no restriction on using 1194 port from Rancher side. If you have another service/stack using the same port then that would cause a conflict, but in your case I don’t think it’s applicable.

If you get a chance to try the new version and still facing issues, some things to check:

  • On the host where the container is launched, check for the exposed port in the iptables rules.
  • Check if there are any errors in the ‘network-manager’ stack containers, ipsec containers.