Within my stack that I deploy via rancher I have a openvpn service running within one of the containers. It runs on 1194/UDP.
Now when running this stack locally on my machine this service runs normally and I can connect clients to it.
However when I deploy it via rancher it fails to connect to clients properly.
The host is a AWS ec2 instance with both TCP and UDP ports open in the security group which I’ve tested both using manual deployment (docker-compose up) and rancher deployment on this instance.
VPN subnet is setup as the 10.20 range.
Are their any known issues with openvpn running on 1194 with rancher agents? Or any other way I can debug this issue?
I assume you are exposing the port and you’ve confirmed this on the host (something like ss -lun should do it, plus you’ll get decent output from docker ps in the PORTS column).
Where are the clients located? Presumably ‘outside’ and thus you have some sort of public IP NATted to the host?
Rancher does not in any way use this port so it shouldn’t be a problem.
Similar on the client end.
AWS instance has a elastic IP that the client is pointing at on 1194/udp. Client is just a simple device on my desk with no network restrictions that I’m aware of (It works without being rancherised anyway so port can’t blocked)
I found this link yesterday as well which I thought might be interfering, changing it to 11.42.x.x didn’t appear to fix anything either.
Hey thanks for the reply. Sorry for the vagueness. Not too familiar with openvpn just picked it up from a colleague.
The port is most certainly exposed. Doing a ss -lun shows it open and waiting. Along with docker ps showing it exposed.
The service is certainly running within the container as the container itself becomes 10.20.0.1 and i can see the interface within it. The container becomes the ‘hq’ which all the clients connect to allowing us to gain ssh access to. This is the only way to access the devices as they may not be publicly exposed otherwise. (excluding the test device I have on my desk)
Not sure what exactly is involved in the TLS certs (the entire process is hidden behind some scripts for me) but the same process is used when connecting to a non-rancher deployed container and it works normally.
Not sure what those last 2 questions entail unfortunately. The clients all connect to the single point. Its not forwarding anywhere else. Basically just used for ssh access. Does rancher do anything that might change this routing?
Is there anyway to trace the packet? or dump it perhaps to analyse? Not sure how useful it might be
OK, sorry but I’m still lost. You’re using a TLS based VPN to SSH into a docker container? That sounds all kinds of wrong but hey, it also sounds like a novel idea that might just work.
I think you’ve a basic architectural issue though. How exactly do you get from the container to the hosts? I’ve never given that much thought. Perhaps you need the container to be running host mode networking?
What image are you using, perhaps I can check the debug options.
Can you post the compose details? For both the Rancher and non-rancher deployments. That might provide some clues.
I don’t consider packet capture useful (except for checking general connectivity) as everything will be encrypted.
To just check connectivity, you could run something like this: https://hub.docker.com/r/rucknar/sys-tools/ using host mode networking (and priv mode) and then use tcpdump: tcpdump -i any -vv -nn port 1194
Just wanted to jump in here and clarify (I’m helping @nathanhyman with this issue):
The exact same image, in the exact same docker-compose.yml, deployed on the exact same EC2 instance does work if brought up using docker-compose but does not work if brought up using a Rancher stack.
This rules out any network or configuration concerns between the clients, the aws instance, and within the container itself.
The only thing that’s different is Rancher’s bringing up the container instead of docker-compose…
Hard to do anything but speculate without your compose files to try to reproduce…
Do you have any customization of the upstairs rules on the host?
Rancher-compose is obviously just ultimately starting a container too, so there most interesting areas to look at are going to be related to things like managed networking and service discovery (DNS in the container will be pointed to 169.254.169.250 to talk to the local network agent).
I did some more digging with tcpdump and think I found the same issue as in that post above.
The source port on the response from the internal IP (last line) for some reason is 1025 instead of 1194.
Working non rancher docker stack
02:41:05.825715 IP [redacted public ip].56865 > 10.0.2.77.1194: UDP, length 42 Internal eth0 IP
02:41:05.825749 IP [redacted public ip].56865 > 172.17.0.3.1194: UDP, length 42 Docker container recieving
02:41:05.825760 IP [redacted public ip].56865 > 172.17.0.3.1194: UDP, length 42 Docker container recieving
02:41:05.826056 IP 172.17.0.3.1194 > [redacted public ip].56865: UDP, length 54 Docker container replying
02:41:05.826056 IP 172.17.0.3.1194 > [redacted public ip].56865: UDP, length 54 Docker container replying
02:41:05.826074 IP 10.0.2.77.1194 > [redacted public ip].56865: UDP, length 54
Rancher docker stack
02:44:32.705069 IP [redacted public ip].43816 > 10.7.128.251.1194: UDP, length 42 Internal eth0 IP
02:44:32.705105 IP [redacted public ip].43816 > 10.42.181.144.1194: UDP, length 42 Docker container recieving
02:44:32.705109 IP [redacted public ip].43816 > 10.42.181.144.1194: UDP, length 42 Docker container recieving
02:44:32.705508 IP 172.17.0.8.1194 > [redacted public ip].43816: UDP, length 54 Docker container replying
02:44:32.705508 IP 172.17.0.8.1194 > [redacted public ip].43816: UDP, length 54 Docker container replying
02:44:32.705540 IP 10.7.128.251.1025 > [redacted public ip].43816: UDP, length 54 Internal eth0 IP
Does this seem like the cause of the issue? and do you know of any solution or workaround in the meantime?
From the last message since I see the packets with source IP address 172.17.x.x, I am assuming you are using an older release. Could you give the latest version (v.1.2.x) of Rancher a try? The networking infrastructure has changed a lot in this release.
10.20.x.x/16 subnet shouldn’t interfere with Rancher 10.42.x.x/26 subnet, so you are fine there.
There is no restriction on using 1194 port from Rancher side. If you have another service/stack using the same port then that would cause a conflict, but in your case I don’t think it’s applicable.
If you get a chance to try the new version and still facing issues, some things to check:
On the host where the container is launched, check for the exposed port in the iptables rules.
Check if there are any errors in the ‘network-manager’ stack containers, ipsec containers.