Rancher agent not communicating across public<->private subnets on AWS

Hi,

We are trying to setup a web application with the load balancers in public subnet and application server in private subnet; both within the same VPC.

I have opened ports 4500/500 on both security groups so that IPSec tunneling should happen.

The problem I am seeing is that rancher agent within the same subnet are able to communicate with each other but not across subnets. The rancher-net logs show that it added forwarding policy for the container on the host in the different subnet 2; but it is not pingable from subnet 1.

Has anybody encountered similar issues with Rancher setup on AWS?

Looks like one (or more) of the hosts is registered using the docker0 IP (172.17.x.y) instead of a public IP. All the hosts need to be able to reach each other using the IP shown on the hosts.

All the hosts are registered with the AWS IP correctly, we use a docker container to restart the rancher agent similar to reset agent workflow and that resets the CATTLE_AGENT_IP on each host.

The problem is very specific to load balancer in public subnet using public IP and app servers in private subnet.

The following 3 scenarios work fine even though they use the same security groups.

  1. Both load balancer and application server in PRIVATE subnets
  2. Both load balancer and application server in PUBLIC subnets
  3. Load balancer in Public subnet but registered with private IP and application server in private subnet

I tried this with the rancher/server:master docker image(v1.2.0-pre4-rc8 is not stable) and got the error below on the load balancer container. Not sure if that is related but 1 more data point.

11/18/2016 3:30:37 PMtime="2016-11-18T23:30:37Z" level=info msg="KUBERNETES_URL is not set, skipping init of kubernetes controller"
11/18/2016 3:30:37 PMtime="2016-11-18T23:30:37Z" level=info msg="Starting Rancher LB service"
11/18/2016 3:30:37 PMtime="2016-11-18T23:30:37Z" level=info msg="LB controller: rancher"
11/18/2016 3:30:37 PMtime="2016-11-18T23:30:37Z" level=info msg="LB provider: haproxy"
11/18/2016 3:30:37 PMtime="2016-11-18T23:30:37Z" level=info msg="starting rancher controller"
11/18/2016 3:30:37 PMtime="2016-11-18T23:30:37Z" level=info msg="Healthcheck handler is listening on :10241"
11/18/2016 3:30:38 PMtime="2016-11-18T23:30:38Z" level=info msg=" -- starting haproxy\n[ALERT] 322/233037 (26) : Starting frontend GLOBAL: cannot bind UNIX socket [/run/haproxy/admin.sock]\n"
11/18/2016 3:30:39 PMtime="2016-11-18T23:30:39Z" level=info msg=" -- reloading haproxy config with the new config changes\n[WARNING] 322/233039 (37) : config : 'option forwardfor' ignored for proxy 'default' as it requires HTTP mode.\n"

Ports 4500/500 are reachable on the public subnet machine from the private subnet.

ubuntu@Public-Private-Leader:~$ nmap -A 54.149.206.152 -p 4500

Starting Nmap 6.47 ( http://nmap.org ) at 2016-11-18 23:46 UTC
Nmap scan report for ec2-54-149-206-152.us-west-2.compute.amazonaws.com (54.149.                                                                                                    206.152)
Host is up (0.0012s latency).
PORT     STATE    SERVICE VERSION
4500/tcp filtered sae-urn

Service detection performed. Please report any incorrect results at http://nmap.                                                                                                    org/submit/ .
Nmap done: 1 IP address (1 host up) scanned in 0.40 seconds
ubuntu@Public-Private-Leader:~$ nmap -A 54.149.206.152 -p 500

Starting Nmap 6.47 ( http://nmap.org ) at 2016-11-18 23:46 UTC
Nmap scan report for ec2-54-149-206-152.us-west-2.compute.amazonaws.com (54.149.                                                                                                    206.152)
Host is up (0.0013s latency).
PORT    STATE    SERVICE VERSION
500/tcp filtered isakmp

Service detection performed. Please report any incorrect results at http://nmap.                                                                                                    org/submit/ .
Nmap done: 1 IP address (1 host up) scanned in 0.39 seconds
ubuntu@Public-Private-Leader:~$ telnet 54.149.206.152 80
Trying 54.149.206.152...
Connected to 54.149.206.152.
Escape character is '^]'.
^CConnection closed by foreign host.
ubuntu@Public-Private-Leader:~$ curl http://54.149.206.152:80/WebAccess/login.html
<html><body><h1>503 Service Unavailable</h1>
No server is available to handle this request.
</body></html>