Rancher HA: Can not launch agent right now: Server not available at http://172.19.10.2:18080/ping

I have followed the instructions for a HA setup, but I’m stuck at starting the rancher/server containers.
The logs from the rancher-ha container state the following lines:

level=info msg="Can not launch agent right now: Server not available at http://172.19.10.2:18080/ping:" component=service

Looking at the iptables configuraiton, it looks like port 18080 is meant to be DNAT’ted to port 8080.

The following works from my host:

curl 172.19.10.2:8080/ping
pong

From the rancher-ha container, neither works:

docker exec -it rancher-ha bash
root@rbsu1082:/# curl http://172.19.10.2:18080/ping
curl: (7) Failed to connect to 172.19.10.2 port 18080: Connection refused

Could it be that docker’s iptables rules are not permitting the flow correctly?

I’m running rancher/server:stable on docker 1.10.3.
iptables-save output below:

# Generated by iptables-save v1.4.21 on Tue Jun 14 12:52:57 2016
*nat
:PREROUTING ACCEPT [6105:275858]
:INPUT ACCEPT [2966:140188]
:OUTPUT ACCEPT [945:62650]
:POSTROUTING ACCEPT [947:62770]
:DOCKER - [0:0]
-A PREROUTING -m addrtype --dst-type LOCAL -j DOCKER
-A OUTPUT ! -d 127.0.0.0/8 -m addrtype --dst-type LOCAL -j DOCKER
-A POSTROUTING -s 172.19.10.0/24 ! -o docker0 -j MASQUERADE
-A POSTROUTING -s 172.19.10.2/32 -d 172.19.10.2/32 -p tcp -m tcp --dport 16379 -j MASQUERADE
-A POSTROUTING -s 172.19.10.2/32 -d 172.19.10.2/32 -p tcp -m tcp --dport 13888 -j MASQUERADE
-A POSTROUTING -s 172.19.10.2/32 -d 172.19.10.2/32 -p tcp -m tcp --dport 12888 -j MASQUERADE
-A POSTROUTING -s 172.19.10.2/32 -d 172.19.10.2/32 -p tcp -m tcp --dport 12181 -j MASQUERADE
-A POSTROUTING -s 172.19.10.2/32 -d 172.19.10.2/32 -p tcp -m tcp --dport 8080 -j MASQUERADE
-A DOCKER -i docker0 -j RETURN
-A DOCKER ! -i docker0 -p tcp -m tcp --dport 6379 -j DNAT --to-destination 172.19.10.2:16379
-A DOCKER ! -i docker0 -p tcp -m tcp --dport 3888 -j DNAT --to-destination 172.19.10.2:13888
-A DOCKER ! -i docker0 -p tcp -m tcp --dport 2888 -j DNAT --to-destination 172.19.10.2:12888
-A DOCKER ! -i docker0 -p tcp -m tcp --dport 2181 -j DNAT --to-destination 172.19.10.2:12181
-A DOCKER ! -i docker0 -p tcp -m tcp --dport 18080 -j DNAT --to-destination 172.19.10.2:8080
COMMIT
# Completed on Tue Jun 14 12:52:57 2016
# Generated by iptables-save v1.4.21 on Tue Jun 14 12:52:57 2016
*filter
:INPUT DROP [3092:132930]
:FORWARD ACCEPT [0:0]
:OUTPUT ACCEPT [153359:38761977]
:DOCKER - [0:0]
:DOCKER-ISOLATION - [0:0]
-A INPUT -m state --state RELATED,ESTABLISHED -j ACCEPT
-A INPUT -m state --state INVALID -j DROP
-A INPUT -i lo -j ACCEPT
-A INPUT -p icmp -j ACCEPT
-A INPUT -p tcp -m tcp --dport 22 -j ACCEPT
-A INPUT -p tcp -m tcp --dport 6556 -j ACCEPT
-A INPUT -p tcp -m multiport --dports 80,443,18080 -j ACCEPT
-A INPUT -s 172.20.13.38/32 -p tcp -m multiport --dports 2181,2376,2888,3888,6379 -j ACCEPT
-A INPUT -s 172.20.13.38/32 -p udp -m multiport --dports 500,4500 -j ACCEPT
-A INPUT -s 172.20.13.39/32 -p tcp -m multiport --dports 2181,2376,2888,3888,6379 -j ACCEPT
-A INPUT -s 172.20.13.39/32 -p udp -m multiport --dports 500,4500 -j ACCEPT
-A INPUT -s 172.20.13.40/32 -p tcp -m multiport --dports 2181,2376,2888,3888,6379 -j ACCEPT
-A INPUT -s 172.20.13.40/32 -p udp -m multiport --dports 500,4500 -j ACCEPT
-A INPUT -s 172.20.13.38/32 -p vrrp -j ACCEPT
-A INPUT -s 172.20.13.39/32 -p vrrp -j ACCEPT
-A INPUT -s 172.20.13.40/32 -p vrrp -j ACCEPT
-A INPUT -s 172.19.10.0/24 -p tcp -m tcp --dport 3306 -j ACCEPT
-A INPUT -s 172.20.13.38/32 -p tcp -m multiport --dports 3306,4444,4567,4568 -j ACCEPT
-A INPUT -s 172.20.13.39/32 -p tcp -m multiport --dports 3306,4444,4567,4568 -j ACCEPT
-A INPUT -s 172.20.13.40/32 -p tcp -m multiport --dports 3306,4444,4567,4568 -j ACCEPT
-A FORWARD -j DOCKER-ISOLATION
-A FORWARD -o docker0 -j DOCKER
-A FORWARD -o docker0 -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
-A FORWARD -i docker0 ! -o docker0 -j ACCEPT
-A FORWARD -i docker0 -o docker0 -j ACCEPT
-A DOCKER -d 172.19.10.2/32 ! -i docker0 -o docker0 -p tcp -m tcp --dport 16379 -j ACCEPT
-A DOCKER -d 172.19.10.2/32 ! -i docker0 -o docker0 -p tcp -m tcp --dport 13888 -j ACCEPT
-A DOCKER -d 172.19.10.2/32 ! -i docker0 -o docker0 -p tcp -m tcp --dport 12888 -j ACCEPT
-A DOCKER -d 172.19.10.2/32 ! -i docker0 -o docker0 -p tcp -m tcp --dport 12181 -j ACCEPT
-A DOCKER -d 172.19.10.2/32 ! -i docker0 -o docker0 -p tcp -m tcp --dport 8080 -j ACCEPT
-A DOCKER-ISOLATION -j RETURN
COMMIT
# Completed on Tue Jun 14 12:52:57 2016

Just tried again without the userland proxy (–userland-proxy=false) but still the same.

It’s a bug!

I met the same issue and my HA nodes totally screwed. The cost to build HA is so high but the benefit is now questionable.

I couldn’t agree more. Every time I get my HA environment running, I feel like I have to treat it with kid gloves. Upgrade the version? Nope, you’re screwed. Some service on one of the nodes won’t come back up afterwards with no logging anywhere to help figure out why. Resize a machine to add more memory. You better pray that when you reboot the server that things come back cleanly… probably won’t happen. Start the instances in the wrong order… hilarity ensues. Again, with no useful logging anywhere accessible to determine why. Sure there might be a message in one of the twenty or so containers that are brought up but good luck finding it.

I LOVE the application and concept of Rancher and our initial proof of concept using the simple one container launch was brilliant, but trying to get the HA environment setup and then keep it running has been a month long struggle. I can’t even get my team to start using it because I tried to upgrade from 1.1.0.dev-something to the final 1.1.0 last Wednesday and the application has been down ever since. I’m wondering if it’s worth diving back in to try and figure out what went wrong “this time”. I’ll probably do it because there’s a reason we chose Rancher over the alternatives, but seriously, it shouldn’t be this hard.

It turns out that I don’t really need Rancher UI to be highly available. So my current solution is HA for database + regular backup, and switch back to Rancher in single node mode. It’s working perfectly and I don’t need to worry about upgrade or recover any more. Hope it helps and wish Rancher becomes better and better.

BTW, recover is more important than HA, in my use case.

Is that the case? We’re not just talking about the UI but all of the management services as well. What happens to your running applications when the management services become unavailable?

Containers running on hosts that are still up continue running. The server container is not involved in the datapath; Networking between containers is done directly between hosts. DNS and Metadata run on each host and service only the containers on that host, etc. So generally, everything continues running as it was.

The biggest point of concern is that healthchecks [can’t] report their state to the [non-existent] server, so there is nobody watching to reschedule replacement containers or update DNS/metadata config if a service or host fails while the server is also down.

All the persistent state is stored in the database, which runs inside the rancher/server container by default but can be pointed at an external instance. So for many use-cases, a reliable external database (e.g. Amazon RDS) and a single server container is sufficient without getting into the much more infolved multi-master load balanced HA setup.

In that case, is there a significant functional difference between running an HA setup and running a blue-green deployment with two single-node servers connected to the same external database?

Same error here:
time=“2016-10-19T08:21:50Z” level=info msg=“Waiting for server to be available” component=cert
time=“2016-10-19T08:21:50Z” level=info msg=“Can not launch agent right now: Server not available at http://192.168.169.2:18080/ping:” component=service

Note that it looks for the agent on the IP 192.168.169.2 when it should be the IP 192.168.169.1

Any idea??