Problems with moving from single node to multi node

Hi There,

We have build a system with single node rancher server with version 0.59.1. As we were expanding and increasing hosts we faced performance issues so we though of upgrading server and we upgraded to 1.0.1 which stabilized things.

Here the env is:

Rancher Version: V1.0.1 (16 Vcpu, 32GB RAM)
Docker Version: 1.11.1
RDS: 4Vcpu, 15 Gig RAM

We happen to add more hosts (currently 70) running 1200 stacks, rancher taking around 20 mins for spinning up a simple stack, so we tried to move to multi node. I’ve seen a post in github that rancher HA does not support docker version 1.11.1. So below is the process I followed.

1) Uploaded certificates, generated script for 3 hosts HA.
2) Downloaded script and ran on 2 servers, Could see hosts being added on rancher (On /admin/ha I see 2/3 for hosts).
3) Stopped single node rancher server downgraded docker version from 1.11.1 to 1.10.3
4) Ran script on existing single node server. and attached servers to aws ELB with tcp protocols in listeners.
5) Could access rancher on port 18080, 80 and 443

Problems faced with HA:

1) "go-machine-service", "rancher-compose-executor", "websocket-proxy" and "websocket-proxy-ssl" are always in unhealthy or initializing state.
2)  rancher-ha-cattle logs shows different errors every time, waiting for zookeeper or connection reset by peer or agent initialization is failed.
3) Rancher-ha logs says "could not delete container b/c unable to mount volume  due to resource or device busy"

I moved back to single node but did not revert docker version from 1.10.3 to 1.11.1 and rancher is broken badly.

Last night we upgraded docker from 1.10.3 to 1.11.1 things are normal but stack/service creation taking atleast 20 mins and RDS uses 100% CPU. Unsure about this problems. Let me how could I move to HA setup with existing setup.

I attempted around 4 times so far - no success. I have a dev env for rancher-ha which is working fine but not working as expected on production.