HA-Cluster not reliable [solved]

Using the the current stable 2.0.6 (but also tried with 2.0.7 this morning) I either set up the cluster wrong or a cluster is generally not reliable in Rancher 2.x.

Rancher version: 2.0.6
OS: Ubuntu 18.04
Docker: 17.12.1-ce

Setup: HA (three Rancher nodes building the Kubernetes cluster)
Created a new cluster inside the Rancher 2.x UI.
Added four nodes to this new cluster.
3 Nodes have role “All”
1 Node has role “Worker”

When I shut down one of the “All” nodes, the whole cluster fails. In the UI an error is shown:

This cluster is currently Unavailable ; areas that interact directly with it will not be available until the API is ready.
Failed to communicate with API server: Get https://192.168.254.32:6443/api/v1/componentstatuses: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

According to infos I found in [SOLVED] HA failover not working, three nodes are required for etcd and controlplane. Hence the three nodes having role “All”.

So one of the cluster nodes goes down, the whole cluster goes down… Please tell me there’s something I forgot in the cluster setup, because that’s a no-go for production.

Woah… it really turns out to be an issue with the newer Docker version 17.12.1-ce, coming with Ubuntu 18.04!! (I read through the docs again and found this specific Docker version requirement in https://rancher.com/docs/rancher/v2.5/en/cluster-provisioning/rke-clusters/custom-nodes/).

  • deleted the cluster
  • reset the cluster nodes and rancher nodes
  • installed Docker 17.03.2~ce from the download.docker.com repository on the cluster nodes (rancher nodes already had 17.03.2~ce installed, that’s why I spotted the difference)
  • reset the Rancher 2 environment (create a new Kubernetes deployment of Rancher with rke 1.9)
  • created the new cluster within the Rancher UI
  • Added three “All” roles nodes into the cluster
  • Added an additional node with “Worker” role into the cluster
  • Shut down the first of the “All” roles node

Result: Cluster recognizes that one node is down but continuous to function.