HA-Cluster not reliable [solved]

Napsty · August 17, 2018, 7:59am

Using the the current stable 2.0.6 (but also tried with 2.0.7 this morning) I either set up the cluster wrong or a cluster is generally not reliable in Rancher 2.x.

Rancher version: 2.0.6
OS: Ubuntu 18.04
Docker: 17.12.1-ce

Setup: HA (three Rancher nodes building the Kubernetes cluster)
Created a new cluster inside the Rancher 2.x UI.
Added four nodes to this new cluster.
3 Nodes have role “All”
1 Node has role “Worker”

When I shut down one of the “All” nodes, the whole cluster fails. In the UI an error is shown:

This cluster is currently Unavailable ; areas that interact directly with it will not be available until the API is ready.
Failed to communicate with API server: Get https://192.168.254.32:6443/api/v1/componentstatuses: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

According to infos I found in [SOLVED] HA failover not working - #3 by nick76, three nodes are required for etcd and controlplane. Hence the three nodes having role “All”.

So one of the cluster nodes goes down, the whole cluster goes down… Please tell me there’s something I forgot in the cluster setup, because that’s a no-go for production.

Napsty · August 17, 2018, 1:24pm

Woah… it really turns out to be an issue with the newer Docker version 17.12.1-ce, coming with Ubuntu 18.04!! (I read through the docs again and found this specific Docker version requirement in https://rancher.com/docs/rancher/v2.5/en/cluster-provisioning/rke-clusters/custom-nodes/).

deleted the cluster
reset the cluster nodes and rancher nodes
installed Docker 17.03.2~ce from the download.docker.com repository on the cluster nodes (rancher nodes already had 17.03.2~ce installed, that’s why I spotted the difference)
reset the Rancher 2 environment (create a new Kubernetes deployment of Rancher with rke 1.9)
created the new cluster within the Rancher UI
Added three “All” roles nodes into the cluster
Added an additional node with “Worker” role into the cluster
Shut down the first of the “All” roles node

Result: Cluster recognizes that one node is down but continuous to function.

Topic		Replies	Views
[SOLVED] HA failover not working Rancher	2	2265	May 22, 2018
Three "All" Node in HA? Rancher	1	1366	June 25, 2018
How are folks approaching HA with k8s clusters in production? Rancher	5	1700	June 19, 2019
Creating new clusters is very flaky - resolved Rancher	3	5417	February 3, 2019
Cluster unavailable - Failed to communicate with API server - waiting for cluster agent to connect Rancher	0	3347	February 28, 2019

HA-Cluster not reliable [solved]

Related topics