I have a three node cluster, all three are all nodes (Control, etcd, worker) all are up and running on vmware (1GB memory a piece)
Over the weekend without any change I received the following error.
This cluster is currently Unavailable; areas that interact directly with it will not be available until the API is ready.
Failed to communicate with API server: Get https://10.10.10.172:6443/api/v1/componentstatuses: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
Turns out the node ran out of memory and killed the apiServer connection, I have since bumped memory up to 4GB.
However the cluster did go down…So with three etcd nodes, why is the cluster down? shouldn’t this be in HA?
Thanks in advance!
Short answer - the cluster shouldn’t go down…
Same or similar issues:
opened 07:23AM - 03 Jun 18 UTC
closed 10:06PM - 06 Sep 18 UTC
kind/bug
priority/1
status/more-info
area/rke
Hi,
what is the reason for rancher 2 UI to decide, when a cluster gets unavai… lable?
I created a test cluster with rke consisting of three nodes: one with etcd and control, two nodes which are workers. Rancher 2 latest runs on a different server (not on this cluster).
Than I tried to simulate destruction of nodes by rebooting them.
If I reboot the 1st worker, rancher UI is fine and tells me after a few seconds, that this node is not available.
If I reboot the 2nd worker, rancher UI tells me - in a big wide square - that my cluster is not available. I not even can "launch a kubectl" console in the UI because it's blocked.
I assumed, that rebooting workers shouldn't make the whole cluster unavailable.
It's intersting, that during both reboots I can access the cluster's API on a linux shell with kubectl. Only rancher UI acts strange.
Greetz,
Josef
opened 09:56PM - 24 May 18 UTC
closed 09:15PM - 06 Aug 18 UTC
kind/bug
**Rancher versions:v2.0.2
**Steps to Reproduce:**
Create a cluster with foll… owing node configurations:
1 control (n1)
1 etcd (n2)
1 worker (n3)
Add 1 more control node (n4)
Power down control node - n1.
Wait for the node to be marked "unavailable".
Try to create a daemon set.
3 pods get created out of which only 1 pod was able to start sucessfully which is on the new control node.
There is an attempt made to start a pod in the worker node that fails with following error:
```
Normal SuccessfulMountVolume 5m kubelet, ip-172-31-3-155 MountVolume.SetUp succeeded for volume "default-token-mqlnn"
Normal SandboxChanged 4m (x12 over 5m) kubelet, ip-172-31-3-155 Pod sandbox changed, it will be killed and re-created.
Warning FailedCreatePodSandBox 14s (x100 over 5m) kubelet, ip-172-31-3-155 Failed create pod sandbox: rpc error: code = Unknown desc = NetworkPlugin cni failed to set up pod "hellotest-qjr5t_default" network: error getting ClusterInformation: Get https://10.43.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: dial tcp 10.43.0.1:443: getsockopt: no route to host
```
Other pod is attempted to start in the control node that is unavailable.
Deploying a pod with some scale , results in all the pods getting deployed on the worker node and it get stuck in "ContainerCreating" state.
```
NAME READY STATUS RESTARTS AGE
hello1-74f74757b9-5swtv 0/1 ContainerCreating 0 39m
hello1-74f74757b9-8rdp2 0/1 ContainerCreating 0 39m
hello1-74f74757b9-gz72c 0/1 ContainerCreating 0 39m
hello1-74f74757b9-xql8l 0/1 ContainerCreating 0 39m
```
**Results:**
Well, I’m finding some very strange behavior in some very simple tests.
I have:
1 x VM running Rancher 2.0.1
3 x VMs nodes, running all services (worker, etcd, control)
I have a few workload deployed - some NGINX web servers mostly. All those workloads are working on node1.
Everything is working well, the Web UI is responsive, kubectl is responding nicely, the workloads are working. kubectl and the web UI are going to the API on the VM hosting the Rancher server.
I then disconnect the netw…