Can't access rancher dashboard in ha install. Pods failing to start with error. Please help

Steps to reproduce (least amount of steps as possible):

I am stumped. I have no idea why this happened. The rancher HA was running fine since more than a year. All pods failed on 29th Oct without any manual inference. There were 502 errors while accessing rancher dashboard. I stopped docker service on all 3 rancher nodes in HA. and restarted docker. Later I was going through nginx ingress logs and noticed that most of the pods has stopped except few core components.

Services which are up:

  • Kube-proxy

  • kubelet (with error)

  • kube-scheduler

  • kube-controller-manager

  • kube-apiserver

  • etcd

Rest of them fail to start across all 3 rancher nodes.

I can provide direct access to the rancher node if you need to look further. Also, please let me know if you any any other detail. I will be really grateful if you can help me get the rancher up again. My livelihood depends on it. This is an humble request.

Other details that may be helpful:

Rancher dashboard, and kubectl is inaccessible. Can’t access any of the components of rancher. I still have access to the host, but that’s it. Error logs doesn’t indicate anything that would point the reason for it to happen. Maybe you guys can help me out here. I used the official tutorial to deploy rancher in HA mode a year ago. and never had to fish out any details about it after that. It was reliable so far. I don’t know what happened.

Environment information

  • Rancher version (rancher/rancher/rancher/server image tag or shown bottom left in the UI):

  • Installation option (single install/HA):
    HA

Cluster information

  • Cluster type (Hosted/Infrastructure Provider/Custom/Imported):
    Infrastructure Provider

  • Machine type (cloud/VM/metal) and specifications (CPU/memory):

    Cloud

    CPU: 4 Core + 4 core + 4core

    RAM: 8GB + 8GB + 8GB

  • Kubernetes version (use kubectl version):
    Kubectl not working. I can't access it.

  • Docker version (use docker version):

    Client:

    Version: 17.03.2-ce

    API version: 1.27

    Go version: go1.7.5

    Git commit: f5ec1e2

    Built: Tue Jun 27 03:35:14 2017

    OS/Arch: linux/amd64

    Server:

    Version: 17.03.2-ce

    API version: 1.27 (minimum version 1.12)

    Go version: go1.7.5

    Git commit: f5ec1e2

    Built: Tue Jun 27 03:35:14 2017

    OS/Arch: linux/amd64

    Experimental: false

What are the errors that kubelet is reporting?

I1031 22:00:42.346330    5723 kubelet_node_status.go:269] Setting node annotation to enable volume controller attach/detach
I1031 22:00:42.346585    5723 kubelet_node_status.go:453] Using node IP: "10.130.88.18"
I1031 22:00:42.349766    5723 kubelet_node_status.go:441] Recording NodeHasSufficientDisk event message for node 168.128.105.20
I1031 22:00:42.349794    5723 kubelet_node_status.go:441] Recording NodeHasSufficientMemory event message for node 168.128.105.20
I1031 22:00:42.349806    5723 kubelet_node_status.go:441] Recording NodeHasNoDiskPressure event message for node 168.128.105.20
I1031 22:00:42.349817    5723 kubelet_node_status.go:441] Recording NodeHasSufficientPID event message for node 168.128.105.20
I1031 22:00:42.349836    5723 kubelet_node_status.go:79] Attempting to register node 168.128.105.20
E1031 22:00:50.101387    5723 eviction_manager.go:243] eviction manager: failed to get get summary stats: failed to get node info: node "168.128.105.20" not found
E1031 22:00:51.137338    5723 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:464: Failed to list *v1.Node: Get https://127.0.0.1:6443/api/v1/nodes?fieldSelector=metadata.name%3D168.128.105.20&limit=500&resourceVersion=0: net/http: TLS handshake timeout
E1031 22:00:51.137451    5723 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:455: Failed to list *v1.Service: Get https://127.0.0.1:6443/api/v1/services?limit=500&resourceVersion=0: net/http: TLS handshake timeout
E1031 22:00:51.137602    5723 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to list *v1.Pod: Get https://127.0.0.1:6443/api/v1/pods?fieldSelector=spec.nodeName%3D168.128.105.20&limit=500&resourceVersion=0: net/http: TLS handshake timeout
E1031 22:00:52.350578    5723 kubelet_node_status.go:103] Unable to register node "168.128.105.20" with API server: Post https://127.0.0.1:6443/api/v1/nodes: net/http: TLS handshake timeout
I1031 22:00:59.350822    5723 kubelet_node_status.go:269] Setting node annotation to enable volume controller attach/detach
I1031 22:00:59.351088    5723 kubelet_node_status.go:453] Using node IP: "10.130.88.18"
I1031 22:00:59.352980    5723 kubelet_node_status.go:441] Recording NodeHasSufficientDisk event message for node 168.128.105.20
I1031 22:00:59.353010    5723 kubelet_node_status.go:441] Recording NodeHasSufficientMemory event message for node 168.128.105.20
I1031 22:00:59.353023    5723 kubelet_node_status.go:441] Recording NodeHasNoDiskPressure event message for node 168.128.105.20
I1031 22:00:59.353035    5723 kubelet_node_status.go:441] Recording NodeHasSufficientPID event message for node 168.128.105.20
I1031 22:00:59.353110    5723 kubelet_node_status.go:79] Attempting to register node 168.128.105.20
E1031 22:01:00.101685    5723 eviction_manager.go:243] eviction manager: failed to get get summary stats: failed to get node info: node "168.128.105.20" not found
E1031 22:01:00.137495    5723 event.go:212] Unable to write event: 'Patch https://127.0.0.1:6443/api/v1/namespaces/default/events/168.128.105.20.15d2c0159a08134c: net/http: TLS handshake timeout' (may retry after sleeping)
E1031 22:01:00.911905    5723 cni.go:280] Error deleting network: error getting ClusterInformation: Get https://10.43.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: dial tcp 10.43.0.1:443: i/o timeout
E1031 22:01:00.913467    5723 remote_runtime.go:115] StopPodSandbox "29ea517091f861277a42c0ef4942ff44e9e7c935d1f0710567793bb350df3567" from runtime service failed: rpc error: code = Unknown desc = NetworkPlugin cni failed to teardown pod "metrics-server-97bc649d5-ssmxt_kube-system" network: error getting ClusterInformation: Get https://10.43.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: dial tcp 10.43.0.1:443: i/o timeout
E1031 22:01:00.913510    5723 kuberuntime_gc.go:153] Failed to stop sandbox "29ea517091f861277a42c0ef4942ff44e9e7c935d1f0710567793bb350df3567" before removing: rpc error: code = Unknown desc = NetworkPlugin cni failed to teardown pod "metrics-server-97bc649d5-ssmxt_kube-system" network: error getting ClusterInformation: Get https://10.43.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: dial tcp 10.43.0.1:443: i/o timeout
W1031 22:01:00.917619    5723 cni.go:243] CNI failed to retrieve network namespace path: cannot find network namespace for the terminated container "bf437df9913a1c32d17548f44c4b051f9256f0c0cc620865cef766f5e3952e3e"
E1031 22:01:01.624481    5723 kubelet_node_status.go:103] Unable to register node "168.128.105.20" with API server: Post https://127.0.0.1:6443/api/v1/nodes: read tcp 127.0.0.1:57934->127.0.0.1:6443: read: connection reset by peer
I1031 22:01:08.625269    5723 kubelet_node_status.go:269] Setting node annotation to enable volume controller attach/detach
I1031 22:01:08.626695    5723 kubelet_node_status.go:453] Using node IP: "10.130.88.18"
I1031 22:01:08.631777    5723 kubelet_node_status.go:441] Recording NodeHasSufficientDisk event message for node 168.128.105.20
I1031 22:01:08.631832    5723 kubelet_node_status.go:441] Recording NodeHasSufficientMemory event message for node 168.128.105.20
I1031 22:01:08.631854    5723 kubelet_node_status.go:441] Recording NodeHasNoDiskPressure event message for node 168.128.105.20
I1031 22:01:08.631876    5723 kubelet_node_status.go:441] Recording NodeHasSufficientPID event message for node 168.128.105.20
I1031 22:01:08.632005    5723 kubelet_node_status.go:79] Attempting to register node 168.128.105.20
E1031 22:01:10.101930    5723 eviction_manager.go:243] eviction manager: failed to get get summary stats: failed to get node info: node "168.128.105.20" not found
E1031 22:01:12.624129    5723 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:464: Failed to list *v1.Node: Get https://127.0.0.1:6443/api/v1/nodes?fieldSelector=metadata.name%3D168.128.105.20&limit=500&resourceVersion=0: net/http: TLS handshake timeout
E1031 22:01:12.624387    5723 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:455: Failed to list *v1.Service: Get https://127.0.0.1:6443/api/v1/services?limit=500&resourceVersion=0: net/http: TLS handshake timeout
E1031 22:01:12.624515    5723 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to list *v1.Pod: Get https://127.0.0.1:6443/api/v1/pods?fieldSelector=spec.nodeName%3D168.128.105.20&limit=500&resourceVersion=0: net/http: TLS handshake timeout

Looks like a problem in the api-server. What do those logs say?