Steps to reproduce (least amount of steps as possible):
I am stumped. I have no idea why this happened. The rancher HA was running fine since more than a year. All pods failed on 29th Oct without any manual inference. There were 502 errors while accessing rancher dashboard. I stopped docker service on all 3 rancher nodes in HA. and restarted docker. Later I was going through nginx ingress logs and noticed that most of the pods has stopped except few core components.
Services which are up:
Kube-proxy
kubelet (with error)
kube-scheduler
kube-controller-manager
kube-apiserver
etcd
Rest of them fail to start across all 3 rancher nodes.
I can provide direct access to the rancher node if you need to look further. Also, please let me know if you any any other detail. I will be really grateful if you can help me get the rancher up again. My livelihood depends on it. This is an humble request.
Other details that may be helpful:
Rancher dashboard, and kubectl is inaccessible. Can’t access any of the components of rancher. I still have access to the host, but that’s it. Error logs doesn’t indicate anything that would point the reason for it to happen. Maybe you guys can help me out here. I used the official tutorial to deploy rancher in HA mode a year ago. and never had to fish out any details about it after that. It was reliable so far. I don’t know what happened.
Environment information
Cluster information
Cluster type (Hosted/Infrastructure Provider/Custom/Imported):
Infrastructure Provider
Machine type (cloud/VM/metal) and specifications (CPU/memory):
Cloud
CPU: 4 Core + 4 core + 4core
RAM: 8GB + 8GB + 8GB
Kubernetes version (use kubectl version
):
Kubectl not working. I can't access it.
Docker version (use docker version
):
Client:
Version: 17.03.2-ce
API version: 1.27
Go version: go1.7.5
Git commit: f5ec1e2
Built: Tue Jun 27 03:35:14 2017
OS/Arch: linux/amd64
Server:
Version: 17.03.2-ce
API version: 1.27 (minimum version 1.12)
Go version: go1.7.5
Git commit: f5ec1e2
Built: Tue Jun 27 03:35:14 2017
OS/Arch: linux/amd64
Experimental: false
lovedigit:
kubelet (with error)
What are the errors that kubelet is reporting?
I1031 22:00:42.346330 5723 kubelet_node_status.go:269] Setting node annotation to enable volume controller attach/detach
I1031 22:00:42.346585 5723 kubelet_node_status.go:453] Using node IP: "10.130.88.18"
I1031 22:00:42.349766 5723 kubelet_node_status.go:441] Recording NodeHasSufficientDisk event message for node 168.128.105.20
I1031 22:00:42.349794 5723 kubelet_node_status.go:441] Recording NodeHasSufficientMemory event message for node 168.128.105.20
I1031 22:00:42.349806 5723 kubelet_node_status.go:441] Recording NodeHasNoDiskPressure event message for node 168.128.105.20
I1031 22:00:42.349817 5723 kubelet_node_status.go:441] Recording NodeHasSufficientPID event message for node 168.128.105.20
I1031 22:00:42.349836 5723 kubelet_node_status.go:79] Attempting to register node 168.128.105.20
E1031 22:00:50.101387 5723 eviction_manager.go:243] eviction manager: failed to get get summary stats: failed to get node info: node "168.128.105.20" not found
E1031 22:00:51.137338 5723 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:464: Failed to list *v1.Node: Get https://127.0.0.1:6443/api/v1/nodes?fieldSelector=metadata.name%3D168.128.105.20&limit=500&resourceVersion=0: net/http: TLS handshake timeout
E1031 22:00:51.137451 5723 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:455: Failed to list *v1.Service: Get https://127.0.0.1:6443/api/v1/services?limit=500&resourceVersion=0: net/http: TLS handshake timeout
E1031 22:00:51.137602 5723 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to list *v1.Pod: Get https://127.0.0.1:6443/api/v1/pods?fieldSelector=spec.nodeName%3D168.128.105.20&limit=500&resourceVersion=0: net/http: TLS handshake timeout
E1031 22:00:52.350578 5723 kubelet_node_status.go:103] Unable to register node "168.128.105.20" with API server: Post https://127.0.0.1:6443/api/v1/nodes: net/http: TLS handshake timeout
I1031 22:00:59.350822 5723 kubelet_node_status.go:269] Setting node annotation to enable volume controller attach/detach
I1031 22:00:59.351088 5723 kubelet_node_status.go:453] Using node IP: "10.130.88.18"
I1031 22:00:59.352980 5723 kubelet_node_status.go:441] Recording NodeHasSufficientDisk event message for node 168.128.105.20
I1031 22:00:59.353010 5723 kubelet_node_status.go:441] Recording NodeHasSufficientMemory event message for node 168.128.105.20
I1031 22:00:59.353023 5723 kubelet_node_status.go:441] Recording NodeHasNoDiskPressure event message for node 168.128.105.20
I1031 22:00:59.353035 5723 kubelet_node_status.go:441] Recording NodeHasSufficientPID event message for node 168.128.105.20
I1031 22:00:59.353110 5723 kubelet_node_status.go:79] Attempting to register node 168.128.105.20
E1031 22:01:00.101685 5723 eviction_manager.go:243] eviction manager: failed to get get summary stats: failed to get node info: node "168.128.105.20" not found
E1031 22:01:00.137495 5723 event.go:212] Unable to write event: 'Patch https://127.0.0.1:6443/api/v1/namespaces/default/events/168.128.105.20.15d2c0159a08134c: net/http: TLS handshake timeout' (may retry after sleeping)
E1031 22:01:00.911905 5723 cni.go:280] Error deleting network: error getting ClusterInformation: Get https://10.43.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: dial tcp 10.43.0.1:443: i/o timeout
E1031 22:01:00.913467 5723 remote_runtime.go:115] StopPodSandbox "29ea517091f861277a42c0ef4942ff44e9e7c935d1f0710567793bb350df3567" from runtime service failed: rpc error: code = Unknown desc = NetworkPlugin cni failed to teardown pod "metrics-server-97bc649d5-ssmxt_kube-system" network: error getting ClusterInformation: Get https://10.43.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: dial tcp 10.43.0.1:443: i/o timeout
E1031 22:01:00.913510 5723 kuberuntime_gc.go:153] Failed to stop sandbox "29ea517091f861277a42c0ef4942ff44e9e7c935d1f0710567793bb350df3567" before removing: rpc error: code = Unknown desc = NetworkPlugin cni failed to teardown pod "metrics-server-97bc649d5-ssmxt_kube-system" network: error getting ClusterInformation: Get https://10.43.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: dial tcp 10.43.0.1:443: i/o timeout
W1031 22:01:00.917619 5723 cni.go:243] CNI failed to retrieve network namespace path: cannot find network namespace for the terminated container "bf437df9913a1c32d17548f44c4b051f9256f0c0cc620865cef766f5e3952e3e"
E1031 22:01:01.624481 5723 kubelet_node_status.go:103] Unable to register node "168.128.105.20" with API server: Post https://127.0.0.1:6443/api/v1/nodes: read tcp 127.0.0.1:57934->127.0.0.1:6443: read: connection reset by peer
I1031 22:01:08.625269 5723 kubelet_node_status.go:269] Setting node annotation to enable volume controller attach/detach
I1031 22:01:08.626695 5723 kubelet_node_status.go:453] Using node IP: "10.130.88.18"
I1031 22:01:08.631777 5723 kubelet_node_status.go:441] Recording NodeHasSufficientDisk event message for node 168.128.105.20
I1031 22:01:08.631832 5723 kubelet_node_status.go:441] Recording NodeHasSufficientMemory event message for node 168.128.105.20
I1031 22:01:08.631854 5723 kubelet_node_status.go:441] Recording NodeHasNoDiskPressure event message for node 168.128.105.20
I1031 22:01:08.631876 5723 kubelet_node_status.go:441] Recording NodeHasSufficientPID event message for node 168.128.105.20
I1031 22:01:08.632005 5723 kubelet_node_status.go:79] Attempting to register node 168.128.105.20
E1031 22:01:10.101930 5723 eviction_manager.go:243] eviction manager: failed to get get summary stats: failed to get node info: node "168.128.105.20" not found
E1031 22:01:12.624129 5723 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:464: Failed to list *v1.Node: Get https://127.0.0.1:6443/api/v1/nodes?fieldSelector=metadata.name%3D168.128.105.20&limit=500&resourceVersion=0: net/http: TLS handshake timeout
E1031 22:01:12.624387 5723 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:455: Failed to list *v1.Service: Get https://127.0.0.1:6443/api/v1/services?limit=500&resourceVersion=0: net/http: TLS handshake timeout
E1031 22:01:12.624515 5723 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to list *v1.Pod: Get https://127.0.0.1:6443/api/v1/pods?fieldSelector=spec.nodeName%3D168.128.105.20&limit=500&resourceVersion=0: net/http: TLS handshake timeout
Looks like a problem in the api-server. What do those logs say?