Rancher Cluster Issue

Configuration:

docker-kube-server - 10.0.0.40 (runs the webservices for the admin console for Rancher. The Cluster Controller)
docker-kube01 - 10.0.0.50 - Node Server
docker-kube02 - 10.0.0.51 - Node Server
docker-kube03 - 10.0.0.52 - Node Server
docker-kube04 - 10.0.0.53 - Node Server

The cluster has been running for like 2 months. I’ve rebooted each “server” in the cluster at least once during that time with no issues… One of my pod’s seemed to be down and so I was logging into the “cluster controller” to see what was up. I was able to get past the login screen to the home page showing my “two clusters” the cluster with the 4 nods above was complaining about docker-kube03 and not being able to connect the services. I rebooted docker-kube03 and tried to get further into the cluster to see whatelse I could see… everything I clicked on kept getting me a 500 internal server error so I decided to reboot docker-kube-server. The server came back up. Docker is running on the server but the rancher server container keeps restarting.

The last log messages in the Docker Container shows:

> 2022/06/09 00:50:57 [INFO] Watching metadata for rke-machine-config.cattle.io/v1, Kind=VmwarevsphereConfig
> 2022/06/09 00:50:57 [INFO] Watching metadata for /v1, Kind=Pod
> 2022/06/09 00:50:57 [INFO] Watching metadata for fleet.cattle.io/v1alpha1, Kind=BundleDeployment
> 2022/06/09 00:50:57 [INFO] Watching metadata for fleet.cattle.io/v1alpha1, Kind=ClusterRegistrationToken
> 2022/06/09 00:50:57 [INFO] Watching metadata for networking.k8s.io/v1, Kind=NetworkPolicy
> 2022/06/09 00:50:57 [INFO] Watching metadata for management.cattle.io/v3, Kind=DynamicSchema
> 2022/06/09 00:50:57 [INFO] Watching metadata for management.cattle.io/v3, Kind=GlobalDnsProvider
> 2022/06/09 00:50:57 [INFO] Watching metadata for rbac.authorization.k8s.io/v1, Kind=RoleBinding
> 2022/06/09 00:50:57 [INFO] Watching metadata for management.cattle.io/v3, Kind=TemplateContent
> 2022/06/09 00:50:57 [INFO] Watching metadata for monitoring.coreos.com/v1, Kind=Alertmanager
> 2022/06/09 00:50:57 [INFO] Watching metadata for management.cattle.io/v3, Kind=NodeDriver
> 2022/06/09 00:50:57 [INFO] Watching metadata for rbac.authorization.k8s.io/v1, Kind=Role
> 2022/06/09 00:50:57 [INFO] Watching metadata for fleet.cattle.io/v1alpha1, Kind=ClusterRegistration
> 2022/06/09 00:50:58 [INFO] Handling backend connection request [stv-cluster-c-2dvhf]
> 2022/06/09 00:50:59 [INFO] Starting catalog controller
> 2022/06/09 00:50:59 [INFO] Starting project-level catalog controller
> 2022/06/09 00:50:59 [INFO] Starting cluster-level catalog controller
> 2022/06/09 00:50:59 [ERROR] error parsing azure-group-cache-size, skipping update strconv.Atoi: parsing "": invalid syntax
> 2022/06/09 00:50:59 [INFO] Watching metadata for cluster.x-k8s.io/v1alpha3, Kind=Machine
> 2022/06/09 00:50:59 [INFO] Refreshing driverMetadata in 1440 minutes
> 2022/06/09 00:50:59 [INFO] Starting provisioning.cattle.io/v1, Kind=Cluster controller
> 2022/06/09 00:50:59 [INFO] Starting management.cattle.io/v3, Kind=ClusterAlertRule controller
> 2022/06/09 00:50:59 [INFO] Starting management.cattle.io/v3, Kind=GlobalDnsProvider controller
> 2022/06/09 00:50:59 [INFO] Starting management.cattle.io/v3, Kind=ClusterAlertGroup controller
> 2022/06/09 00:50:59 [INFO] Starting management.cattle.io/v3, Kind=EtcdBackup controller
> 2022/06/09 00:50:59 [INFO] update kontainerdriver rancherkubernetesengine
> 2022/06/09 00:50:59 [INFO] update kontainerdriver googlekubernetesengine
> 2022/06/09 00:50:59 [INFO] update kontainerdriver huaweicontainercloudengine
> 2022/06/09 00:50:59 [INFO] update kontainerdriver azurekubernetesservice
> 2022/06/09 00:50:59 [INFO] update kontainerdriver baiducloudcontainerengine
> 2022/06/09 00:50:59 [INFO] update kontainerdriver linodekubernetesengine
> 2022/06/09 00:50:59 [INFO] update kontainerdriver opentelekomcloudcontainerengine
> 2022/06/09 00:50:59 [INFO] update kontainerdriver oraclecontainerengine
> 2022/06/09 00:50:59 [INFO] update kontainerdriver tencentkubernetesengine
> 2022/06/09 00:50:59 [INFO] update kontainerdriver aliyunkubernetescontainerservice
> 2022/06/09 00:50:59 [INFO] update kontainerdriver amazonelasticcontainerservice
> 2022/06/09 00:50:59 [INFO] Starting management.cattle.io/v3, Kind=ComposeConfig controller
> 2022/06/09 00:50:59 [INFO] Rancher startup complete
> 2022/06/09 00:50:59 [INFO] checking configmap cattle-system/admincreated to determine if orphan bindings cleanup needs to run
> 2022/06/09 00:50:59 [INFO] checking configmap cattle-system/admincreated to determine if duplicate bindings cleanup needs to run
> 2022/06/09 00:50:59 [INFO] orphan bindings cleanup has already run, skipping
> 2022/06/09 00:50:59 [INFO] duplicate bindings cleanup has already run, skipping
> 2022/06/09 00:50:59 [INFO] Starting cluster.x-k8s.io/v1alpha3, Kind=Machine controller
> 2022/06/09 00:50:59 [INFO] Watching metadata for cluster.x-k8s.io/v1alpha3, Kind=MachineDeployment
> 2022/06/09 00:50:59 [INFO] Starting cluster.x-k8s.io/v1alpha3, Kind=MachineDeployment controller
> 2022/06/09 00:50:59 [INFO] Watching metadata for cluster.x-k8s.io/v1alpha3, Kind=Cluster
> 2022/06/09 00:50:59 [INFO] Starting cluster.x-k8s.io/v1alpha3, Kind=Cluster controller
> 2022/06/09 00:50:59 [INFO] Watching metadata for cluster.x-k8s.io/v1alpha3, Kind=MachineHealthCheck
> 2022/06/09 00:51:00 [INFO] Starting cluster.x-k8s.io/v1alpha3, Kind=MachineHealthCheck controller
> 2022/06/09 00:51:00 [INFO] Watching metadata for cluster.x-k8s.io/v1alpha3, Kind=MachineSet
> 2022/06/09 00:51:00 [INFO] Starting cluster.x-k8s.io/v1alpha3, Kind=MachineSet controller
> 2022/06/09 00:51:00 [INFO] Starting rke.cattle.io/v1, Kind=CustomMachine controller
> 2022/06/09 00:51:00 [INFO] Starting rke.cattle.io/v1, Kind=RKEControlPlane controller
> 2022/06/09 00:51:00 [INFO] Starting rke.cattle.io/v1, Kind=RKEBootstrap controller
> 2022/06/09 00:51:00 [INFO] Starting rke.cattle.io/v1, Kind=RKEBootstrapTemplate controller
> 2022/06/09 00:51:00 [INFO] Starting rke.cattle.io/v1, Kind=RKECluster controller
> 2022/06/09 00:51:00 [INFO] driverMetadata: refreshing data from upstream https://releases.rancher.com/kontainer-driver-metadata/release-v2.6/data.json
> 2022/06/09 00:51:00 [INFO] Retrieve data.json from local path /var/lib/rancher-data/driver-metadata/data.json
> 2022/06/09 00:51:02 [INFO] Handling backend connection request [c-2dvhf]
> 2022/06/09 00:51:02 [INFO] Handling backend connection request [c-2dvhf:m-769b9344cd70]
> 2022/06/09 00:51:03 [INFO] Handling backend connection request [c-2dvhf:m-3af711d2046e]
> 2022/06/09 00:51:04 [INFO] Creating token for user user-w7wcr
> 2022/06/09 00:51:04 [INFO] kontainerdriver googlekubernetesengine listening on address 127.0.0.1:34229
> 2022/06/09 00:51:04 [INFO] kontainerdriver azurekubernetesservice listening on address 127.0.0.1:38083
> 2022/06/09 00:51:04 [INFO] kontainerdriver amazonelasticcontainerservice listening on address 127.0.0.1:43537
> 2022/06/09 00:51:04 [INFO] kontainerdriver amazonelasticcontainerservice stopped
> 2022/06/09 00:51:04 [INFO] dynamic schema for kontainerdriver amazonelasticcontainerservice updating
> 2022/06/09 00:51:04 [INFO] kontainerdriver azurekubernetesservice stopped
> 2022/06/09 00:51:04 [INFO] dynamic schema for kontainerdriver azurekubernetesservice updating
> 2022/06/09 00:51:04 [INFO] kontainerdriver googlekubernetesengine stopped
> 2022/06/09 00:51:04 [INFO] dynamic schema for kontainerdriver googlekubernetesengine updating
> time="2022-06-09 00:51:04" level=info msg="Telemetry Client v0.5.16"
> time="2022-06-09 00:51:04" level=info msg="Listening on 0.0.0.0:8114"
> 2022/06/09 00:51:07 [ERROR] error syncing 'c-2dvhf': handler cluster-deploy: Get "https://10.0.0.52:6443/apis/apps/v1/namespaces/cattle-system/daemonsets/cattle-node-agent": dial tcp 10.0.0.53:6443: connect: connection refused, requeuing
> 2022/06/09 00:51:07 [ERROR] error syncing 'c-2dvhf': handler cluster-deploy: Get "https://10.0.0.52:6443/apis/apps/v1/namespaces/cattle-system/daemonsets/cattle-node-agent": dial tcp 10.0.0.53:6443: connect: connection refused, requeuing
> 2022/06/09 00:51:19 [ERROR] error syncing 'c-2dvhf': handler cluster-deploy: Get "https://10.0.0.52:6443/apis/apps/v1/namespaces/cattle-system/daemonsets/cattle-node-agent": dial tcp 10.0.0.50:6443: connect: no route to host, requeuing
> 2022/06/09 00:51:21 [INFO] Stopping cluster agent for c-2dvhf
> 2022/06/09 00:51:21 [ERROR] failed to start cluster controllers c-2dvhf: context canceled
> 2022/06/09 00:51:34 [INFO] error in remotedialer server [400]: websocket: close 1006 (abnormal closure): unexpected EOF
> 2022/06/09 00:51:34 [INFO] error in remotedialer server [400]: websocket: close 1006 (abnormal closure): unexpected EOF
> 2022/06/09 00:52:40 [INFO] Handling backend connection request [c-2dvhf:m-8243c0ae7c45]
> 2022/06/09 00:52:40 [INFO] error in remotedialer server [400]: websocket: close 1006 (abnormal closure): unexpected EOF
> 2022/06/09 00:52:59 [ERROR] error syncing 'c-2dvhf': handler cluster-deploy: the server was unable to return a response in the time allotted, but may still be processing the request (get daemonsets.meta.k8s.io cattle-node-agent), requeuing
> 2022/06/09 00:54:09 [INFO] Stopping cluster agent for c-2dvhf
> 2022/06/09 00:54:09 [ERROR] failed to start cluster controllers c-2dvhf: context canceled
> 2022/06/09 00:54:14 [ERROR] error syncing 'c-2dvhf': handler cluster-deploy: the server was unable to return a response in the time allotted, but may still be processing the request (get daemonsets.meta.k8s.io cattle-node-agent), requeuing
> 2022/06/09 00:55:29 [ERROR] error syncing 'c-2dvhf': handler cluster-deploy: the server was unable to return a response in the time allotted, but may still be processing the request (get daemonsets.meta.k8s.io cattle-node-agent), requeuing
> 2022/06/09 00:55:58 [INFO] Shutting down management.cattle.io/v3, Kind=ClusterAlert workers
> 2022/06/09 00:55:58 [INFO] Shutting down management.cattle.io/v3, Kind=ClusterLogging workers
> 2022/06/09 00:55:58 [INFO] Shutting down management.cattle.io/v3, Kind=Token workers
> 2022/06/09 00:55:58 [INFO] Shutting down management.cattle.io/v3, Kind=User workers
> 2022/06/09 00:55:58 [INFO] Shutting down rbac.authorization.k8s.io/v1, Kind=ClusterRoleBinding workers
> 2022/06/09 00:55:58 [INFO] Shutting down management.cattle.io/v3, Kind=GlobalRoleBinding workers
> 2022/06/09 00:55:58 [INFO] Shutting down management.cattle.io/v3, Kind=ProjectRoleTemplateBinding workers
> 2022/06/09 00:55:58 [INFO] Shutting down management.cattle.io/v3, Kind=Cluster workers
> 2022/06/09 00:55:58 [INFO] Shutting down /v1, Kind=Secret workers
> 2022/06/09 00:55:58 [INFO] Shutting down management.cattle.io/v3, Kind=GroupMember workers
> 2022/06/09 00:55:58 [INFO] Shutting down /v1, Kind=Secret workers
> 2022/06/09 00:55:58 [INFO] Shutting down /v1, Kind=Secret workers
> 2022/06/09 00:55:58 [FATAL] context canceled

So I decided to reboot the nodes as it seemed to not be able to connect to them. The node servers came back up and running a sudo docker ps only shows the following:

> CONTAINER ID   IMAGE                                COMMAND                  CREATED        STATUS                          PORTS     NAMES
> a3d197fd7be9   rancher/hyperkube:v1.21.9-rancher1   "/opt/rke-tools/entr…"   2 months ago   Up 6 minutes                              kube-proxy
> 054bbed54668   rancher/hyperkube:v1.21.9-rancher1   "/opt/rke-tools/entr…"   2 months ago   Up 6 minutes                              kubelet
> 442c07906148   rancher/hyperkube:v1.21.9-rancher1   "/opt/rke-tools/entr…"   2 months ago   Up 6 minutes                              kube-scheduler
> 36b961471eb8   rancher/hyperkube:v1.21.9-rancher1   "/opt/rke-tools/entr…"   2 months ago   Up 6 minutes                              kube-controller-manager
> 34f17bca4d73   rancher/hyperkube:v1.21.9-rancher1   "/opt/rke-tools/entr…"   2 months ago   Up 9 seconds                              kube-apiserver
> 0cb76e6aa22b   rancher/rancher-agent:v2.6.3         "run.sh --no-registe…"   2 months ago   Restarting (1) 12 seconds ago             share-mnt

The share-mnt container restarting is fine. It’s the one that mounts the nfs location for the pod’s.
kube-apiserver seems to be rebooting over and over again. The logs for that show:

> W0609 04:38:41.030291       1 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://10.0.0.53:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.0.0.53:2379: connect: no route to host". Reconnecting...
> W0609 04:38:42.435283       1 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://10.0.0.52:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.0.0.52:2379: connect: connection refused". Reconnecting...
> W0609 04:38:44.102208       1 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://10.0.0.53:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.0.0.53:2379: connect: no route to host". Reconnecting...
> W0609 04:38:44.102290       1 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://10.0.0.53:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.0.0.53:2379: connect: no route to host". Reconnecting...
> W0609 04:38:44.362627       1 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://10.0.0.52:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.0.0.52:2379: connect: connection refused". Reconnecting...
> W0609 04:38:48.064241       1 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://10.0.0.52:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.0.0.52:2379: connect: connection refused". Reconnecting...
> W0609 04:38:50.250202       1 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://10.0.0.53:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.0.0.53:2379: connect: no route to host". Reconnecting...
> W0609 04:38:50.250202       1 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://10.0.0.53:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.0.0.53:2379: connect: no route to host". Reconnecting...
> W0609 04:38:51.508829       1 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://10.0.0.52:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.0.0.52:2379: connect: connection refused". Reconnecting...
> Error: context deadline exceeded
> + grep -q cloud-provider=azure
> + echo kube-apiserver --cloud-provider= --authentication-token-webhook-config-file=/etc/kubernetes/kube-api-authn-webhook.yaml --bind-address=0.0.0.0 --kubelet-preferred-address-types=InternalIP,ExternalIP,Hostname --profiling=false --service-account-signing-key-file=/etc/kubernetes/ssl/kube-service-account-token-key.pem --insecure-port=0 --tls-cipher-suites=TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305 --etcd-certfile=/etc/kubernetes/ssl/kube-node.pem --etcd-prefix=/registry --proxy-client-cert-file=/etc/kubernetes/ssl/kube-apiserver-proxy-client.pem --proxy-client-key-file=/etc/kubernetes/ssl/kube-apiserver-proxy-client-key.pem --tls-cert-file=/etc/kubernetes/ssl/kube-apiserver.pem --enable-admission-plugins=NamespaceLifecycle,LimitRanger,ServiceAccount,DefaultStorageClass,DefaultTolerationSeconds,MutatingAdmissionWebhook,ValidatingAdmissionWebhook,ResourceQuota,NodeRestriction,Priority,TaintNodesByCondition,PersistentVolumeClaimResize --storage-backend=etcd3 --runtime-config=authorization.k8s.io/v1beta1=true --advertise-address=10.0.0.50 --etcd-servers=https://10.0.0.50:2379,https://10.0.0.52:2379,https://10.0.0.51:2379,https://10.0.0.53:2379 --kubelet-client-key=/etc/kubernetes/ssl/kube-apiserver-key.pem --api-audiences=unknown --service-account-issuer=rke --kubelet-client-certificate=/etc/kubernetes/ssl/kube-apiserver.pem --requestheader-client-ca-file=/etc/kubernetes/ssl/kube-apiserver-requestheader-ca.pem --allow-privileged=true --requestheader-group-headers=X-Remote-Group --authorization-mode=Node,RBAC --service-account-key-file=/etc/kubernetes/ssl/kube-service-account-token-key.pem --authentication-token-webhook-cache-ttl=5s --requestheader-username-headers=X-Remote-User --anonymous-auth=false --requestheader-allowed-names=kube-apiserver-proxy-client --etcd-keyfile=/etc/kubernetes/ssl/kube-node-key.pem --tls-private-key-file=/etc/kubernetes/ssl/kube-apiserver-key.pem --requestheader-extra-headers-prefix=X-Remote-Extra- --audit-log-path=/var/log/kube-audit/audit-log.json --audit-log-maxage=30 --client-ca-file=/etc/kubernetes/ssl/kube-ca.pem --etcd-cafile=/etc/kubernetes/ssl/kube-ca.pem --service-cluster-ip-range=10.43.0.0/16 --service-account-lookup=true --secure-port=6443 --audit-log-maxbackup=10 --audit-log-format=json --service-node-port-range=30000-32767 --audit-log-maxsize=100 --audit-policy-file=/etc/kubernetes/audit-policy.yaml
> + '[' kube-apiserver = kubelet ']'
> + exec kube-apiserver --cloud-provider= --authentication-token-webhook-config-file=/etc/kubernetes/kube-api-authn-webhook.yaml --bind-address=0.0.0.0 --kubelet-preferred-address-types=InternalIP,ExternalIP,Hostname --profiling=false --service-account-signing-key-file=/etc/kubernetes/ssl/kube-service-account-token-key.pem --insecure-port=0 --tls-cipher-suites=TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305 --etcd-certfile=/etc/kubernetes/ssl/kube-node.pem --etcd-prefix=/registry --proxy-client-cert-file=/etc/kubernetes/ssl/kube-apiserver-proxy-client.pem --proxy-client-key-file=/etc/kubernetes/ssl/kube-apiserver-proxy-client-key.pem --tls-cert-file=/etc/kubernetes/ssl/kube-apiserver.pem --enable-admission-plugins=NamespaceLifecycle,LimitRanger,ServiceAccount,DefaultStorageClass,DefaultTolerationSeconds,MutatingAdmissionWebhook,ValidatingAdmissionWebhook,ResourceQuota,NodeRestriction,Priority,TaintNodesByCondition,PersistentVolumeClaimResize --storage-backend=etcd3 --runtime-config=authorization.k8s.io/v1beta1=true --advertise-address=10.0.0.50 --etcd-servers=https://10.0.0.50:2379,https://10.0.0.52:2379,https://10.0.0.51:2379,https://10.0.0.53:2379 --kubelet-client-key=/etc/kubernetes/ssl/kube-apiserver-key.pem --api-audiences=unknown --service-account-issuer=rke --kubelet-client-certificate=/etc/kubernetes/ssl/kube-apiserver.pem --requestheader-client-ca-file=/etc/kubernetes/ssl/kube-apiserver-requestheader-ca.pem --allow-privileged=true --requestheader-group-headers=X-Remote-Group --authorization-mode=Node,RBAC --service-account-key-file=/etc/kubernetes/ssl/kube-service-account-token-key.pem --authentication-token-webhook-cache-ttl=5s --requestheader-username-headers=X-Remote-User --anonymous-auth=false --requestheader-allowed-names=kube-apiserver-proxy-client --etcd-keyfile=/etc/kubernetes/ssl/kube-node-key.pem --tls-private-key-file=/etc/kubernetes/ssl/kube-apiserver-key.pem --requestheader-extra-headers-prefix=X-Remote-Extra- --audit-log-path=/var/log/kube-audit/audit-log.json --audit-log-maxage=30 --client-ca-file=/etc/kubernetes/ssl/kube-ca.pem --etcd-cafile=/etc/kubernetes/ssl/kube-ca.pem --service-cluster-ip-range=10.43.0.0/16 --service-account-lookup=true --secure-port=6443 --audit-log-maxbackup=10 --audit-log-format=json --service-node-port-range=30000-32767 --audit-log-maxsize=100 --audit-policy-file=/etc/kubernetes/audit-policy.yaml
> Flag --insecure-port has been deprecated, This flag has no effect now and will be removed in v1.24.
> I0609 04:38:54.696461       1 server.go:629] external host was not specified, using 10.0.0.50
> I0609 04:38:54.696878       1 server.go:181] Version: v1.21.9
> W0609 04:38:55.122645       1 authentication.go:429] the webhook cache ttl of 5s is shorter than the overall cache ttl of 10s for successful token authentication attempts.
> I0609 04:38:55.125990       1 shared_informer.go:240] Waiting for caches to sync for node_authorizer
> I0609 04:38:55.127440       1 plugins.go:158] Loaded 12 mutating admission controller(s) successfully in the following order: NamespaceLifecycle,LimitRanger,ServiceAccount,NodeRestriction,TaintNodesByCondition,Priority,DefaultTolerationSeconds,DefaultStorageClass,StorageObjectInUseProtection,RuntimeClass,DefaultIngressClass,MutatingAdmissionWebhook.
> I0609 04:38:55.127462       1 plugins.go:161] Loaded 10 validating admission controller(s) successfully in the following order: LimitRanger,ServiceAccount,Priority,PersistentVolumeClaimResize,RuntimeClass,CertificateApproval,CertificateSigning,CertificateSubjectRestriction,ValidatingAdmissionWebhook,ResourceQuota.
> I0609 04:38:55.128571       1 plugins.go:158] Loaded 12 mutating admission controller(s) successfully in the following order: NamespaceLifecycle,LimitRanger,ServiceAccount,NodeRestriction,TaintNodesByCondition,Priority,DefaultTolerationSeconds,DefaultStorageClass,StorageObjectInUseProtection,RuntimeClass,DefaultIngressClass,MutatingAdmissionWebhook.
> I0609 04:38:55.128586       1 plugins.go:161] Loaded 10 validating admission controller(s) successfully in the following order: LimitRanger,ServiceAccount,Priority,PersistentVolumeClaimResize,RuntimeClass,CertificateApproval,CertificateSigning,CertificateSubjectRestriction,ValidatingAdmissionWebhook,ResourceQuota.
> I0609 04:38:55.130116       1 client.go:360] parsed scheme: "endpoint"
> I0609 04:38:55.130159       1 endpoint.go:68] ccResolverWrapper: sending new addresses to cc: [{https://10.0.0.50:2379  <nil> 0 <nil>} {https://10.0.0.52:2379  <nil> 0 <nil>} {https://10.0.0.51:2379  <nil> 0 <nil>} {https://10.0.0.53:2379  <nil> 0 <nil>}]
> W0609 04:38:55.130671       1 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://10.0.0.52:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.0.0.52:2379: connect: connection refused". Reconnecting...
> I0609 04:38:56.129962       1 client.go:360] parsed scheme: "endpoint"
> I0609 04:38:56.130063       1 endpoint.go:68] ccResolverWrapper: sending new addresses to cc: [{https://10.0.0.50:2379  <nil> 0 <nil>} {https://10.0.0.52:2379  <nil> 0 <nil>} {https://10.0.0.51:2379  <nil> 0 <nil>} {https://10.0.0.53:2379  <nil> 0 <nil>}]
> W0609 04:38:56.131107       1 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://10.0.0.52:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.0.0.52:2379: connect: connection refused". Reconnecting...
> W0609 04:38:56.131219       1 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://10.0.0.52:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.0.0.52:2379: connect: connection refused". Reconnecting...
> W0609 04:38:56.390809       1 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://10.0.0.53:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.0.0.53:2379: connect: no route to host". Reconnecting...
> W0609 04:38:56.391303       1 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://10.0.0.53:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.0.0.53:2379: connect: no route to host". Reconnecting...
> W0609 04:38:57.132620       1 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://10.0.0.52:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.0.0.52:2379: connect: connection refused". Reconnecting...
> W0609 04:38:57.863212       1 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://10.0.0.52:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.0.0.52:2379: connect: connection refused". Reconnecting...
> W0609 04:38:58.642941       1 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://10.0.0.52:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.0.0.52:2379: connect: connection refused". Reconnecting...
> W0609 04:38:59.462681       1 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://10.0.0.53:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.0.0.53:2379: connect: no route to host". Reconnecting...
> W0609 04:38:59.463422       1 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://10.0.0.53:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.0.0.53:2379: connect: no route to host". Reconnecting...
> W0609 04:38:59.987440       1 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://10.0.0.52:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.0.0.52:2379: connect: connection refused". Reconnecting...
> W0609 04:39:01.658268       1 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://10.0.0.52:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.0.0.52:2379: connect: connection refused". Reconnecting...
> W0609 04:39:02.534097       1 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://10.0.0.53:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.0.0.53:2379: connect: no route to host". Reconnecting...
> W0609 04:39:02.534131       1 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://10.0.0.53:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.0.0.53:2379: connect: no route to host". Reconnecting...
> W0609 04:39:04.791908       1 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://10.0.0.52:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.0.0.52:2379: connect: connection refused". Reconnecting...
> W0609 04:39:05.477358       1 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://10.0.0.52:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.0.0.52:2379: connect: connection refused". Reconnecting...
> W0609 04:39:05.606475       1 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://10.0.0.53:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.0.0.53:2379: connect: no route to host". Reconnecting...
> W0609 04:39:05.606509       1 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://10.0.0.53:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.0.0.53:2379: connect: no route to host". Reconnecting...

From everthing I’ve been able to find out there, port 2379 is for the docker deamon?! I’m assuming this is the server reaching out to try and start the other k8s_containers… like k8s_kube-api-auth… k8s_coredns etc…

I’ve tested with a basic telnet and the ports appear to be open but just nothing listening on the other end… Am I looking at a “Docker Daemon” issue? or what’s broken my cluster?! I’m not even sure what direction to take with this at the moment. The controller web interface won’t start so I can’t get deeper into the system to see what is going on and the nodes won’t start the api piece which I assume is what is breaking the controller…

Just to complicate the issue it looks like my fourth node - docker-kube04 had a harddrive failure so on that node, once I get the cluster back up, I’ll have to find a way to “Forcibly remove” the existing node information and rebuild the node and re-add it.