Restore Snapshot - etcd nodes are unavailable and calico-node not running

vyhnguyen · October 16, 2019, 8:04pm

Steps to reproduce (least amount of steps as possible):
In Rancher UI, select Restore Snapshot

Result:

etcd nodes (some or all 3 nodes) state is UNAVAILABLE with the “Run-time network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized” error. This behavior does not happen all the time after a snapshot restore.
etcd nodes state is AVAILABLE but calico-node pod (all pods) are not running. Hence, application is not running either.

Other details that may be helpful:
When the calico-node pods are not Running, redeploy calico-node workload seems to solve it. However, the pod(s) can also go in/out of Running state. Ultimately, I have to deploy the calico-node workload and recreate the calico-node daemonset.

Sometimes, the cattle-cluster-agent, coreden, coredns-autoscaler, and metric-server workloads show Active status but the pod(s) are not running. I’m not sure if they are impacted by calico-node not working.

Environment information

Rancher version ( rancher/rancher / rancher/server image tag or shown bottom left in the UI):
v2.2.6
Installation option (single install/HA):
HA AirGap

Cluster information

Cluster type (Hosted/Infrastructure Provider/Custom/Imported):
Import (initializing local rancher cluster)
Machine type (cloud/VM/metal) and specifications (CPU/memory):
t2.large AWS ec2-instance
Kubernetes version (use kubectl version ):
v1.14.3
Docker version (use docker version ):
1.13.1

Topic		Replies	Views
Cannot restore etcd snapshot Rancher	0	591	February 21, 2020
node start error	2	1077	June 18, 2018
Problem with restoring RKE cluster from ETCD snapshot	2	1902	April 20, 2019
Automatic etcd snapshots are failing Rancher	1	858	February 10, 2020
Corrupted etcd?	0	904	March 29, 2022

Restore Snapshot - etcd nodes are unavailable and calico-node not running

Related topics