Hi Team,
We have a Rancher 2.4.4 version deployed.
Its supporting 3 environments. SIT, UAT & PROD
SIT & UAT env
have 3 machines each : having both Master & Worker responsibilities
Prod Env : 5 master + 3 workers.
ISSUE :
We were able add new worker node easily on SIT & UAT env using rke up.
But production scaling activity is failing multiple times.
Everytime we trigger rke -up – config file ;
we get error :
Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
INFO[0088] Starting container [cert-deployer] on host [10.124.56.172], try #2
(the contaner name / Service name usually changes but but error remains same )
Corrective action : Of the 5 master we have already faced this error on atleast 3 masters till date. We end up restarting the docker service on affectted master. Docker service restart on the affecctted system also need to be forced (we need force stop some docker process ) Post the docker restart the affected node becomes normal then next node starts throwing similar error.
Please suggest in case we are missing something ?
Version
Docker version 19.03.8, build afacb8b
rke version v1.0.8
Kubernetes Version: v1.17.5
OS : CentOS Linux release 7.7.1908 (Core)
Env : AWS ec2 ( m5.large)
RKE : server is able to ssh to ALL the k8 nodes ;
error snippet
INFO[0046] Removing container [cert-deployer] on host [10.124.56.78], try #1
INFO[0046] Checking if container [cert-deployer] is running on host [10.124.56.51], try #1
INFO[0046] Removing container [cert-deployer] on host [10.124.56.51], try #1
INFO[0046] Checking if container [cert-deployer] is running on host [10.124.56.10], try #1
INFO[0046] Removing container [cert-deployer] on host [10.124.56.10], try #1
INFO[0047] Checking if container [cert-deployer] is running on host [10.124.56.169], try #1
INFO[0047] Removing container [cert-deployer] on host [10.124.56.169], try #1
INFO[0047] Checking if container [cert-deployer] is running on host [10.124.56.19], try #1
INFO[0047] Removing container [cert-deployer] on host [10.124.56.19], try #1
WARN[0088] Can't start Docker container [cert-deployer] on host [10.124.56.172]: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
INFO[0088] Starting container [cert-deployer] on host [10.124.56.172], try #2
WARN[0138] Can't start Docker container [cert-deployer] on host [10.124.56.172]: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
INFO[0138] Starting container [cert-deployer] on host [10.124.56.172], try #3
WARN[0188] Can't start Docker container [cert-deployer] on host [10.124.56.172]: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
FATA[0188] [Failed to start Certificates deployer container on host [10.124.56.172]: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?]