Unable to add new worker nodes in existing rancher cluster

Hi Team,
We have a Rancher 2.4.4 version deployed.
Its supporting 3 environments. SIT, UAT & PROD

SIT & UAT env
have 3 machines each : having both Master & Worker responsibilities

Prod Env : 5 master + 3 workers.

ISSUE :
We were able add new worker node easily on SIT & UAT env using rke up.
But production scaling activity is failing multiple times.

Everytime we trigger rke -up – config file ;
we get error :
Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
INFO[0088] Starting container [cert-deployer] on host [10.124.56.172], try #2

(the contaner name / Service name usually changes but but error remains same )

Corrective action : Of the 5 master we have already faced this error on atleast 3 masters till date. We end up restarting the docker service on affectted master. Docker service restart on the affecctted system also need to be forced (we need force stop some docker process ) Post the docker restart the affected node becomes normal then next node starts throwing similar error.

Please suggest in case we are missing something ?

Version
Docker version 19.03.8, build afacb8b
rke version v1.0.8
Kubernetes Version: v1.17.5
OS : CentOS Linux release 7.7.1908 (Core)
Env : AWS ec2 ( m5.large)
RKE : server is able to ssh to ALL the k8 nodes ;
error snippet

INFO[0046] Removing container [cert-deployer] on host [10.124.56.78], try #1
INFO[0046] Checking if container [cert-deployer] is running on host [10.124.56.51], try #1
INFO[0046] Removing container [cert-deployer] on host [10.124.56.51], try #1
INFO[0046] Checking if container [cert-deployer] is running on host [10.124.56.10], try #1
INFO[0046] Removing container [cert-deployer] on host [10.124.56.10], try #1
INFO[0047] Checking if container [cert-deployer] is running on host [10.124.56.169], try #1
INFO[0047] Removing container [cert-deployer] on host [10.124.56.169], try #1
INFO[0047] Checking if container [cert-deployer] is running on host [10.124.56.19], try #1
INFO[0047] Removing container [cert-deployer] on host [10.124.56.19], try #1
WARN[0088] Can't start Docker container [cert-deployer] on host [10.124.56.172]: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
INFO[0088] Starting container [cert-deployer] on host [10.124.56.172], try #2
WARN[0138] Can't start Docker container [cert-deployer] on host [10.124.56.172]: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
INFO[0138] Starting container [cert-deployer] on host [10.124.56.172], try #3
WARN[0188] Can't start Docker container [cert-deployer] on host [10.124.56.172]: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
FATA[0188] [Failed to start Certificates deployer container on host [10.124.56.172]: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?]

Hi All,

We need to scale the Kubernetes cluster urgently. But we are stuck on this step with our production cluster. Please note that the docker service is running on all the related systems.

Please advice in case we missing something ?

Checking if /var/run/docker.sock exists, what it’s permissions are, that you don’t have SELinux on and blocking it or things like that would be my next step.

Another thing might be docker version. I think the version in EPEL is too old for a lot of things and you need to go grab docker-ce instead.