Unable to add new worker nodes in existing rancher cluster

mail_swap_003 · November 9, 2021, 10:12am

Hi Team,
We have a Rancher 2.4.4 version deployed.
Its supporting 3 environments. SIT, UAT & PROD

SIT & UAT env
have 3 machines each : having both Master & Worker responsibilities

Prod Env : 5 master + 3 workers.

ISSUE :
We were able add new worker node easily on SIT & UAT env using rke up.
But production scaling activity is failing multiple times.

Everytime we trigger rke -up – config file ;
we get error :
Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
INFO[0088] Starting container [cert-deployer] on host [10.124.56.172], try #2

(the contaner name / Service name usually changes but but error remains same )

Corrective action : Of the 5 master we have already faced this error on atleast 3 masters till date. We end up restarting the docker service on affectted master. Docker service restart on the affecctted system also need to be forced (we need force stop some docker process ) Post the docker restart the affected node becomes normal then next node starts throwing similar error.

Please suggest in case we are missing something ?

Version
Docker version 19.03.8, build afacb8b
rke version v1.0.8
Kubernetes Version: v1.17.5
OS : CentOS Linux release 7.7.1908 (Core)
Env : AWS ec2 ( m5.large)
RKE : server is able to ssh to ALL the k8 nodes ;
error snippet

INFO[0046] Removing container [cert-deployer] on host [10.124.56.78], try #1
INFO[0046] Checking if container [cert-deployer] is running on host [10.124.56.51], try #1
INFO[0046] Removing container [cert-deployer] on host [10.124.56.51], try #1
INFO[0046] Checking if container [cert-deployer] is running on host [10.124.56.10], try #1
INFO[0046] Removing container [cert-deployer] on host [10.124.56.10], try #1
INFO[0047] Checking if container [cert-deployer] is running on host [10.124.56.169], try #1
INFO[0047] Removing container [cert-deployer] on host [10.124.56.169], try #1
INFO[0047] Checking if container [cert-deployer] is running on host [10.124.56.19], try #1
INFO[0047] Removing container [cert-deployer] on host [10.124.56.19], try #1
WARN[0088] Can't start Docker container [cert-deployer] on host [10.124.56.172]: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
INFO[0088] Starting container [cert-deployer] on host [10.124.56.172], try #2
WARN[0138] Can't start Docker container [cert-deployer] on host [10.124.56.172]: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
INFO[0138] Starting container [cert-deployer] on host [10.124.56.172], try #3
WARN[0188] Can't start Docker container [cert-deployer] on host [10.124.56.172]: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
FATA[0188] [Failed to start Certificates deployer container on host [10.124.56.172]: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?]

mail_swap_003 · November 17, 2021, 8:11am

Hi All,

We need to scale the Kubernetes cluster urgently. But we are stuck on this step with our production cluster. Please note that the docker service is running on all the related systems.

Please advice in case we missing something ?

wcoateRR · November 17, 2021, 5:24pm

Checking if /var/run/docker.sock exists, what it’s permissions are, that you don’t have SELinux on and blocking it or things like that would be my next step.

Another thing might be docker version. I think the version in EPEL is too old for a lot of things and you need to go grab docker-ce instead.

Topic		Replies	Views
Rke add node failed! Rancher	0	851	October 25, 2019
Cannot add nodes to my rke cluster Rancher	1	1645	November 5, 2019
Rke Cannot connect to the Docker daemon when trying to add new node to k8 cluster Rancher	0	900	November 9, 2021
Problem creating cluster - RKE Rancher	5	2522	September 26, 2022
Failed to get job complete status for job rke-network-plugin-deploy-job in namespace kube-system Rancher	4	6405	October 30, 2019

Unable to add new worker nodes in existing rancher cluster

Related topics