Issue Creating New Cluster on v2.5.8

zeieri · May 19, 2021, 1:20am

Hi Everyone,

Long time user, first time experiencing an issue with Rancher. I’m trying to provision a new cluster and get the following error:

Rancher Version: 2.5.8 (Pulled today)
Kubernetes Version: 1.20.6-rancher1-1
Node template creation method: Install from boot2docker ISO (Legacy)
OS ISO URL: https://github.com/rancher/os/releases/download/v1.5.8/rancheros.iso

It is connecting to vSphere. It is making the VM’s without issue.

After the above error, it starts a Healthcheck on the same node which passes. It then starts etcd no issue. Hits the same error and restarts the process. It’s been like this for a few hours and I haven’t been able to locate someone else having the same issue where information was provided beyond “use 2.5.8” due to a recent patch in this version, for this issue.

I can PM logs to whomever needs it as well.

Please advise as to what I may be able to try to fix this.

Thanks!

superseb · May 19, 2021, 8:23am

docker logs kubelet from the node that is in the error is probably most helpful and the logs from the Rancher container (at least the part where it shows the provisioning steps, from beginning to error)

zeieri · May 19, 2021, 12:43pm

Thanks for your reply!

I cannot get into the node. I ssh into it with id_rsa and it prompts for a password but won’t accept my home lab password. Is there a way to recover it?

Sorry, I have never had to ssh into a node before!

Thanks Again!

superseb · May 19, 2021, 9:25pm

What username are you using? I think it’s either docker or rancher, and you need to use the SSH key downloaded from Rancher for the node.

zeieri · May 19, 2021, 10:39pm

It is using “Rancher” with the id-rsa key. Which SSH key is used? Is it “key.pem”?

Thanks!

zeieri · May 20, 2021, 2:44am

I’m now getting the following error:

Not too sure what to do with this. I rebuilt the cluster because I did not have the private SSH key from my previous cloud config file anymore. Now I have it but it still is asking for a password. Not sure what that password would be?

Edit: I also built a new cluster on 19.10 and that gave the same error. It’s able to pull the RancherOS and my cloud config file so I believe internet isn’t an issue.

Thanks!

superseb · May 20, 2021, 11:03am

See Rancher Docs: Technical for access nodes, I’m not really sure what is going on, with all the different errors you are presenting. The best way to get an idea of the issue to get on the node and check logs of the rancher/rancher-agent container and different k8s containers. In case of the last error, docker pull would be a start.

Let me know if you are unable to SSH into the nodes still.

zeieri · May 20, 2021, 2:15pm

So each node is having the exact same problem. For each, I ssh’d in and ran the requested commands. They all returned identical information:

When I ran the following command:

docker pull rancher/hyperkube:v1.20.6-rancher1

Each node tells me it’s out of space. Which is odd as I’ve never had to provision space before. I checked the node template and it has a little under 20 GB’s. I can increase it but i’d prefer to find out why 20 GB’s is filled. So I tried from the root of the drive to run:

sudo su -
cd
du -sh *

This produced the following for all nodes (identical):

[root@sg-master1 /]# du -sh *
0 bin
0 dev
7.5M etc
8.0K home
0 host
0 lib
0 lib64
0 media
0 mnt
122.2M opt
0 proc
0 root
948.0K run
0 sbin
0 sys
4.0K tmp
214.9M usr
1.7G var
[root@sg-master1 /]#

As you can see, there is not 20 GB’s there. What’s weird is the docker pull command gets to the last megabyte of the download and craps out.

I extended my node template to give 25 GB’s just to test. I changed to a single node and let it build. Same issue:

So i’ve done some additional troubleshooting:

Ran a prune on docker images and containers. Reclaimed 159 MB’s in space. This got me through pulling all but the last two lines of the above screenshot. Extraction failed.
I did further research thinking it was a docker problem and not a rancher problem. I discovered some things to try with the files in the docker directory. I went to work on that file, it’s not there. Another set of steps was to change the allotted file size in the daemon json - but that was missing too. I confirmed that the container itself is at 25% space used on 25 GB. This lead me to the above mentioned research but as none of the config files are where they should be (per the docker documentation), I figured it may be related to Rancher OS specifically.
I then figured that as my previous config (before it blew up) was using the RancherOS image 1.5.6. As my new cluster was not, I made node templates using 1.5.6 and spun up a cluster. Unfortunately, I reached the same abrupt conclusion with the exact same behavior identified above.
I ran the following command:

docker system prune --all --force --volumes

This allowed me to get farther allong in the pull:

Here is current storage capacity system wide (result of command ‘df -h’):

Filesystem                Size      Used Available Use% Mounted on
overlay                   1.9G    489.5M      1.4G  25% /
tmpfs                     1.9G         0      1.9G   0% /dev
tmpfs                     1.9G         0      1.9G   0% /sys/fs/cgroup
tmpfs                     1.9G         0      1.9G   0% /media
none                      1.9G    928.0K      1.9G   0% /run
tmpfs                     1.9G         0      1.9G   0% /mnt
none                      1.9G    928.0K      1.9G   0% /var/run
devtmpfs                  1.9G         0      1.9G   0% /host/dev
shm                      64.0M         0     64.0M   0% /host/dev/shm
tmpfs                     1.9G    489.5M      1.4G  25% /etc/hostname
shm                      64.0M         0     64.0M   0% /dev/shm
devtmpfs                  1.9G         0      1.9G   0% /dev
shm                      64.0M         0     64.0M   0% /dev/shm
overlay                   1.9G      1.2G    676.7M  65% /var/lib/docker/overlay2/b1388a705ae818c5993ae98af360d815a22185e1d48c21bab4b64f58bdbaa243/merged
overlay                   1.9G      1.2G    676.7M  65% /var/lib/docker/overlay2/1b5eb117e063cd17898b7d990818af49a5a2ffb5a4732107ff8a6d956db7c0c3/merged
shm                      64.0M         0     64.0M   0% /var/lib/docker/containers/661c72bedf38d29f4f2e0d9574d61448131c9a395c6c4ec4a1013ec9684a528b/mounts/shm

(Ultimately grasping at straws…) I then spun up a new linux VM and created a new Rancher instance on the last version my previous setup was using before it blew up on me (2.5.5). The results of that test were identical to the above.

All of the above makes me think it has something to do with a config somewhere. I’m using the same version of Rancher, RancherOS image, credentials to my hosting infrastructure, etc.

Please let me know if there is anything further I can do on my end as far as information or testing goes. I’m at a complete loss for what I am missing!

adrian.goins · May 21, 2021, 8:05pm

Is it possible that you’re just booting from the ISO and not actually performing an ROS install to disk? The Googles have lots of posts about super low volume sizes for /var/lib/docker when running ROS from memory.

zeieri · May 21, 2021, 8:23pm

Excellent question!

I honestly do not know. Here are the settings of my node template:

Beyond that, I haven’t changed anything. I choose the template when I build the cluster and it does the rest. Is there a way to check if it’s working off memory instead of disk?

**Edit: I think you are right:

How to change that is now the question…

adrian.goins · May 21, 2021, 10:21pm

zeieri · May 21, 2021, 11:28pm

So I tried this. It installed. I then tried to SSH into the host afterwards to check it out. The host is still stuck in the Rancher GUI at the same error. I tried to ssh in with the following commands:

ssh -i id_rsa docker@ip_of_node
ssh -i id_rsa rancher@ip_of_node

Both ask for a password. Previously, using the docker command above worked without issue.

Previously, these steps were not required. I would configure the cluster and allow it to come online in about 10 minutes (for 3 control plane/etcd nodes and 3 worker nodes).

Is it possible there is an issue with my cluster or node template configuration causing this extra work?

Thanks!

zeieri · May 22, 2021, 12:00am

I rebuilt my cluster from scratch and got past that part. All I changed was the Rancher OS image to the latest (1.5.8). I get the following error:

Topic		Replies	Views
Unable to SSH into VMWare nodes Rancher	7	5324	August 30, 2018
Unable to provision using custom RancherOS install RancherOS	2	1516	June 22, 2020
Not able to create custom cluster in VMs Rancher	2	855	June 14, 2018
Error creating machine: Error detecting OS: Too many retries waiting for SSH to be available	3	1573	September 21, 2023
Rancher ssh key not reflecting in Ubuntu 20.04 vmware template Rancher 1.x	0	496	February 13, 2023

Issue Creating New Cluster on v2.5.8

Related topics