Recover kubeconfig from rancher

In short, is it possible to recreate/restore a kubeconfig file for an imported cluster from rancher?

I have a kubernetes cluster running on AWS. This kubernetes cluster was installed with rke up. I also have rancher installed on this kubernetes cluster and this kubernetes cluster was also imported into rancher. All these setup was done by someone who have already left and I don’t have either the original cluster.yaml file or the original kubeconfig. I only have a kubeconfig file downloaded from rancher UI.

Now I am trying to upgrade rancher. As the documentation suggested I have to reinstall rancher since I am running very old version of cert-manager.

But I lost connection to the kubernetes cluster as soon as I run helm delete because the kubeconfig file I used was download from the UI and inside it’s pointing the server address to rancher.

I finally managed to restore the cluster from VM snapshots.
Any ideas I can do the upgrade smoothly?

It’s actually the local cluster for rancher.

Hi Iaocius,

If rke can manage the cluster, you can recover the kubeconfig using this command:

rke up --config ./rancher-cluster.yml --update-only

What this command is intended to do is to update your worker nodes only while ignoring any of the controlplane or etcd hosts. It also happens to retrieve a copy of the kubeconfig and stores it in the current directory, if it doesn’t find one.

@Stefan_Lasiewski
The problem is that I don’t have the cluster.yml either.
But I do have the keypairs for the instances.
Can I just create one and run rke up. Is there any risk?

cluster.yml is often a simple file that contains a short list of the Nodes, the Kubernetes version, the SSH method used to access the hosts, and an the etcd backup schedule. But it can be more elaborate.

You can create this by hand using the information that you can find in the GUI. The risk is that your hand-created cluster.yml will be different then what’s actually in production, and you might unintentionally modify the cluster. rke does not have a --dry-run option either, so you can’t compare what’s in your file with the actual state.

Instead of looking at the configuration from GUI, are there some commands I can use to see the configurations?
Another question would be do I have to run it with exactly the same rke version?

Instead of looking at the configuration from GUI, are there some commands I can use to see the configurations?

It sounds like you are hoping to export the RKE configuration from the cluster using the rke cli. Unfortunately, I don’t know how to do that, sorry. You might want to have a look through superseb’s gists here to see if he has a solution https://github.com/superseb/ranchertools

At most places though, I imagine that the cluster.yml file shouldn’t be too difficult to reconstruct by hand. Check out the example at https://rancher.com/docs/rancher/v2.x/en/installation/k8s-install/kubernetes-rke/#1-create-the-cluster-configuration-file to see what one would look like.

You might also want to look on your RKE controlplane nodes to verify there are backups under /opt/rke/etcd-snapshots/.

Another question would be do I have to run it with exactly the same rke version?

Yes, but you do need to be careful because new versions of RKE will by default upgrade a cluster to a new version of Kubernetes. You can hardcode the Kubernetes version into your cluster.yml file using the kubernetes_version: parameter, which is shown in this full cluster.yml example:

https://rancher.com/docs/rke/latest/en/example-yamls/#full-cluster-yml-example

I got the following error when running rke up command with --update-only option.

FATA[0768] [controlPlane] Failed to bring up Control Plane: Failed to verify healthcheck: Failed to check https://localhost:6443/healthz for service [kube-apiserver] on host [172.93.1.191]: Get https://localhost:6443/healthz: Unable to access the service on localhost:6443. The service might be still starting up. Error: ssh: rejected: connect failed (Connection refused), log: I0508 04:33:17.288269       1 plugins.go:161] Loaded 6 validating admission controller(s) successfully in the following order: LimitRanger,ServiceAccount,Priority,PersistentVolumeClaimResize,ValidatingAdmissionWebhook,ResourceQuota.

I couldn’t recover the cluster from etcd backup either, same error message.
I ended up restoring the cluster from VM snapshots again.