Upgrade from Single Node to HA

We’re running a single node deployment of Rancher 2.0.6 in GCP, which manages several bare-metal k8s clusters. We’re hitting some failure limits, which I’m guessing is the limitation of running multiple clusters managed by a single node installation. We’d like to move to running an HA setup.

There doesn’t seem to be any documentation for how to move the existing configs and cluster data over. Is there any way to do this without completely starting over from scratch?

Possibly related: Rancher 2.0 Single Node to HA

3 Likes

Hello, I am interested what sort of limits are you running into. I believe you should be able to launch your HA cluster. Backup your single node cluster and import your etcd data to each node of the HA cluster.

@david We run into a number of limitations of using the “single-node” docker container way. The highlights include:

  • Adding more than 3 nodes at a time to a new cluster causes all nodes in the cluster to get stuck provisioning and the container to hard crash to a go trace
  • Making changes to a running cluster that updates more than one node at a time will also do the same
  • Node deletes don’t work and get stuck in “removing” state

Essentially any action that requires significant downstream API calls will cause the cluster to get in an unrecoverable state.

As for your migration idea. Is that all it would take? I see more state data that just whats in etcd.

In essence, I guess what I am saying is follow the instructions for instantiating a new 3 node HA cluster. Once it is set up shut it down. Once that is done, back up the etcd of your single node cluster, then shut it down. Restore the etcd state to each of the three nodes of the HA cluster one by one. All nodes should now be in the same state. Then bring up the HA cluster. I think that should do it. Btw, what did you mean about hitting failure limits. How did the failure manifest itself?

Any crash of the container should be filed as a GitHub issue so we can investigate. I’ve ran single node install today and adding 15 node simultaneously without issues, if you can provide exact steps to reproduce we can investigate it. (basically fill out all the details requested in the issue template + https://rancher.com/docs/rancher/v2.x/en/contributing/#bugs-issues-or-questions)

1 Like

I’ve filed a few (and found a few that were already filed), but outside of crash of the container, we get errors with API components when updating unrelated parts of the cluster.

As an example, I just added some kubelet config options to the first cluster (here shown as chicago-edens), and once the change was submitted, immediately got an API error from the other cluster. I’ve attached a screenshot below.