I’m using rancher 2.0.6 to manage K8 clusters for services.
Whilst our use is currently in its infancy, I am looking at the long-term management of our rancher service - and that includes upgrades to the rancher service itself.
If I upgrade a standard single-node instance of rancher, then all clusters go off-line [briefly] before coming back up - which means all services running on those clusters become unavailable.
This is a hefty down-time penalty to consider.
I’ve tried a small HA install, and the K8s it controlled also became unavailable.
Are there any plans to look at some form of rolling update to clusters, so services can remain 24/7?
First, if you want to avoid an outage then you really have to go with a HA setup. For a variety of compliance reasons, we refresh our entire set of clusters regularly, so automating this process is absolutely critical. We have found that we can cycle our worker nodes and the pods are successfully relocated to other healthy nodes as part of a rolling update. There is no no loss of service and, depending on how aggressively you want to replace nodes and your cost considerations, you can also mitigate performance degradation. Same is the case for Control Plane nodes. Etcd nodes are a slightly different proposition and we are working on that one right now. Of course the architecture for Rancher in this configuration means that your management plane could suffer a loss of service but this shouldn’t impact your application workloads (although until v2.2 … I think) RBAC has a dependency on your HA cluster, so that might be an issue depending on how you have that configured. So you might only lose the ability to manage your workloads via the UI. You can still do so via Kubectl though, or via you CI/CD pipelines (preferred).