Agree with most that has been said. Unless your business workloads have a high tolerance for potential outage (or you can arrange an upgrade window with all you business application owners) then of course for upgrades of all types you need to exercise some caution. If’s quite easy to be seduced by statements like K8s is completely self-healing, rolling upgrades are completely safe and fool-proof, upgrading individual components should be ok. IME all of those are highly situation dependent. Upgrading Rancher minor version is mostly trouble-free, but there are always a few edge cases, many of which have nothing to do with Rancher itself. For example, upgrades can create incompatibilities with the external applications that integrate with Rancher and/or K8s … think logging/monitoring/alerting, security. products (how do you manage your image and container run-time policies, OPA, Twistlock, Aqua ?), operators or CRDs that you have created, your deployment software (we use Terraform and the Rancher2 provider and have notes several areas of incompatibility or upgrade lag), Helm, … what about you Etcd backups, are they compatible ? … and so on. On the whole Rancher provides some protection from many of these things, especially if you follow the recipes that they publish and (up until v2.3) upgraded Rancher server, K8s and RKE as a unit. v2.3 allows independent upgrade of compatible K8s versions and that is certainly welcome as a way of reducing risk from what is otherwise somewhat of an all-or-nothing approach. If you have decided to make use of non-Rancher specific models (and many do want to do that or their company policy may mandate it), then again you have to reconcile and properly understand any ripple effects.
Don’t get me wrong, we licence Rancher as an Enterprise support customer and naturally we want to leverage the platform as much as we can. But at the same time we are always cautious about patching and upgrades, it’s just inherently risky to do, … and risky not to. Our expectations are based in the reality that no tech stack has been designed specifically for our environment and, our business continuity model, and our internal CISO and engineering policy. So long as you are aware where all those are, you will know how much you can lean on a vendor provided process and where you can’t.
We run regular ‘drills’ for DR and have a number of platform test clusters that we use to test changes to the platform such as upgrades before we make a decision about rolling upgrades and patches to higher environments. For some that will be completely OTT, … for our internal change management process for Prod, it’s mandatory
We recycle all of our clusters, including the Rancher management plane (in HA mode), daily. The more you practice the stuff that makes you anxious, the more comfortable you get with it.