Best way for zero downtime during Rancher version upgrades (with Cattle)

We’d like to catch up on new (stable) releases of Rancher on a regular basis, and especially since the recent switch to a higher release cadence (which is great!), this means more maintenance slots.

We are running a Cattle environment with multiple hosts, at least 2 API gateway (Kong) instances as main entry points and at least 2 instances for each micro service (upstream services to the API gateway). With this basic setup, HA, load balancing and scaling works quite well.

However, whenever certain infrastructure stacks need to be updated after a newer Rancher server release, we experience major connectivity losses (container to container connectivity) during the upgrade process (usually a few minutes) within the whole cluster. An example for this was the upgrade of the network service stack from 0.0.8 to 0.0.14 or the upgrade of the IPSec stack from 0.0.2 to 0.0.4 (when upgrading from Rancher 1.3.x to 1.4.x). So far, we tried sticking our external LB’s to a specific host that we deactivate within Rancher, however this is almost worse since the upgrade process will stop the relevant stacks on this specific host without rescheduling them until the hosts is re-activated.

We love Rancher and the active development and addition of new features, but we would like to be able to do smoother updates in an active environment. Are there better/smarter ways to avoid downtime during such upgrades, even if it involves manually rolling these updates in e.g. on a per-host basis?