Upgrade services that require quorums

We are trying to upgrade an rabbit cluster that is built as a single service in Rancher (1.0.1). We have an appropriate health check in place and are using batch size of 1 and an “in place” upgrade.

The behavior we’d like/expect to see is that each container gets a healthy replacement before continuing.

The behavior we are seeing is that only a single container in healthy is kept around, and while containers are in the “initializing” state, the upgrade continues on. Therefore, we lose quorum as the cluster goes down to one node during upgrade.

I spoke to @vincent on IRC yesterday and apparently upgrade does not support health checks… only interval and batchSize

I’d love for the ability to do batch upgrades based on health status + timeout rather than an interval. I think i’ve raised this before. But I’m wondering if this solves both the quorum and “zero-downtime” deploy idea. I’ve been noodling on how I might like my upgrade procedure to go. Something like:

  • Start an upgrade with batch size of 2 and timeout of 300 seconds.
  • Start two new containers.
  • If timeout hit go to next batch (or maybe fail upgrade?)
  • Once batch is healthy, proceed.
  • Continue batch (stop containers from first batch and start new batch).

@andyshinn I don’t believe there is a Github issue for this enhancement, so please feel free to make one.

As @topper SA I’d it doesn’t exist today, but I basically agree…

  • I would definitely fail/stop if one didn’t come up, given that you’re trying to maintain quorum.
  • And service healthchecks already have an initializing timeout so I would just use that one rather than having as l separate batch timeout…

This thread is a couple years old. Any changes in the status of this feature since then? I have a very similar requirement to OP (service that requires quorum) and so being able to sequence upgrades based on health check status would be a huge win.

As it stands, I’m stuck with a rather crappy option of using a huge --interval or huge initializing timeout, which makes upgrades needlessly slow, and also still doesn’t guarantee quorum, as the upgrade could fail and Rancher will still happily shutdown/restart all the remaining containers even if the first one fails to come up.