Random cluster "unavailable" downtimes

I’ve been using Rancher for a few months now (after Docker Cloud closed) and mostly it’s going fine, but there is one really annoying bit and I wonder if anybody else experiences the same around here.

Every once in a while, after upgrading one of the running services (pulling new Docker image) the whole cluster goes down for no apparent reason - it just transitions to “unavailable” state. Then sometimes it would come back on its own after 3-5 minutes or in other cases I would have to reboot one/several nodes.

This is obviously far from ideal. Why would one container even have a technical possibility to bring down a cluster?

So I have been having this problem too. I don’t think it’s actually down, all the other containers work but I get all these errors for rancher, websocket errors, can’t login or go to the cluster. It fixes it self after 5-10 minutes. Last night it happened again and I opened firefox and it worked fine. It seems chrome loses trust with the invalid cert. When it fixes itself it prompts for continuing through the untrusted cert again.

Never had this with any other apps that have a self signed cert.

Yep, sounds very similar!

When I’m lucky, it just goes back online after 5-10 minutes (not nice to have no cluster control for this long though)

Once or twice the whole Rancher web app went down - I had to restart the instance to bring it back. Couldn’t even SSH to it - web interface hang the whole VPS.

Can you supply more information about your cluster? What version, what install method, how many nodes with what roles, what specifications?

Having issues when pulling an image sounds like resources are not sufficient, but that solely depends on the setup of the cluster and what is running where.