Random cluster "unavailable" downtimes

dusterio · September 25, 2018, 6:43am

I’ve been using Rancher for a few months now (after Docker Cloud closed) and mostly it’s going fine, but there is one really annoying bit and I wonder if anybody else experiences the same around here.

Every once in a while, after upgrading one of the running services (pulling new Docker image) the whole cluster goes down for no apparent reason - it just transitions to “unavailable” state. Then sometimes it would come back on its own after 3-5 minutes or in other cases I would have to reboot one/several nodes.

This is obviously far from ideal. Why would one container even have a technical possibility to bring down a cluster?

SergeantHindsight · September 26, 2018, 4:52pm

So I have been having this problem too. I don’t think it’s actually down, all the other containers work but I get all these errors for rancher, websocket errors, can’t login or go to the cluster. It fixes it self after 5-10 minutes. Last night it happened again and I opened firefox and it worked fine. It seems chrome loses trust with the invalid cert. When it fixes itself it prompts for continuing through the untrusted cert again.

Never had this with any other apps that have a self signed cert.

dusterio · September 27, 2018, 2:19am

Yep, sounds very similar!

When I’m lucky, it just goes back online after 5-10 minutes (not nice to have no cluster control for this long though)

Once or twice the whole Rancher web app went down - I had to restart the instance to bring it back. Couldn’t even SSH to it - web interface hang the whole VPS.

superseb · September 27, 2018, 4:03pm

Can you supply more information about your cluster? What version, what install method, how many nodes with what roles, what specifications?

Having issues when pulling an image sounds like resources are not sufficient, but that solely depends on the setup of the cluster and what is running where.

Topic		Replies	Views
Rancher-server crashed, causes docker to hang Rancher	2	1840	July 24, 2019
Rancher 1.2 HA Certs/Weirdness Rancher 1.x	1	760	December 14, 2016
SH is booting us out because boot loop Rancher	2	520	April 3, 2021
Intermittent Failure of Managed Network causing critical issues for some containers Rancher 1.x	1	723	April 9, 2017
Rancher HA -- Totally Unreliable? Rancher 1.x	3	1036	October 19, 2016

Random cluster "unavailable" downtimes

Related topics