We are currently running rancher since about 2 years. We just upgraded to v1.6.17 and it almost crashed everything.
Currently rancher is running alone on a server with 8 cpu 32Go ram (yea i know it’s waaaay too much) and we’ve set the jvm to 4G (about 1.5G is actually used sometimes about 2G but not more). We have about 80 hosts managed by rancher.
During our upgrade we had to upgrade internals (ipsec / scheduler …). This is when everything went wrong. We first updated some small env with 5 hosts max, no problem. Then we had to upgrade an environment with 60 hosts. While doing it rancher went absolutely crazy. The load on the rancher server was at 14 - 26 (we have 8 cores and the DB is external so WTF), CPU was maxed, ui was barely responding.
All our hosts started to disconnect and reconnect, stacks were mostly fine.
Nothing in rancher logs except the error on ping to agent (which explains the disconnect / reconnect).
To stop it we just deactivated about 25 server. Rancher finally maid it and went back to normal. We then added our host back 1 by 1.
Currently it is barely working, any big action and rancher goes crazy again everything disconnect and reconnect and we will need to deactivate hosts until rancher calms down.
At this point we still have no idea what happened.
We have updated our docker version from 1.12.6 to 17.03.2-ce no change. We have moved our database on a server with NVME storage 8 CPU and way more ram than necessary no change.
Has anyone encountered this before ?
Maybe 1 rancher server is not enough ? We will be trying a 3 setup HA but we are not confident it will solve the issue.
Thank you for your help,