[HELP!] rancher broke down, no idea how to fix it

Our Rancher instance broke down completely 1h ago, and we are absolutely clueless.

The logs show a lot fo this crap:

[...] [service:1] [service.upgrade] [] [ecutorService-1] [c.p.e.p.i.DefaultProcessInstanceImpl] Unknown exception org.jooq.exception.DataChangedException: Database record has been changed

and this:

[...][agent:16122] [agent.remove->(MetadataProcessHandler)] [] [ecutorService-3] [i.c.p.p.m.MetadataProcessHandler    ] Failed to find account id for agent:16122
[...][agent:16125] [agent.deactivate->(MetadataProcessHandler)] [] [ecutorService-1] [i.c.p.p.m.MetadataProcessHandler    ] Failed to find account id for agent:16125
[...][agent:16124] [agent.remove->(MetadataProcessHandler)] [] [ecutorService-5] [i.c.p.p.m.MetadataProcessHandler    ] Failed to find account id for agent:16124
[...][agent:16125] [agent.remove->(MetadataProcessHandler)] [] [ecutorService-1] [i.c.p.p.m.MetadataProcessHandler    ] Failed to find account id for agent:16125

and this:

[...] Failed to get ping from agent [1824] count [8]
[...] Failed to get ping from agent [11989] count [8]
[...] Failed to get ping from agent [13448] count [8]
[...] Failed to get ping from agent [14732] count [8]
[...] Failed to get ping from agent [14790] count [8]
[...] Failed to get ping from agent [14790] count [9]
[...] Failed to get ping from agent [15860] count [9]

The hosts are fine after Rancher restarts, but then fall very soon in a “Reconnecting” state, although they are fine and not under a lot of load.

We already tried restoring a DB snapshot from this night, but no avail.

Can anybody help here? Please? :confused:

Thanks!

What does the database server look like? “Database record has been changed” usually means the DB can’t keep up, which makes hosts disonnect, which causes more requests to the DB, which makes it continue to not be able to keep up…

someone is saturating your network.

well. here’s how we fixed it: we terminated all VMs of the agents, and rebuilt them.

Rancher did stop taking 100% CPU after that, and worked flawlessly. We still have no clue what happened to the setup, and we couldn’t find any reason. We right now think AWS might be to blame becauase of some underlying issue, which is really far-fetched but hey.

As for the database: At one point we restored the database on a much smaller RDS type, which worked just fine after we rebuilt the agent hosts. Then we went back to using the original RDS, also no issues any more.