Our Rancher instance broke down completely 1h ago, and we are absolutely clueless.
The logs show a lot fo this crap:
[...] [service:1] [service.upgrade] [] [ecutorService-1] [c.p.e.p.i.DefaultProcessInstanceImpl] Unknown exception org.jooq.exception.DataChangedException: Database record has been changed
and this:
[...][agent:16122] [agent.remove->(MetadataProcessHandler)] [] [ecutorService-3] [i.c.p.p.m.MetadataProcessHandler ] Failed to find account id for agent:16122
[...][agent:16125] [agent.deactivate->(MetadataProcessHandler)] [] [ecutorService-1] [i.c.p.p.m.MetadataProcessHandler ] Failed to find account id for agent:16125
[...][agent:16124] [agent.remove->(MetadataProcessHandler)] [] [ecutorService-5] [i.c.p.p.m.MetadataProcessHandler ] Failed to find account id for agent:16124
[...][agent:16125] [agent.remove->(MetadataProcessHandler)] [] [ecutorService-1] [i.c.p.p.m.MetadataProcessHandler ] Failed to find account id for agent:16125
and this:
[...] Failed to get ping from agent [1824] count [8]
[...] Failed to get ping from agent [11989] count [8]
[...] Failed to get ping from agent [13448] count [8]
[...] Failed to get ping from agent [14732] count [8]
[...] Failed to get ping from agent [14790] count [8]
[...] Failed to get ping from agent [14790] count [9]
[...] Failed to get ping from agent [15860] count [9]
The hosts are fine after Rancher restarts, but then fall very soon in a “Reconnecting” state, although they are fine and not under a lot of load.
We already tried restoring a DB snapshot from this night, but no avail.
Can anybody help here? Please?
Thanks!