[HELP!] rancher broke down, no idea how to fix it

flypenguin · March 7, 2018, 4:08pm

Our Rancher instance broke down completely 1h ago, and we are absolutely clueless.

The logs show a lot fo this crap:

[...] [service:1] [service.upgrade] [] [ecutorService-1] [c.p.e.p.i.DefaultProcessInstanceImpl] Unknown exception org.jooq.exception.DataChangedException: Database record has been changed

and this:

[...][agent:16122] [agent.remove->(MetadataProcessHandler)] [] [ecutorService-3] [i.c.p.p.m.MetadataProcessHandler    ] Failed to find account id for agent:16122
[...][agent:16125] [agent.deactivate->(MetadataProcessHandler)] [] [ecutorService-1] [i.c.p.p.m.MetadataProcessHandler    ] Failed to find account id for agent:16125
[...][agent:16124] [agent.remove->(MetadataProcessHandler)] [] [ecutorService-5] [i.c.p.p.m.MetadataProcessHandler    ] Failed to find account id for agent:16124
[...][agent:16125] [agent.remove->(MetadataProcessHandler)] [] [ecutorService-1] [i.c.p.p.m.MetadataProcessHandler    ] Failed to find account id for agent:16125

and this:

[...] Failed to get ping from agent [1824] count [8]
[...] Failed to get ping from agent [11989] count [8]
[...] Failed to get ping from agent [13448] count [8]
[...] Failed to get ping from agent [14732] count [8]
[...] Failed to get ping from agent [14790] count [8]
[...] Failed to get ping from agent [14790] count [9]
[...] Failed to get ping from agent [15860] count [9]

The hosts are fine after Rancher restarts, but then fall very soon in a “Reconnecting” state, although they are fine and not under a lot of load.

We already tried restoring a DB snapshot from this night, but no avail.

Can anybody help here? Please?

Thanks!

vincent · March 8, 2018, 9:23am

What does the database server look like? “Database record has been changed” usually means the DB can’t keep up, which makes hosts disonnect, which causes more requests to the DB, which makes it continue to not be able to keep up…

kucerarichard · March 14, 2018, 2:11pm

someone is saturating your network.

flypenguin · March 15, 2018, 7:28am

well. here’s how we fixed it: we terminated all VMs of the agents, and rebuilt them.

Rancher did stop taking 100% CPU after that, and worked flawlessly. We still have no clue what happened to the setup, and we couldn’t find any reason. We right now think AWS might be to blame becauase of some underlying issue, which is really far-fetched but hey.

As for the database: At one point we restored the database on a much smaller RDS type, which worked just fine after we rebuilt the agent hosts. Then we went back to using the original RDS, also no issues any more.

Topic		Replies	Views
Why do I get Failed to get ping from agent error? Rancher 1.x	1	1486	July 13, 2015
Rancher database exception after restart Rancher 1.x	0	849	January 3, 2017
Repeated 'Unknown exception running process [agent.deactivate:XXXXX]' in Rancher server logs - v1.0.1 Rancher 1.x	9	2055	May 23, 2016
Can't get containers' monitoring metrics Rancher 1.x	2	1049	April 26, 2016
Errors after changing registration IP, and back Rancher 1.x	4	1451	March 23, 2016

[HELP!] rancher broke down, no idea how to fix it

Related topics