Rancher down - help with troubleshooting

We just had out rancher server go down and are having trouble getting it to come back up. Could I get some guidance on troubleshooting steps.

Currently in the server logs I’m just seeing a whole lot of this:

2015-10-15 20:33:06,766 ERROR [:] [] [] [] [cutorService-10] [i.c.p.a.s.ping.impl.PingMonitorImpl ] Scheduling reconnect for [40]
2015-10-15 20:33:06,973 ERROR [:] [] [] [] [ecutorService-6] [i.c.p.a.s.ping.impl.PingMonitorImpl ] Scheduling reconnect for [72]
2015-10-15 20:33:06,977 ERROR [:] [] [] [] [ecutorService-5] [i.c.p.a.s.ping.impl.PingMonitorImpl ] Scheduling reconnect for [79]
2015-10-15 20:33:07,011 ERROR [:] [] [] [] [ecutorService-9] [i.c.p.a.s.ping.impl.PingMonitorImpl ] Scheduling reconnect for [74]
2015-10-15 20:33:07,012 ERROR [:] [] [] [] [ecutorService-4] [i.c.p.a.s.ping.impl.PingMonitorImpl ] Scheduling reconnect for [73]
2015-10-15 20:33:07,015 ERROR [:] [] [] [] [ecutorService-1] [i.c.p.a.s.ping.impl.PingMonitorImpl ] Scheduling reconnect for [39]
2015-10-15 20:33:07,017 ERROR [:] [] [] [] [ecutorService-7] [i.c.p.a.s.ping.impl.PingMonitorImpl ] Scheduling reconnect for [78]
2015-10-15 20:33:07,027 ERROR [:] [] [] [] [ecutorService-8] [i.c.p.a.s.ping.impl.PingMonitorImpl ] Scheduling reconnect for [86]
2015-10-15 20:33:07,719 ERROR [:] [] [] [] [ecutorService-2] [i.c.p.a.s.ping.impl.PingMonitorImpl ] Scheduling reconnect for [38]
2015-10-15 20:33:11,722 ERROR [:] [] [] [] [ecutorService-6] [i.c.p.a.s.ping.impl.PingMonitorImpl ] Failed to get ping from agent [38] count [7]
2015-10-15 20:33:15,939 ERROR [:] [] [] [] [ecutorService-7] [i.c.p.a.s.ping.impl.PingMonitorImpl ] Failed to get ping from agent [40] count [7]
2015-10-15 20:33:15,940 ERROR [:] [] [] [] [ecutorService-5] [i.c.p.a.s.ping.impl.PingMonitorImpl ] Failed to get ping from agent [72] count [7]
2015-10-15 20:33:15,941 ERROR [:] [] [] [] [ecutorService-4] [i.c.p.a.s.ping.impl.PingMonitorImpl ] Failed to get ping from agent [74] count [7]
2015-10-15 20:33:15,941 ERROR [:] [] [] [] [ecutorService-6] [i.c.p.a.s.ping.impl.PingMonitorImpl ] Failed to get ping from agent [73] count [7]
2015-10-15 20:33:15,942 ERROR [:] [] [] [] [cutorService-10] [i.c.p.a.s.ping.impl.PingMonitorImpl ] Failed to get ping from agent [79] count [7]
2015-10-15 20:33:15,942 ERROR [:] [] [] [] [ecutorService-1] [i.c.p.a.s.ping.impl.PingMonitorImpl ] Failed to get ping from agent [78] count [7]
2015-10-15 20:33:15,943 ERROR [:] [] [] [] [ecutorService-2] [i.c.p.a.s.ping.impl.PingMonitorImpl ] Failed to get ping from agent [86] count [7]
2015-10-15 20:33:15,944 ERROR [:] [] [] [] [ecutorService-3] [i.c.p.a.s.ping.impl.PingMonitorImpl ] Failed to get ping from agent [39] count [7]
2015/10/15 20:33:21 http: proxy error: net/http: request canceled
time="2015-10-15T20:33:29Z" level=info msg="Connection established"
time="2015-10-15T20:33:29Z" level=info msg="Connection established"
2015/10/15 20:33:38 http: proxy error: net/http: request canceled

From the browser this seems to be the call triggering the request canceled messages:

v1/schemas?limit=1000

Also seeing this stacktrace.

ERROR [:] [] [] [] [ecutorService-9] [o.a.c.m.context.NoExceptionRunnable ] Uncaught exception org.jooq.exception.DataChangedException: Database record has been changed
	at org.jooq.impl.UpdatableRecordImpl.checkIfChanged(UpdatableRecordImpl.java:550) ~[jooq-3.3.0.jar:na]
	at org.jooq.impl.UpdatableRecordImpl.storeUpdate0(UpdatableRecordImpl.java:291) ~[jooq-3.3.0.jar:na]
	at org.jooq.impl.UpdatableRecordImpl.access$200(UpdatableRecordImpl.java:90) ~[jooq-3.3.0.jar:na]
	at org.jooq.impl.UpdatableRecordImpl$3.operate(UpdatableRecordImpl.java:260) ~[jooq-3.3.0.jar:na]
	at org.jooq.impl.RecordDelegate.operate(RecordDelegate.java:123) ~[jooq-3.3.0.jar:na]
	at org.jooq.impl.UpdatableRecordImpl.storeUpdate(UpdatableRecordImpl.java:255) ~[jooq-3.3.0.jar:na]
	at org.jooq.impl.UpdatableRecordImpl.update(UpdatableRecordImpl.java:149) ~[jooq-3.3.0.jar:na]
	at io.cattle.platform.object.impl.JooqObjectManager.persistRecord(JooqObjectManager.java:216) ~[cattle-framework-object-0.5.0-SNAPSHOT.jar:na]
	at io.cattle.platform.object.impl.JooqObjectManager.setFieldsInternal(JooqObjectManager.java:125) ~[cattle-framework-object-0.5.0-SNAPSHOT.jar:na]
	at io.cattle.platform.object.impl.JooqObjectManager$3.execute(JooqObjectManager.java:113) ~[cattle-framework-object-0.5.0-SNAPSHOT.jar:na]
	at io.cattle.platform.engine.idempotent.Idempotent.change(Idempotent.java:88) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
	at io.cattle.platform.object.impl.JooqObjectManager.setFields(JooqObjectManager.java:110) ~[cattle-framework-object-0.5.0-SNAPSHOT.jar:na]
	at io.cattle.platform.agent.server.resource.impl.AgentResourcesMonitorImpl.setHosts(AgentResourcesMonitorImpl.java:205) ~[cattle-iaas-agent-server-0.5.0-SNAPSHOT.jar:na]
	at io.cattle.platform.agent.server.resource.impl.AgentResourcesMonitorImpl$2.doWithLockNoResult(AgentResourcesMonitorImpl.java:125) ~[cattle-iaas-agent-server-0.5.0-SNAPSHOT.jar:na]
	at io.cattle.platform.lock.LockCallbackNoReturn.doWithLock(LockCallbackNoReturn.java:7) ~[cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
	at io.cattle.platform.lock.LockCallbackNoReturn.doWithLock(LockCallbackNoReturn.java:3) ~[cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
	at io.cattle.platform.lock.impl.AbstractLockManagerImpl$3.doWithLock(AbstractLockManagerImpl.java:40) ~[cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
	at io.cattle.platform.lock.impl.LockManagerImpl.doLock(LockManagerImpl.java:33) ~[cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
	at io.cattle.platform.lock.impl.AbstractLockManagerImpl.lock(AbstractLockManagerImpl.java:13) ~[cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
	at io.cattle.platform.lock.impl.AbstractLockManagerImpl.lock(AbstractLockManagerImpl.java:37) ~[cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
	at io.cattle.platform.agent.server.resource.impl.AgentResourcesMonitorImpl.pingReply(AgentResourcesMonitorImpl.java:116) ~[cattle-iaas-agent-server-0.5.0-SNAPSHOT.jar:na]
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[na:1.7.0_79]
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) ~[na:1.7.0_79]
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[na:1.7.0_79]
	at java.lang.reflect.Method.invoke(Method.java:606) ~[na:1.7.0_79]
	at io.cattle.platform.eventing.annotation.MethodInvokingListener$1.doWithLockNoResult(MethodInvokingListener.java:76) ~[cattle-framework-eventing-0.5.0-SNAPSHOT.jar:na]
	at io.cattle.platform.lock.LockCallbackNoReturn.doWithLock(LockCallbackNoReturn.java:7) ~[cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
	at io.cattle.platform.lock.LockCallbackNoReturn.doWithLock(LockCallbackNoReturn.java:3) ~[cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
	at io.cattle.platform.lock.impl.AbstractLockManagerImpl$3.doWithLock(AbstractLockManagerImpl.java:40) ~[cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
	at io.cattle.platform.lock.impl.LockManagerImpl.doLock(LockManagerImpl.java:33) ~[cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
	at io.cattle.platform.lock.impl.AbstractLockManagerImpl.lock(AbstractLockManagerImpl.java:13) ~[cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
	at io.cattle.platform.lock.impl.AbstractLockManagerImpl.lock(AbstractLockManagerImpl.java:37) ~[cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
	at io.cattle.platform.eventing.annotation.MethodInvokingListener.onEvent(MethodInvokingListener.java:72) ~[cattle-framework-eventing-0.5.0-SNAPSHOT.jar:na]
	at io.cattle.platform.eventing.impl.AbstractThreadPoolingEventService$2.doRun(AbstractThreadPoolingEventService.java:135) ~[cattle-framework-eventing-0.5.0-SNAPSHOT.jar:na]
	at org.apache.cloudstack.managed.context.NoExceptionRunnable.runInContext(NoExceptionRunnable.java:13) ~[cattle-framework-managed-context-0.5.0-SNAPSHOT.jar:na]
	at org.apache.cloudstack.managed.context.ManagedContextRunnable$1.run(ManagedContextRunnable.java:49) [cattle-framework-managed-context-0.5.0-SNAPSHOT.jar:na]
	at org.apache.cloudstack.managed.context.impl.DefaultManagedContext$1.call(DefaultManagedContext.java:55) [cattle-framework-managed-context-0.5.0-SNAPSHOT.jar:na]
	at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.callWithContext(DefaultManagedContext.java:108) [cattle-framework-managed-context-0.5.0-SNAPSHOT.jar:na]
	at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.runWithContext(DefaultManagedContext.java:52) [cattle-framework-managed-context-0.5.0-SNAPSHOT.jar:na]
	at org.apache.cloudstack.managed.context.ManagedContextRunnable.run(ManagedContextRunnable.java:46) [cattle-framework-managed-context-0.5.0-SNAPSHOT.jar:na]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_79]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_79]
	at java.lang.Thread.run(Thread.java:745) [na:1.7.0_79]

Possible bad state in the DB?

@aburnett Can you provide some details about your system?

  • What version of rancher are you running?
  • Are you running a single rancher host or multiple?
  • Is rancher behind any kind of proxy?
  • It seems like the UI is loading. Is that the case?
  • If you curl the api with something like curl http://<your server>/v1 what kind of response do you get? Does that seem to correlate with any stacktrace in the logs?

The UI won’t load much if it can’t even get a schema response…

@aburnett email me at craig at rancher.com if you’d like to debug further.

Ok, sent an email. Thanks!

Any updates on what was going on?