Repeated 'Unknown exception running process [agent.deactivate:XXXXX]' in Rancher server logs - v1.0.1

Hi,

I’ve noticed an issue in my Rancher server logs where I’m seeing a repeat of this error every minute or less:

Unknown exception running process [agent.deactivate:211308]

I’m not sure what has happened to trigger this, and everything is ‘green’ in terms of infrastructure and stacks (we’re running v1.0.1 in production here). One thing we are doing however is, using Spotinst, we schedule the addition and removal of servers twice a day to scale a particular stack, and I’m wondering if perhaps something is struggling as a result of this.

Could someone take a look please and let me know if there’s a known issue here and what might fix the problem?

Full stack trace on a couple of samples from the log:

2016-04-28 07:51:19,005 ERROR [:] [] [] [] [erviceReplay-22] [i.c.p.e.e.i.ProcessEventListenerImpl] Unknown exception running process [agent.deactivate:211308] on [1032] io.cattle.platform.eventing.exception.AgentRemovedException: Agent is removed
  at io.cattle.platform.agent.impl.WrappedEventService.call(WrappedEventService.java:93) ~[cattle-iaas-agent-0.5.0-SNAPSHOT.jar:na]
  at io.cattle.platform.agent.impl.EventCallProgressHelper.call(EventCallProgressHelper.java:57) ~[cattle-iaas-agent-0.5.0-SNAPSHOT.jar:na]
  at io.cattle.platform.agent.impl.RemoteAgentImpl.call(RemoteAgentImpl.java:99) ~[cattle-iaas-agent-0.5.0-SNAPSHOT.jar:na]
  at io.cattle.platform.agent.impl.RemoteAgentImpl.callSync(RemoteAgentImpl.java:72) ~[cattle-iaas-agent-0.5.0-SNAPSHOT.jar:na]
  at io.cattle.platform.agent.impl.RemoteAgentImpl.callSync(RemoteAgentImpl.java:135) ~[cattle-iaas-agent-0.5.0-SNAPSHOT.jar:na]
  at io.cattle.platform.process.common.handler.AgentBasedProcessHandler.callSync(AgentBasedProcessHandler.java:180) ~[cattle-iaas-logic-common-0.5.0-SNAPSHOT.jar:na]
  at io.cattle.platform.process.common.handler.AgentBasedProcessHandler.handleEvent(AgentBasedProcessHandler.java:166) ~[cattle-iaas-logic-common-0.5.0-SNAPSHOT.jar:na]
  at io.cattle.platform.process.common.handler.AgentBasedProcessHandler.handle(AgentBasedProcessHandler.java:104) ~[cattle-iaas-logic-common-0.5.0-SNAPSHOT.jar:na]
  at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.runHandler(DefaultProcessInstanceImpl.java:446) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
  at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl$4.execute(DefaultProcessInstanceImpl.java:393) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
  at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl$4.execute(DefaultProcessInstanceImpl.java:387) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
  at io.cattle.platform.engine.idempotent.Idempotent.execute(Idempotent.java:42) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
  at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.runHandlers(DefaultProcessInstanceImpl.java:387) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
  at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.runLogic(DefaultProcessInstanceImpl.java:493) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
  at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.runWithProcessLock(DefaultProcessInstanceImpl.java:320) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
  at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl$2.doWithLockNoResult(DefaultProcessInstanceImpl.java:260) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
  at io.cattle.platform.lock.LockCallbackNoReturn.doWithLock(LockCallbackNoReturn.java:7) ~[cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
  at io.cattle.platform.lock.LockCallbackNoReturn.doWithLock(LockCallbackNoReturn.java:3) ~[cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
  at io.cattle.platform.lock.impl.AbstractLockManagerImpl$3.doWithLock(AbstractLockManagerImpl.java:40) ~[cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
  at io.cattle.platform.lock.impl.LockManagerImpl.doLock(LockManagerImpl.java:33) ~[cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
  at io.cattle.platform.lock.impl.AbstractLockManagerImpl.lock(AbstractLockManagerImpl.java:13) ~[cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
  at io.cattle.platform.lock.impl.AbstractLockManagerImpl.lock(AbstractLockManagerImpl.java:37) ~[cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
  at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.acquireLockAndRun(DefaultProcessInstanceImpl.java:257) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
  at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.runDelegateLoop(DefaultProcessInstanceImpl.java:185) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
  at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.executeWithProcessInstanceLock(DefaultProcessInstanceImpl.java:158) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
  at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl$1.doWithLock(DefaultProcessInstanceImpl.java:108) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
  at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl$1.doWithLock(DefaultProcessInstanceImpl.java:105) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
  at io.cattle.platform.lock.impl.AbstractLockManagerImpl$3.doWithLock(AbstractLockManagerImpl.java:40) ~[cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
  at io.cattle.platform.lock.impl.LockManagerImpl.doLock(LockManagerImpl.java:33) ~[cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
  at io.cattle.platform.lock.impl.AbstractLockManagerImpl.lock(AbstractLockManagerImpl.java:13) ~[cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
  at io.cattle.platform.lock.impl.AbstractLockManagerImpl.lock(AbstractLockManagerImpl.java:37) ~[cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
  at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.execute(DefaultProcessInstanceImpl.java:105) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
  at io.cattle.platform.engine.eventing.impl.ProcessEventListenerImpl.processExecute(ProcessEventListenerImpl.java:74) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
  at io.cattle.platform.engine.server.impl.ProcessInstanceParallelDispatcher$1.runInContext(ProcessInstanceParallelDispatcher.java:27) [cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
  at org.apache.cloudstack.managed.context.ManagedContextRunnable$1.run(ManagedContextRunnable.java:49) [cattle-framework-managed-context-0.5.0-SNAPSHOT.jar:na]
  at org.apache.cloudstack.managed.context.impl.DefaultManagedContext$1.call(DefaultManagedContext.java:55) [cattle-framework-managed-context-0.5.0-SNAPSHOT.jar:na]
  at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.callWithContext(DefaultManagedContext.java:108) [cattle-framework-managed-context-0.5.0-SNAPSHOT.jar:na]
  at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.runWithContext(DefaultManagedContext.java:52) [cattle-framework-managed-context-0.5.0-SNAPSHOT.jar:na]
  at org.apache.cloudstack.managed.context.ManagedContextRunnable.run(ManagedContextRunnable.java:46) [cattle-framework-managed-context-0.5.0-SNAPSHOT.jar:na]
  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) [na:1.7.0_95]
  at java.util.concurrent.FutureTask.run(FutureTask.java:262) [na:1.7.0_95]
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_95]
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_95]
  at java.lang.Thread.run(Thread.java:745) [na:1.7.0_95]
2016-04-28 08:01:34,173 ERROR [:] [] [] [] [erviceReplay-19] [i.c.p.e.e.i.ProcessEventListenerImpl] Unknown exception running process [agent.deactivate:165907] on [609] io.cattle.platform.eventing.exception.AgentRemovedException: Agent is removed
  at io.cattle.platform.agent.impl.WrappedEventService.call(WrappedEventService.java:93) ~[cattle-iaas-agent-0.5.0-SNAPSHOT.jar:na]
  at io.cattle.platform.agent.impl.EventCallProgressHelper.call(EventCallProgressHelper.java:57) ~[cattle-iaas-agent-0.5.0-SNAPSHOT.jar:na]
  at io.cattle.platform.agent.impl.RemoteAgentImpl.call(RemoteAgentImpl.java:99) ~[cattle-iaas-agent-0.5.0-SNAPSHOT.jar:na]
  at io.cattle.platform.agent.impl.RemoteAgentImpl.callSync(RemoteAgentImpl.java:72) ~[cattle-iaas-agent-0.5.0-SNAPSHOT.jar:na]
  at io.cattle.platform.agent.impl.RemoteAgentImpl.callSync(RemoteAgentImpl.java:135) ~[cattle-iaas-agent-0.5.0-SNAPSHOT.jar:na]
  at io.cattle.platform.process.common.handler.AgentBasedProcessHandler.callSync(AgentBasedProcessHandler.java:180) ~[cattle-iaas-logic-common-0.5.0-SNAPSHOT.jar:na]
  at io.cattle.platform.process.common.handler.AgentBasedProcessHandler.handleEvent(AgentBasedProcessHandler.java:166) ~[cattle-iaas-logic-common-0.5.0-SNAPSHOT.jar:na]
  at io.cattle.platform.process.common.handler.AgentBasedProcessHandler.handle(AgentBasedProcessHandler.java:104) ~[cattle-iaas-logic-common-0.5.0-SNAPSHOT.jar:na]
  at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.runHandler(DefaultProcessInstanceImpl.java:446) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
  at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl$4.execute(DefaultProcessInstanceImpl.java:393) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
  at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl$4.execute(DefaultProcessInstanceImpl.java:387) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
  at io.cattle.platform.engine.idempotent.Idempotent.execute(Idempotent.java:42) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
  at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.runHandlers(DefaultProcessInstanceImpl.java:387) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
  at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.runLogic(DefaultProcessInstanceImpl.java:493) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
  at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.runWithProcessLock(DefaultProcessInstanceImpl.java:320) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
  at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl$2.doWithLockNoResult(DefaultProcessInstanceImpl.java:260) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
  at io.cattle.platform.lock.LockCallbackNoReturn.doWithLock(LockCallbackNoReturn.java:7) ~[cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
  at io.cattle.platform.lock.LockCallbackNoReturn.doWithLock(LockCallbackNoReturn.java:3) ~[cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
  at io.cattle.platform.lock.impl.AbstractLockManagerImpl$3.doWithLock(AbstractLockManagerImpl.java:40) ~[cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
  at io.cattle.platform.lock.impl.LockManagerImpl.doLock(LockManagerImpl.java:33) ~[cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
  at io.cattle.platform.lock.impl.AbstractLockManagerImpl.lock(AbstractLockManagerImpl.java:13) ~[cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
  at io.cattle.platform.lock.impl.AbstractLockManagerImpl.lock(AbstractLockManagerImpl.java:37) ~[cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
  at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.acquireLockAndRun(DefaultProcessInstanceImpl.java:257) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
  at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.runDelegateLoop(DefaultProcessInstanceImpl.java:185) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
  at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.executeWithProcessInstanceLock(DefaultProcessInstanceImpl.java:158) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
  at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl$1.doWithLock(DefaultProcessInstanceImpl.java:108) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
  at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl$1.doWithLock(DefaultProcessInstanceImpl.java:105) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
  at io.cattle.platform.lock.impl.AbstractLockManagerImpl$3.doWithLock(AbstractLockManagerImpl.java:40) ~[cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
  at io.cattle.platform.lock.impl.LockManagerImpl.doLock(LockManagerImpl.java:33) ~[cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
  at io.cattle.platform.lock.impl.AbstractLockManagerImpl.lock(AbstractLockManagerImpl.java:13) ~[cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
  at io.cattle.platform.lock.impl.AbstractLockManagerImpl.lock(AbstractLockManagerImpl.java:37) ~[cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
  at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.execute(DefaultProcessInstanceImpl.java:105) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
  at io.cattle.platform.engine.eventing.impl.ProcessEventListenerImpl.processExecute(ProcessEventListenerImpl.java:74) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
  at io.cattle.platform.engine.server.impl.ProcessInstanceParallelDispatcher$1.runInContext(ProcessInstanceParallelDispatcher.java:27) [cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
  at org.apache.cloudstack.managed.context.ManagedContextRunnable$1.run(ManagedContextRunnable.java:49) [cattle-framework-managed-context-0.5.0-SNAPSHOT.jar:na]
  at org.apache.cloudstack.managed.context.impl.DefaultManagedContext$1.call(DefaultManagedContext.java:55) [cattle-framework-managed-context-0.5.0-SNAPSHOT.jar:na]
  at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.callWithContext(DefaultManagedContext.java:108) [cattle-framework-managed-context-0.5.0-SNAPSHOT.jar:na]
  at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.runWithContext(DefaultManagedContext.java:52) [cattle-framework-managed-context-0.5.0-SNAPSHOT.jar:na]
  at org.apache.cloudstack.managed.context.ManagedContextRunnable.run(ManagedContextRunnable.java:46) [cattle-framework-managed-context-0.5.0-SNAPSHOT.jar:na]
  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) [na:1.7.0_95]
  at java.util.concurrent.FutureTask.run(FutureTask.java:262) [na:1.7.0_95]
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_95]
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_95]
  at java.lang.Thread.run(Thread.java:745) [na:1.7.0_95]

Thanks,
John

Hi,

Looks like I’ve got a number of processes that are causing issues:

Any help to fix these greatly appreciated!

Thanks,
John

Hi,

So having looked in to this a bit further I see that there’s a process_instance table in MySQL and that my long running processes are listed there with a NULL end time. I’ve realised that these relate to entries in the agent table, and these show a state of deactivating/reconnecting etc. I assume that I should be able to clear these issues by either just removing the entries from the process_instance table and/or removing related entries in a number of other tables - can somebody please provide some guidance on this?

Thanks,
John

When were the processes created? We typically rows in the process_instance that’s been created a day ago, but maybe it needs to be deleted quicker?

http://docs.rancher.com/rancher/faqs/server/#rancher-server-database-is-growing-too-quickly

Hi Denise,

Most of these processes are more than 6 days old - the problem is that they’re never completed due to exceptions or timeouts, so I’m really just after checking if it’s ok to delete the rows (as I’m pretty sure the processes are relating to hosts/agents/instances that have been deleted or purged) or whether I need to do something else? I’ve attached a result of the query to the process_instance table where end_time is null - these are the processes that the server keeps trying (and failing) to process.

Thanks,
John

I spoke with the team and there is a bug that we are going to fix. I’m assuming this Github issue is what you’re seeing.

Unfortunately, we wouldn’t recommend removing items from teh DB as we don’t want it to get into a bad state.

Hi Denise,

Thanks for the update and, yes, I filed that bug yesterday and linked it to a couple of others that seem to mention the same thing. Do you have any idea when the release that fixes this bug might appear, and will it ‘autofix’ these existing processes?

I’ve also got several agent.reconnect processes running, all around a month old (as you can see from the screen shots) that are in a TIMEOUT state. I guess this is a different issue and not one that seems to be repeating (as yet), the ‘data’ column in the process_instance table is this:

{"io.cattle.platform.configitem.request.ConfigUpdateRequest":{"io.cattle.platform.process.agent.AgentScriptsApply":{"migration":false,"deferredTrigger":false,"client":{"resourceType":"io.cattle.platform.core.model.Agent","resourceId":305,"eventName":"config.update"},"items":[{"name":"host-config","requestedVersion":1,"apply":true,"increment":false,"checkInSyncOnly":true},{"name":"pyagent","requestedVersion":1,"apply":true,"increment":false,"checkInSyncOnly":true},{"name":"configscripts","requestedVersion":1,"apply":true,"increment":false,"checkInSyncOnly":true}],"attributes":{}}},"io.cattle.platform.process.agent.AgentTimedConnectionClose":{"lastDisconnect":1462520463886}}

The resourceId and lastDisconnect are unique for each entry. Are these safely deleteable or repairable?

Thanks for your help,
John

Hi John,

The fix should be coming in our v1.1.0-dev2 and it should clean up the old existing processes.

Also, we’ve seen the fact that restarting rancher/server would sometimes clean it up.

Denise

Hi Denise, thanks for letting me know. I’ll try restarting the Rancher server in the first place to see if it helps, and then wait for 1.1.0. I’ll let you know here if the restart has any effect.

Thanks,
John

Hi,

Just to let you know that the restart also helped my situation. I still have 4 ‘agent.reconnect’ processes that are more than a month old, however they’re unrelated to this specific issue and will look at fixing those separately.

Thanks,
John