Hung processes and services are not stopping

Rancher 1.1.2 on a separate host
3 x custom VPS hosts (16.04 ubuntu, docker 1.11)

  • I’m trying to stop a service, but it keeps timeouting in the “processes” list.
  • There are 3 Drone containers (on hosts that no longer exist that are hung in the “Removing” state.

This is the testing environment but I have to be very careful to not break anything because there is a prod environment that is actually in use and cannot have downtime at the moment.

I’m a bit at a loss on how to proceed.

My network agent on one of the hosts:

2016-08-05 07:12:59,458 ERROR agent [140082703404240] [event.py:112] Error in request : 486232ea-d028-4a9a-bddd-ad2eb4e4f3d4 Traceback (most recent call last): File "/var/lib/cattle/pyagent/cattle/agent/event.py", line 95, in _worker_main resp = agent.execute(req) File "/var/lib/cattle/pyagent/cattle/agent/__init__.py", line 15, in execute return self._router.route(req) File "/var/lib/cattle/pyagent/cattle/plugins/core/event_router.py", line 13, in route resp = handler.execute(req) File "/var/lib/cattle/pyagent/cattle/plugins/core/event_handlers.py", line 32, in execute type.on_ping(event, resp) File "/var/lib/cattle/pyagent/cattle/plugins/docker/compute.py", line 126, in on_ping self._add_instances(ping, pong) File "/var/lib/cattle/pyagent/cattle/plugins/docker/compute.py", line 138, in _add_instances running, nonrunning = self._get_all_containers_by_state() File "/var/lib/cattle/pyagent/cattle/plugins/docker/compute.py", line 171, in _get_all_containers_by_state for c in client.containers(all=True): File "/var/lib/cattle/pyagent/dist/docker/api/container.py", line 69, in containers res = self._result(self._get(u, params=params), True) File "/var/lib/cattle/pyagent/dist/docker/utils/decorators.py", line 47, in inner return f(self, *args, **kwargs) File "/var/lib/cattle/pyagent/dist/docker/client.py", line 112, in _get return self.get(url, **self._set_request_timeout(kwargs)) File "/var/lib/cattle/pyagent/dist/requests/sessions.py", line 487, in get return self.request('GET', url, **kwargs) File "/var/lib/cattle/pyagent/dist/requests/sessions.py", line 475, in request resp = self.send(prep, **send_kwargs) File "/var/lib/cattle/pyagent/dist/requests/sessions.py", line 585, in send r = adapter.send(request, **kwargs) File "/var/lib/cattle/pyagent/dist/requests/adapters.py", line 479, in send raise ReadTimeout(e, request=request) ReadTimeout: UnixHTTPConnectionPool(host='localhost', port=None): Read timed out. (read timeout=2)

The logs of the rancher container

2016-08-04 20:09:01,452 ERROR [:] [] [] [] [ServiceReplay-2] [i.c.p.e.e.i.ProcessEventListenerImpl] Unknown exception running process [volume.purge:3483026] on [16379] io.cattle.platform.eventing.exception.EventExecutionException: ('Connection aborted.', error(111, 'ECONNREFUSED')) at io.cattle.platform.eventing.exception.EventExecutionException.fromEvent(EventExecutionException.java:53) ~[cattle-framework-eventing-0.5.0-SNAPSHOT.jar:na] at io.cattle.platform.agent.impl.RemoteAgentImpl.callSync(RemoteAgentImpl.java:87) ~[cattle-iaas-agent-0.5.0-SNAPSHOT.jar:na] at io.cattle.platform.agent.impl.RemoteAgentImpl.callSync(RemoteAgentImpl.java:135) ~[cattle-iaas-agent-0.5.0-SNAPSHOT.jar:na] at io.cattle.platform.process.common.handler.AgentBasedProcessHandler.callSync(AgentBasedProcessHandler.java:180) ~[cattle-iaas-logic-common-0.5.0-SNAPSHOT.jar:na] at io.cattle.platform.process.common.handler.AgentBasedProcessHandler.handleEvent(AgentBasedProcessHandler.java:166) ~[cattle-iaas-logic-common-0.5.0-SNAPSHOT.jar:na] at io.cattle.platform.process.common.handler.AgentBasedProcessHandler.handle(AgentBasedProcessHandler.java:104) ~[cattle-iaas-logic-common-0.5.0-SNAPSHOT.jar:na] at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.runHandler(DefaultProcessInstanceImpl.java:424) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na] at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl$4.execute(DefaultProcessInstanceImpl.java:375) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na] at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl$4.execute(DefaultProcessInstanceImpl.java:369) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na] at io.cattle.platform.engine.idempotent.Idempotent.execute(Idempotent.java:42) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na] at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.runHandlers(DefaultProcessInstanceImpl.java:369) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na] at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.runLogic(DefaultProcessInstanceImpl.java:471) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na] at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.runWithProcessLock(DefaultProcessInstanceImpl.java:305) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na] at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl$2.doWithLockNoResult(DefaultProcessInstanceImpl.java:245) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na] at io.cattle.platform.lock.LockCallbackNoReturn.doWithLock(LockCallbackNoReturn.java:7) ~[cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]

sorry wrong rancher server log, but my convoy-nfs troubles may be connected

Yesterday i migrated to new hosts, but had to upgrade from convoy-nfs 0.7 to 0.9. It all went to hell and i had to relaunch all containers using new docker volume names. That worked for getting them to start.

2016-08-05 07:27:38,555 ERROR [:] [] [] [] [erviceReplay-13] [c.p.e.p.i.DefaultProcessInstanceImpl] final ExitReason is null, should not be 2016-08-05 07:27:38,555 ERROR [:] [] [] [] [erviceReplay-13] [i.c.p.e.e.i.ProcessEventListenerImpl] Unknown exception running process [instance.remove:3364418] on [12744] java.lang.IllegalStateException: Attempt to cancel when process is still transitioning at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.runDelegateLoop(DefaultProcessInstanceImpl.java:190) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na] at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.executeWithProcessInstanceLock(DefaultProcessInstanceImpl.java:157) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na] at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl$1.doWithLock(DefaultProcessInstanceImpl.java:107) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na] at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl$1.doWithLock(DefaultProcessInstanceImpl.java:104) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na] at io.cattle.platform.lock.impl.AbstractLockManagerImpl$3.doWithLock(AbstractLockManagerImpl.java:40) ~[cattle-framework-lock-0.5.0-SNAPSHOT.jar:na] at io.cattle.platform.lock.impl.LockManagerImpl.doLock(LockManagerImpl.java:33) ~[cattle-framework-lock-0.5.0-SNAPSHOT.jar:na] at io.cattle.platform.lock.impl.AbstractLockManagerImpl.lock(AbstractLockManagerImpl.java:13) ~[cattle-framework-lock-0.5.0-SNAPSHOT.jar:na] at io.cattle.platform.lock.impl.AbstractLockManagerImpl.lock(AbstractLockManagerImpl.java:37) ~[cattle-framework-lock-0.5.0-SNAPSHOT.jar:na] at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.execute(DefaultProcessInstanceImpl.java:104) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na] at io.cattle.platform.engine.eventing.impl.ProcessEventListenerImpl.processExecute(ProcessEventListenerImpl.java:74) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na] at io.cattle.platform.engine.server.impl.ProcessInstanceParallelDispatcher$1.runInContext(ProcessInstanceParallelDispatcher.java:27) [cattle-framework-engine-0.5.0-SNAPSHOT.jar:na] at org.apache.cloudstack.managed.context.ManagedContextRunnable$1.run(ManagedContextRunnable.java:49) [cattle-framework-managed-context-0.5.0-SNAPSHOT.jar:na] at org.apache.cloudstack.managed.context.impl.DefaultManagedContext$1.call(DefaultManagedContext.java:55) [cattle-framework-managed-context-0.5.0-SNAPSHOT.jar:na] at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.callWithContext(DefaultManagedContext.java:108) [cattle-framework-managed-context-0.5.0-SNAPSHOT.jar:na] at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.runWithContext(DefaultManagedContext.java:52) [cattle-framework-managed-context-0.5.0-SNAPSHOT.jar:na] at org.apache.cloudstack.managed.context.ManagedContextRunnable.run(ManagedContextRunnable.java:46) [cattle-framework-managed-context-0.5.0-SNAPSHOT.jar:na] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) [na:1.7.0_101] at java.util.concurrent.FutureTask.run(FutureTask.java:262) [na:1.7.0_101] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_101] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_101] at java.lang.Thread.run(Thread.java:745) [na:1.7.0_101] Caused by: io.cattle.platform.engine.process.impl.ProcessCancelException: State [active] is not valid at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.preRunStateCheck(DefaultProcessInstanceImpl.java:267) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na] at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.runDelegateLoop(DefaultProcessInstanceImpl.java:182) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na] at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.executeWithProcessInstanceLock(DefaultProcessInstanceImpl.java:157) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na] at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl$1.doWithLock(DefaultProcessInstanceImpl.java:107) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na] at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl$1.doWithLock(DefaultProcessInstanceImpl.java:104) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na] at io.cattle.platform.lock.impl.AbstractLockManagerImpl$3.doWithLock(AbstractLockManagerImpl.java:40) ~[cattle-framework-lock-0.5.0-SNAPSHOT.jar:na] at io.cattle.platform.lock.impl.LockManagerImpl.doLock(LockManagerImpl.java:33) ~[cattle-framework-lock-0.5.0-SNAPSHOT.jar:na] at io.cattle.platform.lock.impl.AbstractLockManagerImpl.lock(AbstractLockManagerImpl.java:13) ~[cattle-framework-lock-0.5.0-SNAPSHOT.jar:na] at io.cattle.platform.lock.impl.AbstractLockManagerImpl.lock(AbstractLockManagerImpl.java:37) ~[cattle-framework-lock-0.5.0-SNAPSHOT.jar:na] at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.execute(DefaultProcessInstanceImpl.java:104) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na] at io.cattle.platform.object.process.impl.DefaultObjectProcessManager.executeStandardProcess(DefaultObjectProcessManager.java:29) ~[cattle-framework-object-0.5.0-SNAPSHOT.jar:na] at io.cattle.platform.process.common.handler.AbstractObjectProcessLogic.remove(AbstractObjectProcessLogic.java:101) ~[cattle-iaas-logic-common-0.5.0-SNAPSHOT.jar:na] at io.cattle.platform.process.instance.InstanceRemove.network(InstanceRemove.java:69) ~[cattle-iaas-logic-0.5.0-SNAPSHOT.jar:na] at io.cattle.platform.process.instance.InstanceRemove.handle(InstanceRemove.java:37) ~[cattle-iaas-logic-0.5.0-SNAPSHOT.jar:na] at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.runHandler(DefaultProcessInstanceImpl.java:424) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na] at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl$4.execute(DefaultProcessInstanceImpl.java:375) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na] at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl$4.execute(DefaultProcessInstanceImpl.java:369) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na] at io.cattle.platform.engine.idempotent.Idempotent.execute(Idempotent.java:42) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na] at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.runHandlers(DefaultProcessInstanceImpl.java:369) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na] at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.runLogic(DefaultProcessInstanceImpl.java:471) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na] at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.runWithProcessLock(DefaultProcessInstanceImpl.java:305) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na] at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl$2.doWithLockNoResult(DefaultProcessInstanceImpl.java:245) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na] at io.cattle.platform.lock.LockCallbackNoReturn.doWithLock(LockCallbackNoReturn.java:7) ~[cattle-framework-lock-0.5.0-SNAPSHOT.jar:na] at io.cattle.platform.lock.LockCallbackNoReturn.doWithLock(LockCallbackNoReturn.java:3) ~[cattle-framework-lock-0.5.0-SNAPSHOT.jar:na] at io.cattle.platform.lock.impl.AbstractLockManagerImpl$3.doWithLock(AbstractLockManagerImpl.java:40) ~[cattle-framework-lock-0.5.0-SNAPSHOT.jar:na] at io.cattle.platform.lock.impl.LockManagerImpl.doLock(LockManagerImpl.java:33) ~[cattle-framework-lock-0.5.0-SNAPSHOT.jar:na] at io.cattle.platform.lock.impl.AbstractLockManagerImpl.lock(AbstractLockManagerImpl.java:13) ~[cattle-framework-lock-0.5.0-SNAPSHOT.jar:na] at io.cattle.platform.lock.impl.AbstractLockManagerImpl.lock(AbstractLockManagerImpl.java:37) ~[cattle-framework-lock-0.5.0-SNAPSHOT.jar:na] at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.acquireLockAndRun(DefaultProcessInstanceImpl.java:242) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na] at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.runDelegateLoop(DefaultProcessInstanceImpl.java:184) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na] ... 20 common frames omitted

Any update on this? I am experiencing similar issues. Right now I have nothing extremely critical in production but looking to get there very soon.