Healthcheck stuck in rancher

Hi Team,

We are using rancher 1.3 to manage k8s in production environment.

Currently, we find there are 3 application started continuously. Please see the error and the log in rancher ,

Error on rancher GUI,
(Failed to acquire lock [SERVICE.XXX])
(Expected state running but got error: io.cattle.platform.async.utils.TimeoutException: Timeout)

Logs,

2017-11-23 07:11:52,343 ERROR [e530a028-665d-42e4-9952-08e018e00bb9:2235008] [instance:45433] [instance.start->(InstanceStart)] [utorService-290] [i.c.p.process.instance.InstanceStart] Failed [3/3] to Starting for instance [45433]
2017-11-23 07:11:54,139 ERROR [:] [cutorService-16] [o.a.c.m.context.NoExceptionRunnable ] Expected state running but got error: io.cattle.platform.async.utils.TimeoutException: Timeout
2017-11-23 07:12:04,346 ERROR [3ca5ae21-94e3-475f-97b2-3da9205a11ac:2235053] [instance:45434->instanceHostMap:18891] [instance.start->(InstanceStart)->instancehostmap.activate] [utorService-269] [c.p.e.p.i.DefaultProcessInstanceImpl] Unknown exception io.cattle.platform.util.exception.ExecutionException: io.cattle.platform.async.utils.TimeoutException: Timeout
at io.cattle.platform.process.common.handler.AgentBasedProcessLogic.handleEvent(AgentBasedProcessLogic.java:193) ~[cattle-iaas-logic-common-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.process.common.handler.AgentBasedProcessLogic.handle(AgentBasedProcessLogic.java:111) ~[cattle-iaas-logic-common-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.runHandler(DefaultProcessInstanceImpl.java:448) [cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl$4.execute(DefaultProcessInstanceImpl.java:399) [cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl$4.execute(DefaultProcessInstanceImpl.java:393) [cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.engine.idempotent.Idempotent.execute(Idempotent.java:42) [cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.runHandlers(DefaultProcessInstanceImpl.java:393) [cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.runLogic(DefaultProcessInstanceImpl.java:498) [cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.runWithProcessLock(DefaultProcessInstanceImpl.java:326) [cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl$2.doWithLockNoResult(DefaultProcessInstanceImpl.java:243) [cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.lock.LockCallbackNoReturn.doWithLock(LockCallbackNoReturn.java:7) [cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.lock.LockCallbackNoReturn.doWithLock(LockCallbackNoReturn.java:3) [cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.lock.impl.AbstractLockManagerImpl$3.doWithLock(AbstractLockManagerImpl.java:40) [cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.lock.impl.LockManagerImpl.doLock(LockManagerImpl.java:33) [cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.lock.impl.AbstractLockManagerImpl.lock(AbstractLockManagerImpl.java:13) [cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.lock.impl.AbstractLockManagerImpl.lock(AbstractLockManagerImpl.java:37) [cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.acquireLockAndRun(DefaultProcessInstanceImpl.java:240) [cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.runDelegateLoop(DefaultProcessInstanceImpl.java:182) [cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.executeWithProcessInstanceLock(DefaultProcessInstanceImpl.java:155) [cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl$1.doWithLock(DefaultProcessInstanceImpl.java:114) [cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl$1.doWithLock(DefaultProcessInstanceImpl.java:111) [cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.lock.impl.AbstractLockManagerImpl$3.doWithLock(AbstractLockManagerImpl.java:40) [cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.lock.impl.LockManagerImpl.doLock(LockManagerImpl.java:33) [cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.lock.impl.AbstractLockManagerImpl.lock(AbstractLockManagerImpl.java:13) [cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.lock.impl.AbstractLockManagerImpl.lock(AbstractLockManagerImpl.java:37) [cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.execute(DefaultProcessInstanceImpl.java:111) [cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.object.process.impl.DefaultObjectProcessManager.executeStandardProcess(DefaultObjectProcessManager.java:32) [cattle-framework-object-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.process.common.handler.AbstractObjectProcessLogic.activate(AbstractObjectProcessLogic.java:31) [cattle-iaas-logic-common-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.process.instance.InstanceStart.compute(InstanceStart.java:384) [cattle-iaas-logic-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.process.instance.InstanceStart.handle(InstanceStart.java:140) [cattle-iaas-logic-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.runHandler(DefaultProcessInstanceImpl.java:448) [cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl$4.execute(DefaultProcessInstanceImpl.java:399) [cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl$4.execute(DefaultProcessInstanceImpl.java:393) [cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.engine.idempotent.Idempotent.execute(Idempotent.java:42) [cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.runHandlers(DefaultProcessInstanceImpl.java:393) [cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.runLogic(DefaultProcessInstanceImpl.java:495) [cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.runWithProcessLock(DefaultProcessInstanceImpl.java:326) [cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl$2.doWithLockNoResult(DefaultProcessInstanceImpl.java:243) [cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.lock.LockCallbackNoReturn.doWithLock(LockCallbackNoReturn.java:7) [cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.lock.LockCallbackNoReturn.doWithLock(LockCallbackNoReturn.java:3) [cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.lock.impl.AbstractLockManagerImpl$3.doWithLock(AbstractLockManagerImpl.java:40) [cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.lock.impl.LockManagerImpl.doLock(LockManagerImpl.java:33) [cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.lock.impl.AbstractLockManagerImpl.lock(AbstractLockManagerImpl.java:13) [cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.lock.impl.AbstractLockManagerImpl.lock(AbstractLockManagerImpl.java:37) [cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.acquireLockAndRun(DefaultProcessInstanceImpl.java:240) [cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.runDelegateLoop(DefaultProcessInstanceImpl.java:182) [cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.executeWithProcessInstanceLock(DefaultProcessInstanceImpl.java:155) [cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl$1.doWithLock(DefaultProcessInstanceImpl.java:114) [cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl$1.doWithLock(DefaultProcessInstanceImpl.java:111) [cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.lock.impl.AbstractLockManagerImpl$3.doWithLock(AbstractLockManagerImpl.java:40) [cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.lock.impl.LockManagerImpl.doLock(LockManagerImpl.java:33) [cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.lock.impl.AbstractLockManagerImpl.lock(AbstractLockManagerImpl.java:13) [cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.lock.impl.AbstractLockManagerImpl.lock(AbstractLockManagerImpl.java:37) [cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.execute(DefaultProcessInstanceImpl.java:111) [cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.engine.server.impl.ProcessInstanceDispatcherImpl.processExecuteWithLock(ProcessInstanceDispatcherImpl.java:98) [cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.engine.server.impl.ProcessInstanceDispatcherImpl$1$1.doWithLockNoResult(ProcessInstanceDispatcherImpl.java:71) [cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.lock.LockCallbackNoReturn.doWithLock(LockCallbackNoReturn.java:7) [cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.lock.LockCallbackNoReturn.doWithLock(LockCallbackNoReturn.java:3) [cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.lock.impl.AbstractLockManagerImpl$4.doWithLock(AbstractLockManagerImpl.java:50) [cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.lock.impl.LockManagerImpl.doLock(LockManagerImpl.java:33) [cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.lock.impl.AbstractLockManagerImpl.tryLock(AbstractLockManagerImpl.java:25) [cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.lock.impl.AbstractLockManagerImpl.tryLock(AbstractLockManagerImpl.java:47) [cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.engine.server.impl.ProcessInstanceDispatcherImpl$1.doRun(ProcessInstanceDispatcherImpl.java:68) [cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
at org.apache.cloudstack.managed.context.NoExceptionRunnable.runInContext(NoExceptionRunnable.java:15) [cattle-framework-managed-context-0.5.0-SNAPSHOT.jar:na]
at org.apache.cloudstack.managed.context.ManagedContextRunnable$1.run(ManagedContextRunnable.java:49) [cattle-framework-managed-context-0.5.0-SNAPSHOT.jar:na]
at org.apache.cloudstack.managed.context.impl.DefaultManagedContext$1.call(DefaultManagedContext.java:55) [cattle-framework-managed-context-0.5.0-SNAPSHOT.jar:na]
at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.callWithContext(DefaultManagedContext.java:108) [cattle-framework-managed-context-0.5.0-SNAPSHOT.jar:na]
at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.runWithContext(DefaultManagedContext.java:52) [cattle-framework-managed-context-0.5.0-SNAPSHOT.jar:na]
at org.apache.cloudstack.managed.context.ManagedContextRunnable.run(ManagedContextRunnable.java:46) [cattle-framework-managed-context-0.5.0-SNAPSHOT.jar:na]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_72]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_72]
at java.lang.Thread.run(Thread.java:745) [na:1.8.0_72]
Caused by: io.cattle.platform.async.utils.TimeoutException: Timeout
at io.cattle.platform.async.retry.impl.RetryTimeoutServiceImpl.retry(RetryTimeoutServiceImpl.java:41) ~[cattle-framework-async-0.5.0-SNAPSHOT.jar:na]
at sun.reflect.GeneratedMethodAccessor356.invoke(Unknown Source) ~[na:na]
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[na:1.8.0_72]
at java.lang.reflect.Method.invoke(Method.java:498) ~[na:1.8.0_72]
at org.springframework.scheduling.support.ScheduledMethodRunnable.run(ScheduledMethodRunnable.java:65) ~[spring-context-4.3.2.RELEASE.jar:4.3.2.RELEASE]
at org.springframework.scheduling.support.DelegatingErrorHandlingRunnable.run(DelegatingErrorHandlingRunnable.java:54) ~[spring-context-4.3.2.RELEASE.jar:4.3.2.RELEASE]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[na:1.8.0_72]
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) ~[na:1.8.0_72]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) ~[na:1.8.0_72]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) ~[na:1.8.0_72]
… 3 common frames omitted

The three application are health-check on 1 node, kubectld and rancher-ingress-controller, here is the records found in database,

mysql> select id, name, description, state from instance order by id desc limit 10;
±------±----------------------------------------±------------±---------+
| id | name | description | state |
±------±----------------------------------------±------------±---------+
| 45480 | kubernetes-rancher-ingress-controller-1 | NULL | starting |
| 45479 | healthcheck-healthcheck-2 | NULL | starting |
| 45478 | kubernetes-kubectld-1 | NULL | starting |
| 45477 | kubernetes-rancher-ingress-controller-1 | NULL | removed |
| 45476 | healthcheck-healthcheck-2 | NULL | removed |
| 45475 | kubernetes-kubectld-1 | NULL | removed |
| 45474 | kubernetes-rancher-ingress-controller-1 | NULL | removed |
| 45473 | healthcheck-healthcheck-2 | NULL | removed |
| 45472 | kubernetes-kubectld-1 | NULL | removed |
| 45471 | kubernetes-rancher-ingress-controller-1 | NULL | removed |
±------±----------------------------------------±------------±---------+
10 rows in set (0.00 sec)

I think these applications are managed by rancher and k8s/the applications in k8s are running normally.
These applications started, stopped continuously on all the nodes.

How can I recover these applications or any workaround for this issue?

I appreciate your great help :slight_smile: