Containers stuck at "scheduling"

Hey folks. The last few containers I try to start from rancher-compose seem to be getting stuck at “scheduling.” If I drill in on a service from a stack, I see a status for the service of “Activating (Waiting for [instance:clair_clair_1]. Instance status: Scheduling)”

Looking at rancher-server logs, I see the following stack trace being repeated:

2016-04-14 01:44:43,920 ERROR [:] [] [] [] [ServiceReplay-2] [i.c.p.e.e.i.ProcessEventListenerImpl] Unknown exception running process [instance.start:15109] on [230] java.lang.IllegalStateException: Attempt to cancel when process is still transitioning
        at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.runDelegateLoop(DefaultProcessInstanceImpl.java:191) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
        at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.executeWithProcessInstanceLock(DefaultProcessInstanceImpl.java:158) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
        at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl$1.doWithLock(DefaultProcessInstanceImpl.java:108) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
        at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl$1.doWithLock(DefaultProcessInstanceImpl.java:105) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
        at io.cattle.platform.lock.impl.AbstractLockManagerImpl$3.doWithLock(AbstractLockManagerImpl.java:40) ~[cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
        at io.cattle.platform.lock.impl.LockManagerImpl.doLock(LockManagerImpl.java:33) ~[cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
        at io.cattle.platform.lock.impl.AbstractLockManagerImpl.lock(AbstractLockManagerImpl.java:13) ~[cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
        at io.cattle.platform.lock.impl.AbstractLockManagerImpl.lock(AbstractLockManagerImpl.java:37) ~[cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
        at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.execute(DefaultProcessInstanceImpl.java:105) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
        at io.cattle.platform.process.common.handler.AbstractObjectProcessLogic.execute(AbstractObjectProcessLogic.java:131) ~[cattle-iaas-logic-common-0.5.0-SNAPSHOT.jar:na]
        at io.cattle.platform.process.instance.InstanceStart.allocate(InstanceStart.java:217) ~[cattle-iaas-logic-0.5.0-SNAPSHOT.jar:na]
        at io.cattle.platform.process.instance.InstanceStart.handle(InstanceStart.java:75) ~[cattle-iaas-logic-0.5.0-SNAPSHOT.jar:na]
        at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.runHandler(DefaultProcessInstanceImpl.java:446) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
        at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl$4.execute(DefaultProcessInstanceImpl.java:393) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
        at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl$4.execute(DefaultProcessInstanceImpl.java:387) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
        at io.cattle.platform.engine.idempotent.Idempotent.execute(Idempotent.java:42) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
        at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.runHandlers(DefaultProcessInstanceImpl.java:387) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
        at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.runLogic(DefaultProcessInstanceImpl.java:493) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
        at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.runWithProcessLock(DefaultProcessInstanceImpl.java:320) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
        at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl$2.doWithLockNoResult(DefaultProcessInstanceImpl.java:260) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
        at io.cattle.platform.lock.LockCallbackNoReturn.doWithLock(LockCallbackNoReturn.java:7) ~[cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
        at io.cattle.platform.lock.LockCallbackNoReturn.doWithLock(LockCallbackNoReturn.java:3) ~[cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
        at io.cattle.platform.lock.impl.AbstractLockManagerImpl$3.doWithLock(AbstractLockManagerImpl.java:40) ~[cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
        at io.cattle.platform.lock.impl.LockManagerImpl.doLock(LockManagerImpl.java:33) ~[cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
        at io.cattle.platform.lock.impl.AbstractLockManagerImpl.lock(AbstractLockManagerImpl.java:13) ~[cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
        at io.cattle.platform.lock.impl.AbstractLockManagerImpl.lock(AbstractLockManagerImpl.java:37) ~[cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
        at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.acquireLockAndRun(DefaultProcessInstanceImpl.java:257) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
        at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.runDelegateLoop(DefaultProcessInstanceImpl.java:185) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
        at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.executeWithProcessInstanceLock(DefaultProcessInstanceImpl.java:158) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
        at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl$1.doWithLock(DefaultProcessInstanceImpl.java:108) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
        at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl$1.doWithLock(DefaultProcessInstanceImpl.java:105) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
        at io.cattle.platform.lock.impl.AbstractLockManagerImpl$3.doWithLock(AbstractLockManagerImpl.java:40) ~[cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
        at io.cattle.platform.lock.impl.LockManagerImpl.doLock(LockManagerImpl.java:33) ~[cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
        at io.cattle.platform.lock.impl.AbstractLockManagerImpl.lock(AbstractLockManagerImpl.java:13) ~[cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
        at io.cattle.platform.lock.impl.AbstractLockManagerImpl.lock(AbstractLockManagerImpl.java:37) ~[cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
        at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.execute(DefaultProcessInstanceImpl.java:105) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
        at io.cattle.platform.engine.eventing.impl.ProcessEventListenerImpl.processExecute(ProcessEventListenerImpl.java:69) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
        at io.cattle.platform.engine.server.impl.ProcessInstanceParallelDispatcher$1.runInContext(ProcessInstanceParallelDispatcher.java:27) [cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
        at org.apache.cloudstack.managed.context.ManagedContextRunnable$1.run(ManagedContextRunnable.java:49) [cattle-framework-managed-context-0.5.0-SNAPSHOT.jar:na]
        at org.apache.cloudstack.managed.context.impl.DefaultManagedContext$1.call(DefaultManagedContext.java:55) [cattle-framework-managed-context-0.5.0-SNAPSHOT.jar:na]
        at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.callWithContext(DefaultManagedContext.java:108) [cattle-framework-managed-context-0.5.0-SNAPSHOT.jar:na]
        at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.runWithContext(DefaultManagedContext.java:52) [cattle-framework-managed-context-0.5.0-SNAPSHOT.jar:na]
        at org.apache.cloudstack.managed.context.ManagedContextRunnable.run(ManagedContextRunnable.java:46) [cattle-framework-managed-context-0.5.0-SNAPSHOT.jar:na]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) [na:1.7.0_95]
        at java.util.concurrent.FutureTask.run(FutureTask.java:262) [na:1.7.0_95]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_95]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_95]
        at java.lang.Thread.run(Thread.java:745) [na:1.7.0_95]

Looking under admin/processes, I see instance.create tasks failing with “UNKNOWN EXCEPTION”

Anybody seen similar? I haven’t dug into the rancher source yet, but guessing a job got stuck somewhere and it’s screwing up anything that comes after it…

Happy to add a docker-compose spec or whatever, but this doesn’t seem to be tied to the actual container definition itself

Update…

Backed up the db and did a little poking around. Good amount of carnage in process_instance table. Tried updating records that had null end_time values, then records that had null results…ended up just truncating the process_instance table. That seemed to help a little. Was able to launch wordpress from the catalog and that came up. So went back and tried firing up my service with rancher-compose, service now goes to “Activating (Reconciling)” and then the same Exception fires.

I’ve stared at this enough for one night - I suspect might be issues in the instance table, will dig further in the am.

@jlk can you share some more details about your environment? Rancher version, docker version, number of hosts, and the compose you are trying to run.

Also, can you share a bigger chunk of your logs from before and after the exception? You can put those in a gist if you like instead of pasting directly into this thread.

Sure

Rancher v1.0
On the Rancher server, docker 1.10.3 running on Ubuntu 14.04
ec2 hosts are provisioned by Rancher - 1.10.3 on Ubuntu 15.10, m4.large instances. 2 of those in this case. Plenty of free cpu/mem/disk.

Gist of a few thousand lines at https://gist.github.com/jlk/3ca9206a6391f9abf895631d252c6c09. Not trying to overload, but it mostly looks repetitive…

For the compose file, I’m down to trying something fairly simplistic:

  postgres:
    image: postgres:latest
    volumes:
      - postgres:/var/lib/postgresql/data
    volume_driver: convoy-nfs
    environment:
      POSTGRES_PASSWORD: password

I nuked and installed v1.0.1…if I see this again, will pick it back up…

@jlk, have you resolved the problem? I met the same problem.

after updating the instance in stack, the stack and service always being activating.
Stack(Environment): activating.
Serivce: activating
instance: running.
they have inconsistent status.

I’m unfortunately not currently using Rancher, so can’t say I’ve addressed it.