Hi.
We’re having trouble with our Rancher setup. One symptom is load balancers never getting past initialization state. They actually seem to work - at least some times. We are on v0.59.0 and I can ping other network agents from the agent on the machine in question, so internetwork comm seems to be OK.
On the instance we’re having trouble with we see this in the agent logs:
time=“2016-02-22T08:56:06Z” level=“info” msg=“Processing event: &docker.APIEvents{Status:"destroy", ID:"a783d5b5312334040fe5ec1df7151171cdb6881036b43e66d035f0ce7843ed7f", From:"rancher/agent-instance:v0.6.0", Time:1456131366}”
time=“2016-02-22T08:57:39Z” level=“info” msg=“Processing event: &docker.APIEvents{Status:"start", ID:"0164336d5b9c997411aa69076e5fa218ce8abeaeb7284346005cafdefd59804e", From:"rancher/agent-instance:v0.8.0", Time:1456131459}”
time=“2016-02-22T08:57:39Z” level=“info” msg=“Assigning IP [10.42.156.127/16], ContainerId [0164336d5b9c997411aa69076e5fa218ce8abeaeb7284346005cafdefd59804e], Pid [23211]”
time=“2016-02-22T08:57:39Z” level=“info” msg=“Processing event: &docker.APIEvents{Status:"start", ID:"0164336d5b9c997411aa69076e5fa218ce8abeaeb7284346005cafdefd59804e", From:"-simulated-", Time:0}”
time=“2016-02-22T08:57:39Z” level=“info” msg=“Container locked. Can’t run StartHandler. ID: [0164336d5b9c997411aa69076e5fa218ce8abeaeb7284346005cafdefd59804e]”
time=“2016-02-22T08:57:53Z” level=“info” msg=“Processing event: &docker.APIEvents{Status:"start", ID:"f5a6fefa550d0845cd86296c7b266075c1ca1548a235dc7981905aaa62f97504", From:"uberresearch/s3upload:test", Time:1456131473}”
time=“2016-02-22T08:57:53Z” level=“info” msg=“Processing event: &docker.APIEvents{Status:"start", ID:"0f87c7eef46d1d1b97969a65a7e9b95d2d3ca1d368fde47028e25eabd3ebf711", From:"uberresearch/s3upload:latest", Time:1456131473}”
time=“2016-02-22T08:57:53Z” level=“info” msg=“Assigning IP [10.42.96.69/16], ContainerId [f5a6fefa550d0845cd86296c7b266075c1ca1548a235dc7981905aaa62f97504], Pid [24929]”
time=“2016-02-22T08:57:53Z” level=“info” msg=“Processing event: &docker.APIEvents{Status:"start", ID:"f5a6fefa550d0845cd86296c7b266075c1ca1548a235dc7981905aaa62f97504", From:"-simulated-", Time:0}”
time=“2016-02-22T08:57:53Z” level=“info” msg=“Container locked. Can’t run StartHandler. ID: [f5a6fefa550d0845cd86296c7b266075c1ca1548a235dc7981905aaa62f97504]”
time=“2016-02-22T08:57:53Z” level=“info” msg=“Assigning IP [10.42.196.153/16], ContainerId [0f87c7eef46d1d1b97969a65a7e9b95d2d3ca1d368fde47028e25eabd3ebf711], Pid [24951]”
time=“2016-02-22T08:57:53Z” level=“info” msg=“Processing event: &docker.APIEvents{Status:"start", ID:"0f87c7eef46d1d1b97969a65a7e9b95d2d3ca1d368fde47028e25eabd3ebf711", From:"-simulated-", Time:0}”
time=“2016-02-22T08:57:53Z” level=“info” msg=“Container locked. Can’t run StartHandler. ID: [0f87c7eef46d1d1b97969a65a7e9b95d2d3ca1d368fde47028e25eabd3ebf711]”
time=“2016-02-22T08:58:05Z” level=“info” msg=“Processing event: &docker.APIEvents{Status:"start", ID:"87cb1e016265b660c749ca4826b5532d6cfaa4e410476d11532bad9669e3eb66", From:"rancher/agent-instance:v0.8.0", Time:1456131485}”
time=“2016-02-22T08:58:05Z” level=“info” msg=“Assigning IP [10.42.119.127/16], ContainerId [87cb1e016265b660c749ca4826b5532d6cfaa4e410476d11532bad9669e3eb66], Pid [26243]”
time=“2016-02-22T08:58:05Z” level=“info” msg=“Processing event: &docker.APIEvents{Status:"start", ID:"87cb1e016265b660c749ca4826b5532d6cfaa4e410476d11532bad9669e3eb66", From:"-simulated-", Time:0}”
time=“2016-02-22T08:58:05Z” level=“info” msg=“Container locked. Can’t run StartHandler. ID: [87cb1e016265b660c749ca4826b5532d6cfaa4e410476d11532bad9669e3eb66]”
We have this situation on just about every machine. Killing off the network agent sometimes help. Removing a machine from rancher and purge /var/lib/rancher sometimes help, but it seems we’re stuck.
We’re seeing a lot of this in the master logs:
2016-02-22 09:15:48,471 ERROR [:] [erviceReplay-22] [i.c.p.e.e.i.ProcessEventListenerImpl] Unknown exception running process [volume.deallocate:1310797] on [45778] java.lang.IllegalStateException: Attempt to cancel when process is still transitioning
at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.runDelegateLoop(DefaultProcessInstanceImpl.java:189) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.executeWithProcessInstanceLock(DefaultProcessInstanceImpl.java:156) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl$1.doWithLock(DefaultProcessInstanceImpl.java:106) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl$1.doWithLock(DefaultProcessInstanceImpl.java:103) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.lock.impl.AbstractLockManagerImpl$3.doWithLock(AbstractLockManagerImpl.java:40) ~[cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.lock.impl.LockManagerImpl.doLock(LockManagerImpl.java:33) ~[cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.lock.impl.AbstractLockManagerImpl.lock(AbstractLockManagerImpl.java:13) ~[cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.lock.impl.AbstractLockManagerImpl.lock(AbstractLockManagerImpl.java:37) ~[cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.execute(DefaultProcessInstanceImpl.java:103) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.engine.eventing.impl.ProcessEventListenerImpl.processExecute(ProcessEventListenerImpl.java:68) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.engine.server.impl.ProcessInstanceParallelDispatcher$1.runInContext(ProcessInstanceParallelDispatcher.java:27) [cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
at org.apache.cloudstack.managed.context.ManagedContextRunnable$1.run(ManagedContextRunnable.java:49) [cattle-framework-managed-context-0.5.0-SNAPSHOT.jar:na]
at org.apache.cloudstack.managed.context.impl.DefaultManagedContext$1.call(DefaultManagedContext.java:55) [cattle-framework-managed-context-0.5.0-SNAPSHOT.jar:na]
at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.callWithContext(DefaultManagedContext.java:108) [cattle-framework-managed-context-0.5.0-SNAPSHOT.jar:na]
at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.runWithContext(DefaultManagedContext.java:52) [cattle-framework-managed-context-0.5.0-SNAPSHOT.jar:na]
at org.apache.cloudstack.managed.context.ManagedContextRunnable.run(ManagedContextRunnable.java:46) [cattle-framework-managed-context-0.5.0-SNAPSHOT.jar:na]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) [na:1.7.0_91]
at java.util.concurrent.FutureTask.run(FutureTask.java:262) [na:1.7.0_91]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_91]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_91]
at java.lang.Thread.run(Thread.java:745) [na:1.7.0_91]
Which is weird, since we no longer run any storage setups with Rancher.
end some of these as well:
2016-02-22 09:15:52,557 ERROR [:] [ecutorService-2] [.p.c.v.i.ConfigItemStatusManagerImpl] Failed null, exit code [1] output [ERROR: Agent instance has not started
]
2016-02-22 09:16:08,287 ERROR [:] [ecutorService-9] [.p.c.v.i.ConfigItemStatusManagerImpl] Failed null, exit code [1] output [ERROR: Agent instance has not started
]
2016-02-22 09:16:24,290 ERROR [:] [ecutorService-9] [.p.c.v.i.ConfigItemStatusManagerImpl] Failed null, exit code [1] output [ERROR: Agent instance has not started
]
2016-02-22 09:16:53,220 ERROR [:] [cutorService-10] [.p.c.v.i.ConfigItemStatusManagerImpl] Failed null, exit code [1] output [ERROR: Agent instance has not started
It’s hard to find out what agent is not started from a log message like that. Now, how can be proceed