Environment becomes unstable

Hi.

We’re having trouble with our Rancher setup. One symptom is load balancers never getting past initialization state. They actually seem to work - at least some times. We are on v0.59.0 and I can ping other network agents from the agent on the machine in question, so internetwork comm seems to be OK.

On the instance we’re having trouble with we see this in the agent logs:

time=“2016-02-22T08:56:06Z” level=“info” msg=“Processing event: &docker.APIEvents{Status:"destroy", ID:"a783d5b5312334040fe5ec1df7151171cdb6881036b43e66d035f0ce7843ed7f", From:"rancher/agent-instance:v0.6.0", Time:1456131366}”
time=“2016-02-22T08:57:39Z” level=“info” msg=“Processing event: &docker.APIEvents{Status:"start", ID:"0164336d5b9c997411aa69076e5fa218ce8abeaeb7284346005cafdefd59804e", From:"rancher/agent-instance:v0.8.0", Time:1456131459}”
time=“2016-02-22T08:57:39Z” level=“info” msg=“Assigning IP [10.42.156.127/16], ContainerId [0164336d5b9c997411aa69076e5fa218ce8abeaeb7284346005cafdefd59804e], Pid [23211]”
time=“2016-02-22T08:57:39Z” level=“info” msg=“Processing event: &docker.APIEvents{Status:"start", ID:"0164336d5b9c997411aa69076e5fa218ce8abeaeb7284346005cafdefd59804e", From:"-simulated-", Time:0}”
time=“2016-02-22T08:57:39Z” level=“info” msg=“Container locked. Can’t run StartHandler. ID: [0164336d5b9c997411aa69076e5fa218ce8abeaeb7284346005cafdefd59804e]”
time=“2016-02-22T08:57:53Z” level=“info” msg=“Processing event: &docker.APIEvents{Status:"start", ID:"f5a6fefa550d0845cd86296c7b266075c1ca1548a235dc7981905aaa62f97504", From:"uberresearch/s3upload:test", Time:1456131473}”
time=“2016-02-22T08:57:53Z” level=“info” msg=“Processing event: &docker.APIEvents{Status:"start", ID:"0f87c7eef46d1d1b97969a65a7e9b95d2d3ca1d368fde47028e25eabd3ebf711", From:"uberresearch/s3upload:latest", Time:1456131473}”
time=“2016-02-22T08:57:53Z” level=“info” msg=“Assigning IP [10.42.96.69/16], ContainerId [f5a6fefa550d0845cd86296c7b266075c1ca1548a235dc7981905aaa62f97504], Pid [24929]”
time=“2016-02-22T08:57:53Z” level=“info” msg=“Processing event: &docker.APIEvents{Status:"start", ID:"f5a6fefa550d0845cd86296c7b266075c1ca1548a235dc7981905aaa62f97504", From:"-simulated-", Time:0}”
time=“2016-02-22T08:57:53Z” level=“info” msg=“Container locked. Can’t run StartHandler. ID: [f5a6fefa550d0845cd86296c7b266075c1ca1548a235dc7981905aaa62f97504]”
time=“2016-02-22T08:57:53Z” level=“info” msg=“Assigning IP [10.42.196.153/16], ContainerId [0f87c7eef46d1d1b97969a65a7e9b95d2d3ca1d368fde47028e25eabd3ebf711], Pid [24951]”
time=“2016-02-22T08:57:53Z” level=“info” msg=“Processing event: &docker.APIEvents{Status:"start", ID:"0f87c7eef46d1d1b97969a65a7e9b95d2d3ca1d368fde47028e25eabd3ebf711", From:"-simulated-", Time:0}”
time=“2016-02-22T08:57:53Z” level=“info” msg=“Container locked. Can’t run StartHandler. ID: [0f87c7eef46d1d1b97969a65a7e9b95d2d3ca1d368fde47028e25eabd3ebf711]”
time=“2016-02-22T08:58:05Z” level=“info” msg=“Processing event: &docker.APIEvents{Status:"start", ID:"87cb1e016265b660c749ca4826b5532d6cfaa4e410476d11532bad9669e3eb66", From:"rancher/agent-instance:v0.8.0", Time:1456131485}”
time=“2016-02-22T08:58:05Z” level=“info” msg=“Assigning IP [10.42.119.127/16], ContainerId [87cb1e016265b660c749ca4826b5532d6cfaa4e410476d11532bad9669e3eb66], Pid [26243]”
time=“2016-02-22T08:58:05Z” level=“info” msg=“Processing event: &docker.APIEvents{Status:"start", ID:"87cb1e016265b660c749ca4826b5532d6cfaa4e410476d11532bad9669e3eb66", From:"-simulated-", Time:0}”
time=“2016-02-22T08:58:05Z” level=“info” msg=“Container locked. Can’t run StartHandler. ID: [87cb1e016265b660c749ca4826b5532d6cfaa4e410476d11532bad9669e3eb66]”

We have this situation on just about every machine. Killing off the network agent sometimes help. Removing a machine from rancher and purge /var/lib/rancher sometimes help, but it seems we’re stuck.

We’re seeing a lot of this in the master logs:

2016-02-22 09:15:48,471 ERROR [:] [erviceReplay-22] [i.c.p.e.e.i.ProcessEventListenerImpl] Unknown exception running process [volume.deallocate:1310797] on [45778] java.lang.IllegalStateException: Attempt to cancel when process is still transitioning
at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.runDelegateLoop(DefaultProcessInstanceImpl.java:189) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.executeWithProcessInstanceLock(DefaultProcessInstanceImpl.java:156) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl$1.doWithLock(DefaultProcessInstanceImpl.java:106) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl$1.doWithLock(DefaultProcessInstanceImpl.java:103) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.lock.impl.AbstractLockManagerImpl$3.doWithLock(AbstractLockManagerImpl.java:40) ~[cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.lock.impl.LockManagerImpl.doLock(LockManagerImpl.java:33) ~[cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.lock.impl.AbstractLockManagerImpl.lock(AbstractLockManagerImpl.java:13) ~[cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.lock.impl.AbstractLockManagerImpl.lock(AbstractLockManagerImpl.java:37) ~[cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.execute(DefaultProcessInstanceImpl.java:103) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.engine.eventing.impl.ProcessEventListenerImpl.processExecute(ProcessEventListenerImpl.java:68) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.engine.server.impl.ProcessInstanceParallelDispatcher$1.runInContext(ProcessInstanceParallelDispatcher.java:27) [cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
at org.apache.cloudstack.managed.context.ManagedContextRunnable$1.run(ManagedContextRunnable.java:49) [cattle-framework-managed-context-0.5.0-SNAPSHOT.jar:na]
at org.apache.cloudstack.managed.context.impl.DefaultManagedContext$1.call(DefaultManagedContext.java:55) [cattle-framework-managed-context-0.5.0-SNAPSHOT.jar:na]
at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.callWithContext(DefaultManagedContext.java:108) [cattle-framework-managed-context-0.5.0-SNAPSHOT.jar:na]
at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.runWithContext(DefaultManagedContext.java:52) [cattle-framework-managed-context-0.5.0-SNAPSHOT.jar:na]
at org.apache.cloudstack.managed.context.ManagedContextRunnable.run(ManagedContextRunnable.java:46) [cattle-framework-managed-context-0.5.0-SNAPSHOT.jar:na]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) [na:1.7.0_91]
at java.util.concurrent.FutureTask.run(FutureTask.java:262) [na:1.7.0_91]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_91]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_91]
at java.lang.Thread.run(Thread.java:745) [na:1.7.0_91]

Which is weird, since we no longer run any storage setups with Rancher.

end some of these as well:

2016-02-22 09:15:52,557 ERROR [:] [ecutorService-2] [.p.c.v.i.ConfigItemStatusManagerImpl] Failed null, exit code [1] output [ERROR: Agent instance has not started
]
2016-02-22 09:16:08,287 ERROR [:] [ecutorService-9] [.p.c.v.i.ConfigItemStatusManagerImpl] Failed null, exit code [1] output [ERROR: Agent instance has not started
]
2016-02-22 09:16:24,290 ERROR [:] [ecutorService-9] [.p.c.v.i.ConfigItemStatusManagerImpl] Failed null, exit code [1] output [ERROR: Agent instance has not started
]
2016-02-22 09:16:53,220 ERROR [:] [cutorService-10] [.p.c.v.i.ConfigItemStatusManagerImpl] Failed null, exit code [1] output [ERROR: Agent instance has not started

It’s hard to find out what agent is not started from a log message like that. Now, how can be proceed

Currently, we don’t automatically upgrade your load balancers if a new rancher/agent-instance version is pushed out.

Can you try upgrading your load balancers so they are using rancher/agent-instance:0.8.0?

http://docs.rancher.com/rancher/upgrading/#rancher-agents

Hi Denise,

I’ve upgraded the load balancers to no avail (removed all images from the instance, removed it from rancher,
removed /var/lib/rancher and added it again).
What did make a difference was fixing a node in a different environment. That one did run with an old rancher-agent
and was stuck in reconnecting state. I’m not absolutely positive that was the reason, but it’s the only thing changed between a non-successful deployment of a load balancer and a successful one - both with v0.8.0.
I’m still puzzled about why one node would influence the network on another node and even more so when they are
in different environments.
For now, this seems to have fixed my issues, except for the storage exceptions logged. I guess I’d have to figure out
what to delete from the database to get rid of that.