Environment becomes unstable

sdlarsen · February 22, 2016, 9:32am

Hi.

We’re having trouble with our Rancher setup. One symptom is load balancers never getting past initialization state. They actually seem to work - at least some times. We are on v0.59.0 and I can ping other network agents from the agent on the machine in question, so internetwork comm seems to be OK.

On the instance we’re having trouble with we see this in the agent logs:

time=“2016-02-22T08:56:06Z” level=“info” msg=“Processing event: &docker.APIEvents{Status:"destroy", ID:"a783d5b5312334040fe5ec1df7151171cdb6881036b43e66d035f0ce7843ed7f", From:"rancher/agent-instance:v0.6.0", Time:1456131366}”
time=“2016-02-22T08:57:39Z” level=“info” msg=“Processing event: &docker.APIEvents{Status:"start", ID:"0164336d5b9c997411aa69076e5fa218ce8abeaeb7284346005cafdefd59804e", From:"rancher/agent-instance:v0.8.0", Time:1456131459}”
time=“2016-02-22T08:57:39Z” level=“info” msg=“Assigning IP [10.42.156.127/16], ContainerId [0164336d5b9c997411aa69076e5fa218ce8abeaeb7284346005cafdefd59804e], Pid [23211]”
time=“2016-02-22T08:57:39Z” level=“info” msg=“Processing event: &docker.APIEvents{Status:"start", ID:"0164336d5b9c997411aa69076e5fa218ce8abeaeb7284346005cafdefd59804e", From:"-simulated-", Time:0}”
time=“2016-02-22T08:57:39Z” level=“info” msg=“Container locked. Can’t run StartHandler. ID: [0164336d5b9c997411aa69076e5fa218ce8abeaeb7284346005cafdefd59804e]”
time=“2016-02-22T08:57:53Z” level=“info” msg=“Processing event: &docker.APIEvents{Status:"start", ID:"f5a6fefa550d0845cd86296c7b266075c1ca1548a235dc7981905aaa62f97504", From:"uberresearch/s3upload:test", Time:1456131473}”
time=“2016-02-22T08:57:53Z” level=“info” msg=“Processing event: &docker.APIEvents{Status:"start", ID:"0f87c7eef46d1d1b97969a65a7e9b95d2d3ca1d368fde47028e25eabd3ebf711", From:"uberresearch/s3upload:latest", Time:1456131473}”
time=“2016-02-22T08:57:53Z” level=“info” msg=“Assigning IP [10.42.96.69/16], ContainerId [f5a6fefa550d0845cd86296c7b266075c1ca1548a235dc7981905aaa62f97504], Pid [24929]”
time=“2016-02-22T08:57:53Z” level=“info” msg=“Processing event: &docker.APIEvents{Status:"start", ID:"f5a6fefa550d0845cd86296c7b266075c1ca1548a235dc7981905aaa62f97504", From:"-simulated-", Time:0}”
time=“2016-02-22T08:57:53Z” level=“info” msg=“Container locked. Can’t run StartHandler. ID: [f5a6fefa550d0845cd86296c7b266075c1ca1548a235dc7981905aaa62f97504]”
time=“2016-02-22T08:57:53Z” level=“info” msg=“Assigning IP [10.42.196.153/16], ContainerId [0f87c7eef46d1d1b97969a65a7e9b95d2d3ca1d368fde47028e25eabd3ebf711], Pid [24951]”
time=“2016-02-22T08:57:53Z” level=“info” msg=“Processing event: &docker.APIEvents{Status:"start", ID:"0f87c7eef46d1d1b97969a65a7e9b95d2d3ca1d368fde47028e25eabd3ebf711", From:"-simulated-", Time:0}”
time=“2016-02-22T08:57:53Z” level=“info” msg=“Container locked. Can’t run StartHandler. ID: [0f87c7eef46d1d1b97969a65a7e9b95d2d3ca1d368fde47028e25eabd3ebf711]”
time=“2016-02-22T08:58:05Z” level=“info” msg=“Processing event: &docker.APIEvents{Status:"start", ID:"87cb1e016265b660c749ca4826b5532d6cfaa4e410476d11532bad9669e3eb66", From:"rancher/agent-instance:v0.8.0", Time:1456131485}”
time=“2016-02-22T08:58:05Z” level=“info” msg=“Assigning IP [10.42.119.127/16], ContainerId [87cb1e016265b660c749ca4826b5532d6cfaa4e410476d11532bad9669e3eb66], Pid [26243]”
time=“2016-02-22T08:58:05Z” level=“info” msg=“Processing event: &docker.APIEvents{Status:"start", ID:"87cb1e016265b660c749ca4826b5532d6cfaa4e410476d11532bad9669e3eb66", From:"-simulated-", Time:0}”
time=“2016-02-22T08:58:05Z” level=“info” msg=“Container locked. Can’t run StartHandler. ID: [87cb1e016265b660c749ca4826b5532d6cfaa4e410476d11532bad9669e3eb66]”

We have this situation on just about every machine. Killing off the network agent sometimes help. Removing a machine from rancher and purge /var/lib/rancher sometimes help, but it seems we’re stuck.

We’re seeing a lot of this in the master logs:

2016-02-22 09:15:48,471 ERROR [:] [erviceReplay-22] [i.c.p.e.e.i.ProcessEventListenerImpl] Unknown exception running process [volume.deallocate:1310797] on [45778] java.lang.IllegalStateException: Attempt to cancel when process is still transitioning
at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.runDelegateLoop(DefaultProcessInstanceImpl.java:189) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.executeWithProcessInstanceLock(DefaultProcessInstanceImpl.java:156) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl$1.doWithLock(DefaultProcessInstanceImpl.java:106) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl$1.doWithLock(DefaultProcessInstanceImpl.java:103) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.lock.impl.AbstractLockManagerImpl$3.doWithLock(AbstractLockManagerImpl.java:40) ~[cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.lock.impl.LockManagerImpl.doLock(LockManagerImpl.java:33) ~[cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.lock.impl.AbstractLockManagerImpl.lock(AbstractLockManagerImpl.java:13) ~[cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.lock.impl.AbstractLockManagerImpl.lock(AbstractLockManagerImpl.java:37) ~[cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.execute(DefaultProcessInstanceImpl.java:103) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.engine.eventing.impl.ProcessEventListenerImpl.processExecute(ProcessEventListenerImpl.java:68) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.engine.server.impl.ProcessInstanceParallelDispatcher$1.runInContext(ProcessInstanceParallelDispatcher.java:27) [cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
at org.apache.cloudstack.managed.context.ManagedContextRunnable$1.run(ManagedContextRunnable.java:49) [cattle-framework-managed-context-0.5.0-SNAPSHOT.jar:na]
at org.apache.cloudstack.managed.context.impl.DefaultManagedContext$1.call(DefaultManagedContext.java:55) [cattle-framework-managed-context-0.5.0-SNAPSHOT.jar:na]
at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.callWithContext(DefaultManagedContext.java:108) [cattle-framework-managed-context-0.5.0-SNAPSHOT.jar:na]
at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.runWithContext(DefaultManagedContext.java:52) [cattle-framework-managed-context-0.5.0-SNAPSHOT.jar:na]
at org.apache.cloudstack.managed.context.ManagedContextRunnable.run(ManagedContextRunnable.java:46) [cattle-framework-managed-context-0.5.0-SNAPSHOT.jar:na]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) [na:1.7.0_91]
at java.util.concurrent.FutureTask.run(FutureTask.java:262) [na:1.7.0_91]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_91]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_91]
at java.lang.Thread.run(Thread.java:745) [na:1.7.0_91]

Which is weird, since we no longer run any storage setups with Rancher.

end some of these as well:

2016-02-22 09:15:52,557 ERROR [:] [ecutorService-2] [.p.c.v.i.ConfigItemStatusManagerImpl] Failed null, exit code [1] output [ERROR: Agent instance has not started
]
2016-02-22 09:16:08,287 ERROR [:] [ecutorService-9] [.p.c.v.i.ConfigItemStatusManagerImpl] Failed null, exit code [1] output [ERROR: Agent instance has not started
]
2016-02-22 09:16:24,290 ERROR [:] [ecutorService-9] [.p.c.v.i.ConfigItemStatusManagerImpl] Failed null, exit code [1] output [ERROR: Agent instance has not started
]
2016-02-22 09:16:53,220 ERROR [:] [cutorService-10] [.p.c.v.i.ConfigItemStatusManagerImpl] Failed null, exit code [1] output [ERROR: Agent instance has not started

It’s hard to find out what agent is not started from a log message like that. Now, how can be proceed

denise · February 22, 2016, 6:08pm

Currently, we don’t automatically upgrade your load balancers if a new rancher/agent-instance version is pushed out.

Can you try upgrading your load balancers so they are using rancher/agent-instance:0.8.0?

http://docs.rancher.com/rancher/upgrading/#rancher-agents

sdlarsen · February 22, 2016, 9:42pm

Hi Denise,

I’ve upgraded the load balancers to no avail (removed all images from the instance, removed it from rancher,
removed /var/lib/rancher and added it again).
What did make a difference was fixing a node in a different environment. That one did run with an old rancher-agent
and was stuck in reconnecting state. I’m not absolutely positive that was the reason, but it’s the only thing changed between a non-successful deployment of a load balancer and a successful one - both with v0.8.0.
I’m still puzzled about why one node would influence the network on another node and even more so when they are
in different environments.
For now, this seems to have fixed my issues, except for the storage exceptions logged. I guess I’d have to figure out
what to delete from the database to get rid of that.

Topic		Replies	Views
Containers like Route53, Load Balancer stuck Initializing Rancher 1.x	7	2053	July 14, 2016
Rancher containers and load balancer stuck on "Initializing" Rancher 1.x	1	1104	May 17, 2018
New Load Balancer stay initializing Rancher 1.x	6	1680	April 15, 2016
Load Balancer stay in "Initializing" state Rancher 1.x	21	11077	January 25, 2017
New Loadbalancer stays in Initializing Rancher 1.x	7	1976	December 30, 2015

Environment becomes unstable

Related topics