100% failure rate trying to upgrade stacks

Hi,

I’m incredibly frustrated with Rancher. We’ve been trying to use Rancher for 2+ months, so I’d say I"m not a newbie.

But, 100% of the time, our deployments fail. Every deployment of a new software release burns 8+ hours, and we’re a small engineering team, so it’s very materially impacting.

We have 15 or so stacks, each with 1 to 10 services in it. Each of these services are essentially running the same core “app” image–working off different queues.

In all cases, an attempt to use rancher -compose to push an upgrade fails 100% of the time. This failure leaves Rancher server in a totally busted state, and we literally have to go to each host, clean it, go to rancher server blow it away and rebuild it.

Every. Single. Time.

I’ve literally burned 100+ hours ONLY working on rancher related configuration. We’ve written custom scripts to build our rancher-compose files and do do the deploy. But, rancher fails all the time.

This time, by the way, it seems the deploy was stuck. The web front end states it was downloading for 2+ hours a FS layer. And, when we went to the hosts (which it says were up and showed it was generating traffic/cpu/graphics), the docker images were NOT running (though rancher server showed them as running).

I need to know if this (Rancher) is really something real, or a waste of our time. We’re considering reverting to a 10 to 20 line bash script to do deploys… as that would do it more reliably than Rancher at this point.

Chat rooms (IRC) are unresponsive…

I’m at amy wits end, and could use some guidance here. I’m frankly tired of burning ALL my time working on deploying my code using rancher, versus writing and developing code which makes me money.

Anyone?

Again: 1.2.0pre3 Rancher, Dedicated hosts and hardware (64GB RAM 8 core machines with SSDs), GB ethernet local switch connectivity. Bare metal. It doesn’t get any cleaner/tighter.

And, related… our stack never heals. Before, we were on AWS and doing spot. If we were outbid, the service would go away… but when the hosts came back, Rancher NEVER recovers. If a host goes down, the services are not moved to a new host. Everything seems literally like it’s stuck.

@William_Flanagan whats the output from docker info on your rancher server/agent hosts? Do you get any interesting docker logs from the rancher server/agent when ‘upgrading’. I’ve never seen a more painful rancher-compose experience, most people have minor issues around syntax… but not this bad. Whats the version of rancher-compose that youre using?

@aemneina Filling here for @William_Flanagan.

Our Setup:

  • HOST OS: Ubuntu 16.04 LTS 64 Bit.
  • Each Host is 8 Core, 64 GB Machines [BareMetal] and has Docker version 1.12.3, build 6b644ec
  • The Rancher Server runs on a host that is not part of the Rancher Nodes Cluster.
  • The version is v1.2.0-pre3 with embedded MySQL persisted on a volume
  • Rancher Compose version: v0.10.0
  • There are 12 Rancher Nodes/Instances running the agent:v1.0.2 and agent-instance:v0.8.3

docker info on the host running rancher server

Containers: 2
Running: 2
Paused: 0
Stopped: 0
Images: 15
Server Version: 1.12.3
Storage Driver: aufs
Root Dir: /var/lib/docker/aufs
Backing Filesystem: extfs
Dirs: 42
Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: null host bridge overlay
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Security Options: apparmor seccomp
Kernel Version: 4.4.0-47-generic
Operating System: Ubuntu 16.04.1 LTS
OSType: linux
Architecture: x86_64
CPUs: 8
Total Memory: 62.75 GiB
Name: bastion
ID: KAUN:JYV6:GJK5:3OYR:VYOG:OHVY:M2H2:CJZB:P5S6:VY3H:6Z42:U263
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Username: XXXXXXXXXX
Registry: https://index.docker.io/v1/
WARNING: No swap limit support
Insecure Registries:
127.0.0.0/8

docker info on the hosts which are rancher nodes:

Containers: 16
Running: 16
Paused: 0
Stopped: 0
Images: 8
Server Version: 1.12.3
Storage Driver: aufs
Root Dir: /var/lib/docker/aufs
Backing Filesystem: extfs
Dirs: 88
Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: host null bridge overlay
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Security Options: apparmor seccomp
Kernel Version: 4.4.0-45-generic
Operating System: Ubuntu 16.04.1 LTS
OSType: linux
Architecture: x86_64
CPUs: 8
Total Memory: 62.75 GiB
Name: ranchernode01
ID: 2S6X:ZGL7:P2JD:L7GZ:XN47:5FMA:PYWV:QWJR:ALZT:PBKD:M23V:O6XR
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
WARNING: No swap limit support
Insecure Registries:
127.0.0.0/8

Not necessarily but we see these in the rancher server logs.

PLEASE NOTE THAT THIS IS SEEN EVEN IF WE USE EXTERNAL MARIADB DATABASE OR EMBEDDED ONE VIA VOLUME PERSISTENCE.

Also it is worth noting that when we go ahead with the stack/service upgrade, we see a massive load on the embedded MySQL (both I/O as well as CPU). Also we see that with services with many containers, the upgrade starts doing it 2-3 container changes at a time… thus delaying the whole deployment of that service.

2016-11-15 18:43:54,111 ERROR [d6c6f7d2-69cf-4548-b0c8-016848c33cd0:144143] [service:193] [service.upgrade] [erviceReplay-21] [c.p.e.p.i.DefaultProcessInstanceImpl] Unknown exception org.jooq.exception.DataChangedException: Database record has been changed
at org.jooq.impl.UpdatableRecordImpl.checkIfChanged(UpdatableRecordImpl.java:550) ~[jooq-3.3.0.jar:na]
at org.jooq.impl.UpdatableRecordImpl.storeUpdate0(UpdatableRecordImpl.java:291) ~[jooq-3.3.0.jar:na]
at org.jooq.impl.UpdatableRecordImpl.access$200(UpdatableRecordImpl.java:90) ~[jooq-3.3.0.jar:na]
at org.jooq.impl.UpdatableRecordImpl$3.operate(UpdatableRecordImpl.java:260) ~[jooq-3.3.0.jar:na]
at org.jooq.impl.RecordDelegate.operate(RecordDelegate.java:123) ~[jooq-3.3.0.jar:na]
at org.jooq.impl.UpdatableRecordImpl.storeUpdate(UpdatableRecordImpl.java:255) ~[jooq-3.3.0.jar:na]
at org.jooq.impl.UpdatableRecordImpl.update(UpdatableRecordImpl.java:149) ~[jooq-3.3.0.jar:na]
at io.cattle.platform.object.impl.JooqObjectManager.persistRecord(JooqObjectManager.java:223) ~[cattle-framework-object-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.object.impl.JooqObjectManager.setFieldsInternal(JooqObjectManager.java:130) ~[cattle-framework-object-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.object.impl.JooqObjectManager$3.execute(JooqObjectManager.java:118) ~[cattle-framework-object-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.engine.idempotent.Idempotent.change(Idempotent.java:88) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.object.impl.JooqObjectManager.setFields(JooqObjectManager.java:115) ~[cattle-framework-object-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.object.impl.JooqObjectManager.setFields(JooqObjectManager.java:110) ~[cattle-framework-object-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.object.impl.AbstractObjectManager.setFields(AbstractObjectManager.java:135) ~[cattle-framework-object-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.activity.impl.ActivityLogImpl.close(ActivityLogImpl.java:82) ~[cattle-activity-log-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.activity.impl.EntryImpl.close(EntryImpl.java:23) ~[cattle-activity-log-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.activity.ActivityService.run(ActivityService.java:48) ~[cattle-activity-log-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.servicediscovery.process.ServiceUpgrade.handle(ServiceUpgrade.java:34) ~[cattle-iaas-service-discovery-server-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.runHandler(DefaultProcessInstanceImpl.java:427) [cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl$4.execute(DefaultProcessInstanceImpl.java:378) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl$4.execute(DefaultProcessInstanceImpl.java:372) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.engine.idempotent.Idempotent.execute(Idempotent.java:42) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.runHandlers(DefaultProcessInstanceImpl.java:372) [cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.runLogic(DefaultProcessInstanceImpl.java:474) [cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.runWithProcessLock(DefaultProcessInstanceImpl.java:308) [cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl$2.doWithLockNoResult(DefaultProcessInstanceImpl.java:247) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.lock.LockCallbackNoReturn.doWithLock(LockCallbackNoReturn.java:7) ~[cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.lock.LockCallbackNoReturn.doWithLock(LockCallbackNoReturn.java:3) ~[cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.lock.impl.AbstractLockManagerImpl$3.doWithLock(AbstractLockManagerImpl.java:40) [cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.lock.impl.LockManagerImpl.doLock(LockManagerImpl.java:33) [cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.lock.impl.AbstractLockManagerImpl.lock(AbstractLockManagerImpl.java:13) [cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.lock.impl.AbstractLockManagerImpl.lock(AbstractLockManagerImpl.java:37) [cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.execute(DefaultProcessInstanceImpl.java:105) [cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.engine.eventing.impl.ProcessEventListenerImpl.processExecute(ProcessEventListenerImpl.java:75) [cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.engine.server.impl.ProcessInstanceParallelDispatcher$1.runInContext(ProcessInstanceParallelDispatcher.java:27) [cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
at org.apache.cloudstack.managed.context.ManagedContextRunnable$1.run(ManagedContextRunnable.java:49) [cattle-framework-managed-context-0.5.0-SNAPSHOT.jar:na]
at org.apache.cloudstack.managed.context.impl.DefaultManagedContext$1.call(DefaultManagedContext.java:55) [cattle-framework-managed-context-0.5.0-SNAPSHOT.jar:na]
at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.callWithContext(DefaultManagedContext.java:108) [cattle-framework-managed-context-0.5.0-SNAPSHOT.jar:na]
at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.runWithContext(DefaultManagedContext.java:52) [cattle-framework-managed-context-0.5.0-SNAPSHOT.jar:na]
at org.apache.cloudstack.managed.context.ManagedContextRunnable.run(ManagedContextRunnable.java:46) [cattle-framework-managed-context-0.5.0-SNAPSHOT.jar:na]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) [na:1.7.0_111]
at java.util.concurrent.FutureTask.run(FutureTask.java:262) [na:1.7.0_111]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_111]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_111]
at java.lang.Thread.run(Thread.java:745) [na:1.7.0_111]

And

2016-11-15 17:59:52,254 ERROR [5469f7ea-5ffc-4ad4-a3ae-9ecc60cfd138:143727] [instance:519->nic:531] [instance.stop->(InstanceStop)->nic.deactivate] [torService-1124] [c.p.e.p.i.DefaultProcessInstanceImpl] Unknown exception org.jooq.exception.DataAccessException: SQL [update config_item_status set config_item_status.requested_version = (config_item_status.requested_version + ?), config_item_status.requested_updated = ? where (config_item_status.name = ? and config_item_status.resource_type = ? and config_item_status.resource_id = ?)]; Lock wait timeout exceeded; try restarting transaction
at org.jooq.impl.Utils.translate(Utils.java:1287) ~[jooq-3.3.0.jar:na]
at org.jooq.impl.DefaultExecuteContext.sqlException(DefaultExecuteContext.java:495) ~[jooq-3.3.0.jar:na]
at org.jooq.impl.AbstractQuery.execute(AbstractQuery.java:326) ~[jooq-3.3.0.jar:na]
at org.jooq.impl.AbstractDelegatingQuery.execute(AbstractDelegatingQuery.java:140) ~[jooq-3.3.0.jar:na]
at io.cattle.platform.configitem.version.dao.impl.ConfigItemStatusDaoImpl.increment(ConfigItemStatusDaoImpl.java:100) ~[cattle-config-item-common-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.configitem.version.dao.impl.ConfigItemStatusDaoImpl.incrementOrApply(ConfigItemStatusDaoImpl.java:74) ~[cattle-config-item-common-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.configitem.version.impl.ConfigItemStatusManagerImpl.updateConfig(ConfigItemStatusManagerImpl.java:141) ~[cattle-config-item-common-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.agent.instance.process.AbstractApplyItems.assignServiceItems(AbstractApplyItems.java:92) ~[cattle-iaas-agent-instance-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.agent.instance.process.AbstractApplyItems.assignItems(AbstractApplyItems.java:63) ~[cattle-iaas-agent-instance-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.agent.instance.process.AgentInstanceApplyItems.handle(AgentInstanceApplyItems.java:56) ~[cattle-iaas-agent-instance-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.runHandler(DefaultProcessInstanceImpl.java:427) [cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl$4.execute(DefaultProcessInstanceImpl.java:378) [cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl$4.execute(DefaultProcessInstanceImpl.java:372) [cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.engine.idempotent.Idempotent.execute(Idempotent.java:42) [cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.runHandlers(DefaultProcessInstanceImpl.java:372) [cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.runLogic(DefaultProcessInstanceImpl.java:477) [cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.runWithProcessLock(DefaultProcessInstanceImpl.java:308) [cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl$2.doWithLockNoResult(DefaultProcessInstanceImpl.java:247) [cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.lock.LockCallbackNoReturn.doWithLock(LockCallbackNoReturn.java:7) [cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.lock.LockCallbackNoReturn.doWithLock(LockCallbackNoReturn.java:3) [cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.lock.impl.AbstractLockManagerImpl$3.doWithLock(AbstractLockManagerImpl.java:40) [cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.lock.impl.LockManagerImpl.doLock(LockManagerImpl.java:33) [cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.lock.impl.AbstractLockManagerImpl.lock(AbstractLockManagerImpl.java:13) [cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.lock.impl.AbstractLockManagerImpl.lock(AbstractLockManagerImpl.java:37) [cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.acquireLockAndRun(DefaultProcessInstanceImpl.java:244) [cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.runDelegateLoop(DefaultProcessInstanceImpl.java:186) [cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.executeWithProcessInstanceLock(DefaultProcessInstanceImpl.java:159) [cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl$1.doWithLock(DefaultProcessInstanceImpl.java:108) [cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl$1.doWithLock(DefaultProcessInstanceImpl.java:105) [cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.lock.impl.AbstractLockManagerImpl$3.doWithLock(AbstractLockManagerImpl.java:40) [cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.lock.impl.LockManagerImpl.doLock(LockManagerImpl.java:33) [cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.lock.impl.AbstractLockManagerImpl.lock(AbstractLockManagerImpl.java:13) [cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.lock.impl.AbstractLockManagerImpl.lock(AbstractLockManagerImpl.java:37) [cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.execute(DefaultProcessInstanceImpl.java:105) [cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.object.process.impl.DefaultObjectProcessManager.executeStandardProcess(DefaultObjectProcessManager.java:29) [cattle-framework-object-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.process.common.handler.AbstractObjectProcessLogic.deactivate(AbstractObjectProcessLogic.java:35) [cattle-iaas-logic-common-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.process.instance.InstanceStop.network(InstanceStop.java:65) [cattle-iaas-logic-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.process.instance.InstanceStop.handle(InstanceStop.java:39) [cattle-iaas-logic-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.runHandler(DefaultProcessInstanceImpl.java:427) [cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl$4.execute(DefaultProcessInstanceImpl.java:378) [cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl$4.execute(DefaultProcessInstanceImpl.java:372) [cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.engine.idempotent.Idempotent.execute(Idempotent.java:42) [cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.runHandlers(DefaultProcessInstanceImpl.java:372) [cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.runLogic(DefaultProcessInstanceImpl.java:474) [cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.runWithProcessLock(DefaultProcessInstanceImpl.java:308) [cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl$2.doWithLockNoResult(DefaultProcessInstanceImpl.java:247) [cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.lock.LockCallbackNoReturn.doWithLock(LockCallbackNoReturn.java:7) [cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.lock.LockCallbackNoReturn.doWithLock(LockCallbackNoReturn.java:3) [cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.lock.impl.AbstractLockManagerImpl$3.doWithLock(AbstractLockManagerImpl.java:40) [cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.lock.impl.LockManagerImpl.doLock(LockManagerImpl.java:33) [cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.lock.impl.AbstractLockManagerImpl.lock(AbstractLockManagerImpl.java:13) [cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.lock.impl.AbstractLockManagerImpl.lock(AbstractLockManagerImpl.java:37) [cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.acquireLockAndRun(DefaultProcessInstanceImpl.java:244) [cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.runDelegateLoop(DefaultProcessInstanceImpl.java:186) [cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.executeWithProcessInstanceLock(DefaultProcessInstanceImpl.java:159) [cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl$1.doWithLock(DefaultProcessInstanceImpl.java:108) [cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl$1.doWithLock(DefaultProcessInstanceImpl.java:105) [cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.lock.impl.AbstractLockManagerImpl$3.doWithLock(AbstractLockManagerImpl.java:40) [cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.lock.impl.LockManagerImpl.doLock(LockManagerImpl.java:33) [cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.lock.impl.AbstractLockManagerImpl.lock(AbstractLockManagerImpl.java:13) [cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.lock.impl.AbstractLockManagerImpl.lock(AbstractLockManagerImpl.java:37) [cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.engine.process.impl.DefaultProcessInstanceImpl.execute(DefaultProcessInstanceImpl.java:105) [cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.engine.eventing.impl.ProcessEventListenerImpl.processExecute(ProcessEventListenerImpl.java:75) [cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.engine.eventing.impl.ProcessEventListenerImpl.processExecute(ProcessEventListenerImpl.java:57) [cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
at sun.reflect.GeneratedMethodAccessor782.invoke(Unknown Source) ~[na:na]
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[na:1.7.0_111]
at java.lang.reflect.Method.invoke(Method.java:606) ~[na:1.7.0_111]
at io.cattle.platform.eventing.annotation.MethodInvokingListener$1.doWithLockNoResult(MethodInvokingListener.java:76) [cattle-framework-eventing-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.lock.LockCallbackNoReturn.doWithLock(LockCallbackNoReturn.java:7) [cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.lock.LockCallbackNoReturn.doWithLock(LockCallbackNoReturn.java:3) [cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.lock.impl.AbstractLockManagerImpl$3.doWithLock(AbstractLockManagerImpl.java:40) [cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.lock.impl.LockManagerImpl.doLock(LockManagerImpl.java:33) [cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.lock.impl.AbstractLockManagerImpl.lock(AbstractLockManagerImpl.java:13) [cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.lock.impl.AbstractLockManagerImpl.lock(AbstractLockManagerImpl.java:37) [cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.eventing.annotation.MethodInvokingListener.onEvent(MethodInvokingListener.java:72) [cattle-framework-eventing-0.5.0-SNAPSHOT.jar:na]
at io.cattle.platform.eventing.impl.AbstractThreadPoolingEventService$2.doRun(AbstractThreadPoolingEventService.java:135) [cattle-framework-eventing-0.5.0-SNAPSHOT.jar:na]
at org.apache.cloudstack.managed.context.NoExceptionRunnable.runInContext(NoExceptionRunnable.java:15) [cattle-framework-managed-context-0.5.0-SNAPSHOT.jar:na]
at org.apache.cloudstack.managed.context.ManagedContextRunnable$1.run(ManagedContextRunnable.java:49) [cattle-framework-managed-context-0.5.0-SNAPSHOT.jar:na]
at org.apache.cloudstack.managed.context.impl.DefaultManagedContext$1.call(DefaultManagedContext.java:55) [cattle-framework-managed-context-0.5.0-SNAPSHOT.jar:na]
at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.callWithContext(DefaultManagedContext.java:108) [cattle-framework-managed-context-0.5.0-SNAPSHOT.jar:na]
at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.runWithContext(DefaultManagedContext.java:52) [cattle-framework-managed-context-0.5.0-SNAPSHOT.jar:na]
at org.apache.cloudstack.managed.context.ManagedContextRunnable.run(ManagedContextRunnable.java:46) [cattle-framework-managed-context-0.5.0-SNAPSHOT.jar:na]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_111]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_111]
at java.lang.Thread.run(Thread.java:745) [na:1.7.0_111]
Caused by: java.sql.BatchUpdateException: Lock wait timeout exceeded; try restarting transaction
at org.mariadb.jdbc.MariaDbServerPreparedStatement.execute(MariaDbServerPreparedStatement.java:376) ~[mariadb-java-client-1.3.4.jar:na]
at org.apache.commons.dbcp.DelegatingPreparedStatement.execute(DelegatingPreparedStatement.java:172) ~[commons-dbcp-1.4.jar:1.4]
at org.apache.commons.dbcp.DelegatingPreparedStatement.execute(DelegatingPreparedStatement.java:172) ~[commons-dbcp-1.4.jar:1.4]
at org.jooq.tools.jdbc.DefaultPreparedStatement.execute(DefaultPreparedStatement.java:194) ~[jooq-3.3.0.jar:na]
at org.jooq.impl.AbstractQuery.execute(AbstractQuery.java:376) ~[jooq-3.3.0.jar:na]
at org.jooq.impl.AbstractStoreQuery.execute(AbstractStoreQuery.java:289) ~[jooq-3.3.0.jar:na]
at org.jooq.impl.AbstractQuery.execute(AbstractQuery.java:322) ~[jooq-3.3.0.jar:na]
… 82 common frames omitted
Caused by: java.sql.SQLException: Lock wait timeout exceeded; try restarting transaction
at org.mariadb.jdbc.internal.util.ExceptionMapper.get(ExceptionMapper.java:136) ~[mariadb-java-client-1.3.4.jar:na]
at org.mariadb.jdbc.internal.util.ExceptionMapper.throwException(ExceptionMapper.java:69) ~[mariadb-java-client-1.3.4.jar:na]
at org.mariadb.jdbc.MariaDbServerPreparedStatement.executeQueryEpilog(MariaDbServerPreparedStatement.java:338) ~[mariadb-java-client-1.3.4.jar:na]
at org.mariadb.jdbc.MariaDbServerPreparedStatement.executeInternal(MariaDbServerPreparedStatement.java:293) ~[mariadb-java-client-1.3.4.jar:na]
at org.mariadb.jdbc.MariaDbServerPreparedStatement.execute(MariaDbServerPreparedStatement.java:371) ~[mariadb-java-client-1.3.4.jar:na]
… 88 common frames omitted
Caused by: org.mariadb.jdbc.internal.util.dao.QueryException: Lock wait timeout exceeded; try restarting transaction
at org.mariadb.jdbc.internal.protocol.AbstractQueryProtocol.getResult(AbstractQueryProtocol.java:475) ~[mariadb-java-client-1.3.4.jar:na]
at org.mariadb.jdbc.internal.protocol.AbstractQueryProtocol.executePreparedQuery(AbstractQueryProtocol.java:588) ~[mariadb-java-client-1.3.4.jar:na]
at org.mariadb.jdbc.MariaDbServerPreparedStatement.executeInternal(MariaDbServerPreparedStatement.java:281) ~[mariadb-java-client-1.3.4.jar:na]
… 89 common frames omitted

Hey, Raghav added some details.

Here’s a simpler example. This is launching a stock ubuntu image, and the logs from the rancher server.

Basically, lots of “timeouts”.

We’d appreciate any feedback.

Continuing on this… we’re 10+ minutes into this…and its following that same sequence shown in the script… nothing completed. I think it will complete… it will take a LONG time… but it will eventually finish.

Here is the rancher server data from the rancher server side.

the machine is not under much load:

it just came up… 15 minutes+ in.

https://gist.github.com/wflanagan/fc49beeb23203ccee33eb8f0b455f98c is the last 100 log entries.

Any thoughts?

rancher-compose is 0.10.0 on Macintosh. We’ve also run it on Linux with identical behavior.

Outside of rancher, does the docker pull/run get that image in a reasonable amount of time, and launch it?

@aemneina I recently tested it again on all our nodes. The docker pull of images and docker run outside rancher happens extremely fast as compared to rancher.

So, we’ve upgraded to 1.2. It seems to work if there is NOTHING on it. When you push much of anything, things go belly up. This is a BRAND NEW stack, on a BRAND NEW server. So, there has to be something we’re doing wrong… I"m just not sure what it is.

Could you share your docker-compose.yml and rancher-compose.yml as well as the rancher-compose commands that you’re using to start up your services/stacks?

There are others who are doing your use case, so I"m not sure what is causing your issues.

we have a script that generates a compose file for each “stack”. Then, it loops over them, entering the directory that has the docker and rancher compose files, and executes:

rancher-compose up --force-upgrade --confirm-upgrade -d

docker-compose and rancher-compose.yml examples on this gist. This is the biggest “stack” we have, deploymentwise.

I noticed in your docker info that you have “apparmor” enabled, have you tested any scenarios with apparmor disabled?

Hi Phillip. I have this other thread going… but I will check on this too…