1.2 Upgrade problems -- LB, Service upgrade failure, others

I just upgrade from 1.1.4 to 1.2.0 on debian 8 (GCE base image with a few config changes).

My load balancers are currently not upgrading with the message in the ui:
lb (Expected state running but got error: Timeout getting IP address)

One of my services also failed to be upgraded?

Generally speaking, this has changed my expectations around Rancher upgrades compared to the smooth upgrades of the past. Its one thing to have the rancher server have problems, but this impacted the availability of my services. I really love Rancher and what youā€™re all building, but this kinda sucks :-(.

During the launch of the new rancher container I saw:

time="2016-12-02T17:24:55Z" level=info msg="Updating machine jsons for  [azure packet amazonec2 azure digitalocean google]"
time="2016-12-02T17:24:57Z" level=info msg="Creating schema machine, roles [service]" id=1ds39
2016-12-02 17:24:57,977 ERROR [:] [] [] [] [TaskScheduler-1] [i.c.p.core.cleanup.TableCleanup     ] SQL [delete from `instance` where `instance`.`id` in (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)]; Cannot delete or update a parent row: a foreign key constraint fails (`cattle`.`network_service_provider_instance_map`, CONSTRAINT `fk_network_service_provider_instance_map__instance_id` FOREIGN KEY (`instance_id`) REFERENCES `instance` (`id`) ON DELETE NO ACTIO)
time="2016-12-02T17:24:58Z" level=info msg="Creating schema host, roles [service]" id=1ds40
2016-12-02 17:24:58,552 WARN  [:] [] [] [] [TaskScheduler-1] [i.c.p.core.cleanup.TableCleanup     ] [Rows Skipped] volume=73 host=15 physical_host=15 service_index=100 service=100 environment=2 credential=2 storage_pool=16 agent=100 account=100
time="2016-12-02T17:25:00Z" level=info msg="Creating schema machine, roles [project member owner]" id=1ds41

After launch, Iā€™ve seen these types of log errors:

time="2016-12-02T17:27:24Z" level=info msg="Stack Create Event Done" eventId=3c3f91e2-1cc8-4dc0-98bd-8cb5aba6e94e resourceId=1st54
2016-12-02 17:27:25,055 ERROR [:] [] [] [] [ecutorService-1] [o.a.c.m.context.NoExceptionRunnable ] Uncaught exception org.jooq.exception.DataChangedException: Database record has been changed
  at org.jooq.impl.UpdatableRecordImpl.checkIfChanged(UpdatableRecordImpl.java:550) ~[jooq-3.3.0.jar:na]
  at org.jooq.impl.UpdatableRecordImpl.storeUpdate0(UpdatableRecordImpl.java:291) ~[jooq-3.3.0.jar:na]
  at org.jooq.impl.UpdatableRecordImpl.access$200(UpdatableRecordImpl.java:90) ~[jooq-3.3.0.jar:na]
  at org.jooq.impl.UpdatableRecordImpl$3.operate(UpdatableRecordImpl.java:260) ~[jooq-3.3.0.jar:na]
  at org.jooq.impl.RecordDelegate.operate(RecordDelegate.java:123) ~[jooq-3.3.0.jar:na]
  at org.jooq.impl.UpdatableRecordImpl.storeUpdate(UpdatableRecordImpl.java:255) ~[jooq-3.3.0.jar:na]
  at org.jooq.impl.UpdatableRecordImpl.update(UpdatableRecordImpl.java:149) ~[jooq-3.3.0.jar:na]
  at io.cattle.platform.object.impl.JooqObjectManager.persistRecord(JooqObjectManager.java:223) ~[cattle-framework-object-0.5.0-SNAPSHOT.jar:na]
  at io.cattle.platform.object.impl.JooqObjectManager.setFieldsInternal(JooqObjectManager.java:130) ~[cattle-framework-object-0.5.0-SNAPSHOT.jar:na]
  at io.cattle.platform.object.impl.JooqObjectManager$3.execute(JooqObjectManager.java:118) ~[cattle-framework-object-0.5.0-SNAPSHOT.jar:na]
  at io.cattle.platform.engine.idempotent.Idempotent.change(Idempotent.java:88) ~[cattle-framework-engine-0.5.0-SNAPSHOT.jar:na]
  at io.cattle.platform.object.impl.JooqObjectManager.setFields(JooqObjectManager.java:115) ~[cattle-framework-object-0.5.0-SNAPSHOT.jar:na]
  at io.cattle.platform.object.impl.JooqObjectManager.setFields(JooqObjectManager.java:110) ~[cattle-framework-object-0.5.0-SNAPSHOT.jar:na]
  at io.cattle.platform.object.impl.AbstractObjectManager.setFields(AbstractObjectManager.java:135) ~[cattle-framework-object-0.5.0-SNAPSHOT.jar:na]
  at io.cattle.platform.systemstack.listener.SystemStackUpdate.startStacks(SystemStackUpdate.java:142) ~[cattle-system-stack-0.5.0-SNAPSHOT.jar:na]
  at io.cattle.platform.systemstack.listener.SystemStackUpdate.process(SystemStackUpdate.java:175) ~[cattle-system-stack-0.5.0-SNAPSHOT.jar:na]
  at io.cattle.platform.systemstack.listener.SystemStackUpdate$1.run(SystemStackUpdate.java:102) ~[cattle-system-stack-0.5.0-SNAPSHOT.jar:na]
  at io.cattle.platform.configitem.version.impl.ConfigItemStatusManagerImpl$1.doWithLock(ConfigItemStatusManagerImpl.java:92) ~[cattle-config-item-common-0.5.0-SNAPSHOT.jar:na]
  at io.cattle.platform.lock.impl.AbstractLockManagerImpl$4.doWithLock(AbstractLockManagerImpl.java:50) ~[cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
  at io.cattle.platform.lock.impl.LockManagerImpl.doLock(LockManagerImpl.java:33) ~[cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
  at io.cattle.platform.lock.impl.AbstractLockManagerImpl.tryLock(AbstractLockManagerImpl.java:25) ~[cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
  at io.cattle.platform.lock.impl.AbstractLockManagerImpl.tryLock(AbstractLockManagerImpl.java:47) ~[cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
  at io.cattle.platform.configitem.version.impl.ConfigItemStatusManagerImpl.runUpdateForEvent(ConfigItemStatusManagerImpl.java:85) ~[cattle-config-item-common-0.5.0-SNAPSHOT.jar:na]
  at io.cattle.platform.systemstack.listener.SystemStackUpdate.globalServiceUpdate(SystemStackUpdate.java:98) ~[cattle-system-stack-0.5.0-SNAPSHOT.jar:na]
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[na:1.8.0_72]
  at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[na:1.8.0_72]
  at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[na:1.8.0_72]
  at java.lang.reflect.Method.invoke(Method.java:498) ~[na:1.8.0_72]
  at io.cattle.platform.eventing.annotation.MethodInvokingListener$1.doWithLockNoResult(MethodInvokingListener.java:76) ~[cattle-framework-eventing-0.5.0-SNAPSHOT.jar:na]
  at io.cattle.platform.lock.LockCallbackNoReturn.doWithLock(LockCallbackNoReturn.java:7) ~[cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
  at io.cattle.platform.lock.LockCallbackNoReturn.doWithLock(LockCallbackNoReturn.java:3) ~[cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
  at io.cattle.platform.lock.impl.AbstractLockManagerImpl$3.doWithLock(AbstractLockManagerImpl.java:40) ~[cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
  at io.cattle.platform.lock.impl.LockManagerImpl.doLock(LockManagerImpl.java:33) ~[cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
  at io.cattle.platform.lock.impl.AbstractLockManagerImpl.lock(AbstractLockManagerImpl.java:13) ~[cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
  at io.cattle.platform.lock.impl.AbstractLockManagerImpl.lock(AbstractLockManagerImpl.java:37) ~[cattle-framework-lock-0.5.0-SNAPSHOT.jar:na]
  at io.cattle.platform.eventing.annotation.MethodInvokingListener.onEvent(MethodInvokingListener.java:72) ~[cattle-framework-eventing-0.5.0-SNAPSHOT.jar:na]
  at io.cattle.platform.eventing.impl.AbstractThreadPoolingEventService$2.doRun(AbstractThreadPoolingEventService.java:140) ~[cattle-framework-eventing-0.5.0-SNAPSHOT.jar:na]
  at org.apache.cloudstack.managed.context.NoExceptionRunnable.runInContext(NoExceptionRunnable.java:15) ~[cattle-framework-managed-context-0.5.0-SNAPSHOT.jar:na]
  at org.apache.cloudstack.managed.context.ManagedContextRunnable$1.run(ManagedContextRunnable.java:49) [cattle-framework-managed-context-0.5.0-SNAPSHOT.jar:na]
  at org.apache.cloudstack.managed.context.impl.DefaultManagedContext$1.call(DefaultManagedContext.java:55) [cattle-framework-managed-context-0.5.0-SNAPSHOT.jar:na]
  at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.callWithContext(DefaultManagedContext.java:108) [cattle-framework-managed-context-0.5.0-SNAPSHOT.jar:na]
  at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.runWithContext(DefaultManagedContext.java:52) [cattle-framework-managed-context-0.5.0-SNAPSHOT.jar:na]
  at org.apache.cloudstack.managed.context.ManagedContextRunnable.run(ManagedContextRunnable.java:46) [cattle-framework-managed-context-0.5.0-SNAPSHOT.jar:na]
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_72]
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_72]
  at java.lang.Thread.run(Thread.java:745) [na:1.8.0_72]
2016-12-02 17:28:09,717 ERROR [9722e583-271a-4372-9559-9ba25f519ee3:1541878] [instance:3949] [instance.start->(InstanceStart)] [] [ecutorService-2] [i.c.p.process.instance.InstanceStart] Failed to Waiting for dependencies for instance [3949]
2016-12-02 17:28:10,228 ERROR [8f78d6b0-393d-4e94-b241-10bf4960ecd1:1541893] [instance:3952] [instance.start->(InstanceStart)] [] [cutorService-13] [i.c.p.process.instance.InstanceStart] Failed to Waiting for deployment unit instances to create for instance [3952]
2016-12-02 17:28:11,633 ERROR [8b6b5fef-9ceb-44cf-aab8-7d9c0ac58e48:1541900] [instance:3957] [instance.start->(InstanceStart)] [] [ecutorService-5] [i.c.p.process.instance.InstanceStart] Failed to Waiting for dependencies for instance [3957]
2016-12-02 17:28:16,209 ERROR [6fb2cdee-5cb4-4bf4-ab44-6dbc5f6e74d7:1541912] [instance:3960] [instance.start->(InstanceStart)] [] [ecutorService-7] [i.c.p.process.instance.InstanceStart] Failed to Waiting for deployment unit instances to create for instance [3960]
2016-12-02 17:28:16,425 ERROR [4789ca65-5af7-4dc2-92ac-ee969164fc14:1541917] [instance:3965] [instance.start->(InstanceStart)] [] [cutorService-12] [i.c.p.process.instance.InstanceStart] Failed to Waiting for dependencies for instance [3965]
2016-12-02 17:28:17,887 ERROR [32518592-3846-4df8-930e-f80352e6f068:1541864] [instance:3948->instanceHostMap:3215] [instance.start->(InstanceStart)->instancehostmap.activate] [] [ecutorService-9] [c.p.e.p.i.DefaultProcessInstanceImpl] Agent error for [compute.instance.activate.reply;agent=340]: Timeout getting IP address
2016-12-02 17:28:17,887 ERROR [32518592-3846-4df8-930e-f80352e6f068:1541864] [instance:3948] [instance.start->(InstanceStart)] [] [ecutorService-9] [i.c.p.process.instance.InstanceStart] Failed [1/2] to Starting for instance [3948]
2016-12-02 17:28:17,984 ERROR [32518592-3846-4df8-930e-f80352e6f068:1541864] [instance:3948->instanceHostMap:3215] [instance.start->(InstanceStart)->instancehostmap.activate] [] [ecutorService-9] [c.p.e.p.i.DefaultProcessInstanceImpl] Agent error for [compute.instance.activate.reply;agent=340]: Timeout getting IP address
2016-12-02 17:28:17,984 ERROR [32518592-3846-4df8-930e-f80352e6f068:1541864] [instance:3948] [instance.start->(InstanceStart)] [] [ecutorService-9] [i.c.p.process.instance.InstanceStart] Failed [2/2] to Starting for instance [3948]
2016-12-02 17:28:23,141 ERROR [3a514ddd-7033-4d0f-bc5e-086c3f790510:1541856] [instance:3945->instanceHostMap:3213] [instance.start->(InstanceStart)->instancehostmap.activate] [] [ecutorService-1] [c.p.e.p.i.DefaultProcessInstanceImpl] Agent error for [compute.instance.activate.reply;agent=340]: Timeout getting IP address
2016-12-02 17:28:23,141 ERROR [3a514ddd-7033-4d0f-bc5e-086c3f790510:1541856] [instance:3945] [instance.start->(InstanceStart)] [] [ecutorService-1] [i.c.p.process.instance.InstanceStart] Failed [1/2] to Starting for instance [3945]
2016-12-02 17:28:23,191 ERROR [3a514ddd-7033-4d0f-bc5e-086c3f790510:1541856] [instance:3945->instanceHostMap:3213] [instance.start->(InstanceStart)->instancehostmap.activate] [] [ecutorService-1] [c.p.e.p.i.DefaultProcessInstanceImpl] Agent error for [compute.instance.activate.reply;agent=340]: Timeout getting IP address
2016-12-02 17:28:23,191 ERROR [3a514ddd-7033-4d0f-bc5e-086c3f790510:1541856] [instance:3945] [instance.start->(InstanceStart)] [] [ecutorService-1] [i.c.p.process.instance.InstanceStart] Failed [2/2] to Starting for instance [3945]
2016-12-02 17:28:23,642 ERROR [1e371feb-51c6-4e05-abbe-abf884833c7e:1541853] [instance:3944->instanceHostMap:3212] [instance.start->(InstanceStart)->instancehostmap.activate] [] [cutorService-10] [c.p.e.p.i.DefaultProcessInstanceImpl] Agent error for [compute.instance.activate.reply;agent=394]: Timeout getting IP address
2016-12-02 17:28:23,642 ERROR [1e371feb-51c6-4e05-abbe-abf884833c7e:1541853] [instance:3944] [instance.start->(InstanceStart)] [] [cutorService-10] [i.c.p.process.instance.InstanceStart] Failed [1/2] to Starting for instance [3944]
2016-12-02 17:28:23,688 ERROR [1e371feb-51c6-4e05-abbe-abf884833c7e:1541853] [instance:3944->instanceHostMap:3212] [instance.start->(InstanceStart)->instancehostmap.activate] [] [cutorService-10] [c.p.e.p.i.DefaultProcessInstanceImpl] Agent error for [compute.instance.activate.reply;agent=394]: Timeout getting IP address
2016-12-02 17:28:23,690 ERROR [1e371feb-51c6-4e05-abbe-abf884833c7e:1541853] [instance:3944] [instance.start->(InstanceStart)] [] [cutorService-10] [i.c.p.process.instance.InstanceStart] Failed [2/2] to Starting for instance [3944]
2016-12-02 17:28:42,442 ERROR [f1f112b9-9c46-4e87-8334-3cc137e22a32:1541924] [instance:3968] [instance.start->(InstanceStart)] [] [ecutorService-2] [i.c.p.process.instance.InstanceStart] Failed to Waiting for deployment unit instances to create for instance [3968]
2016-12-02 17:28:46,937 ERROR [e7dc6f58-a2e1-4680-b2a8-74f1ce76dcce:1541927] [instance:3971] [instance.start->(InstanceStart)] [] [ecutorService-7] [i.c.p.process.instance.InstanceStart] Failed to Waiting for dependencies for instance [3971]
2016-12-02 17:28:52,430 ERROR [29583744-1f9f-4881-aa06-6c6b00387b15:1541933] [instance:3972] [instance.start->(InstanceStart)] [] [ecutorService-7] [i.c.p.process.instance.InstanceStart] Failed to Waiting for deployment unit instances to create for instance [3972]
2016-12-02 17:28:53,942 ERROR [e6464d27-2019-4ca4-a801-077bb448a043:1541935] [instance:3975] [instance.start->(InstanceStart)] [] [ecutorService-1] [i.c.p.process.instance.InstanceStart] Failed to Waiting for dependencies for instance [3975]
2016-12-02 17:28:56,525 ERROR [ed4e7cd1-4e6b-405f-acb8-2d2450769386:1541848] [service:252] [service.activate] [] [ecutorService-4] [c.p.e.p.i.DefaultProcessInstanceImpl] Expected state running but got error: Timeout getting IP address
2016-12-02 17:28:56,567 ERROR [f9839d9c-ec08-467e-a4b3-8d6def44c7e1:1541852] [service:251] [service.activate] [] [ecutorService-3] [c.p.e.p.i.DefaultProcessInstanceImpl] Expected state running but got error: Timeout getting IP address
2016-12-02 17:28:56,598 ERROR [:] [] [] [] [ecutorService-4] [.e.s.i.ProcessInstanceDispatcherImpl] Expected state running but got error: Timeout getting IP address
2016-12-02 17:28:56,685 ERROR [:] [] [] [] [ecutorService-3] [.e.s.i.ProcessInstanceDispatcherImpl] Expected state running but got error: Timeout getting IP address
2016-12-02 17:28:57,861 ERROR [0e88e987-6ab5-4206-9ba4-60404cd5d4f2:1541851] [service:253] [service.activate] [] [ecutorService-6] [c.p.e.p.i.DefaultProcessInstanceImpl] Expected state running but got error: Timeout getting IP address
2016-12-02 17:28:58,036 ERROR [:] [] [] [] [ecutorService-6] [.e.s.i.ProcessInstanceDispatcherImpl] Expected state running but got error: Timeout getting IP address

The services eventually managed to fix themselves? Iā€™m really not sure what happened behind the scenes, but this upgrade was kind of rough for 5-10 minutes while it resolved its issues. Would be good if documentation indicated potential for this type of down time.

The release notes specifically outlined these kind of outages for upgrade for 1.2.0.

That was pretty deeply buried for being such a significant noteā€¦ more feedback than anything else at this point.

Also, Iā€™ve been having a number of problems with my environment since launching e.g.:

ā€œUpgrading (Failed to allocate instance [container:1i4093]: Bad instance [container:1i4093] in state [stopping]: Scheduling failed: needs host with label bloom.general: true)ā€ where I have the following scheduling rules:

  labels:
    bloom.pulsepilot.web: true
    io.rancher.scheduler.affinity:host_label: bloom.active=true,bloom.general=true
    io.rancher.scheduler.affinity:container_label_ne: bloom.pulsepilot.web=true

when upgrading a service and there are no other containers on the host that would prevent the scheduling from happening. (this worked in the past)

Additionally, thereā€™s a ton of floating stopped containers named as guids / or services like ā€œr-healthcheck-healthcheck-2-ccd16e1dā€ that previously werenā€™t there. If they are service containers, the filter check box doesnā€™t remove them.

This is a pretty rough start for a GA releaseā€¦

Just to confirm, you have a host that has both bloom.activate=true AND bloom.general=true?

Can you share a screen shot of your hosts?

Yes, we knew that this GA release would be rougher than normal due to the number of internal changes made, but the improvements that weā€™ve made are to make it easier to upgrade just the microservices in the future versus a whole Rancher setup.

Yes ā€“ I think I have some more clues thoughā€¦

The containers with guidā€™s as names are actually old containers for the same service ā€“ with the same labels as the original service. It seems old containers after an upgrade arenā€™t being entirely removed which are causing issues with scheduling new containers. See screenshot for more. Let me know if anything else would help debug this.

Just to add some more ā€“ after any upgrade, my environment seems to have a number of containers stopped such as ones with guids, others with a ā€˜r-_____ā€™ where the blank is an existing conatiner for another service, r-healthcheck-healthcheck, and others.

I also see a lot of log entries on the rancher server like:

2016-12-03 05:09:24,393 ERROR [fe48384b-2870-4436-b6ea-d36c8a4e3fea:1544570] [volumeStoragePoolMap:3745] [volumestoragepoolmap.remove] [] [ecutorService-3] [c.p.e.p.i.DefaultProcessInstanceImpl] Agent error for [storage.volume.remove.reply;agent=385]: Error response from daemon: Unable to remove filesystem for efcdb6c4f8d5ec0d9404caac98a09f655d61a6dfdd6d2336b00e5c82e906a179: remove /var/lib/docker/containers/efcdb6c4f8d5ec0d9404caac98a09f655d61a6dfdd6d2336b00e5c82e906a179/shm: device or resource busy
2016-12-03 05:09:24,394 ERROR [4fe2fbc0-3ef6-44d7-9c37-00dbff57dbaf:1545164] [volumeStoragePoolMap:3768] [volumestoragepoolmap.remove] [] [cutorService-18] [c.p.e.p.i.DefaultProcessInstanceImpl] Agent error for [storage.volume.remove.reply;agent=341]: Error response from daemon: Unable to remove filesystem for 6b05004e9c4f0407d1820a880a81b6738453766c91b9ab4a9cd48253ad699f25: remove /var/lib/docker/containers/6b05004e9c4f0407d1820a880a81b6738453766c91b9ab4a9cd48253ad699f25/shm: device or resource busy
2016-12-03 05:09:24,394 ERROR [71b3d152-d8f1-444d-b638-0ca416bc73ef:1545308] [volumeStoragePoolMap:3498] [volumestoragepoolmap.remove] [] [ecutorService-7] [c.p.e.p.i.DefaultProcessInstanceImpl] Agent error for [storage.volume.remove.reply;agent=341]: Error response from daemon: Unable to remove filesystem for 7d5ed39cca591e3e5fc5e6599a570c525584de84c313882595e04ba8d3a14f0b: remove /var/lib/docker/containers/7d5ed39cca591e3e5fc5e6599a570c525584de84c313882595e04ba8d3a14f0b/shm: device or resource busy
2016-12-03 05:09:24,412 ERROR [98866f82-6b2f-475b-ba23-4eaa9d6580f9:1545173] [volumeStoragePoolMap:3777] [volumestoragepoolmap.remove] [] [ecutorService-2] [c.p.e.p.i.DefaultProcessInstanceImpl] Agent error for [storage.volume.remove.reply;agent=357]: Error response from daemon: Unable to remove filesystem for 9f1523b69ac0b27611965ca162c5407d382e395d81ae64774758a3fa69b40d42: remove /var/lib/docker/containers/9f1523b69ac0b27611965ca162c5407d382e395d81ae64774758a3fa69b40d42/shm: device or resource busy
2016-12-03 05:09:24,418 ERROR [:] [] [] [] [ecutorService-7] [.e.s.i.ProcessInstanceDispatcherImpl] Agent error for [storage.volume.remove.reply;agent=341]: Error response from daemon: Unable to remove filesystem for 7d5ed39cca591e3e5fc5e6599a570c525584de84c313882595e04ba8d3a14f0b: remove /var/lib/docker/containers/7d5ed39cca591e3e5fc5e6599a570c525584de84c313882595e04ba8d3a14f0b/shm: device or resource busy
2016-12-03 05:09:24,420 ERROR [fbd08b41-8840-48de-b83a-e4df5f97161e:1545207] [volumeStoragePoolMap:3767] [volumestoragepoolmap.remove] [] [cutorService-19] [c.p.e.p.i.DefaultProcessInstanceImpl] Agent error for [storage.volume.remove.reply;agent=385]: Error response from daemon: Unable to remove filesystem for efcdb6c4f8d5ec0d9404caac98a09f655d61a6dfdd6d2336b00e5c82e906a179: remove /var/lib/docker/containers/efcdb6c4f8d5ec0d9404caac98a09f655d61a6dfdd6d2336b00e5c82e906a179/shm: device or resource busy
2016-12-03 05:09:24,422 ERROR [:] [] [] [] [ecutorService-1] [.e.s.i.ProcessInstanceDispatcherImpl] Agent error for [storage.volume.remove.reply;agent=357]: Error response from daemon: Unable to remove filesystem for 5f5d7bc4bd29a949a12090d303f8c14d6f1712cc319d705230a2b6f72169ad53: remove /var/lib/docker/containers/5f5d7bc4bd29a949a12090d303f8c14d6f1712cc319d705230a2b6f72169ad53/shm: device or resource busy
2016-12-03 05:09:24,425 ERROR [:] [] [] [] [ecutorService-3] [.e.s.i.ProcessInstanceDispatcherImpl] Agent error for [storage.volume.remove.reply;agent=385]: Error response from daemon: Unable to remove filesystem for efcdb6c4f8d5ec0d9404caac98a09f655d61a6dfdd6d2336b00e5c82e906a179: remove /var/lib/docker/containers/efcdb6c4f8d5ec0d9404caac98a09f655d61a6dfdd6d2336b00e5c82e906a179/shm: device or resource busy
2016-12-03 05:09:24,426 ERROR [68d10112-0e08-420d-baa9-12036103bfc2:1545797] [volumeStoragePoolMap:3787] [volumestoragepoolmap.remove] [] [ecutorService-4] [c.p.e.p.i.DefaultProcessInstanceImpl] Agent error for [storage.volume.remove.reply;agent=340]: Error response from daemon: Unable to remove filesystem for 2033e6f277ecf007578d634913c5335e68f13277d0154c53a0d8f1f2b9f2dd0b: remove /var/lib/docker/containers/2033e6f277ecf007578d634913c5335e68f13277d0154c53a0d8f1f2b9f2dd0b/shm: device or resource busy
2016-12-03 05:09:24,429 ERROR [f35b68a1-a14b-4a19-960e-0ac03b92ea4d:1545803] [volumeStoragePoolMap:3783] [volumestoragepoolmap.remove] [] [cutorService-16] [c.p.e.p.i.DefaultProcessInstanceImpl] Agent error for [storage.volume.remove.reply;agent=340]: Error response from daemon: Unable to remove filesystem for 89a9f5a9be687722fb10836d5d6e9f26ce3fd359365ce6034a15b23b1863637d: remove /var/lib/docker/containers/89a9f5a9be687722fb10836d5d6e9f26ce3fd359365ce6034a15b23b1863637d/shm: device or resource busy
2016-12-03 05:09:24,433 ERROR [0e2901be-1878-4437-9d3a-ba0932209e63:1545195] [volumeStoragePoolMap:3765] [volumestoragepoolmap.remove] [] [cutorService-20] [c.p.e.p.i.DefaultProcessInstanceImpl] Agent error for [storage.volume.remove.reply;agent=340]: Error response from daemon: Unable to remove filesystem for 581cb06ddec3953a9418906e79c5bf49abf3219179ade475c268485751be771d: remove /var/lib/docker/containers/581cb06ddec3953a9418906e79c5bf49abf3219179ade475c268485751be771d/shm: device or resource busy
2016-12-03 05:09:24,437 ERROR [:] [] [] [] [ecutorService-6] [.e.s.i.ProcessInstanceDispatcherImpl] Agent error for [storage.volume.remove.reply;agent=341]: Error response from daemon: Unable to remove filesystem for b887edd615f64a20ca4f4bccde605438fff18f2d0495b61ed608e3107a04597c: remove /var/lib/docker/containers/b887edd615f64a20ca4f4bccde605438fff18f2d0495b61ed608e3107a04597c/shm: device or resource busy
2016-12-03 05:09:24,440 ERROR [6efbada2-7d8c-4e49-bfae-b1a9c432d3fb:1545800] [volumeStoragePoolMap:3784] [volumestoragepoolmap.remove] [] [cutorService-13] [c.p.e.p.i.DefaultProcessInstanceImpl] Agent error for [storage.volume.remove.reply;agent=340]: Error response from daemon: Unable to remove filesystem for 957a6be1784a864772225bb83b43072f03de524247c5b63c57e93e59f671710f: remove /var/lib/docker/containers/957a6be1784a864772225bb83b43072f03de524247c5b63c57e93e59f671710f/shm: device or resource busy
2016-12-03 05:09:24,445 ERROR [:] [] [] [] [cutorService-18] [.e.s.i.ProcessInstanceDispatcherImpl] Agent error for [storage.volume.remove.reply;agent=341]: Error response from daemon: Unable to remove filesystem for 6b05004e9c4f0407d1820a880a81b6738453766c91b9ab4a9cd48253ad699f25: remove /var/lib/docker/containers/6b05004e9c4f0407d1820a880a81b6738453766c91b9ab4a9cd48253ad699f25/shm: device or resource busy
2016-12-03 05:09:24,446 ERROR [:] [] [] [] [cutorService-11] [.e.s.i.ProcessInstanceDispatcherImpl] Agent error for [storage.volume.remove.reply;agent=342]: Error response from daemon: Unable to remove filesystem for ff024eba01e9dd29afcd4de57bd38d690c73a04c93baff1e3237910d5c7c7111: remove /var/lib/docker/containers/ff024eba01e9dd29afcd4de57bd38d690c73a04c93baff1e3237910d5c7c7111/shm: device or resource busy
2016-12-03 05:09:24,479 ERROR [:] [] [] [] [ecutorService-9] [.e.s.i.ProcessInstanceDispatcherImpl] Agent error for [storage.volume.remove.reply;agent=342]: Error response from daemon: Unable to remove filesystem for 8ba4cfcd35132f4f8847def33a964504b138ca6f7a3b8f7f75366acee792411a: remove /var/lib/docker/containers/8ba4cfcd35132f4f8847def33a964504b138ca6f7a3b8f7f75366acee792411a/shm: device or resource busy
2016-12-03 05:09:24,487 ERROR [:] [] [] [] [ecutorService-2] [.e.s.i.ProcessInstanceDispatcherImpl] Agent error for [storage.volume.remove.reply;agent=357]: Error response from daemon: Unable to remove filesystem for 9f1523b69ac0b27611965ca162c5407d382e395d81ae64774758a3fa69b40d42: remove /var/lib/docker/containers/9f1523b69ac0b27611965ca162c5407d382e395d81ae64774758a3fa69b40d42/shm: device or resource busy
2016-12-03 05:09:24,492 ERROR [:] [] [] [] [cutorService-16] [.e.s.i.ProcessInstanceDispatcherImpl] Agent error for [storage.volume.remove.reply;agent=340]: Error response from daemon: Unable to remove filesystem for 89a9f5a9be687722fb10836d5d6e9f26ce3fd359365ce6034a15b23b1863637d: remove /var/lib/docker/containers/89a9f5a9be687722fb10836d5d6e9f26ce3fd359365ce6034a15b23b1863637d/shm: device or resource busy
2016-12-03 05:09:24,493 ERROR [:] [] [] [] [cutorService-13] [.e.s.i.ProcessInstanceDispatcherImpl] Agent error for [storage.volume.remove.reply;agent=340]: Error response from daemon: Unable to remove filesystem for 957a6be1784a864772225bb83b43072f03de524247c5b63c57e93e59f671710f: remove /var/lib/docker/containers/957a6be1784a864772225bb83b43072f03de524247c5b63c57e93e59f671710f/shm: device or resource busy
2016-12-03 05:09:24,493 ERROR [:] [] [] [] [cutorService-20] [.e.s.i.ProcessInstanceDispatcherImpl] Agent error for [storage.volume.remove.reply;agent=340]: Error response from daemon: Unable to remove filesystem for 581cb06ddec3953a9418906e79c5bf49abf3219179ade475c268485751be771d: remove /var/lib/docker/containers/581cb06ddec3953a9418906e79c5bf49abf3219179ade475c268485751be771d/shm: device or resource busy
2016-12-03 05:09:24,494 ERROR [:] [] [] [] [ecutorService-4] [.e.s.i.ProcessInstanceDispatcherImpl] Agent error for [storage.volume.remove.reply;agent=340]: Error response from daemon: Unable to remove filesystem for 2033e6f277ecf007578d634913c5335e68f13277d0154c53a0d8f1f2b9f2dd0b: remove /var/lib/docker/containers/2033e6f277ecf007578d634913c5335e68f13277d0154c53a0d8f1f2b9f2dd0b/shm: device or resource busy
2016-12-03 05:09:24,494 ERROR [:] [] [] [] [cutorService-19] [.e.s.i.ProcessInstanceDispatcherImpl] Agent error for [storage.volume.remove.reply;agent=385]: Error response from daemon: Unable to remove filesystem for efcdb6c4f8d5ec0d9404caac98a09f655d61a6dfdd6d2336b00e5c82e906a179: remove /var/lib/docker/containers/efcdb6c4f8d5ec0d9404caac98a09f655d61a6dfdd6d2336b00e5c82e906a179/shm: device or resource busy
2016-12-03 05:09:24,495 ERROR [:] [] [] [] [cutorService-12] [.e.s.i.ProcessInstanceDispatcherImpl] Agent error for [storage.volume.remove.reply;agent=394]: Error response from daemon: Unable to remove filesystem for 9cb1312a5fa8110d580bbdef454169010bf56ac610434d72f0a7354971f67409: remove /var/lib/docker/containers/9cb1312a5fa8110d580bbdef454169010bf56ac610434d72f0a7354971f67409/shm: device or resource busy
2016-12-03 05:09:24,592 ERROR [089cd133-b662-43c2-bb24-555b5552f374:1545906] [volumeStoragePoolMap:3748] [volumestoragepoolmap.remove] [] [cutorService-10] [c.p.e.p.i.DefaultProcessInstanceImpl] Agent error for [storage.volume.remove.reply;agent=340]: Error response from daemon: Unable to remove filesystem for bd428d00a651e322ef6bacc2fc1edd4d93d1809d449ac51b0b28b9f922205669: remove /var/lib/docker/containers/bd428d00a651e322ef6bacc2fc1edd4d93d1809d449ac51b0b28b9f922205669/shm: device or resource busy
2016-12-03 05:09:24,603 ERROR [90bce392-7603-4b5d-8c5c-72314562af23:1545913] [volumeStoragePoolMap:3752] [volumestoragepoolmap.remove] [] [ecutorService-5] [c.p.e.p.i.DefaultProcessInstanceImpl] Agent error for [storage.volume.remove.reply;agent=341]: Error response from daemon: Unable to remove filesystem for 0c23ef1f81cede0da2c90bce3c97b66a08985f19dd3496605e327cf7a8910604: remove /var/lib/docker/containers/0c23ef1f81cede0da2c90bce3c97b66a08985f19dd3496605e327cf7a8910604/shm: device or resource busy
2016-12-03 05:09:24,639 ERROR [79518222-526a-4072-8190-24f6ecfac0c7:1545899] [volumeStoragePoolMap:3736] [volumestoragepoolmap.remove] [] [ecutorService-8] [c.p.e.p.i.DefaultProcessInstanceImpl] Agent error for [storage.volume.remove.reply;agent=340]: Error response from daemon: Unable to remove filesystem for 1c56d1546e1b0f168a15b054e61cd570a58836a984ec1709ef0cecbb669e9f50: remove /var/lib/docker/containers/1c56d1546e1b0f168a15b054e61cd570a58836a984ec1709ef0cecbb669e9f50/shm: device or resource busy
2016-12-03 05:09:24,642 ERROR [:] [] [] [] [cutorService-10] [.e.s.i.ProcessInstanceDispatcherImpl] Agent error for [storage.volume.remove.reply;agent=340]: Error response from daemon: Unable to remove filesystem for bd428d00a651e322ef6bacc2fc1edd4d93d1809d449ac51b0b28b9f922205669: remove /var/lib/docker/containers/bd428d00a651e322ef6bacc2fc1edd4d93d1809d449ac51b0b28b9f922205669/shm: device or resource busy
2016-12-03 05:09:24,642 ERROR [:] [] [] [] [ecutorService-5] [.e.s.i.ProcessInstanceDispatcherImpl] Agent error for [storage.volume.remove.reply;agent=341]: Error response from daemon: Unable to remove filesystem for 0c23ef1f81cede0da2c90bce3c97b66a08985f19dd3496605e327cf7a8910604: remove /var/lib/docker/containers/0c23ef1f81cede0da2c90bce3c97b66a08985f19dd3496605e327cf7a8910604/shm: device or resource busy
2016-12-03 05:09:24,644 ERROR [748267e1-4da2-4147-ac3f-e1bda2fad5a9:1546102] [volumeStoragePoolMap:3799] [volumestoragepoolmap.remove] [] [ecutorService-1] [c.p.e.p.i.DefaultProcessInstanceImpl] Agent error for [storage.volume.remove.reply;agent=342]: Error response from daemon: Unable to remove filesystem for ff024eba01e9dd29afcd4de57bd38d690c73a04c93baff1e3237910d5c7c7111: remove /var/lib/docker/containers/ff024eba01e9dd29afcd4de57bd38d690c73a04c93baff1e3237910d5c7c7111/shm: device or resource busy
2016-12-03 05:09:24,656 ERROR [8beb43e3-e83b-4c6d-842b-08166ed2f64a:1546097] [volumeStoragePoolMap:3800] [volumestoragepoolmap.remove] [] [ecutorService-7] [c.p.e.p.i.DefaultProcessInstanceImpl] Agent error for [storage.volume.remove.reply;agent=342]: Error response from daemon: Unable to remove filesystem for 8ba4cfcd35132f4f8847def33a964504b138ca6f7a3b8f7f75366acee792411a: remove /var/lib/docker/containers/8ba4cfcd35132f4f8847def33a964504b138ca6f7a3b8f7f75366acee792411a/shm: device or resource busy
2016-12-03 05:09:24,658 ERROR [1f16d427-ff60-473e-b42b-96c8d2e01454:1545921] [volumeStoragePoolMap:3755] [volumestoragepoolmap.remove] [] [cutorService-15] [c.p.e.p.i.DefaultProcessInstanceImpl] Agent error for [storage.volume.remove.reply;agent=341]: Error response from daemon: Unable to remove filesystem for 37b69759385e175622c2f06345db8e53cad042e9ffb8471ee8b1887707e2f13a: remove /var/lib/docker/containers/37b69759385e175622c2f06345db8e53cad042e9ffb8471ee8b1887707e2f13a/shm: device or resource busy
2016-12-03 05:09:24,660 ERROR [:] [] [] [] [ecutorService-8] [.e.s.i.ProcessInstanceDispatcherImpl] Agent error for [storage.volume.remove.reply;agent=340]: Error response from daemon: Unable to remove filesystem for 1c56d1546e1b0f168a15b054e61cd570a58836a984ec1709ef0cecbb669e9f50: remove /var/lib/docker/containers/1c56d1546e1b0f168a15b054e61cd570a58836a984ec1709ef0cecbb669e9f50/shm: device or resource busy
2016-12-03 05:09:24,665 ERROR [0dc0a2b6-05ed-4f5d-a397-bc44f7b80aa8:1545936] [volumeStoragePoolMap:3753] [volumestoragepoolmap.remove] [] [cutorService-14] [c.p.e.p.i.DefaultProcessInstanceImpl] Agent error for [storage.volume.remove.reply;agent=342]: Error response from daemon: Unable to remove filesystem for 90a1d094269dab2adb97221afa19fa178bb7b1834bd6641ddf4f4a18bbada30c: remove /var/lib/docker/containers/90a1d094269dab2adb97221afa19fa178bb7b1834bd6641ddf4f4a18bbada30c/shm: device or resource busy
2016-12-03 05:09:24,672 ERROR [b8f8b65b-333d-4371-a929-486f179652fc:1545928] [volumeStoragePoolMap:3749] [volumestoragepoolmap.remove] [] [cutorService-17] [c.p.e.p.i.DefaultProcessInstanceImpl] Agent error for [storage.volume.remove.reply;agent=342]: Error response from daemon: Unable to remove filesystem for 09d724312e24ab2a8b386db785046672eecba3e115b12056a45104fe9ac4a201: remove /var/lib/docker/containers/09d724312e24ab2a8b386db785046672eecba3e115b12056a45104fe9ac4a201/shm: device or resource busy
2016-12-03 05:09:24,680 ERROR [:] [] [] [] [cutorService-14] [.e.s.i.ProcessInstanceDispatcherImpl] Agent error for [storage.volume.remove.reply;agent=342]: Error response from daemon: Unable to remove filesystem for 90a1d094269dab2adb97221afa19fa178bb7b1834bd6641ddf4f4a18bbada30c: remove /var/lib/docker/containers/90a1d094269dab2adb97221afa19fa178bb7b1834bd6641ddf4f4a18bbada30c/shm: device or resource busy
2016-12-03 05:09:24,682 ERROR [:] [] [] [] [ecutorService-1] [.e.s.i.ProcessInstanceDispatcherImpl] Agent error for [storage.volume.remove.reply;agent=342]: Error response from daemon: Unable to remove filesystem for ff024eba01e9dd29afcd4de57bd38d690c73a04c93baff1e3237910d5c7c7111: remove /var/lib/docker/containers/ff024eba01e9dd29afcd4de57bd38d690c73a04c93baff1e3237910d5c7c7111/shm: device or resource busy
2016-12-03 05:09:24,683 ERROR [:] [] [] [] [ecutorService-7] [.e.s.i.ProcessInstanceDispatcherImpl] Agent error for [storage.volume.remove.reply;agent=342]: Error response from daemon: Unable to remove filesystem for 8ba4cfcd35132f4f8847def33a964504b138ca6f7a3b8f7f75366acee792411a: remove /var/lib/docker/containers/8ba4cfcd35132f4f8847def33a964504b138ca6f7a3b8f7f75366acee792411a/shm: device or resource busy
2016-12-03 05:09:24,683 ERROR [:] [] [] [] [cutorService-15] [.e.s.i.ProcessInstanceDispatcherImpl] Agent error for [storage.volume.remove.reply;agent=341]: Error response from daemon: Unable to remove filesystem for 37b69759385e175622c2f06345db8e53cad042e9ffb8471ee8b1887707e2f13a: remove /var/lib/docker/containers/37b69759385e175622c2f06345db8e53cad042e9ffb8471ee8b1887707e2f13a/shm: device or resource busy
2016-12-03 05:09:24,685 ERROR [:] [] [] [] [cutorService-17] [.e.s.i.ProcessInstanceDispatcherImpl] Agent error for [storage.volume.remove.reply;agent=342]: Error response from daemon: Unable to remove filesystem for 09d724312e24ab2a8b386db785046672eecba3e115b12056a45104fe9ac4a201: remove /var/lib/docker/containers/09d724312e24ab2a8b386db785046672eecba3e115b12056a45104fe9ac4a201/shm: device or resource busy

Hereā€™s another one ā€¦ I donā€™t know if any of this is related to anything else, but rancher services are generating 70-80 log entries per second ā€“ mostly error messages. e.g. in addition to agents generating messages similar to my above note, ipsec is generating a large number of messages like querying policy failed: No such file or directory (2)

Another problem ā€¦

Service is stuck (Failed to acquire lock [SERVICE.194] ā€¦ Iā€™m on the verge of spinning up a separate 1.1.4 environment to get things stable ā€¦

I had a debug session with a customer and they had the same Unable to remove filesystem logs. We fixed them by rebooting the hosts in Rancher server.

As for upgrading on a fresh intall, I was able to upgrade fine in 1.2.0 with a fresh service with same labels and only have 1 container and 1 host and it was fine.

I have to test post upgrade, but will try at some point .

I really appreciate your quick replies late in the day.

My environment has been having so many issues / more than Iā€™ve written here ā€“ I restarted all my hosts and then eventually the server as well. I still am seeing the unable to remove filesystem issue, however, most of the services have recovered. Some of them are still having issues though. Iā€™m also really nervous about touching anything at this point now that most services are back onlineā€¦ might try to repro issues in a clean environment tomorrow.

Iā€™m starting to suspect a lot of these issues are an interaction of rancher with my usage of https://github.com/mcuadros/gce-docker ā€“ though I can verify that volumes are being correctly mounted and unmounted by the driver / the only error messages I see are within Rancher. I donā€™t know why these issues would pop-up with the upgrade to 1.2 when I hadnā€™t seen them previously though.

More issues with deploying new services, containers are stuck activating for images that havenā€™t been pulled before:

Activating (Expected state running but got error: Image [______] failed to pull: Pulling repository ______)

Tested this pretty extensively / pulling the images in other contexts works fine. Nothing wrong with the machines themselves (plenty of disk etc).

Current work-around is logging into the machine and manually pulling the image. Hosted registry is gcr.io

untoldone: Were you ever able to resolve the ā€œError response from daemon: Unable to remove filesystem for ā€¦ device or resource busyā€ errors?

Iā€™m getting the same thing, but running CentOS 7 in VMs.

No, I am still seeing them

Okay. In my case, I was able to remove the Dead containers with docker rm -f $CONTAINER, as I show at ā€œCanā€™t remove dead Rancher Infra containers due to Unable to remove filesystem for ā€¦ device or resource busy errorsā€.

So, I fixed the problem in my environment.

Hopefully docker rm -f didnā€™t leave any stale data around.

@Stefan_Lasiewski Thanks! That worked for me as well. Still have a few other issues from earlier in this thread, but this definitely resolved the floating containers / the errors related to disks.

1 Like

For those who are curious, Iā€™ve decided to reinstall Rancher 1.1.4 and setup all my old containers from scratch. I will likely have a test environment as well to test upgrades before upgrading production in the future. Iā€™ll also probably wait a few weeks post a version release to see forums + issues first. Definitely will start doing regular backups of the Rancher server db ā€¦ all stuff a good IT team should do but are hard to do as a small startup.

A few other issues I hit in addition to what is listed here for the record:

  • Some containers are always stuck initializing ā€“ even though the conainer logs show everything running properly, Rancher never seemed to complete setting up network access for the container?
  • I hit this bug: https://github.com/rancher/rancher/issues/6956 ā€“ though the bug lists it as a new feature, this was a regression that prevents me from deploying new containers without a lot of manual work
  • The issue @Stefan_Lasiewski seems to be an upstream docker bug that recurs in my environment. That said, I didnā€™t see this issue in 1.1.4 (maybe 1.1.4 force removed on its own whereas this doesnt?)

I literally took the exact same steps. I couldnā€™t get my test 1.2 env to run for more then 3 hours without the entire cluster / ui locking up from a scheduler bug. I am going to continue in production with 1.1.4 until things stabilize.

@bacheson We just pushed out rancher/server:v1.2.0-1 which fixed a major UI bug (https://github.com/rancher/rancher/issues/6995). We are looking to cut a v1.2.1 later this week to address the bigger bugs.