Hi!
We are evaluating Rancher, and thus added it to our existing deploy process. It is very useful to provide visibility to the health of our cloud environments. Unfortunately it seems that the rancher/agent container interacts with Docker in some way which cause container removal failure during deploys.
Our deploy consists of using Ansible to provision EC2 machines, making sure Docker etc. is installed, and then deploy/upgrade the containers we need. This works fine, but when we added the Rancher Agent (more or less as outlined in âUsing Ansible with Docker to deploy a wordpress service on Rancherâ on your site) then deploys started failing with Docker API Error: Driver devicemapper failed to remove root filesystem b916c5f77...: Device is Busy
This happens when a container needs to be replaced with a newer version, and removal fails. Stopping the agent makes deploys work as expected again.
Some facts;
- Rancher Server 1.0.1
- Rancher Agent 1.0.1
- Created environments in Rancher, and Ansible interacts with Rancher using API Keys for the particular environment
- Ansible 2.0
- Docker 1.9.1
We provision m4.large machines, and consistently get the issue on two different Amazon Linux versions and at least six different machine instances, so itâs completely reproducible.
- Amazon Linux AMI 2015.03 (3.14.35-28.38.amzn1.x86_64)
- Amazon Linux AMI 2015.09 (4.1.17-22.30.amzn1.x86_64)
We start the rancher agent via this Ansible snippet, where registration_data is âdata[0]â from the reply from the our-rancher-server/v1/registrationtokens API.
- name: Ensure the Rancher Agent is started
become: yes
docker:
# must not be named, as this unnamed container starts other containers, including one named rancher-agent, and then exits
image: "{{ registration_data['image'] }}"
privileged: yes
volumes: /var/run/docker.sock:/var/run/docker.sock
command: "{{ registration_data['registrationUrl'] }}"
state: started
I can provide you with a minimal repro Ansible playbook if that helps, but the issue likely has nothing to do with Ansible, as just sshâing to the machines and manually trying to remove a container, gets us the Device is Busy error. It also has nothing to do with our particular container, as weâve can reproduce the issue with any container, such as the elasticsearch public container, for example. It also doesnât have anything to do with setting up âoddâ container shares, as one of our failing containers is simply based on the java:8u72-jre image, and does not use any volumes.
As far as we can tell, Rancher is working fine besides this and we get useful information on the health and status of our hosts and containers. But for now we have been forced to include a step in the deploy to stop the rancher agent container, and then do our normal deploy, and finally restart the agent again, which effectively makes us blind for a few minutes during the deploy, as well as being vulnerable to a blackout if the deploy exits halfway due to errors. Also, the agent containers are not shown in the rancher UI (and I understand the moment 22 of getting info when it is stopped, but the agent could potentially notify the server that itâs being stopped, so the server could show that)
I have noticed a similar issue already in the forum but it offers no help, just points the finger at Docker. Also it mentions aufs, where our error mentions devicemapper.
Iâve also been looking at corresponding Docker issues, and elsewhere, and this comment from a docker contributor that fixed very similar issues, seems relevant as the agent does run privileged, and the errors are very similar. But the patches he is refering to; âunshare the mount namespace of the docker daemonâ (docker/commit/6bb65864589) seems to already be present in the Amazon Linux startup scripts (/etc/rc0.d/K05docker).
Any advice?