Agent blocks manual container removal on AWS

Hi!

We are evaluating Rancher, and thus added it to our existing deploy process. It is very useful to provide visibility to the health of our cloud environments. Unfortunately it seems that the rancher/agent container interacts with Docker in some way which cause container removal failure during deploys.

Our deploy consists of using Ansible to provision EC2 machines, making sure Docker etc. is installed, and then deploy/upgrade the containers we need. This works fine, but when we added the Rancher Agent (more or less as outlined in “Using Ansible with Docker to deploy a wordpress service on Rancher” on your site) then deploys started failing with Docker API Error: Driver devicemapper failed to remove root filesystem b916c5f77...: Device is Busy

This happens when a container needs to be replaced with a newer version, and removal fails. Stopping the agent makes deploys work as expected again.

Some facts;

  • Rancher Server 1.0.1
  • Rancher Agent 1.0.1
  • Created environments in Rancher, and Ansible interacts with Rancher using API Keys for the particular environment
  • Ansible 2.0
  • Docker 1.9.1

We provision m4.large machines, and consistently get the issue on two different Amazon Linux versions and at least six different machine instances, so it’s completely reproducible.

  • Amazon Linux AMI 2015.03 (3.14.35-28.38.amzn1.x86_64)
  • Amazon Linux AMI 2015.09 (4.1.17-22.30.amzn1.x86_64)

We start the rancher agent via this Ansible snippet, where registration_data is “data[0]” from the reply from the our-rancher-server/v1/registrationtokens API.

- name: Ensure the Rancher Agent is started
  become: yes
  docker:
    # must not be named, as this unnamed container starts other containers, including one named rancher-agent, and then exits
    image: "{{ registration_data['image'] }}"
    privileged: yes
    volumes: /var/run/docker.sock:/var/run/docker.sock
    command: "{{ registration_data['registrationUrl'] }}"
    state: started

I can provide you with a minimal repro Ansible playbook if that helps, but the issue likely has nothing to do with Ansible, as just ssh’ing to the machines and manually trying to remove a container, gets us the Device is Busy error. It also has nothing to do with our particular container, as we’ve can reproduce the issue with any container, such as the elasticsearch public container, for example. It also doesn’t have anything to do with setting up ‘odd’ container shares, as one of our failing containers is simply based on the java:8u72-jre image, and does not use any volumes.

As far as we can tell, Rancher is working fine besides this and we get useful information on the health and status of our hosts and containers. But for now we have been forced to include a step in the deploy to stop the rancher agent container, and then do our normal deploy, and finally restart the agent again, which effectively makes us blind for a few minutes during the deploy, as well as being vulnerable to a blackout if the deploy exits halfway due to errors. Also, the agent containers are not shown in the rancher UI (and I understand the moment 22 of getting info when it is stopped, but the agent could potentially notify the server that it’s being stopped, so the server could show that)

I have noticed a similar issue already in the forum but it offers no help, just points the finger at Docker. Also it mentions aufs, where our error mentions devicemapper.

I’ve also been looking at corresponding Docker issues, and elsewhere, and this comment from a docker contributor that fixed very similar issues, seems relevant as the agent does run privileged, and the errors are very similar. But the patches he is refering to; “unshare the mount namespace of the docker daemon” (docker/commit/6bb65864589) seems to already be present in the Amazon Linux startup scripts (/etc/rc0.d/K05docker).

Any advice?

1 Like

Hey @marhel it would be good to share the ansible playbooks you used. I’m not familiar with ansible, but dont mind trying to troubleshoot this with you… FWIW I went through a manual install, using an AMI cited (specifically: amzn-ami-minimal-hvm-2015.09.2.x86_64-ebs [ami-14ae4e74]) and registrationUrl provided by rancherServer/v1/registrationtokens… and didnt experience any issues spinning up, down, or removing containers.

Thanks for looking into the issue!

I’ve uploaded a playbook that reproduces our issue. Here is the information needed for it to run. I use ansible-playbook 2.0.2.

This example is a self-contained example, but you must do some configuring to match your environment. Set the correct external IP of your EC2-host in hosts.ini. Put rancher server address + environment credentials in rancher-credentials.yml, and name the correct key file in ansible.cfg

This playbook deploys an ElasticSearch container, with a parameter for the version to deploy. If a different version is specified on a subsequent run, the old version must be removed, which is the part that fails when the agent is running.

The playbook is setup to run the tasks in rancher-register.yml before and after deploying ElasticSearch. This part is parameterized so you can deactivate it from the command line.

With stop_rancher=true, it will first try to stop a container named rancher-agent if one is running. After deploying ElasticSearch it will then start an unnamed container based on rancher/agent again, which should start (or create) the container named rancher-agent again. Set to false, will not touch the agent container.

Run ansible-playbook on deploy.yml to kick things off. I prepared this on a Macbook, but we get the error also when running from another AWS EC2 machine.

This is a sequence that works;

ansible-playbook -i hosts.ini deploy.yml --extra-vars "stop_rancher=true elastic_version=1.6" -vv
ansible-playbook -i hosts.ini deploy.yml --extra-vars "stop_rancher=true elastic_version=1.7" -vv

But this sequence gives us ‘Docker API Error: … Device is Busy’ errors.

ansible-playbook -i hosts.ini deploy.yml --extra-vars "stop_rancher=false elastic_version=1.6" -vv
ansible-playbook -i hosts.ini deploy.yml --extra-vars "stop_rancher=false elastic_version=1.7" -vv

You can adjust verbosity with the number of -vs (-v -vv, -vvv or remove altogether)

I wasn’t allowed to upload the playbook files, so they’ll be inserted inline :frowning:

file: hosts.ini, put the external IP of your EC2-host here;

[aws_machines]
54.229.99.99

file: ansible.cfg should point to the .pem file used to create the ec2 host (for ssh access)

[defaults]
host_key_checking = False
private_key_file=./your-deploy-key.pem

file: rancher-credentials.yml, contains rancher server address + environment credentials (API Key)

rancher_server: "rancher.server.somewhere.com"
rancher_api_key: "123abc"
rancher_api_secret: "1234zxcv"

file: rancher-register.yml contains communication with rancher API and agent container

- name: Install httplib2 needed by the uri-module
  # dependency no longer needed for uri-module in ansible 2.1 (http://docs.ansible.com/ansible/uri_module.html)
  pip: name=httplib2 # executable="/usr/local/bin/pip"
  become: yes

- include_vars: "rancher-credentials.yml"
  no_log: True

- name: Return the registration token for Rancher environment
  uri:
    method: GET
    status_code: 200
    user: "{{ rancher_api_key }}"
    password: "{{ rancher_api_secret }}"
    url: "http://{{ rancher_server }}/v1/registrationtokens"
    return_content: yes
  register: rancher_registration
  when: rancher_api_key is defined

- name: Trigger creation of a new registration token for Rancher environment
  uri:
    method: POST
    status_code: 201
    user: "{{ rancher_api_key }}"
    password: "{{ rancher_api_secret }}"
    url: "http://{{ rancher_server }}/v1/registrationtokens"
    return_content: yes # returns links, not the token
  when: rancher_api_key is defined and rancher_registration.json['data'] is undefined

- name: Return the new registration token for Rancher environment
  uri:
    method: GET
    status_code: 200
    user: "{{ rancher_api_key }}"
    password: "{{ rancher_api_secret }}"
    url: "http://{{ rancher_server }}/v1/registrationtokens"
    return_content: yes
  register: new_rancher_registration
  when: rancher_api_key is defined and rancher_registration.json['data'] is undefined

- name: set registration_data from new or existing registration token
  set_fact:
    registration_data: "{{ (new_rancher_registration if not new_rancher_registration.skipped else rancher_registration).json['data'][0] }}"
  when: rancher_api_key is defined

- name: Fail if registration_data is empty
  fail: msg="Cannot register. The registrationtokens data node is empty! Possibly manually click 'Add Host' for this environment in Rancher once."
  when: rancher_api_key is defined and registration_data is undefined

# (to workaround 'Driver devicemapper failed to remove root filesystem')
# we stop the agent before deploy, and start it again afterwards
- name: Ensure the Rancher Agent is stopped
  become: yes
  docker:
    name: rancher-agent
    image: "{{ registration_data['image'] }}"
    privileged: yes
    volumes: /var/run/docker.sock:/var/run/docker.sock
    command: "{{ registration_data['registrationUrl'] }}"
    state: stopped
  when: rancher_api_key is defined and registration_data is defined and agent_state == 'stopped'

- name: Ensure the Rancher Agent is started
  become: yes
  docker:
    # must not be named, as this unnamed container starts other containers, including one named rancher-agent, and then exits
    image: "{{ registration_data['image'] }}"
    privileged: yes
    volumes: /var/run/docker.sock:/var/run/docker.sock
    command: "{{ registration_data['registrationUrl'] }}"
    state: started
  when: rancher_api_key is defined and registration_data is defined and agent_state == 'started'

file: deploy.yml contains the main playbook.

- name: environment-global deploy opening play
  hosts: "aws_machines"
  remote_user: ec2-user
  become: yes
  gather_facts: true
  vars:
    ip: "{{ ec2_private_ip_address }}"
  tasks:
    - include: rancher-register.yml agent_state=stopped
      when: stop_rancher == "true"

- name: search install play
  hosts: "aws_machines"
  remote_user: ec2-user
  become: yes
  gather_facts: false
  tasks:
    - name: elastic search container
      docker:
        name: searchengine
        image: "library/elasticsearch:{{ elastic_version }}"
        net: bridge
        log_driver: "json-file"
        log_opt:
          max-size: "10m"
          max-file: "3"
        env:
          SERVICE_9200_NAME: "searchengine"
        ports:
        - "9200"
        - "9300"
        state: reloaded

- name: environment-global deploy finale play
  hosts: "aws_machines"
  remote_user: ec2-user
  become: yes
  gather_facts: false
  tasks:
    - name: install docker-gc
      docker:
        name: docker-gc
        image: spotify/docker-gc
        volumes:
          - /var/run/docker.sock:/var/run/docker.sock
          - /etc:/etc
        state: restarted
    - include: rancher-register.yml agent_state=started
      when: stop_rancher == "true"
1 Like

@marhel I wasnt able to reproduce this, I did run into other issues, but managed to overcome them so far. Hosts were failing to register/register with rancher:
ERROR: Please re-register this agent

I had to clean up hosts if I was reusing them as agents using:
rm -rf /var/lib/rancher/state; docker rm -fv rancher-agent; docker rm -fv rancher-agent-state

I have questions around how your registering the hosts, as I’m an ansible neophyte, I set the registration data to:

registration_data: "{{ (new_rancher_registration if not new_rancher_registration.skipped else rancher_registration).json['data'][0]['command'] }}"

and ran that as a command. It wasn’t obvious to me how you extracted the registration command from the rancher API and ran that as a command.

I wonder if the issue you’re seeing is related to reusing hosts. I wonder if running the state cleanup in between helps your situation?

Thanks for taking the time to look further on this. I’m sorry that it seems hard to describe in a way to make it easier to reproduce. With our existing setup it’s easily reproduced.

I haven’t seen the re-register error you mention, is that from the logs of the rancher-agent container?

We register the hosts in the task Ensure the Rancher Agent is started in the file rancher-register.yml shown above. We fetch the registrationtokens url from rancher server, which among other things contains image (“rancher/agent:1.0.1” I think) and registrationUrl (a url of the form http://ourserver.com/v1/scripts/ED79155DFB0354B3240D:1465966800000:I3Af6ehWaxqRfCLUlnOJI0lj8). We then simply ask Ansible start an unnamed, privileged Docker container from that image, with the registrationUrl as a parameter. This container exits after ensuring that the rancher-agent and the rancher-agent-state containers are started/present. We start this container regardless of if this was the first registration, or a subsequent run.

Also, we neither remove state, nor the agent/state containers, we just stop the rancher-agent during the deploy if it was running. We don’t want to clean-up state, we want to keep the agent running at all times, if possible. Are you thinking that we might be running into this issue due to not cleaning up state properly?

I will have to do a few new tests where I deploy (without the start/stop workaround) to a set of hosts that Rancher has created (where we haven’t had a chance to mess up any state).

1 Like

Regarding why we aren’t using the ‘command’ property that actually contains a “docker run” command for launching the container to register the job, is that the docker run command itself is simple to recreate using Ansibles docker support, streamlining this part with all the other Docker actions in Ansible, we never need to drop down to executing docker commands directly.

However, now when I take a second look, I notice that the command in the API is the following;

sudo docker run -d --privileged -v /var/run/docker.sock:/var/run/docker.sock -v /var/lib/rancher:/var/lib/rancher rancher/agent:v1.0.1 http://ourserver.com/v1/scripts/ED79155DFB0354B3240D:1465984800000:l8yjtoUsfZ5yaGdVFBSEGDjac

While our Ansible-recreation of that command actually does not map /var/lib/rancher as a volume! I’ll re-test with that added, to see if that makes any difference! If so, my mistake!

1 Like

It seems to make no difference if I include /var/lib/rancher as a volume or not. Before adding that I also tried removing /var/lib/rancher/state; and the rancher-agent and rancher-agent-state containers. This has actually caused a different problem; now the Rancher server does not display any of the existing containers on the same host, even tens of minutes later. I’ve restarted the agent container as well. Only new containers show up.