Rancher upgrade takes down production hosts by creating thousands of inspect containers agents

Summary of the issue

During the Rancher upgrade process from 1.0.2 to 1.1.0, the rancher agent is getting through different states: inspect state and upgrade state in order to redeploy the new agent on hosts. The rancher agent upgrade process ongoing on the hosts is not safe! it fails on certain conditions and in our case it made many hosts unusable and services went down. We did not expect something so serious from a rancher upgrade and from a stable rancher release!

Issue details

The problem was divided in two linked problems.

  1. First was the problem of rancher agent that couldn’t stop and finish a job and from this reason keep starting Rancher agents (with the command “/run.sh inspect-host”) and that was filling up all the space from docker storage metadata!!!

On some servers, like dev-mil-03 or dev-mil-04 we discovered around 6 thousand of rancher agents.

We can see now that on swarm17-eeasites there are still 924 rancher agents.

[root@swarm17-eeasites local]# docker ps -a | grep 'inspect-host' | wc -l

When looking into eionet-cph-13 I found 430 containers with this content. They can’t be removed. I think the reason it stopped after 3 hours was that the container pool device became full or corrupted.

9757f7e53f1f        rancher/agent:v1.0.2                      "/run.sh inspect-host"   13 hours ago        Exited (0) 13 hours ago                         ecstatic_noyce
f71d04397ece        rancher/agent:v1.0.2                      "/run.sh inspect-host"   16 hours ago        Exited (0) 16 hours ago                         big_kirch
efaf8a0c9409        rancher/agent:v1.0.2                      "/run.sh inspect-host"   16 hours ago        Exited (0) 16 hours ago                         condescending_l
  1. Second problem was that Docker couldn’t write on metadata volume and somehow didn’t know or detected that volume is full and remain after stop in an unknown state.
    Because trying to start back, docker was looking for some files that consider that should be in metadata volume and from this reason could not start.

The result of the above problems was corruption of metadata volume.
In some cases resolution of this was repairing of metadata and on some others, scratching the docker storage volumes, with correspondent files(images and devicemapper).

Rancher, Host and docker info

We ran upgrade from Rancher 1.0.2 to Rancher 1.1.0 (current).

Our hosts are CentOS7 kernel 3.10.0-327.13.1.el7.x86_64 running on Openstack

Docker 1.10.3.

Below our docker info for one of the affected hosts:

[marinis@swarm17-eeasites deploy]$ docker info
Containers: 939
 Running: 8
 Paused: 0
 Stopped: 931
Images: 11
Server Version: 1.10.3
Storage Driver: devicemapper
 Pool Name: vg_docker-docker--pool
 Pool Blocksize: 524.3 kB
 Base Device Size: 10.74 GB
 Backing Filesystem: xfs
 Data file: 
 Metadata file: 
 Data Space Used: 16.2 GB
 Data Space Total: 33.97 GB
 Data Space Available: 17.78 GB
 Metadata Space Used: 40.19 MB
 Metadata Space Total: 75.5 MB
 Metadata Space Available: 35.3 MB
 Udev Sync Supported: true
 Deferred Removal Enabled: false
 Deferred Deletion Enabled: false
 Deferred Deleted Device Count: 0
 Library Version: 1.02.107-RHEL7 (2015-12-01)
Execution Driver: native-0.2
Logging Driver: json-file
 Volume: local
 Network: bridge null host
Kernel Version: 3.10.0-327.13.1.el7.x86_64
Operating System: CentOS Linux 7 (Core)
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 3.702 GiB
Name: swarm17-eeasites.eea.europa.eu
WARNING: bridge-nf-call-iptables is disabled
WARNING: bridge-nf-call-ip6tables is disabled

My questions to Rancher team

dear Rancher team,

  • Have you seen similar issues reported by others? Are you aware of what could be the reason that rancher agents continuously creates “run.sh inspect-host” containers? What can make this line of code https://github.com/rancher/rancher/blob/master/agent/run.sh#L144 called continuosly?
  • Is it possible make the Agent run.sh script a bit more robust or at least failing gracefully? In some cases when it cannot delete a rancher container it just get stuck and user does not realize why certain hosts are not working anymore. In other cases it tries continuously.
  • Could these issues be avoided/mitigated if we had docker engine options set to “Deferred Removal Enabled” and “Deferred Deletion Enabled” set to True. Right now we have them set to false on those affected hosts. Is it related to this docker engine issue?

We are now afraid of upgrading rancher server, knowing that it can make the connected hosts unstable.
Any suggestions or help is much appreciated!