Our operations team have done further investigation and found issues with the rancher-agent regarding SELinux, as I suspected before.
I asked our operations support, because I could not anymore remove the rancher-agent container to re-register the host. The following is what I got:
[marinis@swarm07-eeasites ~]$ docker rm -fv rancher-agent-broken
Error response from daemon: Cannot destroy container rancher-agent-broken: Could not kill running container, cannot remove - active container for 111da76b54c39d1f08951b5c2ee88b69708701f6b0ca045393374867d752faba does not exist
Error: failed to remove containers: [rancher-agent-broken]
Below I quote what our operation team find out after their investigation:
The process in questions runs in a container (rancher-agent) but without container context. In fact it runs completely unlabeled:
system_u:object_r:unlabeled_t:s0 root 19387 71.8 0.0 18636 2200 ? Ss Dec09 7303:44 /bin/bash /run.sh run
Other rancher processes run in docker privileged context. For example:
system_u:system_r:docker_t:s0 root 29308 0.0 0.0 111988 2680 ? Sl 00:13 0:03 /var/lib/cattle/bin/rancher-dns -log /var/log/rancher-dns.log -answers /var/lib/cattle/etc/cattle/dns/answers.json
Although the server still runs in Permissive mode it might prove overwhelming (or at least taxing) to filter and report all these would-be selinux denials thus contributing to the high cpu load.
The production containers themselves run in the correct context (svirt_lxc_net_t) but some still show selinux related problems. We should concentrate on fixing those and switch to enforcing as soon as possible.
On this production server may I suggest turning rancher off until everything else runs fine with selinux enforcing, enforce selinux and then eventually try rancher again?
…
The ghost rancher-agent process, the renamed one, I’ve killed it. It still shows there, as it cannot be reaped by it’s adoptive parent, but it doesn’t influence the server anymore (in Unix terms is a zombie)
Our custom Host setup
CentOS 7 kernel 3.10.0-229.14.1.el7.x86_64
$ docker version
Client version: 1.7.1
Client API version: 1.19
Package Version (client): docker-1.7.1-115.el7.x86_64
Go version (client): go1.4.2
Git commit (client): 446ad9b/1.7.1
OS/Arch (client): linux/amd64
Server version: 1.7.1
Server API version: 1.19
Package Version (server): docker-1.7.1-115.el7.x86_64
Go version (server): go1.4.2
Git commit (server): 446ad9b/1.7.1
OS/Arch (server): linux/amd64
$ sestatus
SELinux status: enabled
SELinuxfs mount: /sys/fs/selinux
SELinux root directory: /etc/selinux
Loaded policy name: targeted
Current mode: enforcing
Mode from config file: enforcing
Policy MLS status: enabled
Policy deny_unknown status: allowed
Max kernel policy version: 28
@ibuildthecloud are there any specific settings to be done on host regarding SELinux before running the rancher-agent? We have been changing the SELinux from enforcing to permissive while the rancher-agent was running, now we are back to enforcing. maybe this has created the issue? Is this being tested? We need to have enforcing on our production servers.