Rancher agent looses connection to server, active/reconnecting, SELinux issues

Rancher v0.49.0
Cattle v0.119.0
User Interface v0.70.0
Rancher Compose v0.6.0

We have encountered an issue with a couple of hosts that are continuously showing “Active” and than after a few seconds “Reconnecting”… “Active”, “Reconnecting” and so on

I looked at the /var/log/rancher/agent.log and found the following connection issue that shows up all the time:

Error: [(‘SSL routines’, ‘SSL3_WRITE_PENDING’, ‘bad write retry’)]

see full traceback in gist

Other hosts that are on the same network as the one above (same docker engine version 1.7.1 and VM image centos7 kernel 3.10.0-229.14.1.el7.x86_64) do not have this issue.

We tried to unregister and register again the host by following the instruction in troubleshooting both with a public/private IP for CATTLE_AGENT_IP… still same issue.

Any ideas what it can be? Do you need more info from us?

Could it be that these hosts have SELinux set to “enforcing” and the Rancher agent cannot write to filesystem?

Our operations team have done further investigation and found issues with the rancher-agent regarding SELinux, as I suspected before.

I asked our operations support, because I could not anymore remove the rancher-agent container to re-register the host. The following is what I got:

[marinis@swarm07-eeasites ~]$ docker rm -fv rancher-agent-broken
Error response from daemon: Cannot destroy container rancher-agent-broken: Could not kill running container, cannot remove - active container for 111da76b54c39d1f08951b5c2ee88b69708701f6b0ca045393374867d752faba does not exist
Error: failed to remove containers: [rancher-agent-broken]

Below I quote what our operation team find out after their investigation:

The process in questions runs in a container (rancher-agent) but without container context. In fact it runs completely unlabeled:

system_u:object_r:unlabeled_t:s0 root    19387 71.8  0.0  18636  2200 ?        Ss   Dec09 7303:44 /bin/bash /run.sh run

Other rancher processes run in docker privileged context. For example:

system_u:system_r:docker_t:s0   root     29308  0.0  0.0 111988  2680 ?        Sl   00:13   0:03 /var/lib/cattle/bin/rancher-dns -log /var/log/rancher-dns.log -answers /var/lib/cattle/etc/cattle/dns/answers.json

Although the server still runs in Permissive mode it might prove overwhelming (or at least taxing) to filter and report all these would-be selinux denials thus contributing to the high cpu load.

The production containers themselves run in the correct context (svirt_lxc_net_t) but some still show selinux related problems. We should concentrate on fixing those and switch to enforcing as soon as possible.

On this production server may I suggest turning rancher off until everything else runs fine with selinux enforcing, enforce selinux and then eventually try rancher again?

The ghost rancher-agent process, the renamed one, I’ve killed it. It still shows there, as it cannot be reaped by it’s adoptive parent, but it doesn’t influence the server anymore (in Unix terms is a zombie)

Our custom Host setup
CentOS 7 kernel 3.10.0-229.14.1.el7.x86_64

$ docker version
Client version: 1.7.1
Client API version: 1.19
Package Version (client): docker-1.7.1-115.el7.x86_64
Go version (client): go1.4.2
Git commit (client): 446ad9b/1.7.1
OS/Arch (client): linux/amd64
Server version: 1.7.1
Server API version: 1.19
Package Version (server): docker-1.7.1-115.el7.x86_64
Go version (server): go1.4.2
Git commit (server): 446ad9b/1.7.1
OS/Arch (server): linux/amd64

$ sestatus
SELinux status: enabled
SELinuxfs mount: /sys/fs/selinux
SELinux root directory: /etc/selinux
Loaded policy name: targeted
Current mode: enforcing
Mode from config file: enforcing
Policy MLS status: enabled
Policy deny_unknown status: allowed
Max kernel policy version: 28

@ibuildthecloud are there any specific settings to be done on host regarding SELinux before running the rancher-agent? We have been changing the SELinux from enforcing to permissive while the rancher-agent was running, now we are back to enforcing. maybe this has created the issue? Is this being tested? We need to have enforcing on our production servers.

the issues are gone. It seems like upgrading to rancher v0.51.0 solved the networking issues we had with those hosts.