Rancher agent looses connection to server, active/reconnecting, SELinux issues

demarant · December 9, 2015, 3:09pm

Rancher v0.49.0
Cattle v0.119.0
User Interface v0.70.0
Rancher Compose v0.6.0
rancher/agent:v0.8.2

We have encountered an issue with a couple of hosts that are continuously showing “Active” and than after a few seconds “Reconnecting”… “Active”, “Reconnecting” and so on

I looked at the /var/log/rancher/agent.log and found the following connection issue that shows up all the time:

Error: [(‘SSL routines’, ‘SSL3_WRITE_PENDING’, ‘bad write retry’)]

see full traceback in gist

Other hosts that are on the same network as the one above (same docker engine version 1.7.1 and VM image centos7 kernel 3.10.0-229.14.1.el7.x86_64) do not have this issue.

We tried to unregister and register again the host by following the instruction in troubleshooting both with a public/private IP for CATTLE_AGENT_IP… still same issue.

Any ideas what it can be? Do you need more info from us?

demarant · December 9, 2015, 5:48pm

Could it be that these hosts have SELinux set to “enforcing” and the Rancher agent cannot write to filesystem?

demarant · December 18, 2015, 12:46pm

Our operations team have done further investigation and found issues with the rancher-agent regarding SELinux, as I suspected before.

I asked our operations support, because I could not anymore remove the rancher-agent container to re-register the host. The following is what I got:

[marinis@swarm07-eeasites ~]$ docker rm -fv rancher-agent-broken
Error response from daemon: Cannot destroy container rancher-agent-broken: Could not kill running container, cannot remove - active container for 111da76b54c39d1f08951b5c2ee88b69708701f6b0ca045393374867d752faba does not exist
Error: failed to remove containers: [rancher-agent-broken]

Below I quote what our operation team find out after their investigation:

The process in questions runs in a container (rancher-agent) but without container context. In fact it runs completely unlabeled:

system_u:object_r:unlabeled_t:s0 root    19387 71.8  0.0  18636  2200 ?        Ss   Dec09 7303:44 /bin/bash /run.sh run

Other rancher processes run in docker privileged context. For example:

system_u:system_r:docker_t:s0   root     29308  0.0  0.0 111988  2680 ?        Sl   00:13   0:03 /var/lib/cattle/bin/rancher-dns -log /var/log/rancher-dns.log -answers /var/lib/cattle/etc/cattle/dns/answers.json

Although the server still runs in Permissive mode it might prove overwhelming (or at least taxing) to filter and report all these would-be selinux denials thus contributing to the high cpu load.

The production containers themselves run in the correct context (svirt_lxc_net_t) but some still show selinux related problems. We should concentrate on fixing those and switch to enforcing as soon as possible.

On this production server may I suggest turning rancher off until everything else runs fine with selinux enforcing, enforce selinux and then eventually try rancher again?

…

The ghost rancher-agent process, the renamed one, I’ve killed it. It still shows there, as it cannot be reaped by it’s adoptive parent, but it doesn’t influence the server anymore (in Unix terms is a zombie)

Our custom Host setup
CentOS 7 kernel 3.10.0-229.14.1.el7.x86_64

$ docker version
Client version: 1.7.1
Client API version: 1.19
Package Version (client): docker-1.7.1-115.el7.x86_64
Go version (client): go1.4.2
Git commit (client): 446ad9b/1.7.1
OS/Arch (client): linux/amd64
Server version: 1.7.1
Server API version: 1.19
Package Version (server): docker-1.7.1-115.el7.x86_64
Go version (server): go1.4.2
Git commit (server): 446ad9b/1.7.1
OS/Arch (server): linux/amd64

$ sestatus
SELinux status: enabled
SELinuxfs mount: /sys/fs/selinux
SELinux root directory: /etc/selinux
Loaded policy name: targeted
Current mode: enforcing
Mode from config file: enforcing
Policy MLS status: enabled
Policy deny_unknown status: allowed
Max kernel policy version: 28

@ibuildthecloud are there any specific settings to be done on host regarding SELinux before running the rancher-agent? We have been changing the SELinux from enforcing to permissive while the rancher-agent was running, now we are back to enforcing. maybe this has created the issue? Is this being tested? We need to have enforcing on our production servers.

demarant · January 4, 2016, 9:07am

the issues are gone. It seems like upgrading to rancher v0.51.0 solved the networking issues we had with those hosts.

Topic		Replies	Views
Cannot re-register agent: `Failed to load registration env` Rancher 1.x	3	2605	February 13, 2018
Active/RECONNECTING rancher node Rancher 1.x	1	1165	July 1, 2016
EC2 host gets disconnected from rancher randomly, and gets "Credentials are no longer valid, please re-register this agent" error when attempting to reconnect Rancher 1.x	1	1903	March 16, 2020
Migrated rancher 1.6 disconnted from digitalocean host Rancher 1.x	2	1054	January 11, 2019
Rancher Hosts Continually Disconnect Rancher 1.x	10	10382	August 16, 2017

Rancher agent looses connection to server, active/reconnecting, SELinux issues

Related topics