Agent reconnecting state after rancher server 1.1.0 upgrade

Hi,

I have upgraded rancher server to latest version 1.1.0 and logged in successfully. But when i see my existing agents all were in “Reconnecting” state. I tried to add again the existing host which would upgrade agent version to latest. But it is still in Reconnecting state. Finally i reverted my server to previous version (1.0.1) and all seems to be fine. How do i get it upgrade now? Do we need to upgrade agents manually every time? What are the steps to do that?

Thanks
Gokul

same issues here.

We upgraded from v1.0.2 to v1.1.0 and some hosts lost the connection to rancher server and are now in “reconnecting” state.

I went in to see if there where networking issue. It is not a networking issue, as I can both ping the rancher-server from within the host and the rancher-agent container.

$ docker exec -it rancher-agent bash
$ env

Note the environment variables. I run he following command to see if the agent can connect to Rancher server API.

$ curl -i -u "${CATTLE_ACCESS_KEY}:${CATTLE_SECRET_KEY}" ${CATTLE_URL}

And I got a 200 OK from rancher server with proper JSON return. This means that the agent can connect to Rancher server.

However I found what the issue is in my case and this could be the same issue for you @garjunan here. The rancher-agent is still at v1.0.1 and not upgrade automatically to v1.0.2, I can see that there is still a running temporary container rancher-agent-upgrade in the list:

root@prod-mil-06:/# docker ps | grep rancher
918d38a8a212        rancher/agent:v1.0.2            "/run.sh upgrade"      2 days ago          Up 2 days                                                          rancher-agent-upgrade
521b1ecd3671        rancher/agent-instance:v0.8.1   "/etc/init.d/agent-i   7 weeks ago         Up 7 weeks          0.0.0.0:500->500/udp, 0.0.0.0:4500->4500/udp   44e673fc-bca6-496d-ae03-7e04bfa6dcf0
be8d4b365dd2        rancher/agent:v1.0.1            "/run.sh run"          9 weeks ago         Up 9 weeks                                                         rancher-agent

the rancher-agent logs shows that it could not delete the container from file system:

root@prod-mil-06:/# docker logs --tail=200 rancher-agent
...
INFO: Upgrading to image rancher/agent:v1.0.2
Error response from daemon: Unable to remove filesystem for 5856066ef58ab70a7f07c19e69b8816026da848397e46c34b0f6089d962c1414: remove /var/lib/docker/containers/5856066ef58ab70a7f07c19e69b8816026da848397e46c34b0f6089d962c1414/shm: device or resource busy
time="2016-07-01T19:24:54Z" level=fatal msg="Error: failed to remove one or more containers"

look at the Error response above…this is why the Host is in reconnecting … it has not been able to upgrade the rancher-agent to v1.0.2 because the rancher-agent was not able to delete previous containers.

I am investigating what to do in these cases. Should I force delete the container and rerun the rancher-agent. Or should I restart the rancher server? Should I restart the docker engine on the host? What are the steps to solve this in the most clean way?

we are on OpenStack, hosts are CentOS7 (Linux prod-mil-06.pecs.eea 3.10.0-327.13.1.el7.x86_64 #1 SMP Thu Mar 31 16:04:38 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux)

our docker info:

root@prod-mil-06:/# docker info
Containers: 14
Images: 12
Storage Driver: devicemapper
 Pool Name: vg_docker-docker--pool
 Pool Blocksize: 524.3 kB
 Base Device Size: 10.74 GB
 Backing Filesystem: xfs
 Data file:
 Metadata file:
 Data Space Used: 9.787 GB
 Data Space Total: 33.97 GB
 Data Space Available: 24.19 GB
 Metadata Space Used: 3.105 MB
 Metadata Space Total: 37.75 MB
 Metadata Space Available: 34.64 MB
 Udev Sync Supported: true
 Deferred Removal Enabled: true
 Deferred Deletion Enabled: false
 Deferred Deleted Device Count: 0
 Library Version: 1.02.107-RHEL7 (2015-12-01)
Execution Driver: native-0.2
Kernel Version: 3.10.0-327.13.1.el7.x86_64
Operating System: CentOS Linux 7 (Core)
CPUs: 4
Total Memory: 7.64 GiB
Name: prod-mil-06.pecs.eea
ID: YF4E:KR4V:SAIO:PH46:J7UO:Q23O:OE7O:2AYA:N7TP:LS36:DKPV:KNSP
Http Proxy:
Https Proxy:
No Proxy:

Hi Team,

Can anyone suggest the right way to upgrade Rancher agent asap?

Thanks
Gokul

If the upgrade failed, can you re-run the rancher/agent command on the same host? That should fix your issue. Please let me know if it doesn’t!

how do I get the same rancher-agent command? or should I get a new one from the Rancher Hosts UI?

I also noticed that the previous rancher-agent cannot be deleted see commands, even when I do a docker rm -f the docker ps still show the rancher-agent in the list, and it cannot be destroyed:

[marinis@prod-mil-06 ~]$ docker rm -f rancher-agent
rancher-agent
[marinis@prod-mil-06 ~]$ docker ps | grep rancher
918d38a8a212        rancher/agent:v1.0.2            "/run.sh upgrade"        5 days ago          Up 5 days                                                          rancher-agent-upgrade
521b1ecd3671        rancher/agent-instance:v0.8.1   "/etc/init.d/agent-in"   8 weeks ago         Up 8 weeks          0.0.0.0:500->500/udp, 0.0.0.0:4500->4500/udp   44e673fc-bca6-496d-ae03-7e04bfa6dcf0
be8d4b365dd2        rancher/agent:v1.0.1            "/run.sh run"            9 weeks ago         Up 9 weeks                                                         rancher-agent
[marinis@prod-mil-06 ~]$ docker kill rancher-agent
Failed to kill container (rancher-agent): Error response from daemon: Cannot kill container rancher-agent: [2] Container does not exist: container destroyed

I stopped and restarted the rancher-agent-upgrade and it goes into an infinite loop trying to delete the rancher-agent, of course, so the docker logs rancher-agent-upgrade shows an infinite number of

INFO: Deleting container rancher-agent

stuck in the while loop at https://github.com/rancher/rancher/blob/master/agent/run.sh#L216

finally we managed to delete the rancher-agent container that was in a weird limbo state…and all the other rancher containers. still not clear what made the rancher-agent-upgrade get stuck. It must be docker-engine issues in deleting containers.

I re-run the rancher/agent command (taken from the rancher hosts ui) and now the host is connected again.

Hi,

I’m about to do an upgrade and was interested to see your post, how did you manage to delete the rancher-agent that was stuck in the limbo state in the end? Has everything been ok since?

Thanks,
John

I have this issue too, when I upgrade to Rancher server 1.1.1 from 1.0.2 .

On a centos7.

I think I will stay on Rancher Server v1.0.2 until this bug get fixed

@xlight It sounds like @demarant hit issues when docker-engine couldn’t delete containers. There is no specific Rancher bug that will be fixed for this as it’s a Docker issue.

After your upgrade, if you have any agents that are stuck in reconnecting, all you need to do is re-run the host registration command (found in the infrastructure -> add hosts -> custom) section.

Once the agent is reconnected to rancher/server, everything should act as normal as your host would have connected back to rancher/server and sync with the DB.

I just ran into this issue as well. Relevant info:

OS: CoreOS stable
Rancher server: 1.0.1 -> 1.1.1
Docker: 1.10.3

Manually rerunning the add custom host script seems like not a great solution in environments with lots of servers.

Is there any other way to upgrade the rancher clients to the correct version automatically?

Here’s what we’re seeing with CoreOS. CoreOS auto upgrades and reboots and then after reboot docker isn’t running anymore. For your hosts in reconnecting state, you can log in and do ps -ef | grep docker and you’ll see docker isn’t running. Not until you try to use docker once will it startup. So docker ps will launch docker. It’s because of socket activation in systemd.

The solution is to create on systemd unit file that pings docker on boot. Or add docker ps to /etc/rc.local