I have upgraded rancher server to latest version 1.1.0 and logged in successfully. But when i see my existing agents all were in “Reconnecting” state. I tried to add again the existing host which would upgrade agent version to latest. But it is still in Reconnecting state. Finally i reverted my server to previous version (1.0.1) and all seems to be fine. How do i get it upgrade now? Do we need to upgrade agents manually every time? What are the steps to do that?
We upgraded from v1.0.2 to v1.1.0 and some hosts lost the connection to rancher server and are now in “reconnecting” state.
I went in to see if there where networking issue. It is not a networking issue, as I can both ping the rancher-server from within the host and the rancher-agent container.
$ docker exec -it rancher-agent bash
$ env
Note the environment variables. I run he following command to see if the agent can connect to Rancher server API.
And I got a 200 OK from rancher server with proper JSON return. This means that the agent can connect to Rancher server.
However I found what the issue is in my case and this could be the same issue for you @garjunan here. The rancher-agent is still at v1.0.1 and not upgrade automatically to v1.0.2, I can see that there is still a running temporary container rancher-agent-upgrade in the list:
root@prod-mil-06:/# docker ps | grep rancher
918d38a8a212 rancher/agent:v1.0.2 "/run.sh upgrade" 2 days ago Up 2 days rancher-agent-upgrade
521b1ecd3671 rancher/agent-instance:v0.8.1 "/etc/init.d/agent-i 7 weeks ago Up 7 weeks 0.0.0.0:500->500/udp, 0.0.0.0:4500->4500/udp 44e673fc-bca6-496d-ae03-7e04bfa6dcf0
be8d4b365dd2 rancher/agent:v1.0.1 "/run.sh run" 9 weeks ago Up 9 weeks rancher-agent
the rancher-agent logs shows that it could not delete the container from file system:
root@prod-mil-06:/# docker logs --tail=200 rancher-agent
...
INFO: Upgrading to image rancher/agent:v1.0.2
Error response from daemon: Unable to remove filesystem for 5856066ef58ab70a7f07c19e69b8816026da848397e46c34b0f6089d962c1414: remove /var/lib/docker/containers/5856066ef58ab70a7f07c19e69b8816026da848397e46c34b0f6089d962c1414/shm: device or resource busy
time="2016-07-01T19:24:54Z" level=fatal msg="Error: failed to remove one or more containers"
look at the Error response above…this is why the Host is in reconnecting … it has not been able to upgrade the rancher-agent to v1.0.2 because the rancher-agent was not able to delete previous containers.
I am investigating what to do in these cases. Should I force delete the container and rerun the rancher-agent. Or should I restart the rancher server? Should I restart the docker engine on the host? What are the steps to solve this in the most clean way?
we are on OpenStack, hosts are CentOS7 (Linux prod-mil-06.pecs.eea 3.10.0-327.13.1.el7.x86_64 #1 SMP Thu Mar 31 16:04:38 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux)
our docker info:
root@prod-mil-06:/# docker info
Containers: 14
Images: 12
Storage Driver: devicemapper
Pool Name: vg_docker-docker--pool
Pool Blocksize: 524.3 kB
Base Device Size: 10.74 GB
Backing Filesystem: xfs
Data file:
Metadata file:
Data Space Used: 9.787 GB
Data Space Total: 33.97 GB
Data Space Available: 24.19 GB
Metadata Space Used: 3.105 MB
Metadata Space Total: 37.75 MB
Metadata Space Available: 34.64 MB
Udev Sync Supported: true
Deferred Removal Enabled: true
Deferred Deletion Enabled: false
Deferred Deleted Device Count: 0
Library Version: 1.02.107-RHEL7 (2015-12-01)
Execution Driver: native-0.2
Kernel Version: 3.10.0-327.13.1.el7.x86_64
Operating System: CentOS Linux 7 (Core)
CPUs: 4
Total Memory: 7.64 GiB
Name: prod-mil-06.pecs.eea
ID: YF4E:KR4V:SAIO:PH46:J7UO:Q23O:OE7O:2AYA:N7TP:LS36:DKPV:KNSP
Http Proxy:
Https Proxy:
No Proxy:
I also noticed that the previous rancher-agent cannot be deleted see commands, even when I do a docker rm -f the docker ps still show the rancher-agent in the list, and it cannot be destroyed:
[marinis@prod-mil-06 ~]$ docker rm -f rancher-agent
rancher-agent
[marinis@prod-mil-06 ~]$ docker ps | grep rancher
918d38a8a212 rancher/agent:v1.0.2 "/run.sh upgrade" 5 days ago Up 5 days rancher-agent-upgrade
521b1ecd3671 rancher/agent-instance:v0.8.1 "/etc/init.d/agent-in" 8 weeks ago Up 8 weeks 0.0.0.0:500->500/udp, 0.0.0.0:4500->4500/udp 44e673fc-bca6-496d-ae03-7e04bfa6dcf0
be8d4b365dd2 rancher/agent:v1.0.1 "/run.sh run" 9 weeks ago Up 9 weeks rancher-agent
[marinis@prod-mil-06 ~]$ docker kill rancher-agent
Failed to kill container (rancher-agent): Error response from daemon: Cannot kill container rancher-agent: [2] Container does not exist: container destroyed
I stopped and restarted the rancher-agent-upgrade and it goes into an infinite loop trying to delete the rancher-agent, of course, so the docker logs rancher-agent-upgrade shows an infinite number of
finally we managed to delete the rancher-agent container that was in a weird limbo state…and all the other rancher containers. still not clear what made the rancher-agent-upgrade get stuck. It must be docker-engine issues in deleting containers.
I re-run the rancher/agent command (taken from the rancher hosts ui) and now the host is connected again.
I’m about to do an upgrade and was interested to see your post, how did you manage to delete the rancher-agent that was stuck in the limbo state in the end? Has everything been ok since?
@xlight It sounds like @demarant hit issues when docker-engine couldn’t delete containers. There is no specific Rancher bug that will be fixed for this as it’s a Docker issue.
After your upgrade, if you have any agents that are stuck in reconnecting, all you need to do is re-run the host registration command (found in the infrastructure -> add hosts -> custom) section.
Once the agent is reconnected to rancher/server, everything should act as normal as your host would have connected back to rancher/server and sync with the DB.
Here’s what we’re seeing with CoreOS. CoreOS auto upgrades and reboots and then after reboot docker isn’t running anymore. For your hosts in reconnecting state, you can log in and do ps -ef | grep docker and you’ll see docker isn’t running. Not until you try to use docker once will it startup. So docker ps will launch docker. It’s because of socket activation in systemd.
The solution is to create on systemd unit file that pings docker on boot. Or add docker ps to /etc/rc.local