Rancher OS EC2 machines get stuck after a while

Hello,

I have provisioned a 2 machine RacnherOS (v0.4.5) cluster on EC2 using the Rancher UI AWS provisioning. After a while, generally less than 24 hours, one or the other machine seems to get stuck. In Rancher, the host is seen as “RECONNECTING”. While this happens, it becomes impossible to connect with ssh to the failing host. The only solution is to shutdown it through aws or even “force shutdown” it then restart it.

What size is your EC2 instance and what do you ahve running on it? Is the one that’s reconnecting also hosting rancher/server container?

I have a Rancher setup with RancherOS hosts and have never had that issue.

They are t2.medium with custom applications running using few resources. The rancher server is on a separate host.
Two particular things happened, though…
The first one was a misconfiguration of the security group of one of the hosts. I had the rancher-ecr-credentials service failing, causing other containers to continuously fail at starting. At first I thought it was the cause of the problem but then it happened after the problem was corrected.
Second thing is that I tried to use the glusterfs + glustr-convoy services. I then removed them but volumes are still visible in Rancher UI.
Apart from these two particularities, I do not see anything special with my setup. I monitor the EC2 CPU credits to check that they are not going down. Balance is positive.
Is there any log to check after the reboot, apart from agent log ?

There are 2 location of agent logs. The docker logs as well as more detailed logs.

http://docs.rancher.com/rancher/latest/en/faqs/agents/#agent-logs

We just released v0.5.0 of RancherOS and would be curious if you upgraded and still had this issue. There were major refactoring changes in our RancherOS release.