Rancher OS EC2 machines get stuck after a while

Julien_FAISSOLLE · June 24, 2016, 7:00am

Hello,

I have provisioned a 2 machine RacnherOS (v0.4.5) cluster on EC2 using the Rancher UI AWS provisioning. After a while, generally less than 24 hours, one or the other machine seems to get stuck. In Rancher, the host is seen as “RECONNECTING”. While this happens, it becomes impossible to connect with ssh to the failing host. The only solution is to shutdown it through aws or even “force shutdown” it then restart it.

denise · June 24, 2016, 5:27pm

What size is your EC2 instance and what do you ahve running on it? Is the one that’s reconnecting also hosting rancher/server container?

I have a Rancher setup with RancherOS hosts and have never had that issue.

Julien_FAISSOLLE · June 25, 2016, 3:43pm

They are t2.medium with custom applications running using few resources. The rancher server is on a separate host.
Two particular things happened, though…
The first one was a misconfiguration of the security group of one of the hosts. I had the rancher-ecr-credentials service failing, causing other containers to continuously fail at starting. At first I thought it was the cause of the problem but then it happened after the problem was corrected.
Second thing is that I tried to use the glusterfs + glustr-convoy services. I then removed them but volumes are still visible in Rancher UI.
Apart from these two particularities, I do not see anything special with my setup. I monitor the EC2 CPU credits to check that they are not going down. Balance is positive.
Is there any log to check after the reboot, apart from agent log ?

denise · July 2, 2016, 11:08pm

There are 2 location of agent logs. The docker logs as well as more detailed logs.

http://docs.rancher.com/rancher/latest/en/faqs/agents/#agent-logs

We just released v0.5.0 of RancherOS and would be curious if you upgraded and still had this issue. There were major refactoring changes in our RancherOS release.

Topic		Replies	Views
All AWS Hosts Disconnected on Rancher 1.3? Rancher 1.x	7	1923	January 26, 2017
EC2 cattle hosts shows disconnected after some time running Rancher 1.x	3	1049	June 16, 2017
Add host failed to find rancher-agent container on EC2 Rancher	2	1151	June 4, 2021
Host in AWS keeps hanging at the "Installing Rancher agent" stage Rancher 1.x	0	1355	February 9, 2017
EC2 host gets disconnected from rancher randomly, and gets "Credentials are no longer valid, please re-register this agent" error when attempting to reconnect Rancher 1.x	1	1903	March 16, 2020

Rancher OS EC2 machines get stuck after a while

Related topics