Troubleshooting 1.6.14 worker issue

stu2 · February 8, 2018, 12:41pm

Last night we had a single cattle worker slowly go bad. The containers began failing over the course of an hour. I took this node out of the ASG and replaced it, which restored availability.

But I have zero issue how to troubleshoot this and determine root cause. We are not using Rancher in production so happily the outage didn’t have serious consequences. But I was unhappy to see that the container health checks didn’t cause Rancher to replant the containers on other working nodes like they do when I make an instance unavailable.

Suggestions for where to look for log files etc in Rancher when this happens next time? Unfortunately I did NOT capture required info this time, I won’t be able to answer any questions other than config (RHEL 7.4, in AWS on m5.larges).

Second question: When I brought in a new replacement worker, I couldn’t find a way to redistribute the containers across this new node. All containers were now on the remaining nodes, and this one was not used. Suggestions? Surprisingly, increasing scale then decreasing it consistently removed the container on the new worker. Which was the opposite of the behavior I was hoping for.

Thanks!

Stu

Topic		Replies	Views
Scheduler in Environments crashed after RancherServer Crash	2	907	November 16, 2017
Auto restart sometimes not working Rancher 1.x	3	2512	March 25, 2019
Rancher 1.6.18 - Cattle - HA - Self-Hosted - Dashboard keeps crashing! Rancher 1.x	2	2653	August 21, 2018
Rancher keeps recreating scheduler and healthcheck Rancher 1.x	1	24711	January 25, 2019
Rancher HA Management Stack not coming up Rancher 1.x	20	2979	April 30, 2016

Troubleshooting 1.6.14 worker issue

Related topics