Last night we had a single cattle worker slowly go bad. The containers began failing over the course of an hour. I took this node out of the ASG and replaced it, which restored availability.
But I have zero issue how to troubleshoot this and determine root cause. We are not using Rancher in production so happily the outage didn’t have serious consequences. But I was unhappy to see that the container health checks didn’t cause Rancher to replant the containers on other working nodes like they do when I make an instance unavailable.
Suggestions for where to look for log files etc in Rancher when this happens next time? Unfortunately I did NOT capture required info this time, I won’t be able to answer any questions other than config (RHEL 7.4, in AWS on m5.larges).
Second question: When I brought in a new replacement worker, I couldn’t find a way to redistribute the containers across this new node. All containers were now on the remaining nodes, and this one was not used. Suggestions? Surprisingly, increasing scale then decreasing it consistently removed the container on the new worker. Which was the opposite of the behavior I was hoping for.
Thanks!
Stu