I am not sure if Rancher or RancherOS is the culprit, but I have been experiencing issues where my RancherOS hosts average load time spikes when there is no external reason for it. I was able to do some basic investigating - as busybox does not provide many tools - and top showed a handful of du -s /var/lib/docker/overlay/xxxxx command with the STAT of D, which means Uninterruptible sleep (usually IO). This is usually a bad thing. I killed the du processes and the load average drop below one and and the host started to act as it should. What is triggering the du against the container volumes and more importantly why is it hanging?
As you can imagine having this issue in a production environment can be quite troubling as hosts for no real reasons can become unresponsive.
The RancherOS version is 0.7.1 and the Rancher version is 1.1.0.
cAdvisor provides the data for the graphs you see in the UI in Rancher < 1.2. It runs even when you’re not looking at the graphs and forks du every refresh interval.
In 1.2+ we use docker’s built-in stats, gather them only while you’re looking, and it doesn’t call du.
The issue is it never finishes and craters the host. I don’t suppose there is a way to configure cAdvise not to run? System stability is more critical than the performance graphs and I have other tools providing some of that information. I guess another option would be to write a script to look for processes that have a state of Uninterruptible sleep and kill them?