Rancher server 100% CPU every 2 hours

wdk · October 3, 2018, 11:55am

The Problem
Ever since the upgrade from 2.0.0 to 2.0.8, our rancher server is periodically going to full CPU for approx 50 minutes and then backs down again. It’s basically maxing its CPU for 50 minutes every other hour.

Right when the CPU maxes out, we see the following entry in the rancher server log:

2018/10/03 09:09:50 [INFO] Running cluster events cleanup
2018/10/03 09:09:50 [INFO] Done running cluster events cleanup

It then starts to log etcd long query errors:

2018-10-03 09:38:15.901724 W | etcdserver: avoid queries with large range/delete range!
2018-10-03 09:38:39.649305 I | mvcc: store.index: compact 22465467
2018-10-03 09:38:40.298757 W | etcdserver: apply entries took too long [652.48661ms for 1 entries]
2018-10-03 09:38:40.298814 W | etcdserver: avoid queries with large range/delete range!
2018-10-03 09:38:42.744426 I | mvcc: finished scheduled compaction at 22465467 (took 2.203545667s)
2018-10-03 09:41:27.890672 W | etcdserver: apply entries took too long [749.934134ms for 1 entries]
2018-10-03 09:41:27.891901 W | etcdserver: avoid queries with large range/delete range!

According to https://github.com/etcd-io/etcd/blob/master/Documentation/faq.md#what-does-the-etcd-warning-apply-entries-took-too-long-mean this is either because of a slow disk (not the issue here) or high CPU load which we see here.

Eventually the CPU backs down like clockwork after aprox 50 minutes. The next hour however the CPU stays low. when the cluster event cleanup runs. The hour after that, it goes back to max CPU. This results in the following mountainscape in checkmk:
rancher_cpu

On the VM side, we even see a drop in IO which corresponds to the high CPU, so it looks like the container is not pumping data around:

The peak is an anomaly

Inside the container, top shows:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
5 root 20 0 14.175g 3.340g 161988 S 199.0 58.9 9720:28 rancher

What we’ve tried

Restarting the rancher container
Total shutdown of the VM running the rancher container and restart

Since it’s running production, we don’t want to fiddle with it too much without pointers since we don’t want to screw up rancher.

Topic		Replies	Views
Rancher server restart now and then Rancher	0	2094	February 25, 2019
Clocks out of sync in rancher os cluster RancherOS	1	1977	September 11, 2016
Rancher eating all the CPU, is it overloaded? Rancher 1.x	1	950	June 18, 2018
Rancher 2.0.8 high IO Rancher	1	1086	September 24, 2018
"waiting for 2 etcd machines to delete" Rancher	13	1121	February 20, 2024

Rancher server 100% CPU every 2 hours

Related topics