Rancher server 100% CPU every 2 hours

The Problem
Ever since the upgrade from 2.0.0 to 2.0.8, our rancher server is periodically going to full CPU for approx 50 minutes and then backs down again. It’s basically maxing its CPU for 50 minutes every other hour.

Right when the CPU maxes out, we see the following entry in the rancher server log:

2018/10/03 09:09:50 [INFO] Running cluster events cleanup
2018/10/03 09:09:50 [INFO] Done running cluster events cleanup

It then starts to log etcd long query errors:

2018-10-03 09:38:15.901724 W | etcdserver: avoid queries with large range/delete range!
2018-10-03 09:38:39.649305 I | mvcc: store.index: compact 22465467
2018-10-03 09:38:40.298757 W | etcdserver: apply entries took too long [652.48661ms for 1 entries]
2018-10-03 09:38:40.298814 W | etcdserver: avoid queries with large range/delete range!
2018-10-03 09:38:42.744426 I | mvcc: finished scheduled compaction at 22465467 (took 2.203545667s)
2018-10-03 09:41:27.890672 W | etcdserver: apply entries took too long [749.934134ms for 1 entries]
2018-10-03 09:41:27.891901 W | etcdserver: avoid queries with large range/delete range!

According to https://github.com/etcd-io/etcd/blob/master/Documentation/faq.md#what-does-the-etcd-warning-apply-entries-took-too-long-mean this is either because of a slow disk (not the issue here) or high CPU load which we see here.

Eventually the CPU backs down like clockwork after aprox 50 minutes. The next hour however the CPU stays low. when the cluster event cleanup runs. The hour after that, it goes back to max CPU. This results in the following mountainscape in checkmk:
rancher_cpu

On the VM side, we even see a drop in IO which corresponds to the high CPU, so it looks like the container is not pumping data around:


The peak is an anomaly

Inside the container, top shows:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
5 root 20 0 14.175g 3.340g 161988 S 199.0 58.9 9720:28 rancher

What we’ve tried

  • Restarting the rancher container
  • Total shutdown of the VM running the rancher container and restart

Since it’s running production, we don’t want to fiddle with it too much without pointers since we don’t want to screw up rancher.