High CPU load on network-services/metadata after upgrade to 1.4.1

Hi, I’m new at both rancher and docker, but I noticed a high load in zabbix on my 3 nodes.

After debugging for two days I came across the network-services/metadata stack (new one in Infrastructure) and saw that the CPU load on all metadata services is around 200% or more.

This is the log:

16/02/2017 15:31:36time=“2017-02-16T14:31:36Z” level=info msg="Downloaded in 8.4477361s"16/02/2017 15:31:38time=“2017-02-16T14:31:38Z” level=info msg="Loading answers"16/02/2017 15:31:40time=“2017-02-16T14:31:40Z” level=info msg="Loaded answers"
And that happens every second.

The load is around 15-25 per server. More if something is upgraded or started in rancher.

Anyone know where I should look next? Googling didn’t yield any answers.

Thanks upfront.

Furthermore if I stop the rancher-server the load drops down to a 1 maybe 2.

I noticed that my database table Instance has over 600MB and more than 95% of it is in purged state.

Could that be the problem that the metadata is trying to get info on instances that don’t exist any more?

Hi!

This seems really old, but I’m facing a similar issue using Rancher v1.5.10. When I restart or upgrade e Route53 DNS service, the metadata service CPU usage goes up (the network-services-metadata-dns container).

This is the log I see during the restart process:

time="2017-09-01T16:23:54Z" level=info msg="Reloading answers"
time="2017-09-01T16:23:54Z" level=info msg="Reloaded answers"
time="2017-09-01T16:23:58Z" level=info msg="Reloading answers"
time="2017-09-01T16:23:58Z" level=info msg="Reloaded answers"
time="2017-09-01T16:24:09Z" level=info msg="Reloading answers"
time="2017-09-01T16:24:10Z" level=info msg="Reloaded answers"
time="2017-09-01T16:24:11Z" level=info msg="Reloading answers"
time="2017-09-01T16:24:11Z" level=info msg="Reloaded answers"
time="2017-09-01T16:24:12Z" level=info msg="Reloading answers"
time="2017-09-01T16:24:12Z" level=info msg="Reloaded answers"
time="2017-09-01T16:24:13Z" level=info msg="Reloading answers"
time="2017-09-01T16:24:13Z" level=info msg="Reloaded answers"
time="2017-09-01T16:24:13Z" level=info msg="Reloading answers"
time="2017-09-01T16:24:13Z" level=info msg="Reloaded answers"
time="2017-09-01T16:24:14Z" level=info msg="Reloading answers"
time="2017-09-01T16:24:14Z" level=info msg="Reloaded answers"
time="2017-09-01T16:24:16Z" level=info msg="Reloading answers"
time="2017-09-01T16:24:16Z" level=info msg="Reloaded answers"

Is there anything we can do to improve that behavior?

Thanks,
Vinicius

Short term no, when things change metadata is updated and the complete yaml file is sent to the host and parsed. Longer term (2.0) yes, diffs will be/are sent incrementally instead of the complete file.

Thanks @vincent for your answer.

It really seems some unnecessary work is being made. I imagine that the complete yaml file is not that large to incur a network usage of 150-200Mbits/s for a few seconds. Also looking forward for the news at 2.0.

We have 70 services right now, and we are growing every day. For the moment we need to change the architecture so this wont be happening.