Not sure what is going on, or how to debug this.
A bunch of containers are just doing “Updating-Active” every minute or so with no apparent reason. It causes the external DNS service to remove/recreate the entries and I suspect rancher is overutilizing CPU as well.
Here is an example for one of the containers, just to be clear, the containers aren’t restarted, it’s just the status that bounces.
It turns out there was a couple containers that were unhealthy.
These containers were not visible under the “Stacks” menu, even with Infrastructure services displayed and all frames open.
But when I went to the “Infrastructure” main menu, under “Containers” (the flat list of all the containers in Rancher) I saw here that a newer version of dnsupdate-rfc2136 was Unhealthy, and also a second instance of scheduler-scheduler-1 (same version).
After deleting these two containers everything was back to normal.
I take that back, it started again and nothing to be seen in any logs… like very often in Rancher, some random issue appears out of nowhere with no way to troubleshoot.
So it turns out the issue was very complex, and Rancher really doesn’t make easy to troubleshoot.
It came down to the fact that I had my external dns service updating a name that has been defined manually in the DNS server. The name involved was the one of the load balancer.
The broken dns update probably caused a service metadata change on the load balancer that in turn caused service updates on the containers it was in front of.
Short story : if you have something similar happening to you, and you use both DNS update and load balancer, you might want to there.