We have a rancher deployment with about 160 containers across 20 hosts running v1.1.0, I have noticed our server is getting slower and slower. I have increased the heap reservation of the service to 4GB which did help for a while. I’m currently using an m4.large instance, we are seeing massive cpu contention (around 17 load average).
I was curious if there are any guidelines around what the appropriate server size should be based on deployment size, I understand it’s not an exact science, but any insight would help.
As for the health of my installation, there aren’t really any errors in the logs, I have 2 long running processes, both of which are agent reconnect jobs that timed out a month ago, doesn’t seem like there is an easy way to kill those either.
We are working on some best practice guides along with some sizing recommendations. In the mean time, I’m guessing you are using the embedded MySQL database inside the container. Rancher performance is largely driven by the DB performance.
OK, if the DB is separated out already thats good.
2.4 GB database size isn’t unusual, typically it gets larger because of a process that is hung and keeps logging.
The processes that get stuck in the process view are usually trying to act on a resource that is in an irreconcilable state somewhere. If you can expand the process and see what step it gets to, and the resource, you can click into the API and disable/delete/etc the resource to free it up. It is unfortunately somewhat of a pain.
Also, since Rancher continues to try and reconcile the resources it does chew up threads and will make it seem like Rancher is slow, though it is trying really hard to be good
So for that one, the agent:id in the first image is a link and should take you to the api and should have the deactivate button. For whatever reason, its expecting it to reconnect. Do you have an environment where theres a host trying to reconnect?
For your current install, I would think 8GB machine would be sufficient, as long as the DB is pretty snappy.
However we still experience same performance issues, at times the rancher server is unresponsive… and eventually after many seconds up to a minute it comes back. This makes the UI totally unusable during this time.
We have a 8GB ram and 4 CPUs. The host is large enough for the single node setup I believe. We just see temporarily big spikes in CPUs usage as @Daniel_Jensen describes above.
Our rancher is serving a total of 60-80 hosts spread over 6-8 environments. There are possible hundreds of containers in total. We run all environments in Cattle. Docker engine 1.10.3 on CentOS7.
What is the total amount of hosts and containers cattle can manage? Any scalability tests available?
@cloudnautique Thanks! So I hit the deactivate button and most of the tasks have disappeared, however, a few of them are still stuck in the deactivating state. Is there anything else that can be done to remove them from the worker queue?
Also as soon as 1.2.0 is GA I’ll upgrade to it and see if it makes a difference.