We have a rancher deployment with about 160 containers across 20 hosts running v1.1.0, I have noticed our server is getting slower and slower. I have increased the heap reservation of the service to 4GB which did help for a while. I’m currently using an m4.large instance, we are seeing massive cpu contention (around 17 load average).
I was curious if there are any guidelines around what the appropriate server size should be based on deployment size, I understand it’s not an exact science, but any insight would help.
As for the health of my installation, there aren’t really any errors in the logs, I have 2 long running processes, both of which are agent reconnect jobs that timed out a month ago, doesn’t seem like there is an easy way to kill those either.
We are working on some best practice guides along with some sizing recommendations. In the mean time, I’m guessing you are using the embedded MySQL database inside the container. Rancher performance is largely driven by the DB performance.
I would recommend breaking the database out.
You can stop Rancher Server, launch a MySQL container with --volumes-from rancher-server and do a mysqldump. Import the dump into a separate DB. Relaunch rancher server container with an external db. http://docs.rancher.com/rancher/latest/en/installing-rancher/installing-server/
@cloudnautique Thanks for your reply. I’m actually using an external RDS instance,from the perf metrics I get there it appears there is plenty of room for additional connections, memory and CPU usage.
Just a few questions:
1.) My DB is about 2.4GB in size, is that unusual?
2.) Currently I’m using an m4.xlarge (16GB RAM, 4 vCPU’s), performance seems to be normal.
3.) I do have a fair amount of inactive storage pools that were created via convoy-nfs, most of them won’t purge properly via the GUI. Do you believe this could be the reason for the high CPU?
4.) Are there any plans to give admins the ability to kill off the long running admin processes? I currently have 2 that have been running for over a month.
OK, if the DB is separated out already thats good.
2.4 GB database size isn’t unusual, typically it gets larger because of a process that is hung and keeps logging.
The processes that get stuck in the process view are usually trying to act on a resource that is in an irreconcilable state somewhere. If you can expand the process and see what step it gets to, and the resource, you can click into the API and disable/delete/etc the resource to free it up. It is unfortunately somewhat of a pain.
Also, since Rancher continues to try and reconcile the resources it does chew up threads and will make it seem like Rancher is slow, though it is trying really hard to be good
@cloudnautique Makes sense, regarding the use of threads. Just one further question regarding removing the old tasks, I’ll attempt to illustrate with screenshots:
These are the long running tasks (at least the bottom 3)
When I click on the agent reconnect link and see that its the connecting to the agent child process thats failing, which makes sense since the host it’s trying to connect to is no longer there:
So for that one, the agent:id in the first image is a link and should take you to the api and should have the deactivate button. For whatever reason, its expecting it to reconnect. Do you have an environment where theres a host trying to reconnect?
For your current install, I would think 8GB machine would be sufficient, as long as the DB is pretty snappy.
However we still experience same performance issues, at times the rancher server is unresponsive… and eventually after many seconds up to a minute it comes back. This makes the UI totally unusable during this time.
We have a 8GB ram and 4 CPUs. The host is large enough for the single node setup I believe. We just see temporarily big spikes in CPUs usage as @Daniel_Jensen describes above.
Our rancher is serving a total of 60-80 hosts spread over 6-8 environments. There are possible hundreds of containers in total. We run all environments in Cattle. Docker engine 1.10.3 on CentOS7.
What is the total amount of hosts and containers cattle can manage? Any scalability tests available?
@cloudnautique Thanks! So I hit the deactivate button and most of the tasks have disappeared, however, a few of them are still stuck in the deactivating state. Is there anything else that can be done to remove them from the worker queue?
Also as soon as 1.2.0 is GA I’ll upgrade to it and see if it makes a difference.