As a Rancher admin we have access to the Processes information.
I can see a lot of processes (around 20-30) that are either timeout or other exceptions.
Unfotunatly there is no documentation at all on docs.rancher to understand how to interpret the issues here.
Will Rancher at some point give use some documentation about it? Or maybe a meetup / live demo on how to debug and use this info? It is very cryptic now.
below an example of our Processes…
we have lot of TIMEOUT and RESOURCE_BUSY
any documentation to understand these would be helpful…we have huge issues with rancher stability recently and are not able to upgrade stacks via rancher, they are stuck in a state or the other…we are almost giving up on rancher…unless we find out what is causing these timeouts and resource_busy issues.
We found out that all the TIMEOUT and RESOURCE_BUSY are mainly caused by overloaded Rancher server. The VM where rancher is running has 8 vcpu, but the average load was at 20 up to 40…so we will try with a multinode setup to offload rancher. We have several hundreds containers and around 80 hosts managed by rancher.
@demarant do you have rancher agents in reconnecting state? Thats usually a cause of thread exhaustion on the rancher servers resulting in sluggishness.
yes we did have lot of agen.reconnect TIMEOUTS. We moved the mysql to another server and we gave rancher java more memory (8GB) and we still see lot of processes stuck…see the picture below. I think it is related to the issue Agent.reconnect process stuck forever for the agents not linked to any host · Issue #5349 · rancher/rancher · GitHub
restarting server helps only temporarily, the stuck processes comes back or new ones
I am trying to figure out how to get rid of all those stuck processes…any tips?
@demarant did you get to see if introducing a HA setup helped the problem? I am going through the same issues as you. Also have you tried upgrading to 1.2 to see if that helped your problems?
Hi, at the end what helped most was to give rancher mysql much more ram. In our setup (80 hosts, several hundreds containers) we had to give 32GB RAM and 16 cores. Moving to rancher 1.2 also helped a bit…but new issues came up…we even had to move some stacks back to rancher 1.1.4. The entire Rancher technology and docker as such is moving so fast that it is very complex and frustrating to keep up with it…in any case overall it is OK as we win speed on consistent deployments.
@demarant was your Rancher server actually resource constrained? I was running an 8GB server with 2 cores. It was not using up all of its resources but I increased it to 16GB and 4 cores and did not notice any significant difference other than the UI being slightly more snappy. So I am guessing my issues are not necessarily related to yours. Yes the instability is very frustrating. I was never able to get anything running on Rancher 1.2 even starting with a brand new setup which is making us look at other orchestration solutions for our future.