Rancher server memory, crash after a day or two

Rancher server unusable, it constantly crashes after a few days.

It could be the kernel randomly killing process to free memory, or java committing suicide on rancher server itself.

Can’t even stop container :

rancher@container-01:~$ docker stop 10a91b7e2104
Error response from daemon: Cannot stop container 10a91b7e2104: Cannot kill container 10a91b7e21047913dbdafa6145a07c5f9def021f33afd56b1b2f5298be015405: rpc error: code = 2 desc = containerd: process not found for container

Posted in rancher forums but might be a rancheros issue.

Rancher server 1.6.10
Rancher OS 1.1.0
docker-17.03.2-ce

Rancher server trace :
https://pastebin.com/nTSpt5dS

Investigating a balloning issue, similar to this :

Will update accordingly

Hi

You don’t say what RAM the server instance has. I used to get this also when running without sufficient RAM. Currently I use T2.medium (4GB RAM) for my rancher-server and it is solid.

The server has 32G, with that said, it has many other VMs and is actually under resource constraints (that’s a home lab environment), so it might the the explanation right there

No problem since I disabled the ballooning driver in ros vmware tools, time for a lab upgrade I guess!

With that said, it’s interesting how unrelated conditions at first sight happen to be the cause of the problem. In my candid mind, java and ballooning would just play along like good kids but they don’t. Even though my lab is overallocating RAM, to some extend, it’s often the case with virtualization, and ballooning, though indicative of a memory issue, should not cause that kind of issues.

I wonder if the official vmware tools would have the same effect.

Reference :https://support.azul.com/hc/en-us/articles/115001559526-VMware-Balloon-Driver

Quick steps :

Find the module name: /sbin/lsmod | grep balloon
Remove the module: sudo modprobe -r vmw_balloon