Memory/CPU leak?

The last few months I’ve been battling some issues in my Rancher setup on AWS. I have 4 m4.large hosts, one as primarily a management server where rancher-server and some non-scalable container services run (using WFM which has problems with user sessions if there’s more than one container spun up, phpMyAdmin which I only access personally from my whitelisted home IP, stuff like that) to prevent resource constraints of rancher-server from impacting things. And it seems like every few weeks, I have to restart all of the hosts to get things working. Nodes will spiral up to loads of around 60; I haven’t really determined if that’s based on CPU load or running out of memory, because most of the time it’s faster to just stop all four of the servers (or at least the three normal hosts) from the AWS console, boot them all up, mark them as management hosts until the infrastructure containers go green, and then mark them as regular hosts and have the real services launch.

During this time I’ve upgraded Rancher server versions and RancherOS versions probably a half dozen amount of times. When I’ve tried to use less drastic measures, like killing my user containers, I’ve gotten errors killing them, something about the overlay or aufs filesystem (it’s been a while and I didn’t screenshot it at the time). Has anyone else encountered this kind of problem before, or have any suggestions as to what to look into?

I really don’t want to give up on the promise of Rancher, but even though I have more redundancy in the new setup I moved to with Rancher, I also have much more frequent downtimes. If I can’t get things stabilized, I think I’m just going to go back to Ubuntu and normal nginx virtualhosts.

There have been some fixes for aggressive CPU/memory use but it’s hard to tell without a lot more info. We would be in a world of hurt if running a few containers got Rancher hosts hanging/crashing. If you can share what you are running exactly (stack definitions) I can try to reproduce and see if we can fix something that is broken. m4.large can be sufficient, but if you are running more CPU hungry containers, it could not be. If you have a setup where it happens, you can ping me on Slack and I can take a look at it. And regarding troubleshooting, keeping one host aside after it happens to check what the root cause was/is could help here as it’s pretty hard to determine what’s going on without that information.

They’re mostly globally scaled instances of stormerider/rancher-wordpress-nginx-trusty and stormerider/rancher-wpmu-nginx-trusty looking for a scheduling tag called “role” which I set to “host” for those services. On the management host, I have a few stormerider/docker-extplorer containers and phpmyadmin/phpmyadmin. All the WordPress stuff and WFM use a bind mount to ElasticFilesystem via a mount I do when I reboot the host (not ideal but I never had luck with convoy/rancher-nfs… I want all the sites on the same EFS volume to avoid IO credits expiring, not an individual EFS per site).

The last weekend when this happened i couldnt keep a stable connection to the affected VM to isolate more. Are there any logs i should gather, or steps to take when it happens again? From the rancher server they usually show as disconnected until after the cold reboot.

I havent customized any of the default RAM/CPU metrics on services. Oh and i have NewRelic infrastructure containers reporting back there for broader visibility.

I’ll post the sanitized compose files later on, I haven’t moves to secrets yet either. What’s the best way to provide them, just paste them into a code block or is there an upload location I can drop them off at?

Code block, GitHub gist, similar services…

So I started getting some alerts today and was able to ssh in and do a ps auxw and a free. https://gist.github.com/9d5fdda65203b19f89a13d9ecb09272b – also have a screenshot from htop: https://www.dropbox.com/s/w80zmt0u06r6mgf/Screenshot%202018-02-07%2020.11.01.png?dl=0 – looks like it’s more memory bound than CPU bound.

I also noticed that the instance was flagged as having failed one of the two EC2 health checks. I couldn’t leave it for a long period of time right now (as I had another instance that had just locked up completely, to the point of being unable to get ping responses) so I took advantage of the issue to get both nodes operational again via reboots due to upgrading to RancherOS 1.2.0. (Granted, the node that couldn’t ping had to be forcefully rebooted via AWS Console, then I upgraded it and rebooted again. Checked the EC2 instance log before I stopped it, and it only had the RancherOS bootup screen, no kernel panic or such displayed.)

Finally had a chance to sanitize my compose files as well: https://github.com/stormerider/rancher-compose-files

If this happens again, what log files should I attempt to capture for the investigation?

Then one site in that service environment that does ecommerce over his site using an external merchant account (he handles the cart and shipping info, they do anything that touches transactions). I upgraded my Newrelic licenses, bring it to two (of four; one is the management host, the other is just a regular host now without NewRelic) I then configured the nodes with the proper key, and now all traffic for their site should be coming from the browser or CloudFlare cache into the newrelic agent. One node is in us-east-1a and the other is in us-east-1c. Now 100% of that customer’s traffic is going through NewRelic, as opposed to the ~33% from before.just still not seeing anything jump out at me. I need to take a look at the Newrelic Infrastructure monitoring, in case it caught something going on when it spiraled out of control.