Agent's memory usage too high: see use case

Hi.

I understand that this may not be Rancher’s main use case but I would like to know if there are any plans to work on memory usage in future.

Note that I am not referring to rancher/server as I expect it to be quite demanding, since it works as a centralized provisioning and monitoring server.

Use case: lease a bunch of VMs around the world, each comes with 512MB to 1GB memory. These will be used to deploy consul for consensus-based monitoring.

Issue:

  • After installing Docker, sar -r shows a memory commit of roughly 11%, with low pgpgin/pgpgout using sar -B
    -> This is totally fine
  • After starting Rancher (rancher/agent), memory commit will jump to 350-400%, and pgpgin/pgpgout will be in the tens of thousands.

Using sysdig to investigate the memory usage was not that helpful as apparently rancher/agent (or was it rancher/agent-instance?) shares its process context with the host.

  • Any idea what makes the agent so greedy?
  • Any plans to tune the agent’s memory usage in future releases?

Thanks

ps: I am only asking because Rancher has become one of my favorite tools and it is too bad I cannot use it in remote VMs

Hello?

Am I the only one with this kind of question?

There is certainly not a ton or RAM to go around on a 512MB instance, but the agent doesn’t need anywhere near 2GB as you seem to be suggesting.

This sounds like a misunderstanding of one or more of:

  • The meaning of “commit” in SAR
    • I’m not sure what it really means and googling without a redhat subscription is not very helpful:

Percentage of memory needed for current workload in relation to the total amount of memory (RAM+swap). This number may be greater than 100% because the kernel usually overcommits memory.

  • The meaning of virtual address space vs resident memory:

  • Resident (RES in top) is what is actually being used. Address space (VIRT) is largely meaningless.

  • The meaning of “used” memory vs cache & buffers:

  • Memory in linux is generally always all “used”, because using it for cache is free until it’s needed.

  • The meaning of pgpgin/out:

    • These count essentially all disk I/O, not what you would think of as swapping blocks of memory in and out because there isn’t enough to go around.

Here’s top running on a fresh 512MB DigitalOcean Ubuntu 14.04 droplet created with docker-machine. I started a busybox container, which also starts the Network Agent (rancher/agent-instance) to provide the overlay network… Note that there is virtually no ram “free” (because see above), but also no swap being used.

root@vjf-512:~# top
top - 15:47:00 up 18 min,  1 user,  load average: 0.11, 0.12, 0.12
Tasks:  87 total,   2 running,  85 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.3 us,  1.0 sy,  0.0 ni, 98.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem:    501772 total,   487456 used,    14316 free,    11708 buffers
KiB Swap:  1048572 total,        0 used,  1048572 free.   289296 cached Mem

  PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
 4588 root      20   0  468472  51736  13556 S  0.0 10.3   0:51.55 docker
 5140 root      20   0  119492  29756   4184 S  0.3  5.9   0:06.64 python
 5498 root      20   0  374808  17648   3660 S  0.3  3.5   0:06.57 .cadvisor
 6128 root      20   0  249908  14876   7132 S  0.0  3.0   0:00.07 exe
 6120 root      20   0  175120  14436   7032 S  0.0  2.9   0:00.09 exe
 7049 root      20   0  192088   7500   2376 S  0.0  1.5   0:00.52 rancher-net
 6063 root      20   0   23320   4620   1856 S  0.0  0.9   0:00.12 bash
 5469 root      20   0  214708   4340   2824 S  0.0  0.9   0:02.19 host-api
 6013 root      20   0  105636   4320   3324 S  0.0  0.9   0:00.19 sshd
 7002 root      20   0  189032   3200   2152 S  0.0  0.6   0:00.14 host-api
 6485 root      20   0  100492   3132   1540 S  0.0  0.6   0:00.04 rancher-metadat
  914 root      20   0   61372   3072   2392 S  0.0  0.6   0:00.03 sshd
    1 root      20   0   33492   2736   1408 S  0.0  0.5   0:01.74 init
 7023 root      20   0  749536   2520   1724 S  0.3  0.5   0:00.33 charon
 5097 root      20   0   18208   1860   1360 S  0.0  0.4   0:00.05 run.sh
 6922 root      20   0  104592   1788   1004 S  0.0  0.4   0:00.16 monit
  794 root      20   0   43448   1780   1388 S  0.0  0.4   0:00.06 systemd-logind
  719 message+  20   0   39624   1676    836 S  0.0  0.3   0:00.13 dbus-daemon
 6553 root      20   0  183364   1620    580 S  0.0  0.3   0:00.06 rancher-dns
 7437 root      20   0   24948   1616   1116 R  0.3  0.3   0:00.02 top

If you force the kernel to clear cache (echo 3 > /proc/sys/vm/drop_caches) all that space becomes “free”:

top - 15:50:11 up 21 min,  1 user,  load average: 0.01, 0.07, 0.11
Tasks:  86 total,   2 running,  84 sleeping,   0 stopped,   0 zombie
%Cpu(s):  1.7 us,  0.3 sy,  0.0 ni, 98.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem:    501772 total,   197672 used,   304100 free,     3712 buffers
KiB Swap:  1048572 total,        0 used,  1048572 free.    40612 cached Mem

  PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
...

Vincent, thanks for replying. Note that you are not directly addressing my question:

  • I do not believe that these numbers come from heavy swap usage. In fact, these VMs come without swap space :wink:
  • I do not make any reference to “free/used” memory either: rather, I am trying to figure out where these numbers (commit & pages out) come from.

400% committed memory is surprising because all that this VM is running is Docker + Rancher Agent.

By elimination, Docker alone does not cause these high numbers although it is by no means free!

I understand that virtual memory and pgout are not joined at the hip. I even put together a small script to parse charon’s /proc/pid/maps files to understand what it is doing with 350 times as much virtual memory as RSS (spoiler: it’s not files mapping or libraries, it looks like it is simply asking the kernel for gobs of space without writing to it. Pre-allocations?)

So, this leaves me with the mystery of high pgouts. Like you, after starting from scratch, I have a VM that is currently not displaying this behavior: while its commit% is about 200, it seems content not paging out much. Of course, this is not the same scenario as what I was originally referring to, where the Agent had been running for weeks. I will let it age again and see if this behavior resurfaces.