Hi,
we have the following problem. We have 30 identical nodes in a queue. All nodes have 64GB RAM. After some
time all nodes are showing different free memory (between 48GB up to 61GB) WITHOUT running any jobs.
The system processes are the same, no zombies, …
please keep in mind that the concept of “free memory” may be different from what you expect:
MemFree is actually unused memory. From kernel developer point of view, that’s wasted memory
Buffers, Cached is memory dynamically used by the memory management system to (generally speaking) cache data, i. e. results of “disk reads”
If processes request memory, but no unused memory is available, then the memory management will dynamically reduce the amount of memory used for buffers and caches.
If your system interacts with file systems / block devices (read and write operations) and unused memory is available, the buffers / caches will dynamically grow, to avoid future slow block device interaction.
The (by you called) “bad” node had much block device interaction, therefore memory is used for buffers and caches. The other node didn’t (yet).
If you monitor your system long-term, you’ll see that right after boot, you have a lot of “free” memory, which will gradually decrease over time - either used for applications or for buffers/caches and that ratio likely changing over time, if you have different work loads. If you see significant amounts of “MemFree” all the time, you should reduce the available amount of physical (or configured virtual) memory - it’s a waste of resources. Coming form the other side, you have insufficient memory once you see significant swap operations or insufficient i/o throughput because of too small buffer/cache allocations.
Jobs looking for >50GB will not start
That’s interesting. Who’s reporting this, the OS? Or is some starter process looking at the memory stats and complains because MemFree seems too low?
In order to clear buffers/caches, you could issue “sync; echo 3 > /proc/sys/vm/drop_caches”, which is a one-shot operation. “cat /proc/meminfo” right after that command should show that most of the memory is reported as “free” again.