memory leak issue?

alpha754293 · July 21, 2017, 10:54pm

I’m not sure if anybody have this issue, but I am running some computer aided engineering analysis and when I run the same analysis in Windows, I know that the analysis doesn’t require more than 70 GB of RAM. But here, you can see that not only has it consumed all 128 GB of RAM that I have installed on my system, but it is also requiring or using almost 100 GB of swap.

Any ideas on what I can do to try and investigate this further? This shouldn’t be happening.

Thanks.

ab1 · July 22, 2017, 2:38am

You are not showing us anything about the process, so there is not much
help we can provide about why the process may be different other than it
may be coded differently on Linux.

On the other hand, if you can post the output from something like ‘free’
and ‘ps’ then we may be able to see a lot better.

free
ps aux | grep yourProcessNameHere

To provide some background, Linux uses as much RAM as it can for various
things, including filesystem caching, because RAM that is not being used
is a waste of RAM. It does so in a way such that if anything needs the
RAM, the filesystem cache data can be flushed instantly (it is just a copy
of what is on those slower disks anyway, so no need to write/save it) and
then RAM or other virtual memory can be allocated directly to whatever
process. The pictures you shared do not show anything about the process,
though, just the system, and if you look at RAM used on most systems that
have been up for a while, it is normal to have high RAM usage, because
that means the system is optimizing filesystem reads by caching as much as
it can as mentioned above. Doing otherwise would mean every bit of disk
I/O needs to go to the disks, and disks are slower than RAM, so it’s a
waste of RAM and a penalty of slow performance.

–
Good luck.

If you find this post helpful and are logged into the web interface,
show your appreciation and click on the star below.

If you want to send me a private message, please let me know in the
forum as I do not use the web interface often.

alpha754293 · July 26, 2017, 7:49pm

[QUOTE=ab;38845]You are not showing us anything about the process, so there is not much
help we can provide about why the process may be different other than it
may be coded differently on Linux.

On the other hand, if you can post the output from something like ‘free’
and ‘ps’ then we may be able to see a lot better.

free
ps aux | grep yourProcessNameHere

To provide some background, Linux uses as much RAM as it can for various
things, including filesystem caching, because RAM that is not being used
is a waste of RAM. It does so in a way such that if anything needs the
RAM, the filesystem cache data can be flushed instantly (it is just a copy
of what is on those slower disks anyway, so no need to write/save it) and
then RAM or other virtual memory can be allocated directly to whatever
process. The pictures you shared do not show anything about the process,
though, just the system, and if you look at RAM used on most systems that
have been up for a while, it is normal to have high RAM usage, because
that means the system is optimizing filesystem reads by caching as much as
it can as mentioned above. Doing otherwise would mean every bit of disk
I/O needs to go to the disks, and disks are slower than RAM, so it’s a
waste of RAM and a penalty of slow performance.

–
Good luck.

If you find this post helpful and are logged into the web interface,
show your appreciation and click on the star below.

If you want to send me a private message, please let me know in the
forum as I do not use the web interface often.[/QUOTE]

So here are the outputs for the various data requests:

My apologies in advance that some of the outputs are only available in picture form (I am at work and so I am remotely logging into my system via two other systems so trying to get console outputs back out of it is a big of an involved process, especially while the system is busy).

Note that the memory usage has changed somewhat in that now it is using 84.5% of swap as well.

ewen@aes:/export/home/work> ps aux | grep ansys
ewen      3863  0.5  0.1 821208 144284 ?       Ssl  Jul13 111:35 /usr/ansys_inc/shared_files/licensing/linx64/ansysli_server
ewen      3875  0.0  0.0  19064  2728 ?        Ss   Jul13   4:52 /usr/ansys_inc/shared_files/licensing/linx64/ansysli_monitor -monitor 3863 -restart_port_timeout 15
ewen      3909  0.0  0.0  16780   324 ?        S    Jul13   0:11 /usr/ansys_inc/shared_files/licensing/linx64/lmgrd -c /usr/ansys_inc/shared_files/licensing/license_files -l /usr/ansys_inc/shared_files/licensing/license.log
ewen      3910  0.0  0.0 127744  4132 ?        Ssl  Jul13   5:14 ansyslmd -T aes3 11.13 3 -c :/usr/ansys_inc/shared_files/licensing/license_files: -srv l5cxNrODVadTCxS0jQ7tWv8TYnSz5DAH18P1AGpaG159HHZakpuDbaQFd5y2xys --lmgrd_start 59675cee -vdrestart 0
ewen     20492  0.0  0.0  13588     8 pts/1    S+   Jul23   0:00 /bin/sh /usr/ansys_inc/v180/ansys/bin/ansys180 -b -dis -np 16 -dir LHD_as_is_direct
ewen     20573  0.0  0.0  11664    12 pts/1    S+   Jul23   0:00 /bin/sh /usr/ansys_inc/v180/commonfiles/MPI/Intel/5.1.3.223/linx64/bin/mpirun -np 16 /usr/ansys_inc/v180/ansys/bin/ansysdis180 -i stdin.20492 -b -dis -dir LHD_as_is_direct
ewen     20578  0.0  0.0  15884    80 pts/1    S+   Jul23   0:00 mpiexec.hydra -np 16 /usr/ansys_inc/v180/ansys/bin/ansysdis180 -i stdin.20492 -b -dis -dir LHD_as_is_direct
ewen     20579  0.0  0.0  15080    84 pts/1    S    Jul23   0:00 /usr/ansys_inc/v180/commonfiles/MPI/Intel/5.1.3.223/linx64/bin//pmi_proxy --control-port aes3:30544 --pmi-connect lazy-cache --pmi-aggregate -s 0 --rmk user --launcher ssh --demux poll --pgid 0 --enable-stdin 1 --retries 10 --control-code 2025298722 --usize -2 --proxy-id 0
ewen     20584  0.0  0.0  12068    12 pts/1    S    Jul23   0:00 /bin/sh /usr/ansys_inc/v180/ansys/bin/ansysdis180 -i stdin.20492 -b -dis -dir LHD_as_is_direct
ewen     20585  0.0  0.0  12068     8 pts/1    S    Jul23   0:00 /bin/sh /usr/ansys_inc/v180/ansys/bin/ansysdis180 -i stdin.20492 -b -dis -dir LHD_as_is_direct
ewen     20586  0.0  0.0  12068     8 pts/1    S    Jul23   0:00 /bin/sh /usr/ansys_inc/v180/ansys/bin/ansysdis180 -i stdin.20492 -b -dis -dir LHD_as_is_direct
ewen     20587  0.0  0.0  12068    12 pts/1    S    Jul23   0:00 /bin/sh /usr/ansys_inc/v180/ansys/bin/ansysdis180 -i stdin.20492 -b -dis -dir LHD_as_is_direct
ewen     20588  0.0  0.0  12068     8 pts/1    S    Jul23   0:00 /bin/sh /usr/ansys_inc/v180/ansys/bin/ansysdis180 -i stdin.20492 -b -dis -dir LHD_as_is_direct
ewen     20589  0.0  0.0  12068     8 pts/1    S    Jul23   0:00 /bin/sh /usr/ansys_inc/v180/ansys/bin/ansysdis180 -i stdin.20492 -b -dis -dir LHD_as_is_direct
ewen     20590  0.0  0.0  12068     8 pts/1    S    Jul23   0:00 /bin/sh /usr/ansys_inc/v180/ansys/bin/ansysdis180 -i stdin.20492 -b -dis -dir LHD_as_is_direct
ewen     20591  0.0  0.0  12068    12 pts/1    S    Jul23   0:00 /bin/sh /usr/ansys_inc/v180/ansys/bin/ansysdis180 -i stdin.20492 -b -dis -dir LHD_as_is_direct
ewen     20592  0.0  0.0  12068     8 pts/1    S    Jul23   0:00 /bin/sh /usr/ansys_inc/v180/ansys/bin/ansysdis180 -i stdin.20492 -b -dis -dir LHD_as_is_direct
ewen     20593  0.0  0.0  12068     8 pts/1    S    Jul23   0:00 /bin/sh /usr/ansys_inc/v180/ansys/bin/ansysdis180 -i stdin.20492 -b -dis -dir LHD_as_is_direct
ewen     20594  0.0  0.0  12068     8 pts/1    S    Jul23   0:00 /bin/sh /usr/ansys_inc/v180/ansys/bin/ansysdis180 -i stdin.20492 -b -dis -dir LHD_as_is_direct
ewen     20595  0.0  0.0  12068     8 pts/1    S    Jul23   0:00 /bin/sh /usr/ansys_inc/v180/ansys/bin/ansysdis180 -i stdin.20492 -b -dis -dir LHD_as_is_direct
ewen     20596  0.0  0.0  12068     8 pts/1    S    Jul23   0:00 /bin/sh /usr/ansys_inc/v180/ansys/bin/ansysdis180 -i stdin.20492 -b -dis -dir LHD_as_is_direct
ewen     20597  0.0  0.0  12068     8 pts/1    S    Jul23   0:00 /bin/sh /usr/ansys_inc/v180/ansys/bin/ansysdis180 -i stdin.20492 -b -dis -dir LHD_as_is_direct
ewen     20598  0.0  0.0  12068     8 pts/1    S    Jul23   0:00 /bin/sh /usr/ansys_inc/v180/ansys/bin/ansysdis180 -i stdin.20492 -b -dis -dir LHD_as_is_direct
ewen     20599  0.0  0.0  12068     8 pts/1    S    Jul23   0:00 /bin/sh /usr/ansys_inc/v180/ansys/bin/ansysdis180 -i stdin.20492 -b -dis -dir LHD_as_is_direct
ewen     21525 16.7  0.8 1806700 1119284 pts/1 D    Jul23 835:17 /usr/ansys_inc/v180/ansys/bin/linx64/ansys.e -i stdin.20492 -b -dis -dir LHD_as_is_direct
ewen     21526 24.6  0.8 1831432 1165844 pts/1 D    Jul23 1230:37 /usr/ansys_inc/v180/ansys/bin/linx64/ansys.e -i stdin.20492 -b -dis -dir LHD_as_is_direct
ewen     21527 24.9  0.8 1820004 1063972 pts/1 D    Jul23 1245:20 /usr/ansys_inc/v180/ansys/bin/linx64/ansys.e -i stdin.20492 -b -dis -dir LHD_as_is_direct
ewen     21528 27.9  0.8 1828340 1162440 pts/1 D    Jul23 1393:09 /usr/ansys_inc/v180/ansys/bin/linx64/ansys.e -i stdin.20492 -b -dis -dir LHD_as_is_direct
ewen     21529 20.6  0.8 1819800 1148656 pts/1 D    Jul23 1028:41 /usr/ansys_inc/v180/ansys/bin/linx64/ansys.e -i stdin.20492 -b -dis -dir LHD_as_is_direct
ewen     21530 29.2  0.8 1823580 1159404 pts/1 D    Jul23 1456:34 /usr/ansys_inc/v180/ansys/bin/linx64/ansys.e -i stdin.20492 -b -dis -dir LHD_as_is_direct
ewen     21531 31.0  0.8 1833948 1069276 pts/1 D    Jul23 1549:13 /usr/ansys_inc/v180/ansys/bin/linx64/ansys.e -i stdin.20492 -b -dis -dir LHD_as_is_direct
ewen     21532 14.6  0.8 1832976 1177016 pts/1 D    Jul23 731:57 /usr/ansys_inc/v180/ansys/bin/linx64/ansys.e -i stdin.20492 -b -dis -dir LHD_as_is_direct
ewen     21533 14.2  0.9 1821168 1193096 pts/1 D    Jul23 708:33 /usr/ansys_inc/v180/ansys/bin/linx64/ansys.e -i stdin.20492 -b -dis -dir LHD_as_is_direct
ewen     21534 21.5  0.8 1812292 1156464 pts/1 D    Jul23 1076:28 /usr/ansys_inc/v180/ansys/bin/linx64/ansys.e -i stdin.20492 -b -dis -dir LHD_as_is_direct
ewen     21535 20.0  0.8 1817180 1137240 pts/1 D    Jul23 1001:25 /usr/ansys_inc/v180/ansys/bin/linx64/ansys.e -i stdin.20492 -b -dis -dir LHD_as_is_direct
ewen     21536 13.9  0.8 1814668 1171884 pts/1 D    Jul23 692:48 /usr/ansys_inc/v180/ansys/bin/linx64/ansys.e -i stdin.20492 -b -dis -dir LHD_as_is_direct
ewen     21537 19.7  0.8 1833108 1173808 pts/1 D    Jul23 985:43 /usr/ansys_inc/v180/ansys/bin/linx64/ansys.e -i stdin.20492 -b -dis -dir LHD_as_is_direct
ewen     21538 12.5  0.8 1818804 1164852 pts/1 D    Jul23 626:14 /usr/ansys_inc/v180/ansys/bin/linx64/ansys.e -i stdin.20492 -b -dis -dir LHD_as_is_direct
ewen     21539 18.5  0.8 1816904 1137332 pts/1 D    Jul23 924:21 /usr/ansys_inc/v180/ansys/bin/linx64/ansys.e -i stdin.20492 -b -dis -dir LHD_as_is_direct
ewen     21546 13.9  0.9 4649080 1213380 pts/1 Dl   Jul23 696:38 /usr/ansys_inc/v180/ansys/bin/linx64/ansys.e -i stdin.20492 -b -dis -dir LHD_as_is_direct
ewen     25771  0.0  0.0  10496   936 pts/3    S+   12:15   0:00 grep --color=auto ansys

That’s the output from the ps output.

Hopefully you will be able to decipher that better than I can. (I have SOME idea of what it is saying, but not a full/complete idea of what it is saying, especially in the absence of the column headers.)

But with the very tiny bit of knowledge that I do have, it looks like that the memory (looks like 4th column, %MEM) is very low for each of the processes (e.g. < 1.0%). Given that, I am not sure why SLES doesn’t seem to be releasing the RAM back to the application, and instead, ends up seemingly forcing the app to use swap rather than RAM.

And here is the output for all processes:

ewen@aes3:/export/home/work> ps aux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  0.0  0.0  34232  2460 ?        Ss   Jul13   3:17 /usr/lib/systemd/systemd --switched-root --system --deserialize 21
root         2  0.0  0.0      0     0 ?        S    Jul13   0:03 [kthreadd]
root         3  0.0  0.0      0     0 ?        S    Jul13   0:16 [ksoftirqd/0]
root         5  0.0  0.0      0     0 ?        S<   Jul13   0:00 [kworker/0:0H]
root         8  0.0  0.0      0     0 ?        S    Jul13   0:02 [migration/0]
root         9  0.0  0.0      0     0 ?        S    Jul13   0:00 [rcu_bh]
root        10  0.0  0.0      0     0 ?        S    Jul13   8:34 [rcu_sched]
root        11  0.0  0.0      0     0 ?        S    Jul13   0:02 [watchdog/0]
root        12  0.0  0.0      0     0 ?        S    Jul13   0:02 [watchdog/1]
root        13  0.0  0.0      0     0 ?        S    Jul13   0:03 [migration/1]
root        14  0.0  0.0      0     0 ?        S    Jul13   0:03 [ksoftirqd/1]
root        16  0.0  0.0      0     0 ?        S<   Jul13   0:00 [kworker/1:0H]
root        17  0.0  0.0      0     0 ?        S    Jul13   0:02 [watchdog/2]
root        18  0.0  0.0      0     0 ?        S    Jul13   0:01 [migration/2]
root        19  0.0  0.0      0     0 ?        S    Jul13   0:03 [ksoftirqd/2]
root        21  0.0  0.0      0     0 ?        S<   Jul13   0:00 [kworker/2:0H]
root        22  0.0  0.0      0     0 ?        S    Jul13   0:02 [watchdog/3]
root        23  0.0  0.0      0     0 ?        S    Jul13   0:01 [migration/3]
root        24  0.0  0.0      0     0 ?        S    Jul13   0:17 [ksoftirqd/3]
root        26  0.0  0.0      0     0 ?        S<   Jul13   0:00 [kworker/3:0H]
root        27  0.0  0.0      0     0 ?        S    Jul13   0:02 [watchdog/4]
root        28  0.0  0.0      0     0 ?        S    Jul13   0:00 [migration/4]
root        29  0.0  0.0      0     0 ?        S    Jul13   0:04 [ksoftirqd/4]
root        31  0.0  0.0      0     0 ?        S<   Jul13   0:00 [kworker/4:0H]
root        32  0.0  0.0      0     0 ?        S    Jul13   0:02 [watchdog/5]
root        33  0.0  0.0      0     0 ?        S    Jul13   0:00 [migration/5]
root        34  0.0  0.0      0     0 ?        S    Jul13   0:01 [ksoftirqd/5]
root        36  0.0  0.0      0     0 ?        S<   Jul13   0:00 [kworker/5:0H]
root        37  0.0  0.0      0     0 ?        S    Jul13   0:02 [watchdog/6]
root        38  0.0  0.0      0     0 ?        S    Jul13   0:00 [migration/6]
root        39  0.0  0.0      0     0 ?        S    Jul13   0:01 [ksoftirqd/6]
root        41  0.0  0.0      0     0 ?        S<   Jul13   0:00 [kworker/6:0H]
root        42  0.0  0.0      0     0 ?        S    Jul13   0:02 [watchdog/7]
root        43  0.0  0.0      0     0 ?        S    Jul13   0:00 [migration/7]
root        44  0.0  0.0      0     0 ?        S    Jul13   0:01 [ksoftirqd/7]
root        46  0.0  0.0      0     0 ?        S<   Jul13   0:00 [kworker/7:0H]
root        47  0.0  0.0      0     0 ?        S    Jul13   0:02 [watchdog/8]
root        48  0.0  0.0      0     0 ?        S    Jul13   0:03 [migration/8]
root        49  0.0  0.0      0     0 ?        S    Jul13   0:11 [ksoftirqd/8]
root        51  0.0  0.0      0     0 ?        S<   Jul13   0:00 [kworker/8:0H]
root        53  0.0  0.0      0     0 ?        S    Jul13   0:02 [watchdog/9]
root        54  0.0  0.0      0     0 ?        S    Jul13   0:04 [migration/9]
root        55  0.0  0.0      0     0 ?        S    Jul13   0:04 [ksoftirqd/9]
root        57  0.0  0.0      0     0 ?        S<   Jul13   0:00 [kworker/9:0H]
root        58  0.0  0.0      0     0 ?        S    Jul13   0:02 [watchdog/10]
root        59  0.0  0.0      0     0 ?        S    Jul13   0:01 [migration/10]
root        60  0.0  0.0      0     0 ?        S    Jul13   0:05 [ksoftirqd/10]
root        62  0.0  0.0      0     0 ?        S<   Jul13   0:00 [kworker/10:0H]
root        63  0.0  0.0      0     0 ?        S    Jul13   0:02 [watchdog/11]
root        64  0.0  0.0      0     0 ?        S    Jul13   0:01 [migration/11]
root        65  0.0  0.0      0     0 ?        S    Jul13   0:04 [ksoftirqd/11]
root        67  0.0  0.0      0     0 ?        S<   Jul13   0:00 [kworker/11:0H]
root        68  0.0  0.0      0     0 ?        S    Jul13   0:02 [watchdog/12]
root        69  0.0  0.0      0     0 ?        S    Jul13   0:01 [migration/12]
root        70  0.0  0.0      0     0 ?        S    Jul13   0:03 [ksoftirqd/12]
root        72  0.0  0.0      0     0 ?        S<   Jul13   0:00 [kworker/12:0H]
root        73  0.0  0.0      0     0 ?        S    Jul13   0:02 [watchdog/13]
root        74  0.0  0.0      0     0 ?        S    Jul13   0:01 [migration/13]
root        75  0.0  0.0      0     0 ?        S    Jul13   0:03 [ksoftirqd/13]
root        77  0.0  0.0      0     0 ?        S<   Jul13   0:00 [kworker/13:0H]
root        78  0.0  0.0      0     0 ?        S    Jul13   0:02 [watchdog/14]
root        79  0.0  0.0      0     0 ?        S    Jul13   0:01 [migration/14]
root        80  0.0  0.0      0     0 ?        S    Jul13   0:03 [ksoftirqd/14]
root        82  0.0  0.0      0     0 ?        S<   Jul13   0:00 [kworker/14:0H]
root        83  0.0  0.0      0     0 ?        S    Jul13   0:02 [watchdog/15]
root        84  0.0  0.0      0     0 ?        S    Jul13   0:01 [migration/15]
root        85  0.0  0.0      0     0 ?        S    Jul13   0:03 [ksoftirqd/15]
root        87  0.0  0.0      0     0 ?        S<   Jul13   0:00 [kworker/15:0H]
root        88  0.0  0.0      0     0 ?        S<   Jul13   0:00 [khelper]
root        89  0.0  0.0      0     0 ?        S    Jul13   0:00 [kdevtmpfs]
root        90  0.0  0.0      0     0 ?        S<   Jul13   0:00 [netns]
root        91  0.0  0.0      0     0 ?        S<   Jul13   0:00 [perf]
root        92  0.0  0.0      0     0 ?        S<   Jul13   0:00 [writeback]
root        93  0.0  0.0      0     0 ?        S<   Jul13   0:00 [kintegrityd]
root        94  0.0  0.0      0     0 ?        S<   Jul13   0:00 [bioset]
root        95  0.0  0.0      0     0 ?        S<   Jul13   0:00 [crypto]
root        96  0.0  0.0      0     0 ?        S<   Jul13   0:00 [kblockd]
root       103  0.0  0.0      0     0 ?        S    Jul13   0:40 [kworker/13:1]
root       104  0.0  0.0      0     0 ?        S<   Jul13   0:00 [kgraft]
root       105  0.0  0.0      0     0 ?        S    Jul13   0:00 [khungtaskd]
root       106  0.8  0.0      0     0 ?        S    Jul13 158:54 [kswapd0]
root       107  0.6  0.0      0     0 ?        S    Jul13 122:36 [kswapd1]
root       108  0.0  0.0      0     0 ?        SN   Jul13   0:00 [ksmd]
root       109  0.0  0.0      0     0 ?        SN   Jul13   0:44 [khugepaged]
root       110  0.0  0.0      0     0 ?        S    Jul13   0:00 [fsnotify_mark]
root       120  0.0  0.0      0     0 ?        S<   Jul13   0:00 [kthrotld]
root       129  0.0  0.0      0     0 ?        S<   Jul13   0:00 [kpsmoused]
root       130  0.0  0.0      0     0 ?        S    Jul13   0:00 [print/0]
root       131  0.0  0.0      0     0 ?        S    Jul13   0:00 [print/1]
root       151  0.0  0.0      0     0 ?        S<   Jul13   0:00 [deferwq]
root       187  0.0  0.0      0     0 ?        S    Jul13   0:00 [kauditd]
root       300  0.0  0.0      0     0 ?        S<   Jul13   0:00 [ata_sff]
root       304  0.0  0.0      0     0 ?        S    Jul13   0:00 [khubd]
root       325  0.0  0.0      0     0 ?        S    Jul13   0:00 [scsi_eh_0]
root       328  0.0  0.0      0     0 ?        S<   Jul13   0:00 [scsi_tmf_0]
root       341  0.0  0.0      0     0 ?        S<   Jul13   0:00 [scsi_wq_0]
root       350  0.0  0.0      0     0 ?        S<   Jul13   0:00 [ttm_swap]
root       352  0.0  0.0      0     0 ?        S    Jul13   0:01 [scsi_eh_1]
root       353  0.0  0.0      0     0 ?        S<   Jul13   0:00 [scsi_tmf_1]
root       354  0.0  0.0      0     0 ?        S    Jul13   0:00 [scsi_eh_2]
root       355  0.0  0.0      0     0 ?        S<   Jul13   0:00 [scsi_tmf_2]
root       356  0.0  0.0      0     0 ?        S    Jul13   0:00 [scsi_eh_3]
root       357  0.0  0.0      0     0 ?        S<   Jul13   0:00 [scsi_tmf_3]
root       358  0.0  0.0      0     0 ?        S    Jul13   0:00 [scsi_eh_4]
root       359  0.0  0.0      0     0 ?        S<   Jul13   0:00 [scsi_tmf_4]
root       360  0.0  0.0      0     0 ?        S    Jul13   0:00 [scsi_eh_5]
root       361  0.0  0.0      0     0 ?        S<   Jul13   0:00 [scsi_tmf_5]
root       362  0.0  0.0      0     0 ?        S    Jul13   0:00 [scsi_eh_6]
root       363  0.0  0.0      0     0 ?        S<   Jul13   0:00 [scsi_tmf_6]
root       388  0.0  0.0      0     0 ?        S<   Jul13   0:15 [kworker/9:1H]
root       389  0.0  0.0      0     0 ?        S<   Jul13   0:00 [kworker/0:1H]
root       395  0.0  0.0      0     0 ?        S<   Jul13   0:00 [bioset]
root       407  0.0  0.0      0     0 ?        S<   Jul13   0:14 [kworker/10:1H]
root       413  0.0  0.0      0     0 ?        S    Jul13   0:00 [btrfs-genwork-1]
root       414  0.0  0.0      0     0 ?        S    Jul13   2:05 [btrfs-submit-1]
root       415  0.0  0.0      0     0 ?        S    Jul13   0:00 [btrfs-delalloc-]
root       416  0.0  0.0      0     0 ?        S    Jul13   0:00 [btrfs-fixup-1]
root       419  0.0  0.0      0     0 ?        S    Jul13   0:00 [btrfs-rmw-1]
root       420  0.0  0.0      0     0 ?        S    Jul13   0:00 [btrfs-endio-rai]
root       421  0.0  0.0      0     0 ?        S    Jul13   0:00 [btrfs-endio-met]
root       423  0.0  0.0      0     0 ?        S    Jul13   0:08 [btrfs-freespace]
root       425  0.0  0.0      0     0 ?        S    Jul13   0:00 [btrfs-cache-1]
root       426  0.0  0.0      0     0 ?        S    Jul13   0:00 [btrfs-readahead]
root       428  0.0  0.0      0     0 ?        S    Jul13   0:00 [btrfs-qgroup-re]
root       429  0.0  0.0      0     0 ?        S    Jul13   0:24 [btrfs-cleaner]
root       430  0.0  0.0      0     0 ?        S    Jul13   8:10 [btrfs-transacti]
root       499  0.0  0.0      0     0 ?        S<   Jul13   0:16 [kworker/11:1H]
root       500  0.0  0.0      0     0 ?        S<   Jul13   0:13 [kworker/12:1H]
root       501  0.0  0.0      0     0 ?        S<   Jul13   0:12 [kworker/13:1H]
root       502  0.0  0.0      0     0 ?        S<   Jul13   0:13 [kworker/14:1H]
root       518  0.0  0.0      0     0 ?        S<   Jul13   0:17 [kworker/15:1H]
root       519  0.0  0.0      0     0 ?        S<   Jul13   0:17 [kworker/8:1H]
root       533  0.0  0.0  21732    32 ?        Ss   Jul13   0:23 /sbin/dmeventd -f
root       549  0.0  0.0  42772     8 ?        Ss   Jul13   0:00 /usr/lib/systemd/systemd-udevd
root       765  0.0  0.0      0     0 ?        S<   Jul13   0:00 [edac-poller]
root       767  0.0  0.0  12032   664 ?        Ss   Jul13   6:21 /usr/sbin/haveged -w 1024 -v 0 -F
root       995  0.0  0.0      0     0 ?        S<   Jul13   0:00 [kvm-irqfd-clean]
root      1201  0.0  0.0      0     0 ?        S    Jul20   0:15 [kworker/6:0]
root      1498  0.0  0.0      0     0 ?        S    Jul23   0:00 [kworker/0:2]
root      1797  0.0  0.0      0     0 ?        SN   Jul13   0:00 [kipmi0]
root      1855  0.0  0.0      0     0 ?        S    Jul13   0:00 [btrfs-genwork-1]
root      1856  0.0  0.0      0     0 ?        S    Jul13   7:42 [btrfs-submit-1]
root      1857  0.0  0.0      0     0 ?        S    Jul13   0:00 [btrfs-delalloc-]
root      1858  0.0  0.0      0     0 ?        S    Jul13   0:00 [btrfs-fixup-1]
root      1861  0.0  0.0      0     0 ?        S    Jul13   0:00 [btrfs-rmw-1]
root      1862  0.0  0.0      0     0 ?        S    Jul13   0:00 [btrfs-endio-rai]
root      1863  0.0  0.0      0     0 ?        S    Jul13   0:00 [btrfs-endio-met]
root      1865  0.0  0.0      0     0 ?        S    Jul13   0:09 [btrfs-freespace]
root      1867  0.0  0.0      0     0 ?        S    Jul13   0:00 [btrfs-cache-1]
root      1868  0.0  0.0      0     0 ?        S    Jul13   0:00 [btrfs-readahead]
root      1869  0.0  0.0      0     0 ?        S    Jul13   0:00 [btrfs-flush_del]
root      1870  0.0  0.0      0     0 ?        S    Jul13   0:00 [btrfs-qgroup-re]
root      1875  0.0  0.0      0     0 ?        S    Jul13   0:11 [btrfs-cleaner]
root      1876  0.0  0.0      0     0 ?        S    Jul13  10:16 [btrfs-transacti]
message+  2068  0.0  0.0  42928  1096 ?        SLs  Jul13   0:10 /bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation
avahi     2071  0.0  0.0  28188   492 ?        Ss   Jul13   2:18 avahi-daemon: running [aes3.local]
root      2073  0.0  0.0  24964   456 ?        Ss   Jul13   0:01 /usr/sbin/smartd -n
root      2076  0.0  0.0  19304   432 ?        Ss   Jul13   1:39 /usr/sbin/irqbalance --foreground
nscd      2079  0.0  0.0 1005124  680 ?        Ssl  Jul13   3:09 /usr/sbin/nscd
root      2080  0.0  0.0  29544   508 ?        SLs  Jul13   1:00 /usr/lib/wicked/bin/wickedd-dhcp6 --systemd --foreground
root      2081  0.0  0.0  29488   436 ?        SLs  Jul13   0:00 /usr/lib/wicked/bin/wickedd-dhcp4 --systemd --foreground
root      2094  0.0  0.0  29488    40 ?        SLs  Jul13   0:00 /usr/lib/wicked/bin/wickedd-auto4 --systemd --foreground
root      2102  0.0  0.0  20052    12 ?        Ss   Jul13   0:00 /usr/sbin/mcelog --ignorenodev --daemon --foreground
root      2127  0.0  0.0  20096   224 ?        Ss   Jul13   0:06 /usr/lib/systemd/systemd-logind
root      2129  0.0  0.0   4440     8 tty1     Ss+  Jul13   0:00 /sbin/agetty --noclear tty1 linux
root      2136  0.0  0.0 337852   340 ?        SLsl Jul13   0:53 /usr/sbin/rsyslogd -n
root      2147  0.0  0.0      0     0 ?        S<   Jul13   0:04 [kworker/4:1H]
root      2148  0.0  0.0      0     0 ?        S<   Jul13   0:00 [kworker/2:1H]
root      2153  0.0  0.0      0     0 ?        S<   Jul13   0:00 [kworker/6:1H]
root      2157  0.0  0.0      0     0 ?        S<   Jul13   0:00 [kworker/7:1H]
root      2158  0.0  0.0      0     0 ?        S<   Jul13   1:34 [kworker/3:1H]
root      2160  0.0  0.0 198308     0 ?        SLl  Jul13   0:00 /usr/sbin/gdm
root      2161  0.0  0.0  29612   556 ?        SLs  Jul13   1:00 /usr/sbin/wickedd --systemd --foreground
root      2179  0.0  0.0 279528    40 ?        SLl  Jul13   0:00 /usr/lib/gdm/gdm-simple-slave --display-id /org/gnome/DisplayManager/Displays/_0
root      2180  0.0  0.0  29644   296 ?        SLs  Jul13   0:00 /usr/sbin/wickedd-nanny --systemd --foreground
root      2185  0.1  0.0 422644  3176 ?        Ssl  Jul13  34:31 /usr/bin/python -Es /usr/sbin/tuned -l -P
root      2202  9.5 82.5 333977404 109037892 tty7 Ssl+ Jul13 1818:03 /usr/bin/Xorg :0 -background none -verbose -auth /run/gdm/auth-for-gdm-Ogxqa9/database -seat seat0 -nolisten tcp vt7
root      2203  0.0  0.0 274900   648 ?        SLsl Jul13   0:02 /usr/lib/accounts-daemon
polkitd   2215  0.0  0.0 517876  1308 ?        SLsl Jul13   0:03 /usr/lib/polkit-1/polkitd --no-debug
root      2248  0.0  0.0      0     0 ?        S    Jul25   0:00 [kworker/5:2]
root      2335  0.0  0.0      0     0 ?        S<   Jul13   0:00 [kworker/5:1H]
root      2655  0.0  0.0 223560   408 ?        Ssl  Jul13   0:02 /usr/lib/upower/upowerd
root      2685  0.0  0.0  31624     4 ?        Ss   Jul13   0:00 /usr/lib/systemd/systemd --user
root      2686  0.0  0.0  64284   736 ?        S    Jul13   0:00 (sd-pam)
rtkit     2796  0.0  0.0 174260    64 ?        SNLsl Jul13   0:09 /usr/lib/rtkit/rtkit-daemon
root      3001  0.0  0.0 240744   400 ?        SLl  Jul13   0:03 gdm-session-worker [pam/gdm-password]
root      3199  0.0  0.0 437608    48 ?        SLsl Jul13   0:00 /usr/sbin/libvirtd --listen
root      3232  0.0  0.0  47108     4 ?        Ss   Jul13   0:00 /usr/sbin/sshd -D
root      3309  0.0  0.0  19608   276 ?        Ss   Jul13   0:21 /usr/lib/postfix/master -w
postfix   3311  0.0  0.0  21856   164 ?        S    Jul13   3:01 qmgr -l -t fifo -u
root      3337  0.0  0.0  18820   112 ?        Ss   Jul13   1:30 /usr/sbin/cron -n
root      3340  0.0  0.0      0     0 ?        S    Jul21   0:00 [btrfs-flush_del]
ewen      3485  0.0  0.0  31836    12 ?        Ss   Jul13   0:00 /usr/lib/systemd/systemd --user
ewen      3486  0.0  0.0  64284   568 ?        S    Jul13   0:00 (sd-pam)
ewen      3489  0.0  0.0 289704   140 ?        Sl   Jul13   0:00 /usr/bin/gnome-keyring-daemon --daemonize --login
ewen      3492  0.0  0.0 542316   828 ?        SLsl Jul13   0:01 /usr/bin/gnome-session --session sle-classic
ewen      3541  0.0  0.0  14048     0 ?        S    Jul13   0:00 /usr/bin/dbus-launch --sh-syntax --close-stderr --exit-with-session /etc/X11/xinit/xinitrc --session sle-classic
ewen      3542  0.0  0.0  34532   656 ?        Ss   Jul13   0:00 /bin/dbus-daemon --fork --print-pid 5 --print-address 7 --session
ewen      3545  0.0  0.0 335312   216 ?        Sl   Jul13   0:00 /usr/lib/at-spi2/at-spi-bus-launcher
ewen      3549  0.0  0.0  34092   388 ?        SL   Jul13   0:00 /bin/dbus-daemon --config-file=/etc/at-spi2/accessibility.conf --nofork --print-address 3
ewen      3552  0.0  0.0 122904   456 ?        Sl   Jul13   0:00 /usr/lib/at-spi2/at-spi2-registryd --use-gnome-session
ewen      3558  0.0  0.0 860844  3136 ?        SLl  Jul13   3:03 /usr/lib/gnome-settings-daemon-3.0/gnome-settings-daemon
ewen      3573  0.0  0.0 182000   360 ?        Sl   Jul13   0:00 /usr/lib/gvfs/gvfsd
ewen      3578  0.0  0.0 336912     0 ?        Sl   Jul13   0:00 /usr/lib/gvfs//gvfsd-fuse /run/user/1000/gvfs -f -o big_writes
ewen      3593  0.0  0.0 178132   568 ?        Sl   Jul13   0:00 /usr/lib/dconf-service
ewen      3599  0.0  0.0 296900  1280 ?        SLl  Jul13   0:10 /usr/lib/gvfs/gvfs-udisks2-volume-monitor
ewen      3602  0.0  0.0 291624  1608 ?        Sl   Jul13   0:01 /usr/bin/pulseaudio --start
root      3605  0.0  0.0 361336  3936 ?        SLsl Jul13   3:25 /usr/lib/udisks2/udisksd --no-debug
ewen      3616  0.0  0.0 185980   676 ?        Sl   Jul13   0:00 /usr/lib/gvfs/gvfs-mtp-volume-monitor
ewen      3620  0.0  0.0 198132   632 ?        Sl   Jul13   0:00 /usr/lib/gvfs/gvfs-gphoto2-volume-monitor
ewen      3624  0.0  0.0 181940   504 ?        Sl   Jul13   0:00 /usr/lib/gvfs/gvfs-goa-volume-monitor
ewen      3629  0.0  0.0 395640     0 ?        Sl   Jul13   0:00 /usr/lib/gnome-settings-daemon-3.0/gsd-printer
ewen      3640 34.0  0.1 2132728 193624 ?      SLl  Jul13 6472:38 /usr/bin/gnome-shell
ewen      3665  0.0  0.0 829880  7124 ?        Sl   Jul13   0:04 nautilus --no-default-window --force-desktop
ewen      3673  0.0  0.0  41252   296 ?        S    Jul13   0:02 /usr/lib/GConf/2/gconfd-2
ewen      3676  0.0  0.0 336056   252 ?        Sl   Jul13   0:00 /usr/lib/gvfs/gvfsd-trash --spawner :1.6 /org/gtk/gvfs/exec_spaw/0
ewen      3692  0.0  0.0 110188     0 ?        Sl   Jul13   0:00 /usr/lib/gvfs/gvfsd-metadata
ewen      3702  3.8  0.0 554464 123804 ?       Sl   Jul13 722:51 gnome-system-monitor
ewen      3725  0.0  0.0 414824  7616 ?        Sl   Jul13   1:03 /usr/lib/gnome-terminal-server
ewen      3728  0.0  0.0   6680     0 ?        S    Jul13   0:00 gnome-pty-helper
ewen      3730  0.0  0.0  14752     0 pts/0    Ss   Jul13   0:00 /bin/bash
ewen      3756  0.0  0.0  14752    12 pts/1    Ss   Jul13   0:00 bash
ewen      3863  0.5  0.1 821208 144144 ?       Ssl  Jul13 111:45 /usr/ansys_inc/shared_files/licensing/linx64/ansysli_server
ewen      3875  0.0  0.0  19064  2488 ?        Ss   Jul13   4:52 /usr/ansys_inc/shared_files/licensing/linx64/ansysli_monitor -monitor 3863 -restart_port_timeout 15
ewen      3909  0.0  0.0  16780   412 ?        S    Jul13   0:11 /usr/ansys_inc/shared_files/licensing/linx64/lmgrd -c /usr/ansys_inc/shared_files/licensing/license_files -l /usr/ansys_inc/shared_files/licensing/license.log
ewen      3910  0.0  0.0 127744  4060 ?        Ssl  Jul13   5:14 ansyslmd -T aes3 11.13 3 -c :/usr/ansys_inc/shared_files/licensing/license_files: -srv l5cxNrODVadTCxS0jQ7tWv8TYnSz5DAH18P1AGpaG159HHZakpuDbaQFd5y2xys --lmgrd_start 59675cee -vdrestart 0
root      3958  0.0  0.0      0     0 ?        S    Jul19   0:00 [btrfs-delayed-m]
root      5145  0.0  0.0      0     0 ?        S    Jul21   0:00 [kworker/2:1]
root      5487  0.0  0.0      0     0 ?        S    Jul15   0:32 [kworker/10:2]
root      5558  0.0  0.0      0     0 ?        S<   Jul13   0:00 [kworker/1:1H]
root      6106  0.0  0.0      0     0 ?        S    Jul21   0:00 [kworker/8:1]
root      8106  0.0  0.0      0     0 ?        S    Jul21   0:00 [kworker/11:1]
root      8476  0.1  0.0      0     0 ?        S    Jul25   1:12 [btrfs-endio-met]
root      8947  0.0  0.0      0     0 ?        S    Jul25   0:00 [kworker/3:2]
root     11644  0.0  0.0      0     0 ?        S    Jul25   0:00 [kworker/6:2]
ewen     11653  0.0  0.0  14860   376 pts/2    Ss+  Jul13   0:00 bash
root     11744  0.0  0.0      0     0 ?        S    Jul20   0:17 [kworker/9:1]
root     12037  0.0  0.0      0     0 ?        S    Jul21   0:13 [kworker/1:1]
ewen     12381  0.0  0.0  30152     4 pts/0    S+   Jul21   0:00 vi run_all_linux.sh
root     12400  0.0  0.0      0     0 ?        S    06:42   0:00 [kworker/14:1]
root     12935  0.0  0.0      0     0 ?        S    Jul13   0:44 [kworker/15:2]
root     13267  0.0  0.0      0     0 ?        S    Jul21   0:13 [kworker/3:0]
root     13308  0.0  0.0      0     0 ?        S    Jul18   0:23 [kworker/11:2]
root     13607  0.0  0.0      0     0 ?        S    Jul21   0:13 [kworker/4:2]
root     15196  0.0  0.0      0     0 ?        S    Jul21   0:15 [kworker/8:0]
root     15766  0.0  0.0      0     0 ?        S    Jul14   1:19 [kworker/0:0]
ewen     16347  0.0  0.0 338392     0 ?        Sl   Jul15   0:00 /usr/lib/gvfs/gvfsd-network --spawner :1.6 /org/gtk/gvfs/exec_spaw/1
ewen     16353  0.0  0.0 610480     0 ?        SLl  Jul15   0:00 /usr/lib/gvfs/gvfsd-smb-browse --spawner :1.6 /org/gtk/gvfs/exec_spaw/2
ewen     16360  0.0  0.0 264480   124 ?        Sl   Jul15   0:00 /usr/lib/gvfs/gvfsd-dnssd --spawner :1.6 /org/gtk/gvfs/exec_spaw/3
ewen     16422  0.0  0.0 593760     0 ?        SLl  Jul15   0:00 /usr/lib/gvfs/gvfsd-smb --spawner :1.6 /org/gtk/gvfs/exec_spaw/4
ewen     16988  0.0  0.0  14756  1224 pts/3    Ss   Jul15   0:00 bash
root     18357  0.0  0.0      0     0 ?        S    Jul21   0:00 [kworker/4:0]
root     18651  0.0  0.0      0     0 ?        S    Jul21   0:00 [kworker/7:1]
ewen     18790  0.0  0.0  13064     8 pts/1    S+   Jul15   0:00 /bin/bash ./run_all_linux.sh
root     19230  0.0  0.0      0     0 ?        S    Jul13   0:40 [kworker/14:0]
root     19629  0.0  0.0      0     0 ?        S    Jul23   0:00 [kworker/15:0]
root     19772  0.0  0.0      0     0 ?        S    Jul25   0:00 [kworker/1:2]
ewen     20492  0.0  0.0  13588     8 pts/1    S+   Jul23   0:00 /bin/sh /usr/ansys_inc/v180/ansys/bin/ansys180 -b -dis -np 16 -dir LHD_as_is_direct
ewen     20573  0.0  0.0  11664    12 pts/1    S+   Jul23   0:00 /bin/sh /usr/ansys_inc/v180/commonfiles/MPI/Intel/5.1.3.223/linx64/bin/mpirun -np 16 /usr/ansys_inc/v180/ansys/bin/ansysdis180 -i stdin.20492 -b -dis -dir LHD_as_is_direct
ewen     20578  0.0  0.0  15884    80 pts/1    S+   Jul23   0:00 mpiexec.hydra -np 16 /usr/ansys_inc/v180/ansys/bin/ansysdis180 -i stdin.20492 -b -dis -dir LHD_as_is_direct
ewen     20579  0.0  0.0  15080    84 pts/1    S    Jul23   0:00 /usr/ansys_inc/v180/commonfiles/MPI/Intel/5.1.3.223/linx64/bin//pmi_proxy --control-port aes3:30544 --pmi-connect lazy-cache --pmi-aggregate -s 0 --rmk user --launcher ssh --demux poll --pgid 0 --enable-stdin 1 --retries 10 --control-code 2025298722 --usize -2 --proxy-id 0
ewen     20584  0.0  0.0  12068    12 pts/1    S    Jul23   0:00 /bin/sh /usr/ansys_inc/v180/ansys/bin/ansysdis180 -i stdin.20492 -b -dis -dir LHD_as_is_direct
ewen     20585  0.0  0.0  12068     8 pts/1    S    Jul23   0:00 /bin/sh /usr/ansys_inc/v180/ansys/bin/ansysdis180 -i stdin.20492 -b -dis -dir LHD_as_is_direct
ewen     20586  0.0  0.0  12068     8 pts/1    S    Jul23   0:00 /bin/sh /usr/ansys_inc/v180/ansys/bin/ansysdis180 -i stdin.20492 -b -dis -dir LHD_as_is_direct
ewen     20587  0.0  0.0  12068    12 pts/1    S    Jul23   0:00 /bin/sh /usr/ansys_inc/v180/ansys/bin/ansysdis180 -i stdin.20492 -b -dis -dir LHD_as_is_direct
ewen     20588  0.0  0.0  12068     8 pts/1    S    Jul23   0:00 /bin/sh /usr/ansys_inc/v180/ansys/bin/ansysdis180 -i stdin.20492 -b -dis -dir LHD_as_is_direct
ewen     20589  0.0  0.0  12068     8 pts/1    S    Jul23   0:00 /bin/sh /usr/ansys_inc/v180/ansys/bin/ansysdis180 -i stdin.20492 -b -dis -dir LHD_as_is_direct
ewen     20590  0.0  0.0  12068     8 pts/1    S    Jul23   0:00 /bin/sh /usr/ansys_inc/v180/ansys/bin/ansysdis180 -i stdin.20492 -b -dis -dir LHD_as_is_direct
ewen     20591  0.0  0.0  12068    12 pts/1    S    Jul23   0:00 /bin/sh /usr/ansys_inc/v180/ansys/bin/ansysdis180 -i stdin.20492 -b -dis -dir LHD_as_is_direct
ewen     20592  0.0  0.0  12068     8 pts/1    S    Jul23   0:00 /bin/sh /usr/ansys_inc/v180/ansys/bin/ansysdis180 -i stdin.20492 -b -dis -dir LHD_as_is_direct
ewen     20593  0.0  0.0  12068     8 pts/1    S    Jul23   0:00 /bin/sh /usr/ansys_inc/v180/ansys/bin/ansysdis180 -i stdin.20492 -b -dis -dir LHD_as_is_direct
ewen     20594  0.0  0.0  12068     8 pts/1    S    Jul23   0:00 /bin/sh /usr/ansys_inc/v180/ansys/bin/ansysdis180 -i stdin.20492 -b -dis -dir LHD_as_is_direct
ewen     20595  0.0  0.0  12068     8 pts/1    S    Jul23   0:00 /bin/sh /usr/ansys_inc/v180/ansys/bin/ansysdis180 -i stdin.20492 -b -dis -dir LHD_as_is_direct
ewen     20596  0.0  0.0  12068     8 pts/1    S    Jul23   0:00 /bin/sh /usr/ansys_inc/v180/ansys/bin/ansysdis180 -i stdin.20492 -b -dis -dir LHD_as_is_direct
ewen     20597  0.0  0.0  12068     8 pts/1    S    Jul23   0:00 /bin/sh /usr/ansys_inc/v180/ansys/bin/ansysdis180 -i stdin.20492 -b -dis -dir LHD_as_is_direct
ewen     20598  0.0  0.0  12068     8 pts/1    S    Jul23   0:00 /bin/sh /usr/ansys_inc/v180/ansys/bin/ansysdis180 -i stdin.20492 -b -dis -dir LHD_as_is_direct
ewen     20599  0.0  0.0  12068     8 pts/1    S    Jul23   0:00 /bin/sh /usr/ansys_inc/v180/ansys/bin/ansysdis180 -i stdin.20492 -b -dis -dir LHD_as_is_direct
root     20690  0.0  0.0  43272  3052 ?        SLs  Jul20   0:03 /usr/lib/systemd/systemd-journald
root     21038  0.0  0.0      0     0 ?        S    Jul22   0:00 [kworker/13:2]
ewen     21525 16.7  1.2 2352840 1660180 pts/1 R    Jul23 838:04 /usr/ansys_inc/v180/ansys/bin/linx64/ansys.e -i stdin.20492 -b -dis -dir LHD_as_is_direct
ewen     21526 24.7  0.8 1831432 1160708 pts/1 R    Jul23 1241:05 /usr/ansys_inc/v180/ansys/bin/linx64/ansys.e -i stdin.20492 -b -dis -dir LHD_as_is_direct
ewen     21527 25.0  0.8 1820004 1058728 pts/1 R    Jul23 1254:05 /usr/ansys_inc/v180/ansys/bin/linx64/ansys.e -i stdin.20492 -b -dis -dir LHD_as_is_direct
ewen     21528 28.0  0.8 1828340 1157580 pts/1 R    Jul23 1405:51 /usr/ansys_inc/v180/ansys/bin/linx64/ansys.e -i stdin.20492 -b -dis -dir LHD_as_is_direct
ewen     21529 20.7  0.8 1819800 1143624 pts/1 R    Jul23 1037:44 /usr/ansys_inc/v180/ansys/bin/linx64/ansys.e -i stdin.20492 -b -dis -dir LHD_as_is_direct
ewen     21530 29.2  0.8 1823580 1154252 pts/1 R    Jul23 1466:07 /usr/ansys_inc/v180/ansys/bin/linx64/ansys.e -i stdin.20492 -b -dis -dir LHD_as_is_direct
ewen     21531 31.0  0.8 1833948 1064500 pts/1 R    Jul23 1556:51 /usr/ansys_inc/v180/ansys/bin/linx64/ansys.e -i stdin.20492 -b -dis -dir LHD_as_is_direct
ewen     21532 14.7  0.8 1832976 1172316 pts/1 R    Jul23 738:11 /usr/ansys_inc/v180/ansys/bin/linx64/ansys.e -i stdin.20492 -b -dis -dir LHD_as_is_direct
ewen     21533 14.1  0.8 1821168 1187684 pts/1 D    Jul23 711:15 /usr/ansys_inc/v180/ansys/bin/linx64/ansys.e -i stdin.20492 -b -dis -dir LHD_as_is_direct
ewen     21534 21.6  0.8 1812292 1151712 pts/1 R    Jul23 1082:52 /usr/ansys_inc/v180/ansys/bin/linx64/ansys.e -i stdin.20492 -b -dis -dir LHD_as_is_direct
ewen     21535 20.0  0.8 1817180 1133068 pts/1 R    Jul23 1005:37 /usr/ansys_inc/v180/ansys/bin/linx64/ansys.e -i stdin.20492 -b -dis -dir LHD_as_is_direct
ewen     21536 13.8  0.8 1814668 1166372 pts/1 D    Jul23 696:29 /usr/ansys_inc/v180/ansys/bin/linx64/ansys.e -i stdin.20492 -b -dis -dir LHD_as_is_direct
ewen     21537 19.7  0.8 1833108 1168656 pts/1 D    Jul23 988:22 /usr/ansys_inc/v180/ansys/bin/linx64/ansys.e -i stdin.20492 -b -dis -dir LHD_as_is_direct
ewen     21538 12.5  1.2 2344204 1684812 pts/1 D    Jul23 627:32 /usr/ansys_inc/v180/ansys/bin/linx64/ansys.e -i stdin.20492 -b -dis -dir LHD_as_is_direct
ewen     21539 18.5  0.8 1816904 1132604 pts/1 R    Jul23 928:55 /usr/ansys_inc/v180/ansys/bin/linx64/ansys.e -i stdin.20492 -b -dis -dir LHD_as_is_direct
ewen     21546 13.9  1.2 4649080 1665120 pts/1 Dl   Jul23 698:08 /usr/ansys_inc/v180/ansys/bin/linx64/ansys.e -i stdin.20492 -b -dis -dir LHD_as_is_direct
root     21650  0.0  0.0      0     0 ?        S    Jul24   0:00 [kworker/12:0]
root     22015  0.0  0.0      0     0 ?        S    Jul21   0:04 [kworker/u32:2]
root     23113  0.0  0.0      0     0 ?        S    Jul24   0:00 [kworker/9:0]
root     23153  0.0  0.0      0     0 ?        S    Jul21   0:00 [kworker/10:1]
root     23193  0.0  0.0      0     0 ?        S    Jul13   0:42 [kworker/2:2]
root     23802  0.0  0.0      0     0 ?        S    12:01   0:02 [btrfs-endio-wri]
root     24116  2.6  0.0      0     0 ?        S    12:03   1:02 [btrfs-endio-3]
root     25545  0.1  0.0      0     0 ?        S    12:14   0:02 [btrfs-worker-4]
root     25687  0.0  0.0      0     0 ?        S    12:15   0:00 [btrfs-endio-wri]
root     25707  0.2  0.0      0     0 ?        S    12:15   0:03 [kworker/u33:1]
root     25713  0.0  0.0      0     0 ?        S    Jul21   0:13 [kworker/5:0]
root     26533  0.2  0.0      0     0 ?        S    12:20   0:03 [kworker/u34:2]
root     27013  0.0  0.0      0     0 ?        S    12:23   0:00 [btrfs-endio-4]
root     28008  0.0  0.0      0     0 ?        S    12:30   0:00 [btrfs-endio-met]
root     28018  0.0  0.0      0     0 ?        S    12:30   0:00 [kworker/u34:0]
root     28027  0.0  0.0      0     0 ?        S    12:30   0:00 [kworker/u33:0]
postfix  28394  0.0  0.0  21464   784 ?        S    12:33   0:00 pickup -l -t fifo -u
root     28400  0.2  0.0      0     0 ?        S    12:33   0:01 [btrfs-endio-2]
root     28555  0.0  0.0      0     0 ?        S    12:34   0:00 [btrfs-endio-4]
root     28679  0.0  0.0      0     0 ?        S    12:35   0:00 [btrfs-endio-6]
root     28964  0.0  0.0      0     0 ?        S    12:37   0:00 [btrfs-endio-wri]
root     29195  0.0  0.0      0     0 ?        S    12:39   0:00 [btrfs-endio-met]
root     29578  0.0  0.0      0     0 ?        S    12:41   0:00 [btrfs-endio-wri]
root     29579  0.0  0.0      0     0 ?        S    12:41   0:00 [btrfs-endio-4]
ewen     29728  8.8  0.0  26696  1868 pts/3    RL+  12:43   0:00 ps aux
root     30116  0.0  0.0      0     0 ?        S    Jul22   0:03 [btrfs-worker-6]
root     30517  0.0  0.0      0     0 ?        S    Jul22   0:30 [btrfs-endio-met]
root     31193  0.0  0.0      0     0 ?        S    Jul20   0:15 [kworker/12:2]
root     31858  0.0  0.0      0     0 ?        S    Jul16   0:05 [kworker/u32:0]
root     32337  0.0  0.0      0     0 ?        S    Jul20   0:17 [kworker/7:2]
root     32501  0.0  0.0      0     0 ?        S    Jul22   0:00 [btrfs-delayed-m]

Two things:

if I am reading this correctly, it tells me that my analysis application (total across all 16 processes) should be consuming a total 17934282752 bytes of memory or about 16.7026 GiB.
I don’t understand why the Xorg process (PID 2202) is consuming 82.5% of memory.

This is a fresh install BTW.

ewen@aes3:/export/home/work> uptime
 12:47 pm  up 13 days  5:05,  6 users,  load average: 17.04, 17.47, 17.75

Any additional advice would be greatly appreciated.

P.S. I didn’t know that I needed to include the outputs of ‘free’ and ‘ps aux’. I’ve never even heard of ‘ps aux’ before. Thanks.

alpha754293 · August 6, 2017, 7:43am

Bump.

Anybody have any ideas or suggestions as to why Xorg would take up so much RAM?

ab1 · August 7, 2017, 10:55pm

This is a ridiculous outlier:

root      2202  9.5 82.5 333977404 109037892 tty7 Ssl+ Jul13 1818:03
/usr/bin/Xorg :0 -background none -verbose -auth
/run/gdm/auth-for-gdm-Ogxqa9/database -seat seat0 -nolisten tcp vt7

Xorg should not be using this much RSS (109037892 = 109 GB), and the VSZ
number also sems unreasonably huge.

With that said, Gnome’s gnome-shell process is also a bit suspicious:

ewen      3640 34.0  0.1 2132728 193624 ?      SLl  Jul13 6472:38
/usr/bin/gnome-shell

All of this leads me to ask: what are you doing in the GUI? Have you
tried just closing that, going back to runlevel three (3) or
multi-user.target on systemd, and seeing if that helps your system’s
performance, at least in terms of memory?

#SLES 12 and later:
sudo /usr/bin/systemctl multi-user.target

#SLES 11 and earlier:
sudo init 3

This will close the GUI, so be sure you are connected (or able to connect)
another way.

My only guess at why you may see huge amounts of memory may be that
something about your application is sending something incredible to X in a
way that is causing it to behave badly by trying to use that much RAM.
Every place I have seen these kinds of applications doing serious
number-crunching for engineering, though, they have been Command Line
Interface (CLI) tools, not big GUI monsters, so I presume you can SSH in,
run them, and then get results out some other way than by staring at a
graphical interface. Is that accurate with your environment too?

–
Good luck.

If you find this post helpful and are logged into the web interface,
show your appreciation and click on the star below.

If you want to send me a private message, please let me know in the
forum as I do not use the web interface often.

alpha754293 · August 8, 2017, 2:07am

All of this leads me to ask: what are you doing in the GUI? Have you tried just closing that, going back to runlevel three (3) ormulti-user.target on systemd, and seeing if that helps your system's performance, at least in terms of memory?
I am launching the analysis via the graphical user interface of the application in order to assess its performance – that way if there is future work that I would want to or need to run in the GUI, I would have an idea in terms of its performance.

If I want it to be a solver-only, such that it eliminates the need for run level 5, that still begs the question why is it even doing this AT run level 5, such that one of the suggested workaround (it isn’t a fix) is to switch the default run level from 5 to 3?

Every place I have seen these kinds of applications doing serious number-crunching for engineering, though, they have been Command Line Interface (CLI) tools, not big GUI monsters, so I presume you can SSH in, run them, and then get results out some other way than by staring at a graphical interface. Is that accurate with your environment too?

Not sure about the first part of your question/statement re: application’s interface to X. My initial assumption would be that there shouldn’t be anything that’s significantly different in the application’s code base because the same application is also used on the Windows platform, and therefore; I can only surmise that if there was a memory leak issue on account of the application itself (causing X to consume large amounts of memory), that it might be also evident in Windows as well. Furthermore; SLES is one of the certified operating systems for the application, and therefore; I further surmise that it wouldn’t have been certified for use if this was such a major issue for the application such that the application is the cause of it.

In regards to your second point - I have addressed it in part in my response above. But to further expand on that, yes, most HPC installations would use this as a solver-only configuration, but there are also some other HPC installations that would allow you to VPN into the HPC cluster itself so that you can do large pre- and post-processing on the HPC (because those systems will typically have more memory than your typical high-end workstation).

I forget if I have ssh remote access configured for the system, but if I need to, that isn’t difficult to set up.

I will need to do some research so that it will mount an smb share (from a Windows server) perpetually on startup, and I also reckon that getting it to auto-log in at run level 3 should also be something that can be done, and isn’t particularly difficult to do.

I can definitely try conducting tests at run level 3 to see if the memory issue still persists, but I still don’t really fully understand why it happened in the first place.

And I might be able to make that switch to run level 3 as the default permanently, but I would prefer not to if I don’t necessarily have to.

Thanks.

edit
Update:

I don’t know if the SuSE Forums is having an issue, but it appears that it lost my original reply. Not sure what happened there.
For some reason, when I tried to type in sudo /usr/bin/systemctl multi-user.target, it says that it doesn’t know what “multi-user.target” is or what it means. Not sure why.
Also interestingly enough, sudo init 3 worked, and so I am re-running the tests right now to see what will happen.

However, to me, the question still exists - why should/would X be doing this thing with the memory in the first place given the statements that I had written above.

Thanks.

ab1 · August 8, 2017, 5:35am

On 08/07/2017 05:14 PM, alpha754293 wrote:[color=blue]

Code:

All of this leads me to ask: what are you doing in the GUI? Have you tried just closing that, going back to runlevel three (3) ormulti-user.target on systemd, and seeing if that helps your system’s performance, at least in terms of memory?

I am launching the analysis via the graphical user interface of the
application in order to assess its performance – that way if there is
future work that I would want to or need to run in the GUI, I would have
an idea in terms of its performance.

If I want it to be a solver-only, such that it eliminates the need for
run level 5, that still begs the question why is it even doing this AT
run level 5, such that one of the suggested workaround (it isn’t a fix)
is to switch the default run level from 5 to 3?[/color]

I do not know what X is doing to use that much memory. It could be an
Xorg bug I suppose, but wow what a bug to just chew through that much
memory. It would be interesting to know what kind of application it is
that is doing this, and how it causes X to use that much memory.
[color=blue]

Code:

Every place I have seen these kinds of applications doing serious number-crunching for engineering, though, they have been Command Line Interface (CLI) tools, not big GUI monsters, so I presume you can SSH in, run them, and then get results out some other way than by staring at a graphical interface. Is that accurate with your environment too?

Not sure about the first part of your question/statement re:
application’s interface to X. My initial assumption would be that there
shouldn’t be anything that’s significantly different in the
application’s code base because the same application is also used on the
Windows platform, and therefore; I can only surmise that if there was a
memory leak issue on account of the application itself (causing X to
consume large amounts of memory), that it might be also evident in
Windows as well. Furthermore; SLES is one of the certified operating
systems for the application, and therefore; I further surmise that it
wouldn’t have been certified for use if this was such a major issue for
the application such that the application is the cause of it.

In regards to your second point - I have addressed it in part in my
response above. But to further expand on that, yes, most HPC
installations would use this as a solver-only configuration, but there
are also some other HPC installations that would allow you to VPN into
the HPC cluster itself so that you can do large pre- and post-processing
on the HPC (because those systems will typically have more memory than
your typical high-end workstation).

I forget if I have ssh remote access configured for the system, but if I
need to, that isn’t difficult to set up.

I will need to do some research so that it will mount an smb share (from
a Windows server) perpetually on startup, and I also reckon that getting
it to auto-log in at run level 3 should also be something that can be
done, and isn’t particularly difficult to do.

I can definitely try conducting tests at run level 3 to see if the
memory issue still persists, but I still don’t really fully understand
why it happened in the first place.

And I might be able to make that switch to run level 3 as the default
permanently, but I would prefer not to if I don’t necessarily have to.[/color]

Just to be clear, you are welcome to leave the system at runlevel five
(5), or graphical.target, but just do not run that program and see how
that changes things. My guess is that Xorg itself is not taking that much
memory, because that is insane, but rather something about that program is
taking that much memory when running in the GUI. The actual analysis bits
themselves are not using that much memory, and they’re the workhorses, but
something else about the system is causing Xorg to use it. Very odd, in
my opinion, but I rarely run Xorg, and never if I can help it.
[color=blue]

I don’t know if the SuSE Forums is having an issue, but it appears
that it lost my original reply. Not sure what happened there.[/color]

Me neither; I see this as a second post.
[color=blue]

For some reason, when I tried to type in sudo /usr/bin/systemctl
multi-user.target, it says that it doesn’t know what “multi-user.target”
is or what it means. Not sure why.[/color]

I thought I was mistyping it, and it turns out I was. I should have had
an ‘isolate’ in there:

sudo /usr/bin/systemctl isolate multii-user.target

[color=blue]

Also interestingly enough, sudo init 3 worked, and so I am re-running
the tests right now to see what will happen.[/color]

Hooray for backwards-compatibility!
[color=blue]

However, to me, the question still exists - why should/would X be doing
this thing with the memory in the first place given the statements that
I had written above.[/color]

I agree, and I do not mean to stifle this conversation or cease pursuit of
the issue, but merely to better-isolate it. If you are on current code,
and if you can duplicate this reliably, then I would probably start with
the vendor of the application, but if you have Service Requests (SR) with
SUSE, engage them directly too. The forums are filled with volunteers,
and some employees, but the official route uses SRs. In the end it may be
that something about the application forces Xorg to chew up memory, though
I think that is relatively unlikely unless the application is rendering
some really-crazy-amazing pictures.

As a test, run the program in the GUI, duplicate the problem, and then
close the program. Does Xorgclean up? What if you logout/login again?

–
Good luck.

If you find this post helpful and are logged into the web interface,
show your appreciation and click on the star below.

If you want to send me a private message, please let me know in the
forum as I do not use the web interface often.

alpha754293 · August 8, 2017, 1:54pm

[QUOTE=ab;39020]
I do not know what X is doing to use that much memory. It could be an
Xorg bug I suppose, but wow what a bug to just chew through that much
memory. It would be interesting to know what kind of application it is
that is doing this, and how it causes X to use that much memory.[/quote]

I’m running Ansys Workbench (Mechanical/Static Structural analysis) using the direct sparse solver which takes the most amount of memory, but sometimes, also is the fastest way to obtain a solution. (And because my system has 128 GB of RAM available, it is able to run the entire solution in RAM so that no disk I/O is needed (even though I have a SATA 6 Gbps SSD).

Just to be clear, you are welcome to leave the system at runlevel five (5), or graphical.target, but just do not run that program and see how that changes things. My guess is that Xorg itself is not taking that much memory, because that is insane, but rather something about that program is taking that much memory when running in the GUI. The actual analysis bits themselves are not using that much memory, and they're the workhorses, but something else about the system is causing Xorg to use it. Very odd, in my opinion, but I rarely run Xorg, and never if I can help it.

I can try that. Run it and just leave it sitting at idle to see what happens/what it does.

I was doing some research about this and apparently it does seem to be a bug because a bug report was filed but I don’t know what ever came of it.

Like I said, I am evaluating SLES to see whether it is a viable OS option for the system (because I hae 4 nodes, so I am going to be starting the clustering test soon), because if it SLES is able to run the analysis faster, then I’ll want to use that.

But if I can’t, then I’ll have to stick with Windows. (And clustering with Windows is a whole 'nother animal altogether.)

[quote][color=blue]

I don’t know if the SuSE Forums is having an issue, but it appears
that it lost my original reply. Not sure what happened there.[/color]

Me neither; I see this as a second post.[/quote]

It only shows up once on my end. 'Tis very strange.

[quote][color=blue]

For some reason, when I tried to type in sudo /usr/bin/systemctl
multi-user.target, it says that it doesn’t know what “multi-user.target”
is or what it means. Not sure why.[/color]

I thought I was mistyping it, and it turns out I was. I should have had
an ‘isolate’ in there:

sudo /usr/bin/systemctl isolate multii-user.target[/quote]

Ah…okay…gotcha.

Well…either way, I was able to take it out of the GUI mode into run level 3 with sudo init 3. Interestingly enough, I did also reboot the system (Power Control - Reset) via the IPMI Java application (yay to IPMI) and I did have ssh configured on the system so I was able to remotely log in with that.

[quote][color=blue]

Also interestingly enough, sudo init 3 worked, and so I am re-running
the tests right now to see what will happen.[/color]

Hooray for backwards-compatibility![/quote]

Yep.

[quote][color=blue]

However, to me, the question still exists - why should/would X be doing
this thing with the memory in the first place given the statements that
I had written above.[/color]

I agree, and I do not mean to stifle this conversation or cease pursuit of
the issue, but merely to better-isolate it. If you are on current code,
and if you can duplicate this reliably, then I would probably start with
the vendor of the application, but if you have Service Requests (SR) with
SUSE, engage them directly too. The forums are filled with volunteers,
and some employees, but the official route uses SRs. In the end it may be
that something about the application forces Xorg to chew up memory, though
I think that is relatively unlikely unless the application is rendering
some really-crazy-amazing pictures.

As a test, run the program in the GUI, duplicate the problem, and then
close the program. Does Xorgclean up? What if you logout/login again?[/quote]

re: reliability and repeatability

Yea, the error state is pretty repeatable. I have a shells script that I wrote that runs the analysis using batch processing so it appears almost like “garbage collection” of sorts where the memory and later on swap usage just builds over time.

NORMALLY, at least when I run the exact same thing on Windows (since I am comparing between the two), after each solver iteration, it should be releasing the memory back. Running it at run level 3 appears to be doing that, but I won’t know until the script gets towards the end to see whether this issue is (still) happening.

A couple of people from SuSE actually contacted me last year when I downloaded the evaluation software so I have reached back out to those contacts asking them about this highly technical issue and they’ve referred it to a sales/technical contact for Canada (my geographic region) to investigate.

It is strange though, that this is one of the certified OSes for Ansys, but then there is this issue.

And no, once I close the program, Xorg doesn’t clean up.

The first screenshot that I posted is after the application has been closed and the memory state is still where it is.

Thanks.

–
Good luck.

If you find this post helpful and are logged into the web interface,
show your appreciation and click on the star below.

If you want to send me a private message, please let me know in the
forum as I do not use the web interface often.[/QUOTE]

ab1 · August 8, 2017, 5:41pm

On 08/08/2017 05:04 AM, alpha754293 wrote:[color=blue]

I’m running Ansys Workbench (Mechanical/Static Structural analysis)
using the direct sparse solver which takes the most amount of memory,
but sometimes, also is the fastest way to obtain a solution. (And
because my system has 128 GB of RAM available, it is able to run the
entire solution in RAM so that no disk I/O is needed (even though I have
a SATA 6 Gbps SSD).[/color]

RAM is always better, as you know, if only because it performs better.
[color=blue]

I can try that. Run it and just leave it sitting at idle to see what
happens/what it does.

I was doing some research about this and apparently it does seem to be a
bug because a bug report was filed but I don’t know what ever came of
it.[/color]

Reported against which product, or with which company? Is there a bug number?
[color=blue]

Like I said, I am evaluating SLES to see whether it is a viable OS
option for the system (because I hae 4 nodes, so I am going to be
starting the clustering test soon), because if it SLES is able to run
the analysis faster, then I’ll want to use that.

But if I can’t, then I’ll have to stick with Windows. (And clustering
with Windows is a whole 'nother animal altogether.)[/color]

Yes, and terrible I have heard. What kind of clustering are you after
with a four-node cluster? Is this something where you just want things to
always run, or would you have all of the boxes working on a part of the
overall problem?
[color=blue]

Well…either way, I was able to take it out of the GUI mode into run
level 3 with sudo init 3. Interestingly enough, I did also reboot the
system (Power Control - Reset) via the IPMI Java application (yay to
IPMI) and I did have ssh configured on the system so I was able to
remotely log in with that.[/color]

When doing this I presume memory did not build up, as the memory buildup
is probably entirely due to something either with the application or how
it runs within Xorg.
[color=blue]

re: reliability and repeatability

Yea, the error state is pretty repeatable. I have a shells script that I
wrote that runs the analysis using batch processing so it appears almost
like “garbage collection” of sorts where the memory and later on swap
usage just builds over time.

NORMALLY, at least when I run the exact same thing on Windows (since I
am comparing between the two), after each solver iteration, it should be
releasing the memory back. Running it at run level 3 appears to be doing
that, but I won’t know until the script gets towards the end to see
whether this issue is (still) happening.

A couple of people from SuSE actually contacted me last year when I
downloaded the evaluation software so I have reached back out to those
contacts asking them about this highly technical issue and they’ve
referred it to a sales/technical contact for Canada (my geographic
region) to investigate.[/color]

Care to share contact information for various folks? I would guess they
were sales/account representatives all along, but who knows.
[color=blue]

It is strange though, that this is one of the certified OSes for Ansys,
but then there is this issue.

And no, once I close the program, Xorg doesn’t clean up.[/color]

If you change to runlevel three, then back to five, I presume the memory
does clean up, since that would restart Xorg, but could you confirm that?

–
Good luck.

If you find this post helpful and are logged into the web interface,
show your appreciation and click on the star below.

If you want to send me a private message, please let me know in the
forum as I do not use the web interface often.

alpha754293 · August 8, 2017, 7:39pm

[QUOTE=ab;39030]On 08/08/2017 05:04 AM, alpha754293 wrote:[color=blue]

I’m running Ansys Workbench (Mechanical/Static Structural analysis)
using the direct sparse solver which takes the most amount of memory,
but sometimes, also is the fastest way to obtain a solution. (And
because my system has 128 GB of RAM available, it is able to run the
entire solution in RAM so that no disk I/O is needed (even though I have
a SATA 6 Gbps SSD).[/color]

RAM is always better, as you know, if only because it performs better.
[/quote]

That is nominally true IF and ONLY IF it doesn’t swap. If it swaps, then it is at the mercy of the swap storage subsystem I/O interface, which even though you can get PCIe SSDs, is still not the same.

Ironically though, Windows usually will always have swap/page file, but the impact of having one isn’t gravely detrimental and actually does better than NOT having one at all - because it’s Windows.

[quote=ab]
[color=blue]

I can try that. Run it and just leave it sitting at idle to see what
happens/what it does.

I was doing some research about this and apparently it does seem to be a
bug because a bug report was filed but I don’t know what ever came of
it.[/color]

Reported against which product, or with which company? Is there a bug number?[/quote]

SuSE - Xref here: https://forums.opensuse.org/showthread.php/495133-Xorg-Gobbling-Up-Memory-on-openSUSE-13-1-x64 for the thread and the bug report here: https://bugzilla.novell.com/show_bug.cgi?id=849735

No report has been filed with Ansys yet. I will be doing that once I collect some more data.

(The downside of software or any system really is that when there is an issue like this, so many times, the default is blaming and finger pointing rather than actually solving the issue.)

[quote=ab][color=blue]

Like I said, I am evaluating SLES to see whether it is a viable OS
option for the system (because I hae 4 nodes, so I am going to be
starting the clustering test soon), because if it SLES is able to run
the analysis faster, then I’ll want to use that.

But if I can’t, then I’ll have to stick with Windows. (And clustering
with Windows is a whole 'nother animal altogether.)[/color]

Yes, and terrible I have heard. What kind of clustering are you after
with a four-node cluster? Is this something where you just want things to
always run, or would you have all of the boxes working on a part of the
overall problem?[/quote]

It will be like a micro HPC cluster (with just the 4 nodes) in the Supermicro 2U four half-width 2P nodes. (So, not HA/failover or something like that. HPC - or more specifically, micro HPC. I’m going to be severely limited with just a GbE NIC as the interconnect rather than a high speed fiber or even 10 GbE interconnect.)

It’s my first foray into clustering.

Well…if I try clustering, I am going to try to have a homogenous clustering environment so that either all 4 nodes run Linux or all 4 nodes run Windows. Each has their own upsides and downsides. I don’t really know yet to be honest.

The idea/intent is that with the slow GbE interconnect - would it even make sense for me to try and put the nodes into a cluster or would the nodes be better served working on four separate problems/solutions simultaneously vs. working on one problem (somewhat regardless of size) hampered by the slow interconnect.

So not sure yet.

[quote=ab][color=blue]

Well…either way, I was able to take it out of the GUI mode into run
level 3 with sudo init 3. Interestingly enough, I did also reboot the
system (Power Control - Reset) via the IPMI Java application (yay to
IPMI) and I did have ssh configured on the system so I was able to
remotely log in with that.[/color]

When doing this I presume memory did not build up, as the memory buildup
is probably entirely due to something either with the application or how
it runs within Xorg.[/quote]

Yeah, not sure. TBD. The batch processing script is running now. I am in and out of meetings all day today so I don’t have a chance to check in on how it is doing, so again, TBD. Will report back later on tonight.

[quote=ab][color=blue]

re: reliability and repeatability

Yea, the error state is pretty repeatable. I have a shells script that I
wrote that runs the analysis using batch processing so it appears almost
like “garbage collection” of sorts where the memory and later on swap
usage just builds over time.

NORMALLY, at least when I run the exact same thing on Windows (since I
am comparing between the two), after each solver iteration, it should be
releasing the memory back. Running it at run level 3 appears to be doing
that, but I won’t know until the script gets towards the end to see
whether this issue is (still) happening.

A couple of people from SuSE actually contacted me last year when I
downloaded the evaluation software so I have reached back out to those
contacts asking them about this highly technical issue and they’ve
referred it to a sales/technical contact for Canada (my geographic
region) to investigate.[/color]

Care to share contact information for various folks? I would guess they
were sales/account representatives all along, but who knows.[/quote]

The email was sent to Jason Yeagley who then referred it to Lisa Mitchell and she has brought in Sean Rickerd as a technical expert/subject matter expert into the discussion as well to try and help diagnose and fix this issue.

(I think that I should be okay to post those names of the people whom I am working with on this.)

[quote=ab][color=blue]

It is strange though, that this is one of the certified OSes for Ansys,
but then there is this issue.

And no, once I close the program, Xorg doesn’t clean up.[/color]

If you change to runlevel three, then back to five, I presume the memory
does clean up, since that would restart Xorg, but could you confirm that?[/quote]

I will have to find out. I got impatient last night when I first try to put it into run level 3 last night hence why I sent the power reset command to the node via the IPMI. I probably could have waited, but I didn’t know how long it would take for the system to clean itself up and there was also no message printed to console/display:0 that said it was doing anything of the sort so I couldn’t tell if it just froze or if it was just busy with the clean up.

Will try it again when I get a chance to do so (after this batch processing script has completed).

Thanks.

ab1 · August 8, 2017, 8:27pm

On 08/08/2017 10:44 AM, alpha754293 wrote:[color=blue]

ab;39030 Wrote:[color=green]

On 08/08/2017 05:04 AM, alpha754293 wrote:[color=darkred]

I’m running Ansys Workbench (Mechanical/Static Structural analysis)
using the direct sparse solver which takes the most amount of memory,
but sometimes, also is the fastest way to obtain a solution. (And
because my system has 128 GB of RAM available, it is able to run the
entire solution in RAM so that no disk I/O is needed (even though I[/color]
have[color=darkred]
a SATA 6 Gbps SSD).[/color]

RAM is always better, as you know, if only because it performs better.
[/color]

That is nominally true IF and ONLY IF it doesn’t swap. If it swaps,
then it is at the mercy of the swap storage subsystem I/O interface,
which even though you can get PCIe SSDs, is still not the same.[/color]

That (swap) is not RAM, then. Virtual memory, sure, but not RAM. RAM is
always better, always, because it is RAM. Virtual memory can include RAM
as well as swap, but RAM is RAM is RAM, and nothing else.
[color=blue]

ab Wrote:[color=green]

[color=darkred]

I can try that. Run it and just leave it sitting at idle to see what
happens/what it does.

I was doing some research about this and apparently it does seem to be[/color]
a[color=darkred]
bug because a bug report was filed but I don’t know what ever came of
it.[/color]

Reported against which product, or with which company? Is there a bug
number?[/color]

SuSE - Xref here:
https://forums.opensuse.org/showthread.php/495133-Xorg-Gobbling-Up-Memory-on-openSUSE-13-1-x64
for the thread and the bug report here:
https://bugzilla.novell.com/show_bug.cgi?id=849735[/color]

That is pretty old, and on a much older version f openSUSE than you are on
for SLES. It is also resolved, though not with the best comments ever.
[color=blue]

It will be like a micro HPC cluster (with just the 4 nodes) in the
Supermicro 2U four half-width 2P nodes. (So, not HA/failover or
something like that. HPC - or more specifically, micro HPC. I’m going to
be severely limited with just a GbE NIC as the interconnect rather than
a high speed fiber or even 10 GbE interconnect.)

It’s my first foray into clustering.[/color]

Sounds fun! I’ve done a fair bit, but always the failover type. If you
can share your experience here that would probably be great to read about.
If you write up technical articles on what you do you can also submit
them as “AppNotes” and get points for them, which are redeemable in
various ways, including Amazon $$s, at a fairly decent rate.
[color=blue]

Well…if I try clustering, I am going to try to have a homogenous
clustering environment so that either all 4 nodes run Linux or all 4
nodes run Windows. Each has their own upsides and downsides. I don’t
really know yet to be honest.

The idea/intent is that with the slow GbE interconnect - would it even
make sense for me to try and put the nodes into a cluster or would the
nodes be better served working on four separate problems/solutions
simultaneously vs. working on one problem (somewhat regardless of size)
hampered by the slow interconnect.[/color]

What does your application do through the NICs? While gigabit is now
ten-gigabit, it’s also not a huge slouch for most tasks. Does the
application have enough work that needs to go through there, in a
short-enough period of time, to be limited by gigabit connections?
[color=blue]

ab Wrote:[color=green]

[color=darkred]

Well…either way, I was able to take it out of the GUI mode into run
level 3 with sudo init 3. Interestingly enough, I did also reboot the
system (Power Control - Reset) via the IPMI Java application (yay to
IPMI) and I did have ssh configured on the system so I was able to
remotely log in with that.[/color]

When doing this I presume memory did not build up, as the memory
buildup
is probably entirely due to something either with the application or
how
it runs within Xorg.[/color]

Yeah, not sure. TBD. The batch processing script is running now. I am in
and out of meetings all day today so I don’t have a chance to check in
on how it is doing, so again, TBD. Will report back later on tonight.[/color]

I’m 99% sure it will not, since not-using Xorg at all means Xorg should
not build up memory, even if it is running over there like any other
process not being used.
[color=blue]

ab Wrote:[color=green]

[color=darkred]

re: reliability and repeatability

Yea, the error state is pretty repeatable. I have a shells script that[/color]
I[color=darkred]
wrote that runs the analysis using batch processing so it appears[/color]
almost[color=darkred]
like “garbage collection” of sorts where the memory and later on swap
usage just builds over time.

NORMALLY, at least when I run the exact same thing on Windows (since[/color]
I[color=darkred]
am comparing between the two), after each solver iteration, it should[/color]
be[color=darkred]
releasing the memory back. Running it at run level 3 appears to be[/color]
doing[color=darkred]
that, but I won’t know until the script gets towards the end to see
whether this issue is (still) happening.

A couple of people from SuSE actually contacted me last year when I
downloaded the evaluation software so I have reached back out to[/color]
those[color=darkred]
contacts asking them about this highly technical issue and they’ve
referred it to a sales/technical contact for Canada (my geographic
region) to investigate.[/color]

Care to share contact information for various folks? I would guess
they
were sales/account representatives all along, but who knows.[/color]

The email was sent to Jason Yeagley who then referred it to Lisa
Mitchell and she has brought in Sean Rickerd as a technical
expert/subject matter expert into the discussion as well to try and help
diagnose and fix this issue.

(I think that I should be okay to post those names of the people whom I
am working with on this.)[/color]

Sure, why not?
[color=blue]

ab Wrote:[color=green]

[color=darkred]

It is strange though, that this is one of the certified OSes for[/color]
Ansys,[color=darkred]
but then there is this issue.

And no, once I close the program, Xorg doesn’t clean up.[/color]

If you change to runlevel three, then back to five, I presume the
memory
does clean up, since that would restart Xorg, but could you confirm
that?[/color]

I will have to find out. I got impatient last night when I first try to
put it into run level 3 last night hence why I sent the power reset
command to the node via the IPMI. I probably could have waited, but I
didn’t know how long it would take for the system to clean itself up and
there was also no message printed to console/display:0 that said it was
doing anything of the sort so I couldn’t tell if it just froze or if it
was just busy with the clean up.[/color]

If you change runlevels you may need to explicitly change back to a valid
TTY; e.g. press Ctrl+Alt+F2 to go back to a TTY in case it does not
automatically switch from the GUI’s area to the correct TTY, and that may
just feel like an infinite wait which would be annoying.

–
Good luck.

If you find this post helpful and are logged into the web interface,
show your appreciation and click on the star below.

If you want to send me a private message, please let me know in the
forum as I do not use the web interface often.

alpha754293 · August 8, 2017, 9:24pm

[QUOTE=ab;39037]On 08/08/2017 10:44 AM, alpha754293 wrote:[color=blue]

ab;39030 Wrote:[color=green]

On 08/08/2017 05:04 AM, alpha754293 wrote:[color=darkred]

I’m running Ansys Workbench (Mechanical/Static Structural analysis)
using the direct sparse solver which takes the most amount of memory,
but sometimes, also is the fastest way to obtain a solution. (And
because my system has 128 GB of RAM available, it is able to run the
entire solution in RAM so that no disk I/O is needed (even though I[/color]
have[color=darkred]
a SATA 6 Gbps SSD).[/color]

RAM is always better, as you know, if only because it performs better.
[/color]

That is nominally true IF and ONLY IF it doesn’t swap. If it swaps,
then it is at the mercy of the swap storage subsystem I/O interface,
which even though you can get PCIe SSDs, is still not the same.[/color]

That (swap) is not RAM, then. Virtual memory, sure, but not RAM. RAM is
always better, always, because it is RAM. Virtual memory can include RAM
as well as swap, but RAM is RAM is RAM, and nothing else.[/quote]

Correct - swap is not on a ramdisk mount point. But I make the distinction between the solver memory allocation that is dynamically sets/allocates during the course of the solver run vs. the swap that requires size/allocation/blocks on a “storage device” that is identified and recognised by the system as a storage device that isn’t a ramdisk (such that you can actually mount ‘swap’ onto).

The second reason why I make this distinction is that in the case where Xorg is consuming such large amounts of memory and isn’t releasing it back into the pool of “available” RAM, either it (Xorg) or the OS forces the application to start using virtual memory.

So while the solver might have identified the total RAM installed and would then try to fit the solution within that, what happens in reality, as managed by the OS, which the application may not see (and generally, may not care about as much because the memory management responsibilities lies primarily with the OS rather than the app), it ends up swapping like mad while trying to address or allocate the RAM requested for the in-core solution.

As a result, instead of, for example, the solution being completed between 4-5 hours, it ends up taking more like 124 hours instead on account of that.

This is why this issue is so critical to me.

[quote=ab][color=blue]

ab Wrote:[color=green]

[color=darkred]

I can try that. Run it and just leave it sitting at idle to see what
happens/what it does.

I was doing some research about this and apparently it does seem to be[/color]
a[color=darkred]
bug because a bug report was filed but I don’t know what ever came of
it.[/color]

Reported against which product, or with which company? Is there a bug
number?[/color]

SuSE - Xref here:
https://forums.opensuse.org/showthread.php/495133-Xorg-Gobbling-Up-Memory-on-openSUSE-13-1-x64
for the thread and the bug report here:
https://bugzilla.novell.com/show_bug.cgi?id=849735[/color]

That is pretty old, and on a much older version f openSUSE than you are on
for SLES. It is also resolved, though not with the best comments ever.[/quote]

The bug report itself is pretty old, but the error state, as presented to me as an end user, isn’t.

Therefore; from my perspective, it still appears to be the same error state, and since valgrind couldn’t find a memory leak in that bug report, and I tried running it on my system and it also didn’t find any memory leaks, then it tells me that there is something else going on.

The problem is that with the bug report being closed - rather than trying to open another one - this is what led me here and also to contact the SuSE contacts again about this issue.

[quote=ab][color=blue]

It will be like a micro HPC cluster (with just the 4 nodes) in the
Supermicro 2U four half-width 2P nodes. (So, not HA/failover or
something like that. HPC - or more specifically, micro HPC. I’m going to
be severely limited with just a GbE NIC as the interconnect rather than
a high speed fiber or even 10 GbE interconnect.)

It’s my first foray into clustering.[/color]

Sounds fun! I’ve done a fair bit, but always the failover type. If you
can share your experience here that would probably be great to read about.
If you write up technical articles on what you do you can also submit
them as “AppNotes” and get points for them, which are redeemable in
various ways, including Amazon $$s, at a fairly decent rate.[/quote]

Sure. I can certainly do that.

But that won’t happen for a little while… - need to get this issue resolved first.

(P.S. I might need your help to vary degrees or aspects, and you can be a co-author on the technical article for setting up a micro HPC cluster for this because I am sure that it can be useful to other small and medium businesses that also are looking to do something like this.)

[quote=ab][color=blue]

Well…if I try clustering, I am going to try to have a homogenous
clustering environment so that either all 4 nodes run Linux or all 4
nodes run Windows. Each has their own upsides and downsides. I don’t
really know yet to be honest.

The idea/intent is that with the slow GbE interconnect - would it even
make sense for me to try and put the nodes into a cluster or would the
nodes be better served working on four separate problems/solutions
simultaneously vs. working on one problem (somewhat regardless of size)
hampered by the slow interconnect.[/color]

What does your application do through the NICs? While gigabit is now
ten-gigabit, it’s also not a huge slouch for most tasks. Does the
application have enough work that needs to go through there, in a
short-enough period of time, to be limited by gigabit connections?[/quote]

So…for parallel mechanical engineering analysis and simulations, there is a way that I can “break up” a problem that needs to be solved into many pieces. If it is within the same “pizza box”, then it can either run in what’s called symmetric multiprocessing (SMP) mode (also called “shared memory parallel” mode) such that it never leaves the host or it can run in a massively parallel processing (MPP) or distributed memory processing (DMP) mode (they are analogous with each other) and that means that it can run outside of one “pizza box” and on multiple “pizza boxes” (e.g. nodes).

So, in typical high performance computing (HPC) applications, parallelisation happens a LOT. Think a little like the massive supercomputers that set computational throughput records. Except that this is a micro version of that with only four nodes, and each node having two sockets and each socket having an 8-core processor.

Large HPC installations have upto hundreds of thousands of CPU cores (possibly augmented by hundreds of thousands of GPU cores). I don’t have that kind of budget. So I only have a measly 64-cores in total, and 512 GB of RAM in total (across all 4 nodes).

So the information that might be passed between the nodes/hosts, for example, if I am running a computational fluid dynamics (CFD) case, it can be passing the nodal density (mass) and velocities information of a domain that has been split up/decomposed into pieces so that each core from any of the nodes can work on solving the nodal variables.

The best way that I can think about it is you have a team of 64 people arranged in an 8x8 configuration. Your task is to walk forward as a whole, and you have enough arms such that you are always connected to the person adjacent to you. Some will need 2 arms to do that (people on the corners). Some would need 3 arms (people on the edges connecting to people in the middle) and some people will need 4 arms (people in the middle that have someone in front of them, behind them, to the left and to the right of them).

Your objective is to walk forward as a unit.

And to “complicate” things, there are two opaque walls that divide the 8x8 matrix into four quadrants.

The people at the interface between the quadrants are the people who will need to communicate to their counterparts in another quadrant how they are doing “in terms of their position” and how they are progressing with the goal of walking forward.

That communication by those people is what would be the information that would be going through the network (also know as the system interconnect).

Hope that helps to explain it (using an example of a corporate team building exercise as the reference).

To your question about the workload - yes.

Put it this way, a solution that uses the direct, sparse solver for my model uses 40-ish GB of RAM. If I spread that between 4 nodes, acrossed 64 cores, think about splitting the 40 GB into either 4 parts (where each part = per node RAM requirement for the solution) or 64 “pieces” which represent the amount of RAM that each core (from all four nodes, with 64-cores in total) would need. The solution for the entire problem would then have to be coordinated back into a single, unified, coherent solution that represents the original system that the solution started out with.

So think about the core-to-core memory interconnect interface (e.g. QPI) which runs at speeds upto 25.6 GB/s (loosely ~256 Gbps*). (*if 8b/10b encoding is still used)

Fibre interconnects like Infiniband is up to 56 Gbps (which uses 64b/66b encoding). 1 GbE still uses 8b/10b encoding, so the link is capable of a rate of 1 Gbps, which is 1/256 of QPI (at 3.2 GHz).

So going back to the example of the “corporate team building exercise” to in my attempt to illustrate this:

That would be like instead you being allowed to speak - you can only communicate using hand squeezes. And you aren’t allowed to squeeze your hand again until they squeeze your hand back after understanding your hand squeeze command/signal and executing it.

So you are in control. Your hand squeeze gives them the command to move. They move. And then they will have to squeeze your hand BACK to tell you and confirm that they have executed your command before you are allowed to issue your next command again to move forward again.

That’s like GbE (relatively speaking). The fibre optic interconnect would be like you being allowed to speak and you have the ability to listen to multiple feedback and give multiple commands simultaneously in order to have the whole group walk forward.

Hope this explanation helps. (The specific “thing” is called message passing interface (MPI), which if you are really interested, you can look into that. Pretty nifty stuff. I don’t program myself, so I don’t really use it, but I needed to understand enough about how it works and what it does.)

Point being - there is a lot of data/traffic that goes on between the cores, and so if the cores are on different nodes, then there will be that traffic that goes through the GbE NIC.

[quote=ab][color=blue]

ab Wrote:[color=green]

[color=darkred]

Well…either way, I was able to take it out of the GUI mode into run
level 3 with sudo init 3. Interestingly enough, I did also reboot the
system (Power Control - Reset) via the IPMI Java application (yay to
IPMI) and I did have ssh configured on the system so I was able to
remotely log in with that.[/color]

When doing this I presume memory did not build up, as the memory
buildup
is probably entirely due to something either with the application or
how
it runs within Xorg.[/color]

Yeah, not sure. TBD. The batch processing script is running now. I am in
and out of meetings all day today so I don’t have a chance to check in
on how it is doing, so again, TBD. Will report back later on tonight.[/color]

I’m 99% sure it will not, since not-using Xorg at all means Xorg should
not build up memory, even if it is running over there like any other
process not being used.[/quote]

It is difficult for me to tell right now only because with the solver running, I am only taking periodic snapshots of the memory usage via ssh right now.

That’s kind of why I want to wait until my entire script is finished so that I am able to replicate the issue by doing exactly the same thing I did the last two times and then collecting the data.

[quote=ab][color=blue]

ab Wrote:[color=green]

[color=darkred]

re: reliability and repeatability

Yea, the error state is pretty repeatable. I have a shells script that[/color]
I[color=darkred]
wrote that runs the analysis using batch processing so it appears[/color]
almost[color=darkred]
like “garbage collection” of sorts where the memory and later on swap
usage just builds over time.

NORMALLY, at least when I run the exact same thing on Windows (since[/color]
I[color=darkred]
am comparing between the two), after each solver iteration, it should[/color]
be[color=darkred]
releasing the memory back. Running it at run level 3 appears to be[/color]
doing[color=darkred]
that, but I won’t know until the script gets towards the end to see
whether this issue is (still) happening.

A couple of people from SuSE actually contacted me last year when I
downloaded the evaluation software so I have reached back out to[/color]
those[color=darkred]
contacts asking them about this highly technical issue and they’ve
referred it to a sales/technical contact for Canada (my geographic
region) to investigate.[/color]

Care to share contact information for various folks? I would guess
they
were sales/account representatives all along, but who knows.[/color]

The email was sent to Jason Yeagley who then referred it to Lisa
Mitchell and she has brought in Sean Rickerd as a technical
expert/subject matter expert into the discussion as well to try and help
diagnose and fix this issue.

(I think that I should be okay to post those names of the people whom I
am working with on this.)[/color]

Sure, why not?[/quote]

Not sure if there were policies here restricting that.

[quote=ab][color=blue]

ab Wrote:[color=green]

[color=darkred]

It is strange though, that this is one of the certified OSes for[/color]
Ansys,[color=darkred]
but then there is this issue.

And no, once I close the program, Xorg doesn’t clean up.[/color]

If you change to runlevel three, then back to five, I presume the
memory
does clean up, since that would restart Xorg, but could you confirm
that?[/color]

I will have to find out. I got impatient last night when I first try to
put it into run level 3 last night hence why I sent the power reset
command to the node via the IPMI. I probably could have waited, but I
didn’t know how long it would take for the system to clean itself up and
there was also no message printed to console/display:0 that said it was
doing anything of the sort so I couldn’t tell if it just froze or if it
was just busy with the clean up.[/color]

If you change runlevels you may need to explicitly change back to a valid
TTY; e.g. press Ctrl+Alt+F2 to go back to a TTY in case it does not
automatically switch from the GUI’s area to the correct TTY, and that may
just feel like an infinite wait which would be annoying.[/quote]

I might try that the next time I change between run level 5 to 3.

TBD…

Thank you for your help/responses.

alpha754293 · August 18, 2017, 9:35pm

Update:

So I have switched over from run level 5 to run level 3.

I’ve also change the vfs_cache_pressure from the default 100 to 200 (via $ echo 200 > /proc/sys/vm/vfs_cache_pressure) and it is still doing the same thing (based on the links that Gaby Beitler (Sales Engineer, SuSE) sent to me).

The command:

Also worked to clear up the cache as well, but I am now starting to look into trying to get Linux to completely stop caching (because for what I am doing/using it for - it is creating a MASSIVE problem for me).

Also interesting is that Linux, BY DEFAULT, will cache pagecache, and slab objects (inode and dentries) at a priority HIGHER than the user’s application such that when my application makes a request for RAM, instead of it automatically clearing the cache, to free up the RAM for the application, it sends the application’s request for RAM over to swap instead.

I have NOOO idea why it would do something like that.

ab1 · August 18, 2017, 11:42pm

Care to show some of the output from commands? I find it a little
surprising, even hard to believe, that xorg would be using so much RAM
when it is presumably not even running.

free
ps aux
cat /proc/vm/swappiness

In particular you should be able to set your swappiness to zero (0) which
disables it unless absolutely necessary. Even then, with bunch of
processes needing 1 GiB each, I do not see how you could be getting into
swap very much, but who knows how your system is configured. Tune
swappiness by using sysctl and/or writing to the /proc/vm/swappiness
“file” directly:

echo 0 > /proc/vm/swappiness

While clearing cache is not a bad idea, the issue you showed us earlier
did not look like a caching issue on its own. You had 128 GiB of RM
really truly used, not just caching stuff, and we could see that xorg had
a huge chunk of it used.

Unless you want to use insane amounts of swap, I would also strongly
recommend either getting rid of that swap space or greatly-reducing it.
You seem to know how much RAM should be used, and you have sized your
system well above that, but then you still have double that amount of
swap, maybe because you found a really old recommendation on swap that
said you should have 2x memory in swap; that was a nice recommendation
back in the days when there were computers with 256 MB, even 1 GiB RAM,
but the only reason to even have 1x RAM these days is to suspend to disk
(i.e. hibernate) your machine, which is not likely to be something you’ll
ever do with a server in an HPC environment.

–
Good luck.

If you find this post helpful and are logged into the web interface,
show your appreciation and click on the star below.

If you want to send me a private message, please let me know in the
forum as I do not use the web interface often.

malcolmlewis · August 19, 2017, 1:12am

On Fri 18 Aug 2017 08:42:55 PM CDT, ab wrote:

Care to show some of the output from commands? I find it a little
surprising, even hard to believe, that xorg would be using so much RAM
when it is presumably not even running.
free
ps aux
cat /proc/vm/swappiness
In particular you should be able to set your swappiness to zero (0)
which disables it unless absolutely necessary. Even then, with bunch of
processes needing 1 GiB each, I do not see how you could be getting into
swap very much, but who knows how your system is configured. Tune
swappiness by using sysctl and/or writing to the /proc/vm/swappiness
“file” directly:
echo 0 > /proc/vm/swappiness
While clearing cache is not a bad idea, the issue you showed us earlier
did not look like a caching issue on its own. You had 128 GiB of RM
really truly used, not just caching stuff, and we could see that xorg
had a huge chunk of it used.

Unless you want to use insane amounts of swap, I would also strongly
recommend either getting rid of that swap space or greatly-reducing it.
You seem to know how much RAM should be used, and you have sized your
system well above that, but then you still have double that amount of
swap, maybe because you found a really old recommendation on swap that
said you should have 2x memory in swap; that was a nice recommendation
back in the days when there were computers with 256 MB, even 1 GiB RAM,
but the only reason to even have 1x RAM these days is to suspend to disk
(i.e. hibernate) your machine, which is not likely to be something
you’ll ever do with a server in an HPC environment.

Hi
Another couple of commands to run…

procinfo
cat /proc/meminfo

I use the following in my /etc/sysctl.conf

# Disable swap
vm.swappiness = 1
vm.vfs_cache_pressure = 50

–
Cheers Malcolm Â°Â¿Â° SUSE Knowledge Partner (Linux Counter #276890)
openSUSE Leap 42.2|GNOME 3.20.2|4.4.79-18.26-default
If you find this post helpful and are logged into the web interface,
please show your appreciation and click on the star below… Thanks!

alpha754293 · August 19, 2017, 4:10am

[QUOTE=ab;39168]Care to show some of the output from commands? I find it a little
surprising, even hard to believe, that xorg would be using so much RAM
when it is presumably not even running.

free ps aux cat /proc/vm/swappiness[/QUOTE]

aes@aes3:~> free
             total       used       free     shared    buffers     cached
Mem:     132066080  122653184    9412896     155376         24   84712440
-/+ buffers/cache:   37940720   94125360
Swap:    268437500      35108  268402392

aes@aes3:~> ps aux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  0.0  0.0 181672  4368 ?        Ss   Aug07   0:26 /usr/lib/systemd/systemd --switched-root --system --deserialize 21
root         2  0.0  0.0      0     0 ?        S    Aug07   0:00 [kthreadd]
root         3  0.0  0.0      0     0 ?        S    Aug07   0:08 [ksoftirqd/0]
root         5  0.0  0.0      0     0 ?        S<   Aug07   0:00 [kworker/0:0H]
root         8  0.0  0.0      0     0 ?        S    Aug07   0:04 [migration/0]
root         9  0.0  0.0      0     0 ?        S    Aug07   0:00 [rcu_bh]
root        10  0.0  0.0      0     0 ?        S    Aug07  14:37 [rcu_sched]
root        11  0.0  0.0      0     0 ?        S    Aug07   0:01 [watchdog/0]
root        12  0.0  0.0      0     0 ?        S    Aug07   0:01 [watchdog/1]
root        13  0.0  0.0      0     0 ?        S    Aug07   0:02 [migration/1]
root        14  0.0  0.0      0     0 ?        S    Aug07   0:01 [ksoftirqd/1]
root        16  0.0  0.0      0     0 ?        S<   Aug07   0:00 [kworker/1:0H]
root        17  0.0  0.0      0     0 ?        S    Aug07   0:01 [watchdog/2]
root        18  0.0  0.0      0     0 ?        S    Aug07   0:00 [migration/2]
root        19  0.0  0.0      0     0 ?        S    Aug07   0:01 [ksoftirqd/2]
root        21  0.0  0.0      0     0 ?        S<   Aug07   0:00 [kworker/2:0H]
root        22  0.0  0.0      0     0 ?        S    Aug07   0:01 [watchdog/3]
root        23  0.0  0.0      0     0 ?        S    Aug07   0:00 [migration/3]
root        24  0.0  0.0      0     0 ?        S    Aug07   0:10 [ksoftirqd/3]
root        26  0.0  0.0      0     0 ?        S<   Aug07   0:00 [kworker/3:0H]
root        27  0.0  0.0      0     0 ?        S    Aug07   0:01 [watchdog/4]
root        28  0.0  0.0      0     0 ?        S    Aug07   0:00 [migration/4]
root        29  0.0  0.0      0     0 ?        S    Aug07   0:06 [ksoftirqd/4]
root        31  0.0  0.0      0     0 ?        S<   Aug07   0:00 [kworker/4:0H]
root        32  0.0  0.0      0     0 ?        S    Aug07   0:01 [watchdog/5]
root        33  0.0  0.0      0     0 ?        S    Aug07   0:00 [migration/5]
root        34  0.0  0.0      0     0 ?        S    Aug07   0:01 [ksoftirqd/5]
root        36  0.0  0.0      0     0 ?        S<   Aug07   0:00 [kworker/5:0H]
root        37  0.0  0.0      0     0 ?        S    Aug07   0:01 [watchdog/6]
root        38  0.0  0.0      0     0 ?        S    Aug07   0:00 [migration/6]
root        39  0.0  0.0      0     0 ?        S    Aug07   0:00 [ksoftirqd/6]
root        41  0.0  0.0      0     0 ?        S<   Aug07   0:00 [kworker/6:0H]
root        42  0.0  0.0      0     0 ?        S    Aug07   0:01 [watchdog/7]
root        43  0.0  0.0      0     0 ?        S    Aug07   0:00 [migration/7]
root        44  0.0  0.0      0     0 ?        S    Aug07   0:00 [ksoftirqd/7]
root        46  0.0  0.0      0     0 ?        S<   Aug07   0:00 [kworker/7:0H]
root        47  0.0  0.0      0     0 ?        S    Aug07   0:01 [watchdog/8]
root        48  0.0  0.0      0     0 ?        S    Aug07   0:05 [migration/8]
root        49  0.0  0.0      0     0 ?        S    Aug07   0:07 [ksoftirqd/8]
root        51  0.0  0.0      0     0 ?        S<   Aug07   0:00 [kworker/8:0H]
root        53  0.0  0.0      0     0 ?        S    Aug07   0:01 [watchdog/9]
root        54  0.0  0.0      0     0 ?        S    Aug07   0:02 [migration/9]
root        55  0.0  0.0      0     0 ?        S    Aug07   0:01 [ksoftirqd/9]
root        57  0.0  0.0      0     0 ?        S<   Aug07   0:00 [kworker/9:0H]
root        58  0.0  0.0      0     0 ?        S    Aug07   0:01 [watchdog/10]
root        59  0.0  0.0      0     0 ?        S    Aug07   0:00 [migration/10]
root        60  0.0  0.0      0     0 ?        S    Aug07   0:01 [ksoftirqd/10]
root        62  0.0  0.0      0     0 ?        S<   Aug07   0:00 [kworker/10:0H]
root        63  0.0  0.0      0     0 ?        S    Aug07   0:01 [watchdog/11]
root        64  0.0  0.0      0     0 ?        S    Aug07   0:00 [migration/11]
root        65  0.0  0.0      0     0 ?        S    Aug07   0:01 [ksoftirqd/11]
root        67  0.0  0.0      0     0 ?        S<   Aug07   0:00 [kworker/11:0H]
root        68  0.0  0.0      0     0 ?        S    Aug07   0:01 [watchdog/12]
root        69  0.0  0.0      0     0 ?        S    Aug07   0:00 [migration/12]
root        70  0.0  0.0      0     0 ?        S    Aug07   0:01 [ksoftirqd/12]
root        72  0.0  0.0      0     0 ?        S<   Aug07   0:00 [kworker/12:0H]
root        73  0.0  0.0      0     0 ?        S    Aug07   0:01 [watchdog/13]
root        74  0.0  0.0      0     0 ?        S    Aug07   0:00 [migration/13]
root        75  0.0  0.0      0     0 ?        S    Aug07   0:01 [ksoftirqd/13]
root        77  0.0  0.0      0     0 ?        S<   Aug07   0:00 [kworker/13:0H]
root        78  0.0  0.0      0     0 ?        S    Aug07   0:01 [watchdog/14]
root        79  0.0  0.0      0     0 ?        S    Aug07   0:00 [migration/14]
root        80  0.0  0.0      0     0 ?        S    Aug07   0:01 [ksoftirqd/14]
root        82  0.0  0.0      0     0 ?        S<   Aug07   0:00 [kworker/14:0H]
root        83  0.0  0.0      0     0 ?        S    Aug07   0:01 [watchdog/15]
root        84  0.0  0.0      0     0 ?        S    Aug07   0:00 [migration/15]
root        85  0.0  0.0      0     0 ?        S    Aug07   0:01 [ksoftirqd/15]
root        87  0.0  0.0      0     0 ?        S<   Aug07   0:00 [kworker/15:0H]
root        88  0.0  0.0      0     0 ?        S<   Aug07   0:00 [khelper]
root        89  0.0  0.0      0     0 ?        S    Aug07   0:00 [kdevtmpfs]
root        90  0.0  0.0      0     0 ?        S<   Aug07   0:00 [netns]
root        91  0.0  0.0      0     0 ?        S<   Aug07   0:00 [perf]
root        92  0.0  0.0      0     0 ?        S<   Aug07   0:00 [writeback]
root        93  0.0  0.0      0     0 ?        S<   Aug07   0:00 [kintegrityd]
root        94  0.0  0.0      0     0 ?        S<   Aug07   0:00 [bioset]
root        95  0.0  0.0      0     0 ?        S<   Aug07   0:00 [crypto]
root        96  0.0  0.0      0     0 ?        S<   Aug07   0:00 [kblockd]
root       101  0.0  0.0      0     0 ?        S<   Aug07   0:00 [kgraft]
root       102  0.0  0.0      0     0 ?        S    Aug07   0:00 [khungtaskd]
root       104  0.0  0.0      0     0 ?        S    Aug07   0:04 [kswapd0]
root       105  0.0  0.0      0     0 ?        S    Aug07   0:04 [kswapd1]
root       106  0.0  0.0      0     0 ?        SN   Aug07   0:00 [ksmd]
root       107  0.0  0.0      0     0 ?        SN   Aug07   0:08 [khugepaged]
root       108  0.0  0.0      0     0 ?        S    Aug07   0:00 [fsnotify_mark]
root       118  0.0  0.0      0     0 ?        S<   Aug07   0:00 [kthrotld]
root       127  0.0  0.0      0     0 ?        S<   Aug07   0:00 [kpsmoused]
root       128  0.0  0.0      0     0 ?        S    Aug07   0:00 [print/0]
root       129  0.0  0.0      0     0 ?        S    Aug07   0:00 [print/1]
root       149  0.0  0.0      0     0 ?        S<   Aug07   0:00 [deferwq]
root       187  0.0  0.0      0     0 ?        S    Aug07   0:00 [kauditd]
root       208  0.0  0.0      0     0 ?        S    Aug07   0:34 [kworker/15:1]
root       300  0.0  0.0      0     0 ?        S<   Aug07   0:00 [ata_sff]
root       303  0.0  0.0      0     0 ?        S    Aug07   0:00 [khubd]
root       319  0.0  0.0      0     0 ?        S    Aug07   0:00 [scsi_eh_0]
root       321  0.0  0.0      0     0 ?        S<   Aug07   0:00 [scsi_tmf_0]
root       340  0.0  0.0      0     0 ?        S<   Aug07   0:00 [scsi_wq_0]
root       350  0.0  0.0      0     0 ?        S<   Aug07   0:00 [ttm_swap]
root       352  0.0  0.0      0     0 ?        S    Aug07   0:00 [scsi_eh_1]
root       353  0.0  0.0      0     0 ?        S<   Aug07   0:00 [scsi_tmf_1]
root       354  0.0  0.0      0     0 ?        S    Aug07   0:00 [scsi_eh_2]
root       355  0.0  0.0      0     0 ?        S<   Aug07   0:00 [scsi_tmf_2]
root       356  0.0  0.0      0     0 ?        S    Aug07   0:00 [scsi_eh_3]
root       357  0.0  0.0      0     0 ?        S<   Aug07   0:00 [scsi_tmf_3]
root       358  0.0  0.0      0     0 ?        S    Aug07   0:00 [scsi_eh_4]
root       359  0.0  0.0      0     0 ?        S<   Aug07   0:00 [scsi_tmf_4]
root       360  0.0  0.0      0     0 ?        S    Aug07   0:00 [scsi_eh_5]
root       361  0.0  0.0      0     0 ?        S<   Aug07   0:00 [scsi_tmf_5]
root       362  0.0  0.0      0     0 ?        S    Aug07   0:00 [scsi_eh_6]
root       363  0.0  0.0      0     0 ?        S<   Aug07   0:00 [scsi_tmf_6]
root       388  0.0  0.0      0     0 ?        S<   Aug07   0:00 [kworker/12:1H]
root       389  0.0  0.0      0     0 ?        S<   Aug07   0:00 [kworker/8:1H]
root       390  0.0  0.0      0     0 ?        S<   Aug07   0:00 [kworker/0:1H]
root       391  0.0  0.0      0     0 ?        S<   Aug07   0:00 [kworker/13:1H]
root       397  0.0  0.0      0     0 ?        S<   Aug07   0:00 [bioset]
root       409  0.0  0.0      0     0 ?        S<   Aug07   0:00 [kworker/10:1H]
root       416  0.0  0.0      0     0 ?        S    Aug07   0:00 [btrfs-genwork-1]
root       417  0.0  0.0      0     0 ?        S    Aug07   3:01 [btrfs-submit-1]
root       418  0.0  0.0      0     0 ?        S    Aug07   0:00 [btrfs-delalloc-]
root       419  0.0  0.0      0     0 ?        S    Aug07   0:00 [btrfs-fixup-1]
root       421  0.0  0.0      0     0 ?        S    Aug07   0:01 [btrfs-endio-met]
root       422  0.0  0.0      0     0 ?        S    Aug07   0:00 [btrfs-rmw-1]
root       423  0.0  0.0      0     0 ?        S    Aug07   0:00 [btrfs-endio-rai]
root       424  0.0  0.0      0     0 ?        S    Aug07   0:00 [btrfs-endio-met]
root       426  0.0  0.0      0     0 ?        S    Aug07   0:08 [btrfs-freespace]
root       428  0.0  0.0      0     0 ?        S    Aug07   0:00 [btrfs-cache-1]
root       429  0.0  0.0      0     0 ?        S    Aug07   0:00 [btrfs-readahead]
root       431  0.0  0.0      0     0 ?        S    Aug07   0:00 [btrfs-qgroup-re]
root       432  0.0  0.0      0     0 ?        S<   Aug07   0:00 [kworker/11:1H]
root       433  0.0  0.0      0     0 ?        S<   Aug07   0:00 [kworker/14:1H]
root       434  0.0  0.0      0     0 ?        S<   Aug07   0:00 [kworker/15:1H]
root       435  0.0  0.0      0     0 ?        S<   Aug07   0:00 [kworker/9:1H]
root       436  0.0  0.0      0     0 ?        S    Aug07   0:00 [btrfs-cleaner]
root       437  0.0  0.0      0     0 ?        S    Aug07   4:36 [btrfs-transacti]
root       525  0.0  0.0  51544 11604 ?        SLs  Aug07   0:06 /usr/lib/systemd/systemd-journald
root       533  0.0  0.0  21716   808 ?        Ss   Aug07   0:32 /sbin/dmeventd -f
root       547  0.0  0.0  42500  1256 ?        Ss   Aug07   0:00 /usr/lib/systemd/systemd-udevd
root       691  0.0  0.0      0     0 ?        S    Aug16   0:00 [kworker/11:1]
root       759  0.0  0.0      0     0 ?        S<   Aug07   0:00 [edac-poller]
root       765  0.0  0.0  12016  1276 ?        Ss   Aug07   4:47 /usr/sbin/haveged -w 1024 -v 0 -F
root       816  0.0  0.0      0     0 ?        S<   Aug07   0:00 [kvm-irqfd-clean]
root      1067  0.0  0.0      0     0 ?        S    16:36   0:01 [btrfs-endio-wri]
root      1481  0.0  0.0      0     0 ?        S    12:00   0:01 [kworker/6:2]
root      1550  0.0  0.0      0     0 ?        SN   Aug07   0:00 [kipmi0]
root      1826  0.0  0.0      0     0 ?        S    Aug07   0:00 [btrfs-genwork-1]
root      1827  0.0  0.0      0     0 ?        S    Aug07   1:51 [btrfs-submit-1]
root      1828  0.0  0.0      0     0 ?        S    Aug07   0:00 [btrfs-delalloc-]
root      1829  0.0  0.0      0     0 ?        S    Aug07   0:00 [btrfs-fixup-1]
root      1830  0.0  0.0      0     0 ?        S    Aug07   0:03 [btrfs-endio-1]
root      1831  0.0  0.0      0     0 ?        S    Aug07   0:00 [btrfs-endio-met]
root      1832  0.0  0.0      0     0 ?        S    Aug07   0:00 [btrfs-rmw-1]
root      1833  0.0  0.0      0     0 ?        S    Aug07   0:00 [btrfs-endio-rai]
root      1834  0.0  0.0      0     0 ?        S    Aug07   0:00 [btrfs-endio-met]
root      1836  0.0  0.0      0     0 ?        S    Aug07   0:06 [btrfs-freespace]
root      1837  0.0  0.0      0     0 ?        S    Aug07   0:01 [btrfs-delayed-m]
root      1838  0.0  0.0      0     0 ?        S    Aug07   0:00 [btrfs-cache-1]
root      1839  0.0  0.0      0     0 ?        S    Aug07   0:00 [btrfs-readahead]
root      1841  0.0  0.0      0     0 ?        S    Aug07   0:00 [btrfs-qgroup-re]
root      1866  0.0  0.0      0     0 ?        S    Aug07   0:00 [btrfs-cleaner]
root      1867  0.0  0.0      0     0 ?        S    Aug07   3:09 [btrfs-transacti]
message+  2057  0.0  0.0  42928  2648 ?        SLs  Aug07   0:08 /bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation
avahi     2061  0.0  0.0  28064  1788 ?        Ss   Aug07   1:09 avahi-daemon: running [aes3.local]
root      2063  0.0  0.0  24932  1800 ?        Ss   Aug07   0:00 /usr/sbin/smartd -n
root      2066  0.0  0.0  19272  1020 ?        Ss   Aug07   1:45 /usr/sbin/irqbalance --foreground
root      2069  0.0  0.0  29512  2220 ?        SLs  Aug07   0:00 /usr/lib/wicked/bin/wickedd-dhcp6 --systemd --foreground
root      2070  0.0  0.0  29456  2444 ?        SLs  Aug07   0:00 /usr/lib/wicked/bin/wickedd-dhcp4 --systemd --foreground
root      2073  0.0  0.0  29456  1820 ?        SLs  Aug07   0:00 /usr/lib/wicked/bin/wickedd-auto4 --systemd --foreground
nscd      2087  0.0  0.0 941824  1484 ?        Ssl  Aug07   0:11 /usr/sbin/nscd
root      2122  0.0  0.0 337836  5692 ?        SLsl Aug07   0:01 /usr/sbin/rsyslogd -n
root      2127  0.0  0.0  20080  1440 ?        Ss   Aug07   0:06 /usr/lib/systemd/systemd-logind
root      2129  0.0  0.0  66140  1924 ?        Ss   Aug07   0:00 login -- ewen
root      2133  0.0  0.0      0     0 ?        S<   Aug07   0:00 [kworker/6:1H]
root      2145  0.0  0.0  29576  2496 ?        SLs  Aug07   0:00 /usr/sbin/wickedd --systemd --foreground
root      2166  0.0  0.0  29612  2048 ?        SLs  Aug07   0:00 /usr/sbin/wickedd-nanny --systemd --foreground
root      2769  0.3  0.0      0     0 ?        S    Aug16  13:29 [btrfs-worker-7]
root      3184  0.0  0.0 437596  8236 ?        SLsl Aug07   0:00 /usr/sbin/libvirtd --listen
root      3216  0.0  0.0  47092  2668 ?        Ss   Aug07   0:00 /usr/sbin/sshd -D
root      3295  0.0  0.0  19592  1088 ?        Ss   Aug07   0:03 /usr/lib/postfix/master -w
postfix   3297  0.0  0.0  21840  2096 ?        S    Aug07   0:00 qmgr -l -t fifo -u
root      3331  0.0  0.0  18816  1392 ?        Ss   Aug07   0:02 /usr/sbin/cron -n
root      3517  0.0  0.0      0     0 ?        S    Aug12   0:21 [kworker/10:0]
root      3726  0.0  0.0      0     0 ?        S<   Aug07   0:01 [kworker/4:1H]
ewen      3910  0.0  0.0  31820  2248 ?        Ss   Aug07   0:00 /usr/lib/systemd/systemd --user
ewen      3912  0.0  0.0 211744  1984 ?        S    Aug07   0:00 (sd-pam)
ewen      3913  0.0  0.0  14308  3076 tty1     Ss   Aug07   0:01 -bash
root      3988  0.0  0.0  66140  1928 ?        Ss   Aug07   0:00 login -- ewen
root      4038  0.0  0.0   4724   740 tty4     Ss+  Aug07   0:00 /sbin/agetty --noclear tty4 linux
ewen      4041  0.0  0.0  14192  2604 tty2     Ss+  Aug07   0:00 -bash
root      4273  0.0  0.0      0     0 ?        S    Aug17   0:01 [kworker/12:0]
ewen      5607  0.0  0.0  13068  1392 tty1     S+   Aug14   0:00 /bin/bash ./run_all_linux.sh
root      5614  0.0  0.0      0     0 ?        S    17:15   0:00 [kworker/1:0]
root      6147  0.0  0.0      0     0 ?        S    Aug17   0:00 [kworker/u33:0]
ewen      6270  0.7  0.1 813900 139212 ?       Ssl  Aug07 117:18 /usr/ansys_inc/shared_files/licensing/linx64/ansysli_server
ewen      6276  0.0  0.0  19028  3160 ?        Ss   Aug07   4:35 /usr/ansys_inc/shared_files/licensing/linx64/ansysli_monitor -monitor 6270 -restart_port_timeout 15
ewen      6310  0.0  0.0  16748   968 ?        S    Aug07   0:12 /usr/ansys_inc/shared_files/licensing/linx64/lmgrd -c /usr/ansys_inc/shared_files/licensing/license_files -l /usr/ansys_inc/shared_files/licensing/license.log
ewen      6311  0.0  0.0 127728 10820 ?        Ssl  Aug07   1:14 ansyslmd -T aes3 11.13 3 -c :/usr/ansys_inc/shared_files/licensing/license_files: -srv 8RVqcog4NKrctkBS4S12tcNsH0p1yQ1OA2pzvxBkrW3MpiIi2tnas9IvZ8XmZiw --lmgrd_start 5988ede6 -vdrestart 0
root      7389  0.0  0.0      0     0 ?        S    17:30   0:00 [kworker/5:0]
root      7527  0.0  0.0      0     0 ?        S<   Aug07   0:00 [kworker/1:1H]
root      7546  0.8  0.0      0     0 ?        S    17:31   1:20 [btrfs-worker-8]
root      7654  0.0  0.0      0     0 ?        S<   Aug07   0:00 [kworker/7:1H]
root      8194  0.0  0.0      0     0 ?        S    08:14   0:05 [btrfs-endio-wri]
root      9076  0.0  0.0      0     0 ?        S    17:45   0:00 [kworker/8:2]
root      9220  0.0  0.0      0     0 ?        S    Aug17   0:05 [btrfs-worker-8]
root      9233  0.0  0.0      0     0 ?        S    17:46   0:01 [btrfs-endio-wri]
root      9255  0.0  0.0      0     0 ?        S    Aug17   0:04 [kworker/3:0]
root      9307  0.0  0.0      0     0 ?        S    03:45   0:03 [kworker/2:1]
root      9435  0.0  0.0      0     0 ?        S    17:47   0:00 [btrfs-endio-wri]
root     10804  0.0  0.0      0     0 ?        S    18:00   0:00 [kworker/5:1]
root     10933  0.0  0.0  87684  3200 ?        Ss   Aug12   0:00 sshd: ewen [priv]
ewen     10940  0.0  0.0  87684  1060 ?        S    Aug12   0:01 sshd: ewen@pts/1
ewen     10941  0.0  0.0  14320  2408 pts/1    Ss   Aug12   0:00 -bash
root     11245  0.0  0.0      0     0 ?        S<   Aug08   0:00 [kworker/2:1H]
root     11968  0.0  0.0      0     0 ?        S    13:29   0:02 [btrfs-endio-wri]
root     11974  0.0  0.0      0     0 ?        S    Aug16   0:05 [kworker/14:2]
root     13446  0.0  0.0      0     0 ?        S    09:00   0:02 [kworker/9:1]
root     14082  0.0  0.0      0     0 ?        S    18:30   0:01 [kworker/0:1]
root     14084  0.0  0.0      0     0 ?        S    18:30   0:00 [kworker/9:2]
root     14089  0.0  0.0      0     0 ?        S    Aug17   0:00 [btrfs-endio-wri]
root     15156  0.8  0.0      0     0 ?        S    Aug17  12:13 [btrfs-worker-2]
root     15157  0.8  0.0      0     0 ?        S    Aug17  12:40 [btrfs-worker-3]
root     15158  0.8  0.0      0     0 ?        S    Aug17  12:12 [btrfs-worker-4]
root     15160  0.8  0.0      0     0 ?        S    Aug17  12:16 [btrfs-worker-6]
root     15161  0.8  0.0      0     0 ?        S    Aug17  12:30 [btrfs-worker-7]
root     15162  0.8  0.0      0     0 ?        S    Aug17  12:20 [btrfs-worker-8]
root     17226  0.0  0.0      0     0 ?        S    14:15   0:00 [kworker/6:0]
postfix  17960  0.0  0.0  21448  1364 ?        S    19:02   0:00 pickup -l -t fifo -u
root     17996  0.0  0.0      0     0 ?        S    05:00   0:00 [kworker/u32:2]
root     18620  0.0  0.0      0     0 ?        S    09:45   0:00 [kworker/2:0]
root     18976  0.0  0.0      0     0 ?        S    14:30   0:00 [kworker/u32:1]
root     19292  0.0  0.0      0     0 ?        S    Aug12   0:00 [btrfs-delayed-m]
root     19428  0.0  0.0      0     0 ?        S    19:15   0:00 [kworker/1:2]
root     20000  0.0  0.0      0     0 ?        S    19:19   0:00 [btrfs-endio-wri]
root     20028  0.0  0.0      0     0 ?        S    19:20   0:00 [btrfs-endio-wri]
root     20275  0.1  0.0      0     0 ?        S    Aug09  21:35 [kworker/u34:2]
root     20671  0.0  0.0      0     0 ?        S    14:45   0:00 [kworker/4:2]
root     20766  0.0  0.0  87684  4072 ?        Ss   Aug07   0:00 sshd: ewen [priv]
ewen     20777  0.0  0.0  87684  1840 ?        S    Aug07   0:01 sshd: ewen@pts/0
ewen     20778  0.0  0.0  14424  3152 pts/0    Ss+  Aug07   0:00 -bash
root     21077  0.0  0.0      0     0 ?        S    19:30   0:00 [kworker/0:0]
root     22703  0.0  0.0      0     0 ?        S    19:45   0:00 [kworker/8:1]
root     24084  0.0  0.0      0     0 ?        S    15:15   0:00 [kworker/7:1]
root     24437  0.0  0.0      0     0 ?        S    20:00   0:00 [kworker/3:1]
root     24851  0.0  0.0      0     0 ?        S    06:00   0:03 [kworker/12:2]
ewen     25555 44.4  0.0  26700  1848 pts/1    RL+  20:09   0:00 ps aux
ewen     25558  0.0  0.0  13064  1452 ?        S    20:09   0:00 sh -c ps -e | grep ansyslmd
ewen     25559  0.0  0.0  24532  1128 ?        RL   20:09   0:00 ps -e
ewen     25560  0.0  0.0  10496   948 ?        S    20:09   0:00 grep ansyslmd
root     25741  0.0  0.0      0     0 ?        S    15:29   0:01 [btrfs-endio-wri]
root     25872  0.0  0.0      0     0 ?        S    15:30   0:01 [kworker/4:0]
root     26315  0.1  0.0      0     0 ?        S    Aug17   3:14 [kworker/u33:1]
root     26327  0.0  0.0      0     0 ?        S<   Aug07   0:02 [kworker/3:1H]
root     27194  0.0  0.0      0     0 ?        S    Aug16   0:13 [kworker/10:2]
root     27195  0.0  0.0      0     0 ?        S    Aug11   0:01 [btrfs-flush_del]
ewen     27461  0.0  0.0  13592  2232 tty1     S+   01:48   0:00 /bin/sh /usr/ansys_inc/v180/ansys/bin/ansys180 -b -dis -np 16 -dir RHD_as_is_direct
root     27533  0.0  0.0      0     0 ?        S    15:45   0:00 [kworker/11:0]
ewen     27539  0.0  0.0  11648  1556 tty1     S+   01:48   0:00 /bin/sh /usr/ansys_inc/v180/commonfiles/MPI/Intel/5.1.3.223/linx64/bin/mpirun -np 16 /usr/ansys_inc/v180/ansys/bin/ansysdis180 -i stdin.27461 -b -dis -dir RHD_as_is_direct
ewen     27544  0.0  0.0  15848  1636 tty1     S+   01:48   0:00 mpiexec.hydra -np 16 /usr/ansys_inc/v180/ansys/bin/ansysdis180 -i stdin.27461 -b -dis -dir RHD_as_is_direct
root     27545  0.0  0.0      0     0 ?        S    01:48   0:00 [btrfs-endio-2]
ewen     27546  0.0  0.0  15048  1896 tty1     S    01:48   0:00 /usr/ansys_inc/v180/commonfiles/MPI/Intel/5.1.3.223/linx64/bin//pmi_proxy --control-port aes3:56460 --pmi-connect lazy-cache --pmi-aggregate -s 0 --rmk user --launcher ssh --demux poll --pgid 0 --enable-stdin 1 --retries 10 --control-code 1365381406 --usize -2 --proxy-id 0
ewen     27550  0.0  0.0  12044  2116 tty1     S    01:48   0:00 /bin/sh /usr/ansys_inc/v180/ansys/bin/ansysdis180 -i stdin.27461 -b -dis -dir RHD_as_is_direct
ewen     27551  0.0  0.0  12044  2112 tty1     S    01:48   0:00 /bin/sh /usr/ansys_inc/v180/ansys/bin/ansysdis180 -i stdin.27461 -b -dis -dir RHD_as_is_direct
ewen     27552  0.0  0.0  12044  2116 tty1     S    01:48   0:00 /bin/sh /usr/ansys_inc/v180/ansys/bin/ansysdis180 -i stdin.27461 -b -dis -dir RHD_as_is_direct
ewen     27553  0.0  0.0  12044  2116 tty1     S    01:48   0:00 /bin/sh /usr/ansys_inc/v180/ansys/bin/ansysdis180 -i stdin.27461 -b -dis -dir RHD_as_is_direct
ewen     27554  0.0  0.0  12044  2116 tty1     S    01:48   0:00 /bin/sh /usr/ansys_inc/v180/ansys/bin/ansysdis180 -i stdin.27461 -b -dis -dir RHD_as_is_direct
ewen     27555  0.0  0.0  12044  2112 tty1     S    01:48   0:00 /bin/sh /usr/ansys_inc/v180/ansys/bin/ansysdis180 -i stdin.27461 -b -dis -dir RHD_as_is_direct
ewen     27556  0.0  0.0  12044  2112 tty1     S    01:48   0:00 /bin/sh /usr/ansys_inc/v180/ansys/bin/ansysdis180 -i stdin.27461 -b -dis -dir RHD_as_is_direct
ewen     27557  0.0  0.0  12044  2112 tty1     S    01:48   0:00 /bin/sh /usr/ansys_inc/v180/ansys/bin/ansysdis180 -i stdin.27461 -b -dis -dir RHD_as_is_direct
ewen     27558  0.0  0.0  12044  2116 tty1     S    01:48   0:00 /bin/sh /usr/ansys_inc/v180/ansys/bin/ansysdis180 -i stdin.27461 -b -dis -dir RHD_as_is_direct
ewen     27559  0.0  0.0  12044  2112 tty1     S    01:48   0:00 /bin/sh /usr/ansys_inc/v180/ansys/bin/ansysdis180 -i stdin.27461 -b -dis -dir RHD_as_is_direct
ewen     27560  0.0  0.0  12044  2116 tty1     S    01:48   0:00 /bin/sh /usr/ansys_inc/v180/ansys/bin/ansysdis180 -i stdin.27461 -b -dis -dir RHD_as_is_direct
ewen     27561  0.0  0.0  12044  2116 tty1     S    01:48   0:00 /bin/sh /usr/ansys_inc/v180/ansys/bin/ansysdis180 -i stdin.27461 -b -dis -dir RHD_as_is_direct
ewen     27562  0.0  0.0  12044  2112 tty1     S    01:48   0:00 /bin/sh /usr/ansys_inc/v180/ansys/bin/ansysdis180 -i stdin.27461 -b -dis -dir RHD_as_is_direct
ewen     27563  0.0  0.0  12044  2112 tty1     S    01:48   0:00 /bin/sh /usr/ansys_inc/v180/ansys/bin/ansysdis180 -i stdin.27461 -b -dis -dir RHD_as_is_direct
ewen     27564  0.0  0.0  12044  2116 tty1     S    01:48   0:00 /bin/sh /usr/ansys_inc/v180/ansys/bin/ansysdis180 -i stdin.27461 -b -dis -dir RHD_as_is_direct
ewen     27568  0.0  0.0  12044  2116 tty1     S    01:48   0:00 /bin/sh /usr/ansys_inc/v180/ansys/bin/ansysdis180 -i stdin.27461 -b -dis -dir RHD_as_is_direct
root     28264  0.0  0.0      0     0 ?        S    Aug09   0:00 [kworker/u34:0]
root     28377  0.0  0.0      0     0 ?        S    06:30   0:00 [kworker/13:1]
ewen     28491 99.3  2.6 4119976 3510564 tty1  R    01:48 1093:54 /usr/ansys_inc/v180/ansys/bin/linx64/ansys.e -i stdin.27461 -b -dis -dir RHD_as_is_direct
ewen     28492 93.5  2.6 5526664 3477616 tty1  Rl   01:48 1030:20 /usr/ansys_inc/v180/ansys/bin/linx64/ansys.e -i stdin.27461 -b -dis -dir RHD_as_is_direct
ewen     28493 99.1  2.1 3460180 2817188 tty1  R    01:48 1092:06 /usr/ansys_inc/v180/ansys/bin/linx64/ansys.e -i stdin.27461 -b -dis -dir RHD_as_is_direct
ewen     28494 99.3  2.0 3328280 2694284 tty1  R    01:48 1093:51 /usr/ansys_inc/v180/ansys/bin/linx64/ansys.e -i stdin.27461 -b -dis -dir RHD_as_is_direct
ewen     28495 99.3  2.2 3504608 2943032 tty1  R    01:48 1093:31 /usr/ansys_inc/v180/ansys/bin/linx64/ansys.e -i stdin.27461 -b -dis -dir RHD_as_is_direct
ewen     28498 99.3  1.8 3023900 2464696 tty1  R    01:48 1093:34 /usr/ansys_inc/v180/ansys/bin/linx64/ansys.e -i stdin.27461 -b -dis -dir RHD_as_is_direct
ewen     28503 99.2  1.9 3119816 2583188 tty1  R    01:48 1093:01 /usr/ansys_inc/v180/ansys/bin/linx64/ansys.e -i stdin.27461 -b -dis -dir RHD_as_is_direct
ewen     28504 99.2  1.8 3018592 2457632 tty1  R    01:48 1092:59 /usr/ansys_inc/v180/ansys/bin/linx64/ansys.e -i stdin.27461 -b -dis -dir RHD_as_is_direct
ewen     28505 99.3  2.1 3387128 2789020 tty1  R    01:48 1093:37 /usr/ansys_inc/v180/ansys/bin/linx64/ansys.e -i stdin.27461 -b -dis -dir RHD_as_is_direct
ewen     28506 99.3  2.3 3665396 3133532 tty1  R    01:48 1093:27 /usr/ansys_inc/v180/ansys/bin/linx64/ansys.e -i stdin.27461 -b -dis -dir RHD_as_is_direct
ewen     28507 99.2  2.6 4026096 3451028 tty1  R    01:48 1092:51 /usr/ansys_inc/v180/ansys/bin/linx64/ansys.e -i stdin.27461 -b -dis -dir RHD_as_is_direct
ewen     28508 99.3  1.9 3188896 2602020 tty1  R    01:48 1093:25 /usr/ansys_inc/v180/ansys/bin/linx64/ansys.e -i stdin.27461 -b -dis -dir RHD_as_is_direct
ewen     28509 99.0  2.2 3458340 2917928 tty1  R    01:48 1090:21 /usr/ansys_inc/v180/ansys/bin/linx64/ansys.e -i stdin.27461 -b -dis -dir RHD_as_is_direct
ewen     28510 99.2  1.8 3014900 2453508 tty1  R    01:48 1092:58 /usr/ansys_inc/v180/ansys/bin/linx64/ansys.e -i stdin.27461 -b -dis -dir RHD_as_is_direct
ewen     28511 99.3  2.2 3500784 2955932 tty1  R    01:48 1093:45 /usr/ansys_inc/v180/ansys/bin/linx64/ansys.e -i stdin.27461 -b -dis -dir RHD_as_is_direct
ewen     28512 99.3  2.1 3417656 2863008 tty1  R    01:48 1093:45 /usr/ansys_inc/v180/ansys/bin/linx64/ansys.e -i stdin.27461 -b -dis -dir RHD_as_is_direct
root     29112  0.0  0.0      0     0 ?        S    Aug13   0:00 [btrfs-flush_del]
root     30261  0.0  0.0      0     0 ?        S    06:45   0:02 [kworker/7:0]
root     30646  0.0  0.0      0     0 ?        S<   Aug08   0:00 [kworker/5:1H]
root     32167  0.0  0.0      0     0 ?        S    Aug17   0:06 [kworker/13:2]
root     32190  0.0  0.0      0     0 ?        S    Aug14   0:21 [kworker/15:2]
root     32336  0.0  0.0      0     0 ?        S    11:45   0:01 [kworker/14:0]

aes@aes3:~> cat /proc/vm/swappiness
cat: /proc/vm/swappiness: No such file or directory

(I think that you meant /proc/sys/vm/swappiness and that is still at the default value of 60.)

I’ve been changing vfs_cache_pressure because that’s supposed to be the thing that controls how much or how often the pagecache and slab objects are cached.

[QUOTE=ab]In particular you should be able to set your swappiness to zero (0) which
disables it unless absolutely necessary. Even then, with bunch of
processes needing 1 GiB each, I do not see how you could be getting into
swap very much, but who knows how your system is configured. Tune
swappiness by using sysctl and/or writing to the /proc/vm/swappiness
“file” directly:

echo 0 > /proc/vm/swappiness

While clearing cache is not a bad idea, the issue you showed us earlier
did not look like a caching issue on its own. You had 128 GiB of RM
really truly used, not just caching stuff, and we could see that xorg had
a huge chunk of it used.

Unless you want to use insane amounts of swap, I would also strongly
recommend either getting rid of that swap space or greatly-reducing it.
You seem to know how much RAM should be used, and you have sized your
system well above that, but then you still have double that amount of
swap, maybe because you found a really old recommendation on swap that
said you should have 2x memory in swap; that was a nice recommendation
back in the days when there were computers with 256 MB, even 1 GiB RAM,
but the only reason to even have 1x RAM these days is to suspend to disk
(i.e. hibernate) your machine, which is not likely to be something you’ll
ever do with a server in an HPC environment.

–
Good luck.

If you find this post helpful and are logged into the web interface,
show your appreciation and click on the star below.

If you want to send me a private message, please let me know in the
forum as I do not use the web interface often.[/QUOTE]

re: what I showed earlier - perhaps this might help

These were screenshots that I took of the terminal window (ssh) earlier.

You can see that on one system, it was caching 80.77 GiB and the other, I was caching 94.83 GiB.

This is confirmed because when I run:

# echo 3 > /proc/sys/vm/drop_caches

it clears the cache up right away.

So, yes, Xorg was for some reason taking up a lot of RAM (still not really sure why) so the technical support team from SuSE suggested that I switch over to run level 3 to see if this is still happening, and it is.

I found this site:
http://www.linuxatemyram.com/

The table near the bottom which shows that cached objects in RAM as being used RAM and so for some strange reason, someone, in their infinite wisdom, decided to denote cached objects (pagecache and slab objects) as such and so, the memory manager (vm) WON’T release the RAM via a cache purge automatically which means that large memory programs like scientific/engineering analysis programs would end up needing to swap but not enough RAM is free/available.

Your last statement is correct.

I think that even when you install SLES, if you try to install it without swap, it gives you a warning against doing something like that.

Being that I also come from a predominantly Windows background (moreso than Linux or Solaris), I know that Windows REALLY hates it if it doesn’t have swap available, even with large memory systems.

I’m not sure if Linux will behave “appropriately” in the absence of a swap partition.

Conversely, nowadays with PCI Express SSDs, swapping isn’t quite as big of a deal as it once was. Still sucks, but it’s MUCH better than the days of asynchronous swapping on a mechanically rotating hard drive at 3-5 MB/s. (I don’t have my PCIe SSD installed in it right now. I wanted to see how well it would do without it in case I end up installing an Infiniband card instead.)

To me, /proc/sys/vm/swappiness tells the system how “often” it swaps. What I really want it to do isn’t to swap, but it is to clear the cache so that it will free up the memory so that swap wouldn’t even be an issue. That’s what I was really going for by writing to /proc/sys/vm/vfs_cache_pressure (which, since then, I’ve also edited sysctl so that it will be a more permanent change instead, but I am manually writing to that now to test out which setting will work the way I would want/like it to).

Thanks.

(Additional sources:
https://gist.github.com/dakull/5629740
https://www.suse.com/documentation/sles-12/book_sle_tuning/data/cha_tuning_memory_usage.html
https://www.suse.com/communities/blog/sles-1112-os-tuning-optimisation-guide-part-1/)

ab1 · August 19, 2017, 5:34am

On 08/18/2017 07:14 PM, alpha754293 wrote:[color=blue]

Code:

aes@aes3:~> free
total used free shared buffers cached
Mem: 132066080 122653184 9412896 155376 24 84712440
-/+ buffers/cache: 37940720 94125360
Swap: 268437500 35108 268402392

--------------------[/color]

This is showing what we would hope, that swap is basically unused. Sure,
35 MiB are being used, but that’s just about nothing, and it is probably
only data which should be swapped, like libraries loaded once and never
needed again but still needing to be loaded. You could tune swappiness
further, but I can hardly imagine it will make a big difference since the
system does not need the memory that is even completely free (9 GiB of it)
or that is used and freeable by cache (94 GiB).
[color=blue]

(I think that you meant /proc/sys/vm/swappiness and that is still at the
default value of 60.)[/color]

Change that if you want; sixty (60) is the default I have as well on my
boxes that I have not tuned, but again I doubt it matters too much since
the system isn’t using almost any swap currently now that xorg is not
trying to use all of the virtual memory the system has available.
[color=blue]

These were screenshots that I took of the terminal window (ssh) earlier.

You can see that on one system, it was caching 80.77 GiB and the other,
I was caching 94.83 GiB.

This is confirmed because when I run:

Code:

echo 3 > /proc/sys/vm/drop_caches

it clears the cache up right away.[/color]

Yes, that makes sense, but I do not understand why there is a perceived
problem considering the system state now that xorg is stopped. The system
is not in need of memory, at least not at the time of the snapshot you took.
[color=blue]

So, yes, Xorg was for some reason taking up a lot of RAM (still not
really sure why) so the technical support team from SuSE suggested that
I switch over to run level 3 to see if this is still happening, and it
is.[/color]

I think it may be useful to go back to runlevel five (5) again but without
running anything in it; I suspect it will not show the symptom unless you
run your program in there again. Something about that is probably causing
the memory leak, and thus asking the system for all possible memory to the
detriment of everything. Eventually I would guess the kernel would use
the Out Of Memory (OOM) killer to kill xorg in order to free up what is
obviously the biggest memory consumer.
[color=blue]

I found this site:
http://www.linuxatemyram.com/

The table near the bottom which shows that cached objects in RAM as
being used RAM and so for some strange reason, someone, in their
infinite wisdom, decided to denote cached objects (pagecache and slab
objects) as such and so, the memory manager (vm) WON’T release the RAM
via a cache purge automatically which means that large memory programs
like scientific/engineering analysis programs would end up needing to
swap but not enough RAM is free/available.

Your last statement is correct.

I think that even when you install SLES, if you try to install it
without swap, it gives you a warning against doing something like that.[/color]

Yes, it does, and I thin it does because nobody has bothered to remove it,
and also because in general having a little (2 GiB is standard I think)
swap partition just in case.
[color=blue]

Being that I also come from a predominantly Windows background (moreso
than Linux or Solaris), I know that Windows REALLY hates it if it
doesn’t have swap available, even with large memory systems.

I’m not sure if Linux will behave “appropriately” in the absence of a
swap partition.[/color]

Every box I’ve built in the past several years (ten or so?) has had
minimal or no swap at all, except for my laptop where I have a swap simply
because I hibernate (suspend to disk) often. Probably half of the box
I’ve setup in that time have had no swap at all and are all either
running, or been retired and replaced because of hardware dying. I
stopped using swap, as much as possible, years ago because of the issues
I’ve mentioned before where the performance suffered when some program
(xorg in your case) gets out of control made using the system, even to
kill that problematic process, too much of a problem to stand. Anyway,
Linux seems to do fine, particularly if you want to tune the swappiness
down to one (1) or zero (0) since that should mean swap is only used as a
last resort; my laptop has it set to zero (0).
[color=blue]

Conversely, nowadays with PCI Express SSDs, swapping isn’t quite as big
of a deal as it once was. Still sucks, but it’s MUCH better than the
days of asynchronous swapping on a mechanically rotating hard drive at
3-5 MB/s. (I don’t have my PCIe SSD installed in it right now. I wanted
to see how well it would do without it in case I end up installing an
Infiniband card instead.)[/color]

Sure, but 250+ GiB of it? If your system needs that much swap, even if
you are using SSDs, something is amiss and writing at one GB/s is still
not that fast compared to what RAM can do. I just did a test on a REALLY
old box that was never server-class hardware and even it can write to RAM
at 3 GiB/s and it would probably take striped SSDs today to keep up with
that, or modern RAM should be able to go much faster, maybe an order of
magnitude or more.
[color=blue]

To me, /proc/sys/vm/swappiness tells the system how “often” it swaps.
What I really want it to do isn’t to swap, but it is to clear the cache
so that it will free up the memory so that swap wouldn’t even be an
issue. That’s what I was really going for by writing to
/proc/sys/vm/vfs_cache_pressure (which, since then, I’ve also edited
sysctl so that it will be a more permanent change instead, but I am
manually writing to that now to test out which setting will work the way
I would want/like it to).[/color]

Set swappiness to zero (0) and it will only be used if really needed; the
system may still use cache, and in my testing it frees that at the drop of
a hat for any user process that needs it (‘root’-owned or otherwise), but
at least you will never have swap be used until that time. My laptop runs
VMs, monster Java processes, this Thunderbird tool (subscribed to a
hundred groups and a half-dozen e-mail accounts), Firefox (with a dozen
tabs), Tomboy with a few thousand notes, Pidgin, konsole with a dozen tabs
and infinite scollback, and as much else as I can throw at it, and it
never uses swap unless a process runs away, at which time I still wish it
were dead despite a decent, but not new/modern, SSD.

I need to read up more on the vfs_cache_pressure stuff, but ultimately I
would take your system, cut back the swap as much as possible (you’re into
high performance, to use high-performance (non-swap) memory only by sizing
for it as you have done), and then keep xorg from building up memory by
not running that program within it, or not running it at all (it is a
server after all; you do not need a GUI full time), but that is just me.

It is interesting hearing about your experience, and what you are seeing,
and particularly in this little HPC environment so, no matter which route
you choose, thank-you for sharing what you have so far.

–
Good luck.

If you find this post helpful and are logged into the web interface,
show your appreciation and click on the star below.

If you want to send me a private message, please let me know in the
forum as I do not use the web interface often.

alpha754293 · August 21, 2017, 3:33am

[QUOTE=ab;39171]On 08/18/2017 07:14 PM, alpha754293 wrote:[color=blue]

Code:

aes@aes3:~> free
total used free shared buffers cached
Mem: 132066080 122653184 9412896 155376 24 84712440
-/+ buffers/cache: 37940720 94125360
Swap: 268437500 35108 268402392

--------------------[/color]

This is showing what we would hope, that swap is basically unused. Sure,
35 MiB are being used, but that’s just about nothing, and it is probably
only data which should be swapped, like libraries loaded once and never
needed again but still needing to be loaded. You could tune swappiness
further, but I can hardly imagine it will make a big difference since the
system does not need the memory that is even completely free (9 GiB of it)
or that is used and freeable by cache (94 GiB).
[color=blue]

(I think that you meant /proc/sys/vm/swappiness and that is still at the
default value of 60.)[/color]

Change that if you want; sixty (60) is the default I have as well on my
boxes that I have not tuned, but again I doubt it matters too much since
the system isn’t using almost any swap currently now that xorg is not
trying to use all of the virtual memory the system has available.[/quote]

Xorg isn’t using it, but cache is (pagecache and slab objects) - 81.74 GiB of it to be precise.

So when an application is going to make a request for ca. 70 GiB of RAM, let’s say, and since the system only has 128 GB installed, it’s going to push any new demands on the RAM into swap and this is where it becomes a problem.

See below for further commentary re: swappiness.

[quote=ab][color=blue]

These were screenshots that I took of the terminal window (ssh) earlier.

You can see that on one system, it was caching 80.77 GiB and the other,
I was caching 94.83 GiB.

This is confirmed because when I run:

Code:

echo 3 > /proc/sys/vm/drop_caches

it clears the cache up right away.[/color]

Yes, that makes sense, but I do not understand why there is a perceived
problem considering the system state now that xorg is stopped. The system
is not in need of memory, at least not at the time of the snapshot you took.[/quote]

Again, the root cause of the issue actually isn’t the swap in and of itself. It first manifested as such, especially with X running, but in run level 3, I was able to find out that the root cause of the issue is due to the OS kernel’s vm caching of pagecache and slab objects.

That is the heart of the issue.

If the comment re: perceived problem was re: X running, then okay, sure. But if the comment re: perceived problem is that this is doing it at all, then it isn’t a perceived problem. It is a real and legitimate problem, and again, the root cause of it is the virtual memory management portion of the kernel that manages pagecache and slab objects.

Linux marks the RAM that pagecache and slab objects that are cached into as being RAM that is used (which is TECHNICALLY true). What it DOESN’T do when an application demands the RAM though is that it won’t release the cache a la (# echo 3 > /proc/sys/vm/drop_caches) in order to release the cached pagecache and slab objects back to the free memory pool so that it can then be used for/by a USER application.

THAT is the part that it DOESN’T seem to do/be doing.

And that is, to be blunt and frank - stupid.

If you have user applications that require RAM, it should take precedence over the OS’ need/desire to cache pagecache and slab objects.

If there is an underlying performance issue such that it is SIGNIFICANTLY slower for the OS to load those objects WITHOUT it having been cached into RAM first, then you should be fixing THAT as the root cause of the issue, and NOT “masking” it by making the caching of pagecache and slab objects having an apparent HIGHER priority ABOVE user apps.

That is just dumb.

WHYYYY would you architect a system like that?

Yes, I realise that to Linux, it thinks cached objects in RAM = RAM is in USE but it should be intelligent enough to know what it is TRULY being used vs. what’s only cached so that the cache can be cleared and the subsequent memory/RAM is released back into the free/available pool so that user apps can use it.

THAT is the root cause of the underlying issue.

[quote=ab][color=blue]

So, yes, Xorg was for some reason taking up a lot of RAM (still not
really sure why) so the technical support team from SuSE suggested that
I switch over to run level 3 to see if this is still happening, and it
is.[/color]

I think it may be useful to go back to runlevel five (5) again but without
running anything in it; I suspect it will not show the symptom unless you
run your program in there again. Something about that is probably causing
the memory leak, and thus asking the system for all possible memory to the
detriment of everything. Eventually I would guess the kernel would use
the Out Of Memory (OOM) killer to kill xorg in order to free up what is
obviously the biggest memory consumer.[/quote]

I disagree.

The console output of “free” actually tells you that on one of the nodes, it has cached 81.74 GiB of objects and the other has cached well…it WAS 94.83 GiB, now it is 116.31 GiB.

Here is the output of ps aux for that node:

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  0.0  0.0 116160  4960 ?        Ss   Aug15   0:13 /usr/lib/systemd/systemd --switched-root --system --deserialize 21
root         2  0.0  0.0      0     0 ?        S    Aug15   0:00 [kthreadd]
root         3  0.0  0.0      0     0 ?        S    Aug15   0:02 [ksoftirqd/0]
root         5  0.0  0.0      0     0 ?        S<   Aug15   0:00 [kworker/0:0H]
root         7  0.0  0.0      0     0 ?        S    Aug15   0:30 [kworker/u33:0]
root         8  0.0  0.0      0     0 ?        S    Aug15   0:03 [migration/0]
root         9  0.0  0.0      0     0 ?        S    Aug15   0:00 [rcu_bh]
root        10  0.0  0.0      0     0 ?        S    Aug15   4:47 [rcu_sched]
root        11  0.0  0.0      0     0 ?        S    Aug15   0:00 [watchdog/0]
root        12  0.0  0.0      0     0 ?        S    Aug15   0:00 [watchdog/1]
root        13  0.0  0.0      0     0 ?        S    Aug15   0:00 [migration/1]
root        14  0.0  0.0      0     0 ?        S    Aug15   0:00 [ksoftirqd/1]
root        16  0.0  0.0      0     0 ?        S<   Aug15   0:00 [kworker/1:0H]
root        17  0.0  0.0      0     0 ?        S    Aug15   0:00 [watchdog/2]
root        18  0.0  0.0      0     0 ?        S    Aug15   0:00 [migration/2]
root        19  0.0  0.0      0     0 ?        S    Aug15   0:00 [ksoftirqd/2]
root        21  0.0  0.0      0     0 ?        S<   Aug15   0:00 [kworker/2:0H]
root        22  0.0  0.0      0     0 ?        S    Aug15   0:00 [watchdog/3]
root        23  0.0  0.0      0     0 ?        S    Aug15   0:00 [migration/3]
root        24  0.0  0.0      0     0 ?        S    Aug15   0:00 [ksoftirqd/3]
root        26  0.0  0.0      0     0 ?        S<   Aug15   0:00 [kworker/3:0H]
root        27  0.0  0.0      0     0 ?        S    Aug15   0:00 [watchdog/4]
root        28  0.0  0.0      0     0 ?        S    Aug15   0:00 [migration/4]
root        29  0.0  0.0      0     0 ?        S    Aug15   0:00 [ksoftirqd/4]
root        31  0.0  0.0      0     0 ?        S<   Aug15   0:00 [kworker/4:0H]
root        32  0.0  0.0      0     0 ?        S    Aug15   0:00 [watchdog/5]
root        33  0.0  0.0      0     0 ?        S    Aug15   0:00 [migration/5]
root        34  0.0  0.0      0     0 ?        S    Aug15   0:00 [ksoftirqd/5]
root        36  0.0  0.0      0     0 ?        S<   Aug15   0:00 [kworker/5:0H]
root        37  0.0  0.0      0     0 ?        S    Aug15   0:00 [watchdog/6]
root        38  0.0  0.0      0     0 ?        S    Aug15   0:00 [migration/6]
root        39  0.0  0.0      0     0 ?        S    Aug15   0:00 [ksoftirqd/6]
root        41  0.0  0.0      0     0 ?        S<   Aug15   0:00 [kworker/6:0H]
root        42  0.0  0.0      0     0 ?        S    Aug15   0:00 [watchdog/7]
root        43  0.0  0.0      0     0 ?        S    Aug15   0:00 [migration/7]
root        44  0.0  0.0      0     0 ?        S    Aug15   0:00 [ksoftirqd/7]
root        46  0.0  0.0      0     0 ?        S<   Aug15   0:00 [kworker/7:0H]
root        47  0.0  0.0      0     0 ?        S    Aug15   0:00 [watchdog/8]
root        48  0.0  0.0      0     0 ?        S    Aug15   0:04 [migration/8]
root        49  0.0  0.0      0     0 ?        S    Aug15   0:02 [ksoftirqd/8]
root        51  0.0  0.0      0     0 ?        S<   Aug15   0:00 [kworker/8:0H]
root        53  0.0  0.0      0     0 ?        S    Aug15   0:00 [watchdog/9]
root        54  0.0  0.0      0     0 ?        S    Aug15   0:00 [migration/9]
root        55  0.0  0.0      0     0 ?        S    Aug15   0:01 [ksoftirqd/9]
root        57  0.0  0.0      0     0 ?        S<   Aug15   0:00 [kworker/9:0H]
root        58  0.0  0.0      0     0 ?        S    Aug15   0:00 [watchdog/10]
root        59  0.0  0.0      0     0 ?        S    Aug15   0:00 [migration/10]
root        60  0.0  0.0      0     0 ?        S    Aug15   0:00 [ksoftirqd/10]
root        62  0.0  0.0      0     0 ?        S<   Aug15   0:00 [kworker/10:0H]
root        63  0.0  0.0      0     0 ?        S    Aug15   0:00 [watchdog/11]
root        64  0.0  0.0      0     0 ?        S    Aug15   0:00 [migration/11]
root        65  0.0  0.0      0     0 ?        S    Aug15   0:00 [ksoftirqd/11]
root        67  0.0  0.0      0     0 ?        S<   Aug15   0:00 [kworker/11:0H]
root        68  0.0  0.0      0     0 ?        S    Aug15   0:00 [watchdog/12]
root        69  0.0  0.0      0     0 ?        S    Aug15   0:00 [migration/12]
root        70  0.0  0.0      0     0 ?        S    Aug15   0:00 [ksoftirqd/12]
root        72  0.0  0.0      0     0 ?        S<   Aug15   0:00 [kworker/12:0H]
root        73  0.0  0.0      0     0 ?        S    Aug15   0:00 [watchdog/13]
root        74  0.0  0.0      0     0 ?        S    Aug15   0:00 [migration/13]
root        75  0.0  0.0      0     0 ?        S    Aug15   0:00 [ksoftirqd/13]
root        77  0.0  0.0      0     0 ?        S<   Aug15   0:00 [kworker/13:0H]
root        78  0.0  0.0      0     0 ?        S    Aug15   0:00 [watchdog/14]
root        79  0.0  0.0      0     0 ?        S    Aug15   0:00 [migration/14]
root        80  0.0  0.0      0     0 ?        S    Aug15   0:00 [ksoftirqd/14]
root        82  0.0  0.0      0     0 ?        S<   Aug15   0:00 [kworker/14:0H]
root        83  0.0  0.0      0     0 ?        S    Aug15   0:00 [watchdog/15]
root        84  0.0  0.0      0     0 ?        S    Aug15   0:00 [migration/15]
root        85  0.0  0.0      0     0 ?        S    Aug15   0:00 [ksoftirqd/15]
root        87  0.0  0.0      0     0 ?        S<   Aug15   0:00 [kworker/15:0H]
root        88  0.0  0.0      0     0 ?        S<   Aug15   0:00 [khelper]
root        89  0.0  0.0      0     0 ?        S    Aug15   0:00 [kdevtmpfs]
root        90  0.0  0.0      0     0 ?        S<   Aug15   0:00 [netns]
root        91  0.0  0.0      0     0 ?        S<   Aug15   0:00 [perf]
root        92  0.0  0.0      0     0 ?        S<   Aug15   0:00 [writeback]
root        93  0.0  0.0      0     0 ?        S<   Aug15   0:00 [kintegrityd]
root        94  0.0  0.0      0     0 ?        S<   Aug15   0:00 [bioset]
root        95  0.0  0.0      0     0 ?        S<   Aug15   0:00 [crypto]
root        96  0.0  0.0      0     0 ?        S<   Aug15   0:00 [kblockd]
root       101  0.0  0.0      0     0 ?        S<   Aug15   0:00 [kgraft]
root       102  0.0  0.0      0     0 ?        S    Aug15   0:00 [khungtaskd]
root       104  0.0  0.0      0     0 ?        S    Aug15   0:06 [kswapd0]
root       105  0.0  0.0      0     0 ?        S    Aug15   0:02 [kswapd1]
root       106  0.0  0.0      0     0 ?        SN   Aug15   0:00 [ksmd]
root       107  0.0  0.0      0     0 ?        SN   Aug15   0:00 [khugepaged]
root       108  0.0  0.0      0     0 ?        S    Aug15   0:00 [fsnotify_mark]
root       118  0.0  0.0      0     0 ?        S<   Aug15   0:00 [kthrotld]
root       128  0.0  0.0      0     0 ?        S<   Aug15   0:00 [kpsmoused]
root       129  0.0  0.0      0     0 ?        S    Aug15   0:00 [print/0]
root       130  0.0  0.0      0     0 ?        S    Aug15   0:00 [print/1]
root       150  0.0  0.0      0     0 ?        S<   Aug15   0:00 [deferwq]
root       151  0.0  0.0      0     0 ?        S    Aug15   0:21 [kworker/14:1]
root       188  0.0  0.0      0     0 ?        S    Aug15   0:00 [kauditd]
root       301  0.0  0.0      0     0 ?        S<   Aug15   0:00 [ata_sff]
root       304  0.0  0.0      0     0 ?        S    Aug15   0:00 [khubd]
root       310  0.0  0.0      0     0 ?        S<   Aug15   0:00 [ttm_swap]
root       348  0.0  0.0      0     0 ?        S    Aug15   0:00 [scsi_eh_0]
root       349  0.0  0.0      0     0 ?        S<   Aug15   0:00 [scsi_tmf_0]
root       350  0.0  0.0      0     0 ?        S<   Aug15   0:00 [scsi_wq_0]
root       352  0.0  0.0      0     0 ?        S    Aug15   0:00 [scsi_eh_1]
root       353  0.0  0.0      0     0 ?        S<   Aug15   0:00 [scsi_tmf_1]
root       354  0.0  0.0      0     0 ?        S    Aug15   0:00 [scsi_eh_2]
root       355  0.0  0.0      0     0 ?        S<   Aug15   0:00 [scsi_tmf_2]
root       356  0.0  0.0      0     0 ?        S    Aug15   0:00 [scsi_eh_3]
root       357  0.0  0.0      0     0 ?        S<   Aug15   0:00 [scsi_tmf_3]
root       358  0.0  0.0      0     0 ?        S    Aug15   0:00 [scsi_eh_4]
root       359  0.0  0.0      0     0 ?        S<   Aug15   0:00 [scsi_tmf_4]
root       360  0.0  0.0      0     0 ?        S    Aug15   0:00 [scsi_eh_5]
root       361  0.0  0.0      0     0 ?        S<   Aug15   0:00 [scsi_tmf_5]
root       362  0.0  0.0      0     0 ?        S    Aug15   0:00 [scsi_eh_6]
root       363  0.0  0.0      0     0 ?        S<   Aug15   0:00 [scsi_tmf_6]
root       388  0.0  0.0      0     0 ?        S<   Aug15   0:00 [kworker/0:1H]
root       389  0.0  0.0      0     0 ?        S<   Aug15   0:00 [kworker/11:1H]
root       390  0.0  0.0      0     0 ?        S<   Aug15   0:00 [kworker/14:1H]
root       391  0.0  0.0      0     0 ?        S<   Aug15   0:00 [kworker/15:1H]
root       392  0.0  0.0      0     0 ?        S<   Aug15   0:00 [kworker/8:1H]
root       397  0.0  0.0      0     0 ?        S<   Aug15   0:00 [bioset]
root       398  0.0  0.0      0     0 ?        S<   Aug15   0:00 [kworker/9:1H]
root       410  0.0  0.0      0     0 ?        S<   Aug15   0:00 [kworker/10:1H]
root       417  0.0  0.0      0     0 ?        S    Aug15   0:00 [btrfs-genwork-1]
root       418  0.0  0.0      0     0 ?        S    Aug15   0:00 [btrfs-submit-1]
root       419  0.0  0.0      0     0 ?        S    Aug15   0:00 [btrfs-delalloc-]
root       420  0.0  0.0      0     0 ?        S    Aug15   0:00 [btrfs-fixup-1]
root       422  0.0  0.0      0     0 ?        S    Aug15   0:00 [btrfs-endio-met]
root       423  0.0  0.0      0     0 ?        S    Aug15   0:00 [btrfs-rmw-1]
root       424  0.0  0.0      0     0 ?        S    Aug15   0:00 [btrfs-endio-rai]
root       425  0.0  0.0      0     0 ?        S    Aug15   0:00 [btrfs-endio-met]
root       427  0.0  0.0      0     0 ?        S    Aug15   0:01 [btrfs-freespace]
root       428  0.0  0.0      0     0 ?        S    Aug15   0:00 [btrfs-delayed-m]
root       429  0.0  0.0      0     0 ?        S    Aug15   0:00 [btrfs-cache-1]
root       430  0.0  0.0      0     0 ?        S    Aug15   0:00 [btrfs-readahead]
root       431  0.0  0.0      0     0 ?        S    Aug15   0:00 [btrfs-flush_del]
root       432  0.0  0.0      0     0 ?        S    Aug15   0:00 [btrfs-qgroup-re]
root       433  0.0  0.0      0     0 ?        S<   Aug15   0:00 [kworker/12:1H]
root       434  0.0  0.0      0     0 ?        S<   Aug15   0:00 [kworker/13:1H]
root       435  0.0  0.0      0     0 ?        S    Aug15   0:00 [btrfs-cleaner]
root       436  0.0  0.0      0     0 ?        S    Aug15   0:35 [btrfs-transacti]
root       525  0.0  0.0  43400  9568 ?        SLs  Aug15   0:02 /usr/lib/systemd/systemd-journald
root       533  0.0  0.0  21732   960 ?        Ss   Aug15   0:14 /sbin/dmeventd -f
root       539  0.0  0.0      0     0 ?        S    Aug15   0:00 [btrfs-endio-2]
root       545  0.0  0.0      0     0 ?        S    Aug18   0:05 [kworker/10:2]
root       557  0.0  0.0  42940  2436 ?        Ss   Aug15   0:00 /usr/lib/systemd/systemd-udevd
root       759  0.0  0.0      0     0 ?        S<   Aug15   0:00 [edac-poller]
root       760  0.0  0.0  12032  3792 ?        Ss   Aug15   0:39 /usr/sbin/haveged -w 1024 -v 0 -F
root       925  0.0  0.0      0     0 ?        S<   Aug15   0:00 [kvm-irqfd-clean]
root      1805  0.0  0.0      0     0 ?        SN   Aug15   0:00 [kipmi0]
root      1843  0.0  0.0      0     0 ?        S    Aug15   0:00 [btrfs-genwork-1]
root      1844  0.0  0.0      0     0 ?        S    Aug15   0:02 [btrfs-submit-1]
root      1845  0.0  0.0      0     0 ?        S    Aug15   0:00 [btrfs-delalloc-]
root      1846  0.0  0.0      0     0 ?        S    Aug15   0:00 [btrfs-fixup-1]
root      1847  0.0  0.0      0     0 ?        S    Aug15   0:00 [btrfs-endio-1]
root      1848  0.0  0.0      0     0 ?        S    Aug15   0:00 [btrfs-endio-met]
root      1849  0.0  0.0      0     0 ?        S    Aug15   0:00 [btrfs-rmw-1]
root      1850  0.0  0.0      0     0 ?        S    Aug15   0:00 [btrfs-endio-rai]
root      1851  0.0  0.0      0     0 ?        S    Aug15   0:00 [btrfs-endio-met]
root      1852  0.0  0.0      0     0 ?        S    Aug15   0:04 [btrfs-endio-wri]
root      1853  0.0  0.0      0     0 ?        S    Aug15   0:01 [btrfs-freespace]
root      1854  0.0  0.0      0     0 ?        S    Aug15   0:00 [btrfs-delayed-m]
root      1855  0.0  0.0      0     0 ?        S    Aug15   0:00 [btrfs-cache-1]
root      1856  0.0  0.0      0     0 ?        S    Aug15   0:00 [btrfs-readahead]
root      1857  0.0  0.0      0     0 ?        S    Aug15   0:00 [btrfs-flush_del]
root      1858  0.0  0.0      0     0 ?        S    Aug15   0:00 [btrfs-qgroup-re]
root      1866  0.0  0.0      0     0 ?        S    Aug15   0:00 [btrfs-cleaner]
root      1867  0.0  0.0      0     0 ?        S    Aug15   0:33 [btrfs-transacti]
message+  2073  0.0  0.0  42920  2924 ?        SLs  Aug15   0:03 /bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation
avahi     2076  0.0  0.0  20076  1724 ?        Ss   Aug15   0:24 avahi-daemon: running [aes2.local]
root      2078  0.0  0.0  24964  2708 ?        Ss   Aug15   0:00 /usr/sbin/smartd -n
root      2081  0.0  0.0  19304  1232 ?        Ss   Aug15   0:42 /usr/sbin/irqbalance --foreground
nscd      2084  0.0  0.0 802360  1456 ?        Ssl  Aug15   0:02 /usr/sbin/nscd
root      2085  0.0  0.0  29488  3184 ?        SLs  Aug15   0:00 /usr/lib/wicked/bin/wickedd-dhcp6 --systemd --foreground
root      2088  0.0  0.0  29488  3424 ?        SLs  Aug15   0:00 /usr/lib/wicked/bin/wickedd-dhcp4 --systemd --foreground
root      2099  0.0  0.0  29488  3188 ?        SLs  Aug15   0:00 /usr/lib/wicked/bin/wickedd-auto4 --systemd --foreground
root      2121  0.0  0.0  20096  1584 ?        Ss   Aug15   0:02 /usr/lib/systemd/systemd-logind
root      2122  0.0  0.0   4440   768 tty1     Ss+  Aug15   0:00 /sbin/agetty --noclear tty1 linux
root      2126  0.0  0.0      0     0 ?        S    17:45   0:00 [kworker/0:0]
root      2149  0.0  0.0 337852  2484 ?        SLsl Aug15   0:00 /usr/sbin/rsyslogd -n
root      2151  0.0  0.0      0     0 ?        S<   Aug15   0:00 [kworker/4:1H]
root      2152  0.0  0.0      0     0 ?        S<   Aug15   0:00 [kworker/6:1H]
root      2156  0.0  0.0      0     0 ?        S<   Aug15   0:00 [kworker/5:1H]
root      2158  0.0  0.0      0     0 ?        S<   Aug15   0:00 [kworker/2:1H]
root      2159  0.0  0.0  29612  3568 ?        SLs  Aug15   0:00 /usr/sbin/wickedd --systemd --foreground
root      2169  0.0  0.0  29516  3280 ?        SLs  Aug15   0:00 /usr/sbin/wickedd-nanny --systemd --foreground
root      2207  0.0  0.0      0     0 ?        S    Aug15   0:00 [kworker/u34:1]
root      2322  0.0  0.0      0     0 ?        S<   Aug15   0:00 [kworker/3:1H]
root      2991  0.0  0.0 437608 15460 ?        SLsl Aug15   0:00 /usr/sbin/libvirtd --listen
root      3022  0.0  0.0  46896  3136 ?        Ss   Aug15   0:00 /usr/sbin/sshd -D
root      3100  0.0  0.0  19608  1212 ?        Ss   Aug15   0:01 /usr/lib/postfix/master -w
postfix   3102  0.0  0.0  21856  2268 ?        S    Aug15   0:00 qmgr -l -t fifo -u
root      3129  0.0  0.0  18820  1492 ?        Ss   Aug15   0:00 /usr/sbin/cron -n
root      3472  0.0  0.0      0     0 ?        S    Aug19   0:00 [kworker/u32:1]
root      3538  0.6  0.1 812836 141660 ?       Ssl  Aug15  47:40 /usr/ansys_inc/shared_files/licensing/linx64/ansysli_server
root      3550  0.0  0.0  19064  3592 ?        Ss   Aug15   1:58 /usr/ansys_inc/shared_files/licensing/linx64/ansysli_monitor -monitor 3538 -restart_port_timeout 15
root      3584  0.0  0.0  16780  2692 ?        S    Aug15   0:04 /usr/ansys_inc/shared_files/licensing/linx64/lmgrd -c /usr/ansys_inc/shared_files/licensing/license_files -l /usr/ansys_inc/shared_files/licensing/license.log
root      3585  0.0  0.0 127816  7376 ?        Ssl  Aug15   0:32 ansyslmd -T aes2 11.13 3 -c :/usr/ansys_inc/shared_files/licensing/license_files: -srv LgFV2wwMa2iyCVChj6LuclIeIH7uSthmAgsCjVUTJXsEnEgIdOrnsb832BA3Cnw --lmgrd_start 5993c175 -vdrestart 0
root      4180  0.0  0.0      0     0 ?        S    18:00   0:00 [kworker/7:0]
root      5142  0.0  0.0      0     0 ?        S<   Aug16   0:00 [kworker/1:1H]
root      5601  0.0  0.0      0     0 ?        S    10:00   0:00 [kworker/1:0]
root      5943  0.0  0.0      0     0 ?        S    Aug15   0:00 [btrfs-worker-4]
root      6095  0.0  0.0      0     0 ?        S<   Aug16   0:00 [kworker/7:1H]
root      6123  0.0  0.0      0     0 ?        S    18:15   0:00 [kworker/9:1]
root      6960  0.0  0.0      0     0 ?        S    14:15   0:00 [kworker/6:0]
root      6961  0.0  0.0      0     0 ?        S    Aug17   0:11 [kworker/15:2]
root      7459  0.0  0.0      0     0 ?        S    Aug15   0:05 [kworker/u34:2]
root      7532  0.0  0.0  87676  4088 ?        Ss   Aug15   0:00 sshd: ewen [priv]
ewen      7541  0.0  0.0  87676  1780 ?        S    Aug15   0:00 sshd: ewen@pts/0
ewen      7542  0.0  0.0  14316  3204 pts/0    Ss   Aug15   0:00 -bash
root      7848  0.0  0.0  87676  4088 ?        Ss   Aug16   0:00 sshd: ewen [priv]
ewen      7857  0.0  0.0  87676  1780 ?        S    Aug16   0:00 sshd: ewen@pts/1
ewen      7858  0.0  0.0  14316  3228 pts/1    Ss   Aug16   0:00 -bash
root      8117  0.0  0.0      0     0 ?        S    18:30   0:00 [kworker/11:1]
root      8120  0.0  0.0      0     0 ?        S    18:30   0:00 [kworker/2:1]
root      8904  0.0  0.0      0     0 ?        S    14:30   0:01 [kworker/1:1]
root      9949  0.0  0.0      0     0 ?        S    Aug16   0:00 [kworker/14:2]
root     10053  0.0  0.0      0     0 ?        S    18:45   0:00 [kworker/8:0]
postfix  10362  0.0  0.0  21464  1376 ?        S    18:47   0:00 pickup -l -t fifo -u
root     10405  0.0  0.0      0     0 ?        S    Aug19   0:04 [kworker/15:0]
root     10968  0.0  0.0      0     0 ?        S    18:51   0:00 [btrfs-endio-wri]
root     12052  0.0  0.0      0     0 ?        S    19:00   0:00 [kworker/9:2]
ewen     12054  0.0  0.0 118380 80776 pts/0    S+   Aug16   0:02 /usr/ansys_inc/v180/CFX/bin/../tools/perl-5.8.0-1/bin/Linux-x86_64/perl -Sx /usr/ansys_inc/v180/CFX/bin/cfx5solve -batch -par-local -part 16 -def Transient.def
ewen     12190  0.3  0.0  14368  1328 pts/0    S+   Aug16  21:15 /usr/ansys_inc/v180/commonfiles/MPI/IBM/9.1.4.2/linx64/bin/mpirun -f /export/home/work/Aerosmart/International 9200i 2006 (solid model) flow regime (wo bubble) CFX mesh2a_files/dp0/CFX-1/CFX/Transient_001.dir/appfile
ewen     12193  0.2  0.0  48240  1692 pts/0    S+   Aug16  18:12 /usr/ansys_inc/v180/commonfiles/MPI/IBM/9.1.4.2/linx64/bin/mpid 0 0 151061506 192.168.1.157 30519 12190 /usr/ansys_inc/v180/commonfiles/MPI/IBM/9.1.4.2/linx64
ewen     12288 99.6  3.0 4422964 4068788 pts/0 Rl+  Aug16 6892:24 /usr/ansys_inc/v180/CFX/bin/linux-amd64/ifort/solver-mpi.exe -par -pri 2 -outopt 0 -nojob
ewen     12289 99.8  0.2 474100 366268 pts/0   R+   Aug16 6904:11 /usr/ansys_inc/v180/CFX/bin/linux-amd64/ifort/solver-mpi.exe -par -pri 2 -outopt 0 -nojob
ewen     12290 99.8  0.2 464608 356372 pts/0   R+   Aug16 6904:49 /usr/ansys_inc/v180/CFX/bin/linux-amd64/ifort/solver-mpi.exe -par -pri 2 -outopt 0 -nojob
ewen     12291 99.7  0.2 456024 345484 pts/0   R+   Aug16 6897:07 /usr/ansys_inc/v180/CFX/bin/linux-amd64/ifort/solver-mpi.exe -par -pri 2 -outopt 0 -nojob
ewen     12292 99.7  0.2 473756 363616 pts/0   R+   Aug16 6898:48 /usr/ansys_inc/v180/CFX/bin/linux-amd64/ifort/solver-mpi.exe -par -pri 2 -outopt 0 -nojob
ewen     12293 99.7  0.2 473668 358092 pts/0   R+   Aug16 6898:28 /usr/ansys_inc/v180/CFX/bin/linux-amd64/ifort/solver-mpi.exe -par -pri 2 -outopt 0 -nojob
ewen     12294 99.7  0.2 463792 352588 pts/0   R+   Aug16 6899:14 /usr/ansys_inc/v180/CFX/bin/linux-amd64/ifort/solver-mpi.exe -par -pri 2 -outopt 0 -nojob
ewen     12295 99.8  0.2 454192 346016 pts/0   R+   Aug16 6904:35 /usr/ansys_inc/v180/CFX/bin/linux-amd64/ifort/solver-mpi.exe -par -pri 2 -outopt 0 -nojob
ewen     12296 99.7  0.2 458876 349376 pts/0   R+   Aug16 6897:50 /usr/ansys_inc/v180/CFX/bin/linux-amd64/ifort/solver-mpi.exe -par -pri 2 -outopt 0 -nojob
ewen     12297 99.7  0.2 460728 351572 pts/0   R+   Aug16 6898:57 /usr/ansys_inc/v180/CFX/bin/linux-amd64/ifort/solver-mpi.exe -par -pri 2 -outopt 0 -nojob
ewen     12298 99.8  0.2 461312 353512 pts/0   R+   Aug16 6904:47 /usr/ansys_inc/v180/CFX/bin/linux-amd64/ifort/solver-mpi.exe -par -pri 2 -outopt 0 -nojob
ewen     12299 99.8  0.2 458340 347260 pts/0   R+   Aug16 6903:56 /usr/ansys_inc/v180/CFX/bin/linux-amd64/ifort/solver-mpi.exe -par -pri 2 -outopt 0 -nojob
ewen     12300 99.8  0.2 475320 362708 pts/0   R+   Aug16 6903:35 /usr/ansys_inc/v180/CFX/bin/linux-amd64/ifort/solver-mpi.exe -par -pri 2 -outopt 0 -nojob
ewen     12301 99.8  0.2 476804 367708 pts/0   R+   Aug16 6902:37 /usr/ansys_inc/v180/CFX/bin/linux-amd64/ifort/solver-mpi.exe -par -pri 2 -outopt 0 -nojob
ewen     12302 99.8  0.2 473872 364996 pts/0   R+   Aug16 6903:42 /usr/ansys_inc/v180/CFX/bin/linux-amd64/ifort/solver-mpi.exe -par -pri 2 -outopt 0 -nojob
ewen     12303 99.7  0.2 478276 368140 pts/0   R+   Aug16 6900:52 /usr/ansys_inc/v180/CFX/bin/linux-amd64/ifort/solver-mpi.exe -par -pri 2 -outopt 0 -nojob
root     14379  0.0  0.0      0     0 ?        S    19:17   0:00 [btrfs-worker-2]
root     15671  0.0  0.0      0     0 ?        S    Aug19   0:04 [kworker/10:1]
root     16017  0.0  0.0      0     0 ?        S    19:30   0:00 [kworker/8:2]
root     16864  0.0  0.0      0     0 ?        S    15:30   0:00 [kworker/u32:2]
root     17085  0.0  0.0      0     0 ?        S    Aug18   0:02 [kworker/11:2]
root     17146  0.0  0.0      0     0 ?        S    Aug19   0:05 [kworker/12:2]
root     17956  0.0  0.0      0     0 ?        S    19:45   0:00 [kworker/5:2]
root     17959  0.0  0.0      0     0 ?        S    19:45   0:00 [kworker/4:2]
root     18147  0.2  0.0      0     0 ?        S    19:46   0:00 [btrfs-worker-2]
root     18148  0.2  0.0      0     0 ?        S    19:46   0:00 [btrfs-worker-3]
root     18149  0.1  0.0      0     0 ?        S    19:46   0:00 [btrfs-worker-4]
root     18150  0.1  0.0      0     0 ?        S    19:46   0:00 [btrfs-worker-5]
root     18151  0.1  0.0      0     0 ?        S    19:46   0:00 [btrfs-worker-6]
root     18152  0.0  0.0      0     0 ?        S    19:46   0:00 [btrfs-worker-7]
root     18153  0.0  0.0      0     0 ?        S    19:46   0:00 [btrfs-worker-8]
ewen     18249 57.1  0.0  26696  1864 pts/1    RL+  19:46   0:00 ps aux
root     18790  0.0  0.0      0     0 ?        S    15:45   0:00 [kworker/2:0]
root     20277  0.0  0.0      0     0 ?        S    Aug19   0:00 [kworker/12:0]
root     22717  0.0  0.0      0     0 ?        S    16:15   0:00 [kworker/0:1]
root     23515  0.0  0.0      0     0 ?        S    12:15   0:01 [kworker/7:2]
root     23790  0.0  0.0      0     0 ?        S    Aug18   0:08 [kworker/13:1]
root     24125  0.0  0.0      0     0 ?        S    08:15   0:01 [kworker/5:0]
root     26351  0.0  0.0      0     0 ?        S    Aug18   0:00 [kworker/u33:2]
root     26967  0.0  0.0      0     0 ?        S    Aug19   0:01 [kworker/5:1]
root     27446  0.0  0.0      0     0 ?        S    12:45   0:00 [kworker/3:2]
root     27935  0.0  0.0      0     0 ?        S    Aug16   0:00 [kworker/13:2]
root     28047  0.0  0.0      0     0 ?        S    08:45   0:02 [kworker/4:0]
root     28649  0.0  0.0      0     0 ?        S    17:00   0:00 [kworker/6:1]
root     30856  0.0  0.0      0     0 ?        S    05:00   0:00 [kworker/4:1]
root     31394  0.0  0.0      0     0 ?        S    13:15   0:01 [kworker/3:0]

$ cat /proc/sys/vm/vfs_cache_pressure
200

$ cat /proc/sys/vm/swappiness
60

I highly doubt 116.31 GiB of cached objects is a “perceived” problem.

[quote=ab][color=blue]

I found this site:
http://www.linuxatemyram.com/

The table near the bottom which shows that cached objects in RAM as
being used RAM and so for some strange reason, someone, in their
infinite wisdom, decided to denote cached objects (pagecache and slab
objects) as such and so, the memory manager (vm) WON’T release the RAM
via a cache purge automatically which means that large memory programs
like scientific/engineering analysis programs would end up needing to
swap but not enough RAM is free/available.

Your last statement is correct.

I think that even when you install SLES, if you try to install it
without swap, it gives you a warning against doing something like that.[/color]

Yes, it does, and I thin it does because nobody has bothered to remove it,
and also because in general having a little (2 GiB is standard I think)
swap partition just in case.
[color=blue]

Being that I also come from a predominantly Windows background (moreso
than Linux or Solaris), I know that Windows REALLY hates it if it
doesn’t have swap available, even with large memory systems.

I’m not sure if Linux will behave “appropriately” in the absence of a
swap partition.[/color]

Every box I’ve built in the past several years (ten or so?) has had
minimal or no swap at all, except for my laptop where I have a swap simply
because I hibernate (suspend to disk) often. Probably half of the box
I’ve setup in that time have had no swap at all and are all either
running, or been retired and replaced because of hardware dying. I
stopped using swap, as much as possible, years ago because of the issues
I’ve mentioned before where the performance suffered when some program
(xorg in your case) gets out of control made using the system, even to
kill that problematic process, too much of a problem to stand. Anyway,
Linux seems to do fine, particularly if you want to tune the swappiness
down to one (1) or zero (0) since that should mean swap is only used as a
last resort; my laptop has it set to zero (0).
[color=blue]

Conversely, nowadays with PCI Express SSDs, swapping isn’t quite as big
of a deal as it once was. Still sucks, but it’s MUCH better than the
days of asynchronous swapping on a mechanically rotating hard drive at
3-5 MB/s. (I don’t have my PCIe SSD installed in it right now. I wanted
to see how well it would do without it in case I end up installing an
Infiniband card instead.)[/color]

Sure, but 250+ GiB of it? If your system needs that much swap, even if
you are using SSDs, something is amiss and writing at one GB/s is still
not that fast compared to what RAM can do. I just did a test on a REALLY
old box that was never server-class hardware and even it can write to RAM
at 3 GiB/s and it would probably take striped SSDs today to keep up with
that, or modern RAM should be able to go much faster, maybe an order of
magnitude or more.[/quote]

So…yes and no.

I keep the swap around only because in the cases that I am currently running, I know how much memory they take to run, but that’s generally NOT the case (i.e. I don’t have a good estimate of how much memory it will need before I submit the analysis run/job.)

The system has been spec’d in anticipation of larger memory runs, but there is also a possibility that even 128 GB will be insufficient. Yay engineering?

So I upgraded the RAM to a cost-effective solution - it isn’t the most RAM that it can take, but it also isn’t the lowest cost either.

In my case, swap exists in the event of an analysis requiring more memory than is physically available.

[quote=ab][color=blue]

To me, /proc/sys/vm/swappiness tells the system how “often” it swaps.
What I really want it to do isn’t to swap, but it is to clear the cache
so that it will free up the memory so that swap wouldn’t even be an
issue. That’s what I was really going for by writing to
/proc/sys/vm/vfs_cache_pressure (which, since then, I’ve also edited
sysctl so that it will be a more permanent change instead, but I am
manually writing to that now to test out which setting will work the way
I would want/like it to).[/color]

Set swappiness to zero (0) and it will only be used if really needed; the
system may still use cache, and in my testing it frees that at the drop of
a hat for any user process that needs it (‘root’-owned or otherwise), but
at least you will never have swap be used until that time. My laptop runs
VMs, monster Java processes, this Thunderbird tool (subscribed to a
hundred groups and a half-dozen e-mail accounts), Firefox (with a dozen
tabs), Tomboy with a few thousand notes, Pidgin, konsole with a dozen tabs
and infinite scollback, and as much else as I can throw at it, and it
never uses swap unless a process runs away, at which time I still wish it
were dead despite a decent, but not new/modern, SSD.

I need to read up more on the vfs_cache_pressure stuff, but ultimately I
would take your system, cut back the swap as much as possible (you’re into
high performance, to use high-performance (non-swap) memory only by sizing
for it as you have done), and then keep xorg from building up memory by
not running that program within it, or not running it at all (it is a
server after all; you do not need a GUI full time), but that is just me.

It is interesting hearing about your experience, and what you are seeing,
and particularly in this little HPC environment so, no matter which route
you choose, thank-you for sharing what you have so far.

–
Good luck.

If you find this post helpful and are logged into the web interface,
show your appreciation and click on the star below.

If you want to send me a private message, please let me know in the
forum as I do not use the web interface often.[/QUOTE]

You’re welcome.

Yeah, I can try changing the swappiness to zero as well.

It was my understanding that it only controlled how likely it was going to swap, not whether or not it was going to swap.

Put it this way - even with 128 GB of RAM, SLES has been able to consume ALL of it in one form or another. I just wished that it was more biased towards user apps rather than OS/kernel vm caching.

Like I understand, at a very high level why this problem exists in the Linux kernel (because it doesn’t distinguish the difference between RAM that is used by user apps vs. RAM that is used due to caching - it just sees the RAM as being used), but it should have been architected more intelligently such that when there is an application that is requesting for more RAM than is currently available, one of the first thing the OS should be trying to do is clear out the cache - which it currently doesn’t seem to automatically do.

(It was also my interpretation that vfs_cache_pressure was supposed to do that, but with a value of ‘200’ set in /proc/sys/vm/vfs_cache_pressure and it is STILL caching like mad, something tells me that there is still more testing that needs to be done. Downside is that my suite of tests (via batch processing/shell script) takes about 4.5 days to run with each pass, so…I am reluctant to want to test the system settings slowly because I need to quickly load up the system and just as quickly - need the OS to start caching as well.)

ab1 · August 21, 2017, 7:14am

On 08/20/2017 06:34 PM, alpha754293 wrote:[color=blue]

ab;39171 Wrote:[color=green]

On 08/18/2017 07:14 PM, alpha754293 wrote:[color=darkred]

Code:

aes@aes3:~> free
total used free shared buffers[/color]
cached[color=darkred]
Mem: 132066080 122653184 9412896 155376 24[/color]
84712440[color=darkred]
-/+ buffers/cache: 37940720 94125360
Swap: 268437500 35108 268402392

--------------------[/color]

This is showing what we would hope, that swap is basically unused.
Sure,
35 MiB are being used, but that’s just about nothing, and it is
probably
only data which should be swapped, like libraries loaded once and never
needed again but still needing to be loaded. You could tune swappiness
further, but I can hardly imagine it will make a big difference since
the
system does not need the memory that is even completely free (9 GiB of
it)
or that is used and freeable by cache (94 GiB).
[color=darkred]

(I think that you meant /proc/sys/vm/swappiness and that is still at[/color]
the[color=darkred]
default value of 60.)[/color]

Change that if you want; sixty (60) is the default I have as well on my
boxes that I have not tuned, but again I doubt it matters too much
since
the system isn’t using almost any swap currently now that xorg is not
trying to use all of the virtual memory the system has available.[/color]

Xorg isn’t using it, but cache is (pagecache and slab objects) - 81.74
GiB of it to be precise.[/color]

I think we were talking about different “it” things here; I was talking
about swap, and on this current system almost nothing is using swap.
Sure, when you were running your analysis program under xorg it was xorg
taking up both RAM and swap, but that was not usual at all because xorg, a
user program, thought it needed (NEEDED) 333977404 KB of RAM, meaning 333
GB all by itself. That’s not normal, and only indicative (usually) of a
memory leak. In that case your kernel was not holding onto either RAM or
swap to the chagrin of xorg, but rather, as I would expect, it had
probably freed up as much RAM as possible from cache and given it to xorg,
which had happily used it terribly. There’s no fix for this other than to
fix the xorg bug, but it shows that the system did not hold onto cache
while keeping memory (RAM) from a user application, and this is how it
should work. Things are cached when nothing else needs the RAM, but the
system will free it at the drop of a hat when something important
(basically anything) needs it.
[color=blue]

So when an application is going to make a request for ca. 70 GiB of RAM,
let’s say, and since the system only has 128 GB installed, it’s going to
push any new demands on the RAM into swap and this is where it becomes a
problem.[/color]

I would agree that it would be a problem, but I think your own xorg
example shows that is not the case. Yes, you were using lots of swap on
the system at that time, but you were also using all of your RAM (or
nearly so). If xorg had been denied RAM because the system just had to
cache things, xorg would have crashed like any application that needed RAM
and was denied it from the OS. More below to prove this, though.
[color=blue]

ab Wrote:[color=green]

[color=darkred]

These were screenshots that I took of the terminal window (ssh)[/color]
earlier.[color=darkred]

You can see that on one system, it was caching 80.77 GiB and the[/color]
other,[color=darkred]
I was caching 94.83 GiB.

This is confirmed because when I run:

Code:

echo 3 > /proc/sys/vm/drop_caches

it clears the cache up right away.[/color]

Yes, that makes sense, but I do not understand why there is a perceived
problem considering the system state now that xorg is stopped. The
system
is not in need of memory, at least not at the time of the snapshot you
took.[/color]

Again, the root cause of the issue actually isn’t the swap in and of
itself. It first manifested as such, especially with X running, but in
run level 3, I was able to find out that the root cause of the issue is
due to the OS kernel’s vm caching of pagecache and slab objects.[/color]

I do not see how you reached this conclusion. I see RAM being used, and
by cache, but I also do not see anything in your last bit of output that
shows anything wanting to use all of the RAM, so everybody is running in
RAM, and that’s good (I’m ignoring those tiny 35 MiB of swap because it’s
almost nothing). System performance went down when swap was heavily used
by xorg, yes, but that only happened because xorg wanted nearly 3x your
system RAM, so it was given a lot of RAM and way more swap, because your
swap is (in my opinion) way too big. If you were to start a new HPC job
that needed that much memory, you’d have similar results, but worse
because your programs probably actually use the RAM heavily, rather than
just filling it, and swap, once.
[color=blue]

Linux marks the RAM that pagecache and slab objects that are cached into
as being RAM that is used (which is TECHNICALLY true). What it DOESN’T
do when an application demands the RAM though is that it won’t release
the cache a la (# echo 3 > /proc/sys/vm/drop_caches) in order to release
the cached pagecache and slab objects back to the free memory pool so
that it can then be used for/by a USER application.

THAT is the part that it DOESN’T seem to do/be doing.

And that is, to be blunt and frank - stupid.[/color]

If true, it would be a terrible thing for sure, but I have never seen
Linux do this, and I’ve tested it many times; as mentioned above, I think
your xorg example also shows this, but you disagree so I would like to
figure out if my conceptions are all wrong, or if you are, perhaps,
interpreting differently than I am and perhaps we can find some agreement.
[color=blue]

If you have user applications that require RAM, it should take
precedence over the OS’ need/desire to cache pagecache and slab
objects.[/color]

This is what I have always seen over the years; let’s test it.
[color=blue]

Yes, I realise that to Linux, it thinks cached objects in RAM = RAM is
in USE but it should be intelligent enough to know what it is TRULY
being used vs. what’s only cached so that the cache can be cleared and
the subsequent memory/RAM is released back into the free/available pool
so that user apps can use it.

THAT is the root cause of the underlying issue.[/color]

No, Linux definitely sees the difference between cached objects in RAM and
everything else in RAM; if that were not the case, the ‘free’ command
could never show you, as any old user, how much of RAM is being used for
mere caching/buffers, and of course it does.
[color=blue]

The console output of “free” actually tells you that on one of the
nodes, it has cached 81.74 GiB of objects and the other has cached
well…it WAS 94.83 GiB, now it is 116.31 GiB.[/color]

I cannot see this picture for whatever reason; maybe host it on another
site, unless you can just paste it as text (if it was already text).
[color=blue]

Here is the output of ps aux for that node:[/color]

The node from which you posted the ps output looks fine to me. As far as
I can tell, the ‘ps’ output shows that this node is using either 15212268
KB (15 GiB) of VSZ, or 9759828 (9 GiB) Resident memory. If that 9 GiB is
added on top of 116 GiB cached data, you are stll not using all of your
RAM. Seeing the ‘free’ output would probably show this as well. Sure,
maybe SOME swap was being used, but I would be very surprised if it was
using swap heavily, but still we need to test this.
[color=blue]

Code:

$ cat /proc/sys/vm/swappiness
60

--------------------[/color]

Yes, if concerned, at least set this to one (1). Unless you are using a
lot of swap it will not matter, but at least it will have the system
prefer RAM more-heavily.
[color=blue]

I highly doubt 116.31 GiB of cached objects is a “perceived” problem.[/color]

It definitely is only a perceived problem if your user processes are able
to take back the RAM when they need it.

In this last bit of ‘ps aux’ output the majority of your solver processes
are only using something like 300 MiB RAM, so much smaller than before.
You have one using around 4 GiB, but it’s definitely the biggest thing on
there. As a result, while your system shows a lot of cache, that is
because nothing else needs the RAM. Get something to use that RAM and
watch it free up as if you had told the system to drop caches with the
echo statement, only just for the amount needed by the program.
[color=blue]

In my case, swap exists in the event of an analysis requiring more
memory than is physically available.[/color]

Fair enough, as that is a decent purpose for having it, so long as
everybody understands that it will perform terribly (compared to RAM0 once
needed. Linux, I am arguing, should delay that time as much as possible,
preferring RAM until then, with your swappiness value at the default of
sixty (60).

Ways to est this are varied, but here are a couple. First, on your box
you should have a /dev/shm (shared memory) mountpoint in which you can
write anything you want, and normally it is about half the size of your
system’s RAM (by default), meaning on your box it will be 64 GiB. If your
system is actually using 10 GiB memory for your processes, but it is
caching 116 GiB of stuff, your memory is all filled up and anything you do
will, per my theory, require freeing up RAM. Per your theory, it will
require using swap. Run the following command to request 25 GiB of RAM
for a file in that “ramdisk” area and see where it is used:

free
dd if=/dev/zero of=/dev/shm/25gib bs=1048576 count=25000
free

If my theory is correct, your swap amounts will not change much, and your
used amount will not change much, but your system will have a lot less
cached suddenly. Delete the file and then check ‘free’ again:

rm /dev/shm/25gib
free

At this point you probably have 25 GiB RAM (or a bit more) free and swap
should still be minimally used. Testing this on my system (which has a
lot less RAM than yours) shows these exact results, and they’re the
results I’ve seen for years, and come to expect.

Of course, you’re not foolish and realize that writing a file to a ramdisk
is maybe not exactly the same as any other user process wanting memory.
The easy test there, of course, is to have something gobble RAM.
Thankfully, folks have written programs that will do just that for us.
The original site is gone, but I can paste the code here, you can drop it
into a file, and then compile it see the results; I just tested it on my
server and it still works as hoped; warning, running code from weirdos
online is slightly scary unless you trust them:

#include <errno.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

int main(int argc, char *argv[])
{
char *p;
long i;
size_t n;

/* I'm too bored to do proper cmdline parsing */
if (argc != 2 || atol(argv[1]) <= 0 ) {
fprintf(stderr, "I'm bored... Give me the size of the memory chunk in
KB\
");
return 1;
}
n = 1024 * atol(argv[1]);

if (! (p = malloc(n))) {
perror("malloc failed");
return 2;
}

/* Temp, just want to check malloc */
printf("Malloc was successful\
");
//return 0;

/* Touch all of the buffer, to make sure it gets allocated */
for (i = 0; i < n; i++)
p[i] = 'A';


printf("Allocated and touched buffer, sleeping for 60 sec...\
");
sleep(60);
printf("Done!\
");

return 0;
}

Drop that into something like mem-alloc-test.c and then compile it:

gcc mem-alloc-test.c

The resulting executable will be named ‘a.out’ by default, so now run it
and have it allocate 10 GiB RAM, which in theory you do not have free
other than if cache is freed up:

../a.out 10000000   #10 million KBs = 10 GBs or so

While it is running, run ‘free’ in another shell and watch the cache get
freed to make room for this memory-gobbling monster. when it finishes
(after sixty seconds, or when you hit Ctrl+c) see that you have free
space, and less cached than when you started.

I think this proves, at least on my systems, that things work as I have
described. Cache is treated separately, and it is a second-class consumer
of RAM, and is free up nice and quickly, without using swap.

It is entirely possible your system behaves differently; I have low
swappiness values, and my boxes are probably older and running older
versions of SLES than yours, but if that is the case I would like to
understand why since, as you have noted well, this is a big deal.

Either way, I look forward to better-understanding the memory management here.

–
Good luck.

If you find this post helpful and are logged into the web interface,
show your appreciation and click on the star below.

If you want to send me a private message, please let me know in the
forum as I do not use the web interface often.

ab1 · August 21, 2017, 7:28am

One tiny addition that you probably need not factor in: if you are running
that memory-gobbling monster as a non-root user, ulimit settings may
prevent it from running when it requests those many-GBs of memory. I have
some of my systems set to limit user apps from asking for more than a few
GBs of RAM, because it’s rare I need that much for any one application and
I am happy to override those on an as-needed basis. In your case, you
already have big programs doing things, so I doubt ulimits are in the way,
but just in case, keep that in mind.

[CODE[
ulimit -v 100000000 #100 GiB max
ulimit -v unlimited #unlimited max
[/CODE]

–
Good luck.

If you find this post helpful and are logged into the web interface,
show your appreciation and click on the star below.

If you want to send me a private message, please let me know in the
forum as I do not use the web interface often.

Topic		Replies	Views
View the memory inconsistency SLES Configure-Administer	1	186	May 6, 2014
Linux meminfo very high slab SLES Configure-Administer	1	272	May 4, 2014
Memory use on SLES? SLES Configure-Administer	1	193	November 12, 2013
SUSE 11 Memory Leak SUSE Linux Enterprise Server	20	649	November 16, 2023
Very long time required to allocate memory SLES Configure-Administer	13	216	February 8, 2013

memory leak issue?

Code:

All of this leads me to ask: what are you doing in the GUI? Have you tried just closing that, going back to runlevel three (3) ormulti-user.target on systemd, and seeing if that helps your system’s performance, at least in terms of memory?

Code:

Code:

Code:

echo 3 > /proc/sys/vm/drop_caches

Code:

Code:

echo 3 > /proc/sys/vm/drop_caches

Code:

Code:

echo 3 > /proc/sys/vm/drop_caches

Code:

Related topics