memory leak issue?

alpha754293 · August 21, 2017, 4:14pm

[QUOTE=ab;39178]On 08/20/2017 06:34 PM, alpha754293 wrote:[color=blue]

ab;39171 Wrote:[color=green]

On 08/18/2017 07:14 PM, alpha754293 wrote:[color=darkred]

Code:

aes@aes3:~> free
total used free shared buffers[/color]
cached[color=darkred]
Mem: 132066080 122653184 9412896 155376 24[/color]
84712440[color=darkred]
-/+ buffers/cache: 37940720 94125360
Swap: 268437500 35108 268402392

--------------------[/color]

This is showing what we would hope, that swap is basically unused.
Sure,
35 MiB are being used, but that’s just about nothing, and it is
probably
only data which should be swapped, like libraries loaded once and never
needed again but still needing to be loaded. You could tune swappiness
further, but I can hardly imagine it will make a big difference since
the
system does not need the memory that is even completely free (9 GiB of
it)
or that is used and freeable by cache (94 GiB).
[color=darkred]

(I think that you meant /proc/sys/vm/swappiness and that is still at[/color]
the[color=darkred]
default value of 60.)[/color]

Change that if you want; sixty (60) is the default I have as well on my
boxes that I have not tuned, but again I doubt it matters too much
since
the system isn’t using almost any swap currently now that xorg is not
trying to use all of the virtual memory the system has available.[/color]

Xorg isn’t using it, but cache is (pagecache and slab objects) - 81.74
GiB of it to be precise.[/color]

I think we were talking about different “it” things here; I was talking
about swap, and on this current system almost nothing is using swap.
Sure, when you were running your analysis program under xorg it was xorg
taking up both RAM and swap, but that was not usual at all because xorg, a
user program, thought it needed (NEEDED) 333977404 KB of RAM, meaning 333
GB all by itself. That’s not normal, and only indicative (usually) of a
memory leak. In that case your kernel was not holding onto either RAM or
swap to the chagrin of xorg, but rather, as I would expect, it had
probably freed up as much RAM as possible from cache and given it to xorg,
which had happily used it terribly. There’s no fix for this other than to
fix the xorg bug, but it shows that the system did not hold onto cache
while keeping memory (RAM) from a user application, and this is how it
should work. Things are cached when nothing else needs the RAM, but the
system will free it at the drop of a hat when something important
(basically anything) needs it.[/quote]

As you recall though, the thread title is “memory leak issue?”.

If an OS uses swap because it ran out of memory, then the issue isn’t about the swap, but RAM.

Swap is just the consequence of what’s going on with the RAM usage. Fix that, and I would say you’d have a 99% chance of fixing the swap issue.

[quote=ab][color=blue]

So when an application is going to make a request for ca. 70 GiB of RAM,
let’s say, and since the system only has 128 GB installed, it’s going to
push any new demands on the RAM into swap and this is where it becomes a
problem.[/color]

I would agree that it would be a problem, but I think your own xorg
example shows that is not the case. Yes, you were using lots of swap on
the system at that time, but you were also using all of your RAM (or
nearly so). If xorg had been denied RAM because the system just had to
cache things, xorg would have crashed like any application that needed RAM
and was denied it from the OS. More below to prove this, though.[/quote]

More on this below.

[quote=ab][color=blue]

ab Wrote:[color=green]

[color=darkred]

These were screenshots that I took of the terminal window (ssh)[/color]
earlier.[color=darkred]

You can see that on one system, it was caching 80.77 GiB and the[/color]
other,[color=darkred]
I was caching 94.83 GiB.

This is confirmed because when I run:

Code:

echo 3 > /proc/sys/vm/drop_caches

it clears the cache up right away.[/color]

Yes, that makes sense, but I do not understand why there is a perceived
problem considering the system state now that xorg is stopped. The
system
is not in need of memory, at least not at the time of the snapshot you
took.[/color]

Again, the root cause of the issue actually isn’t the swap in and of
itself. It first manifested as such, especially with X running, but in
run level 3, I was able to find out that the root cause of the issue is
due to the OS kernel’s vm caching of pagecache and slab objects.[/color]

I do not see how you reached this conclusion. I see RAM being used, and
by cache, but I also do not see anything in your last bit of output that
shows anything wanting to use all of the RAM, so everybody is running in
RAM, and that’s good (I’m ignoring those tiny 35 MiB of swap because it’s
almost nothing). System performance went down when swap was heavily used
by xorg, yes, but that only happened because xorg wanted nearly 3x your
system RAM, so it was given a lot of RAM and way more swap, because your
swap is (in my opinion) way too big. If you were to start a new HPC job
that needed that much memory, you’d have similar results, but worse
because your programs probably actually use the RAM heavily, rather than
just filling it, and swap, once.[/quote]

So, I’m re-running the tests now (back in run level 5) to see if changing the vfs_cache_pressure has made any difference or to recreate the condition that caused X to want so much RAM in the first place.

It’ll likely be a couple of days before I will be able to report back.

If X was truly the culprit and NOT due to the caching, then I will rescind and retract my statement. However, right now, with X running on one node and NOT running on another node, it is difficult to tell.

I’ll likely have to do a 2x2 test - where I set the vfs_cache_pressure back to 100 with X running and not running and then with vfs_cache_pressure=200 and do both tests again (with X running and X not running) in order to collect enough data and to confirm.

Thanks.

[quote=ab][color=blue]

Linux marks the RAM that pagecache and slab objects that are cached into
as being RAM that is used (which is TECHNICALLY true). What it DOESN’T
do when an application demands the RAM though is that it won’t release
the cache a la (# echo 3 > /proc/sys/vm/drop_caches) in order to release
the cached pagecache and slab objects back to the free memory pool so
that it can then be used for/by a USER application.

THAT is the part that it DOESN’T seem to do/be doing.

And that is, to be blunt and frank - stupid.[/color]

If true, it would be a terrible thing for sure, but I have never seen
Linux do this, and I’ve tested it many times; as mentioned above, I think
your xorg example also shows this, but you disagree so I would like to
figure out if my conceptions are all wrong, or if you are, perhaps,
interpreting differently than I am and perhaps we can find some agreement.[/quote]

So far, it appears that changing the vfs_cache_pressure to 200 (from the default 100) seems to be helping.

Swap is still at a measily 36 MiB, but the cached mem (read from top) is 63.029 GiB, down from 80+ GiB earlier (when I started this analysis run).

I’ll know more when I run the 2x2 permutative matrix of tests.

[quote=ab][color=blue]

If you have user applications that require RAM, it should take
precedence over the OS’ need/desire to cache pagecache and slab
objects.[/color]

This is what I have always seen over the years; let’s test it.[/quote]

Stay tuned for the results from my 2x2 tests. It’ll take a while for me to run it (because I’ll probably have two bring the other two nodes online in SLES so that I can help it speed up the tests of the 2x2 matrix otherwise the two nodes that are currently running SLES will have to run the cycle twice, which, for the purposes of this discussion, will take twice as much time.)

(And I want to stick to my batch processing/shell script only because it is representative of the condition that I am going to be actually using the system in rather than trying to create something new to test it with.)

So, if it takes quite a bit of time, I am okay with that because I really want to have a firm understanding of what’s going on here.

[quote=ab][color=blue]

Yes, I realise that to Linux, it thinks cached objects in RAM = RAM is
in USE but it should be intelligent enough to know what it is TRULY
being used vs. what’s only cached so that the cache can be cleared and
the subsequent memory/RAM is released back into the free/available pool
so that user apps can use it.

THAT is the root cause of the underlying issue.[/color]

No, Linux definitely sees the difference between cached objects in RAM and
everything else in RAM; if that were not the case, the ‘free’ command
could never show you, as any old user, how much of RAM is being used for
mere caching/buffers, and of course it does.[/quote]

cf. my comments about running the 2x2 permutative matrix of tests above.

[quote=ab][color=blue]

The console output of “free” actually tells you that on one of the
nodes, it has cached 81.74 GiB of objects and the other has cached
well…it WAS 94.83 GiB, now it is 116.31 GiB.[/color]

I cannot see this picture for whatever reason; maybe host it on another
site, unless you can just paste it as text (if it was already text).[/quote]

The output of top is difficult (or I just don’t know how) I would capture that as text.

The attachment is hosted here natively (no different than some of the other pictures/attachments).

Check the forum (not the email) for details.

[quote=ab][color=blue]

Here is the output of ps aux for that node:[/color]

The node from which you posted the ps output looks fine to me. As far as
I can tell, the ‘ps’ output shows that this node is using either 15212268
KB (15 GiB) of VSZ, or 9759828 (9 GiB) Resident memory. If that 9 GiB is
added on top of 116 GiB cached data, you are stll not using all of your
RAM. Seeing the ‘free’ output would probably show this as well. Sure,
maybe SOME swap was being used, but I would be very surprised if it was
using swap heavily, but still we need to test this.
[color=blue]

Code:

$ cat /proc/sys/vm/swappiness
60

--------------------[/color]

Yes, if concerned, at least set this to one (1). Unless you are using a
lot of swap it will not matter, but at least it will have the system
prefer RAM more-heavily.
[color=blue]

I highly doubt 116.31 GiB of cached objects is a “perceived” problem.[/color]

It definitely is only a perceived problem if your user processes are able
to take back the RAM when they need it.

In this last bit of ‘ps aux’ output the majority of your solver processes
are only using something like 300 MiB RAM, so much smaller than before.
You have one using around 4 GiB, but it’s definitely the biggest thing on
there. As a result, while your system shows a lot of cache, that is
because nothing else needs the RAM. Get something to use that RAM and
watch it free up as if you had told the system to drop caches with the
echo statement, only just for the amount needed by the program.
[color=blue]

In my case, swap exists in the event of an analysis requiring more
memory than is physically available.[/color]

Fair enough, as that is a decent purpose for having it, so long as
everybody understands that it will perform terribly (compared to RAM0 once
needed. Linux, I am arguing, should delay that time as much as possible,
preferring RAM until then, with your swappiness value at the default of
sixty (60).

Ways to est this are varied, but here are a couple. First, on your box
you should have a /dev/shm (shared memory) mountpoint in which you can
write anything you want, and normally it is about half the size of your
system’s RAM (by default), meaning on your box it will be 64 GiB. If your
system is actually using 10 GiB memory for your processes, but it is
caching 116 GiB of stuff, your memory is all filled up and anything you do
will, per my theory, require freeing up RAM. Per your theory, it will
require using swap. Run the following command to request 25 GiB of RAM
for a file in that “ramdisk” area and see where it is used:

free
dd if=/dev/zero of=/dev/shm/25gib bs=1048576 count=25000
free

If my theory is correct, your swap amounts will not change much, and your
used amount will not change much, but your system will have a lot less
cached suddenly. Delete the file and then check ‘free’ again:

rm /dev/shm/25gib
free

At this point you probably have 25 GiB RAM (or a bit more) free and swap
should still be minimally used. Testing this on my system (which has a
lot less RAM than yours) shows these exact results, and they’re the
results I’ve seen for years, and come to expect.

Of course, you’re not foolish and realize that writing a file to a ramdisk
is maybe not exactly the same as any other user process wanting memory.
The easy test there, of course, is to have something gobble RAM.
Thankfully, folks have written programs that will do just that for us.
The original site is gone, but I can paste the code here, you can drop it
into a file, and then compile it see the results; I just tested it on my
server and it still works as hoped; warning, running code from weirdos
online is slightly scary unless you trust them:

#include <errno.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

int main(int argc, char *argv[])
{
char *p;
long i;
size_t n;

/* I'm too bored to do proper cmdline parsing */
if (argc != 2 || atol(argv[1]) <= 0 ) {
fprintf(stderr, "I'm bored... Give me the size of the memory chunk in
KB\
");
return 1;
}
n = 1024 * atol(argv[1]);

if (! (p = malloc(n))) {
perror("malloc failed");
return 2;
}

/* Temp, just want to check malloc */
printf("Malloc was successful\
");
//return 0;

/* Touch all of the buffer, to make sure it gets allocated */
for (i = 0; i < n; i++)
p[i] = 'A';


printf("Allocated and touched buffer, sleeping for 60 sec...\
");
sleep(60);
printf("Done!\
");

return 0;
}

Drop that into something like mem-alloc-test.c and then compile it:

gcc mem-alloc-test.c

The resulting executable will be named ‘a.out’ by default, so now run it
and have it allocate 10 GiB RAM, which in theory you do not have free
other than if cache is freed up:

../a.out 10000000   #10 million KBs = 10 GBs or so

While it is running, run ‘free’ in another shell and watch the cache get
freed to make room for this memory-gobbling monster. when it finishes
(after sixty seconds, or when you hit Ctrl+c) see that you have free
space, and less cached than when you started.

I think this proves, at least on my systems, that things work as I have
described. Cache is treated separately, and it is a second-class consumer
of RAM, and is free up nice and quickly, without using swap.

It is entirely possible your system behaves differently; I have low
swappiness values, and my boxes are probably older and running older
versions of SLES than yours, but if that is the case I would like to
understand why since, as you have noted well, this is a big deal.

Either way, I look forward to better-understanding the memory management here.

–
Good luck.

If you find this post helpful and are logged into the web interface,
show your appreciation and click on the star below.

If you want to send me a private message, please let me know in the
forum as I do not use the web interface often.[/QUOTE]

Stand by for the results from the 2x2 matrix testing.

I think that will address pretty much the rest of your points once the results come in.

Thanks.

alpha754293 · August 23, 2017, 6:18pm

Am I interpreting these results correctly?

So I’ve starting running my batch processing shell script in run level 5.

It’s probably been maybe a couple of days.

Here is the output from ps aux | grep Xorg

(see screenshot)

Am I reading it correctly that it wants 73.74 GiB VSZ and 73.36 GiB of RSS???

That seems a LOT. top also shows that - 73.862g 0.072t 58.34% Xorg.

vfs_cache_pressure is still at 200 and swappiness is still at 60.

free -m show this (see screenshot).

I’m still confused. I still don’t understand why Xorg is using so much RAM.

edit

ab:

You had made a comment earlier about how with it being a server that it shouldn’t need to run in GUI mode, except that for some of the analysis (Design of Experiment sweeps) where I am running HUNDREDS to THOUSANDS of runs sequentially, it is much easier to kick those off using the GUI of the analysis application rather than writing the batch processing script for all of those.

I’ve sent the email back to Gaby and Sean to see if they have any further insight into why X is consuming so much RAM in the first place.

Thanks.

alpha754293 · August 23, 2017, 6:39pm

Here is the response that I just received from SuSE support:

“X is most likely caching I/O from the application, would be my assessment. However, this would require deeper evaluation of the X code and the way you application is working. Sean and my organization is not equipped to dive into the level of analysis that would need to be needed to answer the question. That being said, if you are using SLES in an HPC capacity, then we would expect that X would be turned off and you run at run level 3, instead of run level 5.”

To which, my response was (in part).

“I shouldn’t have to change my use case to fit the operating system. The operating system should be changing how it behaves in order to fit my use case/needs.”

ab1 · August 23, 2017, 7:41pm

On 08/23/2017 09:24 AM, alpha754293 wrote:[color=blue]

So I’ve starting running my batch processing shell script in run level
5.[/color]

Just to be clear, are you running them within X then? I presume so, but
had though you were going to run them from a shell while just letting xorg
run on its own without anything logged in to show that xorg was not
somehow needing memory merely because the processes were there.

Also, I thought you had two pieces to your software: the piece that splits
up and runs the analysis, and a piece (in xorg) that connects and shows
some kind of output graphically. I suspect only the last piece as being a
problem, since it is the only part that needs the GUI-ish stuff.
[color=blue]

It’s probably been maybe a couple of days.

Here is the output from ps aux | grep Xorg[/color]

I’m happy with copied/pasted text, in the future.
[color=blue]

Am I reading it correctly that it wants 73.74 GiB VSZ and 73.36 GiB of
RSS???[/color]

It looks like it, yes, and that is definitely not normal. It would be
interesting to know what your application in the GUI is doing that could
possibly make xorg think it needs that much RAM.
[color=blue]

I’m still confused. I still don’t understand why Xorg is using so much
RAM.[/color]

Me neither, and it may be a bug (particularly if you stopped your
processes but xorg still needed that RAM), but there’s good news too: you
are not swapping (much), because the system is giving up cache to your
user process (xorg), and probably will right until the RAM is exhausted.

While the xorg issue is interesting, to me the bigger issue for this
thread is how the system deals with memory overall, and I think so far we
are seeing what is expected (by me anyway) here, so that is good. With
regard to xorg, I think we need to know more about the application, and
whether or not the problem persists if you run your program outside of the
GUI.

–
Good luck.

If you find this post helpful and are logged into the web interface,
show your appreciation and click on the star below.

If you want to send me a private message, please let me know in the
forum as I do not use the web interface often.

alpha754293 · August 23, 2017, 9:58pm

[QUOTE=ab;39253]On 08/23/2017 09:24 AM, alpha754293 wrote:[color=blue]

So I’ve starting running my batch processing shell script in run level
5.[/color]

Just to be clear, are you running them within X then? I presume so, but
had though you were going to run them from a shell while just letting xorg
run on its own without anything logged in to show that xorg was not
somehow needing memory merely because the processes were there.

Also, I thought you had two pieces to your software: the piece that splits
up and runs the analysis, and a piece (in xorg) that connects and shows
some kind of output graphically. I suspect only the last piece as being a
problem, since it is the only part that needs the GUI-ish stuff.[/quote]

So to your first question - yes, in run level 5, I am back running X.

re: batch processing shell script

Yes, this is correct. The script was launched via a terminal window and there is no GUI component that is running in regards to the analysis itself.

HOWEVER, in other simulations that I have performed, the analysis (for example, a thousand-run design of experiment sweep) will start from within the analysis application GUI and so, each one of those 1000 runs would be kicked off from the GUI.

So, for the current test that I am running - the batch processing shell script was kicked off in a Terminal. The GUI of the analysis application is NOT running.

I am testing to see whether SLES12 SP1 is a suitable platform (so that I can run stuff in batch and also via the GUI because I have a need to be able to do both).

Hope that helps.

[quote=ab][color=blue]

It’s probably been maybe a couple of days.

Here is the output from ps aux | grep Xorg[/color]

I’m happy with copied/pasted text, in the future.[/quote]

Not always possible because I am going through three different systems (laptop to remote desktop to remote desktop to IPMI or shell) so for me to extract the text-only output becomes quite and involved process due to various network policies in place.

When I am physically closer to the system/console, the text output extract is easier. But when I am not physically onsite or in the same location as the system that is currently running SLES, getting that text output from the console becomes a non-trivial task. (e.g. if there was a quick and easy way for me to do it, I would)

[quote=ab][color=blue]

Am I reading it correctly that it wants 73.74 GiB VSZ and 73.36 GiB of
RSS???[/color]

It looks like it, yes, and that is definitely not normal. It would be
interesting to know what your application in the GUI is doing that could
possibly make xorg think it needs that much RAM.[/quote]

Again, the analysis application right now is running in text-only mode (no GUI is being used right now) but within the context of run level 5 (so X IS running).

Not really sure what the application is doing in the GUI that could be making X think it needs so much RAM.

[quote=ab][color=blue]

I’m still confused. I still don’t understand why Xorg is using so much
RAM.[/color]

Me neither, and it may be a bug (particularly if you stopped your
processes but xorg still needed that RAM), but there’s good news too: you
are not swapping (much), because the system is giving up cache to your
user process (xorg), and probably will right until the RAM is exhausted.

While the xorg issue is interesting, to me the bigger issue for this
thread is how the system deals with memory overall, and I think so far we
are seeing what is expected (by me anyway) here, so that is good. With
regard to xorg, I think we need to know more about the application, and
whether or not the problem persists if you run your program outside of the
GUI.

–
Good luck.

If you find this post helpful and are logged into the web interface,
show your appreciation and click on the star below.

If you want to send me a private message, please let me know in the
forum as I do not use the web interface often.[/QUOTE]

The swap is progressively growing.

At the time the snapshot was taken, I think that only some 128-ish MB of swap was being used. Now, it is up to around 250-ish MB. So it is progressively growing.

As I previously mentioned - it looks like that after having changed the vfs_cache_pressure from 100 (default) to 200, it seems to be doing better, especially in run level 3 (when X is NOT running). However, in run level 5 (with X running), it appears that the issue with the memory consumption of X has returned or is returning.

The script is only about 2 days into a 9.5 day total runtime/duration, so it still has about another 7.5 days to go still.

It successfully completed running through the script in run level 3 (without X running) and with vfs_cache_pressure = 200 without this problem (because X isn’t running, so no surprises there that X won’t be consuming large amounts of RAM if it isn’t running at all), but when I put it back into run level 5 and X is running again, this issue of X consuming large amounts of RAM is coming back (and based on the current trajectory, is expected to be back in about another 2-ish days or so where it is going to consume enough RAM that it is going to force the analysis application to start eating up swap again).

Regardless, X shouldn’t be consuming so much RAM, but it is, unfortunately.

alpha754293 · August 29, 2017, 3:29pm

Update:

So there is DEFINITELY something going on between the app and X.

Finished running the batch processing script and at the end of it, X is requesting/using some 77.44 GiB of RAM.

Swap is up to 144 GiB used.

Here is the output from ps aux

root     24777  8.9 61.5 232441664 81299912 tty7 Ssl+ Aug20 1098:26 /usr/bin/Xorg :0 -background none -verbose -auth /run/gdm/auth
-for-gdm-G95oiD/database -seat seat0 -nolisten tcp vt7
ewen     26593  0.0  0.0  10500   944 pts/1    S+   08:28   0:00 grep --color=auto Xorg

and here is the output from free

             total       used       free     shared    buffers     cached
Mem:     132066080   98239980   33826100       5508         24   10402232
-/+ buffers/cache:   87837724   44228356
Swap:    268437500  151148088  117289412

Not sure if this is the fault of the application or X itself, and like I said - the strange thing is that this kind of a memory usage problem doesn’t exist using the exact same software, running the exact same batch processing script in Windows. (Win 7 Pro x64 specifically)

So I’m really NOT sure what’s going on here.

system · August 30, 2017, 4:19pm

Try xrestop https://software.opensuse.org/package/xrestop . The binary
RPM for Leap42.2 should install perfectly on SLES12SP2.

At least it should give you a much better understanding about the memory
consumption of all the X applications. If your own application is at
fault (like a GC leak or a pixmap leak), you should be able to step
through your application (while watching the xrestop output) and find
the leak quickly.

Franz

alpha754293 · August 30, 2017, 5:32pm

[QUOTE=Franz Sirl;39350]Try xrestop https://software.opensuse.org/package/xrestop . The binary
RPM for Leap42.2 should install perfectly on SLES12SP2.

At least it should give you a much better understanding about the memory
consumption of all the X applications. If your own application is at
fault (like a GC leak or a pixmap leak), you should be able to step
through your application (while watching the xrestop output) and find
the leak quickly.

Franz[/QUOTE]

Thanks.

I’ll give that a shot.

ab1 · September 1, 2017, 7:46am

On 08/23/2017 01:04 PM, alpha754293 wrote:[color=blue]

So to your first question - yes, in run level 5, I am back running X.[/color]

I was probably not clear, but I think we agree here. The process, running
under X, even though non-GUI, is somehow causing xorg to consume memory.
How odd, and I hope that xrestop utility helps us find something useful,
though I’m still not sure how a non-GUI application can cause X to do
anything. Admittedly I am not a master of all Linux-foo, and X in
particular is not my strength.
[color=blue]

I am testing to see whether SLES12 SP1 is a suitable platform (so that I
can run stuff in batch and also via the GUI because I have a need to be
able to do both).[/color]

As long as we are hitting a bug, I would recommend trying SLES 12 SP2 (I
probably have already), or wait a short time and try out SLES 12 SP3. You
could also try out openSUSE Leap 42.3 which has a very similar codebase to
SLES 12 SP3 (once released publicly). I do not expect different results,
but SP1 is pretty old, and technically out of general support, so it is
possible fixes will help you from current code.
[color=blue]

Not really sure what the application is doing in the GUI that could be
making X think it needs so much RAM.[/color]

Is this an application we could throw on a a box of mine for duplication
of the issue? I’d be happy to spare some cycles for it, but I do not have
the application, or the data so I would need a lot of help from you to
duplicate things. If possible, though, I’d open a bug report for you
after we narrowed it down well enough.
[color=blue]

It successfully completed running through the script in run level 3
(without X running) and with vfs_cache_pressure = 200 without this
problem (because X isn’t running, so no surprises there that X won’t be
consuming large amounts of RAM if it isn’t running at all), but when I
put it back into run level 5 and X is running again, this issue of X
consuming large amounts of RAM is coming back (and based on the current
trajectory, is expected to be back in about another 2-ish days or so
where it is going to consume enough RAM that it is going to force the
analysis application to start eating up swap again).

Regardless, X shouldn’t be consuming so much RAM, but it is,
unfortunately. :([/color]

I think a fun test would be to turn off swap temporarily before it is used
out of necessity; if done, the system should then go ahead and,
much-more-quickly, kill offending memory-hog applications:

sudo /sbin/swapoff -a

Of course, since you started your processes within an X-based terminal
having xorg die will probably kill them too, which is not what you want.

If xorg is the problem, while at runlevel 5 swap over to a TTY, or SSH in,
and start the application from there. If you are concerned about the
connection timing out or otherwise breaking, use a ‘screen’ to run the
command:

screen -q
/path/to/your/application

You can then just close your terminal window (if SSH’d in, or accessing it
via X) or use a key sequence to “detach” from the screen session, leaving
the program running, and able to be reconnected to you via the ‘screen -x’
command, using something like Ctrl+a d (meaning Hold Ctrl, press ‘a’, and
then press ‘d’).

–
Good luck.

If you find this post helpful and are logged into the web interface,
show your appreciation and click on the star below.

If you want to send me a private message, please let me know in the
forum as I do not use the web interface often.

alpha754293 · September 1, 2017, 4:28pm

[QUOTE=ab;39374]On 08/23/2017 01:04 PM, alpha754293 wrote:[color=blue]

So to your first question - yes, in run level 5, I am back running X.[/color]

I was probably not clear, but I think we agree here. The process, running
under X, even though non-GUI, is somehow causing xorg to consume memory.
How odd, and I hope that xrestop utility helps us find something useful,
though I’m still not sure how a non-GUI application can cause X to do
anything. Admittedly I am not a master of all Linux-foo, and X in
particular is not my strength.[/quote]

Yeah, I’ll have to try that xrestop utility later. The systems are right now busy with other stuff, so both Linux/SLES nodes are currently running in run level 3 in order to get through these series of analyses that I need to get out the door.

Once this wave is over, then I would be able to go back and try that and resume testing, but yes, we are in agreement that even non-GUI applications running under X is causing X to consume an inordinate, inexplicable, and yet-to-be-explained amount of memory.

Lisa Mitchelle (SuSE) is looking to set me up with their partner program so that I would be able to get more help/assistance directly from SuSE in order to try and figure out what’s going on with this, so that’s likely going to be coming in some time before the end of the month (due to various schedule - between my schedule and not being available for about half of September, starting around mid-September) and people returning from vacation and digging themselves out from the pile of email.

I did reiterate to her that because my batch processing script takes almost a week-and-a-half to run, so there isn’t any mission critical time pressure on this.

[quote=ab][color=blue]

I am testing to see whether SLES12 SP1 is a suitable platform (so that I
can run stuff in batch and also via the GUI because I have a need to be
able to do both).[/color]

As long as we are hitting a bug, I would recommend trying SLES 12 SP2 (I
probably have already), or wait a short time and try out SLES 12 SP3. You
could also try out openSUSE Leap 42.3 which has a very similar codebase to
SLES 12 SP3 (once released publicly). I do not expect different results,
but SP1 is pretty old, and technically out of general support, so it is
possible fixes will help you from current code.[/quote]

I actually tried SLES12 SP2. The application actually COMPLETELY fail to run in SLES12 SP2. The application requires OpenMotif, which I was able to install fine via zypper, but there were also I think two other libraries - libjpeg and libpng that failed to install properly. When I tried to override the current with the specific versions that the OpenMotif required, there were further dependencies with I think it was libc or something like that, and when I overrode that (since I didn’t install it into separate installation folders/directories, it borked the system – i.e. even ls wouldn’t work because now the libc was the old version and the new ls needs the new version of libc.) In short, I killed the system by doing the rpm -i --force.

This is one of my major pet peeves about Linux how various programs and libraries are built with various dependencies and you end up having to either have multiple copies/instances of the library installed in separate directories and then teaching the application to go pull it from non-system default locations OR you end up borking the system entirely such that I LITERALLY couldn’t recover from it, if even ls wouldn’t run.

As an end-user, I really, literally couldn’t care less about that. I just want it to run.

So rather than messing with trying to redirect stuff (i.e. installing the required and dependent libraries in other non-default locations in SLES12 SP2 and then having to tell/teach the application to know where to pick it up from), I just reverted back to SLES12 SP1 where stuff works.

That’s probably part of the reason why the application is listed as NOT being certified to run on SLES12 SP2, but it IS certified to run in SLES12 SP0 and SP1.

If I had more time and/or patience with dealing with this ****, I probably would. But really, my main focus is on the actual running of the analyses and NOT this sysadmin stuff. I’m only doing this as a “necessary evil” that facilitates what I really need to and WANT to do - which is the analysis portion of it.

[quoote=ab][color=blue]

Not really sure what the application is doing in the GUI that could be
making X think it needs so much RAM.[/color]

Is this an application we could throw on a a box of mine for duplication
of the issue? I’d be happy to spare some cycles for it, but I do not have
the application, or the data so I would need a lot of help from you to
duplicate things. If possible, though, I’d open a bug report for you
after we narrowed it down well enough.[/quote]

Well…since I am now working directly with SuSE, my hope is that perhaps THEY would be able to work with the application vendor to figure out what’s going on.

I’m just “a guy”. When you put two large companies together, it has an equal amount of potential to be able to make things go a LOT faster in getting it fixed or it can be just the exact opposite where each vendor is blaming each other, and I, as the end user, really couldn’t care. Just fix it.

So we shall see.

[quote=ab][color=blue]

It successfully completed running through the script in run level 3
(without X running) and with vfs_cache_pressure = 200 without this
problem (because X isn’t running, so no surprises there that X won’t be
consuming large amounts of RAM if it isn’t running at all), but when I
put it back into run level 5 and X is running again, this issue of X
consuming large amounts of RAM is coming back (and based on the current
trajectory, is expected to be back in about another 2-ish days or so
where it is going to consume enough RAM that it is going to force the
analysis application to start eating up swap again).

Regardless, X shouldn’t be consuming so much RAM, but it is,
unfortunately. :([/color]

I think a fun test would be to turn off swap temporarily before it is used
out of necessity; if done, the system should then go ahead and,
much-more-quickly, kill offending memory-hog applications:

sudo /sbin/swapoff -a

Of course, since you started your processes within an X-based terminal
having xorg die will probably kill them too, which is not what you want.

If xorg is the problem, while at runlevel 5 swap over to a TTY, or SSH in,
and start the application from there. If you are concerned about the
connection timing out or otherwise breaking, use a ‘screen’ to run the
command:

screen -q
/path/to/your/application

You can then just close your terminal window (if SSH’d in, or accessing it
via X) or use a key sequence to “detach” from the screen session, leaving
the program running, and able to be reconnected to you via the ‘screen -x’
command, using something like Ctrl+a d (meaning Hold Ctrl, press ‘a’, and
then press ‘d’).

–
Good luck.

If you find this post helpful and are logged into the web interface,
show your appreciation and click on the star below.

If you want to send me a private message, please let me know in the
forum as I do not use the web interface often.[/QUOTE]

Yeah, there are a couple of things that I can try.

On a sidenote, it’s like a need two more SLES/Linux nodes to come online so that I would be able to run these tests so that I can collect the data from it and then we can stare at that data collectively to see if there is some sense that we can make out of it.

Unfortunately, I don’t have another system/server set up. Two of the nodes (out of 4 total) are Windows nodes. Two are now SLES12 SP1 nodes.

One of the Windows node and one of the SLES node is in the middle of a ca. 20 day external aerodynamics CFD run, so it’ll be a while before those nodes are freed up. (Yes, external aerodynamics analysis and simulation is VERY, VERY computationally time consuming and expensive.) (It’s for a semi-tractor trailer (FHWA Class 9).)

The other two nodes right now is working on a simulation to recycle medical needles actually (explicit dynamics, multiple configurations - e.g. different size needles, temperatures, etc.).

So the total system utilisation level across all four nodes is probably close to 98-99% since I got the system back in June.

So, these tests that have been suggested are being put into a queue, because I DO want to find a solution to this problem. But in the meantime, the systems still need to be able to do quote “useful work” and can’t be a solely dedicated testing resource. I wished it could be until this is resolved, but it isn’t possible right now.

So, stand by for the results.

Thanks.

I’ll have to ask you for more details about the part where you wrote about using the screen -q command, etc. because I THINK that I understood it, but I might only understand about half of it at best right now.

I get the concept of launching the run via ssh so that if the system decides to kill of X, it won’t stop the terminal-in-X run; but I am not entirely sure I follow your remarks about using the screen/detach/attach commands.

Thanks.

ab1 · September 2, 2017, 4:00am

[color=blue]
So, these tests that have been suggested are being put into a queue,
because I DO want to find a solution to this problem. But in the meantime,
the systems still need to be able to do quote “useful work” and can’t be a
solely dedicated testing resource. I wished it could be until this is
resolved, but it isn’t possible right now.[/color]

That so-called “useful work” is always a problem. I would recommend
stopping that entirely; it’s just not worth holding off my personal
curiosity.
[color=blue]

I’ll have to ask you for more details about the part where you wrote about
using the screen -q command, etc. because I THINK that I understood it,
but I might only understand about half of it at best right now.

I get the concept of launching the run via ssh so that if the system
decides to kill of X, it won’t stop the terminal-in-X run; but I am not
entirely sure I follow your remarks about using the screen/detach/attach
commands.[/color]

‘screen’ is one of my vary favorite commands,and it is one you can test
right now without messing with your big jobs.

You are familiar with GUIs like X, and terminals in general, so maybe some
comparisons to those are useful. Normally on your workstation/laptop you
probably use a GUI for your primary interface, because GUIs are good for
humans. When your network dies, you do not worry about your web browser
dying, or LibreOffice dying, or anything like that, because the GUI, or
anything at the physical console, does not depend on network stuff. The
only thing that your applications depend on is the system running beneath
them.

‘screen’ is a command that effectively does the same thing with anything
from the command line. Anything you run within a ‘screen’ can be
connected-to by you in another session, after you reconnect the network,
after you turn on the machine that was connected remotely (power outage on
the client side), or anything else. The only thing that a command within
a screen requires is the local system to keep running. You can also
access a screen session multiple times so you can monitor a long-running
process after connecting in from somewhere else (go home, SSH in, check on
things, go back to work, check again, go on vacation, check again).

A simple test could be this one. As some user, run the following commands:

screen -q
watch --interval=1 'date'

Now pull up another shell, or SSH into the box, or go to a different
physical terminal on the box, and run the following command to connect in
and see that session as it is running:

screen -x

Both are now connected; both can do whatever they want to the session
(kill the command, restart the command, etc.), and both are effectively
right there. Kind of like locking the screen when walking away from a
GUI, you can also disconnect from the ‘screen’ command to leave it running
whatever commands within using special ‘screen’-specific keystrokes, which
by default all start with Ctrl+a; the manpage has a long list, and the
notation starts with C-a (meaning Ctrl+a) and then another character; e.g.:

C-a d #detach

If you use ‘pstree’ much it is fun to see how the screen works; it runs
directly under ‘init’ (systemd) after you detach the first session, which
is why the only thing that is required to keep it running is that base
init/systemd process.

Like a physical screen on your computer, a ‘screen’ command has multiple
“windows” within, so you can actually run ten shells within a screen, and
each of those can run whatever command within. The first/default window
is number zero, and that is accessed via the following command:

C-a 0

To create another one, use:

C-a c

The new window is window #1, which is accessed via:

C-a 1

Add a few more and they all have new indexes up through nine (9). When
you close the last window the screen exits, and anytime you detach or
quite a screen you get a message telling you as much.

–
Good luck.

If you find this post helpful and are logged into the web interface,
show your appreciation and click on the star below.

If you want to send me a private message, please let me know in the
forum as I do not use the web interface often.

alpha754293 · November 19, 2017, 12:57am

[QUOTE=alpha754293;39380]Yeah, I’ll have to try that xrestop utility later. The systems are right now busy with other stuff, so both Linux/SLES nodes are currently running in run level 3 in order to get through these series of analyses that I need to get out the door.