We have two identical servers running SLES. They were not acquired at
the same time, resulting that one of them runs at a 2.6 kernel and the
other at 3.0 kernel.
The server are used for heavy computations and have 4x8 core Xeon cpu’s
and 256Gb memory each.
At the server using the 2.6 kernel we never have performance problems,
at newest server with the 3.0 kernel we have. Freeing cached memory
when required for an allocation of a program takes very long.
For example we started five computation using the same program at once.
Each computation allocated and filled 20Gb of memory at the start. At
the time of starting about 200Gb of cached memory was available and the
load was not very high, about 16. Just the allocation and filling of
the memory took 6 hours. We checked but the swap file was not used
during the allocation. After these computations were finished, we
directly restarted the same computations (now all required memory is
free) and the allocation and filling of memory only required a few
minutes as we know from the server with the 2.6 kernel.
We wrote a small test program in Fortran which just allocates and fills
20Gb of memory and we obtained the same bad performance if all
available memory is cached. Also changing the compiler from Ifort to
G95 did not solve the problem.
Therefore, we are quite sure that the performance problem is due to
freeing the cached memory. Using the command ‘sync’ before starting the
computation doesn’t help. As we don’t have access to the root account,
we cannot free the cached memory
(sync; echo 3 > /proc/sys/vm/drop_aches).
A solution could be to ask our system administrator to provide us a
sudo script for freeing the cached memory. We should then run this
script before starting any computation. But, would there be a more
elegant solution to this problem? Are there setting that our system
administrator could change to get the same performance in freeing the
memory as we have at the server with the 2.6 kernel?
Relating to the ‘Writeback Parameters’ section of that article, I had some
similar thoughts.
Something semi-related about which I wondered was how quickly forcing
caches to be dropped took place. We know that after you run your
programs and take up all of the memory once (forcing caches to be
empty-ish) that re-allocation of memory is quick, but what if we do that
without using any non-kernel program, but instead do it by writing to
drop_caches (you mentioned this in your original post, but that you
needed somebody with ‘root’ privs to do it). If you have not tested, I’d
be interested to see if that has an immediate benefit or if cleaning out
caches takes a similar amount of time as if they were forcing it empty
with one of their programs.
First server without the problem
SUSE Linux Enterprise Server 11 (x86_64)
VERSION = 11
PATCHLEVEL = 1
Linux app-reken01 2.6.32.54-0.3-default #1 SMP 2012-01-27 17:38:56 +0100 x86_64 x86_64 x86_64 GNU/Linux
Second server with the problem:
SUSE Linux Enterprise Server 11 (x86_64)
VERSION = 11
PATCHLEVEL = 2
Linux app-reken03 3.0.42-0.7-default #1 SMP Tue Oct 9 11:58:45 UTC 2012 (a8dc443) x86_64 x86_64 x86_64 GNU/Linux
Our admin tried dropping the caches and it worked very well. Allocating and filling 30Gb in a few seconds instead minutes/hour(s).
We compared
/proc/sys/vm/*
at both servers. Settings are very comparable. Only difference is that oom_dump_tasks:1 at the problem server, set to 0 at the other.
The servers constantely used, so there is little room to try to change setting if we are not sure if it helps. I think that something taking up to six hours where 10mins are required for the 2.6 kernel or even less if the memory is free, will not be solved by tuning these settings. We can only try such things if we are sure that it can solve the problem. We don’t have comparable servers (256Gb mem etc.) where we can test things in advance.
I think that something taking up
to six hours where 10mins are required for the 2.6 kernel or even less
if the memory is free, will not be solved by tuning these settings.
[/color]
Actually, that kind of issue can be resolved with the right settings
tuning. I would welcome a default setting that prevents a runaway app
from sucking up all RAM instantly, but in your situation you would
require removing such a speed bump.
So now to figure out if this issue is a problem or a speed bump, and if
such a bump is it admin configurable or hard coded.
[QUOTE=jotuitman;10999]Our admin tried dropping the caches and it worked very well. Allocating and filling 30Gb in a few seconds instead minutes/hour(s).
We compared
/proc/sys/vm/*
at both servers. Settings are very comparable. Only difference is that oom_dump_tasks:1 at the problem server, set to 0 at the other.
The servers constantely used, so there is little room to try to change setting if we are not sure if it helps. I think that something taking up to six hours where 10mins are required for the 2.6 kernel or even less if the memory is free, will not be solved by tuning these settings. We can only try such things if we are sure that it can solve the problem. We don’t have comparable servers (256Gb mem etc.) where we can test things in advance.[/QUOTE]
Are you running with Transparent Huge Pages (THP) enabled on the server ?
If so, could you please re-test the same with transparent huges pages disabled ?
In addition, periodic gathering of /proc/vmstat at the time the problem starts to appear could help with the diagnosis.
This could even be a regression in the 3.0 kernel you run into.
In order to be able and get our kernel hackers involved looking at this, would you be able to log a service request for this problem ?
The SLES11P2 release notes refer to Transparent Huge Pages :
5.1.1. Transparent Huge Pages (THP) Support
On systems with large memory, frequent access to the Translation Lookaside Buffer (TLB) may slow down the system significantly.
Transparent huge pages thus are of most use on systems with very large (128GB or more) memory, and help to drive performance. In SUSE Linux Enterprise, THP is enabled by default where it is expected to give a performance boost to a large number of workloads.
There are cases where THP may regress performance, particularly when under memory pressure due to pages being reclaimed in an effort to promote to huge pages. It is also possible that performance will suffer on CPUs with a limited number of huge page TLB entries for workloads that sparsely reference large amounts of memory. If necessary, THP can be disabled via the sysfs file “/sys/kernel/mm/transparent_hugepage/enabled”, which accepts one of the values “always”, “madvise”, or “never”.
To disable THP via sysfs and confirm it is disabled, do the following as root:
echo never > /sys/kernel/mm/transparent_hugepage/enabled
cat /sys/kernel/mm/transparent_hugepage/enabled
always madvise [never]
cat /sys/kernel/mm/transparent_hugepage/enabled
[always] madvise never
Freeing the normal size pages to create a huge page seems to be a plausible explanation for the problem. At the older server there is no THP in the kernel.
I can do some manual logging at the system. However, as the server is used for confidential computations, it is not directly connected to the internet and we are not allowed to run any automated logging system which saves data to be transfer outside.
Below is the output of vmstat. At the this moment only 20Gb of the 256Gb is available, so I cannot do any testing because of risking swapping memory. I will discuss with our admin if we could turn off THP for testing when the load of the system is less. But I’m not sure if this will be allowed.
Apologies for not responding sooner.
I’m looking forward to your feedback and hear if you have been able to test.
As for the sensitiveness in gathering data for analysis outside your premises, if need be, and required, we can probably work something out.
I guess it would be better we can than have a phone call about this.
To arrange for that, or if there’s anything you want to share on the topic in private, it would be best if you could send me a direct email at [hvdheuvel] at [novell] dot [com] and we can exchange phone numbers or information.
I’m based out of the Novell office in Rotterdam, the Netherlands, and when I’m right, you’re probably not located too far away from there.
So I think we can easily work out a time that works well for both of us.
I send you a email last Wednesday. Hopeful the server will be available this week to test with/without THP.
Best regards,
Johan[/QUOTE]
Looking forward to the outcome of that test.
Unfortunately, I must have missed your email, as I can’t find anything in my mailbox, sorry for that.
Could you please resend once more to: hvdheuvel [at] novell [dot] com ?
In the mean time, I will be in Salt Lake City for the Brainshare this week, so my response may me a little delayed.
We tested with THP=always and THP=never on the server without any load and with full cach. Allocating about 23Gb took more than 6000 seconds with THP=always and about 10 seconds with THP=never. Note that it was ensured that the cach was refilled in between the tests.
This server with the newest kernel shows a performance increase of about 10%-15% compared with the server with the older kernel. So it seems that THP indeed provide a advantage when having and using large amounts of memory.
Next week I will discuss with Hans what and how to log to allow developers to solve the problem.