Very long time required to allocate memory

Hi all,

We have two identical servers running SLES. They were not acquired at
the same time, resulting that one of them runs at a 2.6 kernel and the
other at 3.0 kernel.

The server are used for heavy computations and have 4x8 core Xeon cpu’s
and 256Gb memory each.

At the server using the 2.6 kernel we never have performance problems,
at newest server with the 3.0 kernel we have. Freeing cached memory
when required for an allocation of a program takes very long.

For example we started five computation using the same program at once.
Each computation allocated and filled 20Gb of memory at the start. At
the time of starting about 200Gb of cached memory was available and the
load was not very high, about 16. Just the allocation and filling of
the memory took 6 hours. We checked but the swap file was not used
during the allocation. After these computations were finished, we
directly restarted the same computations (now all required memory is
free) and the allocation and filling of memory only required a few
minutes as we know from the server with the 2.6 kernel.

We wrote a small test program in Fortran which just allocates and fills
20Gb of memory and we obtained the same bad performance if all
available memory is cached. Also changing the compiler from Ifort to
G95 did not solve the problem.

Therefore, we are quite sure that the performance problem is due to
freeing the cached memory. Using the command ‘sync’ before starting the
computation doesn’t help. As we don’t have access to the root account,
we cannot free the cached memory
(sync; echo 3 > /proc/sys/vm/drop_aches).

A solution could be to ask our system administrator to provide us a
sudo script for freeing the cached memory. We should then run this
script before starting any computation. But, would there be a more
elegant solution to this problem? Are there setting that our system
administrator could change to get the same performance in freeing the
memory as we have at the server with the 2.6 kernel?

Best regards,

Johan

Hi Johan
I’ve asked my SUSE contacts for some pointers on the issue :slight_smile:


Cheers Malcolm °¿° (Linux Counter #276890)
openSUSE 12.2 (x86_64) Kernel 3.4.11-2.16-desktop
up 0:54, 3 users, load average: 0.03, 0.06, 0.06
CPU Intel® i5 CPU M520@2.40GHz | GPU Intel® Ironlake Mobile

Hi
What Patch level and kernels running on the two machines

cat /etc/SuSE-release
uname -a

Also have you had a read through this document;
http://doc.opensuse.org/products/draft/SLES/SLES-tuning_sd_draft/cha.tuning.memory.html


Cheers Malcolm °¿° (Linux Counter #276890)
openSUSE 12.2 (x86_64) Kernel 3.4.11-2.16-desktop
up 9:29, 3 users, load average: 0.02, 0.07, 0.06
CPU Intel® i5 CPU M520@2.40GHz | GPU Intel® Ironlake Mobile

Also have you had a read through this document;[color=blue]
[/color]
http://doc.opensuse.org/products/draft/SLES/SLES-tuning_sd_draft/cha.tuning.memory.html

Relating to the ‘Writeback Parameters’ section of that article, I had some
similar thoughts.

Something semi-related about which I wondered was how quickly forcing
caches to be dropped took place. We know that after you run your
programs and take up all of the memory once (forcing caches to be
empty-ish) that re-allocation of memory is quick, but what if we do that
without using any non-kernel program, but instead do it by writing to
drop_caches (you mentioned this in your original post, but that you
needed somebody with ‘root’ privs to do it). If you have not tested, I’d
be interested to see if that has an immediate benefit or if cleaning out
caches takes a similar amount of time as if they were forcing it empty
with one of their programs.

Good luck.

First server without the problem
SUSE Linux Enterprise Server 11 (x86_64)
VERSION = 11
PATCHLEVEL = 1
Linux app-reken01 2.6.32.54-0.3-default #1 SMP 2012-01-27 17:38:56 +0100 x86_64 x86_64 x86_64 GNU/Linux

Second server with the problem:
SUSE Linux Enterprise Server 11 (x86_64)
VERSION = 11
PATCHLEVEL = 2
Linux app-reken03 3.0.42-0.7-default #1 SMP Tue Oct 9 11:58:45 UTC 2012 (a8dc443) x86_64 x86_64 x86_64 GNU/Linux

Our admin tried dropping the caches and it worked very well. Allocating and filling 30Gb in a few seconds instead minutes/hour(s).

We compared
/proc/sys/vm/*
at both servers. Settings are very comparable. Only difference is that oom_dump_tasks:1 at the problem server, set to 0 at the other.

The servers constantely used, so there is little room to try to change setting if we are not sure if it helps. I think that something taking up to six hours where 10mins are required for the 2.6 kernel or even less if the memory is free, will not be solved by tuning these settings. We can only try such things if we are sure that it can solve the problem. We don’t have comparable servers (256Gb mem etc.) where we can test things in advance.

In article jotuitman.5onewo@no-mx.forums.suse.com, Jotuitman wrote:[color=blue]

I think that something taking up
to six hours where 10mins are required for the 2.6 kernel or even less
if the memory is free, will not be solved by tuning these settings.
[/color]
Actually, that kind of issue can be resolved with the right settings
tuning. I would welcome a default setting that prevents a runaway app
from sucking up all RAM instantly, but in your situation you would
require removing such a speed bump.
So now to figure out if this issue is a problem or a speed bump, and if
such a bump is it admin configurable or hard coded.

Andy Konecny
KonecnyConsulting.ca in Toronto

Andy’s Profiles: http://forums.novell.com/member.php?userid=75037
https://forums.suse.com/member.php?2959-konecnya

Hi jotuitiman

[QUOTE=jotuitman;10999]Our admin tried dropping the caches and it worked very well. Allocating and filling 30Gb in a few seconds instead minutes/hour(s).

We compared
/proc/sys/vm/*
at both servers. Settings are very comparable. Only difference is that oom_dump_tasks:1 at the problem server, set to 0 at the other.

The servers constantely used, so there is little room to try to change setting if we are not sure if it helps. I think that something taking up to six hours where 10mins are required for the 2.6 kernel or even less if the memory is free, will not be solved by tuning these settings. We can only try such things if we are sure that it can solve the problem. We don’t have comparable servers (256Gb mem etc.) where we can test things in advance.[/QUOTE]

Are you running with Transparent Huge Pages (THP) enabled on the server ?
If so, could you please re-test the same with transparent huges pages disabled ?

In addition, periodic gathering of /proc/vmstat at the time the problem starts to appear could help with the diagnosis.

This could even be a regression in the 3.0 kernel you run into.
In order to be able and get our kernel hackers involved looking at this, would you be able to log a service request for this problem ?

Thank you
Hans

Just replying to myself quickly.

The SLES11P2 release notes refer to Transparent Huge Pages :

5.1.1. Transparent Huge Pages (THP) Support

On systems with large memory, frequent access to the Translation Lookaside Buffer (TLB) may slow down the system significantly.

Transparent huge pages thus are of most use on systems with very large (128GB or more) memory, and help to drive performance. In SUSE Linux Enterprise, THP is enabled by default where it is expected to give a performance boost to a large number of workloads.

There are cases where THP may regress performance, particularly when under memory pressure due to pages being reclaimed in an effort to promote to huge pages. It is also possible that performance will suffer on CPUs with a limited number of huge page TLB entries for workloads that sparsely reference large amounts of memory. If necessary, THP can be disabled via the sysfs file “/sys/kernel/mm/transparent_hugepage/enabled”, which accepts one of the values “always”, “madvise”, or “never”.

To disable THP via sysfs and confirm it is disabled, do the following as root:

echo never > /sys/kernel/mm/transparent_hugepage/enabled
cat /sys/kernel/mm/transparent_hugepage/enabled
always madvise [never]

Regards
Hans

Hi Hans,

Thanks very much. Huge pages are indeed enabled:

cat /sys/kernel/mm/transparent_hugepage/enabled
[always] madvise never

Freeing the normal size pages to create a huge page seems to be a plausible explanation for the problem. At the older server there is no THP in the kernel.

I can do some manual logging at the system. However, as the server is used for confidential computations, it is not directly connected to the internet and we are not allowed to run any automated logging system which saves data to be transfer outside.

Below is the output of vmstat. At the this moment only 20Gb of the 256Gb is available, so I cannot do any testing because of risking swapping memory. I will discuss with our admin if we could turn off THP for testing when the load of the system is less. But I’m not sure if this will be allowed.

Best regards,

Johan

nr_free_pages 122319
nr_inactive_anon 2424250
nr_active_anon 57600574
nr_inactive_file 4506677
nr_active_file 523013
nr_immediate 0
nr_unevictable 2
nr_mlock 2
nr_anon_pages 1356641
nr_mapped 9164
nr_file_pages 5051891
nr_dirty 2
nr_writeback 0
nr_slab_reclaimable 163212
nr_slab_unreclaimable 16617
nr_page_table_pages 120798
nr_kernel_stack 463
nr_unstable 0
nr_bounce 0
nr_vmscan_write 2513939
nr_vmscan_immediate_reclaim 0
nr_writeback_temp 0
nr_isolated_anon 0
nr_isolated_file 0
nr_shmem 41
nr_dirtied 2383220152
nr_written 2256824336
numa_hit 15352677217
numa_miss 5127720849
numa_foreign 5127720849
numa_interleave 110188
numa_local 15352517127
numa_other 5127880939
nr_anon_transparent_hugepages 114545
nr_dirty_threshold 26070733
nr_dirty_background_threshold 6517683
pgpgin 7273429772
pgpgout 9210120580
pswpin 620502
pswpout 1594396
pgalloc_dma 1
pgalloc_dma32 1391442502
pgalloc_normal 42636719949
pgalloc_movable 0
pgfree 44795557740
pgactivate 1008297520
pgdeactivate 360793997
pgfault 13118667365
pgmajfault 128964
pgrefill_dma 0
pgrefill_dma32 1373472
pgrefill_normal 292275042
pgrefill_movable 0
pgsteal_dma 0
pgsteal_dma32 14410337
pgsteal_normal 2217907791
pgsteal_movable 0
pgscan_kswapd_dma 0
pgscan_kswapd_dma32 14557721
pgscan_kswapd_normal 2224551754
pgscan_kswapd_movable 0
pgscan_direct_dma 0
pgscan_direct_dma32 1095
pgscan_direct_normal 109985
pgscan_direct_movable 0
pgscan_direct_throttle 0
zone_reclaim_failed 0
pginodesteal 6944570
slabs_scanned 44211968
kswapd_steal 2232208172
kswapd_inodesteal 302537066
kswapd_low_wmark_hit_quickly 64721
kswapd_high_wmark_hit_quickly 711373
kswapd_skip_congestion_wait 19
pageoutrun 7935328
allocstall 73
pgrotated 16037166
pgrescued 0
compact_blocks_moved 1297150698
compact_pages_moved 598142092
compact_pagemigrate_failed 20562307
compact_stall 713529
compact_fail 409345
compact_success 304152
htlb_buddy_alloc_success 0
htlb_buddy_alloc_fail 0
unevictable_pgs_culled 11093
unevictable_pgs_scanned 0
unevictable_pgs_rescued 12189
unevictable_pgs_mlocked 19297
unevictable_pgs_munlocked 19295
unevictable_pgs_cleared 0
unevictable_pgs_stranded 0
unevictable_pgs_mlockfreed 0
thp_fault_alloc 46035119
thp_fault_fallback 403798
thp_collapse_alloc 41897
thp_collapse_alloc_failed 5576
thp_split 40209

Hi Johan,

Apologies for not responding sooner.
I’m looking forward to your feedback and hear if you have been able to test.

As for the sensitiveness in gathering data for analysis outside your premises, if need be, and required, we can probably work something out.
I guess it would be better we can than have a phone call about this.

To arrange for that, or if there’s anything you want to share on the topic in private, it would be best if you could send me a direct email at [hvdheuvel] at [novell] dot [com] and we can exchange phone numbers or information.

I’m based out of the Novell office in Rotterdam, the Netherlands, and when I’m right, you’re probably not located too far away from there.
So I think we can easily work out a time that works well for both of us.

Many thanks
Hans

Hi Hans,

I send you a email last Wednesday. Hopeful the server will be available this week to test with/without THP.

Best regards,

Johan

Hi Johan

[QUOTE=jotuitman;11598]Hi Hans,

I send you a email last Wednesday. Hopeful the server will be available this week to test with/without THP.

Best regards,

Johan[/QUOTE]

Looking forward to the outcome of that test.
Unfortunately, I must have missed your email, as I can’t find anything in my mailbox, sorry for that.
Could you please resend once more to: hvdheuvel [at] novell [dot] com ?

In the mean time, I will be in Salt Lake City for the Brainshare this week, so my response may me a little delayed.

Hi all,

We tested with THP=always and THP=never on the server without any load and with full cach. Allocating about 23Gb took more than 6000 seconds with THP=always and about 10 seconds with THP=never. Note that it was ensured that the cach was refilled in between the tests.
This server with the newest kernel shows a performance increase of about 10%-15% compared with the server with the older kernel. So it seems that THP indeed provide a advantage when having and using large amounts of memory.

Next week I will discuss with Hans what and how to log to allow developers to solve the problem.

Best regards,

Johan