XFS: Possible memory allocation deadlock

We have a HP Proliant DL180 G6 server with 12 GB RAM and a 600 GB xfs filesystem. The server is running Lotus Domino and Domino data is on this xfs filesystem. Today our largest and most heavily used Domino database (54 GB, ca 600 simultaneous users) experienced some problems which from Domino side looked like database access deadlocks. At the same time following messages repeatedly appeared in /var/log/messages:

XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250)

The server doesn’t seem to be low on memory - monitoring shows that amount of cached memory never fell below 7 GB while this problem was occurring. Other than that, I am frankly at a loss on how to troubleshoot this. The OS is SLES11 SP3 x86_64 fully patched, running kernel :

Linux data3 3.0.101-0.29-default #1 SMP Tue May 13 08:40:57 UTC 2014 (9ec28a0) x86_64 x86_64 x86_64 GNU/Linux

Is it possible that this is a kernel bug?

On 05/30/2014 05:04 AM, vatson wrote:[color=blue]

XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250)[/color]

Google turns up some good info on this:

https://bugzilla.kernel.org/show_bug.cgi?id=73831

I have not done any checking to see if this has been backported into the
SLES kernel, but can you confirm some bits from there? Specifically, are
you using 64k blocks? Also, does clearing your caches by running the
following (as ‘root’) fix the issue for the moment?

Code:

echo 1 > /proc/sys/vm/drop_caches


Good luck.

If you find this post helpful and are logged into the web interface,
show your appreciation and click on the star below…

I have 4k blocksize both for data and directory blocks:

# xfs_info /data meta-data=/dev/mapper/datavg-datavol isize=256 agcount=32, agsize=4915136 blks = sectsz=512 attr=2 data = bsize=4096 blocks=157284352, imaxpct=25 = sunit=64 swidth=128 blks naming =version 2 bsize=4096 ascii-ci=0 log =internal bsize=4096 blocks=76800, version=2 = sectsz=512 sunit=64 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0

I’ll try this next time the problem happens.

I discovered that the filesystem in question is quite heavily fragmented.

# xfs_db -c frag -r /dev/mapper/datavg-datavol actual 794468, ideal 52247, fragmentation factor 93.42%

The bugreport has a comment which mentions that “These large allocations are often the result of extent maps for very badly fragmented files”. Looks like I should try to run xfs_fsr on this filesystem.

I haven’t yet managed to obtain a service window to defragment the filesystem, but I can confirm that clearing the cache by ‘echo 1 > /proc/sys/vm/drop_caches’ helps to work around the issue.

K, that’s good information.

With that in mind your next best step is likely going to be opening a
Service Request (SR) with SUSE asking for this fix to be backported into
the current kernel. I’ve already sent a note to SUSE asking them to
evaluate this, but until I hear back they are usually more-motivated by
customer SRs than random people like me asking.

An alternative, if you can wait it out, is to upgrade to SLES 12 once that
is available. I’m guessing that’s still a few months out, but it should
have an updated kernel, likely with this fix included (disclaimer: I have
no first hand knowledge of the new kernel version or if this fix is in
there, etc.).


Good luck.

If you find this post helpful and are logged into the web interface,
show your appreciation and click on the star below…