SLES 11 kdumptool - error loading shared libraries

Hi,
We’ve been having issues with SLES 11 servers freezing/crashing/hanging intermittently. This has happened for almost a year now, and continues even though we’ve upgraded to SP1 and now to SP2.

Anyway, I’ve enabled Magic SysRq on all SLES 11 servers and configured kdump. The goal is to capture crash dumps via the console on unresponsive systems for further analysis.

I’m currently testing the crash dump combo on a SLES 11 SP2 VM. The console responds to Alt+SysRq+c, but then stops at a bash prompt - see screenshot at bottom of my post. The main errors I can see are:

/sbin/resume: error while loading shared libraries: libgcrypt.so.11: cannot open shared object file: No such file or directory ... kdumptool: error while loading shared libraries: libelf.so.0: cannot open shared object file: No such file or directory

However, both shared libs exist in library directories on the root partition, so I don’t know why resume/kumptool can’t find them. Must be something with the “resume” option of the kernel that I do not understand.

[CODE]## libgcrypt
myhost:~ # whereis libgcrypt.so.11
libgcrypt.so: /lib/libgcrypt.so.11 /lib64/libgcrypt.so.11 /usr/lib64/libgcrypt.so /usr/local/lib/libgcrypt.so.11 /usr/local/lib/libgcrypt.so

myhost:~ # dir /lib/libgcrypt.so*
lrwxrwxrwx 1 root root 19 Feb 29 11:17 /lib/libgcrypt.so.11 → libgcrypt.so.11.7.0
-rwxr-xr-x 1 root root 545124 Jan 13 13:00 /lib/libgcrypt.so.11.7.0

libelf

myhost:~ # whereis libelf.so.0
libelf.so: /usr/lib/libelf.so.1 /usr/lib/libelf.so.0 /usr/lib64/libelf.so.1 /usr/lib64/libelf.so.0 /usr/local/lib/libelf.so.0 /usr/local/lib/libelf.so

myhost:~ # dir /usr/lib/libelf.so*
lrwxrwxrwx 1 root root 16 Mar 9 2011 /usr/lib/libelf.so.0 → libelf.so.0.8.12
-rwxr-xr-x 1 root root 88312 May 5 2010 /usr/lib/libelf.so.0.8.12
lrwxrwxrwx 1 root root 25 Mar 9 2011 /usr/lib/libelf.so.1 → /usr/lib/libelf.so.0.8.12[/CODE]

Any idea what the issue is here? There may be more informative messages higher up in the console, but I can’t see them and not sure how to slow down the messages to capture them, or how dump them to a file. My VM is running on VMWare.

Screenshot:

Don’t know about those library errors… could be you need check and fix paths in /etc/ld.so.conf and rerun ldconfig. Have not needed to mess with that though on SLES 11.

Some questions to get a better feel of the environment SLES is running in/as:

Which VMware version are you running there (version/build) en which VMware tools version (also was this a tar install or rpm).

I’m also curious to know how these SLES servers have been setup (which version is it? The VMware release and 32 or 64bit) and did you add any special parameters into the boot options?

Lastly, what are these servers doing? Which software is running and maybe some custom tuning has been done?

-Willem

One more thing to add to this: run ldconfig with the -v switch and check if the libraries kdump is fussing about appear in that list.

It could also be worth a try to see if running “SuSEconfig --verbose” mentions anything particular.

Cheers,
Willem

Hi Willem,

The SLES 11 VM is a guest on a VMware host running ESXi 5.0.0, 623860.

SLES 11 SP2 64-bit. The ISO I used to install it was from the Novell downloads, and not VMware-specific.

myhost:~ # uname -a Linux goat 3.0.13-0.27-default #1 SMP Wed Feb 15 13:33:49 UTC 2012 (d73692b) x86_64 x86_64 x86_64 GNU/Linux

Actually, the server was installed fresh with SLES 11 SP1, then an in-place upgrade to SP2 was done via Yast > Patch CD Upgrade. Here are my boot options:

myhost:~ # cat /boot/grub/menu.lst
# Modified by YaST2. Last modification on Wed Feb 29 11:31:23 EST 2012
default 0
timeout 8
##YaST - generic_mbr
gfxmenu (hd0,0)/boot/message
##YaST - activate

###Don't change this comment - YaST2 identifier: Original name: linux###
title SUSE Linux Enterprise Server 11 SP2 - 3.0.13-0.27 (default)
    root (hd0,0)
    kernel /boot/vmlinuz-3.0.13-0.27-default root=/dev/sda1 insmod=qla4xxx resume=/dev/sdb1 splash=silent crashkernel=256M-:128M showopts
    initrd /boot/initrd-3.0.13-0.27-default

###Don't change this comment - YaST2 identifier: Original name: failsafe###
title Failsafe -- SUSE Linux Enterprise Server 11 SP2 - 3.0.13-0.27
    root (hd0,0)
    kernel /boot/vmlinuz-3.0.13-0.27-default root=/dev/sda1 showopts ide=nodma apm=off noresume edd=off powersaved=off nohz=off highres=off processor.max_cstate=1 nomodeset x11failsafe
    initrd /boot/initrd-3.0.13-0.27-default

###Don't change this comment - YaST2 identifier: Original name: linux###
title Trace -- SUSE Linux Enterprise Server 11 SP2 - 3.0.13-0.27
    root (hd0,0)
    kernel /boot/vmlinuz-3.0.13-0.27-trace root=/dev/sda1 insmod=qla4xxx resume=/dev/sdb1 splash=silent crashkernel=256M-:128M showopts
    initrd /boot/initrd-3.0.13-0.27-trace

###Don't change this comment - YaST2 identifier: Original name: floppy###
title Floppy
    rootnoverify (fd0)
    chainloader +1

This VM is used as a testing ground for proposed software in our SLES 11 environment, but it is currently not running any “extra” software. With that said, I’ve seen two other VMs crash that are actively used by our developers, then this same issue occurs when attempting to do a dump via Magic SysRq. The other two VMs run OpenLDAP and Shibboleth, but nothing else.

Hi
I would guess the filesystem containing the libraries has not been
mounted at that point hence it can’t find them (/dev/sda1)…

When your at that prompt, what is mounted? (just run the mount command).
The dmesg command may offer further information. In you VM if you can
get to tty10 (ctrl+alt+F10) it has kernel messages.

I would guess that /dev/sdb1 is you swap partition (and for resume), so
why would it being resuming, sure the systems are not going into a
power management mode (seem to be frozen). I would have thought
anything related to power saving would be disabled?


Cheers Malcolm °¿° (Linux Counter #276890)
openSUSE 12.1 (x86_64) Kernel 3.1.10-1.9-desktop
up 5 days 18:26, 4 users, load average: 0.05, 0.04, 0.05
CPU Intel i5 CPU M520@2.40GHz | Intel Arrandale GPU

This is what the mount command returns at that bash prompt (after Alt+SysRq+c is pressed):

bash-3.2# mount
proc on /proc type proc (rw)
sysfs on /sys type sysfs (rw)
udev on /dev type tmpfs (rw,mode=0755,nr_inodes=0)
tmpfs on /dev/shm type tmpfs (rw,mod=1777)
devpts on /dev/pts type devpts (rw,mode=0620,gid=5)
/dev/sda1 on /root type ext3 (rw,acl,user_xattr)
bash-3.2#

Cool, I did not know about tty10. I think its just Alt+F# in VMWare though to switch terminals. I was unable to get to tty10 when at the bash prompt though. After doing “exit” it seems to finish booting OK and then I’m at the normal login prompt - I’m not sure if this is a kexec/kdump kernel it puts me in or what kernel I’m presented a login prompt for? Anyway, I’m able to get to tty10 at that point, but its past the shared lib error messages I was interested in.

Correct, /dev/sdb1 is swap and the setting for resume. Ya know, I’ve never questioned the “resume” kernel option. If its for hibernating or suspending a system for power-saving reasons, I guess I don’t need it? Its a server, hence the reason we run SLES.

Hi
So /dev/sda1 is mounted as /root (the user) not / hence it can’t find
the libraries. Unless there are directories under /root?

What does the mount command say when all back and running?

Maybe a browse through the system BIOS may add some additional
information on any power saving features that are enabled?

AFAIK anything power related should be disabled, check down in /etc/pm
and /etc/pm-profiler if they exist.


Cheers Malcolm °¿° (Linux Counter #276890)
openSUSE 12.1 (x86_64) Kernel 3.1.10-1.9-desktop
up 5 days 19:40, 4 users, load average: 0.00, 0.02, 0.05
CPU Intel i5 CPU M520@2.40GHz | Intel Arrandale GPU

[QUOTE=malcolmlewis;4698]Hi
So /dev/sda1 is mounted as /root (the user) not / hence it can’t find
the libraries. Unless there are directories under /root?

What does the mount command say when all back and running?
[/QUOTE]

OK I didn’t notice that it was mounted as /root, but now that makes sense that it can’t find the libs.

After issuing “exit” at the bash prompt, it boots to what looks like a healthy state and this is what mount says:

myhost:~ # mount /dev/sda1 on / type ext3 (rw,acl,user_xattr) proc on /proc type proc (rw) sysfs on /sys type sysfs (rw) devtmpfs on /dev type devtmpfs (rw,mode=0755) tmpfs on /dev/shm type tmpfs (rw,mode=1777) devpts on /dev/pts type devpts (rw,mode=0620,gid=5) fusectl on /sys/fs/fuse/connections type fusectl (rw) securityfs on /sys/kernel/security type securityfs (rw) nfs1:/vol/local.sles11 on /usr/local type nfs (rw,nolock,addr=172.xxx.xxx.xxx) nfs1:/vol/source.sles11 on /usr/local/src type nfs (rw,nolock,addr=172.xxx.xxx.xxx) nfs1:/vol/misc on /usr/misc type nfs (rw,nolock,addr=172.xxx.xxx.xxx) rpc_pipefs on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw) none on /var/lib/ntp/proc type proc (ro,nosuid,nodev)

(I masked out my IPs in the output). I’m not concerned about the power-saving modules at this point, unless you think its affecting the ability to invoke magic sysrq commands and kdump.

On a related note, I can’t seem to get the proper crashkernel setting for kdump.

myhost:~ # /etc/init.d/boot.kdump restart Loading kdump Then try loading kdump kernel Memory for crashkernel is not reserved Please reserve memory by passing "crashkernel=X@Y" parameter to the kernel failed

You can see my menu.lst in an earlier post. I’m using the default which is crashkernel=256M-:128M and also tried other various settings with and without the @YM offset. I’m clueless as to what this should be set to.

I’m curios why the VM has that insmod entry for the qla4xxx… are you using iSCSI within the VM? And unless it would also be for the OS disk, that line is not needed in the grub section AFAIK.

Cheers,
Willem

[QUOTE=Magic31;4712]I’m curios why the VM has that insmod entry for the qla4xxx… are you using iSCSI within the VM? And unless it would also be for the OS disk, that line is not needed in the grub section AFAIK.

Cheers,
Willem[/QUOTE]

True, that shouldn’t be in there. This may have carried over from a PXE network install, where we have this set in the kernel options for our physical servers that use iSCSI. I’m going to remove it on my VMs.

[QUOTE=ashbyj;4699]OK I didn’t notice that it was mounted as /root, but now that makes sense that it can’t find the libs.
[/QUOTE]

Any idea why this would be mounting as /root instead of /?

Hi
I would guess because it’s in single user mode as root user it’s
creating a temporary /root for maintenance mode.

So there is no filesystem under /root eg /root/etc?

It’s almost as though it’s adding a init=/bin/bash when booting…


Cheers Malcolm °¿° (Linux Counter #276890)
openSUSE 12.1 (x86_64) Kernel 3.1.10-1.9-desktop
up 1 day 12:46, 4 users, load average: 0.05, 0.03, 0.05
CPU Intel i5 CPU M520@2.40GHz | Intel Arrandale GPU

Yes, there appears to be some sort of filesystem under /root. Here are some screenshots of file listings under root and etc/, and also the help command, although you can’t see everything and the “more” command is not available.