Issue with dasd timeout

This morning I resolved a problem that I have been trying to resolve, on and off, for several weeks now. Please bear with me as I describe the problem and what I determined.

I have four SLES 11 SP4 servers that host DB2 under z/VM 6.4 on a z13s. In order to minimize the application outage time to upgrade from SLES 11 to SLES 12 SP2 I followed the following process for each of the four servers:

  • Clone the boot volume (a 3390 mod9) for each server to a new dasd volume.
  • Used a test server to upgrade each boot volume, in place, to SLES 12 SP2. This included copying or merging info from the various .rpmnew files into the production files. I used a non-production IP address so that I could play with the SLES 12 system without affecting the production server.

I booted each SLES 12 volume several times during my testing. Note that the VM guest I was upgrading from only had the boot volume defined.

At implementation time I shut down the running server, backed up its SLES 11 boot volume and its SLES 12 boot volume to tape from a z/OS system. I then restored each backup copy to the other disk device (the disk that was SLES 11 became SLES 12…) so that I didn’t have to modify the z/VM guest definition nor the production backup/restore jobs.

I successfully upgraded three of the servers to SLES 12 with no issues. The fourth one would only boot into recovery mode.

The fourth server is an exact clone of one of the other servers (same number and types of 3390 physical volumes). There are over 500 dasd volumes that comprise about 20 filesystems, where each are an LV. This SLES 12 server would only boot into recovery mode, even when I modified /etc/fstab in order to remove all filesystems except for the boot filesystem.

While in recovery mode, I was able to run a filesystem check against each filesystem (most are ext3 while several are reiser). I was then able to manually mount each filesystem. When I rebooted, the system again went into recovery mode.

I ran the suggested ‘journalctl’ command and parsed thru the boot messages. At a point in the boot process, after all of the dasd volumes were detected, I saw that SLES 12 was doing some sort of PV interrogation of the 527 or so volumes. It issued a message for the 28th volume (changed during each boot) that whatever it was doing timed out. It then started issuing [ERROR] messages and booted into recovery mode.

I spent hours and hours searching the forums and Google to no avail.

It dawned on me late yesterday that this is a ‘clone’ of a production database server. When I checked the VM directory I found that the production server has access to five IFL engines while this server was restricted to one IFL (I thought they were set up the same at the z/VM level).

This morning I modified the VM directory to give my troubled server access to five IFL’s and it booted properly. Performance Toolkit showed that this server was using over 350% of the z/VM system (i.e. over 3.5 of the IFL’s) just to boot.

It appears that SLES 12 is doing far more in regards to interrogating the PV’s attached to the server than SLES 11 did. The SLES 11 copy of this server boots in less than 2 minutes. The SLES 12 copy takes at least 4 minutes for the prompt to appear on the console but takes several more minutes before you can login via PuTTY.

I believe that this problem is somehow related to systemd being used instead of SysVinit. It should not take take that much more CPU cycles to boot the system in SLES 12 than it does with SLES 11.

Is there a parameter to make SLES 12 act like SLES 11 when interrogating the attached dasd devices?

Harley

x0500hl wrote:
[color=blue]

Is there a parameter to make SLES 12 act like SLES 11 when
interrogating the attached dasd devices?[/color]

Hi Harley,

SLES 12 does have its own personality! :wink:

I’m not aware of any such parameter but perhaps someone else will jump
in with an idea. If not, you should open a Service Request.

Tech Support should be able to confirm whether this is a bug or
“working as designed”. If it is the latter, they may also be able to
provide a workaround.


Kevin Boyle - Knowledge Partner
If you find this post helpful and are logged into the web interface,
please show your appreciation and click on the star below this post.
Thank you.

RESOLVED.

I opened an SR with Novell to report this as a bug but i don’t think that they are going to address it as such.

During the investigation of the issue, Novell found that there were 76 SLES 11 packages still installed after i performed an in-place upgrade to SLES 12 SP2. They said that my server was in an ‘unsupported state’ and provided an rpm command to display the full package name (for each of the 76) and the rpm command to de-install each package. I processed the list in alphabetic order and found that some of the packages could not be de-installed without also specifying the names of packages that were further down in the list.

To obtain the info on the SLES 11 packages installed on the SLES 12 system:

  • run a supportconfig
  • extract rpm.txt from the tarball and view it.

I used the following process to bring up the server to the old runlevel 3 and get it out of emergency mode:

  • Enter the root password.
  • Enter command ‘lvscan’. This will list all of the Logical Volumes and the status of each.
  • Enter command ‘lvchange -ay xxxxx’ for each Logical Volume that is not “Active”. Linux will mount the filesystem as part of the lvchange command.
  • Enter ‘mount -a’ to verify that all of the filesystems are mounted. If any are not, issue the lvscan and lvchange commands again.
  • Enter ‘systemctl default’ to switch to the old runlevel 3. Note that it may take a couple of minutes for this command to complete. Note that on a couple of occasions I received the normal Linux prompt but the server was not functional (no network). I had to reboot and go thru the above process again.

The resolution to the dasd interrogation timeout problem is to disable the lvmetad service. Prior to disabling this service my server used 100% of one IFL engine (on a z13S) for over 20 minutes and then ended up in emergency mode. After the change, the server booted in under 1 minute and used 100% of the IFL for maybe half that time. To disable lvmetad you need to edit /etc/lvm/lvm.conf and change the line to use_lvmetad = 0.

Early in the debugging process Novell had me change the ‘filter’ parameter in the same file. I was advised to change the filter filter = [ "a|/dev/dasd.*|", "r/.*/" ]. I left this change in place but don’t think it really needs to stay in place as lvmetad is disabled.

At first i thought that this was only a problem on a Linux server that was limited in CPU power and with a large number of dasd volumes defined to the server (in z/VM). This past weekend, the production copy of this server encountered the same problem and it has access to five IFL’s. Both servers have access to over 500 dasd volumes (each with the same number of 3390 mod3’s and mod-9’s).

Harley

x0500hl wrote:
[color=blue]

This past weekend, the production copy of this
server encountered the same problem[/color]

Hi Harley,

Having gone through the process already hopefully you’ll have the issue
resolved in no time.

The feedback you provided is not the kind of information that is
readily available and surely will help others who may experience
similar issues. It is very much appreciated.

Good luck.


Kevin Boyle - Knowledge Partner
If you find this post helpful and are logged into the web interface,
please show your appreciation and click on the star below this post.
Thank you.