Server will not boot in XEN after SLES11SP1 to SLES11SP2 upg

One of my servers, IBM xSeries x366, has been running without issue using SLES11 SP1 /OES2 SP3, as a XEN Host to multiple VMs. I just upgraded the OS to SLES11 SP2 /OES 11 SP1, and now the system only reboots when I try to boot to the XEN Kernel. It starts to boot, and I see normal loading text on the screen, but, then it switches the graphics mode and the screen goes black and then reboots.

It works fine with the standard kernel, and when I boot with XEN Trace kernel it hangs looking for “Part2” of the system disk. The disks are assigned by ID so that isn’t they issue.

If I knew had to setup the ipmitools then I could capture the output, but, I haven’t found an quick starters guide on that.

Any thoughts?

Thanks,
Brent

Hi Brent,

anything in the boot log? When, after the failing Xen boot, you start up the standard kernel, the previous boot’s log file should be available as “/var/log/boot.omsg”. If it came up to the point where syslog was already up, there might be entries in /var/log/messages, too.

How about disabling graphics mode, at least while figuring out the current problems? That way you can follow the system startup on the console to the failing point. If you required X11, you can always start that up manually afterwards. Any maybe the graphics driver is the cause, you’d verify that, too :).

Regards,
Jens

Hi Jens,

Thanks for the post. As you suggested I have disabled the graphics mode and it appears I now get a little further. I haven’t reviewed the logs yet. Before it died too soon for logging.

Now I am at the point were it hangs waiting for the system drive to appear, although it works fine in non-xen mode. From some research I found that there was a known issue with this server’s RAID controller and multi-path devices in a prior kernel (SLES10PS4) hanging waiting for the drive to appear, so just in case I have removed my Fibre card, drivers and kernel modules relating to multi-path devices. At least I believe I have done so.

I’ll check the logs and post it.

Thanks,
Brent

Unfortunately it still doesn’t get to a point were logging occurs. I guess I am going to have to bit the bullet and setup a serial console to log or research how to start logging earlier.

Have you checked to see if all the kernel-xen and xen specific modules have been updated correctly? Could be something along the lines of a mismatch of kernel and modules causing a driver load failure.

-Willem

Thanks for your reply.

Yes I did remove some old kernel modules, and I found I had to update the kernel config and regen the initrd. However it didn’t fix the issue, but, it allowed the server to boot further and a serial console log revealed what appears to be a bug which is being looked into by Novell/SuSe.

The server now stops on “Panic on CPU 6: Xen Bug IO-APIC.c : 129” .

Thanks,
Brent

[QUOTE=rbalcorn;9343]Thanks for your reply.

Yes I did remove some old kernel modules, and I found I had to update the kernel config and regen the initrd. However it didn’t fix the issue, but, it allowed the server to boot further and a serial console log revealed what appears to be a bug which is being looked into by Novell/SuSe.

The server now stops on “Panic on CPU 6: Xen Bug IO-APIC.c : 129” .

Thanks,
Brent[/QUOTE]

Hi Brent,

I think you indeed need SUSE to assist here… To see if you can get the system booted, have you tired adding “noapic” as option at the boot loader screen?

Updating the machines BIOS could be another thing, and after applying the latest BIOS also check for any APIC related settings in the BIOS that might have been disabled.

-Willem

where are you getting with this… I have a whack of HS22’s that were getting upgrades from SLES11GA to 11SP2. The online upgrades went fine but one machine needed a reinstall… SLES 11 sp2 detects it as a uefi platform and forces elilo instead of grub… elilo insists my elilo.conf xen entry is incomplete and is preventing a xen boot. when I do get a valid xen configuration via fresh installation I get the failure to find sda you mention previously

[QUOTE=rbalcorn;9343]Thanks for your reply.

Yes I did remove some old kernel modules, and I found I had to update the kernel config and regen the initrd. However it didn’t fix the issue, but, it allowed the server to boot further and a serial console log revealed what appears to be a bug which is being looked into by Novell/SuSe.

The server now stops on “Panic on CPU 6: Xen Bug IO-APIC.c : 129” .

Thanks,
Brent[/QUOTE]

Hi micajc
,

It appears they may have a fix. Yesterday I tried some test code with them and the server did boot (finally). Now I am waiting for the fix to be implemented. Hopefully that is soon. It seems xen 4.2.x may be the answer, but, I don’t know for sure.

I will post as soon as I get the final word.

Regards,
Brent

Regards,
Brent

Hi Micajc,

It appears that my issue is a BIOS bug relating to APIC timer. So I am not holding out for a “clean” solution since I don’t think IBM will be patching x3850’s, at least not the x366 (8863) variant. The last BIOS updated they published was in 2008. There may be another option, but, some testing is required.

Regards,
Brent

On 11/02/2012 03:24 PM, rbalcorn wrote:[color=blue]

Hi Micajc,

It appears that my issue is a BIOS bug relating to APIC timer. So I
am not holding out for a “clean” solution since I don’t think IBM will
be patching x3850’s, at least not the x366 (8863) variant. The last
BIOS updated they published was in 2008. There may be another option,
but, some testing is required.

Regards,
Brent

[/color]

have you tried booting with noapic?

Yes I had tried that.

The issue was resolved with a fix in XEN and using the lastest XEN 4.x kernels solve the problem.

Regards,
Brent