SLES12 server intermittent hanging problems with KVM

I have been using KVM for virtualization since SLES10 and have moved up to SLES11SP4 with no problems. Over a period of time I have experimented with moving to SLES12, iterations 1,2, and now 3. Each time I carry out the operation, I find that the server may hang at times for periods of up to one minute, before everything starts working again. I have tried this on SLES1,2 and now 3 but always get the same results. Also tried this out on a second server with exactly the same thing happening. The servers are running an 8 core AMD processor with 32 GB ram, and will work for months under SLES11 with now problems or requirement to reboot anything.

Does anyone have any ideas or pointers to the problems I am getting with SLES12, as SLES10 and 11 all work fine!

Regards

ChasR.

Hi
Are the systems running btrfs, not a btrfs maintenance routine running?

No desktop environment running on the server, just console (tty)?

Hi,
the btrfs maintenance is inactive, and I am using a desktop (gnome) environment as I have always done on all the other implementations of SLES (10 and 11). There has never been a problem until SLES12, and I have migrated the same vm’s over from 11 to 12 which now gives the hanging problems.

I am actually going to drop back to 11 to give a usable system again.

Regards

ChasR.

I may have missed it, but as you are using virtualization (KVM) is this
SLES 12 system the host or the virtual machine? I ask because, perhaps
related to Btrfs, it is not a good idea to put certain workloads
specifically within a Btrfs filesystem; that does not mean you cannot use
Btrfs for the system and then put those things (like virtual machine disk
files) in another filesystem (e.g. XFS), but there are limitations with
using a Copy-on-Write filesystem with huge files that change a lot, like
virtual machine disk files, or transactional system files (relational
databases, directories, etc.), so those just need to be moved to a
separate filesystem on the system, with one alternative of disabling CoW
on the directory structure where those files live (which is a feature of
Btrfs too if you set it up on the directory before files are placed into it).


Good luck.

If you find this post helpful and are logged into the web interface,
show your appreciation and click on the star below.

If you want to send me a private message, please let me know in the
forum as I do not use the web interface often.

Thanks for the reply. The system is the host and uses Btrfs file system. I did not get this problem with earlier versions (10 & 11) of SLES but cannot remember what the file system was in these cases. I may try mounting another drive with a different file system to see if this makes a difference. I have noticed that everything works fine until the KVM hypervisor is installed, and then I still get the hanging even when NO vms are running.

Cheers

ChasR.

SLES 10 would not have had Btrfs at all, and SLES 11 would never install
it by default but had it available should you elect to use it.

If you see the problem just because the hypervisor, or its service or
whatever, is installed or loaded, then that does not sound like Btrfs at
all, but it’s probably premature to rule anything in or out at this point.
SLES 12 changes a lot of other things, like the switch everybody is
making from Sys-V to systemd.

For what it is worth, I have SLES 12 boxes with kvm running and I have not
noticed this. Perhaps you can help us reproduce it, or provide an
autoinst.xml file, or a supportconfig. If this happens at certain times
(hours, or minutes of hours) that could be useful an we could look into
cron or snapper (which runs via cron among other ways). You could rebuild
with SLES 12 and use only XFS instead of Btrfs to rule out anything
Btrfs-related.

Could you describe how you experience the “hang”? Are yo SSH’d in, or at
the terminal? If the latter, do other ttys respond at all? What do you
see from ‘top’ at the time, or from ‘uptime’ both before and right after
it returns?


Good luck.

If you find this post helpful and are logged into the web interface,
show your appreciation and click on the star below.

If you want to send me a private message, please let me know in the
forum as I do not use the web interface often.

Hi,
I have given some more thought to the problem, and have re-installed the SLES12 SP3 in graphical mode to set things up, and now have run systemctl to move it to multi-user mode. Also, I had a windows server vm mounted, but was running an old driver pack, so I have updated to the latest version, and things at the moment are working OK.

Will keep you posted on the result.

Thanks for the advice,

ChasR

It has now been 8 days since I adopted the changes I made in the earlier post, and up to now, no problems at all (fingers crossed). I think that the technique of setting up the server using gnome graphical interface and then changing to multi-user as the default for the normal operation has been a contributing factor to curing my problems, and if I need to make any changes, I will just swap back to the graphical interface, do the business and go back to multi-user will be the technique I will use in future.

I will post if there are any significant problems in the forthcoming weeks.

Cheers

ChasR

Found where the problem lies:- it is the GNOME desktop environment when you have a number of vm’s running. Works fine in multi-user mode but if you then switch to graphical mode when there are vm’s running, everything grinds to a halt after a couple of minutes. This happened today, and is exactly what has happened in the past with SLES12. Will only use graphical mode to make setup changes on a fresh reboot of the server with no vms loaded.

Cheers

ChasR.

Thank-you for posting back your results; if you have Service Requests (SR)
available you may want to report this directly to SUSE as well, as it does
not sound like something that should happen.

Good luck.

If you find this post helpful and are logged into the web interface,
show your appreciation and click on the star below.

If you want to send me a private message, please let me know in the
forum as I do not use the web interface often.

Update to previous post:

I have finally given up on SLES12 as the KVM server and moved everything back to SLES11SP4. The reason was that even in multi-user mode, the system gradually slowed down and it took me 5 hours to unload successfully all the loaded vm’s before I could reboot the server. When it came back up in multi-user mode, the response was dreadful (no updates had been installed since the initial install), and after loading one vm and then unloaded it, (nearly 2 hours) I re-installed the SLES11 SP4 in graphical mode, set it all up and then ran in runlevel 3 to service the vm’s and found that the response time of all the vm’s was back to normal, and much improved on the SLES12 installation. This has been running for a week, and there has been no degradation noticed anywhere in the system.

I actually have 2 identical servers (one which I used for sles12 testing and the other which I have moved from 11 to 12 and now back to 11 again (the main server). I have tried out all incantations of 12 (sp1,2,and 3) and each time have experienced the “hanging” effect, though not so immediate on SLES12 SP3, so for now, I will stick with SLES11 to host my KVM vm’s and wait an see if SLES12 SP4 will be an improvement.

By the way, there are no problems with SLES12 SP3 until the KVM hypervisor is installed and you try running vm’s (in my experience), though this may be due to the hardware which I am using (OK on SLES11 and all earlier incantations). The SLES12 SP3 runs fine util I tried the KVM virutalization.

Hope this is of interest

Regards

ChasR.

Hi ChasR,

as this is persistent across the whole SLES12 range, I’d really like to see you get in touch with SUSE engineering. Are you in a position to open a service request?

Regards,
J