xen live migration for fully virtualized domUs hangs

Version Info:

drbd-8.4.4-0.22.9
xen-4.2.4_02-0.7.1
lvm2-2.02.98-0.29.1 <---- updated recently
lvm2-clvm-2.02.98-0.29.1 <— updatede recently

I have a two node cluster running SlES 11 SP3 , HA Extension,
it worked fine for 14 domUs including four Windows Server 2008
We do the online Updates once a month.

since two or three weeks the live migration for the fully virtualized Windows domUs fails.
No problem for the linux VMs.

I tried to migrate manually - independend from pacemaker - with the command:

               migrate winsrv2008 ha1infra -live

and had the same problem: the VM moves from node ha2infra to ha1infra and then it hangs.

It shows the Windows screen but is not reachable.
The xendlog of the ‘migrate_to’ node ends with

[2014-08-13 11:00:17 10651] DEBUG (image:981) args: boot, val: c
[2014-08-13 11:00:17 10651] DEBUG (image:981) args: fda, val: None
[2014-08-13 11:00:17 10651] DEBUG (image:981) args: fdb, val: None
[2014-08-13 11:00:17 10651] DEBUG (image:981) args: soundhw, val: None
[2014-08-13 11:00:17 10651] DEBUG (image:981) args: localtime, val: 1
[2014-08-13 11:00:17 10651] DEBUG (image:981) args: serial, val: [‘pty’]
[2014-08-13 11:00:17 10651] DEBUG (image:981) args: std-vga, val: 0
[2014-08-13 11:00:17 10651] DEBUG (image:981) args: isa, val: 0
[2014-08-13 11:00:17 10651] DEBUG (image:981) args: acpi, val: 1
[2014-08-13 11:00:17 10651] DEBUG (image:981) args: usb, val: 1
[2014-08-13 11:00:17 10651] DEBUG (image:981) args: usbdevice, val: tablet
[2014-08-13 11:00:17 10651] DEBUG (image:981) args: gfx_passthru, val: None
[2014-08-13 11:00:17 10651] DEBUG (image:981) args: watchdog, val: None
[2014-08-13 11:00:17 10651] DEBUG (image:981) args: watchdog-action, val: reset
[2014-08-13 11:00:17 10651] INFO (image:909) Need to create platform device.[domid:15]
[2014-08-13 11:00:17 10651] INFO (image:505) spawning device models: /usr/lib/xen/bin/qemu-dm [’/usr/lib/xen/bin/qemu-dm’, ‘-d’, ‘15’, ‘-domain-name’, ‘winsrv2008’, ‘-videoram’, ‘4’, ‘-k’, ‘de’, ‘-vnc’, ‘127.0.0.1:0’, ‘-vncunused’, ‘-vcpus’, ‘2’, ‘-vcpu_avail’, ‘0x3L’, ‘-boot’, ‘c’, ‘-localtime’, ‘-serial’, ‘pty’, ‘-acpi’, ‘-usb’, ‘-usbdevice’, ‘tablet’, ‘-watchdog-action’, ‘reset’, ‘-net’, ‘none’, ‘-M’, ‘xenfv’, ‘-loadvm’, ‘/var/lib/xen/qemu-resume.15’]
[2014-08-13 11:00:17 10651] INFO (image:554) device model pid: 1277
[2014-08-13 11:00:17 10651] DEBUG (XendDomainInfo:1908) Storing domain details: {‘console/port’: ‘5’, ‘description’: ‘None’, ‘console/limit’: ‘1048576’, ‘vm’: ‘/vm/95ae0edb-feaf-e439-535c-b9b6a463fd30-2’, ‘domid’: ‘15’, ‘store/port’: ‘4’, ‘console/type’: ‘ioemu’, ‘cpu/0/availability’: ‘online’, ‘memory/target’: ‘4194304’, ‘control/platform-feature-multiprocessor-suspend’: ‘1’, ‘store/ring-ref’: ‘1044476’, ‘cpu/1/availability’: ‘online’, ‘control/platform-feature-xs_reset_watches’: ‘1’, ‘image/suspend-cancel’: ‘1’, ‘name’: ‘winsrv2008’}

[2014-08-13 11:00:17 10651] INFO (image:677) waiting for sentinel_fifo
[2014-08-13 11:00:17 10651] DEBUG (XendDomainInfo:3165) XendDomainInfo.completeRestore done
[2014-08-13 11:00:17 10651] DEBUG (XendDomainInfo:1995) XendDomainInfo.handleShutdownWatch
[2014-08-13 11:00:17 10651] DEBUG (DevController:139) Waiting for devices tap2.
[2014-08-13 11:00:17 10651] DEBUG (DevController:139) Waiting for devices vif.
[2014-08-13 11:00:17 10651] DEBUG (DevController:144) Waiting for 0.
[2014-08-13 11:00:17 10651] DEBUG (DevController:671) hotplugStatusCallback /local/domain/0/backend/vif/15/0/hotplug-status.
[2014-08-13 11:00:17 10651] DEBUG (DevController:685) hotplugStatusCallback 1.
[2014-08-13 11:00:17 10651] DEBUG (DevController:139) Waiting for devices vkbd.
[2014-08-13 11:00:17 10651] DEBUG (DevController:139) Waiting for devices ioports.
[2014-08-13 11:00:17 10651] DEBUG (DevController:139) Waiting for devices tap.
[2014-08-13 11:00:17 10651] DEBUG (DevController:139) Waiting for devices vif2.
[2014-08-13 11:00:17 10651] DEBUG (DevController:139) Waiting for devices console.
[2014-08-13 11:00:17 10651] DEBUG (DevController:144) Waiting for 0.
[2014-08-13 11:00:17 10651] DEBUG (DevController:139) Waiting for devices vscsi.
[2014-08-13 11:00:17 10651] DEBUG (DevController:139) Waiting for devices vbd.
[2014-08-13 11:00:17 10651] DEBUG (DevController:144) Waiting for 768.
[2014-08-13 11:00:17 10651] DEBUG (DevController:671) hotplugStatusCallback /local/domain/0/backend/vbd/15/768/hotplug-status.
[2014-08-13 11:00:17 10651] DEBUG (DevController:685) hotplugStatusCallback 1.
[2014-08-13 11:00:17 10651] DEBUG (DevController:139) Waiting for devices irq.
[2014-08-13 11:00:17 10651] DEBUG (DevController:139) Waiting for devices vfb.
[2014-08-13 11:00:17 10651] DEBUG (DevController:139) Waiting for devices pci.
[2014-08-13 11:00:17 10651] DEBUG (DevController:139) Waiting for devices vusb.
[2014-08-13 11:00:17 10651] DEBUG (DevController:139) Waiting for devices vtpm.

Any clue ?
Any idea ?
for paravirtualized Linux domUs live migration works
for fully virtualized Windows it fails

[QUOTE=karo80;23138]…since two or three weeks the live migration for the fully virtualized Windows domUs fails.
No problem for the linux VMs…[/QUOTE]

Which version of the VMDP pack doe you have installed on those Windows VM’s?

This reminds me of an older issue I’ve come across (with the VMDP 1.5/1.7 version IIRC) where a freeze/hang on the domU happens during a live migration of Windows domU’s that only have one vCPU. So I’m curious, how many vCPU’s do the Windows domU’s have?
If this hang/freeze also happens with Windows VM’s that currently already have 2 or more vCPU’s, it’s probably not that.

I have one Xen site that was also patched (Xen hosts) a week or three ago… haven’t seen issues with that. I don’t use the HAE extension on my setups though. I’ll check which lvm package versions I have running there… but I can’t do that right now (no access to that site at the moment).

Cheers,
Willem

my Windows domUs have 2 CPUs, the dom0 has 20

here are some lines of the /etc/xen/vm/… file:

name=“winsrv2008”
description=“None”
uuid=“95ae0edb-feaf-e439-535c-b9b6a463fd30”
memory=4096
maxmem=4096
vcpus=2 <----- the domU has two cpus
cpus=“2-19” <----- cpu0 and cpu1 are reserved for dom0
on_poweroff=“destroy”

karl

[QUOTE=karo80;23144]my Windows domUs have 2 CPUs, the dom0 has 20

here are some lines of the /etc/xen/vm/… file:

name=“winsrv2008”
description=“None”
uuid=“95ae0edb-feaf-e439-535c-b9b6a463fd30”
memory=4096
maxmem=4096
vcpus=2 <----- the domU has two cpus
cpus=“2-19” <----- cpu0 and cpu1 are reserved for dom0
on_poweroff=“destroy”

karl[/QUOTE]

Ok, so that seems to be different to what I was seeing.

And which VMDP version do you have running on those Windows domU’s?

-Willem

I have VMDP-WIN-2.1

karl

Some news.

I move the Windows domU to a test cluster with the same software level.
But instead DRBD I have a shared SAN.
The migration works!
So the problem could be between DRBD and cLVM.

I append my DRBD configuration.

The software stack:

  • the DRBD resource r0 is the PV for the clustered VG and
  • the LVs are the Xen Images.

resource r0 {
startup {
become-primary-on both;
}
net {
allow-two-primaries;
after-sb-0pri discard-zero-changes;
after-sb-1pri discard-secondary;
after-sb-2pri disconnect;
verify-alg md5;
max-buffers 8192;
max-epoch-size 8192;
sndbuf-size 512k;
ko-count 4;
}
device /dev/drbd_r0 minor 0;
meta-disk internal;
on ha1infra {
address 172.17.232.11:7788;
disk /dev/disk/by-id/dm-uuid-part1-mpath-360080e50001c150e00000bee52eb1a88;
}
on ha2infra {
address 172.17.232.12:7788;
disk /dev/disk/by-id/scsi-360080e500036de180000035151d0f3e5-part1;
}
syncer {
rate 50M;
}
}

my problem appeared two weeks ago, around the time when
these two lvm updates came:

lvm2-2.02.98-0.29.1
lvm2-clvm-2.02.98-0.29.1

karl