Qemu-system-i386 segfaults

Hello everyone,

We have 5 HP BL460 Gen9 servers working as XEN hypervisors under SLES 12 SP1, every server hosts 4-5 fullvirt SLES 12 SP1 guests.
About 1 time in 2 months random DomU guest gets unresponsive, in xl list i see system state just as “------”, i can ping the guest, but ssh/vnc is not responding. The only option to bring system back is to power off and restart it from virt-manager.

In system logs at that time i can see qemu-system-i386 segfaults:

[16312686.295207] IPv6: udp checksum is 0 [18570901.441606] qemu-system-i38[3619]: segfault at 0 ip 00007fcc3b3e1fae sp 00007ffeed8f5068 error 4 in libc-2.19.so[7fcc3b352000+19e000] [18570901.527129] br0: port 3(vif1.0-emu) entered disabled state

This happens on all XEN hypervisors.
xl info:

xl info host : MSK-HVX05 release : 3.12.49-11-xen version : #1 SMP Wed Nov 11 20:52:43 UTC 2015 (8d714a0) machine : x86_64 nr_cpus : 40 max_cpu_id : 39 nr_nodes : 2 cores_per_socket : 10 threads_per_core : 2 cpu_mhz : 2297 hw_caps : bfebfbff:2c100800:00000000:00007f00:77fefbff:00000000:00000021:000037ab virt_caps : hvm hvm_directio total_memory : 262015 free_memory : 159893 sharing_freed_memory : 0 sharing_used_memory : 0 outstanding_claims : 0 free_cpus : 0 xen_major : 4 xen_minor : 5 xen_extra : .1_12-2 xen_version : 4.5.1_12-2 xen_caps : xen-3.0-x86_64 xen-3.0-x86_32p hvm-3.0-x86_32 hvm-3.0-x86_32p hvm-3.0-x86_64 xen_scheduler : credit xen_pagesize : 4096 platform_params : virt_start=0xffff800000000000 xen_changeset : xen_commandline : dom0_mem=5785M,max:5785M cc_compiler : gcc (SUSE Linux) 4.8.5 cc_compile_by : abuild cc_compile_domain : suse.de cc_compile_date : Thu Nov 5 14:42:08 UTC 2015 xend_config_format : 4

I am trying to catch core dump, but i don’t know, what do i need to get:

  1. Core dump of domU kernel
  2. Core dump of crashed qemu-system-i386 process on hypervisor

Please, give me some advice, what do i need to catch.

Hi Cernishov,

I strongly suggest to open a service call and have an engineer look into this. You’ll get the according requests for collecting details during the SR process.

Regarding your question, I believe that enabling core dumps for the qemu process would help (see ulimit -c), but would have to be done before starting the VM. Please note that if setting that to “unlimited”, you may get a pretty large dump file for the host file system.

Regards,
J

Yeah, we do plan to open SR, but i feel like without core dumps it would be impossible to catch the problem, and because it could take up to 3 months just to catch the crash i felt like i could ask in forums first

I guess i should follow this instruction - https://www.novell.com/support/kb/doc.php?id=3054866

But how can i test that dumps really will be created? For example, will it be enough to just send SIGSEGV to qemu process to check if core will dump?
I did some tests on SLES11SP3, configured the ulimit and core_pattern according to the instruction (in sysctl.conf too), restarted the system several times, but dumps were generated only for simple examples, like

top &
kill -6 $pid
fg %1

When i tried to send SIGABRT or SIGSEGV to qemu process, i saw no dump generated

Hi Cernishov,

restarted the system several times

the invocation of “ulimit -c unlimited” only affects the current shell and children of it and is not persistent across reboots. The same holds true for direct invocation of “sysctl -w”… so if you ran these steps, then rebooted the machine, then did not rerun these steps, you’d see no effect. The same is true if you did not start the VM as a child of the session you increased the core size limit for.

By looking at the content of /proc//limits you can see if your changes are effective for your process.

Sending a kill -SEGV to a process should be sufficient to simulate the effect you see for qemu. OTOH, the programmers may have decided to catch that signal within the process and react in their own special way, so there’s a slight chance (though not awfully likely) that no core is generated by reason.

Regards,
J