Hi to all,
We have 4 Blades UCS-B200-M3 with SLES 11 SP3 installed for KVM supporting SAP.
This morning one of our servers had a crash on 4 of its vNics. There is a total of 13 of them.
The dump on /var/log/messages is:
04:47:10 hv-1 kernel: [3511310.839912] ------------[ cut here ]------------
Aug 13 04:47:10 hv-1 kernel: [3511310.839922] WARNING: at /usr/src/packages/BUILD/kernel-default-3.0.101/linux-3.0/net/sched/sch_generic.c:255 dev_watchdog+0x23e/0x250()
Aug 13 04:47:10 hv-1 kernel: [3511310.839925] Hardware name: UCSB-B200-M3
Aug 13 04:47:10 hv-1 kernel: [3511310.839927] NETDEV WATCHDOG: kvm-cluster-pec (enic): transmit queue 0 timed out
Aug 13 04:47:10 hv-1 kernel: [3511310.839928] Modules linked in: af_packet ip6table_filter ip6_tables iptable_filter ip_tables ebtable_nat ebtables x_tables edd bridge stp ll
c cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf fuse loop vhost_net macvtap macvlan tun kvm_intel kvm ipv6 ipv6_lib joydev pcspkr iTCO_wdt iTCO_
vendor_support i2c_i801 button acpi_power_meter enic container ac sg wmi rtc_cmos ext3 jbd mbcache usbhid hid sd_mod crc_t10dif ttm drm_kms_helper drm i2c_algo_bit sysimgblt
sysfillrect i2c_core syscopyarea ehci_hcd usbcore usb_common processor thermal_sys hwmon dm_service_time dm_least_pending dm_queue_length dm_round_robin dm_multipath scsi_dh_
hp_sw scsi_dh_emc scsi_dh_alua scsi_dh_rdac scsi_dh dm_snapshot dm_mod fnic libfcoe libfc scsi_transport_fc scsi_tgt megaraid_sas scsi_mod
Aug 13 04:47:10 hv-1 kernel: [3511310.839984] Supported: Yes
Aug 13 04:47:10 hv-1 kernel: [3511310.839987] Pid: 0, comm: swapper Not tainted 3.0.101-0.21-default #1
Aug 13 04:47:10 hv-1 kernel: [3511310.839989] Call Trace:
Aug 13 04:47:10 hv-1 kernel: [3511310.840020] [] dump_trace+0x75/0x310
Aug 13 04:47:10 hv-1 kernel: [3511310.840032] [] dump_stack+0x69/0x6f
Aug 13 04:47:10 hv-1 kernel: [3511310.840041] [] warn_slowpath_common+0x7b/0xc0
Aug 13 04:47:10 hv-1 kernel: [3511310.840049] [] warn_slowpath_fmt+0x45/0x50
Aug 13 04:47:10 hv-1 kernel: [3511310.840057] [] dev_watchdog+0x23e/0x250
Aug 13 04:47:10 hv-1 kernel: [3511310.840069] [] call_timer_fn+0x6b/0x120
Aug 13 04:47:10 hv-1 kernel: [3511310.840077] [] run_timer_softirq+0x173/0x240
Aug 13 04:47:10 hv-1 kernel: [3511310.840087] [] __do_softirq+0x11f/0x260
Aug 13 04:47:10 hv-1 kernel: [3511310.840096] [] call_softirq+0x1c/0x30
Aug 13 04:47:10 hv-1 kernel: [3511310.840107] [] do_softirq+0x65/0xa0
Aug 13 04:47:10 hv-1 kernel: [3511310.840114] [] irq_exit+0xc5/0xe0
Aug 13 04:47:10 hv-1 kernel: [3511310.840122] [] smp_apic_timer_interrupt+0x68/0xa0
Aug 13 04:47:10 hv-1 kernel: [3511310.840130] [] apic_timer_interrupt+0x13/0x20
Aug 13 04:47:10 hv-1 kernel: [3511310.840142] [] intel_idle+0xa1/0x130
Aug 13 04:47:10 hv-1 kernel: [3511310.840152] [] cpuidle_idle_call+0x11b/0x280
Aug 13 04:47:10 hv-1 kernel: [3511310.840161] [] cpu_idle+0x66/0xb0
Aug 13 04:47:10 hv-1 kernel: [3511310.840172] [] start_kernel+0x376/0x447
Aug 13 04:47:10 hv-1 kernel: [3511310.840180] [] x86_64_start_kernel+0x123/0x13d
Aug 13 04:47:10 hv-1 kernel: [3511310.840186] —[ end trace f0165b8680ad586b ]—
I cannot recover from this without shutting down the server.
At this moment, and after several tests and debugging I went nowhere on solving this issue or finding the root cause of it.
On the UCS side there is no Logs or errors regarding the nics.
So, I come here to see anyone has seen this type of errors on this type of systems, and if can provide a more insightful way of solving it.
NOTE - I have not rebooted the server, as we are still trying to figure out what is the root cause for this, since it’s not happening on the other 3 servers that have the same configuration. All vms where moved to the other hypervisors.
Thank you for your support.
Jorge Gomes