BUG: soft lockup on CPU during heavy load

Hi,

just upgraded to SLES 11 SP2 a HP DL580 G7 with 40 CPUs (80 HT) and 512 GB memory. Running jobs that cause heavy CPU load is resulting in several errors. Dmesg is giving me the following report.


[124560.864252] BUG: soft lockup - CPU#67 stuck for 22s! [exe:17646]
[124560.864257] Modules linked in: af_packet st sd_mod crc_t10dif ide_cd_mod ide_core lp parport_pc ppdev parport autofs4 edd xt_tcpudp xt_pkttype ipt_LOG xt_limit nfs lockd fscache auth_rpcgss nfs_acl sunrpc cpu
freq_conservative cpufreq_userspace cpufreq_powersave pcc_cpufreq mperf microcode xt_NOTRACK ipt_REJECT xt_state iptable_raw iptable_filter nf_conntrack_netbios_ns nf_conntrack_broadcast nf_conntrack_ipv4 nf_conn
track nf_defrag_ipv4 ip_tables ip6_tables x_tables fuse loop dm_mod ipv6_lib qlcnic i7core_edac netxen_nic sg edac_core sr_mod cdrom hpilo iTCO_wdt hpwdt iTCO_vendor_support joydev pcspkr serio_raw container rtc_
cmos acpi_power_meter button ext3 jbd mbcache usbhid hid uhci_hcd ehci_hcd usbcore usb_common thermal processor thermal_sys hwmon scsi_dh_hp_sw scsi_dh_alua scsi_dh_rdac scsi_dh_emc scsi_dh ata_generic ata_piix l
ibata hpsa cciss scsi_mod [last unloaded: parport_pc]
[124560.864350] Supported: Yes
[124560.864353] CPU 67 
[124560.864355] Modules linked in: af_packet st sd_mod crc_t10dif ide_cd_mod ide_core lp parport_pc ppdev parport autofs4 edd xt_tcpudp xt_pkttype ipt_LOG xt_limit nfs lockd fscache auth_rpcgss nfs_acl sunrpc cpu
freq_conservative cpufreq_userspace cpufreq_powersave pcc_cpufreq mperf microcode xt_NOTRACK ipt_REJECT xt_state iptable_raw iptable_filter nf_conntrack_netbios_ns nf_conntrack_broadcast nf_conntrack_ipv4 nf_conn
track nf_defrag_ipv4 ip_tables ip6_tables x_tables fuse loop dm_mod ipv6_lib qlcnic i7core_edac netxen_nic sg edac_core sr_mod cdrom hpilo iTCO_wdt hpwdt iTCO_vendor_support joydev pcspkr serio_raw container rtc_
cmos acpi_power_meter button ext3 jbd mbcache usbhid hid uhci_hcd ehci_hcd usbcore usb_common thermal processor thermal_sys hwmon scsi_dh_hp_sw scsi_dh_alua scsi_dh_rdac scsi_dh_emc scsi_dh ata_generic ata_piix l
ibata hpsa cciss scsi_mod [last unloaded: parport_pc]
[124560.864416] Supported: Yes
[124560.864419] 
[124560.864423] Pid: 17646, comm: exe Not tainted 3.0.13-0.27-default #1 HP ProLiant DL580 G7
[124560.864428] RIP: 0010:[<ffffffff81441a08>]  [<ffffffff81441a08>] _raw_spin_unlock_irqrestore+0x8/0x10
[124560.864445] RSP: 0000:ffff8832fa663970  EFLAGS: 00000246
[124560.864448] RAX: 0000000000000000 RBX: ffffea0008943900 RCX: ffffea00846f300c
[124560.864451] RDX: 0000000000000002 RSI: 0000000000000246 RDI: 0000000000000246
[124560.864454] RBP: ffff8832fa663b18 R08: 0000000000000200 R09: ffff88403ffd9e80
[124560.864458] R10: 00000000025d6c00 R11: ffff88403ffda3b0 R12: ffffffff8144a06e
[124560.864461] R13: ffffea000896b600 R14: 0000000000000297 R15: 000000000000000c
[124560.864465] FS:  00007f08256fc700(0000) GS:ffff88603fc40000(0000) knlGS:0000000000000000
[124560.864469] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[124560.864472] CR2: 00007eff20800000 CR3: 000000183c053000 CR4: 00000000000006e0
[124560.864476] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[124560.864479] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[124560.864483] Process exe (pid: 17646, threadinfo ffff8832fa662000, task ffff8800941d2540)
[124560.864486] Stack:
[124560.864499]  ffffffff81136259 ffff88603fc4e0b0 0000000000000206 ffff88403ffd9e80
[124560.864516]  0000000000000000 00000000020645b9 0000000000000246 0000000008943af8
[124560.864527]  ffff88403ffd9ee0 ffffea00846f300c 0000000000000000 0000000000000200
[124560.864538] Call Trace:
[124560.864563]  [<ffffffff81136259>] isolate_freepages+0x359/0x3b0
[124560.864573]  [<ffffffff811362fe>] compaction_alloc+0x4e/0x60
[124560.864584]  [<ffffffff811404a9>] unmap_and_move+0x49/0x180
[124560.864593]  [<ffffffff8114067e>] migrate_pages+0x9e/0x1b0
[124560.864601]  [<ffffffff81136ae3>] compact_zone+0x1f3/0x2f0
[124560.864609]  [<ffffffff81136e42>] compact_zone_order+0xa2/0xe0
[124560.864617]  [<ffffffff81136f5f>] try_to_compact_pages+0xdf/0x110
[124560.864628]  [<ffffffff810f867e>] __alloc_pages_direct_compact+0xee/0x1c0
[124560.864638]  [<ffffffff810f8ab2>] __alloc_pages_slowpath+0x362/0x7f0
[124560.864646]  [<ffffffff810f90f1>] __alloc_pages_nodemask+0x1b1/0x1c0
[124560.864655]  [<ffffffff811354cb>] alloc_pages_vma+0x9b/0x160
[124560.864666]  [<ffffffff81145170>] do_huge_pmd_anonymous_page+0x160/0x270
[124560.864677]  [<ffffffff81445327>] do_page_fault+0x207/0x4c0
[124560.864686]  [<ffffffff81442065>] page_fault+0x25/0x30
[124560.866515] DWARF2 unwinder stuck at page_fault+0x25/0x30
[124560.866518] 
[124560.866520] Leftover inexact backtrace:
[124560.866521] 
[124560.866529] Code: 0f c1 07 0f b7 d0 c1 e8 10 39 c2 74 07 f3 90 0f b7 17 eb f5 c3 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 66 ff 07 48 89 f7 57 9d 
[124560.866561]  66 90 66 90 c3 66 90 b8 ff ff ff ff f0 0f c1 07 83 e8 01 ba 
[124560.866574] Call Trace:
[124560.866582]  [<ffffffff81136259>] isolate_freepages+0x359/0x3b0
[124560.866591]  [<ffffffff811362fe>] compaction_alloc+0x4e/0x60
[124560.866598]  [<ffffffff811404a9>] unmap_and_move+0x49/0x180
[124560.866605]  [<ffffffff8114067e>] migrate_pages+0x9e/0x1b0
[124560.866613]  [<ffffffff81136ae3>] compact_zone+0x1f3/0x2f0
[124560.866621]  [<ffffffff81136e42>] compact_zone_order+0xa2/0xe0
[124560.866628]  [<ffffffff81136f5f>] try_to_compact_pages+0xdf/0x110
[124560.866637]  [<ffffffff810f867e>] __alloc_pages_direct_compact+0xee/0x1c0
[124560.866644]  [<ffffffff810f8ab2>] __alloc_pages_slowpath+0x362/0x7f0
[124560.866652]  [<ffffffff810f90f1>] __alloc_pages_nodemask+0x1b1/0x1c0
[124560.866660]  [<ffffffff811354cb>] alloc_pages_vma+0x9b/0x160
[124560.866668]  [<ffffffff81145170>] do_huge_pmd_anonymous_page+0x160/0x270
[124560.866677]  [<ffffffff81445327>] do_page_fault+0x207/0x4c0
[124560.866684]  [<ffffffff81442065>] page_fault+0x25/0x30
[124560.868451] DWARF2 unwinder stuck at page_fault+0x25/0x30
[124560.868453] 
[124560.868455] Leftover inexact backtrace:
[124560.868456]

The program that is experiencing the soft lockup should not be the problem since it worked on SP1. I searched for while now and got different ideas where to start looking for the problem:

[LIST=1]
[]IIRC System
[
]CPU Damage
[*]APIC System
[/LIST]

But I am really not sure what can cause this kind of problem. During the errors are happening most of the watchdogs scale up to >> 100 % CPU load.

Any ideas?

Best regards
fbemm

Can you try disabling THP?
http://www.suse.com/releasenotes/x86_64/SUSE-SLES/11-SP2/#fate-311931
See if that helps.

fbemm,

Any news about this problem? What kernel version do you have installed? I’m having the same problem using a similiar hardware (number of cores and RAM, but it’s a cisco blade UCS).

Using Suse enterprise 11.2 kernel version 3.0.34-0.7-default.

Thanks!

[QUOTE=enovaklbank;3129]Can you try disabling THP?
http://www.suse.com/releasenotes/x86_64/SUSE-SLES/11-SP2/#fate-311931
See if that helps.[/QUOTE]