Hi,
just upgraded to SLES 11 SP2 a HP DL580 G7 with 40 CPUs (80 HT) and 512 GB memory. Running jobs that cause heavy CPU load is resulting in several errors. Dmesg is giving me the following report.
[124560.864252] BUG: soft lockup - CPU#67 stuck for 22s! [exe:17646]
[124560.864257] Modules linked in: af_packet st sd_mod crc_t10dif ide_cd_mod ide_core lp parport_pc ppdev parport autofs4 edd xt_tcpudp xt_pkttype ipt_LOG xt_limit nfs lockd fscache auth_rpcgss nfs_acl sunrpc cpu
freq_conservative cpufreq_userspace cpufreq_powersave pcc_cpufreq mperf microcode xt_NOTRACK ipt_REJECT xt_state iptable_raw iptable_filter nf_conntrack_netbios_ns nf_conntrack_broadcast nf_conntrack_ipv4 nf_conn
track nf_defrag_ipv4 ip_tables ip6_tables x_tables fuse loop dm_mod ipv6_lib qlcnic i7core_edac netxen_nic sg edac_core sr_mod cdrom hpilo iTCO_wdt hpwdt iTCO_vendor_support joydev pcspkr serio_raw container rtc_
cmos acpi_power_meter button ext3 jbd mbcache usbhid hid uhci_hcd ehci_hcd usbcore usb_common thermal processor thermal_sys hwmon scsi_dh_hp_sw scsi_dh_alua scsi_dh_rdac scsi_dh_emc scsi_dh ata_generic ata_piix l
ibata hpsa cciss scsi_mod [last unloaded: parport_pc]
[124560.864350] Supported: Yes
[124560.864353] CPU 67
[124560.864355] Modules linked in: af_packet st sd_mod crc_t10dif ide_cd_mod ide_core lp parport_pc ppdev parport autofs4 edd xt_tcpudp xt_pkttype ipt_LOG xt_limit nfs lockd fscache auth_rpcgss nfs_acl sunrpc cpu
freq_conservative cpufreq_userspace cpufreq_powersave pcc_cpufreq mperf microcode xt_NOTRACK ipt_REJECT xt_state iptable_raw iptable_filter nf_conntrack_netbios_ns nf_conntrack_broadcast nf_conntrack_ipv4 nf_conn
track nf_defrag_ipv4 ip_tables ip6_tables x_tables fuse loop dm_mod ipv6_lib qlcnic i7core_edac netxen_nic sg edac_core sr_mod cdrom hpilo iTCO_wdt hpwdt iTCO_vendor_support joydev pcspkr serio_raw container rtc_
cmos acpi_power_meter button ext3 jbd mbcache usbhid hid uhci_hcd ehci_hcd usbcore usb_common thermal processor thermal_sys hwmon scsi_dh_hp_sw scsi_dh_alua scsi_dh_rdac scsi_dh_emc scsi_dh ata_generic ata_piix l
ibata hpsa cciss scsi_mod [last unloaded: parport_pc]
[124560.864416] Supported: Yes
[124560.864419]
[124560.864423] Pid: 17646, comm: exe Not tainted 3.0.13-0.27-default #1 HP ProLiant DL580 G7
[124560.864428] RIP: 0010:[<ffffffff81441a08>] [<ffffffff81441a08>] _raw_spin_unlock_irqrestore+0x8/0x10
[124560.864445] RSP: 0000:ffff8832fa663970 EFLAGS: 00000246
[124560.864448] RAX: 0000000000000000 RBX: ffffea0008943900 RCX: ffffea00846f300c
[124560.864451] RDX: 0000000000000002 RSI: 0000000000000246 RDI: 0000000000000246
[124560.864454] RBP: ffff8832fa663b18 R08: 0000000000000200 R09: ffff88403ffd9e80
[124560.864458] R10: 00000000025d6c00 R11: ffff88403ffda3b0 R12: ffffffff8144a06e
[124560.864461] R13: ffffea000896b600 R14: 0000000000000297 R15: 000000000000000c
[124560.864465] FS: 00007f08256fc700(0000) GS:ffff88603fc40000(0000) knlGS:0000000000000000
[124560.864469] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[124560.864472] CR2: 00007eff20800000 CR3: 000000183c053000 CR4: 00000000000006e0
[124560.864476] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[124560.864479] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[124560.864483] Process exe (pid: 17646, threadinfo ffff8832fa662000, task ffff8800941d2540)
[124560.864486] Stack:
[124560.864499] ffffffff81136259 ffff88603fc4e0b0 0000000000000206 ffff88403ffd9e80
[124560.864516] 0000000000000000 00000000020645b9 0000000000000246 0000000008943af8
[124560.864527] ffff88403ffd9ee0 ffffea00846f300c 0000000000000000 0000000000000200
[124560.864538] Call Trace:
[124560.864563] [<ffffffff81136259>] isolate_freepages+0x359/0x3b0
[124560.864573] [<ffffffff811362fe>] compaction_alloc+0x4e/0x60
[124560.864584] [<ffffffff811404a9>] unmap_and_move+0x49/0x180
[124560.864593] [<ffffffff8114067e>] migrate_pages+0x9e/0x1b0
[124560.864601] [<ffffffff81136ae3>] compact_zone+0x1f3/0x2f0
[124560.864609] [<ffffffff81136e42>] compact_zone_order+0xa2/0xe0
[124560.864617] [<ffffffff81136f5f>] try_to_compact_pages+0xdf/0x110
[124560.864628] [<ffffffff810f867e>] __alloc_pages_direct_compact+0xee/0x1c0
[124560.864638] [<ffffffff810f8ab2>] __alloc_pages_slowpath+0x362/0x7f0
[124560.864646] [<ffffffff810f90f1>] __alloc_pages_nodemask+0x1b1/0x1c0
[124560.864655] [<ffffffff811354cb>] alloc_pages_vma+0x9b/0x160
[124560.864666] [<ffffffff81145170>] do_huge_pmd_anonymous_page+0x160/0x270
[124560.864677] [<ffffffff81445327>] do_page_fault+0x207/0x4c0
[124560.864686] [<ffffffff81442065>] page_fault+0x25/0x30
[124560.866515] DWARF2 unwinder stuck at page_fault+0x25/0x30
[124560.866518]
[124560.866520] Leftover inexact backtrace:
[124560.866521]
[124560.866529] Code: 0f c1 07 0f b7 d0 c1 e8 10 39 c2 74 07 f3 90 0f b7 17 eb f5 c3 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 66 ff 07 48 89 f7 57 9d
[124560.866561] 66 90 66 90 c3 66 90 b8 ff ff ff ff f0 0f c1 07 83 e8 01 ba
[124560.866574] Call Trace:
[124560.866582] [<ffffffff81136259>] isolate_freepages+0x359/0x3b0
[124560.866591] [<ffffffff811362fe>] compaction_alloc+0x4e/0x60
[124560.866598] [<ffffffff811404a9>] unmap_and_move+0x49/0x180
[124560.866605] [<ffffffff8114067e>] migrate_pages+0x9e/0x1b0
[124560.866613] [<ffffffff81136ae3>] compact_zone+0x1f3/0x2f0
[124560.866621] [<ffffffff81136e42>] compact_zone_order+0xa2/0xe0
[124560.866628] [<ffffffff81136f5f>] try_to_compact_pages+0xdf/0x110
[124560.866637] [<ffffffff810f867e>] __alloc_pages_direct_compact+0xee/0x1c0
[124560.866644] [<ffffffff810f8ab2>] __alloc_pages_slowpath+0x362/0x7f0
[124560.866652] [<ffffffff810f90f1>] __alloc_pages_nodemask+0x1b1/0x1c0
[124560.866660] [<ffffffff811354cb>] alloc_pages_vma+0x9b/0x160
[124560.866668] [<ffffffff81145170>] do_huge_pmd_anonymous_page+0x160/0x270
[124560.866677] [<ffffffff81445327>] do_page_fault+0x207/0x4c0
[124560.866684] [<ffffffff81442065>] page_fault+0x25/0x30
[124560.868451] DWARF2 unwinder stuck at page_fault+0x25/0x30
[124560.868453]
[124560.868455] Leftover inexact backtrace:
[124560.868456]
The program that is experiencing the soft lockup should not be the problem since it worked on SP1. I searched for while now and got different ideas where to start looking for the problem:
[LIST=1]
[]IIRC System
[]CPU Damage
[*]APIC System
[/LIST]
But I am really not sure what can cause this kind of problem. During the errors are happening most of the watchdogs scale up to >> 100 % CPU load.
Any ideas?
Best regards
fbemm