mcelog hardware error - is it my memory or CPU failing?

Hi,
I’ve been seeing kernel “[Hardware Error]: Machine check events logged” messages in /var/log/messages. These seem to be from the mcelog daemon, and the corresponding logs (I posted an example below) are in /var/log/mcelog.

  • is a RAM chip on its way out? Or is this the CPU or CPU cache thats having issues?
  • if RAM, how do I determine which chip(s) are having issues?

/var/log/mcelog:

Hardware event. This is not a software error. MCE 0 CPU 0 4 northbridge MISC c0090fff01000000 ADDR 757580490 TIME 1335182555 Mon Apr 23 08:02:35 2012 Northbridge RAM Chipkill ECC error Chipkill ECC syndrome = 4857 bit46 = corrected ecc error bit59 = misc error valid bit62 = error overflow (multiple errors) bus error 'local node response, request didn't time out generic read mem transaction memory access, level generic' STATUS dc2bc00048080a13 MCGSTATUS 0 MCGCAP 106 APICID 0 SOCKETID 0 CPUID Vendor AMD Family 16 Model 4

(I’ve never used mcelog before, but since I upgraded from SLES 11 SP1 to SP2, it seems to be configured to start on boot.)

Thanks,
J

Hi J,

sounds like a RAM chip giving up… have you had a look at the SEL? Maybe that can give you more details, as the system behind it ought to know about the hardware layout of your machine…

Regards,
Jens

Hi Jens,

Thanks for the reply. In the System event log, I see several of these messages that occur during boot:

ID = 6eb : 04/22/2012 : 00:27:29 : Memory : BIOS : Configuration Error

Is it possible that there is a strange setting in BIOS that would not play well with mcelog? The machine in question is a Sun Fire x4140.

Either way, we plan on taking the server down one evening and running memtest86 overnight.
Thanks,
J

Hi J,

[QUOTE=ashbyj;4181]Hi Jens,

Thanks for the reply. In the System event log, I see several of these messages that occur during boot:

ID = 6eb : 04/22/2012 : 00:27:29 : Memory : BIOS : Configuration Error

Is it possible that there is a strange setting in BIOS that would not play well with mcelog? The machine in question is a Sun Fire x4140.

Either way, we plan on taking the server down one evening and running memtest86 overnight.
Thanks,
J[/QUOTE]

my guess is that it’s actually something your machine’s BIOS has been complaining about independent of mcelog - mcelog is the mere messenger, don’t shoot it for that :wink:

I don’t have any experience with Sun hardware so I cannot tell for sure, the folks at Sun (or do we have to call them “Oracles” by now?) ought to be more helpful concerning the actual cause of that message. Probably it’s something that simply puts your hardware slightly out of specs and has caused no harm so far…

With regards,

Jens

Here is an update. We replaced the entire Sun Fire x4140 with another x4140. Completely different hardware, except the iSCSI HBA card which we kept the same. I’m still seeing errors in /var/log/mcelog, but they seem to correspond to different DIMMs. So by coincidence, this server memory has issues, or the x4140 AMD-based server gives mcelog some issues. We have several x4150s (Intel-based) that are fine.

New output:

Hardware event. This is not a software error. MCE 0 Hardware event. This is not a software error. CPU 4 BANK 4 STATUS 0 MCGSTATUS 0 CPU 4 4 northbridge MISC c0090fff01000000 ADDR edc79c1c0 Hardware event. This is not a software error. CPU 0 BANK 0 TIME 1335884912 Tue May 1 11:08:32 2012 STATUS 0 MCGSTATUS 0 DDR2 DIMM 333 Mhz Synchronous Width 72 Data Width 64 Size 4 GB Device Locator: DIMM14 Bank Locator: BANK14 Manufacturer: Qimonda Serial Number: FFFFFFFF Asset Tag: N/A Part Number: TIME 1335884912 Tue May 1 11:08:32 2012 Northbridge RAM Chipkill ECC error Chipkill ECC syndrome = 5cac bit46 = corrected ecc error bit59 = misc error valid bit62 = error overflow (multiple errors) bus error 'local node response, request didn't time out generic read mem transaction memory access, level generic' STATUS dc5640005c080a13 MCGSTATUS 0 MCGCAP 106 APICID 4 SOCKETID 1

I disabled mcelog on this particular server. The service processor should give me a heads up on any hardware issues.

You could swap the memory modules and see if the mcelog message changes.