Hi,
I’ve been seeing kernel “[Hardware Error]: Machine check events logged” messages in /var/log/messages. These seem to be from the mcelog daemon, and the corresponding logs (I posted an example below) are in /var/log/mcelog.
is a RAM chip on its way out? Or is this the CPU or CPU cache thats having issues?
if RAM, how do I determine which chip(s) are having issues?
/var/log/mcelog:
Hardware event. This is not a software error.
MCE 0
CPU 0 4 northbridge
MISC c0090fff01000000 ADDR 757580490
TIME 1335182555 Mon Apr 23 08:02:35 2012
Northbridge RAM Chipkill ECC error
Chipkill ECC syndrome = 4857
bit46 = corrected ecc error
bit59 = misc error valid
bit62 = error overflow (multiple errors)
bus error 'local node response, request didn't time out
generic read mem transaction
memory access, level generic'
STATUS dc2bc00048080a13 MCGSTATUS 0
MCGCAP 106 APICID 0 SOCKETID 0
CPUID Vendor AMD Family 16 Model 4
(I’ve never used mcelog before, but since I upgraded from SLES 11 SP1 to SP2, it seems to be configured to start on boot.)
sounds like a RAM chip giving up… have you had a look at the SEL? Maybe that can give you more details, as the system behind it ought to know about the hardware layout of your machine…
Is it possible that there is a strange setting in BIOS that would not play well with mcelog? The machine in question is a Sun Fire x4140.
Either way, we plan on taking the server down one evening and running memtest86 overnight.
Thanks,
J[/QUOTE]
my guess is that it’s actually something your machine’s BIOS has been complaining about independent of mcelog - mcelog is the mere messenger, don’t shoot it for that
I don’t have any experience with Sun hardware so I cannot tell for sure, the folks at Sun (or do we have to call them “Oracles” by now?) ought to be more helpful concerning the actual cause of that message. Probably it’s something that simply puts your hardware slightly out of specs and has caused no harm so far…
Here is an update. We replaced the entire Sun Fire x4140 with another x4140. Completely different hardware, except the iSCSI HBA card which we kept the same. I’m still seeing errors in /var/log/mcelog, but they seem to correspond to different DIMMs. So by coincidence, this server memory has issues, or the x4140 AMD-based server gives mcelog some issues. We have several x4150s (Intel-based) that are fine.
New output:
Hardware event. This is not a software error.
MCE 0
Hardware event. This is not a software error.
CPU 4 BANK 4
STATUS 0 MCGSTATUS 0
CPU 4 4 northbridge
MISC c0090fff01000000 ADDR edc79c1c0
Hardware event. This is not a software error.
CPU 0 BANK 0
TIME 1335884912 Tue May 1 11:08:32 2012
STATUS 0 MCGSTATUS 0
DDR2 DIMM 333 Mhz Synchronous Width 72 Data Width 64 Size 4 GB
Device Locator: DIMM14
Bank Locator: BANK14
Manufacturer: Qimonda
Serial Number: FFFFFFFF
Asset Tag: N/A
Part Number:
TIME 1335884912 Tue May 1 11:08:32 2012
Northbridge RAM Chipkill ECC error
Chipkill ECC syndrome = 5cac
bit46 = corrected ecc error
bit59 = misc error valid
bit62 = error overflow (multiple errors)
bus error 'local node response, request didn't time out
generic read mem transaction
memory access, level generic'
STATUS dc5640005c080a13 MCGSTATUS 0
MCGCAP 106 APICID 4 SOCKETID 1