I’ve been seeing kernel “[Hardware Error]: Machine check events logged” messages in /var/log/messages. These seem to be from the mcelog daemon, and the corresponding logs (I posted an example below) are in /var/log/mcelog.
- is a RAM chip on its way out? Or is this the CPU or CPU cache thats having issues?
- if RAM, how do I determine which chip(s) are having issues?
Hardware event. This is not a software error.
CPU 0 4 northbridge
MISC c0090fff01000000 ADDR 757580490
TIME 1335182555 Mon Apr 23 08:02:35 2012
Northbridge RAM Chipkill ECC error
Chipkill ECC syndrome = 4857
bit46 = corrected ecc error
bit59 = misc error valid
bit62 = error overflow (multiple errors)
bus error 'local node response, request didn't time out
generic read mem transaction
memory access, level generic'
STATUS dc2bc00048080a13 MCGSTATUS 0
MCGCAP 106 APICID 0 SOCKETID 0
CPUID Vendor AMD Family 16 Model 4
(I’ve never used mcelog before, but since I upgraded from SLES 11 SP1 to SP2, it seems to be configured to start on boot.)