Valentine memory

From PDP/Grid Wiki
Revision as of 12:48, 18 December 2008 by Tsuerink@nikhef.nl (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search

Affected valentines: 2, 21, 59, 72, 77

These are the kind of errors:

Dec 12 09:56:51 wn-val-021 kernel: EDAC MC0: CE row 3, channel 3, label "": (Branch=1 DRAM-Bank=2 RDWR=Read RAS=11275 CAS=968, CE Err=0x2000)
Dec 12 09:58:28 wn-val-021 sshd(pam_unix)[4147]: session opened for user root by (uid=0)
Dec 12 10:03:37 wn-val-021 kernel: EDAC i5000 MC0: NON-FATAL ERRORS Found!!! 1st NON-FATAL Err Reg= 0x2000
Dec 12 10:03:37 wn-val-021 kernel: EDAC MC0: CE row 2, channel 2, label "": (Branch=1 DRAM-Bank=6 RDWR=Write RAS=9948 CAS=804, CE Err=0x2000)
Dec 12 10:17:23 wn-val-021 kernel: EDAC i5000 MC0: NON-FATAL ERRORS Found!!! 1st NON-FATAL Err Reg= 0x2000
Dec 12 10:17:24 wn-val-021 kernel: EDAC MC0: CE row 1, channel 2, label "": (Branch=1 DRAM-Bank=7 RDWR=Write RAS=2144 CAS=208, CE Err=0x2000)
Dec 12 10:17:33 wn-val-021 kernel: EDAC i5000 MC0: NON-FATAL ERRORS Found!!! 1st NON-FATAL Err Reg= 0x2000
Dec 12 10:17:33 wn-val-021 kernel: EDAC MC0: CE row 2, channel 3, label "": (Branch=1 DRAM-Bank=0 RDWR=Read RAS=5070 CAS=36, CE Err=0x2000)

Sollution: blacklist these modules: k8_edac, edac_mc, i5000_edac

Research information from Jan Just:

hi all,

I've played around with one of the "broken" Valentine WNs; the problem has been reported by others as well, see e.g.

http://www.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/5.3/html/Release_Notes/sect-Release_Notes-Known_Issues.html
http://ubuntuforums.org/archive/index.php/t-777637.html
http://faq.aslab.com/index.php?sid=8574&lang=en&action=artikel&cat=93&id=154&artlang=en

and it seems related to the i5000_edac memory controller driver that popped up in RHEL 4.7/5.3 ; it looks like a driver error. The i5000_edac driver tries to get access to the Intel i5000 memory controller in order to display ECC errors. The IPMI card/driver *also* wants to access the memory controller in order to display the exact same errors. This can cause a (temporary) deadlock between the IPMI card and the i5000_edac driver. The result is that - hdparm -T /dev/sda slows to a crawl - only in very rare cases does anything useful show up in /sys/devices/system/edac/mc/mc0/*count

All links above state that you should blacklist the i5000_edac driver in order to get rid of these warnings/errors. This driver seems to be for reporting purposes only anyways, i.e. it does not do anything magic to the memory itself, it only wants to query the memory controller for statistics (which all show up in /sys/devices/system/edac/mc/mc0/*) It is a pity that these *correctable* ECC errors pop up on (only?) 5 of the WNs but memtest shows nothing wrong with the memory. Hopefully the driver will be updated in the near future to report meaningful errors/warnings and not cause deadlocks with the IPMI controller.

share and enjoy,

JJK

To kill the event log errors in the IPMI interface:

/etc/init.d/ipmi start
ipmitool bmc setenables recv_msg_intr=off
ipmitool bmc setenables event_msg_intr=off
ipmitool bmc setenables event_msg=off
ipmitool bmc setenables system_event_log=off
ipmitool sel clear