Difference between revisions of "Valentine memory"
Line 17: | Line 17: | ||
Research information from Jan Just: | Research information from Jan Just: | ||
− | + | hi all, | |
− | + | I've played around with one of the "broken" Valentine WNs; the problem has been reported by others as well, see e.g. | |
− | + | http://www.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/5.3/html/Release_Notes/sect-Release_Notes-Known_Issues.html | |
− | + | http://ubuntuforums.org/archive/index.php/t-777637.html | |
− | + | http://faq.aslab.com/index.php?sid=8574&lang=en&action=artikel&cat=93&id=154&artlang=en | |
− | + | and it seems related to the i5000_edac memory controller driver that popped up in RHEL 4.7/5.3 ; it looks like a driver error. The i5000_edac driver tries to get access to the Intel i5000 memory controller in order to display ECC errors. The IPMI card/driver *also* wants to access the memory controller in order to display the exact same errors. This can cause a (temporary) deadlock between the IPMI card and the i5000_edac driver. The result is that | |
− | + | - hdparm -T /dev/sda slows to a crawl | |
− | + | - only in very rare cases does anything useful show up in /sys/devices/system/edac/mc/mc0/*count | |
− | + | All links above state that you should blacklist the i5000_edac driver in order to get rid of these warnings/errors. This driver seems to be for reporting purposes only anyways, i.e. it does not do anything magic to the memory itself, it only wants to query the memory controller for statistics (which all show up in /sys/devices/system/edac/mc/mc0/*) | |
− | + | It is a pity that these *correctable* ECC errors pop up on (only?) 5 of the WNs but memtest shows nothing wrong with the memory. Hopefully the driver will be updated in the near future to report meaningful errors/warnings and not cause deadlocks with the IPMI controller. | |
− | + | share and enjoy, | |
− | + | JJK |
Revision as of 15:15, 15 December 2008
Affected valentines: 2, 21, 59, 72, 77
These are the kind of errors:
Dec 12 09:56:51 wn-val-021 kernel: EDAC MC0: CE row 3, channel 3, label "": (Branch=1 DRAM-Bank=2 RDWR=Read RAS=11275 CAS=968, CE Err=0x2000) Dec 12 09:58:28 wn-val-021 sshd(pam_unix)[4147]: session opened for user root by (uid=0) Dec 12 10:03:37 wn-val-021 kernel: EDAC i5000 MC0: NON-FATAL ERRORS Found!!! 1st NON-FATAL Err Reg= 0x2000 Dec 12 10:03:37 wn-val-021 kernel: EDAC MC0: CE row 2, channel 2, label "": (Branch=1 DRAM-Bank=6 RDWR=Write RAS=9948 CAS=804, CE Err=0x2000) Dec 12 10:17:23 wn-val-021 kernel: EDAC i5000 MC0: NON-FATAL ERRORS Found!!! 1st NON-FATAL Err Reg= 0x2000 Dec 12 10:17:24 wn-val-021 kernel: EDAC MC0: CE row 1, channel 2, label "": (Branch=1 DRAM-Bank=7 RDWR=Write RAS=2144 CAS=208, CE Err=0x2000) Dec 12 10:17:33 wn-val-021 kernel: EDAC i5000 MC0: NON-FATAL ERRORS Found!!! 1st NON-FATAL Err Reg= 0x2000 Dec 12 10:17:33 wn-val-021 kernel: EDAC MC0: CE row 2, channel 3, label "": (Branch=1 DRAM-Bank=0 RDWR=Read RAS=5070 CAS=36, CE Err=0x2000)
Sollution: blacklist these modules: k8_edac, edac_mc, i5000_edac
Research information from Jan Just:
hi all,
I've played around with one of the "broken" Valentine WNs; the problem has been reported by others as well, see e.g. http://www.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/5.3/html/Release_Notes/sect-Release_Notes-Known_Issues.html http://ubuntuforums.org/archive/index.php/t-777637.html http://faq.aslab.com/index.php?sid=8574&lang=en&action=artikel&cat=93&id=154&artlang=en and it seems related to the i5000_edac memory controller driver that popped up in RHEL 4.7/5.3 ; it looks like a driver error. The i5000_edac driver tries to get access to the Intel i5000 memory controller in order to display ECC errors. The IPMI card/driver *also* wants to access the memory controller in order to display the exact same errors. This can cause a (temporary) deadlock between the IPMI card and the i5000_edac driver. The result is that - hdparm -T /dev/sda slows to a crawl - only in very rare cases does anything useful show up in /sys/devices/system/edac/mc/mc0/*count
All links above state that you should blacklist the i5000_edac driver in order to get rid of these warnings/errors. This driver seems to be for reporting purposes only anyways, i.e. it does not do anything magic to the memory itself, it only wants to query the memory controller for statistics (which all show up in /sys/devices/system/edac/mc/mc0/*) It is a pity that these *correctable* ECC errors pop up on (only?) 5 of the WNs but memtest shows nothing wrong with the memory. Hopefully the driver will be updated in the near future to report meaningful errors/warnings and not cause deadlocks with the IPMI controller.
share and enjoy,
JJK