Difference between revisions of "Strijker node type details"

From PDP/Grid Wiki
Jump to navigationJump to search
(Created page with "The strijker nodes are 2u Dell R515 servers with 12 front-loading disks each. They have a [http://www.dell.com/downloads/global/products/pvaul/en/perc-technical-guidebook.pdf...")
 
Line 3: Line 3:
 
They have a [http://www.dell.com/downloads/global/products/pvaul/en/perc-technical-guidebook.pdf PERC H700 Integrated] RAID controller, which can be managed by the MegaRAID Storage Manager software:
 
They have a [http://www.dell.com/downloads/global/products/pvaul/en/perc-technical-guidebook.pdf PERC H700 Integrated] RAID controller, which can be managed by the MegaRAID Storage Manager software:
 
  /usr/local/MegaRAID\ Storage\ Manager/StorCLI/storcli64 /c0 show all
 
  /usr/local/MegaRAID\ Storage\ Manager/StorCLI/storcli64 /c0 show all
 +
 +
The Nagios sensor in /usr/local/lib/nagios/plugins/check_lsi_raid checks the state of the controller with a somewhat poorly documented command:
 +
 +
/opt/MegaRAID/storcli/storcli64 adpallinfo a0
 +
 +
Where 'a0' stands for controller 0.
 +
 +
The output of this command contains a summary of the controller status, of which one block is particularly interesting:
 +
 +
                Device Present
 +
                ================
 +
Virtual Drives    : 2
 +
  Degraded        : 0
 +
  Offline        : 0
 +
Physical Devices  : 16
 +
  Disks          : 14
 +
  Critical Disks  : 1
 +
  Failed Disks    : 0
 +
 +
The 'Critical Disks' here shows there is a disk failing or about to fail. Although individual disk info can be retrieved e.g. with
 +
 +
storcli64 /c0/e32/sall show all
 +
 +
This doesn't directly say that a disk is critical; one has to infer this from the error counts, e.g.
 +
 +
Drive /c0/e32/s5 State :
 +
======================
 +
Shield Counter = 0
 +
Media Error Count = 46
 +
Other Error Count = 137
 +
Drive Temperature =  36C (96.80 F)
 +
Predictive Failure Count = 34
 +
S.M.A.R.T alert flagged by drive = Yes
 +
 +
Visual inspection of the machine will show a blinking led on the failing disk. The numbering of the disks is top to bottom, then left to right with the top left position 0.

Revision as of 13:40, 12 October 2015

The strijker nodes are 2u Dell R515 servers with 12 front-loading disks each.

They have a PERC H700 Integrated RAID controller, which can be managed by the MegaRAID Storage Manager software:

/usr/local/MegaRAID\ Storage\ Manager/StorCLI/storcli64 /c0 show all

The Nagios sensor in /usr/local/lib/nagios/plugins/check_lsi_raid checks the state of the controller with a somewhat poorly documented command:

/opt/MegaRAID/storcli/storcli64 adpallinfo a0

Where 'a0' stands for controller 0.

The output of this command contains a summary of the controller status, of which one block is particularly interesting:

                Device Present
                ================
Virtual Drives    : 2 
  Degraded        : 0 
  Offline         : 0 
Physical Devices  : 16 
  Disks           : 14 
  Critical Disks  : 1 
  Failed Disks    : 0 

The 'Critical Disks' here shows there is a disk failing or about to fail. Although individual disk info can be retrieved e.g. with

storcli64 /c0/e32/sall show all

This doesn't directly say that a disk is critical; one has to infer this from the error counts, e.g.

Drive /c0/e32/s5 State :
======================
Shield Counter = 0
Media Error Count = 46
Other Error Count = 137
Drive Temperature =  36C (96.80 F)
Predictive Failure Count = 34
S.M.A.R.T alert flagged by drive = Yes

Visual inspection of the machine will show a blinking led on the failing disk. The numbering of the disks is top to bottom, then left to right with the top left position 0.