Difference between revisions of "Managing RAID Controllers"

From PDP/Grid Wiki
Jump to navigationJump to search
 
Line 38: Line 38:
  
 
  smartctl -a -d cciss,0 /dev/sg0
 
  smartctl -a -d cciss,0 /dev/sg0
  smartctl -a -d cciss,0 /dev/sg1
+
  smartctl -a -d cciss,1 /dev/sg0
  
 
== Replacing failed discs on a degraded (but not failed) raid array ==
 
== Replacing failed discs on a degraded (but not failed) raid array ==

Latest revision as of 14:03, 2 February 2016

Most systems come equipped with a hardware RAID controller, which can be controlled from the OS when the right software is installed. There are a couple of flavours available, the one you need is not always obvious. The software can be used to destroy the RAID set, but more usefully it can be used to:

  • manage the audible alarm (e.g. in case the backup battery unit is failing)
  • blink the led of a disk that needs replacement
  • enable/disable a disk

See also the specific pages for oliebol and strijker type nodes.

Installing the software

If it is not already installed, the software is available through the nikhef-external repo with yum. Use either of:

yum install MegaCli
yum install storcli
yum install MegaRAID_Software_Manager

MegaCli is a command-line tool with a rather painful syntax. Storcli is nicer.

/opt/MegaRAID/MegaCli/MegaCli64 -CfgDsply -aALL
/opt/MegaRAID/storcli/storcli64 show all

In case these tools report no RAID controllers, try the StorCli from the last one:

/usr/local/MegaRAID\ Storage\ Manager/StorCLI/storcli64 show all

One of these should work.

Using smartctl on the carnaval nodes

The carnaval blades are equipped with Dell PERC H200 cards, having two 500 GB disks in a RAID0 setup. Reading the SMART status of the virtual disk /dev/sda won't work, but what does work is using the generic SCSI devices exposed to the system:

smartctl /dev/sg1 -a
smartctl /dev/sg2 -a

This helps to reveal which of the two disks caused the problem.

Using smartctl on the sint maarten nodes

The smrt blades are HP blades, and they can be read with the following commands:

smartctl -a -d cciss,0 /dev/sg0
smartctl -a -d cciss,1 /dev/sg0

Replacing failed discs on a degraded (but not failed) raid array

For lsi raid arrays where a disc has failed but the array is not set to automatically rebuild, then after physically replacing the disc a rebuild can be manually started by first joining the disc to the degraded array and then setting the disc into a rebuild state.

First check which disc has been replaced (only relevant output shown):

# /opt/MegaRAID/storcli/storcli64 /c0 show all
TOPOLOGY :
========  
--------------------------------------------------------------------------
DG Arr Row EID:Slot DID Type  State BT       Size PDC  PI SED DS3  FSpace 
--------------------------------------------------------------------------
 0 -   -   -        -   RAID1 Dgrd  N  278.875 GB enbl N  N   dflt N      
 0 0   -   -        -   RAID1 Dgrd  N  278.875 GB enbl N  N   dflt N      
 0 0   0   -        -   DRIVE Msng  -  278.875 GB -    -  -   -    -      
 0 0   1   32:13    13  DRIVE Onln  N  278.875 GB enbl N  N   dflt -      
 1 -   -   -        -   RAID6 Pdgd  N   27.285 TB dflt N  N   dflt N      
 1 0   -   -        -   RAID6 Dgrd  N   27.285 TB dflt N  N   dflt N      
 1 0   0   32:0     0   DRIVE Onln  N    2.728 TB dflt N  N   dflt -      
 1 0   1   32:1     1   DRIVE Onln  N    2.728 TB dflt N  N   dflt -      
 1 0   2   32:2     2   DRIVE Onln  N    2.728 TB dflt N  N   dflt -      
 1 0   3   32:3     3   DRIVE Onln  N    2.728 TB dflt N  N   dflt -      
 1 0   4   32:4     4   DRIVE Onln  N    2.728 TB dflt N  N   dflt -      
 1 0   5   32:5     5   DRIVE Onln  N    2.728 TB dflt N  N   dflt -      
 1 0   6   32:6     6   DRIVE Onln  N    2.728 TB dflt N  N   dflt -      
 1 0   7   32:7     7   DRIVE Onln  N    2.728 TB dflt N  N   dflt -      
 1 0   8   32:8     8   DRIVE Onln  N    2.728 TB dflt N  N   dflt -      
 1 0   9   32:9     9   DRIVE Onln  N    2.728 TB dflt N  N   dflt -      
 1 0   10  32:10    10  DRIVE Onln  N    2.728 TB dflt N  N   dflt -      
 1 0   11  32:11    11  DRIVE Rbld  Y    2.728 TB dflt N  N   dflt -      
--------------------------------------------------------------------------

Here there are two drive groups 0 and 1. In drive group 1 array 0, row 11 (disc 11) is already in a rebuild state. In drive group 0 array 0, row 0 is missing. Further down in the output of the show all command is the list of all discs present in the machine:

PD LIST :
=======  
-------------------------------------------------------------------------
EID:Slt DID State DG       Size Intf Med SED PI SeSz Model            Sp 
-------------------------------------------------------------------------
32:0      0 Onln   0   2.728 TB SAS  HDD N   N  512B ST33000650SS     U  
32:1      1 Onln   1   2.728 TB SAS  HDD N   N  512B ST3000NM0023     U  
32:2      2 Onln   1   2.728 TB SAS  HDD N   N  512B ST33000650SS     U  
32:3      3 Onln   1   2.728 TB SAS  HDD N   N  512B ST33000650SS     U  
32:4      4 Onln   1   2.728 TB SAS  HDD N   N  512B ST33000650SS     U  
32:5      5 Onln   1   2.728 TB SAS  HDD N   N  512B ST33000650SS     U  
32:6      6 Onln   1   2.728 TB SAS  HDD N   N  512B ST3000NM0023     U  
32:7      7 Onln   1   2.728 TB SAS  HDD N   N  512B MG03SCA300       U  
32:8      8 Onln   1   2.728 TB SAS  HDD N   N  512B ST33000650SS     U  
32:9      9 Onln   1   2.728 TB SAS  HDD N   N  512B ST3000NM0023     U  
32:10    10 Onln   1   2.728 TB SAS  HDD N   N  512B ST33000650SS     U  
32:11    11 Rbld   1   2.728 TB SAS  HDD N   N  512B MG03SCA300       U  
32:12    12 UGood  - 278.875 GB SAS  HDD N   N  512B ST300MM0006      U  
32:13    13 Onln   0 278.875 GB SAS  HDD N   N  512B HUC106030CSS600  U  
-------------------------------------------------------------------------

Here disc 12 is in the UGood (Unconfigured Good) state, ie the disc is not doing anything.

Step 1 is to add the disc into the degraded array:

# /opt/MegaRAID/storcli/storcli64 /c0/e32/s12 insert dg=0 array=0 row=0

where /c0/e32/s12 specifies the disc the insert command will act on and dg=0 array=0 row=0 specify which drive group, array and row in the array (see the TOPOLOGY table above) into which the drive will be inserted. This leaves the drive in the array but offline:

TOPOLOGY :
========  
--------------------------------------------------------------------------
DG Arr Row EID:Slot DID Type  State BT       Size PDC  PI SED DS3  FSpace 
--------------------------------------------------------------------------
 0 -   -   -        -   RAID1 Dgrd  N  278.875 GB enbl N  N   dflt N      
 0 0   -   -        -   RAID1 Dgrd  N  278.875 GB enbl N  N   dflt N      
 0 0   0   32:12    12  DRIVE Offln N  278.875 GB enbl N  N   dflt -      
 0 0   1   32:13    13  DRIVE Onln  N  278.875 GB enbl N  N   dflt -      
 1 -   -   -        -   RAID6 Pdgd  N   27.285 TB dflt N  N   dflt N      
 1 0   -   -        -   RAID6 Dgrd  N   27.285 TB dflt N  N   dflt N      
 1 0   0   32:0     0   DRIVE Onln  N    2.728 TB dflt N  N   dflt -      
 1 0   1   32:1     1   DRIVE Onln  N    2.728 TB dflt N  N   dflt -      
 1 0   2   32:2     2   DRIVE Onln  N    2.728 TB dflt N  N   dflt -      
 1 0   3   32:3     3   DRIVE Onln  N    2.728 TB dflt N  N   dflt -      
 1 0   4   32:4     4   DRIVE Onln  N    2.728 TB dflt N  N   dflt -      
 1 0   5   32:5     5   DRIVE Onln  N    2.728 TB dflt N  N   dflt -      
 1 0   6   32:6     6   DRIVE Onln  N    2.728 TB dflt N  N   dflt -      
 1 0   7   32:7     7   DRIVE Onln  N    2.728 TB dflt N  N   dflt -      
 1 0   8   32:8     8   DRIVE Onln  N    2.728 TB dflt N  N   dflt -      
 1 0   9   32:9     9   DRIVE Onln  N    2.728 TB dflt N  N   dflt -      
 1 0   10  32:10    10  DRIVE Onln  N    2.728 TB dflt N  N   dflt -      
 1 0   11  32:11    11  DRIVE Rbld  Y    2.728 TB dflt N  N   dflt -      
--------------------------------------------------------------------------

The second step is to then start the rebuild:

# /opt/MegaRAID/storcli/storcli64 /c0/e32/s12 start rebuild

If successful the TOPOLOGY table will show an in progress rebuild:

# /opt/MegaRAID/storcli/storcli64 /c0 show
TOPOLOGY :
========  
--------------------------------------------------------------------------
DG Arr Row EID:Slot DID Type  State BT       Size PDC  PI SED DS3  FSpace 
--------------------------------------------------------------------------
 0 -   -   -        -   RAID1 Dgrd  N  278.875 GB enbl N  N   dflt N      
 0 0   -   -        -   RAID1 Dgrd  N  278.875 GB enbl N  N   dflt N      
 0 0   0   32:12    12  DRIVE Rbld  Y  278.875 GB enbl N  N   dflt -      
 0 0   1   32:13    13  DRIVE Onln  N  278.875 GB enbl N  N   dflt -      
 1 -   -   -        -   RAID6 Pdgd  N   27.285 TB dflt N  N   dflt N      
 1 0   -   -        -   RAID6 Dgrd  N   27.285 TB dflt N  N   dflt N      
 1 0   0   32:0     0   DRIVE Onln  N    2.728 TB dflt N  N   dflt -      
 1 0   1   32:1     1   DRIVE Onln  N    2.728 TB dflt N  N   dflt -      
 1 0   2   32:2     2   DRIVE Onln  N    2.728 TB dflt N  N   dflt -      
 1 0   3   32:3     3   DRIVE Onln  N    2.728 TB dflt N  N   dflt -      
 1 0   4   32:4     4   DRIVE Onln  N    2.728 TB dflt N  N   dflt -      
 1 0   5   32:5     5   DRIVE Onln  N    2.728 TB dflt N  N   dflt -      
 1 0   6   32:6     6   DRIVE Onln  N    2.728 TB dflt N  N   dflt -      
 1 0   7   32:7     7   DRIVE Onln  N    2.728 TB dflt N  N   dflt -      
 1 0   8   32:8     8   DRIVE Onln  N    2.728 TB dflt N  N   dflt -      
 1 0   9   32:9     9   DRIVE Onln  N    2.728 TB dflt N  N   dflt -      
 1 0   10  32:10    10  DRIVE Onln  N    2.728 TB dflt N  N   dflt -      
 1 0   11  32:11    11  DRIVE Rbld  Y    2.728 TB dflt N  N   dflt -      
--------------------------------------------------------------------------

And the rebuild progress can be checked with:

# /opt/MegaRAID/storcli/storcli64 /c0/e32/s12 show rebuild
Controller = 0
Status = Success
Description = Show Drive Rebuild Status Succeeded.
----------------------------------
Drive-ID    Progress% Status
----------------------------------
/c0/e32/s12      3.35 In progress
----------------------------------

Some storcli64-fu for adding disks to a failed RAID6

On October 21st, strijker-21 had two failed and one critical disk in a RAID6 set. As soon as the first failed disk was replaced and rebuild started, the critical disk also failed leaving the RAID set degraded. No data was lost because there was no data on disk at the time. The remaining failed disks were replaced but this left the array still degraded. Here is what followed to restore normal operations. Some editing is done in the output for brevity.

# /opt/MegaRAID/storcli/storcli64 /c0/v1 show all
/c0/v1 :
======
----------------------------------------------------------
DG/VD TYPE  State Access Consist Cache sCC      Size Name 
----------------------------------------------------------
1/1   RAID6 OfLn  RW     No      RWBD  -   27.285 TB VD_1 
----------------------------------------------------------

PDs for VD 1 :
============

-----------------------------------------------------------------------
EID:Slt DID State DG     Size Intf Med SED PI SeSz Model            Sp 
-----------------------------------------------------------------------
32:0      0 Onln   1 2.728 TB SAS  HDD N   N  512B ST33000650SS     U  
32:1      1 Offln  1 2.728 TB SAS  HDD N   N  512B ST3000NM0023     U  
32:2      2 Onln   1 2.728 TB SAS  HDD N   N  512B ST33000650SS     U  
32:3      3 Onln   1 2.728 TB SAS  HDD N   N  512B ST33000650SS     U  
32:4      4 Onln   1 2.728 TB SAS  HDD N   N  512B ST33000650SS     U  
32:5      5 Onln   1 2.728 TB SAS  HDD N   N  512B ST33000650SS     U  
32:7      7 Onln   1 2.728 TB SAS  HDD N   N  512B ST33000650SS     U  
32:8      8 Onln   1 2.728 TB SAS  HDD N   N  512B ST33000650SS     U  
32:10    10 Onln   1 2.728 TB SAS  HDD N   N  512B ST33000650SS     U  
32:11    11 Onln   1 2.728 TB SAS  HDD N   N  512B ST33000650SS     U  
-----------------------------------------------------------------------

As can be seen, disks 6 and 9 are missing, while disk 1 is off-line. That last problem can be remedied easily.

# /opt/MegaRAID/storcli/storcli64 /c0/e32/s1 set online
Controller = 0
Status = Success
Description = Set Drive Online Succeeded.
# /opt/MegaRAID/storcli/storcli64 /c0/v1 show all
/c0/v1 :
======
----------------------------------------------------------
DG/VD TYPE  State Access Consist Cache sCC      Size Name 
----------------------------------------------------------
1/1   RAID6 Dgrd  RW     No      RWBD  -   27.285 TB VD_1 
----------------------------------------------------------

PDs for VD 1 :
============

-----------------------------------------------------------------------
EID:Slt DID State DG     Size Intf Med SED PI SeSz Model            Sp 
-----------------------------------------------------------------------
32:0      0 Onln   1 2.728 TB SAS  HDD N   N  512B ST33000650SS     U  
32:1      1 Onln   1 2.728 TB SAS  HDD N   N  512B ST3000NM0023     U  
32:2      2 Onln   1 2.728 TB SAS  HDD N   N  512B ST33000650SS     U  
32:3      3 Onln   1 2.728 TB SAS  HDD N   N  512B ST33000650SS     U  
32:4      4 Onln   1 2.728 TB SAS  HDD N   N  512B ST33000650SS     U  
32:5      5 Onln   1 2.728 TB SAS  HDD N   N  512B ST33000650SS     U  
32:7      7 Onln   1 2.728 TB SAS  HDD N   N  512B ST33000650SS     U  
32:8      8 Onln   1 2.728 TB SAS  HDD N   N  512B ST33000650SS     U  
32:10    10 Onln   1 2.728 TB SAS  HDD N   N  512B ST33000650SS     U  
32:11    11 Onln   1 2.728 TB SAS  HDD N   N  512B ST33000650SS     U  
-----------------------------------------------------------------------

What about the missing drives?

# /opt/MegaRAID/storcli/storcli64 /c0 show all | less
...
Drive Groups = 2

TOPOLOGY :
========

--------------------------------------------------------------------------
DG Arr Row EID:Slot DID Type  State BT       Size PDC  PI SED DS3  FSpace 
--------------------------------------------------------------------------
 0 -   -   -        -   RAID1 Optl  N  278.875 GB enbl N  N   dflt N      
 0 0   -   -        -   RAID1 Optl  N  278.875 GB enbl N  N   dflt N      
 0 0   0   32:12    12  DRIVE Onln  N  278.875 GB enbl N  N   dflt -      
 0 0   1   32:13    13  DRIVE Onln  N  278.875 GB enbl N  N   dflt -      
 1 -   -   -        -   RAID6 Dgrd  N   27.285 TB dflt N  N   dflt N      
 1 0   -   -        -   RAID6 Dgrd  N   27.285 TB dflt N  N   dflt N      
 1 0   0   32:0     0   DRIVE Onln  N    2.728 TB dflt N  N   dflt -      
 1 0   1   32:1     1   DRIVE Onln  N    2.728 TB dflt N  N   dflt -      
 1 0   2   32:2     2   DRIVE Onln  N    2.728 TB dflt N  N   dflt -      
 1 0   3   32:3     3   DRIVE Onln  N    2.728 TB dflt N  N   dflt -      
 1 0   4   32:4     4   DRIVE Onln  N    2.728 TB dflt N  N   dflt -      
 1 0   5   32:5     5   DRIVE Onln  N    2.728 TB dflt N  N   dflt -      
 1 0   6   -        -   DRIVE Msng  -    2.728 TB -    -  -   -    -      
 1 0   7   32:7     7   DRIVE Onln  N    2.728 TB dflt N  N   dflt -      
 1 0   8   32:8     8   DRIVE Onln  N    2.728 TB dflt N  N   dflt -      
 1 0   9   -        -   DRIVE Msng  -    2.728 TB -    -  -   -    -      
 1 0   10  32:10    10  DRIVE Onln  N    2.728 TB dflt N  N   dflt -      
 1 0   11  32:11    11  DRIVE Onln  N    2.728 TB dflt N  N   dflt -      
--------------------------------------------------------------------------

Physical Drives = 14 

PD LIST :
======= 

-------------------------------------------------------------------------
EID:Slt DID State DG       Size Intf Med SED PI SeSz Model            Sp 
-------------------------------------------------------------------------
32:0      0 Onln   1   2.728 TB SAS  HDD N   N  512B ST33000650SS     U  
32:1      1 Onln   1   2.728 TB SAS  HDD N   N  512B ST3000NM0023     U  
32:2      2 Onln   1   2.728 TB SAS  HDD N   N  512B ST33000650SS     U  
32:3      3 Onln   1   2.728 TB SAS  HDD N   N  512B ST33000650SS     U  
32:4      4 Onln   1   2.728 TB SAS  HDD N   N  512B ST33000650SS     U  
32:5      5 Onln   1   2.728 TB SAS  HDD N   N  512B ST33000650SS     U  
32:6      6 UGood  -   2.728 TB SAS  HDD N   N  512B ST3000NM0023     U  
32:9      9 UGood  -   2.728 TB SAS  HDD N   N  512B ST3000NM0023     U  
32:12    12 Onln   0 278.875 GB SAS  HDD N   N  512B HUC106030CSS600  U  
32:13    13 Onln   0 278.875 GB SAS  HDD N   N  512B HUC106030CSS600  U  
32:7      7 Onln   1   2.728 TB SAS  HDD N   N  512B ST33000650SS     U  
32:8      8 Onln   1   2.728 TB SAS  HDD N   N  512B ST33000650SS     U  
32:10    10 Onln   1   2.728 TB SAS  HDD N   N  512B ST33000650SS     U  
32:11    11 Onln   1   2.728 TB SAS  HDD N   N  512B ST33000650SS     U  
-------------------------------------------------------------------------

So disks in slots 6 and 9 are unassigned good and not part of a drive group, while drive group 1 is missing drives in array 0 row 6 and 9. Let's try to insert disks 6 and 9 back in their respective slots.

# /opt/MegaRAID/storcli/storcli64 /c0/e32/s6 insert dg=1 array=0 row=6
Controller = 0
Status = Success
Description = Insert Drive Succeeded.

# /opt/MegaRAID/storcli/storcli64 /c0/e32/s9 insert dg=1 array=0 row=9
Controller = 0
Status = Success
Description = Insert Drive Succeeded.

# /opt/MegaRAID/storcli/storcli64 /c0/d1 show all

TOPOLOGY :
========

-------------------------------------------------------------------------
DG Arr Row EID:Slot DID Type  State BT      Size PDC  PI SED DS3  FSpace 
-------------------------------------------------------------------------
 1 -   -   -        -   RAID6 Dgrd  N  27.285 TB dflt N  N   dflt N      
 1 0   -   -        -   RAID6 Dgrd  N  27.285 TB dflt N  N   dflt N      
 1 0   0   32:0     0   DRIVE Onln  N   2.728 TB dflt N  N   dflt -      
 1 0   1   32:1     1   DRIVE Onln  N   2.728 TB dflt N  N   dflt -      
 1 0   2   32:2     2   DRIVE Onln  N   2.728 TB dflt N  N   dflt -      
 1 0   3   32:3     3   DRIVE Onln  N   2.728 TB dflt N  N   dflt -      
 1 0   4   32:4     4   DRIVE Onln  N   2.728 TB dflt N  N   dflt -      
 1 0   5   32:5     5   DRIVE Onln  N   2.728 TB dflt N  N   dflt -      
 1 0   6   32:6     6   DRIVE Offln N   2.728 TB dflt N  N   dflt -      
 1 0   7   32:7     7   DRIVE Onln  N   2.728 TB dflt N  N   dflt -      
 1 0   8   32:8     8   DRIVE Onln  N   2.728 TB dflt N  N   dflt -      
 1 0   9   32:9     9   DRIVE Offln N   2.728 TB dflt N  N   dflt -      
 1 0   10  32:10    10  DRIVE Onln  N   2.728 TB dflt N  N   dflt -      
 1 0   11  32:11    11  DRIVE Onln  N   2.728 TB dflt N  N   dflt -      
-------------------------------------------------------------------------

That is looking better, but the drives are still off-line.

# /opt/MegaRAID/storcli/storcli64 /c0/e32/s6 set online
Controller = 0
Status = Success
Description = Set Drive Online Succeeded.
# /opt/MegaRAID/storcli/storcli64 /c0/e32/s9 set online
Controller = 0
Status = Success
Description = Set Drive Online Succeeded.
# /opt/MegaRAID/storcli/storcli64 /c0/d1 show all

-------------------------------------------------------------------------
DG Arr Row EID:Slot DID Type  State BT      Size PDC  PI SED DS3  FSpace 
-------------------------------------------------------------------------
 1 -   -   -        -   RAID6 Optl  N  27.285 TB dflt N  N   dflt N      
 1 0   -   -        -   RAID6 Optl  N  27.285 TB dflt N  N   dflt N      
 1 0   0   32:0     0   DRIVE Onln  N   2.728 TB dflt N  N   dflt -      
 1 0   1   32:1     1   DRIVE Onln  N   2.728 TB dflt N  N   dflt -      
 1 0   2   32:2     2   DRIVE Onln  N   2.728 TB dflt N  N   dflt -      
 1 0   3   32:3     3   DRIVE Onln  N   2.728 TB dflt N  N   dflt -      
 1 0   4   32:4     4   DRIVE Onln  N   2.728 TB dflt N  N   dflt -      
 1 0   5   32:5     5   DRIVE Onln  N   2.728 TB dflt N  N   dflt -      
 1 0   6   32:6     6   DRIVE Onln  N   2.728 TB dflt N  N   dflt -      
 1 0   7   32:7     7   DRIVE Onln  N   2.728 TB dflt N  N   dflt -      
 1 0   8   32:8     8   DRIVE Onln  N   2.728 TB dflt N  N   dflt -      
 1 0   9   32:9     9   DRIVE Onln  N   2.728 TB dflt N  N   dflt -      
 1 0   10  32:10    10  DRIVE Onln  N   2.728 TB dflt N  N   dflt -      
 1 0   11  32:11    11  DRIVE Onln  N   2.728 TB dflt N  N   dflt -      
-------------------------------------------------------------------------

----------------------------------------------------------
DG/VD TYPE  State Access Consist Cache sCC      Size Name 
----------------------------------------------------------
1/1   RAID6 Optl  RW     No      RWBD  -   27.285 TB VD_1 
----------------------------------------------------------

Now it is 'optimal' but not consistent? This is odd, but at least we can write data to it.

# pvscan
  PV /dev/sdb1   VG data     lvm2 [27,29 TiB / 27,29 TiB free]
  PV /dev/sda2   VG system   lvm2 [278,34 GiB / 193,84 GiB free]
  Total: 2 [27,56 TiB] / in use: 2 [27,56 TiB] / in no VG: 0 [0   ]