Difference between revisions of "Managing RAID Controllers"
(One intermediate revision by the same user not shown) | |||
Line 32: | Line 32: | ||
This helps to reveal which of the two disks caused the problem. | This helps to reveal which of the two disks caused the problem. | ||
+ | |||
+ | == Using smartctl on the sint maarten nodes == | ||
+ | |||
+ | The smrt blades are HP blades, and they can be read with the following commands: | ||
+ | |||
+ | smartctl -a -d cciss,0 /dev/sg0 | ||
+ | smartctl -a -d cciss,1 /dev/sg0 | ||
== Replacing failed discs on a degraded (but not failed) raid array == | == Replacing failed discs on a degraded (but not failed) raid array == |
Latest revision as of 14:03, 2 February 2016
Most systems come equipped with a hardware RAID controller, which can be controlled from the OS when the right software is installed. There are a couple of flavours available, the one you need is not always obvious. The software can be used to destroy the RAID set, but more usefully it can be used to:
- manage the audible alarm (e.g. in case the backup battery unit is failing)
- blink the led of a disk that needs replacement
- enable/disable a disk
See also the specific pages for oliebol and strijker type nodes.
Installing the software
If it is not already installed, the software is available through the nikhef-external repo with yum. Use either of:
yum install MegaCli yum install storcli yum install MegaRAID_Software_Manager
MegaCli is a command-line tool with a rather painful syntax. Storcli is nicer.
/opt/MegaRAID/MegaCli/MegaCli64 -CfgDsply -aALL /opt/MegaRAID/storcli/storcli64 show all
In case these tools report no RAID controllers, try the StorCli from the last one:
/usr/local/MegaRAID\ Storage\ Manager/StorCLI/storcli64 show all
One of these should work.
The carnaval blades are equipped with Dell PERC H200 cards, having two 500 GB disks in a RAID0 setup. Reading the SMART status of the virtual disk /dev/sda won't work, but what does work is using the generic SCSI devices exposed to the system:
smartctl /dev/sg1 -a smartctl /dev/sg2 -a
This helps to reveal which of the two disks caused the problem.
Using smartctl on the sint maarten nodes
The smrt blades are HP blades, and they can be read with the following commands:
smartctl -a -d cciss,0 /dev/sg0 smartctl -a -d cciss,1 /dev/sg0
Replacing failed discs on a degraded (but not failed) raid array
For lsi raid arrays where a disc has failed but the array is not set to automatically rebuild, then after physically replacing the disc a rebuild can be manually started by first joining the disc to the degraded array and then setting the disc into a rebuild state.
First check which disc has been replaced (only relevant output shown):
# /opt/MegaRAID/storcli/storcli64 /c0 show all
TOPOLOGY : ======== -------------------------------------------------------------------------- DG Arr Row EID:Slot DID Type State BT Size PDC PI SED DS3 FSpace -------------------------------------------------------------------------- 0 - - - - RAID1 Dgrd N 278.875 GB enbl N N dflt N 0 0 - - - RAID1 Dgrd N 278.875 GB enbl N N dflt N 0 0 0 - - DRIVE Msng - 278.875 GB - - - - - 0 0 1 32:13 13 DRIVE Onln N 278.875 GB enbl N N dflt - 1 - - - - RAID6 Pdgd N 27.285 TB dflt N N dflt N 1 0 - - - RAID6 Dgrd N 27.285 TB dflt N N dflt N 1 0 0 32:0 0 DRIVE Onln N 2.728 TB dflt N N dflt - 1 0 1 32:1 1 DRIVE Onln N 2.728 TB dflt N N dflt - 1 0 2 32:2 2 DRIVE Onln N 2.728 TB dflt N N dflt - 1 0 3 32:3 3 DRIVE Onln N 2.728 TB dflt N N dflt - 1 0 4 32:4 4 DRIVE Onln N 2.728 TB dflt N N dflt - 1 0 5 32:5 5 DRIVE Onln N 2.728 TB dflt N N dflt - 1 0 6 32:6 6 DRIVE Onln N 2.728 TB dflt N N dflt - 1 0 7 32:7 7 DRIVE Onln N 2.728 TB dflt N N dflt - 1 0 8 32:8 8 DRIVE Onln N 2.728 TB dflt N N dflt - 1 0 9 32:9 9 DRIVE Onln N 2.728 TB dflt N N dflt - 1 0 10 32:10 10 DRIVE Onln N 2.728 TB dflt N N dflt - 1 0 11 32:11 11 DRIVE Rbld Y 2.728 TB dflt N N dflt - --------------------------------------------------------------------------
Here there are two drive groups 0 and 1. In drive group 1 array 0, row 11 (disc 11) is already in a rebuild state. In drive group 0 array 0, row 0 is missing. Further down in the output of the show all command is the list of all discs present in the machine:
PD LIST : ======= ------------------------------------------------------------------------- EID:Slt DID State DG Size Intf Med SED PI SeSz Model Sp ------------------------------------------------------------------------- 32:0 0 Onln 0 2.728 TB SAS HDD N N 512B ST33000650SS U 32:1 1 Onln 1 2.728 TB SAS HDD N N 512B ST3000NM0023 U 32:2 2 Onln 1 2.728 TB SAS HDD N N 512B ST33000650SS U 32:3 3 Onln 1 2.728 TB SAS HDD N N 512B ST33000650SS U 32:4 4 Onln 1 2.728 TB SAS HDD N N 512B ST33000650SS U 32:5 5 Onln 1 2.728 TB SAS HDD N N 512B ST33000650SS U 32:6 6 Onln 1 2.728 TB SAS HDD N N 512B ST3000NM0023 U 32:7 7 Onln 1 2.728 TB SAS HDD N N 512B MG03SCA300 U 32:8 8 Onln 1 2.728 TB SAS HDD N N 512B ST33000650SS U 32:9 9 Onln 1 2.728 TB SAS HDD N N 512B ST3000NM0023 U 32:10 10 Onln 1 2.728 TB SAS HDD N N 512B ST33000650SS U 32:11 11 Rbld 1 2.728 TB SAS HDD N N 512B MG03SCA300 U 32:12 12 UGood - 278.875 GB SAS HDD N N 512B ST300MM0006 U 32:13 13 Onln 0 278.875 GB SAS HDD N N 512B HUC106030CSS600 U -------------------------------------------------------------------------
Here disc 12 is in the UGood (Unconfigured Good) state, ie the disc is not doing anything.
Step 1 is to add the disc into the degraded array:
# /opt/MegaRAID/storcli/storcli64 /c0/e32/s12 insert dg=0 array=0 row=0
where /c0/e32/s12 specifies the disc the insert command will act on and dg=0 array=0 row=0 specify which drive group, array and row in the array (see the TOPOLOGY table above) into which the drive will be inserted. This leaves the drive in the array but offline:
TOPOLOGY : ======== -------------------------------------------------------------------------- DG Arr Row EID:Slot DID Type State BT Size PDC PI SED DS3 FSpace -------------------------------------------------------------------------- 0 - - - - RAID1 Dgrd N 278.875 GB enbl N N dflt N 0 0 - - - RAID1 Dgrd N 278.875 GB enbl N N dflt N 0 0 0 32:12 12 DRIVE Offln N 278.875 GB enbl N N dflt - 0 0 1 32:13 13 DRIVE Onln N 278.875 GB enbl N N dflt - 1 - - - - RAID6 Pdgd N 27.285 TB dflt N N dflt N 1 0 - - - RAID6 Dgrd N 27.285 TB dflt N N dflt N 1 0 0 32:0 0 DRIVE Onln N 2.728 TB dflt N N dflt - 1 0 1 32:1 1 DRIVE Onln N 2.728 TB dflt N N dflt - 1 0 2 32:2 2 DRIVE Onln N 2.728 TB dflt N N dflt - 1 0 3 32:3 3 DRIVE Onln N 2.728 TB dflt N N dflt - 1 0 4 32:4 4 DRIVE Onln N 2.728 TB dflt N N dflt - 1 0 5 32:5 5 DRIVE Onln N 2.728 TB dflt N N dflt - 1 0 6 32:6 6 DRIVE Onln N 2.728 TB dflt N N dflt - 1 0 7 32:7 7 DRIVE Onln N 2.728 TB dflt N N dflt - 1 0 8 32:8 8 DRIVE Onln N 2.728 TB dflt N N dflt - 1 0 9 32:9 9 DRIVE Onln N 2.728 TB dflt N N dflt - 1 0 10 32:10 10 DRIVE Onln N 2.728 TB dflt N N dflt - 1 0 11 32:11 11 DRIVE Rbld Y 2.728 TB dflt N N dflt - --------------------------------------------------------------------------
The second step is to then start the rebuild:
# /opt/MegaRAID/storcli/storcli64 /c0/e32/s12 start rebuild
If successful the TOPOLOGY table will show an in progress rebuild:
# /opt/MegaRAID/storcli/storcli64 /c0 show
TOPOLOGY : ======== -------------------------------------------------------------------------- DG Arr Row EID:Slot DID Type State BT Size PDC PI SED DS3 FSpace -------------------------------------------------------------------------- 0 - - - - RAID1 Dgrd N 278.875 GB enbl N N dflt N 0 0 - - - RAID1 Dgrd N 278.875 GB enbl N N dflt N 0 0 0 32:12 12 DRIVE Rbld Y 278.875 GB enbl N N dflt - 0 0 1 32:13 13 DRIVE Onln N 278.875 GB enbl N N dflt - 1 - - - - RAID6 Pdgd N 27.285 TB dflt N N dflt N 1 0 - - - RAID6 Dgrd N 27.285 TB dflt N N dflt N 1 0 0 32:0 0 DRIVE Onln N 2.728 TB dflt N N dflt - 1 0 1 32:1 1 DRIVE Onln N 2.728 TB dflt N N dflt - 1 0 2 32:2 2 DRIVE Onln N 2.728 TB dflt N N dflt - 1 0 3 32:3 3 DRIVE Onln N 2.728 TB dflt N N dflt - 1 0 4 32:4 4 DRIVE Onln N 2.728 TB dflt N N dflt - 1 0 5 32:5 5 DRIVE Onln N 2.728 TB dflt N N dflt - 1 0 6 32:6 6 DRIVE Onln N 2.728 TB dflt N N dflt - 1 0 7 32:7 7 DRIVE Onln N 2.728 TB dflt N N dflt - 1 0 8 32:8 8 DRIVE Onln N 2.728 TB dflt N N dflt - 1 0 9 32:9 9 DRIVE Onln N 2.728 TB dflt N N dflt - 1 0 10 32:10 10 DRIVE Onln N 2.728 TB dflt N N dflt - 1 0 11 32:11 11 DRIVE Rbld Y 2.728 TB dflt N N dflt - --------------------------------------------------------------------------
And the rebuild progress can be checked with:
# /opt/MegaRAID/storcli/storcli64 /c0/e32/s12 show rebuild
Controller = 0 Status = Success Description = Show Drive Rebuild Status Succeeded. ---------------------------------- Drive-ID Progress% Status ---------------------------------- /c0/e32/s12 3.35 In progress ----------------------------------
Some storcli64-fu for adding disks to a failed RAID6
On October 21st, strijker-21 had two failed and one critical disk in a RAID6 set. As soon as the first failed disk was replaced and rebuild started, the critical disk also failed leaving the RAID set degraded. No data was lost because there was no data on disk at the time. The remaining failed disks were replaced but this left the array still degraded. Here is what followed to restore normal operations. Some editing is done in the output for brevity.
# /opt/MegaRAID/storcli/storcli64 /c0/v1 show all
/c0/v1 : ====== ---------------------------------------------------------- DG/VD TYPE State Access Consist Cache sCC Size Name ---------------------------------------------------------- 1/1 RAID6 OfLn RW No RWBD - 27.285 TB VD_1 ---------------------------------------------------------- PDs for VD 1 : ============ ----------------------------------------------------------------------- EID:Slt DID State DG Size Intf Med SED PI SeSz Model Sp ----------------------------------------------------------------------- 32:0 0 Onln 1 2.728 TB SAS HDD N N 512B ST33000650SS U 32:1 1 Offln 1 2.728 TB SAS HDD N N 512B ST3000NM0023 U 32:2 2 Onln 1 2.728 TB SAS HDD N N 512B ST33000650SS U 32:3 3 Onln 1 2.728 TB SAS HDD N N 512B ST33000650SS U 32:4 4 Onln 1 2.728 TB SAS HDD N N 512B ST33000650SS U 32:5 5 Onln 1 2.728 TB SAS HDD N N 512B ST33000650SS U 32:7 7 Onln 1 2.728 TB SAS HDD N N 512B ST33000650SS U 32:8 8 Onln 1 2.728 TB SAS HDD N N 512B ST33000650SS U 32:10 10 Onln 1 2.728 TB SAS HDD N N 512B ST33000650SS U 32:11 11 Onln 1 2.728 TB SAS HDD N N 512B ST33000650SS U -----------------------------------------------------------------------
As can be seen, disks 6 and 9 are missing, while disk 1 is off-line. That last problem can be remedied easily.
# /opt/MegaRAID/storcli/storcli64 /c0/e32/s1 set online Controller = 0 Status = Success Description = Set Drive Online Succeeded.
# /opt/MegaRAID/storcli/storcli64 /c0/v1 show all /c0/v1 : ====== ---------------------------------------------------------- DG/VD TYPE State Access Consist Cache sCC Size Name ---------------------------------------------------------- 1/1 RAID6 Dgrd RW No RWBD - 27.285 TB VD_1 ---------------------------------------------------------- PDs for VD 1 : ============ ----------------------------------------------------------------------- EID:Slt DID State DG Size Intf Med SED PI SeSz Model Sp ----------------------------------------------------------------------- 32:0 0 Onln 1 2.728 TB SAS HDD N N 512B ST33000650SS U 32:1 1 Onln 1 2.728 TB SAS HDD N N 512B ST3000NM0023 U 32:2 2 Onln 1 2.728 TB SAS HDD N N 512B ST33000650SS U 32:3 3 Onln 1 2.728 TB SAS HDD N N 512B ST33000650SS U 32:4 4 Onln 1 2.728 TB SAS HDD N N 512B ST33000650SS U 32:5 5 Onln 1 2.728 TB SAS HDD N N 512B ST33000650SS U 32:7 7 Onln 1 2.728 TB SAS HDD N N 512B ST33000650SS U 32:8 8 Onln 1 2.728 TB SAS HDD N N 512B ST33000650SS U 32:10 10 Onln 1 2.728 TB SAS HDD N N 512B ST33000650SS U 32:11 11 Onln 1 2.728 TB SAS HDD N N 512B ST33000650SS U -----------------------------------------------------------------------
What about the missing drives?
# /opt/MegaRAID/storcli/storcli64 /c0 show all | less ... Drive Groups = 2 TOPOLOGY : ======== -------------------------------------------------------------------------- DG Arr Row EID:Slot DID Type State BT Size PDC PI SED DS3 FSpace -------------------------------------------------------------------------- 0 - - - - RAID1 Optl N 278.875 GB enbl N N dflt N 0 0 - - - RAID1 Optl N 278.875 GB enbl N N dflt N 0 0 0 32:12 12 DRIVE Onln N 278.875 GB enbl N N dflt - 0 0 1 32:13 13 DRIVE Onln N 278.875 GB enbl N N dflt - 1 - - - - RAID6 Dgrd N 27.285 TB dflt N N dflt N 1 0 - - - RAID6 Dgrd N 27.285 TB dflt N N dflt N 1 0 0 32:0 0 DRIVE Onln N 2.728 TB dflt N N dflt - 1 0 1 32:1 1 DRIVE Onln N 2.728 TB dflt N N dflt - 1 0 2 32:2 2 DRIVE Onln N 2.728 TB dflt N N dflt - 1 0 3 32:3 3 DRIVE Onln N 2.728 TB dflt N N dflt - 1 0 4 32:4 4 DRIVE Onln N 2.728 TB dflt N N dflt - 1 0 5 32:5 5 DRIVE Onln N 2.728 TB dflt N N dflt - 1 0 6 - - DRIVE Msng - 2.728 TB - - - - - 1 0 7 32:7 7 DRIVE Onln N 2.728 TB dflt N N dflt - 1 0 8 32:8 8 DRIVE Onln N 2.728 TB dflt N N dflt - 1 0 9 - - DRIVE Msng - 2.728 TB - - - - - 1 0 10 32:10 10 DRIVE Onln N 2.728 TB dflt N N dflt - 1 0 11 32:11 11 DRIVE Onln N 2.728 TB dflt N N dflt - -------------------------------------------------------------------------- Physical Drives = 14 PD LIST : ======= ------------------------------------------------------------------------- EID:Slt DID State DG Size Intf Med SED PI SeSz Model Sp ------------------------------------------------------------------------- 32:0 0 Onln 1 2.728 TB SAS HDD N N 512B ST33000650SS U 32:1 1 Onln 1 2.728 TB SAS HDD N N 512B ST3000NM0023 U 32:2 2 Onln 1 2.728 TB SAS HDD N N 512B ST33000650SS U 32:3 3 Onln 1 2.728 TB SAS HDD N N 512B ST33000650SS U 32:4 4 Onln 1 2.728 TB SAS HDD N N 512B ST33000650SS U 32:5 5 Onln 1 2.728 TB SAS HDD N N 512B ST33000650SS U 32:6 6 UGood - 2.728 TB SAS HDD N N 512B ST3000NM0023 U 32:9 9 UGood - 2.728 TB SAS HDD N N 512B ST3000NM0023 U 32:12 12 Onln 0 278.875 GB SAS HDD N N 512B HUC106030CSS600 U 32:13 13 Onln 0 278.875 GB SAS HDD N N 512B HUC106030CSS600 U 32:7 7 Onln 1 2.728 TB SAS HDD N N 512B ST33000650SS U 32:8 8 Onln 1 2.728 TB SAS HDD N N 512B ST33000650SS U 32:10 10 Onln 1 2.728 TB SAS HDD N N 512B ST33000650SS U 32:11 11 Onln 1 2.728 TB SAS HDD N N 512B ST33000650SS U -------------------------------------------------------------------------
So disks in slots 6 and 9 are unassigned good and not part of a drive group, while drive group 1 is missing drives in array 0 row 6 and 9. Let's try to insert disks 6 and 9 back in their respective slots.
# /opt/MegaRAID/storcli/storcli64 /c0/e32/s6 insert dg=1 array=0 row=6 Controller = 0 Status = Success Description = Insert Drive Succeeded. # /opt/MegaRAID/storcli/storcli64 /c0/e32/s9 insert dg=1 array=0 row=9 Controller = 0 Status = Success Description = Insert Drive Succeeded. # /opt/MegaRAID/storcli/storcli64 /c0/d1 show all TOPOLOGY : ======== ------------------------------------------------------------------------- DG Arr Row EID:Slot DID Type State BT Size PDC PI SED DS3 FSpace ------------------------------------------------------------------------- 1 - - - - RAID6 Dgrd N 27.285 TB dflt N N dflt N 1 0 - - - RAID6 Dgrd N 27.285 TB dflt N N dflt N 1 0 0 32:0 0 DRIVE Onln N 2.728 TB dflt N N dflt - 1 0 1 32:1 1 DRIVE Onln N 2.728 TB dflt N N dflt - 1 0 2 32:2 2 DRIVE Onln N 2.728 TB dflt N N dflt - 1 0 3 32:3 3 DRIVE Onln N 2.728 TB dflt N N dflt - 1 0 4 32:4 4 DRIVE Onln N 2.728 TB dflt N N dflt - 1 0 5 32:5 5 DRIVE Onln N 2.728 TB dflt N N dflt - 1 0 6 32:6 6 DRIVE Offln N 2.728 TB dflt N N dflt - 1 0 7 32:7 7 DRIVE Onln N 2.728 TB dflt N N dflt - 1 0 8 32:8 8 DRIVE Onln N 2.728 TB dflt N N dflt - 1 0 9 32:9 9 DRIVE Offln N 2.728 TB dflt N N dflt - 1 0 10 32:10 10 DRIVE Onln N 2.728 TB dflt N N dflt - 1 0 11 32:11 11 DRIVE Onln N 2.728 TB dflt N N dflt - -------------------------------------------------------------------------
That is looking better, but the drives are still off-line.
# /opt/MegaRAID/storcli/storcli64 /c0/e32/s6 set online Controller = 0 Status = Success Description = Set Drive Online Succeeded.
# /opt/MegaRAID/storcli/storcli64 /c0/e32/s9 set online Controller = 0 Status = Success Description = Set Drive Online Succeeded.
# /opt/MegaRAID/storcli/storcli64 /c0/d1 show all ------------------------------------------------------------------------- DG Arr Row EID:Slot DID Type State BT Size PDC PI SED DS3 FSpace ------------------------------------------------------------------------- 1 - - - - RAID6 Optl N 27.285 TB dflt N N dflt N 1 0 - - - RAID6 Optl N 27.285 TB dflt N N dflt N 1 0 0 32:0 0 DRIVE Onln N 2.728 TB dflt N N dflt - 1 0 1 32:1 1 DRIVE Onln N 2.728 TB dflt N N dflt - 1 0 2 32:2 2 DRIVE Onln N 2.728 TB dflt N N dflt - 1 0 3 32:3 3 DRIVE Onln N 2.728 TB dflt N N dflt - 1 0 4 32:4 4 DRIVE Onln N 2.728 TB dflt N N dflt - 1 0 5 32:5 5 DRIVE Onln N 2.728 TB dflt N N dflt - 1 0 6 32:6 6 DRIVE Onln N 2.728 TB dflt N N dflt - 1 0 7 32:7 7 DRIVE Onln N 2.728 TB dflt N N dflt - 1 0 8 32:8 8 DRIVE Onln N 2.728 TB dflt N N dflt - 1 0 9 32:9 9 DRIVE Onln N 2.728 TB dflt N N dflt - 1 0 10 32:10 10 DRIVE Onln N 2.728 TB dflt N N dflt - 1 0 11 32:11 11 DRIVE Onln N 2.728 TB dflt N N dflt - ------------------------------------------------------------------------- ---------------------------------------------------------- DG/VD TYPE State Access Consist Cache sCC Size Name ---------------------------------------------------------- 1/1 RAID6 Optl RW No RWBD - 27.285 TB VD_1 ----------------------------------------------------------
Now it is 'optimal' but not consistent? This is odd, but at least we can write data to it.
# pvscan PV /dev/sdb1 VG data lvm2 [27,29 TiB / 27,29 TiB free] PV /dev/sda2 VG system lvm2 [278,34 GiB / 193,84 GiB free] Total: 2 [27,56 TiB] / in use: 2 [27,56 TiB] / in no VG: 0 [0 ]