RAID-5 drive failure
Back in December 2004, I wrote about implementing hardware RAID-5.
Yesterday, one of the drives in that cluster failed.
Here is the output which shows a failed drive:
$ raidutil -L all
RAIDUTIL Version: 3.04 Date: 9/27/2000 FreeBSD CLI Configuration Utility
Adaptec ENGINE Version: 3.04 Date: 9/27/2000 Adaptec FreeBSD SCSI Engine
# b0 b1 b2 Controller Cache FW NVRAM Serial Status
---------------------------------------------------------------------------
d0 -- -- -- ADAP2400A 16MB 3A0L CHNL 1.1 BF0B111Z0B4Optimal
Physical View
Address Type Manufacturer/Model Capacity Status
---------------------------------------------------------------------------
d0b0t0d0 Disk Drive (DASD) ST380011 A 76319MB Optimal
d0b1t0d0 Disk Drive (DASD) ST380011 A 76319MB Optimal
d0b2t0d0 Disk Drive (DASD) ST380011 A 76319MB Optimal
d0b3t0d0 Disk Drive (DASD) ST380011 A 76319MB Failed drive
Logical View
Address Type Manufacturer/Model Capacity Status
---------------------------------------------------------------------------
d0b0t0d0 RAID 5 (Redundant ADAPTEC RAID-5 228957MB Degraded
d0b1t0d0 Disk Drive (DASD) ST380011 A 76319MB Optimal
d0b0t0d0 Disk Drive (DASD) ST380011 A 76319MB Optimal
d0b2t0d0 Disk Drive (DASD) ST380011 A 76319MB Optimal
d0b3t0d0 Disk Drive (DASD) ST380011 A 76319MB Failed drive
Address Max Speed Actual Rate / Width
---------------------------------------------------------------------------
d0b0t0d0 50 MHz 100 MB/sec wide
d0b1t0d0 50 MHz 100 MB/sec wide
d0b2t0d0 50 MHz 100 MB/sec wide
d0b3t0d0 10 MHz 100 MB/sec wide
Address Manufacturer/Model Write Cache Mode (HBA/Device)
---------------------------------------------------------------------------
d0b0t0d0 ADAPTEC RAID-5 Write Back / --
d0b1t0d0 ST380011 A -- / Write Back
d0b0t0d0 ST380011 A -- / Write Back
d0b2t0d0 ST380011 A -- / Write Back
d0b3t0d0 ST380011 A -- / Write Back
# Controller Cache FW NVRAM BIOS SMOR Serial
---------------------------------------------------------------------------
d0 ADAP2400A 16MB 3A0L CHNL 1.1 1.62 1.12/79I BF0B111Z0B4
# Controller Status Voltage Current Full Cap Rem Cap Rem Time
---------------------------------------------------------------------------
d0 ADAP2400A No battery
Address Manufacturer/Model FW Serial 123456789012
---------------------------------------------------------------------------
d0b0t0d0 ST380011 A 3.06 5JVAYH4G -X-XX--X-O--
d0b1t0d0 ST380011 A 3.06 5JVB4AY9 -X-XX--X-O--
d0b2t0d0 ST380011 A 3.06 3JV8XK0N -X-XX--X-O--
d0b3t0d0 ST380011 A 3.06 3JV8VS5K -X-XX--X-O--
Capabilities Map: Column 1 = Soft Reset
Column 2 = Cmd Queuing
Column 3 = Linked Cmds
Column 4 = Synchronous
Column 5 = Wide 16
Column 6 = Wide 32
Column 7 = Relative Addr
Column 8 = SCSI II
Column 9 = S.M.A.R.T.
Column 0 = SCAM
Column 1 = SCSI-3
Column 2 = SAF-TE
X = Capability Exists, - = Capability does not exist, O = Not Supported
The key point is the ‘Failed drive’. I happen to have an identical drive here,
just sitting around, for just this event.
But before I do that, let us try a rebuild
Let’s try a rebuild first:
raidutil -a rebuild d0 d0b0t0d0
As I type this, the system is now in this state (I will show only a small extract from
the full output):
$ sudo raidutil -L all Password: RAIDUTIL Version: 3.04 Date: 9/27/2000 FreeBSD CLI Configuration Utility Adaptec ENGINE Version: 3.04 Date: 9/27/2000 Adaptec FreeBSD SCSI Engine # b0 b1 b2 Controller Cache FW NVRAM Serial Status --------------------------------------------------------------------------- d0 -- -- -- ADAP2400A 16MB 3A0L CHNL 1.1 BF0B111Z0B4Optimal Physical View Address Type Manufacturer/Model Capacity Status --------------------------------------------------------------------------- d0b0t0d0 Disk Drive (DASD) ST380011 A 76319MB Optimal d0b1t0d0 Disk Drive (DASD) ST380011 A 76319MB Optimal d0b2t0d0 Disk Drive (DASD) ST380011 A 76319MB Optimal d0b3t0d0 Disk Drive (DASD) ST380011 A 76319MB Replaced Drive Logical View Address Type Manufacturer/Model Capacity Status --------------------------------------------------------------------------- d0b0t0d0 RAID 5 (Redundant ADAPTEC RAID-5 228957MB Reconstruct 3% d0b1t0d0 Disk Drive (DASD) ST380011 A 76319MB Optimal d0b0t0d0 Disk Drive (DASD) ST380011 A 76319MB Optimal d0b2t0d0 Disk Drive (DASD) ST380011 A 76319MB Optimal d0b3t0d0 Disk Drive (DASD) ST380011 A 76319MB Replaced Drive ...
As you can see, we are at about 3% on the rebuild. Time to wait and see. It is now 15:03 EST.
12 hours later…
Twelve hours later, NetSaint sent me this email:
***** NetSaint 0.0.7 ***** Notification Type: RECOVERY Service: RAID Host: polo Address: polo.unixathome.org State: OK Date/Time: Tue Jul 25 03:49:49 EDT 2006 Additional Info: Optimal
The rebuild took about 13 hours all up… I’m glad the machine was online during
that time. The machine in question runs the
FreshPorts BETA site and is my main
development server.
What caused the problem?
I don’t know what caused the problem. I know some of the symptoms.
- The box did not respond to pings
- telnet to port 22 gave a standard SSH banner
- attempts to ssh were unsuccessful with no login prompt being provided
- Console was sluggish
- When pressing ALT-F3 to go to another vtty, nothing happened. When I returned
to the console some minutes later, I noticed I was now on the other vtty (sluggish). - Attempts to login via that tty showed no response. It may have just been sluggish
- There are no entries in the log
After about 20 or 30 minutes trying to get the system going, I rebooted it. Of course,
this would degrade the RAID array, and I wanted to avoid that. I saw no other options.
I rebooted the box.
It was suggested that one drive may have been experiencing an error. HDD try to solve
errors and can take a long time attempting to recover. The RAID card can see this
and just waits. No I/O occurs during this time. Western Digital has drives which
are designed for RAID and feature TLER (Time Limited Error Recovery). Such features
have been available on SCSI drives for quite some time. For what it’s worth, the
drives I’m planning to buy for the Dual Opteron server will have TLER.
Ideas? Suggestions? Comments? Please use the comments link to the right.
FWIW, we use mostly consumer-grade drives at work. When we have one drop out of an array, we put it through the manufacturer’s diagnostics and, if everything checks out, label it "for desktop use only" and stick it in our small pile of spare drives on the assumption that the drive just went into deep-recovery mode on one bad sector and is otherwise in reasonable shape for non-RAID use.
[%sig%]