Jul 252006
RAID-5 drive failure
Back in December 2004, I wrote about implementing hardware RAID-5. Yesterday, one of the drives in that cluster failed. Here is the output which shows a failed drive:The key point is the ‘Failed drive’. I happen to have an identical drive here, just sitting around, for just this event.$ raidutil -L all RAIDUTIL Version: 3.04 Date: 9/27/2000 FreeBSD CLI Configuration Utility Adaptec ENGINE Version: 3.04 Date: 9/27/2000 Adaptec FreeBSD SCSI Engine # b0 b1 b2 Controller Cache FW NVRAM Serial Status --------------------------------------------------------------------------- d0 -- -- -- ADAP2400A 16MB 3A0L CHNL 1.1 BF0B111Z0B4Optimal Physical View Address Type Manufacturer/Model Capacity Status --------------------------------------------------------------------------- d0b0t0d0 Disk Drive (DASD) ST380011 A 76319MB Optimal d0b1t0d0 Disk Drive (DASD) ST380011 A 76319MB Optimal d0b2t0d0 Disk Drive (DASD) ST380011 A 76319MB Optimal d0b3t0d0 Disk Drive (DASD) ST380011 A 76319MB Failed drive Logical View Address Type Manufacturer/Model Capacity Status --------------------------------------------------------------------------- d0b0t0d0 RAID 5 (Redundant ADAPTEC RAID-5 228957MB Degraded d0b1t0d0 Disk Drive (DASD) ST380011 A 76319MB Optimal d0b0t0d0 Disk Drive (DASD) ST380011 A 76319MB Optimal d0b2t0d0 Disk Drive (DASD) ST380011 A 76319MB Optimal d0b3t0d0 Disk Drive (DASD) ST380011 A 76319MB Failed drive Address Max Speed Actual Rate / Width --------------------------------------------------------------------------- d0b0t0d0 50 MHz 100 MB/sec wide d0b1t0d0 50 MHz 100 MB/sec wide d0b2t0d0 50 MHz 100 MB/sec wide d0b3t0d0 10 MHz 100 MB/sec wide Address Manufacturer/Model Write Cache Mode (HBA/Device) --------------------------------------------------------------------------- d0b0t0d0 ADAPTEC RAID-5 Write Back / -- d0b1t0d0 ST380011 A -- / Write Back d0b0t0d0 ST380011 A -- / Write Back d0b2t0d0 ST380011 A -- / Write Back d0b3t0d0 ST380011 A -- / Write Back # Controller Cache FW NVRAM BIOS SMOR Serial --------------------------------------------------------------------------- d0 ADAP2400A 16MB 3A0L CHNL 1.1 1.62 1.12/79I BF0B111Z0B4 # Controller Status Voltage Current Full Cap Rem Cap Rem Time --------------------------------------------------------------------------- d0 ADAP2400A No battery Address Manufacturer/Model FW Serial 123456789012 --------------------------------------------------------------------------- d0b0t0d0 ST380011 A 3.06 5JVAYH4G -X-XX--X-O-- d0b1t0d0 ST380011 A 3.06 5JVB4AY9 -X-XX--X-O-- d0b2t0d0 ST380011 A 3.06 3JV8XK0N -X-XX--X-O-- d0b3t0d0 ST380011 A 3.06 3JV8VS5K -X-XX--X-O-- Capabilities Map: Column 1 = Soft Reset Column 2 = Cmd Queuing Column 3 = Linked Cmds Column 4 = Synchronous Column 5 = Wide 16 Column 6 = Wide 32 Column 7 = Relative Addr Column 8 = SCSI II Column 9 = S.M.A.R.T. Column 0 = SCAM Column 1 = SCSI-3 Column 2 = SAF-TE X = Capability Exists, - = Capability does not exist, O = Not Supported
But before I do that, let us try a rebuild
Let’s try a rebuild first:As I type this, the system is now in this state (I will show only a small extract from the full output):raidutil -a rebuild d0 d0b0t0d0
As you can see, we are at about 3% on the rebuild. Time to wait and see. It is now 15:03 EST.$ sudo raidutil -L all Password: RAIDUTIL Version: 3.04 Date: 9/27/2000 FreeBSD CLI Configuration Utility Adaptec ENGINE Version: 3.04 Date: 9/27/2000 Adaptec FreeBSD SCSI Engine # b0 b1 b2 Controller Cache FW NVRAM Serial Status --------------------------------------------------------------------------- d0 -- -- -- ADAP2400A 16MB 3A0L CHNL 1.1 BF0B111Z0B4Optimal Physical View Address Type Manufacturer/Model Capacity Status --------------------------------------------------------------------------- d0b0t0d0 Disk Drive (DASD) ST380011 A 76319MB Optimal d0b1t0d0 Disk Drive (DASD) ST380011 A 76319MB Optimal d0b2t0d0 Disk Drive (DASD) ST380011 A 76319MB Optimal d0b3t0d0 Disk Drive (DASD) ST380011 A 76319MB Replaced Drive Logical View Address Type Manufacturer/Model Capacity Status --------------------------------------------------------------------------- d0b0t0d0 RAID 5 (Redundant ADAPTEC RAID-5 228957MB Reconstruct 3% d0b1t0d0 Disk Drive (DASD) ST380011 A 76319MB Optimal d0b0t0d0 Disk Drive (DASD) ST380011 A 76319MB Optimal d0b2t0d0 Disk Drive (DASD) ST380011 A 76319MB Optimal d0b3t0d0 Disk Drive (DASD) ST380011 A 76319MB Replaced Drive ...
12 hours later…
Twelve hours later, NetSaint sent me this email:
The rebuild took about 13 hours all up… I’m glad the machine was online during that time. The machine in question runs the FreshPorts BETA site and is my main development server.***** NetSaint 0.0.7 ***** Notification Type: RECOVERY Service: RAID Host: polo Address: polo.unixathome.org State: OK Date/Time: Tue Jul 25 03:49:49 EDT 2006 Additional Info: Optimal
What caused the problem?
I don’t know what caused the problem. I know some of the symptoms.- The box did not respond to pings
- telnet to port 22 gave a standard SSH banner
- attempts to ssh were unsuccessful with no login prompt being provided
- Console was sluggish
- When pressing ALT-F3 to go to another vtty, nothing happened. When I returned to the console some minutes later, I noticed I was now on the other vtty (sluggish).
- Attempts to login via that tty showed no response. It may have just been sluggish
- There are no entries in the log
FWIW, we use mostly consumer-grade drives at work. When we have one drop out of an array, we put it through the manufacturer’s diagnostics and, if everything checks out, label it "for desktop use only" and stick it in our small pile of spare drives on the assumption that the drive just went into deep-recovery mode on one bad sector and is otherwise in reasonable shape for non-RAID use.
[%sig%]