Aug 182006
NetSaint plugin for 3Ware RAID card
I recently set up a RAID10 array on a 9550SX-8LP controller donated by 3Ware. RAID is not a panacea, solve all your problems, I can ignore the machine, backups are no longer required, solution. You must monitor your RAID array[s] just like any other service on your machine. I’ve been using NetSaint since 2001. Development of NetSaint is being continued under a new name – Nagios. I have no reason to move to Ngaios, so I continue with NetSaint. A previous article shows how I created a plug-in for another RAID card. I’ll be using a similar approach for this plug-in. I have previously written about the 3Ware CLI interface and will be using the CLI as the foundation for this plug-in.The plug-in components
I am assuming you already have NetSaint installed, configured, and operational. I documented my installation and that will help you get started. During the construction of this plug-in, we will deal with three main components of the NetSaint system. I have described the changes required in [brackets].- netsaint_statd – which provides remote monitoring of hosts [patch this code so it knows about the 3Ware script]
- commands.cfg – specifies what services should be monitored by NetSaint [add the RAID service]
- a new script, for use by netsaint_statd, which pulls data from the 3Ware CLI [write the script]
What I’d like to do is monitor the status of those three units. We could get all fancy and create a generic plug-in that would work with any number of units, ports, and drives. But I’m not going to do that here. I’m going to create a script that will monitor u0, u1, and u2.# tw_cli info c0 Unit UnitType Status %Cmpl Stripe Size(GB) Cache AVerify IgnECC ------------------------------------------------------------------------------ u0 SPARE OK - - 69.2404 - OFF - u1 SPARE OK - - 69.2404 - OFF - u2 RAID-10 OK - 64K 195.548 ON OFF OFF Port Status Unit Size Blocks Serial --------------------------------------------------------------- p0 OK u2 69.25 GB 145226112 WD-WMAKE23790 p1 OK u2 69.25 GB 145226112 WD-WMAKE23790 p2 OK u2 69.25 GB 145226112 WD-WMAKE23943 p3 OK u2 69.25 GB 145226112 WD-WMAKE23790 p4 OK u2 69.25 GB 145226112 WD-WMAKE23790 p5 OK u2 69.25 GB 145226112 WD-WMAKE23792 p6 OK u0 69.25 GB 145226112 WD-WMAKE23790 p7 OK u1 69.25 GB 145226112 WD-WMAKE23786 Name OnlineState BBUReady Status Volt Temp Hours LastCapTest --------------------------------------------------------------------------- bbu On Yes OK OK OK 0 xx-xxx-xxxx
Getting the information we need
There is a command that will produce very concise status output:All we need to do is capture the fourth field in this output. That can be done easily with awk:# tw_cli info c0 u0 status /c0/u0 status = OK #
The above outputs the bare minimum. Perhaps we want more. Such as this:# tw_cli info c0 u0 status | awk '{print $4}' OK #
A slightly different approach will give better results. Consider this output:# tw_cli info c0 u0 Unit UnitType Status %Cmpl Port Stripe Size(GB) Blocks ----------------------------------------------------------------------- u0 SPARE OK - p6 - 69.2404 145207680 #
With this command, we have all the information we need for all units. This is a better approach.# tw_cli info c0 unitstatus Unit UnitType Status %Cmpl Stripe Size(GB) Cache AVerify IgnECC ------------------------------------------------------------------------------ u0 SPARE OK - - 69.2404 - OFF - u1 SPARE OK - - 69.2404 - OFF - u2 RAID-10 OK - 64K 195.548 ON OFF OFF
What about drive removal?
The above examples are from the normal situation. What happens if we remove a drive. Here is the output if I remove drive 6 (u0 -SPARE):Unit u0 is gone. If we query the controller, we get more information:# tw_cli info c0 unitstatus Unit UnitType Status %Cmpl Stripe Size(GB) Cache AVerify IgnECC ------------------------------------------------------------------------------ u1 SPARE OK - - 69.2404 - OFF - u2 RAID-10 OK - 64K 195.548 ON OFF OFF
We can see that drive 6 has been removed. We must code our script accordingly. If I replace the drive, the status returns to normal. In addition, I found these entries in /var/log/messages:# tw_cli info c0 Unit UnitType Status %Cmpl Stripe Size(GB) Cache AVerify IgnECC ------------------------------------------------------------------------------ u1 SPARE OK - - 69.2404 - OFF - u2 RAID-10 OK - 64K 195.548 ON OFF OFF Port Status Unit Size Blocks Serial --------------------------------------------------------------- p0 OK u2 69.25 GB 145226112 WD-WMAKE23790 p1 OK u2 69.25 GB 145226112 WD-WMAKE23790 p2 OK u2 69.25 GB 145226112 WD-WMAKE23943 p3 OK u2 69.25 GB 145226112 WD-WMAKE23790 p4 OK u2 69.25 GB 145226112 WD-WMAKE23790 p5 OK u2 69.25 GB 145226112 WD-WMAKE23792 p6 DRIVE-REMOVED - - - - p7 OK u1 69.25 GB 145226112 WD-WMAKE23786 Name OnlineState BBUReady Status Volt Temp Hours LastCapTest --------------------------------------------------------------------------- bbu On Yes OK OK OK 0 xx-xxx-xxxx
That’s OK for a spare. What about a drive from the RAID array? Let’s try drive 4.twa0: WARNING: (0x04: 0x0019): Drive removed: port=6 twa0: INFO: (0x04: 0x001A): Drive inserted: port=6
Good. That’s what one would expect. Further inquiries show that one of the hot spares has been taken into production:# tw_cli info c0 unitstatus Unit UnitType Status %Cmpl Stripe Size(GB) Cache AVerify IgnECC ------------------------------------------------------------------------------ u0 SPARE OK - - 69.2404 - OFF - u1 SPARE OK - - 69.2404 - OFF - u2 RAID-10 DEGRADED - 64K 195.548 OFF OFF OFF
Checking a few minutes later and you can see that it’s rebuilding:# tw_cli info c0 Unit UnitType Status %Cmpl Stripe Size(GB) Cache AVerify IgnECC ------------------------------------------------------------------------------ u1 SPARE OK - - 69.2404 - OFF - u2 RAID-10 REBUILD-PAUSED 66 64K 195.548 OFF OFF OFF Port Status Unit Size Blocks Serial --------------------------------------------------------------- p0 OK u2 69.25 GB 145226112 WD-WMAKE23790 p1 OK u2 69.25 GB 145226112 WD-WMAKE23790 p2 OK u2 69.25 GB 145226112 WD-WMAKE23943 p3 OK u2 69.25 GB 145226112 WD-WMAKE23790 p4 DRIVE-REMOVED - - - - p5 OK u2 69.25 GB 145226112 WD-WMAKE23792 p6 DEGRADED u2 69.25 GB 145226112 WD-WMAKE23790 p7 OK u1 69.25 GB 145226112 WD-WMAKE23786 Name OnlineState BBUReady Status Volt Temp Hours LastCapTest --------------------------------------------------------------------------- bbu On Yes OK OK OK 0 xx-xxx-xxxx
I’ll just go plug drive 4 back in…. Now we see this in /var/log/messages:# tw_cli info c0 unitstatus Unit UnitType Status %Cmpl Stripe Size(GB) Cache AVerify IgnECC ------------------------------------------------------------------------------ u1 SPARE OK - - 69.2404 - OFF - u2 RAID-10 REBUILDING 69 64K 195.548 OFF OFF OFF
After the unit has finished rebuilding, the drive status looked like this:Aug 14 11:50:23 opti kernel: twa0: WARNING: (0x04: 0x0019): Drive removed: port=4 Aug 14 11:50:23 opti kernel: twa0: ERROR: (0x04: 0x0002): Degraded unit: unit=2, port=4 Aug 14 11:53:01 opti kernel: twa0: INFO: (0x04: 0x000B): Rebuild started: unit=2 Aug 14 11:56:11 opti kernel: twa0: INFO: (0x04: 0x001A): Drive inserted: port=4
Note that drive 4 is part of unit u?. I issued a rescan and then saw this:# tw_cli info c0 drivestatus Port Status Unit Size Blocks Serial --------------------------------------------------------------- p0 OK u2 69.25 GB 145226112 WD-WMAKE23790 p1 OK u2 69.25 GB 145226112 WD-WMAKE23790 p2 OK u2 69.25 GB 145226112 WD-WMAKE23943 p3 OK u2 69.25 GB 145226112 WD-WMAKE23790 p4 OK u? 69.25 GB 145226112 WD-WMAKE23790 p5 OK u2 69.25 GB 145226112 WD-WMAKE23792 p6 OK u2 69.25 GB 145226112 WD-WMAKE23790 p7 OK u1 69.25 GB 145226112 WD-WMAKE23786
Good, now it’s back on u0. Let’s look at the unit status:# tw_cli info c0 drivestatus Port Status Unit Size Blocks Serial --------------------------------------------------------------- p0 OK u2 69.25 GB 145226112 WD-WMAKE23790 p1 OK u2 69.25 GB 145226112 WD-WMAKE23790 p2 OK u2 69.25 GB 145226112 WD-WMAKE23943 p3 OK u2 69.25 GB 145226112 WD-WMAKE23790 p4 OK u0 69.25 GB 145226112 WD-WMAKE23790 p5 OK u2 69.25 GB 145226112 WD-WMAKE23792 p6 OK u2 69.25 GB 145226112 WD-WMAKE23790 p7 OK u1 69.25 GB 145226112 WD-WMAKE23786
OK, well, that’s not idea, but it does reflect what was going on. What we need to do is delete that unit and re-add it as a hot spare:# tw_cli info c0 unitstatus Unit UnitType Status %Cmpl Stripe Size(GB) Cache AVerify IgnECC ------------------------------------------------------------------------------ u0 RAID-10 INOPERABLE - 64K 195.548 OFF OFF OFF u1 SPARE OK - - 69.2404 - OFF - u2 RAID-10 OK - 64K 195.548 ON OFF OFF
Ummm, what? This was a problem. For several hours. Eventually, I upgrade the firmware. Then things started working as expected:# tw_cli //opti> /c0/u0 delete Error: (CLI:038) Invalid unit command. //opti> /c0/u0 del Deleting /c0/u0 will cause the data on the unit permanently loss. Do you want to continue ? Y|N [N]: y Deleting unit c0/u0 ...Done. //opti> /c0 add type=spare disk=4 Creating new unit on controller /c0 ... Failed. Error: (API:0012) Disk is member of un-exported unit. //opti> /c0/p4 export Exporting /c0/p4 will take the disk offline. Do you want to continue ? Y|N [N]: yes Exporting port /c0/p4 ... Done. //opti>
You will note that in the above I used disks 3 and 6, and not 4 as in a previous example. This is because I did several disk removals and adds during the diagnosis of this problem. The full text of everything I did is available here (41KB). It is interesting to note that as hot spares were taken up, units were renumbered. The RAID-10 array started as u2. It is now u0.# //opti/c0> /c0 add type=spare disk=3 Creating new unit on controller /c0 ... Done. The new unit is /c0/u1. //opti/c0> /c0 add type=spare disk=6 Creating new unit on controller /c0 ... Done. The new unit is /c0/u2. //opti/c0>
How to process the data
What us is it having the information if you don’t know what to do with it? I wasn’t sure how to use all this data. I looked at how the disk checking routine handled it. If you look at the output, it’s pretty simple:That fancy regex, which I spent considerable time playing with, can probably be replaced with a call to split(). If you want the patch for the above, it is based upon netsaint_statd_v2.15 and is available here. This function is designed to sit on the RAID server (the one with the 3Ware card). It will be invoked by the netsaint_statd daemon. The script accepts one parameter: the controller id (usually c0 if you have just one controller). In the next section, I’ll show you how I pass that parameter from NetSaint to the script on the server. I wrote the above function while sitting in a Second Cup cafe, waiting for a couple of women to finish their massages next door at The Spa. What’s unusual about that? Nothing in particular, except that I had no access to the server from that location (at least not without paying for WiFi, which I was not going to do). So I faked the call to tw_cli by creating this perl script:sub raid3ware { my $controller = shift; my $unitlisting; my $command = "$commandlist{$os}{raid3ware}"; $command =~ s/XXX/$controller/g; open(PROCOUT, "$command |") || die; $_ = <PROCOUT>; while($_ = <PROCOUT>) { if (/^(u\S+)\s+(\S+)\s*(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s*/) { $unitlisting .= '(' . $1 . ',' . $2 . ',' . $3 .')'; } } if (defined($unitlisting)) { print Client $unitlisting; } else { print Client "no units?"; } $unitlisting = undef; close(PROCOUT); }
With this handy little script, I was able to develop the NetSaint side of the script quickly and easily. I was also able to modify the Status fields without actually having to affect the server. Useful! I will show you how to modify netsaint_statd later. Next, how do we modify NetSaint?$ cat /home/dan/src/netsaint_statd/test.pl #!/usr/bin/perl $ans = " Unit UnitType Status %Cmpl Stripe Size(GB) Cache AVerify IgnECC ------------------------------------------------------------------------------ u0 SPARE OK - - 69.2404 - OFF - u1 SPARE OK - - 69.2404 - OFF - u2 RAID-10 OK - 64K 195.548 ON OFF OFF "; print $ans;
Getting the data from the server into netsaint_statd
netsaint_statd is the daemon that can be installed on remote systems. NetSaint talks to this daemon to extract information from the remote system. In this context, remote can mean on the same LAN/WAN, etc. We need a script that will query the server and grab the 3Ware information from it. This is it:Look for this line:#!/usr/bin/perl # # See LICENSE for copyright information # # check_3wareraid.pl <host> # # NetSaint host script to get the 3ware RAID status from a client that is running # netsaint_statd. # require 5.003; BEGIN { $ENV{PATH} = '/bin' } use Socket; use POSIX; sub usage; my $TIMEOUT = 15; my %ERRORS = ('UNKNOWN', '-1', 'OK', '0', 'WARNING', '1', 'CRITICAL', '2'); my $remote = shift || &usage(%ERRORS); my $controller = shift || &usage(%ERRORS); my $unitarg = shift || &usage(%ERRORS); my $port = shift || 1040; my $remoteaddr = inet_aton("$remote"); my $paddr = sockaddr_in($port, $remoteaddr) || die "Can't create info for connection: #!\n";; my $proto = getprotobyname('tcp'); socket(Server, PF_INET, SOCK_STREAM, $proto) || die "Can't create socket: $!"; setsockopt(Server, SOL_SOCKET, SO_REUSEADDR, 1); connect(Server, $paddr) || die "Can't connect to server: $!"; my $state = "OK"; my $answer = undef; # Just in case of problems, let's not hang NetSaint $SIG{'ALRM'} = sub { close(Server); select(STDOUT); print "No Answer from Client\n"; exit $ERRORS{"UNKNOWN"}; }; alarm($TIMEOUT); #print "invoking Server with:raid3wareunits $controller\n"; select(Server); $| = 1; print Server "raid3wareunits $controller\n"; my ($servanswer) = <Server>; alarm(0); close(Server); select(STDOUT); chomp($servanswer); #print "REPLY: '$servanswer'\n"; $servanswer =~ s/\(//g; my @servanswer = split(/\)/,$servanswer); $answer = 'not found'; $state = 'CRITICAL'; foreach $line (@servanswer) { my ($unit, $name, $status) = split(/,/, $line); if ($unit eq $unitarg) { if ($status =~ m%^REBUILDING%) { $state = "WARNING"; $answer = $status; } else { if ($status =~ m%^DEGRADED%) { $state = "CRITICAL"; $answer = $status; } else { if ($status =~ m%^OK%) { $state = "OK"; $answer = $status; } else { $answer = $status; $state = "CRITICAL"; } } } } } print $answer; exit $ERRORS{$state}; sub usage { print "Minimum arguments not supplied!\n"; print "\n"; print "Perl Check Users plugin for NetSaint\n"; print "Copyright (c) 1999 Charlie Cook & Nick Reinking\n"; print "Copyright (c) 2006 Dan Langille\n"; print "\n"; print "Usage: $0 <host> <controller> <unit>\n"; print "\n"; exit $ERRORS{"UNKNOWN"}; }
That is the line that tells netsaint_statd to invoke the raid3ware command and pass it the $controller parameter. This script is based upon one I found included with the NetSaint plug-ins. I used it as a base and went from there. I can’t recall which script I started with, but they all have a very similar structure. I placed this script at /usr/local/libexec/netsaint/netsaint_statd/ on my NetSaint server.print Server "raid3ware $controller\n";
Configuring NetSaint server to use the plug-in
In this section, I’ll show you how I modified my NetSaint server installation to add monitoring support for the 3Ware plug-in. The files to be modified are:- /usr/local/etc/netsaint/commands.cfg – add the new commands
- /usr/local/etc/netsaint/hosts.cfg – add the new host and services to be monitored
In hosts.cfg, here are the entries that relate to monitoring the 3Ware RAID on the dual opteron server:command[check_raid3ware.pl]=$USER1$/netsaint_statd/check_3wareraidunits.pl \ $HOSTADDRESS$ $ARG1$ $ARG2$ $ARG3$ $ARG4$
After issuing a /usr/local/etc/rc.d/netsaint.sh reload command, and waiting a short while for NetSaint to run its queries, I found the following in my NetSaint monitoring website:service[opti]=RAID spare 1;0;24x7;3;2;1;raid-admins;120;24x7;1;1;1;;check_raid3ware.pl! c0 u1 service[opti]=RAID spare 2;0;24x7;3;2;1;raid-admins;120;24x7;1;1;1;;check_raid3ware.pl! c0 u2 service[opti]=RAID array ;0;24x7;3;2;1;raid-admins;120;24x7;1;1;1;;check_raid3ware.pl! c0 u0
Testing the plug-in
It is one thing to write a script, test it, and put it into production. It is an entirely different thing to test it in production. I tested this in production by removing drives from the server. This server has hot-swappable drives. By removing one, I can verify that the plug-in is working as expected. Here is NetSaint after I removed a spare drive:
A short time later, NetSaint reported all was well. Next, I removed a drive from the RAID cluster. NetSaint then displayed this:# tw_cli rescan Rescanning controller /c0 for units and drives ...Done. Found the following unit(s): [/c0/u2]. Found the following drive(s): [none].


I can see what is rebuilding:# tw_cli rescan Rescanning controller /c0 for units and drives ...Done. Found the following unit(s): [/c0/u1]. Found the following drive(s): [none].
Drive 6 (p6 as show above) has been pulled into the cluster. By issuing a unitstatus command, I can confirm that u1 is inoperable. That would be the drive I just removed and replaced.# tw_cli info c0 u0 Unit UnitType Status %Cmpl Port Stripe Size(GB) Blocks ----------------------------------------------------------------------- u0 RAID-10 REBUILDING 87 - 64K 195.548 410093568 u0-0 RAID-1 REBUILDING 61 - - - - u0-0-0 DISK OK - p0 - 65.1826 136697856 u0-0-1 DISK DEGRADED - p6 - 65.1826 136697856 u0-1 RAID-1 OK - - - - - u0-1-0 DISK OK - p2 - 65.1826 136697856 u0-1-1 DISK OK - p4 - 65.1826 136697856 u0-2 RAID-1 OK - - - - - u0-2-0 DISK OK - p3 - 65.1826 136697856 u0-2-1 DISK OK - p5 - 65.1826 136697856
Since this drive came from the array, it contains META data that identifies it as part of the array. That META data needs to be erased before the 3Ware controller will accept it as a hot spare. With the following commands, I delete the inoperable unit, add the drive (from that unit) back as a hot spare, and then list the unitstatus.# tw_cli info c0 unitstatus Unit UnitType Status %Cmpl Stripe Size(GB) Cache AVerify IgnECC ------------------------------------------------------------------------------ u0 RAID-10 REBUILDING 90 64K 195.548 OFF OFF OFF u1 RAID-10 INOPERABLE - 64K 195.548 OFF OFF OFF u2 SPARE OK - - 69.2404 - OFF -
NetSaint then displaying this status:# tw_cli /c0/u1 del Deleting /c0/u1 will cause the data on the unit permanently loss. Do you want to continue ? Y|N [N]: y Deleting unit c0/u1 ...Done. # tw_cli /c0 add type=spare disk=1 Creating new unit on controller /c0 ... Done. The new unit is /c0/u1. # tw_cli info c0 unitstatus Unit UnitType Status %Cmpl Stripe Size(GB) Cache AVerify IgnECC ------------------------------------------------------------------------------ u0 RAID-10 OK - 64K 195.548 ON OFF OFF u1 SPARE OK - - 69.2404 - OFF - u2 SPARE OK - - 69.2404 - OFF
