Sep 032010
 

3Ware Nagios plugin

I use Nagios to monitor my servers and work stations. If something goes wrong, I usually get told by Nagios before I notice the problem myself. A week or so back, I noticed a rather odd RAID problem. Eventually, the problem was solved by upgrading the firmware on the controller. In the meantime, I had located and installed a Nagios 3ware plugin. I like it and I’m using it on more than one server. However, now that I turned on AUTO-VERIFY, I’ve found a spot where I can improve the plugin.

Verifying…!

Earlier today, I turned on AUTO-VERIFY for this controller. Tonight, Nagios is reporting:
Status: UNKNOWN
Status Information: UNKNOWN: 
/c0/u0 RAID-10 VERIFYING - 56% 64K 195.548 ON ON - 
/c0/u1 SPARE VERIFYING - 0% - 69.2404 - ON - 
/c0/u2 SPARE VERIFYING - 0% - 69.2404 - ON - 
If I look at the status output, I see:
$ sudo /usr/local/sbin/tw_cli info c0 u0
Password:

Unit     UnitType  Status         %RCmpl  %V/I/M  Port  Stripe  Size(GB)
------------------------------------------------------------------------
u0       RAID-10   VERIFYING      -       62%     -     64K     195.548
u0-0     RAID-1    VERIFYING      62%     -       -     -       -
u0-0-0   DISK      OK             -       -       p0    -       65.1826
u0-0-1   DISK      OK             -       -       p2    -       65.1826
u0-1     RAID-1    VERIFYING      62%     -       -     -       -
u0-1-0   DISK      OK             -       -       p6    -       65.1826
u0-1-1   DISK      OK             -       -       p5    -       65.1826
u0-2     RAID-1    VERIFYING      63%     -       -     -       -
u0-2-0   DISK      OK             -       -       p3    -       65.1826
u0-2-1   DISK      OK             -       -       p4    -       65.1826
u0/v0    Volume    -              -       -       -     -       195.548
Now I’d rather have something other than UNKNOWN. Fortunately, I have the source.

The patch!

This is the patch:
--- /usr/local/libexec/nagios/check_3ware.sh	2010-08-27 02:34:55.000000000 +0100
+++ /home/dan/bin/check_3ware.sh	2010-09-02 01:08:39.000000000 +0100
@@ -66,6 +66,12 @@
 				MSG="$MSG $STATUS -"
 				PREEXITCODE=1
 				;;
+			VERIFYING)
+				CHECKUNIT=`$TWCLI info $i unitstatus | ${GREP} -E "${UNIT[$COUNT]}" | ${AWK} '{print $1,$3,$5}'`
+				STATUS="/$i/$CHECKUNIT"
+				MSG="$MSG $STATUS -"
+				PREEXITCODE=1
+				;;
 			DEGRADED)
 				CHECKUNIT=`$TWCLI info $i unitstatus | ${GREP} -E "${UNIT[$COUNT]}" | ${AWK} '{print $1,$3}'`
 				STATUS="/$i/$CHECKUNIT"

This is what it outputs:
$ sudo ~/bin/check_3ware.sh
WARNING:  /c0/u0 VERIFYING 89% - /c0/u1 VERIFYING 0% - /c0/u2 VERIFYING 0% -
After replacing the original script, I get this output when testing it from the command line on the Nagios server:
$ /usr/local/libexec/nagios/check_nrpe2 -H supernews-vpn -c check_3ware.sh
WARNING:  /c0/u0 VERIFYING 99% - /c0/u1 VERIFYING 1% - /c0/u2 VERIFYING 0% -
I now see this on my Nagios webpage:
Status: WARNING
Status Information: WARNING:
/c0/u0 VERIFYING 99% - 
/c0/u1 VERIFYING 1% - 
/c0/u2 VERIFYING 0% - 





Other ideas

Tonight I started a battery test. The status immediately went to CRITICAL. That got me thinking about this patch:
$ diff -ruN /usr/local/libexec/nagios/check_3ware.sh ~/bin/check_3ware.sh
--- /usr/local/libexec/nagios/check_3ware.sh    2010-09-02 01:08:39.000000000 +0100
+++ /home/dan/bin/check_3ware.sh        2010-09-02 02:52:39.000000000 +0100
@@ -100,7 +100,7 @@
        # Check BBU's
        BBU=(`$TWCLI info $i |${GREP} -E "^bbu"|${AWK} '{print $1,$2,$3,$4,$5}'`)
        if [ "${BBU[0]}" = "bbu" ]; then
-               if [ "${BBU[1]}" != "On" ] || [ "${BBU[2]}" != "Yes" ] || [ "${BBU[3]}" != "OK" ] || [ "${BBU[4]}" != "OK" ]; then
+               if [ "${BBU[1]}" != "On" ] || [ "${BBU[2]}" != "Yes" ] || [ "${BBU[3]}" != "OK" && "${BBU[3]}" != "Testing" ] || [ "${BBU[4]}" != "OK" ]; then
                     BBUEXITCODE=2
                     BBUERROR="BBU on $i failed"
                fi
I also think I may change the status for VERIFYING from WARNING to OK, because really, everything IS OK. The controller is merely running VERIFY. FYI: I sent an email to the plugin author before I published this.

  One Response to “3Ware Nagios plugin”

  1. I also needed more flexible alert system for that plugin. Also emailed the author. My solution allows you to specify states that should generate warnings from: R (Rebuilding), (I) Initializing, V (Verifying) and P (Verify-Paused). Default is to not warn during these states.

    Patch is long, but you can get the code here:

    http://kf-compute1.path.utah.edu/public/code/nagios/

    Also there is a plugin there that checks all the sata disks in a 3ware array for bad sectors. It needs some work, but I use it to monitor about 60 hard drives.

    -Kael

    p.s. I’ve been reading the FreeBSD diary since you started it. Thanks for the great resource.

    [%sig%]

    Post Edited (09-09-10 23:48)