Aug 112006
NetSaint – creating a plug-in for RAID monitoring
This article originally appeared in OnLAMP. In a previous article I talked about my RAID-5 installation. It has been up and running for a few days now. I’m pleased with the result. However, RAID can fail. When it fails, you need to take action before the next failure. Two failures close together, no matter how rare that may be, will involve a complete reinstall1. I have been using NetSaint since first writing about it back in 2001. You will notice that NetSaint development has been continued under a new name – Nagios. For me, I continue to use NetSaint. It does what I need. The monitoring consists of three main components:- NetSaint (which I will assume you have installed and configured). I’m guessing my tools will also work with Nagios.
- netsaint_statd – provides remote monitoring of hosts, as patched with my change.
- check_adptraid.pl – the plug-in which monitors the RAID status
Monitoring the array
Monitoring the health of your RAID array is vital to the health of your system. Fortunately, Adaptec has a tool for this. It is available within the FreeBSD sysutils/asr-utils port. After installing the port, it took me a while to figure out what to use and how to use it. The problem was compounded by a run-time error which took me down a little tangent before I could get it running. I will show you how to integrate this utility into your NetSaint configuration. My first few attempts at running the monitoring tool failed with this result:After some Googling, I found this reference. The problem was shared memory. It seems that with PostgreSQL running,# /usr/local/sbin/raidutil -L all
Engine connect failed: Open
raidutil
could not
get what it needed.. I hunted
around, asked questions, and found a few knobs and switches:
These kernel options are also available as sysctl values:# grep SHM /usr/src/sys/i386/conf/LINT options SYSVSHM # include support for shared memory options SHMMAXPGS=1025 # max amount of shared memory pages (4k on i386) options SHMALL=1025 # max number of shared memory pages system wide options SHMMAX="(SHMMAXPGS*PAGE_SIZE+1)" options SHMMIN=2 # min shared memory segment size (bytes) options SHMMNI=33 # max number of shared memory identifiers options SHMSEG=9 # max shared memory segments per process
I stared playing with$ sysctl -a | grep shm kern.ipc.shmmax: 33554432 kern.ipc.shmmin: 1 kern.ipc.shmmni: 192 kern.ipc.shmseg: 128 kern.ipc.shmall: 8192 kern.ipc.shm_use_phys: 0 kern.ipc.shm_allow_removed: 0
kern.ipc.shmmax
but failed to get
anything useful. I went up to some very large values. I suspect someone will
suggest appropriate values. I found the solution by modifying the number of
PostgreSQL connections. I modified
/usr/local/pgsql/data/postgresql.conf
and reduced the
value of max_connections
from 40 to 30. Issuing the
following command invoked the changes by restarting the PostgreSQL postmaster:
Now thatkill -HUP `cat /usr/local/pgsql/data/postmaster.pid`
raidutil
is able to run, here is what the output looks like:
The output shows:$ sudo raidutil -L all RAIDUTIL Version: 3.04 Date: 9/27/2000 FreeBSD CLI Configuration Utility Adaptec ENGINE Version: 3.04 Date: 9/27/2000 Adaptec FreeBSD SCSI Engine # b0 b1 b2 Controller Cache FW NVRAM Serial Status --------------------------------------------------------------------------- d0 -- -- -- ADAP2400A 16MB 3A0L CHNL 1.1 BF0B111Z0B4Optimal Physical View Address Type Manufacturer/Model Capacity Status --------------------------------------------------------------------------- d0b0t0d0 Disk Drive (DASD) ST380011 A 76319MB Optimal d0b1t0d0 Disk Drive (DASD) ST380011 A 76319MB Optimal d0b2t0d0 Disk Drive (DASD) ST380011 A 76319MB Replaced Drive d0b3t0d0 Disk Drive (DASD) ST380011 A 76319MB Optimal Logical View Address Type Manufacturer/Model Capacity Status --------------------------------------------------------------------------- d0b0t0d0 RAID 5 (Redundant ADAPTEC RAID-5 228957MB Reconstruct 94% d0b0t0d0 Disk Drive (DASD) ST380011 A 76319MB Optimal d0b1t0d0 Disk Drive (DASD) ST380011 A 76319MB Optimal d0b2t0d0 Disk Drive (DASD) ST380011 A 76319MB Replaced Drive d0b3t0d0 Disk Drive (DASD) ST380011 A 76319MB Optimal Address Max Speed Actual Rate / Width --------------------------------------------------------------------------- d0b0t0d0 50 MHz 100 MB/sec wide d0b1t0d0 50 MHz 100 MB/sec wide d0b2t0d0 50 MHz 100 MB/sec wide d0b3t0d0 10 MHz 100 MB/sec wide Address Manufacturer/Model Write Cache Mode (HBA/Device) --------------------------------------------------------------------------- d0b0t0d0 ADAPTEC RAID-5 Write Back / -- d0b0t0d0 ST380011 A -- / Write Back d0b1t0d0 ST380011 A -- / Write Back d0b2t0d0 ST380011 A -- / Write Back d0b3t0d0 ST380011 A -- / Write Back # Controller Cache FW NVRAM BIOS SMOR Serial --------------------------------------------------------------------------- d0 ADAP2400A 16MB 3A0L CHNL 1.1 1.62 1.12/79I BF0B111Z0B4 # Controller Status Voltage Current Full Cap Rem Cap Rem Time --------------------------------------------------------------------------- d0 ADAP2400A No battery Address Manufacturer/Model FW Serial 123456789012 --------------------------------------------------------------------------- d0b0t0d0 ST380011 A 3.06 1ABW6AY1 -X-XX--X-O-- d0b1t0d0 ST380011 A 3.06 1ABEYH4P -X-XX--X-O-- d0b2t0d0 ST380011 A 3.06 1ABRWK0E -X-XX--X-O-- d0b3t0d0 ST380011 A 3.06 1ABRDS5E -X-XX--X-O-- Capabilities Map: Column 1 = Soft Reset Column 2 = Cmd Queuing Column 3 = Linked Cmds Column 4 = Synchronous Column 5 = Wide 16 Column 6 = Wide 32 Column 7 = Relative Addr Column 8 = SCSI II Column 9 = S.M.A.R.T. Column 0 = SCAM Column 1 = SCSI-3 Column 2 = SAF-TE X = Capability Exists, - = Capability does not exist, O = Not Supported
- I’m using an Adaptec 2400A (ADAP2400A)
- I have four drives, all ST380011 and 80MB (76319MB).
- I’m running RAID-5 giving me 228957MB of space.
- The array is rebuilding and is 98% through the reconstruction.
- The drive on Channel 0 (d0b2t0d0) was replaced
raidutil
reports when the array is in different states.
Note: d0b2t0d0
was not actually replaced
as the output above indicates. As part of my RAID testing, I had shutdown
the system, disconnected the power to one drive, started the system,
verified that it still ran, shutdown
again, reconnected the drive, powered up again, and started to rebuild the
array.
Know your RAID
I’m sure that each RAID utility will have different responses to different situation. I am about to investigate whatraidutil
reports about my Adaptec
2400A. I will do that by disconnecting a drive from the array, booting,
and then building the array. The conditions reported will allow us to
customize our scripts.
Normal
Here is whatraidutil
reports when all is well:
# /usr/local/bin/raidutil -L logical Address Type Manufacturer/Model Capacity Status --------------------------------------------------------------------------- d0b0t0d0 RAID 5 (Redundant ADAPTEC RAID-5 228957MB Optimal
Degraded
I shutdown the system, removed the power from one drive, then rebooted. Here is whatraidutil
reports then:
This is the normal situation when a disk has died, or in this case, been removed from the array. After I add the disk back in,# /usr/local/bin/raidutil -L logical Address Type Manufacturer/Model Capacity Status --------------------------------------------------------------------------- d0b0t0d0 RAID 5 (Redundant ADAPTE RAID-5 228957MB Degraded
raidutil
will report the same status. To recover
the array, we must rebuild.
Reconstruction
You can also useraidutil
to start the rebuilding process. This will sync up
the degraded drive with the rest of the array. This can be a lengthy process,
but it is vital.
The rebuilding can be started with this command:
Where/usr/local/bin/raidutil -a rebuild d0 d0b0t0d0
d0b0t0d0
is the address supplied in the above raidutil
output.
After rebuilding has started, this is what raidutil
reports:
The percentage will slowly creep up until all disks are resynced.# /usr/local/bin/raidutil -L logical Address Type Manufacturer/Model Capacity Status --------------------------------------------------------------------------- d0b0t0d0 RAID 5 (Redundant ADAPTE RAID-5 228957MB Reconstruct 0%
Using netsaint_statd
The scripts supplied withnetsaint_statd
can be divided into two types:
- scripts that get information from a remote machine
- a daemon that processes incoming requests and supplies the information
netsaint_statd
. It should be installed on every machine you wish
to monitor. I downloaded the
netsaint_statd
tarball and untar’d it to the directory
/usr/local/libexec/netsaint/netsaint_statd
on my
RAID machine. Strictly speaking, the check_*.pl scripts do not need to be on the RAID
machine, only the netsaint_statd
. You can remove them if you want. I have them
only on the NetSaint machine.
I use the following script to start it up at boot time:
Then I started up the script:$ less /usr/local/etc/rc.d/netsaint_statd.sh #!/bin/sh case "$1" in start) /usr/local/libexec/netsaint/netsaint_statd/netsaint_statd ;; esac exit 0
The RAID machine now has the# /usr/local/etc/rc.d/netsaint_statd.sh start
netsaint_statd
script running as a daemon waiting
for incoming requests. Now we will move our attention to the NetSaint machine.
This post
by RevDigger is the basis for what I did to set up netsaint_statd
.
I installed the netsaint_statd
tarball into the same directory on the NetSaint
machine. You will need the check_*.pl scripts this time.
Now that NetSaint has the tools, you need to tell it about them. I added this
to the end of my /usr/local/etc/netsaint/commands.cfg
file:
Here are the entries I added to# netsaint_statd remote commands
command[check_rload]=$USER1$/netsaint_statd/check_load.pl $HOSTADDRESS$ $ARG1$ $ARG2$ $ARG3$ command[check_rprocs]=$USER1$/netsaint_statd/check_procs.pl $HOSTADDRESS$ $ARG1$ $ARG2$ $ARG3$ $ARG4$ command[check_rusers]=$USER1$/netsaint_statd/check_users.pl $HOSTADDRESS$ $ARG1$ $ARG2$ $ARG3$ command[check_rdisk]=$USER1$/netsaint_statd/check_disk.pl $HOSTADDRESS$ $ARG1$ $ARG2$ $ARG3$ $ARG4$ command[check_rall_disks]=$USER1$/netsaint_statd/check_all_disks.pl $HOSTADDRESS$ $ARG1$ $ARG2$ $ARG3$ $ARG4$ command[check_adptraid.pl]=$USER1$/netsaint_statd/check_adptraid.pl $HOSTADDRESS$ $ARG1$ $ARG2$ $ARG3$ $ARG4$
/usr/local/etc/netsaint/hosts.cfg
to add monitoring
for the machine named polo
. Specifically, we will be monitoring
load, number of processes, number of users, and disk space.
Then I restarted Netsaint:service[polo]=LOAD;0;24x7;3;2;1;freebsd-admins;120;24x7;1;1;1;;check_rload! 3 service[polo]=PROCS;0;24x7;3;2;1;freebsd-admins;120;24x7;1;1;1;;check_rprocs! service[polo]=USERS;0;24x7;3;2;1;freebsd-admins;120;24x7;1;1;1;;check_rusers! 4 service[polo]=DISKSALL;0;24x7;3;2;1;freebsd-admins;120;24x7;1;1;1;;check_rall_disks
After the restart, the I started to see those services in my NetSaint website. This is great!/usr/local/etc/rc.d/netsaint.sh restart
RAID Notification overview
To get NetSaint to monitor my RAID array was not as simple as getting NetSaint to monitor a regular disk. I was already using netsaint_statd to monitor remote machines. I have them all set up so I can see load, process count, users, and diskspace usage. I will extendnetsaint_statd
to monitor RAID
status.
This additional feature will involved several distinct steps:
- Create a perl script for use by
netsaint_statd
to monitor the RAID - Extend
netsaint_statd
to use that script - Add RAID to the services monitored by NetSaint
RAID Perl script
As the basis for the perl script, I used check_users.pl as supplied withnetsaint_statd
and I created
check_adptraid.pl. I installed
that script into the same directory as all the other netsaint_statd
scripts
(/usr/local/libexec/netsaint/netsaint_statd/netsaint_statd
.
If you look at this script, you’ll see that we’re looking for the 3 major
status values:
I have decided that Degraded and unknown results will be CRITICAL, Optimal will be OK, and that Reconstruction will be a WARNING. The next step is to modifyif ($servanswer =~ m%^Reconstruct%) { $state = "WARNING"; $answer = $servanswer; } else { if ($servanswer =~ m%^Degraded%) { $state = "CRITICAL"; $answer = $servanswer; } else { if ($servanswer =~ m%^Optimal%) { $state = "OK"; $answer = $servanswer; } else { $answer = $servanswer; $state = "CRITICAL"; } } }
netsaint_statd
to use this newly added script.
netsaint_statd patch
The patch fornetsaint_statd
is available from
here. Apply the patch like
this:
Now that you have modified the daemon, you need to kill it and restart it:cd /usr/local/libexec/netsaint/netsaint_statd patch < path.to.patch.you.downloaded
# ps auwx | grep netsaint_statd root 28778 0.0 0.5 3052 2460 ?? Ss 6:56PM 0:00.32 /usr/bin/perl /usr/local/libexec/netsaint/netsaint_statd/netsaint_statd # kill -TERM 28778 # /usr/local/etc/rc.d/netsaint_statd.sh start #
Add RAID to the services monitored by NetSaint
Now we have the remote RAID box ready to tell us all about the RAID status. Now it's time to test it.That looks right to me! Now I'll show you what I added to NetSaint to use this new tool. First, I'll add the service definition to# cd /usr/local/libexec/netsaint/netsaint_statd # perl check_adptraid.pl polo Reconstruct 85%
/usr/local/etc/netsaint/hosts.cfg
:
I have set up a new notification_group (raid-admins) because I want to be notified via text message to my cellphone when the RAID array has a problem. The contact group I created was:service[polo]=RAID;0;24x7;3;2;1;raid-admins;120;24x7;1;1;1;;check_adptraid.pl
In this case, I want contacts danphone and dan to be notified. Here are the contacts which relate to the above contact group (the lines below may be wrapped, but in NetSaint there should only be two lines):contactgroup[raid-admins]=RAID Administrators;danphone,dan
This shows that I will be emailed and an email will be sent to my cellphone. After restarting NetSaint, I was able to see this on my webpage:contact[dan]=Dan Langille;24x7;24x7;1;1;0;1;1;0;notify-by-email;host-notify-by-email;dan; contact[danphone]=Dan Langille;24x7;24x7;1;1;0;1;1;0;notify-xtrashort;notify-xtrashort;dan;6135551212@pcs.example.com;

Thank you. This will be very useful.
—
The Man Behind The Curtain