Monitoring your HDD using SMART and Nagios
Monitoring of your computer systems is a good idea. There are many tools
that let you verify that specified services are running, and available for
clients. I use Nagios. You can check that Apache is still running,
Postfix is still accepting mail, and various other things. If you can
write a test, Nagios can monitor it.
Typically, people monitor network connections, applications, and bandwidth
consumption. Until recently, I did not monitor disk health. That
recently changed.
I started using three new tools:
In this article I’ll show you how I added SMART monitoring to my Nagios
installation. munin is straight forward to install, but is outside the scope
of this article. It is for another time.
This article also assumes you have Nagios installed and nrpe running
on the host you are monitoring. I am using Fruity for my nagios
configuration, so I will be glossing over that too.
SMART
Disks die. Usually, they die predictably. Tools exist for monitoring
your HDD. Many modern disks contain SMART support. From
http://en.wikipedia.org/wiki/S.M.A.R.T.:
Self-Monitoring, Analysis, and Reporting Technology, or S.M.A.R.T.
(sometimes written as SMART), is a monitoring system for computer hard
disks to detect and report on various indicators of reliability, in the
hope of anticipating failures.
My first real introduction to SMART came from reading
Watching a hard drive die
by Greg Smith. Greg is present on the PostgreSQL Performance mailing list.
He knows a lot about hardware and how to get the best out of it. As I was
setting up a 10TB file
server, I wanted to start monitoring the health of those disks.
smartmontools
To install smartmontools:
cd /usr/ports/sysutils/smartmontools/ make install clean
To have smartd start at boot:
echo 'smartd_enable="YES"' >> /etc/rc.conf
I used the default configuration file, but you could get more specific if you
wanted:
cp -i /usr/local/etc/smartd.conf.sample /usr/local/etc/smartd.conf
To start smartd now:
# /usr/local/etc/rc.d/smartd start Starting smartd.
I know I have two HDD, so I added this to /etc/periodic.conf so I
include drive health information in my daily status reports:
daily_status_smart_devices="/dev/ad0 /dev/ad2"
nagios-check_smartmon
nagios-check_smartmon is a Nagios plugin that allows you to access
smartmontools from within nagios. To install it:
# cd /usr/ports/net-mgmt/nagios-check_smartmon # make install clean
Let’s see if we can run it:
# /usr/local/libexec/nagios/check_smartmon -d /dev/ad2 OK: device is functional and stable (temperature: 43)
That’s what we need.
nrpe changes
smartmon must be run with sufficient permission to access the device. The
command runs as the Nagios user, via net-mgmt/nrpe.
The following is the entry I add to /usr/local/etc/nrpe.cfg to monitor the
two HDD in this system:
command[check_smartmon_ad2]=sudo /usr/local/libexec/nagios/check_smartmon -d /dev/ad2 command[check_smartmon_ad4]=sudo /usr/local/libexec/nagios/check_smartmon -d /dev/ad4
After changing the above configuration file, remember to restart nrpe:
# /usr/local/etc/rc.d/nrpe2 restart Stopping nrpe2. Starting nrpe2.
In order to allow the nagios user to run this command via sudo, I add the following
via the visudo command:
nagios ALL=(ALL) NOPASSWD:/usr/local/libexec/nagios/check_smartmon -d /dev/ad2 nagios ALL=(ALL) NOPASSWD:/usr/local/libexec/nagios/check_smartmon -d /dev/ad4
From the nagios system, I ran these commands to verify that nrpe would return the
expected results:
$ /usr/local/libexec/nagios/check_nrpe2 -H bast -c check_smartmon_ad2 OK: device is functional and stable (temperature: 42)
Good. So we know NRPE will perform the command and return the expected results.
Now it’s a simple matter of configuring nagios to run the above command.
Guess what. I found news:
WARNING: device temperature (57) exceeds warning temperature threshold (55)
I started a long self test:
# smartctl -t long /dev/ad6 smartctl version 5.38 [i386-portbld-freebsd8.0] Copyright (C) 2002-8 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION === Sending command: "Execute SMART Extended self-test routine immediately in off-line mode". Drive command "Execute SMART Extended self-test routine immediately in off-line mode" successful. Testing has begun. Please wait 54 minutes for test to complete. Test will complete after Sat Mar 13 20:38:33 2010 Use smartctl -X to abort test.
And soon after that:
CRITICAL: device temperature (61) exceeds critical temperature threshold (60)
Nice.
After manually checking the HDD temperature, by putting my hand on the HDD, I
determined all were of a similar temperature. I concluded SMART was wrong,
which is not unknown. I adjusted nrpe.cfg to adjust for the higher reading:
command[check_smartmon_ad6]=sudo /usr/local/libexec/nagios/check_smartmon -d /dev/ad6 -w 65 -c 70
I also ran visudo and updated the ad6 entry to allow nagios to run the amended command.
Today I noticed this in /var/log/messages:
Mar 14 01:02:29 ngaio smartd[49539]: Device: /dev/ad3, 1 Currently unreadable (pending) sectors
Mar 14 01:32:30 ngaio smartd[49539]: Device: /dev/ad3, 1 Currently unreadable (pending) sectors
—
The Man Behind The Curtain