Monitoring your HDD using SMART and Nagios

Monitoring your HDD using SMART and Nagios

Monitoring of your computer systems is a good idea. There are many tools
that let you verify that specified services are running, and available for
clients. I use Nagios. You can check that Apache is still running,
Postfix is still accepting mail, and various other things. If you can
write a test, Nagios can monitor it.

Typically, people monitor network connections, applications, and bandwidth
consumption. Until recently, I did not monitor disk health. That
recently changed.

I started using three new tools:

In this article I’ll show you how I added SMART monitoring to my Nagios
installation. munin is straight forward to install, but is outside the scope
of this article. It is for another time.

This article also assumes you have Nagios installed and nrpe running
on the host you are monitoring. I am using Fruity for my nagios
configuration, so I will be glossing over that too.


Disks die. Usually, they die predictably. Tools exist for monitoring
your HDD. Many modern disks contain SMART support. From

Self-Monitoring, Analysis, and Reporting Technology, or S.M.A.R.T.
(sometimes written as SMART), is a monitoring system for computer hard
disks to detect and report on various indicators of reliability, in the
hope of anticipating failures.

My first real introduction to SMART came from reading
Watching a hard drive die
by Greg Smith. Greg is present on the PostgreSQL Performance mailing list.
He knows a lot about hardware and how to get the best out of it. As I was
setting up a 10TB file
, I wanted to start monitoring the health of those disks.


To install smartmontools:

cd /usr/ports/sysutils/smartmontools/
make install clean

To have smartd start at boot:

echo 'smartd_enable="YES"' >> /etc/rc.conf

I used the default configuration file, but you could get more specific if you

cp -i /usr/local/etc/smartd.conf.sample /usr/local/etc/smartd.conf

To start smartd now:

# /usr/local/etc/rc.d/smartd start
Starting smartd.

I know I have two HDD, so I added this to /etc/periodic.conf so I
include drive health information in my daily status reports:

daily_status_smart_devices="/dev/ad0 /dev/ad2"


nagios-check_smartmon is a Nagios plugin that allows you to access
smartmontools from within nagios. To install it:

# cd /usr/ports/net-mgmt/nagios-check_smartmon
# make install clean

Let’s see if we can run it:

# /usr/local/libexec/nagios/check_smartmon -d /dev/ad2
OK: device is functional and stable (temperature: 43)

That’s what we need.

nrpe changes

smartmon must be run with sufficient permission to access the device. The
command runs as the Nagios user, via net-mgmt/nrpe.
The following is the entry I add to /usr/local/etc/nrpe.cfg to monitor the
two HDD in this system:

command[check_smartmon_ad2]=sudo /usr/local/libexec/nagios/check_smartmon -d /dev/ad2
command[check_smartmon_ad4]=sudo /usr/local/libexec/nagios/check_smartmon -d /dev/ad4

After changing the above configuration file, remember to restart nrpe:

# /usr/local/etc/rc.d/nrpe2 restart
Stopping nrpe2.
Starting nrpe2.

In order to allow the nagios user to run this command via sudo, I add the following
via the visudo command:

nagios   ALL=(ALL) NOPASSWD:/usr/local/libexec/nagios/check_smartmon -d /dev/ad2
nagios   ALL=(ALL) NOPASSWD:/usr/local/libexec/nagios/check_smartmon -d /dev/ad4

From the nagios system, I ran these commands to verify that nrpe would return the
expected results:

$ /usr/local/libexec/nagios/check_nrpe2 -H bast -c check_smartmon_ad2
OK: device is functional and stable (temperature: 42)

Good. So we know NRPE will perform the command and return the expected results.
Now it’s a simple matter of configuring nagios to run the above command.

Guess what. I found news:

WARNING: device temperature (57) exceeds warning temperature threshold (55) 

I started a long self test:

# smartctl -t long /dev/ad6
smartctl version 5.38 [i386-portbld-freebsd8.0] Copyright (C) 2002-8 Bruce Allen
Home page is

Sending command: "Execute SMART Extended self-test routine immediately in off-line mode".
Drive command "Execute SMART Extended self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 54 minutes for test to complete.
Test will complete after Sat Mar 13 20:38:33 2010

Use smartctl -X to abort test.

And soon after that:

CRITICAL: device temperature (61) exceeds critical temperature threshold (60) 


After manually checking the HDD temperature, by putting my hand on the HDD, I
determined all were of a similar temperature. I concluded SMART was wrong,
which is not unknown. I adjusted nrpe.cfg to adjust for the higher reading:

command[check_smartmon_ad6]=sudo /usr/local/libexec/nagios/check_smartmon -d /dev/ad6 -w 65 -c 70

I also ran visudo and updated the ad6 entry to allow nagios to run the amended command.

1 thought on “Monitoring your HDD using SMART and Nagios”

  1. Today I noticed this in /var/log/messages:

    Mar 14 01:02:29 ngaio smartd[49539]: Device: /dev/ad3, 1 Currently unreadable (pending) sectors
    Mar 14 01:32:30 ngaio smartd[49539]: Device: /dev/ad3, 1 Currently unreadable (pending) sectors

    The Man Behind The Curtain

Leave a Comment

Scroll to Top