How I test tapes and tape drives
I’ve written a number of articles about backups
and about Bacula. With anything, you must
test it to ensure it works properly.
Do you test your tape drives? Do you test your tapes? Do you test the backup process
by also doing a restore?
In this article, I’ll show you a script that pulls statistics from my DLT drive, and a
script I use for testing tapes in a tape library.
Why? If you just bought a tape drive, you want to know that it works. If you just
took delivery of 10 used DLT tapes you want to know they work. You also want to know
that if you’re doing a backup, you can also do a restore.
A practical example
When I first started using Bacula, I had a DDS drive with a 4-tape magazine. It worked well.
I started testing Bacula to make sure it did what I expected. The first thing I did was test
backups that spanned two or more tapes. I created a large file, slightly larger than would fit
on one tape. Then I told Bacula to back it up. The backup worked fine. Then I restored.
That’s when I found the problem. The restore file was smaller than the original file.
This led us to find a bug in FreeBSD (since fixed) related to pthreads.
The moral: always test.
Statistics from the tape driveI am a fan of DLT drives. I’ve been using them for over a year. What is DLT?
Ask Google about Digital Linear Tape.
DLT has been around for quite some time. It is quite robust.
The tape drive mechanism is what impressed me. The recording surface is touched only by the
recording head. This results in very little tape wear.
When I first started to use DLT, I hooked the up into a SCSI chain and wanted to know how much
throughput I should expect. I was asking questions on the FreeBSD SCSI mailing list when someone
sent me a very interesting script. It seems there are many factors that can affect throughput,
two of which are tape quality and drive quality. If there are many errors, the drive must stop
and start, moving the tape back and forth until it gets the right results. This can dramatically
affect throughput. What you want to hear when writing is a constant feed of tape going through
the drive. You don’t want to hear stop, rewind, start, repeat. Ideally, you only hear a stop/start
when you get to the end of the tape.
The script in question allows me to query the
tape drive and obtain a number of interesting statistics, one of which is “corrected errors”.
Here is some sample output from my production tape drive:
# ~/bin/dlt sa0 The tape is 'sa0' READING Corrected errors with substantial delay: 0 Corrected errors with possible delay : 0 Total errors : 73 Total errors corrected : 73 Total times correction algorithm used : 0 Total bytes processed : 1955930720 Total corrected errors / GB : 40 Total uncorrected errors : 0 Read compression ratio : 191% On tape Mbytes read : 5 On tape kbytes read residual : 971386 WRITING Corrected errors with substantial delay: 0 Corrected errors with possible delay : 0 Total errors : 147 Total errors corrected : 147 Total times correction algorithm used : 0 Total bytes processed : 7736171760 Total corrected errors / GB : 20 Total uncorrected errors : 0 Write compression ratio : 228% Host requested Mbytes written : 13342 Host requested kbytes written residual : 487424 On tape Mbytes written : 5852 On tape kbytes written residual : 0
In the above you can see the tape drive is correcting 20 errors for every GB of data written to the tape.
This is a relatively good value. Other drives I’ve tested were getting upwards of 4000 corrected errors.
I use this script as a relative indicator of good versus poor tapes and drives. I’ve found that then same
tape used on two different drives can give very different error rates.
This script was written for FreeBSD and makes use of
camcontrol(8).
I’m also quite sure that it interrogates the DLT drive using commands specific to DLT.
Therefore, I doubt this script will work on non-FreeBSD systems and with non-DLT drives.
I’m quite sure a similar script would be written for other operating systems. If you know of such a script,
please let me know and I’ll happily include it here.
You can download the script here. Use at your own risk, may cause cancer,
your milage may vary, etc.
When all I had were single tape drives, I would pop in tape, tar up some data, and check the stats.
Now that I have a tape library, I can automate the changing of tapes
and test 10 tapes at a time. This is a great time saver. This script dumps a bunch of files onto each
tape, extracts the stats, and logs them. Then it does the same thing to the next tape. No
read verification is done. All I’m interested in is corrected errors.
You can get the script here.
The output looks like this:
The information logged in /var/log/messages is:
By running the same tests multiple times, and perhaps on different drives, you can compare the results. For
example, here’s the results for one of my tapes. I have appreviated the results so they fit better in your
broswer. In this case, JYN260
is the barcode on the tape.
# grep JYN260 /var/log/messages Feb 21 10:12:16 TapeTesting: /dev/ch0 : JYN260S2 - corrected errors / GB : 69 - uncorrected errors : 0 Feb 21 15:12:55 TapeTesting: /dev/ch0 : JYN260S2 - corrected errors / GB : 58 - uncorrected errors : 0 Feb 21 18:19:11 TapeTesting: /dev/ch1 : JYN260S2 - corrected errors / GB : 23 - uncorrected errors : 0 Feb 21 21:34:45 TapeTesting: /dev/ch0 : JYN260S2 - corrected errors / GB : 36 - uncorrected errors : 0 Feb 21 23:39:33 TapeTesting: /dev/ch0 : JYN260S2 - corrected errors / GB : 32 - uncorrected errors : 0 Feb 22 10:14:17 TapeTesting: /dev/ch0 : JYN260S2 - corrected errors / GB : 23 - uncorrected errors : 0 Feb 22 12:03:11 TapeTesting: /dev/ch0 : JYN260S2 - corrected errors / GB : 26 - uncorrected errors : 0 Feb 22 13:31:39 TapeTesting: /dev/ch1 : JYN260S2 - corrected errors / GB : 12 - uncorrected errors : 0 Feb 22 16:15:47 TapeTesting: /dev/ch0 : JYN260S2 - corrected errors / GB : 26 - uncorrected errors : 0 Feb 24 11:00:03 TapeTesting: /dev/ch1 : JYN260S2 - corrected errors / GB : 12 - uncorrected errors : 0
For this test, I had the following tapes in the magazine:
# ~dan/rc-chio-changer /dev/ch1 list 2:JYN257S2 3:JYN249S2 4:001320 5:JYN265S2
Here is what the script looks like when you run it:
# ./tape-testing.sh ch1 sa1 loading 2 tar: Removing leading '/' from member names unloading 2 loading 3 tar: Removing leading '/' from member names unloading 3 loading 4 tar: Removing leading '/' from member names unloading 4 loading 5 tar: Removing leading '/' from member names unloading 5
where I want to test changer ch1
and drive sa1
.
In this case, I’m testing the changer identified by /dev/ch1
.
Looking in /var/log/messages
we can see these results:
Feb 26 11:07:11 TapeTesting: /dev/ch1 : JYN257S2 - corrected errors / GB : 3 - uncorrected errors : 0 Feb 26 11:29:32 TapeTesting: /dev/ch1 : JYN249S2 - corrected errors / GB : 35 - uncorrected errors : 0 Feb 26 11:51:01 TapeTesting: /dev/ch1 : 001320 - corrected errors / GB : 9 - uncorrected errors : 0 Feb 26 12:13:29 TapeTesting: /dev/ch1 : JYN265S2 - corrected errors / GB : 95 - uncorrected errors : 0
As you can see, JYN257S2 has the fewest errors. I repeated this test several times for
each tape, trying to write a large amount of data to each tape, so I could get a code impression
of quality. I actually didn’t find very many tapes that had over 50 corrected errors per GB.
I also found that the error rate would decrease the more times I wrote to the tape.
If you use the same tapes on different drives, you can then get a idea of the relative quality of each
drive. For example, I could get corrected error rates of 6 or 8 / GB on one drive, but the same tapes
on another drive gives 600+ corrected errors per drive. That is clearly an issue with the drive, not
the tape.
This isn’t strictly related to tape testing, but it based upone some recent observations. The following
is just a set of items which I think will affect throughput (i.e. KB/s) when backing up data.
Lots of small files
If you are backing up lots of small files, you’ll need to spool more attributes to the catalog. While
this won’t make a different for a few hundred files, but it will under certain circumstances. For example,
if you’re backing up 10,000 files of 10KB each, you’ll have a lower throughput than if you are backing up
10 files of 10,000KB, despite both having the same amount of data.
Seeking forward on tape
If the storage daemon must seek forward on the tape, that will take some time. This will occur if the
tape has already been used, but it not positioned at the right place. For example, if you have restarted
Bacula, and the tape has been used, the SD will need to seek.
Spooling attributes
If you are spooling attributes, file attributes are sent by the Storage daemon to the Director as the file
is save to the Volume. If you spool the attributes, the Storage daemon will buffer the File attributes
and Storage coordinates to a temporary file in the Working Directory, then when writing the Job data to
the tape is completed, the attributes and storage coordinates will be sent to the Director. This avoids
the possibility that database updates will slow down writing to the tape.
Data spooling
Data spooling, as opposed to
attribute spooling discussed above, ensures that all the data is available before the tape writing starts.
This avoids situations where the tape must stop/start when data is unavailable. This is especially obvious
when your network is slower than your tape drive.
One thing I’d like to do with this tape script is update some statistics somewhere and keep track of error
rates per volume.
Remember to do a restore and compare the result to the original. That is important.
Store your tapes safely. Follow the maufacturer’s recommendations. Not too humid. Not too hot. Not too cold.
Lastly, have a good time with Bacula and DLT.