Feb 272006

How I test tapes and tape drives

I’ve written a number of articles about backups and about Bacula. With anything, you must test it to ensure it works properly. Do you test your tape drives? Do you test your tapes? Do you test the backup process by also doing a restore? In this article, I’ll show you a script that pulls statistics from my DLT drive, and a script I use for testing tapes in a tape library. Why? If you just bought a tape drive, you want to know that it works. If you just took delivery of 10 used DLT tapes you want to know they work. You also want to know that if you’re doing a backup, you can also do a restore.

A practical example

When I first started using Bacula, I had a DDS drive with a 4-tape magazine. It worked well. I started testing Bacula to make sure it did what I expected. The first thing I did was test backups that spanned two or more tapes. I created a large file, slightly larger than would fit on one tape. Then I told Bacula to back it up. The backup worked fine. Then I restored. That’s when I found the problem. The restore file was smaller than the original file. This led us to find a bug in FreeBSD (since fixed) related to pthreads. The moral: always test. Statistics from the tape drive I am a fan of DLT drives. I’ve been using them for over a year. What is DLT? Ask Google about Digital Linear Tape. DLT has been around for quite some time. It is quite robust. The tape drive mechanism is what impressed me. The recording surface is touched only by the recording head. This results in very little tape wear. When I first started to use DLT, I hooked the up into a SCSI chain and wanted to know how much throughput I should expect. I was asking questions on the FreeBSD SCSI mailing list when someone sent me a very interesting script. It seems there are many factors that can affect throughput, two of which are tape quality and drive quality. If there are many errors, the drive must stop and start, moving the tape back and forth until it gets the right results. This can dramatically affect throughput. What you want to hear when writing is a constant feed of tape going through the drive. You don’t want to hear stop, rewind, start, repeat. Ideally, you only hear a stop/start when you get to the end of the tape. The script in question allows me to query the tape drive and obtain a number of interesting statistics, one of which is “corrected errors”. Here is some sample output from my production tape drive:
# ~/bin/dlt sa0
The tape is 'sa0'
Corrected errors with substantial delay: 0
Corrected errors with possible delay   : 0
Total errors                           : 73
Total errors corrected                 : 73
Total times correction algorithm used  : 0
Total bytes processed                  : 1955930720
Total corrected errors / GB            : 40
Total uncorrected errors               : 0
Read compression ratio                 : 191%
On tape Mbytes read                    : 5
On tape kbytes read residual           : 971386

Corrected errors with substantial delay: 0
Corrected errors with possible delay   : 0
Total errors                           : 147
Total errors corrected                 : 147
Total times correction algorithm used  : 0
Total bytes processed                  : 7736171760
Total corrected errors / GB            : 20
Total uncorrected errors               : 0
Write compression ratio                : 228%
Host requested Mbytes written          : 13342
Host requested kbytes written residual : 487424
On tape Mbytes written                 : 5852
On tape kbytes written residual        : 0
In the above you can see the tape drive is correcting 20 errors for every GB of data written to the tape. This is a relatively good value. Other drives I’ve tested were getting upwards of 4000 corrected errors. I use this script as a relative indicator of good versus poor tapes and drives. I’ve found that then same tape used on two different drives can give very different error rates. The script This script was written for FreeBSD and makes use of camcontrol(8). I’m also quite sure that it interrogates the DLT drive using commands specific to DLT. Therefore, I doubt this script will work on non-FreeBSD systems and with non-DLT drives. I’m quite sure a similar script would be written for other operating systems. If you know of such a script, please let me know and I’ll happily include it here. You can download the script here. Use at your own risk, may cause cancer, your milage may vary, etc. Testing tapes When all I had were single tape drives, I would pop in tape, tar up some data, and check the stats. Now that I have a tape library, I can automate the changing of tapes and test 10 tapes at a time. This is a great time saver. This script dumps a bunch of files onto each tape, extracts the stats, and logs them. Then it does the same thing to the next tape. No read verification is done. All I’m interested in is corrected errors. You can get the script here. The output looks like this:
The information logged in /var/log/messages is:
By running the same tests multiple times, and perhaps on different drives, you can compare the results. For example, here’s the results for one of my tapes. I have appreviated the results so they fit better in your broswer. In this case, JYN260 is the barcode on the tape.
# grep JYN260 /var/log/messages
Feb 21 10:12:16 TapeTesting: /dev/ch0 : JYN260S2 - corrected errors / GB : 69 - uncorrected errors : 0
Feb 21 15:12:55 TapeTesting: /dev/ch0 : JYN260S2 - corrected errors / GB : 58 - uncorrected errors : 0
Feb 21 18:19:11 TapeTesting: /dev/ch1 : JYN260S2 - corrected errors / GB : 23 - uncorrected errors : 0
Feb 21 21:34:45 TapeTesting: /dev/ch0 : JYN260S2 - corrected errors / GB : 36 - uncorrected errors : 0
Feb 21 23:39:33 TapeTesting: /dev/ch0 : JYN260S2 - corrected errors / GB : 32 - uncorrected errors : 0
Feb 22 10:14:17 TapeTesting: /dev/ch0 : JYN260S2 - corrected errors / GB : 23 - uncorrected errors : 0
Feb 22 12:03:11 TapeTesting: /dev/ch0 : JYN260S2 - corrected errors / GB : 26 - uncorrected errors : 0
Feb 22 13:31:39 TapeTesting: /dev/ch1 : JYN260S2 - corrected errors / GB : 12 - uncorrected errors : 0
Feb 22 16:15:47 TapeTesting: /dev/ch0 : JYN260S2 - corrected errors / GB : 26 - uncorrected errors : 0
Feb 24 11:00:03 TapeTesting: /dev/ch1 : JYN260S2 - corrected errors / GB : 12 - uncorrected errors : 0
For this test, I had the following tapes in the magazine:
# ~dan/rc-chio-changer /dev/ch1 list
Here is what the script looks like when you run it:
# ./tape-testing.sh ch1 sa1
loading 2
tar: Removing leading '/' from member names
unloading 2
loading 3
tar: Removing leading '/' from member names
unloading 3
loading 4
tar: Removing leading '/' from member names
unloading 4
loading 5
tar: Removing leading '/' from member names
unloading 5

where I want to test changer ch1 and drive sa1. In this case, I’m testing the changer identified by /dev/ch1. Looking in /var/log/messages we can see these results:

Feb 26 11:07:11 TapeTesting: /dev/ch1 : JYN257S2 - corrected errors / GB : 3 - uncorrected errors : 0
Feb 26 11:29:32 TapeTesting: /dev/ch1 : JYN249S2 - corrected errors / GB : 35 - uncorrected errors : 0
Feb 26 11:51:01 TapeTesting: /dev/ch1 : 001320 - corrected errors / GB : 9 - uncorrected errors : 0
Feb 26 12:13:29 TapeTesting: /dev/ch1 : JYN265S2 - corrected errors / GB : 95 - uncorrected errors : 0
As you can see, JYN257S2 has the fewest errors. I repeated this test several times for each tape, trying to write a large amount of data to each tape, so I could get a code impression of quality. I actually didn’t find very many tapes that had over 50 corrected errors per GB. I also found that the error rate would decrease the more times I wrote to the tape. Testing drives If you use the same tapes on different drives, you can then get a idea of the relative quality of each drive. For example, I could get corrected error rates of 6 or 8 / GB on one drive, but the same tapes on another drive gives 600+ corrected errors per drive. That is clearly an issue with the drive, not the tape. Things that can affect throughput This isn’t strictly related to tape testing, but it based upone some recent observations. The following is just a set of items which I think will affect throughput (i.e. KB/s) when backing up data.

Lots of small files

If you are backing up lots of small files, you’ll need to spool more attributes to the catalog. While this won’t make a different for a few hundred files, but it will under certain circumstances. For example, if you’re backing up 10,000 files of 10KB each, you’ll have a lower throughput than if you are backing up 10 files of 10,000KB, despite both having the same amount of data.

Seeking forward on tape

If the storage daemon must seek forward on the tape, that will take some time. This will occur if the tape has already been used, but it not positioned at the right place. For example, if you have restarted Bacula, and the tape has been used, the SD will need to seek.

Spooling attributes

If you are spooling attributes, file attributes are sent by the Storage daemon to the Director as the file is save to the Volume. If you spool the attributes, the Storage daemon will buffer the File attributes and Storage coordinates to a temporary file in the Working Directory, then when writing the Job data to the tape is completed, the attributes and storage coordinates will be sent to the Director. This avoids the possibility that database updates will slow down writing to the tape.

Data spooling

Data spooling, as opposed to attribute spooling discussed above, ensures that all the data is available before the tape writing starts. This avoids situations where the tape must stop/start when data is unavailable. This is especially obvious when your network is slower than your tape drive. A few final points on testing tapes/drives One thing I’d like to do with this tape script is update some statistics somewhere and keep track of error rates per volume. Remember to do a restore and compare the result to the original. That is important. Store your tapes safely. Follow the maufacturer’s recommendations. Not too humid. Not too hot. Not too cold. Lastly, have a good time with Bacula and DLT.