Filesystem data checksumming, pt. II

After my last post on filesystem data checksumming it took me a while until I could convince myself to actually set up regular checks of all the (important) files on my filesystems. The "fileserver" is a somewhat older machine and checksumming ~1.5TB of data takes almost 4 (!) days. Admittedly, the fact that I chose SHA-256 as a hashing algorithm seems to contribute to this long runtime. This being a private file server, MD5 would've have probably been more than enough.

But I wanted to know if this would really make a difference and wrote a small benchmark script, testing different programs and different digests on a particular machine. As always, the results will differ greatly from machine to machine - the following results are for this PowerBook of mine:

$ time ./test-digests.sh test.img 30 2>&1 | tee out.log
[....]
=> This took 3.5 hours to complete!

$ grep ^TEST out.log | egrep -v 'rhash_benchmark|SKIPPED' | sort -nk7
TEST: coreutils / DIGEST: md5 / 58 seconds over 30 runs
TEST: openssl / DIGEST: sha1 / 64 seconds over 30 runs
TEST: rhash / DIGEST: sha1 / 64 seconds over 30 runs
TEST: openssl / DIGEST: md5 / 75 seconds over 30 runs
TEST: rhash / DIGEST: md5 / 84 seconds over 30 runs
TEST: perl / DIGEST: sha1 / 121 seconds over 30 runs
TEST: rhash / DIGEST: sha224 / 140 seconds over 30 runs
TEST: openssl / DIGEST: sha224 / 141 seconds over 30 runs
TEST: rhash / DIGEST: sha256 / 141 seconds over 30 runs
TEST: openssl / DIGEST: sha256 / 169 seconds over 30 runs
TEST: coreutils / DIGEST: sha1 / 177 seconds over 30 runs
TEST: rhash / DIGEST: ripemd160 / 305 seconds over 30 runs
TEST: openssl / DIGEST: ripemd160 / 447 seconds over 30 runs
TEST: perl / DIGEST: sha256 / 637 seconds over 30 runs
TEST: perl / DIGEST: sha224 / 641 seconds over 30 runs
TEST: coreutils / DIGEST: sha256 / 653 seconds over 30 runs
TEST: coreutils / DIGEST: sha224 / 657 seconds over 30 runs
TEST: perl / DIGEST: sha384 / 660 seconds over 30 runs
TEST: perl / DIGEST: sha512 / 661 seconds over 30 runs
TEST: rhash / DIGEST: sha512 / 693 seconds over 30 runs
TEST: openssl / DIGEST: sha384 / 694 seconds over 30 runs
TEST: rhash / DIGEST: sha384 / 695 seconds over 30 runs
TEST: openssl / DIGEST: sha512 / 696 seconds over 30 runs
TEST: coreutils / DIGEST: sha512 / 1513 seconds over 30 runs
TEST: coreutils / DIGEST: sha384 / 1515 seconds over 30 runs
I've marked two entries here:
  • Originally I used coreutils to calculate a SHA-256 checksum of each file. In the test run above this takes 11 times longer to complete than MD5 would have taken.
  • Even if I decide against MD5 and choose SHA-1 instead, I'd have to switch to openssl because for some reason coreutils takes almost 3 times longer to complete.
The outcome of these tests means that I'll probably switch to MD5 for my data checksums - this also means that I have to 1) re-generate an MD5 checksum for all files and 2) remove the now-obsolete SHA-256 from all files :-\

Update 1: I omitted cksum and sum from the tests above, as they're not necessarily faster than other checksum tools:
$ n=30
$ for t in sum cksum openssl\ {md4,md5}; do
    START=$(date +%s)
    for a in `seq 1 $n`; do
        $t test.img > /dev/null
    done
    END=$(date +%s)
    echo "TEST: $t / $(echo $END - $START | bc -l) seconds over $n runs"
done | sed 's/ md/_md/' | sort -nk4
TEST: openssl_md4 / 56 seconds over 30 runs
TEST: md5sum / 58 seconds over 30 runs
TEST: sum / 75 seconds over 30 runs
TEST: openssl_md5 / 76 seconds over 30 runs
TEST: cksum / 78 seconds over 30 runs
But again: these tests will have to be repeated on different systems, it could very well be that cksum might really be faster than everything else on another machine - maybe not :-)

Update 2: And it helped indeed: removing the SHA-256 checksum and calculating & attaching the MD5 checksum on 1.5TB of data (88k files) took "only" 31 hours. Which is still a lot, but a lot shorter than the "almost 4 days" we had with SHA-256 :-) Also, the next run won't have to remove the old checksum - it only has to do the verification step. What skewed this number even more was the fact that backups were running on the machine while it was doing the re-calculating stuff, so hopefully the next run will be even shorter.

Update 3: Just to document the cronjob running these regular checks from now on:
0 4 1 */2 *  root  /usr/local/sbin/cron-checksum_file.sh all
This will be run on the first day every second month at 4am.

Update 4: I just had to verify the checksum of two ISO images and did another comparison on the same PowerBook G4 machine:
$ ls -goh *.iso
-rw-r--r-- 1 3.2G Jul 24 16:20 file1.iso
-rw-r--r-- 1  440 Jul 24 19:58 file1.iso.sum
-rw-r--r-- 1 4.2G Jul 24 16:51 file2.iso
-rw-r--r-- 1  440 Jul 24 19:58 file2.iso.sum

$ for a in md5 sha1 sha256 sha512; do echo "$a"; time "$a"sum -c *.iso.sum; echo; done
md5
real    8m17.404s
user    0m56.588s
sys     0m28.444s

sha1
real    11m12.638s
user    3m20.220s
sys     0m28.044s

sha256
real    21m12.057s
user    12m47.092s
sys     0m37.156s

sha512 
real    40m56.836s
user    29m55.444s
sys     0m39.684s
So, each of the chosen "stronger" algorithm bascially doubles the execution time of a "weaker" one. Again, md5 is more than enough for our use case here.