Filesystem data checksumming, pt. II
After my last post on filesystem data checksumming it took me a while until I could convince myself to actually set up regular checks of all the (important) files on my filesystems. The "fileserver" is a somewhat older machine and checksumming ~1.5TB of data takes almost 4 (!) days. Admittedly, the fact that I chose SHA-256 as a hashing algorithm seems to contribute to this long runtime. This being a private file server, MD5 would've have probably been more than enough.
But I wanted to know if this would really make a difference and wrote a small benchmark script, testing different programs and different digests on a particular machine. As always, the results will differ greatly from machine to machine - the following results are for this PowerBook of mine:
$ time ./test-digests.sh test.img 30 2>&1 | tee out.log [....] => This took 3.5 hours to complete! $ grep ^TEST out.log | egrep -v 'rhash_benchmark|SKIPPED' | sort -nk7 TEST: coreutils / DIGEST: md5 / 58 seconds over 30 runs TEST: openssl / DIGEST: sha1 / 64 seconds over 30 runs TEST: rhash / DIGEST: sha1 / 64 seconds over 30 runs TEST: openssl / DIGEST: md5 / 75 seconds over 30 runs TEST: rhash / DIGEST: md5 / 84 seconds over 30 runs TEST: perl / DIGEST: sha1 / 121 seconds over 30 runs TEST: rhash / DIGEST: sha224 / 140 seconds over 30 runs TEST: openssl / DIGEST: sha224 / 141 seconds over 30 runs TEST: rhash / DIGEST: sha256 / 141 seconds over 30 runs TEST: openssl / DIGEST: sha256 / 169 seconds over 30 runs TEST: coreutils / DIGEST: sha1 / 177 seconds over 30 runs TEST: rhash / DIGEST: ripemd160 / 305 seconds over 30 runs TEST: openssl / DIGEST: ripemd160 / 447 seconds over 30 runs TEST: perl / DIGEST: sha256 / 637 seconds over 30 runs TEST: perl / DIGEST: sha224 / 641 seconds over 30 runs TEST: coreutils / DIGEST: sha256 / 653 seconds over 30 runs TEST: coreutils / DIGEST: sha224 / 657 seconds over 30 runs TEST: perl / DIGEST: sha384 / 660 seconds over 30 runs TEST: perl / DIGEST: sha512 / 661 seconds over 30 runs TEST: rhash / DIGEST: sha512 / 693 seconds over 30 runs TEST: openssl / DIGEST: sha384 / 694 seconds over 30 runs TEST: rhash / DIGEST: sha384 / 695 seconds over 30 runs TEST: openssl / DIGEST: sha512 / 696 seconds over 30 runs TEST: coreutils / DIGEST: sha512 / 1513 seconds over 30 runs TEST: coreutils / DIGEST: sha384 / 1515 seconds over 30 runsI've marked two entries here:
- Originally I used coreutils to calculate a
SHA-256
checksum of each file. In the test run above this takes 11 times longer to complete thanMD5
would have taken. - Even if I decide against
MD5
and choose SHA-1 instead, I'd have to switch to openssl because for some reasoncoreutils
takes almost 3 times longer to complete.
MD5
for my data checksums - this also means that I have to 1) re-generate an MD5
checksum for all files and 2) remove the now-obsolete SHA-256
from all files :-\Update 1: I omitted cksum and sum from the tests above, as they're not necessarily faster than other checksum tools:
$ n=30 $ for t in sum cksum openssl\ {md4,md5}; do START=$(date +%s) for a in `seq 1 $n`; do $t test.img > /dev/null done END=$(date +%s) echo "TEST: $t / $(echo $END - $START | bc -l) seconds over $n runs" done | sed 's/ md/_md/' | sort -nk4 TEST: openssl_md4 / 56 seconds over 30 runs TEST: md5sum / 58 seconds over 30 runs TEST: sum / 75 seconds over 30 runs TEST: openssl_md5 / 76 seconds over 30 runs TEST: cksum / 78 seconds over 30 runsBut again: these tests will have to be repeated on different systems, it could very well be that
cksum
Update 2: And it helped indeed: removing the
SHA-256
checksum and calculating & attaching the MD5
checksum on 1.5TB of data (88k files) took "only" 31 hours. Which is still a lot, but a lot shorter than the "almost 4 days" we had with SHA-256
:-) Also, the next run won't have to remove the old checksum - it only has to do the verification step. What skewed this number even more was the fact that backups were running on the machine while it was doing the re-calculating stuff, so hopefully the next run will be even shorter.Update 3: Just to document the cronjob running these regular checks from now on:
0 4 1 */2 * root /usr/local/sbin/cron-checksum_file.sh allThis will be run on the first day every second month at 4am.
Update 4: I just had to verify the checksum of two ISO images and did another comparison on the same PowerBook G4 machine:
$ ls -goh *.iso -rw-r--r-- 1 3.2G Jul 24 16:20 file1.iso -rw-r--r-- 1 440 Jul 24 19:58 file1.iso.sum -rw-r--r-- 1 4.2G Jul 24 16:51 file2.iso -rw-r--r-- 1 440 Jul 24 19:58 file2.iso.sum $ for a in md5 sha1 sha256 sha512; do echo "$a"; time "$a"sum -c *.iso.sum; echo; done md5 real 8m17.404s user 0m56.588s sys 0m28.444s sha1 real 11m12.638s user 3m20.220s sys 0m28.044s sha256 real 21m12.057s user 12m47.092s sys 0m37.156s sha512 real 40m56.836s user 29m55.444s sys 0m39.684sSo, each of the chosen "stronger" algorithm bascially doubles the execution time of a "weaker" one. Again,
md5
is more than enough for our use case here.