Filesystem data checksumming

2013-05-18 11:05

While Linux has many filesystems to offer, almost none of them features data checksumming.

Btrfs is the exception, as it offers CRC-32C to checksum data and metadata. However, it is left to the reader to decide if btrfs is ready yet.

Ext4 features metadata checksums, but this is still very experimental and needs a development version of e2fsprogs.

XFS is also adding metadata checksums, but this is also work in progress.

Of course, everybody is looking to ZFS: Solaris has it since 2005, FreeBSD introduced it in 2007. What about Linux? The ZFSonLinux project looks quite promising, while ZFS-Fuse seems to more a proof-of-concept.

On MacOS X we have a similar picture: there's MacZFS, which I haven't looked into in a long time. But apparently it's supported for 64-bit kernels now, so maybe that's something to try again. And then there's Zevo, a commercial port of ZFS for Mac.

All in all, I wouldn't use these experiments for actual data storage just yet. Thus I decided to have data checksums in userspace - that is, generating a checksum for a file, storing it and checking it once in a while. Of course, this has various implications and drawbacks:

The filesystem is already in place and lots of files are already stored. By generating a checksum now we cannot say for sure if the file is still "correct" and we may generate a "correct" checksum for an already corrupt file.
Generating a checksum for new files doesn't happen on-the-fly. Instead it has to be done regularly (and possibly automacially too) if new files are added to the filesystem. While not generating checksums on-the-fly could translate to a better performance compared to data-checksum enabled filesystems, there's serious I/O to be expected once our "automatic" checksum generating scripts kick in.
After generating checksums we also have to implement a regular (and possibly automatic) verification and a some kind of remediation process on what to do if a checksum doesn't match. Is there a backup available? Does the checksum of the backup file validate? What if there are two backup locations and both of them have different (but validating) checksums? Mayhem!
Where do we store the checksums? In a separate file on the same filesystem? On a different filesystem, on some offsite storage? Where do we store the checksum for the checksum file?

Except for the last question there are probably no good answers and may be major issues depending on the setup. However, for me this was the only viable way to go for now: there's no ZFS port for this 12" PowerBook G4 and I didn't trust btrfs enough to hold my data.

In light of all these obstacles I wrote a small shell script that will generate a checksum for a file and store them as an extended attribute. Most filesystems support them and the script tries to accommodate MacOS X, Linux and Solaris (just in case UFS is in use).

The scripts needs to be run once for the full filesystem:

 find . -type f | while read a; do checksum_file.sh set "$a"; done

...and regularly when new files are added:

 find . -type f | while read a; do checksum_file.sh check-set "$a"; done

...and again to validate already stored checksums:

 find . -type f | while read a; do checksum_file.sh check "$a"; done

Enjoy! :-)

Notes:

On XFS, one needs to pay attention when EAs are used. Usually the attributes are stored in the inode - but when the attribute is too large for the inode, it has to be stored somewhere else and performance suffers :-\ Better use a bigger inode size when creating the XFS filesystem. This might or might not be true for other filesystems.

For JFS, the inode is fixed at 512 bytes and space for inline xattr data is limited to 128 bytes. Anything bigger will require more data blocks for the extended attributes to be stored.

While checksums for files may be important, this won't address corruption in other places of your (and my) machine. #tinfoilhat

Update: FWIW, on this particular machine of mine (throttled to 750MHz), with an external SATA disk attached via FW-400, a full initialization or verification of 800 GB data takes about one day. Yes, a full day. The major bottleneck seems to be CPU though, as the disk delivers around 30MB/s - but the dm-crypt layer slows this down to ~8 MB/s. With a newer machine this should be much faster.