Checking the integrity of a large DreamObjects file archive

dreamobjects

#1

Hi everyone,

I have several DreamObjects buckets which correspond to top-level directories of a large archive of personal files. I’d like to start relying on DO buckets as a permanent cloud backup, and a way to refer to files from old projects. As the file archive is not only fairly large, but the dirtree is pretty wide and deep, a sync of a bucket takes some churning by s3cmd or DragonDisk. Sometimes an error or something pops up and then I’m left to wonder, ‘Well are my buckets a complete and accurate backup of my local files?’ Is there a way to confirm that the local folders match the buckets?

Cheers,
Ryan


#2

One approach would be to use s3cmd to list all the files recursively inside buckets, and get their MD5 signature. Then you could compare the list of files and the md5 with your local files to see if anything is either missing or different.

$ s3cmd ls --list-md5 -r [s3://BUCKET[/PREFIX]]

and locally you’d need to do something like

$ find .  -exec md5sum {} +

and you can compare the two lists.


#3

Thanks, smaffulli! That worked very well. I executed the commands you gave. One note: The output for each of these commands is different, ‘s3cmd ls --list-md5’ places the md5 signatures in column 3, or rather,

array[timestamp, size, md5, s3://bucket/object]

and your find command puts the md5 signature in column 1, followed by the path of each file.

Someone with better knowledge of shell scripting may know how to quickly compare the md5s in each; I pulled them both into OpenOffice Calc, took out the columns I didn’t need, sorted them, and then, after consulting this thread on StackExchange,

was able to confirm that one of my buckets is absolutely the same as the local folder. For now that is fine, and I know where to look if I want to do a complete test.

Cheers,
Ryan


#4

Indeed, Ryan, the output of the two commands are not immediately comparable but with a bit of awk or perl that should be easily fixed.

Anyway, we have just published a new wiki page to describe how to use duplicity on Linux with DreamObjects: maybe that’s a more solid way to manage your offsite backups? Check it out http://wiki.dreamhost.com/How_to_Use_duplicity_with_DreamObjects