Register
It is currently Wed Oct 22, 2014 7:48 am

Trying to create a backup script that verifies files w MD5.


All times are UTC - 6 hours


Post new topic Reply to topic  [ 10 posts ] 
Author Message
 PostPosted: Wed Sep 20, 2006 7:00 pm   

Joined: Wed Sep 20, 2006 6:46 pm
Posts: 4
I'm trying to write a script to automate the backup of a large collection of digital photos. I'd like the script to do several things:

1. Compare the md5 checksums of the current folder and the backup folder, and notify me if one doesn't match the other. That way, I can check the photos/videos, and determine which one has been corrupted. If my source file has gone bad, I don't want to simply write over the good backup file with my latest, corrupted copy! This seems to be what many typical backup applications do. (I did have a good experience with an application called SyncBack that did do this in my Windows days, but there doesn't seem to be an equivalent in Lunux. Hence the need to make it myself!)

I've had some pretty good luck using md5deep, and its recursive directory abilities.

Code:
#!/bin/bash
echo "Generating hashes of the pictures."
md5deep -rbo f /home/boelcke/pictures > orig.txt
echo "Generating hashes on the copy already in backup."
md5deep -rbo f /backup/boelcke/pictures > bak.txt
echo "Now making a differences file called differ.txt."
diff orig.txt bak.txt > differ.txt
echo "Done!"


2. If a file doesn't exist in the backup, but does in the source, I'll want to copy the file over.

Here's where I'm having trouble. When I run md5deep on my directories, it can generate the hashes for all the files, and even show when two are different. What it doesn't to is indicate when one file is simply missing. What command can I use to determine what files exist in one place, but not the other?

3. If a file doesn't exist in the source, but does in the backup, prompt me for what to do about it.

This should be fairly easy, once I've got a solution for #2!

Thanks in advance for any help or pointers in the right direction. Once I have this working, I'd also like to use the same structure to maintain a backup of my /home directory on a separate drive.[/code]


Top
 Profile  
 PostPosted: Thu Sep 21, 2006 9:22 am   
Site Admin
User avatar

Joined: Sun May 15, 2005 9:36 pm
Posts: 669
Location: Des Moines, Iowa
moved to the sandbox.....

FWIW, i think you can do everything your looking to do with rsync........

Quote:
rsync version 2.6.6 protocol version 29
Copyright (C) 1996-2005 by Andrew Tridgell and others
<http://rsync.samba.org/>
Capabilities: 64-bit files, socketpairs, hard links, ACLs, symlinks, batchfiles,
inplace, IPv6, 64-bit system inums, 64-bit internal inums, SLP

rsync comes with ABSOLUTELY NO WARRANTY. This is free software, and you
are welcome to redistribute it under certain conditions. See the GNU
General Public Licence for details.

rsync is a file transfer program capable of efficient remote update
via a fast differencing algorithm.

Usage: rsync [OPTION]... SRC [SRC]... [USER@]HOST:DEST
or rsync [OPTION]... [USER@]HOST:SRC [DEST]
or rsync [OPTION]... SRC [SRC]... DEST
or rsync [OPTION]... [USER@]HOST::SRC [DEST]
or rsync [OPTION]... SRC [SRC]... [USER@]HOST::DEST
or rsync [OPTION]... rsync://[USER@]HOST[:PORT]/SRC [DEST]
or rsync [OPTION]... SRC [SRC]... rsync://[USER@]HOST[:PORT]/DEST
SRC on single-colon remote HOST will be expanded by remote shell
SRC on server remote HOST may contain shell wildcards or multiple
sources separated by space as long as they have same top-level

Options
-v, --verbose increase verbosity
-q, --quiet suppress non-error messages
-c, --checksum skip based on checksum, not mod-time & size
-a, --archive archive mode; same as -rlptgoD (no -H)
-r, --recursive recurse into directories
-R, --relative use relative path names
--no-relative turn off --relative
--no-implied-dirs don't send implied dirs with -R
-b, --backup make backups (see --suffix & --backup-dir)
--backup-dir=DIR make backups into hierarchy based in DIR
--suffix=SUFFIX set backup suffix (default ~ w/o --backup-dir)
-u, --update skip files that are newer on the receiver
--inplace update destination files in-place (SEE MAN PAGE)
-d, --dirs transfer directories without recursing
-l, --links copy symlinks as symlinks
-L, --copy-links transform symlink into referent file/dir
--copy-unsafe-links only "unsafe" symlinks are transformed
--safe-links ignore symlinks that point outside the source tree
-H, --hard-links preserve hard links
-K, --keep-dirlinks treat symlinked dir on receiver as dir
-p, --perms preserve permissions
-A, --acls preserve ACLs (implies --perms)
-o, --owner preserve owner (root only)
-g, --group preserve group
-D, --devices preserve devices (root only)
-t, --times preserve times
-O, --omit-dir-times omit directories when preserving times
-S, --sparse handle sparse files efficiently
-n, --dry-run show what would have been transferred
-W, --whole-file copy files whole (without rsync algorithm)
--no-whole-file always use incremental rsync algorithm
-x, --one-file-system don't cross filesystem boundaries
-B, --block-size=SIZE force a fixed checksum block-size
-e, --rsh=COMMAND specify the remote shell to use
--rsync-path=PROGRAM specify the rsync to run on the remote machine
--existing only update files that already exist on receiver
--ignore-existing ignore files that already exist on receiving side
--remove-sent-files sent files/symlinks are removed from sending side
--del an alias for --delete-during
--delete delete files that don't exist on the sending side
--delete-before receiver deletes before transfer (default)
--delete-during receiver deletes during transfer, not before
--delete-after receiver deletes after transfer, not before
--delete-excluded also delete excluded files on the receiving side
--ignore-errors delete even if there are I/O errors
--force force deletion of directories even if not empty
--max-delete=NUM don't delete more than NUM files
--max-size=SIZE don't transfer any file larger than SIZE
--partial keep partially transferred files
--partial-dir=DIR put a partially transferred file into DIR
--delay-updates put all updated files into place at transfer's end
--numeric-ids don't map uid/gid values by user/group name
--timeout=TIME set I/O timeout in seconds
-I, --ignore-times don't skip files that match in size and mod-time
--size-only skip files that match in size
--modify-window=NUM compare mod-times with reduced accuracy
-T, --temp-dir=DIR create temporary files in directory DIR
-y, --fuzzy find similar file for basis if no dest file
--compare-dest=DIR also compare destination files relative to DIR
--copy-dest=DIR ... and include copies of unchanged files
--link-dest=DIR hardlink to files in DIR when unchanged
-z, --compress compress file data during the transfer
-C, --cvs-exclude auto-ignore files the same way CVS does
-f, --filter=RULE add a file-filtering RULE
-F same as --filter='dir-merge /.rsync-filter'
repeated: --filter='- .rsync-filter'
--exclude=PATTERN exclude files matching PATTERN
--exclude-from=FILE read exclude patterns from FILE
--include=PATTERN don't exclude files matching PATTERN
--include-from=FILE read include patterns from FILE
--files-from=FILE read list of source-file names from FILE
-0, --from0 all *-from/filter files are delimited by 0s
--address=ADDRESS bind address for outgoing socket to daemon
--port=PORT specify double-colon alternate port number
--blocking-io use blocking I/O for the remote shell
--no-blocking-io turn off blocking I/O when it is the default
--stats give some file-transfer stats
--progress show progress during transfer
-P same as --partial --progress
-i, --itemize-changes output a change-summary for all updates
--log-format=FORMAT output filenames using the specified format
--password-file=FILE read password from FILE
--list-only list the files instead of copying them
--bwlimit=KBPS limit I/O bandwidth; KBytes per second
--write-batch=FILE write a batched update to FILE
--only-write-batch=FILE like --write-batch but w/o updating destination
--read-batch=FILE read a batched update from FILE
--protocol=NUM force an older protocol version to be used
-4, --ipv4 prefer IPv4
-6, --ipv6 prefer IPv6
--version print version number
-h, --help show this help screen

Use "rsync --daemon --help" to see the daemon-mode command-line options.
Please see the rsync(1) and rsyncd.conf(5) man pages for full documentation.
See http://rsync.samba.org/ for updates, bug reports, and answers




Top
 Profile WWW  
 PostPosted: Thu Sep 21, 2006 6:43 pm   

Joined: Wed Sep 06, 2006 12:19 pm
Posts: 54
Location: Covington, WA
Boelcke wrote:
I'm trying to write a script to automate the backup of a large collection of digital photos. I'd like the script to do several things:

1. Compare the md5 checksums of the current folder and the backup folder, and notify me if one doesn't match the other. That way, I can check the photos/videos, and determine which one has been corrupted. If my source file has gone bad, I don't want to simply write over the good backup file with my latest, corrupted copy! This seems to be what many typical backup applications do. (I did have a good experience with an application called SyncBack that did do this in my Windows days, but there doesn't seem to be an equivalent in Lunux. Hence the need to make it myself!)

I've had some pretty good luck using md5deep, and its recursive directory abilities.

Code:
#!/bin/bash
echo "Generating hashes of the pictures."
md5deep -rbo f /home/boelcke/pictures > orig.txt
echo "Generating hashes on the copy already in backup."
md5deep -rbo f /backup/boelcke/pictures > bak.txt
echo "Now making a differences file called differ.txt."
diff orig.txt bak.txt > differ.txt
echo "Done!"

How about this............Instead of using the bare filename option (-b), use the relative path option (-l) which should print out the hashes and relative names and save it to a list in the current directory:
Code:
cd <source_dir>; md5deep -lr * > md5list
Then, when you do a check, you can use 'md5sum -c md5list', which will verify if any files are corrupt or missing:
Code:
cd <source_dir>; md5sum -c md5list  2>&1 | egrep '(No such|FAILED$)'
The above output will only tell you which files are missing (No such file or directory) or corrupt (FAILED)

Quote:
2. If a file doesn't exist in the backup, but does in the source, I'll want to copy the file over.

Here's where I'm having trouble. When I run md5deep on my directories, it can generate the hashes for all the files, and even show when two are different. What it doesn't to is indicate when one file is simply missing. What command can I use to determine what files exist in one place, but not the other?
Here you can use 'rsync' as crouse mentioned........When using rsync, you will need to be aware that it is sensitive to trailing slashes in source directory name(s).......If you don't include a trailing slash it will copy the directory and all the contents, while the trailing slash means to just copy the contents of the directory and not the directory itself.........This is a fairly common mistake to those new to rsync......

So the backup can look like this:
Code:
rsync -a -del --exclude=md5list source/dir destination

  -OR-

rsync -a -del --exclude=md5list source/dir/ destination/dir
The above will copy everything except the md5list and delete any destination files not found in the source, in others words an exact mirror minus the md5list.....

For the second part of this question, see #1 above.....

Quote:
3. If a file doesn't exist in the source, but does in the backup, prompt me for what to do about it.
This one is a bit tricky with rsync, since rsync doesn't have an interactive option (prompt what to do).............The best strategy is to do a rotating backup by renaming the current backup directory, then using rsync's '--link-dest=DIR' option to create hardlinks of unchanged files in the previous backup directory you just renamed.......This will allow you to keep any files that were deleted in the source for a period of time in one of the old backup directories, while producing an exact mirror in the current backup directory:
Code:
mv destinatin/dir destination/dir.old
rsync -a -del --exclude=md5list --link-dest=destination/dir.old source/dir destination
For details on this type of rotating backup scheme, take a look at Easy Automated Snapshot-Style Backups with Rsync......This could be considered the authoratative source on rotating (or snapshot) backup schemes using rsync.........This has helpful suggestions on saving snapshots of a user's home directory, in case a file was accidently erased it could be retrieved in one of the backups...............This also has the advantage of not using up tons of space to keep multiple backups of the same directory by the use of hardlinks.....

---thegeekster


Top
 Profile  
 PostPosted: Fri Sep 22, 2006 7:46 pm   

Joined: Wed Sep 20, 2006 6:46 pm
Posts: 4
Thanks for both of your replies. I think I may be heading in a more manual direction here.

First, I've looked into rsync, and it doesn't do what I'm asking. First of all, it doesn't do a checksum of both files to verify that one hasn't become corrupted. It uses the checksums (if you select that option) to determine if your source as changed, and if you want to copy it over the old one.

Geekster, I like your 'md5sum -c md5list' idea, though I think it will only work on a flat directory structure. I ended up thinking of md5deep because it will go recursively through the directories, which md5sum will not do. My photos have many layers of directories organizing them.

I understand all about rotating backups, and have used that method for other data backup. However, in this case, it will not positively keep a good, uncorrupted picture. If some jpeg from 2004 that I don't look at much gets corrupted, having 4 weeks of rotating backups won't do me any good. The rotations only help if I have a chance to discover the bad file in that timeframe.

I think I'm about to look into how to read information from text files into script variables. Here's my new tentative plan:

1. Run md5deep on the source, and have it generate a listing of hashes with the relative filenames.

2. Run it on the backup the same way.

3. Loop line by line through the source-md5.txt file, set a variable equal to the path/filename, and another variable equal to the hash. Search through the backup-md5.txt file for the filename, and then compare the hashes.

a. If they match, all is good, move on to the next iteration of the loop.

b. If they don't match, I can display the files (if pictures), and seek user input for what to do (keep one or the other, or keep both for now).

c. If it doesn't find a matching filename in the backup-md5.txt file, copy the file from the source to the backup. Then, run a md5sum checksum to verify the copy, and then append text to the backup-md5.txt file to keep it up-to-date with the new file.

4. Loop line by line through the backup-md5.txt file, set a variable equal to the path/filename, and search the source-md5.txt file to make sure it's still in the source. If it isn't, alert the user for the proper action.

It isn't fast, but it will be thorough.


Top
 Profile  
 PostPosted: Sat Sep 23, 2006 12:06 am   

Joined: Wed Sep 06, 2006 12:19 pm
Posts: 54
Location: Covington, WA
Boelcke wrote:
...

First, I've looked into rsync, and it doesn't do what I'm asking. First of all, it doesn't do a checksum of both files to verify that one hasn't become corrupted. It uses the checksums (if you select that option) to determine if your source as changed, and if you want to copy it over the old one.

Geekster, I like your 'md5sum -c md5list' idea, though I think it will only work on a flat directory structure. I ended up thinking of md5deep because it will go recursively through the directories, which md5sum will not do. My photos have many layers of directories organizing them.

I understand all about rotating backups, and have used that method for other data backup. However, in this case, it will not positively keep a good, uncorrupted picture. If some jpeg from 2004 that I don't look at much gets corrupted, having 4 weeks of rotating backups won't do me any good. The rotations only help if I have a chance to discover the bad file in that timeframe.

I guess I didn't explain it well enough, but creating the list with md5deep using relative paths, then using md5sum on the list to verify checksums _will_ work for subdirectories because of the relative pathnames............You could even use absolute pathnames, but that wouldn't be very portable arcoss partitons unless you have an identical absolute path at the destination......

Therefore, your alternate plan of looping for subdirectories (step 3) can be revised and you can still do the substeps by saving the results of the "md5sum -c md5list 2>&1 | egrep '(No such|FAILED$)'" check in a variable and loop through the lines in the variable for user interaction on what to do with the bad (or missing) file....or, even better, piping the results of the check directly into a 'while read' loop and forgo the variable........This way the looping will be much shorter than arbitrarily looping through all the files looking for the faulty or missing ones........ ;-)

There will still be a bit of a time lag when md5sum does its checking, but if nothing is found the 'while read' loop will be skipped entirely......

HTH :-)
---thegeekster


Top
 Profile  
 PostPosted: Tue Sep 26, 2006 6:17 pm   

Joined: Wed Sep 20, 2006 6:46 pm
Posts: 4
Yes, as I'm new to bash scripting, I believe I'm probably doing this with way too many steps, and could probably be slicker. I do have a fair amount of it running, though I have a few more basic steps of getting the core of it working. Then, I suppose it'll be 3x more work to get all the niceties working like I want it.

I guess I'm not sure why using md5sum to go through the text file is better than, say, just grepping it.

As I'll paste below, I'm actually creating 2 text files with md5deep, one of the source, and one of the backup. It's probably extra work, but once it grinds through it, manipulating text files is quick. Plus, having them there lets me run other tasks later. Recall, I've got three main tasks:
1. Compare checksums of files that exist in both places, & resolve differences
2. Copy over files that are in the source but not the backup
3. Question the user about files that are in the backup, but not the source

Thanks for all your advice. Sometimes getting the right direction can be even more valuable than getting the syntax of a command correct!

I left off last night with a successful run on my test directory, but when I modified it to run on my /home directory (just for kicks), it choked. It seems I'm not set up to deal with filenames with spaces in them! Oh, those legacy things left over from my Windoze days...


Code:
#!/bin/bash
# Initialize the log file
date=$(date)
echo "This is a log for the backup run on $date" > $HOME/test/log.txt
echo "Generating hashes on the original."
cd $HOME/test/original
md5deep -rlo f * > $HOME/test/orig.txt
echo "Generating hashes on the backup."
cd $HOME/test/duplicate
md5deep -rlo f * > $HOME/test/dupe.txt
loop1=$(sed -n '$=' $HOME/test/orig.txt)
X=1
# This loop will iterate through each line of the source text file
while [ $X -le $loop1 ]
do
  sourceline=$(sed ''"$X"'q;d' $HOME/test/orig.txt)
  sourcehash=${sourceline:0:32}
  sourcefilepath=${sourceline:34:255}
# Now we try and find the corresponding line in the dupe.txt file
  dupehash=$(grep $sourcefilepath $HOME/test/dupe.txt)
# Error check - if file not found
  if [ -n "$dupehash" ]; then
    dupehash=${dupehash:0:32}
# Now compare these suckers
    if [ $dupehash != $sourcehash ]; then
      echo "ALERT! The source does not match the backup."
      echo "File: "$sourcefilepath
      echo "Would you like to (V)iew the files,"
      echo "copy the (S)ource over the backup,"
      echo "copy the (B)ackup over the source,"
      echo "or (L)og the discrepancy and move on?"
# For now, don't implement all the choices, just log it and move on.
      echo " " >> $HOME/test/log.txt
      echo "ALERT! The source does not match the backup." >> $HOME/test/log.txt
      echo "File: "$sourcefilepath >> $HOME/test/log.txt
      echo "Source: "$sourcehash >> $HOME/test/log.txt
      echo "Backup: "$dupehash >> $HOME/test/log.txt
      echo " " >> $HOME/test/log.txt
    fi
  else
#    echo "The dupe.txt file doesn't seem to contain "$sourcefilepath
    echo " " >> $HOME/test/log.txt
    echo "The dupe.txt file doesn't seem to contain "$sourcefilepath >> $HOME/test/log.txt
    echo " " >> $HOME/test/log.txt
  fi

  echo
  X=$((X+1))
done

# Then, come back later and add the loop that checks that there aren't any files listed in the duplicate that aren't in the source.


echo


Top
 Profile  
 PostPosted: Tue Sep 26, 2006 11:37 pm   

Joined: Wed Sep 06, 2006 12:19 pm
Posts: 54
Location: Covington, WA
Boelcke wrote:
...I guess I'm not sure why using md5sum to go through the text file is better than, say, just grepping it...
Grep doesn't do checksums, and the '-c' option for md5sum is exactly for that purpose........

Quote:
...As I'll paste below, I'm actually creating 2 text files with md5deep, one of the source, and one of the backup. It's probably extra work, but once it grinds through it, manipulating text files is quick...
Not extra work...........This is the reason my example code using rsync excludes copying the md5list from the source..... :-)

Quote:
...I left off last night with a successful run on my test directory, but when I modified it to run on my /home directory (just for kicks), it choked. It seems I'm not set up to deal with filenames with spaces in them! Oh, those legacy things left over from my Windoze days...
......... :lol:
You can deal with "spacey" filenames by making sure all your variables assigned to the filenames are surrounded by double-quotes.....

Or, rename them by replacing the spaces with underscores.......like so:
Code:
find <homedir> -type f | while read, do FILE="${REPLY##*/}";  mv "${REPLY%/*}/$FILE" "${REPLY%/*}/${FILE// /_}"; done
Replace <homedir> with the proper path and you're good to go........

:-)
---thegeekster


PS: Forgot to add the quotes in the rename code :roll: .......(late night)


Top
 Profile  
 PostPosted: Wed Sep 27, 2006 6:22 pm   

Joined: Wed Sep 20, 2006 6:46 pm
Posts: 4
Yeah, I like both ways, but I'll probably choose to do your first suggestion, and put double-quotes around the variables I'm using.

Thanks for discussing this one with me! I'll be sure to post the finished work once I get it completely running...


Top
 Profile  
 PostPosted: Wed Sep 27, 2006 10:49 pm   

Joined: Wed Sep 06, 2006 12:19 pm
Posts: 54
Location: Covington, WA
N/P......Glad to help out.......

:-)
---thegeekster


Top
 Profile  
 PostPosted: Fri Sep 29, 2006 11:13 am   
User avatar

Joined: Mon Jul 03, 2006 8:58 pm
Posts: 52
Location: Rochester, NY
Beware of other "bad" characters in your filenames like quotes or apostrophes...those will cause headaches as well.


Top
 Profile WWW  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 10 posts ] 

All times are UTC - 6 hours


Who is online

Users browsing this forum: Bing [Bot] and 9 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Jump to:  
cron


BashScripts | Promote Your Page Too
Powered by phpBB © 2011 phpBB Group
© 2003 - 2011 USA LINUX USERS GROUP