06th May 2005
Commands, Scripting, and Wading Through 10,000 Files
This grew out of a thread from the FlashCodersNY.org group
Ever have to wade through tens of thousands of files, trying to decide which ones are duplicates? One scenario would involve comparing backups of the same thing (think “large dev project” or “email archives” or “500 photos named DC_something.jpg) made at different times. Inspired by an email from Jean-Charles Carelli, I’ve written up a perl-based approach. (this extends the “command line” thread on the list recently)
The Problem
You want to compare lots of files. Perhaps backups of a directory structure containing hundreds or thousands of files of differing types, such as text, code, jpegs, and so on. One way is to compare dates, or file length. Comparing by date is problematic.. you can have the exact same file, where one version was accidentally “touched” (perhaps saved from an editor - nothing changed except the “last modified” date.) Comparing by file length is also problematic: perhaps you are dealing with program code, where a constant is no longer ‘1′ for true… now it’s ‘0′ for false!
And the clincher: you didn’t use a version control system - you simply have copies of copies of copies …
Checksums
A checksum is a sort of unique id, or fingerprint that corresponds to an exact sequence of bytes in a file. Change one byte in the file, and you will get a different checksum.
Let’s create a simple file with the date command (*nix or Mac, or Windows running Cygwin, etc.):
>> date > rightnow
At the moment, I get:
Fri May 6 11:00:53 EDT 2005
The command “md5sum” will show a unique id for this file:
>> md5sum rightnow
2ee6978bf7ecb473cdb9aa3a46380932 rightnow
update the file…
>> date > ! rightnow ; cat rightnow
Fri May 6 11:04:16 EDT 2005
Same length, right? But we can’t fool md5sum!
>> md5sum rightnow
d4b92e7c532c89a7967fa0303367205b rightnow
A Good Solution - Part 1, Gathering Info
Armed with this knowledge, it would be great to generate a list of checksums for everything in a massive directory structure.
Here’s a simple csh script, “mksum”, that does just that:
#!/bin/csh -f
find * -type f -exec md5sum {} \; | & tee $1
Let’s run it on something small, as a test… cd to the top level of the directory you wish to gather checksums for, and run:
>> mksum AllSums-May6
The argument (shown here as “AllSums”) is a file that stores
the result. The lines are in “<checksum> <filename>” format:
51ae1d06a34513aff649f1a080f169f6 dist-my_photos/include/config.php
0cd41486cccd38baf68911c78962b972 dist-my_photos/include/get-language.php
b767beb555b95482e7bb13c35c44dc33 dist-my_photos/include/global/d-col-edit.res
I deliberately keep a copy of the checksum file for two reasons:
- it’s crucial to the next step, of finding duplicates
- it can be used by other tools, including the ability to diff files containing lists of checksums (see what’s changed over time)
A Good Solution - Part 2, Comparing Checksums
Let’s feed our checksum file to a Perl script. I’ll go over the script (”showdups”) in a bit:
>> showdups AllSums-May6
dups:
b19035b188e97126ae769fadc81f8e17 php/templates/ui-csspos/en/pick-collection.php
b19035b188e97126ae769fadc81f8e17 php/templates/ui-example/en/pick-collection.php
dups:
5f20bc80e71cda366985af8a1931bbc4 dist-my_photos/php/templates/ui-csspos/en/search-photo.php
5f20bc80e71cda366985af8a1931bbc4 dist-my_photos/php/templates/ui-default/en/search-photo.php
5f20bc80e71cda366985af8a1931bbc4 dist-my_photos/php/templates/ui-example/en/search-photo.php
As you can see, the script catches multiple duplicates of the same file(s), regardless of where they reside in a directory structure.
Here’s showdups:
#!/usr/bin/perl
print "starting...";
%compares = ();
%num_compares = ();
while (<>) {
chop;
($cksum, $filename) = split;
$compares{$cksum} .= "$cksum $filename\n";
$num_compares{$cksum}++;
}
foreach $i (keys %num_compares) {
if ($num_compares{$i} > 1) {
print "dups:\n$compares{$i}\n";
}
}
The script works in two phases:
- Build up an associative array, where the keys are checksums. The values are filenames. As we go along, we build up another associative array that simply keeps a hit count for each unique checksum
- Now we iterate over the array that keeps the hit count (num_compares). If we have more than one hit, we use the current index (a checksum) to look at the list of matching filenames.
Summary
The approach helps narrow down possible duplicate files. It is possible for two files of differing lengths to have the exact same checksum. To me this seems like an acceptable, though rarely seen, false positive. It does quickly direct you to which files to compare manually when you are faced with 10’s of thousands.
PostScript
This is an old-school Unix approach. I will note that in Tiger (Mac OS 10.4) that it is possible to set attributes on files (such as “scene: city”) which are stored as metadata, along with more traditional attributes [1]. I think [2] this means that the checksum could be the same among two files of the same content, but in a sense the files are still “different”, because the ability to set/get arbitrary attribute data on a file is very meaningful.
- [1] http://arstechnica.com/reviews/os/macosx-10.4.ars/7
- [2] I’m not running Tiger at the moment, so I’m not able to test this
Update: see Comparing Sets Of Files for the comparesnaps.pl script…
[...] e more information on using the command line to manipulate text and files. Check it out: http://www.daniel.org/blog/?p=314 J-C This entry was posted on Wednesd [...]