Ubuntu Manpage: simhash - file similarity hash tool

Provided by: simhash_0.0.20110213-1_amd64

NAME

       simhash - file similarity hash tool

SYNOPSIS

       simhash [ -s nshingles ] [ -f nfeatures ] [ file ]
       simhash [ -s nshingles ] [ -f nfeatures ] -w file ...
       simhash [ -s nshingles ] [ -f nfeatures ] -m file ...
       simhash -c hashfile hashfile

DESCRIPTION

       This  program is used to compute and compare similarity hashes of files.  A similarity hash is a chunk of
       data that has the property that some distance metric between  files  is  proportional  to  some  distance
       metric between the hashes.  Typically the similarity hash will be much smaller than the file itself.

       The  algorithm  used by simhash is Manassas' "shingleprinting" algorithm (see BIBLIOGRAPHY below): take a
       hash of every m-byte subsequence of the file, and retain the n  of  these  hashes  that  are  numerically
       smallest.  The size of the intersection of the hash sets of two files gives a statistically good estimate
       of the similarity of the files as a whole.

       In  its  default mode, simhash will compute the similarity hash of its file argument (or stdin) and write
       this hash to its standard output.  When invoked with the -w argument (see below),  simhash  will  compute
       similarity  hashes  of all of its file arguments in "batch mode".  When invoked with the -m argument (see
       below), simhash will compare all the given files using similarity hashes in "match mode".  Finally,  when
       invoked  with  the  -c  argument  (see  below),  simhash will report the degree of similarity between two
       hashes.

OPTIONS

       -f feature-count
              When computing a similarity hash, retain at most feature-count significant hashes from the  target
              file.   The  default  is  128  features.   Larger  feature  counts  will give higher resolution in
              differences between files, will increase the size of the similarity  hash  proportionally  to  the
              feature count, and will increase similarity hash computation time slightly.

       -s shingle-size
              When  computing  a  similarity  hash, use hashes of samples consisting of shingle-size consecutive
              bytes drawn from the target file.  The default is 8 bytes, the minimum is 4 bytes.  Larger shingle
              sizes will emphasize the differences  between  files  more  and  will  slow  the  similarity  hash
              computation proportionally to the shingle size.

       -c hashfile1 hashfile2
              Display  the  distance  (normalized  to  the  range  0..1)  between  the similarity hash stored in
              hashfile1 and the similarity hash stored in hashfile2.

       -w file ...
              Write the similarity hash of each of the file arguments to file.sim.

       -m file ...
              Compute the similarity hash of each of the file arguments, and  output  a  similarity  matrix  for
              those files.

AUTHOR

       Bart Massey <bart@cs.pdx.edu>

BUGS

       This  currently  uses  CRC32 for the hashing.  A Rabin Fingerprint should be offered as a slightly slower
       but more reliable alternative.

       The shingleprinting algorithm works for text files and fairly well for other  sequential  filetypes,  but
       does not work well for image files.   The latter both are 2D and often undergo odd transformations.

BIBLIOGRAPHY

       Mark  Manasse,  Microsoft  Research Silicon Valley.  Finding similar things quickly in large collections.
       http://research.microsoft.com/research/sv/PageTurner/similarity.htm

       Andrei Z. Broder.  On the resemblance and containment of documents.  In  Compression  and  Complexity  of
       Sequences       (SEQUENCES'97),       pages       21-29.      IEEE      Computer      Society,      1998.
       ftp://ftp.digital.com/pub/DEC/SRC/publications/broder/positano-final-wpnums.pdf

       Andrei Z. Broder.  Some applications of Rabin's fingerprinting method.  Published in R. Capocelli, A.  De
       Santis,  U.  Vaccaro  eds.,  Sequences  II:  Methods  in  Communications, Security, and Computer Science,
       Springer-Verlag, 1993.  http://athos.rutgers.edu/~muthu/broder.ps

                                                 3 January 2007                                       SIMHASH(1)