lunar (1) simhash.1.gz

Provided by: simhash_0.0.20161225-2_amd64 bug

NAME

       simhash - file similarity hash tool

SYNOPSIS

       simhash [ -s nshingles ] [ -f nfeatures ] [ file ]
       simhash [ -s nshingles ] [ -f nfeatures ] -w file ...
       simhash [ -s nshingles ] [ -f nfeatures ] -m file ...
       simhash -c hashfile hashfile

DESCRIPTION

       This program is used to compute and compare similarity hashes of files.  A similarity hash
       is a chunk of data that has the property  that  some  distance  metric  between  files  is
       proportional  to  some  distance metric between the hashes.  Typically the similarity hash
       will be much smaller than the file itself.

       The algorithm used by simhash is Manassas' "shingleprinting" algorithm  (see  BIBLIOGRAPHY
       below):  take  a  hash  of every m-byte subsequence of the file, and retain the n of these
       hashes that are numerically smallest.  The size of the intersection of the  hash  sets  of
       two files gives a statistically good estimate of the similarity of the files as a whole.

       In  its  default  mode,  simhash will compute the similarity hash of its file argument (or
       stdin) and write this hash to its standard output.  When invoked with the -w argument (see
       below),  simhash  will  compute  similarity  hashes of all of its file arguments in "batch
       mode".  When invoked with the -m argument (see below), simhash will compare all the  given
       files using similarity hashes in "match mode".  Finally, when invoked with the -c argument
       (see below), simhash will report the degree of similarity between two hashes.

OPTIONS

       -f feature-count
              When computing a similarity hash, retain at most feature-count  significant  hashes
              from  the  target  file.   The default is 128 features.  Larger feature counts will
              give higher resolution in differences between files, will increase the size of  the
              similarity  hash  proportionally to the feature count, and will increase similarity
              hash computation time slightly.

       -s shingle-size
              When computing a similarity hash, use hashes of samples consisting of  shingle-size
              consecutive  bytes drawn from the target file.  The default is 8 bytes, the minimum
              is 4 bytes.  Larger shingle sizes will emphasize the differences between files more
              and will slow the similarity hash computation proportionally to the shingle size.

       -c hashfile1 hashfile2
              Display  the  distance  (normalized  to the range 0..1) between the similarity hash
              stored in hashfile1 and the similarity hash stored in hashfile2.

       -w file ...
              Write the similarity hash of each of the file arguments to file.sim.

       -m file ...
              Compute the similarity hash of each of the file arguments, and output a  similarity
              matrix for those files.

AUTHOR

       Bart Massey <bart@cs.pdx.edu>

BUGS

       This  currently  uses  CRC32  for the hashing.  A Rabin Fingerprint should be offered as a
       slightly slower but more reliable alternative.

       The shingleprinting algorithm works for text files and fairly well  for  other  sequential
       filetypes,  but  does  not  work  well for image files.   The latter both are 2D and often
       undergo odd transformations.

BIBLIOGRAPHY

       Mark Manasse, Microsoft Research Silicon Valley.  Finding similar things quickly in  large
       collections.  http://research.microsoft.com/research/sv/PageTurner/similarity.htm

       Andrei  Z.  Broder.   On the resemblance and containment of documents.  In Compression and
       Complexity  of  Sequences  (SEQUENCES'97),  pages  21-29.  IEEE  Computer  Society,  1998.
       ftp://ftp.digital.com/pub/DEC/SRC/publications/broder/positano-final-wpnums.pdf

       Andrei  Z.  Broder.   Some applications of Rabin's fingerprinting method.  Published in R.
       Capocelli, A. De Santis,  U.  Vaccaro  eds.,  Sequences  II:  Methods  in  Communications,
       Security,         and        Computer        Science,        Springer-Verlag,        1993.
       http://athos.rutgers.edu/~muthu/broder.ps

                                          3 January 2007                               SIMHASH(1)