bionic (1) simhash.1.gz

Provided by: simhash_0.0.20150404-1_amd64 bug

NAME

       simhash - file similarity hash tool

SYNOPSIS

       simhash [ -s nshingles ] [ -f nfeatures ] [ file ]
       simhash [ -s nshingles ] [ -f nfeatures ] -w file ...
       simhash [ -s nshingles ] [ -f nfeatures ] -m file ...
       simhash -c hashfile hashfile

DESCRIPTION

       This  program is used to compute and compare similarity hashes of files.  A similarity hash is a chunk of
       data that has the property that some distance metric between  files  is  proportional  to  some  distance
       metric between the hashes.  Typically the similarity hash will be much smaller than the file itself.

       The  algorithm  used by simhash is Manassas' "shingleprinting" algorithm (see BIBLIOGRAPHY below): take a
       hash of every m-byte subsequence of the file, and retain the n  of  these  hashes  that  are  numerically
       smallest.  The size of the intersection of the hash sets of two files gives a statistically good estimate
       of the similarity of the files as a whole.

       In its default mode, simhash will compute the similarity hash of its file argument (or stdin)  and  write
       this  hash  to  its standard output.  When invoked with the -w argument (see below), simhash will compute
       similarity hashes of all of its file arguments in "batch mode".  When invoked with the -m  argument  (see
       below),  simhash will compare all the given files using similarity hashes in "match mode".  Finally, when
       invoked with the -c argument (see below), simhash will  report  the  degree  of  similarity  between  two
       hashes.

OPTIONS

       -f feature-count
              When  computing a similarity hash, retain at most feature-count significant hashes from the target
              file.  The default is 128  features.   Larger  feature  counts  will  give  higher  resolution  in
              differences  between  files,  will  increase the size of the similarity hash proportionally to the
              feature count, and will increase similarity hash computation time slightly.

       -s shingle-size
              When computing a similarity hash, use hashes of samples  consisting  of  shingle-size  consecutive
              bytes drawn from the target file.  The default is 8 bytes, the minimum is 4 bytes.  Larger shingle
              sizes will emphasize the differences  between  files  more  and  will  slow  the  similarity  hash
              computation proportionally to the shingle size.

       -c hashfile1 hashfile2
              Display  the  distance  (normalized  to  the  range  0..1)  between  the similarity hash stored in
              hashfile1 and the similarity hash stored in hashfile2.

       -w file ...
              Write the similarity hash of each of the file arguments to file.sim.

       -m file ...
              Compute the similarity hash of each of the file arguments, and  output  a  similarity  matrix  for
              those files.

AUTHOR

       Bart Massey <bart@cs.pdx.edu>

BUGS

       This  currently  uses  CRC32 for the hashing.  A Rabin Fingerprint should be offered as a slightly slower
       but more reliable alternative.

       The shingleprinting algorithm works for text files and fairly well for other  sequential  filetypes,  but
       does not work well for image files.   The latter both are 2D and often undergo odd transformations.

BIBLIOGRAPHY

       Mark  Manasse,  Microsoft  Research Silicon Valley.  Finding similar things quickly in large collections.
       http://research.microsoft.com/research/sv/PageTurner/similarity.htm

       Andrei Z. Broder.  On the resemblance and containment of documents.  In  Compression  and  Complexity  of
       Sequences       (SEQUENCES'97),       pages       21-29.      IEEE      Computer      Society,      1998.
       ftp://ftp.digital.com/pub/DEC/SRC/publications/broder/positano-final-wpnums.pdf

       Andrei Z. Broder.  Some applications of Rabin's fingerprinting method.  Published in R. Capocelli, A.  De
       Santis,  U.  Vaccaro  eds.,  Sequences  II:  Methods  in  Communications, Security, and Computer Science,
       Springer-Verlag, 1993.  http://athos.rutgers.edu/~muthu/broder.ps

                                                 3 January 2007                                       SIMHASH(1)