Provided by: similarity-tester_3.0.2-1_amd64 bug

NAME

       sim  -  find similarities in C, Java, Pascal, Modula-2, Lisp, Miranda, 8086 assembler code
       or in text files

SYNOPSIS

       sim_c [ -[adefFiMnOpPRsSTuv] -r N -t N -w N -o F ] file ... [ [ / | ] file ... ]
       sim_text ...
       sim_c++ ...
       sim_java ...
       sim_pasc ...
       sim_m2 ...
       sim_lisp ...
       sim_mira ...
       sim_8086 ...

DESCRIPTION

       Sim_c reads the C files file ...  and looks for segments of text  that  are  similar;  two
       segments  of program text are similar if they only differ in layout, comment, identifiers,
       and the contents of numbers, strings and characters.  If any runs of sufficient length are
       found,  they  are reported on standard output; the number of significant tokens in the run
       is given between square brackets.

       Sim_c++ does the same for  C++,  sim_java  for  Java,  sim_pasc  for  Pascal,  sim_m2  for
       Modula-2,  sim_mira  for Miranda, sim_lisp for Lisp, and sim_8086 for 8086 assembler code.
       Sim_text works on arbitrary text and it is occasionally useful on shell scripts.

       The program can be used for  finding  copied  pieces  of  code  in  purportedly  unrelated
       programs  (with  -s or -S), or for finding accidentally duplicated code in larger projects
       (with -f or -F).

       If a separator / or | is present in the list of input files, the files are divided into  a
       group  of  "new" files (before the / or |) and a group of "old" files; if there is no / or
       |, all files are "new".  Old files are never  compared  to  other  files.   See  also  the
       description of the -s and -S options below.

       Since  the similarity tester needs file names to pinpoint the similarities, it cannot read
       from standard input.

       The similarity tester takes ASCII or UTF-8 text as input, and produces a  sorted  list  of
       runs  in  text form (default or with the -d or -n options) or in percentage form (with the
       -p option).  Input in other formats, e.g.  .pdf or .doc needs to be converted to ASCII  or
       UTF-8   by  preprocessing.   Aggregated  similarity  results  can  be  obtained  by  doing
       postprocessing on one of the forms of output.

       There are the following options:

       -a     All new files are compared to all files.  See the section `Calculating Percentages'
              below.

       -d     The  output  is  in  a  diff(1)-like format instead of the default 2-column format.
              Recommended for text in languages with non-Latin alphabets.

       -e     Each file is compared to each file in isolation. This will  find  all  similarities
              between  all  texts  involved,  regardless  of repetitive text, but may be slow for
              large numbers of files.  See also `Calculating Percentages' below.

       -f     Runs are restricted to segments with balancing parentheses,  to  isolate  potential
              routine bodies (not in sim_text).

       -F     The names of routines in calls are required to match exactly (not in sim_text).

       -i     The  names  of  the  files to be compared are read from standard input, including a
              possible separator / or |; the file names must be  one  to  a  line.   This  option
              allows  a  very  large  number of file names to be specified; it differs from the @
              facility provided by some compilers in that it handles file names  only,  and  does
              not recognize option arguments.

       -M     Memory usage information is displayed on standard error output.

       -n     Similarities  found  are  summarized  by  file name, position and size, rather than
              displayed in full.

       -o F   The output is written to the file named F.

       -O     The option settings used are shown at the beginning of the output.

       -p     The output is given in similarity percentages; see `Calculating Percentages' below;
              implies -s.

       -P     When reporting percentages, only the main contributor for each file is shown.

       -r N   The  minimum  run  length  is  set  to N units; the default is 24 tokens, except in
              sim_text, where it is 8 words.

       -R     Directories in the input list are entered recursively, and all files  they  contain
              are involved in the comparison.

       -s     The contents of a file are not compared to itself (-s for "not self").

       -S     The  contents  of  the  new  files are compared to the old files only - not between
              themselves.

       -t N   In combination with the -p option, sets the threshold  (in  percents)  below  which
              similarities  will  not be reported; the default is 1, except in sim_text, where it
              is 20.

       -T     Suppresses the printing of information about the input files.

       -u     The output is not buffered and not sorted (only when reporting percentages).

       -v     Prints the version number and compilation date on standard output, then stops.

       -w N   The page width used is set to N columns; the default is 80.

       --     (A secret option, which prints the input as the similarity  checker  sees  it,  and
              then stops.)

       The -p option results in lines of the form
               F consists for x % of G material
       meaning  that  x  %  of  F's  text can also be found in G.  Note that this relation is not
       symmetric; it is quite possible for one file to consist for 100 %  of  text  from  another
       file,  while  the  other  file  consists  for only 1 % of text of the first file, if their
       lengths differ enough.  The -P (capital P) option shows the main contributor for each file
       only.   This  simplifies  the  identification  of  a set of files A[1] ... A[n], where the
       concatenation of these files is also present.  A threshold can be set using the -t option.
       Note that the granularity of the recognized text is still governed by the -r option or its
       default.

       The -r option controls the number of "units" that constitute a run.  For the programs that
       compare  programming  language  code, a unit is a lexical token in the pertinent language;
       comment and standard preamble material (file inclusion, etc.) is ignored and  all  strings
       are considered equal.  For sim_text a unit is a "word" which is defined as any sequence of
       one or more letters, digits, or characters over  127  (177  octal),  to  accommodate  full
       Unicode (UTF-8).

       The programs can handle Unicode (UTF-16) file names under Windows.

       Sim_text accepts  s p a c e d   t e x t  as normal text.

       Once sim has read, stored and preprocessed the input, it will no longer run out of memory.
       If memory is short it will change automatically  to  unbuffered,  unsorted  output  (while
       issuing a warning message).

HOW SIMILARITY IS RECOGNIZED

       Since  computers  cannot  test  for  similarity, only for equality, all units in the input
       files are replaced by 16-bit tokens, such that all units that are regarded as similar  are
       reduced to the same token.  For example in sim_c all identifiers are replaced by the token
       IDF and all strings are replaced by the token STR. The secret option -- can be used to see
       the resulting token sequence.

       In  sim_text  each  word  is  reduced  to a 16-bit token using a hash function. There is a
       chance of 1 : 65536 that two different  words  get  the  same  token  value,  but  because
       recognized  runs  of  tokens  are  usually several tokens long, the chances for accidental
       similarities are very low.

       The sequence of tokens obtained this way is then processed as follows.

       The default operation cycle of sim starts at the beginning of the token  sequence  of  the
       first  input  file  or  at a position $X$ in file $F$ at which the previous cycle has left
       off.  Sim then finds the longest segment $S$ such that 1) $S$  is  equal  to  the  segment
       starting  at  $X$; 2) $S$ is situated somewhere between position $X$ in $F$ and the end of
       all files; 3) $S$ does not overlap with the segment starting at $X$.  If the segment is at
       least  of  minimum  run  size,  it  is recorded, and the cycle starts again just after the
       segment at $X$; otherwise it starts again at $X+1$ .

       So if the token sequence at $X$ reads abcabcadefabdabcz, the cycle finds $S$ to be the abc
       just  before  the end; abca at $X+3$ would be longer but overlaps with the abca at $X+0$ .
       The cycle then starts at $X+3$, and will find another match with the  abc  near  the  end.
       Finally  the  ab  after  the  f  will  be  matched with the ab just before the cz.  So the
       following matches are found:

       $ X[0:2] = X[13:15] =$ abc
       $ X[3:5] = X[13:15] =$ abc
       $ X[9:10] = X[13:14] =$ ab

       This way best matches for the text in a file are found in material to  the  right  of  it,
       until  the  end  of all files.  The results are asymmetric: given files F1, F2, F3, F4, no
       matches for F3 are reported from  F1  or  F2,  for  example.   As  explained  below  under
       "Limitations", this avoids duplicate reports of similarity and helps to keep sim fast.

WHAT IS COMPARED TO WHAT

       The area that is searched by sim's cycle is called the range.  The default range, which as
       we have seen above runs from the file under observation  to  the  end  of  all  files,  is
       excellent  for  finding  similarities  in  program files, and, when doing percentages, for
       getting an impression of which files are  related  to  which  files,  but  sometimes  more
       control is needed.  The following modifications to the range are available:

       The  -a option includes all text in the range by not stopping the search at the end of the
       files but rather looping back to the beginning of the files and continuing  to  the  point
       where  the search started.  Now matches are also found in files before the present one and
       the results are symmetric: given files F1, F2,  F3,  F4,  matches  for  F3  will  also  be
       reported  from  F1 or F2, if present.  But matches may be reported twice, once for file Fa
       versus file Fb, and once for file Fb versus file Fa.  The -a option allows a more accurate
       determination of similarity percentages.

       The  -a  option is the only way to obtain symmetrical results, with information about both
       F1 vs. F2 and F2 vs. F1.

       The -S option removes the new files from the range, so files are only compared to the  old
       files.

       The  -s  option  removes the file itself from the range, so a file will not be compared to
       itself. This is the default when reporting percentages.

       In normal operation the whole range is searched as one unit. The -e option divides up  the
       range into the separate files, and causes sim to compare a file to each of the other files
       separately.  This produces the most detailed information when reporting text similarities,
       and  the  best  possible  results  when reporting similarity percentages, but can be quite
       slow.

   A Tabular Representation
       Input files are divided into two groups, new and old.  In the absence of  control  options
       sim compares the files thus (for 4 new files and 6 old ones):
                                 n e w    /     o l d       <- second file
                               1  2  3  4 / 5  6  7  8  9 10
                             |------------/------------
                        n  1 | c  c  c  c / c  c  c  c  c  c
                        e  2 |    c  c  c / c  c  c  c  c  c
                        w  3 |       c  c / c  c  c  c  c  c
                           4 |          c / c  c  c  c  c  c
              first        / / /  /  /  / / /  /  /  /  /  /
              file  ->     5 |            /
                        o  6 |            /
                        l  7 |            /
                        d  8 |            /
                           9 |            /
                          10 |            /
       where  a  c  indicates  that  the  first  file  is  compared to the second file, and the /
       represents the demarcation between new and old files.  The comparison range of  the  first
       files is clearly visible.

       Using the -a option extends this to
                                 n e w    /     o l d       <- second file
                               1  2  3  4 / 5  6  7  8  9 10
                             |------------/------------
                        n  1 | c  c  c  c / c  c  c  c  c  c
                        e  2 | c  c  c  c / c  c  c  c  c  c
                        w  3 | c  c  c  c / c  c  c  c  c  c
                           4 | c  c  c  c / c  c  c  c  c  c
              first        / / /  /  /  / / /  /  /  /  /  /
              file  ->     5 |            /
                        o  6 |            /
                        l  7 |            /
                        d  8 |            /
                           9 |            /
                          10 |            /

       Using the -S option instead reduces this to
                                 n e w    /     o l d       <- second file
                               1  2  3  4 / 5  6  7  8  9 10
                             |------------/------------
                        n  1 |            / c  c  c  c  c  c
                        e  2 |            / c  c  c  c  c  c
                        w  3 |            / c  c  c  c  c  c
                           4 |            / c  c  c  c  c  c
              first        / / /  /  /  / / /  /  /  /  /  /
              file  ->     5 |            /
                        o  6 |            /
                        l  7 |            /
                        d  8 |            /
                           9 |            /
                          10 |            /

       Finally, using the -s option changes the default ranges to
                                 n e w    /     o l d       <- second file
                               1  2  3  4 / 5  6  7  8  9 10
                             |------------/------------
                        n  1 |    c  c  c / c  c  c  c  c  c
                        e  2 |       c  c / c  c  c  c  c  c
                        w  3 |          c / c  c  c  c  c  c
                           4 |            / c  c  c  c  c  c
              first        / / /  /  /  / / /  /  /  /  /  /
              file  ->     5 |            /
                        o  6 |            /
                        l  7 |            /
                        d  8 |            /
                           9 |            /
                          10 |            /
       and the -a-extended ranges to
                                 n e w    /     o l d       <- second file
                               1  2  3  4 / 5  6  7  8  9 10
                             |------------/------------
                        n  1 |    c  c  c / c  c  c  c  c  c
                        e  2 | c     c  c / c  c  c  c  c  c
                        w  3 | c  c     c / c  c  c  c  c  c
                           4 | c  c  c    / c  c  c  c  c  c
              first        / / /  /  /  / / /  /  /  /  /  /
              file  ->     5 |            /
                        o  6 |            /
                        l  7 |            /
                        d  8 |            /
                           9 |            /
                          10 |            /

LIMITATIONS

       Repetitive  input  is  the  bane  of  similarity checking.  If we have a file containing 4
       copies of identical text,
           A1 A2 A3 A4
       where the numbers serve only to  distinguish  the  identical  copies,  there  are  7  non-
       overlapping  identities:  A1=A2,  A1=A3,  A1=A4,  A2=A3,  A2=A4, A3=A4, and A1A2=A3A4.  Of
       these, only 3 are meaningful: A1=A2, A2=A3, and A3=A4.  And for  a  table  with  20  lines
       identical  to  each  other,  not  unusual in a program text, there are 715 non-overlapping
       identities, of which at most 19 are meaningful.  Reporting all  715  of  them  is  clearly
       unacceptable.

       This is remedied by sim's search cycle: for each position in the text, the largest segment
       is found of which a non-overlapping copy occurs in the text following  it.   That  segment
       and  its  copy  are  then  reported  and  scanning  resumes at the position just after the
       segment.  For the above example this results in the two identities  A1A2=A3A4  and  A3=A4,
       which  is  quite  satisfactory,  and for N identical segments roughly 2 log N messages are
       given.

       This also works out well when the four identical segments are in different files:
           File1: A1
           File2: A2
           File3: A3
           File4: A4
       Now combined segments like A1A2 do not occur, and the  algorithm  finds  the  runs  A1=A2,
       A2=A3, and A3=A4, for a total of N-1 runs, all informative.

   Calculating Percentages
       The  above  approach  is unsuitable for obtaining the exact percentage of a file's content
       that can be found in another file, although indicative results can be obtained.  Obtaining
       exact  percentages  requires  comparing  each file pair in isolation; this is what the -ae
       options do.  Under the -ae options a segment File3:A3, recognized in File4, will again  be
       recognized in File1 and File2.  In the example above it produces the runs
           File1:A1=File2:A2
           File1:A1=File3:A3
           File1:A1=File4:A4
           File2:A2=File3:A3
           File2:A2=File4:A4
           File2:A2=File1:A1
           File3:A3=File4:A4
           File3:A3=File1:A1
           File3:A3=File2:A2
           File4:A4=File1:A1
           File4:A4=File2:A2
           File4:A4=File3:A3
       for a total of N(N-1) runs.

       When the -e option is used alone.  sim will find the following runs:
           File1:A1=File2:A2
           File1:A1=File3:A3
           File1:A1=File4:A4
           File2:A2=File3:A3
           File2:A2=File4:A4
           File3:A3=File4:A4
       for  a  total  of  ½N(N-1)  runs, thus missing half the percentage contributions; in fact,
       File4 is found to have 0% in common with the other files.

       If, however, the -a option is used alone.  sim finds the following runs:
           File1:A1=File2:A2
           File2:A2=File3:A3
           File3:A3=File4:A4
           File4:A4=File1:A1
       for a total of N runs. This setting misses many of the percentage contributions, but finds
       something for every file.

TIME AND SPACE REQUIREMENTS

       Care  has  been  taken  to  keep  the time requirements of all internal processes (almost)
       linear in the lengths of the input files, by using various tables.

       The time requirements are quadratic in the number of files.  This means that, for example,
       one 64 MB file processes much faster than 8000 8 kB files.

       The  program  requires  6 bytes of memory for each token in the input; 2 bytes per newline
       (not when doing percentages); and 80 bytes for each run found.

EXAMPLES

       The call
               sim_c *.c
       highlights duplicate C code in the directory.  (It is useful  to  remove  generated  files
       first.)  A call of
               sim_c -f -F *.c
       can pinpoint the duplicate code further.

       A call
               sim_text -peu -S new/* "|" old/*
       compares  each  file  in new/* to each file in old/*, and if any pair has more that 20% in
       common, that fact is reported.  Usually a similarity of 30% or more is significant;  lower
       than 20% is probably coincidence; and in between is doubtful.

       The  u  in  -peu  causes  the output to be unbuffered (and unsorted), so if the program is
       stopped for running out of time, any results already found are not lost.

       For large data sets, using -pu rather than -peu may do the job much more quickly, but less
       accurately.

       The  |  can  be  used as a separator instead of / on systems where the / as a command-line
       parameter gets mangled by the command interpreter.

       These calls are good for plagiarism detection.

BUGS

       Unbuffered, unsorted output is not available for text output, only for percentage output.

AUTHOR

       Dick Grune, Vrije Universiteit, Amsterdam; dick@dickgrune.com.

                                            2017/11/23                                     SIM(1)