Provided by: similarity-tester_2.70-1_amd64 bug

NAME

       sim - find similarities in C, Java, Pascal, Modula-2, Lisp, Miranda, or text files

SYNOPSIS

       sim_c [ -[defFiMnpPRsST] -r N -t N -w N -o F ] file ... [ / [ file ... ] ]
       sim_c ...
       sim_java ...
       sim_pasc ...
       sim_m2 ...
       sim_lisp ...
       sim_mira ...
       sim_text ...

DESCRIPTION

       Sim_c  reads  the  C  files file ...  and looks for segments of text that are similar; two
       segments of program text are similar if they only differ in layout,  comment,  identifiers
       and the contents of numbers, strings and characters.  If any runs of sufficient length are
       found, they are reported on standard output; the number of significant tokens in  the  run
       is given between square brackets.

       Sim_java  does  the  same for Java, sim_pasc for Pascal, sim_m2 for Modula-2, sim_mira for
       Miranda, and sim_lisp for Lisp.  Sim_text works on  arbitrary  text;  it  is  occasionally
       useful on shell scripts.

       The  program  can  be  used  for  finding  copied  pieces of code in purportedly unrelated
       programs (with -s or -S), or for finding accidentally duplicated code in  larger  projects
       (with -f).

       If  a  /  is present between the input files, the latter are divided into a group of "new"
       files (before the /) and a group of "old" files; if there is no /, all  files  are  "new".
       Old files are never compared to each other.

       Since  the  similarity  tester reads the files several times, it cannot read from standard
       input.

       There are the following options:

       -d     The output is in a diff(1)-like format instead of the default 2-column format.

       -e     Each file is compared to each file in isolation; this will  find  all  similarities
              between all texts involved, regardless of duplicates.

       -f     Runs  are  restricted  to segments with balancing parentheses, to isolate potential
              routine bodies (not in text).

       -F     The names of routines in calls are required to match exactly (not in text).

       -i     The names of the files to be compared are read from  standard  input,  including  a
              possible  /; the file names must be one to a line.  This option allows a very large
              number of file names to be specified; it differs from the @  facility  provided  by
              some  compilers  in  that it handles file names only, and does not recognize option
              arguments.

       -M     Memory usage information is displayed on standard error output.

       -n     Similarities found are only summarized, not displayed.

       -o F   The output is written to the file named F.

       -p     The output is given in similarity percentages; see below; implies -e and -s.

       -P     As -p but more extensive; implies -e and -s.

       -r N   The minimum run length is set to N units; the  default  is  24  tokens,  except  in
              sim_text, where it is 8 words.

       -R     Directories  in  the input list are entered recursively, and all files they contain
              are involved in the comparison.

       -s     The contents of a file are not compared to itself (-s for "not self").

       -S     The contents of the new files are compared to the old  files  only  -  not  between
              themselves.

       -t N   In  combination  with  the  -p option, sets the threshold (in percents) below which
              similarities will not be reported; the default is 1, except in sim_text,  where  it
              is 20.

       -T     A more terse and uniform form of output is produced, which may be more suitable for
              postprocessing.

       -w N   The page width used is set to N columns; the default is 80.

       --     (A secret option, which prints the input as the similarity  checker  sees  it,  and
              then stops.)

       The -p option results in lines of the form
               F consists for x % of G material
       meaning  that  x  %  of  F's  text can also be found in G.  Note that this relation is not
       symmetric; it is in fact quite possible for one file to consist for 100  %  of  text  from
       another  file,  while  the  other file consists for only 1 % of text of the first file, if
       their lengths differ enough.  Each file is reported only once in the position of the F  in
       the above line.  This simplifies the identification of a set of files A[1] ... A[n], where
       the concatenation of these files is also present.  This restriction can be lifted by using
       the -P option instead.  A threshold can be set using the -t option; this option is ignored
       under -P.  Note that the granularity of the recognized text is still governed  by  the  -r
       option or its default.

       Sim_text accepts  s p a c e d   t e x t  as normal text.

       The  program can handle UNICODE file names under Windows.  This is relevant only under the
       -R option, since there is no way to give UNICODE file names from the command line.

       Care has been taken to keep all internal processes linear in the length of the input, with
       the  exception of the matching process which is almost linear, using a hash table; various
       other tables are used for speed-up.  If, however, there  is  not  enough  memory  for  the
       tables, they are discarded in order of unimportance, under which conditions the algorithms
       revert to their quadratic nature.

EXAMPLES

       The call
               sim_c *.c
       highlights duplicate code in the directory.  (It  is  useful  to  remove  generated  files
       first.)  A call
               sim_c -f -F *.c
       can pinpoint them further.

       A call
               sim_text -e -p -s new/* / old/*
       compares  each  file  in new/* to each subsequent file in new/* and old/*, and if any pair
       has more that 20% in common, that fact is reported.  Usually a similarity of 30%  or  more
       is significant; lower than 20% is probably coincidence; and in between is doubtful.

       A call
               sim_text -e -n -s -r100 new/* / old/*
       compares  the same files, and reports large common segments.  Both approaches are good for
       plagiarism detection.

LIMITATIONS

       Repetitive input is the bane of similarity checking.  If  we  have  a  file  containing  4
       copies of similar text,
           A1 A2 A3 A4
       where  the numbers serve only to distinguish the similar copies, there are 7 similarities:
       A1=A2, A1=A3, A1=A4, A2=A3, A2=A4, A3=A4, and A1A2=A3A4, even discarding  the  overlapping
       A1A2A3=A2A3A4.  Of these, only 3 are meaningful: A1=A2, A2=A3, and A3=A4.  And for a table
       with 20 lines similar to each other, not unusual in a program, there are 715 similarities,
       of which at most 19 are meaningful.  Reporting all 715 of them is clearly unacceptable.

       To remedy this, finding the similarities is performed as follows: For each position in the
       text, the largest segment is found, of which a non-overlapping copy  occurs  in  the  text
       following it.  That segment and its copy are reported and scanning resumes at the position
       just after the segment.  For the above example this results in the similarities  A1A2=A3A4
       and  A3=A4, which is quite satisfactory, and for N similar segments roughly log N messages
       are given.

       A drawback of this heuristic is that the output is sensitive to the  order  of  the  input
       files.  If we have two files
           file1 = A1, file2 = A2A3
       then  the order "file1 file2" gives "A1=A2, A2=A3" and "file2 file1" gives "A2=A3, A3=A1";
       but both reports convey the same information.

BUGS

       Since it uses lex(1) on some  systems,  it  may  crash  on  any  weird  construction  that
       overflows lex's internal buffers.

AUTHOR

       Dick Grune, Vrije Universiteit, Amsterdam; dick@dickgrune.com.

                                            2012/05/02                                     SIM(1)