Provided by: similarity-tester_2.70-1_amd64 bug

NAME

       sim - find similarities in C, Java, Pascal, Modula-2, Lisp, Miranda, or text files

SYNOPSIS

       sim_c [ -[defFiMnpPRsST] -r N -t N -w N -o F ] file ... [ / [ file ... ] ]
       sim_c ...
       sim_java ...
       sim_pasc ...
       sim_m2 ...
       sim_lisp ...
       sim_mira ...
       sim_text ...

DESCRIPTION

       Sim_c  reads  the  C  files  file  ...   and looks for segments of text that are similar; two segments of
       program text are similar if they only differ in layout, comment, identifiers and the contents of numbers,
       strings  and  characters.   If  any  runs  of  sufficient length are found, they are reported on standard
       output; the number of significant tokens in the run is given between square brackets.

       Sim_java does the same for Java, sim_pasc for Pascal, sim_m2 for  Modula-2,  sim_mira  for  Miranda,  and
       sim_lisp for Lisp.  Sim_text works on arbitrary text; it is occasionally useful on shell scripts.

       The  program  can be used for finding copied pieces of code in purportedly unrelated programs (with -s or
       -S), or for finding accidentally duplicated code in larger projects (with -f).

       If a / is present between the input files, the latter are divided into a group of "new" files (before the
       /)  and  a  group of "old" files; if there is no /, all files are "new".  Old files are never compared to
       each other.

       Since the similarity tester reads the files several times, it cannot read from standard input.

       There are the following options:

       -d     The output is in a diff(1)-like format instead of the default 2-column format.

       -e     Each file is compared to each file in isolation; this will find all similarities between all texts
              involved, regardless of duplicates.

       -f     Runs  are  restricted  to segments with balancing parentheses, to isolate potential routine bodies
              (not in text).

       -F     The names of routines in calls are required to match exactly (not in text).

       -i     The names of the files to be compared are read from standard input, including a  possible  /;  the
              file  names  must  be  one  to a line.  This option allows a very large number of file names to be
              specified; it differs from the @ facility provided by some compilers in that it handles file names
              only, and does not recognize option arguments.

       -M     Memory usage information is displayed on standard error output.

       -n     Similarities found are only summarized, not displayed.

       -o F   The output is written to the file named F.

       -p     The output is given in similarity percentages; see below; implies -e and -s.

       -P     As -p but more extensive; implies -e and -s.

       -r N   The  minimum  run length is set to N units; the default is 24 tokens, except in sim_text, where it
              is 8 words.

       -R     Directories in the input list are entered recursively, and all files they contain are involved  in
              the comparison.

       -s     The contents of a file are not compared to itself (-s for "not self").

       -S     The contents of the new files are compared to the old files only - not between themselves.

       -t N   In  combination with the -p option, sets the threshold (in percents) below which similarities will
              not be reported; the default is 1, except in sim_text, where it is 20.

       -T     A  more  terse  and  uniform  form  of  output  is  produced,  which  may  be  more  suitable  for
              postprocessing.

       -w N   The page width used is set to N columns; the default is 80.

       --     (A secret option, which prints the input as the similarity checker sees it, and then stops.)

       The -p option results in lines of the form
               F consists for x % of G material
       meaning  that x % of F's text can also be found in G.  Note that this relation is not symmetric; it is in
       fact quite possible for one file to consist for 100 % of text from another file,  while  the  other  file
       consists  for  only 1 % of text of the first file, if their lengths differ enough.  Each file is reported
       only once in the position of the F in the above line.  This simplifies the identification  of  a  set  of
       files  A[1]  ...  A[n],  where the concatenation of these files is also present.  This restriction can be
       lifted by using the -P option instead.  A threshold can be set  using  the  -t  option;  this  option  is
       ignored under -P.  Note that the granularity of the recognized text is still governed by the -r option or
       its default.

       Sim_text accepts  s p a c e d   t e x t  as normal text.

       The program can handle UNICODE file names under Windows.  This is relevant  only  under  the  -R  option,
       since there is no way to give UNICODE file names from the command line.

       Care  has been taken to keep all internal processes linear in the length of the input, with the exception
       of the matching process which is almost linear, using a hash table; various other  tables  are  used  for
       speed-up.   If,  however,  there  is  not  enough  memory  for the tables, they are discarded in order of
       unimportance, under which conditions the algorithms revert to their quadratic nature.

EXAMPLES

       The call
               sim_c *.c
       highlights duplicate code in the directory.  (It is useful to remove generated files first.)  A call
               sim_c -f -F *.c
       can pinpoint them further.

       A call
               sim_text -e -p -s new/* / old/*
       compares each file in new/* to each subsequent file in new/* and old/*, and if any pair has more that 20%
       in  common, that fact is reported.  Usually a similarity of 30% or more is significant; lower than 20% is
       probably coincidence; and in between is doubtful.

       A call
               sim_text -e -n -s -r100 new/* / old/*
       compares the same files, and reports large common segments.  Both  approaches  are  good  for  plagiarism
       detection.

LIMITATIONS

       Repetitive  input  is  the bane of similarity checking.  If we have a file containing 4 copies of similar
       text,
           A1 A2 A3 A4
       where the numbers serve only to distinguish the similar copies, there are 7 similarities:  A1=A2,  A1=A3,
       A1=A4, A2=A3, A2=A4, A3=A4, and A1A2=A3A4, even discarding the overlapping A1A2A3=A2A3A4.  Of these, only
       3 are meaningful: A1=A2, A2=A3, and A3=A4.  And for a table with 20 lines  similar  to  each  other,  not
       unusual  in a program, there are 715 similarities, of which at most 19 are meaningful.  Reporting all 715
       of them is clearly unacceptable.

       To remedy this, finding the similarities is performed as follows: For each  position  in  the  text,  the
       largest  segment is found, of which a non-overlapping copy occurs in the text following it.  That segment
       and its copy are reported and scanning resumes at the position just after the  segment.   For  the  above
       example  this  results  in  the  similarities A1A2=A3A4 and A3=A4, which is quite satisfactory, and for N
       similar segments roughly log N messages are given.

       A drawback of this heuristic is that the output is sensitive to the order of the input files.  If we have
       two files
           file1 = A1, file2 = A2A3
       then  the  order  "file1  file2"  gives  "A1=A2,  A2=A3" and "file2 file1" gives "A2=A3, A3=A1"; but both
       reports convey the same information.

BUGS

       Since it uses lex(1) on some systems, it may  crash  on  any  weird  construction  that  overflows  lex's
       internal buffers.

AUTHOR

       Dick Grune, Vrije Universiteit, Amsterdam; dick@dickgrune.com.

                                                   2012/05/02                                             SIM(1)