Provided by: sim4_0.0.20121010-8_amd64 bug

NAME

       sim4 - align an expressed DNA sequence with a genomic sequence

SYNOPSIS

       sim4 seqfile1 seqfile2 {[WXKCRDAPNB]=value}

DESCRIPTION

       sim4  is  a similarity-based tool for aligning an expressed DNA sequence (EST, cDNA, mRNA)
       with a genomic sequence for the gene. It also detects  end  matches  when  the  two  input
       sequences  overlap  at  one  end  (i.e., the start of one sequence overlaps the end of the
       other). If seqfile2 is a database of sequences, the sequence in seqfile1 will  be  aligned
       with each of the sequences in seqfile2.

       sim4  employs  a  blast-based  technique  to  first  determine  the  basic matching blocks
       representing the "exon cores". In this first stage, it detects all possible exact  matches
       of  W-mers  (i.e.,  DNA  words  of  size  W) between the two sequences and extends them to
       maximal scoring gap-free segments. In the second stage, the exon cores are  extended  into
       the  adjacent as-yet-unmatched fragments using greedy alignment algorithms, and heuristics
       are used to favor configurations that conform to the splice-site recognition signals  (GT-
       AG,  CT-AC).  If  necessary, the process is repeated with less stringent parameters on the
       unmatched fragments.

       By default, sim4 searches both strands and reports the best match, measured by the  number
       of  matching  nucleotides found in the alignment. The R command line option can be used to
       restrict the search to one orientation (strand) only.

       Currently, five major alignment display options are supported, controlled by the A option.
       By  default  (A=0), only the endpoints, overall similarity, and orientation of the introns
       are reported. An arrow sign (`->' or `<-') indicates the orientation of the intron (`+' or
       `-' strand), when the signals flanking the intron have three or more position matches with
       either the GT-AG or the CT-AC splice recognition signals. When the same number of  matches
       is  found  for  both orientations, the intron is reported as ambiguous, and represented by
       `--'. The sign `==' marks the absence from the alignment of a cDNA  fragment  starting  at
       that position. Alternative formats (lav-block format, text, PipMaker-type `exons file', or
       certain combinations of these options) can be requested by specifying  a  different  value
       for A.

       If  the  P  option  is specified with a non-zero value, sim4 will remove any 3'-end poly-A
       tails that it detects in the alignment.

       Occasionally, sim4 may miss an internal  exon  when  surrounded  by  very  large  introns,
       typically  longer  than  100 Kb. When this is suspected, the H option can be used to reset
       the exons' weight to compensate for the intron gap penalty.

       Ambiguity codes are by default allowed  in  sequence  data,  but  sim4  treats  them  non-
       differentially.  If  desired,  the  B  command  option  can restrict the set of acceptable
       characters to A,C,G,T,N and X only.

       sim4 compares the lengths of the input sequences to distinguish between the cDNA (`short')
       and the genomic (`long') components in the comparison. When seqfile2 contains a collection
       of sequences, the first entry in the file will be used to determine the type of  this  and
       all subsequent comparisons.

       In  the description below, the term MSP denotes a Maximal Segment Pair, that is, a pair of
       highly similar fragments in the two sequences, obtained during the blast-like procedure by
       extending a W-mer hit by matches and perhaps a few mismatches.

OPTIONS

       The  algorithm  parameters  (included  in  the first two sections below) have already been
       tuned and do not normally require adjustment by the user.

       Parameters internal to the blast-like procedure:

       W      Sets the word size for blast hits in the first stage of the algorithm. The  default
              value  is  12,  but it can be increased for a more stringent search or decreased to
              find weaker matches.

       X      Controls the limits for terminating word extensions in the blast-like stage of  the
              algorithm. The default value is 12.

       K      Sets  the  threshold  for  the  MSP scores when determining the basic `exon cores',
              during the first stage of the algorithm. (If this  option  is  not  specified,  the
              threshold  is  computed  from  the  lengths  of  the  sequences,  using statistical
              criteria.) For example, a good value for genomic sequences in the range  of  a  few
              hundred  Kb is 16. To avoid spurious matches, however, a larger value may be needed
              for longer sequences.

       C      Sets the threshold for the MSP scores when aligning the as-yet-unmatched fragments,
              during  the  second stage of the algorithm. By default, the smaller of the constant
              12 and a statistics-based threshold is chosen.

       Additional algorithm parameters:

       D      Sets the bound for the "diagonal" distance within consecutive MSPs in an exon.  The
              default value is 10.

       Context parameters:

       R      Specifies  the  direction  of  the  search. If R=0, only the "+" (direct) strand is
              searched. If R=1, only the "-" (reverse complement) matches are sought. By  default
              (R=2),  sim4  searches  both  strands  and  reports the best match, measured by the
              number of matching pairs in the alignment.

       A      Specifies the format of the output: exon endpoints only (A=0), exon  endpoints  and
              boundaries  of  the coding region (CDS) in the genomic sequence, when specified for
              the input mRNA (A=5), alignment text (A=1), alignment in lav-block format (A=2), or
              both  exon endpoints and alignment text (A=3 or A=4). If a reverse complement match
              is found, A=0,1,2,3,5 will give its position  in  the  "+"  strand  of  the  longer
              sequence  and the "-" strand of the shorter sequence. A=4 will give its position in
              the "+" strand of the first sequence (seqfile1) and the "-" strand  of  the  second
              sequence  (seqfile2), regardless of which sequence is longer. The A=5 option can be
              used with the S command line option to specify the endpoints  of  the  CDS  in  the
              mRNA, and produces output in the `exons file' format required by PipMaker.

       P      Specifies  whether  or  not the program should report the fragment of the alignment
              containing the poly-A tail (if found). By default (P=0) the alignment is  displayed
              as computed, but specifying a non-zero value will request sim4 to remove the poly-A
              tails. When this feature is enabled, all display  options  produce  additional  lav
              alignment headers.

       H      Resets  the MSPs' weight to compensate for very large introns. The default value is
              H=500, but some introns larger than 100 Kb may  require  higher  values,  typically
              between  1000  and  2500. This option should be used cautiously, generally in cases
              where an unmatched internal portion of the cDNA may disguise a missed exon within a
              very  large intron. It is not recommended for ESTs, where they may produce spurious
              exons.

       N      Requests an additional search for small marginal exons (N=1) guided by the  splice-
              site  recognition  signals.  This  option can be used when a high accuracy match is
              expected. The default value is N=0, specifying no additional search.

       B      Controls the set of characters allowed in the input sequences.  By  default  (B=1),
              ambiguity  characters (ABCDGHKMNRSTVWXY) are allowed. By specifying B=0, the set of
              acceptable characters is restricted to A,C,G,T,N and X only.

       S      Allows the user to specify the endpoints of the CDS in the  input  mRNA,  with  the
              syntax:  S=n1..n2.  This option is only available with the A=5 flag, which produces
              output in the format required by PipMaker. Alternatively, the CDS coordinates could
              appear in a construct CDS=n1..n2 in the FastA header of the mRNA sequence. When the
              second file is an mRNA database, the command line specification for  the  CDS  will
              apply to the first sequence in the file only.

EXAMPLES

       sim4 est genomic

       sim4 genomic estdb

       sim4 est genomic A=1 P=1

       sim4 est1 est2 R=1

       sim4 mRNA genomic A=5 S=123..1020

       sim4 mouse_cDNA human_genomic K=15 C=11 A=3 W=10

AUTHORS

       sim4 was written by Liliana Florea <florea@gwu.edu> and Scott Schwartz.

       This  manual  page  was  written by Nelson A. de Oliveira <naoliv@gmail.com>, based on the
       online documentation  at  http://globin.cse.psu.edu/html/docs/sim4.html,  for  the  Debian
       project (but may be used by others).

                                 Wed, 03 Aug 2005 18:40:58 -0300                          SIM4(1)