lunar (1) exonerate.1.gz

Provided by: exonerate_2.4.0-5_amd64 bug

NAME

       exonerate - a generic tool for sequence comparison

SYNOPSIS

       exonerate [ options ] <query path> <target path>

DESCRIPTION

       exonerate is a general tool for sequence comparison.

       It  uses  the C4 dynamic programming library.  It is designed to be both general and fast.
       It can produce either gapped or ungapped alignments, according to a variety  of  different
       alignment  models.   The  C4  library allows sequence alignment using a reduced space full
       dynamic programming implementation, but also allows  automated  generation  of  heuristics
       from  the  alignment  models,  using  bounded  sparse  dynamic  programming, so that these
       alignments may also be rapidly generated.  Alignments  generated  using  these  heuristics
       will  represent  a  valid  path  through  the  alignment model, yet (unlike the exhaustive
       alignments), the results are not guaranteed to be optimal.

CONVENTIONS

       A number of conventions (and idiosyncracies) are used within exonerate.  An  understanding
       of them facilitates interpretation of the output.

       Coordinates
              An  in-between  coordinate  system is used, where the positions are counted between
              the symbols, rather than on the symbols.  This numbering scheme starts  from  zero.
              This numbering is shown below for the sequence "ACGT":

               A C G T
              0 1 2 3 4

              Hence  the  subsequence  "CG"  would  have  start=1,  end=3,  and  length=2.   This
              coordinate system is used internally in exonerate, and for all the  output  formats
              produced  with  the exception of the "human readable" alignment display and the GFF
              output where convention and standards dictate otherwise.

       Reverse Complements
              When an alignment is  reported  on  the  reverse  complement  of  a  sequence,  the
              coordinates are simply given on the reverse complement copy of the sequence.  Hence
              positions on the sequences are never negative.  Generally, the  forward  strand  is
              indicated  by  '+',  the  reverse  strand  by '-', and an unknown or not-applicable
              strand (as in the case of a protein sequence) is indicated by '.'

       Alignment Scores
              Currently, only the raw alignment scores are displayed.  This score just is the sum
              of transistion scores used in the dynamic programming.  For example, in the case of
              a Smith-Waterman alignment, this will be the sum of the substitution matrix  scores
              and the gap penalties.

GENERAL OPTIONS

       Most arguments have short and long forms.  The long forms
              are  more  likely to be stable over time, and hence should be used in scripts which
              call exonerate.

       -h | --shorthelp <boolean>
              Show help.  This will display a concise summary of the available options,  defaults
              and values currently set.

       --help <boolean>
              This  shows  all  the help options including the defaults, the value currently set,
              and the environment variable which may be used to set each parameter.   There  will
              be  an  indication  of  which  options  are  mandatory.   Mandatory options have no
              default, and must have a value supplied for exonerate to run.  If mandatory options
              are  used  in order, their flags may be skipped from the command line (see examples
              below).  Unlike this man page, the information from this option will always  be  up
              to date with the latest version of the program.

       -v | --version <boolean>
              Display the version number.  Also displays other information such as the build date
              and glib version used.

SEQUENCE INPUT OPTIONS

       Pairwise comparisons will  be  performed  between  all  query  sequences  and  all  target
       sequences.   Generally,  for  the  best  performance, shorter sequences (eg. ESTs, shotgun
       reads, proteins) should be used as the query sequences, and longer sequences (eg.  genomic
       sequences) should be used as the target sequences.

       -q | --query  <paths>
              Specify  the  query  sequences  required.   These  must  be in a FASTA format file.
              Single or muiltiple query sequences may be supplied.  Additionally multiple  copies
              of  the  fasta  file  may  be  supplied  following a --query flag, or by using with
              multiple --query flags.

       -t | --target <paths>
              Specify the target sequences required.  Also, must be in a FASTA format  file.   As
              with  the  query  sequences,  single  or multiple target sequences and files may be
              supplied.  The target filename may by replace by a server name and port  number  in
              the  form  of  hostname:port  when  using  exonerate-server.   See the man page for
              exonerate-server for more information on running exonerate in  client:server  mode.
              NEW(v2.4.0):  multiple  servers may now be used.  These will be queried in parallel
              if you have set the --cores option.  NEW(v2.4.0): If an input file is not  a  FASTA
              format  file,  it is assumed to contain a list of other fasta files, directories or
              servers (one per line).

       -Q | --querytype <dna | protein>
              Specify the query type to use.  If this is not supplied, the query type is  assumed
              to be DNA when the first sequence in the file contains more than 85% [ACGTN] bases.
              Otherwise, it is assumed to be peptide.  This option forces the query type as  some
              nucleotide and peptide sequences can fall either side of this threshold.

       -T | --targettype <dna | protein>
              Specify  the  target  type to use.  The same as --querytype (above), except that it
              applies to the target.  Specifying the sequence type will  avoid  the  overhead  of
              having  to  read the first sequence in the database twice (which may be significant
              with chromosome-sized sequences)

       --querychunkid <id>

       --querychunktotal <total>

       --targetchunkid <id>

       --targetchunktotal <total>
              These options to facilitate running exonerate on compute farms, and avoid having to
              split  up  sequence databases into small chunks to run on different nodes.  If, for
              example, you wished to split the target database into three parts,  you  would  run
              three exonerate jobs on different nodes including the options:

              --targetchunkid 1 --targetchunktotal 3
              --targetchunkid 2 --targetchunktotal 3
              --targetchunkid 3 --targetchunktotal 3
              NB.  The granularity offered by this option only goes down to a single sequence, so
              when there are more chunks than sequences in the database, some processes  will  do
              nothing.

       -V | --verbose <int>
              Be  verbose  -  show  information  about what is going on during the analysis.  The
              default  is  1  (little  information),  the  higher  the  number  given,  the  more
              information  is  printed.   To  silence  all the default output from exonerate, use
              --verbose 0 --showalignment no --showvulgar no

ANALYSIS OPTIONS

       -E | --exhaustive <boolean>
              Specify whether or not exhaustive alignment should be used.  By  default,  this  is
              FALSE,  and alignment heuristics will be used.  If it is set to TRUE, an exhaustive
              alignment will be calculated.  This requires quadratic time, and will be much, much
              slower, but will provide the optimal result for the given model.
       -B | --bigseq <int>
              Perform  alignment  of  large  (multi-megabase)  sequences.   This  is  very memory
              efficient and fast when both sequences are chromosome-sized, but currently does not
              currently permit the use of a word neighbourhood (ie. exactly matching seeds only).
       --revcomp <boolean>
              Include  comparison  of  the  reverse  complement  of  the  query  and target where
              possible.  By default, this option is enabled,  but  when  you  know  the  gene  is
              definitely on the forward strand of the query and target, this option can halve the
              time taken to compute alignments.
       --forcescan <none | query | target>
              Force the FSM to scan the query sequence rather than the target.   This  option  is
              useful, for example, if you have a single piece of genomic sequence and you with to
              compare it to the whole of dbEST.  By scanning the database, rather than the query,
              the  analysis  will  be  completed  much more quickly, as the overheads of multiple
              query FSM construction, multiple target reading and splice site predictions will be
              removed.   By  default, exonerate will guess the optimal strategy based on database
              sequence sizes.
       --saturatethreshold <number>
              When set to zero, this option does nothing.  Otherwise, once more than this  number
              of  words  (in  addition  to the expected number of words by chance) have matched a
              position on the query, the position on the query will be 'numbed'  (ignore  further
              matches) for the current pairwise comparison.
       --customserver <command>
              When using exonerate in client:server mode with a non-standard server, this command
              allows you to send a custom command to the server.  This command  is  sent  by  the
              client  (exonerate)  before any other commands, and is provided as a way of passing
              parameters or other commands specific to the custom  server.   See  the  exonerate-
              server man page for more information on running exonerate in client:server mode.
       --cores <number>
              The number of cores/CPUs/threads that should be used.  On a multi-core or multi-CPU
              machine, increasing this ammount allows alignment computations to run  in  parallel
              on  separate  CPUs/cores.  NB.  Generally, it is better to parallelise the analysis
              by splitting it up into separate  jobs,  but  this  option  may  prove  useful  for
              problems such as interactive single-gene queries.

FASTA DATABASE OPTIONS

       --fastasuffix <extension>
              If any of the inputs given with --query or --target are directories, then exonerate
              will recursively descent these directories, reading  all  files  ending  with  this
              suffix as fasta format input.

GAPPED ALIGNMENT OPTIONS

       -m | --model <alignment model>
              Specify the alignment model to use.  The models currently supported are:
              ungapped
                     The  simplest  type of model, used by default.  An appropriate model with be
                     selected automatically for the type of input sequences provided.
              ungapped:trans
                     This ungapped model includes translation of all frames of both the query and
                     target sequences.  This is similar to an ungapped tblastx type search.
              affine:global
                     This  performs  gapped  global  alignment,  similar  to the Needleman-Wunsch
                     algorithm, except with affine gaps.  Global alignment requires that both the
                     sequences in their entirety are included in the alignment.
              affine:bestfit
                     This  performs  a  best fit or best location alignment of the query onto the
                     target sequence.   The  entire  query  sequence  will  be  included  in  the
                     alignment,  but  only  the  best  location  for  its alignment on the target
                     sequence.
              affine:local
                     This is local alignment with affine gaps,  similar  to  the  Smith-Waterman-
                     Gotoh  algorithm.   A general-purpose alignment algorithm.  As this is local
                     alignment, any subsequence of the query and target sequence  may  appear  in
                     the alignment.
              affine:overlap
                     This  type of alignment finds the best overlap between the query and target.
                     The overlap alignment must include the start of the query or target and  the
                     end of the query or the target sequence, to align sequences which overlap at
                     the ends, or in the mid-section of a longer sequence..  This is the type  of
                     alignment frequently used in assembly algorithms.
              est2genome
                     This model is similar to the affine:local model, but it also includes intron
                     modelling on the target sequence to allow alignment of spliced to  unspliced
                     coding  sequences  for  both forward and reversed genes.  This is similar to
                     the alignment models used in programs such as EST_GENOME and sim4.
              ner    NERs are non-equivalenced regions - large regions  in  both  the  query  and
                     target which are not aligned.  This model can be used for protein alignments
                     where strongly conserved helix regions will be aligned, but weakly conserved
                     loop  regions  are not.  Similarly, this model could be used to look for co-
                     linearly conserved regions in comparison of genomic sequences.
              protein2dna
                     This model compares a protein sequence to a DNA sequence, incorporating  all
                     the appropriate gaps and frameshifts.
              protein2dna:bestfit
                     This  is  a  bestfit version of the protein2dna model, with which the entire
                     protein is included in the alignment.  It is currently only  available  when
                     using exhaustive alignment.
              protein2genome
                     This  model allows alignment of a protein sequence to genomic DNA.   This is
                     similar to the protein2dna model, with the addition of modelling of  introns
                     and intron phases.  This model is similar to those used by genewise.
              protein2genome:bestfit
                     This is a bestfit version of the protein2genome model, with which the entire
                     protein is included in the alignment.  It is currently only  available  when
                     using exhaustive alignment.
              coding2coding
                     This  model  is  similar  to  the ungapped:trans model, except that gaps and
                     frameshifts are allowed.  It is similar to a gapped tblastx search.
              coding2genome
                     This is similar to the est2genome model, except that the query  sequence  is
                     translated during comparison, allowing a more sensitive comparison.
              cdna2genome
                     This  combines  properties  of  the  est2genome and coding2genome models, to
                     allow modeling of an whole cDNA where a central coding region can be flanked
                     by non-coding UTRs.  When the CDS start and end is known it may be specified
                     using the --annotation option (see below) to permit only the correct  coding
                     region to appear in the alignemnt.
              genome2genome
                     This  model  is  similar  to  the  coding2coding  model,  except introns are
                     modelled on both sequences.  (not working well yet)

       The short names u, u:t, a:g, a:b, a:l, a:o, e2g, ner,
              p2d, p2d:b p2g, p2g:b, c2c, c2g cd2g and  g2g  can  also  be  used  for  specifying
              models.

       -s | --score <threshold>
              This  is  the  overall score threshold.  Alignments will not be reported below this
              threshold.  For heuristic alignments, the higher this threshold, the less time  the
              analysis will take.

       --percent <percentage>
              Report  only  alignments  scoring at least this percentage of the maximal score for
              each query.  eg. use --percent 90 to report alignments  with  90%  of  the  maximal
              score optainable for that query.  This option is useful not only because it reduces
              the spurious matches  in  the  output,  but  because  it  generates  query-specific
              thresholds  (unlike  --score  ) for a set of queries of differing lengths, and will
              also speed up the search considerably.  NB.  with this option, it  is  possible  to
              have  a  cDNA match its corresponding gene exactly, yet still score less than 100%,
              due to the addition of the intron penalty scores, hence this option  must  be  used
              with caution.

       --showalignment <boolean>
              Show the alignments in an human readable form.

       --showsugar <boolean>
              Display "sugar" output for ungapped alignments.  Sugar is Simple UnGapped Alignment
              Report, which displays ungapped alignments one-per-line.   The  sugar  line  starts
              with  the  string  "sugar:" for easy extraction from the output, and is followed by
              the the following 9 fields in the order below:

              query_id        Query identifier
              query_start     Query position at alignment start
              query_end       Query position alignment end
              query_strand    Strand of query matched
              target_id       |
              target_start    | the same 4 fields
              target_end      | for the target sequence
              target_strand   |
              score           The raw alignment score

       --showcigar <boolean>
              Show the alignments in "cigar" format.  Cigar is  a  Compact  Idiosyncratic  Gapped
              Alignment Report, which displays gapped alignments one-per-line.  The format starts
              with the same 9 fields as sugar output (see above), and is followed by a series  of
              <operation,  length>  pairs  where operation is one of match, insert or delete, and
              the length describes the number of times this operation is repeated.

       --showvulgar <boolean>
              Shows the alignments in "vulgar" format.  Vulgar is Verbose Useful Labelled  Gapped
              Alignment  Report,  This  format also starts with the same 9 fields as sugar output
              (see above), and is followed by a series of  <label,  query_length,  target_length>
              triplets.  The label may be one of the following:

              M      Match
              C      Codon
              G      Gap
              N      Non-equivalenced region
              5      5' splice site
              3      3' splice site
              I      Intron
              S      Split codon
              F      Frameshift

       --showquerygff <boolean>
              Report    GFF    output    for    features    on    the    query   sequence.    See
              http://www.sanger.ac.uk/Software/formats/GFF for more information.

       --showtargetgff <boolean>
              Report GFF output for features on the target sequence.

       --ryo <format>
              Roll-your-own output format.  This allows specification of  a  printf-esque  format
              line  which  is used to specify which information to include in the output, and how
              it is to be shown.  The format field may contain the following fields:

              %[qt][idlsSt]
                     For          either           {query,target},           report           the
                     {id,definition,length,sequence,Strand,type}  Sequences  are  reported  in  a
                     fasta-format like block (no headers).
              %[qt]a[bels]
                     For either {query,target} region which occurs in the alignment,  report  the
                     {begin,end,length,sequence}
              %[qt]c[bels]
                     For  either {query,target} region which occurs in the coding sequence in the
                     alignment, report the {begin,end,length,sequence}
              %s     The raw score
              %r     The rank (in results from a bestn search)
              %m     Model name
              %e[tism]
                     Equivalenced {total,id,similarity,mismatches} (ie. %em == (%et - %ei))
              %p[isS]
                     Percent  {id,similarity,Self}  over  the  equivalenced   portions   of   the
                     alignment.   (ie.  %pi  == 100*(%ei / %et)).  Percent Self is the score over
                     the equivalenced portions of the alignment  as  a  percentage  of  the  self
                     comparison score of the query sequence.
              %g     Gene orientation ('+' = forward, '-' = reverse, '.' = unknown)
              %S     Sugar block (the 9 fields used in sugar output (see above)
              %C     Cigar block (the fields of a cigar line after the sugar portion)
              %V     Vulgar block (the fields of a vulgar line after the sugar portion)
              %%     Expands to a percentage sign (%)
              \n     Newline
              \t     Tab
              \\     Expands to a backslash (\)
              \{     Open curly brace
              \}     Close curly brace
              {      Begin per-transition output section
              }      End per-transition output section
              %P[qt][sabe]
                     Per-transition output for {query,target} {sequence,advance,begin,end}
              %P[nsl]
                     Per-transition output for {name,score,label}

       This option is very useful and flexible.  For example, to report all the sections of query
       sequences which feature in alignments in fasta format, use:

       --ryo ">%qi %qd\n%qas\n"

       To output all the symbols and scores in an alignment, try something like:

       --ryo "%V{%Pqs %Pts %Ps\n}"

       -n | --bestn <number>
              Report the best N results for each query.  (Only results scoring  better  than  the
              score threshold
               will  be  reported).   The option reduces the amount of output generated, and also
              allows exonerate to speed up the search.

       -S | --subopt <boolean>
              This  option  allows  for  the  reporting  of  (Waterman-Eggert  style)  suboptimal
              alignments.   (It  is  on  by  default.)   All  suboptimal  (ie.  non-intersecting)
              alignments will be reported for  each  pair  of  sequences  scoring  at  least  the
              threshold provided by --score.

              When  this  option  is used with exhaustive alignments, several full quadratic time
              passes will be required, so the running time will be considerably increased.

       -g | --gappedextension <boolean>
              Causes a gapped extension stage to be performed ie. dynamic programming is  applied
              in  arbitrarily  shaped  and  dynamically sized regions surrounding HSP seeds.  The
              extension threshold is controlled by the --extensionthreshold option.

              Although sometimes slower than BSDP, gapped  extension  improves  sensitivity  with
              weak, gap-rich alignments such as during cross-species comparison.

              NB. This option is now the default. Set it to false to reverse to the old BSDP type
              alignments.  This option may be slower than BSDP for some large scale analyses with
              simple alignment models.

       --refine <strategy>
              Force  exonerate  to  refine  alignments  generated  by  heuristics  using  dynamic
              programming over larger regions.  This takes more time, but improves the quality of
              the final alignments.

              The strategies available for refinement are:

              none   The default - no refinement is used.
              full   An  exhaustive  alignment  is calculated from the pair of sequences in their
                     entirety.
              region DP is applied just to the region of the sequences covered by  the  heuristic
                     alignment.

       --refineboundary <size>
              Specify  an extra boundary to be included in the region subject to alignment during
              refinement by region.

VITERBI ALGORITHM OPTIONS

       -D | --dpmemory <Mb>
              The exhaustive alignment traceback  routines  use  a  Hughey-style  reduced  memory
              technique.   This  option  specifies  how  much  memory  will  be  used  for  this.
              Generally, the more memory is permitted here, the faster  the  alignments  will  be
              produced.

CODE GENERATION OPTIONS

       -C | --compiled <boolean>
              This  option  allows  disabling  of  generated code for dynamic programming.  It is
              mainly used during development of exonerate.  When set to FALSE,  an  "interpreted"
              version of the dynamic programming implementation is used, which is much slower.

HEURISTIC OPTIONS

       --terminalrangeint
       --terminalrangeext
       --joinrangeint
       --joinrangeext
       --spanrangeint
       --spanrangeext
              These options are used to specify the size of the sub-alignment regions to which DP
              is applied around the ends of the HSPs.  This can be  at  the  HSP  ends  (terminal
              range),  between  HSPs  (join  range),  or between HSPs which may be connected by a
              large region such as an intron or  non-equivalenced  region  (span  range).   These
              ranges  can be specified for a number of matches back onto the HSP (internal range)
              or out from the HSP (external range).

SEEDED DYNAMIC PROGRAMMING OPTIONS

       -x | --extensionthreshold <score>
              This is the amount by which the score will be allowed to degrade during SDP.   This
              is  the equivalent of the hspdropoff penalties, except it is applied during dynamic
              programming, not HSP extension.  Decreasing this parameter will increase the  speed
              of the SDP, and increasing it will increase the sensitivity.

       --singlepass  <boolean>
              By  default  the  suboptimal SDP alignments are reported by a singlepass algorithm,
              but may miss some suboptimal alignments that are close together.  This  option  can
              be  used  to  force  the use of a multipass suboptimal alignment algorithm for SDP,
              resulting in higher quality suboptimal alignments.

BSDP OPTIONS

       --joinfilter <limit>
              (experimental)

              Only allow consider this number of SARs for joining HSPs together.  The  SARs  with
              the  highest  potential  for  appearing in a high-scoring alignment are considered.
              This option useful for limiting time and memory usage when searching unmasked  data
              with  repetitive  sequences, but should not be set too low, as valid matches may be
              ignored.  Something like --joinfilter 32 seems to work well.

SEQUENCE OPTIONS

       --annotation <path>
              Specify basic sequence annotation  information.   This  is  most  useful  with  the
              cdna2genome  model,  but will work with other models.  The annotation file contains
              four fields per line:

              <id> <strand> <cds_start> <cds_length>

              Here is a simple example of such a file for 4 cDNAs:

              dhh.human.cdna + 308 1191
              dhh.mouse.cdna + 250 1191
              csn7a.human.cdna + 178 828
              csn7a.mouse.cdna + 126 828
              These annotation lines will also work when only the  first  two  fields  are  used.
              This  can  be  used  when  specifying which strand of a specific sequence should be
              included in a comparison.

SYMBOL COMPARISON OPTIONS

       --softmaskquery <boolean>
              Indicate that the query is softmasked.  See description below for --softmasktarget
       --softmasktarget <boolean>
              Indicate that the target is softmasked.  In a softmasked sequence file, instead  of
              masking  regions by Ns or Xs they are masked by putting those regions in lower case
              (and with unmasked regions in upper case).  This option allows the  masking  to  be
              ignored  by some parts of the program, combining the speed of searching masked data
              with sensitivity of searching unmasked data.  The  utility  fastasoftmask  supplied
              which is supplied with exonerate can be used for producing softmasked sequence from
              conventionally masked sequence.
       -d | --dnasubmat <name>
              Specify the the substitution matrix to be used for DNA comparison.  This should  be
              a path to a substitution matrix in same format as that which is used by blast.
       -p | --proteinsubmat <name>
              Specify  the  the substitution matrix to be used for protein comparison.  (Both DNA
              and protein substitution matrices are required for some types  of  analysis).   The
              use  of  the  special names, nucleic, blosum62, pam250, edit or identity will cause
              built-in substitution matrices to be used.

ALIGNMENT SEEDING OPTIONS

       -M | --fsmmemory <Mb>
              Specify the amount of memory to use for the FSM in heuristic  analyses.   exonerate
              multiplexes the query to accelerate large-throughput database queries.  This figure
              should always be less than the physical memory on the machine, but  when  searching
              large  databases,  generally,  the  more memory it is allowed to use, the faster it
              will go.
       --forcefsm <none | normal | compact>
              Force the use of more compact finite state  machines  for  analyses  involving  big
              sequences  and  large  word  neighbourhoods.   By  default,  exonerate  will pick a
              sensible strategy, so this option will rarely need to be set.
       --wordjump <int>
              The jump between query words used to yield the word neighbourhood.  If  set  to  1,
              every  word  is  used,  if  set  to  2, every other word is used, and if set to the
              wordlength, only non-overlapping words will  be  used.   This  option  reduces  the
              memory requirements when using very large query sequences, and makes the search run
              faster, but it also damages search sensitivity when high values are set.
       --wordambiguity <limit>
              This option may be  used  to  allow  alignment  seeds  containing  IUPAC  ambiguity
              symbols.   The  limit  is the maximum number of ambiguous words allowed at a single
              position.  If this limit is reached then the position is  not  used  for  alignment
              seeding.   Using  this  option  may  slow down a search.  For large datasets, it is
              recommended to use esd2esi --wordambiguity instead, as then the speed  overhead  is
              only  incurred  during  indexing, rather than during the database searching itself.
              NB. This option only works for IUPAC symbols in the target sequence.   Query  words
              containing IUPAC symbols are (currently) excluded from seeding.

AFFINE MODEL OPTIONS

       -o | --gapopen <penalty>
              This is the gap open penalty.
       -e | --gapextend <penalty>
              This is the gap extension penalty.
       --codongapopen <penalty>
              This is the codon gap open penalty.
       --codongapextend <penalty>
              This is the codon gap extension penalty.

NER OPTIONS

       --minner <boolean>
              Minimum NER length allowed.
       --maxner <length>
              Maximum NER length allowed.  NB. this option only affects heuristic alignments.
       --neropen <penalty>
              Penalty for opening a non-equivalenced region.

INTRON MODELLING OPTIONS

       --minintron <length>
              Minimum  intron  length  limit.  NB. this option only affects heuristic alignments.
              This is not a hard limit - it only affects size of introns which are sought  during
              heuristic alignment.
       --maxintron <length>
              Maximum intron length limit.  See notes above for --minintron
       -i | --intronpenalty <penalty>
              Penalty for introduction of an intron.

FRAMESHIFT MODELLING OPTIONS

       -f | --frameshift <penalty>
              The penalty for the inclusion of a frameshift in an alignment.

ALPHABET OPTIONS

       --useaatla <boolean>
              Use  three-letter  abbreviations for AA names.  ie. when displaying alignment "Met"
              is used instead of " M "

TRANSLATION OPTIONS

       --geneticcode <code>
              Specify an alternative genetic code.  The default code (1) is the standard  genetic
              code.  Other genetic codes may be specified by in shorthand or longhand form.

              In  shorthand form, a number between 1 and 23 is used to specify one of 17 built-in
              genetic code variants.  These are genetic code variants taken from:

              http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi

              These are:
              1      The Standard Code
              2      The Vertebrate Mitochondrial Code
              3      The Yeast Mitochondrial Code
              4      The  Mold,  Protozoan,  and  Coelenterate   Mitochondrial   Code   and   the
                     Mycoplasma/Spiroplasma Code
              5      The Invertebrate Mitochondrial Code
              6      The Ciliate, Dasycladacean and Hexamita Nuclear Code
              9      The Echinoderm and Flatworm Mitochondrial Code
              10     The Euplotid Nuclear Code
              11     The Bacterial and Plant Plastid Code
              12     The Alternative Yeast Nuclear Code
              13     The Ascidian Mitochondrial Code
              14     The Alternative Flatworm Mitochondrial Code
              15     Blepharisma Nuclear Code
              16     Chlorophycean Mitochondrial Code
              21     Trematode Mitochondrial Code
              22     Scenedesmus obliquus mitochondrial Code
              23     Thraustochytrium Mitochondrial Code",
              In  longhand  form,  a  genetic code variant may be provided as a 64 byte string in
              TCAG order, eg. the standard genetic code in this form would be:

              FFLLSSSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG

HSP CREATION OPTIONS

       --hspfilter <threshold>
              Use aggressive HSP  filtering  to  speed  up  heuristic  searches.   The  threshold
              specifies  the  number  of  HSPs  centred  about a point in the query which will be
              stored.  Any lower scoring HSPs will be discarded.  This is an experimental  option
              to  handle  speed problems caused by some sequences.  A value of about 100 seems to
              work well.
       --useworddropoff <boolean>
              When this  is  TRUE,  the  score  threshold  for  admitting  words  into  the  word
              neighbourhood  is  set  to  be the initial word score minus the word threshold (see
              below).   This   strategy   is   designed   to   prevent   restricting   the   word
              SSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG When this is FALSE, the
              word threshold is taken to be an absolute value.
       --seedrepeat <count>
              The seedrepeat parameter sets the number of seeds which must be found on  the  same
              diagonal  or  reading  frame before HSP extension will occur.  Increasing the value
              for --seedrepeat will speed up searches, and is usually a better option than  using
              longer  word lengths, particularly when using the exonerate-server where increasing
              word  lengths  requires  recomputing  the  index,  and  greater  increases   memory
              requirements.
       -w --dnawordlen <bases>
       -W --proteinwordlen <residues>
       -W --codonnwordlen <bases>
              The  word  length  used  for  DNA,  protein or codon words.  When performing DNA vs
              protein comparisons, a the DNA wordlength will always (automatically) be triple the
              protein wordlength.
       --dnahspdropoff <score>
       --proteinhspdropoff <score>
       --codonhspdropoff <score>
              The  amount  by which an HSP score will be allowed to degrade during HSP extension.
              Separate threshold can be set for dna or protein comparisons.
       --dnahspthreshold <score>
       --proteinhspthreshold <score>
       --codonhspthreshold <score>
              The HSP score thresholds.  An HSP must score at least this much before it  will  be
              reported or be used in preparation of a heuristic alignment.
       --dnawordlimit  <score>
       --proteinwordlimit  <score>
       --codonwordlimit  <score>
              The  threshold for admitting DNA or protein words into the word neighbourhood.  The
              behaviour of this option is altered by the --useworddropoff option (see above).

       --geneseed <threshold>
              Exclude HSPs from gapped alignment computation which cannot feature in a  alignment
              containing at least one HSP scoring at least this threshold.

              This  option  provides  considerable speed up for gapped alignment computation, but
              may cause some very gap-rich alignments to be missed.

              It is useful when aligning similar sequences back  onto  genome  quickly,  eg.  try
              --geneseed 250
       --geneseedrepeat <count>
              The  geneseedrepeat parameter is like the seedrepeat parameter, but is only applied
              when looking for the geneseed hsps.  Using a larger value for --geneseedrepeat will
              speed  up  searches  when  the  --geneseed  parameter is also used.  (experimental,
              implementation incomplete)

ALIGNMENT OPTIONS

       --alignmentwidth <width>
              Width of alignment display.  The default is 80.
       --forwardcoordinates <boolean>
              By default, all coordinates are reported  on  the  forward  strand.   Setting  this
              option  to false reverts to the old behaviour (pre-0.8.3) whereby alignments on the
              reverse complement of a sequence are reported  using  coordinates  on  the  reverse
              complement.

SUB-ALIGNMENT REGION OPTIONS

       --quality <percent>
              This  option excludes HSPs from BSDP when their components outside of the SARs fall
              below this quality threshold.

SPLICE SITE PREDICTION OPTIONS

       --splice3 <path>
       --splice5 <path>
              Provide a file containing a  custom  PSSM  (position  specific  score  matrix)  for
              prediction of the intron splice sites.

              The file format for splice data is simple: lines beginning with ´#´ are comments, a
              line containing just the word ´splice´ denotes the position of the splice site, and
              the  other  lines  show the observed relative frequencies of the bases flanking the
              splice sites in the chosen organism (in ACGT order).

              Example 5' splice data file:

               # start of example 5' splice data
               # A C G T
               28 40  17  14
               59 14  13  14
                8  5  81   6
               splice
                0  0 100   0
                0  0   0 100
               54  2  42   2
               74  8  11   8
                5  6  85   4
               16 18  21  45
               # end of test 5' splice data

              Example 3' splice data file:

               # start of example 3' splice data
               # A C G T
                10  31  14  44
                 8  36  14  43
                 6  34  12  48
                 6  34   8  52
                 9  37   9  45
                 9  38  10  44
                 8  44   9  40
                 9  41   8  41
                 6  44   6  45
                 6  40   6  48
                23  28  26  23
                 2  79   1  18
               100   0   0   0
                 0   0 100   0
               splice
                28  14  47  11
               # end of example 3' splice data

       --forcegtag <boolean>
              Only allow splice sites at gt....ag sites (or  ct....ac  sites  when  the  gene  is
              reversed)  With  this  restriction  in place, the splice site prediction scores are
              still used and allow tie breaking when there is more than one possible splice site.

STRATEGIES FOR SPEED

       Keep all data on local disks.

       Apply the highest acceptable score thresholds using a combination  of  --score,  --percent
       and --bestn.

       Repeat  mask  and  dust  the genomic (target) sequence.  (Softmask these sequences and use
       --softmasktarget).

       Increase the --fsmmemory option to allow more query multiplexing.

       Increase the value for --seedrepeat

       When using an alignment model containing introns, set --geneseed as high as possible.

       If you are compiling exonerate yourself, see the README file supplied with the source code
       for details of compile-time optimisations.

STRATEGIES FOR SENSITIVITY

       Not documented yet.

       Increase  the  word  neighbourhood.  Decrease the HSP threshold.  Increase the SAR ranges.
       Run exhaustively.

ENVIRONMENT

       Not documented yet.

EXAMPLES

       exonerate cdna.fasta genomic.fasta
              This simplest way in  which  exonerate  may  be  used.   By  default,  an  ungapped
              alignment model will be used.

       exonerate --exhaustive y --model est2genome cdna.fasta genomic.masked.fasta
              Exhaustively  align cdnas to genomic sequence.  This will be much, much slower, but
              more accurate.  This option causes exonerate to behave like EST_GENOME.

       exonerate --exhaustive --model affine:local query.fasta target.fasta
              If the affine:local model is used with exhaustive alignment, you  have  the  Smith-
              Waterman algorithm.

       exonerate --exhaustive --model affine:global protein.fasta protein.fasta
              Switch to a global model, and you have Needleman-Wunsch.

       exonerate --wordthreshold 1 --gapped no --showhsp yes protein.fasta genome.fasta
              Generate ungapped Protein:DNA alignments

       exonerate  --model  coding2coding  --score  1000  --bigseq  yes  --proteinhspthreshold  90
       chr21.fa chr22.fa
              Perform quick-and-dirty  translated  pairwise  alignment  of  two  very  large  DNA
              sequences.

       Many similar combinations should work.  Try them out.

VERSION

       This documentation accompanies version 2.2.0 of the exonerate package.

AUTHOR

       Guy St.C. Slater.  <guy@ebi.ac.uk>.
       See the AUTHORS file accompanying the source code for a list of contributors.

AVAILABILITY

       This source code for the exonerate package is available under the terms of the GNU general
       public licence.

       Please  see  the  file   COPYING   which   was   distrubuted   with   this   package,   or
       http://www.gnu.org/licenses/gpl.txt for details.

       This   package   has   been  developed  as  part  of  the  ensembl  project.   Please  see
       http://www.ensembl.org/ for more information.

SEE ALSO

       exonerate-server(1), ipcress(1), blast(1L).