Provided by: exonerate_2.4.0-4_amd64 bug

NAME

       exonerate - a generic tool for sequence comparison

SYNOPSIS

       exonerate [ options ] <query path> <target path>

DESCRIPTION

       exonerate is a general tool for sequence comparison.

       It  uses the C4 dynamic programming library.  It is designed to be both general and fast.  It can produce
       either gapped or ungapped alignments, according to a variety  of  different  alignment  models.   The  C4
       library allows sequence alignment using a reduced space full dynamic programming implementation, but also
       allows automated generation of heuristics  from  the  alignment  models,  using  bounded  sparse  dynamic
       programming,  so  that  these alignments may also be rapidly generated.  Alignments generated using these
       heuristics will represent  a  valid  path  through  the  alignment  model,  yet  (unlike  the  exhaustive
       alignments), the results are not guaranteed to be optimal.

CONVENTIONS

       A  number  of  conventions  (and  idiosyncracies)  are  used  within exonerate.  An understanding of them
       facilitates interpretation of the output.

       Coordinates
              An in-between coordinate system is used, where the positions  are  counted  between  the  symbols,
              rather  than  on  the  symbols.   This numbering scheme starts from zero.  This numbering is shown
              below for the sequence "ACGT":

               A C G T
              0 1 2 3 4

              Hence the subsequence "CG" would have start=1, end=3, and length=2.   This  coordinate  system  is
              used  internally  in  exonerate, and for all the output formats produced with the exception of the
              "human readable" alignment display and the GFF  output  where  convention  and  standards  dictate
              otherwise.

       Reverse Complements
              When  an alignment is reported on the reverse complement of a sequence, the coordinates are simply
              given on the reverse complement copy of the sequence.  Hence positions on the sequences are  never
              negative.   Generally,  the  forward strand is indicated by '+', the reverse strand by '-', and an
              unknown or not-applicable strand (as in the case of a protein sequence) is indicated by '.'

       Alignment Scores
              Currently, only the raw alignment scores are displayed.  This score just is the sum of transistion
              scores  used  in the dynamic programming.  For example, in the case of a Smith-Waterman alignment,
              this will be the sum of the substitution matrix scores and the gap penalties.

GENERAL OPTIONS

       Most arguments have short and long forms.  The long forms
              are more likely to be stable over time, and hence should be used in scripts which call exonerate.

       -h | --shorthelp <boolean>
              Show help.  This will display a concise summary of the  available  options,  defaults  and  values
              currently set.

       --help <boolean>
              This  shows  all  the  help  options  including  the  defaults,  the  value currently set, and the
              environment variable which may be used to set each parameter.  There  will  be  an  indication  of
              which  options  are  mandatory.  Mandatory options have no default, and must have a value supplied
              for exonerate to run.  If mandatory options are used in order, their flags may be skipped from the
              command  line  (see  examples below).  Unlike this man page, the information from this option will
              always be up to date with the latest version of the program.

       -v | --version <boolean>
              Display the version number.  Also displays other information such  as  the  build  date  and  glib
              version used.

SEQUENCE INPUT OPTIONS

       Pairwise  comparisons will be performed between all query sequences and all target sequences.  Generally,
       for the best performance, shorter sequences (eg. ESTs, shotgun reads, proteins) should  be  used  as  the
       query sequences, and longer sequences (eg. genomic sequences) should be used as the target sequences.

       -q | --query  <paths>
              Specify  the query sequences required.  These must be in a FASTA format file.  Single or muiltiple
              query sequences may be supplied.  Additionally multiple copies of the fasta file may  be  supplied
              following a --query flag, or by using with multiple --query flags.

       -t | --target <paths>
              Specify  the  target sequences required.  Also, must be in a FASTA format file.  As with the query
              sequences, single or multiple target sequences and files may be supplied.  The target filename may
              by  replace  by  a  server name and port number in the form of hostname:port when using exonerate-
              server.  See the man page for exonerate-server  for  more  information  on  running  exonerate  in
              client:server  mode.   NEW(v2.4.0):  multiple  servers  may now be used.  These will be queried in
              parallel if you have set the --cores option.  NEW(v2.4.0): If an input file is not a FASTA  format
              file, it is assumed to contain a list of other fasta files, directories or servers (one per line).

       -Q | --querytype <dna | protein>
              Specify  the query type to use.  If this is not supplied, the query type is assumed to be DNA when
              the first sequence in the file contains more than 85% [ACGTN] bases.  Otherwise, it is assumed  to
              be  peptide.   This option forces the query type as some nucleotide and peptide sequences can fall
              either side of this threshold.

       -T | --targettype <dna | protein>
              Specify the target type to use.  The same as --querytype (above), except that it  applies  to  the
              target.  Specifying the sequence type will avoid the overhead of having to read the first sequence
              in the database twice (which may be significant with chromosome-sized sequences)

       --querychunkid <id>

       --querychunktotal <total>

       --targetchunkid <id>

       --targetchunktotal <total>
              These options to facilitate running exonerate on compute farms,  and  avoid  having  to  split  up
              sequence  databases  into  small chunks to run on different nodes.  If, for example, you wished to
              split the target database into three parts, you would run three exonerate jobs on different  nodes
              including the options:

              --targetchunkid 1 --targetchunktotal 3
              --targetchunkid 2 --targetchunktotal 3
              --targetchunkid 3 --targetchunktotal 3
              NB.  The granularity offered by this option only goes down to a single sequence, so when there are
              more chunks than sequences in the database, some processes will do nothing.

       -V | --verbose <int>
              Be verbose - show information about what is going on  during  the  analysis.   The  default  is  1
              (little  information),  the  higher the number given, the more information is printed.  To silence
              all the default output from exonerate, use --verbose 0 --showalignment no --showvulgar no

ANALYSIS OPTIONS

       -E | --exhaustive <boolean>
              Specify whether or not exhaustive alignment should be  used.   By  default,  this  is  FALSE,  and
              alignment  heuristics  will  be  used.   If  it  is  set  to TRUE, an exhaustive alignment will be
              calculated.  This requires quadratic time, and will be much, much slower,  but  will  provide  the
              optimal result for the given model.
       -B | --bigseq <int>
              Perform  alignment  of  large  (multi-megabase) sequences.  This is very memory efficient and fast
              when both sequences are chromosome-sized, but currently does not currently permit  the  use  of  a
              word neighbourhood (ie. exactly matching seeds only).
       --revcomp <boolean>
              Include  comparison of the reverse complement of the query and target where possible.  By default,
              this option is enabled, but when you know the gene is definitely on  the  forward  strand  of  the
              query and target, this option can halve the time taken to compute alignments.
       --forcescan <none | query | target>
              Force  the  FSM  to  scan  the  query sequence rather than the target.  This option is useful, for
              example, if you have a single piece of genomic sequence and you with to compare it to the whole of
              dbEST.   By scanning the database, rather than the query, the analysis will be completed much more
              quickly, as the overheads of multiple query FSM construction, multiple target reading  and  splice
              site  predictions will be removed.  By default, exonerate will guess the optimal strategy based on
              database sequence sizes.
       --saturatethreshold <number>
              When set to zero, this option does nothing.  Otherwise, once more than this number  of  words  (in
              addition  to  the  expected  number  of words by chance) have matched a position on the query, the
              position on the query  will  be  'numbed'  (ignore  further  matches)  for  the  current  pairwise
              comparison.
       --customserver <command>
              When  using exonerate in client:server mode with a non-standard server, this command allows you to
              send a custom command to the server.  This command is sent by the client  (exonerate)  before  any
              other  commands,  and is provided as a way of passing parameters or other commands specific to the
              custom server.  See the exonerate-server man page for more information  on  running  exonerate  in
              client:server mode.
       --cores <number>
              The  number  of  cores/CPUs/threads  that  should  be used.  On a multi-core or multi-CPU machine,
              increasing this ammount allows alignment computations to run in parallel on  separate  CPUs/cores.
              NB.   Generally,  it  is better to parallelise the analysis by splitting it up into separate jobs,
              but this option may prove useful for problems such as interactive single-gene queries.

FASTA DATABASE OPTIONS

       --fastasuffix <extension>
              If any of the inputs  given  with  --query  or  --target  are  directories,  then  exonerate  will
              recursively  descent  these directories, reading all files ending with this suffix as fasta format
              input.

GAPPED ALIGNMENT OPTIONS

       -m | --model <alignment model>
              Specify the alignment model to use.  The models currently supported are:
              ungapped
                     The simplest type of model, used  by  default.   An  appropriate  model  with  be  selected
                     automatically for the type of input sequences provided.
              ungapped:trans
                     This  ungapped  model  includes  translation  of  all  frames  of both the query and target
                     sequences.  This is similar to an ungapped tblastx type search.
              affine:global
                     This performs gapped global alignment, similar to the  Needleman-Wunsch  algorithm,  except
                     with  affine gaps.  Global alignment requires that both the sequences in their entirety are
                     included in the alignment.
              affine:bestfit
                     This performs a best fit or best location alignment of the query onto the target  sequence.
                     The entire query sequence will be included in the alignment, but only the best location for
                     its alignment on the target sequence.
              affine:local
                     This is local alignment with affine gaps, similar to the Smith-Waterman-Gotoh algorithm.  A
                     general-purpose  alignment  algorithm.   As this is local alignment, any subsequence of the
                     query and target sequence may appear in the alignment.
              affine:overlap
                     This type of alignment finds the best overlap between the query and  target.   The  overlap
                     alignment  must  include  the  start of the query or target and the end of the query or the
                     target sequence, to align sequences which overlap at the ends, or in the mid-section  of  a
                     longer sequence..  This is the type of alignment frequently used in assembly algorithms.
              est2genome
                     This  model  is similar to the affine:local model, but it also includes intron modelling on
                     the target sequence to allow alignment of spliced to unspliced coding  sequences  for  both
                     forward  and reversed genes.  This is similar to the alignment models used in programs such
                     as EST_GENOME and sim4.
              ner    NERs are non-equivalenced regions - large regions in both the query and  target  which  are
                     not  aligned.  This model can be used for protein alignments where strongly conserved helix
                     regions will be aligned, but weakly conserved loop regions are not.  Similarly, this  model
                     could be used to look for co-linearly conserved regions in comparison of genomic sequences.
              protein2dna
                     This model compares a protein sequence to a DNA sequence, incorporating all the appropriate
                     gaps and frameshifts.
              protein2dna:bestfit
                     This is a bestfit version of the protein2dna  model,  with  which  the  entire  protein  is
                     included in the alignment.  It is currently only available when using exhaustive alignment.
              protein2genome
                     This  model allows alignment of a protein sequence to genomic DNA.   This is similar to the
                     protein2dna model, with the addition of modelling of introns and intron phases.  This model
                     is similar to those used by genewise.
              protein2genome:bestfit
                     This  is  a  bestfit  version of the protein2genome model, with which the entire protein is
                     included in the alignment.  It is currently only available when using exhaustive alignment.
              coding2coding
                     This model is similar to the ungapped:trans model, except that  gaps  and  frameshifts  are
                     allowed.  It is similar to a gapped tblastx search.
              coding2genome
                     This  is  similar  to  the  est2genome  model, except that the query sequence is translated
                     during comparison, allowing a more sensitive comparison.
              cdna2genome
                     This combines properties of the est2genome and coding2genome models, to allow  modeling  of
                     an  whole  cDNA  where a central coding region can be flanked by non-coding UTRs.  When the
                     CDS start and end is known it may be specified using the --annotation option (see below) to
                     permit only the correct coding region to appear in the alignemnt.
              genome2genome
                     This  model  is  similar  to  the  coding2coding model, except introns are modelled on both
                     sequences.  (not working well yet)

       The short names u, u:t, a:g, a:b, a:l, a:o, e2g, ner,
              p2d, p2d:b p2g, p2g:b, c2c, c2g cd2g and g2g can also be used for specifying models.

       -s | --score <threshold>
              This is the overall score threshold.  Alignments will not be reported below this  threshold.   For
              heuristic alignments, the higher this threshold, the less time the analysis will take.

       --percent <percentage>
              Report  only alignments scoring at least this percentage of the maximal score for each query.  eg.
              use --percent 90 to report alignments with 90% of the maximal score  optainable  for  that  query.
              This  option is useful not only because it reduces the spurious matches in the output, but because
              it generates query-specific thresholds (unlike --score  )  for  a  set  of  queries  of  differing
              lengths, and will also speed up the search considerably.  NB.  with this option, it is possible to
              have a cDNA match its corresponding gene exactly, yet still score  less  than  100%,  due  to  the
              addition of the intron penalty scores, hence this option must be used with caution.

       --showalignment <boolean>
              Show the alignments in an human readable form.

       --showsugar <boolean>
              Display  "sugar" output for ungapped alignments.  Sugar is Simple UnGapped Alignment Report, which
              displays ungapped alignments one-per-line.  The sugar line starts with  the  string  "sugar:"  for
              easy extraction from the output, and is followed by the the following 9 fields in the order below:

              query_id        Query identifier
              query_start     Query position at alignment start
              query_end       Query position alignment end
              query_strand    Strand of query matched
              target_id       |
              target_start    | the same 4 fields
              target_end      | for the target sequence
              target_strand   |
              score           The raw alignment score

       --showcigar <boolean>
              Show  the alignments in "cigar" format.  Cigar is a Compact Idiosyncratic Gapped Alignment Report,
              which displays gapped alignments one-per-line.  The format starts with the same 9 fields as  sugar
              output  (see  above),  and is followed by a series of <operation, length> pairs where operation is
              one of match, insert or delete, and the length describes the number of  times  this  operation  is
              repeated.

       --showvulgar <boolean>
              Shows  the  alignments  in  "vulgar"  format.   Vulgar is Verbose Useful Labelled Gapped Alignment
              Report, This format also starts with the same 9  fields  as  sugar  output  (see  above),  and  is
              followed  by  a  series of <label, query_length, target_length> triplets.  The label may be one of
              the following:

              M      Match
              C      Codon
              G      Gap
              N      Non-equivalenced region
              5      5' splice site
              3      3' splice site
              I      Intron
              S      Split codon
              F      Frameshift

       --showquerygff <boolean>
              Report     GFF     output     for     features     on      the      query      sequence.       See
              http://www.sanger.ac.uk/Software/formats/GFF for more information.

       --showtargetgff <boolean>
              Report GFF output for features on the target sequence.

       --ryo <format>
              Roll-your-own  output  format.   This  allows specification of a printf-esque format line which is
              used to specify which information to include in the output, and how it is to be shown.  The format
              field may contain the following fields:

              %[qt][idlsSt]
                     For either {query,target}, report the {id,definition,length,sequence,Strand,type} Sequences
                     are reported in a fasta-format like block (no headers).
              %[qt]a[bels]
                     For  either  {query,target}  region   which   occurs   in   the   alignment,   report   the
                     {begin,end,length,sequence}
              %[qt]c[bels]
                     For  either  {query,target}  region  which  occurs in the coding sequence in the alignment,
                     report the {begin,end,length,sequence}
              %s     The raw score
              %r     The rank (in results from a bestn search)
              %m     Model name
              %e[tism]
                     Equivalenced {total,id,similarity,mismatches} (ie. %em == (%et - %ei))
              %p[isS]
                     Percent {id,similarity,Self} over the equivalenced portions of the alignment.  (ie. %pi  ==
                     100*(%ei  /  %et)).   Percent  Self  is  the  score  over  the equivalenced portions of the
                     alignment as a percentage of the self comparison score of the query sequence.
              %g     Gene orientation ('+' = forward, '-' = reverse, '.' = unknown)
              %S     Sugar block (the 9 fields used in sugar output (see above)
              %C     Cigar block (the fields of a cigar line after the sugar portion)
              %V     Vulgar block (the fields of a vulgar line after the sugar portion)
              %%     Expands to a percentage sign (%)
              \n     Newline
              \t     Tab
              \\     Expands to a backslash (\)
              \{     Open curly brace
              \}     Close curly brace
              {      Begin per-transition output section
              }      End per-transition output section
              %P[qt][sabe]
                     Per-transition output for {query,target} {sequence,advance,begin,end}
              %P[nsl]
                     Per-transition output for {name,score,label}

       This option is very useful and flexible.  For example, to report all  the  sections  of  query  sequences
       which feature in alignments in fasta format, use:

       --ryo ">%qi %qd\n%qas\n"

       To output all the symbols and scores in an alignment, try something like:

       --ryo "%V{%Pqs %Pts %Ps\n}"

       -n | --bestn <number>
              Report the best N results for each query.  (Only results scoring better than the score threshold
               will  be reported).  The option reduces the amount of output generated, and also allows exonerate
              to speed up the search.

       -S | --subopt <boolean>
              This option allows for the reporting of (Waterman-Eggert style) suboptimal alignments.  (It is  on
              by  default.)   All suboptimal (ie. non-intersecting) alignments will be reported for each pair of
              sequences scoring at least the threshold provided by --score.

              When this option is used with exhaustive alignments, several full quadratic time  passes  will  be
              required, so the running time will be considerably increased.

       -g | --gappedextension <boolean>
              Causes  a gapped extension stage to be performed ie. dynamic programming is applied in arbitrarily
              shaped and dynamically sized regions surrounding HSP seeds.  The extension threshold is controlled
              by the --extensionthreshold option.

              Although  sometimes  slower  than  BSDP, gapped extension improves sensitivity with weak, gap-rich
              alignments such as during cross-species comparison.

              NB. This option is now the default. Set it to false to reverse to the old  BSDP  type  alignments.
              This option may be slower than BSDP for some large scale analyses with simple alignment models.

       --refine <strategy>
              Force exonerate to refine alignments generated by heuristics using dynamic programming over larger
              regions.  This takes more time, but improves the quality of the final alignments.

              The strategies available for refinement are:

              none   The default - no refinement is used.
              full   An exhaustive alignment is calculated from the pair of sequences in their entirety.
              region DP is applied just to the region of the sequences covered by the heuristic alignment.

       --refineboundary <size>
              Specify an extra boundary to be included in the region subject to alignment during  refinement  by
              region.

VITERBI ALGORITHM OPTIONS

       -D | --dpmemory <Mb>
              The  exhaustive  alignment  traceback  routines use a Hughey-style reduced memory technique.  This
              option specifies how much memory will be used for this.  Generally, the more memory  is  permitted
              here, the faster the alignments will be produced.

CODE GENERATION OPTIONS

       -C | --compiled <boolean>
              This  option allows disabling of generated code for dynamic programming.  It is mainly used during
              development of exonerate.  When set to FALSE, an "interpreted" version of the dynamic  programming
              implementation is used, which is much slower.

HEURISTIC OPTIONS

       --terminalrangeint
       --terminalrangeext
       --joinrangeint
       --joinrangeext
       --spanrangeint
       --spanrangeext
              These  options  are  used  to specify the size of the sub-alignment regions to which DP is applied
              around the ends of the HSPs.  This can be at the HSP ends (terminal  range),  between  HSPs  (join
              range),  or  between  HSPs  which  may  be  connected  by a large region such as an intron or non-
              equivalenced region (span range).  These ranges can be specified for a number of matches back onto
              the HSP (internal range) or out from the HSP (external range).

SEEDED DYNAMIC PROGRAMMING OPTIONS

       -x | --extensionthreshold <score>
              This  is  the  amount  by  which  the  score  will  be allowed to degrade during SDP.  This is the
              equivalent of the hspdropoff penalties, except it is applied during dynamic programming,  not  HSP
              extension.   Decreasing  this parameter will increase the speed of the SDP, and increasing it will
              increase the sensitivity.

       --singlepass  <boolean>
              By default the suboptimal SDP alignments are reported by a singlepass algorithm, but may miss some
              suboptimal  alignments  that  are  close  together.  This option can be used to force the use of a
              multipass  suboptimal  alignment  algorithm  for  SDP,  resulting  in  higher  quality  suboptimal
              alignments.

BSDP OPTIONS

       --joinfilter <limit>
              (experimental)

              Only  allow  consider  this  number  of SARs for joining HSPs together.  The SARs with the highest
              potential for appearing in a high-scoring  alignment  are  considered.   This  option  useful  for
              limiting  time and memory usage when searching unmasked data with repetitive sequences, but should
              not be set too low, as valid matches may be ignored.  Something like --joinfilter 32 seems to work
              well.

SEQUENCE OPTIONS

       --annotation <path>
              Specify  basic  sequence  annotation information.  This is most useful with the cdna2genome model,
              but will work with other models.  The annotation file contains four fields per line:

              <id> <strand> <cds_start> <cds_length>

              Here is a simple example of such a file for 4 cDNAs:

              dhh.human.cdna + 308 1191
              dhh.mouse.cdna + 250 1191
              csn7a.human.cdna + 178 828
              csn7a.mouse.cdna + 126 828
              These annotation lines will also work when only the first two fields are used.  This can  be  used
              when specifying which strand of a specific sequence should be included in a comparison.

SYMBOL COMPARISON OPTIONS

       --softmaskquery <boolean>
              Indicate that the query is softmasked.  See description below for --softmasktarget
       --softmasktarget <boolean>
              Indicate that the target is softmasked.  In a softmasked sequence file, instead of masking regions
              by Ns or Xs they are masked by putting those regions in lower case (and with unmasked  regions  in
              upper case).  This option allows the masking to be ignored by some parts of the program, combining
              the speed of searching masked data with sensitivity  of  searching  unmasked  data.   The  utility
              fastasoftmask  supplied  which  is  supplied  with  exonerate can be used for producing softmasked
              sequence from conventionally masked sequence.
       -d | --dnasubmat <name>
              Specify the the substitution matrix to be used for DNA comparison.  This should be  a  path  to  a
              substitution matrix in same format as that which is used by blast.
       -p | --proteinsubmat <name>
              Specify  the  the  substitution  matrix  to be used for protein comparison.  (Both DNA and protein
              substitution matrices are required for some types of analysis).  The use  of  the  special  names,
              nucleic, blosum62, pam250, edit or identity will cause built-in substitution matrices to be used.

ALIGNMENT SEEDING OPTIONS

       -M | --fsmmemory <Mb>
              Specify  the amount of memory to use for the FSM in heuristic analyses.  exonerate multiplexes the
              query to accelerate large-throughput database queries.  This figure should always be less than the
              physical  memory on the machine, but when searching large databases, generally, the more memory it
              is allowed to use, the faster it will go.
       --forcefsm <none | normal | compact>
              Force the use of more compact finite state machines for analyses involving big sequences and large
              word  neighbourhoods.   By  default,  exonerate will pick a sensible strategy, so this option will
              rarely need to be set.
       --wordjump <int>
              The jump between query words used to yield the word neighbourhood.  If set to  1,  every  word  is
              used,  if  set  to 2, every other word is used, and if set to the wordlength, only non-overlapping
              words will be used.  This option reduces the memory  requirements  when  using  very  large  query
              sequences,  and  makes  the  search  run  faster, but it also damages search sensitivity when high
              values are set.
       --wordambiguity <limit>
              This option may be used to allow alignment seeds containing IUPAC ambiguity symbols.  The limit is
              the maximum number of ambiguous words allowed at a single position.  If this limit is reached then
              the position is not used for alignment seeding.  Using this option may slow down  a  search.   For
              large  datasets,  it  is  recommended  to  use  esd2esi --wordambiguity instead, as then the speed
              overhead is only incurred during indexing, rather than during the database searching itself.   NB.
              This  option  only  works  for IUPAC symbols in the target sequence.  Query words containing IUPAC
              symbols are (currently) excluded from seeding.

AFFINE MODEL OPTIONS

       -o | --gapopen <penalty>
              This is the gap open penalty.
       -e | --gapextend <penalty>
              This is the gap extension penalty.
       --codongapopen <penalty>
              This is the codon gap open penalty.
       --codongapextend <penalty>
              This is the codon gap extension penalty.

NER OPTIONS

       --minner <boolean>
              Minimum NER length allowed.
       --maxner <length>
              Maximum NER length allowed.  NB. this option only affects heuristic alignments.
       --neropen <penalty>
              Penalty for opening a non-equivalenced region.

INTRON MODELLING OPTIONS

       --minintron <length>
              Minimum intron length limit.  NB. this option only affects heuristic alignments.  This  is  not  a
              hard limit - it only affects size of introns which are sought during heuristic alignment.
       --maxintron <length>
              Maximum intron length limit.  See notes above for --minintron
       -i | --intronpenalty <penalty>
              Penalty for introduction of an intron.

FRAMESHIFT MODELLING OPTIONS

       -f | --frameshift <penalty>
              The penalty for the inclusion of a frameshift in an alignment.

ALPHABET OPTIONS

       --useaatla <boolean>
              Use  three-letter abbreviations for AA names.  ie. when displaying alignment "Met" is used instead
              of " M "

TRANSLATION OPTIONS

       --geneticcode <code>
              Specify an alternative genetic code.  The default code (1) is the standard  genetic  code.   Other
              genetic codes may be specified by in shorthand or longhand form.

              In  shorthand  form,  a number between 1 and 23 is used to specify one of 17 built-in genetic code
              variants.  These are genetic code variants taken from:

              http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi

              These are:
              1      The Standard Code
              2      The Vertebrate Mitochondrial Code
              3      The Yeast Mitochondrial Code
              4      The Mold, Protozoan, and Coelenterate Mitochondrial  Code  and  the  Mycoplasma/Spiroplasma
                     Code
              5      The Invertebrate Mitochondrial Code
              6      The Ciliate, Dasycladacean and Hexamita Nuclear Code
              9      The Echinoderm and Flatworm Mitochondrial Code
              10     The Euplotid Nuclear Code
              11     The Bacterial and Plant Plastid Code
              12     The Alternative Yeast Nuclear Code
              13     The Ascidian Mitochondrial Code
              14     The Alternative Flatworm Mitochondrial Code
              15     Blepharisma Nuclear Code
              16     Chlorophycean Mitochondrial Code
              21     Trematode Mitochondrial Code
              22     Scenedesmus obliquus mitochondrial Code
              23     Thraustochytrium Mitochondrial Code",
              In  longhand  form,  a genetic code variant may be provided as a 64 byte string in TCAG order, eg.
              the standard genetic code in this form would be:

              FFLLSSSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG

HSP CREATION OPTIONS

       --hspfilter <threshold>
              Use aggressive HSP filtering to speed up heuristic searches.  The threshold specifies  the  number
              of  HSPs  centred about a point in the query which will be stored.  Any lower scoring HSPs will be
              discarded.  This is an experimental option to handle speed problems caused by some  sequences.   A
              value of about 100 seems to work well.
       --useworddropoff <boolean>
              When  this  is TRUE, the score threshold for admitting words into the word neighbourhood is set to
              be the initial word score minus the word threshold (see below).   This  strategy  is  designed  to
              prevent restricting the word SSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG When this
              is FALSE, the word threshold is taken to be an absolute value.
       --seedrepeat <count>
              The seedrepeat parameter sets the number of seeds which must be found  on  the  same  diagonal  or
              reading  frame  before HSP extension will occur.  Increasing the value for --seedrepeat will speed
              up searches, and is usually a better option than using  longer  word  lengths,  particularly  when
              using  the  exonerate-server  where  increasing  word  lengths requires recomputing the index, and
              greater increases memory requirements.
       -w --dnawordlen <bases>
       -W --proteinwordlen <residues>
       -W --codonnwordlen <bases>
              The word length used for DNA, protein or codon words.  When performing DNA vs protein comparisons,
              a the DNA wordlength will always (automatically) be triple the protein wordlength.
       --dnahspdropoff <score>
       --proteinhspdropoff <score>
       --codonhspdropoff <score>
              The  amount  by  which  an  HSP  score  will be allowed to degrade during HSP extension.  Separate
              threshold can be set for dna or protein comparisons.
       --dnahspthreshold <score>
       --proteinhspthreshold <score>
       --codonhspthreshold <score>
              The HSP score thresholds.  An HSP must score at least this much before it will be reported  or  be
              used in preparation of a heuristic alignment.
       --dnawordlimit  <score>
       --proteinwordlimit  <score>
       --codonwordlimit  <score>
              The  threshold  for  admitting DNA or protein words into the word neighbourhood.  The behaviour of
              this option is altered by the --useworddropoff option (see above).

       --geneseed <threshold>
              Exclude HSPs from gapped alignment computation which cannot feature in a alignment  containing  at
              least one HSP scoring at least this threshold.

              This  option  provides  considerable speed up for gapped alignment computation, but may cause some
              very gap-rich alignments to be missed.

              It is useful when aligning similar sequences back onto genome quickly, eg. try --geneseed 250
       --geneseedrepeat <count>
              The geneseedrepeat parameter is like the seedrepeat parameter, but is only  applied  when  looking
              for  the geneseed hsps.  Using a larger value for --geneseedrepeat will speed up searches when the
              --geneseed parameter is also used.  (experimental, implementation incomplete)

ALIGNMENT OPTIONS

       --alignmentwidth <width>
              Width of alignment display.  The default is 80.
       --forwardcoordinates <boolean>
              By default, all coordinates are reported on the forward strand.   Setting  this  option  to  false
              reverts  to  the  old  behaviour  (pre-0.8.3)  whereby  alignments  on the reverse complement of a
              sequence are reported using coordinates on the reverse complement.

SUB-ALIGNMENT REGION OPTIONS

       --quality <percent>
              This option excludes HSPs from BSDP when their components outside of  the  SARs  fall  below  this
              quality threshold.

SPLICE SITE PREDICTION OPTIONS

       --splice3 <path>
       --splice5 <path>
              Provide  a  file  containing  a custom PSSM (position specific score matrix) for prediction of the
              intron splice sites.

              The file format for splice data  is  simple:  lines  beginning  with  ´#´  are  comments,  a  line
              containing  just  the  word  ´splice´ denotes the position of the splice site, and the other lines
              show the observed relative frequencies of the bases  flanking  the  splice  sites  in  the  chosen
              organism (in ACGT order).

              Example 5' splice data file:

               # start of example 5' splice data
               # A C G T
               28 40  17  14
               59 14  13  14
                8  5  81   6
               splice
                0  0 100   0
                0  0   0 100
               54  2  42   2
               74  8  11   8
                5  6  85   4
               16 18  21  45
               # end of test 5' splice data

              Example 3' splice data file:

               # start of example 3' splice data
               # A C G T
                10  31  14  44
                 8  36  14  43
                 6  34  12  48
                 6  34   8  52
                 9  37   9  45
                 9  38  10  44
                 8  44   9  40
                 9  41   8  41
                 6  44   6  45
                 6  40   6  48
                23  28  26  23
                 2  79   1  18
               100   0   0   0
                 0   0 100   0
               splice
                28  14  47  11
               # end of example 3' splice data

       --forcegtag <boolean>
              Only  allow splice sites at gt....ag sites (or ct....ac sites when the gene is reversed) With this
              restriction in place, the splice site prediction scores are still used and allow tie breaking when
              there is more than one possible splice site.

STRATEGIES FOR SPEED

       Keep all data on local disks.

       Apply the highest acceptable score thresholds using a combination of --score, --percent and --bestn.

       Repeat mask and dust the genomic (target) sequence.  (Softmask these sequences and use --softmasktarget).

       Increase the --fsmmemory option to allow more query multiplexing.

       Increase the value for --seedrepeat

       When using an alignment model containing introns, set --geneseed as high as possible.

       If you are compiling exonerate yourself, see the README file supplied with the source code for details of
       compile-time optimisations.

STRATEGIES FOR SENSITIVITY

       Not documented yet.

       Increase  the  word  neighbourhood.   Decrease  the  HSP  threshold.   Increase  the  SAR  ranges.    Run
       exhaustively.

ENVIRONMENT

       Not documented yet.

EXAMPLES

       exonerate cdna.fasta genomic.fasta
              This simplest way in which exonerate may be used.  By default, an ungapped alignment model will be
              used.

       exonerate --exhaustive y --model est2genome cdna.fasta genomic.masked.fasta
              Exhaustively align cdnas to genomic sequence.  This will be much, much slower, but more  accurate.
              This option causes exonerate to behave like EST_GENOME.

       exonerate --exhaustive --model affine:local query.fasta target.fasta
              If  the  affine:local  model  is  used  with  exhaustive  alignment,  you  have the Smith-Waterman
              algorithm.

       exonerate --exhaustive --model affine:global protein.fasta protein.fasta
              Switch to a global model, and you have Needleman-Wunsch.

       exonerate --wordthreshold 1 --gapped no --showhsp yes protein.fasta genome.fasta
              Generate ungapped Protein:DNA alignments

       exonerate --model coding2coding --score 1000 --bigseq yes --proteinhspthreshold 90 chr21.fa chr22.fa
              Perform quick-and-dirty translated pairwise alignment of two very large DNA sequences.

       Many similar combinations should work.  Try them out.

VERSION

       This documentation accompanies version 2.2.0 of the exonerate package.

AUTHOR

       Guy St.C. Slater.  <guy@ebi.ac.uk>.
       See the AUTHORS file accompanying the source code for a list of contributors.

AVAILABILITY

       This source code for the exonerate package is available  under  the  terms  of  the  GNU  general  public
       licence.

       Please     see    the    file    COPYING    which    was    distrubuted    with    this    package,    or
       http://www.gnu.org/licenses/gpl.txt for details.

       This package has been developed as part of the ensembl project.  Please see  http://www.ensembl.org/  for
       more information.

SEE ALSO

       exonerate-server(1), ipcress(1), blast(1L).