lunar (1) spidey.1.gz

Provided by: ncbi-tools-bin_6.1.20170106+dfsg1-10_amd64 bug

NAME

       spidey - align mRNA sequences to a genome

SYNOPSIS

       spidey  [-]  [-F N]  [-G] [-L N] [-M filename] [-N filename] [-R filename] [-S p/m] [-T N]
       [-X] [-a filename] [-c N] [-d] [-e X] [-f X] [-g X] -i filename [-j] [-k filename]  [-l N]
       -m filename [-n N] [-o str] [-p N] [-r c/d/m/p/v] [-s] [-t filename] [-u] [-w]

DESCRIPTION

       spidey  is  a  tool  for  aligning one or more mRNA sequences to a given genomic sequence.
       spidey was written with two main goals in mind: find good alignments regardless of  intron
       size;  and  avoid  getting confused by nearby pseudogenes and paralogs.  Towards the first
       goal, spidey uses  BLAST  and  Dot  View  (another  local  alignment  tool)  to  find  its
       alignments;  since  these  are  both  local alignment tools, spidey does not intrinsically
       favor shorter or longer introns and has no  maximum  intron  size.   To  avoid  mistakenly
       including exons from paralogs and pseudogenes, spidey first defines windows on the genomic
       sequence and then performs the mRNA-to-genomic alignment separately  within  each  window.
       Because of the way the windows are constructed, neighboring paralogs or pseudogenes should
       be in separate windows and should not be included in the final spliced alignment.

   Initial alignments and construction of genomic windows
       spidey takes as input a single genomic sequence and a set  of  mRNA  accessions  or  FASTA
       sequences.   All  processing is done one mRNA sequence at a time.  The first step for each
       mRNA sequence is a high-stringency BLAST against the genomic sequence.  The resulting hits
       are analyzed to find the genomic windows.

       The  BLAST  alignments  are  sorted by score and then assigned into windows by a recursive
       function which takes the first alignment and then goes down the alignment list to find all
       alignments  that  are  consistent  with  the first (same strand of mRNA, both the mRNA and
       genomic coordinates are nonoverlapping and linearly consistent).   On  subsequent  passes,
       the  remaining  alignments  are  examined  and  are  put  into  their  own nonoverlapping,
       consistent windows, until no alignments are left.  Depending on how many gene  models  are
       desired,  the  top  n  windows  are  chosen  to  go on to the next step and the others are
       deleted.

   Aligning in each window
       Once the genomic windows are constructed, the  initial  BLAST  alignments  are  freed  and
       another  BLAST  search  is  performed,  this time with the entire mRNA against the genomic
       region defined by the window, and at a lower stringency than the initial  search.   spidey
       then  uses  a  greedy  algorithm  to generate a high-scoring, nonoverlapping subset of the
       alignments from the second BLAST search.  This consistent set  is  analyzed  carefully  to
       make sure that the entire mRNA sequence is covered by the alignments.  When gaps are found
       between the alignments, the appropriate region of genomic sequence is searched against the
       missing  mRNA,  first  using a very low-stringency BLAST and, if the BLAST fails to find a
       hit, using DotView functions to locate the alignment.  When gaps are found at the ends  of
       the  alignments,  the  BLAST  and DotView searches are actually allowed to extend past the
       boundaries of the window.  If the 3' end of the mRNA does  not  align  completely,  it  is
       first  examined  for  the  presence  of  a  poly(A) tail.  No attempt is made to align the
       portion of the mRNA that seems to be a poly(A) tail; sometimes there  is  a  poly(A)  tail
       that  does  align  to  the genomic sequence, and these are noted because they indicate the
       possibility of a pseudogene.

       Now that the mRNA is completely covered by the set of alignments, the  boundaries  of  the
       alignments  (there  should  be  one  alignment  per  exon  now)  are  adjusted so that the
       alignments abut each other precisely and so that they are adjacent to  good  splice  donor
       and  acceptor  sites.  Most commonly, two adjacent exons' alignments overlap by as much as
       20 or 30 base pairs on the mRNA sequence.  The true exon boundary may lie anywhere  within
       this  overlap, or (as we have seen empirically) even a few base pairs outside the overlap.
       To position the exon boundaries, the overlap plus  a  few  base  pairs  on  each  side  is
       examined  for  splice  donor  sites,  using  functions that have different splice matrices
       depending on the organism chosen.  The top few splice donor  sites  (by  score)  are  then
       evaluated  as  to  how  much they affect the original alignment boundaries.  The site that
       affects the boundaries the least is chosen, and is evaluated as  to  the  presence  of  an
       acceptor  site.   The  alignments  are  truncated  or  extended  as necessary so that they
       terminate at the splice donor site and so that they do not overlap.

   Final result
       The windows are examined carefully to get the percent identity per  exon,  the  number  of
       gaps per exon, the overall percent identity, the percent coverage of the mRNA, presence of
       an aligning or non-aligning poly(A) tail, number of splice donor sites and the presence or
       absence  of  splice  donor and acceptor sites for each exon, and the occurrence of an mRNA
       that has a 5' or 3' end (or both) that does not align to the  genomic  sequence.   If  the
       overall percent identity and percent length coverage are above the user-defined cutoffs, a
       summary report is printed, and, if requested, a  text  alignment  showing  identities  and
       mismatches is also printed.

   Interspecies alignments
       spidey  is  capable  of  performing  interspecies  alignments.   The  major  difference in
       interspecies alignments is that the mRNA-genomic identity will not be close to 100% as  it
       is  in  intraspecies  alignments; also, the alignments have numerous and lengthy gaps.  If
       spidey is used in its normal mode to do interspecies alignments, it produces  gene  models
       with  many,  many  short  exons.  When the interspecies flag is set, spidey uses different
       BLAST parameters to encourage longer and more gaps and to  not  penalize  as  heavily  for
       mismatches.   This  way,  the  alignments  for  the exons are much longer and more closely
       approximate the actual gene structure.

   Extracting CDS alignments
       When spidey is run in network-aware mode or  when  ASN.1  files  are  used  for  the  mRNA
       records,  it  is capable of extracting a CDS alignment from an mRNA alignment and printing
       the CDS information also.  Since the CDS alignment is just a subset of the mRNA alignment,
       it  is  relatively  straightforward  to  truncate  the exon alignments as necessary and to
       generate a CDS alignment.  Furthermore, the untranslated regions are now defined,  so  the
       percent identity for the 5' and 3' untranslated regions is also calculated.

OPTIONS

       A summary of options is included below.

       -      Print usage message.

       -F N   Start of genomic interval desired (from; 0-based).

       -G     Input file is a GI list.

       -L N   The extra-large intron size to use (default = 220000).

       -M filename
              File with donor splice matrix.

       -N filename
              File with acceptor splice matrix.

       -R filename
              File (including path) to repeat blast database for filtering.

       -S p/m Restrict to plus (p) or minus (m) strand of genomic sequence.

       -T N   Stop of genomic interval desired (to; 0-based).

       -X     Use  extra-large intron sizes (increases the limit for initial and terminal introns
              from 100kb to 240kb and  for  all  others  from  35kb  to  120kb);  may  result  in
              significantly longer compute times.

       -a filename
              Output  file  for  alignments when directed to a separate file with -p 3 (default =
              spidey.aln).

       -c N   Identity cutoff, in percent, for quality control purposes.

       -d     Also try to align coding sequences corresponding to the  given  mRNA  records  (may
              require network access).

       -e X   First-pass  e-value  (default = 1.0e-10).  Higher values increase speed at the cost
              of sensitivity.

       -f X   Second-pass e-value (default = 0.001).

       -g X   Third-pass e-value (default = 10).

       -i filename
              Input file containing the genomic sequence in  ASN.1  or  FASTA  format.   If  your
              computer  is  running  on a network that can access GenBank, you can substitute the
              desired accession number for the filename.

       -j     Print ASN.1 alignment?

       -k filename
              File for ASN.1 output with -k (default = spidey.asn).

       -l N   Length coverage cutoff, in percent.

       -m filename
              Input file containing the mRNA sequence(s) in ASN.1 or FASTA format, or a  list  of
              their  accessions  (with  -G).   If  your computer is running on a network that can
              access GenBank, you can substitute a single accession number for the filename.

       -n N   Number of gene models to return per input mRNA (default = 1).

       -o str Main output file (default = stdout; contents controlled by -p).

       -p N   Print alignment?
              0      summary and alignments together (default)
              1      just the summary
              2      just the alignments
              3      summary and alignments in different files

       -r c/d/m/p/v
              Organism of genomic sequence, used to determine splice matrices.
              c      C. elegans
              d      Drosophila
              m      Dictyostelium discoideum
              p      plant
              v      vertebrate (default)

       -s     Tune for interspecies alignments.

       -t filename
              File with feature table, in 4 tab-delimited columns:
              seqid  (e.g., NM_04377.1)
              name   (only repetitive_region is currently supported)
              start  (0-based)
              stop   (0-based)

       -u     Make a multiple alignment of all input mRNAs (which must  overlap  on  the  genomic
              sequence).

       -w     Consider lowercase characters in input FASTA sequences to be masked.

AUTHOR

       Sarah  Wheelan  and  others  at the National Center for Biotechnology Information; Steffen
       Moeller contributed to this documentation.

SEE ALSO

       <http://www.ncbi.nlm.nih.gov/spidey>