Provided by: fasta3_36.3.8i.14-Nov-2020-1_amd64 bug

NAME

       fasta36 - scan a protein or DNA sequence library for similar sequences

       fastx36   -  compare a DNA sequence to a protein sequence database, comparing the translated DNA sequence
       in forward and reverse frames.

       tfastx36  - compare a protein  sequence  to  a  DNA  sequence  database,  calculating  similarities  with
       frameshifts to the forward and reverse orientations.

       fasty36   -  compare a DNA sequence to a protein sequence database, comparing the translated DNA sequence
       in forward and reverse frames.

       tfasty36  - compare a protein  sequence  to  a  DNA  sequence  database,  calculating  similarities  with
       frameshifts to the forward and reverse orientations.

       fasts36 - compare unordered peptides to a protein sequence database

       fastm36 - compare ordered peptides (or short DNA sequences) to a protein (DNA) sequence database

       tfasts36 - compare unordered peptides to a translated DNA sequence database

       fastf36 - compare mixed peptides to a protein sequence database

       tfastf36 - compare mixed peptides to a translated DNA sequence database

       ssearch36 - compare a protein or DNA sequence to a sequence database using the Smith-Waterman algorithm.

       ggsearch36  -  compare  a  protein  or  DNA  sequence  to  a  sequence  database using a global alignment
       (Needleman-Wunsch)

       glsearch36 - compare a protein or DNA sequence to a sequence database with alignments that are global  in
       the query and local in the database sequence (global-local).

       lalign36  - produce multiple non-overlapping alignments for protein and DNA sequences using the Huang and
       Miller sim algorithm for the Waterman-Eggert algorithm.

       prss36, prfx36 - discontinued; all the FASTA programs will estimate statistical  significance  using  500
       shuffled sequence scores if two sequences are compared.

DESCRIPTION

       Release  3.6  of the FASTA package provides a modular set of sequence comparison programs that can run on
       conventional single processor computers or in parallel on multiprocessor computers.  More  than  a  dozen
       programs  -  fasta36,  fastx36/tfastx36,  fasty36/tfasty36,  fasts36/tfasts36, fastm36, fastf36/tfastf36,
       ssearch36, ggsearch36, and glsearch36 - are currently available.

       All the comparison programs share a set of basic command line options; additional options  are  available
       for individual comparison functions.

       Threaded  versions  of  the  FASTA  programs (built by default under Unix/Linux/MacOX) run in parallel on
       modern Linux and Unix multi-core or  multi-processor  computers.   Accelerated  versions  of  the  Smith-
       Waterman  algorithm are available for architectures with the Intel SSE2 or Altivec PowerPC architectures,
       which can speed-up Smith-Waterman calculations 10 - 20-fold.

       In addition to the serial and threaded  versions  of  the  FASTA  programs,  MPI  parallel  versions  are
       available as fasta36_mpi, ssearch36_mpi, fastx36_mpi, etc. The MPI parallel versions use the same command
       line options as the serial and threaded versions.

Running the FASTA programs

       By default, the FASTA programs are no  longer  interactive;  they  are  run  from  the  command  line  by
       specifying  the  program,  query.file, and library.file.  Program options must preceed the query.file and
       library.file arguments:

     fasta36 -option1 -option2 -option3 query.file library.file > fasta.output

       The "classic" interactive mode, which prompts for a query.file and library.file, is available with the -I
       option.   Typing  a  program  name  without  any  arguments  (ssearch36)  provides  a short help message;
       program_name -help provides a complete set of program options.

       Program options MUST preceed the query.file and library.file arguments.

FASTA program options

       The default scoring matrix and gap penalties used by each of the programs have  been  selected  for  high
       sensitivity  searches  with  the  various  algorithms.   The  default program behavior can be modified by
       providing command line options before the query.file and library.file arguments.   Command  line  options
       can also be used in interactive mode.

       Command line arguments come in several classes.

       (1)  Commands  that specify the comparison type. FASTA, FASTS, FASTM, SSEARCH, GGSEARCH, and GLSEARCH can
       compare either protein or DNA sequences, and attempt to recognize the  comparison  type  by  looking  the
       residue  composition.  -n,  -p specify DNA (nucleotide) or protein comparison, respectively. -U specifies
       RNA comparison.

       (2) Commands that limit the set of sequences compared: -1, -3, -M.

       (3) Commands that modify the scoring parameters: -f gap-open penaltyP, -g gap-extend penalty,  -j  inter-
       codon frame-shift, within-codon frameshift, -s scoring-matrix, -r match/mismatch score, -x X:X score.

       (4)  Commands that modify the algorithm (mostly FASTA and [T]FASTX/Y): -c, -w, -y, -o. The -S can be used
       to ignore lower-case (low complexity) residues during the initial score calculation.

       (5) Commands that modify the output: -A, -b number, -C width, -d number, -L, -m 0-11,B, -w line-width, -W
       context-width, -o offset1,ofset2

       (6) Commands that affect statistical estimates: -Z, -k.

Option summary:

       -1     Sort by "init1" score (obsolete)

       -3     ([t]fast[x,y] only) use only forward frame translations

       -a     Displays  the  full length (included unaligned regions) of both sequences with fasta36, ssearch36,
              glsearch36, and fasts36.

       -A (fasta36 only) For DNA:DNA, force Smith-Waterman alignment for
              output.  Smith-Waterman is the default for FASTA protein alignment and [t]fast[x,y], but  not  for
              DNA comparisons with FASTA.  For protein:protein, use band-alignment algorithm.

       -b #   number  of  best  scores/descriptions  to  show (must be < expectation cutoff if -E is given).  By
              default, this option is no longer used; all scores better than the expectation  (E())  cutoff  are
              listed.  To  guarantee  the display of # descriptions/scores, use -b =#, i.e. -b =100 ensures that
              100 descriptions/scores will be displayed.  To guarantee at least 1 description, but possibly many
              more (limited by -E e_cut), use -b >1.

       -c "E-opt E-join"
              threshold for gap joining (E-join) and band optimization (E-opt) in FASTA and [T]FASTX/Y.  FASTA36
              now uses BLAST-like statistical  thresholds  for  joining  and  band  optimization.   The  default
              statistical  thresholds for protein and translated comparisons are E-opt=0.2, E-join=0.5; for DNA,
              E-join = 0.1 and E-opt= 0.02. The actual number of joins and optimizations is reported  after  the
              E-join  and  E-opt  scoring  parameters.  Statistical thresholds improves search speed 2 - 3X, and
              provides much more accurate statistical estimates for matrices other than BLOSUM50. The  "classic"
              joining/optimization  thresholds  that  were  the  default  in  fasta35  and  earlier programs are
              available using -c O (upper  case  O),  possibly  followed  a  value  >  1.0  to  set  the  optcut
              optimization threshold.

       -C #   length of name abbreviation in alignments, default = 6.  Must be less than 20.

       -d #   number  of  best  alignments to show ( must be < expectation (-E) cutoff and <= the -b description
              limit).

       -D     turn on debugging mode.  Enables checks on sequence alphabet that cause  problems  with  tfastx36,
              tfasty36  (only  available  after  compile  time  option).   Also  preserves  temp  files  with -e
              expand_script.sh option.

       -e expand_script.sh
              Run a script to expand the set of sequences displayed/aligned based on the results of the  initial
              search.   When  the  -e  expand_script.sh  option  is  used, after the initial scan and statistics
              calculation, but before the "Best scores" are shown, expand_script.sh with a single argument,  the
              name  of  a  file  that contains the accession information (the text on the fasta description line
              between > and the first space) and the E()-value for the  sequence.   expand_script.sh  then  uses
              this  information  to send a library of additional sequences to stdout. These additional sequences
              are included in the list of high-scoring sequences (if their scores are significant) and  aligned.
              The additional sequences do not change the statistics or database size.

       -E e_cut e_cut_r
              expectation  value upper limit for score and alignment display.  Defaults are 10.0 for FASTA36 and
              SSEARCH36 protein searches, 5.0 for  translated  DNA/protein  comparisons,  and  2.0  for  DNA/DNA
              searches.  FASTA  version  36  now reports additional alignments between the query and the library
              sequence, the second value sets the threshold for the subsequent alignments.  If  not  given,  the
              threshold  is  e_cut/10.0.   If  given  and value > 1.0, e_cut_r = e_cut / value; for value < 1.0,
              e_cut_r = value;  If e_cut_r < 0, then the additional alignment option is disabled.

       -f #   penalty for opening a gap.

       -F #   expectation value lower limit for score and alignment display.  -F 1e-6 prevents library sequences
              with E()-values lower than 1e-6 from being displayed. This allows the use to focus on more distant
              relationships.

       -g #   penalty for additional residues in a gap

       -h     Show short help message.

       -help  Show long help message, with all options.

       -H     show histogram (with fasta-36.3.4, the histogram is not shown by default).

       -i     (fasta DNA, [t]fastx[x,y]) compare against only the reverse complement of the library sequence.

       -I     interactive mode; prompt for query filename, library.

       -j # # ([t]fast[x,y] only) penalty for a frameshift between two codons, ([t]fasty  only)  penalty  for  a
              frameshift within a codon.

       -J     (lalign36 only) show identity alignment.

       -k     specify number of shuffles for statistical parameter estimation (default=500).

       -l str specify FASTLIBS file

       -L     report long sequence description in alignments (up to 200 characters).

       -m 0,1,2,3,4,5,6,8,9,10,11,B,BB,"F# out.file" alignment display
              options.   -m  0, 1, 2, 3 display different types of alignments.  -m 4 provides an alignment "map"
              on the query. -m 5 combines the alignment map and a -m 0 alignment.  -m 6 provides an HTML output.

       -m 8 seeks to mimic BLAST -m 8 tabular output.  Only query and
              library sequence names, and identity, mismatch,  starts/stops,  E()-values,  and  bit  scores  are
              displayed.   -m  8C  mimics  BLAST  tabular  format  with comment lines.  -m 8 formats do not show
              alignments.

       -m 9 does not change the alignment output, but provides
              alignment coordinate and percent identity information with the best scores  report.   -m  9c  adds
              encoded  alignment  information  to  the -m 9; -m 9C adds encoded alignment information as a CIGAR
              formatted string. To accomodate frameshifts,  the  CIGAR  format  has  been  supplemented  with  F
              (forward)  and R (reverse).  -m 9i provides only percent identity and alignment length information
              with the best scores.  With current versions of the FASTA programs, independent -m options can  be
              combined; e.g. -m 1 -m 9c -m 6.

       -m 11 provides lav format output from lalign36.  It does not
              currently  affect  other  alignment  algorithms.   The  lav2ps and lav2svg programs can be used to
              convert lav format output to postscript/SVG alignment "dot-plots".

       -m B provides BLAST-like alignments.  Alignments are labeled as
              "Query" and "Sbjct", with coordinates on the same line as the sequences,  and  BLAST-like  symbols
              for  matches and mismatches. -m BB extends BLAST similarity to all the output, providing an output
              that closely mimics BLAST output.

       -m "F# out.file" allows one search to write different alignment
              formats to different files.  The 'F' indicates separate file output; the '#' is the output  format
              (1-6,8,9,10,11,B,BB, multiple compatible formats can be combined separated by commas -',').

       -M #-# molecular weight (residue) cutoffs.  -M "101-200" examines only library sequences that are 101-200
              residues long.

       -n     force query to nucleotide sequence

       -N #   break long library sequences into blocks of # residues.  Useful for bacterial genomes, which  have
              only  one  sequence  entry.   -N  2000 works well for well for bacterial genomes. (This option was
              required when FASTA only provided one alignment between the query and library sequence.  It is not
              as useful, now that multiple alignments are available.)

       -o "#,#"
              offsets query, library sequence for numbering alignments

       -O file
              send output to file.

       -p     force query to protein alphabet.

       -P pssm_file
              (ssearch36,  ggsearch36,  glsearch36  only).   Provide  blastpgp  checkpoint  file as the PSSM for
              searching. Two PSSM file formats  are  available,  which  must  be  provided  with  the  filename.
              'pssm_file 0' uses a binary format that is machine specific; 'pssm_file 1' uses the "blastpgp -u 1
              -C pssm_file" ASN.1 binary format (preferred).

       -q/-Q  quiet option; do not prompt for input (on by default)

       -r "+n/-m"
              (DNA only) values for match/mismatch for DNA comparisons. +n is  used  for  the  maximum  positive
              value and -m is used for the maximum negative value. Values between max and min, are rescaled, but
              residue pairs having the value -1 continue to be -1.

       -R file
              save all scores to statistics file (previously -r file)

       -s name
              specify substitution matrix.  BLOSUM50 is used by default; PAM250, PAM120,  and  BLOSUM62  can  be
              specified  by  setting  -s  P120,  P250,  or  BL62.  Additional scoring matrices include: BLOSUM80
              (BL80), and MDM10, MDM20, MDM40 (Jones, Taylor, and Thornton, 1992 CABIOS 8:275-282; specified  as
              -s MD10, -s MD20, -s MD40), OPTIMA5 (-s OPT5, Kann and Goldstein, (2002) Proteins 48:367-376), and
              VTML160 (-s VT160, Mueller and Vingron (2002) J. Comp. Biol. 19:8-13).  Each  scoring  matrix  has
              associated  default  gap  penalties.   The BLOSUM62 scoring matrix and -11/-1 gap penalties can be
              specified with -s BP62.

              Alternatively, a BLASTP format scoring matrix file can be specified, e.g. -s matrix.filename.  DNA
              scoring matrices can also be specified with the "-r" option.

              With  fasta36.3,  variable  scoring  matrices  can  be  specified by preceeding the scoring matrix
              abbreviation with '?', e.g. -s '?BP62'. Variable scoring matrices  allow  the  FASTA  programs  to
              choose  an  alternative  scoring  matrix with higher information content (bit score/position) when
              short queries are used.  For example, a 90 nucleotide FASTX query can produce only a 30 amino-acid
              alignment,  so a scoring matrix with 1.33 bits/position is required to produce a 40 bit score. The
              FASTA programs include BLOSUM50 (0.49 bits/pos) and BLOSUM62 (0.58 bits/pos) but can range to MD10
              (3.44  bits/position).  The  variable  scoring  matrix  option  searches  down the list of scoring
              matrices to find one with information content high enough to produce a 40 bit alignment score.

       -S     treat lower case letters in the query or database as low complexity regions that are equivalent to
              'X'  during  the initial database scan, but are treated as normal residues for the final alignment
              display.  Statistical estimates are based on the  'X'ed  out  sequence  used  during  the  initial
              search.  Protein  databases (and query sequences) can be generated in the appropriate format using
              John Wooton's "pseg" program, available from ftp://ftp.ncbi.nih.gov/pub/seg/pseg.  Once  you  have
              compiled the "pseg" program, use the command:

              pseg database.fasta -z 1 -q  > database.lc_seg

       -t #   Translation  table  -  [t]fastx36  and  [t]fasty36  support  the  BLAST  tranlation  tables.   See
              http://www.ncbi.nih.gov/htbin-post/Taxonomy/wprintgc?mode=c/.

       -T #   (threaded, parallel only) number of threads or workers to use (on Linux/MacOS/Unix, the default is
              to use as many processors as are available; on Windows systems, 2 processors are used).

       -U     Do RNA sequence comparisons: treat 'T' as 'U', allow G:U base pairs (by scoring "G-A" and "T-C" as
              score(G:G)-3).  Search only one strand.

       -V "?$%*"
              Allow special annotation characters in query sequence.  These characters will be displayed in  the
              alignments on the coordinate number line.

       -w # line width for similarity score, sequence alignment, output.

       -W # context length (default is 1/2 of line width -w) for alignment,
              like fasta and ssearch, that provide additional sequence context.

       -X extended options.  Less used options. Other options include
              -XB, -XM4G, -Xo, -Xx, and -Xy; see fasta_guide.pdf.

       -z 1, 2, 3, 4, 5, 6
              Specify  the  statistical  calculation.  Default is -z 1 for local similarity searches, which uses
              regression against the length of the library sequence. -z -1 disables statistics.  -z 0  estimates
              significance  without  normalizing for sequence length. -z 2 provides maximum likelihood estimates
              for lambda and K, censoring the 250 lowest and 250 highest scores. -z 3 uses Altschul  and  Gish's
              statistical  estimates  for specific protein BLOSUM scoring matrices and gap penalties. -z 4,5: an
              alternate regression method.  -z 6 uses a composition based maximum likelihood estimate  based  on
              the method of Mott (1992) Bull. Math. Biol. 54:59-75.

       -z 11,12,14,15,16
              compute the regression against scores of randomly shuffled copies of the library sequences.  Twice
              as many comparisons are performed, but accurate estimates  can  be  generated  from  databases  of
              related sequences. -z 11 uses the -z 1 regression strategy, etc.

       -z 21, 22, 24, 25, 26
              compute  two E()-values.  The standard (library-based) E()-value is calculated in the standard way
              (-z 1, 2, etc), but a second E2() value is calculated  by  shuffling  the  high-scoring  sequences
              (those  with  E()-values  less than the threshold).  For "average" composition proteins, these two
              estimates will be similar (though the best-shuffle estimates are always more  conservative).   For
              biased  composition  proteins,  the  two  estimates  may  differ by 100-fold or more.  A second -z
              option, e.g. -z "21 2", specifies the estimation method for the  best-shuffle  E2()-values.  Best-
              shuffle E2()-values approximate the estimates given by PRSS (or in a pairwise SSEARCH).

       -Z db_size
              Set  the  apparent database size used for expectation value calculations (used for protein/protein
              FASTA and SSEARCH, and for [T]FASTX/Y).

Reading sequences from STDIN

       The FASTA programs can accept a query sequence from the unix "stdin" data stream.   This  makes  it  much
       easier  to use fasta36 and its relatives as part of a WWW page. To indicate that stdin is to be used, use
       "@" as the query sequence file name.  "@" can also be used to specify a subset of the query  sequence  to
       be used, e.g:

     cat query.aa | fasta36 @:50-150 s

       would  search  the  's' database with residues 50-150 of query.aa.  FASTA cannot automatically detect the
       sequence type (protein vs DNA) when "stdin" is used and assumes protein comparisons by default; the  '-n'
       option is required for DNA for STDIN queries.

Environment variables:

       FASTLIBS
              location of library choice file (-l FASTLIBS)

       SRCH_URL1, SRCH_URL2
              format strings used to define options to re-search the database.

       REF_URL
              the  format  string  used  to  define the option to lookup the library sequence in entrez, or some
              other database.

AUTHOR

       Bill Pearson
       wrp@virginia.EDU

       Version: $ Id: $ Revision: $Revision: 210 $

                                                           fasta36/ssearch36/[t]fast[x,y]36/lalign36    1(local)