Provided by: fasta3_36.3.8h.2020-02-11-4_amd64 bug

NAME

       prss - test a protein sequence similarity for significance

SYNOPSIS

       prss34  [-Q  -A  -f  # -g # -H -O file -s SMATRIX -w # -Z # -k # -v # ] sequence-file-1 sequence-file-2 [
       #-of-shuffles ]

       prfx34 [-Q -A -f # -g # -H -O file -s SMATRIX -w # -z 1,3 -Z # -k #  -v  #  ]  sequence-file-1  sequence-
       file-2 [ ktup ] [ #-of-shuffles ]

       prss34(_t)/prfx34(_t) [-AfghksvwzZ] - interactive mode

DESCRIPTION

       prss34  and  prfx34  are  used  to evaluate the significance of a protein:protein, DNA:DNA ( prss34 ), or
       translated-DNA:protein ( prfx34 ) sequence similarity score by comparing two  sequences  and  calculating
       optimal  similarity  scores,  and  then repeatedly shuffling the second sequence, and calculating optimal
       similarity scores using the Smith-Waterman algorithm. An extreme value distribution is then  fit  to  the
       shuffled-sequence  scores.  The characteristic parameters of the extreme value distribution are then used
       to estimate the probability that each of the unshuffled sequence scores would be obtained  by  chance  in
       one  sequence, or in a number of sequences equal to the number of shuffles.  This program is derived from
       rdf2, described by Pearson and Lipman, PNAS (1988) 85:2444-2448, and  Pearson  (Meth.  Enz.   183:63-98).
       Use of the extreme value distribution for estimating the probabilities of similarity scores was described
       by Altshul and Karlin, PNAS (1990) 87:2264-2268.   The  and  expectations  calculated  by  prdf.   prss34
       calculates  optimal  scores using the same rigorous Smith-Waterman algorithm (Smith and Waterman, J. Mol.
       Biol. (1983) 147:195-197) used by the ssearch34  program.   prfx34  calculates  scores  using  the  FASTX
       algorithm (Pearson et al. (1997) Genomics 46:24-36.

       prss34  and  prfx34  also  allow a more sophisticated shuffling method: residues can be shuffled within a
       local window, so that the order of residues 1-10, 11-20, etc, is destroyed but a residue in the first  10
       is never swapped with a residue outside the first ten, and so on for each local window.

EXAMPLES

       (1)    prss34  -v 10 musplfm.aa lcbo.aa

       Compare  the  amino  acid  sequence in the file musplfm.aa with that in lcbo.aa, then shuffle lcbo.aa 200
       times using a local shuffle with a window of 10.  Report the significance of the unshuffled  musplfm/lcbo
       comparison scores with respect to the shuffled scores.

       (2)    prss34 musplfm.aa lcbo.aa 1000

       Compare  the amino acid sequence in the file musplfm.aa with the sequences in the file lcbo.aa, shuffling
       lcbo.aa 1000 times.  Shuffles can also be specified with the -k # option.

       (3)    prfx34 mgstm1.esq xurt8c.aa 2 1000

       Translate the DNA sequence in the mgstm1.esq file in all six frames and compare  it  to  the  amino  acid
       sequence  in  the  file  xurt8c.aa,  using  ktup=2  and  shuffling xurt8c.aa 1000 times.  Each comparison
       considers the best forward or reverse alignment with frameshifts, using the fastx algorithm  (Pearson  et
       al (1997) Genomics 46:24-36).

       (4)    prss34/prfx34

       Run  prss in interactive mode.  The program will prompt for the file name of the two query sequence files
       and the number of shuffles to be used.

OPTIONS

       prss34/prfx34 can be directed to change the scoring matrix, gap  penalties,  and  shuffle  parameters  by
       entering  options  on  the  command line (preceeded by a `-'). All of the options should preceed the file
       names number of shuffles.

       -A     Show unshuffled alignment.

       -f #   Penalty for opening a gap (-10 by default for proteins).

       -g #   Penalty for additional residues in a gap (-2 by default) for proteins.

       -H     Do not display histogram of similarity scores.

       -k #   Number of shuffles (200 is the default)

       -Q -q  "quiet" - do not prompt for filename.

       -O filename
              send copy of results to "filename."

       -s str specify the scoring matrix.  BLOSUM50 is used by default for proteins; +5/-4 is used by defaul for
              DNA.   prss34 recognizes the same scoring matrices as fasta34, ssearch34, fastx34, etc; e.g. BL50,
              P250, BL62, BL80, MD10, MD20, and other matrices in BLAST1.4 matrix format.

       -v #   Use a local window shuffle with a window size of #.

       -z #   Calculate  statistical  significance  using  the  mean/variance   (moments)   approach   used   by
              fasta34/ssearch or from maximum likelihood estimates of lambda and K.

       -Z #   Present  statistical  significance  as  if a '#' entry database had been searched (e.g. "-Z 50000"
              presents statistical significance as if 50,000 sequences had been compared).

ENVIRONMENT VARIABLES

       (SMATRIX) the filename of an alternative scoring matrix file.  For protein sequences, BLOSUM50 is used by
       default;  PAM250  can  be used with the command line option -s P250(or with -s pam250.mat).  BLOSUM62 (-s
       BL62) and PAM120 (-S P120).

SEE ALSO

       ssearch3(1), fasta3(1).

AUTHOR

       Bill Pearson
       wrp@virginia.EDU

                                                      local                                             PRSS3(1)