Provided by: fasta3_36.3.8h-2_amd64 bug

NAME

       prss - test a protein sequence similarity for significance

SYNOPSIS

       prss34  [-Q  -A  -f  #  -g  #  -H -O file -s SMATRIX -w # -Z # -k # -v # ] sequence-file-1
       sequence-file-2 [ #-of-shuffles ]

       prfx34 [-Q -A -f # -g # -H -O file -s SMATRIX -w # -z 1,3 -Z # -k # -v # ] sequence-file-1
       sequence-file-2 [ ktup ] [ #-of-shuffles ]

       prss34(_t)/prfx34(_t) [-AfghksvwzZ] - interactive mode

DESCRIPTION

       prss34  and  prfx34  are used to evaluate the significance of a protein:protein, DNA:DNA (
       prss34 ), or translated-DNA:protein ( prfx34 ) sequence similarity score by comparing  two
       sequences  and  calculating  optimal  similarity scores, and then repeatedly shuffling the
       second sequence, and  calculating  optimal  similarity  scores  using  the  Smith-Waterman
       algorithm. An extreme value distribution is then fit to the shuffled-sequence scores.  The
       characteristic parameters of the extreme value distribution are then used to estimate  the
       probability that each of the unshuffled sequence scores would be obtained by chance in one
       sequence, or in a number of sequences equal to the number of shuffles.   This  program  is
       derived  from rdf2, described by Pearson and Lipman, PNAS (1988) 85:2444-2448, and Pearson
       (Meth. Enz.  183:63-98).  Use  of  the  extreme  value  distribution  for  estimating  the
       probabilities  of  similarity  scores  was  described  by  Altshul and Karlin, PNAS (1990)
       87:2264-2268.  The and expectations calculated by prdf.  prss34 calculates optimal  scores
       using the same rigorous Smith-Waterman algorithm (Smith and Waterman, J. Mol. Biol. (1983)
       147:195-197) used by the ssearch34 program.  prfx34  calculates  scores  using  the  FASTX
       algorithm (Pearson et al. (1997) Genomics 46:24-36.

       prss34  and  prfx34  also  allow  a  more  sophisticated shuffling method: residues can be
       shuffled within a local window, so that  the  order  of  residues  1-10,  11-20,  etc,  is
       destroyed  but a residue in the first 10 is never swapped with a residue outside the first
       ten, and so on for each local window.

EXAMPLES

       (1)    prss34  -v 10 musplfm.aa lcbo.aa

       Compare the amino acid sequence in the file musplfm.aa with that in lcbo.aa, then  shuffle
       lcbo.aa  200  times using a local shuffle with a window of 10.  Report the significance of
       the unshuffled musplfm/lcbo comparison scores with respect to the shuffled scores.

       (2)    prss34 musplfm.aa lcbo.aa 1000

       Compare the amino acid sequence in the file musplfm.aa with  the  sequences  in  the  file
       lcbo.aa,  shuffling  lcbo.aa  1000  times.   Shuffles  can also be specified with the -k #
       option.

       (3)    prfx34 mgstm1.esq xurt8c.aa 2 1000

       Translate the DNA sequence in the mgstm1.esq file in all six frames and compare it to  the
       amino  acid  sequence  in  the  file  xurt8c.aa, using ktup=2 and shuffling xurt8c.aa 1000
       times.  Each comparison considers the best forward or reverse alignment with  frameshifts,
       using the fastx algorithm (Pearson et al (1997) Genomics 46:24-36).

       (4)    prss34/prfx34

       Run  prss in interactive mode.  The program will prompt for the file name of the two query
       sequence files and the number of shuffles to be used.

OPTIONS

       prss34/prfx34 can be directed to change the scoring matrix,  gap  penalties,  and  shuffle
       parameters  by  entering  options  on  the  command  line (preceeded by a `-'). All of the
       options should preceed the file names number of shuffles.

       -A     Show unshuffled alignment.

       -f #   Penalty for opening a gap (-10 by default for proteins).

       -g #   Penalty for additional residues in a gap (-2 by default) for proteins.

       -H     Do not display histogram of similarity scores.

       -k #   Number of shuffles (200 is the default)

       -Q -q  "quiet" - do not prompt for filename.

       -O filename
              send copy of results to "filename."

       -s str specify the scoring matrix.  BLOSUM50 is used by default  for  proteins;  +5/-4  is
              used  by  defaul  for DNA.  prss34 recognizes the same scoring matrices as fasta34,
              ssearch34, fastx34, etc; e.g.  BL50,  P250,  BL62,  BL80,  MD10,  MD20,  and  other
              matrices in BLAST1.4 matrix format.

       -v #   Use a local window shuffle with a window size of #.

       -z #   Calculate  statistical significance using the mean/variance (moments) approach used
              by fasta34/ssearch or from maximum likelihood estimates of lambda and K.

       -Z #   Present statistical significance as if a '#' entry database had been searched (e.g.
              "-Z  50000"  presents  statistical  significance  as  if  50,000 sequences had been
              compared).

ENVIRONMENT VARIABLES

       (SMATRIX) the filename of an alternative scoring  matrix  file.   For  protein  sequences,
       BLOSUM50  is  used  by default; PAM250 can be used with the command line option -s P250(or
       with -s pam250.mat).  BLOSUM62 (-s BL62) and PAM120 (-S P120).

SEE ALSO

       ssearch3(1), fasta3(1).

AUTHOR

       Bill Pearson
       wrp@virginia.EDU

                                              local                                      PRSS3(1)