Provided by: biosquid_1.9g+cvs20050121-5_amd64 bug

NAME

       shuffle - randomize the sequences in a sequence file

SYNOPSIS

       shuffle [options] seqfile

DESCRIPTION

       shuffle reads a sequence file seqfile, randomizes each sequence, and prints the randomized
       sequences in FASTA format on standard output.  The  sequence  names  are  unchanged;  this
       allows you to track down the source of each randomized sequence if necessary.

       The  default  is  to simply shuffle each input sequence, preserving monosymbol composition
       exactly. To shuffle each sequence  while  preserving  both  its  monosymbol  and  disymbol
       composition exactly, use the -d option.

       The  -0  and -1 options allow you to generate sequences with the same Markov properties as
       each input sequence. With -0, for each input sequence, 0th  order  Markov  statistics  are
       collected  (e.g.  symbol  composition),  and  a  new  sequence  is generated with the same
       composition.  With -1, the generated sequence has the same 1st order Markov properties  as
       the input sequence (e.g.  the same disymbol frequencies).

       Note that the default and -0, or -d and -1, are similar; the shuffling algorithms preserve
       composition exactly, while the Markov algorithms only expect to  generate  a  sequence  of
       similar composition on average.

       Other shuffling algorithms are also available, as documented below in the options.

OPTIONS

       -0     Calculate  0th  order  Markov  frequencies  of  each  input  sequence (e.g. residue
              composition); generate output sequence using the same 0th order Markov frequencies.

       -1     Calculate 1st order Markov frequencies for  each  input  sequence  (e.g.  diresidue
              composition); generate output sequence using the same 1st order Markov frequencies.
              The first residue of the output sequence is always the same as the first residue of
              the input sequence.

       -d     Shuffle   the   input  sequence  while  preserving  both  monosymbol  and  disymbol
              composition exactly. Uses  an  algorithm  published  by   S.F.  Altschul  and  B.W.
              Erickson, Mol. Biol. Evol. 2:526-538, 1985.

       -h     Print  brief  help;  includes  version number and summary of all options, including
              expert options.

       -l     Look only at the length of each input sequence; generate an i.i.d.  output  protein
              sequence  of  that length, using monoresidue frequencies typical of proteins (taken
              from Swissprot 35).

       -n <n> Make <n> different randomizations of each input sequence in  seqfile,  rather  than
              the default of one.

       -r     Generate  the  output sequence by reversing the input sequence. (Therefore only one
              "randomization" per input sequence is possible, so it's not worth using -n  if  you
              use reversal.)

       -t <n> Truncate  each  input  sequence  to  a fixed length of exactly <n> residues. If the
              input sequence is shorter than <n> it is discarded (therefore the output  file  may
              contain fewer sequences than the input file).  If the input sequence is longer than
              <n> a contiguous subsequence is randomly chosen.

       -w <n> Regionally shuffle each input sequence in window sizes  of  <n>,  preserving  local
              residue  composition  in  each  window.   Probably a better shuffling algorithm for
              biosequences with nonstationary  residue  composition  (e.g.  composition  that  is
              varying  along  the  sequence,  such as between different isochores in human genome
              sequence).

       -B     (Babelfish). Autodetect and read a sequence file  format  other  than  the  default
              (FASTA).  Almost  any common sequence file format is recognized (including Genbank,
              EMBL, SWISS-PROT, PIR, and GCG unaligned sequence formats, and Stockholm, GCG  MSF,
              and  Clustal  alignment formats). See the printed documentation for a complete list
              of supported formats.

EXPERT OPTIONS

       --informat <s>
              Specify that the sequence file is in format <s>,  rather  than  the  default  FASTA
              format.   Common examples include Genbank, EMBL, GCG, PIR, Stockholm, Clustal, MSF,
              or PHYLIP; see the printed documentation for a complete  list  of  accepted  format
              names.   This  option  overrides  the  default  expected  format (FASTA) and the -B
              Babelfish autodetection option.

       --nodesc
              Do not output any sequence description in the output file, only the sequence names.

       --seed <s>
              Set the random number seed to <s>.  If you want reproducible results, use the  same
              seed  each  time.  By default, shuffle uses a different seed each time, so does not
              generate the same output in subsequent runs with the same input.

SEE ALSO

       afetch(1), alistat(1), compalign(1), compstruct(1), revcomp(1),  seqsplit(1),  seqstat(1),
       sfetch(1), sindex(1), sreformat(1), stranslate(1), weight(1).

AUTHOR

       Biosquid  and  its  documentation  are  Copyright (C) 1992-2003 HHMI/Washington University
       School of Medicine Freely distributed under the  GNU  General  Public  License  (GPL)  See
       COPYING in the source code distribution for more details, or contact me.

       Sean Eddy
       HHMI/Department of Genetics
       Washington University School of Medicine
       4444 Forest Park Blvd., Box 8510
       St Louis, MO 63108 USA
       Phone: 1-314-362-7666
       FAX  : 1-314-362-2157
       Email: eddy@genetics.wustl.edu