xenial (1) shuffle.1.gz

Provided by: biosquid_1.9g+cvs20050121-5_amd64 bug

NAME

       shuffle - randomize the sequences in a sequence file

SYNOPSIS

       shuffle [options] seqfile

DESCRIPTION

       shuffle  reads  a sequence file seqfile, randomizes each sequence, and prints the randomized sequences in
       FASTA format on standard output. The sequence names are unchanged; this allows  you  to  track  down  the
       source of each randomized sequence if necessary.

       The  default  is  to  simply  shuffle  each input sequence, preserving monosymbol composition exactly. To
       shuffle each sequence while preserving both its monosymbol and disymbol composition exactly, use  the  -d
       option.

       The  -0  and  -1  options  allow  you to generate sequences with the same Markov properties as each input
       sequence. With -0, for each input sequence, 0th  order  Markov  statistics  are  collected  (e.g.  symbol
       composition), and a new sequence is generated with the same composition.  With -1, the generated sequence
       has the same 1st order Markov properties as the input sequence (e.g.  the same disymbol frequencies).

       Note that the default and -0, or -d and -1, are similar; the shuffling  algorithms  preserve  composition
       exactly,  while  the  Markov  algorithms  only  expect  to  generate a sequence of similar composition on
       average.

       Other shuffling algorithms are also available, as documented below in the options.

OPTIONS

       -0     Calculate 0th order Markov frequencies of each input sequence (e.g. residue composition); generate
              output sequence using the same 0th order Markov frequencies.

       -1     Calculate  1st  order  Markov  frequencies  for  each input sequence (e.g. diresidue composition);
              generate output sequence using the same 1st order Markov frequencies.  The first  residue  of  the
              output sequence is always the same as the first residue of the input sequence.

       -d     Shuffle the input sequence while preserving both monosymbol and disymbol composition exactly. Uses
              an algorithm published by  S.F. Altschul and B.W. Erickson, Mol. Biol. Evol. 2:526-538, 1985.

       -h     Print brief help; includes version number and summary of all options, including expert options.

       -l     Look only at the length of each input sequence; generate an i.i.d. output protein sequence of that
              length, using monoresidue frequencies typical of proteins (taken from Swissprot 35).

       -n <n> Make  <n>  different  randomizations of each input sequence in seqfile, rather than the default of
              one.

       -r     Generate the output sequence by reversing the input sequence. (Therefore only one  "randomization"
              per input sequence is possible, so it's not worth using -n if you use reversal.)

       -t <n> Truncate  each  input sequence to a fixed length of exactly <n> residues. If the input sequence is
              shorter than <n> it is discarded (therefore the output file may contain fewer sequences  than  the
              input  file).   If  the  input  sequence  is  longer than <n> a contiguous subsequence is randomly
              chosen.

       -w <n> Regionally shuffle  each  input  sequence  in  window  sizes  of  <n>,  preserving  local  residue
              composition  in  each  window.   Probably  a  better  shuffling  algorithm  for  biosequences with
              nonstationary residue composition (e.g. composition that is varying along the  sequence,  such  as
              between different isochores in human genome sequence).

       -B     (Babelfish). Autodetect and read a sequence file format other than the default (FASTA). Almost any
              common sequence file format is recognized (including  Genbank,  EMBL,  SWISS-PROT,  PIR,  and  GCG
              unaligned  sequence  formats,  and  Stockholm,  GCG  MSF,  and Clustal alignment formats). See the
              printed documentation for a complete list of supported formats.

EXPERT OPTIONS

       --informat <s>
              Specify that the sequence file is in format <s>, rather than the  default  FASTA  format.   Common
              examples  include  Genbank,  EMBL,  GCG,  PIR, Stockholm, Clustal, MSF, or PHYLIP; see the printed
              documentation for a complete list of accepted format names.  This  option  overrides  the  default
              expected format (FASTA) and the -B Babelfish autodetection option.

       --nodesc
              Do not output any sequence description in the output file, only the sequence names.

       --seed <s>
              Set the random number seed to <s>.  If you want reproducible results, use the same seed each time.
              By default, shuffle uses a different seed each time, so does  not  generate  the  same  output  in
              subsequent runs with the same input.

SEE ALSO

       afetch(1),  alistat(1),  compalign(1),  compstruct(1),  revcomp(1),  seqsplit(1),  seqstat(1), sfetch(1),
       sindex(1), sreformat(1), stranslate(1), weight(1).

AUTHOR

       Biosquid and its documentation are Copyright (C) 1992-2003 HHMI/Washington University School of  Medicine
       Freely distributed under the GNU General Public License (GPL) See COPYING in the source code distribution
       for more details, or contact me.

       Sean Eddy
       HHMI/Department of Genetics
       Washington University School of Medicine
       4444 Forest Park Blvd., Box 8510
       St Louis, MO 63108 USA
       Phone: 1-314-362-7666
       FAX  : 1-314-362-2157
       Email: eddy@genetics.wustl.edu