Provided by: ncbi-seg_0.0.20000620-3_amd64 bug

NAME

       ncbi-seg - segment sequence(s) by local complexity

SYNOPSIS

       ncbi-seg sequence [ W ] [ K(1) ] [ K(2) ] [ -x ] [ options ]

DESCRIPTION

       ncbi-seg divides sequences into contrasting segments of low-complexity and high-
       complexity.  Low-complexity segments defined by the algorithm represent "simple sequences"
       or "compositionally-biased regions".

       Locally-optimized low-complexity segments are produced at defined levels of stringency,
       based on formal definitions of local compositional complexity (Wootton & Federhen, 1993).
       The segment lengths and the number of segments per sequence are determined automatically
       by the algorithm.

       The input is a FASTA-formatted sequence file, or a database file containing many FASTA-
       formatted  sequences.  ncbi-seg is tuned for amino acid sequences.  For nucleotide
       sequences, see EXAMPLES OF PARAMETER SETS below.

       The stringency of the search for low-complexity segments is determined by three user-
       defined parameters, trigger window length [ W ], trigger complexity [ K(1) ] and extension
       complexity [ K(2)] (see below under PARAMETERS ).  The defaults provided are suitable for
       low-complexity masking of database search query sequences [ -x option required, see
       below].

OUTPUTS AND APPLICATIONS

       (1) Readable segmented sequence [Default].  Regions of contrasting complexity are
       displayed in "tree format".  See EXAMPLES.

       (2) Low-complexity masking (see Altschul et al, 1994).  Produce a masked FASTA-formatted
       file, ready for  input as a query sequence for database search programs such as BLAST or
       FASTA.  The amino acids in low-complexity regions are replaced with "x" characters [-x
       option].  See EXAMPLES.

       (3) Database construction.  Produce FASTA-formatted files containing low-complexity
       segments [-l  option], or high-complexity segments [-h option], or both [-a option].  Each
       segment is a separate sequence entry with an informative header line.

ALGORITHM

       The SEG algorithm has two stages.  First, identification of approximate raw segments of
       low- complexity; second local optimization.

       At the first stage, the stringency and resolution of the search for low-complexity
       segments is determined  by the W, K(1) and K(2) parameters.  All trigger windows are
       defined, including overlapping windows, of length W and complexity less than or equal to
       K(1).  "Complexity" here is defined by equation  (3) of Wootton & Federhen (1993).  Each
       trigger window is then extended into a contig in both directions by merging with extension
       windows, which are overlapping windows of length W and complexity  less than or equal to
       K(2).  Each contig is a raw segment.

       At the second stage, each raw segment is reduced to a single optimal low-complexity
       segment, which  may be the entire raw segment but is usually a subsequence.  The optimal
       subsequence has the lowest  value of the probability P(0) (equation (5) of Wootton &
       Federhen, 1993).

PARAMETERS

       These three numeric parameters are in obligatory order after the sequence file name.

       Trigger window length [ W ].  An integer greater than zero [ Default 12 ].

       Trigger complexity. [ K1 ].  The maximum complexity of a trigger window in units of bits.
       K1 must  be equal to or greater than zero.  The maximum value is 4.322 (log[base 2]20) for
       amino acid sequences [ Default 2.2 ].

       Extension complexity [ K2 ].  The maximum complexity of an extension window in units of
       bits.  Only values greater than K1 are effective in extending triggered windows.  Range of
       possible values is as for K1 [ Default 2.5 ].

OPTIONS

       The following options may be placed in any order in the command line after the W, K1 and
       K2 parameters:

       -a  Output both low-complexity and high-complexity segments in a FASTA-formatted file, as
           a set of  separate entries with header lines.

       -c  [characters-per-line]
           Number of sequence characters per line of output [Default 60].  Other characters, such
           as residue numbers, are additional.

       -h  Output only the high-complexity segments in a FASTA-formatted file, as a set of
           separate entries  with header lines.

       -l  Output only the low-complexity segments in a FASTA-formatted file, as a set of
           separate entries with  header lines.

       -m  [length]
           Minimum length in residues for a high-complexity segment [default 0].  Shorter
           segments are merged with adjacent low-complexity segments.

       -o  Show all overlapping, independently-triggered low-complexity segments [these are
           merged by default].

       -q  Produce an output format with the sequence in a numbered block with markings to assist
           residue counting.  The low-complexity and high-complexity segments are in lower- and
           upper-case characters respectively.

       -t  [length]
           "Maximum trim length" parameter [default 100]. This controls the search space (and
           search time) during the optimization of raw segments (see ALGORITHM above).  By
           default, subsequences 100 or more residues shorter than the raw segment are omitted
           from the search. This parameter may be increased to give a more extensive search if
           raw segments are longer than 100 residues.

       -x  The masking option for amino acid sequences.  Each input sequence is represented by a
           single output sequence in FASTA-format with low-complexity regions replaced by strings
           of "x" characters.

EXAMPLES OF PARAMETER SETS

       Default parameters are given by 'ncbi-seg sequence' (equivalent to 'ncbi-seg sequence 12
       2.2 2.5').  These  parameters are appropriate for low- complexity masking of many amino
       acid sequences [with -x option  ].

   Database-database comparisons:
       More stringent (lower) complexity parameters are suitable when masked sequences are
       compared with masked sequences.  For example, for BLAST or FASTA searches that compare two
       amino acid sequence databases, the following masking may be applied to both databases:

         ncbi-seg database 12 1.8 2.0 -x

   Homopolymer analysis:
       To examine all homopolymeric subsequences of length (for example) 7 or greater:

         ncbi-seg sequence 7 0 0

   Non-globular regions of protein sequences:
       Many long non-globular domains may be diagnosed at longer window lengths, typically:

         ncbi-seg sequence 45 3.4 3.75

       For some shorter non-globular domains, the following set is appropriate:

         ncbi-seg sequence 25 3.0 3.3

   Nucleotide sequences:
       The maximum value of the complexity parameters is 2 (log[base 2]4).  For masking, the
       following is approximately equivalent in effect to the default parameters for amino acid
       sequences:

         ncbi-seg sequence.na 21 1.4 1.6

EXAMPLES

       The following is a file named 'prion' in FASTA format:

        >PRIO_HUMAN MAJOR PRION PROTEIN PRECURSOR
        MANLGCWMLVLFVATWSDLGLCKKRPKPGGWNTGGSRYPGQGSPGGNRYPPQGGGGWGQP
        HGGGWGQPHGGGWGQPHGGGWGQPHGGGWGQGGGTHSQWNKPSKPKTNMKHMAGAAAAGA
        VVGGLGGYMLGSAMSRPIIHFGSDYEDRYYRENMHRYPNQVYYRPMDEYSNQNNFVHDCV
        NITIKQHTVTTTTKGENFTETDVKMMERVVEQMCITQYERESQAYYQRGSSMVLFSSPPV
        ILLISFLIFLIVG

       The command line:

        ncbi-seg /usr/share/doc/ncbi-seg/examples/prion.fa

       gives the standard output below

        >PRIO_HUMAN MAJOR PRION PROTEIN PRECURSOR

                                          1-49   MANLGCWMLVLFVATWSDLGLCKKRPKPGG
                                                 WNTGGSRYPGQGSPGGNRY
        ppqggggwgqphgggwgqphgggwgqphgg   50-94
                       gwgqphgggwgqggg
                                         95-112  THSQWNKPSKPKTNMKHM
               agaaaagavvgglggymlgsams  113-135
                                        136-187  RPIIHFGSDYEDRYYRENMHRYPNQVYYRP
                                                 MDEYSNQNNFVHDCVNITIKQH
                        tvttttkgenftet  188-201
                                        202-236  DVKMMERVVEQMCITQYERESQAYYQRGSS
                                                 MVLFS
                      sppvillisflifliv  237-252
                                        253-253  G

       The low-complexity sequences are on the left (lower case) and high-complexity sequences
       are on the right (upper case).  All sequence segments read from left to right and their
       order in the sequence is from top to bottom, as shown by the central column of residue
       numbers.

       The command line:

         ncbi-seg /usr/share/doc/ncbi-seg/examples/prion.fa -x

       gives the following FASTA-formatted file:-

        >PRIO_HUMAN MAJOR PRION PROTEIN PRECURSOR
        MANLGCWMLVLFVATWSDLGLCKKRPKPGGWNTGGSRYPGQGSPGGNRYxxxxxxxxxxx
        xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxTHSQWNKPSKPKTNMKHMxxxxxxxx
        xxxxxxxxxxxxxxxRPIIHFGSDYEDRYYRENMHRYPNQVYYRPMDEYSNQNNFVHDCV
        NITIKQHxxxxxxxxxxxxxxDVKMMERVVEQMCITQYERESQAYYQRGSSMVLFSxxxx
        xxxxxxxxxxxxG

SEE ALSO

       segn(1), blast(1), saps(1), xnu(1)

AUTHORS

       John Wootton:     wootton@ncbi.nlm.nih.gov

       Scott Federhen:   federhen@ncbi.nlm.nih.gov

        National Center for Biotechnology Information
        Building 38A, Room 8N805
        National Library of Medicine
        National Institutes of Health
        Bethesda, Maryland, MD 20894
        U.S.A.

PRIMARY REFERENCE

       Wootton, J.C., Federhen, S. (1993)  Statistics of local complexity in amino acid sequences
       and sequence  databases.  Computers & Chemistry 17: 149-163.

OTHER REFERENCES

       Wootton, J.C. (1994)  Non-globular domains in protein sequences: automated segmentation
       using complexity measures.  Computers & Chemistry 18: (in press).

       Altschul, S.F., Boguski, M., Gish, W., Wootton, J.C. (1994)  Issues in searching molecular
       sequence  databases.  Nature Genetics 6: 119-129.

       Wootton, J.C. (1994)  Simple sequences of protein and DNA. In: Nucleic Acid and Protein
       Sequence  Analysis: A Practical Approach.  (Second Edition, Chapter 8, Bishop, M.J. and
       Rawlings, C.R. Eds.  IRL  Press, Oxford) (In press).