lunar (1) pbsTrain.1.gz

Provided by: phast_1.6+dfsg-3_amd64 bug

NAME

       pbsTrain - Estimate a discrete encoding scheme for probabilistic biological

DESCRIPTION

       Estimate a discrete encoding scheme for probabilistic biological sequences (PBSs) based on
       training data.  Input file should be a table of probability vectors, with a row  for  each
       distinct  vector, and a column of counts (positive integers) followed by d columns for the
       elements of the d-dimensional probability vectors (see example below).  It may be produced
       with 'prequel' using the --suff-stats option.  Output is a code file that can be used with
       pbsEncode, pbsDecode, etc.  By default, a code of size 255 is  created,  so  that  encoded
       PBSs  can  be  represented  with  one  byte  per position (the 256th letter in the code is
       reserved for gaps).  The --nbytes option allows larger codes to be created, if desired.

       The code is estimated by a two-part procedure designed to minimize  the  "training  error"
       (defined  as  the  total  KL  divergence) of the encoded training data with respect to the
       original training  data.   First,  a  "grid"  is  defined  for  the  probability  simplex,
       partitioning  it  into  regions  that  intersect  "cells"  (hypercubes)  in  a  matrix  in
       d-dimensional space.  This grid has n "rows" per dimension.  By default, n  is  given  the
       largest  possible  value  such  that  the  number of simplex regions is no larger than the
       target code size, but smaller values of n can be specified using  --nrows.   Each  simplex
       region  is  assigned a letter in the code, and the representative point for that letter is
       set equal to the mean (weighted by the counts) of all vectors in the  training  data  that
       fall  in  that  region.  This can be shown to minimize the training error for this initial
       encoding scheme.  (If no vectors fall in a region, then the representative  point  is  set
       equal  to  the  centroid  of  the  region,  which can be shown to minimize the expected KL
       divergence of points uniformly distributed in the region.)

       In the second part of the estimation procedure, the remaining  letters  in  the  code  are
       defined  by  a  greedy  algorithm,  which attempts to further minimize the training error.
       Briefly, on each step, the simplex region with the largest contribution to the total error
       is  identified,  and  the next letter in the code is assigned to that region.  In this new
       encoding, there are multiple letters, hence multiple representative  points,  per  region;
       the  representative  point  for  a given vector is taken to be the closest, in terms of KL
       divergence, of the representative points associated with the simplex region in which  that
       vector  falls.   When  a new representative point is added to a region, all representative
       points for that region are reoptimized using a k-means type algorithm.  This procedure  is
       repeated, letter by letter, until the number of code letters equals the target code size.

EXAMPLE

       Generate training data using prequel:

              prequel --suff-stats mammals.fa mytree.mod training

       A file called "training.stats" will be generated.
              It will look

              something like this:

              #count

              p(A)    p(C)    p(G)    p(T)

              170085

              0.043485        0.797886        0.029534        0.129096

              158006

              0.191119        0.046081        0.695205        0.067595

              221937

              0.047309        0.122834        0.043852        0.786004

              221585

              0.781156        0.044520        0.126179        0.048146

              159472

              0.067254        0.697947        0.045959        0.188840

              ...

       Now estimate a code from the training data:

              pbsTrain training.stats > mammals.code

       The code file contains some metadata followed by a list of code indices and representative
       points, e.g.,

              ##NROWS = 7

              ##DIMENSION = 4

              ##NBYTES = 1

              ##CODESIZE = 255

              # Code generated by pbsTrain, with argument(s) "training.stats"

              # acs, Mon Jul 18 23:29:07 2005

              # Average training error = 0.001298 bits

       Each index of the code is shown below with its representative probability vector (p1,  p2,
       ..., pd).

              #code_index   p1   p2   ...    0         0.107143         0.107143         0.107143
              0.678571

              1       0.033226        0.093854        0.031987        0.840933

              2       0.000059        0.001645        0.000111        0.998185

              3       0.139270        0.021059        0.278993        0.560678

              ...

       The reported "average training error" is the training error divided by the number of  data
       points (the sum of the counts).

OPTIONS

       --nrows,  -n  <n>  Number of "rows" per dimension in the simplex grid.  Default is maximum
              possible for code size.

       --nbytes, -b <b>

              Number of bytes per encoded probabilistic base (default 1).  The size of  the  code
              will  be  256^b - 1 (one letter in the code is reserved for gaps).  Values as large
              as  4  are  allowed  for  b,  but  in  the  current   implementation,   performance
              considerations effectively limit it to 2 or 3.

       --no-greedy,  -G  Skip greedy optimization -- just assign a single representative point to
              each region of the probability simplex, equal to the (weighted) mean of all vectors
              from the training data that fall in that region.

       --no-train, -x <dim>

              Ignore  the  data  entirely;  just use the centroid of each simplex partition.  The
              dimension of the simplex must be given (<dim>) but no data file is required.

       --log, -l <file>

              write log of optimization procedure to specified file.

       --help, -h

              Print this help message.