Provided by: theseus_2.0.6-1_amd64 bug

NAME

       theseus  -  Maximum  likelihood,  multiple  simultaneous  superpositions  with statistical
       analysis

SYNOPSIS

       theseus [-aAbBcCdDeEfFgGhHiIjklLmMnNoOpPqQrRsStTuvVwWxXyYZ] pdbfile1 [pdbfile2 ...]

       and

       theseus_align [-aAbBcCdDeEfFgGhHiIjklLmMnNoOpPqQrRsStTuvVwWxXyYZ]  -f  pdbfile1  [pdbfile2
       ...]

       Default usage is equivalent to:

       theseus -a0 -e2 -g1 -i200 -k-1 -p1e-7 -r theseus -v -P0 your.pdb

DESCRIPTION

       Theseus  superpositions a set of macromolecular structures simultaneously using the method
       of maximum likelihood (ML), rather than the conventional least-squares criterion.  Theseus
       assumes  that  the  structures are distributed according to a matrix Gaussian distribution
       and that the eigenvalues of the atomic covariance matrix  are  hierarchically  distributed
       according  to  an inverse gamma distribution. This ML superpositioning model produces much
       more accurate results by essentially downweighting variable regions of the structures  and
       by correcting for correlations among atoms.

       Theseus  operates  in  two  main modes, a mode for superimposing structures with identical
       sequences and a mode for structures with different sequences but similar structures:

              (1) A mode for superpositioning macromolecules with identical sequences and numbers
              of  residues, for instance, multiple models in an NMR family or multiple structures
              from different crystal forms of the same protein. In this mode, Theseus  will  read
              every model in every file on the command line and superposition them.

              Example:

              theseus 1s40.pdb

              In the above example, 1s40.pdb is a pdb file of 10 NMR models.

              (2)  An  "alignment" mode for superpositioning structures with different sequences,
              for example, multiple structures of the cytochrome c protein from different species
              or  multiple  mutated structures of hen egg white lysozyme.  This mode requires the
              user to supply a sequence alignment file of the  structures  being  superpositioned
              (see  option  -A  and  "FILE FORMATS" below).  Additionally, it may be necessary to
              supply a mapfile that tells theseus which PDB structure files correspond  to  which
              sequences  in  the  alignment  (see  option  -M  and  "FILE  FORMATS"  below). When
              superpositioning based on a  seqeunce  alignment,  theseus  uses  a  novel  maximum
              likelihood   algorithm   for  superpositioning  multiple  structures  that  include
              arbitrary gaps and insertions relative to each other.  Unlike other algorithms  for
              simultaneous  superpositioning of multiple structures, our Expectation-Maximization
              algorithm uses all available data by including all residues aligned  with  gaps  in
              the  calculations.   In this mode, if there are multiple structural models in a PDB
              file, theseus only reads the first model in each file on the command line. In other
              words,  theseus  treats  the  files  on  the command line as if there were only one
              structure per file.

              Example 1:

              theseus -A cytc.aln -M cytc.filemap d1cih__.pdb d1csu__.pdb d1kyow_.pdb

              In the above example, d1cih__.pdb, d1csu__.pdb, and d1kyow_.pdb are  pdb  files  of
              cytochrome c domains from the SCOP database.

              Example 2:

              theseus_align -f d1cih__.pdb d1csu__.pdb d1kyow_.pdb

              In  this  example,  the theseus_align script is called to do the hard work for you.
              It will  calculate  a  sequence  alignment  and  then  superimpose  based  on  that
              alignment.  The script theseus_align takes the same options as the theseus program.
              Note, the first few lines of this script must be modified for your system, since it
              calls an external multiple sequence alignment program to do the alignment.  See the
              examples/ directory for more details, including example files.

OPTIONS

   Algorithmic options, defaults in {brackets}:
       --amber
              Do special processing for AMBER8 formatted PDB files

              Most people will never need to use this long option, unless you are  processing  MD
              traces from AMBER.  AMBER puts the atom names in the wrong column in the PDB file.

       -a [selection]
              Atoms  to  include in the superposition.  This option takes two types of arguments,
              either (1) a number specifying a preselected set of atom types, or (2)  an  explict
              PDB-style, colon-delimited list of the atoms to include.

              For the preselected atom type subsets, the following integer options are available:

               • 0, alpha carbons for proteins, C1´ atoms for nucleic acids
               • 1, backbone
               • 2, all
               • 3, alpha and beta carbons
               • 4, all heavy atoms (no hydrogens)

              Note,  only  the  -a0  option  is  available  when superpositioning structures with
              different sequences.

              To custom select an explicit set of atom types, the atom types  must  be  specified
              exactly  as  given in the PDB file field, including spaces, and the atom-types must
              encapsulated in quotation marks.  Multiple atom types must be delimited by a colon.
              For example,

              -a' N  : CA : C  : O  '

              would specify the atom types in the peptide backbone.

       -c     Use ML atomic covariance weighting (fit correlations, much slower)

              Unless   you  have  many  different  structures  with  few  residues,  fitting  the
              correlation matrix is  likely  unwarranted  statistically  due  to  a  plethora  of
              parameters and a paucity of data.

       -e [n] Embedding algorithm for initializing the average structure
               • 0 = none; use randomly chosen model
               • {2} = {ML embedded structure}

       -f     Only read the first model of a multi-model PDB file

       -g [n] Hierarchical model for variances
               • 0 = none (may not converge)
               • {1} = inverse gamma distribution

       -h     Help/usage

       -i [nnn]
              Maximum iterations, {200}

       -k [n] constant  minimum  variance {-1} {if set to negative value, the minimum variance is
              determined empirically}

       -p [precision]
              Requested relative precision for convergence, {1e-7}

       -r [root name]
              Root name to be used in naming the output files, {theseus}

       -s [n-n:...]
              Residue selection (e.g. -s15-45:50-55), {all}

       -S [n-n:...]
              Residues to exclude (e.g. -S15-45:50-55) {none}

              The previous two options have the same format. Residue (or alignment column) ranges
              are  indicated  by  beginning and end separated by a dash.  Multiple ranges, in any
              arbitrary order, are separated by a colon.  Chains may also be selected  by  giving
              the  chain ID immediately preceding the residue range.  For example, -sA1-20:A40-71
              will only include residues 1 through 20 and 40 through 70 in chain A. Chains cannot
              be specified when superpositioning structures with different sequences.

       -v     use ML variance weighting (no correlations) {default}

   Input/output options:
       -A [sequence alignment file]
              Sequence alignment file to use as a guide (CLUSTAL or A2M format)

              For  use  when  superpositioning  structures  with  different sequences.  See "FILE
              FORMATS" below.

       -E     Print expert options

       -F     Print FASTA files of the sequences in PDB files and quit

              A useful option when superpositioning structures  with  different  sequences.   The
              files  output  with  this  option can be aligned with a multiple sequence alignment
              program such as CLUSTAL or MUSCLE, and the resulting output alignment file used  as
              theseus input with the -A option.

       -h     Help/usage

       -I     Just calculate statistics for input file; don't superposition

       -M [mapfile]
              File that maps PDB files to sequences in the alignment.

              A simple two-column formatted file; see "FILE FORMATS" below. Used with mode 2.

       -n     Don't write transformed pdb file

       -o [reference structure]
              Reference  file  to superposition on, all rotations are relative to the first model
              in this file

              For  example,  'theseus  -o   cytc1.pdb   cytc1.pdb   cytc2.pdb   cytc3.pdb'   will
              superposition  the structures and rotate the entire final superposition so that the
              structure from cytc1.pdb is in  the  same  orientation  as  the  structure  in  the
              original cytc1.pdb PDB file.

       -O     Olve's segID file

              Useful  output  when superpositioning structures with different sequences (mode 2).
              In 'theseus_sup.pdb', the main output superposition PDB file, the segID  field  now
              holds the number of the sequence alignment column that it belongs to.  This number,
              divided by 100, is also echoed in the B-factor field.  When using O (or  any  other
              capable molecular visualization program), one can then color by B-factor ranges and
              immediately see in the superposition which regions of the structure are aligned  in
              the   sequence   alignment  file.   An  additional  file  is  also  output,  called
              'theseus_olve.pdb' which only contains the very atoms that were included in the  ML
              superposition  calculation.   That  is,  it  will  only  contain  alpha  carbons or
              phosphorous atoms, and it will only contain atoms from the  columns  selected  with
              the -s or "-S" options.  Requested by Olve Peersen of Colorado State University.

       -V     Version

   Principal components analysis:
       -C     Use covariance matrix for PCA (correlation matrix is default)

       -P [nnn]
              Number of principal components to calculate {0}

              In  both  of  the above, the corresponding principal component is written in the B-
              factor field of the output PDB file. Usually only the first  few  PCs  are  of  any
              interest (maybe up to six).

               EXAMPLES theseus 2sdf.pdb

       theseus -l -r new2sdf 2sdf.pdb

       theseus -s15-45 -P3 2sdf.pdb

       theseus  -A  cytc.aln  -M  cytc.mapfile  -o cytc1.pdb -s1-40 cytc1.pdb cytc2.pdb cytc3.pdb
       cytc4.pdb

ENVIRONMENT

       You can set the environment variable 'PDBDIR' to your PDB file directory and theseus  will
       look  there  after  the  present  working directory.  For example, in the C shell (tcsh or
       csh), you can put something akin to this in your .cshrc file:

       setenv PDBDIR '/usr/share/pdbs/'

FILE FORMATS

       Theseus will read standard PDB formatted files  (see  <http://www.rcsb.org/pdb/>).   Every
       effort  has  been  made  for the program to accept nonstandard CNS and X-PLOR file formats
       also.

       Two other files deserve mention, a sequence alignment file and a mapfile.

   Sequence alignment file
       When superpositioning structures with different residue identities (where the  lengths  of
       each  the  macromolecules  in  terms  of  residues  are not necessarily equal), a sequence
       alignment file must be included for theseus to  use  as  a  guide  (specified  by  the  -A
       option).   Theseus  accepts  both  CLUSTAL  and  A2M  (FASTA)  formatted multiple sequence
       alignment files.

       NOTE 1: The residue sequence in the alignment must  match  exactly  the  residue  sequence
       given  in  the  coordinates  of  the  PDB  file. That is, there can be no missing or extra
       residues that do not correspond to the sequence in the PDB file. An  easy  way  to  ensure
       that  your  sequences  exactly  match  the  PDB  files  is to generate the sequences using
       theseus' -F option, which writes out a FASTA formatted sequence file of  the  chain(s)  in
       the  PDB  files.  The  files  output  with this option can then be aligned with a multiple
       sequence alignment program such as CLUSTAL or MUSCLE, and the resulting  output  alignment
       file used as theseus input with the -A option.

       NOTE  2: Every PDB file must have a corresponding sequence in the alignment.  However, not
       every sequence in the alignment needs to have a corresponding PDB file. That is, there can
       be extra sequences in the alignment that are not used for guiding the superposition.

   PDB -> Sequence mapfile
       If  the  names  of  the  PDB  files  and  the  names of the corresponding sequences in the
       alignemnt are identical, the mapfile may be omitted.  Otherwise,  Theseus  needs  to  know
       which  sequences  in  the  alignment  file  correspond  to which PDB structure files. This
       information is included in a mapfile with a very simple  format  (specified  with  the  -M
       option).  There  are  only two columns separated by whitespace: the first column lists the
       names of the PDB structure files, while the second column lists the corresponding sequence
       names exactly as given in the multiple sequence alignment file.

       An example of the mapfile:

       cytc1.pdb    seq1
       cytc2.pdb    seq2
       cytc3.pdb    seq3

SCREEN OUTPUT

       Theseus  provides  output describing both the progress of the superpositioning and several
       statistics for the final result:

       Least-squares <sigma>:
              The standard deviation for the superposition, based on the conventional  assumption
              of no correlation and equal variances. Basically equal to the RMSD from the average
              structure.

       Classical LS pairwise <RMSD>:
              The conventional RMSD for the superposition, the  average  RMSD  for  all  pairwise
              combinations of structures in the ensemble.

       Maximum Likelihood <sigma>:
              The  ML  analog of the standard deviation for the superposition. When assuming that
              the correlations are zero (a diagonal covariance matrix),  this  is  equal  to  the
              square  root  of  the harmonic average of the variances for each atom. In contrast,
              the 'Least-squares <sigma>' given above reports the square root of  the  arithmetic
              average  of the variances.  The harmonic average is always less than the arithmetic
              average, and the harmonic average downweights large values  proportional  to  their
              magnitude. This makes sense statistically, because when combining values one should
              weight them by the reciprocal of their variance (which  is  in  fact  what  the  ML
              superpositioning method does).

       Log Likelihood:
              The  final  log  likelihood  of  the  superposition,  assuming  the matrix Gaussian
              distribution of the structures and the hierarchical inverse gamma  distribution  of
              the eigenvalues of the covariance matrix.

       AIC:   The  Akaike Information Criterion for the final superposition. This is an important
              statistic in likelihood analysis and model selection theory. It allows an objective
              comparison  of multiple theoretical models with different numbers of parameters. In
              this case, the higher the number the better. There is a tradeoff between fit to the
              data  and  the number of parameters being fit.  Increasing the number of parameters
              in a model will always give a better fit to the data, but  it  also  increases  the
              uncertainty  of the estimated values.  The AIC criterion finds the best combination
              by (1) maximizing the fit to the data while (2) minimizing the uncertainty  due  to
              the  number  of  parameters.  In  the superposition case, one can compare the least
              squares superposition to the  maximum  likelihood  superposition.  The  method  (or
              model)  with  the  higher AIC is preferred. A difference in the AIC of 2 or more is
              considered strong statistical evidence for the better model.

       BIC:   The Bayesian Information Criterion.  Similar  to  the  AIC,  but  with  a  Bayesian
              emphasis.

       Rotational, translational, covar chi^2:
              The reduced chi-squared statistic for the fit of the structures to the model.  With
              a good fit it should be close to 1.0, which indicates a perfect fit of the data  to
              the statistical model.  In the case of least-squares, the assumed model is a matrix
              Gaussian distribution of the structures with equal variances and  no  correlations.
              For  the  ML  fits,  the  assumed models can either be (1) unequal variances and no
              correlations, as calculated with the -v option [default] or (2)  unequal  variances
              and  correlations,  as  calculated  with  the -c option.  This statistic is for the
              superposition only,  and  does  not  include  the  fit  of  the  covariance  matrix
              eigenvalues to an inverse gamma distribution.  See 'Omnibus chi^2' below.

       Hierarchical minimum var:
              The  hierarchical fit of the inverse gamma distribution constrains the variances of
              the atoms by making large ones smaller  and  small  ones  larger.   This  statistic
              reports the minimum possible variance given the inferred inverse gamma parameters.

       Hierarchical var (alpha, gamma) chi^2:
              The  reduced  chi-squared  for  the  inverse  gamma  fit  of  the covariance matrix
              eigenvalues. As before, it should ideally be close to 1.0.  The two values  in  the
              parentheses  are  the ML estimates of the scale and shape parameters, respectively,
              for the inverse gamma distribtuion.

       Omnibus chi^2:
              The overall reduced  chi-squared  statistic  for  the  entire  fit,  including  the
              rotations,  translations,  covariances,  and  the inverse gamma parameters. This is
              probably the most important statistic for the superposition.  In  some  cases,  the
              inverse  gamma  fit  may be poor, yet the overall fit is still very good. Again, it
              should ideally be close to 1.0, which would indicate a perfect fit. However, if you
              think  it  is too large, make sure to compare it to the chi^2 for the least-squares
              fit; it's probably not that bad  after  all.   A  large  chi^2  often  indicates  a
              violation  of  the  assumptions  of  the  model.  The most common violation is when
              superpositioning two or more independent domains that can rotate relative  to  each
              other.  If  this  is  the  case,  then  there  will likely be not just one Gaussian
              distribution, but several mixed Gaussians, one for each domain.  Then, it would  be
              better to superposition each domain independently.

       skewness, skewness Z-value, kurtosis & kurtosis Z-value:
              The skewness and kurtosis of the residuals. Both should be 0.0 if the residuals fit
              a Gaussian distribution perfectly.  They  are  followed  by  the  P-value  for  the
              statistics.  This  is a very stringent test; residuals can be very non-Gaussian and
              yet the estimated rotations, translations,  and  covariance  matrix  may  still  be
              rather accurate.

       FP error in transformed coordinates:
              The  empirically  determined floating point error in the coordinates after rotation
              and translation.

       Minimum RMSD error per atom:
              The empirically determined minimum RMSD error per atom, based on the floating point
              error of the computer.

       Data pts, Free params, D/P:
              The  total  number  of  data  points  given  all observed structures, the number of
              parameters being fit in the model, and the data-to-parameter ratio.

       Median structure:
              The structure that is overall most similar to the average structure.  This  can  be
              considered to be the most "typical" structure in the ensemble.

       Total rounds:
              The number of iterations that the algorithm took to converge.

       Fractional precision:
              The actual precision that the algorithm converged to.

OUTPUT FILES

       Theseus writes out the following files:

       theseus_sup.pdb
              The final superposition, rotated to the principle axes of the mean structure.

       theseus_ave.pdb
              The estimate of the mean structure.

       theseus_cor.mat, theseus_cov.mat
              The  atomic  correlation  matrix  and  covariance  matrices,  based  on  the  final
              superposition. The format is suitable for input to GNU's  octave.   These  are  the
              matrices used in the Principal Components Analysis.

       theseus_embed_ave.pdb
              The  average structure as calculated by S. Lele's EDMA embedding algorithm, used as
              the starting point for the maximum likelihood iterations.

       theseus_residuals.txt
              The normalized residuals of the superposition. These can be analyzed for deviations
              from  normality  (whether  they  fit  a  standard Gaussian distribution). E.g., the
              chi^2, skewness, and kurtosis statistics are based on these values.

       theseus_transf.txt
              The final transformation rotation matrices and translation vectors.

       theseus_variances.txt
              The vector of estimated variances for each atom.

       When Principal Components are calculated (with the -P option),  the  following  files  are
       also produced:

       theseus_pcvecs.txt
              The principal component vectors.

       theseus_pcstats.txt
              Simple  statistics  for  each  principle  component  (loadings, variance explained,
              etc.).

       theseus_pcN_ave.pdb
              The average structure with the Nth principal component written in  the  temperature
              factor field.

       theseus_pcN.pdb
              The final superposition with the Nth principal component written in the temperature
              factor field.  This file is omitted when superpositioning molecules with  different
              residue sequences (mode 2).

BUGS

       Please send me (DLT) reports of all problems.

RESTRICTIONS

       Theseus  is  not a structural alignment program.  The structure-based alignment problem is
       completely different from  the  structural  superposition  problem.   In  order  to  do  a
       structural  superposition, there must be a 1-to-1 mapping that associates the atoms in one
       structure with the atoms in the other structures.  In the simplest case, this  means  that
       structures  must  have equivalent numbers of atoms, such as the models in an NMR PDB file.
       For structures with different numbers of residues/atoms, superpositioning is only possible
       when  the  sequences  have  been  aligned previously.  Finding the best sequence alignment
       based on only structural information is a difficult problem, and one for  which  there  is
       currently  no  maximum  likelihood  approach.  Extending theseus to address the structural
       alignment problem is an ongoing research project.

AUTHOR

       Douglas L. Theobald
       dtheobald@brandeis.edu

CITATION

       When using theseus in publications please cite:

       Douglas L. Theobaldand Phillip A. Steindel (2012)
       "Optimal simultaneous superpositioning of multiple structures with missing data."
       Bioinformatics 28(15):1972-1979

       The following papers also report theseus developments:

       Douglas L. Theobald and Deborah S. Wuttke (2006)
       "Empirical Bayes models for regularizing  maximum  likelihood  estimation  in  the  matrix
       Gaussian Procrustes problem."
       PNAS 103(49):18521-18527

       Douglas L. Theobald and Deborah S. Wuttke (2006)
       "THESEUS: Maximum likelihood superpositioning and analysis of macromolecular structures."
       Bioinformatics 22(17):2171-2172

       Douglas L. Theobald and Deborah S. Wuttke (2008)
       "Accurate structural correlations from maximum likelihood superpositions."
       PLoS Computational Biology 4(2):e43

HISTORY

       Long, tedious, and sordid.