xenial (1) cmcalibrate.1.gz

Provided by: infernal_1.1.1-3_amd64 bug

NAME

       cmcalibrate - fit exponential tails for covariance model E-value determination

SYNOPSIS

       cmcalibrate [options] cmfile

DESCRIPTION

       cmcalibrate  determines  exponential  tail  parameters  for  E-value  determination  by generating random
       sequences, searching them with the CM and collecting the scores of the resulting hits. A histogram of the
       bit  scores of the hits is fit to an exponential tail, and the parameters of the fitted tail are saved to
       the CM file. The exponential tail parameters are then used to estimate the  statistical  significance  of
       hits found in cmsearch and cmscan.

       A  CM file must be calibrated with cmcalibrate before it can be used in cmsearch or cmscan, with a single
       exception: it is not necessary to calibrate CM files that include only models with zero basepairs  before
       running cmsearch.

       cmcalibrate  is  very slow. It takes a couple of hours to calibrate a single average sized CM on a single
       CPU.  cmcalibrate will run in parallel on all available cores if Infernal was  built  on  a  system  that
       supports POSIX threading (see the Installation section of the user guide for more information). Using <n>
       cores will result in roughly <n> -fold acceleration versus a single CPU.  MPI (Message Passing Interface)
       can be also be used for parallelization with the --mpi option if Infernal was built with MPI enabled, but
       using more than 161 processors is not recommended  because  increasing  past  161  won't  accelerate  the
       calibration.  See the Installation seciton of the user guide for more information.

       The --forecast option can be used to estimate how long the program will take to run for a given cmfile on
       the current machine.  To predict the running time on  <n>  processors  with  MPI,  additionally  use  the
       --nforecast <n> option.

       The  random  sequences  searched  in cmcalibrate are generated by an HMM that was trained on real genomic
       sequences with various GC contents. The goal is to have the GC distributions in the random  sequences  be
       similar to those in actual genomic sequences.

       Four  rounds  of  searches  and  subsequent  exponential  tail  fits are performed, one each for the four
       different CM algorithms that can be used in cmsearch and cmscan: glocal CYK, glocal Inside, local CYK and
       local Inside.

       The  E-values parameters determined by cmcalibrate are only used by the cmsearch and cmscan programs.  If
       you are not going to use these programs then do not waste time calibrating your models.

OPTIONS

       -h     Help; print a brief reminder of command line usage and available options.

       -L <x> Set the total length of random sequences to search to <x> megabases (Mb). By default, <x>  is  1.6
              Mb.  Increasing  <x>  will make the exponential tail fits more precise and E-values more accurate,
              but will take longer (doubling <x> will roughly double the running time).  Decreasing <x>  is  not
              recommended as it will make the fits less precise and the E-values less accurate.

OPTIONS FOR PREDICTING REQUIRED TIME AND MEMORY

       --forecast
              Predict  the  running  time  of  the  calibration of cmfile (with provided options) on the current
              machine and exit. The calibration is not performed.  The predictions should  be  considered  rough
              estimates.  If multithreading is enabled (see Installation section of user guide), the timing will
              take into account the number of available cores.

       --nforecast <n>
              With --forecast, specify that <n> processors will be used for  the  calibration.   This  might  be
              useful for predicting the running time of an MPI run with <n> processors.

       --memreq
              Predict  the  amount  of  required  memory  for  calibrating cmfile (with provided options) on the
              current machine and exit. The calibration is not performed.

OPTIONS CONTROLLING EXPONENTIAL TAIL FITS

       --gtailn <x>
              fit the exponential tail for glocal Inside and glocal  CYK  to  the  <n>  highest  scores  in  the
              histogram tail, where <n> is <x> times the number of Mb searched. The default value of <x> is 250.
              The value 250 was chosen because it works well empirically relative to other values.

       --ltailn <x>
              fit the exponential tail for local Inside and local CYK to the <n> highest scores in the histogram
              tail,  where  <n>  is  <x>  times the number of Mb searched. The default value of <x> is 750.  The
              value 750 was chosen because it works well empirically relative to other values.

       --tailp <x>
              Ignore the --gtailn and --ltailn prefixed options and fit the <x> fraction tail of  the  histogram
              to an exponential tail, for all search modes.

OPTIONAL OUTPUT FILES

       --hfile <f>
              Save  the  histograms fit to file <f>.  The format of this file is two space delimited columns per
              line. The first column is the x-axis values of bit scores of each bin. The second column is the y-
              axis  values of number of hits per bin. Each series is delimited by a line with a single character
              "&". The file will contain one series for each of the four exponential tail fits in the  following
              order: glocal CYK, glocal Inside, local CYK, and local Inside.

       --sfile <f>
              Save  survival  plot  information  to  file  <f>.   The format of this file is two space delimited
              columns per line. The first column is the x-axis values of bit scores  of  each  bin.  The  second
              column  is  the y-axis values of fraction of hits that meet or exceed the score for each bin. Each
              series is delimited by a line with a single character "&".  The file will contain three series  of
              data for each of the four CM search modes in the following order: glocal CYK, glocal Inside, local
              CYK, and local Inside.  The first series is the empirical survival plot from the histogram of hits
              to  the  random  sequence.  The  second  series  is  the  exponential  tail  fit  to the empirical
              distribution. The third series is the exponential tail fit if lambda were fixed  and  set  as  the
              natural log of 2 (0.691314718).

       --qqfile <f>
              Save  quantile-quantile  plot  information  to  file  <f>.   The  format of this file is two space
              delimited columns per line. The first column is the x-axis values, and the second column is the y-
              axis  values. The distance of the points from the identity line (y=x) is a measure of how good the
              exponential tail fit is, the closer the points are to the identity line, the better  the  fit  is.
              Each  series is delimited by a line with a single character "&".  The file will contain one series
              of empirical data for each of the four exponential tail fits in the following order:  glocal  CYK,
              glocal Inside, local CYK and local Inside.

       --ffile <f>
              Save  space  delimited  statistics  of different exponential tail fits to file <f>.  The file will
              contain the lambda and mu values for exponential tails fit to histogram tails of different  sizes.
              The fields in the file are labelled informatively.

       --xfile <f>
              Save  a  list  of  the scores in each fit histogram tail to file <f>.  Each line of this file will
              have a different score indicating one hit existed in the tail with that  score.   Each  series  is
              delimited  by a line with a single character "&". The file will contain one series for each of the
              four exponential tail fits in the following order: glocal CYK, glocal Inside, local CYK, and local
              Inside.

OTHER OPTIONS

       --seed <n>
              Seed  the  random  number  generator  with  <n>,  an  integer >= 0.  If <n> is nonzero, stochastic
              simulations will be reproducible; the same command will give the same results.  If <n> is  0,  the
              random  number  generator  is seeded arbitrarily, and stochastic simulations will vary from run to
              run of the same command.  The default seed is 181.

       --beta <x>
              By default query-dependent banding (QDB) is used to accelerate the CM  search  algorithms  with  a
              beta  tail loss probability of 1E-15.  This beta value can be changed to <x> with --beta <x>.  The
              beta parameter is the amount of probability mass excluded during band calculation,  higher  values
              of  beta  give  greater  speedups but sacrifice more accuracy than lower values. The default value
              used is 1E-15. (For more information on QDB see Nawrocki  and  Eddy,  PLoS  Computational  Biology
              3(3): e56.)

       --nonbanded
              Turn off QDB during E-value calibration. This will slow down calibration.

       --nonull3
              Turn  off  the  null3  post  hoc additional null model. This is not recommended unless you plan on
              using the same option to cmsearch and/or cmscan.

       --random
              Use the background null model of the CM to generate the random  sequences,  instead  of  the  more
              realistic  HMM.  Unless  the  CM was built using the --null option to cmbuild, the background null
              model will be 25% each A, C, G and U.

       --gc <f>
              Generate the random sequences using the nucleotide distribution from the sequence file <f>.

       --cpu <n>
              Specify that <n> parallel CPU workers be used. If <n> is set as "0", then the program will be  run
              in serial mode, without using threads.  You can also control this number by setting an environment
              variable, INFERNAL_NCPU.  This option will only be available if the machine on which Infernal  was
              built is capable of using POSIX threading (see the Installation section of the user guide for more
              information).

       --mpi  Run as an MPI parallel program. This option will only be available if Infernal has been configured
              and  built  with  the "--enable-mpi" flag (see the Installation section of the user guide for more
              information).

SEE ALSO

       See infernal(1) for a master man page with a list of all the individual man pages  for  programs  in  the
       Infernal package.

       For complete documentation, see the user guide that came with your Infernal distribution (Userguide.pdf);
       or see the Infernal web page ().

       Copyright (C) 2014 Howard Hughes Medical Institute.
       Freely distributed under the GNU General Public License (GPLv3).

       For additional information on copyright and licensing, see the file called  COPYRIGHT  in  your  Infernal
       source distribution, or see the Infernal web page ().

AUTHOR

       The Eddy/Rivas Laboratory
       Janelia Farm Research Campus
       19700 Helix Drive
       Ashburn VA 20147 USA
       http://eddylab.org