focal (1) cmcalibrate.1.gz

Provided by: infernal_1.1.3-4_amd64 bug

NAME

       cmcalibrate - fit exponential tails for covariance model E-value determination

SYNOPSIS

       cmcalibrate [options] cmfile

DESCRIPTION

       cmcalibrate determines exponential tail parameters for E-value determination by generating
       random sequences, searching them with the CM and collecting the scores  of  the  resulting
       hits.  A  histogram  of  the bit scores of the hits is fit to an exponential tail, and the
       parameters of the fitted tail are saved to the CM file. The  exponential  tail  parameters
       are  then  used  to  estimate  the  statistical significance of hits found in cmsearch and
       cmscan.

       A CM file must be calibrated with cmcalibrate before it can be used in cmsearch or cmscan,
       with  a  single  exception:  it  is  not necessary to calibrate CM files that include only
       models with zero basepairs before running cmsearch.

       cmcalibrate is very slow. It takes a couple of hours to calibrate a single  average  sized
       CM  on  a single CPU.  cmcalibrate will run in parallel on all available cores if Infernal
       was built on a system that supports POSIX threading (see the Installation section  of  the
       user  guide  for  more  information).  Using  <n>  cores  will result in roughly <n> -fold
       acceleration versus a single CPU.  MPI (Message Passing Interface) can be also be used for
       parallelization  with  the  --mpi option if Infernal was built with MPI enabled, but using
       more than 161 processors is not recommended because increasing past 161  won't  accelerate
       the calibration.  See the Installation section of the user guide for more information.

       The  --forecast option can be used to estimate how long the program will take to run for a
       given cmfile on the current machine.  To predict the running time on <n>  processors  with
       MPI, additionally use the --nforecast <n> option.

       The  random  sequences searched in cmcalibrate are generated by an HMM that was trained on
       real genomic sequences with various GC contents. The goal is to have the GC  distributions
       in the random sequences be similar to those in actual genomic sequences.

       Four  rounds  of searches and subsequent exponential tail fits are performed, one each for
       the four different CM algorithms that can be used in  cmsearch  and  cmscan:  glocal  CYK,
       glocal Inside, local CYK and local Inside.

       The E-values parameters determined by cmcalibrate are only used by the cmsearch and cmscan
       programs.  If you are not going to use these programs then do not waste  time  calibrating
       your models.

OPTIONS

       -h     Help; print a brief reminder of command line usage and available options.

       -L <x> Set  the  total  length  of  random  sequences  to search to <x> megabases (Mb). By
              default, <x> is 1.6 Mb. Increasing <x> will make the  exponential  tail  fits  more
              precise and E-values more accurate, but will take longer (doubling <x> will roughly
              double the running time).  Decreasing <x> is not recommended as it  will  make  the
              fits less precise and the E-values less accurate.

OPTIONS FOR PREDICTING REQUIRED TIME AND MEMORY

       --forecast
              Predict  the  running  time of the calibration of cmfile (with provided options) on
              the current machine and exit. The calibration is not  performed.   The  predictions
              should   be   considered   rough  estimates.  If  multithreading  is  enabled  (see
              Installation section of user guide), the timing will take into account  the  number
              of available cores.

       --nforecast <n>
              With  --forecast,  specify  that  <n>  processors will be used for the calibration.
              This might be useful for predicting the  running  time  of  an  MPI  run  with  <n>
              processors.

       --memreq
              Predict  the  amount  of  required  memory  for  calibrating  cmfile (with provided
              options) on the current machine and exit. The calibration is not performed.

OPTIONS CONTROLLING EXPONENTIAL TAIL FITS

       --gtailn <x>
              fit the exponential tail for glocal Inside and glocal CYK to the <n> highest scores
              in  the  histogram  tail,  where  <n>  is  <x> times the number of Mb searched. The
              default value of <x> is 250.  The value  250  was  chosen  because  it  works  well
              empirically relative to other values.

       --ltailn <x>
              fit  the  exponential tail for local Inside and local CYK to the <n> highest scores
              in the histogram tail, where <n> is <x>  times  the  number  of  Mb  searched.  The
              default  value  of  <x>  is  750.   The  value 750 was chosen because it works well
              empirically relative to other values.

       --tailp <x>
              Ignore the --gtailn and --ltailn prefixed options and fit the <x> fraction tail  of
              the histogram to an exponential tail, for all search modes.

OPTIONAL OUTPUT FILES

       --hfile <f>
              Save  the  histograms  fit  to  file  <f>.   The  format  of this file is two space
              delimited columns per line. The first column is the x-axis values of bit scores  of
              each  bin.  The  second column is the y-axis values of number of hits per bin. Each
              series is delimited by a line with a single character "&". The  file  will  contain
              one  series  for  each  of  the  four exponential tail fits in the following order:
              glocal CYK, glocal Inside, local CYK, and local Inside.

       --sfile <f>
              Save survival plot information to file <f>.  The format of this file is  two  space
              delimited  columns per line. The first column is the x-axis values of bit scores of
              each bin. The second column is the y-axis values of fraction of hits that  meet  or
              exceed  the  score  for  each bin. Each series is delimited by a line with a single
              character "&".  The file will contain three series of data for each of the four  CM
              search  modes  in  the  following  order: glocal CYK, glocal Inside, local CYK, and
              local Inside.  The first series is the empirical survival plot from  the  histogram
              of  hits  to  the random sequence. The second series is the exponential tail fit to
              the empirical distribution. The third series is the exponential tail fit if  lambda
              were fixed and set as the natural log of 2 (0.691314718).

       --qqfile <f>
              Save  quantile-quantile  plot  information to file <f>.  The format of this file is
              two space delimited columns per line. The first column is the  x-axis  values,  and
              the  second  column  is  the  y-axis  values.  The  distance of the points from the
              identity line (y=x) is a measure of how good  the  exponential  tail  fit  is,  the
              closer  the points are to the identity line, the better the fit is.  Each series is
              delimited by a line with a single character "&".  The file will contain one  series
              of  empirical  data  for  each  of  the four exponential tail fits in the following
              order: glocal CYK, glocal Inside, local CYK and local Inside.

       --ffile <f>
              Save space delimited statistics of different exponential tail  fits  to  file  <f>.
              The  file  will  contain  the  lambda  and  mu  values for exponential tails fit to
              histogram  tails  of  different  sizes.  The  fields  in  the  file  are   labelled
              informatively.

       --xfile <f>
              Save  a  list  of  the scores in each fit histogram tail to file <f>.  Each line of
              this file will have a different score indicating one hit existed in the  tail  with
              that  score.   Each  series is delimited by a line with a single character "&". The
              file will contain one series for each of the four  exponential  tail  fits  in  the
              following order: glocal CYK, glocal Inside, local CYK, and local Inside.

OTHER OPTIONS

       --seed <n>
              Seed  the  random  number  generator with <n>, an integer >= 0.  If <n> is nonzero,
              stochastic simulations will be reproducible; the same command will  give  the  same
              results.   If  <n>  is  0,  the  random number generator is seeded arbitrarily, and
              stochastic simulations will vary from run to run of the same command.  The  default
              seed is 181.

       --beta <x>
              By  default  query-dependent  banding  (QDB)  is  used  to accelerate the CM search
              algorithms with a beta tail loss probability of 1E-15.   This  beta  value  can  be
              changed  to  <x>  with --beta <x>.  The beta parameter is the amount of probability
              mass excluded during band calculation, higher values of beta give greater  speedups
              but  sacrifice  more  accuracy  than lower values. The default value used is 1E-15.
              (For more information on QDB see Nawrocki  and  Eddy,  PLoS  Computational  Biology
              3(3): e56.)

       --nonbanded
              Turn off QDB during E-value calibration. This will slow down calibration.

       --nonull3
              Turn  off  the null3 post hoc additional null model. This is not recommended unless
              you plan on using the same option to cmsearch and/or cmscan.

       --random
              Use the background null model of the CM to generate the random  sequences,  instead
              of  the  more  realistic  HMM.  Unless  the CM was built using the --null option to
              cmbuild, the background null model will be 25% each A, C, G and U.

       --gc <f>
              Generate the random sequences using the nucleotide distribution from  the  sequence
              file <f>.

       --cpu <n>
              Specify  that  <n>  parallel  CPU  workers  be used. If <n> is set as "0", then the
              program will be run in serial mode, without using threads.  You  can  also  control
              this  number  by  setting an environment variable, INFERNAL_NCPU.  This option will
              only be available if the machine on which Infernal was built is  capable  of  using
              POSIX  threading  (see  the  Installation  section  of  the  user  guide  for  more
              information).

       --mpi  Run as an MPI parallel program. This option will only be available if Infernal  has
              been  configured  and  built  with  the  "--enable-mpi"  flag (see the Installation
              section of the user guide for more information).

SEE ALSO

       See infernal(1) for a master man page with a list of all  the  individual  man  pages  for
       programs in the Infernal package.

       For  complete  documentation, see the user guide that came with your Infernal distribution
       (Userguide.pdf); or see the Infernal web page ().

       Copyright (C) 2019 Howard Hughes Medical Institute.
       Freely distributed under the BSD open source license.

       For additional information on copyright and licensing, see the file  called  COPYRIGHT  in
       your Infernal source distribution, or see the Infernal web page ().

AUTHOR

       The Eddy/Rivas Laboratory
       Janelia Farm Research Campus
       19700 Helix Drive
       Ashburn VA 20147 USA
       http://eddylab.org