Ubuntu Manpage: cmcalibrate - fit exponential tails for covariance model E-value determination

NAME

       cmcalibrate - fit exponential tails for covariance model E-value determination

SYNOPSIS

       cmcalibrate [options] cmfile

DESCRIPTION

cmcalibrate determines exponential tail parameters for E-value determination by generating
random sequences, searching them with the CM and collecting the scores of the resulting
hits. A histogram of the bit scores of the hits is fit to an exponential tail, and the
parameters of the fitted tail are saved to the CM file. The exponential tail parameters
are then used to estimate the statistical significance of hits found in cmsearch and
cmscan.

A CM file must be calibrated with cmcalibrate before it can be used in cmsearch or cmscan,
with a single exception: it is not necessary to calibrate CM files that include only
models with zero basepairs before running cmsearch.

cmcalibrate is very slow. It takes a couple of hours to calibrate a single average sized
CM on a single CPU. cmcalibrate will run in parallel on four cores if Infernal was built
on a system that supports POSIX threading (see the Installation section of the user guide
for more information) and that system has at least 4 cores. Using <n> cores will result in
roughly <n> -fold acceleration versus a single CPU. You can specify the number of cores
be <n> to use with the --cpu <n> option. MPI (Message Passing Interface) can be also be
used for parallelization with the --mpi option if Infernal was built with MPI enabled, but
using more than 161 processors is not recommended because increasing past 161 won't
accelerate the calibration. See the Installation section of the user guide for more
information.

The --forecast option can be used to estimate how long the program will take to run for a
given cmfile on the current machine. To predict the running time on <n> processors with
MPI, additionally use the --nforecast <n> option.

Some large models require a lot of memory to calibrate. You can determine how much memory
is required with the --memreq option. For these models, you may be limited by the
available RAM on your system. Another strategy for parallelization that can be useful when
a lot of memory is required per core is to split the calibration into <n> separate
computations or partitions, each of which can be performed separately, potentially in
parallel if you have access to a computer cluster. The results from each computation can
then be merged together for the final calibration. To do this, first run cmcalibrate with
the --split, --ptot <n> and --cfile <f> options, which will save the <n> separate
partition commands into the file <f> . After all of these commands have been executed,
you can then combine the results and create a calibrated model file by calling again with
the --merge and --ptot <n> options. See the "Parallelizing calibration of large models by
splitting into partitions" subsection of the tutorial in the user's guide for more
information.

The random sequences searched in cmcalibrate are generated by an HMM that was trained on
real genomic sequences with various GC contents. The goal is to have the GC distributions
in the random sequences be similar to those in actual genomic sequences.

Four rounds of searches and subsequent exponential tail fits are performed, one each for
the four different CM algorithms that can be used in cmsearch and cmscan: glocal CYK,
glocal Inside, local CYK and local Inside.

The E-values parameters determined by cmcalibrate are only used by the cmsearch and cmscan
programs. If you are not going to use these programs then do not waste time calibrating
your models.

OPTIONS

       -h     Help; print a brief reminder of command line usage and available options.

       -L <x> Set  the  total  length  of  random  sequences  to search to <x> megabases (Mb). By
              default, <x> is 1.6 Mb. Increasing <x> will make the  exponential  tail  fits  more
              precise and E-values more accurate, but will take longer (doubling <x> will roughly
              double the running time).  Decreasing <x> is not recommended as it  will  make  the
              fits less precise and the E-values less accurate.

OPTIONS FOR PREDICTING REQUIRED TIME AND MEMORY

       --forecast
              Predict  the  running  time of the calibration of cmfile (with provided options) on
              the current machine and exit. The calibration is not  performed.   The  predictions
              should   be   considered   rough  estimates.  If  multithreading  is  enabled  (see
              Installation section of user guide), the timing will take into account  the  number
              of available cores.

       --nforecast <n>
              With  --forecast,  specify  that  <n>  processors will be used for the calibration.
              This might be useful for predicting the  running  time  of  an  MPI  run  with  <n>
              processors.

       --memreq
              Predict  the  amount  of  required  memory  for  calibrating  cmfile (with provided
              options) on the current machine and exit. The calibration is not performed.

OPTIONS CONTROLLING EXPONENTIAL TAIL FITS

       --gtailn <x>
              fit the exponential tail for glocal Inside and glocal CYK to the <n> highest scores
              in  the  histogram  tail,  where  <n>  is  <x> times the number of Mb searched. The
              default value of <x> is 250.  The value  250  was  chosen  because  it  works  well
              empirically relative to other values.

       --ltailn <x>
              fit  the  exponential tail for local Inside and local CYK to the <n> highest scores
              in the histogram tail, where <n> is <x>  times  the  number  of  Mb  searched.  The
              default  value  of  <x>  is  750.   The  value 750 was chosen because it works well
              empirically relative to other values.

       --tailp <x>
              Ignore the --gtailn and --ltailn prefixed options and fit the <x> fraction tail  of
              the histogram to an exponential tail, for all search modes.

OPTIONAL OUTPUT FILES

--hfile <f>
Save the histograms fit to file <f>. The format of this file is two space
delimited columns per line. The first column is the x-axis values of bit scores of
each bin. The second column is the y-axis values of number of hits per bin. Each
series is delimited by a line with a single character "&". The file will contain
one series for each of the four exponential tail fits in the following order:
glocal CYK, glocal Inside, local CYK, and local Inside.

--sfile <f>
Save survival plot information to file <f>. The format of this file is two space
delimited columns per line. The first column is the x-axis values of bit scores of
each bin. The second column is the y-axis values of fraction of hits that meet or
exceed the score for each bin. Each series is delimited by a line with a single
character "&". The file will contain three series of data for each of the four CM
search modes in the following order: glocal CYK, glocal Inside, local CYK, and
local Inside. The first series is the empirical survival plot from the histogram
of hits to the random sequence. The second series is the exponential tail fit to
the empirical distribution. The third series is the exponential tail fit if lambda
were fixed and set as the natural log of 2 (0.691314718).

--qqfile <f>
Save quantile-quantile plot information to file <f>. The format of this file is
two space delimited columns per line. The first column is the x-axis values, and
the second column is the y-axis values. The distance of the points from the
identity line (y=x) is a measure of how good the exponential tail fit is, the
closer the points are to the identity line, the better the fit is. Each series is
delimited by a line with a single character "&". The file will contain one series
of empirical data for each of the four exponential tail fits in the following
order: glocal CYK, glocal Inside, local CYK and local Inside.

--ffile <f>
Save space delimited statistics of different exponential tail fits to file <f>.
The file will contain the lambda and mu values for exponential tails fit to
histogram tails of different sizes. The fields in the file are labelled
informatively.

--xfile <f>
Save a list of the scores in each fit histogram tail to file <f>. Each line of
this file will have a different score indicating one hit existed in the tail with
that score. Each series is delimited by a line with a single character "&". The
file will contain one series for each of the four exponential tail fits in the
following order: glocal CYK, glocal Inside, local CYK, and local Inside.

OPTIONS CONTROLLING SPLIT, PARTITION AND MERGE MODES:

       --split
              Prepare  a  partitioned calibration. This option only works in combination with the
              --ptot <n> and --cfile <f> options, and will prepare a calibration split  into  <n>
              separate  partitions. The commands to run all of the partitions will be in the file
              <f> .

       --cfile <f>
              With --split, save the commands for all partitions to file <f> .

       --proot <s>
              With --split, specify that the per-partition scores files be  named  <s>.<n>  where
              <n>  is the partition index.  By default they will be named <s>.calib.<n> where <s>
              is the name of the CM file to be calibrated (including path).

       --part <n>
              specify that this is partition <n> out of <n2> from --ptot <n2>.  Must be  used  in
              combination with --ptot and --pfile .

       --ptot <n>
              With --split, --part or --merge, specify that there are <n> total partitions.

       --pfile <f>
              With --part , specify that scores for this partition be saved to file <f>

       --merge
              Merge scores from multiple previously executed partitions and calibrate CMs. If you
              used the option --proot <s> with cmcalibrate when you ran it with --split to  setup
              the  partitions,  use --proot <s> again with --merge.  The full cmcalibrate --merge
              command to  use  will  have  been  output  to  standard  output  when  the  initial
              cmcalibrate --split command was executed.

OTHER OPTIONS

--seed <n>
Seed the random number generator with <n>, an integer >= 0. If <n> is nonzero,
stochastic simulations will be reproducible; the same command will give the same
results. If <n> is 0, the random number generator is seeded arbitrarily, and
stochastic simulations will vary from run to run of the same command. The default
seed is 181.

--beta <x>
By default query-dependent banding (QDB) is used to accelerate the CM search
algorithms with a beta tail loss probability of 1E-15. This beta value can be
changed to <x> with --beta <x>. The beta parameter is the amount of probability
mass excluded during band calculation, higher values of beta give greater speedups
but sacrifice more accuracy than lower values. The default value used is 1E-15.
(For more information on QDB see Nawrocki and Eddy, PLoS Computational Biology
3(3): e56.)

--nonbanded
Turn off QDB during E-value calibration. This will slow down calibration.

--nonull3
Turn off the null3 post hoc additional null model. This is not recommended unless
you plan on using the same option to cmsearch and/or cmscan.

--random
Use the background null model of the CM to generate the random sequences, instead
of the more realistic HMM. Unless the CM was built using the --null option to
cmbuild, the background null model will be 25% each A, C, G and U.

--gc <f>
Generate the random sequences using the nucleotide distribution from the sequence
file <f>.

--cpu <n>
Set the number of parallel worker threads to <n>. On multicore machines, the
default is 4. You can also control this number by setting an environment variable,
INFERNAL_NCPU. There is also a master thread, so the actual number of threads that
Infernal spawns is <n>+1. This option is not available if Infernal was compiled
with POSIX threads support turned off.

--mpi Run as an MPI parallel program. This option will only be available if Infernal has
been configured and built with the "--enable-mpi" flag (see the Installation
section of the user guide for more information).

COPYRIGHT

       Copyright (C) 2023 Howard Hughes Medical Institute.
       Freely distributed under the BSD open source license.

       For  additional  information  on copyright and licensing, see the file called COPYRIGHT in
       your    Infernal    source    distribution,    or    see    the    Infernal    web    page
       (http://eddylab.org/infernal/).

AUTHOR

       http://eddylab.org