Ubuntu Manpage: cmemit - sample sequences from a covariance model

NAME

       cmemit - sample sequences from a covariance model

SYNOPSIS

       cmemit [options] <cmfile>

DESCRIPTION

       The cmemit program samples (emits) sequences from the covariance model(s) in <cmfile>, and writes them to
       output.  Sampling sequences may be useful for a variety of purposes, including  creating  synthetic  true
       positives for benchmarks or tests.

       The  default is to sample ten unaligned sequence from each CM. Alternatively, with the -c option, you can
       emit a single majority-rule consensus sequence; or with the -a option, you can emit an alignment.

       The <cmfile> may contain a library of CMs, in which case each CM will be used in turn.

       <cmfile> may be '-' (dash), which means reading this input from stdin rather than a file.

       For models with zero basepairs, sequences are sampled from the profile HMM  filter  instead  of  the  CM.
       However,  since  these  models  will  be nearly identical (unless special options were used in cmbuild to
       prevent this), using the HMM instead of the CM will not change the output in a  significant  way,  unless
       the  -l  option  is  used.  With  -l,  the  HMM  will  be configured for equiprobable model begin and end
       positions, while the CM will not. You can force cmemit to always sample from the CM with the  --nohmmonly
       option.

OPTIONS

-h Help; print a brief reminder of command line usage and available options.

-o <f> Save the synthetic sequences to file <f> rather than writing them to stdout.

-N <n> Generate <n> sequences. The default value for <n> is 10.

-u Write the generated sequences in unaligned format (FASTA). This is the default behavior.

-a Write the generated sequences in an aligned format (STOCKHOLM) with consensus structure annotation
rather than FASTA. Other output formats are possible with the --outformat option.

-c Predict a single majority-rule consensus sequence instead of sampling sequences from the CM´s
probability distribution. Highly conserved residues (base paired residues that score higher than
3.0 bits, or single stranded residues that score higher than 1.0 bits) are shown in upper case;
others are shown in lower case.

-e <n> Embed the CM emitted sequences in a larger randomly generated sequence of length <n> generated
from an HMM that was trained on real genomic sequences with various GC contents (the same HMM used
by cmcalibrate). You can use the --iid option to generate 25% A, C, G, and U sequence instead.
The CM emitted sequence will begin at a random position within the larger sequence and will be
included in its entirety unless the --u5p or --u3p options are used. When -e is used in
combination with --u5p, the CM emitted sequence will always begin at position 1 of the larger
sequence and will be truncated 5'. When used in combination --u3p the CM emitted sequence will
always end at position <n> of the larger sequence and will be truncated 3'.

-l Configure the CMs into local mode before emitting sequences. By default the model will be in
global mode. In local mode, large insertions and deletions are more common than in global mode.

OPTIONS FOR TRUNCATING EMITTED SEQUENCES

       --u5p  Truncate  all  emitted  sequences  at  a  randomly  chosen  start position <n>, by only outputting
              residues beginning at <n>.  A different start point is randomly chosen for each sequence.

       --u3p  Truncate all emitted sequences at a randomly chosen end position <n>, by only outputting  residues
              up to position <n>.  A different end point is randomly chosen for each sequence.

       --a5p <n>
              In combination with the -a option, truncate the emitted alignment at a randomly chosen start match
              position <n>, by only outputting alignment columns for positions after match state <n> -  1.   <n>
              must  be an integer between 0 and the consensus length of the model (which can be determined using
              the cmstat program. As a special case, using 0 as <n> will  result  in  a  randomly  chosen  start
              position.

       --a3p <n>
              In  combination  with the -a option, truncate the emitted alignment at a randomly chosen end match
              position <n>, by only outputting alignment columns for positions before match state <n> + 1.   <n>
              must  be an integer between 1 and the consensus length of the model (which can be determined using
              the cmstat program). As a special case, using 0 as <n>  will  result  in  a  randomly  chosen  end
              position.

OTHER OPTIONS

       --seed <n>
              Seed the random number generator with <n>, an integer >= 0. If <n> is nonzero, stochastic sampling
              of sequences will be reproducible; the same command will give the same results.  If <n> is 0,  the
              random  number generator is seeded arbitrarily, and stochastic samplings will vary from run to run
              of the same command.  The default seed is 0.

       --iid  With -e, generate the larger sequences as 25% each A, C, G and U.

       --rna  Specify that the emitted sequences be output as RNA sequences. This is true by default.

       --dna  Specify that the emitted sequences be output as DNA sequences. By default, the output alphabet  is
              RNA.

       --idx <n>
              Specify that the emitted sequences be named starting with <modelname>.<n>.  By default <n> is 1.

       --outformat <s>
              With  -a,  specify  the  output  alignment format as <s>.  Acceptable formats are: Pfam, AFA, A2M,
              Clustal, and Phylip.  AFA is aligned fasta. Only Pfam and Stockholm alignment formats will include
              consensus structure annotation.

       --tfile <f>
              Dump  tabular  sequence  parsetrees (tracebacks) for each emitted sequence to file <f>.  Primarily
              useful for debugging.

       --exp <x>
              Exponentiate the emission and transition probabilities of the CM by <x> and then renormalize those
              distributions  before  emitting  sequences. This option changes the CM probability distribution of
              parsetrees relative to default. With <x> less than 1.0 the emitted sequences  will  tend  to  have
              lower  bit scores upon alignment to the CM.  With <x> greater than 1.0, the emitted sequences will
              tend to have higher bit scores upon alignment to the CM. This bit score difference  will  increase
              as  <x>  moves  further  away from 1.0 in either direction.  If <x> equals 1.0, this option has no
              effect relative to default.  This option is useful for generating sequences that are  either  more
              difficult  (  <x>  <  1.0)  or  easier  (  <x> > 1.0) for the CM to distinguish as homologous from
              background, random sequence.

       --hmmonly
              Emit from the filter profile HMM instead of the CM.

       --nohmmonly
              Never emit from the filter profile HMM, always use the CM, even for models with zero basepairs.

COPYRIGHT

       Copyright (C) 2019 Howard Hughes Medical Institute.
       Freely distributed under the BSD open source license.

       For additional information on copyright and licensing, see the file called  COPYRIGHT  in  your  Infernal
       source distribution, or see the Infernal web page ().

AUTHOR

       The Eddy/Rivas Laboratory
       Janelia Farm Research Campus
       19700 Helix Drive
       Ashburn VA 20147 USA
       http://eddylab.org