lunar (1) cmemit.1.gz

Provided by: infernal_1.1.4-1_amd64 bug

NAME

       cmemit - sample sequences from a covariance model

SYNOPSIS

       cmemit [options] <cmfile>

DESCRIPTION

       The cmemit program samples (emits) sequences from the covariance model(s) in <cmfile>, and
       writes them to output.  Sampling sequences may  be  useful  for  a  variety  of  purposes,
       including creating synthetic true positives for benchmarks or tests.

       The  default  is to sample ten unaligned sequence from each CM. Alternatively, with the -c
       option, you can emit a single majority-rule consensus sequence; or with the -a option, you
       can emit an alignment.

       The <cmfile> may contain a library of CMs, in which case each CM will be used in turn.

       <cmfile> may be '-' (dash), which means reading this input from stdin rather than a file.

       For  models with zero basepairs, sequences are sampled from the profile HMM filter instead
       of the CM. However, since these models will be nearly identical  (unless  special  options
       were used in cmbuild to prevent this), using the HMM instead of the CM will not change the
       output in a significant way, unless the -l option is  used.  With  -l,  the  HMM  will  be
       configured  for equiprobable model begin and end positions, while the CM will not. You can
       force cmemit to always sample from the CM with the --nohmmonly option.

OPTIONS

       -h     Help; print a brief reminder of command line usage and available options.

       -o <f> Save the synthetic sequences to file <f> rather than writing them to stdout.

       -N <n> Generate <n> sequences. The default value for <n> is 10.

       -u     Write the generated sequences in unaligned format  (FASTA).  This  is  the  default
              behavior.

       -a     Write  the  generated  sequences  in  an  aligned format (STOCKHOLM) with consensus
              structure annotation rather than FASTA. Other output formats are possible with  the
              --outformat option.

       -c     Predict  a  single  majority-rule  consensus sequence instead of sampling sequences
              from the CM´s probability distribution.  Highly  conserved  residues  (base  paired
              residues  that  score  higher than 3.0 bits, or single stranded residues that score
              higher than 1.0 bits) are shown in upper case; others are shown in lower case.

       -e <n> Embed the CM emitted sequences in a larger randomly generated  sequence  of  length
              <n>  generated  from an HMM that was trained on real genomic sequences with various
              GC contents (the same HMM used by cmcalibrate).  You can use the  --iid  option  to
              generate  25%  A, C, G, and U sequence instead.  The CM emitted sequence will begin
              at a random position within the  larger  sequence  and  will  be  included  in  its
              entirety  unless  the  --u5p  or  --u3p  options  are  used.   When  -e  is used in
              combination with --u5p, the CM emitted sequence will always begin at position 1  of
              the larger sequence and will be truncated 5'. When used in combination --u3p the CM
              emitted sequence will always end at position <n> of the larger sequence and will be
              truncated 3'.

       -l     Configure  the  CMs into local mode before emitting sequences. By default the model
              will be in global mode. In local mode, large  insertions  and  deletions  are  more
              common than in global mode.

OPTIONS FOR TRUNCATING EMITTED SEQUENCES

       --u5p  Truncate  all  emitted  sequences  at a randomly chosen start position <n>, by only
              outputting residues beginning at <n>.  A different start point is  randomly  chosen
              for each sequence.

       --u3p  Truncate  all  emitted  sequences  at  a  randomly chosen end position <n>, by only
              outputting residues up to position <n>.  A different end point is  randomly  chosen
              for each sequence.

       --a5p <n>
              In  combination  with  the  -a option, truncate the emitted alignment at a randomly
              chosen start match position <n>, by only outputting alignment columns for positions
              after  match  state  <n>  -  1.  <n> must be an integer between 0 and the consensus
              length of the model (which can be determined using the cmstat program. As a special
              case, using 0 as <n> will result in a randomly chosen start position.

       --a3p <n>
              In  combination  with  the  -a option, truncate the emitted alignment at a randomly
              chosen end match position <n>, by only outputting alignment columns  for  positions
              before  match  state  <n>  + 1.  <n> must be an integer between 1 and the consensus
              length of the model (which can be  determined  using  the  cmstat  program).  As  a
              special case, using 0 as <n> will result in a randomly chosen end position.

OTHER OPTIONS

       --seed <n>
              Seed  the  random  number  generator  with <n>, an integer >= 0. If <n> is nonzero,
              stochastic sampling of sequences will be reproducible; the same command  will  give
              the  same results.  If <n> is 0, the random number generator is seeded arbitrarily,
              and stochastic samplings will vary from run  to  run  of  the  same  command.   The
              default seed is 0.

       --iid  With -e, generate the larger sequences as 25% each A, C, G and U.

       --rna  Specify  that  the  emitted  sequences  be output as RNA sequences. This is true by
              default.

       --dna  Specify that the emitted sequences be output as  DNA  sequences.  By  default,  the
              output alphabet is RNA.

       --idx <n>
              Specify  that  the  emitted  sequences  be named starting with <modelname>.<n>.  By
              default <n> is 1.

       --outformat <s>
              With -a, specify the output alignment format as <s>.  Acceptable formats are: Pfam,
              AFA,  A2M,  Clustal,  and  Phylip.   AFA  is aligned fasta. Only Pfam and Stockholm
              alignment formats will include consensus structure annotation.

       --tfile <f>
              Dump tabular sequence parsetrees (tracebacks) for each  emitted  sequence  to  file
              <f>.  Primarily useful for debugging.

       --exp <x>
              Exponentiate  the  emission  and transition probabilities of the CM by <x> and then
              renormalize those distributions before emitting sequences. This option changes  the
              CM  probability  distribution of parsetrees relative to default. With <x> less than
              1.0 the emitted sequences will tend to have lower bit scores upon alignment to  the
              CM.   With <x> greater than 1.0, the emitted sequences will tend to have higher bit
              scores upon alignment to the CM. This bit score difference  will  increase  as  <x>
              moves  further  away  from 1.0 in either direction.  If <x> equals 1.0, this option
              has no effect relative to default.  This option is useful for generating  sequences
              that  are  either  more difficult ( <x> < 1.0) or easier ( <x> > 1.0) for the CM to
              distinguish as homologous from background, random sequence.

       --hmmonly
              Emit from the filter profile HMM instead of the CM.

       --nohmmonly
              Never emit from the filter profile HMM, always use the CM,  even  for  models  with
              zero basepairs.

SEE ALSO

       See  infernal(1)  for  a  master  man page with a list of all the individual man pages for
       programs in the Infernal package.

       For complete documentation, see the user guide that came with your  Infernal  distribution
       (Userguide.pdf); or see the Infernal web page (http://eddylab.org/infernal/).

       Copyright (C) 2020 Howard Hughes Medical Institute.
       Freely distributed under the BSD open source license.

       For  additional  information  on copyright and licensing, see the file called COPYRIGHT in
       your    Infernal    source    distribution,    or    see    the    Infernal    web    page
       (http://eddylab.org/infernal/).

AUTHOR

       http://eddylab.org