lunar (1) sim4db.1.gz

Provided by: sim4db_0~20150903+r2013-8build3_amd64 bug

NAME

       sim4db - batch spliced alignment of cDNA sequences to a target genome

SYNOPSIS

       A simple command line invocation:

       sim4db -genomic g.fasta -cdna c.fasta -scr script -output o.sim4db

       where:
          - 'c.fasta' and 'g.fasta' are the multi-fasta cDNA and genome sequence files
          - 'script' is a script file indicating individual alignments to be computed
          - output in sim4db format will be sent to the file 'o.sim4db' ('-' for standard output)

       A more complex invocation:

       sim4db -genomic g.fasta -cdna c.fasta -output o.sim4db [options]

DESCRIPTION

       sim4db  performs  fast batch alignment of large cDNA (EST, mRNA) sequence sets to a set of
       eukaryotic genomic regions. It uses the  sim4  and  sim4cc  algorithms  to  determine  the
       alignments, but incorporates a fast sequence indexing and retrieval mechanism, implemented
       in the sister package leaff(1), to speedily process large volumes of sequences.

       While sim4db produces alignments in the same way as sim4  or  sim4cc,  it  has  additional
       features to make it more amenable for use with whole-genome annotation pipelines. A script
       file can be used to group pairings between cDNAs and their corresponding genomic  regions,
       to  be  aligned  as  one  run and using the same set of parameters. Sim4db also optionally
       reports more than one alignment for the same cDNA within a genomic region, as long as they
       meet  user-defined  criteria  such  as  minimum  length,  percentage  sequence identity or
       coverage. This feature is instrumental in finding all alignments of a gene family  at  one
       locus.  Lastly, the output is presented either as custom sim4db alignments or as GFF3 gene
       features.

OPTIONS

       Salient options:
              -cdna         use these cDNA sequences (multi-fasta file)
              -genomic      use these genomic sequences (multi-fasta file)
              -script       use this script file
              -pairwise     sequentially align pairs of sequences

                            If none of the '-script' and '-pairwise' options
                            is specified, sim4db performs all-against-all
                            alignments between pairs of cDNA and genomic sequences.

              -output       write output to this file
              -gff3         report output in GFF3 format
              -interspecies use sim4cc for inter-species alignments (default sim4)

       Filter options:
              -mincoverage  iteratively find all exon models with the specified
                            minimum PERCENT COVERAGE
              -minidentity  iteratively find all exon models with the specified
                            minimum PERCENT EXON IDENTITY
              -minlength    iteratively find all exon models with the specified
                            minimum ABSOLUTE COVERAGE (number of bp matched)
                            (default 0)
              -alwaysreport always report <number> exon models, even if they
                            are below the quality thresholds

                If no mincoverage or minidentity or minlength is given, only
                the best exon model is returned. This is the DEFAULT operation.

                You will probably want to specify ALL THREE of mincoverage,
                minidentity and minlength!  Don't assume the default values
                are what you want!

                You will DEFINITELY want to specify at least one of mincoverage,
                minidentity and minlength with alwaysreport!  If you don't,
                mincoverage will be set to 90 and minidentity to 95 -- to reduce
                the number of spurious matches when a good match is found.

       Auxiliary options:
              -nodeflines   don't include the defline in the sim4db output
              -alignments   print alignments

              -polytails    DON'T mask poly-A and poly-T tails
              -cut          trim marginal exons if A/T % > x (poly-AT tails)

              -noncanonical don't force canonical splice sites
              -splicemodel  use the following splice model: 0 - original sim4;
                            1 - GeneSplicer; 2 - Glimmer;  options 1 and 2 are
                            only available with '-interspecies'.
                            Default for sim4 is 0, and for sim4cc is 1.

              -forcestrand  Force the strand prediction to always be
                            one of 'forward' or 'reverse'

       Execution options:
              -threads      Use n threads.
              -touch        create this file when the program finishes execution

       Debugging options:
              -v            print status to stderr while running
              -V            print script lines (stderr) as they are being processed

       Developer options:
              -Z            set the spaced seed pattern
              -H            set the relink weight factor (H=1000 recommended for mRNAs)
              -K            set the first MSP threshold
              -C            set the second MSP threshold
              -Ma           set the limit of the number of MSPs allowed
              -Mp           same, as percentage of bases in cDNA
                            NOTE:  If used, both -Ma and -Mp must be specified!

SEE ALSO

       README.sim4db
       http://kmer.sourceforge.net/wiki/index.php/Getting_Started_with_Sim4db

       The sim4dbutils(1) package contains a range of utilities  to  work  with  sim4db-generated
       alignment   files,   of   particular  note  being  convertPolishes(1),  filterPolishes(1),
       mergePolishes(1), and sortPolishes(1)

                                           January 2016                                 SIM4DB(1)