Ubuntu Manpage: ecotag - description of ecotag

Provided by: obitools_1.2.12+dfsg-2_amd64

NAME

       ecotag - description of ecotag

       ecotag  is  the  tool  that assigns sequences to a taxon based on sequence similarity. The
       program first searches the reference database for  the  reference  sequence(s)  (hereafter
       referred  to  as  ‘primary reference sequence(s)’) showing the highest similarity with the
       query sequence. Then it looks for all other reference sequences (hereafter referred to  as
       ‘secondary  reference  sequences’) whose similarity with the primary reference sequence(s)
       is equal or higher than the  similarity  between  the  primary  reference  and  the  query
       sequences.  Finally,  it  assigns the query sequence to the most recent common ancestor of
       the primary and secondary reference sequences.

       As input, ecotag requires the sequences to be assigned,  a  reference  database  in  fasta
       format, where each sequence is associated with a taxon identified by a unique taxid, and a
       taxonomy database where taxonomic information is stored for each taxid.
          Example:

                 > ecotag -d embl_r113  -R ReferenceDB.fasta \
                   --sort=count -m 0.95 -r seq.fasta > seq_tag.fasta

              The above command specifies that each sequence stored in seq.fasta is  compared  to
              those  in the reference database called ReferenceDB.fasta for taxonomic assignment.
              In the output file seq_tag.fasta, the sequences are sorted from highest  to  lowest
              counts.  When there is no reference sequence with a similarity equal or higher than
              0.95 for a given sequence, no taxonomic information is provided for  this  sequence
              in seq_tag.fasta.

ECOTAG SPECIFIC OPTIONS

-R <FILENAME>, --ref-database=<FILENAME>
<FILENAME> is the fasta file containing the reference sequences

-m FLOAT, --minimum-identity=FLOAT
When the best match with the reference database present an identity level below
FLOAT, the taxonomic assignment for the sequence record is not computed. The
sequence record is nevertheless included in the output file. FLOAT is included in a
[0,1] interval.

--minimum-circle=FLOAT
minimum identity considered for the assignment circle. FLOAT is included in a
[0,1] interval.

-x RANK, --explain=RANK

-u, --uniq
When this option is specified, the program first dereplicates the sequence records
to work on unique sequences only. This option greatly improves the program’s speed,
especially for highly redundant datasets.

--sort=<KEY>
The output is sorted based on the values of the relevant attribute.

-r, --reverse
The output is sorted in reverse order (should be used with the –sort option).
(Works even if the –sort option is not set, but could not find on what the output
is sorted).

-E FLOAT, --errors=FLOAT
FLOAT is the fraction of reference sequences that will be ignored when looking for
the lowest common ancestor. This option is useful when a non-negligible proportion
of reference sequences is expected to be assigned to the wrong taxon, for example
because of taxonomic misidentification. FLOAT is included in a [0,1] interval.

-M INTEGER, --min-matches=FLOAT
Define the minimum congruent assignation. If this minimum is reached and the -E
option is activated, the lowest common ancestor algorithm tolarated that some
sequences do not provide the same taxonomic annotation (see the -E option).

--cache-size=INTEGER
A cache for computed similarities is maintained by ecotag. the default size for
this cache is 1,000,000 of scores. This option allows to change the cache size.

TAXONOMY RELATED OPTIONS

       -d <FILENAME>, --database=<FILENAME>
              ecoPCR taxonomy Database name

       -t <FILENAME>, --taxonomy-dump=<FILENAME>
              NCBI Taxonomy dump repository name

OPTIONS TO SPECIFY INPUT FORMAT

   Restrict the analysis to a sub-part of the input file
       --skip <N>
              The  N  first  sequence records of the file are discarded from the analysis and not
              reported to the output file

       --only <N>
              Only the N next sequence records of the file are analyzed. The following  sequences
              in the file are neither analyzed, neither reported to the output file.  This option
              can be used conjointly with the –skip option.

   Sequence annotated format
       --genbank
              Input file is in genbank format.

       --embl Input file is in embl format.

   fasta related format
       --fasta
              Input file is in fasta format (including OBITools fasta extensions).

   fastq related format
       --sanger
              Input  file  is  in  Sanger  fastq  format  (standard  fastq  used  by  HiSeq/MiSeq
              sequencers).

       --solexa
              Input file is in fastq format produced by Solexa (Ga IIx) sequencers.

   ecoPCR related format
       --ecopcr
              Input file is in ecoPCR format.

       --ecopcrdb
              Input is an ecoPCR database.

   Specifying the sequence type
       --nuc  Input file contains nucleic sequences.

       --prot Input file contains protein sequences.

OPTIONS TO SPECIFY OUTPUT FORMAT

   Standard output format
       --fasta-output
              Output sequences in OBITools fasta format

       --fastq-output
              Output sequences in Sanger fastq format

   Generating an ecoPCR database
       --ecopcrdb-output=<PREFIX_FILENAME>
              Creates an ecoPCR database from sequence records results

   Miscellaneous option
       --uppercase
              Print sequences in upper case (default is lower case)

COMMON OPTIONS

       -h, --help
              Shows this help message and exits.

       --DEBUG
              Sets logging in debug mode.

ECOTAG ADDED SEQUENCE ATTRIBUTES

            · best_identity

            · best_match

            · family

            · family_name

            · genus

            · genus_name

            · id_status

            · order

            · order_name

            · rank

            · scientific_name

            · species

            · species_list

            · species_name

            · taxid

AUTHOR

       The OBITools Development Team - LECA

COPYRIGHT

       2019 - 2015, OBITool Development Team

 1.02 12                                   Jan 28, 2019                                 ECOTAG(1)