Ubuntu Manpage: obiannotate - description of obiannotate

Provided by: obitools_1.2.12+dfsg-2_amd64

NAME

       obiannotate - description of obiannotate

       obiannotate  is  the  command  that allows adding/modifying/removing annotation attributes
       attached to sequence records.

       Once such attributes are added, they can be  used  by  the  other  OBITools  commands  for
       filtering purposes or for statistics computing.

       Example 1:

              > obiannotate -S short:'len(sequence)<100' seq1.fasta > seq2.fasta

          The  above  command  adds an attribute named short which has a boolean value indicating
          whether the sequence length is less than 100bp.

       Example 2:

              > obiannotate --seq-rank seq1.fasta | \
                obiannotate -C --set-identifier '"'FungA'_%05d" % seq_rank' \
                > seq2.fasta

          The above command adds a new attribute whose value is the sequence record entry  number
          in  the file. Then it clears all the sequence record attributes and sets the identifier
          to a string beginning with FungA_ followed by a suffix with  5  digits  containing  the
          sequence entry number.

       Example 3:

              > obiannotate -d my_ecopcr_database \
                --with-taxon-at-rank=genus seq1.fasta > seq2.fasta

          The above command adds taxonomic information at the genus rank to the sequence records.

       Example 4:

              > obiannotate -S 'new_seq:str(sequence).replace("a","t")' \
                seq1.fasta | obiannotate --set-sequence new_seq > seq2.fasta

          The  overall  aim  of  the  above  command  is  to  edit the sequence object itself, by
          replacing all nucleotides a by nucleotides t. First, a new attribute named  new_seq  is
          created, which contains the modified sequence, and then the former sequence is replaced
          by the modified one.

SEQUENCE RECORD EDITING OPTIONS

       --seq-rank
              Adds a new attribute named seq_rank to the sequence  record  indicating  its  entry
              number in the sequence file.

       -R <OLD_NAME>:<NEW_NAME>, --rename-tag=<OLD_NAME>:<NEW_NAME>
              Changes attribute name <OLD_NAME> to <NEW_NAME>. When attribute named <OLD_NAME> is
              missing, the sequence record is skipped and the next one is examined.

       --delete-tag=<KEY>
              Deletes attribute  named  <ATTRIBUTE_NAME>.When  this  attribute  is  missing,  the
              sequence record is skipped and the next one is examined.

       -S <KEY>:<PYTHON_EXPRESSION>, --set-tag=<KEY>:<PYTHON_EXPRESSION>
              Creates  a  new  attribute  named  with  a  key  <KEY>  and  a  value computed from
              <PYTHON_EXPRESSION>.

       --tag-list=<FILENAME>
              <FILENAME> points to a file containing attribute names and  values  to  modify  for
              specified sequence records.

       --set-identifier=<PYTHON_EXPRESSION>
              Sets sequence record identifier with a value computed from <PYTHON_EXPRESSION>.

       --run=<PYTHON_EXPRESSION>
              Runs a python expression on each selected sequence.

       --set-sequence=<PYTHON_EXPRESSION>
              Changes the sequence itself with a value computed from <PYTHON_EXPRESSION>.

       -T, --set-definition=<PYTHON_EXPRESSION>
              Sets sequence definition with a value computed from <PYTHON_EXPRESSION>.

       -O, --only-valid-python
              Allows only valid python expressions.

       -C, --clear
              Clears all attributes associated to the sequence records.

       -k <KEY>, --keep=<KEY>
              Keeps only attribute with key <KEY>. Several -k options can be combined.

       --length
              Adds attribute with seq_length as a key and sequence length as a value.

       --with-taxon-at-rank=<RANK_NAME>
              Adds taxonomic annotation at taxonomic rank <RANK_NAME>.

       -m <MCLFILE>, --mcl=<MCLFILE>
              Creates  a  new  attribute containing the number of the cluster the sequence record
              was assigned to, as indicated in file <MCLFILE>.

       --uniq-id
              Forces sequence record ids to be unique.

SEQUENCE RECORD SELECTION OPTIONS

       -s <REGULAR_PATTERN>, --sequence=<REGULAR_PATTERN>
                 Regular expression pattern to be tested against the sequence itself. The pattern
                 is case insensitive.

              Examples:

                     > obigrep -s 'GAATTC' seq1.fasta > seq2.fasta

                 Selects only the sequence records that contain an EcoRI restriction site.

                     > obigrep -s 'A{10,}' seq1.fasta > seq2.fasta

                 Selects only the sequence records that contain a stretch of at least 10 A.

                     > obigrep -s '^[ACGT]+$' seq1.fasta > seq2.fasta

                 Selects only the sequence records that do not contain ambiguous nucleotides.

       -D <REGULAR_PATTERN>, --definition=<REGULAR_PATTERN>
                 Regular  expression  pattern to be tested against the definition of the sequence
                 record. The pattern is case sensitive.

              Example:

                     > obigrep -D '[Cc]hloroplast' seq1.fasta > seq2.fasta

                 Selects only the sequence  records  whose  definition  contains  chloroplast  or
                 Chloroplast.

       -I <REGULAR_PATTERN>, --identifier=<REGULAR_PATTERN>
                 Regular  expression  pattern to be tested against the identifier of the sequence
                 record. The pattern is case sensitive.

              Example:

                     > obigrep -I '^GH' seq1.fasta > seq2.fasta

                 Selects only the sequence records whose identifier begins with GH.

       --id-list=<FILENAME>
                 <FILENAME> points to  a  text  file  containing  the  list  of  sequence  record
                 identifiers to be selected.  The file format consists in a single identifier per
                 line.

              Example:

                     > obigrep --id-list=my_id_list.txt seq1.fasta > seq2.fasta

                 Selects  only  the  sequence  records  whose  identifier  is  present   in   the
                 my_id_list.txt file.

       -a <KEY>:<REGULAR_PATTERN>,

       --attribute=<KEY>:<REGULAR_PATTERN>
                 Regular  expression  pattern  matched  against  the  attributes  of the sequence
                 record. the value of this attribute is of the form  :  key:regular_pattern.  The
                 pattern  is  case  sensitive. Several -a options can be used on the same command
                 line and in this last  case,  the  selected  sequence  records  will  match  all
                 constraints.

              Example:

                     > obigrep -a 'family_name:Asteraceae' seq1.fasta > seq2.fasta

                 Selects  the  sequence  records containing an attribute whose key is family_name
                 and value is Asteraceae.

       -A <ATTRIBUTE_NAME>, --has-attribute=<KEY>
                 Selects sequence records having an attribute whose key = <KEY>.

              Example:

                     > obigrep -A taxid seq1.fasta > seq2.fasta

                 Selects only the sequence records having a taxid attribute defined.

       -p <PYTHON_EXPRESSION>, --predicat=<PYTHON_EXPRESSION>
                 Python boolean  expression  to  be  evaluated  for  each  sequence  record.  The
                 attribute keys defined for each sequence record can be used in the expression as
                 variable names.  An extra variable  named  ‘sequence’  refers  to  the  sequence
                 record  itself.   Several -p options can be used on the same command line and in
                 this last case, the selected sequence records will match all constraints.

              Example:

                     >  obigrep -p '(forward_error<2) and (reverse_error<2)' \
                        seq1.fasta > seq2.fasta

                 Selects  only  the  sequence  records  whose  forward_error  and   reverse_error
                 attributes have a value smaller than two.

       -L <##>, --lmax=<##>
                 Keeps sequence records whose sequence length is equal or shorter than lmax.

              Example:

                     > obigrep -L 100 seq1.fasta > seq2.fasta

                 Selects  only  the sequence records that have a sequence length equal or shorter
                 than 100bp.

       -l <##>, --lmin=<##>
                 Selects sequence records whose sequence length is equal or longer than lmin.

              Examples:

                     > obigrep -l 100 seq1.fasta > seq2.fasta

                 Selects only the sequence records that have a sequence length  equal  or  longer
                 than 100bp.

       -v, --inverse-match
                 Inverts the sequence record selection.

              Examples:

                     > obigrep -v -l 100 seq1.fasta > seq2.fasta

                 Selects  only  the  sequence  records  that  have a sequence length shorter than
                 100bp.

TAXONOMY RELATED OPTIONS

       -d <FILENAME>, --database=<FILENAME>
              ecoPCR taxonomy Database name

       -t <FILENAME>, --taxonomy-dump=<FILENAME>
              NCBI Taxonomy dump repository name

       --require-rank=<RANK_NAME>
              select sequence with taxid tag containing a parent of rank <RANK_NAME>

       -r <TAXID>, --required=<TAXID>
              required taxid

       -i <TAXID>, --ignore=<TAXID>
              ignored taxid

OPTIONS TO SPECIFY INPUT FORMAT

   Restrict the analysis to a sub-part of the input file
       --skip <N>
              The N first sequence records of the file are discarded from the  analysis  and  not
              reported to the output file

       --only <N>
              Only  the N next sequence records of the file are analyzed. The following sequences
              in the file are neither analyzed, neither reported to the output file.  This option
              can be used conjointly with the –skip option.

   Sequence annotated format
       --genbank
              Input file is in genbank format.

       --embl Input file is in embl format.

   fasta related format
       --fasta
              Input file is in fasta format (including OBITools fasta extensions).

   fastq related format
       --sanger
              Input  file  is  in  Sanger  fastq  format  (standard  fastq  used  by  HiSeq/MiSeq
              sequencers).

       --solexa
              Input file is in fastq format produced by Solexa (Ga IIx) sequencers.

   ecoPCR related format
       --ecopcr
              Input file is in ecoPCR format.

       --ecopcrdb
              Input is an ecoPCR database.

   Specifying the sequence type
       --nuc  Input file contains nucleic sequences.

       --prot Input file contains protein sequences.

OPTIONS TO SPECIFY OUTPUT FORMAT

   Standard output format
       --fasta-output
              Output sequences in OBITools fasta format

       --fastq-output
              Output sequences in Sanger fastq format

   Generating an ecoPCR database
       --ecopcrdb-output=<PREFIX_FILENAME>
              Creates an ecoPCR database from sequence records results

   Miscellaneous option
       --uppercase
              Print sequences in upper case (default is lower case)

COMMON OPTIONS

       -h, --help
              Shows this help message and exits.

       --DEBUG
              Sets logging in debug mode.

OBIANNOTATE ADDED SEQUENCE ATTRIBUTES

            · seq_length

            · seq_rank

            · cluster

            · scientific_name

            · taxid

            · rank

            · family

            · family_name

            · genus

            · genus_name

            · order

            · order_name

            · species

            · species_name

AUTHOR

       The OBITools Development Team - LECA

COPYRIGHT

       2019 - 2015, OBITool Development Team

 1.02 12                                   Jan 28, 2019                            OBIANNOTATE(1)