Ubuntu Manpage: obiuniq - description of obiuniq

Provided by: obitools_1.2.12+dfsg-2_amd64

NAME

       obiuniq - description of obiuniq

       The obiuniq command is in some way analog to the standard Unix uniq -c command.

       Instead  of  working  text  line by text line as the standard Unix tool, the processing is
       done on sequence records.

       A sequence record is a complex object composed of  an  identifier,  a  set  of  attributes
       (key=value), a definition, and the sequence itself.

       The  obiuniq  command  groups  together sequence records. Then, for each group, a sequence
       record is printed.

       A group is defined by the sequence and optionally by the values of  a  set  of  attributes
       specified with the -c option.

       As  the  identifier,  the set of attributes (key=value) and the definition of the sequence
       records that are grouped together may be different, two options (-m and -i) allow refining
       how these parts of the records are reported.

          · By  default, only attributes with identical values within a group of sequence records
            are kept.

          · A count attribute is set to the total number of sequence records for each group.

          · For each attribute specified by the -m option, a new attribute whose key is  prefixed
            by  merged_  is  created. These new attributes contain the number of times each value
            occurs within the group of sequence records.

OBIUNIQ AND TAXONOMIC INFORMATION

       When a taxonomy is loaded (-d or -t options), the merged_taxid attribute  is  created  and
       records  the number of times each taxid has been found in the group (it may be empty if no
       sequence  record  has  a  taxid  attribute  in  the  group).   In  addition,  a   set   of
       taxonomy-related  attributes  are  generated  for  each group having at least one sequence
       record with a taxid attribute. The taxid attribute of the sequence group  is  set  to  the
       last  common  ancestor  of  the taxids of the group. All other taxonomy-related attributes
       created  (species,   genus,   family,   species_name,   genus_name,   family_name,   rank,
       scientific_name) give information on the last common ancestor.

OBIUNIQ SPECIFIC OPTIONS

       -m <KEY>, --merge=<KEY>
              Attribute to merge.

              Example:

                     > obiuniq -m sample seq1.fasta > seq2.fasta

                 Dereplicates  sequences and keeps the value distribution of the sample attribute
                 in the new attribute merged_sample.

       -i, --merge-ids
              Adds a merged attribute containing the list of sequence record  ids  merged  within
              this group.

       -c <KEY>, --category-attribute=<KEY>
              Adds  one  attribute to the list of attributes used to define sequence groups (this
              option can be used several times).

              Example:

                     > obiuniq -c sample seq1.fasta > seq2.fasta

                 Dereplicates sequences within each sample.

       -p, --prefix
              Dereplication is done based on prefix matching:

                 1. The shortest sequence of each group is a prefix of any sequence of its group

                 2. The shortest sequence of  a  group  is  the  prefix  of  only  the  sequences
                    belonging to its group

TAXONOMY RELATED OPTIONS

       -d <FILENAME>, --database=<FILENAME>
              ecoPCR taxonomy Database name

       -t <FILENAME>, --taxonomy-dump=<FILENAME>
              NCBI Taxonomy dump repository name

OPTIONS TO SPECIFY INPUT FORMAT

   Restrict the analysis to a sub-part of the input file
       --skip <N>
              The  N  first  sequence records of the file are discarded from the analysis and not
              reported to the output file

       --only <N>
              Only the N next sequence records of the file are analyzed. The following  sequences
              in the file are neither analyzed, neither reported to the output file.  This option
              can be used conjointly with the –skip option.

   Sequence annotated format
       --genbank
              Input file is in genbank format.

       --embl Input file is in embl format.

   fasta related format
       --fasta
              Input file is in fasta format (including OBITools fasta extensions).

   fastq related format
       --sanger
              Input  file  is  in  Sanger  fastq  format  (standard  fastq  used  by  HiSeq/MiSeq
              sequencers).

       --solexa
              Input file is in fastq format produced by Solexa (Ga IIx) sequencers.

   ecoPCR related format
       --ecopcr
              Input file is in ecoPCR format.

       --ecopcrdb
              Input is an ecoPCR database.

   Specifying the sequence type
       --nuc  Input file contains nucleic sequences.

       --prot Input file contains protein sequences.

COMMON OPTIONS

       -h, --help
              Shows this help message and exits.

       --DEBUG
              Sets logging in debug mode.

OBIUNIQ ADDED SEQUENCE ATTRIBUTES

            · count

            · merged_*

            · merged

            · scientific_name

            · rank

            · family

            · family_name

            · genus

            · genus_name

            · order

            · order_name

            · species

            · species_name

OBIUNIQ USED SEQUENCE ATTRIBUTE

          · taxid

AUTHOR

       The OBITools Development Team - LECA

COPYRIGHT

       2019 - 2015, OBITool Development Team

 1.02 12                                   Jan 28, 2019                                OBIUNIQ(1)