Ubuntu Manpage: obiclean - description of obiclean

Provided by: obitools_1.2.12+dfsg-2_amd64

NAME

       obiclean - description of obiclean

       obiclean  is  a  command  that  classifies  sequence  records  either as head, internal or
       singleton.

       For that purpose, two pieces of information are used:

              · sequence record counts

              · sequence similarities

       S1 a sequence record is considered as a variant of S2 another sequence record if and  only
       if:

              · count  of  S1 divided by count of S2 is lesser than the ratio R.  R default value
                is set to 1, and can be adjusted between 0 and 1 with the -r option.

              · both sequences are related to one another (they can align with some  differences,
                the maximum number of differences can be specified by the -d option).

       Considering S a sequence record, the following properties hold for S tagged as:

              ·

                head:

                       · there  exists  at  least  one  sequence  record in the dataset that is a
                         variant of S

                       · there exists no sequence record in the dataset such that S is a  variant
                         of this sequence record

              ·

                internal:

                       · there  exists at least one sequence record in the dataset such that S is
                         a variant of this sequence record

              ·

                singleton:

                       · there exists no sequence record in the dataset that is a variant of S

                       · there exists no sequence record in the dataset such that S is a  variant
                         of this sequence record

       By  default, tagging is done once for the whole dataset, but it can also be done sample by
       sample by specifying the -s option. In such a case, the  counts  are  extracted  from  the
       sample information.

       Finally,  each  sequence  record is annotated with three new attributes head, internal and
       singleton. The attribute values are the numbers of samples in which  the  sequence  record
       has been classified in this manner.

OBICLEAN SPECIFIC OPTIONS

       -d <INTEGER>, --distance=<INTEGER>
              Maximum numbers of differences between two variant sequences (default: 1).

       -s <KEY>, --sample=<KEY>
              Attribute containing sample descriptions.

       -r <FLOAT>, --ratio=<FLOAT>
              Threshold  ratio  between  counts (rare/abundant counts) of two sequence records so
              that the less abundant one is a variant of the more abundant (default: 1, i.e.  all
              less abundant sequences are variants).

       -C, --cluster
              Switch  obiclean  into  its clustering mode. This adds information to each sequence
              about the true.

       -H, --head
              Select only sequences with the head status in a least one sample.

       -g, --graph
              Creates a file containing the set of DAG used by the obiclean clustering algorithm.
              The graph file follows the dot format

OPTIONS TO SPECIFY INPUT FORMAT

   Restrict the analysis to a sub-part of the input file
       --skip <N>
              The  N  first  sequence records of the file are discarded from the analysis and not
              reported to the output file

       --only <N>
              Only the N next sequence records of the file are analyzed. The following  sequences
              in the file are neither analyzed, neither reported to the output file.  This option
              can be used conjointly with the –skip option.

   Sequence annotated format
       --genbank
              Input file is in genbank format.

       --embl Input file is in embl format.

   fasta related format
       --fasta
              Input file is in fasta format (including OBITools fasta extensions).

   fastq related format
       --sanger
              Input  file  is  in  Sanger  fastq  format  (standard  fastq  used  by  HiSeq/MiSeq
              sequencers).

       --solexa
              Input file is in fastq format produced by Solexa (Ga IIx) sequencers.

   ecoPCR related format
       --ecopcr
              Input file is in ecoPCR format.

       --ecopcrdb
              Input is an ecoPCR database.

   Specifying the sequence type
       --nuc  Input file contains nucleic sequences.

       --prot Input file contains protein sequences.

OPTIONS TO SPECIFY OUTPUT FORMAT

   Standard output format
       --fasta-output
              Output sequences in OBITools fasta format

       --fastq-output
              Output sequences in Sanger fastq format

   Generating an ecoPCR database
       --ecopcrdb-output=<PREFIX_FILENAME>
              Creates an ecoPCR database from sequence records results

   Miscellaneous option
       --uppercase
              Print sequences in upper case (default is lower case)

COMMON OPTIONS

       -h, --help
              Shows this help message and exits.

       --DEBUG
              Sets logging in debug mode.

OBICLEAN USED SEQUENCE ATTRIBUTES

            · count

OBICLEAN ADDED SEQUENCE ATTRIBUTES

            · obiclean_cluster

            · obiclean_count

            · obiclean_head

            · obiclean_headcount

            · obiclean_internalcount

            · obiclean_samplecount

            · obiclean_singletoncount

            · obiclean_status

AUTHOR

       The OBITools Development Team - LECA

COPYRIGHT

       2019 - 2015, OBITool Development Team

 1.02 12                                   Jan 28, 2019                               OBICLEAN(1)