Ubuntu Manpage: obiselect - description of obiselect

Provided by: obitools_1.2.12+dfsg-2_amd64

NAME

       obiselect - description of obiselect

       obiselect  command  allows to select a subset of sequences records from a sequence file by
       describing sequence record groups and defining how many and which  sequence  records  from
       each group must be retrieved.

       In  each  group as definied by a set of -c options, sequence records are ordered according
       to a score function. The N first sequences (N`is selected using the `-n option)  are  kept
       in the result subset of sequence records.

       By  default  the  score function is a random function and one sequence record is retrieved
       per group. This leads to select randomly one sequence per group.

OBISELECT SPECIFIC OPTIONS

       -c <KEY>, --category-attribute=<KEY>
                 Attribute used to categorize the sequence records. Several  -c  options  can  be
                 combined.

                 TIP:
                     The  <KEY>  can  be  simply  the key of an attribute, or a Python expression
                     similarly to the -p option of obigrep.

              Example:

                        > obiselect -c sample -c seq_length seq.fasta

                 This command select randomly one sequence record per sample and sequence  length
                 from the sequence records included in the seq.fasta file.  The selected sequence
                 records are printed on the screen.

       -n <INTEGER>, --number=<INTEGER>
                 Indicates how many sequence records per group have to be retrieved.  If the size
                 of the group is lesser than this NUMBER, the whole group is retrieved.

              Example:

                        > obiselect -n 2 -c sample -c seq_length seq.fasta

                 This  command  has  the  same  effect  than the previous example except that two
                 sequences are retrieved by class of sample/length.

       --merge=<KEY>
              Attribute to merge.

              Example:

                     > obiselect -c seq_length -n 2 -m sample seq1.fasta > seq2.fasta

                 This command keeps two sequences per sequence length, and records how many times
                 they were observed for each sample in the new attribute merged_sample.

       --merge-ids
              Adds  a  merged  attribute containing the list of sequence record ids merged within
              this group.

       -m, --min
              Sets the function used for scoring sequence records into a  group  to  the  minimum
              function.   The minimum function is applied to the values used to define categories
              (see option -c).  Sequences will be ordered according  to  the  distance  of  their
              values to the minimum value.

       -M, --max
              Sets  the  function  used  for scoring sequence records into a group to the maximum
              function.  The maximum function is applied to the values used to define  categories
              (see  option  -c).   Sequences  will  be ordered according to the distance of their
              values to the maximum value.

       -a, --mean
              Sets the function used for scoring sequence  records  into  a  group  to  the  mean
              function.   The  mean  function  is applied to the values used to define categories
              (see option -c).  Sequences will be ordered according  to  the  distance  of  their
              values to the mean value.

       --median
              Sets  the  function  used  for  scoring sequence records into a group to the median
              function.  The median function is applied to the values used to  define  categories
              (see  option  -c).   Sequences  will  be ordered according to the distance of their
              values to the median value.

       -f FUNCTION, --function=FUNCTION
              Sets the function used for scoring sequence records into a group to a  user  define
              function.  The user define function is declared using Python syntax. Attribute keys
              can be used as  variables.   An  extra  sequence  variable  representing  the  full
              sequence  record  is  available.  If  option  for  loading  a  taxonomy database is
              provided, a taxonomy variable is also available.  The  function  is  estimated  for
              each  sequence  record  and  the  minimum  value  of  this  function in each group.
              Sequences will be ordered in each group according to the distance of their function
              estimation to the minimum value of their group.

OPTIONS TO SPECIFY INPUT FORMAT

   Restrict the analysis to a sub-part of the input file
       --skip <N>
              The  N  first  sequence records of the file are discarded from the analysis and not
              reported to the output file

       --only <N>
              Only the N next sequence records of the file are analyzed. The following  sequences
              in the file are neither analyzed, neither reported to the output file.  This option
              can be used conjointly with the –skip option.

   Sequence annotated format
       --genbank
              Input file is in genbank format.

       --embl Input file is in embl format.

   fasta related format
       --fasta
              Input file is in fasta format (including OBITools fasta extensions).

   fastq related format
       --sanger
              Input  file  is  in  Sanger  fastq  format  (standard  fastq  used  by  HiSeq/MiSeq
              sequencers).

       --solexa
              Input file is in fastq format produced by Solexa (Ga IIx) sequencers.

   ecoPCR related format
       --ecopcr
              Input file is in ecoPCR format.

       --ecopcrdb
              Input is an ecoPCR database.

   Specifying the sequence type
       --nuc  Input file contains nucleic sequences.

       --prot Input file contains protein sequences.

TAXONOMY RELATED OPTIONS

       -d <FILENAME>, --database=<FILENAME>
              ecoPCR taxonomy Database name

       -t <FILENAME>, --taxonomy-dump=<FILENAME>
              NCBI Taxonomy dump repository name

OPTIONS TO SPECIFY OUTPUT FORMAT

   Standard output format
       --fasta-output
              Output sequences in OBITools fasta format

       --fastq-output
              Output sequences in Sanger fastq format

   Generating an ecoPCR database
       --ecopcrdb-output=<PREFIX_FILENAME>
              Creates an ecoPCR database from sequence records results

   Miscellaneous option
       --uppercase
              Print sequences in upper case (default is lower case)

COMMON OPTIONS

       -h, --help
              Shows this help message and exits.

       --DEBUG
              Sets logging in debug mode.

OBISELECT ADDED SEQUENCE ATTRIBUTES

          · class

          · distance

          · merged

          · class

          · merged_*

          · select

OBISELECT USED SEQUENCE ATTRIBUTE

          · taxid

AUTHOR

       The OBITools Development Team - LECA

COPYRIGHT

       2019 - 2015, OBITool Development Team

 1.02 12                                   Jan 28, 2019                              OBISELECT(1)