Ubuntu Manpage: obigrep - description of obigrep

Provided by: obitools_1.2.12+dfsg-2_amd64

NAME

       obigrep - description of obigrep

       The obigrep command is in some way analog to the standard Unix grep command.  It selects a
       subset of sequence records from a sequence file.

       A sequence record is a complex object composed of  an  identifier,  a  set  of  attributes
       (key=value), a definition, and the sequence itself.

       Instead  of  working  text  line by text line as the standard Unix tool, selection is done
       sequence record by sequence record.  A large set of options allows refining  selection  on
       any of the sequence record elements.

       Moreover  obigrep allows specifying simultaneously several conditions (that take the value
       TRUE or FALSE) and only  the  sequence  records  that  fulfill  all  the  conditions  (all
       conditions are TRUE) are selected.

SEQUENCE RECORD SELECTION OPTIONS

       -s <REGULAR_PATTERN>, --sequence=<REGULAR_PATTERN>
                 Regular expression pattern to be tested against the sequence itself. The pattern
                 is case insensitive.

              Examples:

                     > obigrep -s 'GAATTC' seq1.fasta > seq2.fasta

                 Selects only the sequence records that contain an EcoRI restriction site.

                     > obigrep -s 'A{10,}' seq1.fasta > seq2.fasta

                 Selects only the sequence records that contain a stretch of at least 10 A.

                     > obigrep -s '^[ACGT]+$' seq1.fasta > seq2.fasta

                 Selects only the sequence records that do not contain ambiguous nucleotides.

       -D <REGULAR_PATTERN>, --definition=<REGULAR_PATTERN>
                 Regular expression pattern to be tested against the definition of  the  sequence
                 record. The pattern is case sensitive.

              Example:

                     > obigrep -D '[Cc]hloroplast' seq1.fasta > seq2.fasta

                 Selects  only  the  sequence  records  whose  definition contains chloroplast or
                 Chloroplast.

       -I <REGULAR_PATTERN>, --identifier=<REGULAR_PATTERN>
                 Regular expression pattern to be tested against the identifier of  the  sequence
                 record. The pattern is case sensitive.

              Example:

                     > obigrep -I '^GH' seq1.fasta > seq2.fasta

                 Selects only the sequence records whose identifier begins with GH.

       --id-list=<FILENAME>
                 <FILENAME>  points  to  a  text  file  containing  the  list  of sequence record
                 identifiers to be selected.  The file format consists in a single identifier per
                 line.

              Example:

                     > obigrep --id-list=my_id_list.txt seq1.fasta > seq2.fasta

                 Selects   only   the  sequence  records  whose  identifier  is  present  in  the
                 my_id_list.txt file.

       -a <KEY>:<REGULAR_PATTERN>,

       --attribute=<KEY>:<REGULAR_PATTERN>
                 Regular expression pattern  matched  against  the  attributes  of  the  sequence
                 record.  the  value  of this attribute is of the form : key:regular_pattern. The
                 pattern is case sensitive. Several -a options can be used on  the  same  command
                 line  and  in  this  last  case,  the  selected  sequence records will match all
                 constraints.

              Example:

                     > obigrep -a 'family_name:Asteraceae' seq1.fasta > seq2.fasta

                 Selects the sequence records containing an attribute whose  key  is  family_name
                 and value is Asteraceae.

       -A <ATTRIBUTE_NAME>, --has-attribute=<KEY>
                 Selects sequence records having an attribute whose key = <KEY>.

              Example:

                     > obigrep -A taxid seq1.fasta > seq2.fasta

                 Selects only the sequence records having a taxid attribute defined.

       -p <PYTHON_EXPRESSION>, --predicat=<PYTHON_EXPRESSION>
                 Python  boolean  expression  to  be  evaluated  for  each  sequence  record. The
                 attribute keys defined for each sequence record can be used in the expression as
                 variable  names.   An  extra  variable  named  ‘sequence’ refers to the sequence
                 record itself.  Several -p options can be used on the same command line  and  in
                 this last case, the selected sequence records will match all constraints.

              Example:

                     >  obigrep -p '(forward_error<2) and (reverse_error<2)' \
                        seq1.fasta > seq2.fasta

                 Selects   only  the  sequence  records  whose  forward_error  and  reverse_error
                 attributes have a value smaller than two.

       -L <##>, --lmax=<##>
                 Keeps sequence records whose sequence length is equal or shorter than lmax.

              Example:

                     > obigrep -L 100 seq1.fasta > seq2.fasta

                 Selects only the sequence records that have a sequence length equal  or  shorter
                 than 100bp.

       -l <##>, --lmin=<##>
                 Selects sequence records whose sequence length is equal or longer than lmin.

              Examples:

                     > obigrep -l 100 seq1.fasta > seq2.fasta

                 Selects  only  the  sequence records that have a sequence length equal or longer
                 than 100bp.

       -v, --inverse-match
                 Inverts the sequence record selection.

              Examples:

                     > obigrep -v -l 100 seq1.fasta > seq2.fasta

                 Selects only the sequence records that  have  a  sequence  length  shorter  than
                 100bp.

TAXONOMY RELATED OPTIONS

       -d <FILENAME>, --database=<FILENAME>
              ecoPCR taxonomy Database name

       -t <FILENAME>, --taxonomy-dump=<FILENAME>
              NCBI Taxonomy dump repository name

       --require-rank=<RANK_NAME>
              select sequence with taxid tag containing a parent of rank <RANK_NAME>

       -r <TAXID>, --required=<TAXID>
              required taxid

       -i <TAXID>, --ignore=<TAXID>
              ignored taxid

OPTIONS TO SPECIFY INPUT FORMAT

   Restrict the analysis to a sub-part of the input file
       --skip <N>
              The  N  first  sequence records of the file are discarded from the analysis and not
              reported to the output file

       --only <N>
              Only the N next sequence records of the file are analyzed. The following  sequences
              in the file are neither analyzed, neither reported to the output file.  This option
              can be used conjointly with the –skip option.

   Sequence annotated format
       --genbank
              Input file is in genbank format.

       --embl Input file is in embl format.

   fasta related format
       --fasta
              Input file is in fasta format (including OBITools fasta extensions).

   fastq related format
       --sanger
              Input  file  is  in  Sanger  fastq  format  (standard  fastq  used  by  HiSeq/MiSeq
              sequencers).

       --solexa
              Input file is in fastq format produced by Solexa (Ga IIx) sequencers.

   ecoPCR related format
       --ecopcr
              Input file is in ecoPCR format.

       --ecopcrdb
              Input is an ecoPCR database.

   Specifying the sequence type
       --nuc  Input file contains nucleic sequences.

       --prot Input file contains protein sequences.

OPTIONS TO SPECIFY OUTPUT FORMAT

   Standard output format
       --fasta-output
              Output sequences in OBITools fasta format

       --fastq-output
              Output sequences in Sanger fastq format

   Generating an ecoPCR database
       --ecopcrdb-output=<PREFIX_FILENAME>
              Creates an ecoPCR database from sequence records results

   Miscellaneous option
       --uppercase
              Print sequences in upper case (default is lower case)

COMMON OPTIONS

       -h, --help
              Shows this help message and exits.

       --DEBUG
              Sets logging in debug mode.

AUTHOR

       The OBITools Development Team - LECA

COPYRIGHT

       2019 - 2015, OBITool Development Team

 1.02 12                                   Jan 28, 2019                                OBIGREP(1)