Provided by: libbio-eutilities-perl_1.75-3_all bug

NAME

       bp_genbank_ref_extractor - Retrieves all related sequences for a list of searches on Entrez gene

VERSION

       version 1.75

SYNOPSIS

       bp_genbank_ref_extractor [options] [Entrez Gene Queries]

DESCRIPTION

       This script searches on Entrez Gene database and retrieves not only the gene sequence but also the
       related transcript and protein sequences.

       The gene UIDs of multiple searches are collected before attempting to retrieve them so each gene will
       only be analyzed once even if appearing as result on more than one search.

       Note that by default no sequences are saved (see options and examples).

OPTIONS

       Several options can be used to fine tune the script behaviour. It is possible to obtain extra base pairs
       upstream and downstream of the gene, control the naming of files and genome assembly to use.

       See the section bugs for problems when using default values of options.

       --assembly
           When  retrieving  the sequence, a specific assemly can be defined. The value expected is a regex that
           will be case-insensitive. If it matches more than one assembly, it  will  use  the  first  match.  It
           defauls to "(primary|reference) assembly".

       --debug
           If  set,  even  more  output  will  be  printed  that may help on debugging. Unlike the messages from
           --verbose and --very-verbose, these will not appear on the log file unless this option  is  selected.
           This option also sets --very-verbose.

       --downstream, --down
           Specifies  the  number  of  extra base pairs to be retrieved downstream of the gene.  This extra base
           pairs will only affect the gene sequence, not the transcript or proteins.

       --email
           A valid email used to connect to the NCBI servers. This may be used by NCBI to contact users in  case
           of problems and before blocking access in case of heavy usage.

       --format
           Specifies the format that the sequences will be saved. Defaults to genbank format.  Valid formats are
           'genbank' or 'fasta'.

       --genes
           Specifies  the  name  for gene file. By default, they are not saved. If no value is given defaults to
           its UID. Possible values are 'uid', 'name', 'symbol' (the official symbol or nomenclature).

       --help
           Display the documentation (this text).

       --limit
           When making a query, limit the result to these first specific results. This is to prevent the use  of
           specially  unspecific  queries  and  a warning will be given if a query returns more results than the
           limit. The default value is 200. Note that this limit is for each search.

       --non-coding, --nonon-coding
           Some protein coding genes have transcripts that are non-coding. By default, these sequences are saved
           as well. --nonon-coding can be used to ignore those transcripts.

       --proteins
           Specifies the name for proteins file. By default, they are not saved. If no value is  given  defaults
           to  its accession. Possible values are 'accession', 'description', 'gene' (the corresponding gene ID)
           and 'transcript' (the corresponding transcript accesion).

           Note that if not using 'accession' is possible for files to be overwritten. It is  possible  for  the
           same gene to encode more than one protein or different proteins to have the same description.

       --pseudo, --nopseudo
           By default, sequences of pseudo genes will be saved. --nopseudo can be used to ignore those genes.

       --save
           Specifies the path for the directory where the sequence and log files will be saved. If the directory
           does  not  exist  it will be created altough the path to it must exist. Files on the directory may be
           rewritten if necessary. If  unspecified,  a  directory  named  extracted  sequences  on  the  current
           directory will be used.

       --save-data
           This  options  saves  the  data  (gene  UIDs,  description, product accessions, etc) to a file. As an
           optional value, the file format can be specified. Defaults to CSV.

           Currently only CSV is supported.

           Saving the data structure as a CSV file, requires the installation of the Text::CSV module.

       --transcripts, --mrna
           Specifies the name for transcripts file. By default, they  are  not  saved.  If  no  value  is  given
           defaults  to its accession. Possible values are 'accession', 'description', 'gene' (the corresponding
           gene ID) and 'protein' (the protein the transcript encodes).

           Note that if not using 'accession' is possible for files to be overwritten. It is  possible  for  the
           same  gene  to  have  more than one transcript or different transcripts to have the same description.
           Also, non-coding transcripts will create problems if using 'protein'.

       --upstream, --up
           Specifies the number of extra base pairs to be extracted upstream of the gene.  This extra base pairs
           will only affect the gene sequence, not the transcript or proteins.

       --verbose, --v
           If set, program becomes verbose. For an extremely verbose program, use --very-verbose instead.

       --very-verbose, --vv
           If set, program becomes extremely verbose. Setting this option, automatically sets --verbose as well.
           For help in debugging, consider using --debug

EXAMPLES

       "bp_genbank_ref_extractor --transcripts=accession '"homo sapiens"[organism] AND H2B'"
           Search Entrez gene with the query '"homo sapiens"[organism] AND  H2B',  and  save  their  transcripts
           sequences. Note that default value of --limit may only extract some of the hits.

       "bp_genbank_ref_extractor --transcripts=accession --proteins=accession --format=fasta '"homo
       sapiens"[organism] AND H2B' '"homo sapiens"[organism] AND MCPH1'"
           Same as first example but also searches for '"homo sapiens"[organism] AND MCPH1', proteins sequences,
           and saves them in the fasta format.

       "bp_genbank_ref_extractor --genes --up=100 --down=500 '"homo sapiens"[organism] AND H2B'"
           Same  search  as  first  example  but  saves  the  genomic sequences instead including 100 and 500 bp
           upstream and downstream.

       "bp_genbank_ref_extractor --genes --asembly='Alternate HuRef' '"homo sapiens"[organism] AND H2B'"
           Same search as first example but saves genomic sequences and from the Alternate HuRef genome assembly
           instead.

       "bp_genbank_ref_extractor --save-data=CSV '"homo sapiens"[organism] AND H2B'"
           Same search as first example but does not save any sequence but saves all the results in a CSV file.

       "bp_genbank_ref_extractor --save='search results' --genes=name --upstream=200 downstream=500 --nopseudo
       --nonnon-coding  --transcripts --proteins  --format=fasta --save-data=CSV '"homo sapiens"[organism] AND
       H2B' '"homo sapiens"[organism] AND MCPH1'"
           Searches on Entrez gene for both '"homo sapiens"[organism] AND H2B' and '"homo sapiens"[organism] AND
           MCPH1' and saves the gene sequences  of  all  hits  (not  passing  the  default  limit  and  ignoring
           pseudogenes)  plus  200 and 500bp upstream and downstream of them. It will also save the sequences of
           all transcripts and proteins of each gene (but ignoring non-coding transcripts).  It  will  save  the
           sequences  in  the  fasta  format, inside a directory "search results", and save the results in a CSV
           file

KNOWN BUGS

       •   When supplying options, it's possible to not supply a value and use their default. However, when  the
           expected  value  is a string, the next argument may be confused as value for the option. For example,
           when using the following command:

           "bp_genbank_ref_extractor --transcripts 'H2A AND homo sapiens'"

           we mean to search for 'H2A AND homo sapiens' saving only the transcripts and  using  the  default  as
           base  for  the  filename. However, the search terms will be interpreted as the base for the filenames
           (but since it's not a valid identifier, it will return an error). To prevent  this,  you  can  either
           specify the values:

           "bp_genbank_ref_extractor --transcripts 'accession' 'H2A AND homo sapiens'"

           "bp_genbank_ref_extractor --transcripts='accession' 'H2A AND homo sapiens'"

           or  you  can use the double hash to stop processing options. Note that this should only be used after
           the last option. All arguments supplied after the double dash will be interpreted as search terms

           "bp_genbank_ref_extractor --transcripts -- 'H2A AND homo sapiens'"

NOTES ON USAGE

       •   Genes that are marked as 'live' and 'protein-coding' should have at least  one  transcript.  However,
           This  is  not  always true due to mistakes on annotation. Such cases will throw a warning. When faced
           with     this,     be     nice     and     write     to     the     entrez     RefSeq     maintainers
           <http://www.ncbi.nlm.nih.gov/RefSeq/update.cgi>.

       •   When  creating the directories to save the files, if the directory already exists it will be used and
           no error or warning will be issued unless --debug as been set. If a non-directory file already exists
           with that name bp_genbank_ref_extractor exits with an error.

       •   On the subject of verbosity, all messages are saved on  the  log  file.  The  options  --verbose  and
           --very-verbose  only  affect  their printing to standard output. Debug messages are different as they
           will only show up (and be logged) if requested with --debug.

       •   When saving a file, to avoid problems with limited  filesystems  such  as  NTFS  or  FAT,  only  some
           characters  are  allowed.  All other characters will be replaced by an underscore. Allowed characters
           are:

           a-z 0-9 - +  . , () {} []'bp_genbank_ref_extractor tries to use the same file extensions that bioperl would expect when  saving
           the file. If unable it will use the '.seq' extension.

FEEDBACK

   Mailing lists
       User  feedback is an integral part of the evolution of this and other Bioperl modules. Send your comments
       and suggestions preferably to the Bioperl mailing list.  Your participation is much appreciated.

         bioperl-l@bioperl.org                  - General discussion
         http://bioperl.org/wiki/Mailing_lists  - About the mailing lists

   Support
       Please direct usage questions or support issues to the mailing list: bioperl-l@bioperl.org

       rather than to the module maintainer directly. Many experienced and reponsive experts will be  able  look
       at the problem and quickly address it. Please include a thorough description of the problem with code and
       data examples if at all possible.

   Reporting bugs
       Report  bugs  to  the Bioperl bug tracking system to help us keep track of the bugs and their resolution.
       Bug reports can be submitted via the web:

         https://github.com/bioperl/%%7Bdist%7D

AUTHOR

       Carnë Draug <carandraug+dev@gmail.com>

COPYRIGHT

       This software is copyright (c) 2011-2015 by Carnë Draug.

       This software is available under the GNU General Public License, Version 3, June 2007.

perl v5.24.1                                       2017-04-07                       BP_GENBANK_REF_EXTRACTOR(1p)