Provided by: libboulder-perl_1.30-5.1_all bug

NAME

       Boulder::Genbank - Fetch Genbank data records as parsed Boulder Stones

SYNOPSIS

         use Boulder::Genbank

         # network access via Entrez
          $gb = Boulder::Genbank->newFh( qw(M57939 M28274 L36028) );

          while ($data = <$gb>) {
              print $data->Accession;

              @introns = $data->features->Intron;
              print "There are ",scalar(@introns)," introns.\n";
              $dna = $data->Sequence;
              print "The dna is ",length($dna)," bp long.\n";

              my @features = $data->features(-type=>[ qw(Exon Source Satellite) ],
                                             -pos=>[90,310] );
              foreach (@features) {
                 print $_->Type,"\n";
                 print $_->Position,"\n";
                 print $_->Gene,"\n";
             }
           }

         # another syntax
         $gb = new Boulder::Genbank(-accessor=>'Entrez',
                                    -fetch => [qw/M57939 M28274 L36028/]);

         # local access via Yank
         $gb = new Boulder::Genbank(-accessor=>'Yank',
                                    -fetch=>[qw/M57939 M28274 L36028/]);
         while (my $s = $gb->get) {
            # etc.
         }

         # parse a file of Genbank records
         $gb = new Boulder::Genbank(-accessor=>'File',
                                    -fetch => '/usr/local/db/gbpri3.seq');
         while (my $s = $gb->get) {
            # etc.
         }

         # parse flatfile records yourself
         open (GB,"/usr/local/db/gbpri3.seq");
         local $/ = "//\n";
         while (<GB>) {
            my $s = Boulder::Genbank->parse($_);
            # etc.
         }

DESCRIPTION

       Boulder::Genbank provides retrieval and parsing services for NCBI Genbank-format records.
       It returns Genbank entries in Stone format, allowing easy access to the various fields and
       values.  Boulder::Genbank is a descendent of Boulder::Stream, and provides a stream-like
       interface to a series of Stone objects.

       >> IMPORTANT NOTE <<

       As of January 2002, NCBI has changed their Batch Entrez interface.  I have modified
       Boulder::Genbank so as to use a "demo" interface, which fixes things, but this isn't
       guaranteed in the long run.

       I have written to NCBI, and they may fix this -- or they may not.

       >> IMPORTANT NOTE <<

       Access to Genbank is provided by three different accessors, which together give access to
       remote and local Genbank databases.  When you create a new Boulder::Genbank stream, you
       provide one of the three accessors, along with accessor-specific parameters that control
       what entries to fetch.  The three accessors are:

       Entrez
           This provides access to NetEntrez, accessing the most recent Genbank information
           directly from NCBI's Web site.  The parameters passed to this accessor are either a
           series of Genbank accession numbers, or an Entrez query (see
           http://www.ncbi.nlm.nih.gov/Entrez/linking.html).  If you provide a list of accession
           numbers, the stream will return a series of stones corresponding to the numbers.
           Otherwise, if you provided an Entrez query, the entries returned will be in the order
           returned by Entez.

       File
           This provides access to local Genbank entries by reading from a flat file (typically
           one of the .seq files downloadable from NCBI's Web site).  The stream will return a
           Stone corresponding to each of the entries in the file, starting from the top of the
           file and working downward.  The parameter in this case is the path to the local file.

       Yank
           This provides access to local Genbank entries using Will Fitzhugh's Yank program.
           Yank provides fast indexed access to a Genbank flat file using the accession number as
           the key.  The parameter passed to the Yank accessor is a list of accession numbers.
           Stones will be returned in the requested order.  By default the yank binary lives in
           /usr/local/bin/yank.  To support other locations, you may define the environment
           variable YANK to contain the full path.

       It is also possible to parse a single Genbank entry from a text string stored in a scalar
       variable, returning a Stone object.

   Boulder::Genbank methods
       This section lists the public methods that the Boulder::Genbank class makes available.

       new()
              # Network fetch via Entrez, with accession numbers
              $gb=new Boulder::Genbank(-accessor  =>  'Entrez',
                                       -fetch     =>  [qw/M57939 M28274 L36028/]);

              # Same, but shorter and uses -> operator
              $gb = Boulder::Genbank->new qw(M57939 M28274 L36028);

              # Network fetch via Entrez, with a query

              # Network fetch via Entrez, with a query
              $query = 'Homo sapiens[Organism] AND EST[Keyword]';
              $gb=new Boulder::Genbank(-accessor  =>  'Entrez',
                                       -fetch     =>  $query);

              # Local fetch via Yank, with accession numbers
              $gb=new Boulder::Genbank(-accessor  =>  'Yank',
                                       -fetch     =>  [qw/M57939 M28274 L36028/]);

              # Local fetch via File
              $gb=new Boulder::Genbank(-accessor  =>  'File',
                                       -fetch     =>  '/usr/local/genbank/gbpri3.seq');

           The new() method creates a new Boulder::Genbank stream on the accessor provided.  The
           three possible accessors are Entrez, Yank and File.  If successful, the method returns
           the stream object.  Otherwise it returns undef.

           new() takes the following arguments:

                   -accessor       Name of the accessor to use
                   -fetch          Parameters to pass to the accessor
                   -proxy          Path to an HTTP proxy, used when using
                                    the Entrez accessor over a firewall.

           Specify the accessor to use with the -accessor argument.  If not specified, it
           defaults to Entrez.

           -fetch is an accessor-specific argument.  The possibilities are:

           For Entrez, the -fetch argument may point to a scalar, in which case it is interpreted
           as an Entrez query string.  See http://www.ncbi.nlm.nih.gov/Entrez/linking.html for a
           description of the query syntax.  Alternatively, -fetch may point to an array
           reference, in which case it is interpreted as a list of accession numbers to retrieve.
           If -fetch points to a hash, it is interpreted as extended information.  See "Extended
           Entrez Parameters" below.

           For Yank, the -fetch argument must point to an array reference containing the
           accession numbers to retrieve.

           For File, the -fetch argument must point to a string-valued scalar, which will be
           interpreted as the path to the file to read Genbank entries from.

           For Entrez (and Entrez only) Boulder::Genbank allows you to use a shortcut syntax in
           which you provde new() with a list of accession numbers:

             $gb = new Boulder::Genbank('M57939','M28274','L36028');

       newFh()
           This works like new(), but returns a filehandle.  To recover each GenBank record read
           from the filehandle with the <> operator:

             $fh = Boulder::GenBank->newFh('M57939','M28274','L36028');
             while ($record = <$fh>) {
                print $record->asString;
             }

       get()
           The get() method is inherited from Boulder::Stream, and simply returns the next parsed
           Genbank Stone, or undef if there is nothing more to fetch.  It has the same semantics
           as the parent class, including the ability to restrict access to certain top-level
           tags.

           The object returned is a Stone::GB_Sequence object, which is a descendent of Stone.

       put()
           The put() method is inherited from the parent Boulder::Stream class, and will write
           the passed Stone to standard output in Boulder format.  This means that it is
           currently not possible to write a Boulder::Genbank object back into Genbank flatfile
           form.

   Extended Entrez Parameters
       The Entrez accessor recognizes extended parameters that allow you the ability to customize
       the search.  Instead of passing a query string scalar or a list of accession numbers as
       the -fetch argument, pass a hash reference.  The hashref should contain one or more of the
       following keys:

       -query
           The Entrez query to process.

       -accession
           The list of accession numbers to fetch, as an array ref.

       -db The database to search.  This is a single-letter database code selected from the
           following list:

             m  MEDLINE
             p  Protein
             n  Nucleotide
             s  Popset

       -proxy
           An HTTP proxy to use.  For example:

              -proxy => http://www.firewall.com:9000

           If you think you need this, get the correct URL from your system administrator.

       As an example, here's how to search for ESTs from Oryza sativa that have been entered or
       modified since 1999.

         my $gb = new Boulder::Genbank( -accessor=>Entrez,
                                        -query=>'Oryza sativa[Organism] AND EST[Keyword] AND 1999[MDAT]',
                                        -db   => 'n'
                                       });

METHODS DEFINED BY THE GENBANK STONE OBJECT

       Each record returned from the Boulder::Genbank stream defines a set of methods that
       correspond to features and other fields in the Genbank flat file record.
       Stone::GB_Sequence gives the full details, but they are listed for reference here:

   $length = $entry->length
       Get the length of the sequence.

   $start = $entry->start
       Get the start position of the sequence, currently always "1".

   $end = $entry->end
       Get the end position of the sequence, currently always the same as the length.

   @feature_list = $entry->features(-pos=>[50,450],-type=>['CDS','Exon'])
       features() will search the entry feature list for those features that meet certain
       criteria.  The criteria are specified using the -pos and/or -type argument names, as shown
       below.

       -pos
           Provide a position or range of positions which the feature must overlap.  A single
           position is specified in this way:

              -pos => 1500;         # feature must overlap postion 1500

           or a range of positions in this way:

              -pos => [1000,1500];  # 1000 to 1500 inclusive

           If no criteria are provided, then features() returns all the features, and is
           equivalent to calling the Features() accessor.

       -type, -types
           Filter the list of features by type or a set of types.  Matches are case-insensitive,
           so "exon", "Exon" and "EXON" are all equivalent.  You may call with a single type as
           in:

              -type => 'Exon'

           or with a list of types, as in

              -types => ['Exon','CDS']

           The names "-type" and "-types" can be used interchangeably.

   $seqObj = $entry->bioSeq;
       Returns a Bio::Seq object from the Bioperl project.  Dies with an error message unless the
       Bio::Seq module is installed.

OUTPUT TAGS

       The tags returned by the parsing operation are taken from the NCBI ASN.1 schema.  For
       consistency, they are normalized so that the initial letter is capitalized, and all
       subsequent letters are lowercase.  This section contains an abbreviated list of the most
       useful/common tags.  See "The NCBI Data Model", by James Ostell and Jonathan Kans in
       "Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins" (Eds. A.
       Baxevanis and F. Ouellette), pp 121-144 for the full listing.

   Top-Level Tags
       These are tags that appear at the top level of the parsed Genbank entry.

       Accession
           The accession number of this entry.  Because of the vagaries of the Genbank data
           model, an entry may have multiple accession numbers (e.g. after a merging operation).
           Accession may therefore be a multi-valued tag.

           Example:

                 my $accessionNo = $s->Accession;

       Authors
           The list of authors, as they appear on the AUTHORS line of the Genbank record.  No
           attempt is made to parse them into individual authors.

       Basecount
           The nucleotide basecount for the entry.  It is presented as a Boulder Stone with keys
           "a", "c", "t" and "g".  Example:

                my $A = $s->Basecount->A;
                my $C = $s->Basecount->C;
                my $G = $s->Basecount->G;
                my $T = $s->Basecount->T;
                print "GC content is ",($G+$C)/($A+$C+$G+$T),"\n";

       Blob
           The entire flatfile record as an unparsed chunk of text (a "blob").  This is a handy
           way of reassembling the record for human inspection.

       Comment
           The COMMENT line from the Genbank record.

       Definition
           The DEFINITION line from the Genbank record, unmodified.

       Features
           The FEATURES table.  This is a complex stone object with multiple subtags.  See the
           "The Features Tag" for details.

       Journal
           The JOURNAL line from the Genbank record, unmodified.

       Keywords
           The KEYWORDS line from the Genbank record, unmodified.  No attempt is made to parse
           the keywords into separate values.

           Example:

               my $keywords = $s->Keywords

       Locus
           The LOCUS line from the Genbank record.  It is not further parsed.

       Medline, Nid
           References to other database accession numbers.

       Organism
           The taxonomic name of the organism from which this entry was derived. This line is
           taken from the Genbank entry unmodified.  See the NCBI data model documentation for an
           explanation of their taxonomic syntax.

       Reference
           The REFERENCE line from the Genbank entry.  There are often multiple Reference lines.
           Example:

             my @references = $s->Reference;

       Sequence
           The DNA or RNA sequence of the entry.  This is presented as a single lower-case
           string, with all base numbers and formatting characters removed.

       Source
           The entry's SOURCE field; often giving clues on how the sequencing was performed.

       Title
           The TITLE field from the paper describing this entry, if any.

   The Features Tag
       The Features tag points to a Stone record that contains multiple subtags.  Each subtag is
       the name of a feature which points, in turn, to a Stone that describes the feature's
       location and other attributes.  The full list of feature is beyond this document, but the
       following are the features that are most often seen:

               Cds             a CDS
               Intron          an intron
               Exon            an exon
               Gene            a gene
               Mrna            an mRNA
               Polya_site      a putative polyadenylation signal
               Repeat_unit     a repetitive region
               Source          More information about the organism and cell
                               type the sequence was derived from
               Satellite       a microsatellite (dinucleotide repeat)

       Each feature will contain one or more of the following subtags:

       DB_xref
           A cross-reference to another database in the form DB_NAME:accession_number.  See the
           NCBI Web site for a description of these cross references.

       Evidence
           The evidence for this feature, either "experimental" or "predicted".

       Gene
           If the feature involves a gene, this will be the gene's name (or one of its names).
           This subtag is often seen in "Gene" and Cds features.

           Example:

                   foreach ($s->Features->Cds) {
                      my $gene = $_->Gene;
                      my $position = $_->Position;
                      Print "Gene $gene ($position)\n";
                   }

       Map If the feature is mapped, this provides a map position, usually as a cytogenetic band.

       Note
           A grab-back for various text notes.

       Number
           When multiple features of this type occur, this field is used to number them.
           Ordinarily this field is not needed because Boulder::Genbank preserves the order of
           features.

       Organism
           If the feature is Source, this provides the source organism.

       Position
           The position of this feature, usually expresed as a range (1970..1975).

       Product
           The protein product of the feature, if applicable, as a text string.

       Translation
           The protein translation of the feature, if applicable.

SEE ALSO

       Boulder, Boulder::Blast

AUTHOR

       Lincoln Stein <lstein@cshl.org>.

       Copyright (c) 1997-2000 Lincoln D. Stein

       This library is free software; you can redistribute it and/or modify it under the same
       terms as Perl itself.  See DISCLAIMER.txt for disclaimers of warranty.

EXAMPLE GENBANK OBJECT

       The following is an excerpt from a moderately complex Genbank Stone.  The Sequence line
       and several other long lines have been truncated for readability.

        Authors=Spritz,R.A., Strunk,K., Surowy,C.S.O., Hoch,S., Barton,D.E. and Francke,U.
        Authors=Spritz,R.A., Strunk,K., Surowy,C.S. and Mohrenweiser,H.W.
        Locus=HUMRNP7011   2155 bp    DNA             PRI       03-JUL-1991
        Accession=M57939
        Accession=J04772
        Accession=M57733
        Keywords=ribonucleoprotein antigen.
        Sequence=aagcttttccaggcagtgcgagatagaggagcgcttgagaaggcaggttttgcagcagacggcagtgacagcccag...
        Definition=Human small nuclear ribonucleoprotein (U1-70K) gene, exon 10 and 11.
        Journal=Nucleic Acids Res. 15, 10373-10391 (1987)
        Journal=Genomics 8, 371-379 (1990)
        Nid=g337441
        Medline=88096573
        Medline=91065657
        Features={
          Polya_site={
            Evidence=experimental
            Position=1989
            Gene=U1-70K
          }
          Polya_site={
            Position=1990
            Gene=U1-70K
          }
          Polya_site={
            Evidence=experimental
            Position=1992
            Gene=U1-70K
          }
          Polya_site={
            Evidence=experimental
            Position=1998
            Gene=U1-70K
          }
          Source={
            Organism=Homo sapiens
            Db_xref=taxon:9606
            Position=1..2155
            Map=19q13.3
          }
          Cds={
            Codon_start=1
            Product=ribonucleoprotein antigen
            Db_xref=PID:g337445
            Position=join(M57929:329..475,M57930:183..245,M57930:358..412, ...
            Gene=U1-70K
            Translation=MTQFLPPNLLALFAPRDPIPYLPPLEKLPHEKHHNQPYCGIAPYIREFEDPRDAPPPTR...
          }
          Cds={
            Codon_start=1
            Product=ribonucleoprotein antigen
            Db_xref=PID:g337444
            Evidence=experimental
            Position=join(M57929:329..475,M57930:183..245,M57930:358..412, ...
            Gene=U1-70K
            Translation=MTQFLPPNLLALFAPRDPIPYLPPLEKLPHEKHHNQPYCGIAPYIREFEDPR...
          }
          Polya_signal={
            Position=1970..1975
            Note=putative
            Gene=U1-70K
          }
          Intron={
            Evidence=experimental
            Position=1100..1208
            Gene=U1-70K
          }
          Intron={
            Number=10
            Evidence=experimental
            Position=1100..1181
            Gene=U1-70K
          }
          Intron={
            Number=9
            Evidence=experimental
            Position=order(M57937:702..921,1..1011)
            Note=2.1 kb gap
            Gene=U1-70K
          }
          Intron={
            Position=order(M57935:272..406,M57936:1..284,M57937:1..599, <1..>1208)
            Gene=U1-70K
          }
          Intron={
            Evidence=experimental
            Position=order(M57935:284..406,M57936:1..284,M57937:1..599, <1..>1208)
            Note=first gap-0.14 kb, second gap-0.62 kb
            Gene=U1-70K
          }
          Intron={
            Number=8
            Evidence=experimental
            Position=order(M57935:272..406,M57936:1..284,M57937:1..599, <1..>1181)
            Note=first gap-0.14 kb, second gap-0.62 kb
            Gene=U1-70K
          }
          Exon={
            Number=10
            Evidence=experimental
            Position=1012..1099
            Gene=U1-70K
          }
          Exon={
            Number=11
            Evidence=experimental
            Position=1182..(1989.1998)
            Gene=U1-70K
          }
          Exon={
            Evidence=experimental
            Position=1209..(1989.1998)
            Gene=U1-70K
          }
          Mrna={
            Product=ribonucleoprotein antigen
            Position=join(M57928:358..668,M57929:319..475,M57930:183..245, ...
            Gene=U1-70K
          }
          Mrna={
            Product=ribonucleoprotein antigen
            Citation=[2]
            Evidence=experimental
            Position=join(M57928:358..668,M57929:319..475,M57930:183..245, ...
            Gene=U1-70K
          }
          Gene={
            Position=join(M57928:207..719,M57929:1..562,M57930:1..577, ...
            Gene=U1-70K
          }
        }
        Reference=1  (sites)
        Reference=2  (bases 1 to 2155)
        =