Provided by: libnhgri-blastall-perl_0.66-1_all bug

NAME

       NHGRI::Blastall - Perl extension for running and parsing NCBI's BLAST 2.x

SYNOPSIS

DESCRIPTION

       If you have NCBI's BLAST2 or WU-BLAST installed locally and your environment is already
       setup you can use Perl's object-oriented capabilities to run your BLASTs.  Also if you
       have a blastcl3 binary from the toolkit (or binaries from our FTP site) you can run BLAST
       over the network.  There are also methods to blast single sequences against each other
       using the bl2seq binaries (also in the toolkit and binaries).  You can blast one sequence
       against a library of sequences using the blast_one_to_many method.  You can format
       databases with formatdb method.  You can also have NHGRI::Blastall read existing BLAST
       reports.  If you have a database of repetitive DNA or other DNA you would like to mask
       out, you can use the mask method to mask the data against these databases. You can then
       use either the filter or result methods to parse the report and access the various
       elements of the data.

       RUNNING NEW BLASTS
             use NHGRI::Blastall;
             my $b = new NHGRI::Blastall;
             # If you are running NCBI's Local BLAST
             $b->blastall( p => 'blastn',
                           d => 'nr',
                           i => 'infile',
                           o => 'outfile'
                         );
             # If you are running NCBI's blastcl3 network client
             $b->blastcl3( p => 'blastn',
                           d => 'nr',
                           i => 'infile',
                           o => 'outfile'
                         );
             # If you are running WU-BLAST locally
             $b->wu_blastall( p      => 'blastn',
                              d      => 'nr',
                              nogap  => '!',     #use ! for arguments w/o parameter
                              i      => 'infile',
                              o      => 'outfile'
                            );

             See BLASTALL for more info

       BLASTING 2 SEQUENCES
             use NHGRI::Blastall;
             my $b = new NHGRI::Blastall;
             $b->bl2seq(i => 'seq1',
                        j => 'seq2',
                        p => 'tblastx'
                       );

             See BL2SEQ for more info

       BLASTING 1 SEQUENCE AGAINST A FASTA LIBRARY OF SEQUENCES
             # a library is a FASTA file with multiple FASTA formatted sequences
             use NHGRI::Blastall;
             my $b = new NHGRI::Blastall;
             $b->blast_one_to_many(i => 'seq1',
                                   d => 'seq2.lib',
                                   p => 'tblastx',
                                  );

             See BLAST_ONE_TO_MANY for more info

       INITIALIZING EXISTING BLAST REPORTS
             use NHGRI::Blastall;
             my $b = new NHGRI::Blastall;
             $b->read_report('/path/to/report');

       MASKING SEQUENCES
             use NHGRI::Blastall;
             my $b = new NHGRI::Blastall;
             $masked_seq = $b->mask( type => 'wu_blastall',
                                     p    => 'blastn',
                                     d    => 'alu',
                                     i    => 'infile'
                                   );

             See MASKING for more info

       CREATING BLAST INDEXES
             use NHGRI::Blastall;
             my $b = new NHGRI::Blastall;
             $b->formatdb( i => 'est',
                           p => 'F',
                           o => 'T',
                         );

             See FORMATDB for more info

       PRINTING REPORTS
             $b->print_report();
             # this method only opens the report and prints.  It does not print
             # summary reports

       FILTERING BLAST RESULTS
             @hits = $b->filter( scores     => '38.2',
                                 identities => '.98'
                               );

             # returns an array of hash references.
             See HASHREF for more info on manipulating the results.
             See FILTERING for more info on using the filter method

       GETTING AT ELEMENTS
             @ids = $b->result('id');
             @scores = $b->result('scores',$ids[0]);  # second param must be an id

             See RESULT for more info on using the result method
             See ELEMENTS for element names

       GETTING AT ALL THE DATA
             @results = $b->result();  # returns an array of hashes

             See HASHREF for information on the array of hashes that is returned.
             See DUMP RESULTS to see how to work with the array of hashes

       ADJUSTING THE DEFLINE REGEX
             $b = new NHGRI::Blastall (-DB_ID_REGEX => '[^ ]+');

             See DB_ID_REGEX for more info

BLASTALL

       This method provides a simple object oriented frontend to BLAST.  This module works with
       either NCBI's blastall binary distributed with BLAST 2.x, WU-BLAST or over the web through
       NCBI's Web Site.  The blastall function accepts as a parameter an anonymous hash with keys
       that are the command line options (See BLASTALL OPTIONS) and values which are the
       corresponding values to those options.  You may want to set the BLASTALL variable in
       Blastall.pm to the full path of your `blastall' binary, especially if you will be running
       scripts as cron jobs or if blastall is not in the system path.

BLASTALL OPTIONS

       For wu_blastall you need to use NCBI type switches for the following
          [C-i] for infile
          [C-o] for outfile
          [C-p] for program
          [C-d] for database the rest of the parameters MUST be the parameters available through
       WU-BLAST (e.g. -sump, -nogap -compat1.4, etc.)  use a `!' to specify that an argument has
       no parameters.  See the example at the top of the manpage.

       These are the options that NCBI's blastall and binary accepts and these are the same
       options that are accepted by the blastall and blastcl3 methods.  NOTE: You must set the
       proper environmental variables for the blastall method to work (BLASTDB,BLASTMAT).

       •   p => Program Name

       •   d => Database                                           default=nr

       •   i => QueryFile

       •   e => Expectation vaule (E)                              default=10.0

       •   m => alignment view                                     default=0
               0 = pairwise,
               1 = master-slave showing identities,
               2 = master-slave no identities,
               3 = flat master-slave, show identities,
               4 = flat master-slave, no identities,
               5 = master-slave no identities and blunt ends,
               6 = flat master-slave, no identities and blunt ends

       •   o => BLAST report Output File                           default=stdout

       •   F => Filter query sequence                              default=T
               (DUST with blastn, SEG with others)

       •   G => Cost to open a gap                                 default=0
               (zero invokes default behavior)

       •   E => Cost to extend a gap                               default=0
               (zero invokes default behavior)

       •   X => X dropoff value for gapped alignment (in bits)     default=0
               (zero invokes default behavior)

       •   I => Show GI's in deflines                              default=F

       •   q => Penalty for a nucleotide mismatch (blastn only)    default=-3

       •   r => Reward for a nucleotide match (blastn only)        default=1

       •   v => Number of one-line descriptions (V)                default=500

       •   b => Number of alignments to show (B)                   default=250

       •   f => Threshold for extending hits, default if zero      default=0

       •   g => Perfom gapped alignment (NA with tblastx)          default=T

       •   Q => Query Genetic code to use                          default=1

       •   D => DB Genetic code (for tblast[nx] only)              default=1

       •   a => Number of processors to use                        default=1

       •   O => SeqAlign file                                      Optional

       •   J => Believe the query defline                          default=F

       •   M => Matrix                                             default=BLOSUM62

       •   W => Word size, default if zero                         default=0

       •   z => Effective length of the database                   default=0
                (use zero for the real size)

       •   K => Number of best hits from a region to keep          default=100

       •   L => Length of region used to judge hits                default=20

       •   Y => Effective length of the search space               default=0
                (use zero for the real size)

       •   S => Query strands to search against database           default=3
                (for blast[nx], and tblastx).
                3 is both, 1 is top, 2 is bottom

       •   T => Produce HTML output [T/F]                          default=F

       •   l => Restrict search of database to list of GI's [String]

       NOTE: If you do not supply an `o' option (outfile),
             the following environment variables are checked in order:
             `TMPDIR', `TEMP', and `TMP'.
             If one of them is set, outfiles are created relative to the
             directory it specifies.  If none of them are set, the first
             possible one of the following directories is used:
             /var/tmp , /usr/tmp , /temp , /tmp ,
             This file is deleted after the NHGRI::Blastall object is destroyed.
             It is recommended that you create a tmp directory in your home
             directory and set one of the above environmental vars to point
             to this directory and then set the permissions on this directory
             to 0700.  Writing to a "public" tmp directory can have
             security ramifications.

BL2SEQ

       This method uses the bl2seq binary (distributed with BLAST executables and source) to
       BLAST one sequence against another sequence.  Like the blastall method the bl2seq method
       accepts the same options that the bl2seq binary accepts.  Run bl2seq without options from
       the command line to get a full list of options.  An important note about the options, when
       running blastx 1st sequence should be nucleotide; when running tblastn 2nd sequence should
       be nucleotide.

         use NHGRI::Blastall;
         my $b = new NHGRI::Blastall;
         $b->bl2seq(i => 'seq1.nt',
                    j => 'seq2.aa',
                    p => 'blastx'
                   );

BLAST_ONE_TO_MANY

       This method allows for blasting one sequence against a FASTA library of sequences.  Behind
       the scenes, BLAST indexes are created (in the same directory as the FASTA library) using
       the provided FASTA library and the one sequence is used to search against this database.
       If the program completes successfully, the databases are removed.  To compare two
       sequences, use the bl2seq method which is faster and less messy (no tmp indexes).   This
       method accepts the same options as the blastall binary with the d option corresponding to
       the FASTA library.

         use NHGRI::Blastall;
         my $b = new NHGRI::Blastall;
         $b->blast_one_to_many(i => 'seq.aa',
                               d => 'seq.nt.lib',
                               e => '0.001',
                               p => 'tblastn',
                              );

MASKING

       Screens DNA sequences in fasta format against the database specified in the blastall 'd'
       option.  The mask method accepts the same parameters as the blastall method.  Any matches
       to the masking database will be substituted with "N"s.  The mask method returns the masked
       sequence.  Performs similar function as xblast, an old NCBI program written in C.

       Set the type parameter to wu_blastall, blastcl3 or blastall depending on your
       configuration.

         $masked_seq = $b->mask( type => 'blastcl3', # defaults to blastall
                                 p    => 'blastn',
                                 d    => 'alu',
                                 i    => 'infile'
                               );

       To get the mask coordinates back call the mask method in an array context.

           @mask = $b->mask(p    => 'blastn',
                            d    => 'alu',
                            i    => 'infile'
                           );
           $masked_seq = $mask[0];        # same as above masked seq
           $ra_masked_coords = $mask[1];  # reference to array of mask coordinates

FORMATDB

       This method creates BLAST indexes using the formatdb binary which is distributed with
       BLAST.  It accepts the same parameters as formatdb.  The remove_formatdb_indexes method
       will remove databases created using the formatdb method (if called by the same object).
       formatdb leaves a file called formatdb.log by default in the current working directory (if
       it has permission).  To change this behavior, use the l option to direct the sequence to
       /dev/null or somewhere else.

         use NHGRI::Blastall;
         my $b = new NHGRI::Blastall;
         $b->formatdb( i => 'swissprot',
                       p => 'T',
                       l => '/dev/null',
                       o => 'T',
                     );

DB_ID_REGEX

       By default Blastall.pm expects FASTA deflines of BLAST databases to be formatted like
       Genbank database (gi|GINUMBER|DATABASE|ACCESSION|SUMMARY).  The default regular expression
       is [^\|]+(?:\|[^\|,\s]*){1,10} When using non-genbankformatted deflines, it may become
       necessary to adjust the regular expression that identifies the unique identifier in a
       defline. This can be done with the -DB_ID_REGEX parameter to the new method.  For example

           $b = new NHGRI::Blastall( -DB_ID_REGEX => '[^ ]+' );

FILTERING

       The filter method accepts an anonymous hash in which the keys are elements of the blast
       report and the values are limits that are put on the result set.

       The following are the Filter elements and their default operations.

           id                  => regular expression match
           defline             => regular expression match
           subject_length      => greater than
           scores              => greater than
           expects             => less than
           identities          => greater than
           match_length        => greater than
           subject_strand      => equals
           query_frames        => equals
           subject_frames      => equals

       so if you would like to limit your results to entries that have scores greater than 38.2
       and identities greater than 98% you would say...

           @hits = $b->filter( scores      => '38.2',
                               identities  => '.98'
                             );

       you can also override the defaults.  if you would like only scores that are less than 38.2
       you could say...

           @hits = $b->filter( scores => '<38.2' );

       or if you wanted only identities that were equal to 1 and you didn't care about the hits
       array you could say...

           $b->filter( identities => '=1' );

       Regular expression matches are case insensitive.  If you wanted only records with the word
       "human" in the defline you could say...

           @hits = $b->filter( defline => 'HuMaN' );

       After you run the filter method on an object the object only contains those results which
       passed the filter. This will effect additional calls to the filter method as well as calls
       to other methods (e.g. result).  To reset the NHGRI::Blastall object you can use the
       unfilter method.

           $b->unfilter();

       See DUMP RESULTS for info on how to manipulate the array of hash refs.

RESULT

         The result method has 3 possible invocations.  The first invocation
         is when it is called without parameters.

         @results = $b->result();

         This invocation returns an array of hash references.
         See HASHREF for further explanation of this structure.

         To get a list of all the ids do...

         @ids = $b->result('id');

         These ids can be used to get at specific elements.  If 2 parameters
         are present and the first one is an element (See ELEMENTS for a list
         of ELEMENTS) and the second one is an id then the routine will
         return a list of elements corresponding to the id.

         @scores = $b->result('scores',$ids[0]);  # second param must be an id

         If more than 2 elements are passed the function will return undef.

ACCESSOR METHODS

       get_report
               returns the filename of the BLAST report.

       get_database_description
               returns description given to the database during formatting of db.
               e.g. All non-redundant GenBank CDStranslations+PDB+SwissProt+PIR

       get_database_sequence_count
               returns the number of sequences in the database.

       get_database_letter_count
               returns the number of total letters in the database.

       get_blast_program
               returns the BLAST program name that appears at the top of the report.
               either BLASTN, BLASTP, BLASTX, TBLASTN or TBLASTX

       get_blast_version
               returns the version of the BLAST program that was used.

ELEMENTS

       id
                an example of an id is `>gb|U19386|MMU19386' the initial `>'
                is just a flag.  The next characters up until the first pipe
                is the database the subject was taken from.  The next characters
                up to the next pipe is the Genbank accession number.  The last
                characters are the locus.  This element is used as a unique
                identifier by the NHGRI::Blastall module.
                (SCALAR)

       defline
               The definition line taken from the subject
                (SCALAR)

       subject_length
               This is the length of the full subject, not the
                length of the match.
                (SCALAR)

       scores
               This is score (in bits) of the match.
                (ARRAY)

       expects
               This is the statistical significance (`E value') for the match.
                (ARRAY)

       identities
               This is the number of identities divided by the match
               length in decimal format. (Listed as a fraction and a percentage
               in a BLAST report.)
               (ARRAY)

       match_lengths
               this is the number of base pairs that match up.
               (ARRAY)

       query_starts
               This is the number of the first base which matched
               with the subject.
               (ARRAY)

       query_ends
               This is the number of the last base which matched
               with the subject.
               (ARRAY)

       subject_starts
               This is the number of the first base which matched
               with the query.
               (ARRAY)

       subject_ends
               This is the number of the last base which matched
               with the query.
               (ARRAY)

       subject_strands
               This is either plus or minus depending on the orientation
               of the subject sequence in the match.
               (ARRAY)

       query_strands
               This is either plus or minus depending on the orientation
               of the query sequence in the match.
               (ARRAY)

       query_frames
               If you are running a blastx or tblastx search in which the
               query_sequence is translated this is the frame the query
               sequence matched.
               (ARRAY)

       subject_frames
               If you are running a tblastn or tblastx search in which the
               subject sequence is translated, this is the frame where the
               subject sequence matched.
               (ARRAY)

HASHREF

         Each hash ref contains an id, defline and subject Length.  Because
         there can be multiple scores, expect values, Identities, match_lengths,
         query_starts, query_strands and subject_starts, these are stored
         as array references.  The following is an array containing two hash
         refs.

         @hits = (
             {'id'                 => '>gb|U79716|HSU79716',
              'defline'            => 'Human reelin (RELN) mRNA, complete cds',
              'subject_length'     => '11580',
              'scores'             => [ 684, 123               ],
              'expects'            => [ 0.0, 3e-26             ],
              'identities'         => [ .99430199, .89256198   ],
              'match_lengths'      => [ 351, 121               ],
              'query_starts'       => [ 3, 404                 ],
              'query_ends'         => [ 303, 704                 ],
              'subject_starts'     => [ 5858, 6259             ],
              'subject_ends'       => [ 6158, 6559             ],
              'subject_strands'    => [ 'plus', 'minus'        ],
              'query_strands'      => [ 'plus', 'plus'         ],
              'query_frames'       => [ '+1', '-3'             ],
              'subject_frames'     => [ '+2', '-1'             ],
             },
             {'id'                 => '>gb|U24703|MMU24703',
              'defline'            => 'Mus musculus reelin mRNA, complete cds',
              'subject_length'     => '11673',
              'scores'             => [ 319, 38.2              ],
              'expects'            => [ 2e-85, 1.2             ],
              'identities'         => [ .86455331, 1           ],
              'match_lengths'      => [ 347, 19                ],
              'query_starts'       => [ 3, 493                 ],
              'query_ends'         => [ 303, 793                 ],
              'subject_starts'     => [ 5968, 6457             ]
              'subject_ends'       => [ 6268, 6757             ],
              'subject_strands'    => [ 'plus', 'minus'        ],
              'query_strands'      => [ 'plus', 'plus'         ],
              'query_frames'       => [ '+3', '-3'             ],
              'subject_frames'     => [ '+1', '-2'             ],
             }
         );

         See ELEMENTS for explanation of each element.
         See DUMP RESULTS and/or the perlref(1) manpage for clues on working
             with this structure.

DUMP RESULTS

         When calling the result function or with no parameters, or calling the
         filter function, an array of references to hashes is returned.
         Each elment of the array is a reference to a hash containing 1 record.
         See HASHREF for details on this structure.  The following
         routine will go through each element of the array of hashes and
         then print out the element and it's corresponding value or values.
         See perlref(1) for more info on references.

         sub dump_results {
             foreach $rh_r (@results) {
                 while (($key,$value) = each %$rh_r) {
                     if (ref($value) eq "ARRAY") {
                         print "$key: ";
                         foreach $v (@$value) {
                             print "$v ";
                         }
                         print "\n";
                     } else {
                         print "$key: $value\n";
                     }
                 }
             }
         }

AUTHOR

       •   Joseph Ryan (jfryan@nhgri.nih.gov)

CONTACT ADDRESS

       If you have problems, questions, comments send to webblaster@nhgri.nih.gov

COPYRIGHT INFORMATION

       This software/database is "United States Government Work" under the terms of the United
       States Copyright Act. It was written as part of the authors' official duties for the
       United States Government and thus cannot be copyrighted. This software/database is freely
       available to the public for use without a copyright notice. Restrictions cannot be placed
       on its present or future use.

       Although all reasonable efforts have been taken to ensure the accuracy and reliability of
       the software and data, the National Human Genome Research Institute (NHGRI) and the U.S.
       Government does not and cannot warrant the performance or results that may be obtained by
       using this software or data. NHGRI and the U.S.  Government disclaims all warranties as to
       performance, merchantability or fitness for any particular purpose.

       In any work or product derived from this material, proper attribution of the authors as
       the source of the software or data should be made, using
       http://genome.nhgri.nih.gov/blastall as the citation.

ENVIRONMENT VARIABLES

       BLASTDB
           location of BLAST formated databases

       BLASTMAT
           location of BLAST matrices

       TMPDIR TEMP TMP
           If the `o' option is not passed to the blastall method than NHGRI::Blastall looks for
           one of these vars (in order) to store the BLAST report.  This report is destroyed
           after the NHGRI::Blastall.pm object is destroyed.

SEE ALSO

       perl(1) perlref(1)

       http://www.ncbi.nlm.nih.gov/BLAST/newblast.html

       ftp://ncbi.nlm.nih.gov/blast/db/README

       http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html