Ubuntu Manpage: Zerg - a lexical scanner for BLAST reports.

NAME

       Zerg - a lexical scanner for BLAST reports.

SYNOPSIS

       use Zerg;

DESCRIPTION

       This manpage describes the Zerg library and its interface for use with Perl.

       The Zerg library contains a C/flex lexical scanner for BLAST reports and a set of
       supporting functions. It is centered on a "get_token" function that scans the input for
       specified lexical elements and, when one is found, returns its code and value to the user.

       It is intended to be fast: for that we used flex, which provides simple regular expression
       matching and input buffering in the generated C scanner. And it is intended to be simple
       in the sense of providing just a lexical scanner, with no features whose support could
       slow down its main function.

   FUNCTIONS
       zerg_get_token() is the core function of this module. Each time it is called, it scans the
       input BLAST report for the next "interesting" lexical element and returns its code and
       value. Codes are listed in the section "EXPORTED CONSTANTS (TOKEN CODES)". Code zero (not
       listed) means end of file.

         ($code, $value) = Zerg::zerg_get_token();

       zerg_open_file($filename) opens $filename in read-only mode and set it as the input to the
       scanner. If this function is not called, the standard input is used.

         Zerg::zerg_open_file($filename);

       zerg_close_file() closes the file opened with zerg_open_file().

       zerg_get_token_offset() returns the byte offset (relative to the beginning of file) of the
       last token read. (See section BUGS).

       zerg_ignore($code) instructs zerg_get_token not to return when it finds a token with code
       $code.

       zerg_ignore_all() does zerg_ignore on all token codes.

       zerg_unignore($code) instructs zerg_get_token to return when it finds a token with code
       $code.

       zerg_unignore_all() does zerg_unignore on all token codes.

         Example:
         Zerg::zerg_ignore_all();
         Zerg::zerg_unignore(QUERY_NAME);
         Zerg::zerg_unignore(SUBJECT_NAME);

   EXPORTED CONSTANTS (TOKEN CODES)
           ALIGNMENT_LENGTH
           BLAST_VERSION
           CONVERGED
           DATABASE
           DESCRIPTION_ANNOTATION
           DESCRIPTION_EVALUE
           DESCRIPTION_HITNAME
           DESCRIPTION_SCORE
           END_OF_REPORT
           EVALUE
           GAPS
           HSP_METHOD
           IDENTITIES
           NOHITS
           PERCENT_IDENTITIES
           PERCENT_POSITIVES
           POSITIVES
           QUERY_ALI
           QUERY_ANNOTATION
           QUERY_END
           QUERY_FRAME
           QUERY_LENGTH
           QUERY_NAME
           QUERY_ORIENTATION
           QUERY_START
           REFERENCE
           ROUND_NUMBER
           ROUND_SEQ_FOUND
           ROUND_SEQ_NEW
           SCORE
           SCORE_BITS
           SEARCHING
           SUBJECT_ALI
           SUBJECT_ANNOTATION
           SUBJECT_END
           SUBJECT_FRAME
           SUBJECT_LENGTH
           SUBJECT_NAME
           SUBJECT_ORIENTATION
           SUBJECT_START
           TAIL_OF_REPORT
           UNMATCHED

   NOTES ON THE SCANNER
       Some BLAST parsers rely on some simple regular expression matches to conclude about token
       types and values. For example: an input line matching /^Query=\s(\S+)/ should make such a
       "loose" parser to infer that a token was found, it is a query name and its value is $1.
       Although improbable, it is perfectly possible for an anotation field to match
       /^Query=\s(\S+)/. Worse than this is the fact that those parsers are often unable to
       detect corrupt or truncated BLAST reports, possibly producing inaccurate information.

       The scanner provided by this library is much more stringent: for a token to match it must
       be in its place in the context of a BLAST report. For example: in a single BLAST report, a
       QUERY_NAME cannot follow another QUERY_NAME. The scanner can be thought as, and in fact it
       is, a big regular expression that matches an entire BLAST report.

       A special token code (UNMATCHED) is provided for cases in which the input text does not
       match any other lexical rule of the scanner. When an umnacthed character is found, either
       the report is corrupt or the scanner has a bug.

       If you are interested in only a few token codes, try to zerg_ignore() as much codes you
       can. This will avoid unnecessary function calls that eat a lot of CPU.

EXAMPLES

       This program prints the code and the value of each token it finds.

         #!/usr/bin/perl -w
         use strict;
         use Zerg;

         my ($code, $value);
         while((($code, $value)= Zerg::zerg_get_token()) && $code)
         {
             print "$code\t$value\n";
         }

       The program below is a "syntax checker". The presence of UNMATCHEDs is a strong indicator
       of problems in the BLAST report. (See section NOTES ON THE SCANNER)

         #!/usr/bin/perl -w
         use strict;
         use Zerg;

         my ($code, $value);

         Zerg::zerg_ignore_all();
         Zerg::zerg_unignore(UNMATCHED);

         while((($code, $value)= Zerg::zerg_get_token()) && $code)
         {
             print "UNMATCHED CHAR:\t$value\n";
         }

BUGS

       The tokens DESCRIPTION_ANNOTATION, DESCRIPTION_SCORE and DESCRIPTION_EVALUE are scanned
       all at once and released one by one on user request. So, if the user wants to get any of
       these fields, they must be unignored BEFORE scanning DESCRIPTION_ANNOTATION.

       zerg_get_token_offset() may return incorrect values for these tokens and those that are
       modified by the parser, namely: QUERY_LENGTH, SUBJECT_LENGTH, EVALUE, GAPS.

TODO

       Add more tokens to the scanner as the need for that appears.

AUTHOR

       Apuã Paquola, IQ-USP Bioinformatics Lab, apua@iq.usp.br

       Laszlo Kajan <lkajan@rostlab.org>, Technical University of Munich, Germany

POD ERRORS

       Hey! The above document had some coding errors, which are explained below:

       Around line 354:
           Non-ASCII character seen before =encoding in 'Apuã'. Assuming UTF-8

NAME

SYNOPSIS

DESCRIPTION

EXAMPLES

BUGS

TODO

AUTHOR

SEE ALSO

POD ERRORS