Ubuntu Manpage: btparse - C library for parsing and processing BibTeX data files

Provided by: libbtparse-dev_0.71-1build1_amd64

NAME

       btparse - C library for parsing and processing BibTeX data files

SYNOPSIS

          #include <btparse.h>

          /* Basic library initialization / cleanup */
          void bt_initialize (void);
          void bt_free_ast (AST *ast);
          void bt_cleanup (void);

          /* Input / interface to parser */
          void   bt_set_stringopts (bt_metatype_t metatype, btshort options);
          AST * bt_parse_entry_s (char *    entry_text,
                                  char *    filename,
                                  int       line,
                                  btshort    options,
                                  boolean * status);
          AST * bt_parse_entry   (FILE *    infile,
                                  char *    filename,
                                  btshort    options,
                                  boolean * status);
          AST * bt_parse_file    (char *    filename,
                                  btshort    options,
                                  boolean * overall_status);

          /* AST traversal/query */
          AST * bt_next_entry (AST * entry_list,
                               AST * prev_entry)
          AST * bt_next_field (AST *entry, AST *prev, char **name);
          AST * bt_next_value (AST *head,
                               AST *prev,
                               bt_nodetype_t *nodetype,
                               char **text);

          bt_metatype_t bt_entry_metatype (AST *entry);
          char *bt_entry_type (AST *entry);
          char *bt_entry_key (AST *entry);
          char *bt_get_text (AST *node);

          /* Splitting names and lists of names */
          bt_stringlist * bt_split_list (char *   string,
                                         char *   delim,
                                         char *   filename,
                                         int      line,
                                         char *   description);
          void bt_free_list (bt_stringlist *list);
          bt_name * bt_split_name (char *  name,
                                   char *  filename,
                                   int     line,
                                   int     name_num);
          void bt_free_name (bt_name * name);

          /* Formatting names */
          bt_name_format * bt_create_name_format (char * parts, boolean abbrev_first);
          void bt_free_name_format (bt_name_format * format);
          void bt_set_format_text (bt_name_format * format,
                                   bt_namepart part,
                                   char * pre_part,
                                   char * post_part,
                                   char * pre_token,
                                   char * post_token);
          void bt_set_format_options (bt_name_format * format,
                                      bt_namepart part,
                                      boolean abbrev,
                                      bt_joinmethod join_tokens,
                                      bt_joinmethod join_part);
          char * bt_format_name (bt_name * name, bt_name_format * format);

          /* Construct tree from TeX groups */
          bt_tex_tree * bt_build_tex_tree (char * string);
          void          bt_free_tex_tree (bt_tex_tree **top);
          void          bt_dump_tex_tree (bt_tex_tree *node, int depth, FILE *stream);
          char *        bt_flatten_tex_tree (bt_tex_tree *top);

          /* Miscellaneous string utilities */
          void bt_purify_string (char * string, btshort options);
          void bt_change_case (char transform, char * string, btshort options);

DESCRIPTION

       btparse is a C library for parsing and processing BibTeX files.  It provides a lexical
       scanner and LR parser (constructed by PCCTS), both of which are efficient and offer good
       error detection and recovery; a set of functions for traversing the AST (abstract syntax
       tree) generated by the parser; and utility functions for manipulating strings according to
       BibTeX conventions.  (Note that nothing in the library assumes that you're using BibTeX
       files for their original purpose of bibliographic data for scholarly publications; you
       could use the file format for any conceivable purpose that fits it.  However, there is
       some code in the library that is really only appropriate for use with strings meant to be
       processed in the same way that BibTeX itself does.  This is all entirely optional,
       though.)

       Note that the interface provided by btparse, while complete, is fairly low-level.  If you
       have more sophisticated needs, you might be interested my "Text::BibTeX" module for Perl 5
       (available on CPAN).

CONCEPTS AND TERMINOLOGY

To understand this document and use btparse, you should already be familiar with the
BibTeX language---more specifically, the BibTeX data description language. (BibTeX being
the complex beast that it is, one can conceive of the term applying to the program, the
data language, the particular database structure described in the original BibTeX
documentation, the ".bst" formatting language, and the set of conventions embodied in the
standard styles included with the BibTeX distribution. In this document, I'll stick to
the first two meanings---the data language because that's what btparse deals with, and the
program because it's occasionally necessary to explain differences between my parser and
BibTeX's.)

In particular, you should have a good idea what's going on in the following:

@string{and = { and },
joe = "Blow, Joe",
john = "John Smith"}

@book(ourbook,
author = joe # and # john,
title = {Our Little Book})

If this looks like something you want to parse, but don't want to have to write your own
parser for, you've come to the right place.

Before going much further, though, you're going to have to learn some of the terminology I
use for describing BibTeX data. Most of it's the same as you'll find in any BibTeX
documentation, but it's important to be sure that we're talking about the same things
here. So, some definitions:

top-level
All text in a BibTeX file from the start of the file to the start of the first entry,
and between entries thereafter.

name
A string of letters, digits, and the following characters:

! $ & * + - . / : ; < > ? [ ] ^ _ ` |

A "name" is a catch-all used for entry types, entry keys, and field and macro names.
For BibTeX compatibility, there are slightly different rules for these four entities;
currently, the only such rule actually implemented is that field and macro names may
not begin with a digit. Some names in the above example: "string", "and".

entry
A chunk of text starting with an "at" sign ("@") at top-level, followed by a name (the
entry type), an entry delimiter ("{" or "("), and proceeding to the matching closing
delimiter. Also, the data structure that results from parsing this chunk of text.
There are two entries in the above example.

entry type
The name that comes right after an "@" at top-level. Examples from above: "string",
"book".

entry metatype
A classification of entry types that allows us to group one or more entry types under
the same heading. With the standard BibTeX database structure, "article", "book",
"inbook", etc. all fall under the "regular entry" metatype. Other metatypes are
"macro definition" (for "string" entries), "preamble" (for "preamble") entries, and
"comment" ("comment" entries). In fact, any entry whose type is not one of "string",
"preamble", or "comment" is called a "regular" entry.

entry delimiters
"{" and "}", or "(" and ")": the pair of characters that (almost) mark the boundaries
of an entry. "Almost" because the start of an entry is marked by an "@", not by the
"entry open" delimiter.

entry key
(Or just key when it's clear what we're speaking of.) The name immediately following
the entry open delimiter in a regular entry, which uniquely identifies the entry.
Example from above: "ourbook". Only regular entries have keys.

field
A name to the left of an equals sign in a regular or macro-definition entry. In the
latter context, might also be called a macro name. Examples from above: "joe",
"author".

field list
In a regular entry, everything between the entry delimiters except for the entry key.
In a macro definition entry, everything between the entry delimiters (possibly also
called a macro list).

compound value
(Usually just "value".) The text that follows an equals sign ("=") in a regular or
macro definition entry, up to a comma or the entry close delimiter; a list of one or
more simple values joined by hash signs ("#").

simple value
A string, macro, or number.

string
(Or, sometimes, "quoted string.") A chunk of text between quotes (""") or braces ("{"
and "}"). Braces must balance: "{this is a {string}" is not a BibTeX string, but
"{this is a {string}}" is. ("this is a {string" is also illegal, mainly to avoid the
possibility of generating bogus TeX code--which BibTeX will do in certain cases.)

macro
A name that appears on the right-hand side of an equals sign (i.e. as one simple value
in a compound value). Implies that this name was defined as a macro in an earlier
macro definition entry, but this is only checked if btparse is being asked to expand
macros to their full definitions.

number
An unquoted string of digits.

Working with btparse generally consists of passing the library some BibTeX data (or a
source for some BibTeX data, such as a filename or a file pointer), which it then
lexically scans, parses, and constructs an abstract syntax tree (AST) from. It returns
this AST to you, and you call other btparse functions to traverse and query the tree.

The contents of AST nodes are the private domain of the library, and you shouldn't go
poking into them. This being C, though, there's nothing to prevent you from doing so
except good manners and the possibility that I might change the AST structure in future
releases, breaking any badly-behaved code. Also, it's not necessary to know the
structural relationships between nodes in the AST---that's taken care of by the
query/traversal functions.

However, it's useful to know some of the things that btparse deposits in the AST and
returns to you through those query/traversal functions. First off, each node has a "node
type," which records the syntactic element corresponding to each node. For instance, the
entry

@book{mybook, author = "Joe Blow", title = "My Little Book"}

is rooted by an "entry" node; under this would be found a "key" node (for the entry key),
two "field" nodes (for the "author" and "title" fields); and associated with each field
node would be a "string" node. The only time this concerns you is when you ask the
library for a simple value; just looking at the text is not enough to distinguish quoted
strings, numbers, and macro names, so btparse returns the nodetype as well.

In addition to the nodetype, btparse records the metatype of each "entry" node. This
allows you (and the library) to distinguish, say, regular entries from comment entries.
Not only do they have very different structures and must therefore be traversed
differently by the library, but certain traversal functions make no sense on certain entry
metatypes---thus it's necessary for you to be able to make the distinction as well.

That said, everything you need to know to work with the AST is explained in bt_traversal.

DATA TYPES AND MACROS

       btparse defines several types required for the external interface.  First, it trivially
       defines a "boolean" type (along with "TRUE" and "FALSE" macros).  This might affect you
       when including the btparse.h header in your own code---since it's not possible for the
       code to detect if there is already a "boolean" type defined, you might have to define the
       "HAVE_BOOLEAN" pre-processor token to deactivate btparse.h's "typedef" of "boolean".

       Next, two enumeration types are defined: "bt_metatype" and "bt_nodetype".  Both of these
       are used extensively in the library itself, and are made available to users of the library
       because they can be found in nodes of the "btparse" AST (abstract syntax tree).  (I.e.,
       querying the AST can give you "bt_metatype" and "bt_nodetype" values, so the "typedef"s
       must be available to your code.)

   Entry metatype enum
       "bt_metatype_t" has the following values:

       •   "BTE_UNKNOWN"

       •   "BTE_REGULAR"

       •   "BTE_COMMENT"

       •   "BTE_PREAMBLE"

       •   "BTE_MACRODEF"

       which are determined by the "entry type" token.  (@string entries have the "BTE_MACRODEF"
       metatype; @comment and @preamble correspond to "BTE_COMMENT" and "BTE_PREAMBLE"; and any
       other entry type has the "BTE_REGULAR" metatype.)

   AST nodetype enum
       "bt_nodetype" has the following values:

       •   "BTAST_UNKNOWN"

       •   "BTAST_ENTRY"

       •   "BTAST_KEY"

       •   "BTAST_FIELD"

       •   "BTAST_STRING"

       •   "BTAST_NUMBER"

       •   "BTAST_MACRO"

       Of these, you'll only ever deal with the last three.  They are returned when you query the
       AST for a simple value---just seeing the text isn't enough to distinguish between a quoted
       string, a number, and a macro, so the AST nodetype is supplied along with the text.

   String processing option macros
       Since BibTeX is essentially a system for glueing strings together in a wide variety of
       ways, the processing done to its strings is fairly important.  Most of the string
       transformations are done outside of the lexer/parser; this reduces their complexity, and
       makes it easier to switch different transformations on and off.  This switching is done
       with an "options" bitmap which can be specified on a per-entry-metatype basis.  (That is,
       you can have one set of transformations done to the strings in all regular entries,
       another set done to the strings in all macro definition entries, and so on.)  If you need
       finer control than that, it's currently unavailable outside of the library (but it's just
       a matter of making a couple functions available and documenting them---so bug me if you
       need this feature).

       There are three basic macros for constructing this bitmap:

       "BTO_CONVERT"
           Convert "number" values to strings.  (The conversion is trivial, involving changing
           the type of the AST node representing the number from "BTAST_NUMBER" to
           "BTAST_STRING".  "Number" values are stored as strings of digits, just as they are in
           the input data.)

       "BTO_EXPAND"
           Expand macro invocations to the full macro text.

       "BTO_PASTE"
           Paste simple values together.

       "BTO_COLLAPSE"
           Collapse whitespace according to the BibTeX rules.

       For instance, supplying "BTO_CONVERT | BTO_EXPAND" as the string options bitmap for the
       "BTE_REGULAR" metatype means that all simple values in "regular" entries will be converted
       to strings: numbers will simply have their "nodetype" changed, and macros will be
       expanded.  Nothing else will be done to the simple values, though---they will not be
       concatenated, nor will whitespace be collapsed.  See the "bt_set_stringopts()" and
       "bt_parse_*()" functions in bt_input for more information on the various options for
       parsing; see bt_postprocess for details on the post-processing.

USING THE LIBRARY

       The following code is a skeletal example of using the btparse library:

           #include <btparse.h>

           int main (void)
           {
              bt_initialize ();

              /* process some data */

              bt_cleanup ();
              exit (0);
           }

       Please note the call to "bt_initialize()"; this is very important!  Without it, the
       library may crash or fail mysteriously.  You must call "bt_initialize()" before calling
       any other btparse functions.  "bt_cleanup()" just frees the memory allocated by
       "bt_initialize()"; if you are careful to call it before exiting, and "bt_free_ast()" on
       any abstract syntax trees generated by btparse when you are done with them, then your
       program shouldn't have any memory leaks.  (Unless they're due to your own code, of
       course!)

BUGS AND LIMITATIONS

btparse has several inherent limitations that are due to the lexical scanner and parser
generated by PCCTS 1.x. In short, the scanner and parser are both heavily dependent on
global variables, meaning that thread safety -- or even the ability to have two files open
and being parsed at the same time -- is well-nigh impossible. This will not change until
I get with the times and adopt ANTLR 2.0, the successor to PCCTS -- presuming of course
that it can generate more modular C scanners and parsers.

Another limitation that is due to PCCTS: entries with a large number of fields (more than
about 90, if each field value is just a single string) will cause the parser to crash.
This is unavoidable due to the parser using statically-allocated stacks for attributes and
abstract-syntax tree nodes. I could increase the static allocation, but that would just
decrease the likelihood of encountering the problem, not make it go away. Again, the
chances of this changing as long as I'm using PCCTS 1.x are nil.

Apart from those inherent limitations, there are no known bugs in btparse. Any
segmentation faults or bus errors from the library should be considered bugs. They
probably result from using the library incorrectly (eg. attempting to interleave the
parsing of two files), but I do make an attempt to catch all such mistakes, and if I've
missed any I'd like to know about it.

Any memory leaks from the library are also a concern; as long as you are conscientious
about calling the cleanup functions ("bt_free_ast()" and "bt_cleanup()"), then the library
shouldn't leak.

AUTHOR

       Greg Ward <gward@python.net>

COPYRIGHT

       Copyright (c) 1996-97 by Gregory P. Ward.

       This library is free software; you can redistribute it and/or modify it under the terms of
       the GNU Library General Public License as published by the Free Software Foundation;
       either version 2 of the License, or (at your option) any later version.

       This library is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY;
       without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
       See the GNU Library General Public License for more details.

       You should have received a copy of the GNU Library General Public License along with this
       library; if not, write to the Free Software Foundation, Inc., 675 Mass Ave, Cambridge, MA
       02139, USA.

AVAILABILITY

       The btOOL home page, where you can get up-to-date information about btparse (and download
       the latest version) is

          http://starship.python.net/~gward/btOOL/

       You will also find the latest version of Text::BibTeX, the Perl library that provides a
       high-level front-end to btparse, there.  btparse is needed to build "Text::BibTeX", and
       must be downloaded separately.

       Both libraries are also available on CTAN (the Comprehensive TeX Archive Network,
       "http://www.ctan.org/tex-archive/") and CPAN (the Comprehensive Perl Archive Network,
       "http://www.cpan.org/").  Look in biblio/bibtex/utils/btOOL/ on CTAN, and
       authors/Greg_Ward/ on CPAN.  For example,

          http://www.ctan.org/tex-archive/biblio/bibtex/utils/btOOL/
          http://www.cpan.org/authors/Greg_Ward

       will both get you to the latest version of "Text::BibTeX" and btparse -- but of course,
       you should always access busy sites like CTAN and CPAN through a mirror.