Provided by: libbtparse-dev_0.71-1build1_amd64 bug

NAME

       bt_language - the BibTeX data language, as recognized by btparse

SYNOPSIS

          # Lexical grammar, mode 1: top-level
          AT                    \@
          NEWLINE               \n
          COMMENT               \%~[\n]*\n
          WHITESPACE            [\ \r\t]+
          JUNK                  ~[\@\n\ \r\t]+

          # Lexical grammar, mode 2: in-entry
          NEWLINE               \n
          COMMENT               \%~[\n]*\n
          WHITESPACE            [\ \r\t]+
          NUMBER                [0-9]+
          NAME                  [a-z0-9\!\$\&\*\+\-\.\/\:\;\<\>\?\[\]\^\_\`\|]+
          LBRACE                \{
          RBRACE                \}
          LPAREN                \(
          RPAREN                \)
          EQUALS                =
          HASH                  \#
          COMMA                 ,
          QUOTE                 \"

          # Lexical grammar, mode 3: strings
          # (very hairy -- see text)

          # Syntactic grammar:
          bibfile : ( entry )*

          entry : AT NAME body

          body : STRING                    # for comment entries
               | ENTRY_OPEN contents ENTRY_CLOSE

          contents : ( NAME | NUMBER ) COMMA fields   # for regular entries
                   | fields                # for macro definition entries
                   | value                 # for preamble entries

          fields : field { COMMA fields }
                 |

          field : NAME EQUALS value

          value : simple_value ( HASH simple_value )*

          simple_value : STRING
                       | NUMBER
                       | NAME

DESCRIPTION

       One of the problems with BibTeX is that there is no formal specification of the language.
       This means that users exploring the arcane corners of the language are largely on their
       own, and programmers implementing their own parsers are completely on their own---except
       for observing the behaviour of the original implementation.

       Other parser implementors (Nelson Beebe of "bibclean" fame, in particular) have taken the
       trouble to explain the language accepted by their parser, and in that spirit the following
       is presented.

       If you are unfamiliar with the arcana of regular and context-free languages, you will not
       have any easy time understanding this.  This is not an introduction to the BibTeX
       language; any LaTeX book would be more suitable for learning the data language itself.

LEXICAL GRAMMAR

       The lexical scanner has three distinct modes: top-level, in-entry, and string.  Roughly
       speaking, top-level is the initial mode; we enter in-entry mode on seeing an "@" at top-
       level; and on seeing the "}" or ")" that ends the entry, we return to top-level.  We enter
       string mode on seeing a """ or non-entry-delimiting "{" from in-entry mode.  Note that the
       lexical language is both non-regular (because braces must balance) and context-sensitive
       (because "{" can mean different things depending on its syntactic context).  That said, we
       will use regular expressions to describe the lexical elements, because they are the
       starting point used by the lexical scanner itself.  The rest of the lexical grammar will
       be informally explained in the text.

       From top-level, the following tokens are recognized according to the regular expressions
       on the right:

          AT                    \@
          NEWLINE               \n
          COMMENT               \%~[\n]*\n
          WHITESPACE            [\ \r\t]+
          JUNK                  ~[\@\n\ \r\t]+

       (Note that this is PCCTS regular expression syntax, which should be fairly familiar to
       users of other regex engines.  One oddity is that a character class is negated as "~[...]"
       rather than "[^...]".)

       On seeing "at" at top-level, we enter in-entry mode.  Whitespace, junk, newlines, and
       comments are all skipped, with the latter two incrementing a line counter.  (Junk is
       explicitly recognized to allow for "bibtex"'s "implicit comment" scheme.)

       From in-entry mode, we recognize newline, comment, and whitespace identically to top-level
       mode.  In addition, the following tokens are recognized:

          NUMBER                [0-9]+
          NAME                  [a-z0-9\!\$\&\*\+\-\.\/\:\;\<\>\?\[\]\^\_\`\|]+
          LBRACE                \{
          RBRACE                \}
          LPAREN                \(
          RPAREN                \)
          EQUALS                =
          HASH                  \#
          COMMA                 ,
          QUOTE                 \"

       At this point, the lexical scanner starts to sound suspiciously like a context-free
       grammar, rather than a collection of independent regular expressions.  However, it is
       necessary to keep this complexity in the scanner because certain characters ("{" and "("
       in particular) have very different lexical meanings depending on the tokens that have
       preceded them in the input stream.

       In particular, "{" and "(" are treated as "entry openers" if they follow one "at" and one
       "name" token, unless the value of the "name" token is "comment".  (Note the switch from
       top-level to in-entry between the two tokens.)  In the @comment case, the delimiter is
       considered as starting a string, and we enter string mode.  Otherwise, the delimiter is
       saved, and when we see a corresponding "}" or ")" it is considered an "entry closer".
       (Braces are balanced for free here because the string lexer takes care of counting brace-
       depth.)

       Anywhere else, "{" is considered as starting a string, and we enter string mode.  """
       always starts a string, regardless of context.  The other tokens ("name", "number",
       "equals", "hash", and "comma") are recognized unconditionally.

       Note that "name" is a catch-all token used for entry types, citation keys, field names,
       and macro names; because BibTeX has slightly different (largely undocumented) rules for
       these various elements, a bit of trickery is needed to make things work.  As a starting
       point, consider BibTeX's definition of what's allowed for an entry key: a sequence of any
       characters except

          " # % ' ( ) , = { }

       plus space.  There are a couple of problems with this scheme.  First, without specifying
       the character set from which those "magic 10" characters are drawn, it's a bit hard to
       know just what is allowed.  Second, allowing "@" characters could lead to confusing BibTeX
       syntax (it doesn't confuse BibTeX, but it might confuse a human reader).  Finally,
       allowing certain characters that are special to TeX means that BibTeX can generate bogus
       TeX code: try putting a backslash ("\") or tilde ("~") in a citation key.  (This last
       exception is rather specific to the "generating (La)TeX code from a BibTeX database"
       application, but since that's the major application for BibTeX databases, then it will
       presumably be the major application for btparse, at least initially.  Thus, it makes sense
       to pay attention to this problem.)

       In btparse, then, a name is defined as any sequence of letters, digits, underscores, and
       the following characters:

          ! $ & * + - . / : ; < > ? [ ] ^ _ ` |

       This list was derived by removing BibTeX's "magic 10" from the set of printable 7-bit
       ASCII characters (32-126), and then further removing "@", "\", and "~".  This means that
       btparse disallows some of the weirder entry keys that BibTeX would accept, such as
       "\foo@bar", but still allows a string with initial digits.  In fact, from the above
       definition it appears that btparse would accept a string of all digits as a "name;" this
       is not the case, though, as the lexical scanner recognizes such a digit string as a number
       first.  There are two problems here: BibTeX entry keys may in fact be entirely numeric,
       and field names may not begin with a digit.  (Those are two of the not-so-obvious
       differences in BibTeX's handling of keys and field names.)  The tricks used to deal with
       these problems are implemented in the parser rather than the lexical scanner, so are
       described in "SYNTACTIC GRAMMAR" below.

       The string lexer recognizes "lbrace", "rbrace", "lparen", and "rparen" tokens in order to
       count brace- or parenthesis-depth.  This is necessary so it knows when to accept a string
       delimited by braces or parentheses.  (Note that a parenthesis-delimited string is only
       allowed after @comment---this is not a normal BibTeX construct.)  In addition, it converts
       each non-space whitespace character (newline, carriage-return, and tab) to a single space.
       (Sequences of whitespace are not collapsed; that's the domain of string post-processing,
       which is well removed from the scanner or parser.)  Finally, it accepts """ to delimit
       quote-delimited strings.  Apart from those restrictions, the string lexer accepts anything
       up to the end-of-string delimiter.

SYNTACTIC GRAMMAR

       (The language used to describe the grammar here is the extended Backus-Naur Form (EBNF)
       used by PCCTS.  Terminals are represented by uppercase strings, non-terminals by lowercase
       strings; terminal names are the same as those given in the lexical grammar above.  "( foo
       )*" means zero or more repetitions of the "foo" production, and "{ foo }" means an
       optional "foo".)

       A file is just a sequence of zero or more entries:

          bibfile : ( entry )*

       An entry is an at-sign, a name (the "entry type"), and the entry body:

          entry : AT NAME body

       A body is either a string (this alternative is only tried if the entry type is "comment")
       or the entry contents:

          body : STRING                    # for comment entries
               | ENTRY_OPEN contents ENTRY_CLOSE

       ("ENTRY_OPEN" and "ENTRY_CLOSE" are either "{" and "}" or "(" and ")", depending what is
       seen in the input for a particular entry.)

       There are three possible productions for the "contents" non-terminal.  Only one applies to
       any given entry, depending on the entry metatype (which in turn depends on the entry
       type).  Currently, btparse supports four entry metatypes: comment, preamble, macro
       definition, and regular.  The first two correspond to @comment and @preamble entries;
       "macro definition" is for @string entries; and "regular" is for all other entry types.
       (The library will be extended to handle @modify and @alias entry types, and corresponding
       "modify" and "alias" metatypes, when BibTeX 1.0 is released and the exact syntax is
       known.)  The "metatype" concept is necessary so that all entry types that aren't
       specifically recognized fall into the "regular" metatype.  It's also convenient not to
       have to "strcmp" the entry type all the time.

          contents : ( NAME | NUMBER ) COMMA fields     # for regular entries
                   | fields                # for macro definition entries
                   | value                 # for preamble entries

       Note that the entry key is not just a "NAME", but "( NAME | NUMBER)".  This is necessary
       because BibTeX allows all-numeric entry keys, but btparse's lexical scanner recognizes
       such digit strings as "NUMBER" tokens.

       "fields" is a comma-separated list of fields, with an optional single trailing comma:

          fields : field { COMMA fields }
                 |

       A "field" is a single "field = value" assignment:

          field : NAME EQUALS value

       Note that "NAME" here is a restricted version of the "name" token described in "LEXICAL
       GRAMMAR" above.  Any "name" token will be accepted by the parser, but it is immediately
       checked to ensure that it doesn't begin with a digit; if so, an artificial syntax error is
       triggered.  (This is for compatibility with BibTeX, which doesn't allow field names to
       start with a digit.)

       A "value" is a series of simple values joined by '#' characters:

          value : simple_value ( HASH simple_value )*

       A simple value is a string, number, or name (for macro invocations):

          simple_value : STRING
                       | NUMBER
                       | NAME

SEE ALSO

       btparse

AUTHOR

       Greg Ward <gward@python.net>