Provided by: libbtparse-dev_0.85-2build1_amd64 bug


       bt_split_names - splitting up BibTeX names and lists of names


          bt_stringlist * bt_split_list (char *   string,
                                         char *   delim,
                                         char *   filename,
                                         int      line,
                                         char *   description);
          void bt_free_list (bt_stringlist *list);
          bt_name * bt_split_name (char *  name,
                                   char *  filename,
                                   int     line,
                                   int     name_num);
          void bt_free_name (bt_name * name);


       When BibTeX files are used for their original purpose---bibliographic entries describing
       scholarly publications---processing lists of names (authors and editors mostly) becomes
       important.  Although such name-processing is outside the general-purpose database domain
       of most of the btparse library, these splitting functions are provided as a concession to
       reality: most BibTeX data files use the BibTeX conventions for author names, and a library
       to process that data ought to be capable of processing the names.

       Name-processing comes in two stages: first, split up a list of names into individual
       strings; second, split up each name into "parts" (first, von, last, and jr).  The first is
       actually quite general: you could pick a delimiter (such as 'and', used for lists of
       names) and use it to divide any string into substrings.  "bt_split_list()" could then be
       called to break up the original string and extract the substrings.  "bt_split_name()",
       however, is quite specific to four-part author names written using BibTeX conventions.
       (These conventions are described informally in any BibTeX documentation; the description
       you will find here is more formal and algorithmic---and thus harder to understand.)

       See bt_format_names for information on turning split-up names back into strings in a
       variety of ways.


              bt_stringlist * bt_split_list (char *   string,
                                             char *   delim,
                                             char *   filename,
                                             int      line,
                                             char *   description)

           Splits "string" into substrings delimited by "delim" (a fixed string).  The splitting
           is done according to the rules used by BibTeX for splitting up a list of names, in

           ·   delimiters at beginning or end of string are ignored

           ·   delimiters must be surrounded by whitespace

           ·   matching of delimiters is case insensitive

           ·   delimiters at non-zero brace depth are ignored

           For instance, if the delimiter is "and", then the string

              Candy and Apples AnD {Green Eggs and Ham}

           splits into three substrings: "Candy", "Apples", and "{Green Eggs and Ham}".

           If there are extra delimiters at the extremities of the string---say, an "and" at the
           beginning of the string---then they are included in the first/last string; no warning
           is currently printed, but this may change.  Successive delimiters ("and and") result
           in a warning and a NULL string being added to the list of substrings.  For instance,
           the string

              and Joe Q. Blow and and Smith, Jr., John

           would split into three substrings: "and Joe Q. Blow", "NULL", and "Smith, Jr., John".

           (If these rules seem somewhat odd, don't blame me: I just implemented BibTeX's
           observed behaviour and added warning messages for one of the more obvious and easily-
           detected mistakes.)

           The substrings are returned as a "bt_stringlist" structure:

              typedef struct
                 char *  string;
                 int     num_items;
                 char ** items;
              } bt_stringlist;

           There is currently no elegant interface to this structure: you just have to poke
           around in it yourself.  The fields are:

               a copy of the "string" parameter passed to "bt_split_list()", but with NUL
               characters replacing the space after each substring.  (This is safe because
               delimiters must be surrounded by whitespace, which means that each substring is
               followed by whitespace which is not part of the substring.)  You probably
               shouldn't fiddle with "string"; it's just there so that "bt_free_list()" has
               something to "free()".

               the number of substrings found in the string passed to "bt_split_list()".

               an array of "num_items" pointers into "string".  For instance, "items[1]" points
               to the second substring.  Since "string" has been mangled with NUL characters, it
               is safe to treat "items[i]" as a regular C string.

               "filename", "line", and "description" are all used for generating warning
               messages.  "filename" and "line" simply describe where the string came from, and
               "description" is a brief (one word) description of the substrings.  For instance,
               if you are splitting a list of names, supply "name" for "description"---that way,
               warnings will refer to "name X" rather than "substring x".

              void bt_free_list (bt_stringlist *list)

           Frees a "bt_stringlist" structure as returned by "bt_split_list()".  That is, it frees
           the copy of the string you passed to "bt_split_list()", and then frees the structure

              bt_name * bt_split_name (char *  name,
                                       char *  filename,
                                       int     line,
                                       int     name_num)

           Splits a single BibTeX-style author name into four parts: first, von, last, and jr.
           This can handle almost all names in the style of the major Western European languages,
           but not quite.  (Alas!)

           A name is split by first dividing into tokens; tokens are separated by whitespace or
           commas at brace-level zero.  Thus the name

              van der Graaf, Horace Q.

           has five tokens, whereas the name

              {Foo, Bar, and Sons}

           consists of a single token.

           How tokens are divided into parts depends on the form of the name.  If the name has no
           commas at brace-level zero (as in the second example), then it is assumed to be in
           either "first last" or "first von last" form.  If there are no tokens that start with
           a lower-case letter, then "first last" form is assumed: the final token is the last
           name, and all other tokens form the first name.  Otherwise, the earliest contiguous
           sequence of tokens with initial lower-case letters is taken as the `von' part; if this
           sequence includes the final token, then a warning is printed and the final token is
           forced to be the `last' part.

           If a name has a single comma, then it is assumed to be in "von last, first" form.  A
           leading sequence of tokens with initial lower-case letters, if any, forms the `von'
           part; tokens between the `von' and the comma form the `last' part; tokens following
           the comma form the `first' part.  Again, if there are no token following a leading
           sequence of lowercase tokens, a warning is printed and the token immediately preceding
           the comma is taken to be the `last' part.

           If a name has more than two commas, a warning is printed and the name is treated as
           though only the first two commas were present.

           Finally, if a name has two commas, it is assumed to be in "von last, jr, first" form.
           (This is the only way to represent a name with a `jr' part.)  The parsing of the name
           is the same as for a one-comma name, except that tokens between the two commas are
           taken to be the `jr' part.

           The one case not properly handled by BibTeX name conventions is a name with a 'jr'
           part not separated from the last name by a comma; for example:

              Henry Ford Jr.
              George Herbert Walker Bush III

           Both of these would be incorrectly interpreted by both BibTeX and bt_split_name(): the
           "Jr." or "III" token would be taken as the last name, and the other tokekens as a two-
           or four-part first name.  The workaround is to shoehorn the 'jr' into the last name:

              Henry {Ford Jr.}
              George Herbert Walker {Bush III}

           but this will make it impossible to extract the last name on its own, e.g. to generate
           "author-year" style citations.  This design flaw may be fixed in a future version of

           The split-up name is returned as a "bt_name" structure:

              typedef struct
                 bt_stringlist * tokens;
                 char ** parts[BT_MAX_NAMEPARTS];
                 int     part_len[BT_MAX_NAMEPARTS];
              } bt_name;

           Again, there's no nice interface to this structure; you'll just have to access the
           fields individually.  They are:

               the name, broken down into a flat list of tokens.  See above for a description of
               the "bt_stringlist" structure.

               an array of arrays of pointers into the token list.  The major dimension of this
               beast is the "name part;" you should index this dimension using the "bt_namepart"
               enum.  For instance, "parts[BTN_LAST]" is an array of pointers to the tokens
               comprising the last name; "parts[BTN_LAST][1]" is a "char *": the second token of
               the 'last' part; and "parts[BTN_LAST][1][0]" is the first character of the second
               token of the 'last' part.

               the length, in tokens, of each part.  For instance, you might loop over all tokens
               in the 'first' part as follows (assuming "name" is a "bt_name *" returned by

                  for (i = 0; i < name->part_len[BTN_FIRST]; i++)
                     printf ("token %d of first name: %s\n",
                             i, name->parts[BTN_FIRST][i]);

              void bt_free_name (bt_name * name)

           Frees the "bt_name" structure created by "bt_split_name()" (including the
           "bt_stringlist" structure inside the "bt_name").


       btparse, bt_format_names


       Greg Ward <>