plucky (3) ub_beta2coptic.3.gz

Provided by: unibetacode_2.3-5_amd64 bug

NAME

       libunibetacode - Library for Beta Code to Unicode conversion

SYNOPSIS

       int
       ub_beta2greek (char *beta_string, int max_beta_string, char *utf8_string, int max_utf8_string);

       int
       ub_beta2coptic (char *beta_string, int max_beta_string, char *utf8_string, int max_utf8_string);

       int
       ub_beta2hebrew (char *beta_string, int max_beta_string, char *utf8_string, int max_utf8_string);

       int
       ub_greek2beta (char *utf8_string, int max_utf8_string, char *beta_string, int max_beta_string);

       int
       ub_coptic2beta (char *utf8_string, int max_utf8_string, char *beta_string, int max_beta_string);

       int
       ub_hebrew2beta (char *utf8_string, int max_utf8_string, char *beta_string, int max_beta_string);

       int
       ub_codept2utf8 (unsigned codept, char *utf8_bytes);

       int
       ub_utf82codept (char *utf8_bytes, unsigned codept);

DESCRIPTION

       libunibetacode  is  a  self-contained  C library with functions to convert between UTF-8 Unicode and Beta
       Code, as adopted by the University of California, Irvine Thesaurus Linguae Graecae (TLG) Program and  the
       Tufts  University  Perseus  Project,  among others.  Beta Code provides a way of encoding polytonic Greek
       characters using plain ASCII characters.  Beta Code also provides some support for  encoding  Coptic  and
       Hebrew.

       The libunibetacode package contains three top-level functions to convert from Beta Code to UTF-8 Unicode,
       and three top-level functions to convert from UTF-8 Unicode to Beta Code.

       The top-level functions to convert Beta Code to UTF-8 Unicode are:

              ub_beta2greek(3) converts a Greek Beta Code input string to a UTF-8 output string.

              ub_beta2coptic(3) converts a Coptic Beta Code input string to a UTF-8 output string.

              ub_beta2hebrew(3) converts a Hebrew Beta Code input string to a UTF-8 output string.

       The top-level functions to convert UTF-8 Unicode to Beta code are:

              ub_greek2beta(3) converts a Greek UTF-8 input string to a Greek Beta Code output string.

              ub_coptic2beta(3) converts a Coptic UTF-8 input string to a Coptic Beta Code output string.

              ub_hebrew2beta(3) converts a Hebrew UTF-8 input string to a Hebrew Beta Code output string.

       In addition:

              ub_codept2utf8(3) converts a Unicode code point to a UTF-8 output string.

              ub_utf82codept(3) converts a Unicode UTF-8 string to to a Unicode code point.

       A Unicode code point is an assignment to a specific numeric  value  for  glyphs  and  other  entities  in
       Unicode  fonts.  By convention, Unicode code points are given by their Unicode numeric values in the form
       U+xxxx, where "xxxx" is a string of four hexadecimal digits representing a glyph  in  the  Unicode  Basic
       Multilingual Plane.

       All  of  these  functions  are  non-destructive: they will not alter the input strings that are passed to
       them.

       State is not preserved between calls to any of these functions.

       The Beta Code conversion functions (ub_beta2greek, ub_beta2coptic, and ub_beta2hebrew) expect  the  input
       string  to  contain  only  Beta Code sequences for Greek, Coptic, or Hebrew, respectively.  Likewise, the
       language-specific  UTF-8  to  Beta  Code  conversion  functions   (ub_greek2beta,   ub_coptic2beta,   and
       ub_hebrew2beta)  expect  the  input  string to contain only UTF-8 code points that map to valid Beta Code
       sequences in the respective language.

       The functions ub_codept2utf8 and ub_utf82codept support  the  entire  Unicode  space  of  U+0000  through
       U+10FFF.   Thus  they  are  not  tied  to  one  Beta Code language (Greek, Coptic, or Hebrew), and so can
       complement the other functions.

       libunibetacode supports the language-specific Beta Code letter and punctutation symbol mappings described
       in unibetacode(5).

       The  additional  capabilities  described in unibetacode(5) section "EXTENSIONS FOR ASCII AND UNICODE" are
       not implemented.  There is also  no  function  to  perform  the  equivalent  of  the  standalone  program
       unibetaprep(1).   As  a  consequence,  ub_beta2greek does not support the full Beta Code numeric sequence
       range beginning with '#' and followed by a decimal number.  For  example,  the  Unicode  Byzantine  Music
       Symbols  having  TLG Beta Code encodings of '#2000' through '#2245' (corresponding to Unicode code points
       U+1D000 through U+1D0F5) are not supported.

       The three Beta Code to UTF-8 Unicode functions also do not support the  Unicode  code  point  description
       format  of  the  form  "\uxxxx" that beta2uni(1) supports.  That limits the usefulness of ub_beta2hebrew,
       because the TLG Beta Code specification only contains encodings for Hebrew consonants, not for vowels  or
       cantillation marks.  A user program could use ub_codept2utf8 along with ub_beta2hebrew to fill this gap.

       Balanced  double  quotes  are  supported in ub_beta2greek and ub_beta2coptic, but the opening and closing
       quotation marks must appear in the same input string because state is  not  preserved  between  calls  to
       those functions.  (An input string can contain embedded newlines.)  Quotation marks in ub_beta2hebrew are
       output as the ASCII double quote mark (").

       The ub_greek2beta function will determine whether a Greek letter follows a lower-case sigma in the  input
       UTF-8  string, and based upon that convert Greek medial and final small sigma to "s" if context will make
       the conversion back from Beta Code to UTF-8 unambiguous.  If this is not the case, small  sigma  will  be
       converted to "s1" for small medial sigma or "s2" for small final sigma.  For example, if a final sigma is
       followed by a letter, then the final sigma will be converted to  Beta  Code  as  "s2"  to  ensure  proper
       conversion back from Beta Code into UTF-8.

       Note: Thesaurus Linguae Graecae and TLG are registered trademarks of the University of California.

PARAMETERS

       The top-level functions described in this document take the following parameters:

              beta_string         A null-terminated string with Beta Code sequences for the corresponding script
                                  (Greek, Coptic, or Hebrew).  This  string  is  an  input  for  functions  that
                                  convert from Beta Code to UTF-8, and an output for functions that convert from
                                  UTF-8 to Beta Code.

              max_beta_string     The maximum size of beta_string, in bytes, to prevent accesses past the end of
                                  the array.

              utf8_string         A  null-terminated  string  with UTF-8 Unicode sequences for the corresponding
                                  script (Greek, Coptic, or Hebrew).  This string is  an  output  for  functions
                                  that  convert from Beta Code to UTF-8, and an input for functions that convert
                                  from UTF-8 to Beta Code.

              max_utf8_string     The maximum size of utf8_string, in bytes, to prevent accesses past the end of
                                  the array.

              codept              An  unsigned  32-bit  Unicode code point.  This is an input to ub_codept2utf8,
                                  and an output from ub_utf82codept.

              utf8_bytes          The null-terminated UTF-8 byte string corresponding to the Unicode code  point
                                  stored  in  codept.   This  is  an output from ub_codept2utf8, and an input to
                                  ub_utf82codept.

   UNICODE GREEK
       The Greek Extended range of The Unicode Standard (U+1F00 - U+1FFF) contains 16 small and  capital  vowels
       that  have  identical  representation  in the Greek and Coptic range (U+0370 - U+03FF).  These are vowels
       with an oxia (acute) accent in the Greek Extended range; they have equivalent glyphs with a tonos (acute)
       accent  in  the  Greek and Coptic range.  Because of this duplication, the use of these 16 Greek Extended
       glyphs is deprecated.

       However, unlike the beta2uni program, by default the function ub_beta2greek maps to those  16  deprecated
       code  points.   This  was  done  after observing that many fonts contain consistent looking glyphs in the
       Unicode Greek Extended block that do not have a consistent appearance with the Greek and Coptic block.

       The choice between these two options is compiled in with  a  #define  statement  near  the  beginning  of
       "ub_beta2greek.c",  which  is  in  the  "src/libsrc"  directory  in  the  source  distribution.  To avoid
       conversion to these 16 deprecated code points, change the following two lines:

              // #define GREEK_COMBINING beta2combining
              #define GREEK_COMBINING beta2combining_alt

       to this:

              #define GREEK_COMBINING beta2combining
              // #define GREEK_COMBINING beta2combining_alt

       and then recompile the package by running "make" in the top-level package source directory.

RETURN VALUES

       Each of the four library functions returns the number of bytes in the UTF-8 output string, not  including
       the final null character that terminates the string.

SAMPLES

       The  directory  "examples"  in  the  source distribution contains samples with mappings from Beta Code to
       UTF-8 and vice versa.  The "genesis-1-1.beta" and "genesis-1-1.utf8" files show the Bible  verse  Genesis
       1:1  in  Koine  Greek  (from  the  Septuagint),  Hebrew,  and  Bohairic  Coptic  in  Beta Code and UTF-8,
       respectively.

       The program "test/ublibcheck.c" in the source distribution is a sample program that calls  ub_beta2greek,
       ub_beta2coptic,  and  ub_beta2hebrew  to  convert the above-mentioned Genesis 1:1 passage.  Each of those
       three functions calls ub_codept2utf8 to produce its UTF-8 output.  Hence this program tests all  four  of
       the  top-level  library functions.  Once the "make install" command above has completed, the test program
       can be copied to another directory and compiled separately as  a  starting  point  for  new  software  as
       follows:

              cc ublibcheck.c -o ublibcheck -lunibetacode

SEE ALSO

       unibetaprep(1), beta2uni(1), uni2beta(1), unibetacode(5)

AUTHOR

       The unibetacode package was created by Paul Hardy.

LICENSE

       libunibetacode is Copyright © 2020 Paul Hardy.

       This  program  is  free  software;  you  can  redistribute it and/or modify it under the terms of the GNU
       General Public License as published by the Free Software Foundation; either version 2 of the License,  or
       (at your option) any later version.

BUGS

       No known bugs exist.  However, all corner cases have not been tested.

                                                   2020 Apr 11                                 LIBUNIBETACODE(3)