Provided by: tcl8.4-doc_8.4.20-7_all bug

NAME

       Tcl_GetEncoding, Tcl_FreeEncoding, Tcl_ExternalToUtfDString, Tcl_ExternalToUtf, Tcl_UtfToExternalDString,
       Tcl_UtfToExternal,  Tcl_WinTCharToUtf,  Tcl_WinUtfToTChar,  Tcl_GetEncodingName,   Tcl_SetSystemEncoding,
       Tcl_GetEncodingNames,    Tcl_CreateEncoding,   Tcl_GetDefaultEncodingDir,   Tcl_SetDefaultEncodingDir   -
       procedures for creating and using encodings.

SYNOPSIS

       #include <tcl.h>

       Tcl_Encoding
       Tcl_GetEncoding(interp, name)

       void
       Tcl_FreeEncoding(encoding)

       char *
       Tcl_ExternalToUtfDString(encoding, src, srcLen, dstPtr)

       int
       Tcl_ExternalToUtf(interp, encoding, src, srcLen, flags, statePtr, dst, dstLen, srcReadPtr, dstWrotePtr,
            dstCharsPtr)

       char *
       Tcl_UtfToExternalDString(encoding, src, srcLen, dstPtr)

       int
       Tcl_UtfToExternal(interp, encoding, src, srcLen, flags, statePtr, dst, dstLen, srcReadPtr, dstWrotePtr,
            dstCharsPtr)

       char *
       Tcl_WinTCharToUtf(tsrc, srcLen, dstPtr)

       TCHAR *
       Tcl_WinUtfToTChar(src, srcLen, dstPtr)

       CONST char *
       Tcl_GetEncodingName(encoding)

       int
       Tcl_SetSystemEncoding(interp, name)

       void
       Tcl_GetEncodingNames(interp)

       Tcl_Encoding
       Tcl_CreateEncoding(typePtr)

       CONST char *
       Tcl_GetDefaultEncodingDir(void)

       void
       Tcl_SetDefaultEncodingDir(path)

ARGUMENTS

       Tcl_Interp          *interp        (in)      Interpreter to use for error reporting, or NULL if no  error
                                                    reporting is desired.

       CONST char          *name          (in)      Name of encoding to load.

       Tcl_Encoding        encoding       (in)      The encoding to query, free, or use for converting text.  If
                                                    encoding is NULL, the current system encoding is used.

       CONST char          *src           (in)      For the Tcl_ExternalToUtf functions, an array  of  bytes  in
                                                    the  specified  encoding  that are to be converted to UTF-8.
                                                    For the Tcl_UtfToExternal and  Tcl_WinUtfToTChar  functions,
                                                    an  array  of  UTF-8  characters  to  be  converted  to  the
                                                    specified encoding.

       CONST TCHAR         *tsrc          (in)      An array of Windows TCHAR characters to convert to UTF-8.

       int                 srcLen         (in)      Length of src or tsrc in bytes.  If the length is  negative,
                                                    the encoding-specific length of the string is used.

       Tcl_DString         *dstPtr        (out)     Pointer to an uninitialized or free Tcl_DString in which the
                                                    converted result will be stored.

       int                 flags          (in)      Various  flag  bits  OR-ed   together.    TCL_ENCODING_START
                                                    signifies  that  the  source  buffer is the first block in a
                                                    (potentially  multi-block)   input   stream,   telling   the
                                                    conversion  routine to reset to an initial state and perform
                                                    any initialization that needs to occur before the first byte
                                                    is  converted.   TCL_ENCODING_END  signifies that the source
                                                    buffer is the last  block  in  a  (potentially  multi-block)
                                                    input  stream, telling the conversion routine to perform any
                                                    finalization that needs to occur  after  the  last  byte  is
                                                    converted   and   then   to   reset  to  an  initial  state.
                                                    TCL_ENCODING_STOPONERROR  signifies  that   the   conversion
                                                    routine  should  return  immediately  upon  reading a source
                                                    character  that  doesn't  exist  in  the  target   encoding;
                                                    otherwise a default fallback character will automatically be
                                                    substituted.

       Tcl_EncodingState   *statePtr      (in/out)  Used when converting a (generally long or indefinite length)
                                                    byte  stream  in  a  piece by piece fashion.  The conversion
                                                    routine stores its current state in *statePtr after src (the
                                                    buffer  containing  the  current  piece) has been converted;
                                                    that state information must be passed back  when  converting
                                                    the next piece of the stream so the conversion routine knows
                                                    what state it was in when it left off at the end of the last
                                                    piece.   May  be NULL, in which case the value specified for
                                                    flags is ignored and the source buffer is assumed to contain
                                                    the complete string to convert.

       char                *dst           (out)     Buffer  in  which  the  converted result will be stored.  No
                                                    more than dstLen bytes will be stored in dst.

       int                 dstLen         (in)      The maximum length of the output buffer dst in bytes.

       int                 *srcReadPtr    (out)     Filled with the number of bytes from src that were  actually
                                                    converted.  This may be less than the original source length
                                                    if there was a problem converting  some  source  characters.
                                                    May be NULL.

       int                 *dstWrotePtr   (out)     Filled with the number of bytes that were actually stored in
                                                    the output buffer as a result of  the  conversion.   May  be
                                                    NULL.

       int                 *dstCharsPtr   (out)     Filled  with the number of characters that correspond to the
                                                    number of bytes stored in the output buffer.  May be NULL.

       Tcl_EncodingType    *typePtr       (in)      Structure that defines a new type of encoding.

       CONST char          *path          (in)      A path to the location of the encoding file.
_________________________________________________________________

INTRODUCTION

       These  routines  convert  between  Tcl's  internal  character  representation,   UTF-8,   and   character
       representations  used by various operating systems or file systems, such as Unicode, ASCII, or Shift-JIS.
       When operating on strings, such as such as obtaining the names of files or  displaying  characters  using
       international  fonts,  the  strings  must  be  translated  into one or possibly multiple formats that the
       various system calls can use.  For instance, on a Japanese  Unix  workstation,  a  user  might  obtain  a
       filename  represented  in the EUC-JP file encoding and then translate the characters to the jisx0208 font
       encoding in order to display the filename in a Tk widget.  The purpose of the encoding package is to help
       bridge the translation gap.  UTF-8 provides an intermediate staging ground for all the various encodings.
       In the example above, text would be translated into UTF-8  from  whatever  file  encoding  the  operating
       system is using.  Then it would be translated from UTF-8 into whatever font encoding the display routines
       require.

       Some basic encodings are compiled into Tcl.  Others can be defined by the user or dynamically loaded from
       encoding files in a platform-independent manner.

DESCRIPTION

       Tcl_GetEncoding  finds an encoding given its name.  The name may refer to a builtin Tcl encoding, a user-
       defined encoding registered by calling Tcl_CreateEncoding, or a dynamically-loadable encoding file.   The
       return  value  is  a token that represents the encoding and can be used in subsequent calls to procedures
       such as Tcl_GetEncodingName, Tcl_FreeEncoding, and Tcl_UtfToExternal.  If the name did not refer  to  any
       known or loadable encoding, NULL is returned and an error message is returned in interp.

       The  encoding  package  maintains  a  database of all encodings currently in use.  The first time name is
       seen, Tcl_GetEncoding returns an encoding with a reference count of 1.  If the  same  name  is  requested
       further  times,  then  the  reference  count  for  that  encoding  is incremented without the overhead of
       allocating a new encoding and all its associated data structures.

       When an encoding is no longer needed, Tcl_FreeEncoding should be called to release it.  When an  encoding
       is  no  longer  in  use  anywhere  (i.e.,  it  has  been  freed  as  many  times  as  it has been gotten)
       Tcl_FreeEncoding will release all storage the encoding was using and delete it from the database.

       Tcl_ExternalToUtfDString converts a source buffer src  from  the  specified  encoding  into  UTF-8.   The
       converted  bytes  are stored in dstPtr, which is then null-terminated.  The caller should eventually call
       Tcl_DStringFree to free any information stored in dstPtr.  When converting, if any of the  characters  in
       the  source  buffer  cannot  be  represented in the target encoding, a default fallback character will be
       used.  The return value is a pointer to the value stored in the DString.

       Tcl_ExternalToUtf converts a source buffer src from the specified encoding  into  UTF-8.   Up  to  srcLen
       bytes  are  converted  from the source buffer and up to dstLen converted bytes are stored in dst.  In all
       cases, *srcReadPtr is filled with the number of bytes that  were  successfully  converted  from  src  and
       *dstWrotePtr  is filled with the corresponding number of bytes that were stored in dst.  The return value
       is one of the following:

              TCL_OK                       All bytes of src were converted.

              TCL_CONVERT_NOSPACE          The destination buffer was not large enough for all of the  converted
                                           data; as many characters as could fit were converted though.

              TCL_CONVERT_MULTIBYTE        The  last  fews  bytes  in  the source buffer were the beginning of a
                                           multibyte sequence, but more  bytes  were  needed  to  complete  this
                                           sequence.   A subsequent call to the conversion routine should pass a
                                           buffer containing the unconverted bytes that  remained  in  src  plus
                                           some  further  bytes  from  the source stream to properly convert the
                                           formerly split-up multibyte sequence.

              TCL_CONVERT_SYNTAX           The source buffer contained an invalid character sequence.  This  may
                                           occur  if  the input stream has been damaged or if the input encoding
                                           method was misidentified.

              TCL_CONVERT_UNKNOWN          The source buffer contained a character that could not be represented
                                           in the target encoding and TCL_ENCODING_STOPONERROR was specified.

       Tcl_UtfToExternalDString  converts  a  source  buffer  src  from  UTF-8 into the specified encoding.  The
       converted bytes are stored in dstPtr, which is then terminated  with  the  appropriate  encoding-specific
       null.   The caller should eventually call Tcl_DStringFree to free any information stored in dstPtr.  When
       converting, if any of the characters in the source buffer cannot be represented in the target encoding, a
       default  fallback  character  will  be  used.   The  return value is a pointer to the value stored in the
       DString.

       Tcl_UtfToExternal converts a source buffer src from UTF-8 into the  specified  encoding.   Up  to  srcLen
       bytes  are  converted  from the source buffer and up to dstLen converted bytes are stored in dst.  In all
       cases, *srcReadPtr is filled with the number of bytes that  were  successfully  converted  from  src  and
       *dstWrotePtr is filled with the corresponding number of bytes that were stored in dst.  The return values
       are the same as the return values for Tcl_ExternalToUtf.

       Tcl_WinUtfToTChar and Tcl_WinTCharToUtf are Windows-only convenience  functions  for  converting  between
       UTF-8 and Windows strings.  On Windows 95 (as with the Macintosh and Unix operating systems), all strings
       exchanged between Tcl and the operating system are "char" based.  On Windows NT, some  strings  exchanged
       between  Tcl and the operating system are "char" oriented while others are in Unicode.  By convention, in
       Windows a TCHAR is a character in the ANSI code page on Windows 95 and a Unicode character on Windows NT.

       If you planned to use the same "char" based interfaces on both Windows 95 and Windows NT, you  could  use
       Tcl_UtfToExternal  and Tcl_ExternalToUtf (or their Tcl_DString equivalents) with an encoding of NULL (the
       current system encoding).  On the other hand, if you planned to use the Unicode interface when running on
       Windows  NT and the "char" interfaces when running on Windows 95, you would have to perform the following
       type of test over and over in your program (as represented in pseudo-code):
              if (running NT) {
                  encoding <- Tcl_GetEncoding("unicode");
                  nativeBuffer <- Tcl_UtfToExternal(encoding, utfBuffer);
                  Tcl_FreeEncoding(encoding);
              } else {
                  nativeBuffer <- Tcl_UtfToExternal(NULL, utfBuffer);
       Tcl_WinUtfToTChar and Tcl_WinTCharToUtf automatically handle this test and use the proper encoding  based
       on  the  current  operating  system.   Tcl_WinUtfToTChar  returns  a  pointer  to  a  TCHAR  string,  and
       Tcl_WinTCharToUtf expects a TCHAR string pointer as the src string.  Otherwise,  these  functions  behave
       identically to Tcl_UtfToExternalDString and Tcl_ExternalToUtfDString.

       Tcl_GetEncodingName  is  roughly  the inverse of Tcl_GetEncoding.  Given an encoding, the return value is
       the name argument that was used to create the encoding.  The string returned  by  Tcl_GetEncodingName  is
       only guaranteed to persist until the encoding is deleted.  The caller must not modify this string.

       Tcl_SetSystemEncoding sets the default encoding that should be used whenever the user passes a NULL value
       for the encoding argument to any of the other encoding functions.  If name is NULL, the  system  encoding
       is  reset  to  the  default  system encoding, binary.  If the name did not refer to any known or loadable
       encoding, TCL_ERROR is returned and an error message  is  left  in  interp.   Otherwise,  this  procedure
       increments  the  reference  count  of  the new system encoding, decrements the reference count of the old
       system encoding, and returns TCL_OK.

       Tcl_GetEncodingNames sets the interp result to a list consisting of the names of all the  encodings  that
       are  currently  defined  or  can  be  dynamically  loaded,  searching  the  encoding  path  specified  by
       Tcl_SetDefaultEncodingDir.  This procedure does not ensure that the dynamically-loadable  encoding  files
       contain valid data, but merely that they exist.

       Tcl_CreateEncoding  defines a new encoding and registers the C procedures that are called back to convert
       between the encoding and UTF-8.  Encodings created by Tcl_CreateEncoding are thereafter  visible  in  the
       database  used  by  Tcl_GetEncoding.   Just  as with the Tcl_GetEncoding procedure, the return value is a
       token that represents the encoding and can be used in  subsequent  calls  to  other  encoding  functions.
       Tcl_CreateEncoding  returns  an  encoding  with a reference count of 1. If an encoding with the specified
       name already exists, then its entry in the database is replaced with the new encoding; the token for  the
       old encoding will remain valid and continue to behave as before, but users of the new token will now call
       the new encoding procedures.

       The typePtr argument to Tcl_CreateEncoding contains information about the name of the  encoding  and  the
       procedures that will be called to convert between this encoding and UTF-8.  It is defined as follows:

              typedef struct Tcl_EncodingType {
                CONST char *encodingName;
                Tcl_EncodingConvertProc *toUtfProc;
                Tcl_EncodingConvertProc *fromUtfProc;
                Tcl_EncodingFreeProc *freeProc;
                ClientData clientData;
                int nullSize;
              } Tcl_EncodingType;

       The encodingName provides a string name for the encoding, by which it can be referred in other procedures
       such as Tcl_GetEncoding.  The toUtfProc refers to a callback procedure to invoke  to  convert  text  from
       this  encoding into UTF-8.  The fromUtfProc refers to a callback procedure to invoke to convert text from
       UTF-8 into this encoding.  The freeProc refers to a callback procedure to invoke when  this  encoding  is
       deleted.   The freeProc field may be NULL.  The clientData contains an arbitrary one-word value passed to
       toUtfProc, fromUtfProc, and freeProc whenever they are called.  Typically, this is a pointer  to  a  data
       structure  containing  encoding-specific  information  that  can be used by the callback procedures.  For
       instance, two very similar encodings such as ascii and macRoman may use the same callback procedure,  but
       use  different  values  of clientData to control its behavior.  The nullSize specifies the number of zero
       bytes that signify end-of-string in this encoding.  It must be 1 (for single-byte or multi-byte encodings
       like  ASCII or Shift-JIS) or 2 (for double-byte encodings like Unicode).  Constant-sized encodings with 3
       or more bytes per character (such as CNS11643) are not accepted.

       The callback procedures toUtfProc and fromUtfProc should match the type Tcl_EncodingConvertProc:

              typedef int Tcl_EncodingConvertProc(
                ClientData clientData,
                CONST char *src,
                int srcLen,
                int flags,
                Tcl_Encoding *statePtr,
                char *dst,
                int dstLen,
                int *srcReadPtr,
                int *dstWrotePtr,
                int *dstCharsPtr);

       The toUtfProc and fromUtfProc procedures are called by the Tcl_ExternalToUtf or Tcl_UtfToExternal  family
       of  functions to perform the actual conversion.  The clientData parameter to these procedures is the same
       as the clientData field specified to Tcl_CreateEncoding when the encoding  was  created.   The  remaining
       arguments  to  the  callback  procedures  are  the  same  as  the  arguments,  documented  at the top, to
       Tcl_ExternalToUtf or Tcl_UtfToExternal, with the following exceptions.  If the srcLen argument to one  of
       those  high-level  functions  is  negative,  the  value  passed  to  the  callback  procedure will be the
       appropriate encoding-specific  string  length  of  src.   If  any  of  the  srcReadPtr,  dstWrotePtr,  or
       dstCharsPtr  arguments  to one of the high-level functions is NULL, the corresponding value passed to the
       callback procedure will be a non-NULL location.

       The callback procedure freeProc, if non-NULL, should match the type Tcl_EncodingFreeProc:
              typedef void Tcl_EncodingFreeProc(
                ClientData clientData);

       This freeProc function is called when the encoding is deleted.  The clientData parameter is the  same  as
       the clientData field specified to Tcl_CreateEncoding when the encoding was created.

       Tcl_GetDefaultEncodingDir and Tcl_SetDefaultEncodingDir access and set the directory to use when locating
       the default encoding files.  If this value is not NULL, the TclpInitLibraryPath routine appends the  path
       to  the head of the search path, and uses this path as the first place to look into when trying to locate
       the encoding file.

ENCODING FILES

       Space would prohibit precompiling into Tcl every possible  encoding  algorithm,  so  many  encodings  are
       stored  on  disk  as  dynamically-loadable  encoding files.  This behavior also allows the user to create
       additional encoding files that can be loaded using the same  mechanism.   These  encoding  files  contain
       information  about  the  tables  and/or  escape  sequences  used  to map between an external encoding and
       Unicode.  The external encoding may consist of single-byte, multi-byte, or double-byte characters.

       Each dynamically-loadable encoding is represented as  a  text  file.   The  initial  line  of  the  file,
       beginning  with a ``#'' symbol, is a comment that provides a human-readable description of the file.  The
       next line identifies the type of encoding file.  It can be one of the following letters:

       [1]   S
              A single-byte encoding, where one character is always one byte long in the encoding.   An  example
              is iso8859-1, used by many European languages.

       [2]   D
              A  double-byte encoding, where one character is always two bytes long in the encoding.  An example
              is big5, used for Chinese text.

       [3]   M
              A multi-byte encoding, where one character may be either one or two bytes long.  Certain bytes are
              a  lead  bytes, indicating that another byte must follow and that together the two bytes represent
              one character.  Other bytes are not lead bytes and represent themselves.  An example is  shiftjis,
              used by many Japanese computers.

       [4]   E
              An  escape-sequence  encoding,  specifying  that  certain  sequences  of  bytes  do  not represent
              characters, but commands that describe how following bytes should be interpreted.

       The rest of the lines in the file depend on the type.

       Cases [1], [2], and [3] are collectively referred to as table-based  encoding  files.   The  lines  in  a
       table-based  encoding  file are in the same format as this example taken from the shiftjis encoding (this
       is not the complete file):
              # Encoding file: shiftjis, multi-byte
              M
              003F 0 40
              00
              0000000100020003000400050006000700080009000A000B000C000D000E000F
              0010001100120013001400150016001700180019001A001B001C001D001E001F
              0020002100220023002400250026002700280029002A002B002C002D002E002F
              0030003100320033003400350036003700380039003A003B003C003D003E003F
              0040004100420043004400450046004700480049004A004B004C004D004E004F
              0050005100520053005400550056005700580059005A005B005C005D005E005F
              0060006100620063006400650066006700680069006A006B006C006D006E006F
              0070007100720073007400750076007700780079007A007B007C007D203E007F
              0080000000000000000000000000000000000000000000000000000000000000
              0000000000000000000000000000000000000000000000000000000000000000
              0000FF61FF62FF63FF64FF65FF66FF67FF68FF69FF6AFF6BFF6CFF6DFF6EFF6F
              FF70FF71FF72FF73FF74FF75FF76FF77FF78FF79FF7AFF7BFF7CFF7DFF7EFF7F
              FF80FF81FF82FF83FF84FF85FF86FF87FF88FF89FF8AFF8BFF8CFF8DFF8EFF8F
              FF90FF91FF92FF93FF94FF95FF96FF97FF98FF99FF9AFF9BFF9CFF9DFF9EFF9F
              0000000000000000000000000000000000000000000000000000000000000000
              0000000000000000000000000000000000000000000000000000000000000000
              81
              0000000000000000000000000000000000000000000000000000000000000000
              0000000000000000000000000000000000000000000000000000000000000000
              0000000000000000000000000000000000000000000000000000000000000000
              0000000000000000000000000000000000000000000000000000000000000000
              300030013002FF0CFF0E30FBFF1AFF1BFF1FFF01309B309C00B4FF4000A8FF3E
              FFE3FF3F30FD30FE309D309E30034EDD30053006300730FC20152010FF0F005C
              301C2016FF5C2026202520182019201C201DFF08FF0930143015FF3BFF3DFF5B
              FF5D30083009300A300B300C300D300E300F30103011FF0B221200B100D70000
              00F7FF1D2260FF1CFF1E22662267221E22342642264000B0203220332103FFE5
              FF0400A200A3FF05FF03FF06FF0AFF2000A72606260525CB25CF25CE25C725C6
              25A125A025B325B225BD25BC203B301221922190219121933013000000000000
              000000000000000000000000000000002208220B2286228722822283222A2229
              000000000000000000000000000000002227222800AC21D221D4220022030000
              0000000000000000000000000000000000000000222022A52312220222072261
              2252226A226B221A223D221D2235222B222C0000000000000000000000000000
              212B2030266F266D266A2020202100B6000000000000000025EF000000000000

       The third line of the file is three numbers.  The first number is the fallback character (in base 16)  to
       use  when  converting  from UTF-8 to this encoding.  The second number is a 1 if this file represents the
       encoding for a symbol font, or 0 otherwise.  The last number (in base 10)  is  how  many  pages  of  data
       follow.

       Subsequent  lines  in  the example above are pages that describe how to map from the encoding into 2-byte
       Unicode.  The first line in a page identifies the page number.  Following it are 256 double-byte numbers,
       arranged as 16 rows of 16 numbers.  Given a character in the encoding, the high byte of that character is
       used to select which page, and the low byte of that character is used as an index to select  one  of  the
       double-byte  numbers  in  that  page  - the value obtained being the corresponding Unicode character.  By
       examination of the example above, one can see that the characters 0x7E and 0x8163 in shiftjis map to 203E
       and 2026 in Unicode, respectively.

       Following  the  first  page will be all the other pages, each in the same format as the first: one number
       identifying the page followed by 256 double-byte Unicode characters.  If a character in the encoding maps
       to  the Unicode character 0000, it means that the character doesn't actually exist.  If all characters on
       a page would map to 0000, that page can be omitted.

       Case [4] is the escape-sequence encoding file.  The lines in an this type of file are in the same  format
       as this example taken from the iso2022-jp encoding:
              # Encoding file: iso2022-jp, escape-driven
              E
              init           {}
              final          {}
              iso8859-1      \x1b(B
              jis0201        \x1b(J
              jis0208        \x1b$@
              jis0208        \x1b$B
              jis0212        \x1b$(D
              gb2312         \x1b$A
              ksc5601        \x1b$(C

       In  the  file, the first column represents an option and the second column is the associated value.  init
       is a string to emit or expect before the first character is converted, while final is a string to emit or
       expect  after  the  last character.  All other options are names of table-based encodings; the associated
       value is the escape-sequence that marks that encoding.  Tcl syntax is used for the values; in  the  above
       example, for instance, ``{}'' represents the empty string and ``\x1b'' represents character 27.

       When  Tcl_GetEncoding  encounters  an  encoding  name  that  has  not been loaded, it attempts to load an
       encoding file called name.enc from the encoding subdirectory of each directory specified in  the  library
       path  $tcl_libPath.   If  the  encoding  file  exists, but is malformed, an error message will be left in
       interp.

KEYWORDS

       utf, encoding, convert