Ubuntu Manpage: unicode - Functions for converting Unicode characters

name
description
data types
exports

Provided by: erlang-manpages_16.b.3-dfsg-1ubuntu2.2_all

NAME

       unicode - Functions for converting Unicode characters

DESCRIPTION

       This  module  contains functions for converting between different character representations. Basically it
       converts between ISO-latin-1 characters and Unicode ditto, but it  can  also  convert  between  different
       Unicode encodings (like UTF-8, UTF-16 and UTF-32).

       The  default  Unicode encoding in Erlang is in binaries UTF-8, which is also the format in which built in
       functions and libraries in OTP expect to find binary Unicode data. In lists, Unicode data is  encoded  as
       integers,  each  integer  representing  one character and encoded simply as the Unicode codepoint for the
       character.

       Other Unicode encodings than integers representing codepoints or UTF-8 in binaries  are  referred  to  as
       "external encodings". The ISO-latin-1 encoding is in binaries and lists referred to as latin1-encoding.

       It  is  recommended to only use external encodings for communication with external entities where this is
       required. When working inside the Erlang/OTP environment, it is recommended to  keep  binaries  in  UTF-8
       when  representing  Unicode  characters. Latin1 encoding is supported both for backward compatibility and
       for communication with external entities not supporting Unicode character sets.

DATA TYPES

       encoding() = latin1
                  | unicode
                  | utf8
                  | utf16
                  | {utf16, endian()}
                  | utf32
                  | {utf32, endian()}

       endian() = big | little

       unicode_binary() = binary()

              A binary() with characters encoded in the UTF-8 coding standard.

       chardata() = charlist() | unicode_binary()

       charlist() =
           maybe_improper_list(char() | unicode_binary() | charlist(),
                               unicode_binary() | [])

       external_unicode_binary() = binary()

              A binary() with characters coded in a user specified Unicode encoding other than UTF-8 (UTF-16  or
              UTF-32).

       external_chardata() = external_charlist()
                           | external_unicode_binary()

       external_charlist() =
           maybe_improper_list(char() |
                               external_unicode_binary() |
                               external_charlist(),
                               external_unicode_binary() | [])

       latin1_binary() = binary()

              A binary() with characters coded in ISO-latin-1.

       latin1_char() = byte()

              An integer() representing valid latin1 character (0-255).

       latin1_chardata() = latin1_charlist() | latin1_binary()

              The same as iodata().

       latin1_charlist() =
           maybe_improper_list(latin1_char() |
                               latin1_binary() |
                               latin1_charlist(),
                               latin1_binary() | [])

              The same as iolist().

EXPORTS

       bom_to_encoding(Bin) -> {Encoding, Length}

              Types:

                 Bin = binary()
                    A binary() such that byte_size(Bin) >= 4.
                 Encoding = latin1
                          | utf8
                          | {utf16, endian()}
                          | {utf32, endian()}
                 Length = integer() >= 0
                 endian() = big | little

              Check  for  a  UTF  byte order mark (BOM) in the beginning of a binary. If the supplied binary Bin
              begins with a valid byte order mark for either UTF-8, UTF-16 or UTF-32, the function  returns  the
              encoding identified along with the length of the BOM in bytes.

              If no BOM is found, the function returns {latin1,0}

       characters_to_list(Data) -> Result

              Types:

                 Data = latin1_chardata() | chardata() | external_chardata()
                 Result = list()
                        | {error, list(), RestData}
                        | {incomplete, list(), binary()}
                 RestData = latin1_chardata() | chardata() | external_chardata()

              Same as characters_to_list(Data, unicode).

       characters_to_list(Data, InEncoding) -> Result

              Types:

                 Data = latin1_chardata() | chardata() | external_chardata()
                 InEncoding = encoding()
                 Result = list()
                        | {error, list(), RestData}
                        | {incomplete, list(), binary()}
                 RestData = latin1_chardata() | chardata() | external_chardata()

              Converts  a  possibly  deep  list  of  integers  and binaries into a list of integers representing
              Unicode characters. The binaries in the input may have characters encoded as latin1 (0 - 255,  one
              character  per  byte),  in  which case the InEncoding parameter should be given as latin1, or have
              characters encoded as one of the UTF-encodings, which is given as the InEncoding  parameter.  Only
              when  the  InEncoding  is  one of the UTF encodings, integers in the list are allowed to be grater
              than 255.

              If InEncoding is latin1, the Data parameter corresponds to the iodata() type, but for unicode, the
              Data  parameter  can  contain integers greater than 255 (Unicode characters beyond the ISO-latin-1
              range), which would make it invalid as iodata().

              The purpose of the function is mainly to be able to convert  combinations  of  Unicode  characters
              into  a pure Unicode string in list representation for further processing. For writing the data to
              an external entity, the reverse function characters_to_binary/3 comes in handy.

              The option unicode is an alias for utf8, as this is the preferred encoding for Unicode  characters
              in  binaries. utf16 is an alias for {utf16,big} and utf32 is an alias for {utf32,big}. The big and
              little atoms denote big or little endian encoding.

              If for some reason, the data  cannot  be  converted,  either  because  of  illegal  Unicode/latin1
              characters  in  the  list,  or  because of invalid UTF encoding in any binaries, an error tuple is
              returned. The error tuple contains the tag error, a list representing the characters that could be
              converted before the error occurred and a representation of the characters including and after the
              offending integer/bytes. The last part is mostly for debugging as it still constitutes a  possibly
              deep  and/or  mixed list, not necessarily of the same depth as the original data. The error occurs
              when traversing the list and whatever is left to decode is simply returned as is.

              However, if the input Data is a pure binary, the third part of the error tuple is guaranteed to be
              a binary as well.

              Errors occur for the following reasons:

                * Integers  out  of range - If InEncoding is latin1, an error occurs whenever an integer greater
                  than 255 is found in the lists. If InEncoding is of a Unicode type, an error  occurs  whenever
                  an integer

                  * greater than 16#10FFFF (the maximum Unicode character),

                  * in the range 16#D800 to 16#DFFF (invalid range reserved for UTF-16 surrogate pairs)

                 is found.

                * UTF encoding incorrect - If InEncoding is one of the UTF types, the bytes in any binaries have
                  to be valid in that encoding. Errors can occur for various reasons, including "pure"  decoding
                  errors  (like  the  upper bits of the bytes being wrong), the bytes are decoded to a too large
                  number, the bytes are decoded to a code-point in the invalid Unicode  range,  or  encoding  is
                  "overlong",  meaning  that  a  number  should  have been encoded in fewer bytes. The case of a
                  truncated UTF is handled specially, see the paragraph  about  incomplete  binaries  below.  If
                  InEncoding  is  latin1, binaries are always valid as long as they contain whole bytes, as each
                  byte falls into the valid ISO-latin-1 range.

              A special type of error is when no actual invalid integers or bytes  are  found,  but  a  trailing
              binary()  consists  of too few bytes to decode the last character. This error might occur if bytes
              are read from a file in chunks  or  binaries  in  other  ways  are  split  on  non  UTF  character
              boundaries.  In  this case an incomplete tuple is returned instead of the error tuple. It consists
              of the same parts as the error tuple, but the tag is incomplete instead  of  error  and  the  last
              element  is  always guaranteed to be a binary consisting of the first part of a (so far) valid UTF
              character.

              If one UTF characters is split over two consecutive binaries in the Data, the conversion succeeds.
              This  means that a character can be decoded from a range of binaries as long as the whole range is
              given as input without errors occurring. Example:

                   decode_data(Data) ->
                       case unicode:characters_to_list(Data,unicode) of
                           {incomplete,Encoded, Rest} ->
                              More = get_some_more_data(),
                           Encoded ++ decode_data([Rest, More]);
                        {error,Encoded,Rest} ->
                              handle_error(Encoded,Rest);
                           List ->
                              List
                       end.

              Bit-strings that are not whole bytes are however not allowed, so a UTF character has to  be  split
              along 8-bit boundaries to ever be decoded.

              If  any  parameters are of the wrong type, the list structure is invalid (a number as tail) or the
              binaries do not contain whole bytes (bit-strings), a badarg exception is thrown.

       characters_to_binary(Data) -> Result

              Types:

                 Data = latin1_chardata() | chardata() | external_chardata()
                 Result = binary()
                        | {error, binary(), RestData}
                        | {incomplete, binary(), binary()}
                 RestData = latin1_chardata() | chardata() | external_chardata()

              Same as characters_to_binary(Data, unicode, unicode).

       characters_to_binary(Data, InEncoding) -> Result

              Types:

                 Data = latin1_chardata() | chardata() | external_chardata()
                 InEncoding = encoding()
                 Result = binary()
                        | {error, binary(), RestData}
                        | {incomplete, binary(), binary()}
                 RestData = latin1_chardata() | chardata() | external_chardata()

              Same as characters_to_binary(Data, InEncoding, unicode).

       characters_to_binary(Data, InEncoding, OutEncoding) -> Result

              Types:

                 Data = latin1_chardata() | chardata() | external_chardata()
                 InEncoding = OutEncoding = encoding()
                 Result = binary()
                        | {error, binary(), RestData}
                        | {incomplete, binary(), binary()}
                 RestData = latin1_chardata() | chardata() | external_chardata()

              Behaves as characters_to_list/2, but produces an binary instead of a Unicode list. The  InEncoding
              defines  how  input  is  to  be interpreted if binaries are present in the Data, while OutEncoding
              defines in what format output is to be generated.

              The option unicode is an alias for utf8, as this is the preferred encoding for Unicode  characters
              in  binaries. utf16 is an alias for {utf16,big} and utf32 is an alias for {utf32,big}. The big and
              little atoms denote big or little endian encoding.

              Errors and exceptions occur as in characters_to_list/2, but the second element  in  the  error  or
              incomplete tuple will be a binary() and not a list().

       encoding_to_bom(InEncoding) -> Bin

              Types:

                 Bin = binary()
                    A binary() such that byte_size(Bin) >= 4.
                 InEncoding = encoding()

              Create  a  UTF  byte  order  mark  (BOM)  as a binary from the supplied InEncoding. The BOM is, if
              supported at all, expected to be placed first in UTF encoded files or messages.

              The function returns <<>> for the latin1 encoding as there is no BOM for ISO-latin-1.

              It can be noted that the BOM for UTF-8 is seldom used, and it is really not  a  byte  order  mark.
              There  are  obviously  no  byte order issues with UTF-8, so the BOM is only there to differentiate
              UTF-8 encoding from other UTF formats.