Provided by: erlang-manpages_16.b.3-dfsg-1ubuntu2.2_all bug

NAME

       unicode - Functions for converting Unicode characters

DESCRIPTION

       This  module  contains functions for converting between different character representations. Basically it
       converts between ISO-latin-1 characters and Unicode ditto, but it  can  also  convert  between  different
       Unicode encodings (like UTF-8, UTF-16 and UTF-32).

       The  default  Unicode encoding in Erlang is in binaries UTF-8, which is also the format in which built in
       functions and libraries in OTP expect to find binary Unicode data. In lists, Unicode data is  encoded  as
       integers,  each  integer  representing  one character and encoded simply as the Unicode codepoint for the
       character.

       Other Unicode encodings than integers representing codepoints or UTF-8 in binaries  are  referred  to  as
       "external encodings". The ISO-latin-1 encoding is in binaries and lists referred to as latin1-encoding.

       It  is  recommended to only use external encodings for communication with external entities where this is
       required. When working inside the Erlang/OTP environment, it is recommended to  keep  binaries  in  UTF-8
       when  representing  Unicode  characters. Latin1 encoding is supported both for backward compatibility and
       for communication with external entities not supporting Unicode character sets.

DATA TYPES

       encoding() = latin1
                  | unicode
                  | utf8
                  | utf16
                  | {utf16, endian()}
                  | utf32
                  | {utf32, endian()}

       endian() = big | little

       unicode_binary() = binary()

              A binary() with characters encoded in the UTF-8 coding standard.

       chardata() = charlist() | unicode_binary()

       charlist() =
           maybe_improper_list(char() | unicode_binary() | charlist(),
                               unicode_binary() | [])

       external_unicode_binary() = binary()

              A binary() with characters coded in a user specified Unicode encoding other than UTF-8 (UTF-16  or
              UTF-32).

       external_chardata() = external_charlist()
                           | external_unicode_binary()

       external_charlist() =
           maybe_improper_list(char() |
                               external_unicode_binary() |
                               external_charlist(),
                               external_unicode_binary() | [])

       latin1_binary() = binary()

              A binary() with characters coded in ISO-latin-1.

       latin1_char() = byte()

              An integer() representing valid latin1 character (0-255).

       latin1_chardata() = latin1_charlist() | latin1_binary()

              The same as iodata().

       latin1_charlist() =
           maybe_improper_list(latin1_char() |
                               latin1_binary() |
                               latin1_charlist(),
                               latin1_binary() | [])

              The same as iolist().

EXPORTS

       bom_to_encoding(Bin) -> {Encoding, Length}

              Types:

                 Bin = binary()
                    A binary() such that byte_size(Bin) >= 4.
                 Encoding = latin1
                          | utf8
                          | {utf16, endian()}
                          | {utf32, endian()}
                 Length = integer() >= 0
                 endian() = big | little

              Check  for  a  UTF  byte order mark (BOM) in the beginning of a binary. If the supplied binary Bin
              begins with a valid byte order mark for either UTF-8, UTF-16 or UTF-32, the function  returns  the
              encoding identified along with the length of the BOM in bytes.

              If no BOM is found, the function returns {latin1,0}

       characters_to_list(Data) -> Result

              Types:

                 Data = latin1_chardata() | chardata() | external_chardata()
                 Result = list()
                        | {error, list(), RestData}
                        | {incomplete, list(), binary()}
                 RestData = latin1_chardata() | chardata() | external_chardata()

              Same as characters_to_list(Data, unicode).

       characters_to_list(Data, InEncoding) -> Result

              Types:

                 Data = latin1_chardata() | chardata() | external_chardata()
                 InEncoding = encoding()
                 Result = list()
                        | {error, list(), RestData}
                        | {incomplete, list(), binary()}
                 RestData = latin1_chardata() | chardata() | external_chardata()

              Converts  a  possibly  deep  list  of  integers  and binaries into a list of integers representing
              Unicode characters. The binaries in the input may have characters encoded as latin1 (0 - 255,  one
              character  per  byte),  in  which case the InEncoding parameter should be given as latin1, or have
              characters encoded as one of the UTF-encodings, which is given as the InEncoding  parameter.  Only
              when  the  InEncoding  is  one of the UTF encodings, integers in the list are allowed to be grater
              than 255.

              If InEncoding is latin1, the Data parameter corresponds to the iodata() type, but for unicode, the
              Data parameter can contain integers greater than 255 (Unicode characters  beyond  the  ISO-latin-1
              range), which would make it invalid as iodata().

              The  purpose  of  the  function is mainly to be able to convert combinations of Unicode characters
              into a pure Unicode string in list representation for further processing. For writing the data  to
              an external entity, the reverse function characters_to_binary/3 comes in handy.

              The  option unicode is an alias for utf8, as this is the preferred encoding for Unicode characters
              in binaries. utf16 is an alias for {utf16,big} and utf32 is an alias for {utf32,big}. The big  and
              little atoms denote big or little endian encoding.

              If  for  some  reason,  the  data  cannot  be  converted, either because of illegal Unicode/latin1
              characters in the list, or because of invalid UTF encoding in any  binaries,  an  error  tuple  is
              returned. The error tuple contains the tag error, a list representing the characters that could be
              converted before the error occurred and a representation of the characters including and after the
              offending  integer/bytes. The last part is mostly for debugging as it still constitutes a possibly
              deep and/or mixed list, not necessarily of the same depth as the original data. The  error  occurs
              when traversing the list and whatever is left to decode is simply returned as is.

              However, if the input Data is a pure binary, the third part of the error tuple is guaranteed to be
              a binary as well.

              Errors occur for the following reasons:

                * Integers  out  of range - If InEncoding is latin1, an error occurs whenever an integer greater
                  than 255 is found in the lists. If InEncoding is of a Unicode type, an error  occurs  whenever
                  an integer

                  * greater than 16#10FFFF (the maximum Unicode character),

                  * in the range 16#D800 to 16#DFFF (invalid range reserved for UTF-16 surrogate pairs)

                 is found.

                * UTF encoding incorrect - If InEncoding is one of the UTF types, the bytes in any binaries have
                  to  be valid in that encoding. Errors can occur for various reasons, including "pure" decoding
                  errors (like the upper bits of the bytes being wrong), the bytes are decoded to  a  too  large
                  number,  the  bytes  are  decoded to a code-point in the invalid Unicode range, or encoding is
                  "overlong", meaning that a number should have been encoded in  fewer  bytes.  The  case  of  a
                  truncated  UTF  is  handled  specially,  see the paragraph about incomplete binaries below. If
                  InEncoding is latin1, binaries are always valid as long as they contain whole bytes,  as  each
                  byte falls into the valid ISO-latin-1 range.

              A  special  type  of  error  is when no actual invalid integers or bytes are found, but a trailing
              binary() consists of too few bytes to decode the last character. This error might occur  if  bytes
              are  read  from  a  file  in  chunks  or  binaries  in  other  ways are split on non UTF character
              boundaries. In this case an incomplete tuple is returned instead of the error tuple.  It  consists
              of  the  same  parts  as  the error tuple, but the tag is incomplete instead of error and the last
              element is always guaranteed to be a binary consisting of the first part of a (so far)  valid  UTF
              character.

              If one UTF characters is split over two consecutive binaries in the Data, the conversion succeeds.
              This  means that a character can be decoded from a range of binaries as long as the whole range is
              given as input without errors occurring. Example:

                   decode_data(Data) ->
                       case unicode:characters_to_list(Data,unicode) of
                           {incomplete,Encoded, Rest} ->
                              More = get_some_more_data(),
                           Encoded ++ decode_data([Rest, More]);
                        {error,Encoded,Rest} ->
                              handle_error(Encoded,Rest);
                           List ->
                              List
                       end.

              Bit-strings that are not whole bytes are however not allowed, so a UTF character has to  be  split
              along 8-bit boundaries to ever be decoded.

              If  any  parameters are of the wrong type, the list structure is invalid (a number as tail) or the
              binaries do not contain whole bytes (bit-strings), a badarg exception is thrown.

       characters_to_binary(Data) -> Result

              Types:

                 Data = latin1_chardata() | chardata() | external_chardata()
                 Result = binary()
                        | {error, binary(), RestData}
                        | {incomplete, binary(), binary()}
                 RestData = latin1_chardata() | chardata() | external_chardata()

              Same as characters_to_binary(Data, unicode, unicode).

       characters_to_binary(Data, InEncoding) -> Result

              Types:

                 Data = latin1_chardata() | chardata() | external_chardata()
                 InEncoding = encoding()
                 Result = binary()
                        | {error, binary(), RestData}
                        | {incomplete, binary(), binary()}
                 RestData = latin1_chardata() | chardata() | external_chardata()

              Same as characters_to_binary(Data, InEncoding, unicode).

       characters_to_binary(Data, InEncoding, OutEncoding) -> Result

              Types:

                 Data = latin1_chardata() | chardata() | external_chardata()
                 InEncoding = OutEncoding = encoding()
                 Result = binary()
                        | {error, binary(), RestData}
                        | {incomplete, binary(), binary()}
                 RestData = latin1_chardata() | chardata() | external_chardata()

              Behaves as characters_to_list/2, but produces an binary instead of a Unicode list. The  InEncoding
              defines  how  input  is  to  be interpreted if binaries are present in the Data, while OutEncoding
              defines in what format output is to be generated.

              The option unicode is an alias for utf8, as this is the preferred encoding for Unicode  characters
              in  binaries. utf16 is an alias for {utf16,big} and utf32 is an alias for {utf32,big}. The big and
              little atoms denote big or little endian encoding.

              Errors and exceptions occur as in characters_to_list/2, but the second element  in  the  error  or
              incomplete tuple will be a binary() and not a list().

       encoding_to_bom(InEncoding) -> Bin

              Types:

                 Bin = binary()
                    A binary() such that byte_size(Bin) >= 4.
                 InEncoding = encoding()

              Create  a  UTF  byte  order  mark  (BOM)  as a binary from the supplied InEncoding. The BOM is, if
              supported at all, expected to be placed first in UTF encoded files or messages.

              The function returns <<>> for the latin1 encoding as there is no BOM for ISO-latin-1.

              It can be noted that the BOM for UTF-8 is seldom used, and it is really not  a  byte  order  mark.
              There  are  obviously  no  byte order issues with UTF-8, so the BOM is only there to differentiate
              UTF-8 encoding from other UTF formats.

Ericsson AB                                       stdlib 1.19.4                                    unicode(3erl)