Provided by: sam_4.3-18.1_i386 bug

NAME

       UTF, Unicode, ASCII, rune - character set and format

DESCRIPTION

       The Plan 9 character set and representation are based on Unicode and on
       a proposed X-Open multibyte  FSS-UCS-TF  (File  System  Safe  Universal
       Character  Set Transformation Format) encoding.  Unicode represents its
       characters in 16 bits; FSS-UCS-TF, or just UTF, represent  such  values
       in an 8-bit byte stream.

       In  Plan  9,  a  rune  is  a  16-bit  quantity  representing  a Unicode
       character.   Internally,  programs  may  store  characters  as   runes.
       However, any external manifestation of textual information, in files or
       at the interface between programs, uses  a  machine-independent,  byte-
       stream encoding called UTF.

       UTF  is  designed so the 7-bit ASCII set (values hexadecimal 00 to 7F),
       appear only as themselves in the encoding.  Runes with values above  7F
       appear  as  sequences  of two or more bytes with values only from 80 to
       FF.

       The UTF encoding of Unicode is backward compatible with ASCII: programs
       presented  only  with  ASCII work on Plan 9 even if not written to deal
       with UTF, as do programs that deal  with  uninterpreted  byte  streams.
       However,  programs  that  perform  semantic processing on ASCII graphic
       characters must convert from UTF to runes in  order  to  work  properly
       with non-ASCII input.  See rune(3g).

       Letting  numbers  be  binary,  a rune x is converted to a multibyte UTF
       sequence as follows:

       01. x in [00000000.0bbbbbbb] → 0bbbbbbb
       10. x in [00000bbb.bbbbbbbb] → 110bbbbb, 10bbbbbb
       11. x in [bbbbbbbb.bbbbbbbb] → 1110bbbb, 10bbbbbb, 10bbbbbb

       Conversion 01  provides  a  one-byte  sequence  that  spans  the  ASCII
       character  set  in  a  compatible way.  Conversions 10 and 11 represent
       higher-valued characters as sequences of two or three  bytes  with  the
       high  bit  set.  Plan 9 does not support the 4, 5, and 6 byte sequences
       proposed by X-Open.  When there are multiple ways to  encode  a  value,
       for example rune 0, the shortest encoding is used.

       In  the  inverse  mapping, any sequence except those described above is
       incorrect and is converted to rune 0080.

SEE ALSO

       ascii(7), rune(3g), keyboard(5g), The Unicode Standard.

                                                                       UTF(5G)