Provided by: sam_4.3-18.2_i386 bug


        UTF, Unicode, ASCII, rune - character set and format


        The Plan 9 character set and representation are based on Unicode and on
        a proposed X-Open multibyte  FSS-UCS-TF  (File  System  Safe  Universal
        Character  Set Transformation Format) encoding.  Unicode represents its
        characters in 16 bits; FSS-UCS-TF, or just UTF, represent  such  values
        in an 8-bit byte stream.
        In  Plan  9, a rune is a 16-bit quantity representing a Unicode charac‐
        ter.  Internally, programs may store characters as runes.  However, any
        external  manifestation  of  textual  information,  in  files or at the
        interface between programs,  uses  a  machine-independent,  byte-stream
        encoding called UTF.
        UTF  is  designed so the 7-bit ASCII set (values hexadecimal 00 to 7F),
        appear only as themselves in the encoding.  Runes with values above  7F
        appear  as  sequences  of two or more bytes with values only from 80 to
        The UTF encoding of Unicode is backward compatible with ASCII: programs
        presented  only  with  ASCII work on Plan 9 even if not written to deal
        with UTF, as do programs that deal  with  uninterpreted  byte  streams.
        However,  programs  that  perform  semantic processing on ASCII graphic
        characters must convert from UTF to runes in  order  to  work  properly
        with non-ASCII input.  See rune(3g).
        Letting  numbers  be  binary,  a rune x is converted to a multibyte UTF
        sequence as follows:
        01. x in [00000000.0bbbbbbb] → 0bbbbbbb
        10. x in [00000bbb.bbbbbbbb] → 110bbbbb, 10bbbbbb
        11. x in [bbbbbbbb.bbbbbbbb] → 1110bbbb, 10bbbbbb, 10bbbbbb
        Conversion 01 provides a one-byte sequence that spans the ASCII charac‐
        ter  set  in a compatible way.  Conversions 10 and 11 represent higher-
        valued characters as sequences of two or three bytes with the high  bit
        set.   Plan  9 does not support the 4, 5, and 6 byte sequences proposed
        by X-Open.  When there are multiple ways to encode a value, for example
        rune 0, the shortest encoding is used.
        In  the  inverse  mapping, any sequence except those described above is
        incorrect and is converted to rune 0080.
        ascii(7), rune(3g), keyboard(5g), The Unicode Standard.