Ubuntu Manpage: Unicode::String - String of Unicode characters (UTF-16BE)

Provided by: libunicode-string-perl_2.09-5build1_amd64

NAME

       Unicode::String - String of Unicode characters (UTF-16BE)

SYNOPSIS

        use Unicode::String qw(utf8 latin1 utf16be);

        $u = utf8("string");
        $u = latin1("string");
        $u = utf16be("\0s\0t\0r\0i\0n\0g");

        print $u->utf32be;   # 4 byte characters
        print $u->utf16le;   # 2 byte characters + surrogates
        print $u->utf8;      # 1-4 byte characters

DESCRIPTION

       A "Unicode::String" object represents a sequence of Unicode characters.  Methods are provided to convert
       between various external formats (encodings) and "Unicode::String" objects, and methods are provided for
       common string manipulations.

       The functions utf32be(), utf32le(), utf16be(), utf16le(), utf8(), utf7(), latin1(), uhex(), uchr() can be
       imported from the "Unicode::String" module and will work as constructors initializing strings of the
       corresponding encoding.

       The "Unicode::String" objects overload various operators, which means that they in most cases can be
       treated like plain strings.

       Internally a "Unicode::String" object is represented by a string of 2 byte numbers in network byte order
       (big-endian). This representation is not visible by the API provided, but it might be useful to know in
       order to predict the efficiency of the provided methods.

   METHODS
   Class methods
       The following class methods are available:

       Unicode::String->stringify_as
       Unicode::String->stringify_as( $enc )
           This  method  is  used  to  specify  which  encoding  will be used when "Unicode::String" objects are
           implicitly converted to and from plain strings.

           If an argument is provided it sets the current  encoding.   The  argument  should  have  one  of  the
           following:  "ucs4",  "utf32",  "utf32be",  "utf32le",  "ucs2", "utf16", "utf16be", "utf16le", "utf8",
           "utf7", "latin1" or "hex".  The default is "utf8".

           The stringify_as() method returns a reference to the current encoding function.

       $us = Unicode::String->new
       $us = Unicode::String->new( $initial_value )
           This is the object constructor.  Without argument, it creates an empty "Unicode::String" object.   If
           an  $initial_value  argument  is  given,  it  is  decoded  according  to the specified stringify_as()
           encoding, UTF-8 by default.

           In general it is recommended to import and use one of the  encoding  specific  constructor  functions
           instead of invoking this method.

   Encoding methods
       These  methods  get  or  set  the  value  of  the  "Unicode::String"  object  by  passing  strings in the
       corresponding encoding.   If  a  new  value  is  passed  as  argument  it  will  set  the  value  of  the
       "Unicode::String",  and  the previous value is returned.  If no argument is passed then the current value
       is returned.

       To illustrate the encodings we show how the 2 character sample string of "Xm" (micro  meter)  is  encoded
       for each one.

       $us->utf32be
       $us->utf32be( $newval )
           The  string  passed should be in the UTF-32 encoding with bytes in big endian order.  The sample "Xm"
           is "\0\0\0\xB5\0\0\0m" in this encoding.

           Alternative names for this method are utf32() and ucs4().

       $us->utf32le
       $us->utf32le( $newval )
           The string passed should be in the UTF-32 encoding with bytes in little  endian  order.   The  sample
           "Xm" is is "\xB5\0\0\0m\0\0\0" in this encoding.

       $us->utf16be
       $us->utf16be( $newval )
           The string passed should be in the UTF-16 encoding with bytes in big endian order. The sample "Xm" is
           "\0\xB5\0m" in this encoding.

           Alternative names for this method are utf16() and ucs2().

           If the string passed to utf16be() starts with the Unicode byte order mark in little endian order, the
           result is as if utf16le() was called instead.

       $us->utf16le
       $us->utf16le( $newval )
           The  string  passed  should  be in the UTF-16 encoding with bytes in little endian order.  The sample
           "Xm" is is "\xB5\0m\0" in this encoding.  This is the encoding used by the Microsoft Windows API.

           If the string passed to utf16le() starts with the Unicode byte order mark in big  endian  order,  the
           result is as if utf16le() was called instead.

       $us->utf8
       $us->utf8( $newval )
           The string passed should be in the UTF-8 encoding. The sample "Xm" is "\xC2\xB5m" in this encoding.

       $us->utf7
       $us->utf7( $newval )
           The string passed should be in the UTF-7 encoding. The sample "Xm" is "+ALU-m" in this encoding.

           The  UTF-7  encoding  only  use  plain  US-ASCII characters for the encoding.  This makes it safe for
           transport  through  8-bit  stripping  protocols.   Characters  outside   the   US-ASCII   range   are
           base64-encoded and '+' is used as an escape character.  The UTF-7 encoding is described in RFC 1642.

           If  the (global) variable $Unicode::String::UTF7_OPTIONAL_DIRECT_CHARS is TRUE, then a wider range of
           characters are encoded as themselves.  It is even TRUE by default.  The characters affected  by  this
           are:

              ! " # $ % & * ; < = > @ [ ] ^ _ ` { | }

       $us->latin1
       $us->latin1( $newval )
           The string passed should be in the ISO-8859-1 encoding. The sample "Xm" is "\xB5m" in this encoding.

           Characters  outside  the  "\x00"  ..  "\xFF"  range  are  simply removed from the return value of the
           latin1() method.  If you want more control over the mapping  from  Unicode  to  ISO-8859-1,  use  the
           "Unicode::Map8" class.  This is also the way to deal with other 8-bit character sets.

       $us->hex
       $us->hex( $newval )
           The  string  passed should be plain ASCII where each Unicode character is represented by the "U+XXXX"
           string and separated by a single space character.  The "U+"  prefix  is  optional  when  setting  the
           value.  The sample "Xm" is "U+00b5 U+006d" in this encoding.

   String Operations
       The following methods are available:

       $us->as_string
           Converts  a  "Unicode::String"  to  a  plain  string according to the setting of stringify_as().  The
           default stringify_as() encoding is "utf8".

       $us->as_num
           Converts a "Unicode::String" to a number.  Currently only the digits in the range 0x30  ..  0x39  are
           recognized.  The plan is to eventually support all Unicode digit characters.

       $us->as_bool
           Converts  a  "Unicode::String"  to  a  boolean  value.   Only  the  empty  string is FALSE.  A string
           consisting of only the character U+0030 is considered TRUE, even if Perl consider "0" to be FALSE.

       $us->repeat( $count )
           Returns a new "Unicode::String" where the content of $us is repeated $count times.  This operation is
           also overloaded as:

             $us x $count

       $us->concat( $other_string )
           Concatenates  the  string  $us  and  the  string  $other_string.   If   $other_string   is   not   an
           "Unicode::String"  object,  then it is first passed to the Unicode::String->new constructor function.
           This operation is also overloaded as:

             $us . $other_string

       $us->append( $other_string )
           Appends the string $other_string to the value of $us.  If $other_string is not  an  "Unicode::String"
           object,  then it is first passed to the Unicode::String->new constructor function.  This operation is
           also overloaded as:

             $us .= $other_string

       $us->copy
           Returns a copy of the  current  "Unicode::String"  object.   This  operation  is  overloaded  as  the
           assignment operator.

       $us->length
           Returns the length of the "Unicode::String".  Surrogate pairs are still counted as 2.

       $us->byteswap
           This method will swap the bytes in the internal representation of the "Unicode::String" object.

           Unicode  reserve the character U+FEFF character as a byte order mark.  This works because the swapped
           character, U+FFFE, is reserved to not be valid.  For strings that have the byte  order  mark  as  the
           first character, we can guaranty to get the byte order right with the following code:

              $ustr->byteswap if $ustr->ord == 0xFFFE;

       $us->unpack
           Returns a list of integers each representing an UCS-2 character code.

       $us->pack( @uchr )
           Sets the value of $us as a sequence of UCS-2 characters with the characters codes given as parameter.

       $us->ord
           Returns  the  character  code  of  the first character in $us.  The ord() method deals with surrogate
           pairs, which gives us a result-range of 0x0 .. 0x10FFFF.  If  the  $us  string  is  empty,  undef  is
           returned.

       $us->chr( $code )
           Sets  the  value  of  $us  to be a string containing the character assigned code $code.  The argument
           $code must be an integer in the range 0x0 .. 0x10FFFF.  If the code is greater  than  0xFFFF  then  a
           surrogate pair created.

       $us->name
           In  scalar context returns the official Unicode name of the first character in $us.  In array context
           returns the name of all characters in $us.  Also see Unicode::CharName.

       $us->substr( $offset )
       $us->substr( $offset, $length )
       $us->substr( $offset, $length, $subst )
           Returns a sub-string of $us.  Works similar to the builtin substr() function.

       $us->index( $other )
       $us->index( $other, $pos )
           Locates the position of $other within $us, possibly starting the search at position $pos.

       $us->chop
           Chops off the last character of $us and returns it (as a "Unicode::String" object).

FUNCTIONS

       The following functions are provided.  None of these are exported by default.

       byteswap2( $str, ... )
           This function will swap 2 and 2 bytes in the strings passed as arguments.  If this function is called
           in void context, then it will modify its arguments in-place.   Otherwise,  the  swapped  strings  are
           returned.

       byteswap4( $str, ... )
           The byteswap4 function works similar to byteswap2, but will reverse the order of 4 and 4 bytes.

       latin1( $str )
       utf7( $str )
       utf8( $str )
       utf16le( $str )
       utf16be( $str )
       utf32le( $str )
       utf32be( $str )
           Constructor functions for the various Unicode encodings.  These return new "Unicode::String" objects.
           The provided argument should be encoded correspondingly.

       uhex( $str )
           Constructs  a  new  "Unicode::String" object from a string of hex values.  See hex() method above for
           description of the format.

       uchar( $num )
           Constructs a new one character "Unicode::String" object from a Unicode character  code.   This  works
           similar to perl's builtin chr() function.

COPYRIGHT

       Copyright 1997-2000,2005 Gisle Aas.

       This  library  is  free  software;  you can redistribute it and/or modify it under the same terms as Perl
       itself.

POD ERRORS

       Hey! The above document had some coding errors, which are explained below:

       Around line 600:
           Non-ASCII character seen before =encoding in '"Xm"'. Assuming ISO8859-1

perl v5.18.1                                       2005-10-26                                        String(3pm)

NAME

SYNOPSIS

DESCRIPTION

FUNCTIONS

SEE ALSO

COPYRIGHT

POD ERRORS