Ubuntu Manpage: I18N::Charset - IANA Character Set Registry names and Unicode::MapUTF8 (et al.) conversion

Provided by: libi18n-charset-perl_1.414-1_all

NAME

       I18N::Charset - IANA Character Set Registry names and Unicode::MapUTF8 (et al.) conversion
       scheme names

SYNOPSIS

         use I18N::Charset;

         $sCharset = iana_charset_name('WinCyrillic');
         # $sCharset is now 'windows-1251'
         $sCharset = umap_charset_name('Adobe DingBats');
         # $sCharset is now 'ADOBE-DINGBATS' which can be passed to Unicode::Map->new()
         $sCharset = map8_charset_name('windows-1251');
         # $sCharset is now 'cp1251' which can be passed to Unicode::Map8->new()
         $sCharset = umu8_charset_name('x-sjis');
         # $sCharset is now 'sjis' which can be passed to Unicode::MapUTF8->new()
         $sCharset = libi_charset_name('x-sjis');
         # $sCharset is now 'MS_KANJI' which can be passed to `iconv -f $sCharset ...`
         $sCharset = enco_charset_name('Shift-JIS');
         # $sCharset is now 'shiftjis' which can be passed to Encode::from_to()

         I18N::Charset::add_iana_alias('my-japanese' => 'iso-2022-jp');
         I18N::Charset::add_map8_alias('my-arabic' => 'arabic7');
         I18N::Charset::add_umap_alias('my-hebrew' => 'ISO-8859-8');
         I18N::Charset::add_libi_alias('my-sjis' => 'x-sjis');
         I18N::Charset::add_enco_alias('my-japanese' => 'shiftjis');

DESCRIPTION

       The "I18N::Charset" module provides access to the IANA Character Set Registry names for
       identifying character encoding schemes.  It also provides a mapping to the character set
       names used by the Unicode::Map and Unicode::Map8 modules.

       So, for example, if you get an HTML document with a META CHARSET="..."  tag, you can
       fairly quickly determine what Unicode::MapXXX module can be used to convert it to Unicode.

       If you don't have the module Unicode::Map installed, the umap_ functions will always
       return undef.  If you don't have the module Unicode::Map8 installed, the map8_ functions
       will always return undef.  If you don't have the module Unicode::MapUTF8 installed, the
       umu8_ functions will always return undef.  If you don't have the iconv library installed,
       the libi_ functions will always return undef.  If you don't have the Encode module
       installed, the enco_ functions will always return undef.

CONVERSION ROUTINES

       There are four main conversion routines: "iana_charset_name()", "map8_charset_name()",
       "umap_charset_name()", and "umu8_charset_name()".

       iana_charset_name()
           This function takes a string containing the name of a character set and returns a
           string which contains the official IANA name of the character set identified. If no
           valid character set name can be identified, then "undef" will be returned.  The case
           and punctuation within the string are not important.

               $sCharset = iana_charset_name('WinCyrillic');

       mime_charset_name()
           This function takes a string containing the name of a character set and returns a
           string which contains the preferred MIME name of the character set identified. If no
           valid character set name can be identified, then "undef" will be returned.  The case
           and punctuation within the string are not important.

               $sCharset = mime_charset_name('Extended_UNIX_Code_Packed_Format_for_Japanese');

       enco_charset_name()
           This function takes a string containing the name of a character set and returns a
           string which contains a name of the character set suitable to be passed to the Encode
           module.  If no valid character set name can be identified, or if Encode is not
           installed, then "undef" will be returned.  The case and punctuation within the string
           are not important.

               $sCharset = enco_charset_name('Extended_UNIX_Code_Packed_Format_for_Japanese');

       libi_charset_name()
           This function takes a string containing the name of a character set and returns a
           string which contains a name of the character set suitable to be passed to iconv.  If
           no valid character set name can be identified, then "undef" will be returned.  The
           case and punctuation within the string are not important.

               $sCharset = libi_charset_name('Extended_UNIX_Code_Packed_Format_for_Korean');

       mib_to_charset_name
           This function takes a string containing the MIBenum of a character set and returns a
           string which contains a name for the character set.  If the given MIBenum does not
           correspond to any character set, then "undef" will be returned.

               $sCharset = mib_to_charset_name('3');

       mib_charset_name
           This is a synonum for mib_to_charset_name

       charset_name_to_mib
           This function takes a string containing the name of a character set in almost any
           format and returns a MIBenum for the character set.  For IANA-registered character
           sets, this is the IANA-registered MIB.  For non-IANA character sets, this is an
           unambiguous unique string whose only use is to pass to other functions in this module.
           If no valid character set name can be identified, then "undef" will be returned.

               $iMIB = charset_name_to_mib('US-ASCII');

       map8_charset_name()
           This function takes a string containing the name of a character set (in almost any
           format) and returns a string which contains a name for the character set that can be
           passed to Unicode::Map8::new().  Note: the returned string will be capitalized just
           like the name of the .bin file in the Unicode::Map8::MAPS_DIR directory.  If no valid
           character set name can be identified, then "undef" will be returned.  The case and
           punctuation within the argument string are not important.

               $sCharset = map8_charset_name('windows-1251');

       umap_charset_name()
           This function takes a string containing the name of a character set (in almost any
           format) and returns a string which contains a name for the character set that can be
           passed to Unicode::Map::new(). If no valid character set name can be identified, then
           "undef" will be returned.  The case and punctuation within the argument string are not
           important.

               $sCharset = umap_charset_name('hebrew');

       umu8_charset_name()
           This function takes a string containing the name of a character set (in almost any
           format) and returns a string which contains a name for the character set that can be
           passed to Unicode::MapUTF8::new(). If no valid character set name can be identified,
           then "undef" will be returned.  The case and punctuation within the argument string
           are not important.

               $sCharset = umu8_charset_name('windows-1251');

QUERY ROUTINES

       There is one function which can be used to obtain a list of all IANA-registered character
       set names.

       "all_iana_charset_names()"
           Returns a list of all registered IANA character set names.  The names are not in any
           particular order.

CHARACTER SET NAME ALIASING

       This module supports several semi-private routines for specifying character set name
       aliases.

       add_iana_alias()
           This function takes two strings: a new alias, and a target IANA Character Set Name (or
           another alias).  It defines the new alias to refer to that character set name (or to
           the character set name to which the second alias refers).

           Returns the target character set name of the successfully installed alias.  Returns
           'undef' if the target character set name is not registered.  Returns 'undef' if the
           target character set name of the second alias is not registered.

             I18N::Charset::add_iana_alias('my-alias1' => 'Shift_JIS');

           With this code, "my-alias1" becomes an alias for the existing IANA character set name
           'Shift_JIS'.

             I18N::Charset::add_iana_alias('my-alias2' => 'sjis');

           With this code, "my-alias2" becomes an alias for the IANA character set name referred
           to by the existing alias 'sjis' (which happens to be 'Shift_JIS').

       add_map8_alias()
           This function takes two strings: a new alias, and a target Unicode::Map8 Character Set
           Name (or an exising alias to a Map8 name).  It defines the new alias to refer to that
           mapping name (or to the mapping name to which the second alias refers).

           If the first argument is a registered IANA character set name, then all aliases of
           that IANA character set name will end up pointing to the target Map8 mapping name.

           Returns the target mapping name of the successfully installed alias.  Returns 'undef'
           if the target mapping name is not registered.  Returns 'undef' if the target mapping
           name of the second alias is not registered.

             I18N::Charset::add_map8_alias('normal' => 'ANSI_X3.4-1968');

           With the above statement, "normal" becomes an alias for the existing Unicode::Map8
           mapping name 'ANSI_X3.4-1968'.

             I18N::Charset::add_map8_alias('normal' => 'US-ASCII');

           With the above statement, "normal" becomes an alias for the existing Unicode::Map
           mapping name 'ANSI_X3.4-1968' (which is what "US-ASCII" is an alias for).

             I18N::Charset::add_map8_alias('IBM297' => 'EBCDIC-CA-FR');

           With the above statement, "IBM297" becomes an alias for the existing Unicode::Map
           mapping name 'EBCDIC-CA-FR'.  As a side effect, all the aliases for 'IBM297' (i.e.
           'cp297' and 'ebcdic-cp-fr') also become aliases for 'EBCDIC-CA-FR'.

       add_umap_alias()
           This function works identically to add_map8_alias() above, but operates on
           Unicode::Map encoding tables.

       add_libi_alias()
           This function takes two strings: a new alias, and a target iconv Character Set Name
           (or existing iconv alias).  It defines the new alias to refer to that character set
           name (or to the character set name to which the existing alias refers).

           Returns the target conversion scheme name of the successfully installed alias.
           Returns 'undef' if there is no such target conversion scheme or alias.

           Examples:

             I18N::Charset::add_libi_alias('my-chinese1' => 'CN-GB');

           With this code, "my-chinese1" becomes an alias for the existing iconv conversion
           scheme 'CN-GB'.

             I18N::Charset::add_libi_alias('my-chinese2' => 'EUC-CN');

           With this code, "my-chinese2" becomes an alias for the iconv conversion scheme
           referred to by the existing alias 'EUC-CN' (which happens to be 'CN-GB').

       add_enco_alias()
           This function takes two strings: a new alias, and a target Encode encoding Name (or
           existing Encode alias).  It defines the new alias referring to that encoding name (or
           to the encoding to which the existing alias refers).

           Returns the target encoding name of the successfully installed alias.  Returns 'undef'
           if there is no such encoding or alias.

           Examples:

             I18N::Charset::add_enco_alias('my-japanese1' => 'jis0201-raw');

           With this code, "my-japanese1" becomes an alias for the existing encoding
           'jis0201-raw'.

             I18N::Charset::add_enco_alias('my-japanese2' => 'my-japanese1');

           With this code, "my-japanese2" becomes an alias for the encoding referred to by the
           existing alias 'my-japanese1' (which happens to be 'jis0201-raw' after the previous
           call).

KNOWN BUGS AND LIMITATIONS

       ·   There could probably be many more aliases added (for convenience) to all the IANA
           names.  If you have some specific recommendations, please email the author!

       ·   The only character set names which have a corresponding mapping in the Unicode::Map8
           module are the character sets that Unicode::Map8 can convert.

           Similarly, the only character set names which have a corresponding mapping in the
           Unicode::Map module are the character sets that Unicode::Map can convert.

       ·   In the current implementation, all tables are read in and initialized when the module
           is loaded, and then held in memory until the program exits.  A "lazy" implementation
           (or a less-portable tied hash) might lead to a shorter startup time.  Suggestions,
           patches, comments are always welcome!

AUTHOR

       Martin 'Kingpin' Thurn, "mthurn at cpan.org", <http://tinyurl.com/nn67z>.

LICENSE

       This module is free software; you can redistribute it and/or modify it under the same
       terms as Perl itself.

NAME

SYNOPSIS

DESCRIPTION

CONVERSION ROUTINES

QUERY ROUTINES

CHARACTER SET NAME ALIASING

KNOWN BUGS AND LIMITATIONS

SEE ALSO

AUTHOR

LICENSE