Ubuntu Manpage: HTML::Encoding - Determine the encoding of HTML/XML/XHTML documents

Provided by: libhtml-encoding-perl_0.61-1_all

NAME

       HTML::Encoding - Determine the encoding of HTML/XML/XHTML documents

SYNOPSIS

         use HTML::Encoding 'encoding_from_http_message';
         use LWP::UserAgent;
         use Encode;

         my $resp = LWP::UserAgent->new->get('http://www.example.org');
         my $enco = encoding_from_http_message($resp);
         my $utf8 = decode($enco => $resp->content);

WARNING

       The interface and implementation are guranteed to change before this module reaches version 1.00! Please
       send feedback to the author of this module.

DESCRIPTION

       HTML::Encoding helps to determine the encoding of HTML and XML/XHTML documents...

DEFAULT ENCODINGS

       Most routines need to know some suspected character encodings which can be provided through the
       "encodings" option. This option always defaults to the $HTML::Encoding::DEFAULT_ENCODINGS array reference
       which means the following encodings are considered by default:

         * ISO-8859-1
         * UTF-16LE
         * UTF-16BE
         * UTF-32LE
         * UTF-32BE
         * UTF-8

       If you change the values or pass custom values to the routines note that Encode must support them in
       order for this module to work correctly.

ENCODING SOURCES

       "encoding_from_xml_document", "encoding_from_html_document", and "encoding_from_http_message" return in
       list context the encoding source and the encoding name, possible encoding sources are

         * protocol         (Content-Type: text/html;charset=encoding)
         * bom              (leading U+FEFF)
         * xml              (<?xml version='1.0' encoding='encoding'?>)
         * meta             (<meta http-equiv=...)
         * default          (default fallback value)
         * protocol_default (protocol default)

ROUTINES

       Routines exported by this module at user option. By default, nothing is exported.

       encoding_from_content_type($content_type)
         Takes  a  byte  string  and  uses  HTTP::Headers::Util  to  extract  the  charset  parameter  from  the
         "Content-Type" header value and returns its value or "undef" (or an empty  list  in  list  context)  if
         there  is  no  such  value.  Only  the  first  component will be examined (HTTP/1.1 only allows for one
         component), any backslash escapes in strings will be unescaped, all leading and  trailing  quote  marks
         and  white-space characters will be removed, all white-space will be collapsed to a single space, empty
         charset values will be ignored and no case folding is performed.

         Examples:

           +-----------------------------------------+-----------+
           | encoding_from_content_type(...)         | returns   |
           +-----------------------------------------+-----------+
           | "text/html"                             | undef     |
           | "text/html,text/plain;charset=utf-8"    | undef     |
           | "text/html;charset="                    | undef     |
           | "text/html;charset=\"\\u\\t\\f\\-\\8\"" | 'utf-8'   |
           | "text/html;charset=utf\\-8"             | 'utf\\-8' |
           | "text/html;charset='utf-8'"             | 'utf-8'   |
           | "text/html;charset=\" UTF-8 \""         | 'UTF-8'   |
           +-----------------------------------------+-----------+

         If you pass a string with the UTF-8 flag turned on the string will be converted to bytes before  it  is
         passed  to  HTTP::Headers::Util.   The return value will thus never have the UTF-8 flag turned on (this
         might change in future versions).

       encoding_from_byte_order_mark($octets [, %options])
         Takes a sequence of octets and attempts to read a byte  order  mark  at  the  beginning  of  the  octet
         sequence.  It  will  go  through the list of $options{encodings} or the list of default encodings if no
         encodings are specified and match the beginning of  the  string  against  any  byte  order  mark  octet
         sequence found.

         The result can be ambiguous, for example qq(\xFF\xFE\x00\x00) could be both, a complete BOM in UTF-32LE
         or  a  UTF-16LE  BOM  followed  by  a  U+0000  character.  It is also possible that $octets starts with
         something that looks like a byte order mark but actually is not.

         encoding_from_byte_order_mark sorts the list of possible encodings by the length  of  their  BOM  octet
         sequence  and  returns  in  scalar  context only the encoding with the longest match, and all encodings
         ordered by length of their BOM octet sequence in list context.

         Examples:

           +-------------------------+------------+-----------------------+
           | Input                   | Encodings  | Result                |
           +-------------------------+------------+-----------------------+
           | "\xFF\xFE\x00\x00"      | default    | qw(UTF-32LE)          |
           | "\xFF\xFE\x00\x00"      | default    | qw(UTF-32LE UTF-16LE) |
           | "\xEF\xBB\xBF"          | default    | qw(UTF-8)             |
           | "Hello World!"          | default    | undef                 |
           | "\xDD\x73\x66\x73"      | default    | undef                 |
           | "\xDD\x73\x66\x73"      | UTF-EBCDIC | qw(UTF-EBCDIC)        |
           | "\x2B\x2F\x76\x38\x2D"  | default    | undef                 |
           | "\x2B\x2F\x76\x38\x2D"  | UTF-7      | qw(UTF-7)             |
           +-------------------------+------------+-----------------------+

         Note however that for UTF-7 it is in theory possible that the U+FEFF combines with other characters  in
         which case such detection would fail, for example consider:

           +--------------------------------------+-----------+-----------+
           | Input                                | Encodings | Result    |
           +--------------------------------------+-----------+-----------+
           | "\x2B\x2F\x76\x38\x41\x39\x67\x2D"   | default   | undef     |
           | "\x2B\x2F\x76\x38\x41\x39\x67\x2D"   | UTF-7     | undef     |
           +--------------------------------------+-----------+-----------+

         This might change in future versions, although this is not very relevant for most applications as there
         should never be need to use UTF-7 in the encoding list for existing documents.

         If  no  BOM  can  be found it returns "undef" in scalar context and an empty list in list context. This
         routine should not be used with strings with the UTF-8 flag turned on.

       encoding_from_xml_declaration($declaration)
         Attempts to extract the  value  of  the  encoding  pseudo-attribute  in  an  XML  declaration  or  text
         declaration  in  the  character  string  $declaration.  If  there does not appear to be such a value it
         returns nothing. This would typically be used with the return  values  of  xml_declaration_from_octets.
         Normalizes whitespaces like encoding_from_content_type.

         Examples:

           +-------------------------------------------+---------+
           | encoding_from_xml_declaration(...)        | Result  |
           +-------------------------------------------+---------+
           | "<?xml version='1.0' encoding='utf-8'?>"  | 'utf-8' |
           | "<?xml encoding='utf-8'?>"                | 'utf-8' |
           | "<?xml encoding=\"utf-8\"?>"              | 'utf-8' |
           | "<?xml foo='bar' encoding='utf-8'?>"      | 'utf-8' |
           | "<?xml encoding='a' encoding='b'?>"       | 'a'     |
           | "<?xml encoding=' a    b '?>"             | 'a b'   |
           | "<?xml-stylesheet encoding='utf-8'?>"     | undef   |
           | " <?xml encoding='utf-8'?>"               | undef   |
           | "<?xml encoding =\x{2028}'utf-8'?>"       | 'utf-8' |
           | "<?xml version='1.0' encoding=utf-8?>"    | undef   |
           | "<?xml x='encoding=\"a\"' encoding='b'?>" | 'a'     |
           +-------------------------------------------+---------+

         Note  that  encoding_from_xml_declaration()  determines the encoding even if the XML declaration is not
         well-formed or violates other requirements of the relevant XML specification as long as it can find  an
         encoding  pseudo-attribute  in the provided string. This means XML processors must apply further checks
         to determine whether the entity is well-formed, etc.

       xml_declaration_from_octets($octets [, %options])
         Attempts to find a ">" character in the byte string $octets using the encodings in $encodings and  upon
         success attempts to find a preceding "<" character. Returns all the strings found this way in the order
         of  number  of successful matches in list context and the best match in scalar context. Should probably
         be combined with the only user of this routine, encoding_from_xml_declaration...  You  can  modify  the
         list of suspected encodings using $options{encodings};

       encoding_from_first_chars($octets [, %options])
         Assuming   that   documents   start   with   "<"   optionally   preceded   by   whitespace  characters,
         encoding_from_first_chars attempts to determine an encoding by matching $octets against something  like
         /^[@{$options{whitespace}}]*</ in the various suspected $options{encodings}.

         This  is  useful  to distinguish e.g. UTF-16LE from UTF-8 if the byte string does not start with a byte
         order mark nor an XML declaration (e.g. if the document is a HTML document) to  get  at  least  a  base
         encoding  which  can  be  used  to  decode  enough  of  the  document  to  find  <meta>  elements using
         encoding_from_meta_element. $options{whitespace} defaults to qw/CR  LF  SP  TB/.   Returns  nothing  if
         unsuccessful.  Returns  the matching encodings in order of the number of octets matched in list context
         and the best match in scalar context.

         Examples:

           +---------------+----------+---------------------+
           | String        | Encoding | Result              |
           +---------------+----------+---------------------+
           | '<!DOCTYPE '  | UTF-16LE | UTF-16LE            |
           | ' <!DOCTYPE ' | UTF-16LE | UTF-16LE            |
           | '...'         | UTF-16LE | undef               |
           | '...<'        | UTF-16LE | undef               |
           | '<'           | UTF-8    | ISO-8859-1 or UTF-8 |
           | "<!--\xF6-->" | UTF-8    | ISO-8859-1 or UTF-8 |
           +---------------+----------+---------------------+

       encoding_from_meta_element($octets, $encname [, %options])
         Attempts to find <meta> elements in the document using HTML::Parser.  It will attempt to decode  chunks
         of  the  byte  string using $encname to characters before passing the data to HTML::Parser. An optional
         %options hash can be provided which will be passed  to  the  HTML::Parser  constructor.  It  will  stop
         processing the document if it encounters

           * </head>
           * encoding errors
           * the end of the input
           * ... (see todo)

         If relevant <meta> elements, i.e. something like

           <meta http-equiv=Content-Type content='...'>

         are  found,  uses  encoding_from_content_type  to  extract  the  charset parameter. It returns all such
         encodings it could find in document order in list context or the first encoding in scalar  context  (it
         will currently look for others regardless of calling context) or nothing if that fails for some reason.

         Note  that  there  are  many  edge cases where this does not yield in "proper" results depending on the
         capabilities of the HTML::Parser version and the options you pass for it, for example,

           <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" [
             <!ENTITY content_type "text/html;charset=utf-8">
           ]>
           <meta http-equiv="Content-Type" content="&content_type;">
           <title></title>
           <p>...</p>

         This would likely not detect the "utf-8" value if HTML::Parser does not resolve the entity. This should
         however only be a concern for documents specifically crafted to break the encoding detection.

       encoding_from_xml_document($octets, [, %options])
         Uses encoding_from_byte_order_mark to detect the encoding using a byte order mark in  the  byte  string
         and  returns  the  return  value  of  that routine if it succeeds. Uses xml_declaration_from_octets and
         encoding_from_xml_declaration and returns the encoding for which the latter routine found most  matches
         in  scalar  context,  and  all  encodings  ordered by number of occurences in list context. It does not
         return a value of neither byte order mark not inbound declarations declare a character encoding.

         Examples:

           +----------------------------+----------+-----------+----------+
           | Input                      | Encoding | Encodings | Result   |
           +----------------------------+----------+-----------+----------+
           | "<?xml?>"                  | UTF-16   | default   | UTF-16BE |
           | "<?xml?>"                  | UTF-16LE | default   | undef    |
           | "<?xml encoding='utf-8'?>" | UTF-16LE | default   | utf-8    |
           | "<?xml encoding='utf-8'?>" | UTF-16   | default   | UTF-16BE |
           | "<?xml encoding='cp37'?>"  | CP37     | default   | undef    |
           | "<?xml encoding='cp37'?>"  | CP37     | CP37      | cp37     |
           +----------------------------+----------+-----------+----------+

         Lacking a return value from this routine  and  higher-level  protocol  information  (such  as  protocol
         encoding defaults) processors would be required to assume that the document is UTF-8 encoded.

         Note  however  that  the  return  value  depends  on the set of suspected encodings you pass to it. For
         example, by default, EBCDIC encodings would not be considered and thus for

           <?xml version='1.0' encoding='cp37'?>

         this routine would return the undefined value. You can modify the list  of  suspected  encodings  using
         $options{encodings}.

       encoding_from_html_document($octets, [, %options])
         Uses  encoding_from_xml_document  and  encoding_from_meta_element  to  determine  the  encoding of HTML
         documents.  If  $options{xhtml}  is  set  to  a  false  value  uses  encoding_from_byte_order_mark  and
         encoding_from_meta_element  to  determine  the  encoding.  The  xhtml  option  is  on  by  default. The
         $options{encodings} can be used to modify the suspected encodings and $options{parser_options}  can  be
         used to modify the HTML::Parser options in encoding_from_meta_element (see the relevant documentation).

         Returns  nothing if no declaration could be found, the winning declaration in scalar context and a list
         of encoding source and encoding name in list context, see ENCODING SOURCES.

         ...

         Other problems arise from differences between HTML and XHTML syntax and encoding detection  rules,  for
         example, the input could be

           Content-Type: text/html

           <?xml version='1.0' encoding='utf-8'?>
           <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN"
           "http://www.w3.org/TR/html4/strict.dtd">
           <meta http-equiv = "Content-Type"
                    content = "text/html;charset=iso-8859-2">
           <title></title>
           <p>...</p>

         This  is  a  perfectly  legal  HTML 4.01 document and implementations might be expected to consider the
         document ISO-8859-2 encoded as XML rules for encoding detection do not apply to HTML  documents.   This
         module  attempts  to avoid making decisions which rules apply for a specific document and would thus by
         default return 'utf-8' for this input.

         On the other hand, if the input omits the encoding declaration,

           Content-Type: text/html

           <?xml version='1.0'?>
           <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN"
           "http://www.w3.org/TR/html4/strict.dtd">
           <meta http-equiv = "Content-Type"
                    content = "text/html;charset=iso-8859-2">
           <title></title>
           <p>...</p>

         It would return 'iso-8859-2'. Similar problems would arise from  other  differences  between  HTML  and
         XHTML, for example consider

           Content-Type: text/html

           <?foo >
           <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
               "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
           <html ...
           ?>
           ...
           <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN">
           ...

         If  this  is  processed using HTML rules, the first > will end the processing instruction and the XHTML
         document type declaration would be the relevant declaration for the document, if it is processed  using
         XHTML rules, the ?> will end the processing instruction and the HTML document type declaration would be
         the relevant declaration.

         IOW, an application would need to assume a certain character encoding (family) to process enough of the
         document  to  determine  whether  it  is XHTML or HTML and the result of this detection would depend on
         which processing rules are assumed in order to process it.  It is thus in essence not possible to write
         a "perfect" detection algorithm, which is why this routine attempts to avoid making  any  decisions  on
         this matter.

       encoding_from_http_message($message [, %options])
         Determines the encoding of HTML / XML / XHTML documents enclosed in HTTP message. $message is an object
         compatible  to  HTTP::Message,  e.g.  a  HTTP::Response  object.  %options is a hash with the following
         possible entries:

         encodings
           array references of suspected character encodings, defaults to $HTML::Encoding::DEFAULT_ENCODINGS.

         is_html
           Regular expression matched against the content_type of the message to determine whether to  use  HTML
           rules for the entity body, defaults to "qr{^text/html$}i".

         is_xml
           Regular  expression  matched  against the content_type of the message to determine whether to use XML
           rules for the entity body, defaults to "qr{^.+/(?:.+\+)?xml$}i".

         is_text_xml
           Regular expression matched against the content_type of  the  message  to  determine  whether  to  use
           text/html rules for the message, defaults to "qr{^text/(?:.+\+)?xml$}i". This will only be checked if
           is_xml matches aswell.

         html_default
           Default encoding for documents determined (by is_html) as HTML, defaults to "ISO-8859-1".

         xml_default
           Default encoding for documents determined (by is_xml) as XML, defaults to "UTF-8".

         text_xml_default
           Default  encoding for documents determined (by is_text_xml) as text/xml, defaults to "undef" in which
           case the default is ignored. This should be set to "US-ASCII" if desired as this module is by default
           inconsistent with RFC 3023 which requires that for text/xml documents without a charset parameter  in
           the HTTP header "US-ASCII" is assumed.

           This  requirement is inconsistent with RFC 2616 (HTTP/1.1) which requires to assume "ISO-8859-1", has
           been widely ignored and is thus disabled by default.

         xhtml
           Whether the routine should look for an encoding declaration in the XML declaration  of  the  document
           (if any), defaults to 1.

         default
           Whether  the  relevant  default value should be returned when no other information can be determined,
           defaults to 1.

         This  is  furhter  possibly  inconsistent  with  XML  MIME  types  that  differ  in  other  ways   from
         application/xml,  for  example  if  the  MIME Type does not allow for a charset parameter in which case
         applications might be expected to ignore the charset parameter if erroneously provided.

EBCDIC SUPPORT

       By default, this module does not support EBCDIC encodings. To enable support for EBCDIC encodings you can
       either change the $HTML::Encodings::DEFAULT_ENCODINGS array  reference  or  pass  the  encodings  to  the
       routines you use using the encodings option, for example

         my @try = qw/UTF-8 UTF-16LE cp500 posix-bc .../;
         my $enc = encoding_from_xml_document($doc, encodings => \@try);

       Note  that  there are some subtle differences between various EBCDIC encodings, for example "!" is mapped
       to 0x5A in "posix-bc" and  to  0x4F  in  "cp500";  these  differences  might  affect  processing  in  yet
       undetermined ways.

TODO

         * bundle with test suite
         * optimize some routines to give up once successful
         * avoid transcoding for HTML::Parser if e.g. ISO-8859-1
         * consider adding a "HTML5" modus of operation?

AUTHOR / COPYRIGHT / LICENSE

         Copyright (c) 2004-2008 Bjoern Hoehrmann <bjoern@hoehrmann.de>.
         This module is licensed under the same terms as Perl itself.

perl v5.10.1                                       2010-09-24                                HTML::Encoding(3pm)

NAME

SYNOPSIS

WARNING

DESCRIPTION

DEFAULT ENCODINGS

ENCODING SOURCES

ROUTINES

EBCDIC SUPPORT

TODO

SEE ALSO

AUTHOR / COPYRIGHT / LICENSE