oracular (3) HTML::Defang.3pm.gz

Provided by: libhtml-defang-perl_1.07-2_all bug

NAME

       HTML::Defang - Cleans HTML as well as CSS of scripting and other executable contents, and neutralises XSS
       attacks.

SYNOPSIS

         my $InputHtml = "<html><body></body></html>";

         my $Defang = HTML::Defang->new(
           context => $Self,
           fix_mismatched_tags => 1,
           tags_to_callback => [ br embed img ],
           tags_callback => \&DefangTagsCallback,
           url_callback => \&DefangUrlCallback,
           css_callback => \&DefangCssCallback,
           attribs_to_callback => [ qw(border src) ],
           attribs_callback => \&DefangAttribsCallback,
           content_callback => \&ContentCallback,
         );

         my $SanitizedHtml = $Defang->defang($InputHtml);

         # Callback for custom handling specific HTML tags
         sub DefangTagsCallback {
           my ($Self, $Defang, $OpenAngle, $lcTag, $IsEndTag, $AttributeHash, $CloseAngle, $HtmlR, $OutR) = @_;

           # Explicitly defang this tag, eventhough safe
           return DEFANG_ALWAYS if $lcTag eq 'br';

           # Explicitly whitelist this tag, eventhough unsafe
           return DEFANG_NONE if $lcTag eq 'embed';

           # I am not sure what to do with this tag, so process as HTML::Defang normally would
           return DEFANG_DEFAULT if $lcTag eq 'img';
         }

         # Callback for custom handling URLs in HTML attributes as well as style tag/attribute declarations
         sub DefangUrlCallback {
           my ($Self, $Defang, $lcTag, $lcAttrKey, $AttrValR, $AttributeHash, $HtmlR) = @_;

           # Explicitly allow this URL in tag attributes or stylesheets
           return DEFANG_NONE if $$AttrValR =~ /safesite.com/i;

           # Explicitly defang this URL in tag attributes or stylesheets
           return DEFANG_ALWAYS if $$AttrValR =~ /evilsite.com/i;
         }

         # Callback for custom handling style tags/attributes
         sub DefangCssCallback {
           my ($Self, $Defang, $Selectors, $SelectorRules, $Tag, $IsAttr) = @_;
           my $i = 0;
           foreach (@$Selectors) {
             my $SelectorRule = $$SelectorRules[$i];
             foreach my $KeyValueRules (@$SelectorRule) {
               foreach my $KeyValueRule (@$KeyValueRules) {
                 my ($Key, $Value) = @$KeyValueRule;

                 # Comment out any '!important' directive
                 $$KeyValueRule[2] = DEFANG_ALWAYS if $Value =~ '!important';

                 # Comment out any 'position=fixed;' declaration
                 $$KeyValueRule[2] = DEFANG_ALWAYS if $Key =~ 'position' && $Value =~ 'fixed';
               }
             }
             $i++;
           }
         }

         # Callback for custom handling HTML tag attributes
         sub DefangAttribsCallback {
           my ($Self, $Defang, $lcTag, $lcAttrKey, $AttrValR, $HtmlR) = @_;

           # Change all 'border' attribute values to zero.
           $$AttrValR = '0' if $lcAttrKey eq 'border';

           # Defang all 'src' attributes
           return DEFANG_ALWAYS if $lcAttrKey eq 'src';

           return DEFANG_NONE;
         }

         # Callback for all content between tags (except <style>, <script>, etc)
         sub DefangContentCallback {
           my ($Self, $Defang, $ContentR) = @_;

           $$ContentR =~ s/remove this content//;
         }

DESCRIPTION

       This module accepts an input HTML and/or CSS string and removes any executable code including scripting,
       embedded objects, applets, etc., and neutralises any XSS attacks. A whitelist based approach is used
       which means only HTML known to be safe is allowed through.

       HTML::Defang uses a custom html tag parser. The parser has been designed and tested to work with nasty
       real world html and to try and emulate as close as possible what browsers actually do with strange
       looking constructs. The test suite has been built based on examples from a range of sources such as
       http://ha.ckers.org/xss.html and http://imfo.ru/csstest/css_hacks/import.php to ensure that as many as
       possible XSS attack scenarios have been dealt with.

       HTML::Defang can make callbacks to client code when it encounters the following:

       •   When a specified tag is parsed

       •   When a specified attribute is parsed

       •   When a URL is parsed as part of an HTML attribute, or CSS property value.

       •   When style data is parsed, as part of an HTML style attribute, or as part of an HTML <style> tag.

       The callbacks include details about the current tag/attribute that is being parsed, and also gives a
       scalar reference to the input HTML. Querying pos() on the input HTML should indicate where the module is
       with parsing. This gives the client code flexibility in working with HTML::Defang.

       HTML::Defang can defang whole tags, any attribute in a tag, any URL that appear as an attribute or style
       property, or any CSS declaration in a declaration block in a style rule. This helps to precisely block
       the most specific unwanted elements in the contents(for example, block just an offending attribute
       instead of the whole tag), while retaining any safe HTML/CSS.

CONSTRUCTOR

       HTML::Defang->new(%Options)
           Constructs a new HTML::Defang object. The following options are supported:

           Options
               tags_to_callback
                   Array reference of tags for which a call back should be made. If a tag in this array is
                   parsed, the subroutine tags_callback() is invoked.

               attribs_to_callback
                   Array reference of tag attributes for which a call back should be made. If an attribute in
                   this array is parsed, the subroutine attribs_callback() is invoked.

               tags_callback
                   Subroutine reference to be invoked when a tag listed in @$tags_to_callback is parsed.

               attribs_callback
                   Subroutine reference to be invoked when an attribute listed in @$attribs_to_callback is
                   parsed.

               url_callback
                   Subroutine reference to be invoked when a URL is detected in an HTML tag attribute or a CSS
                   property.

               css_callback
                   Subroutine reference to be invoked when CSS data is found either as the contents of a 'style'
                   attribute in an HTML tag, or as the contents of a <style> HTML tag.

               content_callback
                   Subroutine reference to be invoked when standard content between HTML tags in found.

               fix_mismatched_tags
                   This property, if set, fixes mismatched tags in the HTML input. By default, tags present in
                   the default %mismatched_tags_to_fix hash are fixed. This set of tags can be overridden by
                   passing in an array reference $mismatched_tags_to_fix to the constructor. Any opened tags in
                   the set are automatically closed if no corresponding closing tag is found. If an unbalanced
                   closing tag is found, that is commented out.

               mismatched_tags_to_fix
                   Array reference of tags for which the code would check for matching opening and closing tags.
                   See the property $fix_mismatched_tags.

               context
                   You can pass an arbitrary scalar as a 'context' value that's then passed as the first
                   parameter to all callback functions. Most commonly this is something like '$Self'

               allow_double_defang
                   If this is true, then tag names and attribute names which already begin with the defang
                   string ("defang_" by default) will have an additional copy of the defang string prepended if
                   they are flagged to be defanged by the return value of a callback, or if the tag or attribute
                   name is unknown.

                   The default is to assume that tag names and attribute names beginning with the defang string
                   are already made safe, and need no further modification, even if they are flagged to be
                   defanged by the return value of a callback.  Any tag or attribute modifications made directly
                   by a callback are still performed.

               delete_defang_content
                   Normally defanged tags are turned into comments and prefixed by defang_, and defanged styles
                   are surrounded by /* ... */. If this is set to true, then defanged content is deleted instead

               Debug
                   If set, prints debugging output.

       HTML::Defang->new_bodyonly(%Options)
           Constructs a new HTML::Defang object that has the following implicit options

           fix_mismatched_tags = 1
           delete_defang_content = 1
           tags_to_callback = [ qw(html head link body meta title bgsound) ]
           tags_callback = { ... remove all above tags and related content ... }
           url_callback = { ... explicity DEFANG_NONE to leave everything alone ... }

           Basically this is a easy way to remove all html boiler plate content and return only the html body
           content.

CALLBACK METHODS

       COMMON PARAMETERS
           A number of the callbacks share the same parameters. These common parameters are documented here.
           Certain variables may have specific meanings in certain callbacks, so be sure to check the
           documentation for that method first before referring this section.

           $context
               You can pass an arbitrary scalar as a 'context' value that's then passed as the first parameter
               to all callback functions. Most commonly this is something like '$Self'

           $Defang
               Current HTML::Defang instance

           $OpenAngle
               Opening angle(<) sign of the current tag.

           $lcTag
               Lower case version of the HTML tag that is currently being parsed.

           $IsEndTag
               Has the value '/' if the current tag is a closing tag.

           $AttributeHash
               A reference to a hash containing the attributes of the current tag and their values. Each value
               is a scalar reference to the value, rather than just a scalar value. You can add attributes
               (remember to make it a scalar ref, eg $AttributeHash{"newattr"} = \"newval"), delete attributes,
               or modify attribute values in this hash, and any changes you make will be incorporated into the
               output HTML stream.

               The attribute values will have any entity references decoded before being passed to you, and any
               unsafe values we be re-encoded back into the HTML stream.

               So for instance, the tag:

                 <div title="&lt;&quot;Hi there &#x003C;">

               Will have the attribute hash:

                 { title => \q[<"Hi there <] }

               And will be turned back into the HTML on output:

                 <div title="&lt;&quot;Hi there &lt;">

           $CloseAngle
               Anything after the end of last attribute including the closing HTML angle(>)

           $HtmlR
               A scalar reference to the input HTML. The input HTML is parsed using m/\G$SomeRegex/c constructs,
               so to continue from where HTML:Defang left, clients can use m/\G$SomeRegex/c for further
               processing on the input. This will resume parsing from where HTML::Defang left. One can also use
               the pos() function to determine where HTML::Defang left off. This combined with the
               add_to_output() method should give reasonable flexibility for the client to process the input.

           $OutR
               A scalar reference to the processed output HTML so far.

       tags_callback($context, $Defang, $OpenAngle, $lcTag, $IsEndTag, $AttributeHash, $CloseAngle, $HtmlR,
       $OutR)
           If $Defang->{tags_callback} exists, and HTML::Defang has parsed a tag preset in
           $Defang->{tags_to_callback}, the above callback is made to the client code. The return value of this
           method determines whether the tag is defanged or not. More details below.

           Return values
               DEFANG_NONE
                   The current tag will not be defanged.

               DEFANG_ALWAYS
                   The current tag will be defanged.

               DEFANG_DEFAULT
                   The current tag will be processed normally by HTML:Defang as if there was no callback method
                   specified.

       attribs_callback($context, $Defang, $lcTag, $lcAttrKey, $AttrVal, $HtmlR, $OutR)
           If $Defang->{attribs_callback} exists, and HTML::Defang has parsed an attribute present in
           $Defang->{attribs_to_callback}, the above callback is made to the client code. The return value of
           this method determines whether the attribute is defanged or not. More details below.

           Method parameters
               $lcAttrKey
                   Lower case version of the HTML attribute that is currently being parsed.

               $AttrVal
                   Reference to the HTML attribute value that is currently being parsed.

                   See $AttributeHash for details of decoding.

           Return values
               DEFANG_NONE
                   The current attribute will not be defanged.

               DEFANG_ALWAYS
                   The current attribute will be defanged.

               DEFANG_DEFAULT
                   The current attribute will be processed normally by HTML:Defang as if there was no callback
                   method specified.

       url_callback($context, $Defang, $lcTag, $lcAttrKey, $AttrVal, $AttributeHash, $HtmlR, $OutR)
           If $Defang->{url_callback} exists, and HTML::Defang has parsed a URL, the above callback is made to
           the client code. The return value of this method determines whether the attribute containing the URL
           is defanged or not. URL callbacks can be made from <style> tags as well style attributes, in which
           case the particular style declaration will be commented out. More details below.

           Method parameters
               $lcAttrKey
                   Lower case version of the HTML attribute that is currently being parsed. However if this
                   callback is made as a result of parsing a URL in a style attribute, $lcAttrKey will be set to
                   the string style, or will be set to undef if this callback is made as a result of parsing a
                   URL inside a style tag.

               $AttrVal
                   Reference to the URL value that is currently being parsed.

               $AttributeHash
                   A reference to a hash containing the attributes of the current tag and their values. Each
                   value is a scalar reference to the value, rather than just a scalar value. You can add
                   attributes (remember to make it a scalar ref, eg $AttributeHash{"newattr"} = \"newval"),
                   delete attributes, or modify attribute values in this hash, and any changes you make will be
                   incorporated into the output HTML stream. Will be set to undef if the callback is made due to
                   URL in a <style> tag or attribute.

           Return values
               DEFANG_NONE
                   The current URL will not be defanged.

               DEFANG_ALWAYS
                   The current URL will be defanged.

               DEFANG_DEFAULT
                   The current URL will be processed normally by HTML:Defang as if there was no callback method
                   specified.

       css_callback($context, $Defang, $Selectors, $SelectorRules, $lcTag, $IsAttr, $OutR)
           If $Defang->{css_callback} exists, and HTML::Defang has parsed a <style> tag or style attribtue, the
           above callback is made to the client code. The return value of this method determines whether a
           particular declaration in the style rules is defanged or not. More details below.

           Method parameters
               $Selectors
                   Reference to an array containing the selectors in a style tag or attribute.

               $SelectorRules
                   Reference to an array containing the style declaration blocks of all selectors in a style tag
                   or attribute. Consider the below CSS:

                     a { b:c; d:e}
                     j { k:l; m:n}

                   The declaration blocks will get parsed into the following data structure:

                     [
                       [
                         [ "b", "c", DEFANG_DEFAULT ],
                         [ "d", "e", DEFANG_DEFAULT ]
                       ],
                       [
                         [ "k", "l", DEFANG_DEFAULT ],
                         [ "m", "n", DEFANG_DEFAULT ]
                       ]
                     ]

                   So, generally each property:value pair in a declaration is parsed into an array of the form

                     ["property", "value", X]

                   where X can be DEFANG_NONE, DEFANG_ALWAYS or DEFANG_DEFAULT, and DEFANG_DEFAULT the default
                   value. A client can manipulate this value to instruct HTML::Defang to defang this
                   property:value pair.

                   DEFANG_NONE - Do not defang

                   DEFANG_ALWAYS - Defang the style:property value

                   DEFANG_DEFAULT - Process this as if there is no callback specified

               $IsAttr
                   True if the currently processed item is a style attribute. False if the currently processed
                   item is a style tag.

METHODS

       PUBLIC METHODS
           defang($InputHtml, \%Opts)
               Cleans up $InputHtml of any executable code including scripting, embedded objects, applets, etc.,
               and defang any XSS attacks.

               Method parameters
                   $InputHtml
                       The input HTML string that needs to be sanitized.

               Returns the cleaned HTML. If fix_mismatched_tags is set, any tags that appear in
               @$mismatched_tags_to_fix that are unbalanced are automatically commented or closed.

           add_to_output($String)
               Appends $String to the output after the current parsed tag ends. Can be used by client code in
               callback methods to add HTML text to the processed output. If the HTML text needs to be defanged,
               client code can safely call HTML::Defang->defang() recursively from within the callback.

               Method parameters
                   $String
                       The string that is added after the current parsed tag ends.

       INTERNAL METHODS
           Generally these methods never need to be called by users of the class, because they'll be called
           internally as the appropriate tags are encountered, but they may be useful for some users in some
           cases.

           defang_script_tag($OutR, $HtmlR, $TagOps, $OpenAngle, $IsEndTag, $Tag, $TagTrail, $Attributes,
           $CloseAngle)
               This method is invoked when a <script> tag is parsed. Defangs the <script> opening tag, and any
               closing tag. Any scripting content is also commented out, so browsers don't display them.

               Returns 1 to indicate that the <script> tag must be defanged.

               Method parameters
                   $OutR
                       A reference to the processed output HTML before the tag that is currently being parsed.

                   $HtmlR
                       A scalar reference to the input HTML.

                   $TagOps
                       Indicates what operation should be done on a tag. Can be undefined, integer or code
                       reference. Undefined indicates an unknown tag to HTML::Defang, 1 indicates a known safe
                       tag, 0 indicates a known unsafe tag, and a code reference indicates a subroutine that
                       should be called to parse the current tag. For example, <style> and <script> tags are
                       parsed by dedicated subroutines.

                   $OpenAngle
                       Opening angle(<) sign of the current tag.

                   $IsEndTag
                       Has the value '/' if the current tag is a closing tag.

                   $Tag
                       The HTML tag that is currently being parsed.

                   $TagTrail
                       Any space after the tag, but before attributes.

                   $Attributes
                       A reference to an array of the attributes and their values, including any surrouding
                       spaces. Each element of the array is added by 'push' calls like below.

                         push @$Attributes, [ $AttributeName, $SpaceBeforeEquals, $EqualsAndSubsequentSpace, $QuoteChar, $AttributeValue, $QuoteChar, $SpaceAfterAtributeValue ];

                   $CloseAngle
                       Anything after the end of last attribute including the closing HTML angle(>)

           defang_style_text($Content, $lcTag, $IsAttr, $AttributeHash, $HtmlR, $OutR)
               Defang some raw css data and return the defanged content

               Method parameters
                   $Content
                       The input style string that is defanged.

                   $IsAttr
                       True if $Content is from an attribute, otherwise from a <style> block

           cleanup_style($StyleString)
               Helper function to clean up CSS data. This function directly operates on the input string without
               taking a copy.

               Method parameters
                   $StyleString
                       The input style string that is cleaned.

           defang_stylerule($SelectorsIn, $StyleRules, $lcTag, $IsAttr, $AttributeHash, $HtmlR, $OutR)
               Defangs style data.

               Method parameters
                   $SelectorsIn
                       An array reference to the selectors in the style tag/attribute contents.

                   $StyleRules
                       An array reference to the declaration blocks in the style tag/attribute contents.

                   $lcTag
                       Lower case version of the HTML tag that is currently being parsed.

                   $IsAttr
                       Whether we are currently parsing a style attribute or style tag. $IsAttr will be true if
                       we are currently parsing a style attribute.

                   $HtmlR
                       A scalar reference to the input HTML.

                   $OutR
                       A scalar reference to the processed output so far.

           defang_attributes($OutR, $HtmlR, $TagOps, $OpenAngle, $IsEndTag, $Tag, $TagTrail, $Attributes,
           $CloseAngle)
               Defangs attributes, defangs tags, does tag, attrib, css and url callbacks.

               Method parameters
                   For a description of the method parameters, see documentation of defang_script_tag() method

           cleanup_attribute($AttributeString)
               Helper function to cleanup attributes

               Method parameters
                   $AttributeString
                       The value of the attribute.

SEE ALSO

       <http://mailtools.anomy.net/>, <http://htmlcleaner.sourceforge.net/>, HTML::StripScripts,
       HTML::Detoxifier, HTML::Sanitizer, HTML::Scrubber

AUTHOR

       Kurian Jose Aerthail <cpan@kurianja.fastmail.fm>. Thanks to Rob Mueller <cpan@robm.fastmail.fm> for
       initial code, guidance and support and bug fixes.

       Copyright (C) 2003-2013 by FastMail Pty Ltd

       This library is free software; you can redistribute it and/or modify it under the same terms as Perl
       itself.