HTML::StripScripts::Regex - XSS filter using a regular expression


HTML-StripScripts-Regex documentation Contained in the HTML-StripScripts-Regex distribution.

Index


Code Index:

NAME

Top

HTML::StripScripts::Regex - XSS filter using a regular expression

VERSION

Top

Version 0.02

SYNOPSIS

Top

This class subclasses HTML::StripScripts, and adds an input method based on a regular expression. See HTML::StripScripts.

  use HTML::StripScripts::Regex;

  my $hss = HTML::StripScripts::Regex->new({ Context => 'Inline' });

  $hss->input("<i>hello, world!</i>");

  print $hss->filtered_document;

Using a regular expression to parse HTML is error prone and inefficient for large documents. If HTML::Parser is available then HTML::StripScripts::Parser should be used in preference to this module.

METHODS

Top

This subclass adds the following methods to those of HTML::StripScripts.

input ( TEXT )

Parses an HTML document and runs it through the filter. TEXT must be the entire HTML document to be filtered, as a single flat string.

SUBCLASSING

Top

The HTML::StripScripts::Regex class is subclassable, in exactly the same way as HMTL::StripScripts. See "SUBCLASSING" in HTML::StripScripts for details.

SEE ALSO

Top

HTML::StripScripts, HTML::StripScripts::Parser, HTML::Parser

AUTHOR

Top

Nick Cleaton, <nick at cleaton dot net>

COPYRIGHT & LICENSE

Top


HTML-StripScripts-Regex documentation Contained in the HTML-StripScripts-Regex distribution.
package HTML::StripScripts::Regex;
use strict;
use warnings;
our $VERSION = '0.02';

use HTML::StripScripts;
use base qw(HTML::StripScripts);

sub input {
    my ($self, $text) = @_;

    $self->input_start_document;

    while ( $text =~ m[

            # <script></script> or <style></style> constructs,
            # in which everything between the tags counts as
            # CDATA.
            (?: <(script|style).*?> (.*?) </\1>           ) |

            # An HTML comment
            ( <!--.*?-->                                  ) |

            # A processing instruction
            ( <\?.*?>                                     ) |

            # A declaration 
            ( <\!.*?>                                     ) |

            # A start tag
            ( <[a-z0-9]+\b(?:[^>'"]|"[^"]*"|'[^']*')*>    ) |

            # An end tag
            ( </[a-z0-9]+>                                ) |

            # Some non-tag text.  We eat '<' only if it's
            # the first character, since a '<' as the
            # first character can't be the start of a well
            # formed tag or one of the patterns above would
            # have matched.
            ( .[^<]*                                       )

            ]igsx ) {
        
        if    ( defined $1 ) {
            $self->input_start("<$1>");
            $self->input_text($2);
            $self->input_end("</$1>");
        }
        elsif ( defined $3 ) {
            $self->input_comment($3);
        }
        elsif ( defined $4 ) {
            $self->input_process($4);
        }
        elsif ( defined $5 ) {
            $self->input_declaration($5);
        }
        elsif ( defined $6 ) {
            $self->input_start($6);
        }
        elsif ( defined $7 ) {
            $self->input_end($7);
        }
        elsif ( defined $8 ) {
            $self->input_text($8);
        }
        else {
            die 'regex failed to act as expected';
        }

    }

    $self->input_end_document;
}

1;