Search::Tools::XML - methods for playing nice with XML and HTML


Search-Tools documentation Contained in the Search-Tools distribution.

Index


Code Index:

NAME

Top

Search::Tools::XML - methods for playing nice with XML and HTML

SYNOPSIS

Top

 use Search::Tools::XML;

 my $class = 'Search::Tools::XML';

 my $text = 'the "quick brown" fox';

 my $xml = $class->start_tag('foo');

 $xml .= $class->utf8_safe( $text );

 $xml .= $class->end_tag('foo');

 # $xml: <foo>the &#34;quick brown&#34; fox</foo>

 $xml = $class->escape( $xml );

 # $xml: &lt;foo&gt;the &amp;#34;quick brown&amp;#34; fox&lt;/foo&gt;

 $xml = $class->unescape( $xml );

 # $xml: <foo>the "quick brown" fox</foo>

 my $plain = $class->no_html( $xml );

 # $plain eq $text




DESCRIPTION

Top

IMPORTANT: The API for escape() and unescape() has changed as of version 0.16. The text is no longer modified in place, as this was less intuitive.

Search::Tools::XML provides utility methods for dealing with XML and HTML. There isn't really anything new here that CPAN doesn't provide via HTML::Entities or similar modules. The difference is convenience: the most common methods you need for search apps are in one place with no extra dependencies.

NOTE: To get full UTF-8 character set from chr() you must be using Perl >= 5.8. This affects things like the unescape* methods.

VARIABLES

Top

%HTML_ents

Complete map of all named HTML entities to their decimal values.

METHODS

Top

The following methods may be accessed either as object or class methods.

new

Create a Search::Tools::XML object.

tag_re

Returns a qr// regex for matching a SGML (XML, HTML, etc) tag.

html_whitespace

Returns a regex for all whitespace characters and HTML whitespace entities.

char2ent_map

Returns a hash reference to the class data mapping chr() values to their numerical entity equivalents.

looks_like_html( string )

Returns true if string appears to have HTML-like markup in it.

Aliases for this method include:

looks_like_xml
looks_like_markup

start_tag( string [, \%attr ] )

end_tag( string )

Returns string as a tag, either start or end. string will be escaped for any non-valid chars using tag_safe().

If \%attr is passed, XML-safe attributes are generated using attr_safe().

singleton( string [, \%attr ] )

Like start_tag() but includes the closing slash.

tag_safe( string )

Create a valid XML tag name, escaping/omitting invalid characters.

Example:

    my $tag = Search::Tools::XML->tag_safe( '1 * ! tag foo' );
    # $tag == '______tag_foo'

attr_safe( \%attr )

Returns stringified \%attr as XML attributes.

utf8_safe( string )

Return string with special XML chars and all non-ASCII chars converted to numeric entities.

This is escape() on steroids. Do not use them both on the same text unless you know what you're doing. See the SYNOPSIS for an example.

escape_utf8

Alias for utf8_safe().

no_html( text [, normalize_whitespace] )

no_html() is a brute-force method for removing all tags and entities from text. A simple regular expression is used, so things like nested comments and the like will probably break. If you really need to reliably filter out the tags and entities from a HTML text, use HTML::Parser or similar.

text is returned with no markup in it.

If normalize_whitespace is true (defaults to false) then all whitespace is normalized away to ASCII space (U+0020). This can be helpful if you have Unicode entities representing line breaks or other layout instructions.

strip_html

An alias for no_html().

strip_markup

An alias for no_html().

escape( text )

Similar to escape() functions in more famous CPAN modules, but without the added dependency. escape() will convert the special XML chars (><'"&) to their named entity equivalents.

The escaped text is returned.

IMPORTANT: The API for this method has changed as of version 0.16. text is no longer modified in-place.

As of version 0.27 escape() is written in C/XS for speed.

unescape( text )

Similar to unescape() functions in more famous CPAN modules, but without the added dependency. unescape() will convert all entities to their chr() equivalents.

NOTE: unescape() does more than reverse the effects of escape(). It attempts to resolve all entities, not just the special XML entities (><'"&).

IMPORTANT: The API for this method has changed as of version 0.16. text is no longer modified in-place.

unescape_named( text )

Replace all named HTML entities with their chr() equivalents.

Returns modified copy of text.

unescape_decimal( text )

Replace all decimal entities with their chr() equivalents.

Returns modified copy of text.

perl_to_xml( ref, root_element [, strip_plural ][, do_not_escape] )

Similar to the XML::Simple XMLout() feature, perl_to_xml() will take a Perl data structure ref and convert it to XML, using root_element as the top-level element.

If strip_plural is a true value and not a CODE ref, any trailing s character will be stripped from the enclosing tag name whenever an array of hashrefs is found. Example:

 my $data = {
    values => [
        {   two   => 2,
            three => 3,
        },
        {   four => 4,
            five => 5,
        },
    ],
 };

 my $xml = $utils->perl_to_xml($data, 'data', 1);

 # $xml DOM will look like:

 <data>
  <values>
   <value>
    <three>3</three>
    <two>2</two>
   </value>
   <value>
    <five>5</five>
    <four>4</four>
   </value>
  </values>
 </data>

Obviously stripping the final s will not always render sensical tag names. Pass a CODE ref instead, expecting one value (the tag name) and returning the tag name to use:

tidy( xmlstring )

Attempts to indent xmlstring correctly to make it more legible.

Returns the xmlstring tidied up.

WARNING This is an experimental feature. It might be really slow or eat your XML. You have been warned.

AUTHOR

Top

Peter Karman <karman@cpan.org>

Originally based on the HTML::HiLiter regular expression building code, by the same author, copyright 2004 by Cray Inc.

Thanks to Atomic Learning www.atomiclearning.com for sponsoring the development of these modules.

BUGS

Top

Please report any bugs or feature requests to bug-search-tools at rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Search-Tools. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.

SUPPORT

Top

You can find documentation for this module with the perldoc command.

    perldoc Search::Tools




You can also look for information at:

* RT: CPAN's request tracker

http://rt.cpan.org/NoAuth/Bugs.html?Dist=Search-Tools

* AnnoCPAN: Annotated CPAN documentation

http://annocpan.org/dist/Search-Tools

* CPAN Ratings

http://cpanratings.perl.org/d/Search-Tools

* Search CPAN

http://search.cpan.org/dist/Search-Tools/

COPYRIGHT

Top

SEE ALSO

Top

HTML::HiLiter, SWISH::HiLiter, Rose::Object, Class::XSAccessor, Text::Aspell


Search-Tools documentation Contained in the Search-Tools distribution.
package Search::Tools::XML;
use strict;
use warnings;
use Carp;
use base qw( Search::Tools::Object );
use Search::Tools;    # XS required

our $VERSION = '0.59';

# regexp for what constitutes whitespace in an HTML doc
# it's not as simple as \s|&nbsp; so we define it separately
my @white_hex_pts = qw(
    0009
    000a
    000b
    000c
    000d
    0020
    00a0
    2000
    2001
    2002
    2003
    2004
    2005
    2006
    2007
    2008
    2009
    200a
    200b
    2028
    2029
    202f
    205f
    3000
    feff
);

my @whitesp = ('\s');

# NOTE that the pound sign # needs escaping because we use
# the 'x' flag in our regexp.

for my $w (@white_hex_pts) {
    push @whitesp, sprintf( "&\\#x%s;", $w );                # hex entity
    push @whitesp, sprintf( "&\\#%s;",  hex($w) );           # dec entity
    push @whitesp, sprintf( "%s",       chr( hex($w) ) );    # byte value
}

my $HTML_WHITESPACE = join( '|', @whitesp );
my $WHITESPACE = join( '|', map { chr( hex($_) ) } @white_hex_pts );

# HTML entity table
# this just removes a dependency on another module...

our %HTML_ents = (
    quot     => 34,
    amp      => 38,
    apos     => 39,
    'lt'     => 60,
    'gt'     => 62,
    nbsp     => 160,
    iexcl    => 161,
    cent     => 162,
    pound    => 163,
    curren   => 164,
    yen      => 165,
    brvbar   => 166,
    sect     => 167,
    uml      => 168,
    copy     => 169,
    ordf     => 170,
    laquo    => 171,
    not      => 172,
    shy      => 173,
    reg      => 174,
    macr     => 175,
    deg      => 176,
    plusmn   => 177,
    sup2     => 178,
    sup3     => 179,
    acute    => 180,
    micro    => 181,
    para     => 182,
    middot   => 183,
    cedil    => 184,
    sup1     => 185,
    ordm     => 186,
    raquo    => 187,
    frac14   => 188,
    frac12   => 189,
    frac34   => 190,
    iquest   => 191,
    Agrave   => 192,
    Aacute   => 193,
    Acirc    => 194,
    Atilde   => 195,
    Auml     => 196,
    Aring    => 197,
    AElig    => 198,
    Ccedil   => 199,
    Egrave   => 200,
    Eacute   => 201,
    Ecirc    => 202,
    Euml     => 203,
    Igrave   => 204,
    Iacute   => 205,
    Icirc    => 206,
    Iuml     => 207,
    ETH      => 208,
    Ntilde   => 209,
    Ograve   => 210,
    Oacute   => 211,
    Ocirc    => 212,
    Otilde   => 213,
    Ouml     => 214,
    'times'  => 215,
    Oslash   => 216,
    Ugrave   => 217,
    Uacute   => 218,
    Ucirc    => 219,
    Uuml     => 220,
    Yacute   => 221,
    THORN    => 222,
    szlig    => 223,
    agrave   => 224,
    aacute   => 225,
    acirc    => 226,
    atilde   => 227,
    auml     => 228,
    aring    => 229,
    aelig    => 230,
    ccedil   => 231,
    egrave   => 232,
    eacute   => 233,
    ecirc    => 234,
    euml     => 235,
    igrave   => 236,
    iacute   => 237,
    icirc    => 238,
    iuml     => 239,
    eth      => 240,
    ntilde   => 241,
    ograve   => 242,
    oacute   => 243,
    ocirc    => 244,
    otilde   => 245,
    ouml     => 246,
    divide   => 247,
    oslash   => 248,
    ugrave   => 249,
    uacute   => 250,
    ucirc    => 251,
    uuml     => 252,
    yacute   => 253,
    thorn    => 254,
    yuml     => 255,
    OElig    => 338,
    oelig    => 339,
    Scaron   => 352,
    scaron   => 353,
    Yuml     => 376,
    fnof     => 402,
    circ     => 710,
    tilde    => 732,
    Alpha    => 913,
    Beta     => 914,
    Gamma    => 915,
    Delta    => 916,
    Epsilon  => 917,
    Zeta     => 918,
    Eta      => 919,
    Theta    => 920,
    Iota     => 921,
    Kappa    => 922,
    Lambda   => 923,
    Mu       => 924,
    Nu       => 925,
    Xi       => 926,
    Omicron  => 927,
    Pi       => 928,
    Rho      => 929,
    Sigma    => 931,
    Tau      => 932,
    Upsilon  => 933,
    Phi      => 934,
    Chi      => 935,
    Psi      => 936,
    Omega    => 937,
    alpha    => 945,
    beta     => 946,
    gamma    => 947,
    delta    => 948,
    epsilon  => 949,
    zeta     => 950,
    eta      => 951,
    theta    => 952,
    iota     => 953,
    kappa    => 954,
    lambda   => 955,
    mu       => 956,
    nu       => 957,
    xi       => 958,
    omicron  => 959,
    pi       => 960,
    rho      => 961,
    sigmaf   => 962,
    sigma    => 963,
    tau      => 964,
    upsilon  => 965,
    phi      => 966,
    chi      => 967,
    psi      => 968,
    omega    => 969,
    thetasym => 977,
    upsih    => 978,
    piv      => 982,
    ensp     => 8194,
    emsp     => 8195,
    thinsp   => 8201,
    zwnj     => 8204,
    zwj      => 8205,
    lrm      => 8206,
    rlm      => 8207,
    ndash    => 8211,
    mdash    => 8212,
    lsquo    => 8216,
    rsquo    => 8217,
    sbquo    => 8218,
    ldquo    => 8220,
    rdquo    => 8221,
    bdquo    => 8222,
    dagger   => 8224,
    Dagger   => 8225,
    bull     => 8226,
    hellip   => 8230,
    permil   => 8240,
    prime    => 8242,
    Prime    => 8243,
    lsaquo   => 8249,
    rsaquo   => 8250,
    oline    => 8254,
    frasl    => 8260,
    euro     => 8364,
    image    => 8465,
    weierp   => 8472,
    real     => 8476,
    trade    => 8482,
    alefsym  => 8501,
    larr     => 8592,
    uarr     => 8593,
    rarr     => 8594,
    darr     => 8595,
    harr     => 8596,
    crarr    => 8629,
    lArr     => 8656,
    uArr     => 8657,
    rArr     => 8658,
    dArr     => 8659,
    hArr     => 8660,
    forall   => 8704,
    part     => 8706,
    exist    => 8707,
    empty    => 8709,
    nabla    => 8711,
    isin     => 8712,
    notin    => 8713,
    ni       => 8715,
    prod     => 8719,
    'sum'    => 8721,
    'minus'  => 8722,
    lowast   => 8727,
    radic    => 8730,
    prop     => 8733,
    infin    => 8734,
    ang      => 8736,
    'and'    => 8743,
    'or'     => 8744,
    cap      => 8745,
    cup      => 8746,
    int      => 8747,
    there4   => 8756,
    sim      => 8764,
    cong     => 8773,
    asymp    => 8776,
    ne       => 8800,
    equiv    => 8801,
    le       => 8804,
    ge       => 8805,
    sub      => 8834,
    sup      => 8835,
    nsub     => 8836,
    sube     => 8838,
    supe     => 8839,
    oplus    => 8853,
    otimes   => 8855,
    perp     => 8869,
    sdot     => 8901,
    lceil    => 8968,
    rceil    => 8969,
    lfloor   => 8970,
    rfloor   => 8971,
    lang     => 9001,
    rang     => 9002,
    loz      => 9674,
    spades   => 9824,
    clubs    => 9827,
    hearts   => 9829,
    diams    => 9830,
);

my %char2entity = ();
while ( my ( $e, $n ) = each(%HTML_ents) ) {
    my $char = chr($n);
    $char2entity{$char} = "&$e;";
}
delete $char2entity{q/'/};    # only one-way decoding

# Fill in missing entities
# TODO does this only work under latin1 locale?
for ( 0 .. 255 ) {
    next if exists $char2entity{ chr($_) };
    $char2entity{ chr($_) } = "&#$_;";
}

sub tag_re {qr/<[^>]+>/s}

sub html_whitespace {$HTML_WHITESPACE}

sub char2ent_map { \%char2entity }

sub looks_like_html { return $_[1] =~ m/[<>]|&[\#\w]+;/o }
*looks_like_xml    = \&looks_like_html;
*looks_like_markup = \&looks_like_html;

sub start_tag { "<" . tag_safe( $_[1] ) . $_[0]->attr_safe( $_[2] ) . ">" }
sub end_tag   { "</" . tag_safe( $_[1] ) . ">" }
sub singleton { "<" . tag_safe( $_[1] ) . $_[0]->attr_safe( $_[2] ) . "/>" }

sub tag_safe {
    my $t = pop;

    return '_' unless length $t;

    $t =~ s/[^-\.\w:]/_/g;
    $t =~ s/^(\d)/_$1/;

    return $t;
}

sub attr_safe {
    my $self = shift;
    my $attr = shift;
    return '' unless defined $attr;
    if ( ref $attr ne "HASH" ) {
        croak "attributes must be a hash ref";
    }
    my @xml = ('');    # force space at start in return
    for my $name ( sort keys %$attr ) {
        my $val = _escape_xml( $attr->{$name} );
        push @xml, tag_safe($name) . qq{="$val"};
    }
    return join( ' ', @xml );
}

*escape_utf8 = \&utf8_safe;

sub utf8_safe {
    my $t = pop;
    $t = '' unless defined $t;

    # converts all low chars except \t \n and \r
    # to space because XML spec disallows <32
    $t =~ s,[\x00-\x08\x0b-\x0c\x0e-\x1f], ,g;

    $t =~ s{([^\x09\x0a\x0d\x20\x21\x23-\x25\x28-\x3b\x3d\x3F-\x5B\x5D-\x7E])}
                        {'&#'.(ord($1)).';'}eg;

    return $t;
}

sub no_html {
    my $class                = shift;
    my $text                 = shift;
    my $normalize_whitespace = shift || 0;
    if ( !defined $text ) {
        croak "text required";
    }
    my $re = $class->tag_re;
    $text =~ s,$re,,g;
    $text = $class->unescape($text);
    if ($normalize_whitespace) {
        $text =~ s/\s+/ /g;
    }
    return $text;
}

*strip_html   = \&no_html;
*strip_markup = \&no_html;

sub escape {
    my $text = pop;
    return unless defined $text;
    return _escape_xml($text);
}

sub unescape {
    my $text = pop;
    $text = unescape_named($text);
    $text = unescape_decimal($text);
    return $text;
}

sub unescape_named {
    my $t = pop;
    if ( defined($t) ) {

        # named entities - check first to see if it is worth looping
        if ( $t =~ m/&[a-zA-Z]+;/ ) {
            for ( keys %HTML_ents ) {
                if ( my $n = $t =~ s/&$_;/chr($HTML_ents{$_})/eg ) {

                    #warn "replaced $_ -> $HTML_ents{$_} $n times in text";
                }
            }
        }
    }
    return $t;
}

sub unescape_decimal {
    my $t = pop;

    # resolve numeric entities as best we can
    $t =~ s/&#(\d+);/chr($1)/ego if defined($t);
    return $t;
}

sub _make_singular {
    my ($t) = @_;
    $t =~ s/ies$/y/i;
    return $t if ( $t =~ s/ses$/s/i );
    return $t if ( $t =~ /[aeiouy]ss$/i );
    $t =~ s/s$//i;
    return length $t ? $t : $_[0];
}

sub perl_to_xml {
    my $self          = shift;
    my $perl          = shift;
    my $root          = shift || '_root';
    my $strip_plural  = shift || 0;
    my $do_not_escape = shift || 0;
    unless ( defined $perl ) {
        croak "perl data struct required";
    }

    if ( $strip_plural and ref($strip_plural) ne 'CODE' ) {
        $strip_plural = \&_make_singular;
    }

    if ( !ref $perl ) {
        return
              $self->start_tag($root)
            . ( $do_not_escape ? $perl : $self->utf8_safe($perl) )
            . $self->end_tag($root);
    }

    my $xml = $self->start_tag($root);
    $self->_ref_to_xml( $perl, '', \$xml, $strip_plural, $do_not_escape );
    $xml .= $self->end_tag($root);
    return $xml;
}

sub _ref_to_xml {
    my ( $self, $perl, $root, $xml_ref, $strip_plural, $do_not_escape ) = @_;
    my $type = ref $perl;
    if ( !$type ) {
        ( $$xml_ref .= $self->start_tag($root) )
            if length($root);
        $$xml_ref .= ( $do_not_escape ? $perl : $self->utf8_safe($perl) );
        ( $$xml_ref .= $self->end_tag($root) )
            if length($root);

        #$$xml_ref .= "\n";    # just for debugging
    }
    elsif ( $type eq 'SCALAR' ) {
        $self->_scalar_to_xml( $perl, $root, $xml_ref, $strip_plural,
            $do_not_escape );
    }
    elsif ( $type eq 'ARRAY' ) {
        $self->_array_to_xml( $perl, $root, $xml_ref, $strip_plural,
            $do_not_escape );
    }
    elsif ( $type eq 'HASH' ) {
        $self->_hash_to_xml( $perl, $root, $xml_ref, $strip_plural,
            $do_not_escape );
    }
    else {
        croak "unsupported ref type: $type";
    }

}

sub _array_to_xml {
    my ( $self, $perl, $root, $xml_ref, $strip_plural, $do_not_escape ) = @_;
    for my $thing (@$perl) {
        if ( ref $thing and length($root) ) {
            $$xml_ref .= $self->start_tag($root);
        }
        $self->_ref_to_xml( $thing, $root, $xml_ref, $strip_plural,
            $do_not_escape );
        if ( ref $thing and length($root) ) {
            $$xml_ref .= $self->end_tag($root);
        }
    }
}

sub _hash_to_xml {
    my ( $self, $perl, $root, $xml_ref, $strip_plural, $do_not_escape ) = @_;
    for my $key ( keys %$perl ) {
        my $thing = $perl->{$key};
        if ( ref $thing ) {
            my $key_to_pass = $key;
            my %attr;
            if ( ref $thing eq 'ARRAY' && $strip_plural ) {
                $key_to_pass = $strip_plural->($key_to_pass);
                $attr{count} = scalar @$thing;
            }
            $$xml_ref .= $self->start_tag( $key, \%attr );
            $self->_ref_to_xml( $thing, $key_to_pass, $xml_ref, $strip_plural,
                $do_not_escape );
            $$xml_ref .= $self->end_tag($key);

            #$$xml_ref .= "\n";                  # just for debugging
        }
        else {
            $self->_ref_to_xml( $thing, $key, $xml_ref, $strip_plural,
                $do_not_escape );
        }
    }
}

sub _scalar_to_xml {
    my ( $self, $perl, $root, $xml_ref, $strip_plural, $do_not_escape ) = @_;
    $$xml_ref
        .= $self->start_tag($root)
        . ( $do_not_escape ? $perl : $self->utf8_safe($perl) )
        . $self->end_tag($root);

    #$$xml_ref .= "\n";    # just for debugging
}

sub tidy {
    my $xml    = pop;
    my $level  = 2;
    my $indent = 0;
    my @tidy   = ();

    # normalize tag breaks
    $xml =~ s,>\s*<,>\n<,gs;

    my @xmlarr = split( m/\n/, $xml );

    # shift off declaration
    if ( scalar(@xmlarr) and $xmlarr[0] =~ m/^<\?\s*xml/ ) {
        push @tidy, shift(@xmlarr);
    }

    my $count = 0;
    for my $el (@xmlarr) {

        if ( $count == 1 ) {
            $indent = 2;
        }
        if ( $count == scalar(@xmlarr) - 1 ) {
            $indent = 0;
        }

        #warn "el: $el\n";

        # match opening tag
        if ( $el =~ m/^<([\w])+[^>\/]*>$/ ) {

            #warn "open $indent\n";
            push @tidy, ( ' ' x $indent ) . $el;
            $indent += $level;
        }
        else {
            if ( $el =~ m/^<\// ) {

                #warn "close $indent\n";
                $indent -= $level;    # closing tag
            }
            if ( $indent < 0 ) {
                $indent += $level;
            }
            push @tidy, ( ' ' x $indent ) . $el;
        }

        #warn "indent = $indent\n";

        #Data::Dump::dump \@tidy;
        $count++;
    }

    return join( "\n", @tidy );

}

1;
__END__