NAME
YAPE::HTML - Yet Another Parser/Extractor for HTML
SYNOPSIS
use YAPE::HTML;
use strict;
my $content = "<html>...</html>";
my $parser = YAPE::HTML->new($content);
my ($extor,@fonts,@urls,@headings,@comments);
# here is the tokenizing part
while (my $chunk = $parser->next) {
if ($chunk->type eq 'tag' and $chunk->tag eq 'font') {
if (my $face = $chunk->get_attr('face')) {
push @fonts, $face;
}
}
}
# here we catch any errors
unless ($parser->done) {
die sprintf "bad HTML: %s (%s)",
$parser->error, $parser->chunk;
}
# here is the extracting part
# <A> tags with HREF attributes
# <IMG> tags with SRC attributes
$extor = $parser->extract(a => ['href'], img => ['src']);
while (my $chunk = $extor->()) {
push @urls, $chunk->get_attr(
$chunk->tag eq 'a' ? 'href' : 'src'
);
}
# <H1>, <H2>, ..., <H6> tags
$extor = $parser->extract(qr/^h[1-6]$/ => []);
while (my $chunk = $extor->()) {
push @headings, $chunk;
}
# all comments
$extor = $parser->extract(-COMMENT => []);
while (my $chunk = $extor->()) {
push @comments, $chunk;
}
`YAPE' MODULES
The `YAPE' hierarchy of modules is an attempt at a unified means of parsing and extracting content. It attempts to maintain a generic interface, to promote simplicity and reusability. The API is powerful, yet simple. The modules do tokenization (which can be intercepted) and build trees, so that extraction of specific nodes is doable.
DESCRIPTION
This module is yet another parser and tree-builder for HTML documents. It is designed to make extraction and modification of HTML documents simplistic. The API allows for easy custom additions to the document being parsed, and allows very specific tag, text, and comment extraction.
USAGE
In addition to the base class, `YAPE::HTML', there is the auxiliary class `YAPE::HTML::Element' (common to all `YAPE' base classes) that holds the individual nodes' classes. There is documentation for the node classes in that module's documentation.
HTML elements and their attributes are stored internally as lowercase strings. For clarification, that means that the tag `<A HREF="FooBar.html">' is stored as
{
TAG => 'a',
ATTR => {
href => 'FooBar.html',
}
}
This means that tags will be output in lowercase. There will be a feature in a future version to switch output case to capital letters.
Functions
There is a subtle difference between "empty" and "open"
tags. For example, the `<AREA>' tag contains a few
attributes, but there is no text associated with it (nor any
other tags), and therefore, is "empty"; the `<LI>', on the
other hand,
It is strongly suggested that for ease in parsing, any tags
that you do not explicitly close have a `/' at the end of
the tag:
Here's my cat: <img src="cat.jpg" />
Methods for `YAPE::HTML'
It also copies the `%OPEN' and `%EMPTY' hashes, as well as
the `OPEN()' and `EMPTY()' functions, into the `MyExt::Mod'
namespace. This process is designed to save you from having
to place `@ISA' assignments all over the place.
It also copies the `%SSI' hash. This hash is not suggested
to be altered, and therefore it does not have any public
interface (you have to fiddle with it yourself). It exists
to ensure an SSI is valid.
package MyExt::Mod;
use YAPE::HTML 'MyExt::Mod';
# @MyExt::Mod::ISA = 'YAPE::HTML'
# @MyExt::Mod::text::ISA = 'YAPE::HTML::text'
# ...
# being rather strict with the tags
%OPEN = ();
%EMPTY = ();
my $quoted = YAPE::HTML::quote($string);
Extracting Sections
`YAPE::HTML' allows comprehensive extraction of tags, text, comments, DTDs, PIs, and SSIs, using a simple, yet rich, syntax:
my $extor = $parser->extract(
TYPE => [ REQS ],
...
);
TYPE can be either the name of a tag (`"table"'), a regular expression that matches tags (`qr/^t[drh]$/'), or a special string to match all tags (`-TAG'), all text (`-TEXT'), all comments (`-COMMENT'), all DTDs (`-DTD'), all PIs (`-PI'), and all SSIs (`-SSI').
REQS varies from element to element:
Here are some example uses:
my $extor = $parser->extract(qr/^h/ => []);
my $extor = $parser->extract(-TAG => ['align']);
my $extor = $parser->extract(-TEXT => [qr/\bjaphy\b/i]);
my $extor = $parser->extract( a => ['href'], area => ['href'], base => ['href'], body => ['background'], img => ['src'], # ... );
FEATURES
This is a list of special features of `YAPE::HTML'.
If you aren't enforcing strict HTML syntax, then in the act of parsing HTML, if a tag that should be closed is not closed, it will be flagged for closing. That means that input like:
<b>Foo<i>bar</b>
will appear as:
<b>Foo<i>bar</i></b>
upon request for output. In addition, tags that are left
dangling open at the end of an HTML document get closed.
That means:
<b>Foo<i>bar
will appear as:
<b>Foo<i>bar</i></b>
If strict checking is off, the only error you'll receive from mismatched HTML tags is a closing tag out-of-place.
On the other hand, if you do enforce strict HTML syntax,
you'll be informed of tags that do not get closed as
well (that should be closed).
TO DO
This is a listing of things to add to future versions of
this module.
API
Add a flag to the `fullstring' method of objects, `- EXPAND', which will display `&...;' HTML escapes as the character representing them.
Add a flag to the `fullstring' method of objects, `- UPPER', which will display tag and attribute names in uppercase.
DTD-like strictness in regards to nesting of elements -- `<LI>' is not allowed to be outside an `<OL>' or `<UL>' element.
Internals
There's probably some inherent slowness to this method, but it works. And it supports the robust `extract' method.
Make three constants, `CLOSED_NO', `CLOSED_YES', and `CLOSED_IMPL'.
BUGS
Following is a list of known or reported bugs.
Fixed
Visit `YAPE''s web site at http://www.pobox.com/~japhy/YAPE/.
SEE ALSO
The `YAPE::HTML::Element' documentation, for information on
the node classes.
AUTHOR
Jeff "japhy" Pinyan
CPAN ID: PINYAN
japhy@pobox.com
http://www.pobox.com/~japhy/