| HTML-ListScraper documentation | view source | Contained in the HTML-ListScraper distribution. |
HTML::ListScraper - generic web page scraping support
Version 0.05
use HTML::ListScraper;
$scraper = HTML::ListScraper->new( api_version => 3,
marked_sections => 1 );
# set up $scraper options...
$scraper->parse($html);
$scraper->eof;
@seq = $scraper->find_sequences;
$seq = shift @seq;
if ($seq) { # is-a HTML::ListScraper::Sequence
foreach $inst ($seq->instances) { # is-a HTML::ListScraper::Instance
foreach $tag ($inst->tags) { # is-a HTML::ListScraper::Tag
print "<", $tag->name, ">\n";
print $tag->text, "\n";
}
}
}
While Perl has good support and is often used for extracting machine-friendly data from HTML pages, most scripts used for that task are ad-hoc, parsing just one site's HTML and depending on superficial, transient details of its structure - and are therefore brittle and labor-intensive to maintain. This module tries to support more generic scraping for a class of pages: those whose most important part is a list of links.
HTML::ListScraper is a subclass of HTML::Parser, building on its
ability to convert an octet stream - whether strictly valid HTML or
something just vaguely similar to it - to tags and text. HTML parsing
works the same as with HTML::Parser, except you don't need to
register your own HTML event handlers.
When the document is parsed, call find_sequences to find out which
tags in the document repeat, one after the other, more than once (text
and comments are ignored for this comparison). Since there'll probably
be quite a lot of such sequences, HTML::ListScraper tries to find
the "longest one repeating most often", specifically, it maximizes
log(number of non-overlapping runs)*log(number of tags in the
sequence). There can obviously be more than one such sequence, which
is why the method returns an array (and the array can also be empty -
see below). Your application can then iterate over the returned
structure to find items of interest.
This module includes a script, scrape, displaying the sequences
found by HTML::ListScraper, so that you can see which items your
application needs - and if they aren't there, you can try to tweak
HTML::ListScraper's settings with the various scrape switches to
make it find more.
HTML::ListScraper methods are as follows:
HTML::ListScraper's constructor. Passes all its parameters to the
superclass and registers HTML::Parser's event handlers start,
text and end.
Numeric threshold for the frequency of found sequences -
get_sequences returns only those which repeat at least min_count
times. Call without arguments to get the current value, with an
argument to set it. Default (as well as the minimal allowed value) is
2.
By default, get_sequences returns only "well-shaped" sequences,
whose every opening tag is followed by the appropriate closing tag,
with an exception for those tags whose closing tag is optional -
i.e. <div><br></div> is well-shaped but neither <div><br> nor <br></div> is. Tags which don't need a closing tag are
those identified by is_unclosed_tag. Closing tags are paired with
the nearest opening tag with the same name which hasn't been paired
yet. A well-shaped sequence is basically an HTML fragment - like a
tree, except it doesn't have to have a single root.
Well-shaped sequences should be fine when processing valid HTML, but
since this module doesn't restrict itself to valid HTML, that isn't
always good enough. Setting shapeless to a true value removes this
filtering and makes all sequences eligible.
Test for tag names with optional closing tag. Takes a tag name,
returns true for tags declared in HTML 4.01 Transitional DTD as having
either optional or no closing tag. Note that subclassing this method
won't change HTML::ListScraper behavior - it delegates to a real
implementation deep in this module's guts, which are not documented
here.
The core of HTML::ListScraper. Takes no arguments, returns an array
of HTML::ListScraper::Sequence objects. The sequences are sorted by
length (shortest first).
"Sequences" with just 1 tag and sequences which don't repeat are never
returned; depending on the value of min_count and shapeless,
get_sequences may also ignore other ones (see min_count and
shapeless).
A generalization of get_sequences. Like get_sequences,
find_sequences takes no arguments and returns an array of
HTML::ListScraper::Sequence objects - the same sequences, in fact,
as get_sequences, but with potentially more instances. In addition
to the exact matches, find_sequences tries to find "approximate"
instance matches, that is, tag sequences with a non-zero but low edit
distance from the exact sequence.
The alignment uses Algorithm::NeedlemanWunsch (q.v.) in its local
mode, with fixed scores whose particular values hopefully don't matter
much (see the source of HTML::ListScraper::Sweep if you're really
interested in them). Approximate instances are sought between the
exact ones, from the most similar to a cut-off point of low
similarity.
Found approximate instances are identified by
HTML::ListScraper::Instance::match value approx. their score is
available as the value of HTML::ListScraper::Instance::score. That
value isn't always defined, though: if the shapeless flag isn't
set, approximate tag sequences are made to look like valid HTML
fragments by removing unpaired tags. Since that obviously damages the
score, no score is returned for such cut-up instances.
When the "longest sequence repeating most often" found by
HTML::ListScraper isn't quite the sought one, you can specify
exactly which one you want by calling get_known_sequence instead of
get_sequences. get_known_sequence takes a list of tag names
spelled using the same convention as HTML::ListScraper::Tag,
i.e. in lowercase, without angle brackets and with closing tags having
'/' as the first character. If the parsed document doesn't contain the
specified sequence, get_known_sequence returns undef. Otherwise,
it returns an instance of HTML::ListScraper::Sequence.
A generalization of get_known_sequence. Like get_known_sequence,
find_known_sequence takes a list of tag names and finds both exact
and approximate matches for it. If the parsed document doesn't contain
at least one at least approximately matching tag sequences,
find_known_sequence returns undef. Otherwise, it returns an
instance of HTML::ListScraper::Sequence.
Attribute start handler. Registered with signature self, tagname,
attr, although the only attribute preserved by HTML::ListScraper
is href. For ultimate flexibility in preprocessing the input HTML,
you can subclass this method, but do call the base version at least
conditionally. Note that if you want to just ignore some tags, there
are simpler ways, i.e. HTML::Parser::ignore_tags.
Text handler. Registered with signature self, dtext. For ultimate
flexibility in preprocessing the input HTML, you can subclass this
method, but do call the base version at least conditionally.
Attribute end handler. Registered with signature self, tagname. For
ultimate flexibility in preprocessing the input HTML, you can subclass
this method.
Requires too much configuration.
Vaclav Barta, <vbar@comp.cz>
Copyright 2007 Vaclav Barta, all rights reserved.
This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
| HTML-ListScraper documentation | view source | Contained in the HTML-ListScraper distribution. |