NAME

YAPE::HTML - Yet Another Parser/Extractor for HTML

SYNOPSIS

      use YAPE::HTML;
      use strict;
      
      my $content = "<html>...</html>";
      my $parser = YAPE::HTML->new($content);
      my ($extor,@fonts,@urls,@headings,@comments);
      
      # here is the tokenizing part
      while (my $chunk = $parser->next) {
        if ($chunk->type eq 'tag' and $chunk->tag eq 'font') {
          if (my $face = $chunk->get_attr('face')) {
            push @fonts, $face;
          }
        }
      }
      
      # here we catch any errors
      unless ($parser->done) {
        die sprintf "bad HTML: %s (%s)",
          $parser->error, $parser->chunk;
      }
      
      # here is the extracting part
      
      # <A> tags with HREF attributes
      # <IMG> tags with SRC attributes
      $extor = $parser->extract(a => ['href'], img => ['src']);
      while (my $chunk = $extor->()) {
        push @urls, $chunk->get_attr(
          $chunk->tag eq 'a' ? 'href' : 'src'
        );
      }
      
      # <H1>, <H2>, ..., <H6> tags
      $extor = $parser->extract(qr/^h[1-6]$/ => []);
      while (my $chunk = $extor->()) {
        push @headings, $chunk;
      }
      
      # all comments
      $extor = $parser->extract(-COMMENT => []);
      while (my $chunk = $extor->()) {
        push @comments, $chunk;
      }

`YAPE' MODULES

The `YAPE' hierarchy of modules is an attempt at a unified means of parsing and extracting content. It attempts to maintain a generic interface, to promote simplicity and reusability. The API is powerful, yet simple. The modules do tokenization (which can be intercepted) and build trees, so that extraction of specific nodes is doable.

DESCRIPTION

This module is yet another parser and tree-builder for HTML documents. It is designed to make extraction and modification of HTML documents simplistic. The API allows for easy custom additions to the document being parsed, and allows very specific tag, text, and comment extraction.

USAGE

In addition to the base class, `YAPE::HTML', there is the auxiliary class `YAPE::HTML::Element' (common to all `YAPE' base classes) that holds the individual nodes' classes. There is documentation for the node classes in that module's documentation.

HTML elements and their attributes are stored internally as lowercase strings. For clarification, that means that the tag `<A HREF="FooBar.html">' is stored as

      {
        TAG => 'a',
        ATTR => {
          href => 'FooBar.html',
        }
      }

This means that tags will be output in lowercase. There will be a feature in a future version to switch output case to capital letters.

Functions

        There is a subtle difference between "empty" and "open"
        tags. For example, the `<AREA>' tag contains a few
        attributes, but there is no text associated with it (nor any
        other tags), and therefore, is "empty"; the `<LI>', on the
        other hand,

        It is strongly suggested that for ease in parsing, any tags
        that you do not explicitly close have a `/' at the end of
        the tag:

          Here's my cat: <img src="cat.jpg" />

Methods for `YAPE::HTML'

        It also copies the `%OPEN' and `%EMPTY' hashes, as well as
        the `OPEN()' and `EMPTY()' functions, into the `MyExt::Mod'
        namespace. This process is designed to save you from having
        to place `@ISA' assignments all over the place.

        It also copies the `%SSI' hash. This hash is not suggested
        to be altered, and therefore it does not have any public
        interface (you have to fiddle with it yourself). It exists
        to ensure an SSI is valid.

          package MyExt::Mod;
          use YAPE::HTML 'MyExt::Mod';
          
          # @MyExt::Mod::ISA = 'YAPE::HTML'
          # @MyExt::Mod::text::ISA = 'YAPE::HTML::text'
          # ...
          
          # being rather strict with the tags
          %OPEN = ();
          %EMPTY = ();

my $quoted = YAPE::HTML::quote($string);

Extracting Sections

`YAPE::HTML' allows comprehensive extraction of tags, text, comments, DTDs, PIs, and SSIs, using a simple, yet rich, syntax:

      my $extor = $parser->extract(
        TYPE => [ REQS ],
        ...
      );

TYPE can be either the name of a tag (`"table"'), a regular expression that matches tags (`qr/^t[drh]$/'), or a special string to match all tags (`-TAG'), all text (`-TEXT'), all comments (`-COMMENT'), all DTDs (`-DTD'), all PIs (`-PI'), and all SSIs (`-SSI').

REQS varies from element to element:

Here are some example uses:

FEATURES

This is a list of special features of `YAPE::HTML'.

<b>Foo<i>bar</b>

will appear as:

<b>Foo<i>bar</i></b>

            upon request for output. In addition, tags that are left
            dangling open at the end of an HTML document get closed.
            That means:

              <b>Foo<i>bar

            will appear as:

              <b>Foo<i>bar</i></b>
            On the other hand, if you do enforce strict HTML syntax,
            you'll be informed of tags that do not get closed as
            well (that should be closed).

TO DO

        This is a listing of things to add to future versions of
        this module.

API

Internals

BUGS

Following is a list of known or reported bugs.

Fixed

SEE ALSO

        The `YAPE::HTML::Element' documentation, for information on
        the node classes.

AUTHOR

          Jeff "japhy" Pinyan
          CPAN ID: PINYAN
          japhy@pobox.com
          http://www.pobox.com/~japhy/