Formatter::HTML::HTML - Formatter to clean existing HTML


Formatter-HTML-HTML documentation Contained in the Formatter-HTML-HTML distribution.

Index


Code Index:

NAME

Top

Formatter::HTML::HTML - Formatter to clean existing HTML

SYNOPSIS

Top

  use Formatter::HTML::HTML;
  my $formatter = Formatter::HTML::HTML->format($data);
  print $formatter->document;
  print $formatter->title;
  my $links = $text->links;
  print ${$links}[0]->{url};

DESCRIPTION

Top

This module will clean the document using HTML::Tidy. It also inherits from that module, so you can use methods of that class. It can also parse and return links and the title (using HTML::TokeParser).

METHODS

Top

This module conforms with the Formatter API specification, version 0.95:

format($string [, {config_file => 'path/to/tidy.cfg'} )

The format function that you call to initialise the formatter. It takes the plain text as a string argument and returns an object of this class.

Optionally, you may give a hashref with the full file name of the tidy config. This enables you to have this Formatter return valid XHTML, just set it correctly in the config. Note also that you may break the Formatter by e.g. returning configuring tidy to return just a fragment, and it is your own resonsibility to make sure you don't.

document([$charset])

Will return a full, cleaned and valid HTML document. You may specify an optional $charset parameter. This will include a HTML meta element with the chosen character set. It will still be your responsibility to ensure that the document served is encoded with this character set.

fragment

This will return only the contents of the body element.

Will return all links found the input plain text string as an arrayref. The arrayref will for each element keys url and title, the former containing the URL, the latter the text of the link.

title

Will return the title of the document as seen in the HTML title element or undef if none can be found.

SEE ALSO

Top

Formatter, HTML::Tidy, HTML::TokeParser

TODO

Top

Both the fragment and document methods use naive regular expressions to strip off elements and add a meta element respectively. This is clearly not very reliable, and should be done with a proper parser.

SUBVERSION REPOSITORY

Top

This module is currently maintained in a Subversion repository. The trunk can be checked out anonymously using e.g.:

  svn checkout http://svn.kjernsmo.net/Formatter-HTML-HTML/trunk Formatter-HTML-HTML

AUTHOR

Top

Kjetil Kjernsmo, <kjetilk@cpan.org>

COPYRIGHT AND LICENSE

Top


Formatter-HTML-HTML documentation Contained in the Formatter-HTML-HTML distribution.
package Formatter::HTML::HTML;

use 5.006;
use strict;
use warnings;
use HTML::Tidy;
use HTML::TokeParser;

use base qw( HTML::Tidy );


our $VERSION = '0.97';

sub format {
  my ($that, $text, $config)  = @_;
  my $class = ref($that) || $that;
  my $tidy = new HTML::Tidy($config); # In fact, we let it do the hard work
  my $clean = $tidy->clean($text);    # allready. It has to be done anyway.
  my $self = {
	      _out => $clean,
	     };
  bless($self, $class);
  return $self;
}


sub document {
  my $self = shift;
  my $charset = shift;
  my $cleaned = $self->{_out};
  if (($charset) && ($cleaned !~ m/charset/)) {
    $cleaned =~ s|(<head.*?>)|$1\n<meta http-equiv="Content-Type" content="text/html; charset=$charset">|si;
  }
  return $cleaned;
}


sub fragment {
  my $self = shift;
  if ($self->{_out} =~ m|<body.*?>(.*)</body>|si) {
    return $1;
  } else {
    return $self->{_out}
  }
}

sub links {
  my $self = shift;
  my @arr;
  my $p = HTML::TokeParser->new(\$self->{_out});

  while (my $token = $p->get_tag("a")) {
    my $url = $token->[1]{href} || "-";
    my $text = $p->get_trimmed_text("/a");
    push(@arr, {url => $url, title => $text});
  }
  return \@arr;
}

# Both links and title are taken right from examples in TokeParser!
# Nice of them, huh? :-)


sub title {
  my $self = shift;
  my $p = HTML::TokeParser->new(\$self->{_out});

  if ($p->get_tag("title")) {
    return $p->get_trimmed_text;
  }
  return undef;
}


1;
__END__