HTML::Content::ContentExtractor - Perl module for extracting content from HTML documents.


HTML-Content-Extractor documentation  | view source Contained in the HTML-Content-Extractor distribution.

Index


NAME

Top

HTML::Content::ContentExtractor - Perl module for extracting content from HTML documents.

SYNOPSIS

Top

  use HTML::WordTagRatio::WeightedRatio;
  use HTML::Content::HTMLTokenizer;
  use HTML::Content::ContentExtractor;

  my $tokenizer = new HTML::Content::HTMLTokenizer('TAG','WORD');

  my $ranker = new HTML::WordTagRatio::WeightedRatio();

  my $extractor = new HTML::Content::ContentExtractor($tokenizer,$ranker,"index.html","index.extr");

  $extractor->Extract();

DESCRIPTION

Top

HTML::Content::ContentExtractor attempts to extract the content from HTML documents. It attempts to remove tags, scripts and boilerplate text from the documents by trying to find the region of the HTML document that has the maximum ratio of words to tags.

Methods

* my $extractor = new HTML::Content::ContentExtractor($tokenizer, $ratio, $inputfilename, $extractfilename)

Initializes HTML::Content::ContentExtractor with 1) an object that can tokenize HTML 2) an object that can compute the ratio of Words to Tags 3) an input filename and 4) an output filename.

* $extractor->Extract()

Attempts to extract content from the $inputfilename.

AUTHOR

Top

Jean Tavernier (jj.tavernier@gmail.com)

COPYRIGHT

Top

SEE ALSO

Top

ContentExtractorDriver.pl (1).


HTML-Content-Extractor documentation  | view source Contained in the HTML-Content-Extractor distribution.