HTML::WordTagRatio::SmoothedRatio - Default module for determining the ratio of words to tags in a range of tokens in an HTML document.


HTML-Content-Extractor documentation  | view source Contained in the HTML-Content-Extractor distribution.

Index


NAME

Top

HTML::WordTagRatio::SmoothedRatio - Default module for determining the ratio of words to tags in a range of tokens in an HTML document.

SYNOPSIS

Top

  use HTML::WordTagRatio::SmoothedRatio;
  use HTML::Content::HTMLTokenizer;
  use HTML::Content::ContentExtractor;

  my $tokenizer = new HTML::Content::HTMLTokenizer('TAG','WORD');

  open(HTML,"index.html");
  my $doc = join("",<HTML>);
  close(HTML);

  my ($word_count_arr_ref,$tag_count_arr_ref,$token_type_arr_ref,$token_hash_ref) = $tokenizer->Tokenize($doc);

  my $ratio = new HTML::WordTagRatio::SmoothedRatio();

  my $value = $ratio->RangeValue(0, @$word_count_arr_ref, 
  				$word_count_arr_ref, $tag_count_arr_ref);

DESCRIPTION

Top

HTML::WordTagRatio::SmoothedRatio computes a ratio of Words to Tags for a given range. In psuedo code, the ratio is

Words/TotalWords/(Tags + 1)/(TotalTags + 1)

Methods

* my $ratio = new HTML::WordTagRatio::SmoothedRatio()

Initializes HTML::WordTagRatio::SmoothedRatio

* my $value = $ratio->RangeValue($start, $end, \@WordCount, \@TagCount)

$value is computed as follows:

	($WordCount[$end] - $WordCount[$start])/$WordCount[$#WordCount]/($TagCount[$end] - $TagCount[$start] + 1)/($TagCount[$#TagCount] + 1)

This is the number of words in the range, divided by the total number of words in the document, divided by the number of tags in range plus one, divided by the total number of tags plus one. The plus ones compensate for ranges with no tags. $WordCount[$i] is the number of word tokens before or at the ith token in the input HTML document. $TagCount[$i] is the number of tag tokens before or at the ith token in the input HTML document.

AUTHOR

Top

Jean Tavernier (jj.tavernier@gmail.com)

COPYRIGHT

Top

SEE ALSO

Top

ContentExtractorDriver.pl (1), HTML::Content::ContentExtractor (3), HTML::Content::HTMLTokenizer (3), HTML::WordTagRatio::Ratio (3),HTML::WordTagRatio::WeightedRatio (3), HTML::WordTagRatio::RelativeRatio (3), HTML::WordTagRatio::ExponentialRatio (3), HTML::WordTagRatio::NormalizedRatio (3).


HTML-Content-Extractor documentation  | view source Contained in the HTML-Content-Extractor distribution.