Text::WordGrams - Calculates statistics on word ngrams.


Text-WordGrams documentation Contained in the Text-WordGrams distribution.

Index


Code Index:

NAME

Top

Text::WordGrams - Calculates statistics on word ngrams.

VERSION

Top

Version 0.04

SYNOPSIS

Top

    use Text::WordGrams;

    my $data = word_grams( $text );

    my $data = word_grams_from_files( $file1, $file2 );

FUNCTIONS

Top

word_grams

Returns a reference to an hash table with word ngrams counts for a specified string. Options are passed as a hash reference as first argument if needed.

Options include:

ignore_case

Set this option to ignore text case;

size

Set this option to the n-gram size you want. Notice that the value should be greater or equal to two. Also, keep in mind that the bigger size you ask for, the larger the hash will become. Future releases might include a DB File version for less memory consuption.

tokenize

This option is activated by default. Give a zero value if your document is already tokenized. In this case your text will be slitted by space characters.

word_grams_from_files

Supports the same options of word_grams function, but receives a list of file names instead of a string.

AUTHOR

Top

Alberto Simões, <ambs@cpan.org>

BUGS

Top

Current method is very, very slow. if you find any faster method, please let me know. I think the bottle neck is in the tokenisation part.

Please report any bugs or feature requests to bug-text-wordgrams@rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Text-WordGrams. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.

COPYRIGHT & LICENSE

Top


Text-WordGrams documentation Contained in the Text-WordGrams distribution.
package Text::WordGrams;

use warnings;
use strict;

require Exporter;

use Lingua::PT::PLNbase;

our $VERSION = '0.04';
our @ISA = "Exporter";
our @EXPORT = ("word_grams", "word_grams_from_files");

sub word_grams {
  my $conf = {};
  $conf = shift if (ref($_[0]) eq "HASH");
  $conf->{size} = 2 unless $conf->{size} && $conf->{size} > 1;

  my $text = shift;
  $text = lc($text) if $conf->{ignore_case};

  my @atoms;
  if (!exists($conf->{tokenize}) || $conf->{tokenize} == 1) {
      @atoms = atomiza($text);
  }
  else {
      $text =~ s/\n/ /g;
      @atoms = split /\s+/, $text;
  }

  my $data;

  my $previous = shift @atoms;
  my $next;
  while ($next = _get($conf->{size}-1, \@atoms)) {
    $data->{"$previous $next"}++;
    $previous = shift @atoms;
  }
  return $data
}

sub _get {
  my ($n, $atoms) = @_;
  if ($n <= $#$atoms + 1) {
    return join(" ", @{$atoms}[0..$n-1])
  } else {
    return undef
  }
}

sub word_grams_from_files {
  my $conf = {};
  $conf = shift if (ref($_[0]) eq "HASH");
  my $data;

  for my $file (@_) {
    next unless -f $file;

    local $/ = "\n\n";

    open F, $file or die "Can't open file: $file\n";
    while(<F>) {
      my $o = word_grams($conf, $_);
      for my $w (keys %$o) {
	$data->{$w}+=$o->{$w}
      }
    }
    close F;
  }

  return $data;
}

1; # End of Text::WordGrams