Text::Similarity - Measure the pair-wise Similarity of Files or Strings


Text-Similarity documentation  | view source Contained in the Text-Similarity distribution.

Index


NAME

Top

Text::Similarity - Measure the pair-wise Similarity of Files or Strings

SYNOPSIS

Top

      # this will return an un-normalized score that just gives the
      # number of overlaps

      use Text::Similarity::Overlaps;
      my $mod = Text::Similarity::Overlaps->new;
      defined $mod or die "Construction of Text::Similarity::Overlaps failed";

      # adjust file names to reflect true relative position
      # these paths are valid from lib/Text/Similarity
      my $text_file1 = 'Overlaps.pm';
      my $text_file2 = '../OverlapFinder.pm';

      my $score = $mod->getSimilarity ($text_file1, $text_file2);
      print "The similarity of $text_file1 and $text_file2 is : $score\n";

      # if you want to turn on the verbose option and provide a stoplist
      # you can pass those parameters to Overlaps.pm via hash arguments

      use Text::Similarity::Overlaps;
      my %options = ('verbose' => 1, 'stoplist' => '../../samples/stoplist.txt');

      my $mod = Text::Similarity::Overlaps->new (\%options);
      defined $mod or die "Construction of Text::Similarity::Overlaps failed";

      # adjust file names to reflect true relative position
      # these paths are valid from lib/Text/Similarity
      my $text_file1 = 'Overlaps.pm';
      my $text_file2 = '../OverlapFinder.pm';

      my $score = $mod->getSimilarity ($text_file1, $text_file2);
      print "The similarity of $text_file1 and $text_file2 is : $score\n";

DESCRIPTION

Top

This module is a superclass for other modules and provides generic services such as stop word removal, compound identification, and text cleaning or sanitizing.

It's important to realize that additional methods of measuring similarity can be added to this package. Text::Similarity::Overlaps is just one possible way of measuring similarity, others can be added.

Subroutine sanitizeString carries out text cleaning. Briefly, it removes nearly all punctuation except for underscores and embedded apostrophes, converts all text to lower case, and collapes multiple white spaces to a single space.

This module is where compounds are identified (although currently disabled). When implemented it will check a list of compounds provided by the user, and then when a compound is found in the text it will be desigated via an underscore (e.g., white house might be converted to white_house).

Stop words are removed here. The length of the documents reported does not include the stop words. Overlaps are found after stopword removal. By including a word in the stoplist, you are saying that the word never existed in your input (in effect).

BUGS

Top

SEE ALSO

Top

 L<http://text-similarity.sourceforge.net> 

AUTHORS

Top

 Ted Pedersen, University of Minnesota, Duluth
 tpederse at d.umn.edu

 Siddharth Patwardhan, University of Utah
 sidd at cs.utah.edu

 Jason Michelizzi

 Ying Liu, University of Minnesota, Twin Cities
 liux0395 at umn.edu

Last modified by : $Id: Similarity.pm,v 1.26 2010/06/12 01:04:23 tpederse Exp $

COPYRIGHT AND LICENSE

Top


Text-Similarity documentation  | view source Contained in the Text-Similarity distribution.