Lingua::Stem::En - Porter's stemming algorithm for 'generic' English


Lingua-Stem documentation  | view source Contained in the Lingua-Stem distribution.

Index


NAME

Top

Lingua::Stem::En - Porter's stemming algorithm for 'generic' English

SYNOPSIS

Top

    use Lingua::Stem::En;
    my $stems   = Lingua::Stem::En::stem({ -words => $word_list_reference,
                                        -locale => 'en',
                                    -exceptions => $exceptions_hash,
                                     });

DESCRIPTION

Top

This routine applies the Porter Stemming Algorithm to its parameters, returning the stemmed words.

It is derived from the C program "stemmer.c" as found in freewais and elsewhere, which contains these notes:

   Purpose:    Implementation of the Porter stemming algorithm documented 
               in: Porter, M.F., "An Algorithm For Suffix Stripping," 
               Program 14 (3), July 1980, pp. 130-137.
   Provenance: Written by B. Frakes and C. Cox, 1986.

I have re-interpreted areas that use Frakes and Cox's "WordSize" function. My version may misbehave on short words starting with "y", but I can't think of any examples.

The step numbers correspond to Frakes and Cox, and are probably in Porter's article (which I've not seen). Porter's algorithm still has rough spots (e.g current/currency, -ings words), which I've not attempted to cure, although I have added support for the British -ise suffix.

CHANGES

Top

 


 1999.06.15 - Changed to '.pm' module, moved into Lingua::Stem namespace,
              optionalized the export of the 'stem' routine
              into the caller's namespace, added named parameters

 1999.06.24 - Switch core implementation of the Porter stemmer to
              the one written by Jim Richardson <jimr@maths.usyd.edu.au>

 2000.08.25 - 2.11 Added stemming cache

 2000.09.14 - 2.12 Fixed *major* :( implementation error of Porter's algorithm
              Error was entirely my fault - I completely forgot to include
              rule sets 2,3, and 4 starting with Lingua::Stem 0.30. 
              -- Benjamin Franz

 2003.09.28 - 2.13 Corrected documentation error pointed out by Simon Cozens.

 2005.11.20 - 2.14 Changed rule declarations to conform to Perl style convention
              for 'private' subroutines. Changed Exporter invokation to more
              portable 'require' vice 'use'.

 2006.02.14 - 2.15 Added ability to pass word list by 'handle' for in-place stemming.

 2009.07.27   2.16 Documentation Fix
=cut

####################################################################### # Initialization #######################################################################

use strict; require Exporter; use Carp; use vars qw (@ISA @EXPORT_OK @EXPORT %EXPORT_TAGS $VERSION); BEGIN { $VERSION = "2.16"; @ISA = qw (Exporter); @EXPORT = (); @EXPORT_OK = qw (stem clear_stem_cache stem_caching); %EXPORT_TAGS = (); }

my $Stem_Caching = 0; my $Stem_Cache = {}; my %Stem_Cache2 = ();

# #V Porter.pm V2.11 25 Aug 2000 stemming cache # Porter.pm V2.1 21 Jun 1999 with '&$sub if defined' not 'eval ""' # Porter.pm V2.0 25 Nov 1994 (for Perl 5.000) # porter.pl V1.0 10 Aug 1994 (for Perl 4.036) # Jim Richardson, University of Sydney # jimr@maths.usyd.edu.au or http://www.maths.usyd.edu.au:8000/jimr.html

# Find a canonical stem for a word, assumed to consist entirely of # lower-case letters. The approach is from # # M. F. Porter, An algorithm for suffix stripping, Program (Automated # Library and Information Systems) 14 (3) 130-7, July 1980. # # This algorithm is used by WAIS: for example, see freeWAIS-0.3 at # # http://kudzu.cnidr.org/cnidr_projects/cnidr_projects.html

# Some additional rules are used here, mainly to allow for British spellings # like -ise. They are marked ** in the code.

# Initialization required before using subroutine stem:

# We count syllables slightly differently from Porter: we say the syllable # count increases on each occurrence in the word of an adjacent pair # # [aeiouy][^aeiou] # # This avoids any need to define vowels and consonants, or confusion over # 'y'. It also works slightly better: our definition gives two syllables # in 'yttrium', while Porter's gives only one because the initial 'y' is # taken to be a consonant. But it is not quite obvious: for example, # consider 'mayfly' where, when working backwards (see below), the 'yf' # matches the above pattern, even though it is the 'ay' which in Porter's # terms increments the syllable count. # # We wish to match the above in context, working backwards from the end of # the word: the appropriate regular expression is

my $syl = '[aeiou]*[^aeiou][^aeiouy]*[aeiouy]';

# (This works because [^aeiouy] is a subset of [^aeiou].) If we want two # syllables ("m>1" in Porter's terminology) we can just match $syl$syl.

# For step 1b we need to be able to detect the presence of a vowel: here # we revert to Porter's definition that a vowel is [aeiou], or y preceded # by a consonant. (If the . below is a vowel, then the . is the desired # vowel; if the . is a consonant the y is the desired vowel.)

my $hasvow = '[^aeiouy]*([aeiou]|y.)';

METHODS

Top

stem({ -words => \@words, -locale => 'en', -exceptions => \%exceptions });

Stems a list of passed words using the rules of US English. Returns an anonymous array reference to the stemmed words.

Example:

  my @words         = ( 'wordy', 'another' );
  my $stemmed_words = Lingua::Stem::En::stem({ -words => \@words,
                                              -locale => 'en',
                                          -exceptions => \%exceptions,
                          });

If the first element of @words is a list reference, then the stemming is performed 'in place' on that list (modifying the passed list directly instead of copying it to a new array).

This is only useful if you do not need to keep the original list. If you do need to keep the original list, use the normal semantic of having 'stem' return a new list instead - that is faster than making your own copy and using the 'in place' semantics since the primary difference between 'in place' and 'by value' stemming is the creation of a copy of the original list. If you don't need the original list, then the 'in place' stemming is about 60% faster.

Example of 'in place' stemming:

  my $words         = [ 'wordy', 'another' ];
  my $stemmed_words = Lingua::Stem::En::stem({ -words => [$words],
                          -locale => 'en',
                      -exceptions => \%exceptions,
                      });

The 'in place' mode returns a reference to the original list with the words stemmed.

stem_caching({ -level => 0|1|2 });

Sets the level of stem caching.

'0' means 'no caching'. This is the default level.

'1' means 'cache per run'. This caches stemming results during a single call to 'stem'.

'2' means 'cache indefinitely'. This caches stemming results until either the process exits or the 'clear_stem_cache' method is called.

clear_stem_cache;

Clears the cache of stemmed words

NOTES

Top

This code is almost entirely derived from the Porter 2.1 module written by Jim Richardson.

SEE ALSO

Top

 Lingua::Stem

AUTHOR

Top

  Jim Richardson, University of Sydney
  jimr@maths.usyd.edu.au or http://www.maths.usyd.edu.au:8000/jimr.html

  Integration in Lingua::Stem by 
  Benjamin Franz, FreeRun Technologies,
  snowhare@nihongo.org or http://www.nihongo.org/snowhare/

COPYRIGHT

Top

BUGS

Top

TODO

Top


Lingua-Stem documentation  | view source Contained in the Lingua-Stem distribution.