| WAIT documentation | view source | Contained in the WAIT distribution. |
$new = isolc($word)($word)$new = isouc($word)($word)$new = isotr($word)($word)$new = stop($word)$new = grundform($word)$new = utf8iso($word)WAIT::Filter - Perl extension providing the basic freeWAIS-sf reduction functions
use WAIT::Filter qw(Stem Soundex Phonix isolc disolc isouc disouc
isotr disotr stop grundform utf8iso);
$stem = Stem($word);
$scode = Soundex($word);
$pcode = Phonix($word);
$lword = isolc($word);
disolc($word);
$uword = isouc($word);
disouc($word);
$trword = isotr($word);
disotr($word);
$word = stop($word);
$word = grundform($word);
@words = WAIT::Filter::split($word);
@words = WAIT::Filter::split2($word);
@words = WAIT::Filter::split3($word);
@words = WAIT::Filter::split4($word); # arbitrary numbers allowed
This tiny modules gives access to the basic reduction functions build in freeWAIS-sf.
reduces word using the well know Porter algorithm.
AU: Porter, M.F. TI: An Algorithm for Suffix Stripping JT: Program VO: 14 PP: 130-137 PY: 1980 PM: JUL
computes the 4 byte Soundex code for word.
AU: Gadd, T.N.
TI: 'Fisching for Werds'. Phonetic Retrieval of written text in
Information Retrieval Systems
JT: Program
VO: 22
NO: 3
PP: 222-237
PY: 1988
computes the 8 byte Phonix code for word.
AU: Gadd, T.N. TI: PHONIX: The Algorithm JT: Program VO: 24 NO: 4 PP: 363-366 PY: 1990 PM: OCT
There are some additional function which transpose some/most ISOlatin1
characters to upper and lower case. To allow for maximum speed there
are also destructive versions which change the argument instead of
allocating a copy which is returned. For convenience, the destructive
version also returns the argument. So all of the following is
valid and $word will contain the lowercased string.
$word = isolc($word); $word = disolc($word); disolc($word);
Here are the hardcoded characters which are recognized:
abcdefghijklmnopqrstuvwxyzàáâãäåæçèéêëìíîïñòóôõöøùúûüýß ABCDEFGHIJKLMNOPQRSTUVWXYZÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÑÒÓÔÕÖØÙÚÛÜÝß
$new = isolc($word)($word)transposes to lower case.
$new = isouc($word)($word)transposes to upper case.
$new = isotr($word)($word)Remove non-letters according to the above table.
$new = stop($word)Returns an empty string if $word is a stopword.
$new = grundform($word)Calls Text::German::reduce
$new = utf8iso($word)Convert UTF8 encoded strings to ISO-8859-1. WAIT currently is internally based on the Latin1 character set, so if you process anything in a different encoding, you should convert to Latin1 as the first filter.
The splitN funtions all take a scalar as input and return a list of words. Split acts just like the perl split(' '). Split2 eliminates all words from the list that are shorter than 2 characters (bytes), split3 eliminates those shorter than 3 characters (bytes) and so on.
Ulrich Pfeifer <pfeifer@ls6.informatik.uni-dortmund.de>
perl(1).
| WAIT documentation | view source | Contained in the WAIT distribution. |