| Search-Tokenizer documentation | view source | Contained in the Search-Tokenizer distribution. |
Search::Tokenizer - Decompose a string into tokens (words)
# generic usage
use Search::Tokenizer;
my $tokenizer = Search::Tokenizer->new(
regex => qr/.../,
filter => sub { ... },
stopwords => {word1 => 1, word2 => 1, ... },
lower => 1,
);
my $iterator = $tokenizer->($string);
while (my ($term, $len, $start, $end, $index) = $iterator->()) {
...
}
# usage for DBD::SQLite (with builtin tokenizers: word, word_locale,
# word_unicode, unaccent)
use Search::Tokenizer;
$dbh->do("CREATE VIRTUAL TABLE t "
." USING fts3(tokenize=perl 'Search::Tokenizer::unaccent')");
This module builds an iterator function that will progressively
extract terms from a given input string. Terms are defined by a
regular expression (for example \w+). Term matching relies on the
builtin "global match" operator of Perl (the 'g' flag), and therefore
is quite efficient.
Before being returned to the caller, terms may be filtered by an auxiliary function, for performing tasks such as stemming or stopword elimination.
A tokenizer returned from the new method is a code reference, not a regular Perl object. To use the tokenizer, just call it with a string to parse : this will return another code reference, which works as an iterator. Each call to the iterator will return the next term from the string, until the string is exhausted.
This API was explicitly designed for integrating Perl with the FTS3 fulltext search engine in DBD::SQLite; however, the API is general enough to be useful for other purposes, which is why it is published in its own, separate distribution.
my $tokenizer = Search::Tokenizer->new($regex); my $tokenizer = Search::Tokenizer->new(%options);
Builds a new tokenizer, returned as a code reference.
The first syntax with a single Regexp argument is a shorthand
for ->new(regex => $regex). The second syntax, with
named arguments, has the following available options :
regex => $regex$regex is a compiled regular expression that
specifies how to match a term; that regular expression should not
match the empty string (otherwise the tokenizer would enter an
infinite loop). The default is qr/\w+/. Here are some examples of more
advanced regexes :
# take 'locale' into account
$regex = do {use locale; qr/\w+/};
# rely on Unicode's definition of "word characters"
$regex = qr/\p{Word}+/;
# words like "don't", "it's" are treated as a single term
$regex = qr/\w+(?:'\w+)?/;
# same thing but also with internal hyphens like "fox-trot"
$regex = qr/\w+(?:[-']\w+)?/;
lower => $boolIf true, the term returned by the $regex is
converted to lowercase. This option is activated by default.
filter => $filter$filter is a reference to a function that may modify or cancel
a term before it is returned to the caller. The filter takes one
single argument (the term) and returns a scalar (the modified term).
If the value returned from the filter is empty, then this term is canceled.
filter_in_place => $filterLike filter, except that the filtering function directly
modifies the term in its $_[0] argument instead of returning
a new term. This is useful for example when building a filter
from Lingua::Stem::Snowball
or from Text::Transliterator::Unaccent.
stopwords => $hashrefThe keys in $hashref are terms to cancel (usually : common terms
for which indexing would consume lots of resources with little
added value). Values in the hash should evaluate to true.
Lists of stopwords for various languages may be found in
the Lingua::StopWords module.
Stopwords filtering is applied after the filter or
filter_in_place function (if any).
Whenever a term is canceled through the filter or stopwords options, the tokenizer does not return that term to the client, but nevertheless rembembers the canceled position: so for example when tokenizing "Once upon a time" with
$tokenizer = Search::Tokenizer->new(
stopwords => Lingua::StopWords::getStopWords('en')
);
we get the term sequence
("upon", 4, 5, 9, 1)
("time", 4, 12, 16, 3)
where terms "once" and "a" in positions 0 and 2 have been canceled.
my $iterator = $tokenizer->($text);
# loop over terms ..
while (my $term = $iterator->()) {
work_with_term($term);
}
# .. or loop over terms with detailed information
while (my @term_details = $iterator->()) {
work_with_details(@term_details); # ($term, $len, $start, $end, $index)
}
The tokenizer takes one string argument and returns an iterator. The iterator takes no argument; each call returns a next term from the string, until the string is exhausted, at which point the iterator returns an empty result.
If called in a scalar context, the iterator returns just a string; if called in a list context, it returns a tuple composed from
the term (after filtering)
the term length
the starting offset in the string where this term was found
the end offset (where the search for the next term will start)
the index of this term within the string, starting at 0
Length and start/end offsets are computed in characters, not in bytes (note for SQLite users : the C layer in SQLite needs byte values, but the conversion will be automatically taken care of by the C implementation in DBD::SQLite).
Beware that ($end - $start) is the length of the original term extracted by the regex, while $len is the length of the final $term, after filtering; both may differ, especially if stemming is being applied.
For convenience, the following tokenizers are builtin :
Search::Tokenizer::wordTerms are "words" according to Perl's notion of \w+.
Search::Tokenizer::word_localeTerms are "words" according to Perl's notion of \w+
under use locale.
Search::Tokenizer::word_unicodeTerms are "words" according to Unicode's notion of
\p{Word}+.
Search::Tokenizer::unaccentLike Search::Tokenizer::word_unicode, but filtered
through Text::Transliterator::Unaccent
to replace all accented characters by their base character.
These builtin tokenizers may take the same arguments
as new(): for example
use Search::Tokenizer; my $tokenizer = Search::Tokenizer::unaccent(lower => 0, stopwords => ...);
Laurent Dami, <lau.....da..@justice.ge.ch>
Please report any bugs or feature requests to bug-search-tokenizer
at rt.cpan.org, or through the web interface at
http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Search-Tokenizer. I
will be notified, and then you'll automatically be notified of
progress on your bug as I make changes.
You can find documentation for this module with the perldoc command.
perldoc Search::Tokenizer
You can also look for information at:
Copyright 2010 Laurent Dami.
This program is free software; you can redistribute it and/or modify it under the terms of either: the GNU General Public License as published by the Free Software Foundation; or the Artistic License.
See http://dev.perl.org/licenses/ for more information.
| Search-Tokenizer documentation | view source | Contained in the Search-Tokenizer distribution. |