| Plucene-SearchEngine documentation | view source | Contained in the Plucene-SearchEngine distribution. |
Plucene::SearchEngine::Index - A higher level abstraction for Plucene
my $indexer = Plucene::SearchEngine::Index->new(
dir => "/var/lib/plucene"
);
my @documents = map { $_->document }
Plucene::SearchEngine::Index::File->examine("foo.html");
$indexer->index($_) for @documents;
This module makes it easy to write to Plucene indexes. It does so by
providing an interface to the index writer which, in terms of
complexity, sits between Plucene::Index::Writer and
Plucene::Simple; it also provides a framework of modules for turning
data into Plucene::Document objects, so that you don't necessarily
have to parse them yourself. See Document Frontends and Backends for
more on this.
Designed to be used with Plucene::SearchEngine::Query, these two modules aim to make it easy for anyone writing search engines based on Plucene.
my $indexer = Plucene::SearchEngine::Index->new(
dir => "/var/plucene/foo",
analyzer => "Plucene::Analysis::SimpleAnalyzer",
);
This creates a new indexer; you must specify the directory to contain the index, and you may specify an analyzer to tokenize the data.
This adds a Plucene::Document to the index.
So far so good, but how do you create these Plucene::Documents? You
can, of course, do so manually, but the easiest way is to use the
supplied Plucene::SearchEngine::Index::File or
Plucene::SearchEngine::Index::URL modules.
These two modules are frontends which gather metadata about a file or
URL and then hand the data off to one of the backend modules - there are
backends supplied for PDF, HTML and plain text files. These in turn
return a list of documents found in the file or URL. In most cases,
there'll only be one document, but, for instance, a Unix mbox should
return an object for each email in the box. These objects can be turned
into Plucene::Document objects by calling the document method on
them. This isn't done by default because you may wish to mess with the
hash yourself, or serialize it, or whatever.
If you want to handle a different type of file, it's relatively easy to
do. All you need to do is create a module called
Plucene::SearchEngine::Index::Whatever; this should inherit from
Plucene::SearchEngine::Index::Base and supply a
gather_data_from_file method. It should also call the
register_handler method to state which MIME types and file extensions
it can handle.
For instance, suppose we want to create a backend which grabs metadata from an image and indexes that. (Not unlike Plucene::SearchEngine::Index::Image...) We'd start off like this:
package Plucene::SearchEngine::Index::Image;
use strict;
use warnings;
use base 'Plucene::SearchEngine::Index::Base';
use Image::Info;
Now we register the mime types and file extensions we can handle:
__PACKAGE__->register_handler(qw(
image/bmp .bmp
image/gif .gif
image/jpeg .jpeg .jpg .jpe
...
));
And our gather_data_from_file method will call add_data for
each bit of metadata it can find:
sub gather_data_from_file {
my ($self, $filename) = @_;
my $info = image_info($filename);
return if $info->{error};
$self->add_data("size", "UnStored", scalar html_dim($info));
$self->add_data("text", "UnStored", $info->{Comment});
$self->add_data("subtype", "UnStored", $info->{file_ext});
$self->add_data("created", "Date", Time::Piece->new(
str2time($info->{LastModificationTime})));
}
See Plucene::SearchEngine::Index::Base for an explanation of add_data.
Beceause Plucene::SearchEngine::Index uses a plugin architecture,
once this module is installed, it will automatically be called upon to
handle those image types it can deal with, without any additional action
by the user.
For certain types of data, such as emails, news articles, or instant
messages, you may not want to use the file or URL frontends.
Alternatively, if you have a simple piece of data which isn't
file-based, you may just want to do everything yourself. Even then,
Plucene::SearchEngine::Index::Base can help you to create
Plucene::Documents - just inherit from it, and use add_data to add
fields to the document in your examine method. See
Plucene::SearchEngine::Index::Base for more details.
Plucene::SearchEngine::Index::File, Plucene::SearchEngine::Index::URL, Plucene::SearchEngine::Index::Base, Plucene::SearchEngine::Query, Plucene::Simple.
Simon Cozens simon@cpan.org.
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
| Plucene-SearchEngine documentation | view source | Contained in the Plucene-SearchEngine distribution. |